Introducing The PlacePicker: What you can build with Iggy
Deciding where to live is really hard. It doesn’t need to be.
How many tabs did you have open, or how many Google searches did you do the last time you moved to a new place? If you’re like us, the answer is probably simple: “way too many”. When it comes to location we often have preferences, but no way to articulate them in the places that we make decisions. Want to look for apartments near coffee shops? Airbnbs in the woods? Good luck. You need to leave the site, open up maps on various tabs, read blogs and listicles and maybe you’ll have energy left to cross-reference and book that listing before someone else does. (✋if it’s happened to you!)
We built PlacePicker to demonstrate how Iggy can be used by developers to help solve this kind of decision problem. The idea is simple: people have preferences that play a role in where they live or travel. Let them tell you those preferences (and how much those preferences matter), collect data that represent those preferences, and use that data to help them choose a place to live or travel.
For PlacePicker, we started with some things that mattered-- things like access to coffeeshops, libraries, big trees and economic vitality. Using 50k randomly sampled residential addresses for all of SF, we calculated each address’ ‘score’ on each preference, and created a simple algorithm to combine preferences into an overall score. We then plotted the scores on an interactive map so that choosing a place to live or travel in SF is a much more intuitive and fun experience.
Below we’ll walk you through how we did this so you can build something similar.
A few months ago, we had a prototype for the Place Picker. Lindsay (former data scientist, definitely not a developer!) built a very basic UI with Dash in Python that allowed users to select a single preference they had for where they live and see how different census tracts compared to one another on that preference. It also had the option of seeing a summary score calculated by weighing all preferences equally and… adding them up.
We wanted to expand this concept to:
- Allow users to select multiple preferences, AND ‘over’ or ‘under’ weigh them compared to other preferences,
- Calculate a score that best represented the ‘sum’ of their preferences,
- Present the data in a form that was easy to understand and consume
- Be performant, while displaying data for the entirety of San Francisco.
Since we wanted a map-based single-page app, we decided to go with the Mapbox GL JS library with the NextJS React framework. This allowed us to get a functional page up and running pretty quickly with minimal effort so we could focus on working with the data.
Parsing the Data
We started with about 50k randomly sampled addresses in SF and enriched them using the IggyEnrich API endpoints.
- Lookup API for raster (gridded images aka really annoying format to work with) data. This means: we identified the value in the raster grid/pixel/cell containing each address, and added that to the dataframe. We used this for determining tree cover and noise levels at each address.
- Proximity API to calculate features that were nearest to each address. We used this for determining distance to schools, coffee shops, grocery stores and libraries.
- Buffer API for determining if a feature fell within a predetermined distance of our addresses. We used this for determining if bike lanes and transit stops were within a specified distance of each address, as well as building our clean streets metric.
If you’ve ever tried to do raster analysis (or any geospatial analysis, really) without training, you’ll appreciate how easy our APIs are. You don’t need to know anything about raster files or vector files or projections or coordinate reference systems. You don’t need to do spatial overlays or create buffers or use gdal. You just have to tell us how you want to enrich your lat/lngs.
After the enrichments we ended up with a dataframe full of data on different scales. This isn’t great because as soon as you try to ‘add up’ the different datapoints, you’ll skew to those preferences that happen to be on the largest scale. To solve this we just transformed each of the values to a decile scale.
Lindsay learned that as much as she may like floating point precision, Ben didn’t bc it can blow up processing times on the front end. It’s nice to know that this kind of granularity is there if we need it, but for our app, we rounded these values to the nearest hundredth to make the data easier to process.
Generating the Visual Layer
The first iteration of the Place Picker displayed scores that were aggregated to census tract boundaries. If you’ve worked with sub-city level data in the US before, you’ve probably come across data aggregated in this way. But just because it’s the default doesn’t mean we should accept it. Most people have no idea what a census tract is. We decided that if we have high granular data, we should strive to maintain that granularity as much as possible while still allowing a performant app. To be clear, we’re not biasing for granularity to be contrarian. We’re doing it because aggregating to an arbitrary shape like a census tract obscures the very variation or local detail that we want to call attention to.
We chose hexbins to group the data points because they let us group in a way that is sensitive to closeness. The hexbin grid is generated by the server, using Turf.js (turfjs.org), whenever the user sets or changes a preference and subsequent requests for the same properties or preferences are cached so that we’re not generating the same set of data repeatedly.
[Here’s a really good write up on why hexbins are ideal for this kind of data visualization], as well as how it compares to different methods.
Weighing User-Selected Priorities
After settling on hexbins as a way of displaying the values on the map, we need to calculate how they are rated and distributed across a scale.
Users are able to set a priority for each metric (tree cover, access to coffeeshops, etc.), which can be set to a value of 0, 1, or 2. These priority values are then factored into each corresponding metric’s value for each geographic point in a hexbin. Prioritized metric values are then summed up to generate a score for each hexbin (non-prioritized metrics are discarded), which is then normalized into a 0-1 scale.
Doing the Maths
Scoring each hexbin seems simple enough using a linear function, but in practice this resulted in cases when using a combination of preferences where one of the metrics would pull all the weight for a hexbin to the very top of the scale even when a secondary priority scored very low (figure 1). This becomes apparent when setting certain metrics to a higher priority (figure 2). Ideally, we want to give results that are not too prescriptive by scaling the score for each metric so that higher values are still a factor, but not so much that they dominate the results.
Using a base 2 logarithmic function to scale the score values should help scale down the effects of higher metric values, but this resulted in overly flattening the range of values throughout the scale. This made it hard to see any variations between high and low ratings.
We needed a scale function that returns a diminishing value the closer it gets to 100. Raising the initial log function to the power of 2 gave us a more desirable scale. This gave us values that extended the threshold for higher ranking areas while de-emphasizing low values.
The following charts show the output scores (y) of the three functions and x as the input value. As x gets closer to 100, the resulting value still increases, but at a diminishing return. This is especially apparent when a, the priority value (a) of the metric, is set to 2, those properties will tend to get scored much higher than intended compared to metrics with a priority value of 1.
Wrapping it up
Because we’re handling a sizable amount of data, it’s important for us to consider how we manage this from the server end as well as the client side.
Deciding the Hexbin Size
When we generate the hexbins on the server, points are compiled and binned into hexagons. The final size of the dataset sent to the client can be controlled by setting the size of each hexagon.
Larger hexagons means a smaller amount of data points, which can result in a smaller payload size, but becomes less useful because the rendered layer will have a lower resolution at the neighborhood level. Smaller hexagons provide more granular information but at the expense of more data points, a larger payload size, and more resources required to render.
We ultimately decided on a hexbin size of 0.15km, which is close to the average length of a block in San Francisco (about 0.144km on the long side), and is performant, weighing in at around 200kb per request. This achieved a good balance between performance and usability – having a hexbin resolution below a block level starts to become less useful in this context while incurring larger costs in payload size.
Because users can change the priority values of metrics, we need to be able to generate and return the hexbin dataset to be rendered everytime these parameters are changed. We don’t want to regenerate hexbin data that the server has already generated, so we cache these responses. Cached requests saves the client from having to download data it’s previously downloaded, and saves the server from processing a request it’s already served.
Like what you’ve read here? This is just the beginning. At Iggy we want to own the future of location data and products. We’d love to have you join us.
What will you create with Iggy?