At Iggy, we process information about locations from hundreds of datasets in all shapes and sizes. From relatively small datasets, like the FAA’s directory of airports, to USGS’s Geodatabase of all the protected areas in the United States, to OpenStreetMap’s open source project of a billion planet-wide features, we’ve done it all.
Iggy's vision is to make data about the world easily accessible so that our customers can build great products, models and analyses. But the world is constantly changing and there is no one single best source of all geographic information, so how do we do this? We consider checking and improving data quality a core part of our work.
Here’s how it works
It starts with a big question, like “Where are all the lakes and rivers in the US?”. Then we look for possible sources. A quick Google search might turn up this USGS source, but as a team who’s been into geo data for ages, we also know we can always cross-check OpenStreetMap, or maybe another free project like NaturalEarth. But wait, actually there is this much higher resolution file from USGS. Why didn’t they mention that it was available in the original link? Good thing we didn’t go with the first link, because while it looks pretty reasonable:
You start to notice some issues when you zoom in and compare it (foreground) against a basemap (background) of satellite imagery:
Not only are the lines of the main lake shown super rough when we use data from the first USGS link, but there also seem to be a ton of lakes that just aren’t provided in the first file at all! We always evaluate multiple options for data sources and pick the one with the best and most accurate coverage for the questions we are trying to answer. We also look for sources that don’t just provide locations, but also include information about the locations, like categories and tags that we can use to provide more context.
Fortunately, sources often DO provide additional metadata about their locations. These can be extremely helpful, like how USGS’s dataset tags protected areas with various categories, like “National Park”, “Agricultural Easement”, “Marine Protected Area”, and so on. Actually they included more than 60 different categories, but we thought that was overwhelming, so combined them into a handful of categories with much broader coverage, making downstream use much more straight-forward.
Unfortunately, like the shapes themselves, the metadata is also of highly variable quality. For instance, one of our data partners, who shall remain nameless, categorized a number of Japanese steakhouses as ‘services for the elderly’, rather than restaurants (don’t worry, we flagged it for them). Our point is not to criticize our sources – we couldn’t do what we do without them – but rather to acknowledge that data management is very hard. So we always approach data with skepticism and strive to improve it.
Yep, you read that correctly. We always assume our sources are incorrect or incomplete and try to supplement through outside sources, like satellite and other mapping imagery, statistical analysis, local knowledge, and good old-fashioned data sleuthing to find and correct issues. For instance, we noticed the issue with the Japanese steakhouses while looking at examples of business names in each of our business POI categories. We always spend hours examining row-level and aggregated data from each of our sources so you don’t have to.
Finally, we evaluate how each source and category fits into our existing model of the world. Should these features exist in a new category of their own? Do we already have a category these new features fit well into? If so, should we replace the existing data or layer the two together? There aren’t definitive answers to these questions; ontology has been debated since ancient times and is very much a philosophy and not a science.
Once we’ve done all the work here, we run data through our in-house modeling suite. This is an internal tool we’ve built to measure the impact of data quality updates on predictive models. A future blog post will cover this tool in detail-- so stay tuned!