Want to use off-the-shelf OSM data for ML models in Real Estate? Hold your horses, for now

Getting geospatial data into ML models is hard. One reason for this is that there are few “canonical” sources for geospatial data at scale. Open Street Map (OSM) is understood as one such source for rideshare, but its potential in real estate use cases is largely unexplored. In this post we discuss some of the quality control/QC work we do to improve off-the-shelf OSM data and measure the impact of that work via a real estate pricing model benchmarking test.

Context

Water features (rivers, lakes, ponds, reservoirs and almost everything in between!) are valuable to our partners, and we recently released a dataset with updates to these features.

We decided to make the update when we realized that there were a lot of water bodies missing from the OSM dataset during our quality checks. We spent a lot of time investigating and exploring the water features and realized that many water bodies were not fully tagged, or incorrectly tagged in our incoming dataset. As a result, we decided to invest in improving the quality of our water features data. 

What changed?

We expanded the range of tags that we considered as relevant features, started using regexes of water feature names and began actively excluding invalid or irrelevant features (a routine challenge with user-generated content!).

The final result was a water features dataset that covered far more of the smaller ponds and lakes that we were missing previously and also far fewer irrelevant water bodies.

The animation below compares water body coverage in the San Francisco Bay Area before and after the QC.

Figure 1: Before and after – water features in the San Francisco, Bay Area pre and post water QC

To give some examples of updates, many smaller water bodies have been added in the East Bay, in areas including Concord and Pleasanton. Also some of the river's polygons in the East Bay have been added back too. For our partners that include proximity to water in the models of home prices, they will now have a better representation of water features in that model.

What’s the impact? QC improvements improve models!

After updating our features, we wanted to explore what kind of impact this had on a benchmark modeling task. We decided to test the impact on a real estate sales price prediction task using data from Pinellas County, FL. This is a dataset that we have used extensively in our internal benchmarking, and seems like a perfect dataset and task for this analysis as this part of the country has many many water bodies, and is surrounded on 3 sides by coastline. Therefore we hypothesized that there would be an improvement in model performance when trained on the updated features versus the original features.

As we suspected, our water feature update had a significant impact on machine learning model performance. Interestingly, while improvements maintained across model types, the magnitude of improvements varied by model type.

Figure 2: Average change in model performance by model type
Model Type Change in MSE after Water Update
XgBoost -14%
Random Forest -4.23%
Linear Regression -3.84%

As you can see in Figure 3, all of the model types showed an improvement in performance when trained using the newly updated water features compared to using the original water features.

Here is an example of the input features that were used in one of the experiments:

Features = [
    "water_intersects_isochrone_walk_10m_True",
    "river_intersects_isochrone_walk_10m_False",
    "river_intersects_isochrone_walk_10m_True",
    "coast_intersecting_length_in_km_isochrone_walk_10m",
    "lake_pct_area_intersecting_boundary_isochrone_walk_10m",
    "poi_is_restaurant_count_per_capita_isochrone_walk_10m",
    "conservation_area_intersecting_area_in_sqkm_isochrone_walk_10m",
    "heated_area",
    "year_built",
    "log_acreage",
    "poi_count_isochrone_walk_10m",
    "perimeter_km_isochrone_walk_10m",
    "public_park_count_per_sqkm_isochrone_walk_10m",
    "fixtures"
]

The models were trained on a mixture of property metadata features and geospatial features, with an emphasis placed on the water features such as:  `water_intersects_isochrone_walk_10m_True`, `river_intersects_isochrone_walk_10m_False`.

To be frank, we did not expect to see improvements of this magnitude so these results are very encouraging. We hope to continue to perform these kinds of analyses as we do further quality assurance and update other OSM features in our data.

About Iggy

Iggy is a toolkit for data teams. It is used by data scientists, analysts and machine learning engineers to understand and leverage data about place to build better models and user-facing products. Their flagship datasets and tools provide instant access to hundreds of geospatial features being used to improve location search, pricing and recommended homes/listing ML models, and develop unique customer-facing products. Iggy has been proven to quicken experimentation and time to value, and improve ML models and core products for industry-leading companies in the real estate, private equity and travel sectors. Dive into some demos and learn more about the product at https://github.com/askiggy/iggy-enrich-demos, check out our blog on improving Real Estate pricing models with location data at https://medium.com/@acocos/faster-experimentation-for-location-data-iggy-metaflow-c3dd40c4f5e4 or contact us at https://www.askiggy.com/contact

Related posts

Launching Iggy Self Serve APIs

Our mission at Iggy is to make information about the world accessible. We’ve taken our first steps toward this over the last few months. Starting today you can query more than 175 geographic datasets using the Iggy API.

Developers deserve better

My version of better is founded on usability as a first principle. This means building a company from the ground up that is devoted to making geospatial data actually usable to the folks who are building the products and experiences that are shaping our lives: developers.

IggyEnrich for your browser