Exploring Google Open Buildings data with Geowrangler.

Fork this project in Github here



  • 2023-Feb-28 JCP: I’ve been informed by some colleagues that the detected “ghost neighborhoods” I’ve reported are actually existing recent neighborhoods that were just not yet present in the default basemap used by Google in the embedded web app (although footnote claims its (c) 2023). Google Open Buildings footprints detected these neighborhoods quite accurately! I apologize for the wrong results and added edits to correct this. Thank you so much for your feedback!

Looking at maps is part of my daily job as a geospatial data scientist. One of our recurring project themes involve finding objects in satellite imagery–farms, roads, houses etc. It may be easy for human eyes to identify these at a glance but the same cannot be said for a computer. To do this, it needs a good set of images with annotations that tell which pixels are the objects and which ones are background.

Compiling a good training set is very tedious manual work. However, in recent years, wide scale detection of building footprints has been made possible due to efforts of leading companies and their recent advances in computing power. We can already use these for model training if we can validate them over specific areas of interest.

In this blog I would like to present Google’s Open Building dataset and see how it performs over my hometown, Rizal province.

I also invite you to check how well it performs in your area of choice using the widget below:



What is the Open Buildings dataset?#

On Oct 11 2022, Google announced in their blog that their Open Buildings product would now expand to cover South and Southeast Asia, and that includes the Philippines.

According to their documentation, it contains 817 million building detections, across a total area of 39.1 M sq km. Aside from the polygon of the footprint, they also provide (1) the building area, (2) a confidence score indicating the level of certainty that the footprint is a building, and (3) a Plus Code corresponding to the center of the building.

You may view the specs and download your copy by following the instructions in their blog

Exploring the data using Geowrangler#

For this blog, I was eager to use Thinking Machine’s Geowrangler python module for my geospatial analysis. I was involved in the team who tested and enhanced some features of the current version.

With Geowrangler, it is very easy to generate statistics over geospatial data types (vector and raster). What I wrote in multiple lines of code before can now be done quickly in convenient one-liners. Please look at the repo notebooks for complete examples, but I’ll also show you some snippets in the following sections.

To get a high-level understanding of the dataset, we could answer some of these preliminary questions.

Q: How large are the buildings detected in Rizal?

The area distribution of the detected footprints are shown below. There were footprints detected with area as small as 2 m² and as large as 7800 m². Here I show only up to 300 m², which covers 99% of all detected footprints.

Distribution of Open Buildings footprint areas in Rizal Province, Philippines

Half of the footprints are below 35 m², which seems too small a size for a majority chunk of footprints (presumably residential areas). Furthermore, about three-quarters of the footprints have area smaller than 64 m². Could these be misidentified buildings?

Distibution of confidence across building footprint areas

When we plot confidence across footprint area ranges, we see that they are directly proportional–confidence increases as footprint size increases. Most of the footprints with area <30 m² typically have low to medium confidence, indicating that most are likely false positives. Conversely, the detection model assigned high confidence for larger building footprints (>100 m²).

Q: Which city/municipality had the most identified buildings?

For this analysis, we use the Geowrangler one-liner vector_zonal_stats to obtain statistics using footprints that fall within the boundaries of Rizal’s cities and municipalities.

data_gdf = vzs.create_zonal_stats(
    mun_bounds, # admin boundaries
    buildings, # building footprints
    overlap_method="intersects", # method for determining overlap
    aggregations=[dict(func="count", output="bldg_count", fillna=True),
                 dict(column="confidence", func="mean", output="bldg_confidence_mean", fillna=True),
                 dict(column="area_in_meters", func="mean", output="bldg_area_mean", fillna=True)],
)

This generates a GeoDataFrame with the relevant statistic per city/municipality. Lets look at the results below.

Building footprint counts per city/municipality in Rizal Province

The City of Antipolo leads with 175,000+ detected footprints, which is twice the count of the next-in-rank municipalities Rodriguez, Cainta, Taytay and Binangonan. The municipalities with the least detected footprints are Pililia, Cardona and Jala-jala, which are also the least populous in Rizal.

Building footprint counts per area for each city/municipality in Rizal Province

However, when we plot the number of building footprints per unit area, Cainta now leads with about 4000+ footprints per km². Taytay, Binangonan are still among the top, while City of Antipolo is now at 7th due to its size. The municipality with the least detected footprints per unit area is now Tanay, with <100 footprints per km².

Q: Which locales have the most identified buildings?

Map of building footprint counts per city/municipality in Rizal Province

When plotted in a map, we see that eastern Rizal has more detected footprints than western Rizal, which makes more sense since these areas have denser built-up areas because they are closer to Metro Manila.

To check in better detail, we can also get statistics for areas smaller than city/municipal boundaries.

This is also very easy to do in Geowrangler. First, we generate Bing tiles at zoom level 15 using BingTileGridGenerator, and then use the instantiated class to generated the desired grid within the boundaries using generate_grid.

# Setup generator
bing_tile_grid_generator = grids.BingTileGridGenerator(15) 
# Generate grids
bing_tile_gdf = bing_tile_grid_generator.generate_grid(mun_bounds) 

Gridded statistics of building footprint counts in Rizal Province

The map plots of footprint count, median area. and median confidence per grid are shown above.

  • The locales with highest count are concentrated along the eastern border of Rizal with Metro Manila, the densest being the neighborhoods near the old Taytay Market.

  • The median area map reflects the majority < 30m² footprints we saw in the bar plots. The ones higher are gated subdivisions in Antipolo, and the bright spot in the middle is the location of Robina Farms.

  • Most footprints have confidence between the 0.75-0.85 range.

Performance and spot-checking over Rizal province#

I inspected a few areas to see how well the footprint captures the actual building boundaries as seen in the satellite images.

Click on the links to get redirected to an interactive map showing the footprints overlapped on Google Maps.

Strengths

  1. I’m amazed with these very accurate outlines in this urban Antipolo subdivision. This is as good as a human’s annotation!

  1. It does an excellent job in this Taytay area, showing a variety of building sizes and confidence values.

  1. This San Mateo neighborhood with smaller and light materials was also captured well, although with less confidence.

  1. It also has some skill on low density rural neighborhoods like this in Tanay, Rizal and was able to avoid tree canopy and vegetation.

  1. I’m also impressed when it detected fisherfolk shelter at fish ponds along Laguna Bay at Cardona. Pretty robust!

Weaknesses

1. Entire ghost neighborhoods were detected over this part of Teresa, where the model assigned high confidence on footprints detected over barren land. There are signs that new subdivisions are being erected around the area, but none for this particular one. I dont know how many neighborhoods like this exist in the whole dataset.

2023-Feb-28 EDIT: This satellite image is older than the data used by the footprint model, hence the mismatch. When we look at this in Bing Maps (see image below), it shows that this neighborhood actually exists and its footprints are actually well-captured.

2. It had some skill in detecting row houses in this part of Cardona, but it still detects whole blocks in barren land. It seems like the model develops some recognition of the shape of roads in the area and in turn, infers that buildings of this type must exist there–although the image says otherwise. What do you think?

2023-Feb-28 EDIT: Same case here as above. Bing Maps (see image below) contains this neighborhood and the footprints are accurate.

  1. Dense and unordered neighborhoods witout separating roads are understandably difficult to capture, such as this place in Binangonan.

  1. Informal settlements are also not well-detected, like this area in Antipolo.

Im interested in Open Buildings and Geowrangler! How can I learn more?#

What I did here is by no means a comprehensive check of this dataset. It is a very promising dataset, but before you use this as training data, please do a quick visual inspection of the locale you intend to use. This way, we can prevent your model from being trained by some false detections.

You may refer to the Google Open Buildings FAQ for more details.

I also encourage you to try Geowrangler. Aside from computing statistics and generate grids like what I did here, it can also validate geometries, analyze raster data, and even download the most-often-used dataset in the industry (OpenStreetMap, Ookla, and Nightlights (soon!)).

Please check out these other features, and if you want to raise a discussion or contribute to the code, you are more than welcome to write to us here.

I hope this was useful! Thank you so much and see you in the next blog!