lxbarth's Diary

Importing 1 million New York City buildings and addresses

Posted by lxbarth on 21 August 2014 in English.

As of June, New York City buildings and addresses have been fully imported to OpenStreetMap. While we are tackling remaining cleanup tasks I wanted to share a full recap of the effort. I am very happy with the overall result. There are lessons to be learned here from what went well but also where we could have done better - read on for the details.

More than 20 people - volunteers and members of the Mapbox team - spent more than 1,500 hours writing proposals, discussing, programming, uploading, processing and reviewing. Between September 2013 and June 2014 we imported 1 million buildings and over 900,000 addresses. We fixed over 5,000 unrelated map issues along the way.

Here are screenshots of the resulting work:

Building coverage on Manhattan island, the southern tip of the Bronx to the northwest and Wards island to the right.

====

JFK airport buildings in Queens, bordering on the Hamilton Beach neighborhood to the left and South Ozone Park to the north.

====

Coverage around Battery Park and Wall Street in Manhattan. This is an area that already had many buildings. We filled in the gaps and replaced buildings where the New York City data set was clearly better.

====

We imported over 900,000 addresses. Here is an example of the Park Slope neighborhood in Brooklyn.

====

Buildings contain height information and render nicely as seen here on this example of downtown Brooklyn on Fmap.

====

The import covers all of New York City’s five boroughs

====

Overview

This is a full writeup sharing my experience with the New York City import in the hope that there is one or the other valuable lesson, good idea, or line of code for you to walk away with. Note that this post is very specific to the work in New York City. If you’re planning to do an import, make sure to check out the Import Guidelines for a more universal checklist of how to go about imports.

If you’re looking for the 30 seconds version, I’d summarize my take aways like this:

Importing is a lot of work, make sure you have the time to commit.
Be prepared to continuously improve your conversion scripts and already uploaded data throughout the import.
Importing is a skill. It looks easy at first, but everyone involved uploading will need proper support, advanced knowledge of mapping practices and data validation by peers.
Involve community where possible, clear and frequent communication is clutch.
Invest in your tools

Read on for the deep dive.

OpenStreetMap as a collaboration space for citizens and government

Using New York City’s data for OpenStreetMap became possible thanks to the then-mayor Michael Bloomberg’s open data policy. Local Law 11 of 2012, releases all New York City government data “without any registration requirement, license requirement or restrictions on their use” (23-502 d). This effectively puts the data in the public domain, making it compatible with OpenStreetMap’s contributor terms.

Both, address point data and building data fall under this law and are available for download on New York City’s open data web site:

The way we used this data in OpenStreetMap is an illustration of how Bloomberg’s plan to stimulate the economy with open data is starting to pay off. This data in OpenStreetMap is now benefiting everyone using OpenStreetMap and this includes the New York City based startup Foursquare which is using OpenStreetMap data on its Mapbox powered maps.

But the relationship between OpenStreetMap and New York City should be ideally a two way street. How can the creator and maintainer of the building and address datasets - New York City’s GIS department - benefit directly from their work being imported in OpenStreetMap? The vision of edits in OpenStreetMap directly helping improve a crucial government dataset is very promising. OpenStreetMap is a unique data collaboration platform while datasets like building or address catalogs are incredibly hard to maintain - even for a large municipal government like New York’s. How can government become a part of OpenStreetMap?

OpenStreetMap’s share alike license means that OpenStreetMap data can’t be taken over directly into New York City public domain datasets but we can use OpenStreetMap to find out where changes happened. We set up a daily change feed flagging modifications to buildings and addresses to subscribers. Here’s a copy of a change notification email how New York City GIS receives it every day:

Daily change notifications from OpenStreetMap, flagging building and address changes to New York City government.

The notification contains a list of relevant changesets from the previous day with a link to each modified building and address. We are right now assessing the utility of these emails. Another way of leveraging OpenStreetMap as a change signal would be to periodically extract all building and address data and identify all changes in a certain time frame at once.

All code powering the change feed is available as open source on Github. If you’d like to receive the New York City change feed notifications, please let me know. Happy to subscribe you.

Import procedure

To import New York City data we had to convert it to OpenStreetMap format first and cut it into byte size chunks so we could review and import it manually, piece by piece. Once it was imported, a different person than the original importer would validate the data. This means reviewing it for errors and cleaning it up where needed.

Selecting a task on the tasking manager, opening existing OpenStreetMap data and opening importing data in JOSM.

Each participant would set up their workspace according to documentation we provided on Github. In the same document we laid out the actual import procedure. Some of the key items of the import procedure were:

Use a separate import account
Run full JOSM validation, fix all conflicts with existing data
But also fix all existing unrelated issues in area
Spot check data - for instance, do street names line up?
Merge POIs where appropriate
In case of duplicate data, keep the best data if there is a clear difference. In case of any doubt, keep the local data.
Add a note where a local mapper could solve a problem

As we imported, we ran into a series of recurring issues that we shared in a common issues guide - a useful resource for training new mappers and agreeing on fixes for unclear situations.

Community import or not?

From the beginning, the import was planned as a community import. There is no standing definition of this practice, but the rough idea is that uploads to the map would be done predominantly by members of local community familiar with the areas uploaded. Once started into the import, we quickly ran into a series of issues.

For Mapbox data team members participating in the import full time it was very easy to outpace local volunteers by a huge factor. In addition, I underestimated the complexity of the actual review and upload work. While not hard, there was a certain learning curve which meant that every new individual joining required significant training and support to get started - which meant plain and simple time that someone had to spend. Add to this that the individual time commitment is huge. I estimate we spent about 1,500 hours among everyone involved - and this is on the conservative side. Assuming 20 people work on the import, each one of them would look at 75 hours on this project. Very few people spend this much time on OpenStreetMap in a year.

The pace of uploads turned out to be key friction point. At the same time a series of data quality issues arose. This is why a couple of months into the import the loosely formed group around the project including community members and myself decided to pause the import and when we restarted a month later, slow it down and stop billing it as a community import. This would allow everyone to participate better and it would set expectations straight as to who was doing the uploading work. I think this adjustment was a good one. Overall it took us 10 months to get the job done - longer than I thought but still a pace that I was comfortable with to commit help finish the job. In the end a vast majority of uploads, validations and programmatic updates were done by the Mapbox team and I’m glad we had the opportunity to contribute.

Still, community involvement was clutch. The incredible input everyone gave, the many reviews, advice and personal time people invested was crucial to make this import a success. Everyone weighing in has helped make the resulting map better.

Sh** happens

We dealt with data corruption and conversion script bugs all using Github issues. Over the course of the import, we opened and closed 120 issues flagging suspicious data found in data reviews and sometimes working through protracted problems with New York City’s head of GIS directly chiming in and helping interpret data correctly.

Some of the issues we discovered required updates to data we already imported. Once we were into the import even a couple of days, updating existing data manually quickly wasn’t an option anymore. This is where automated edits came in, updating OpenStreetMap data programmatically. We captured all scripts for automated edits in the same code repository as the data conversion scripts. Some examples of programmatic updates are:

We fixed wrong tagging on school buildings where we tagged amenity=school instead of building=school.
We added ordinal suffixes like “th” in “4th”.
We expanded abbreviations we had overlooked like “Ft” to “Fort”.

We prepared this import well and we had good peer reviews on the imports list running up to the first uploads. We could head off many issues before we started importing. But in the end, the amount of issues we encountered after we started was still an unpleasant surprise. Having gained a lot more experience with this import I am sure the next time we can avoid a series of pitfalls - but the need for being able to programmatically update data after it’s been uploaded is crucial for a successful import. You simply cannot plan for all eventualities and you need to be prepared to apply fixes as you go.

From this perspective, the next time I would want us to write data integrity tests from the get go. These tests would assert data quality on data before it is uploaded. This would allow us to be much more agile in updating and refactoring conversion scripts as we go.

Another set of tests would assert data quality of already uploaded data. This would help to identify existing systematic problems and catch data issues due to negligent uploads fast.

So far, we have a rudimentary directory with validation scripts we started to build up during the import. There is a real need across the OpenStreetMap community to further develop and share easy to use tools to test and validate data. What if we could reuse the validators available in JOSM from the command line on arbitrary portions of OpenStreetMap data?

Data processing

To get source data ready for upload, a conversion script would download the data, split it, convert it and store the resulting files in OSM XML format on Amazon S3. We set up a tasking manager job that would expose each file as a task for people to import. To upload a dataset, a mapper would select a task, download OpenStreetMap data and load OSM data. We used the excellent JOSM editor to merge and review data before uploading to OpenStreetMap.

The entire data processing script is captured in a Makefile and can be run from download to upload to Amazon S3 with a single command. In sequence, the processing script would perform the following actions:

Download and unpack buildings (polygon data in shapefile format)
Download and unpack addresses (point data in shapefile format)
Reproject and simplify building geometries
Reproject addresses
Split buildings and addresses into byte size chunks
Merge: Where only a single address is available for a building, merge the address attributes onto the building polygon.
Convert: Map attributes to OpenStreetMap tags, convert street name formatting and house number formatting and export in osm format
Put to S3

All code is open source under a permissive BSD license - feel free to lift where convenient.

Repeatable conversion

The conversion script is repeatable with a single command and it is organized in stages: Each significant processing step creates files on disk and can be run separately. All that’s needed are the output files of the previous processing stage. Running the entire script would take on the order of several hours on an extra large Amazon EC2 instance. Being able to run steps like the merge stage or the convert stage separately was saving important debugging time. Throughout the import, we wound up reprocessing the data countless times as we fixed issues.

# Download, convert and push to s3
make && ./puts3.sh

# Download and expand all files, reproject
make download

# Chunk address and building files by district
make chunks

# Generate importable .osm files.
# This will populate the osm/ directory with one .osm file per
# NYC election district.
make osm

# Clean up all intermediary files:
make clean

# Put to s3
./puts3.sh

# For testing it's useful to convert just a single district.
# For instance, convert election district 65001:
make merged # Will take a while
python convert.py merged/buildings-addresses-65001.geojson # Very fast

Reprojecting and simplifying

New York City data comes in its own special projection and it is way too detailed for OpenStreetMap, so we reprojected and simplified it using ogr2ogr:

ogr2ogr -simplify 0.2 -t_srs EPSG:4326 -overwrite buildings/buildings.shp buildings/building_0913.shp

Splitting into byte size chunks

We couldn’t upload all data in one go, it had to be cut into byte size chunks for manual review and upload. For splitting up the data we used New York City voting districts. This was an arbitrary choice, it just so happens that New York City voting districts are of a manageable size for manual uploads. There are 5,285 voting districts, the processing script generated an OSM file for manual upload for each one of them. The script chunk.py uses the great Shapely and Fiona libraries for doing this. It is nicely reusable for any task where you need to split up one geospatial dataset by the polygons of another geospatial dataset.

Merging

In OpenStreetMap, addresses tend to be merged onto building polygons where only one address is available for the building. We wanted to follow this convention and thus merged addresses where only one was available onto the corresponding building. The python script merge.py uses Shapely, Fiona and Rtree to do this. The script also converts data into geojson format - which was extremely useful for debugging as we could inspect them in any text editor. Here is an example output file of the merge stage.

Most of our fixes during the import happened on later stages so we could always work off of the merged files, saving about 50% of the total processing time.

Conversion

This is where most of the actual conversion is happening - this is also the part of the script that was the most significant time investment. It captures the full complexity of the conversion and handles hairy problems like house number conversion, street name conversion, cleanly merging geometries, generating multipolygons and more. The script convert.py uses Shapely and lxml for attribute mapping and exporting data in OSM XML format. OSM XML is directly readable by JOSM, so the resulting files of this stage could be opened and directly uploaded to OpenStreetMap with JOSM.

One tricky problem we’re solving on this stage is merging T-intersections. OpenStreetMap’s data model is unique in that it allows for sharing vertices between polygons. In the picture below, you see a typical T intersection. The node with the arrow is supposed to be part of the two ways describing the corner of one building but also part of the ways describing the straight walls of the other building.

It took us a while into the import to notice unmerged T-intersections. What makes this issue vexing is that OpenStreetMap’s native decimal precision is lower than our source data. The result was that data we uploaded to OpenStreetMap looked fine, but once we downloaded it again it came back with truncated precision, moving nodes just far enough to place some within neighboring buildings.

Nodes on T-intersections between buildings need to be part of both buildings.

Our conversion script merges all incidents of T-intersections. This requires truncating decimal point precision to OpenStreetMap’s native 7 positions and buffering - the technique to test not only whether a point sits on a line, but whether a point is in the close vicinity of a line. Read up on appendBuilding() in convert.py for details.

Pushing to S3 and exposing the data in the tasking manager

For exposing tasks to mappers we used the OSM Tasking Manager - a great tool for coordinating mapping tasks among large groups of individuals. We used a patched version that allows for tasks shaped as arbitrary polygons - instead of the usual squares. Each task polygon pointed to the file we’ve made available on s3, and the tasking manager exposed two buttons: one for loading OpenStreetMap data into JOSM, the other one for loading the import data into JOSM. We labeled those buttons “JOSM” and “.osm” which doesn’t make all too much sense, but hey!

Loading data into JOSM from the tasking manager.

Reusing and the elusive import toolchain

Writing these scripts we avoided overthinking the problem. Creating generalized solutions for these functionalities is hard and we simply didn’t have enough data points to do so. Now having gone through this import, I see a couple of opportunities to solidify a toolchain for import:

Generalize a command line script for splitting data (like a properly abstracted chunk.py)
Generalize a library for converting Simple Features to the OpenStreetMap data model, including XML export
Consider using PostGIS - I avoided it intentionally here, but built in spatial operations and indexing is appealing
Identify a pattern for reusable validation scripts that can be used to assert data quality before and after uploads

Continuously improving the map

Here is the full time line of the import:

July 2013 Started programming the conversion script
September 2013 Proposed import on imports list
September 2013 First test import
October 2013 New York City community import session
December 2013 Pause import after multiple issues arose
February 2014 Restarted import after fixing all critical issues, going at slower upload pace after community feedback
June 2014 Finalized uploads and tasking manager level validation

We are not done yet. While all data has been imported to OpenStreetMap, there are final cleanup tasks we are tackling as we speak. Help us further improve the map: if you find a building or address related issue on the New York City map, please let us know by filling an issue on Github. As soon as new data is available from New York City, we will also take a look at updating OpenStreetMap where it makes sense.

Thank you

Huge thanks to all who have helped make this import happen. Through your work reviewing, coding, organizing mapping parties and doing data uploads you have helped make this import better than it would have been without you: Serge Wroclawski, Liz Barry, Eric Brelsford, Toby Murray, Ian Dees, Paul Norman, Frederick Ramm, Chris MacNally, and many others. A special thanks to Colin Reilly from New York City GIS who has helped on many occasions fully understand the source data and find the best decision translating it to OpenStreetMap. A big shout out to my colleagues who’ve put a ton of work into this endeavour: Ruben Lopez, Edith Quispe, Aaaron Lidman, Matt Greene, and Tom Macwright among others. Say hello if you bump into them on the internet, or maybe at one of the next conferences.

Cheers to making the best map in the world.

Location: Manhattan Community Board 3, Manhattan, New York County, New York, United States

Discussion

Comment from pnorman on 21 August 2014 at 15:06

Now that we’ve got it imported, when will NYC be releasing new data, and how will we handle updating it?

Comment from Rps333 on 21 August 2014 at 15:24

Great Job. Thanks for the info

Comment from lxbarth on 21 August 2014 at 15:47

@pnorman - we’re still focused on clean up tasks, but with the next significant building or address import we should run a diff against OSM data and see whether there are worthwhile updates to go after. How exactly that’s handled best I think depends a lot on the quality and the quantity of the specific changes.

Comment from ColinReilly on 21 August 2014 at 16:00

@pnorman We (NYC DoITT GIS) have been releasing the buildings and address data on a quarterly basis all along. This was not a one-time data release.

We are also investigating other means of publishing our data to better establish two-way open communication. One of which is piloting the use of GeoGig (formerly known as GeoGit). And we are always open to new ideas to keep the flow of data open.

In terms of data currency, I brought up that during the import. Certainly interested in hearing how that can be best accomplished. And to the extent we can, how NYC can enable that.

Current data: https://data.cityofnewyork.us/Housing-Development/Building-Footprints/tb92-6tj8 https://data.cityofnewyork.us/City-Government/NYC-Address-Points/4iq4-tuhq

Colin Reilly creilly@doitt.nyc.gov