A History of TIGER in OSM
TIGER data serves as the base data for much of US map data for all the major US map providers including Google, Nokia and TomTom. Much of OpenStreetMap data for the US is also based off of the 2005 version of TIGER data and was completed between 2007 and 2008. Here is an animation of the import process thanks to Scurio
However, unfortunately the TIGER data was never designed to be used as an accurate map of the US which could be used reliably for things like GPS routing -- it was a CENSUS project with more limited objectives. However, the consensus is that, major improvements were made to TIGER between 2000 and 2010 -- for OSM however, because the import was made with the 2005 data, it "caught TIGER halfway through the update cycle” ref
What this means is that we have quite a mess. Everyone knows that we have bad data for a LOT of the US, but the problem is that we don't know where, and we dont know what's wrong with the data. Further, a lot of these errors have probably been fixed by people, but we're also certain that there are regions that have not been touched, making it hard to replace the old TIGER data with the new TIGER data using a wholesale technique like an import.
The response to this problem has been a number of projects by the community to perform "TIGER Fixups". The idea is to come up with a metric that guides contributors to places where the old TIGER data is most likely to be out of date / incorrect and get them to fix it.
I wanted to come up with a map / dataset of OSM routing data "quality" and realized that there is have been a whole host of approaches, some overlapping and some not. I've been studying these approaches carefully, and thought I would summarize them here. The results will hopefully be useful to continuing this important work forward.
Community Efforts to Measure "TIGER DESERTS" and equivalents
Toby Murray's Analysis
He started with a current version of OSM map filtering for ways. Then you count total number of ways that have been last edited by someone other than DaveHansenTiger (original import), balrog-kun (expanding street names), NHD edits, woodpeck_fixbot -- he mentions that adding NE2 (who did highways) to this list is also probably a good idea. Another subtle point is that a node could have been a TIGER node, edited by user Y and then edited by balrog_kun, causing this algorithm to treat the node a purely TIGER, when in fact it has been touched. Toby's final map "takes this into consideration" using version numbers, but I'm not sure how exactly.
The result is a county level map with a number associated with each count -- what percent of ways in this county likely comes directly from TIGER? Here is a browseable heatmap and here is a screenshot.
Martijn Van Exel’s TIGER deserts
Martijn's analysis not only built upon Toby's analyis at a much finer level for the state of Florida, but also coined the term "Tiger Desert", a region where TIGER data has been untouched by anyone else.
Martijn's methodology relied purely on version numbers, but was the first to take into account a regions "importance" by considering way density. Tiger deserts by his definition are 5km X 5km grid cells that have a version number being either 1 or 2 (for the "predominant way") and have a way density higher than 1.8 (I think! the post is not super clear on this). This results in a picture of TIGER ghost towns for Florida which looks as follows:
Mike Migurski's "Green Means Go"
In Jan 2013, Mike Migurski's Green Means Go was released which considerably expanded the scope of Martijn's analysis. The first improvement is that the cells are 1km X 1km and the coverage is national.
What Mike did was a threeway comparison. First he generated places where there was "scope for improvement" by comparing TIGER 2012 to 2007, and generating a darker green for places where highway length were substantiall greater. Then, he counted up total OSM editors by block and overlaid them as white blocks (ignoring the bulk edits) and overlaid that information on top of the shades-of-green map. This is great because it helps focus attention of where new TIGER is most likely to be beneficial and where its not likely to interfere with local community. However, this map does not take into account whether old TIGER nodes have been edited, or information like that. Deletes of old TIGER nodes are also problematic.
Mapbox efforts (Alex Barth, Ian Villeda, Ruben Mendoza, Eric Fischer)
There have been two recent efforts by folks at Mapbox to develop tools to bring in new TIGER data to fix the map. First, they developed a map for Vermont that measures for each 1kmX1km cell, considering highways where "(1) the average version number of all ways in each grid cell and (2) the percent of version=1 ways per grid cell." -- the more blue the cell, the more is the cell likely to be TIGER.
A related effort, released in June and Dec 2013, came courtesy blog posts by Eric Fischer in June and Dec 2013. Eric's work was all about comparing TIGER 2012 and TIGER 2007. The first map, simply compares TIGER 2013 and TIGER 2007 -- according to him "changes in the 2013 edition are in yellow, changes between 2010 and 2012 are in cyan, changes made in the accuracy improvement push between 2006 and 2010 are in magenta, and data that hasn’t changed since 2006 is in blue." -- the main thing to note in this national picture below is the magenta, these are the wholesale changes that were missed in the OSM import.
Martijn's Battle Grid
The basic idea is to compare Tiger 2013 to OSM to highlight cells with large changes, however the innovation here is the addition of data from Telenav on actual driving patterns in these different cells. This helps prioritize cleanup work by focusing attention on places where people are likely using the map for routing. And the Grid follows maproulette conventions and allows users to directly "check out" cells for fixing in your preferred editor. Very nice!
There are three things going on 1. TIGER 2013 and TIGER 2007 are different, but not everywhere 2. OSM has made changes to TIGER 2006, but not everywhere 3. The "corrections" are important, but far more important in some places rather than others.
The different approaches highlighted above all combine different aspects of these three objectives. The first is to identify TIGERness of existing data which gets at #2 above (and can be done using a combination of username and version analysis) and the second is to compare simple diffs of TIGER 2013 and current data or TIGER 2006.
I think more work could be done in #3 above. The two approaches so far have been "way density" or Telenav data, but surprisingly none using gridded population data. This is something that I plan to do in the future, which would be a remix of some of these previous efforts --
- Find raw data for the Mapbox maps which calculate "new TIGER" areas
- Use raw data on user count by cell (http://openstreetmap.us/~migurski/TIGER-Raster/nodes/)
- Use population data and look into getting access to Telenav "usage" data
Calculate these three metrics at the cell level, and combine them to come up with a final "quality" map.