OpenStreetMap

dalek2point3's Diary

Recent diary entries

Who uses OpenStreetMap?

Posted by dalek2point3 on 24 October 2014 in English (English).

I have recently been interested in measuring how openstreetmap is being used in different services around the world. Now obviously, this is a very hard question to answer, because, being an open project, OSM data can be downloaded at any point in time, and you can start playing around with it. We dont require any permission for this action, and while the Odbl license does require attribution if you use the data in production, such attribution is hard to track. Openstreetmap data can be found on planes, in disaster relief – not to mention the thousands of web and mobile applications that use it for different intents and purposes.

Alright, having convinced you that its quite hard to track all possible uses of openstreetmap, perhaps it is possible to track usage of OSM tiles in web applications online? Now, while still difficult, this is easier to accomplish, because at the very least the question is well defined, and in theory, answerable. If we could survey each and every website out there, see if they use tiles from an OpenStreetMap server (or Mapbox server) we might be able to say something about OSM usage. Now, this still would not cover cases where folks have set up their own tileserver with OSM data – which one might argue is a quite common way to use OSM data.

Either way, I recently discovered HTTPArchive and thought it would be a cool project to track the usage of different mapping APIs online, including Mapbox and folks using OpenStreetMap tiles (which you’re not supposed to do for heavy usage!). What HTTPArchive does it crawl about the top million websites, and for each website it records the HTTP requests that the site is making. Now, this it turns out, is a great way to measure the usage of javascript frameworks, Google Analytics etc – except, no one so far has used it to look at mapping APIs and OpenStreetMap in particular!-

So, I thought I would do that! Now, the HTTPArchive data is quite large – (petabytes I hear) – but fortunately it’s all available on Google Big Query which makes it a cinch to query. Results from now of my explorations are below.

HTTPArchive data is stored in two important tables (two for each “run”) – pages and requests, and the latest versions can be always found at latest_pages and latest_requests. The `pages’ tables contains information like the url scraped, number of bytes etc. Lets see if the main openstreetmap website is in the data. The following query does the job:

select pageid, url,urlShort from httparchive:runs.latest_pages
where  REGEXP_MATCH(urlShort, r'openstreetmap.org');

Yes it is! This query returns:

Row pageid url urlShort 1 17330926 http://www.openstreetmap.org/ http://www.openstreetmap.org/

Seems like the pageid is 17330926. Now, the `requests’ table is where all the juicy information is contained. Lets see what requests, the OSM website makes:

select * from httparchive:runs.latest_requests
where pageid == 17330926;

And this is the response that you get – about 35 requests. That data is here. As you can see, this includes a number of requests for *.tile.openstreetmap.org, OSM’s public tileserver.

Which other websites make similar requests? This is where the HTTPArchive really shines. After some experimentation, the following SQL query does the trick:

  SELECT urlShort FROM [httparchive:runs.latest_pages] as pages JOIN (
    SELECT pageid, REGEXP_EXTRACT(url, r'(tile.openstreetmap.org)') AS link2
    FROM [httparchive:runs.latest_requests] as requests
  WHERE REGEXP_MATCH(url, r'tile.openstreetmap.org')
  ) AS lib ON pages.pageid = lib.pageid
  GROUP BY urlShort;

Not many websites seem to be hitting the tileserver directly – which is reassuring. That data is here.

A final interesting thing to run would be a similar analysis for Google Maps and Mapbox. Queries for Mapbox and Google Maps are available on Github. And the data from there queries are here – Mapbox and I’m still working on getting the data from Google Maps API usage. That is for another post!

Hope you will find HTTPArchive a useful tool to analyze data from the web. It certainly seems easy to use and with lots of interesting data for analysis! Happy exploring!

Location: Lechmere Square, East Cambridge, Cambridge, Middlesex County, Massachusetts, 02141, United States

NOTE: This post is mostly for my reference, might not make a lot of sense for other folks – but thought I’d put this out there in case you’re interested in the intricacies of TIGER data in the US.

Quick post to highlight some charts as I’m doing research on this topic:

  • Number of ways with a highway tag, and counting a way as one using unique “name” tags. TIGER counties are the ones that complete “good” TIGER data, while control counties are those that got missing TIGER data.

Imgur

  • Same chart, but only for highway class = 1 (i.e. motorway and trunk). Seems like someone added a lot of new class 1 highways in 2010q4

Imgur

  • Same chart, but only for highway class = 4 (i.e. cycleways etc). Note that TIGER was not a source for many highways of this type. Note how the control regions got a spike while the TIGER regions did not.

Imgur

  • Same chart, but only for highway class = 3 (i.e. residential / tertiary etc).

Imgur

Location: Lechmere Square, East Cambridge, Cambridge, Middlesex County, Massachusetts, 02141, United States

The Missing Mappers Problem?

Posted by dalek2point3 on 2 July 2014 in English (English). Last updated on 3 July 2014.

While playing around with the changeset data I noticed an interesting pattern. There were some users who had made a lot of contributions to the map, who were nowhere to be seen. A lot of the talk in the community has been on attracting new mappers, but if we’re losing existing mappers that is surely a problem, no?

Wondering if this was a big problem, I decided to dig in. Here is what I found – it doesnt seem to be a HUGE problem at the very top, but there are many mappers who make 10 or 100 changesets that never come back. Note that this is only using data from the USA at the moment.

missingmappers

First thing to notice is the blue line – these are how many users there are for a minimum number of changesets. 44k users have more than zero changesets, 25k have more than one changeset, 100k have more than 5 changesets and so on. There are only about 273 who have more than 1000 changesets, and I’ve done my best to get rid of imports so these are likely to be real people, but Im sure a few of these include imports.

Now, how many of these users seem to be lost from OSM? The Orange, Yello and Green bars plot number of users who have not made a single edit within the US in since 2012, 2011 and 2010 respectively. So, for example, looking at the extreme right, of the 273 users who made at least 1000 changesets, 22 have not been seen since 2012, 12 since 2011 and 4 since 2010. The problem is a little bit more acute if you looked at those who made at least 500 changesets. Of this group, about 50 have not been seen since 2012.

Who are these users? And why did we lose them? Was it due to contributor terms or is it just bots which are no longer being deployed?

Here is the list of the top missing mappers that I could find in my data, heavy mapper that have not been seen since 2012! Chime in, in the comments if you have any theories about where these lost souls might be and what we can do to bring them back!

Here at the top 10 users who have at least 500 changesets and have not been since since 2012:

list

Click here to check out the full list!

Location: Lechmere Square, East Cambridge, Cambridge, Middlesex County, Massachusetts, 02141, United States

Some Notes from Analyzing OSM US History Data

Posted by dalek2point3 on 16 June 2014 in English (English). Last updated on 28 June 2014.

The Data

Today, I’ve finally gotten a chance to play with data from the OSM History extract that I created using the parser that I wrote about last time.

This is what the data contains:

  • Every node that contains either an amenity, addr:housenumber or place tag.
  • For each node, I record basic metadata, “name” and gnis* tags
  • Every way that contains either an amenity, highway, building or parking tag.
  • For each node, I also record basic metadata and the following tags:
    • name
    • tiger:cfcc, tiger:county, tiger:reviewed
    • access
    • oneway
    • maxspeed
    • lanes

The resulting flatfiles are large, partly because I’m parsing the “history” data, so I’m including every past version of every node and way, in addition to the most current version. There are about 6.3 million node entries and about 48.7 million way entries. For each of these nodes and ways, I ran them through my point-in-poly program to code the county and the MSA that each way / node lies in.

The next big step was to drop imported data. I really dont care about this – obviously this includes data from the TIGER import but also many other major edits in the US. Interpret the numbers below as the contribution of OSM editors – but major national level imports. I’ve not removed smaller county level imports, because I see them as being relevant to my analyses, but also because they’re harder to pin down. So the data includes any way, node with the relevant tag touched by a non-TIGER (and some other importer accounts).

Some Highlights

Way data

How many items do we have for each of the 4 types? * Highway = 15.9 million versions, 7.9 million uniques * Building = 6 million versions, 4.9 uniques * Amenity = 460k versions, 331 uniques * Parking = 52.6k uniques, 66k versions, * All data = 22.3 million versions, 13.1 million uniques

Imgur

Other notes: * About 5.8 million of the 15.9 million way versions have a tigercfcc / tigercounty tag.

Node data (pending)

NOTE: this post is under construction

Location: Lechmere Square, East Cambridge, Cambridge, Middlesex County, Massachusetts, 02141, United States

One of the most prominent users of OpenStreetMap is Craigslist. Craigslist users often use OpenStreetMap to indicate the location of the house / item they are selling. When they dont find the street they’re looking for – Craigslist users have the option to submit a note to the OSM.org notes system.

I’ve found these notes to be useful, quite often containing information about subdivisions that are missing from the map. I wanted to visualize all the notes submitted through this system. Even though Craigslist does not submit notes using a dedicated URL (although I think they should!), they use a peculiar notes system and notes from Craigslist almost always look like this one:

bounds: (38.0118,-121.943 - 37.9966,-121.9013) http://www.openstreetmap.org/?box=yes&notes=yes&bbox=-121.943%2C37.9966%2C-121.9013%2C38.0118 Map is missing data here. Freshwater Court in Pittsburg CA is not showing up

Notice – how they begin with “bounds”. This suggests that using the OSM Notes API to search for the text “bounds” should give a reasonably accurate picture of notes from the CL system.

I wrote a script to use the API to get this data, and parse it into a CSV ready for visualization – you can check out the code on github and visualized the 2980 notes that I found using Cartodb. Each dot contains a link that will take you to the Notes page where you can read the full text of the comment.

And this is what we get! Click on the image to be taken to the CartoDB page (I cant figure out how to embed IFRAMEs in diary entries). You can even download the raw data here

Click here to explore this map map

Location: Lechmere Square, East Cambridge, Cambridge, Middlesex County, Massachusetts, 02141, United States

Welcome to OpenStreetMap Telangana!

Posted by dalek2point3 on 2 June 2014 in English (English). Last updated on 6 June 2014.

June 2 is a monumental day for residents of the new state of Telangana in southern India. After decades of struggle, the government of India decided to create an independent state of Telangana, separate from the state of Andhra Pradesh. Historically, backward and poor as compared to the rest of Andhra Pradesh, the hope is that a more empowered state can bring development to the people of Telangana.

Meanwhile, in the digital world – OpenStreetMap also welcomed Telangana with open arms. User PlaneMad created the state boundary relation, and it went live today – exactly the day that Telanga came into existence on the ground! Before it went live, the OpenStreetMap community had a chance to talk and discuss this change and prepare for the impending arrival of the state!

Here are the two state boundaries on OpenStreetMap today – Telangana

Imgur

and the new, smaller Andhra Pradesh

Imgur

Meanwhile, in proprietary-maps-world, Andhra Pradesh still remains in its traditional form. Uncorrected and denying the people of Telangana their existence for a few more weeks! Another win for community-crafted local maps!

Imgur

Image source: Google Maps. (Used under fair-use exceptions to copyright law. Please file a DMCA request in case you disagree)

Update (June 6)

Thanks to help from Wambacher and OSM Boundaries I was able to create this overlay which shows both the states in one image.

osm

Location: Lechmere Square, East Cambridge, Cambridge, Middlesex County, Massachusetts, 02141, United States

Today while playing with data from all changesets in the US, I found an intersting fact. When I plot the number of monthly new users by county over time, it’s heartening to see that it’s been growing – but what happened in Dec 2012?

Osm

Location: Lechmere Square, East Cambridge, Cambridge, Middlesex County, Massachusetts, 02141, United States

One of the mysteries of OpenStreetMap not known to the new user is the issue of imports. I’ve been pondering for a while what the best way was to identify what user accounts are related to imports, what they have been importing, where is the data coming from, and what portion of the data comes from imports and what is “purely” from contributors.

Now, my sense is that initially there was a lot of importing going on informally, till someone instituted the formal process. The Import Catalogue where all the imports are supposed to be documented is sorely in need of some cleaning up and fixing. That is, there are many imports there that are not recorded. Hopefully we can use the data to fix the page as well.

In my own research, Im interested in identifying imports so as to get rid of them! I want to understand contributor activity, and your analysis can get seriously skewed if you consider imports. One example of this is Dennis’ SOTM-US 2014 talk where they found that there was lot of activity in North Dakota, but most of this was coming from imports (or so we think!).

Here, I wanted to write some notes about how I’ve discovered the best way is to indentify imports in the changesets data. The changesets data contains a field called “num_changes” that records the number of changes in any given changeset. A feature of most imports is that they cram as many features as they can in one changeset (the max is 50000). So what you can do is, look at all the changesets for a given user, and if a extraordinarily high number of them (say 80%) have more than 5000 changes, then its likely that the account is being used for imports.

Using this method, I calculated “import accounts” (at least 50% of their changesets have above 5000 changes and overall they have at least 50 changesets) to get this list of large import accounts in the US. Here “mean” is the percent of edits that are above 5000 changes, and N is the total number of changesets for that user.

import account

This is by no means perfect, and there are many other types of imports that I think I’m missing – and perhaps there are some false positives as well? Would love to get your reaction of if you had other suggestions on better ways to do this!

I’ve been processing changesets inside the BBOX {-125, 24.34, -66.9, 49.4} see code.

I then ran the center of each changesets on a point-in-polygon algorithm and that produces this rather interesting map of changesets in and around the US, but not actually IN the continental US. Thats about 300k changesets.

changeset map

Location: Lechmere Square, East Cambridge, Cambridge, Middlesex County, Massachusetts, 02141, United States

TLDR; Version

I wrote a small script to get some data out of OpenStreetMap history files using Osmium. You can find it here: https://github.com/dalek2point3/osmium-tools/

The Story So Far …

I’m working on a project to analyze OSM contribution history in the United States. One way to do this is to use the changeset dumps – the changeset dumps contain fields for the username of the contributor, the time of the contribution and the bounding-box of all the edits made in a particular changeset. Using this data, and approximating the location of the edit as the center of the bounding box it is possible to do a lot of analysis about how contribution activity has changed and evolved in different regions around the country. My previous diary entry is one example of such an analysis.

History Files Here We Come!

However the approach of analyzing changesets comes quickly to a dead end if you want to understand the type of contributions. What are people actually doing when they edit in a certain area? Are they adding new subdivisions (likely to be lots of ways and “highway” tags here), are they adding useful metadata to existing streets (lots of maxspeed and oneway tags here) or are they adding POIs (amenity tags) and natural features like water bodies, hiking trails etc?

The changeset files are absolutely useless to understand this kind of activity. If you want to do such an analysis you have to look to the history files. History files are exactly like planet XML files, but with every past version of a particular node, way or relation also recorded. Like the planet XML, each feature comes with a changeset id, so you can reconstruct changesets from this file, and then look into what is actuall going on in the data.

Great! So how do we do this exactly?

The one problem with using history files is that a number of traditional tools do not work with this file format. Osmfilter which is great for filtering certain feature types does not work with .osh files and neither does Osmosis – both tools I was using extensively in my previous work.

One way to go is to use Osmconvert to convert the entire .osh file to csv, and then manipulate this file using standard command line tools like sed. Unfortunately I this approach scales poorly – I was using the history file for all of the US, and the csv version of this file can get pretty large, pretty quickly.

So, how do we get there? Enter Osmium! Osmium is a wonderful tool written by Jochen Topf that provides C++ libraries and header files to work with OSM data. Yes, your read that right C++, finally it was time for me to leave my comfortable world of python and dynamic typing and try to remember the C++ back in engineering school a few years back!

Fortunatately, osmium ships with some nice examples and very helpful community members that makes it not terribly hard to pick up! What’s more – Osmium has recently be redesigned and now comes with a nice shiny new website and some nice documentation.

Exact Steps to Osmium Glory!

So using all those as helpers, this is how I proceeded:

  1. First, thanks to wonderful work by MaZderMind there are a number of “history extracts” available so that you dont have to begin with the planet history file. I downloaded the north america extract to start with

  2. Then using MaZderMind’s wonderful OSM history splitter (which is based on Osmium) and a bounding box for the continental USA ((-124.848974, 24.396308) - (-66.885444, 49.384358)) I created an extract for my analysis. Installing OSM history splitter is fairly straightforward if you follow the instructions on the github page.

  3. Then, in order to extract relevant elements from the history file I turned to Osmium. The original plan was to use OSM History importer which is another tool based on Osmium, to import all of the data into a PostGIS database and then run queries on this data – but given the nature of my requirement, I thought this was overkill. Installing Osmium was fairly straightforward for me (although I’ve read that others have had trouble) – I installed the debian packaged versions of all requirements listed here and then git clones the repo and I was set!

  4. And voila! Here is the script that extracts amenities from the history file and extracts highways and records the lat / lon for each (for the highways it records the lat /lon of the first node).

  5. I took me a while to understand how Osmium works, so I thought I would make a few notes to help others out.

  • First, you must remember that Osmium is a header-only library project. There are no executables that come with libosmium, but you should definitely play around with osmium-contrib – for my particular requirement I found this program to be very helpful.

  • Second, osmium comes with a program called osmium-tool that can do a limited number of things. Your requirement might actually be satisfied by one of these pre-coded tools, so you should be all set! Look at the usage here

  • I order to understand my script, you need to understand a few things. Osmium can read a large number of osm-related file formats. So I create a reader object that reads the file I’m interested in. Then osm:apply interates through all the objects in the file (i.e. nodes, ways and relations) and for each object calls the “handlers”. In my case, I have a “location handler” that reads in the locations of all the nodes and associates them with the way (In OSM, the nodes have coordinates, while ways only have node references) – and once the location handler is called, I call the “names handler”. The names handler calls the “node” function for all nodes and “way” function for all ways. In these functions I include logic for what I want to do with the data, in this case extract features that have the relevant tags and write them to stdout.

  • Here are links to some more documentation to help with your Osmium Project.

Osmium is an extremely powerful (and fast!) to do lots of amazing things with OSM data. In fact my favorite Taginfo used Osmium on the backend. I’d highly recommend it as a tool for any heavy duty history file processing!

Location: Lechmere Square, East Cambridge, Cambridge, Middlesex County, Massachusetts, 02141, United States

Battlegrid -- What still needs to be fixed?

Posted by dalek2point3 on 12 March 2014 in English (English). Last updated on 13 March 2014.

Martijn Van Exel’s Battlegrid has been a fun resource for me to fix TIGER errors. The one problem however is that it was hard for me to identify regions that really needed some love, because Battlegrid does not allow you to zoom out and do a visual overview of what cities might have the most potential issues with the TIGER data.

For example, Chicago seems to have had a lot of errors, but these errors seem to have been mostly resolved:

Chicago

While there are still regions like this part of Charleston, NC that needs a lot of work:

Charleston

But, Martijn has kindly agreed to share the raw data behind Battlegrid, and using this data, I’ve produced some analyses can hopefully be useful to understanding both the current state of TIGER and also helping guide future fixup work.

Here is a view of Battlegrid for all of the US. Each dot represents something that is bad, from small errors to big ones.

USA Grid

And this is the same picture counting only tiles that have more than 200 misaligned nodes.

severe

The counties in pink are the ones that originally got the old TIGER data (the white counties are likely to have more aligned and up-to-date data) – notice how the Battlegrid picks up this variation. But its not perfect, you can see the clear lines near Alabama (because all the states around Alamaba got bad data), but you can also see that large cities are changing quickly, so all the dots in Birmingham are likely to be from new roads.

alabama

Anyway, I wanted to systematically analyze what regions still that need fixing. Using the Battlegrid data, I came up with a quick ranking of MSAs that still have a lot of disagreements with TIGER 2012. I counted all the tiles that have at least 200 corrections in them, and then considered only those tiles in the top 80 MSAs in the country. Then for each MSA, I simply summed up the total number of errors and calculated a “density” (number of errors divided by the area of the MSA).

These are the places that are all fairly large MSAs and where error density is high, so your Battlegrid work is likely to bring you some joy. Eventually I want to put up a live map, and link to the relevant place on Battlegrid, so that you can get started corrected these places, but here is the list of the top MSAs that currently have a high density of poor TIGER data. (A high score indicates a high density of bad TIGER)

  1. Asheville, NC 8.5
  2. Charleston–North Charleston, SC 6.9
  3. Baton Rouge, LA 4.6
  4. Knoxville, TN 3.6
  5. Columbia, SC 3.5
  6. Winston-Salem, NC 3.1
  7. Atlanta, GA 2.6
  8. Birmingham, AL 2
  9. Phoenix–Mesa, AZ 1.8
  10. Orlando, FL 1.6
  11. Dallas–Fort Worth–Arlington, TX 1.6
  12. Pittsburgh, PA 1.6
  13. Tampa–St. Petersburg, FL 1.5
  14. Charlotte, NC–SC 1.5
  15. Las Vegas–Henderson, NV 1.5
  16. Bridgeport–Stamford, CT–NY 1.4
  17. McAllen, TX 1.4
  18. Riverside–San Bernardino, CA 1.4
  19. Tucson, AZ 1.4
  20. New Haven, CT 1.4

(you can find a downloadable version of all the top 78 MSAs here)

And here is the same data as a map : map

I see the map as a garden. It needs love and care from local mappers who know the area and can look after it regularly. This means editing regularly in the area, as against armchair mapping – of which I do quite a bit (!), but there is nothing like a community of locals taking care of an area.

What areas of the US receive such community love and what areas do not? Digging into the raw changeset dump produces some interesting results.

Methodology

This is what I did :

  1. For all changesets, approximate the location of the changeset to the center of the bounding box and geocode the point to the country level – throw out all changesets which are not geocoded to the US and which seem to be over 2 latitudes or 2 longitudes wide (these are mostly programmatic edits)

  2. Remove changesets by special users, davehansentiger, milenko, OSMFRedactionbot, bot-mode, woodpeck_fixbot and nhd-import (these were the big import accounts I identified, did I miss any?)

  3. Divide up the country into .1 X .1 latitude / longitude squares (this is approximately a 11km X 11km box) and treat each box as a little garden of its own. For each of these of these gardens ask the following questions:

a) how many unique users have made commited a changeset in this area? b) how many unique users have made more than 5 changesets in this area? c) how many unique users have made more than 10 changesets in this area?

  1. Once I calculated these three metrics, then I was ready to generate some pretty maps and analyses! I’m presenting some results below.

Findings

  1. Here are the top OSM Gardens in the US that are well taken care of in the US.
  • Bay Area 64 contributors with 10+ changesets Bay Area
  • DC 53 contributors with 10+ changesets DC
  • Seattle 42 contributors with 10+ changesets Seattle

The complete top 10 list can be seen here: http://pastie.org/pastes/8897892/text

  1. And here is map of places with at least five contributors commiting more than 10 changesets:

usamap

  1. Here is a similar map, with major cities marked as airplanes. All the major cities seem to meet this bar of having at least 5 members who have made more than 10 changes. But at the second tier there is a lot of variation. Looking in the south, Alabama, Huntsville and Montogemery and Alamaba all seem to have at least one area that is well taken care of, while cities in Louisiana (Lafayette) or Arkansas seem to have no real communities to speak of. Memphis, TN and Knoxville, TN seem to have very little acitvity – while there seems to be a decent community in Nashville. I think there is a lot of interesting variation here that goe beyond simply saying that places with people are places with contributors.

cities

Going forward

This is very much a work in progress. If there is interest I could compile lists which state more clearly (large) cities that could do with community engagement and others that seem to have plenty going on. I should also figure out how to make these maps in TileMill and have embed them so that people can play around with them.

Fixing Tiger Deserts : The Progress So Far ...

Posted by dalek2point3 on 1 March 2014 in English (English). Last updated on 2 March 2014.

A History of TIGER in OSM

[TIGER] (http://http://www.census.gov/geo/www/tiger/) data serves as the base data for much of US map data for all the major US map providers including Google, Nokia and TomTom. Much of OpenStreetMap data for the US is also based off of the 2005 version of TIGER data and was completed between 2007 and 2008. Here is an animation of the import process thanks to Scurio

TIGER import animation

TODO: Go through the notes on the TIGER/Line website and figure out the major changes in the data collection process. The Wikipedia page is also helpful.

However, unfortunately the TIGER data was never designed to be used as an accurate map of the US which could be used reliably for things like GPS routing – it was a CENSUS project with more limited objectives. However, the consensus is that, major improvements were made to TIGER between 2000 and 2010 – for OSM however, because the import was made with the 2005 data, it “caught TIGER halfway through the update cycle” ref

What this means is that we have quite a mess. Everyone knows that we have bad data for a LOT of the US, but the problem is that we don’t know where, and we dont know what’s wrong with the data. Further, a lot of these errors have probably been fixed by people, but we’re also certain that there are regions that have not been touched, making it hard to replace the old TIGER data with the new TIGER data using a wholesale technique like an import.

The response to this problem has been a number of projects by the community to perform [“TIGER Fixups”] (http://wiki.openstreetmap.org/wiki/TIGER_fixup). The idea is to come up with a metric that guides contributors to places where the old TIGER data is most likely to be out of date / incorrect and get them to fix it.

I wanted to come up with a map / dataset of OSM routing data “quality” and realized that there is have been a whole host of approaches, some overlapping and some not. I’ve been studying these approaches carefully, and thought I would summarize them here. The results will hopefully be useful to continuing this important work forward.

Community Efforts to Measure “TIGER DESERTS” and equivalents

Toby Murray’s Analysis

This is one of the first comprehensive look at TIGER editing that I found (apparently there was something called “TIGER edited map” by MapQuest, but its no longer online where it’s supposed to be

He started with a current version of OSM map filtering for ways. Then you count total number of ways that have been last edited by someone other than DaveHansenTiger (original import), balrog-kun (expanding street names), NHD edits, woodpeck_fixbot – he mentions that adding NE2 (who did highways) to this list is also probably a good idea. Another subtle point is that a node could have been a TIGER node, edited by user Y and then edited by balrog_kun, causing this algorithm to treat the node a purely TIGER, when in fact it has been touched. Toby’s final map “takes this into consideration” using version numbers, but I’m not sure how exactly.

The result is a county level map with a number associated with each count – what percent of ways in this county likely comes directly from TIGER? Here is a browseable heatmap and here is a screenshot.

screenshot

Martijn Van Exel’s TIGER deserts

Martijn’s analysis not only built upon Toby’s analyis at a much finer level for the state of Florida, but also coined the term “Tiger Desert”, a region where TIGER data has been untouched by anyone else.

Martijn’s methodology relied purely on version numbers, but was the first to take into account a regions “importance” by considering way density. Tiger deserts by his definition are 5km X 5km grid cells that have a version number being either 1 or 2 (for the “predominant way”) and have a way density higher than 1.8 (I think! the post is not super clear on this). This results in a picture of TIGER ghost towns for Florida which looks as follows:

TIGER Ghost towns

Mike Migurski’s “Green Means Go”

In Jan 2013, Mike Migurski’s Green Means Go was released which considerably expanded the scope of Martijn’s analysis. The first improvement is that the cells are 1km X 1km and the coverage is national.

Green means go

What Mike did was a threeway comparison. First he generated places where there was “scope for improvement” by comparing TIGER 2012 to 2007, and generating a darker green for places where highway length were substantiall greater. Then, he counted up total OSM editors by block and overlaid them as white blocks (ignoring the bulk edits) and overlaid that information on top of the shades-of-green map. This is great because it helps focus attention of where new TIGER is most likely to be beneficial and where its not likely to interfere with local community. However, this map does not take into account whether old TIGER nodes have been edited, or information like that. Deletes of old TIGER nodes are also problematic.

See all the maps and additional patterns here.

Mapbox efforts (Alex Barth, Ian Villeda, Ruben Mendoza, Eric Fischer)

There have been two recent efforts by folks at Mapbox to develop tools to bring in new TIGER data to fix the map. First, they developed a map for Vermont that measures for each 1kmX1km cell, considering highways where “(1) the average version number of all ways in each grid cell and (2) the percent of version=1 ways per grid cell.” – the more blue the cell, the more is the cell likely to be TIGER.

virginia

A related effort, released in June and Dec 2013, came courtesy blog posts by Eric Fischer in June and Dec 2013. Eric’s work was all about comparing TIGER 2012 and TIGER 2007. The first map, simply compares TIGER 2013 and TIGER 2007 – according to him “changes in the 2013 edition are in yellow, changes between 2010 and 2012 are in cyan, changes made in the accuracy improvement push between 2006 and 2010 are in magenta, and data that hasn’t changed since 2006 is in blue.” – the main thing to note in this national picture below is the magenta, these are the wholesale changes that were missed in the OSM import.

june2013

Martijn’s Battle Grid

And the final, and perhaps the most actionable tool has been the development of Battle Grid by Martijn Van Exel. Check it out here and the blog post decribing it.

The basic idea is to compare Tiger 2013 to OSM to highlight cells with large changes, however the innovation here is the addition of data from Telenav on actual driving patterns in these different cells. This helps prioritize cleanup work by focusing attention on places where people are likely using the map for routing. And the Grid follows maproulette conventions and allows users to directly “check out” cells for fixing in your preferred editor. Very nice!

battlegrid

General Lessons

There are three things going on 1. TIGER 2013 and TIGER 2007 are different, but not everywhere 2. OSM has made changes to TIGER 2006, but not everywhere 3. The “corrections” are important, but far more important in some places rather than others.

The different approaches highlighted above all combine different aspects of these three objectives. The first is to identify TIGERness of existing data which gets at #2 above (and can be done using a combination of username and version analysis) and the second is to compare simple diffs of TIGER 2013 and current data or TIGER 2006.

I think more work could be done in #3 above. The two approaches so far have been “way density” or Telenav data, but surprisingly none using gridded population data. This is something that I plan to do in the future, which would be a remix of some of these previous efforts –

  1. Find raw data for the Mapbox maps which calculate “new TIGER” areas
  2. Use raw data on user count by cell (http://openstreetmap.us/~migurski/TIGER-Raster/nodes/)
  3. Use population data and look into getting access to Telenav “usage” data

Calculate these three metrics at the cell level, and combine them to come up with a final “quality” map.

Code and Other Resources (TODO: very incomplete)

References