I have recently been interested in measuring how openstreetmap is being used in different services around the world. Now obviously, this is a very hard question to answer, because, being an open project, OSM data can be downloaded at any point in time, and you can start playing around with it. We dont require any permission for this action, and while the Odbl license does require attribution if you use the data in production, such attribution is hard to track. Openstreetmap data can be found on planes, in disaster relief – not to mention the thousands of web and mobile applications that use it for different intents and purposes.
Alright, having convinced you that its quite hard to track all possible uses of openstreetmap, perhaps it is possible to track usage of OSM tiles in web applications online? Now, while still difficult, this is easier to accomplish, because at the very least the question is well defined, and in theory, answerable. If we could survey each and every website out there, see if they use tiles from an OpenStreetMap server (or Mapbox server) we might be able to say something about OSM usage. Now, this still would not cover cases where folks have set up their own tileserver with OSM data – which one might argue is a quite common way to use OSM data.
Either way, I recently discovered HTTPArchive and thought it would be a cool project to track the usage of different mapping APIs online, including Mapbox and folks using OpenStreetMap tiles (which you’re not supposed to do for heavy usage!). What HTTPArchive does it crawl about the top million websites, and for each website it records the HTTP requests that the site is making. Now, this it turns out, is a great way to measure the usage of javascript frameworks, Google Analytics etc – except, no one so far has used it to look at mapping APIs and OpenStreetMap in particular!-
So, I thought I would do that! Now, the HTTPArchive data is quite large – (petabytes I hear) – but fortunately it’s all available on Google Big Query which makes it a cinch to query. Results from now of my explorations are below.
HTTPArchive data is stored in two important tables (two for each “run”) – pages
and requests
, and the latest versions can be always found at latest_pages
and latest_requests
. The `pages’ tables contains information like the url scraped, number of bytes etc. Lets see if the main openstreetmap website is in the data. The following query does the job:
select pageid, url,urlShort from httparchive:runs.latest_pages
where REGEXP_MATCH(urlShort, r'openstreetmap.org');
Yes it is! This query returns:
Row pageid url urlShort 1 17330926 http://www.openstreetmap.org/ http://www.openstreetmap.org/
Seems like the pageid
is 17330926. Now, the `requests’ table is where all the juicy information is contained. Lets see what requests, the OSM website makes:
select * from httparchive:runs.latest_requests
where pageid == 17330926;
And this is the response that you get – about 35 requests. That data is here. As you can see, this includes a number of requests for *.tile.openstreetmap.org, OSM’s public tileserver.
Which other websites make similar requests? This is where the HTTPArchive really shines. After some experimentation, the following SQL query does the trick:
SELECT urlShort FROM [httparchive:runs.latest_pages] as pages JOIN (
SELECT pageid, REGEXP_EXTRACT(url, r'(tile.openstreetmap.org)') AS link2
FROM [httparchive:runs.latest_requests] as requests
WHERE REGEXP_MATCH(url, r'tile.openstreetmap.org')
) AS lib ON pages.pageid = lib.pageid
GROUP BY urlShort;
Not many websites seem to be hitting the tileserver directly – which is reassuring. That data is here.
A final interesting thing to run would be a similar analysis for Google Maps and Mapbox. Queries for Mapbox and Google Maps are available on Github. And the data from there queries are here – Mapbox and I’m still working on getting the data from Google Maps API usage. That is for another post!
Hope you will find HTTPArchive a useful tool to analyze data from the web. It certainly seems easy to use and with lots of interesting data for analysis! Happy exploring!
Discussion
Comment from Stalfur on 25 October 2014 at 00:25
Very nice. I’m running Mapping Botswana at [www.openstreetmap.org.bw] which is basically just a few pages with a wrap around tile.openstreetmap.org. It is only a few months old so I guess it hasn’t arrived yet in the published data.
Comment from dalek2point3 on 25 October 2014 at 15:08
I think another reason might be that the domain is not (yet) in the top million domains for Alexa and so they might not be indexing it.
Comment from karussell on 25 October 2014 at 19:04
How can it intercept an API call or HTTP request technically?