I have recently been interested in measuring how openstreetmap is being used in different services around the world. Now obviously, this is a very hard question to answer, because, being an open project, OSM data can be downloaded at any point in time, and you can start playing around with it. We dont require any permission for this action, and while the Odbl license does require attribution if you use the data in production, such attribution is hard to track. Openstreetmap data can be found on planes, in disaster relief – not to mention the thousands of web and mobile applications that use it for different intents and purposes.
Alright, having convinced you that its quite hard to track all possible uses of openstreetmap, perhaps it is possible to track usage of OSM tiles in web applications online? Now, while still difficult, this is easier to accomplish, because at the very least the question is well defined, and in theory, answerable. If we could survey each and every website out there, see if they use tiles from an OpenStreetMap server (or Mapbox server) we might be able to say something about OSM usage. Now, this still would not cover cases where folks have set up their own tileserver with OSM data – which one might argue is a quite common way to use OSM data.
So, I thought I would do that! Now, the HTTPArchive data is quite large – (petabytes I hear) – but fortunately it’s all available on Google Big Query which makes it a cinch to query. Results from now of my explorations are below.
HTTPArchive data is stored in two important tables (two for each “run”) –
requests, and the latest versions can be always found at
latest_requests. The `pages’ tables contains information like the url scraped, number of bytes etc. Lets see if the main openstreetmap website is in the data. The following query does the job:
select pageid, url,urlShort from httparchive:runs.latest_pages where REGEXP_MATCH(urlShort, r'openstreetmap.org');
Yes it is! This query returns:
Seems like the
pageid is 17330926. Now, the `requests’ table is where all the juicy information is contained. Lets see what requests, the OSM website makes:
select * from httparchive:runs.latest_requests where pageid == 17330926;
And this is the response that you get – about 35 requests. That data is here. As you can see, this includes a number of requests for *.tile.openstreetmap.org, OSM’s public tileserver.
Which other websites make similar requests? This is where the HTTPArchive really shines. After some experimentation, the following SQL query does the trick:
SELECT urlShort FROM [httparchive:runs.latest_pages] as pages JOIN ( SELECT pageid, REGEXP_EXTRACT(url, r'(tile.openstreetmap.org)') AS link2 FROM [httparchive:runs.latest_requests] as requests WHERE REGEXP_MATCH(url, r'tile.openstreetmap.org') ) AS lib ON pages.pageid = lib.pageid GROUP BY urlShort;
Not many websites seem to be hitting the tileserver directly – which is reassuring. That data is here.
A final interesting thing to run would be a similar analysis for Google Maps and Mapbox. Queries for Mapbox and Google Maps are available on Github. And the data from there queries are here – Mapbox and I’m still working on getting the data from Google Maps API usage. That is for another post!
Hope you will find HTTPArchive a useful tool to analyze data from the web. It certainly seems easy to use and with lots of interesting data for analysis! Happy exploring!