OpenStreetMap

Jennings Anderson's diary

Recent diary entries

State of the Map US 2018: OpenStreetMap Data Analysis Workshop

Posted by Jennings Anderson on 5 December 2018 in English (English)

(This is a description of a workshop Seth Fitzsimmons and I put on at State of the Map US 2018 in Detroit, Michigan. Cross-posting from this repository)

Workshop: October 2018

Workshop Abstract

With an overflowing Birds-of-a-Feather session on “OSM Data Analysis” the past few years at State of the Map US, we’d like to leave the nest as a flock. Many SotM-US attendees build and maintain various OSM data analysis systems, many of which have been and will be presented in independent sessions. Further, better analysis systems have yet to be built, and OSM analysis discussions often end with what is left to be built and how it can be done collaboratively. Our goal is to bring the data-analysis back into the discussion through an interactive workshop. Utilizing web-based interactive computation notebooks such as Zeppelin and Jupyter, we will step through the computation and visualization of various OpenStreetMap metrics.

tl;dr:

We skip the messy data-wrangling parts of OSM data analysis by pre-processing a number of datasets with osm-wayback and osmesa. This creates a series of CSV files with editing histories for a variety of US cities which workshop participants can immediately load into example analysis notebooks to quickly visualize OSM edits without ever having to touch raw OSM data.

1. Background

OpenStreetMap is more than an open map of the world: it is the cumulative product of billions of edits by nearly 1M active contributors (and another 4M registered users). Each object on the map can be edited multiple times. Each time the major attributes of an object are changed in OSM, the version number is incremented. To get a general idea of how many major changes exist in the current map, we can count the version numbers for every object in the latest osm-qa-tiles. This isn’t every single object in OSM, but includes nearly all roads, POIs, and buildings.

 Histogram of Object Versions from OSM-QA-Tiles

OSM object versions by type. 475M objects in OSM have only been edited once, meaning they were created and haven’t been subsequently edited in a major way. However, more than 200M have been edited more than once. Note: Less than 10% of these edits are from bots, or imports.

Furthermore, when a contributor edits the map, the effect that their edit has depends on the type of OSM element that was modified. Moving nodes may also affect the geometry of ways and relations (lines and polygons) without those elements needing to be touched. Thus, a contributor’s edits may have an indirect effect elsewhere (we track these as “minor versions”). Conversely, when editing a way or relation’s tags, no geometries are modified, so counts within defined geographical boundaries often don’t incorporate these edits. Therefore, to better understand the evolution of the map, we need analysis tools that can expose and account for these rich and nuanced editing histories. There are a plethora of community-maintained tools out there to help parse and process the massive OSM database though none of them currently handle the full-history and relationship between every object on the map. Questions such as “how many contributors have been active in this particular area?” are then very difficult to answer at scale. As we should expect, this number also varies drastically around the globe:

 Map of 2015 users Map of areas with more than 10 active contributors in 2015 source. The euro-centric editing focus doesn’t surprise us, but this map also shows another area with an unprecedented number of active contributors in 2015: Nepal. This was in response to the April 2015 Nepal Earthquake. This is just one of many examples of the OSM editing history being situational, complex and often difficult to conceptualize at scale.

Putting on a Workshop

The purpose of this workshop was two-fold: first, we wanted to take the OSM data analysis discussion past the “how do we best handle the data?” to actual data analysis. The complicated and often messy editing history of objects in OSM make simply transforming the data into something to be read by common data-science tools an exceedingly difficult task (described next). Second, we hoped that providing such an environment to explore the data would in turn generate more questions around the data: What is it that people want to measure? What are the insightful analytics?

2. Preparing the Data: What is Available?

This was the most hand-wavey part of the workshop, and intentionally so. Seth and I have been tackling the problems of historical OpenStreetMap data representation independently for a few years now. Preparing for this workshop was one of the first times we had a chance to compare some of the numbers produced by OSMesa and OSM-Wayback, the respective full-history analysis infrastructures that we’re building. As expected, there were some differences in our results based on howe we count objects and measure history, so this was a fantastic opportunity to sit down and talk through these differences and validate our measures. In short, there are many ways that people can edit the map and it’s important to distinguish between the following edit types:

  1. Creating a new object
  2. Slightly editing an existing object’s geometry (move the nodes around in a way)
  3. Majorly editing an existing object’s geometry (delete or add nodes in a way)
  4. Edit an existing object’s attributes (tag changes)
  5. Delete an existing object

All but edit type 2 result in an increase in the version number of the OSM object. This makes identifying the edit easier at the OSM element level because the version number is true to the number of times the object has been edited. Edit type 2, however, a slight change to an object’s geometry is a common edit that is often overlooked because it is not reflected in the version number. Moving the corners of a building to “square it up” or correcting a road to align better with aerial imagery are just two examples of edit type 2. We call these changes minor versions. To account for these edits, we add a metadata field to an object called minor version that is 0 for newly created objects and > 0 for any number of minor version changes between a major version. When another major version is created, the minor version is reset to 0.

Quantifying Edits

Each of the above edit types refer to a single map object. In this context, we consider map objects to be OSM objects which have some level of detailed attribute. As opposed to OSM elements (nodes, ways, or relations), an object is the logical representation of a real-world object: road, building, or POI. This is an important distinction to make when talking about OSM data because this is not a 1-1 relationship. OSM elements do not represent map objects. A rectangular building object, for example, is at minimum 5 OSM elements: at least 4 nodes (likely untagged) that define the corners and the way that references these nodes with an attribute of building=*. An edit to any one of these objects is then considered an edit to this building.

This may seem obvious when thinking about editing OpenStreetMap and how the map gets made, but reconstructing this version of OSM editing history from the database is difficult and largely remains an unsolved (unimplemented) problem at the global scale: i.e., there does not yet exist a single (public, production) API end-point to reconstruct the history of any arbitrary object with regards to all 5 types of edits mentioned above.

Working towards such an API, another important infrastructure to mention here is the the ohsome project built with the oshdb. This is another approach to working with OSM full-history data that can ingest full-history files and handle each of these edit types.

Making the data Available

For this workshop then, we pre-computed a number of statistics for various cities that describe the historical OSM editing record at per-edit, per-changeset, and per-user granularities (further described below).

3. Interactive Analysis Environment

Jupyter notebooks allowed us to host a single analysis environment for the workshop such that each participant did not have to install or run any analysis software on their own machines. This saved a lot of time and allowed participants to jump right into analysis. For the workshop, we used a single machine operated by ChameleonCloud.org and funded by the National Science Foundation to host the environment. I hope to provide this type of service again at other conferences or workshops. Please be in touch if you are interested in hosting a similar workshop and I can see if hosting a similar environment for a short duration is possible!

Otherwise, it is possible to recreate the analysis environment locally with the following steps:

  1. Download Jupyter
  2. Clone this repository: jenningsanderson/sotmus-analysis
  3. Run Jupyter and navigate to sotmus-analysis/analysis/ for the notebook examples.

4. Available Notebooks & Datasets

We pre-processed data for a variety of regions with the following resolution:

1. Per User Stats

A comprehensive summary of editing statistics (new buildings, edited buildings, km of new roads, edited roads, number of sidewalks, etc.) see full list here that are totaled for each user active in the area of interest. This dataset is ideal for comparing editing activity among users. Who has edited the most? Who is creating the most buildings? This dataset is great for building leaderboards and getting a general idea of how many users are active in an area and what the distribution of work per user looks like.

2. Per Changeset Stats

The same editing statistics as above (see full list of columns here) but with higher resolution: grouped by the changeset. A changeset is a very logical unit of analysis for looking at the evolution of the map in a given area. Since each changeset can only be from one user, this is the next level of detail from user summaries. Since changeset IDs are sequential, this is a great dataset for time-series analysis. Unfortunately, due to a lack of changeset extracts for the selected regions (time constraints, fun!), OSMesa-generated roll-ups do not include actual timestamps. This caused some confusion for a group looking at Chicago, as visualization of their building import did not show the condensed timeframe during which many changesets were made when using changeset ID as the x-axis.

3. Per Edit Stats

This dataset records each individual edit to the map. This dataset is best for understanding exactly what changed on the map with each edit. Each edit tracks the tags changed as well as the geometry changes (if any). This dataset is significantly larger than the other two.

What cities are available?

Detroit is currently available in this repository. See this list in the readme for a handful of North American cities available for download.

5. Example Notebooks

  1. Per User Stats
  2. Per Changeset Stats
  3. Per Edit Stats

Editing heatmap Example heatmap from building edits in Detroit

If you’re interested in more of this type of analysis, directions on setting up this analysis environment locally can be found in this repository. Furthermore, much of this is my current dissertation work, so I’m always happy to chat more about it. Thanks!

Location: The Hill, Boulder, Boulder County, Colorado, 80802, USA

Watching the Map Grow: State of the Map US Presentation

Posted by Jennings Anderson on 27 November 2017 in English (English)

SOTMUS Logo

At State of the Map US last month, I presented my latest OSM analysis work. This is work done in collaboration between the University of Colorado Boulder and Mapbox. You can watch the whole presentation here or read on for a summary followed by extra details on the methods with some code examples.

OpenStreetMap is Constantly Improving

At the root of this work is the notion that OSM is constantly growing. This makes OSM uniquely different from other comparable sources of geographic information. To this extent, static assessments of quality notions such as completeness or accuracy are limited. For a more wholistic perspective of the constantly evolving project, this work focuses on the growth of the map over time.

Intrinsic Data Quality Assessment

Intrinsic quality assessment relies only internal attributes of the target data and not on external datasets as points of reference for comparison. In contrast, extrinsic data quality assessments of projects like OSM and Wikipedia involve comparing the data directly to the external datasets, often authoritative, reference datasets. For many parts of the world, however, such datasets do not exist, making extrinsic analysis impossible.

Here we look at features of the OSM database over time. By comparing attributes like numbers of contributors, density of buildings, and amount of roads, we can learn how the map grows and ultimately improves overtime.

Specifically, we aim to explore the following:

Contributors

  • How many?
  • How recent?
  • What type of edits?

Objects

  • What types?
  • How many?
  • Relative Density?
  • Object version?

The bulk of this work involves designing a data pipeline to better allow us to ask these types of questions of the OSM database. This next section takes a deep dive into these methods. The final section, Visualizing, has a series of gifs that show the results to-date.

The interactive version of the dashboard in these GIFS can be found here: http://mapbox.github.io/osm-analysis-dashboard


Methods: Vector Tiles

Specifically, zoom level 15 vector tiles are the base of this work. Zoom level 15 is chosen because (depending on Latitude), most tiles have an area of 1 square kilometer. For scale, a zoom 15 tile looks like this:

z-15-vector-tile

Vector Tiles are chosen primarily for three reasons:

  1. Vector Tiles (specifically OSM data in the .mbtiles format) are standalone sqlite databases. This means very little overhead to maintain (no running database). To this end, they are very easy to transfer and move around on disk.

  2. They are inherently a spatial datastore. With good compression abilities, the file sizes are not dramatically larger than standard osm pbf files, but they can be loaded onto a map with no processing. This is mostly done with mbview
  3. Vector Tiles can be processed efficiently with the tile-reduce framework.

In sum, at any point in the process, a single file exists that can easily be visualized spatially.

Quarterly Historic Snapshots

To capture the growth of the map overtime, we create historical snapshots of the map: OSM-QA-Tiles that represent the map at any point in history. You can read more about OSM-QA-Tiles here.

Boulder Map Growth

This image shows the growth of Boulder, CO in the last decade. The top row shows the road network rapidly filling in over 9 months during the TIGER import and the bottom row shows the the densification of the road and trail network along with the addition of buildings over the last 5 years.

The global-scale quarterly snapshots we created are available for download here: osmlab.github.io/osm-qa-tiles/historic.html.

While quarterly snapshots can teach us about the map at a specific point in history, they do not contain enough information to tell us how the map has changed: the edits that happen between the quarters. To really answer questions such as, “how many users edited the map?” or “How many kilometers of roads were edited?” or “How many buildings were added?” We need the full editing history of the map.

Historical Tilesets

The full editing history of the map is made available in various formats on a weekly basis. Known as the full history dump, this massive file can be processed in a variety of ways to help reconstruct the exact process of building the map.

Since OSM objects are defined by their tags, we focus on the tagging history of objects. To do this, we define a new schema for historical osm-qa-tiles. The new vector tiles extend the current osm-qa-tiles by including an additional attribute, @history.

Currently, these are built with the OSM-Wayback utility. Still in development, this utility uses rocksdb to build a historical tag index for every OSM object. It does this by parsing a full-history file and saving each individual version of each object to a large index (Note: Currently only saves objects with tags, and does not save geometries). This can be thought of as creating an expanded OSM history file that is optimized for lookups. For the full planet, this can create indexes up to 600GB in size.

Once the index is built, the utility can ingest a ‘stream’ of the latest OSM features (such as those produced by minjur or osmium-export). If the incoming object version is greater than 1, then it performs a lookup for each previous version of the object in this index.

The incoming object is then augmented to have an additional @history property. The augmented features are then re-encoded with tippecanoe to create a full-historical tileset.

Tag History

Here is an example of a tennis court that is currently at version 3 in the database. The @history property contains a list of each version with details about which tags were added or deleted in each version.

A Note on Scale & Performance

Full history tilesets are rendered at zoom level 15. OSM-QA-Tiles are typically rendered only at zoom level 12, but we found zoom 15 to be better not only for the higher resolution, but it lowers the number of features per tile. Since many features are now much larger because they contain multiple versions, this helps lower the number of features per tile, keeping tile-reduce processing efficient.

One downside, however, is that at zoom 15, the total number of tiles required to render the entire planet can be problematically large (depending on the language/library reading the file). For this reason, historical tilesets should be broken into multiple regions.

Processing 1: Create Summary Tilesets

The first step in processing these tiles is to ensure that the data are at the same resolution. Historical tilests are created at zoom 15 resolution while osm-qa-tiles exist at zoom 12 resolution. Zoom 12 is the highest resolution that the entire planet should be rendered to osm-qa-tiles to ensure efficiency in processing. Therefore, we start by summarizing zoom 15 resolution into zoom 12 tiles.

Summarizing Zoom 15 Resolution at Zoom 12

A zoom-12 tile contains 64 child zoom-15 tiles (64 tiles = 4^(15-12), resulting in an 8x8 grid). To create summary tilesets for data initially rendered at zoom 12 (like the snapshot osm-qa-tiles), we calculate statistics about each child zoom-15 tile inside of a zoom-12 tile. This is done with a tile-reduce script that first bins each feature into the appropriate 64 child zoom-15 tiles and then computes various statistics for each of them, such as “total kilometers of named highway” or “density of buildings”

Since each of these attributes pertains to the zoom-15 tile and not individual features, individual object geometries are ignored. Instead, these statistics are represented by a single feature: a point at the center of the zoom-15 tile that it represents. Each feature then looks like:

geometry: <Point Geometry representing center of zoom-15 tile>
properties : {
   quadkey :		<unique quadkey for zoom 15 tile>,
   highwayLength:		<total length of highways>,
   namedHighwayLength:	<kilometers of named highways>,
   buildingCount:			<Number of buildings>,
   buildingArea:			<Total area of building footprints>
   ...

These features are encoded into zoom-12 tiles, each with no more than 64 features. The result is a lightweight summary tileset (only point-geometries) rendered at zoom-12.

Summarizing Editing Histories

The summarization of the editing histories is very similar, except that the input tiles are already at zoom 15. Therefore, we skip the binning process and just summarize the features in each tile. Similarly, up to 64 individual features that each represent a zoom-15 tile are re-encoded into a single zoom-12 tile. Each feature includes editing statistics per-user for the zoom-15 tile it represents:

geometry: <Point Geometry representing center of zoom-15 tile>
properties : {
  quadkey : (unique quadkey for zoom 15 tile),
  users: [
  {
    name: <user name>,
    uid: <user id>,
    editCount: <total number of edits>,
    newFeatureCount: <number of edits where version=1>,
    newBuildings: <number of buildings created>,
    editedBuildings: <number of buildings edited>,
    newHighwayKM: <kilometers of highways created>,
    editedHighwayKM: <kilometers of highways edited>,
    addedHighwayNames: <Number of `name` tags added to highways>,
    modHighwayNames: <Number of existing `name` tags modified on highways>
  },
  { ... }
],
usersEver: <array of all user ids ever to edit on this tile>

Why go through all of this effort to tile it?

Keeping these data in the mbtiles format enables spatial organization of the editing summaries in a single file. Encoding zoom 15 summaries into zoom 12 tiles is the ideal size for the mbtiles format and can be efficiently processed with tile-reduce.

Processing 2: Calculate & Aggregate

With the above summarization, we have two tilesets each rendered at zoom 12 with zoom 15 level resolution. We can now pass both tilesets into a tile-reduce script. This is done by specifying multiple sources when initializing the tile-reduce job:

var tileReduce = require('@mapbox/tile-reduce');

tileReduce({
  zoom: 12,
  map: path.join(__dirname, '/map-tileset-aggregator.js'),
  sources : [{
    name: 'histories',
    mbtiles: historicalTileset-2010-Q4,
    raw: false
   },{
    name: 'quarterly-snapshot',
    mbtiles: snapshot-2010-Q4,
    raw: false
  }]
  ...

In processing, the map script can then access attributes of both tilesets like this:

module.exports = function(data, tile, writeData, done) {  
  var quarterlySnapshots = data['quarterly-snapshot']
  var histories = data['histories']

For performance, the script builds a Map() object for each layer, indexing by zoom-15 quadkey. Next, the script iterates over the (up to 64) features of one tile and looks up the corresponding quadkey in the other tile to combine, compare, contrast, or calculate new attributes. Here is an example of combining and aggregating across two tilesets, writing out single features with attributes from both input tilesets:

features.forEach(function(feat){

  //Create a single export feature to represent each z15 tile:
  var exportFeature = {
    type      : 'Feature',
    tippecanoe: {minzoom: 10, maxzoom: 12}, //Only renders this feature at these zoom levels.
    properties: {
      quadkey   : feat.properties.quadkey //The z15 quadkey
    },
    geometry: tilebelt.tileToGeoJSON(tilebelt.quadkeyToTile(feat.properties.quadkey)) // Reconstruct the Polygon representing the zoom-15 tile.
  }
  
  exportFeature.properties.buildingCount_normAggArea  =  < Lookup the number of buildings on this zoom-15 tile (and normalize by area).
  exportFeature.properties.namedHighwayLength_normAggArea = < Lookup kilometers of named highway for this zoom-15 tile (and normalize by area).
  
  // Access the contributor history information for this zoom-15 tile.
  var tileHistory  = contributorHistories.get(feat.properties.quadkey)
  var users = JSON.parse(tileHistory.users) // Get user array back from string
  
  // Sum attributes across users for simple data-driven-styling
  users.forEach(function(user){
    exportFeature.properties.editCount         += user.editCount;
    exportFeature.properties.newFeatureCount   += user.newFeatureCount;
    exportFeature.properties.newBuildings      += user.newBuildings;
    exportFeature.properties.newHighwayKM      += user.newHighwayKM;
    exportFeature.properties.editedHighwayKM   += user.editedHighwayKM;
    exportFeature.properties.addedHighwayNames += user.addedHighwayNames;
    exportFeature.properties.modHighwayNames   += user.modHighwayNames;
  });
  writeData( JSON.stringify( exportFeature ) ) //Write out zoom-15 tile summary with information combined from both tilesets.
})

This script produces two types of output:

  1. (Up to 64) polygons per zoom-12 tile that represent the child zoom-15 tiles. Matching the editing-history format, these features contain per-editor statistics, such as kilometers of roads.

  2. A single zoom-12 summary of all the editing activity.

Processing 3: The Reduce Phase

When the summary zoom-12 tile is delivered to the reduce script, it is first written out to a file (z12.geojson) and then passed to a downscaling, aggregation function, described next.

Downscaling & Aggregation

Last year I made a series of similar visualizations of osm-qa-tiles. I only worked with the data at zoom 12 and kept the features very simple in hopes that tippecanoe could coalesce similar features to display at lower zooms. While this worked, there were a lot of visual artifacts in busy parts of the map and the tile individual geometries must be low detail to fit:

Last Year's Example

To address this, we rely heavily on downscaling and aggregation in the current workflow to successively bin and summarize children tiles into a single parent child. Each zoom level is then written to disk separately and tiled only at specific zoom levels. Unfortunately, this is done by holding these tiles in memory. Fortunately, however, with a known quantity of (4) child tiles per parent zoom level, we can design the aggregation to continually free up memory when all child tiles of a given parent tile are processed.

Psuedocode:

zoom_11_tiles = {
   'tile1' : [],
    ...
   'tileN' : []
 }
 
processTile( incomingTile (Tile at Zoom 12) ){
  z11_parentTile = incomingTile.getParent()
  tiles_at_zoom_11[z11_parentTile].push(incomingTile)
  if (tiles_at_zoom_11[parent].length == 4){
	
    // Aggregate, Sum, Average attributes 
    // of zoom 12 tiles as appropriate to create
    // single summary zoom 11 tile
    
    // Write aggregated, summarized zoom 11 
    // tile to disk and delete from memory.
  }
}

In reality, these are not done for every zoom level, but instead for zoom levels 12, 10, and 8.

To ensure this function works as designed, the order of tiles being processed by the entire tile-reduce job is modified to be a stream of tiles grouped at zoom 10. While we cannot ensure that tiles finish processing in a specific order, by controlling the order of the input stream, we can create reasonable expectations that groups of tiles finish processing at similar times and are therefore appropriately aggregated and subsequently freed from memory.

Processing 4: Tiling

The final result of the tile-reduce job(s) is a series of geojsonl files (line-delimited) representing different zoom levels. Using tippecanoe, we create a single tileset that is optimized for rendering in the browser. Recall that each geometry is a polygon representing a vector tile. The attributes of each feature are consistent among zoom levels to allow for data-driven styling in mapbox-gl.

tippecanoe -Z0 -z12 -Pf --no-duplication -b0 \
  --named-layer=z15:z15-res.geojsonl \
  --named-layer=z12:z12-res.geojsonl \
  --named-layer=z10:z10-res.geojsonl \
  --named-layer=z8:z8-res.geojsonl   \
  -o Output.mbtiles

Visualizing: Mapbox-GL

Loading the resulting tileset into MapboxGL allows for data driven styling across any of the calculated attributes. An interactive dashboard to explore the North America Tileset is available here: mapbox.github.io/osm-analysis-dashboard

Downscaling across Zoom Levels

This first gif shows the different layers (the results of the downscale & aggregation):

Since everything is aggregated per-quarter, we can easily compare between two quarters. This gif compares the number of active users in mid 2012 to mid 2017. Users active Per Quarter: 2012 vs. 2017

New Building Activity

Here is a high level overview of where buildings were being added to the map in the second quarter of both 2015 (left) and 2016 (right). We can see a few major building imports taking place between these times as well as more general coverage of the map.

New Building Activity: 2015 vs. 2016

If we zoom in on Los Angeles and visualize the “building density” as calculated in July 2015 and July 2016, we see the impact of LA building import at zoom 15 resolution:

LA Building Import

Users

The 2010 Haiti Earthquake:

This slider shows the number users active in Haiti during the last quarter of 2009 (just before the earthquake) and then the first quarter of 2010 (when the earthquake struck): Users active during the Haiti Earthquake

We can see the work done by comparing the building density of the map at the end of 2009 and then at the end of the first quarter of 2010:

Building Density increase in Haiti (Quarter 1: 2010)

Ultimately, the number of (distinct) contributors active to date in North America has grown impressively in the last 5 years. This animation shows the difference between mid 2012 and mid 2017:

5 Year Growth

Looking Forward: Geometric Histories

So far, when discussing full editing history, we’ve only been talking about history of a map object as told through the changes to tags over time. This is a decent proxy of the total editing, and can certainly help us understand how objects grow and change overtime. The geometries of these objects, however also change overtime. Whether it’s the release of better satellite imagery that prompts a contributor to re-align or enhance a feature, or just generally squaring up building outlines, a big part of editing OpenStreetMap includes changing existing geometries.

Many times, geometry changes to objects like roads or buildings do not propagate to the feature itself. That is, if only the nodes underlying a way are changed, the version of the way is not incremented. Learning that an object has had a geometry change requires a more involved approach, something we are currently exploring in addition to just the tag history.

With full geometry history, we could compare individual objects at two points in time. Here is an example from a proof-of-concept for historic geometries. Note many of the buildings initially in red “square up” when they turn turquoise. These are geometry changes after the 2015 Nepal Earthquake. The buildings were initially created non-square and just a little while later, another mapper came through and updated the geometries:

5 Year Growth

Location: The Hill, Boulder, Boulder County, Colorado, 80802, USA

Analysis Walk-thru: How many contributors are editing in each Country?

Posted by Jennings Anderson on 29 June 2017 in English (English)

How many contributors are active in each Country?

I recently put together this visualization of users editing per Country with along with some other basic statistics. This analysis is done with tile-reduce and osm-qa-tiles. I’m sharing my code and the procedure here.

Users by Country

This interactive map depitcs the number of contributors editing in each Country. The Country geometries are in a fill-extrusion layer, allowing for 3D interaction. Both the heights of the Countries and the color scale in relation to the number of editors. Additional Country-level statistics such as number of buildings and kilometers of roads are also computed.

Procedure

These numbers are all calculated with OSM-QA-Tiles and tile-reduce. I started with the current planet tiles and used this Countries geojson file for the Country geometries to act as boundaries.

Starting tile reduce:

tileReduce({
  map: path.join(__dirname, '/map-user-count.js'),
  sources: [{name: 'osm', mbtiles: path.join("latest.planet.mbtiles"), raw: false}],
  geojson: country.geometry,
  zoom: 12
})

In this case, country is a geojson feature from the countries.geo.json file. I ran tile-reduce separately for each Country in the file, creating individual geojson files per Country.

The map function:

var distance = require('@turf/line-distance')

module.exports = function(data, tile, writeData, done) {
  var layer = data.osm.osm;

  var buildings = 0;
  var hwy_km    = 0;
  var users = []

  layer.features.forEach(function(feat){
  
    if (feat.properties.building) buildings++; 
  
    if (users.indexOf(feat.properties['@uid']) < 0)
      users.push(feat.properties['@uid'])
    }
  
    if (feat.properties.highway && feat.geometry.type === "LineString"){
      hwy_km += distance(feat, 'kilometers')
    }
  });
  done(null, {'users': users, 'hwy_km': hwy_km, 'buildings' : buildings});
};

The map function runs on every tile and then returns a single object with the summary stats for the tile. For every object on the tile, the script first checks if it is a building and increments the building counter appropriately. Next, it checks if the user who made this edit has been recorded yet for this tile. If not, it adds their user id to the list. Finally, the script checks if the object has the highway tag and is indeed a LineString object. If so, it uses turfjs to calculate the length of this hwy and adds that to a running counter of total road kilometers on a tile.

After doing this for all objects on the tile (Nodes and Ways in the current osm-qa-tiles), it returns an object with an array of user ids and total counts for both road kilometers and buildings.

Back in the main script, the instructions for reduce are as follows:

.on('reduce', function(res) {
  users = users.concat(res.users)
  buildings += res.buildings;
  hwy_km += res.hwy_km;
})

The list of unique users active on any given tile is added to the users array keeping track of users across all tiles. If users have edited on more than one tile, they will be replicated in this array. We’ll deal with this later.

The running building and kilometers of road counts are then updated with the totals from each tile.

Ultimately, the last stage of the main script writes the results to a file.

.on('end', function() {
  var numUsers = _.uniq(users).length;

  fs.writeFile('/data/countries/'+country.id+'.geojson', JSON.stringify(
    {type: "Feature",
     geometry: country.geometry,
     properties: {
       uCount: numUsers,
       hwy_km: hwy_km,
       buildings: buildings,
       name: country.properties.name,
       id: country.id
      }
    })
   )
});

Once all tiles have been processed, this function uses lodash to remove all duplicate entries in the users array. The length of this array now represents the number of distinct users with visible edits on any of the tiles in this Country.

Using JSON.stringify and the original geometry of this Country that was used as the bounds for tile-reduce, this function creates a new geojson file for every Country with a properties object of all the calculated values.

Visualizing

Once the individual Country geojson files are created, the following python code iterates through the directory and creates a single geojson FeatureCollection with each Country as a feature (The same as the countries.geo.json file we started with, but now with more properties.

countries = []

for file in os.listdir('/data/countries'):
  country = json.load(open('/data/countries/'+file))
  countries.append(country)

json.dump({"type":"FeatureCollection",
           "features" : countries}, open('/data/www/countries.geojson','w'))

Once this single geojson FeatureCollection is created, I uploaded it to Mapbox and then used mapbox-gl-js with fill-extrusion and a data-driven color scheme to make the Countries with more contributors appear taller and more red while those with less contributors are shorter and closer to yellow/white in color.

Here is a sample of that code:

map.addSource('country-data', {
  'type': 'vector',
  'url': 'mapbox://jenningsanderson.b7rpo0sf'
})

map.addLayer({
  'id': "country-layer",
  'type': "fill-extrusion",
  'source': 'country-data',
  'source-layer': 'countries_1-1l5fxc',
  'paint': {
    'fill-extrusion-color': {
      'property':'uCount',
      'stops':[
        [10, 'white'],
        [100, 'yellow'],
        [1000, 'orange'],
        [10000, 'orangered'],
        [50000, 'red'],
        [100000, 'maroon']
      ]
    },
    'fill-extrusion-opacity': 0.8,
    'fill-extrusion-base': 0,
    'fill-extrusion-height': {
      'property': 'uCount',
      'stops': [
        [10, 6],
        [100, 60],
        [1000, 600],
        [10000, 6000],
        [50000, 30000],
        [100000, 65000]
      ]
    }
  }
})

This current implementation uses two visual channels (height and color) for the user count. This is repetitive and the data-driven styling could be easily modified to represent number of buildings or kilometers of roads as well by simply changing the stops array and property value to buildings or hwy_km.

To show more information about a Country on click, the following is added:

map.on('mousemove', function(e){
  var features = map.queryRenderedFeatures(e.point, {layers:['country-layer']})
    map.getCanvas().style.cursor = (features.length>0)? 'pointer' : '';
  });

map.on('click', function(e){
  var features = map.queryRenderedFeatures(e.point, {layers: ['country-layer']})

  if(!features.length){return};
  var props = features[0].properties

  new mapboxgl.Popup()
    .setLngLat(e.lngLat)
    .setHTML(`<table>
      <tr><td>Country</td><td>${props.name}</td></tr>
      <tr><td>ShortCode</td><td>${props.id}</td></tr>
      <tr><td>Users</td><td>${props.uCount}</td></tr>
      <tr><td>Highways</td><td>${props.hwy_km.toFixed(2)} km</td></tr>
      <tr><td>Buildings</td><td>${props.buildings}</td></tr></table>`)
    .addTo(map);
});

Much of this code is based on these examples

Location: The Hill, Boulder, Boulder County, Colorado, 80802, USA

OSM Contributor Analysis - Entry 2: Annual Summaries of User Edits

Posted by Jennings Anderson on 6 July 2016 in English (English)

Over the past two weeks I have been trying out some new methods to uncover user focus on the map. Investigating this idea of user focus includes questions like:

  • Are there areas where a specific user edits more frequently or regularly?
  • Are there multiple contributors who focus on the same areas?
  • Do these activities correlate to “map gardening”?

To answer these questions, I’ve put together an interactive map, similar to How Did You Contribute to OSM by Pascal Neis , but with the addition of being able to compare multiple users through the years.

Check it out Here: OSM Annual User Summary Map

Please Note: Requires recent versions of Google Chrome (recommended) or Firefox (>=35).

How does it work?

Using the annual snapshots osm-qa tiles, I have calculated the following statistics for each user’s visible edits at the end of each year on a per-tile basis:

  • of total edits

  • of buildings

  • of amenities

  • kilometers of roads

With this information, we can look at areas of specific focus for a given user by applying minimum thresholds. For example, here are most of the tiles edited by seven different users in 2011: 7 Users No Filter When we increase the threshold for minimum percent of edits, we see that though this particular user has thousands of edits all over the Country, 70% of his edits are on this one tile! 7 Users FIltered

Just by playing around with this map, it seems that even users with millions of edits always have a handful of tiles where they seem to be significantly more active. Of course this begs the question, “is this the user’s hometown?” or perhaps even more importantly, “is this user contributing local knowledge to these particular tiles?”

When you zoom in close, you can click on any given tile and get a list of the top 100 contributors on that tile for the year. Clicking on any user in that list will load their edits onto the map. List of Users

What’s Next?

This is just the first step of many to come in doing community detection in OSM through social network analysis!

More to come! Jennings

Location: The Hill, Boulder, Boulder County, Colorado, 80802, USA

OpenStreetMap Data Analysis: Entry 1

Posted by Jennings Anderson on 20 June 2016 in English (English)

Howdy OpenStreetMap, I am excited to share that I am working as a Research Fellow with Mapbox this summer! As a research fellow, I am looking to better understand contributions to OSM.

For my first project, I have been using the tile-reduce framework to summarize per-tile visible edits from the Historical OSM-QA-Tiles. These historical tiles are a snapshot of what the map looked like at the time listed on the link.

With this annual resolution, we can visualize the edits (those edits that were visible at the end of that year) that happened on each tile. So far, I’ve summarized them as a) number of editors, b) number of objects, and c) recency of the latest edit (relative to that year).

The OSM-QA-Tiles are all generated at Zoom level 12, which separates the world into 5Million+ tiles. Some tiles have few objects while others have ten-thousand plus.

So far I have created two interactive maps to investigate OpenStreetMap editing behavior at this tile-level analysis:

1. Editor Density (Number of editors active on a tile)

### 2. Edit Recency (Time since last edit on the tile)

Editor Density

This map highlights tiles where multiple editors have been active. The most active editors in most cases are automated bots, especially in the more recent years. For best results, moving the slider in the bottom left for Minimum Users Per Tile to 2 or 3 will exclude most of these automated edits.

Examples

#### 2007: European Hotspots By increasing the minimum object and minimum user thresholds, areas of heavy editing activity pop out: 2007 european hotspots

2007: US Tiger Import - Automated Edits

This image of the activity in the US in 2007 has no threshold on the limited number of objects or users per tile, so you can see all of the tiles affected by the 2007 import. If you increase the threshold, it changes dramatically tiger import

Edit Recency

This map shows the recency of edits to a tile, relative to the year of analysis. It looks surprising at first how many tiles are edited at the end of the year, but that is most likely a function of automated bots. Again, if you move the threshold for number of editors or objects per tile, interesting patterns pop out across the world where users may have been active early in the year and then are less active later. The 2010 Haiti Earthquake is a good example, as it occurred in January of 2010.

2007: The stages of the Tiger Import

If we view by latest edit date, relative to the year, we see the state-by-state import in the US:

2008: North Eastern Hemisphere

2008 recency

More to come! -Jennings

Location: Logan Circle/Shaw, Washington, District of Columbia, USA