OpenStreetMap

Jennings Anderson's diary

Recent diary entries

HOT Summit & State of the Map 2019

Posted by Jennings Anderson on 26 September 2019 in English (English)

This past week, the 2019 HOT Summit was followed by State of the Map in Heidelberg, Germany. First, a big thank you and congratulations on a job well done to all of the organizing committee and folks in Heidelberg that made these events possible!

I had the opportunity to both lead a workshop at the HOT Summit on Thursday and participate in the academic track at State of the Map on Sunday. I’m writing this post to share a few resources and results from these talks, compiled all in one place.

1. HOT Workshop: Hands On Experience Extracting Meaningful OSM Data by Using Amazon Athena with AWS Public Datasets

This workshop was designed to show the analytical power of Amazon Athena with a large dataset like OSM. The workshop description was as follows:

Learn how to use Amazon Athena with AWS Public Datasets to query large amounts of OSM data and extract meaningful results. We will explore the maintenance behavior of contributors after HOT mapping activations and learn how the map gets maintained, what happens after validation, if the data grows stale, and if a local community emerges. This 200 level workshop is hands on and requires familiarity with SQL. Familiarity with data science tools such as Python and Jupyter Notebooks is helpful, but not required. Sample code will be made available at the state that participants can modify and ask their own questions of the data.

Grace Kitzmiller (AWS) & Jennings Anderson (University of Colorado Boulder)

The workshop included 10 prepared Jupyter Notebooks that contained all of the code to parse the results of an Athena query and generate a number of graphs and maps, such as the following graph which shows the cumulative number of users who have edited in Tacloban, Philippines.

Imgur

This shows that since 2012, there has been stable growth (a fairly consistent slope) in the number of editors, however, the overall rate was impacted heavily by nearly a 400 person ‘step’ as a result of the disaster mapping for Typhoon Haiyan.

As another example, here is a visualization built with KeplerGL showing the impact to the map in Puerto Rico by disaster mapping for Hurricane Maria (a sample of 10,000 edits)

Sample of edits in NW Puerto Rico

These are just two examples of the many figures and maps featured in the workshop that can be generated for most of the regions where humanitarian mapping has occurred.

You can find detailed instructions on how to recreate this workshop and run the material locally here.




2. SOTM Presentation: Corporate Editors in the Evolving Landscape of OpenStreetMap: A Close Investigation of the Impact to the Map and the Community

This marked the second year of the Academic Track at State of the Map. Thanks to the hard work of the OSM Science community, the proceedings of this track have been published here. Included is an abstract discussing my latest research on organized editing—specifically corporate editing—in the map. You can watch the full presentation here.

Visual Abstract

Last Spring, we (coauthors Dipto Sarkar and Leysia Palen) wrote an article that investigated the quantities and characteristics of corporate editing teams in OpenStreetMap. The visualization above shows the aggregate summary of this activity.

My current research looks at more deeply investigating the impact and editor interactions between corporate editors (or other organized editing groups) and other mappers. This requires examining the complete history of the map and breaking it down to individual edits, as visualized below:

Kaart editing in Jamaica

Edits from non-paid editors (pink) and paid-editors, primarily Kaart (green & yellow).

Or this visualization of Facebook’s activity in Thailand:

Facebook Editing in Thailand

If we zoom in on a particular area, we can see that Facebook’s edits between two previously mapped areas (in pink), are filling in the map.

Image of side-by-side editing in Thailand

This graph shows consistent editing activity from Facebook in 2018, followed by a few major events from non-paid editors in Eastern Thailand. This may lend credit to the notion of corporate map-seeding where data-teams start the map in an area and then non-corporate editors fill it in.

Graph of edits in Thailand

Here’s another (quite different) example showing how Amazon Logistics is editing the map in Dallas, Texas. Presumably they are adding valuable navigation-oriented ground-truthed data from their delivery network into the map: Amazon in Dallas

There are a few more examples in the presentation that I talk through, identifying potential interaction patterns between organized editing groups and other mappers. Please leave a comment on this post if you have any questions.


Extra: Preparing for OSM Geo Week.

OSM Geography Awareness Week will be here before we know it! I did not present this at the conference, but find it interesting nonetheless. This is a visualization showing the impact of this event, derived from OSM changesets:

OSM Geo Week

This particular visualization technique is a recreation of results from this paper by Daniel Bégin et al.

How to read this:

  • The yellow along the steep diagonal represent all 1-time contributors.
  • Faint vertical lines represent geoweeks that resulted in mappers sticking around
  • Horizontal lines represent geoweeks where mappers who had previously edited OSM made their last edit during a geoweek.
  • The purple at the top are mappers with a significant amount of editing experience who have edited during an osmgeoweek and continue to edit frequently.

Thanks for reading, please leave a comment with any questions you may have.

Location: Neuenheimer Feld, Neuenheim, Heidelberg, Regierungsbezirk Karlsruhe, Baden-Württemberg, 69120, Germany

PostCards from the Edge: A Tour of OSM Data Analyses + Visualizations (SOTMUS 2019)

Posted by Jennings Anderson on 19 September 2019 in English (English)

At State of the Map US a few weeks ago in Minneapolis, Minnesota, Seth and I presented a session titled:

PostCards from the Edge: A Tour of OSM Data Analyses + Visualizations

The recording and description of the presentation is available here.

Our goal was to curate a collection of OSM data visualizations from over the years that tell the story of OSM’s evolution, both as a map and a community, as well as highlight a few innovative data visualizations that show new ways to interact with OSM data to learn more about an area of the map.

We produced this spreadsheet (same as the table below) with links and author information for each of the visualizations that we showed and discussed in the talk. Since many of them are interactive, we chose to link to the original source:

Visualization Author Year
2 weeks of bicycle courier data in London Tom Carden / eCourier 2005
OSM Node Density Martin Raifer 2013-present
Man-made vs. Natural feature density Jennings Anderson 2016
Object Density Jennings Anderson 2019
Non-diverse Mapping Density Jennings Anderson 2019
Haiti Earthquake Response Mikel Maron 2010
Edits with HOT Jennings Anderson 2019
HOT Project Activity Timeline Martin Dittus 2015
The life cycle of contributors in collaborative online communities—The case of OpenStreetMap Daniel Bégin et al. 2018
Timespan of OSM Contributor Engagement Jennings Anderson 2019
Cartographers of North Korea Wonyoung So 2019
Pipelines Tim Meko, Washington Post 2016
City Street Network Orientations Geoff Boeing 2018
OpenStreetMap past(s), OpenStreetMap future(s) Alan McConchie 2016
Optimal Routes by Car from the Geographic Center of the Contiguous United States to all Counties Topi Tjukanov 2017

A few of the visualizations were from my OSM research work, so I’m compiling them here:

Man Made & Natural Features in OSM

Man made and natural features in OSM

Made with tile-reduce & datamaps, this rendering of OSM data shows natural features (such as ways tagged as natural=coastline) in blue and all other features in orange. Do you know what those large orange rectangles in the Barents and Kara Seas are? View them on OSM.

Object Densities at Zoom level 12

OSM object densities

Also made with tile-reduce, this visualization shows the density of objects in OSM as calculated by the number of objects in each zoom-level 12 osm-qa-tile.* At first glance, this figure shows there are few parts of the map that have no data. This is misleading, however. This is really a diverging color scheme where areas that appear blue or purple are unmapped. There are 0-100 objects representing areas of more than 60 square kilometers. In reality, these purple dots are showing us where we know something is there (such as the name of a town, a road, a river, etc.), but it has yet to be more completely mapped.

*Zoom level 12 tiles represent the area of about a small city. Their area decreases at higher latitudes, so normalizing against this would absolve cartographic sin. However, having done this and seen little affect to the message being conveyed here, I present the raw, non-normalized numbers.

Object Densities Broken Down by Contributor Count

Less than 10 mappers since 2018

More than 10 mappers since 2018

These two visualizations show the same density counts as the previous map, but exclusively show only tiles where more than or less than 10 mappers have been active since 2018-01-01. For many parts of the world, these appear to be a population density map (as many maps do). The takeaway here, however, is that while there may not be a lot of contributors active everywhere, there are at least a few contributors active most everywhere.

Contributor Lifespans

These charts are recreations of a chart first presented in Bégin et al. 2018. These charts are all derived from data obtained by querying the history of all OSM changesets (just under 70M) on the OSM public dataset on Amazon AWS with Amazon Athena.

Both axes represent time and each dot represents 1 user. Users that fall along the x=y diagonal are on-time contributors: Meaning their first edit and their last edit are on the same day. The vertical lines that begin to appear represent times when many users made their first edit (x-axis), and then some users continued to contribute for days, weeks, months, and years, creating the line.

Users along the top are still active, meaning their most-recent edit in OSM was near the time when we downloaded the data. The thick line across top means that there are many users who frequently edit the map, regardless of when they made their first edit.

All contributors

Contributor Lifespans

Contributors with at least 1 changeset with the text osmgeoweek

OSM Geo Week

Contributors whose first edit was in 2015.

Contributors whose first edit was in 2015

The impact of HOT editing on the growth of OSM

Edits associated with HOT and not

This figure shows the number of changes to the map per day, as calculated from all of the changesets in OSM. The area between the blue and orange lines represents edits in changesets that include the term “hotosm” in the comment.

State of the Map US 2018: OpenStreetMap Data Analysis Workshop

Posted by Jennings Anderson on 5 December 2018 in English (English)

(This is a description of a workshop Seth Fitzsimmons and I put on at State of the Map US 2018 in Detroit, Michigan. Cross-posting from this repository)

Workshop: October 2018

Workshop Abstract

With an overflowing Birds-of-a-Feather session on “OSM Data Analysis” the past few years at State of the Map US, we’d like to leave the nest as a flock. Many SotM-US attendees build and maintain various OSM data analysis systems, many of which have been and will be presented in independent sessions. Further, better analysis systems have yet to be built, and OSM analysis discussions often end with what is left to be built and how it can be done collaboratively. Our goal is to bring the data-analysis back into the discussion through an interactive workshop. Utilizing web-based interactive computation notebooks such as Zeppelin and Jupyter, we will step through the computation and visualization of various OpenStreetMap metrics.

tl;dr:

We skip the messy data-wrangling parts of OSM data analysis by pre-processing a number of datasets with osm-wayback and osmesa. This creates a series of CSV files with editing histories for a variety of US cities which workshop participants can immediately load into example analysis notebooks to quickly visualize OSM edits without ever having to touch raw OSM data.

1. Background

OpenStreetMap is more than an open map of the world: it is the cumulative product of billions of edits by nearly 1M active contributors (and another 4M registered users). Each object on the map can be edited multiple times. Each time the major attributes of an object are changed in OSM, the version number is incremented. To get a general idea of how many major changes exist in the current map, we can count the version numbers for every object in the latest osm-qa-tiles. This isn’t every single object in OSM, but includes nearly all roads, POIs, and buildings.

 Histogram of Object Versions from OSM-QA-Tiles

OSM object versions by type. 475M objects in OSM have only been edited once, meaning they were created and haven’t been subsequently edited in a major way. However, more than 200M have been edited more than once. Note: Less than 10% of these edits are from bots, or imports.

Furthermore, when a contributor edits the map, the effect that their edit has depends on the type of OSM element that was modified. Moving nodes may also affect the geometry of ways and relations (lines and polygons) without those elements needing to be touched. Thus, a contributor’s edits may have an indirect effect elsewhere (we track these as “minor versions”). Conversely, when editing a way or relation’s tags, no geometries are modified, so counts within defined geographical boundaries often don’t incorporate these edits. Therefore, to better understand the evolution of the map, we need analysis tools that can expose and account for these rich and nuanced editing histories. There are a plethora of community-maintained tools out there to help parse and process the massive OSM database though none of them currently handle the full-history and relationship between every object on the map. Questions such as “how many contributors have been active in this particular area?” are then very difficult to answer at scale. As we should expect, this number also varies drastically around the globe:

 Map of 2015 users Map of areas with more than 10 active contributors in 2015 source. The euro-centric editing focus doesn’t surprise us, but this map also shows another area with an unprecedented number of active contributors in 2015: Nepal. This was in response to the April 2015 Nepal Earthquake. This is just one of many examples of the OSM editing history being situational, complex and often difficult to conceptualize at scale.

Putting on a Workshop

The purpose of this workshop was two-fold: first, we wanted to take the OSM data analysis discussion past the “how do we best handle the data?” to actual data analysis. The complicated and often messy editing history of objects in OSM make simply transforming the data into something to be read by common data-science tools an exceedingly difficult task (described next). Second, we hoped that providing such an environment to explore the data would in turn generate more questions around the data: What is it that people want to measure? What are the insightful analytics?

2. Preparing the Data: What is Available?

This was the most hand-wavey part of the workshop, and intentionally so. Seth and I have been tackling the problems of historical OpenStreetMap data representation independently for a few years now. Preparing for this workshop was one of the first times we had a chance to compare some of the numbers produced by OSMesa and OSM-Wayback, the respective full-history analysis infrastructures that we’re building. As expected, there were some differences in our results based on howe we count objects and measure history, so this was a fantastic opportunity to sit down and talk through these differences and validate our measures. In short, there are many ways that people can edit the map and it’s important to distinguish between the following edit types:

  1. Creating a new object
  2. Slightly editing an existing object’s geometry (move the nodes around in a way)
  3. Majorly editing an existing object’s geometry (delete or add nodes in a way)
  4. Edit an existing object’s attributes (tag changes)
  5. Delete an existing object

All but edit type 2 result in an increase in the version number of the OSM object. This makes identifying the edit easier at the OSM element level because the version number is true to the number of times the object has been edited. Edit type 2, however, a slight change to an object’s geometry is a common edit that is often overlooked because it is not reflected in the version number. Moving the corners of a building to “square it up” or correcting a road to align better with aerial imagery are just two examples of edit type 2. We call these changes minor versions. To account for these edits, we add a metadata field to an object called minor version that is 0 for newly created objects and > 0 for any number of minor version changes between a major version. When another major version is created, the minor version is reset to 0.

Quantifying Edits

Each of the above edit types refer to a single map object. In this context, we consider map objects to be OSM objects which have some level of detailed attribute. As opposed to OSM elements (nodes, ways, or relations), an object is the logical representation of a real-world object: road, building, or POI. This is an important distinction to make when talking about OSM data because this is not a 1-1 relationship. OSM elements do not represent map objects. A rectangular building object, for example, is at minimum 5 OSM elements: at least 4 nodes (likely untagged) that define the corners and the way that references these nodes with an attribute of building=*. An edit to any one of these objects is then considered an edit to this building.

This may seem obvious when thinking about editing OpenStreetMap and how the map gets made, but reconstructing this version of OSM editing history from the database is difficult and largely remains an unsolved (unimplemented) problem at the global scale: i.e., there does not yet exist a single (public, production) API end-point to reconstruct the history of any arbitrary object with regards to all 5 types of edits mentioned above.

Working towards such an API, another important infrastructure to mention here is the the ohsome project built with the oshdb. This is another approach to working with OSM full-history data that can ingest full-history files and handle each of these edit types.

Making the data Available

For this workshop then, we pre-computed a number of statistics for various cities that describe the historical OSM editing record at per-edit, per-changeset, and per-user granularities (further described below).

3. Interactive Analysis Environment

Jupyter notebooks allowed us to host a single analysis environment for the workshop such that each participant did not have to install or run any analysis software on their own machines. This saved a lot of time and allowed participants to jump right into analysis. For the workshop, we used a single machine operated by ChameleonCloud.org and funded by the National Science Foundation to host the environment. I hope to provide this type of service again at other conferences or workshops. Please be in touch if you are interested in hosting a similar workshop and I can see if hosting a similar environment for a short duration is possible!

Otherwise, it is possible to recreate the analysis environment locally with the following steps:

  1. Download Jupyter
  2. Clone this repository: jenningsanderson/sotmus-analysis
  3. Run Jupyter and navigate to sotmus-analysis/analysis/ for the notebook examples.

4. Available Notebooks & Datasets

We pre-processed data for a variety of regions with the following resolution:

1. Per User Stats

A comprehensive summary of editing statistics (new buildings, edited buildings, km of new roads, edited roads, number of sidewalks, etc.) see full list here that are totaled for each user active in the area of interest. This dataset is ideal for comparing editing activity among users. Who has edited the most? Who is creating the most buildings? This dataset is great for building leaderboards and getting a general idea of how many users are active in an area and what the distribution of work per user looks like.

2. Per Changeset Stats

The same editing statistics as above (see full list of columns here) but with higher resolution: grouped by the changeset. A changeset is a very logical unit of analysis for looking at the evolution of the map in a given area. Since each changeset can only be from one user, this is the next level of detail from user summaries. Since changeset IDs are sequential, this is a great dataset for time-series analysis. Unfortunately, due to a lack of changeset extracts for the selected regions (time constraints, fun!), OSMesa-generated roll-ups do not include actual timestamps. This caused some confusion for a group looking at Chicago, as visualization of their building import did not show the condensed timeframe during which many changesets were made when using changeset ID as the x-axis.

3. Per Edit Stats

This dataset records each individual edit to the map. This dataset is best for understanding exactly what changed on the map with each edit. Each edit tracks the tags changed as well as the geometry changes (if any). This dataset is significantly larger than the other two.

What cities are available?

Detroit is currently available in this repository. See this list in the readme for a handful of North American cities available for download.

5. Example Notebooks

  1. Per User Stats
  2. Per Changeset Stats
  3. Per Edit Stats

Editing heatmap Example heatmap from building edits in Detroit

If you’re interested in more of this type of analysis, directions on setting up this analysis environment locally can be found in this repository. Furthermore, much of this is my current dissertation work, so I’m always happy to chat more about it. Thanks!

Location: The Hill, Boulder, Boulder County, Colorado, 80802, United States

Watching the Map Grow: State of the Map US Presentation

Posted by Jennings Anderson on 27 November 2017 in English (English)

SOTMUS Logo

At State of the Map US last month, I presented my latest OSM analysis work. This is work done in collaboration between the University of Colorado Boulder and Mapbox. You can watch the whole presentation here or read on for a summary followed by extra details on the methods with some code examples.

OpenStreetMap is Constantly Improving

At the root of this work is the notion that OSM is constantly growing. This makes OSM uniquely different from other comparable sources of geographic information. To this extent, static assessments of quality notions such as completeness or accuracy are limited. For a more wholistic perspective of the constantly evolving project, this work focuses on the growth of the map over time.

Intrinsic Data Quality Assessment

Intrinsic quality assessment relies only internal attributes of the target data and not on external datasets as points of reference for comparison. In contrast, extrinsic data quality assessments of projects like OSM and Wikipedia involve comparing the data directly to the external datasets, often authoritative, reference datasets. For many parts of the world, however, such datasets do not exist, making extrinsic analysis impossible.

Here we look at features of the OSM database over time. By comparing attributes like numbers of contributors, density of buildings, and amount of roads, we can learn how the map grows and ultimately improves overtime.

Specifically, we aim to explore the following:

Contributors

  • How many?
  • How recent?
  • What type of edits?

Objects

  • What types?
  • How many?
  • Relative Density?
  • Object version?

The bulk of this work involves designing a data pipeline to better allow us to ask these types of questions of the OSM database. This next section takes a deep dive into these methods. The final section, Visualizing, has a series of gifs that show the results to-date.

The interactive version of the dashboard in these GIFS can be found here: http://mapbox.github.io/osm-analysis-dashboard


Methods: Vector Tiles

Specifically, zoom level 15 vector tiles are the base of this work. Zoom level 15 is chosen because (depending on Latitude), most tiles have an area of 1 square kilometer. For scale, a zoom 15 tile looks like this:

z-15-vector-tile

Vector Tiles are chosen primarily for three reasons:

  1. Vector Tiles (specifically OSM data in the .mbtiles format) are standalone sqlite databases. This means very little overhead to maintain (no running database). To this end, they are very easy to transfer and move around on disk.

  2. They are inherently a spatial datastore. With good compression abilities, the file sizes are not dramatically larger than standard osm pbf files, but they can be loaded onto a map with no processing. This is mostly done with mbview
  3. Vector Tiles can be processed efficiently with the tile-reduce framework.

In sum, at any point in the process, a single file exists that can easily be visualized spatially.

Quarterly Historic Snapshots

To capture the growth of the map overtime, we create historical snapshots of the map: OSM-QA-Tiles that represent the map at any point in history. You can read more about OSM-QA-Tiles here.

Boulder Map Growth

This image shows the growth of Boulder, CO in the last decade. The top row shows the road network rapidly filling in over 9 months during the TIGER import and the bottom row shows the the densification of the road and trail network along with the addition of buildings over the last 5 years.

The global-scale quarterly snapshots we created are available for download here: osmlab.github.io/osm-qa-tiles/historic.html.

While quarterly snapshots can teach us about the map at a specific point in history, they do not contain enough information to tell us how the map has changed: the edits that happen between the quarters. To really answer questions such as, “how many users edited the map?” or “How many kilometers of roads were edited?” or “How many buildings were added?” We need the full editing history of the map.

Historical Tilesets

The full editing history of the map is made available in various formats on a weekly basis. Known as the full history dump, this massive file can be processed in a variety of ways to help reconstruct the exact process of building the map.

Since OSM objects are defined by their tags, we focus on the tagging history of objects. To do this, we define a new schema for historical osm-qa-tiles. The new vector tiles extend the current osm-qa-tiles by including an additional attribute, @history.

Currently, these are built with the OSM-Wayback utility. Still in development, this utility uses rocksdb to build a historical tag index for every OSM object. It does this by parsing a full-history file and saving each individual version of each object to a large index (Note: Currently only saves objects with tags, and does not save geometries). This can be thought of as creating an expanded OSM history file that is optimized for lookups. For the full planet, this can create indexes up to 600GB in size.

Once the index is built, the utility can ingest a ‘stream’ of the latest OSM features (such as those produced by minjur or osmium-export). If the incoming object version is greater than 1, then it performs a lookup for each previous version of the object in this index.

The incoming object is then augmented to have an additional @history property. The augmented features are then re-encoded with tippecanoe to create a full-historical tileset.

Tag History

Here is an example of a tennis court that is currently at version 3 in the database. The @history property contains a list of each version with details about which tags were added or deleted in each version.

A Note on Scale & Performance

Full history tilesets are rendered at zoom level 15. OSM-QA-Tiles are typically rendered only at zoom level 12, but we found zoom 15 to be better not only for the higher resolution, but it lowers the number of features per tile. Since many features are now much larger because they contain multiple versions, this helps lower the number of features per tile, keeping tile-reduce processing efficient.

One downside, however, is that at zoom 15, the total number of tiles required to render the entire planet can be problematically large (depending on the language/library reading the file). For this reason, historical tilesets should be broken into multiple regions.

Processing 1: Create Summary Tilesets

The first step in processing these tiles is to ensure that the data are at the same resolution. Historical tilests are created at zoom 15 resolution while osm-qa-tiles exist at zoom 12 resolution. Zoom 12 is the highest resolution that the entire planet should be rendered to osm-qa-tiles to ensure efficiency in processing. Therefore, we start by summarizing zoom 15 resolution into zoom 12 tiles.

Summarizing Zoom 15 Resolution at Zoom 12

A zoom-12 tile contains 64 child zoom-15 tiles (64 tiles = 4^(15-12), resulting in an 8x8 grid). To create summary tilesets for data initially rendered at zoom 12 (like the snapshot osm-qa-tiles), we calculate statistics about each child zoom-15 tile inside of a zoom-12 tile. This is done with a tile-reduce script that first bins each feature into the appropriate 64 child zoom-15 tiles and then computes various statistics for each of them, such as “total kilometers of named highway” or “density of buildings”

Since each of these attributes pertains to the zoom-15 tile and not individual features, individual object geometries are ignored. Instead, these statistics are represented by a single feature: a point at the center of the zoom-15 tile that it represents. Each feature then looks like:

geometry: <Point Geometry representing center of zoom-15 tile>
properties : {
   quadkey :		<unique quadkey for zoom 15 tile>,
   highwayLength:		<total length of highways>,
   namedHighwayLength:	<kilometers of named highways>,
   buildingCount:			<Number of buildings>,
   buildingArea:			<Total area of building footprints>
   ...

These features are encoded into zoom-12 tiles, each with no more than 64 features. The result is a lightweight summary tileset (only point-geometries) rendered at zoom-12.

Summarizing Editing Histories

The summarization of the editing histories is very similar, except that the input tiles are already at zoom 15. Therefore, we skip the binning process and just summarize the features in each tile. Similarly, up to 64 individual features that each represent a zoom-15 tile are re-encoded into a single zoom-12 tile. Each feature includes editing statistics per-user for the zoom-15 tile it represents:

geometry: <Point Geometry representing center of zoom-15 tile>
properties : {
  quadkey : (unique quadkey for zoom 15 tile),
  users: [
  {
    name: <user name>,
    uid: <user id>,
    editCount: <total number of edits>,
    newFeatureCount: <number of edits where version=1>,
    newBuildings: <number of buildings created>,
    editedBuildings: <number of buildings edited>,
    newHighwayKM: <kilometers of highways created>,
    editedHighwayKM: <kilometers of highways edited>,
    addedHighwayNames: <Number of `name` tags added to highways>,
    modHighwayNames: <Number of existing `name` tags modified on highways>
  },
  { ... }
],
usersEver: <array of all user ids ever to edit on this tile>

Why go through all of this effort to tile it?

Keeping these data in the mbtiles format enables spatial organization of the editing summaries in a single file. Encoding zoom 15 summaries into zoom 12 tiles is the ideal size for the mbtiles format and can be efficiently processed with tile-reduce.

Processing 2: Calculate & Aggregate

With the above summarization, we have two tilesets each rendered at zoom 12 with zoom 15 level resolution. We can now pass both tilesets into a tile-reduce script. This is done by specifying multiple sources when initializing the tile-reduce job:

var tileReduce = require('@mapbox/tile-reduce');

tileReduce({
  zoom: 12,
  map: path.join(__dirname, '/map-tileset-aggregator.js'),
  sources : [{
    name: 'histories',
    mbtiles: historicalTileset-2010-Q4,
    raw: false
   },{
    name: 'quarterly-snapshot',
    mbtiles: snapshot-2010-Q4,
    raw: false
  }]
  ...

In processing, the map script can then access attributes of both tilesets like this:

module.exports = function(data, tile, writeData, done) {  
  var quarterlySnapshots = data['quarterly-snapshot']
  var histories = data['histories']

For performance, the script builds a Map() object for each layer, indexing by zoom-15 quadkey. Next, the script iterates over the (up to 64) features of one tile and looks up the corresponding quadkey in the other tile to combine, compare, contrast, or calculate new attributes. Here is an example of combining and aggregating across two tilesets, writing out single features with attributes from both input tilesets:

features.forEach(function(feat){

  //Create a single export feature to represent each z15 tile:
  var exportFeature = {
    type      : 'Feature',
    tippecanoe: {minzoom: 10, maxzoom: 12}, //Only renders this feature at these zoom levels.
    properties: {
      quadkey   : feat.properties.quadkey //The z15 quadkey
    },
    geometry: tilebelt.tileToGeoJSON(tilebelt.quadkeyToTile(feat.properties.quadkey)) // Reconstruct the Polygon representing the zoom-15 tile.
  }
  
  exportFeature.properties.buildingCount_normAggArea  =  < Lookup the number of buildings on this zoom-15 tile (and normalize by area).
  exportFeature.properties.namedHighwayLength_normAggArea = < Lookup kilometers of named highway for this zoom-15 tile (and normalize by area).
  
  // Access the contributor history information for this zoom-15 tile.
  var tileHistory  = contributorHistories.get(feat.properties.quadkey)
  var users = JSON.parse(tileHistory.users) // Get user array back from string
  
  // Sum attributes across users for simple data-driven-styling
  users.forEach(function(user){
    exportFeature.properties.editCount         += user.editCount;
    exportFeature.properties.newFeatureCount   += user.newFeatureCount;
    exportFeature.properties.newBuildings      += user.newBuildings;
    exportFeature.properties.newHighwayKM      += user.newHighwayKM;
    exportFeature.properties.editedHighwayKM   += user.editedHighwayKM;
    exportFeature.properties.addedHighwayNames += user.addedHighwayNames;
    exportFeature.properties.modHighwayNames   += user.modHighwayNames;
  });
  writeData( JSON.stringify( exportFeature ) ) //Write out zoom-15 tile summary with information combined from both tilesets.
})

This script produces two types of output:

  1. (Up to 64) polygons per zoom-12 tile that represent the child zoom-15 tiles. Matching the editing-history format, these features contain per-editor statistics, such as kilometers of roads.

  2. A single zoom-12 summary of all the editing activity.

Processing 3: The Reduce Phase

When the summary zoom-12 tile is delivered to the reduce script, it is first written out to a file (z12.geojson) and then passed to a downscaling, aggregation function, described next.

Downscaling & Aggregation

Last year I made a series of similar visualizations of osm-qa-tiles. I only worked with the data at zoom 12 and kept the features very simple in hopes that tippecanoe could coalesce similar features to display at lower zooms. While this worked, there were a lot of visual artifacts in busy parts of the map and the tile individual geometries must be low detail to fit:

Last Year's Example

To address this, we rely heavily on downscaling and aggregation in the current workflow to successively bin and summarize children tiles into a single parent child. Each zoom level is then written to disk separately and tiled only at specific zoom levels. Unfortunately, this is done by holding these tiles in memory. Fortunately, however, with a known quantity of (4) child tiles per parent zoom level, we can design the aggregation to continually free up memory when all child tiles of a given parent tile are processed.

Psuedocode:

zoom_11_tiles = {
   'tile1' : [],
    ...
   'tileN' : []
 }
 
processTile( incomingTile (Tile at Zoom 12) ){
  z11_parentTile = incomingTile.getParent()
  tiles_at_zoom_11[z11_parentTile].push(incomingTile)
  if (tiles_at_zoom_11[parent].length == 4){
	
    // Aggregate, Sum, Average attributes 
    // of zoom 12 tiles as appropriate to create
    // single summary zoom 11 tile
    
    // Write aggregated, summarized zoom 11 
    // tile to disk and delete from memory.
  }
}

In reality, these are not done for every zoom level, but instead for zoom levels 12, 10, and 8.

To ensure this function works as designed, the order of tiles being processed by the entire tile-reduce job is modified to be a stream of tiles grouped at zoom 10. While we cannot ensure that tiles finish processing in a specific order, by controlling the order of the input stream, we can create reasonable expectations that groups of tiles finish processing at similar times and are therefore appropriately aggregated and subsequently freed from memory.

Processing 4: Tiling

The final result of the tile-reduce job(s) is a series of geojsonl files (line-delimited) representing different zoom levels. Using tippecanoe, we create a single tileset that is optimized for rendering in the browser. Recall that each geometry is a polygon representing a vector tile. The attributes of each feature are consistent among zoom levels to allow for data-driven styling in mapbox-gl.

tippecanoe -Z0 -z12 -Pf --no-duplication -b0 \
  --named-layer=z15:z15-res.geojsonl \
  --named-layer=z12:z12-res.geojsonl \
  --named-layer=z10:z10-res.geojsonl \
  --named-layer=z8:z8-res.geojsonl   \
  -o Output.mbtiles

Visualizing: Mapbox-GL

Loading the resulting tileset into MapboxGL allows for data driven styling across any of the calculated attributes. An interactive dashboard to explore the North America Tileset is available here: mapbox.github.io/osm-analysis-dashboard

Downscaling across Zoom Levels

This first gif shows the different layers (the results of the downscale & aggregation):

Since everything is aggregated per-quarter, we can easily compare between two quarters. This gif compares the number of active users in mid 2012 to mid 2017. Users active Per Quarter: 2012 vs. 2017

New Building Activity

Here is a high level overview of where buildings were being added to the map in the second quarter of both 2015 (left) and 2016 (right). We can see a few major building imports taking place between these times as well as more general coverage of the map.

New Building Activity: 2015 vs. 2016

If we zoom in on Los Angeles and visualize the “building density” as calculated in July 2015 and July 2016, we see the impact of LA building import at zoom 15 resolution:

LA Building Import

Users

The 2010 Haiti Earthquake:

This slider shows the number users active in Haiti during the last quarter of 2009 (just before the earthquake) and then the first quarter of 2010 (when the earthquake struck): Users active during the Haiti Earthquake

We can see the work done by comparing the building density of the map at the end of 2009 and then at the end of the first quarter of 2010:

Building Density increase in Haiti (Quarter 1: 2010)

Ultimately, the number of (distinct) contributors active to date in North America has grown impressively in the last 5 years. This animation shows the difference between mid 2012 and mid 2017:

5 Year Growth

Looking Forward: Geometric Histories

So far, when discussing full editing history, we’ve only been talking about history of a map object as told through the changes to tags over time. This is a decent proxy of the total editing, and can certainly help us understand how objects grow and change overtime. The geometries of these objects, however also change overtime. Whether it’s the release of better satellite imagery that prompts a contributor to re-align or enhance a feature, or just generally squaring up building outlines, a big part of editing OpenStreetMap includes changing existing geometries.

Many times, geometry changes to objects like roads or buildings do not propagate to the feature itself. That is, if only the nodes underlying a way are changed, the version of the way is not incremented. Learning that an object has had a geometry change requires a more involved approach, something we are currently exploring in addition to just the tag history.

With full geometry history, we could compare individual objects at two points in time. Here is an example from a proof-of-concept for historic geometries. Note many of the buildings initially in red “square up” when they turn turquoise. These are geometry changes after the 2015 Nepal Earthquake. The buildings were initially created non-square and just a little while later, another mapper came through and updated the geometries:

5 Year Growth

Location: The Hill, Boulder, Boulder County, Colorado, 80802, United States

Analysis Walk-thru: How many contributors are editing in each Country?

Posted by Jennings Anderson on 29 June 2017 in English (English)

How many contributors are active in each Country?

I recently put together this visualization of users editing per Country with along with some other basic statistics. This analysis is done with tile-reduce and osm-qa-tiles. I’m sharing my code and the procedure here.

Users by Country

This interactive map depitcs the number of contributors editing in each Country. The Country geometries are in a fill-extrusion layer, allowing for 3D interaction. Both the heights of the Countries and the color scale in relation to the number of editors. Additional Country-level statistics such as number of buildings and kilometers of roads are also computed.

Procedure

These numbers are all calculated with OSM-QA-Tiles and tile-reduce. I started with the current planet tiles and used this Countries geojson file for the Country geometries to act as boundaries.

Starting tile reduce:

tileReduce({
  map: path.join(__dirname, '/map-user-count.js'),
  sources: [{name: 'osm', mbtiles: path.join("latest.planet.mbtiles"), raw: false}],
  geojson: country.geometry,
  zoom: 12
})

In this case, country is a geojson feature from the countries.geo.json file. I ran tile-reduce separately for each Country in the file, creating individual geojson files per Country.

The map function:

var distance = require('@turf/line-distance')

module.exports = function(data, tile, writeData, done) {
  var layer = data.osm.osm;

  var buildings = 0;
  var hwy_km    = 0;
  var users = []

  layer.features.forEach(function(feat){
  
    if (feat.properties.building) buildings++; 
  
    if (users.indexOf(feat.properties['@uid']) < 0)
      users.push(feat.properties['@uid'])
    }
  
    if (feat.properties.highway && feat.geometry.type === "LineString"){
      hwy_km += distance(feat, 'kilometers')
    }
  });
  done(null, {'users': users, 'hwy_km': hwy_km, 'buildings' : buildings});
};

The map function runs on every tile and then returns a single object with the summary stats for the tile. For every object on the tile, the script first checks if it is a building and increments the building counter appropriately. Next, it checks if the user who made this edit has been recorded yet for this tile. If not, it adds their user id to the list. Finally, the script checks if the object has the highway tag and is indeed a LineString object. If so, it uses turfjs to calculate the length of this hwy and adds that to a running counter of total road kilometers on a tile.

After doing this for all objects on the tile (Nodes and Ways in the current osm-qa-tiles), it returns an object with an array of user ids and total counts for both road kilometers and buildings.

Back in the main script, the instructions for reduce are as follows:

.on('reduce', function(res) {
  users = users.concat(res.users)
  buildings += res.buildings;
  hwy_km += res.hwy_km;
})

The list of unique users active on any given tile is added to the users array keeping track of users across all tiles. If users have edited on more than one tile, they will be replicated in this array. We’ll deal with this later.

The running building and kilometers of road counts are then updated with the totals from each tile.

Ultimately, the last stage of the main script writes the results to a file.

.on('end', function() {
  var numUsers = _.uniq(users).length;

  fs.writeFile('/data/countries/'+country.id+'.geojson', JSON.stringify(
    {type: "Feature",
     geometry: country.geometry,
     properties: {
       uCount: numUsers,
       hwy_km: hwy_km,
       buildings: buildings,
       name: country.properties.name,
       id: country.id
      }
    })
   )
});

Once all tiles have been processed, this function uses lodash to remove all duplicate entries in the users array. The length of this array now represents the number of distinct users with visible edits on any of the tiles in this Country.

Using JSON.stringify and the original geometry of this Country that was used as the bounds for tile-reduce, this function creates a new geojson file for every Country with a properties object of all the calculated values.

Visualizing

Once the individual Country geojson files are created, the following python code iterates through the directory and creates a single geojson FeatureCollection with each Country as a feature (The same as the countries.geo.json file we started with, but now with more properties.

countries = []

for file in os.listdir('/data/countries'):
  country = json.load(open('/data/countries/'+file))
  countries.append(country)

json.dump({"type":"FeatureCollection",
           "features" : countries}, open('/data/www/countries.geojson','w'))

Once this single geojson FeatureCollection is created, I uploaded it to Mapbox and then used mapbox-gl-js with fill-extrusion and a data-driven color scheme to make the Countries with more contributors appear taller and more red while those with less contributors are shorter and closer to yellow/white in color.

Here is a sample of that code:

map.addSource('country-data', {
  'type': 'vector',
  'url': 'mapbox://jenningsanderson.b7rpo0sf'
})

map.addLayer({
  'id': "country-layer",
  'type': "fill-extrusion",
  'source': 'country-data',
  'source-layer': 'countries_1-1l5fxc',
  'paint': {
    'fill-extrusion-color': {
      'property':'uCount',
      'stops':[
        [10, 'white'],
        [100, 'yellow'],
        [1000, 'orange'],
        [10000, 'orangered'],
        [50000, 'red'],
        [100000, 'maroon']
      ]
    },
    'fill-extrusion-opacity': 0.8,
    'fill-extrusion-base': 0,
    'fill-extrusion-height': {
      'property': 'uCount',
      'stops': [
        [10, 6],
        [100, 60],
        [1000, 600],
        [10000, 6000],
        [50000, 30000],
        [100000, 65000]
      ]
    }
  }
})

This current implementation uses two visual channels (height and color) for the user count. This is repetitive and the data-driven styling could be easily modified to represent number of buildings or kilometers of roads as well by simply changing the stops array and property value to buildings or hwy_km.

To show more information about a Country on click, the following is added:

map.on('mousemove', function(e){
  var features = map.queryRenderedFeatures(e.point, {layers:['country-layer']})
    map.getCanvas().style.cursor = (features.length>0)? 'pointer' : '';
  });

map.on('click', function(e){
  var features = map.queryRenderedFeatures(e.point, {layers: ['country-layer']})

  if(!features.length){return};
  var props = features[0].properties

  new mapboxgl.Popup()
    .setLngLat(e.lngLat)
    .setHTML(`<table>
      <tr><td>Country</td><td>${props.name}</td></tr>
      <tr><td>ShortCode</td><td>${props.id}</td></tr>
      <tr><td>Users</td><td>${props.uCount}</td></tr>
      <tr><td>Highways</td><td>${props.hwy_km.toFixed(2)} km</td></tr>
      <tr><td>Buildings</td><td>${props.buildings}</td></tr></table>`)
    .addTo(map);
});

Much of this code is based on these examples

Location: The Hill, Boulder, Boulder County, Colorado, 80802, United States

OSM Contributor Analysis - Entry 2: Annual Summaries of User Edits

Posted by Jennings Anderson on 6 July 2016 in English (English)

Over the past two weeks I have been trying out some new methods to uncover user focus on the map. Investigating this idea of user focus includes questions like:

  • Are there areas where a specific user edits more frequently or regularly?
  • Are there multiple contributors who focus on the same areas?
  • Do these activities correlate to “map gardening”?

To answer these questions, I’ve put together an interactive map, similar to How Did You Contribute to OSM by Pascal Neis , but with the addition of being able to compare multiple users through the years.

Check it out Here: OSM Annual User Summary Map

Please Note: Requires recent versions of Google Chrome (recommended) or Firefox (>=35).

How does it work?

Using the annual snapshots osm-qa tiles, I have calculated the following statistics for each user’s visible edits at the end of each year on a per-tile basis:

  • of total edits

  • of buildings

  • of amenities

  • kilometers of roads

With this information, we can look at areas of specific focus for a given user by applying minimum thresholds. For example, here are most of the tiles edited by seven different users in 2011: 7 Users No Filter When we increase the threshold for minimum percent of edits, we see that though this particular user has thousands of edits all over the Country, 70% of his edits are on this one tile! 7 Users FIltered

Just by playing around with this map, it seems that even users with millions of edits always have a handful of tiles where they seem to be significantly more active. Of course this begs the question, “is this the user’s hometown?” or perhaps even more importantly, “is this user contributing local knowledge to these particular tiles?”

When you zoom in close, you can click on any given tile and get a list of the top 100 contributors on that tile for the year. Clicking on any user in that list will load their edits onto the map. List of Users

What’s Next?

This is just the first step of many to come in doing community detection in OSM through social network analysis!

More to come! Jennings

Location: The Hill, Boulder, Boulder County, Colorado, 80802, United States

OpenStreetMap Data Analysis: Entry 1

Posted by Jennings Anderson on 20 June 2016 in English (English)

Howdy OpenStreetMap, I am excited to share that I am working as a Research Fellow with Mapbox this summer! As a research fellow, I am looking to better understand contributions to OSM.

For my first project, I have been using the tile-reduce framework to summarize per-tile visible edits from the Historical OSM-QA-Tiles. These historical tiles are a snapshot of what the map looked like at the time listed on the link.

With this annual resolution, we can visualize the edits (those edits that were visible at the end of that year) that happened on each tile. So far, I’ve summarized them as a) number of editors, b) number of objects, and c) recency of the latest edit (relative to that year).

The OSM-QA-Tiles are all generated at Zoom level 12, which separates the world into 5Million+ tiles. Some tiles have few objects while others have ten-thousand plus.

So far I have created two interactive maps to investigate OpenStreetMap editing behavior at this tile-level analysis:

1. Editor Density (Number of editors active on a tile)

### 2. Edit Recency (Time since last edit on the tile)

Editor Density

This map highlights tiles where multiple editors have been active. The most active editors in most cases are automated bots, especially in the more recent years. For best results, moving the slider in the bottom left for Minimum Users Per Tile to 2 or 3 will exclude most of these automated edits.

Examples

#### 2007: European Hotspots By increasing the minimum object and minimum user thresholds, areas of heavy editing activity pop out: 2007 european hotspots

2007: US Tiger Import - Automated Edits

This image of the activity in the US in 2007 has no threshold on the limited number of objects or users per tile, so you can see all of the tiles affected by the 2007 import. If you increase the threshold, it changes dramatically tiger import

Edit Recency

This map shows the recency of edits to a tile, relative to the year of analysis. It looks surprising at first how many tiles are edited at the end of the year, but that is most likely a function of automated bots. Again, if you move the threshold for number of editors or objects per tile, interesting patterns pop out across the world where users may have been active early in the year and then are less active later. The 2010 Haiti Earthquake is a good example, as it occurred in January of 2010.

2007: The stages of the Tiger Import

If we view by latest edit date, relative to the year, we see the state-by-state import in the US:

2008: North Eastern Hemisphere

2008 recency

More to come! -Jennings

Location: Logan Circle/Shaw, Chinatown, Washington, Washington, D.C., 2005, United States of America