OpenStreetMap Contributor LifeSpans - Revisiting and expanding on 2018 research paper
Posted by Jennings Anderson on 13 November 2021 in English.In 2018, researchers Daniel Bégin, Rodolphe Devillers, and Stéphane Roche published a paper titled, The life cycle of contributors in collaborative online communities - the case of OpenStreetMap. A key takeaway from this paper was this density plot of a contributor’s first and last edit:
Plotted this way, we see temporal trends emerge as vertical or horizontal lines describing when many users started or stopped mapping (vertical or horizontal lines). The paper also published this table to describe the events in OSM history that were being captured:
At the time of publication, the authors used data from mid-2005 through mid-2014.
Adding New Data
I find this density plot to be one of the best visualizations of OSM contributor patterns, so I recently remade the figure with data through 2021. In this post, I will share the new figures and the code I used to generate them.
First, I used the OSM public dataset on Amazon Athena to query the OSM changeset history (registry.opendata.aws/osm/). What once involved downloading and parsing >100M changesets can now be reduced to a 5-line SQL query:
SELECT uid,
min(date(created_at)) as _first,
max(date(created_at)) as _latest
FROM changesets
GROUP BY uid
Next, using Pandas
and Matplotlib
, we read in the CSV and create the following plot:
import pandas as pd; import seaborn as sns
import matplotlib.pyplot as plt
#Read in CSV from Athena
df = pd.read_csv('~/Downloads/05b0fce8-8318-4c9e-b658-a8677cbed877.csv', parse_dates=['_first','_latest'])
#Create plot
fig, ax = plt.subplots(1, figsize=(15,15))
df.plot.scatter(x='_first',y='_latest',s=0.1,color='k',alpha=0.2, ax=ax)
#Add Labels
ax.set_title("OSM Contributor Lifespans (Remake of Bégin et al. 2018)\n({:,} mappers)".format(len(df)), fontsize=20)
ax.set_ylabel("Latest Edit", fontsize=18); ax.set_xlabel("First Edit", fontsize=18);
For all of OSM:
We see the same features highlighted in the 2018 paper (so we know it worked!), but also many new vertical lines. Most notably in mid-2016, the density of the plot increases considerably. Recall that each of these dots represents a single mapper. This denser upper corner represents users who made their first edit in 2016 or after. Looking at when mappers made their first edit, we can see that in 2016, the average number of daily new mappers in OSM jumped from about 300 in 2015 to nearly 550 in 2016:
What caused this spike?
Distinguishing which software was used for each of these first-edits, we can see that this spike was due to the launch of editing within Maps.ME:
>*Showing 95% of first edits to OSM with most popular mapping tools; remaining 5% were made with > 750 other software libraries. The query:
WITH mappers AS ( SELECT uid, min(id) as _first_changeset, min(date(created_at)) as _first FROM changesets GROUP BY uid ) SELECT mappers.uid, _first, split(tags['created_by'],' ')[1] as _editor FROM mappers LEFT JOIN changesets ON mappers._first_changeset = changesets.id
Recent Years
Given the density of the plot in recent years, we can discern more if we focus only on mappers starting since 2015:
A few observations:
- The thick diagonal line at
y=x
shows that for most mappers, the first and last days of editing are very close if not the same day. This could be from attending a mapathon once, for example. - The diagonal stripes indicate that for some mappers, their last day of editing is exactly 1 year after their first day of editing.
- The darker horizontal line at the top of the plot shows the thousands of mappers that started in the last 7 years and continue to be active.
- The vertical lines represent specific days when many new mappers started, such as the vertical line appearing in early-mid 2015 describing mappers that likely started mapping in response to the April 25, 2015 Nepal Earthquake.
Incorporating Color
While the diagonal stripes in the previous scatterplot show mappers whose first and last editing days were 1 year apart, we do not know how many days they may have been mapping in between those two dates. If we add count(distinct(date(created_at)))
to our query, we can use this mapping_days
attribute to color the dots:
If these mappers along the various diagonal lines were active for the much of the year, we would expect their dots to appear pink to orange, instead, we see the majority of the dots forming these diagonal lines to be purple, meaning that these mappers were only active a few days within their first year of mapping, but they did return on the one-year anniversary of their first edit to make their last edit.
Another View - Humanitarian Mapping
As a whole, this density plot exhibits interesting patterns, but subsetting it further highlights other distinct behaviors. For example, if we look at only the 236k mappers who included the text #hotosm
in the comment of their first OSM changeset (perhaps implying that they were introduced to OSM via humanitarian mapping), we see a different pattern:
One thing to note are the many groups of dots in November. This is likely the effect of mappers joining during an OSM geo-week event at some point and then contributing again (for the last time) at another OSM geo-week in November of a later year. We should also note the orange and yellow dots at the top of the plot, showing the many mappers that started mapping in OSM via a HOT-task and have continued to map consistently since.
These density plots offer a convenient, interpretable visualization of hundreds of thousands of OSM contributors. This conversation on the OpenStreetMap US slack prompted me to recreate these figures (and finally solve a longstanding question about the bump in new mappers since 2016). What also came out of this thread was an interest in visualizing the daily mapping activity to see if new density patterns might emerge.
Daily Mapping Activity
The previous density plots use one dot to represent one mapper. If we focus instead only on a subset of top contributors, say mappers that have mapped for more than 100 days since 2018, we can dig a little deeper into their temporal patterns. In the following figures, each dot represents 1 mapper mapping on 1 day. Each row, then, represents a single mapper.
To find which mappers were active on which days, we use the following query:
SELECT uid,
date(changesets.created_at) as _day,
sum(num_changes) as _edits,
FROM changesets
WHERE changesets.created_at > date '2018-01-01'
GROUP BY uid, date(changesets.created_at)
ORDER BY uid DESC, _day DESC
This plot is sort of interesting, highlighting a few light spots around the holiday when even the most ardent mappers are less active. We see many very active mappers picking up activity / joining in 2021. What if we subset this data one more level?
Daily Mapping Activity with Paid Editors
If we expand our criteria to include only mappers active for more than 50 days since 2018, we find 23k mappers (23k rows) where the mappers at the very bottom were active for up to 1,385 days (nearly everyday), which continually decreases as you go up, to mappers in the top rows who were active for at least 50 days since 2018. I have highlighted known paid-editors in orange on this plot (known because they disclose their affiliation in their OSM profile). Notice the heavy concentration of paid editors between 300 and 750 days, especially after mid-2018 (700s) and mid-2019 (500s), and early 2020 (400s). For reference, there are about 250 working days in a given calendar year. Someone mapping consistently on working days since mid-2018 would have mapped more than 750 days by late 2021. Likewise, someone mapping consistently during the work week since mid-2019 would have mapped more than 500 days by late 2021. It is subtle, but I think this pattern is discernible in the graph:
Conclusion
These density plots to quickly visualize thousands of OSM contributors and their daily editing patterns. The lifecycle plots show platform-wide trends such as many mappers starting or stopping while the daily mapping plots elucidate nuanced temporal patterns of continuous editing behaviors. Visualizing all of OSM is always a tedious task, but finding ways to subset the data (say by hashtag or known paid-editors) adds new dimensions to these plots.
Leave a comment with any questions or other visualizations you’d like to see and I will try to post more examples.
Cheers! Jennings