On a skype today, Kate Chapman said that analysis after the earthquake in Haiti, she found that '40 people did 90% of the work' within the community.
Is the workload more evenly spread throughout the community when it comes to Missing Maps tasks as opposed to HOT tasks? Is it more evenly spread during non-emergencies?
I thought I can look at this quickly because I'd done similar work around participation inequality in the context of OSM; in the end took much longer than expected and I can't see that I found a simple answer. If anything it serves as a good reminder why it's challenging to produce meaningful statistics for social spaces: the devil is in the many nuances. This writeup here can probably give you some impression of that.
Unfortunately I don't have contributor statistics for Haiti since it predates the tasking manager, instead I will compare Missing Maps with other large HOT initiatives, most importantly Typhon Haiyan in the Philippines in 2013, but also the Ebola activation in 2014, and Nepal in 2015.
The impatient can skip the more in-depth discussion and jump to the conclusion section at the bottom. Note that this is just a quick exploration, not a thorough statistical analysis. I'm sure I've overlooked things, so please give feedback.
As usual I'm looking at labour hours as a measure of work. The results are probably not that different than if I'd used map edits, however I find they're a better reflection of the effort spent on contributing. Time moves at the same pace for everyone, while the same number of clicks could yield a different number of edits depending on what you're doing. Edit counts are also a potentially confusing measure because there's no standard way of counting them: as the number of version increases of geometries, or the number of changesets? Etc. So here's a key limitation of these stats: I'm not actually looking at map impact, instead I'm looking at a measure of individual effort.
You're of course welcome to do your own analyses and compare, the raw data is linked below. Including edit counts!
Group sizes and average labour hours
Just to get a first impression: how much work do people do in each group, on average?
I apologise for the messy table, this is quite unreadable, but useful for reference later. According to my contributor database (which atm has data up to early August 2015) there were about 6,400 contributors to Nepal, 2,800 to Missing Maps, 650 to Haiyan, etc. In other words, Missing Maps has more than 4 times the number of contributors than Haiyan.
Here the median labour hours per group as a plot:
The median contributor effort looks comparable between Missing Maps and Haiyan. However bear in mind that we're looking at a data set that is long-tail distributed, as this histogram suggests:
When looking at long-tail distributions we have to pick our aggregate measures carefully: the mean and even median are likely heavily skewed by outliers. There is no general measure of central tendency for long-tail distributions, nor can there be one. Repeat after me: "There is no average user".
Instead we should compute measures of distribution: how is work distributed among the group?
The Gini index as a basic inequality measure
The Gini index is a classic measure in economics used to describe inequality in groups, usually income inequality in societies. It's typically a number between 0 and 100 (sometimes 0 and 1), and a higher number means "more unequal". According to the CIA fact book, the US has a Gini index of around 45 while the UK's is around 33, and Germany is at 27.
The Gini index is also sometimes used to describe participation inequalities in online communities such as HOT. Online communities tend to be highly unequal, with a small share of highly active users; we will come back to that in a bit. It's important to know that we can't compare Gini scores across different kinds of social systems, e.g. we couldn't fairly compare Wikipedia scores with HOT scores unless we're sure they've both been measured in the same way. We can however simply use it to compare different subgroups within a community. Here: different HOT initiatives.
Observations: Missing Maps & Haiyan seem fairly similar. Haiyan might even be a little bit more fairly distributed, however these kinds of "social" statistics tend to be messy in all kinds of ways, with a high degree of measurement error, so for practical purposes I would consider them equal. Work for Nepal and particularly Ebola on the other hand is more unequally distributed -- with either a smaller number of hardcore contributors, or a larger number of people who do very little.
Distribution of work: the bottom end
Ok so let's look at the actual distribution of contributions -- as a first step, let's see how many people do a minimum amount of hours in each group.
This plot shows the distribution of work in absolute terms: how many contributors work for x hours? For example we can see that in the Nepal and Ebola groups, a large number of people contribute very little: many already stop within the first 30 minutes. Missing Maps on the other hand has a nice bump: many people contribute for up to 2h. Is this the mapathon bump?
The Haiyan group is too small to be easily discernible in this plot, so let's look at relative numbers..
... it's somewhere in between. Not an extreme spike of early leavers, but also no mapathon bump. Otoh it likely has a longer tail: a larger number of highly prolific contributors who each do loads of work.
Based on these charts you could say Missing Maps manages to raise the lower threshold of participation above the bare minimum, which is an achievement in itself. However this doesn't yet answer Kate's question: how much work do the top 40 contributors do?
Impact of highly prolific contributors
As we've seen in the summary table above, the initiatives have very different sizes. Based on that alone we can expect that the top 40 contributors in Missing Maps are likely to have had a smaller impact on the overall output, because it's a larger overall group. Let's check:
Hah, unexpected: The top 40 contributors in Missing Maps and Haiyan had about the same impact on their groups, both carry around 50% of the total effort! Intriguing. If I may speculate about a cause: Missing Maps is a larger project, but also has been running for much longer, so while the overall output is larger, the top contributors also have more time to do their share. (There's only so much time in the day a person has available to do mapping.)
[NOTE: Kate said 90% for Haiti, I got 50% for Haiyan/MM for the same number of people. Why? Don't know -- don't have data on Haiti, and would also need to compare how each statistic was computed. Stats are hard.]
Let's look at it in relative numbers instead -- the impact of top 6.25% contributors (the equivalent of 40 in 639 contributors for Haiyan, according to my records).
As expected: we're now covering a larger absolute number of Missing Maps contributors, and of course they collectively account for a larger share of the work at almost 70%. I.e., the "core" contributor group in Missing Maps is larger and does more work than in Haiyan, but only because Missing Maps involves many more people.
Let's also look at the impact of the top 20% contributors, just because that's a classic number people tend to use.
Here we see a classic 80-20 distribution: 20% of users are responsible for 80% of the work. Aka the Pareto principle. Widely observed among online communities. Interesting that it seems to approximately apply for every single one of the HOT initiatives shown here.
As in all things relating to people it's complex, there are different ways of looking at the question, and likely many contributing effects: how were the initiatives promoted, were they executed by a core community or lots of one-off contributors, to what extent did they attract hardcore OSM experts, how long did activity last, etc. I think we barely scratched the surface here. It also serves as a good reminder of why we should be sceptical of simple analytics when looking at online communities.
Based on the charts here we could say that:
- Work is always "unfairly" distributed in HOT -- that's also a well-known empirical finding in many other social settings.
- Missing Maps and Haiyan (and other initiatives) are comparable in terms of participation inequality in some respects, however there are also differences.
- E.g. most have a similar 80-20 split, where 20% of highly active contributors do about 80% of the work -- typical for online communities. This appears to be independent of group size, length of activity period, and other factors.
- It gets even more extreme at the top. A very small number of the most active contributors might be responsible for a surprisingly large share of the work -- e.g. we found for both Haiyan and Missing Maps that 40 people are responsible for about half the work.
- Ebola and Nepal have a higher Gini index compared to Missing Maps or Haiyan, which means work is more unequally distributed in these groups. We find that in these two groups, a larger share of contributors drop out within the first 30 minutes -- more people do less.
- Missing Maps on the other hand appears to have raised the bar in terms of minimum participation. Compared to the other groups we looked at, people don't tend to drop out right away, and instead many stay active for 2h or more. This might be a result of the regular mapathons organised by Missing Maps teams around the world, or of the fact that it's a long-running effort so people contribute more over time.
My inner academic would further argue that in order to gain confidence in these claims we'd have to do actual statistical analyses, and not just look at charts and summary statistics. For long-tailed distributions we might use statistical tests of independence, such as the Mann–Whitney U test or the Wilcoxon signed-rank test, to determine whether these distributions of labour are actually statistically different across the different groups. That's for another time -- or maybe someone else wants to take it on? The data is linked below.
Do any differences relate to a sense of urgency? Not sure we have looked at enough evidence to answer this; of the four groups we could say Haiyan & Nepal are "urgent", Missing Maps is "not urgent", and Ebola may be somewhere in between. I believe the data we've looked at so far won't easily accommodate simple interpretations. A study for another day, or another person :)
Other suggestions for analyses not shown here?
The data used for these analyses:
- hot_contributors_20150810.csv (682kB), a list of all contributors per initiative, along with their total number of labour hours and edits.