While reading the manifestos of the seven candidates of this year’s OSMF board elections, I though that getting some numbers about their activity could add a little bit more information to my personal decision whom to vote.
Numbers? Many readers will think of the number of changesets which is the most visible number of an OSM account. However, this number should be interpreted with care because changesets can contain between 0 and 50,000 (nowadays 10,000) changes. Users of different editors create changesets of different size. To get a better picture of the activity of an OSM user, I recommend to use Pascal Neis’s HDYC instead.
The number of data edits or the HDYC level (e.g. “super mapper (very active)”) is only one information. There are a few more contributions which can be counted:
- Users of the OSM Forum (you have to be logged in but you don’t have to be a moderator) can search other users’ accounts by their names. It’s the menu item “members”. It will show the number of postings of a user.
- The OSMF runs a StackOverflow-like system where people can ask questions, e.g. how to use OSM or how to map things. It’s called OSM Help. Answers can be up- or downvoted. This leads to an reputation. You can search for a user names and get their reputation. The forum and OSM Help have a a 1:1 mapping of Help/Forum and OSM API accounts but users who rename their accounts might get a new one (at least on the forum).
- I haven’t found where to get the “editcount” using the MediaWiki front end but you can query the API (you might have to be logged in). The URL is
https://wiki.openstreetmap.org/w/api.php?action=query&list=users&usprop=editcount&format=json&ususers=USERNAME
whereUSERNAME
is the display name of the user.
That was the easy part. The most difficult follows – our Mailman mailing lists.
The archives of all public mailing lists can be download as GZIP compressed plain text files. I used mailman-download by Matt Hicks, a Ruby script without a proper license, for that purpose. Afterwards I had a set of directories, each containing the archives of one mailing list. As the next step, I loaded all of them into a PostgreSQL database to run queries against them. I thought it would be easy (I am used to work with Python and the Psycopg2, its PostgreSQL library) but it was more difficult. Here are the problems I encountered:
- Archives prior to November 2014 are not encoded in UTF-8 but in a encoding which is used in the language of the mailing list, e.g. Latin1 for Talk-de or a Japanese encoding for Talk-ja. Later archives are encoded in UTF-8. This does not affect the
Subject
headers and mail bodies but theFrom
headers as well if the name of the sender contains non-ASCII characters. However, the email address is readable if your used encoding has the same character mapping for character positions 1–127. - An email begins with two lines matching the regular expression
^From
. But it is more difficult that it sounds. If people use the TOFU quoting style, this regular expression matches for embedded, quoted emails as well. Use the regular expessions^From +[^: ]
for the first line and^From:
the next line instead to be sure not to count an email twice. The header of all emails ends with aMessage-ID
header. - The separator of the local and the domain part of email addresses is not an
@
but a string which is different between languages (at
ora
for most mailing lists buten
on Talk-ar or a @-like character on Talk-jp).
Here are some results I got:
mails=# SELECT COUNT(1) AS count FROM mails;
count
--------
738498
mails=# SELECT list_id, COUNT(1) AS message_count FROM mails GROUP BY list_id ORDER BY message_count DESC;
list_id | message_count
-----------------------+---------------
talk-de | 115377
talk-fr | 90995
talk | 81529
talk-it | 64945
tagging | 40989
dev | 30267
talk-gb | 22084
talk-cz | 20141
talk-us | 18959
talk-es | 16322
talk-nl | 15435
rails-dev | 14652
hot | 14544
talk-br | 12375
talk-au | 12055
talk-be | 10473
talk-ja | 10237
talk-at | 9670
talk-ca | 8784
legal-talk | 8532
josm-dev | 8045
talk-ph | 6201
talk-lv | 5956
tile-serving | 5919
imports | 5735
talk-dk | 5043
talk-co | 4311
talk-se | 3376
merkaartor | 3302
talk-ro | 3248
talk-in | 3100
talk-pl | 2845
dev-fr | 2541
talk-hr | 2510
talk-gb-westmidlands | 2219
talk-fr-bzh | 2204
talk-lt | 2171
potlatch-dev | 2101
talk-ie | 2078
talk-transit | 1940
geocoding | 1851
osmosis-dev | 1755
talk-cl | 1752
talk-ee | 1679
osrm-talk | 1675
talk-ar | 1619
talk-pt | 1521
talk-it-fvg | 1486
talk-ht | 1386
routing | 1319
talk-tw | 1284
talk-it-trentino | 1227
hot-francophone | 1214
talk-cat | 1174
talk-za | 1141
historic | 1119
talk-is | 939
talk-it-piemonte | 841
imports-us | 836
talk-ni | 820
talk-it-liguria | 795
talk-latam | 756
talk-it-lazio | 749
talk-pe | 735
talk-mx | 709
talk-ve | 621
talk-sn | 611
strategic | 597
talk-bo | 575
talk-it-cai | 509
talk-us-massachusetts | 399
mapcss | 397
talk-cu | 378
talk-tr | 377
talk-gb-midanglia | 368
diversity-talk | 357
rebuild | 353
talk-rs | 346
talk-ko | 338
talk-gb-oxoncotswolds | 333
dev-italia | 333
talk-pr | 332
accessibility | 324
talk-si | 301
local-chapters | 291
talk-np | 276
talk-scotland | 272
umap | 244
taginfo-dev | 241
talk-africa | 232
talk-gh | 226
talk-it-sardinia | 211
talk-tn | 198
talk-bf | 192
legal-general | 188
talk-ml | 176
talk-uy | 166
talk-fi | 165
talk-id | 164
talk-mg | 162
talk-cm | 153
talk-it-southtyrol | 151
talk-lu | 149
talk-it-bici | 134
talk-us-nps | 131
learnosm-coord | 128
talk-il | 121
talk-gb-london | 115
talk-bj | 106
talk-cn | 103
talk-ru | 102
talk-bd | 101
maproulette | 98
talk-tg | 98
photon | 98
talk-ug | 97
talk-baltics | 96
talk-ci | 91
sotm-asia | 90
talk-ke | 87
talk-it-veneto | 84
talk-kosovo | 78
party | 76
moderation | 74
talk-gb-thenorth | 72
teachosm | 66
talk-vi | 65
tagging-fr | 63
osmf-membership | 61
announce | 61
talk-blr | 55
talk-cd | 52
talk-py | 47
talk-us-pugetsound | 36
talk-us-newyork | 35
talk-ma | 35
talk-it-sicilia | 34
talk-lk | 31
talk-ba | 28
talk-al | 26
talk-do | 23
welcomewg | 23
talk-nc | 19
talk-ne | 19
osm-professional | 19
talk-mw | 18
talk-ng | 17
talk-cr | 17
osmcha-dev | 15
talk-it-marche | 14
talk-tz | 12
talk-ps | 12
talk-mm | 12
talk-et | 10
openhiking | 9
science | 9
lorodux-dev | 9
talk-gr | 7
talk-us-sfbay | 7
talk-hn | 7
talk-it-appulo-lucana | 7
hot-announce | 6
talk-kg | 6
talk-my | 5
osm2world | 5
talk-sc | 3
talk-lb | 3
talk-carib | 3
outdoor-natural | 3
talk-mn | 2
talk-iq | 1
talk-bi | 1
talk-asia | 1
talk-zm | 1
(174 Zeilen)
mails=# SELECT row_number() OVER (ORDER BY message_count DESC) AS rank, address, message_count
FROM (
SELECT
from_short AS address,
COUNT(1) AS message_count
FROM mails WHERE from_short NOT LIKE '%@github.com'
GROUP BY address
) AS a
ORDER BY message_count DESC
LIMIT 20;
-- email addresses have been manually replace by names or nicknames in the following listing. People who have used multiple email addresses on the mailing lists are treated as multiple persons.
rank | address | message_count
------+---------------------------------+---------------
1 | Martin Koppenhoefer | 25634
2 | Frederik Ramm | 12274
3 | Pieren | 6537
4 | Philippe Verdy | 5731
5 | Christian Quest | 4956
6 | John Smith | 4582
7 | Tobias Wendorff | 3494
8 | Luca Delucchi | 3171
9 | Sven Geggus | 3113
10 | Jan Tappenbeck | 3025
11 | Steve Coast | 3024
12 | Maning Sambale | 2849
13 | Polyglot | 2831
14 | Tom Hughes | 2668
15 | Paul Johnson | 2559
16 | Vincent Pottier | 2516
17 | Markus | 2506
18 | Simone Cortesi | 2415
19 | Richard Fairhurst | 2393
20 | Bernd Wurst | 2347
I will publish more results of more queries soon. If you have any suggestions what to query, write a comment to this user diary entry.
The Python script I used is quite ugly. That’s why I have not published its source code. However, if you are interested in doing similar analysis, write me an email and I will send it to you. The same applies for the dump of the database.
Discussion
Comment from ImreSamu on 2 December 2018 at 17:49
Thank you for sharing information.
Just a note for the future mailing list processing:
There are at least 2 google-groups based country mailing lists:
OFF:
it is not a big problem - for the simple statistics, but can be a “real problem” for contacting local OSM community - in emergency cases.
Comment from Heather Leson on 2 December 2018 at 17:54
Thanks for this information. It is important to note that many community members do not use the mailing lists but opt for other communication channels. There are many reasons for this, but when we consider the breadth and activity of the community it is always key to see metrics in terms of the reality.
Comment from imagico on 2 December 2018 at 18:23
I would suggest to focus on recent years and not the full duration of the archive since that could lead to a bias towards the old timers. A long time experience is a valid argument for a candidate but should not be mixed with recent activity.
I think it would be good to include other channels than mailing lists - you mentioned the forum and wiki activities already - diary entries and comments to them are another venue, so are changeset discussions. I think it is fine to exclude proprietary platforms because their use for OSM community discourse is problematic anyway.
And it is always important to keep in mind that in communication quantity does not necessarily say something about quality. Some people comment a lot of things while others contribute more rarely but provide more thoughtful messages.