OpenStreetMap

Some numbers about mailing lists

Posted by Nakaner on 2 December 2018 in English.

While reading the manifestos of the seven candidates of this year’s OSMF board elections, I though that getting some numbers about their activity could add a little bit more information to my personal decision whom to vote.

Numbers? Many readers will think of the number of changesets which is the most visible number of an OSM account. However, this number should be interpreted with care because changesets can contain between 0 and 50,000 (nowadays 10,000) changes. Users of different editors create changesets of different size. To get a better picture of the activity of an OSM user, I recommend to use Pascal Neis’s HDYC instead.

The number of data edits or the HDYC level (e.g. “super mapper (very active)”) is only one information. There are a few more contributions which can be counted:

  • Users of the OSM Forum (you have to be logged in but you don’t have to be a moderator) can search other users’ accounts by their names. It’s the menu item “members”. It will show the number of postings of a user.
  • The OSMF runs a StackOverflow-like system where people can ask questions, e.g. how to use OSM or how to map things. It’s called OSM Help. Answers can be up- or downvoted. This leads to an reputation. You can search for a user names and get their reputation. The forum and OSM Help have a a 1:1 mapping of Help/Forum and OSM API accounts but users who rename their accounts might get a new one (at least on the forum).
  • I haven’t found where to get the “editcount” using the MediaWiki front end but you can query the API (you might have to be logged in). The URL is https://wiki.openstreetmap.org/w/api.php?action=query&list=users&usprop=editcount&format=json&ususers=USERNAME where USERNAME is the display name of the user.

That was the easy part. The most difficult follows – our Mailman mailing lists.

The archives of all public mailing lists can be download as GZIP compressed plain text files. I used mailman-download by Matt Hicks, a Ruby script without a proper license, for that purpose. Afterwards I had a set of directories, each containing the archives of one mailing list. As the next step, I loaded all of them into a PostgreSQL database to run queries against them. I thought it would be easy (I am used to work with Python and the Psycopg2, its PostgreSQL library) but it was more difficult. Here are the problems I encountered:

  • Archives prior to November 2014 are not encoded in UTF-8 but in a encoding which is used in the language of the mailing list, e.g. Latin1 for Talk-de or a Japanese encoding for Talk-ja. Later archives are encoded in UTF-8. This does not affect the Subject headers and mail bodies but the From headers as well if the name of the sender contains non-ASCII characters. However, the email address is readable if your used encoding has the same character mapping for character positions 1–127.
  • An email begins with two lines matching the regular expression ^From. But it is more difficult that it sounds. If people use the TOFU quoting style, this regular expression matches for embedded, quoted emails as well. Use the regular expessions ^From +[^: ] for the first line and ^From: the next line instead to be sure not to count an email twice. The header of all emails ends with a Message-ID header.
  • The separator of the local and the domain part of email addresses is not an @ but a string which is different between languages (at or a for most mailing lists but en on Talk-ar or a @-like character on Talk-jp).

Here are some results I got:

mails=# SELECT COUNT(1) AS count FROM mails;
 count  
--------
 738498

mails=# SELECT list_id, COUNT(1) AS message_count FROM mails GROUP BY list_id ORDER BY message_count DESC;
        list_id        | message_count 
-----------------------+---------------
 talk-de               |        115377
 talk-fr               |         90995
 talk                  |         81529
 talk-it               |         64945
 tagging               |         40989
 dev                   |         30267
 talk-gb               |         22084
 talk-cz               |         20141
 talk-us               |         18959
 talk-es               |         16322
 talk-nl               |         15435
 rails-dev             |         14652
 hot                   |         14544
 talk-br               |         12375
 talk-au               |         12055
 talk-be               |         10473
 talk-ja               |         10237
 talk-at               |          9670
 talk-ca               |          8784
 legal-talk            |          8532
 josm-dev              |          8045
 talk-ph               |          6201
 talk-lv               |          5956
 tile-serving          |          5919
 imports               |          5735
 talk-dk               |          5043
 talk-co               |          4311
 talk-se               |          3376
 merkaartor            |          3302
 talk-ro               |          3248
 talk-in               |          3100
 talk-pl               |          2845
 dev-fr                |          2541
 talk-hr               |          2510
 talk-gb-westmidlands  |          2219
 talk-fr-bzh           |          2204
 talk-lt               |          2171
 potlatch-dev          |          2101
 talk-ie               |          2078
 talk-transit          |          1940
 geocoding             |          1851
 osmosis-dev           |          1755
 talk-cl               |          1752
 talk-ee               |          1679
 osrm-talk             |          1675
 talk-ar               |          1619
 talk-pt               |          1521
 talk-it-fvg           |          1486
 talk-ht               |          1386
 routing               |          1319
 talk-tw               |          1284
 talk-it-trentino      |          1227
 hot-francophone       |          1214
 talk-cat              |          1174
 talk-za               |          1141
 historic              |          1119
 talk-is               |           939
 talk-it-piemonte      |           841
 imports-us            |           836
 talk-ni               |           820
 talk-it-liguria       |           795
 talk-latam            |           756
 talk-it-lazio         |           749
 talk-pe               |           735
 talk-mx               |           709
 talk-ve               |           621
 talk-sn               |           611
 strategic             |           597
 talk-bo               |           575
 talk-it-cai           |           509
 talk-us-massachusetts |           399
 mapcss                |           397
 talk-cu               |           378
 talk-tr               |           377
 talk-gb-midanglia     |           368
 diversity-talk        |           357
 rebuild               |           353
 talk-rs               |           346
 talk-ko               |           338
 talk-gb-oxoncotswolds |           333
 dev-italia            |           333
 talk-pr               |           332
 accessibility         |           324
 talk-si               |           301
 local-chapters        |           291
 talk-np               |           276
 talk-scotland         |           272
 umap                  |           244
 taginfo-dev           |           241
 talk-africa           |           232
 talk-gh               |           226
 talk-it-sardinia      |           211
 talk-tn               |           198
 talk-bf               |           192
 legal-general         |           188
 talk-ml               |           176
 talk-uy               |           166
 talk-fi               |           165
 talk-id               |           164
 talk-mg               |           162
 talk-cm               |           153
 talk-it-southtyrol    |           151
 talk-lu               |           149
 talk-it-bici          |           134
 talk-us-nps           |           131
 learnosm-coord        |           128
 talk-il               |           121
 talk-gb-london        |           115
 talk-bj               |           106
 talk-cn               |           103
 talk-ru               |           102
 talk-bd               |           101
 maproulette           |            98
 talk-tg               |            98
 photon                |            98
 talk-ug               |            97
 talk-baltics          |            96
 talk-ci               |            91
 sotm-asia             |            90
 talk-ke               |            87
 talk-it-veneto        |            84
 talk-kosovo           |            78
 party                 |            76
 moderation            |            74
 talk-gb-thenorth      |            72
 teachosm              |            66
 talk-vi               |            65
 tagging-fr            |            63
 osmf-membership       |            61
 announce              |            61
 talk-blr              |            55
 talk-cd               |            52
 talk-py               |            47
 talk-us-pugetsound    |            36
 talk-us-newyork       |            35
 talk-ma               |            35
 talk-it-sicilia       |            34
 talk-lk               |            31
 talk-ba               |            28
 talk-al               |            26
 talk-do               |            23
 welcomewg             |            23
 talk-nc               |            19
 talk-ne               |            19
 osm-professional      |            19
 talk-mw               |            18
 talk-ng               |            17
 talk-cr               |            17
 osmcha-dev            |            15
 talk-it-marche        |            14
 talk-tz               |            12
 talk-ps               |            12
 talk-mm               |            12
 talk-et               |            10
 openhiking            |             9
 science               |             9
 lorodux-dev           |             9
 talk-gr               |             7
 talk-us-sfbay         |             7
 talk-hn               |             7
 talk-it-appulo-lucana |             7
 hot-announce          |             6
 talk-kg               |             6
 talk-my               |             5
 osm2world             |             5
 talk-sc               |             3
 talk-lb               |             3
 talk-carib            |             3
 outdoor-natural       |             3
 talk-mn               |             2
 talk-iq               |             1
 talk-bi               |             1
 talk-asia             |             1
 talk-zm               |             1
(174 Zeilen)

mails=# SELECT row_number() OVER (ORDER BY message_count DESC) AS rank, address, message_count
  FROM (
    SELECT
        from_short AS address,
        COUNT(1) AS message_count
      FROM mails WHERE from_short NOT LIKE '%@github.com'
      GROUP BY address
  ) AS a
  ORDER BY message_count DESC
  LIMIT 20;
  -- email addresses have been manually replace by names or nicknames in the following listing. People who have used multiple email addresses on the mailing lists are treated as multiple persons.
 rank |             address             | message_count
------+---------------------------------+---------------
    1 | Martin Koppenhoefer             |         25634
    2 | Frederik Ramm                   |         12274
    3 | Pieren                          |          6537
    4 | Philippe Verdy                  |          5731
    5 | Christian Quest                 |          4956
    6 | John Smith                      |          4582
    7 | Tobias Wendorff                 |          3494
    8 | Luca Delucchi                   |          3171
    9 | Sven Geggus                     |          3113
   10 | Jan Tappenbeck                  |          3025
   11 | Steve Coast                     |          3024
   12 | Maning Sambale                  |          2849
   13 | Polyglot                        |          2831
   14 | Tom Hughes                      |          2668
   15 | Paul Johnson                    |          2559
   16 | Vincent Pottier                 |          2516
   17 | Markus                          |          2506
   18 | Simone Cortesi                  |          2415
   19 | Richard Fairhurst               |          2393
   20 | Bernd Wurst                     |          2347

I will publish more results of more queries soon. If you have any suggestions what to query, write a comment to this user diary entry.

The Python script I used is quite ugly. That’s why I have not published its source code. However, if you are interested in doing similar analysis, write me an email and I will send it to you. The same applies for the dump of the database.

Discussion

Comment from ImreSamu on 2 December 2018 at 17:49

“The most difficult follows – our Mailman mailing lists.”

Thank you for sharing information.

Just a note for the future mailing list processing:

There are at least 2 google-groups based country mailing lists:


OFF:

it is not a big problem - for the simple statistics, but can be a “real problem” for contacting local OSM community - in emergency cases.

Comment from Heather Leson on 2 December 2018 at 17:54

Thanks for this information. It is important to note that many community members do not use the mailing lists but opt for other communication channels. There are many reasons for this, but when we consider the breadth and activity of the community it is always key to see metrics in terms of the reality.

Comment from imagico on 2 December 2018 at 18:23

I would suggest to focus on recent years and not the full duration of the archive since that could lead to a bias towards the old timers. A long time experience is a valid argument for a candidate but should not be mixed with recent activity.

I think it would be good to include other channels than mailing lists - you mentioned the forum and wiki activities already - diary entries and comments to them are another venue, so are changeset discussions. I think it is fine to exclude proprietary platforms because their use for OSM community discourse is problematic anyway.

And it is always important to keep in mind that in communication quantity does not necessarily say something about quality. Some people comment a lot of things while others contribute more rarely but provide more thoughtful messages.

Log in to leave a comment