In my previous diary entry I showed you some of the problems that I see with the amount of keys in use for the tagging of objects in OSM (54382 at the time of my research: 25 july 2015).
In a reaction I got from user Hedaja he pointed to an interesting blog I wasn’t aware of, by the maintainer of the Taginfo database, Jochen Topf.
Jochen - in this blog - also mentions the “one-time-only” use of keys and calls for action in an attempt to lower the number of keys back to a healthy 40.000.
I did some research and I downloaded the taginfo database on 25 july 2015.
It has a table “keys” with 54382 keys that I used for the next statistics.
I consider keys used 10 or less times suspect of some mistake in the use of the key (e.g. wrong spelling of a regular key).
Lets consider any key that is used 10.000 or more times a “trusted” key. How many are there?
In between we have a group of 17.516 keys that are used between 11 and 9999 times.
By itself all those numbers do not mean very much because what counts more is what value the key has. A key that is used once can only have one value. E.g. the key “nitrox” is a one-time-only key and it can be found here.
A key that is used twice can have at most 2 different vallues and a key that is used 100 times can have at most 100 different values.
The key that is used most on OSM is the key: source, it appears 162.428.193 times with 143.491 different values (one of them is Bing and another is bing).
Now, then, how can we use all this information to get rid of all those keys that shouldn’t be there because the mapper added them by accident or by ignorance?
Sometimes a mapper adds a concluding space at the end of a key, simply by hitting the spacebar instead of the return key. You don’t see anything on your screen of it, but it gets recorded in the database:
We see that this happened only twice with the name key, but the same error happens much more often. I heard that at regular times a bot is running to fix all those invalid spaces, but I’m not sure.
And if you are one of the mappers that created those keys above and happen to read this also, please fix it!
Do you want to know the values of the correctly spelled name key?
Here is the first page (of more than a million) of taginfo about that key:
Now, lets look at a “rare” key. What about: gauge:1879-1934?
Here it is (screenshot with openpoimap):
It’s about the trackwidth of this railway track between 1879 and 1934.
According to the wiki the gauge=* tag is supposed to have the trackwidth like gauge=1435. But because there are no instructions on how to handle the situation where the trackwidth is changed after some time, the mapper choose to add that time-span to the key. Is it wrong? I’m not sure, but it is definitely a key that is not easy to re-use. How many other tracks changed their gauge in the same period? (1879-1934).
And what happened between 1906 and 1934?? Did they use both trackwidths?
On the other hand, why include historical data in OSM? We have other OSM datasets that are meant to collect historical data. OSM is supposed to “map what is on the ground”, but a railway from more than 100 years ago, is it still there?
There are many more examples to be found that are questionable, but removing all those tags and replacing them with more “valid” ones is not an easy task and needs to be done with care.
If you want to see more examples yourself, the best way to do that is to go to taginfo and select the page with all the keys. Currently it contains 3218 pages. Click on the second column (Objects) so that it is sorted low to high and then scroll a few pages to see the keys that have a count of 1. Take your pick and see the results in taginfo. Please leave your comments or recommendations here.
I have one more question: what about the keys in the database (121 by number) that do not appear at all?
Comment from Jochen Topf on 5 August 2015 at 19:10
Keys or tags that have wiki pages but do not appear in the database will still appear in taginfo, but with a count of 0. Some of those are just typos etc.
Note that a key or tag that appears only a few times is not necessarily wrong. Say, for instance, a tag to mark a nations capital. There are only on the order of 200 nations in the world, so that tag will not appear more often than that and thats totally fine. Also all data entry has to begin somewhere, so there might be a rather new tag that doesn’t appear often but has good potential. So the count itself can only be one hint that a key or tag might be something we could look at. A mapper still has to look at each and every case and figure out if the key/tag is okay or whether it needs fixing and, in that case, how best to fix it.
Comment from Alan Bragg on 5 August 2015 at 19:42
I’ve enjoyed reading your diary entries. Thanks for pointing to Jochen’s blog. He’s gave a very clever guy and gave a well received talk at the SOTM US 2015 which you can watch at https://www.youtube.com/watch?v=_2L5wzv8DHw
Comment from Hendric Stattmann on 6 August 2015 at 14:05
My suggestion to fix the issue manually, at least partly: Create a new maproulette task.
Would this be a conceivable solution?
Comment from SimonPoole on 6 August 2015 at 21:52
Well a different way at looking at it would be to say well we have roughly 40’000 mistagged objects of a good 80’000’000 (just counting nodes with tags, in reality the number is larger)..
In other words: we have the incredibly large error rate of 0.05% aka “not a problem”.
Comment from marczoutendijk on 9 August 2015 at 16:02
I have never used Maproulette before, so I’ll give it a try, but have to find out how!
Maybe it is something that might help.
But as @SimonPoole noted, the problem is not the most important task we have to solve…
Comment from Hendrikklaas on 12 August 2015 at 21:55
@simonpoole, problem or not, if a mapper wont’s to add or change an item there’s nothing wrong in it, we ‘re all doiing our own ways isnt it ? According to and as good as following the Wiki.
Comment from Jojo4u on 14 August 2015 at 07:43
The date namespace is described here: http://wiki.openstreetmap.org/wiki/Date_namespace. It’s nowhere approved or recommended but a guideline for a mapper who whishes to integrate this information.
Comment from marczoutendijk on 10 August 2016 at 18:44
A year later:
Total number of keys: 59953
Number of keys used at most 10 times: 39179 (65%)
Number of keys used at least 10000 times: 1401 (2.3%)
The problem with the “suspicious” keys remains….