Finding non-English key names for cleanup while only speaking English
Posted by watmildon on 22 November 2022 in English (English).The Goal
While working through some edits in Indonesia I noticed object with the key “nama”. A quick search revealed that this in Indonesian for “name” and the objects can very likely be modified to use the standard English. I wondered, how common is this? Is it easy enough to track down?
The Plan
As a test run I picked 4 tags: name, building, source, type. These show up in TagInfo in abundance and I’m sure there are lots of other good candidates.
Next step is to get usable translations. It turns out Google Sheets has a GOOGLETRANSLATE function that takes a word and will return translations into various languages. I pulled in the two letter language code list and built my sheet. After eliminating all languages that Google Translate didn’t support and all languages with non Latin characters I was left with ~80 languages to check.
The last step was to pull usage information. Fortunately for me TagInfo has an exceptionally well documented REST API. Fifty lines of C# later and I had my results.
The Results
Clicking through a few of these in TagInfo reveals some more likely candidates for cleanup.
name | 92528930 |
nome | 166 |
Name | 7 |
Nom | 62 |
Nome | 31 |
non | 6 |
név | 1 |
nama | 133 |
nombre | 207 |
building | 537924316 |
bangunan | 4 |
Bangunan | 1 |
budynek | 2 |
source | 242170152 |
bron | 13 |
fonte | 16 |
Source | 66914 |
fuente | 57 |
kaynak | 8 |
type | 10603620 |
tip | 382 |
tipo | 375 |
typ | 65 |
Typ | 6 |
genus | 902468 |
tipas | 2 |
Type | 283 |
tur | 8 |
Comment from n76 on 22 November 2022 at 23:29
Sounds like you have a found some good things to clean up.
I see you have “genus” as a translation of “type” but you should be aware that “genus” is a valid tag for tagging plants so you may want to refine that one a bit more. Maybe if you find a
genus=*
tag withoutnatural=*
and/or other associated tags likespecies=*
they could be candidates.Comment from watmildon on 23 November 2022 at 03:26
Oh absolutely. I was curious about “genus” and it’s only in the list because Latin was one of the “languages” that happened to survive my sorting. I highly doubt anyone is actually accidentally submitting Latin into the database. Casting a wide net means you’ll very often find false positives!
Comment from Mateusz Konieczny on 26 May 2023 at 08:15
Note that (based on my own experience) it is easy to fall into trap of finding a lot of things to fix then not managing to fix even small part of that.
So I would encourage to balance fixing/finding things to fix.
(BTW, I have obviously_mistagged_tags_using_trivial_tag_fixes.py and obviously_mistagged_tags_using_wrong_keys.py )
Comment from Mateusz Konieczny on 26 May 2023 at 08:18
for first link target was supposed to be https://codeberg.org/matkoniecz/OpenStreetMap_cleanup_scripts/src/branch/master/script_assisted_cleanup/obviously_mistagged_tags_using_trivial_tag_fixes.py
Comment from Mateusz Konieczny on 26 May 2023 at 08:20
Oh, and it is also danger of being to edit happy and then damaging data more than improving it (or ending with case where others think it happened due to lacking or missing communication). This also happened recently to me and I am still fixing it.
See also https://wiki.openstreetmap.org/wiki/Automated_Edits_code_of_conduct
Comment from watmildon on 26 May 2023 at 20:56
Oh absolutely! Definitely meant as a demonstration of “there’s work out there that’s easy to go get at” than a recommendation to “go mass retag things”. As always, the tools are powerful but must be used cautiously.
We have an infinite sea of work, just need to find the right little inspirations for folks to go to it. (in a collaborative and cooperative manner of course!)