OSM processes over 3.1 million feature changes per day. With the help of OSMCha and the ever-vigilant OSM community, many low-quality changes are found and corrected. However, sometimes low-quality labels are not detected and slip into the map.
Towards ensuring a high-quality map, I have been applying natural language processing (NLP) techniques to detect low-quality labels. Part of applying these NLP techniques requires:
First, towards defining a low-quality label, we typically think of labels that include profanity, such as the following (also! 🙂, all of the screenshots you see below are examples of low-quality labels that I have found with my filtering logic):
Profane label says: “The Shit”
Profane label says: “Too Damn Far Rd”
However, labels can also be low-quality but not profane, such as the following:
Assertive label says: “Arguably the best location to see evening Manhattan Henge”
Humorous/assertive label says: “I’ve replaced the batteries but my smoke detector won’t stop beeping!”
Flag planting label says: “Jen is here”
Through inspecting millions of labels, I found that the three main motivators of intentional vandalism are:
On the other hand, the only motivation of unintentional vandalism seems to be naiveté.
Specifically, I inspected over 11 million map labels from the United States of America. With filtering logic applied, I was able to narrow down the potential list of poor-quality labels to just 256 labels. Of these 256 labels, approximately 30% were true positives. As a result of these findings, I submitted my first OSM edits! https://www.openstreetmap.org/changeset/59951250
However, these are only preliminary results and I am excited to continue expanding my filtering logic! 🙂
Comment from iandees on 20 July 2018 at 13:50
This is really interesting. How are you applying this work?
Comment from iandees on 20 July 2018 at 13:51
By that I mean: are you going to keep running this? Will you work with the OSMCha team to flag changesets that introduce new low-quality data?
Comment from Jennifer_Cats on 20 July 2018 at 13:54
@iandees Yes! I plan to continue running, improving, and working with other teams to better flag low-quality labels. :)
Comment from GinaroZ on 20 July 2018 at 22:08
I don’t want to discourage your useful work, however I would say that in this case “The Shit” might be a valid name. :-)
It seems to be part of a mountain bike trail in a large forest, and there’s various other names that could be termed as low quality. E.g. “Meth Lab”, “Unemployment Line”, “Ewok Village” etc. In fact, it appears on this map - http://www.galbraithmt.com/paid/trails/evolution.html - so I’d say it should be kept on the OSM map.
Comment from Jennifer_Cats on 20 July 2018 at 22:12
@GinaroZ Totally fair! I think there’s a lot of complexity with detecting anomalous low-quality labels, and I envision my techniques as something to augment or flag human review, such as what you’ve done. I’ve noticed that a lot of labels in hiking areas have a lot more variability/flexibility. Thanks for the due diligence!
Comment from Carnildo on 21 July 2018 at 01:40
How does your system deal with https://www.openstreetmap.org/node/154309316 or https://www.openstreetmap.org/node/160368207?
Comment from Chetan_Gowda on 21 July 2018 at 15:53
Awesome Jennifer! In my validation experience, the vandalism is very rare, if it happens the community will fix as soon as possible. The other chances of adding low quality label to map is newbies. I’d care more for “time” gap for detecting and fixing low quality labels or bad edits on map. It’d be good to automate these things. Thanks for sharing!
Comment from Jennifer_Cats on 21 July 2018 at 21:33
Hi @Carnildo! Looking at strictly labels, the ratio of false positives to true positives for the examples you listed was too high for me to consider including the feature that surfaces such labels. In the future, I want to find some combination of features that can discern a better decision boundary between poor and high quality. I think this may include, as others have suggested, bringing in other data sources/indicators of reputation.
Comment from Jennifer_Cats on 21 July 2018 at 21:34
@Chetan_Gowda Hi! Thanks for the read. :) I’m glad you liked it, and I think your idea is great! (see comment above to @Carnildo). :D Thanks again!
Comment from Adamant1 on 23 July 2018 at 04:49
Hi there. Your post is really interesting. I have been going through the more obscure name tags manually for awhile now to find things things that are tagged wrongly and there are definitely some interesting names being used. So far the best one I have found is Sick Wife Creek in British Columbia. Aside from that, there is a surprising amount of things named as what they are, like name=park or name=building. There is also a lot of more personal names being used like name=parents house or name=meeting place. In one town, someone had tagged a few trees and car ports with name=shelter. My guess is that they where probably homeless and using those places to get out of the sun or something.
I wish OpenStreetMap had a way to integrate things like that and names given to places by locals that are not official. Although I think low quality, none official, or profanity laced names don’t serve the purpose of this project well, labeling a name as “low quality” seems rather subjective and I feel like something is lost by not integrating the tagging practices of locals into the map, especially when it comes to names. Even if they are low quality, personal, or none verifiable I feel like they still serve a purpose somehow and are worth preserving. Anyway, thinks for the post. Its nice to read about projects that are similar to what I am working on.
Comment from Piskvor on 23 July 2018 at 09:18
Nice work - could this take non-English data into account, or does it only work for one language currently?
@Adamant1: there’s loc_name for local names - e.g. this tunnel has an unofficial but widely recognized local nickname, unrelated to the official one: https://www.openstreetmap.org/way/340126832
On the other hand, although personal mapping does serve a purpose, it’s often a very different purpose than the rest of OSM data: POIs named “I left my car here”, “my appartment” and “change here to go to Piskvor’s place” are transient and not useful beyond the mapper who added them. (As a resident of a tourist destination, I see these all the time.) Perhaps an app that allows for personal, non-shared markers would serve those better (e.g. OsmAnd, Locus); I think that the users who submit those (e.g. in Maps.Me) have no idea they’re adding them to the public database and no intention to do so (as opposed to the other additions, e.g. restaurants).
With better tagging tools, I’m seeing a decline in name=this is an X during the past few years - that at least is a solvable problem, methinks.
name=this is an X
Comment from Jennifer_Cats on 23 July 2018 at 15:51
Thanks for reading, @Adamant1, and bringing up some interesting points! I will try to consider them in my next iteration. :)
Comment from Jennifer_Cats on 23 July 2018 at 15:52
@Piskvor, thanks for taking the time to read! Currently, my logic only handles the English language. Expansion to other languages is my top priority! Hopefully I post another blog post when that is done. :D
Comment from Viajero Perdido on 24 July 2018 at 16:20
Bike trails often have “edgy” names, sometimes denoted with little signs. If it’s on a sign, I’m mapping it.
Comment from Adamant1 on 24 July 2018 at 21:02
@Piskvor, I am aware of the loc_name tag and I mostly agree with your points. Unfortunately it doesn’t seem like any apps out there that I have used take advantage of it and the wiki is a little ambiguous on name usage as it says “Names recorded in name=* tag are ones that are locally used,” but then also mentions loc_name further down in a table that any armature mapper probably wont look at. If they visit the wiki in the first place.
Applications like OSMAnd or Maps.me (who is a big offender of bad mapping) don’t make it clear what will be made public or not either, let alone provide good tag definitions or provide alternative tags for people if they need them either. Things like that don’t help.
After messaging a lot of people who make mistakes though, I think a lot of this comes from the general vagueness in Wiki articles, not just in relation to names. Parks, recreation grounds, and many other things have the same issues. If we aren’t very clear about what we mean by a name versus a local name, the tags will be used in ways that are low quality. The same with all the bad park and recreation ground mapping. Its on us to define our terms well or except a certain amount of “bad” mapping as a consequence. To me things like this are more an indicator of those types of things then anything having to do with the individual mapper.
Finally, I would say what constitutes a local name versus not is pretty debatable and comes down to what official source acknowledges the name as being legitimate or not, and if they should be considered legitimate or not themselves. A lot of which probably can’t be known by people deleting names that are bad half a world away without major research or at all. For instance there was a lively discussion on the forums once about houses in Pakistan that were mapped with names of the residents because they don’t have addresses there and if the names should stay on the map or not. I would call those things an authoritative, useful, local source, but yet they still get removed. I don’t know what the answer is, but I would at least like to see some thought put into it more before this all becomes automated and names altered on a mass scale. It would be good if some names were converted local names and if other measures were taken to make sure name changes are generally treated with care and respect for the locals. Not that I don’t think they wont be
I still think this is a really interesting project and I’d like to see where it leads. I don’t want to muddy an otherwise good diary entry and discussion with my long winded thoughts about the subject either. So I’ll leave it at that.
Comment from SK53 on 26 July 2018 at 14:06
The image of Too Damn Far Road seemed very familiar. It is a site in Butler County, PA where the Pennsic War event takes place. Much of the infrastructure mapped there is presumably transitory during the time of the festival. However, I imagine the names of roads are those used during the event.
Comment from GRUBERND on 31 July 2018 at 18:49
oh, the fun with language we can have. how does your system work with these 100% legit towns in Germany and Austria?
this just as a friendly reminder, because one culture deems something inappropriate, another might think of nothing about the same word because it has a different meaning. other brilliant examples from this category are the Mitsubishi Pajero or DJ Bobo.
Comment from Jennifer_Cats on 1 August 2018 at 20:32
Great call out, @GRUBERND! I am actually working on creating filters for each language. Your examples absolutely highlight the obstacle of a strict, one-size-fits-all filter; a profane word in one culture or language may be benign in another, or vice versa. :) Thanks for reading.
Comment from Jennifer_Cats on 1 August 2018 at 20:33
Thanks for reading @SK53! I appreciate the due diligence, and I will consider it in my current work moving forward! :)