OSM processes over 3.1 million feature changes per day. With the help of OSMCha and the ever-vigilant OSM community, many low-quality changes are found and corrected. However, sometimes low-quality labels are not detected and slip into the map.

Towards ensuring a high-quality map, I have been applying natural language processing (NLP) techniques to detect low-quality labels. Part of applying these NLP techniques requires:

  1. Defining low-quality labels and understanding their properties
  2. Determining the motivation of a user who submits a low-quality label

First, towards defining a low-quality label, we typically think of labels that include profanity, such as the following (also! 🙂, all of the screenshots you see below are examples of low-quality labels that I have found with my filtering logic):

Profane label says: “The Shit” Profane label says: “The Shit”

Profane label says: “Too Damn Far Rd” Profane label says: “Too Damn Far Rd”

However, labels can also be low-quality but not profane, such as the following:

Assertive label says: “Arguably the best location to see evening Manhattan Henge” Assertive label says: “Arguably the best location to see evening Manhattan Henge”

Humorous/assertive label says: “I’ve replaced the batteries but my smoke detector won’t stop beeping!” Humorous/assertive label says: “I’ve replaced the batteries but my smoke detector won’t stop beeping!”

Flag planting label says: “Jen is here” Flag planting label says: “Jen is here”

Through inspecting millions of labels, I found that the three main motivators of intentional vandalism are:

  1. Flag planting, or the idea that someone wants to assert that they were present in a place
  2. Assertion of quality, or a user who wants to express their opinion of quality about someone or something
  3. Humor, such as meme-based labels

On the other hand, the only motivation of unintentional vandalism seems to be naiveté.

Specifically, I inspected over 11 million map labels from the United States of America. With filtering logic applied, I was able to narrow down the potential list of poor-quality labels to just 256 labels. Of these 256 labels, approximately 30% were true positives. As a result of these findings, I submitted my first OSM edits!

However, these are only preliminary results and I am excited to continue expanding my filtering logic! 🙂

Comment from iandees on 20 July 2018 at 13:50

This is really interesting. How are you applying this work?

Comment from iandees on 20 July 2018 at 13:51

By that I mean: are you going to keep running this? Will you work with the OSMCha team to flag changesets that introduce new low-quality data?

Comment from Jennifer_Cats on 20 July 2018 at 13:54

@iandees Yes! I plan to continue running, improving, and working with other teams to better flag low-quality labels. :)

Comment from GinaroZ on 20 July 2018 at 22:08

I don’t want to discourage your useful work, however I would say that in this case “The Shit” might be a valid name. :-)

It seems to be part of a mountain bike trail in a large forest, and there’s various other names that could be termed as low quality. E.g. “Meth Lab”, “Unemployment Line”, “Ewok Village” etc. In fact, it appears on this map - - so I’d say it should be kept on the OSM map.

Comment from Jennifer_Cats on 20 July 2018 at 22:12

@GinaroZ Totally fair! I think there’s a lot of complexity with detecting anomalous low-quality labels, and I envision my techniques as something to augment or flag human review, such as what you’ve done. I’ve noticed that a lot of labels in hiking areas have a lot more variability/flexibility. Thanks for the due diligence!

Comment from Chetan_Gowda on 21 July 2018 at 15:53

Awesome Jennifer! In my validation experience, the vandalism is very rare, if it happens the community will fix as soon as possible. The other chances of adding low quality label to map is newbies. I’d care more for “time” gap for detecting and fixing low quality labels or bad edits on map. It’d be good to automate these things. Thanks for sharing!

Comment from Jennifer_Cats on 21 July 2018 at 21:33

Hi @Carnildo! Looking at strictly labels, the ratio of false positives to true positives for the examples you listed was too high for me to consider including the feature that surfaces such labels. In the future, I want to find some combination of features that can discern a better decision boundary between poor and high quality. I think this may include, as others have suggested, bringing in other data sources/indicators of reputation.

Comment from Jennifer_Cats on 21 July 2018 at 21:34

@Chetan_Gowda Hi! Thanks for the read. :) I’m glad you liked it, and I think your idea is great! (see comment above to @Carnildo). :D Thanks again!

Comment from Piskvor on 23 July 2018 at 09:18

Nice work - could this take non-English data into account, or does it only work for one language currently?

@Adamant1: there’s loc_name for local names - e.g. this tunnel has an unofficial but widely recognized local nickname, unrelated to the official one:

On the other hand, although personal mapping does serve a purpose, it’s often a very different purpose than the rest of OSM data: POIs named “I left my car here”, “my appartment” and “change here to go to Piskvor’s place” are transient and not useful beyond the mapper who added them. (As a resident of a tourist destination, I see these all the time.) Perhaps an app that allows for personal, non-shared markers would serve those better (e.g. OsmAnd, Locus); I think that the users who submit those (e.g. in Maps.Me) have no idea they’re adding them to the public database and no intention to do so (as opposed to the other additions, e.g. restaurants).

With better tagging tools, I’m seeing a decline in name=this is an X during the past few years - that at least is a solvable problem, methinks.

Comment from Jennifer_Cats on 23 July 2018 at 15:51

Thanks for reading, @Adamant1, and bringing up some interesting points! I will try to consider them in my next iteration. :)

Comment from Jennifer_Cats on 23 July 2018 at 15:52

@Piskvor, thanks for taking the time to read! Currently, my logic only handles the English language. Expansion to other languages is my top priority! Hopefully I post another blog post when that is done. :D

Comment from Viajero Perdido on 24 July 2018 at 16:20

Bike trails often have “edgy” names, sometimes denoted with little signs. If it’s on a sign, I’m mapping it.

Comment from SK53 on 26 July 2018 at 14:06

The image of Too Damn Far Road seemed very familiar. It is a site in Butler County, PA where the Pennsic War event takes place. Much of the infrastructure mapped there is presumably transitory during the time of the festival. However, I imagine the names of roads are those used during the event.

Comment from GRUBERND on 31 July 2018 at 18:49

oh, the fun with language we can have. how does your system work with these 100% legit towns in Germany and Austria?

this just as a friendly reminder, because one culture deems something inappropriate, another might think of nothing about the same word because it has a different meaning. other brilliant examples from this category are the Mitsubishi Pajero or DJ Bobo.

Comment from Jennifer_Cats on 1 August 2018 at 20:32

Great call out, @GRUBERND! I am actually working on creating filters for each language. Your examples absolutely highlight the obstacle of a strict, one-size-fits-all filter; a profane word in one culture or language may be benign in another, or vice versa. :) Thanks for reading.

Comment from Jennifer_Cats on 1 August 2018 at 20:33

Thanks for reading @SK53! I appreciate the due diligence, and I will consider it in my current work moving forward! :)

Login to leave a comment