Validating the map - Part 1

Posted by PlaneMad on 6 April 2016 in English (English)

Hello Kitty

Quality assurance in OSM is a pretty hot topic. The first question when first faced with the concept of a map that anyone can edit is - how can one guarantee the quality of the data?

In a constantly transforming world, the fact is that its impossible to gaurantee any map in the world is accurate unless it was physically surveyed recently. And the reason why OpenStreetMap has such high quality of map data for many parts of the world is precisely because this project allows anyone to update the map so easily. And to ensure quality, we just need more users of the map for that location, an empower them to update anything that does not match reality.

Recent incidents

For this to work effectively its obvious we need to make it as simple as possible for map users to spot mistakes and make an update. Some interesting data incidents that were caught by the community recently:

This is just a small list of examples that the data team at Mapbox stumbles into while ivestigating issues reported in our map feedback system. We have been using an amazing tool by willemarcel called osmcha which acts as a handy dashboard to review and investigate changesets. By looking for words like reverted or vandalismin the comments, its possible to identify changesets that corrected a previous mistake.

The interesting observation here is that many of these issues were accidental changes by the mapper which could have been easily avoided if they were more careful or had there been more sanity checks in the software. Most of the incidents have been fixed, usually by alerting the mapper and undoing their changes.

Something to be concerned about is the time it takes for an issue to be detected, which ranges upto a month for something as major as the label for a major city like Buffalo being deleted. But once the mistake is spotted, action and recovery is swift, usually a few minutes. This points us to an interesting question on why does it take so long for issues to be identified? And what happens to issues that are invisible on the map style?

Identifying data issues

While may mistakes are easy to visually identify in the map style especially if they involve large features like cities, forests, lakes or motorways, there are probably thousands of issues on smaller features involving inconsistent access tags, broken geometries or relations that would never be spotted or fixed unless an expert mapper stumbled on them. And there will be numerous cases of incorrect data intentionally being added to the map that will go unnoticed till a local mapper finds it.

Find whats wrong in the map universe with osmose

This is where tools for quality assurance of the OSM data come into play to address the issue. Some of the more popular ones are Osmose, Maproulette, Improveosm, OSM Inspector and keepright. These tools have greatly helped to bring in some sense of quality monitoring to the map, and could become a lot more powerful when part of an integrated validation environment.

This was one of the reasons that prompted us to create osmlint which makes it possible to run a global analysis of OSM data in a few minutes. Anyone want to know how many buildings in the world are shaped like cats? give it a try ;)

Identifying suspicous behaviour

Another aspect to consider while identifying data issues is the mapper behaviour. Currently its common for mappers to check the edit count of a contributor to evaluate how experienced or trusted the user is. Pascal Neis has done some amazing research on user behaviour and the popular tool HDYC provides a detailed contribution profile for any mapper.

An open question is if aspects such a user reputation and experience play a part in validation workflows, and to what extent they can be relied upon to catch problematic behaviour like spamming and vandalism.

The future of validation

Now is a great time to get involved in the OSM data validation puzzle with the growing number of people using the map outside the OSM website through services like Strava and Github. There has also been some major technology leaps that allows us to process more data much faster than before opening up new technical approaches for solving the puzzle.

One of the big question for our team at Mapbox is - How can we make it as simple as possible for anyone who uses the map to validate it? Reaching this goal would greatly enhance the accuracy of the map data keeping it as close to ground reality as practical.

Have you thought of OSM validation before? What other big questions should we all be thinking about?

Smile, you are on OSM

Comment from robert on 7 April 2016 at 20:25

Couldn't agree more. Suspicious behaviour detection is particularly something we could do with.

Comment from RicoElectrico on 7 April 2016 at 20:28

Have you thought of OSM validation before? What other big questions should we all be thinking about?

I think we should deal with errors/vandalism right as they happen. Too often we see questionable edits to only find out the user is inactive.

It would make sense to come with some recommended practices how to communicate with users (wrt. wording, content and call to action - which anecdotally seems to increase chances of response / fixing). Then sometimes I don't know what to do if a user who is still editing doesn't respond - after all it seems to be a little over-the-top to report every non-communicating user to the DWG. Maybe we could look into giving more people the ability to make until-read-message user blocks.

Other thing is to especially monitor countries with very little established community (in practice everything except Europe, NA, Japan and few more countries). The amount of sloppy edits that go under the radar is too high for a quality map we strive to be. And, in the long term, we should help these countries to grow an autonomous, self-healing community - which doesn't really need much resources besides people's time and Internet access.

Finally, the problems should be eliminated at the source. Many of them seem to stem from shortcomings of iD. I always feel awkward when making a changeset comment that boils down to "hey, you made a mistake, but you couldn't have known that because it was caused by deficiency of iD" (real-world example: 1) no support of addr:place 2) people adding names to addresses instead of making correctly tagged POIs). While I've seen bhousel (the maintainer) change his stance a little bit, there's still too little awareness on how iD could do a better job to guide users and be more foolproof.

Comment from SomeoneElse on 9 April 2016 at 12:13

As well as "detecting problems" we also need to ask why a problematic edit was made.

For example, "doodle the dog" in the example above - was that drawn by a bored college student who'd been dragged into a HOT mapping session against their will? If so, perhaps we could try and engage with the person running the class?

Login to leave a comment