OSMCha is a an open source changeset exploration tool originally created by Wille Marcel. Early 2016, few of us at Mapbox were interested in using this tool for trying out validation on a changeset level. Over the course of 2016, we made several improvements to the tool. As of this morning we reviewed more than 23000 changesets and found 1150 to be harmful to the map. OSMCha database consists useful changeset metadata such as changeset ID, username, editor used, changeset comment, source, imagery used, and timestamp.
You can download a CSV of all the reviewed changesets here. For community members who are interested in validating the map using OSMCha, our validation guide can be a good starting point in understanding the tool, how we use it and validate their own neighborhood.
Few things to note
OSMCha does not parse all changesets from OSM. There are a few that go unparsed each day because of various edge cases that we are working on fixing. So do not take numbers on OSMCha as absolute but as near accurate estimates.
Some of the mapping activity marked as harmful in OSMCha are not necessarily harmful. Undiscussed, unannounced imports in OSM are constantly tracked and reverted by the DWG. These edits to the map do not necessarily have mapping mistakes in them but were found to be uninvited into the map to maintain a data import protocol, accuracy on the map and local community accord.
Hence, mass deletion of above imports in revert changesets by DWG cleanup accounts like Woodpeck_repair are also marked as good edits. These can be ignored by filtering out repair accounts.
The reviewed changesets were from random places on the map and are not specific to any place. For area specific filtering we can take advantage of bbox filter in OSMCha or filter manually as the CSV contains the bbox information for each changeset.
Since we have a big dataset of reviewed changesets, we can find correlation between harmful changesets to find patterns of vandalism on OSM. I did a basic analysis using a recently added metadata filter in OSMCha stats page with which I have come to below estimates.
Editor wise breakdown of changesets marked to be harmful
Editor wise breakdown of changesets reviewed
Filters we found to be successful
These are percentage of harmful edits observed against the number of reviewed.
iD+suspect word : 14.1%
iD+mass deletions : 7.9%
potlatch+mass deletion : 5.8%
JOSM+suspect word : 5.8%
JOSM+mass deletion : 4.9%
Maps.me : 3.7%
- Suspect word filter flags changesets with apple, google, nokia, here, waze, tomtom, import, wikimapia as words in changeset comment or source.
Having a database of OSM edits that are classified into good and harmful can help future endeavours into implementing smart anamoly detection tools and machine learning algorithms to better protect the map.
We are looking forward to continue validation using OSMCha, refine OSMCha changeset flagging heuristics, collaborate with the community with better open tools to protect the map.
Let us know your thoughts, how this can be taken forward and share with us your insights to improve feature level detection.