OpenStreetMap

Validating Wikidata tags on OpenStreetMap

Posted by Amisha Singla on 13 February 2017 in English (English)

The OpenStreetMap database has been getting enriched with Wikidata tags on a daily basis, with over 500,000 feature tagged till date. This is generally done based on matching the name and location of a popular map feature to its corresponding Wikidata item if it exists. Check the OSM Wiki page on Wikidata for more information.

This is currently done manually and requires local knowledge to avoid connecting unrelated features between the two databases. The most common case of mixup are:

  • Features with the same name exist and lie in entirely different geographical area eg. City named Salem in US and in India.

  • Features with the same name but of a different type in the same location eg. A railway station matched to a nearby landmark of the same name

In such cases, there are high chances of linking wrong wikidata items to OSM feature if one doesn't match the locations of both features correctly. Apart from this, there happens a lot of human error in copy pasting the wrong wikidata QID. The following post introduces a validator tool for reviewing these mismatches based purely on location.

Validating wikidata tags in OSM features using wikidata-osm validator.

wikidata-osm is a visual validator tool which spots possible Wikidata tag mismatches by comparing the location of the OSM and Wikidata feature and highlights those where the distance between these is greater than threshold distance set by the user.

Validator tool Highlight Wikidata tagged map features based on the distance between the features on OSM and Wikidata databases

Using the tool

Each circle on the map represents an OSM feature taht has a Wikidata tag. The color and size of the circle depends on the distance between OSM feature and Wikidata QID. The larger red circles represent features which are having high chances of being erroneous while the smaller green circles represents features with less chances of being erroneous.

Threshold distance on the left top pane has to be set by user. It varies depending on the type of place one is reviewing. For example, while reviewing wikidata tags for large countries, one can set the threshold as ~ 100 km. Because there is a possibility that its wikidata coordinates can be 100 km apart from OSM coordinates. But for reviewing small countries , neighbourhood places, this value can go down.

using the tool

Clicking on any circle takes to above view. This represents the visual representation of locations of both wikidata item and OSM feature. Also it lists the tags, wikidata item URL and OSM feature URL in the right panel, which could help in validating the mismatch.

The tool was made to help various communities review and improve Wikidata tagging in their local areas, since there was no existing tools for this. The features displayed on the map is a static snapshot from December 2016, but clicking any feature will calculate the latest location information from OSM and Wikidata.

Feel free to play with the code on Github to make any improvements that will make it easier to validate Wikidata tags.

Other Wikidata validation tools

Yurik has done a tremendous amount of work in the last few months to bring OSM and Wikidata closer. He recently spoke about using the instanceOf property of the linked Wikidata features to validate potentially incorrectly matched features in this OSM-talk thread and an open list of questions to evolve best practices to link the features between the two databases.

If you have any feedback or ideas on how to improve such processes or some other tools that were missed out it would be great to hear.

Location: Indiranagar 1st Stage, Indiranagar, East Zone, Bengaluru, Bangalore Urban, Karnataka, 560038, India

Comment from Glassman on 13 February 2017 at 16:52

wikidata-osm is a nice visual tool but it's not suitable to fixing problems or even marking which problems have been fixed or marking ones as valid/invalid. That would save time by not having people look at a tag that someone else has already investigated. I would suggest using maproulette.

Maproulette could be used to mark the osm wikidata tag as valid - such as Unimak Island in Alaska. According to wikidata-osm the two nodes are both on the island but separated by a few kilometers. Or Sanak Island, also in Alaska. OSM has the node on the correct island while it looks like wikipedia has the node in the ocean.

By using Maproulette, nodes could be either marked valid, invalid or fixed. This way even Wikipedia could improved.

Comment from BushmanK on 13 February 2017 at 18:27

I'd even say that "circle of error" representation together with text located in the top left block could promote "mapping for Wikidata". It says:

Large red circles mean the distance is greater than 10 kilometers, which indicates a higher error rate

So, anyone who got used to other validators (where errors are detected by it and presented for confirmation and correction by mappers) could get an idea, that OSM feature is misplaced when there is a large red circle. While it only shows a distance that could probably mean there is an error (but not necessarily).

And there are a lot of false errors related to long linear objects such as rivers because Wikidata works with points only. So, a map gets cluttered and unreadable. Same applies to relatively large areas - there are several ways to define a center of an area. OSM could use one method, Wikidata - another one.

The whole tool only gives you a vague hint on what could be referenced to Wikidata by mistake, it is not a validator by any means.

Comment from Warin61 on 14 February 2017 at 08:01

At least one more of the location errors is on the Wikidata database.

As for "Large red circles mean the distance is greater than 10 kilometers, which indicates a higher error rate" .. errr no not an error rate but a large discrepancy between the two.

So .. interesting .. but needs work as detailed by the other comments above.

There could be a lot of OSM objects that have Wikidata that are not referenced. That too might be a useful tool for some.

Comment from SK53 on 14 February 2017 at 17:08

Not mentioned is that wikidata co-ordinates will many times represent data added to Wikipedia from sources which are not compatible with the OSM licence.

Equally I presume that there are many objects both in wikidata and OSM (notably imported GNIS nodes) which may agree in location but in fact are wrong.

Some wikidata objects are a very long way from their actual location: see http://www.openstreetmap.org/node/1244940858. Likely because the originator of a wikipedia article copied over information from another article & forgot to change the lat/long information.

The Italian community had a tool for comparing wikipedia and OSM items (sorry dont have a link), but it was based on administrative geography (usually represented in the wikipedia articles & wikidata) and on classes of objects (places, historic buildings ...). There were a number of advantages in my view: one could focus both on topic & area; missing or incorrect data showed up both ways. Fixing things could be done purely by searching for a putative missing object in OSM, adding it to OSM & then potentially fixing wikipedia. In this way the only information I used from wikipedia was a) such and such a place exists; and b) it's located within a given admin geography.

In summary there are multiple reasons why a straight comparison of locations may not only be misleading but may encourage use of undesirable sources for edits which may not improve OSM.

Comment from Amisha Singla on 15 February 2017 at 14:01

wikidata-osm is a nice visual tool but it's not suitable to fixing problems The whole tool only gives you a vague hint on what could be referenced to Wikidata by mistake, it is not a validator by any means. As for "Large red circles mean the distance is greater than 10 kilometers, which indicates a higher error rate" .. errr no not an error rate but a large discrepancy between the two. Some wikidata objects are a very long way from their actual location: see http://www.openstreetmap.org/node/1244940858. Likely because the originator of a wikipedia article copied over information from another article & forgot to change the lat/long information.

Thanks everyone for the feedback. Yes, it is correct to say that larger distances don't necessarily indicate an error. To avoid confusing users, the tool has been renamed to 'Wikidata-OSM Distance Visualizer' to focus on the distance based visualization and all the possible cases of a large match distance has been documented in the readme with examples.

It would be upto the local OSM or Wikidata community to analyze the reason for the large distance based on the context one uses the tool for.

By using Maproulette, nodes could be either marked valid, invalid or fixed.

That sounds like a great idea to take forward to integrate into a tasking tool. It would be great to hear suggestions on what would be a practical way to get started on this.

Login to leave a comment