New quality checks in the Osmose QA tool for links from OpenStreetMap to Wikidata
Posted by Geonick on 7 July 2022 in English. Last updated on 11 July 2022.Wikidata is a free knowledge base for linked open data designed to support Wikipedia and its sister projects, such as Wikivoyage. It contains over 97 million entries structured as a “Labeled Property Graph,” which is more powerful than RDF-based graphs. Like OpenStreetMap (OSM), Wikidata (WD) is an open crowdsourcing project with a large and active community.
Since 2014, OSM can be linked to WD through its tags. Currently, there are about 5.5 million such Wikidata tags with steadily growing popularity. These links can be used to create interesting products, for example a map with castles enriched with factual data from WD. However, the quality of these manually captured links in OSM is as yet unknown and untested. One must also note that the preferred way from WD to OSM - the other way around - is to use only coordinates (WD property P625) - i.e., no WD properties such as P402 are to be used because this covers only OSM relationships.
Now, two computer science students, Jari Elmer and Timon Erhart, from the University of Applied Sciences of Eastern Switzerland (OST), with the help of Sascha Brawer - a young software engineer in “un-retirement” and Wikipedian - have developed an application called “osm wikidata quality checker”. The goal was to check the existing links from OSM to WD. The errors found - for example invalid WD entries in OSM - are also sent to osmose with a suggested correction. Osmose is a quality assurance tool for detecting problems in OSM data. The goal of the application was to become an integral part of OSM’s quality assurance ecosystem. It handles the large amounts of data in the two databases (about 1.5 TB each).
The successful result of the thesis is a data processing pipeline capable of finding diverse types of erroneous Wikidata links in OSM with a high accuracy of >95%. By using multiprocessing and the developed database model, where only the relevant data is extracted, the tool is able to handle the large amount of data and check the whole world on a weekly basis. The difficulties of dealing with crowdsourced data, where unforeseen data errors are to be expected, were also mastered, resulting in a robust software. Documentation and an easy-to-understand architecture allow the tool to be extended and additional checks to be implemented. The optional configuration provides the necessary flexibility in operation and helps with further development.
Currently, a total of over 30,000 errors are found in the following nine categories:
- Incorrect value for Wikidata-Tag
- Wikidata item does not exist
- Redirected value for Wikidata tag
- The distance between OSM object and linked Wikidata item is unusually large
- Characteristics of the OSM tags and linked Wikidata item do not match
- The secondary Wikidata tag and the linked Wikidata item do not match
- The OSM object is linked to an unpermitted Wikidata item
- Unpermitted link to an instace of living organism on Wikidata
- The OSM object does not match the Wikidata item
We are happy that these categories already have been incorporated into Osmose (see e.g. this Tweet) and are ready also to be integrated e.g. in the id editor.
This is the OSM Wiki page of the tool. We are now searching for a permanent place to host this data processing pipeline.
Discussion
Comment from Claudius Henrichs on 18 July 2022 at 12:11
Thanks for creating this report 👏 Great QA tool to use in combination with https://osm.wikidata.link/ to complete the mapping OSM->WD
Comment from Mateusz Konieczny on 23 December 2022 at 19:18
https://wiki.openstreetmap.org/wiki/OpenStreetMap_Wikidata_Quality_Checker
Can you publish it?
Comment from Geonick on 26 December 2022 at 13:24
Hi Mateusz: The code repo has been published in the mentioned Wiki page https://wiki.openstreetmap.org/wiki/OpenStreetMap_Wikidata_Quality_Checker now also as “Project repository”.
Comment from Mateusz Konieczny on 26 December 2022 at 16:39
Thanks! I will look at it!
I wonder would it be feasible for you to detect problem of “wikipedia/wikidata linked multiple times and it is not waterway/pipeline/railway/roads/etc[1] feature where linking multiple times may be correct” (or maybe you detect this kind of issue already?)
[1] I have list of such features and Ruby script trying to detect features invalidly linked multiple times.
BTW, I tried editing using Osmose and run into https://github.com/osm-fr/osmose-frontend/issues/437
Comment from Geonick on 26 December 2022 at 17:20
Hi Mateusz: I’ll forward your comments to Timon, the maintainer of the OSM Wikidata Quality Checker, who is now working in my institute IFS at OST.
Comment from turbotimon on 26 December 2022 at 22:43
@Geonick: thanks for this post and also for adding repo link to the osm wiki page!
@Mateusz: “detect problem of “wikipedia/wikidata linked multiple times”: Interesting idea, I created an issue for that. Please feel free to add more details, like your list there!