Wikidata is a free knowledge base for linked open data designed to support Wikipedia and its sister projects, such as Wikivoyage. It contains over 97 million entries structured as a “Labeled Property Graph,” which is more powerful than RDF-based graphs. Like OpenStreetMap (OSM), Wikidata (WD) is an open crowdsourcing project with a large and active community.
Since 2014, OSM can be linked to WD through its tags. Currently, there are about 5.5 million such Wikidata tags with steadily growing popularity. These links can be used to create interesting products, for example a map with castles enriched with factual data from WD. However, the quality of these manually captured links in OSM is as yet unknown and untested. One must also note that the preferred way from WD to OSM - the other way around - is to use only coordinates (WD property P625) - i.e., no WD properties such as P402 are to be used because this covers only OSM relationships.
Now, two computer science students, Jari Elmer and Timon Erhart, from the University of Applied Sciences of Eastern Switzerland (OST), with the help of Sascha Brawer - a young software engineer in “un-retirement” and Wikipedian - have developed an application called “osm wikidata quality checker”. The goal was to check the existing links from OSM to WD. The errors found - for example invalid WD entries in OSM - are also sent to osmose with a suggested correction. Osmose is a quality assurance tool for detecting problems in OSM data. The goal of the application was to become an integral part of OSM’s quality assurance ecosystem. It handles the large amounts of data in the two databases (about 1.5 TB each).
The successful result of the thesis is a data processing pipeline capable of finding diverse types of erroneous Wikidata links in OSM with a high accuracy of >95%. By using multiprocessing and the developed database model, where only the relevant data is extracted, the tool is able to handle the large amount of data and check the whole world on a weekly basis. The difficulties of dealing with crowdsourced data, where unforeseen data errors are to be expected, were also mastered, resulting in a robust software. Documentation and an easy-to-understand architecture allow the tool to be extended and additional checks to be implemented. The optional configuration provides the necessary flexibility in operation and helps with further development.
Currently, a total of over 30,000 errors are found in the following nine categories:
- Incorrect value for Wikidata-Tag
- Wikidata item does not exist
- Redirected value for Wikidata tag
- The distance between OSM object and linked Wikidata item is unusually large
- Characteristics of the OSM tags and linked Wikidata item do not match
- The secondary Wikidata tag and the linked Wikidata item do not match
- The OSM object is linked to an unpermitted Wikidata item
- Unpermitted link to an instace of living organism on Wikidata
- The OSM object does not match the Wikidata item
We are happy that these categories already have been incorporated into Osmose (see e.g. this Tweet) and are ready also to be integrated e.g. in the id editor.
This is the OSM Wiki page of the tool. We are now searching for a permanent place to host this data processing pipeline.