Scaling multilingual name tags with Wikidata

Posted by PlaneMad on 8 November 2016 in English (English). Last updated on 9 November 2016.

Wikidata, the crowdsourced database of structured knowledge by the Wikimedia movement has grown to over 24 million entries and by now has structured information for every major settlement on earth. These are extremely useful properties like multlingual labels, statistics like populations and GDP, and other related information like politics, history and media about the place (See London, New York City, Timbuktu).

Geolocated articles on Wikidata, with those added in the last year highlighted in pink. Source: Wikidata Map

Current state of multilingual tags

One of the great strengths of OSM is to leverage the data to create create multlingual maps that make the map accessible to a lot more readers than just the local population. Since the beginning of the project, the community have been adding various name:code tags for this purpose, and has resulted in map features with a ever growing list of multilingual names eg. the node for London has 171 properties, of which 155(90%) are name tags in various languages.

A more scalable approach would be to leverage the Wikidata entry for London, which has the translated name in 248 languages, and growing automatically with every Wikipedia page of the city that is created in a new language.

This would also enable the translation of a map to languages where that language on OSM would be considered non local and not worthy on adding to the map, eg. Ukranian labels for cities and towns in UK.

The first step to start leveraging the power of Wikidata from OSM is adding a simple wikidata property to the feature on OSM with the associated QID of the corresponding concept on Wikidata eg. wikidata=Q84 for London. Check out this video by user:polyglot on doing this via the JOSM Wikipedia plugin or via the iD editor.

Matching Wikidata items to OSM

Just like OSM, Wikidata items of places have tags describing the feature and coordinates that make it possible to automatically match a feature on OSM to the corresponding feature on Wikidata. Unfortunately the geographical accuracy of Wikidata entries cannot be trusted, as many of the coordinates are derieved from Wikipedia pages which in turn are usually derived from Google Maps. Moreover entries of lesser known places may not be tagged correctly on Wikidata and might result in ambiguous matches to an OSM feature. For this reason manual confirmation of a match is necessary.

At the Mapbox data team, we have been experimenting with adding Wikidata tags to cities and towns on OSM based on an exact name and location match. The possible matches were loaded onto a spreadsheet with the match distance and Wikidata description of the corresponding item. After a manual review, its easy to confirm the match with a very high degree of confidence based on the name, distance and description of the match. With this approach we have found that just an exact name and location match can give a 99% success rate for places.

screenshot 2016-10-28 17 42 12 Over 5,300 cities and towns have been updated with corresponding wikidata tags in the last two weeks

There are two cases when the name matching happens: - Unique matches: One OSM feature matches to one Wikidata feature - Duplicate matches: One OSM feature matches to multiple Wikidata features with the same name

Unique matches

In most cases, the location of the matched feature on Wikidata is less than a few Kms, and by confirming from the description that the feature is also a city or town, its possible to confirm this was the correct match. It is important to be careful about the feature description as in some cases Wikidata may have ambiguous entries that represents multiple concepts like both a city and a province with the same name as one object.

For unique matches with a large match distance >10kms, it is likely the match was to another place with the same name and is an incorrect match. In a few rare cases, the Wikidata location was found to be incorrect and was actually a correct match.

Duplicate matches

When an OSM feature matches to multiple Wikidata entries with the same name, it is considered a duplicate match. In most cases a distance filter of around 10km enables a unique match, and a further look at the description can confirm the match is correct.

In a few rare cases multiple OSM features with the same name and location match to a single Wikidata feature. These are places with duplicate nodes on OSM itself and need to be merged.

What next?

Large scale map features like countries, cities, towns and water bodies are great candidates to start matching with Wikidata as they are fairly well defined on both projects and can be matched without ambiguity. Doing this will allow us to better understand the value that Wikidata can add to OSM, and help pave the wave for more interesting map services that can be built on open data.

There’s been some amazing work from EdwardBetts on matching all of Wikidata to OSM. You can see the results and this can be a good push to the efforts of contributors like User:Pigsonthewing on bringing the two biggest crowdsourced open data projects in humanity can get closer together.

Comment from d1g on 8 November 2016 at 13:17

Thank you to everyone involved in Wikidata and OpenStreetMap integration!

Currently Russian administrative divisions (relations) are well curated in OpenStreetMap but often we don’t have link back to an OSM relation in Wikidata items. We would appreciate bots in this area.

Comment from DenisCarriere on 8 November 2016 at 14:42

Awesome work! Keep it up!

Comment from pavanvijjapu on 9 November 2016 at 07:57

Good Analysis and location matching limitations rightly pointed out.

Comment from ff5722 on 9 November 2016 at 09:32

Good initiative! I will add wikidata links manually when I come across a place that lacks them.

Comment from SimonPoole on 10 November 2016 at 13:25

The tiny weeny issue with this is naturally that there is the underlying assumption that wikidata is correct and that the data meets our quality criteria (as in actually being in use and not invented).

Comment from PlaneMad on 10 November 2016 at 13:54

Since the matching is based on the name, location and description on two databases being coherent, the chances of having invented data being added is really low, unless of course the same invented data made it to both the databases, and we found this did happen with the GNIS place data in the US. Check out this discussion

Still figuring out what the scale of the issue is, since it looks like nobody really reviewed if all these towns were tagged correctly on the map in the last 9 years.

Comment from LogicalViolinist on 14 November 2016 at 16:58

Canada should be mostly complete…those that don’t have wikidata tags dont have a wikidata page

Comment from pigsonthewing on 14 November 2016 at 17:30

Great post; and great work! Thank you for the namecheck.

Comment from Skippern on 25 November 2016 at 13:59

I like the idea of using Wikidata to link different platforms of information, but I miss information about API and development tools to properly make use of it. The problem is obviously not on OpenStreetMap side, but rather on WikiData side. An use example, a tool getting border relations from OSM, and collects the names of the City/State/Country from name:* tags, it could also call a WikiData API for the same reason, and would than be able to get the names from a broader selection, and less prone to miss names due to limited tagging, i.e., several Chinese provinces have no latinised name tags, though this might exist in WikiData.

Comment from gorn on 26 November 2016 at 11:10

Great initiative! I see one danger thought. Usually a city or village is also connected to surrounding region. Both the city and the region cam be mapped in OSM (and in Czechia they always are) having the same or similar name. One needs to be carefully to attach the wiki data label to the right area than. If needed I can easily fingers and example.

Login to leave a comment