Digging into Wikidata

Posted by tchaddad on 30 July 2019 in English (English).

The last post explained some of the background on the use of Wikipedia page links and other information in Nominatim. This post covers looking into Wikidata as another source of information that may be of use. Wikidata logo Wikidata is the knowledgebase of the Wikimedia foundation. It was founded in 2012 with the goal of collecting factual data used in Wikipedia across all languages. The project is maintained by over 20,000 active community contributors.

The Wikidata repository consists mainly of items, each one having a label, a description, and any number of aliases. Items are uniquely identified by a Q followed by a number, such as:

Statements describe detailed characteristics of an Item and consist of a property and a value. Properties in Wikidata have a P followed by a number, such as:

Properties can point to values, such as:

Or properties can point to values that represent other concepts, such as:

And so on.

The numbers used in entity identifier codes are assigned in the order the items and properties were created in Wikidata. By using identifier codes, Wikidata is able to support multilingual data entry without favoring any one language. Most item identifiers will have labels attached that can be used to label the page for the entity in the language specified by the users preference. If an identifier does not have a label available in a users language of interest, the option exists to become a wikidata contributor and add the missing label.

How is Wikidata potentially of use to Nominatim?

As with Wikipedia tags, OSM users have been entering tag links to Wikidata items for some time now. These connections may be only of passing interest to the average OSM user, but have some very interesting potential.

Entities in Wikidata are inter-related, and the connections between them are a large part of what is stored in the database The information can be visualized as a directed labeled graph where entities are connected by edges that are labeled by properties. To understand what this means, it is helpful to look at an example:

An example of a Wikidata item that represents a well known entity:

This item has several properties linking to more items with value of interest

Of these, the P31 property, or “instance of” is extremely useful property that connects to other item concepts of interest for the Eiffel Tower:

This kind of information can be visualized as follows: Wikidata example graphic Obviously, in the universe of knowledge that Wikidata is trying to cover, most items are not inherently geospatial. However, many items in Wikidata do represent locations, and for those that do, the statements in the Wikidata database that represent properties and relationships between items can potentially be valuable “additional” knowledge that can enhance Nominatim. Knowing that an OSM object is a landmark is helpful, and could be used in calculations of nominatim importance scores, for example. In cases where real world items might have identical names, the ability to differentiate between OSM locations using properties not covered by OSM could help with returning more relevant geocoding results, etc.

Extracting Wikidata

As with Wikipedia data, there are a multitude of ways that users can access Wikidata via various dump formats, tools, and services. The most powerful methods make use of graph databases and utilize the SPARQL query language.

For now, this project has only scratched the surface of these options, by focusing on a few wikidata extracts of initial interest, importing them into Postgres tables, and querying them using SQL, which is an easier entry point than SPARQL for the purposes of a quick introduction.

All tables size and query values discussed below reflect the Wikipedia dumps dated 20190620:

We can use the Eiffel Tower example to navigate the above Wikidata tables as follows:

  • Page table → contains 60,216,116 records about all pages in the collection, including past versions. To find the most current Eiffel Tower (Wikidata Q243) record in the page table: Wikidata page table example
  • Geo-tags table → An interesting feature of this table is that it contains lat/lon information for locations on earth, and also locations on other planets. It also mixes locations that are the primary subject of a page, and other locations mentioned within a page. There are 7,411,618 total records in geo_tags, where 7,404,767 are on earth, and 7,120,320 are primary subject of page locations, vs 284,447 locations merely mentioned within the pages. To find the lat/lon of the Eiffel Tower record in the geo-tags table: Wikidata geo-tags table example
  • Wb_items_per_site table → contains 68,771,106 records. This table contains the link between each Wikidata page_title (Q number minus Q) and the various language wikis and the site_page name in the local language. To find the various Wikipedia pages about the Eiffel Tower: Wikidata wb_items_per_site table example There are in fact 155 named Wikipedia pages for the Eiffel Tower in Wikidata, originating across the various Wikipedia language sites. This contrasts to the OSM record for the Eiffel tower, which contains 44 translations of the landmark’s name, and illustrates that a single Wikidata link in OSM can open up access to a wide array of additional relevant information.

Overall, the Eiffel Tower example shows that using information extracted from the above Wikidata tables, OSM items with Wikipedia page tags can now be linked to their corresponding Wikidata information, and vice versa.

How does this help?

Circling back to Nominatim, recall that Nominatim is already able to calculate and use importance scores based on Wikipedia in-links. One issue with this approach has been that while Nominatim can calculate importance scores for all Wikipedia pages, these scores can only be used if there is a credible link between a Wikipedia page and an OSM item.

Up until this point, there have been two ways in which links between OSM items and Wikipedia pages have been made in Nominatim:

Now that Wikidata information can be used in the Nominatim database, the benefits of the importance scores calculated using Wikipedia inlinks can be extended to more OSM items as follows:

  • Wikidata to Wikipedia links - OSM items with Wikidata tags but no Wikipedia tags can be directly connected to their relevant Wikipedia page and importance scores

  • Additional Mapper supplied links - improved geocoding could be an incentive to OSM mappers to add Wikidata identifiers to OSM items. See below for a tool that can help with this process.

  • Improved best guess links - there is a potential that Wikidata properties can help discriminate between similarly named items and more definitively connect the correct Wikipedia page and importance score to the relevant OSM item

These improvements should lead to a larger number of OSM items that can benefit from Wikipedia importance scoring improvements. It also means that in the future, Nominatim can potentially begin to capitalize on the relationships information stored in Wikidata to further enhance importance scoring.

Curious about linking Wikidata to OSM entities in your area?

If the above discussion has piqued your curiosity about the utility of Wikidata links, and you would like to experiment with contributing some Wikidata identifiers to items you are mapping in OpenStreetMap, take a look at the following interesting tool from Edward Betts: OSM to Wikidata Matcher Tool by Edward Betts The tool is explained by the author in the following video from FOSSDEM 2019: Linking OpenStreetMap and Wikidata - A semi-automated, user-assisted editing tool, FOSDEM 2019

Login to leave a comment