Nominatim is the geocoder / search engine that powers the search box on the main OpenStreetMap site. The software indexes all named features in OSM and certain points of interest, and assigns ranks of importance scored between 0 and 30 (where 0 is most important). To augment the accuracy of rankings, Nominatim also uses page ranking of Wikipedia pages to help indicate the relative importance of osm features. This is done by calculating an importance score between 0 and 1 based on the number of inlinks to an article for a location. If two places have the same name and one is more important than the other, the wikipedia score often points to the correct place.
This summer I’ll be participating in a GSoC project to add Wikidata to Nominatim. The primary goal of this project is to extend the existing Nominatim wikipedia extraction scripts to also take into account Wikidata. Wikidata is a database that exists to support Wikipedia, and Wikimedia Commons. The Wikidata project contains structured data about items, and statements about the relationships between items. In recent years, OpenStreetMap has gained a large number of Wikidata tags, and the data and relationships from the Wikidata database should provide information useful to improving the importance rankings and search results from Nominatim. To accomplish this, it will be necessary to process a Wikidata dump and extract the information that would be useful to Nominatim, then import the data it into the Nominatim database and put it to use. The Wikidata information can be used in a number of ways:
Processing of OSM objects that do not have a Wikipedia link, but that do have a Wikidata link: by using the Wikidata information, it should be possible to infer the relevant Wikipedia link for use with the current Nominatim ranking process. According to Taginfo, there are currently over 450,000 nodes with Wikipedia tags, and over 600,000 nodes with Wikidata tags, so this represents the scale of opportunity for this particular improvement.
Taking into account the type of the object (museum, library, tower, castle, etc) as indicated by Wikidata, and using this to correctly infer the relevant Wikipedia link, and to prevent or eliminate possible links to similarly named objects of conflicting types.
Using Wikidata information to eliminate certain kinds of detrimental Wikipedia links (such as links to brands) from use in calculation of search ranking importance.
This first diary post covers general background and proposed areas of improvement for the Adding Wikidata to Nominatim project. Any feedback, comments or questions are welcome. Future posts will cover progress, and lessons learned along the way.