tchaddad's Diary

Recent diary entries

End of Project Summary

Posted by tchaddad on 2 September 2019 in English.

Summer has come to an end, and so this post is to wrap up the progress made over the course of the “Add Wikidata to Nominatim” project. Overall, the main contributions are documented in the 4 preceding diary posts, and in:

updated steps for extracting Wikipedia data and calculating importance scores
a new script for extracting Wikidata items and place types

These new processes have made big improvements in several OSM-to-Wikipedia comparison metrics as compared to equivalent numbers from 2013 (when the previous Wikipedia snapshot was taken).

Improved Numbers

For context, the number of Wikipedia articles in the top 40 languages in 2013 was 80,007,141, and the number of Wikipedia articles for the same 40 languages in 2019 was 142,620,084 - an increase of ~78%.

Within these article records, in 2013 it was possible under the old processing steps to attach latitude and longitude numbers to 692,541 articles, while in 2019 it was possible to enrich 7,755,392 records with location information - an increase of ~1,020%. This particular statistic largely reflects an improvement in the source Wikipedia / Wikidata projects.

More exciting, with the old method of linking Wikipedia articles to osm_ids, it was possible to link 313,606 Wikipedia article importance scores to osm_ids, but with the new method that uses both Wikidata item ids, and Wikipedia pages together, the number of Wikipedia article importance scores that can be linked has risen to 4,730,972 - an increase of ~1,409%. This increase is due to both the large number of Wikipedia and WIkidata tags added by OSM contributors since 2013, as well as the inclusion of Wikidata item ids in the linking process for the first time via this project.

Future Work

Although the project technically concludes today, there are obviously always areas of future work where more gains can be made. These include:

Continuing to add new Wikidata links to OSM objects (possibly using tools such as the example provided by Edward Betts)
Increasing the number of place types accounted for in the scripts. Currently the top 200 place types are being used, and there are many more that could be added.
Possibly increasing the number of languages covered. Currently 40 languages are processed.

In addition, for items without direct database links to be made, a process could be developed to use radius searches, names, and place type information to make educated guesses on remaining items that could be linked. While this enhancement option would be similar to the previous process used in 2013, it may not be as productive as encouraging work in the first two areas mentioned.

Acknowledgements

I’d like to thank the project mentors lonvia and mtmail for suggesting this project, and for all their good advice given throughout. I very much appreciate the time it takes to take on mentorship of a remote project such as this, and hope that the results will be of use. I look forward to continuing to contribute on the Wikidata aspects of Nominatim as the ideas on how to utilize the data continue to evolve.

Wikidata Queries

Posted by tchaddad on 30 August 2019 in English.

The last post introduced Wikidata, and covered how to extract basic information from wikidata dumps using a postgres database and regular SQL queries. This post explains some simple SPARQL queries and the Wikidata Query System and related API, and how they relate to the improvements desired in Nominatim. The following examples, along with a few other tools from the Wikidata ecosystem, are a good start for any OSM contributor that is Wikidata-curious. SPARQL sur Wikidata CC-BY Wikimedia Commons, Jorge Abellán, Berlekemp

Explaining SPARQL

For a new user, probably the most foreign aspect of using Wikidata is the need to know something about SPARQL. So what is SPARQL?, and why do you need it?

As covered previously, entities in Wikidata are inter-related, and the connections between them are a large part of what is stored in the database. SPARQL is a query language for databases such as Wikidata, that store information as statements in the form: “subject–predicate–object”. To understand what “subject–predicate–object” means, it is helpful to look at specifics.

Going back to the Wikidata page for the Eiffel Tower (Q243) , we can see that the Eiffel Tower is listed as an instance of a “lattice tower” (Q1440476), and also listed as an instance of a “tourist destination” (Q1200957).

These statements:

“Eiffel Tower” “is an instance of” “lattice tower”
“Eiffel Tower” “is an instance of” “tourist destination”

Are examples of the “subject–predicate–object” concept, where the subject is the Eiffel Tower, the predicte is “Instance of” and the object is “lattice tower” or “tourist destination”.

Wikidata Query Service Examples

OK, so how can we use these relationships? The Wikidata Project provides a Wikidata Query Service with a web-based front end that lets us experiment with SPARQL to learn more.

Say for example, you want to know about other instances of lattice tower. Here is a simple SPARQL query that lets us ask for the Wikidata item id of any other WIkidata item that is marked as an instance of a lattice tower: Screenshot of Wikidata Query Service Notice that line 5 is where the key parameters are: the “instance of” property (P31), and the object “lattice tower” (Q1440476).

You can try the Wikidata Query Service web interface for the above query at this link. At the time of this post, running this SPARQL query will return 10 instances of lattice towers from Wikidata. The data can be downloaded in a variety of formats if using the web interface, or returned as json or xml if using the API.

Great, but of course most of us cannot “read” a Wikidata item id, so it would be nice to have a label in the language of our choosing. The Wikidata Query Service provides a way to also ask for a label in a specified language: Screenshot of Wikidata Query Service With the addition of the request for the label, the query now returns the same 10 results, this time with 2 columns: Wikidata item id, and label. Try the improved query here.

Excellent, so what if we want to also return the latitude and longitude of the instances of these lattice towers? We can do this by asking for the coordinate location property (P625): Screenshot of Wikidata Query Service Now the results include 3 columns: Wikidata item id, itemLabel, and a geo column containing the coordinate location as a WKT string (try the query here). If you prefer to extract the individual latitude and longitude columns you can drill down into this last element and extract them (query): Screenshot of Wikidata Query Service And so on.

One of the nice things about trying examples in the Wikidata Query Service web interface is that there is also an option to render the list of results as a map. To do this, simply add “#defaultView:Map” to the top of the query block: Wikidata Query Service example map output With this one addition, we can now quickly see the results of the query in map form.

Last but not least, we can try an example that will also report instances of subclasses (P279) of the Wikidata item for “tower” (Q12518): Screenshot of Wikidata Query Service Because “towers” is a much more generic concept than “lattice towers” we expect that there are both many instances of towers, and also many instances of subclasses of towers. And indeed, when this query is run, there are over 21,000 results. What is interesting is that we did not need to know what the subclasses of “tower” were in order to return them in the query - and this begins to give a glimpse of the power inherent in SPARQL.

Hopefully the above examples are enough to give a new user a basic understanding of SPARQL. As with any new language, it takes some time to get used to the syntax and best practices, but once a few basics are mastered, the sky’s the limit.

Relevance to Nominatim

Knowing a bit about SPARQL is nice, but most OSM contributors are not necessarily interested in becoming expert at SPARQL, or even at Wikidata. The main benefit this project is hoping to bring to the OSM community is that by adding Wikidata information into the process used by Nominatim, the service can return more relevant search results.

To do this, we are using a combination of Wikidata extracts from table dumps (vis SQL) and API calls (using SPARQL), to selectively retrieve instances of “place types”, from the Wikidata ecosystem. Examples of the sorts of queries that have been useful are what is captured in these diary posts. Once extracted, the Wikidata items are used to enrich the wikipedia_articles table in the Nominatim database, where they are cross-referenced with their relevant Wikipedia importance scores by language. Once the wikidata is linked in this way, Nominatim can use the importance scoring in it’s logic of search results.

More on Place Types

Wikipedia seeks to be a “compendium that contains information on all branches of knowledge”. As a result, it should be logical that Nominatim is not concerned with all of Wikidata, but rather in the small subset of items that are instances of place types.

Unfortunately there is no source that presents the range of place types covered by Wikipedia, and Wikidata does not have any official ontologies. However, the DBpedia project has created an ontology that covers place types, and so this can be used as a starting point for building a list of place types that Nominatim can use. By using place types to identify instances of places, and not just those Wikidata items that have geographic coordinates, it is possible to identify a broader pool of Wikidata items that might have links to OSM items. In addition, as the curated list of place types is built over time, the potential for improved links grows accordingly.

Digging into Wikidata

Posted by tchaddad on 30 July 2019 in English.

The last post explained some of the background on the use of Wikipedia page links and other information in Nominatim. This post covers looking into Wikidata as another source of information that may be of use. Wikidata is the knowledgebase of the Wikimedia foundation. It was founded in 2012 with the goal of collecting factual data used in Wikipedia across all languages. The project is maintained by over 20,000 active community contributors.

The Wikidata repository consists mainly of items, each one having a label, a description, and any number of aliases. Items are uniquely identified by a Q followed by a number, such as:

Q936 → OpenStreetMap

Statements describe detailed characteristics of an Item and consist of a property and a value. Properties in Wikidata have a P followed by a number, such as:

Properties can point to values, such as:

P2124 → member count

Or properties can point to values that represent other concepts, such as:

P31 → Instance of which for the OpenStreetMap item Q936 points to values such as:
Q4505959 → digital map
Q7094076 → online database
Q6576792 → online community
Q933625 → volunteered geographic information

And so on.

The numbers used in entity identifier codes are assigned in the order the items and properties were created in Wikidata. By using identifier codes, Wikidata is able to support multilingual data entry without favoring any one language. Most item identifiers will have labels attached that can be used to label the page for the entity in the language specified by the users preference. If an identifier does not have a label available in a users language of interest, the option exists to become a wikidata contributor and add the missing label.

How is Wikidata potentially of use to Nominatim?

As with Wikipedia tags, OSM users have been entering tag links to Wikidata items for some time now. These connections may be only of passing interest to the average OSM user, but have some very interesting potential.

Entities in Wikidata are inter-related, and the connections between them are a large part of what is stored in the database The information can be visualized as a directed labeled graph where entities are connected by edges that are labeled by properties. To understand what this means, it is helpful to look at an example:

An example of a Wikidata item that represents a well known entity:

Q243 → Eiffel Tower

This item has several properties linking to more items with value of interest

Of these, the P31 property, or “instance of” is extremely useful property that connects to other item concepts of interest for the Eiffel Tower:

This kind of information can be visualized as follows: Wikidata example graphic Obviously, in the universe of knowledge that Wikidata is trying to cover, most items are not inherently geospatial. However, many items in Wikidata do represent locations, and for those that do, the statements in the Wikidata database that represent properties and relationships between items can potentially be valuable “additional” knowledge that can enhance Nominatim. Knowing that an OSM object is a landmark is helpful, and could be used in calculations of nominatim importance scores, for example. In cases where real world items might have identical names, the ability to differentiate between OSM locations using properties not covered by OSM could help with returning more relevant geocoding results, etc.

Extracting Wikidata

As with Wikipedia data, there are a multitude of ways that users can access Wikidata via various dump formats, tools, and services. The most powerful methods make use of graph databases and utilize the SPARQL query language.

For now, this project has only scratched the surface of these options, by focusing on a few wikidata extracts of initial interest, importing them into Postgres tables, and querying them using SQL, which is an easier entry point than SPARQL for the purposes of a quick introduction.

All tables size and query values discussed below reflect the Wikipedia dumps dated 20190620:

Page table - wikidatawiki-latest-page.sql.gz The page table can be considered the “core of the wiki”. Each page in a MediaWiki installation has an entry here which identifies it by title and contains some essential metadata.
Geo_tags table - wikidatawiki-latest-geo_tags.sql.gz Stores information about geographical coordinates in articles
Wb_items_per_site table - wikidatawiki-latest-wb_items_per_site.sql.gz This table holds links from Wikidata items items to Wikipedia articles.

We can use the Eiffel Tower example to navigate the above Wikidata tables as follows:

Page table → contains 60,216,116 records about all pages in the collection, including past versions. To find the most current Eiffel Tower (Wikidata Q243) record in the page table:
Geo-tags table → An interesting feature of this table is that it contains lat/lon information for locations on earth, and also locations on other planets. It also mixes locations that are the primary subject of a page, and other locations mentioned within a page. There are 7,411,618 total records in geo_tags, where 7,404,767 are on earth, and 7,120,320 are primary subject of page locations, vs 284,447 locations merely mentioned within the pages. To find the lat/lon of the Eiffel Tower record in the geo-tags table:
Wb_items_per_site table → contains 68,771,106 records. This table contains the link between each Wikidata page_title (Q number minus Q) and the various language wikis and the site_page name in the local language. To find the various Wikipedia pages about the Eiffel Tower: There are in fact 155 named Wikipedia pages for the Eiffel Tower in Wikidata, originating across the various Wikipedia language sites. This contrasts to the OSM record for the Eiffel tower, which contains 44 translations of the landmark’s name, and illustrates that a single Wikidata link in OSM can open up access to a wide array of additional relevant information.

Overall, the Eiffel Tower example shows that using information extracted from the above Wikidata tables, OSM items with Wikipedia page tags can now be linked to their corresponding Wikidata information, and vice versa.

How does this help?

Circling back to Nominatim, recall that Nominatim is already able to calculate and use importance scores based on Wikipedia in-links. One issue with this approach has been that while Nominatim can calculate importance scores for all Wikipedia pages, these scores can only be used if there is a credible link between a Wikipedia page and an OSM item.

Up until this point, there have been two ways in which links between OSM items and Wikipedia pages have been made in Nominatim:

Mapper supplied links - OSM mappers have added Wikipedia tags to some OSM items, and so these are directly accessible
Best guess links - Nominatim has a linking script that attempts to create links using logic that compares item names and locations

Now that Wikidata information can be used in the Nominatim database, the benefits of the importance scores calculated using Wikipedia inlinks can be extended to more OSM items as follows:

Wikidata to Wikipedia links - OSM items with Wikidata tags but no Wikipedia tags can be directly connected to their relevant Wikipedia page and importance scores
Additional Mapper supplied links - improved geocoding could be an incentive to OSM mappers to add Wikidata identifiers to OSM items. See below for a tool that can help with this process.
Improved best guess links - there is a potential that Wikidata properties can help discriminate between similarly named items and more definitively connect the correct Wikipedia page and importance score to the relevant OSM item

These improvements should lead to a larger number of OSM items that can benefit from Wikipedia importance scoring improvements. It also means that in the future, Nominatim can potentially begin to capitalize on the relationships information stored in Wikidata to further enhance importance scoring.

Curious about linking Wikidata to OSM entities in your area?

If the above discussion has piqued your curiosity about the utility of Wikidata links, and you would like to experiment with contributing some Wikidata identifiers to items you are mapping in OpenStreetMap, take a look at the following interesting tool from Edward Betts: The tool is explained by the author in the following video from FOSSDEM 2019:

Wikipedia Deep Dive

Posted by tchaddad on 29 June 2019 in English.

The GSoC project to add Wikidata to Nominatim is now well underway. This post will focus on the first phases that have centered on updating Wikipedia extraction scripts in order to build a modern Wikipedia extract for use in Nominatim.

A little background

Why is Wikipedia data important to Nominatim? What many OpenStreetMap users might not know is that Wikipedia can be used as an optional auxiliary data source to help indicate the importance of OSM features. Nominatim will work without this information but it will improve the quality of the results if this is installed. To augment the accuracy of geocoded rankings, Nominatim uses page ranking of Wikipedia pages to help indicate the relative importance of OSM features. This is done by calculating an importance score between 0 and 1 based on the number of inlinks to a Wikipedia article for a given location. If two places have the same name and one is more important than the other, the Wikipedia score often points to the correct place.

Users who want to set up their own instance of Nominatim have long had the option of including the WIkipedia-based improvements in their instances at installation time. This is done by the optional inclusion of a purpose-built extract of the primary Wikipedia information required to calculate the Nominatim importance scores.

Problem I

Up until this point, the Wikipedia extract used in Nominatim installations has been a one-time extract dating back to 2012-2013. Of course, in the years since this extract was made available, the OpenStreetMap database has come to contain many more items linked to Wikipedia articles, and Wikipedia itself has grown tremendously as well. This all means that there are now many more sites that can benefit from the Wikipedia-derived importance scores, provided that Nominatim users can operate from a up-to-date Wikipedia extract.

Problem II

Unfortunately, the process for creating a Wikipedia extract was developed before the current Nominatim stewards were involved in the project, and it was poorly documented. Ideally the new process for creating a Wikipedia extract would be easy to understand, and streamlined enough to run on an annual basis similar to how other optional Nominatim packages such as the TIGER data extracts are generated.

Where are we starting from?

The first step of the project involved excavating the historic Wikipedia extraction scripts, documenting the process, and identifying and eliminating inefficiencies. For those curious about the database specifics, the Wikipedia derived data in Nominatim is stored in two postgres tables: Wikipedia table structures used in Nominatim database In the original extract preparation process, these two tables were created by running a collection of language-specific WIkipedia dump files through a series of processing steps. These steps created some intermediate Wikipedia Postgres tables which were then augmented with geo-coordinates from a third source (DBpedia), and data from the Wikipedia pages themselves, such as population and website links. A couple of final steps condensed the data into simpler summary tables from which Nominatim importance calculations could be made.

At the time that this process was developed, a much lower number of OSM objects were tagged with Wikipedia links. As a result, it was thought to be necessary to go through an artificial linking process to plot the locations of the WIkipedia articles in relation to the locations of the OSM objects, and to use the proximity of similarly named objects to make educated links between OSM ids and Wikipedia articles, so that the calculated importance scores could be used. This linking process turns out to have been the only use of the lat/lon data in the Wikipedia tables, and in addition, items such as population and website were thought to have been potentially useful, but in fact have turned out not to be used.

Where are we heading?

As a result of the improvements in OSM since 2013, it is thought that the artificial linking process described above is no longer necessary, and that dropping it will cut down on the number of processing steps. Since population and website information has similarly not been useful, these items also no longer need to be parsed out of the full page Wikipedia dumps, cutting down on processing time. So cutting these 2 steps is already a major streamlining of the old process. This turns out to be very beneficial, because the size of the Wikipedia data being processed has grown fairly dramatically and processing an extract composed of data from the top 40 Wikipedia language sites takes quite a lot of time and disk space.

Results thus far

After processing the same 40 languages worth of Wikipedia data as the 2013 dump, we have arrived at a substantially increased number of wikipedia_article records with importance calculations:

Wikipedia 2013 Extract –> 80,007,141 records –> 6.4 GB table size
Wikipedia 2019 Extract –> 142,620,084 records –> 11 GB table size

While doing the processing, it was noted that some Wikipedia language dumps contain errors which cause their import and processing for Nominatim to fail. To correct for this, a perl script used to parse the raw dump files has been improved to account for the most common errors. This means that the new process can now be reliably repeated with whatever new set of Wikipedia languages are specified, whenever necessary.

Users are warned that processing the top 40 Wikipedia languages can take over a day, and will add over 200 tables, and nearly 1TB to the processing database. As of June 2019, the final two summary tables that will finally be imported into the Nominatim database are 11GB and 2GB in size.

Next Steps

The large size of the 2019 wikipedia_article table can potentially be streamlined further by running a process to exclude records that do not have any relationship to locations. Wikidata should be able to help us identify the records that should be kept and those that could be discarded, so this will be looked at in the next phase of the project.

Another area of possible improvement is that we should consider which languages we want to process. Should we focus on the top 40 by number of Wikipedia pages? This is the historic approach. Or should we consider the geographic footprint of languages in the calculus? For example, adding a language like Maori might improve links in the New Zealand region, and yet the Maori language is just outside the top 40 languages based on number of Wikipedia pages. Should such facts be considered in language selection? Other languages may have artificially high numbers of Wikipedia pages due to the actions of bots - should that somehow be taken into account? If we can be effective with streamlining the size and relevance of the contents of the wikipedia_articles table, the space savings might mean that we also could potentially process more languages, allowing us to be more inclusive.

Feedback welcome!

Hopefully this explanation of the role of Wikipedia data in Nominatim has been informative. Any feedback, comments or questions are welcome. The next diary post will take a closer look at Wikidata, so any questions or suggestions on that topic are also welcome.

Adding Wikidata to Nominatim

Posted by tchaddad on 24 May 2019 in English.

Nominatim is the geocoder / search engine that powers the search box on the main OpenStreetMap site. The software indexes all named features in OSM and certain points of interest, and assigns ranks of importance scored between 0 and 30 (where 0 is most important). To augment the accuracy of rankings, Nominatim also uses page ranking of Wikipedia pages to help indicate the relative importance of osm features. This is done by calculating an importance score between 0 and 1 based on the number of inlinks to an article for a location. If two places have the same name and one is more important than the other, the wikipedia score often points to the correct place.

This summer I’ll be participating in a GSoC project to add Wikidata to Nominatim. The primary goal of this project is to extend the existing Nominatim wikipedia extraction scripts to also take into account Wikidata. Wikidata is a database that exists to support Wikipedia, and Wikimedia Commons. The Wikidata project contains structured data about items, and statements about the relationships between items. In recent years, OpenStreetMap has gained a large number of Wikidata tags, and the data and relationships from the Wikidata database should provide information useful to improving the importance rankings and search results from Nominatim. To accomplish this, it will be necessary to process a Wikidata dump and extract the information that would be useful to Nominatim, then import the data it into the Nominatim database and put it to use. The Wikidata information can be used in a number of ways:

Processing of OSM objects that do not have a Wikipedia link, but that do have a Wikidata link: by using the Wikidata information, it should be possible to infer the relevant Wikipedia link for use with the current Nominatim ranking process. According to Taginfo, there are currently over 450,000 nodes with Wikipedia tags, and over 600,000 nodes with Wikidata tags, so this represents the scale of opportunity for this particular improvement.
Taking into account the type of the object (museum, library, tower, castle, etc) as indicated by Wikidata, and using this to correctly infer the relevant Wikipedia link, and to prevent or eliminate possible links to similarly named objects of conflicting types.
Using Wikidata information to eliminate certain kinds of detrimental Wikipedia links (such as links to brands) from use in calculation of search ranking importance.

This first diary post covers general background and proposed areas of improvement for the Adding Wikidata to Nominatim project. Any feedback, comments or questions are welcome. Future posts will cover progress, and lessons learned along the way.