Summer has come to an end, and so this post is to wrap up the progress made over the course of the “Add Wikidata to Nominatim” project. Overall, the main contributions are documented in the 4 preceding diary posts, and in:
- updated steps for extracting Wikipedia data and calculating importance scores
- a new script for extracting Wikidata items and place types
These new processes have made big improvements in several OSM-to-Wikipedia comparison metrics as compared to equivalent numbers from 2013 (when the previous Wikipedia snapshot was taken).
For context, the number of Wikipedia articles in the top 40 languages in 2013 was 80,007,141, and the number of Wikipedia articles for the same 40 languages in 2019 was 142,620,084 - an increase of ~78%.
Within these article records, in 2013 it was possible under the old processing steps to attach latitude and longitude numbers to 692,541 articles, while in 2019 it was possible to enrich 7,755,392 records with location information - an increase of ~1,020%. This particular statistic largely reflects an improvement in the source Wikipedia / Wikidata projects.
More exciting, with the old method of linking Wikipedia articles to osm_ids, it was possible to link 313,606 Wikipedia article importance scores to osm_ids, but with the new method that uses both Wikidata item ids, and Wikipedia pages together, the number of Wikipedia article importance scores that can be linked has risen to 4,730,972 - an increase of ~1,409%. This increase is due to both the large number of Wikipedia and WIkidata tags added by OSM contributors since 2013, as well as the inclusion of Wikidata item ids in the linking process for the first time via this project.
Although the project technically concludes today, there are obviously always areas of future work where more gains can be made. These include:
- Continuing to add new Wikidata links to OSM objects (possibly using tools such as the example provided by Edward Betts)
- Increasing the number of place types accounted for in the scripts. Currently the top 200 place types are being used, and there are many more that could be added.
- Possibly increasing the number of languages covered. Currently 40 languages are processed.
In addition, for items without direct database links to be made, a process could be developed to use radius searches, names, and place type information to make educated guesses on remaining items that could be linked. While this enhancement option would be similar to the previous process used in 2013, it may not be as productive as encouraging work in the first two areas mentioned.
I’d like to thank the project mentors lonvia and mtmail for suggesting this project, and for all their good advice given throughout. I very much appreciate the time it takes to take on mentorship of a remote project such as this, and hope that the results will be of use. I look forward to continuing to contribute on the Wikidata aspects of Nominatim as the ideas on how to utilize the data continue to evolve.