Nominatim QA Analyser Tool - GSoC'21 Final Report
Posted by AntoJvlt on 22 August 2021 in English (English). Last updated on 23 August 2021.Introduction
As we are reaching the end of the Google Summer of Code 2021, I would like to share with you the work done on my project “Nominatim QA Analyser”.
Previous diary entries
Project summary
Nominatim is the OSM’s main geocoding software used to process geocoding requests on OpenStreetMap data. The software uses its own database schema which differs from the one used by the main OSM database. As a result, Nominatim processes OSM data in a way that allows to discover a lot of inconsistencies. The idea of the project was to build a Quality Assurance Tool which analyzes the Nominatim database and extracts errors from the OSM data based on a set of rules.
These extracted errors are then made available to everyone through a visual map, so that OSM mappers can correct them easily.
Running example
If you are curious about the result of the project, you can check a running instance here: https://gsoc2021-qa.nominatim.org/front/. It has up to date data which are generated by the backend analyser directly on the production server of Nominatim.
The web application will be set up on the official nominatim.org website later this month or early September. I will update the previously provided link in this diary entry when it will be done.
What has been done
The project is divided into two parts, the first one is the main Data Analyser tool which runs on the Nominatim’s server and the second one is the web application used to visualize the data through a map.
Data Analyser
The github repository of the Data Analyser can be found here: osm-search/Nominatim-Data-Analyser. The whole commits in this repository are from my work. You are free to contribute and help to make the tool better.
The development of this analyser includes:
-
A pipe based architecture developed in python with a YAML Rule Specification system used to easily add and customize QA rules. Learn more about this in the Mid-term project update I wrote.
-
The QA rules defined by @lonvia which are checked off in the following Nominatim issue: #1848 were implemented into the analyser.
-
A good documentation of the tool, available here.
-
About 95% of test coverage for the main python code.
-
Continuous integration has been set up in the github repository with github actions. It builds the tool and runs the tests for every push/pull request.
-
As a bonus, I had time to develop a custom C++ module to generate clusters based on a set of points data (in GeoJSON format) and then generate vector tiles from these clusters in the mapbox vector tiles format. This module is named clustering-vt and the code can be found in the Nominatim Data Analyser github repository. I used mapbox/supercluster.hpp and mapbox/vtzero as the main libraries for this module. And I can say that they are amazing libraries. We initially used Tippecanoe to generate the vector tiles but I wasn’t satisfied with the results we had. I wanted to use Supercluster so I developed clustering-vt.
Web application
Initially, we used osmoscope-ui as a website to display the data extracted by the analyser. Later in the project, I wanted to switch to a custom web application which allows us to have custom UI, features and side information. This was made a lot easier because we kept the layer definition of osmoscope.
This web application is developed with React. The code can be found here: osm-search/Nominatim-Data-Analyser-Frontend. The whole code in this repository is from my work. You are free to contribute and help to make the tool better.
Please see the Running example section above to get a link to a running instance of this web application.
What I have learned?
I have learned a lot about the open source world. It was my first real open source experience. I discovered how many open source projects there is out there, the communities around them, how people are working together to produce such good projects. I really enjoyed that and I definitively want to continue this journey.
I also learned many things related to OpenStreetMap/Mapbox and many projects build around them. But in a more general way I have learned a lot on the map systems. How maps are working, how vector tiles work, what are tiles servers, how maps are rendered, etc.
I worked on C++ without that much of experience with it. It was needed when I developed the clustering-vt module. It was not an easy step as I used libraries from the mapbox C++ ecosystem and I was not familiar with it at all. I dived into the source code of these libraries and learned many things. It made me want to use C++ more and get better with it, I am really happy that it happened.
It was the first time that I was setting up continuous integration and the first time I used github actions. This is so helpful.
And of course I learned a lot more about python development, PostgreSQL, YAML etc.
What to do next
I want to keep maintaining the Nominatim Data Analyser and Nominatim Data Analyser Frontend.
Issues have been opened on the repositories especially for the analyser concerning what currently needs to be done in order to improve the tool.
The main things to do currently are the following:
- Add a multithreading feature to make the analyser run way faster as what takes the most time to execute are the PostgreSQL queries and the vector tiles generation (external clustering-vt module).
- Add tests and documentation to the clustering-vt module and to the frontend code.
- Optimize the clustering-vt algorithm to make it runs faster.
- Add a “report as false positive” feature to the tool.
Acknowledgements
I would like to thank my mentors: Sarah Hoffmann (lonvia) and Marc Tobias (mtmail) who have helped me a lot for this project. It was amazing working with them and I hope we can continue like this in the future, for the QA tool but also for Nominatim and other projects too.
I would also like to thank Jochen Topf (joto) who developed osmoscope-ui which we used a lot initially and which was very helpful to develop our own web app. Jochen is also one of the authors of mapbox/vtzero which is the library I used to encode vector tiles to the .mvt/pbf format in the clustering-vt module.
Of course I would also like to thank the whole OpenStreetMap and Mapbox community for their work.
The end
Thank you for reading this post to the end. I hope you enjoyed it and I hope you will enjoy the new Nominatim QA Tool!