Recent diary entries
As we are reaching the end of the Google Summer of Code 2021, I would like to share with you the work done on my project “Nominatim QA Analyser”.
Previous diary entries
Nominatim is the OSM’s main geocoding software used to process geocoding requests on OpenStreetMap data. The software uses its own database schema which differs from the one used by the main OSM database. As a result, Nominatim processes OSM data in a way that allows to discover a lot of inconsistencies. The idea of the project was to build a Quality Assurance Tool which analyzes the Nominatim database and extracts errors from the OSM data based on a set of rules.
These extracted errors are then made available to everyone through a visual map, so that OSM mappers can correct them easily.
If you are curious about the result of the project, you can check a running instance here: https://gsoc2021-qa.nominatim.org/front/. It has up to date data which are generated by the backend analyser directly on the production server of Nominatim.
The web application will be set up on the official nominatim.org website later this month or early September. I will update the previously provided link in this diary entry when it will be done.
What has been done
The project is divided into two parts, the first one is the main Data Analyser tool which runs on the Nominatim’s server and the second one is the web application used to visualize the data through a map.
The github repository of the Data Analyser can be found here: osm-search/Nominatim-Data-Analyser. The whole commits in this repository are from my work. You are free to contribute and help to make the tool better.
The development of this analyser includes:
A pipe based architecture developed in python with a YAML Rule Specification system used to easily add and customize QA rules. Learn more about this in the Mid-term project update I wrote.
The QA rules defined by @lonvia which are checked off in the following Nominatim issue: #1848 were implemented into the analyser.
A good documentation of the tool, available here.
About 95% of test coverage for the main python code.
Continuous integration has been set up in the github repository with github actions. It builds the tool and runs the tests for every push/pull request.
As a bonus, I had time to develop a custom C++ module to generate clusters based on a set of points data (in GeoJSON format) and then generate vector tiles from these clusters in the mapbox vector tiles format. This module is named clustering-vt and the code can be found in the Nominatim Data Analyser github repository. I used mapbox/supercluster.hpp and mapbox/vtzero as the main libraries for this module. And I can say that they are amazing libraries. We initially used Tippecanoe to generate the vector tiles but I wasn’t satisfied with the results we had. I wanted to use Supercluster so I developed clustering-vt.
Initially, we used osmoscope-ui as a website to display the data extracted by the analyser. Later in the project, I wanted to switch to a custom web application which allows us to have custom UI, features and side information. This was made a lot easier because we kept the layer definition of osmoscope.
This web application is developed with React. The code can be found here: osm-search/Nominatim-Data-Analyser-Frontend. The whole code in this repository is from my work. You are free to contribute and help to make the tool better.
Please see the Running example section above to get a link to a running instance of this web application.
What I have learned?
I have learned a lot about the open source world. It was my first real open source experience. I discovered how many open source projects there is out there, the communities around them, how people are working together to produce such good projects. I really enjoyed that and I definitively want to continue this journey.
I also learned many things related to OpenStreetMap/Mapbox and many projects build around them. But in a more general way I have learned a lot on the map systems. How maps are working, how vector tiles work, what are tiles servers, how maps are rendered, etc.
I worked on C++ without that much of experience with it. It was needed when I developed the clustering-vt module. It was not an easy step as I used libraries from the mapbox C++ ecosystem and I was not familiar with it at all. I dived into the source code of these libraries and learned many things. It made me want to use C++ more and get better with it, I am really happy that it happened.
It was the first time that I was setting up continuous integration and the first time I used github actions. This is so helpful.
And of course I learned a lot more about python development, PostgreSQL, YAML etc.
What to do next
Issues have been opened on the repositories especially for the analyser concerning what currently needs to be done in order to improve the tool.
The main things to do currently are the following:
- Add a multithreading feature to make the analyser run way faster as what takes the most time to execute are the PostgreSQL queries and the vector tiles generation (external clustering-vt module).
- Add tests and documentation to the clustering-vt module and to the frontend code.
- Optimize the clustering-vt algorithm to make it runs faster.
- Add a “report as false positive” feature to the tool.
I would like to thank my mentors: Sarah Hoffmann (lonvia) and Marc Tobias (mtmail) who have helped me a lot for this project. It was amazing working with them and I hope we can continue like this in the future, for the QA tool but also for Nominatim and other projects too.
I would also like to thank Jochen Topf (joto) who developed osmoscope-ui which we used a lot initially and which was very helpful to develop our own web app. Jochen is also one of the authors of mapbox/vtzero which is the library I used to encode vector tiles to the .mvt/pbf format in the clustering-vt module.
Of course I would also like to thank the whole OpenStreetMap and Mapbox community for their work.
Thank you for reading this post to the end. I hope you enjoyed it and I hope you will enjoy the new Nominatim QA Tool!
Hello everyone! I would like to make an update on my project “Nominatim QA Analyser Tool” which is progressing very well.
As a recap, this project aims to have a tool capable of analysing the Nominatim’s database to extract suspicious data from it. Then, these data should be presented to mappers through a graphical interface so that they can correct them.
The tool is still under development, it lacks of tests, documentation, configuration etc. However, you can access the github repository there if you are interested: https://github.com/AntoJvlt/Nominatim-Data-Analyser
Suspicious data public visualization
We chose to use Osmoscope as the main visualization tool for the data we extract with the Nominatim QA Analyser.
I have setup up an instance of Osmoscope on the development server which was provided to me for this GSoC project. This instance is publicly available there: https://gsoc2021-qa.nominatim.org/osmoscope You are free to look at it and start fixing some data errors around you!
/!\ Here are some important informations to know about this public instance /!\
- The OSM data on the development server are not regularly updated. The current OSM data were imported around May 24th 2021.
- This is the osmoscope instance that I use for the development, this means that it can be down or data can be under some tests at some point so it is not fully trustable.
- If you want to follow the evolution of this Osmoscope instance throughout the development, dont forget to refresh your browser’s cache for this webpage to get the latest evolutions.
- The QA rules are not definitive, if you find data that you think are not wrong or if you want to discuss about the QA rules, please come to the Nominatim’s github page in the discussions section. QA rules suggestions are also welcomed.
How does the Nominatim QA Analyser tool work
In this section, I will talk about some technical aspects of the Nominatim QA Analyser Tool and I will focus on the most important points.
In order to have a flexible architecture and reusable components, I went for a pipe structure. Therefore, one rule is represented as a pipeline where each pipe is a processing task which sends its result to the next pipe.
The most used pipes that we currently have are the following:
- SQLProcessor which is responsible of executing an SQL query, converting the results into some data objects and returning those results.
- GeoJSONFeatureConverter which converts the input data into GeoJSON features from the geojson python library.
- GeoJSONFormatter which takes a geojson features list as input and create a GeoJSON file (still by using the geojson python library).
- LayerFormatter which creates an Osmoscope layer file with the right metadata inside.
With this set of pipes, as an example, we can have a rule with pipes plugged in this order: SQLProcessor -> GeoJSONFeatureConverter -> GeoJSONFormatter -> LayerFormatter.
YAML Rule Specification
In order to reduce the amount of code needed to add a new QA Rule and to make it more easy, I introduced the YAML rule specification.
Each rule is defined inside a YAML file. This YAML file follows a tree structure where each node is a pipe and each node can have one or multiple childs defined in the “out” property of the node. Here is an example for the QA rule “boundary=administrative without admin_level”:
When executing a rule, the QA analyser will take the corresponding YAML specification file and it will parse it. The parsing is done by the deconstructor module which will go through the tree structure and send events when it reachs a new node and when it backtracks through the tree to an upper node.
The assembler module subscribes to the deconstructor and it is responsible of assembling the nodes, instantiating the right pipes, and plugging them in the right order. All of that is done smoothly because the deconstructor is sending nodes by following the tree structure so they are in the right order.
All of this YAML specification is made possible because of the pipe structure that I have set up before.
Some of the rules return a lot of results, so in order to display them properly through the osmoscope instance without killing the browser, I had to add a vector tiles output to the tool. This was done by implementing the VectorTileConverter pipe.
I decided to use Tippecanoe from Mapbox because it is very efficient and very easy to use in order to convert a geojson file into vector tiles. For now, the VectorTileConverter pipe is getting a geojson file as input and it calls Tippecanoe from the command line to convert the file to vector tiles automatically.
This is probably not the most efficient way to do this but it works well for now.
What should be done next
Here is a list of things that need to be done in the second part of this project:
- Add tests cover for the tool (without testing each rule independently).
- Add documentation.
- Add configuration files.
- Make each rule executing in its own thread to parallelize query execution.
- Finish implementing all the rules already defined.
What might be done next
Here is a list of things that might be done next depending on the direction we want to take for this project:
- Keep working on the Osmoscope project to make it better, maybe also by cleaning up the code and upgrading the UI design.
- Add other data outputs, like Maproulette challenges for example.
- Keep upgrading the analyser tool framework.
I would like to thank my mentors: Sarah Hoffmann (lonvia) and Marc Tobias (mtmail) who help me a lot for this project. A special thank to Sarah Hoffman who helps me a lot to make the Nominatim database query for the rules and she also helps me to understand the OSM data better.
I will be sharing my GSoC’21 journey with OpenStreetMap through this diary. Feel free to ask me questions if you have any, I would be delighted to answer them!
Who am I?
I am Antonin Jolivat, a French student in an Engineering School which specializes in Computer Science.
Passionate about Software Engineering, I spend many time to discover new domains and create personal projects related to them. The rest of my time, I like to read about what others people do, entertain myself with various things and work out to stay in shape.
I came accross OpenStreetMap while working on a project involving maps and it stayed in my mind. I recently got involved a lot more as I have chosen OSM as my organization for the GSoC’21, and I am very happy to have made this choice! :)
My GSoC Project: QA Reports Extraction Tool for Nominatim
This project is mainly concerning Nominatim but it will help to increase overall OSM data quality.
Nominatim is the OSM’s main geocoding software used to process geocoding requests on OpenStreetMap data. The software uses its own database schema which differs from the one used by the main OSM database. As a result, Nominatim processes OSM data in a way that allows to discover a lot of inconsistencies. The idea of the project is to build a Quality Assurance Tool which will analyze the Nominatim database and extract errors from the OSM data.
Those errors would be made available to everyone through a map, so that OSM mappers will be able to correct them easily.
In the future we could even imagine to share the data errors through other tools, like in the form of challenges suitable for https://maproulette.org/.
The main goals of the project are:
Supporting quality assurance rules like the ones already presented there.
The tool should be modulable so that it is easy to add a new rule.
It should run fast because it will be executed once per day stopping regular updates on the Nominatim database.
We should have an Osmoscope instance accessible to everyone. This instance should show the data errors previously extracted by the QA tool.
I think that this project will be very beneficial for the whole OSM community and not only to Nominatim as it will help to increase OSM data quality.
I will add new entries to this diary as I go along with the project. I hope it’s an interesting topic for you, Thank you for reading!