Hello everyone! I would like to make an update on my project “Nominatim QA Analyser Tool” which is progressing very well.
As a recap, this project aims to have a tool capable of analysing the Nominatim’s database to extract suspicious data from it. Then, these data should be presented to mappers through a graphical interface so that they can correct them.
The tool is still under development, it lacks of tests, documentation, configuration etc. However, you can access the github repository there if you are interested: https://github.com/AntoJvlt/Nominatim-Data-Analyser
Suspicious data public visualization
We chose to use Osmoscope as the main visualization tool for the data we extract with the Nominatim QA Analyser.
I have setup up an instance of Osmoscope on the development server which was provided to me for this GSoC project. This instance is publicly available there: https://gsoc2021-qa.nominatim.org/osmoscope You are free to look at it and start fixing some data errors around you!
/!\ Here are some important informations to know about this public instance /!\
- The OSM data on the development server are not regularly updated. The current OSM data were imported around May 24th 2021.
- This is the osmoscope instance that I use for the development, this means that it can be down or data can be under some tests at some point so it is not fully trustable.
- If you want to follow the evolution of this Osmoscope instance throughout the development, dont forget to refresh your browser’s cache for this webpage to get the latest evolutions.
- The QA rules are not definitive, if you find data that you think are not wrong or if you want to discuss about the QA rules, please come to the Nominatim’s github page in the discussions section. QA rules suggestions are also welcomed.
How does the Nominatim QA Analyser tool work
In this section, I will talk about some technical aspects of the Nominatim QA Analyser Tool and I will focus on the most important points.
In order to have a flexible architecture and reusable components, I went for a pipe structure. Therefore, one rule is represented as a pipeline where each pipe is a processing task which sends its result to the next pipe.
The most used pipes that we currently have are the following:
- SQLProcessor which is responsible of executing an SQL query, converting the results into some data objects and returning those results.
- GeoJSONFeatureConverter which converts the input data into GeoJSON features from the geojson python library.
- GeoJSONFormatter which takes a geojson features list as input and create a GeoJSON file (still by using the geojson python library).
- LayerFormatter which creates an Osmoscope layer file with the right metadata inside.
With this set of pipes, as an example, we can have a rule with pipes plugged in this order: SQLProcessor -> GeoJSONFeatureConverter -> GeoJSONFormatter -> LayerFormatter.
YAML Rule Specification
In order to reduce the amount of code needed to add a new QA Rule and to make it more easy, I introduced the YAML rule specification.
Each rule is defined inside a YAML file. This YAML file follows a tree structure where each node is a pipe and each node can have one or multiple childs defined in the “out” property of the node. Here is an example for the QA rule “boundary=administrative without admin_level”:
When executing a rule, the QA analyser will take the corresponding YAML specification file and it will parse it. The parsing is done by the deconstructor module which will go through the tree structure and send events when it reachs a new node and when it backtracks through the tree to an upper node.
The assembler module subscribes to the deconstructor and it is responsible of assembling the nodes, instantiating the right pipes, and plugging them in the right order. All of that is done smoothly because the deconstructor is sending nodes by following the tree structure so they are in the right order.
All of this YAML specification is made possible because of the pipe structure that I have set up before.
Some of the rules return a lot of results, so in order to display them properly through the osmoscope instance without killing the browser, I had to add a vector tiles output to the tool. This was done by implementing the VectorTileConverter pipe.
I decided to use Tippecanoe from Mapbox because it is very efficient and very easy to use in order to convert a geojson file into vector tiles. For now, the VectorTileConverter pipe is getting a geojson file as input and it calls Tippecanoe from the command line to convert the file to vector tiles automatically.
This is probably not the most efficient way to do this but it works well for now.
What should be done next
Here is a list of things that need to be done in the second part of this project:
- Add tests cover for the tool (without testing each rule independently).
- Add documentation.
- Add configuration files.
- Make each rule executing in its own thread to parallelize query execution.
- Finish implementing all the rules already defined.
What might be done next
Here is a list of things that might be done next depending on the direction we want to take for this project:
- Keep working on the Osmoscope project to make it better, maybe also by cleaning up the code and upgrading the UI design.
- Add other data outputs, like Maproulette challenges for example.
- Keep upgrading the analyser tool framework.
I would like to thank my mentors: Sarah Hoffmann (lonvia) and Marc Tobias (mtmail) who help me a lot for this project. A special thank to Sarah Hoffman who helps me a lot to make the Nominatim database query for the rules and she also helps me to understand the OSM data better.