Nominatim QA Analyser Tool - GSoC'21 Update
Posted by AntoJvlt on 11 July 2021 in English (English).Hello everyone! I would like to make an update on my project “Nominatim QA Analyser Tool” which is progressing very well.
As a recap, this project aims to have a tool capable of analysing the Nominatim’s database to extract suspicious data from it. Then, these data should be presented to mappers through a graphical interface so that they can correct them.
Github repository
The tool is still under development, it lacks of tests, documentation, configuration etc. However, you can access the github repository there if you are interested: https://github.com/AntoJvlt/Nominatim-Data-Analyser
Suspicious data public visualization
We chose to use Osmoscope as the main visualization tool for the data we extract with the Nominatim QA Analyser.
I have setup up an instance of Osmoscope on the development server which was provided to me for this GSoC project. This instance is publicly available there: https://gsoc2021-qa.nominatim.org/osmoscope You are free to look at it and start fixing some data errors around you!
/!\ Here are some important informations to know about this public instance /!\
- The OSM data on the development server are not regularly updated. The current OSM data were imported around May 24th 2021.
- This is the osmoscope instance that I use for the development, this means that it can be down or data can be under some tests at some point so it is not fully trustable.
- If you want to follow the evolution of this Osmoscope instance throughout the development, dont forget to refresh your browser’s cache for this webpage to get the latest evolutions.
- The QA rules are not definitive, if you find data that you think are not wrong or if you want to discuss about the QA rules, please come to the Nominatim’s github page in the discussions section. QA rules suggestions are also welcomed.
How does the Nominatim QA Analyser tool work
In this section, I will talk about some technical aspects of the Nominatim QA Analyser Tool and I will focus on the most important points.
Pipe structure
In order to have a flexible architecture and reusable components, I went for a pipe structure. Therefore, one rule is represented as a pipeline where each pipe is a processing task which sends its result to the next pipe.
The most used pipes that we currently have are the following:
- SQLProcessor which is responsible of executing an SQL query, converting the results into some data objects and returning those results.
- GeoJSONFeatureConverter which converts the input data into GeoJSON features from the geojson python library.
- GeoJSONFormatter which takes a geojson features list as input and create a GeoJSON file (still by using the geojson python library).
- LayerFormatter which creates an Osmoscope layer file with the right metadata inside.
With this set of pipes, as an example, we can have a rule with pipes plugged in this order: SQLProcessor -> GeoJSONFeatureConverter -> GeoJSONFormatter -> LayerFormatter.
YAML Rule Specification
In order to reduce the amount of code needed to add a new QA Rule and to make it more easy, I introduced the YAML rule specification.
Each rule is defined inside a YAML file. This YAML file follows a tree structure where each node is a pipe and each node can have one or multiple childs defined in the “out” property of the node. Here is an example for the QA rule “boundary=administrative without admin_level”:
When executing a rule, the QA analyser will take the corresponding YAML specification file and it will parse it. The parsing is done by the deconstructor module which will go through the tree structure and send events when it reachs a new node and when it backtracks through the tree to an upper node.
The assembler module subscribes to the deconstructor and it is responsible of assembling the nodes, instantiating the right pipes, and plugging them in the right order. All of that is done smoothly because the deconstructor is sending nodes by following the tree structure so they are in the right order.
All of this YAML specification is made possible because of the pipe structure that I have set up before.
Vector Tiles
Some of the rules return a lot of results, so in order to display them properly through the osmoscope instance without killing the browser, I had to add a vector tiles output to the tool. This was done by implementing the VectorTileConverter pipe.
I decided to use Tippecanoe from Mapbox because it is very efficient and very easy to use in order to convert a geojson file into vector tiles. For now, the VectorTileConverter pipe is getting a geojson file as input and it calls Tippecanoe from the command line to convert the file to vector tiles automatically.
This is probably not the most efficient way to do this but it works well for now.
What should be done next
Here is a list of things that need to be done in the second part of this project:
- Add tests cover for the tool (without testing each rule independently).
- Add documentation.
- Add configuration files.
- Make each rule executing in its own thread to parallelize query execution.
- Finish implementing all the rules already defined.
What might be done next
Here is a list of things that might be done next depending on the direction we want to take for this project:
- Keep working on the Osmoscope project to make it better, maybe also by cleaning up the code and upgrading the UI design.
- Add other data outputs, like Maproulette challenges for example.
- Keep upgrading the analyser tool framework.
Acknowledgments
I would like to thank my mentors: Sarah Hoffmann (lonvia) and Marc Tobias (mtmail) who help me a lot for this project. A special thank to Sarah Hoffman who helps me a lot to make the Nominatim database query for the rules and she also helps me to understand the OSM data better.
Comment from tordans on 26 July 2021 at 08:12
Hi AntoJvlt, thanks for sharing!
How can I resolve an issue on https://gsoc2021-qa.nominatim.org/osmoscope/#map=14.618837772215969/13.40799/52.49792&l=https://gsoc2021-qa.nominatim.org/QA-data/same_wikidata/osmoscope-layer/layer.json as false positive? Example: https://www.openstreetmap.org/node/473867813 (State Berlin) and https://www.openstreetmap.org/node/240109189 (City Berlin) both reference https://www.wikidata.org/entity/Q64?uselang=de which represents State + City. Being able to mark cases like this as false positive will make it easier to work with the QA Tool.
Comment from AntoJvlt on 23 August 2021 at 00:44
Hi tordans,
I am sorry for the late answer.
We don’t have a feature to report false positive yet, but it is planned to be added in the near future. We now have our own web app which will make it easier to add this feature.
However, if you find a big chunk of data which should be considered as false positive and if you think we could have a good way to identity these data, please do not hesitate to come talk about this in the issues section of the repository. Maybe we could add some logic to the whole rule in order to avoid fetching such data.
Comment from tordans on 26 August 2021 at 17:58
Thanks AntoJvlt!