fititnt's Diary

Early feedback welcomed: open source tool for Spatial Data Matching with OpenStreetMap Schema

Posted by fititnt on 11 June 2024 in English. Last updated on 16 June 2024.

The link for the public version https://sdm.etica.ai/v/0.5/ I made an effort to make it easy and very cheap to host (currently is a client side static vanilla JavaScript+HTML app) and, as a side effect, the privacy of your data is kept.

Since I joined OpenStreetMap in 2022, I’ve done some tools without a graphical interface, and this one I’d love to receive feedback from potential users on such a very niche topic.

Already at early versions of it (I stated a prototype in 2023 mere debug for the real conflation done non interactively before load on OSM editors), I truly attempted to think how to make it as a plugin for JOSM or think how to extend iD instead of keeping it side-by-side with iD or alt-tabbing with JOSM.

The good news: It does have basic support to use one or more files to match by distance and/or by tagging with the one or more target files and then you download the geojson. Okay, addr:street would need language and country level comparison (because misspellings), and also addr:postcode may already have logic to tolerate near matches. If you know vanilla JavaScript to code a function to your country, then it could be more forgiving.

The bad news: for points of interest, the so-called “edgematch links”, “rubber shedding links” or whatever the term to be use to export file “these 0-N items in dataset A matches these 0-N items in dataset B” necessarily need human-in-the-loop and it happens in unpredictable ways. And links which aren’t obvious 1-to-1 (while there’s room for suggestion) require need human input. It started as the “typical leaflet” plus a text-only, but we might need a way to visualize N:M links (unless any you have an UI suggestion to plot such links already over pins in a map!).

This diary is less about one implementation targeting a topic and more about suggestions, including realistic feedback on failed attempts. I love the human creativity involved to merge different information into something that could be given to OpenStreetMap, Wikidata and/or give back to your open data providers whose data needs review.

1. Quick overview of other tools and how this initial release fits in

Context: by citing other tools (which, trust me, it’s not just have different approaches, but focus on different challenges) I hope be helpful if any of your use case already is more specialized with them, or, as the title “Early feedback welcomed”, this could help others suggest improvements here, such how to present the interface.

I’d assume those more likely to be interested in this topic already have some knowledge of OpenStreetMap Conflation or Wikidata Imports.

One blog post with comparison between some tools that really worth reading is the https://junglebus.io/MobilityData/benchmarks/Benchmark%20of%20existing%20open%20source%20solutions%20for%20conflating%20structured,%20geographical%20and%20transit%20data.html to which I would TL;DR how this tool would fit

table 1

Tool	Ecosystem	Object Type
SDM	OpenStreetMap, Wikidata	Point

table 2

	SDM
need to match the dataset with OSM model	yes
use an identifier existing in both dataset	possible, not mandatory
investigate each output element	needed
collaborative review	no *
visualization of the conflation output	+ **
visualization of each output element	+ **
language	JavaScript
user interface	dedicated webapp, client-side, works offline
License	AGPL-3 ***

*: if there’s interest, eventually it would be feasible to export JSON or GeoJSON with additional information for tools that are collaborative. OSM Conflate and (as a preparation step for PoIs ) Map Roulette seems decent choices

**: while I’m already looking for inspiration on other tools (the v0.5.0 do not have something basic such as diff per item), visualization is likely to be a core functionality.

***: I might change it to public domain if it makes it more likely to get collaboration.

On conflation in general, do exist other tools than ones listed in this blog post. I will quickly comment on some of them.

ArcGis Pro (paid) gives me an impression of (thinking from a user’s perspective, not software developer perspective) having a “single button” on typical actions users want, and do in such ways what open source alternatives such as QGIS would be several steps plus custom script.
- Example of documentation https://desktop.arcgis.com/en/arcmap/latest/tools/editing-toolbox/an-overview-of-the-conflation-toolset.htm
QGIS (if you already don’t have installed) is good to have around, even if is to save you trouble how to use GDAL or GRASS directly to convert files from/to GeoJSON / GeoJSON Lines (which is the main format used by the tool I’m presenting)
MapRoulette is not cited there, but it actually works as some kind of conflation tool.
RapiD (when enabled with datasets from authoritative sources or generated by machine learning) also works as some sort of conflation tool
- Maybe this is intentional (since doing differently could make RapiD less likely to eventually be added as additional editor on OpenStreetMap.org) but other than the very specific list of listed datasets, RapiD have no changes at all over iD on load data layer (e.g. the GeoJSON you could get as export)
  - There’s no way to add more than one data layer, nor customize colours. I would consider really important, and not really hard to implement
    - (actually also both on JOSM and QGIS this seems not possible) for data layers, there no quick filter to display part of them by attribute, so if a PoI (even with right addr:housenumber) is not close, this make very manual labour click one by one.
and obviously, hootenanny, which while likely the more feature-rich for interactive conflation, the OSM Wiki for Conflation rightfully cites it is complex to install.
- It started as a fork of (now older version) of iD. RapiD also started as a fork of iD, and has some built- in support for conflate data, but very basic compared to Hootenanny.

2. Screenshots with context of the implementation sdm-etica.ai

2.1 Kind of “co-pilot” for an OSM editor (iD example)

Some mappers already look on official websites to enhance mode metadata on OpenStreetMap. When these sources already publish such data into something you can convert for GeoJSON with tagging close to what you would do in OSM, you can do the following:

Load one or more of these datasets into the app
Divide the screen between the iD editor and this app. I put it on the right side because it is close to the panel of iD.
When I find an OpenStreetMap element without more data, I copy and paste existing attributes from the element and place them into the search box of the app.
Sometimes, you may need to filter by addr: street (copy from nearby roads, the name=, alt_name=, and old_names=). If you find the data, copy and paste from the app into the iD free text tagging editing.

While the external dataset had over 36.000 items, by selecting with

addr:housenumber=155
addr:street=Rua Catarino Andreatta

the match was 1 of the 9 results. It’s manual process, but copy the tags from the text area

The preview on the map mode also have the same keys, which could be copy pasted.

2.2 Using links from items displayed on the app to add into JOSM, iD, etc

iD (documentation at develop/API.md) allows creation of direct links, and a lot of other software have something similar we could add shortcuts. JOSM, however, has Remote Control, being notable that it can reuse the same JOSM instance and make changesets with more than one edit than iD. This use case you use the app in full screen to find what you can edit on OpenStreetMap in the default editor.

Feedback is also welcomed on how to optimize space of the links in the map. While writing this diary I Noticed a link to the level 0 editor.

2.3 Display OpenStreetMap data along with other data into the app

As you will notice, the webapp does not (at least not yet, but is viable implement) load OpenStreetMap data itself, so OpenStreetMap-carto as default base map helps to compare with the pins.

However, you can use Overpass-Turbo and select it as one of the inputs, just use the export button and save as GeoJSON. (Later example use conflation on import betwen 2 external datasets, but same could be done to use what’s on the OSM near what’s in external dataset)

In my tests since last year, preparation of the dataset to OpenStreetMap schema may have much more fields than we would use. This explain why there’s a field that you need to which tags are imported to the app

Unless you unmark, by default if GeoJSON seems to be an OpenStreetMap export, it will bypass the selection.

2.4 Working with very large datasets

As the idea of app is to help you to match data, you may have one or more smaller datasets that need to be matched against one big one.

Currently there’s 2 strategies:

At import stage: you prefilter 1+ subject datasets using 1+ reference datasets. Both by distance and by marching attributes (such as addr:housenumber) is possible
At the live filter stage: all datasets already are loaded in memory, and can even be exported, but at some point the preview will not show everything.

The main file used is GeoJSON, but with very large datasets you need to pre-convert to GeoJSON Text Sequences (see formal specification at RFC8142, also know as “GeoJSON Lines”. (By the way, if you are generating it from scratch, do with RS+LF, not just LF).

2.4.1 Example at import stage (use items from 1+ datasets to find maching items from other datasets)

The exact position may change in future versions but currently you

Define distance and (if relevant) also matching key. Then, load 1+ reference dataset

After that, just select 1+ datasets into the main file input

At the end, you can just export the file (potentially reuse again in a next section).

The speed of this process is greatly affected by the number of items in the reference dataset. However note you can export the result and the file in your disk, so you start a new section only with precomputed data.

Quick comment about these examples:

While maybe there some last minute bug with the UI (which is why I would recommend use https://sdm.etica.ai/v/0.5/, not https://sdm.etica.ai/, which I migth be changing faster) a filter that reduces 6M to 1M would be too forgiving. But the real filters are heavily dependend on the reference datasets and target datasets.

One reason for the input dataset be less than 1/6 is also because which keys are allowed to load into memory.

While the time will greatly vary by how powerful is the user CPU, with a 6 cores / 12 threads recent CPU, by conflating all houseadresses surveyed on last Brazilian Census for one province (this one https://www.openstreetmap.org/relation/242620, population: 11,322,895) this took around 55 seconds (around 50% of this is merely reading GeoJSON-Seq into chunks, not the comparison with items from reference datasets). This kind of processing time will necessarily increase with proper fine tunning. For example, as soon as start to implement forgiving matches, such as non exact addr:street (and this varies by country and language, which would need to be programmed in javascript) will increase CPU use.

While this may not seem much, if such processing was done “in the cloud”, making it free by releasing access for OpenStreetMap would be expensive.

2.4.2. Example at live filtering stage

Would be trivial to copy the same logic (dataset VS dataset) from using reference files from the import stage to the filtering stage, however full recalculation would lead to bad user experience (for a province-level dataset like previous step, think >1 minute). With over a million points waiting in the background memory, trying to match one or few items might still be fast (just “not instantaneous”).

The current version doesn’t have an “auto suggestion”, but I guess this could be implemented with some defaults exploring the fact datasets already will be using OpenStreetMap schema. Suggestions are welcomed, and maybe after then, proof of concepts to try it, but I can say upfront that:

instead of a “yes/no” march, some numeric result (even if to sort results).
Sometimes either source or target may not have one field. This is differente from a false match, it’s an unknow case
Some datasets may have no position at all, so the match is fully by address alone (which may need an intermediary dataset). Also, these cases make a poor experience plot them in the null island,

How the live filtering may be used really depends of the dataset (sparse points we could use kilometers, but very near ponts, something like 100 meters), however this more manual strategy still works as fallback.

The “Position” can accept latitude/longitude values (wuch as -29.92420 -51.17002), also could accept a temporary identifier of any element inside the dataset or even an URL like https://www.openstreetmap.org/#map=18/-29.92421/-51.17002 (the regex will extract -51.17002 and -29.92421).

Quick comment on this example:

with 6M itens in the background, and without implementing yet any more advanced check, the parsing get betwen 500ms-800ms. Of these miliseconds, most are likely to be not the raw calculation, but updating the user interface.

2.5 No restriction on number of “layers/files”

At some point, the images used on the map for pin colours will start to get reused, but other than that, it is quite flexible how used will organize the files.

Currently the colours of the pins are based on order of upload. On live filtering (all data already in memory) users can also select the dataset as focus. While unsure of a better way to differentiate, this is an example.

While by default there a maximum number of data points to show, if this already was reached, but the app knows do exist dataset in focus, it will show 2x the limit, so if working with >1 million dataset, the smaller ones you may interested is more likely to still be displayed on preview.

3 Other performance comments

The memory usage tends to be around the same size or lower than the uncompressed size of files in disk. There’s room for improvement not done yet, but by limiting how many items are displayed (for example 10.000) this will use less memory than JOSM and have UI with faster feedback than QGIS.
- Memory usage tends to only grow at the import stage (or if you export a very large datasets, when you save a file). This (and also to simplify logic) explain why as soon as files are loaded, they are locked to edit. To work with different datasets, you need to refresh. To work with different sections at the same time, just open 2 or more tabs.
- If you notice using more RAM than this, consider opening a new tab instead of reusing the tab from previous import (no need to close the browser, just the tab). I noticed browser refresh / hard refresh the browser may (potentially by assuming you will use a lot of RAM again) not do it.

Here is one example with 6 files (uncompressed size in disk around 2.8 GB).

Baseline (using Webkit based browser): around ~30MB (but for smaller datasets, that still display all data, this likely will be around 100 MB when actually using the app)

Loading all the files (using Webkit based browser): around 1.4GB

Here using the same datasets (had to use GeoJSON instead of GeoJSON Lines). JOSM can load CNEFE 2022 dataset for the city of Porto Alegre, but without optimisations, eventually in my before finishing on importing a province (computer had free RAM, but likely JVM was not configured to allow it).

And here QGIS, which is quite impressive at around 340 MB of RAM.

Obviously, QGIS and JOSM have different purposes. JOSM is already optimized for editing. QGIS (without need to use command line) seems a good choice to convert files. GeoJSON parsing may be one of the worst cases (because likehood of code as loading entire file as single string, not in chunks). I also noticed (the plugins for GeoJSON) from JOSM seems to merge points in very same position with same (or similar) tagging, which actually seems a good default.

I don’t have the test here, but using GeoPackage, JOSM would use less RAM on import. Similar could be achieved by converting the large datasets into a single file on disk with vector tiles. (The link for the big file is bellow, and I’m trully curious how would to optimize JOSM for the bare minimum).

4. Files used in the tutorial (do NOT use to upload this to OpenStreetMap)

For the sake of testing the application (or, if errors would be in the custom data you may be using) I will share a copy of the files used in the screenshots.

~~Smaller files is availible at https://gist.github.com/fititnt/01a5b660013b54743989759c4a9b5f18/archive/fda8376c6bb6c9eb040233e9cf2235ec4155db97.zip~~
- Smaller files is availible at https://gist.github.com/fititnt/01a5b660013b54743989759c4a9b5f18/archive/a03752c04337a0e952ed0a4613b2a3880264c6f0.zip
the 2.8 GB (may be deleted later) ~~is at https://osm-cdn.etica.ai/cnefe2022/tutorial/43_RS.zip.~~
- The big GeoJSON file tested on this diary (bug on delimiter) https://osm-cdn.etica.ai/cnefe2022/tutorial/43_RS_old.zip.
  - The big GeoJSON file (for test in future versions) https://osm-cdn.etica.ai/cnefe2022/tutorial/43_RS.zip
- The original CSV https://osm-cdn.etica.ai/cnefe2022/tutorial/43_RS_original-csv.zip

These files have OpenStreetMap + 2 different “official” datasets (which can have conflicting information between themselves, such as imprecise positioning), one which have list of addr:housenumbers plus some extra non-detailed metadata surveyed around 2022 and the other which is related to points of interest (fire stations, but some the actually tagging could be office, despite sharing address and phone, but not email and (suggested by the reference dataset, not not typical used name) name of what could be mapped on OSM. The actual number of focused things is around 200, not > 6.000.000.

The v0.5.0-beta is still not making better groups between sources that may be about the same subject (sometimes files can be the same provider). However this might help the readers understand that, while most solutions tend to break conflation in 1 vs 1, my idea is do this too, however also attempt be more flexible. This is merely a 1 + 2 example, but some kinds of schools focused on learning disabilities could be > 1+5 (OSM, Wikidata, ref:vatin, ref healthcare, ref education). Not only this, but consider that ref:vatin by open data source do not have exact position, and the text representation of addresses is a f**ng nightmare.

5. End comments

I hope this initial version can be a reasonable start. It doesn’t require expensive server-side to keep it running, which helps to not shutdown because of excessive memory and CPU usage.

Post-edits

Edit 1 (2024-06-16) : oops, link for smaller was wrong. Updated the link on the text
Edit 2 (2024-06-16): discovered a bug on test files generated by another script. File used on this tutorial is renamed to 43_RS_old.zip., and the fixed one is 43_RS.zip. The file used to generate these is a big CSV file 43_RS_original-csv.zip, but the original source (which may be available long term) is https://ftp.ibge.gov.br/Cadastro_Nacional_de_Enderecos_para_Fins_Estatisticos/Censo_Demografico_2022/Arquivos_CNEFE/UF/43_RS.zip

OpenStreetMap