OpenStreetMap logo OpenStreetMap

The link for the public version https://sdm.etica.ai/v/0.5/ I made an effort to make it easy and very cheap to host (currently is a client side static vanilla JavaScript+HTML app) and, as a side effect, the privacy of your data is kept.

Since I joined OpenStreetMap in 2022, I’ve done some tools without a graphical interface, and this one I’d love to receive feedback from potential users on such a very niche topic.

Already at early versions of it (I stated a prototype in 2023 mere debug for the real conflation done non interactively before load on OSM editors), I truly attempted to think how to make it as a plugin for JOSM or think how to extend iD instead of keeping it side-by-side with iD or alt-tabbing with JOSM.

The good news: It does have basic support to use one or more files to match by distance and/or by tagging with the one or more target files and then you download the geojson. Okay, addr:street would need language and country level comparison (because misspellings), and also addr:postcode may already have logic to tolerate near matches. If you know vanilla JavaScript to code a function to your country, then it could be more forgiving.

The bad news: for points of interest, the so-called “edgematch links”, “rubber shedding links” or whatever the term to be use to export file “these 0-N items in dataset A matches these 0-N items in dataset B” necessarily need human-in-the-loop and it happens in unpredictable ways. And links which aren’t obvious 1-to-1 (while there’s room for suggestion) require need human input. It started as the “typical leaflet” plus a text-only, but we might need a way to visualize N:M links (unless any you have an UI suggestion to plot such links already over pins in a map!).

This diary is less about one implementation targeting a topic and more about suggestions, including realistic feedback on failed attempts. I love the human creativity involved to merge different information into something that could be given to OpenStreetMap, Wikidata and/or give back to your open data providers whose data needs review.

1. Quick overview of other tools and how this initial release fits in

Context: by citing other tools (which, trust me, it’s not just have different approaches, but focus on different challenges) I hope be helpful if any of your use case already is more specialized with them, or, as the title “Early feedback welcomed”, this could help others suggest improvements here, such how to present the interface.

I’d assume those more likely to be interested in this topic already have some  knowledge of OpenStreetMap Conflation or Wikidata Imports.

One blog post with comparison between some tools that really worth reading is the https://junglebus.io/MobilityData/benchmarks/Benchmark%20of%20existing%20open%20source%20solutions%20for%20conflating%20structured,%20geographical%20and%20transit%20data.html  to which I would TL;DR how this tool would fit


table 1

Tool Ecosystem Object Type
  SDM OpenStreetMap, Wikidata Point

table 2

  SDM
need to match the dataset with OSM model yes
use an identifier existing in both dataset possible, not mandatory
investigate each output element needed
collaborative review no *
visualization of the conflation output + **
visualization of each output element + **
language JavaScript
user interface dedicated webapp, client-side, works offline
License AGPL-3 ***

*: if there’s interest, eventually it would be feasible to export JSON or GeoJSON with additional information for tools that are collaborative. OSM Conflate and (as a preparation step for PoIs ) Map Roulette seems decent choices

**: while I’m already looking for inspiration on other tools (the v0.5.0 do not have something basic such as diff per item), visualization is likely to be a core functionality.

***: I might change it to public domain if it makes it more likely to get collaboration.

On conflation in general, do exist other tools than ones listed in this blog post. I will quickly comment on some of them.

  • ArcGis Pro (paid) gives me an impression of (thinking from a user’s perspective, not software developer perspective) having a “single button” on typical actions users want, and do in such ways what open source alternatives such as QGIS would be several steps plus custom script.
  • QGIS (if you already don’t have installed) is good to have around, even if is to save you trouble how to use GDAL or GRASS directly to convert files from/to GeoJSON / GeoJSON Lines (which is the main format used by the tool I’m presenting)
  • MapRoulette is not cited there, but it actually works as some kind of conflation tool.
  • RapiD (when enabled with datasets from authoritative sources or generated by machine learning) also works as some sort of conflation tool
    • Maybe this is intentional (since doing differently could make RapiD less likely to eventually be added as additional editor on OpenStreetMap.org) but other than the very specific list of listed datasets, RapiD have no changes at all over iD on load data layer (e.g. the GeoJSON you could get as export)
      • There’s no way to add more than one data layer, nor customize colours. I would consider really important, and not really hard to implement
        • (actually also both on JOSM and QGIS this seems not possible) for data layers, there no quick filter to display part of them by attribute, so if a PoI (even with right addr:housenumber) is not close, this make very manual labour click one by one.
  • and obviously, hootenanny, which while likely the more feature-rich for interactive conflation, the OSM Wiki for Conflation rightfully cites it is complex to install.
    • It started as a fork of (now older version) of iD. RapiD also started as a fork of iD, and has some built- in support for conflate data, but very basic compared to Hootenanny.

2. Screenshots with context of the implementation sdm-etica.ai

2.1 Kind of “co-pilot” for an OSM editor (iD example)

Some mappers already look on official websites to enhance mode metadata on OpenStreetMap. When these sources already publish such data into something you can convert for GeoJSON with tagging close to what you would do in OSM, you can do the following:

  1. Load one or more of these datasets into the app
  2. Divide the screen between the iD editor and this app. I put it on the right side because it is close to the panel of iD.
  3. When I find an OpenStreetMap element without more data, I copy and paste existing attributes from the element and place them into the search box of the app.
  4. Sometimes, you may need to filter by addr: street (copy from nearby roads, the name=, alt_name=, and old_names=). If you find the data, copy and paste from the app into the iD free text tagging editing.

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/1-example-co-pilot-id.png

While the external dataset had over 36.000 items, by selecting with

addr:housenumber=155
addr:street=Rua Catarino Andreatta

the match was 1 of the 9 results. It’s manual process, but copy the tags from the text area

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/2-example-co-pilot-id.png

The preview on the map mode also have the same keys, which could be copy pasted.

iD (documentation at develop/API.md) allows creation of direct links, and a lot of other software have something similar we could add shortcuts. JOSM, however, has Remote Control, being notable that it can reuse the same JOSM instance and make changesets with more than one edit than iD. This use case you use the app in full screen to find what you can edit on OpenStreetMap in the default editor.

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/3-pin-mode.png


https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/5-iD-link.png


https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/6-josm-link.png

Feedback is also welcomed on how to optimize space of the links in the map. While writing this diary I Noticed a link to the level 0 editor.

2.3 Display OpenStreetMap data along with other data into the app

As you will notice, the webapp does not (at least not yet, but is viable implement) load OpenStreetMap data itself, so OpenStreetMap-carto as default base map helps to compare with the pins.

However, you can use Overpass-Turbo and select it as one of the inputs, just use the export button and save as GeoJSON. (Later example use conflation on import betwen 2 external datasets, but same could be done to use what’s on the OSM near what’s in external dataset)

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/18408532791aed56abd7494845d45b0a12b3cb7e/7-overpass.png

In my tests since last year, preparation of the dataset to OpenStreetMap schema may have much more fields than we would use. This explain why there’s a field that you need to which tags are imported to the app

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/8-restriction-by-key.png

Unless you unmark, by default if GeoJSON seems to be an OpenStreetMap export, it will bypass the selection.

2.4 Working with very large datasets

As the idea of app is to help you to match data, you may have one or more smaller datasets that need to be matched against one big one.

Currently there’s 2 strategies:

  1. At import stage: you prefilter 1+ subject datasets using 1+ reference datasets. Both by distance and by marching attributes (such as addr:housenumber) is possible
  2. At the live filter stage: all datasets already are loaded in memory, and can even be exported, but at some point the preview will not show everything.

The main file used is GeoJSON, but with very large datasets you need to pre-convert to GeoJSON Text Sequences (see formal specification at RFC8142, also know as “GeoJSON Lines”. (By the way, if you are generating it from scratch, do with RS+LF, not just LF).

2.4.1 Example at import stage (use items from 1+ datasets to find maching items from other datasets)

The exact position may change in future versions but currently you 

  1. Define distance and (if relevant) also matching key. Then, load 1+ reference dataset

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/9-conflation-at-impoirt-reference-files.png

After that, just select 1+ datasets into the main file input

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/a03752c04337a0e952ed0a4613b2a3880264c6f0/10-conflation-at-import-now-the-files.png

At the end, you can just export the file (potentially reuse again in a next section).

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/11-conflation-at-import-downloading-without-going-live-filter.png

The speed of this process is greatly affected by the number of items in the reference dataset. However note you can export the result and the file in your disk, so you start a new section only with precomputed data.

Quick comment about these examples:

  1. While maybe there some last minute bug with the UI (which is why I would recommend use https://sdm.etica.ai/v/0.5/, not https://sdm.etica.ai/, which I migth be changing faster) a filter that reduces 6M to 1M would be too forgiving. But the real filters are heavily dependend on the reference datasets and target datasets.

  2. One reason for the input dataset be less than 1/6 is also because which keys are allowed to load into memory.

While the time will greatly vary by how powerful is the user CPU, with a 6 cores / 12 threads recent CPU, by conflating all houseadresses surveyed on last Brazilian Census for one province (this one osm.org/relation/242620, population: 11,322,895) this took around 55 seconds (around 50% of this is merely reading GeoJSON-Seq into chunks, not the comparison with items from reference datasets). This kind of processing time will necessarily increase with proper fine tunning. For example, as soon as start to implement forgiving matches, such as non exact addr:street (and this varies by country and language, which would need to be programmed in javascript) will increase CPU use.

While this may not seem much, if such processing was done “in the cloud”, making it free by releasing access for OpenStreetMap would be expensive.

2.4.2. Example at live filtering stage

Would be trivial to copy the same logic (dataset VS dataset) from using reference files from the import stage to the filtering stage, however full recalculation would lead to bad user experience (for a province-level dataset like previous step, think >1 minute). With over a million points waiting in the background memory, trying to match one or few items might still be fast (just “not instantaneous”).

The current version doesn’t have an “auto suggestion”, but I guess this could be implemented with some defaults exploring the fact datasets already will be using OpenStreetMap schema. Suggestions are welcomed, and maybe after then, proof of concepts to try it, but I can say upfront that:

  1. instead of a “yes/no” march, some numeric result (even if to sort results).
  2. Sometimes either source or target may not have one field. This is differente from a false match, it’s an unknow case
  3. Some datasets may have no position at all, so the match is fully by address alone (which may need an intermediary dataset). Also, these cases make a poor experience plot them in the null island,

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/12-conflation-single-item-live-filtering.png

How the live filtering may be used really depends of the dataset (sparse points we could use kilometers, but very near ponts, something like 100 meters), however this more manual strategy still works as fallback.

The “Position” can accept latitude/longitude values (wuch as -29.92420 -51.17002), also could accept a temporary identifier of any element inside the dataset or even an URL like https://www.openstreetmap.org/#map=18/-29.92421/-51.17002 (the regex will extract -51.17002 and -29.92421).

Quick comment on this example:

  1. with 6M itens in the background, and without implementing yet any more advanced check, the parsing get betwen 500ms-800ms. Of these miliseconds, most are likely to be not the raw calculation, but updating the user interface.

2.5 No restriction on number of “layers/files”

At some point, the images used on the map for pin colours will start to get reused, but other than that, it is quite flexible how used will organize the files.

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/13-loaded-datasets.png

Currently the colours of the pins are based on order of upload. On live filtering (all data already in memory) users can also select the dataset as focus. While unsure of a better way to differentiate, this is an example.

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/14-focus-dataset.png

While by default there a maximum number of data points to show, if this already was reached, but the app knows do exist dataset in focus, it will show 2x the limit, so if working with >1 million dataset, the smaller ones you may interested is more likely to still be displayed on preview.

3 Other performance comments

  • The memory usage tends to be around the same size or lower than the uncompressed size of files in disk. There’s room for improvement not done yet, but by limiting how many items are displayed (for example 10.000) this will use less memory than JOSM and have UI with faster feedback than QGIS.
    • Memory usage tends to only grow at the import stage (or if you export a very large datasets, when you save a file). This (and also to simplify logic) explain why as soon as files are loaded, they are locked to edit. To work with different datasets, you need to refresh. To work with different sections at the same time, just open 2 or more tabs.
    • If you notice using more RAM than this, consider opening a new tab instead of reusing the tab from previous import (no need to close the browser, just the tab). I noticed browser refresh / hard refresh the browser may (potentially by assuming you will use a lot of RAM again) not do it.

Here is one example with 6 files (uncompressed size in disk around 2.8 GB).

Baseline (using Webkit based browser): around ~30MB (but for smaller datasets, that still display all data, this likely will be around 100 MB when actually using the app)

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/z-1-memory-management_baseline.png

Loading all the files (using Webkit based browser): around 1.4GB

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/z-1-memory-management_chromium.png

Here using the same datasets (had to use GeoJSON instead of GeoJSON Lines). JOSM can load CNEFE 2022 dataset for the city of Porto Alegre, but without optimisations, eventually in my before finishing on importing a province (computer had free RAM, but likely JVM was not configured to allow it).

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/z-3-memory-management_JOSM.png

And here QGIS, which is quite impressive at around 340 MB of RAM.

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/z-4-memory-management_QGIS.png

Obviously, QGIS and JOSM have different purposes. JOSM is already optimized for editing. QGIS (without need to use command line) seems a good choice to convert files. GeoJSON parsing may be one of the worst cases (because likehood of code as loading entire file as single string, not in chunks). I also noticed (the plugins for GeoJSON) from JOSM seems to merge points in very same position with same (or similar) tagging, which actually seems a good default.

I don’t have the test here, but using GeoPackage, JOSM would use less RAM on import. Similar could be achieved by converting the large datasets into a single file on disk with vector tiles. (The link for the big file is bellow, and I’m trully curious how would to optimize JOSM for the bare minimum).

4. Files used in the tutorial (do NOT use to upload this to OpenStreetMap)

For the sake of testing the application (or, if errors would be in the custom data you may be using) I will share a copy of the files used in the screenshots.

These files have OpenStreetMap + 2 different “official” datasets (which can have conflicting information between themselves, such as imprecise positioning), one which have list of addr:housenumbers plus some extra non-detailed metadata surveyed around 2022 and the other which is related to points of interest (fire stations, but some the actually tagging could be office, despite sharing address and phone, but not email and (suggested by the reference dataset, not not typical used name) name of what could be mapped on OSM. The actual number of focused things is around 200, not > 6.000.000.

The v0.5.0-beta is still not making better groups between sources that may be about the same subject (sometimes files can be the same provider). However this might help the readers understand that, while most solutions tend to break conflation in 1 vs 1, my idea is do this too, however also attempt be more flexible. This is merely a 1 + 2 example, but some kinds of schools focused on learning disabilities could be > 1+5 (OSM, Wikidata, ref:vatin, ref healthcare, ref education). Not only this, but consider that ref:vatin by open data source do not have exact position, and the text representation of addresses is a f**ng nightmare.

5. End comments

I hope this initial version can be a reasonable start. It doesn’t require expensive server-side to keep it running, which helps to not shutdown because of excessive memory and CPU usage.


Post-edits

Email icon Bluesky Icon Facebook Icon LinkedIn Icon Mastodon Icon Telegram Icon X Icon

Discussion

Comment from Mateusz Konieczny on 30 July 2024 at 07:51

Hello! I got linked here as I started to build a sort-of similar tool (link in https://codeberg.org/matkoniecz/improving_openstreetmap_using_alltheplaces_dataset/issues/8 ).

I definitely prefer to not invent the wheel so I went here to review what exists already and hopefully use this tool or contribute to it.

Quick feedback, from opening website:

  • I failed link to source code or project website at https://sdm.etica.ai/v/0.5/ and given AGPL license I expected it next to mention of AGPL
  • Help text on pressing “help” is not in English while other page elements are in English
  • “itens” should be “items” right? Unless it is some specialized term?
  • I see some other typos/nonenglish - is it useful to report them? Where I can just submit patch to fix this?

Comment from Mateusz Konieczny on 30 July 2024 at 07:52

BTW, I see that my comment is the first one - have you submitted this to OSM Weekly/created OSM forums thread/maybe posted on imports mailing list?

Not sure whether I managed to miss it in all such places or have you not posted about this tool there.

Comment from fititnt on 6 August 2024 at 16:18

Oh, sorry for the delay. I missed the notification by mail. The license already was AGPL (trivia, Inspired by Overpass and some repos from Matheuz) and the all-in-one-html page could be simply downloaded, however I also added dedicated repository at https://github.com/fititnt/spatial-data-maching .

You can either use GitHub issues or this diary (in particular the next few days I will check for updates here).

By the way, I discovered a bug with the asynchronous loading process (not the total number of elements, but if divided in many files, like AllthePlaces dump). I will see if I can fix this soon, then reply to other of your comments.

Some quick comments upfront

I definitely prefer to not invent the wheel so I went here to review what exists already and hopefully use this tool or contribute to it.

(From https://codeberg.org/matkoniecz/improving_openstreetmap_using_alltheplaces_dataset/issues/8#issuecomment-2121177 ) I suspect that it is not ideal as ATP-OSM matching needs to be more advanced than merely based on location or 1:1 name matching to work acceptably

On a quick look on the AllthePlaces, it’s less likely to have detected non 1:1 relationships than the datasets I’ve been using (I really got stuck on how to design a preview of complex links). I believe in next weeks I will do another round of updates.

Considering the title of the issue #8 (“review data for cases where Maproulette is potentially viable #8”) I’m openly interested in making it easier to be compatible with other tools (especially if already related to data import/conflation). However (even in the long term) making the same codebase able to edit in some sort of “microtasking mode” (e.g. decide concepts one by one) would be better done by other tools (even if something specific does not exist yet). The closer to this I think maybe would be interesting to do in future would export files with documented conventions on how to read/write (generic GeoJSON alone is insufficient), and the final result if the apps don’t know how to upload, I could make it a conversor to .osm /.osc file.

The text of first topic on #8

(…) (say, convenience shops are not plausible to map based on aerial imagery - but maybe there is way to detect cases where there is recent Bing Street Side or Mapillary imagery for them…)

I thought something similar (maybe inspired by an issue on iD repo that mentions this) however I didn’t investigate how to get locations that have street imagery. I agree that this is highly reusable. Please ping me if I get dataset with this kind of information. I could implement a filter (or document better the existing one) on pre-select values that are near something (just another specific file with this data). So let’s say, the matches for data already not on OpenStreetMap (but near positions of an dataset that represent existence of street imagery) should be displayed/exportable for manual human check, so help with focus

The feasibility of automated suggestions is not already 1:1, unless the reason follows some predictable pattern and happens in a lot of repeated cases, the time spent on automating may be higher than manually fixing. I saw myself several times more using previous versions of this visual tool for quick search than thinking how to focus on specific hardcoded strategies (also different data might return different optimizations) where users change the algorithm.

Comment from fititnt on 7 August 2024 at 00:21

I attempted to load the entire alltheplaces datasets into the app, and it worked. 2024-08-06_alltheplaces-run.png

Some of the geojsons from the dump were ignored (I mention a bit more on https://github.com/fititnt/spatial-data-maching/issues/1#issuecomment-2272324591) because could not be parsed.

BTW, I see that my comment is the first one - have you submitted this to OSM Weekly/created OSM forums thread/maybe posted on imports mailing list?

Yes, it was mentioned in one of the OSM weekly (likely they read the diaries). But other than that, since I was busy at the time I published, I didn’t mention it anywhere.

Also, this kind of subject is very, very niche, so I’m not surprised. Also, note the following: while (at least in theory) the interface doesn’t need knowledge of programming language, create the files to to load on it do require some help (for example, have they created on demand and available somewhere (like not only you do with HTML versions).

If we ignore GeoJSON that could be exported from Overpass, almost any dataset that could be compared/conflated into OSM do need preprocessing. The approach I was trying to make more friendly was to implement a CSV importer format, however even this would need the user already name the columns as close as the same column names of other files (in case of comparing to Overpass, the user would need to convert it to OSM schema).

In the big picture, I think over time this SDM tool (or any fork of it if I get inactive) could be improved over time to allow easy to use preview (without for example needing to install complex tools). However, it necessarily need both other tools (or public generated datasets) as input, and whatever is programmed on it as export MUST be usable on editors.

Log in to leave a comment