OpenStreetMap logo OpenStreetMap

fititnt's Diary

Recent diary entries

The link for the public version https://sdm.etica.ai/v/0.5/ I made an effort to make it easy and very cheap to host (currently is a client side static vanilla JavaScript+HTML app) and, as a side effect, the privacy of your data is kept.

Since I joined OpenStreetMap in 2022, I’ve done some tools without a graphical interface, and this one I’d love to receive feedback from potential users on such a very niche topic.

Already at early versions of it (I stated a prototype in 2023 mere debug for the real conflation done non interactively before load on OSM editors), I truly attempted to think how to make it as a plugin for JOSM or think how to extend iD instead of keeping it side-by-side with iD or alt-tabbing with JOSM.

The good news: It does have basic support to use one or more files to match by distance and/or by tagging with the one or more target files and then you download the geojson. Okay, addr:street would need language and country level comparison (because misspellings), and also addr:postcode may already have logic to tolerate near matches. If you know vanilla JavaScript to code a function to your country, then it could be more forgiving.

The bad news: for points of interest, the so-called “edgematch links”, “rubber shedding links” or whatever the term to be use to export file “these 0-N items in dataset A matches these 0-N items in dataset B” necessarily need human-in-the-loop and it happens in unpredictable ways. And links which aren’t obvious 1-to-1 (while there’s room for suggestion) require need human input. It started as the “typical leaflet” plus a text-only, but we might need a way to visualize N:M links (unless any you have an UI suggestion to plot such links already over pins in a map!).

This diary is less about one implementation targeting a topic and more about suggestions, including realistic feedback on failed attempts. I love the human creativity involved to merge different information into something that could be given to OpenStreetMap, Wikidata and/or give back to your open data providers whose data needs review.

1. Quick overview of other tools and how this initial release fits in

Context: by citing other tools (which, trust me, it’s not just have different approaches, but focus on different challenges) I hope be helpful if any of your use case already is more specialized with them, or, as the title “Early feedback welcomed”, this could help others suggest improvements here, such how to present the interface.

I’d assume those more likely to be interested in this topic already have some  knowledge of OpenStreetMap Conflation or Wikidata Imports.

One blog post with comparison between some tools that really worth reading is the https://junglebus.io/MobilityData/benchmarks/Benchmark%20of%20existing%20open%20source%20solutions%20for%20conflating%20structured,%20geographical%20and%20transit%20data.html  to which I would TL;DR how this tool would fit


table 1

Tool Ecosystem Object Type
  SDM OpenStreetMap, Wikidata Point

table 2

  SDM
need to match the dataset with OSM model yes
use an identifier existing in both dataset possible, not mandatory
investigate each output element needed
collaborative review no *
visualization of the conflation output + **
visualization of each output element + **
language JavaScript
user interface dedicated webapp, client-side, works offline
License AGPL-3 ***

*: if there’s interest, eventually it would be feasible to export JSON or GeoJSON with additional information for tools that are collaborative. OSM Conflate and (as a preparation step for PoIs ) Map Roulette seems decent choices

**: while I’m already looking for inspiration on other tools (the v0.5.0 do not have something basic such as diff per item), visualization is likely to be a core functionality.

***: I might change it to public domain if it makes it more likely to get collaboration.

On conflation in general, do exist other tools than ones listed in this blog post. I will quickly comment on some of them.

  • ArcGis Pro (paid) gives me an impression of (thinking from a user’s perspective, not software developer perspective) having a “single button” on typical actions users want, and do in such ways what open source alternatives such as QGIS would be several steps plus custom script.
  • QGIS (if you already don’t have installed) is good to have around, even if is to save you trouble how to use GDAL or GRASS directly to convert files from/to GeoJSON / GeoJSON Lines (which is the main format used by the tool I’m presenting)
  • MapRoulette is not cited there, but it actually works as some kind of conflation tool.
  • RapiD (when enabled with datasets from authoritative sources or generated by machine learning) also works as some sort of conflation tool
    • Maybe this is intentional (since doing differently could make RapiD less likely to eventually be added as additional editor on OpenStreetMap.org) but other than the very specific list of listed datasets, RapiD have no changes at all over iD on load data layer (e.g. the GeoJSON you could get as export)
      • There’s no way to add more than one data layer, nor customize colours. I would consider really important, and not really hard to implement
        • (actually also both on JOSM and QGIS this seems not possible) for data layers, there no quick filter to display part of them by attribute, so if a PoI (even with right addr:housenumber) is not close, this make very manual labour click one by one.
  • and obviously, hootenanny, which while likely the more feature-rich for interactive conflation, the OSM Wiki for Conflation rightfully cites it is complex to install.
    • It started as a fork of (now older version) of iD. RapiD also started as a fork of iD, and has some built- in support for conflate data, but very basic compared to Hootenanny.

2. Screenshots with context of the implementation sdm-etica.ai

2.1 Kind of “co-pilot” for an OSM editor (iD example)

Some mappers already look on official websites to enhance mode metadata on OpenStreetMap. When these sources already publish such data into something you can convert for GeoJSON with tagging close to what you would do in OSM, you can do the following:

  1. Load one or more of these datasets into the app
  2. Divide the screen between the iD editor and this app. I put it on the right side because it is close to the panel of iD.
  3. When I find an OpenStreetMap element without more data, I copy and paste existing attributes from the element and place them into the search box of the app.
  4. Sometimes, you may need to filter by addr: street (copy from nearby roads, the name=, alt_name=, and old_names=). If you find the data, copy and paste from the app into the iD free text tagging editing.

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/1-example-co-pilot-id.png

While the external dataset had over 36.000 items, by selecting with

addr:housenumber=155
addr:street=Rua Catarino Andreatta

the match was 1 of the 9 results. It’s manual process, but copy the tags from the text area

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/2-example-co-pilot-id.png

The preview on the map mode also have the same keys, which could be copy pasted.

iD (documentation at develop/API.md) allows creation of direct links, and a lot of other software have something similar we could add shortcuts. JOSM, however, has Remote Control, being notable that it can reuse the same JOSM instance and make changesets with more than one edit than iD. This use case you use the app in full screen to find what you can edit on OpenStreetMap in the default editor.

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/3-pin-mode.png


https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/5-iD-link.png


https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/6-josm-link.png

Feedback is also welcomed on how to optimize space of the links in the map. While writing this diary I Noticed a link to the level 0 editor.

2.3 Display OpenStreetMap data along with other data into the app

As you will notice, the webapp does not (at least not yet, but is viable implement) load OpenStreetMap data itself, so OpenStreetMap-carto as default base map helps to compare with the pins.

However, you can use Overpass-Turbo and select it as one of the inputs, just use the export button and save as GeoJSON. (Later example use conflation on import betwen 2 external datasets, but same could be done to use what’s on the OSM near what’s in external dataset)

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/18408532791aed56abd7494845d45b0a12b3cb7e/7-overpass.png

In my tests since last year, preparation of the dataset to OpenStreetMap schema may have much more fields than we would use. This explain why there’s a field that you need to which tags are imported to the app

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/8-restriction-by-key.png

Unless you unmark, by default if GeoJSON seems to be an OpenStreetMap export, it will bypass the selection.

2.4 Working with very large datasets

As the idea of app is to help you to match data, you may have one or more smaller datasets that need to be matched against one big one.

Currently there’s 2 strategies:

  1. At import stage: you prefilter 1+ subject datasets using 1+ reference datasets. Both by distance and by marching attributes (such as addr:housenumber) is possible
  2. At the live filter stage: all datasets already are loaded in memory, and can even be exported, but at some point the preview will not show everything.

The main file used is GeoJSON, but with very large datasets you need to pre-convert to GeoJSON Text Sequences (see formal specification at RFC8142, also know as “GeoJSON Lines”. (By the way, if you are generating it from scratch, do with RS+LF, not just LF).

2.4.1 Example at import stage (use items from 1+ datasets to find maching items from other datasets)

The exact position may change in future versions but currently you 

  1. Define distance and (if relevant) also matching key. Then, load 1+ reference dataset

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/9-conflation-at-impoirt-reference-files.png

After that, just select 1+ datasets into the main file input

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/a03752c04337a0e952ed0a4613b2a3880264c6f0/10-conflation-at-import-now-the-files.png

At the end, you can just export the file (potentially reuse again in a next section).

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/11-conflation-at-import-downloading-without-going-live-filter.png

The speed of this process is greatly affected by the number of items in the reference dataset. However note you can export the result and the file in your disk, so you start a new section only with precomputed data.

Quick comment about these examples:

  1. While maybe there some last minute bug with the UI (which is why I would recommend use https://sdm.etica.ai/v/0.5/, not https://sdm.etica.ai/, which I migth be changing faster) a filter that reduces 6M to 1M would be too forgiving. But the real filters are heavily dependend on the reference datasets and target datasets.

  2. One reason for the input dataset be less than 1/6 is also because which keys are allowed to load into memory.

While the time will greatly vary by how powerful is the user CPU, with a 6 cores / 12 threads recent CPU, by conflating all houseadresses surveyed on last Brazilian Census for one province (this one https://www.openstreetmap.org/relation/242620, population: 11,322,895) this took around 55 seconds (around 50% of this is merely reading GeoJSON-Seq into chunks, not the comparison with items from reference datasets). This kind of processing time will necessarily increase with proper fine tunning. For example, as soon as start to implement forgiving matches, such as non exact addr:street (and this varies by country and language, which would need to be programmed in javascript) will increase CPU use.

While this may not seem much, if such processing was done “in the cloud”, making it free by releasing access for OpenStreetMap would be expensive.

2.4.2. Example at live filtering stage

Would be trivial to copy the same logic (dataset VS dataset) from using reference files from the import stage to the filtering stage, however full recalculation would lead to bad user experience (for a province-level dataset like previous step, think >1 minute). With over a million points waiting in the background memory, trying to match one or few items might still be fast (just “not instantaneous”).

The current version doesn’t have an “auto suggestion”, but I guess this could be implemented with some defaults exploring the fact datasets already will be using OpenStreetMap schema. Suggestions are welcomed, and maybe after then, proof of concepts to try it, but I can say upfront that:

  1. instead of a “yes/no” march, some numeric result (even if to sort results).
  2. Sometimes either source or target may not have one field. This is differente from a false match, it’s an unknow case
  3. Some datasets may have no position at all, so the match is fully by address alone (which may need an intermediary dataset). Also, these cases make a poor experience plot them in the null island,

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/12-conflation-single-item-live-filtering.png

How the live filtering may be used really depends of the dataset (sparse points we could use kilometers, but very near ponts, something like 100 meters), however this more manual strategy still works as fallback.

The “Position” can accept latitude/longitude values (wuch as -29.92420 -51.17002), also could accept a temporary identifier of any element inside the dataset or even an URL like https://www.openstreetmap.org/#map=18/-29.92421/-51.17002 (the regex will extract -51.17002 and -29.92421).

Quick comment on this example:

  1. with 6M itens in the background, and without implementing yet any more advanced check, the parsing get betwen 500ms-800ms. Of these miliseconds, most are likely to be not the raw calculation, but updating the user interface.

2.5 No restriction on number of “layers/files”

At some point, the images used on the map for pin colours will start to get reused, but other than that, it is quite flexible how used will organize the files.

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/13-loaded-datasets.png

Currently the colours of the pins are based on order of upload. On live filtering (all data already in memory) users can also select the dataset as focus. While unsure of a better way to differentiate, this is an example.

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/14-focus-dataset.png

While by default there a maximum number of data points to show, if this already was reached, but the app knows do exist dataset in focus, it will show 2x the limit, so if working with >1 million dataset, the smaller ones you may interested is more likely to still be displayed on preview.

3 Other performance comments

  • The memory usage tends to be around the same size or lower than the uncompressed size of files in disk. There’s room for improvement not done yet, but by limiting how many items are displayed (for example 10.000) this will use less memory than JOSM and have UI with faster feedback than QGIS.
    • Memory usage tends to only grow at the import stage (or if you export a very large datasets, when you save a file). This (and also to simplify logic) explain why as soon as files are loaded, they are locked to edit. To work with different datasets, you need to refresh. To work with different sections at the same time, just open 2 or more tabs.
    • If you notice using more RAM than this, consider opening a new tab instead of reusing the tab from previous import (no need to close the browser, just the tab). I noticed browser refresh / hard refresh the browser may (potentially by assuming you will use a lot of RAM again) not do it.

Here is one example with 6 files (uncompressed size in disk around 2.8 GB).

Baseline (using Webkit based browser): around ~30MB (but for smaller datasets, that still display all data, this likely will be around 100 MB when actually using the app)

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/z-1-memory-management_baseline.png

Loading all the files (using Webkit based browser): around 1.4GB

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/z-1-memory-management_chromium.png

Here using the same datasets (had to use GeoJSON instead of GeoJSON Lines). JOSM can load CNEFE 2022 dataset for the city of Porto Alegre, but without optimisations, eventually in my before finishing on importing a province (computer had free RAM, but likely JVM was not configured to allow it).

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/z-3-memory-management_JOSM.png

And here QGIS, which is quite impressive at around 340 MB of RAM.

https://gist.githubusercontent.com/fititnt/01a5b660013b54743989759c4a9b5f18/raw/cc48697b0d47a1280ea86238094d5bc5df2b47d2/z-4-memory-management_QGIS.png

Obviously, QGIS and JOSM have different purposes. JOSM is already optimized for editing. QGIS (without need to use command line) seems a good choice to convert files. GeoJSON parsing may be one of the worst cases (because likehood of code as loading entire file as single string, not in chunks). I also noticed (the plugins for GeoJSON) from JOSM seems to merge points in very same position with same (or similar) tagging, which actually seems a good default.

I don’t have the test here, but using GeoPackage, JOSM would use less RAM on import. Similar could be achieved by converting the large datasets into a single file on disk with vector tiles. (The link for the big file is bellow, and I’m trully curious how would to optimize JOSM for the bare minimum).

4. Files used in the tutorial (do NOT use to upload this to OpenStreetMap)

For the sake of testing the application (or, if errors would be in the custom data you may be using) I will share a copy of the files used in the screenshots.

These files have OpenStreetMap + 2 different “official” datasets (which can have conflicting information between themselves, such as imprecise positioning), one which have list of addr:housenumbers plus some extra non-detailed metadata surveyed around 2022 and the other which is related to points of interest (fire stations, but some the actually tagging could be office, despite sharing address and phone, but not email and (suggested by the reference dataset, not not typical used name) name of what could be mapped on OSM. The actual number of focused things is around 200, not > 6.000.000.

The v0.5.0-beta is still not making better groups between sources that may be about the same subject (sometimes files can be the same provider). However this might help the readers understand that, while most solutions tend to break conflation in 1 vs 1, my idea is do this too, however also attempt be more flexible. This is merely a 1 + 2 example, but some kinds of schools focused on learning disabilities could be > 1+5 (OSM, Wikidata, ref:vatin, ref healthcare, ref education). Not only this, but consider that ref:vatin by open data source do not have exact position, and the text representation of addresses is a f**ng nightmare.

5. End comments

I hope this initial version can be a reasonable start. It doesn’t require expensive server-side to keep it running, which helps to not shutdown because of excessive memory and CPU usage.


Post-edits

This text is a continuation of my previous diary and do what the title says. The draft already existed 6 months ago, but just today I’m publishing this diary. Anyway, there’s comment I head about this on the @Wikimaps telegram

“You seem to be doing what we were doing 10+ years ago before Wikidata existed” – Maarten Dammers opinion on what this approach doing

Well, he’s right… but there’s a reason for that. This diary have 4 parts, the examples are on 3.

1. Preface

This extension could be perceived as one approach to make general proposed data extraction from OpenStreetMap Wiki, which , in a ABox vs TBox dicotomy, is the closest of a TBox for OpenStreetMap (*).

*: if we ignore id-tagging-schema and, obviously, other custom strategies to explain the meaning of OpenStreetMap data, which could include cartocss used to explain how to render as image. I do have a rudimentary draft of try to make sense of all of her encodings here, but not ready for today.

1.1 Wikibase is not consensus even between what would be ontologists on OpenStreetMap

Tip: for those wanting to view/review some past discussions, check https://wiki.openstreetmap.org/wiki/User:Minh_Nguyen/Wikidata_discussions#Wikidata_link_in_wiki_infoboxes. The same page from Minh Nguyen has other links, such as discussions to remove the entire Wikibase extension from OSM.wiki at https://github.com/openstreetmap/operations/issues/764.

On the surface, it may appear that the partial opposition for Wikibase was because of some minor user interfaces issues. But reading old discussions (for those not alware, I’m new to OpenStreetmap, joined October 2022) this would be insufficient to understand why a stricter, overcentralized approach is rejected. I suspect that part of the complains (which are reflected on very early criticisms, including from the Taginfo developer on the mail list but even himself years before was criticizes when attempted to improve standardization (at least on parsing the wiki, which is very relevant here); I’m seeing a trend here: innovators today, conservatives tomorrow) seems to be that attempting to encode in a single storage a strictly logically consistent T-Box would not be feasible: some definitions might contradict each other.

One fact is that OpenStreetMap dada is used successfully in production, and there’s significant number of tools focused on its data. Wikidata may be more known as a community contributed linked data repository, however OpenStreetMap, while RDF format is less standardized today, is known to be used in production with few to none transformation than data repacking. In other words, mass rewrites on OSM data can easily break lot of applications. Note my focus here: “production use” means the developers which also consume the data are focused on making it usable, not wanting to break things unless there’s a valid reason for it. One impact is proposals wanting to refactor already used tagging on data likely will be refused.

However, similar to how Wikidata has proposals for properties, OpenStreetMap does have a formal process for tagging (in addition. To simply be “de facto” or “in use”). This alone is proof that, while some might not call themselves ontologists, and defend the idea of Any tags you like, they actually have a role of ontologists. The mere fact they don’t call themselves this, or not use popular strategy to encode ontologies, e.g. RDF, don’t make their criticism invalid, because they can be simply not complaining about the standards (or even the Wikibase itself) but the idea on how these are used to solve problems on OpenStreetMap.

I’m trying to keep this topic short, but my current hypothesis is the reason TBox on OpenStreetMap cannot be fully centralized is because while developers might have several points in common and would be wishing to integrate in their software (both id-tagging-schema and editor-layer-index are examples of it), they have technical reasons to not agree on 100%, so strategies to make easier makes sense. For example, either some tags can contradict each other (which even on semantic reasoning is a blocker; because the tag cannot “be fixed” if it is realistic with implementation) or their definition might be too complex to production implementation.

On this aspect, the current deliverable of this diary might seem a step backwards to how Wikibase works, but in addition to trying to further formalize and help to do data mining on OSM.wiki infoboxes, it starts the idea of getting even more data from wiki pages. And yes, in future it could be used by other tools to help synchronize OSM infoboxes with a Wikibase instance such as Data Items again, even if it means detect differences so humans could act. Even knowing to be impossible to reach 100%, we could try work on baseline which could help others consume not just OpenStreetMap data, ABox, but also part of it’s tagging, which is part (but not full), TBox, but in this journey, far before might be necessary help understand inconsistencies.

1.2 Wikibase is not even the only approach using MediaWiki for structured content

Wikibase, while it powers Wikidata, is not the only extension which can be used with MediaWiki. Likely a good link for a general overview is this one https://www.mediawiki.org/wiki/Manual:Managing_data_in_MediaWiki. These MediaWiki extensions focused on structured data are server side, a centralized approach (which assumes others agree with how to implement it from start). Since a text field with wikitext of all pages on MediaWiki database wouldn’t be queryable, these extensions actually use MediaWiki as permanent, versioned storage, but they take the responsibility of synchronize such data with some more specialized database engine (or at least use same database, but with additional tables). Even Wikibase still relies on external RDF triplestore to allow running SPARQL; it’s user interface (the one humans edit on sites like Wikidata) are an abstraction to store the data like a page in the MediaWiki (Wikibase extension actually uses undocumented JSON, not Wikitext).

One (in the author’s knowledge) unique feature of the implementation this diary presents to you is the following: it doesn’t require installation on the MediaWiki server. One side effect is it can also, out of the box, parse data from multiple MediaWiki wikis, and I’m not only talking about mix OSM.wiki and the OpenStreetMap Foundation wiki, but could extract data from wikipedias. You are free to decide which pages in the selected wiki should contain the data, without any specific URL pattern (like prefixes with Qs or Ps), and this aspect is more similar to other MediaWiki alternatives to Wikibase.

1.3 Then, what means decentralized, without particular database, approach

I’m very sure, especially for ontologists (the ones less aware of diverse ecosystem of data costumers on OpenStreetMap), that the very idea of not optimizing for a centralized storage would be perceived as anti-pattern. However, while requiring more work, those interested could still ingest the data on a single database. The command line implementation does not dictate how the data should be consumed, because it has other priorities.

“Make each program do one thing well.” – (part of) The UNIX Philosophy

Quote Make each program do one thing well.

All MediaWiki extensions have in common parse Wikitext (Wikibase is JSON), and this one does this specific part.For sake of make easier for the user (and make wiki admins less likely to have incentive to block data mining with this extension) it actually caches the data locally in a SQLite database, so if somewhat make friendly for repeated use (maybe even offline/backup if you setup higher expiration date). But unless you work directly with its default outputs (explained in next section) if you want a full solution, you will still need to customize which storage to save the data optimized for your use cases. So, this implementation could help synchronize OSM Infoboxes with the OSM Data Items, but its use actually is an abstraction for generic use cases. In the OpenStreetMap world, there’s TagInfo is know to parse the data, and also the Nominarim uses the Wiki to extract some information.

2. The “data model” (…of exported data handcrafted in the Wiki)

With all the explanation on the preface, the implementation of the result of data mining optimizes for a dump-like interoperable file format.

I do have some experience generating and to document group of files in the Humanitarian eXchange Language Standard, so at least for extracted for tabular data, if there's sufficient, instead of custom JSON-LD, could be some of these packaging standards highly optimized for traditional SQL databases instead of what could be archived if data inside this top level JSON-LD could be directly usable as RDF. But let's focus for now.

Some technical decisions at the moment of the generic approach:

  1. The exported data is JSON where individual parts of the page are inside an list/array of the top level “data” key. This is a popular convention on REST APIs, another would be to use the top level “error” key.
  2. The alternative is JSON-seq (RFC 7464), which would make it friendly to work with continuously streaming or merge different datasets by… Just concatenating the file. This approach also could in future be highly optimized for massive datasets with low memory use.
  3. The fields are documented in JSON-LD and JSON Schema and everything else (from standards to tooling) able to work with this. The working draft is available at https://wtxt.etica.ai/
  4. As one alternative, the implementation also allows to materialize individual items extracted from the pages as files both with global (unique if merging different wikis) and optional customized file names. Output is a Zip file with a previsible default directory structure.

One known limitation of the overgeneralization is that only the top level of JSON-LD and @types are by default strictly documented. Sorry .

Would it be possible to allow customization of the internal parts in future? “Yes”. However, (and this is from someone who already made CLI tools with a massive number of options) doesn’t seem to be a good usability idea to make way too many command line configurations instead of simulating them as some kind of file (which could potentially be extracted itself from wikis). But for those thinking on it, let me say upfront that to fill this gap, MediaWiki templates (aka the Infoboxes) and the tabular data could have at least per Wiki profiles for what become consensuses. And tables and subset of relevant for reuse syntaxhighlight  codes (or some kinds of Templates with SPARQL / OverpassQL which are also example codes), could have additional hidden comments to give hints now they’re exported, at minimum their suggested file names. To maximize such approach, every MediaWiki would require a some sort of global dictionary (for things which already are global meaning, not varying by context), to give hints of how to convert, as example , {{yes}} to something machine readable like true. Another missing point would be have conversion tables which might depend on context (such as “inuse” -> “in use” on OSM Infoboxes), so as much as possible, the generated templates avoid humans to rewrite of 100s pages with misspellings or synonyms as long as some page on wiki can centralized these profiles.

3. Pratical part

With all this said, let’s cite the examples not already on the README.md and --help option of the tool.

3.0. Requirements for installation of wiki_as_base cli tool

pip install wiki_as_base==0.5.10
# latest, use
# pip install wiki_as_base --upgrade

3.1. JSON-LD version of a single page on OSM.wiki

By default the tool assumes you want parse OpenStreetMap Wiki and would be ok with a cache of 23 hours (which would be similar if already would be parsing the wiki dump)

The following example will download OSM.wiki Tag:highway=residential

wiki_as_base --titles 'Tag:highway=residential'

3.2. “Just give me the code example files” of a single page on OSM.wiki

The parser tries its best to detect what’s on the Wikitext without any customization. For example, if the wikitext is using the right syntaxhighlight codes, it tries to use that as a suggestion for which file extension that code would have.

Let’s use this example page, the User:EmericusPetro/sandbox/Wiki-as-base. A different parameter will export a zip file instead of JSON-LD

# JSON-LD output
wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base'

# Files (inside a zip)
wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' --output-zip-file sandbox-Wiki-as-base.zip

Wikitext parsing (the one done by this implementation) can benefit from receiving more explicit suggestions of preferred exported filename. So for pages focused as technical guide, a proxy of this could allow a download link with a tutorial with predictable filenames, while others wiki contributors could still improve it over time.

3.3. Download all parseable information of pages by an small category on OSM.wiki

Let’s say you want to fetch OSM.wiki Category:OSM_best_practice, not merely an article of it, like the Relations_are_not_categories, but all pages with the respective category.

# JSON-LD output
wiki_as_base --input-autodetect 'Category:OSM_best_practice'

# Files (inside a zip)
wiki_as_base --input-autodetect 'Category:OSM_best_practice' --output-zip-file Category:OSM_best_practice.zip

Trivia: this request is done with only 2 background fetches: one to know pages for the category and one for all pages.

3.4. Download all parseable information of pages by an well used category on OSM.wiki

Let’s say you want a Category:References. Now, the cli tool will behave differently as it assumes it can take at maximum 50 pages in one step (the default most MediaWikis for non-admin/bots can ask). This means it will paginate and save on local cache, and ultimately just output the final result.

# The --verbose argument will output more information,
# in this case hints about looping, if have cache, etc.
# It will take 50 seconds plus server delay plus internal time to compute the result
wiki_as_base --input-autodetect 'Category:References' --verbose --output-zip-file Category:References.zip
# (print to stderr)
#    loop... Cached: [False] Expired: [False] delay if not cached [10]
#    loop... Cached: [False] Expired: [False] delay if not cached [10]
#    loop... Cached: [False] Expired: [False] delay if not cached [10]
#    loop... Cached: [False] Expired: [False] delay if not cached [10]
#    loop... Cached: [False] Expired: [False] delay if not cached [10]

# Now, lets run it again. However, since the raw requests are cached by 23 hours
# it will reuse the cache.
wiki_as_base --input-autodetect 'Category:References' --verbose > Category:References.jsonld
# (print to stderr)
#    loop... Cached: [True] Expired: [False] delay if not cached [10]
#    loop... Cached: [True] Expired: [False] delay if not cached [10]
#    loop... Cached: [True] Expired: [False] delay if not cached [10]
#    loop... Cached: [True] Expired: [False] delay if not cached [10]
#    loop... Cached: [True] Expired: [False] delay if not cached [10]


# Current directory
ls -lh | awk '{print $5, $9}'
# 
#    668K Category:References.jsonld
#    315K Category:References.zip
#    540K wikiasbase.sqlite

3.4.1 Controling the delay for pagination requests

By default, not only does the tool do caching, but the cli will intentionally add a delay 10 times slower if you don’t customize the user agent hint and it detects must paginate more background requests. Currently, 10 times means 10 x 1 second (only sequential, not parallel requests), but if this get heavier usage it could be increased.

The logic for the cli tool to delay more non customized user agents is to have less users not changing contact information. Here the behavior if you detect you customized its contact information then point to the developer of the tool.

## change the contact information on the next line
# export WIKI_AS_BASE_BOT_CONTACT='https://github.com/fititnt/wiki_as_base-py; generic@example.org'
export WIKI_AS_BASE_BOT_CONTACT='https://wiki.openstreetmap.org/wiki/User:MyUsername; mycontact@gmail.com'

# time will output real time to finish the command. On this case, 5 x 1 are artificial delay,
# but 10s both download time and (which is not instantaneous), internal computation
time wiki_as_base --input-autodetect 'Category:References' --verbose > Category:References.jsonld
#    loop... Cached: [False] Expired: [False] delay if not cached [1]
#    loop... Cached: [False] Expired: [False] delay if not cached [1]
#    loop... Cached: [False] Expired: [False] delay if not cached [1]
#    loop... Cached: [False] Expired: [False] delay if not cached [1]
#    loop... Cached: [False] Expired: [False] delay if not cached [1]
#
#    real	0m15,170s
#    user	0m1,518s
#    sys	0m0,041s

However, if you do want to identify yourself, but believe 1 second additional delay in sequential request is too low, (which might be a case for a bot without human supervision), the next example will use 30 seconds.


export WIKI_AS_BASE_BOT_CONTACT='https://wiki.openstreetmap.org/wiki/User:MyUsername; mycontact@gmail.com'
export WIKI_AS_BASE_BOT_CUSTOM_DELAY='30'
time wiki_as_base --input-autodetect 'Category:References' --verbose > Category:References.jsonld
#    loop... Cached: [False] Expired: [False] delay if not cached [30]
#    loop... Cached: [False] Expired: [False] delay if not cached [30]
#    loop... Cached: [False] Expired: [False] delay if not cached [30]
#    loop... Cached: [False] Expired: [False] delay if not cached [30]
#    loop... Cached: [False] Expired: [False] delay if not cached [30]
#
#    real	2m40,390s
#    user	0m1,565s
#    sys	0m0,036s

3.5. Download all parseable information of know exact list of Wiki Pages on OSM.wiki

The initial command used to fetch a single page actually accept multiples ones: just divide them with |.

In this example we’re already using another parameter, --pageids, not by name.

## Uses curl, tr and jq <https://jqlang.github.io> as one example on how to get some examples of pages.
# curl --silent 'https://wiki.openstreetmap.org/w/api.php?action=query&cmtitle=Category:Overpass_API&list=categorymembers&cmlimit=500&format=json' | jq '.query.categorymembers | .[] | .pageid' |  tr -s "\n" "|"


# Manually setup the pageids, without use of categories
wiki_as_base --pageids '35322|253043|104140|100013|156642|96046|141055|101307|72215|98438|89410|250961|133391|242270|85360|97208|181541|90307|150883|98210|254719|137435|99030|163708|241349|305815|74105|104139|162633|170198|160054|150897|106651|180544|92605|78244|187965|187964|105268' --verbose > My-custom-list.jsonld

If the number of explicitly individual pages is greater than the pagination (which is 50) then the cli, similar to how it deals with wiki pages from large categories, will paginate.

3.6. “Just give me the example files” of a single page on a different wiki than OSM.wiki

Note: for wikimedia-related websites, the prefix used uses the logic from the database naming dumps on https://dumps.wikimedia.org/backup-index.html. e.g. wikidata.org = wikidatawiki.

This is the same as 2, however the same content is both as https://wiki.openstreetmap.org/wiki/User:EmericusPetro/sandbox/Wiki-as-base and https://www.wikidata.org/wiki/User:EmericusPetro/sandbox/Wiki-as-base.

The idea here is to explain how to customize a different Wiki. This is done with the two environment variables.

wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' > Wiki-as-base_from-osm.jsonld


# If you just want change environment variables for single command without affecting next commands, then prepared the option on that single line
WIKI_NS='osmwiki' WIKI_API='https://wiki.openstreetmap.org/w/api.php' wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' > Wiki-as-base_from-osmwiki.jsonld
WIKI_NS='wikidatawiki' WIKI_API='https://www.wikidata.org/w/api.php' wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' > Wiki-as-base_from-wikidatawiki.jsonld



# If your focus is a single wiki, but the default being OpenStreetMap Wiki make longer commands, then define as environment variable
export WIKI_NS='wikidatawiki'
export WWIKI_API='https://www.wikidata.org/w/api.php'

wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' > Wiki-as-base_from-wikidatawiki.jsonl

Note the irony: Using Wikidata (wiki) but parsing wikitext of generic Wiki Pages, not Wikibase 🙃! Anyway, could you guess what wiki_as_base --titles 'Item:Q5043|Property:P12' returns on osmwiki?

3.7. Merge content of several pages in different Wikis and the --output-streaming

Here things start to get interesting, and might explain why all unique filenames are namespaced by wikiprefix: you might at some point want to store them in the same folder, maybe also match same kind of content on different wikis.

Also, this is the time the exported file is not JSON-LD with individual items inside key data on the top level of the object, but JSON text sequences, where each individual item is in its own line. this format allow user simpler tools to merge the files.

#### merging the files at creation time

echo "" > merged-same-file-before.jsonl

# the ">" create file and replacey any previous content, if existed
# the ">>" only append content at the end of file, but create if not exist
WIKI_NS='osmwiki' WIKI_API='https://wiki.openstreetmap.org/w/api.php' wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' --output-streaming >> merged-same-file-before.jsonl
WIKI_NS='wikidatawiki' WIKI_API='https://www.wikidata.org/w/api.php' wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' --output-streaming >> merged-same-file-before.jsonl

#### dumping file by file, but then merge files at the end
mkdir temp-output/
WIKI_NS='osmwiki' WIKI_API='https://wiki.openstreetmap.org/w/api.php' wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' --output-streaming > temp-output/osmwiki-page-1.jsonl
WIKI_NS='wikidatawiki' WIKI_API='https://www.wikidata.org/w/api.php' wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' --output-streaming > temp-output/wikidatawiki-page-1.jsonl
cat temp-output/*.jsonl > merged-cat-after.jsonl

And that was the last practical example. A place with other MediaWikis (which may or not be up to date, which means this tool cannot undestand their API version) are listed on https://wikiindex.org/.

4. What’s next: feedback (in special of OSM.wiki editors) are welcomed in next months

That’s it. This approach is a very niche, so likely the ones which may be interested are heavy wiki editors, in special early adopters who could benefit from move some of their software options for contributors use the Wiki to make changes and do not have yet a parser strategy like Nominarim/Taginfo have.

In the case of OpenStreetMap, since the infoboxes for tags and tag-values are very important, I’m especially interested in suggestions on how we could use the wiki page itself to at least give hints on expected values, maybe further hints on how to normalize the values. Currently, the implementation does not have option to initialize with such extra information (still a bit hardcoded), but even if each Wiki could have some page where most people would agree how to define the parser, I believe the tool should allow user customize it (so someone could use customized version from own it’s user namespace, potentially “proving” how the final result would work). This might explain I took some time to reply ChrisMap when asked to “design an example”, but to not delay further I just posted this diary today with how to at least do the basics of extract data. I already had a draft for this since January 2023, and asked some parts of it on the Talk Wiki, but after today this part is somewhat officially released for feedback either here in the comments, in the Wiki or in the GitHub issues.

This is my first attempt on the subject of the title divided in 6 topics. Sorry for the long text (but could be far longer).

Disclaimer: low experience as OSM mapper!

While I do have prior advanced experience in other areas, as you can see from my account, I’m so new to the project that as a newbie user of iD left after the tutorial in India I got scared that if someone touches something, after that validators will assume that person is responsible for errors in that something. In my case it was “Mapbox: Fictional mapping” from OSMCha.

So assume that this text is written by someone who one day ignored iD warnings for something I touched, still not sure how to fix the changeset 127073124 😐

Some parts of this post, such as reference to notability (from this discussion here https://wiki.openstreetmap.org/wiki/Talk:Wiki#Use_Wikibase_to_document_OSM_software) and gives some hints of unexplored potential which not even current OpenStreetMap Data items are doing (from this discussion here Remove Wikibase extension from all OSM wikis #764) are the reason for the dismistifing part of the title.

1. Differences in notability of Wikidata, Wikipedia, and Commons make what is acceptable different in each project

I tried to find how OpenStreetMap defines notability, but the closest I found was this:

For sake of this post:

What I discovered is that Commons already is used as a suggested place to host for example images, in particular what would go on the OpenStreetMap Wiki.

Wikipedia is likely to be far more well known than Wikidata and (I suppose) people know that Wikipedias tend to be quite strict on what goes there.

And Wikidata? Well, without explaining too much, it is more flexible than Wikipedia’s notability, however (and this is important) is not as flexible as the Notability Rule on OpenStreetMap if we assume that there’s not explicitly one.

In other words: as flexible as Wikidata is, there’s things that do exist in the real world (let’s say, an individual tree in someone’s backyard) that are notable to be on OpenStreetMap, but are not to be on Wikidata.. And, unless there is some attachment (something worth to put on Commons, like 3D file) I would assume uploading low level data of micromapping of some building (creating huge amounts of unique Wikidata Qs) might be considered vandalism there.

1.1 When to use Wikidata?

I think I will agree with what others said sometimes about preferring to keep concepts that are worth being on Wikidata, on Wikidata.

But with this in mind, it is still relevant to have Listeria (which is a bot, not a installable extension) on OpenStreetMap Wiki. Might not be a short time priority, but Wikidata already have relevant information related to OpenStretMap.

2. Differences in how data is structured makes hard for RDF triplestores (like Wikidata) to store less structured content

In an ideal world, I would summarize how the RDF data store works. RDF is quite simple after someone understands the basics like sum + and subtraction - operations in RDF, the problem is often users will jump not only to multiplication, but differential equations. SPARQL is more powerful than SQL, and the principles of Wikidata have existed for over 2 decades. However most people will use someone else’s example ready to run.

Without getting into low level details of data storage, it might be better to just cite as an example that Wikidata recommends storing administrative boundaries as files on Commons. For example this is the one for the country of Brazil (Q155) links to https://commons.wikimedia.org/wiki/Data:Brazil.map. OpenStreetMap doesn’t require Commons for this (because store all information and can still very efficient), however RDF even with extensions such as geoSPARQL, does not provide low level access for things such as what would be a node in OpenStreetMap (at least the nodes without any extra metadata, which only exist because are part of something else).

Question against RDF: if the RDF triple store is so flexible and powerful, why not make it able to store EVERY detail, so it becomes a 1 for 1 to OpenStreetMap? Well, it is possible, however storing such data info RDF triplestore would take more disk space. Sophox already avoid some types of content

One way able to use SPARQL would, in fact, be an abstraction to another storage with R2ML and an implementation such as ONTOP VKG to rewrite SPARQL queries to SQL queries, so in worst case scenario at least it could always be using up to date data. But this is the focus of this post.

In other words: is overkill to store low level details on RDF triplestores even if we could do it if could afford the hardware. They’re not a replacement for OpenStreetMap.

3. Advantage of RDF triplestores (Wikidata, Wikibase,…): be welcoming concepts without geographic reference

Something OpenStreetMap cannot compete with Wikidata: relationship between things, and storage of things without geographic reference. Actually, most, if not all, tools that deal with OpenStreetMap data don’t know how to deal with an abstraction concept which cannot be plotted in the map. This is not an exclusive issue, because it happens with most GIS tools. They will break.

In my journey to understand OpenStreetMap with an Wikidata school of thought, after some questions in my local Telegram group about how to map back OpenStreetMap to Wikidata, I received this link:

https://wiki.openstreetmap.org/wiki/Relations_are_not_categories

Truth to be told, I loved this explanation! But without making this post overly long to make analogy with both Wikidata vs OpenStreetMap:

  1. OpenStreetMap can store reference to something such as individual buildings for firefighter’s stations of a province Province AA in a country CountryA
  2. Wikidata can store the abstract concept that represents the organization that coordinates all firefighting stations in ProvinceAA, and also that this is part of the Civil Defense organization in the CountryA. Both concepts might be even notable enough to have dedicated pages on Wikipedia, and photos on Commons.

This is where, without off-the-wire agreements or custom protocols, the tools which handle OpenStreetMap data are not designed to handle concepts that explains things which OpenStreetMap happily will store from its users. Someone can plot a building for an organization, but not the structural need of what that organization is that such building is part of.

Truth to be told, such uses of Wikidata concepts are already being used in the wild. However, it seems this is very rudimentar, mostly to allow translations and images such as for brandings used by Name Suggestion Index in tools such as the iD editor, not what these brands represent. But everything already tagged with Wikidata Qs or Ps already is viable to download this extra meaning.

The discussions about API changes (such as https://wiki.openstreetmap.org/wiki/API_v1.0) are sort of more low level. What today is on the database schema https://wiki.openstreetmap.org/wiki/Rails_port/Database_schema doesn’t need to change (it’s quite efficient already, and previous point admitted the limitations of RDF Triplestores for low level of details).

In the best case scenario, this might help understand existing data, and make stronger validations because could make easier to find patterns, but does not require change underlining database, but the validation rules become sort of cross platform. For things simpler (like know if something is acceptable or not) no semantic reasoning is need, could be done automated rule generation in SHACL (https://en.wikipedia.org/wiki/SHACL), so if today someone is doing import of several items, but some of then classes with existing ones, could be simple to the person just click “ignore the errors for me” and SHACL could only allow the things that will validate.

But this SHACL could take years. I mean, if some countries would want to make very strict rules, could be possible that in that region, these things become enforced.

4. RDF/OWL allow state of the art semantic reasoning (and shared public identifiers from Wikidata are a good thing)

In an ideal world and with enough time, behind the idea of ontology engineering, I would introduce mereology, the idea of Universals vs Particulars, and that when designing reusable ontologies, the best practices are not mere translation of words people use, but underlying concepts that may not even have a formal name, so giving numbers make things simpler.

Socrates and Plato Socrates y Platon, Escuela de Atenas, Raffae

The foundations for mimicking human thinking from rules is far older than RDF.

RDF provides sums and subtractions, it’s very simple, but an early attempt RDFS (RDF Schema), was insufficient for developers to implement semantic reasoning. The OWL1, sort of a inspired in one DARPA project (DAML, later DAML+OL), aimed to allow such semantic reasoning, however computability was accidentally not on scope. This means that, by design, a computation could run forever without being feasible now upfront, so it failed. Then after all this saga, OWL2 was designed from the ground to avoid mistakes from OWL1 to allow it to stay in the realm of computability (not just be a project to call attention from others, but actually be possible to implement by tools). So today, a user, without resort to command line, can use Protégé and know upfront if the triplestore doesn’t have logical errors. However, since semantic reasoning can be computationally expensive, often is not enabled by default in public endpoints (think: Wikidata and Sophox), but anyone could download all required data (e.g instead of .osm file, some flavor of .rdf file, or convert .osm to RDF after download it) and turn the thing on.

Example of inference

For example, when 2 rules are created, <CityAAA "located_in" ProviceAA>, <ProvinceAA "located_in" CountryA>, the way “located_in” is encoded could say that the inverse is “location_of” so the reasoner could infer that <CountryA "location_of" CityAAA> is true. At minimum, even without semantic reasoner turned on (it is not on Wikidata; this is why the interface warns user to be more explicit), is possible validate errors, with very primitive rules, but it also means that dumps of OSM data for regions (or worldwide, but subset of features) if converted to RDF and loaded in memory with reasoning turned on, allow deduce things very fast.

This example of “located_in” / “location_of” is simplistic, however with or without a reasoner turned on, RDF makes data interoperable in other domains even if individual rules are simple. Also, rules can depend on other rules, so there is a viable chain effect. It is possible to teach machines not mere “part_of” or “subclass_of” most people learn in diagrams used only for business, but cause and effect. And the language used to encode these meanings already is an standard.

One major reason to consider using Wikidata is to have well defined, uniquely identified, abstract concepts notable enough to be there. At minimum (like is used today) it helps with having labels in up to 200 languages, however the tendency would be that both Wikidata contributors and OpenStreetMap contributors on taxonomy help each other.

Trivia: tools such as Apache Jena even allow running via command lines (such as SPARQL queries you would ask for Sophos) from an static dump file locally or in a pre-processed file remote server.

5. Relevance to Overpass Turbo, Normatim, and creators of data validators

As explained before, the OpenStreetMap data model doesn’t handle structural concepts that couldn’t be plotted in a map. The way the so called semantic web works, could be possible to either A) rely full on Wikidata (even for internal properties; this is what OpenStreetMap Wikibase do with Data Items; but this is not the discussion today) or B) just for things that are notable enough to be there and interlink from some RDF triplestores on OpenStreetMap.

Such abstract concepts, even if they could be added as tags on things OpenStreetMap can plot on map, would take too much space. If someone has a less powerful tool (that really needs explicit tags, think like some JavaScript rendering library) then semantic reasoners can expand, missing on the fly, that implicit knowledge and tools use this version.

Something such as Overpass turbo doesn’t need to also allow SPARQL as additional flavor of query (but maybe with ONTOP, it could and with live data, but this is not the discussion here), but the advantage a more well defined ontological definition means the overpass turbo can get more smarter: an user could search for an abstract concept, that could represent a group of different tags (and this tags vary per region) and Overpass Turbo could preprocess/rewrite such advanced queries in more low level queries it know today that work today without user need to care about this.

Existing tools can understand the concept of “near me” (physical distance) but they can’t cope with something’s that are not an obvious tag. Actually, current version of Normatim seems not aware if asked by a category (let’s say, “hospital”) so it relies too much on the name of the feature, because even if is trivial to have translations of “hospital” (Q16917, full RDF link: http://www.wikidata.org/wiki/Special:EntityData/Q16917.ttl) from Wikidata, tools such as Normatim don’t know what the meaning of hospital. In this text, I’m arguing that semantic reasoning would allow the user asking from a generic category to return the abstract concept such as 911 (or whatever is the numbers for police and etc in your region) in addition to the objects in the map. OpenStreetMap Relations are the closest from this (but I think it would be better if such abstracts do not need to be on the same database; the closest to this are Data Items Qs).

And what advantage for current strategies to validate/review existing data? Well, while the idea of making Normatim aware of text by categories is very specific to a use case, the abstract concepts would allow searching things by abstract meaning and (like Overpass already allow) recursion. An unique opaque (e.g. numeric, not resembling real tags) identifier can by itself contain the meaning (like be alias for several tagging patterns, both old and new, and even varying by region of the world) so the questions become simpler.

6. On the OpenStreetMap Data Items (Wikibase extension on OpenStreetMap Wiki) and SPARQL access to data

Like I said in the start, I’m new to OpenStreetMap, and despite knowing other areas, my opinion might evolve after this text is written in face of more evidence.

6.1. I support (meaning: willing to help with tooling) the idea of have OWL-like approach to encode taxonomy and consider multilingualism important

I do like the idea of a place to centralize more semantic versions of OpenStreetMap metadata. The Data items do use Wikibase (which is used by Wikidata), so they’re one way to do it. It has fewer user gadgets than Wikidata, but the basics are there.

However, as long as it works, the way to edit the rules could be even editing files by hand. Most ontologies people do this way (sometimes with Protege). However, OpenStreetMap has a massive user base and the translations to data items already have far more than the Wiki pages for the same tags.

Even if the rules could be broken into some centralized GitHub repository (like is today with Name Suggestion Index, but there is less Pull Request, because is mostly the semantic rules) without some user interface like Wikibase allows, it would be very hard to allow collaboration that already was happening on the translations.

6.2. I don’t think criticism against customization of Wikibase Q or complain about not be able to use full text as identifiers makes sense

There’s some criticism about the Wikibase interface and those might even be trivial to deal with. But the idea of persistent identifiers being as opaque as possible, to disencourage users’ desire to change then in the future is a good practice. This actually is the only one I really disagree with.

DOIs and ARKs have a whole discussion on this. DOIs for example, despite being designed to persist like a century, the major reason people break systems was the customized prefixes. So as much as someone would like a custom prefix instead of Q124 be OSM123 this unlikely would persist more than one decade or two.

Also, the idea of allowing full customizable IDs, such as instead of Q123 use addr:street is even more prone to lead to inconsistencies either misleading users or braking systems because users didn’t like the older name. So Q123, as ugly as it may seem, is likely to only be deprecated by serious errors rather than the naming choosing by itself.

Note that I’m not arguing against the addr:street tag, this obviously is a property (and such property itself needs to be defined). *The argument is that structural codes should be as opaque as possible to only change in worst cases. If tag addr:street is (inside OpenStreetMap) notable enough, it can receive a code such as Q123. Then OWL semantics could even deal with depreciated, have two tags as aliases for each other etc, because it was designed from the ground to help with this. That’s the logic behind opaque codes.

If someone doesn’t know what Q123 means, we add contextual information about it on the interfaces.

6.3. Wiki Infoxes issues

I guess more than one tool already does data mining from OpenStreetMap Infoboxes. Whatever would be some strategy to synchronize semantic version of taxonomy, is important it be done to keep running if the users already not doing there directly. From time to time, things may break (like a bot refusing to override human edit) then relevant reports of what is failing.

I don’t have an opinion on this, just that out-of sync Information is bad.

6.4. Interest in get realist opinions from Names Suggestion Index, Taginfo, Geofabrik (e.g it’s data dictionary), and open source initiatives with heavy use on taxonomy

Despite my bias to “make things semantic” just to say here (not need to write in the comments, just to make public my view) I’m genuinely interested in knowing why the Data Items was not used to its full potential. I might not agree, but that doesn’t mean I’m not interested to hear.

Wikidata is heavily used by major companies (Google, Facebook, Apple, Microsoft,…) because it is useful, so I’m a bit surprised OpenStreetMap Data Items is less well known.

If the problem is how to export data into other formats, I could document such queries. Also, for things which are public IDs (such as Geofabrik numeric codes on http://download.geofabrik.de/osm-data-in-gis-formats-free.pdf) similar to how Wikidata allows external identities, would make sense if the Data Items have such properties. The more people are already making use of it, the more likely it is to be well cared for.

6.5 Strategies to allow run SPARQL against up to date data

While I’m mostly interested in having some place always in real time with translations and semantic relationships of taxonomic concepts, at minimum I’m personally interested in having some way to convert data dumps to RDF/OWL. But for clients that already export slices from OpenStreetMap data (such as overpass-turbo) it is feasible to export RDF triples as an additional format. Is hard to understand RDF or SPARQL, but it is far easier to export it.

However, running a full public SPARQL service with data for the entire world (while maybe not worse than what already is OpenStreetMap API and overpass-turbo) is CPU intensive. But if it becomes relevant enough (for example, for people to find potential errors with more advanced queries) then any public server ideally should have significant no lag. This is something I would personally like to help. One alternative to R2RML+ONTOP could be (after first global import) to have some strategy to convert differences from live services from the last state, then these differences instead of SQL, be UPDATE / DELETE SPARQL queries.

I’m open to opinions of how important it is to others to have some public endpoint with small lag. Might take some time to know more about OSM software stack, but scripts to synchronize from main repository data seems a win-win to create and let it public for anyone to use.

That’s it for my long post!

This was my original question on the Wiki :

The OpenStreetMap Foundation ("OSMF") already had discussions and even a committee on takeover mitigation and this question focuses on this topic. The Humanitarian OpenStreetMap Team United States Inc ("HOTUSI"), which it's grow up over 100x OSMF budget (using 2020 as year, 26,562,141 USD vs 226,273 GBP), on its board minutes date 2022-01-24 (archived version here) already admitted interest on trademark agreement "with clear, irrevocable rights to the name" as option to "Ensure that the HOT Brand name is not in danger and is formerly in HOT’s hands", however this explicitly require OpenStreetMap Foundation approval at least once in its history. Already before this election, the new discourse community, which is public know have receive support from HOTUSI, had a paid HOTUSI employee closing a discussion about HOTUSI which also asked why the site redesign still being delayed to a point of know to not happens before the OpenStreetMap Foundation election, even if this already was asked on OSMF mail lists, and the incident sparked a discussion on handling conflict of interest on moderation channels. At this very moment of the history of OpenStreetMap, majority of candidates in this election do have links with HOTUSI, so it is viable that the result will allow a single corporation to make decisions in self interest against OSMF, in which you hopefully will win as a candidate. So the question to you is: how will you handle conflicts of interest in the OpenStreetMap Foundation board itself under this challenging context?

Regardless of this, I’m actually very okay with the set of official questions proposed for candidates to be asked to answer, since common themes were grouped. And the fact to point to the Trademark Policy was better than the ones I used to contextualize. Fantastic!

New absurd events

Because the last date to send questions was 2022-11-01, sadly I was not able to cite a real world example (like one sentence more with links on my original question) about how absurd things can get when an organization has so much money that can simply focusing on whoever is willing to be bought to disrupt regional groups without remorse.

As a sort of public response for a complaint by Mario that I wasn’t aware about the #communities:latam on 2022-11-02T17:48 happened at 2022-11-05T02:31: a moderator of the forum, despite use of euphemisms, actually wrote in the that the Humanitarian OpenStreetMap Team United States Inc was willing to pay money for projects in the LATAM region to get more support. While I wasn’t expecting much based on what happened in Philiphines, no discussion at all of the bigger issue Mario was discussing in Spanish in several threads. Just this.

Let me repeat: the response to perceived conflicts of interest by moderators in a subforum of community.opensteetmap.org was one of those moderators going to the same subforum and offering money from the very same organization while in the role of moderator.

“Das ist nicht nur nicht richtig; es ist nicht einmal falsch!” – Wolfgang Pauli

The original of this post is on Discourse https://community.openstreetmap.org/t/what-happens-if-let-others-keep-sponsoring-against-openstreetmap/4343?u=fititnt .


The ad (circa 2017)

Source

https://twitter.com/sp8962/status/838676848301260800

Finances (2020, around 100x difference without need to do any core function)

Location: Historic District, Porto Alegre, Região Geográfica Imediata de Porto Alegre, Metropolitan Region of Porto Alegre, Região Geográfica Intermediária de Porto Alegre, Rio Grande do Sul, South Region, Brazil