fititnt's Diary

Dismistifying Wikidata and standards compliant semantic approach on tags on OpenStreetMap to make tooling smarter on medium to long term

Posted by fititnt on 12 November 2022 in English.

This is my first attempt on the subject of the title divided in 6 topics. Sorry for the long text (but could be far longer).

Disclaimer: low experience as OSM mapper!

While I do have prior advanced experience in other areas, as you can see from my account, I’m so new to the project that as a newbie user of iD left after the tutorial in India I got scared that if someone touches something, after that validators will assume that person is responsible for errors in that something. In my case it was “Mapbox: Fictional mapping” from OSMCha.

So assume that this text is written by someone who one day ignored iD warnings for something I touched, still not sure how to fix the changeset 127073124 😐

Some parts of this post, such as reference to notability (from this discussion here https://wiki.openstreetmap.org/wiki/Talk:Wiki#Use_Wikibase_to_document_OSM_software) and gives some hints of unexplored potential which not even current OpenStreetMap Data items are doing (from this discussion here Remove Wikibase extension from all OSM wikis #764) are the reason for the dismistifing part of the title.

1. Differences in notability of Wikidata, Wikipedia, and Commons make what is acceptable different in each project

I tried to find how OpenStreetMap defines notability, but the closest I found was this:

https://wiki.openstreetmap.org/wiki/Welcome_to_Wikipedia_users#We_don’t_have_a_notability_rule

For sake of this post:

Commons Notability: https://commons.wikimedia.org/wiki/Commons:Notability
Wikipedia (EN) Notability: https://en.wikipedia.org/wiki/Wikipedia:Notability
Wikidata Notability: https://www.wikidata.org/wiki/Wikidata:Notability

What I discovered is that Commons already is used as a suggested place to host for example images, in particular what would go on the OpenStreetMap Wiki.

Wikipedia is likely to be far more well known than Wikidata and (I suppose) people know that Wikipedias tend to be quite strict on what goes there.

And Wikidata? Well, without explaining too much, it is more flexible than Wikipedia’s notability, however (and this is important) is not as flexible as the Notability Rule on OpenStreetMap if we assume that there’s not explicitly one.

In other words: as flexible as Wikidata is, there’s things that do exist in the real world (let’s say, an individual tree in someone’s backyard) that are notable to be on OpenStreetMap, but are not to be on Wikidata.. And, unless there is some attachment (something worth to put on Commons, like 3D file) I would assume uploading low level data of micromapping of some building (creating huge amounts of unique Wikidata Qs) might be considered vandalism there.

1.1 When to use Wikidata?

I think I will agree with what others said sometimes about preferring to keep concepts that are worth being on Wikidata, on Wikidata.

But with this in mind, it is still relevant to have Listeria (which is a bot, not a installable extension) on OpenStreetMap Wiki. Might not be a short time priority, but Wikidata already have relevant information related to OpenStretMap.

2. Differences in how data is structured makes hard for RDF triplestores (like Wikidata) to store less structured content

In an ideal world, I would summarize how the RDF data store works. RDF is quite simple after someone understands the basics like sum + and subtraction - operations in RDF, the problem is often users will jump not only to multiplication, but differential equations. SPARQL is more powerful than SQL, and the principles of Wikidata have existed for over 2 decades. However most people will use someone else’s example ready to run.

Without getting into low level details of data storage, it might be better to just cite as an example that Wikidata recommends storing administrative boundaries as files on Commons. For example this is the one for the country of Brazil (Q155) links to https://commons.wikimedia.org/wiki/Data:Brazil.map. OpenStreetMap doesn’t require Commons for this (because store all information and can still very efficient), however RDF even with extensions such as geoSPARQL, does not provide low level access for things such as what would be a node in OpenStreetMap (at least the nodes without any extra metadata, which only exist because are part of something else).

Question against RDF: if the RDF triple store is so flexible and powerful, why not make it able to store EVERY detail, so it becomes a 1 for 1 to OpenStreetMap? Well, it is possible, however storing such data info RDF triplestore would take more disk space. Sophox already avoid some types of content

One way able to use SPARQL would, in fact, be an abstraction to another storage with R2ML and an implementation such as ONTOP VKG to rewrite SPARQL queries to SQL queries, so in worst case scenario at least it could always be using up to date data. But this is the focus of this post.

In other words: is overkill to store low level details on RDF triplestores even if we could do it if could afford the hardware. They’re not a replacement for OpenStreetMap.

3. Advantage of RDF triplestores (Wikidata, Wikibase,…): be welcoming concepts without geographic reference

Something OpenStreetMap cannot compete with Wikidata: relationship between things, and storage of things without geographic reference. Actually, most, if not all, tools that deal with OpenStreetMap data don’t know how to deal with an abstraction concept which cannot be plotted in the map. This is not an exclusive issue, because it happens with most GIS tools. They will break.

In my journey to understand OpenStreetMap with an Wikidata school of thought, after some questions in my local Telegram group about how to map back OpenStreetMap to Wikidata, I received this link:

https://wiki.openstreetmap.org/wiki/Relations_are_not_categories

Truth to be told, I loved this explanation! But without making this post overly long to make analogy with both Wikidata vs OpenStreetMap:

OpenStreetMap can store reference to something such as individual buildings for firefighter’s stations of a province Province AA in a country CountryA
Wikidata can store the abstract concept that represents the organization that coordinates all firefighting stations in ProvinceAA, and also that this is part of the Civil Defense organization in the CountryA. Both concepts might be even notable enough to have dedicated pages on Wikipedia, and photos on Commons.

This is where, without off-the-wire agreements or custom protocols, the tools which handle OpenStreetMap data are not designed to handle concepts that explains things which OpenStreetMap happily will store from its users. Someone can plot a building for an organization, but not the structural need of what that organization is that such building is part of.

Truth to be told, such uses of Wikidata concepts are already being used in the wild. However, it seems this is very rudimentar, mostly to allow translations and images such as for brandings used by Name Suggestion Index in tools such as the iD editor, not what these brands represent. But everything already tagged with Wikidata Qs or Ps already is viable to download this extra meaning.

The discussions about API changes (such as https://wiki.openstreetmap.org/wiki/API_v1.0) are sort of more low level. What today is on the database schema https://wiki.openstreetmap.org/wiki/Rails_port/Database_schema doesn’t need to change (it’s quite efficient already, and previous point admitted the limitations of RDF Triplestores for low level of details).

In the best case scenario, this might help understand existing data, and make stronger validations because could make easier to find patterns, but does not require change underlining database, but the validation rules become sort of cross platform. For things simpler (like know if something is acceptable or not) no semantic reasoning is need, could be done automated rule generation in SHACL (https://en.wikipedia.org/wiki/SHACL), so if today someone is doing import of several items, but some of then classes with existing ones, could be simple to the person just click “ignore the errors for me” and SHACL could only allow the things that will validate.

But this SHACL could take years. I mean, if some countries would want to make very strict rules, could be possible that in that region, these things become enforced.

4. RDF/OWL allow state of the art semantic reasoning (and shared public identifiers from Wikidata are a good thing)

In an ideal world and with enough time, behind the idea of ontology engineering, I would introduce mereology, the idea of Universals vs Particulars, and that when designing reusable ontologies, the best practices are not mere translation of words people use, but underlying concepts that may not even have a formal name, so giving numbers make things simpler.

The foundations for mimicking human thinking from rules is far older than RDF.

RDF provides sums and subtractions, it’s very simple, but an early attempt RDFS (RDF Schema), was insufficient for developers to implement semantic reasoning. The OWL1, sort of a inspired in one DARPA project (DAML, later DAML+OL), aimed to allow such semantic reasoning, however computability was accidentally not on scope. This means that, by design, a computation could run forever without being feasible now upfront, so it failed. Then after all this saga, OWL2 was designed from the ground to avoid mistakes from OWL1 to allow it to stay in the realm of computability (not just be a project to call attention from others, but actually be possible to implement by tools). So today, a user, without resort to command line, can use Protégé and know upfront if the triplestore doesn’t have logical errors. However, since semantic reasoning can be computationally expensive, often is not enabled by default in public endpoints (think: Wikidata and Sophox), but anyone could download all required data (e.g instead of .osm file, some flavor of .rdf file, or convert .osm to RDF after download it) and turn the thing on.

Example of inference

For example, when 2 rules are created, <CityAAA "located_in" ProviceAA>, <ProvinceAA "located_in" CountryA>, the way “located_in” is encoded could say that the inverse is “location_of” so the reasoner could infer that <CountryA "location_of" CityAAA> is true. At minimum, even without semantic reasoner turned on (it is not on Wikidata; this is why the interface warns user to be more explicit), is possible validate errors, with very primitive rules, but it also means that dumps of OSM data for regions (or worldwide, but subset of features) if converted to RDF and loaded in memory with reasoning turned on, allow deduce things very fast.

This example of “located_in” / “location_of” is simplistic, however with or without a reasoner turned on, RDF makes data interoperable in other domains even if individual rules are simple. Also, rules can depend on other rules, so there is a viable chain effect. It is possible to teach machines not mere “part_of” or “subclass_of” most people learn in diagrams used only for business, but cause and effect. And the language used to encode these meanings already is an standard.

One major reason to consider using Wikidata is to have well defined, uniquely identified, abstract concepts notable enough to be there. At minimum (like is used today) it helps with having labels in up to 200 languages, however the tendency would be that both Wikidata contributors and OpenStreetMap contributors on taxonomy help each other.

Trivia: tools such as Apache Jena even allow running via command lines (such as SPARQL queries you would ask for Sophos) from an static dump file locally or in a pre-processed file remote server.

5. Relevance to Overpass Turbo, Normatim, and creators of data validators

As explained before, the OpenStreetMap data model doesn’t handle structural concepts that couldn’t be plotted in a map. The way the so called semantic web works, could be possible to either A) rely full on Wikidata (even for internal properties; this is what OpenStreetMap Wikibase do with Data Items; but this is not the discussion today) or B) just for things that are notable enough to be there and interlink from some RDF triplestores on OpenStreetMap.

Such abstract concepts, even if they could be added as tags on things OpenStreetMap can plot on map, would take too much space. If someone has a less powerful tool (that really needs explicit tags, think like some JavaScript rendering library) then semantic reasoners can expand, missing on the fly, that implicit knowledge and tools use this version.

Something such as Overpass turbo doesn’t need to also allow SPARQL as additional flavor of query (but maybe with ONTOP, it could and with live data, but this is not the discussion here), but the advantage a more well defined ontological definition means the overpass turbo can get more smarter: an user could search for an abstract concept, that could represent a group of different tags (and this tags vary per region) and Overpass Turbo could preprocess/rewrite such advanced queries in more low level queries it know today that work today without user need to care about this.

Existing tools can understand the concept of “near me” (physical distance) but they can’t cope with something’s that are not an obvious tag. Actually, current version of Normatim seems not aware if asked by a category (let’s say, “hospital”) so it relies too much on the name of the feature, because even if is trivial to have translations of “hospital” (Q16917, full RDF link: http://www.wikidata.org/wiki/Special:EntityData/Q16917.ttl) from Wikidata, tools such as Normatim don’t know what the meaning of hospital. In this text, I’m arguing that semantic reasoning would allow the user asking from a generic category to return the abstract concept such as 911 (or whatever is the numbers for police and etc in your region) in addition to the objects in the map. OpenStreetMap Relations are the closest from this (but I think it would be better if such abstracts do not need to be on the same database; the closest to this are Data Items Qs).

And what advantage for current strategies to validate/review existing data? Well, while the idea of making Normatim aware of text by categories is very specific to a use case, the abstract concepts would allow searching things by abstract meaning and (like Overpass already allow) recursion. An unique opaque (e.g. numeric, not resembling real tags) identifier can by itself contain the meaning (like be alias for several tagging patterns, both old and new, and even varying by region of the world) so the questions become simpler.

6. On the OpenStreetMap Data Items (Wikibase extension on OpenStreetMap Wiki) and SPARQL access to data

Like I said in the start, I’m new to OpenStreetMap, and despite knowing other areas, my opinion might evolve after this text is written in face of more evidence.

6.1. I support (meaning: willing to help with tooling) the idea of have OWL-like approach to encode taxonomy and consider multilingualism important

I do like the idea of a place to centralize more semantic versions of OpenStreetMap metadata. The Data items do use Wikibase (which is used by Wikidata), so they’re one way to do it. It has fewer user gadgets than Wikidata, but the basics are there.

However, as long as it works, the way to edit the rules could be even editing files by hand. Most ontologies people do this way (sometimes with Protege). However, OpenStreetMap has a massive user base and the translations to data items already have far more than the Wiki pages for the same tags.

Even if the rules could be broken into some centralized GitHub repository (like is today with Name Suggestion Index, but there is less Pull Request, because is mostly the semantic rules) without some user interface like Wikibase allows, it would be very hard to allow collaboration that already was happening on the translations.

6.2. I don’t think criticism against customization of Wikibase Q or complain about not be able to use full text as identifiers makes sense

There’s some criticism about the Wikibase interface and those might even be trivial to deal with. But the idea of persistent identifiers being as opaque as possible, to disencourage users’ desire to change then in the future is a good practice. This actually is the only one I really disagree with.

DOIs and ARKs have a whole discussion on this. DOIs for example, despite being designed to persist like a century, the major reason people break systems was the customized prefixes. So as much as someone would like a custom prefix instead of Q124 be OSM123 this unlikely would persist more than one decade or two.

Also, the idea of allowing full customizable IDs, such as instead of Q123 use addr:street is even more prone to lead to inconsistencies either misleading users or braking systems because users didn’t like the older name. So Q123, as ugly as it may seem, is likely to only be deprecated by serious errors rather than the naming choosing by itself.

Note that I’m not arguing against the addr:street tag, this obviously is a property (and such property itself needs to be defined). *The argument is that structural codes should be as opaque as possible to only change in worst cases. If tag addr:street is (inside OpenStreetMap) notable enough, it can receive a code such as Q123. Then OWL semantics could even deal with depreciated, have two tags as aliases for each other etc, because it was designed from the ground to help with this. That’s the logic behind opaque codes.

If someone doesn’t know what Q123 means, we add contextual information about it on the interfaces.

6.3. Wiki Infoxes issues

I guess more than one tool already does data mining from OpenStreetMap Infoboxes. Whatever would be some strategy to synchronize semantic version of taxonomy, is important it be done to keep running if the users already not doing there directly. From time to time, things may break (like a bot refusing to override human edit) then relevant reports of what is failing.

I don’t have an opinion on this, just that out-of sync Information is bad.

6.4. Interest in get realist opinions from Names Suggestion Index, Taginfo, Geofabrik (e.g it’s data dictionary), and open source initiatives with heavy use on taxonomy

Despite my bias to “make things semantic” just to say here (not need to write in the comments, just to make public my view) I’m genuinely interested in knowing why the Data Items was not used to its full potential. I might not agree, but that doesn’t mean I’m not interested to hear.

Wikidata is heavily used by major companies (Google, Facebook, Apple, Microsoft,…) because it is useful, so I’m a bit surprised OpenStreetMap Data Items is less well known.

If the problem is how to export data into other formats, I could document such queries. Also, for things which are public IDs (such as Geofabrik numeric codes on http://download.geofabrik.de/osm-data-in-gis-formats-free.pdf) similar to how Wikidata allows external identities, would make sense if the Data Items have such properties. The more people are already making use of it, the more likely it is to be well cared for.

6.5 Strategies to allow run SPARQL against up to date data

While I’m mostly interested in having some place always in real time with translations and semantic relationships of taxonomic concepts, at minimum I’m personally interested in having some way to convert data dumps to RDF/OWL. But for clients that already export slices from OpenStreetMap data (such as overpass-turbo) it is feasible to export RDF triples as an additional format. Is hard to understand RDF or SPARQL, but it is far easier to export it.

However, running a full public SPARQL service with data for the entire world (while maybe not worse than what already is OpenStreetMap API and overpass-turbo) is CPU intensive. But if it becomes relevant enough (for example, for people to find potential errors with more advanced queries) then any public server ideally should have significant no lag. This is something I would personally like to help. One alternative to R2RML+ONTOP could be (after first global import) to have some strategy to convert differences from live services from the last state, then these differences instead of SQL, be UPDATE / DELETE SPARQL queries.

I’m open to opinions of how important it is to others to have some public endpoint with small lag. Might take some time to know more about OSM software stack, but scripts to synchronize from main repository data seems a win-win to create and let it public for anyone to use.

That’s it for my long post!

Discussion

Comment from Mateusz Konieczny on 12 November 2022 at 16:19

that are notable to be on OpenStreetMap

I would rather phrase it that OSM has no notability requirements whatsoever

Comment from Minh Nguyen on 13 November 2022 at 02:38

Thanks for taking the time to explain more about the purpose and reasoning behind Wikidata and Wikibase. It’s entirely possible that some of the misunderstanding and reticence that persists today can be traced to some early missed opportunities to explain these unfamiliar concepts patiently and effectively. At least that was my takeaway when compiling a bibliography of OSM discussions about Wikidata.

As many of us discovered at this weekend’s joint conference, “WikiConference North America + Mapping USA”, an increasing number of people in both the OSM and Wikimedia communities are interested in exploring what our projects can accomplish together. Personally, I think the best way to overcome the skepticism about Wikidata is to demonstrate the value of the currently limited integration in the form of creative visualizations, analyses, and tools.

Comment from Minh Nguyen on 13 November 2022 at 02:38

I don’t want to distract from the main topic, but since you seemed to be troubled by Mapbox’s flag in this sidenote:

While I do have prior advanced experience in other areas, as you can see from my account, I’m so new to the project that as a newbie user of iD left after the tutorial in India I got scared that if someone touches something, after that validators will assume that person is responsible for errors in that something. In my case it was “Mapbox: Fictional mapping” from OSMCha.

So assume that this text is written by someone who one day ignored iD warnings for something I touched, still not sure how to fix the changeset 127073124 😐

Sorry you found this intimidating. OSMCha has an API that allows individual features in a changeset to be flagged as “suspicious” for a particular reason. Not every flagged feature is rejected; OSMCha itself sometimes applies suspicious reasons like “New mapper” and “Possible import” that only serve as a heads-up to someone doing a manual review at Mapbox or elsewhere.

Mapbox’s data team flagged this way as appearing to be fictional, as it looks like someone doodling a road through a populated place without any resemblence to aerial imagery. (You can find the specific feature by clicking the ⚠️ tab.) Perhaps you had meant to draw something else but accidentally tagged it as a road? You’re welcome to use these flags to detect and fix errors too. In any event, Mapbox accepted the rest of your changeset; for example, you can already see this road in Mapbox maps. If you don’t have a Mapbox account, you can check using this example page or a map by one of Mapbox’s customers.

OSMCha doesn’t track how many flagged features you’ve accrued, so even a false positive shouldn’t be an ongoing problem for you. OSMCha does track how many changesets its users rate as good or bad. Review teams at Mapbox or elsewhere could theoretically consider this statistic when judging whether to scrutinize a changeset more closely.

Hope this addresses your concern. (For full disclosure, I work at Mapbox but not on the teams involved with this software or process.)

Comment from Mateusz Konieczny on 13 November 2022 at 11:04

I’m genuinely interested in knowing why the Data Items was not used to its full potential.

You probably have seen my comments already but for me there are following reasons:

data quality is lower than in what is specified as parameters of infoboxes (due to serious UX problems - one major part is tracked as https://phabricator.wikimedia.org/T43686 but see also for example https://wiki.openstreetmap.org/wiki/Talk:Data_items#Order )

As watchlisting interface is atrocious damage will be easily missed. I bet I could find silently redefined/broken data item descriptions that were left without reaction if I would cobble together more complex interface (unusable by others due to reliance on scripting).

data items do not contain anything so structured. In vast majority of uses you would take image and description, which may be as well as taken from infoboxes. Wikibase is complex and allowing storing structured data (sources, qualifiers and many other) which is not used at all in data items as far as I can see.
In many cases all this data is useful as giving hints and needs to be looked on manually anyway, at which point any benefits of easier automation are completely disappearing and you are left with only disadvantages

For example I use OSM Wiki heavily while developing StreetComplete. Images from OSM Wiki are sometimes useful, but often different are needed. And cropping them is always necessary.

So OSM Wiki illustration are only the first candidate at most.

I could look at data items instead but what is the benefit here? How I would benefit from automation?

The same for goes for descriptions, due to different context StreetComplete needs a bit different descriptions than OSM Wiki (or data items) are using.

I basically gain nothing by interacting with data items (or even lose due to janky interface and some items randomly displaying images at the bottom because Wikibase cannot even display properties in consistent order)

wikibase API is complex, annoying and troublesome. In comparison using https://github.com/earwig/mwparserfromhell allowed to get me parsed infobox parameters in time that I used in failed attempt to get “required tags” field out of wikibase API. Going by name mwparserfromhell is dealing with a lot of complexity underneath, but as user of this it was really pleasant.
data stored for example in https://wiki.openstreetmap.org/wiki/Property:P12 is often really dubious and poorly defined to the point that relying on it is a mistake

Comment from Mateusz Konieczny on 13 November 2022 at 11:05

Sorry for silly list formatting above. Also, I am not claiming that data items cannot be useful, just explaining why they were completely pointless for me, and this reasons may be shared by some other people.

Comment from Mateusz Konieczny on 13 November 2022 at 11:08

Some people advocated migrating preset definitions to Data items - but as someone developing one of editors I just do not see it as a viable solution.

See how some OSM Wiki articles are contradicting each other or end in confusing weird states as result of compromise. Or have temporarily absurd claims before it gets fixed.

That is survivable in documentation treated with some scepticism. That is not viable for editor presets.

Comment from fititnt on 13 November 2022 at 17:43

This compilation https://wiki.openstreetmap.org/wiki/User:Minh_Nguyen/Wikidata_discussions is fantastic! I’m reading everything (and also putting on Watching for changes). Might take some days, but I already perceived additional points from the links:

One (which I think is actually worse than the interface issues) is the following: seems that the idea of something such as OpenStreetMap Data Items was turn OpenStreetMap data into an ontology by using Wikibase to store the formal specifications of what people are tagging the data. This precedes the de facto Wikibase install on the Wiki and Sophox. So the goals already were higher.
There’s another (https://lists.openstreetmap.org/pipermail/talk/2020-May/084637.html) issue which I believe actually where meanining rdfs:seeAlso (but hoping to be owl:sameAs) on the Infoboxes for each OpenStreetMap tag to point to Wikidata, but then get frustrated. I’m likely to create another diary just for this point. Not giving spoilers, but searching by complaints around “owl:sameAs” even on formal ontologies which people don’t test integration (but rdfs:seeAlso is still useful).
Except for translations (which by the way are going great) the actual non-bot editing on OpenStreetMap Data items was not healthy considering the importance. This needs to change.
Someone (even if a small minority) complained about the fact of sending people to Wikidata for things that are for OpenStreetMap. I think that considering there already exist organized editing by humans and it works for areas of their expertise, even for specific features (roads), moving even structural Q to Wikidata (e.g. Universals very specific to OpenStreetMap) would lose their willingness to update. Not saying this must be Wikibase, but at least still under OpenStreetMap (worst case a GitHub repository)

I know we’re mostly discussing Data Items/Wikibase install issues, regardless of the outcome of this, I already think that fixing the bot for synchronizing Infoboxes would not fulfill the general idea because not just the tags are necessary. For example, the Universal for the tag building=fire_station is not the same as the Universal for a concept it represents in the map. This is just an example, but explains why do exist things that cannot be outsourced for Wikidata nor rely on the tag documentation alone, because it is relevant at least for most common tools helping with OpenStreetMap.

Comment from fititnt on 13 November 2022 at 18:10

Comment: where was written “Except for translations (which by the way are going great) the actual (number of) non-bot editing on OpenStreetMap Data items was not healthy considering the importance” was about the low number of of people involved, not the interactions.

This also applies to the number of people here also fixing (when necessary) Wikidata items that are not edited by automation.

Comment from rtnf on 15 November 2022 at 15:44

As a fellow Wikidata contributors, i have several comments.

First, the last time I contributed to Wikidata, they have huge performance issue regarding “the triplestore overload” due to massive automatic bot entries. It is even predicted that at some point Wikidata will reach the “theoretical limit of BlazeGraph triplestore”. Some Wikidata engineering higher ups have proposed several remedies to this problem, including the replacement of BlazeGraph triplestore with better technology, or completely split up the triplestore in a decentralized fashion.

My point is, as long this problem is not addressed properly, we shouldn’t invest our time to Wikidata, because performance-wise it’s quite risky. It’s safer for us to conduct an experiment in a safe sandboxed environment, separated from the main Wikidata triplestore. For example, by setting up a Wikibase instance inside OSM Wiki.

Second, regarding reusable ontologies, I think we can start to work on it in a “technology-agnostic” fashion. At some point, ontology representation format, such as RDF, cannot model everything succinctly. So, it’s better for us to start drafting the ontologies by using natural language instead. For starter, you can read OSM tag proposal archives here to understand how OSM design their own ontologies. As a fellow Wikidata contributor, the OSM’s tag proposal discussion sections is very similar to Wikidata’s property proposal discussion.

Third, regarding SPARQL access to data, we already have the Overpass API query language that also solved the same exact problem. I think we don’t really need to reinvent the wheel around here.

Finally, let’s just start with the actual problem instead of “solutions looking so hard for the problem”. As implied by this article title, it seems that your main goal is to “make tooling smarter” by applying semantic technology on OpenStreetMap tag. As a semantic web researcher myself, this is quite a good medium-to-long term project. We could start by formalizing the ontology of OSM tag into queryable SPARQL endpoint (for example, by installing Wikibase on OSM Wiki), synchronizing the ontology with global Wikidata items, then integrating WDQS with Overpass API, so we could query “entities” on OpenStreetMap data by using standardized SPARQL queries.

Comment from rtnf on 15 November 2022 at 16:05

… the next real problem is “who want to work on this”?

You said that you’re “willing to help with tooling”. We could start this movement in a “open-source” fashion. Since the whole OSM data is free to download at its entirety, we can theoretically “fork” the OSM in our local machine and start experimenting with the data. Work by yourself first, publish a prototype, then a roadmaps, and finally a contributing guideline for those who want to join.

Comment from nyuriks on 16 November 2022 at 05:12

Sophox indeed needs to be migrated to the new server. I have set up a 501c(3) via OpenCollective to get the needed $150/month. Once I have enough for the first two months, I will migrate Sophox to the new hardware, bring it up to date, etc. Also, would be awesome to get more developer help with maintaining it at GitHub. Thanks!

Comment from fititnt on 18 November 2022 at 06:52

1. About natural language use

Second, regarding reusable ontologies, I think we can start to work on it in a “technology-agnostic” fashion. At some point, ontology representation format, such as RDF, cannot model everything succinctly. So, it’s better for us to start drafting the ontologies by using natural language instead.

I think you and others might like this this heavy cited article on the idea of what is ontology (in comparison to Ontology): https://iaoa.org/isc2012/docs/Guarino2009_What_is_an_Ontology.pdf

I’m not against the use of natural language (and I mean not just as drafting). Actually, well written explanations can easily be more well understood than programming language implementation, and good practices of how to encode formal ontologies recommended that the elucidation (something such as short description) must be good enough as a minimal viable product.

1.1. Example of natural language reusable beyond the formal ontology encoding (iD editor short descriptions)

Regardless of how the ontology is encoded, at minimum it helps to implement reuse of the terminology.

It would take way too much time here try explain good practices of elucidation, however if user have a select box (let’s say, tag surface) then is presented with options that are suggested, at very minimum is expected that the options: - don’t have circular reference: - term: “Acme” ; elucidation : “data about acme” - at minimum disambiguate about other terms in the same context - This sometimes is hard, however the descriptions should alone help the user not confuse two options.

There’s other good practices, however, as soon as someone knows some formal language, it might be far easier to write fast in that formal language than to write natural descriptions. I’m not sure if iD editors actually use the Data Items, however for everyone reading: ontology engineering is more than just optimized for the software reading the formal specifications, it MUST help implementations that at minimum should be able to reuse the labels.

2. Proposals for properties vs formal specification

The conversion from the proposal to the underlying semantics shouldn’t require everyone to understand the details. This can be done by non-experts of the area of that tag (but who know the encoding the explanation into the formal ontology), then the collaborators validate the final result by comparing if the software delivers the expected results. If what’s requested is not feasible (creates conflict, or the person which implements simply can’t understand) then the situation gets stuck.

For starter, you can read OSM tag proposal archives here (https://osm-proposals.push-f.com/) to understand how OSM design their own ontologies. As a fellow Wikidata contributor, the OSM’s tag proposal discussion section is very similar to Wikidata’s property proposal discussion.

So yes, your example is a great baseline. And it is a good thing to make the more philosophical contributors feel welcomed. The discussion here might expose some low level (for those more philosophical) about encoding, however if we really go deep into the whole Ontology thing, the discussions about implicit details could be far more complex than the ones about tag proposals.

Likely one major difference between the existing proposals for OpenStreetMap Tags and the make easier to make formal ontology is the following: is necessary formalize more structural concepts (implicitly) which are discussed and if they’re not created, this inviabilize create formal encoding. Often this means that (at least for things notable enough), it is necessary to differentiate the tag from the idea the tag represents, otherwise systems break. This might be more obvious if we start having examples everyone can run, so we start seeing what “variables” already not relevant OpenStreetMap are so recorrent that they need their own code. I need to explain this point better later.

3. About use of custom formats: perfectly okay we do it!

The idea of not writing rules directly into the exchange format is not absurd. In fact, there’s a file format popular on ontologies, the OBO format, which is still used today, which the documentation is http://owlcollab.github.io/oboformat/doc/obo-syntax.html.

So, I see no problem in having one or several formats, as long as they can be converted. In the worst case scenario (e.g. not create a mapping specification from one format to another) then the script that converts something that already has a good workflow in place becomes an ad-hoc reference between automated conversion. I meant: let’s assume different projects already use relevant information (for example, JSON), as long as the information makes sense (e.g. do not cause logical errors, this is very important) then it is far easier just to create conversor than convince people to change.

However: in general, the encoding is likely to be more powerful than existing formats, so the very few people who understand either the logic or the language will have more advantage. After more and more parts of OpenStreetMap become formal ontology, these persons can compute the previous decisions without needing to be experts in that area of all tags.

Second, regarding reusable ontologies, I think we can start to work on it in a “technology-agnostic” fashion. At some point, ontology representation format, such as RDF, cannot model everything succinctly. So, it’s better for us to start drafting the ontologies by using natural language instead

There’s more than one way to create formal ontologies (e.g. that computers can understand). But I believe we should still stay in something viable with description logics (this is the one which differentiate TBox and ABox). First-Order Logic, or FOL, includes everything possible with descriptive logic, however has less tooling and is hard to write the rules.

Under description logics, the exchange format with better support is OWL. RDFS (RDF Schema) , while there exist some reasoners, is “too weak”. I’m not going into full details, but some tools even make the constructors like owl:SymmetricProperty actually hardcoded, and they recommend not import https://www.w3.org/2002/07/owl#.

So, when we define one ontology with OWL2 then any tool which reads the rules we define, actually will work as expected. OWL2 does not include everything in description logics (but does include most useful features), but if the strategy used to document an ontology can be expressed with description logics, then it can be converted to OWL, and from there is can for the rest.

The first minutes from this video https://open.hpi.de/courses/semanticweb2015/items/6WC6rQkWVYT0D9KapRNHEh comments that First-Order Logic being too expressive (it’s close to a programming language), but follows the trend to recommend Description Logic when necessary.

4.1 Why don’t I just say “let’s use OWL”?

Well, if we’re discussing willing to even create tooling, “description logics” helps with the motivation. It is sort of assembler language at a logical level, yet not full mereology/philosophy.

And the example with OBO format they even cited in the past attempts to use first-order logic to export their custom formats, but it was hard to generalize.

And this also explains that we cannot simply convert data arbitrarily to RDF triples and assume that the thing is now semantic.

So, while this is oversimplification, people trying to go down from mereology to description logics, despite several times might hear “RDF” or “Turtle” (one encoding of RDF), I recommended either look for OWL, or read basic tutoriais of interfaces using OWL, and one open source one is Protége (but try tutorials that explain how semantic reasoner’s work).

5. On the focus about Tbox

Assuming people here agree that the foundation for formal ontology uses description logics (again, don’t matter if the file format is edited by humans, but if it is possible to convert to another format), the differentiation of TBox vs ABox becomes relevant, both for data storage and optimization. Actually, some semantic reasoner (there’s more than one) will be faster in some scenarios than others.

In general, I believe a good focus is the TBox (even if we could store ABox with TBox, like Wikidata does, we optimize for having a more powerful TBox). This means not just go further on at least the most common OpenStreetMap Tags and their values (this sort of is the Data Items), but other things which are implicitly, but doesn’t have unique identifier and without this, we could have formal language, we could have people know how to make the calcs with such formal language (without depend on someone else), but if we would need permanent nomenclature for these implicit structural codes, otherwise would the rules for each new proposal either too complex or impractical.

PS: Okay, this sort of comment turned out overlong and still not wrote everything I would about the comments! I’m doing some tests with the dumps. The https://wiki.openstreetmap.org/wiki/User:EmericusPetro/sandbox/Poor_mans_OpenStreetMap_Data_Items_dumper was generating some incomplete data, so after the https://github.com/openstreetmap/operations/issues/779, now we have RDF dumps at https://wiki.openstreetmap.org/dump/wikibase-rdf.ttl.gz !

Comment from rtnf on 18 November 2022 at 14:02

In short, do you have any short-term / long-term goal to accomplish (and its related roadmap)? Introducing semantic technology into OSM is quite an interesting research topic to me (but i simply don’t know where to start).

Comment from pangoSE on 23 November 2022 at 04:55

Hi. Thanks for sharing this information and analysis. I work with Wikidata since a few years and use WikibaseIntegrator to create bots and tools and I am an advanced osm editor also.

wikibase API is complex, annoying and troublesome. In comparison using https://github.com/earwig/mwparserfromhell allowed to get me parsed infobox parameters in time that I used in failed attempt to get “required tags” field out of wikibase API. Going by name mwparserfromhell is dealing with a lot of complexity underneath, but as user of this it was really pleasant.

I agree with the statements about complexity. It is not easy to use, the documentation on errors is lacking, the errors are lacking clarity because they don’t point to what in the blob of json you tried to upload that were rejected, etc. Without a good library like WikibaseIntegrator you are probably going to quickly get a headache when interacting with the Wikibase API 🤷‍♂️

That said the system is allowing a whole new level of semantic encoding and linking that is not possible on a wikipage. The question is if the community is willing to invest in this and see the benefits clearly or not.

I have used the wikidat <> osm webui matcher quite a lot to find matches between the two projects. I have used Sophox to create interesting federated queries. None of that rely on the data items in the osm wiki because we store the necessary mappings already in Wikidata 🤷‍♂️

I recently created my own matcher for my purposes, see url in my next post

I really like the community of OSM, the on the ground rule, etc but the data mode is not really to my liking. It makes imports hard (for example of addresses in DK or nature reserve boundaries in SE (they cannot be observed on the ground btw so maybe OSM is not the right place for this kind of data because it cannot be verified by other mappers?). All in all I see a future with Wikidata containing most of the data except the geographic features themselves. I see no point in storing names of features in osm if there is a wikidata link for example, but my work on shelters also made me think that other data like the area or whether it has a floor or bot is probably better stored in Wikidata where it can be easily queried and validated using an EntitySchema.

I like that iD is disallowing edits to brand name when there is a brand qid. I see a future with more cross editing. That is we agree on what data to have only in Wikidata and iD or other editors upload to Wikidata instead of storing it in OSM.

This increases complexity btw. To edit osm you suddenly have to also grapple with a monster like Wikidata. But the advantages probably outweigh the complexity.

Comment from pangoSE on 23 November 2022 at 05:06

https://github.com/LeMyst/WikibaseIntegrator Simple bot written with WBI https://github.com/internetarchive/cgraphbot Hiking trail matcher https://www.wikidata.org/wiki/Q114238977 (also written using WBI) I started the https://www.wikidata.org/wiki/Wikidata:WikiProject_Hiking_trails and tried to prepare a few imports to wikidata so I can have an overview over what is still missing in OSM. Unfortunately as you can see there in the country subpages the data we get from authorities is quite dirty and for New York it’s a mix of trails and trail systems (which are probably mappablein osm in a relation, but I’m not sure anyone has done that and it’s probably not verifiable on the ground 🤷‍♂️)

In Sweden there are thousands of trails on the ground but very little official data about any of them as of yet. I’m poking the relevant authorities to release their data as open data.

Comment from pangoSE on 23 November 2022 at 05:34

My point is, as long this problem is not addressed properly, we shouldn’t invest our time to Wikidata, because performance-wise it’s quite risky.

I dug into this quite s lot and even helped WMF investigate and evaluate alternatives.

I wholeheartedly disagree with avoiding Wikidata based on the assumption thst the whole project is brittle. Wikidata is not brittle. If, and this if is still theoretical, WDQS cannot cope with the amount of triples tomorrow, then there are viable (AFAICS) plans to get it back up in a short time.

The lack of scalability of Blazegraph is a problem and WMF is working on it.

Note that Wikidata is a mix of mariadb and blazegraph. The former is not affected by the latter which is a rdf slap on getting updated every few minutes.

Also there is really no good reason for WMF to be the only one making Wikidata available for queries. Anyone can download the dump, run the updater and provide a good query service. See e.g. https://addshore.com/2022/10/a-wikidata-query-service-jnl-file-for-public-use/

Comment from fititnt on 24 November 2022 at 21:24

Thank you, @pangoSE, for your comments!

Also, anyone more already on OpenStreetMap using Wikidata or any other ontology or semantic integration, feel free to comment here, even months or years after.

1. Maybe eventually we some channel for people interested in discussing the subject?

If we have sufficient people, maybe we could set up some Telegram group, then eventually one subforum on the community.openstreetmap.org for not already tagging schemas, but ontology and semantic integration. If something like this goes ahead, instead of focusing on Sophox, LinkedGeoData, OSM Semantic Network, OSMonto, etc (projects of OSM) be more generic.

Beyond some chat, I think something like https://www.wikidata.org/wiki/Wikidata:WikiProject_Ontology (which @Mateusz Konieczny proposed here https://wiki.openstreetmap.org/wiki/Talk:Data_items#Broken_Wikidata_ontology) but for OpenStreetMap and already considering that more than one frontend (such as iD editor) consumes the multilingual terminologies / ontologies.

This idea is not for now. Maybe at least a few weeks.

1.1. It’s clear already do exist people in OpenStreetMap already working with semantic integration

@rtnf said: In short, do you have any short-term / long-term goal to accomplish (and its related roadmap)? Introducing semantic technology into OSM is quite an interesting research topic to me (but i simply don’t know where to start).

I do already have some opinions, and like I said I’m new to OpenStreetMap, however since few days before this post I start to notice that you here already have people working on it, but they’re on several projects (that would be edict from stricter encoding than raw tags) or are sort of lone-wolfs that neither can use Wikidata or OpenStreetMap. Actually even the people not happy with Wikibase integration on the Wiki are great candidates to be interested in this.

2. [Argument] ontology (information science) still like Ontology: abstract enough to allow reusability

Call me dogmatic, however neither Wikidata (Wikibase) nor any conversion of OpenStreetMap data into RDF is the ontology.

Something such as Wikibase is a tool. Something such as RDF flavor is a way to store data. And a RDF triplestore can often be an abstraction that actually uses just SQL databases (often command line tools use SQLite in memory). So, even the already considered reference strategies for semantic encode are flexible.

I’m not saying everyone needs to go deeper into mereology (but is good have people around able to explain in more abstract level the relations of things, these don’t need care about how is encoded) but it would be too restrictive assume ontology engineering on OpenStreetMap is any particular implementation.

So, if we start from the idea of why ontology engineering exists, we focus on data integration, data interoperability, we focus on things that people already care about.

2.1 Well encoded information into traditional formats (JSON, CSV, …) can allow semantic reasoning VERSUS dumping data into RDF because is RDF don’t make it reasonable

Simply storing data into RDF doesn’t not allow semantic reasoning. RDF accepts nearly everything very fast. And similar to SQL (think unrestricted CREATE TABLE ALTER TABLE), by default someone can not just insert data, but could change the schema with global level effect. Such small mistakes might not break SPARQL, but can cause chaos in semantic reasoning. And for performance reasons, something such as Wikidata never would work with semantic reasoning enabled.

Data Items currently have OpenStreetMap tags, however I’m not as sure if they already have abstract concepts which would be one or more tags combinations. I think the closest we could get such information already would be to start with how popular tools organize tags to show icons on maps. There May already exist common schemes, just not converted to RDF/OWL.

My argument here is that as long as any other non linked-data format, like JSON or map styles encode concepts, doesn’t have illogical information, they still feasible to be upgraded to RDF/OWL plus some post processing (like skip know inconsistencies that will fail in later stages). And nor just this, this approach might actually be safer to not break schema at world level, because the controls become scripts that upgrade the data. Actually, it is better to have a predictable non-RDF format that can be upgraded predictably than to be too focused on always using the entire RDF-like toolchain in every part.

So, this thinking could make the “bug” of data not immediately RDF/OWL become “a feature” (e.g. alternative more prone to become inconsistent)

2.2 Dump RDF triples is simple after the before vs after example is documented

For those not aware of RDF tooling, while things can get very complicated (because often tools comes what all the features) the export to RDF (which if inserted into an ontology, could give full meaning) is likely one of the easiest and very optimized for massive speed.

If we have a source file (something already updated by the community) that is possible to loop over each item in a language like python, it is somewhat trivial to make print line by line the easiest format, triples. RDF even doesn’t care about insert order, tolerate repeated insertion, etc. More complex stages like validation and/or formatting is better done by compiled code by someone else’s standard tool. I cited JSON, but whatever is used could be done.

This is one reason why I believe it makes sense we get people already using OpenStreetMap with semantic integration, so they could help decide the before/after of existing data not already in RDF/OWL. However, soon or later the discussions also become very abstract (what is a hospital?, is a pharmacy a type of hospital, or both are part of something else we explicitly don’t use on map styles?). Another complicated, very human task is what to do with two or more sources of encoding that conflict with each other (let’s say, there’s a generic map style for OpenStreetMap, however something like Hiking is more specific, who should have preference for Hiking related concepts?) because whatever end result break reasoning, they cannot co-exist on sort of reference ontology.

Comment from rtnf on 24 November 2022 at 23:15

I have several ideas that come to mind :

Construct a grand, unified Ontology for everything related to OpenStreetMap : Convert all information related to OpenStreetMap (taggings, OSM concepts, entities) into Wikibase knowledge graph. While we can do it on Wikidata, proposing a newly required properties to model this knowledge would be kinda hard (since there are high-level politics involved in properties proposal, and voting process could take considerable time). Setting up our own Wikibase instance would be easier. Or, we can use the already existing Wikibase instance on OSM Wiki (even though, there is a proposal to delete it completely).
Adding metadata about OSM objects by using Wikibase knowledge graph that constructed by step 1, to improve search quality. : In short, we could embed a built-in “infobox” right into OSM location search feature. So, whenever user search a location, we could add some interesting facts regarding that location.

Quick roadmaps for these two ideas :

Set up our own wikibase instance. For greater freedom and independence, we separate this project from both Wikidata and OSM Wiki wikibase instance. Meanwhile, we can always import triplets from both of them, as long as we need.
Build the knowledge graph. Discuss about the ontology engineering process together.
Build a knowledge graph editing tool that could facilitate OSM contributors to join this project easily. In short, create a better Wikibase editing app (both web/mobile).
Once the knowledge graph is populated enough, we could think to build new innovative services later.

Regarding the communication channel :

We could setup a project page on OSM Wiki, similar to those Wikiprojects. Maybe someone could draft the page right now.
We could request a community.openstreetmap.org subcategory. Check out the proposal template here
Yeah, Telegram group would be nice.

Comment from fititnt on 6 December 2022 at 12:23

Quick update here: I’m drafting https://eticaai.github.io/openstreetmap-semantic-conventions-2023/ (repository at https://github.com/EticaAI/openstreetmap-semantic-conventions-2023) as an first attempt to make an a quick example-based reference to convert between some file formats and their encodings to RDF.

While some initial examples are not more than Sophox used in the Wiki (I think mostly based on work from @nyuriks ? Thanks!) with different formatting (trivia: its based on rdflib longturtle format which is decent for git diffs using only text) there’s far more things missing how to encode.

For now I’m just leaving it on @EticaAI and leaving the thing public domain, but no problem to me to move to something else. Also, the folks already working on the Data Items in last years and already active on Wiki things (aka for example Minh Nguyen or Mateusz Konieczny) could be editors later. The respec from W3C is quite fantastic to create this type of thing, including cross-reference, author/editors retired date, etc.