Dismistifying Wikidata and standards compliant semantic approach on tags on OpenStreetMap to make tooling smarter on medium to long termPosted by fititnt on 12 November 2022 in English (English).
This is my first attempt on the subject of the title divided in 6 topics. Sorry for the long text (but could be far longer).
Disclaimer: low experience as OSM mapper!
While I do have prior advanced experience in other areas, as you can see from my account, I’m so new to the project that as a newbie user of iD left after the tutorial in India I got scared that if someone touches something, after that validators will assume that person is responsible for errors in that something. In my case it was “Mapbox: Fictional mapping” from OSMCha.
So assume that this text is written by someone who one day ignored iD warnings for something I touched, still not sure how to fix the changeset 127073124 😐
Some parts of this post, such as reference to notability (from this discussion here https://wiki.openstreetmap.org/wiki/Talk:Wiki#Use_Wikibase_to_document_OSM_software) and gives some hints of unexplored potential which not even current OpenStreetMap Data items are doing (from this discussion here Remove Wikibase extension from all OSM wikis #764) are the reason for the dismistifing part of the title.
1. Differences in notability of Wikidata, Wikipedia, and Commons make what is acceptable different in each project
I tried to find how OpenStreetMap defines notability, but the closest I found was this:
For sake of this post:
- Commons Notability: https://commons.wikimedia.org/wiki/Commons:Notability
- Wikipedia (EN) Notability: https://en.wikipedia.org/wiki/Wikipedia:Notability
- Wikidata Notability: https://www.wikidata.org/wiki/Wikidata:Notability
What I discovered is that Commons already is used as a suggested place to host for example images, in particular what would go on the OpenStreetMap Wiki.
Wikipedia is likely to be far more well known than Wikidata and (I suppose) people know that Wikipedias tend to be quite strict on what goes there.
And Wikidata? Well, without explaining too much, it is more flexible than Wikipedia’s notability, however (and this is important) is not as flexible as the Notability Rule on OpenStreetMap if we assume that there’s not explicitly one.
In other words: as flexible as Wikidata is, there’s things that do exist in the real world (let’s say, an individual tree in someone’s backyard) that are notable to be on OpenStreetMap, but are not to be on Wikidata.. And, unless there is some attachment (something worth to put on Commons, like 3D file) I would assume uploading low level data of micromapping of some building (creating huge amounts of unique Wikidata Qs) might be considered vandalism there.
1.1 When to use Wikidata?
I think I will agree with what others said sometimes about preferring to keep concepts that are worth being on Wikidata, on Wikidata.
But with this in mind, it is still relevant to have Listeria (which is a bot, not a installable extension) on OpenStreetMap Wiki. Might not be a short time priority, but Wikidata already have relevant information related to OpenStretMap.
2. Differences in how data is structured makes hard for RDF triplestores (like Wikidata) to store less structured content
In an ideal world, I would summarize how the RDF data store works. RDF is quite simple after someone understands the basics like sum
+ and subtraction
- operations in RDF, the problem is often users will jump not only to multiplication, but differential equations. SPARQL is more powerful than SQL, and the principles of Wikidata have existed for over 2 decades. However most people will use someone else’s example ready to run.
Without getting into low level details of data storage, it might be better to just cite as an example that Wikidata recommends storing administrative boundaries as files on Commons. For example this is the one for the country of Brazil (Q155) links to https://commons.wikimedia.org/wiki/Data:Brazil.map. OpenStreetMap doesn’t require Commons for this (because store all information and can still very efficient), however RDF even with extensions such as geoSPARQL, does not provide low level access for things such as what would be a node in OpenStreetMap (at least the nodes without any extra metadata, which only exist because are part of something else).
Question against RDF: if the RDF triple store is so flexible and powerful, why not make it able to store EVERY detail, so it becomes a 1 for 1 to OpenStreetMap? Well, it is possible, however storing such data info RDF triplestore would take more disk space. Sophox already avoid some types of content
One way able to use SPARQL would, in fact, be an abstraction to another storage with R2ML and an implementation such as ONTOP VKG to rewrite SPARQL queries to SQL queries, so in worst case scenario at least it could always be using up to date data. But this is the focus of this post.
In other words: is overkill to store low level details on RDF triplestores even if we could do it if could afford the hardware. They’re not a replacement for OpenStreetMap.
3. Advantage of RDF triplestores (Wikidata, Wikibase,…): be welcoming concepts without geographic reference
Something OpenStreetMap cannot compete with Wikidata: relationship between things, and storage of things without geographic reference. Actually, most, if not all, tools that deal with OpenStreetMap data don’t know how to deal with an abstraction concept which cannot be plotted in the map. This is not an exclusive issue, because it happens with most GIS tools. They will break.
In my journey to understand OpenStreetMap with an Wikidata school of thought, after some questions in my local Telegram group about how to map back OpenStreetMap to Wikidata, I received this link:
Truth to be told, I loved this explanation! But without making this post overly long to make analogy with both Wikidata vs OpenStreetMap:
- OpenStreetMap can store reference to something such as individual buildings for firefighter’s stations of a province Province AA in a country CountryA
- Wikidata can store the abstract concept that represents the organization that coordinates all firefighting stations in ProvinceAA, and also that this is part of the Civil Defense organization in the CountryA. Both concepts might be even notable enough to have dedicated pages on Wikipedia, and photos on Commons.
This is where, without off-the-wire agreements or custom protocols, the tools which handle OpenStreetMap data are not designed to handle concepts that explains things which OpenStreetMap happily will store from its users. Someone can plot a building for an organization, but not the structural need of what that organization is that such building is part of.
Truth to be told, such uses of Wikidata concepts are already being used in the wild. However, it seems this is very rudimentar, mostly to allow translations and images such as for brandings used by Name Suggestion Index in tools such as the iD editor, not what these brands represent. But everything already tagged with Wikidata Qs or Ps already is viable to download this extra meaning.
3.1 This is NOT related at all with changes on underlining API
The discussions about API changes (such as https://wiki.openstreetmap.org/wiki/API_v1.0) are sort of more low level. What today is on the database schema https://wiki.openstreetmap.org/wiki/Rails_port/Database_schema doesn’t need to change (it’s quite efficient already, and previous point admitted the limitations of RDF Triplestores for low level of details).
In the best case scenario, this might help understand existing data, and make stronger validations because could make easier to find patterns, but does not require change underlining database, but the validation rules become sort of cross platform. For things simpler (like know if something is acceptable or not) no semantic reasoning is need, could be done automated rule generation in SHACL (https://en.wikipedia.org/wiki/SHACL), so if today someone is doing import of several items, but some of then classes with existing ones, could be simple to the person just click “ignore the errors for me” and SHACL could only allow the things that will validate.
But this SHACL could take years. I mean, if some countries would want to make very strict rules, could be possible that in that region, these things become enforced.
4. RDF/OWL allow state of the art semantic reasoning (and shared public identifiers from Wikidata are a good thing)
In an ideal world and with enough time, behind the idea of ontology engineering, I would introduce mereology, the idea of Universals vs Particulars, and that when designing reusable ontologies, the best practices are not mere translation of words people use, but underlying concepts that may not even have a formal name, so giving numbers make things simpler.
The foundations for mimicking human thinking from rules is far older than RDF.
RDF provides sums and subtractions, it’s very simple, but an early attempt RDFS (RDF Schema), was insufficient for developers to implement semantic reasoning. The OWL1, sort of a inspired in one DARPA project (DAML, later DAML+OL), aimed to allow such semantic reasoning, however computability was accidentally not on scope. This means that, by design, a computation could run forever without being feasible now upfront, so it failed. Then after all this saga, OWL2 was designed from the ground to avoid mistakes from OWL1 to allow it to stay in the realm of computability (not just be a project to call attention from others, but actually be possible to implement by tools). So today, a user, without resort to command line, can use Protégé and know upfront if the triplestore doesn’t have logical errors. However, since semantic reasoning can be computationally expensive, often is not enabled by default in public endpoints (think: Wikidata and Sophox), but anyone could download all required data (e.g instead of .osm file, some flavor of .rdf file, or convert .osm to RDF after download it) and turn the thing on.
Example of inference
For example, when 2 rules are created,
<CityAAA "located_in" ProviceAA>,
<ProvinceAA "located_in" CountryA>, the way “located_in” is encoded could say that the inverse is “location_of” so the reasoner could infer that
<CountryA "location_of" CityAAA>is true. At minimum, even without semantic reasoner turned on (it is not on Wikidata; this is why the interface warns user to be more explicit), is possible validate errors, with very primitive rules, but it also means that dumps of OSM data for regions (or worldwide, but subset of features) if converted to RDF and loaded in memory with reasoning turned on, allow deduce things very fast.
This example of “located_in” / “location_of” is simplistic, however with or without a reasoner turned on, RDF makes data interoperable in other domains even if individual rules are simple. Also, rules can depend on other rules, so there is a viable chain effect. It is possible to teach machines not mere “part_of” or “subclass_of” most people learn in diagrams used only for business, but cause and effect. And the language used to encode these meanings already is an standard.
One major reason to consider using Wikidata is to have well defined, uniquely identified, abstract concepts notable enough to be there. At minimum (like is used today) it helps with having labels in up to 200 languages, however the tendency would be that both Wikidata contributors and OpenStreetMap contributors on taxonomy help each other.
Trivia: tools such as Apache Jena even allow running via command lines (such as SPARQL queries you would ask for Sophos) from an static dump file locally or in a pre-processed file remote server.
5. Relevance to Overpass Turbo, Normatim, and creators of data validators
As explained before, the OpenStreetMap data model doesn’t handle structural concepts that couldn’t be plotted in a map. The way the so called semantic web works, could be possible to either A) rely full on Wikidata (even for internal properties; this is what OpenStreetMap Wikibase do with Data Items; but this is not the discussion today) or B) just for things that are notable enough to be there and interlink from some RDF triplestores on OpenStreetMap.
Something such as Overpass turbo doesn’t need to also allow SPARQL as additional flavor of query (but maybe with ONTOP, it could and with live data, but this is not the discussion here), but the advantage a more well defined ontological definition means the overpass turbo can get more smarter: an user could search for an abstract concept, that could represent a group of different tags (and this tags vary per region) and Overpass Turbo could preprocess/rewrite such advanced queries in more low level queries it know today that work today without user need to care about this.
Existing tools can understand the concept of “near me” (physical distance) but they can’t cope with something’s that are not an obvious tag. Actually, current version of Normatim seems not aware if asked by a category (let’s say, “hospital”) so it relies too much on the name of the feature, because even if is trivial to have translations of “hospital” (Q16917, full RDF link: http://www.wikidata.org/wiki/Special:EntityData/Q16917.ttl) from Wikidata, tools such as Normatim don’t know what the meaning of hospital. In this text, I’m arguing that semantic reasoning would allow the user asking from a generic category to return the abstract concept such as 911 (or whatever is the numbers for police and etc in your region) in addition to the objects in the map. OpenStreetMap Relations are the closest from this (but I think it would be better if such abstracts do not need to be on the same database; the closest to this are Data Items Qs).
And what advantage for current strategies to validate/review existing data? Well, while the idea of making Normatim aware of text by categories is very specific to a use case, the abstract concepts would allow searching things by abstract meaning and (like Overpass already allow) recursion. An unique opaque (e.g. numeric, not resembling real tags) identifier can by itself contain the meaning (like be alias for several tagging patterns, both old and new, and even varying by region of the world) so the questions become simpler.
6. On the OpenStreetMap Data Items (Wikibase extension on OpenStreetMap Wiki) and SPARQL access to data
Like I said in the start, I’m new to OpenStreetMap, and despite knowing other areas, my opinion might evolve after this text is written in face of more evidence.
6.1. I support (meaning: willing to help with tooling) the idea of have OWL-like approach to encode taxonomy and consider multilingualism important
I do like the idea of a place to centralize more semantic versions of OpenStreetMap metadata. The Data items do use Wikibase (which is used by Wikidata), so they’re one way to do it. It has fewer user gadgets than Wikidata, but the basics are there.
However, as long as it works, the way to edit the rules could be even editing files by hand. Most ontologies people do this way (sometimes with Protege). However, OpenStreetMap has a massive user base and the translations to data items already have far more than the Wiki pages for the same tags.
Even if the rules could be broken into some centralized GitHub repository (like is today with Name Suggestion Index, but there is less Pull Request, because is mostly the semantic rules) without some user interface like Wikibase allows, it would be very hard to allow collaboration that already was happening on the translations.
6.2. I don’t think criticism against customization of Wikibase Q or complain about not be able to use full text as identifiers makes sense
There’s some criticism about the Wikibase interface and those might even be trivial to deal with. But the idea of persistent identifiers being as opaque as possible, to disencourage users’ desire to change then in the future is a good practice. This actually is the only one I really disagree with.
DOIs and ARKs have a whole discussion on this. DOIs for example, despite being designed to persist like a century, the major reason people break systems was the customized prefixes. So as much as someone would like a custom prefix instead of
OSM123 this unlikely would persist more than one decade or two.
Also, the idea of allowing full customizable IDs, such as instead of
addr:street is even more prone to lead to inconsistencies either misleading users or braking systems because users didn’t like the older name. So
Q123, as ugly as it may seem, is likely to only be deprecated by serious errors rather than the naming choosing by itself.
Note that I’m not arguing against the
addr:street tag, this obviously is a property (and such property itself needs to be defined). *The argument is that structural codes should be as opaque as possible to only change in worst cases. If tag
addr:street is (inside OpenStreetMap) notable enough, it can receive a code such as
Q123. Then OWL semantics could even deal with depreciated, have two tags as aliases for each other etc, because it was designed from the ground to help with this. That’s the logic behind opaque codes.
If someone doesn’t know what
Q123 means, we add contextual information about it on the interfaces.
6.3. Wiki Infoxes issues
I guess more than one tool already does data mining from OpenStreetMap Infoboxes. Whatever would be some strategy to synchronize semantic version of taxonomy, is important it be done to keep running if the users already not doing there directly. From time to time, things may break (like a bot refusing to override human edit) then relevant reports of what is failing.
I don’t have an opinion on this, just that out-of sync Information is bad.
6.4. Interest in get realist opinions from Names Suggestion Index, Taginfo, Geofabrik (e.g it’s data dictionary), and open source initiatives with heavy use on taxonomy
Despite my bias to “make things semantic” just to say here (not need to write in the comments, just to make public my view) I’m genuinely interested in knowing why the Data Items was not used to its full potential. I might not agree, but that doesn’t mean I’m not interested to hear.
Wikidata is heavily used by major companies (Google, Facebook, Apple, Microsoft,…) because it is useful, so I’m a bit surprised OpenStreetMap Data Items is less well known.
If the problem is how to export data into other formats, I could document such queries. Also, for things which are public IDs (such as Geofabrik numeric codes on http://download.geofabrik.de/osm-data-in-gis-formats-free.pdf) similar to how Wikidata allows external identities, would make sense if the Data Items have such properties. The more people are already making use of it, the more likely it is to be well cared for.
6.5 Strategies to allow run SPARQL against up to date data
While I’m mostly interested in having some place always in real time with translations and semantic relationships of taxonomic concepts, at minimum I’m personally interested in having some way to convert data dumps to RDF/OWL. But for clients that already export slices from OpenStreetMap data (such as overpass-turbo) it is feasible to export RDF triples as an additional format. Is hard to understand RDF or SPARQL, but it is far easier to export it.
However, running a full public SPARQL service with data for the entire world (while maybe not worse than what already is OpenStreetMap API and overpass-turbo) is CPU intensive. But if it becomes relevant enough (for example, for people to find potential errors with more advanced queries) then any public server ideally should have significant no lag. This is something I would personally like to help. One alternative to R2RML+ONTOP could be (after first global import) to have some strategy to convert differences from live services from the last state, then these differences instead of SQL, be UPDATE / DELETE SPARQL queries.
I’m open to opinions of how important it is to others to have some public endpoint with small lag. Might take some time to know more about OSM software stack, but scripts to synchronize from main repository data seems a win-win to create and let it public for anyone to use.
That’s it for my long post!