fititnt's Diary

Generalization of extraction of example codes, tabular data and Infoboxes from MediaWikis such as OSM.wiki

Posted by fititnt on 24 July 2023 in English.

This text is a continuation of my previous diary and do what the title says. The draft already existed 6 months ago, but just today I’m publishing this diary. Anyway, there’s comment I head about this on the @Wikimaps telegram

“You seem to be doing what we were doing 10+ years ago before Wikidata existed” – Maarten Dammers opinion on what this approach doing

Well, he’s right… but there’s a reason for that. This diary have 4 parts, the examples are on 3.

1. Preface

This extension could be perceived as one approach to make general proposed data extraction from OpenStreetMap Wiki, which , in a ABox vs TBox dicotomy, is the closest of a TBox for OpenStreetMap (*).

*: if we ignore id-tagging-schema and, obviously, other custom strategies to explain the meaning of OpenStreetMap data, which could include cartocss used to explain how to render as image. I do have a rudimentary draft of try to make sense of all of her encodings here, but not ready for today.

1.1 Wikibase is not consensus even between what would be ontologists on OpenStreetMap

Tip: for those wanting to view/review some past discussions, check https://wiki.openstreetmap.org/wiki/User:Minh_Nguyen/Wikidata_discussions#Wikidata_link_in_wiki_infoboxes. The same page from Minh Nguyen has other links, such as discussions to remove the entire Wikibase extension from OSM.wiki at https://github.com/openstreetmap/operations/issues/764.

On the surface, it may appear that the partial opposition for Wikibase was because of some minor user interfaces issues. But reading old discussions (for those not alware, I’m new to OpenStreetmap, joined October 2022) this would be insufficient to understand why a stricter, overcentralized approach is rejected. I suspect that part of the complains (which are reflected on very early criticisms, including from the Taginfo developer on the mail list but even himself years before was criticizes when attempted to improve standardization (at least on parsing the wiki, which is very relevant here); I’m seeing a trend here: innovators today, conservatives tomorrow) seems to be that attempting to encode in a single storage a strictly logically consistent T-Box would not be feasible: some definitions might contradict each other.

One fact is that OpenStreetMap dada is used successfully in production, and there’s significant number of tools focused on its data. Wikidata may be more known as a community contributed linked data repository, however OpenStreetMap, while RDF format is less standardized today, is known to be used in production with few to none transformation than data repacking. In other words, mass rewrites on OSM data can easily break lot of applications. Note my focus here: “production use” means the developers which also consume the data are focused on making it usable, not wanting to break things unless there’s a valid reason for it. One impact is proposals wanting to refactor already used tagging on data likely will be refused.

However, similar to how Wikidata has proposals for properties, OpenStreetMap does have a formal process for tagging (in addition. To simply be “de facto” or “in use”). This alone is proof that, while some might not call themselves ontologists, and defend the idea of Any tags you like, they actually have a role of ontologists. The mere fact they don’t call themselves this, or not use popular strategy to encode ontologies, e.g. RDF, don’t make their criticism invalid, because they can be simply not complaining about the standards (or even the Wikibase itself) but the idea on how these are used to solve problems on OpenStreetMap.

I’m trying to keep this topic short, but my current hypothesis is the reason TBox on OpenStreetMap cannot be fully centralized is because while developers might have several points in common and would be wishing to integrate in their software (both id-tagging-schema and editor-layer-index are examples of it), they have technical reasons to not agree on 100%, so strategies to make easier makes sense. For example, either some tags can contradict each other (which even on semantic reasoning is a blocker; because the tag cannot “be fixed” if it is realistic with implementation) or their definition might be too complex to production implementation.

On this aspect, the current deliverable of this diary might seem a step backwards to how Wikibase works, but in addition to trying to further formalize and help to do data mining on OSM.wiki infoboxes, it starts the idea of getting even more data from wiki pages. And yes, in future it could be used by other tools to help synchronize OSM infoboxes with a Wikibase instance such as Data Items again, even if it means detect differences so humans could act. Even knowing to be impossible to reach 100%, we could try work on baseline which could help others consume not just OpenStreetMap data, ABox, but also part of it’s tagging, which is part (but not full), TBox, but in this journey, far before might be necessary help understand inconsistencies.

1.2 Wikibase is not even the only approach using MediaWiki for structured content

Wikibase, while it powers Wikidata, is not the only extension which can be used with MediaWiki. Likely a good link for a general overview is this one https://www.mediawiki.org/wiki/Manual:Managing_data_in_MediaWiki. These MediaWiki extensions focused on structured data are server side, a centralized approach (which assumes others agree with how to implement it from start). Since a text field with wikitext of all pages on MediaWiki database wouldn’t be queryable, these extensions actually use MediaWiki as permanent, versioned storage, but they take the responsibility of synchronize such data with some more specialized database engine (or at least use same database, but with additional tables). Even Wikibase still relies on external RDF triplestore to allow running SPARQL; it’s user interface (the one humans edit on sites like Wikidata) are an abstraction to store the data like a page in the MediaWiki (Wikibase extension actually uses undocumented JSON, not Wikitext).

One (in the author’s knowledge) unique feature of the implementation this diary presents to you is the following: it doesn’t require installation on the MediaWiki server. One side effect is it can also, out of the box, parse data from multiple MediaWiki wikis, and I’m not only talking about mix OSM.wiki and the OpenStreetMap Foundation wiki, but could extract data from wikipedias. You are free to decide which pages in the selected wiki should contain the data, without any specific URL pattern (like prefixes with Qs or Ps), and this aspect is more similar to other MediaWiki alternatives to Wikibase.

1.3 Then, what means decentralized, without particular database, approach

I’m very sure, especially for ontologists (the ones less aware of diverse ecosystem of data costumers on OpenStreetMap), that the very idea of not optimizing for a centralized storage would be perceived as anti-pattern. However, while requiring more work, those interested could still ingest the data on a single database. The command line implementation does not dictate how the data should be consumed, because it has other priorities.

“Make each program do one thing well.” – (part of) The UNIX Philosophy

Quote Make each program do one thing well.

All MediaWiki extensions have in common parse Wikitext (Wikibase is JSON), and this one does this specific part.For sake of make easier for the user (and make wiki admins less likely to have incentive to block data mining with this extension) it actually caches the data locally in a SQLite database, so if somewhat make friendly for repeated use (maybe even offline/backup if you setup higher expiration date). But unless you work directly with its default outputs (explained in next section) if you want a full solution, you will still need to customize which storage to save the data optimized for your use cases. So, this implementation could help synchronize OSM Infoboxes with the OSM Data Items, but its use actually is an abstraction for generic use cases. In the OpenStreetMap world, there’s TagInfo is know to parse the data, and also the Nominarim uses the Wiki to extract some information.

2. The “data model” (…of exported data handcrafted in the Wiki)

With all the explanation on the preface, the implementation of the result of data mining optimizes for a dump-like interoperable file format.

I do have some experience generating and to document group of files in the Humanitarian eXchange Language Standard, so at least for extracted for tabular data, if there's sufficient, instead of custom JSON-LD, could be some of these packaging standards highly optimized for traditional SQL databases instead of what could be archived if data inside this top level JSON-LD could be directly usable as RDF. But let's focus for now.

Some technical decisions at the moment of the generic approach:

The exported data is JSON where individual parts of the page are inside an list/array of the top level “data” key. This is a popular convention on REST APIs, another would be to use the top level “error” key.
The alternative is JSON-seq (RFC 7464), which would make it friendly to work with continuously streaming or merge different datasets by… Just concatenating the file. This approach also could in future be highly optimized for massive datasets with low memory use.
The fields are documented in JSON-LD and JSON Schema and everything else (from standards to tooling) able to work with this. The working draft is available at https://wtxt.etica.ai/
As one alternative, the implementation also allows to materialize individual items extracted from the pages as files both with global (unique if merging different wikis) and optional customized file names. Output is a Zip file with a previsible default directory structure.

One known limitation of the overgeneralization is that only the top level of JSON-LD and @types are by default strictly documented. Sorry .

Would it be possible to allow customization of the internal parts in future? “Yes”. However, (and this is from someone who already made CLI tools with a massive number of options) doesn’t seem to be a good usability idea to make way too many command line configurations instead of simulating them as some kind of file (which could potentially be extracted itself from wikis). But for those thinking on it, let me say upfront that to fill this gap, MediaWiki templates (aka the Infoboxes) and the tabular data could have at least per Wiki profiles for what become consensuses. And tables and subset of relevant for reuse syntaxhighlight codes (or some kinds of Templates with SPARQL / OverpassQL which are also example codes), could have additional hidden comments to give hints now they’re exported, at minimum their suggested file names. To maximize such approach, every MediaWiki would require a some sort of global dictionary (for things which already are global meaning, not varying by context), to give hints of how to convert, as example , {{yes}} to something machine readable like true. Another missing point would be have conversion tables which might depend on context (such as “inuse” -> “in use” on OSM Infoboxes), so as much as possible, the generated templates avoid humans to rewrite of 100s pages with misspellings or synonyms as long as some page on wiki can centralized these profiles.

3. Pratical part

With all this said, let’s cite the examples not already on the README.md and --help option of the tool.

3.0. Requirements for installation of `wiki_as_base` cli tool

pip install wiki_as_base==0.5.10
# latest, use
# pip install wiki_as_base --upgrade

3.1. JSON-LD version of a single page on OSM.wiki

By default the tool assumes you want parse OpenStreetMap Wiki and would be ok with a cache of 23 hours (which would be similar if already would be parsing the wiki dump)

The following example will download OSM.wiki Tag:highway=residential

wiki_as_base --titles 'Tag:highway=residential'

3.2. “Just give me the code example files” of a single page on OSM.wiki

The parser tries its best to detect what’s on the Wikitext without any customization. For example, if the wikitext is using the right syntaxhighlight codes, it tries to use that as a suggestion for which file extension that code would have.

Let’s use this example page, the User:EmericusPetro/sandbox/Wiki-as-base. A different parameter will export a zip file instead of JSON-LD

# JSON-LD output
wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base'

# Files (inside a zip)
wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' --output-zip-file sandbox-Wiki-as-base.zip

Wikitext parsing (the one done by this implementation) can benefit from receiving more explicit suggestions of preferred exported filename. So for pages focused as technical guide, a proxy of this could allow a download link with a tutorial with predictable filenames, while others wiki contributors could still improve it over time.

3.3. Download all parseable information of pages by an small category on OSM.wiki

Let’s say you want to fetch OSM.wiki Category:OSM_best_practice, not merely an article of it, like the Relations_are_not_categories, but all pages with the respective category.

# JSON-LD output
wiki_as_base --input-autodetect 'Category:OSM_best_practice'

# Files (inside a zip)
wiki_as_base --input-autodetect 'Category:OSM_best_practice' --output-zip-file Category:OSM_best_practice.zip

Trivia: this request is done with only 2 background fetches: one to know pages for the category and one for all pages.

3.4. Download all parseable information of pages by an well used category on OSM.wiki

Let’s say you want a Category:References. Now, the cli tool will behave differently as it assumes it can take at maximum 50 pages in one step (the default most MediaWikis for non-admin/bots can ask). This means it will paginate and save on local cache, and ultimately just output the final result.

# The --verbose argument will output more information,
# in this case hints about looping, if have cache, etc.
# It will take 50 seconds plus server delay plus internal time to compute the result
wiki_as_base --input-autodetect 'Category:References' --verbose --output-zip-file Category:References.zip
# (print to stderr)
#    loop... Cached: [False] Expired: [False] delay if not cached [10]
#    loop... Cached: [False] Expired: [False] delay if not cached [10]
#    loop... Cached: [False] Expired: [False] delay if not cached [10]
#    loop... Cached: [False] Expired: [False] delay if not cached [10]
#    loop... Cached: [False] Expired: [False] delay if not cached [10]

# Now, lets run it again. However, since the raw requests are cached by 23 hours
# it will reuse the cache.
wiki_as_base --input-autodetect 'Category:References' --verbose > Category:References.jsonld
# (print to stderr)
#    loop... Cached: [True] Expired: [False] delay if not cached [10]
#    loop... Cached: [True] Expired: [False] delay if not cached [10]
#    loop... Cached: [True] Expired: [False] delay if not cached [10]
#    loop... Cached: [True] Expired: [False] delay if not cached [10]
#    loop... Cached: [True] Expired: [False] delay if not cached [10]


# Current directory
ls -lh | awk '{print $5, $9}'
# 
#    668K Category:References.jsonld
#    315K Category:References.zip
#    540K wikiasbase.sqlite

3.4.1 Controling the delay for pagination requests

By default, not only does the tool do caching, but the cli will intentionally add a delay 10 times slower if you don’t customize the user agent hint and it detects must paginate more background requests. Currently, 10 times means 10 x 1 second (only sequential, not parallel requests), but if this get heavier usage it could be increased.

The logic for the cli tool to delay more non customized user agents is to have less users not changing contact information. Here the behavior if you detect you customized its contact information then point to the developer of the tool.

## change the contact information on the next line
# export WIKI_AS_BASE_BOT_CONTACT='https://github.com/fititnt/wiki_as_base-py; generic@example.org'
export WIKI_AS_BASE_BOT_CONTACT='https://wiki.openstreetmap.org/wiki/User:MyUsername; mycontact@gmail.com'

# time will output real time to finish the command. On this case, 5 x 1 are artificial delay,
# but 10s both download time and (which is not instantaneous), internal computation
time wiki_as_base --input-autodetect 'Category:References' --verbose > Category:References.jsonld
#    loop... Cached: [False] Expired: [False] delay if not cached [1]
#    loop... Cached: [False] Expired: [False] delay if not cached [1]
#    loop... Cached: [False] Expired: [False] delay if not cached [1]
#    loop... Cached: [False] Expired: [False] delay if not cached [1]
#    loop... Cached: [False] Expired: [False] delay if not cached [1]
#
#    real	0m15,170s
#    user	0m1,518s
#    sys	0m0,041s

However, if you do want to identify yourself, but believe 1 second additional delay in sequential request is too low, (which might be a case for a bot without human supervision), the next example will use 30 seconds.


export WIKI_AS_BASE_BOT_CONTACT='https://wiki.openstreetmap.org/wiki/User:MyUsername; mycontact@gmail.com'
export WIKI_AS_BASE_BOT_CUSTOM_DELAY='30'
time wiki_as_base --input-autodetect 'Category:References' --verbose > Category:References.jsonld
#    loop... Cached: [False] Expired: [False] delay if not cached [30]
#    loop... Cached: [False] Expired: [False] delay if not cached [30]
#    loop... Cached: [False] Expired: [False] delay if not cached [30]
#    loop... Cached: [False] Expired: [False] delay if not cached [30]
#    loop... Cached: [False] Expired: [False] delay if not cached [30]
#
#    real	2m40,390s
#    user	0m1,565s
#    sys	0m0,036s

3.5. Download all parseable information of know exact list of Wiki Pages on OSM.wiki

The initial command used to fetch a single page actually accept multiples ones: just divide them with |.

In this example we’re already using another parameter, --pageids, not by name.

## Uses curl, tr and jq <https://jqlang.github.io> as one example on how to get some examples of pages.
# curl --silent 'https://wiki.openstreetmap.org/w/api.php?action=query&cmtitle=Category:Overpass_API&list=categorymembers&cmlimit=500&format=json' | jq '.query.categorymembers | .[] | .pageid' |  tr -s "\n" "|"


# Manually setup the pageids, without use of categories
wiki_as_base --pageids '35322|253043|104140|100013|156642|96046|141055|101307|72215|98438|89410|250961|133391|242270|85360|97208|181541|90307|150883|98210|254719|137435|99030|163708|241349|305815|74105|104139|162633|170198|160054|150897|106651|180544|92605|78244|187965|187964|105268' --verbose > My-custom-list.jsonld

If the number of explicitly individual pages is greater than the pagination (which is 50) then the cli, similar to how it deals with wiki pages from large categories, will paginate.

3.6. “Just give me the example files” of a single page on a different wiki than OSM.wiki

Note: for wikimedia-related websites, the prefix used uses the logic from the database naming dumps on https://dumps.wikimedia.org/backup-index.html. e.g. wikidata.org = wikidatawiki.

This is the same as 2, however the same content is both as https://wiki.openstreetmap.org/wiki/User:EmericusPetro/sandbox/Wiki-as-base and https://www.wikidata.org/wiki/User:EmericusPetro/sandbox/Wiki-as-base.

The idea here is to explain how to customize a different Wiki. This is done with the two environment variables.

wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' > Wiki-as-base_from-osm.jsonld


# If you just want change environment variables for single command without affecting next commands, then prepared the option on that single line
WIKI_NS='osmwiki' WIKI_API='https://wiki.openstreetmap.org/w/api.php' wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' > Wiki-as-base_from-osmwiki.jsonld
WIKI_NS='wikidatawiki' WIKI_API='https://www.wikidata.org/w/api.php' wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' > Wiki-as-base_from-wikidatawiki.jsonld



# If your focus is a single wiki, but the default being OpenStreetMap Wiki make longer commands, then define as environment variable
export WIKI_NS='wikidatawiki'
export WWIKI_API='https://www.wikidata.org/w/api.php'

wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' > Wiki-as-base_from-wikidatawiki.jsonl

Note the irony: Using Wikidata (wiki) but parsing wikitext of generic Wiki Pages, not Wikibase 🙃! Anyway, could you guess what wiki_as_base --titles 'Item:Q5043|Property:P12' returns on osmwiki?

3.7. Merge content of several pages in different Wikis and the `--output-streaming`

Here things start to get interesting, and might explain why all unique filenames are namespaced by wikiprefix: you might at some point want to store them in the same folder, maybe also match same kind of content on different wikis.

Also, this is the time the exported file is not JSON-LD with individual items inside key data on the top level of the object, but JSON text sequences, where each individual item is in its own line. this format allow user simpler tools to merge the files.

#### merging the files at creation time

echo "" > merged-same-file-before.jsonl

# the ">" create file and replacey any previous content, if existed
# the ">>" only append content at the end of file, but create if not exist
WIKI_NS='osmwiki' WIKI_API='https://wiki.openstreetmap.org/w/api.php' wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' --output-streaming >> merged-same-file-before.jsonl
WIKI_NS='wikidatawiki' WIKI_API='https://www.wikidata.org/w/api.php' wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' --output-streaming >> merged-same-file-before.jsonl

#### dumping file by file, but then merge files at the end
mkdir temp-output/
WIKI_NS='osmwiki' WIKI_API='https://wiki.openstreetmap.org/w/api.php' wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' --output-streaming > temp-output/osmwiki-page-1.jsonl
WIKI_NS='wikidatawiki' WIKI_API='https://www.wikidata.org/w/api.php' wiki_as_base --titles 'User:EmericusPetro/sandbox/Wiki-as-base' --output-streaming > temp-output/wikidatawiki-page-1.jsonl
cat temp-output/*.jsonl > merged-cat-after.jsonl

And that was the last practical example. A place with other MediaWikis (which may or not be up to date, which means this tool cannot undestand their API version) are listed on https://wikiindex.org/.

4. What’s next: feedback (in special of OSM.wiki editors) are welcomed in next months

That’s it. This approach is a very niche, so likely the ones which may be interested are heavy wiki editors, in special early adopters who could benefit from move some of their software options for contributors use the Wiki to make changes and do not have yet a parser strategy like Nominarim/Taginfo have.

In the case of OpenStreetMap, since the infoboxes for tags and tag-values are very important, I’m especially interested in suggestions on how we could use the wiki page itself to at least give hints on expected values, maybe further hints on how to normalize the values. Currently, the implementation does not have option to initialize with such extra information (still a bit hardcoded), but even if each Wiki could have some page where most people would agree how to define the parser, I believe the tool should allow user customize it (so someone could use customized version from own it’s user namespace, potentially “proving” how the final result would work). This might explain I took some time to reply ChrisMap when asked to “design an example”, but to not delay further I just posted this diary today with how to at least do the basics of extract data. I already had a draft for this since January 2023, and asked some parts of it on the Talk Wiki, but after today this part is somewhat officially released for feedback either here in the comments, in the Wiki or in the GitHub issues.

Discussion

Comment from UndueMarmot on 4 August 2023 at 12:09

Huh. This actually explains about Wikibase tbh.

Wikibase, while it powers Wikidata, is not the only extension which can be used with MediaWiki. [..] Since a text field with wikitext of all pages on MediaWiki database wouldn’t be queryable, these extensions actually use MediaWiki as permanent, versioned storage, but they take the responsibility of synchronize such data with some more specialized database engine (or at least use same database, but with additional tables). Even Wikibase still relies on external RDF triplestore to allow running SPARQL; it’s user interface (the one humans edit on sites like Wikidata) are an abstraction to store the data like a page in the MediaWiki (Wikibase extension actually uses undocumented JSON, not Wikitext). ^(new emphasis mine)

One of the mysteries to me was “how does this thing work in the first place??”, in the sense that you edit them with a UI that looks like a MediaWiki page, is rendered similarly to a MediaWiki page, and with an editor that sort of..? looks like a glorified VisualEditor, but doesn’t function as one?

But it isn’t a duck.

It’s JSON, which explains just how disconnected it actually is to the MediaWiki experience. That’s why it feels so foreign and disorienting, and functions like the completely tacked-onto experience it provides.

Comment from fititnt on 6 August 2023 at 02:14

Humm, interesting the first comment is about how Wikibase abstractes the data! And yes, I found it relevant to explain this internal part, because it really depends on external storage to add the “true” linked data storage.

I mean, when Wikibase stores items data as JSON on a single page, these pages are on a big textarea, so by default, the SQL database cannot really understand its internals. And I mean, it is not even MongoDB where there’s native support for JSON fields.

Another trivia is that if trying to do data mining using MediaWiki with Wikibase API, it’s likely to be item by item (maybe it can allow pre-fetch related, so still very useful) however if somewhat brute force the wikitext (which will be JSON) with vanilla MediaWiki API, then even without special user account (admins or bots) is possible to fetch 50 pages at once. I know this may sound a bit low level, but if we’re talking about synchronizing content, as long as the content stored on the MediaWiki can be exported without need to always work with Full wikidumps available here https://wiki.openstreetmap.org/dump/.

About your comment comparing with VisualEditor, I guess the Wikibase interface is more a form-like interface (enforces some structure, not very advanced integrity check, but does some checks). I have not fully tested the alternatives, but I’m sure there’s other MediaWiki extensions which could enforce a form-like entry, to restrict what users can do. So, the analogy with VisualEditor is not perfect, because the link you passed about the VisualEditor, it still allows more freedom for the user with higher challenges to parse (compared to any form-like interface).

Maybe closer analogy than the MediaWiki VisualEditor, the Wikibase editing page is similar to how iD editor allows users to edit an already well detailed item (depending on the tag, the field changes appearance, suggest different values, etc).

(…) It’s JSON, which explains just how disconnected it actually is to the MediaWiki experience. That’s why it feels so foreign and disorienting, and functions like the completely tacked-onto experience it provides.

I think that from a perspective of a “MediaWiki experience”, even trying to not break the mental flow of editing as text (while still fully machine readable) at least some types of trade offs are necessary. The Wikibase (and any other MediaWiki with a form-like UI) explicitly enforce (sometimes too much, or sometimes not allowing for a full strict validation; I know, both ideals are contradictory) how to add/edit data, but even if we parse wikitext directly, the parser could still benefit from hints (such as suggested filename of a code sample) which might not worth show for the user who only cares about visual text, not metadata.

This part is briefly commented on in the dairy, but both conversion tables (e.g. {{yes}} => true) and explanations of what the parameters on most important infoboxes are may not be in the same page (also, would be too redundant) but would still be in some place (preferable in the same wiki). And the syntax where these instructions may not be possible is only natural language.

But it isn’t a duck.

Yes, I also liked the analogy! But again, “Wikidata” is the project (formal explanation: https://www.wikidata.org/wiki/Wikidata:Introduction, “Wikibase” is an extension for MediaWiki (see also: https://wikiba.se/). So the Wikidata as a project actually is a full linked data. Another interesting fact is that Wikibase (even without triplestore) somewhat still linked data, because it does expose persistent URLs and still fast. So the self description on wikiba.se, “Wikibase is open-source software for creating collaborative knowledge bases, opening the door to the Linked Open Data web.” still very true.

The diary could get more complex, but in theory, a future proxy for each page on OSM.wiki could still somewhat be linked data as soon as the person request some format like RDF/Turtle. Same principle could apply for the main API, which today returns XML, but a pull request started in 2019 by the Overpass main developer added JSON output (link: https://github.com/openstreetmap/openstreetmap-website/pull/2485), so in theory, even the main API, rails-port, could also be explicitly “linked data”. I started a early draft for that 2022 here https://wiki.openstreetmap.org/wiki/User:EmericusPetro/sandbox/RFC_OpenStreetMap_RDF_2023_conventions_(proof_of_concept) which do a very rudimentar conversion from the XML to RDF/Turtle as a proxy (if person request XML or JSON, it does noting, just output the true output of the de faco API). That was the very easy part, the true challenge (beyond the slow process of how to agree on a schema good enough) would be start to build the endpoints for every tag, so if a tool try to fetch PREFIX osmt: <https://wiki.openstreetmap.org/wiki/Key:>, it would work

Comment from Minh Nguyen on 6 August 2023 at 16:47

There’s a lot to unpack here, but just for awareness:

It’s JSON, which explains just how disconnected it actually is to the MediaWiki experience. That’s why it feels so foreign and disorienting, and functions like the completely tacked-onto experience it provides.

Wikitext is only one of the page content models that MediaWiki supports. For example, the Module: namespace is in Lua, and every user can personalize their wiki experience via personal subpages in JavaScript and CSS. Pages in the template namespace can also be in JSON, irrespective of Wikibase. Though this isn’t currently enabled on the OSM Wiki, we did consider it for event listings and such until OSMCal came along.

For all its warts, I appreciate the fact that Wikibase is intended for structured data. We can of course make wikitext look like structured data by convention and build custom tooling around it, but ultimately that results in a different kind of subpar experience for anyone who attempts to edit the wiki: you can write a wiki page using simple wikitext syntax as long as you avoid breaking several lightly documented tools that place arbitrary constraints on exactly how you write (e.g., whitespace and capitalization) it due to assumptions they make. Writing for the renderer, in other words.

I appreciate your efforts at data mining the OSM Wiki, to the extent that you find the output useful. I also appreciate your emphasis on reusing existing content without creating extra maintenance overhead. However, we should view this kind of tooling as being complementary to structured data, not in competition with it.

Comment from fititnt on 7 August 2023 at 01:52

Wikitext is only one of the page content models that MediaWiki supports.

Good to know other content models! Maybe I also create some syntax sugar (e.g. instead of raw string , return something else).

But for data-like content, beyond wikitext (in special the tabular data) Wikibase JSON could be abstracted to return at least the labels, which would later be used, for example, for translations. Note that it is complex to convert from/to a RDF-like dataset to other datasets, but the translation part of items might be so common that it could be worth an abstraction.

(…) you can write a wiki page using simple wikitext syntax as long as you avoid breaking several lightly documented tools that place arbitrary constraints on exactly how you write (e.g., whitespace and capitalization) it due to assumptions they make. Writing for the renderer, in other words. ^(new emphasis mine)

Yes, the challenging part of parsing wikitext is exactly this. This is one of the reasons (at least if using the tool to extract data) it is more forgiving for who writes, and strict on what output generates.

(…) I also appreciate your emphasis on reusing existing content without creating extra maintenance overhead. However, we should view this kind of tooling as being complementary to structured data, not in competition with it.

I agree with the complementary. In fact, the implementation cited here is intentionally lower level, without the database part (it does have a sqlite, but for cache requests). The fun part is done outside.

On the reusing existing content without creating extra maintenance overhead, this is really a focus. While the tool is not self-sufficient for a full solution, by making it it focused on the parsing (and allowing be reusable with other wikis) could increase some extra conventions where wikitext alone is insufficient.