OpenStreetMap

bdon's Diary Comments

Diary Comments added by bdon

Post When Comment
OpenStreetMap Isn't Unicode

For anyone still following this thread, I’ve made a proof-of-concept Zawgyi detector over planet.osm.pbf. Here’s a view of potential problem tags with their scores (could be up to 10% of Burmese text in OSM):

https://bdon.github.io/OpenStreetMap-BurmeseEncoding/

The GitHub repository with program to run the classifier and generate the CSV file:

https://github.com/bdon/OpenStreetMap-BurmeseEncoding

OpenStreetMap Isn't Unicode

Thanks @bryceco for the investigation into editors! I wonder if the easiest path is to create something like a http://maproulette.org task for Zawgyi-to-Unicode conversion. One caveat: A Burmese speaker reached out to me already to let me know that the Zawgyi classifier is trained on, and designed for long text and may not have good results for short text like place names, so we should look into https://github.com/myanmartools/ng-zawgyi-detector which is regex-based.

OpenStreetMap Isn't Unicode

I agree that this is out of scope of the API, and don’t think language-specific logic belongs in the Rails port, unless there is already precedent for that.

So it remains an open question if resolving this class of issue - that does currently break the display of text in the near-orbit OSM ecosystem (iD, OSM Carto) - falls under the purview of EWG at all.

OpenStreetMap Isn't Unicode

Thanks, this is really useful background on CGIMap! Given that, it’s worth doing a scan over the dataset to determine if there are any pre-2019 strings that don’t pass that check.

Unfortunately I don’t think the mbsrtowcs function is good enough for Zawgyi, because a Zawgyi string at the bit level is still a valid UTF-8 sequence of bytes, but in a nonsensical arrangement - kind of how “Ybaqba” is a valid string of Latin characters but is the rot13 encoding of “London”. Because Burmese is a shaped language, the consequence is more severe in that placeholder marks appear if the wrong encoding is used. Some language-level analysis (the ML model) is necessary for doing the classification of Zawgyi vs. Myanmar Unicode.

OpenStreetMap Isn't Unicode

I think what I said stands: the bits at rest in OSM the dataset, whether that’s the planet XML or PBF file, are not guaranteed to be UTF-8. The tooling around OSM like an editor or renderer usually assumes UTF-8, yes, just as tooling often assumes closed ways with certain tags are areas.

I am merely a new member of EWG and this is a reflection of what I’m personally interested in and what I think Working Groups / OSMF should prioritize in making the project more global. My goal right now is to get a high-level understanding of other places where this class of problem exists. I work with two written languages on a daily basis, so am blissfully unaware of most of the world’s text encoding details. Whether or not it is a top priority for EWG depends on factors such as:

  • how high impact it is - in terms of mapping applications or mappers affected
  • how complex the remedies are - for example, a solution to the Unihan problem I linked above would involve significant changes to the tagging standards

For Zawgyi and editors specifically: one approach would be to first identify if most Zawgyi comes from specific places (iD on web? mobile editors?) and what simple solutions could be admitted (e.g. regex detecting Myanmar code range + validating against an HTTP endpoint, provided the classifier is accurate)

Vector Map Bundles

Thanks, I fixed the link.

Deep Dive: natural=coastline

One area that I haven’t attempted to fix is the area around this delta: https://www.openstreetmap.org/way/677293895#map=19/2.54713/-78.40208 which requires tracing new coastlines. It doesn’t appear in the OSM inspector view.

Deep Dive: natural=coastline

@giggls I don’t believe that the Coastline checker tool you link catches this class of error right now, though it may be possible to add an automated check for it by doing point-in-polygon tests.

Deep Dive: natural=coastline

Yes, in this case “land” should be interpreted as non-ocean shapes; most freshwater areas are and should be mapped on top of this “land”. I’ll edit the post above to clarify this.