When working with OSM it’s generally fair to assume that textual data, like tag values, are encoded in UTF-8. Without this assumption, multilingual mapmaking would be almost impossible - custom fonts or browser settings would need to be specified for every language when displaying geocoding results, routing directions or map labels.
As part of the newly resurrected Engineering Working Group, I’m investigating ways to improve OSM’s software ecosystem. One of the top tasks for the EWG is localization, and standardized text encoding is a prerequisite for this, but OSM does not enforce any particular encoding as policy.
Where is the non-Unicode data?
The most obvious instance of non-Unicode in OSM is the Zawgyi encoding for Burmese text. For background on Zawgyi, see this post on the civil war between fonts in Myanmar.
The default Mapnik-based rendering on OpenStreetMap.org, openstreetmap-carto, uses Unicode fonts. Zawgyi-encoded tags appear obviously garbled on the map, with the combining mark ◌ visible:
Myanmar officially adopted Unicode in 2019, but the migration requires both digital services and end user devices to adopt the new standard. OSM still has mixed encodings; this significantly limits its usefulness as a dataset, for not only mappers using Burmese, but any global-scale data products such as geocoders and basemaps that touch Burmese text.
Zawgyi shares a similar space of code points with Unicode, so detecting Zawgyi-encoded text is not trivial. Google and Facebook have open sourced a ML-based model for this detection: see Facebook’s path from Zawgyi to Unicode - which determines a probability an input string is Zawgyi. I have created a list of all OSM name tags with >90% probability according to this model here:
The Osmium script to generate this list from a PBF extract is on GitHub.
Next Steps
- A high-quality conversion of non-Unicode data requires users proficient in the written script, ideally native speakers/readers. If you’re a Burmese reader and are interested in this task, please leave a reply.
- The ML model for Zawgyi detection is trained on longer text. Evaluate if it is sufficient for classifying short place names like in OSM.
- Identify what, if anything, should be done at the editor level to detect encodings. For a mapper with Zawgyi set at a device-level, text encoding conflicts will be invisible.
- Does your language have text encoding problems in OSM ? Another, less critical area is the issue of Han Unification (Unihan) characters, but the solutions to this are outside the design of Unicode.