Recent diary entries
When working with OSM it’s generally fair to assume that textual data, like tag values, are encoded in UTF-8. Without this assumption, multilingual mapmaking would be almost impossible - custom fonts or browser settings would need to be specified for every language when displaying geocoding results, routing directions or map labels.
As part of the newly resurrected Engineering Working Group, I’m investigating ways to improve OSM’s software ecosystem. One of the top tasks for the EWG is localization, and standardized text encoding is a prerequisite for this, but OSM does not enforce any particular encoding as policy.
Where is the non-Unicode data?
The default Mapnik-based rendering on OpenStreetMap.org, openstreetmap-carto, uses Unicode fonts. Zawgyi-encoded tags appear obviously garbled on the map, with the combining mark ◌ visible:
Myanmar officially adopted Unicode in 2019, but the migration requires both digital services and end user devices to adopt the new standard. OSM still has mixed encodings; this significantly limits its usefulness as a dataset, for not only mappers using Burmese, but any global-scale data products such as geocoders and basemaps that touch Burmese text.
Zawgyi shares a similar space of code points with Unicode, so detecting Zawgyi-encoded text is not trivial. Google and Facebook have open sourced a ML-based model for this detection: see Facebook’s path from Zawgyi to Unicode - which determines a probability an input string is Zawgyi. I have created a list of all OSM name tags with >90% probability according to this model here:
The Osmium script to generate this list from a PBF extract is on GitHub.
- A high-quality conversion of non-Unicode data requires users proficient in the written script, ideally native speakers/readers. If you’re a Burmese reader and are interested in this task, please leave a reply.
- The ML model for Zawgyi detection is trained on longer text. Evaluate if it is sufficient for classifying short place names like in OSM.
- Identify what, if anything, should be done at the editor level to detect encodings. For a mapper with Zawgyi set at a device-level, text encoding conflicts will be invisible.
- Does your language have text encoding problems in OSM ? Another, less critical area is the issue of Han Unification (Unihan) characters, but the solutions to this are outside the design of Unicode.
You can download map bundles for up to 500,000 nodes at protomaps.com/bundles.
Self-Contained, Batteries Included
Map Bundles are cartographic basemap tile pyramids built from a minutely-updated snapshot of OpenStreetMap. That means within sixty seconds of uploading your changeset, you can download a new Map Bundle and have a self-hostable or offline-friendly zoomable map with those changes.
python3 demo.py after you’ve unzipped the ZIP archive. You can even interact with individual elements on the map to view their tags or OSM identifiers.
Note that this is different from a “minutely-updated” tileset, in which changes from OSM appear without intervention. The goal of Map Bundles is to shorten the feedback loop for creating interactive maps from OSM.
Consider Wikipedia as a point of comparison. Editing an article via the Wikimedia editor appears on the Wikipedia site for all users quite quickly. For OSM, mappers use editors like iD or JOSM to upload changes; but a primary way to consume OSM is via slippy maps. OSM Carto can take a long time to reflect changes, especially if caches are being accessed. Other tile providers may take weeks or months to incorporate changes! Map Bundles hopefully enable a Wikipedia-like experience for vector cartography.
Map Bundles are built off the OSM Express database format, and downloaded MBTiles or files remain ODbL-licensed. Be sure to attribute OpenStreetMap contributors in your slippy maps - and a link back to protomaps.com/bundles is also appreciated!
Minutely Extracts (protomaps.com/extracts) is a new service for on-demand OpenStreetMap extracts in .osm.pbf format.
- Data is updated once per minute from the main OpenStreetMap replication diffs. Your changes on OSM.org can be consumed almost immediately, making editing more rewarding.
- Select a bounding box or draw a polygonal region up to 100 million nodes.
- Small areas can be extracted in seconds.
- I am running Minutely Extracts as a public service to help improve the mapping experience.
- Code is available at github.com/protomaps/OSMExpress : this is a command line utility and serialization format specific to planet-scale, spatially indexed OSM data.
- Extracts do not include metadata such as version numbers, Changeset IDs, timestamps or usernames.
- Extracts are reference-complete for ways and multipolygon relations.
- Please contact me at email@example.com if you’re interested in creating extracts via API. Example: automating daily extracts to ingest updated OSM data.
- Since OSM Express is a simple serialization format, It can support other use cases such as minute-frequency editing statistics.
I have made several edits around the world related to the OSM coastline. My goal is to enable small-scale derivation of land and ocean polygons without resorting to global preprocessed continent geometries assembled from programs like OSMCoastline.
As a primer, the coastline should be mapped as ways with natural=coastline, with land on the left side and ocean on the right side. This is specified on the OSM wiki: Tag:natural=coastline. “Land” in this instance is defined as the non-ocean parts of the world, not solid ground in general; for example, the Great Lakes are represented not by coastlines but as water body features inside “land”.
There are a few implications to this design:
The ocean should be one polygon in the OGC Simple Features sense: it has one outer ring in the clockwise (CW) direction, and a counter-clockwise (CCW) inner ring for every continent and island. The Caspian Sea is the one exception to this single polygon as stated on the Wiki.
The complement of the global ocean polygon is thousands of land polygons representing continents and islands. Each polygon has a single CCW outer ring and zero inner rings. Again, the one exception is that the Eurasian continent polygon has a CW inner ring representing the Caspian Sea.
In theory, this specification should be enough to infer land and ocean polygons from orientation even within a small geographic extract. I discovered and corrected several dozen places where a violation of this specification arises. The image below is an example of data that needs to be corrected:
These problems are sometimes simply mistagged islands within inland water bodies (Changeset 78960812), but can arise even when mappers are mindful of oceans and way orientation. From edit histories this seem to happen because:
A mapper correctly adds island A based on satellite imagery. Years later, a mapper has access to higher resolution imagery and adds island B (way 677293895). However, B is wholly contained within A: the ocean defined by B’s boundary contradicts the land defined by A’s boundary.
Changeset 78947512 : A case where a hydrological feature has been changed from part of the Single Ocean Polygon to a more detailed object, such as a river delta, estuary or bay multipolygon. The outer ring of the Single Ocean Polygon recedes, but now its islands are no longer contained.
I have corrected most occurrences of this problem as of December 2019, which should enable more flexible land and ocean mapping from raw OSM extracts! Inevitably these issues will crop up again as the global coastline is better digitized, so I hope this post can help mappers fix them.