OpenStreetMap

Finding Vandals and Language Hotspots with Unicode

Posted by mapmeld on 23 March 2019 in English (English)

I’ve long wanted to see a true map of the world’s languages. We know where languages are supposed to be spoken, but where are the real borders, where are the little enclaves? Recently I finally got the server space to download the global OSM data and look for myself!

For this project, I take the primary ‘name’ tag of any points, and using Jan Lelis’s unicode-blocks Ruby gem, determine where its characters fall in the Unicode block system. This blog post is in English speakers’ familiar “Basic Latin”, “Latin-1 Supplement”, and “General Punctuation”, which are common enough that I’ve filtered them out from this map.

Explore

You can view the map at mapmeld.com/osm-unicode-coverage/ and data and source code are on GitHub. I think Europe and Asia, being the largest files, may have been undercounted or only partially read by my script, but it still shows all of the expected language coverage.

Local Language Hotspots

Tifinagh (ⵜⵉⴼⵉⵏⴰⵖ) is used alongside Arabic across North Africa. In the past five years, OpenStreetMap users started labeling all Moroccan cities in Latin, Arabic, and Tifinagh script. You can see a handful of other locations in Algeria and Libya.

Morocco and Algeria

The letters in “Latin Extended-B” and “IPA Extensions” are common on a small section of the Guyana-Brazil border. This seems to overlap with a local language known as Wayampi. There is another cluster in the Tizi Ouzou region of Algeria. That doesn’t mean that these languages are related at all — just when new sounds and/or symbols were added to the Unicode standard, both were included in the same batch of updates.

rainforest collection

Similarly, “Latin Extended-D” appears only on Easter Island. There is a small cluster of N’Ko labeled cities and rivers in Guinea.

Canadian Aboriginal Syllabics are used over a wide area, but don’t appear to ‘cluster’ as much. I’d like to see some initiatives to get this used more often. Canada

Outliers

Seeing a solitary marker for the “Latin Extended Additional” block, we find that Australia’s Uluṟu includes the letter ṟ.

Uluru map

I found India’s Antarctic base because it’s labeled in Devanagari script.

Antarctica

A Greek hostel in Vila Velha, Brazil? A Chinese bank in the Bahamas? There were other unusual outliers which I couldn’t fully identify or verify on Google Street View.

Vandals

A “Canadian Aboriginal Syllabics” point in Colombia caught my attention. This user stylized the shop name as ᗰI ᑕᗩᔕᗩ, but Mapnik had some issues rendering it.

Mi Casa

A seemingly harmless extra bus stop named in Lao script in a neighborhood outside of Adelaide, Australia, got scraped into the content generator OpeningHoursAU.com

bus stop

I found attempts to add ‘flag emojis’ and a handful of points in Tenerife which had Glagolitic (unused old Croatian script). The reason for this type of vandalism is unclear, but they should be removed.

Caveats

  • OpenStreetMap data is © OpenStreetMap and contributors, and was downloaded from https://download.geofabrik.de/
  • By using the Unicode script, I miss distinction between several language sets, such as Russian, Ukrainian, Abkhazian, and Mongolian using different parts of the Cyrillic alphabet.
  • By using node names, I missed names used on ways, particularly roads, rivers, and buildings where I’ve seen local languages used in the past.
  • I haven’t checked if Guyana and Easter Island have the ‘correct’ Latin extended letters for their names, just noting a common pattern.
  • I checked only the primary name=__ tag, and not the alternate names (name:en=, name:es=), I’m leaving a lot of multilingual names. This was OK with me because cities often have dozens of alternate WikiData names, and the main name tag is the primary, most-seen one.

Finding gaps in local language coverage

There are many areas which aren’t highlighted by the map because they were Anglicized by map editors. As an example, places in the Marshall Islands aren’t labeled in their Marshallese names (e.g. Mājro, Arņo). Unicode may not be done adding new codepoints for Marshallese (ņ here is repurposed from the Latvian alphabet).

It would be interesting to reach out to editors who have added ~5 places in N’Ko alphabet in Guinea, or similar local scripts, to expand the number of local scripts used on OSM.

Adapted from my Medium post for non-OSM audiences

Location: Goose Island, Washington, District of Columbia, 22314-5352, USA

Comment from Rostranimin on 1 April 2019 at 20:57

Just a thought. When looking at how ‘clustered’ points are, have you remembered that you may need to change the map projection before comparing two images at what is the same ‘zoom level’. On looking at your Canadian image (above) I was immediately struck by the fact that you’d stuck with the standard ‘Pseudo Mercator’ coordinate reference system / projection - which will have wildly distorted this map in comparison to clusters in other geographical locations. This wouldn’t have mattered necessarily, except that the maps sit in a sequence, appearing as comparisons.

Login to leave a comment