The world is messy and human languages moreso. Recently the talk@ mailing list erupted in discussion over a proposal to shunt the vast majority of
name:* tags over to Wikidata. But most of the discussion has centered around rather eurocentric examples and concerns. I worry that the discussion will lead to a policy change based on overgeneralizations. Having done a fair amount of multilingual name-tagging in the past, I want to point out just a few of the complications that monolingual mappers may be unaware of.
Translation versus transliteration
The top 20 languages are each natively spoken by about one percent of the world’s population. Twelve of them are in scripts other than Latin, and at least three are in non-alphabetic scripts, requiring transliteration just to produce a name that monolingual English speakers can recognize as text, let alone type.
Some have argued that translations are preferable to transliterations. Others have argued that transliterations should be omitted entirely from OSM, as an exercise to the reader or a job for third-party services. But what’s the difference between translation and transliteration? The wiki offers this simplistic explanation:
Transliteration is the process of taking a name in one language, and simply changing letters from one script to another.
This definition is a gross oversimplification, downplaying what it takes to adapt a foreign word to something you can use in your own language. There are three ways to go about it:
- Transcription from another language gets you the original word’s pronunciation respelled in a very literal phonetic alphabet (or a language-neutral alphabet like the IPA), without regard for etymology. Except for cases involving ideographic scripts, as we’ll see below, pure transcription is almost never the right answer for a
- Transliteration from another script to a Roman alphabet gets you the original word, but respelled as if English had borrowed the word, often taking liberties with the pronunciation in order to look “native” or respect the original etymology. Transliteration is the most reliable method for producing a usable name in your language.
- Translation from another language to English gets you a word that refers to the same thing in English but may have a completely different pronunciation and etymology. Translation is only appropriate in a limited number of cases for historical reasons. Words like “north” and “city” are often translated while the rest of the name is transliterated.
I don’t speak Russian; perhaps one can get Абергавенни from “Abergavenny” by performing a simple one-to-one mapping from Cyrillic letters to Latin letters. But Russian has varying transliteration schemes, each with their own exceptions, and that’s a relatively easy task considering that the Roman and Cyrillic scripts share a common ancestor.
A counterexample: transliterating Chinese to Vietnamese
Shanghai has a Vietnamese name. You’ll never see it on signage in Shanghai, but no Vietnamese speaker refers to the city by its Chinese name. (Photo: Immanuel Giel / CC BY-SA 4.0)
Over the last seven years, I’ve added tens of thousands of
name:vi tags by hand, the vast majority of them to
place POIs and relations in mainland China. One of these POIs is Shanghai, called 上海 in Chinese. English-language literature calls it “Shanghai”, after the Pinyin transcription Shànghǎi. Shanghai is just a name to English speakers; it retains the pronunciation, more or less, but not the meaning. A literal translation would be “High Sea” or, more poetically, “Upon the Sea”. You’d never put “Upon the Sea” into OpenStreetMap because no one has ever called it that. You’d set
name:en=Shanghai because English has no special name for the city.
Vietnamese is very different when it comes to Chinese names. Vietnam has had millennia of intense contact with China (much of it adversarial). As a result, every Chinese character has a Sino-Vietnamese reading: a word that was borrowed from Middle Chinese into Old Vietnamese, retaining the meaning but not the pronunciation (owing to changes in both spoken Vietnamese and spoken Chinese over the centuries). For Shanghai, I set
name:vi=Thượng Hải, using Sino-Vietnamese for 上海. It literally means “high sea”, but in words that are only used for terms and names borrowed from Chinese.
As it happens, 上 has multiple readings corresponding to different meanings: thưởng (award), thượng (high), thướng (rise). Choosing between them is the task of a translator, not a SQL transform. So how does a translator like me know choose the right Sino-Vietnamese words? Sometimes the answer is obvious: I simply learned long ago that Shanghai is called Thượng Hải in the course of learning Vietnamese, and most Vietnamese learn that just by living in Vietnam for a time. For more obscure names, there are plenty of places to look up individual characters. My sources have included an out-of-copyright dictionary and a Sino-Vietnamese database that comes with no restrictions according to its author. (For the record, Unihan is TIGER bad when it comes to Vietnamese.) When I’m on the fence about a transliteration, I double-check it against sites like the Vietnamese Wikipedia. And when a character really has me stumped, I leave the POI alone.
If I were to actually translate “Shanghai” into “plain” Vietnamese, the result would be either Trên Biển if I transliterate at the same time or something like 𨕭𣛟 if I don’t. (The Vietnamese language also used ideographic characters until the 20th century, just a different set of characters than Chinese.) No one would ever use the “plain” Vietnamese name, though; Thượng Hải is the only correct way to render this particular city’s name in Vietnamese.
This is just one language out of many that have rich histories of dealing with multiple writing systems. You can imagine that other languages have their own unique considerations.
Machine transliteration is impractical
If we rely on software to localize place names for us, some languages can hope for no better than hack jobs, akin to this humorous map in “English”. (Illustration: imkharn)
There has been plenty of handwaving about renderers and geocoders that are smart enough to transliterate between different writing systems. But consider that Google Translate, with all its NLP might and a corpus the size of the Internet, fares poorly at interpreting Chinese place names. It doesn’t know that 红寺堡 is Hồng Tự Bảo in Vietnamese or “Hongsibao” in English. Your average mapmaker can’t afford that kind of technology anyways.
Software developers have much more experience converting between metric and imperial units than between human languages. Even though Sino-Vietnamese words aren’t “plain”, modern Vietnamese words, their meanings are often not lost on Vietnamese speakers today. Any schoolchild could tell you that thượng hải means trên biển (upon the sea), an apt name for a major port city. But a multilingual software client, burdened with the knowledge that thượng could also mean 㐀 = “hill”, or 㠪 = “five”, or 尙 = “yet”, would need a lot of resources to make a decision:
- Natural language processing (NLP), a form of artificial intelligence
- Context about the city and common naming practices
- A decent, machine-readable, suitably licensed dictionary for that particular language pair
- Possibly even dedicated logic for each character, multiplied by the number of transliteration schemes
Then there are suggestions that IPA transcriptions could be tagged as an intermediate step. But IPA comes with its own headaches, like whether to transcribe broadly or narrowly. Consider the number of valid English pronunciations of “north”, then consider that the same Chinese script is used by a host of mutually-unintelligible language varieties.
It wouldn’t be possible to derive the Sino-Vietnamese name from an IPA or Pinyin transcription, anyways, because they have different many-to-many mappings between characters and words. Shàng (Pinyin) doesn’t just correspond to 上; it also corresponds to the following characters, as would an IPA transcription based on Mandarin: 上姠尙尚蠰銄鑜. On the other hand, thượng (Sino-Vietnamese) corresponds to a very different set of Chinese characters: 㐀㠪丄仩上鞜妴尙尚鞝躺𠄞. Spoken Mandarin and Vietnamese have evolved so much over the centuries that, if a system like Sino-Vietnamese were invented today based on modern Mandarin pronunciation instead of Middle Chinese, it would employ a completely different set of words for each character.
There is a consensus at least that automatic transliteration does not belong in OSM, because it cannot be verified for accuracy. But excluding handcrafted transliterations from OSM forces data consumers to foist those same automatic, unverified algorithms upon their users. The result is the worst of both worlds: poor support due to the effort required and poor quality due to a lack of context.