Minh Nguyen's Diary

Globalizing the name translation debate

Posted by Minh Nguyen on 5 June 2015 in English. Last updated on 17 February 2022.

The world is messy and human languages moreso. Recently the talk@ mailing list erupted in discussion over a proposal to shunt the vast majority of name:* tags over to Wikidata. But most of the discussion has centered around rather eurocentric examples and concerns. I worry that the discussion will lead to a policy change based on overgeneralizations. Having done a fair amount of multilingual name-tagging in the past, I want to point out just a few of the complications that monolingual mappers may be unaware of.

Translation versus transliteration

The top 20 languages are each natively spoken by about one percent of the world’s population. Twelve of them are in scripts other than Latin, and at least three are in non-alphabetic scripts, requiring transliteration just to produce a name that monolingual English speakers can recognize as text, let alone type.

Some have argued that translations are preferable to transliterations. Others have argued that transliterations should be omitted entirely from OSM, as an exercise to the reader or a job for third-party services. But what’s the difference between translation and transliteration? The wiki offers this simplistic explanation:

Transliteration is the process of taking a name in one language, and simply changing letters from one script to another.

This definition is a gross oversimplification, downplaying what it takes to adapt a foreign word to something you can use in your own language. There are three ways to go about it:

Transcription from another language gets you the original word’s pronunciation respelled in a very literal phonetic alphabet (or a language-neutral alphabet like the IPA), without regard for etymology. Except for cases involving ideographic scripts, as we’ll see below, pure transcription is almost never the right answer for a name:* tag.
Transliteration from another script to a Roman alphabet gets you the original word, but respelled as if English had borrowed the word, often taking liberties with the pronunciation in order to look “native” or respect the original etymology. Transliteration is the most reliable method for producing a usable name in your language.
Translation from another language to English gets you a word that refers to the same thing in English but may have a completely different pronunciation and etymology. Translation is only appropriate in a limited number of cases for historical reasons. Words like “north” and “city” are often translated while the rest of the name is transliterated.

I don’t speak Russian; perhaps one can get Абергавенни from “Abergavenny” by performing a simple one-to-one mapping from Cyrillic letters to Latin letters. But Russian has varying transliteration schemes, each with their own exceptions, and that’s a relatively easy task considering that the Roman and Cyrillic scripts share a common ancestor.

A counterexample: transliterating Chinese to Vietnamese

Shanghai has a Vietnamese name. You’ll never see it on signage in Shanghai, but no Vietnamese speaker refers to the city by its Chinese name. (Photo: Immanuel Giel / CC BY-SA 4.0)

Over the last seven years, I’ve added tens of thousands of name:vi tags by hand, the vast majority of them to place POIs and relations in mainland China. One of these POIs is Shanghai, called 上海 in Chinese. English-language literature calls it “Shanghai”, after the Pinyin transcription Shànghǎi. Shanghai is just a name to English speakers; it retains the pronunciation, more or less, but not the meaning. A literal translation would be “High Sea” or, more poetically, “Upon the Sea”. You’d never put “Upon the Sea” into OpenStreetMap because no one has ever called it that. You’d set name:en=Shanghai because English has no special name for the city.

Vietnamese is very different when it comes to Chinese names. Vietnam has had millennia of intense contact with China (much of it adversarial). As a result, every Chinese character has a Sino-Vietnamese reading: a word that was borrowed from Middle Chinese into Old Vietnamese, retaining the meaning but not the pronunciation (owing to changes in both spoken Vietnamese and spoken Chinese over the centuries). For Shanghai, I set name:vi=Thượng Hải, using Sino-Vietnamese for 上海. It literally means “high sea”, but in words that are only used for terms and names borrowed from Chinese.

As it happens, 上 has multiple readings corresponding to different meanings: thưởng (award), thượng (high), thướng (rise). Choosing between them is the task of a translator, not a SQL transform. So how does a translator like me know choose the right Sino-Vietnamese words? Sometimes the answer is obvious: I simply learned long ago that Shanghai is called Thượng Hải in the course of learning Vietnamese, and most Vietnamese learn that just by living in Vietnam for a time. For more obscure names, there are plenty of places to look up individual characters. My sources have included an out-of-copyright dictionary and a Sino-Vietnamese database that comes with no restrictions according to its author. (For the record, Unihan is TIGER bad when it comes to Vietnamese.) When I’m on the fence about a transliteration, I double-check it against sites like the Vietnamese Wikipedia. And when a character really has me stumped, I leave the POI alone.

If I were to actually translate “Shanghai” into “plain” Vietnamese, the result would be either Trên Biển if I transliterate at the same time or something like 𨕭𣛟 if I don’t. (The Vietnamese language also used ideographic characters until the 20th century, just a different set of characters than Chinese.) No one would ever use the “plain” Vietnamese name, though; Thượng Hải is the only correct way to render this particular city’s name in Vietnamese.

This is just one language out of many that have rich histories of dealing with multiple writing systems. You can imagine that other languages have their own unique considerations.

Machine transliteration is impractical

If we rely on software to localize place names for us, some languages can hope for no better than hack jobs, akin to this humorous map in “English”. (Illustration: imkharn)

There has been plenty of handwaving about renderers and geocoders that are smart enough to transliterate between different writing systems. But consider that Google Translate, with all its NLP might and a corpus the size of the Internet, fares poorly at interpreting Chinese place names. It doesn’t know that 红寺堡 is Hồng Tự Bảo in Vietnamese or “Hongsibao” in English. Your average mapmaker can’t afford that kind of technology anyways.

Software developers have much more experience converting between metric and imperial units than between human languages. Even though Sino-Vietnamese words aren’t “plain”, modern Vietnamese words, their meanings are often not lost on Vietnamese speakers today. Any schoolchild could tell you that thượng hải means trên biển (upon the sea), an apt name for a major port city. But a multilingual software client, burdened with the knowledge that thượng could also mean 㐀 = “hill”, or 㠪 = “five”, or 尙 = “yet”, would need a lot of resources to make a decision:

Natural language processing (NLP), a form of artificial intelligence
Context about the city and common naming practices
A decent, machine-readable, suitably licensed dictionary for that particular language pair
Possibly even dedicated logic for each character, multiplied by the number of transliteration schemes

Then there are suggestions that IPA transcriptions could be tagged as an intermediate step. But IPA comes with its own headaches, like whether to transcribe broadly or narrowly. Consider the number of valid English pronunciations of “north”, then consider that the same Chinese script is used by a host of mutually-unintelligible language varieties.

It wouldn’t be possible to derive the Sino-Vietnamese name from an IPA or Pinyin transcription, anyways, because they have different many-to-many mappings between characters and words. Shàng (Pinyin) doesn’t just correspond to 上; it also corresponds to the following characters, as would an IPA transcription based on Mandarin: 上姠尙尚蠰銄鑜. On the other hand, thượng (Sino-Vietnamese) corresponds to a very different set of Chinese characters: 㐀㠪丄仩上鞜妴尙尚鞝躺𠄞. Spoken Mandarin and Vietnamese have evolved so much over the centuries that, if a system like Sino-Vietnamese were invented today based on modern Mandarin pronunciation instead of Middle Chinese, it would employ a completely different set of words for each character.

There is a consensus at least that automatic transliteration does not belong in OSM, because it cannot be verified for accuracy. But excluding handcrafted transliterations from OSM forces data consumers to foist those same automatic, unverified algorithms upon their users. The result is the worst of both worlds: poor support due to the effort required and poor quality due to a lack of context.

Discussion

Comment from Stalfur on 6 June 2015 at 00:15

Very good analysis. This is a very strange movement to purge non-regional names, instead of maintaining accurate knowledge we suddenly need to depend on super-smart AI that doesn’t exist yet, all to save a few kilobytes in a database of terabyte.

Comment from seav on 6 June 2015 at 08:10

I think the original point of the mailing discussion is, why not put your Vietnamese translations into Wikidata instead of in OSM?

Comment from imagico on 6 June 2015 at 08:23

These are very useful insights into far east Language relations and particularities. You do not however address the problem of verifiability of handcrafted transliterations. This is rarely an issue with two neightboring countries with a long history of cultural relations like China and Vietnam but the main question is how to handle this in other cases. How do you decide if and how to tag name:vi for European/African/American places?

Comment from Stalfur on 7 June 2015 at 16:19

@seav And do what once it is in Wikidata? There is no ready made solution that makes it better for anyone to put it there.

Comment from seav on 7 June 2015 at 19:49

@Stalfur, I also think that the original point also includes developing tools to use Wikidata with OSM data. There is no suggestion that we wholesale remove some name tags in OSM without having a concurrent usable tool in obtaining the data in those removed name tags.

Comment from Vincent de Phily on 8 June 2015 at 10:01

Thanks for the nice writeup. There are still people on the mailing list arguing that this is “nonsense” but I hope that they’ll open up, or at least agree to disagree and carry on mapping.

@imagico for this kind of data, verification is done by native speakers. The more eyes we have on OSM data the better (and providing the initial name:CC coverage should help). If two contributors disagree on Abergavenny’s name:vi then we’ll look more closely at more verifyable sources. In the meantime there’s no point in depriving ourselves from usefull data.

In Ireland we recently got a lot of farmers mapping their fields’ names (http://osm.org/go/esz5oiyo– for example). It’s even less verifyable because these names are only used by a few individuals from one familly, yet there’s nobody in the community arguing for deletion. I think the difference is that the contributors are locals and therefore the other mappers don’t feel “threathened by invasion” like they did with name:ru=Абергавенни. And it’s a shame that such a feeling arises in a global mapping project.

Comment from woodpeck on 13 June 2015 at 12:55

“I simply learned long ago that Shanghai is called Thượng Hải in the course of learning Vietnamese, and most Vietnamese learn that just by living in Vietnam for a time.”

I think these kinds of multi-lingual names are the ones we’re looking for and that are ok to have. (Whether in OSM or Wikidata, is another question.) There will, however, be no such name for Abergavenny, because Wales and Vietnam lack the millenia of common history. And that’s why Abergavenny should not have a name:vi tag even if you could conceivably come up with one by applying some rules and looking up characters in books.

I agree that we must be careful not to throw the baby out with the bathwater, but in order to record something as a name in a different language, it must actually be “alive”, used, known to people. I don’t see room for theoretical constructs from the desk of a translator, and I strongly disagree with Vincent de Phily when he suggests that whoever is the first to invent a name can have that stand until someone else notices. Adding a name:vi tag to Abergavenny would require proper sources demonstrating that this name is being used.

Assuming that local people know their area best is not a shame, it is a bedrock of OSM. That doesn’t make the project any less global.