mariotomo's Diary

textual/ortographic fixes to names

Posted by mariotomo on 26 May 2022 in English. Last updated on 30 May 2022.

(I’m publishing it now, I will review, and might add screenshots.)

rationale

Here in Panama I’ve occasionally noticed bursts of name contributions by foreigners who do not know the language, or by local people who copy from official information where the accents have been purged, or contributed long ago when non-ASCII character became a question point.

Whatever the case, we have textual mistakes in the database, so that when from time to time I browse the database, I still stumble on “Simon Bolivar”, “San Jose”, “Canaveral”, “la Compania”, and the commonplace terms Panaderia, Pasteleria, Fruteria, etc.

If you’re not familiar with Spanish, you haven’t noticed anything wrong, or have you?

Those familiar with the language and/or fixated on formal correctness like linguists, philologists, and mathematicians, have felt pain, because the correct Spanish form of the above words is “Simón Bolívar”, “San José”, “Cañaveral”, “la Compañía”, “Panadería”, “Pastelería”, “Frutería”, etc.

Just for reference, at the time of writing, “Simon Bolivar” in the Bolivarian Republic of Venezuela returns 32 objects, while “Simón Bolívar” gives 200, possibly giving a measure of some national pride. Colombian “Fruterías” however score 56 vs. 64 against “Fruteria” without the accent.

the task

Here I only focus on what are obvious orthographic mistakes, skipping any data that might look wrong only because it’s in a different language, and I want to fix the mistakes only looking at the textual information, not to the geographic or geometric characteristics of the object.

This spares me downloading object coordinates, and the base map, but still I will advise checking every single instance we’re fixing, no automatic blanket fix.

the tools

I suggest using overpass-turbo, level0, the standard unix stream editor sed. Then correct the data in the level0 html interface.

the procedure by example

imagine you want to fix all “Fruteria” in Colombia, placing the missing accent as in “Frutería”.

first of all, we gather all object ids from overpass-turbo:

https://overpass-turbo.eu/

activate the wizard

have the query built, but not run.

replace the trailing code:

// print results
out body;
>;
out skel qt;

with the lighter:

// print object ids
out ids;

also increase the timout from 25 to, say, 125.

running it will give you a result in the right-hand-side pane. select and copy to the clipboard.

now open a terminal, and give the command:

sed -n -e 's/^.."type": "\(.\).*",/\1/p' -e 's/.*"id": \([0-9]*\)[,]*/\1/p' | 
   sed -e ':a; N; s/\n//; N; s/\n/, /; ba'

when you run the above, sed will read from the standard input. paste the code you copied, then end the stream, hitting Ctrl-D (that’s press and hold Ctrl, press D, release both).

the output will be on a single line, looking like

n805012123, n846533218, n2692985140, n3413518765, n3750009415, 
[…]
n3196435415

triple-click on it, copy to the clipboard, then activate the level0 page, and paste the longish line with the object ids.

have level0 grab the information from OSM, decide what you want to fix, and whatever you don’t feel sure about, remove it from the text area, so you won’t be submitting it to level0.

caveats

I strongly advice you double check all you change, I therefore advice against grabbing more than, say, 50 items at a time.

if the overpass query returned more than 50 items of data, consider choosing a single uniform type of discrepancies among the data, and fix that only.

do not mass change anything other than the name. if you see other textual mistake, clearly a mistake, like “heladeria” (again missing the same ´ on the i) that should be fine.

for the above example, a “frutería” might be tagged as café, or fast_food, or greengrocer, or even ice_cream, generally depending on the presence, or not, of other keywords like “Cafetería” and “Heladería”. you might possibly even notice that it depends on which group of corporate mappers contributed the information.

a Frutería y Heladería tagged as shop=convencience is doubtful, and it might be useful to tag it with a fixme, clearly explaining what doesn’t make sense to you. a shop=greengrocer named “Frutería” might be just fine, in particular if the name also mentions veggies.

please never assume anything, if you have doubts, stay away from the object, or tag it with a fixme expressing your doubt.

otherwise

Another approach, a bit more paranoid maybe, would be to download all named objects for a given area, then look for inspiration on what to correct. My hint here would be to use JOSM to perform the overpass query, save the objects in a (dot)osm file, then look at the names with something like this:

cat yourfile.osm | 
  sed -n -e "s/.*k='name' v='\(.*\)' .>/\1/p" | 
  tr A-Z a-z | 
  tr ÁÉÍÓÚÜ áéíóúü | 
  sed -e "s/&apos;/'/g" -e "s/&gt;/>/g" -e 's/&quot;/"/g' -e "s/&amp;/\&/g" | 
  grep -o '\<[^. -]*\>' | 
  sort | 
  uniq -c | less

(I know it’s a “useless” use of “cat”, but I find it practical because it allows me shuffling the different filters who are all free of the filename.)

once you have decided which words is worth having a look at, you can either use overpass, or easier you again use your local file:

cat yourfile.osm | 
  grep "\(id='\|'name'\)" | 
  grep -B1 -i "\(técnología\|tecnólogico\|técnológicos\)"  | 
  sed -n -e "s/  <\(.\)[^ ]* id='\([0-9]*\)'.*/\1\2/p" | 
  paste -sd,

Discussion

Comment from Minh Nguyen on 3 June 2022 at 20:12

(También hay una discusión relacionado sobre los errores tipográficos en los nombres de las iglesias en español.)

Thank you for thoroughly documenting your process here.

I’ve also encountered a lot of similar spelling mistakes in Spanish-speaking neighborhoods of San José, California. The signs of taquerías, panaderías, and carnicerías are usually posted in ALL CAPS, so the diacritics are omitted for convenience.¹ Non–Spanish speakers either don’t know that there should be diacritics or don’t know which ones to use. Sometimes people even remove the diacritics, thinking that’s more faithful to the on-the-ground principle.²

The same problem affects the city’s Vietnamese-speaking neighborhoods so much that I added a short tip to the wiki about how to tag name:vi. Maybe there needs to be a page about name:es to raise awareness of this issue and, eventually, facilitate the development of QA validation rules to catch these problems earlier.

Even though language authorities like the RAE and academic institutions like the Library of Congress have ended this practice, it persists in signmaking. ↩
Incidentally, the name of this city has been the subject of a slow-moving edit war for years, reflecting a real-world dispute about whether it should include the acute mark. ↩

Comment from mariotomo on 3 June 2022 at 23:43

Hi Minh Nguyen!

about San Jose in California, my guess is limited to name:en and name:es, respectively San Jose and San José. IIUC, the city council decided that the official city name should be in Spanish. in this situation it does not surprise me that the value for name is seen as debatable.

about accents, I’ve noticed that Italians have a better ear for them, funny enough: we don’t write them except on the last syllable.

Comment from Minh Nguyen on 4 June 2022 at 06:10

It’s officially San José in English, based on the Spanish name, so the debate is about which English name is the main one. As the wiki page suggests, it’s pretty complicated, but currently the unaccented name is the name in OSM. Fortunately, the names you’re looking at would be uncontroversial, so your tips will probably come in handy for me. Thanks again!

OpenStreetMap