(I’m publishing it now, I will review, and might add screenshots.)
Here in Panama I’ve occasionally noticed bursts of name contributions by foreigners who do not know the language, or by local people who copy from official information where the accents have been purged, or contributed long ago when non-ASCII character became a question point.
Whatever the case, we have textual mistakes in the database, so that when from time to time I browse the database, I still stumble on “Simon Bolivar”, “San Jose”, “Canaveral”, “la Compania”, and the commonplace terms Panaderia, Pasteleria, Fruteria, etc.
If you’re not familiar with Spanish, you haven’t noticed anything wrong, or have you?
Those familiar with the language and/or fixated on formal correctness like linguists, philologists, and mathematicians, have felt pain, because the correct Spanish form of the above words is “Simón Bolívar”, “San José”, “Cañaveral”, “la Compañía”, “Panadería”, “Pastelería”, “Frutería”, etc.
Just for reference, at the time of writing, “Simon Bolivar” in the Bolivarian Republic of Venezuela returns 32 objects, while “Simón Bolívar” gives 200, possibly giving a measure of some national pride. Colombian “Fruterías” however score 56 vs. 64 against “Fruteria” without the accent.
Here I only focus on what are obvious orthographic mistakes, skipping any data that might look wrong only because it’s in a different language, and I want to fix the mistakes only looking at the textual information, not to the geographic or geometric characteristics of the object.
This spares me downloading object coordinates, and the base map, but still I will advise checking every single instance we’re fixing, no automatic blanket fix.
I suggest using overpass-turbo, level0, the standard unix stream editor sed. Then correct the data in the level0 html interface.
the procedure by example
imagine you want to fix all “Fruteria” in Colombia, placing the missing accent as in “Frutería”.
first of all, we gather all object ids from overpass-turbo:
activate the wizard
have the query built, but not run.
replace the trailing code:
// print results out body; >; out skel qt;
with the lighter:
// print object ids out ids;
also increase the timout from 25 to, say, 125.
running it will give you a result in the right-hand-side pane. select and copy to the clipboard.
now open a terminal, and give the command:
sed -n -e 's/^.."type": "\(.\).*",/\1/p' -e 's/.*"id": \([0-9]*\)[,]*/\1/p' | sed -e ':a; N; s/\n//; N; s/\n/, /; ba'
when you run the above,
sed will read from the standard input. paste the code you copied, then end the stream, hitting Ctrl-D (that’s press and hold Ctrl, press D, release both).
the output will be on a single line, looking like
n805012123, n846533218, n2692985140, n3413518765, n3750009415, […] n3196435415
triple-click on it, copy to the clipboard, then activate the level0 page, and paste the longish line with the object ids.
have level0 grab the information from OSM, decide what you want to fix, and whatever you don’t feel sure about, remove it from the text area, so you won’t be submitting it to level0.
I strongly advice you double check all you change, I therefore advice against grabbing more than, say, 50 items at a time.
if the overpass query returned more than 50 items of data, consider choosing a single uniform type of discrepancies among the data, and fix that only.
do not mass change anything other than the name. if you see other textual mistake, clearly a mistake, like “heladeria” (again missing the same ´ on the i) that should be fine.
for the above example, a “frutería” might be tagged as café, or fast_food, or greengrocer, or even ice_cream, generally depending on the presence, or not, of other keywords like “Cafetería” and “Heladería”. you might possibly even notice that it depends on which group of corporate mappers contributed the information.
a Frutería y Heladería tagged as shop=convencience is doubtful, and it might be useful to tag it with a fixme, clearly explaining what doesn’t make sense to you. a shop=greengrocer named “Frutería” might be just fine, in particular if the name also mentions veggies.
please never assume anything, if you have doubts, stay away from the object, or tag it with a fixme expressing your doubt.
Another approach, a bit more paranoid maybe, would be to download all named objects for a given area, then look for inspiration on what to correct. My hint here would be to use JOSM to perform the overpass query, save the objects in a (dot)osm file, then look at the names with something like this:
cat yourfile.osm | sed -n -e "s/.*k='name' v='\(.*\)' .>/\1/p" | tr A-Z a-z | tr ÁÉÍÓÚÜ áéíóúü | sed -e "s/'/'/g" -e "s/>/>/g" -e 's/"/"/g' -e "s/&/\&/g" | grep -o '\<[^. -]*\>' | sort | uniq -c | less
(I know it’s a “useless” use of “cat”, but I find it practical because it allows me shuffling the different filters who are all free of the filename.)
once you have decided which words is worth having a look at, you can either use overpass, or easier you again use your local file:
cat yourfile.osm | grep "\(id='\|'name'\)" | grep -B1 -i "\(técnología\|tecnólogico\|técnológicos\)" | sed -n -e "s/ <\(.\)[^ ]* id='\([0-9]*\)'.*/\1\2/p" | paste -sd,