Dirty datasets

Posted by Cascafico on 14 November 2018 in Italian (Italiano)

Mini abstract

I’ve found a 600+ rows Bed&Breakfast dataset, available opendata by RAFVG. No geo coordinates. Since hiousenumbers were recently imported from RAFVG dataset, I decided to go for geocoding. To get reliable coordinates I used csvgeocode script attached to nominatim geocoding service. Nominatim requires almost perfect (standardized) odonyms, hence I started openrefine and a reconcile service which comes in a separate jar. Reconcile service needs a csv with authoritative names, which I get from overpass-turbo and some filtering.


Bed and Breakfast is a rather new dataset (Oct 17) with more than 600 POIs. Many useful fields such as

  • name and operator
  • phone
  • email
  • site
  • opening hours
  • category (standard, comfort, superior)

Cleaninig data

Such duty has been accomplished by OpenRefine and Reconcilie plugin, connected as a reconciliation service,

In order to standardize messy B&B addresses (entered by B&B operators theirselves) I had to provide Reconcile with an authoritative set of highway names, which I got from overpass-turbo (see Strade d’Italia diary entry).


Just happened for other projects, I choose csvgecode which features pretty simple usage.

Here is a run using mapbox service:

$ csvgeocode input.csv output.csv –handler mapbox –delay 1000 –verbose –url “{{INDIRIZZO}},{{CAP}} {{COMUNE}}.json?access_token="

here using nominatim, instead:

$ csvgeocode input.csv output.csv –handler osm –delay 1000 –verbose –url “{{INDIRIZZO 1}}, {{COMUNE}}&format=json” Rows geocoded: 468
Rows failed: 114
Time elapsed: 879.4 seconds

114 rows not geocoded expose a geocoder problem with apostrophes in city field. Workaround to bypass such not escapable apostrophe is both removing it (ie: Farra d’Isonzo » Farra disonzo) or use postcode instead. Same problem for address, which only remove solution (ie: San Francesco d’Assisi » San Francescao dassisi). Of course above edits are for geocoding sake only. Besides, part of “success” geocoding rows could have been geocoded even with missing housenumber, resulting in highway centroid coordinates. To limit these false positives, I had to check which municipality were imported, faceting out rows belonging to municipalities in red


Conflation matched just 11 POIs, so If you want to collaborate in B&B import, here you can find the audit map to review mostly new nodes.

Login to leave a comment