OpenStreetMap logo OpenStreetMap

Italian schools geocoding

Posted by Cascafico on 1 October 2020 in English. Last updated on 10 October 2020.

Intro

Italian Ministry for Education (MIUR) publishes yearly a nationwide detailed dataset about public schools (isced:level 0 to 3). Guess which OSM-essential data is missing.

No reliable geocoding is feasible if we don’t have a homogeneous address base and I think it’s unlikely large nations (say, Italy, 60 million people is one of these) feature such quality. So we need a tool to filter approximate geocodings (typically street centroid) out from good ones.

Tools

I tried csvgeocode, a light script to insert lat and lon fields in input dataset, but it lacks in grabbing important geocoder responses like accuracy.
then I ran into Openrefine (OR) which has been created to fix “messy data”, but can be happily used to compose geocoders requests, filter responses adequately and structure output in whatever format you want.

Data

  • School buildings dataset, where address is well structured (updated 2018)
  • School details dataset 2020, where all the other stuff is stored (updated 2020)

Buildings dataset (EDIANAGRAFESTA)

Here you’ll find fields for composing geocoding request: in separate columns street type (Via, Viale, Piazza, etc), street name, postalcode, municipality. And of course, the reference field (Codice Meccanografico) needed to link the school details dataset. Using buildings dataset, Openrefine will compose columns to generate Nominatim geocoder requests, fetch URLs responses and filter out those OSM elements which “type” is not punctual (typically highway=* or blank).

School details dataset (SCUANAGRAFESTAT)

Here you’ll find stuff like school name, unique reference code (Codice meccanografico), contacts, description (for ISCED), etc. Openrefine will manage several “messy data” like accents, abbreviations, typos, titlecase etc.

Install Openrefine

OR can be installed in few steps. School datasets feature 55k+ records and OR run nice and easy on my rasperry +1Gbyte RAM board.

Importing Buildings dataset in Openrefine

Fire up OR and your browser will open at local address 127.0.0.1:3333; click “Create project” and copy&paste the link of the buildings dataset in the “Web Addresses (URLs)” box. Right pane will be populated with paged rows (say 1-10 out of 55k or so).

Composing geocoding requests

Click on one of the column names and drop-down to select “edit column, add column based on this column”. On the top, choose the name of the newly create column (ie: nominatim_request, nr or whatever), then as expression, paste the ugly rows below (including double quotes)

https://nominatim.openstreetmap.org/search/?format=json&limit=1&street=”+
cells[“NUMEROCIVICO”].value
+” “+
escape(cells[“TIPOLOGIAINDIRIZZO”].value, “url”)
+” “+
escape(cells[“DENOMINAZIONEINDIRIZZO”].value, “url”)+
“&city=”+
escape(cells[“DESCRIZIONECOMUNE”].value,”url”)+
“&postalcode=”+cells[“CAP”].value

as you may note, we have composed a query using 5 fields (those in uppercase) which for instance could be:

12%20via%20Emilia&city=Verona&postalcode=12345&limit=1

which should be compatible with Nominatim API search parameters. Please, note that still there are some weak points here: the street name “Emilia” can be ambiguous for Nominatim, since OSM highway name can be “emilia Parmense”, but the rest of the request is strong enough: better starting with housenumber (UK & US addr notation, in Italy we put at the end of street name), separate city and adding postcode in case of city duplicity.

Fetching geo coords

So we have a column filled with Nominatim search requests. For each row, OR can ask Nominatim for coordinates and store them in a new column, simply clicking on request column header and “edit column, add column by fetching URLs”.
As defined in request URL, response will be in json format and you will get something like the following example:

[{“place_id”:55354740,”licence”:”Data © OpenStreetMap contributors, ODbL 1.0. osm.org/copyright”,”osm_type”:”node”,”osm_id”:4553758334,”boundingbox”:[“45.445432”,”45.445532”,”10.9452022”,”10.9453022”],”lat”:”45.445482”,”lon”:”10.9452522”,”display_name”:”11, Via Milone, Chievo, Ovest, Verona, Veneto, 37139, Italia”,”class”:”place”,”type”:”house”,”importance”:0.5309999999999999}]

Then, we need to filter out inaccurate results using the “type” tag. In the example above, since “type” is “house”, tags “lat” and “lon” will be OK.
Best way to accomplish this task is creating another column (call it accuracy, type or whatever) that will parse json tag “type”: so click Nominatim response column header and “edit column, add column based on this column”, pasting as “expression” the following:

parseJson(cells[“nr”].value)[0].type

This will fill the new column with the type of OSM element fetched by Nominatim.
One of the OR main features is “faceting”. Click on column header that contains OSM types (house, residential…) and “Facet, text facet”: on the left pane will appear a summary of values with count; as a sample, you could find:
* house 111 OK
* residential 86 N/A
* school 35 OK
* tertiary 18 N/A
* unclassified 7 N/A
* secondary 6 N/A
* museum 4 OK * swimming_pool 2 OK
* …

Sadly, some results are not applicable (I noted “N/A”), since their coordinates will be highway centroids, which could be several hundred meters away from our school. So, let’s keep only punctual elements clicking “include” OK rows in the left pane.

Last operation is creating lat and lon columns in the same way we extracted OSM type values. Once again clicking “Edit column, add column based on this column”, paste as expression:

parseJson(cells[“nr”].value)[0].lat
parseJson(cells[“nr”].value)[0].lon

Importing School details dataset in Openrefine

Just replay the same steps as we did for “Importing Buildings dataset in Openrefine” using the appropriate URL.

As mentioned, “Codice Meccanografico” field is unique for each school and is stored in both datasets, so we are going to use it in order to “cross” records.

In the newly created project we are going to grab lat and lon fields from the first project. Click on the “Codice Meccanografico” header (the one with values like ABCD1234A) then “Edit column, add column based on this column” and use the expression:
cell.cross(“EDIANAGRAFE”, “ref”)[0].cells[“lon”].value

when “ref” value of active Openrefine project is equal to “ref” in the “EDIANAGRAFE” project, “lon” value will be stored in new column.

To facilitate filling in expressions, Openrefine has an useful history register, so use it to repeat last operation, replacing “lon” with “lat” and you are done :-)

Refining

There are several operations that can improve school details making them nice and correct. Just explore expressions and GREL language help. …To be continued

Exporting data

Whenever you are done, if just need to display data in datawrapper, umap etc. you can export in plain csv: click “Export” button and “Custom tabular exporter” then follow the simple procedure.
In case you need json, “Export” then “Templating” will give you maximum flexibility.

Discussion

Log in to leave a comment