OpenStreetMap

Preface

User Watmildon recently posted their conflation workflow with the new hospitals dataset in the HIFLD, and as such I thought I’d share my own process of converting these datasets into an OSM-readable format.

Workflow

With any good QGIS project, we start by adding the dataset we’re working with, (This particular dataset is accessible here on the HIFLD) and adding openstreetmap as a background layer.

Once that’s done, you map should look something like this:

Now that we’ve got our data loaded, we can go ahead and start converting it over to OSM schema. QGIS has a great set of tools for working with this data, and for this kind of work, Field Calculator and Attribute table are our friends.

The work we will be doing will require a lot of regex to work. I have no formal education in any kind of computer programming, but I do have a good friend that is very good at computer programming, that goes by the name of ChatGPT.

ChatGPT can happily crank out QGIS regex queries all day, as long as you have a grasp on what exactly you need to ask it to get what you want.

Here’s a query that I’ve gotten it to generate me:

“I have US database with a field in QGIS for phone numbers in the format (XXX) XXX-XXXX. I need to convert it to the format +1-XXX-XXX-XXXX using the regexp function”

regexp_replace("TELEPHONE", '^\\((\\d{3})\\) (\\d{3})-(\\d{4})$', '+1-\\1-\\2-\\3')

Do note that if you don’t explicitly tell ChatGPT the name of your field, it will make something up and you will need to fill your own field.

Now one of the most annoying parts of the HIFLD is that they bundle the street and house number into one field, generating something like ‘809 UNIVERSITY BOULEVARD EAST’. This is nothing ChatGPT can’t handle if we’re willing to get a little creative in what we ask it.

“Can you make a query that draws the first word of a sentence as long as it is a number in QGIS regex syntax?”

CASE 
  WHEN regexp_match(LEFT("sentence", strpos("sentence", ' ')-1), '^[0-9]+$') 
  THEN LEFT("sentence", strpos("sentence", ' ')-1) 
  ELSE NULL 
END

“Ok, now I need a qgis regex query that gets every word in a sentence besides the first word, unless the first word is not a number.”

CASE
    WHEN "ADDRESS" ~ '^[0-9]+\\s' THEN regexp_replace("ADDRESS", '^\\S+\\s(.*)$', '\\1')
    ELSE regexp_replace("ADDRESS", '^\\S+\\s+(.*)$', '\\1')
END

Make sure that you’re saving your generated fields in the field calculator correctly. QGIS defaults to 10 character integer, but in most cases it’s best to set it to text, and then set the length to 128 characters (OSM’s maximum)

It’s important to remember that if you’re working with a .shp file, the field name has a maximum of 10 characters, so for fields that are longer than 10 characters, such as addr:housenumber or addr:street, you will have to fill in the whole key when you load your data into JOSM. Other data structures such as geojson don’t have these limitations, but may have performance issues with very large amounts of data.

You won’t have to use ChatGPT for everything. QGIS has a function called ‘title’ that does what you might expect, formatting all the all capital names into first letter capitalization.

This is good for many of your fields, namely the street name, hospital name, etc.

We’ve gotten through the hard part, now to address the rest of the data. This means going back to the attribute table, and seeing what metadata has been provided to us.

To make things a little easier, we can delete the fields that we won’t be using for OSM

I won’t walk through how to interpret every single tag here, since I’ve already covered the processes for using regex and title function to get what you need from most of them.

For each field listed, you just generate the OSM key from your own research on the wiki.

For this dataset, this is the process that would be used for each field, and the OSM key it corresponds to.

Start by adding amenity=hospital to all objects.

NAME- title function, name=*

ADDRESS- regex covered above, addr:street=*

addr:housenumber=*

CITY- title function, addr:city=*

STATE- no function needed, addr:state=*

ZIP- no function needed, addr:postcode=*

TELEPHONE- regex covered above, phone=*

TYPE- This is the hard one, you will need to parse into the OSM healthcare scheme, [1][2]

STATUS- use this field to purge any data that isn’t operational, or is abandoned/closed.

POPULATION- you’ll have to filter out anything with null value (-999) then use capacity=*

WEBSITE- filter out the ‘NOT AVAILABLE’, website=*

STATE_ID- filter out the ‘NOT AVAILABLE’, ref=*

OWNER- no OSM key for this, but may be worth making one.

TRAUMA- currently represented on OSM by healthcare:specialty=trauma

HELIPAD- optionally, you can leave yourself a note to map the helipad of the hospital you’re working on.

It takes a good bit of time to do all this, but once it’s done you’ve got a nationwide dataset of good quality with much more metadata with most data already on OSM.

Discussion

Comment from Anton Khorev on 20 April 2023 at 08:15

What do you think is the difference between THEN and ELSE branches in this expression?

CASE
    WHEN "ADDRESS" ~ '^[0-9]+\\s' THEN regexp_replace("ADDRESS", '^\\S+\\s(.*)$', '\\1')
    ELSE regexp_replace("ADDRESS", '^\\S+\\s+(.*)$', '\\1')
END

Log in to leave a comment