With the new release of more than 59 million points of interest (POIs) from Overture, consisting of Microsoft and Meta POI datasets combined, the natural question arises: how can this be useful for OpenStreetMap?
Challenges to consider
The most important challenge in getting this data into OSM is making sure the place labels in Overture have an equivalent in OSM. This is mostly doable with automation, but many cases require context.
Validation of these is a forthcoming challenge: street-level imagery from Mapillary will be especially helpful, but being there in person to validate is also a big advantage. That aside, even if the data can be added to OSM one-by-one (not imported) with validation, the tags need to have a proper format.
Loading up the data to analyze
I got started by referencing Feye Andal’s great and succinct guide on viewing the data in AWS Athena. I found a slight lack of clarity in the instructions: you need to make sure your Athena instance, and your S3 bucket where queries are saved, are on us-west-2 region, same as the Overture dataset, unless you copy the dataset first to a bucket in your other region. So make sure the regions are the same, and the instructions should work flawlessly!
Analyzing the data
Exploring the dataset, there are 1037 unique place labels in it. 86,000+ are structure_and_geography
which can refer to a wide range of natural geography or built structures in OSM, difficult to match with any specific tag without context. Others translate directly, such as a laundromat.
Some example tags include: "forest", "stadium_arena", "farm", "professional_services", "baptist_church", "park", "print_media", "spas", "passport_and_visa_services", "restaurant", "dentist"
To get most of the tags matched, I used Python to import the OpenAI module, and connect to my OpenAI account, which charges a few fractions of a penny per request.
I set a system message, which defines the role the AI should play or assume. My message was:
system_msg = 'You are a helpful assistant who understands data structures, place and map data labeling ontology, and OpenStreetMap tagging. I will give you single labels of a POI category, and you will give me back the single OSM equivalent tag that most makes sense in the format of list with a single string like ["key=value"] unless it has multiple tags such as a mexican restaurant, then give the list of multiple like ["amenity=restaurant","cuisine=mexican"] or if there is no good match you will write back in all caps, ["UNKNOWN"]. Only include a list of tags or the list with unknown value, do not include any dialogue.'
I made an empty dictionary:
overture_osm_dict = { }
Then I made a list of all the unique tags, and looped through it. My code looks like:
for tag in overture_tags:
if tag not in overture_osm_dict:
user_msg = tag
response = openai.ChatCompletion.create(model="gpt-3.5-turbo",
messages=[{"role": "system", "content": system_msg},
{"role": "user", "content": user_msg}])
osm_tag = response["choices"][0]["message"]["content"]
overture_osm_dict[tag] = osm_tag
It is recommended to add some sleep timer, or handler for a timeout response, as parsing 1037 items came with probably 10 timeouts.
In the end I had a few tags that were unknown, and I made manual fixes as needed. Running the loop multiple times yielded different results, so it is good to be aware that the AI is not consistent.
I made various fixes to the JSON structure, including stray line breaks, quotations in the wrong place, and bad tag formats. Some tags were also simply invented, it seemed, such as amenity=water_supplier
for Overture’s water_supplier
, which I changed to office=water_utility
though that could be quite wrong, depending on the POI.
There are other debatable tags that came out as unknown, so I added in tags:
- “personal_assistant”: [“office=administrative”]
- “kids_recreation_and_party”: [“shop=party”]
- “sewing_and_alterations”: [“shop=tailor “] instead of equal to “craft=sewing”
- “sports_bar”: [“amenity=bar”, “sport=”] but dropping “sport=” to just be a bar
There are many more up for review.
In my final version, the dictionary is something like:
{ "forest": [
"landuse=forest" ], "stadium_arena": [
"leisure=stadium" ], "farm": [
"landuse=farm" ], "professional_services": [
"office" ], "baptist_church": [
"amenity=place_of_worship",
"religion=baptist" ], "park": [
"leisure=park" ], "print_media": [
"amenity=newspaper" ], "spas": [
"amenity=spa" ], "passport_and_visa_services": [
"office=government",
"office=visa",
"office=passport" ], "restaurant": [
"amenity=restaurant" ], "dentist": [
"amenity=dentist" ], "sports_club_and_league": [
"sport=club" ], "thai_restaurant": [
"amenity=restaurant",
"cuisine=thai" ], "clothing_store": [
"shop=clothes" ], "insurance_agency": [
"office=insurance" ], "barber": [
"shop=hairdresser" ], "bar": [
"amenity=bar" ], "agriculture": [
"landuse=farmland" ], "accommodation": [
"amenity=hotel" ], "event_planning": [
"amenity=event_planning" ], "non_governmental_association": [
"amenity=community_centre" ], "elementary_school": [
"amenity=school",
"education=primary" ], "landmark_and_historical_building": [
"historic=yes" ], "gym": [
"leisure=sports_centre" ], "pilates_studio": [
"amenity=gym",
"sport=pilates" ], "hotel": [
"tourism=hotel" ], "advertising_agency": [
"office=advertising_agency" ], "educational_research_institute": [
"amenity=school",
"research_institute=yes" ], "furniture_store": [
"shop=furniture" ], ....
The full gist is available to download as a Githib gist and I hope to get feedback on it, so we may arrive at a more officially agreed upon translation of the tags.
Conclusion
These POIs offer a lot of opportunity to improve one of the categories that is often cited as lacking in OSM. The quality is not perfect, whether in location accuracy, proper tagging, etc, but it is at least professionally curated. Nothing is better than crowdsourcing–which is how many POIs sourced from Facebook business pages or Foursquare check-ins are generated–and OSM is the best spatial crowdsourcing platform in the world.
Some data needs special analysis. For example, I asked the AI to help me with a case I could not verify without context, for example a structure_and_geography
category, where the AI noticed the Turkish name for it has the Turkish word for “harbor” and recommended the tag is “natural=harbor”.
Before we can start finding ways to validate the data and ingest it into the map on a case by case basis, we need to have a good basis for the tagging. The user can always modify this to be more appropriate before confirming and sending an OSM changeset, but getting a good first guess to present to users helps reduce the friction and increase the success rate.