Composite keys in OpenStreetMap: ref:highway, highway:ref or highway_ref?

Posted by Zverik on 3 October 2017 in English.

We all used building:levels and alt_name without giving it a second thought. Why are these keys built that way? Why not levels:building? To me, it looks like there is a rule for building composite keys.

Suffixes

ref is the basic tag for storing a reference number. For a reference number in some third-party table, we add a suffix: ref:third_party. That is because the new tag still contains a reference number. We have all such numbers in ref* keys. The rule of thumb is, the meaning of a value is defined by the basic tag before the suffix. ref:third_party is still a ref, and source:maxspeed is a source.

Sometimes we cannot use suffixes for historical reason. That is the case with name: we use name:en and other suffixes for names in other languages. For that reason, we build composite keys by prepending a content with an underscore: local_name or place_name. These are still names — a reversed order from the semicolon notation.

Of course, an underscore is also used for multi-word keys: public_transport and admin_level.

Namespaces

Then, there are namespaces. The most known is addr: with addr:housenumber and so on. Without a suffix, addr key has no meaning. The same with contact: and turn:. Namespaces are used for marking a group of tags that have the same meaning, have similar value formats and they are usually described on a single wiki page.

Some namespaces are used for tying properties to a part of the object described by the main tag, and for adding more specific properties of it. For example, building:* tags describe attributes of a building, and we also have roof:type and fire_hydrant:type. These words are most often put on the same object as a key or a value, e.g. building=yes or amenity=fire_hydrant, but also can mean a part of a structure denoted by these tags, like how buildings almost always have a roof.

The definition for namespaces is very vague, and some people mistake basic tags for namespaces. For example, we have 2.6k addr tags in the database. Sometimes people try to impose an prefix on a set of established and well-used tags to group these: it improves sorting in editors and allows for introducing many more similarly-named keys without “polluting” the namespace-less set of tags. That is what happend with “contact:” prefix: it is rare to see imports using “phone” and “website” tags without it.

Suffix or a namespace?

Telling a basic tag with a suffix from a namespace of the second type is harder. For example, what would be correct, building:height or height:building? roof:height or height:roof? This depends on four things:

Which of the basic tags for each of parts is used more often, hence is expected to come first? In this case, building is used 28 times more that height. roof key is virtually unused.
Which of these parts is more commonly used as a namespace? height: is used as a namespace for only three popular (more than a hundred usages) keys, none of which is globally spread. For building:, the number of prefixed keys with more than a hundred usages is around 120, for roof: — around 30.
When removing the suffix, will the value be meaningful for the basic tag? It definitely won’t be for building=100 m and roof=100 m, but will be for height=100 m.
Will the basic tag without a suffix have the same meaning for the kind of objects with other similarly namespaced keys? In case of buildings, height would be enough without a suffix, and these tags are pretty widespread. But roofs are parts of buildings, so you would have either a suffix or a namespace.

So, for building height you would use a plain height key because of the fourth point. But for roof height, you would choose roof:height because roof: is commonly used as a prefix, as per the second point, unlike height:.

A case against brand:wikipedia

The reason for this post is the recent import of thousands of brand:wikipedia and brand:wikidata tags. I argue that the better choice would be wikipedia:brand and wikidata:brand, for the same reason as source:maxspeed and ref:whatever.

I accept the introduction of separate tags for an object and its brand: we can have two links for the McDonalds brand and a single notable restaurant under that brand. That covers the item 4 in the above list, and item 2 is not applicable, since both wikipedia and brand keys have not been used for namespaces. But points 1 and 3 are in favour of wikipedia:brand: the value is still a wikipedia article, and it is processed similarly to the value of wikipedia tag. And we have four times more wikipedia keys than brand keys.

To conclude, I suggest we do a mass-retagging of these imported or automatically processed keys before this mistake creeps into the wiki. Either wikipedia:brand or brand_wikipedia would be better options.

Past mistakes

In some cases we failed to notice composite keys in proposals that are built contrary to the norm described here. Now you have to do some non-obvious tagging, which requires looking for the correct keys in the wiki:

bridge:name instead of bridge_name (like old_name)
source:ref, though the correct key source_ref is used 10 times more often. Note that ref:source would not be entirely correct, since you should be more specific in the suffix. source=tmnt with ref:tmnt=1 would be the correct choice, better than source_ref=1.
This whole section on *:wikipedia prompted by this edit. Thankfully, we have only 20k of these keys, including the imported brand:wikipedia, so there is still time to fix this.

Discussion

Comment from escada on 3 October 2017 at 13:05

Does that fact that wikipedia: exists has any impact on your reasoning ? Is there a potential problem where the wikipedia:brand can be considered as the Wikipedia article in the "brand" language ? I think you cannot use wikipedia: due to the wikipedia: syntax. What is your opinion ?

Comment from escada on 3 October 2017 at 13:06

“due to the wikipedia: syntax” should be “due to the wikipedia: syntax"

Comment from escada on 3 October 2017 at 13:07

problems with markdown. sorry. wikipedia: should be wikipedia:language.

Comment from dieterdreist on 3 October 2017 at 13:15

hierarchy vs flat

IMHO there is a difference between “alt_name” and “building:levels”. ## colon stands for hierarchy If you use the colon, the information is structured (building: and then all the properties refering to building) ## underscore doesn’t imply hierarchy The underscore replaces the space and means the term has to be read “as one term”, although it is several actual words. Like in “tourist_bus”, “public_transport”, “alt_name”, “start_date”. Yes, a start date is a date, a tourist bus is a bus and public transport is a part of transport, but there is no hierarchy, you have to understand the term to make sense of it, order of words doesn’t imply which part is more important and which is the qualifier. “man_made” might even be a part of everything “made”, but for example “leaf_type” is not a part of all “types” (well, only if you twist your brain). Similarly “is_in”, “opening_hours” or “passenger_lines”.

Comment from dieterdreist on 3 October 2017 at 13:33

The story of bridge:name is another particularity of the history of OSM. Basically it stems from our reluctance to introduce a bridge object. It took us 15 years to get a tag for bridges into OSM (talking about man_made=bridge here). Until then, the only way to find bridges was indirectly by looking for things on a bridge and inferring that there must be a bridge somewhere below (or many, because this systems didn’t tell you how many nearby bridges there were, separate carriageways on the same or on parallel bridges). If man_made=bridge would have been introduced earlier, we likely wouldn’t have gotten a bridge:name at all. If we had waited even more, we could have gotten bridge:wikipedia and more. We don’t use street_name for things that are on a street, do we? (Well, we might, because until there are area highways we also can’t tell reliably whether something is on a street or only close).

Comment from Baloo Uriza on 3 October 2017 at 16:41

I’d argue that using ref=* on highways to describe road routes was a big mistake in the first place and a legacy of not having relations. This further complicates things in that Oregon has separate highway numbers and route numbers that don’t usually match up; the highway number belongs to the road, not the route; and the route number might cross several different highways.

Comment from TheSwavu on 10 October 2017 at 08:42

source_ref and source:ref mean two different things.

source_ref is the reference for the source of data or as taginfo puts it “used to link external source of information:”. If you look at the tag values over 90% of them are from two imports and point to the url describing the source.

source:ref on the other hand tells you where the value of the ref tag came from. So:

name=Something source:name=survey source:name:date=2014-05 ref=256 source:ref=Some dataset source:geometry=bing highway=trunk source:highway=Some other dataset . .

OpenStreetMap

Zverik's Diary