OpenStreetMap

Ongoing voting on Education 2.0 proposal have demonstrated, that some OSM members don’t understand the situation with multiple properties. I was a kind of surprised by this lack of understanding, since it is already explained in Wiki, so I want to explain it one more time.

Definitions:

  1. Single value stands for key=value case, for example: man_made=tower
  2. Value list stands for semicolon-separated list of values, for example: sport=basketball;volleyball
  3. Separate keys stand for multiple binary keys for each property, for example: sport:basketball=yes, sport:volleyball=yes

Case (1) is the most simple and traditional in OSM. However, if certain feature uses this tagging style, its value becomes exclusive, which means, there can be only one value. And for certain features it’s completely acceptable, because, for example, man-made structure can’t be a mast and a tower in the same time.

Case (2) was, probably, the first way to find a workaround for non-exclusive values. It’s a straightforward way for those cases, when properties are non-exclusive, which means, certain object can have several properties of one feature. Just like in example above, a pitch could be used for playing basketball and volleyball.

Case (3) is, at first glance, similar to (2). And some people think that it’s only a different tagging style. Indeed, it allows to represent the same situation, but with separate keys (tags) instead of value list in one tag, so semantically it doesn’t have any advantage. It even looks more complicated, because its structure actually consists of three elements, not two: namespace:subkey=boolean_value, where namespace equals to key in case (2), subkey equals to each value from list in case (2) and boolean_value is just yes or no (in case of no, the whole tag is omitted).

Technical side

But the thing is, OSM is a database. Which means, those tags we using are actually data structures. And data structures should be usable. There is a whole applied science on that topic - data architecture. So, if you’ve never heard about it or you don’t know what exactly it stands for, please, read an article about it in Wikipedia (link is above).

From the point of view of general public, there is no semantic advantage of method (3) over method (2). But there is huge advantage from the point of view of data architecture. That is why method (2) is strongly discouraged for the most cases, except only a few of them, as it is more or less clearly described in Semi-colon value separator article. In simple words, semicolon separator is usually acceptable for those cases, where value list contains a list of strings (portions of free text), used for labels or descriptions, not properties. In more technical words, where these strings from lists are not used for querying objects for any purpose, including subsetting data, applying rendering style, etc. It includes complex text structures such as lane tagging or opening hours, meant for deeper parsing by design.

Why value lists are so bad for querying? It’s simple, it’s all about performance and having known predictable data structures. Having something like sport=basketball;volleyball from case (2), to work with it, software needs to break it down first and to store resulting list somewhere in memory. Before it starts doing that, it should first determine, how many elements are in list. While in case of (1) and (3), number of operations is smaller and there is nothing unpredictable, since namespace:subkey always breaks into known number of strings (even if subkey is compound, like subkey:sub-subkey - usually we don’t have to care about it, it can be processed as a single string) with predictable order.

There is another technological reason: OSM uses XML-style “attribute=value” constructs, therefore, it’s quite logical to be able to use tools, intended for working with XML, such as query languages, frameworks and so on. And usually, these tools are not intended for dealing with values comprised by lists (by technical reasons explained above). While methods (1) and (3) are perfectly compatible with general XML ecosystem.

Anyone can easily get some experience and compare methods (2) and (3) by writing an example of Overpass API (significant part of OSM ecosystem) query for both cases, given in definitions section above: sport=basketball;volleyball and sport:basketball=yes, sport:volleyball=yes. Imagine that you need to select all pitches with both these properties. I can guarantee you, that method (2) will require utilizing CPU-hungry regular expressions, which will inevitably reduce query performance.

Objections

Voters on Education 2.0 proposal have expressed certain objections to method (3) I want to address directly.

… semicolon-separated lists are much more concise and therefore easier.

Easier for what? For reading - maybe (if such list is very short), but not for any technical purpose, including tagging preset development, editor interface development and so on.

… it’s just an aversion to semicolon-separated values. I understand that this is a proposal by Russians, and some Russians such as XXzme have expressed their aversion to semicolon-separated values. I respect their opinion, but I have the opposite opinion, sorry.

No, it’s logic, not an aversion. Read everything above. And it’s not “another opinion”. Opinion is a view, not necessarily based on facts or logic, however, logic behind preferable use of method (3) is explained both in Semi-colon value separator article and in this diary entry. If someone finds it false, feel free to point on it. Otherwise, you have an aversion, based on personal preference, which doesn’t have to be respected, since it contradicts reasonable requirements to tagging schemes.

… you cannot model the whole world in a key:*=yes ontology, …

That’s actually classic demagogy - to say, that your opponent said something false (which he didn’t) and to prove it’s false to discredit him. Nobody claimed that it’s possible and/or necessary to use method (3) to model the whole word (it’s another piece of demagogy - incorrect universal quantification). If someone thinks, that collision of exclusive properties will never occur for certain feature, it’s okay to point on it and explain, why method (1) is acceptable. But using demagogy usually reveals lack of real argumentation.

… that ridicules the key=value scheme in OSM

How exactly? Again, method (3) is actually recommended by OSM documentation in Semi-colon value separator article. And it is technologically much closer to method (1) than method (2) is.

Conclusion

Tagging method with separate subkeys and Boolean values is an effective part of tagging system. It doesn’t replace other methods completely and nobody claims that it does. For cases, where any chance of collision between exclusive values exists, it’s the only effective method. Value list method is equivalent to it only semantically, but it’s inferior to it in technical aspects of data architecture, performance, usability. Usability for mappers strongly depends on tools: complex schemes are rarely used in fully manual manner - plugins, presets and other tools helping people to use it, therefore, it doesn’t really matter, how exactly these tags look.

And the last thing. Being a part of OSM project for several years, I’ve heard many references to national features, like “in this country, we do it this way”. But I’ve never seen anyone making general references to any nation in negative connotation, like it happened in one of comments to opposing vote on Education 2.0 proposal. Indeed, there are several members of Russian community, including myself and Xxzme (notoriously famous for his Wiki activity, but who has no connection to this proposal) mentioned in that comment, who trying to promote separate keys method for cases, where collision of properties can occur and who from time to time criticizing legacy tagging methods.

But shouldn’t judgment be based on what is said and its logic instead of who said that or other personal features? I really hope, opposite thing will never happen again.

Comment from GinaroZ on 25 July 2016 at 21:48

I notice that the iD editor recently introduced an easy to use multicombo UI for selecting the types of material accepted by a recycling node. Perhaps the same could be done for the sport tag, since that is one of the main places I see the semi-colon being used?

Comment from BushmanK on 25 July 2016 at 22:15

@GinaroZ,

First of all, I’m not discussing any specific tags here. I’m also not looking for a solution, how to deal with existing schemes utilizing value lists in any particular software or service.

For sure, working with lists isn’t impossible and I didn’t say it is. However, if you count all technical expenses, caused by this scheme, you will see its bad side.

Comment from Minh Nguyen on 25 July 2016 at 23:58

In the past, the OSM community made a lot of tagging decisions that were largely driven by a need to make it comfortable and efficient to type out raw tags by hand. These days, those considerations are less important, now that the most popular editors all rely heavily on presets to maximize discoverability.

Parallels can be found in other aspects of tagging too. For a long time, proponents of route relations met quite a bit of resistance from mappers who felt that they were tedious and redundant to information already found on each way. But as we increasingly need to express additional information about routes, and as ways get split more and more granularity for lane tagging, resistance to route relations has become less vocal – at least in the U.S., where there’s a demonstrated need for them.

The multicombo UI in iD produces exactly the type of tags that BushmanK is arguing in favor of (multiple tags with colons in the keys). But visually, it’s all on one line, so it’s more convenient to work with this UI than it would be to type a semicolon-delimited list (since the control can autocomplete each item in the list). I would hope that everyone involved in discussions about the OSM data model keep in mind the needs and benefits of intelligent editor software.

Comment from BushmanK on 26 July 2016 at 00:42

@Minh Nguyen,

As a bottom line for what you just said, I’d like to add, that data structures have objective criteria of technical usability and accessibility, while usability for people is very subjective and varying from person to person. Therefore, having limited resources, it’s easier to meet objective criteria than to please everyone’s personal taste.

Someone could say, that adding certain abstraction layer between OSM data and end user (I mean: developer of rendering engine, map style, spatial data product) can solve any issue of data structure by pre-processing it. And it is currently required anyway to handle some list values. But there are several things to keep in mind:

  • It’s a fact, that only a few application developers actually were brave enough to dive into development of value list support. It demonstrates, that such practice is undesired for developers and end users of data, while OSM doesn’t exist for itself - it exists exactly for these users.
  • Abstraction layer can be successfully added not only between OSM data and its users, but also between mappers and data. I mean, editor presets, “tag-less” interface similar to iD and so on. And, ironically, having better data structures in database makes it easier to create and support this kind of abstraction layer for editor developers.
  • Complexity level of real-world entity and its description (model) can’t differ significantly. Complex entity always has complex model, if there is no way to get rid of some of its features. So, there are and there will be more complex data structures (tagging schemes), requiring smart editor GUI to make mapping process easier.

Because of all these reasons, technological quality of data structures (tagging schemes) is more important than how appealing it is for mappers, who’d like to type it manually. Without better editor interface only a few people will use such scheme anyway, regardless of its style.


Login to leave a comment