Natural language vs. abstract tags
Posted by BushmanK on 1 January 2016 in English. Last updated on 10 January 2016.This article was originally published in collaborative IT magazine Habrahabr.ru in Russian, but I believe, it would be interesting to read it for those who can’t read Russian too. So, here is the translated version (everything is okay with permission to translate and publish it here). My own notes and additions are enclosed with brackets and formatted with italic font like [this].
Those who are familiar with OpenStreetMap, should also be familiar with a couple of core principles of this project: «any tags you like» and primary role of geodatabase, not of «updating a map on the front page of osm.org». But is everything that nice and perfect with semantic structure of that geodatabase, keeping in mind the first principle? Reading the Russian section of OSM forum, I decided to investigate this situation.
Here are some historical facts. OSM project was born in UK. Therefore, British English is used for tags. [But you can find this fact only in discussions or certain key descriptions - there is nothing about it in recommendations for tag proposals.] So, there are leisure=sports_centre
and place=neighbourhood
. Sometimes you can find German words, for example to tag certain type of megalithic objects (while it is not recommended to use this tag): megalith_type=grosssteingrab
from Großsteingrab - dolmen.
Tag usually consists of key (to the left from the equal sign) and value (to the right from it). It gives us a hint, that key reflects some sort of class of objects or properties, while value stands for a single type of object or exact value of property. Sometimes, keys and values are using so called namespaces. Usually, it happens in those cases, where several tags belong to a scheme, where root tag could have additional properties. For example:
social_facility=day_care
social_facility:for=senior
This set of tags telling us that it’s a «social facility, day care, for seniors». This principle of tagging is close to perfect, because we can have any number of additional properties, associated with root key.
Another, even more common case, is to have refinement tags without namespace. For example:
barrier=bollard
bollard=removable
It means «artificial obstacle, bollard, removable». From the point of view of natural language, bollard=removable
sounds odd, but you should keep in mind, that OSM uses these words as abstract entities. Ideally, these entities must be thoroughly defined in details in OSM Wiki (which doesn’t always happen). Negative side of this way of making refining tags is that it’s impossible to have more than one - you can’t assign multiple tags with the same key to an object. But it’s okay to have any number of non-specific refining tags. For example, this bollard could have tags for material and size: material=concrete
, height=0.7
.
Until now, everything looks good and totally understandable. But, it’s always easy to break something. Obviously, any database, intended for storing structured data, querying that data, subsetting it, filtering it and so on, should have more or less uniform semantic architecture. Otherwise, it simply turns into a text. But OSM database, being intended for making derivative maps, should store data about the real world, which is perceived by many people just «as is», in form of the whole objects, without any attempts to describe them with formal properties. People just got used to talk about things they see.
Typical scenario for usage of large databases in complex projects (online stores, for example) is querying data for particular user. In certain cases it’s just some sort of parametric search, while in others - it’s so called «smart» search, that allows to associate sets of formal properties with free-form search queries.
OSM has an opposite situation: contributors are submitting data, not querying it. And every contributor does that as good, as he can, especially - in aspect of his ability to look for key features of objects. But when it combines with «any tags you like» principle (intended to ensure an ability to extend tagging scheme and to adapt it to any purpose), and usage of terms of natural language, it could lead to almost catastrophic consequences or at least highly undetermined situation.
Please, answer the question: when you want to buy certain thing, would you look for a store, where they should have it, or for a store where it might be found? Usually, people don’t want to spend their time running from store to store. But there is a lot of tags in OSM, standing for stores selling nobody-knows-what. For example, shop=kiosk
. If you read the tag description, you’ll find out, that usually kiosk sells everything from newspapers to tobacco products. [and it’s varying from country to country.] So, looking for a box of cigarettes, you can or can not find it there. Only more or less clear feature of kiosk is its size: it’s relatively small. And you can’t even tell, if it’s a separate small building or something built into a bigger building. In some countries, this term is used for small shops in general.
This tag was actually copied directly from the natural language. You can thank Etric Celine for it. On his Wiki page, he clearly tells, that he is an anarchist and he disrespects the formal procedure of proposals and discussions of new tags. But in the same time, he thinks, that it’s very important for everybody to «do something». He actually did: he introduced the tag, which stands for nobody-knows-what exactly. There are almost 50000 «shops of nobody knows what» in OSM. And lots of people, who only started to contribute, and who don’t really understand, how important is to keep everything as structured as you can, starting using this tag immediately for every small store. Even if there is better tag for news agent, ice cream shop, tobacco shop and so on. Why they doing that? Because they don’t care about semantic structure, while they got used to call these shops «kiosk».
For an experienced developer or database architect, it looks completely ridiculous, that there is no way to actually tell, what classes of goods sells particular store, while there is a way to reflect its natural name such as «supermarket», «kiosk», «mall». But that’s how it works in OSM. Yes, there are tags for book stores, DIY stores and so on. In the same time, you can’t tell anything about stores with more or less universal set of goods, such as «supermarket» or «mall». By the way, it’s quite hard to tell the difference between «supermarket» and «mall». [And it brings up the question about being able to verify these tags, which is important principle of OSM.] Yes, we know, that mall is a «large building, where many shops, restaurants and entertainment facilities are located», but supermarket can have some space to give for rent to owners of smaller shops too. So, where is that border between a supermarket and a mall? Moreover, is it important at all to have different tags for them?
Majority of widely used tags have blurred boundaries of usage. For example, it is really hard to define formal difference between cafe and resturant. This difference is not in size, not in having or not having the waiters, not in menu, not in working hours, not in the way of seating, not in reservation requirement, not in price or anything else. These are just words. Sure, in certain cases, it’s hard to call some place a «cafe», if it’s a place with high rank service and so on. But where is the border? There is no border. That’s why usage of amenity=cafe
and amenity=restarurant
is undetermined, and it goes against the requirement of being verifiable. This principle says, that every object, submitted to OSM, should be tagged in a manner, when any other contributor can confirm it using the same source. Presence of word «restaurant» in a name is not a good criterion, because there are similar words in certain languages (like adopted «кафе» and «ресторан» in Russian), but what about other languages? What about Czech hospoda or Polish tawerna? There is no way in this direction, because it doesn’t make any sense (and also - impossible) to make a tag for any natural language term.
It is important to use an abstraction as mental instrument to search for valuable properties and to tag these properties, regardless of own habits. Only this approach allows to offer valuable and determined information to OSM data users. Only in this case, user will not have to guess, what was the idea behind tagging particular place as restaurant or cafe. Parametric search or offering the list of properties are much more user-friendly solutions compared to showing the whole list of restaurants, cafes and other places without any descriptions.
Sometimes you can see an attempt of making a kind of scientific classification, but natural language and narrative knowledge can affect it in negative manner too. Almost from the beginning of OSM, there were two tags: wood=coniferous
, wood=deciduous
. People [English-speakers] got used to think about these words as about opposing terms. But actually, there are deciduous and evergreen trees. And trees with leaves and needles. So, there is European Larch, which is deciduous, but it has needles and cones. Also, there are different Laurels - evergreen tree with leaves. Recently, old tagging scheme was replaced with a new one, clearly reflecting shape of leaves and seasonal changes of foliage.
Another situation, where lack of knowledge gave us unclear and contradictory definitions, is the case of masts and towers. These objects are tagged with values of man_made=*
key. In structural engineering (area of knowledge, covering all types of man-made structures) towers are free-standing narrow vertical structures, supported only by its foundation. While masts are narrow vertical structures, supported by guys and anchors. That’s quite simple and in addition, these terms are international. But other areas of technical knowledge are less strict about it. So, power company workers could call everything a mast. As a result, these two tags are used in unclear manner. [Pictures, given as an example on Wiki page are ridiculously non-descriptive.]
The most ridiculous (and, in case if it spreads - the most negative, because of semantic divergence - coexistence of several different readings of the same tag) situation is when a word, used for certain key or value, has completely different meaning in different languages. There was a case recently, when Russian-speaking contributor tried to propose a new property for restaurants to reflect that you can have «business lunch» there. Ridiculousness of this situation is that only in Russia this term (coined in the nineties of the 20th century) means «lunch for fixed price». In the rest of the world, terms table d’hôte and fix-price are used for it. Obviously, in the rest of the world, business lunch means something about having lunch and conversation with business partners, not a type of service. For sure, words, used for keys and values are abstract. But these words should be understandable for other contributors, at least to have some correct general idea of what it means. [There is another case of similar nature: one of Russian contributors invented new values of office=
tag - office=administrative
, but his English skills didn’t allow him to translate his definition properly, therefore, we have completely opposite definitions in Russian and English Wiki.]
It works in opposite way even more often. Foreign words are rarely adopted without any changes in semantics. Therefore, those people, who are not familiar with British or American culture, pretty often making mistakes because of wrong reading of some word or because they are not aware of consonance. For example, some Russian-speaking contributors sometimes getting confused by highway=service
service=alley
. English word alley sounds similar to French allée and Russian аллея. Russian one was adopted from French, and it stands for a street with tree rows. English alley is a narrow passage between or behind the buildings. [There is no alleys in Russian/Soviet architectural tradition] This is why some Russian contributors trying to use highway=service
service=alley
for promenades.
Even withing the English-speaking community, certain contradictions could happen. Just think about drugstore and pharmacy.
Another case is usage of words cabin, hut, for values of building=
. These tags should denote type of buildings. But there is no clear difference between them. And there are associations. Americans will likely associate cabin with small isolated house, given for rent. Norwegians will associate it with hytte of Den Norske Turistforening or private one. Same thing with people from Switzerland and mountain huts of Schweizer Alpen Club. I mean, people could compensate the lack of clarity with associations. Russian contributors used to use these tags to tag traditional log houses, while you can still clearly tag them by using building=yes
, building:material=log
.
Certainly, semantic chaos of natural language doesn’t affect the whole project, but it is significant. There are successful cases of making good classifications to replace old unclear schemes. One of these cases is Healthcare 2.0, created to describe all features of medical facilities and medical help and to get rid of amenity=doctors
which means nothing. [Ironically, this scheme didn’t make it through the formal proposal procedure.] Pretty remarkable job is done by one of Russian community members to create forestry management scheme. Unfortunately, it doesn’t even have own Wiki page, currently sharing it with deprecated wood=
tags.
New well-designed schemes are usually not depending on language or cultural context. To adapt it to another context, it could be enough to add only a couple of values. For example, Healthcare 2.0 allows to describe medical facilities, specific for Russia, even keeping in mind that authors of this scheme were not aware of these facilities completely. That’s a power of using elementary properties you can combine in any required manner.
The most sad aspect of this situation is that even many experienced contributors don’t understand this problem, thinking it’s something unimportant, or claiming that solving this problem will create more barriers for newcomers and deter newbies (that’s pretty common argument, because nobody knows anything about their patterns of behavior for sure, so it’s easy to speculate about it).
But actually, if there is a new more clear scheme, it’s easy to start using it to map those objects it was impossible or inconvenient to map before. OSM exists to create free geospatial data, not to make a nice club. So, if for someone it’s easier to light-mindedly map things using unclear ineffective schemes, but it’s too hard to use clear and effective ones, quality of his contribution is highly questionable. In the same time, people who don’t like current schemes because of semantic issues might increase their contribution in case if better ones would be available.
Discussion
Comment from Sanderd17 on 2 January 2016 at 09:40
Very interesting post. I’m from Belgium, so not a native English speaker, but close enough to England to share a lot of the culture.
We personally have no problem with kiosks at all. But we do have problems with other tags, mostly the range between amenity=restaurant and amenity=pub.
In our language, a “café” is where you drink beer, so that’s a pub. Then we have “taverne”, “tearoom” and “brasserie”, which is more to visit in the afternoon, so more like a “cafe” in OSM terms.
Distinguishing between our fast_food and restaurants is also hard. Certainly when it comes to our “fritures” ( https://en.wikipedia.org/wiki/Friture ). A friture is a place where mostly fries and fried snacks are served. Ranging from take-away to having full service. However, they’re not restaurants, as you need a license to have a restaurant title in Belgium. And IMO, it’s very inappropriate to describe those fritures with full service as a fast_food amenity.
So we would like to introduce our custom tag amenity=friture, but nobody wants to tag objects like that, as it would remove all fritures from the visible map (while they are important, every Belgian village has at least one friture). And we can’t get amenity=friture to render in mapnik before it gets used.
Some of the tags you mention also don’t matter a lot. Like building=*, if you don’t understand the tag, it’s just a building, and doesn’t matter which type. Most buildings are “yes” anyway.
Comment from Warin61 on 2 January 2016 at 11:02
friture
I think any places that sells physical objects is a shop .. and food and drink are physical objects that are sold …
So use shop=friture ?? That would coexist with the other tags. And you can track the numbers using it using taginfo.
## Regarding forestry .. ##
The present misuse of landuse=forest where the intention was to indicate the presence of trees needs to be addressed…
I am thinking that the intention of the tag landuse=forest needs to be made clearer .. perhaps a new value of landuse=forestry would assist? And then landuse=forest can be depreciated.
Further; the use of the key landcover= should be encouraged .. landcover=trees has over 10,000 uses… and has a clear meaning for most people?
Description of trees
As you point out .. this leaves (pun) a lot of growth (pun) within OSM tagging. Unfortunately most are not concerned with it.. bigger issues elsewhere.
Buildings
These lack a diversity in description values .. so most people chose to use ‘yes’ rather than use a locally descriptive value like ‘ranch’ or ‘homestead’… And the rendering does not look to take the building value into account anyway.
Nice entry BushmanK.
Comment from marczoutendijk on 2 January 2016 at 12:56
Very nice, interesting reading!
Reminds me of my series of diary entries: Improving the OSM map - why don’t we?
And especially the one regarding language use.
Comment from BushmanK on 2 January 2016 at 17:59
@Sanderd17,
Adding more and more values for amenity to reflect local types is exactly that wrong practice this article is talking about. Proper way of tagging food amenities should look like single tag with set of properties. Some sort of (don’t pay too much attention to wording, it’s just an outline):
I repeat, this is not even a draft of tagging scheme, this is just an illustration of principle, so, you don’t have to focus on wording, you should focus on the idea. Technically, it’s possible to combine this scheme with an old-style one for compatibility. But only completely new scheme could be capable to actually describe food amenities instead of making everybody guessing.
@Warin61, There is a problem with building= values. Some people think it should tell us, how this building is used, other people think, it should tell us how it looks/built. This is stupid.
@marczoutendijk,
Thank you, I’ll read your entries today.
Comment from trigpoint on 2 January 2016 at 18:27
Very interesting post, I am a native English speaker.
I will try to clarify a few points as I see it.
wood=coniferous, wood=deciduous
This dates back to the way Ordnance Survey map woodland and how I suppose most British mappers think of woodland. More scientific methods have come into, but they make it harder for a normal mapper to add detail. Native woodland in England and Wales is, as far as I am aware, decidicous. Conifer will mean it has been planted for forestry.
shop=kiosk
This tag does not really make sense on its own, a kiosk is a type of shop where the customer is served outside. Usually selling newspapers or ice cream. It needs extra tagging to explain what it is, such as kiosk=newsagent / kiosk=ice_cream in the same way as shops.
pharmacy / drugstore
A pharmacy will have a qualified pharmacist on duty who is qualified to dispense prescription medicines and offer advice on non-prescription medicines. There are certain medicines which do not need a prescription, but can only be sold by a pharmacy. Over the counter medicines can be sold almost anywhere, but in strictly limited quantities. In the UK the term pharmacy is interchangeable with chemist, chemist is probably more commonly used for the shop, pharmacist referring to the person. I first came across pharmacy when I was learning French, growing up it was always the chemist.
Drugstore is an American word. rather than English. I am not sure exactly what it means.
Supermarket / Mall
Mall is again an American word, the English which I would use in tagging would be shopping_centre.
In the UK the dilemma doesn’t really arise, they are clearly different things and whilst a supermarket can be part of a shopping centre the distinction is obvious. I can see where you are coming from, the big Auchan/Carrefour shops in France could be tagged as either.
Comment from BushmanK on 2 January 2016 at 18:48
@trigpoint, I am aware of these aspects in natural language, but OSM does not use natural language for tagging, that’s why many of these entities making no sense. OSM is a database, and any attempt of using it (for making POI catalog or map index for navigation device) should never require anyone to dig into cultural and linguistic details. That’s why all these malls, supermarkets, shopping centres and so on are really bad for tagging, being highly dependent on cultural context and unclear even for insiders of that culture in the same time.
Could you explain, how exactly it’s harder to tell for “normal” mapper, if tree sheds its foliage in winter and if it has leaves or needles? This is just common myth. Even kids in their five years can easily do that if they don’t have intellectual development problems. Conifer will mean it has been planted for forestry - that’s just beyond any limits of logic, by the way.
Comment from joost schouppe on 31 January 2016 at 20:19
Conifer=forestry is actually entirely logical in the UK. There are few examples of where a coniferous forest was not planted for forestry use, as this does not occur naturally in our climate (or most people don’t realize if it doesn’t),
So thus is exactly the kind of example you need: people think they know what words mean, but they don’t. The same with kiosk : maybe in England kiosks are either newsagents or ice-cream shops, but in other countries they might sell a range of items, and maybe evens specialize in both ice-cream and newspapers.
So both of these examples actually strengthen your point of describing aspects of a thing, not the thing itself. While I would agree that doing this radically might in fact discourage mappers, we will probably have to evolve more and more in this direction, just as we did with woodland tagging.
This will be a gradual evolution though, and not a revolution. That way tools can develop that put a layer of presets over things, in much the way that iD shows a description and JOSM the tags. And that way this necessary complexity is only introduced where it is really necessary.
The most useful thing to do, IMHO, is getting involved in specific discussions about tagging. The theoretical argument may add just the little weight there to tip the scale. But the “simple tagging model” has its worth too, and will not be abandoned just because it’s not logically elegant.
On the friture example, as a Belgian this is important to me :) I’m entirely satisfied with the fast food + cuisine=friture that is common practice. It might have the disadvantages of both tagging styles, but it is queriable - and no subset of tags could ever do a friture justice.
Comment from BushmanK on 31 January 2016 at 20:58
@joost schouppe,
Yes, these are perfect illustrations of implications, existing within certain culture (language, nation), but completely false in the worldwide scale. Currently, HOT projects covering Africa, South America and Asia, so I’m wondering, how certain tags, born on European soil, working there.
I don’t want to look like radical person, and I’m not insisting on getting rid of tags, which are just not logically elegant. But there are enough tags which are quite meaningless, so it is impossible to use them for any real purpose. Also, there are badly described tags where lack of clear definition in Wiki turned them into barely usable ones. Masts and towers are perfect example - for now, seeing one of these tags, you can only tell for sure that it represents some vertical structure. While, if an author of these tags were able to make a couple of searches, we would have nice two tags for self-supported and guyed vertical structures. (I’ve added an “Engineering definition” section to Wiki pages of these tags to encourage people to use this distinctive feature to differentiate them properly instead of using subjective unverifiable criterion of “something bigger”.)
Indeed, discussions of tagging schemes are useful, but from reading both mail list and Wiki discussion pages, I’ve learned that usually it could be quite passive process. It looks like the majority of people just prefer to express their view and then to be a passive observer.
As I said before, “fast food” is okay in obvious cases. Like, you have a food cart with something (and I know what you are talking about - sitting at Spa Francorchamps circuit watching racecars passing Eau Rouge and eating frites is an unforgettable memories). But it turns into an issue when you have some place at least with seats. What makes it “fast food”, or, otherwise, “restaurant” or “cafe”? McDonald’s has seats and call itself “restaurant”, but people using “fast food” tag for it. At least, nobody took his time to explain it in Wiki. And I don’t see any logical more or less universal criterion for that.