Recent diary entries
Que pasa en Valdivia, Chile? Tres meses viajando en Chile con datos OSM me ha hecho mimado: de casi todas las rutas puedo ver si están pavimentadas o no, casi todos los atractivos están, hasta hay muchos senderos dentro de los parques nacionales. Pero en Lago Ranco, faltaba asfalto nuevo, después había una carretera larga supuestamente asfaltada, de pura tierra. Y el mismo día, un camino con asfalto no muy nuevo, mapeado como tierra. Llego a Curiñanco, con más errores de asfalto, y no hay el sendero en la reserva.
En general, alrededor de ciudades universitarias esta re buena la calidad. Pero que pasa en Valdivia?
Cuando escribí esto, estaba acampado en el camino al parque Oncol. En una media hora, tres personas me preguntaron si esto era bien el camino a Oncol. Este parque sí esta bastante bien mapeado en OSM, así que alguien debería explicarle a los Valdivianos a utilizar nuestro mapa :)
Since the State of the Map in Buenos Aires, Ive been able To try out some possible indicators, I tried out a dataset for my home region Flanders. Here's some examples of things to measure.
The nodes table contains all POI's defined as nodes, but also all the nodes that make up the lines and closed lines (polygons) of Openstreetmap. We can reasonably assume that almost all untagged nodes will be part of lines or polygons. Some tagged nodes are also part of lines. For example, a miniroundabout, a ford, a barrier, etc, should always be part of a line.
The total number of nodes is made up almost completely made up of nodes that belong to something else. That's to be expected of course.
Over time the number of tagged nodes increases. But the number of tags on these nodes increases faster. In 2009, there were on avarage only 1,24 tags on the nodes, now it's over twice as many.
What gets tagged? Here's a quick breakdown in some very wide categories. Road info are all the kind of tagged nodes you'd expect on highways, the kind that adds to better routing and safer driving. POI's are things like banks, schools, fuel stations, etc. These two take top spots, but in 2014 there was a big jump in the first group.
Infrastructure nodes like those belonging to railways and high tension electricity lines are only recently being overtaken by address nodes. The release of open data about addresses in Flanders is probably the cause of the big jump. However, most addresses are tagged on buildings, so they do not show up here. For POI statistics, it would be best to just take the sum of nodes and points for the same tag combinations. Two problems arrise. One is practical: there seems to be something wrong with the way the history importer handles polygons. It might have to do with the lack of support for relations, but I don't know yet. One more thing for the to investigate list. The second problem is that sometimes the same POI has both a polygon and a node tagged with the same information. This is not good practice, but it happens. You could remove nodes that geographically fall within polygons if the tags are the same. But I wouldn't know how to do that in my setup. It zould take a lot of processing as well. And my available processing power at the moment is way too small as it is.
On to lines. In most cases, the thing to measure is the length of these. The absolute number of lines is mostly unimportant. A river is a river, wether it consist of 10 or a 100 bits and îeces. A nice example of how crowdsourcing works in practice is the evolution of the waterway network. First we see a quick growth of the river network (length in km). As the growth of the rivers winds down and stops, we see the streams taking off. So the crowd has finished mapping all the rivers, and only when that is finished, the smaller streams get more attention. Rivers are sometimes mapped as polygons too. Normally the lines are not deleted as this happens, so on network completion this has no impact. Of course the level of detail does increase. A way to measure the detailedness of the river network, could be to count the nodes of all lines and polygons making up this network.
A similar picture for roads. Main roads (tertiary to motorway) start of as the largest category. Minor roads (residential, unknown, unclassified) follow but overtake them quickly. Full network completion seems to be achieved by 2013-2014. Other roads (mostly service roads) grow slower, and steady. Just like "slow roads" (mostly footways etc) the steady growth seems to indicate that it is either more or lower priority work to complete this network. So these might keep growing for many years to come.
Network completion isn't everything of course. A lot of extra information is needed to have a good, rouatable map. This kind of infor is often mapped as tagged nodes on the map. The history importar does not load realtions unfortunately, so the number of turn restrictions can't be counted with my method. In the graph we compare the growth of road info nodes with the evolution of the road network. Again, first the basics get mapped, only as the first prioirty nears completion, real progress is made on the extra's.
So why do we need global statistics like this? To learn if these are general patterns. To see if imports disrupt these patters. Or if they only occur when population density and wealth is high enough. To see how complete maps are - just looking at the graphs, you can often see which features are mapped completely and which aspects of the map need more work. Based on the files generated in the process, it's not very hard to classify mappers: are they local, do they have local knowledge or are they probably remote mappers. The distribution of these is good to know, but more than that might give important insights. What happens when remote mappers reach road network completion? Does this increase the chance a good number of local mappers pick up the mapping that needs local knowledge? That might inform if and when remote mapping should be encouraged - or avoided. A lot of these issues give rise to heated arguments. Wouldn't it be nice to have some data to corroborate opinions?
As I said before, there is a lot left to be done. At State of the Map in Buenos Aires I got many tips on how to move ahead. And that has been quite helpful. I could for example never have imagined how incredibly simple it was to add length and area to lines and polygons. As old problems get solved, new ones show up. I just found out that the number of adresses in my polygon analysis is way smaller than other peoples results. SO there goes another day in finding out what goes wrong.
So even though my set-up is still not really finished for a more complete analysis, it would be nice to start some basic worldwide analysis (see the links at the start of my previous post on the subject) available soon. For those who don't know my little project, the idea is to provide these kind of statistics in an interactive platform, making them available for every region, every country, every continent and the whole world. There's also a video available (which I daren't watch yet) of me mumbling through the idea at State of the Map.
One little detail: my computer can't really handle the denser regions. Flanders was on the limit of what I can do. And there are much larger areas which are just as dense. So if you can spare a little server, I'd be happy to use it :)
I know a lot of people have a problem with OSM objects not having a dependable unique identifier. Of course, a node has an ID which will never change. But a campsite mapped as a node will get a very different ID when someone decides to re-map it as a polygon. This makes life complicated for external applications who would like to link up their data to OSM. For example, a fabulous application like iOVerlander (collects data, reviews and ratings on wild/formal campsites) might want to make all the campsites available in OSM rateable in their application. But it would be silly to also copy the geography to their database - as OSM geography is improved upon all the time. Of course, there's a fuzzy way to refer to a specific object, but that's really of no use in this case. Imagine a campsite without a name. Then you could tell OSM to look for a campsite within a certain radius of where you found it. But what if a new campsite has been added? What if the campsite has gotten a better coordinate? What if it has become a caravan site. Etc... Or a more complex case: take a bar that has moved locations. Do you give preference to the location or to a bar with the same name somewhere else in town.
This would be an argument to just include much more data within OSM, as that way the link between the thing and its description cannot easily be broken. But considereng even adding some price information is controversial, adding opinions etc. would be unthinkable.
As I've been playing with the idea of using Openstreetmap as a base for an open alternative to Tripadvisor, I've been thinking about this problem a lot. In a flash of inspiration, I thought of this concept. I would like to hear some opinions about that. Anyone who has a project that requires a thing to have a unique ID can look it up through a query to an www.osmdata.org . All objects that have linked external content, get an extra tag, for example "osmdata=uniqueid01".
Here's how it could work in practice. Imagine a site where all things vaguely related to tourism are searchable and clickable on the map. Take restaurants as an example. Or generate a list of all restaurants in a city. This list can be updated automatically all the time. But once users start adding untaggable information, like "overpriced" or "what a lovely atmosphere", this data will be saved outside of OSM. Instead of forking the location, the restaurant gets an extra tag in OSM (osmdata=uniqueid22), and the bits of external data saved outside of OSM get this same ID. Now when someone moves the restaurant in OSM (copying tags or dragging the node and deleting the old node) nothing gets messed up. When someone re-maps the restaurant as tags on a building, they copy the osmdata tag too, and again nothing is broken. If a different project wants to use the same thing, they just use the same osmdata unique id. That way, database bloat is minimal.
Another example would be to rate subjective features of roads, like how scenic are they. The same principle could applied; and the result could be Michelin-style maps with a green outline for crowd-approved beautiful trips.
Of course, a side-effect will be that external projects like iOverlander would have a much easier time building their project around OSM data. Which would mean that their users would contribute to OSM, instead of just to the external project.
I'm very interested to hear your ideas on how this problem could be solved - or how it is not a problem - or how it has been solved before
So after 8 months on the road in South America, navigating with Osmand, I'm now number 37 in the world when it comes to opening/closing notes. I make the notes mostly for myself, so when I get the time (and access to good wifi), I fix the problems I spotted.
Twice in Ecuador and once in Peru it happened that local mappers spotted the errors and started fixing them. A big thank you to users giomaussi, Diego Sanguinetti and agranizo! But that means that in large parts of Peru, Bolivia, Chile and Argentina no-one is watching notes.
If you feel like doing some random mapping in South America (mostly Argentina and Chile now), please feel free to correct some of my notes. If something isn't clear, I do respond to questions. Here's a direct link to my notes page
TLDR: click these links to play with South America OSM contributor statistics on a continental level, in detail. It's ready for the world. Or even easier, get a ready made report for a continent, a country or a region.
This is a writeup for the presentation I gave at State of the Map 2014. Slides available here (since it's such a bother to add images to diary entries, you'll have to refer to the slides for pretty pictures). You know about these motivationals saying things like "do one thing every day that scares you"? Well I did, and I wouldn't recommend it. So I'm thinking maybe a written version might be a little more coherent. But if you want to, you can see me talk here.
During my one year road trip through South America, I'm trying to do as many things OSM as possible. Of course, I'm navigating using Osmand, contributing tracks, notes and POI's along the way. I'm trying to convince other roadtrippers to use OSM, which in a lot of cases they're already using anyway. Making contributors out of them is harder: a lot of them seem to know they can, feel like they should, but just "haven't found the time to really look into it". Then recently, I did a presentation about OSM in Carmen Pampa, a village near Coroico, La Paz, Bolivia.
But mostly, I want the world.
The job I'm on a one year break from, revolves around generating and providing data in such a way that people can make their own analysis. In a lot of cases, that means taking GIS data or agregated statistical data and simplify them to a geographic neighborhood level. A quite literal example: count the number of green pixels within a neighborhood and devide them by number of people. So here's what I do: a bit of automation, some basic statistics, some self-thaught GIS skills, some translating problems back and forth between humans and database querying. I'm great at none of those, but I understand a bit of all these worlds.
At work, the area of interest is just the tiny metropolis of Antwerp. But the tools we use lend themselves to much wider scales.
So I though, during my trip, why not do the same thing a bit bigger? Antwerp is known for its big egos - and I have to admit I do fit in. So how about the world.
Global Openstreetmap Community Statistics
Slightly obsessed with statistics and with OSM, I felt a lack of mid-level statistics about OSM. Yes, we have some tools telling you how many people edited recently, etc. But there is no "state of the map" for any country, any region. There is a lot of opinion on new contributor mess-ups, or on imports - but few statistics to back it all up.
So here's the one-year plan: make a worldwide tool to see the State of the Map for any region, country and continent in the world.
Minor detail: I wanted to present it at State of the Map Buenos Aires, only half a year away. And it was much more complicated to work from my campervan than I thought. 3G is slow, expensive and often absent from the places we stayed. The amazing 12v-19v converter I found blew up the computer in Ecuador. A total loss in Europe, they fixed it for 100 USD in Quito - but there went another month. Also, I'm not a programmer, so I had to learn quite a lot - and have quite a lot to learn still.
I wanted to go beyond the ad hoc analyses you so often see. People are interested in Switzerland, France, South Africa. All these case studies bring interesting insights, but I wanted to provide the basics to all communities. From what profound research has tought is, we know that often it is enough to look at OSM data to know the quality of OSM data. For example: the easiest indicator of map quality is the number of people contributing.
There are some national OSM statistics available, I wanted to go beyond that. Of course, there are a lot of national communities, but being from Belgium, I decided the national level isn't ideal. And for countries like the US, Brazil or Russia, well, it's just not fair to only give them as much space as Liechtenstein is it? So I decided to go (with some exceptions) for the highest subdivision of countries.
I decided to use OSM as a base for the regions, I don't quite remember why, but I'm sticking to the theory that it was a matter of principle. The principle being: the more people actually use the data, the better it will become. At the time (say beginning 2014), these devisions were very far from complete. I started working on the problem where I could, even wrote a diary post about my cleaning experience. But of course Wambacher's wonderfull boundaries tool had the larger impact. There has been amazing progress in under a year, and now the only larger countries that have severe problems with their top level regions are:
Panama Honduras Portugal Sri Lanka New Zealand Malaysia Indonesia
Of course, people keep destroying administrative relations. Some of them because they're new and ID doesn't warn you about destroying relations. Rarely some vandalism. And often as well by very experienced users having an off-day I suppose.
It took me quite some time, but now I have a beautiful shapefile of the world with most all international conflicts resolved and anly a few regions claiming their neighbours territory. Yes, I can share this SHP.
Turning historical OSM data into statistics
I believe you can only understand where we are, if you know how we got there. And for a complete view of Openstreetmap evolution, you do need the history files. These contain every version of every thing that has ever existed in OSM - with some exceptions caused by the license change and redaction work. There is no easy way to work with these files. I had to learn how to translate these data into statistics. That meant learning a whole new world of Virtualbox, Linux, Osmium, History Splitter, PSQL. And I'll probably have to learn some C++ and R yet. I could never have gotten on with this whole project without the help of Ben Abelshausen and especially Peter Mazdermind, whom I've bothered enormously. I wrote a bit about these first steps (with links to Peter's tools) in my diary as well. If you like prety maps more than stats, you'll probably not make it back here again :)
The workflow so far, as suggested by Peter, is to cut up the world into small pieces, import them into PSQL and then make some queries. To cut up the world, I convert my regions shapefile to poly files using the OSM-to-poly for qGIS 1.8. So far, I have little more than a proof of concept. Let's take all data for an area, dump unique combination of users and start dates of objects and use SPSS to make some simple indicators.
So here are the first results, a complete basic statistics tool with data on a continental level but also in detail. It's completely interactive and ready for the world. Of course you can compare evolutions, but if you play around with the tool a bit, you'll see the possibilities are endless.
You'll be forgiving for not liking to 'play' with a tool like this, as most normal people don't. To make you're life easier, there's a reporting studio which gives you a ready made analysis of the evolution of contributors in a continent, country or region of your choice. This being SOTM Buenos Aires, the obvious examples are South America, Argentina and the city of Buenos Aires.
All the data in the tool is available for re-use: you can download xls or xml for any view you make, WMS services can be provided, you can remotely query a visualization and you can acces through a basic API.
From my experience at State of the Map, I don't feel like I made quite clear what is the importance of a tool like this. I'll try to give some more examples of what could be easily done with just OSM data.
- You don't need any other sources than OSM data to get an idea about road network completeness, and how much is left to be mapped.
- You could make statistics about how many map errors are open In more advanced countries, see how quickly landuse mapping is being completed
- Does mapping peter out when the map gets more adult? Or is it the other way around, does more data imply more people using and contributing to even more data? Is there an exponential curve of map development. And dare I say, yes? (LINK)
- How do imports really affect mapping? Is a country which starts of with a larg import likely to quickly grow a large community, or will it start to lag behind after a while?
- Is the number of mappers proportional to people or to GDP?
- Do most regions follow the same growth track, but just started of later? Or are there regions that will not ever get properly mapped without special outside attention?
- Or something very specific: "does the probability of a new contributor becoming a recurring contributor increase if we contact all new mappers in our area"?
- What does HOT attention do to local community development? Are people recruited through a HOT project more likely to keep contributing?
Any subject leads itself to the creation of indicators. How quickly do notes get resolved? Simple: count the number of nodes still open, three months after their creation. Then you can quickly compare the speedyness of note resolution in different regions. And maybe even adopt a region to watch some notes in. Or some investigator might decide to look into the dynamics of note resolution, and suggest better indicators.
The tool allows 1000ths of indicators to be easily managed and widely consulted.
A cry for help
As I kept saying at SOTM, I don't really know what I'm doing, and I would like some outside checks. I even admitted on stage that I'm a Potlach2 mapper. I'll say it again: I like Potlach. Aparently, that can earn you free beer. But it does mean I need help. I do think I will get some, but I'll take some more effort from my side. For example, I might get some scripts to get the road length out of a history file. I'm also going to look into some C++ scripts that Abhishek made. And maybe OSM France can set up a history server which might make life a bit easier on my poor computer.
Part of my lack of confidence at SOTM was that my numbers of contributors for a given country were much higher than a colleague investigator found. And after my presentations I saw some more numbers that frightened me. So the last week, I've been trying to figure out what went wrong. It turned out: nothing did. Wille from Brazil pointed out that user naoliv produces some statistics of number of contributors for Brazil - and mine where much higher. Only after a while was I sure that he didn't use the history files, but a current world snapshot, which is bound to creat some difference. But even then the differences were much higher than I would have thought. Here's some basic statistics (taken at a random moment beginnening of 2014):
6936 number in history files 5585 number in current world 178 known in current world, but not in the history files 1529 known in history files, but not in the current world dump
How can you be known in the current Brazil map, but not in the history files, as 178 people are? Well, I honestly don't know. Some random checking was in order. Most cases seemed to be people editing very close to the border of Brazil. I use the exact borders, whereas naoliv uses the Geofabrik dump which probably has a tiny buffer to ensure data integrity. But there were also some cases where I have no clue as to what causes someone not to show up in my dumps. Anyway, small differences are bound to arise in databases like this. You'll probably always get some noise in analysis like this - though mostly because of some deeply hidden error or bias.
Another 1529 have contributed to the Brazil map, but their work is not visible anymore at all. I though this not impossible, but still surprising large. Some random checking learned that these people did in fact contribute to Brazil at one time. Here are some statistics I found comforting:
Here we look at the percentage of people found in the history files, lost in the current version of the map. Overall, the number is 22% lost. But when we classify by number of added/touched nodes, you see the number is much higher for people with few edits. Which is exactly what you would expect if the cause of the difference is people's work getting overwited. If you have more edits, less chance that 'all will be lost'.
Percentage lost to current state 1-10 35% 11-50 13% 51-250 5% 251+ 1%
The same goes when we look at the last year people have contributed to the map in Brazil. People editing in 2008 have 56% of not being visible in the current state of the map. Again, what you would expect if people's edits are overwritten. The longer ago you've contributed, the more probable that you're contribution has been lost.
Percentage lost to current state 2007 57% 2008 56% 2009 50% 2010 40% 2011 31% 2012 24% 2013 17% 2014 10%
This means that when you make contributor statistics, the difference between using history files and current world dumps are pretty high.
With this I'm feeling a lot more confident. I'm thinking to build up more in depth analysis first, and only then try and do the whole world. At least, further worldwide analysis will have to wait till 2014 is completed. That way I can work on history files that include the whole of 2014. I'll have my friends in Belgium download them :)
Here's a list of things I think I can manage, in rough order of how hard it will be, or how far I've gotten. WE could of course manage much more, much better, much sooner. But that means YOUR help. I should stop watching motivational posters.
- cumulative number of contributors, or active contributors by year
- number of nodes, ways, polygons (created, deleted, touched)
- notes resolution
- proportion of data contributed by 'local' contributors
- number of mapped hamlets/villages/towns/cities
- kilometers of roads by type
- proportion of area covered by land use
I'm very interested in other suggestions. Especially if they come with a script that gets the numbers out of a OSHistory file.
Viajando en Sudamerica con movilidad propia, me surprendio la calidad de la informacion. En Chile y Ecuador, esta muy claro que hay una cuminidad trabajando duro. En el Peru falta mas trabajo, pero gracias a imports, la mayoria de los pueblos tiene calles con nombres, aun que ni hay cobertura Bing. Lo que para mi era una de las lacunas mas importantes, es informacion sobre la calidad de las rutas.
En el Peru, por ejemplo, hay muchas carreteras que hace poco se asfaltaron. Sin asfalto, eran muy dificiles, ahora mucho mas facil. Pero, como Mapnik es Eurocentrico, no toma en cuenta esta informacion. Si una carretera es importante, en Europa esto siempre estaria asfaltado. Si es que la carretera es poco importante, todavia poco probable que es camino de tierra. En paises como Peru y Bolivia, no es asi. La carretera no tan grande entre Cajamarca y Chachapoyas se encuentra con asfalto nuevito, mientras la carretera importante de Huaraz hacia la costa por el Norte tiene un parte importante sin asfalto.
Si uno planifica un viaje, no solo es importante que este la informacion, pero tambien que se visualisa. Mapnik tiene dos fallas, aplicandole en Sudamerica. Primero, que no se ve la diferencia entre paved y unpaved. Y lo que no se ve, no se mapea. Segundo, que el estilo es hecho por paisas pequenos con muchas carreteras. La preocupacion es de que no entra tanta informacion en la pantalle que ya no se puede leer. En Sudamerica, hay tan pocos carreteras que el problema es al reves: hay que ir a niveles de zoom muy altos haste que se ve donde estan las carreteras. (otra razon, creo yo, porque tantas carreteras se pusieron como trunk)
Que podemos hacer?
Completar datos, y mejorar la visualizacion.
Mapear todos los surfaces y calidades de la rutas que conocemos
Quisiera pedir a toda la comunidad Latinoamericano de mapear todos los surfaces y calidades de la rutas que conocemos, empezando con las carreteras mas importantes del continente. Lo que es obvia para gente local, muchas veces no lo es para extranjeros. Lo que estoy aprendiendo en mi viaje, ya poco a poco lo estoy mapeando. No solo habria que tomar en cuenta el "surface", pero tambien "smoothness", ya que existen rutas de tierras donde se puede volar y rutas de asfalto que tienen tanto hueco que uno va muy muy lento. Los dos tienen pagina wiki, aun que smoothness no esta definido como para viajeros en caro, mas bien como para ciclistas. Y falta una traduccion al español.
Pensaremos como se puede mejorar la visualizacion de esta informacion.
Abajo algo de inspiracion. Quizas existen mas applicaciones que ya toman en cuenta esta informacion. Pero hasta donde yo lo conozco, me parece que deberiamos de trabajar hacia un estilo latino, que servira para todos los paises menos poblado y can una red de carreteras no 100% asfaltado. Como primer paso, ya pedi un mapview en Osmand. Tambien existe el Humanitarian style ya toma en cuenta surface. Pero esta mapa es un mapa de fondo, no tanto un mapa como Mapnik que quiere ser un mapa completo (como dicen ellos mismos). Para ayudar hacer el primer mapeo, pueden ayudar los mapas de Itoworld: http://www.itoworld.com/map/215 y http://www.itoworld.com/map/25 . Pero no sé de mapas que tambien toman en cuanta smoothness - aun que esto ya es un gran desafio para visualizar. Quizas hay que buscar la solucion en routing: de A hacia B vas a pasar 100 km de asfalto bueno, 50 kilometros de tierra bueno y 25 kilomtros de asfalto malo.
There is no navigation app like Osmand. But it is quite complicated. So I made this write-up based on what I've learned over the past two years using it. I wrote it with people like myself in mind: navigating overland trips in third world countries.
Feel free to suggest changes, additions or to copy/paste.
EDIT: yeah, so my little hosting package didn't agree with your interest (you consumed 12 times my allotment). Fortunately the nice people at OSM.be came to the rescue and offered me some space. Thank you Ben!
I have a big scheme in my head to do somehing fun with OSM data. Unfortunately I'm still taking babysteps. Still, here is one step that makes me pretty happy: a map of the evolution of La Paz, Bolivia. EDIT: as I'm a disaster in reading manuals, I didn't add timestamps to the first few tries. I'll re-run them with timestamps when I get the time.
I can make an animation like that for any bounding box with just a couple of minutes work (and some waiting time, depending on the data-density of the area).
In fact, doing this is extremely easy. It still took me two months :) All you have to do is follow the instructions here: https://github.com/MaZderMind/osm-history-renderer/blob/master/TUTORIAL.md (this was very helpful too: https://github.com/MaZderMind/osm-history-splitter)
Only for me, that meant setting up a VirtualBox with Ubuntu, understanding how to install software on Ubuntu and how to fix messed up installations, getting Ubuntu to be able to read data from my host Windows 8 laptop. A big challenge was also not throwing the laptop out of the window (everyone LOVES Windows 8, right?). I could have not done this without Mazdermind Peter who didn't just make the data available in a workable format and the tools to work them, but also gave me personal support. Eternal gratitude and what not. Also free Belgian beer (or chocolate) on any future IRL meetings.
If you want something similar for a bounding box that interests you, let me know. Just send me a bounding box made with http://maps.personalwerk.de/tools/bbox-paint.html and send me the "EPSG:4326" line. That way I can stupidly copy paste the coordinates.
Next step: creating yearly statistics about the state of the map.
Here are some requests:
EDIT: Peter suggested I make an extract of my setup for him to distribute. That would mean you can install VirtualBox (easy), load up a copy of my VM (should be easy), write three lines of codes and you have your thing (easy).
EDIT: There are some bugs visible (Krakow, Kathmandu). It might be I failed to do an update for some of the involved packages. If after a re-run they still show, I'll try and make some bugreports for Peter.
I woke up one morning, and realized I needed a reusable dataset of all the communities in the world. Not just X-Y, but administrative areas. Obviously, I started looking on OSM. With a bit of playing around (and a little help from my friends), I had a nice set of admin areas of various levels from OSM in a shapefile. Then I started noticing holes. If a country is mostly made of holes, you know there is no data. But if there are a few holes, well, something is fishy. What happens is, there are no extraction tool that can make an admin area if there are gaps. A line is not an area, only a closed line is. Borders in OSM are relations. These are collections of lines, joined together in virtual union. Often, someone deletes one of these lines, and replaces it with a more detailed version. That's sort of OK, but then you have to add the new line to -all- the relations the old one was part of. This is of course very exotic to new mappers, and even experienced mappers don't always seem to care.
Data use = data cleaning
Data will only get fixed, if they are used. And even on the forum, I saw people referring end data users to other sources to get their admin areas. It is complicated to extract a dataset with borders from OSM. So why care about this data? It shows up okay on OSM even if it's broken. BUT, user Wambacher made this great tool to download shapefiles by country with all available admin areas. So now that it's easy to use the data, please help maintain it.
Fixing things up
I tried several tools to help fixing things. I have a global focus, so I've mostly been doing fixups on the admin level right below the country. Often states, sometimes departments, etc. If you're going to fix a certain area, you're going to need other tools (see below) - choose a level to fix using layers.openstreetmap.fr - check which are missing - find the relation which is broken (or rarely, missing). Search doesn't always work. Zoom to a frontier which at one side is ok, at he other isn't. Click on the frontier and find the relations it is part of. Copy the ID. - Vizualize the relation with an url like this: http://www.openstreetmap.org/relation/3624100 This shows obvious defects. If defects are more subtle (little holes, almost-junctions), go to http://analyser.openstreetmap.fr/cgi-bin/index.py and paste the ID.
Causes of trouble
During the fixups I found many different types of errors. Borders are basically always the result of imports. Duh. Messing around with borders are a good way to understand why imports are controversial. It's easy to do it wrong, and hard to do it right. Sometimes there is data from the original shapefiles that were used, like area=123. In some cases, the original polygons are still there. And where it gets really messy, is when you add detailed borders from source X on top of general borders from source Y. It takes a lot of time and effort to clean that up, and importers don't always get around to finish it all up. Once the data are there, information may be redacted, because the original data wasn't compatible with our licence. Most often: no commercial reuse, the menace of all open data users. And redaction leaves a mess. Another source of trouble is including both seafront and a maritime border. These should be separated. Apart from that, most errors come from beginners or experienced users alike who delete a line and replace it with something else. Simple solution: use shift-click to improve geometry instead. That makes understanding the history of an area much easier too.
In depth error checking
If you want to go in depth in a certain area, these tools will come in handy: http://keepright.at/report_map.php > Click "none" (bottom left), then only activate the "Boundaries" checks
This even vizualizes all the broken relations, but it's just plain depressing to use: http://tools.geofabrik.de/osmi/?view=multipolygon
Using Overpass turbo you can quickly get the ID's for the admin areas in the area you're fixing. That makes it a lot easier to start fixing things. For example, using this query you get a map with all the relations defining level 4 admin areas. Don't zoom out when running, as you will be downloading too much data.
So, about six months after promising Ben Abelshausen, I'm finaly organizing a MeetUp in Antwerpen.
As I know organizing isn't quite my forte, I've been thinking a lot about what other thing one can do to make OSM more social today. As opposed to "how could we make it more social if we were programmers with all the time in the world". So, there was one piece of very low hanging fruit we identified on the last "monthly" "summer" meetup in Gent . We could just let all the new contributors in our area know that OSM are people. So when you join, you wouldn't think OSM is somewhere you just dump some data, but a place where people actually meet and collaborate on a common dream. (in case you wondered: yes, nerds can be romantic)
So I made a Google Spreadsheet where all new Belgian contributors are listed quite clean. You can click a link to their profile and send them a welcome message. After you sent it, you just add some info to the spreadsheet, so the same new user wouldn't be getting more than one welcome message. I made a draft standard welcome message (in Dutch) anyone can use or modify of completely throw away and make something better.
I don't want to be doing this alone. I would like to share both Google Documents with anyone who wants to help out. All you need is a Google Account. We'll obviously need a French translation. And I would like there to be different letters of introduction.
If you're interested, I used (of course) one of neis-one's services. This RSS is read by an IFTTT recipe into Google Docs. It's read a bit messy, so I introduced a second worksheet that cleans it up a bit with some basic formulas.