Recent diary entries
TLDR: click these links to play with South America OSM contributor statistics on a continental level, in detail. It's ready for the world. Or even easier, get a ready made report for a continent, a country or a region.
This is a writeup for the presentation I gave at State of the Map 2014. Slides available here (since it's such a bother to add images to diary entries, you'll have to refer to the slides for pretty pictures). You know about these motivationals saying things like "do one thing every day that scares you"? Well I did, and I wouldn't recommend it. So I'm thinking maybe a written version might be a little more coherent.
During my one year road trip through South America, I'm trying to do as many things OSM as possible. Of course, I'm navigating using Osmand, contributing tracks, notes and POI's along the way. I'm trying to convince other roadtrippers to use OSM, which in a lot of cases they're already using anyway. Making contributors out of them is harder: a lot of them seem to know they can, feel like they should, but just "haven't found the time to really look into it". Then recently, I did a presentation about OSM in Carmen Pampa, a village near Coroico, La Paz, Bolivia.
But mostly, I want the world.
The job I'm on a one year break from, revolves around generating and providing data in such a way that people can make their own analysis. In a lot of cases, that means taking GIS data or agregated statistical data and simplify them to a geographic neighborhood level. A quite literal example: count the number of green pixels within a neighborhood and devide them by number of people. So here's what I do: a bit of automation, some basic statistics, some self-thaught GIS skills, some translating problems back and forth between humans and database querying. I'm great at none of those, but I understand a bit of all these worlds.
At work, the area of interest is just the tiny metropolis of Antwerp. But the tools we use lend themselves to much wider scales.
So I though, during my trip, why not do the same thing a bit bigger? Antwerp is known for its big egos - and I have to admit I do fit in. So how about the world.
Global Openstreetmap Community Statistics
Slightly obsessed with statistics and with OSM, I felt a lack of mid-level statistics about OSM. Yes, we have some tools telling you how many people edited recently, etc. But there is no "state of the map" for any country, any region. There is a lot of opinion on new contributor mess-ups, or on imports - but few statistics to back it all up.
So here's the one-year plan: make a worldwide tool to see the State of the Map for any region, country and continent in the world.
Minor detail: I wanted to present it at State of the Map Buenos Aires, only half a year away. And it was much more complicated to work from my campervan than I thought. 3G is slow, expensive and often absent from the places we stayed. The amazing 12v-19v converter I found blew up the computer in Ecuador. A total loss in Europe, they fixed it for 100 USD in Quito - but there went another month. Also, I'm not a programmer, so I had to learn quite a lot - and have quite a lot to learn still.
I wanted to go beyond the ad hoc analyses you so often see. People are interested in Switzerland, France, South Africa. All these case studies bring interesting insights, but I wanted to provide the basics to all communities. From what profound research has tought is, we know that often it is enough to look at OSM data to know the quality of OSM data. For example: the easiest indicator of map quality is the number of people contributing.
There are some national OSM statistics available, I wanted to go beyond that. Of course, there are a lot of national communities, but being from Belgium, I decided the national level isn't ideal. And for countries like the US, Brazil or Russia, well, it's just not fair to only give them as much space as Liechtenstein is it? So I decided to go (with some exceptions) for the highest subdivision of countries.
I decided to use OSM as a base for the regions, I don't quite remember why, but I'm sticking to the theory that it was a matter of principle. The principle being: the more people actually use the data, the better it will become. At the time (say beginning 2014), these devisions were very far from complete. I started working on the problem where I could, even wrote a diary post about my cleaning experience. But of course Wambacher's wonderfull boundaries tool had the larger impact. There has been amazing progress in under a year, and now the only larger countries that have severe problems with their top level regions are:
Panama Honduras Portugal Sri Lanka New Zealand Malaysia Indonesia
Of course, people keep destroying administrative relations. Some of them because they're new and ID doesn't warn you about destroying relations. Rarely some vandalism. And often as well by very experienced users having an off-day I suppose.
It took me quite some time, but now I have a beautiful shapefile of the world with most all international conflicts resolved and anly a few regions claiming their neighbours territory. Yes, I can share this SHP.
Turning historical OSM data into statistics
I believe you can only understand where we are, if you know how we got there. And for a complete view of Openstreetmap evolution, you do need the history files. These contain every version of every thing that has ever existed in OSM - with some exceptions caused by the license change and redaction work. There is no easy way to work with these files. I had to learn how to translate these data into statistics. That meant learning a whole new world of Virtualbox, Linux, Osmium, History Splitter, PSQL. And I'll probably have to learn some C++ and R yet. I could never have gotten on with this whole project without the help of Ben Abelshausen and especially Peter Mazdermind, whom I've bothered enormously. I wrote a bit about these first steps (with links to Peter's tools) in my diary as well. If you like prety maps more than stats, you'll probably not make it back here again :)
The workflow so far, as suggested by Peter, is to cut up the world into small pieces, import them into PSQL and then make some queries. To cut up the world, I convert my regions shapefile to poly files using the OSM-to-poly for qGIS 1.8. So far, I have little more than a proof of concept. Let's take all data for an area, dump unique combination of users and start dates of objects and use SPSS to make some simple indicators.
So here are the first results, a complete basic statistics tool with data on a continental level but also in detail. It's completely interactive and ready for the world. Of course you can compare evolutions, but if you play around with the tool a bit, you'll see the possibilities are endless.
You'll be forgiving for not liking to 'play' with a tool like this, as most normal people don't. To make you're life easier, there's a reporting studio which gives you a ready made analysis of the evolution of contributors in a continent, country or region of your choice. This being SOTM Buenos Aires, the obvious examples are South America, Argentina and the city of Buenos Aires.
All the data in the tool is available for re-use: you can download xls or xml for any view you make, WMS services can be provided, you can remotely query a visualization and you can acces through a basic API.
From my experience at State of the Map, I don't feel like I made quite clear what is the importance of a tool like this. I'll try to give some more examples of what could be easily done with just OSM data.
- You don't need any other sources than OSM data to get an idea about road network completeness, and how much is left to be mapped.
- You could make statistics about how many map errors are open In more advanced countries, see how quickly landuse mapping is being completed
- Does mapping peter out when the map gets more adult? Or is it the other way around, does more data imply more people using and contributing to even more data? Is there an exponential curve of map development. And dare I say, yes? (LINK)
- How do imports really affect mapping? Is a country which starts of with a larg import likely to quickly grow a large community, or will it start to lag behind after a while?
- Is the number of mappers proportional to people or to GDP?
- Do most regions follow the same growth track, but just started of later? Or are there regions that will not ever get properly mapped without special outside attention?
- Or something very specific: "does the probability of a new contributor becoming a recurring contributor increase if we contact all new mappers in our area"?
- What does HOT attention do to local community development? Are people recruited through a HOT project more likely to keep contributing?
Any subject leads itself to the creation of indicators. How quickly do notes get resolved? Simple: count the number of nodes still open, three months after their creation. Then you can quickly compare the speedyness of note resolution in different regions. And maybe even adopt a region to watch some notes in. Or some investigator might decide to look into the dynamics of note resolution, and suggest better indicators.
The tool allows 1000ths of indicators to be easily managed and widely consulted.
A cry for help
As I kept saying at SOTM, I don't really know what I'm doing, and I would like some outside checks. I even admitted on stage that I'm a Potlach2 mapper. I'll say it again: I like Potlach. Aparently, that can earn you free beer. But it does mean I need help. I do think I will get some, but I'll take some more effort from my side. For example, I might get some scripts to get the road length out of a history file. I'm also going to look into some C++ scripts that Abhishek made. And maybe OSM France can set up a history server which might make life a bit easier on my poor computer.
Part of my lack of confidence at SOTM was that my numbers of contributors for a given country were much higher than a colleague investigator found. And after my presentations I saw some more numbers that frightened me. So the last week, I've been trying to figure out what went wrong. It turned out: nothing did. Wille from Brazil pointed out that user naoliv produces some statistics of number of contributors for Brazil - and mine where much higher. Only after a while was I sure that he didn't use the history files, but a current world snapshot, which is bound to creat some difference. But even then the differences were much higher than I would have thought. Here's some basic statistics (taken at a random moment beginnening of 2014):
6936 number in history files 5585 number in current world 178 known in current world, but not in the history files 1529 known in history files, but not in the current world dump
How can you be known in the current Brazil map, but not in the history files, as 178 people are? Well, I honestly don't know. Some random checking was in order. Most cases seemed to be people editing very close to the border of Brazil. I use the exact borders, whereas naoliv uses the Geofabrik dump which probably has a tiny buffer to ensure data integrity. But there were also some cases where I have no clue as to what causes someone not to show up in my dumps. Anyway, small differences are bound to arise in databases like this. You'll probably always get some noise in analysis like this - though mostly because of some deeply hidden error or bias.
Another 1529 have contributed to the Brazil map, but their work is not visible anymore at all. I though this not impossible, but still surprising large. Some random checking learned that these people did in fact contribute to Brazil at one time. Here are some statistics I found comforting:
Here we look at the percentage of people found in the history files, lost in the current version of the map. Overall, the number is 22% lost. But when we classify by number of added/touched nodes, you see the number is much higher for people with few edits. Which is exactly what you would expect if the cause of the difference is people's work getting overwited. If you have more edits, less chance that 'all will be lost'.
Percentage lost to current state 1-10 35% 11-50 13% 51-250 5% 251+ 1%
The same goes when we look at the last year people have contributed to the map in Brazil. People editing in 2008 have 56% of not being visible in the current state of the map. Again, what you would expect if people's edits are overwritten. The longer ago you've contributed, the more probable that you're contribution has been lost.
Percentage lost to current state 2007 57% 2008 56% 2009 50% 2010 40% 2011 31% 2012 24% 2013 17% 2014 10%
This means that when you make contributor statistics, the difference between using history files and current world dumps are pretty high.
With this I'm feeling a lot more confident. I'm thinking to build up more in depth analysis first, and only then try and do the whole world. At least, further worldwide analysis will have to wait till 2014 is completed. That way I can work on history files that include the whole of 2014. I'll have my friends in Belgium download them :)
Here's a list of things I think I can manage, in rough order of how hard it will be, or how far I've gotten. WE could of course manage much more, much better, much sooner. But that means YOUR help. I should stop watching motivational posters.
- cumulative number of contributors, or active contributors by year
- number of nodes, ways, polygons (created, deleted, touched)
- notes resolution
- proportion of data contributed by 'local' contributors
- number of mapped hamlets/villages/towns/cities
- kilometers of roads by type
- proportion of area covered by land use
I'm very interested in other suggestions. Especially if they come with a script that gets the numbers out of a OSHistory file.
Viajando en Sudamerica con movilidad propia, me surprendio la calidad de la informacion. En Chile y Ecuador, esta muy claro que hay una cuminidad trabajando duro. En el Peru falta mas trabajo, pero gracias a imports, la mayoria de los pueblos tiene calles con nombres, aun que ni hay cobertura Bing. Lo que para mi era una de las lacunas mas importantes, es informacion sobre la calidad de las rutas.
En el Peru, por ejemplo, hay muchas carreteras que hace poco se asfaltaron. Sin asfalto, eran muy dificiles, ahora mucho mas facil. Pero, como Mapnik es Eurocentrico, no toma en cuenta esta informacion. Si una carretera es importante, en Europa esto siempre estaria asfaltado. Si es que la carretera es poco importante, todavia poco probable que es camino de tierra. En paises como Peru y Bolivia, no es asi. La carretera no tan grande entre Cajamarca y Chachapoyas se encuentra con asfalto nuevito, mientras la carretera importante de Huaraz hacia la costa por el Norte tiene un parte importante sin asfalto.
Si uno planifica un viaje, no solo es importante que este la informacion, pero tambien que se visualisa. Mapnik tiene dos fallas, aplicandole en Sudamerica. Primero, que no se ve la diferencia entre paved y unpaved. Y lo que no se ve, no se mapea. Segundo, que el estilo es hecho por paisas pequenos con muchas carreteras. La preocupacion es de que no entra tanta informacion en la pantalle que ya no se puede leer. En Sudamerica, hay tan pocos carreteras que el problema es al reves: hay que ir a niveles de zoom muy altos haste que se ve donde estan las carreteras. (otra razon, creo yo, porque tantas carreteras se pusieron como trunk)
Que podemos hacer?
Completar datos, y mejorar la visualizacion.
Mapear todos los surfaces y calidades de la rutas que conocemos
Quisiera pedir a toda la comunidad Latinoamericano de mapear todos los surfaces y calidades de la rutas que conocemos, empezando con las carreteras mas importantes del continente. Lo que es obvia para gente local, muchas veces no lo es para extranjeros. Lo que estoy aprendiendo en mi viaje, ya poco a poco lo estoy mapeando. No solo habria que tomar en cuenta el "surface", pero tambien "smoothness", ya que existen rutas de tierras donde se puede volar y rutas de asfalto que tienen tanto hueco que uno va muy muy lento. Los dos tienen pagina wiki, aun que smoothness no esta definido como para viajeros en caro, mas bien como para ciclistas. Y falta una traduccion al español.
Pensaremos como se puede mejorar la visualizacion de esta informacion.
Abajo algo de inspiracion. Quizas existen mas applicaciones que ya toman en cuenta esta informacion. Pero hasta donde yo lo conozco, me parece que deberiamos de trabajar hacia un estilo latino, que servira para todos los paises menos poblado y can una red de carreteras no 100% asfaltado. Como primer paso, ya pedi un mapview en Osmand. Tambien existe el Humanitarian style ya toma en cuenta surface. Pero esta mapa es un mapa de fondo, no tanto un mapa como Mapnik que quiere ser un mapa completo (como dicen ellos mismos). Para ayudar hacer el primer mapeo, pueden ayudar los mapas de Itoworld: http://www.itoworld.com/map/215 y http://www.itoworld.com/map/25 . Pero no sé de mapas que tambien toman en cuanta smoothness - aun que esto ya es un gran desafio para visualizar. Quizas hay que buscar la solucion en routing: de A hacia B vas a pasar 100 km de asfalto bueno, 50 kilometros de tierra bueno y 25 kilomtros de asfalto malo.
There is no navigation app like Osmand. But it is quite complicated. So I made this write-up based on what I've learned over the past two years using it. I wrote it with people like myself in mind: navigating overland trips in third world countries.
Feel free to suggest changes, additions or to copy/paste.
EDIT: yeah, so my little hosting package didn't agree with your interest (you consumed 12 times my allotment). Fortunately the nice people at OSM.be came to the rescue and offered me some space. Thank you Ben!
I have a big scheme in my head to do somehing fun with OSM data. Unfortunately I'm still taking babysteps. Still, here is one step that makes me pretty happy: a map of the evolution of La Paz, Bolivia. EDIT: as I'm a disaster in reading manuals, I didn't add timestamps to the first few tries. I'll re-run them with timestamps when I get the time.
I can make an animation like that for any bounding box with just a couple of minutes work (and some waiting time, depending on the data-density of the area).
In fact, doing this is extremely easy. It still took me two months :) All you have to do is follow the instructions here: https://github.com/MaZderMind/osm-history-renderer/blob/master/TUTORIAL.md (this was very helpful too: https://github.com/MaZderMind/osm-history-splitter)
Only for me, that meant setting up a VirtualBox with Ubuntu, understanding how to install software on Ubuntu and how to fix messed up installations, getting Ubuntu to be able to read data from my host Windows 8 laptop. A big challenge was also not throwing the laptop out of the window (everyone LOVES Windows 8, right?). I could have not done this without Mazdermind Peter who didn't just make the data available in a workable format and the tools to work them, but also gave me personal support. Eternal gratitude and what not. Also free Belgian beer (or chocolate) on any future IRL meetings.
If you want something similar for a bounding box that interests you, let me know. Just send me a bounding box made with http://maps.personalwerk.de/tools/bbox-paint.html and send me the "EPSG:4326" line. That way I can stupidly copy paste the coordinates.
Next step: creating yearly statistics about the state of the map.
Here are some requests:
EDIT: Peter suggested I make an extract of my setup for him to distribute. That would mean you can install VirtualBox (easy), load up a copy of my VM (should be easy), write three lines of codes and you have your thing (easy).
EDIT: There are some bugs visible (Krakow, Kathmandu). It might be I failed to do an update for some of the involved packages. If after a re-run they still show, I'll try and make some bugreports for Peter.
I woke up one morning, and realized I needed a reusable dataset of all the communities in the world. Not just X-Y, but administrative areas. Obviously, I started looking on OSM. With a bit of playing around (and a little help from my friends), I had a nice set of admin areas of various levels from OSM in a shapefile. Then I started noticing holes. If a country is mostly made of holes, you know there is no data. But if there are a few holes, well, something is fishy. What happens is, there are no extraction tool that can make an admin area if there are gaps. A line is not an area, only a closed line is. Borders in OSM are relations. These are collections of lines, joined together in virtual union. Often, someone deletes one of these lines, and replaces it with a more detailed version. That's sort of OK, but then you have to add the new line to -all- the relations the old one was part of. This is of course very exotic to new mappers, and even experienced mappers don't always seem to care.
Data use = data cleaning
Data will only get fixed, if they are used. And even on the forum, I saw people referring end data users to other sources to get their admin areas. It is complicated to extract a dataset with borders from OSM. So why care about this data? It shows up okay on OSM even if it's broken. BUT, user Wambacher made this great tool to download shapefiles by country with all available admin areas. So now that it's easy to use the data, please help maintain it.
Fixing things up
I tried several tools to help fixing things. I have a global focus, so I've mostly been doing fixups on the admin level right below the country. Often states, sometimes departments, etc. If you're going to fix a certain area, you're going to need other tools (see below) - choose a level to fix using layers.openstreetmap.fr - check which are missing - find the relation which is broken (or rarely, missing). Search doesn't always work. Zoom to a frontier which at one side is ok, at he other isn't. Click on the frontier and find the relations it is part of. Copy the ID. - Vizualize the relation with an url like this: http://www.openstreetmap.org/relation/3624100 This shows obvious defects. If defects are more subtle (little holes, almost-junctions), go to http://analyser.openstreetmap.fr/cgi-bin/index.py and paste the ID.
Causes of trouble
During the fixups I found many different types of errors. Borders are basically always the result of imports. Duh. Messing around with borders are a good way to understand why imports are controversial. It's easy to do it wrong, and hard to do it right. Sometimes there is data from the original shapefiles that were used, like area=123. In some cases, the original polygons are still there. And where it gets really messy, is when you add detailed borders from source X on top of general borders from source Y. It takes a lot of time and effort to clean that up, and importers don't always get around to finish it all up. Once the data are there, information may be redacted, because the original data wasn't compatible with our licence. Most often: no commercial reuse, the menace of all open data users. And redaction leaves a mess. Another source of trouble is including both seafront and a maritime border. These should be separated. Apart from that, most errors come from beginners or experienced users alike who delete a line and replace it with something else. Simple solution: use shift-click to improve geometry instead. That makes understanding the history of an area much easier too.
In depth error checking
If you want to go in depth in a certain area, these tools will come in handy: http://keepright.at/report_map.php > Click "none" (bottom left), then only activate the "Boundaries" checks
This even vizualizes all the broken relations, but it's just plain depressing to use: http://tools.geofabrik.de/osmi/?view=multipolygon
Using Overpass turbo you can quickly get the ID's for the admin areas in the area you're fixing. That makes it a lot easier to start fixing things. For example, using this query you get a map with all the relations defining level 4 admin areas. Don't zoom out when running, as you will be downloading too much data.
So, about six months after promising Ben Abelshausen, I'm finaly organizing a MeetUp in Antwerpen.
As I know organizing isn't quite my forte, I've been thinking a lot about what other thing one can do to make OSM more social today. As opposed to "how could we make it more social if we were programmers with all the time in the world". So, there was one piece of very low hanging fruit we identified on the last "monthly" "summer" meetup in Gent . We could just let all the new contributors in our area know that OSM are people. So when you join, you wouldn't think OSM is somewhere you just dump some data, but a place where people actually meet and collaborate on a common dream. (in case you wondered: yes, nerds can be romantic)
So I made a Google Spreadsheet where all new Belgian contributors are listed quite clean. You can click a link to their profile and send them a welcome message. After you sent it, you just add some info to the spreadsheet, so the same new user wouldn't be getting more than one welcome message. I made a draft standard welcome message (in Dutch) anyone can use or modify of completely throw away and make something better.
I don't want to be doing this alone. I would like to share both Google Documents with anyone who wants to help out. All you need is a Google Account. We'll obviously need a French translation. And I would like there to be different letters of introduction.
If you're interested, I used (of course) one of neis-one's services. This RSS is read by an IFTTT recipe into Google Docs. It's read a bit messy, so I introduced a second worksheet that cleans it up a bit with some basic formulas.