Some months ago, I was looking around OSM to find where the bulk of noise and inefficiency is. I’m aware of some other efforts (like Toby in 2013) but I actually went so far as to write a C++ app on Osmium which parses PBF extracts, simulates running line simplification, and produces a list of the ways which are the least efficient.
I ran this on various US states and countries worldwide, and the winner is… North Carolina. It is so wildly inefficient that we may as well not bother with the rest of the world until we’ve cleaned up North Carolina first. (Just for comparison, the size of the output: Finland 17k, Colombia 20k, Colorado 30k, England 45k, North Carolina >300k).
Why is North Carolina (henceforce NC) so obese? There are a handful of bad spots elsewhere (like some of the Corine landuse in Europe, and a waterway import in Cantabria, Spain) but nothing close to NC. It’s due almost entirely to a single import in 2009. The USA’s hydrography, NHD is a truly massive dataset. An account called “jumbanho” imported NHD for NC and apparently applied almost no cleanup (beside a small pass at removing duplicate nodes a few months later). Among the many flaws of that import:
- Topology is mostly missing (features meet but don’t share a node)
- Really out of date (shows swamps that were drained decades ago, streams running through what are now shopping malls).
- Almost all of it is barely or not at all decimated (a stream which is perfectly modeled in 15 nodes is sometimes made of 300 nodes).
As a result, the jumbanho account has noderank #3 with 43 Mnodes (this was rank #2 with 49 Mnodes, but as I’ll explain, I’ve been busy).
This is what the data looks like:
As you can see, a regular set of evenly-spaced nodes, with no decimation. This is worse when you consider that the overall accuracy is far less: here the pond is 7m off, the stream is variously 12, 14, or 32m off:
This inefficiency is bad in a few ways, such as making the planet file balloon with dead weight. But a more relevant issue is this: When a user comes in here to fix the alignment of the data, there is NO WAY they can be expected to move all 200 points by hand. An import with too many points is highly resistant to EVER getting manually fixed. The solution is to simplify first, but by how much? Here we encounter some issues:
- The simplify tool in JOSM defaults to 5 (!) meters which is brutal and useless for just about any use I can think of (maybe very, very rough old GPS traces?)
- JOSM lets you change the amount, but it is buried deep in the “advanced preferences.”
- Once you find that, knowing how much to simplify each kind of feature is a matter of experience and skill.
After hundreds of hours of manual work on NC, I have learned what values work; general guidelines which I carefully tweak based on each area:
- natural=wetland. These are very rough, 1.0-1.2 m.
- waterway=stream, waterway=riverbank, natural=water. They are more delicate, I use 50-80 cm.
- Streams and rivers which are either inside wetlands, or “artificial paths”, these are often notional and don’t correspond closely to any real feature, so >1 m.
Note that these levels of precision are WAY less than the actual inaccuracy of the data; they cannot harm the value in the data, because they are too small. In fact they could be bigger, but the goal is to leave enough nodes so that human editing won’t have to add or remove many nodes when they align the feature to its correct location.
While I would be happy to just write a bot to do that first step, that would be a “mechanical edit” and I’d have to put up with mailing list arguments to get permission. (I’d also have to write that bot, which I’ve been too lazy to do so far). So instead, I’ve put in the time to do it all manually in JOSM, with steps like:
- Study each area, compare the features to the imagery.
- Do some super careful simplify with appropriate values. (It gets really tiring having to dig into JOSM’s advanced preferences every single time I change the value.)
- Fix the topology by carefully tuning the validator’s precision and allowing it to auto-fix, with manual verification.
- Some manual adding of bridges and culverts.
- Removing/updating non-existent wetlands and streams (one common clue: they intersect buildings).
- Splitting some ways and creating relations, for example for a large riverbank and wetland that share an edge.
Here is that same area after a simplify to 70cm on the NHD features, then quick manual alignment:
It’s exhausting. In fact, a bot wouldn’t really help that much, since the simplify is only the first step, the topology and the rest still need to be done by a human anyway.
By my rough calculation, if I work hard for 5 hours every night, It would take around 5 months for me to finish cleaning up NC NHD to a decent level.
On the plus side, other NHD imports I’ve seen around the USA (like Oklahoma) don’t seem to be nearly as bad; while they suffer from most of the same quality issues, at least they were already simplified before uploading.