Atakua's Diary

Recent diary entries

Postmortem of Naturvårdsverkets dataimport to Openstreetmap

Posted by Atakua on 9 February 2020 in English (English). Last updated on 17 February 2020.

This post is originally published at my personal page, but I assume here more people might read it. Some formatting may be off as I converted it from HTML.

Around March 2019 I started work to import a large chunk of open data into Openstreetmap. Specifically, to improve the landcover coverage of Sweden. Mostly it concerned areas and features of forest, farmland, wetland, and highland marshes.

This post continues and somewhat concludes the series of thoughts I’ve documented earlier in 2019:

The import plan documents a lot of technical details. I maintained it to reflect the high level overview of the project throughout its life.

Why bother

My motivation behind the project was that I was tired to trace forests by hand. My understanding is that it would take a million of hours to finish this work by using manual labor alone. I, being lazy, always look for ways to automate it and/or integrate someone else’s already finished work.

And there is a lot of work to integrate. Here’s how the map looked like before the start of the project:

Overview of Sweden's unmapped land cover for start of 2019

There is other peoples’ work worth integrating. Many national agencies around the world now offer the geographical data they have collected and continue to maintain to everyone under liberal licenses, such as CC0 or public domain. To me, it looks strange to not even to attempt make use of this data.

Note on terminology

I will use “land cover” and “land use” as synonyms, even though they are not. But I do not care enough to make such a difference, and nobody else seems to do it as well de-facto, given the current tagging status. People predominantly use “landuse=*”, “natural=*” and “residential=*’ tags to convey details both for land use and land cover. At the same time, almost non-existing presence (and lack of support by renderers) for “landcover=*” makes it questionable to add this tag for new data. Nobody will see the results of such work.

The main idea of the import

Naturvårdsverket’s land cover dataset comprises of a single (huge) GeoTIFF where every 10×10 square meters of Sweden’s surface is classified to be one of predetermined types of land cover: forest, water, roads, settlements etc. So it is raster data, while Openstreetmap uses vector features, such as polygons and multipolygons, to represent land cover. So the first step would apparently be to convert raster to vector. Resulting vector will certainly have discretization noise (“90-degree ladders”), so the next required step is to smooth vector to look more natural.

To make sure newly added polygons do not conflict with already present map features, it was required to merge two vector datasets, or to conflate them. Need for conflation meant that some parts of new vector data had to be adjusted. Among these modifications were: delete whole polygons, cut polygons, align borders of new and old polygons, retag certain pieces to change their tags, and so on.

Because the input dataset is really huge, it would be unreasonable to attempt importing it in a single pass. Thus, I needed a strategy to split input data into chunks. This meant that new, artificial boundaries would start to appear inside the vector dataset. As it will be shown below, such artificial boundaries adds certain unique challenges to the process.

Openstreetmap’s uniquely loose data classification scheme makes it impossible to algorithmically decide on whether any new data would be well enough integrated without duplicating or unnecessarily overlapping anything already existing on the map. Thus, the final step for each data chunk before it gets uploaded was to visually inspect the result of merging of two layers and to fix uncovered problems.

Evolution of the process

There have been several iterations, some of them huge, requiring regeneration of everything from the start, some were regarded as smaller touch-ups. I can now recall several major decisions that affected the result in a significant way.

  1. Import one kommun at a time. This was the original plan to have 290 disjoint chunks made, then edited/uploaded to the OSM. Quite a few of vector OSM files turned out to be larger than 1 GByte of XML. However, the division of the territory into kommuns remained in some form through the rest of the project.
  2. Be less aggressive with smoothing. As the Vingåker experiment (see below) demonstrated, it was necessary to try several values of thresholds and several available smoothing algorithms to find something that does not destroy too much of polygons but removes excessive details.

  3. Cut kommuns into smaller “rectangular” tile 0.1×0.1 degrees latitude/longitude. Each tile then could be loaded into JOSM without much consuming all the RAM. It still could contain tens of thousands of nodes and usually required several changesets to upload everything. Because of the significant overhead of manual re-sewing of tiles (see below), I considered increasing their dimensions to be 0.2×0.2, but never did it.

  4. Pay special attention that adjacent tiles do not overlap. It was discovered during Katrineholms kommun import that coordinate system transformation performed too late resulted in overlap between adjacent tiles (by up to hundres of meters). To prevent it, it was made certain that data gets converted from SWEREF99 coordinate system to WGS84 (the one used by OSM) early at raster stage.

  5. Employ a second, “negative” raster layer produced from existing OSM land cover data. Its purpose is to to mask data points in the import raster data as if nothing was known about them. My original intersection detection heuristic only considered bounding boxes of polygons. Being too imprecise, it generated excessive amount of false positive matches, causing a lot of perfectly good new polygons to become rejected at conflation. By masking new raster data with “old” rasterized data meant that traced vector polygons could not possibly overlap with old ones (except inside a thin “coastline” buffer zone caused by the discretization noise). Of course, using two input raster images required to regenerate all the vector data from scratch.

  6. Add buffered roads to the negative layer. After some consideration it was decided to include existing OSM roads to the negative raster layer. Their position correlated with “road” land cover pixels in the input raster data, and excluding areas around them allowed to have less noise in the result. “Roads” include railroads, motorways of all sizes and also pedestrian ways down to trails. Inclusion of trails (highway=path) turned out to be a questionable decision.

  7. Add water to the import. Originally I expected even smallest lakes to be already well-mapped in OSM, and therefore did not include water-related polygons, such as lakes and even wetland. However, visual inspection of conflation results uncovered that the situation was much worse than I thought. Even quite large wetlands were missing. Late in the project I decided to retain information about water and to convert it into “natural=water” and “natural=wetland” polygons. It also had to use a modified conflation specific to monolithic areas (see below).

Besides these bigger changes, throughout the project I constantly adjusted a multitude of numerical parameters affecting the conversion process, such as cut-out thresholds, smoothing parameters, and so on. Used algorithms and tools have also received numerous fixes and adjustments.

Tools used and made

Of course, the JOSM editor was the main and final tool to process data before uploading. Some adjustments to Java VM memory limits were needed for it to be able to chew through larger chunks of the import. Having a machine with 32 GB RAM also helped.

I initially used QGIS to visualize input data and iteratively apply different hypothesis to it. However, this application turned out to be not very amenable for scripting. After some time struggling with it, I realized that I essentially used QGIS as a front end to another GIS called GRASS. I ended up using a multitude of GRASS’ individual instruments such as v.generalize, v.clean etc. to construct data processing pipelines taking raster data and chewing it multiple times until vector data was out.

Among libraries to process, convert and otherwise transform data, GDAL was of utmost value to me, both directly and indirectly via all GRASS tools based on it.

Besides many existing tools and frameworks , I’ve written quite a few lines of Python, Java and Bash scripts to assist with data conversion, filtering, cleanup and conflation. Currently the bulk of this code is at Github and my other repositories, and I continue to reuse some pieces of it for my ongoing projects.


The project resulted in things both visible to others in a form of the map improvements, and also as a lot of knowledge for me and hopefully for others.

What required no improvements

Of all the kommuns quite a few have already been mapped well enough. Adding new data for them would mostly imply a lost of manual cleanup work without significant improvements for the coverage. As expected from the beginning, examples of such well covered municipalities were areas around the biggest cities, such as Stockholm, Göteborg, and Malmö.

The farther to the north, the less land cover data was present in OSM, the more need for data import seemed reasonable.

What was covered by new data

Kommuns borders were selected as the top level of hierarchy to determine import structure form the beginning. However, it turned out to be impractical, as sizes of such import units varied wildly, areas of kommuns were often too big to visually review in one sitting, and geometry of borders turned out oftentimes to be too convoluted (long and twisted, with enclaves and exclaves etc.), with no practical benefits coming from from blindly obeying them.

Somewhere in the middle of the project these boundaries were only used as rough guidelines for splitting data into tiles. All tiles were of fixed size and alignment, and as such could span over the boundaries of kommuns.

The following parts of the country were completed, fully or partially.

  • Vingåkers kommun. It was the first one and the only one converted, conflated and committed as the whole in one go. Being the first one, a lot of mistakes were admitted together with the data, such as overly aggressive Douglas-Peucker simplification.

  • Katrineholms kommun. I started to manually map this area long before, then tried to employ scanaerial plugin to assist with tracing forests. Around 50% of the territory was prepared by these means. Finally I finished it with the import data. As the location was adjacent to just finished Vingåkers kommun, it was the first experience where I had to tackle the requirement to nicely align polygons for adjacent parts of the import.

  • Vadstena and Åstorp. Relatively small subareas which are mostly covered by farmland. Here I tuned my algorithms and learned to expect unusually tagged (multi)polygons to conflict with new data.

  • Linköpings kommun. It was basically the only support I received from someone else during this project. I have not participated much in working on this kommun, and the data used for it, as far as I can tell, was from one of the first batches that I have provided, and as such it included no improvements that were present in later iterations.

  • Åre kommun was my biggest effort so far. Result of many laborious evenings, nights and days, the kommun has been mapped in the fullest. Besides the territory of the kommun, adjacent parts of the country (e.g. parts of Bergs kommun) were also mapped. More details are in my previous post.

  • Ljusdals kommun. Compared to earlier work, here I started to map water areas in addition to forests, farmland and other “ground” cover. I discovered that the situation with mapping of smaller lakes in Sweden was not as good as I originally assumed. Imported water polygons included smaller lakes and medium-sized rivers and streams represented by “non-zero width”. Compared to ground cover, water polygons had to be treated as monolithic (see below), which was reflected in conflation script adjustments. However, I did not finish this kommun. All work essentially halted after that.

What was not covered and why

Remaining kommuns did not receive significant changes. While there were no technical reasons to stop at this point, I was unable to reduce amount of manual work to a level low enough to allow a single person to finish all tiles in reasonable time.

Work on Åre kommun demonstrated that individual tiles required up to 30-60 minutes of manual work each to import. This is still faster that one can trace an area of the same size with the same level of details. Manual operations that had not been optimized to be done by the scripts were becoming tedious to me though.

  1. Fixing geometry warnings such as self-intersections. Despite all attempts to detect and address the majority of self-intersections at vector simplification phase, to have up to 100 warnings per tile reported by JOSM Validator was not uncommon. Absolute majority of them were trivial short “loops”.
  2. Sewing tiles’ boundaries. Adjacent pixels of the source raster that happened to land inside separate adjacent raster tiles were converted to unconnected vector features. This created a tiny but non-zero gap between them. I’ve employed certain heuristic algorithms to close this gap when it was safe. Yet, almost every vector tile required to manually scroll along its border in JOSM and to sew the remaining gap between it and already uploaded adjacent tiles.
  3. Addressing quality problems of pre-existing data. It was not uncommon to meet situations with roads traced by offset landsat imagery, crudely drawn polygons for lakes and forests etc. Given that this OSM data was also used for generation of the negative raster layer, these problems could also be imprinted into the new vector features to be imported.
  4. Merging boundaries of new and old polygons. Ideally, this should have been the only type of manual work needed for each tile. In reality, this was a relatively smaller part of the manual work.

I go deeper into technical details behind some of these problems below.


It’s time to complain. A few factors were discovered along the way which I tend to classify as external problems, common for any other sort of future importing work attempted for the OSM project.

  1. The OSM community’s inconsistency towards imports in general and land cover imports in particular. Not many of arguments DH4 or higher was given to me. I won’t delve deeper into the social, humanitarian, technical, economical, legal and political circumstances known to me leading to this situation. Conversations on that mailing list, while not being openly toxic, tend to derail into delving over general unsolved/unsolvable issues of the whole project. This brings little constructive feedback to the initiator of an import effort, while forcing him/her to drown under disproportional volume of email exchange. There are deliberately no objective formal/verifiable/measurable criteria of acceptance, meaning everyone is judging data from his/her standpoint of “beauty”. There is no authority to have a final word in a discussion, and there are no procedures for voting for/against a proposal, which gives disproportional power to those few with louder voices. By the way, barely anyone has commented anything on the import plan topics. No comments on the quality assurance sub-topic, for which I was so much hopeful. It seemed that spending so much time on documenting the project was redundant.

  2. Rounding of coordinates done on server. I was baffled to discover that, after I have thoroughly made sure that no self-intersections are present in data and have uploaded it, then to download it back and to see my polygons to self-intersect! It looked like nodes were moved a tiny bit when they were returned from the database. Comparing the original data files and copies downloaded from the OSM, I could notice that coordinates precision had been lowered. Only 7 digits after the dot were kept. At the same time, JOSM (and its tools like Validator) are perfectly capable of operating over coordinates with 10 or more digits after the dot. It seems that one needs to take this into account when running geometry checks.

  3. Poor validation tools for multipolygons. Neither offline nor online tools, nor JOSM nor Osmose, seem to report intersection/overlapping of multipolygons. From the practical standpoint, overlapping multipolygons is the same problem as overlapping polygons. It may be technically harder to implement and computationally more costly to run such a check, yes, I do realize that. But at least some basic checks, or even imprecise algorithms (with reasonably low ratio of false positives) would still be better than nothing. I’ve spent quite some time hunting for rendering problems caused by omissions made in multipolygons (e.g. because of the implicit rounding problem).

  4. No tools for conflation of (multi)polygons. As OSM community still manage to somehow import zero-dimensional (i.e. POIs) and linear features (i.e. road nets) from time to time, a few tools to assist with conflation of those types of features exist. However, it is not the case for import data in form of closed polygons. Most of my hand-written tools were made because nothing better (or anything really) existed for my goals. I do understand that polygons and especially multipolygons are much harder to work with than e.g. roads or POIs. This, however, does not excuse the fact that nobody has prepared tools to for merging, splitting, transforming etc. them.
  5. Obstacles of using GIS-software to perform read-modify-write cycle over the OSM contents, especially the “write” part to preserve untouched objects in unmodified state. The rounding problem above is an example of such an issue. Loading OSM data into another GIS format and then immediately exporting it (without any explicit modifications) can still produce a dataset not identical to the original one.

Useful tricks learned

  • Add comments describing operations and decisions made over primitives to the primitives themselves as tags. All warnings issued by scripts should be turned into “fixme” or similar tags. Then nodes with these tags can easily be highlighted in JOSM at the visual inspection phase.
  • Save both “survived” and “dropped” primitives into separate files to simplify debugging and to record reasons why a certain feature was kept or deleted from the dataset.

The data flow approximately looks as follows:

new data                  review-ready data (with additional fixme tags)
old data                  dropped data (with new note tags)

Experience to be used later

There are better tools out there

One excellent tool I’ve discovered too late was ogr2osm. GRASS GIS worked best with vector data in GML format, while JOSM understands its native OSM XML format best.

JOSM’s GeoJSON and GML support was lacking at that moment. I ended up writing my own converter from GML to OSM, but using ogr2osm is excellent for such work. I plan to use it instead in the future.

Node snapping is treacherous

There are often two vector features and they need to be adjusted to have a common segment. One may think that automatic snapping — moving/merging nodes of a source line that are close enough to the destination line — would do the job in a second. But so many things can go wrong with it.

  • Which nodes to snap. Only nodes closer than a pre-defined threshold should be selected for modification. The distance threshold value cannot be automatically deduced by an algorithm as it mostly depends on input data resolution, its quality etc. It often happens that the threshold has to be adjusted for different parts of the same input.
  • Where to snap. A single node chosen to be snapped may in fact “gravitate” to multiple destination segments, or multiple points on the same line. Let’s suppose that the chosen snapping threshold is too big, bigger than linear dimensions of the destination polygon. In such extreme situation the source node may end up glued to any position of the destination.
  • In which order to snap nodes. In the end, it is not individual nodes but two lines that we care about. Snapping is expected to preserve the original order of nodes on both lines. For reasons outlined above, it may not happen automatically. What can happen is erratic “jumping” from of segments. At the very least, snapping should be accompanied by a cleanup phase that detects and corrects created problems.

To quote v.clean tool=snap manual page:

The type option can have a strong influence on the result. A too large threshold and type=boundary can severely damage area topology, beyond repair.

I have some ideas about a type of iterative snapping approach. Lines are “stretchy” and “flexy” and are “attracted” towards each other. Attraction forces compete against repulsion forces. As a result a most “natural” (i.e. with minimal potential energy) relative position of lines is achieved. I however do not know how hard it would be to fine-tune parameters of the attraction/repulsion forces for the process to be stable enough, and whether performance of such an algorithm would be enough for practical use.

Keeping knowledge about sewing points is important

When splitting bigger geographical data files into smaller chunks by arbitrarily chosen (i.e. not dictated by the data itself) borders, try to keep information about split points to simplify re-gluing of adjacent resulting data. Without such information, split points should either be detected algorithmically, which is not trivial nor reliable, or specified manually, which is more time-consuming than one would think.

It does not matter whether data is organized into regularly shaped rectangular tiles, or follows less regular but nevertheless arbitrary administrative borders.

Ignoring the problem won’t help either as it would result in unconnected linear features and gaps and/or overlaps for landcover features.

Of course, the aforementioned problem does not affect zero-dimensional imports, such as POIs. They can be organized in arbitrary subsets without disturbing any sort of (non-existent) relational information.

Here are some ideas on how to handle the re-sewing problem.

Vector input

For vector polygons, maintain correspondence between unique node IDs used internally by your source (i.e. negative numbers for OSM XML files) and unique node IDs assigned to them by OSM database after they were successfully uploaded. Let’s suppose two polygons are uploaded independently as they are in different tiles. To avoid uploading duplicate nodes shared by them, after the first polygon has been uploaded, their IDs (and their refs) have to be updated in the second one. This way, corresponding points will only be uploaded once. Uploading the second polygon will refer to their already present instances instead of introducing duplicates with different IDs.

Alternatively, duplicate nodes can be merged after everything has been uploaded, as detecting them should be easy. However, this is less elegant, and creates some room for mistakes, as we essentially recreate shared borders instead of preserving them in the first place.

Raster input

For raster data, the process is more involved, as the vectorization phase and follow-up simplification passes can easily move nodes and destroy relation information close to tile borders. Recovering such information from adjacent vector polygons is even less reliable as it involves some sort of node snapping to lines. Node snapping involves moving of existing or adding new nodes and therefore it is capable to corrupt geometrical properties of polygons: create self-intersections, loops etc. I can imagine two approaches to the problem.

  1. Modify vectorization algorithms to treat tile border pixels and nodes produced from them in a special way. Resulting vector features should retain information about which segments were traced from tile boundaries. The same applies to line simplification algorithms: they should not move nor delete nodes originating from tile borders.

  2. Use natural borders present in the data to determine form and boundaries of smaller chunks. For example, if exclusively tracing forest areas, linear borders, such as water coast lines, (buffered) roads, rivers etc. can be used to delineate where a free-form chunk ends. The problem with this approach is that there is no guarantee about result’s size, form or run time. Basically, a single stretch of forest can span through the whole country. Inevitably, artificial cuts have to be introduced into the data. To minimize their length while keeping the size of chunk within limits would be an interesting challenge.

It is not only me who considers tiles borders to be an issue for automatic processing of geometrical features. From Facebook’s RapiD FAQ:

  1. Why does RapiD crop roads at task boundaries? Are you concerned about the risk of creating disconnected ways?

We believe this is a general problem when working on tiled mapping tasks. We learnt from the community that when working on tasks on HOT Tasking Manager, a general guideline for the mappers is to draw roads up to the task boundary to avoid creating dupes across tasks, so RapiD is designed to align with this guideline. When the user is working on the neighbor task later, the close-node check or crossing-way check will have a chance to catch the disconnected ways and help the user fix them.

So far, I used the “natural” borders approach in a limited form. Roads running through forests cut them into smaller bits. For rectangular tile borders, a lot of manual work was needed to recover common borders late into the process, because the vectorization and simplification tools I used did not care about preserving required information.

Monolithic and non-monolithic features

Here is another concept I’ve discovered when I’ve decided to import water polygons.

For forest areas it is quite OK to split them along arbitrary borders. Generally, someone who maps forest areas manually starts by drawing its border until he/she becomes tired or hits the limitation of max 2000 nodes per way. Then current portion of forest gets closed with random long straight lines going right through the forest mass so that the polygon become closed. This polygon then gets uploaded. The next adjacent section of the same forest is then traced the same way, combining “real” borders and the previously drawn artificial border. The picture below illustrates this situation.

Non-monolithic forest

However, closed water objects, such as lakes, even huge ones with borders spanning well over 2000 nodes (and thus represented as multipolygons), are traced and treated differently. People do not tend to create arbitrary internal borders for them.

In other words, there are few examples when a lake gets treated like this:

Non-monolithic lake

Instead, a nice single polygon is usually drawn:

Monolithic forest

It seems that land cover classes, among other classes of area-like features, can gravitate to one of the following types.

  • Non-monolithic land cover for which arbitrary internal borders are allowed and are in fact welcomed to maintain feature size in check. Examples are: forests, farmland, long riverbanks.
  • Monolithic land cover, where feature size does not justify splitting it into arbitrary delineated chunks. Lakes are most prominent examples, even such complexly shaped as Mälaren. A feature with defined name have more chances to be treated as monolithic. When there is a single name, a single feature seems reasonable. But it gets impractical for e.g. riverbanks which may be very long.
  • Land cover with undefined traditions or rules. An example would be wetland. I would risk to say that wetlands are even more mysterious for average mappers (such as myself) than forests. There are so many types of them, their borders are even less defined than forests’. Sometimes it makes sense to treat them as more water-like type, other times it is convenient to consider them have forest-like behavior.

Import raster data already has arbitrarily defined split lines. These lines may split monolithic features into two or more vector pieces. This is undesirable as this goes against the tradition of mapping such features.

For this reason it was decided to pay special attention to new water polygons lying close to tiles’ borders and, when needed, to merge multiple pieces back into a single object. This, however, had to be done manually.

A good thing with monolithic features however is their “all or nothing” nature, which can be used when conflating them against already mapped counterparts. A special algorithm compared “closeness” ratio of borders of an new and an old water features to decide whether or not they corresponded to the same object. Such comparison would make no sense for non-monolithic features as they may have parts of borders arbitrarily defined by a user’s will, not by properties of the physical world.

Additional results

  • My better understanding of Openstreetmap’s intrinsic conflicts and different views, including sorts of idealistic philosophy some people preach. Anarchy allowed in certain aspects of the project existence clashes against the desire for rigid control in other aspects.

  • Tools for operating over OSM files and related vector formats, available at Github. They surely duplicate a lot of functionality already existing in GIS-systems. Surely these tools are mostly useless for anyone but myself because…

  • There are much better programming tools available for processing geometrical information. And I need to learn these tools and to start using them. Libraries such as shapely, ogr, (geo)pandas and others exist and they could have simplified some of my work if I knew about their existence in advance. Osmosis is an OSM-focused framework I might also need to learn a bit.

Originally published here:

This post continues where the previous one left off.

After some time spent on processing and importing land cover data, I have several ideas on how to further improve and streamline both the import process and in general work with land cover features in JOSM.

Suggested tools to help with land cover data

Certain typical tasks arise over and over again when one works with polygons meant to represent land cover, regardless of whether they are imported or manually traced. At the moment there are no adequate tools in JOSM to assist with such tasks.

The trick here is not trying to find an exact geometric solution to the tasks at hand, but rather imitate what a human would reasonably do to finish such a task. And a human would cut corners, allow some inexactness traded in for speed of completion.

Floodfill tool

A common task is to fill a gap between two or several polygons. An example would be to map a new farm field situated between several forests or clammed between several intersecting road segments. Currently one has to carefully trace a new way along the existing borders, either reusing nodes or leaving a small gap between the new ways and adjacent ones.

The idea here is similar to pouring a bucket of paint into the middle of the empty area and then letting it spread out naturally to fill the empty area. The paint will then spread out until it hits borders, or until it runs out of paint.

The same approach can be implemented in a tool that starts from a single node (or rather, from a tiny closed way) which then grows in all directions. Its growth is stopped when a segment of the new way hits a boundary in a form of an existing way. Optionally, the new way can then snap to existing way there.

It sounds simple, but it will require some clever implementation to be robust, fast and reliable. I can already see a couple of implementation details that will require some attention during the implementation.

  1. The resulting polygon does not have to fill precisely the intended area. It is not bound to share all its boundary segments with surrounding features, or reuse all of their nodes. Surely, some nodes and segments may be shared, but solutions that leave a small configurable gap between the new polygon and old polygons are also acceptable.

  2. The resulting polygon should treat itself as a boundary as well. It may happen that a resulting figure has holes in it, and a new polygon should become a multipolygon. The simplest alternative is to keep it as a polygon with thin “bridges” in it, almost wrapping around the inner holes.

  3. Surrounding land cover features are not guaranteed to create a closed perimeter around the empty area one wants to map. There is a risk that the contents of the new area will leak outside through gaps it founds between these surrounding features. To prevent uncontrolled spreading of the new way through such leaks two strategies may be applied. Firstly, total area allowed to be covered by the floodfill process should have a hard limit. Secondly, spreading through such holes can be detected by giving new nodes a non-zero buffer diameter. This basically makes them thick, and they won’t be able to squeeze through gaps smaller than a predetermined value.

Polygon breaking tool

There are many situations when splitting a closed way that either intersects itself or overlaps with another one makes sense. Possible uses of such functionality include:

  1. Fixing self-intersections of small loops left after simplification or coordinate transformations of polygons.

  2. Assuring new polygons have a nice single common border with old ones by cutting them in two by that boundary and then removing the smallest part as noise.

This is similar to what v.clean [1] does with options tool=break and tool=bpol.

Another useful addition is to clean up zero angles in polygons, identically to what tool=rmsa does. These are always artifacts not worth keeping and as such it should be possible to remove them.

Smart and configurable tool to snap nodes

Having nice common borders between land cover features without any under- or overshoot is a hard task requiring a lot of manual labor.

To drag or replace nodes to make a common border is called “snapping” them to new place.

Current embedded JOSM-tools (“N” and “J” buttons) lack configurability and cannot be used at scale, although they are very useful to patch something up.

The main problem with snapping is that it can destroy geometry quite significantly if done without measure.

To make decisions which nodes to move and which to keep, a threshold distance value is used. However, the nodes to be snapped are usually already organized in ways, and preserving sanity in these ways after some of their nodes have been moved is a huge task.

There are often two situations/issues: 1) to snap a node when there are multiple alternatives within a threshold distance, and 2) in which order to snap several nodes already placed on a way.

If several nodes from the same source way are snapped to different ways in its vicinity, the result is often a mess of self-intersecting ways.

Even if the same target way to snap is chosen for several nodes, if the order in which they are included into it is wrong, the result has zero-angled segments and annoying overlappings with destination ways within a narrow threshold area around them.

What I think is needed is to think about the problem as an optimization task. An algorithm may iterate over small movements of individual source nodes which are moved according to “the force field” or “a potential” function defined by positions of destination ways. Source nodes which are far enough from destination segments do not experience any incentive to move around and are mostly still. Nodes closer to the threshold distance are attracted to destination ways, and eventually get placed on them. To prevent nodes becoming in a wrong order only one node per step is allowed to be included into a destination way, which can only be done unambiguously. In next iterations that node is treated as if it always was there, and incoming snapped nodes are ordered automatically correctly relative to it (XXX is it true? can I prove it? are there couterexamples to that?)

To prevent source segments to overlap with destination segments the force potential should repulse the source segments from destination nodes. This way, source nodes are attracted to destination

As it was pointed out earlier, we are not looking for a 100% correct solution but for a one that is good enough without being too destructive. The bad part is that this algorithm is more complex than simple snapping that could be done in linear time. A number of iterations to achieve a stable result may be hard to guess in each case.

Scanaerial integration into JOSM. Take into account existing object boundaries

Scanaerial [2] is a tool to help with tracing of aerial images. It is espacially effective for water surfaces (as they have simpler textures than e.g. forests).

The problem is, it is an external program written in Python. Amongst many problems that this brings are: less than possible speed, no progress indication, awkward configuration of aerial imagery to use.

Inclusion of the same functionality into JOSM directly as a Java plugin would allow the tool to have a better interface and provide smoother user experience. It could use all imagery that is present as JOSM layers.

Another improvement is that it could be improved to stop at boundaries where there are already traced objects, by e.g. using them as a mask for raster. That would save a lot of time conflating results of tracing into the whole picture.

Not yet solved problems of traced data quality

This section discusses not yet solved problems in the import data I’ve worked with. It was required to manually edit their consequences, which was the limiting factor for import. Thus they are very much welcome to be solved mechanically whenever it is possible.

Nice parallel borders along roads

Compare the following two pictures. Before cleaning import forests along a road:

Forests along road original

After manual cleaning of nodes.

Forests along road cleaned

Usually humans find the second variant to be preferable as it has less nodes and less clutter.

The task is almost equivalent to snapping nodes to an “invisible” buffer polygon created around the road. If all source nodes are placed on the buffer boundary, resulting forest borders would be parallel to the road in the middle.

Almost road channels in multipolygons

A road can run through a massive of a forest, farmyard or similar land cover. Alternatively, it can be placed completely outside of any land cover, effectively cutting it in halves. The road then has its own “channel” in which it runs.

The worst case is when one has a mixture of these strategies. A road that runs outside the forest but then jumping into a short chunk of it and then emerging back from it is confusing. See the picture for explanation.

Almost a road channel

Here, outer and inner ways of the surrounding multipolygon create “almost” a channel for the road. To properly remove the “plug” in this channel requires a huge amount of work: to split the outer way in to, to split the inner way in two and change its role to outer, then sew matching parts of these halves to obtain two outer ways placed on both sides of the road.

Deciding what to do with old data of lower resolution

This is kind of self-explanatory.

As we get more and more highly detailed aerial pictures and more and more advanced remote sensing and tracing software, it is often obvious during an import when existing data is of lower resolution than the data that could have been imported in its place instead. But, to play it safe, the old data is considered to be “golden”, and new fine features get mangled at borders that contact old coarse features.

In the case of lakes, many of them were present on the map, but badly traced. If we were to import lakes as well, it would be preferrable to replace those old ones with coarse boundaries with new ones. But how to reliably decide?

Is it possible to measure resolution of vector objects?

Lake of lower resolution than surrounding import data

Final thoughts and ideas

A few notes to self to try before importing the next huge chunk of data. Or, to do as a completely separate imports.

  1. Create a workflow to create raster mask layers from OSM XML by converting it to SHP and then to GeoTiff. Using QGIS and Geofablik’s exports is error prone as there are multiple files to combine and they have weird category mappings.

  2. Import of new short ways as single nodes when resolution is not enough. Individual houses are typically represented as a pixel in the input raster, or four nodes of 10×10 meters square in its vectorized form. For really remote buildings outside of any residential areas, it is possible to at least record their position as a node.

  3. Replacing geometry of ways with finer geometry. If there is a way to measure a “resolution” of a vector feature (fractal dimension metrics?), then two features whose centroids are close enough can be compared. Assuming that both features represent the same physical object, it can then be decided whether to replace old geometry with a new one.

  4. So far I was concerned with the goal of filling in the empty places. Defining and implementing a strategy for updating land cover is yet another topic to explore. The key here is to reliably decide when two features from old and new datasets represent the same object, and which representation is worth keeping. An techinique to cut holes in existing multipolygons will become critical to have to in order to e.g. implement a scenario when a section of a forest got burned down and should be excluded from the old multipolygon. Differential pixel comparison of new and old import data might become handy to find all places with “diffs” and act only on them. The key here is to bring raster inputs to comparable states (identical coordinate systems, spatial extents, pixel resolutions and land cover classifications).

  5. Generate import vector datasets of different “resolution” or “detaileness”. That is, prodice a family of parameterized datasets which have different aspects of their generation adjusted to one or another side. E.g., have a new layer that only has objects larger than a predetermined value; or have only data for forests; or apply different aggressiveness of smoothing algorithms. Typically, the “resolution” of new data is dictated by existing data density for the tile. It does not make sense to add multitude of fine details to a tile that was coarsely outlined, without essentially redoing it from scratch. However, there should be just a few of such datasets per a tile though. Otherwise one would spend too much time choosing from them and comparing them between each other instead of integrating them into the map.

  6. Use local “rubber band stretch” transformations for the vector data when adjusting positions of individual nodes in respect to positions of existing nodes. Just as the pioneers in the map conflation did. The potential function idea outlined earlier kind of builds on the ability to stretch things without introducing new topological errors to human’s dismay.

  7. Reduce number of inner polygons in multipolygons, leaving only the biggest ones (e.g. more than 5% of the outer way area). We have too many fine details, but which of them to keep?

  8. Try using ogr2osm [3] instead of my own



Originally posted here:

This is the third part of summarizing my experience with conflation of land cover data for Sweden. More examples of practical problems and ways to address them follow.

The same or similar problems may or may not arise during other imports of closed (multi)polygons in the future, so tips and tricks to save time will become handy. Note that some points from the previous part may be repeated here, but with more thoughts or ideas on how to address them.

Why bother with land cover import

The general idea of importing any data into OSM is to save time on doing the same by hand.

Classic data sources for the OSM contents are:

  1. Local or public knowledge. This is mostly useful for points of interest, provided one knows an exact position of the point. For needs of land cover, local reconnaissance is critical for establishing secondary details of actual surface. E.g., for a swamp, what type of swamp, for a forest, what species of trees, with what height/diameter and of what age. There are remote sensing methods, however, that allow obtaining some of this information in some cases without visiting the actual place.

  2. GPX traces submitted by users. These are very useful for mapping certain linear objects such as roads. This is natural as people tend to move along the roads. But people do many other things for which a map would be helpful, but obtaining accurate linear traces is problematic. For closed map features, such as building outlines, usefulness of GPX traces becomes limited as people rarely use to circle around each and every house. Besides, the baseline GPS resolution of several meters (without a lengthy precision improvement procedure) is often not enough to make out where actual corners of a building are. To go around a swamp in a remote location to get its outline is rarely doable in reasonable time.

  3. Aerial and satellite imagery. Provided the pictures are up-to-date and do not have an unknown offset relative to reality (which GPX traces can help establish and correct as a cross-referencing source), the imagery allows to trace basically everything that is visible on the earth surface. The problem is that amount of data this source gives is overwhelming. The only universally accepted way to convert it to vector is manual tracing, which is a not very efficient way to process data given that it is done on a computer that is capable of billions operations per second.

The benefits of a data import is that instead of tracing aerial images and/or GPX traces, we simply get “ready” data.

This data is also very likely to contain “noise” — any sorts of artifacts, wrong points, extra useless data etc. Balancing the value of new data against amount of new issues it brings into the database should always be held in check. This is a classical “signal to noise ratio” problem. To determine what kind and level of noise is acceptable is essential. Whether it is comparatively easy to ignore or manually remove the noise present in data generated by the scripts determines when it is worth trying importing data and fixing it at the same time. It may turn out that it is faster to simply trace everything manually or to get the data from another source.

An important aspect of land cover features is their inherent redundancy and inexactness of their geometry. You can often delete a node of a polygon denoting a forest or move it slightly without significantly affecting the perceived accuracy of how the its reflects the reality. This opens possibilities for optimizations such as adjusting common boundaries of features.

Complexity metric for data batches

It turned out to be useful to have a rough estimation of how hard a particular tile would turn out to integrate by calculating a complexity metric over it.

Intuition says that the more input data there is in the new data layer or in the old data layer, the harder it would be to make sure that they are nicely aligned. But what is the best mathematical expression to quantify this idea?

Originally, a total sum of number of new and old ways was used as a complexity metric.

It kind of made sense because a lot of conflation work is done to align common shared borders between new and old ways. It did not matter how many nodes were there, and relations typically did not stand in the way of doing it as mostly outer ways were important.

Later in the process is was decided that even a simpler metric is more representative of what needs to be done to integrate a tile. Namely, number of old ways proved to be proportional to how much time will be spent on aligning new ways to them.

Further refinements to the metric is of course possible. Seeing the complexity number along a list of not yet processed tiles allowed to understand which tile should be opened next to make sure it will be possible to process it under a limited span of time.

Excluded and missing features

The import raster file stored information for the whole country area and a wide assortment of land use classes, including those common for wilderness, residential, water areas, as long as rail- and car roads.

Information about roads was clearly least useful. As an assortment of pixels, there was no guarantee for proper connectivity for roads, which is the main property to extract and preserve in a vector map that should allow routing.

Residential areas were also excluded as the raster resolution of 10×10 meters did not allow to reconstruct actual footprints of individual buildings. There is however an idea to use this information to partially map remote isolated dwellings, if not as polygons then at least as nodes building=yes.

It was finally decided not to import water bodies, such as lakes, swamps and rivers but to focus on forests, farmland and grass. This is despite the fact that information about water bodies was in fact extractable from the import raster.

The motivation for the decision not to use them was based on the following premises.

  1. Water bodies would be already well represented in the OSM. This turned out not to be true for many parts of the country. As a result, now the map has “white holes” where not yet mapped lakes should be in reality.

  2. Decision to import land cover for islands and islets. This was in conflict with the developed approach of using a separate mask raster layer that did not treat water areas as masked. Including mapped lakes into the mask layer would have masked islets.

Ideas on how to import water bodies

As we can see now, the OSM map would benefit from a careful import of missing lakes as well.

It should be possible to import water bodies in a separate process from the same input raster. The import data processing scripts will have to be adjusted in many places to take into account different status of lakes. In particular:

  1. The mask layer should include “landuse=water” to prevent creation of overlapping water bodies. Ocean/sea around the coastline have to receive the same treatment to avoid mapping water inside the ocean. However, areas outside the coastline borders are likely to already be marked as “no data” in the source raster image.

  2. To prevent “slivers” of water in the shore area (similar to noise around already mapped forests), limitations of how “oblong” new water features can become should be set in the scripts. This means that wider rivers will most likely be excluded from the end result.

  3. The naive tiled approach with static extent positions would lead to cutting of new lakes into several adjacent pieces. To avoid this, tiles should dynamically be resized to implement the “all or nothing” approach for water bodies. The idea is that, while a forest can be split into several adjacent pieces that can be individually mapped, a feature for a lake is typically added as a whole area. It may be represented as a multipolygon with multiple shorter outer boundaries if needed, but the result should not look like two separate water bodies with a common border along a tile edge. At least this is not how one usually maps lakes in the OSM. If nothing else, larger tile sizes would decrease amount of cases of clipped water bodies. And in remaining cases two or more pieces should be merged into a single polygon, not just aligned to have a single common border.

  4. Making sure the border between land and water is unified will be a tough problem to automatically solve, given the amount of land cover data already imported.

Boundaries between individual tiles

Tiles as rectangular extents of data of the same size and adjacent to each other are one of simplest methods to split huge geodata into smaller chunks that are easier to work on individually. However, the transition from one tile to adjacent ones should still be smooth in the resulting vector data. The most apparent problem that comes to mind is positive or negative gaps between features lying close to common borders of adjacent tiles.

Straight lines between tiles

During this import the following problems with tile boundaries were observed.

  1. A bug in a tile splitting algorithm caused adjacent tiles to overlap significantly (up to several hundreds meters). This defect in resulting vector data was very cumbersome to manually fix, especially when at least one of the overlapping tiles had already been uploaded to the database. Careful ordering of coordinate system conversions allowed to avoid this issue and generate further tiles with much better mutual alignment.

  2. Smoothing algorithms tended to move nodes including tile border nodes. Nodes at four corner positions of the rectangle suffered most often — they were “eaten” by the Douglas-Peucker algorithm and required manual restoration at the editing phase. In general, the edges of the tile is the area where “sharp” corners in features should be allowed. But curve simplification algorithms tend to replace them with more gradual transitions. Filtered polygon corner

  3. Rounding error accumulated at different data processing stages caused that boundary nodes were moved by a small fraction of a degree. This regularly caused a tiny (around 1e-4 degrees) but noticeable (bigger than 1e-7 resolution that OSM uses internally) overshoot or undershoot against nearby parallel tile borders. To compensate for it, a separate data processing stage was written to determine which nodes are likely to be “boundary” in the tile. It them made sure that their respective longitude or latitude values would precisely lie on the tile border. For example, all nodes with longitude 45.099998 would be moved to 45.1. This improved the situation somewhat, often making at least some of nodes on two adjacent tile borders to receive identical coordinates and thus allowing for automated merging of feature nodes at such cases.

Almost the same coordinates

  1. Small features split over into two separate tiles typically did not align well when sewn back together.

Cut features are misaligned in two tiles

It required a manual movement of nodes in both lat/lon directions to restore the original shape. Ignoring a minor misalignment for a small remote natural feature was often fine (who cares if this tiny swamp looks cut?), but when the cut line went through a residential area, one would certainly want to restore the original shape.

Despite some implemented measures for automatic tile border merging, a lot of manual adjustment work was required in many cases. This activity was one of limiting factors for data import speed. Clearly there is a room for improvement for future imports. I see two possible directions.

  1. Improving algorithms for feature reconstruction after they were split and parts were individually traced.
  2. Improving the tile borders selection algorithms. Instead of just blindly cutting everything without any regard to what lies under the knife, tiles can be made to follow more natural shapes of the data in hand, e.g. preferring cutting through water areas and avoiding separating data in residential areas. Such tiles will be of different size and not necessarily rectangular in shape.

Even using bigger tiles reduces number of cuts in the original data and amount of follow-up sewing of tiles. E.g. instead of cutting a country into small tiles one can split it into bigger counties. However, with a chunk size growing other problems of scale become more prominent. So finding a balance here is critical.

Single and double borders between features

Often it is desirable that two adjacent features have a single border line shared with each other, not two loosely curves intersecting back and forth.

A single border between forest and lake can be seen along the south coast of the lake, while the north coast has a CORINE import forest polygon with its own border running along the lake border.

An example of single and double borders

In my opinion, having a single common border between land cover features is the best. It is more compact to store and less visually complex. It does not correspond to reality in the regard that there is often no sharp border between two natural features.

Double separate borders kind of reflect this fact that one border is not necessarily defined by another feature. However, there are always unanswered questions about what is in between these two features. What is going on in the thin sliver of no data squeezed in between two borders? Is the distance between two features wide enough to reflect the transitional area? What if one of borders is more detailed than another — is it just an artifact or faithful reflection of reality?

Both types of borders are the main problem during the conflation. To maintain a single border between new and old features means to modify existing borders. To create double borders one has to deal with the fact that multiple overshoots and undershoots are about to happen, and as a result these borders will “interlace” with multiple intersections along their common part.

Current semi-automatic methods to maintain the single border, besides manually editing all individual nodes, are:

  • Snapping nodes to nearby lines.
  • Breaking polygons at mutual intersections and removing smaller overlapping chunks as noise.

The plugin SnapNewNodes [4] was made with idea of helping with the snapping process. It does help, but has annoying bugs making it less reliable. SnapNewNodes should be made more reliable in the result, and report cases when it cannot unambiguously snap everything.

In the import process, the main source of double borders were existing shore lines of lakes and new imported forests, swamps etc.

An interesting case when existing islets without land cover received it from the import data:

Double borders for an islet

Double borders for an islet, another example

Often the previously mapped water border was of lower quality/resolution than newly added forest bordering with it. But as often both new and old borders were equally accurate.

Still, this resulted in forest partially “sinking” into water and partially remaining on an islet.

Manual solutions include:

  • Replacing geometry for small islets to better reflect land cover and aerial images.
  • Deleting, merging and snapping nodes of new and old ways to make sure land cover stays within the designated borders. ContourMerge [3] is also useful to speed things up.

Roads partially inside a forest, partially outside it

Partial pockets

It has to be manually cut through with two “split way” actions:

Cut placement for road

This is very laborious and not always possible in cases of multiple inner/outer ways of a multipolygon as they are not considered as a single closed polygon that could be cut through:

Impossible to make a cut for road

Something has to be improved in order to speed up processing of such cases.

Occasional isthmuses between two land use areas separated by road

When a road goes through a forest or a field, there are often situations when a single or a small group of nodes belong to both fields, connecting them briefly, like a waistline or an isthmus, if you will. Ideally, the road should either lie completely outside any field, lie completely on top of a single uninterrupted field, or be a part of a border between these fields. See selected nodes on two examples below.

Waistline 1

Waistline 2

Basically there should be either three “parallel” lines (two field borders and a road between them), one line (being both a border and a road at the same time), or just one road on top of a monolithic field.

A manual solution can consist of:

  • Delete the connecting node or node group placed close to the road
  • Delete/fill in the empty space under the road, possibly merging land use areas it was separating
  • Merge two boundaries to a single way that is also a road.

An automatic solution would be to have a plugin or data processing phase that uses existing ways for roads to cut imported polygons into smaller pieces placed on different sides of that road. Then smaller ways are thrown away as artifacts.

Current script [9] mimics attempt to do something similar to this by simply removing land cover nodes that happen to lie too close to any road segment. This do ensure that isthmuses get deleted in many cases. A disadvantage is that often “innocent” nodes that simply were too close to a road get deleted, creating several types of other problems, such as too big gaps between the road and the field, too straight lines for a field border, self-intersections or even unclosed ways.

Noisy details in residential areas

There is no problem with regions already circled by landuse=residential as they go to the mask layer and efficiently prohibit any overlapping with new landcover data to be imported. However, it is not mandatory to outline each and every settlement with this tag. Often only individual houses are outlined. Besides, there are many places where even no houses are mapped yet.

Previously unmapped areas of small farms, residential areas and similar areas with closely placed man-made features receive a lot of small polygons that are trying to fill in all empty spaces between buildings, map individual trees etc.

Noise in residential area

A manual solution is to delete all new polygons covering the area, as they are not of high value for man-made features. It is worth noticing that in most cases it is “grass” polygons with small individual areas. Selecting with a filter or search functions and then inspecting or deleting all ways tagged “grass” that are smaller than a certain area could speed up manual work.

However, it is hard to find all such places, leading to untidy results reported by others [5] [6].

A semi-automatic assistance method would be to buffer [13] already mapped buildings in the mask layer with about 10-20 meters distance to make sure that they are not being overlapped with landcover data. This does not solve the problem of unmapped buildings however.

Risk of editing unrelated existing polygons

Because the OSM data contains a mixture of everything in a single layer, it is very easy to accidentally touch existing features that are not part of the import intent.

Generally, for the land cover import, one is foremost interested in seeing and interacting with features tagged with landuse, natural, water or similar. Certain linear objects, such as highway and power, are also important to work on as they in fact often interact with the land cover polygons.

What is almost never important is all sorts of administrative borders. Examples of such areas are: leisure=nature_reserve, landuse=military, boundary=administrative etc. They rarely correlate with the terrain. One does not want to accidentally move them as their position is dictated by human agreements, laws, political situation etc. but not the actual state of the land.

It is recommended to create a filter in JOSM [12] that inactivates all undesirable features, still keeping them as vaguely visible in the background. This way, it becomes impossible to accidentally select them and thus change them.

Interaction with CORINE Land Cover data 2006

In general, the priority was always given to existing features, even if it was apparent that they provide a lower resolution or even notably worse information about the actual matter of things. This meant that boundaries of new features were typically adjusted to run along or reuse boundaries of old features.

In several cases when correspondence to reality was especially bad, existing features were edited to make sure their borders were in better shape. When it was possible not to have a common border between old and new features, the old ones were left untouched.

As an exception, the CORINE Land Cover 2006 [7] polygons from earlier OSM imports were not always preserved. For example, in mountainous regions they were completely replaced by the new data. They have not been included into the mask layer.

In regions with less altitude the CLC2006 features were mostly added for forests. There they were mostly preserved, at least their tagging part. However, low precision of such polygons forced to edit their borders on many occasions, adjusting nodes, adding nodes, removing nodes or rarely removing the whole feature and replacing it from the import data and/or manually redrawn data.

Removal of small excessive details

These were numerous. Typically the focus was on simplifying excessively detailed or noisy polygons generated by the scripts.

Examples of manual processing include:

  • Removal of small polygons of 12 nodes and less. This was only used for certain tiles where “slivers” of polygons filled “no man’s land” between already mapped land cover polygons with double borders.

  • Smoothing of polygons that follow long roads. It is typically expected that land cover borders adjacent to roads run more or less along them without any jerking unless there is a reason for it.

Because of that, the following fragment needed some manual editing:

Before smoothing

It starts looking much more “human” after a great deal of just added nodes have been removed:

After smoothing

Validate, validate, validate

To validate data after major editing steps and directly before uploading means that one would run the validation steps a dozen times per a tile at least.

JOSM validation frame

The base layer often gives quite a few unrelated warnings for issues that were present even before you have touched the tile.

The most common ways to fix discovered issues are to delete nodes, merge nodes, snap nodes to a line, or move nodes a bit.

The most common problems that were present in an unmodified open tile import vector layer were:

Overlapping identical landuses”, “Overlapping identical natural areas” etc. As planned, this happened along borders of new and old objects.

Duplicate nodes

It is easy to fix them manually by pressing a button in the validator panel.

Duplicate nodes

Unconnected nodes without tags

Typically those are remnants of filtering, snapping or similar actions. If it was you who added them, these nodes can be safely deleted by the validator itself.

In the base layer, take a minute to check that any untagged orphan nodes are indeed old (more than several months old). Freshly added nodes may in fact be a part of a big upload someone else is doing right now. Because of OSM’s specifics, new nodes are added first but ways start connecting them only after later parts of the same changeset has been uploaded. The window between these two events may be as big as several hours. So do not clean up untagged nodes added by others if they were added just recently.

If you see a lone node added three years ago, then it is most definitely can be safely removed — someone else has forgotten to use that node a long time ago.

Overlapping landuse of different types, e.g. water and forest.

A small new feature that is likely to be a “sliver” of land cover squeezed between old and new data. Those can be safely removed:

Overlapping slivers

Larger new features would require creating a single shared border with an old existing feature they overlap with.

An example before:

Before having a common border

The same border after it has been unified:

After making a common border

Often it is possible (and reasonable) to merge two identically tagged overlapping polygons into one. Press “Shift-J” in JOSM to achieve that. This may turn out to be easier than trying to create a common border between them.


Before merging two polygons


After merging two polygons

Unclosed multipolygons

This was a rather awkward manifestation of a bug in Specifically, when a node connecting two outer ways of a long outer multipolygon line was chosen to be deleted, the script did not re-close the new resulting multi-way.

It did close regular ways if their start/end nodes were removed, but for the script did not track what the start/end node for more complex situations were.

Unclosed way

Luckily, there were not many of such situations, and JOSM validator could always detect them. And it was always trivially easy to manually restore the broken geometry by reconnecting the ways.

New islets in lakes

New islands

Islets should be marked as inner ways in lakes typed as multipolygons. However, as this import was specifically not about water objects, it was decided to left such new islets as simple polygons floating in the water. This saved time on editing lakes, many of which were not multipolygons. Their transformation from a single way into a relation can be done in a separate step.

Finding an overlapping object

The JOSM validator allows you to focus on the position of most warnings. This allows addressing them efficiently. This is very efficient for cases when a bounding box for the problem is small. And this is not always then case.

It is sometimes hard to figure out how and where exactly an intersection between two polygons happens.

Where is the conflict?

On the picture there is one huge way and several small ways which overlap with the former. It’s close to impossible to figure out individual intersections.

However, in the warnings page you can select a pair of conflicting polygons by their warning entry. And then you can select one of them (preferable the smallest one) by clicking on it, and then zoom to it by pressing 3 on your keyboard (“View — Zoom to Selection”).

Data uploading

The uploading of the resulting data starts just as a regular JOSM upload dialog. Follow the general guidelines [10].

You are very likely to have more than a 1000 of objects to add/modify/delete which will cause the upload to the split into smaller chunks. You are also likely to exceed the hard limit of 10000 objects per a changeset, meaning your modifications will end up in separately numbered changesets. None of this matters in practice as there is neither atomicity nor even ability to transparently rollback failed transactions in the project. All open changesets are immediately visible to others. You cannot easily control the order in which new data will be packed into the changesets either, so most likely your initial changesets will contain only new untagged nodes; the following upp changesets will add ways and relations between them.

Problems during the upload

For huge OSM uploads there is always risk that something goes wrong. Do not panic, everything is solvable, and this is the way huge distributed systems operate in any way.

Here are some of the problems that happened to me.

  • A conflict with data on the server. This is reported to you through a conflict resolution dialog. Just use your common sense by choosing which yours and which theirs changes will stay. It is often the case that “they” means “you” as well, only the data you’ve uploaded earlier. I do not understand yet what are the specific cases when this happens, but it rarely does. Before resuming the uploading, re-run the validation one more time.

  • A hung up uploading. Again, it is unclear why this happens, but it largely correlates with network problems between you and the server. JOSM does not issue any relevant messages into the UI or to the log, it just stays still and nothing happens. If you suspect this has happened, abort the uploading and re-initiate it again. Do not download new data in between. Sometimes your current changeset on the server becomes closed and JOSM reports that to you. In this case, open a new one and continue. There is nothing you can do about it in any case, so why even bother reporting this to the user?

  • JOSM or your whole computer crashes and reboots. You’ve saved your data file before starting the upload, right? Open that file and continue.

If any of such problems have happened during the uploading, after you are done with pushing your changes through do a paranoid check. Download the same region again (as a new layer) and validate it. It could have happened that multiple nodes or ways with identical positions have been added to the database. In this case, use the validator buttons to automatically resolve this and then upload the correcting changeset as the last step. It is typically not needed, but to check that no extra thousands of objects are present is a right thing to do.

It is possible to “roll back” your complete changeset in JOSM by committing a new changeset [11], but it is usually not needed unless your original upload was a complete mistake.



This was originally posted here:

The whole premise of the land cover import for Sweden [1] bases on the idea to take the raster map of land cover and to covert it into the OSM format. This results in new map features that are essentially closed (multi)polygons and tags. These new features are then integrated into the existing database with old features during the conflation step.

This post is about the first steps of this process, everything around the vectorization of raster.

Data flow overview

It is hard to describe all the programmatic and manual actions needed to convert the input data. A lot of it is described in the OSM wiki page [1]. The best way to learn the details is to look into the source code of scripts written to achieve the goal. However, the general data processing flow will definitely contain most of the following phases, and maybe something more. The order of certain steps, especially filtering phases, can be different. Coordinate system transformations are only needed if the input data is not in the WGS 84 format used by the OSM database. It can also be done later in the process.

  1. Change coordinate system of data
  2. Filter the input raster file to remove small “noise”
  3. Remap input raster to reduce number of pixel classes
  4. Mask the input raster with a mask raster generated from existing OSM data
  5. Split te single raster file into smaller chunks i.e. tiles
  6. Vectorize the raster data into vector data
  7. Assign OSM tags to vector features, drop uninteresting features
  8. Smooth the features to hide the rasterization artifacts
  9. Simplify the features to keep size of data in check
  10. Do automatic conflation steps that take both new and old vector data into the account. Examples: cut roads, snap nodes, delete insignificant features etc.
  11. Do manual conflation steps that could not be automated.

Raster masking approach

To recap on relations between raster and vector layers of new and existing data to be conflated with each other.

It proved difficult to automatically or even manually make sure that land use (multi)polygons from existing OSM-data and new data to be imported do not conflict with each other when they both are represented as vector outlines. Even more complex question is how to decide what to do with two conflicting ways. Should one delete one of them? Replace one with another? Merge them? Create a common border between them?

A simpler approach was developed to address conflicts at the stage when import data can be easily masked, i.e. when it is represented by raster pixels. The idea behind this approach that we can generate a second raster image of identical size and resolution for the country. The source for this raster mask image is existing OSM land cover information. For example, a vector way for already mapped forest will be turned into a group of non-zero pixels. The vectorizing software then uses this mask to prevent new vector ways to be created from the import data raster. It would look as if no data for those areas is available. As a result, vectors generated from masked raster never enter “forbidden” areas where previously mapped OSM-data is known to be present.

By restricting new data to be created only for not yet mapped areas we reduce the problem of finding intersections between multipolygons to the problem of aligning borders between new and old polygons.

As import data is masked at the very first stage when it is in raster form, it is expected that areas “touching” (sharing common border) with pre-mapped land cover data will require careful examination and merging of individual way borders. All cases of overlapping of identical land uses should be fixed.

Issues with tagging

A major issue that everyone is wary of is that new features generated from the import data will be incorrectly tagged. The issue here is that the OSM tagging approach does not stimulate using fixed predetermined number of data classes for land cover, meanwhile the majority of raster sources by definition provide a limited number of pixel values and associated landuse classes. Mapping the former to the latter is considered by some to be the most unreliable task.

Sure, there were situations when a misclassification of a feature was found during cross-inspection of new data, old data and aerial imagery. But they were really rare compared to numerous other problems to deal with. The most common (but still rare) situation of this class was wrong marking of “bushes” under a power line as “forest”.

The real problem was in the need to adjust the tag correspondence map from input raster values to resulting OSM tags. That is, in different regions of the country the same pixel value might correspond to different OSM tags.

The most problematic class of land cover to tag correctly turned out to be “grass”. The same original raster pixel value may correspond to different concepts in OSM, ranging from a golf course, through cultivated grass to wild or cultivated meadow and even heathland. Because of that, manual inspection of all areas tagged as “grass” was constantly needed.

Excessively detailed ways

Often more nodes that a human would place are used on a way. Original data may have a node every 10 meters, additionally using Chaiken [2] filter to smooth 90-degrees in vector data can create as many nodes. See an example:

Excessive details

A manual solution is to delete undesired nodes, and/or use Simplify way [3] tool to do so.

An automatic solution would be to apply Douglas-Peucker filter to ways of the import file. The issue is to find the best threshold values for the simplification algorithm. Excessively aggressive automatic removal of nodes leads to losing important details of certain polygons. Typically it can be expected that up to 50% of import data set nodes can be removed without losing much in quality of details.

It seems that an extra pass with v.generalize douglas threshold = 0.00005 does a good enough job without chewing too much of details. It does, however, chew some important details, especially of bigger polygons (such as those at a tile border), and also fails to clean up segments that are shared between several ways. Because of the last issue, manual phases of detecting and smoothing remaining angles with 90-degrees and close pairs of nodes that are strictly horisontal or vertical were necessary to implement to clean up suspicious geometry patterns left after vectorization, smoothing and simplifying phases.

General notes on manually vs machine traced polygons

Pros of machine traced land cover.

  • Consistency in making decisions at tracing. Humans’ decisions during the tracing are largely affected by a lot of random factors including the level of concentration, tiredness, smoothness of mouse/pointer operation etc. A person may skip the whole section or oversimplify it because he feels so right now. Another human or even the same one but at a different time of the day would have produced a completely different result. A computer algorithm is consistent and uniform in its both good and bad tracing decisions. Given that we can improve on the algorithms, the ratio of good to bad machine tracing can be monotonically improved.
  • Computers won’t get tired until it is done with tracing. Time needed to find complete borders of a forest section is often underestimated as it just goes on and on. The started tracing line just won’t return back to the beginning even if it may return temptingly close to it several times. A human will decide sooner or later to (literally) draw a line and end the current polygon prematurely even if it is clear that more can be included into it. The next part will have to be traced separately as a new polygon. A computer, on the other hand, is fast enough to trace everything under reasonable time. It won’t stop until it is done. Often the only limit for it are physical limits of current map tile, not the length or complexity of the resulting polygon.

Pros of human traced results.

  • “Important” features get traced first. It is hard to algorithmically define what is important at current map position as this largely depends on the context. A common approach for a computer is to process everything at hand, as it can afford it because it is so fast. Humans can “feel” the context and use it efficiently to prioritize things. For example, one typically starts from working on bigger features that are close to populated areas and which are likely to be looked in the end.
  • Tag choice is more consistent with reality. It is well known problem in remote land cover sensing that no 100% correct matching can be achieved. Again, this is in part because not everything can be tagged with a limited number of classes and corresponding tag combinations. Humans tend to be more conservative in this regard and provide more reliable results. If a person cannot tell from looking at the image what sort of land cover should be given to a polygon, he is likely to skip it altogether or at least express his doubts in a comment to the tag set. The machine is rarely tasked with recording its confidence level for the chosen tags.

  • Humans tend to choose vector resolution (distance between nodes) dynamically based on current context. For example, a forest around a long straight highway tends to be mapped with few nodes along that road. More nodes are needed when a forest border is more “open” and does not contact anything else. Machine algorithms are currently not taking the context into account and basically use a fixed resolution coming from the underlying raster image. Different smoothing or simplification algorithms do not change this much as they only take into account the current curve itself, not adjacent data. Because of that both situations are possible with machine processed data: a) a lot of extra nodes along a straight line that could have been mapped with just a few of them; b) lost important nodes where a line takes a sudden turn.

    The same applies to making decisions of which features to keep. It was common to get many small patches of grass along highways that face large forests from machine-traced data. A human would have ignored these patches and drew a single line between a forest and a highway.

Common issues:

  • It is impossible to “naturally” determine where one should stop tracing a polygon, round it up and go for the next one. A machine traces it until it has some data left to trace, e.g. until it naturally closes the current polygon or reaches an artificial border of current raster area. A human traces until he gets tired, then he simply draws a straight line to the beginning and calls it fine. Both approaches have their problems. Humans tend to round up in a way that simplifies future additions by e.g. closing polygons through areas that are of less importance and where no future details are expected to be added. A computer may finish its work right in the middle of tightly populated area, thus creating a lot of complications for someone who wants to complete mapping of remaining area.

On Land Cover Import

Posted by Atakua on 14 June 2019 in English (English).

Originally posted here: . Reposted here as it might be easier for some people to find it in the diaries.

Land cover geographic data is what is mostly represented as landuse=* in the OSM database. Other tagging schemes e.g. landcover=* also exist.

During the ongoing land cover import for Sweden [1] I learned several things that were not documented anywhere in the OSM wiki or elsewhere, as far as I know. Below are my observations on what pitfalls and issues arise during all stages of the import, from data conversion to conflation and tag choice.

Data import of zero-dimensional (separate points of interest) and linear (roads) objects are regularly done in OSM. Some documentation and tools to assist with such imports exist. Compared to them, importing of polygons and multipolygons has unique challenges.

Please make sure you read the OSM import guidelines [2], [7] before you start working on an import. Then document as much of the process as possible for your own and other reference, as it is likely to take more time to finish than you originally estimated.

General lessons learned

  1. Start small. Choose a small subset of import data and experiment with it. You are likely to improve your data processing pipeline several times, and it is faster to evaluate what an end result would look like on a small piece of data. When your approach starts looking good enough to be applied to a larger scale, still try to keep individual pieces of uploaded data reasonably small as this reduces risks of data conflicts, incomplete uploads, and generally helps with parallelizing the work.
  2. Prepare for a lot of opposition. The OSM community had many cases with badly carried out imports. A small mistake in data that may be forgivable and easily correctable when made once becomes a major problem with replicated at scale by a script. People are very defensive against especially land cover imports. Just be persistent, listen for constructive feedback, ignore non-constructive feedback, improve your tools and data, do not blindly trust your tools.
  3. Keep track of what is done. Document everything. Your import will take quite some time, you are likely to forget what you did in the beginning. As more than one person may work on importing, situations when two or more people attempt to process the same geographical extent in parallel are possible. Write it down to keep track of what is done and by whom. Besides, there are not many success stories about land cover imports, and we need more documented experience on the topic.
  4. Find those who are interested in the same thing. Depending on your ambitions, you will have to import hundreds of thousands of new objects. Find some same minded individuals to parallelize as much of manual work as possible.
  5. Learn your tools and write new tools. As a lot of data processing is done in JOSM, including final touch ups and often the uploading and conflict resolution,. Learning and knowing the most effective ways to accomplish everyday tasks helps to speed the process up. Learn the editor’s shortcut key combinations, or change them to match your taste. Some good key combinations helping to work with multiple data layers that I did not know before starting the import are mentioned in [11]. Programming skills are also a must for at least one person in your group who develops an automated data conversion pipeline or adjusts existing tools for the purpose. Always look for ways to automate tedious work. And always look for ways to improve your data processing pipeline, as you’ll learn new patterns in your new or existing OSM data.

  6. Think about coordinate systems early. It is not enough to assume that everything comes in WGS84. Convert data to same coordinate systems before you start merging or otherwise processing it. I discovered that depending on the moment when a dataset is finally brought to WGS84 there may or may not be weird effects such as overlapping of adjacent tiles, loss of precisions resulting in self-intersections of polygons, or other avoidable effects.

On Source data

The source data for your import may be either in raster or vector form. If you have choice, prefer importing vector data as it would save you one step in data processing and avoid troubles related to raster to vector conversion.

Data actuality

It goes without saying that source of the import should be up to date. Cross-checking it against available aerial imagery throughout the import process should help with judging of how modern the offered data is.

Objects classification

Take notice of what classes of features are present in the import data, how well they can be translated into the OSM tagging scheme. You may certainly want to throw away, reclassify or merge several feature classes present in the import data.

Even after you have started the import, continuously estimate and cross-check that the classification is consistently applied in the source, correctly translated to the OSM tagging schema, and how often misclassification mistakes requiring manual correction occur.

As an example, an area classified as “grass” in a source dataset may be in fact representing a golf course, a park, a farmland, a heath etc. in different parts of a country. They are tagged differently in the OSM, and, if possible, this difference should be preserved.

Check regularly for misclassification of features, especially when you switch between areas with largely different biotops. Tags choice for tundra is likely to be different from those used for more southern areas, and will require adjustments to the tag correspondence mapping used in your data processing routines.

Be aware that there are also unsolved issues of tagging of e.g. natural=wood against landuse=forest that might affect your decisions on tags choice. Very good points on complexity of properly tagging of land cover are presented in [3].

Data resolution

For linear objects it is important to adequately reflect shape and position of actual natural features they represent.

Data resolution is also important as certain types of objects are worth importing only if they are represented with resolution fine enough. Conversely, having objects with too many details will result in an increase of amount of raw data to process without giving any benefits to the end result.

It is easy to tell resolution of a raster data as it is defined by pixel size. For vector data, estimation of how good newly added polygons are aligned with existing ones can be used as a rough indication of data resolution.

Consider the following example. For a forest, having its details drawn on a map in the range of 1 to 10 meters should be just enough for practical uses. A forest with a unit resolution of 100 meters is of less use for e.g. pedestrians. But mapping a forest with resolution of 10 centimeters is basically outlining every tree, which is of little practical use for larger territories.

Similarly, trying to create an outline of regular buildings from data with resolution of one meter or worse will not succeed to capture their true shape. A data with 10 cm details can be used to correctly detect all 90 degree angles of buildings.

Converting data to OSM format

Most likely the import data will not be in a format directly acceptable by the OSM database, that is, OSM XML or equivalent binary formats. Additional processing steps will be needed to load the data one or more third-party or custom written tools.

Several freely available GIS applications, libraries and frameworks are available to help you with data processing: QGIS, GRASS GIS, GDAL, OGR etc. However, knowing programming is a must at this stage as it is often simpler to write a small Python (or similar comparable scripted language) converter than to try do the same work in a GUI tool. Moreover, many steps require automation to be applied to many files, which can also be automated through scripting.

Raster data may often be available in GeoTIFF format which is a TIFF image with additional metadata describing coordinate system, bounding box, pixels meaning etc. Vector data can be present in many forms, from simple CSV files to ESRI shapefiles, XML-based GML, JSON-based GeoJSON files, or even stored in a geospatial database.

Once data is converted into the OSM XML format, a few tools are available to process it as well, such as command-line tools osmconvert, osmfilter, and tools and plug-ins for the main OSM editor JOSM.

Importing a single feature

The whole process of importing can be described by repeated addition, modification or deletion of features, in the land cover case represented by individual units of forests, farmlands, residential areas etc. Such features have to be extracted from the source data, and then inserted into the OSM database. Many decisions have to be made to make sure that enough useful information is extracted and not much noise is introduced at the same time, so that the new feature does won’t create more trouble than good it brings.

The following tasks have to be solved for every import feature considered for manipulation.

  1. Tracing vector boundaries of a feature. For vector data, the boundaries should be already in vector format. For raster data, individual pixels with the same classification have to be grouped into bigger vector outlines of (multi)polygons. Certain tools exist that can assist with solving this [4].

  2. Assigning correct tags to features. The tagging scheme of OSM has unique properties, and deciding what tags a (multi)polygons should have is very important. At the very least the tagging for new features should match with what has been already been used for tagging objects in the same area.

  3. Assuring correct mutual boundaries between old and new features. The OSM project support only a single data layer, and everything has to be nicely organized in that only layer. Although different types of land cover may overlap in reality, it is not the common case. No sharp border can be often defined between two natural or artificial areas either, but maps often simplify this to actually representing things as a single border. Certain types of overlaps are definitely considered to be erroneous, e.g. two forests overlapping by a large part, or forest sliding into a lake. Note that this task is affected by how accurate boundaries were specified for pre-mapped features. Sometimes it is feasible to delete old objects and replace them with new ones, provided that there is enough evidence that new features do not lose any information present in the old objects. See further discussion on the subject below.

  4. Assuring correct borders between adjacent imported data pieces. As a dataset for land cover is rarely imported in a single go for the whole planet, it is bound to be split into more or less arbitrary sized and shaped chunks. The data itself does not necessarily dictate on what principle such splitting is to be made. Borders for these chunks may be chosen based on an administrative principle (import by country, municipality, city, region etc) and/or by data size limitations (rectangular tiles of several kilometers wide etc.) Regardless of a chosen strategy, artificial borders will be imposed upon the data. E.g. one can split what in reality is a single farmland into several parts. It is often important to hide such seams in the end result by carefully “sewing” the features back together. In certain cases of bugs in the splitting process, new data pieces may even start overlapping with adjacent pieces, which only adds extra manual work without any value.

  5. Finding balance between data density and usefulness. Even if import data resolution looks to be optimal, it is often worth to further filter, smooth, remove small details or otherwise pre- and postprocess resulting features. Let us consider a few examples. a) For the case of raster data, it is worth removing lone “forest” pixels marking individual trees standing in a farmland or in a residential area. b) Rasterization noise is always an issue to deal with: imported data should not look “pixelated” with suspiciously looking 90 degrees corners where there should be none. c) Lastly, many nodes lying on a straight line can be safely removed without losing accuracy of vector data but reducing its size. A lot of filters exist for both simplification and smoothing [4] [5], but most of them require some experimentation to find their optimal parameters to be run with. Doing too aggressive filtering can destroy essential parts of imported features.

  6. Keeping feature size under control. Artificial splitting of import data surprisingly has its own positive effects on keeping size/area of natural features in check. An automatically traced forest can turn into a multipolygon that spans many dozens of thousands of nodes and hundreds of inner ways. In practice having several smaller and simpler organized adjacent polygons covering the same area is better. Other means to keep features’ size in check can be used. For example, roads crossing forests can effectively cut them in smaller parts that are then represented as more contained features.

Conflation with existing data

Conflation is merging data from two or more sources into a single consistent representation. For us, it means merging two layers of vector data: “new” with import features and “old” with existing features — into a single layer to be then uploaded to the main OSM database.

Let us assume that both the “old” data already present in the OSM and “new” data to be imported are self-consistent: no overlapping happens, no broken polygons are present etc. Always make sure that it is true for both layers before you start merging them, and fix discovered problems early.

When these data layers are self-consistent, new inconsistencies can only arise from interaction of old and new features. For the land cover case, it is the (multi)polygons intersecting and overlapping each other in all possible ways.

Start thinking about how you are going to address these problem early in the import process. Solving them efficiently is a critical component of having a successful import.

Algorithmically, problems of finding exact shape of overlapping, points of intersections, common borders etc. of two or several multipolygons are very far from trivial. Solutions to such problems tend to be algorithmically complex meaning applying them for huge number of features with many nodes in each may take unreasonable time to finish.

Whenever possible, the conflation task should be simplified. Compromises between speed, accuracy and data loss/simplification have to be made. Improve your conflation algorithms as you progress with early data chunks and learn the issues arising over them. If you see that same tasks arise over and over and take a lot of human time to be resolved manually, integrate a solution for it into your algorithms.

For situations when old and new features overlap, intersect or otherwise happen to be in a conflict, define a consistent decision making strategy on which modifications to conflicting features will be applied. For example, one should decide in which situations old or new nodes are to be removed, moved or added, whether conflicting features are to be merged, or if there are conditions for some of them to be thrown away.

Be on lookout for common patterns in the data that can be easily solved by a computer. More complex cases can be marked by computer for manual resolution.

Do not leave too much work for humans however. Humans are bad with tedious work, and will quickly start making mistakes. Everything that is reasonable to do by a machine for conflation should be done by machine.

It is easiest to solve conflicts when no conflicts can arise. For the features, if no features can overlap, they cannot conflict. In this sense, undershoot in data is better than overshoot, but again, make an informed decision that applies to your imports best.

Consider that making sure that features’ borders are aligned is easier than making a decision about what to do with two arbitrarily overlapping polygons. This means that making sure that no two old/new feature pairs overlap would help greatly with conflation. Simply saying, importing features only for areas where there is “white space” on the map is easier than importing features to already tightly mapped areas. You might clean up the space first by removing old features (according to some strategy), or just leave those alone.

Decide how to verify that conflation is correct. JOSM validator helps with detecting when e.g. two ways with identical “landuse” tags overlap. But it does not check for everything, and certain cases that are obvious for a human to be wrong are skipped by the validator. Use it and other similar tools, as well as visual inspection of the result, to see that no (obvious) errors have slipped into the end result.

Let us consider different strategies to approaching the data merging task.

Raster vs Raster

This is arguably the simplest case of data conflation as the task can be reduced to comparing values of individual pixels as two pixels are either fully overlap or do not overlap at all (provided that two rasters are brought to the same resolution and extent position). There is no problem with detecting intersections of (multi)polygons. Algorithmic complexity is proportional to the area of a raster map in pixels.

Existing OSM data is vector, not raster. You can however, rasterize it [6] into a matrix of values that can then be processed together with the import raster data layer. Now, a decision about every pixel of import data can now be made based on two data sources: old data for the pixel, and new data from the import. For example, pixels for which there already is some data in the OSM database can be made “invisible” for the vectorization process, and tracing will not create any vector features passing through such pixels. This will effectively make sure that no “deep” overlapping between old and new features can happen. However, due to data loss at the rasterization process now borders of old and new features may and will intersect somewhat along their common borders. Thus, the task of making sure two features do not overlap is reduced to the task of making sure two features have a common border.

This approach was used with the Sweden land cover import [1]. Issues discovered so far and solutions for them.

  1. Make sure that the raster datasets being compared are completely identical in their extent (bounding box position and dimensions), resolution and coordinate projection systems. Tools that work with geographic data in raster formats typically expect all layers to have exactly the same dimensions, no relative shifting, no variation in projections etc. Not all of them have proper error reporting when these conditions do not hold. An incorrectly skewed mask layer creates holes for phantom features in the resulting vector layer, simultaneously leaving a lot of overlapping for features that should not have been generated at all.

  2. Quality of existing OSM data starts to affect results of vectorization of the import data layer. Any errors or simplifications that are present in old OSM data will be reflected in the eventual vector data to import. For example, a roughly and sketchy drawn forest patch will be able to partially mask a nearby farmland about to be imported, thus making its shape to be wrong. A road that was drawn too far off its actual position will create a phantom cutline in your data.
  3. Because there may be multiple ways to draw a border for a forest or another natural patch of land, sliver (long and thin) patches of land cover may appear in the data. They stem from “subtracting” of two almost identically shaped old and new polygons.
  4. To effectively prevent new land cover polygons to creep over roads, include roads into the mask layer. However, because roads usually have no “thickness”, use buffering [12] to create an area around them, effectively turning them into long curved polygons. The same applies to railroads. To prevent houses from being partially covered by small patches of trees, you can try buffering them as well.

Vector vs vector conflict solving

There are no known tools working directly with OSM-format data that allow for conflict resolution with another set of vector features. One can, however, import existing OSM data into one of GIS applications together with import vector data and do your processing there, then export modified import vector features to an OSM file. More exploration is needed in this direction to see how efficient and accurate it can be, e.g. using vector overlaying [10].

You might consider some old vector features for removal from the OSM dataset if new features that are located at the same position are of better spatial resolution, quality or tagging set. For example, data coming from older imports may be considered for replacement by new, more recent/detailed import data. Be careful however with editing OSM XML extract files in your scripts as simply deleting (or marking for deletion) nodes, ways or relations may leave dangling references from other objects somewhere outside the current map extent. Typically, it is best avoided to automatically remove old features; the final decision must be made in manual mode.

Deleting objects is easier in new data layers as they have not yet been “observed” in a global database and no external references could be created yet, so you can just drop unnecessary objects from your files. Land cover data is almost always redundant to some extent, so often instead of trying to solve a complex conflicting overlapping or intersection it is easier to remove them altogether and replace with a manually drawn configuration.

If your old and new vector features can overlap in an arbitrary manner, you should explore known algorithms for merging, subtracting etc. operations on (multi)polygons. Be sure to measure their performance however as they may be too slow for bigger datasets. A simpler and faster spatial strategies to detect some common cases of (non)interactions between features should be cleverly used. For example, before calculating an intersection of two polygons, check if their bounding boxes overlap. If they do not overlap, there is no chance of intersection either, and there is no need to run a complex algorithm for discovering that. There are many spatial indexes invented to aid with the general task of telling whether two objects are “close” to each other.

If there is a guarantee that features may only intersect in a tight area along their common borders, one can try snapping [8] of nodes to lines in an attempt to unify the border between the features. This is not universal, however, as there will always be cases when manual post-editing will be needed. The following JOSM plug-in [9] was developed to assist with a kind of node snapping in JOSM; it still relies on manual adjustment for complex cases.

To be continued: practical examples of problems and solutions

There are a lot of common specific issues with the land cover data and its conflation. Pictures should also definitely help with explaining them. I plan to talk about these in more details later in a separate post.