This post is originally published at my personal page, but I assume here more people might read it. Some formatting may be off as I converted it from HTML.
Around March 2019 I started work to import a large chunk of open data into
Openstreetmap. Specifically, to improve the landcover coverage of Sweden. Mostly
it concerned areas and features of forest, farmland, wetland, and highland marshes.
This post continues and somewhat concludes the series of thoughts I’ve
documented earlier in 2019:
The import plan
documents a lot of technical details. I maintained it to reflect the high level
overview of the project throughout its life.
My motivation behind the project was that I was tired to trace forests by hand.
My understanding is that it would take a million of hours to finish this work by
using manual labor alone. I, being lazy, always look for ways to automate it
and/or integrate someone else’s already finished work.
And there is a lot of work to integrate. Here’s how the map looked like before
the start of the project:
There is other peoples’ work worth integrating. Many national agencies around
the world now offer the geographical data they have collected and continue to
maintain to everyone under liberal licenses, such as CC0 or public domain.
To me, it looks strange to not even to attempt make use of this data.
I will use “land cover” and “land use” as synonyms, even though they are
But I do not care enough to make such a difference, and
nobody else seems to do it as well de-facto, given the current tagging status.
People predominantly use “landuse=*”, “natural=*” and “residential=*’ tags to
convey details both for land use and land cover. At the same time, almost
non-existing presence (and lack of support by renderers) for “landcover=*” makes it
questionable to add this tag for new data. Nobody will see the results of such work.
Naturvårdsverket’s land cover dataset comprises of a single (huge) GeoTIFF where
every 10×10 square meters of Sweden’s surface is classified to be one of
predetermined types of land cover: forest, water, roads, settlements etc. So it
is raster data, while Openstreetmap uses vector features, such as polygons
and multipolygons, to represent land cover. So the first step would apparently be
to convert raster to vector. Resulting vector will certainly have discretization
noise (“90-degree ladders”), so the next required step is to smooth vector to
look more natural.
To make sure newly added polygons do not conflict with already present map
features, it was required to merge two vector datasets, or to conflate them.
Need for conflation meant that some
parts of new vector data had to be adjusted. Among these modifications were:
delete whole polygons, cut polygons, align borders of new and old polygons, retag
certain pieces to change their tags, and so on.
Because the input dataset is really huge, it would be unreasonable to attempt
importing it in a single pass. Thus, I needed a strategy to split input data
into chunks. This meant that new, artificial boundaries would
start to appear inside the vector dataset. As it will be shown below, such
artificial boundaries adds certain unique challenges to the process.
Openstreetmap’s uniquely loose data classification scheme makes it impossible to
algorithmically decide on whether any new data would be well
enough integrated without duplicating or unnecessarily overlapping anything
already existing on the map. Thus, the final step for each data chunk before it
gets uploaded was to visually inspect the result of merging of two layers and
to fix uncovered problems.
There have been several iterations, some of them huge, requiring regeneration of
everything from the start, some were regarded as smaller touch-ups. I can now
recall several major decisions that affected the result in a significant way.
Be less aggressive with smoothing. As the Vingåker experiment (see below)
demonstrated, it was necessary to try several values of thresholds and several
available smoothing algorithms to find something that does not destroy too
much of polygons but removes excessive details.
Cut kommuns into smaller “rectangular” tile 0.1×0.1 degrees latitude/longitude.
Each tile then could be loaded into JOSM without much consuming all the RAM.
It still could contain tens of
thousands of nodes and usually required several changesets to upload everything.
Because of the significant overhead of manual re-sewing of tiles (see below),
I considered increasing their dimensions to be 0.2×0.2, but never did it.
Pay special attention that adjacent tiles do not overlap. It was discovered
during Katrineholms kommun import that coordinate system transformation
performed too late resulted in overlap between adjacent tiles (by up to hundres of meters).
To prevent it, it was made certain that data gets converted from
SWEREF99 coordinate system to WGS84 (the one used by OSM) early at raster stage.
Employ a second, “negative” raster layer produced from existing OSM land cover
data. Its purpose is to to mask data points in the import raster data as if
nothing was known about them.
My original intersection detection heuristic only considered
bounding boxes of polygons. Being too imprecise, it generated excessive
amount of false positive matches, causing a lot of perfectly good new polygons
to become rejected at conflation.
By masking new raster data with “old” rasterized data meant that traced vector
polygons could not possibly overlap with old ones (except inside a thin
“coastline” buffer zone caused by the discretization noise). Of course, using
two input raster images required to regenerate all the vector data from scratch.
Add buffered roads to the negative layer. After some consideration it was
decided to include existing OSM roads to the negative raster layer.
Their position correlated with “road” land cover pixels in the input raster data,
and excluding areas around them allowed to have less noise in the result.
“Roads” include railroads, motorways of all sizes and also pedestrian ways
down to trails. Inclusion of trails (highway=path) turned out to be
a questionable decision.
Add water to the import. Originally I expected even smallest lakes to be
already well-mapped in OSM, and therefore did not include water-related polygons,
such as lakes and even wetland. However, visual inspection of conflation results
uncovered that the situation was much worse than I thought. Even quite large
wetlands were missing. Late in the project I decided to retain information about
water and to convert it into “natural=water” and “natural=wetland” polygons.
It also had to use a modified conflation specific to monolithic areas (see below).
Besides these bigger changes, throughout the project I constantly adjusted a
multitude of numerical parameters affecting the conversion process, such as
cut-out thresholds, smoothing parameters, and so on.
Used algorithms and tools have also received numerous fixes and adjustments.
Of course, the JOSM editor was the main and final tool to process data before
uploading. Some adjustments to Java VM memory limits were needed for it to be
able to chew through larger chunks of the import. Having a machine with 32 GB
RAM also helped.
I initially used QGIS to visualize input data and iteratively apply different
hypothesis to it. However, this application turned out to be not very amenable for
scripting. After some time struggling with it, I realized that I essentially used
QGIS as a front end to another GIS called GRASS. I ended up using a multitude of
GRASS’ individual instruments such as v.generalize, v.clean etc. to construct
data processing pipelines taking raster data and chewing it multiple times until
vector data was out.
Among libraries to process, convert and otherwise transform data, GDAL was of
utmost value to me, both directly and indirectly via all GRASS tools based on it.
Besides many existing tools and frameworks , I’ve written quite a few lines of
Python, Java and Bash scripts to assist with data conversion, filtering, cleanup
Currently the bulk of this code is at Github and
my other repositories, and I continue to reuse some pieces of it for my ongoing projects.
The project resulted in things both visible to others in a form of the map
improvements, and also as a lot of knowledge for me and hopefully for others.
Of all the kommuns quite a few have already been mapped well enough.
Adding new data for them would mostly imply a lost of manual cleanup work without
significant improvements for the coverage. As expected from the beginning,
examples of such well covered municipalities were areas around the biggest
cities, such as Stockholm, Göteborg, and Malmö.
The farther to the north, the less land cover data was present in OSM, the more
need for data import seemed reasonable.
Kommuns borders were selected as the top level of hierarchy to determine import
structure form the beginning. However, it turned out to be impractical, as sizes
of such import units varied wildly, areas of kommuns were often too big to visually
review in one sitting, and geometry of borders turned out oftentimes to be too
convoluted (long and twisted, with enclaves and exclaves etc.), with no
practical benefits coming from from blindly obeying them.
Somewhere in the middle of the project these boundaries were only used as
rough guidelines for splitting data into tiles. All tiles were of fixed size and
alignment, and as such could span over the boundaries of kommuns.
The following parts of the country were completed, fully or partially.
Vingåkers kommun. It was the first one and the only one converted, conflated
and committed as the whole in one go. Being the first one, a lot of mistakes
were admitted together with the data, such as overly aggressive Douglas-Peucker simplification.
Katrineholms kommun. I started to manually map this area long before, then tried to
plugin to assist with tracing forests. Around 50% of the territory was prepared
by these means. Finally I finished it with the import data. As the location was
adjacent to just finished Vingåkers kommun, it was the first experience where
I had to tackle the requirement to nicely align polygons for adjacent parts
of the import.
Vadstena and Åstorp. Relatively small subareas which are mostly covered by
farmland. Here I tuned my algorithms and learned to expect unusually tagged
(multi)polygons to conflict with new data.
Linköpings kommun. It was basically the only support I received from someone else
during this project. I have not participated much in working on this kommun,
and the data used for it, as far as I can tell, was from one of the first batches
that I have provided, and as such it included no improvements that were present
in later iterations.
Åre kommun was my biggest effort so far. Result of many laborious
evenings, nights and days, the kommun has been mapped in the fullest. Besides
the territory of the kommun, adjacent parts of the country (e.g. parts of
Bergs kommun) were also mapped. More details are in my previous post.
Ljusdals kommun. Compared to earlier work, here I started to map water areas in
addition to forests, farmland and other “ground” cover. I discovered that
the situation with mapping of smaller lakes in Sweden was not as good as I
originally assumed. Imported water polygons included smaller lakes and medium-sized
rivers and streams represented by “non-zero width”. Compared to ground cover,
water polygons had to be treated as monolithic (see below), which was reflected
in conflation script adjustments.
However, I did not finish this kommun. All work essentially halted after that.
Remaining kommuns did not receive significant changes. While there were no
technical reasons to stop at this point, I was unable to reduce amount of manual
work to a level low enough to allow a single person to finish all tiles in
Work on Åre kommun demonstrated that individual tiles required up to 30-60
minutes of manual work each to import. This is still faster that one can
trace an area of the same size with the same level of details. Manual operations
that had not been optimized to be done by the scripts were becoming tedious to
I go deeper into technical details behind some of these problems below.
It’s time to complain. A few factors were discovered along the way which I tend
to classify as external problems, common for any other sort of future importing
work attempted for the OSM project.
The OSM community’s inconsistency towards imports in general and land cover
imports in particular. Not many of arguments DH4
or higher was given to me. I won’t delve deeper into the social, humanitarian,
technical, economical, legal and political circumstances known to me leading
to this situation. Conversations on that mailing list, while not being openly toxic, tend to
derail into delving over general unsolved/unsolvable issues of the whole
project. This brings little constructive feedback to the initiator of an
import effort, while forcing him/her to drown under disproportional volume
of email exchange.
There are deliberately no objective formal/verifiable/measurable criteria
of acceptance, meaning everyone is judging data from his/her standpoint of
“beauty”. There is no authority to have a final word in a discussion, and there
are no procedures for voting for/against a proposal, which gives disproportional
power to those few with louder voices.
By the way, barely anyone has commented anything on the import plan topics.
No comments on the quality assurance sub-topic, for which I was so much hopeful.
It seemed that spending so much time on documenting the project was redundant.
Rounding of coordinates done on server. I was baffled to discover that, after
I have thoroughly made sure that no self-intersections are present in data and
have uploaded it, then to download it back and to see my polygons to self-intersect!
It looked like nodes were moved a tiny bit when they were returned from the database.
Comparing the original data files and copies downloaded from the OSM, I could notice
that coordinates precision had been lowered. Only 7 digits after the dot were
kept. At the same time, JOSM (and its tools like Validator) are perfectly capable
of operating over coordinates with 10 or more digits after the dot. It seems
that one needs to take this into account when running geometry checks.
Poor validation tools for multipolygons. Neither offline nor online tools, nor
JOSM nor Osmose, seem to report
intersection/overlapping of multipolygons. From the practical standpoint, overlapping
multipolygons is the same problem as overlapping polygons. It may be technically
harder to implement and computationally more costly to run such a check, yes,
I do realize that. But at least some basic checks, or even imprecise algorithms
(with reasonably low ratio of false positives) would still be better than nothing.
I’ve spent quite some time hunting for rendering problems caused by omissions
made in multipolygons (e.g. because of the implicit rounding problem).
The data flow approximately looks as follows:
new data review-ready data (with additional fixme tags)
old data dropped data (with new note tags)
One excellent tool I’ve discovered too late was ogr2osm.
GRASS GIS worked best with vector data in GML format, while JOSM understands its
native OSM XML format best.
JOSM’s GeoJSON and GML support was lacking at that moment. I ended up writing
my own converter from GML to OSM, but using ogr2osm is excellent for such work.
I plan to use it instead in the future.
There are often two vector features and they need to be adjusted to have a common
segment. One may think that automatic snapping — moving/merging nodes of a
source line that are close enough to the destination line — would do the job in a second.
But so many things can go wrong with it.
To quote v.clean tool=snap manual page:
The type option can have a strong influence on the result.
A too large threshold and type=boundary can severely damage area topology, beyond repair.
The type option can have a strong influence on the result.
A too large threshold and type=boundary can severely damage area topology, beyond repair.
I have some ideas about a type of iterative snapping approach. Lines are “stretchy” and
“flexy” and are “attracted” towards each other. Attraction forces compete against
repulsion forces. As a result a most “natural” (i.e. with minimal potential energy)
relative position of lines is achieved. I however do not know how hard
it would be to fine-tune parameters of the attraction/repulsion forces for the process
to be stable enough, and whether performance of such an algorithm would be enough
for practical use.
When splitting bigger geographical data files into smaller chunks by arbitrarily
chosen (i.e. not dictated by the data itself) borders, try to keep information about
split points to simplify re-gluing of adjacent resulting data.
Without such information, split points should either be detected algorithmically,
which is not trivial nor reliable, or specified manually, which is more
time-consuming than one would think.
It does not matter whether data is organized into regularly shaped rectangular
tiles, or follows less regular but nevertheless arbitrary administrative borders.
Ignoring the problem won’t help either as it would result in unconnected linear
features and gaps and/or overlaps for landcover features.
Of course, the aforementioned problem does not affect zero-dimensional imports,
such as POIs. They can be organized in arbitrary subsets without disturbing
any sort of (non-existent) relational information.
Here are some ideas on how to handle the re-sewing problem.
For vector polygons, maintain correspondence between unique node IDs used
internally by your source (i.e. negative numbers for OSM XML files) and unique
node IDs assigned to them by OSM database after they were successfully uploaded.
Let’s suppose two polygons are uploaded independently as they are in different
tiles. To avoid uploading duplicate nodes shared by them, after the first polygon
has been uploaded, their IDs (and their refs) have to be updated in the second one.
This way, corresponding points will only be uploaded once. Uploading the second
polygon will refer to their already present instances instead of introducing
duplicates with different IDs.
Alternatively, duplicate nodes can be merged after everything has been uploaded,
as detecting them should be easy. However, this is less elegant, and creates some
room for mistakes, as we essentially recreate shared borders instead of preserving
them in the first place.
For raster data, the process is more involved, as the vectorization phase and
follow-up simplification passes can easily move nodes and destroy relation information
close to tile borders. Recovering such information from adjacent vector polygons
is even less reliable as it involves some sort of node snapping to lines.
Node snapping involves moving of existing or adding new nodes and therefore it
is capable to corrupt geometrical properties of polygons: create self-intersections,
loops etc. I can imagine two approaches to the problem.
Modify vectorization algorithms to treat tile border pixels and nodes produced
from them in a special way. Resulting vector features should retain information
about which segments were traced from tile boundaries.
The same applies to line simplification algorithms: they should not move nor
delete nodes originating from tile borders.
Use natural borders present in the data to determine form and boundaries of
smaller chunks. For example, if exclusively tracing forest areas, linear borders,
such as water coast lines, (buffered) roads, rivers etc. can be used to
delineate where a free-form chunk ends. The problem with this approach is that
there is no guarantee about result’s size, form or run time. Basically, a
single stretch of forest can span through the whole country. Inevitably,
artificial cuts have to be introduced into the data. To minimize their
length while keeping the size of chunk within limits would be an interesting challenge.
It is not only me who considers tiles borders to be an issue for automatic
processing of geometrical features.
From Facebook’s RapiD FAQ:
Why does RapiD crop roads at task boundaries? Are you concerned about the risk of creating disconnected ways?
We believe this is a general problem when working on tiled mapping tasks. We learnt from the community that when working on tasks on HOT Tasking Manager, a general guideline for the mappers is to draw roads up to the task boundary to avoid creating dupes across tasks, so RapiD is designed to align with this guideline. When the user is working on the neighbor task later, the close-node check or crossing-way check will have a chance to catch the disconnected ways and help the user fix them.
We believe this is a general problem when working on tiled mapping tasks. We learnt from the community that when working on tasks on HOT Tasking Manager, a general guideline for the mappers is to draw roads up to the task boundary to avoid creating dupes across tasks, so RapiD is designed to align with this guideline. When the user is working on the neighbor task later, the close-node check or crossing-way check will have a chance to catch the disconnected ways and help the user fix them.
So far, I used the “natural” borders approach in a limited form. Roads running
through forests cut them into smaller bits. For rectangular tile borders,
a lot of manual work was needed to recover common borders late into the process,
because the vectorization and simplification tools I used did not care about preserving
Here is another concept I’ve discovered when I’ve decided to import water polygons.
For forest areas it is quite OK to split them along arbitrary borders. Generally,
someone who maps forest areas manually starts by drawing its border until he/she
becomes tired or hits the limitation of max 2000 nodes per way.
Then current portion of forest gets closed with random long straight lines going
right through the forest mass so that the polygon become closed. This polygon
then gets uploaded. The next adjacent section of the same forest is then traced
the same way, combining “real” borders and the previously drawn artificial border.
The picture below illustrates this situation.
However, closed water objects, such as lakes, even huge ones with borders
spanning well over 2000 nodes (and thus represented as multipolygons), are
traced and treated differently. People do not tend to create arbitrary internal
borders for them.
In other words, there are few examples when a lake gets treated like this:
Instead, a nice single polygon is usually drawn:
It seems that land cover classes, among other classes of area-like features, can
gravitate to one of the following types.
Import raster data already has arbitrarily defined split lines. These lines may
split monolithic features into two or more vector pieces.
This is undesirable as this goes against the tradition of mapping such features.
For this reason it was decided to pay special attention to new water polygons
lying close to tiles’ borders and, when needed, to merge multiple pieces back into
a single object. This, however, had to be done manually.
A good thing with monolithic features however is their “all or nothing” nature,
which can be used when conflating them against already mapped counterparts.
A special algorithm compared “closeness” ratio of borders of an new and an old
water features to decide whether or not they corresponded to the same object.
Such comparison would make no sense for non-monolithic features as they may have
parts of borders arbitrarily defined by a user’s will, not by properties of
the physical world.
My better understanding of Openstreetmap’s intrinsic conflicts and different views,
including sorts of idealistic philosophy some people preach. Anarchy allowed
in certain aspects of the project existence clashes against the desire for
rigid control in other aspects.
Tools for operating over OSM files and related vector formats, available at
Github. They surely
duplicate a lot of functionality already existing in GIS-systems. Surely these
tools are mostly useless for anyone but myself because…
There are much better programming tools available for processing geometrical
information. And I need to learn these tools and to start using them. Libraries
such as shapely, ogr, (geo)pandas and others exist and they could have
simplified some of my work if I knew about their existence in advance.
Osmosis is an OSM-focused framework I might also need to learn a bit.
Originally published here: https://atakua.org/w/landcover-conflation-unsolved-issues.html
This post continues where the previous one
After some time spent on processing and importing land cover data, I have
several ideas on how to further improve and streamline both the import process
and in general work with land cover features in JOSM.
Certain typical tasks arise over and over again when one works with polygons
meant to represent land cover, regardless of whether they are imported or manually
traced. At the moment there are no adequate tools in JOSM to assist with such tasks.
The trick here is not trying to find an exact geometric solution to the tasks at hand,
but rather imitate what a human would reasonably do to finish such a task. And a
human would cut corners, allow some inexactness traded in for speed of completion.
A common task is to fill a gap between two or several polygons. An example would
be to map a new farm field situated between several forests or clammed between
several intersecting road segments. Currently one has to carefully trace a new way along
the existing borders, either reusing nodes or leaving a small gap between the new
ways and adjacent ones.
The idea here is similar to pouring a bucket of paint into the middle of the empty
area and then letting it spread out naturally to fill the empty area. The paint
will then spread out until it hits borders, or until it runs out of paint.
The same approach can be implemented in a tool that starts from a single node
(or rather, from a tiny closed way) which then grows in all directions. Its
growth is stopped when a segment of the new way hits a boundary in a form of an
existing way. Optionally, the new way can then snap to existing way there.
It sounds simple, but it will require some clever implementation to be robust,
fast and reliable. I can already see a couple of implementation details that
will require some attention during the implementation.
The resulting polygon does not have to fill precisely the intended area. It is
not bound to share all its boundary segments with surrounding features, or
reuse all of their nodes. Surely, some nodes and segments may be shared,
but solutions that leave a small configurable gap between the new polygon and
old polygons are also acceptable.
The resulting polygon should treat itself as a boundary as well. It may happen
that a resulting figure has holes in it, and a new polygon should become
a multipolygon. The simplest alternative is to keep it as a polygon with thin
“bridges” in it, almost wrapping around the inner holes.
Surrounding land cover features are not guaranteed to create a closed perimeter
around the empty area one wants to map. There is a risk that the contents of the new
area will leak outside through gaps it founds between these surrounding features.
To prevent uncontrolled spreading of the new way through such leaks two
strategies may be applied. Firstly, total area allowed to be covered by the floodfill
process should have a hard limit. Secondly, spreading through such holes can be
detected by giving new nodes a non-zero buffer diameter. This basically makes
them thick, and they won’t be able to squeeze through gaps smaller than a predetermined value.
There are many situations when splitting a closed way that either intersects itself
or overlaps with another one makes sense. Possible uses of such functionality include:
Fixing self-intersections of small loops left after simplification or
coordinate transformations of polygons.
Assuring new polygons have a nice single common border with old ones by
cutting them in two by that boundary and then removing the smallest part as noise.
This is similar to what v.clean  does with options tool=break and tool=bpol.
Another useful addition is to clean up zero angles in polygons, identically to
what tool=rmsa does. These are always artifacts not worth keeping and as such
it should be possible to remove them.
Having nice common borders between land cover features without any under- or
overshoot is a hard task requiring a lot of manual labor.
To drag or replace nodes to make a common border is called “snapping” them
to new place.
Current embedded JOSM-tools (“N” and “J” buttons) lack configurability and
cannot be used at scale, although they are very useful to patch something up.
The main problem with snapping is that it can destroy geometry quite significantly
if done without measure.
To make decisions which nodes to move and which to keep,
a threshold distance value is used.
However, the nodes to be snapped are usually already organized in ways, and
preserving sanity in these ways after some of their nodes have been moved is a
There are often two situations/issues: 1) to snap a node when there are multiple
alternatives within a threshold distance, and 2) in which order to snap several
nodes already placed on a way.
If several nodes from the same source way are snapped to different ways in its vicinity,
the result is often a mess of self-intersecting ways.
Even if the same target way to snap is chosen for several nodes, if the order in which they
are included into it is wrong, the result has zero-angled segments and annoying
overlappings with destination ways within a narrow threshold area around them.
What I think is needed is to think about the problem as an optimization task.
An algorithm may iterate over small movements of individual
source nodes which are moved according to “the force field” or “a potential” function
defined by positions of destination ways. Source nodes which are far enough from
destination segments do not experience any incentive to move around and are mostly
still. Nodes closer to the threshold distance are attracted to destination ways,
and eventually get placed on them. To prevent nodes becoming in a wrong order
only one node per step is allowed to be included into a destination way, which
can only be done unambiguously. In next iterations that node is treated as if
it always was there, and incoming snapped nodes are ordered automatically correctly
relative to it (XXX is it true? can I prove it? are there couterexamples to that?)
To prevent source segments to overlap with destination segments the force potential
should repulse the source segments from destination nodes. This way, source nodes
are attracted to destination
As it was pointed out earlier, we are not looking for a 100% correct solution
but for a one that is good enough without being too destructive.
The bad part is that this algorithm is more complex than simple snapping that
could be done in linear time. A number of iterations to achieve a stable result
may be hard to guess in each case.
Scanaerial  is a tool to help with tracing of aerial images. It is espacially
effective for water surfaces (as they have simpler textures than e.g. forests).
The problem is, it is an external program written in Python. Amongst many problems
that this brings are: less than possible speed, no progress indication, awkward
configuration of aerial imagery to use.
Inclusion of the same functionality into JOSM directly as a Java plugin would
allow the tool to have a better interface and provide smoother user experience.
It could use all imagery that is present as JOSM layers.
Another improvement is that it could be improved to stop at boundaries where there
are already traced objects, by e.g. using them as a mask for raster. That would
save a lot of time conflating results of tracing into the whole picture.
This section discusses not yet solved problems in the import data I’ve worked with.
It was required to manually edit their consequences, which was the limiting
factor for import. Thus they are very much welcome to be solved mechanically whenever
it is possible.
Compare the following two pictures. Before cleaning import forests along a road:
After manual cleaning of nodes.
Usually humans find the second variant to be preferable as it has less nodes and
The task is almost equivalent to snapping nodes to an “invisible” buffer polygon
created around the road. If all source nodes are placed on the buffer boundary,
resulting forest borders would be parallel to the road in the middle.
A road can run through a massive of a forest, farmyard or similar land cover.
Alternatively, it can be placed completely outside of any land cover, effectively
cutting it in halves. The road then has its own “channel” in which it runs.
The worst case is when one has a mixture of these strategies. A road that
runs outside the forest but then jumping into a short chunk of it and then
emerging back from it is confusing. See the picture for explanation.
Here, outer and inner ways of the surrounding multipolygon create “almost” a
channel for the road. To properly remove the “plug” in this channel requires
a huge amount of work: to split the outer way in to, to split the inner way in
two and change its role to outer, then sew matching parts of these halves to
obtain two outer ways placed on both sides of the road.
This is kind of self-explanatory.
As we get more and more highly detailed aerial pictures and more and more
advanced remote sensing and tracing software, it is often obvious during an
import when existing data is of lower resolution than the data that could have
been imported in its place instead. But, to play it safe, the old data is considered
to be “golden”, and new fine features get mangled at borders that contact old
In the case of lakes, many of them were present on the map, but badly traced.
If we were to import lakes as well, it would be preferrable to replace those
old ones with coarse boundaries with new ones. But how to reliably decide?
Is it possible to measure resolution of vector objects?
A few notes to self to try before importing the next huge chunk of data. Or,
to do as a completely separate imports.
Create a workflow to create raster mask layers from OSM XML by converting it to
SHP and then to GeoTiff. Using QGIS and Geofablik’s exports is error prone
as there are multiple files to combine and they have weird category mappings.
Import of new short ways as single nodes when resolution is not enough. Individual
houses are typically represented as a pixel in the input raster, or four
nodes of 10×10 meters square in its vectorized form. For really remote buildings
outside of any residential areas, it is possible to at least record their position
as a node.
Replacing geometry of ways with finer geometry. If there is a way to measure a “resolution”
of a vector feature (fractal dimension metrics?), then two features whose centroids
are close enough can be compared. Assuming that both features represent
the same physical object, it can then be decided whether to replace old
geometry with a new one.
So far I was concerned with the goal of filling in the empty places.
Defining and implementing a strategy for updating land cover is yet another
topic to explore. The key here is to reliably decide when two features from
old and new datasets represent the same object, and which representation is
worth keeping. An techinique to cut holes in existing multipolygons will become
critical to have to in order to e.g. implement a scenario when a section of a
forest got burned down and should be excluded from the old multipolygon.
Differential pixel comparison of new and old import data might become handy to
find all places with “diffs” and act only on them. The key here is to bring
raster inputs to comparable states (identical coordinate systems, spatial
extents, pixel resolutions and land cover classifications).
Generate import vector datasets of different “resolution” or “detaileness”.
That is, prodice a family of parameterized datasets which have different
aspects of their generation adjusted to one or another side.
E.g., have a new layer that only has objects larger than a predetermined value;
or have only data for forests; or apply different aggressiveness of smoothing
algorithms. Typically, the “resolution” of new data is dictated by existing
data density for the tile. It does not make sense to add multitude of fine
details to a tile that was coarsely outlined, without essentially redoing it
However, there should be just a few of such datasets per a tile though.
Otherwise one would spend too much time choosing from them and comparing them
between each other instead of integrating them into the map.
Use local “rubber band stretch” transformations for the vector data when
adjusting positions of individual nodes in respect to positions of existing
nodes. Just as the pioneers in the map conflation did. The potential function
idea outlined earlier kind of builds on the ability to stretch things without
introducing new topological errors to human’s dismay.
Reduce number of inner polygons in multipolygons, leaving only the biggest ones
(e.g. more than 5% of the outer way area). We have too many fine details, but
which of them to keep?
Try using ogr2osm  instead of
my own gml2osm.py.
Originally posted here: https://atakua.org/w/landcover-conflation-practical-issues.html
This is the third part of summarizing my experience with conflation
of land cover data for Sweden. More examples of practical problems and ways to
address them follow.
The same or similar problems may or may not arise during
other imports of closed (multi)polygons in the future, so tips and tricks to save
time will become handy. Note that some points from the previous part may be
repeated here, but with more thoughts or ideas on how to address them.
The general idea of importing any data into OSM is to save time on doing the same
Classic data sources for the OSM contents are:
Local or public knowledge. This is mostly useful for points of interest,
provided one knows an exact position of the point. For needs of land cover,
local reconnaissance is critical for establishing secondary details of actual surface.
E.g., for a swamp, what type of swamp, for a forest, what species of trees,
with what height/diameter and of what age. There are remote sensing methods,
however, that allow obtaining some of this information in some cases without
visiting the actual place.
GPX traces submitted by users. These are very useful for mapping certain linear objects
such as roads. This is natural as people tend to move along the roads. But
people do many other things for which a map would be helpful, but obtaining
accurate linear traces is problematic.
For closed map features, such as building outlines, usefulness of GPX traces
becomes limited as people rarely use to circle around each and every house.
Besides, the baseline GPS resolution of several meters (without a lengthy
precision improvement procedure) is often not enough to make out where actual
corners of a building are. To go around a swamp in a remote location to get
its outline is rarely doable in reasonable time.
The benefits of a data import is that instead of tracing aerial images and/or
GPX traces, we simply get “ready” data.
This data is also very likely to contain “noise” — any sorts of artifacts,
wrong points, extra useless data etc. Balancing the value of new data against amount of
new issues it brings into the database should always be held in check. This is
a classical “signal to noise ratio” problem. To determine what kind and level of
noise is acceptable is essential. Whether it is comparatively easy to ignore or
manually remove the noise present in data generated by the scripts determines
when it is worth trying importing data and fixing it at the same time.
It may turn out that it is faster to simply trace everything manually or to get
the data from another source.
An important aspect of land cover features is their inherent redundancy and
inexactness of their geometry. You can often delete a node of a polygon denoting
a forest or move it slightly without significantly
affecting the perceived accuracy of how the its reflects the reality.
This opens possibilities for optimizations such as adjusting common boundaries of features.
It turned out to be useful to have a rough estimation of how hard a particular
tile would turn out to integrate by calculating a complexity metric over it.
Intuition says that the more input data there is in the new data layer or
in the old data layer, the harder it would be to make sure that they are nicely
aligned. But what is the best mathematical expression to quantify this idea?
Originally, a total sum of number of new and old ways was used as a
It kind of made sense because a lot of conflation work is done to align common
shared borders between new and old ways. It did not matter how many nodes were
there, and relations typically did not stand in the way of doing it as mostly
outer ways were important.
Later in the process is was decided that even a simpler metric is more representative
of what needs to be done to integrate a tile.
Namely, number of old ways proved to be proportional to how much time will be
spent on aligning new ways to them.
Further refinements to the metric is of course possible. Seeing the complexity number along
a list of not yet processed tiles allowed to understand which tile should be
opened next to make sure it will be possible to process it under a limited span
The import raster file stored information for the whole country area and a wide assortment of
land use classes, including those common for wilderness, residential, water areas,
as long as rail- and car roads.
Information about roads was clearly least useful. As an assortment of pixels,
there was no guarantee for proper connectivity for roads, which is the main property
to extract and preserve in a vector map that should allow routing.
Residential areas were also excluded as the raster resolution of 10×10 meters did not
allow to reconstruct actual footprints of individual buildings. There is however
an idea to use this information to partially map remote isolated dwellings, if not
as polygons then at least as nodes building=yes.
It was finally decided not to import water bodies, such as lakes, swamps and rivers but
to focus on forests, farmland and grass. This is despite the fact that information
about water bodies was in fact extractable from the import raster.
The motivation for the decision not to use them was based on the following premises.
Water bodies would be already well represented in the OSM. This turned out
not to be true for many parts of the country. As a result, now the map has
“white holes” where not yet mapped lakes should be in reality.
Decision to import land cover for islands and islets. This was in conflict with
the developed approach of using a separate mask raster layer that did not
treat water areas as masked. Including mapped lakes into the mask layer would
have masked islets.
As we can see now, the OSM map would benefit from a careful import of missing lakes as well.
It should be possible to import water bodies in a separate process from the same input
raster. The import data processing scripts will have to be adjusted in many places
to take into account different status of lakes. In particular:
The mask layer should include “landuse=water” to prevent creation of overlapping
water bodies. Ocean/sea around the coastline have to receive the same treatment
to avoid mapping water inside the ocean.
However, areas outside the coastline borders are likely to already be marked
as “no data” in the source raster image.
To prevent “slivers” of water in the shore area (similar to noise around already
mapped forests), limitations of how “oblong” new water features can become
should be set in the scripts.
This means that wider rivers will most likely be excluded from the end result.
The naive tiled approach with static extent positions would lead to cutting
of new lakes into several adjacent pieces. To avoid this, tiles should dynamically
be resized to implement the “all or nothing” approach for water bodies.
The idea is that, while a forest can be split into several adjacent pieces that
can be individually mapped, a feature for a lake is typically added as a whole
area. It may be represented as a multipolygon with multiple shorter outer
boundaries if needed, but the result should not look like two separate water
bodies with a common border along a tile edge. At least this is not how one
usually maps lakes in the OSM. If nothing else, larger tile sizes would
decrease amount of cases of clipped water bodies. And in remaining cases two
or more pieces should be merged into a single polygon, not just aligned to have
a single common border.
Making sure the border between land and water is unified will be a tough problem
to automatically solve, given the amount of land cover data already imported.
Tiles as rectangular extents of data of the same size and adjacent to each other
are one of simplest methods to split huge geodata into smaller chunks that are easier
to work on individually. However, the transition from one tile to adjacent ones
should still be smooth in the resulting vector data. The most apparent
problem that comes to mind is positive or negative gaps between features lying
close to common borders of adjacent tiles.
During this import the following problems with tile boundaries were observed.
A bug in a tile splitting algorithm caused adjacent tiles to overlap
significantly (up to several hundreds meters). This defect in resulting vector
data was very cumbersome to manually fix, especially when at least one of the
overlapping tiles had already been uploaded to the database.
Careful ordering of coordinate system conversions allowed to avoid this issue
and generate further tiles with much better mutual alignment.
Smoothing algorithms tended to move nodes including tile border nodes. Nodes
at four corner positions of the rectangle suffered most often — they were
“eaten” by the Douglas-Peucker algorithm and required manual restoration
at the editing phase. In general, the edges of the tile is the area where
“sharp” corners in features should be allowed. But curve simplification
algorithms tend to replace them with more gradual transitions.
Rounding error accumulated at different data processing stages caused that
boundary nodes were moved by a small fraction of a degree. This regularly caused a
tiny (around 1e-4 degrees) but noticeable
(bigger than 1e-7 resolution that OSM uses internally) overshoot or undershoot
against nearby parallel tile borders.
To compensate for it, a separate data processing stage was written to determine
which nodes are likely to be “boundary” in the tile. It them made sure that
their respective longitude or latitude values would precisely lie on the tile
border. For example, all nodes with longitude 45.099998 would be moved to 45.1.
This improved the situation somewhat, often making at least some of nodes on
two adjacent tile borders to receive identical coordinates and thus
allowing for automated merging of feature nodes at such cases.
It required a manual movement of nodes in both lat/lon directions to restore the
original shape. Ignoring a minor misalignment for a small remote natural
feature was often fine (who cares if this tiny swamp looks cut?), but when the
cut line went through a residential area, one would certainly want to restore the
Despite some implemented measures for automatic tile border merging, a lot of
manual adjustment work was required in many cases. This activity was one of
limiting factors for data import speed. Clearly there is a room for improvement
for future imports. I see two possible directions.
Even using bigger tiles reduces number of cuts in the original data and amount
of follow-up sewing of tiles. E.g. instead of cutting a country into small tiles
one can split it into bigger counties. However, with a chunk size growing other problems
of scale become more prominent. So finding a balance here is critical.
Often it is desirable that two adjacent features have a single border line
shared with each other, not two loosely curves intersecting back and forth.
A single border between forest and lake can be seen along the south coast of the
lake, while the north coast has a CORINE import forest polygon with its own
border running along the lake border.
In my opinion, having a single common border between land cover features is the
best. It is more compact to store and less visually complex. It does not correspond
to reality in the regard that there is often no sharp border between two natural features.
Double separate borders kind of reflect this fact that one border is not necessarily
defined by another feature. However, there are always unanswered questions about
what is in between these two features. What is going on in the thin sliver of
no data squeezed in between two borders? Is the distance between two features
wide enough to reflect the transitional area? What if one of borders is more
detailed than another — is it just an artifact or faithful reflection of reality?
Both types of borders are the main problem during the conflation.
To maintain a single border between new and old features means to modify existing
borders. To create double borders one has to deal with the fact that multiple
overshoots and undershoots are about to happen, and as a result these borders
will “interlace” with multiple intersections along their common part.
Current semi-automatic methods to maintain the single border, besides manually editing
all individual nodes, are:
The plugin SnapNewNodes  was made with idea of helping with the snapping process.
It does help, but has annoying bugs making it less reliable.
SnapNewNodes should be made more reliable in the result, and report cases when
it cannot unambiguously snap everything.
In the import process, the main source of double borders were existing shore lines
of lakes and new imported forests, swamps etc.
An interesting case when existing islets without land cover received it from the import data:
Often the previously mapped water border was of lower quality/resolution than
newly added forest bordering with it. But as often both new and old borders were equally accurate.
Still, this resulted in forest partially “sinking” into water and partially remaining
on an islet.
Manual solutions include:
It has to be manually cut through with two “split way” actions:
This is very laborious and not always possible in cases of multiple inner/outer
ways of a multipolygon as they are not considered as a single closed polygon
that could be cut through:
Something has to be improved in order to speed up processing of such cases.
When a road goes through a forest or a field, there are often situations when a
single or a small group of nodes belong to both fields, connecting them briefly,
like a waistline or an isthmus, if you will. Ideally, the road should either
lie completely outside any field, lie completely on top of a single
uninterrupted field, or be a part of a border between these fields.
See selected nodes on two examples below.
Basically there should be either three “parallel” lines (two field borders and a
road between them), one line (being both a border and a road at the same time),
or just one road on top of a monolithic field.
A manual solution can consist of:
An automatic solution would be to have a plugin or data processing phase that
uses existing ways for roads to cut imported polygons into smaller pieces placed
on different sides of that road. Then smaller ways are thrown away as artifacts.
Current script plow-roads.py  mimics attempt to do something similar to this by
simply removing land cover nodes that happen to lie too close to any road segment.
This do ensure that isthmuses get deleted in many cases. A disadvantage is that
often “innocent” nodes that simply were too close to a road get deleted, creating
several types of other problems, such as too big gaps between the road and the
field, too straight lines for a field border, self-intersections or even unclosed ways.
There is no problem with regions already circled by landuse=residential as
they go to the mask layer and efficiently prohibit any overlapping with new
landcover data to be imported. However, it is not mandatory to outline each and
every settlement with this tag. Often only individual houses are outlined.
Besides, there are many places where even no houses are mapped yet.
Previously unmapped areas of small farms, residential areas and similar areas
with closely placed man-made features receive a lot of small polygons that are
trying to fill in all empty spaces between buildings, map individual trees etc.
A manual solution is to delete all new polygons covering the area, as they are
not of high value for man-made features. It is worth noticing that in most
cases it is “grass” polygons with small individual areas. Selecting with a
filter or search functions and then inspecting or deleting all ways tagged
“grass” that are smaller than a certain area could speed up manual work.
However, it is hard to find all such places, leading to untidy results reported
by others  .
A semi-automatic assistance method would be to buffer  already mapped buildings
in the mask layer with about 10-20 meters distance to make sure that they are not
being overlapped with landcover data. This does not solve the problem of unmapped
Because the OSM data contains a mixture of everything in a single layer,
it is very easy to accidentally touch existing features that are not part
of the import intent.
Generally, for the land cover import, one is foremost interested in
seeing and interacting with features tagged with landuse, natural,
water or similar. Certain linear objects, such as highway and power,
are also important to work on as they in fact often interact with the land cover polygons.
What is almost never important is all sorts of administrative borders.
Examples of such areas are: leisure=nature_reserve, landuse=military,
boundary=administrative etc. They rarely correlate with the terrain.
One does not want to accidentally move them as their position is dictated by
human agreements, laws, political situation etc. but not the actual state of the land.
It is recommended to create a filter in JOSM  that inactivates all
undesirable features, still keeping them as vaguely visible in the
background. This way, it becomes impossible to accidentally select them and
thus change them.
In general, the priority was always given to existing features, even if it was
apparent that they provide a lower resolution or even notably worse information
about the actual matter of things. This meant that boundaries of new features
were typically adjusted to run along or reuse boundaries of old features.
In several cases when correspondence to reality was especially bad, existing
features were edited to make sure their borders were in better shape. When it
was possible not to have a common border between old and new features, the old ones
were left untouched.
As an exception, the CORINE Land Cover 2006  polygons from earlier OSM imports
were not always preserved.
For example, in mountainous regions they were completely replaced by the new data.
They have not been included into the mask layer.
In regions with less altitude the CLC2006 features were mostly added for forests.
There they were mostly preserved, at least their tagging part. However, low precision of
such polygons forced to edit their borders on many occasions, adjusting nodes, adding nodes,
removing nodes or rarely removing the whole feature and replacing it from the
import data and/or manually redrawn data.
These were numerous. Typically the focus was on simplifying excessively
detailed or noisy polygons generated by the scripts.
Examples of manual processing include:
Removal of small polygons of 12 nodes and less. This was only used for
certain tiles where “slivers” of polygons filled “no man’s land” between
already mapped land cover polygons with double borders.
Smoothing of polygons that follow long roads. It is typically expected that
land cover borders adjacent to roads run more or less along them without any
jerking unless there is a reason for it.
Because of that, the following fragment needed some manual editing:
It starts looking much more “human” after a great deal of just added nodes have
To validate data after major editing steps and directly before uploading
means that one would run the validation steps a dozen times per a tile at least.
The base layer often gives quite a few unrelated warnings for issues that were
present even before you have touched the tile.
The most common ways to fix discovered issues are to delete nodes, merge nodes,
snap nodes to a line, or move nodes a bit.
The most common problems that were present in an unmodified open tile import vector
“Overlapping identical landuses”, “Overlapping identical natural areas” etc.
As planned, this happened along borders of new and old objects.
It is easy to fix them manually by pressing a button in the validator panel.
Typically those are remnants of filtering, snapping or similar actions.
If it was you who added them, these nodes can be safely deleted by the
In the base layer, take a minute to check that any untagged orphan nodes are indeed old (more than several months old).
Freshly added nodes may in fact be a part of a big upload someone else is doing right now. Because of OSM’s specifics,
new nodes are added first but ways start connecting them only after later parts of the same changeset
has been uploaded. The window between these two events may be as big as several hours. So do not clean up
untagged nodes added by others if they were added just recently.
If you see a lone node added three years ago, then it is most definitely can be
safely removed — someone else has forgotten to use that node a long time ago.
A small new feature that is likely to be a “sliver” of land cover squeezed between old and new data.
Those can be safely removed:
Larger new features would require creating a single shared border with an old existing
feature they overlap with.
An example before:
The same border after it has been unified:
Often it is possible (and reasonable) to merge two identically tagged overlapping polygons
into one. Press “Shift-J” in JOSM to achieve that. This may turn out to be easier than
trying to create a common border between them.
This was a rather awkward manifestation of a bug in plow-roads.py. Specifically,
when a node connecting two outer ways of a long outer multipolygon line was chosen
to be deleted, the script did not re-close the new resulting multi-way.
It did close regular ways if their start/end nodes were removed, but for the
script did not track what the start/end node for more complex situations were.
Luckily, there were not many of such situations, and JOSM validator could
always detect them. And it was always trivially easy to manually restore the
broken geometry by reconnecting the ways.
Islets should be marked as inner ways in lakes typed as multipolygons. However,
as this import was specifically not about water objects, it was decided to left
such new islets as simple polygons floating in the water. This saved time
on editing lakes, many of which were not multipolygons. Their transformation
from a single way into a relation can be done in a separate step.
The JOSM validator allows you to focus on the position of most warnings. This
allows addressing them efficiently. This is very efficient for cases when a
bounding box for the problem is small. And this is not always then case.
It is sometimes hard to figure out how and where exactly an intersection between
two polygons happens.
On the picture there is one huge way and several small ways which overlap with the
former. It’s close to impossible to figure out individual intersections.
However, in the warnings page you can select a pair of conflicting polygons
by their warning entry. And then you can select one of them (preferable the smallest one)
by clicking on it, and then zoom to it by pressing 3 on your keyboard (“View — Zoom to Selection”).
The uploading of the resulting data starts just as a regular JOSM upload dialog.
Follow the general guidelines .
You are very likely to have more than a 1000 of objects to add/modify/delete which
will cause the upload to the split into smaller chunks. You are also likely to
exceed the hard limit of 10000 objects per a changeset, meaning your modifications
will end up in separately numbered changesets. None of this matters in practice
as there is neither atomicity nor even ability to transparently rollback
failed transactions in the project.
All open changesets are immediately visible to others. You cannot easily control the order
in which new data will be packed into the changesets either, so most likely
your initial changesets will contain only new untagged nodes; the following upp
changesets will add ways and relations between them.
For huge OSM uploads there is always risk that something goes wrong. Do not panic,
everything is solvable, and this is the way huge distributed systems operate in any way.
Here are some of the problems that happened to me.
A conflict with data on the server. This is reported to you through a conflict
resolution dialog. Just use your common sense by choosing which yours and which
theirs changes will stay. It is often the case that “they” means “you” as well,
only the data you’ve uploaded earlier. I do not understand yet what are the specific cases
when this happens, but it rarely does. Before resuming the uploading, re-run the
validation one more time.
A hung up uploading. Again, it is unclear why this happens, but it largely correlates
with network problems between you and the server. JOSM does not issue any
relevant messages into the UI or to the log, it just stays still and nothing happens.
If you suspect this has happened, abort the uploading and re-initiate it again.
Do not download new data in between. Sometimes your current changeset on the server
becomes closed and JOSM reports that to you. In this case, open a new one and continue.
There is nothing you can do about it in any case, so why even bother reporting
this to the user?
JOSM or your whole computer crashes and reboots. You’ve saved your data file before starting the upload, right?
Open that file and continue.
If any of such problems have happened during the uploading, after you are done
with pushing your changes through do a paranoid check. Download the same region again
(as a new layer) and validate it. It could have happened that multiple nodes or
ways with identical positions have been added to the database. In this case,
use the validator buttons to automatically resolve this and then upload the
correcting changeset as the last step. It is typically not needed, but to
check that no extra thousands of objects are present is a right thing to do.
It is possible to “roll back” your complete changeset in JOSM by committing
a new changeset , but it is usually not needed unless your original upload was
a complete mistake.
This was originally posted here: https://atakua.org/w/raster-to-vector-landcover.html
The whole premise of the land cover import for Sweden  bases on the idea to
take the raster map of land cover and to covert it into the OSM format. This
results in new map features that are essentially closed (multi)polygons and tags.
These new features are then integrated into the existing database with old features
during the conflation step.
This post is about the first steps of this process, everything around the vectorization
It is hard to describe all the programmatic and manual actions needed to convert the input data.
A lot of it is described in the OSM wiki page .
The best way to learn the details is to look into the source code of scripts written to
achieve the goal.
However, the general data processing flow will definitely contain most of the following phases,
and maybe something more. The order of certain steps, especially filtering phases, can be different.
Coordinate system transformations are only needed if the input data is not in the WGS 84 format
used by the OSM database. It can also be done later in the process.
To recap on relations between raster and vector layers of new and existing data to
be conflated with each other.
It proved difficult to automatically or even manually make sure that land use (multi)polygons from existing OSM-data and new data to be imported do not conflict with each other when they both are represented as vector outlines. Even more complex question is how to decide what to do with two conflicting ways. Should one delete one of them? Replace one with another? Merge them? Create a common border between them?
A simpler approach was developed to address conflicts at the stage when import data can be easily masked, i.e. when it is represented by raster pixels. The idea behind this approach that we can generate a second raster image of identical size and resolution for the country. The source for this raster mask image is existing OSM land cover information. For example, a vector way for already mapped forest will be turned into a group of non-zero pixels. The vectorizing software then uses this mask to prevent new vector ways to be created from the import data raster. It would look as if no data for those areas is available. As a result, vectors generated from masked raster never enter “forbidden” areas where previously mapped OSM-data is known to be present.
By restricting new data to be created only for not yet mapped areas we reduce the problem of finding intersections between multipolygons to the problem of aligning borders between new and old polygons.
As import data is masked at the very first stage when it is in raster form, it is expected that areas “touching” (sharing common border) with pre-mapped land cover data will require careful examination and merging of individual way borders. All cases of overlapping of identical land uses should be fixed.
A major issue that everyone is wary of is that new features generated from the import
data will be incorrectly tagged. The issue here is that the OSM tagging approach
does not stimulate using fixed predetermined number of data classes for land cover,
meanwhile the majority of raster sources by definition provide a limited number
of pixel values and associated landuse classes. Mapping the former to the latter
is considered by some to be the most unreliable task.
Sure, there were situations when a misclassification of a feature was found during
cross-inspection of new data, old data and aerial imagery.
But they were really rare compared to numerous other problems to deal with.
The most common (but still rare) situation of this class was wrong marking of “bushes” under a power line as “forest”.
The real problem was in the need to adjust the tag correspondence map from input
raster values to resulting OSM tags. That is, in different regions of the
country the same pixel value might correspond to different OSM tags.
The most problematic class of land cover to tag correctly turned out to be “grass”.
The same original raster pixel value may correspond to different concepts in OSM,
ranging from a golf course, through cultivated grass to wild or cultivated meadow
and even heathland. Because of that, manual inspection of all areas tagged as
“grass” was constantly needed.
Often more nodes that a human would place are used on a way. Original data may
have a node every 10 meters, additionally using Chaiken  filter to smooth
90-degrees in vector data can create as many nodes. See an example:
A manual solution is to delete undesired nodes, and/or use Simplify way  tool to do so.
An automatic solution would be to apply Douglas-Peucker filter to ways of the
import file. The issue is to find the best threshold values for the
simplification algorithm. Excessively aggressive automatic removal of nodes leads
to losing important details of certain polygons. Typically it can be expected
that up to 50% of import data set nodes can be removed without losing much in
quality of details.
It seems that an extra pass with v.generalize douglas threshold = 0.00005
does a good enough job without chewing too much of details. It does, however,
chew some important details, especially of bigger polygons (such as those at a tile border),
and also fails to clean up segments that are shared between several ways.
Because of the last issue, manual phases of detecting and smoothing remaining
angles with 90-degrees and close pairs of nodes that are strictly horisontal or
vertical were necessary to implement to clean up suspicious geometry patterns
left after vectorization, smoothing and simplifying phases.
v.generalize douglas threshold = 0.00005
Pros of machine traced land cover.
Pros of human traced results.
Tag choice is more consistent with reality. It is well known problem in remote
land cover sensing that no 100% correct matching can be achieved. Again, this is in
part because not everything can be tagged with a limited number of classes
and corresponding tag combinations. Humans tend to be more conservative in this regard and
provide more reliable results. If a person cannot tell from looking at the image
what sort of land cover should be given to a polygon, he is likely to skip it
altogether or at least express his doubts in a comment to the tag set. The
machine is rarely tasked with recording its confidence level for the chosen
Humans tend to choose vector resolution (distance between nodes) dynamically
based on current context. For example, a forest around a long straight highway
tends to be mapped with few nodes along that road. More nodes are needed when
a forest border is more “open” and does not contact anything else.
Machine algorithms are currently not taking the context into account and basically
use a fixed resolution coming from the underlying raster image. Different smoothing
or simplification algorithms do not change this much as they only take into account
the current curve itself, not adjacent data. Because of that both situations
are possible with machine processed data: a) a lot of extra nodes along
a straight line that could have been mapped with just a few of them;
b) lost important nodes where a line takes a sudden turn.
The same applies to making decisions of which features to keep. It was common
to get many small patches of grass along highways that face large forests
from machine-traced data. A human would have ignored these patches and drew a
single line between a forest and a highway.
Originally posted here: https://atakua.org/w/landuse-conflation.html . Reposted here as it might be easier for some people to find it in the diaries.
Land cover geographic data is what is mostly represented as landuse=* in
the OSM database. Other tagging schemes e.g. landcover=* also exist.
During the ongoing land cover import for Sweden  I learned several things that
were not documented anywhere in the OSM wiki or elsewhere, as far as I know.
Below are my observations on what pitfalls and issues arise during all stages of
the import, from data conversion to conflation and tag choice.
Data import of zero-dimensional (separate points of interest) and linear (roads)
objects are regularly done in OSM. Some documentation and tools to assist with
such imports exist.
Compared to them, importing of polygons and multipolygons has unique challenges.
Please make sure you read the OSM import guidelines ,  before you start working
on an import. Then document as much of the process as possible for your own and
other reference, as it is likely to take more time to finish than you originally
Learn your tools and write new tools. As a lot of data processing is done
in JOSM, including final touch ups and often the uploading and conflict
resolution,. Learning and knowing the most effective ways to accomplish
everyday tasks helps to speed the process up. Learn the editor’s shortcut key
combinations, or change them to match your taste.
Some good key combinations helping to work with multiple data layers
that I did not know before starting the import are mentioned in .
Programming skills are also a must for at least one person in your group who
develops an automated data conversion pipeline or adjusts existing tools for
the purpose. Always look for ways to automate tedious work. And always look
for ways to improve your data processing pipeline, as you’ll learn new patterns
in your new or existing OSM data.
The source data for your import may be either in raster or vector form.
If you have choice, prefer importing vector data as it would save you one step
in data processing and avoid troubles related to raster to vector conversion.
It goes without saying that source of the import should be up to date. Cross-checking
it against available aerial imagery throughout the import process should help with
judging of how modern the offered data is.
Take notice of what classes of features are present in the import data, how well
they can be translated into the OSM tagging scheme. You may certainly want to
throw away, reclassify or merge several feature classes present in
the import data.
Even after you have started the import, continuously estimate and cross-check
that the classification is consistently applied in the source, correctly translated
to the OSM tagging schema, and how often misclassification mistakes requiring
manual correction occur.
As an example, an area classified as “grass” in a source dataset may be in fact
representing a golf course, a park, a farmland, a heath etc. in different parts
of a country. They are tagged differently in the OSM, and, if possible, this
difference should be preserved.
Check regularly for misclassification of features, especially when you switch
between areas with largely different biotops. Tags choice for tundra is likely to be
different from those used for more southern areas, and will require adjustments
to the tag correspondence mapping used in your data processing routines.
Be aware that there are also unsolved issues of tagging of e.g. natural=wood
against landuse=forest that might affect your decisions on tags choice.
Very good points on complexity of properly tagging of land cover are presented
For linear objects it is important to adequately reflect shape and position
of actual natural features they represent.
Data resolution is also important as certain types of objects are worth importing
only if they are represented with resolution fine enough. Conversely, having
objects with too many details will result in an increase of amount of raw data
to process without giving any benefits to the end result.
It is easy to tell resolution of a raster data as it is defined by pixel size.
For vector data, estimation of how good newly added polygons are aligned with
existing ones can be used as a rough indication of data resolution.
Consider the following example. For a forest, having its details drawn on a map
in the range of 1 to 10 meters should be just enough for practical uses.
A forest with a unit resolution of 100 meters is of less use for e.g.
pedestrians. But mapping a forest with resolution of 10 centimeters is basically
outlining every tree, which is of little practical use for larger territories.
Similarly, trying to create an outline of regular buildings from data with resolution
of one meter or worse will not succeed to capture their true shape. A data with
10 cm details can be used to correctly detect all 90 degree angles of buildings.
Most likely the import data will not be in a format directly acceptable by the OSM
database, that is, OSM XML or equivalent binary formats. Additional processing
steps will be needed to load the data one or more third-party or custom written
Several freely available GIS applications, libraries and frameworks are
available to help you with data processing: QGIS, GRASS GIS, GDAL, OGR etc.
However, knowing programming is a must at this stage as it is often simpler to
write a small Python (or similar comparable scripted language) converter than to
try do the same work in a GUI tool. Moreover, many steps require automation
to be applied to many files, which can also be automated through scripting.
Raster data may often be available in GeoTIFF format which is a TIFF image with
additional metadata describing coordinate system, bounding box, pixels meaning etc.
Vector data can be present in many forms, from simple CSV files to ESRI shapefiles,
XML-based GML, JSON-based GeoJSON files, or even stored in a geospatial database.
Once data is converted into the OSM XML format, a few tools are available to
process it as well, such as command-line tools osmconvert, osmfilter,
and tools and plug-ins for the main OSM editor JOSM.
The whole process of importing can be described by repeated addition,
modification or deletion of features, in the land cover case represented by
individual units of forests, farmlands, residential areas etc. Such features
have to be extracted from the source data, and then inserted into the OSM
database. Many decisions have to be made to make sure that enough useful
information is extracted and not much noise is introduced at the same time,
so that the new feature does won’t create more trouble than good it brings.
The following tasks have to be solved for every import feature considered for manipulation.
Tracing vector boundaries of a feature. For vector data, the boundaries
should be already in vector format. For raster data, individual pixels with the
same classification have to be grouped into bigger vector outlines of (multi)polygons.
Certain tools exist that can assist with solving this .
Assigning correct tags to features. The tagging scheme of OSM has unique properties,
and deciding what tags a (multi)polygons should have is very important. At the
very least the tagging for new features should match with what has been already
been used for tagging objects in the same area.
Assuring correct mutual boundaries between old and new features. The OSM project
support only a single data layer, and everything has to be nicely organized in
that only layer. Although different types of land cover may overlap in
reality, it is not the common case. No sharp border can be often defined
between two natural or artificial areas either,
but maps often simplify this to actually representing things as a single border.
Certain types of overlaps are definitely considered to be erroneous, e.g.
two forests overlapping by a large part, or forest sliding into a lake. Note that
this task is affected by how accurate boundaries were specified for pre-mapped
features. Sometimes it is feasible to delete old objects and replace them with
new ones, provided that there is enough evidence that new features do not lose
any information present in the old objects.
See further discussion on the subject below.
Assuring correct borders between adjacent imported data pieces. As a dataset
for land cover is rarely imported in a single go for the whole planet,
it is bound to be split into more or less arbitrary sized and shaped chunks.
The data itself does not necessarily dictate on what principle such splitting
is to be made. Borders for these chunks may be chosen based on an administrative
principle (import by country, municipality, city, region etc) and/or by data size
limitations (rectangular tiles of several kilometers wide etc.) Regardless of
a chosen strategy, artificial borders will be imposed upon the data. E.g.
one can split what in reality is a single farmland into several parts.
It is often important to hide such seams in the end result by carefully “sewing”
the features back together. In certain cases of bugs in the splitting process,
new data pieces may even start overlapping with adjacent pieces, which only
adds extra manual work without any value.
Finding balance between data density and usefulness.
Even if import data resolution looks to be optimal, it is often worth to
further filter, smooth, remove small details or otherwise pre- and postprocess
resulting features. Let us consider a few examples.
a) For the case of raster data, it is worth removing lone “forest” pixels
marking individual trees standing in a farmland or in a residential area.
b) Rasterization noise is always an issue to deal with: imported data should
not look “pixelated” with suspiciously looking 90 degrees corners where there
should be none. c) Lastly, many nodes lying on a straight line can be safely
removed without losing accuracy of vector data but reducing its size.
A lot of filters exist for both simplification and smoothing  ,
but most of them require some experimentation to find their optimal parameters
to be run with. Doing too aggressive filtering can destroy essential parts of
Keeping feature size under control. Artificial splitting of import data
surprisingly has its own positive effects on keeping size/area of natural features
in check. An automatically traced forest can turn into a multipolygon that
spans many dozens of thousands of nodes and hundreds of inner ways. In practice
having several smaller and simpler organized adjacent polygons covering the
same area is better. Other means to keep features’ size in check can be used.
For example, roads crossing forests can effectively cut them in smaller parts
that are then represented as more contained features.
Conflation is merging data from two or more sources into a single consistent
representation. For us, it means merging two layers of vector data:
“new” with import features and “old” with existing features — into a single
layer to be then uploaded to the main OSM database.
Let us assume that both the “old” data already present in the OSM and “new” data
to be imported are self-consistent: no overlapping happens, no broken
polygons are present etc. Always make sure that it is true for both layers before
you start merging them, and fix discovered problems early.
When these data layers are self-consistent, new inconsistencies can only arise
from interaction of old and new features. For the land cover case, it is the
(multi)polygons intersecting and overlapping each other in all possible ways.
Start thinking about how you are going to address these problem early in the
import process. Solving them efficiently is a critical component of having
a successful import.
Algorithmically, problems of finding exact shape of overlapping, points of
intersections, common borders etc. of two or several multipolygons are very far
from trivial. Solutions to such problems tend to be algorithmically complex meaning
applying them for huge number of features with many nodes in each may take
unreasonable time to finish.
Whenever possible, the conflation task should be simplified. Compromises between
speed, accuracy and data loss/simplification have to be made. Improve your
conflation algorithms as you progress with early data chunks and learn the issues
arising over them. If you see that same tasks arise over and over and take a
lot of human time to be resolved manually, integrate a solution for it into your
For situations when old and new features overlap, intersect or otherwise
happen to be in a conflict, define a consistent decision making strategy on which
modifications to conflicting features will be applied. For example, one should
decide in which situations old or new nodes are to be removed, moved or added,
whether conflicting features are to be merged, or if there are conditions for
some of them to be thrown away.
Be on lookout for common patterns in the data that can be easily solved by a
computer. More complex cases can be marked by computer for manual resolution.
Do not leave too much work for humans however. Humans are bad with tedious work,
and will quickly start making mistakes. Everything that is reasonable to do by
a machine for conflation should be done by machine.
It is easiest to solve conflicts when no conflicts can arise. For the features,
if no features can overlap, they cannot conflict. In this sense, undershoot in
data is better than overshoot, but again, make an informed decision that applies
to your imports best.
Consider that making sure that features’ borders are aligned is easier than
making a decision about what to do with two arbitrarily overlapping polygons. This
means that making sure that no two old/new feature pairs overlap would help
greatly with conflation. Simply saying, importing features only for areas
where there is “white space” on the map is easier than importing features to
already tightly mapped areas. You might clean up the space first by removing
old features (according to some strategy), or just leave those alone.
Decide how to verify that conflation is correct. JOSM validator helps with detecting
when e.g. two ways with identical “landuse” tags overlap. But it does not check
for everything, and certain cases that are obvious for a human to be wrong are
skipped by the validator.
Use it and other similar tools, as well as visual inspection of the result,
to see that no (obvious) errors have slipped into the end result.
Let us consider different strategies to approaching the data merging task.
This is arguably the simplest case of data conflation as the task can be reduced
to comparing values of individual pixels as two pixels are either fully overlap
or do not overlap at all (provided that two rasters are brought to the same resolution
and extent position). There is no problem with detecting intersections of
(multi)polygons. Algorithmic complexity is proportional to the area of a
raster map in pixels.
Existing OSM data is vector, not raster. You can however, rasterize it
 into a matrix of values that can then be processed together with the import
raster data layer. Now, a decision about every pixel of import data can now be made
based on two data sources: old data for the pixel, and new data from the import.
For example, pixels for which there already is some data in the OSM database
can be made “invisible” for the vectorization process, and tracing will not create
any vector features passing through such pixels. This will effectively make
sure that no “deep” overlapping between old and new features can happen.
However, due to data loss at the rasterization process now borders of old and
new features may and will intersect somewhat along their common borders. Thus,
the task of making sure two features do not overlap is reduced to the task of
making sure two features have a common border.
This approach was used with the Sweden land cover import .
Issues discovered so far and solutions for them.
Make sure that the raster datasets being compared are completely identical
in their extent (bounding box position and dimensions), resolution and
coordinate projection systems. Tools that work with geographic data in raster
formats typically expect all layers to have exactly the same dimensions,
no relative shifting, no variation in projections etc. Not all of them have
proper error reporting when these conditions do not hold. An incorrectly skewed
mask layer creates holes for phantom features in the resulting vector layer,
simultaneously leaving a lot of overlapping for features that should not have
been generated at all.
There are no known tools working directly with OSM-format data that allow
for conflict resolution with another set of vector features. One can, however,
import existing OSM data into one of GIS applications together with import vector
data and do your processing there, then export modified import vector features
to an OSM file. More exploration is needed in this direction to see how efficient
and accurate it can be, e.g. using vector overlaying .
You might consider some old vector features for removal from the OSM dataset if
new features that are located at the same position are of better spatial resolution,
quality or tagging set.
For example, data coming from older imports may be considered for replacement by
new, more recent/detailed import data. Be careful however with editing OSM XML
extract files in your scripts as simply deleting (or marking for deletion) nodes,
ways or relations may leave dangling references from other objects somewhere outside
the current map extent. Typically, it is best avoided to automatically remove
old features; the final decision must be made in manual mode.
Deleting objects is easier in new data layers as they have not yet been “observed”
in a global database and no external references could be created yet, so you can
just drop unnecessary objects from your files. Land cover data is almost always
redundant to some extent, so often instead of trying to solve a complex conflicting
overlapping or intersection it is easier to remove them altogether and replace with
a manually drawn configuration.
If your old and new vector features can overlap in an arbitrary manner, you should
explore known algorithms for merging, subtracting etc. operations on (multi)polygons.
Be sure to measure their performance however as they may be too slow for bigger
datasets. A simpler and faster spatial strategies to detect some common cases of
(non)interactions between features should be cleverly used. For example,
before calculating an intersection of two polygons, check if their bounding
boxes overlap. If they do not overlap, there is no chance of intersection either,
and there is no need to run a complex algorithm for discovering that.
There are many spatial indexes invented to aid with the general task of telling
whether two objects are “close” to each other.
If there is a guarantee that features may only intersect in a tight area along
their common borders, one can try snapping  of nodes to lines in an attempt to
unify the border between the features. This is not universal, however, as
there will always be cases when manual post-editing will be needed. The
following JOSM plug-in  was developed to assist with a kind of node snapping
in JOSM; it still relies on manual adjustment for complex cases.
There are a lot of common specific issues with the land cover data and its
conflation. Pictures should also definitely help with explaining them.
I plan to talk about these in more details later in a separate post.