Originally posted here: https://atakua.org/w/landuse-conflation.html . Reposted here as it might be easier for some people to find it in the diaries.
Land cover geographic data is what is mostly represented as landuse=* in
the OSM database. Other tagging schemes e.g. landcover=* also exist.
During the ongoing land cover import for Sweden  I learned several things that
were not documented anywhere in the OSM wiki or elsewhere, as far as I know.
Below are my observations on what pitfalls and issues arise during all stages of
the import, from data conversion to conflation and tag choice.
Data import of zero-dimensional (separate points of interest) and linear (roads)
objects are regularly done in OSM. Some documentation and tools to assist with
such imports exist.
Compared to them, importing of polygons and multipolygons has unique challenges.
Please make sure you read the OSM import guidelines ,  before you start working
on an import. Then document as much of the process as possible for your own and
other reference, as it is likely to take more time to finish than you originally
Learn your tools and write new tools. As a lot of data processing is done
in JOSM, including final touch ups and often the uploading and conflict
resolution,. Learning and knowing the most effective ways to accomplish
everyday tasks helps to speed the process up. Learn the editor’s shortcut key
combinations, or change them to match your taste.
Some good key combinations helping to work with multiple data layers
that I did not know before starting the import are mentioned in .
Programming skills are also a must for at least one person in your group who
develops an automated data conversion pipeline or adjusts existing tools for
the purpose. Always look for ways to automate tedious work. And always look
for ways to improve your data processing pipeline, as you’ll learn new patterns
in your new or existing OSM data.
The source data for your import may be either in raster or vector form.
If you have choice, prefer importing vector data as it would save you one step
in data processing and avoid troubles related to raster to vector conversion.
It goes without saying that source of the import should be up to date. Cross-checking
it against available aerial imagery throughout the import process should help with
judging of how modern the offered data is.
Take notice of what classes of features are present in the import data, how well
they can be translated into the OSM tagging scheme. You may certainly want to
throw away, reclassify or merge several feature classes present in
the import data.
Even after you have started the import, continuously estimate and cross-check
that the classification is consistently applied in the source, correctly translated
to the OSM tagging schema, and how often misclassification mistakes requiring
manual correction occur.
As an example, an area classified as “grass” in a source dataset may be in fact
representing a golf course, a park, a farmland, a heath etc. in different parts
of a country. They are tagged differently in the OSM, and, if possible, this
difference should be preserved.
Check regularly for misclassification of features, especially when you switch
between areas with largely different biotops. Tags choice for tundra is likely to be
different from those used for more southern areas, and will require adjustments
to the tag correspondence mapping used in your data processing routines.
Be aware that there are also unsolved issues of tagging of e.g. natural=wood
against landuse=forest that might affect your decisions on tags choice.
Very good points on complexity of properly tagging of land cover are presented
For linear objects it is important to adequately reflect shape and position
of actual natural features they represent.
Data resolution is also important as certain types of objects are worth importing
only if they are represented with resolution fine enough. Conversely, having
objects with too many details will result in an increase of amount of raw data
to process without giving any benefits to the end result.
It is easy to tell resolution of a raster data as it is defined by pixel size.
For vector data, estimation of how good newly added polygons are aligned with
existing ones can be used as a rough indication of data resolution.
Consider the following example. For a forest, having its details drawn on a map
in the range of 1 to 10 meters should be just enough for practical uses.
A forest with a unit resolution of 100 meters is of less use for e.g.
pedestrians. But mapping a forest with resolution of 10 centimeters is basically
outlining every tree, which is of little practical use for larger territories.
Similarly, trying to create an outline of regular buildings from data with resolution
of one meter or worse will not succeed to capture their true shape. A data with
10 cm details can be used to correctly detect all 90 degree angles of buildings.
Most likely the import data will not be in a format directly acceptable by the OSM
database, that is, OSM XML or equivalent binary formats. Additional processing
steps will be needed to load the data one or more third-party or custom written
Several freely available GIS applications, libraries and frameworks are
available to help you with data processing: QGIS, GRASS GIS, GDAL, OGR etc.
However, knowing programming is a must at this stage as it is often simpler to
write a small Python (or similar comparable scripted language) converter than to
try do the same work in a GUI tool. Moreover, many steps require automation
to be applied to many files, which can also be automated through scripting.
Raster data may often be available in GeoTIFF format which is a TIFF image with
additional metadata describing coordinate system, bounding box, pixels meaning etc.
Vector data can be present in many forms, from simple CSV files to ESRI shapefiles,
XML-based GML, JSON-based GeoJSON files, or even stored in a geospatial database.
Once data is converted into the OSM XML format, a few tools are available to
process it as well, such as command-line tools osmconvert, osmfilter,
and tools and plug-ins for the main OSM editor JOSM.
The whole process of importing can be described by repeated addition,
modification or deletion of features, in the land cover case represented by
individual units of forests, farmlands, residential areas etc. Such features
have to be extracted from the source data, and then inserted into the OSM
database. Many decisions have to be made to make sure that enough useful
information is extracted and not much noise is introduced at the same time,
so that the new feature does won’t create more trouble than good it brings.
The following tasks have to be solved for every import feature considered for manipulation.
Tracing vector boundaries of a feature. For vector data, the boundaries
should be already in vector format. For raster data, individual pixels with the
same classification have to be grouped into bigger vector outlines of (multi)polygons.
Certain tools exist that can assist with solving this .
Assigning correct tags to features. The tagging scheme of OSM has unique properties,
and deciding what tags a (multi)polygons should have is very important. At the
very least the tagging for new features should match with what has been already
been used for tagging objects in the same area.
Assuring correct mutual boundaries between old and new features. The OSM project
support only a single data layer, and everything has to be nicely organized in
that only layer. Although different types of land cover may overlap in
reality, it is not the common case. No sharp border can be often defined
between two natural or artificial areas either,
but maps often simplify this to actually representing things as a single border.
Certain types of overlaps are definitely considered to be erroneous, e.g.
two forests overlapping by a large part, or forest sliding into a lake. Note that
this task is affected by how accurate boundaries were specified for pre-mapped
features. Sometimes it is feasible to delete old objects and replace them with
new ones, provided that there is enough evidence that new features do not lose
any information present in the old objects.
See further discussion on the subject below.
Assuring correct borders between adjacent imported data pieces. As a dataset
for land cover is rarely imported in a single go for the whole planet,
it is bound to be split into more or less arbitrary sized and shaped chunks.
The data itself does not necessarily dictate on what principle such splitting
is to be made. Borders for these chunks may be chosen based on an administrative
principle (import by country, municipality, city, region etc) and/or by data size
limitations (rectangular tiles of several kilometers wide etc.) Regardless of
a chosen strategy, artificial borders will be imposed upon the data. E.g.
one can split what in reality is a single farmland into several parts.
It is often important to hide such seams in the end result by carefully “sewing”
the features back together. In certain cases of bugs in the splitting process,
new data pieces may even start overlapping with adjacent pieces, which only
adds extra manual work without any value.
Finding balance between data density and usefulness.
Even if import data resolution looks to be optimal, it is often worth to
further filter, smooth, remove small details or otherwise pre- and postprocess
resulting features. Let us consider a few examples.
a) For the case of raster data, it is worth removing lone “forest” pixels
marking individual trees standing in a farmland or in a residential area.
b) Rasterization noise is always an issue to deal with: imported data should
not look “pixelated” with suspiciously looking 90 degrees corners where there
should be none. c) Lastly, many nodes lying on a straight line can be safely
removed without losing accuracy of vector data but reducing its size.
A lot of filters exist for both simplification and smoothing  ,
but most of them require some experimentation to find their optimal parameters
to be run with. Doing too aggressive filtering can destroy essential parts of
Keeping feature size under control. Artificial splitting of import data
surprisingly has its own positive effects on keeping size/area of natural features
in check. An automatically traced forest can turn into a multipolygon that
spans many dozens of thousands of nodes and hundreds of inner ways. In practice
having several smaller and simpler organized adjacent polygons covering the
same area is better. Other means to keep features’ size in check can be used.
For example, roads crossing forests can effectively cut them in smaller parts
that are then represented as more contained features.
Conflation is merging data from two or more sources into a single consistent
representation. For us, it means merging two layers of vector data:
“new” with import features and “old” with existing features — into a single
layer to be then uploaded to the main OSM database.
Let us assume that both the “old” data already present in the OSM and “new” data
to be imported are self-consistent: no overlapping happens, no broken
polygons are present etc. Always make sure that it is true for both layers before
you start merging them, and fix discovered problems early.
When these data layers are self-consistent, new inconsistencies can only arise
from interaction of old and new features. For the land cover case, it is the
(multi)polygons intersecting and overlapping each other in all possible ways.
Start thinking about how you are going to address these problem early in the
import process. Solving them efficiently is a critical component of having
a successful import.
Algorithmically, problems of finding exact shape of overlapping, points of
intersections, common borders etc. of two or several multipolygons are very far
from trivial. Solutions to such problems tend to be algorithmically complex meaning
applying them for huge number of features with many nodes in each may take
unreasonable time to finish.
Whenever possible, the conflation task should be simplified. Compromises between
speed, accuracy and data loss/simplification have to be made. Improve your
conflation algorithms as you progress with early data chunks and learn the issues
arising over them. If you see that same tasks arise over and over and take a
lot of human time to be resolved manually, integrate a solution for it into your
For situations when old and new features overlap, intersect or otherwise
happen to be in a conflict, define a consistent decision making strategy on which
modifications to conflicting features will be applied. For example, one should
decide in which situations old or new nodes are to be removed, moved or added,
whether conflicting features are to be merged, or if there are conditions for
some of them to be thrown away.
Be on lookout for common patterns in the data that can be easily solved by a
computer. More complex cases can be marked by computer for manual resolution.
Do not leave too much work for humans however. Humans are bad with tedious work,
and will quickly start making mistakes. Everything that is reasonable to do by
a machine for conflation should be done by machine.
It is easiest to solve conflicts when no conflicts can arise. For the features,
if no features can overlap, they cannot conflict. In this sense, undershoot in
data is better than overshoot, but again, make an informed decision that applies
to your imports best.
Consider that making sure that features’ borders are aligned is easier than
making a decision about what to do with two arbitrarily overlapping polygons. This
means that making sure that no two old/new feature pairs overlap would help
greatly with conflation. Simply saying, importing features only for areas
where there is “white space” on the map is easier than importing features to
already tightly mapped areas. You might clean up the space first by removing
old features (according to some strategy), or just leave those alone.
Decide how to verify that conflation is correct. JOSM validator helps with detecting
when e.g. two ways with identical “landuse” tags overlap. But it does not check
for everything, and certain cases that are obvious for a human to be wrong are
skipped by the validator.
Use it and other similar tools, as well as visual inspection of the result,
to see that no (obvious) errors have slipped into the end result.
Let us consider different strategies to approaching the data merging task.
This is arguably the simplest case of data conflation as the task can be reduced
to comparing values of individual pixels as two pixels are either fully overlap
or do not overlap at all (provided that two rasters are brought to the same resolution
and extent position). There is no problem with detecting intersections of
(multi)polygons. Algorithmic complexity is proportional to the area of a
raster map in pixels.
Existing OSM data is vector, not raster. You can however, rasterize it
 into a matrix of values that can then be processed together with the import
raster data layer. Now, a decision about every pixel of import data can now be made
based on two data sources: old data for the pixel, and new data from the import.
For example, pixels for which there already is some data in the OSM database
can be made “invisible” for the vectorization process, and tracing will not create
any vector features passing through such pixels. This will effectively make
sure that no “deep” overlapping between old and new features can happen.
However, due to data loss at the rasterization process now borders of old and
new features may and will intersect somewhat along their common borders. Thus,
the task of making sure two features do not overlap is reduced to the task of
making sure two features have a common border.
This approach was used with the Sweden land cover import .
Issues discovered so far and solutions for them.
Make sure that the raster datasets being compared are completely identical
in their extent (bounding box position and dimensions), resolution and
coordinate projection systems. Tools that work with geographic data in raster
formats typically expect all layers to have exactly the same dimensions,
no relative shifting, no variation in projections etc. Not all of them have
proper error reporting when these conditions do not hold. An incorrectly skewed
mask layer creates holes for phantom features in the resulting vector layer,
simultaneously leaving a lot of overlapping for features that should not have
been generated at all.
There are no known tools working directly with OSM-format data that allow
for conflict resolution with another set of vector features. One can, however,
import existing OSM data into one of GIS applications together with import vector
data and do your processing there, then export modified import vector features
to an OSM file. More exploration is needed in this direction to see how efficient
and accurate it can be, e.g. using vector overlaying .
You might consider some old vector features for removal from the OSM dataset if
new features that are located at the same position are of better spatial resolution,
quality or tagging set.
For example, data coming from older imports may be considered for replacement by
new, more recent/detailed import data. Be careful however with editing OSM XML
extract files in your scripts as simply deleting (or marking for deletion) nodes,
ways or relations may leave dangling references from other objects somewhere outside
the current map extent. Typically, it is best avoided to automatically remove
old features; the final decision must be made in manual mode.
Deleting objects is easier in new data layers as they have not yet been “observed”
in a global database and no external references could be created yet, so you can
just drop unnecessary objects from your files. Land cover data is almost always
redundant to some extent, so often instead of trying to solve a complex conflicting
overlapping or intersection it is easier to remove them altogether and replace with
a manually drawn configuration.
If your old and new vector features can overlap in an arbitrary manner, you should
explore known algorithms for merging, subtracting etc. operations on (multi)polygons.
Be sure to measure their performance however as they may be too slow for bigger
datasets. A simpler and faster spatial strategies to detect some common cases of
(non)interactions between features should be cleverly used. For example,
before calculating an intersection of two polygons, check if their bounding
boxes overlap. If they do not overlap, there is no chance of intersection either,
and there is no need to run a complex algorithm for discovering that.
There are many spatial indexes invented to aid with the general task of telling
whether two objects are “close” to each other.
If there is a guarantee that features may only intersect in a tight area along
their common borders, one can try snapping  of nodes to lines in an attempt to
unify the border between the features. This is not universal, however, as
there will always be cases when manual post-editing will be needed. The
following JOSM plug-in  was developed to assist with a kind of node snapping
in JOSM; it still relies on manual adjustment for complex cases.
There are a lot of common specific issues with the land cover data and its
conflation. Pictures should also definitely help with explaining them.
I plan to talk about these in more details later in a separate post.