OpenStreetMap

What is an import?

Posted by jremillard on 17 April 2018 in English (English)

I have been working on some code to detect if a changeset is an import, SPAM, or if it has a tagging error.

https://github.com/jremillard/osm-changeset-classification

Detecting SPAM and tagging errors is pretty straight forward. However, detecting imports is much more challenging. Before I started, I thought I knew what an import was. I was looking for large changesets, that only added 1 or two kinds of data. However, this criteria performs poorly in practice. In OSM many import changesets are not large, also it is not uncommon that the imported data has some hand editing mixed it.

My new definition of an import

An import is any addition to OSM that directly derives from other digital map sources.

Comment from Glassman on 17 April 2018 at 05:39

If I use the TIGER background image, provided in both iD and JOSM, to determine geometry as well as road name, is this an import?

Comment from DevonshireBoy42 on 17 April 2018 at 08:25

What do you want to do with flagged imports? If I do a small import of one village or town and manually check, conflate and edit every building then it lacks the issue that large or automated imports have.

Comment from Zverik on 17 April 2018 at 10:19

There are no imports. Import is an invented construct made by Germans to try to keep their map in check. That's why no matter what algorithm you choose, you'd get tons of false positives and false negatives.

Comment from Stereo on 17 April 2018 at 15:07

I think it's very interesting that the import guidelines don’t actually define what the term means.

Comment from Nakaner on 17 April 2018 at 17:24

I agree that the size alone is not helpful. I regularly check my OSMCha filters for changesets with more than 9000 additions and many of them are HOT mappers tracing buildings and uploading them after they finished editing.

jremillard wrote:

An import is any addition to OSM that directly derives from other digital map sources.

I would append:

without or with limited use of ground surveys and aerial/satellite imagery.

Otherwise people will try to define Bing imagery as a "digital map source". :-)

However, that criteria is difficult to translate into rules a computer can apply. That's my personal list of criteria to define a bad import:

  • Use of strange tags
  • uppercase tags
  • coordinates as tags
  • no tag is longer than 10 characters
  • no discussion on imports@ mailing list
  • too short time between first posting to imports@ and start of the import
  • obvious and large copyright violation
  • no entry in the imports catalogue, no documentation on the wiki
  • no usage of a dedicated account for imports

Unfortunately, our rules don't require users to add a tag to the changeset indicating the documentation and discussion of the import. If so, we could look for changesets which look like imports but lack that tags. I would call these tags:

  • import:documentation=<page title at wiki>
  • import:discussed:<mailing_list>=<date of first posting on imports@ mailing list>

Comment from Glassman on 17 April 2018 at 18:18

@Nakaner - At a minimum having a tag: import= should be sufficient. Or even the import page url to simplify getting to the page to see details of the import.

I applaud the effort to use software to detect imports. However, we need to be careful. False positives could cause angry comment directed at the editor who did nothing wrong.

Clifford

Comment from Zverik on 17 April 2018 at 18:48

that criteria is difficult to translate into rules a computer can apply

Well, this applies to all but the first four items on your list. And the fourth one is questionable.

And you are starting to discuss imports, not their detection.

Again, I am pretty sure you cannot tell a proper import from a regular edit. Regarding the source cirteria, you never know what a mapper used for tracing or tagging, the same as with imports.

Comment from Nakaner on 17 April 2018 at 18:56

Well, this applies to all but the first four items on your list. And the fourth one is questionable.

The forth item (I should have written "key", not "tag) is an easy way to find users importing shape files. As you might know, field names of shape files are limited to 10 characters. Sometimes things go completely wrong and people end up uploading objects with uppercase keys or keys ending with ~1.

Again, I am pretty sure you cannot tell a proper import from a regular edit. Regarding the source cirteria, you never know what a mapper used for tracing or tagging, the same as with imports.

That's not wrong. I have difficulties and write changeset comments even if I am sure. There are HOT mappers uploading thousands of buildings in one large changeset.

Comment from dieterdreist on 17 April 2018 at 22:22

jremillard wrote:

An import is any addition to OSM that directly derives from other digital map sources.

I think this definition has to be extended, because you can also import other information if you are able to assign positions to it (or relate it to OSM objects)

Comment from dieterdreist on 17 April 2018 at 22:24

for me an import is adding data from somewhere when you didn’t check every part individually

Comment from jremillard on 18 April 2018 at 02:20

Thanks for all the comments!

@Zverik - The vast majority of imports (probably over 95%) are detectable. However, a knowledgeable person that wishes to make the import hard to detect certainly can. Obliviously, it is impossible to know how often this happens.

@Stereo - I agree that the fact that the term isn't clearly defined is interesting.

@DevonshireBoy42 - I have no plans on what to do with the detector and we will see if it goes anywhere useful.

@Glassman - Pulling road names from Tiger is a kind of import, but it doesn't need to follow the import guidelines we all know that it is OK because Tiger is public domain. However, pulling road names from google, isn't ok. For small imports we skip the import guidelines and deal with them by reverting them after the fact if they have problems.

Finally, the word "directly" would exclude tracing over an image layer.

Login to leave a comment