One of the mysteries of OpenStreetMap not known to the new user is the issue of imports. I’ve been pondering for a while what the best way was to identify what user accounts are related to imports, what they have been importing, where is the data coming from, and what portion of the data comes from imports and what is “purely” from contributors.
Now, my sense is that initially there was a lot of importing going on informally, till someone instituted the formal process. The Import Catalogue where all the imports are supposed to be documented is sorely in need of some cleaning up and fixing. That is, there are many imports there that are not recorded. Hopefully we can use the data to fix the page as well.
In my own research, Im interested in identifying imports so as to get rid of them! I want to understand contributor activity, and your analysis can get seriously skewed if you consider imports. One example of this is Dennis’ SOTM-US 2014 talk where they found that there was lot of activity in North Dakota, but most of this was coming from imports (or so we think!).
Here, I wanted to write some notes about how I’ve discovered the best way is to indentify imports in the changesets data. The changesets data contains a field called “num_changes” that records the number of changes in any given changeset. A feature of most imports is that they cram as many features as they can in one changeset (the max is 50000). So what you can do is, look at all the changesets for a given user, and if a extraordinarily high number of them (say 80%) have more than 5000 changes, then its likely that the account is being used for imports.
Using this method, I calculated “import accounts” (at least 50% of their changesets have above 5000 changes and overall they have at least 50 changesets) to get this list of large import accounts in the US. Here “mean” is the percent of edits that are above 5000 changes, and N is the total number of changesets for that user.
This is by no means perfect, and there are many other types of imports that I think I’m missing – and perhaps there are some false positives as well? Would love to get your reaction of if you had other suggestions on better ways to do this!
Comment from SOSM on 28 May 2014 at 06:45
One possible refinement is to ignore or at least list in a class of their own, “bot” accounts (for example both woodpeck accounts don’t actually introduce any new data).
Comment from dalek2point3 on 28 May 2014 at 19:52
that is a good point – I guess I must distinguish between bots and imports. There are bots that import and bots that dont, and there are people who import and people who dont.
I want to identify all bots AND import accounts AND individual imports not made from a dedicated account.
Does there exist a list of bots somewhere that you can point me to?