One of the mysteries of OpenStreetMap not known to the new user is the issue of imports. I’ve been pondering for a while what the best way was to identify what user accounts are related to imports, what they have been importing, where is the data coming from, and what portion of the data comes from imports and what is “purely” from contributors.
Now, my sense is that initially there was a lot of importing going on informally, till someone instituted the formal process. The Import Catalogue where all the imports are supposed to be documented is sorely in need of some cleaning up and fixing. That is, there are many imports there that are not recorded. Hopefully we can use the data to fix the page as well.
In my own research, Im interested in identifying imports so as to get rid of them! I want to understand contributor activity, and your analysis can get seriously skewed if you consider imports. One example of this is Dennis’ SOTM-US 2014 talk where they found that there was lot of activity in North Dakota, but most of this was coming from imports (or so we think!).
Here, I wanted to write some notes about how I’ve discovered the best way is to indentify imports in the changesets data. The changesets data contains a field called “num_changes” that records the number of changes in any given changeset. A feature of most imports is that they cram as many features as they can in one changeset (the max is 50000). So what you can do is, look at all the changesets for a given user, and if a extraordinarily high number of them (say 80%) have more than 5000 changes, then its likely that the account is being used for imports.
Using this method, I calculated “import accounts” (at least 50% of their changesets have above 5000 changes and overall they have at least 50 changesets) to get this list of large import accounts in the US. Here “mean” is the percent of edits that are above 5000 changes, and N is the total number of changesets for that user.
This is by no means perfect, and there are many other types of imports that I think I’m missing – and perhaps there are some false positives as well? Would love to get your reaction of if you had other suggestions on better ways to do this!