b-jazz's Diary

Analysis of Bounding Box Sizes Over the Last Eight Years

Posted by b-jazz on 5 August 2020 in English. Last updated on 16 August 2020.

Impetus

I read recently in one of the weekly OSM newsletters of a discussion thread on the OSM-talk mailing list about limiting or adding a warning to editors to let users know if they are editing something that will result in an unusually large bounding box for the changeset. As someone that has made the mistake of accidentally editing nodes in entirely different parts of the country and being horrified that I created a massive bounding box, I was curious as to how often this happens and what a typical bounding box size would be for your average mapper.

Gathering the Data

I set about gathering the data on changesets and bounding boxes and picked the current month (July at the time) to look at. I found that there was a minutely “feed” of changeset data that also included the computed bounding box in the replication/changesets directory on planet.openstreetmap.org site (and luckily mirrored to a single place in the U.S. that I could use). After my internet connection started glowing red after a day of transferring just a week’s worth of July data, I figured that was probably enough to get something useful up. I wrote a few lines of Python to uncompress the by-minute files, convert them into SQL statements, and start loading them into a Postgres/PostGIS database. (With a non-trivial detour to learn just enough on how to actually work with polygons and WKT and how to calculate the area using the right spatial reference system.)

First Look

The first graph I generated was a simple bar chart for the first week in July. I posted the following chart on the OSM US Slack server in a new channel that I created called #data-is-beautiful (after a popular subreddit on the “front page of the internet” website known as reddit.com)

changeset bounding box frequency

It’s important to note the scale of the X-axis in that each bucket of the histogram is twice the size of the previous bucket. And I call out a special size of “exactly 0 sq. meters” because changesets with a single edited node show up as having zero area, as do changesets that are empty (something new that I learned was possible). Other than that first bucket, each following bucket is greater than 2^(N-1) and less than or equal to 2^N (square meters of course).

Since I had the workflow all in place to gather the thousands of files necessary and process them, I decided to gather the rest of the month and see if that changed anything. So I let my scripts take over my internet connection and relegated myself to watching degraded Netflix video streams for the next day or two. (I’m kidding, it really wasn’t that bad.) After all of that was processed, the graph for the month of data vs the week of data looked essentially identical (short of the scale of the Y-axis).

Quick Executive Summary Sidenote

I’m not sure what the technical name is for the most common bucket size. It is kind of like the “median” bucket, but I don’t think that is 100% accurate. But we’ll use that term here anyway. The median bucket for changesets for this particular sample of data is 2^19th square meters, or roughly half of a square kilometer.

An Innocent Comment

I got some great feedback and comments on my posting in the Slack channel, but Ian Dees made an innocent comment about how it would be cool to see the data over the time and represented in a heatmap. Well, I couldn’t just let that idea hang out there unfulfilled so I went about trying to figure out how I could make that happen on my slow, rural internet connection, and a 7 year old desktop linux computer.

My first thought was that sampling was going to be the answer to making this possible. Seeing that the month data vs the week data on my first attempt showed essentially the same results, I decided that I’d grab just one week of data from each month (the first through the seventh). I wanted an entire week in case there was different behavior from weekday mappers (possibly influenced by corporate mappers) vs. weekend mappers (possibly included more “hobby” mappers). I also knew that gathering every single changeset for the week would take longer than I was willing to wait, and would probably be more space than I would be able to store without doing some cleanup and rearranging of data on my computer. So I decided to get every third minute available. That seemed like a reasonable amount of data, and something that I could accomplish and store in a reasonable amount of time and space.

Visualizing the Data

After waiting about a week for all of the samples to come over the while, I was finally able to get it into Postgres and do some collating and dumping to a file that I could process into a heatmap. I did a little research into what software to use, but just ended up writing a simple python script to generate an SVG file that I could see in a browser.

Early on, I thought that there were two ways that I wanted to view the heatmap. The first was by splitting each column up (a single month) and setting the color gradient based on the max value for that particular month. But I also thought it might be interesting to see what it would look like with the max value for the entire dataset so that you could also see the growth in the number of changesets over time.

These heatmaps are what I came up with:

Percent based on max of each month

and…

Percent based on max of all data

@imagico had a good idea of showing the circumference or perimeter of the bounding box so you don’t have the problem of a change in two different parts of the world, but very similar latitudes showing up as a very wide box, but with a very short height leading to a smaller area than some normal in-city editing might turn up. I ran the numbers again using the PostGIS ST_PERIMETER() function and came up with the following heatmap:

Perimeter this time

Another view of the data is to look at the num_changes data for each changeset and compare the area of the changeset to how many objects are being edited.

Bounding box area per number of changes

Misc. Notes

The changeset data only went back to late 2012. It would be interesting to see it going back all the way to the start, but this is what was easily available to me
Changesets didn’t start including num_changes until late 2013.
The source site (planet.osm.org) is missing a swath of data from April 2013.
I’m not good at color, so this isn’t as vibrant as it could be. Sorry for that. If you have suggestions on how to pick a better color palette, I’d love to hear it.
I’m happy to share my JSON file in case you want to do your own visualization. It is only 24K so just let me know where I can email it or drop it for you.
It looks like diary entries with images might not allow you to click on them to see the full size image. If that’s the case, the heatmaps are available at https://i.ibb.co/ynJQtmt/image.png and https://i.ibb.co/qxFm57x/image.png.

Future Direction

Now that I have this 9GB database available to me, I want to poke around with it a little more. I might do a sampling of some of the larger changesets and see if there is anything interesting to find there. How many of them are just editing two nodes across a large area? How many are editing a single, very large feature? How does number of modified objects compare to the size of the changeset for large changesets? Are there particular usernames that tend to make rather large changesets (on accident or on purpose)?

If you have ideas on what else I can do, please let me know. I might as well make use of the disk space that I’ve dedicated to this.

Discussion

Comment from b-jazz on 6 August 2020 at 03:40

Looking at my (sampled) data, there were 760 changesets with only two objects modified that ended up making for a changeset bounding box of greater than 1,000 square kilometers. The sampling factor is roughly 1:13, so that extrapolates to 10,000 very large changesets in a year from just two changes (likely two nodes from a few that I looked at by hand).

Comment from imagico on 6 August 2020 at 13:20

Very interesting.

The obligatory question is of course: How did you calculate the bounding box area? That is non-trivial with large bounding boxes.

I am a bit astonished about the almost complete lack of bounding boxes above 2^40 square meters in your analysis. The whole earth surface is above 2^48.8 square meters (2^50.5 mercator square meters for the full mercator square). A larger bounding box with edits in several continents will usually be in the order of 2^44 to 2^45 square meters i think.

Comment from tyr_asd on 6 August 2020 at 15:05

The source site (planet.osm.org) is missing a swath of data from April 2013.

Interesting, and good to know. I guess it could have made sense to download and use the full changeset dump from https://planet.openstreetmap.org/planet/changesets-latest.osm.bz2, which doesn’t seem to be missing any changesets and would have been slightly quicker to download. ;)

Comment from b-jazz on 6 August 2020 at 15:35

@imagico To calculate area, I use PostGIS’s ST_AREA() function. Over the past year, my sampling (roughly 1/13th of all changesets) has 829 records that are over 2^40th square meters. I’m not a statistician and can’t speak to how representative sampling is when it comes to rare events (0.06% in this case), but I can see it being inaccurate in either direction.

Comment from b-jazz on 6 August 2020 at 15:43

@tyr_asd I think I did start with that data but learned that it either didn’t include the bounding box or organize the changes into changesets. There was some reason I didn’t use it, but it’s not worth another 3.5GB download to figure out why. :)

Comment from imagico on 6 August 2020 at 16:14

ST_Area() on geography will for large bounding boxes lead to quite significant errors if you don’t subdivide the long W-E-segments before calculating the area (because they will get calculated along the great circle and not along the parallel this way).

In any case - since you are after the size of the bounding box you should consider that the area might not be the best measure because an excessively large changeset editing features in America and Europe might be fairly small by that measure if the features are at approximately the same latitude. The circumference of the bounding box might be a better measure.

Regarding changeset size - if i look at the changeset history on the website somewhere in the middle of the Atlantic Ocean i get about 1-2 ocean spanning changesets per day usually. Not all of them will be larger than 2^40 square meters but many of them are. If you analyzed about one week per month you get over the course of 8 years to about the number you gave. So my intuition was a bit off here apparently.

Comment from b-jazz on 6 August 2020 at 20:16

@imagico. Yes, yes. I see what you’re saying. Thanks for the comments! I’ll look into doing another heatmap with this idea.

Comment from PierZen on 7 August 2020 at 01:53

From the Planet Changesets dump, I did publish brief statistics for 2017 in a Twitter Moment Called Weight of Continents.

I observed for 2017 10 million changesets 6 million where BBOX smaller then 100x100m 11,000 changesets that cover more then a continent

Some kind of Heat Map - I could also represent areas mapped over the continents by tracing all BBOXes except the smaller and larger ones.

Comment from b-jazz on 7 August 2020 at 02:47

@imagico: I added a third heatmap with the perimeter length of the bounding box. There are fewer buckets (keeping with earlier use of doubling the bucket size on ever iteration) so the “heat” looks a little more condensed, but turns up some slight variations in the pattern. It’s an interesting take. Thanks for the suggestion.

Comment from b-jazz on 7 August 2020 at 02:48

@PierZen: thanks for the comments and your tweet with another view of changeset analysis. Neat stuff.

Comment from imagico on 7 August 2020 at 07:37

Nice. I would think i can see a slight U-Curve along the timeline in your plots with larger average changeset sizes at the beginning and the end and smaller ones in between.

Comment from Jennings Anderson on 7 August 2020 at 16:23

Really interesting work! Another source to look at could be the OSM Public Dataset on Amazon where you can use Athena to query all of the changesets in the cloud and then download only the results you’re after.

My next questions (since you have the data :) ) would be to check the density of some of these changesets (num_changes / area) as well as the number of users submitting these changesets?As the number of monthly contributors increases, so does the rate of large changesets? Also, how many of these larger changesets are from bots?

Very cool stuff!

Comment from b-jazz on 7 August 2020 at 17:11

@Jennings: oh man, I wish I knew about that Amazon/Athena store before I started this. That might have been a big help. I’ll have to store that one away for any future plans. Thanks!

I was thinking of doing the area / num_changes analysis next, and I’m pondering if that is much different than num_changes / area.

What I’d really like to do is figure out the “empty space” of the bounding box if you consider the bounding boxes of the individual changes of the set. But I’d need to dig further and find the objects/bbox of the individual objects, which isn’t available in my data. If I were to have my ideal feature in an editor that would warn me that my bounding box is too large, I’d really want it to warn me when the empty space is too large. If I edit a single way that is massive (the Bermuda Triangle for example), the overlapping bbox of the three ways would pretty much fill the bbox of the changeset. At least that should be one factor to consider when deciding whether or not the “hassle” the user with a warning.

As for bots, do you have a dataset of bot usernames/userids?

Comment from Jennings Anderson on 8 August 2020 at 00:08

Yes, the “empty space” is the real question at hand… and nontrivial to compute, but a few ideas:

If you calculated the bounding box / convex-hull of each object in a changeset, and subtracted the total area from the changeset area itself, you might get a measure of “empty space” in a changeset. That said, I really have no idea what that measure would mean, and could be kind of meaningless. The “empty space” measure between someone adding a bunch of new buildings and someone updating the opening hours on various businesses would look extremely different and it’s unclear what they actually represent.

Anyways, as for bots, I usually start with this list: https://wiki.openstreetmap.org/wiki/Bot and I’ve found that searching for ‘bot’ in the username is usually pretty good, as well as _repair, _cleanup, or _import, depending on what you’re after. If focusing on US stuff, especially pre 2010, remember TIGER and TIGER cleanup as terms :) There’s also the mechanical=yes tag on the changeset itself for some of these ?

Good luck!

Comment from jtracey on 13 August 2020 at 18:34

I’m not sure what the technical name is for the most common bucket size.

“Mode” is the word you’re looking for. ;)

OpenStreetMap