OpenStreetMap

Recently I found Streamlit which is a pretty cool Python library that makes it easy to create web apps for visualising data.

I converted changeset dump from planet.osm.org to Parquet file format and uploaded it to AWS S3 storage. Then created this streamlit app in their free cloud: https://ttomasz-tt-osm-changeset-analyzer-main-apdkpy.streamlit.app/ which displays some basic statistics.

The app leverages the power of DuckDB, a database engine that can query these files over internet on demand. Parquet files, which are a popular format in modern cloud data lakes, have several advantages over traditional file formats. They are column-oriented, compressed, and support range requests, which means that you can download only the portion of the file you need, instead of having to go through the entire file, making processing larger datasets much faster.

DuckDB works similarly to SQLite in that it doesn’t have a dedicated server. You run the queries locally [0]. This makes the setup super simple you either install the binary or configure connection in IDE like DBeaver and you can run SQL queries.

Running these simple SQL queries over remote Parquet files takes about a minute or two. Trying to do the same with a custom script on raw changesets.xml.bz2 file would run longer not to mention that the effort to prepare the code would be much much larger.

It would be great if OSM hosted more “consumption ready” data instead of relying on users to do their own coding and parsing.

Let me know if you have some ideas for charts/tables that could be added to the demo.

[0] - well in this case they are running on streamlit cloud’s server but you can run the queries locally on the same parquet files easily

Comment from bryceco on 2 February 2023 at 15:36

You can also read the changeset data already hosted on AWS rather than uploading it yourself: https://www.openstreetmap.org/user/Jennings%20Anderson/diary/394762

Comment from tomczk on 3 February 2023 at 13:08

Thanks, I knew they had planet dump converted to ORC in AWS PD bucket but did not know that there was changeset data as well.

Unfortunately it’s all in ORC format which DuckDB does not support natively at the moment :(

I think I could use e.g. Arrow to read it but didn’t delve very deep into their documentation and at first glance it seems like much more work to get the same result: not having to load entire file into memory.

Comment from JoLacrampe on 12 February 2023 at 14:36

Hello! Great work :)

I did something similar but converted in JSON instead of Parquet, and then leveraged Datadog to create some charts. You can see the MVP here : osm-monitor.com.

There is as well a dashboard focused on editors.

Hope this can give you some ideas.

Happy to sync with you to continue the discussion ! Cheers

Comment from tomczk on 15 February 2023 at 00:23

Thanks. Your dashboard looks very nice.

Was curious how did you handle assigning countries to changesets. Seems like you get centroid of changeset and reverse geocode it and get country from that. I would have tried either assigning all countries intersecting changeset or only adding countries to changesets that are completely within the borders (maybe with some buffer). Or going full H3 route. Maybe I’ll try it in the future.

Comment from JoLacrampe on 16 February 2023 at 14:15

Thanks!

Seems like you get centroid of changeset and reverse geocode it and get country from that.

It’s exactly this! I’m taking the center of the “square” defined in the changeset bounding box.

I would have tried either assigning all countries intersecting changeset or only adding countries to changesets that are completely within the borders (maybe with some buffer)

I would not overthink this, since most of the subsets are scoped on a district / city. It’s from OSM best practices to create changesets that “should be local.” (source: https://wiki.openstreetmap.org/wiki/Changeset).

Also, people have requested me few things that seems to be doable with your tool :

  • An overview for a particular country: the idea would be to have the same set of graphs, and just be able to select the country as a variable (like from a dropdown)
  • Same for a particular username

Comment from StefanoCudini on 9 August 2023 at 08:31

I agree with you that OSM dumps should include many more available pre-digested formats, this would facilitate their use especially in academic environments


Login to leave a comment