OpenStreetMap

GeoGit and GitHub Geo

Posted by mikelmaron on 26 September 2013 in English.

As I’ve been exploring the OSM rails app for other data, Git has hovered in the background of my thoughts, and I’ve been watching GeoGit and GitHub Geo Features closely. The conceptual basis of Git, distributed version control, solves issues we come up against regularly in OpenStreetMap, like how to keep an “authoritative” data source and community data in sync or how do we support offline editing, in areas with bad or non-existent net (something to explore with BRCK perhaps). As Jeff Johnson says, “OSM is a geodata repository with just a single branch”.

GeoGit

Chris Holmes gives a thorough recap of BoundlessGeo’s rational and work so far with GeoGit (part 1 part 2) including experiments with using git itself. Git is built around managing revisions of individual files, and hits performance issues with very large files, or very large numbers of directories (which early GeoGit experimented with, using a directory hierarchy to support quad-tree indexing). So they worked to decouple Git’s set of verbs from its backend, and implement those concepts on top of spatial databases, and provide special verbs particular to interact with OpenStreetMap (or even perhaps, OSM clones). Perhaps that’s comparable to git integrating with svn.

The work looks really promising, though they are still working on the internal technical challenges, and they’ve set the bar high, to fork the entirety of OSM including all history! They’re tremendously talented, so I expect they can get there. But what then? Git without the interface and social features of github is a frustrating experience. Replicating that kind of community space for GeoGit is another tall tall order, and I expect not the first order of business for a GIS oriented customer base.

Now, if interacting with OSM clones is a core part of GeoGit, then perhaps can simply use the OSM application ecosystem for editing and socializing on an individual branch. And in that case, makes sense to invest effort in improving OSM website features for Moabi, and leverage any future interoperability with GeoGit itself.

GitHub Geo

“GitHub Geo” takes another approach. Accept the file limitations of git, GitHub and browser client displays (split up large files, if needed, etc). For rendering GeoJSON on GitHub, the limits seem to be in the 5-10 MB range. They seem to be loading the data as a GeoJSON layer in Leaflet/MapboxJS, so a performance improvement would be rendering of that data into map tiles and utf8grids. I would guess that there’s already been thinking into what kind of infrastructure would be needed to support that, and they’re watching uptake of Geo features before making that kind of investment.

The approach takes full advantage of GitHub social functions, and there’s been some fun examples of this, and it will be great to see if the city of Chicago will accept pull requests. Just as important, the GitHub API to build applications to interact with GeoJSON files. This is what MapBox has been doing with Prose.IO to edit gitpages, and their GeoJSON.io provides an excellent editing environment for geodata coupled to GitHub, and would be the first mapping example of this pattern. (And as a mindbender, the entire GeoJSON.io site is hosted with GitPages). GeoJSON.io also has performance limits, probably similar to what we’ve seen with iD (I assume some of the internals are similar, but I haven’t looked).

Another service matching this pattern launched this week is GitSpatial, which selectively syncs GeoJSON files in your repos, and provides a simple geospatial query API, for example an API for Kenyan constituency boundaries counties near Nairobi.

This pattern could conceivably be used to build a rendering service, that could build those tile sets and grids, or vector tiles, from very large GitHub GeoJSON files. Or a combination of GitSpatial indexing & geo API and GeoJSON.io could build the ability to edit a select area of a large file, and commit just those changes. That would I suppose require GitSpatial reassembling the GeoJSON features in order they were received, or using some convention for ordering features based on a geographic index, and committing a diff to that file. Another useful service would be visualizing geographic diffs, something that OSM itself (or anything really) doesn’t do particularly well. Though that feature is something I could expect from GitHub soon, seeing that they are now doing 3D file diffs.

Large geographic data collaboration today

This is all so new and untested, and so far, not really built with large data sets in mind. Unlike the OSM architecture, API and ecosystem, which can pretty solidly handle loads of data and provide lots of services. It’s hard to get this kind of glimpse of the future, but also have the needs of today to grapple with. For now, I reckon OSM is a really good place to experiment and build, while we’ll keep a close eye on these other approaches.

Discussion

Comment from robert on 26 September 2013 at 23:16

I’ve previously put a lot of thought into what’s preventing OSM from using a Git/DVCS/DAG-y type model and what makes OSM & large scale geodata fundamentally different from a large software project. The conclusion I came to wasn’t that it was the size of the data that was the important factor, it was the number of contributors. Sure, we know that Git-managed projects can practically scale to hundreds, perhaps thousands of contributors. But tens of thousands? 100,000? Without the process becoming chaotic and necessity of (sometimes complex) “merges” scaring off users?

Because in OSM you essentially have tens of thousands of users editing the same “repository”. And with that you’ll have an awful lot of merges to do, whether they’re semi-automatic or not.

(these really are rhetorical questions - I would love to find out the answers)

Comment from mikelmaron on 27 September 2013 at 12:20

@robert, good point about number of users … can you really have a single Git/GitHub project with 100k contributors? Or a single project with pull requests from 100k forks? Yea, probably not. But that’s not the structure I would expect to see develop. I’d expect to see a bounded number of forks emerge, with admin level merges, and individual users working as collaborators within a repository.

For an individual user, they’d rarely encounter a merge. How often do we encounter edit conflicts in OSM? More than we want clearly! Thankfully rare. It’s not something our tools handle well at all (or any tools really).

So yea, essentially, I would think that the emergent structure wouldn’t necessarily increase the number of merges over what we have to deal with now.

Comment from tmcw on 27 September 2013 at 20:15

Fwiw, I’m pretty bearish on GeoGit, for a number of reasons.

Java is a poor choice of basis - your contributor base is small and userbase has to deal with Java-isms. The ‘multiple backends’ concept makes performance and potential hard to pin down. The military-fund-for-open-source model of project management often produces incomprehensible software (see: GeoPackage). The GeoGit model is, like Git, more or less a glorified hash table whereas OSM is at its core not a list of features but a graph database, and the graph-ness of this graph database makes the idea of versioning it fundamentally different than a shapefile or PostGIS database. The lack of opinion in terms of datatype means that operations will be lossy and non-pure for quite a few datatypes, unless GeoGit’s internal format is a superset of all geospatial data formats. And there’s no real solid answer for how GeoGit will scale to OSM, even if the bottleneck of database throughput is removed.

That’s not to say that I disagree with the intent - nearly everyone agrees with the intent, and wants something-of-this-sort.

Comment from mikelmaron on 27 September 2013 at 20:39

@tmcw: separate from the implementation (and I understand your misgivings) curious your thoughts about concept of versioning a graph database, structured as some kind of “OSM JSON”. I don’t see why topological data couldn’t fit in a Git model, with additional checks for consistency when features refer to each other. It’s not Git’s responsibility to ensure your code compiles, nor would it be data consistency.

Log in to leave a comment