OpenStreetMap

The Failover Issue and Publishing Derived Datasets

Posted by SimonPoole on 28 April 2015 in English. Last updated on 6 May 2015.

It is time that we lay the geocoding related licence discussion to rest by forming consensus on a guideline.

It is well known that I support the concept that the results of bulk geocoding form a derived database and support the corresponding conclusions on the Geocoding Guideline page .

However Example 7 glosses over a point that has been raised for example by Steve Coast in the past: are failed geocoding results really free of OSM intellectual property? For clarity: we are not discussing on the fly gecoding as there is no database created and nothing to share.

We need to resolve this to move forward on the matter.

I don’t believe there is a clear and conclusive answer to the above and there is a certain danger of getting in to “how many angels can dance on the head of a pin” type of discussions, so I believe that it boils down to: with what is the OSM community happy? Naturally with the backdrop of the ODbL in mind.

I suggest something very simple: that the set of failed addresses (or more general: input data) should be shared with the OSM community. I am not saying that the failed addresses are subject to the ODbL SA clauses, just that we should treat them as if they are.

Now you might ask why would we be interested in failed addresses? On the one hand these can be mined, just as the successfully geocoded ones, for additional information, for example for house number -> post codes relationships and on the other hand the list of failed addresses is obviously helpful for quality assurance.

And I believe that this, particularly the later point, creates a win-win situation for the organisation doing the geocoding and for OSM. The win for the geocoding organisation is that more of its addresses will be found in OSM and the reliance on third party datasets will be reduced.

Now assuming that a consensus forms around the above, there is still a slightly touchy issue in that companies may not want to be identified as the source of specific addresses. To resolve this I propose providing a facility by which such input datasets can be provided to the community and published anonymously (there is at least one system in existence that could simply be cloned to provide this facility).

Note: all of the above only applies to datasets that are being publicly used so there can’t be an expectation of a high level of data privacy to start with.

Discussion

Log in to leave a comment