How do we streamline the import proposal/data quality assessment flow?
Posted by CloCkWeRX on 29 July 2024 in English.The problem
We are generating an increasing level of data as a society. An unstated goal of openstreetmap that many contributors subscribe to is “completeness” or “accuracy”, which works fine when you dataset is small, local and high level detail, but less so when scaled up to determining if every traffic light crossing in the world has tactile paving.
So naturally, automation and data imports are where people start to look; and very sensibly there’s a process to propose, review and ingest large datasets.
However, this relies on:
- Expertise and peer review
- Honesty and diligence of the importer to have and execute a QA plan
- A second level of QA tools and mappers to QA and maintain data
What could we do differently?
In the semantic web/linked data world, two big concepts emerged. The first is the semantic web layer cake, which talks about going from “machine readable” to “schemas” to “query” to “proof” to “trust”. In OSM terms these are poi, tags, overpass, a lot of tools like keep right or osmose, and at the moment, human boots on the ground survey.
The concept of 5 star open data is focused on the idea that we have a lot of data locked up in silos - and while it would be ideal to align it to every standard and have the highest quality possible data; 95% of the time it’s better to publish anything at all rather than wait until it’s perfect. So long as data consumers have an idea of the limitations, they can apply judgement when attempting to use it.
What is the current state?
A number of open data portals provide basic indicators of “5 star open data” quality.
In our wiki, we maintain documentation which describes the OSM community’s view on data quality of an external dataset.
We have tags for change sets describing the source.
What specifically would we change?
I’m proposing a set of tools or standard metadata for annotating external datasets and proposed/approved exports; so that editing and conflation tools can reason about the quality of data.
IE, if you have a dataset which is derived from OSM, corrects wrong tags and it has been human verified from a random sampling of 5% of the data? That’s a good candidate for letting a maintenance bot operate on this with minimal oversight, and is potentially 5 star quality.
Have a stream of AI generated shop names from street level imagery? Tag that was 2/5 and have flags for requiring human verification, even if it is one click approval.
What would be the impact?
By having these standards in place, tools that are typically used for bulk imports or conflation can add extra guard rails around the process; and from a community review/import approval perspective it becomes a discussion about the higher risk aspects of an import.
It also then greenlights a degree of automated maintenance activities - after data is imported and mappers are promoted to confirm accuracy in the ground; it then becomes lower risk to trust that data source for bots updating existing attributes.