Recently at Mapbox we open sourced RoboSat our end-to-end pipeline for feature extraction from aerial and satellite imagery. In the following I will show you how to run the full RoboSat pipeline on your own imagery using drone imagery from the OpenAerialMap project in the area of Tanzania as an example.
Goal
For this step by step guide let’s extract buildings in the area around Dar es Salaam and Zanzibar. I encourage you to check out the amazing Zanzibar Mapping Initiative and OpenAerialMap for context around the drone imagery and where these projects are heading.
High-level steps
To extract buildings from drone imagery we need to run the RoboSat pipeline consisting of
- data preparation: creating a dataset for training feature extraction models
- training and modeling: segmentation models for feature extraction in images
- post-processing: turning segmentation results into cleaned and simple geometries
I will first walk you through creating a dataset based on drone imagery available on OpenAerialMap and corresponding building masks bootstrapped from OpenStreetMap geometries. Then I will show you how to train the RoboSat segmentation model to spot buildings in new drone imagery. And in the last step I will show you how to use the trained model to predict simplified polygons for detected buildings not yet mapped in OpenStreetMap.
Data Preparation
The Zanzibar Mapping Initiative provides their drone imagery through OpenAerialMap.
Here is a map where you can manually navigate this imagery.
To train RoboSat’s segmentation model we need a dataset consisting of Slippy Map tiles for drone images and corresponding building masks. You can think of these masks are binary images which are zero where there is no building and one for building areas.
Let’s give it a try for Dar es Salaam and Zanzibar, fetching a bounding box to start with.
Let’s start with extracting building geometries from OpenStreetMap and figuring out where we need drone imagery for training dataset. To do this we need to cut out the area we are interested in from OpenStreetMap.
Our friends over at GeoFabrik provide convenient and up-to-date extracts we can work with. The osmium-tool then allows us to cut out the area we are interested in.
wget --limit-rate=1M http://download.geofabrik.de/africa/tanzania-latest.osm.pbf
osmium extract --bbox '38.9410400390625,-7.0545565715284955,39.70458984374999,-5.711646879515092' tanzania-latest.osm.pbf --output map.osm.pbf
Perfect, now we have a map.osm.pbf
for Dar es Salaam and Zanzibar to extract building geometries from!
RoboSat comes with a tool rs extract
to extract geometries from an OpenStreetMap base map.
rs extract --type building map.osm.pbf buildings.geojson
Now that we have a buildings.geojson
with building geometries we need to generate all Slippy Map tiles which have buildings in them. For buildings zoom level 19 or 20 seems reasonable.
rs cover --zoom 20 buildings.geojson buildings.tiles
Based on the generated buildings.tiles
file we then can
- download drone imagery tiles from OpenAerialMap, and
- rasterize the OpenStreetMap geometries into corresponding mask tiles
Here is a preview of what we want to generate and train the segmentation model on.
If you look closely you will notice the masks are not always perfect. Because we will train our model on thousands of images and masks, a slightly noisy dataset will still work fine.
The easiest way for us to create the drone image tiles is through the OpenAerialMap API. We can use its /meta
endpoint to query all available drone images within a specific area.
http 'https://api.openaerialmap.org/meta?bbox=38.9410400390625,-7.0545565715284955,39.70458984374999,-5.711646879515092'
The response is a JSON array with metadata for all drone imagery within this bounding box. We can filter these responses with jq by their attributes, e.g. by acquisition date or by user name.
jq '.results[] | select(.user.name == "ZANZIBAR MAPPING INITIATIVE") | {user: .user.name, date: .acquisition_start, uuid: .uuid}'
Which will give us one JSON object per geo-referenced and stitched GeoTIFF image.
{
"user": "ZANZIBAR MAPPING INITIATIVE",
"date": "2017-06-07T00:00:00.000Z",
"uuid": "https://oin-hotosm.s3.amazonaws.com/5ac7745591b5310010e0d49a/0/5ac7745591b5310010e0d49b.tif"
}
Now we have two options
- download the GeoTIFFs and cut out the tiles where there are buildings, or
- query the OpenAerialMap API’s Slippy Map endpoint for the tiles directly
We can tile the GeoTIFFs with a small tool on top of rasterio and rio-tiler. Or for the second option we can download the tiles directly from the OpenAerialMap Slippy Map endpoints (changing the uuids).
rs download https://tiles.openaerialmap.org/5ac626e091b5310010e0d480/0/5ac626e091b5310010e0d481/{z}/{x}/{y}.png building.tiles
Note: OpenAerialMap provides multiple Slippy Map endpoints, one for every GeoTIFF.
In both cases the result is the same: a Slippy Map directory with drone image tiles of size 256x256
(by default; you can run the pipeline with 512x512
images for some efficiency gains, too).
To create the corresponding masks we can use the extracted building geometries and the list of tiles they cover to rasterize image tiles.
rs rasterize --dataset dataset-building.toml --zoom 20 --size 256 buildings.geojson buildings.tiles masks
Before rasterizing we need to create a dataset-building.toml
; have a look at the parking lot config RoboSat comes with and change the tile size to 256
and the classes to background
and building
(we only support binary models right now). Other configuration values are not needed right now and we will come back to it later.
With downloaded drone imagery and rasterized corresponding masks, our dataset is ready!
Training and modeling
The RoboSat segmentation model is a fully convolutional neural net which we will train on pairs of drone images and corresponding masks. To make sure these models can generalize to images never seen before we need to split our dataset into:
- a training dataset on which we train the model on
- a validation dataset on which we calculate metrics on after training
- a hold-out evaluation dataset if you want to do hyper-parameter tuning
The recommended ratio is roughly 80/10/10 but feel free to change that slightly.
We can randomly shuffle our building.tiles
, split it into three files according to our ratio, and use rs subset
to split the Slippy Map directories.
rs subset images validation.tiles dataset/validation/images
rs subset masks validation.tiles dataset/validation/labels
Repeat for training and evaluation.
Before training the model we need to calculate the class distribution since background and building pixels are not evenly distributed in our images.
rs weights --dataset dataset-building.toml
Save the weights in the dataset configuration file, which training will then pick up. We can now adapt the model configuration file, e.g. enabling GPUs (CUDA) and then start training.
rs train --model model-unet.toml --dataset dataset-building.toml
For each epoch the training process saves the current model checkpoint and a history showing you the training and validation loss and metrics. We can pick the best model, saving its checkpoint, looking at the validation plots.
Using a saved checkpoint allows us to predict segmentation probabilities for every pixel in an image. These segmentation probabilities indicate how likely each pixel is background or building. We can then turn these probabilities into discrete segmentation masks.
rs predict --tile_size 256 --model model-unet.toml --dataset dataset-building.toml --checkpoint checkpoint-00038-of-00050.pth images segmentation-probabilities
rs masks segmentation-masks segmentation-probabilities
Note: both rs predict
as well as rs mask
transform Slippy Map directories and create .png
files with a color palette attached for visual inspection.
These Slippy Map directories can be served via an HTTP server and then visualized directly in a map raster layer. We also provide an on-demand tile server with rs serve
to do the segmentation on the fly; it’s neither efficient nor handles post-processig (tile boundaries, de-noising, vectorization, simplification) and should only be used for debugging purpose.
If you manually check the predictions you will probably notice
- the segmentation masks already look okay’ish for buildings
- there are false positives where we predict buildings but there is none
The false positives are due to how we created the dataset: we bootstrapped a dataset based on tiles with buildings in them. Even though these tiles have some background pixels they won’t contain enough background (so called negative samples) to properly learn what is not a building. If we never showed the model a single image of water it has a hard time classifying it as background.
There are two ways for us to approach this problem:
- add many randomly sampled background tiles to the training set, re-compute class distribution weights, then train again, or
- use the model we trained on the bootstrapped dataset and predict on tiles where we know there are no buildings; if the model tells us there is a building put these tiles into the dataset with an all-background mask, then train again
The second option is called “hard-negative mining” and allows us to come up with negative images which contribute most to the model learning about background tiles. We recommend this approach if you want a small, clean, and solid dataset and care about short training time.
For hard-negative mining we can randomly sample tiles which are not in building.tiles
and predict on them with our trained model. Then make use of the rs compare
tool to create images visualizing the images without buildings in them and the prediction next to it.
rs compare visualizations images segmentation-masks
After making sure these are really background images and not just unmapped buildings in OpenStreetMap, we can put the negative samples into our dataset with a corresponding all-background mask. Then run rs weights
again, update the dataset config, and re-train.
It is common to do a couple rounds of hard-negative mining and re-training, resulting in a solid and small dataset which helps the model most for learning.
Congratulations, you now have a solid model ready for prediction!
Here are the segmentation probabilities I got out after spending a few hours of hard negative mining.
Interesting to see here is the model is not entirely sure about building construction sites. It is on us to make an explicit decision when creating the dataset and when doing hard-negative mining: do we want to include building construction sites or not.
These edge-cases occur with all features and make up the boundaries of your feature’s visual appearance. Are house-boats still buildings? Are parking lots without parking aisle lane markings still parking lots? Make a call and be consistent.
Finally, the post-processing steps are responsible for turning the segmentation masks into simplified and vectorized GeoJSON features potentially spanning multiple tiles. We also tools for de-duplicating detections against OpenStreetMap to filter out already mapped features.
I won’t go into post-processing details in this guide since the segmentation masks based on this small training dataset are still a bit rough to make it properly work well and the RoboSat post-processing is currently tuned to parking lots on zoom level 18 and I had to make some in-place adaptions when running this out.
Summary
In this step-by-step guide we walked through the RoboSat pipeline from creating a dataset, to training the segmentation model, to predicting buildings in drone imagery. All tools and datasets used in this guide are open source and openly available, respectively.
Give it a try! https://github.com/mapbox/robosat
I’m happy to hear your feedback, ideas and use-cases either here, on a ticket, or by mail.