RoboSat ❤️ Tanzania

Posted by daniel-j-h on 5 July 2018 in English (English)

Recently at Mapbox we open sourced RoboSat our end-to-end pipeline for feature extraction from aerial and satellite imagery. In the following I will show you how to run the full RoboSat pipeline on your own imagery using drone imagery from the OpenAerialMap project in the area of Tanzania as an example.


For this step by step guide let's extract buildings in the area around Dar es Salaam and Zanzibar. I encourage you to check out the amazing Zanzibar Mapping Initiative and OpenAerialMap for context around the drone imagery and where these projects are heading.

High-level steps

To extract buildings from drone imagery we need to run the RoboSat pipeline consisting of

  • data preparation: creating a dataset for training feature extraction models
  • training and modeling: segmentation models for feature extraction in images
  • post-processing: turning segmentation results into cleaned and simple geometries

I will first walk you through creating a dataset based on drone imagery available on OpenAerialMap and corresponding building masks bootstrapped from OpenStreetMap geometries. Then I will show you how to train the RoboSat segmentation model to spot buildings in new drone imagery. And in the last step I will show you how to use the trained model to predict simplified polygons for detected buildings not yet mapped in OpenStreetMap.

Data Preparation

The Zanzibar Mapping Initiative provides their drone imagery through OpenAerialMap.

Here is a map where you can manually navigate this imagery.

To train RoboSat's segmentation model we need a dataset consisting of Slippy Map tiles for drone images and corresponding building masks. You can think of these masks are binary images which are zero where there is no building and one for building areas.

Let's give it a try for Dar es Salaam and Zanzibar, fetching a bounding box to start with.

Let's start with extracting building geometries from OpenStreetMap and figuring out where we need drone imagery for training dataset. To do this we need to cut out the area we are interested in from OpenStreetMap.

Our friends over at GeoFabrik provide convenient and up-to-date extracts we can work with. The osmium-tool then allows us to cut out the area we are interested in.

wget --limit-rate=1M

osmium extract --bbox '38.9410400390625,-7.0545565715284955,39.70458984374999,-5.711646879515092' tanzania-latest.osm.pbf --output map.osm.pbf

Perfect, now we have a map.osm.pbf for Dar es Salaam and Zanzibar to extract building geometries from!

RoboSat comes with a tool rs extract to extract geometries from an OpenStreetMap base map.

rs extract --type building map.osm.pbf buildings.geojson

Now that we have a buildings.geojson with building geometries we need to generate all Slippy Map tiles which have buildings in them. For buildings zoom level 19 or 20 seems reasonable.

rs cover --zoom 20 buildings.geojson buildings.tiles

Based on the generated buildings.tiles file we then can

  • download drone imagery tiles from OpenAerialMap, and
  • rasterize the OpenStreetMap geometries into corresponding mask tiles

Here is a preview of what we want to generate and train the segmentation model on.

If you look closely you will notice the masks are not always perfect. Because we will train our model on thousands of images and masks, a slightly noisy dataset will still work fine.

The easiest way for us to create the drone image tiles is through the OpenAerialMap API. We can use its /meta endpoint to query all available drone images within a specific area.

http ',-7.0545565715284955,39.70458984374999,-5.711646879515092'

The response is a JSON array with metadata for all drone imagery within this bounding box. We can filter these responses with jq by their attributes, e.g. by acquisition date or by user name.

jq '.results[] | select( == "ZANZIBAR MAPPING INITIATIVE") | {user:, date: .acquisition_start, uuid: .uuid}'

Which will give us one JSON object per geo-referenced and stitched GeoTIFF image.

  "date": "2017-06-07T00:00:00.000Z",
  "uuid": ""

Now we have two options

  • download the GeoTIFFs and cut out the tiles where there are buildings, or
  • query the OpenAerialMap API's Slippy Map endpoint for the tiles directly

We can tile the GeoTIFFs with a small tool on top of rasterio and rio-tiler. Or for the second option we can download the tiles directly from the OpenAerialMap Slippy Map endpoints (changing the uuids).

rs download{z}/{x}/{y}.png building.tiles

Note: OpenAerialMap provides multiple Slippy Map endpoints, one for every GeoTIFF.

In both cases the result is the same: a Slippy Map directory with drone image tiles of size 256x256 (by default; you can run the pipeline with 512x512 images for some efficiency gains, too).

To create the corresponding masks we can use the extracted building geometries and the list of tiles they cover to rasterize image tiles.

rs rasterize --dataset dataset-building.toml --zoom 20 --size 256 buildings.geojson buildings.tiles masks

Before rasterizing we need to create a dataset-building.toml; have a look at the parking lot config RoboSat comes with and change the tile size to 256 and the classes to background and building (we only support binary models right now). Other configuration values are not needed right now and we will come back to it later.

With downloaded drone imagery and rasterized corresponding masks, our dataset is ready!

Training and modeling

The RoboSat segmentation model is a fully convolutional neural net which we will train on pairs of drone images and corresponding masks. To make sure these models can generalize to images never seen before we need to split our dataset into:

  • a training dataset on which we train the model on
  • a validation dataset on which we calculate metrics on after training
  • a hold-out evaluation dataset if you want to do hyper-parameter tuning

The recommended ratio is roughly 80/10/10 but feel free to change that slightly.

We can randomly shuffle our building.tiles, split it into three files according to our ratio, and use rs subset to split the Slippy Map directories.

rs subset images validation.tiles dataset/validation/images
rs subset masks validation.tiles dataset/validation/labels

Repeat for training and evaluation.

Before training the model we need to calculate the class distribution since background and building pixels are not evenly distributed in our images.

rs weights --dataset dataset-building.toml

Save the weights in the dataset configuration file, which training will then pick up. We can now adapt the model configuration file, e.g. enabling GPUs (CUDA) and then start training.

rs train --model model-unet.toml --dataset dataset-building.toml

For each epoch the training process saves the current model checkpoint and a history showing you the training and validation loss and metrics. We can pick the best model, saving its checkpoint, looking at the validation plots.

Using a saved checkpoint allows us to predict segmentation probabilities for every pixel in an image. These segmentation probabilities indicate how likely each pixel is background or building. We can then turn these probabilities into discrete segmentation masks.

rs predict --tile_size 256 --model model-unet.toml --dataset dataset-building.toml --checkpoint checkpoint-00038-of-00050.pth images segmentation-probabilities

rs masks segmentation-masks segmentation-probabilities

Note: both rs predict as well as rs mask transform Slippy Map directories and create .png files with a color palette attached for visual inspection.

These Slippy Map directories can be served via an HTTP server and then visualized directly in a map raster layer. We also provide an on-demand tile server with rs serve to do the segmentation on the fly; it's neither efficient nor handles post-processig (tile boundaries, de-noising, vectorization, simplification) and should only be used for debugging purpose.

If you manually check the predictions you will probably notice

  • the segmentation masks already look okay'ish for buildings
  • there are false positives where we predict buildings but there is none

The false positives are due to how we created the dataset: we bootstrapped a dataset based on tiles with buildings in them. Even though these tiles have some background pixels they won't contain enough background (so called negative samples) to properly learn what is not a building. If we never showed the model a single image of water it has a hard time classifying it as background.

There are two ways for us to approach this problem:

  • add many randomly sampled background tiles to the training set, re-compute class distribution weights, then train again, or
  • use the model we trained on the bootstrapped dataset and predict on tiles where we know there are no buildings; if the model tells us there is a building put these tiles into the dataset with an all-background mask, then train again

The second option is called "hard-negative mining" and allows us to come up with negative images which contribute most to the model learning about background tiles. We recommend this approach if you want a small, clean, and solid dataset and care about short training time.

For hard-negative mining we can randomly sample tiles which are not in building.tiles and predict on them with our trained model. Then make use of the rs compare tool to create images visualizing the images without buildings in them and the prediction next to it.

rs compare visualizations images segmentation-masks

After making sure these are really background images and not just unmapped buildings in OpenStreetMap, we can put the negative samples into our dataset with a corresponding all-background mask. Then run rs weights again, update the dataset config, and re-train.

It is common to do a couple rounds of hard-negative mining and re-training, resulting in a solid and small dataset which helps the model most for learning.

Congratulations, you now have a solid model ready for prediction!

Here are the segmentation probabilities I got out after spending a few hours of hard negative mining.

Interesting to see here is the model is not entirely sure about building construction sites. It is on us to make an explicit decision when creating the dataset and when doing hard-negative mining: do we want to include building construction sites or not.

These edge-cases occur with all features and make up the boundaries of your feature’s visual appearance. Are house-boats still buildings? Are parking lots without parking aisle lane markings still parking lots? Make a call and be consistent.

Finally, the post-processing steps are responsible for turning the segmentation masks into simplified and vectorized GeoJSON features potentially spanning multiple tiles. We also tools for de-duplicating detections against OpenStreetMap to filter out already mapped features.

I won’t go into post-processing details in this guide since the segmentation masks based on this small training dataset are still a bit rough to make it properly work well and the RoboSat post-processing is currently tuned to parking lots on zoom level 18 and I had to make some in-place adaptions when running this out.


In this step-by-step guide we walked through the RoboSat pipeline from creating a dataset, to training the segmentation model, to predicting buildings in drone imagery. All tools and datasets used in this guide are open source and openly available, respectively.

Give it a try!

I'm happy to hear your feedback, ideas and use-cases either here, on a ticket, or by mail.

Comment from iandees on 5 July 2018 at 20:18

Great walkthrough! Thanks for writing it up.

Comment from PlaneMad on 6 July 2018 at 02:11

Can't believe its so simple, thank you for the tutorial!

Comment from Tomas Straupis on 6 July 2018 at 03:23

Thank you for sharing! How many training images do you think are necessary to have a decently trained model? Are there any plans to release some trained models (f.e. buildings, roads) to the public?

Comment from daniel-j-h on 6 July 2018 at 07:58

The amount of images you will need for training can vary a lot and mostly depends on

  • the imagery quality, and if it's from the same source or not
  • how good and precise the masks for training are
  • the zoom level
  • the variety in the images, and if it's from the same area or totally different
  • the time and processing you can invest

For example the more hard-negative iterations you do the better the model can distinguish the background class. But hard-negative mining also takes quite a while. Same with the automatically created dataset: you can manually clean it up it but it is quite time-intensive.

In addition you could do more data augmentation during training to further artificially embiggen the dataset, you could do test-time augmentation where you predict on the tile and it's 90/180/270 degree rotations and then merge the predictions, you could train and predict on multiple zoom levels, and so on.

I would say it also depends on your use-case. For detecting building footprints like in this guide a couple of thousand image are fine to get the rough shapes. It's definitely not great for automatic mapping but that is not my intention in the first place.

Regarding trained models: I recently added an ONNX model exporter to RoboSat which allows for portable model files folks can use with their backend of choice. I could publish the trained ONNX model for this guide since I did it on my own time. The Mapbox models I am not allowed to publish as of writing this.

If there is community interest maybe we can come up with a publicly available model catalogue hosting ONNX models and metadata where folks can easily upload and download models?

Comment from Tomas Straupis on 6 July 2018 at 08:32

Thank you for the answers. I ran through 8000 training and 1000 validation images (no augmentation) and IoU is still below 0.8. Of course images are quite different: buildings in forest, in rural areas, industrial buildings, urban buildings. And I haven't done any hard-negative training. Training this without GPU takes almost a week :-) So yes, it would be useful to have access to trained models at least to compare the results, for others it could be much easier to take a model, run predictions and add missing data to OSM without the need to train and tune the model.

Comment from daniel-j-h on 6 July 2018 at 10:19

IoU is still below 0.8

If you reach an IoU of 0.8 that's pretty amazing to be honest. Here's why. There are two sides contributing to the IoU metric: your predictions can be off but worse also the OpenStreetMap "ground truth" geometries can be off. Even with a perfect model you won't reach an IoU of 1.0 since the OpenStreetMap geometries can - and often are - coarse, or have a slight misalignment, or are not yet mapped in OpenStreetMap etc.

Here's an interesting experiment: randomly sample from your dataset. Manually annotate the image tiles generating fine-detailed masks. Now calculate the IoU metric on your manually generated fine-detailed masks aand the automatically generated masks from OpenStreetMap. This will be your IoU upper bound you can reach.

Also see this graphic to get a feel for the IoU metric.

Training this without GPU takes almost a week :-)

Agree, without a GPU training will be slow. That said, I made some changes recently which should speed things up considerably:

  • - Simplifies the training setup; now you no longer have to manually tune learning rate and more importantly epochs at which to decay the learning rate. This should give you great training results without any tuning.

  • - We are using an encoder-decoder architecture. Before we were training both the encoder as well as the decoder from scratch. This changeset brings in a pre-trained ResNet encoder, resulting in faster training times per epoch, less epochs needed, less memory needed, higher validation metrics, and faster prediction runtime.

If you want to give it a try with current master you should see improvements for your use-case.

Comment from imagico on 6 July 2018 at 10:50

What i find fascinating is that you seem to treat the image tiles completely independently - in other words: cut off building parts at a tile edge are treated as if they were whole buildings.

I can see that this affects the algorithm because we see the discontinuities in the results but I wonder how 'local' the method is ultimately when you apply the trained algorithm. I mean if you move the tile edge a tiny bit (a few pixel) would the results change completely across the whole tile potentially or would such a move only affect the results near the edge of the tile and leave the rest unaffected?

Comment from daniel-j-h on 6 July 2018 at 13:59

We train on (tile, mask) pairs, that's right.

But for prediction we buffer the tile on the fly (e.g. with 32px overlap on all sides), predict on the buffered tile which now captures the context from its eight adjacent tiles, and then crop out the probabilities for the original tile again. This results in smooth borders across tiles in the masks. The probabilities you are seeing above might still not match 100% at the borders, but that's fine.

Here is the original tile and how the buffering during prediction affects it.

original tile in the middle, four corners from adjacent tiles added for buffering

fully buffered tile with context from its eight adjacent tiles

Without this buffering approach you would clearly see the tile borders, correct.

Comment from imagico on 6 July 2018 at 14:14

Ah, that makes much more sense now.

Comment from shakasom on 6 July 2018 at 22:56

Thanks. can this be used in Jupyter Notebooks? Where are all of these commands(RS) run?

Comment from daniel-j-h on 7 July 2018 at 08:22

Not sure why you'd want to run the RoboSat toolchain in Jupyter Notebooks, but I guess there is nothing stopping you from doing that. You have to install the RoboSat tools and its dependencies then you can use the robosat Python package. The rs commands are really just a small shell script expanding to python3 -m cmd args.

Comment from Tomas Straupis on 7 July 2018 at 13:20

Thank you. ResNet decreases training time by 1/3!

Comment from maning on 8 July 2018 at 04:40

Following this guide, I got the following results. Hard negative mining helped a lot.


Comment from QSMLong on 12 July 2018 at 03:17

Thank you for your detailed introduction to the RoboSat pipeline, but there's still some confusing places for me. Could you please tell me if the online web site which you used to fetch a bounding box is available now? And May I take it that the bounding box has nothing to do with the final process once your bounding box covers the study areas?

Comment from daniel-j-h on 12 July 2018 at 09:47

It doesn't matter where you get the bounding box from; I used since it's convenient to use. And yes, it is currently working for me.

The bounding box is only used for cutting out a smaller base map from a larger .osm.pbf and downloading aerial images in that area. After that we work with data extracted from the base map and the aerial imagery as is.

Comment from NewSource on 13 July 2018 at 15:38

Thanks ! Some steps are still confusing me when I try to use my own drone imagery. I tried to prepare tiles from my own GeoTiff (.tif) that covers whole area I want to work on, with, as result I got right Slippy Map directory structure that contains garbage images. What I did wrong ?

Comment from daniel-j-h on 16 July 2018 at 09:40

I can't debug statements like "contains garbage images". What does your GeoTIFF look like? Do you have a small self-contained example where I can reproduce this issue? What do the Slippy Map tiles look like? You can give gdal2tiles a try; the tiler script was just a quick way for me to tile my GeoTIFFs.

Comment from QSMLong on 18 July 2018 at 01:30

Sorry to trouble you again. Similarly I tried to prepare tiles from Tiff Image downloaded at OpenAerialMap. The tiling speed was about 60tile/s, and the speed would go down after it operate several hours. it would take a lot of time to tile my study area. I'd like to know your tiling speed, and is there any lib which could accelerate the tiling speed or I must write my own code.

Comment from daniel-j-h on 18 July 2018 at 08:36

Agree, for larger tiling jobs I recommend using proper tools like gdal2tiles.

Comment from QSMLong on 20 July 2018 at 02:16

When I tried to create the drone image tiles the OpenAerialMap API, the robosat.tool download just returned too many "Warning: Tile(x=857247, y=430855, z=20) failed, skipping". After I debuged the program, it showed "{"message":"Not Authorized - Invalid Token"}". But I did copy the "tms" property from the "http ',30.3833619895642,114.637039919757,30.6950619601414'" request. And I choosed "

        "__v": 0, 

        "_id": "59e62c273d6412ef7220d589", 

        "acquisition_end": "2015-08-23T00:00:00.000Z", 

        "acquisition_start": "2015-08-22T00:00:00.000Z", 

        "bbox": [






        "contact": "", 

        "file_size": 174569977, 

        "footprint": "POLYGON((112.64687500000001 31.365344444444446,115.05728611111111 31.365344444444446,115.05728611111111 29.229752777777776,112.64687500000001 29.229752777777776,112.64687500000001 31.365344444444446))", 

        "geojson": {

            "bbox": [






            "coordinates": [
























            "type": "Polygon"


        "gsd": 35.64656047021406, 

        "meta_uri": "", 

        "platform": "satellite", 

        "projection": "PROJCS[\"WGS84/Pseudo-Mercator\",GEOGCS[\"WGS84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS84\",6378137,298.257223563,AUTHORITY[\"EPSG\",\"7030\"]],AUTHORITY[\"EPSG\",\"6326\"]],PRIMEM[\"Greenwich\",0],UNIT[\"degree\",0.0174532925199433],AUTHORITY[\"EPSG\",\"4326\"]],PROJECTION[\"Mercator_1SP\"],PARAMETER[\"central_meridian\",0],PARAMETER[\"scale_factor\",1],PARAMETER[\"false_easting\",0],PARAMETER[\"false_northing\",0],UNIT[\"metre\",1,AUTHORITY[\"EPSG\",\"9001\"]],EXTENSION[\"PROJ4\",\"+proj=merc+a=6378137+b=6378137+lat_ts=0.0+lon_0=0.0+x_0=0.0+y_0=0+k=1.0+units=m+nadgrids=@null+wktext+no_defs\"],AUTHORITY[\"EPSG\",\"3857\"]]", 

        "properties": {
            "sensor": "Landsat 8", 
            "thumbnail": "",

            "tms": "{z}/{x}/{y}.png?access_token=pk.eyJ1IjoiYXN0cm9kaWdpdGFsIiwiYSI6ImNVb1B0ZkEifQ.IrJoULY2VMSBNFqHLrFYew"


        "provider": "Astro Digital", 

        "title": "LC81230392015234LGN00_bands_432.TIF", 

        "uuid": ""
    }, "

to download the tiles. Similarly, I tried to repeat your work in the diary. Sadly I didn't get the tiles either. Did my network have some problem or other reasons?

Comment from maning on 20 July 2018 at 02:25

It looks like you are downloading landsat imagery, is this what you want to download? Depending on the zoom level you download, this is expected to be low resolution (28m pixel resolution).

Comment from QSMLong on 20 July 2018 at 02:38

Yes, it's what I want now. At present, I just want to repeat the experiment. (By the way, are there any easy-to-use high resolutuon remote-sensing image?)

Comment from daniel-j-h on 20 July 2018 at 11:45

You will get the warning when the downloader was not able to download tiles from the list of tile coordinates you give it. This is probably due to the tile endpoint not providing imagery for all your tile ids. is a good place to start when looking for imagery sources.

Login to leave a comment