OpenStreetMap

Google Summer of Code 2022: Phase 1

Posted by tareqpi on 28 July 2022 in English (English). Last updated on 29 July 2022.

Hi everyone, this is an update on my progress in enhancing Nominatim’s search results ranking. For an overview of the project, you can check out my previous diary entry here. I would like to thank my mentors, Sarah Hoffman (@lonvia) and Marc Tobias (@mtmail), for their guidance throughout the implementation of this project.

Goals of the First Phase

The first phase of this project has some goals which were previously set. Below are the main goals of this phase.

  • Enabling PostGIS to work with raster files
  • Finding and implementing the most suitable method used to import GeoTIFF files
  • Conducting performance tests on the import functionality
  • Adding unit tests
  • Documenting the new changes

Hardware I Am Using

Since Nominatim with a full planet import needs a lot of computing resources, I had set up the server which allowed me to work on the project. I would like to thank OpenCage for providing me with the server to work with on this project. The specifications of the server that I am currently using are 8 core AMD Ryzen™ 7 3700X, 64GB RAM, 1TB NVMe disk (900GB usable, 850GB free), running Ubuntu 22.04 LTS.

OSM Views Data

As mentioned before, OpenStreetMap has log information about the number of successful requests by the users for each map tile. This information can be found here. The first thing I have done was download one of the log files and understand its content. This led me to read about the Web Mercator projection to have a lower-level understanding of how tiles work and better understand the logs. After that, I started using a GeoTIFF file that stores the same information that the logs have. This GeoTIFF file, which is currently 387MB in size, is the source of data that is chosen to be used for loading the map tiles’ access numbers into Nominatim’s database. GeoTIFF is a variation of the TIF format that adds a set of tags containing geospatial data in order to provide internal georeference information for the raster data in the file. The image below is the illustration of the map access numbers that are stored in the GeoTIFF file that is used in this project. This image is generated with QGIS.

Figure 1: Illustration of osmviews.tiff loaded in QGIS

Import Functionality

PostgreSQL and PostGIS are already being used in Nominatim to store and query geographic objects. However, in order to load the GeoTIFF file, adding support for working with raster data to Nominatim is also needed. I have done that by adding a new function that creates a new database extension called “postgis_raster”. After that, I have used raster2pgsql which is the default tool of PostGIS for loading raster data to the database. I have integrated raster2pgsql so that it is called programmatically by Nominatim. The tool itself has various options that affect how the raster data is being loaded into the database. One of the options that have been set is GiST indexing on the raster column so that querying a specific raster data becomes much faster. Another option worth mentioning is the tile size which is the size of the raster that will be cut into and inserted one per table row. The optimum raster tile size when using raster2pgsql is in the range of 32x32 to 100x100. I have conducted performance tests twice on each of the two tile sizes of both ends of the recommended range to understand the time it takes to load the GeoTIFF file into the database, the space the raster data takes, and the number of rows of the newly created table. The table below is the performance test results:

Table 1: Performance Test Results Summary

It is clear that the 100x100 tile size is the better option, thus I have chosen it to be the tile size for importing the GeoTIFF file into Nominatim. The image below is how the raster data looks inside its table after importing the GeoTIFF file into Nominatim’s database.

Figure 2: Table sample raster data after importing it from the GeoTIFF file

The image below is another table that I have created that contains the access numbers which are extracted from the loaded raster data, as well as their corresponding places in the map which is found in the “placex” table.

Figure 3: Sample data of place_views table

Additionally, I have created the functionality of refreshing the map access numbers using the same function of importing the GeoTIFF file with the inclusion of dropping the raster table if the table already exists. That way, the new raster data replaces the old one.

Unit Tests and Documentation

Finally, I have created some unit tests and documented the new changes to cover the functionalities that have been added to Nominatim.

What’s Next?

Now that the map access numbers can be loaded into Nominatim, the main next step is to enhance the search ranking algorithm by including the map access numbers into the computation of the places’ importance values. Feel free to ask any questions about my progress so far or the next steps of the project and I will happily answer them.

Location: Taman Tun Dr Ismail, TTDI, Kuala Lumpur, 60000, Malaysia

Comment from pnorman on 29 July 2022 at 00:41

I’ve been using tiles2image to turn map tile lists into images. Because when you fix the zoom, you only need 1px per tile, this makes the resulting images small.

To get grayscale I’ve been using something hacked together, diff below

diff --git a/tiles2image.py b/tiles2image.py
index d65cc15..4455f16 100755
--- a/tiles2image.py
+++ b/tiles2image.py
@@ -4,6 +4,7 @@

 import sys
 import argparse
+from math import log
 from PIL import Image

 parser = argparse.ArgumentParser()
@@ -11,16 +12,24 @@ parser.add_argument("zoom", type=int, help = "Zoom level of tiles")
 parser.add_argument("filename", help = "Name of PNG to write to")
 args = parser.parse_args()

-img = Image.new('1', (2**args.zoom, 2**args.zoom), "black")
+img = Image.new('L', (2**args.zoom, 2**args.zoom), "black")
 pixels = img.load()

+max_hits = 0
 for line in sys.stdin.readlines():
+    splitline=line.split(' ',2)
     # Standard z/x/y format
-    splitline=line.split('/',4)
-    if (splitline[0] != str(args.zoom)):
+    tile=splitline[0].split('/',4)
+    if (tile[0] != str(args.zoom)):
         raise ValueError("Line {} does not have zoom {}".format(line, args.zoom))
-    x = int(splitline[1])
-    y = int(splitline[2])
-    pixels[x,y] = 1
+    x = int(tile[1])
+    y = int(tile[2])
+    hits = int(splitline[1])
+    max_hits = max(max_hits, hits)
+    loghits = (log(hits-9,10))/(log(500000,10))
+
+    pixels[x,y] = min(255, int((loghits ** 1.8)*300))
+
+print(max_hits)

 img.save(args.filename)
\ No newline at end of file

For zoom 12, this results in a 612K PNG image, for zoom 15 it is a 3.7 MB PNG.

Have you decided what zoom tile you are going to have as equal to 1px in the image you have? Looking at the tiff, it is 262144px wide, which is z18 tiles. This seems like an unnecessarily high resolution, as z18 tiles are only a few houses wide.

Does the data need to be in the database? If it’s being processed in the PHP application, storing it in a PNG is an option to consider, since getting a pixel value for a <1MB PNG is pretty fast.

I have conducted performance tests twice on each of the two tile sizes of both ends of the recommended range to understand the time it takes to load the GeoTIFF file into the database, the space the raster data takes, and the number of rows of the newly created table. The table below is the performance test results:

I would focus on time to get the value for a given coordinate, not on loading time. This is likely to be correlated with table size more than loading time, so sizes larger than 100x100 might be better.


Login to leave a comment