OpenStreetMap logo OpenStreetMap

A New File Format for OSM Data

Posted by kumakyoo on 7 March 2025 in English. Last updated on 21 March 2025.

Whenever I use files containing OSM data, I’m faced with two major problems. These problems are inherent in the OSM data model and inherited by the common file formats (.osm, .o5m, .pbf).

The first of these problems is about accessing the data: As a result of the data model users are forced to use either huge amounts of memory or a lot of time. Often even both.

The second problem is even worse: Quite often you have to guess properties of the data, which means using heuristics. But by their very nature, they can lead to wrong results. The most prominent example is probably the question, if a closed way element represents a linear object or an area.

To overcome these problems I invented a new file format. The main idea: Convert the data once (accepting the drawbacks caused by the two problems) and end up with data that can be processed fast, using only little memory. I also tried to keep the file size small and the file format simple.

I called the new format “OMA format” like “Open MAp”. It’s accompanied by a human readable version called “OPA format” like “Open PAper”. (Oma and Opa are the German words for grandma and grandpa).

It took me about a year to design the Oma file format, and write a converter and a library for querying the new file format. I still consider it in experimental state, as I would like to get some feedback before releasing a stable version. Hence this post.

All this is a lot of stuff and I can’t go into all the details in just one blog post. For this reason I’ve decided to keep this article brief and base it on a single example. In the coming weeks, I will be writing more articles about the new file format, which will provide more in-depth information.

 

Scattered Data in OSM Files: The Viktorstraße in Wuppertal

For explanation I set up a simple example: I want to know everything about the street Viktorstraße in the city of Wuppertal. All I’ve got, is a dump of the extract of Germany (germany.osm.xml from 1st of January 20251) and the knowledge that the street is saved as a couple of way elements, each of them containing a highway tag (with unknown value) and a name tag with value “Viktorstraße”. And of course I know the name of the city it belongs to.

To get started, lets peek into germany.osm.xml:

data of Viktorstraße in germany.osm.xml

The file is split into three parts: The first two thirds of the file are occupied by nodes, almost all of the rest is used for ways, and a tiny fraction at the end contains relations.

The Viktorstraße consists of 5 way elements, which refer to 10 different node elements. The places where these elements can be found in germany.osm.xml are marked with red lines. As you can see, they are scattered throughout the file.

In principle scattering data throughout a file doesn’t do much harm, as long as there are means for fast (that is direct) access. Unfortunately, this doesn’t apply for OSM files: In almost all cases, you have to read the entire file from the very beginning without being able to skip any parts.2

Even worse, the data in the ways section references data in the nodes section that has already been read, when the references are read. This means that you don’t know, which part of the information read in the nodes section you will need later.

Basically, this can be handled in two ways:

  1. Keep the first part of the file completely in main memory (you need gigantic amounts of memory).

  2. Read the file a second time (takes a lot of time). In this case, a new problem arises: You have to keep the references of the way part in memory while you retrieve the nodes they reference from the nodes part. In the case of the Viktorstraße it’s only 10 references, but a lot of use cases consist of millions (or even billions) of references and again you need huge amounts of memory.

 

A First Step Towards a Solution: Resolving the Node References.

OSM ways consist of a series of node references (that is, a list of node IDs). The first thing the converter does is to replace these references with the coordinates of the nodes to which they refer. This leaves a lot of nodes in the nodes section without any further use. Therefore these useless nodes are discarded.

The same is true for some relations, where not only the node references are replaced by nodes, but also the way references are replaced by ways. (There are more changes to the relations section, but I won’t go into that in this article.)

The result can be seen in the following diagram:

data of Viktorstraße after resolving node references

The data of the Viktorstaße is now completely contained in the ways section. This makes it possible to retrieve the data with a single pass using only a small amount of memory. However, the data could be anywhere in the ways section and you still don’t know where this section starts, so almost the whole file has still to be read in.

 

Second Step: Dividing into Chunks

If you look more closely at the data in the ways section, you will notice that completely unrelated way elements are located directly beside each other. For example, next to a street in Wuppertal there could be a park in Munich and then a construction area in Pulsnitz.

A better approach would be to sort the data geographically: Things that are close together in the real world should also be close together in the file. Unfortunately the world is not one-dimensional, but the file is.

One solution to this problem is to divide the data into regions and store all the data in one region together. The Oma file format uses chunks for this purpose: The data in a chunk is bounded by a rectangle on the earth’s surface (better known as a bounding box). Only data that is completely within this bounding box is saved in the chunk.3 A chunk table allowing direct lookup of the chunks is added to the file.

If you now want to find the Viktorstraße, you only need to search for chunks that overlap with the bounding box of Wuppertal. With the default bounding boxes there are three of them: A large one (actually the largest chunk in the file, containing Wuppertal, Cologne and the whole Ruhr area) and two small ones at the end of the ways section which are barely visible in the following diagram:

data of Viktorstraße after dividing the data into chunks

In this diagram, the boundaries of the chunks are marked by black lines, chunks that could contain data of the Viktorstraße are coloured pink. At the end of the file, there is a small chunk table, not visible in the diagram (because it is very small). All in all: Now only about 9% of the file needs to be searched.

 

Third Step: Dividing into Blocks and Separating Areas and Ways

Within a chunk, the data is still unordered. This means that next to a highway there might be a power line or a shop or whatever. Accessing the data could be speed up again if all the highways were put together, and so on. And this is the next step, what Oma files do: Within a chunk all elements of the same type are put into one block.

Unfortunately the OSM data model does not provide information about the type of an element. So we have to guess. This can be done by scanning a list of keys for a match – if the element contains a highway key it’s a highway, and so on.

But sometimes, there is no match, and sometimes there is more than one match. This is how the Oma file format deals with this:

If there is no match, the solution is simple: Oma files put all elements without a type in an extra (typeless) block.

Dealing with several matches is trickier: You could decide to take only the first matching key as the type or you could repeat the element and put it in several blocks. Since there are pros and cons to both versions, the Oma file format leaves this decision to the user. The default is to add elements to several blocks, which we use in the diagram below.

But before we look at the diagram, we need to discuss an other topic: The question of whether a way element represents a linear element or an area. With the help of the type information this can be decided in most cases, but it is still some guesswork. Since ways and areas are something completely different, Oma files put them into different chunks.

Now we are ready for the diagram:

data of Viktorstraße after dividing chunks into blocks

The way chunks from the last diagram have been split up in area and way chunks (area chunks are coloured blue, way chunks are coloured yellow). At the bottom of the diagram, the largest of the three way chunks that could contain the Viktorstraße has been enlarged to show the block structure.

As the Viktorstraße is made up of ways only, the search for it can be narrowed down to three smaller (way) chunks. In these chunks only the highway blocks need to be searched. That reduces the search to about 1.8% of the file.

 

Fourth Step: Dividing into Slices

There is one last thing to do: Again, the data inside of a block is not sorted, and we could do this by looking at the value of the key, which defines the type of the element: All highways of type service are put into one slice, all highways of type track are put into the next slice, and so on.

This works well as long as the slices created are large enough. Since the OSM data modell is open for new inventions, there are a lot of values that are only used a few times. If we were to put them all into separate slices, this would create a lot of overhead. Therefore, only values which are known to be used frequently are put into separate slices and all other values are put into a special slice with no fixed value.

Our example doesn’t benefit from this additional sorting, because we don’t know, what type of highway the ViktorstraÃe is. But if we had known, that it is a residential highway, we could have narrowed our search again.

The following diagram shows the slices in the highway block from the previous diagram:

data of Viktorstraße in the block shown in the last diagram

 

Finally: Some numbers

That’s basically all about the Oma file format. You may wonder how long it takes to convert OSM files into Oma files, how much space they use and how long it takes to query the Viktorstraße. Well, I have collected these numbers for you.

I used germany_internal.osm.pbf taken from Geofabrik from 1st of January 2025 (contains full meta information) and converted it into Oma files three times.

For the first conversion I used the default, which means, that no meta data was saved (not even the ID) and elements with multiple types have been saved multiple times.

For the second conversion I used the command line parameter -p all, which means that all meta data was included and elements with multiple types have been saved multiple times. This results in the largest possible Oma file.4

For the third conversion I used the command line parameter -1, which means that no meta data was saved (not even the ID) and elements with multiple types were saved only once. This results in the smallest possible Oma file.

The conversion times are the times required on my computer.5 If done with a faster CPU and more main memory, the times would decrease significantly. For comparison I added germany.osm.pbf, the publically available version which contains restricted meta information.

File Size Conversion time
germany_internal.osm.pbf 4.9 GB
germany.osm.pbf 4.1 GB
germany.oma 2.9 GB 0:57:14
germany.largest.oma 3.8 GB 1:20:44
germany.smallest.oma 2.7 GB 0:51:47

Querying the Oma file for the Viktorstraße took 5.3 seconds with germany.largest.oma and 4.6 seconds with germany.smallest.oma, including the time to extract the boundaries of Wuppertal from the same file.

Now I’m curious to read your comments.

See also


  1. The use of germany.o5m or germany.pbf will basically lead to the same result. They differ from germany.osm.xml mainly in the way they compress the data. 

  2. With the help of something resembling a binary search algorithm it might be possible to access parts of the file in logarithmic time. Unfortunately, these algorithms are prone to mistakes or even complete failure. 

  3. If you wonder, what happens with elements that do not fit completely into a chunk: They will be saved in a larger chunk, if necessary in the last chunk which spans the whole world. 

  4. Well, that’s not completely true: I used compressed slices in all versions. Without compression, Oma files get much larger. 

  5. 4 GB of main memory, Intel(R) Core(TM) i3-7100T CPU @ 3.40GHz. I used only 3 GB of main memory for the Java Virtual Machine. With more memory, the operating system starts swapping, slowing things down considerably. 

Discussion

Comment from chris_debian on 7 March 2025 at 23:07

Hi,

This sounds like a really interesting idea, and you’ve obviously put a lot of specialist work into developing this far. It will be good to see the constructive feedback from people who know more about this specialist area.

Good work,

Chris

Comment from kumakyoo on 8 March 2025 at 08:06

I hope, that at least the library can be used by non-specialists too (only some basics about Java programming is needed). :-) I plan to give a short introduction on how to use the library in my next post.

Comment from H@mlet on 9 March 2025 at 15:07

Hi.

This looks really well designed. I’m curious, you tell the time to query Oma file for the Viktorstraße, but without baseline.

Would you share some figures to query the same data from pbf files, and maybe even specialized services like overpass ?

Regards.

Comment from kumakyoo on 11 March 2025 at 16:46

Would you share some figures to query the same data from pbf files, and maybe even specialized services like overpass ?

This is very difficult to answer (which is the reason, why I didn’t give any numbers above).

First of all, the idea behind overpass is similar to the idea behind Oma files - do the problematic stuff once. Instead of a file, overpass uses a database. Databases have some advantages over files: For example they contain indexes to speed up searching. Since I don’t have an instance of overpass on my computer I can only guess, but I’d say that it would be faster. The drawback is that databases are not so easy to share. The dumps tend to get big. And the initialisation may take some time to create the indices. All in all, I think, these two approches are not easily compareable.

Querying a pbf file with tools like osmium or osmconvert is not easily compareable either. For example, osmconvert cannot search for tags. You can just reduce the amount of data, for example by specifying the bounding box of Wuppertal (wherever you get it from, with Oma files you can easily query it). This step alone took 1:44 minutes.

After that, you can use osmfilter to search for tags. Theoretically, I think, it should be possible to query the Viktorstraße. In practice, I always confuse the --keep and --drop options you need for that, and I do not get what I’m looking for. But if I managed, I think, it would add only a second or so.

Osmium is easier to use, but the problems are similar: You have to run it twice (once with osmium extract to limit the data to Wuppertal, not knowing, where to get the bounding polygon from) and once with osmium tags-filter to select the tags. I can’t give times here though. Osmium just crashes, because my computer does not have not enough main memory.

Comment from H@mlet on 14 March 2025 at 17:16

Thanks for the detailed answer. :-)

Comment from Geonick on 16 March 2025 at 23:09

Dear @kumakyoo. Interesting project. I haven’t yet understood all your requirements. But have you already looked at GeoParquet?

Comment from cello on 17 March 2025 at 19:24

Very impressive design of essentially a query-optimized database for OSM data. Thanks a lot for your efforts and making it public!

You currently have an option to compress parts of the file (zip_chunks in the source code), which uses Java’s DeflaterOutputStream. Essentially, this uses the compression also used by gzip. While it is simple to use as it is directly included by Java, the deflate-compression algorithm is known to be pretty slow and have a low throughput. More modern compression algorithms would be ZStandard (https://facebook.github.io/zstd/) or LZ4 (https://github.com/lz4/lz4-java), which are much faster for both compression and decompression, and Zstd might even result in better compression than the default deflate.

While the library might lose a bit of its appeal as it will no longer be self-contained but require some dependencies on other code, I think it might be worth it by becoming even faster and creating even smaller files.

So, the general feedback from my lines above might be: - include some additional bits or bytes in the header for future needs, just to future proof your format - do not only have 1 bit for compression = true|false, but maybe 3 bits (giving values 0 – 7): 0=uncompressed, 1=deflate, 2=zstandard, 3=lz4, 4-7=future use

Comment from cello on 17 March 2025 at 19:35

Do you know GeoDesk (https://www.geodesk.com)? I think they are trying to achieve something similar. They also have a custom file format that groups osm-data by region for faster access. But their files are actually bigger than the originating pbf-files, and not smaller as yours are.

Comment from kumakyoo on 18 March 2025 at 16:14

More modern compression algorithms would be ZStandard (https://facebook.github.io/zstd/) or LZ4 (https://github.com/lz4/lz4-java), which are much faster for both compression and decompression, and Zstd might even result in better compression than the default deflate.

Sounds like a good idea. I didn’t know these two compression algorithms and I didn’t look for alternatives to deflate. Many thanks for pointing this out. I’ll have a look soon.

I’ll also have a look at GeoParquet and GeoDesk when I find the time. They might contain additional ideas I overlooked. Many thanks too.

Comment from Geonick on 18 March 2025 at 16:28

I’ll also have a look at GeoParquet and GeoDesk when I find the time. They might contain additional ideas I overlooked.

Very good. I would definitely take a look at GeoParquet 1.1 https://geoparquet.org/ - and if necessary contribute to it.

I respect in-house developments like Oma, but the chances of a new format like GeoParquet catching on are much greater.

Take a look at the long list of GeoParquet (and Parquet) software. And see also, for example, these interesting discussions on the subject here: https://github.com/opengeospatial/geoparquet/discussions/251 .

Comment from rayKiddy on 18 March 2025 at 19:45

Really interesting ideas here.

First, I would encourage you to go ahead and put the information in this post into a wiki linked to one of your repos.

Second, geoparquet looks interesting and there is a lot there but that is, I would suggest, not a reason to give up on this effort. If geoparquet has to compete, that will be all for the better. It seems that it would be a good idea to have a geoparqet<–>oma/opa converter.

Third, can oma files be used to generate tiles? This might help make the format’s usefulness more obvious.

Fourth, are you testing your converters so that we can be confident of the round-trip behavior of a conversion into and then out of oma? I do not see a “tests” directory anywhere. :–)

And finally, would you object to things being done in python? I have a lot of experience working in java, but it would be good to have tools in other languages too.

Thanx and I look forward to putting up some PRs. - ray

Comment from kumakyoo on 19 March 2025 at 16:38

First, I would encourage you to go ahead and put the information in this post into a wiki linked to one of your repos.

I’m not sure, which wiki you are referring to. I plan to add a page in the OSM-Wiki when the format is finalized. I don’t want to do it in advance, because then I would have to change the entry every time I change the format. But maybe you have got something else in mind.

I would suggest, not a reason to give up on this effort.

Of course not. I have spend a year on this. I’m not going to give up, because there is something similar out there. But I like the idea of comparing it to my approach. It will help me get a clearer picture of the strengths and weaknesses of my format. And it might bring up some new ideas that I may have overlooked.

Third, can oma files be used to generate tiles? This might help make the format’s usefulness more obvious.

I think so. It’s an all-purpose format that contains almost all the information available from OSM. It might be better suited for vector tiles though, because, I think, Oma files could be used directly, without any additional preparation.

Fourth, are you testing your converters so that we can be confident of the round-trip behavior of a conversion into and then out of oma? I do not see a “tests” directory anywhere. :–)

No, there is currently no automated testing. The main goal so far has been to create a new file format. In my opinion this cannot be tested automatically, because after every change I would have to rewrite all the tests, and then I would have to test the tests, just to run them once…

The converter and the library are a kind of add-on, a prototype to show what is possible. When the format is fixed and a “real” converter/libraray is created, it should definitely be accompanied by automated tests.

Having said that, I did a lot of testing during the development of the two tools mentioned. It was just nothing automated. :-)

Regarding a roundtrip: That is not possible. You can’t convert Oma files back to OSM files. Some information is lost in the conversion process, for example the IDs and other meta information of the nodes that make up a way, but also how multipolygons have been pieced together and a few other things.

There is only one round trip I know of: From Oma to Opa and back. The resulting Oma file must be identical to the original.

And finally, would you object to things being done in python? I have a lot of experience working in java, but it would be good to have tools in other languages too.

Of course it would be nice to have the library in several languages. But first, the file format needs to be finalized. Python and PHP are languages, where I’ll probably write the library myself (but I don’t mind if someone else volunteers) when the time comes. For other languages other people will have to do the job.

Concerning the converter: I doubt that Python is fast enough for this job. And memory management may also be an issue. Java is (despite its reputation) one of the fastest languages available (but memory is an issue here too - I’ll cover that in my next post in this series) and thus a rewrite in another language may be required sooner or later.

Comment from rayKiddy on 19 March 2025 at 18:14

In a github repo, you can create a wiki. Then “Wiki” appears in the repo next to “Issues” and “Pull Requests” and such as that. Just a suggestion. If you have plans for the docs, that is all good.

Log in to leave a comment