OpenStreetMap

I’m working on publishing a summary of sites using tile.osm.org and want to know what format would be most useful for people.

The information I’ll be publishing is requests/second, requests/second that were cache misses, and domain. The first two are guaranteed to be numbers, while the last one is a string that will typically be a domain name like www.openstreetmap.org, but could theoretically contain a poisoned value like a space.

The existing logs which have tiles and number of requests are formatted as z/x/y N where z/x/y are tile coordinates and N is the number of accesses.

My first thought was TPS TPS_MISS DOMAIN, space-separated like the existing logs. This would work, with the downside that it’s not very future proof. Because the domain can theoretically have a space, it has to be last. This means that any future additions will require re-ordering the columns, breaking existing usage. Additionally, I’d really prefer to have the domain at the start of the line.

A couple of options are - CSV, with escaping - tab-delimited

Potential users, what would work well with the languages and libraries you prefer?

An example of the output right now is

1453.99 464.1 www.openstreetmap.org  
310.3 26.29 localhost
136.46 39.68 dro.routesmart.com
123.65 18.54 www.openrailwaymap.org
107.98 0.05 www.ad-production-stage.com
96.64 1.78 r.onliner.by
91.42 0.16 solagro.org
87.83 1.53 tvil.ru
84.88 12.98 eae.opekepe.gov.gr
74.0 2.32 www.mondialrelay.fr
63.44 1.93 www.lightningmaps.org
63.22 14.01 nakarte.me
55.1 0.74 qualp.com.br
52.77 11.25 apps.sentinel-hub.com
46.68 4.07 127.0.0.1
46.3 1.96 www.gites-de-france.com
43.47 1.15 www.anwb.nl
42.46 10.52 dacota.lyft.net
41.13 6.63 www.esri.com
40.84 0.69 busti.me

Discussion

Comment from Mateusz Konieczny on 25 June 2022 at 14:57

I occasionally used such things, and for me CSV (with escaping and comma delimited) is a good format.

Tab delimited is weird, but I guess that something can be cobbled together to use this - especially if it can be assumed that domain names are without tabs or at least escaped.

Providing total sum for all, including domains not listed would be useful if possible to estimate long tail, unless every single domain is published.

14531.99 4641.1 * ? Or as separate entry?

Is it possible for domain to be unknown?

Comment from Jochen Topf on 26 June 2022 at 19:55

CSV is great for really simple data. But once you neet to get into escaping and all that it becomes annoying really fast. How about an sqlite database?

Comment from pnorman on 27 June 2022 at 05:22

I’ve gone with CSV, quoting the string field. On the technical site, it’s using Python’s csv.writer with unix_dialect and QUOTE_NONNUMERIC.

sqlite would be overkill for what is 15kb of data in text form in a single table.

If I were truly reporting website domains, I wouldn’t worry about quoting, since domain names have a restricted set of characters. Instead, what I’m reporting is extracted domains from the referer header, and there’s nothing that stops an app from sending https://www.foo,bar.com/ as a referer header. They’d eventually get blocked for sending a fake header, but it doesn’t rule out them appearing in the logs.

That is related to the origin of “” in the hosts log. They are from referers that don’t parse as a valid URI.

In practice, for the type of log processing I’m doing, I’m likely to ignore that. If I run a command like fgrep '"umap.openstreetmap.fr"' hosts* | sed 's/hosts-//' | sed 's/.csv:"umap.openstreetmap.fr"//' > umap.csv I get a CSV I can open in a spreadsheet and do further stuff with.

The first file is up at https://planet.openstreetmap.org/tile_logs/hosts-2022-06-26.csv. I’m preparing the files to backfill into 2021

Comment from jleedev on 27 June 2022 at 10:47

Including a header row is something to be expected in a CSV file, especially if you’re aware that the file format may evolve in the future.

DOMAIN,TPS,TPS_MISS

Comment from mmd on 27 June 2022 at 19:53

Somehow I’m missing the bigger picture here. Should site owners check this information to evaluate their own resource usage and find out how they compare to others?

Comment from o_andras on 28 June 2022 at 12:09

I fail to understand how this question of which separator to use in a DSV file for simple tabular data still comes up, since it’s been solved for decades, without using subpar “solutions” (e.g. quoting & escaping) that unnecessarily preclude handling the files with simple tools (e.g. *nix shell and related tools (cut, sed, sort, read, etc); a simple string split in a programming language; etc)…

Comment from o_andras on 28 June 2022 at 12:15

PS: using tabs instead of comma+quoting+escaping is the opposite of weird, and in general (for what people normally use CSV) it’s a step in the right direction. Python’s CSV module, for example, (since it’s been mentioned) supports “arbitrary” separators as an instance parameter, IIRC.

Comment from jleedev on 28 June 2022 at 12:21

Can you say for sure that nobody has included \x1e, \x1f, \t, \n, or indeed quote and comma, anywhere in the Host: header of a tile request, ever? No tabular data is “simple”.

Comment from o_andras on 28 June 2022 at 13:10

@jleedev

No tabular data is “simple”.

That should be obviously false to anyone…

Can you say for sure that nobody has included (…) anywhere in the Host: header of a tile request, ever?

There are two sides to that coin:

  1. Of course you can’t!
  2. Of course you can!

RE (1): obviously you can’t because it’s user input.

RE (2): like @mmd said, “I’m missing the bigger picture here.” I don’t know what the purpose of these files is, I don’t know who these files are for, I don’t know if they’ll be automatically updated regularly, I don’t know if the files will be published raw (untreated) or not, etc, etc – because these details aren’t explicitly stated anywhere.

For example, if after the raw data is gathered you check that there’s no occurrence of a {tab,space,US,comma,…}, then you can safely use {tab,space,US,comma,…} as the separator of the published file – but this assumes that the data isn’t automatically updated regularly.

Python’s csv.writer with unix_dialect and QUOTE_NONNUMERIC

That’s an implementation detail a user shouldn’t have to think about, and, personally, I have no idea what the unix_dialect is… Additionally, it’s not guaranteed to be compatible across different programming languages/DSV libraries/programs, even though these quoting+escaping rules may look standard. Before I can use the file, I (as the user) have to make sure whatever program I want to use correctly supports all these implementation details.

OTOH, @Jochen Topf suggested using a SQLite file, which looks to me like a much better alternative even if it sounds overkill at first sight. Among many other pros over some DSV format:

  • there’s only one SQLite implementation;
  • it’s stable as a rock;
  • pretty much every language has bindings to it;
  • there’s the sqlite CLI program;
  • you get the power of the SQL language;
  • if you don’t like SQL you can still export to whatever DSV dialect (and therefore use *nix tools);
  • there are no choices to make;
  • it’s guaranteed to work for everyone everywhere always!

Hopefully that makes clear what I was trying to express: if you can’t guarantee interoperability through simplicity (i.e. without hacks), then at least guarantee interoperability through actual standards.

Comment from pnorman on 29 June 2022 at 03:39

Somehow I’m missing the bigger picture here. Should site owners check this information to evaluate their own resource usage and find out how they compare to others?

No. The goal is to improve the ability of the community to see how the service is being used and reduce the number of times admins need to run ad-hoc queries to see usage. We’ve done this in the past for user-agents.

In particular, historical usage was exceptionally difficult to query because of the large amount of data (~130GB/day compressed) that needed to be queried.

Comment from Mateusz Konieczny on 29 June 2022 at 07:30

PS: using tabs instead of comma+quoting+escaping is the opposite of weird, and in general (for what people normally use CSV) it’s a step in the right direction.

But calling it CSV is weird. tab-delimited CSV is not CSV.

Comment from Mateusz Konieczny on 29 June 2022 at 07:32

I fail to understand how this question of which separator to use in a DSV file for simple tabular data still comes up, since it’s been solved for decades

Files which are non-human readable are obnoxious for various reasons, so I am not surprised that https://en.m.wikipedia.org/wiki/Delimiter#ASCII_delimited_text are rarely used in practice.

Comment from o_andras on 29 June 2022 at 18:20

Totally agree that calling tab-delimited CSV “CSV” is silly!

About the human-readable part: that could be a reason, but I don’t think that’s it… There’s so much that can be said about “human-readable-ness” I don’t even know where to begin, apart from the fact that “human-readable” is ambiguous.

In the sense of being (or not) “text”: both less and vim display these characters, so they’re definitely readable. Example file of two records, one of values a1, a2, a3, and another of values b1, b2, b3:

$ xxd test.dsv
00000000: 6131 1f61 321f 6133 1e62 311f 6232 1f1e  a1.a2.a3.b1.b2..
00000010: 1c0a

In the sense of being easy(-ish) to read: I wouldn’t call CSV files that look like this readable:

foo,"bar,baz,zaz",,,,,asdf

How many columns does this record have? How many (non-)empty columns does this record have? What’s the value of the 5th column? You can’t answer any of these questions at a glance, so this CSV file isn’t significantly more readable than the file above.

So if your only reason to use CSV (w/ quoting+escaping) is that it’s human-readable, just don’t. They’re not anyway, they’re meant to be consumed/processed by a program, and for that there are better alternatives.

Finally, we use non-human-readable files all the time (images, audio, video, zips, etc, etc)! I’ve never seen anyone complain their images/audio/etc aren’t human-readable.

Log in to leave a comment