I’m working on publishing a summary of sites using tile.osm.org and want to know what format would be most useful for people.
The information I’ll be publishing is requests/second, requests/second that were cache misses, and domain. The first two are guaranteed to be numbers, while the last one is a string that will typically be a domain name like www.openstreetmap.org, but could theoretically contain a poisoned value like a space.
The existing logs which have tiles and number of requests are formatted as z/x/y N
where z/x/y are tile coordinates and N is the number of accesses.
My first thought was TPS TPS_MISS DOMAIN
, space-separated like the existing logs. This would work, with the downside that it’s not very future proof. Because the domain can theoretically have a space, it has to be last. This means that any future additions will require re-ordering the columns, breaking existing usage. Additionally, I’d really prefer to have the domain at the start of the line.
A couple of options are - CSV, with escaping - tab-delimited
Potential users, what would work well with the languages and libraries you prefer?
An example of the output right now is
1453.99 464.1 www.openstreetmap.org
310.3 26.29 localhost
136.46 39.68 dro.routesmart.com
123.65 18.54 www.openrailwaymap.org
107.98 0.05 www.ad-production-stage.com
96.64 1.78 r.onliner.by
91.42 0.16 solagro.org
87.83 1.53 tvil.ru
84.88 12.98 eae.opekepe.gov.gr
74.0 2.32 www.mondialrelay.fr
63.44 1.93 www.lightningmaps.org
63.22 14.01 nakarte.me
55.1 0.74 qualp.com.br
52.77 11.25 apps.sentinel-hub.com
46.68 4.07 127.0.0.1
46.3 1.96 www.gites-de-france.com
43.47 1.15 www.anwb.nl
42.46 10.52 dacota.lyft.net
41.13 6.63 www.esri.com
40.84 0.69 busti.me
Discussion
Comment from Mateusz Konieczny on 25 June 2022 at 14:57
I occasionally used such things, and for me CSV (with escaping and comma delimited) is a good format.
Tab delimited is weird, but I guess that something can be cobbled together to use this - especially if it can be assumed that domain names are without tabs or at least escaped.
Providing total sum for all, including domains not listed would be useful if possible to estimate long tail, unless every single domain is published.
14531.99 4641.1 *
? Or as separate entry?Is it possible for domain to be unknown?
Comment from Jochen Topf on 26 June 2022 at 19:55
CSV is great for really simple data. But once you neet to get into escaping and all that it becomes annoying really fast. How about an sqlite database?
Comment from pnorman on 27 June 2022 at 05:22
I’ve gone with CSV, quoting the string field. On the technical site, it’s using Python’s csv.writer with unix_dialect and QUOTE_NONNUMERIC.
sqlite would be overkill for what is 15kb of data in text form in a single table.
If I were truly reporting website domains, I wouldn’t worry about quoting, since domain names have a restricted set of characters. Instead, what I’m reporting is extracted domains from the referer header, and there’s nothing that stops an app from sending
https://www.foo,bar.com/
as a referer header. They’d eventually get blocked for sending a fake header, but it doesn’t rule out them appearing in the logs.That is related to the origin of “” in the hosts log. They are from referers that don’t parse as a valid URI.
In practice, for the type of log processing I’m doing, I’m likely to ignore that. If I run a command like
fgrep '"umap.openstreetmap.fr"' hosts* | sed 's/hosts-//' | sed 's/.csv:"umap.openstreetmap.fr"//' > umap.csv
I get a CSV I can open in a spreadsheet and do further stuff with.The first file is up at https://planet.openstreetmap.org/tile_logs/hosts-2022-06-26.csv. I’m preparing the files to backfill into 2021
Comment from jleedev on 27 June 2022 at 10:47
Including a header row is something to be expected in a CSV file, especially if you’re aware that the file format may evolve in the future.
DOMAIN,TPS,TPS_MISS
Comment from mmd on 27 June 2022 at 19:53
Somehow I’m missing the bigger picture here. Should site owners check this information to evaluate their own resource usage and find out how they compare to others?
Comment from o_andras on 28 June 2022 at 12:09
I fail to understand how this question of which separator to use in a DSV file for simple tabular data still comes up, since it’s been solved for decades, without using subpar “solutions” (e.g. quoting & escaping) that unnecessarily preclude handling the files with simple tools (e.g. *nix shell and related tools (
cut
,sed
,sort
,read
, etc); a simple string split in a programming language; etc)…Comment from o_andras on 28 June 2022 at 12:15
PS: using tabs instead of comma+quoting+escaping is the opposite of weird, and in general (for what people normally use CSV) it’s a step in the right direction. Python’s CSV module, for example, (since it’s been mentioned) supports “arbitrary” separators as an instance parameter, IIRC.
Comment from jleedev on 28 June 2022 at 12:21
Can you say for sure that nobody has included \x1e, \x1f, \t, \n, or indeed quote and comma, anywhere in the Host: header of a tile request, ever? No tabular data is “simple”.
Comment from o_andras on 28 June 2022 at 13:10
@jleedev
That should be obviously false to anyone…
There are two sides to that coin:
RE (1): obviously you can’t because it’s user input.
RE (2): like @mmd said, “I’m missing the bigger picture here.” I don’t know what the purpose of these files is, I don’t know who these files are for, I don’t know if they’ll be automatically updated regularly, I don’t know if the files will be published raw (untreated) or not, etc, etc – because these details aren’t explicitly stated anywhere.
For example, if after the raw data is gathered you check that there’s no occurrence of a {tab,space,US,comma,…}, then you can safely use {tab,space,US,comma,…} as the separator of the published file – but this assumes that the data isn’t automatically updated regularly.
That’s an implementation detail a user shouldn’t have to think about, and, personally, I have no idea what the
unix_dialect
is… Additionally, it’s not guaranteed to be compatible across different programming languages/DSV libraries/programs, even though these quoting+escaping rules may look standard. Before I can use the file, I (as the user) have to make sure whatever program I want to use correctly supports all these implementation details.OTOH, @Jochen Topf suggested using a SQLite file, which looks to me like a much better alternative even if it sounds overkill at first sight. Among many other pros over some DSV format:
sqlite
CLI program;Hopefully that makes clear what I was trying to express: if you can’t guarantee interoperability through simplicity (i.e. without hacks), then at least guarantee interoperability through actual standards.
Comment from pnorman on 29 June 2022 at 03:39
No. The goal is to improve the ability of the community to see how the service is being used and reduce the number of times admins need to run ad-hoc queries to see usage. We’ve done this in the past for user-agents.
In particular, historical usage was exceptionally difficult to query because of the large amount of data (~130GB/day compressed) that needed to be queried.
Comment from Mateusz Konieczny on 29 June 2022 at 07:30
But calling it CSV is weird. tab-delimited CSV is not CSV.
Comment from Mateusz Konieczny on 29 June 2022 at 07:32
Files which are non-human readable are obnoxious for various reasons, so I am not surprised that https://en.m.wikipedia.org/wiki/Delimiter#ASCII_delimited_text are rarely used in practice.
Comment from o_andras on 29 June 2022 at 18:20
Totally agree that calling tab-delimited CSV “CSV” is silly!
About the human-readable part: that could be a reason, but I don’t think that’s it… There’s so much that can be said about “human-readable-ness” I don’t even know where to begin, apart from the fact that “human-readable” is ambiguous.
In the sense of being (or not) “text”: both
less
andvim
display these characters, so they’re definitely readable. Example file of two records, one of valuesa1
,a2
,a3
, and another of valuesb1
,b2
,b3
:In the sense of being easy(-ish) to read: I wouldn’t call CSV files that look like this readable:
How many columns does this record have? How many (non-)empty columns does this record have? What’s the value of the 5th column? You can’t answer any of these questions at a glance, so this CSV file isn’t significantly more readable than the file above.
So if your only reason to use CSV (w/ quoting+escaping) is that it’s human-readable, just don’t. They’re not anyway, they’re meant to be consumed/processed by a program, and for that there are better alternatives.
Finally, we use non-human-readable files all the time (images, audio, video, zips, etc, etc)! I’ve never seen anyone complain their images/audio/etc aren’t human-readable.