OpenStreetMap

Heavy Usage on OSM Sites

Posted by alexkemp on 4 November 2016 in English. Last updated on 6 November 2016.

At 09:45 UTC the OSM sites are back up again after ‘too many people accessing’ notices for ~45 minutes. Do we have some info on what was happening, please?

The Existential Question

08:45 5 November
Having slept on this I’ve come to realise that it is far more important than at first I realised, since it concerns the very existence of OSM: in what form is it going to continue? (all life on this planet grows, withers or transmutes; what is OSM going to do?).

In short (there is fuller info with links at bottom):

  1. 28 Oct: zerebubuth (Matt Amos) opened a discussion on Github : the tile-servers are hitting capacity; what policy is OSM going to follow in the future? Restrict to open-source apps or restrict to publicly-accessible sites (notice that both options involve denial of access)?
     
    There are 59 machines internationally — including 20 globally distributed tile-caches — serving up OSM tiles to anybody that asks for them. Funded entirely by donations. Outbound peak traffic on tile.openstreetmap.org (served from cache) peaks at 1 Gbit/sec (6,800 GB/day; 528 million requests at 13.6 kB/request), but a recent measurement indicates that just 11% of this is supplied to OSM websites. The rest is 3rd party sites and apps.
     
  2. 4 Nov: the system seized up tight.
     
    ironbelly had to be restarted (look at ‘uptime’) and re-init (at a guess) on the ramoth network (look at ‘memory’ and also ‘network’ - the server system became saturated with FIN_WAIT & TIME_WAIT connections for all of 2 Nov, possibly indicating TIME-WAIT loading problems)
     
  3. As soon as you provide something good at zero cost there is large demand. How is OSM going to respond? 

Added at 24:00 4 November:

Zero assistance from OSM Folk, so had to find an answer myself. Took a long while.

Found it on the Munin OSM site. In particular: a critical error where the Chef node status on ironbelly went AWOL between 18:00 Thu 3 Oct & 09:45 Fri 4 Oct (whoops). In addition, ironbelly + ramoth disks show a gap at ~9am; presumably the server was stopped, fixed & restarted with a 45 minute gap as I experienced this morning. As further proof of an issue, the Squid Client Requests began to show gaps in reporting on 31 October & following days then reporting died completely on Thursday & has not resumed.

Of course, the important point is why & not what (that has also been my point to Harry in the comments). The answer is probably here, as discussed on github.

PS
The tile-server peaks at >1 Gbits/sec outbound serving from cache. Phew!

No direct Connection between this recent Fail & the Tileserver Issue

Andy Allan has left a comment pointing out that there is zero connection between the recent issues & Matt’s discussion.

Andy actually works on the physical servers & therefore knows a touch more about them than me. He points out that what I detailed above were two separate issues, and that each is unconnected with the tileserver discussion:–

  1. Extra Disk Drives for ironbelly
    Fri 4 Nov: The extra load from the subsequent RAID rebuild caused ironbelly to fail. ironbelly provides services that other servers depend upon, including the servers that power the website and the editing API (which each also failed). The server guys had to make some config changes so that ironbelly continued to provide essential services whilst the RAID build continued.
  2. DB Server Index Rebuild
    Wed 2 Nov: The databases are arranged in primary (katla) & slave fashion; the index was rebuilt on katla (smaller, faster) and the vast network traffic are those indexes being transmitted out to the slaves (such as ramoth).

Discussion

Comment from Harry Wood on 4 November 2016 at 09:55

Why did you post this question as a user diary entry?

Comment from alexkemp on 4 November 2016 at 11:01

Hi Harry

Because I do not know any quicker or better way to get the info. I take it that you do not know either?

Comment from Harry Wood on 4 November 2016 at 14:59

It might be a “quick” way to the info, but a “better” way to get the info would be the talk mailing list, the osmf-talk mailing list IRC, the OWG contact email, github issue, or …actually pretty much any of the contact channels would have been more appropriate than a diary entry.

Comment from alexkemp on 4 November 2016 at 18:19

You are big on declamation but missing on reason. Why are they better channels? And why not a Diary entry? What’s the big deal that causes you to come over so proscriptive so quickly?

Comment from Andy Allan on 6 November 2016 at 13:18

Hi Alex - the question of tileserver policy, and the outage on Friday, are unrelated. Friday’s outage was related to the addition of more disks to our “services” machine Ironbelly, and while the RAID array was rebuilding it caused the machine to become overloaded. Other machines depend on ironbelly, including the servers that power the website and the editing API. We slowed the RAID rebuild, and stopped some non-essential services (including chef runs, log analysis etc), to get the website back online as quickly as was possible.

The network traffic you see on the 2nd November to ramoth is unrelated to the above. In this case, we were rebuilding indexes on the primary database server katla, to compact them which saves space and makes things work faster. The large amount of network traffic is the main server (kalta) distributing the rebuilt indexes out to the slave database servers, including ramoth.

These are all isolated from the tileservers, which have their own servers and databases, and weren’t affected by either the ironbelly issues or the coredb index rebuilding. The policy discussion on them is a long-term thing, and the amount of traffic to the tileservers doesn’t have any impact on the traffic to the website and/or the API.

I hope this provides some more information to you!

Comment from alexkemp on 6 November 2016 at 19:21

Thanks Andy, I’ll add a Coda to the Diary pointing out your comments.

Log in to leave a comment