As a guess you exhausted all mbufs, 10 has much better defaults for
these so I'd recommend updating.
If you can get in via IPMI or something similar you should be able to
confirm.
A trick I've used in the past to recover from such a issue is to hard
bounce the nic ports on the switch which seemed to free enough to be
able to ssh in.
On 05/11/2014 11:49, Matthew Seaman wrote:> Dear all,
>
> We had an unfortunate set of circumstances which resulted in several
> million people all trying to download about 1.5MB worth of images from
> our servers over the course of a few hours. Or, at least, it would have
> been a few hours, except that our three varnish proxies just crumbled
> under the load within 10 minutes.
>
> Now, that's bad enough, but we could have just about coped if the
> proxies stopped serving requests for a few minutes. What actually
> happened was that all three servers went catatonic on the network *and
> stayed that way*: even when we shunted the traffic away from one, we
> still couldn't access it via ssh or any network protocol. And it
stayed
> like that for sufficiently long time that we had no recourse other than
> to get the servers rebooted.
>
> Can anyone explain what was happening here? Not having the servers
> recover accessibility for an extended period even after the excess
> traffic was stopped is unacceptable. We're also struggling to recreate
> the effect in the lab: any clues about how to do so, and any suggestions
> about how to prevent the 'going catatonic' response would be
greatly
> appreciated.
>
> Servers are amd64 running FreeBSD 9.1 or 9.2 and Varnish 3.0.5.
>
>
> Cheers,
>
> Matthew
>
>
>