thr3ads.net - freebsd stable - ZFS "stalls" -- and maybe we should be talking about defaults? [Mar 2013]

If this information is useful, please help other people find it:
Share via:

Karl Denninger

2013-Mar-04 22:48 UTC

ZFS "stalls" -- and maybe we should be talking about defaults?

Well now this is interesting.

I have converted a significant number of filesystems to ZFS over the
last week or so and have noted a few things.  A couple of them aren't so
good.

The subject machine in question has 12GB of RAM and dual Xeon
5500-series processors.  It also has an ARECA 1680ix in it with 2GB of
local cache and the BBU for it.  The ZFS spindles are all exported as
JBOD drives.  I set up four disks under GPT, have a single freebsd-zfs
partition added to them, are labeled and the providers are then
geli-encrypted and added to the pool.  When the same disks were running
on UFS filesystems they were set up as a 0+1 RAID array under the ARECA
adapter, exported as a single unit, GPT labeled as a single pack and
then gpart-sliced and newfs'd under UFS+SU.

Since I previously ran UFS filesystems on this config I know what the
performance level I achieved with that, and the entire system had been
running flawlessly set up that way for the last couple of years. 
Presently the machine is running 9.1-Stable, r244942M

Immediately after the conversion I set up a second pool to play with
backup strategies to a single drive and ran into a problem.  The disk I
used for that testing is one that previously was in the rotation and is
also known good.  I began to get EXTENDED stalls with zero I/O going on,
some lasting for 30 seconds or so.  The system was not frozen but
anything that touched I/O would lock until it cleared.  Dedup is off,
incidentally.

My first thought was that I had a bad drive, cable or other physical
problem.  However, searching for that proved fruitless -- there was
nothing being logged anywhere -- not in the SMART data, not by the
adapter, not by the OS.  Nothing.  Sticking a digital storage scope on
the +5V and +12V rails didn't disclose anything interesting with the
power in the chassis; it's stable.  Further, swapping the only disk that
had changed (the new backup volume) with a different one didn't change
behavior either.

The last straw was when I was able to reproduce the stalls WITHIN the
original pool against the same four disks that had been running
flawlessly for two years under UFS, and still couldn't find any evidence
of a hardware problem (not even ECC-corrected data returns.)  All the
disks involved are completely clean -- zero sector reassignments, the
drive-specific log is clean, etc.

Attempting to cut back the ARECA adapter's aggressiveness (buffering,
etc) on the theory that I was tickling something in its cache management
algorithm that was pissing it off proved fruitless as well, even when I
shut off ALL caching and NCQ options.  I also set
vfs.zfs.prefetch_disable=1 to no effect.  Hmmmm...

Last night after reading the ZFS Tuning wiki for FreeBSD I went on a
lark and limited the ARC cache to 2GB (vfs.zfs.arc_max=2000000000), set
vfs.zfs.write_limit_override to 1024000000 (1GB) and rebooted.  /*

The problem instantly disappeared and I cannot provoke its return even
with multiple full-bore snapshot and rsync filesystem copies running
while a scrub is being done.*/
/**/
I'm pinging between being I/O and processor (geli) limited now in normal
operation and slamming the I/O channel during a scrub.  It appears that
performance is roughly equivalent, maybe a bit less, than it was with
UFS+SU -- but it's fairly close.

The operating theory I have at the moment is that the ARC cache was in
some way getting into a near-deadlock situation with other memory
demands on the system (there IS a Postgres server running on this
hardware although it's a replication server and not taking queries --
nonetheless it does grab a chunk of RAM) leading to the stalls. 
Limiting its grab of RAM appears to have to resolved the contention
issue.  I was unable to catch it actually running out of free memory
although it was consistently into the low five-digit free page count and
the kernel never garfed on the console about resource exhaustion --
other than a bitch about swap stalling (the infamous "more than 20
seconds" message.)  Page space in use near the time in question (I could
not get a display while locked as it went to I/O and froze) was not
zero, but pretty close to it (a few thousand blocks.)  That the system
was driven into light paging does appear to be significant and
indicative of some sort of memory contention issue as under operation
with UFS filesystems this machine has never been observed to allocate
page space.

Anyone seen anything like this before and if so.... is this a case of
bad defaults or some bad behavior between various kernel memory
allocation contention sources?

This isn't exactly a resource-constrained machine running x64 code with
12GB of RAM and two quad-core processors in it!

-- 
-- Karl Denninger
/The Market Ticker ?/ <http://market-ticker.org>
Cuda Systems LLC

Steven Hartland

2013-Mar-05 00:33 UTC

head link

ZFS "stalls" -- and maybe we should be talking about defaults?

What does zfs-stats -a show when your having the stall issue?

You can also use zfs iostats to show individual disk iostats
which may help identify a single failing disk e.g.
zpool iostat -v 1

Also have you investigated which of the two sysctls you changed
fixed it or does it require both?

    Regards
    Steve

----- Original Message ----- 
From: "Karl Denninger" <karl at denninger.net>
To: <freebsd-stable at freebsd.org>
Sent: Monday, March 04, 2013 10:48 PM
Subject: ZFS "stalls" -- and maybe we should be talking about
defaults?


Well now this is interesting.

I have converted a significant number of filesystems to ZFS over the
last week or so and have noted a few things.  A couple of them aren't so
good.

The subject machine in question has 12GB of RAM and dual Xeon
5500-series processors.  It also has an ARECA 1680ix in it with 2GB of
local cache and the BBU for it.  The ZFS spindles are all exported as
JBOD drives.  I set up four disks under GPT, have a single freebsd-zfs
partition added to them, are labeled and the providers are then
geli-encrypted and added to the pool.  When the same disks were running
on UFS filesystems they were set up as a 0+1 RAID array under the ARECA
adapter, exported as a single unit, GPT labeled as a single pack and
then gpart-sliced and newfs'd under UFS+SU.

Since I previously ran UFS filesystems on this config I know what the
performance level I achieved with that, and the entire system had been
running flawlessly set up that way for the last couple of years.
Presently the machine is running 9.1-Stable, r244942M

Immediately after the conversion I set up a second pool to play with
backup strategies to a single drive and ran into a problem.  The disk I
used for that testing is one that previously was in the rotation and is
also known good.  I began to get EXTENDED stalls with zero I/O going on,
some lasting for 30 seconds or so.  The system was not frozen but
anything that touched I/O would lock until it cleared.  Dedup is off,
incidentally.

My first thought was that I had a bad drive, cable or other physical
problem.  However, searching for that proved fruitless -- there was
nothing being logged anywhere -- not in the SMART data, not by the
adapter, not by the OS.  Nothing.  Sticking a digital storage scope on
the +5V and +12V rails didn't disclose anything interesting with the
power in the chassis; it's stable.  Further, swapping the only disk that
had changed (the new backup volume) with a different one didn't change
behavior either.

The last straw was when I was able to reproduce the stalls WITHIN the
original pool against the same four disks that had been running
flawlessly for two years under UFS, and still couldn't find any evidence
of a hardware problem (not even ECC-corrected data returns.)  All the
disks involved are completely clean -- zero sector reassignments, the
drive-specific log is clean, etc.

Attempting to cut back the ARECA adapter's aggressiveness (buffering,
etc) on the theory that I was tickling something in its cache management
algorithm that was pissing it off proved fruitless as well, even when I
shut off ALL caching and NCQ options.  I also set
vfs.zfs.prefetch_disable=1 to no effect.  Hmmmm...

Last night after reading the ZFS Tuning wiki for FreeBSD I went on a
lark and limited the ARC cache to 2GB (vfs.zfs.arc_max=2000000000), set
vfs.zfs.write_limit_override to 1024000000 (1GB) and rebooted.  /*

The problem instantly disappeared and I cannot provoke its return even
with multiple full-bore snapshot and rsync filesystem copies running
while a scrub is being done.*/
/**/
I'm pinging between being I/O and processor (geli) limited now in normal
operation and slamming the I/O channel during a scrub.  It appears that
performance is roughly equivalent, maybe a bit less, than it was with
UFS+SU -- but it's fairly close.

The operating theory I have at the moment is that the ARC cache was in
some way getting into a near-deadlock situation with other memory
demands on the system (there IS a Postgres server running on this
hardware although it's a replication server and not taking queries --
nonetheless it does grab a chunk of RAM) leading to the stalls.
Limiting its grab of RAM appears to have to resolved the contention
issue.  I was unable to catch it actually running out of free memory
although it was consistently into the low five-digit free page count and
the kernel never garfed on the console about resource exhaustion --
other than a bitch about swap stalling (the infamous "more than 20
seconds" message.)  Page space in use near the time in question (I could
not get a display while locked as it went to I/O and froze) was not
zero, but pretty close to it (a few thousand blocks.)  That the system
was driven into light paging does appear to be significant and
indicative of some sort of memory contention issue as under operation
with UFS filesystems this machine has never been observed to allocate
page space.

Anyone seen anything like this before and if so.... is this a case of
bad defaults or some bad behavior between various kernel memory
allocation contention sources?

This isn't exactly a resource-constrained machine running x64 code with
12GB of RAM and two quad-core processors in it!

-- 
-- Karl Denninger
/The Market Ticker ?/ <http://market-ticker.org>
Cuda Systems LLC
_______________________________________________
freebsd-stable at freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe at
freebsd.org"


===============================================This e.mail is private and
confidential between Multiplay (UK) Ltd. and the person or entity to whom it is
addressed. In the event of misdirection, the recipient is prohibited from using,
copying, printing or otherwise disseminating it or any information contained in
it.

In the event of misdirection, illegible or incomplete transmission please
telephone +44 845 868 1337
or return the E.mail to postmaster at multiplay.co.uk.

Dennis Glatting

2013-Mar-05 02:07 UTC

head link

ZFS "stalls" -- and maybe we should be talking about defaults?

I get stalls with 256GB of RAM with arc_max=64G (my limit is usually 25%
) on a 64 core system with 20 new 3TB Seagate disks under LSI2008 chips
without much load. Interestingly pbzip2 consistently created a problem
on a volume whereas gzip does not.

Here, stalls happen across several systems however I have had less
problems under 8.3 than 9.1. If I go to hardware RAID5 (LSI2008 -- same
chips: IR vs IT) I don't have a problem.




On Mon, 2013-03-04 at 16:48 -0600, Karl Denninger wrote:> Well now this is interesting.
> 
> I have converted a significant number of filesystems to ZFS over the
> last week or so and have noted a few things.  A couple of them aren't
so
> good.
> 
> The subject machine in question has 12GB of RAM and dual Xeon
> 5500-series processors.  It also has an ARECA 1680ix in it with 2GB of
> local cache and the BBU for it.  The ZFS spindles are all exported as
> JBOD drives.  I set up four disks under GPT, have a single freebsd-zfs
> partition added to them, are labeled and the providers are then
> geli-encrypted and added to the pool.  When the same disks were running
> on UFS filesystems they were set up as a 0+1 RAID array under the ARECA
> adapter, exported as a single unit, GPT labeled as a single pack and
> then gpart-sliced and newfs'd under UFS+SU.
> 
> Since I previously ran UFS filesystems on this config I know what the
> performance level I achieved with that, and the entire system had been
> running flawlessly set up that way for the last couple of years. 
> Presently the machine is running 9.1-Stable, r244942M
> 
> Immediately after the conversion I set up a second pool to play with
> backup strategies to a single drive and ran into a problem.  The disk I
> used for that testing is one that previously was in the rotation and is
> also known good.  I began to get EXTENDED stalls with zero I/O going on,
> some lasting for 30 seconds or so.  The system was not frozen but
> anything that touched I/O would lock until it cleared.  Dedup is off,
> incidentally.
> 
> My first thought was that I had a bad drive, cable or other physical
> problem.  However, searching for that proved fruitless -- there was
> nothing being logged anywhere -- not in the SMART data, not by the
> adapter, not by the OS.  Nothing.  Sticking a digital storage scope on
> the +5V and +12V rails didn't disclose anything interesting with the
> power in the chassis; it's stable.  Further, swapping the only disk
that
> had changed (the new backup volume) with a different one didn't change
> behavior either.
> 
> The last straw was when I was able to reproduce the stalls WITHIN the
> original pool against the same four disks that had been running
> flawlessly for two years under UFS, and still couldn't find any
evidence
> of a hardware problem (not even ECC-corrected data returns.)  All the
> disks involved are completely clean -- zero sector reassignments, the
> drive-specific log is clean, etc.
> 
> Attempting to cut back the ARECA adapter's aggressiveness (buffering,
> etc) on the theory that I was tickling something in its cache management
> algorithm that was pissing it off proved fruitless as well, even when I
> shut off ALL caching and NCQ options.  I also set
> vfs.zfs.prefetch_disable=1 to no effect.  Hmmmm...
> 
> Last night after reading the ZFS Tuning wiki for FreeBSD I went on a
> lark and limited the ARC cache to 2GB (vfs.zfs.arc_max=2000000000), set
> vfs.zfs.write_limit_override to 1024000000 (1GB) and rebooted.  /*
> 
> The problem instantly disappeared and I cannot provoke its return even
> with multiple full-bore snapshot and rsync filesystem copies running
> while a scrub is being done.*/
> /**/
> I'm pinging between being I/O and processor (geli) limited now in
normal
> operation and slamming the I/O channel during a scrub.  It appears that
> performance is roughly equivalent, maybe a bit less, than it was with
> UFS+SU -- but it's fairly close.
> 
> The operating theory I have at the moment is that the ARC cache was in
> some way getting into a near-deadlock situation with other memory
> demands on the system (there IS a Postgres server running on this
> hardware although it's a replication server and not taking queries --
> nonetheless it does grab a chunk of RAM) leading to the stalls. 
> Limiting its grab of RAM appears to have to resolved the contention
> issue.  I was unable to catch it actually running out of free memory
> although it was consistently into the low five-digit free page count and
> the kernel never garfed on the console about resource exhaustion --
> other than a bitch about swap stalling (the infamous "more than 20
> seconds" message.)  Page space in use near the time in question (I
could
> not get a display while locked as it went to I/O and froze) was not
> zero, but pretty close to it (a few thousand blocks.)  That the system
> was driven into light paging does appear to be significant and
> indicative of some sort of memory contention issue as under operation
> with UFS filesystems this machine has never been observed to allocate
> page space.
> 
> Anyone seen anything like this before and if so.... is this a case of
> bad defaults or some bad behavior between various kernel memory
> allocation contention sources?
> 
> This isn't exactly a resource-constrained machine running x64 code with
> 12GB of RAM and two quad-core processors in it!
>

Peter Jeremy

2013-Mar-07 07:21 UTC

head link

ZFS "stalls" -- and maybe we should be talking about defaults?

On 2013-Mar-04 16:48:18 -0600, Karl Denninger <karl at denninger.net>
wrote:>The subject machine in question has 12GB of RAM and dual Xeon
>5500-series processors.  It also has an ARECA 1680ix in it with 2GB of
>local cache and the BBU for it.  The ZFS spindles are all exported as
>JBOD drives.  I set up four disks under GPT, have a single freebsd-zfs
>partition added to them, are labeled and the providers are then
>geli-encrypted and added to the pool.
What sort of disks?  SAS or SATA?
>also known good.  I began to get EXTENDED stalls with zero I/O going on,
>some lasting for 30 seconds or so.  The system was not frozen but
>anything that touched I/O would lock until it cleared.  Dedup is off,
>incidentally.
When the system has stalled:
- Do you see very low free memory?
- What happens to all the different CPU utilisation figures?  Do they
  all go to zero?  Do you get high system or interrupt CPU (including
  going to 1 core's worth)?
- What happens to interrupt load?  Do you see any disk controller
  interrupts?

Would you be able to build a kernel with WITNESS (and WITNESS_SKIPSPIN)
and see if you get any errors when stalls happen.

On 2013-Mar-05 14:09:36 -0800, Jeremy Chadwick <jdc at koitsu.org>
wrote:>On Tue, Mar 05, 2013 at 01:09:41PM +0200, Andriy Gapon wrote:
>> Completely unrelated to the main thread:
>> 
>> on 05/03/2013 07:32 Jeremy Chadwick said the following:
>> > That said, I still do not recommend ZFS for a root filesystem
>> Why?
>Too long a history of problems with it and weird edge cases (keep
>reading); the last thing an administrator wants to deal with is a system
>where the root filesystem won't mount/can't be used.  It makes
>recovery or problem-solving (i.e. the server is not physically accessible
>given geographic distances) very difficult.
I've had lots of problems with a gmirrored UFS root as well.  The
biggest issue is that gmirror has no audit functionality so you
can't verify that both sides of a mirror really do have the same data.
>My point/opinion: UFS for a root filesystem is guaranteed to work
>without any fiddling about and, barring drive failures or controller
>issues, is (again, my opinion) a lot more risk-free than ZFS-on-root.
AFAIK, you can't boot from anything other than a single disk (ie no
graid).

-- 
Peter Jeremy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL:
<http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20130307/8fd53329/attachment.sig>

freebsd stable - Mar 2013 - ZFS "stalls" -- and maybe we should be talking about defaults?

ZFS "stalls" -- and maybe we should be talking about defaults?

ZFS "stalls" -- and maybe we should be talking about defaults?

ZFS "stalls" -- and maybe we should be talking about defaults?

ZFS "stalls" -- and maybe we should be talking about defaults?