thr3ads.net - zfs discuss - [zfs-discuss] With RAID-Z2 under load, machine stops responding to local or remote login [May 2009]

If this information is useful, please help other people find it:
Share via:

Rince

2009-May-13 02:10 UTC

[zfs-discuss] With RAID-Z2 under load, machine stops responding to local or remote login

Hi world,
I have a 10-disk RAID-Z2 system with 4 GB of DDR2 RAM and a 3 GHz Core 2 Duo.

It''s exporting ~280 filesystems over NFS to about half a dozen
machines.

Under some loads (in particular, any attempts to rsync between another
machine and this one over SSH), the machine''s load average sometimes
goes insane (27+), and it appears to all be in kernel-land (as nothing
in userland reports more than 5% CPU usage, and top reports 50%+ CPU
usage).

I say 27+ because when the load spikes this high, the machine stops
responding to any meaningful commands. Console login will take a
username and password then hang forever without printing anything. SSH
login will block forever without prompting for user or password.
Machine responds to ping. NFS drops.

snv_113, this has occurred since the RAID-Z2 was created (b102).

I have no idea how to instrument this, as it doesn''t appear to be
panicking, or running out of RAM (as far as I can see from the last
responses of top and prstat), and I don''t know how to ask dtrace about
where I''m mostly spending my time. I read one or two guides, but I
don''t follow how the output of it is meaningful.

I''m sending this to zfs-discuss as I can''t replicate this
problem
unless I''m doing heavy I/O on ZFS.

(Final note - this 10-disk pool is serviced by an ARC 1280ML, and
during the time the kernel is  heavily under load, zpool iostat -v is
reporting no more than 1 MB/s per disk, and almost always to the tune
of 128 KB/s.)

- Rich

-- 

The generation of random numbers is too important to be left to chance.

James C. McPherson

2009-May-13 04:37 UTC

head link

[zfs-discuss] With RAID-Z2 under load, machine stops responding to local or remote login

On Tue, 12 May 2009 22:10:58 -0400
Rince <rincebrain at gmail.com> wrote:
> Hi world,
> I have a 10-disk RAID-Z2 system with 4 GB of DDR2 RAM and a 3 GHz Core 2
Duo.
> 
> It''s exporting ~280 filesystems over NFS to about half a dozen
machines.
> 
> Under some loads (in particular, any attempts to rsync between another
> machine and this one over SSH), the machine''s load average
sometimes
> goes insane (27+), and it appears to all be in kernel-land (as nothing
> in userland reports more than 5% CPU usage, and top reports 50%+ CPU
> usage).
> 
> I say 27+ because when the load spikes this high, the machine stops
> responding to any meaningful commands. Console login will take a
> username and password then hang forever without printing anything. SSH
> login will block forever without prompting for user or password.
> Machine responds to ping. NFS drops.
> 
> snv_113, this has occurred since the RAID-Z2 was created (b102).
> 
> I have no idea how to instrument this, as it doesn''t appear to be
> panicking, or running out of RAM (as far as I can see from the last
> responses of top and prstat), and I don''t know how to ask dtrace
about
> where I''m mostly spending my time. I read one or two guides, but I
> don''t follow how the output of it is meaningful.
> 
> I''m sending this to zfs-discuss as I can''t replicate this
problem
> unless I''m doing heavy I/O on ZFS.
> 
> (Final note - this 10-disk pool is serviced by an ARC 1280ML, and
> during the time the kernel is  heavily under load, zpool iostat -v is
> reporting no more than 1 MB/s per disk, and almost always to the tune
> of 128 KB/s.)

Ah, this last snippet of information is interesting
(to me at least, since I integrated the arcmsr driver).

Is the ARC1280ML in raid or jbod mode? Are you using the
Sun-supplied arcmsr(7d), or the Areca version?

You might want to try running the attached D script, dumping the
output to a file.



James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp	http://www.jmcp.homeunix.com/blog
Kernel Conference Australia - http://au.sun.com/sunnews/events/2009/kernel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: arcmsr.d.all
Type: application/octet-stream
Size: 2659 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090513/4b4dfb5b/attachment.obj>

Scott Duckworth

2009-May-15 18:58 UTC

head link

[zfs-discuss] With RAID-Z2 under load, machine stops responding to local or remote login

Are you running compression on the file systems that you''re
rsync''ing to?
That''ll drive up the load average pretty high, and it''s in the
kernel (from
what I can tell).

In particular, I''ve seen gzip compression on ZFS file systems bump the
load
average over 60 when running multiple parallel rsyncs over SSH.  prstat/top
shows little userland CPU usage.  We''re running on 2 cores (8 threads
per
core) of an UltraSPARC T2 (using LDOMs) and it handles the load nicely - the
domain is still acceptably responsive.  I can see how a dual core x86
machine would get swamped with such a load.

We''re running Solaris 10, not OpenSolaris, so it could also be the case
that
there is a regression somewhere in there.

Scott Duckworth, Systems Programmer II
Clemson University School of Computing


On Tue, May 12, 2009 at 10:10 PM, Rince <rincebrain at gmail.com> wrote:
> Hi world,
> I have a 10-disk RAID-Z2 system with 4 GB of DDR2 RAM and a 3 GHz Core 2
> Duo.
>
> It''s exporting ~280 filesystems over NFS to about half a dozen
machines.
>
> Under some loads (in particular, any attempts to rsync between another
> machine and this one over SSH), the machine''s load average
sometimes
> goes insane (27+), and it appears to all be in kernel-land (as nothing
> in userland reports more than 5% CPU usage, and top reports 50%+ CPU
> usage).
>
> I say 27+ because when the load spikes this high, the machine stops
> responding to any meaningful commands. Console login will take a
> username and password then hang forever without printing anything. SSH
> login will block forever without prompting for user or password.
> Machine responds to ping. NFS drops.
>
> snv_113, this has occurred since the RAID-Z2 was created (b102).
>
> I have no idea how to instrument this, as it doesn''t appear to be
> panicking, or running out of RAM (as far as I can see from the last
> responses of top and prstat), and I don''t know how to ask dtrace
about
> where I''m mostly spending my time. I read one or two guides, but I
> don''t follow how the output of it is meaningful.
>
> I''m sending this to zfs-discuss as I can''t replicate this
problem
> unless I''m doing heavy I/O on ZFS.
>
> (Final note - this 10-disk pool is serviced by an ARC 1280ML, and
> during the time the kernel is  heavily under load, zpool iostat -v is
> reporting no more than 1 MB/s per disk, and almost always to the tune
> of 128 KB/s.)
>
> - Rich
>
> --
>
> The generation of random numbers is too important to be left to chance.
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090515/c39bd487/attachment.html>

Seemingly Similar Threads

Search for more reasonably related threads

zfs discuss - May 2009 - With RAID-Z2 under load, machine stops responding to local or remote login

[zfs-discuss] With RAID-Z2 under load, machine stops responding to local or remote login

[zfs-discuss] With RAID-Z2 under load, machine stops responding to local or remote login

[zfs-discuss] With RAID-Z2 under load, machine stops responding to local or remote login

Seemingly Similar Threads