Rince
2009-May-13 02:10 UTC
[zfs-discuss] With RAID-Z2 under load, machine stops responding to local or remote login
Hi world, I have a 10-disk RAID-Z2 system with 4 GB of DDR2 RAM and a 3 GHz Core 2 Duo. It''s exporting ~280 filesystems over NFS to about half a dozen machines. Under some loads (in particular, any attempts to rsync between another machine and this one over SSH), the machine''s load average sometimes goes insane (27+), and it appears to all be in kernel-land (as nothing in userland reports more than 5% CPU usage, and top reports 50%+ CPU usage). I say 27+ because when the load spikes this high, the machine stops responding to any meaningful commands. Console login will take a username and password then hang forever without printing anything. SSH login will block forever without prompting for user or password. Machine responds to ping. NFS drops. snv_113, this has occurred since the RAID-Z2 was created (b102). I have no idea how to instrument this, as it doesn''t appear to be panicking, or running out of RAM (as far as I can see from the last responses of top and prstat), and I don''t know how to ask dtrace about where I''m mostly spending my time. I read one or two guides, but I don''t follow how the output of it is meaningful. I''m sending this to zfs-discuss as I can''t replicate this problem unless I''m doing heavy I/O on ZFS. (Final note - this 10-disk pool is serviced by an ARC 1280ML, and during the time the kernel is heavily under load, zpool iostat -v is reporting no more than 1 MB/s per disk, and almost always to the tune of 128 KB/s.) - Rich -- The generation of random numbers is too important to be left to chance.
James C. McPherson
2009-May-13 04:37 UTC
[zfs-discuss] With RAID-Z2 under load, machine stops responding to local or remote login
On Tue, 12 May 2009 22:10:58 -0400 Rince <rincebrain at gmail.com> wrote:> Hi world, > I have a 10-disk RAID-Z2 system with 4 GB of DDR2 RAM and a 3 GHz Core 2 Duo. > > It''s exporting ~280 filesystems over NFS to about half a dozen machines. > > Under some loads (in particular, any attempts to rsync between another > machine and this one over SSH), the machine''s load average sometimes > goes insane (27+), and it appears to all be in kernel-land (as nothing > in userland reports more than 5% CPU usage, and top reports 50%+ CPU > usage). > > I say 27+ because when the load spikes this high, the machine stops > responding to any meaningful commands. Console login will take a > username and password then hang forever without printing anything. SSH > login will block forever without prompting for user or password. > Machine responds to ping. NFS drops. > > snv_113, this has occurred since the RAID-Z2 was created (b102). > > I have no idea how to instrument this, as it doesn''t appear to be > panicking, or running out of RAM (as far as I can see from the last > responses of top and prstat), and I don''t know how to ask dtrace about > where I''m mostly spending my time. I read one or two guides, but I > don''t follow how the output of it is meaningful. > > I''m sending this to zfs-discuss as I can''t replicate this problem > unless I''m doing heavy I/O on ZFS. > > (Final note - this 10-disk pool is serviced by an ARC 1280ML, and > during the time the kernel is heavily under load, zpool iostat -v is > reporting no more than 1 MB/s per disk, and almost always to the tune > of 128 KB/s.)Ah, this last snippet of information is interesting (to me at least, since I integrated the arcmsr driver). Is the ARC1280ML in raid or jbod mode? Are you using the Sun-supplied arcmsr(7d), or the Areca version? You might want to try running the attached D script, dumping the output to a file. James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog Kernel Conference Australia - http://au.sun.com/sunnews/events/2009/kernel -------------- next part -------------- A non-text attachment was scrubbed... Name: arcmsr.d.all Type: application/octet-stream Size: 2659 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090513/4b4dfb5b/attachment.obj>
Scott Duckworth
2009-May-15 18:58 UTC
[zfs-discuss] With RAID-Z2 under load, machine stops responding to local or remote login
Are you running compression on the file systems that you''re rsync''ing to? That''ll drive up the load average pretty high, and it''s in the kernel (from what I can tell). In particular, I''ve seen gzip compression on ZFS file systems bump the load average over 60 when running multiple parallel rsyncs over SSH. prstat/top shows little userland CPU usage. We''re running on 2 cores (8 threads per core) of an UltraSPARC T2 (using LDOMs) and it handles the load nicely - the domain is still acceptably responsive. I can see how a dual core x86 machine would get swamped with such a load. We''re running Solaris 10, not OpenSolaris, so it could also be the case that there is a regression somewhere in there. Scott Duckworth, Systems Programmer II Clemson University School of Computing On Tue, May 12, 2009 at 10:10 PM, Rince <rincebrain at gmail.com> wrote:> Hi world, > I have a 10-disk RAID-Z2 system with 4 GB of DDR2 RAM and a 3 GHz Core 2 > Duo. > > It''s exporting ~280 filesystems over NFS to about half a dozen machines. > > Under some loads (in particular, any attempts to rsync between another > machine and this one over SSH), the machine''s load average sometimes > goes insane (27+), and it appears to all be in kernel-land (as nothing > in userland reports more than 5% CPU usage, and top reports 50%+ CPU > usage). > > I say 27+ because when the load spikes this high, the machine stops > responding to any meaningful commands. Console login will take a > username and password then hang forever without printing anything. SSH > login will block forever without prompting for user or password. > Machine responds to ping. NFS drops. > > snv_113, this has occurred since the RAID-Z2 was created (b102). > > I have no idea how to instrument this, as it doesn''t appear to be > panicking, or running out of RAM (as far as I can see from the last > responses of top and prstat), and I don''t know how to ask dtrace about > where I''m mostly spending my time. I read one or two guides, but I > don''t follow how the output of it is meaningful. > > I''m sending this to zfs-discuss as I can''t replicate this problem > unless I''m doing heavy I/O on ZFS. > > (Final note - this 10-disk pool is serviced by an ARC 1280ML, and > during the time the kernel is heavily under load, zpool iostat -v is > reporting no more than 1 MB/s per disk, and almost always to the tune > of 128 KB/s.) > > - Rich > > -- > > The generation of random numbers is too important to be left to chance. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090515/c39bd487/attachment.html>