Hi... Here''s my system: 2 Intel 3 Ghz 5160 dual-core cpu''s 10 SATA 750 GB disks running as a ZFS RAIDZ2 pool 8 GB Memory SunOS 5.11 snv_79a on a separate UFS mirror ZFS pool version 10 No separate ZIL or ARC cache I ran into a problem today where the ZFS pool jammed for an extended period of time. During that time, it seemed read-bound doing only read I/O''s (as observed with "zpool iostat 1") and I saw 100% misses while running arcstat.pl (for "miss%", "dm%", "pm%" and "mm%"). Processes accessing the pool were jammed, including remote NFS mounts. At the time, I was: 1) running a scrub, 2) writing 10''s of MB/sec of data onto the pool as well as reading from the pool, 3) was deleting a large number of files on the pool. I tried killing one of the jammed "rm" processes and it eventually died. The # of misses seen in arcstat.pl eventually dropped back down to the 20-40% range ("miss%"). A while later, writes began occuring to the pool again and remote NFS access also freed up and overall system behaviour seemed to normalize. This all occurred over the course of approximately an hour. Does this kind of problem sound familiar to anyone? Is it a ZFS problem, or have I hit some sort of ZFS load maximum and this is the response? Any suggestions for ways to avoid this are welcome... Thanks... Art Arthur A. Person Research Assistant, System Administrator Penn State Department of Meteorology email: person at meteo.psu.edu, phone: 814-863-1563
> Hi... > > Here''s my system: > > 2 Intel 3 Ghz 5160 dual-core cpu''s > 0 SATA 750 GB disks running as a ZFS RAIDZ2 pool > 8 GB Memory > SunOS 5.11 snv_79a on a separate UFS mirror > ZFS pool version 10 > No separate ZIL or ARC cache > ran into a problem today where the ZFS pool jammed > for an extended > eriod of time. During that time, it seemed > read-bound doing only read > I/O''s (as observed with "zpool iostat 1") and I saw > 100% misses while > running arcstat.pl (for "miss%", "dm%", "pm%" and > "mm%"). Processes > accessing the pool were jammed, including remote NFS > mounts. At the time, > I was: 1) running a scrub, 2) writing 10''s of MB/sec > of data onto the pool > as well as reading from the pool, 3) was deleting a > large number of files > on the pool. I tried killing one of the jammed "rm" > processes and it > eventually died. The # of misses seen in arcstat.pl > eventually dropped > back down to the 20-40% range ("miss%"). A while > later, writes began > occuring to the pool again and remote NFS access also > freed up and overall > system behaviour seemed to normalize. This all > occurred over the course > of approximately an hour. > > Does this kind of problem sound familiar to anyone? > Is it a ZFS problem, > r have I hit some sort of ZFS load maximum and this > is the response? > Any suggestions for ways to avoid this are welcome... > > Thanks... > Art > thur A. Person > Research Assistant, System Administrator > Penn State Department of Meteorology > email: person at meteo.psu.edu, phone: 814-863-1563 > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ssHi Art, I have seen a similar problem that is happening on several servers since a recent upgrade from b70 to b86/b87. For no obvious reason, the servers will stop writing to the pool for long periods of time. Watching a "zpool iostat", I can see that 0 writes are being done for up to a minute at a time. Meanwhile, a large amount of small (~3K) reads are happening. The servers behave like this for an hour or more at a time. The server configuration is: Dual-core Opteron 2212HE 4GB ECC DDR2 RAM 15 1TB SATA drives in a RAID-Z2 pool 2 Supermicro SAT2-MV8 controllers SunOS 5.11 snv_86 UFS root and swap are on their own disk Have you made any progress with this problem? Has anyone else seen this behavior? This message posted from opensolaris.org
Scott, On Sun, 4 May 2008, Scott wrote:>> Hi... >> >> Here''s my system: >> >> 2 Intel 3 Ghz 5160 dual-core cpu''s >> 10 SATA 750 GB disks running as a ZFS RAIDZ2 pool >> 8 GB Memory >> SunOS 5.11 snv_79a on a separate UFS mirror >> ZFS pool version 10 >> No separate ZIL or ARC cache >> ran into a problem today where the ZFS pool jammed >> for an extended >> eriod of time. During that time, it seemed >> read-bound doing only read >> I/O''s (as observed with "zpool iostat 1") and I saw >> 100% misses while >> running arcstat.pl (for "miss%", "dm%", "pm%" and >> "mm%"). Processes >> accessing the pool were jammed, including remote NFS >> mounts. At the time, >> I was: 1) running a scrub, 2) writing 10''s of MB/sec >> of data onto the pool >> as well as reading from the pool, 3) was deleting a >> large number of files >> on the pool. I tried killing one of the jammed "rm" >> processes and it >> eventually died. The # of misses seen in arcstat.pl >> eventually dropped >> back down to the 20-40% range ("miss%"). A while >> later, writes began >> occuring to the pool again and remote NFS access also >> freed up and overall >> system behaviour seemed to normalize. This all >> occurred over the course >> of approximately an hour. >> >> Does this kind of problem sound familiar to anyone? >> Is it a ZFS problem, >> r have I hit some sort of ZFS load maximum and this >> is the response? >> Any suggestions for ways to avoid this are welcome... >> >> Thanks... >> Art >> thur A. Person >> Research Assistant, System Administrator >> Penn State Department of Meteorology >> email: person at meteo.psu.edu, phone: 814-863-1563 >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discu >> ss > > Hi Art, > > I have seen a similar problem that is happening on several servers since > a recent upgrade from b70 to b86/b87. For no obvious reason, the > servers will stop writing to the pool for long periods of time. > Watching a "zpool iostat", I can see that 0 writes are being done for up > to a minute at a time. Meanwhile, a large amount of small (~3K) reads > are happening. The servers behave like this for an hour or more at a > time. > > The server configuration is: > Dual-core Opteron 2212HE > 4GB ECC DDR2 RAM > 15 1TB SATA drives in a RAID-Z2 pool > 2 Supermicro SAT2-MV8 controllers > SunOS 5.11 snv_86 > UFS root and swap are on their own disk > > Have you made any progress with this problem? Has anyone else seen this behavior?I haven''t seen it happen again, but I haven''t hammered the system as I did above to try and make it fail either. Since then, I have also added two RiDATA 16GB SSD''s, one as a log device and one as a cache device to see if I can improve performance into/out-of the array. Writing data to the array has definitely improved with the log device, but I''m still having performance issues reading large numbers of small files off the array. I''m curious about your array configuration above... did you create your RAIDZ2 as one vdev or multiple vdev''s? If multiple, how many? On mine, I have all 10 disks set up as one RAIDZ2 vdev which is supposed to be near the performance limit... I''m wondering how much I would gain by splitting it into two vdev''s for the price of losing 1.5TB (2 disks) worth of storage. Art Arthur A. Person Research Assistant, System Administrator Penn State Department of Meteorology email: person at meteo.psu.edu, phone: 814-863-1563
This sounds like an important problem> > Hi... > > > > Here''s my system: > > > > 2 Intel 3 Ghz 5160 dual-core cpu''s > > 0 SATA 750 GB disks running as a ZFS RAIDZ2 pool > > 8 GB Memory > > SunOS 5.11 snv_79a on a separate UFS mirror > > ZFS pool version 10 > > No separate ZIL or ARC cache > > ran into a problem today where the ZFS pool jammed > > for an extended > > eriod of time. During that time, it seemed > > read-bound doing only read > > I/O''s (as observed with "zpool iostat 1") and I > saw > > 100% misses while > > running arcstat.pl (for "miss%", "dm%", "pm%" and > > "mm%"). Processes > > accessing the pool were jammed, including remote > NFS > > mounts. At the time, > > I was: 1) running a scrub, 2) writing 10''s of > MB/sec > > of data onto the pool > > as well as reading from the pool, 3) was deleting > a > > large number of files > > on the pool. I tried killing one of the jammed > "rm" > > processes and it > > eventually died. The # of misses seen in > arcstat.pl > > eventually dropped > > back down to the 20-40% range ("miss%"). A while > > later, writes began > > occuring to the pool again and remote NFS access > also > > freed up and overall > > system behaviour seemed to normalize. This all > > occurred over the course > > of approximately an hour. > > > > Does this kind of problem sound familiar to > anyone? > > Is it a ZFS problem, > > r have I hit some sort of ZFS load maximum and > this > > is the response? > > Any suggestions for ways to avoid this are > welcome... > > > > Thanks... > > Art > > thur A. Person > > Research Assistant, System Administrator > > Penn State Department of Meteorology > > email: person at meteo.psu.edu, phone: 814-863-1563 > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > > > ss > > Hi Art, > > I have seen a similar problem that is happening on > several servers since a recent upgrade from b70 to > b86/b87. For no obvious reason, the servers will > stop writing to the pool for long periods of time. > Watching a "zpool iostat", I can see that 0 writes > are being done for up to a minute at a time. > Meanwhile, a large amount of small (~3K) reads are > happening. The servers behave like this for an hour > or more at a time. > > The server configuration is: > Dual-core Opteron 2212HE > 4GB ECC DDR2 RAM > 15 1TB SATA drives in a RAID-Z2 pool > 2 Supermicro SAT2-MV8 controllers > SunOS 5.11 snv_86 > UFS root and swap are on their own disk > > Have you made any progress with this problem? Has > anyone else seen this behavior?This message posted from opensolaris.org
On Tue, Apr 22, 2008 at 8:24 AM, Arthur Person <ap60 at meteo.psu.edu> wrote:> Does this kind of problem sound familiar to anyone? Is it a ZFS problem, > or have I hit some sort of ZFS load maximum and this is the response? > Any suggestions for ways to avoid this are welcome...I think I''ve seen reports or similar problems on the zfs list, but I don''t know if there was any resolution. One suggestion was that a SATA drive could be attempting to correct a read error and that was causing ZFS to block on i/o. -B -- Brandon High bhigh at freaks.com "The good is the enemy of the best." - Nietzsche
Hi Arthur, I''ve seen a lockup type situation which might possibly be similar to what you''ve described. I didn''t wait 1 hour though to see if it resolved itself so I had to reboot. I have described the saga here: http://www.opensolaris.org/jive/thread.jspa?threadID=59201&tstart=0 I haven''t managed to debug it yet with DTrace as I didn''t get the time to learn DTrace, and I wasn''t sure which version of the SATA driver source code was the correct one for snv_b87. Simon This message posted from opensolaris.org
> I''m curious about your array configuration above... > did you create your > RAIDZ2 as one vdev or multiple vdev''s? If multiple, > how many? On mine, I > have all 10 disks set up as one RAIDZ2 vdev which is > supposed to be near > the performance limit... I''m wondering how much I > would gain by splitting > it into two vdev''s for the price of losing 1.5TB (2 > disks) worth of > storage. > > ArtI have all 15 drives under a single raidz2. In my case, capacity is more important than speed. I''m sure others can comment on any potential speed tradeoffs in your setup. I''m still having this problem, and have been playing around with DTrace for the last few days. I downgraded my b87 servers to b86 in order to cut the new write-throttling code from the equation. That seems to have improved the performance, but hasn''t completely eliminated the problem. This message posted from opensolaris.org
ap60 at meteo.psu.edu said:> I''m curious about your array configuration above... did you create your > RAIDZ2 as one vdev or multiple vdev''s? If multiple, how many? On mine, I > have all 10 disks set up as one RAIDZ2 vdev which is supposed to be near the > performance limit... I''m wondering how much I would gain by splitting it into > two vdev''s for the price of losing 1.5TB (2 disks) worth of storage.You''ve probably already seen/heard this, but I haven''t seen it mentioned in this thread. The consensus is, and measurements seem to confirm, that splitting it into two vdev''s will double your available IOPS for small, random read loads on raidz/raidz2. Here are some references and examples: http://blogs.sun.com/roch/entry/when_to_and_not_to http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance1 http://acc.ohsu.edu/~hakansom/thumper_bench.html Regards, Marion
On Tue, 20 May 2008, Marion Hakanson wrote:> You''ve probably already seen/heard this, but I haven''t seen it mentioned > in this thread. The consensus is, and measurements seem to confirm, that > splitting it into two vdev''s will double your available IOPS for small, > random read loads on raidz/raidz2. Here are some references and examples: > > http://blogs.sun.com/roch/entry/when_to_and_not_to > http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance > http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance1 > http://acc.ohsu.edu/~hakansom/thumper_bench.htmlThe upshot of all this analysis is that mirroring offers the best multi-user performance with excellent reliability. A system comprised of mirrors and big-fat SATA-II disks will almost certainly beat a system using raidz and small fast SAS disks in multi-user situations. A system comprised of mirrors with small fast SAS disks will of course be fastest, but will be more expensive for the same storage. Note that in the Roch blog, load-sharing across 40 4-disk raidzs only achieved 4000 random I/Os per second but mirrors achieved 20000. Switching over to the Ranch Ramblings, we see that the fancy drives are only 78% faster than the big fat SATA drives. Using the fancy SAS drives prefered at the Ranch only improves the raidz random I/Os to something like 7120, which is still far less than 20000. It seems that it pains people to "waste" disk space. For example, the cost increase to use 1TB disks may not be all that much more as compared to 500GB disks, but it somehow seems like a huge cost not to maximize use of available media space. This perception of cost and waste is completely irrational. There is more waste caused by crippling your investment. The Roch "WHEN TO (AND NOT TO) USE RAID-Z" blog posting contains an error since it blames ZFS mirroring on doubling the write IOPS. This is not actually true since IOPS are measured at the per-disk level and each mirror disk sees the same IOPS. It is true that the host system needs to send twice as many transactions when using mirroring, but the transactions are to different disks. In order to improve read performance further, triple mirroring can be used, with added write cost at the host level and more wasted disk space. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/