Howdy, We''re having a ZFS performance issue over here that I was hoping you guys could help me troubleshoot. We have a ZFS pool made up of 24 disks, arranged into 7 raid-z devices of 4 disks each. We''re using it as an iSCSI back-end for VMWare and some Oracle RAC clusters. Under normal circumstances performance is very good both in benchmarks and under real-world use. Every couple days, however, I/O seems to hang for anywhere between several seconds and several minutes. The hang seems to be a complete stop of all write I/O. The following zpool iostat illustrates: pool0 2.47T 5.13T 120 0 293K 0 pool0 2.47T 5.13T 127 0 308K 0 pool0 2.47T 5.13T 131 0 322K 0 pool0 2.47T 5.13T 144 0 347K 0 pool0 2.47T 5.13T 135 0 331K 0 pool0 2.47T 5.13T 122 0 295K 0 pool0 2.47T 5.13T 135 0 330K 0 While this is going on our VMs all hang, as do any "zfs create" commands or attempts to touch/create files in the zfs pool from the local system. After several minutes the system "un-hangs" and we see very high write rates before things return to normal across the board. Some more information about our configuration: We''re running OpenSolaris svn-134. ZFS is at version 22. Our disks are 15kRPM 300gb Seagate Cheetahs, mounted in Promise J610S Dual enclosures, hanging off a Dell SAS 5/e controller. We''d tried out most of this configuration previously on OpenSolaris 2009.06 without running into this problem. The only thing that''s new, aside from the newer OpenSolaris/ZFS is a set of four SSDs configured as log disks. At first we blamed de-dupe, but we''ve disabled that. Next we suspected the SSD log disks, but we''ve seen the problem with those removed, as well. Has anyone seen anything like this before? Are there any tools we can use to gather information during the hang which might be useful in determining what''s going wrong? Thanks for any insights you may have. -Charles -- This message posted from opensolaris.org
Charles, Did you check for any HW issues reported during the hangs? fmdump -ev and the like? ..Remco On 8/30/10 6:02 PM, Charles J. Knipe wrote:> Howdy, > > We''re having a ZFS performance issue over here that I was hoping you guys could help me troubleshoot. We have a ZFS pool made up of 24 disks, arranged into 7 raid-z devices of 4 disks each. We''re using it as an iSCSI back-end for VMWare and some Oracle RAC clusters. > > Under normal circumstances performance is very good both in benchmarks and under real-world use. Every couple days, however, I/O seems to hang for anywhere between several seconds and several minutes. The hang seems to be a complete stop of all write I/O. The following zpool iostat illustrates: > > pool0 2.47T 5.13T 120 0 293K 0 > pool0 2.47T 5.13T 127 0 308K 0 > pool0 2.47T 5.13T 131 0 322K 0 > pool0 2.47T 5.13T 144 0 347K 0 > pool0 2.47T 5.13T 135 0 331K 0 > pool0 2.47T 5.13T 122 0 295K 0 > pool0 2.47T 5.13T 135 0 330K 0 > > While this is going on our VMs all hang, as do any "zfs create" commands or attempts to touch/create files in the zfs pool from the local system. After several minutes the system "un-hangs" and we see very high write rates before things return to normal across the board. > > Some more information about our configuration: We''re running OpenSolaris svn-134. ZFS is at version 22. Our disks are 15kRPM 300gb Seagate Cheetahs, mounted in Promise J610S Dual enclosures, hanging off a Dell SAS 5/e controller. We''d tried out most of this configuration previously on OpenSolaris 2009.06 without running into this problem. The only thing that''s new, aside from the newer OpenSolaris/ZFS is a set of four SSDs configured as log disks. > > At first we blamed de-dupe, but we''ve disabled that. Next we suspected the SSD log disks, but we''ve seen the problem with those removed, as well. > > Has anyone seen anything like this before? Are there any tools we can use to gather information during the hang which might be useful in determining what''s going wrong? > > Thanks for any insights you may have. > > -Charles
Charles, Is it just ZFS hanging (or what it appears to be is slowing down or blocking) or does the whole system hang? A couple of questions What does iostat show during the time period of the slowdown? What does mpstat show during the time of the slowdown? You can look at the metadata statistics by running the following. echo ::arc | mdb -k When looking at a ZFS problem, I usually like to gather echo ::spa | mdb -k echo ::zio_state | mdb -k I suspect you could drill down more with dtrace or lockstat to see where the slowdown is happening. Dave On 08/30/10 11:02, Charles J. Knipe wrote:> Howdy, > > We''re having a ZFS performance issue over here that I was hoping you guys could help me troubleshoot. We have a ZFS pool made up of 24 disks, arranged into 7 raid-z devices of 4 disks each. We''re using it as an iSCSI back-end for VMWare and some Oracle RAC clusters. > > Under normal circumstances performance is very good both in benchmarks and under real-world use. Every couple days, however, I/O seems to hang for anywhere between several seconds and several minutes. The hang seems to be a complete stop of all write I/O. The following zpool iostat illustrates: > > pool0 2.47T 5.13T 120 0 293K 0 > pool0 2.47T 5.13T 127 0 308K 0 > pool0 2.47T 5.13T 131 0 322K 0 > pool0 2.47T 5.13T 144 0 347K 0 > pool0 2.47T 5.13T 135 0 331K 0 > pool0 2.47T 5.13T 122 0 295K 0 > pool0 2.47T 5.13T 135 0 330K 0 > > While this is going on our VMs all hang, as do any "zfs create" commands or attempts to touch/create files in the zfs pool from the local system. After several minutes the system "un-hangs" and we see very high write rates before things return to normal across the board. > > Some more information about our configuration: We''re running OpenSolaris svn-134. ZFS is at version 22. Our disks are 15kRPM 300gb Seagate Cheetahs, mounted in Promise J610S Dual enclosures, hanging off a Dell SAS 5/e controller. We''d tried out most of this configuration previously on OpenSolaris 2009.06 without running into this problem. The only thing that''s new, aside from the newer OpenSolaris/ZFS is a set of four SSDs configured as log disks. > > At first we blamed de-dupe, but we''ve disabled that. Next we suspected the SSD log disks, but we''ve seen the problem with those removed, as well. > > Has anyone seen anything like this before? Are there any tools we can use to gather information during the hang which might be useful in determining what''s going wrong? > > Thanks for any insights you may have. > > -Charles >-- -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100830/77ac1893/attachment.html>
David, Thanks for your reply. Answers to your questions are below.> Is it just ZFS hanging (or what it appears to be is > slowing down or > blocking) or does the whole system hang? <br>Only the ZFS storage is affected. Any attempt to write to it blocks until the issue passes. Other than that the system behaves normally. I have not, as far as I remember, tried writing to the root pool while this is going on, I''ll have to check that next time. I suspect the problem is likely limited to a single pool.> What does iostat show during the time period of the > slowdown?<br> > What does mpstat show during the time of the > slowdown?<br> > <br> > You can look at the metadata statistics by running > the following. > echo ::arc | mdb -k<br> > When looking at a ZFS problem, I usually like to > gather > echo ::spa | mdb -k<br> > echo ::zio_state | mdb -k<br>I will plan to dump information from all of these sources next time I can catch it in the act. Any other diag commands you think might be useful?> I suspect you could drill down more with dtrace or > lockstat to see > where the slowdown is happening.I''m brand new to DTrace. I''m doing some reading now toward being in a position to ask intelligent questions. -Charles -- This message posted from opensolaris.org
Charles, Just like UNIX, there are several ways to drill down on the problem. I would probably start with a live crash dump (savecore -L) when you see the problem. Another method would be to grap multiple "stats" commands during the problem to see where you can drill down later. I would probably use this method if the problem lasts for a while and drill down with dtrace base on what I saw. But each method is going to depend on your skill, when looking at the problem. Dave On 08/30/10 16:15, Charles J. Knipe wrote:> David, > > Thanks for your reply. Answers to your questions are below. > > >> Is it just ZFS hanging (or what it appears to be is >> slowing down or >> blocking) or does the whole system hang? <br> >> > > Only the ZFS storage is affected. Any attempt to write to it blocks until the issue passes. Other than that the system behaves normally. I have not, as far as I remember, tried writing to the root pool while this is going on, I''ll have to check that next time. I suspect the problem is likely limited to a single pool. > > >> What does iostat show during the time period of the >> slowdown?<br> >> What does mpstat show during the time of the >> slowdown?<br> >> <br> >> You can look at the metadata statistics by running >> the following. >> echo ::arc | mdb -k<br> >> When looking at a ZFS problem, I usually like to >> gather >> echo ::spa | mdb -k<br> >> echo ::zio_state | mdb -k<br> >> > > I will plan to dump information from all of these sources next time I can catch it in the act. Any other diag commands you think might be useful? > > >> I suspect you could drill down more with dtrace or >> lockstat to see >> where the slowdown is happening. >> > > I''m brand new to DTrace. I''m doing some reading now toward being in a position to ask intelligent questions. > > -Charles >-- -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100831/e7e88564/attachment.html>
Hi Charles, You might want rule out hardware issues first... You can review iostat -En or the /var/adm/messages file to see if any driver related error messages are related to the hangs, like this: c4t40d0 Soft Errors: 7 Hard Errors: 0 Transport Errors: 0 Vendor: SUN Product: StorEdge 3510 Revision: 327P Serial No: Size: 48.20GB <48201990144 bytes> In addition, FMA will report disk issues in fmdump output. For example, you could grep for some of the devices in the pool like this: # fmdump -eV | grep c1t9d0 vdev_path = /dev/dsk/c1t9d0s0 vdev_path = /dev/dsk/c1t9d0s0 If you get output like the above, then take a closer look at the fmdump -eV output to see what is happening at the disk level. Thanks, Cindy On 08/30/10 10:02, Charles J. Knipe wrote:> Howdy, > > We''re having a ZFS performance issue over here that I was hoping you guys could help me troubleshoot. We have a ZFS pool made up of 24 disks, arranged into 7 raid-z devices of 4 disks each. We''re using it as an iSCSI back-end for VMWare and some Oracle RAC clusters. > > Under normal circumstances performance is very good both in benchmarks and under real-world use. Every couple days, however, I/O seems to hang for anywhere between several seconds and several minutes. The hang seems to be a complete stop of all write I/O. The following zpool iostat illustrates: > > pool0 2.47T 5.13T 120 0 293K 0 > pool0 2.47T 5.13T 127 0 308K 0 > pool0 2.47T 5.13T 131 0 322K 0 > pool0 2.47T 5.13T 144 0 347K 0 > pool0 2.47T 5.13T 135 0 331K 0 > pool0 2.47T 5.13T 122 0 295K 0 > pool0 2.47T 5.13T 135 0 330K 0 > > While this is going on our VMs all hang, as do any "zfs create" commands or attempts to touch/create files in the zfs pool from the local system. After several minutes the system "un-hangs" and we see very high write rates before things return to normal across the board. > > Some more information about our configuration: We''re running OpenSolaris svn-134. ZFS is at version 22. Our disks are 15kRPM 300gb Seagate Cheetahs, mounted in Promise J610S Dual enclosures, hanging off a Dell SAS 5/e controller. We''d tried out most of this configuration previously on OpenSolaris 2009.06 without running into this problem. The only thing that''s new, aside from the newer OpenSolaris/ZFS is a set of four SSDs configured as log disks. > > At first we blamed de-dupe, but we''ve disabled that. Next we suspected the SSD log disks, but we''ve seen the problem with those removed, as well. > > Has anyone seen anything like this before? Are there any tools we can use to gather information during the hang which might be useful in determining what''s going wrong? > > Thanks for any insights you may have. > > -Charles
> <div id="jive-html-wrapper-div"> > > Charles,<br> > <br> > Just like UNIX, there are several ways to drill down > on the problem. I > would probably start with a live crash dump (savecore > -L) when you see > the problem. Another method would be to grap > multiple "stats" commands > during the problem to see where you can drill down > later. I would > probably use this method if the problem lasts for a > while and drill > down with dtrace base on what I saw. But each > method is going to > depend on your skill, when looking at the > problem.<br> > <br> > Dave<br> > <br>Dave,<br> <br> After running clean since my last post the problem occurred again today. This time I was able to gather some data while it was going on. The only thing that jumps out at my so far is the output of echo ::zio_state | mdb -k. <br> Under normal operations this usually looks like this:<br> <br> ADDRESS TYPE STAGE WAITER<br> <br> ffffff090eb69328 NULL OPEN -<br> ffffff090eb69c88 NULL OPEN -<br> <br> Here are a couple samples while the issue was happening:<br> <br> ADDRESS TYPE STAGE WAITER<br> <br> ffffff0bfe8c59b0 NULL CHECKSUM_VERIFY ffffff003e2f2c60<br> ffffff090eb69328 NULL OPEN -<br> ffffff090eb69c88 NULL OPEN -<br> <br> ADDRESS TYPE STAGE WAITER<br> <br> ffffff09bb12a040 NULL CHECKSUM_VERIFY ffffff003d6acc60<br> ffffff0bfe8c59b0 NULL CHECKSUM_VERIFY ffffff003e2f2c60<br> ffffff090eb69328 NULL OPEN -<br> ffffff090eb69c88 NULL OPEN -<br> <br> Operating under the assumption that the waiter column is referencing kernel threads, I went looking for those addresses in the thread list. Here are the threadlist entries for ffffff003d6acc60 and ffffff003e2f2c60 from the example directly above taken at about the same time as that output:<br> <br> ffffff003d6acc60 ffffff0930d8c700 ffffff09172f9de0 2 0 ffffff09bb12a348<br> PC: _resume_from_idle+0xf1 CMD: zpool-pool0<br> stack pointer for thread ffffff003d6acc60: ffffff003d6ac360<br> [ ffffff003d6ac360 _resume_from_idle+0xf1() ]<br> swtch+0x145()<br> cv_wait+0x61()<br> zio_wait+0x5d()<br> dbuf_read+0x1e8()<br> dmu_buf_hold+0x93()<br> zap_get_leaf_byblk+0x56()<br> zap_deref_leaf+0x78()<br> fzap_length+0x42()<br> zap_length_uint64+0x84()<br> ddt_zap_lookup+0x4b()<br> ddt_object_lookup+0x6d()<br> ddt_lookup+0x115()<br> zio_ddt_free+0x42()<br> zio_execute+0x8d()<br> taskq_thread+0x248()<br> thread_start+8()<br> <br> ffffff003e2f2c60 fffffffffbc2dbb0 0 0 60 ffffff0bfe8c5cb8<br> PC: _resume_from_idle+0xf1 THREAD: txg_sync_thread()<br> stack pointer for thread ffffff003e2f2c60: ffffff003e2f2a40<br> [ ffffff003e2f2a40 _resume_from_idle+0xf1() ]<br> swtch+0x145()<br> cv_wait+0x61()<br> zio_wait+0x5d()<br> spa_sync+0x40c()<br> txg_sync_thread+0x24a()<br> thread_start+8()<br> <br> Not sure if any of that sheds any light on the problem. I also have a live dump from the period when the problem was happening, a bunch of iostats, mpstats, and ::arc, ::spa, ::zio_state, and ::threadlist -v from mdb -k at several points during the issue.<br> <br> If you have any advice on how to proceed from here in debugging this issue I''d greatly appreciate it. So you know, I''m generally very comfortable with unix, but dtrace and the solaris kernel are unfamiliar territory. <br> <br> In any event, thanks again for all the help thus far.<br> <br> -Charles -- This message posted from opensolaris.org
> At first we blamed de-dupe, but we''ve disabled that. Next we suspected > the SSD log disks, but we''ve seen the problem with those removed, as > well.Did you have dedup enabled and then disabled it? If so, data can (or will) be deduplicated on the drives. Currently the only way of de-deduping them is to recopy them after disabling dedup. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
> > At first we blamed de-dupe, but we''ve disabled that. Next we > suspected > > the SSD log disks, but we''ve seen the problem with those removed, as > > well. > > Did you have dedup enabled and then disabled it? If so, data can (or > will) be deduplicated on the drives. Currently the only way of de- > deduping them is to recopy them after disabling dedup.That''s a good point. There is deduplicated data still present on disk. Do you think the issue we''re seeing may be related to the existing deduped data? I''m not against copying the contents of the pool over to a new pool, but considering the effort/disruption I''d want to make sure it''s not just a shot in the dark. If I don''t have a good theory in another week, that''s when I start shooting in the dark... -Charles
So, I''m still having problems with intermittent hangs on write with my ZFS pool. Details from my original post are below. Since posting that, I''ve gone back and forth with a number of you, and gotten a lot of useful advice, but I''m still trying to get to the root of the problem so I can correct it. Since the original post I have: -Gathered a great deal of information in the form of kernel thread dumps, zio_state dumps, and live crash dumps while the problem is happening. -Been advised that my ruling out of dedupe was probably premature, as I still likely have a good deal of deduplicated data on-disk. -Checked just about every log and counter that might indicate a hardware error, without finding one. I was wondering at this point if someone could give me some pointers on the following: 1. Given the dumps and diagnostic data I''ve gathered so far, is there a way I can determine for certain where in the ZFS driver I''m spending so much time hanging? At the very least I''d like to try to determine whether it is, in-fact a deduplication issue. 2. If it is, in fact, a deduplication issue, would my only recourse be a new pool and a send/receive operation? The data we''re storing is VMFS volumes for ESX. We''re tossing around the idea of creating new volumes in the same pool (now that dedupe is off) and migrating VMs over in small batches. The theory is that we would be writing non-deduped data this way, and when we were done we could remove the deduplicated volumes. Is this sound? Thanks again for all the help! -Charles> Howdy, > > We''re having a ZFS performance issue over here that I > was hoping you guys could help me troubleshoot. We > have a ZFS pool made up of 24 disks, arranged into 7 > raid-z devices of 4 disks each. We''re using it as an > iSCSI back-end for VMWare and some Oracle RAC > clusters. > > Under normal circumstances performance is very good > both in benchmarks and under real-world use. Every > couple days, however, I/O seems to hang for anywhere > between several seconds and several minutes. The > hang seems to be a complete stop of all write I/O. > The following zpool iostat illustrates: > > pool0 2.47T 5.13T 120 0 293K 0 > pool0 2.47T 5.13T 127 0 308K 0 > pool0 2.47T 5.13T 131 0 322K 0 > pool0 2.47T 5.13T 144 0 347K 0 > pool0 2.47T 5.13T 135 0 331K 0 > pool0 2.47T 5.13T 122 0 295K 0 > pool0 2.47T 5.13T 135 0 330K 0 > > While this is going on our VMs all hang, as do any > "zfs create" commands or attempts to touch/create > files in the zfs pool from the local system. After > several minutes the system "un-hangs" and we see very > high write rates before things return to normal > across the board. > > Some more information about our configuration: We''re > running OpenSolaris svn-134. ZFS is at version 22. > Our disks are 15kRPM 300gb Seagate Cheetahs, mounted > in Promise J610S Dual enclosures, hanging off a Dell > SAS 5/e controller. We''d tried out most of this > configuration previously on OpenSolaris 2009.06 > without running into this problem. The only thing > that''s new, aside from the newer OpenSolaris/ZFS is > a set of four SSDs configured as log disks. > > At first we blamed de-dupe, but we''ve disabled that. > Next we suspected the SSD log disks, but we''ve seen > the problem with those removed, as well. > > Has anyone seen anything like this before? Are there > any tools we can use to gather information during the > hang which might be useful in determining what''s > going wrong? > > Thanks for any insights you may have. > > -Charles-- This message posted from opensolaris.org
Hi Charles, There are quite a few bugs in b134 that can lead to this. Alas, due to the new regime, there was a period of time where the distributions were not being delivered. If I were in your shoes, I would upgrade to OpenIndiana b147 which has 26 weeks of maturity and bug fixes over b134. http://www.openindiana.org -- richard On Sep 23, 2010, at 2:48 PM, Charles J. Knipe wrote:> So, I''m still having problems with intermittent hangs on write with my ZFS pool. Details from my original post are below. Since posting that, I''ve gone back and forth with a number of you, and gotten a lot of useful advice, but I''m still trying to get to the root of the problem so I can correct it. Since the original post I have: > > -Gathered a great deal of information in the form of kernel thread dumps, zio_state dumps, and live crash dumps while the problem is happening. > -Been advised that my ruling out of dedupe was probably premature, as I still likely have a good deal of deduplicated data on-disk. > -Checked just about every log and counter that might indicate a hardware error, without finding one. > > I was wondering at this point if someone could give me some pointers on the following: > 1. Given the dumps and diagnostic data I''ve gathered so far, is there a way I can determine for certain where in the ZFS driver I''m spending so much time hanging? At the very least I''d like to try to determine whether it is, in-fact a deduplication issue. > 2. If it is, in fact, a deduplication issue, would my only recourse be a new pool and a send/receive operation? The data we''re storing is VMFS volumes for ESX. We''re tossing around the idea of creating new volumes in the same pool (now that dedupe is off) and migrating VMs over in small batches. The theory is that we would be writing non-deduped data this way, and when we were done we could remove the deduplicated volumes. Is this sound? > > Thanks again for all the help! > > -Charles > >> Howdy, >> >> We''re having a ZFS performance issue over here that I >> was hoping you guys could help me troubleshoot. We >> have a ZFS pool made up of 24 disks, arranged into 7 >> raid-z devices of 4 disks each. We''re using it as an >> iSCSI back-end for VMWare and some Oracle RAC >> clusters. >> >> Under normal circumstances performance is very good >> both in benchmarks and under real-world use. Every >> couple days, however, I/O seems to hang for anywhere >> between several seconds and several minutes. The >> hang seems to be a complete stop of all write I/O. >> The following zpool iostat illustrates: >> >> pool0 2.47T 5.13T 120 0 293K 0 >> pool0 2.47T 5.13T 127 0 308K 0 >> pool0 2.47T 5.13T 131 0 322K 0 >> pool0 2.47T 5.13T 144 0 347K 0 >> pool0 2.47T 5.13T 135 0 331K 0 >> pool0 2.47T 5.13T 122 0 295K 0 >> pool0 2.47T 5.13T 135 0 330K 0 >> >> While this is going on our VMs all hang, as do any >> "zfs create" commands or attempts to touch/create >> files in the zfs pool from the local system. After >> several minutes the system "un-hangs" and we see very >> high write rates before things return to normal >> across the board. >> >> Some more information about our configuration: We''re >> running OpenSolaris svn-134. ZFS is at version 22. >> Our disks are 15kRPM 300gb Seagate Cheetahs, mounted >> in Promise J610S Dual enclosures, hanging off a Dell >> SAS 5/e controller. We''d tried out most of this >> configuration previously on OpenSolaris 2009.06 >> without running into this problem. The only thing >> that''s new, aside from the newer OpenSolaris/ZFS is >> a set of four SSDs configured as log disks. >> >> At first we blamed de-dupe, but we''ve disabled that. >> Next we suspected the SSD log disks, but we''ve seen >> the problem with those removed, as well. >> >> Has anyone seen anything like this before? Are there >> any tools we can use to gather information during the >> hang which might be useful in determining what''s >> going wrong? >> >> Thanks for any insights you may have. >> >> -Charles > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com ZFS and performance consulting http://www.RichardElling.com
If one was sticking with OpenSolaris for the short term, is something older than 134 more stable/less buggy? Not using de-dupe. -J On Thu, Sep 23, 2010 at 6:04 PM, Richard Elling <richard.elling at gmail.com>wrote:> Hi Charles, > There are quite a few bugs in b134 that can lead to this. Alas, due to the > new > regime, there was a period of time where the distributions were not being > delivered. If I were in your shoes, I would upgrade to OpenIndiana b147 > which > has 26 weeks of maturity and bug fixes over b134. > > http://www.openindiana.org > -- richard > > > > On Sep 23, 2010, at 2:48 PM, Charles J. Knipe wrote: > > > So, I''m still having problems with intermittent hangs on write with my > ZFS pool. Details from my original post are below. Since posting that, > I''ve gone back and forth with a number of you, and gotten a lot of useful > advice, but I''m still trying to get to the root of the problem so I can > correct it. Since the original post I have: > > > > -Gathered a great deal of information in the form of kernel thread dumps, > zio_state dumps, and live crash dumps while the problem is happening. > > -Been advised that my ruling out of dedupe was probably premature, as I > still likely have a good deal of deduplicated data on-disk. > > -Checked just about every log and counter that might indicate a hardware > error, without finding one. > > > > I was wondering at this point if someone could give me some pointers on > the following: > > 1. Given the dumps and diagnostic data I''ve gathered so far, is there a > way I can determine for certain where in the ZFS driver I''m spending so much > time hanging? At the very least I''d like to try to determine whether it is, > in-fact a deduplication issue. > > 2. If it is, in fact, a deduplication issue, would my only recourse be a > new pool and a send/receive operation? The data we''re storing is VMFS > volumes for ESX. We''re tossing around the idea of creating new volumes in > the same pool (now that dedupe is off) and migrating VMs over in small > batches. The theory is that we would be writing non-deduped data this way, > and when we were done we could remove the deduplicated volumes. Is this > sound? > > > > Thanks again for all the help! > > > > -Charles > > > >> Howdy, > >> > >> We''re having a ZFS performance issue over here that I > >> was hoping you guys could help me troubleshoot. We > >> have a ZFS pool made up of 24 disks, arranged into 7 > >> raid-z devices of 4 disks each. We''re using it as an > >> iSCSI back-end for VMWare and some Oracle RAC > >> clusters. > >> > >> Under normal circumstances performance is very good > >> both in benchmarks and under real-world use. Every > >> couple days, however, I/O seems to hang for anywhere > >> between several seconds and several minutes. The > >> hang seems to be a complete stop of all write I/O. > >> The following zpool iostat illustrates: > >> > >> pool0 2.47T 5.13T 120 0 293K 0 > >> pool0 2.47T 5.13T 127 0 308K 0 > >> pool0 2.47T 5.13T 131 0 322K 0 > >> pool0 2.47T 5.13T 144 0 347K 0 > >> pool0 2.47T 5.13T 135 0 331K 0 > >> pool0 2.47T 5.13T 122 0 295K 0 > >> pool0 2.47T 5.13T 135 0 330K 0 > >> > >> While this is going on our VMs all hang, as do any > >> "zfs create" commands or attempts to touch/create > >> files in the zfs pool from the local system. After > >> several minutes the system "un-hangs" and we see very > >> high write rates before things return to normal > >> across the board. > >> > >> Some more information about our configuration: We''re > >> running OpenSolaris svn-134. ZFS is at version 22. > >> Our disks are 15kRPM 300gb Seagate Cheetahs, mounted > >> in Promise J610S Dual enclosures, hanging off a Dell > >> SAS 5/e controller. We''d tried out most of this > >> configuration previously on OpenSolaris 2009.06 > >> without running into this problem. The only thing > >> that''s new, aside from the newer OpenSolaris/ZFS is > >> a set of four SSDs configured as log disks. > >> > >> At first we blamed de-dupe, but we''ve disabled that. > >> Next we suspected the SSD log disks, but we''ve seen > >> the problem with those removed, as well. > >> > >> Has anyone seen anything like this before? Are there > >> any tools we can use to gather information during the > >> hang which might be useful in determining what''s > >> going wrong? > >> > >> Thanks for any insights you may have. > >> > >> -Charles > > -- > > This message posted from opensolaris.org > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > -- > OpenStorage Summit, October 25-27, Palo Alto, CA > http://nexenta-summit2010.eventbrite.com > ZFS and performance consulting > http://www.RichardElling.com > > > > > > > > > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100927/fd2b9faa/attachment.html>
I also have this problem on my system which consists of an AMD Phenom 2 X4 with system pools on various hard drives connected to the SB750 controller and a larger raidz2 storage pool connected to an LSI 1068e controller (using IT mode). The storage pool is also used to share files using CIFS. The build of the OSOL is 134 and the system pools use zpool v22. But the storage pool still uses zpool v14 from the original 2009.06 release of OSOL. I have not encountered any errors or intermittent freezes in any of these pools until I started using VirtualBox 4.2.8. The virtual machines have been set up to access the storage pool via vboxsrv. When doing various activities in a virtual machine that involves heavier disk access (such as playing/streaming video, using bittorrent, etc) via vboxsrv I also get this intermittent freeze of the zpool stack. The rest of the system works fine (including the virtual machine) as long as it doesn''t involve access to the storage pool (I have not yet tried accessing other pools during this freeze). It is easily seen on the system monitor when the network access suddenly drops to zero for a few minutes. Issuing commands such as ''zpool status'' makes the ssh/bash shell freeze until the zpool wakes up again. No errors are reported when issuing the ''zpool status'' command and all pools are ''ONLINE''. I could give more exhaustive details from hardware scans as suggested in this thread if that is desired... -- This message posted from opensolaris.org
On 28/09/10 09:22 PM, Robin Axelsson wrote:> I also have this problem on my system which consists of an AMD Phenom 2 > X4 with system pools on various hard drives connected to the SB750 > controller and a larger raidz2 storage pool connected to an LSI 1068e > controller (using IT mode). The storage pool is also used to share > files using CIFS. The build of the OSOL is 134 and the system pools use > zpool v22. But the storage pool still uses zpool v14 from the original > 2009.06 release of OSOL. > > I have not encountered any errors or intermittent freezes in any of > these pools until I started using VirtualBox 4.2.8. The virtual > machines have been set up to access the storage pool via vboxsrv. When > doing various activities in a virtual machine that involves heavier > disk access (such as playing/streaming video, using bittorrent, etc) > via vboxsrv I also get this intermittent freeze of the zpool stack. The > rest of the system works fine (including the virtual machine) as long > as it doesn''t involve access to the storage pool (I have not yet tried > accessing other pools during this freeze). It is easily seen on the > system monitor when the network access suddenly drops to zero for a few > minutes. Issuing commands such as ''zpool status'' makes the ssh/bash > shell freeze until the zpool wakes up again. No errors are reported > when issuing the ''zpool status'' command and all pools are ''ONLINE''. I > could give more exhaustive details from hardware scans as suggested in > this thread if that is desired...Two possibilities: (1) you''re using ZFS vols for swap - these had some performance improvements in builds after 134. If you can reconfigure to use a raw slice (or two) instead that would help. (2) 6908360 49.3% snv_129 vdb407_nvSeqWriteBs128kFs1g_zfs-raidz performance regression x86 which was fixed in snv_135. James C. McPherson -- Oracle http://www.jmcp.homeunix.com/blog
I have now run some hardware tests as suggested by Cindy.''iostat -En'' indicates no errors, i.e. after carefully checking the output from this command, all errors are followed by zeroes. The only messages found in /var/adm/messages are the following: <timestamp> opensolaris scsi: [ID 365881 kern.info] /pci at 0,0/pci1002,5980 at b/pci1000,3140 at 0 (mpt0): <timestamp> opensolaris Log info 0x31080000 received for target 3. <timestamp> opensolaris scsi_status=0x0, ioc_status=0x804b, scsi_state=0x0 <timestamp> opensolaris scsi: [ID 365881 kern.info] /pci at 0,0/pci1002,5980 at b/pci1000,3140 at 0 (mpt0): <timestamp> opensolaris Log info 0x31080000 received for target 3. <timestamp> opensolaris scsi_status=0x0, ioc_status=0x804b, scsi_state=0x1 and <timestamp> opensolaris scsi: [ID 365881 kern.info] /pci at 0,0/pci1002,5980 at b/pci1000,3140 at 0 (mpt0): <timestamp> opensolaris Log info 0x31080000 received for target 3. <timestamp> opensolaris scsi_status=0x0, ioc_status=0x804b, scsi_state=0x2 Where the difference between them is "scsi_state=0x0", "scsi_state=0x1" and "scsi_state=0x2" respectively. These messages are generated almost every second in the log and the most common one is the "scsi_state=0x0" message. Grepping for any of the disks in the pools returns nothing from the ''fmdump -eV'' command. Also note that I have never used de-dup in any of these pools. -- This message posted from opensolaris.org
I am using a zpool for swap that is located in the rpool (i.e. not in the storage pool). The system disk contains four primary partitions where the first contains the system volume (c7d0s0) two are windows partitions (c7d0p2 and c7d0p3) and the fourth (c7d0p4) is a zfs pool dedicated for VirtualBox. The swap is located in c7d0s0 as rpool/swap. ''swap -l'' returns swapfile dev swaplo blocks free /dev/zvol/dsk/rpool/swap 182,2 8 4192248 4169344 I''m not sure how to make a raw slice for the swap file, it doesn''t look like it is possible unless I get a separate hard drive for it. The swap area is not located in the storage pool so I''m not sure how this could cause interference with access to this pool. I will upgrade to OpenIndiana (snv_b147 or later) as soon as I get a confirmation that it is not less stable than b134. -- This message posted from opensolaris.org
I have now upgraded to OpenIndiana b148 which should fix those bugs that you mentioned. I lost the picture on the monitor but by ssh:ing from another computer the system seems to be running fine. The problems have become worse now and I get a freeze every time I try to access the 8-disk raidz2 tank (using no dedup and never have used it). It also takes considerably longer than before to mount the storage pool during boot up. No errors are reported when using zpool status but there is one significant difference since after the update; the "iostat -En" command now reports errors and here''s what it looks like: c7d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Model: SAMSUNG HD103SJ Revision: Serial No: ##### Size: 1000.20GB <1000202305536 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 c9t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: SAMSUNG HD154UI Revision: 1118 Serial No: Size: 1500.30GB <1500301910016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 2 Predictive Failure Analysis: 0 c9t1d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: SAMSUNG HD154UI Revision: 1118 Serial No: Size: 1500.30GB <1500301910016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 2 Predictive Failure Analysis: 0 c9t2d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: SAMSUNG HD154UI Revision: 1118 Serial No: Size: 1500.30GB <1500301910016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 2 Predictive Failure Analysis: 0 c9t3d0 Soft Errors: 0 Hard Errors: 35 Transport Errors: 21 Vendor: ATA Product: SAMSUNG HD154UI Revision: 1118 Serial No: Size: 1500.30GB <1500301910016 bytes> Media Error: 30 Device Not Ready: 0 No Device: 5 Recoverable: 0 Illegal Request: 5 Predictive Failure Analysis: 0 c9t4d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: SAMSUNG HD154UI Revision: 1118 Serial No: Size: 1500.30GB <1500301910016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 2 Predictive Failure Analysis: 0 c9t5d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: SAMSUNG HD154UI Revision: 1118 Serial No: Size: 1500.30GB <1500301910016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 2 Predictive Failure Analysis: 0 c9t6d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: SAMSUNG HD154UI Revision: 1118 Serial No: Size: 1500.30GB <1500301910016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 2 Predictive Failure Analysis: 0 c9t7d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Vendor: ATA Product: SAMSUNG HD154UI Revision: 1118 Serial No: Size: 1500.30GB <1500301910016 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 2 Predictive Failure Analysis: 0 To conclude this (in case you don''t view this message using a monospace font) all drives in the affected storage pool (c9t0d0 - c9t7d0) report 2 Illegal Requests (save c9t3d0 that reports 5 illegal requests). There is one drive (c9t3d0) that looks like the black sheep where it also is reported to have 35 Hard Errors, 21 Transport Errors and 30 Media Errors. Does this mean that the disk is about to give up and should be replaced? zpool status indicates that it is in the online state and reports no failures. Any suggestions on how to proceed with this would be much appreciated. -- This message posted from opensolaris.org
On Sun, 19 Dec 2010, Robin Axelsson wrote:> > To conclude this (in case you don''t view this message using a > monospace font) all drives in the affected storage pool (c9t0d0 - > c9t7d0) report 2 Illegal Requests (save c9t3d0 that reports 5 > illegal requests). There is one drive (c9t3d0) that looks like the > black sheep where it also is reported to have 35 Hard Errors, 21 > Transport Errors and 30 Media Errors. Does this mean that the disk > is about to give up and should be replaced? zpool status indicates > that it is in the online state and reports no failures.I agree that it is best to attend to the "black sheep". First make sure that there is nothing odd about its mechanics such as a loose mount which might allow it to vibrate. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
There''s nothing odd about the physical mounting of the hard drives. All drives are firmly attached and secured in their casings, no loose connections etc. There is some dust but not more than the hardware should be able to handle. I replaced the hard drive with another one of the same size, I find it a little disturbing that the new drive doesn''t have the same denomination as the old one (i.e. c9t8d0 instead of c9t3d0). If the denominations get changed it will get much more difficult to locate the physical location of the drives. Maybe there is a way to change the denomination so that the logical associations are maintained. At the end of the resilvering process I noticed that the c9d0t0 started to get resilvered too. I found that quite disturbing. After running "iostat -En" I saw that it now has errors. I no longer have freezes when accessing the pool (I have yet to try this more thoroughly though) and it isn''t as slow as it used to be, but it seems that I need to replace yet another drive. I really hope Samsung will accept the RMA. zpool doesn''t report any errors when requesting status whereas iostat does and this is quite disturbing why it doesn''t. If the errors are so concealed then a regular diagnostic may not detect them and Samsung may decline the RMA as these errors are not shown in their diagnostic procedures which I believe they run on any RMA before acceptance is considered. -- This message posted from opensolaris.org
On Tue, 21 Dec 2010, Robin Axelsson wrote:> There''s nothing odd about the physical mounting of the hard drives. All drives are firmly attached and secured in their casings, no loose connections etc. There is some dust but not more than the hardware should be able to handle. > > I replaced the hard drive with another one of the same size, I find > it a little disturbing that the new drive doesn''t have the same > denomination as the old one (i.e. c9t8d0 instead of c9t3d0). If theIn most cases, the device name is tied to the slot rather than the purpose of the drive. Was the replacement drive inserted in the same slot or a different slot? Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
I didn''t hot swap the drive but yes, the new drive is in the same "slot" as the old one was (i.e. using the same connector/channel on the fan out cable). What I did was that I turned off the system, and booted it up after disconnecting the physical drive that I suspected was c0t3d0. My guess was right so I turned the system off once again and replaced it with the new drive. Upon boot I saw that the LSI firmware posted the new drive last on the list of detected drives but thought it was merely presenting them in alphabetical order (WD comes after Samsung). When I issued the "zpool replace <mytank> c0t3d0" command it didn''t work. After some examination as to what had happened I saw that the new drive was actually c9t8d0. -- This message posted from opensolaris.org
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Charles J. Knipe > > Some more information about our configuration: We''re running OpenSolaris > svn-134. ZFS is at version 22. Our disks are 15kRPM 300gb SeagateCheetahs,> mounted in Promise J610S Dual enclosures, hanging off a Dell SAS 5/e > controller. We''d tried out most of this configuration previously onSorry I didn''t read or mention this before, but, I am consistently finding that solaris doesn''t work quite right on dell hardware (or non-sun hardware in general), and THAT is when it''s a supported configuration. You''re running opensolaris. Unsupported, and non-sun hardware. I''m not saying there''s no solution. For various weird problems that people have had on this list, people often find weird solutions. Like disabling c-states, or upgrading or downgrading firmware levels... For me personally, on my only dell which runs solaris, which is a certified and "supported" configuration, I had to disable the on-board NIC and buy an add-on Intel nic. End result, we eliminated the once-per-week crash, but I would still say the system is rather unstable. Uptime of 1 month is the next goal... Other people on this list have precisely the same setup, repeated on dozens of systems... Some of them act normal, and some of them don''t. It''s weird and unexplained. I''m basically concluding that if you need it to work right, you need the sun hardware, or you need good luck on non-sun hardware, because the developers don''t bother testing much on non-sun hardware. Yes, they try to avoid nonstandard and specific system calls or whatever... Yes, it''s likely to work on a lot of non-sun hardware. But the odds are highly stacked.
On Dec 21, 2010, at 4:41 PM, Robin Axelsson wrote:> There''s nothing odd about the physical mounting of the hard drives. All drives are firmly attached and secured in their casings, no loose connections etc. There is some dust but not more than the hardware should be able to handle. > > I replaced the hard drive with another one of the same size, I find it a little disturbing that the new drive doesn''t have the same denomination as the old one (i.e. c9t8d0 instead of c9t3d0). If the denominations get changed it will get much more difficult to locate the physical location of the drives. Maybe there is a way to change the denomination so that the logical associations are maintained.There are some option settings for HBAs that can do this. From your case, I think we can presume that Dell''s HBA ships this way. NB. SAS and FC disks are recognized by their wwn, not slot.> At the end of the resilvering process I noticed that the c9d0t0 started to get resilvered too. I found that quite disturbing. After running "iostat -En" I saw that it now has errors. I no longer have freezes when accessing the pool (I have yet to try this more thoroughly though) and it isn''t as slow as it used to be, but it seems that I need to replace yet another drive.Freezes are caused by the disk not responding. When the disk doesn''t respond for a long time (60 seconds) sd will, by default for many systems, issue a bus reset.> I really hope Samsung will accept the RMA. zpool doesn''t report any errors when requesting status whereas iostat does and this is quite disturbing why it doesn''t. If the errors are so concealed then a regular diagnostic may not detect them and Samsung may decline the RMA as these errors are not shown in their diagnostic procedures which I believe they run on any RMA before acceptance is considered.Good luck. -- richard
The HBA I use is an LSI MegaRAID 1038E-R but I guess it doesn''t really matter as most OEM manufacturers such as Dell, Intel, HP, IBM use the LSI 1068e/1078e or the newer 2008e/2018e Megaraid chips which I believe use pretty much the same firmware. So I guess I could change these settings in the firmware so that the new WD drive comes 4th in the list of identified drives. The question is whether zfs will buy into that. It may still identify it as c9t8d0 or it might think that the pool is corrupt. I''ve had these freezes for a bit over three months by now and it wasn''t until I upgraded to oi_b148 and ran ''iostat -En'' I found that one drive had errors. If the disk only takes longer than usual to respond I shouldn''t get media errors according to iostat, right? Still, zpool returned that everything is ok when issuing ''zpool status'' for that pool. I will keep in the back of my head that c9t0d0 produced some errors during the resilvering process. I even saw that zpool did some resilvering on that drive too at the end of the resilvering process (why is that ?!?). However, when I did a scrub there were no errors and no new errors have been found by iostat since the resilvering. I have not yet encountered any freezes anymore and I have occasionally put it under some load. But if freezes will start to occur again, I will assume that c9t0d0 will be the likely culprit and I will not be able to count on that zpool will report these problems. -- This message posted from opensolaris.org
I''m not sure if they still apply to B134, but it seems similar to problems caused by transaction group issues in the past. Have you looked at the threads involving setting zfs:zfs_write_limit_override, zfs:zfs_vdev_max_pending or zfs:zfs_txg_timeout in /etc/system? Paul -- This message posted from opensolaris.org