Karthik Krishnamoorthy
2008-Oct-16  04:38 UTC
[zfs-discuss] zfs cp hangs when the mirrors are removed ..
Hello All,
  Summary:
  ~~~~~~~~
  cp command for mirrored zfs hung when all the disks in the mirrored
  pool were unavailable.
  
  Detailed description:
  ~~~~~~~~~~~~~~~~~~~~~
  
  The cp command (copy a 1GB file from nfs to zfs) hung when all the disks
  in the mirrored pool (both c1t0d9 and c2t0d9) were removed physically.
  
         NAME        STATE     READ WRITE CKSUM
         test        ONLINE      0     0     0
           mirror    ONLINE      0     0     0
             c1t0d9  ONLINE      0     0     0
             c2t0d9  ONLINE      0     0     0
  
  We think if all the disks in the pool are unavailable, cp command should
  fail with error (not cause hang).
  
  Our request:
  ~~~~~~~~~~~~
  Please investigate the root cause of this issue.
 
  How to reproduce:
  ~~~~~~~~~~~~~~~~~
  1. create a zfs mirrored pool
  2. execute cp command from somewhere to the zfs mirrored pool.
  3. remove the both of disks physically during cp command working
    =  hang happen (cp command never return and we can''t kill cp
command)
One engineer pointed me to this page  
http://opensolaris.org/os/community/arc/caselog/2007/567/onepager/ and 
indicated that if all the mirrors are removed zfs enters a hang like 
state to prevent the kernel from going into a panic mode and this type 
of feature would be an RFE.
My questions are
Are there any documentation of the "mirror" configuration of zfs that 
explains what happens when the underlying
drivers detect problems in one of the mirror devices?
It seems that the traditional views of "mirror" or "raid-2"
would expect
that the
mirror would be able to proceed without interruption and that does not 
seem to be this case in ZFS. 
What is the purpose of the mirror, in zfs?  Is it more like an instant
backup?  If so, what can the user do to recover, when there is an
IO error on one of the devices?
Appreciate any pointers and help,
Thanks and regards,
Karthik
Neil Perrin
2008-Oct-16  05:03 UTC
[zfs-discuss] zfs cp hangs when the mirrors are removed ..
Karthik, The pool failmode property as implemented governs the behaviour when all the devices needed are unavailable. The default behaviour is to wait (block) until the IO can continue - perhaps by re-enabling the device(s). The behaviour you expected can be achieved by "zpool set failmode=continue <pool>", as shown in the link you indicated below. Neil. On 10/15/08 22:38, Karthik Krishnamoorthy wrote:> Hello All, > > Summary: > ~~~~~~~~ > cp command for mirrored zfs hung when all the disks in the mirrored > pool were unavailable. > > Detailed description: > ~~~~~~~~~~~~~~~~~~~~~ > > The cp command (copy a 1GB file from nfs to zfs) hung when all the disks > in the mirrored pool (both c1t0d9 and c2t0d9) were removed physically. > > NAME STATE READ WRITE CKSUM > test ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c1t0d9 ONLINE 0 0 0 > c2t0d9 ONLINE 0 0 0 > > We think if all the disks in the pool are unavailable, cp command should > fail with error (not cause hang). > > Our request: > ~~~~~~~~~~~~ > Please investigate the root cause of this issue. > > How to reproduce: > ~~~~~~~~~~~~~~~~~ > 1. create a zfs mirrored pool > 2. execute cp command from somewhere to the zfs mirrored pool. > 3. remove the both of disks physically during cp command working > = hang happen (cp command never return and we can''t kill cp command) > > One engineer pointed me to this page > http://opensolaris.org/os/community/arc/caselog/2007/567/onepager/ and > indicated that if all the mirrors are removed zfs enters a hang like > state to prevent the kernel from going into a panic mode and this type > of feature would be an RFE. > > My questions are > > Are there any documentation of the "mirror" configuration of zfs that > explains what happens when the underlying > drivers detect problems in one of the mirror devices? > > It seems that the traditional views of "mirror" or "raid-2" would expect > that the > mirror would be able to proceed without interruption and that does not > seem to be this case in ZFS. > > What is the purpose of the mirror, in zfs? Is it more like an instant > backup? If so, what can the user do to recover, when there is an > IO error on one of the devices? > > > Appreciate any pointers and help, > > Thanks and regards, > Karthik > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Karthik Krishnamoorthy
2008-Oct-16  05:12 UTC
[zfs-discuss] zfs cp hangs when the mirrors are removed ..
Neil, Thanks for the quick suggestion, the hang seems to happen even with the zpool set failmode=continue <pool> option. Any other way to recover from the hang ? thanks and regards, Karthik On 10/15/08 22:03, Neil Perrin wrote:> Karthik, > > The pool failmode property as implemented governs the behaviour when all > the devices needed are unavailable. The default behaviour is to wait > (block) until the IO can continue - perhaps by re-enabling the device(s). > The behaviour you expected can be achieved by "zpool set > failmode=continue <pool>", > as shown in the link you indicated below. > > Neil. > > On 10/15/08 22:38, Karthik Krishnamoorthy wrote: >> Hello All, >> >> Summary: >> ~~~~~~~~ >> cp command for mirrored zfs hung when all the disks in the mirrored >> pool were unavailable. >> Detailed description: >> ~~~~~~~~~~~~~~~~~~~~~ >> The cp command (copy a 1GB file from nfs to zfs) hung when all >> the disks >> in the mirrored pool (both c1t0d9 and c2t0d9) were removed physically. >> NAME STATE READ WRITE CKSUM >> test ONLINE 0 0 0 >> mirror ONLINE 0 0 0 >> c1t0d9 ONLINE 0 0 0 >> c2t0d9 ONLINE 0 0 0 >> We think if all the disks in the pool are unavailable, cp command >> should >> fail with error (not cause hang). >> Our request: >> ~~~~~~~~~~~~ >> Please investigate the root cause of this issue. >> >> How to reproduce: >> ~~~~~~~~~~~~~~~~~ >> 1. create a zfs mirrored pool >> 2. execute cp command from somewhere to the zfs mirrored pool. >> 3. remove the both of disks physically during cp command working >> = hang happen (cp command never return and we can''t kill cp >> command) >> >> One engineer pointed me to this page >> http://opensolaris.org/os/community/arc/caselog/2007/567/onepager/ >> and indicated that if all the mirrors are removed zfs enters a hang >> like state to prevent the kernel from going into a panic mode and >> this type of feature would be an RFE. >> >> My questions are >> >> Are there any documentation of the "mirror" configuration of zfs that >> explains what happens when the underlying >> drivers detect problems in one of the mirror devices? >> >> It seems that the traditional views of "mirror" or "raid-2" would >> expect that the >> mirror would be able to proceed without interruption and that does >> not seem to be this case in ZFS. >> What is the purpose of the mirror, in zfs? Is it more like an instant >> backup? If so, what can the user do to recover, when there is an >> IO error on one of the devices? >> >> >> Appreciate any pointers and help, >> >> Thanks and regards, >> Karthik >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Neil Perrin
2008-Oct-16  05:27 UTC
[zfs-discuss] zfs cp hangs when the mirrors are removed ..
On 10/15/08 23:12, Karthik Krishnamoorthy wrote:> Neil, > > Thanks for the quick suggestion, the hang seems to happen even with the > zpool set failmode=continue <pool> option. > > Any other way to recover from the hang ?You should set the property before you remove the devices. This should prevent the hang. It isn''t used to recover from it. If you did do that then it seems like a bug somewhere in ZFS or the IO stack below it. In which case you should file a bug. Neil.> > thanks and regards, > Karthik > > On 10/15/08 22:03, Neil Perrin wrote: >> Karthik, >> >> The pool failmode property as implemented governs the behaviour when all >> the devices needed are unavailable. The default behaviour is to wait >> (block) until the IO can continue - perhaps by re-enabling the device(s). >> The behaviour you expected can be achieved by "zpool set >> failmode=continue <pool>", >> as shown in the link you indicated below. >> >> Neil. >> >> On 10/15/08 22:38, Karthik Krishnamoorthy wrote: >>> Hello All, >>> >>> Summary: >>> ~~~~~~~~ >>> cp command for mirrored zfs hung when all the disks in the mirrored >>> pool were unavailable. >>> Detailed description: >>> ~~~~~~~~~~~~~~~~~~~~~ >>> The cp command (copy a 1GB file from nfs to zfs) hung when all >>> the disks >>> in the mirrored pool (both c1t0d9 and c2t0d9) were removed physically. >>> NAME STATE READ WRITE CKSUM >>> test ONLINE 0 0 0 >>> mirror ONLINE 0 0 0 >>> c1t0d9 ONLINE 0 0 0 >>> c2t0d9 ONLINE 0 0 0 >>> We think if all the disks in the pool are unavailable, cp command >>> should >>> fail with error (not cause hang). >>> Our request: >>> ~~~~~~~~~~~~ >>> Please investigate the root cause of this issue. >>> >>> How to reproduce: >>> ~~~~~~~~~~~~~~~~~ >>> 1. create a zfs mirrored pool >>> 2. execute cp command from somewhere to the zfs mirrored pool. >>> 3. remove the both of disks physically during cp command working >>> = hang happen (cp command never return and we can''t kill cp >>> command) >>> >>> One engineer pointed me to this page >>> http://opensolaris.org/os/community/arc/caselog/2007/567/onepager/ >>> and indicated that if all the mirrors are removed zfs enters a hang >>> like state to prevent the kernel from going into a panic mode and >>> this type of feature would be an RFE. >>> >>> My questions are >>> >>> Are there any documentation of the "mirror" configuration of zfs that >>> explains what happens when the underlying >>> drivers detect problems in one of the mirror devices? >>> >>> It seems that the traditional views of "mirror" or "raid-2" would >>> expect that the >>> mirror would be able to proceed without interruption and that does >>> not seem to be this case in ZFS. >>> What is the purpose of the mirror, in zfs? Is it more like an instant >>> backup? If so, what can the user do to recover, when there is an >>> IO error on one of the devices? >>> >>> >>> Appreciate any pointers and help, >>> >>> Thanks and regards, >>> Karthik >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Karthik Krishnamoorthy
2008-Oct-16  05:35 UTC
[zfs-discuss] zfs cp hangs when the mirrors are removed ..
We did try with this zpool set failmode=continue <pool> option and the wait option before pulling running the cp command and pulling out the mirrors and in both cases there was a hang and I have a core dump of the hang as well. Any pointers to the bug opening process ? Thanks Karthik On 10/15/08 22:27, Neil Perrin wrote:> > > On 10/15/08 23:12, Karthik Krishnamoorthy wrote: >> Neil, >> >> Thanks for the quick suggestion, the hang seems to happen even with >> the zpool set failmode=continue <pool> option. >> >> Any other way to recover from the hang ? > > You should set the property before you remove the devices. > This should prevent the hang. It isn''t used to recover from it. > > If you did do that then it seems like a bug somewhere in ZFS or the IO > stack > below it. In which case you should file a bug. > > Neil. >> >> thanks and regards, >> Karthik >> >> On 10/15/08 22:03, Neil Perrin wrote: >>> Karthik, >>> >>> The pool failmode property as implemented governs the behaviour when >>> all >>> the devices needed are unavailable. The default behaviour is to wait >>> (block) until the IO can continue - perhaps by re-enabling the >>> device(s). >>> The behaviour you expected can be achieved by "zpool set >>> failmode=continue <pool>", >>> as shown in the link you indicated below. >>> >>> Neil. >>> >>> On 10/15/08 22:38, Karthik Krishnamoorthy wrote: >>>> Hello All, >>>> >>>> Summary: >>>> ~~~~~~~~ >>>> cp command for mirrored zfs hung when all the disks in the mirrored >>>> pool were unavailable. >>>> Detailed description: >>>> ~~~~~~~~~~~~~~~~~~~~~ >>>> The cp command (copy a 1GB file from nfs to zfs) hung when all >>>> the disks >>>> in the mirrored pool (both c1t0d9 and c2t0d9) were removed >>>> physically. >>>> NAME STATE READ WRITE CKSUM >>>> test ONLINE 0 0 0 >>>> mirror ONLINE 0 0 0 >>>> c1t0d9 ONLINE 0 0 0 >>>> c2t0d9 ONLINE 0 0 0 >>>> We think if all the disks in the pool are unavailable, cp >>>> command should >>>> fail with error (not cause hang). >>>> Our request: >>>> ~~~~~~~~~~~~ >>>> Please investigate the root cause of this issue. >>>> >>>> How to reproduce: >>>> ~~~~~~~~~~~~~~~~~ >>>> 1. create a zfs mirrored pool >>>> 2. execute cp command from somewhere to the zfs mirrored pool. >>>> 3. remove the both of disks physically during cp command working >>>> = hang happen (cp command never return and we can''t kill cp >>>> command) >>>> >>>> One engineer pointed me to this page >>>> http://opensolaris.org/os/community/arc/caselog/2007/567/onepager/ >>>> and indicated that if all the mirrors are removed zfs enters a hang >>>> like state to prevent the kernel from going into a panic mode and >>>> this type of feature would be an RFE. >>>> >>>> My questions are >>>> >>>> Are there any documentation of the "mirror" configuration of zfs >>>> that explains what happens when the underlying >>>> drivers detect problems in one of the mirror devices? >>>> >>>> It seems that the traditional views of "mirror" or "raid-2" would >>>> expect that the >>>> mirror would be able to proceed without interruption and that does >>>> not seem to be this case in ZFS. >>>> What is the purpose of the mirror, in zfs? Is it more like an instant >>>> backup? If so, what can the user do to recover, when there is an >>>> IO error on one of the devices? >>>> >>>> >>>> Appreciate any pointers and help, >>>> >>>> Thanks and regards, >>>> Karthik >>>> _______________________________________________ >>>> zfs-discuss mailing list >>>> zfs-discuss at opensolaris.org >>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>
Richard Elling
2008-Oct-16  14:54 UTC
[zfs-discuss] zfs cp hangs when the mirrors are removed ..
Karthik Krishnamoorthy wrote:> We did try with this > > zpool set failmode=continue <pool> option > > and the wait option before pulling running the cp command and pulling > out the mirrors and in both cases there was a hang and I have a core > dump of the hang as well. >You have to wait for the I/O drivers to declare that the device is dead. This can be up to several minutes, depending on the driver.> Any pointers to the bug opening process ? >http://bugs.opensolaris.org, or bugster if you have an account. Be sure to indicate which drivers you are using, as this is not likely a ZFS bug, per se. Output from prtconf -D should be a minimum. -- richard> Thanks > Karthik > > On 10/15/08 22:27, Neil Perrin wrote: > >> On 10/15/08 23:12, Karthik Krishnamoorthy wrote: >> >>> Neil, >>> >>> Thanks for the quick suggestion, the hang seems to happen even with >>> the zpool set failmode=continue <pool> option. >>> >>> Any other way to recover from the hang ? >>> >> You should set the property before you remove the devices. >> This should prevent the hang. It isn''t used to recover from it. >> >> If you did do that then it seems like a bug somewhere in ZFS or the IO >> stack >> below it. In which case you should file a bug. >> >> Neil. >> >>> thanks and regards, >>> Karthik >>> >>> On 10/15/08 22:03, Neil Perrin wrote: >>> >>>> Karthik, >>>> >>>> The pool failmode property as implemented governs the behaviour when >>>> all >>>> the devices needed are unavailable. The default behaviour is to wait >>>> (block) until the IO can continue - perhaps by re-enabling the >>>> device(s). >>>> The behaviour you expected can be achieved by "zpool set >>>> failmode=continue <pool>", >>>> as shown in the link you indicated below. >>>> >>>> Neil. >>>> >>>> On 10/15/08 22:38, Karthik Krishnamoorthy wrote: >>>> >>>>> Hello All, >>>>> >>>>> Summary: >>>>> ~~~~~~~~ >>>>> cp command for mirrored zfs hung when all the disks in the mirrored >>>>> pool were unavailable. >>>>> Detailed description: >>>>> ~~~~~~~~~~~~~~~~~~~~~ >>>>> The cp command (copy a 1GB file from nfs to zfs) hung when all >>>>> the disks >>>>> in the mirrored pool (both c1t0d9 and c2t0d9) were removed >>>>> physically. >>>>> NAME STATE READ WRITE CKSUM >>>>> test ONLINE 0 0 0 >>>>> mirror ONLINE 0 0 0 >>>>> c1t0d9 ONLINE 0 0 0 >>>>> c2t0d9 ONLINE 0 0 0 >>>>> We think if all the disks in the pool are unavailable, cp >>>>> command should >>>>> fail with error (not cause hang). >>>>> Our request: >>>>> ~~~~~~~~~~~~ >>>>> Please investigate the root cause of this issue. >>>>> >>>>> How to reproduce: >>>>> ~~~~~~~~~~~~~~~~~ >>>>> 1. create a zfs mirrored pool >>>>> 2. execute cp command from somewhere to the zfs mirrored pool. >>>>> 3. remove the both of disks physically during cp command working >>>>> = hang happen (cp command never return and we can''t kill cp >>>>> command) >>>>> >>>>> One engineer pointed me to this page >>>>> http://opensolaris.org/os/community/arc/caselog/2007/567/onepager/ >>>>> and indicated that if all the mirrors are removed zfs enters a hang >>>>> like state to prevent the kernel from going into a panic mode and >>>>> this type of feature would be an RFE. >>>>> >>>>> My questions are >>>>> >>>>> Are there any documentation of the "mirror" configuration of zfs >>>>> that explains what happens when the underlying >>>>> drivers detect problems in one of the mirror devices? >>>>> >>>>> It seems that the traditional views of "mirror" or "raid-2" would >>>>> expect that the >>>>> mirror would be able to proceed without interruption and that does >>>>> not seem to be this case in ZFS. >>>>> What is the purpose of the mirror, in zfs? Is it more like an instant >>>>> backup? If so, what can the user do to recover, when there is an >>>>> IO error on one of the devices? >>>>> >>>>> >>>>> Appreciate any pointers and help, >>>>> >>>>> Thanks and regards, >>>>> Karthik >>>>> _______________________________________________ >>>>> zfs-discuss mailing list >>>>> zfs-discuss at opensolaris.org >>>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>>>> > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Karthik Krishnamoorthy
2008-Oct-20  15:40 UTC
[zfs-discuss] zfs cp hangs when the mirrors are removed ..
Hi Richard, Richard Elling wrote:> Karthik Krishnamoorthy wrote: >> We did try with this >> >> zpool set failmode=continue <pool> option >> >> and the wait option before pulling running the cp command and pulling >> out the mirrors and in both cases there was a hang and I have a core >> dump of the hang as well. >> > > You have to wait for the I/O drivers to declare that the device is > dead. This can be up to several minutes, depending on the driver.Okay, the customer indicated they didn''t see a hang with the ufs when they ran the same test with UFS.> >> Any pointers to the bug opening process ? >> > > http://bugs.opensolaris.org, or bugster if you have an account. > Be sure to indicate which drivers you are using, as this is not likely > a ZFS bug, per se. Output from prtconf -D should be a minimum.I have the core dump of the hang. Will make that available as well. Thanks and regards, Karthik> -- richard > >> Thanks >> Karthik >> >> On 10/15/08 22:27, Neil Perrin wrote: >> >>> On 10/15/08 23:12, Karthik Krishnamoorthy wrote: >>> >>>> Neil, >>>> >>>> Thanks for the quick suggestion, the hang seems to happen even with >>>> the zpool set failmode=continue <pool> option. >>>> >>>> Any other way to recover from the hang ? >>>> >>> You should set the property before you remove the devices. >>> This should prevent the hang. It isn''t used to recover from it. >>> >>> If you did do that then it seems like a bug somewhere in ZFS or the >>> IO stack >>> below it. In which case you should file a bug. >>> >>> Neil. >>> >>>> thanks and regards, >>>> Karthik >>>> >>>> On 10/15/08 22:03, Neil Perrin wrote: >>>> >>>>> Karthik, >>>>> >>>>> The pool failmode property as implemented governs the behaviour >>>>> when all >>>>> the devices needed are unavailable. The default behaviour is to wait >>>>> (block) until the IO can continue - perhaps by re-enabling the >>>>> device(s). >>>>> The behaviour you expected can be achieved by "zpool set >>>>> failmode=continue <pool>", >>>>> as shown in the link you indicated below. >>>>> >>>>> Neil. >>>>> >>>>> On 10/15/08 22:38, Karthik Krishnamoorthy wrote: >>>>> >>>>>> Hello All, >>>>>> >>>>>> Summary: >>>>>> ~~~~~~~~ >>>>>> cp command for mirrored zfs hung when all the disks in the >>>>>> mirrored >>>>>> pool were unavailable. >>>>>> Detailed description: >>>>>> ~~~~~~~~~~~~~~~~~~~~~ >>>>>> The cp command (copy a 1GB file from nfs to zfs) hung when >>>>>> all the disks >>>>>> in the mirrored pool (both c1t0d9 and c2t0d9) were removed >>>>>> physically. >>>>>> NAME STATE READ WRITE CKSUM >>>>>> test ONLINE 0 0 0 >>>>>> mirror ONLINE 0 0 0 >>>>>> c1t0d9 ONLINE 0 0 0 >>>>>> c2t0d9 ONLINE 0 0 0 >>>>>> We think if all the disks in the pool are unavailable, cp >>>>>> command should >>>>>> fail with error (not cause hang). >>>>>> Our request: >>>>>> ~~~~~~~~~~~~ >>>>>> Please investigate the root cause of this issue. >>>>>> >>>>>> How to reproduce: >>>>>> ~~~~~~~~~~~~~~~~~ >>>>>> 1. create a zfs mirrored pool >>>>>> 2. execute cp command from somewhere to the zfs mirrored pool. >>>>>> 3. remove the both of disks physically during cp command working >>>>>> = hang happen (cp command never return and we can''t kill cp >>>>>> command) >>>>>> >>>>>> One engineer pointed me to this page >>>>>> http://opensolaris.org/os/community/arc/caselog/2007/567/onepager/ >>>>>> and indicated that if all the mirrors are removed zfs enters a >>>>>> hang like state to prevent the kernel from going into a panic >>>>>> mode and this type of feature would be an RFE. >>>>>> >>>>>> My questions are >>>>>> >>>>>> Are there any documentation of the "mirror" configuration of zfs >>>>>> that explains what happens when the underlying >>>>>> drivers detect problems in one of the mirror devices? >>>>>> >>>>>> It seems that the traditional views of "mirror" or "raid-2" would >>>>>> expect that the >>>>>> mirror would be able to proceed without interruption and that >>>>>> does not seem to be this case in ZFS. >>>>>> What is the purpose of the mirror, in zfs? Is it more like an >>>>>> instant >>>>>> backup? If so, what can the user do to recover, when there is an >>>>>> IO error on one of the devices? >>>>>> >>>>>> >>>>>> Appreciate any pointers and help, >>>>>> >>>>>> Thanks and regards, >>>>>> Karthik >>>>>> _______________________________________________ >>>>>> zfs-discuss mailing list >>>>>> zfs-discuss at opensolaris.org >>>>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>>>>> >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >