thr3ads.net - zfs discuss - [zfs-discuss] ''zfs create'' hanging [Mar 2008]

If this information is useful, please help other people find it:
Share via:

Paul Raines

2008-Mar-07 18:48 UTC

[zfs-discuss] ''zfs create'' hanging

We have a Sun Fire X4500 (Thumper) with 48 750GB SATA drives being used as an 
NFS server.  My original plan was to reinstall Linux on it but after getting 
it and playing around with zfs I decided to give Solaris a try.

I have created over 30 zfs filesystems so far and exported them via NFS and 
this has been working fine.  Well, almost.  A couple of weeks ago I discovered 
clients could no longer mount and I logged into the Thumper and found mountd 
was not running.  I could not figure out how to properly get it restarted 
(nothing I did with svcadm seeemed to work) so I just rebooted and then 
everything was fine again.  Nothing in the logs seemed to give any indication 
of what the problem was (the logs are awful sparse on Solaris).

Anyway, today I log into to make a new zfs filesystem and the zfs create
command has just hung and is unkillable even via kill -9.  I ran:

zfs create -o quota=131G -o reserv=131G -o recsize=8K zpool1/itgroup_001

and this is still running now.  truss on the process shows nothing.  I
don''t know how to debug it beyond that.  I thought I would ask for any
info from this list before I just reboot.

# uname -a
SunOS raidsrv03 5.10 Generic_127112-05 i86pc i386 i86pc

And ''hd -c'' shows all the disks as operating normally.  THere
is nothing
relevant I can find in dmesg or /var/adm/messages or /var/log/syslog.

-- 
---------------------------------------------------------------
Paul Raines                email: raines at nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street     Charlestown, MA 02129	    USA

Mark J Musante

2008-Mar-07 19:01 UTC

head link

[zfs-discuss] ''zfs create'' hanging

On Fri, 7 Mar 2008, Paul Raines wrote:
> zfs create -o quota=131G -o reserv=131G -o recsize=8K zpool1/itgroup_001
>
> and this is still running now.  truss on the process shows nothing.  I
> don''t know how to debug it beyond that.  I thought I would ask for
any
> info from this list before I just reboot.
What does pstack show?


Regards,
markm

max at bruningsystems.com

2008-Mar-07 20:04 UTC

head link

[zfs-discuss] ''zfs create'' hanging

Mark J Musante wrote:> On Fri, 7 Mar 2008, Paul Raines wrote:
>
>   
>> zfs create -o quota=131G -o reserv=131G -o recsize=8K
zpool1/itgroup_001
>>
>> and this is still running now.  truss on the process shows nothing.  I
>> don''t know how to debug it beyond that.  I thought I would ask
for any
>> info from this list before I just reboot.
>>     
>
> What does pstack show?
>
>
>   If truss shows nothing, it''s either looping at user level, or hung in 
the kernel.
Try
echo ::threadlist -v | mdb -k

and see what the stack trace looks like for the zfs process in the kernel.

max

Paul Raines

2008-Mar-07 20:11 UTC

head link

[zfs-discuss] ''zfs create'' hanging

Well, as is probably obvious I am pretty new to Solaris and don''t
really
know these tools.

root at raidsrv03 # ps -f -p 3056
      UID   PID  PPID   C    STIME TTY         TIME CMD
     root  3056  3041   0 12:05:08 pts/1       0:00 zfs create -o quota=131G -o 
reserv=131G -o recsize=8K zpool1/itgroup_001
root at raidsrv03 # pstack 3056
pstack: cannot examine 3056: unanticipated system error

ANother subscribers says I am out of data with a big zfs patch from a
week or two ago so I am doing updates and will reboot.


On Fri, 7 Mar 2008, Mark J Musante wrote:
> On Fri, 7 Mar 2008, Paul Raines wrote:
>
>> zfs create -o quota=131G -o reserv=131G -o recsize=8K
zpool1/itgroup_001
>>
>> and this is still running now.  truss on the process shows nothing.  I
>> don''t know how to debug it beyond that.  I thought I would ask
for any
>> info from this list before I just reboot.
>
> What does pstack show?
>
>
> Regards,
> markm
>
>
-- 
---------------------------------------------------------------
Paul Raines                email: raines at nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street     Charlestown, MA 02129	    USA

Paul Raines

2008-Mar-09 13:55 UTC

head link

[zfs-discuss] ''zfs create'' hanging

Well, I ran updatemanager and started applying about 64 updates.  After
the progress meter got about half way it seemed to hang not moving for
hours.  I finally gave up and did a reboot.  But the machine would not
reboot.  I went in the ILOM and tried ''stop /SYS'' but after a
few minutes
would get back an error on the console saying something like "shutdown 
failed".  So I finally just hard power cycled the box.  Luckily, it came
back up seemingly okay and I was able to rerun updatemanager and get all
updates installed.  However, after rebooting I now note the following
error messages on the console:

Mar  9 03:22:16 raidsrv03 sata: NOTICE: 
/pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1:
Mar  9 03:22:16 raidsrv03  port 6: device reset
Mar  9 03:22:16 raidsrv03 sata: NOTICE: 
/pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1:
Mar  9 03:22:16 raidsrv03  port 6: link lost
Mar  9 03:22:16 raidsrv03 sata: NOTICE: 
/pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1:
Mar  9 03:22:16 raidsrv03  port 6: link established
Mar  9 03:22:16 raidsrv03 scsi: WARNING: 
/pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1/disk at 6,0 (sd46):
Mar  9 03:22:16 raidsrv03       Error for Command: write(10) 
Error Level: Retryable
Mar  9 03:22:16 raidsrv03 scsi:         Requested Block: 68158362 
Error Block: 68158362
Mar  9 03:22:16 raidsrv03 scsi:         Vendor: ATA 
Serial Number:
Mar  9 03:22:16 raidsrv03 scsi:         Sense Key: No Additional Sense
Mar  9 03:22:16 raidsrv03 scsi:         ASC: 0x0 (no additional sense info), 
ASCQ: 0x0, FRU: 0x0


The above repeated a few times but now seems to have stopped. Running
''hd -c''
shows all disks as ok.  But it seems like I do have a disk problem.  But since
everything is redundant (zraid) why a failed disk should lock up the machine
like I saw I don''t understand unless there is a some bigger issue.

Any advice?

Thanks

Marc Bevand

2008-Mar-09 22:51 UTC

head link

[zfs-discuss] ''zfs create'' hanging

Paul Raines <raines <at> nmr.mgh.harvard.edu>
writes:> 
> Mar  9 03:22:16 raidsrv03 sata: NOTICE: 
> /pci <at> 0,0/pci1022,7458 <at> 1/pci11ab,11ab <at> 1:
> Mar  9 03:22:16 raidsrv03  port 6: device reset
> [...]
> 
> The above repeated a few times but now seems to have stopped.
> Running ''hd -c'' shows all disks as ok.  But it seems like
I do have
> a disk problem.  But since everything is redundant (zraid) why a
> failed disk should lock up the machine like I saw I don''t
understand
> unless there is a some bigger issue.
It looks like your Solaris 10U4 install on a Thumper is affected by:
http://bugs.opensolaris.org/view_bug.do?bug_id=6587133
Which was discussed here:
http://opensolaris.org/jive/thread.jspa?messageID=189256
http://opensolaris.org/jive/thread.jspa?messageID=163460

Apply T-PATCH 127871-02, or upgrade to snv_73, or wait for 10U5.

-- 
Marc Bevand

Lida Horn

2008-Mar-10 18:57 UTC

head link

[zfs-discuss] ''zfs create'' hanging

Paul Raines wrote:> Well, I ran updatemanager and started applying about 64 updates.  After
> the progress meter got about half way it seemed to hang not moving for
> hours.  I finally gave up and did a reboot.  But the machine would not
> reboot.  I went in the ILOM and tried ''stop /SYS'' but
after a few minutes
> would get back an error on the console saying something like "shutdown
> failed".  So I finally just hard power cycled the box.  Luckily, it
came
> back up seemingly okay and I was able to rerun updatemanager and get all
> updates installed.  However, after rebooting I now note the following
> error messages on the console:
>
> Mar  9 03:22:16 raidsrv03 sata: NOTICE: 
> /pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1:
> Mar  9 03:22:16 raidsrv03  port 6: device reset
> Mar  9 03:22:16 raidsrv03 sata: NOTICE: 
> /pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1:
> Mar  9 03:22:16 raidsrv03  port 6: link lost
> Mar  9 03:22:16 raidsrv03 sata: NOTICE: 
> /pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1:
> Mar  9 03:22:16 raidsrv03  port 6: link established
> Mar  9 03:22:16 raidsrv03 scsi: WARNING: 
> /pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1/disk at 6,0 (sd46):
> Mar  9 03:22:16 raidsrv03       Error for Command: write(10) 
> Error Level: Retryable
> Mar  9 03:22:16 raidsrv03 scsi:         Requested Block: 68158362 
> Error Block: 68158362
> Mar  9 03:22:16 raidsrv03 scsi:         Vendor: ATA 
> Serial Number:
> Mar  9 03:22:16 raidsrv03 scsi:         Sense Key: No Additional Sense
> Mar  9 03:22:16 raidsrv03 scsi:         ASC: 0x0 (no additional sense
info),
> ASCQ: 0x0, FRU: 0x0
>
>
> The above repeated a few times but now seems to have stopped. Running
''hd -c''
> shows all disks as ok.  But it seems like I do have a disk problem.  But
since
> everything is redundant (zraid) why a failed disk should lock up the
machine
> like I saw I don''t understand unless there is a some bigger issue.
>
> Any advice?
>   It is unclear what you are talking about.  Do you have any evidence to 
connect
that retryable write errors with the previous hang or were they two 
independent
events?  The retried write error would appear to be normal behavior with
a bad sector.  If the sector is actually bad, there would be the initial 
write
attempt followed by five retries.  The last retry would have "Error 
Level: Fatal"
as opposed to "Error Level: Retryable", otherwise one of the retries
would
have been successful and everything would move on.

Regards,
Lida> Thanks
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Paul Raines

2008-Mar-10 19:15 UTC

head link

[zfs-discuss] ''zfs create'' hanging

On Mon, 10 Mar 2008, Lida Horn wrote:
> Paul Raines wrote:
>> Well, I ran updatemanager and started applying about 64 updates.  After
>> the progress meter got about half way it seemed to hang not moving for
>> hours.  I finally gave up and did a reboot.  But the machine would not
>> reboot.  I went in the ILOM and tried ''stop /SYS'' but
after a few minutes
>> would get back an error on the console saying something like
"shutdown
>> failed".  So I finally just hard power cycled the box.  Luckily,
it came
>> back up seemingly okay and I was able to rerun updatemanager and get
all
>> updates installed.  However, after rebooting I now note the following
>> error messages on the console:
>> 
>> Mar  9 03:22:16 raidsrv03 sata: NOTICE: 
>> /pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1:
>> Mar  9 03:22:16 raidsrv03  port 6: device reset
>> Mar  9 03:22:16 raidsrv03 sata: NOTICE: 
>> /pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1:
>> Mar  9 03:22:16 raidsrv03  port 6: link lost
>> Mar  9 03:22:16 raidsrv03 sata: NOTICE: 
>> /pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1:
>> Mar  9 03:22:16 raidsrv03  port 6: link established
>> Mar  9 03:22:16 raidsrv03 scsi: WARNING: 
>> /pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1/disk at 6,0 (sd46):
>> Mar  9 03:22:16 raidsrv03       Error for Command: write(10) Error
Level:
>> Retryable
>> Mar  9 03:22:16 raidsrv03 scsi:         Requested Block: 68158362 Error
>> Block: 68158362
>> Mar  9 03:22:16 raidsrv03 scsi:         Vendor: ATA Serial Number:
>> Mar  9 03:22:16 raidsrv03 scsi:         Sense Key: No Additional Sense
>> Mar  9 03:22:16 raidsrv03 scsi:         ASC: 0x0 (no additional sense 
>> info), ASCQ: 0x0, FRU: 0x0
>> 
>> 
>> The above repeated a few times but now seems to have stopped. Running
''hd
>> -c''
>> shows all disks as ok.  But it seems like I do have a disk problem. 
But
>> since
>> everything is redundant (zraid) why a failed disk should lock up the 
>> machine
>> like I saw I don''t understand unless there is a some bigger
issue.
>> 
>> Any advice?
>> 
> It is unclear what you are talking about.  Do you have any evidence to 
> connect
> that retryable write errors with the previous hang or were they two 
> independent
> events?  The retried write error would appear to be normal behavior with
> a bad sector.  If the sector is actually bad, there would be the initial 
> write
> attempt followed by five retries.  The last retry would have "Error
Level:
> Fatal"
> as opposed to "Error Level: Retryable", otherwise one of the
retries would
> have been successful and everything would move on.
>
> Regards,
> Lida
No, I cannot connect the two events.  When the ''zfs create''
hang happened, and
the hang on applying updates, there were no error messages at all I could 
find.  The above only happened after the reboot.  SO it is circumstancial.

Paul Raines

2008-Mar-10 20:37 UTC

head link

[zfs-discuss] ''zfs create'' hanging

On Sun, 9 Mar 2008, Marc Bevand wrote:
> Paul Raines <raines <at> nmr.mgh.harvard.edu> writes:
>>
>> Mar  9 03:22:16 raidsrv03 sata: NOTICE:
>> /pci <at> 0,0/pci1022,7458 <at> 1/pci11ab,11ab <at>
1:
>> Mar  9 03:22:16 raidsrv03  port 6: device reset
>> [...]
>>
>> The above repeated a few times but now seems to have stopped.
>> Running ''hd -c'' shows all disks as ok.  But it seems
like I do have
>> a disk problem.  But since everything is redundant (zraid) why a
>> failed disk should lock up the machine like I saw I don''t
understand
>> unless there is a some bigger issue.
>
> It looks like your Solaris 10U4 install on a Thumper is affected by:
> http://bugs.opensolaris.org/view_bug.do?bug_id=6587133
> Which was discussed here:
> http://opensolaris.org/jive/thread.jspa?messageID=189256
> http://opensolaris.org/jive/thread.jspa?messageID=163460
>
> Apply T-PATCH 127871-02, or upgrade to snv_73, or wait for 10U5.
I don''t find 127871-02 on the normal "Patches and Updates"
website.
Does someone have to go some place special for that?  Also, where
do I find info on updating to snv_73?

thanks

Enda O''Connor

2008-Mar-10 21:14 UTC

head link

[zfs-discuss] ''zfs create'' hanging

Paul Raines wrote:> On Sun, 9 Mar 2008, Marc Bevand wrote:
>
>   
>> Paul Raines <raines <at> nmr.mgh.harvard.edu> writes:
>>     
>>> Mar  9 03:22:16 raidsrv03 sata: NOTICE:
>>> /pci <at> 0,0/pci1022,7458 <at> 1/pci11ab,11ab
<at> 1:
>>> Mar  9 03:22:16 raidsrv03  port 6: device reset
>>> [...]
>>>
>>> The above repeated a few times but now seems to have stopped.
>>> Running ''hd -c'' shows all disks as ok.  But it
seems like I do have
>>> a disk problem.  But since everything is redundant (zraid) why a
>>> failed disk should lock up the machine like I saw I don''t
understand
>>> unless there is a some bigger issue.
>>>       
>> It looks like your Solaris 10U4 install on a Thumper is affected by:
>> http://bugs.opensolaris.org/view_bug.do?bug_id=6587133
>> Which was discussed here:
>> http://opensolaris.org/jive/thread.jspa?messageID=189256
>> http://opensolaris.org/jive/thread.jspa?messageID=163460
>>
>> Apply T-PATCH 127871-02, or upgrade to snv_73, or wait for 10U5.
>>     
>
> I don''t find 127871-02 on the normal "Patches and
Updates" website.
> Does someone have to go some place special for that?  Also, where
> do I find info on updating to snv_73?
>
> thanks
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>   Hi
Unfortunately 127871-* ,are currently feature patches used in update 5 
builds, these patches won''t be released until u5 ships, so that
won''t be
for another month perhaps. Very dangerous to apply these to pre u5 until 
they are shipped.
There are no sustaining patches for this issue 6579855 ( the only CR 
fixed in 127871-02 )

but the CR 6587133 mentioned above is fixed in generic patch 125205-07, 
available on SunSolve

Enda

Lida Horn

2008-Mar-11 01:13 UTC

head link

[zfs-discuss] ''zfs create'' hanging

Marc Bevand wrote:> Paul Raines <raines <at> nmr.mgh.harvard.edu> writes:
>   
>> Mar  9 03:22:16 raidsrv03 sata: NOTICE: 
>> /pci <at> 0,0/pci1022,7458 <at> 1/pci11ab,11ab <at>
1:
>> Mar  9 03:22:16 raidsrv03  port 6: device reset
>> [...]
>>
>> The above repeated a few times but now seems to have stopped.
>> Running ''hd -c'' shows all disks as ok.  But it seems
like I do have
>> a disk problem.  But since everything is redundant (zraid) why a
>> failed disk should lock up the machine like I saw I don''t
understand
>> unless there is a some bigger issue.
>>     
>
> It looks like your Solaris 10U4 install on a Thumper is affected by:
> http://bugs.opensolaris.org/view_bug.do?bug_id=6587133
> Which was discussed here:
> http://opensolaris.org/jive/thread.jspa?messageID=189256
> http://opensolaris.org/jive/thread.jspa?messageID=163460
>
> Apply T-PATCH 127871-02, or upgrade to snv_73, or wait for 10U5.
>
>   I think you jumped to a conclusion that is probably not warranted.  
First he said
that the machine was hung and there were no messages associated with the 
hang.
Later, after rebooting he saw a few messages about a (apparently) single 
bad sector
and the system was not hung and recovered from the error in a reasonable
amount of time.  When asked, he replied that he had no evidence to 
connect the
two events.  At no time did he report anything about DMA timeouts.  Please
don''t jump to conclusions.

Regards,
Lida

Marc Bevand

2008-Mar-11 06:29 UTC

head link

[zfs-discuss] ''zfs create'' hanging

Lida Horn <Lida.Horn <at> Sun.COM> writes:> 
> I think you jumped to a conclusion that is probably not warranted.
You are right. I read his error message too hastily and thought I
recognized a pattern --I have been victim of bug 6587133 myself.
And to top this off I gave him the wrong patch number.

To answer Paul''s question about how to upgrade to snv_73 (if you
still want to upgrade for another reason): actually I would recommend
you the latest SXDE (Solaris Express Developer Edition 1/08, based
on build 79). Boot from the install disc, and choose the "Upgrade
Install" option.

-- 
Marc Bevand

zfs discuss - Mar 2008 - ''zfs create'' hanging

[zfs-discuss] ''zfs create'' hanging

[zfs-discuss] ''zfs create'' hanging

[zfs-discuss] ''zfs create'' hanging

[zfs-discuss] ''zfs create'' hanging

[zfs-discuss] ''zfs create'' hanging

[zfs-discuss] ''zfs create'' hanging

[zfs-discuss] ''zfs create'' hanging

[zfs-discuss] ''zfs create'' hanging

[zfs-discuss] ''zfs create'' hanging

[zfs-discuss] ''zfs create'' hanging

[zfs-discuss] ''zfs create'' hanging

[zfs-discuss] ''zfs create'' hanging