We have a Sun Fire X4500 (Thumper) with 48 750GB SATA drives being used as an NFS server. My original plan was to reinstall Linux on it but after getting it and playing around with zfs I decided to give Solaris a try. I have created over 30 zfs filesystems so far and exported them via NFS and this has been working fine. Well, almost. A couple of weeks ago I discovered clients could no longer mount and I logged into the Thumper and found mountd was not running. I could not figure out how to properly get it restarted (nothing I did with svcadm seeemed to work) so I just rebooted and then everything was fine again. Nothing in the logs seemed to give any indication of what the problem was (the logs are awful sparse on Solaris). Anyway, today I log into to make a new zfs filesystem and the zfs create command has just hung and is unkillable even via kill -9. I ran: zfs create -o quota=131G -o reserv=131G -o recsize=8K zpool1/itgroup_001 and this is still running now. truss on the process shows nothing. I don''t know how to debug it beyond that. I thought I would ask for any info from this list before I just reboot. # uname -a SunOS raidsrv03 5.10 Generic_127112-05 i86pc i386 i86pc And ''hd -c'' shows all the disks as operating normally. THere is nothing relevant I can find in dmesg or /var/adm/messages or /var/log/syslog. -- --------------------------------------------------------------- Paul Raines email: raines at nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown, MA 02129 USA
On Fri, 7 Mar 2008, Paul Raines wrote:> zfs create -o quota=131G -o reserv=131G -o recsize=8K zpool1/itgroup_001 > > and this is still running now. truss on the process shows nothing. I > don''t know how to debug it beyond that. I thought I would ask for any > info from this list before I just reboot.What does pstack show? Regards, markm
Mark J Musante wrote:> On Fri, 7 Mar 2008, Paul Raines wrote: > > >> zfs create -o quota=131G -o reserv=131G -o recsize=8K zpool1/itgroup_001 >> >> and this is still running now. truss on the process shows nothing. I >> don''t know how to debug it beyond that. I thought I would ask for any >> info from this list before I just reboot. >> > > What does pstack show? > > >If truss shows nothing, it''s either looping at user level, or hung in the kernel. Try echo ::threadlist -v | mdb -k and see what the stack trace looks like for the zfs process in the kernel. max
Well, as is probably obvious I am pretty new to Solaris and don''t really know these tools. root at raidsrv03 # ps -f -p 3056 UID PID PPID C STIME TTY TIME CMD root 3056 3041 0 12:05:08 pts/1 0:00 zfs create -o quota=131G -o reserv=131G -o recsize=8K zpool1/itgroup_001 root at raidsrv03 # pstack 3056 pstack: cannot examine 3056: unanticipated system error ANother subscribers says I am out of data with a big zfs patch from a week or two ago so I am doing updates and will reboot. On Fri, 7 Mar 2008, Mark J Musante wrote:> On Fri, 7 Mar 2008, Paul Raines wrote: > >> zfs create -o quota=131G -o reserv=131G -o recsize=8K zpool1/itgroup_001 >> >> and this is still running now. truss on the process shows nothing. I >> don''t know how to debug it beyond that. I thought I would ask for any >> info from this list before I just reboot. > > What does pstack show? > > > Regards, > markm > >-- --------------------------------------------------------------- Paul Raines email: raines at nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown, MA 02129 USA
Well, I ran updatemanager and started applying about 64 updates. After the progress meter got about half way it seemed to hang not moving for hours. I finally gave up and did a reboot. But the machine would not reboot. I went in the ILOM and tried ''stop /SYS'' but after a few minutes would get back an error on the console saying something like "shutdown failed". So I finally just hard power cycled the box. Luckily, it came back up seemingly okay and I was able to rerun updatemanager and get all updates installed. However, after rebooting I now note the following error messages on the console: Mar 9 03:22:16 raidsrv03 sata: NOTICE: /pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1: Mar 9 03:22:16 raidsrv03 port 6: device reset Mar 9 03:22:16 raidsrv03 sata: NOTICE: /pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1: Mar 9 03:22:16 raidsrv03 port 6: link lost Mar 9 03:22:16 raidsrv03 sata: NOTICE: /pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1: Mar 9 03:22:16 raidsrv03 port 6: link established Mar 9 03:22:16 raidsrv03 scsi: WARNING: /pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1/disk at 6,0 (sd46): Mar 9 03:22:16 raidsrv03 Error for Command: write(10) Error Level: Retryable Mar 9 03:22:16 raidsrv03 scsi: Requested Block: 68158362 Error Block: 68158362 Mar 9 03:22:16 raidsrv03 scsi: Vendor: ATA Serial Number: Mar 9 03:22:16 raidsrv03 scsi: Sense Key: No Additional Sense Mar 9 03:22:16 raidsrv03 scsi: ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 The above repeated a few times but now seems to have stopped. Running ''hd -c'' shows all disks as ok. But it seems like I do have a disk problem. But since everything is redundant (zraid) why a failed disk should lock up the machine like I saw I don''t understand unless there is a some bigger issue. Any advice? Thanks
Paul Raines <raines <at> nmr.mgh.harvard.edu> writes:> > Mar 9 03:22:16 raidsrv03 sata: NOTICE: > /pci <at> 0,0/pci1022,7458 <at> 1/pci11ab,11ab <at> 1: > Mar 9 03:22:16 raidsrv03 port 6: device reset > [...] > > The above repeated a few times but now seems to have stopped. > Running ''hd -c'' shows all disks as ok. But it seems like I do have > a disk problem. But since everything is redundant (zraid) why a > failed disk should lock up the machine like I saw I don''t understand > unless there is a some bigger issue.It looks like your Solaris 10U4 install on a Thumper is affected by: http://bugs.opensolaris.org/view_bug.do?bug_id=6587133 Which was discussed here: http://opensolaris.org/jive/thread.jspa?messageID=189256 http://opensolaris.org/jive/thread.jspa?messageID=163460 Apply T-PATCH 127871-02, or upgrade to snv_73, or wait for 10U5. -- Marc Bevand
Paul Raines wrote:> Well, I ran updatemanager and started applying about 64 updates. After > the progress meter got about half way it seemed to hang not moving for > hours. I finally gave up and did a reboot. But the machine would not > reboot. I went in the ILOM and tried ''stop /SYS'' but after a few minutes > would get back an error on the console saying something like "shutdown > failed". So I finally just hard power cycled the box. Luckily, it came > back up seemingly okay and I was able to rerun updatemanager and get all > updates installed. However, after rebooting I now note the following > error messages on the console: > > Mar 9 03:22:16 raidsrv03 sata: NOTICE: > /pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1: > Mar 9 03:22:16 raidsrv03 port 6: device reset > Mar 9 03:22:16 raidsrv03 sata: NOTICE: > /pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1: > Mar 9 03:22:16 raidsrv03 port 6: link lost > Mar 9 03:22:16 raidsrv03 sata: NOTICE: > /pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1: > Mar 9 03:22:16 raidsrv03 port 6: link established > Mar 9 03:22:16 raidsrv03 scsi: WARNING: > /pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1/disk at 6,0 (sd46): > Mar 9 03:22:16 raidsrv03 Error for Command: write(10) > Error Level: Retryable > Mar 9 03:22:16 raidsrv03 scsi: Requested Block: 68158362 > Error Block: 68158362 > Mar 9 03:22:16 raidsrv03 scsi: Vendor: ATA > Serial Number: > Mar 9 03:22:16 raidsrv03 scsi: Sense Key: No Additional Sense > Mar 9 03:22:16 raidsrv03 scsi: ASC: 0x0 (no additional sense info), > ASCQ: 0x0, FRU: 0x0 > > > The above repeated a few times but now seems to have stopped. Running ''hd -c'' > shows all disks as ok. But it seems like I do have a disk problem. But since > everything is redundant (zraid) why a failed disk should lock up the machine > like I saw I don''t understand unless there is a some bigger issue. > > Any advice? >It is unclear what you are talking about. Do you have any evidence to connect that retryable write errors with the previous hang or were they two independent events? The retried write error would appear to be normal behavior with a bad sector. If the sector is actually bad, there would be the initial write attempt followed by five retries. The last retry would have "Error Level: Fatal" as opposed to "Error Level: Retryable", otherwise one of the retries would have been successful and everything would move on. Regards, Lida> Thanks > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
On Mon, 10 Mar 2008, Lida Horn wrote:> Paul Raines wrote: >> Well, I ran updatemanager and started applying about 64 updates. After >> the progress meter got about half way it seemed to hang not moving for >> hours. I finally gave up and did a reboot. But the machine would not >> reboot. I went in the ILOM and tried ''stop /SYS'' but after a few minutes >> would get back an error on the console saying something like "shutdown >> failed". So I finally just hard power cycled the box. Luckily, it came >> back up seemingly okay and I was able to rerun updatemanager and get all >> updates installed. However, after rebooting I now note the following >> error messages on the console: >> >> Mar 9 03:22:16 raidsrv03 sata: NOTICE: >> /pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1: >> Mar 9 03:22:16 raidsrv03 port 6: device reset >> Mar 9 03:22:16 raidsrv03 sata: NOTICE: >> /pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1: >> Mar 9 03:22:16 raidsrv03 port 6: link lost >> Mar 9 03:22:16 raidsrv03 sata: NOTICE: >> /pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1: >> Mar 9 03:22:16 raidsrv03 port 6: link established >> Mar 9 03:22:16 raidsrv03 scsi: WARNING: >> /pci at 0,0/pci1022,7458 at 1/pci11ab,11ab at 1/disk at 6,0 (sd46): >> Mar 9 03:22:16 raidsrv03 Error for Command: write(10) Error Level: >> Retryable >> Mar 9 03:22:16 raidsrv03 scsi: Requested Block: 68158362 Error >> Block: 68158362 >> Mar 9 03:22:16 raidsrv03 scsi: Vendor: ATA Serial Number: >> Mar 9 03:22:16 raidsrv03 scsi: Sense Key: No Additional Sense >> Mar 9 03:22:16 raidsrv03 scsi: ASC: 0x0 (no additional sense >> info), ASCQ: 0x0, FRU: 0x0 >> >> >> The above repeated a few times but now seems to have stopped. Running ''hd >> -c'' >> shows all disks as ok. But it seems like I do have a disk problem. But >> since >> everything is redundant (zraid) why a failed disk should lock up the >> machine >> like I saw I don''t understand unless there is a some bigger issue. >> >> Any advice? >> > It is unclear what you are talking about. Do you have any evidence to > connect > that retryable write errors with the previous hang or were they two > independent > events? The retried write error would appear to be normal behavior with > a bad sector. If the sector is actually bad, there would be the initial > write > attempt followed by five retries. The last retry would have "Error Level: > Fatal" > as opposed to "Error Level: Retryable", otherwise one of the retries would > have been successful and everything would move on. > > Regards, > LidaNo, I cannot connect the two events. When the ''zfs create'' hang happened, and the hang on applying updates, there were no error messages at all I could find. The above only happened after the reboot. SO it is circumstancial.
On Sun, 9 Mar 2008, Marc Bevand wrote:> Paul Raines <raines <at> nmr.mgh.harvard.edu> writes: >> >> Mar 9 03:22:16 raidsrv03 sata: NOTICE: >> /pci <at> 0,0/pci1022,7458 <at> 1/pci11ab,11ab <at> 1: >> Mar 9 03:22:16 raidsrv03 port 6: device reset >> [...] >> >> The above repeated a few times but now seems to have stopped. >> Running ''hd -c'' shows all disks as ok. But it seems like I do have >> a disk problem. But since everything is redundant (zraid) why a >> failed disk should lock up the machine like I saw I don''t understand >> unless there is a some bigger issue. > > It looks like your Solaris 10U4 install on a Thumper is affected by: > http://bugs.opensolaris.org/view_bug.do?bug_id=6587133 > Which was discussed here: > http://opensolaris.org/jive/thread.jspa?messageID=189256 > http://opensolaris.org/jive/thread.jspa?messageID=163460 > > Apply T-PATCH 127871-02, or upgrade to snv_73, or wait for 10U5.I don''t find 127871-02 on the normal "Patches and Updates" website. Does someone have to go some place special for that? Also, where do I find info on updating to snv_73? thanks
Paul Raines wrote:> On Sun, 9 Mar 2008, Marc Bevand wrote: > > >> Paul Raines <raines <at> nmr.mgh.harvard.edu> writes: >> >>> Mar 9 03:22:16 raidsrv03 sata: NOTICE: >>> /pci <at> 0,0/pci1022,7458 <at> 1/pci11ab,11ab <at> 1: >>> Mar 9 03:22:16 raidsrv03 port 6: device reset >>> [...] >>> >>> The above repeated a few times but now seems to have stopped. >>> Running ''hd -c'' shows all disks as ok. But it seems like I do have >>> a disk problem. But since everything is redundant (zraid) why a >>> failed disk should lock up the machine like I saw I don''t understand >>> unless there is a some bigger issue. >>> >> It looks like your Solaris 10U4 install on a Thumper is affected by: >> http://bugs.opensolaris.org/view_bug.do?bug_id=6587133 >> Which was discussed here: >> http://opensolaris.org/jive/thread.jspa?messageID=189256 >> http://opensolaris.org/jive/thread.jspa?messageID=163460 >> >> Apply T-PATCH 127871-02, or upgrade to snv_73, or wait for 10U5. >> > > I don''t find 127871-02 on the normal "Patches and Updates" website. > Does someone have to go some place special for that? Also, where > do I find info on updating to snv_73? > > thanks > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >Hi Unfortunately 127871-* ,are currently feature patches used in update 5 builds, these patches won''t be released until u5 ships, so that won''t be for another month perhaps. Very dangerous to apply these to pre u5 until they are shipped. There are no sustaining patches for this issue 6579855 ( the only CR fixed in 127871-02 ) but the CR 6587133 mentioned above is fixed in generic patch 125205-07, available on SunSolve Enda
Marc Bevand wrote:> Paul Raines <raines <at> nmr.mgh.harvard.edu> writes: > >> Mar 9 03:22:16 raidsrv03 sata: NOTICE: >> /pci <at> 0,0/pci1022,7458 <at> 1/pci11ab,11ab <at> 1: >> Mar 9 03:22:16 raidsrv03 port 6: device reset >> [...] >> >> The above repeated a few times but now seems to have stopped. >> Running ''hd -c'' shows all disks as ok. But it seems like I do have >> a disk problem. But since everything is redundant (zraid) why a >> failed disk should lock up the machine like I saw I don''t understand >> unless there is a some bigger issue. >> > > It looks like your Solaris 10U4 install on a Thumper is affected by: > http://bugs.opensolaris.org/view_bug.do?bug_id=6587133 > Which was discussed here: > http://opensolaris.org/jive/thread.jspa?messageID=189256 > http://opensolaris.org/jive/thread.jspa?messageID=163460 > > Apply T-PATCH 127871-02, or upgrade to snv_73, or wait for 10U5. > >I think you jumped to a conclusion that is probably not warranted. First he said that the machine was hung and there were no messages associated with the hang. Later, after rebooting he saw a few messages about a (apparently) single bad sector and the system was not hung and recovered from the error in a reasonable amount of time. When asked, he replied that he had no evidence to connect the two events. At no time did he report anything about DMA timeouts. Please don''t jump to conclusions. Regards, Lida
Lida Horn <Lida.Horn <at> Sun.COM> writes:> > I think you jumped to a conclusion that is probably not warranted.You are right. I read his error message too hastily and thought I recognized a pattern --I have been victim of bug 6587133 myself. And to top this off I gave him the wrong patch number. To answer Paul''s question about how to upgrade to snv_73 (if you still want to upgrade for another reason): actually I would recommend you the latest SXDE (Solaris Express Developer Edition 1/08, based on build 79). Boot from the install disc, and choose the "Upgrade Install" option. -- Marc Bevand