[Initial version of this message originally sent to zfs-interest by
mistake.  Sorry if this appears anywhere as a duplicate.]
I was noodling around with creating a backup script for my home
system, and I ran into a problem that I''m having a little trouble
diagnosing.  Has anyone seen anything like this or have any debug
advice?
I did a "zfs create -r" to set a snapshot on all of the members of a
given pool.  Later, for reasons that are probably obscure, I wanted to
rename that snapshot.  There''s no "zfs rename -r" function,
so I tried
to write a crude one on my own:
zfs list -rHo name -t filesystem pool |
while read name; do
	zfs rename $name at foo $name at bar
done
The results were disappointing.  The system was extremely busy for a
moment and then went completely catatonic.  Most network traffic
appeared to stop, though I _think_ network driver interrupts were
still working.  The keyboard and mouse (traditional PS/2 types; not
USB) went dead -- not even keyboard lights were working (nothing from
Caps Lock).  The disk light stopped flashing and went dark.  The CPU
temperature started to climb (as measured by an external sensor).  No
messages were written to /var/adm/messages or dmesg on reboot.
The system turned into an increasingly warm brick.  As all of my
inputs to the system were gone, I really had no good way immediately
available to debug the problem.  Thinking this was just a fluke or
perhaps something induced by hardware, I shut everything down, cooled
off, and tried again.  Three times.  The same thing happened each
time.
System details:
  - snv_55
  - Tyan 2885 motherboard with 4GB RAM (four 1GB modules) and one
    Opteron 246 (model 5 step 8).
  - AMI BIOS version 080010, dated 06/14/2005.  No tweaks applied,
    system is always on; no power management.
  - Silicon Image 3114 SATA controller configured for legacy (not
    RAID) mode.
  - Three SATA disks in the system, no IDE as they''ve gone to the
    great bit-bucket in the sky.  The SATA drives are one WDC
    WD740GD-32F (not part of this ZFS pool), and a pair of
    ST3250623NS.
  - The two Seagate drives are partitioned like this:
  0       root    wm       3 -   655        5.00GB    (653/0/0)    10490445
  1       swap    wm     656 -   916        2.00GB    (261/0/0)     4192965
  2     backup    wu       0 - 30397      232.86GB    (30398/0/0) 488343870
  3   reserved    wm     917 -   917        7.84MB    (1/0/0)         16065
  4 unassigned    wu       0                0         (0/0/0)             0
  5 unassigned    wu       0                0         (0/0/0)             0
  6 unassigned    wu       0                0         (0/0/0)             0
  7       home    wm     918 - 30397      225.83GB    (29480/0/0) 473596200
  8       boot    wu       0 -     0        7.84MB    (1/0/0)         16065
  9 alternates    wm       1 -     2       15.69MB    (2/0/0)         32130
  - For both disks: slice 0 is for an SVM mirrored root, slice 1 has
    swap, slice 3 has the SVM metadata, and slice 7 is in the ZFS pool
    named "pool" as a mirror.  No, I''m not using whole-disk
or EFI.
  - Zpool status:
  pool: pool
 state: ONLINE
 scrub: none requested
config:
        NAME        STATE     READ WRITE CKSUM
        pool        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c4d0s7  ONLINE       0     0     0
            c4d1s7  ONLINE       0     0     0
  - ''zfs list -rt filesystem pool | wc -l'' says 37.
  - Iostat -E doesn''t show any errors of any kind on the drives.
  - I read through CR 6421427, but that seems to be SPARC-only.
Next step will probably be to set the ''snooping'' flag and
maybe hack
the bge driver to do an abort_sequence_enter() call on a magic packet
so that I can wrest control back.  Before I do something that drastic,
does anyone else have ideas?
-- 
James Carlson, Solaris Networking              <james.d.carlson at
sun.com>
Sun Microsystems / 1 Network Drive         71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677
Wade.Stuart at fallon.com
2007-Jan-08  20:54 UTC
[zfs-discuss] hard-hang on snapshot rename
> I was noodling around with creating a backup script for my home > system, and I ran into a problem that I''m having a little trouble > diagnosing. Has anyone seen anything like this or have any debug > advice? > > I did a "zfs create -r" to set a snapshot on all of the members of a > given pool. Later, for reasons that are probably obscure, I wanted to > rename that snapshot. There''s no "zfs rename -r" function, so I tried > to write a crude one on my own:do you mean "zfs snapshot -r <fsname>@foo" instead of the create?> > zfs list -rHo name -t filesystem pool | > while read name; do > zfs rename $name at foo $name at bar > donehmm, just to verify sanity, have can you show the output of: zfs list -rHo name -t filesystem pool and zfs list -rHo name -t filesystem pool | while read name; do echo zfs rename $name at foo $name at bar done (note the echo inserted above)> > The results were disappointing. The system was extremely busy for a > moment and then went completely catatonic. Most network traffic > appeared to stop, though I _think_ network driver interrupts were > still working. The keyboard and mouse (traditional PS/2 types; not > USB) went dead -- not even keyboard lights were working (nothing from > Caps Lock). The disk light stopped flashing and went dark. The CPU > temperature started to climb (as measured by an external sensor). No > messages were written to /var/adm/messages or dmesg on reboot. > > The system turned into an increasingly warm brick. As all of my > inputs to the system were gone, I really had no good way immediately > available to debug the problem. Thinking this was just a fluke or > perhaps something induced by hardware, I shut everything down, cooled > off, and tried again. Three times. The same thing happened each > time. > > System details: > > - snv_55 > > - Tyan 2885 motherboard with 4GB RAM (four 1GB modules) and one > Opteron 246 (model 5 step 8). > > - AMI BIOS version 080010, dated 06/14/2005. No tweaks applied, > system is always on; no power management. > > - Silicon Image 3114 SATA controller configured for legacy (not > RAID) mode. > > - Three SATA disks in the system, no IDE as they''ve gone to the > great bit-bucket in the sky. The SATA drives are one WDC > WD740GD-32F (not part of this ZFS pool), and a pair of > ST3250623NS. > > - The two Seagate drives are partitioned like this: > > 0 root wm 3 - 655 5.00GB (653/0/0)10490445> 1 swap wm 656 - 916 2.00GB (261/0/0)4192965> 2 backup wu 0 - 30397 232.86GB (30398/0/0)488343870> 3 reserved wm 917 - 917 7.84MB (1/0/0)16065> 4 unassigned wu 0 0 (0/0/0)0> 5 unassigned wu 0 0 (0/0/0)0> 6 unassigned wu 0 0 (0/0/0)0> 7 home wm 918 - 30397 225.83GB (29480/0/0)473596200> 8 boot wu 0 - 0 7.84MB (1/0/0)16065> 9 alternates wm 1 - 2 15.69MB (2/0/0)32130> > - For both disks: slice 0 is for an SVM mirrored root, slice 1 has > swap, slice 3 has the SVM metadata, and slice 7 is in the ZFS pool > named "pool" as a mirror. No, I''m not using whole-disk or EFI. > > - Zpool status: > > pool: pool > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > pool ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4d0s7 ONLINE 0 0 0 > c4d1s7 ONLINE 0 0 0 > > - ''zfs list -rt filesystem pool | wc -l'' says 37. > > - Iostat -E doesn''t show any errors of any kind on the drives. > > - I read through CR 6421427, but that seems to be SPARC-only. > > Next step will probably be to set the ''snooping'' flag and maybe hack > the bge driver to do an abort_sequence_enter() call on a magic packet > so that I can wrest control back. Before I do something that drastic, > does anyone else have ideas? > > -- > James Carlson, Solaris Networking <james.d.carlson at sun.com> > Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 > MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677 > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Wade.Stuart at fallon.com writes:> > I was noodling around with creating a backup script for my home > > system, and I ran into a problem that I''m having a little trouble > > diagnosing. Has anyone seen anything like this or have any debug > > advice? > > > > I did a "zfs create -r" to set a snapshot on all of the members of a > > given pool. Later, for reasons that are probably obscure, I wanted to > > rename that snapshot. There''s no "zfs rename -r" function, so I tried > > to write a crude one on my own: > > do you mean "zfs snapshot -r <fsname>@foo" instead of the create?Yes; sorry. A bit of a typo there.> hmm, just to verify sanity, have can you show the output of: > > zfs list -rHo name -t filesystem pool > > and > > zfs list -rHo name -t filesystem pool | > while read name; do > echo zfs rename $name at foo $name at bar > done > > (note the echo inserted above)Sure, but it''s not a shell problem. I should have mentioned that when I brought the system back up, *most* of the renames had actually taken place, but not *all* of them. I ended up with mostly pool... at foo, but with a handful of stragglers near the end of the list pool... at bar. The output looks a bit like this (not _all_ file systems shown, but representative ones): pool pool/HTSData pool/apache pool/client pool/csw pool/home pool/home/benjamin pool/home/beth pool/home/carlsonj pool/home/ftp pool/laptop pool/local pool/music pool/photo pool/sys pool/sys/core pool/sys/dhcp pool/sys/mail pool/sys/named And then: zfs rename pool at foo pool at bar zfs rename pool/HTSData at foo pool/HTSData at bar zfs rename pool/apache at foo pool/apache at bar zfs rename pool/client at foo pool/client at bar zfs rename pool/csw at foo pool/csw at bar zfs rename pool/home at foo pool/home at bar zfs rename pool/home/benjamin at foo pool/home/benjamin at bar zfs rename pool/home/beth at foo pool/home/beth at bar zfs rename pool/home/carlsonj at foo pool/home/carlsonj at bar zfs rename pool/home/ftp at foo pool/home/ftp at bar zfs rename pool/laptop at foo pool/laptop at bar zfs rename pool/local at foo pool/local at bar zfs rename pool/music at foo pool/music at bar zfs rename pool/photo at foo pool/photo at bar zfs rename pool/sys at foo pool/sys at bar zfs rename pool/sys/core at foo pool/sys/core at bar zfs rename pool/sys/dhcp at foo pool/sys/dhcp at bar zfs rename pool/sys/mail at foo pool/sys/mail at bar zfs rename pool/sys/named at foo pool/sys/named at bar It''s not a matter of the shell script not working; it''s a matter of something inside the kernel (perhaps not even ZFS but instead a driver related to SATA?) experiencing vapor-lock. Other heavy load on the system, though, doesn''t cause this to happen. This one operation does cause the lock-up. -- James Carlson, Solaris Networking <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
Wade.Stuart at fallon.com
2007-Jan-08  22:06 UTC
[zfs-discuss] hard-hang on snapshot rename
James Carlson <james.d.carlson at sun.com> wrote on 01/08/2007 03:26:14 PM:> Wade.Stuart at fallon.com writes: > > > I was noodling around with creating a backup script for my home > > > system, and I ran into a problem that I''m having a little trouble > > > diagnosing. Has anyone seen anything like this or have any debug > > > advice? > > > > > > I did a "zfs create -r" to set a snapshot on all of the members of a > > > given pool. Later, for reasons that are probably obscure, I wantedto> > > rename that snapshot. There''s no "zfs rename -r" function, so Itried> > > to write a crude one on my own: > > > > do you mean "zfs snapshot -r <fsname>@foo" instead of the create? > > Yes; sorry. A bit of a typo there. > > > hmm, just to verify sanity, have can you show the output of: > > > > zfs list -rHo name -t filesystem pool > > > > and > > > > zfs list -rHo name -t filesystem pool | > > while read name; do > > echo zfs rename $name at foo $name at bar > > done > > > > (note the echo inserted above) > > Sure, but it''s not a shell problem. I should have mentioned that when > I brought the system back up, *most* of the renames had actually taken > place, but not *all* of them. I ended up with mostly pool... at foo, but > with a handful of stragglers near the end of the list pool... at bar. > > The output looks a bit like this (not _all_ file systems shown, but > representative ones): > > pool > pool/HTSData > pool/apache > pool/client > pool/csw > pool/home > pool/home/benjamin > pool/home/beth > pool/home/carlsonj > pool/home/ftp > pool/laptop > pool/local > pool/music > pool/photo > pool/sys > pool/sys/core > pool/sys/dhcp > pool/sys/mail > pool/sys/named > > And then: > > zfs rename pool at foo pool at bar > zfs rename pool/HTSData at foo pool/HTSData at bar > zfs rename pool/apache at foo pool/apache at bar > zfs rename pool/client at foo pool/client at bar > zfs rename pool/csw at foo pool/csw at bar > zfs rename pool/home at foo pool/home at bar > zfs rename pool/home/benjamin at foo pool/home/benjamin at bar > zfs rename pool/home/beth at foo pool/home/beth at bar > zfs rename pool/home/carlsonj at foo pool/home/carlsonj at bar > zfs rename pool/home/ftp at foo pool/home/ftp at bar > zfs rename pool/laptop at foo pool/laptop at bar > zfs rename pool/local at foo pool/local at bar > zfs rename pool/music at foo pool/music at bar > zfs rename pool/photo at foo pool/photo at bar > zfs rename pool/sys at foo pool/sys at bar > zfs rename pool/sys/core at foo pool/sys/core at bar > zfs rename pool/sys/dhcp at foo pool/sys/dhcp at bar > zfs rename pool/sys/mail at foo pool/sys/mail at bar > zfs rename pool/sys/named at foo pool/sys/named at bar > > It''s not a matter of the shell script not working; it''s a matter of > something inside the kernel (perhaps not even ZFS but instead a driver > related to SATA?) experiencing vapor-lock. > > Other heavy load on the system, though, doesn''t cause this to happen. > This one operation does cause the lock-up. >Understood. Two things, does the rename loop hit any of the fs in question, and does putting a " sort -r | " before the while make any difference?
Wade.Stuart at fallon.com
2007-Jan-08  22:11 UTC
[zfs-discuss] hard-hang on snapshot rename
zfs-discuss-bounces at opensolaris.org wrote on 01/08/2007 04:06:46 PM:> > > > > > > James Carlson <james.d.carlson at sun.com> wrote on 01/08/2007 03:26:14 PM: > > > Wade.Stuart at fallon.com writes: > > > > I was noodling around with creating a backup script for my home > > > > system, and I ran into a problem that I''m having a little trouble > > > > diagnosing. Has anyone seen anything like this or have any debug > > > > advice? > > > > > > > > I did a "zfs create -r" to set a snapshot on all of the members ofa> > > > given pool. Later, for reasons that are probably obscure, I wanted > to > > > > rename that snapshot. There''s no "zfs rename -r" function, so I > tried > > > > to write a crude one on my own: > > > > > > do you mean "zfs snapshot -r <fsname>@foo" instead of the create? > > > > Yes; sorry. A bit of a typo there. > > > > > hmm, just to verify sanity, have can you show the output of: > > > > > > zfs list -rHo name -t filesystem pool > > > > > > and > > > > > > zfs list -rHo name -t filesystem pool | > > > while read name; do > > > echo zfs rename $name at foo $name at bar > > > done > > > > > > (note the echo inserted above) > > > > Sure, but it''s not a shell problem. I should have mentioned that when > > I brought the system back up, *most* of the renames had actually taken > > place, but not *all* of them. I ended up with mostly pool... at foo, but > > with a handful of stragglers near the end of the list pool... at bar. > >Sorry missed this, ignore my first question> > The output looks a bit like this (not _all_ file systems shown, but > > representative ones): > > > > pool > > pool/HTSData...> > zfs rename pool/sys/named at foo pool/sys/named at bar > > > > It''s not a matter of the shell script not working; it''s a matter of > > something inside the kernel (perhaps not even ZFS but instead a driver > > related to SATA?) experiencing vapor-lock. > > > > Other heavy load on the system, though, doesn''t cause this to happen. > > This one operation does cause the lock-up. > > > > Understood. Two things, does the rename loop hit any of the fs in > question, and does putting a " sort -r | " before the while make any > difference?The reason I ask is because I had a similar issue running through batch renames (from epoch -> human) of my snapshots. It seemed to cause a system lock unless I did the batch depth first (sort -r).
Wade.Stuart at fallon.com writes:> > Other heavy load on the system, though, doesn''t cause this to happen. > > This one operation does cause the lock-up. > > > > Understood. Two things, does the rename loop hit any of the fs in > question,No; the loop you saw is essentially what I ran. (Other than that it was "level0new" and "level0" instead of "foo" and "bar.") Thinking it was some locking issue, I did try saving off the list in a file (on tmpfs), and then running it through the while loop -- that produced the same result.> and does putting a " sort -r | " before the while make any > difference?I''ll give it a try tonight and see. It''s a "production" system, so I have to wait until all of the users are asleep or otherwise occupied by "Two And A Half Men" reruns to try something hazardous like that. -- James Carlson, Solaris Networking <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
Wade.Stuart at fallon.com writes:> > Understood. Two things, does the rename loop hit any of the fs in > > question, and does putting a " sort -r | " before the while make any > > difference? > > > The reason I ask is because I had a similar issue running through batch > renames (from epoch -> human) of my snapshots. It seemed to cause a system > lock unless I did the batch depth first (sort -r).Well, it still hung, but the test itself revealed an interesting clue to the problem. In the previous trials, the last two file systems were left unchanged (snapshots were unrenamed) after rebooting. In the sort -r trial, only the last one was changed. This means that the penultimate entry in the list is ''magical'' somehow. It''s not shared via NFS, nor used for Zones, nor does it have compression enabled. The only thing special about that one file system is that it has over 100K tiny files on it. I''ll do some more digging as time (and other users) permit. -- James Carlson, Solaris Networking <james.d.carlson at sun.com> Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677