thr3ads.net - zfs discuss - [zfs-discuss] System hang caused by a "bad" snapshot [Sep 2006]

If this information is useful, please help other people find it:
Share via:

Ben Miller

2006-Sep-12 15:58 UTC

[zfs-discuss] System hang caused by a "bad" snapshot

I had a strange ZFS problem this morning.  The entire system would hang when
mounting the ZFS filesystems.  After trial and error I determined that the
problem was with one of the 2500 ZFS filesystems.  When mounting that
users'' home the system would hang and need to be rebooted.  After I
removed the snapshots (9 of them) for that filesystem everything was fine.

I don''t know how to reproduce this and didn''t get a crash
dump.  I don''t remember seeing anything about this before so I wanted
to report it and see if anyone has any ideas.

The system is a Sun Fire 280R with 3GB of RAM running SXCR b40.
The pool looks like this (I''m running a scrub currently):
# zpool status pool1
  pool: pool1
 state: ONLINE
 scrub: scrub in progress, 78.61% done, 0h18m to go
config:

        NAME         STATE     READ WRITE CKSUM
        pool1        ONLINE       0     0     0
          raidz      ONLINE       0     0     0
            c1t8d0   ONLINE       0     0     0
            c1t9d0   ONLINE       0     0     0
            c1t10d0  ONLINE       0     0     0
            c1t11d0  ONLINE       0     0     0

errors: No known data errors

Ben
 
 
This message posted from opensolaris.org

Matthew Ahrens

2006-Sep-12 17:57 UTC

head link

[zfs-discuss] System hang caused by a "bad" snapshot

Ben Miller wrote:> I had a strange ZFS problem this morning.  The entire system would
> hang when mounting the ZFS filesystems.  After trial and error I
> determined that the problem was with one of the 2500 ZFS filesystems.
> When mounting that users'' home the system would hang and need to
be
> rebooted.  After I removed the snapshots (9 of them) for that
> filesystem everything was fine.
> 
> I don''t know how to reproduce this and didn''t get a crash
dump.  I
> don''t remember seeing anything about this before so I wanted to
> report it and see if anyone has any ideas.
Hmm, that sounds pretty bizarre, since I don''t think that mounting a 
filesystem doesn''t really interact with snapshots at all. 
Unfortunately, I don''t think we''ll be able to diagnose this
without a
crash dump or reproducibility.  If it happens again, force a crash dump 
while the system is hung and we can take a look at it.

--matt

Robert Milkowski

2006-Sep-12 20:16 UTC

head link

[zfs-discuss] System hang caused by a "bad" snapshot

Hello Matthew,

Tuesday, September 12, 2006, 7:57:45 PM, you wrote:

MA> Ben Miller wrote:>> I had a strange ZFS problem this morning.  The entire system would
>> hang when mounting the ZFS filesystems.  After trial and error I
>> determined that the problem was with one of the 2500 ZFS filesystems.
>> When mounting that users'' home the system would hang and need
to be
>> rebooted.  After I removed the snapshots (9 of them) for that
>> filesystem everything was fine.
>> 
>> I don''t know how to reproduce this and didn''t get a
crash dump.  I
>> don''t remember seeing anything about this before so I wanted
to
>> report it and see if anyone has any ideas.
MA> Hmm, that sounds pretty bizarre, since I don''t think that
mounting a
MA> filesystem doesn''t really interact with snapshots at all. 
MA> Unfortunately, I don''t think we''ll be able to diagnose
this without a
MA> crash dump or reproducibility.  If it happens again, force a crash dump
MA> while the system is hung and we can take a look at it.

Maybe it wasn''t hung after all. I''ve seen similar behavior
here
sometimes. Did your disks used in a pool were actually working?

Sometimes it takes a lot of time (30-50minutes) to mount a file system
- it''s rare, but it happens. And during this ZFS reads from those
disks in a pool. I did report it here some time ago.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Ben Miller

2006-Sep-13 13:42 UTC

head link

[zfs-discuss] Re: Re[2]: System hang caused by a "bad" snapshot

> Hello Matthew,
> Tuesday, September 12, 2006, 7:57:45 PM, you wrote:
> MA> Ben Miller wrote:
> >> I had a strange ZFS problem this morning.  The
> entire system would
> >> hang when mounting the ZFS filesystems.  After
> trial and error I
> >> determined that the problem was with one of the
> 2500 ZFS filesystems.
> >> When mounting that users'' home the system would
> hang and need to be
> >> rebooted.  After I removed the snapshots (9 of
> them) for that
> >> filesystem everything was fine.
> >> 
> >> I don''t know how to reproduce this and didn''t
get
> a crash dump.  I
> >> don''t remember seeing anything about this before
> so I wanted to
> >> report it and see if anyone has any ideas.
> 
> MA> Hmm, that sounds pretty bizarre, since I don''t
> think that mounting a 
> MA> filesystem doesn''t really interact with snapshots
> at all. 
> MA> Unfortunately, I don''t think we''ll be able to
> diagnose this without a 
> MA> crash dump or reproducibility.  If it happens
> again, force a crash dump
> MA> while the system is hung and we can take a look
> at it.
> 
> Maybe it wasn''t hung after all. I''ve seen similar
> behavior here
> sometimes. Did your disks used in a pool were
> actually working?
> 
  There was lots of activity on the disks (iostat and status LEDs) until it got
to this one filesystem and everything stopped.  ''zpool iostat
5'' stopped running, the shell wouldn''t respond and activity on
the disks stopped.  This fs is relatively small  (175M used of a 512M quota).
> Sometimes it takes a lot of time (30-50minutes) to
> mount a file system
> - it''s rare, but it happens. And during this ZFS
> reads from those
> disks in a pool. I did report it here some time ago.
>   In my case the system crashed during the evening and it was left hung up when
I came in during the morning, so it was hung for a good 9-10 hours.

Ben
 
 
This message posted from opensolaris.org

Ben Miller

2006-Sep-27 12:33 UTC

head link

[zfs-discuss] Re: Re[2]: System hang caused by a "bad" snapshot

> > Hello Matthew,
> > Tuesday, September 12, 2006, 7:57:45 PM, you
> wrote:
> > MA> Ben Miller wrote:
> > >> I had a strange ZFS problem this morning.  The
> > entire system would
> > >> hang when mounting the ZFS filesystems.  After
> > trial and error I
> > >> determined that the problem was with one of the
> > 2500 ZFS filesystems.
> > >> When mounting that users'' home the system would
> > hang and need to be
> > >> rebooted.  After I removed the snapshots (9 of
> > them) for that
> > >> filesystem everything was fine.
> > >> 
> > >> I don''t know how to reproduce this and
didn''t
> get
> > a crash dump.  I
> > >> don''t remember seeing anything about this
> before
> > so I wanted to
> > >> report it and see if anyone has any ideas.
> > 
> > MA> Hmm, that sounds pretty bizarre, since I don''t
> > think that mounting a 
> > MA> filesystem doesn''t really interact with
> snapshots
> > at all. 
> > MA> Unfortunately, I don''t think we''ll be able to
> > diagnose this without a 
> > MA> crash dump or reproducibility.  If it happens
> > again, force a crash dump
> > MA> while the system is hung and we can take a
> look
> > at it.
> > 
> > Maybe it wasn''t hung after all. I''ve seen similar
> > behavior here
> > sometimes. Did your disks used in a pool were
> > actually working?
> > 
> 
> There was lots of activity on the disks (iostat and
> status LEDs) until it got to this one filesystem and
> everything stopped.  ''zpool iostat 5'' stopped
> running, the shell wouldn''t respond and activity on
> the disks stopped.  This fs is relatively small
>   (175M used of a 512M quota).
> > Sometimes it takes a lot of time (30-50minutes) to
> > mount a file system
> > - it''s rare, but it happens. And during this ZFS
> > reads from those
> > disks in a pool. I did report it here some time
> ago.
> > 
> In my case the system crashed during the evening
> and it was left hung up when I came in during the
>  morning, so it was hung for a good 9-10 hours.
>   The problem happened again last night, but for a different users''
filesystem.  I took a crash dump with it hung and the back trace looks like
this:> ::statusdebugging crash dump vmcore.0 (64-bit) from hostname
operating system: 5.11 snv_40 (sun4u)
panic message: sync initiated
dump content: kernel pages only> ::stack0xf0046a3c(f005a4d8, 2a100047818, 181d010, 18378a8, 1849000, f005a4d8)
prom_enter_mon+0x24(2, 183c000, 18b7000, 2a100046c61, 1812158, 181b4c8)
debug_enter+0x110(0, a, a, 180fc00, 0, 183e000)
abort_seq_softintr+0x8c(180fc00, 18abc00, 180c000, 2a100047d98, 1, 1859800)
intr_thread+0x170(600019de0e0, 0, 6000d7bfc98, 600019de110, 600019de110, 
600019de110)
zfs_delete_thread_target+8(600019de080, ffffffffffffffff, 0, 600019de080, 
6000d791ae8, 60001aed428)
zfs_delete_thread+0x164(600019de080, 6000d7bfc88, 1, 2a100c4faca, 2a100c4fac8, 
600019de0e0)
thread_start+4(600019de080, 0, 0, 0, 0, 0)

In single user I set the mountpoint for that user to be none and then brought
the system up fine.  Then I destroyed the snapshots for that user and their
filesystem mounted fine.  In this case the quota was reached with the snapshots
and 52% used without.

Ben
 
 
This message posted from opensolaris.org

Ben Miller

2007-Sep-18 19:24 UTC

head link

[zfs-discuss] System hang caused by a "bad" snapshot

> > > Hello Matthew,
> > > Tuesday, September 12, 2006, 7:57:45 PM, you
> > wrote:
> > > MA> Ben Miller wrote:
> > > >> I had a strange ZFS problem this morning.
>  The
>  > entire system would
>  > >> hang when mounting the ZFS filesystems.  After
>  > trial and error I
> > >> determined that the problem was with one of
>  the
>  > 2500 ZFS filesystems.
> > >> When mounting that users'' home the system
>  would
>  > hang and need to be
>  > >> rebooted.  After I removed the snapshots (9 of
>  > them) for that
>  > >> filesystem everything was fine.
>  > >> 
>  > >> I don''t know how to reproduce this and
didn''t
>  get
>  > a crash dump.  I
>  > >> don''t remember seeing anything about this
>  before
>  > so I wanted to
>  > >> report it and see if anyone has any ideas.
>  > 
> > MA> Hmm, that sounds pretty bizarre, since I
>  don''t
>  > think that mounting a 
>  > MA> filesystem doesn''t really interact with
>  snapshots
>  > at all. 
>  > MA> Unfortunately, I don''t think we''ll be able
to
>  > diagnose this without a 
>  > MA> crash dump or reproducibility.  If it happens
>  > again, force a crash dump
>  > MA> while the system is hung and we can take a
>  look
>  > at it.
>  > 
>  > Maybe it wasn''t hung after all. I''ve seen similar
>  > behavior here
>  > sometimes. Did your disks used in a pool were
>  > actually working?
>  > 
>  
>  There was lots of activity on the disks (iostat and
> status LEDs) until it got to this one filesystem
>  and
>  everything stopped.  ''zpool iostat 5'' stopped
>  running, the shell wouldn''t respond and activity on
>  the disks stopped.  This fs is relatively small
>    (175M used of a 512M quota).
>  Sometimes it takes a lot of time (30-50minutes) to
>  > mount a file system
>  > - it''s rare, but it happens. And during this ZFS
>  > reads from those
>  > disks in a pool. I did report it here some time
>  ago.
>  > 
>  In my case the system crashed during the evening
>  and it was left hung up when I came in during the
>   morning, so it was hung for a good 9-10 hours.
> 
> The problem happened again last night, but for a
> different users'' filesystem.  I took a crash dump
> with it hung and the back trace looks like this:
> > ::status
> debugging crash dump vmcore.0 (64-bit) from hostname
> operating system: 5.11 snv_40 (sun4u)
> panic message: sync initiated
> dump content: kernel pages only
> > ::stack
> 0xf0046a3c(f005a4d8, 2a100047818, 181d010, 18378a8,
> 1849000, f005a4d8)
> prom_enter_mon+0x24(2, 183c000, 18b7000, 2a100046c61,
> 1812158, 181b4c8)
> debug_enter+0x110(0, a, a, 180fc00, 0, 183e000)
> abort_seq_softintr+0x8c(180fc00, 18abc00, 180c000,
> 2a100047d98, 1, 1859800)
> intr_thread+0x170(600019de0e0, 0, 6000d7bfc98,
> 600019de110, 600019de110, 
> 600019de110)
> zfs_delete_thread_target+8(600019de080,
> ffffffffffffffff, 0, 600019de080, 
> 6000d791ae8, 60001aed428)
> zfs_delete_thread+0x164(600019de080, 6000d7bfc88, 1,
> 2a100c4faca, 2a100c4fac8, 
> 600019de0e0)
> thread_start+4(600019de080, 0, 0, 0, 0, 0)
> 
> In single user I set the mountpoint for that user to
> be none and then brought the system up fine.  Then I
> destroyed the snapshots for that user and their
> filesystem mounted fine.  In this case the quota was
> reached with the snapshots and 52% used without.
> 
> Ben
Hate to re-open something from a year ago, but we just had this problem happen
again.  We have been running Solaris 10u3 on this system for awhile.  I searched
the bug reports, but couldn''t find anything on this.  I also think I
understand what happened a little more.  We take snapshots at noon and the
system hung up during that time.  When trying to reboot the system would hang on
the ZFS mounts.  After I boot into single use and remove the snapshot from the
filesystem causing the problem everything is fine.  The filesystem in question
at 100% use with snapshots in use.

Here''s the back trace for the system when it was
hung:> ::stack0xf0046a3c(f005a4d8, 2a10004f828, 0, 181c850, 1848400, f005a4d8)
prom_enter_mon+0x24(0, 0, 183b400, 1, 1812140, 181ae60)
debug_enter+0x118(0, a, a, 180fc00, 0, 183d400)
abort_seq_softintr+0x94(180fc00, 18a9800, 180c000, 2a10004fd98, 1, 1857c00)
intr_thread+0x170(2, 30007b64bc0, 0, c001ed9, 110, 60002400000)
0x985c8(300adca4c40, 0, 0, 0, 0, 30007b64bc0)
dbuf_hold_impl+0x28(60008cd02e8, 0, 0, 0, 7b648d73, 2a105bb57c8)
dbuf_hold_level+0x18(60008cd02e8, 0, 0, 7b648d73, 0, 0)
dmu_tx_check_ioerr+0x20(0, 60008cd02e8, 0, 0, 0, 7b648c00)
dmu_tx_hold_zap+0x84(60011fb2c40, 0, 0, 0, 30049b58008, 400)
zfs_rmnode+0xc8(3002410d210, 2a105bb5cc0, 0, 60011fb2c40, 30007b3ff58, 
30007b56ac0)
zfs_delete_thread+0x168(30007b56ac0, 3002410d210, 600009a4778, 30007b56b28, 
2a105bb5aca, 2a105bb5ac8)
thread_start+4(30007b56ac0, 0, 0, 489a4800000000, d83a10bf28, 5000000000386)

Has this been fixed in more recent code?  I can make the crash dump available.

Ben
 
 
This message posted from opensolaris.org

George Wilson

2007-Sep-19 05:29 UTC

head link

[zfs-discuss] System hang caused by a "bad" snapshot

Ben,

Much of this code has been revamped as a result of:

6514331 in-memory delete queue is not needed

Although this may not fix your issue it would be good to try this test 
with more recent bits.

Thanks,
George

Ben Miller wrote:
> Hate to re-open something from a year ago, but we just had this problem
happen again.  We have been running Solaris 10u3 on this system for awhile.  I
searched the bug reports, but couldn''t find anything on this.  I also
think I understand what happened a little more.  We take snapshots at noon and
the system hung up during that time.  When trying to reboot the system would
hang on the ZFS mounts.  After I boot into single use and remove the snapshot
from the filesystem causing the problem everything is fine.  The filesystem in
question at 100% use with snapshots in use.
> 
> Here''s the back trace for the system when it was hung:
>> ::stack
> 0xf0046a3c(f005a4d8, 2a10004f828, 0, 181c850, 1848400, f005a4d8)
> prom_enter_mon+0x24(0, 0, 183b400, 1, 1812140, 181ae60)
> debug_enter+0x118(0, a, a, 180fc00, 0, 183d400)
> abort_seq_softintr+0x94(180fc00, 18a9800, 180c000, 2a10004fd98, 1, 1857c00)
> intr_thread+0x170(2, 30007b64bc0, 0, c001ed9, 110, 60002400000)
> 0x985c8(300adca4c40, 0, 0, 0, 0, 30007b64bc0)
> dbuf_hold_impl+0x28(60008cd02e8, 0, 0, 0, 7b648d73, 2a105bb57c8)
> dbuf_hold_level+0x18(60008cd02e8, 0, 0, 7b648d73, 0, 0)
> dmu_tx_check_ioerr+0x20(0, 60008cd02e8, 0, 0, 0, 7b648c00)
> dmu_tx_hold_zap+0x84(60011fb2c40, 0, 0, 0, 30049b58008, 400)
> zfs_rmnode+0xc8(3002410d210, 2a105bb5cc0, 0, 60011fb2c40, 30007b3ff58, 
> 30007b56ac0)
> zfs_delete_thread+0x168(30007b56ac0, 3002410d210, 600009a4778, 30007b56b28,
> 2a105bb5aca, 2a105bb5ac8)
> thread_start+4(30007b56ac0, 0, 0, 489a4800000000, d83a10bf28,
5000000000386)
> 
> Has this been fixed in more recent code?  I can make the crash dump
available.
> 
> Ben
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

zfs discuss - Sep 2006 - System hang caused by a "bad" snapshot

[zfs-discuss] System hang caused by a "bad" snapshot

[zfs-discuss] System hang caused by a "bad" snapshot

[zfs-discuss] System hang caused by a "bad" snapshot

[zfs-discuss] Re: Re[2]: System hang caused by a "bad" snapshot

[zfs-discuss] Re: Re[2]: System hang caused by a "bad" snapshot

[zfs-discuss] System hang caused by a "bad" snapshot

[zfs-discuss] System hang caused by a "bad" snapshot