I had a strange ZFS problem this morning. The entire system would hang when mounting the ZFS filesystems. After trial and error I determined that the problem was with one of the 2500 ZFS filesystems. When mounting that users'' home the system would hang and need to be rebooted. After I removed the snapshots (9 of them) for that filesystem everything was fine. I don''t know how to reproduce this and didn''t get a crash dump. I don''t remember seeing anything about this before so I wanted to report it and see if anyone has any ideas. The system is a Sun Fire 280R with 3GB of RAM running SXCR b40. The pool looks like this (I''m running a scrub currently): # zpool status pool1 pool: pool1 state: ONLINE scrub: scrub in progress, 78.61% done, 0h18m to go config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz ONLINE 0 0 0 c1t8d0 ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 c1t10d0 ONLINE 0 0 0 c1t11d0 ONLINE 0 0 0 errors: No known data errors Ben This message posted from opensolaris.org
Ben Miller wrote:> I had a strange ZFS problem this morning. The entire system would > hang when mounting the ZFS filesystems. After trial and error I > determined that the problem was with one of the 2500 ZFS filesystems. > When mounting that users'' home the system would hang and need to be > rebooted. After I removed the snapshots (9 of them) for that > filesystem everything was fine. > > I don''t know how to reproduce this and didn''t get a crash dump. I > don''t remember seeing anything about this before so I wanted to > report it and see if anyone has any ideas.Hmm, that sounds pretty bizarre, since I don''t think that mounting a filesystem doesn''t really interact with snapshots at all. Unfortunately, I don''t think we''ll be able to diagnose this without a crash dump or reproducibility. If it happens again, force a crash dump while the system is hung and we can take a look at it. --matt
Robert Milkowski
2006-Sep-12 20:16 UTC
[zfs-discuss] System hang caused by a "bad" snapshot
Hello Matthew, Tuesday, September 12, 2006, 7:57:45 PM, you wrote: MA> Ben Miller wrote:>> I had a strange ZFS problem this morning. The entire system would >> hang when mounting the ZFS filesystems. After trial and error I >> determined that the problem was with one of the 2500 ZFS filesystems. >> When mounting that users'' home the system would hang and need to be >> rebooted. After I removed the snapshots (9 of them) for that >> filesystem everything was fine. >> >> I don''t know how to reproduce this and didn''t get a crash dump. I >> don''t remember seeing anything about this before so I wanted to >> report it and see if anyone has any ideas.MA> Hmm, that sounds pretty bizarre, since I don''t think that mounting a MA> filesystem doesn''t really interact with snapshots at all. MA> Unfortunately, I don''t think we''ll be able to diagnose this without a MA> crash dump or reproducibility. If it happens again, force a crash dump MA> while the system is hung and we can take a look at it. Maybe it wasn''t hung after all. I''ve seen similar behavior here sometimes. Did your disks used in a pool were actually working? Sometimes it takes a lot of time (30-50minutes) to mount a file system - it''s rare, but it happens. And during this ZFS reads from those disks in a pool. I did report it here some time ago. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Ben Miller
2006-Sep-13 13:42 UTC
[zfs-discuss] Re: Re[2]: System hang caused by a "bad" snapshot
> Hello Matthew, > Tuesday, September 12, 2006, 7:57:45 PM, you wrote: > MA> Ben Miller wrote: > >> I had a strange ZFS problem this morning. The > entire system would > >> hang when mounting the ZFS filesystems. After > trial and error I > >> determined that the problem was with one of the > 2500 ZFS filesystems. > >> When mounting that users'' home the system would > hang and need to be > >> rebooted. After I removed the snapshots (9 of > them) for that > >> filesystem everything was fine. > >> > >> I don''t know how to reproduce this and didn''t get > a crash dump. I > >> don''t remember seeing anything about this before > so I wanted to > >> report it and see if anyone has any ideas. > > MA> Hmm, that sounds pretty bizarre, since I don''t > think that mounting a > MA> filesystem doesn''t really interact with snapshots > at all. > MA> Unfortunately, I don''t think we''ll be able to > diagnose this without a > MA> crash dump or reproducibility. If it happens > again, force a crash dump > MA> while the system is hung and we can take a look > at it. > > Maybe it wasn''t hung after all. I''ve seen similar > behavior here > sometimes. Did your disks used in a pool were > actually working? >There was lots of activity on the disks (iostat and status LEDs) until it got to this one filesystem and everything stopped. ''zpool iostat 5'' stopped running, the shell wouldn''t respond and activity on the disks stopped. This fs is relatively small (175M used of a 512M quota).> Sometimes it takes a lot of time (30-50minutes) to > mount a file system > - it''s rare, but it happens. And during this ZFS > reads from those > disks in a pool. I did report it here some time ago. >In my case the system crashed during the evening and it was left hung up when I came in during the morning, so it was hung for a good 9-10 hours. Ben This message posted from opensolaris.org
Ben Miller
2006-Sep-27 12:33 UTC
[zfs-discuss] Re: Re[2]: System hang caused by a "bad" snapshot
> > Hello Matthew, > > Tuesday, September 12, 2006, 7:57:45 PM, you > wrote: > > MA> Ben Miller wrote: > > >> I had a strange ZFS problem this morning. The > > entire system would > > >> hang when mounting the ZFS filesystems. After > > trial and error I > > >> determined that the problem was with one of the > > 2500 ZFS filesystems. > > >> When mounting that users'' home the system would > > hang and need to be > > >> rebooted. After I removed the snapshots (9 of > > them) for that > > >> filesystem everything was fine. > > >> > > >> I don''t know how to reproduce this and didn''t > get > > a crash dump. I > > >> don''t remember seeing anything about this > before > > so I wanted to > > >> report it and see if anyone has any ideas. > > > > MA> Hmm, that sounds pretty bizarre, since I don''t > > think that mounting a > > MA> filesystem doesn''t really interact with > snapshots > > at all. > > MA> Unfortunately, I don''t think we''ll be able to > > diagnose this without a > > MA> crash dump or reproducibility. If it happens > > again, force a crash dump > > MA> while the system is hung and we can take a > look > > at it. > > > > Maybe it wasn''t hung after all. I''ve seen similar > > behavior here > > sometimes. Did your disks used in a pool were > > actually working? > > > > There was lots of activity on the disks (iostat and > status LEDs) until it got to this one filesystem and > everything stopped. ''zpool iostat 5'' stopped > running, the shell wouldn''t respond and activity on > the disks stopped. This fs is relatively small > (175M used of a 512M quota). > > Sometimes it takes a lot of time (30-50minutes) to > > mount a file system > > - it''s rare, but it happens. And during this ZFS > > reads from those > > disks in a pool. I did report it here some time > ago. > > > In my case the system crashed during the evening > and it was left hung up when I came in during the > morning, so it was hung for a good 9-10 hours. >The problem happened again last night, but for a different users'' filesystem. I took a crash dump with it hung and the back trace looks like this:> ::statusdebugging crash dump vmcore.0 (64-bit) from hostname operating system: 5.11 snv_40 (sun4u) panic message: sync initiated dump content: kernel pages only> ::stack0xf0046a3c(f005a4d8, 2a100047818, 181d010, 18378a8, 1849000, f005a4d8) prom_enter_mon+0x24(2, 183c000, 18b7000, 2a100046c61, 1812158, 181b4c8) debug_enter+0x110(0, a, a, 180fc00, 0, 183e000) abort_seq_softintr+0x8c(180fc00, 18abc00, 180c000, 2a100047d98, 1, 1859800) intr_thread+0x170(600019de0e0, 0, 6000d7bfc98, 600019de110, 600019de110, 600019de110) zfs_delete_thread_target+8(600019de080, ffffffffffffffff, 0, 600019de080, 6000d791ae8, 60001aed428) zfs_delete_thread+0x164(600019de080, 6000d7bfc88, 1, 2a100c4faca, 2a100c4fac8, 600019de0e0) thread_start+4(600019de080, 0, 0, 0, 0, 0) In single user I set the mountpoint for that user to be none and then brought the system up fine. Then I destroyed the snapshots for that user and their filesystem mounted fine. In this case the quota was reached with the snapshots and 52% used without. Ben This message posted from opensolaris.org
> > > Hello Matthew, > > > Tuesday, September 12, 2006, 7:57:45 PM, you > > wrote: > > > MA> Ben Miller wrote: > > > >> I had a strange ZFS problem this morning. > The > > entire system would > > >> hang when mounting the ZFS filesystems. After > > trial and error I > > >> determined that the problem was with one of > the > > 2500 ZFS filesystems. > > >> When mounting that users'' home the system > would > > hang and need to be > > >> rebooted. After I removed the snapshots (9 of > > them) for that > > >> filesystem everything was fine. > > >> > > >> I don''t know how to reproduce this and didn''t > get > > a crash dump. I > > >> don''t remember seeing anything about this > before > > so I wanted to > > >> report it and see if anyone has any ideas. > > > > MA> Hmm, that sounds pretty bizarre, since I > don''t > > think that mounting a > > MA> filesystem doesn''t really interact with > snapshots > > at all. > > MA> Unfortunately, I don''t think we''ll be able to > > diagnose this without a > > MA> crash dump or reproducibility. If it happens > > again, force a crash dump > > MA> while the system is hung and we can take a > look > > at it. > > > > Maybe it wasn''t hung after all. I''ve seen similar > > behavior here > > sometimes. Did your disks used in a pool were > > actually working? > > > > There was lots of activity on the disks (iostat and > status LEDs) until it got to this one filesystem > and > everything stopped. ''zpool iostat 5'' stopped > running, the shell wouldn''t respond and activity on > the disks stopped. This fs is relatively small > (175M used of a 512M quota). > Sometimes it takes a lot of time (30-50minutes) to > > mount a file system > > - it''s rare, but it happens. And during this ZFS > > reads from those > > disks in a pool. I did report it here some time > ago. > > > In my case the system crashed during the evening > and it was left hung up when I came in during the > morning, so it was hung for a good 9-10 hours. > > The problem happened again last night, but for a > different users'' filesystem. I took a crash dump > with it hung and the back trace looks like this: > > ::status > debugging crash dump vmcore.0 (64-bit) from hostname > operating system: 5.11 snv_40 (sun4u) > panic message: sync initiated > dump content: kernel pages only > > ::stack > 0xf0046a3c(f005a4d8, 2a100047818, 181d010, 18378a8, > 1849000, f005a4d8) > prom_enter_mon+0x24(2, 183c000, 18b7000, 2a100046c61, > 1812158, 181b4c8) > debug_enter+0x110(0, a, a, 180fc00, 0, 183e000) > abort_seq_softintr+0x8c(180fc00, 18abc00, 180c000, > 2a100047d98, 1, 1859800) > intr_thread+0x170(600019de0e0, 0, 6000d7bfc98, > 600019de110, 600019de110, > 600019de110) > zfs_delete_thread_target+8(600019de080, > ffffffffffffffff, 0, 600019de080, > 6000d791ae8, 60001aed428) > zfs_delete_thread+0x164(600019de080, 6000d7bfc88, 1, > 2a100c4faca, 2a100c4fac8, > 600019de0e0) > thread_start+4(600019de080, 0, 0, 0, 0, 0) > > In single user I set the mountpoint for that user to > be none and then brought the system up fine. Then I > destroyed the snapshots for that user and their > filesystem mounted fine. In this case the quota was > reached with the snapshots and 52% used without. > > BenHate to re-open something from a year ago, but we just had this problem happen again. We have been running Solaris 10u3 on this system for awhile. I searched the bug reports, but couldn''t find anything on this. I also think I understand what happened a little more. We take snapshots at noon and the system hung up during that time. When trying to reboot the system would hang on the ZFS mounts. After I boot into single use and remove the snapshot from the filesystem causing the problem everything is fine. The filesystem in question at 100% use with snapshots in use. Here''s the back trace for the system when it was hung:> ::stack0xf0046a3c(f005a4d8, 2a10004f828, 0, 181c850, 1848400, f005a4d8) prom_enter_mon+0x24(0, 0, 183b400, 1, 1812140, 181ae60) debug_enter+0x118(0, a, a, 180fc00, 0, 183d400) abort_seq_softintr+0x94(180fc00, 18a9800, 180c000, 2a10004fd98, 1, 1857c00) intr_thread+0x170(2, 30007b64bc0, 0, c001ed9, 110, 60002400000) 0x985c8(300adca4c40, 0, 0, 0, 0, 30007b64bc0) dbuf_hold_impl+0x28(60008cd02e8, 0, 0, 0, 7b648d73, 2a105bb57c8) dbuf_hold_level+0x18(60008cd02e8, 0, 0, 7b648d73, 0, 0) dmu_tx_check_ioerr+0x20(0, 60008cd02e8, 0, 0, 0, 7b648c00) dmu_tx_hold_zap+0x84(60011fb2c40, 0, 0, 0, 30049b58008, 400) zfs_rmnode+0xc8(3002410d210, 2a105bb5cc0, 0, 60011fb2c40, 30007b3ff58, 30007b56ac0) zfs_delete_thread+0x168(30007b56ac0, 3002410d210, 600009a4778, 30007b56b28, 2a105bb5aca, 2a105bb5ac8) thread_start+4(30007b56ac0, 0, 0, 489a4800000000, d83a10bf28, 5000000000386) Has this been fixed in more recent code? I can make the crash dump available. Ben This message posted from opensolaris.org
Ben, Much of this code has been revamped as a result of: 6514331 in-memory delete queue is not needed Although this may not fix your issue it would be good to try this test with more recent bits. Thanks, George Ben Miller wrote:> Hate to re-open something from a year ago, but we just had this problem happen again. We have been running Solaris 10u3 on this system for awhile. I searched the bug reports, but couldn''t find anything on this. I also think I understand what happened a little more. We take snapshots at noon and the system hung up during that time. When trying to reboot the system would hang on the ZFS mounts. After I boot into single use and remove the snapshot from the filesystem causing the problem everything is fine. The filesystem in question at 100% use with snapshots in use. > > Here''s the back trace for the system when it was hung: >> ::stack > 0xf0046a3c(f005a4d8, 2a10004f828, 0, 181c850, 1848400, f005a4d8) > prom_enter_mon+0x24(0, 0, 183b400, 1, 1812140, 181ae60) > debug_enter+0x118(0, a, a, 180fc00, 0, 183d400) > abort_seq_softintr+0x94(180fc00, 18a9800, 180c000, 2a10004fd98, 1, 1857c00) > intr_thread+0x170(2, 30007b64bc0, 0, c001ed9, 110, 60002400000) > 0x985c8(300adca4c40, 0, 0, 0, 0, 30007b64bc0) > dbuf_hold_impl+0x28(60008cd02e8, 0, 0, 0, 7b648d73, 2a105bb57c8) > dbuf_hold_level+0x18(60008cd02e8, 0, 0, 7b648d73, 0, 0) > dmu_tx_check_ioerr+0x20(0, 60008cd02e8, 0, 0, 0, 7b648c00) > dmu_tx_hold_zap+0x84(60011fb2c40, 0, 0, 0, 30049b58008, 400) > zfs_rmnode+0xc8(3002410d210, 2a105bb5cc0, 0, 60011fb2c40, 30007b3ff58, > 30007b56ac0) > zfs_delete_thread+0x168(30007b56ac0, 3002410d210, 600009a4778, 30007b56b28, > 2a105bb5aca, 2a105bb5ac8) > thread_start+4(30007b56ac0, 0, 0, 489a4800000000, d83a10bf28, 5000000000386) > > Has this been fixed in more recent code? I can make the crash dump available. > > Ben > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss