No Guarantees
2009-Aug-21 10:52 UTC
[zfs-discuss] zpool import hangs with indefinite writes
Every time I attempt to import a particular RAID-Z pool, my system hangs. Specifically, if I open up a gnome terminal and input ''$ pfexec import zpool mypool'', the process will never complete and I will return to the prompt. If I open up another terminal, I can input a ''zpool status" and see that the pool has been imported, but the directory has not been mounted. In other words, there is no /mypool in the tree. If I issue a ''zpool iostat 1'' I can see that there are constant writes to the pool, but NO reads. If I halt the zpool import, and then do a ''zpool scrub'', it will complete with no errors after about 12-17 hours (it''s a 5TB pool). I have looked through this forum and found many examples where people can''t import due to hardware failure and lack of redundancy, but none where they had a redundant setup, everything appears okay, and they STILL can''t import. I can export the pool without any problems. I need to do this before rebooting, otherwise it hangs on reboot, probably while trying to import the pool. I''ve looked around for troubleshooting info, but the only thing that gives me any hint of what is wrong is a core dump after issuing a ''zdb -v mypool''. It fails with "Assertion failed: object_count == usedobjs (0x7 == 0x6), file ../zdb.c, line 1215. Any suggestions? _________________________________________________________________ Windows Live: Keep your friends up to date with what you do online. http://windowslive.com/Campaign/SocialNetworking?ocid=PID23285::T:WLMTAGL:ON:WL:en-US:SI_SB_online:082009
Okay, I''m trying to do whatever I can NONDESTRUCTIVELY to fix this. I have almost 5TB of data that I can''t afford to lose (baby pictures and videos, etc.). Since no one has seen this problem before, maybe someone can tell me what I need to do to make a backup of what I have now so I can try other methods to recover this data. Just about everything I do starts writing to these drives, and that worries me. For example: zfs list - writes, zfs volinit - writes. I am hoping that I have not already ruined the existing data on these drives, but I do not know enough about ZFS to troubleshoot or test. I''m a little bit frustrated about this because I think I have everything I need, but I still can''t access anything: 64-bit - check, ECC RAM - check, redundancy (RAID-Z) - check. I don''t know if I''m not following the proper protocol for posting, and that is why I am not getting any responses, or what. At this point, I''m open to any suggestions (right or wrong). If the only way I''ll get any help is to pay for OpenSolaris support, let me know. -- This message posted from opensolaris.org
Victor Latushkin
2009-Aug-26 08:30 UTC
[zfs-discuss] zpool import hangs with indefinite writes
On 21.08.09 14:52, No Guarantees wrote:> Every time I attempt to import a particular RAID-Z pool, my system hangs. > Specifically, if I open up a gnome terminal and input ''$ pfexec import zpool > mypool'', the process will never complete and I will return to the prompt. If > I open up another terminal, I can input a ''zpool status" and see that the > pool has been imported, but the directory has not been mounted.This suggests that import is partially done, it is just unable to perform some final stage of the process, so ''zpool import'' never returns. So you need to do something like this to see where ''zpool import'' is stuck. 1. Find PID of the hanging ''zpool import'', e.g. with ''ps -ef | grep zpool'' 2. Substitute PID with actual number in the below command echo "0tPID::pid2proc|::walk thread|::findstack -v" | mdb -k 3. Do echo "::spa" | mdb -k 4. Find address of your pool in the output of stage 3 and replace ADDR with it in the below command (it is single line): echo "ADDR::print spa_t spa_dsl_pool->dp_tx.tx_sync_thread|::findstack -v" | mdb -k 5. Run command in step 4 several times. This could be the first step. Another option may be to force a crash dump.> In other > words, there is no /mypool in the tree. If I issue a ''zpool iostat 1'' I can > see that there are constant writes to the pool, but NO reads. If I halt the > zpool import,What do you mean by halt here? Are you able to interrupt ''zpool import'' with CTRL-C?> and then do a ''zpool scrub'', it will complete with no errors > after about 12-17 hours (it''s a 5TB pool).That sound promising. Does ''zfs list'' provide any output? Apparently as you have 5TB of data there, it worked fine some time ago. What happened to the pool before this issue was noticed? regards, victor> I have looked through this forum > and found many examples where people can''t import due to hardware failure and > lack of redundancy, but none where they had a redundant setup, everything > appears okay, and they STILL can''t import. I can export the pool without any > problems. I need to do this before rebooting, otherwise it hangs on reboot, > probably while trying to import the pool. I''ve looked around for > troubleshooting info, but the only thing that gives me any hint of what is > wrong is a core dump after issuing a ''zdb -v mypool''. It fails with > "Assertion failed: object_count == usedobjs (0x7 == 0x6), file ../zdb.c, line > 1215. Any suggestions? >> > _________________________________________________________________ > Windows Live: Keep your friends up to date with what you do online. > http://windowslive.com/Campaign/SocialNetworking?ocid=PID23285::T:WLMTAGL:ON:WL:en-US:SI_SB_online:082009 > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Thank you so much for your reply! Here are the outputs:>1. Find PID of the hanging ''zpool import'', e.g. with ''ps -ef | grep zpool''root at mybox:~# ps -ef|grep zpool root 915 908 0 03:34:46 pts/3 0:00 grep zpool root 901 874 1 03:34:09 pts/2 0:00 zpool import drowning>2. Substitute PID with actual number in the below command >echo "0tPID::pid2proc|::walk thread|::findstack -v" | mdb -kroot at mybox:~# echo "0t901::pid2proc|::walk thread|::findstack -v" | mdb -k stack pointer for thread ffffff02ed8c7880: ffffff0010191a10 [ ffffff0010191a10 _resume_from_idle+0xf1() ] ffffff0010191a40 swtch+0x147() ffffff0010191a70 cv_wait+0x61(ffffff02eb010dda, ffffff02eb010d98) ffffff0010191ac0 txg_wait_synced+0x7f(ffffff02eb010c00, 31983c5) ffffff0010191b00 dsl_sync_task_group_wait+0xee(ffffff02f1d11bd8) ffffff0010191b80 dsl_sync_task_do+0x65(ffffff02eb010c00, fffffffff78be1f0, fffffffff78be250, ffffff02edc38400, ffffff0010191b98, 0) ffffff0010191bd0 dsl_dataset_rollback+0x53(ffffff02edc38400, 2) ffffff0010191c00 dmu_objset_rollback+0x46(ffffff02eb674b20) ffffff0010191c40 zfs_ioc_rollback+0x10d(ffffff02f2b58000) ffffff0010191cc0 zfsdev_ioctl+0x10b(b600000000, 5a1a, 803e240, 100003, ffffff02ee813338, ffffff0010191de4) ffffff0010191d00 cdev_ioctl+0x45(b600000000, 5a1a, 803e240, 100003, ffffff02ee813338, ffffff0010191de4) ffffff0010191d40 spec_ioctl+0x83(ffffff02df6a7480, 5a1a, 803e240, 100003, ffffff02ee813338, ffffff0010191de4, 0) ffffff0010191dc0 fop_ioctl+0x7b(ffffff02df6a7480, 5a1a, 803e240, 100003, ffffff02ee813338, ffffff0010191de4, 0) ffffff0010191ec0 ioctl+0x18e(3, 5a1a, 803e240) ffffff0010191f10 _sys_sysenter_post_swapgs+0x14b()>3. Do >echo "::spa" | mdb -kroot at mybox:~# echo "::spa" | mdb -k ADDR STATE NAME ffffff02f2b8b800 ACTIVE mypool ffffff02d5890000 ACTIVE rpool>4. Find address of your pool in the output of stage 3 and replace ADDR with it >in the below command (it is single line): >echo "ADDR::print spa_t spa_dsl_pool->dp_tx.tx_sync_thread|::findstack -v" | mdb -kroot at mybox:~# echo "ffffff02f2b8b800::print spa_t spa_dsl_pool->dp_tx.tx_sync_thread|::findstack -v" | mdb -k mdb: spa_t is not a struct or union type So I decided to remove "spa_t" to see what would happen: root at mybox:~# echo "ffffff02f2b8b800::print spa_dsl_pool->dp_tx.tx_sync_thread|::findstack -v" | mdb -k mdb: failed to look up type spa_dsl_pool->dp_tx.tx_sync_thread: no symbol corresponds to address>What do you mean by halt here? Are you able to interrupt ''zpool import'' with CTRL-C?Yes>Does ''zfs list'' provide any output?JACKPOT!!!!! When I run "zfs list", the import completes! Instead, "zfs list" hangs just like "zpool import" did. root at mybox:~# ps -ef | grep zfs root 940 874 0 03:49:15 pts/2 0:00 grep zfs root 936 908 0 03:44:28 pts/3 0:01 zfs list root at mybox:~# echo "0t936::pid2proc|::walk thread|::findstack -v" | mdb -k stack pointer for thread ffffff02d72ea020: ffffff000fdeaa10 [ ffffff000fdeaa10 _resume_from_idle+0xf1() ] ffffff000fdeaa40 swtch+0x147() ffffff000fdeaa70 cv_wait+0x61(ffffff02eb010dda, ffffff02eb010d98) ffffff000fdeaac0 txg_wait_synced+0x7f(ffffff02eb010c00, 31990da) ffffff000fdeab00 dsl_sync_task_group_wait+0xee(ffffff02f1d11bd8) ffffff000fdeab80 dsl_sync_task_do+0x65(ffffff02eb010c00, fffffffff78be1f0, fffffffff78be250, ffffff02f1d0ce00, ffffff000fdeab98, 0) ffffff000fdeabd0 dsl_dataset_rollback+0x53(ffffff02f1d0ce00, 2) ffffff000fdeac00 dmu_objset_rollback+0x46(ffffff02eb3322a8) ffffff000fdeac40 zfs_ioc_rollback+0x10d(ffffff02ebf4e000) ffffff000fdeacc0 zfsdev_ioctl+0x10b(b600000000, 5a1a, 8043a20, 100003, ffffff02ee813e78, ffffff000fdeade4) ffffff000fdead00 cdev_ioctl+0x45(b600000000, 5a1a, 8043a20, 100003, ffffff02ee813e78, ffffff000fdeade4) ffffff000fdead40 spec_ioctl+0x83(ffffff02df6a7480, 5a1a, 8043a20, 100003, ffffff02ee813e78, ffffff000fdeade4, 0) ffffff000fdeadc0 fop_ioctl+0x7b(ffffff02df6a7480, 5a1a, 8043a20, 100003, ffffff02ee813e78, ffffff000fdeade4, 0) ffffff000fdeaec0 ioctl+0x18e(3, 5a1a, 8043a20) ffffff000fdeaf10 _sys_sysenter_post_swapgs+0x14b()>Apparently as you have 5TB of data there, it worked fine some time ago. What >happened to the pool before this issue was noticed?A reboot? This box acts as network storage for all of my computers. All of the PCs in the house are set to back up to it daily, and it is like an extra hard drive for my wife''s netbook and laptop. We dump all of the pictures off of the camera there as well as any HD video we capture. I NEVER reboot this box unless I am prompted to. I''m running OpenSolaris (uname -a: SunOS mybox 5.11 snv_111b i86pc i386 i86pc Solaris), and if I remember right, I was prompted to update. I did so, and needed to reboot. Rebooted, and the box would not start. I used another PC to find out how to start in single user mode and tried that. No dice. I had to physically remove the drives to get to a login prompt. BTW, I just stopped the "zfs list" after about 30 minutes running, and it was constantly writing to my drives. (used ''zpool iostat 1'' to check) I am by no means an expert, but whatever "zfs list" is trying to do, it is hanging. Right now, my goal is to back up all of my important data. Once I do that, I will delete this pool and start over from scratch. My biggest concern is to keep this from happening again. Any suggestions? -- This message posted from opensolaris.org
I used the GUI to delete all my snapshots, and after that, "zfs list" worked without hanging. I did a "zpool scrub" and will wait to see what happens with that. I DID have automatic snapshots enabled before. They are disabled now. I don''t know how the snapshots work to be honest, so maybe I ran into some upper limit with the amount of snapshots? I AM running daily backups on all computers: Windows, Linux, and Mac OS. -- This message posted from opensolaris.org