Hi everyone, I have a backup server with a considerably large pool > 40TB running on a Solaris 10 5/09 s10x_u7wos_08 (x86_64). All disks are 2TB SATA harddrivers. Here''s how it is configured: # zpool import pool: backup id: 9395034695502046623 state: ONLINE action: The pool can be imported using its name or numeric identifier. config: backup ONLINE raidz1 ONLINE c3t102d0 ONLINE c3t103d0 ONLINE c3t104d0 ONLINE c3t105d0 ONLINE c3t106d0 ONLINE c3t107d0 ONLINE c3t108d0 ONLINE raidz1 ONLINE c3t109d0 ONLINE c3t110d0 ONLINE c3t111d0 ONLINE c3t112d0 ONLINE c3t113d0 ONLINE c3t114d0 ONLINE c3t115d0 ONLINE raidz1 ONLINE c3t87d0 ONLINE c3t88d0 ONLINE c3t89d0 ONLINE c3t90d0 ONLINE c3t91d0 ONLINE c3t92d0 ONLINE c3t93d0 ONLINE raidz1 ONLINE c3t94d0 ONLINE c3t95d0 ONLINE c3t96d0 ONLINE c3t97d0 ONLINE c3t98d0 ONLINE c3t99d0 ONLINE c3t100d0 ONLINE spares c3t116d0 c3t101d0 This server suffered an unexpected crash and after a power cycle it refused to mount all ZFS filesystems during the boot procedure, keeping Solaris from initiating the network services (such as SSH). On every attempt to mount all filesystems, it stops at the same point and I see the numbers "190/339" in the progress indicator during boot (I have 399 zfs filesystems within the only pool on that system). I rebooted the server in failsafe mode, but that did not load my JBOD controller''s drivers and I can''t access the pool. So, to allow Solaris to boot normally, I have moved /etc/zfs/zpool.cache away and restarted the OS. After that change, no ZFS filesystems were attempted to be mounted at boot and "zfs import" would show me the information above. So I went ahead and started importing it with: "zpool import backup", however that hangs and never completes. I noticed while the process is hanging the server is still responsive to all non-zfs related commands and some zfs commands as well. For example, I can run "zfs list" and "zpool status" and it does show the entire pool. The process refuses to die with a traditional "kill -9", so I rebooted the server and repeated the command using "truss". It was clear that a lot of ZFS filesystem were mounting cleanly and it was hanging only on a specific filesystem (even if I reboot everything and start again the exact same behavior is seen). Attached you will find the "zpool-import- output.txt" file. The last filesystem it tries to mount is "backup/ insightiq". It''s interesting to note that while the "zpool import backup" command is running, I can access all the zfs filesystems it already mounted -- and I can manually mount others using "zfs mount". There''s only one filesystem that causes "mount" / "zpool import" to hang. Because the filesystem in question (backup/insightiq) is being accessed by the "zpool import" hanging command, I can not try to simply destroy it. I also can not use "zfs set mount=noauto" on that zfs (or any other one), while "zpool import" is running. Additionally, I''m attaching "threads-list.txt" which is the output of: echo "::threadlist -v" | mdb -k -- and also "zdb-output.txt" which is the output of: zdb Does anyone know what''s my next step here? I need to restore the pool as soon as possible (I didn''t want to, but at this point I''m cool with destroying the zfs filesystem with the problem if that saves me the entire pool). -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: zpool-import-output.txt URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100507/0d99e3dd/attachment.txt> -------------- next part -------------- -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: threads-list.txt URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100507/0d99e3dd/attachment-0001.txt> -------------- next part -------------- -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: zdb-output.txt URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100507/0d99e3dd/attachment-0002.txt> -------------- next part -------------- Any help will be greatly appreciated. Thanks in advance, Eduardo Bragatto.
Additionally, I would like to mention that the only ZFS filesystem not mounting -- causing the entire "zpool import backup" command to hang, is the only filesystem configured to be exported via NFS: backup/insightiq sharenfs root=* local Is there any chance the NFS share is the culprit here? If so, how to avoid it? Thanks, Eduardo Bragatto
Hi again, As for the NFS issue I mentioned before, I made sure the NFS server was working and was able to export before I attempted to import anything, then I started a new "zpool import backup: -- my hope was that the NFS share was causing the issue, since the only filesystem shared is the one causing the problem, but that doesn''t seem to be the case. I''ve done a lot of research and could not find a similar case to mine. The most similar one I''ve found was this from 2008: http://opensolaris.org/jive/thread.jspa?threadID=70205&tstart=15 I simply can not import the pool although ZFS reports it as OK. In that old thread, the user was also having the "zpool import" hang issue, however he was able to run these two commands (his pool was named data1): zdb -e -bb data1 zdb -e -dddd data1 While my system returns: # zdb -e -bb backup zdb: can''t open backup: File exists # zdb -e -ddd backup zdb: can''t open backup: File exists Every documentation assumes you will be able to run "zpool import" before troubleshooting, however my problem is exactly on that command. I don''t even know where to find more detailed documentation. I believe there''s very knowledgeable people in this list. Could someone be kind enough to take a look and at least point me in the right direction? Thanks, Eduardo Bragatto. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100510/8b80b26e/attachment.html>
Howdy Eduardo, Recently I had a similar issue where the pool wouldn''t import and attempting to import it would essentially lock the server up. Finally I used pfexec zpool import -F pool1 and simply let it do it''s thing. After almost 60 hours the imported finished and all has been well since (except my backup procedures have improved!). Good luck! John On May 10, 2010, at 12:35 PM, Eduardo Bragatto wrote:> Hi again, > > As for the NFS issue I mentioned before, I made sure the NFS server was working and was able to export before I attempted to import anything, then I started a new "zpool import backup: -- my hope was that the NFS share was causing the issue, since the only filesystem shared is the one causing the problem, but that doesn''t seem to be the case. > > I''ve done a lot of research and could not find a similar case to mine. The most similar one I''ve found was this from 2008: > > http://opensolaris.org/jive/thread.jspa?threadID=70205&tstart=15 > > I simply can not import the pool although ZFS reports it as OK. > > In that old thread, the user was also having the "zpool import" hang issue, however he was able to run these two commands (his pool was named data1): > > zdb -e -bb data1 > zdb -e -dddd data1 > > While my system returns: > > # zdb -e -bb backup > zdb: can''t open backup: File exists > # zdb -e -ddd backup > zdb: can''t open backup: File exists > > Every documentation assumes you will be able to run "zpool import" before troubleshooting, however my problem is exactly on that command. I don''t even know where to find more detailed documentation. > > I believe there''s very knowledgeable people in this list. Could someone be kind enough to take a look and at least point me in the right direction? > > Thanks, > Eduardo Bragatto. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100510/69383905/attachment.html>
On May 10, 2010, at 4:46 PM, John Balestrini wrote:> Recently I had a similar issue where the pool wouldn''t import and > attempting to import it would essentially lock the server up. > Finally I used pfexec zpool import -F pool1 and simply let it do > it''s thing. After almost 60 hours the imported finished and all has > been well since (except my backup procedures have improved!).Hey John, thanks a lot for answering -- I already allowed the "zpool import" command to run from Friday to Monday and it did not complete -- I also made sure to start it using "truss" and literally nothing has happened during that time (the truss output file does not have anything new). While the "zpool import" command runs, I don''t see any CPU or Disk I/O usage. "zpool iostat" shows very little I/O too: # zpool iostat -v capacity operations bandwidth pool used avail read write read write ------------ ----- ----- ----- ----- ----- ----- backup 31.4T 19.1T 11 2 29.5K 11.8K raidz1 11.9T 741G 2 0 3.74K 3.35K c3t102d0 - - 0 0 23.8K 1.99K c3t103d0 - - 0 0 23.5K 1.99K c3t104d0 - - 0 0 23.0K 1.99K c3t105d0 - - 0 0 21.3K 1.99K c3t106d0 - - 0 0 21.5K 1.98K c3t107d0 - - 0 0 24.2K 1.98K c3t108d0 - - 0 0 23.1K 1.98K raidz1 12.2T 454G 3 0 6.89K 3.94K c3t109d0 - - 0 0 43.7K 2.09K c3t110d0 - - 0 0 42.9K 2.11K c3t111d0 - - 0 0 43.9K 2.11K c3t112d0 - - 0 0 43.8K 2.09K c3t113d0 - - 0 0 47.0K 2.08K c3t114d0 - - 0 0 42.9K 2.08K c3t115d0 - - 0 0 44.1K 2.08K raidz1 3.69T 8.93T 3 0 9.42K 610 c3t87d0 - - 0 0 43.6K 1.50K c3t88d0 - - 0 0 43.9K 1.48K c3t89d0 - - 0 0 44.2K 1.49K c3t90d0 - - 0 0 43.4K 1.49K c3t91d0 - - 0 0 42.5K 1.48K c3t92d0 - - 0 0 44.5K 1.49K c3t93d0 - - 0 0 44.8K 1.49K raidz1 3.64T 8.99T 3 0 9.40K 3.94K c3t94d0 - - 0 0 31.9K 2.09K c3t95d0 - - 0 0 31.6K 2.09K c3t96d0 - - 0 0 30.8K 2.08K c3t97d0 - - 0 0 34.2K 2.08K c3t98d0 - - 0 0 34.4K 2.08K c3t99d0 - - 0 0 35.2K 2.09K c3t100d0 - - 0 0 34.9K 2.08K ------------ ----- ----- ----- ----- ----- ----- Also, the third "raidz" entry shows less "write" in bandwidth (610). This is actually the first time it''s a non-zero value. My last attempt to import it, was using this command: zpool import -o failmode=panic -f -R /altmount backup However it did not panic. As I mentioned in the first message, it mounts 189 filesystems and hangs on #190. While the command is hanging, I can use "zfs mount" to mount filesystems #191 and above (only one filesystem does not mount and causes the import procedure to hang). Before trying the command above, I was using only "zpool import backup", and the "iostat" output was showing ZERO for the third raidz from the list above (not sure if that means something, but it does look odd). I''m really on a dead end here, any help is appreciated. Thanks, Eduardo Bragatto.
Hi Eduardo, Please use the following steps to collect more information: 1. Use the following command to get the PID of the zpool import process, like this: # ps -ef | grep zpool 2. Use the actual <PID of zpool import> found in step 1 in the following command, like this: echo "0t<PID of zpool import>::pid2proc|::walk thread|::findstack" | mdb -k Then, send the output. Thanks, Cindy On 05/10/10 14:22, Eduardo Bragatto wrote:> On May 10, 2010, at 4:46 PM, John Balestrini wrote: > >> Recently I had a similar issue where the pool wouldn''t import and >> attempting to import it would essentially lock the server up. Finally >> I used pfexec zpool import -F pool1 and simply let it do it''s thing. >> After almost 60 hours the imported finished and all has been well >> since (except my backup procedures have improved!). > > Hey John, > > thanks a lot for answering -- I already allowed the "zpool import" > command to run from Friday to Monday and it did not complete -- I also > made sure to start it using "truss" and literally nothing has happened > during that time (the truss output file does not have anything new). > > While the "zpool import" command runs, I don''t see any CPU or Disk I/O > usage. "zpool iostat" shows very little I/O too: > > # zpool iostat -v > capacity operations bandwidth > pool used avail read write read write > ------------ ----- ----- ----- ----- ----- ----- > backup 31.4T 19.1T 11 2 29.5K 11.8K > raidz1 11.9T 741G 2 0 3.74K 3.35K > c3t102d0 - - 0 0 23.8K 1.99K > c3t103d0 - - 0 0 23.5K 1.99K > c3t104d0 - - 0 0 23.0K 1.99K > c3t105d0 - - 0 0 21.3K 1.99K > c3t106d0 - - 0 0 21.5K 1.98K > c3t107d0 - - 0 0 24.2K 1.98K > c3t108d0 - - 0 0 23.1K 1.98K > raidz1 12.2T 454G 3 0 6.89K 3.94K > c3t109d0 - - 0 0 43.7K 2.09K > c3t110d0 - - 0 0 42.9K 2.11K > c3t111d0 - - 0 0 43.9K 2.11K > c3t112d0 - - 0 0 43.8K 2.09K > c3t113d0 - - 0 0 47.0K 2.08K > c3t114d0 - - 0 0 42.9K 2.08K > c3t115d0 - - 0 0 44.1K 2.08K > raidz1 3.69T 8.93T 3 0 9.42K 610 > c3t87d0 - - 0 0 43.6K 1.50K > c3t88d0 - - 0 0 43.9K 1.48K > c3t89d0 - - 0 0 44.2K 1.49K > c3t90d0 - - 0 0 43.4K 1.49K > c3t91d0 - - 0 0 42.5K 1.48K > c3t92d0 - - 0 0 44.5K 1.49K > c3t93d0 - - 0 0 44.8K 1.49K > raidz1 3.64T 8.99T 3 0 9.40K 3.94K > c3t94d0 - - 0 0 31.9K 2.09K > c3t95d0 - - 0 0 31.6K 2.09K > c3t96d0 - - 0 0 30.8K 2.08K > c3t97d0 - - 0 0 34.2K 2.08K > c3t98d0 - - 0 0 34.4K 2.08K > c3t99d0 - - 0 0 35.2K 2.09K > c3t100d0 - - 0 0 34.9K 2.08K > ------------ ----- ----- ----- ----- ----- ----- > > Also, the third "raidz" entry shows less "write" in bandwidth (610). > This is actually the first time it''s a non-zero value. > > My last attempt to import it, was using this command: > > zpool import -o failmode=panic -f -R /altmount backup > > However it did not panic. As I mentioned in the first message, it mounts > 189 filesystems and hangs on #190. While the command is hanging, I can > use "zfs mount" to mount filesystems #191 and above (only one filesystem > does not mount and causes the import procedure to hang). > > Before trying the command above, I was using only "zpool import backup", > and the "iostat" output was showing ZERO for the third raidz from the > list above (not sure if that means something, but it does look odd). > > I''m really on a dead end here, any help is appreciated. > > Thanks, > Eduardo Bragatto. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On May 10, 2010, at 6:28 PM, Cindy Swearingen wrote:> Hi Eduardo, > > Please use the following steps to collect more information: > > 1. Use the following command to get the PID of the zpool import > process, > like this: > > # ps -ef | grep zpool > > 2. Use the actual <PID of zpool import> found in step 1 in the > following > command, like this: > > echo "0t<PID of zpool import>::pid2proc|::walk thread|::findstack" | > mdb -k > > Then, send the output.Hi Cindy, first of all, thank you for taking your time to answer my question. Here''s the output of the command you requested: # echo "0t733::pid2proc|::walk thread|::findstack" | mdb -k stack pointer for thread ffffffff94e4db40: fffffe8000d3e5b0 [ fffffe8000d3e5b0 _resume_from_idle+0xf8() ] fffffe8000d3e5e0 swtch+0x12a() fffffe8000d3e600 cv_wait+0x68() fffffe8000d3e640 txg_wait_open+0x73() fffffe8000d3e670 dmu_tx_wait+0xc5() fffffe8000d3e6a0 dmu_tx_assign+0x38() fffffe8000d3e700 dmu_free_long_range_impl+0xe6() fffffe8000d3e740 dmu_free_long_range+0x65() fffffe8000d3e790 zfs_trunc+0x77() fffffe8000d3e7e0 zfs_freesp+0x66() fffffe8000d3e830 zfs_space+0xa9() fffffe8000d3e850 zfs_shim_space+0x15() fffffe8000d3e890 fop_space+0x2e() fffffe8000d3e910 zfs_replay_truncate+0xa8() fffffe8000d3e9b0 zil_replay_log_record+0x1ec() fffffe8000d3eab0 zil_parse+0x2ff() fffffe8000d3eb30 zil_replay+0xde() fffffe8000d3eb50 zfsvfs_setup+0x93() fffffe8000d3ebc0 zfs_domount+0x2e4() fffffe8000d3ecc0 zfs_mount+0x15d() fffffe8000d3ecd0 fsop_mount+0xa() fffffe8000d3ee00 domount+0x4d7() fffffe8000d3ee80 mount+0x105() fffffe8000d3eec0 syscall_ap+0x97() fffffe8000d3ef10 _sys_sysenter_post_swapgs+0x14b() The first message from this thread has three files attached with information from truss (tracing zpool import), zdb output and the entire list of threads taken from ''echo "::threadlist -v" | mdb -k''. Thanks, Eduardo Bragatto
Hi, I have fixed this problem a couple weeks ago, but haven''t found the time to report it until now. Cindy Swearingen was very kind in contacting me to resolve this issue, I would like to take this opportunity to express my gratitude to her. We have not found the root cause of the error. Cindy suspected about some known bugs in release 5/09 that have been fixed in 10/09, but we could not confirm that as the real cause of the problem. Anyway, I went ahead and re-installed the operating system with the latest Solaris release (10/09) and "zpool import" worked like there was nothing wrong. I have scrubbed the pool and no errors were found. I''m using the system since the OS was re-installed (exactly 10 days now) without any problems. If you get yourself in a situation where "zpool import" hangs and never finishes because it hangs while mounting some of the ZFS filesystems, make sure you try to import that pool on the newest stable system before wasting too much time debugging the problem. Thanks, Eduardo Bragatto.