Hernan Freschi
2008-May-23 14:18 UTC
[zfs-discuss] help with a BIG problem, can''t import my zpool anymore
Hello, I''m having a big problem here, disastrous maybe. I have a zpool consisting of 4x500GB SATA drives, this pool was born on S10U4 and was recently upgraded to snv85 because of iSCSI issues with some initiator. Last night I was doing housekeeping, deleting old snapshots. One snapshot failed to delete because it had a dependant clone. So I try to destroy that clone: Everything went wrong from there. The deletion was taking an excessively long time (over 40 minutes). zpool status hungs when I call it. zfs list too. zpool iostat showed disk activity. Other services non dependant on the pool were running, and the iSCSI this machine was serving was unbearably slow. At one point, I lost all iSCSI, SSH, web, and all other services. Ping still worked. So I go to the server and notice that the fans are running at 100%. I try to get a console (local VGA+keyboard) but the monitor shows "no signal". No disk activity seemed to be happening at the moment. So, I do the standard procedure (reboot). Solaris boots but stops at "hostname: blah". I see disk activity from the pool disks, so I let it boot. 30 minutes later, still didn''t finish. I thought (correctly) that the system was waiting to mount the ZFS before booting, but for some reason it doesn''t. I call it the day and let the machine do its thing. 8 hours later I return. CPU is cold, disks are idle and... solaris stays at the same "hostname: blah". Time to reboot again, this time in failsafe. zpool import shows that the devices are detected and online. I delete /etc/zfs/zpool.cache and reboot. Solaris starts normally with all services running, but of course no zfs. zpool import shows the available pool, no errors. I do zpool import -f pool... 20 minutes later I''m still waiting for the pool to mount. zpool iostat shows activity: capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- tera 1.51T 312G 274 0 1.61M 2.91K tera 1.51T 312G 308 0 1.82M 0 tera 1.51T 312G 392 0 2.31M 0 tera 1.51T 312G 468 0 2.75M 0 but the mountpoint /tera is still not populated (and zpool import still doesn''t exit). zpool status shows: pool: tera state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM tera ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c1d0 ONLINE 0 0 0 c2d0 ONLINE 0 0 0 c3d0 ONLINE 0 0 0 c4d0 ONLINE 0 0 0 errors: No known data errors What''s going on? Why is taking so long to import? Thanks in advance, Hernan This message posted from opensolaris.org
Hernan Freschi
2008-May-23 17:08 UTC
[zfs-discuss] help with a BIG problem, can''t import my zpool anymore
I got more info. I can run zpool history and this is what I get: 2008-05-23.00:29:40 zfs destroy tera/sparsevol at 20080522 2008-05-23.00:29:47 [internal destroy_begin_sync txg:3890809] dataset = 152 2008-05-23.01:28:38 [internal destroy_begin_sync txg:3891101] dataset = 152 2008-05-23.07:01:36 zpool import -f tera 2008-05-23.07:01:40 [internal destroy_begin_sync txg:3891106] dataset = 152 2008-05-23.10:52:56 zpool import -f tera 2008-05-23.10:52:58 [internal destroy_begin_sync txg:3891112] dataset = 152 2008-05-23.12:17:49 [internal destroy_begin_sync txg:3891114] dataset = 152 2008-05-23.12:27:48 zpool import -f tera 2008-05-23.12:27:50 [internal destroy_begin_sync txg:3891120] dataset = 152 2008-05-23.13:03:07 [internal destroy_begin_sync txg:3891122] dataset = 152 2008-05-23.13:56:52 zpool import -f tera 2008-05-23.13:56:54 [internal destroy_begin_sync txg:3891128] dataset = 152 apparently, it starts destroying dataset #152, which is the parent snapshot of the clone I issued the command to destroy. Not sure how it works, but I ordered the deletion of the CLONE, not the snapshot (which I was going to destroy anyway). The question is still, why does it hang the machine? Why can''t I access the filesystems? Isn''t it supposed to import the zpool, mount the ZFSs and then do the destroy, in background? This message posted from opensolaris.org
Hernan Freschi
2008-May-23 23:03 UTC
[zfs-discuss] help with a BIG problem, can''t import my zpool anymore
I let it run for about 4 hours. when I returned, still the same: I can ping the machine but I can''t SSH to it, or use the console. Please, I need urgent help with this issue! This message posted from opensolaris.org
Hernan Freschi
2008-May-24 00:42 UTC
[zfs-discuss] help with a BIG problem, can''t import my zpool anymore
I let it run while watching TOP, and this is what I got just before it hung. Look at free mem. Is this memory allocated to the kernel? can I allow the kernel to swap? last pid: 7126; load avg: 3.36, 1.78, 1.11; up 0+01:01:11 21:16:49 88 processes: 78 sleeping, 9 running, 1 on cpu CPU states: 22.4% idle, 0.4% user, 77.2% kernel, 0.0% iowait, 0.0% swap Memory: 3072M phys mem, 31M free mem, 2055M swap, 1993M free swap PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 7126 root 9 58 0 45M 4188K run 0:00 0.71% java 4821 root 1 59 0 5124K 1724K run 0:03 0.46% zfs 5096 root 1 59 0 5124K 1724K run 0:03 0.45% zfs 2470 root 1 59 0 4956K 1660K sleep 0:06 0.45% zfs This message posted from opensolaris.org
Rob at Logan.com
2008-May-24 01:10 UTC
[zfs-discuss] help with a BIG problem, can''t import my zpool anymore
> Memory: 3072M phys mem, 31M free mem, 2055M swap, 1993M free swapperhaps this might help.. mkfile -n 4g /usr/swap swap -a /usr/swap http://blogs.sun.com/realneel/entry/zfs_arc_statistics Rob
I forgot to post arcstat.pl''s output: Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 22:32:37 556K 525K 94 515K 94 9K 98 515K 97 1G 1G 22:32:38 63 63 100 63 100 0 0 63 100 1G 1G 22:32:39 74 74 100 74 100 0 0 74 100 1G 1G 22:32:40 76 76 100 76 100 0 0 76 100 1G 1G <State Changed> 22:32:41 75 75 100 75 100 0 0 75 100 1G 1G 22:32:42 77 77 100 77 100 0 0 77 100 1G 1G 22:32:43 72 72 100 72 100 0 0 72 100 1G 1G 22:32:44 80 80 100 80 100 0 0 80 100 1G 1G <State Changed> 22:32:45 98 98 100 98 100 0 0 98 100 1G 1G sometimes "c" is 2G. I tried the mkfile and swap, but I get: [root at solaris:/]# mkfile -n 4g /export/swap [root at solaris:/]# swap -a /export/swap "/export/swap" may contain holes - can''t swap on it. /export is the only place where I have enough free space. I could add another drive if needed. This message posted from opensolaris.org
Hernan Freschi ?????:> I tried the mkfile and swap, but I get: > [root at solaris:/]# mkfile -n 4g /export/swap > [root at solaris:/]# swap -a /export/swap > "/export/swap" may contain holes - can''t swap on it.You should not use -n for creating files for additional swap. This is mentioned in the mfile man page. Wbr, Victor
Hi, Herman You may not use ''-n'' to Makefile, that''ll lead swap comlain. Hernan Freschi wrote:> I forgot to post arcstat.pl''s output: > > Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c > 22:32:37 556K 525K 94 515K 94 9K 98 515K 97 1G 1G > 22:32:38 63 63 100 63 100 0 0 63 100 1G 1G > 22:32:39 74 74 100 74 100 0 0 74 100 1G 1G > 22:32:40 76 76 100 76 100 0 0 76 100 1G 1G > <State Changed> > 22:32:41 75 75 100 75 100 0 0 75 100 1G 1G > 22:32:42 77 77 100 77 100 0 0 77 100 1G 1G > 22:32:43 72 72 100 72 100 0 0 72 100 1G 1G > 22:32:44 80 80 100 80 100 0 0 80 100 1G 1G > <State Changed> > 22:32:45 98 98 100 98 100 0 0 98 100 1G 1G > > sometimes "c" is 2G. > > I tried the mkfile and swap, but I get: > [root at solaris:/]# mkfile -n 4g /export/swap > [root at solaris:/]# swap -a /export/swap > "/export/swap" may contain holes - can''t swap on it. > > /export is the only place where I have enough free space. I could add another drive if needed. > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
oops. replied too fast. Ran without -n, and space was added successfully... but it didn''t work. It died out of memory again. This message posted from opensolaris.org
Mattias Pantzare
2008-May-24 05:43 UTC
[zfs-discuss] help with a BIG problem, can''t import my zpool anymore
2008/5/24 Hernan Freschi <hjf at hjf.com.ar>:> I let it run while watching TOP, and this is what I got just before it hung. Look at free mem. Is this memory allocated to the kernel? can I allow the kernel to swap?No, the kernel will not use swap for this. But most of the memory used by the kernel is probably in caches that should release memory when needed. Is this a 32 or 64 bit system? ZFS will sometimes use all kernel address space on a 32-bit system. You can give the kernel more address space with this command (only on 32-bit system): eeprom kernelbase=0x50000000
No, this is a 64-bit system (athlon64) with 64-bit kernel of course. This message posted from opensolaris.org
So, I think I''ve narrowed it down to two things: * ZFS tries to destroy the dataset every time it''s called because the last time it didn''t finish destroying * In this process, ZFS makes the kernel run out of memory and die So I thought of two options, but I''m not sure if I''m right: Option 1: "Destroy" is an atomic operation If destroy is atomic, then I guess what it''s trying to do is look up all the blocks that need to be deleted/unlinked/released/freed (not sure which is the word). After it has that list, it will write it to the ZIL (remember this is just what I suppose, correct me if I''m wrong!) and start to physically delete the blocks, until the operation is done and it''s finally committed. If this is the case, then the process will be restarted from scratch every time the system is rebooted. But I read that apparently in previous versions, rebooting while destroying a clone that it''s taking too long makes the clone reappear intact next time. This, and the fact that zpool iostat show only reads and no or very few writes is what lead me to think this is how it works. So if this is the case, I''d like to abort this destroy. After importing the pool, I will have everything as it was and maybe I can delete snapshots before the clone''s parent snapshot and maybe this will speed up the destroy process, or just leave the clone. Option 2: Destroy is "not" atomic By this I don''t mean that it''s not "atomic", as in "if the operation is canceled, it will finish in an incomplete state", but as in "if the system is rebooted, the operation will RESUME at the point it was where it died. If this is the case, maybe I can write a script to reboot the computer in a fixed amount of time, and run it on boot: zpool import xx & sleep 20 seconds rm /etc/zfs/zpool.cache sleep 1800 seconds reboot This will work under the assumption that the list of blocks to be deleted is flushed to the ZIL or something before boot, to allow the operation to restart at the same point. This is a very nasty hack but it may do the trick only in a very slow fashion: zpool iostat shows 1MB/s read when it''s doing the destroy. The dataset in question has 450GB which means that the operation will take 5 days to finish if it needs to read the whole dataset to destroy it, or 7 days if it also needs to go through the other snapshots (600GB total). So, my only viable option seems to be to "abort" this. How can I do this? disable the ZIL, maybe? Delete the ZIL? scrub after this? Thanks, Hern?n This message posted from opensolaris.org
Robert Milkowski
2008-May-25 17:02 UTC
[zfs-discuss] help with a BIG problem, can''t import my zpool anymore
Hello Hernan, Friday, May 23, 2008, 6:08:34 PM, you wrote: HF> The question is still, why does it hang the machine? Why can''t I HF> access the filesystems? Isn''t it supposed to import the zpool, HF> mount the ZFSs and then do the destroy, in background? HF> Try to limit ARC size to 1/4 of your physical memory size via /etc/system and reboot the box. Then wait and see if it finishes and if it still freezes the box. -- Best regards, Robert Milkowski mailto:milek at task.gda.pl http://milek.blogspot.com
Hello, thanks for your suggestion. I tried settin zfs_arc_max to 0x30000000 (768MB, out of 3GB). The system ran for almost 45 minutes before it froze. Here''s an interesting piece of arcstat.pl, which I noticed just as it was pasing by: Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 15:17:41 152 152 100 152 100 0 0 152 100 2G 805M 15:17:42 139 139 100 139 100 0 0 139 100 2G 805M <State Changed> 15:17:43 188 188 100 188 100 0 0 188 100 2G 805M 15:17:44 150 150 100 150 100 0 0 150 100 2G 805M 15:17:45 151 151 100 151 100 0 0 151 100 2G 805M 15:17:46 149 149 100 149 100 0 0 149 100 2G 805M 15:17:47 161 161 100 161 100 0 0 161 100 2G 805M 15:17:48 153 153 100 153 100 0 0 153 100 2G 219M 15:17:49 140 140 100 140 100 0 0 140 100 2G 100M 15:17:50 143 143 100 143 100 0 0 143 100 2G 100M 15:17:51 145 145 100 145 100 0 0 145 100 2G 100M notice how it suddenly drops "c" from 805M to 100M in 2 seconds. Also "arcsz" is 2G, which is weird because it shouldn''t grow beyond 0x30000000 (768M), right? And it''s also weird to also get 100% MISS ratio Here''s "top" just before it froze: last pid: 5253; load avg: 0.47, 0.37, 0.33; up 0+00:44:53 15:20:14 77 processes: 75 sleeping, 1 running, 1 on cpu CPU states: 57.5% idle, 1.0% user, 41.6% kernel, 0.0% iowait, 0.0% swap Memory: 3072M phys mem, 28M free mem, 2055M swap, 1994M free swap PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 1248 root 1 59 0 5940K 2736K sleep 0:14 0.82% arcstat.pl 5206 root 9 59 0 47M 4892K sleep 0:01 0.35% java 855 root 2 59 0 5076K 1588K sleep 0:09 0.33% apcupsd 3134 root 1 59 0 5152K 1764K sleep 0:02 0.26% zpool 1261 root 1 59 0 4104K 588K cpu 0:03 0.22% top 3125 root 1 59 0 6352K 1536K sleep 0:00 0.06% sshd 1151 root 1 59 0 6352K 1504K sleep 0:00 0.05% sshd 62 root 1 59 0 1832K 540K sleep 0:01 0.05% powernowd 849 root 1 59 0 11M 1100K sleep 0:00 0.05% snmpd 465 proxy 1 59 0 15M 2196K run 0:00 0.04% squid 271 daemon 1 59 0 6652K 264K sleep 0:00 0.03% rcapd 1252 root 1 59 0 6352K 1292K sleep 0:00 0.02% sshd 7 root 14 59 0 12M 5412K sleep 0:04 0.02% svc.startd 880 root 1 59 0 6276K 2076K sleep 0:00 0.02% httpd 847 root 1 59 0 2436K 1148K sleep 0:00 0.02% dhcpagent and finally, zpool iostat 1: tera 1.51T 312G 207 0 1.22M 0 tera 1.51T 312G 141 0 854K 0 tera 1.51T 312G 70 0 427K 0 tera 1.51T 312G 204 0 1.20M 0 tera 1.51T 312G 187 0 1.10M 0 tera 1.51T 312G 179 0 1.05M 0 tera 1.51T 312G 120 0 743K 0 tera 1.51T 312G 94 0 580K 0 tera 1.51T 312G 77 0 471K 0 tera 1.51T 312G 115 0 696K 0 Which shows a very poor read performance, for a 4xSATA2 (this array usually saturates my gigabit ethernet). And it''s not that the kernel is processing that much data because the CPU is 57% idle and I THINK powernowd is making it run at 900MHz. Hern?n This message posted from opensolaris.org
so, anyone have any ideas? I''m obviously hitting a bug here. I''m happy to help anyone solve this, I DESPERATELY need this data. I can post dtrace results if you send them to me. I wish I could solve this myself, but I''m not a C progammer, I don''t know how to program filesystems, much less an advanced filesystem as ZFS. Also, this is not a weird configuration, or alpha software, or a hack I''m making. This is a "production" server with proven software, on which I executed a regular maintenance operation, just a clone destroy, which should take just a few seconds, and I found so many wrong things: * a clone destroy takes more than a few seconds or minutes * to destroy the clone it tries to read, apparently, through the whole dataset * while reading through the whole dataset, it does so at 1MB/s (dataset size is 450GB without snapshots) * it loads these reads to the ARC, filling it up and freezing the machine * the ARC doesn''t work in this case as all reads have 100% miss ratio * it doesn''t respect the ARC maximum size. * after a reboot, it won''t work on the background. it will keep the pool locked preventing mounts * disabling the ZIL doesn''t prevent zpool from trying to complete the destroy this is very very wrong. I don''t know anything about dtrace. I wrote this, I don''t really know if this is correct. I hope it''s of use to someone. #!/usr/sbin/dtrace -s fbt:zfs:: { @count_table[probefunc]=count(); } tick-300s {exit(0);} The output for the 5 minute run: dnode_buf_pageout 6 zfs_sync 12 dmu_zfetch_rele 18 dnode_destroy 18 bplist_close 78 bplist_open 78 bplist_sync 78 dbuf_hold 78 dmu_buf_get_user 78 dmu_object_info 78 dmu_object_info_from_dnode 78 dmu_zfetch 78 dmu_zfetch_find 78 dnode_hold 78 dnode_hold_impl 78 dsl_pool_sync 78 dsl_pool_zil_clean 78 spa_async_dispatch 78 spa_errlog_sync 78 spa_scrub_resume 78 spa_scrub_suspend 78 spa_sync 78 spa_sync_config_object 78 txg_quiesce 78 uberblock_update 78 vdev_config_sync 78 dmu_tx_commit 156 dmu_tx_create_assigned 156 dmu_tx_create_dd 156 spa_sync_aux_dev 156 txg_list_empty 234 arc_do_user_evicts 368 txg_list_remove 390 txg_thread_wait 390 add_reference 486 arc_buf_add_ref 486 spa_config_enter 566 spa_config_exit 566 zio_null 566 zio_root 566 zio_wait 566 vdev_queue_agg_io_done 1954 dbuf_hold_level 62208 dmu_tx_check_ioerr 62208 dnode_next_offset 62208 dnode_next_offset_level 62208 dbuf_whichblock 62286 dbuf_do_evict 62436 dbuf_hash_remove 62436 dbuf_clear 62440 dbuf_destroy 62440 dbuf_evict 62440 dbuf_evict_user 62440 arc_space_return 62462 arc_cksum_compute 62510 arc_read_done 62510 arc_set_callback 62510 dbuf_read_done 62510 dbuf_set_data 62510 fletcher_4_native 62510 lzjb_decompress 62510 vdev_mirror_child_done 62510 vdev_mirror_io_done 62510 vdev_mirror_map_free 62510 vdev_raidz_io_done 62510 vdev_raidz_map_free 62510 zio_checksum_error 62510 zio_checksum_verified 62510 zio_decompress_data 62510 zio_read_decompress 62510 dnode_rele 62514 arc_buf_destroy 62694 arc_cksum_verify 62694 arc_evict 62694 arc_adapt 62696 arc_buf_alloc 62696 arc_evict_needed 62696 arc_get_data_buf 62696 arc_read 62696 buf_hash_insert 62696 dbuf_create 62696 dbuf_findbp 62696 dbuf_hash_insert 62696 dbuf_read_impl 62696 hdr_cons 62696 spa_get_failmode 62696 vdev_mirror_child_select 62696 vdev_mirror_io_start 62696 vdev_mirror_map_alloc 62696 vdev_raidz_io_start 62696 vdev_raidz_map_alloc 62696 zio_read 62696 zio_read_init 62696 arc_space_consume 62698 spa_get_random 62774 arc_access 62996 arc_buf_freeze 63002 arc_buf_remove_ref 63002 remove_reference 63002 arc_reclaim_needed 63080 zio_ready 63262 zio_wait_for_children_ready 63262 arc_change_state 125204 buf_hash_find 125206 vdev_lookup_top 125392 dbuf_add_ref 125470 arc_released 125512 vdev_cache_fill 153525 vdev_cache_evict 154090 zio_vdev_io_reissue 155192 vdev_disk_io_intr 163696 zio_interrupt 163696 vdev_disk_io_done 163708 vdev_queue_io_done 163708 vdev_error_inject 163924 vdev_queue_io_remove 166100 vdev_cache_allocate 166298 vdev_queue_io_add 166434 vdev_cache_hit 175179 zio_vdev_io_bypass 179874 dbuf_find 187190 dbuf_hold_impl 187190 dbuf_read 187190 vdev_raidz_child_done 187536 dbuf_update_data 249700 zio_buf_free 281068 zio_buf_alloc 281438 zio_vdev_io_done 288728 dbuf_rele 312214 dbuf_hash 312322 vdev_dtl_contains 313480 buf_hash 314084 vdev_queue_io 323580 vdev_stat_update 405526 zio_vdev_child_io 406828 vdev_readable 414708 vdev_queue_io_to_issue 437894 zio_should_retry 468036 zio_vdev_io_assess 468036 zio_done 468601 zio_assess 468602 zio_clear_transform_stack 468602 zio_nowait 469524 zio_create 470090 vdev_is_dead 477404 vdev_cache_read 499324 vdev_disk_io_start 499324 zio_notify_parent 530730 zio_pop_transform 531112 zio_push_transform 532786 zio_wait_for_children_done 593994 spa_state 624716 zio_vdev_io_start 624716 zio_wait_for_children 657256 vdev_queue_deadline_compare 906304 zio_execute 1247615 vdev_queue_offset_compare 1694848 vdev_cache_lastused_compare 2472113 vdev_cache_offset_compare 3076216 This message posted from opensolaris.org
it would be helpful to know what''s filling your ram.. echo ::kmastat | mdb -k
> vdev_cache_offset_compare 3076216vdev_cache changed to metadata only in snv_70 http://bugs.opensolaris.org/view_bug.do?bug_id=6437054 but without knowing what''s using the kmem we are kinda guessing. Rob
I''m using snv_85 upgraded from S10U4. Here is echo ::kmastat | mdb -k: After 4 minutes uptime http://pastebin.com/f112d1cae (when it''s OK) After 40 minutes: http://pastebin.com/f7076d96f (when free memory starts dropping) This message posted from opensolaris.org
is there a way to know the progress of the process? I obviously don''t really have other choice than leave it do whatever it''s doing, and hope that it will finish soon (according to my calculations, within a week). So far the only thing I''m doing is allowing it to work and reboot every 20 minutes before memory fills. This message posted from opensolaris.org