Hernan Freschi
2008-May-23 14:18 UTC
[zfs-discuss] help with a BIG problem, can''t import my zpool anymore
Hello, I''m having a big problem here, disastrous maybe.
I have a zpool consisting of 4x500GB SATA drives, this pool was born on S10U4
and was recently upgraded to snv85 because of iSCSI issues with some initiator.
Last night I was doing housekeeping, deleting old snapshots. One snapshot failed
to delete because it had a dependant clone. So I try to destroy that clone:
Everything went wrong from there.
The deletion was taking an excessively long time (over 40 minutes). zpool status
hungs when I call it. zfs list too. zpool iostat showed disk activity. Other
services non dependant on the pool were running, and the iSCSI this machine was
serving was unbearably slow.
At one point, I lost all iSCSI, SSH, web, and all other services. Ping still
worked. So I go to the server and notice that the fans are running at 100%. I
try to get a console (local VGA+keyboard) but the monitor shows "no
signal". No disk activity seemed to be happening at the moment. So, I do
the standard procedure (reboot). Solaris boots but stops at "hostname:
blah". I see disk activity from the pool disks, so I let it boot. 30
minutes later, still didn''t finish. I thought (correctly) that the
system was waiting to mount the ZFS before booting, but for some reason it
doesn''t. I call it the day and let the machine do its thing.
8 hours later I return. CPU is cold, disks are idle and... solaris stays at the
same "hostname: blah". Time to reboot again, this time in failsafe.
zpool import shows that the devices are detected and online. I delete
/etc/zfs/zpool.cache and reboot. Solaris starts normally with all services
running, but of course no zfs. zpool import shows the available pool, no errors.
I do zpool import -f pool... 20 minutes later I''m still waiting for the
pool to mount. zpool iostat shows activity:
capacity operations bandwidth
pool used avail read write read write
---------- ----- ----- ----- ----- ----- -----
tera 1.51T 312G 274 0 1.61M 2.91K
tera 1.51T 312G 308 0 1.82M 0
tera 1.51T 312G 392 0 2.31M 0
tera 1.51T 312G 468 0 2.75M 0
but the mountpoint /tera is still not populated (and zpool import still
doesn''t exit).
zpool status shows:
pool: tera
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
tera ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c1d0 ONLINE 0 0 0
c2d0 ONLINE 0 0 0
c3d0 ONLINE 0 0 0
c4d0 ONLINE 0 0 0
errors: No known data errors
What''s going on? Why is taking so long to import?
Thanks in advance,
Hernan
This message posted from opensolaris.org
Hernan Freschi
2008-May-23 17:08 UTC
[zfs-discuss] help with a BIG problem, can''t import my zpool anymore
I got more info. I can run zpool history and this is what I get: 2008-05-23.00:29:40 zfs destroy tera/sparsevol at 20080522 2008-05-23.00:29:47 [internal destroy_begin_sync txg:3890809] dataset = 152 2008-05-23.01:28:38 [internal destroy_begin_sync txg:3891101] dataset = 152 2008-05-23.07:01:36 zpool import -f tera 2008-05-23.07:01:40 [internal destroy_begin_sync txg:3891106] dataset = 152 2008-05-23.10:52:56 zpool import -f tera 2008-05-23.10:52:58 [internal destroy_begin_sync txg:3891112] dataset = 152 2008-05-23.12:17:49 [internal destroy_begin_sync txg:3891114] dataset = 152 2008-05-23.12:27:48 zpool import -f tera 2008-05-23.12:27:50 [internal destroy_begin_sync txg:3891120] dataset = 152 2008-05-23.13:03:07 [internal destroy_begin_sync txg:3891122] dataset = 152 2008-05-23.13:56:52 zpool import -f tera 2008-05-23.13:56:54 [internal destroy_begin_sync txg:3891128] dataset = 152 apparently, it starts destroying dataset #152, which is the parent snapshot of the clone I issued the command to destroy. Not sure how it works, but I ordered the deletion of the CLONE, not the snapshot (which I was going to destroy anyway). The question is still, why does it hang the machine? Why can''t I access the filesystems? Isn''t it supposed to import the zpool, mount the ZFSs and then do the destroy, in background? This message posted from opensolaris.org
Hernan Freschi
2008-May-23 23:03 UTC
[zfs-discuss] help with a BIG problem, can''t import my zpool anymore
I let it run for about 4 hours. when I returned, still the same: I can ping the machine but I can''t SSH to it, or use the console. Please, I need urgent help with this issue! This message posted from opensolaris.org
Hernan Freschi
2008-May-24 00:42 UTC
[zfs-discuss] help with a BIG problem, can''t import my zpool anymore
I let it run while watching TOP, and this is what I got just before it hung. Look at free mem. Is this memory allocated to the kernel? can I allow the kernel to swap? last pid: 7126; load avg: 3.36, 1.78, 1.11; up 0+01:01:11 21:16:49 88 processes: 78 sleeping, 9 running, 1 on cpu CPU states: 22.4% idle, 0.4% user, 77.2% kernel, 0.0% iowait, 0.0% swap Memory: 3072M phys mem, 31M free mem, 2055M swap, 1993M free swap PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 7126 root 9 58 0 45M 4188K run 0:00 0.71% java 4821 root 1 59 0 5124K 1724K run 0:03 0.46% zfs 5096 root 1 59 0 5124K 1724K run 0:03 0.45% zfs 2470 root 1 59 0 4956K 1660K sleep 0:06 0.45% zfs This message posted from opensolaris.org
Rob at Logan.com
2008-May-24 01:10 UTC
[zfs-discuss] help with a BIG problem, can''t import my zpool anymore
> Memory: 3072M phys mem, 31M free mem, 2055M swap, 1993M free swapperhaps this might help.. mkfile -n 4g /usr/swap swap -a /usr/swap http://blogs.sun.com/realneel/entry/zfs_arc_statistics Rob
I forgot to post arcstat.pl''s output:
Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
22:32:37 556K 525K 94 515K 94 9K 98 515K 97 1G 1G
22:32:38 63 63 100 63 100 0 0 63 100 1G 1G
22:32:39 74 74 100 74 100 0 0 74 100 1G 1G
22:32:40 76 76 100 76 100 0 0 76 100 1G 1G
<State Changed>
22:32:41 75 75 100 75 100 0 0 75 100 1G 1G
22:32:42 77 77 100 77 100 0 0 77 100 1G 1G
22:32:43 72 72 100 72 100 0 0 72 100 1G 1G
22:32:44 80 80 100 80 100 0 0 80 100 1G 1G
<State Changed>
22:32:45 98 98 100 98 100 0 0 98 100 1G 1G
sometimes "c" is 2G.
I tried the mkfile and swap, but I get:
[root at solaris:/]# mkfile -n 4g /export/swap
[root at solaris:/]# swap -a /export/swap
"/export/swap" may contain holes - can''t swap on it.
/export is the only place where I have enough free space. I could add another
drive if needed.
This message posted from opensolaris.org
Hernan Freschi ?????:> I tried the mkfile and swap, but I get: > [root at solaris:/]# mkfile -n 4g /export/swap > [root at solaris:/]# swap -a /export/swap > "/export/swap" may contain holes - can''t swap on it.You should not use -n for creating files for additional swap. This is mentioned in the mfile man page. Wbr, Victor
Hi, Herman You may not use ''-n'' to Makefile, that''ll lead swap comlain. Hernan Freschi wrote:> I forgot to post arcstat.pl''s output: > > Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c > 22:32:37 556K 525K 94 515K 94 9K 98 515K 97 1G 1G > 22:32:38 63 63 100 63 100 0 0 63 100 1G 1G > 22:32:39 74 74 100 74 100 0 0 74 100 1G 1G > 22:32:40 76 76 100 76 100 0 0 76 100 1G 1G > <State Changed> > 22:32:41 75 75 100 75 100 0 0 75 100 1G 1G > 22:32:42 77 77 100 77 100 0 0 77 100 1G 1G > 22:32:43 72 72 100 72 100 0 0 72 100 1G 1G > 22:32:44 80 80 100 80 100 0 0 80 100 1G 1G > <State Changed> > 22:32:45 98 98 100 98 100 0 0 98 100 1G 1G > > sometimes "c" is 2G. > > I tried the mkfile and swap, but I get: > [root at solaris:/]# mkfile -n 4g /export/swap > [root at solaris:/]# swap -a /export/swap > "/export/swap" may contain holes - can''t swap on it. > > /export is the only place where I have enough free space. I could add another drive if needed. > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
oops. replied too fast. Ran without -n, and space was added successfully... but it didn''t work. It died out of memory again. This message posted from opensolaris.org
Mattias Pantzare
2008-May-24 05:43 UTC
[zfs-discuss] help with a BIG problem, can''t import my zpool anymore
2008/5/24 Hernan Freschi <hjf at hjf.com.ar>:> I let it run while watching TOP, and this is what I got just before it hung. Look at free mem. Is this memory allocated to the kernel? can I allow the kernel to swap?No, the kernel will not use swap for this. But most of the memory used by the kernel is probably in caches that should release memory when needed. Is this a 32 or 64 bit system? ZFS will sometimes use all kernel address space on a 32-bit system. You can give the kernel more address space with this command (only on 32-bit system): eeprom kernelbase=0x50000000
No, this is a 64-bit system (athlon64) with 64-bit kernel of course. This message posted from opensolaris.org
So, I think I''ve narrowed it down to two things: * ZFS tries to destroy the dataset every time it''s called because the last time it didn''t finish destroying * In this process, ZFS makes the kernel run out of memory and die So I thought of two options, but I''m not sure if I''m right: Option 1: "Destroy" is an atomic operation If destroy is atomic, then I guess what it''s trying to do is look up all the blocks that need to be deleted/unlinked/released/freed (not sure which is the word). After it has that list, it will write it to the ZIL (remember this is just what I suppose, correct me if I''m wrong!) and start to physically delete the blocks, until the operation is done and it''s finally committed. If this is the case, then the process will be restarted from scratch every time the system is rebooted. But I read that apparently in previous versions, rebooting while destroying a clone that it''s taking too long makes the clone reappear intact next time. This, and the fact that zpool iostat show only reads and no or very few writes is what lead me to think this is how it works. So if this is the case, I''d like to abort this destroy. After importing the pool, I will have everything as it was and maybe I can delete snapshots before the clone''s parent snapshot and maybe this will speed up the destroy process, or just leave the clone. Option 2: Destroy is "not" atomic By this I don''t mean that it''s not "atomic", as in "if the operation is canceled, it will finish in an incomplete state", but as in "if the system is rebooted, the operation will RESUME at the point it was where it died. If this is the case, maybe I can write a script to reboot the computer in a fixed amount of time, and run it on boot: zpool import xx & sleep 20 seconds rm /etc/zfs/zpool.cache sleep 1800 seconds reboot This will work under the assumption that the list of blocks to be deleted is flushed to the ZIL or something before boot, to allow the operation to restart at the same point. This is a very nasty hack but it may do the trick only in a very slow fashion: zpool iostat shows 1MB/s read when it''s doing the destroy. The dataset in question has 450GB which means that the operation will take 5 days to finish if it needs to read the whole dataset to destroy it, or 7 days if it also needs to go through the other snapshots (600GB total). So, my only viable option seems to be to "abort" this. How can I do this? disable the ZIL, maybe? Delete the ZIL? scrub after this? Thanks, Hern?n This message posted from opensolaris.org
Robert Milkowski
2008-May-25 17:02 UTC
[zfs-discuss] help with a BIG problem, can''t import my zpool anymore
Hello Hernan,
Friday, May 23, 2008, 6:08:34 PM, you wrote:
HF> The question is still, why does it hang the machine? Why can''t I
HF> access the filesystems? Isn''t it supposed to import the zpool,
HF> mount the ZFSs and then do the destroy, in background?
HF>
Try to limit ARC size to 1/4 of your physical memory size via
/etc/system and reboot the box. Then wait and see if it finishes and
if it still freezes the box.
--
Best regards,
Robert Milkowski mailto:milek at task.gda.pl
http://milek.blogspot.com
Hello, thanks for your suggestion. I tried settin zfs_arc_max to 0x30000000
(768MB, out of 3GB). The system ran for almost 45 minutes before it froze.
Here''s an interesting piece of arcstat.pl, which I noticed just as it
was pasing by:
Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
15:17:41 152 152 100 152 100 0 0 152 100 2G 805M
15:17:42 139 139 100 139 100 0 0 139 100 2G 805M
<State Changed>
15:17:43 188 188 100 188 100 0 0 188 100 2G 805M
15:17:44 150 150 100 150 100 0 0 150 100 2G 805M
15:17:45 151 151 100 151 100 0 0 151 100 2G 805M
15:17:46 149 149 100 149 100 0 0 149 100 2G 805M
15:17:47 161 161 100 161 100 0 0 161 100 2G 805M
15:17:48 153 153 100 153 100 0 0 153 100 2G 219M
15:17:49 140 140 100 140 100 0 0 140 100 2G 100M
15:17:50 143 143 100 143 100 0 0 143 100 2G 100M
15:17:51 145 145 100 145 100 0 0 145 100 2G 100M
notice how it suddenly drops "c" from 805M to 100M in 2 seconds. Also
"arcsz" is 2G, which is weird because it shouldn''t grow
beyond 0x30000000 (768M), right? And it''s also weird to also get 100%
MISS ratio
Here''s "top" just before it froze:
last pid: 5253; load avg: 0.47, 0.37, 0.33; up 0+00:44:53
15:20:14
77 processes: 75 sleeping, 1 running, 1 on cpu
CPU states: 57.5% idle, 1.0% user, 41.6% kernel, 0.0% iowait, 0.0% swap
Memory: 3072M phys mem, 28M free mem, 2055M swap, 1994M free swap
PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
1248 root 1 59 0 5940K 2736K sleep 0:14 0.82% arcstat.pl
5206 root 9 59 0 47M 4892K sleep 0:01 0.35% java
855 root 2 59 0 5076K 1588K sleep 0:09 0.33% apcupsd
3134 root 1 59 0 5152K 1764K sleep 0:02 0.26% zpool
1261 root 1 59 0 4104K 588K cpu 0:03 0.22% top
3125 root 1 59 0 6352K 1536K sleep 0:00 0.06% sshd
1151 root 1 59 0 6352K 1504K sleep 0:00 0.05% sshd
62 root 1 59 0 1832K 540K sleep 0:01 0.05% powernowd
849 root 1 59 0 11M 1100K sleep 0:00 0.05% snmpd
465 proxy 1 59 0 15M 2196K run 0:00 0.04% squid
271 daemon 1 59 0 6652K 264K sleep 0:00 0.03% rcapd
1252 root 1 59 0 6352K 1292K sleep 0:00 0.02% sshd
7 root 14 59 0 12M 5412K sleep 0:04 0.02% svc.startd
880 root 1 59 0 6276K 2076K sleep 0:00 0.02% httpd
847 root 1 59 0 2436K 1148K sleep 0:00 0.02% dhcpagent
and finally, zpool iostat 1:
tera 1.51T 312G 207 0 1.22M 0
tera 1.51T 312G 141 0 854K 0
tera 1.51T 312G 70 0 427K 0
tera 1.51T 312G 204 0 1.20M 0
tera 1.51T 312G 187 0 1.10M 0
tera 1.51T 312G 179 0 1.05M 0
tera 1.51T 312G 120 0 743K 0
tera 1.51T 312G 94 0 580K 0
tera 1.51T 312G 77 0 471K 0
tera 1.51T 312G 115 0 696K 0
Which shows a very poor read performance, for a 4xSATA2 (this array usually
saturates my gigabit ethernet). And it''s not that the kernel is
processing that much data because the CPU is 57% idle and I THINK powernowd is
making it run at 900MHz.
Hern?n
This message posted from opensolaris.org
so, anyone have any ideas? I''m obviously hitting a bug here.
I''m happy to help anyone solve this, I DESPERATELY need this data. I
can post dtrace results if you send them to me. I wish I could solve this
myself, but I''m not a C progammer, I don''t know how to program
filesystems, much less an advanced filesystem as ZFS.
Also, this is not a weird configuration, or alpha software, or a hack
I''m making. This is a "production" server with proven
software, on which I executed a regular maintenance operation, just a clone
destroy, which should take just a few seconds, and I found so many wrong things:
* a clone destroy takes more than a few seconds or minutes
* to destroy the clone it tries to read, apparently, through the whole dataset
* while reading through the whole dataset, it does so at 1MB/s (dataset size is
450GB without snapshots)
* it loads these reads to the ARC, filling it up and freezing the machine
* the ARC doesn''t work in this case as all reads have 100% miss ratio
* it doesn''t respect the ARC maximum size.
* after a reboot, it won''t work on the background. it will keep the
pool locked preventing mounts
* disabling the ZIL doesn''t prevent zpool from trying to complete the
destroy
this is very very wrong.
I don''t know anything about dtrace. I wrote this, I don''t
really know if this is correct. I hope it''s of use to someone.
#!/usr/sbin/dtrace -s
fbt:zfs::
{
@count_table[probefunc]=count();
}
tick-300s {exit(0);}
The output for the 5 minute run:
dnode_buf_pageout 6
zfs_sync 12
dmu_zfetch_rele 18
dnode_destroy 18
bplist_close 78
bplist_open 78
bplist_sync 78
dbuf_hold 78
dmu_buf_get_user 78
dmu_object_info 78
dmu_object_info_from_dnode 78
dmu_zfetch 78
dmu_zfetch_find 78
dnode_hold 78
dnode_hold_impl 78
dsl_pool_sync 78
dsl_pool_zil_clean 78
spa_async_dispatch 78
spa_errlog_sync 78
spa_scrub_resume 78
spa_scrub_suspend 78
spa_sync 78
spa_sync_config_object 78
txg_quiesce 78
uberblock_update 78
vdev_config_sync 78
dmu_tx_commit 156
dmu_tx_create_assigned 156
dmu_tx_create_dd 156
spa_sync_aux_dev 156
txg_list_empty 234
arc_do_user_evicts 368
txg_list_remove 390
txg_thread_wait 390
add_reference 486
arc_buf_add_ref 486
spa_config_enter 566
spa_config_exit 566
zio_null 566
zio_root 566
zio_wait 566
vdev_queue_agg_io_done 1954
dbuf_hold_level 62208
dmu_tx_check_ioerr 62208
dnode_next_offset 62208
dnode_next_offset_level 62208
dbuf_whichblock 62286
dbuf_do_evict 62436
dbuf_hash_remove 62436
dbuf_clear 62440
dbuf_destroy 62440
dbuf_evict 62440
dbuf_evict_user 62440
arc_space_return 62462
arc_cksum_compute 62510
arc_read_done 62510
arc_set_callback 62510
dbuf_read_done 62510
dbuf_set_data 62510
fletcher_4_native 62510
lzjb_decompress 62510
vdev_mirror_child_done 62510
vdev_mirror_io_done 62510
vdev_mirror_map_free 62510
vdev_raidz_io_done 62510
vdev_raidz_map_free 62510
zio_checksum_error 62510
zio_checksum_verified 62510
zio_decompress_data 62510
zio_read_decompress 62510
dnode_rele 62514
arc_buf_destroy 62694
arc_cksum_verify 62694
arc_evict 62694
arc_adapt 62696
arc_buf_alloc 62696
arc_evict_needed 62696
arc_get_data_buf 62696
arc_read 62696
buf_hash_insert 62696
dbuf_create 62696
dbuf_findbp 62696
dbuf_hash_insert 62696
dbuf_read_impl 62696
hdr_cons 62696
spa_get_failmode 62696
vdev_mirror_child_select 62696
vdev_mirror_io_start 62696
vdev_mirror_map_alloc 62696
vdev_raidz_io_start 62696
vdev_raidz_map_alloc 62696
zio_read 62696
zio_read_init 62696
arc_space_consume 62698
spa_get_random 62774
arc_access 62996
arc_buf_freeze 63002
arc_buf_remove_ref 63002
remove_reference 63002
arc_reclaim_needed 63080
zio_ready 63262
zio_wait_for_children_ready 63262
arc_change_state 125204
buf_hash_find 125206
vdev_lookup_top 125392
dbuf_add_ref 125470
arc_released 125512
vdev_cache_fill 153525
vdev_cache_evict 154090
zio_vdev_io_reissue 155192
vdev_disk_io_intr 163696
zio_interrupt 163696
vdev_disk_io_done 163708
vdev_queue_io_done 163708
vdev_error_inject 163924
vdev_queue_io_remove 166100
vdev_cache_allocate 166298
vdev_queue_io_add 166434
vdev_cache_hit 175179
zio_vdev_io_bypass 179874
dbuf_find 187190
dbuf_hold_impl 187190
dbuf_read 187190
vdev_raidz_child_done 187536
dbuf_update_data 249700
zio_buf_free 281068
zio_buf_alloc 281438
zio_vdev_io_done 288728
dbuf_rele 312214
dbuf_hash 312322
vdev_dtl_contains 313480
buf_hash 314084
vdev_queue_io 323580
vdev_stat_update 405526
zio_vdev_child_io 406828
vdev_readable 414708
vdev_queue_io_to_issue 437894
zio_should_retry 468036
zio_vdev_io_assess 468036
zio_done 468601
zio_assess 468602
zio_clear_transform_stack 468602
zio_nowait 469524
zio_create 470090
vdev_is_dead 477404
vdev_cache_read 499324
vdev_disk_io_start 499324
zio_notify_parent 530730
zio_pop_transform 531112
zio_push_transform 532786
zio_wait_for_children_done 593994
spa_state 624716
zio_vdev_io_start 624716
zio_wait_for_children 657256
vdev_queue_deadline_compare 906304
zio_execute 1247615
vdev_queue_offset_compare 1694848
vdev_cache_lastused_compare 2472113
vdev_cache_offset_compare 3076216
This message posted from opensolaris.org
it would be helpful to know what''s filling your ram.. echo ::kmastat | mdb -k
> vdev_cache_offset_compare 3076216vdev_cache changed to metadata only in snv_70 http://bugs.opensolaris.org/view_bug.do?bug_id=6437054 but without knowing what''s using the kmem we are kinda guessing. Rob
I''m using snv_85 upgraded from S10U4. Here is echo ::kmastat | mdb -k: After 4 minutes uptime http://pastebin.com/f112d1cae (when it''s OK) After 40 minutes: http://pastebin.com/f7076d96f (when free memory starts dropping) This message posted from opensolaris.org
is there a way to know the progress of the process? I obviously don''t really have other choice than leave it do whatever it''s doing, and hope that it will finish soon (according to my calculations, within a week). So far the only thing I''m doing is allowing it to work and reboot every 20 minutes before memory fills. This message posted from opensolaris.org