Can anyone comment about whether the on-boot "Reading ZFS confi" is any slower/better/whatever than deleting zpool.cache, rebooting and manually importing? I''ve been waiting more than 30 hours for this system to come up. There is a pool with 13TB of data attached. The system locked up whilst destroying a 934GB dedup''d dataset, and I was forced to reboot it. I can hear hard drive activity presently - ie its doing <b>something</b>, but am really hoping there is a better way :) Thanks
On 02/11/10 08:15, taemun wrote:> Can anyone comment about whether the on-boot "Reading ZFS confi" is > any slower/better/whatever than deleting zpool.cache, rebooting and > manually importing? > > I''ve been waiting more than 30 hours for this system to come up. There > is a pool with 13TB of data attached. The system locked up whilst > destroying a 934GB dedup''d dataset, and I was forced to reboot it. I > can hear hard drive activity presently - ie its doing > <b>something</b>, but am really hoping there is a better way :) > > Thanks > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >I think that this is a consequence of 6924390.- <http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6924390> ZFS destroy on de-duped dataset locks all I/O This bug is closed as a dup of another bug which is not readable from the opensolaris site, (I''m not clear what makes some bugs readable and some not). While trying to reproduce 6924390 (or its equivalent) yesterday, my system hung as yours did, and when I rebooted, it hung at "Reading ZFS config". Someone who knows more about the root cause of this situation (i.e., the bug named above) might be able tell you what''s going on and how to recover (it might be that what''s going on is that the destroy has resumed and you have to wait for it to complete, which I think it will, but it might take a long time). Lori -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100211/894166bb/attachment.html>
Bill Sommerfeld
2010-Feb-11 19:08 UTC
[zfs-discuss] Reading ZFS config for an extended period
On 02/11/10 10:33, Lori Alt wrote:> This bug is closed as a dup of another bug which is not readable from > the opensolaris site, (I''m not clear what makes some bugs readable and > some not).the other bug in question was opened yesterday and probably hasn''t had time to propagate. - Bill
Do you think that more RAM would help this progress faster? We''ve just hit 48 hours. No visible progress (although that doesn''t really mean much). It is presently in a system with 8GB of ram, I could try to move the pool across to a system with 20GB of ram, if that is likely to expedite the process. Of course, if it isn''t going to make any difference, I''d rather not restart this process. Thanks On 12 February 2010 06:08, Bill Sommerfeld <sommerfeld at sun.com> wrote:> On 02/11/10 10:33, Lori Alt wrote: >> >> This bug is closed as a dup of another bug which is not readable from >> the opensolaris site, (I''m not clear what makes some bugs readable and >> some not). > > the other bug in question was opened yesterday and probably hasn''t had time > to propagate. > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Bill > > >
After around four days the process appeared to have stalled (no audible hard drive activity). I restarted with milestone=none; deleted /etc/zfs/zpool.cache, restarted, and went zpool import tank. (also allowed root login to ssh, so I could make new ssh sessions if required.) Now I can watch the process from on the machine. My present question is how is the DDT stored? I believe the DDT to have around 10M entries for this dataset, as per: DDT-sha256-zap-duplicate: 400478 entries, size 490 on disk, 295 in core DDT-sha256-zap-unique: 10965661 entries, size 381 on disk, 187 in core (taken just previous to the attempt to destroy the dataset) A sample from iopattern shows: %RAN %SEQ COUNT MIN MAX AVG KR 100 0 195 512 512 512 97 100 0 414 512 65536 895 362 100 0 261 512 512 512 130 100 0 273 512 512 512 136 100 0 247 512 512 512 123 100 0 297 512 512 512 148 100 0 292 512 512 512 146 100 0 250 512 512 512 125 100 0 274 512 512 512 137 100 0 302 512 512 512 151 100 0 294 512 512 512 147 100 0 308 512 512 512 154 98 2 286 512 512 512 143 100 0 270 512 512 512 135 100 0 390 512 512 512 195 100 0 269 512 512 512 134 100 0 251 512 512 512 125 100 0 254 512 512 512 127 100 0 265 512 512 512 132 100 0 283 512 512 512 141 As the pool is comprised of 2x 8-disk raidz vdevs, I presume that each element is stored twice (for the raidz redundancy). So around 280 512b read op/s, that''s 140 entries per second. Is the import of a semi-broken pool: 1> Reading all the DDT markers for the dataset; or 2> Reading all the DDT markers for the pool; or 3> Reading all of the block markers for the dataset; or 4> Reading all of the block markers for the pool Prior to actually finalising what it needs to do to fix the pool? I''d like to be able to estimate the length of time likely before the import finishes. Or should I tell it to roll back to the last valid txg - ie before the zfs destroy <dataset> command was issued? (by zpool import -F.) Or is this likely to take as long/longer than the present import/fix? Cheers.
Just thought I''d chime in for anyone who had read this - the import operation completed this time, after 60 hours of disk grinding. :)
The DDT is stored within the pool, IIRC, but there is an RFE open to allow you to store it on a separate top level VDEV, like a SLOG. The other thing I''ve noticed with all of the "destroyed a large dataset with dedup enabled and it''s taking forever to import/destory/<insert function here" questions is that the process runs so so so much faster with 8+ GiB of RAM. Almost to a man, everyone who reports these 3, 4, or more day destroys has < 8 GiB of RAM on the storage server. Just some observations/thoughts. On Mon, Feb 15, 2010 at 23:14, taemun <taemun at gmail.com> wrote:> Just thought I''d chime in for anyone who had read this - the import > operation completed this time, after 60 hours of disk grinding. > > :) > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- "You can choose your friends, you can choose the deals." - Equity Private "If Linux is faster, it''s a Solaris bug." - Phil Harman Blog - http://whatderass.blogspot.com/ Twitter - @khyron4eva -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100215/f21872d9/attachment.html>
> RFE open to allow you to store [DDT] on a separate top level VDEVhmm, add to this spare, log and cache vdevs, its to the point of making another pool and thinly provisioning volumes to maintain partitioning flexibility. taemun: hay, thanks for closing the loop! Rob
The system in question has 8GB of ram. It never paged during the import (unless I was asleep at that point, but anyway). It ran for 52 hours, then started doing 47% kernel cpu usage. At this stage, dtrace stopped responding, and so iopattern died, as did iostat. It was also increasing ram usage rapidly (15mb / minute). After an hour of that, the cpu went up to 76%. An hour later, CPU usage stopped. Hard drives were churning throughout all of this (albeit at a rate that looks like each vdev is being controller by a single threaded operation). I''m guessing that if you don''t have enough ram, it gets stuck on the use-lots-of-cpu phase, and just dies from too much paging. Of course, I have absolutely nothing to back that up. Personally, I think that if L2ARC devices were persistent, we already have the mechanism in place for storing the DDT as a "seperate vdev". The problem is, there is nothing you can run at boot time to populate the L2ARC, so the dedup writes are ridiculously slow until the cache is warm. If the cache stayed warm, or there was an option to forcibly warm up the cache, this could be somewhat alleviated. Cheers
Markus Kovero
2010-Feb-16 06:16 UTC
[zfs-discuss] Reading ZFS config for an extended period
> The other thing I''ve noticed with all of the "destroyed a large dataset with dedup > enabled and it''s taking forever to import/destory/<insert function here" questions > is that the process runs so so so much faster with 8+ GiB of RAM.? Almost to a man, > everyone who reports these 3, 4, or more day destroys has < 8 GiB of RAM on the > storage server.I''ve witnessed destroys that take several days with 24GB+ systems (dataset over 30TB). I guess it''s just matter of how large datasets vs. how much ram. Yours Markus Kovero
Ugh! If you received a direct response to me instead of via the list, apologies for that. Rob: I''m just reporting the news. The RFE is out there. Just like SLOGs, I happen to think it a good idea, personally, but that''s my personal opinion. If it makes dedup more usable, I don''t see the harm. Taemun: The issue, as I understand it, is not "use-lots-of-cpu" or "just dies from paging". I believe it is more to do with all of the small, random reads/writes in updating the DDT. Remember, the DDT is stored within the pool, just as the ZIL is if you don''t have a SLOG. (The S in SLOG standing for "separate".) So all the DDT updates are in competition for I/O with the actual data deletion. If the DDT could be stored as a separate VDEV already, I''m sure a way would have been hacked together by someone (likely someone on this list). Hence, the need for the RFE to create this functionality where it does not currently exist. The DDT is separate from the ARC or L2ARC. Here''s the bug: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6913566 If I''m incorrect, someone please let me know. Markus: Yes, the issue would appear to be dataset size vs. RAM size. Sounds like an area ripe for testing, much like RAID Z3 performance. Cheers all! On Tue, Feb 16, 2010 at 00:20, taemun <taemun at gmail.com> wrote:> The system in question has 8GB of ram. It never paged during the > import (unless I was asleep at that point, but anyway). > > It ran for 52 hours, then started doing 47% kernel cpu usage. At this > stage, dtrace stopped responding, and so iopattern died, as did > iostat. It was also increasing ram usage rapidly (15mb / minute). > After an hour of that, the cpu went up to 76%. An hour later, CPU > usage stopped. Hard drives were churning throughout all of this > (albeit at a rate that looks like each vdev is being controller by a > single threaded operation). > > I''m guessing that if you don''t have enough ram, it gets stuck on the > use-lots-of-cpu phase, and just dies from too much paging. Of course, > I have absolutely nothing to back that up. > > Personally, I think that if L2ARC devices were persistent, we already > have the mechanism in place for storing the DDT as a "seperate vdev". > The problem is, there is nothing you can run at boot time to populate > the L2ARC, so the dedup writes are ridiculously slow until the cache > is warm. If the cache stayed warm, or there was an option to forcibly > warm up the cache, this could be somewhat alleviated. > > Cheers >-- "You can choose your friends, you can choose the deals." - Equity Private "If Linux is faster, it''s a Solaris bug." - Phil Harman Blog - http://whatderass.blogspot.com/ Twitter - @khyron4eva -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100217/c197d84e/attachment.html>
Miles Nordin
2010-Feb-17 20:07 UTC
[zfs-discuss] Reading ZFS config for an extended period
>>>>> "k" == Khyron <khyron4eva at gmail.com> writes:k> The RFE is out there. Just like SLOGs, I happen to think it a k> good idea, personally, but that''s my personal opinion. If it k> makes dedup more usable, I don''t see the harm. slogs and l2arcs, modulo the current longstanding ``cannot import pool with attached missing slog'''' bug, are disposeable: You will lose either a little data or no data if the device goes away (once the bug is finally fixed). This makes them less ponderous because these days we are looking for raidz2 or raidz3 amount of redundancy, so in a seperate device that wasn''t disposeable we''d need a 3- or 4-way mirror. It also makes their seperateness more seperable since they can go away at any time, so maybe they do deserve to be seperate. The two together make the complexity more bearable. Would an sddt be disposeable, or would it be a critical top-level vdev needed for import? If it''s critical, well, that''s kind of annoying, because now we need 3-way mirrors of sddt to match the minimum best-practice redundancy of the rest of the pool''s redundancy, and my first reaction would be ``can you spread it throughout the normal raidz{,2,3} vdevs at least in backup form?'''' once I say a copy should be kept in the main pool even afer it becomes an sddt, well, what''s that imply? * In the read case it means cacheing, so it could go in the l2arc. How''s DDT different from anything else in the l2arc? * In the write case it means sometimes commiting it quickly without waiting on the main pool so we can release some lock or answer some RPC and continue. Why not write it to the slog? Then if we lose the slog we can do what we always do without the slog and roll back to the last valid txg, losing whatever writes were associated with that lost ddt update. The two cases fit fine with the types of SSD''s we''re using for each role and the type of special error recovery we have if we lose the device. Why litter a landscape so full of special cases and tricks (like the ``import pool with missing slog'''' bug that is taking so long to resolve) with yet another kind of vdev that will take 1 year to discover special cases and a halfdecade to address them? Maybe there is a reason. Are DDT write patterns different than slog write patterns? Is it possible to make a DDT read cache using less ARC for pointers than the l2arc currently uses? Is the DDT particularly hard on the l2arc by having small block sizes? Will the sddt be delivered with a separate offline ``not an fsck!!!'''' tool for slowly regenerating it from pool data if it''s lost, or maybe after an sddt goes bad the pool can be mounted space-wastingly as in like no dedup is done and deletes do not free space, with an empty DDT, and the sddt regenerated by a scrub? If the performance or recovery behavior is different than what we''re working towards with optional-slog and persistent-l2arc then maybe sddt does deserve to be antoher vdev type. so....i dunno. On one hand I''m clearly nowhere near informed enough to weigh in on an architectural decision like this and shouldn''t even be discussing it, and the same applies to you Khyron, to my view, since our input seems obvious at best and misinformed at worst. On the other hand, other major architectural changes (slog) was delivered incomplete in a cripplingly bad and silly, trivial way for, AFAICT, nothing but lack of sufficient sysadmin bitching and moaning, leaving heaps of multi-terabyte naked pools out there for half a decade with fancy triple redundancy that will be totally lost if a single SSD + zpool.cache goes bad, so apparently thinking things through even at this trivial level might have some value to the ones actually doing the work. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100217/39568515/attachment.bin>
Anton Pomozov
2010-Oct-12 14:43 UTC
[zfs-discuss] Reading ZFS config for an extended period
I have 1TB mirror deduplicated pool. snv_134 runned on x86 i7 PC with 8GB RAM I destroyed 30GB zfs volume and now trying to import that pool at the LiveUSB runned osol. It works >2h already, I''m waiting ... How can I see some progressbar or another signs of current import job? -- This message posted from opensolaris.org
Roy Sigurd Karlsbakk
2010-Oct-12 16:27 UTC
[zfs-discuss] Reading ZFS config for an extended period
----- Original Message -----> I have 1TB mirror deduplicated pool. > snv_134 runned on x86 i7 PC with 8GB RAM > I destroyed 30GB zfs volume and now trying to import that pool at the > LiveUSB runned osol. > It works >2h already, I''m waiting ...It may even take longer. I''ve seen this take a while. It''s a known bug. The fix is not to use dedup...> How can I see some progressbar or another signs of current import job?You can''t. If you reboot, the system will likely hang until the volume is removed. It should be possible to take the system up in single user mode, but you should probably just wait Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.