thr3ads.net - zfs discuss - [zfs-discuss] Reading ZFS config for an extended period [Feb 2010]

If this information is useful, please help other people find it:
Share via:

taemun

2010-Feb-11 15:15 UTC

[zfs-discuss] Reading ZFS config for an extended period

Can anyone comment about whether the on-boot "Reading ZFS confi" is
any slower/better/whatever than deleting zpool.cache, rebooting and
manually importing?

I''ve been waiting more than 30 hours for this system to come up. There
is a pool with 13TB of data attached. The system locked up whilst
destroying a 934GB dedup''d dataset, and I was forced to reboot it. I
can hear hard drive activity presently - ie its doing
<b>something</b>, but am really hoping there is a better way :)

Thanks

Lori Alt

2010-Feb-11 18:33 UTC

head link

[zfs-discuss] Reading ZFS config for an extended period

On 02/11/10 08:15, taemun wrote:> Can anyone comment about whether the on-boot "Reading ZFS confi"
is
> any slower/better/whatever than deleting zpool.cache, rebooting and
> manually importing?
>
> I''ve been waiting more than 30 hours for this system to come up.
There
> is a pool with 13TB of data attached. The system locked up whilst
> destroying a 934GB dedup''d dataset, and I was forced to reboot it.
I
> can hear hard drive activity presently - ie its doing
> <b>something</b>, but am really hoping there is a better way :)
>
> Thanks
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>   I think that this is a consequence of 6924390.-  
<http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6924390> ZFS 
destroy on de-duped dataset locks all I/O

This bug is closed as a dup of another bug which is not readable from 
the opensolaris site, (I''m not clear what makes some bugs readable and 
some not).

While trying to reproduce 6924390 (or its equivalent) yesterday, my 
system hung as yours did, and when I rebooted, it hung at "Reading ZFS 
config".

Someone who knows more about the root cause of this situation (i.e., the 
bug named above) might be able tell you what''s going on and how to 
recover (it might be that what''s going on is that the destroy has 
resumed and you have to wait for it to complete, which I think it will, 
but it might take a long time).

Lori

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100211/894166bb/attachment.html>

Bill Sommerfeld

2010-Feb-11 19:08 UTC

head link

[zfs-discuss] Reading ZFS config for an extended period

On 02/11/10 10:33, Lori Alt wrote:> This bug is closed as a dup of another bug which is not readable from
> the opensolaris site, (I''m not clear what makes some bugs readable
and
> some not).
the other bug in question was opened yesterday and probably hasn''t had 
time to propagate.

					- Bill

taemun

2010-Feb-12 05:05 UTC

head link

[zfs-discuss] Reading ZFS config for an extended period

Do you think that more RAM would help this progress faster? We''ve just
hit 48 hours. No visible progress (although that doesn''t really mean
much).

It is presently in a system with 8GB of ram, I could try to move the
pool across to a system with 20GB of ram, if that is likely to
expedite the process. Of course, if it isn''t going to make any
difference, I''d rather not restart this process.

Thanks

On 12 February 2010 06:08, Bill Sommerfeld <sommerfeld at sun.com>
wrote:> On 02/11/10 10:33, Lori Alt wrote:
>>
>> This bug is closed as a dup of another bug which is not readable from
>> the opensolaris site, (I''m not clear what makes some bugs
readable and
>> some not).
>
> the other bug in question was opened yesterday and probably hasn''t
had time
> to propagate.
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Bill
>
>
>

taemun

2010-Feb-14 06:54 UTC

head link

[zfs-discuss] Reading ZFS config for an extended period

After around four days the process appeared to have stalled (no
audible hard drive activity). I restarted with milestone=none; deleted
/etc/zfs/zpool.cache, restarted, and went zpool import tank. (also
allowed root login to ssh, so I could make new ssh sessions if
required.) Now I can watch the process from on the machine.

My present question is how is the DDT stored? I believe the DDT to
have around 10M entries for this dataset, as per:
DDT-sha256-zap-duplicate: 400478 entries, size 490 on disk, 295 in core
DDT-sha256-zap-unique: 10965661 entries, size 381 on disk, 187 in core
(taken just previous to the attempt to destroy the dataset)

A sample from iopattern shows:
%RAN %SEQ  COUNT    MIN    MAX    AVG     KR
 100    0    195    512    512    512     97
 100    0    414    512  65536    895    362
 100    0    261    512    512    512    130
 100    0    273    512    512    512    136
 100    0    247    512    512    512    123
 100    0    297    512    512    512    148
 100    0    292    512    512    512    146
 100    0    250    512    512    512    125
 100    0    274    512    512    512    137
 100    0    302    512    512    512    151
 100    0    294    512    512    512    147
 100    0    308    512    512    512    154
  98    2    286    512    512    512    143
 100    0    270    512    512    512    135
 100    0    390    512    512    512    195
 100    0    269    512    512    512    134
 100    0    251    512    512    512    125
 100    0    254    512    512    512    127
 100    0    265    512    512    512    132
 100    0    283    512    512    512    141

As the pool is comprised of 2x 8-disk raidz vdevs, I presume that each
element is stored twice (for the raidz redundancy). So around 280 512b
read op/s, that''s 140 entries per second.

Is the import of a semi-broken pool:
1> Reading all the DDT markers for the dataset; or
2> Reading all the DDT markers for the pool; or
3> Reading all of the block markers for the dataset; or
4> Reading all of the block markers for the pool
Prior to actually finalising what it needs to do to fix the pool? I''d
like to be able to estimate the length of time likely before the
import finishes.

Or should I tell it to roll back to the last valid txg - ie before the
zfs destroy <dataset> command was issued? (by zpool import -F.) Or is
this likely to take as long/longer than the present import/fix?

Cheers.

taemun

2010-Feb-16 04:14 UTC

head link

[zfs-discuss] Reading ZFS config for an extended period

Just thought I''d chime in for anyone who had read this - the import
operation completed this time, after 60 hours of disk grinding.

:)

Khyron

2010-Feb-16 04:43 UTC

head link

[zfs-discuss] Reading ZFS config for an extended period

The DDT is stored within the pool, IIRC, but there is an RFE open to allow
you to
store it on a separate top level VDEV, like a SLOG.

The other thing I''ve noticed with all of the "destroyed a large
dataset with
dedup
enabled and it''s taking forever to import/destory/<insert function
here"
questions
is that the process runs so so so much faster with 8+ GiB of RAM.  Almost to
a man,
everyone who reports these 3, 4, or more day destroys has < 8 GiB of RAM on
the
storage server.

Just some observations/thoughts.

On Mon, Feb 15, 2010 at 23:14, taemun <taemun at gmail.com> wrote:
> Just thought I''d chime in for anyone who had read this - the
import
> operation completed this time, after 60 hours of disk grinding.
>
> :)
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

-- 
"You can choose your friends, you can choose the deals." - Equity
Private

"If Linux is faster, it''s a Solaris bug." - Phil Harman

Blog - http://whatderass.blogspot.com/
Twitter - @khyron4eva
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100215/f21872d9/attachment.html>

Rob Logan

2010-Feb-16 05:11 UTC

head link

[zfs-discuss] Reading ZFS config for an extended period

>  RFE open to allow you to store [DDT] on a separate top level VDEV
hmm, add to this spare, log and cache vdevs, its to the point of making
another pool and thinly provisioning volumes to maintain partitioning  
flexibility.

taemun: hay, thanks for closing the loop!

				Rob

taemun

2010-Feb-16 05:20 UTC

head link

[zfs-discuss] Reading ZFS config for an extended period

The system in question has 8GB of ram. It never paged during the
import (unless I was asleep at that point, but anyway).

It ran for 52 hours, then started doing 47% kernel cpu usage. At this
stage, dtrace stopped responding, and so iopattern died, as did
iostat. It was also increasing ram usage rapidly (15mb / minute).
After an hour of that, the cpu went up to 76%. An hour later, CPU
usage stopped. Hard drives were churning throughout all of this
(albeit at a rate that looks like each vdev is being controller by a
single threaded operation).

I''m guessing that if you don''t have enough ram, it gets stuck
on the
use-lots-of-cpu phase, and just dies from too much paging. Of course,
I have absolutely nothing to back that up.

Personally, I think that if L2ARC devices were persistent, we already
have the mechanism in place for storing the DDT as a "seperate vdev".
The problem is, there is nothing you can run at boot time to populate
the L2ARC, so the dedup writes are ridiculously slow until the cache
is warm. If the cache stayed warm, or there was an option to forcibly
warm up the cache, this could be somewhat alleviated.

Cheers

Markus Kovero

2010-Feb-16 06:16 UTC

head link

[zfs-discuss] Reading ZFS config for an extended period

> The other thing I''ve noticed with all of the "destroyed a
large dataset with dedup
> enabled and it''s taking forever to import/destory/<insert
function here" questions
> is that the process runs so so so much faster with 8+ GiB of RAM.? Almost
to a man,
> everyone who reports these 3, 4, or more day destroys has < 8 GiB of RAM
on the
> storage server.
I''ve witnessed destroys that take several days with 24GB+ systems
(dataset over 30TB). I guess it''s just matter of how large datasets vs.
how much ram.

Yours
Markus Kovero

Khyron

2010-Feb-17 09:55 UTC

head link

[zfs-discuss] Reading ZFS config for an extended period

Ugh!  If you received a direct response to me instead of via the list,
apologies for
that.

Rob:

I''m just reporting the news.  The RFE is out there.  Just like SLOGs, I
happen to
think it a good idea, personally, but that''s my personal opinion.  If
it
makes dedup
more usable, I don''t see the harm.

Taemun:

The issue, as I understand it, is not "use-lots-of-cpu" or "just
dies from
paging".  I
believe it is more to do with all of the small, random reads/writes in
updating the
DDT.

Remember, the DDT is stored within the pool, just as the ZIL is if you
don''t
have
a SLOG.  (The S in SLOG standing for "separate".)  So all the DDT
updates
are in
competition for I/O with the actual data deletion.  If the DDT could be
stored as
a separate VDEV already, I''m sure a way would have been hacked together
by
someone (likely someone on this list).  Hence, the need for the RFE to
create this
functionality where it does not currently exist.  The DDT is separate from
the ARC
or L2ARC.

Here''s the bug:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6913566

If I''m incorrect, someone please let me know.

Markus:

Yes, the issue would appear to be dataset size vs. RAM size.  Sounds like an
area
ripe for testing, much like RAID Z3 performance.

Cheers all!

On Tue, Feb 16, 2010 at 00:20, taemun <taemun at gmail.com> wrote:
> The system in question has 8GB of ram. It never paged during the
> import (unless I was asleep at that point, but anyway).
>
> It ran for 52 hours, then started doing 47% kernel cpu usage. At this
> stage, dtrace stopped responding, and so iopattern died, as did
> iostat. It was also increasing ram usage rapidly (15mb / minute).
> After an hour of that, the cpu went up to 76%. An hour later, CPU
> usage stopped. Hard drives were churning throughout all of this
> (albeit at a rate that looks like each vdev is being controller by a
> single threaded operation).
>
> I''m guessing that if you don''t have enough ram, it gets
stuck on the
> use-lots-of-cpu phase, and just dies from too much paging. Of course,
> I have absolutely nothing to back that up.
>
> Personally, I think that if L2ARC devices were persistent, we already
> have the mechanism in place for storing the DDT as a "seperate
vdev".
> The problem is, there is nothing you can run at boot time to populate
> the L2ARC, so the dedup writes are ridiculously slow until the cache
> is warm. If the cache stayed warm, or there was an option to forcibly
> warm up the cache, this could be somewhat alleviated.
>
> Cheers
>

-- 
"You can choose your friends, you can choose the deals." - Equity
Private

"If Linux is faster, it''s a Solaris bug." - Phil Harman

Blog - http://whatderass.blogspot.com/
Twitter - @khyron4eva
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100217/c197d84e/attachment.html>

Miles Nordin

2010-Feb-17 20:07 UTC

head link

[zfs-discuss] Reading ZFS config for an extended period

>>>>> "k" == Khyron <khyron4eva at gmail.com>
writes:
k> The RFE is out there. Just like SLOGs, I happen to think it a
k> good idea, personally, but that''s my personal opinion. If
it
k> makes dedup more usable, I don''t see the harm.

slogs and l2arcs, modulo the current longstanding ``cannot import pool
with attached missing slog'''' bug, are disposeable: You will
lose
either a little data or no data if the device goes away (once the bug
is finally fixed). This makes them less ponderous because these days
we are looking for raidz2 or raidz3 amount of redundancy, so in a
seperate device that wasn''t disposeable we''d need a 3- or
4-way
mirror. It also makes their seperateness more seperable since they
can go away at any time, so maybe they do deserve to be seperate. The
two together make the complexity more bearable.

Would an sddt be disposeable, or would it be a critical top-level vdev
needed for import? If it''s critical, well, that''s kind of
annoying,
because now we need 3-way mirrors of sddt to match the minimum
best-practice redundancy of the rest of the pool''s redundancy, and my
first reaction would be ``can you spread it throughout the normal
raidz{,2,3} vdevs at least in backup form?''''

once I say a copy should be kept in the main pool even afer it becomes
an sddt, well, what''s that imply?

* In the read case it means cacheing, so it could go in the l2arc.
How''s DDT different from anything else in the l2arc?

* In the write case it means sometimes commiting it quickly without
waiting on the main pool so we can release some lock or answer some
RPC and continue. Why not write it to the slog? Then if we lose
the slog we can do what we always do without the slog and roll back
to the last valid txg, losing whatever writes were associated with
that lost ddt update.

The two cases fit fine with the types of SSD''s we''re using for
each
role and the type of special error recovery we have if we lose the
device. Why litter a landscape so full of special cases and tricks
(like the ``import pool with missing slog'''' bug that is taking
so long
to resolve) with yet another kind of vdev that will take 1 year to
discover special cases and a halfdecade to address them?

Maybe there is a reason. Are DDT write patterns different than slog
write patterns? Is it possible to make a DDT read cache using less
ARC for pointers than the l2arc currently uses? Is the DDT
particularly hard on the l2arc by having small block sizes? Will the
sddt be delivered with a separate offline ``not an fsck!!!''''
tool for
slowly regenerating it from pool data if it''s lost, or maybe after an
sddt goes bad the pool can be mounted space-wastingly as in like no
dedup is done and deletes do not free space, with an empty DDT, and the
sddt regenerated by a scrub? If the performance or recovery behavior
is different than what we''re working towards with optional-slog and
persistent-l2arc then maybe sddt does deserve to be antoher vdev type.

so....i dunno. On one hand I''m clearly nowhere near informed enough
to weigh in on an architectural decision like this and shouldn''t even
be discussing it, and the same applies to you Khyron, to my view,
since our input seems obvious at best and misinformed at worst. On
the other hand, other major architectural changes (slog) was delivered
incomplete in a cripplingly bad and silly, trivial way for, AFAICT,
nothing but lack of sufficient sysadmin bitching and moaning, leaving
heaps of multi-terabyte naked pools out there for half a decade with
fancy triple redundancy that will be totally lost if a single SSD +
zpool.cache goes bad, so apparently thinking things through even at
this trivial level might have some value to the ones actually doing
the work.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100217/39568515/attachment.bin>

Anton Pomozov

2010-Oct-12 14:43 UTC

head link

[zfs-discuss] Reading ZFS config for an extended period

I have 1TB mirror deduplicated pool.
snv_134 runned on x86 i7 PC with 8GB RAM
I destroyed 30GB zfs volume and now trying to import that pool at the LiveUSB
runned osol.
It works >2h already, I''m waiting ...

How can I see some progressbar or another signs of current import job?
-- 
This message posted from opensolaris.org

Roy Sigurd Karlsbakk

2010-Oct-12 16:27 UTC

head link

[zfs-discuss] Reading ZFS config for an extended period

----- Original Message -----> I have 1TB mirror deduplicated pool.
> snv_134 runned on x86 i7 PC with 8GB RAM
> I destroyed 30GB zfs volume and now trying to import that pool at the
> LiveUSB runned osol.
> It works >2h already, I''m waiting ...
It may even take longer. I''ve seen this take a while. It''s a
known bug. The fix is not to use dedup...
> How can I see some progressbar or another signs of current import job?
You can''t. If you reboot, the system will likely hang until the volume
is removed. It should be possible to take the system up in single user mode, but
you should probably just wait

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Anton Pomozov

2010-Oct-12 18:56 UTC

head link

[zfs-discuss] Reading ZFS config for an extended period

It finished!
Going to switch off dedup ... if it''s possible yet
-- 
This message posted from opensolaris.org

zfs discuss - Feb 2010 - Reading ZFS config for an extended period

[zfs-discuss] Reading ZFS config for an extended period

[zfs-discuss] Reading ZFS config for an extended period

[zfs-discuss] Reading ZFS config for an extended period

[zfs-discuss] Reading ZFS config for an extended period

[zfs-discuss] Reading ZFS config for an extended period

[zfs-discuss] Reading ZFS config for an extended period

[zfs-discuss] Reading ZFS config for an extended period

[zfs-discuss] Reading ZFS config for an extended period

[zfs-discuss] Reading ZFS config for an extended period

[zfs-discuss] Reading ZFS config for an extended period

[zfs-discuss] Reading ZFS config for an extended period

[zfs-discuss] Reading ZFS config for an extended period

[zfs-discuss] Reading ZFS config for an extended period

[zfs-discuss] Reading ZFS config for an extended period

[zfs-discuss] Reading ZFS config for an extended period