thr3ads.net - zfs code - [zfs-code] Questions about ZIO subsystem (and some others) [Dec 2007]

If this information is useful, please help other people find it:
Share via:

Stephen Oberholtzer

2007-Dec-18 20:54 UTC

[zfs-code] Questions about ZIO subsystem (and some others)

I apologize if this is the wrong place to ask this.  I looked at the
archives for both zfs-code and zfs-discuss, and this seemed like the
more appropriate list to post my query.


I recently read about ZFS and it seems to be a very cool thing.  I''ve
been reading various webpages and looking through the source code, and
I think I have a pretty good handle on the basics -- the object
directory, the whole vdev mirror/stripe/raidz setup, snapshots and
clones, etc. I believe I even have a handle on the metaslab allocator,
to a limited degree. Most of this stuff is apparent from various
blogs, and http://www.opensolaris.org/os/community/zfs/source/, but
there are a few things that aren''t fully clear to me, mostly to do
with the ZIO subsystem.

I can fully appreciate the ''the source is the documentation''
rule, but
the lack of comments sometimes makes it really hard to figure out
what''s going on.

1.  zio.c has functions like "zio_rewrite" and
"zio_rewrite_gang_members".  ZFS is copy-on-write, so it should never
be rewriting anything, right?   Also, zio_write_compress makes a
cryptic reference to spa_sync.

2. Gang Blocks: While not explicitly spelled out anywhere (except
maybe the source code), it seems to me that the behavior is this:
system needs to write a 128KB block, but can''t allocate a contiguous
128KB (in which case, you''ve got issues), so it allocates two 64KB
blocks and a ''gang block'' to point to them.  When somebody
tries to
read back the original 128KB block, the ZIO subsystem reads the two
64KB halves and pieces them back together -- and the upper layers of
code are none the wiser.   Is this correct?

3. Gang Blocks II: Can a gang block point to other gang blocks?  My guess is no.

4. Gang Blocks III: If a gang block contains up to 3 pointers
(according to the ''on-disk format'' doc) and it *cannot* point
to other
gang blocks, does that mean that ZIO can split a block into at most 3
pieces?

5. spa_sync has a loop with the comment "Iterate to convergence".  I
was under the impression that the sync operation just made sure all
outstanding writes were committed to disk.  How is committing that
data to disk going to change that data?

-- 
-- Stevie-O
Real programmers use COPY CON PROGRAM.EXE

Jeff Bonwick

2007-Dec-19 01:08 UTC

head link

[zfs-code] Questions about ZIO subsystem (and some others)

> 1.  zio.c has functions like "zio_rewrite" and
> "zio_rewrite_gang_members".  ZFS is copy-on-write, so it should
never
> be rewriting anything, right?   Also, zio_write_compress makes a
> cryptic reference to spa_sync.
Right.  The only time we rewrite an existing block is when the block
was allocated in the same transaction group we''re currently syncing.
This can happen during "sync to convergence", which I''ll
describe
in a moment.
> 2. Gang Blocks: While not explicitly spelled out anywhere (except
> maybe the source code), it seems to me that the behavior is this:
> system needs to write a 128KB block, but can''t allocate a
contiguous
> 128KB (in which case, you''ve got issues), so it allocates two 64KB
> blocks and a ''gang block'' to point to them.  When
somebody tries to
> read back the original 128KB block, the ZIO subsystem reads the two
> 64KB halves and pieces them back together -- and the upper layers of
> code are none the wiser.   Is this correct?
Exactly right.
> 3. Gang Blocks II: Can a gang block point to other gang blocks?
Yes.  It''s not likely to come up outside of testing, but it works.
> 4. Gang Blocks III: If a gang block contains up to 3 pointers
> (according to the ''on-disk format'' doc) and it *cannot*
point to other
> gang blocks, does that mean that ZIO can split a block into at most 3
> pieces?
No -- worst case, they can be nested as mentioned above.
> 5. spa_sync has a loop with the comment "Iterate to convergence".
I
> was under the impression that the sync operation just made sure all
> outstanding writes were committed to disk.  How is committing that
> data to disk going to change that data?
With the lone exception of the uberblock, *everything* in ZFS is stored
in transactional datasets.  This includes not just user data, but also
metadata, including pool-wide metadata like space maps.

On the first pass of spa_sync() we write to disk every block that
was modified in that transaction group -- this is dsl_pool_sync().
As a side effect, we allocate and free a bunch of blocks, which we
record in our in-core space map structures.

The next thing we do is propagate these in-core space map chages
to their on-disk counterparts by writing to the space map objects
(vdev_sync() -> metaslab_sync() -> space_map_sync() -> dmu_write()).
The act of doing this marks the space map objects as modified.
This is fundamentally no different than modifying any other object.

However, we have a chicken-and-egg problem: we now have a modified
dataset (the pool''s MOS, or meta-objset-set) that has to be synced.
So, we now enter the second pass of spa_sync().  Here we do the exact
same thing as we did on the first pass, but of course there''s a lot
less data this time.  We keep doing this until the pool stops wiggling.
The thing is, this iterative process would never converge if we
allocated new blocks on every pass.  So on each pass, when writing
to a particular block, we first ask whether that block was born in the
same transaction group that we''re currently syncing.  If so, then
since it''s not part of any prior transaction group, and the current
transaction group is not yet committed, we can safely overwrite the
existing block rather than freeing it and allocating a new one --
the important implication being that no space maps are modified
in the process.  That''s what allows spa_sync() to converge.

There are a few twists worth noting.

First, compression adds a wrinkle because if the data compresses
to a different size, we have to allocate a new block of that size.
In theory, this could go on forever.  To guard against it, we stop
compressing after the first few passes.

Second, it''s generally faster to write to newly-allocated space, which
tends to be physically contiguous, than to be forced to rewrite at a
specific location on disk.  That''s one of the benefits of a
copy-on-write
approach in general.  So we have the option of continuing to allocate new
blocks for the first few passes, then switch to rewrites when there are
only a few blocks left.  In practice, however, we haven''t found this to
be
a net win, because there just aren''t many blocks after the first pass.
The benefits of copy-on-write locality seem to be neutralized by the cost
of additional sync passes.

Third, there are many space maps in a large pool -- a few hundred per
device times as many devices as you''ve got.  The more space maps you
touch, the longer it takes for spa_sync() to converge.  We can make
block allocation as localized as we want -- typically touching just
one space map -- but we have no control over the locality of frees.
Therefore, after the first few passes we start recording frees not in
their space maps, but in a single deferred-free list (zio_free() ->
bplist_enqueue_deferred()).  We then processs the deferred frees at the
beginning of the next transaction group (spa_sync_deferred_frees()).

For testing purposes, there are tunables that govern the thresholds
for each of these three behaviors -- see zio_sync_pass.

Jeff

zfs code - Dec 2007 - Questions about ZIO subsystem (and some others)

[zfs-code] Questions about ZIO subsystem (and some others)

[zfs-code] Questions about ZIO subsystem (and some others)