Stephen Oberholtzer
2007-Dec-18 20:54 UTC
[zfs-code] Questions about ZIO subsystem (and some others)
I apologize if this is the wrong place to ask this. I looked at the archives for both zfs-code and zfs-discuss, and this seemed like the more appropriate list to post my query. I recently read about ZFS and it seems to be a very cool thing. I''ve been reading various webpages and looking through the source code, and I think I have a pretty good handle on the basics -- the object directory, the whole vdev mirror/stripe/raidz setup, snapshots and clones, etc. I believe I even have a handle on the metaslab allocator, to a limited degree. Most of this stuff is apparent from various blogs, and http://www.opensolaris.org/os/community/zfs/source/, but there are a few things that aren''t fully clear to me, mostly to do with the ZIO subsystem. I can fully appreciate the ''the source is the documentation'' rule, but the lack of comments sometimes makes it really hard to figure out what''s going on. 1. zio.c has functions like "zio_rewrite" and "zio_rewrite_gang_members". ZFS is copy-on-write, so it should never be rewriting anything, right? Also, zio_write_compress makes a cryptic reference to spa_sync. 2. Gang Blocks: While not explicitly spelled out anywhere (except maybe the source code), it seems to me that the behavior is this: system needs to write a 128KB block, but can''t allocate a contiguous 128KB (in which case, you''ve got issues), so it allocates two 64KB blocks and a ''gang block'' to point to them. When somebody tries to read back the original 128KB block, the ZIO subsystem reads the two 64KB halves and pieces them back together -- and the upper layers of code are none the wiser. Is this correct? 3. Gang Blocks II: Can a gang block point to other gang blocks? My guess is no. 4. Gang Blocks III: If a gang block contains up to 3 pointers (according to the ''on-disk format'' doc) and it *cannot* point to other gang blocks, does that mean that ZIO can split a block into at most 3 pieces? 5. spa_sync has a loop with the comment "Iterate to convergence". I was under the impression that the sync operation just made sure all outstanding writes were committed to disk. How is committing that data to disk going to change that data? -- -- Stevie-O Real programmers use COPY CON PROGRAM.EXE
Jeff Bonwick
2007-Dec-19 01:08 UTC
[zfs-code] Questions about ZIO subsystem (and some others)
> 1. zio.c has functions like "zio_rewrite" and > "zio_rewrite_gang_members". ZFS is copy-on-write, so it should never > be rewriting anything, right? Also, zio_write_compress makes a > cryptic reference to spa_sync.Right. The only time we rewrite an existing block is when the block was allocated in the same transaction group we''re currently syncing. This can happen during "sync to convergence", which I''ll describe in a moment.> 2. Gang Blocks: While not explicitly spelled out anywhere (except > maybe the source code), it seems to me that the behavior is this: > system needs to write a 128KB block, but can''t allocate a contiguous > 128KB (in which case, you''ve got issues), so it allocates two 64KB > blocks and a ''gang block'' to point to them. When somebody tries to > read back the original 128KB block, the ZIO subsystem reads the two > 64KB halves and pieces them back together -- and the upper layers of > code are none the wiser. Is this correct?Exactly right.> 3. Gang Blocks II: Can a gang block point to other gang blocks?Yes. It''s not likely to come up outside of testing, but it works.> 4. Gang Blocks III: If a gang block contains up to 3 pointers > (according to the ''on-disk format'' doc) and it *cannot* point to other > gang blocks, does that mean that ZIO can split a block into at most 3 > pieces?No -- worst case, they can be nested as mentioned above.> 5. spa_sync has a loop with the comment "Iterate to convergence". I > was under the impression that the sync operation just made sure all > outstanding writes were committed to disk. How is committing that > data to disk going to change that data?With the lone exception of the uberblock, *everything* in ZFS is stored in transactional datasets. This includes not just user data, but also metadata, including pool-wide metadata like space maps. On the first pass of spa_sync() we write to disk every block that was modified in that transaction group -- this is dsl_pool_sync(). As a side effect, we allocate and free a bunch of blocks, which we record in our in-core space map structures. The next thing we do is propagate these in-core space map chages to their on-disk counterparts by writing to the space map objects (vdev_sync() -> metaslab_sync() -> space_map_sync() -> dmu_write()). The act of doing this marks the space map objects as modified. This is fundamentally no different than modifying any other object. However, we have a chicken-and-egg problem: we now have a modified dataset (the pool''s MOS, or meta-objset-set) that has to be synced. So, we now enter the second pass of spa_sync(). Here we do the exact same thing as we did on the first pass, but of course there''s a lot less data this time. We keep doing this until the pool stops wiggling. The thing is, this iterative process would never converge if we allocated new blocks on every pass. So on each pass, when writing to a particular block, we first ask whether that block was born in the same transaction group that we''re currently syncing. If so, then since it''s not part of any prior transaction group, and the current transaction group is not yet committed, we can safely overwrite the existing block rather than freeing it and allocating a new one -- the important implication being that no space maps are modified in the process. That''s what allows spa_sync() to converge. There are a few twists worth noting. First, compression adds a wrinkle because if the data compresses to a different size, we have to allocate a new block of that size. In theory, this could go on forever. To guard against it, we stop compressing after the first few passes. Second, it''s generally faster to write to newly-allocated space, which tends to be physically contiguous, than to be forced to rewrite at a specific location on disk. That''s one of the benefits of a copy-on-write approach in general. So we have the option of continuing to allocate new blocks for the first few passes, then switch to rewrites when there are only a few blocks left. In practice, however, we haven''t found this to be a net win, because there just aren''t many blocks after the first pass. The benefits of copy-on-write locality seem to be neutralized by the cost of additional sync passes. Third, there are many space maps in a large pool -- a few hundred per device times as many devices as you''ve got. The more space maps you touch, the longer it takes for spa_sync() to converge. We can make block allocation as localized as we want -- typically touching just one space map -- but we have no control over the locality of frees. Therefore, after the first few passes we start recording frees not in their space maps, but in a single deferred-free list (zio_free() -> bplist_enqueue_deferred()). We then processs the deferred frees at the beginning of the next transaction group (spa_sync_deferred_frees()). For testing purposes, there are tunables that govern the thresholds for each of these three behaviors -- see zio_sync_pass. Jeff