There''s been some talk about alignment lately, both for flash and WD disks. What''s missing, at least from my perspective, is a clear an unambiguous test so users can verify that their zfs pools are aligned correctly. This should be a test that sees through all the layers of BIOS and SMI/EFI and zfs labels and their accumulated offsets, and lets us ascertain where things land in terms of addresses that matter to the storage. I can think of two methods to achieve this, but lack information to complete either. I''d appreciate some help - or a better way. #1. Use xxd (or similar) to examine the contents of the raw disk device that ignores partitioning (e.g. c0d0p0). Search for a known label magic number or similar content, and determine its address. Apply arithmetic as necessary. This relies on knowing what to look for, and how that is aligned to the start of the partition and to to metaslab addresses and offsets that determine the writes we actually care about. #2. Use dtrace to watch the actual sector addresses being accessed when examining a pool (e.g. zdb -l). Apply arithmetic as necessary. This relies on dtrace clue. As for the arithmetic.. I''m not certain I''ve seen, for example, a definitive statement of what the alignment offset is between start-of-partition and zfs data blocks, once various preamble header sectors are allowed for. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100329/59fbcff1/attachment.bin>
On Mar 28, 2010, at 6:21 PM, Daniel Carosone wrote:> There''s been some talk about alignment lately, both for flash and WD disks. > > What''s missing, at least from my perspective, is a clear an > unambiguous test so users can verify that their zfs pools are aligned > correctly. This should be a test that sees through all the layers of > BIOS and SMI/EFI and zfs labels and their accumulated offsets, and > lets us ascertain where things land in terms of addresses that matter > to the storage. > > I can think of two methods to achieve this, but lack information to > complete either. I''d appreciate some help - or a better way. > > #1. Use xxd (or similar) to examine the contents of the raw disk > device that ignores partitioning (e.g. c0d0p0). Search for a known > label magic number or similar content, and determine its address. > Apply arithmetic as necessary. > > This relies on knowing what to look for, and how that is aligned to > the start of the partition and to to metaslab addresses and offsets > that determine the writes we actually care about. > > #2. Use dtrace to watch the actual sector addresses being accessed > when examining a pool (e.g. zdb -l). Apply arithmetic as necessary. > > This relies on dtrace clue. > > As for the arithmetic.. I''m not certain I''ve seen, for example, a > definitive statement of what the alignment offset is between > start-of-partition and zfs data blocks, once various preamble header > sectors are allowed for.This is documented in the ZFS on disk format doc. Uberblocks are aligned to 256KB offsets from the beginning of the slice. The first metaslab starts 4MB from the start of the device. Use prtvtoc or format to see the beginning of the slice relative to the beginning of the partition. I dunno how you tell the start of the partition relative to the physical device. To get an idea of the objects, "zdb -dd poolname" will show object properties. Do not be surprised that the metadata is compressed and small. Other zdb options will show allocations, DVAs, and pretty much anything else. The attached dtrace script is a basic analysis of 4KB aligned starts (relative to the start of the slice) and 4KB size alignment. Don''t be surprised to see very little alignment due to compression, metadata, and other oddities. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com -------------- next part -------------- A non-text attachment was scrubbed... Name: align4k.d Type: application/octet-stream Size: 605 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100328/1f208f74/attachment.obj> -------------- next part --------------
On Mon, Mar 29, 2010 at 12:21:39PM +1100, Daniel Carosone wrote:> #1. Use xxd (or similar) to examine the contents of the raw disk > > This relies on knowing what to look for, and how that is aligned to > the start of the partition and to to metaslab addresses and offsets > that determine the writes we actually care about.Ok, from looking at (what seems to a draft of..) the ZFS on disk format document: - There don''t seem to be easily identified magic numbers until the uberblock replicas. Until then there''s various kinds of reserved space, and an nvlist that doesn''t seem to have an explicit marker. - The uberblocks start at 128k into each label, or 128k and 384k into the partition. - The uberblocks start with a magic number that should be easily identified. And, indeed, I can find them: 0420000: 0cb1ba00 00000000 16000000 00000000 ................ 0420400: 0cb1ba00 00000000 16000000 00000000 ................ 0420800: 0cb1ba00 00000000 16000000 00000000 ................ 0420c00: 0cb1ba00 00000000 16000000 00000000 ................ [...] 043f800: 0cb1ba00 00000000 16000000 00000000 ................ 043fc00: 0cb1ba00 00000000 16000000 00000000 ................ [...] 0460000: 0cb1ba00 00000000 16000000 00000000 ................ 0460400: 0cb1ba00 00000000 16000000 00000000 ................ 0460800: 0cb1ba00 00000000 16000000 00000000 ................ 0460c00: 0cb1ba00 00000000 16000000 00000000 ................ [...] 047f800: 0cb1ba00 00000000 16000000 00000000 ................ 047fc00: 0cb1ba00 00000000 16000000 00000000 ................ Alas, it also means I can''t blame alignment issues for making this cruddy netbook SSD any slower; it''s just this slow really. PS: It sure seems odd to use a 32-bit constant (0x00bab10c) as a magic for a 64-bit quantity, especially when stored in native byte order..> As for the arithmetic.. [..] the alignment offset is between > start-of-partition and zfs data blocks, once various preamble header > sectors are allowed for.Data starts at 4M from the start of partition, so for all intents and purposes it keeps alignment with the partition. At granularities <=128k, it also keeps alignment with the first uberblock replicas. I expected as much but the confirmation is nice. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100329/2bdc275d/attachment.bin>
On Sun, Mar 28, 2010 at 09:32:02PM -0700, Richard Elling wrote:> This is documented in the ZFS on disk format doc.Yep, I''ve been there in the meantime.. ;-)> Use prtvtoc or format to see the beginning of the slice relative to the > beginning of the partition. I dunno how you tell the start of the partition > relative to the physical device.Therein lay my issue. Much discussion of the problem, no simple guidance for end users to see if they are affected, and a bunch of variables (partition types and nesting) to get in the way. At least I now have the basic info that could be scripted into a simple method.> The attached dtrace script is a basic analysis of 4KB aligned starts > (relative to the start of the slice) and 4KB size alignment. Don''t be > surprised to see very little alignment due to compression, metadata, > and other oddities.Thanks .. this is useful info for the next level of detail - getting and verifying the alignment of the pool device overall is a first step to make sure these io''s don''t start out behind the 8-ball. --- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100329/f1f79cce/attachment.bin>