Hi After a clean database load a database would (should?) look like this, if a random stab at the data is taken... [8KB-m][8KB-n][8KB-o][8KB-p]... The data should be fairly (100%) sequential in layout ... after some days though that same spot (using ZFS) would problably look like: [8KB-m][ ][8KB-o][ ] Is this "pseudo logical-physical" view correct (if blocks n and p was updated and with COW relocated somewhere else)? Could a utility be constructed to show the level of "fragmentation" ? (50% in above example) IF the above theory is flawed... how would fragmentation "look/be observed/calculated" under ZFS with large Oracle tablespaces? Does it even matter what the "fragmentation" is from a performance perspective?
Louwtjie Burger writes: > Hi > > After a clean database load a database would (should?) look like this, > if a random stab at the data is taken... > > [8KB-m][8KB-n][8KB-o][8KB-p]... > > The data should be fairly (100%) sequential in layout ... after some > days though that same spot (using ZFS) would problably look like: > > [8KB-m][ ][8KB-o][ ] > > Is this "pseudo logical-physical" view correct (if blocks n and p was > updated and with COW relocated somewhere else)? > That''s the proper view if the ZFS recordsize is tuned to be 8KB. That''s a best practice that might need to be qualified in the future. > Could a utility be constructed to show the level of "fragmentation" ? > (50% in above example) > That will need to dive into the internals of ZFS. But anything is possible. It''s been done for UFS before. > IF the above theory is flawed... how would fragmentation "look/be > observed/calculated" under ZFS with large Oracle tablespaces? > > Does it even matter what the "fragmentation" is from a performance perspective? It matters to table scans and how those scans will impact OLTP workloads. Good blog topic. Stay tune. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
This question triggered some silly questions in my mind: Lots of folks are determined that the whole COW to different locations are a Bad Thing(tm), and in some cases, I guess it might actually be... What if ZFS had a pool / filesystem property that caused zfs to do a journaled, but non-COW update so the data''s relative location for databases is always the same? Or - What if it did a double update: One to a staged area, and another immediately after that to the ''old'' data blocks. Still always have on-disk consistency etc, at a cost of double the I/O''s... Of course, both of these would require non-sparse file creation for the DB etc, but would it be plausible? For very read intensive and position sensitive applications, I guess this sort of capability might make a difference? Just some stabs in the dark... Cheers! Nathan. Louwtjie Burger wrote:> Hi > > After a clean database load a database would (should?) look like this, > if a random stab at the data is taken... > > [8KB-m][8KB-n][8KB-o][8KB-p]... > > The data should be fairly (100%) sequential in layout ... after some > days though that same spot (using ZFS) would problably look like: > > [8KB-m][ ][8KB-o][ ] > > Is this "pseudo logical-physical" view correct (if blocks n and p was > updated and with COW relocated somewhere else)? > > Could a utility be constructed to show the level of "fragmentation" ? > (50% in above example) > > IF the above theory is flawed... how would fragmentation "look/be > observed/calculated" under ZFS with large Oracle tablespaces? > > Does it even matter what the "fragmentation" is from a performance perspective? > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Nathan Kroenert wrote:> This question triggered some silly questions in my mind: > > Lots of folks are determined that the whole COW to different locations > are a Bad Thing(tm), and in some cases, I guess it might actually be...There is a lot of speculation about this, but no real data. I''ve done some experiments on long seeks and didn''t see much of a performance difference, but I wasn''t using a database workload. Note that the many caches and optimizations in the path between the database and physical medium will make this very difficult to characterize for a general case. Needless to say, you''ll get better performance on a device which can handle multiple outstanding I/Os -- avoid PATA disks.> What if ZFS had a pool / filesystem property that caused zfs to do a > journaled, but non-COW update so the data''s relative location for > databases is always the same? > > Or - What if it did a double update: One to a staged area, and another > immediately after that to the ''old'' data blocks. Still always have > on-disk consistency etc, at a cost of double the I/O''s...This is a non-starter. Two I/Os is worse than one.> Of course, both of these would require non-sparse file creation for the > DB etc, but would it be plausible? > > For very read intensive and position sensitive applications, I guess > this sort of capability might make a difference?We are all anxiously awaiting data... -- richard
> This question triggered some silly questions in my > mind:Actually, they''re not silly at all.> > Lots of folks are determined that the whole COW to > different locations > are a Bad Thing(tm), and in some cases, I guess it > might actually be... > > What if ZFS had a pool / filesystem property that > caused zfs to do a > journaled, but non-COW update so the data''s relative > location for > databases is always the same?That''s just what a conventional file system (no need even for a journal, when you''re updating in place) does when it''s not guaranteeing write atomicity (you address the latter below).> > Or - What if it did a double update: One to a staged > area, and another > immediately after that to the ''old'' data blocks. > Still always have > on-disk consistency etc, at a cost of double the > I/O''s...It only requires an extra disk access if the new data is too large to dump right into the journal itself (which guarantees that the subsequent in-place update can complete). Whether the new data is dumped into the log or into a temporary location the pointer to which is logged, the subsequent in-place update can be deferred until it''s convenient (e.g., until after any additional updates to the same data have also been accumulated, activity has cooled off, and the modified blocks are getting ready to be evicted from the system cache - and, optionally, until the target disks are idle or have their heads positioned conveniently near the target location). ZFS''s small-synchronous-write log can do something similar as long as the writes aren''t too large to place in it. However, data that''s only persisted in the journal isn''t accessible via the normal snapshot mechanisms (well, if an entire file block was dumped into the journal I guess it could be, at the cost of some additional complexity in journal space reuse), so I''m guessing that ZFS writes back any dirty data that''s in the small-update journal whenever a snapshot is created. And if you start actually updating in place as described above, then you can''t use ZFS-style snapshotting at all: instead of capturing the current state as the snapshot with the knowledge that any subsequent updates will not disturb it, you have to capture the old state that you''re about to over-write and stuff it somewhere else - and then figure out how to maintain appropriate access to it while the rest of the system moves on. Snapshots make life a lot more complex for file systems than it used to be, and COW techniques make snapshotting easy at the expense of normal run-time performance - not just because they make update-in-place infeasible for preserving on-disk contiguity but because of the significant increase in disk bandwidth (and snapshot storage space) required to write back changes all the way up to whatever root structure is applicable: I suspect that ZFS does this on every synchronous update save for those that it can leave temporarily in its small-update journal, and it *has* to do it whenever a snapshot is created.> > Of course, both of these would require non-sparse > file creation for the > DB etc, but would it be plausible?Update-in-place files can still be sparse: it''s only data that already exists that must be present (and updated in place to preserve sequential access performance to it).> > For very read intensive and position sensitive > applications, I guess > this sort of capability might make a difference?No question about it. And sequential table scans in databases are among the most significant examples, because (unlike things like streaming video files which just get laid down initially and non-synchronously in a manner that at least potentially allows ZFS to accumulate them in large, contiguous chunks - though ISTR some discussion about just how well ZFS managed this when it was accommodating multiple such write streams in parallel) the tables are also subject to fine-grained, often-random update activity. Background defragmentation can help, though it generates a boatload of additional space overhead in any applicable snapshot. - bill This message posted from opensolaris.org
can you guess? wrote:>> For very read intensive and position sensitive >> applications, I guess >> this sort of capability might make a difference? > > No question about it. And sequential table scans in databases > are among the most significant examples, because (unlike things > like streaming video files which just get laid down initially > and non-synchronously in a manner that at least potentially > allows ZFS to accumulate them in large, contiguous chunks - > though ISTR some discussion about just how well ZFS managed > this when it was accommodating multiple such write streams in > parallel) the tables are also subject to fine-grained, > often-random update activity. > > Background defragmentation can help, though it generates a > boatload of additional space overhead in any applicable snapshot.The reason that this is hard to characterize is that there are really two very different configurations used to address different performance requirements: cheap and fast. It seems that when most people first consider this problem, they do so from the cheap perspective: single disk view. Anyone who strives for database performance will choose the fast perspective: stripes. Note: data redundancy isn''t really an issue for this analysis, but consider it done in real life. When you have a striped storage device under a file system, then the database or file system''s view of contiguous data is not contiguous on the media. There are many different ways to place the data on the media and we would typically strive for a diverse stochastic spread. Hmm... one could theorize that COW will also result in a diverse stochastic spread. The complexity of the characterization is then caused by the large number of variables which the systems use to spread the data (interlace size, block size, prefetch, caches, cache policies, etc) and the feasibility of understanding the interdependent relationships these will have on performance. Real data would be greatly appreciated. -- richard
> can you guess? wrote: > >> For very read intensive and position sensitive > >> applications, I guess > >> this sort of capability might make a difference? > > > > No question about it. And sequential table scans > in databases > > are among the most significant examples, because > (unlike things > > like streaming video files which just get laid down > initially > > and non-synchronously in a manner that at least > potentially > > allows ZFS to accumulate them in large, contiguous > chunks - > > though ISTR some discussion about just how well ZFS > managed > > this when it was accommodating multiple such write > streams in > > parallel) the tables are also subject to > fine-grained, > > often-random update activity. > > > > Background defragmentation can help, though it > generates a > > boatload of additional space overhead in any > applicable snapshot. > > The reason that this is hard to characterize is that > there are > really two very different configurations used to > address different > performance requirements: cheap and fast. It seems > that when most > people first consider this problem, they do so from > the cheap > perspective: single disk view. Anyone who strives > for database > performance will choose the fast perspective: > stripes.And anyone who *really* understands the situation will do both. Note: data> redundancy isn''t really an issue for this analysis, > but consider it > done in real life. When you have a striped storage > device under a > file system, then the database or file system''s view > of contiguous > data is not contiguous on the media.The best solution is to make the data piece-wise contiguous on the media at the appropriate granularity - which is largely determined by disk access characteristics (the following assumes that the database table is large enough to be spread across a lot of disks at moderately coarse granularity, since otherwise it''s often small enough to cache in the generous amounts of RAM that are inexpensively available today). A single chunk on an (S)ATA disk today (the analysis is similar for high-performance SCSI/FC/SAS disks) needn''t exceed about 4 MB in size to yield over 80% of the disk''s maximum possible (fully-contiguous layout) sequential streaming performance (after the overhead of an ''average'' - 1/3 stroke - initial seek and partial rotation are figured in: the latter could be avoided by using a chunk size that''s an integral multiple of the track size, but on today''s zoned disks that''s a bit awkward). A 1 MB chunk yields around 50% of the maximum streaming performance. ZFS''s maximum 128 KB ''chunk size'' if effectively used as the disk chunk size as you seem to be suggesting yields only about 15% of the disk''s maximum streaming performance (leaving aside an additional degradation to a small fraction of even that should you use RAID-Z). And if you match the ZFS block size to a 16 KB database block size and use that as the effective unit of distribution across the set of disks, you''ll obtain a mighty 2% of the potential streaming performance (again, we''ll be charitable and ignore the further degradation if RAID-Z is used). Now, if your system is doing nothing else but sequentially scanning this one database table, this may not be so bad: you get truly awful disk utilization (2% of its potential in the last case, ignoring RAID-Z), but you can still read ahead through the entire disk set and obtain decent sequential scanning performance by reading from all the disks in parallel. But if your database table scan is only one small part of a workload which is (perhaps the worst case) performing many other such scans in parallel, your overall system throughput will be only around 4% of what it could be had you used 1 MB chunks (and the individual scan performances will also suck commensurately, of course). Using 1 MB chunks still spreads out your database admirably for parallel random-access throughput: even if the table is only 1 GB in size (eminently cachable in RAM, should that be preferable), that''ll spread it out across 1,000 disks (2,000, if you mirror it and load-balance to spread out the accesses), and for much smaller database tables if they''re accessed sufficiently heavily for throughput to be an issue they''ll be wholly cache-resident. Or another way to look at it is in terms of how many disks you have in your system: if it''s less than the number of MB in your table size, then the table will be spread across all of them regardless of what chunk size is used, so you might as well use one that''s large enough to give you decent sequential scanning performance (and if your table is too small to spread across all the disks, then it may well all wind up in cache anyway). ZFS''s problem (well, the one specific to this issue, anyway) is that it tries to use its ''block size'' to cover two different needs: performance for moderately fine-grained updates (though its need to propagate those updates upward to the root of the applicable tree significantly compromises this effort), and decent disk utilization (I''m using that term to describe throughput as a fraction of potential streaming throughput: just ''keeping the disks saturated'' only describes where they system hits its throughput wall, not how well its design does in pushing that wall back as far as possible). The two requirements conflict, and in ZFS''s case the latter one loses - badly. Which is why background defragmentation could help, as I previously noted: it could rearrange the table such that multiple virtually-sequential ZFS blocks were placed contiguously on each disk (to reach 1 MB total, in the current example) without affecting the ZFS block size per se. But every block so rearranged (and every tree ancestor of each such block) would then leave an equal-sized residue in the most recent snapshot if one existed, which gets expensive fast in terms of snapshot space overhead (which then is proportional to the amount of reorganization performed as well as to the amount of actual data updating). - bill This message posted from opensolaris.org
can you guess? wrote:>> can you guess? wrote: >> >>>> For very read intensive and position sensitive >>>> applications, I guess >>>> this sort of capability might make a difference? >>>> >>> No question about it. And sequential table scans >>> >> in databases >> >>> are among the most significant examples, because >>> >> (unlike things >> >>> like streaming video files which just get laid down >>> >> initially >> >>> and non-synchronously in a manner that at least >>> >> potentially >> >>> allows ZFS to accumulate them in large, contiguous >>> >> chunks - >> >>> though ISTR some discussion about just how well ZFS >>> >> managed >> >>> this when it was accommodating multiple such write >>> >> streams in >> >>> parallel) the tables are also subject to >>> >> fine-grained, >> >>> often-random update activity. >>> >>> Background defragmentation can help, though it >>> >> generates a >> >>> boatload of additional space overhead in any >>> >> applicable snapshot. >> >> The reason that this is hard to characterize is that >> there are >> really two very different configurations used to >> address different >> performance requirements: cheap and fast. It seems >> that when most >> people first consider this problem, they do so from >> the cheap >> perspective: single disk view. Anyone who strives >> for database >> performance will choose the fast perspective: >> stripes. >> > > And anyone who *really* understands the situation will do both. >I''m not sure I follow. Many people who do high performance databases use hardware RAID arrays which often do not expose single disks.> Note: data > >> redundancy isn''t really an issue for this analysis, >> but consider it >> done in real life. When you have a striped storage >> device under a >> file system, then the database or file system''s view >> of contiguous >> data is not contiguous on the media. >> > > The best solution is to make the data piece-wise contiguous on the media at the appropriate granularity - which is largely determined by disk access characteristics (the following assumes that the database table is large enough to be spread across a lot of disks at moderately coarse granularity, since otherwise it''s often small enough to cache in the generous amounts of RAM that are inexpensively available today). > > A single chunk on an (S)ATA disk today (the analysis is similar for high-performance SCSI/FC/SAS disks) needn''t exceed about 4 MB in size to yield over 80% of the disk''s maximum possible (fully-contiguous layout) sequential streaming performance (after the overhead of an ''average'' - 1/3 stroke - initial seek and partial rotation are figured in: the latter could be avoided by using a chunk size that''s an integral multiple of the track size, but on today''s zoned disks that''s a bit awkward). A 1 MB chunk yields around 50% of the maximum streaming performance. ZFS''s maximum 128 KB ''chunk size'' if effectively used as the disk chunk size as you seem to be suggesting yields only about 15% of the disk''s maximum streaming performance (leaving aside an additional degradation to a small fraction of even that should you use RAID-Z). And if you match the ZFS block size to a 16 KB database block size and use that as the effective unit of distribution across the set of disks, you''ll !obt> ain a mighty 2% of the potential streaming performance (again, we''ll be charitable and ignore the further degradation if RAID-Z is used). > >You do not seem to be considering the track cache, which for modern disks is 16-32 MBytes. If those disks are in a RAID array, then there is often larger read caches as well. Expecting a seek and read for each iop is a bad assumption.> Now, if your system is doing nothing else but sequentially scanning this one database table, this may not be so bad: you get truly awful disk utilization (2% of its potential in the last case, ignoring RAID-Z), but you can still read ahead through the entire disk set and obtain decent sequential scanning performance by reading from all the disks in parallel. But if your database table scan is only one small part of a workload which is (perhaps the worst case) performing many other such scans in parallel, your overall system throughput will be only around 4% of what it could be had you used 1 MB chunks (and the individual scan performances will also suck commensurately, of course). > > Using 1 MB chunks still spreads out your database admirably for parallel random-access throughput: even if the table is only 1 GB in size (eminently cachable in RAM, should that be preferable), that''ll spread it out across 1,000 disks (2,000, if you mirror it and load-balance to spread out the accesses), and for much smaller database tables if they''re accessed sufficiently heavily for throughput to be an issue they''ll be wholly cache-resident. Or another way to look at it is in terms of how many disks you have in your system: if it''s less than the number of MB in your table size, then the table will be spread across all of them regardless of what chunk size is used, so you might as well use one that''s large enough to give you decent sequential scanning performance (and if your table is too small to spread across all the disks, then it may well all wind up in cache anyway). > > ZFS''s problem (well, the one specific to this issue, anyway) is that it tries to use its ''block size'' to cover two different needs: performance for moderately fine-grained updates (though its need to propagate those updates upward to the root of the applicable tree significantly compromises this effort), and decent disk utilization (I''m using that term to describe throughput as a fraction of potential streaming throughput: just ''keeping the disks saturated'' only describes where they system hits its throughput wall, not how well its design does in pushing that wall back as far as possible). The two requirements conflict, and in ZFS''s case the latter one loses - badly. >Real data would be greatly appreciated. In my tests, I see reasonable media bandwidth speeds for reads.> Which is why background defragmentation could help, as I previously noted: it could rearrange the table such that multiple virtually-sequential ZFS blocks were placed contiguously on each disk (to reach 1 MB total, in the current example) without affecting the ZFS block size per se. But every block so rearranged (and every tree ancestor of each such block) would then leave an equal-sized residue in the most recent snapshot if one existed, which gets expensive fast in terms of snapshot space overhead (which then is proportional to the amount of reorganization performed as well as to the amount of actual data updating). > >This comes up often from people who want > 128 kByte block sizes for ZFS. And yet we can demonstrate media bandwidth limits relatively easily. How would you reconcile the differences? -- richard
> Nathan Kroenert wrote:... What if it did a double update: One to a> staged area, and another > > immediately after that to the ''old'' data blocks. > Still always have > > on-disk consistency etc, at a cost of double the > I/O''s... > > This is a non-starter. Two I/Os is worse than one.Well, that attitude may be supportable for a write-only workload, but then so is the position that you really don''t even need *one* I/O (since no one will ever need to read the data and you might as well just drop it on the floor). In the real world, data (especially database data) does usually get read after being written, and the entire reason the original poster raised the question was because sometimes it''s well worth taking on some additional write overhead to reduce read overhead. In such a situation, if you need to protect the database from partial-block updates as well as to keep it reasonably laid out for sequential table access, then performing the two writes described is about as good a solution as one can get (especially if the first of them can be logged - even better, logged in NVRAM - such that its overhead can be amortized across multiple such updates by otherwise independent processes, and even more especially if, as is often the case, the same data gets updated multiple times in sufficiently close succession that instead of 2N writes you wind up only needing to perform N+1 writes, the last being the only one that updates the data in place after the activity has cooled down).> > > Of course, both of these would require non-sparse > file creation for the > > DB etc, but would it be plausible? > > > > For very read intensive and position sensitive > applications, I guess > > this sort of capability might make a difference? > > We are all anxiously awaiting data...Then you might find it instructive to learn more about the evolution of file systems on Unix: In The Beginning there was the block, and the block was small, and it was isolated from its brethren, and darkness was upon the face of the deep because any kind of sequential performance well and truly sucked. Then (after an inexcusably lengthy period of such abject suckage lasting into the ''80s) there came into the world FFS, and while there was still only the block the block was at least a bit larger, and it was at least somewhat less isolated from its brethren, and once in a while it actually lived right next to them, and while sequential performance still usually sucked at least it sucked somewhat less. And then the disciples Kleiman and McVoy looked upon FFS and decided that mere proximity was still insufficient, and they arranged that blocks should (at least when convenient) be aggregated into small groups (56 KB actually not being all that small at the time, given the disk characteristics back then), and the Great Sucking Sound of Unix sequential-access performance was finally reduced to something at least somewhat quieter than a dull roar. But other disciples had (finally) taken a look at commercial file systems that had been out in the real world for decades and that had had sequential performance down pretty well pat for nearly that long. And so it came to pass that corporations like Veritas (VxFS), and SGI (EFS & XFS), and IBM (JFS) imported the concept of extents into the Unix pantheon, and the Gods of Throughput looked upon it, and it was good, and (at least in those systems) Unix sequential performance no longer sucked at all, and even non-corporate developers whose faith was strong nearly to the point of being blind could not help but see the virtues revealed there, and began incorporating extents into their own work, yea, even unto ext4. And the disciple Hitz (for it was he, with a few others) took a somewhat different tack, and came up with a ''write anywhere file layout'' but had the foresight to recognize that it needed some mechanism to address sequential performance (not to mention parity-RAID performance). So he abandoned general-purpose approaches in favor of the Appliance, and gave it most uncommodity-like but yet virtuous NVRAM to allow many consecutive updates to be aggregated into not only stripes but adjacent stripes before being dumped to disk, and the Gods of Throughput smiled upon his efforts, and they became known throughout the land. Now comes back Sun with ZFS, apparently ignorant of the last decade-plus of Unix file system development (let alone development in other systems dating back to the ''60s). Blocks, while larger (though not necessarily proportionally larger, due to dramatic increases in disk bandwidth), are once again often isolated from their brethren. True, this makes the COW approach a lot easier to implement, but (leaving aside the debate about whether COW as implemented in ZFS is a good idea at all) there is *no question whatsoever* that it returns a significant degree of suckage to sequential performance - especially for data subject to small, random updates. Here ends our lesson for today. - bill This message posted from opensolaris.org
> When you have a striped storage device under a > file system, then the database or file system''s view > of contiguous data is not contiguous on the media.Right. That''s a good reason to use fairly large stripes. (The primary limiting factor for stripe size is efficient parallel access; using a 100 MB stripe size means that an average 100 MB file gets less than two disks'' worth of throughput.) ZFS, of course, doesn''t have this problem, since it''s handling the layout on the media; it can store things as contiguously as it wants.> There are many different ways to place the data on the media and we would typically > strive for a diverse stochastic spread.Err ... why? A random distribution makes reasonable sense if you assume that future read requests are independent, or that they are dependent in unpredictable ways. Now, if you''ve got sufficient I/O streams, you could argue that requests *are* independent, but in many other cases they are not, and they''re usually predictable (particularly after a startup period). Optimizing for the predicted access cases makes sense. (Optimizing for observed access may make sense in some cases as well.) -- Anton This message posted from opensolaris.org
... But every block so rearranged> (and every tree ancestor of each such block) would > then leave an equal-sized residue in the most recent > snapshot if one existed, which gets expensive fast in > terms of snapshot space overhead (which then is > proportional to the amount of reorganization > performed as well as to the amount of actual data > updating).Actually, it''s not *quite* as bad as that, since the common parent block of multiple children should appear only once in the snapshot, not once for each child moved. Still, it does drive up snapshot overhead, and if you start trying to use snapshots to simulate ''continuous data protection'' rather than more sparingly the problem becomes more significant (because each snapshot will catch any background defragmentation activity at a different point, such that common parent blocks may appear in more than one snapshot even if no child data has actually been updated). Once you introduce CDP into the process (and it''s tempting to, since the file system is in a better position to handle it efficiently than some add-on product), rethinking how one approaches snapshots (and COW in general) starts to make more sense. - bill This message posted from opensolaris.org
> We are all anxiously awaiting data... > -- richardWould it be worthwhile to build a test case: - Build a postgresql database and import 1 000 000 (or more) lines of data. - Run a single and multiple large table scan queries ... and watch the system then, - Update a column of each row in the database, run the same queries and watch the system Continue updating more colums (to get more "defrag") until you notice something. I personally believe that since most people will have hardware LUN''s (with underlying RAID) and cache, it will be difficult to notice anything. Given that those hardware LUN''s might be busy with their own wizardry ;) You will also have to minimize the effect of the database cache ... It will be a tough assignment ... maybe someone has already done this? Thinking about this (very abstract) ... does it really matter? [8KB-a][8KB-b][8KB-c] So what it 8KB-b gets updated and moved somewhere else? If the DB gets a request to read 8KB-a, it needs to do an I/O (eliminate all caching). If it gets a request to read 8KB-b, it needs to do an I/O. Does it matter that b is somewhere else ... it still needs to go get it ... only in a very abstract world with read-ahead (both hardware or db) would 8KB-b be in cache after 8KB-a was read. Hmmm... the only way is to get some data :) *hehe*
Anton B. Rang wrote:>> There are many different ways to place the data on the media and we would typically >> strive for a diverse stochastic spread. >> > > Err ... why? > > A random distribution makes reasonable sense if you assume that future read requests are independent, or that they are dependent in unpredictable ways. Now, if you''ve got sufficient I/O streams, you could argue that requests *are* independent, but in many other cases they are not, and they''re usually predictable (particularly after a startup period). Optimizing for the predicted access cases makes sense. (Optimizing for observed access may make sense in some cases as well.) > >For modern disks, media bandwidths are now getting to be > 100 MBytes/s. If you need 500 MBytes/s of sequential read, you''ll never get it from one disk. You can get it from multiple disks, so the questions are: 1. How to avoid other bottlenecks, such as a shared fibre channel path? Diversity. 2. How to predict the data layout such that you can guarantee a wide spread? This is much more difficult. But you can use random distribution to reduce the probability (stochastic) that you''ll be reading all blocks from one disk. There are pathological cases, especially for block-aligned data, but those tend to be rather easy to identify when you look at the performance data. -- richard
...> I personally believe that since most people will have > hardware LUN''s > (with underlying RAID) and cache, it will be > difficult to notice > anything. Given that those hardware LUN''s might be > busy with their own > wizardry ;) You will also have to minimize the effect > of the database > cache ...By definition, once you''ve got the entire database in cache, none of this matters (though filling up the cache itself takes some added time if the table is fragmented). Most real-world databases don''t manage to fit all or even mostly in cache, because people aren''t willing to dedicate that much RAM to running them. Instead, they either use a lot less RAM than the database size or share the system with other activity that shares use of the RAM. In other words, they use a cost-effective rather than a money-is-no-object configuration, but then would still like to get the best performance they can from it.> > It will be a tough assignment ... maybe someone has > already done this? > > Thinking about this (very abstract) ... does it > really matter? > > [8KB-a][8KB-b][8KB-c] > > So what it 8KB-b gets updated and moved somewhere > else? If the DB gets > a request to read 8KB-a, it needs to do an I/O > (eliminate all > caching). If it gets a request to read 8KB-b, it > needs to do an I/O. > > Does it matter that b is somewhere else ...Yes, with any competently-designed database. it still> needs to go get > it ... only in a very abstract world with read-ahead > (both hardware or > db) would 8KB-b be in cache after 8KB-a was read.1. If there''s no other activity on the disk, then the disk''s track cache will acquire the following data when the first block is read, because it has nothing better to do. But if the all the disks are just sitting around waiting for this table scan to get to them, then if ZFS has a sufficiently intelligent read-ahead mechanism it could help out a lot here as well: the differences become greater when the system is busier. 2. Even a moderately smart disk will detect a sequential access pattern if one exists and may read ahead at least modestly after having detected that pattern even if it *does* have other requests pending. 3. But in any event any competent database will explicitly issue prefetches when it knows (and it *does* know) that it is scanning a table sequentially - and will also have taken pains to try to ensure that the table data is laid out such that it can be scanned efficiently. If it''s using disks that support tagged command queuing it may just issue a bunch of single-database-block requests at once, and the disk will organize them such that they can all be satisfied by a single streaming access; with disks that don''t support queuing, the database can elect to issue a single large I/O request covering many database blocks, accomplishing the same thing as long as the table is in fact laid out contiguously on the medium (the database knows this if it''s handling the layout directly, but when it''s using a file system as an intermediary it usually can only hope that the file system has minimized file fragmentation).> > Hmmm... the only way is to get some data :) *hehe*Data is good, as long as you successfully analyze what it actually means: it either tends to confirm one''s understanding or to refine it. - bill This message posted from opensolaris.org
Richard Elling wrote: ...>>> there are >>> really two very different configurations used to >>> address different >>> performance requirements: cheap and fast. It seems >>> that when most >>> people first consider this problem, they do so from >>> the cheap >>> perspective: single disk view. Anyone who strives >>> for database >>> performance will choose the fast perspective: >>> stripes. >>> >> >> And anyone who *really* understands the situation will do both. >> > > I''m not sure I follow. Many people who do high performance > databases use hardware RAID arrays which often do not > expose single disks.They don''t have to expose single disks: they just have to use reasonable chunk sizes on each disk, as I explained later. Only very early (or very low-end) RAID used very small per-disk chunks (up to 64 KB max). Before the mid-''90s chunk sizes had grown to 128 - 256 KB per disk on mid-range arrays in order to improve disk utilization in the array. From talking with one of its architects years ago my impression is that HP''s (now somewhat aging) EVA series uses 1 MB as its chunk size (the same size I used as an example, though today one could argue for as much as 4 MB and soon perhaps even more). The array chunk size is not the unit of update, just the unit of distribution across the array: RAID-5 will happily update a single 4 KB file block within a given array chunk and the associated 4 KB of parity within the parity chunk. But the larger chunk size does allow files to retain the option of using logical contiguity to attain better streaming sequential performance, rather than splintering that logical contiguity at fine grain across multiple disks. ...>> A single chunk on an (S)ATA disk today (the analysis is similar for >> high-performance SCSI/FC/SAS disks) needn''t exceed about 4 MB in size >> to yield over 80% of the disk''s maximum possible (fully-contiguous >> layout) sequential streaming performance (after the overhead of an >> ''average'' - 1/3 stroke - initial seek and partial rotation are figured >> in: the latter could be avoided by using a chunk size that''s an >> integral multiple of the track size, but on today''s zoned disks that''s >> a bit awkward). A 1 MB chunk yields around 50% of the maximum >> streaming performance. ZFS''s maximum 128 KB ''chunk size'' if >> effectively used as the disk chunk size as you seem to be suggesting >> yields only about 15% of the disk''s maximum streaming performance >> (leaving aside an additional degradation to a small fraction of even >> that should you use RAID-Z). And if you match the ZFS block size to a >> 16 KB database block size and use that as the effective unit of >> distribution across the set of disks, you''ll >> obtain a mighty 2% of the potential streaming performance (again, we''ll >> be charitable and ignore the further degradation if RAID-Z is used). >> >> > > You do not seem to be considering the track cache, which for > modern disks is 16-32 MBytes. If those disks are in a RAID array, > then there is often larger read caches as well.Are you talking about hardware RAID in that last comment? I thought ZFS was supposed to eliminate the need for that. Expecting a seek and> read for each iop is a bad assumption.The bad assumption is that the disks are otherwise idle and therefore have the luxury of filling up their track caches - especially when I explicitly assumed otherwise in the following paragraph in that post. If the system is heavily loaded the disks will usually have other requests queued up (even if the next request comes in immediately rather than being queued at the disk itself, an even half-smart disk will abort any current read-ahead activity so that it can satisfy the new request). Not that it would necessarily do much good for the case currently under discussion even if the disks weren''t otherwise busy and they did fill up the track caches: ZFS''s COW policies tend to encourage data that''s updated randomly at fine grain (as a database table often is) to be splattered across the storage rather than neatly arranged such that the next data requested from a given disk will just happen to reside right after the previous data requested from that disk.> >> Now, if your system is doing nothing else but sequentially scanning >> this one database table, this may not be so bad: you get truly awful >> disk utilization (2% of its potential in the last case, ignoring >> RAID-Z), but you can still read ahead through the entire disk set and >> obtain decent sequential scanning performance by reading from all the >> disks in parallel. But if your database table scan is only one small >> part of a workload which is (perhaps the worst case) performing many >> other such scans in parallel, your overall system throughput will be >> only around 4% of what it could be had you used 1 MB chunks (and the >> individual scan performances will also suck commensurately, of course)....> Real data would be greatly appreciated. In my tests, I see > reasonable media bandwidth speeds for reads.You already said that you hadn''t been studying databases (the source of the kind of random-update/streaming read access mix specifically under consideration here). But while they may one of the worst cases in this respect (especially given their tendency to want to perform synchronous rather than lazy writes), the underlying problem is hardly unique: didn''t I see a reference recently to streaming read performance issues with data that had been laid down by multiple concurrent sequential write streams?> >> Which is why background defragmentation could help, as I previously >> noted: it could rearrange the table such that multiple >> virtually-sequential ZFS blocks were placed contiguously on each disk >> (to reach 1 MB total, in the current example) without affecting the >> ZFS block size per se. But every block so rearranged (and every tree >> ancestor of each such block) would then leave an equal-sized residue >> in the most recent snapshot if one existed, which gets expensive fast >> in terms of snapshot space overhead (which then is proportional to the >> amount of reorganization performed as well as to the amount of actual >> data updating). >> >> > This comes up often from people who want > 128 kByte block sizes > for ZFS.Using larger block sizes to solve this problem would just be piling one kludge on top of another. Block size is not the right answer to streaming performance - you achieve it by arranging *multiple* blocks sensibly on the media, so that you can then use a block size that''s otherwise appropriate for the application (e.g., 16 KB for a database that uses 16 KB blocks itself). And yet we can demonstrate media bandwidth limits> relatively easily. How would you reconcile the differences?Perhaps the difference is that you''re happier talking about workloads that make ZFS look good rather than actively looking for workloads that give ZFS fits. Start looking for what ZFS is *not* good at and you''ll find it (and then be able to start thinking about how to fix it). - bill This message posted from opensolaris.org
...> For modern disks, media bandwidths are now getting to > be > 100 MBytes/s. > If you need 500 MBytes/s of sequential read, you''ll > never get it from > one disk.And no one here even came remotely close to suggesting that you should try to.> You can get it from multiple disks, so the questions > are: > 1. How to avoid other bottlenecks, such as a > shared fibre channel > ath? Diversity. > 2. How to predict the data layout such that you > can guarantee a wide > spread?You''ve missed at least one more significant question: 3. How to lay out the data such that this 500 MB/s drain doesn''t cripple *other* concurrent activity going on in the system (that''s what increasing the amount laid down on each drive to around 1 MB accomplishes - otherwise, you can easily wind up using all the system''s disk resources to satisfy that one application, or even fall short if you have fewer than 50 disks available, since if you spread the data out relatively randomly in 128 KB chunks on a system with disks reasonably well-filled with data you''ll only be obtaining around 10 MB/s from each disk, whereas with 1 MB chunks similarly spread about each disk can contribute more like 35 MB/s and you''ll need only 14 - 15 disks to meet your requirement). Use smaller ZFS block sizes and/or RAID-Z and things get rapidly worse. - bill This message posted from opensolaris.org
Anton B. Rang writes: > > When you have a striped storage device under a > > file system, then the database or file system''s view > > of contiguous data is not contiguous on the media. > > Right. That''s a good reason to use fairly large stripes. (The > primary limiting factor for stripe size is efficient parallel access; > using a 100 MB stripe size means that an average 100 MB file gets less > than two disks'' worth of throughput.) > > ZFS, of course, doesn''t have this problem, since it''s handling the > layout on the media; it can store things as contiguously as it wants. > It can do what it wants. But currently what it does is to maintain files subject to small random writes contiguous to the level of the zfs recordsize. Now after a significant run of random writes the files ends up with a scattered on-disk layout. This should work well for the transaction parts of the workload. But the implications of using a small recordsize are that large sequential scans of files will make the disk heads very busy fetching or pre-fetching recordsized chunks. Get more spindles and a good prefetch algorithm you can reach whatever throughput you need. The problem is that your scanning ops will create heavy competition at the spindle level thus impacting the transactional response time (once you have 150 IOPS on every spindle just prefetching data for your full table scans, the OLTP will suffer). Yes we do need data to characterise this but the physics are fairly clear. The BP suggesting a small recordize needs to be updated. We need to strike a balance between random writes and sequential reads which does imply using greater records that 8K/16K DB blocks. -r
... currently what it> does is to > maintain files subject to small random writes > contiguous to > the level of the zfs recordsize. Now after a > significant > run of random writes the files ends up with a > scattered > n-disk layout. This should work well for the > transaction > parts of the workload.Absolutely (save for the fact that every database block write winds up writing all the block''s ancestors as well, but that''s a different discussion and one where ZFS looks only somewhat sub-optimal rather than completely uncompetitive when compared with different approaches). But the implications of> using a > small recordsize are that large sequential scans > of files > will make the disk heads very busy fetching or > pre-fetching > recordsized chunks.Well, no: that''s only an implication if you *choose* not to arrange the individual blocks on the disk to support sequential access better - and that choice can have devastating implications for the kind of workload being discussed here (another horrendous example would be a simple array of fixed-size records in a file accessed - and in particular updated - randomly by ordinal record number converted to a file offset but also scanned sequentially for bulk processing). Yes, choosing to reorganize files does have the kinds of snapshot implications that I''ve mentioned, but in most installations (save those entirely dedicated to databases) the situation under discussion here will typically involve only a small subset of the total data stored and thus reorganization shouldn''t have severe consequences. Get more spindles and a good> prefetch > algorithm you can reach whatever throughput you > need.At the cost of using about 25x (or 50x - 200x, if using RAID-Z) as many disks as you''d need for the same throughput if the blocks were laid out in 1 MB chunks rather than the 16 KB chunks in the example database. The> problem is that your scanning ops will create > heavy > ompetition at the spindle level thus > impacting the > nsactional response time (once you have 150 IOPS on > every > spindle just prefetching data for your full table > scans, the > OLTP will suffer).And this effect on OLTP performance would be dramatically reduced as well if you pulled the sequential scan off the disk in large chunks rather than in randomly-distributed individual blocks. Yes we do need data to> characterise this > but the physics are fairly clear.Indeed they are - thanks for recognizing this better than some here have managed to.> > > The BP suggesting a small recordize needs to be > updated. > > We need to strike a balance between random writes and > sequential reads which does imply using greater > records that > 8K/16K DB blocks.As Jesus Cea observed in the recent "ZFS + DB + default blocksize" discussion, this then requires that every database block update first read in the larger ZFS record before performing the update rather than allowing the database block to be written directly. The ZFS block *might* still be present in ZFS''s cache, but if most RAM is dedicated to the database cache (which for many workloads makes more sense) the chances of this are reduced (not only is ZFS''s cache smaller but the larger database cache will hold the small database blocks a *lot* longer than ZFS''s cache will hold any associated large ZFS blocks, so a database block update can easily occur long after the associated ZFS block has been evicted). Even ignoring the deleterious effect on random database update performance, you still can''t get *good* performance for sequential scans this way because maximum-size 128 KB ZFS blocks laid out randomly are still a factor of about 4 less efficient in doing this than 1 MB chunks would be (i.e., you''d need about 4x as many disks - or 8x - 32x as many disks if using RAID-Z - to achieve comparable performance). And it just doesn''t have to be that way - as other modern Unix file systems recognized long ago. You don''t even need to embrace extent-based allocation as they did, but just rearrange your blocks sensibly - and to at least some degree you could do that while they''re still cache-resident if you captured updates for a while in the ZIL (what''s the largest update that you''re willing to stuff in there now?) before batch-writing them back to less temporary locations. RAID-Z could be fixed as well, which would help a much wider range of workloads. - bill This message posted from opensolaris.org
Hello All, Here''s a possibly-silly proposal from a non-expert. Summarising the problem: - there''s a conflict between small ZFS record size, for good random update performance, and large ZFS record size for good sequential read performance - COW probably makes that conflict worse - re-packing (~= defragmentation) would make it better, but cause problems with the snapshot mechanism Proposed solution: - keep COW - create a new operation that combines snapshots and cloning - when you''re cloning, always write a tidy, re-packed layout of the data - if you''re using the new operation, keep the existing layout as the clone, and give the new layout to the running file-system Things that have to be done to make this work: - sort out the semantics, because the clone will be in the existing zpool, and the file-system will move to a new zpool (not sure if I have the terminology right) - sort out the transactional properties; the changes made since the start of the operation will have to be copied across into the new layout Regards, James.
James Cone wrote:> Hello All, > > Here''s a possibly-silly proposal from a non-expert. > > Summarising the problem: > - there''s a conflict between small ZFS record size, for good random > update performance, and large ZFS record size for good sequential read > performance >Poor sequential read performance has not been quantified.> - COW probably makes that conflict worse > >This needs to be proven with a reproducible, real-world workload before it makes sense to try to solve it. After all, if we cannot measure where we are, how can we prove that we''ve improved? Note: some block devices will not exhibit the phenomenon which people seem to be worried about in this thread. There are more options than just re-architect ZFS. I''m not saying there aren''t situations where there may be a problem, I''m just observing that nobody has brought data to this party. -- richard> - re-packing (~= defragmentation) would make it better, but cause > problems with the snapshot mechanism > > Proposed solution: > - keep COW > > - create a new operation that combines snapshots and cloning > > - when you''re cloning, always write a tidy, re-packed layout of the data > > - if you''re using the new operation, keep the existing layout as the > clone, and give the new layout to the running file-system > > Things that have to be done to make this work: > > - sort out the semantics, because the clone will be in the existing > zpool, and the file-system will move to a new zpool (not sure if I have > the terminology right) > > - sort out the transactional properties; the changes made since the > start of the operation will have to be copied across into the new layout > > Regards, > James. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Regardless of the merit of the rest of your proposal, I think you have put your finger on the core of the problem: aside from some apparent reluctance on the part of some of the ZFS developers to believe that any problem exists here at all (and leaving aside the additional monkey wrench that using RAID-Z here would introduce, because one could argue that files used in this manner are poor candidates for RAID-Z anyway hence that there''s no need to consider reorganizing RAID-Z files), the *only* down-side (other than a small matter of coding) to defragmenting files in the background in ZFS is the impact that would have on run-time performance (which should be minimal if the defragmentation is performed at lower priority) and the impact it would have on the space consumed by a snapshot that existed while the defragmentation was being done. One way to eliminate the latter would be simply not to reorganize while any snapshot (or clone) existed: no worse than the situation today, and better whenever no snapshot or clone is present. That would change the perceived ''expense'' of a snapshot, though, since you''d know you were potentially giving up some run-time performance whenever one existed - and it''s easy to imagine installations which might otherwise like to run things such that a snapshot was *always* present. Another approach would be just to accept any increased snapshot space overhead. So many sequentially-accessed files are just written once and read-only thereafter that a lot of installations might not see any increased snapshot overhead at all. Some files are never accessed sequentially (or done so only in situations where performance is unimportant), and if they could be marked "Don''t reorganize" then they wouldn''t contribute any increased snapshot overhead either. One could introduce controls to limit the times when reorganization was done, though my inclination is to suspect that such additional knobs ought to be unnecessary. One way to eliminate almost completely the overhead of the additional disk accesses consumed by background defragmentation would be to do it as part of the existing background scrubbing activity, but for actively-updated files one might want to defragment more often than one needed to scrub. In any event, background defragmention should be a relatively easy feature to introduce and try out if suitable multi-block contiguous allocation mechanisms already exist to support ZFS''s existing batch writes. Use of ZIL to perform opportunistic defragmentation while updated data was still present in the cache might be a bit more complex, but could still be worth investigating. - bill This message posted from opensolaris.org
> > Poor sequential read performance has not been quantified. > > > - COW probably makes that conflict worse > > > > > > This needs to be proven with a reproducible, real-world workload before it > makes sense to try to solve it. After all, if we cannot measure where > we are, > how can we prove that we''ve improved?I agree, let''s first find a reproducible example where "updates" negatively impacts large table scans ... one that is rather simple (if there is one) to reproduce and then work from there. I might be able to help with such an example during the month of December/January :) I do have a question though? (Should probably ask in database-discuss). Q: Does a full online backup of a DB (let''s say Legato Networker with oracle plugin) constitute "large sequential table scans" in the way that it reads database data? It probably is not as simple as that...
My initial thought was that this whole thread may be irrelevant - anybody wanting to run such a database is likely to use a specialised filesystem optimised for it. But then I realised that for a database admin the integrity checking and other benefits of ZFS would be very tempting, but only if ZFS can guarantee equivalent performance to other filesystems. So, let me see if I understand this right: - Louwtjie is concerned that ZFS will fragment databases, potentially leading to read performance issues for some databases. - Nathan appears to have suggested a good workaround. Could ZFS be updated to have a ''contiguous'' setting where blocks are kept together. This sacrifices write performance for read. - Richard isn''t convinced there''s a problem as he''s not seen any data supporting this. I can see his point, but I don''t agree that this is a non starter. For certain situations it could be very useful, and balancing read and write performance is an integral part in the choice of storage configuration. - Bill seems to understand the issue, and added some useful background (although in an entertaining but rather condascending way). Richard then went into a little more detail. I think he''s pointing out here that while contiguous data is fastest if you consider a single disk, is not necessarily the fastest approach when your data is spread across multiple disks. Instead he feels a ''diverse stochastic spread'' is needed. I guess that means you want the data spread so all the disks can be used in parallel. I think I''m now seeing why Richard is asking for real data. I think he believes that ZFS may already be faster or equal to a standard contiguous filesystem in this scenario. Richard seems to be using a random or statistical approach to this: If data is saved randomly, you''re likely to be using all disks when reading data. I do see the point, and yes, data would be useful, but I think I agree with Bill on this. For reading data, while random locations are likely to be fast in terms of using multiple disks, that data is also likely to be spread and so is almost certain to result in more disk seeks. Whereas if you have contiguous data you can guarantee that it will be striped across the maximum possible number of disks, with the minimum number of seeks. As a database admin I would take guaranteed performance over probable performance any day of the week. Especially if I can be sure that performance will be consistent and will not degrade as the database ages. One point that I haven''t seen raised yet: I believe most databases will have had years of tuning based around the assumption that their data is saved contigously on disk. They will be optimising their disk access based on that and this is not something we should ignore. Yes, until we have data to demonstrate the problem it''s just theoretical. However that may be hard to obtain and in the meantime I think the theory is sound, and the solution easy enough that it is worth tackling. I definately don''t think defragmentation is the solution (although that is needed in ZFS for other scenarios). If your database is under enough read strain to need the fix suggested here, your disks definately do not have the time needed to scan and defrag the entire system. It would seem to me that Nathan''s suggestion right at the start of the thread is the way to go. It guarantees read performance for the database, and would seem to be relatively easy to implement at the zpool level. Yes it adds considerable overhead to writes, but that is a decision database administrators can make given the expected load. If I''m understanding Nathan right, saving a block of data would mean: - Reading the original block (may be cached if we''re lucky) - Saving that block to a new location - Saving the new data to the original location So you''ve got a 2-3x slowdown in write performance, but you guarantee read performance will at least match existing filesystems (with ZFS caching, it may exceed it). ZFS then works much better with all the existing optimisations done within the database software, and you still keep all the benefits of ZFS - full data integrity, snapshots, clones, etc... For many database admins, I think that would be an option they would like to have. Taking it a stage further, I wonder if this would work well with the prioritized write feature request (caching writes to a solid state disk)? http://www.genunix.org/wiki/index.php/OpenSolaris_Storage_Developer_Wish_List That could potentially mean there''s very little slowdown: - Read the original block - Save that to solid state disk - Write the the new block in the original location - Periodically stream writes from the solid state disk to the main storage In theory there''s no need for the drive head to move at all between the read and the write, so this should only be fractionally slower than traditional ZFS writes. Yes the data needs to be flushed from the solid state store from time to time, but those writes can be batched together for improved performance and streamed to contiguous free space on the disk. That would appear to then give you the best of both worlds. This message posted from opensolaris.org
...> - Nathan appears to have suggested a good workaround. > Could ZFS be updated to have a ''contiguous'' setting > where blocks are kept together. This sacrifices > write performance for read.I had originally thought that this would be incompatible with ZFS''s snapshot mechanism, but with a minor tweak it may not be. ...> - Bill seems to understand the issue, and added some > useful background (although in an entertaining but > rather condascending way).There is a bit of nearby history that led to that. ...> One point that I haven''t seen raised yet: I believe > most databases will have had years of tuning based > around the assumption that their data is saved > contigously on disk. They will be optimising their > disk access based on that and this is not something > we should ignore.Ah - nothing like real, experienced user input. I tend to agree with ZFS''s general philosophy of attempting to minimize the number of knobs that need tuning, but this can lead to forgetting that higher-level software may have knobs of its own. My original assumption was that databases automatically attempted to leverage on-disk contiguity (which the more evolved ones certainly do when they''re controlling the on-disk layout themselves and one might suspect try to do even when running on top of files by assuming that the file system is trying to preserve on-disk contiguity), but of course admins play a major role as well (e.g., in determining which indexes need not be created because sequential table scans can get the job done efficiently). ...> I definately don''t think defragmentation is the > solution (although that is needed in ZFS for other > scenarios). If your database is under enough read > strain to need the fix suggested here, your disks > definately do not have the time needed to scan and > defrag the entire system.Well, it''s only this kind of randomly-updated/sequentially-scanned data that needs much defragmention in the first place. Data that''s written once and then only read at worst needs a single defragmentation pass (if the original writes got interrupted by a lot of other update activity), data that''s not read sequentially (e.g., indirect blocks) needn''t be defragmented at all, nor need data that''s seldom read and/or not very fragmented in the first place.> > It would seem to me that Nathan''s suggestion right at > the start of the thread is the way to go. It > guarantees read performance for the database, and > would seem to be relatively easy to implement at the > zpool level. Yes it adds considerable overhead to > writes, but that is a decision database > administrators can make given the expected load. > > If I''m understanding Nathan right, saving a block of > data would mean: > - Reading the original block (may be cached if we''re > lucky) > - Saving that block to a new location > - Saving the new data to the original location1. You''d still need an initial defragmentation pass to ensure that the file was reasonably piece-wise contiguous to begin with. 2. You can''t move the old version of the block without updating all its ancestors (since the pointer to it changes). When you update this path to the old version, you need to suppress the normal COW behavior if a snapshot exists because it would otherwise maintain the old path pointing to the old data location that you''re just about to over-write below. This presumably requires establishing the entire new path and deallocating the entire old path in a single transaction but this may just be equivalent to a normal data block ''update'' (that just doesn''t happen to change any data in the block) when no snapshot exists. I don''t *think* that there should be any new issues raised with other updates that may be combined in the same ''transaction'', even if they may affect some of the same ancestral blocks. 3. You can''t just slide in the new version of the block using the old version''s existing set of ancestors because a) you just deallocated that path above (introducing additional mechanism to preserve it temporarily almost certainly would not be wise), b) the data block checksum changed, and c) in any event this new path should be *newer* than the path to the old version''s new location that you just had to establish (if a snapshot exists, that''s the path that should be propagated to it by the COW mechanism). However, this is just the normal situation whenever you update a data block: all the *additional* overhead occurred in the previous steps. Given that doing the update twice, as described above, only adds to the bandwidth consumed (steps 2 and 3 should be able to be combined in a single transaction), the only additional disk seek would be that required to re-read the original data if it wasn''t cached. So you may well be correct that this approach would likely consume fewer resources than background defragmentation would (though, as noted above, you''d still need an initial defrag pass to establish initial contiguity), and while the additional resources would be consumed at normal rather than reduced priority the file would be kept contiguous all the time rather than just returned to contiguity whenever there was time to do so. ...> Taking it a stage further, I wonder if this would > work well with the prioritized write feature request > (caching writes to a solid state disk)? > http://www.genunix.org/wiki/index.php/OpenSolaris_Sto > age_Developer_Wish_List > > That could potentially mean there''s very little > slowdown: > - Read the original block > - Save that to solid state disk > - Write the the new block in the original location > - Periodically stream writes from the solid state > disk to the main storageI don''t think this applies (nor would it confer any benefit) if things in fact need to be handled as I described above. - bill This message posted from opensolaris.org
In that case, this may be a much tougher nut to crack than I thought. I''ll be the first to admit that other than having seen a few presentations I don''t have a clue about the details of how ZFS works under the hood, however... You mention that moving the old block means updating all it''s ancestors. I had naively assumed moving a block would be relatively simple, and would also update all the ancestors. My understanding of ZFS (in short: an upside down tree) is that each block is referenced by it''s parent. So regardless of how many snapshots you take, each block is only ever referenced by one other, and I''m guessing that the pointer and checksum are both stored there. If that''s the case, to move a block it''s just a case of: - read the data - write to the new location - update the pointer in the parent block Please let me know if I''m mis-understanding ZFS here. The major problem with this is that I don''t know if there''s any easy way to identify the parent block from the child, or an effcient way to do this move. However, thinking about it, there must be. ZFS intelligently moves data if it detects corruption, so there must already be tools in place to do exactly what we need here. In which case, this is still relatively simple and much of the code already exists. This message posted from opensolaris.org
...> My understanding of ZFS (in short: an upside down > tree) is that each block is referenced by it''s > parent. So regardless of how many snapshots you take, > each block is only ever referenced by one other, and > I''m guessing that the pointer and checksum are both > stored there. > > If that''s the case, to move a block it''s just a case > of: > - read the data > - write to the new location > - update the pointer in the parent blockWhich changes the contents of the parent block (the change in the data checksum changed it as well), and thus requires that this parent also be rewritten (using COW), which changes the pointer to it (and of course its checksum as well) in *its* parent block, which thus also must be re-written... and finally a new copy of the superblock is written to reflect the new underlying tree structure - all this in a single batch-written ''transaction''. The old version of each of these blocks need only be *saved* if a snapshot exists and it hasn''t previously been updated since that snapshot was created. But all the blocks need to be COWed even if no snapshot exists (in which case the old versions are simply discarded). ...> PS. > > >1. You''d still need an initial defragmentation pass > to ensure that the file was reasonably piece-wise > contiguous to begin with. > > No, not necessarily. If you were using a zpool > configured like this I''d hope you were planning on > creating the file as a contiguous block in the first > place :)I''m not certain that you could ensure this if other updates in the system were occurring concurrently. Furthermore, the file may be extended dynamically as new data is inserted, and you''d like to have some mechanism that could restore reasonable contiguity to the result (which can be difficult to accomplish in the foreground if, for example, free space doesn''t happen to exist on the disk right after the existing portion of the file). ...> Any zpool with this option would probably be > dedicated to the database file and nothing else. In > fact, even with multiple databases I think I''d have a > single pool per database.It''s nice if you can afford such dedicated resources, but it seems a bit cavalier to ignore users who just want decent performance from a database that has to share its resources with other activity. Your prompt response is probably what prevented me from editing my previous post after I re-read it and realized I had overlooked the fact that over-writing the old data complicates things. So I''ll just post the revised portion here: 3. Now you must make the above transaction persistent, and then randomly over-write the old data block with the new data (since that data must be in place before you update the path to it below, and unfortunately since its location is not arbitrary you can''t combine this update with either the transaction above or the transaction below). 4. You can''t just slide in the new version of the block using the old version''s existing set of ancestors because a) you just deallocated that path above (introducing additional mechanism to preserve it temporarily almost certainly would not be wise), b) the data block checksum changed, and c) in any event this new path should be *newer* than the path to the old version''s new location that you just had to establish (if a snapshot exists, that''s the path that should be propagated to it by the COW mechanism). However, this is just the normal situation whenever you update a data block (save for the fact that the block itself was already written above): all the *additional* overhead occurred in the previous steps. So instead of a single full-path update that fragments the file, you have two full-path updates, a random write, and possibly a random read initially to fetch the old data. And you still need an initial defrag pass to establish initial contiguity. Furthermore, these additional resources are consumed at normal rather than the reduced priority at which a background reorg can operate. On the plus side, though, the file would be kept contiguous all the time rather than just returned to contiguity whenever there was time to do so. ...> Taking it a stage further, I wonder if this would > work well with the prioritized write feature request > (caching writes to a solid state disk)? > http://www.genunix.org/wiki/index.php/OpenSolaris_Sto > age_Developer_Wish_List > > That could potentially mean there''s very little > slowdown: > - Read the original block > - Save that to solid state disk > - Write the the new block in the original location > - Periodically stream writes from the solid state > disk to the main storageI''m not sure this would confer much benefit if things in fact need to be handled as I described above. In particular, if a snapshot exists you almost certainly must establish the old version in its new location in the snapshot rather than just capture it in the log; if no snapshot exists you could capture the old version in the log and then discard it as soon as the new version becomes persistent, but I''m not sure how easily that (and especially recovering should a crash occur before the new version becomes persistent) could be integrated with the existing COW facilities. - bill This message posted from opensolaris.org
Hmm... that''s a pain if updating the parent also means updating the parent''s checksum too. I guess the functionality is there for moving bad blocks, but since that''s likely to be a rare occurence, it wasn''t something that would need to be particularly efficient. With regards sharing the disk resources with other programs, obviously it''s down to the individual admins how they would configure this, but I would suggest that if you have a database with heavy enough requirements to be suffering noticable read performance issues due to fragmentation, then that database really should have it''s own dedicated drives and shouldn''t be competing with other programs. I''m not saying defrag is bad (it may be the better solution here), just that if you''re looking at performance in this kind of depth, you''re probably experienced enough to have created the database in a contiguous chunk in the first place :-) I do agree that doing these writes now sounds like a lot of work. I''m guessing that needing two full-path updates to achieve this means you''re talking about a much greater write penalty. And that means you can probably expect significant read penalty if you have any significant volume of writes at all, which would rather defeat the point. After all, if you have a low enough amount of writes to not suffer from this penalty, your database isn''t going to be particularly fragmented. However, I''m now in over my depth. This needs somebody who knows the internal architecture of ZFS to decide whether it''s feasible or desirable, and whether defrag is a good enough workaround. It may be that ZFS is not a good fit for this kind of use, and that if you''re really concerned about this kind of performance you should be looking at other file systems. This message posted from opensolaris.org
...> With regards sharing the disk resources with other > programs, obviously it''s down to the individual > admins how they would configure this,Only if they have an unconstrained budget. but I would> suggest that if you have a database with heavy enough > requirements to be suffering noticable read > performance issues due to fragmentation, then that > database really should have it''s own dedicated drives > and shouldn''t be competing with other programs.You''re not looking at it from a whole-system viewpoint (which if you''re accustomed to having your own dedicated storage devices is understandable). Even if your database performance is acceptable, if it''s performing 50x as many disk seeks as it would otherwise need to when scanning a table that''s affecting the performance of *other* applications.> > Also, I''m not saying defrag is bad (it may be the > better solution here), just that if you''re looking at > performance in this kind of depth, you''re probably > experienced enough to have created the database in a > contiguous chunk in the first place :-)As I noted, ZFS may not allow you to ensure that and in any event if the database grows that contiguity may need to be reestablished. You could grow the db in separate files, each of which was preallocated in full (though again ZFS may not allow you to ensure that each is created contiguously on disk), but while databases may include such facilities as a matter of course it would still (all other things being equal) be easier to manage everything if it could just extend a single existing file (or one file per table, if they needed to be kept separate) as it needed additional space.> > I do agree that doing these writes now sounds like a > lot of work. I''m guessing that needing two full-path > updates to achieve this means you''re talking about a > much greater write penalty.Not all that much. Each full-path update is still only a single write request to the disk, since all the path blocks (again, possibly excepting the superblock) are batch-written together, thus mostly increasing only streaming bandwidth consumption. ...> It may be that ZFS is not a good fit for this kind of > use, and that if you''re really concerned about this > kind of performance you should be looking at other > file systems.I suspect that while it may not be a great fit now with relatively minor changes it could be at least an acceptable one. - bill This message posted from opensolaris.org
Louwtjie Burger wrote:> Richard Elling wrote: > > > > > - COW probably makes that conflict worse > > > > > > > > > > This needs to be proven with a reproducible, real-world > workload before it > > makes sense to try to solve it. After all, if we cannot > measure where > > we are, > > how can we prove that we''ve improved? > > I agree, let''s first find a reproducible example where "updates" > negatively impacts large table scans ... one that is rather simple (if > there is one) to reproduce and then work from there.I''d say it would be possible to define a reproducible workload that demonstrates this using the Filebench tool... I haven''t worked with it much (maybe over the holidays I''ll be able to do this), but I think a workload like: 1) create a large file (bigger than main memory) on an empty ZFS pool. 2) time a sequential scan of the file 3) random write i/o over say, 50% of the file (either with or without matching blocksize) 4) time a sequential scan of the file The difference between times 2 and 4 are the "penalty" that COW block reordering (which may introduce seemingly-random seeks between "sequential" blocks) imposes on the system. It would be interesting to watch seeksize.d''s output during this run too. --Joe
> > doing these writes now sounds like a > > lot of work. I''m guessing that needing two full-path > > updates to achieve this means you''re talking about a > > much greater write penalty. > > Not all that much. Each full-path update is still > only a single write request to the disk, since all > the path blocks (again, possibly excepting the > superblock) are batch-written together, thus mostly > increasing only streaming bandwidth consumption.Ok, that took some thinking about. I''m pretty new to ZFS, so I''ve only just gotten my head around how CoW works, and I''m not used to thinking about files at this kind of level. I''d not considered that path blocks would be batch-written close together, but of course that makes sense. What I''d been thinking was that ordinarily files would get fragmented as they age, which would make these updates slower as blocks would be scattered over the disk, so a full-path update would take some time. I''d forgotten that the whole point of doing this is to prevent fragmentation... So a nice side effect of this approach is that if you use it, it makes itself more efficient :D This message posted from opensolaris.org
Rats - I was right the first time: there''s a messy problem with snapshots. The problem is that the parent of the child that you''re about to update in place may *already* be in one or more snapshots because one or more of its *other* children was updated since each snapshot was created. If so, then each snapshot copy of the parent is pointing to the location of the existing copy of the child you now want to update in place, and unless you change the snapshot copy of the parent (as well as the current copy of the parent) the snapshot will point to the *new* copy of the child you are now about to update (with an incorrect checksum to boot). With enough snapshots, enough children, and bad enough luck, you might have to change the parent (and of course all its ancestors...) in every snapshot. In other words, Nathan''s approach is pretty much infeasible in the presence of snapshots. Background defragmention works as long as you move the entire region (which often has a single common parent) to a new location, which if the source region isn''t excessively fragmented may not be all that expensive; it''s probably not something you''d want to try at normal priority *during* an update to make Nathan''s approach work, though, especially since you''d then wind up moving the entire region on every such update rather than in one batch in the background. - bill This message posted from opensolaris.org
On Nov 19, 2007 10:08 PM, Richard Elling <Richard.Elling at sun.com> wrote:> James Cone wrote: > > Hello All, > > > > Here''s a possibly-silly proposal from a non-expert. > > > > Summarising the problem: > > - there''s a conflict between small ZFS record size, for good random > > update performance, and large ZFS record size for good sequential read > > performance > > > > Poor sequential read performance has not been quantified.I think this is a good point. A lot of solutions are being thrown around, and the problems are only theoretical at the moment. Conventional solutions may not even be appropriate for something like ZFS. The point that makes me skeptical is this: blocks do not need to be logically contiguous to be (nearly) physically contiguous. As long as you reallocate the blocks close to the originals, chances are that a scan of the file will end up being mostly physically contiguous reads anyway. ZFS''s intelligent prefetching along with the disk''s track cache should allow for good performance even in this case. ZFS may or may not already do this, I haven''t checked. Obviously, you won''t want to keep a years worth of snapshots, or run the pool near capacity. With a few minor tweaks though, it should work quite well. Talking about fundamental ZFS design flaws at this point seems unnecessary to say the least. Chris
But the whole point of snapshots is that they don''t take up extra space on the disk. If a file (and hence a block) is in every snapshot it doesn''t mean you''ve got multiple copies of it. You only have one copy of that block, it''s just referenced by many snapshots. The thing is, the location of that block isn''t saved separately in every snapshot either - the location is just stored in it''s parent. So moving a block is just a case of updating one parent. So regardless of how many snapshots the parent is in, you only have to update one parent to point it at the new location for the *old* data. Then you save the new data to the old location and ensure the current tree points to that. If you think about it, that has to work for the old data since as I said before, ZFS already has this functionality. If ZFS detects a bad block, it moves it to a new location on disk. If it can already do that without affecting any of the existing snapshots, so there''s no reason to think we couldn''t use the same code for a different purpose. Ultimately, your old snapshots get fragmented, but the live data stays contiguous. This message posted from opensolaris.org
> But the whole point of snapshots is that they don''t > take up extra space on the disk. If a file (and > hence a block) is in every snapshot it doesn''t mean > you''ve got multiple copies of it. You only have one > copy of that block, it''s just referenced by many > snapshots.I used the wording "copies of a parent" loosely to mean "previous states of the parent that also contain pointers to the current state of the child about to be updated in place".> > The thing is, the location of that block isn''t saved > separately in every snapshot either - the location is > just stored in it''s parent.And in every earlier version of the parent that was updated for some *other* reason and still contains a pointer to the current child that someone using that snapshot must be able to follow correctly. So moving a block is> just a case of updating one parent.No: every version of the parent that points to the current version of the child must be updated. ...> If you think about it, that has to work for the old > data since as I said before, ZFS already has this > functionality. If ZFS detects a bad block, it moves > it to a new location on disk. If it can already do > that without affecting any of the existing snapshots, > so there''s no reason to think we couldn''t use the > same code for a different purpose.Only if it works the way you think it works, rather than, say, by using a look-aside list of moved blocks (there shouldn''t be that many of them), or by just leaving the bad block in the snapshot (if it''s mirrored or parity-protected, it''ll still be usable there unless a second failure occurs; if not, then it was lost anyway). - bill This message posted from opensolaris.org
On Nov 20, 2007 5:33 PM, can you guess? <billtodd at metrocast.net> wrote:> > But the whole point of snapshots is that they don''t > > take up extra space on the disk. If a file (and > > hence a block) is in every snapshot it doesn''t mean > > you''ve got multiple copies of it. You only have one > > copy of that block, it''s just referenced by many > > snapshots. > > I used the wording "copies of a parent" loosely to mean "previous > states of the parent that also contain pointers to the current state of > the child about to be updated in place".But children are never updated in place. When a new block is written to a leaf, new blocks are used for all the ancestors back to the superblock, and then the old ones are either freed or held on to by the snapshot.> And in every earlier version of the parent that was updated for some > *other* reason and still contains a pointer to the current child that > someone using that snapshot must be able to follow correctly.The snapshot doesn''t get the ''current'' child - it gets the one that was there when the snapshot was taken.> No: every version of the parent that points to the current version of > the child must be updated.Even with clones, the two ''parent'' and the ''clone'' are allowed to diverge - they contain different data. Perhaps I''m missing something. Excluding ditto blocks, when in ZFS would two parents point to the same child, and need to both be updated when the child is updated? Will
On Tue, 20 Nov 2007, Ross wrote:>>> doing these writes now sounds like a >>> lot of work. I''m guessing that needing two full-path >>> updates to achieve this means you''re talking about a >>> much greater write penalty. >> >> Not all that much. Each full-path update is still >> only a single write request to the disk, since all >> the path blocks (again, possibly excepting the >> superblock) are batch-written together, thus mostly >> increasing only streaming bandwidth consumption. >... reformatted ...> Ok, that took some thinking about. I''m pretty new to ZFS, so I''ve > only just gotten my head around how CoW works, and I''m not used to > thinking about files at this kind of level. I''d not considered thatHere''s a couple of resources that''ll help you get up to speed with ZFS internals: a) From the London OpenSolaris User Group (LOSUG) session, presented by Jarod Nash, TSC Systems Engineer entitled: "ZFS: Under The Hood": ZFS-UTH_3_v1.1_LOSUG.pdf zfs_data_structures_for_single_file.pdf also referred to as "ZFS Internals Lite". and b) the ZFS on-disk Specification: ondiskformat0822.pdf> path blocks would be batch-written close together, but of course > that makes sense. > > What I''d been thinking was that ordinarily files would get > fragmented as they age, which would make these updates slower as > blocks would be scattered over the disk, so a full-path update would > take some time. I''d forgotten that the whole point of doing this is > to prevent fragmentation... > > So a nice side effect of this approach is that if you use it, it > makes itself more efficient :D >Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ Graduate from "sugar-coating school"? Sorry - I never attended! :)
... just rearrange your blocks sensibly -> and to at least some degree you could do that while > they''re still cache-residentLots of discussion has passed under the bridge since that observation above, but it may have contained the core of a virtually free solution: let your table become fragmented, but each time that a sequential scan is performed on it determine whether the region that you''re currently scanning is *sufficiently* fragmented that you should retain the sequential blocks that you''ve just had to access anyway in cache until you''ve built up around 1 MB of them and then (in a background thread) flush the result contiguously back to a new location in a single bulk ''update'' that changes only their location rather than their contents. 1. You don''t incur any extra reads, since you were reading sequentially anyway and already have the relevant blocks in cache. Yes, if you had reorganized earlier in the background the current scan would have gone faster, but if scans occur sufficiently frequently for their performance to be a significant issue then the *previous* scan will probably not have left things *all* that fragmented. This is why you choose a fragmentation threshold to trigger reorg rather than just do it whenever there''s any fragmentation at all, since the latter would probably not be cost-effective in some circumstances; conversely, if you only perform sequential scans once in a blue moon, every one may be completely fragmented but it probably wouldn''t have been worth defragmenting constantly in the background to avoid this, and the occasional reorg triggered by the rare scan won''t constitute enough additional overhead to justify heroic efforts to avoid it. Such a ''threshold'' is a crude but possibly adequate metric; a better but more complex one would perhaps nudge up the threshold value every time a sequential scan took place without an intervening update, such that rarely-updated but frequently-scanned files would eventually approach full contiguity, and an even finer-grained metric would maintain such information about each individual *region* in a file, but absent evidence that the single, crude, unchanging threshold (probably set to defragment moderately aggressively - e.g., whenever it takes more than 3 or 5 disk seeks to inhale a 1 MB region) is inadequate these sound a bit like over-kill. 2. You don''t defragment data that''s never sequentially scanned, avoiding unnecessary system activity and snapshot space consumption. 3. You still incur additional snapshot overhead for data that you do decide to defragment for each block that hadn''t already been modified since the most recent snapshot, but performing the local reorg as a batch operation means that only a single copy of all affected ancestor blocks will wind up in the snapshot due to the reorg (rather than potentially multiple copies in multiple snapshots if snapshots were frequent and movement was performed one block at a time). - bill This message posted from opensolaris.org
Moore, Joe writes: > Louwtjie Burger wrote: > > Richard Elling wrote: > > > > > > > - COW probably makes that conflict worse > > > > > > > > > > > > > > This needs to be proven with a reproducible, real-world > > workload before it > > > makes sense to try to solve it. After all, if we cannot > > measure where > > > we are, > > > how can we prove that we''ve improved? > > > > I agree, let''s first find a reproducible example where "updates" > > negatively impacts large table scans ... one that is rather simple (if > > there is one) to reproduce and then work from there. > > I''d say it would be possible to define a reproducible workload that > demonstrates this using the Filebench tool... I haven''t worked with it > much (maybe over the holidays I''ll be able to do this), but I think a > workload like: > > 1) create a large file (bigger than main memory) on an empty ZFS pool. > 2) time a sequential scan of the file > 3) random write i/o over say, 50% of the file (either with or without > matching blocksize) > 4) time a sequential scan of the file > > The difference between times 2 and 4 are the "penalty" that COW block > reordering (which may introduce seemingly-random seeks between > "sequential" blocks) imposes on the system. > But it''s not the only thing. The difference between 2 and 4 is the COW penalty that one can hide under prefetching and many spindles. The other thing is to see what is the impact (throughput and response time) of the file scan operation to the ever going random write load. Third is the impact on CPU cycles required to do the filescans. -r > It would be interesting to watch seeksize.d''s output during this run > too. > > --Joe > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
In order to be reasonably representative of a real-world situation, I''d suggest the following additions:> 1) create a large file (bigger than main memory) on > an empty ZFS pool.1a. The pool should include entire disks, not small partitions (else seeks will be artificially short). 1b. The file needs to be a *lot* bigger than the cache available to it, else caching effects on the reads will be non-negligible. 1c. Unless the file fills up a large percentage of the pool the rest of the pool needs to be fairly full (else the seeks that updating the file generates will, again, be artificially short ones).> 2) time a sequential scan of the file > 3) random write i/o over say, 50% of the file (either > with or without > matching blocksize)3a. Unless the file itself fills up a large percentage of the pool, do this while other significant other updating activity is also occurring in the pool so that the local holes in the original file layout created by some of its updates don''t get favored for use by subsequent updates to the same file (again, artificially shortening seeks). - bill This message posted from opensolaris.org
BillTodd wrote:> In order to be reasonably representative of a real-world > situation, I''d suggest the following additions: >Your suggestions (make the benchmark big enough so seek times are really noticed) are good. I''m hoping that over the holidays, I''ll get to play with an extra server... If I''m lucky, I''ll have 2x36GB drives (in a 1-2GB memory server) that I can dedicate to their own mirrored zfs pool. I figure a 30GB test file should make the seek times interesting. There''s also a needed 5) Run the same microbenchmark against a UFS filesystem to compare the step2/step4 ratio with what a non-COW filesystem offers. In theory, the UFS ratio "should" be 1:1, that is, sequential read performance should not be affected by the intervening random writes. (In the case of my test server, I''ll make it an SVM mirror of the same 2 drives) --Joe
...> This needs to be proven with a reproducible, > real-world workload before it > makes sense to try to solve it. After all, if we > cannot measure where > we are, > how can we prove that we''ve improved?Ah - Tests & Measurements types: you''ve just gotta love ''em. Wife: "Darling, is there really supposed to be that much water in the bottom of our boat?" T&M: "There''s almost always a little water in the bottom of a boat, Love." Wife: "But I think it''s getting deeper!" T&M: "I suppose you *could* be right: I''ll just put this mark where the water is now, and then after a few minutes we can see if it really has gotten deeper and, if so, just how much we really may need to worry about it." Wife: "I think I''ll use this bucket to get rid of some of it, just in case." T&M: "No, don''t do that: then we won''t be able to see how bad the problem is!" Wife: "But -" T&M: "And try not to rock the boat: it changes the level of the water at the mark that I just made." Wife: "I''m really not a very good swimmer, dear: let''s just head for shore." T&M: "That would be silly if there turns out not to be any problem, wouldn''t it?" (Wife hits T&M over head with bucket, grabs oars, and starts rowing.) - bill This message posted from opensolaris.org
OK, I''ll bite; it''s not like I''m getting an answer to my other question. Bill, please explain why deciding what to do about sequential scan performance in ZFS is urgent? ie why it''s urgent rather than important (I agree that if it''s bad then it''s going to be important eventually). ie why it''s too urgent to work out, first, how to measure whether we''re succeeding. Regards, James. can you guess? wrote: <snip>> > Ah - Tests & Measurements types: you''ve just gotta love ''em. > > Wife: "Darling, is there really supposed to be that much water in the bottom of our boat?" ><snip>> > Wife: "I''m really not a very good swimmer, dear: let''s just head for shore." > > T&M: "That would be silly if there turns out not to be any problem, wouldn''t it?" > > (Wife hits T&M over head with bucket, grabs oars, and starts rowing.) > > - bill >
> OK, I''ll bite; it''s not like I''m getting an answer to > my other question.Did I miss one somewhere?> > Bill, please explain why deciding what to do about > sequential scan > performance in ZFS is urgent?It''s not so much that it''s ''urgent'' (anyone affected by it simply won''t use ZFS) as that it''s a no-brainer.> > ie why it''s urgent rather than important (I agree > that if it''s bad > hen it''s going to be important eventually).It''s bad, and it''s important now for anyone who cares whether ZFS is viable for such workloads.> > ie why it''s too urgent to work out, first, how to > measure whether > we''re succeeding.You don''t have to measure the *rate* at which the depth of the water in the boat is rising in order to know that you''ve got a problem that needs addressing. You don''t have to measure *just how bad* sequential performance in a badly-fragmented file is to know that you''ve got a problem that needs addressing (see both Anton''s and Roch''s comments if you don''t find mine convincing). *After* you''ve tried to fix things, *then* it makes sense to measure just how close you got to ideal streaming-sequential disk bandwidth in order to see whether you need to work some more. Right now, the only reason to measure precisely how awful sequential scanning performance can get after severely fragmenting a file by updating it randomly in small chunks is to be able to hand out "Attaboy!"s for how much changing it improved things - even though this by itself *still* won''t say anything about whether the result attained offered reasonable performance in comparison with what''s attainable (which is what should *really* be the basis for handing out any "Attaboy!"s). Rather than make a politically-incorrect comment about the Special Olympics here, I''ll just ask whether common sense is no longer considered an essential attribute in an engineer: given the nature of the discussions about this and about RAID-Z, I''ve really got to wonder. - bill This message posted from opensolaris.org