Roland Mainz
2006-May-30 01:16 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source "QFS"... / was:Re: Re: Distributed File System for Solaris
UNIX admin wrote:> > > There''s still an opening in the shared filesystem > > space (multi-reader > > and multi-writer). Fix QFS, or extend ZFS? > > That one''s a no-brainer, innit? Extend ZFS and plough on.Uhm... I think this is not that easy. Based on IRC feedback I think it may be difficult to implement the intended features, e.g. storing inodes and data on sepeate disks. We had several projects in the past where this was the only way to gurantee good performace for realtime data collection and processing and due lack of such a feature in ZFS we still need QFS... ---- Bye, Roland P.S.: Reply-To: set to ZFS filesystem discussion list <zfs-discuss at opensolaris.org> -- __ . . __ (o.\ \/ /.o) roland.mainz at nrubsig.org \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer /O /==\ O\ TEL +49 641 7950090 (;O/ \/ \O;)
Casper.Dik at Sun.COM
2006-May-30 04:19 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source "QFS"... / was:Re: Re: Distributed File System for Solaris
>UNIX admin wrote: >> >> > There''s still an opening in the shared filesystem >> > space (multi-reader >> > and multi-writer). Fix QFS, or extend ZFS? >> >> That one''s a no-brainer, innit? Extend ZFS and plough on. > >Uhm... I think this is not that easy. Based on IRC feedback I think it >may be difficult to implement the intended features, e.g. storing inodes >and data on sepeate disks. We had several projects in the past where >this was the only way to gurantee good performace for realtime data >collection and processing and due lack of such a feature in ZFS we still >need QFS...I''m assuming this means you''ve measured the performance and found ZFS wanting? I don''t get it; zfs is a copy-on-write filesystem, so there should be no hotspotting of disks and, theoretically, write performance could be maxed out. The requirement is not that inodes and data are separate; the requirement is a specific upperbound to disk transactions. The question therefor is not "when will ZFS be able to separate inods and data"; the question is when ZFS will meet the QoS criteria. Casper
Anton B. Rang
2006-May-30 15:13 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris
Well, I don''t know about his particular case, but many QFS clients have found the separation of data and metadata to be invaluable. The primary reason is that it avoids disk seeks. We have QFS customers who are running at over 90% of theoretical bandwidth on a medium-sized set of FibreChannel controllers and need to maintain that streaming rate. Taking a seek to update the on-disk inodes once a minute or so slowed down transfers enough that QFS was invented. ;-) QFS uses an allocate-forward policy which means that the disk head is always moving in one direction and, for new file creation (the data capture case), we issue large writes that are always sequential. (And when multiple files are being captured simultaneously, they can be directed onto different physical disk arrays within the same file system, to avoid interference.) ZFS will be a great file system for transactional work (small reads/writes) and its data integrity should be unmatched. But for large streaming, it''s hard to beat QFS. (And it will take some cleverness to figure out a multi-host ZFS.) (For what it''s worth, the current 128K-per-I/O policy of ZFS really hurts its performance for large writes. I imagine this would not be too difficult to fix if we allowed multiple 128K blocks to be allocated as a group.) This message posted from opensolaris.org
Casper.Dik at Sun.COM
2006-May-30 15:36 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris
>Well, I don''t know about his particular case, but many QFS clients >have found the separation of da ta and metadata to be invaluable. The >primary reason is that it avoids disk seeks. We have QFS cust omers who >are running at over 90% of theoretical bandwidth on a medium-sized set >of FibreChannel co ntrollers and need to maintain that streaming rate. >Taking a seek to update the on-disk inodes once >a minute or so slowed down transfers enough that QFS was invented.;-) That does not answer th equestion I asked; since ZFS is a copy-on-write filesystem, there''s no fixed inode location and streaming writes should always be possible. So, in theory ZFS can do this and mix metadata and data. That''s why I asked for any preactival input into this matter. There are, I think, four different outcomes possible of such an experiment and subsequent analysis: ZFS does just fine, thank you ZFS doesn''t measure up but can be fixed without splitting meta data. ZFS doesn''t measure up and can only be fixed by allowing a logical split ZFS doesn''t measure up and cannot be fixed My money is on #2.>ZFS will be a great file system for transactional work (small >reads/writes) and its data integrity >should be unmatched. But for large streaming, it''s hard to beat QFS. >(And it will take some clever ness to figure out a multi-host ZFS.)I think ZFS should do fine in streaming mode also, though there are currently some shortcomings, such as the mentioned 128K I/O size. Casper
Nicolas Williams
2006-May-30 16:16 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source "QFS"... / was:Re: Re: Distributed File System for Solaris
On Tue, May 30, 2006 at 06:19:16AM +0200, Casper.Dik at Sun.COM wrote:> The requirement is not that inodes and data are separate; the requirement > is a specific upperbound to disk transactions. The question therefor > is not "when will ZFS be able to separate inods and data"; the question > is when ZFS will meet the QoS criteria.And if it were a requirement surely ZFS/pools could be hacked on to support a notion of meta-data vdevs and dnodes/dnode-file/directory blocks could be allocated on meta-data vdevs. But I don''t see it as a requirement either.
Anton Rang
2006-May-30 16:23 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris
On May 30, 2006, at 10:36 AM, Casper.Dik at Sun.COM wrote:> That does not answer th equestion I asked; since ZFS is a copy-on- > write > filesystem, there''s no fixed inode location and streaming writes > should > always be possible.The ?berblock still must be updated, however. This may not be an issue if its updates don''t have to be done on the data devices, but I believe the current design has a copy (several actually) on each device for redundancy.> I think ZFS should do fine in streaming mode also, though there are > currently some shortcomings, such as the mentioned 128K I/O size.It may eventually. The lack of direct I/O may also be an issue, since some of our systems don''t have enough main memory bandwidth to support data being extensively touched by the CPU between capture (DMA in) and writing (DMA out). -- Anton
Nicolas Williams
2006-May-30 16:25 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris
On Tue, May 30, 2006 at 08:13:56AM -0700, Anton B. Rang wrote:> Well, I don''t know about his particular case, but many QFS clients > have found the separation of data and metadata to be invaluable. The > primary reason is that it avoids disk seeks. We have QFS customers who^^^^^^^^^^^^^^^^^^^^ Are you talking about reads or writes? Anyways, for reads separating data and meta-data helps, sure, but so would adding mirrors. And anyways, separating meta-data/data _caching_ may make as much difference.> are running at over 90% of theoretical bandwidth on a medium-sized set > of FibreChannel controllers and need to maintain that streaming rate. > Taking a seek to update the on-disk inodes once a minute or so slowed > down transfers enough that QFS was invented. ;-)So we''re talking about writes then, in which case ZFS should not seek because there are no fixed inode locations (there are fixed root block locations though).> (For what it''s worth, the current 128K-per-I/O policy of ZFS really > hurts its performance for large writes. I imagine this would not be > too difficult to fix if we allowed multiple 128K blocks to be > allocated as a group.)I''ve been following the thread on this and that''s not clear yet. Sure, the block size may be 128KB, but ZFS can bundle more than one per-file/transaction, so that the block size shouldn''t matter so much -- it may be a meta-data and read I/O trade-off, but should not have much impact on write performance. It may be that implementation-wise the 128KB block size does affect write performance, but design-wise I don''t see why it should.
Anton Rang
2006-May-30 16:43 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris
On May 30, 2006, at 11:25 AM, Nicolas Williams wrote:> On Tue, May 30, 2006 at 08:13:56AM -0700, Anton B. Rang wrote: >> Well, I don''t know about his particular case, but many QFS clients >> have found the separation of data and metadata to be invaluable. The >> primary reason is that it avoids disk seeks. We have QFS customers >> who > ^^^^^^^^^^^^^^^^^^^^ > > Are you talking about reads or writes?Writes -- that''s what''s important for data capture, which is where I entered this thread. ;-) Sorry for the confusion.> So we''re talking about writes then, in which case ZFS should not seek > because there are no fixed inode locations (there are fixed root block > locations though).There''s actually three separate issues here. The first is the fixed root block. This one may be a problem, but it may be easy enough to mark certain logical units in a pool as "no root block on this device." The second is the allocation policy. If ZFS used an allocate-forward policy, as QFS does, it should be able to avoid seeks. Note that this is optimal for data capture but not for most other workloads, as it tends to spread data across the whole disk over time, rather than keeping it concentrated in a smaller region (with concomitant faster seek times). The third is the write scheduling policy. QFS, when used in data capture applications, uses direct I/O and hence issues writes in sequential block order. ZFS should do the same to get peak performance from its devices for streaming (though intelligent devices can absorb some misordering, it is usually at some performance penalty).>> (For what it''s worth, the current 128K-per-I/O policy of ZFS really >> hurts its performance for large writes. I imagine this would not be >> too difficult to fix if we allowed multiple 128K blocks to be >> allocated as a group.) > > I''ve been following the thread on this and that''s not clear yet. > > Sure, the block size may be 128KB, but ZFS can bundle more than one > per-file/transactionBut it doesn''t right now, as far as I can tell. I never see ZFS issuing a 16 MB write, for instance. You simply can''t get the same performance from a disk array issuing 128 KB writes that you can with 16 MB writes. It''s physically impossible because of protocol overhead, even if the controller itself were infinitely fast. (There''s also the issue that at 128 KB, most disk arrays will choose to cache rather than stream the data, since it''s less than a single RAID stripe, which slows you down.) -- Anton
Nicolas Williams
2006-May-30 17:23 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris
On Tue, May 30, 2006 at 11:43:41AM -0500, Anton Rang wrote:> There''s actually three separate issues here. > > The first is the fixed root block. This one may be a problem, but it > may be easy enough to mark certain logical units in a pool as "no root > block on this device."I don''t think that''s very creative. Another way is to have lots of pre-allocated next ubberblock locations, so that seek-to-one-ubberblock times are always small. Each ubberblock can point to its predecessor and its copies and list the pre-allocated possible locations of its successors. You''d still need some well-known, non-COWed ubber- ubberblocks, but these would need to be updated infrequently -- less frequently than once per-transaction, the trade-off being the time to find the latest set of ubberblocks on mount. Data/meta-data on-disk separation doesn''t seem to be the answer for write performance. It may make a big difference to separate memory allocations for caching data vs. meta-data though, and there must be a reason why this is being pursued by the IETF NFSv4 WG (see pNFS). But for local write performance it makes no sense to me. It could be that transactions are a problem though, for all I know, since it transacting may mean punctuating physical writes. But this seems like a matter of trade-offs, and clearly it''s better to have transactions than not.> The second is the allocation policy. If ZFS used an allocate-forward > policy, as QFS does, it should be able to avoid seeks. Note that this > is optimal for data capture but not for most other workloads, as it > tends to spread data across the whole disk over time, rather than > keeping it concentrated in a smaller region (with concomitant faster > seek times).The on-disk layout of ZFS does not dictate block allocation policies.> The third is the write scheduling policy. QFS, when used in data > capture > applications, uses direct I/O and hence issues writes in sequential > block > order. ZFS should do the same to get peak performance from its devices > for streaming (though intelligent devices can absorb some > misordering, it > is usually at some performance penalty).Again. So far we''re talking about potential improvements to the implementation, not the on-disk layout, with the possible exception of fixed well-known ubberblock locations.> >>(For what it''s worth, the current 128K-per-I/O policy of ZFS really > >>hurts its performance for large writes. I imagine this would not be > >>too difficult to fix if we allowed multiple 128K blocks to be > >>allocated as a group.) > > > >I''ve been following the thread on this and that''s not clear yet. > > > >Sure, the block size may be 128KB, but ZFS can bundle more than one > >per-file/transaction > > But it doesn''t right now, as far as I can tell. I never see ZFS issuing > a 16 MB write, for instance. You simply can''t get the same performance > from a disk array issuing 128 KB writes that you can with 16 MB writes. > It''s physically impossible because of protocol overhead, even if the > controller itself were infinitely fast. (There''s also the issue that at > 128 KB, most disk arrays will choose to cache rather than stream the > data, since it''s less than a single RAID stripe, which slows you down.)I''ll leave this to Ron, you et. al. to hash out, but nothing in the on-disk layout prevents ZFS from bundle _most_ (i.e., excluding updates of ubberblocks) of each transaction as one large write, AFAICT. Nico --
Richard Elling
2006-May-30 19:16 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris
[assuming we''re talking about disks and not "hardware RAID arrays"...] On Tue, 2006-05-30 at 11:43 -0500, Anton Rang wrote:> > Sure, the block size may be 128KB, but ZFS can bundle more than one > > per-file/transaction > > But it doesn''t right now, as far as I can tell.The protocol overhead is still orders of magnitude faster than a rev. Sure, there are pathological cases such as FC-AL over 200kms with 100+ nodes, but most folks won''t hurt themselves like that. For modern disks, multiple 128kByte transfers will spend a long time in the disk''s buffer cache waiting to be written to media.> I never see ZFS issuing > a 16 MB write, for instance. You simply can''t get the same performance > from a disk array issuing 128 KB writes that you can with 16 MB writes. > It''s physically impossible because of protocol overhead, even if the > controller itself were infinitely fast. (There''s also the issue that at > 128 KB, most disk arrays will choose to cache rather than stream the > data, since it''s less than a single RAID stripe, which slows you down.)Very few disks have 16MByte write buffer caches, so if you want to send such a large iop down the wire (DAS please, otherwise you kill the SAN), then you''ll be waiting on the media anyway. The disk interconnect is faster than the media speed. I don''t see how you could avoid blowing a rev in that case. Surely there is a more generally applicable blocksize which is appropriate. Since many disks today do support queued commands, I don''t see the 128kByte iop as a large, inherent limitation. OTOH, the jury is still out... -- richard
Anton Rang
2006-May-30 19:26 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris
On May 30, 2006, at 12:23 PM, Nicolas Williams wrote:> Another way is to have lots of pre-allocated next ubberblock > locations, > so that seek-to-one-ubberblock times are always small. Each > ubberblock > can point to its predecessor and its copies and list the pre-allocated > possible locations of its successors.That''s a possibility, though it could be difficult to distinguish an uberblock from a datablock after a crash (in the worst case), since now you''re writing both into the same arena. You''d also need to skip past some disk areas (to get to the next uberblock) at each transaction, which will cost some small amount of bandwidth.> The on-disk layout of ZFS does not dictate block allocation policies.Precisely, which is why I broke the issues apart. Two of them, at least, can be attacked through simple code changes. The uberblock update may or may not be an issue. It would be interesting to test this, by changing the implementation in the other areas and seeing whether we can succeed in matching the streaming performance of other file systems, and where the bottlenecks are. It''s worth pointing out (maybe?) that having an uberblock (or, for that matter, an indirect block) stored in the "middle" of your data may be a problem, if it results in issuing a short read to the disk. Performance is better if you read 4 MB from disk and throw out a small piece in the middle than if you do a 2 MB read followed by a slightly shorter read to skip the piece you don''t want. Again, this does not require an on- disk layout change. Honestly, I''m not sure that focusing on latency-sensitive streaming applications is worth it until we can get the bandwidth issues of ZFS nailed down. There''s some work yet to reach the 95% of device speed mark. How close does ZFS get to writing at 8 GB/sec on an F15K? It''s also worth noting that the customers for whom streaming is a real issue tend to be those who are willing to spend a lot of money for reliability (think replicating the whole system+storage) rather than compromising performance; for them, simply the checksumming overhead and lack of direct I/O in (today''s) ZFS may be unacceptable. Is it worth the effort to change ZFS to satisfy the requirements of that relative handful of customers? I''d rather see us focus on adding functionality that we can use to sell Solaris to large numbers of customers, and thus building our customer base. We have a solution for streaming already, while we''re just entering the reliability and ease-of-administration space, where the real opportunity lies. -- Anton
Nicolas Williams
2006-May-30 19:51 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris
On Tue, May 30, 2006 at 02:26:07PM -0500, Anton Rang wrote:> On May 30, 2006, at 12:23 PM, Nicolas Williams wrote: > > >Another way is to have lots of pre-allocated next ubberblock > >locations, > >so that seek-to-one-ubberblock times are always small. Each > >ubberblock > >can point to its predecessor and its copies and list the pre-allocated > >possible locations of its successors. > > That''s a possibility, though it could be difficult to distinguish an > uberblock from a datablock after a crash (in the worst case), since now > you''re writing both into the same arena.I don''t agree. ZFS already has to deal with this (root blocks have to be self-checksumming). Basically the new ubberblocks would reference their predecessors and would include: a checksum of the predecessor and a self-checksum. The likelihood that some non-ubberblock data in a block that was once pre-allocated as a possible ubberblock could look like a valid ubberblock can be kept vanishingly small. OTOH, this would present an attack vector, so such blocks should not be freed for normal filesystem use until ubber-ubberblocks have been updated and this attack vector has been closed.> You''d also need to skip past > some disk areas (to get to the next uberblock) at each transaction, > which will cost some small amount of bandwidth.Yes and no. You''re doing transactions, which already means you''re punctuating writes. And if you pre-allocate enough potential next- ubberblocks you can make this cost very small, even when you have back-to-back transactions in the pipeline.> Honestly, I''m not sure that focusing on latency-sensitive streaming > applications is worth it until we can get the bandwidth issues of ZFS > nailed down. There''s some work yet to reach the 95% of device speed > mark. How close does ZFS get to writing at 8 GB/sec on an F15K?Sure.> It''s also worth noting that the customers for whom streaming is a real > issue tend to be those who are willing to spend a lot of money for > reliability (think replicating the whole system+storage) rather than > compromising performance; for them, simply the checksumming overhead > and lack of direct I/O in (today''s) ZFS may be unacceptable.Which is it? They want reliability, or they don''t? It''s not always a reliability vs. performance trade-off. RAID-Z is a performance improvement (for writes) over RAID-5 precisely because of the COW/transactional + the ZFS block-checksums-in-pointers approach to integrity protection. The choice to use ZFS or not will depend on specific requirements that one has identified. If you require a system where multiple cluster nodes can write to the same filesystems concurrently then ZFS won''t do for you, for example. OTOH, if you want protection against bit rot then ZFS is for you. Etcetera, etcetera.> Is it > worth the effort to change ZFS to satisfy the requirements of that > relative handful of customers? I''d rather see us focus on adding > functionality that we can use to sell Solaris to large numbers of > customers, and thus building our customer base. We have a solution > for streaming already, while we''re just entering the reliability and > ease-of-administration space, where the real opportunity lies.I''m pretty sure that the ZFS team has the right set of priorities and will adjust as necessary. I just don''t buy your arguments about design :-) Nico --
Anton Rang
2006-May-30 19:59 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris
On May 30, 2006, at 2:16 PM, Richard Elling wrote:> [assuming we''re talking about disks and not "hardware RAID arrays"...]It''d be interesting to know how many customers plan to use raw disks, and how their performance relates to hardware arrays. (My gut feeling is that a lot of disks on FC probably isn''t too bad, though on parallel SCSI the negotiation overhead and lack of fairness was awful, but I haven''t tested this.)> On Tue, 2006-05-30 at 11:43 -0500, Anton Rang wrote: >>> Sure, the block size may be 128KB, but ZFS can bundle more than one >>> per-file/transaction >> >> But it doesn''t right now, as far as I can tell. > > The protocol overhead is still orders of magnitude faster than a > rev. Sure, there are pathological cases such as FC-AL over > 200kms with 100+ nodes, but most folks won''t hurt themselves like > that.OK. Let''s take 4 Gb FC (e.g. array hardware). Sending 128 KB will take roughly 330 microseconds. If we''re going to achieve 95% of theoretical rate, then each transaction can have no more than 5% of that for overhead, or 16 microseconds. That''s pretty darn fast. For that matter, the Solaris host would have to initiate 3,000 writes per second to keep the channel busy. For each channel. And a host might well have 20 channels. Can our FC stack do that? Not yet, though it''s been looked at.... At 16 MB [why 16? because we can''t do 32 MB in a WRITE(10) command] we have some more leeway. Sending 16 MB will take roughly 42 ms. Each transaction can take 5% of that, or 2 ms, for overhead, and still reach the 95% mark. And we only need to issue 24 commands per second to keep the channel saturated. No problem.... Single disks still run FC at 2 Gb, so the numbers above are roughly halved, and since it takes 2-4 disks to max out a channel, you can also multiply the allowable overhead time on the disk by a factor of 2-4. That gives the disk about 16*2*4 = 128 microseconds to process a command. The disk might be able to do that. Solaris (and the HBA) still need to push out 1500 writes per second (per channel), though. A good HBA may be able to do that....> For modern disks, multiple 128kByte transfers will spend a long time > in the disk''s buffer cache waiting to be written to media.They shouldn''t spend that long, really. Today''s Cheetah has a 200 MB/ sec interface, and a 59-118 MB/sec transfer rate to media, so at best we can fill the cache a little over twice as fast as it empties. (Once we put multiple disks on the channel, it''s easy to have the cache empty faster than we fill it -- this is actually the desirable case, so that we''re not waiting on the media.)> Very few disks have 16MByte write buffer caches, so if you want to > send > such a large iop down the wire (DAS please, otherwise you kill the > SAN), > then you''ll be waiting on the media anyway. The disk interconnect is > faster than the media speed. I don''t see how you could avoid blowing > a rev in that case.Yes, we''ll wait on the media. We''ll never lose a rev, though. Each track on a Cheetah holds an average of 400 KB (1.6 MB/cylinder), so each time that we change tracks, we''ll likely have the buffer full with all the data for the track. Even if we don''t, FC transfers data out of order, so the drive can re-order if it deems necessary (in the desirable cache-empty case). But until we have a well-configured test system to benchmark, this is rather academic. :-) I suspect our customers will quickly tell us how well ZFS works in their environments. Hopefully the answer will be "very well" for the 95% of customers who are in the median; for those on the "radical fringe" of I/O requirements, there will likely be more work to do. I''ll wander off to wait for some real data. ;-) -- Anton
Richard Elling
2006-May-30 21:06 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris
On Tue, 2006-05-30 at 14:59 -0500, Anton Rang wrote:> On May 30, 2006, at 2:16 PM, Richard Elling wrote: > > > [assuming we''re talking about disks and not "hardware RAID arrays"...] > > It''d be interesting to know how many customers plan to use raw disks, > and how their performance relates to hardware arrays. (My gut feeling > is that a lot of disks on FC probably isn''t too bad, though on parallel > SCSI the negotiation overhead and lack of fairness was awful, but I > haven''t tested this.)FC-AL has much greater arbitration overhead than parallel SCSI, though FC-AL is arguably more fair for targets. However, parallel SCSI is at its end. SAS/SATA is taking over relegating FC to the non-IP SAN.> > On Tue, 2006-05-30 at 11:43 -0500, Anton Rang wrote: > >>> Sure, the block size may be 128KB, but ZFS can bundle more than one > >>> per-file/transaction > >> > >> But it doesn''t right now, as far as I can tell. > > > > The protocol overhead is still orders of magnitude faster than a > > rev. Sure, there are pathological cases such as FC-AL over > > 200kms with 100+ nodes, but most folks won''t hurt themselves like > > that. > > OK. Let''s take 4 Gb FC (e.g. array hardware). Sending 128 KB will take > roughly 330 microseconds. If we''re going to achieve 95% of theoretical > rate, then each transaction can have no more than 5% of that for > overhead, > or 16 microseconds. That''s pretty darn fast. For that matter, the > Solaris host would have to initiate 3,000 writes per second to keep the > channel busy. For each channel. And a host might well have 20 > channels. > Can our FC stack do that? Not yet, though it''s been looked at.... > > At 16 MB [why 16? because we can''t do 32 MB in a WRITE(10) command] we > have some more leeway. Sending 16 MB will take roughly 42 ms. Each > transaction can take 5% of that, or 2 ms, for overhead, and still reach > the 95% mark. And we only need to issue 24 commands per second to keep > the channel saturated. No problem.... > > Single disks still run FC at 2 Gb, so the numbers above are roughly > halved, and since it takes 2-4 disks to max out a channel, you can > also multiply the allowable overhead time on the disk by a factor of > 2-4. That gives the disk about 16*2*4 = 128 microseconds to process > a command. The disk might be able to do that. Solaris (and the HBA) > still need to push out 1500 writes per second (per channel), though. > A good HBA may be able to do that.... > > > For modern disks, multiple 128kByte transfers will spend a long time > > in the disk''s buffer cache waiting to be written to media. > > They shouldn''t spend that long, really. Today''s Cheetah has a 200 MB/ > sec > interface, and a 59-118 MB/sec transfer rate to media, so at best we can > fill the cache a little over twice as fast as it empties. (Once we put > multiple disks on the channel, it''s easy to have the cache empty faster > than we fill it -- this is actually the desirable case, so that we''re > not waiting on the media.) > > > Very few disks have 16MByte write buffer caches, so if you want to > > send > > such a large iop down the wire (DAS please, otherwise you kill the > > SAN), > > then you''ll be waiting on the media anyway. The disk interconnect is > > faster than the media speed. I don''t see how you could avoid blowing > > a rev in that case. > > Yes, we''ll wait on the media. We''ll never lose a rev, though. Each > track on a Cheetah holds an average of 400 KB (1.6 MB/cylinder), so each > time that we change tracks, we''ll likely have the buffer full with all > the data for the track.Right. For a ST3300007LW (300 GByte UltraSCSI 320) a 16 MByte iop takes approximately 50ms to transfer over the SCSI bus. The media speed is 59-118 Mbytes/s (270-136ms). The default cache size is 8 MBytes. Since the transfer is too big to fit in cache then you have to stall the transfer or you overrun the buffer. For reads, the disk won''t be able to keep the bus busy either. Using such big block sizes doesn''t gain you anything. I suspect that the full size of the buffer is not available since they might want to use some of the space for the read cache, too.> Even if we don''t, FC transfers data out of order, > so the drive can re-order if it deems necessary (in the desirable > cache-empty case).A single iop (even a 16 MByte one) will not be re-ordered.> But until we have a well-configured test system to benchmark, this is > rather academic. :-) I suspect our customers will quickly tell us how > well ZFS works in their environments. Hopefully the answer will be > "very well" for the 95% of customers who are in the median; for those > on the "radical fringe" of I/O requirements, there will likely be more > work to do.Agree 100% :-) -- -- richard
Robert Milkowski
2006-May-30 22:37 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris
Hello Anton, Tuesday, May 30, 2006, 9:59:09 PM, you wrote: AR> On May 30, 2006, at 2:16 PM, Richard Elling wrote:>> [assuming we''re talking about disks and not "hardware RAID arrays"...]AR> It''d be interesting to know how many customers plan to use raw disks, AR> and how their performance relates to hardware arrays. (My gut feeling AR> is that a lot of disks on FC probably isn''t too bad, though on parallel AR> SCSI the negotiation overhead and lack of fairness was awful, but I AR> haven''t tested this.)>> On Tue, 2006-05-30 at 11:43 -0500, Anton Rang wrote: >>>> Sure, the block size may be 128KB, but ZFS can bundle more than one >>>> per-file/transaction >>> >>> But it doesn''t right now, as far as I can tell. >> >> The protocol overhead is still orders of magnitude faster than a >> rev. Sure, there are pathological cases such as FC-AL over >> 200kms with 100+ nodes, but most folks won''t hurt themselves like >> that.AR> OK. Let''s take 4 Gb FC (e.g. array hardware). Sending 128 KB will take AR> roughly 330 microseconds. If we''re going to achieve 95% of theoretical AR> rate, then each transaction can have no more than 5% of that for AR> overhead, AR> or 16 microseconds. That''s pretty darn fast. For that matter, the AR> Solaris host would have to initiate 3,000 writes per second to keep the AR> channel busy. For each channel. And a host might well have 20 AR> channels. AR> Can our FC stack do that? Not yet, though it''s been looked at.... AR> At 16 MB [why 16? because we can''t do 32 MB in a WRITE(10) command] we AR> have some more leeway. Sending 16 MB will take roughly 42 ms. Each AR> transaction can take 5% of that, or 2 ms, for overhead, and still reach AR> the 95% mark. And we only need to issue 24 commands per second to keep AR> the channel saturated. No problem.... AR> Single disks still run FC at 2 Gb, so the numbers above are roughly AR> halved, and since it takes 2-4 disks to max out a channel, you can AR> also multiply the allowable overhead time on the disk by a factor of AR> 2-4. That gives the disk about 16*2*4 = 128 microseconds to process AR> a command. The disk might be able to do that. Solaris (and the HBA) AR> still need to push out 1500 writes per second (per channel), though. AR> A good HBA may be able to do that....>> For modern disks, multiple 128kByte transfers will spend a long time >> in the disk''s buffer cache waiting to be written to media.AR> They shouldn''t spend that long, really. Today''s Cheetah has a 200 MB/ AR> sec AR> interface, and a 59-118 MB/sec transfer rate to media, so at best we can AR> fill the cache a little over twice as fast as it empties. (Once we put AR> multiple disks on the channel, it''s easy to have the cache empty faster AR> than we fill it -- this is actually the desirable case, so that we''re AR> not waiting on the media.)>> Very few disks have 16MByte write buffer caches, so if you want to >> send >> such a large iop down the wire (DAS please, otherwise you kill the >> SAN), >> then you''ll be waiting on the media anyway. The disk interconnect is >> faster than the media speed. I don''t see how you could avoid blowing >> a rev in that case.AR> Yes, we''ll wait on the media. We''ll never lose a rev, though. Each AR> track on a Cheetah holds an average of 400 KB (1.6 MB/cylinder), so each AR> time that we change tracks, we''ll likely have the buffer full with all AR> the data for the track. Even if we don''t, FC transfers data out of AR> order, AR> so the drive can re-order if it deems necessary (in the desirable AR> cache-empty AR> case). AR> But until we have a well-configured test system to benchmark, this is AR> rather academic. :-) I suspect our customers will quickly tell us how AR> well ZFS works in their environments. Hopefully the answer will be AR> "very well" for the 95% of customers who are in the median; for those AR> on the "radical fringe" of I/O requirements, there will likely be more AR> work to do. AR> I''ll wander off to wait for some real data. ;-) Well see my earlier posts here about some basic testing sequential writes with dd using different block sizes. It looks like using 8MB IO size gives much better real throughput (to UFS or raw disk) than to ZFS when it''s actually written using 128KB IO sizes. And the difference is actually quite big. It wan noticed by Roch that it could be related to other issue with ZFS which is being addressed - however I still feel that with sequential writing large IOs can actually give much better throughput than just using 128KB. ps. I was using FC disks directly connected to host (JBOD) without HW RAID. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Darren J Moffat
2006-May-31 08:43 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris
Anton Rang wrote:> It''s also worth noting that the customers for whom streaming is a real > issue tend to be those who are willing to spend a lot of money for > reliability (think replicating the whole system+storage) rather than > compromising performance; for them, simply the checksumming overhead > and lack of direct I/O in (today''s) ZFS may be unacceptable. Is itThat statement to me is inconsistent. The customers want reliability but the way that ZFS provides it that no other filesystem does today is too much of an overhead, sigh. So do they really want reliability of their data or not ? What proof is there that the checksumming in ZFS is actually hurting any performance ? We know that the implementation of the fletcher algorithms can be improved on some systems (a test implementation exists for UltraSPARC T1). -- Darren J Moffat
Roch Bourbonnais - Performance Engineering
2006-May-31 13:56 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris
Anton wrote: (For what it''s worth, the current 128K-per-I/O policy of ZFS really hurts its performance for large writes. I imagine this would not be too difficult to fix if we allowed multiple 128K blocks to be allocated as a group.) I''m not taking a stance on this, but if I keep a controler full of 128K I/Os and assuming there are targetting contiguous physical blocks, how different is that to issuing a very large I/O ? -r
Roch Bourbonnais - Performance Engineering
2006-May-31 14:03 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris
> I think ZFS should do fine in streaming mode also, though there are> currently some shortcomings, such as the mentioned 128K I/O size. It may eventually. The lack of direct I/O may also be an issue, since some of our systems don''t have enough main memory bandwidth to support data being extensively touched by the CPU between capture (DMA in) and writing (DMA out). Is the notion that 128K is too small based on directio results ? Clearly with directio, a single thread streaming data will see terrible performance. -r
Anton Rang
2006-May-31 14:48 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris
On May 31, 2006, at 8:56 AM, Roch Bourbonnais - Performance Engineering wrote:> I''m not taking a stance on this, but if I keep a controler > full of 128K I/Os and assuming there are targetting > contiguous physical blocks, how different is that to issuing > a very large I/O ?There are differences at the host, the HBA, the disk or RAID controller, and on the wire. At the host: The SCSI/FC/ATA stack is run once for each I/O. This takes a bit of CPU. We generally take one interrupt for each I/O (if the CPU is fast enough), so instead of taking one interrupt for 8 MB (for instance), we take 64. We run through the IOMMU or page translation code once per page, but the overhead of initially setting up the IOMMU or starting the translation loop happens once per I/O. At the HBA: There is some overhead each time that the controller switches processing from one I/O to another. This isn''t too large on a modern system, but it does add up. There is overhead on the PCI (or other) bus for the small transfers that make up the command block and scatter/gather list for each I/O. Again, it adds up (faster than you might expect, since PCI Express can move 128 KB very quickly). There is a limit on the maximum number of outstanding I/O requests, but we''re unlikely to hit this in normal use; it is typically at least 256 and more often 1024 or more on newer hardware. (This is shared for the whole channel in the FC and SCSI case, and may be shared between multiple channels for SAS or multi-port FC cards.) There is often a small cache of commands which can be handled quickly; commands outside of this cache (which may hold 4 to 16 or so) are much slower to "context-switch" in when their data is needed; in particular, the scatter/gather list may need to be read again. At the disk or RAID: There is a fixed overhead for processing each command. This can be fairly readily measured, and roughly reflects the difference between delivered 512-byte IOPs and bandwidth for a large I/O. Some of it is related to parsing the CDB and starting command execution; some of it is related to cache management. There is some overhead for switching between data transfers for each command. A typical track on a disk may hold 400K or so of data, and a full-track transfer is optimal (runs at platter speed). A partial-track transfer immediately followed by another may take enough time to switch that we sometimes lose one revolution (particularly on disks which do not have sector headers). Write caching should nearly eliminate this as a concern, however. There is a fixed-size window of commands that can be reordered on the device. Data transfer within a command can be reordered arbitrarily (for parallel SCSI and FC, though not for ATA or SAS). It''s good to have lots of outstanding commands, but if they are all sequential, there''s not much point (no reason to reorder them, except perhaps if you''re going backwards, and FC/SCSI can handle this anyway). On the wire: Sending a command and its completion takes time that could be spent moving data instead; but for most protocols this probably isn''t significant. You can actually see most of this with a PCI and protocol analyzer. -- Anton
Bill Sommerfeld
2006-May-31 15:21 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris
On Wed, 2006-05-31 at 10:48, Anton Rang wrote:> We generally take one interrupt for each I/O > (if the CPU is fast enough), so instead of taking one > interrupt for 8 MB (for instance), we take 64.Hunh. Gigabit ethernet devices typically implement some form of interrupt blanking or coalescing so that the host cpu can batch I/O completion handling. That doesn''t exist in FC controllers? Under continuous heavy load it can be more efficient to do polling instead of interrupt-driven I/O. - Bill
Anton Rang
2006-May-31 16:00 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris
On May 31, 2006, at 10:21 AM, Bill Sommerfeld wrote:> Hunh. Gigabit ethernet devices typically implement some form of > interrupt blanking or coalescing so that the host cpu can batch I/O > completion handling. That doesn''t exist in FC controllers?Not in quite the same way, AFAIK. Usually there is an queue of completed I/O operations, and one interrupt is generated each time the queue becomes non-empty. If the host is relatively slow, you''ll process a number of I/O completions for one interrupt. If it''s relatively fast, you''ll get one interrupt per completion (unless you poll in the driver before re-enabling interrupts, which is reasonable if you believe the load is heavy).> Under continuous heavy load it can be more efficient to do polling > instead of interrupt-driven I/O.Yes. -- Anton
Roland Mainz
2006-Jun-01 02:48 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source "QFS"... /was:Re: Re: Distributed File System for Solaris
Casper.Dik at Sun.COM wrote:> >UNIX admin wrote: > >> > There''s still an opening in the shared filesystem > >> > space (multi-reader > >> > and multi-writer). Fix QFS, or extend ZFS? > >> > >> That one''s a no-brainer, innit? Extend ZFS and plough on. > > > >Uhm... I think this is not that easy. Based on IRC feedback I think it > >may be difficult to implement the intended features, e.g. storing inodes > >and data on sepeate disks. We had several projects in the past where > >this was the only way to gurantee good performace for realtime data > >collection and processing and due lack of such a feature in ZFS we still > >need QFS... > > I''m assuming this means you''ve measured the performance and found ZFS > wanting?No, I didn not test ZFS yet, I only discussed the matter on IRC yet. But based on the original problems (see below) I do not think that ZFS can deliver something (without the inode+data seperation) what neither IBM nor QFS without inode+data split could not deliver a few years ago (even when backed with a giant RAID+caches (which caused more trouble than expected, see below, too)).> I don''t get it; zfs is a copy-on-write filesystem, so there should > be no hotspotting of disks and, theoretically, write performance > could be maxed out.What about read performace ? And interactive users who are MAD and run their stuff during data capture ?> The requirement is not that inodes and data are separate; the requirement > is a specific upperbound to disk transactions. The question therefor > is not "when will ZFS be able to separate inods and data"; the question > is when ZFS will meet the QoS criteria.Uhm... that''s the point where you are IMO slightly wrong. The exact requirement is that inodes and data need to be seperated. In this specific case (and the setup was copied several times so Sun made a considerable amount of money with it :-) ) the inode data+log were put on a seperate solid-state disks on seperate SCSI controllers (which have nearly zero seek time). The problem was that a high amount of inode activity could starve the data recoding and playback, something which was inacceptable since running the matching experiment just costs around >=40000Euro/Minute. Similar proposed setups provided by IBM failed (even with giant RAID caches (which mainly were able to flat the problems out, but sometimes suffered from second-long "hiccups" where read and writes were stalled)) to deliver the requested performance (much data and much inode traffic and tons of scripts and (to make it worse and even more unpredictable) interactive users). The only working solution in this case was to move inodes+log to the seperate solid-state disks with it''s own path (e.g. SCSI controller) for these data, freeing the data RAID from such operations. The only alternative was to move everything to solid-state disks - but that was considered to be far to expensive (or better: we already wasted too much money elsewhere... ;-( ). ---- Bye, Roland -- __ . . __ (o.\ \/ /.o) roland.mainz at nrubsig.org \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer /O /==\ O\ TEL +49 641 7950090 (;O/ \/ \O;)
Roland Mainz
2006-Jun-01 02:51 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun wouldopen-source"QFS"... / was:Re: Re: Distributed File System for Solaris
Casper.Dik at Sun.COM wrote:> > >Well, I don''t know about his particular case, but many QFS clients > >have found the separation of da ta and metadata to be invaluable. The > >primary reason is that it avoids disk seeks. We have QFS cust omers who > >are running at over 90% of theoretical bandwidth on a medium-sized set > >of FibreChannel co ntrollers and need to maintain that streaming rate. > >Taking a seek to update the on-disk inodes once > >a minute or so slowed down transfers enough that QFS was invented. > ;-) > > That does not answer th equestion I asked; since ZFS is a copy-on-write > filesystem, there''s no fixed inode location and streaming writes should > always be possible. > > So, in theory ZFS can do this and mix metadata and data. That''s why > I asked for any preactival input into this matter. > > There are, I think, four different outcomes possible of such an > experiment and subsequent analysis: > > ZFS does just fine, thank you > ZFS doesn''t measure up but can be fixed without splitting meta data. > ZFS doesn''t measure up and can only be fixed by allowing a logical > split > ZFS doesn''t measure up and cannot be fixed > > My money is on #2.I strongly bet for #3 (assuming "logical split" means "data+inode" split) based on real-world experience with other products from other vendors. ---- Bye, Roland -- __ . . __ (o.\ \/ /.o) roland.mainz at nrubsig.org \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer /O /==\ O\ TEL +49 641 7950090 (;O/ \/ \O;)
Roch
2006-Jun-02 13:54 UTC
[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris
Anton Rang writes: > On May 31, 2006, at 8:56 AM, Roch Bourbonnais - Performance > Engineering wrote: > > > I''m not taking a stance on this, but if I keep a controler > > full of 128K I/Os and assuming there are targetting > > contiguous physical blocks, how different is that to issuing > > a very large I/O ? > > There are differences at the host, the HBA, the disk or RAID > controller, and on the wire. > > At the host: > > The SCSI/FC/ATA stack is run once for each I/O. This takes > a bit of CPU. We generally take one interrupt for each I/O > (if the CPU is fast enough), so instead of taking one > interrupt for 8 MB (for instance), we take 64. > > We run through the IOMMU or page translation code once per > page, but the overhead of initially setting up the IOMMU or > starting the translation loop happens once per I/O. > > At the HBA: > > There is some overhead each time that the controller switches > processing from one I/O to another. This isn''t too large on a > modern system, but it does add up. > > There is overhead on the PCI (or other) bus for the small > transfers that make up the command block and scatter/gather > list for each I/O. Again, it adds up (faster than you might > expect, since PCI Express can move 128 KB very quickly). > > There is a limit on the maximum number of outstanding I/O > requests, but we''re unlikely to hit this in normal use; it > is typically at least 256 and more often 1024 or more on > newer hardware. (This is shared for the whole channel > in the FC and SCSI case, and may be shared between multiple > channels for SAS or multi-port FC cards.) > > There is often a small cache of commands which can be handled > quickly; commands outside of this cache (which may hold 4 to > 16 or so) are much slower to "context-switch" in when their > data is needed; in particular, the scatter/gather list may > need to be read again. > > At the disk or RAID: > > There is a fixed overhead for processing each command. This > can be fairly readily measured, and roughly reflects the > difference between delivered 512-byte IOPs and bandwidth for > a large I/O. Some of it is related to parsing the CDB and > starting command execution; some of it is related to cache > management. > > There is some overhead for switching between data transfers > for each command. A typical track on a disk may hold 400K > or so of data, and a full-track transfer is optimal (runs at > platter speed). A partial-track transfer immediately followed > by another may take enough time to switch that we sometimes > lose one revolution (particularly on disks which do not have > sector headers). Write caching should nearly eliminate this > as a concern, however. > > There is a fixed-size window of commands that can be > reordered on the device. Data transfer within a command can > be reordered arbitrarily (for parallel SCSI and FC, though > not for ATA or SAS). It''s good to have lots of outstanding > commands, but if they are all sequential, there''s not much > point (no reason to reorder them, except perhaps if you''re > going backwards, and FC/SCSI can handle this anyway). > > On the wire: > > Sending a command and its completion takes time that could > be spent moving data instead; but for most protocols this > probably isn''t significant. > > You can actually see most of this with a PCI and protocol > analyzer. > So the main question, does any of this cause a full flush of the pipelined operations ? If it just extra busy-ness of the individual components, all operating concurrently and if we don''t saturate anybody because of the extra work, then it seems to me that we are fine. So clearly there may be a few extra bubbles that find their way into the pipe and we can loose the last few bleeding edge percent of throughput. Those guys are on QFS and delighted to be (and they should be, QFS is outstanding in that market). -r > -- Anton >