thr3ads.net - zfs discuss - [zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source "QFS"... / was:Re: Re: Distributed File System for Solaris [May 2006]

If this information is useful, please help other people find it:
Share via:

Roland Mainz

2006-May-30 01:16 UTC

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source "QFS"... / was:Re: Re: Distributed File System for Solaris

UNIX admin wrote:> 
> > There''s still an opening in the shared filesystem
> > space (multi-reader
> > and multi-writer). Fix QFS, or extend ZFS?
> 
> That one''s a no-brainer, innit? Extend ZFS and plough on.
Uhm... I think this is not that easy. Based on IRC feedback I think it
may be difficult to implement the intended features, e.g. storing inodes
and data on sepeate disks. We had several projects in the past where
this was the only way to gurantee good performace for realtime data
collection and processing and due lack of such a feature in ZFS we still
need QFS...

----

Bye,
Roland

P.S.: Reply-To: set to ZFS filesystem discussion list
<zfs-discuss at opensolaris.org>

-- 
  __ .  . __
 (o.\ \/ /.o) roland.mainz at nrubsig.org
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix
programmer
  /O /==\ O\  TEL +49 641 7950090
 (;O/ \/ \O;)

Casper.Dik at Sun.COM

2006-May-30 04:19 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source "QFS"... / was:Re: Re: Distributed File System for Solaris

>UNIX admin wrote:
>> 
>> > There''s still an opening in the shared filesystem
>> > space (multi-reader
>> > and multi-writer). Fix QFS, or extend ZFS?
>> 
>> That one''s a no-brainer, innit? Extend ZFS and plough on.
>
>Uhm... I think this is not that easy. Based on IRC feedback I think it
>may be difficult to implement the intended features, e.g. storing inodes
>and data on sepeate disks. We had several projects in the past where
>this was the only way to gurantee good performace for realtime data
>collection and processing and due lack of such a feature in ZFS we still
>need QFS...

I''m assuming this means you''ve measured the performance and
found ZFS
wanting?

I don''t get it; zfs is a copy-on-write filesystem, so there should
be no hotspotting of disks and, theoretically, write performance
could be maxed out.

The requirement is not that inodes and data are separate; the requirement
is a specific upperbound to disk transactions.  The question therefor
is not "when will ZFS be able to separate inods and data"; the
question
is when ZFS will meet the QoS criteria.

Casper

Anton B. Rang

2006-May-30 15:13 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

Well, I don''t know about his particular case, but many QFS clients have
found the separation of data and metadata to be invaluable. The primary reason
is that it avoids disk seeks. We have QFS customers who are running at over 90%
of theoretical bandwidth on a medium-sized set of FibreChannel controllers and
need to maintain that streaming rate. Taking a seek to update the on-disk inodes
once a minute or so slowed down transfers enough that QFS was invented.  ;-)

QFS uses an allocate-forward policy which means that the disk head is always
moving in one direction and, for new file creation (the data capture case), we
issue large writes that are always sequential. (And when multiple files are
being captured simultaneously, they can be directed onto different physical disk
arrays within the same file system, to avoid interference.)

ZFS will be a great file system for transactional work (small reads/writes) and
its data integrity should be unmatched. But for large streaming, it''s
hard to beat QFS. (And it will take some cleverness to figure out a multi-host
ZFS.)

(For what it''s worth, the current 128K-per-I/O policy of ZFS really
hurts its performance for large writes. I imagine this would not be too
difficult to fix if we allowed multiple 128K blocks to be allocated as a group.)
 
 
This message posted from opensolaris.org

Casper.Dik at Sun.COM

2006-May-30 15:36 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

>Well, I don''t know about his particular case, but many QFS clients
>have found the separation of da ta and metadata to be invaluable. The
>primary reason is that it avoids disk seeks. We have QFS cust omers who
>are running at over 90% of theoretical bandwidth on a medium-sized set
>of FibreChannel co ntrollers and need to maintain that streaming rate.
>Taking a seek to update the on-disk inodes once
>a minute or so slowed down transfers enough that QFS was invented. ;-)

That does not answer th equestion I asked; since ZFS is a copy-on-write
filesystem, there''s no fixed inode location and streaming writes should
always be possible.

So, in theory ZFS can do this and mix metadata and data.  That''s why
I asked for any preactival input into this matter.

There are, I think, four different outcomes possible of such an
experiment and subsequent analysis:

	ZFS does just fine, thank you
	ZFS doesn''t measure up but can be fixed without splitting meta data.
	ZFS doesn''t measure up and can only be fixed by allowing a logical
	split
	ZFS doesn''t measure up and cannot be fixed

My money is on #2.
>ZFS will be a great file system for transactional work (small
>reads/writes) and its data integrity
>should be unmatched. But for large streaming, it''s hard to beat
QFS.
>(And it will take some clever ness to figure out a multi-host ZFS.)
I think ZFS should do fine in streaming mode also, though there are
currently some shortcomings, such as the mentioned 128K I/O size.

Casper

Nicolas Williams

2006-May-30 16:16 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source "QFS"... / was:Re: Re: Distributed File System for Solaris

On Tue, May 30, 2006 at 06:19:16AM +0200, Casper.Dik at Sun.COM
wrote:> The requirement is not that inodes and data are separate; the requirement
> is a specific upperbound to disk transactions.  The question therefor
> is not "when will ZFS be able to separate inods and data"; the
question
> is when ZFS will meet the QoS criteria.
And if it were a requirement surely ZFS/pools could be hacked on to
support a notion of meta-data vdevs and dnodes/dnode-file/directory
blocks could be allocated on meta-data vdevs.

But I don''t see it as a requirement either.

Anton Rang

2006-May-30 16:23 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

On May 30, 2006, at 10:36 AM, Casper.Dik at Sun.COM wrote:
> That does not answer th equestion I asked; since ZFS is a copy-on- 
> write
> filesystem, there''s no fixed inode location and streaming writes  
> should
> always be possible.
The ?berblock still must be updated, however.  This may not be an issue
if its updates don''t have to be done on the data devices, but I believe
the current design has a copy (several actually) on each device for
redundancy.
> I think ZFS should do fine in streaming mode also, though there are
> currently some shortcomings, such as the mentioned 128K I/O size.
It may eventually.  The lack of direct I/O may also be an issue, since
some of our systems don''t have enough main memory bandwidth to support
data being extensively touched by the CPU between capture (DMA in) and
writing (DMA out).

-- Anton

Nicolas Williams

2006-May-30 16:25 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

On Tue, May 30, 2006 at 08:13:56AM -0700, Anton B. Rang
wrote:> Well, I don''t know about his particular case, but many QFS clients
> have found the separation of data and metadata to be invaluable. The
> primary reason is that it avoids disk seeks. We have QFS customers who                         ^^^^^^^^^^^^^^^^^^^^

Are you talking about reads or writes?

Anyways, for reads separating data and meta-data helps, sure, but so
would adding mirrors.  And anyways, separating meta-data/data _caching_
may make as much difference.
> are running at over 90% of theoretical bandwidth on a medium-sized set
> of FibreChannel controllers and need to maintain that streaming rate.
> Taking a seek to update the on-disk inodes once a minute or so slowed
> down transfers enough that QFS was invented.  ;-)
So we''re talking about writes then, in which case ZFS should not seek
because there are no fixed inode locations (there are fixed root block
locations though).
> (For what it''s worth, the current 128K-per-I/O policy of ZFS
really
> hurts its performance for large writes. I imagine this would not be
> too difficult to fix if we allowed multiple 128K blocks to be
> allocated as a group.)
I''ve been following the thread on this and that''s not clear
yet.

Sure, the block size may be 128KB, but ZFS can bundle more than one
per-file/transaction, so that the block size shouldn''t matter so much
--
it may be a meta-data and read I/O trade-off, but should not have much
impact on write performance.  It may be that implementation-wise the
128KB block size does affect write performance, but design-wise I don''t
see why it should.

Anton Rang

2006-May-30 16:43 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

On May 30, 2006, at 11:25 AM, Nicolas Williams wrote:
> On Tue, May 30, 2006 at 08:13:56AM -0700, Anton B. Rang wrote:
>> Well, I don''t know about his particular case, but many QFS
clients
>> have found the separation of data and metadata to be invaluable. The
>> primary reason is that it avoids disk seeks. We have QFS customers  
>> who
>                          ^^^^^^^^^^^^^^^^^^^^
>
> Are you talking about reads or writes?
Writes -- that''s what''s important for data capture, which is
where I
entered this thread.  ;-)  Sorry for the confusion.
> So we''re talking about writes then, in which case ZFS should not
seek
> because there are no fixed inode locations (there are fixed root block
> locations though).
There''s actually three separate issues here.

The first is the fixed root block.  This one may be a problem, but it
may be easy enough to mark certain logical units in a pool as "no root
block on this device."

The second is the allocation policy.  If ZFS used an allocate-forward
policy, as QFS does, it should be able to avoid seeks.  Note that this
is optimal for data capture but not for most other workloads, as it
tends to spread data across the whole disk over time, rather than
keeping it concentrated in a smaller region (with concomitant faster
seek times).

The third is the write scheduling policy.  QFS, when used in data  
capture
applications, uses direct I/O and hence issues writes in sequential  
block
order.  ZFS should do the same to get peak performance from its devices
for streaming (though intelligent devices can absorb some  
misordering, it
is usually at some performance penalty).
>> (For what it''s worth, the current 128K-per-I/O policy of ZFS
really
>> hurts its performance for large writes. I imagine this would not be
>> too difficult to fix if we allowed multiple 128K blocks to be
>> allocated as a group.)
>
> I''ve been following the thread on this and that''s not
clear yet.
>
> Sure, the block size may be 128KB, but ZFS can bundle more than one
> per-file/transaction
But it doesn''t right now, as far as I can tell.  I never see ZFS
issuing
a 16 MB write, for instance.  You simply can''t get the same performance
from a disk array issuing 128 KB writes that you can with 16 MB writes.
It''s physically impossible because of protocol overhead, even if the
controller itself were infinitely fast.  (There''s also the issue that
at
128 KB, most disk arrays will choose to cache rather than stream the
data, since it''s less than a single RAID stripe, which slows you down.)

-- Anton

Nicolas Williams

2006-May-30 17:23 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

On Tue, May 30, 2006 at 11:43:41AM -0500, Anton Rang
wrote:> There''s actually three separate issues here.
> 
> The first is the fixed root block.  This one may be a problem, but it
> may be easy enough to mark certain logical units in a pool as "no root
> block on this device."
I don''t think that''s very creative.  Another way is to have
lots of
pre-allocated next ubberblock locations, so that seek-to-one-ubberblock
times are always small.  Each ubberblock can point to its predecessor
and its copies and list the pre-allocated possible locations of its
successors.  You''d still need some well-known, non-COWed ubber-
ubberblocks, but these would need to be updated infrequently -- less
frequently than once per-transaction, the trade-off being the time to
find the latest set of ubberblocks on mount.

Data/meta-data on-disk separation doesn''t seem to be the answer for
write performance.  It may make a big difference to separate memory
allocations for caching data vs. meta-data though, and there must be a
reason why this is being pursued by the IETF NFSv4 WG (see pNFS).  But
for local write performance it makes no sense to me.

It could be that transactions are a problem though, for all I know,
since it transacting may mean punctuating physical writes.  But this
seems like a matter of trade-offs, and clearly it''s better to have
transactions than not.
> The second is the allocation policy.  If ZFS used an allocate-forward
> policy, as QFS does, it should be able to avoid seeks.  Note that this
> is optimal for data capture but not for most other workloads, as it
> tends to spread data across the whole disk over time, rather than
> keeping it concentrated in a smaller region (with concomitant faster
> seek times).
The on-disk layout of ZFS does not dictate block allocation policies.
> The third is the write scheduling policy.  QFS, when used in data  
> capture
> applications, uses direct I/O and hence issues writes in sequential  
> block
> order.  ZFS should do the same to get peak performance from its devices
> for streaming (though intelligent devices can absorb some  
> misordering, it
> is usually at some performance penalty).
Again.

So far we''re talking about potential improvements to the
implementation,
not the on-disk layout, with the possible exception of fixed well-known
ubberblock locations.
> >>(For what it''s worth, the current 128K-per-I/O policy of
ZFS really
> >>hurts its performance for large writes. I imagine this would not be
> >>too difficult to fix if we allowed multiple 128K blocks to be
> >>allocated as a group.)
> >
> >I''ve been following the thread on this and that''s not
clear yet.
> >
> >Sure, the block size may be 128KB, but ZFS can bundle more than one
> >per-file/transaction
> 
> But it doesn''t right now, as far as I can tell.  I never see ZFS
issuing
> a 16 MB write, for instance.  You simply can''t get the same
performance
> from a disk array issuing 128 KB writes that you can with 16 MB writes.
> It''s physically impossible because of protocol overhead, even if
the
> controller itself were infinitely fast.  (There''s also the issue
that at
> 128 KB, most disk arrays will choose to cache rather than stream the
> data, since it''s less than a single RAID stripe, which slows you
down.)
I''ll leave this to Ron, you et. al. to hash out, but nothing in the
on-disk layout prevents ZFS from bundle _most_ (i.e., excluding updates
of ubberblocks) of each transaction as one large write, AFAICT.

Nico
--

Richard Elling

2006-May-30 19:16 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

[assuming we''re talking about disks and not "hardware RAID
arrays"...]

On Tue, 2006-05-30 at 11:43 -0500, Anton Rang wrote:> > Sure, the block size may be 128KB, but ZFS can bundle more than one
> > per-file/transaction
> 
> But it doesn''t right now, as far as I can tell.  
The protocol overhead is still orders of magnitude faster than a
rev.  Sure, there are pathological cases such as FC-AL over
200kms with 100+ nodes, but most folks won''t hurt themselves like
that.

For modern disks, multiple 128kByte transfers will spend a long time
in the disk''s buffer cache waiting to be written to media.
> I never see ZFS issuing
> a 16 MB write, for instance.  You simply can''t get the same
performance
> from a disk array issuing 128 KB writes that you can with 16 MB writes.
> It''s physically impossible because of protocol overhead, even if
the
> controller itself were infinitely fast.  (There''s also the issue
that at
> 128 KB, most disk arrays will choose to cache rather than stream the
> data, since it''s less than a single RAID stripe, which slows you
down.)
Very few disks have 16MByte write buffer caches, so if you want to send
such a large iop down the wire (DAS please, otherwise you kill the SAN),
then you''ll be waiting on the media anyway.  The disk interconnect is
faster than the media speed.  I don''t see how you could avoid blowing
a rev in that case.  Surely there is a more generally applicable
blocksize which is appropriate.  Since many disks today do support
queued commands, I don''t see the 128kByte iop as a large, inherent
limitation.  OTOH, the jury is still out...
 -- richard

Anton Rang

2006-May-30 19:26 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

On May 30, 2006, at 12:23 PM, Nicolas Williams wrote:
> Another way is to have lots of pre-allocated next ubberblock  
> locations,
> so that seek-to-one-ubberblock times are always small.  Each  
> ubberblock
> can point to its predecessor and its copies and list the pre-allocated
> possible locations of its successors.
That''s a possibility, though it could be difficult to distinguish an
uberblock from a datablock after a crash (in the worst case), since now
you''re writing both into the same arena.  You''d also need to
skip past
some disk areas (to get to the next uberblock) at each transaction,
which will cost some small amount of bandwidth.
> The on-disk layout of ZFS does not dictate block allocation policies.
Precisely, which is why I broke the issues apart.  Two of them, at  
least,
can be attacked through simple code changes.  The uberblock update  
may or
may not be an issue.  It would be interesting to test this, by changing
the implementation in the other areas and seeing whether we can succeed
in matching the streaming performance of other file systems, and where
the bottlenecks are.

It''s worth pointing out (maybe?) that having an uberblock (or, for that
matter, an indirect block) stored in the "middle" of your data may be
a
problem, if it results in issuing a short read to the disk.  Performance
is better if you read 4 MB from disk and throw out a small piece in the
middle than if you do a 2 MB read followed by a slightly shorter read
to skip the piece you don''t want.  Again, this does not require an on- 
disk
layout change.

Honestly, I''m not sure that focusing on latency-sensitive streaming
applications is worth it until we can get the bandwidth issues of ZFS
nailed down.  There''s some work yet to reach the 95% of device speed
mark.  How close does ZFS get to writing at 8 GB/sec on an F15K?
It''s also worth noting that the customers for whom streaming is a real
issue tend to be those who are willing to spend a lot of money for
reliability (think replicating the whole system+storage) rather than
compromising performance; for them, simply the checksumming overhead
and lack of direct I/O in (today''s) ZFS may be unacceptable.  Is it
worth the effort to change ZFS to satisfy the requirements of that
relative handful of customers?  I''d rather see us focus on adding
functionality that we can use to sell Solaris to large numbers of
customers, and thus building our customer base.  We have a solution
for streaming already, while we''re just entering the reliability and
ease-of-administration space, where the real opportunity lies.

-- Anton

Nicolas Williams

2006-May-30 19:51 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

On Tue, May 30, 2006 at 02:26:07PM -0500, Anton Rang
wrote:> On May 30, 2006, at 12:23 PM, Nicolas Williams wrote:
> 
> >Another way is to have lots of pre-allocated next ubberblock  
> >locations,
> >so that seek-to-one-ubberblock times are always small.  Each  
> >ubberblock
> >can point to its predecessor and its copies and list the pre-allocated
> >possible locations of its successors.
> 
> That''s a possibility, though it could be difficult to distinguish
an
> uberblock from a datablock after a crash (in the worst case), since now
> you''re writing both into the same arena.
I don''t agree.

ZFS already has to deal with this (root blocks have to be
self-checksumming).

Basically the new ubberblocks would reference their predecessors and
would include: a checksum of the predecessor and a self-checksum.  The
likelihood that some non-ubberblock data in a block that was once
pre-allocated as a possible ubberblock could look like a valid
ubberblock can be kept vanishingly small.  OTOH, this would present an
attack vector, so such blocks should not be freed for normal filesystem
use until ubber-ubberblocks have been updated and this attack vector has
been closed.
>                                           You''d also need to skip
past
> some disk areas (to get to the next uberblock) at each transaction,
> which will cost some small amount of bandwidth.
Yes and no.  You''re doing transactions, which already means
you''re
punctuating writes.  And if you pre-allocate enough potential next-
ubberblocks you can make this cost very small, even when you have
back-to-back transactions in the pipeline.
> Honestly, I''m not sure that focusing on latency-sensitive
streaming
> applications is worth it until we can get the bandwidth issues of ZFS
> nailed down.  There''s some work yet to reach the 95% of device
speed
> mark.  How close does ZFS get to writing at 8 GB/sec on an F15K?
Sure.
> It''s also worth noting that the customers for whom streaming is a
real
> issue tend to be those who are willing to spend a lot of money for
> reliability (think replicating the whole system+storage) rather than
> compromising performance; for them, simply the checksumming overhead
> and lack of direct I/O in (today''s) ZFS may be unacceptable.
Which is it?  They want reliability, or they don''t?

It''s not always a reliability vs. performance trade-off.  RAID-Z is a
performance improvement (for writes) over RAID-5 precisely because of
the COW/transactional + the ZFS block-checksums-in-pointers approach to
integrity protection.

The choice to use ZFS or not will depend on specific requirements that
one has identified.  If you require a system where multiple cluster
nodes can write to the same filesystems concurrently then ZFS won''t do
for you, for example.  OTOH, if you want protection against bit rot then
ZFS is for you.  Etcetera, etcetera.
>                                                               Is it
> worth the effort to change ZFS to satisfy the requirements of that
> relative handful of customers?  I''d rather see us focus on adding
> functionality that we can use to sell Solaris to large numbers of
> customers, and thus building our customer base.  We have a solution
> for streaming already, while we''re just entering the reliability
and
> ease-of-administration space, where the real opportunity lies.
I''m pretty sure that the ZFS team has the right set of priorities and
will adjust as necessary.  I just don''t buy your arguments about design
:-)

Nico
--

Anton Rang

2006-May-30 19:59 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

On May 30, 2006, at 2:16 PM, Richard Elling wrote:
> [assuming we''re talking about disks and not "hardware RAID
arrays"...]
It''d be interesting to know how many customers plan to use raw disks,
and how their performance relates to hardware arrays.  (My gut feeling
is that a lot of disks on FC probably isn''t too bad, though on parallel
SCSI the negotiation overhead and lack of fairness was awful, but I
haven''t tested this.)
> On Tue, 2006-05-30 at 11:43 -0500, Anton Rang wrote:
>>> Sure, the block size may be 128KB, but ZFS can bundle more than one
>>> per-file/transaction
>>
>> But it doesn''t right now, as far as I can tell.
>
> The protocol overhead is still orders of magnitude faster than a
> rev.  Sure, there are pathological cases such as FC-AL over
> 200kms with 100+ nodes, but most folks won''t hurt themselves like
> that.
OK.  Let''s take 4 Gb FC (e.g. array hardware).  Sending 128 KB will
take
roughly 330 microseconds.  If we''re going to achieve 95% of theoretical
rate, then each transaction can have no more than 5% of that for  
overhead,
or 16 microseconds.  That''s pretty darn fast.  For that matter, the
Solaris host would have to initiate 3,000 writes per second to keep the
channel busy.  For each channel.  And a host might well have 20  
channels.
Can our FC stack do that?  Not yet, though it''s been looked at....

At 16 MB [why 16? because we can''t do 32 MB in a WRITE(10) command] we
have some more leeway.  Sending 16 MB will take roughly 42 ms.  Each
transaction can take 5% of that, or 2 ms, for overhead, and still reach
the 95% mark.  And we only need to issue 24 commands per second to keep
the channel saturated.  No problem....

Single disks still run FC at 2 Gb, so the numbers above are roughly
halved, and since it takes 2-4 disks to max out a channel, you can
also multiply the allowable overhead time on the disk by a factor of
2-4.  That gives the disk about 16*2*4 = 128 microseconds to process
a command.  The disk might be able to do that.  Solaris (and the HBA)
still need to push out 1500 writes per second (per channel), though.
A good HBA may be able to do that....
> For modern disks, multiple 128kByte transfers will spend a long time
> in the disk''s buffer cache waiting to be written to media.
They shouldn''t spend that long, really.  Today''s Cheetah has a
200 MB/
sec
interface, and a 59-118 MB/sec transfer rate to media, so at best we can
fill the cache a little over twice as fast as it empties.  (Once we put
multiple disks on the channel, it''s easy to have the cache empty faster
than we fill it -- this is actually the desirable case, so that we''re
not waiting on the media.)
> Very few disks have 16MByte write buffer caches, so if you want to  
> send
> such a large iop down the wire (DAS please, otherwise you kill the  
> SAN),
> then you''ll be waiting on the media anyway.  The disk interconnect
is
> faster than the media speed.  I don''t see how you could avoid
blowing
> a rev in that case.
Yes, we''ll wait on the media.  We''ll never lose a rev, though.
Each
track on a Cheetah holds an average of 400 KB (1.6 MB/cylinder), so each
time that we change tracks, we''ll likely have the buffer full with all
the data for the track.  Even if we don''t, FC transfers data out of  
order,
so the drive can re-order if it deems necessary (in the desirable  
cache-empty
case).

But until we have a well-configured test system to benchmark, this is
rather academic.  :-)  I suspect our customers will quickly tell us how
well ZFS works in their environments.  Hopefully the answer will be
"very well" for the 95% of customers who are in the median; for those
on the "radical fringe" of I/O requirements, there will likely be more
work to do.

I''ll wander off to wait for some real data.  ;-)

-- Anton

Richard Elling

2006-May-30 21:06 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

On Tue, 2006-05-30 at 14:59 -0500, Anton Rang wrote:> On May 30, 2006, at 2:16 PM, Richard Elling wrote:
> 
> > [assuming we''re talking about disks and not "hardware
RAID arrays"...]
> 
> It''d be interesting to know how many customers plan to use raw
disks,
> and how their performance relates to hardware arrays.  (My gut feeling
> is that a lot of disks on FC probably isn''t too bad, though on
parallel
> SCSI the negotiation overhead and lack of fairness was awful, but I
> haven''t tested this.)
FC-AL has much greater arbitration overhead than parallel SCSI,
though FC-AL is arguably more fair for targets.
However, parallel SCSI is at its end.  SAS/SATA is taking over
relegating FC to the non-IP SAN.
> > On Tue, 2006-05-30 at 11:43 -0500, Anton Rang wrote:
> >>> Sure, the block size may be 128KB, but ZFS can bundle more
than one
> >>> per-file/transaction
> >>
> >> But it doesn''t right now, as far as I can tell.
> >
> > The protocol overhead is still orders of magnitude faster than a
> > rev.  Sure, there are pathological cases such as FC-AL over
> > 200kms with 100+ nodes, but most folks won''t hurt themselves
like
> > that.
> 
> OK.  Let''s take 4 Gb FC (e.g. array hardware).  Sending 128 KB
will take
> roughly 330 microseconds.  If we''re going to achieve 95% of
theoretical
> rate, then each transaction can have no more than 5% of that for  
> overhead,
> or 16 microseconds.  That''s pretty darn fast.  For that matter,
the
> Solaris host would have to initiate 3,000 writes per second to keep the
> channel busy.  For each channel.  And a host might well have 20  
> channels.
> Can our FC stack do that?  Not yet, though it''s been looked at....
> 
> At 16 MB [why 16? because we can''t do 32 MB in a WRITE(10)
command] we
> have some more leeway.  Sending 16 MB will take roughly 42 ms.  Each
> transaction can take 5% of that, or 2 ms, for overhead, and still reach
> the 95% mark.  And we only need to issue 24 commands per second to keep
> the channel saturated.  No problem....
> 
> Single disks still run FC at 2 Gb, so the numbers above are roughly
> halved, and since it takes 2-4 disks to max out a channel, you can
> also multiply the allowable overhead time on the disk by a factor of
> 2-4.  That gives the disk about 16*2*4 = 128 microseconds to process
> a command.  The disk might be able to do that.  Solaris (and the HBA)
> still need to push out 1500 writes per second (per channel), though.
> A good HBA may be able to do that....
> 
> > For modern disks, multiple 128kByte transfers will spend a long time
> > in the disk''s buffer cache waiting to be written to media.
> 
> They shouldn''t spend that long, really.  Today''s Cheetah
has a 200 MB/
> sec
> interface, and a 59-118 MB/sec transfer rate to media, so at best we can
> fill the cache a little over twice as fast as it empties.  (Once we put
> multiple disks on the channel, it''s easy to have the cache empty
faster
> than we fill it -- this is actually the desirable case, so that
we''re
> not waiting on the media.)
> 
> > Very few disks have 16MByte write buffer caches, so if you want to  
> > send
> > such a large iop down the wire (DAS please, otherwise you kill the  
> > SAN),
> > then you''ll be waiting on the media anyway.  The disk
interconnect is
> > faster than the media speed.  I don''t see how you could avoid
blowing
> > a rev in that case.
> 
> Yes, we''ll wait on the media.  We''ll never lose a rev,
though.  Each
> track on a Cheetah holds an average of 400 KB (1.6 MB/cylinder), so each
> time that we change tracks, we''ll likely have the buffer full with
all
> the data for the track.  
Right.  For a ST3300007LW (300 GByte UltraSCSI 320) a 16 MByte iop takes
approximately 50ms to transfer over the SCSI bus.  The media speed is 
59-118 Mbytes/s (270-136ms).  The default cache size is 8 MBytes.  Since
the transfer is too big to fit in cache then you have to stall the
transfer or you overrun the buffer.  For reads, the disk won''t be able
to keep the bus busy either.  Using such big block sizes doesn''t
gain you anything.

I suspect that the full size of the buffer is not available since they
might want to use some of the space for the read cache, too.
> Even if we don''t, FC transfers data out of order,
> so the drive can re-order if it deems necessary (in the desirable  
> cache-empty case).
A single iop (even a 16 MByte one) will not be re-ordered.
> But until we have a well-configured test system to benchmark, this is
> rather academic.  :-)  I suspect our customers will quickly tell us how
> well ZFS works in their environments.  Hopefully the answer will be
> "very well" for the 95% of customers who are in the median; for
those
> on the "radical fringe" of I/O requirements, there will likely be
more
> work to do.
Agree 100% :-)

-- 

-- richard

Robert Milkowski

2006-May-30 22:37 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

Hello Anton,

Tuesday, May 30, 2006, 9:59:09 PM, you wrote:

AR> On May 30, 2006, at 2:16 PM, Richard Elling wrote:
>> [assuming we''re talking about disks and not "hardware
RAID arrays"...]
AR> It''d be interesting to know how many customers plan to use raw
disks,
AR> and how their performance relates to hardware arrays.  (My gut feeling
AR> is that a lot of disks on FC probably isn''t too bad, though on
parallel
AR> SCSI the negotiation overhead and lack of fairness was awful, but I
AR> haven''t tested this.)
>> On Tue, 2006-05-30 at 11:43 -0500, Anton Rang wrote:
>>>> Sure, the block size may be 128KB, but ZFS can bundle more than
one
>>>> per-file/transaction
>>>
>>> But it doesn''t right now, as far as I can tell.
>>
>> The protocol overhead is still orders of magnitude faster than a
>> rev.  Sure, there are pathological cases such as FC-AL over
>> 200kms with 100+ nodes, but most folks won''t hurt themselves
like
>> that.
AR> OK.  Let''s take 4 Gb FC (e.g. array hardware).  Sending 128 KB
will take
AR> roughly 330 microseconds.  If we''re going to achieve 95% of
theoretical
AR> rate, then each transaction can have no more than 5% of that for  
AR> overhead,
AR> or 16 microseconds.  That''s pretty darn fast.  For that matter,
the
AR> Solaris host would have to initiate 3,000 writes per second to keep the
AR> channel busy.  For each channel.  And a host might well have 20  
AR> channels.
AR> Can our FC stack do that?  Not yet, though it''s been looked
at....

AR> At 16 MB [why 16? because we can''t do 32 MB in a WRITE(10)
command] we
AR> have some more leeway.  Sending 16 MB will take roughly 42 ms.  Each
AR> transaction can take 5% of that, or 2 ms, for overhead, and still reach
AR> the 95% mark.  And we only need to issue 24 commands per second to keep
AR> the channel saturated.  No problem....

AR> Single disks still run FC at 2 Gb, so the numbers above are roughly
AR> halved, and since it takes 2-4 disks to max out a channel, you can
AR> also multiply the allowable overhead time on the disk by a factor of
AR> 2-4.  That gives the disk about 16*2*4 = 128 microseconds to process
AR> a command.  The disk might be able to do that.  Solaris (and the HBA)
AR> still need to push out 1500 writes per second (per channel), though.
AR> A good HBA may be able to do that....
>> For modern disks, multiple 128kByte transfers will spend a long time
>> in the disk''s buffer cache waiting to be written to media.
AR> They shouldn''t spend that long, really.  Today''s
Cheetah has a 200 MB/
AR> sec
AR> interface, and a 59-118 MB/sec transfer rate to media, so at best we can
AR> fill the cache a little over twice as fast as it empties.  (Once we put
AR> multiple disks on the channel, it''s easy to have the cache empty
faster
AR> than we fill it -- this is actually the desirable case, so that
we''re
AR> not waiting on the media.)
>> Very few disks have 16MByte write buffer caches, so if you want to  
>> send
>> such a large iop down the wire (DAS please, otherwise you kill the  
>> SAN),
>> then you''ll be waiting on the media anyway.  The disk
interconnect is
>> faster than the media speed.  I don''t see how you could avoid
blowing
>> a rev in that case.
AR> Yes, we''ll wait on the media.  We''ll never lose a rev,
though.  Each
AR> track on a Cheetah holds an average of 400 KB (1.6 MB/cylinder), so each
AR> time that we change tracks, we''ll likely have the buffer full
with all
AR> the data for the track.  Even if we don''t, FC transfers data out
of
AR> order,
AR> so the drive can re-order if it deems necessary (in the desirable  
AR> cache-empty
AR> case).

AR> But until we have a well-configured test system to benchmark, this is
AR> rather academic.  :-)  I suspect our customers will quickly tell us how
AR> well ZFS works in their environments.  Hopefully the answer will be
AR> "very well" for the 95% of customers who are in the median; for
those
AR> on the "radical fringe" of I/O requirements, there will likely
be more
AR> work to do.

AR> I''ll wander off to wait for some real data.  ;-)

Well see my earlier posts here about some basic testing sequential
writes with dd using different block sizes. It looks like using 8MB IO
size gives much better real throughput (to UFS or raw disk) than to
ZFS when it''s actually written using 128KB IO sizes. And the
difference is actually quite big.

It wan noticed by Roch that it could be related to other issue with
ZFS which is being addressed - however I still feel that with
sequential writing large IOs can actually give much better throughput
than just using 128KB.

ps. I was using FC disks directly connected to host (JBOD) without HW
RAID.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Darren J Moffat

2006-May-31 08:43 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

Anton Rang wrote:> It''s also worth noting that the customers for whom streaming is a
real
> issue tend to be those who are willing to spend a lot of money for
> reliability (think replicating the whole system+storage) rather than
> compromising performance; for them, simply the checksumming overhead
> and lack of direct I/O in (today''s) ZFS may be unacceptable.  Is
it
That statement to me is inconsistent.  The customers want reliability 
but the way that ZFS provides it that no other filesystem does today is 
too much of an overhead, sigh.  So do they really want reliability of 
their data or not ?

What proof is there that the checksumming in ZFS is actually hurting any 
performance ?   We know that the implementation of the fletcher 
algorithms can be improved on some systems (a test implementation exists 
for UltraSPARC T1).

--
Darren J Moffat

Roch Bourbonnais - Performance Engineering

2006-May-31 13:56 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

Anton wrote:

  (For what it''s worth, the current 128K-per-I/O policy of ZFS really
  hurts its performance for large writes. I imagine this would not be
  too difficult to fix if we allowed multiple 128K blocks to be
  allocated as a group.)


I''m not taking  a stance on this, but  if I keep a controler
full  of 128K   I/Os  and  assuming  there  are   targetting
contiguous physical blocks, how different is that to issuing
a very large I/O ?

-r

Roch Bourbonnais - Performance Engineering

2006-May-31 14:03 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

> I think ZFS should do fine in streaming mode also, though there are  > currently some shortcomings, such as the mentioned 128K I/O size.

  It may eventually.  The lack of direct I/O may also be an issue, since
  some of our systems don''t have enough main memory bandwidth to
support
  data being extensively touched by the CPU between capture (DMA in) and
  writing (DMA out).

Is the notion that 128K is too small based on directio
results ? Clearly with directio, a single thread streaming
data will see terrible performance.

-r

Anton Rang

2006-May-31 14:48 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

On May 31, 2006, at 8:56 AM, Roch Bourbonnais - Performance  
Engineering wrote:
> I''m not taking  a stance on this, but  if I keep a controler
> full  of 128K   I/Os  and  assuming  there  are   targetting
> contiguous physical blocks, how different is that to issuing
> a very large I/O ?
There are differences at the host, the HBA, the disk or RAID
controller, and on the wire.

At the host:

   The SCSI/FC/ATA stack is run once for each I/O.  This takes
   a bit of CPU.  We generally take one interrupt for each I/O
   (if the CPU is fast enough), so instead of taking one
   interrupt for 8 MB (for instance), we take 64.

   We run through the IOMMU or page translation code once per
   page, but the overhead of initially setting up the IOMMU or
   starting the translation loop happens once per I/O.

At the HBA:

   There is some overhead each time that the controller switches
   processing from one I/O to another.  This isn''t too large on a
   modern system, but it does add up.

   There is overhead on the PCI (or other) bus for the small
   transfers that make up the command block and scatter/gather
   list for each I/O.  Again, it adds up (faster than you might
   expect, since PCI Express can move 128 KB very quickly).

   There is a limit on the maximum number of outstanding I/O
   requests, but we''re unlikely to hit this in normal use; it
   is typically at least 256 and more often 1024 or more on
   newer hardware.  (This is shared for the whole channel
   in the FC and SCSI case, and may be shared between multiple
   channels for SAS or multi-port FC cards.)

   There is often a small cache of commands which can be handled
   quickly; commands outside of this cache (which may hold 4 to
   16 or so) are much slower to "context-switch" in when their
   data is needed; in particular, the scatter/gather list may
   need to be read again.

At the disk or RAID:

   There is a fixed overhead for processing each command.  This
   can be fairly readily measured, and roughly reflects the
   difference between delivered 512-byte IOPs and bandwidth for
   a large I/O.  Some of it is related to parsing the CDB and
   starting command execution; some of it is related to cache
   management.

   There is some overhead for switching between data transfers
   for each command.  A typical track on a disk may hold 400K
   or so of data, and a full-track transfer is optimal (runs at
   platter speed).  A partial-track transfer immediately followed
   by another may take enough time to switch that we sometimes
   lose one revolution (particularly on disks which do not have
   sector headers).  Write caching should nearly eliminate this
   as a concern, however.

   There is a fixed-size window of commands that can be
   reordered on the device.  Data transfer within a command can
   be reordered arbitrarily (for parallel SCSI and FC, though
   not for ATA or SAS).  It''s good to have lots of outstanding
   commands, but if they are all sequential, there''s not much
   point (no reason to reorder them, except perhaps if you''re
   going backwards, and FC/SCSI can handle this anyway).

On the wire:

   Sending a command and its completion takes time that could
   be spent moving data instead; but for most protocols this
   probably isn''t significant.

You can actually see most of this with a PCI and protocol
analyzer.

-- Anton

Bill Sommerfeld

2006-May-31 15:21 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

On Wed, 2006-05-31 at 10:48, Anton Rang wrote:>    We generally take one interrupt for each I/O
>    (if the CPU is fast enough), so instead of taking one
>    interrupt for 8 MB (for instance), we take 64.
Hunh.  Gigabit ethernet devices typically implement some form of
interrupt blanking or coalescing so that the host cpu can batch I/O
completion handling.  That doesn''t exist in FC controllers?

Under continuous heavy load it can be more efficient to do polling
instead of interrupt-driven I/O.

						- Bill

Anton Rang

2006-May-31 16:00 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

On May 31, 2006, at 10:21 AM, Bill Sommerfeld wrote:
> Hunh.  Gigabit ethernet devices typically implement some form of
> interrupt blanking or coalescing so that the host cpu can batch I/O
> completion handling.  That doesn''t exist in FC controllers?
Not in quite the same way, AFAIK.  Usually there is an queue of
completed I/O operations, and one interrupt is generated each
time the queue becomes non-empty.  If the host is relatively
slow, you''ll process a number of I/O completions for one
interrupt.  If it''s relatively fast, you''ll get one interrupt
per completion (unless you poll in the driver before re-enabling
interrupts, which is reasonable if you believe the load is heavy).
> Under continuous heavy load it can be more efficient to do polling
> instead of interrupt-driven I/O.
Yes.

-- Anton

Roland Mainz

2006-Jun-01 02:48 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source "QFS"... /was:Re: Re: Distributed File System for Solaris

Casper.Dik at Sun.COM wrote:> >UNIX admin wrote:
> >> > There''s still an opening in the shared filesystem
> >> > space (multi-reader
> >> > and multi-writer). Fix QFS, or extend ZFS?
> >>
> >> That one''s a no-brainer, innit? Extend ZFS and plough on.
> >
> >Uhm... I think this is not that easy. Based on IRC feedback I think it
> >may be difficult to implement the intended features, e.g. storing
inodes
> >and data on sepeate disks. We had several projects in the past where
> >this was the only way to gurantee good performace for realtime data
> >collection and processing and due lack of such a feature in ZFS we
still
> >need QFS...
> 
> I''m assuming this means you''ve measured the performance
and found ZFS
> wanting?
No, I didn not test ZFS yet, I only discussed the matter on IRC yet.
But based on the original problems (see below) I do not think that ZFS
can deliver something (without the inode+data seperation) what neither
IBM nor QFS without inode+data split could not deliver a few years ago
(even when backed with a giant RAID+caches (which caused more trouble
than expected, see below, too)).
> I don''t get it; zfs is a copy-on-write filesystem, so there should
> be no hotspotting of disks and, theoretically, write performance
> could be maxed out.
What about read performace ? And interactive users who are MAD and run
their stuff during data capture ?
> The requirement is not that inodes and data are separate; the requirement
> is a specific upperbound to disk transactions.  The question therefor
> is not "when will ZFS be able to separate inods and data"; the
question
> is when ZFS will meet the QoS criteria.
Uhm... that''s the point where you are IMO slightly wrong. The exact
requirement is that inodes and data need to be seperated.
In this specific case (and the setup was copied several times so Sun
made a considerable amount of money with it :-) ) the inode data+log
were put on a seperate solid-state disks on seperate SCSI controllers
(which have nearly zero seek time). The problem was that a high amount
of inode activity could starve the data recoding and playback, something
which was inacceptable since running the matching experiment just costs
around >=40000Euro/Minute. Similar proposed setups provided by IBM
failed (even with giant RAID caches (which mainly were able to flat the
problems out, but sometimes suffered from second-long "hiccups" where
read and writes were stalled)) to deliver the requested performance
(much data and much inode traffic and tons of scripts and (to make it
worse and even more unpredictable) interactive users).
The only working solution in this case was to move inodes+log to the
seperate solid-state disks with it''s own path (e.g. SCSI controller)
for
these data, freeing the data RAID from such operations. The only
alternative was to move everything to solid-state disks - but that was
considered to be far to expensive (or better: we already wasted too much
money elsewhere... ;-( ).

----

Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) roland.mainz at nrubsig.org
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix
programmer
  /O /==\ O\  TEL +49 641 7950090
 (;O/ \/ \O;)

Roland Mainz

2006-Jun-01 02:51 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun wouldopen-source"QFS"... / was:Re: Re: Distributed File System for Solaris

Casper.Dik at Sun.COM wrote:> 
> >Well, I don''t know about his particular case, but many QFS
clients
> >have found the separation of da ta and metadata to be invaluable. The
> >primary reason is that it avoids disk seeks. We have QFS cust omers who
> >are running at over 90% of theoretical bandwidth on a medium-sized set
> >of FibreChannel co ntrollers and need to maintain that streaming rate.
> >Taking a seek to update the on-disk inodes once
> >a minute or so slowed down transfers enough that QFS was invented.
>  ;-)
> 
> That does not answer th equestion I asked; since ZFS is a copy-on-write
> filesystem, there''s no fixed inode location and streaming writes
should
> always be possible.
> 
> So, in theory ZFS can do this and mix metadata and data.  That''s
why
> I asked for any preactival input into this matter.
> 
> There are, I think, four different outcomes possible of such an
> experiment and subsequent analysis:
> 
>         ZFS does just fine, thank you
>         ZFS doesn''t measure up but can be fixed without splitting
meta data.
>         ZFS doesn''t measure up and can only be fixed by allowing a
logical
>         split
>         ZFS doesn''t measure up and cannot be fixed
> 
> My money is on #2.
I strongly bet for #3 (assuming "logical split" means
"data+inode"
split) based on real-world experience with other products from other
vendors.

----

Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) roland.mainz at nrubsig.org
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix
programmer
  /O /==\ O\  TEL +49 641 7950090
 (;O/ \/ \O;)

Roch

2006-Jun-02 13:54 UTC

head link

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

Anton Rang writes:
 > On May 31, 2006, at 8:56 AM, Roch Bourbonnais - Performance  
 > Engineering wrote:
 > 
 > > I''m not taking  a stance on this, but  if I keep a controler
 > > full  of 128K   I/Os  and  assuming  there  are   targetting
 > > contiguous physical blocks, how different is that to issuing
 > > a very large I/O ?
 > 
 > There are differences at the host, the HBA, the disk or RAID
 > controller, and on the wire.


 > 
 > At the host:
 > 
 >    The SCSI/FC/ATA stack is run once for each I/O.  This takes
 >    a bit of CPU.  We generally take one interrupt for each I/O
 >    (if the CPU is fast enough), so instead of taking one
 >    interrupt for 8 MB (for instance), we take 64.
 > 
 >    We run through the IOMMU or page translation code once per
 >    page, but the overhead of initially setting up the IOMMU or
 >    starting the translation loop happens once per I/O.
 > 
 > At the HBA:
 > 
 >    There is some overhead each time that the controller switches
 >    processing from one I/O to another.  This isn''t too large on a
 >    modern system, but it does add up.
 > 
 >    There is overhead on the PCI (or other) bus for the small
 >    transfers that make up the command block and scatter/gather
 >    list for each I/O.  Again, it adds up (faster than you might
 >    expect, since PCI Express can move 128 KB very quickly).
 > 
 >    There is a limit on the maximum number of outstanding I/O
 >    requests, but we''re unlikely to hit this in normal use; it
 >    is typically at least 256 and more often 1024 or more on
 >    newer hardware.  (This is shared for the whole channel
 >    in the FC and SCSI case, and may be shared between multiple
 >    channels for SAS or multi-port FC cards.)
 > 
 >    There is often a small cache of commands which can be handled
 >    quickly; commands outside of this cache (which may hold 4 to
 >    16 or so) are much slower to "context-switch" in when their
 >    data is needed; in particular, the scatter/gather list may
 >    need to be read again.
 > 
 > At the disk or RAID:
 > 
 >    There is a fixed overhead for processing each command.  This
 >    can be fairly readily measured, and roughly reflects the
 >    difference between delivered 512-byte IOPs and bandwidth for
 >    a large I/O.  Some of it is related to parsing the CDB and
 >    starting command execution; some of it is related to cache
 >    management.
 > 
 >    There is some overhead for switching between data transfers
 >    for each command.  A typical track on a disk may hold 400K
 >    or so of data, and a full-track transfer is optimal (runs at
 >    platter speed).  A partial-track transfer immediately followed
 >    by another may take enough time to switch that we sometimes
 >    lose one revolution (particularly on disks which do not have
 >    sector headers).  Write caching should nearly eliminate this
 >    as a concern, however.
 > 
 >    There is a fixed-size window of commands that can be
 >    reordered on the device.  Data transfer within a command can
 >    be reordered arbitrarily (for parallel SCSI and FC, though
 >    not for ATA or SAS).  It''s good to have lots of outstanding
 >    commands, but if they are all sequential, there''s not much
 >    point (no reason to reorder them, except perhaps if you''re
 >    going backwards, and FC/SCSI can handle this anyway).
 > 
 > On the wire:
 > 
 >    Sending a command and its completion takes time that could
 >    be spent moving data instead; but for most protocols this
 >    probably isn''t significant.
 > 
 > You can actually see most of this with a PCI and protocol
 > analyzer.
 > 

So the main question, does any of this cause a full flush of
the pipelined operations ? If it just extra busy-ness of the
individual components,  all operating concurrently and if we
don''t  saturate anybody because of the  extra  work, then it
seems to me that we are fine. So clearly  there may be a few
extra bubbles that find their  way into the  pipe and we can
loose the     last    few   bleeding    edge     percent  of
throughput. Those  guys are on  QFS and delighted to be (and
they should be, QFS is outstanding in that market).

-r

 > -- Anton
 >

zfs discuss - May 2006 - Re: [osol-discuss] Re: I wish Sun would open-source "QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source "QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source "QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source "QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source "QFS"... /was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun wouldopen-source"QFS"... / was:Re: Re: Distributed File System for Solaris

[zfs-discuss] Re: [osol-discuss] Re: I wish Sun would open-source"QFS"... / was:Re: Re: Distributed File System for Solaris