thr3ads.net - zfs discuss - [zfs-discuss] creating a fast ZIL device for $200 [May 2010]

If this information is useful, please help other people find it:
Share via:

sensille

2010-May-26 13:10 UTC

[zfs-discuss] creating a fast ZIL device for $200

Recently, I''ve been reading through the ZIL/slog discussion and
have the impression that a lot of folks here are (like me)
interested in getting a viable solution for a cheap, fast and
reliable ZIL device.
I think I can provide such a solution for about $200, but it
involves a lot of development work.
The basic idea: the main problem when using a HDD as a ZIL device
are the cache flushes in combination with the linear write pattern
of the ZIL. This leads to a whole rotation of the platter after
each write, because after the first write returns, the head is
already past the sector that will be written next.
My idea goes as follows: don''t write linearly. Track the rotation
and write to the position the head will hit next. This might be done
by a re-mapping layer or integrated into ZFS. This works only because
ZIL device are basically write-only. Reads from this device will be
horribly slow.

I have done some testing and am quite enthusiastic. If I take a
decent SAS disk (like the Hitachi Ultrastar C10K300), I can raise
the synchronous write performance from 166 writes/s to about
2000 writes/s (!). 2000 IOPS is more than sufficient for our
production environment.

Currently I''m implementing a re-mapping driver for this. The
reason I''m writing to this list is that I''d like to find
support
from the zfs team, find sparring partners to discuss implementation
details and algorithms and, most important, find testers!

If there is interest it would be great to build an official project
around it. I''d be willing to contribute most of the code, but any
help will be more than welcome.

So, anyone interested? :)

--
Arne Jansen

Edward Ned Harvey

2010-May-26 14:06 UTC

head link

[zfs-discuss] creating a fast ZIL device for $200

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of sensille
> 
> The basic idea: the main problem when using a HDD as a ZIL device
> are the cache flushes in combination with the linear write pattern
> of the ZIL. This leads to a whole rotation of the platter after
> each write, because after the first write returns, the head is
> already past the sector that will be written next.
> My idea goes as follows: don''t write linearly. Track the rotation
> and write to the position the head will hit next. This might be done
> by a re-mapping layer or integrated into ZFS. This works only because
> ZIL device are basically write-only. Reads from this device will be
> horribly slow.
This is a really interesting idea, but I think you''ve hurt yourself in
the
way you described the problem - and additionally, I was recently corrected
for misusing the terms you just misused (saying "ZIL" != saying
"ZIL on
dedicated log device").  So I''ll try to clarify what you just
said:

The reason why hard drives are less effective as ZIL dedicated log devices
compared to such things as SSD''s, is because of the rotation of the
hard
drives; the physical time to seek a random block.  There may be a
possibility to use hard drives as dedicated log devices, cheaper than
SSD''s
with possibly comparable latency, if you can intelligently eliminate the
random seek.  If you have a way to tell the hard drive "Write this data, to
whatever block happens to be available at minimum seek time."

For rough estimates:  Assume the drive is using Zone Density Recording, like
this:
http://www.dewassoc.com/kbase/hard_drives/hard_disk_sector_structures.htm
Suppose you''re able to keep your hard drive head on the outer sectors.
Suppose 1000 sectors per track (I have no idea if that''s accurate, but
at
least according to the above article in the year 2000 it was ballpark
realistic).  Suppose 10krpm.  Then the physical seek time could
theoretically be brought down to as low as 10^-7 seconds.  Of course,
that''s
not realistic - some sectors may already be used - the electronics
themselves could be a factor - But the point remains, the physical seek time
can be effectively eliminated.  At least in theory.  And that was the year
2000.

> I have done some testing and am quite enthusiastic. If I take a
> decent SAS disk (like the Hitachi Ultrastar C10K300), I can raise
> the synchronous write performance from 166 writes/s to about
> 2000 writes/s (!). 2000 IOPS is more than sufficient for our
> production environment.
Um ... Careful there.  There are many apples, oranges, and bananas to be
compared inaccurately against each other.  When I measure IOPS of physical
disks, with all the caches disabled, I get anywhere from 200 to 2400 for a
single spindle disk (SAS 10k), and I get anywhere from 2000 to 6000 with a
SSD (SATA).  Just depending on the benchmark configuration.  Because ZFS is
doing all sorts of acceleration behind the scenes, which make the results
vary *immensely* from some IOPS number that you look up online.

You''ve got to be sure you measure something, then change *only one
thing*
and measure again, to get a good measurement.  You''ve got to toggle
back and
forth a few times, and see that the results are repeatable.  And *only* then
do you have a solid result.

> Currently I''m implementing a re-mapping driver for this. The
> reason I''m writing to this list is that I''d like to find
support
> from the zfs team, find sparring partners to discuss implementation
> details and algorithms and, most important, find testers!
So you believe you can know the drive geometry, the instantaneous head
position, and the next available physical block address in software?  No
need for special hardware?  That''s cool.  I hope there aren''t
any "gotchas"
as-yet undiscovered.

sensille

2010-May-26 14:32 UTC

head link

[zfs-discuss] creating a fast ZIL device for $200

Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of sensille
>>
>> The basic idea: the main problem when using a HDD as a ZIL device
>> are the cache flushes in combination with the linear write pattern
>> of the ZIL. This leads to a whole rotation of the platter after
>> each write, because after the first write returns, the head is
>> already past the sector that will be written next.
>> My idea goes as follows: don''t write linearly. Track the
rotation
>> and write to the position the head will hit next. This might be done
>> by a re-mapping layer or integrated into ZFS. This works only because
>> ZIL device are basically write-only. Reads from this device will be
>> horribly slow.
> 
> The reason why hard drives are less effective as ZIL dedicated log devices
> compared to such things as SSD''s, is because of the rotation of
the hard
> drives; the physical time to seek a random block.  There may be a
> possibility to use hard drives as dedicated log devices, cheaper than
SSD''s
> with possibly comparable latency, if you can intelligently eliminate the
> random seek.  If you have a way to tell the hard drive "Write this
data, to
> whatever block happens to be available at minimum seek time."
Thanks for rephrasing my idea :) The only thing I''d like to point out
is that
ZFS doesn''t do random writes on a slog, but nearly linear writes. This
might
even be hurting performance more than random writes, because you always hit
the worst case of one full rotation.
> 
> For rough estimates:  Assume the drive is using Zone Density Recording,
like
> this:
> http://www.dewassoc.com/kbase/hard_drives/hard_disk_sector_structures.htm
> Suppose you''re able to keep your hard drive head on the outer
sectors.
> Suppose 1000 sectors per track (I have no idea if that''s accurate,
but at
> least according to the above article in the year 2000 it was ballpark
> realistic).  Suppose 10krpm.  Then the physical seek time could
> theoretically be brought down to as low as 10^-7 seconds.  Of course,
that''s
> not realistic - some sectors may already be used - the electronics
> themselves could be a factor - But the point remains, the physical seek
time
> can be effectively eliminated.  At least in theory.  And that was the year
> 2000.
The mentioned Hitachi disk (at least the one I have in my test machine)
has 1764 sectors on head1 and 1680 sectors on head2 in the first zone, which
has 50 tracks. I''m quite sure the limiting factor is the electronics.
This
disk needs the write about 140 sectors in advance. It may be that also the
servo information on the platters has to be taken into account. Other disks
don''t behave that well. I tried with 1TB SATA disks, but they
doesn''t seem to
have any predictable timing.
>> I have done some testing and am quite enthusiastic. If I take a
>> decent SAS disk (like the Hitachi Ultrastar C10K300), I can raise
>> the synchronous write performance from 166 writes/s to about
>> 2000 writes/s (!). 2000 IOPS is more than sufficient for our
>> production environment.
> 
> Um ... Careful there.  There are many apples, oranges, and bananas to be
> compared inaccurately against each other.  When I measure IOPS of physical
> disks, with all the caches disabled, I get anywhere from 200 to 2400 for a
> single spindle disk (SAS 10k), and I get anywhere from 2000 to 6000 with a
> SSD (SATA).  Just depending on the benchmark configuration.  Because ZFS is
> doing all sorts of acceleration behind the scenes, which make the results
> vary *immensely* from some IOPS number that you look up online.
The measurement is simple: disable write cache, write on sector, when that write
returns, calculate the next optimal sector to write to, write, calculate
again... This gives a quite stable result of about 2000 writes/s or 0.5ms
average service time, single threaded. No ZFS involved, just pure disk
performance.
> 
> So you believe you can know the drive geometry, the instantaneous head
> position, and the next available physical block address in software?  No
> need for special hardware?  That''s cool.  I hope there
aren''t any "gotchas"
> as-yet undiscovered.
Yes, I already did a mapping of several drives. I measured at least the track
length, the interleave needed between two writes and the interleave if a
track-to-track seek is involved. Of course you can always learn more about a
disk, but that''s a good starting point.

--
Arne

Tomas Ögren

2010-May-26 14:46 UTC

head link

[zfs-discuss] creating a fast ZIL device for $200

On 26 May, 2010 - sensille sent me these 4,5K bytes:
> Edward Ned Harvey wrote:
> >> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> >> bounces at opensolaris.org] On Behalf Of sensille
> >>
> >> The basic idea: the main problem when using a HDD as a ZIL device
> >> are the cache flushes in combination with the linear write pattern
> >> of the ZIL. This leads to a whole rotation of the platter after
> >> each write, because after the first write returns, the head is
> >> already past the sector that will be written next.
> >> My idea goes as follows: don''t write linearly. Track the
rotation
> >> and write to the position the head will hit next. This might be
done
> >> by a re-mapping layer or integrated into ZFS. This works only
because
> >> ZIL device are basically write-only. Reads from this device will
be
> >> horribly slow.
> > 
> > The reason why hard drives are less effective as ZIL dedicated log
devices
> > compared to such things as SSD''s, is because of the rotation
of the hard
> > drives; the physical time to seek a random block.  There may be a
> > possibility to use hard drives as dedicated log devices, cheaper than
SSD''s
> > with possibly comparable latency, if you can intelligently eliminate
the
> > random seek.  If you have a way to tell the hard drive "Write
this data, to
> > whatever block happens to be available at minimum seek time."
> 
> Thanks for rephrasing my idea :) The only thing I''d like to point
out is that
> ZFS doesn''t do random writes on a slog, but nearly linear writes.
This might
> even be hurting performance more than random writes, because you always hit
> the worst case of one full rotation.
A simple test would be to change "write block X" "write block
X+1"
"write block X+2" into  "write block X" "write block
X+4" "write block
X+8" or something, so it might manage to send the command before the
head has travelled over to block X+4 etc..

I guess basically, you want to do something like TCQ/NCQ, but without
the Q.. placing writes optimally..
> > So you believe you can know the drive geometry, the instantaneous head
> > position, and the next available physical block address in software? 
No
> > need for special hardware?  That''s cool.  I hope there
aren''t any "gotchas"
> > as-yet undiscovered.
> 
> Yes, I already did a mapping of several drives. I measured at least the
track
> length, the interleave needed between two writes and the interleave if a
> track-to-track seek is involved. Of course you can always learn more about
a
> disk, but that''s a good starting point.
Since X, X+1, X+2 seems to be the optimally worst case, try just
skipping over a few blocks.. Double (or such) the performance for a
single software tweak would be surely welcome.

/Tomas
-- 
Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Ume?
`- Sysadmin at {cs,acc}.umu.se

Bob Friesenhahn

2010-May-26 14:48 UTC

head link

[zfs-discuss] creating a fast ZIL device for $200

On Wed, 26 May 2010, sensille wrote:> The basic idea: the main problem when using a HDD as a ZIL device
> are the cache flushes in combination with the linear write pattern
> of the ZIL. This leads to a whole rotation of the platter after
> each write, because after the first write returns, the head is
> already past the sector that will be written next.
> My idea goes as follows: don''t write linearly. Track the rotation
> and write to the position the head will hit next. This might be done
> by a re-mapping layer or integrated into ZFS. This works only because
> ZIL device are basically write-only. Reads from this device will be
> horribly slow.
I like your idea.  It would require a profiling application to learn 
the physical geometry and timing of a given disk drive in order to 
save the configuration data for it.  The timing could vary under heavy 
system load so the data needs to be sent early enough that it will 
always be there when needed.  The profiling application might need to 
drive a disk for several hours (or a day) in order to fully understand 
how it behaves.  Remapped failed sectors would cause this micro-timing 
to fail, but only for the remapped sectors.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

sensille

2010-May-26 15:02 UTC

head link

[zfs-discuss] creating a fast ZIL device for $200

Bob Friesenhahn wrote:> On Wed, 26 May 2010, sensille wrote:
>> The basic idea: the main problem when using a HDD as a ZIL device
>> are the cache flushes in combination with the linear write pattern
>> of the ZIL. This leads to a whole rotation of the platter after
>> each write, because after the first write returns, the head is
>> already past the sector that will be written next.
>> My idea goes as follows: don''t write linearly. Track the
rotation
>> and write to the position the head will hit next. This might be done
>> by a re-mapping layer or integrated into ZFS. This works only because
>> ZIL device are basically write-only. Reads from this device will be
>> horribly slow.
> 
> I like your idea.  It would require a profiling application to learn the
> physical geometry and timing of a given disk drive in order to save the
> configuration data for it.  The timing could vary under heavy system
> load so the data needs to be sent early enough that it will always be
> there when needed.  The profiling application might need to drive a disk
> for several hours (or a day) in order to fully understand how it
> behaves.
A day is a good landmark. Currently the application runs several hours just
to map the tracks. But there''s lots of room for algorithms that measure
and
fine-tune on the fly. Every write is also a measurement.
> Remapped failed sectors would cause this micro-timing to fail,
> but only for the remapped sectors.
Of course you could detect those remapped sectors because of the failed timing
and stop using them in the future :)

--
Arne
> Bob

Neil Perrin

2010-May-26 15:38 UTC

head link

[zfs-discuss] creating a fast ZIL device for $200

On 05/26/10 07:10, sensille wrote:> Recently, I''ve been reading through the ZIL/slog discussion and
> have the impression that a lot of folks here are (like me)
> interested in getting a viable solution for a cheap, fast and
> reliable ZIL device.
> I think I can provide such a solution for about $200, but it
> involves a lot of development work.
> The basic idea: the main problem when using a HDD as a ZIL device
> are the cache flushes in combination with the linear write pattern
> of the ZIL. This leads to a whole rotation of the platter after
> each write, because after the first write returns, the head is
> already past the sector that will be written next.
> My idea goes as follows: don''t write linearly. Track the rotation
> and write to the position the head will hit next. This might be done
> by a re-mapping layer or integrated into ZFS. This works only because
> ZIL device are basically write-only. Reads from this device will be
> horribly slow.
>
> I have done some testing and am quite enthusiastic. If I take a
> decent SAS disk (like the Hitachi Ultrastar C10K300), I can raise
> the synchronous write performance from 166 writes/s to about
> 2000 writes/s (!). 2000 IOPS is more than sufficient for our
> production environment.
>
> Currently I''m implementing a re-mapping driver for this. The
> reason I''m writing to this list is that I''d like to find
support
> from the zfs team, find sparring partners to discuss implementation
> details and algorithms and, most important, find testers!
>
> If there is interest it would be great to build an official project
> around it. I''d be willing to contribute most of the code, but any
> help will be more than welcome.
>
> So, anyone interested? :)
>
> --
> Arne Jansen
>
>   
Yes, I agree this seems very appealing. I have investigated and
observed similar results. Just allocating larger intent log blocks but
only writing to say the first half of them has seen the same effect.
Despite the impressive results, we have not pursued this further mainly
because of it''s maintainability. There is quite a variance between
drives so, as mentioned, feedback profiling of the device is needed
in the working system. The layering of the Solaris IO subsystem doesn''t
provide the feedback necessary and the ZIL code is layered on the SPA/DMU.
Still it should be possible. Good luck!

Neil.

Richard Elling

2010-May-26 21:15 UTC

head link

[zfs-discuss] creating a fast ZIL device for $200

On May 26, 2010, at 8:38 AM, Neil Perrin wrote:
> On 05/26/10 07:10, sensille wrote:
>> Recently, I''ve been reading through the ZIL/slog discussion
and
>> have the impression that a lot of folks here are (like me)
>> interested in getting a viable solution for a cheap, fast and
>> reliable ZIL device.
>> I think I can provide such a solution for about $200, but it
>> involves a lot of development work.
>> The basic idea: the main problem when using a HDD as a ZIL device
>> are the cache flushes in combination with the linear write pattern
>> of the ZIL. This leads to a whole rotation of the platter after
>> each write, because after the first write returns, the head is
>> already past the sector that will be written next.
>> My idea goes as follows: don''t write linearly. Track the
rotation
>> and write to the position the head will hit next. This might be done
>> by a re-mapping layer or integrated into ZFS. This works only because
>> ZIL device are basically write-only. Reads from this device will be
>> horribly slow.
>> 
>> I have done some testing and am quite enthusiastic. If I take a
>> decent SAS disk (like the Hitachi Ultrastar C10K300), I can raise
>> the synchronous write performance from 166 writes/s to about
>> 2000 writes/s (!). 2000 IOPS is more than sufficient for our
>> production environment.
>> 
>> Currently I''m implementing a re-mapping driver for this. The
>> reason I''m writing to this list is that I''d like to
find support
>> from the zfs team, find sparring partners to discuss implementation
>> details and algorithms and, most important, find testers!
>> 
>> If there is interest it would be great to build an official project
>> around it. I''d be willing to contribute most of the code, but
any
>> help will be more than welcome.
>> 
>> So, anyone interested? :)
>> 
>> --
>> Arne Jansen
>> 
>>  
> 
> Yes, I agree this seems very appealing. I have investigated and
> observed similar results. Just allocating larger intent log blocks but
> only writing to say the first half of them has seen the same effect.
> Despite the impressive results, we have not pursued this further mainly
> because of it''s maintainability. There is quite a variance between
> drives so, as mentioned, feedback profiling of the device is needed
> in the working system. The layering of the Solaris IO subsystem
doesn''t
> provide the feedback necessary and the ZIL code is layered on the SPA/DMU.
> Still it should be possible. Good luck!
I agree.  If you search the literature, you will find many cases where
people have tried to optimize file systems based on device geometry
and all have ended up as roadkill.  File systems last much longer than
the hardware and writing hardware-specific optimizations into the file
system just doesn''t make good sense.

Meanwhile, though there are doubters, Intel''s datasheet for the X-25V
clearly states support for the ATA FLUSH CACHE feature.  These can
be bought for around $120 and can do 2,500 random write IOPS.
http://download.intel.com/design/flash/nand/value/datashts/322736.pdf
Similarly, for the X-25E
http://download.intel.com/design/flash/nand/extreme/319984.pdf

I think the effort is better spent making sure the SSD vendors do the
right thing.
 -- richard

-- 
Richard Elling
richard at nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/

sensille

2010-May-27 08:01 UTC

head link

[zfs-discuss] creating a fast ZIL device for $200

Richard Elling wrote:> On May 26, 2010, at 8:38 AM, Neil Perrin wrote:
> 
>> On 05/26/10 07:10, sensille wrote:
>>> My idea goes as follows: don''t write linearly. Track the
rotation
>>> and write to the position the head will hit next. This might be
done
>>> by a re-mapping layer or integrated into ZFS. This works only
because
>>> ZIL device are basically write-only. Reads from this device will be
>>> horribly slow.
>>>  
>> Yes, I agree this seems very appealing. I have investigated and
>> observed similar results. Just allocating larger intent log blocks but
>> only writing to say the first half of them has seen the same effect.
>> Despite the impressive results, we have not pursued this further mainly
>> because of it''s maintainability. There is quite a variance
between
>> drives so, as mentioned, feedback profiling of the device is needed
>> in the working system. The layering of the Solaris IO subsystem
doesn''t
>> provide the feedback necessary and the ZIL code is layered on the
SPA/DMU.
>> Still it should be possible. Good luck!
> 
> I agree.  If you search the literature, you will find many cases where
> people have tried to optimize file systems based on device geometry
> and all have ended up as roadkill.  File systems last much longer than
> the hardware and writing hardware-specific optimizations into the file
> system just doesn''t make good sense.
I see the point that the filesystem itself is not the right place for this
kind of optimization.
> 
> Meanwhile, though there are doubters, Intel''s datasheet for the
X-25V
> clearly states support for the ATA FLUSH CACHE feature.  These can
> be bought for around $120 and can do 2,500 random write IOPS.
> http://download.intel.com/design/flash/nand/value/datashts/322736.pdf
> Similarly, for the X-25E
> http://download.intel.com/design/flash/nand/extreme/319984.pdf
The datasheet states that they understand the command, yes. I haven''t
testet myself, but there are many indications on the net that they does
not honor it properly, at least for the X-25E. As to the 2500 writes/s, the
datasheet says "up to", using a queue depth of 32 and utilizing the
write
cache. Similarly I just tested a Hitachi 15k disk to see how many linear
4k writes I can issue, and it can handle approx. 20000 writes/s. This is
a completely useless number, because as soon as I insert cache flushes it
drops down to 250/s (or 15k/minute, of course).
Don''t understand me wrong, I would be glad if SSDs would hold their
promises, it would save us a lot of trouble, but I don''t see they are
there yet.
> 
> I think the effort is better spent making sure the SSD vendors do the
> right thing.
That might be true if I had any influence with Intel. I think this is
the responsibility of big companies like Oracle and NetApp. All I can
do is not to buy broken hardware.

--
Arne
>  -- richard
>

sensille

2010-May-27 08:48 UTC

head link

[zfs-discuss] creating a fast ZIL device for $200

Neil Perrin wrote:> Yes, I agree this seems very appealing. I have investigated and
> observed similar results. Just allocating larger intent log blocks but
> only writing to say the first half of them has seen the same effect.
> Despite the impressive results, we have not pursued this further mainly
> because of it''s maintainability. There is quite a variance between
> drives so, as mentioned, feedback profiling of the device is needed
> in the working system. The layering of the Solaris IO subsystem
doesn''t
> provide the feedback necessary and the ZIL code is layered on the SPA/DMU.
> Still it should be possible. Good luck!
> 
Thanks :) Though I hoped to get a different answer. An integration into
ZFS code would be much more elegant, but of course in a few years the
necessity for this optimization will be gone, when SSDs are cheap, fast
and reliable.


There seems to be some interest in this idea here. Would it make sense
to start a project for it? Currently I''m implementing a driver as a
proof of concept, but I''m in need of a lot of discussions about algo-
rithms and concepts, and maybe some code reviews.

Can I count on some support from here?

--Arne

sensille

2010-May-27 13:45 UTC

head link

[zfs-discuss] creating a fast ZIL device for $200

Edward Ned Harvey wrote:>> From: sensille [mailto:sensille at gmx.net]
>>
>> The only thing I''d like to point out
>> is that
>> ZFS doesn''t do random writes on a slog, but nearly linear
writes. This
>> might
>> even be hurting performance more than random writes, because you always
>> hit
>> the worst case of one full rotation.
> 
> Um ... I certainly have a doubt about this.  My understanding is that hard
> disks are already optimized for sustained sequential throughput.  I have a
> really hard time believing Seagate, WD, etc, designed their drives such
that
> you read/write one track, then pause and wait for a full rotation, then
> read/write one track, and wait again, and so forth.  This would limit the
> drive to approx 50% duty cycle, and the market is very competitive.
> 
> Yes, I am really quite sure, without any knowledge at all, that the drive
> mfgrs are intelligent enough to map the logical blocks in such a way that
> sequential reads/writes which are larger than a single track will not
suffer
> such a huge penalty.  Just a small penalty to jump up one track, and wait
> for a few degrees of rotation, not 360 degrees.
I''m afraid you got me wrong here. Of course the drives are optimized
for
sequential reads/writes. If you give the drive a single read or write that
is larger than one track the drive acts exactly as you described. The same
holds if you give the drive multiple smaller consecutive reads/writes in
advance (NCQ/TCQ) so that the drive can coagulate them to one big op.

But this is not what happens in case of ZFS/ZIL with a single application.
The application requests a synchronous op. This request goes down into
ZFS, which in turn allocates a ZIL block, writes it to the disk and issues a
cache flush. Only after the cache flush completes, ZFS can acknowledge the
op to the application. Now the application can issue the next op, for which
ZFS will again allocate ZIL block, probably immediately after the previous
one. It writes the block and issues a flush. But in the meantime the head
has traveled some sectors down the track. To physically write the block the
drive has of course to wait until the sector is under the head again, which
means waiting nearly one full rotation. If ZFS would have chosen a block
appropriately further down the track the possibility would have been high
that the head had not passed it and could write without a big rotational
delay.

sensille

2010-May-27 17:31 UTC

head link

[zfs-discuss] creating a fast ZIL device for $200

(resent because of mail problems)
Edward Ned Harvey wrote:>> From: sensille [mailto:sensille at gmx.net]
>>
>> The only thing I''d like to point out
>> is that
>> ZFS doesn''t do random writes on a slog, but nearly linear
writes. This
>> might
>> even be hurting performance more than random writes, because you always
>> hit
>> the worst case of one full rotation.
> 
> Um ... I certainly have a doubt about this.  My understanding is that hard
> disks are already optimized for sustained sequential throughput.  I have a
> really hard time believing Seagate, WD, etc, designed their drives such
that
> you read/write one track, then pause and wait for a full rotation, then
> read/write one track, and wait again, and so forth.  This would limit the
> drive to approx 50% duty cycle, and the market is very competitive.
> 
> Yes, I am really quite sure, without any knowledge at all, that the drive
> mfgrs are intelligent enough to map the logical blocks in such a way that
> sequential reads/writes which are larger than a single track will not
suffer
> such a huge penalty.  Just a small penalty to jump up one track, and wait
> for a few degrees of rotation, not 360 degrees.
I''m afraid you got me wrong here. Of course the drives are optimized
for
sequential reads/writes. If you give the drive a single read or write that
is larger than one track the drive acts exactly as you described. The same
holds if you give the drive multiple smaller consecutive reads/writes in
advance (NCQ/TCQ) so that the drive can coagulate them to one big op.

But this is not what happens in case of ZFS/ZIL with a single application.
The application requests a synchronous op. This request goes down into
ZFS, which in turn allocates a ZIL block, writes it to the disk and issues a
cache flush. Only after the cache flush completes, ZFS can acknowledge the
op to the application. Now the application can issue the next op, for which
ZFS will again allocate ZIL block, probably immediately after the previous
one. It writes the block and issues a flush. But in the meantime the head
has traveled some sectors down the track. To physically write the block the
drive has of course to wait until the sector is under the head again, which
means waiting nearly one full rotation. If ZFS would have chosen a block
appropriately further down the track the possibility would have been high
that the head had not passed it and could write without a big rotational
delay.

sensille

2010-May-27 17:33 UTC

head link

[zfs-discuss] creating a fast ZIL device for $200

(resent because of received bounce)
Edward Ned Harvey wrote:>> From: sensille [mailto:sensille at gmx.net]
>>
> 
> So this brings me back to the question I indirectly asked in the middle of
a
> much longer previous email - 
> 
> Is there some way, in software, to detect the current position of the head?
> If not, then I only see two possibilities:
> 
> Either you have some previous knowledge (or assumptions) about the drive
> geometry, rotation speed, and wall clock time passed since the last write
> completed, and use this (possibly vague or inaccurate) info to make your
> best guess what available blocks are accessible with minimum latency next
> ...
> 
That is my approach currently, and it works quite well. I obtain the prior
knowledge through a special measuring process run before first using the
disk. To keep the driver in sync with the disk during idle times it issues
dummy ops in regular intervals, say 20 per second.
> or else some sort of new hardware behavior would be necessary.  Possibly a
> "special" type of drive, which always assumes a command to write
to a
> magical block number actually means "write to the next available"
block or
> something like that ... or reading from a magical block actually tells you
> the position of the head or something like that...
That would be nice. But what would be much nicer is a drive with an extremely
small setup time. Current drives need the command 0.4-0.7ms in advance,
depending on manufacturer and drive type.

Garrett D''Amore

2010-May-27 17:38 UTC

head link

[zfs-discuss] creating a fast ZIL device for $200

On 5/27/2010 10:33 AM, sensille wrote:> (resent because of received bounce)
> Edward Ned Harvey wrote:
>>> From: sensille [mailto:sensille at gmx.net]
>>>
>>
>> So this brings me back to the question I indirectly asked in the 
>> middle of a
>> much longer previous email -
>> Is there some way, in software, to detect the current position of the 
>> head?
>> If not, then I only see two possibilities:
>>
>> Either you have some previous knowledge (or assumptions) about the
drive
>> geometry, rotation speed, and wall clock time passed since the last 
>> write
>> completed, and use this (possibly vague or inaccurate) info to make
your
>> best guess what available blocks are accessible with minimum latency 
>> next
>> ...
>>
>
> That is my approach currently, and it works quite well. I obtain the 
> prior
> knowledge through a special measuring process run before first using the
> disk. To keep the driver in sync with the disk during idle times it 
> issues
> dummy ops in regular intervals, say 20 per second.
>
>> or else some sort of new hardware behavior would be necessary.  
>> Possibly a
>> "special" type of drive, which always assumes a command to
write to a
>> magical block number actually means "write to the next
available"
>> block or
>> something like that ... or reading from a magical block actually 
>> tells you
>> the position of the head or something like that...
>
> That would be nice. But what would be much nicer is a drive with an 
> extremely
> small setup time. Current drives need the command 0.4-0.7ms in advance,
> depending on manufacturer and drive type.
Technology like DDRdrive X1 (which is well beyond $200) doesn''t have 
this problem.  The setup times for that kind of hardware are measured in 
usec.  (I.e. measured in PCI cycles.)

     - Garrett
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Marty Scholes

2010-May-28 13:19 UTC

head link

[zfs-discuss] creating a fast ZIL device for $200

I have a Sun A5000, 22x 73GB 15K disks in split-bus configuration, two dual 2Gb
HBAs and four fibre cables from server to array, all for just under $200.

The array gives 4Gb of aggregate thoughput in each direction across two 11 disk
buses.

Right now it is the main array, but when we outgrow its storage it will become a
multiple external ZIL / L2ARC array for a slow sata array.

Admittedly, it is rare for all of the pieces to come together at the right price
like this and since it is unsupported no one would seriously consider it for
production.

At the same time, it makes blistering main storage today and will provide for
amazing iops against slow storage later.
-- 
This message posted from opensolaris.org

Maybe Matching Threads

Search for more seemingly similar threads

zfs discuss - May 2010 - creating a fast ZIL device for $200

[zfs-discuss] creating a fast ZIL device for $200

[zfs-discuss] creating a fast ZIL device for $200

[zfs-discuss] creating a fast ZIL device for $200

[zfs-discuss] creating a fast ZIL device for $200

[zfs-discuss] creating a fast ZIL device for $200

[zfs-discuss] creating a fast ZIL device for $200

[zfs-discuss] creating a fast ZIL device for $200

[zfs-discuss] creating a fast ZIL device for $200

[zfs-discuss] creating a fast ZIL device for $200

[zfs-discuss] creating a fast ZIL device for $200

[zfs-discuss] creating a fast ZIL device for $200

[zfs-discuss] creating a fast ZIL device for $200

[zfs-discuss] creating a fast ZIL device for $200

[zfs-discuss] creating a fast ZIL device for $200

[zfs-discuss] creating a fast ZIL device for $200

Maybe Matching Threads