Recently, I''ve been reading through the ZIL/slog discussion and have the impression that a lot of folks here are (like me) interested in getting a viable solution for a cheap, fast and reliable ZIL device. I think I can provide such a solution for about $200, but it involves a lot of development work. The basic idea: the main problem when using a HDD as a ZIL device are the cache flushes in combination with the linear write pattern of the ZIL. This leads to a whole rotation of the platter after each write, because after the first write returns, the head is already past the sector that will be written next. My idea goes as follows: don''t write linearly. Track the rotation and write to the position the head will hit next. This might be done by a re-mapping layer or integrated into ZFS. This works only because ZIL device are basically write-only. Reads from this device will be horribly slow. I have done some testing and am quite enthusiastic. If I take a decent SAS disk (like the Hitachi Ultrastar C10K300), I can raise the synchronous write performance from 166 writes/s to about 2000 writes/s (!). 2000 IOPS is more than sufficient for our production environment. Currently I''m implementing a re-mapping driver for this. The reason I''m writing to this list is that I''d like to find support from the zfs team, find sparring partners to discuss implementation details and algorithms and, most important, find testers! If there is interest it would be great to build an official project around it. I''d be willing to contribute most of the code, but any help will be more than welcome. So, anyone interested? :) -- Arne Jansen
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of sensille > > The basic idea: the main problem when using a HDD as a ZIL device > are the cache flushes in combination with the linear write pattern > of the ZIL. This leads to a whole rotation of the platter after > each write, because after the first write returns, the head is > already past the sector that will be written next. > My idea goes as follows: don''t write linearly. Track the rotation > and write to the position the head will hit next. This might be done > by a re-mapping layer or integrated into ZFS. This works only because > ZIL device are basically write-only. Reads from this device will be > horribly slow.This is a really interesting idea, but I think you''ve hurt yourself in the way you described the problem - and additionally, I was recently corrected for misusing the terms you just misused (saying "ZIL" != saying "ZIL on dedicated log device"). So I''ll try to clarify what you just said: The reason why hard drives are less effective as ZIL dedicated log devices compared to such things as SSD''s, is because of the rotation of the hard drives; the physical time to seek a random block. There may be a possibility to use hard drives as dedicated log devices, cheaper than SSD''s with possibly comparable latency, if you can intelligently eliminate the random seek. If you have a way to tell the hard drive "Write this data, to whatever block happens to be available at minimum seek time." For rough estimates: Assume the drive is using Zone Density Recording, like this: http://www.dewassoc.com/kbase/hard_drives/hard_disk_sector_structures.htm Suppose you''re able to keep your hard drive head on the outer sectors. Suppose 1000 sectors per track (I have no idea if that''s accurate, but at least according to the above article in the year 2000 it was ballpark realistic). Suppose 10krpm. Then the physical seek time could theoretically be brought down to as low as 10^-7 seconds. Of course, that''s not realistic - some sectors may already be used - the electronics themselves could be a factor - But the point remains, the physical seek time can be effectively eliminated. At least in theory. And that was the year 2000.> I have done some testing and am quite enthusiastic. If I take a > decent SAS disk (like the Hitachi Ultrastar C10K300), I can raise > the synchronous write performance from 166 writes/s to about > 2000 writes/s (!). 2000 IOPS is more than sufficient for our > production environment.Um ... Careful there. There are many apples, oranges, and bananas to be compared inaccurately against each other. When I measure IOPS of physical disks, with all the caches disabled, I get anywhere from 200 to 2400 for a single spindle disk (SAS 10k), and I get anywhere from 2000 to 6000 with a SSD (SATA). Just depending on the benchmark configuration. Because ZFS is doing all sorts of acceleration behind the scenes, which make the results vary *immensely* from some IOPS number that you look up online. You''ve got to be sure you measure something, then change *only one thing* and measure again, to get a good measurement. You''ve got to toggle back and forth a few times, and see that the results are repeatable. And *only* then do you have a solid result.> Currently I''m implementing a re-mapping driver for this. The > reason I''m writing to this list is that I''d like to find support > from the zfs team, find sparring partners to discuss implementation > details and algorithms and, most important, find testers!So you believe you can know the drive geometry, the instantaneous head position, and the next available physical block address in software? No need for special hardware? That''s cool. I hope there aren''t any "gotchas" as-yet undiscovered.
Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of sensille >> >> The basic idea: the main problem when using a HDD as a ZIL device >> are the cache flushes in combination with the linear write pattern >> of the ZIL. This leads to a whole rotation of the platter after >> each write, because after the first write returns, the head is >> already past the sector that will be written next. >> My idea goes as follows: don''t write linearly. Track the rotation >> and write to the position the head will hit next. This might be done >> by a re-mapping layer or integrated into ZFS. This works only because >> ZIL device are basically write-only. Reads from this device will be >> horribly slow. > > The reason why hard drives are less effective as ZIL dedicated log devices > compared to such things as SSD''s, is because of the rotation of the hard > drives; the physical time to seek a random block. There may be a > possibility to use hard drives as dedicated log devices, cheaper than SSD''s > with possibly comparable latency, if you can intelligently eliminate the > random seek. If you have a way to tell the hard drive "Write this data, to > whatever block happens to be available at minimum seek time."Thanks for rephrasing my idea :) The only thing I''d like to point out is that ZFS doesn''t do random writes on a slog, but nearly linear writes. This might even be hurting performance more than random writes, because you always hit the worst case of one full rotation.> > For rough estimates: Assume the drive is using Zone Density Recording, like > this: > http://www.dewassoc.com/kbase/hard_drives/hard_disk_sector_structures.htm > Suppose you''re able to keep your hard drive head on the outer sectors. > Suppose 1000 sectors per track (I have no idea if that''s accurate, but at > least according to the above article in the year 2000 it was ballpark > realistic). Suppose 10krpm. Then the physical seek time could > theoretically be brought down to as low as 10^-7 seconds. Of course, that''s > not realistic - some sectors may already be used - the electronics > themselves could be a factor - But the point remains, the physical seek time > can be effectively eliminated. At least in theory. And that was the year > 2000.The mentioned Hitachi disk (at least the one I have in my test machine) has 1764 sectors on head1 and 1680 sectors on head2 in the first zone, which has 50 tracks. I''m quite sure the limiting factor is the electronics. This disk needs the write about 140 sectors in advance. It may be that also the servo information on the platters has to be taken into account. Other disks don''t behave that well. I tried with 1TB SATA disks, but they doesn''t seem to have any predictable timing.>> I have done some testing and am quite enthusiastic. If I take a >> decent SAS disk (like the Hitachi Ultrastar C10K300), I can raise >> the synchronous write performance from 166 writes/s to about >> 2000 writes/s (!). 2000 IOPS is more than sufficient for our >> production environment. > > Um ... Careful there. There are many apples, oranges, and bananas to be > compared inaccurately against each other. When I measure IOPS of physical > disks, with all the caches disabled, I get anywhere from 200 to 2400 for a > single spindle disk (SAS 10k), and I get anywhere from 2000 to 6000 with a > SSD (SATA). Just depending on the benchmark configuration. Because ZFS is > doing all sorts of acceleration behind the scenes, which make the results > vary *immensely* from some IOPS number that you look up online.The measurement is simple: disable write cache, write on sector, when that write returns, calculate the next optimal sector to write to, write, calculate again... This gives a quite stable result of about 2000 writes/s or 0.5ms average service time, single threaded. No ZFS involved, just pure disk performance.> > So you believe you can know the drive geometry, the instantaneous head > position, and the next available physical block address in software? No > need for special hardware? That''s cool. I hope there aren''t any "gotchas" > as-yet undiscovered.Yes, I already did a mapping of several drives. I measured at least the track length, the interleave needed between two writes and the interleave if a track-to-track seek is involved. Of course you can always learn more about a disk, but that''s a good starting point. -- Arne
On 26 May, 2010 - sensille sent me these 4,5K bytes:> Edward Ned Harvey wrote: > >> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > >> bounces at opensolaris.org] On Behalf Of sensille > >> > >> The basic idea: the main problem when using a HDD as a ZIL device > >> are the cache flushes in combination with the linear write pattern > >> of the ZIL. This leads to a whole rotation of the platter after > >> each write, because after the first write returns, the head is > >> already past the sector that will be written next. > >> My idea goes as follows: don''t write linearly. Track the rotation > >> and write to the position the head will hit next. This might be done > >> by a re-mapping layer or integrated into ZFS. This works only because > >> ZIL device are basically write-only. Reads from this device will be > >> horribly slow. > > > > The reason why hard drives are less effective as ZIL dedicated log devices > > compared to such things as SSD''s, is because of the rotation of the hard > > drives; the physical time to seek a random block. There may be a > > possibility to use hard drives as dedicated log devices, cheaper than SSD''s > > with possibly comparable latency, if you can intelligently eliminate the > > random seek. If you have a way to tell the hard drive "Write this data, to > > whatever block happens to be available at minimum seek time." > > Thanks for rephrasing my idea :) The only thing I''d like to point out is that > ZFS doesn''t do random writes on a slog, but nearly linear writes. This might > even be hurting performance more than random writes, because you always hit > the worst case of one full rotation.A simple test would be to change "write block X" "write block X+1" "write block X+2" into "write block X" "write block X+4" "write block X+8" or something, so it might manage to send the command before the head has travelled over to block X+4 etc.. I guess basically, you want to do something like TCQ/NCQ, but without the Q.. placing writes optimally..> > So you believe you can know the drive geometry, the instantaneous head > > position, and the next available physical block address in software? No > > need for special hardware? That''s cool. I hope there aren''t any "gotchas" > > as-yet undiscovered. > > Yes, I already did a mapping of several drives. I measured at least the track > length, the interleave needed between two writes and the interleave if a > track-to-track seek is involved. Of course you can always learn more about a > disk, but that''s a good starting point.Since X, X+1, X+2 seems to be the optimally worst case, try just skipping over a few blocks.. Double (or such) the performance for a single software tweak would be surely welcome. /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
On Wed, 26 May 2010, sensille wrote:> The basic idea: the main problem when using a HDD as a ZIL device > are the cache flushes in combination with the linear write pattern > of the ZIL. This leads to a whole rotation of the platter after > each write, because after the first write returns, the head is > already past the sector that will be written next. > My idea goes as follows: don''t write linearly. Track the rotation > and write to the position the head will hit next. This might be done > by a re-mapping layer or integrated into ZFS. This works only because > ZIL device are basically write-only. Reads from this device will be > horribly slow.I like your idea. It would require a profiling application to learn the physical geometry and timing of a given disk drive in order to save the configuration data for it. The timing could vary under heavy system load so the data needs to be sent early enough that it will always be there when needed. The profiling application might need to drive a disk for several hours (or a day) in order to fully understand how it behaves. Remapped failed sectors would cause this micro-timing to fail, but only for the remapped sectors. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn wrote:> On Wed, 26 May 2010, sensille wrote: >> The basic idea: the main problem when using a HDD as a ZIL device >> are the cache flushes in combination with the linear write pattern >> of the ZIL. This leads to a whole rotation of the platter after >> each write, because after the first write returns, the head is >> already past the sector that will be written next. >> My idea goes as follows: don''t write linearly. Track the rotation >> and write to the position the head will hit next. This might be done >> by a re-mapping layer or integrated into ZFS. This works only because >> ZIL device are basically write-only. Reads from this device will be >> horribly slow. > > I like your idea. It would require a profiling application to learn the > physical geometry and timing of a given disk drive in order to save the > configuration data for it. The timing could vary under heavy system > load so the data needs to be sent early enough that it will always be > there when needed. The profiling application might need to drive a disk > for several hours (or a day) in order to fully understand how it > behaves.A day is a good landmark. Currently the application runs several hours just to map the tracks. But there''s lots of room for algorithms that measure and fine-tune on the fly. Every write is also a measurement.> Remapped failed sectors would cause this micro-timing to fail, > but only for the remapped sectors.Of course you could detect those remapped sectors because of the failed timing and stop using them in the future :) -- Arne> Bob
On 05/26/10 07:10, sensille wrote:> Recently, I''ve been reading through the ZIL/slog discussion and > have the impression that a lot of folks here are (like me) > interested in getting a viable solution for a cheap, fast and > reliable ZIL device. > I think I can provide such a solution for about $200, but it > involves a lot of development work. > The basic idea: the main problem when using a HDD as a ZIL device > are the cache flushes in combination with the linear write pattern > of the ZIL. This leads to a whole rotation of the platter after > each write, because after the first write returns, the head is > already past the sector that will be written next. > My idea goes as follows: don''t write linearly. Track the rotation > and write to the position the head will hit next. This might be done > by a re-mapping layer or integrated into ZFS. This works only because > ZIL device are basically write-only. Reads from this device will be > horribly slow. > > I have done some testing and am quite enthusiastic. If I take a > decent SAS disk (like the Hitachi Ultrastar C10K300), I can raise > the synchronous write performance from 166 writes/s to about > 2000 writes/s (!). 2000 IOPS is more than sufficient for our > production environment. > > Currently I''m implementing a re-mapping driver for this. The > reason I''m writing to this list is that I''d like to find support > from the zfs team, find sparring partners to discuss implementation > details and algorithms and, most important, find testers! > > If there is interest it would be great to build an official project > around it. I''d be willing to contribute most of the code, but any > help will be more than welcome. > > So, anyone interested? :) > > -- > Arne Jansen > >Yes, I agree this seems very appealing. I have investigated and observed similar results. Just allocating larger intent log blocks but only writing to say the first half of them has seen the same effect. Despite the impressive results, we have not pursued this further mainly because of it''s maintainability. There is quite a variance between drives so, as mentioned, feedback profiling of the device is needed in the working system. The layering of the Solaris IO subsystem doesn''t provide the feedback necessary and the ZIL code is layered on the SPA/DMU. Still it should be possible. Good luck! Neil.
On May 26, 2010, at 8:38 AM, Neil Perrin wrote:> On 05/26/10 07:10, sensille wrote: >> Recently, I''ve been reading through the ZIL/slog discussion and >> have the impression that a lot of folks here are (like me) >> interested in getting a viable solution for a cheap, fast and >> reliable ZIL device. >> I think I can provide such a solution for about $200, but it >> involves a lot of development work. >> The basic idea: the main problem when using a HDD as a ZIL device >> are the cache flushes in combination with the linear write pattern >> of the ZIL. This leads to a whole rotation of the platter after >> each write, because after the first write returns, the head is >> already past the sector that will be written next. >> My idea goes as follows: don''t write linearly. Track the rotation >> and write to the position the head will hit next. This might be done >> by a re-mapping layer or integrated into ZFS. This works only because >> ZIL device are basically write-only. Reads from this device will be >> horribly slow. >> >> I have done some testing and am quite enthusiastic. If I take a >> decent SAS disk (like the Hitachi Ultrastar C10K300), I can raise >> the synchronous write performance from 166 writes/s to about >> 2000 writes/s (!). 2000 IOPS is more than sufficient for our >> production environment. >> >> Currently I''m implementing a re-mapping driver for this. The >> reason I''m writing to this list is that I''d like to find support >> from the zfs team, find sparring partners to discuss implementation >> details and algorithms and, most important, find testers! >> >> If there is interest it would be great to build an official project >> around it. I''d be willing to contribute most of the code, but any >> help will be more than welcome. >> >> So, anyone interested? :) >> >> -- >> Arne Jansen >> >> > > Yes, I agree this seems very appealing. I have investigated and > observed similar results. Just allocating larger intent log blocks but > only writing to say the first half of them has seen the same effect. > Despite the impressive results, we have not pursued this further mainly > because of it''s maintainability. There is quite a variance between > drives so, as mentioned, feedback profiling of the device is needed > in the working system. The layering of the Solaris IO subsystem doesn''t > provide the feedback necessary and the ZIL code is layered on the SPA/DMU. > Still it should be possible. Good luck!I agree. If you search the literature, you will find many cases where people have tried to optimize file systems based on device geometry and all have ended up as roadkill. File systems last much longer than the hardware and writing hardware-specific optimizations into the file system just doesn''t make good sense. Meanwhile, though there are doubters, Intel''s datasheet for the X-25V clearly states support for the ATA FLUSH CACHE feature. These can be bought for around $120 and can do 2,500 random write IOPS. http://download.intel.com/design/flash/nand/value/datashts/322736.pdf Similarly, for the X-25E http://download.intel.com/design/flash/nand/extreme/319984.pdf I think the effort is better spent making sure the SSD vendors do the right thing. -- richard -- Richard Elling richard at nexenta.com +1-760-896-4422 ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/
Richard Elling wrote:> On May 26, 2010, at 8:38 AM, Neil Perrin wrote: > >> On 05/26/10 07:10, sensille wrote: >>> My idea goes as follows: don''t write linearly. Track the rotation >>> and write to the position the head will hit next. This might be done >>> by a re-mapping layer or integrated into ZFS. This works only because >>> ZIL device are basically write-only. Reads from this device will be >>> horribly slow. >>> >> Yes, I agree this seems very appealing. I have investigated and >> observed similar results. Just allocating larger intent log blocks but >> only writing to say the first half of them has seen the same effect. >> Despite the impressive results, we have not pursued this further mainly >> because of it''s maintainability. There is quite a variance between >> drives so, as mentioned, feedback profiling of the device is needed >> in the working system. The layering of the Solaris IO subsystem doesn''t >> provide the feedback necessary and the ZIL code is layered on the SPA/DMU. >> Still it should be possible. Good luck! > > I agree. If you search the literature, you will find many cases where > people have tried to optimize file systems based on device geometry > and all have ended up as roadkill. File systems last much longer than > the hardware and writing hardware-specific optimizations into the file > system just doesn''t make good sense.I see the point that the filesystem itself is not the right place for this kind of optimization.> > Meanwhile, though there are doubters, Intel''s datasheet for the X-25V > clearly states support for the ATA FLUSH CACHE feature. These can > be bought for around $120 and can do 2,500 random write IOPS. > http://download.intel.com/design/flash/nand/value/datashts/322736.pdf > Similarly, for the X-25E > http://download.intel.com/design/flash/nand/extreme/319984.pdfThe datasheet states that they understand the command, yes. I haven''t testet myself, but there are many indications on the net that they does not honor it properly, at least for the X-25E. As to the 2500 writes/s, the datasheet says "up to", using a queue depth of 32 and utilizing the write cache. Similarly I just tested a Hitachi 15k disk to see how many linear 4k writes I can issue, and it can handle approx. 20000 writes/s. This is a completely useless number, because as soon as I insert cache flushes it drops down to 250/s (or 15k/minute, of course). Don''t understand me wrong, I would be glad if SSDs would hold their promises, it would save us a lot of trouble, but I don''t see they are there yet.> > I think the effort is better spent making sure the SSD vendors do the > right thing.That might be true if I had any influence with Intel. I think this is the responsibility of big companies like Oracle and NetApp. All I can do is not to buy broken hardware. -- Arne> -- richard >
Neil Perrin wrote:> Yes, I agree this seems very appealing. I have investigated and > observed similar results. Just allocating larger intent log blocks but > only writing to say the first half of them has seen the same effect. > Despite the impressive results, we have not pursued this further mainly > because of it''s maintainability. There is quite a variance between > drives so, as mentioned, feedback profiling of the device is needed > in the working system. The layering of the Solaris IO subsystem doesn''t > provide the feedback necessary and the ZIL code is layered on the SPA/DMU. > Still it should be possible. Good luck! >Thanks :) Though I hoped to get a different answer. An integration into ZFS code would be much more elegant, but of course in a few years the necessity for this optimization will be gone, when SSDs are cheap, fast and reliable. There seems to be some interest in this idea here. Would it make sense to start a project for it? Currently I''m implementing a driver as a proof of concept, but I''m in need of a lot of discussions about algo- rithms and concepts, and maybe some code reviews. Can I count on some support from here? --Arne
Edward Ned Harvey wrote:>> From: sensille [mailto:sensille at gmx.net] >> >> The only thing I''d like to point out >> is that >> ZFS doesn''t do random writes on a slog, but nearly linear writes. This >> might >> even be hurting performance more than random writes, because you always >> hit >> the worst case of one full rotation. > > Um ... I certainly have a doubt about this. My understanding is that hard > disks are already optimized for sustained sequential throughput. I have a > really hard time believing Seagate, WD, etc, designed their drives such that > you read/write one track, then pause and wait for a full rotation, then > read/write one track, and wait again, and so forth. This would limit the > drive to approx 50% duty cycle, and the market is very competitive. > > Yes, I am really quite sure, without any knowledge at all, that the drive > mfgrs are intelligent enough to map the logical blocks in such a way that > sequential reads/writes which are larger than a single track will not suffer > such a huge penalty. Just a small penalty to jump up one track, and wait > for a few degrees of rotation, not 360 degrees.I''m afraid you got me wrong here. Of course the drives are optimized for sequential reads/writes. If you give the drive a single read or write that is larger than one track the drive acts exactly as you described. The same holds if you give the drive multiple smaller consecutive reads/writes in advance (NCQ/TCQ) so that the drive can coagulate them to one big op. But this is not what happens in case of ZFS/ZIL with a single application. The application requests a synchronous op. This request goes down into ZFS, which in turn allocates a ZIL block, writes it to the disk and issues a cache flush. Only after the cache flush completes, ZFS can acknowledge the op to the application. Now the application can issue the next op, for which ZFS will again allocate ZIL block, probably immediately after the previous one. It writes the block and issues a flush. But in the meantime the head has traveled some sectors down the track. To physically write the block the drive has of course to wait until the sector is under the head again, which means waiting nearly one full rotation. If ZFS would have chosen a block appropriately further down the track the possibility would have been high that the head had not passed it and could write without a big rotational delay.
(resent because of mail problems) Edward Ned Harvey wrote:>> From: sensille [mailto:sensille at gmx.net] >> >> The only thing I''d like to point out >> is that >> ZFS doesn''t do random writes on a slog, but nearly linear writes. This >> might >> even be hurting performance more than random writes, because you always >> hit >> the worst case of one full rotation. > > Um ... I certainly have a doubt about this. My understanding is that hard > disks are already optimized for sustained sequential throughput. I have a > really hard time believing Seagate, WD, etc, designed their drives such that > you read/write one track, then pause and wait for a full rotation, then > read/write one track, and wait again, and so forth. This would limit the > drive to approx 50% duty cycle, and the market is very competitive. > > Yes, I am really quite sure, without any knowledge at all, that the drive > mfgrs are intelligent enough to map the logical blocks in such a way that > sequential reads/writes which are larger than a single track will not suffer > such a huge penalty. Just a small penalty to jump up one track, and wait > for a few degrees of rotation, not 360 degrees.I''m afraid you got me wrong here. Of course the drives are optimized for sequential reads/writes. If you give the drive a single read or write that is larger than one track the drive acts exactly as you described. The same holds if you give the drive multiple smaller consecutive reads/writes in advance (NCQ/TCQ) so that the drive can coagulate them to one big op. But this is not what happens in case of ZFS/ZIL with a single application. The application requests a synchronous op. This request goes down into ZFS, which in turn allocates a ZIL block, writes it to the disk and issues a cache flush. Only after the cache flush completes, ZFS can acknowledge the op to the application. Now the application can issue the next op, for which ZFS will again allocate ZIL block, probably immediately after the previous one. It writes the block and issues a flush. But in the meantime the head has traveled some sectors down the track. To physically write the block the drive has of course to wait until the sector is under the head again, which means waiting nearly one full rotation. If ZFS would have chosen a block appropriately further down the track the possibility would have been high that the head had not passed it and could write without a big rotational delay.
(resent because of received bounce) Edward Ned Harvey wrote:>> From: sensille [mailto:sensille at gmx.net] >> > > So this brings me back to the question I indirectly asked in the middle of a > much longer previous email - > > Is there some way, in software, to detect the current position of the head? > If not, then I only see two possibilities: > > Either you have some previous knowledge (or assumptions) about the drive > geometry, rotation speed, and wall clock time passed since the last write > completed, and use this (possibly vague or inaccurate) info to make your > best guess what available blocks are accessible with minimum latency next > ... >That is my approach currently, and it works quite well. I obtain the prior knowledge through a special measuring process run before first using the disk. To keep the driver in sync with the disk during idle times it issues dummy ops in regular intervals, say 20 per second.> or else some sort of new hardware behavior would be necessary. Possibly a > "special" type of drive, which always assumes a command to write to a > magical block number actually means "write to the next available" block or > something like that ... or reading from a magical block actually tells you > the position of the head or something like that...That would be nice. But what would be much nicer is a drive with an extremely small setup time. Current drives need the command 0.4-0.7ms in advance, depending on manufacturer and drive type.
On 5/27/2010 10:33 AM, sensille wrote:> (resent because of received bounce) > Edward Ned Harvey wrote: >>> From: sensille [mailto:sensille at gmx.net] >>> >> >> So this brings me back to the question I indirectly asked in the >> middle of a >> much longer previous email - >> Is there some way, in software, to detect the current position of the >> head? >> If not, then I only see two possibilities: >> >> Either you have some previous knowledge (or assumptions) about the drive >> geometry, rotation speed, and wall clock time passed since the last >> write >> completed, and use this (possibly vague or inaccurate) info to make your >> best guess what available blocks are accessible with minimum latency >> next >> ... >> > > That is my approach currently, and it works quite well. I obtain the > prior > knowledge through a special measuring process run before first using the > disk. To keep the driver in sync with the disk during idle times it > issues > dummy ops in regular intervals, say 20 per second. > >> or else some sort of new hardware behavior would be necessary. >> Possibly a >> "special" type of drive, which always assumes a command to write to a >> magical block number actually means "write to the next available" >> block or >> something like that ... or reading from a magical block actually >> tells you >> the position of the head or something like that... > > That would be nice. But what would be much nicer is a drive with an > extremely > small setup time. Current drives need the command 0.4-0.7ms in advance, > depending on manufacturer and drive type.Technology like DDRdrive X1 (which is well beyond $200) doesn''t have this problem. The setup times for that kind of hardware are measured in usec. (I.e. measured in PCI cycles.) - Garrett> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
I have a Sun A5000, 22x 73GB 15K disks in split-bus configuration, two dual 2Gb HBAs and four fibre cables from server to array, all for just under $200. The array gives 4Gb of aggregate thoughput in each direction across two 11 disk buses. Right now it is the main array, but when we outgrow its storage it will become a multiple external ZIL / L2ARC array for a slow sata array. Admittedly, it is rare for all of the pieces to come together at the right price like this and since it is unsupported no one would seriously consider it for production. At the same time, it makes blistering main storage today and will provide for amazing iops against slow storage later. -- This message posted from opensolaris.org