A couple of ZFS questions: 1. ZFS dynamic striping will automatically use new added devices when there are write requests. Customer has a *mostly read-only* application with I/O bottleneck, they wonder if there is a ZFS command or mechanism to enable the manual rebalancing of ZFS data when adding new drives to an existing pool? 2. Will ZFS automatically/proactively seek out bad blocks (self-healing) when there''re idle cpu cycles? I don''t think so but like to get a confirmation. We are aware of ''zpool scrub'', a manual way to verify checksums and correct bad blocks. We also know that bad blocks will be self-healed when there''s a access request to the bad block. 3. Can zpool determine and alert if server2 is attemping to import a ZFS pool that is currently imported by server1? Can server2 force an import in case server1 crashes - manual failover scenario? 4. When S10 ZFS boot is available, will Sun offer a migration strategy (commands, processes, etc.) to convert/migrate root devices from SVM/VxVM to a ZFS root file system? Best regards, Kimberly
Kimberly Chang wrote:> A couple of ZFS questions: > > 1. ZFS dynamic striping will automatically use new added devices when > there are write requests. Customer has a *mostly read-only* application > with I/O bottleneck, they wonder if there is a ZFS command or mechanism > to enable the manual rebalancing of ZFS data when adding new drives to > an existing pool?cp :-) If you copy the file then the new writes will be spread across the newly added drives. It doesn''t really matter how you do the copy, though.> 2. Will ZFS automatically/proactively seek out bad blocks (self-healing) > when there''re idle cpu cycles? I don''t think so but like to get a > confirmation. We are aware of ''zpool scrub'', a manual way to verify > checksums and correct bad blocks. We also know that bad blocks will be > self-healed when there''s a access request to the bad block.You can setup periodic scrubs with cron (I''m unsure if there is also a builtin timer -- to do so wouldn''t be very UNIX-like) The ZFS scheduler put scrubs at a low priority, so they should have minimal impact on real work.> 3. Can zpool determine and alert if server2 is attemping to import a ZFS > pool that is currently imported by server1? Can server2 force an import > in case server1 crashes - manual failover scenario?The manual failover scenario is also how the automated failover scenario will work with Sun Cluster. However, I do not believe there is a way for server1 to know that server2 is *attempting* the import without a cluster infrastructure. Normally, for Sun Cluster, the disks will be fenced from the other node.> 4. When S10 ZFS boot is available, will Sun offer a migration strategy > (commands, processes, etc.) to convert/migrate root devices from > SVM/VxVM to a ZFS root file system?cp :-) cpio more likely, but it may be easier to use LiveUpgrade or reinstall, depending on how the legacy system is configured. Conversion in-place is just not worth the effort. -- richard
On Jun 16, 2006, at 11:40 PM, Richard Elling wrote:> Kimberly Chang wrote: >> A couple of ZFS questions: >> 1. ZFS dynamic striping will automatically use new added devices >> when there are write requests. Customer has a *mostly read-only* >> application with I/O bottleneck, they wonder if there is a ZFS >> command or mechanism to enable the manual rebalancing of ZFS data >> when adding new drives to an existing pool? > > cp :-) > If you copy the file then the new writes will be spread across the > newly > added drives. It doesn''t really matter how you do the copy, though.She raises an interesting point, though. The concept of shifting blocks in a zpool around in the background as part of a scrubbing process and/or on the order of a explicit command to populate newly added devices seems like it could be right up ZFS''s alley. Perhaps it could also be done with volume-level granularity. Off the top of my head, an area where this would be useful is performance management - e.g. relieving load on a particular FC interconnect or an overburdened RAID array controller/cache thus allowing total no-downtime-to-cp-data-around flexibility when one is horizontally scaling storage performance. /dale
On 6/17/06, Dale Ghent <daleg at elemental.org> wrote:> The concept of shifting blocks in a zpool around in the background as > part of a scrubbing process and/or on the order of a explicit command > to populate newly added devices seems like it could be right up ZFS''s > alley. Perhaps it could also be done with volume-level granularity. > > Off the top of my head, an area where this would be useful is > performance management - e.g. relieving load on a particular FC > interconnect or an overburdened RAID array controller/cache thus > allowing total no-downtime-to-cp-data-around flexibility when one is > horizontally scaling storage performance.Another good use would be to migrate blocks that are rarely accessed to slow storage (750 GB drives with RAID-Z) while very active blocks are kept on fast storage (solid state disk). Presumably writes would go to relatively fast storage and use idle IO cycles to migrate those that don''t have "a lot" of reads to slower storage. Blocks that are very active and reside on slow storage could be migrated (mirrored?) to fast storage. Presumably fast storage vs. slow storage is based upon measurement of performance, leading to automatic balancing across the disks. Mike -- Mike Gerdts http://mgerdts.blogspot.com/
Mike Gerdts wrote:> On 6/17/06, Dale Ghent <daleg at elemental.org> wrote: > >> The concept of shifting blocks in a zpool around in the background as >> part of a scrubbing process and/or on the order of a explicit command >> to populate newly added devices seems like it could be right up ZFS''s >> alley. Perhaps it could also be done with volume-level granularity. >> >> Off the top of my head, an area where this would be useful is >> performance management - e.g. relieving load on a particular FC >> interconnect or an overburdened RAID array controller/cache thus >> allowing total no-downtime-to-cp-data-around flexibility when one is >> horizontally scaling storage performance. > > > Another good use would be to migrate blocks that are rarely accessed > to slow storage (750 GB drives with RAID-Z) while very active blocks > are kept on fast storage (solid state disk). Presumably writes would > go to relatively fast storage and use idle IO cycles to migrate those > that don''t have "a lot" of reads to slower storage. Blocks that are > very active and reside on slow storage could be migrated (mirrored?) > to fast storage.Solid state disk often has a higher failure rate than normal disk and a limited write cycle. Hence it is often desirable to try and redesign the filesystem to do fewer writes when it is on (for example) compact flash, so moving "hot blocks" to fast storage can have consequences. But then there is also this new storage paradigm in the e-rags where a hard drive also has some amount of solid state storage to speed up the boot time. It''ll be interesting to see how that plays out, but I suspect the idea is that in the relevant market (PCs), it''ll be used for things like drivers and OS core image files that do not change very often. Darren
Darren Reed wrote:> Solid state disk often has a higher failure rate than normal disk and a > limited write cycle. Hence it is often desirable to try and redesign the > filesystem to do fewer writes when it is on (for example) compact flash, > so moving "hot blocks" to fast storage can have consequences.Solid state storage does not necessarily mean flash. For example, I have recently performed some testing of Sun''s Directory Server in conjunction with solid state disks from two different vendors. Both of these used standard DRAM, so there''s no real limit to the number of writes that can be performed. They have lots of internal redundancy features (e.g., ECC memory with chipkill, redundant power supplies, internal UPSes, and internal hard drives to protect against extended power outages), but both vendors said that customers often use other forms of redundancy (e.g., mirror to traditional disk, or RAID across multiple solid-state devices). One of the vendors mentioned that both SVM and VxVM have the ability to designate one disk in a mirror as "write only" (unless the other has failed) which can be good for providing redundancy with cheaper, traditional storage. All reads would still come from the solid state storage so they would be very fast, and as long as the write rate doesn''t exceed that of the traditional disk then there wouldn''t be much adverse performance impact from the slower disk in the mirror. I don''t believe that ZFS has this capability, but it could be something worth looking into. The original suggestion provided in this thread would potentially work well in that kind of setup. ZFS with compression can also provide a notable win because the compression can significantly reduce the amount of storage required, which can help cut down on the costs. Solid state disks like this are expensive (both of the 32GB disks that I tested list at around $60K), so controlling costs is important. Neil
On 6/17/06, Neil A. Wilson <Neil.A.Wilson at sun.com> wrote:> Darren Reed wrote: > > Solid state disk often has a higher failure rate than normal disk and a > > limited write cycle. Hence it is often desirable to try and redesign the > > filesystem to do fewer writes when it is on (for example) compact flash, > > so moving "hot blocks" to fast storage can have consequences.I mentioned solid state (assuming DRAM-based) and 750 GB drives as the two ends of the spectrum available. Most people will find their extremes that are each closer to the middle of the spectrum. Possibly a multi-tier approach including 73 GB FC, 300 GB FC, and 500 GB SATA would be more likely in most shops.> Solid state disks like this are > expensive (both of the 32GB disks that I tested list at around $60K), so > controlling costs is important. >If you remove "enterprise" from the solid state disk equation, consider this at $150 + the cost of 4 1 GB DDR DIMMs. I suppose you could mirror across a pair of them and still have a pretty fast small 4GB of space for less than $1k. http://www.anandtech.com/storage/showdoc.aspx?i=2480 FWIW, google gives plenty of hits for "solid state disk terabyte". Mike -- Mike Gerdts http://mgerdts.blogspot.com/
Saying "Solid State disk" in the storage arena means battery-backed DRAM (or, rarely, NVRAM). It does NOT include the various forms of solid-state memory (compact flash, SD, MMC, etc.);"Flash disk" is reserved for those kind of devices. This is historical, since Flash disk hasn''t been functionally usable in the Enterprise Storage arena until the last year or so. Battery-backed DRAM as a "disk" has been around for a very long time, though. :-) We''ve all talked about adding the ability to change read/write policy on a pool''s vdevs for awhile. There are a lot of good reasons that this is desirable. However, I''d like to try to separate this request from HSM, and not immediately muddy the waters by trying to lump too many things together. That is, start out with adding the ability to differentiate between access policy in a vdev. Generally, we''re talking only about mirror vdevs right now. Later on, we can consider the ability to migrate data based on performance, but a lot of this has to take into consideration snapshot capability and such, so is a bit less straightforward. And, on not a completely tangential side note: WTF is up with the costs for Solid State disks? I mean, prices well over $1k per GB are typical, which is absolutely ludicrous. The DRAM itself is under $100/GB, and these devices are idiot-simple to make. In the minimalist case, it''s simply DIMM slots, a NiCad battery and trickle charger, and a SCSI/SATA/FC interface chip. Even in the fancy case, were you provide a backup drive to copy the DRAM contents to in case of power failure, it''s a trivial engineering exercise. I realize there is (currently) a small demand for these devices, but honestly, I''m pretty sure that if they reduced the price by a factor of 3, they''d see 10x or maybe even 100x the volume, cause these little buggers are just so damned useful. Oh, and the newest thing in the consumer market is called "hybrid drives", which is a melding of a Flash drive with a Winchester drive. It''s originally targetted at the laptop market - think a 1GB flash memory welded to a 40GB 2.5" hard drive in the same form-factor. You don''t replace the DRAM cache on the HD - it''s still there for fast-write response. But all the "frequently used" blocks get scheduled to be placed on the Flash part of the drive, while the mechanical part actually holds a copy of everything. The Flash portion is there for power efficiency as well as performance. -Erik
Erik Trimble wrote:> That is, start out with adding the ability to differentiate between > access policy in a vdev. Generally, we''re talking only about mirror > vdevs right now. Later on, we can consider the ability to migrate data > based on performance, but a lot of this has to take into consideration > snapshot capability and such, so is a bit less straightforward.The policy is implemented on the read side, since you still need to commit writes to all mirrors. The implementation shouldn''t be difficult, deciding on the administrative interface will be the hardest part.> Oh, and the newest thing in the consumer market is called "hybrid > drives", which is a melding of a Flash drive with a Winchester drive. > It''s originally targetted at the laptop market - think a 1GB flash > memory welded to a 40GB 2.5" hard drive in the same form-factor. You > don''t replace the DRAM cache on the HD - it''s still there for fast-write > response. But all the "frequently used" blocks get scheduled to be > placed on the Flash part of the drive, while the mechanical part > actually holds a copy of everything. The Flash portion is there for > power efficiency as well as performance.Flash is (can be) a bit more sophisticated. The problem is that they have a limited write endurance -- typically spec''ed at 100k writes to any single bit. The good flash drives use block relocation, spares, and write spreading to avoid write hot spots. For many file systems, the place to worry is the block(s) containing your metadata. ZFS inherently spreads and mirrors its metadata, so it should be more appropriate for flash devices than FAT or UFS. Similarly, the disk drive manufacturers make extensive use of block sparing, so applying that technique to the hybrid drives is expected. -- richard
On Tue, Jun 20, 2006 at 09:32:58AM -0700, Richard Elling wrote:> Flash is (can be) a bit more sophisticated. The problem is that they > have a limited write endurance -- typically spec''ed at 100k writes to > any single bit. The good flash drives use block relocation, spares, and > write spreading to avoid write hot spots. For many file systems, the > place to worry is the block(s) containing your metadata. ZFS inherently > spreads and mirrors its metadata, so it should be more appropriate for > flash devices than FAT or UFS.What about the UberBlock? It''s written each time a transaction group commits. Cheers, - jonathan -- Jonathan Adams, Solaris Kernel Development
On Tue, Jun 20, 2006 at 11:17:42AM -0700, Jonathan Adams wrote:> On Tue, Jun 20, 2006 at 09:32:58AM -0700, Richard Elling wrote: > > Flash is (can be) a bit more sophisticated. The problem is that they > > have a limited write endurance -- typically spec''ed at 100k writes to > > any single bit. The good flash drives use block relocation, spares, and > > write spreading to avoid write hot spots. For many file systems, the > > place to worry is the block(s) containing your metadata. ZFS inherently > > spreads and mirrors its metadata, so it should be more appropriate for > > flash devices than FAT or UFS. > > What about the UberBlock? It''s written each time a transaction group > commits.Yes, but this is only written once every 5 seconds, and we store to 256 different locations in a ring buffer. So you have (256*100000*5) seconds, or about 100 years. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
On Tue, Jun 20, 2006 at 11:17:42AM -0700, Jonathan Adams wrote:> On Tue, Jun 20, 2006 at 09:32:58AM -0700, Richard Elling wrote: > > Flash is (can be) a bit more sophisticated. The problem is that they > > have a limited write endurance -- typically spec''ed at 100k writes to > > any single bit. The good flash drives use block relocation, spares, and > > write spreading to avoid write hot spots. For many file systems, the > > place to worry is the block(s) containing your metadata. ZFS inherently > > spreads and mirrors its metadata, so it should be more appropriate for > > flash devices than FAT or UFS. > > What about the UberBlock? It''s written each time a transaction group > commits.Right. But we rotate the uberblock over 128 positions in the device label. This helps with write-leveling. Furthermore, a lot of flash devices are starting to incorporate write-leveling in HW, since a lot of software just doesn''t deal with it. --Bill
Jonathan Adams wrote:>On Tue, Jun 20, 2006 at 09:32:58AM -0700, Richard Elling wrote: > >>Flash is (can be) a bit more sophisticated. The problem is that they >>have a limited write endurance -- typically spec''ed at 100k writes to >>any single bit. The good flash drives use block relocation, spares, and >>write spreading to avoid write hot spots. For many file systems, the >>place to worry is the block(s) containing your metadata. ZFS inherently >>spreads and mirrors its metadata, so it should be more appropriate for >>flash devices than FAT or UFS. >> > >What about the UberBlock? It''s written each time a transaction group >commits. >Also, options such as "-nomtime" and "-noctime" have been introduced alongside "-noatime" in some free operating systems to limit the amount of meta data that gets written back to disk. Darren
>Also, options such as "-nomtime" and "-noctime" have been introduced >alongside "-noatime" in some free operating systems to limit the amount >of meta data that gets written back to disk.Those seem rather pointless. (mtime and ctime generally imply other changes, often to the inode; atime does not) Casper
Wouldn''t that be: 5 seconds per write = 86400/5 = 17280 writes per day 256 rotated locations for 17280/256 = 67 writes per location per day Resulting in (100000/67) ~1492 days or 4.08 years before failure? That''s still a long time, but it''s not 100 years. On Jun 20, 2006, at 12:47 PM, Eric Schrock wrote:> On Tue, Jun 20, 2006 at 11:17:42AM -0700, Jonathan Adams wrote: >> On Tue, Jun 20, 2006 at 09:32:58AM -0700, Richard Elling wrote: >>> Flash is (can be) a bit more sophisticated. The problem is that >>> they >>> have a limited write endurance -- typically spec''ed at 100k >>> writes to >>> any single bit. The good flash drives use block relocation, >>> spares, and >>> write spreading to avoid write hot spots. For many file systems, >>> the >>> place to worry is the block(s) containing your metadata. ZFS >>> inherently >>> spreads and mirrors its metadata, so it should be more >>> appropriate for >>> flash devices than FAT or UFS. >> >> What about the UberBlock? It''s written each time a transaction group >> commits. > > Yes, but this is only written once every 5 seconds, and we store to > 256 > different locations in a ring buffer. So you have (256*100000*5) > seconds, or about 100 years. > > - Eric > > -- > Eric Schrock, Solaris Kernel Development http://blogs.sun.com/ > eschrock > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-2773 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
On Tue, Jun 20, 2006 at 02:18:34PM -0600, Gregory Shaw wrote:> Wouldn''t that be: > > 5 seconds per write = 86400/5 = 17280 writes per day > 256 rotated locations for 17280/256 = 67 writes per location per day > > Resulting in (100000/67) ~1492 days or 4.08 years before failure? > > That''s still a long time, but it''s not 100 years.Yes, I goofed on the math. It''s still (256*100000*5) seconds, but somehow I managed to goof up the math. I tried it again and came up with 1,481 days. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Richard Elling wrote:> Erik Trimble wrote:>> Oh, and the newest thing in the consumer market is called "hybrid >> drives", which is a melding of a Flash drive with a Winchester >> drive. It''s originally targetted at the laptop market - think a 1GB >> flash memory welded to a 40GB 2.5" hard drive in the same >> form-factor. You don''t replace the DRAM cache on the HD - it''s still >> there for fast-write response. But all the "frequently used" blocks >> get scheduled to be placed on the Flash part of the drive, while the >> mechanical part actually holds a copy of everything. The Flash >> portion is there for power efficiency as well as performance. > > Flash is (can be) a bit more sophisticated. The problem is that they > have a limited write endurance -- typically spec''ed at 100k writes to > any single bit. The good flash drives use block relocation, spares, and > write spreading to avoid write hot spots. For many file systems, the > place to worry is the block(s) containing your metadata. ZFS inherently > spreads and mirrors its metadata, so it should be more appropriate for > flash devices than FAT or UFS.What I do not know yet is exactly how the flash portion of these hybrid drives is administered. I rather expect that a non-hybrid-aware OS may not actually exercise the flash storage on these drives by default; or should I say, the flash storage will only be available to a hybrid-aware OS. Has anyone reading this seen a command-set reference for one of these drives? Dana
Eric Schrock wrote:> On Tue, Jun 20, 2006 at 11:17:42AM -0700, Jonathan Adams wrote: >> On Tue, Jun 20, 2006 at 09:32:58AM -0700, Richard Elling wrote: >>> Flash is (can be) a bit more sophisticated. The problem is that they >>> have a limited write endurance -- typically spec''ed at 100k writes to >>> any single bit. The good flash drives use block relocation, spares, and >>> write spreading to avoid write hot spots. For many file systems, the >>> place to worry is the block(s) containing your metadata. ZFS inherently >>> spreads and mirrors its metadata, so it should be more appropriate for >>> flash devices than FAT or UFS. >> What about the UberBlock? It''s written each time a transaction group >> commits. > > Yes, but this is only written once every 5 seconds, and we store to 256 > different locations in a ring buffer. So you have (256*100000*5) > seconds, or about 100 years.100k writes is the de-facto minimum. In looking at some SSD (yes, they are marketing them as solid state disks) drives with IDE or SATA interfaces, at least one vendor specs 5,000,000 writes, sizes up to 128 GBytes. It will be a while before these are really inexpensive, though. -- richard
Dana H. Myers wrote:> What I do not know yet is exactly how the flash portion of these hybrid > drives is administered. I rather expect that a non-hybrid-aware OS may > not actually exercise the flash storage on these drives by default; or > should I say, the flash storage will only be available to a hybrid-aware > OS.Samsung describes their hybrid drives as using flash for the boot block and as a write cache. -- richard
And, this is a worst case, no? If the device itself also does some funky stuff under the covers, and ZFS only writes an update if there is *actually* something to write, then it could be much much longer than 4 years. Actually - That''s an interesting. I assume ZFS only writes something when there is actually data? :) Nathan. On Wed, 2006-06-21 at 06:25, Eric Schrock wrote:> On Tue, Jun 20, 2006 at 02:18:34PM -0600, Gregory Shaw wrote: > > Wouldn''t that be: > > > > 5 seconds per write = 86400/5 = 17280 writes per day > > 256 rotated locations for 17280/256 = 67 writes per location per day > > > > Resulting in (100000/67) ~1492 days or 4.08 years before failure? > > > > That''s still a long time, but it''s not 100 years. > > Yes, I goofed on the math. It''s still (256*100000*5) seconds, but > somehow I managed to goof up the math. I tried it again and came up > with 1,481 days. > > - Eric > > -- > Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss--
> I assume ZFS only writes something when there is actually data?Right. Jeff
Casper.Dik at Sun.COM wrote:>>Also, options such as "-nomtime" and "-noctime" have been introduced >>alongside "-noatime" in some free operating systems to limit the amount >>of meta data that gets written back to disk. >> >> > > >Those seem rather pointless. (mtime and ctime generally imply other >changes, often to the inode; atime does not) > >Well operating systems that *do* get used to build devices *do* have these mount options for this purpose, so I imagine that someone who does this kind of thing thinks they''re worthwhile. Darren
Richard Elling wrote:> Dana H. Myers wrote: >> What I do not know yet is exactly how the flash portion of these hybrid >> drives is administered. I rather expect that a non-hybrid-aware OS may >> not actually exercise the flash storage on these drives by default; or >> should I say, the flash storage will only be available to a hybrid-aware >> OS. > > Samsung describes their hybrid drives as using flash for the boot block > and as a write cache. > -- richardHere''s Seagate''s take on the Hybrid HD: http://www.seagate.com/docs/pdf/marketing/po_momentus_5400_psd.pdf My understanding of the general design of hybrids is described in the PDF above: Flash is being used for a READ cache, though I''m not certain about write caching (whether that too goes through the flash RAM, or not) - my assumption is that it does NOT, at least in the laptop space. And, there is no need for OS-level drives - this simply is a plug-in SATA drive, treated like any other drive. Now, I expect there might be some optimizations possible should the OS know that the drive is a Hybrid, but that the drive will still work well (that is, provide better performance/lower power draw) without any OS modifications. I do expect that the flash cache will be getting larger (the current default seems to be 8-16mb, or about the same as a normal RAM cache on a standard non-hybrid drive), the designers figure out what seems to be a good mix for the expected environment: that is, I''d estimate that for the single-drive laptop space, a goodly cache (perhaps enough to cache the most-common OS libraries, say in the 100MB or so range) is likely, while the performance market (say for SAS drives), it may be much less (just enough to keep some frequent metadata around or so). -Erik
>Well operating systems that *do* get used to build devices *do* >have these mount options for this purpose, so I imagine that >someone who does this kind of thing thinks they''re worthwhile.Thinking that soemthing is worthwhile and having done the analysis to proof that it is worthwhile are two different things. Intuition and performance analysis generally do not match. Casper
Actually, while Seagate''s little white paper doesn''t explicitly say so, the FLASH is used for a write cache and that provides one of the major benefits: Writes to the disk rarely need to spin up the motor. Probably 90+% of all writes to disk will fit into the cache in a typical laptop environment (no, compiling OpenSolaris isn''t typical usage?). My guess from reading between the lines of the Samsung/Microsoft press release is that there is a mechanism for the operating system to "pin" particular blocks into the cache (e.g. to speed boot) and the rest of the cache is used for write buffering. (Using it as a read cache doesn''t buy much compared to using the normal drive cache RAM for that, and might also contribute to wear, which is why read caching appears to be under OS control rather than automatic.) Incidentally, there''s a nice overview of some algorithms (including file systems) optimized for the characteristics of FLASH memory that was published by ACM last year, for the curious (who happen to have access to either the online or their local library). <http://doi.acm.org/10.1145/1089733.1089735> Anton This message posted from opensolaris.org
Anton B. Rang wrote:>Actually, while Seagate''s little white paper doesn''t explicitly say so, the FLASH is used for a write cache and that provides one of the major benefits: Writes to the disk rarely need to spin up the motor. Probably 90+% of all writes to disk will fit into the cache in a typical laptop environment (no, compiling OpenSolaris isn''t typical usage?). >On OpenSolaris laptops with enough RAM, we need to think about fitting mappings of libc, cron and all of its work into the buffer cache and then maybe the flash cache on the drive. Each time you execute a program, that''s an atime uptdate of its file... I''ve known people to wear out laptop hard drives in a frighteningly short period of time because of the drive being spun up and down to service cron, sendmail queue runs, syslog messages... Darren
So, based on the below, there should be no reason why a flash-based ZFS filesystem should need to do anything special to avoid problems. That''s a Good Thing. I think that using flash as the system disk will be the way to go. Using flash as read-only with a disk or memory for read-write would result in a very fast system with fewer points of failure... On Jun 20, 2006, at 6:23 PM, Nathan Kroenert wrote:> And, this is a worst case, no? > > If the device itself also does some funky stuff under the covers, and > ZFS only writes an update if there is *actually* something to write, > then it could be much much longer than 4 years. > > Actually - That''s an interesting. I assume ZFS only writes something > when there is actually data? > > :) > > Nathan. > > On Wed, 2006-06-21 at 06:25, Eric Schrock wrote: >> On Tue, Jun 20, 2006 at 02:18:34PM -0600, Gregory Shaw wrote: >>> Wouldn''t that be: >>> >>> 5 seconds per write = 86400/5 = 17280 writes per day >>> 256 rotated locations for 17280/256 = 67 writes per location per day >>> >>> Resulting in (100000/67) ~1492 days or 4.08 years before failure? >>> >>> That''s still a long time, but it''s not 100 years. >> >> Yes, I goofed on the math. It''s still (256*100000*5) seconds, but >> somehow I managed to goof up the math. I tried it again and came up >> with 1,481 days. >> >> - Eric >> >> -- >> Eric Schrock, Solaris Kernel Development http:// >> blogs.sun.com/eschrock >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > -- > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-2773 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
On Jun 21, 2006, at 11:05, Anton B. Rang wrote:> > My guess from reading between the lines of the Samsung/Microsoft > press release is that there is a mechanism for the operating system > to "pin" particular blocks into the cache (e.g. to speed boot) and > the rest of the cache is used for write buffering. (Using it as a > read cache doesn''t buy much compared to using the normal drive > cache RAM for that, and might also contribute to wear, which is why > read caching appears to be under OS control rather than automatic.)Actually, Microsoft has been posting a bit about this for the upcoming Vista release .. WinHEC ''06 had a few interesting papers and it looks like Microsoft is going to be introducing SuperFetch, ReadyBoost, and ReadyDrive .. mentioned here: http://www.microsoft.com/whdc/system/sysperf/accelerator.mspx The ReadyDrive paper seems to outline their strategy on the industry Hybrid Drive push and the recent t13.org adoption of the ATA-ACS8 command set: http://www.microsoft.com/whdc/device/storage/hybrid.mspx It also looks like they''re aiming at some sort of driver level PriorityIO scheme which should play nicely into lower level tiered hardware in an attempt for more intelligent read/write caching: http://www.microsoft.com/whdc/driver/priorityio.mspx --- .je