The decade-old paper "The Rio File Cache: Surviving Operating System Crashes" at http://www.eecs.umich.edu/Rio/papers/chen96.pdf explains how to make a filesystem write-back cache be as reliable as write through. It seems that this mechanism could be used to eliminate the need to flush the ZIL to disk when performing synchronous writes, and all filesystem writes could therefore be made synchronous with essentially no performance loss. This message posted from opensolaris.org
Jeff Bonwick
2006-Jan-14 07:09 UTC
[zfs-discuss] Would Rio be practical for making ZIL go fast?
> It seems that this mechanism could be used to eliminate the need to flush > the ZIL to disk when performing synchronous writes, and all filesystem > writes could therefore be made synchronous with essentially no performance > loss.Sadly not. The requirement of a synchronous write is that the filesystem cannot return from the write(2) system call until the data is on disk. No caching strategy can circumvent the disk write. The purpose of the Rio work was different: to survive crashes. ZFS already does that, with or without the ZIL. The only purpose of the ZIL is to reduce the latency of synchronous I/O requests. The ZIL is not required for fsckless operation. If you turned off the ZIL, all it would mean is that in the event of a crash, it would appear that some of the most recent (last few seconds) synchronous system calls never happened. In other words, we wouldn''t have net the O_DSYNC specification, but the filesystem would nevertheless still be perfectly consistent on disk. Jeff
Andrew
2006-Jan-14 17:53 UTC
[zfs-discuss] Re: Would Rio be practical for making ZIL go fast?
Jeff Bonwick wrote:>> It seems that this mechanism could be used to eliminate the need to flush >> the ZIL to disk when performing synchronous writes, and all filesystem >> writes could therefore be made synchronous with essentially no performance >> loss. > > Sadly not. The requirement of a synchronous write is that the > filesystem cannot return from the write(2) system call until the > data is on disk. No caching strategy can circumvent the disk write.The disk on which the data is guaranteed to have been written after a synchronous write could be a conventional magnetic disk, or a flash disk, or a battery-backed dram disk. The purpose of Rio is to make software-induced corruption or erasure of a particular section of the system''s main memory no more likely than software-induced corruption or erasure of data on the disk (the test results presented in the paper show that this goal was achieved); if a UPS is then attached to the system to protect against power outage, and the kernel designed to honor the sanctity of that section of memory upon warm reboot, then that section effectively becomes a battery-backed dram disk. Therefore the reliability and persistence guarantee for synchronous writes can be honored by writing to that section of memory.> The purpose of the Rio work was different: to survive crashes.Yes; to protect a section of memory against corruption and erasure generally, with crash/reboot being one important case (since contemporary systems erase all previous contents of main memory upon reboot by ignoring and overwriting it). In the case of ZFS, the relevant property of Rio is protection against erasure.> ZFS already does that, with or without the ZIL. The only purpose > of the ZIL is to reduce the latency of synchronous I/O requests.It''s true that both for synchronous and asynchronous writes with ZFS, the reliability of data on the disk is never in question, and the persistence of synchronous writes is never in question, but achieving persistence for the synchronous writes still requires actually writing to the disk, which is sufficiently slow that making all writes synchronous by default is not feasible. A main-memory-backed ramdisk could be used for the ZIL if the system has a UPS, but in Solaris as currently designed the ramdisk is not persistent across system reboots; Rio would make it persistent.> The ZIL is not required for fsckless operation. If you turned off > the ZIL, all it would mean is that in the event of a crash, it would > appear that some of the most recent (last few seconds) synchronous > system calls never happened. In other words, we wouldn''t have net > the O_DSYNC specification, but the filesystem would nevertheless > still be perfectly consistent on disk.Instead of turning off the ZIL, just putting the ZIL in a main-memory-backed ramdisk would cause the same result. The point of Rio in this case would be to simply make that ramdisk be persistent, thus guaranteeing the persistence of all synchronously written data even in the event of a crash and reboot. The same result could be achieved by disconnecting a few of the system''s dram chips from the processor bus, hooking them to a SATA interface, sticking a battery on it, and using that new disk to hold the ZIL. Using a system-wide UPS eliminates the need for that dedicated battery, and using Rio eliminates the need to move some of the dram chips from the processor bus to a SATA interface. This message posted from opensolaris.org
Richard Elling
2006-Jan-14 20:18 UTC
[zfs-discuss] Re: Would Rio be practical for making ZIL go fast?
Andrew writes:> The disk on which the data is guaranteed to have been > written after a synchronous write could be a > conventional magnetic disk, or a flash disk, or a > battery-backed dram disk. The purpose of Rio is to > make software-induced corruption or erasure of a > particular section of the system''s main memory no > more likely than software-induced corruption or > erasure of data on the disk (the test results > presented in the paper show that this goal was > achieved);Yes. Indeed Sun has had several hardware products over the years which do provide host-based nonvolatile caching for I/O (IIRC, the name PrestoServe jiggles some brain cells.) Alas, while they do improve performance, they are not commercially viable. The architectural problem is that an interdependency is created between the data on disk and the data remaining in the host. This breaks the intuitive notion that data is on disks and not in hosts. This rears its ugly head during panics, maintenance, and migration. Further, since persistent state is stored in the host, it is not feasible to create a highly available cluster using such hardware.> if a UPS is then attached to the system to > protect against power outage, and the kernel designed > to honor the sanctity of that section of memory upon > warm reboot, then that section effectively becomes a > battery-backed dram disk. Therefore the reliability > and persistence guarantee for synchronous writes can > be honored by writing to that section of memory.While you''re protecting from one failure mode, failure of the mains, you introduce many more: UPS failure, ATS failure, software bugs (!), hardware bugs, DRAM transient faults, and maintenance events, to name a few. My crystal ball seems to point to hybrid disks for an interim step, where the nonvolatile persistent storage is physically on the disk drive electronics board. This solves both the speed and data containment problem. In the long term, it will all be solid state and your grandchildren won''t know what a disk is :-) It should be noted that Solaris already supports a ramdisk. Further, modern SPARC-based systems provide a mechanism to preserve the ramdisk between boots. All that is needed now is a persistently nonvolatile memory which is backwards compatible to today''s various DIMMs. If you solve this, you will be a billionaire, so there are lots of people trying. -- richard This message posted from opensolaris.org
Jeff Bonwick
2006-Jan-15 08:23 UTC
[zfs-discuss] Re: Would Rio be practical for making ZIL go fast?
> Instead of turning off the ZIL, just putting the ZIL in a > main-memory-backed ramdisk would cause the same result.If you lose power, you lose the ramdisk. You can survive very brief power outages if you use a UPS or NVRAM, but neither one can seriously be called stable storage. People lost a ton of data in the first (1993) World Trade Center bombing because the batteries ran out before power was restored. When *true* non-volatile memory (e.g. MRAM or Ovonic Unified Memory) replaces DRAM, we won''t put the ZIL there -- we''ll turn it off. MRAM was very much in our minds during the design of ZFS because it''s going to happen this decade. Power consumption alone will force the economics -- your 1-terabyte laptop won''t have enough battery power to continuously refresh DRAM. We designed the ZIL to be completely separate from the rest of the code because we know that it''s really just a workaround for the volatility of present-generation memory. Jeff
Andrew
2006-Jan-15 16:41 UTC
[zfs-discuss] Re: Re: Would Rio be practical for making ZIL go fast?
Jeff Bonwick wrote:>> Instead of turning off the ZIL, just putting the ZIL in a >> main-memory-backed ramdisk would cause the same result. > If you lose power, you lose the ramdisk.Yes, that''s what I meant (with the point being that you don''t lose the ramdisk until you lose power, so the only extra hardware necessary is a UPS).> You can survive very brief > power outages if you use a UPS or NVRAM, but neither one can seriously > be called stable storage. People lost a ton of data in the first > (1993) World Trade Center bombing because the batteries ran out > before power was restored.But using a UPS, the ramdisk doesn''t need to be stable storage; it only has to survive for a couple minutes, at most. This is because for those couple minutes on the UPS, the processor and the hard disk also still have power, and the system is informed when main power is lost and can therefore deal with the problem. When main power is lost and the UPS''s batteries are running low (less than a couple minutes guaranteed runtime remaining), the system can switch to using a standard persistent-disk-backed ZIL instead of a main-memory-backed ZIL. If the system ever panics, it can automatically trigger a warm reboot. If the system ever hangs, a watchdog timer can trigger a warm reboot. On reboot, the system can play the ZIL from the ramdisk to commit the data to the persistent disk. With this design, where is the risk of data loss? I.e. in what circumstance would this design lose data, such that in the same circumstance a design using a persistent-disk-only ZIL would not lose data?> When *true* non-volatile memory (e.g. MRAM or Ovonic Unified Memory) > replaces DRAM, we won''t put the ZIL there -- we''ll turn it off.Nonvolatile main memory will indeed solve the persistence problem, but in the meantime, it doesn''t make sense to wait for nonvolatile memory if the problem can already be solved today even in spite of volatile memory. MRAM systems will say, "synchronous writes, fast writes, don''t have to buy a UPS: pick three." Contemporary systems say, "synchronous writes, fast writes: pick one." But contemporary systems could say, "synchronous writes, fast writes, don''t have to buy a UPS: pick two." (With a UPS, instead of putting the ZIL in the ramdisk, the ZIL could be turned off and the filecache put in the ramdisk, but my point is the same.) This message posted from opensolaris.org
Al Hopper
2006-Jan-15 19:38 UTC
[zfs-discuss] Re: Would Rio be practical for making ZIL go fast?
On Sun, 15 Jan 2006, Jeff Bonwick wrote:> > Instead of turning off the ZIL, just putting the ZIL in a > > main-memory-backed ramdisk would cause the same result. > > If you lose power, you lose the ramdisk. You can survive very brief > power outages if you use a UPS or NVRAM, but neither one can seriously > be called stable storage. People lost a ton of data in the first > (1993) World Trade Center bombing because the batteries ran out > before power was restored. > > When *true* non-volatile memory (e.g. MRAM or Ovonic Unified Memory) > replaces DRAM, we won''t put the ZIL there -- we''ll turn it off. > > MRAM was very much in our minds during the design of ZFS because it''s > going to happen this decade. Power consumption alone will force the > economics -- your 1-terabyte laptop won''t have enough battery power > to continuously refresh DRAM. We designed the ZIL to be completely > separate from the rest of the code because we know that it''s really > just a workaround for the volatility of present-generation memory.Agreed 100%. RAM volatility is a bug - not a feature! And every attempt to fix this "bug", to date, has been a mere work-around. All mechanical systems are subject to wear & tear - and, regardless of the quality of the implementation, are doomed to (pre-mature) mechanical failure.[1] I''m delighted to see the ZFS team gifted with such futuristic and focused, insight. This is in stark contrast to the current/next quarterly "instant gratification" philosophy that plagues most (business) Corporations today. The separation of the ZIL code from the other parts of ZFS is a really good long-term decision IMHO. In the short term, the inefficiency of the extra "call" layer is a tough pill to swallow. But that bitter pill tastes better every day as the next generation, of faster, CPUs is brought to market. [1] A simple example: you want to store the target co-ordinates in a Minute Man missle - with no volatility or accuracy/integrity/security issues. For something that sounds, and should be, rather trivial, this is a *very* non-trivial problem. Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
Eric Schrock
2006-Jan-15 21:01 UTC
[zfs-discuss] Re: Re: Would Rio be practical for making ZIL go fast?
On Sun, Jan 15, 2006 at 08:41:49AM -0800, Andrew wrote:> > If the system ever panics, it can automatically trigger a warm reboot. > If the system ever hangs, a watchdog timer can trigger a warm reboot. > On reboot, the system can play the ZIL from the ramdisk to commit the > data to the persistent disk. With this design, where is the risk of > data loss? I.e. in what circumstance would this design lose data, such > that in the same circumstance a design using a persistent-disk-only > ZIL would not lose data?The point is that all these scenarios require a warm reboot. Jeff has already given some examples where a system might fail to come up, but if you want some more, here goes: Imagine the sole CPU in your system goes bad and starts getting persistent UEs so that it cannot possibly boot. The only way to get this box to boot is to replace the CPU, which requires turning off the power. Except of course you can''t turn off the power, since you need the data in RAM to maintain data consistency. What happens with hot-swappable disks? If I yank the disks out of my pool and import them on another system, then all the synchronous data _must_ be on disk. This is not a impractical example - it''s how cluster failover fundamentally has to work. No notification or warning; your data must be on disk and available on another host immediately. This works for maintaining filesystem consistency, but it doesn''t help with synchronous writes, since they must be committed to stable storage. As Jeff and others have pointed out, requiring some piece of persistent data on the host is just not an option, since the storage is fundamentally unstable. The data MUST be on-disk even if the system fails to come up through a "warm reboot". - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Andrew
2006-Jan-16 00:07 UTC
[zfs-discuss] Re: Re: Re: Would Rio be practical for making ZIL go fast?
Eric Schrock wrote:> Imagine the sole CPU in your system goes bad and starts getting > persistent UEs so that it cannot possibly boot. The only way to get > this box to boot is to replace the CPU, which requires turning off the > power. Except of course you can''t turn off the power, since you need > the data in RAM to maintain data consistency.Ok, if there''s only a single system, then use the standard harddisk-backed ZIL, but if there are two systems connected by gigabit ethernet, each backed by an independent UPS system, with memory and network bandwidth to spare, then each system can use two ramdisks: one for local use, and one to NFS export to the other system. Each system can then make a dedicated ZFS pool consisting of a mirror of its local ramdisk and the other system''s NFS-exported ramdisk, and store the ZIL(s) for its other locally-controlled ZFS pool(s) on that dedicated pool. If either system fails to successfully write ZIL data to the NFS-mounted half of its mirror, then it can assume that the other system has died (or the network is down), and fall back to using standard harddisk-backed ZIL. And if either system fails to receive a periodic keepalive signal from the other system, then it can assume that the other system has died or the network is down, and flush its own NFS-exported ramdisk to its own harddisk on behalf of the possibly-dead system.> What happens with hot-swappable disks? If I yank the disks out of my > pool and import them on another system, then all the synchronous data > _must_ be on disk. This is not a impractical example - it''s how cluster > failover fundamentally has to work. No notification or warning; your > data must be on disk and available on another host immediately.This same issue would arise if main memory were persistent (e.g. MRAM) and ZIL were simply turned off. Yet Jeff already wrote: "When *true* non-volatile memory (e.g. MRAM or Ovonic Unified Memory) replaces DRAM, we won''t put the ZIL there -- we''ll turn it off." This message posted from opensolaris.org
Richard Elling
2006-Jan-16 05:43 UTC
[zfs-discuss] Re: Re: Would Rio be practical for making ZIL go fast?
> Agreed 100%. RAM volatility is a bug - not a feature! And every attempt > to fix this "bug", to date, has been a mere work-around. All mechanical > systems are subject to wear & tear - and, regardless of the quality of the > implementation, are doomed to (pre-mature) mechanical failure.[1]nit: integrated circuits are mechanical systems and prone to wear, tear, and mechanical failure. Fortunately, good designs tend to last for tens of years, given proper design margins. -- richard This message posted from opensolaris.org
Richard Elling
2006-Jan-16 05:48 UTC
[zfs-discuss] Re: Re: Re: Would Rio be practical for making ZIL go fast?
> Ok, if there''s only a single system, then use the > standard harddisk-backed ZIL, but if there are two > systems connected by gigabit ethernet, each backed by > an independent UPS system, with memory and network > bandwidth to spare, then each system can use two > ramdisks: one for local use, and one to NFS export to > the other system. Each system can then make a > dedicated ZFS pool consisting of a mirror of its > local ramdisk and the other system''s NFS-exported > ramdisk, and store the ZIL(s) for its other > locally-controlled ZFS pool(s) on that dedicated > pool. If either system fails to successfully write > ZIL data to the NFS-mounted half of its mirror, then > it can assume that the other system has died (or the > network is down), and fall back to using standard > harddisk-backed ZIL. And if either system fails to > receive a periodic keepalive signal from the other > system, then it can assume that the other system has > died or the network is down, and flush its own > NFS-exported ramdisk to its own harddisk on behalf of > the possibly-dead system.You are describing, in some ways, the Sun Cluster Cluster File System (aka GFS aka pxfs). q.v. http://docs.sun.com/app/docs/doc/819-0421 -- richard This message posted from opensolaris.org
Bart Smaalders
2006-Jan-17 17:58 UTC
[zfs-discuss] Re: Re: Would Rio be practical for making ZIL go fast?
Richard Elling wrote:>> Agreed 100%. RAM volatility is a bug - not a feature! And every attempt >> to fix this "bug", to date, has been a mere work-around. All mechanical >> systems are subject to wear & tear - and, regardless of the quality of the >> implementation, are doomed to (pre-mature) mechanical failure.[1] > > nit: integrated circuits are mechanical systems and prone to wear, tear, > and mechanical failure. Fortunately, good designs tend to last for tens > of years, given proper design margins. > -- richardThere is no such thing as an electrical failure. All failures are mechanical. - Bart -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts