thr3ads.net - zfs discuss - [zfs-discuss] Would Rio be practical for making ZIL go fast? [Jan 2006]

If this information is useful, please help other people find it:
Share via:

Andrew

2006-Jan-13 21:07 UTC

[zfs-discuss] Would Rio be practical for making ZIL go fast?

The decade-old paper "The Rio File Cache: Surviving Operating System
Crashes" at
http://www.eecs.umich.edu/Rio/papers/chen96.pdf
explains how to make a filesystem write-back cache be as reliable as write
through. It seems that this mechanism could be used to eliminate the need to
flush the ZIL to disk when performing synchronous writes, and all filesystem
writes could therefore be made synchronous with essentially no performance loss.
This message posted from opensolaris.org

Jeff Bonwick

2006-Jan-14 07:09 UTC

head link

[zfs-discuss] Would Rio be practical for making ZIL go fast?

> It seems that this mechanism could be used to eliminate the need to flush
> the ZIL to disk when performing synchronous writes, and all filesystem
> writes could therefore be made synchronous with essentially no performance
> loss.
Sadly not.  The requirement of a synchronous write is that the
filesystem cannot return from the write(2) system call until the
data is on disk.  No caching strategy can circumvent the disk write.

The purpose of the Rio work was different: to survive crashes.
ZFS already does that, with or without the ZIL.  The only purpose
of the ZIL is to reduce the latency of synchronous I/O requests.

The ZIL is not required for fsckless operation.  If you turned off
the ZIL, all it would mean is that in the event of a crash, it would
appear that some of the most recent (last few seconds) synchronous
system calls never happened.  In other words, we wouldn''t have net
the O_DSYNC specification, but the filesystem would nevertheless
still be perfectly consistent on disk.

Jeff

Andrew

2006-Jan-14 17:53 UTC

head link

[zfs-discuss] Re: Would Rio be practical for making ZIL go fast?

Jeff Bonwick wrote:>> It seems that this mechanism could be used to eliminate the need to
flush
>> the ZIL to disk when performing synchronous writes, and all filesystem
>> writes could therefore be made synchronous with essentially no
performance
>> loss.
> 
> Sadly not. The requirement of a synchronous write is that the
> filesystem cannot return from the write(2) system call until the
> data is on disk. No caching strategy can circumvent the disk write.The disk on which the data is guaranteed to have been written after a
synchronous write could be a conventional magnetic disk, or a flash disk, or a
battery-backed dram disk. The purpose of Rio is to make software-induced
corruption or erasure of a particular section of the system''s main
memory no more likely than software-induced corruption or erasure of data on the
disk (the test results presented in the paper show that this goal was achieved);
if a UPS is then attached to the system to protect against power outage, and the
kernel designed to honor the sanctity of that section of memory upon warm
reboot, then that section effectively becomes a battery-backed dram disk.
Therefore the reliability and persistence guarantee for synchronous writes can
be honored by writing to that section of memory.
> The purpose of the Rio work was different: to survive crashes.Yes; to protect a section of memory against corruption and erasure generally,
with crash/reboot being one important case (since contemporary systems erase all
previous contents of main memory upon reboot by ignoring and overwriting it). In
the case of ZFS, the relevant property of Rio is protection against erasure.
> ZFS already does that, with or without the ZIL. The only purpose
> of the ZIL is to reduce the latency of synchronous I/O requests.It''s true that both for synchronous and asynchronous writes with ZFS,
the reliability of data on the disk is never in question, and the persistence of
synchronous writes is never in question, but achieving persistence for the
synchronous writes still requires actually writing to the disk, which is
sufficiently slow that making all writes synchronous by default is not feasible.
A main-memory-backed ramdisk could be used for the ZIL if the system has a UPS,
but in Solaris as currently designed the ramdisk is not persistent across system
reboots; Rio would make it persistent.
> The ZIL is not required for fsckless operation. If you turned off
> the ZIL, all it would mean is that in the event of a crash, it would
> appear that some of the most recent (last few seconds) synchronous
> system calls never happened. In other words, we wouldn''t have net
> the O_DSYNC specification, but the filesystem would nevertheless
> still be perfectly consistent on disk.Instead of turning off the ZIL, just putting the ZIL in a main-memory-backed
ramdisk would cause the same result. The point of Rio in this case would be to
simply make that ramdisk be persistent, thus guaranteeing the persistence of all
synchronously written data even in the event of a crash and reboot.
The same result could be achieved by disconnecting a few of the
system''s dram chips from the processor bus, hooking them to a SATA
interface, sticking a battery on it, and using that new disk to hold the ZIL.
Using a system-wide UPS eliminates the need for that dedicated battery, and
using Rio eliminates the need to move some of the dram chips from the processor
bus to a SATA interface.
This message posted from opensolaris.org

Richard Elling

2006-Jan-14 20:18 UTC

head link

[zfs-discuss] Re: Would Rio be practical for making ZIL go fast?

Andrew writes:> The disk on which the data is guaranteed to have been
> written after a synchronous write could be a
> conventional magnetic disk, or a flash disk, or a
> battery-backed dram disk. The purpose of Rio is to
> make software-induced corruption or erasure of a
> particular section of the system''s main memory no
> more likely than software-induced corruption or
> erasure of data on the disk (the test results
> presented in the paper show that this goal was
> achieved);
Yes.  Indeed Sun has had several hardware products
over the years which do provide host-based nonvolatile caching
for I/O (IIRC, the name PrestoServe jiggles some brain cells.)
Alas, while they do improve performance, they are not 
commercially viable.  The architectural problem is that
an interdependency is created between the data on disk
and the data remaining in the host.  This breaks the
intuitive notion that data is on disks and not in hosts.  This
rears its ugly head during panics, maintenance, and 
migration.  Further, since persistent state is stored in the host,
it is not feasible to create a highly available cluster using
such hardware.
>                  if a UPS is then attached to the system to
> protect against power outage, and the kernel designed
> to honor the sanctity of that section of memory upon
> warm reboot, then that section effectively becomes a
> battery-backed dram disk. Therefore the reliability
> and persistence guarantee for synchronous writes can
> be honored by writing to that section of memory.
While you''re protecting from one failure mode, failure of the
mains, you introduce many more: UPS failure, ATS failure,
software bugs (!), hardware bugs, DRAM transient faults, 
and maintenance events, to name a few.

My crystal ball seems to point to hybrid disks for an
interim step, where the nonvolatile persistent storage is
physically on the disk drive electronics board.  This solves
both the speed and data containment problem.  In the 
long term, it will all be solid state and your grandchildren
won''t know what a disk is :-)

It should be noted that Solaris already supports a ramdisk.
Further, modern SPARC-based systems provide a 
mechanism to preserve the ramdisk between boots.  All
that is needed now is a persistently nonvolatile memory
which is backwards compatible to today''s various DIMMs.
If you solve this, you will be a billionaire, so there are lots
of people trying.
 -- richard
This message posted from opensolaris.org

Jeff Bonwick

2006-Jan-15 08:23 UTC

head link

[zfs-discuss] Re: Would Rio be practical for making ZIL go fast?

> Instead of turning off the ZIL, just putting the ZIL in a
> main-memory-backed ramdisk would cause the same result. 
If you lose power, you lose the ramdisk.  You can survive very brief
power outages if you use a UPS or NVRAM, but neither one can seriously
be called stable storage.  People lost a ton of data in the first
(1993) World Trade Center bombing because the batteries ran out
before power was restored.

When *true* non-volatile memory (e.g. MRAM or Ovonic Unified Memory)
replaces DRAM, we won''t put the ZIL there -- we''ll turn it
off.

MRAM was very much in our minds during the design of ZFS because it''s
going to happen this decade.  Power consumption alone will force the
economics -- your 1-terabyte laptop won''t have enough battery power
to continuously refresh DRAM.  We designed the ZIL to be completely
separate from the rest of the code because we know that it''s really
just a workaround for the volatility of present-generation memory.

Jeff

Andrew

2006-Jan-15 16:41 UTC

head link

[zfs-discuss] Re: Re: Would Rio be practical for making ZIL go fast?

Jeff Bonwick wrote:>> Instead of turning off the ZIL, just putting the ZIL in a
>> main-memory-backed ramdisk would cause the same result. 
> If you lose power, you lose the ramdisk.Yes, that''s what I meant (with the point being that you don''t
lose the ramdisk until you lose power, so the only extra hardware necessary is a
UPS).
> You can survive very brief
> power outages if you use a UPS or NVRAM, but neither one can seriously
> be called stable storage. People lost a ton of data in the first
> (1993) World Trade Center bombing because the batteries ran out
> before power was restored.But using a UPS, the ramdisk doesn''t need to be stable storage; it only
has to survive for a couple minutes, at most. This is because for those couple
minutes on the UPS, the processor and the hard disk also still have power, and
the system is informed when main power is lost and can therefore deal with the
problem.
When main power is lost and the UPS''s batteries are running low (less
than a couple minutes guaranteed runtime remaining), the system can switch to
using a standard persistent-disk-backed ZIL instead of a main-memory-backed ZIL.
If the system ever panics, it can automatically trigger a warm reboot.
If the system ever hangs, a watchdog timer can trigger a warm reboot.
On reboot, the system can play the ZIL from the ramdisk to commit the data to
the persistent disk.
With this design, where is the risk of data loss? I.e. in what circumstance
would this design lose data, such that in the same circumstance a design using a
persistent-disk-only ZIL would not lose data?
> When *true* non-volatile memory (e.g. MRAM or Ovonic Unified Memory)
> replaces DRAM, we won''t put the ZIL there -- we''ll turn
it off.Nonvolatile main memory will indeed solve the persistence problem, but in the
meantime, it doesn''t make sense to wait for nonvolatile memory if the
problem can already be solved today even in spite of volatile memory.
MRAM systems will say, "synchronous writes, fast writes, don''t
have to buy a UPS: pick three."
Contemporary systems say, "synchronous writes, fast writes: pick one."
But contemporary systems could say, "synchronous writes, fast writes,
don''t have to buy a UPS: pick two."

(With a UPS, instead of putting the ZIL in the ramdisk, the ZIL could be turned
off and the filecache put in the ramdisk, but my point is the same.)
This message posted from opensolaris.org

Al Hopper

2006-Jan-15 19:38 UTC

head link

[zfs-discuss] Re: Would Rio be practical for making ZIL go fast?

On Sun, 15 Jan 2006, Jeff Bonwick wrote:
> > Instead of turning off the ZIL, just putting the ZIL in a
> > main-memory-backed ramdisk would cause the same result.
>
> If you lose power, you lose the ramdisk.  You can survive very brief
> power outages if you use a UPS or NVRAM, but neither one can seriously
> be called stable storage.  People lost a ton of data in the first
> (1993) World Trade Center bombing because the batteries ran out
> before power was restored.
>
> When *true* non-volatile memory (e.g. MRAM or Ovonic Unified Memory)
> replaces DRAM, we won''t put the ZIL there -- we''ll turn
it off.
>
> MRAM was very much in our minds during the design of ZFS because
it''s
> going to happen this decade.  Power consumption alone will force the
> economics -- your 1-terabyte laptop won''t have enough battery
power
> to continuously refresh DRAM.  We designed the ZIL to be completely
> separate from the rest of the code because we know that it''s
really
> just a workaround for the volatility of present-generation memory.
Agreed 100%.  RAM volatility is a bug - not a feature!  And every attempt
to fix this "bug", to date, has been a mere work-around.  All
mechanical
systems are subject to wear & tear - and, regardless of the quality of the
implementation, are doomed to (pre-mature) mechanical failure.[1]

I''m delighted to see the ZFS team gifted with such futuristic and
focused,
insight.  This is in stark contrast to the current/next quarterly "instant
gratification" philosophy that plagues most (business) Corporations today.

The separation of the ZIL code from the other parts of ZFS is a really good
long-term decision IMHO.  In the short term, the inefficiency of the extra
"call" layer is a tough pill to swallow.  But that bitter pill tastes
better every day as the next generation, of faster, CPUs is brought to
market.

[1] A simple example: you want to store the target co-ordinates in a Minute
Man missle - with no volatility or accuracy/integrity/security issues.
For something that sounds, and should be, rather trivial, this is a *very*
non-trivial problem.

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005

Eric Schrock

2006-Jan-15 21:01 UTC

head link

[zfs-discuss] Re: Re: Would Rio be practical for making ZIL go fast?

On Sun, Jan 15, 2006 at 08:41:49AM -0800, Andrew wrote:>
> If the system ever panics, it can automatically trigger a warm reboot.
> If the system ever hangs, a watchdog timer can trigger a warm reboot.
> On reboot, the system can play the ZIL from the ramdisk to commit the
> data to the persistent disk.  With this design, where is the risk of
> data loss? I.e. in what circumstance would this design lose data, such
> that in the same circumstance a design using a persistent-disk-only
> ZIL would not lose data?
The point is that all these scenarios require a warm reboot.  Jeff has
already given some examples where a system might fail to come up, but if
you want some more, here goes:

Imagine the sole CPU in your system goes bad and starts getting
persistent UEs so that it cannot possibly boot.  The only way to get
this box to boot is to replace the CPU, which requires turning off the
power.  Except of course you can''t turn off the power, since you need
the data in RAM to maintain data consistency.

What happens with hot-swappable disks?  If I yank the disks out of my
pool and import them on another system, then all the synchronous data
_must_ be on disk.  This is not a impractical example - it''s how
cluster
failover fundamentally has to work.  No notification or warning; your
data must be on disk and available on another host immediately.

This works for maintaining filesystem consistency, but it doesn''t help
with synchronous writes, since they must be committed to stable storage.
As Jeff and others have pointed out, requiring some piece of persistent
data on the host is just not an option, since the storage is
fundamentally unstable.  The data MUST be on-disk even if the system
fails to come up through a "warm reboot".

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Andrew

2006-Jan-16 00:07 UTC

head link

[zfs-discuss] Re: Re: Re: Would Rio be practical for making ZIL go fast?

Eric Schrock wrote:> Imagine the sole CPU in your system goes bad and starts getting
> persistent UEs so that it cannot possibly boot. The only way to get
> this box to boot is to replace the CPU, which requires turning off the
> power. Except of course you can''t turn off the power, since you
need
> the data in RAM to maintain data consistency.Ok, if there''s only a single system, then use the standard
harddisk-backed ZIL, but if there are two systems connected by gigabit ethernet,
each backed by an independent UPS system, with memory and network bandwidth to
spare, then each system can use two ramdisks: one for local use, and one to NFS
export to the other system. Each system can then make a dedicated ZFS pool
consisting of a mirror of its local ramdisk and the other system''s
NFS-exported ramdisk, and store the ZIL(s) for its other locally-controlled ZFS
pool(s) on that dedicated pool. If either system fails to successfully write ZIL
data to the NFS-mounted half of its mirror, then it can assume that the other
system has died (or the network is down), and fall back to using standard
harddisk-backed ZIL. And if either system fails to receive a periodic keepalive
signal from the other system, then it can assume that the other system has died
or the network is down, and flush its own NFS-exported ramdisk to its own
harddisk on behalf of the possibly-dead system.
> What happens with hot-swappable disks? If I yank the disks out of my
> pool and import them on another system, then all the synchronous data
> _must_ be on disk. This is not a impractical example - it''s how
cluster
> failover fundamentally has to work. No notification or warning; your
> data must be on disk and available on another host immediately.This same issue would arise if main memory were persistent (e.g. MRAM) and ZIL
were simply turned off. Yet Jeff already wrote:
"When *true* non-volatile memory (e.g. MRAM or Ovonic Unified Memory)
replaces DRAM, we won''t put the ZIL there -- we''ll turn it
off."
This message posted from opensolaris.org

Richard Elling

2006-Jan-16 05:43 UTC

head link

[zfs-discuss] Re: Re: Would Rio be practical for making ZIL go fast?

> Agreed 100%.  RAM volatility is a bug - not a feature!  And every attempt
> to fix this "bug", to date, has been a mere work-around.  All
mechanical
> systems are subject to wear & tear - and, regardless of the quality of
the
> implementation, are doomed to (pre-mature) mechanical failure.[1]
nit: integrated circuits are mechanical systems and prone to wear, tear,
and mechanical failure.  Fortunately, good designs tend to last for tens
of years, given proper design margins.
 -- richard
This message posted from opensolaris.org

Richard Elling

2006-Jan-16 05:48 UTC

head link

[zfs-discuss] Re: Re: Re: Would Rio be practical for making ZIL go fast?

> Ok, if there''s only a single system, then use the
> standard harddisk-backed ZIL, but if there are two
> systems connected by gigabit ethernet, each backed by
> an independent UPS system, with memory and network
> bandwidth to spare, then each system can use two
> ramdisks: one for local use, and one to NFS export to
> the other system. Each system can then make a
> dedicated ZFS pool consisting of a mirror of its
> local ramdisk and the other system''s NFS-exported
> ramdisk, and store the ZIL(s) for its other
> locally-controlled ZFS pool(s) on that dedicated
> pool. If either system fails to successfully write
> ZIL data to the NFS-mounted half of its mirror, then
> it can assume that the other system has died (or the
> network is down), and fall back to using standard
> harddisk-backed ZIL. And if either system fails to
> receive a periodic keepalive signal from the other
> system, then it can assume that the other system has
> died or the network is down, and flush its own
> NFS-exported ramdisk to its own harddisk on behalf of
> the possibly-dead system.
You are describing, in some ways, the Sun Cluster
Cluster File System (aka GFS aka pxfs).  q.v.
http://docs.sun.com/app/docs/doc/819-0421

 -- richard
This message posted from opensolaris.org

Bart Smaalders

2006-Jan-17 17:58 UTC

head link

[zfs-discuss] Re: Re: Would Rio be practical for making ZIL go fast?

Richard Elling wrote:>> Agreed 100%.  RAM volatility is a bug - not a feature!  And every
attempt
>> to fix this "bug", to date, has been a mere work-around.  All
mechanical
>> systems are subject to wear & tear - and, regardless of the quality
of the
>> implementation, are doomed to (pre-mature) mechanical failure.[1]
> 
> nit: integrated circuits are mechanical systems and prone to wear, tear,
> and mechanical failure.  Fortunately, good designs tend to last for tens
> of years, given proper design margins.
>  -- richard
There is no such thing as an electrical failure.  All failures
are mechanical.

- Bart

-- 
Bart Smaalders			Solaris Kernel Performance
barts at cyber.eng.sun.com		http://blogs.sun.com/barts

zfs discuss - Jan 2006 - Would Rio be practical for making ZIL go fast?

[zfs-discuss] Would Rio be practical for making ZIL go fast?

[zfs-discuss] Would Rio be practical for making ZIL go fast?

[zfs-discuss] Re: Would Rio be practical for making ZIL go fast?

[zfs-discuss] Re: Would Rio be practical for making ZIL go fast?

[zfs-discuss] Re: Would Rio be practical for making ZIL go fast?

[zfs-discuss] Re: Re: Would Rio be practical for making ZIL go fast?

[zfs-discuss] Re: Would Rio be practical for making ZIL go fast?

[zfs-discuss] Re: Re: Would Rio be practical for making ZIL go fast?

[zfs-discuss] Re: Re: Re: Would Rio be practical for making ZIL go fast?

[zfs-discuss] Re: Re: Would Rio be practical for making ZIL go fast?

[zfs-discuss] Re: Re: Re: Would Rio be practical for making ZIL go fast?

[zfs-discuss] Re: Re: Would Rio be practical for making ZIL go fast?