thr3ads.net - zfs discuss - [zfs-discuss] Oracle and ZFS [Jun 2008]

If this information is useful, please help other people find it:
Share via:

Mertol Ozyoney

2008-Jun-23 15:56 UTC

[zfs-discuss] Oracle and ZFS

Hi All ;

 

One of our customer is suffered from FS being corrupted after an unattanded
shutdonw due to power problem. 

They want to switch to ZFS. 

 
>From what I read on, ZFS will most probably not be corrupted from the sameevent. But I am not sure how will Oracle be affected from a sudden power
outage when placed over ZFS ? 

 

Any comments ?

 

PS: I am aware of UPS''s and smilar technologies but customer is still
asking
those if ... questions ...

 

Mertol 

 

 

 


 <http://www.sun.com/> http://www.sun.com/emrkt/sigs/6g_top.gif

Mertol Ozyoney 
Storage Practice - Sales Manager

Sun Microsystems, TR
Istanbul TR
Phone +902123352200
Mobile +905339310752
Fax +902123352222
Email  <mailto:Ayca.Yalcin at Sun.COM> mertol.ozyoney at Sun.COM

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080623/862ef677/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080623/862ef677/attachment.gif>

Chris Cosby

2008-Jun-23 16:17 UTC

head link

[zfs-discuss] Oracle and ZFS

>From my usage, the first question you should ask your customer is how muchof a performance hit they can spare when switching to ZFS for Oracle.
I''ve
done lots of tweaking (following threads I''ve read on the mailing
list), but
I still can''t seem to get enough performance out of any databases on
ZFS.
I''ve tried using zvols, cooked files on top of ZFS filesystems,
everything,
but either raw disk devices via the old style DiskSuite tools or cooked
files on top of the same are far more performant than anything on ZFS. Your
mileage may vary, but so far, that''s where I stand.

As for the corrupted filesystem, ZFS is much better, but there are still no
guarantees that your filesystem won''t be corrupted during a hard
shutdown.
The CoW and checksumming gives you a much lower incidence of corruption, but
the customer still needs to be made aware that things like battery backed
controllers, managed UPS, redundant power supplies, and the like are the
first thing they need to put into place - not the last.

On Mon, Jun 23, 2008 at 11:56 AM, Mertol Ozyoney <Mertol.Ozyoney at
sun.com>
wrote:
>  Hi All ;
>
>
>
> One of our customer is suffered from FS being corrupted after an unattanded
> shutdonw due to power problem.
>
> They want to switch to ZFS.
>
>
>
> From what I read on, ZFS will most probably not be corrupted from the same
> event. But I am not sure how will Oracle be affected from a sudden power
> outage when placed over ZFS ?
>
>
>
> Any comments ?
>
>
>
> PS: I am aware of UPS''s and smilar technologies but customer is
still
> asking those if ... questions ...
>
>
>
> Mertol
>
>
>
>
>
>
>
> [image: http://www.sun.com/emrkt/sigs/6g_top.gif]
<http://www.sun.com/>
>
> *Mertol Ozyoney *
> Storage Practice - Sales Manager
>
> *Sun Microsystems, TR*
> Istanbul TR
> Phone +902123352200
> Mobile +905339310752
> Fax +902123352222
> Email mertol.ozyoney at Sun.COM <Ayca.Yalcin at Sun.COM>
>
>
>
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>

-- 
chris -at- microcozm -dot- net
=== Si Hoc Legere Scis Nimium Eruditionis Habes
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080623/4345ab13/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1257 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080623/4345ab13/attachment.gif>

Richard Elling

2008-Jun-23 17:14 UTC

head link

[zfs-discuss] Oracle and ZFS

Mertol Ozyoney wrote:>
> Hi All ;
>
> One of our customer is suffered from FS being corrupted after an 
> unattanded shutdonw due to power problem.
>
> They want to switch to ZFS.
>
> From what I read on, ZFS will most probably not be corrupted from the 
> same event. But I am not sure how will Oracle be affected from a 
> sudden power outage when placed over ZFS ?
>
> Any comments ?
>
Most databases have the ability to recover from unscheduled interruptions
without causing corruption. ZFS works in the same way -- you will recover
to a stable point in time. In-flight transactions will not be completed, as
expected. Upon restart, ZFS recovery will happen first, followed by the
database recovery.
> PS: I am aware of UPS?s and smilar technologies but customer is still 
> asking those if ... questions ...
>

UPS''s fail, too. When we design highly available services, we will
expect that unscheduled interruptions will occur -- that is the only way
to design effective solutions.
-- richard

Miles Nordin

2008-Jun-23 17:36 UTC

head link

[zfs-discuss] Oracle and ZFS

>>>>> "mo" == Mertol Ozyoney <Mertol.Ozyoney at
Sun.COM> writes:
mo> One of our customer is suffered from FS being corrupted after
mo> an unattanded shutdonw due to power problem.

mo> They want to switch to ZFS.

mo> From what I read on, ZFS will most probably not be corrupted
mo> from the same event.

It''s not supposed to happen with UFS, either. nor XFS, JFS, ext3,
reiserfs, FFS+softdep, plain FFS, mac-HFS+journal. All filesystems in
popular use for many years except maybe NTFS are supposed to obey
fsync and survive kernel crashes and unplanned power outage that
happens after fsync returns, without losing any data written before
fsync was called. The fact that they don''t in practice is a warning
that ZFS might not, either, no matter what it promises in theory.

I think many cheap PeeCee RAID setups without batteries suffer from
``the RAID5 write hole'''' which takes away all the guarantees
of
no-power-fail-corruption that the filesystems made, and these broken
no-battery setups seem to be really popular. If one used ZFS on top
of such a no-battery RAID instead of switching it to JBOD mode, ZFS
would be vulnerable, too.

One interesting part of ZFS''s ``in theory'''' pitch is
that, if you use
redundancy with ZFS, the checksums may somewhat address this problem
described below:

http://linuxmafia.com/faq/Filesystems/reiserfs.html

-----8<-----
You see, when you yank the power cord out of the wall, not all parts
of the computer stop functioning at the same time. As the voltage
starts dropping on the +5 and +12 volt rails, certain parts of the
system may last longer than other parts. For example, the DMA
controller, hard drive controller, and hard drive unit may continue
functioning for several hundred of milliseconds, long after the DIMMs,
which are very voltage sensitive, have gone crazy, and are returning
total random garbage. If this happens while the filesystem is writing
critical sections of the filesystem metadata, well, you get to visit
the fun Web pages at http://You.Lose.Hard/ .

I was actually told about this by an XFS engineer, who discovered this
about the hardware. Their solution was to add a power-fail interrupt
and bigger capacitors in the power supplies in SGI hardware; and, in
Irix, when the power-fail interrupt triggers, the first thing the OS
does is to run around frantically aborting I/O transfers to the
disk. Unfortunately, PC-class hardware doesn''t have power-fail
interrupts. Remember, PC-class hardware is cr*p.
-----8<-----

I would suspect a ZFS mirror might have a better shot of coming
through that type of crazy power failure, but I don''t know how
anything can be robust to a mysterious force that scribbles randomly
all over the disk.

On the downside there are some things I thought I understood about
SVM''s ideas of quorum that I do not yet understand in the ZFS world.

also...FTR I use his ext3 rather than XFS myself, but I''m a little
skeptical of Ted Ts''o ranting above because he is defending a shortcut
he took writing his own filesystem.

And I''m not sure the cord-pulling problem he describes is really
universal, and is really a reason for XFS-users losing data that
ext3-users don''t---it sounds like it could be a specific-quirk type
problem, a blip in history just like ``the 5-volt rail'''' he
talks
about (+5V? what did they used to run on 5 volts, a disk motor or a
battery charger or something?). The SGI engineers had the problem on
their specific hardware, and solved it, but it may or may not exist on
present machines. Maybe current hardware has other equally weird
problems when one pulls the power cord.

--
READ CAREFULLY. By reading this fortune, you agree, on behalf of your employer,
to release me from all obligations and waivers arising from any and all
NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap,
browsewrap, confidentiality, non-disclosure, non-compete and acceptable use
policies ("BOGUS AGREEMENTS") that I have entered into with your
employer, its
partners, licensors, agents and assigns, in perpetuity, without prejudice to my
ongoing rights and privileges. You further represent that you have the
authority to release me from any BOGUS AGREEMENTS on behalf of your employer.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080623/f3fe406e/attachment.bin>

Keith Bierman

2008-Jun-23 17:45 UTC

head link

[zfs-discuss] Oracle and ZFS

On Jun 23, 2008, at 11:36 AM, Miles Nordin wrote:
> unplanned power outage that
> happens after fsync returns
Aye, but isn''t that the real rub ... when the power fails after the  
write but *before* the fsync has occurred...


-- 
Keith H. Bierman   khbkhb at gmail.com      | AIM kbiermank
5430 Nassau Circle East                  |
Cherry Hills Village, CO 80113           | 303-997-2749
<speaking for myself*> Copyright 2008

Richard Elling

2008-Jun-23 17:54 UTC

head link

[zfs-discuss] Oracle and ZFS

Miles Nordin wrote:>>>>>> "mo" == Mertol Ozyoney <Mertol.Ozyoney at
Sun.COM> writes:
>>>>>>             
>
>     mo> One of our customer is suffered from FS being corrupted after
>     mo> an unattanded shutdonw due to power problem.
>
>     mo> They want to switch to ZFS.
>
>     mo> From what I read on, ZFS will most probably not be corrupted
>     mo> from the same event.
>
> It''s not supposed to happen with UFS, either.  nor XFS, JFS, ext3,
> reiserfs, FFS+softdep, plain FFS, mac-HFS+journal.  All filesystems in
> popular use for many years except maybe NTFS are supposed to obey
> fsync and survive kernel crashes and unplanned power outage that
> happens after fsync returns, without losing any data written before
> fsync was called.  The fact that they don''t in practice is a
warning
> that ZFS might not, either, no matter what it promises in theory.
>   
There is a more common failure mode at work here.  Most low-cost
disks have their volatile write cache enabled.  UFS knows nothing of
such caches and believes the disk has committed data when it acks.
In other words, even with O_DSYNC and friends doing the "right
thing" in the OS, the disk lies about the persistence of the data.  ZFS
knows disks lie, so it sends sync commands when necessary to help
ensure that the data is flushed to persistent storage.  But even if it is
not flushed, the ZFS on-disk format is such that you can recover to
a point in time where the file system is consistent. This is not the
case for UFS which was designed to trust the hardware.
 -- richard

Miles Nordin

2008-Jun-23 21:59 UTC

head link

[zfs-discuss] Oracle and ZFS

>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
>>>>> "kb" == Keith Bierman <khbkhb at gmail.com>
writes:
re> the disk lies about the persistence of the data. ZFS knows
re> disks lie, so it sends sync commands when necessary

(1) i don''t think ``lie'''' is a correct
characerization given that the
sync commands exist, but point taken about the other area of risk.

I suspect there may be similar problems in ZFS''s write path when
one is using iSCSI targets. Or it''s just common for iSCSI target
implementations to suck (lie). or maybe it''s something else
I''m
seeing.

(2) i thought the recommendation that one give ZFS whole disks and let
it put EFI labels on them came from the Solaris behavior that,
only in a whole-disk-for-zfs configuration, will the Solaris
drivers refrain from explicitly disabling the write cache in these
inexpensive disks. so the cache shouldn''t be a problem for UFS,
but it might be for non-Solaris operating systems (even for ZFS on
platforms where ZFS is ported but the SYNCHRONIZE CACHE commands
don''t make it through some mid-layer or CAM or driver).

kb> Aye, but isn''t that the real rub ... when the power fails
kb> after the write but *before* the fsync has occurred...

no, there is no rub here, I was only speaking precisely. A proper
DBMS (anything except MySQL) is also designed to understand that power
failures happen. It does its writes in a deliberate order such that
it won''t return success to the application calling it until it gets
the return from fsync(), and also so that the system will never
recover such that a transaction is half-completed.

re> the ZFS on-disk format is such that you can recover to a point
re> in time where the file system is consistent.

do you mean taht, ``after a power outage ZFS will always recover the
filesystem to a state that it passed through in the moments leading up
to the outage,'''' while UFS, which logs only metadata,
typically
recovers to some state the filesystem never passed through---but it
never loses fsync()ed data nor data that wasn''t written
``recently''''
before the crash?

For casual filesystem use, or for applications that weren''t designed
with cord-pulling in mind, ZFS''s guarantee is larger and more
comforting. But for databases, I don''t think the distinction matters
because they call fsync() at deliberate moments and do their own
copy-on-write logging above the filesystem, so they provide the same
consistency guarantees whether operating on UFS or ZFS. It would be
fine to feed a database the type of hacked non-CoW zvol that''s used
for swap, if fsync could be made to work there, which maybe it can''t.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080623/ca3431c5/attachment.bin>

Richard Elling

2008-Jun-24 01:30 UTC

head link

[zfs-discuss] Oracle and ZFS

Miles Nordin wrote:>>>>>> "re" == Richard Elling <Richard.Elling at
Sun.COM> writes:
>>>>>> "kb" == Keith Bierman <khbkhb at
gmail.com> writes:
>>>>>>             
>
>     re> the disk lies about the persistence of the data.  ZFS knows
>     re> disks lie, so it sends sync commands when necessary
>
> (1) i don''t think ``lie'''' is a correct
characerization given that the
>     sync commands exist, but point taken about the other area of risk.
>   
IMNSHO, they lie. Some disks do not disable volatile write
caches, even when you ask them.  I''ve got a scar... right there
below the ORA-27062 and next to the FC-disk firmware scars...
I think Torrey''s is on his backside... :-)
>     I suspect there may be similar problems in ZFS''s write path
when
>     one is using iSCSI targets.  Or it''s just common for iSCSI
target
>     implementations to suck (lie).  or maybe it''s something else
I''m
>     seeing.
>   
I hope they aren''t making assumptions about volatility...
> (2) i thought the recommendation that one give ZFS whole disks and let
>     it put EFI labels on them came from the Solaris behavior that,
>     only in a whole-disk-for-zfs configuration, will the Solaris
>     drivers refrain from explicitly disabling the write cache in these
>     inexpensive disks.  so the cache shouldn''t be a problem for
UFS,
>     but it might be for non-Solaris operating systems (even for ZFS on
>     platforms where ZFS is ported but the SYNCHRONIZE CACHE commands
>     don''t make it through some mid-layer or CAM or driver).
>   
Close.  By default, Solaris will try to disable the write cache,
ostensibly to protect UFS.  But if the whole disk is in use by
ZFS, then it will enable the write cache and ZFS uses the
synchronize cache commands, as appropriate.  Solaris is a
bit conservative here, maybe too conservative.  In some
cases you can override it with format -e.
>     kb> Aye, but isn''t that the real rub ... when the power
fails
>     kb> after the write but *before* the fsync has occurred...
>
> no, there is no rub here, I was only speaking precisely.  A proper
> DBMS (anything except MySQL) is also designed to understand that power
> failures happen.  It does its writes in a deliberate order such that
> it won''t return success to the application calling it until it
gets
> the return from fsync(), and also so that the system will never
> recover such that a transaction is half-completed.
>   
ZFS has similar protections. The most interesting is that since it is
COW, the metadata is (almost) never overwritten.  The almost
applies to the uberblocks which use a circular queue.
>     re> the ZFS on-disk format is such that you can recover to a point
>     re> in time where the file system is consistent.
>
> do you mean taht, ``after a power outage ZFS will always recover the
> filesystem to a state that it passed through in the moments leading up
> to the outage,'''' while UFS, which logs only metadata,
typically
> recovers to some state the filesystem never passed through---but it
> never loses fsync()ed data nor data that wasn''t written
``recently''''
> before the crash?
>   
The system can lose fsync()ed data if UFS thinks it wrote
to persistent storage, but was actually writing to volatile
storage.  This may be less common, though.  I think the
more common symptom is a need to fsck to rebuild the
metadata.
> For casual filesystem use, or for applications that weren''t
designed
> with cord-pulling in mind, ZFS''s guarantee is larger and more
> comforting.  But for databases, I don''t think the distinction
matters
> because they call fsync() at deliberate moments and do their own
> copy-on-write logging above the filesystem, so they provide the same
> consistency guarantees whether operating on UFS or ZFS.  It would be
> fine to feed a database the type of hacked non-CoW zvol that''s
used
> for swap, if fsync could be made to work there, which maybe it
can''t.
>   
Hacked non-COW zvol?  Since COW occurs at the DMU layer,
below ZPL or ZVol, I don''t see how to bypass it.  AFAIK,
the trick to using ZVols for swap was to just fix some bugs
in ZFS and rewrite the pertinent parts of the installer(s).

The subject of a non-COW volume does come up periodically.
I refer to these as "raw devices" :-)  Since many of the
features of ZFS depend on COW, if you get rid of COW
then you get rid of the features, and you might as well use
raw devices, no?
 -- richard

Darren J Moffat

2008-Jun-24 09:25 UTC

head link

[zfs-discuss] Oracle and ZFS

Richard Elling wrote:> Hacked non-COW zvol?  Since COW occurs at the DMU layer,
> below ZPL or ZVol, I don''t see how to bypass it.  AFAIK,
> the trick to using ZVols for swap was to just fix some bugs
> in ZFS and rewrite the pertinent parts of the installer(s).

Swap just uses a normal ZVOL, which is good because that means I can 
swap on an encrypted ZVOL.

Dump however doesn''t it preallocates space and hands of the writes to 
the ldi_dump routines.
> The subject of a non-COW volume does come up periodically.
> I refer to these as "raw devices" :-)  Since many of the
> features of ZFS depend on COW, if you get rid of COW
> then you get rid of the features, and you might as well use
> raw devices, no?
The currently running ARC case for controlling per dataset wither 
data/metadata is cached (ARC and L2ARC) will hopefully resolve some of 
the issues where this comes up (because not all the issues are actually 
about COW some are about caching when a DB (or similar) is already doing 
its own caching).

-- 
Darren J Moffat

Toby Thain

2008-Jun-24 23:13 UTC

head link

[zfs-discuss] Oracle and ZFS

On 23-Jun-08, at 6:59 PM, Miles Nordin wrote:
>>>>>> ... A proper
> DBMS (anything except MySQL)
Perhaps you mean MyISAM. MySQL''s InnoDB engine offers ACID.

--Toby

Apparently Analagous Threads

Search for more possibly parallel threads

zfs discuss - Jun 2008 - Oracle and ZFS

[zfs-discuss] Oracle and ZFS

[zfs-discuss] Oracle and ZFS

[zfs-discuss] Oracle and ZFS

[zfs-discuss] Oracle and ZFS

[zfs-discuss] Oracle and ZFS

[zfs-discuss] Oracle and ZFS

[zfs-discuss] Oracle and ZFS

[zfs-discuss] Oracle and ZFS

[zfs-discuss] Oracle and ZFS

[zfs-discuss] Oracle and ZFS

Apparently Analagous Threads