Hi All ; One of our customer is suffered from FS being corrupted after an unattanded shutdonw due to power problem. They want to switch to ZFS.>From what I read on, ZFS will most probably not be corrupted from the sameevent. But I am not sure how will Oracle be affected from a sudden power outage when placed over ZFS ? Any comments ? PS: I am aware of UPS''s and smilar technologies but customer is still asking those if ... questions ... Mertol <http://www.sun.com/> http://www.sun.com/emrkt/sigs/6g_top.gif Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems, TR Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +902123352222 Email <mailto:Ayca.Yalcin at Sun.COM> mertol.ozyoney at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080623/862ef677/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 1257 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080623/862ef677/attachment.gif>
>From my usage, the first question you should ask your customer is how muchof a performance hit they can spare when switching to ZFS for Oracle. I''ve done lots of tweaking (following threads I''ve read on the mailing list), but I still can''t seem to get enough performance out of any databases on ZFS. I''ve tried using zvols, cooked files on top of ZFS filesystems, everything, but either raw disk devices via the old style DiskSuite tools or cooked files on top of the same are far more performant than anything on ZFS. Your mileage may vary, but so far, that''s where I stand. As for the corrupted filesystem, ZFS is much better, but there are still no guarantees that your filesystem won''t be corrupted during a hard shutdown. The CoW and checksumming gives you a much lower incidence of corruption, but the customer still needs to be made aware that things like battery backed controllers, managed UPS, redundant power supplies, and the like are the first thing they need to put into place - not the last. On Mon, Jun 23, 2008 at 11:56 AM, Mertol Ozyoney <Mertol.Ozyoney at sun.com> wrote:> Hi All ; > > > > One of our customer is suffered from FS being corrupted after an unattanded > shutdonw due to power problem. > > They want to switch to ZFS. > > > > From what I read on, ZFS will most probably not be corrupted from the same > event. But I am not sure how will Oracle be affected from a sudden power > outage when placed over ZFS ? > > > > Any comments ? > > > > PS: I am aware of UPS''s and smilar technologies but customer is still > asking those if ... questions ... > > > > Mertol > > > > > > > > [image: http://www.sun.com/emrkt/sigs/6g_top.gif] <http://www.sun.com/> > > *Mertol Ozyoney * > Storage Practice - Sales Manager > > *Sun Microsystems, TR* > Istanbul TR > Phone +902123352200 > Mobile +905339310752 > Fax +902123352222 > Email mertol.ozyoney at Sun.COM <Ayca.Yalcin at Sun.COM> > > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >-- chris -at- microcozm -dot- net === Si Hoc Legere Scis Nimium Eruditionis Habes -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080623/4345ab13/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1257 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080623/4345ab13/attachment.gif>
Mertol Ozyoney wrote:> > Hi All ; > > One of our customer is suffered from FS being corrupted after an > unattanded shutdonw due to power problem. > > They want to switch to ZFS. > > From what I read on, ZFS will most probably not be corrupted from the > same event. But I am not sure how will Oracle be affected from a > sudden power outage when placed over ZFS ? > > Any comments ? >Most databases have the ability to recover from unscheduled interruptions without causing corruption. ZFS works in the same way -- you will recover to a stable point in time. In-flight transactions will not be completed, as expected. Upon restart, ZFS recovery will happen first, followed by the database recovery.> PS: I am aware of UPS?s and smilar technologies but customer is still > asking those if ... questions ... >UPS''s fail, too. When we design highly available services, we will expect that unscheduled interruptions will occur -- that is the only way to design effective solutions. -- richard
>>>>> "mo" == Mertol Ozyoney <Mertol.Ozyoney at Sun.COM> writes:mo> One of our customer is suffered from FS being corrupted after mo> an unattanded shutdonw due to power problem. mo> They want to switch to ZFS. mo> From what I read on, ZFS will most probably not be corrupted mo> from the same event. It''s not supposed to happen with UFS, either. nor XFS, JFS, ext3, reiserfs, FFS+softdep, plain FFS, mac-HFS+journal. All filesystems in popular use for many years except maybe NTFS are supposed to obey fsync and survive kernel crashes and unplanned power outage that happens after fsync returns, without losing any data written before fsync was called. The fact that they don''t in practice is a warning that ZFS might not, either, no matter what it promises in theory. I think many cheap PeeCee RAID setups without batteries suffer from ``the RAID5 write hole'''' which takes away all the guarantees of no-power-fail-corruption that the filesystems made, and these broken no-battery setups seem to be really popular. If one used ZFS on top of such a no-battery RAID instead of switching it to JBOD mode, ZFS would be vulnerable, too. One interesting part of ZFS''s ``in theory'''' pitch is that, if you use redundancy with ZFS, the checksums may somewhat address this problem described below: http://linuxmafia.com/faq/Filesystems/reiserfs.html -----8<----- You see, when you yank the power cord out of the wall, not all parts of the computer stop functioning at the same time. As the voltage starts dropping on the +5 and +12 volt rails, certain parts of the system may last longer than other parts. For example, the DMA controller, hard drive controller, and hard drive unit may continue functioning for several hundred of milliseconds, long after the DIMMs, which are very voltage sensitive, have gone crazy, and are returning total random garbage. If this happens while the filesystem is writing critical sections of the filesystem metadata, well, you get to visit the fun Web pages at http://You.Lose.Hard/ . I was actually told about this by an XFS engineer, who discovered this about the hardware. Their solution was to add a power-fail interrupt and bigger capacitors in the power supplies in SGI hardware; and, in Irix, when the power-fail interrupt triggers, the first thing the OS does is to run around frantically aborting I/O transfers to the disk. Unfortunately, PC-class hardware doesn''t have power-fail interrupts. Remember, PC-class hardware is cr*p. -----8<----- I would suspect a ZFS mirror might have a better shot of coming through that type of crazy power failure, but I don''t know how anything can be robust to a mysterious force that scribbles randomly all over the disk. On the downside there are some things I thought I understood about SVM''s ideas of quorum that I do not yet understand in the ZFS world. also...FTR I use his ext3 rather than XFS myself, but I''m a little skeptical of Ted Ts''o ranting above because he is defending a shortcut he took writing his own filesystem. And I''m not sure the cord-pulling problem he describes is really universal, and is really a reason for XFS-users losing data that ext3-users don''t---it sounds like it could be a specific-quirk type problem, a blip in history just like ``the 5-volt rail'''' he talks about (+5V? what did they used to run on 5 volts, a disk motor or a battery charger or something?). The SGI engineers had the problem on their specific hardware, and solved it, but it may or may not exist on present machines. Maybe current hardware has other equally weird problems when one pulls the power cord. -- READ CAREFULLY. By reading this fortune, you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies ("BOGUS AGREEMENTS") that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080623/f3fe406e/attachment.bin>
On Jun 23, 2008, at 11:36 AM, Miles Nordin wrote:> unplanned power outage that > happens after fsync returnsAye, but isn''t that the real rub ... when the power fails after the write but *before* the fsync has occurred... -- Keith H. Bierman khbkhb at gmail.com | AIM kbiermank 5430 Nassau Circle East | Cherry Hills Village, CO 80113 | 303-997-2749 <speaking for myself*> Copyright 2008
Miles Nordin wrote:>>>>>> "mo" == Mertol Ozyoney <Mertol.Ozyoney at Sun.COM> writes: >>>>>> > > mo> One of our customer is suffered from FS being corrupted after > mo> an unattanded shutdonw due to power problem. > > mo> They want to switch to ZFS. > > mo> From what I read on, ZFS will most probably not be corrupted > mo> from the same event. > > It''s not supposed to happen with UFS, either. nor XFS, JFS, ext3, > reiserfs, FFS+softdep, plain FFS, mac-HFS+journal. All filesystems in > popular use for many years except maybe NTFS are supposed to obey > fsync and survive kernel crashes and unplanned power outage that > happens after fsync returns, without losing any data written before > fsync was called. The fact that they don''t in practice is a warning > that ZFS might not, either, no matter what it promises in theory. >There is a more common failure mode at work here. Most low-cost disks have their volatile write cache enabled. UFS knows nothing of such caches and believes the disk has committed data when it acks. In other words, even with O_DSYNC and friends doing the "right thing" in the OS, the disk lies about the persistence of the data. ZFS knows disks lie, so it sends sync commands when necessary to help ensure that the data is flushed to persistent storage. But even if it is not flushed, the ZFS on-disk format is such that you can recover to a point in time where the file system is consistent. This is not the case for UFS which was designed to trust the hardware. -- richard
>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes: >>>>> "kb" == Keith Bierman <khbkhb at gmail.com> writes:re> the disk lies about the persistence of the data. ZFS knows re> disks lie, so it sends sync commands when necessary (1) i don''t think ``lie'''' is a correct characerization given that the sync commands exist, but point taken about the other area of risk. I suspect there may be similar problems in ZFS''s write path when one is using iSCSI targets. Or it''s just common for iSCSI target implementations to suck (lie). or maybe it''s something else I''m seeing. (2) i thought the recommendation that one give ZFS whole disks and let it put EFI labels on them came from the Solaris behavior that, only in a whole-disk-for-zfs configuration, will the Solaris drivers refrain from explicitly disabling the write cache in these inexpensive disks. so the cache shouldn''t be a problem for UFS, but it might be for non-Solaris operating systems (even for ZFS on platforms where ZFS is ported but the SYNCHRONIZE CACHE commands don''t make it through some mid-layer or CAM or driver). kb> Aye, but isn''t that the real rub ... when the power fails kb> after the write but *before* the fsync has occurred... no, there is no rub here, I was only speaking precisely. A proper DBMS (anything except MySQL) is also designed to understand that power failures happen. It does its writes in a deliberate order such that it won''t return success to the application calling it until it gets the return from fsync(), and also so that the system will never recover such that a transaction is half-completed. re> the ZFS on-disk format is such that you can recover to a point re> in time where the file system is consistent. do you mean taht, ``after a power outage ZFS will always recover the filesystem to a state that it passed through in the moments leading up to the outage,'''' while UFS, which logs only metadata, typically recovers to some state the filesystem never passed through---but it never loses fsync()ed data nor data that wasn''t written ``recently'''' before the crash? For casual filesystem use, or for applications that weren''t designed with cord-pulling in mind, ZFS''s guarantee is larger and more comforting. But for databases, I don''t think the distinction matters because they call fsync() at deliberate moments and do their own copy-on-write logging above the filesystem, so they provide the same consistency guarantees whether operating on UFS or ZFS. It would be fine to feed a database the type of hacked non-CoW zvol that''s used for swap, if fsync could be made to work there, which maybe it can''t. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080623/ca3431c5/attachment.bin>
Miles Nordin wrote:>>>>>> "re" == Richard Elling <Richard.Elling at Sun.COM> writes: >>>>>> "kb" == Keith Bierman <khbkhb at gmail.com> writes: >>>>>> > > re> the disk lies about the persistence of the data. ZFS knows > re> disks lie, so it sends sync commands when necessary > > (1) i don''t think ``lie'''' is a correct characerization given that the > sync commands exist, but point taken about the other area of risk. >IMNSHO, they lie. Some disks do not disable volatile write caches, even when you ask them. I''ve got a scar... right there below the ORA-27062 and next to the FC-disk firmware scars... I think Torrey''s is on his backside... :-)> I suspect there may be similar problems in ZFS''s write path when > one is using iSCSI targets. Or it''s just common for iSCSI target > implementations to suck (lie). or maybe it''s something else I''m > seeing. >I hope they aren''t making assumptions about volatility...> (2) i thought the recommendation that one give ZFS whole disks and let > it put EFI labels on them came from the Solaris behavior that, > only in a whole-disk-for-zfs configuration, will the Solaris > drivers refrain from explicitly disabling the write cache in these > inexpensive disks. so the cache shouldn''t be a problem for UFS, > but it might be for non-Solaris operating systems (even for ZFS on > platforms where ZFS is ported but the SYNCHRONIZE CACHE commands > don''t make it through some mid-layer or CAM or driver). >Close. By default, Solaris will try to disable the write cache, ostensibly to protect UFS. But if the whole disk is in use by ZFS, then it will enable the write cache and ZFS uses the synchronize cache commands, as appropriate. Solaris is a bit conservative here, maybe too conservative. In some cases you can override it with format -e.> kb> Aye, but isn''t that the real rub ... when the power fails > kb> after the write but *before* the fsync has occurred... > > no, there is no rub here, I was only speaking precisely. A proper > DBMS (anything except MySQL) is also designed to understand that power > failures happen. It does its writes in a deliberate order such that > it won''t return success to the application calling it until it gets > the return from fsync(), and also so that the system will never > recover such that a transaction is half-completed. >ZFS has similar protections. The most interesting is that since it is COW, the metadata is (almost) never overwritten. The almost applies to the uberblocks which use a circular queue.> re> the ZFS on-disk format is such that you can recover to a point > re> in time where the file system is consistent. > > do you mean taht, ``after a power outage ZFS will always recover the > filesystem to a state that it passed through in the moments leading up > to the outage,'''' while UFS, which logs only metadata, typically > recovers to some state the filesystem never passed through---but it > never loses fsync()ed data nor data that wasn''t written ``recently'''' > before the crash? >The system can lose fsync()ed data if UFS thinks it wrote to persistent storage, but was actually writing to volatile storage. This may be less common, though. I think the more common symptom is a need to fsck to rebuild the metadata.> For casual filesystem use, or for applications that weren''t designed > with cord-pulling in mind, ZFS''s guarantee is larger and more > comforting. But for databases, I don''t think the distinction matters > because they call fsync() at deliberate moments and do their own > copy-on-write logging above the filesystem, so they provide the same > consistency guarantees whether operating on UFS or ZFS. It would be > fine to feed a database the type of hacked non-CoW zvol that''s used > for swap, if fsync could be made to work there, which maybe it can''t. >Hacked non-COW zvol? Since COW occurs at the DMU layer, below ZPL or ZVol, I don''t see how to bypass it. AFAIK, the trick to using ZVols for swap was to just fix some bugs in ZFS and rewrite the pertinent parts of the installer(s). The subject of a non-COW volume does come up periodically. I refer to these as "raw devices" :-) Since many of the features of ZFS depend on COW, if you get rid of COW then you get rid of the features, and you might as well use raw devices, no? -- richard
Richard Elling wrote:> Hacked non-COW zvol? Since COW occurs at the DMU layer, > below ZPL or ZVol, I don''t see how to bypass it. AFAIK, > the trick to using ZVols for swap was to just fix some bugs > in ZFS and rewrite the pertinent parts of the installer(s).Swap just uses a normal ZVOL, which is good because that means I can swap on an encrypted ZVOL. Dump however doesn''t it preallocates space and hands of the writes to the ldi_dump routines.> The subject of a non-COW volume does come up periodically. > I refer to these as "raw devices" :-) Since many of the > features of ZFS depend on COW, if you get rid of COW > then you get rid of the features, and you might as well use > raw devices, no?The currently running ARC case for controlling per dataset wither data/metadata is cached (ARC and L2ARC) will hopefully resolve some of the issues where this comes up (because not all the issues are actually about COW some are about caching when a DB (or similar) is already doing its own caching). -- Darren J Moffat