Hi all, I''m trying to evaluate what are the risks of running NFS share of zfs dataset with sync=disabled property. The clients are vmware hosts in our environment and server is SunFire X4540 "Thor" system. Though general recommendation tells not to do this, but after testing performance with default setting and sync=disabled - it''s night and day, so it''s really tempting to do sync=disabled ! Thanks for any suggestion. Best regards,
On Nov 8, 2011, at 6:38 AM, Evaldas Auryla wrote:> Hi all, > > I''m trying to evaluate what are the risks of running NFS share of zfs dataset with sync=disabled property. The clients are vmware hosts in our environment and server is SunFire X4540 "Thor" system. Though general recommendation tells not to do this, but after testing performance with default setting and sync=disabled - it''s night and day, so it''s really tempting to do sync=disabled ! Thanks for any suggestion.The risks are, any changes your software clients expect to be written to disk -- after having gotten a confirmation that they did get written -- might not actually be written if the server crashes or loses power for some reason. You should consider a high performance low-latency SSD (doesn''t have to be very big) as an SLOG? it will do a lot for your performance without having to give up the commit guarantees that you lose with sync=disabled. Of course, if the data isn''t precious to you, then running with sync=disabled is probably ok. But if you love your data, don''t do it. - Garrett
On Tue, November 8, 2011 09:38, Evaldas Auryla wrote:> I''m trying to evaluate what are the risks of running NFS share of zfs > dataset with sync=disabled property. The clients are vmware hosts in our > environment and server is SunFire X4540 "Thor" system. Though general > recommendation tells not to do this, but after testing performance with > default setting and sync=disabled - it''s night and day, so it''s really > tempting to do sync=disabled ! Thanks for any suggestion.You may want to examine getting some good SSDs and attaching them as (mirrored?) "slog" devices instead: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Separate_Log_Devices You probably want to zpool version of 22 or better to do this, as from that point onward it becomes possible to remove the "slog" device/s if desired. Previous to v22 once you add them you''re stuck with them. Some interesting benchmarks on offloading the ZIL can be found at: https://blogs.oracle.com/brendan/entry/slog_screenshots Your SSD/s don''t have to be that large either: by default the ZIL can be at most 50% of RAM, so if your server has (say) 48 GB of RAM, then the an SSD larger than 24 GB would really be a bit of a waste (though you can use the ''extra'' space as L2ARC perhaps). Given that, it''s probably better value to get a faster SLC SSD that''s smaller, rather than a ''cheaper'' MLC that''s larger. Past discussions on zfs-discuss have favourably mentioned devices based on the SandForce SF-1500 and SF-2500/2600 chipsets (they often come with supercaps and such). Intel''s 311 could be another option.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Evaldas Auryla > > I''m trying to evaluate what are the risks of running NFS share of zfs > dataset with sync=disabled property. The clients are vmware hosts in our > environment and server is SunFire X4540 "Thor" system. Though general > recommendation tells not to do this, but after testing performance with > default setting and sync=disabled - it''s night and day, so it''s really > tempting to do sync=disabled ! Thanks for any suggestion.I know a lot of people will say "don''t do it," but that''s only partial truth. The real truth is: At all times, if there''s a server crash, ZFS will come back along at next boot or mount, and the filesystem will be in a consistent state, that was indeed a valid state which the filesystem actually passed through at some moment in time. So as long as all the applications you''re running can accept the possibility of "going back in time" as much as 30 sec, following an ungraceful ZFS crash, then it''s safe to disable ZIL (set sync=disabled). In your case, you have vm''s inside the ZFS filesystem. In the event ZFS crashes ungracefully, you don''t want the VM disks to "go back in time" while the VM''s themselves are unaware anything like that happened. If you run with sync=disabled, you want to ensure your ZFS / NFS server doesn''t come back up automatically. If ZFS crashes, you want to force the guest VM''s to crash. Force power down the VM''s, then bring up NFS, remount NFS, and reboot the guest VM''s. All the guest VM''s will have gone back in time, by as much as 30 sec. This is generally acceptable for things like web servers and file servers and windows VMs in a virtualized desktop environment etc. It''s also acceptable for things running databases, as long as all the DB clients can go back in time (reboot them whatever). It is NOT acceptable if you''re processing credit card transactions, or if you''re running a mailserver and you''re unwilling to silently drop any messages, or ... stuff like that. Long story short, if you''re willing to allow your server and all of the dependent clients to go back in time as much as 30 seconds, and you''re willing/able to reboot everything that depends on it, then you can accept the sync=disabled That''s a lot of thinking. And a lot of faith or uncertainty. And in your case, it''s kind of inconvenient. Needing to manually start your NFS share every time you reboot your ZFS server. The safer/easier thing to do is add dedicated log devices to the server instead. It''s not as fast as running with ZIL disabled, but it''s much faster than running without a dedicated log. When choosing a log device, focus on FAST. You really don''t care about size. Even 4G is usually all you need.
On 11/ 9/11 01:42 AM, Edward Ned Harvey wrote:> I know a lot of people will say "don''t do it," but that''s only partial > truth. The real truth is: > > At all times, if there''s a server crash, ZFS will come back along at next > boot or mount, and the filesystem will be in a consistent state, that was > indeed a valid state which the filesystem actually passed through at some > moment in time. So as long as all the applications you''re running can > accept the possibility of "going back in time" as much as 30 sec, > following > an ungraceful ZFS crash, then it''s safe to disable ZIL (set > sync=disabled).Ok, so the risk is about ZFS server unexpexted reboot (crash, power, hardware problem..).> Long story short, if you''re willing to allow your server and all of the > dependent clients to go back in time as much as 30 seconds, and you''re > willing/able to reboot everything that depends on it, then you can accept > the sync=disabled > > That''s a lot of thinking. And a lot of faith or uncertainty. And in your > case, it''s kind of inconvenient. Needing to manually start your NFS share > every time you reboot your ZFS server.Let''s say assuming ZFS is stable (we never had problems on any opensolaris/openindiana systems, except ones with CIFS services enabled..), the server is running in well power-protected rack and on Sun''s legendary reliable hardware (e.g. X4540), doing "sync=disabled" thing could be acceptable option for test lab environments.> The safer/easier thing to do is add dedicated log devices to the server > instead. It''s not as fast as running with ZIL disabled, but it''s much > faster than running without a dedicated log. >and for production use ZIL dedicated device is required.> When choosing a log device, focus on FAST. You really don''t care about > size. Even 4G is usually all you need. >I was thinking about STEC ZeusRAM, but unfortunately it''s SAS only device, and it won''t make into X4540 (SATA ports only), so another option could be STEC MACH16iops (50GB SLC SATA SSD). Thanks to all for sharing your ideas and suggestions.
> From: Evaldas Auryla [mailto:evaldas.auryla at edqm.eu] > Sent: Wednesday, November 09, 2011 8:55 AM > > I was thinking about STEC ZeusRAM, but unfortunately it''s SAS only > device, and it won''t make into X4540 (SATA ports only), so another > option could be STEC MACH16iops (50GB SLC SATA SSD).Perhaps you should also consider ddrdrive I work a lot on flash... and I''ve got to say... dram is a lot faster. Mostly because of design/implementation limitations in the present controller technologies. Restrictions about which blocks can be erased and when, and how many erase in parallel, under which circumstances etc etc blah blah. That''s mostly where any delay comes from. Straight out of the box, SSD''s perform very well. But in real life, only moderately well.
On 11/ 9/11 03:11 PM, Edward Ned Harvey wrote:>> From: Evaldas Auryla [mailto:evaldas.auryla at edqm.eu] >> Sent: Wednesday, November 09, 2011 8:55 AM >> >> I was thinking about STEC ZeusRAM, but unfortunately it''s SAS only >> device, and it won''t make into X4540 (SATA ports only), so another >> option could be STEC MACH16iops (50GB SLC SATA SSD). > Perhaps you should also consider ddrdrive > I work a lot on flash... and I''ve got to say... dram is a lot faster. > Mostly because of design/implementation limitations in the present > controller technologies. Restrictions about which blocks can be erased > and > when, and how many erase in parallel, under which circumstances etc etc > blah > blah. That''s mostly where any delay comes from. Straight out of the box, > SSD''s perform very well. But in real life, only moderately well. >Yes, I was thinking about that, but X4540 "Thor"s have only low profile PCIe slots, and DDRdrive X1 requires full heigh slot, unfortunately.
On 08 November, 2011 - Edward Ned Harvey sent me these 2,9K bytes:> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > bounces at opensolaris.org] On Behalf Of Evaldas Auryla > > > > I''m trying to evaluate what are the risks of running NFS share of zfs > > dataset with sync=disabled property. The clients are vmware hosts in our > > environment and server is SunFire X4540 "Thor" system. Though general > > recommendation tells not to do this, but after testing performance with > > default setting and sync=disabled - it''s night and day, so it''s really > > tempting to do sync=disabled ! Thanks for any suggestion. > > I know a lot of people will say "don''t do it," but that''s only partial > truth. The real truth is: > > At all times, if there''s a server crash, ZFS will come back along at next > boot or mount, and the filesystem will be in a consistent state, that was > indeed a valid state which the filesystem actually passed through at some > moment in time. So as long as all the applications you''re running can > accept the possibility of "going back in time" as much as 30 sec, following > an ungraceful ZFS crash, then it''s safe to disable ZIL (set sync=disabled).Client writes block 0, server says OK and writes it to disk. Client writes block 1, server says OK and crashes before it''s on disk. Client writes block 2.. waaiits.. waiits.. server comes up and, server says OK and writes it to disk. Now, from the view of the clients, block 0-2 are all OK''d by the server and no visible errors. On the server, block 1 never arrived on disk and you''ve got silent corruption. Too bad NFS is resilient against servers coming and going.. /Tomas -- Tomas Forsman, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
On Wed, November 9, 2011 10:35, Tomas Forsman wrote:> Too bad NFS is resilient against servers coming and going..NFSv4 is statefull, so server reboots are more noticeable. (This has pluses and minuses.)
Fred Liu
2011-Nov-09 23:54 UTC
[zfs-discuss] Oracle releases Solaris 11 for Sparc and x86 servers
[This email is either empty or too large to be displayed at this time]
Fajar A. Nugraha
2011-Nov-10 03:02 UTC
[zfs-discuss] Oracle releases Solaris 11 for Sparc and x86 servers
On Thu, Nov 10, 2011 at 6:54 AM, Fred Liu <Fred_Liu at issi.com> wrote:>... so when will zfs-related improvement make it to solaris-derivatives :D ? -- FAN
Fred Liu
2011-Nov-10 03:34 UTC
[zfs-discuss] Oracle releases Solaris 11 for Sparc and x86 servers
> > ... so when will zfs-related improvement make it to solaris- > derivatives :D ? >I am also very curious about Oracle''s policy about source code. ;-) Fred
"Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D."
2011-Nov-10 13:26 UTC
[zfs-discuss] Oracle releases Solaris 11 for Sparc and x86 servers
AFAIK, there is no change in open source policy for Oracle Solaris On 11/9/2011 10:34 PM, Fred Liu wrote:>> ... so when will zfs-related improvement make it to solaris- >> derivatives :D ? >> > I am also very curious about Oracle''s policy about source code. ;-) > > > Fred > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Hung-Sheng Tsao Ph D. Founder& Principal HopBit GridComputing LLC cell: 9734950840 http://laotsao.wordpress.com/ http://laotsao.blogspot.com/ -------------- next part -------------- A non-text attachment was scrubbed... Name: laotsao.vcf Type: text/x-vcard Size: 153 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111110/511eb72e/attachment.vcf>
On Wed, 9 Nov 2011, Tomas Forsman wrote:>> >> At all times, if there''s a server crash, ZFS will come back along at next >> boot or mount, and the filesystem will be in a consistent state, that was >> indeed a valid state which the filesystem actually passed through at some >> moment in time. So as long as all the applications you''re running can >> accept the possibility of "going back in time" as much as 30 sec, following >> an ungraceful ZFS crash, then it''s safe to disable ZIL (set sync=disabled). > > Client writes block 0, server says OK and writes it to disk. > Client writes block 1, server says OK and crashes before it''s on disk. > Client writes block 2.. waaiits.. waiits.. server comes up and, server > says OK and writes it to disk. > > Now, from the view of the clients, block 0-2 are all OK''d by the server > and no visible errors. > On the server, block 1 never arrived on disk and you''ve got silent > corruption.The silent corruption (of zfs) does not occur due to simple reason that flushing all of the block writes are acknowledged by the disks and then a new transaction occurs to start the next transaction group. The previous transaction is not closed until the next transaction has been successfully started by writing the previous TXG group record to disk. Given properly working hardware, the worst case scenario is losing the whole transaction group and no "corruption" occurs. Loss of data as seen by the client can definitely occur. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 10 November, 2011 - Bob Friesenhahn sent me these 1,6K bytes:> On Wed, 9 Nov 2011, Tomas Forsman wrote: >>> >>> At all times, if there''s a server crash, ZFS will come back along at next >>> boot or mount, and the filesystem will be in a consistent state, that was >>> indeed a valid state which the filesystem actually passed through at some >>> moment in time. So as long as all the applications you''re running can >>> accept the possibility of "going back in time" as much as 30 sec, following >>> an ungraceful ZFS crash, then it''s safe to disable ZIL (set sync=disabled). >> >> Client writes block 0, server says OK and writes it to disk. >> Client writes block 1, server says OK and crashes before it''s on disk. >> Client writes block 2.. waaiits.. waiits.. server comes up and, server >> says OK and writes it to disk. >> >> Now, from the view of the clients, block 0-2 are all OK''d by the server >> and no visible errors. >> On the server, block 1 never arrived on disk and you''ve got silent >> corruption. > > The silent corruption (of zfs) does not occur due to simple reason that > flushing all of the block writes are acknowledged by the disks and then a > new transaction occurs to start the next transaction group. The previous > transaction is not closed until the next transaction has been > successfully started by writing the previous TXG group record to disk. > Given properly working hardware, the worst case scenario is losing the > whole transaction group and no "corruption" occurs. > > Loss of data as seen by the client can definitely occur.When a client writes something, and something else ends up on disk - I call that corruption. Doesn''t matter whose fault it is and technical details, the wrong data was stored despite the client being careful when writing. /Tomas -- Tomas Forsman, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
On Thu, Nov 10, 2011 at 14:12, Tomas Forsman <stric at acc.umu.se> wrote:> On 10 November, 2011 - Bob Friesenhahn sent me these 1,6K bytes: >> On Wed, 9 Nov 2011, Tomas Forsman wrote: >>>> >>>> At all times, if there''s a server crash, ZFS will come back along at next >>>> boot or mount, and the filesystem will be in a consistent state, that was >>>> indeed a valid state which the filesystem actually passed through at some >>>> moment in time. ?So as long as all the applications you''re running can >>>> accept the possibility of "going back in time" as much as 30 sec, following >>>> an ungraceful ZFS crash, then it''s safe to disable ZIL (set sync=disabled). >>> >>> Client writes block 0, server says OK and writes it to disk. >>> Client writes block 1, server says OK and crashes before it''s on disk. >>> Client writes block 2.. waaiits.. waiits.. server comes up and, server >>> says OK and writes it to disk. > When a client writes something, and something else ends up on disk - I > call that corruption. Doesn''t matter whose fault it is and technical > details, the wrong data was stored despite the client being careful when > writing.If the hardware is behaving itself (actually doing a cache flush when ZFS asks it to, for example) the server won''t say OK for block 1 until it''s actually on disk. This behavior is what makes NFS over ZFS slow without a slog: NFS does everything O_SYNC by default, so ZFS runs around syncing all the disks all the time. Therefore, you won''t lose data in this circumstance. Will
On 10 November, 2011 - Will Murnane sent me these 1,5K bytes:> On Thu, Nov 10, 2011 at 14:12, Tomas Forsman <stric at acc.umu.se> wrote: > > On 10 November, 2011 - Bob Friesenhahn sent me these 1,6K bytes: > >> On Wed, 9 Nov 2011, Tomas Forsman wrote: > >>>> > >>>> At all times, if there''s a server crash, ZFS will come back along at next > >>>> boot or mount, and the filesystem will be in a consistent state, that was > >>>> indeed a valid state which the filesystem actually passed through at some > >>>> moment in time. ?So as long as all the applications you''re running can > >>>> accept the possibility of "going back in time" as much as 30 sec, following > >>>> an ungraceful ZFS crash, then it''s safe to disable ZIL (set sync=disabled). > >>> > >>> Client writes block 0, server says OK and writes it to disk. > >>> Client writes block 1, server says OK and crashes before it''s on disk. > >>> Client writes block 2.. waaiits.. waiits.. server comes up and, server > >>> says OK and writes it to disk. > > When a client writes something, and something else ends up on disk - I > > call that corruption. Doesn''t matter whose fault it is and technical > > details, the wrong data was stored despite the client being careful when > > writing. > If the hardware is behaving itself (actually doing a cache flush when > ZFS asks it to, for example) the server won''t say OK for block 1 until > it''s actually on disk. This behavior is what makes NFS over ZFS slow > without a slog: NFS does everything O_SYNC by default, so ZFS runs > around syncing all the disks all the time. Therefore, you won''t lose > data in this circumstance.Which is exactly what this thread is about, consequences from -disabling- sync. /Tomas -- Tomas Forsman, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Bob Friesenhahn > > The silent corruption (of zfs) does not occur due to simple reason > that flushing all of the block writes are acknowledged by the disks > and then a new transaction occurs to start the next transaction group. > The previous transaction is not closed until the next transaction has > been successfully started by writing the previous TXG group record to > disk. Given properly working hardware, the worst case scenario is > losing the whole transaction group and no "corruption" occurs. > > Loss of data as seen by the client can definitely occur.Tomas is right on this point - If you have a ZFS NFS server running with sync disabled, and the ZFS server reboots ungracefully and starts serving NFS again without the NFS clients dismounting/remounting, then ZFS hasn''t been "corrupted" but NFS has. Exactly the way Tomas said. The server has lost its mind and gone back into the past, but the clients remember their state (which is/was in the future) and after the server comes up again in the past, the clients will simply assume the server hasn''t lost its mind and continue as if nothing went wrong, which is precisely the wrong thing to do. This is why, somewhere higher up in this thread, I said, if you have a NFS server running with sync disabled, you need to ensure NFS services don''t automatically start at boot time. If your server crashes ungracefully, you need to crash your clients too (NFS dismount/remount). Personally, this is how I operate the systems I support. Because running with sync disabled is so DARN fast, and a server crash is so DARN rare, I feel the extra productivity for 500 days in a row outweigh the productivity loss that occurs on that one fateful day, when I have to reboot or dismount/remount all kinds of crap around the office.
> disk. This behavior is what makes NFS over ZFS slow without a slog: NFSdoes> everything O_SYNC by default,No, it doesn''t. Howver VMWare by default issues all writes as SYNC.
On Thu, 10 Nov 2011, Tomas Forsman wrote:>> >> Loss of data as seen by the client can definitely occur. > > When a client writes something, and something else ends up on disk - I > call that corruption. Doesn''t matter whose fault it is and technical > details, the wrong data was stored despite the client being careful when > writing.Unlike many filesystems, zfs does not prioritize sync data over async data when it comes to finally writing the data to main store. Sync data is written to an intent log, which is replayed (as required) when the server reboots. Disabling sync disables this intent log and so data should be consistently set back in time if sync is disabled and the server does an unclean reboot. From this standpoint, the filesystem does not become "corrupted". Regardless, data formats like databases could become internally corrupted due to the data written in a zfs transaction group not being representative of a coherent database transaction. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Bob Friesenhahn > > data formats like > databases could become internally corrupted due to the data written in > a zfs transaction group not being representative of a coherent > database transaction.Although this is true, it''s only true to the extent that "corruption" is a term applicable to power loss. If some database application is performing operations all over the place, unaware of what ZFS or any other filesystem sync policy is in force... Then ZFS ungracefully crashing with sync disabled and rewinding to some previous state would be just like ext4 having the power yanked out suddenly. If your application is able to survive power loss, then it''s able to survive ungraceful crash with zfs sync disabled.
Generally, there should not be "corruption", only a roll-back to a previous state. *HOWEVER*, its possible that an application which has state outside of the filesystem (such as effects on network peers, or even state written to *other* filesystems) will encounter a consistency problem as the application will not be expecting this potentially "partial" rollback of state. This state *could* be state tracked in remote systems, or VMs, for example. Generally, I discourage disabling the sync unless you know *exactly* what you are doing. On my build filesystems I do it, because I can regenerate all the data, and a loss of up to 30 seconds of data is no problem for me. But I don''t do this on home directories, or filesystems used for "arbitrary" application storage. And I would *never* do this for a filesystem that is backing a database. As they say, better safe than sorry. - Garrett On Nov 10, 2011, at 11:12 AM, Tomas Forsman wrote:> On 10 November, 2011 - Bob Friesenhahn sent me these 1,6K bytes: > >> On Wed, 9 Nov 2011, Tomas Forsman wrote: >>>> >>>> At all times, if there''s a server crash, ZFS will come back along at next >>>> boot or mount, and the filesystem will be in a consistent state, that was >>>> indeed a valid state which the filesystem actually passed through at some >>>> moment in time. So as long as all the applications you''re running can >>>> accept the possibility of "going back in time" as much as 30 sec, following >>>> an ungraceful ZFS crash, then it''s safe to disable ZIL (set sync=disabled). >>> >>> Client writes block 0, server says OK and writes it to disk. >>> Client writes block 1, server says OK and crashes before it''s on disk. >>> Client writes block 2.. waaiits.. waiits.. server comes up and, server >>> says OK and writes it to disk. >>> >>> Now, from the view of the clients, block 0-2 are all OK''d by the server >>> and no visible errors. >>> On the server, block 1 never arrived on disk and you''ve got silent >>> corruption. >> >> The silent corruption (of zfs) does not occur due to simple reason that >> flushing all of the block writes are acknowledged by the disks and then a >> new transaction occurs to start the next transaction group. The previous >> transaction is not closed until the next transaction has been >> successfully started by writing the previous TXG group record to disk. >> Given properly working hardware, the worst case scenario is losing the >> whole transaction group and no "corruption" occurs. >> >> Loss of data as seen by the client can definitely occur. > > When a client writes something, and something else ends up on disk - I > call that corruption. Doesn''t matter whose fault it is and technical > details, the wrong data was stored despite the client being careful when > writing. > > /Tomas > -- > Tomas Forsman, stric at acc.umu.se, http://www.acc.umu.se/~stric/ > |- Student at Computing Science, University of Ume? > `- Sysadmin at {cs,acc}.umu.se > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss