thr3ads.net - zfs discuss - [zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync [Apr 2007]

If this information is useful, please help other people find it:
Share via:

cedric briner

2007-Apr-26 13:56 UTC

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

Hello,

I wonder if the subject of this email is not self-explanetory ?


okay let''say that it is not. :)
Imagine that I setup a box:
  - with Solaris
  - with many HDs (directly attached).
  - use ZFS as the FS
  - export the Data with NFS
  - on an UPS.

Then after reading the : 
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_and_Complex_Storage_Considerations
I wonder if there is a way to tell the OS to ignore the fsync flush 
commands since they are likely to survive a power outage.


Ced.

-- 

Cedric BRINER
Geneva - Switzerland

Wee Yeh Tan

2007-Apr-26 14:21 UTC

head link

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

On 4/26/07, cedric briner <work at infomaniak.ch>
wrote:> okay let''say that it is not. :)
> Imagine that I setup a box:
>   - with Solaris
>   - with many HDs (directly attached).
>   - use ZFS as the FS
>   - export the Data with NFS
>   - on an UPS.
>
> Then after reading the :
>
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_and_Complex_Storage_Considerations
> I wonder if there is a way to tell the OS to ignore the fsync flush
> commands since they are likely to survive a power outage.
Cedric,

You do not want to ignore syncs from ZFS if your harddisk is directly
attached to the server.  As the document mentioned, that is really for
Complex Storage with NVRAM where flush is not necessary.



-- 
Just me,
Wire ...
Blog: <prstat.blogspot.com>

Roch - PAE

2007-Apr-26 14:23 UTC

head link

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

You might set zil_disable to 1 (_then_ mount the fs to be
shared). But you''re still exposed to OS crashes; those would 
still corrupt your nfs clients.

-r


cedric briner writes:
 > Hello,
 > 
 > I wonder if the subject of this email is not self-explanetory ?
 > 
 > 
 > okay let''say that it is not. :)
 > Imagine that I setup a box:
 >   - with Solaris
 >   - with many HDs (directly attached).
 >   - use ZFS as the FS
 >   - export the Data with NFS
 >   - on an UPS.
 > 
 > Then after reading the : 
 >
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_and_Complex_Storage_Considerations
 > I wonder if there is a way to tell the OS to ignore the fsync flush 
 > commands since they are likely to survive a power outage.
 > 
 > 
 > Ced.
 > 
 > -- 
 > 
 > Cedric BRINER
 > Geneva - Switzerland
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Dennis Clarke

2007-Apr-26 14:53 UTC

head link

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

On 4/26/07, Roch - PAE <Roch.Bourbonnais at sun.com>
wrote:>
> You might set zil_disable to 1 (_then_ mount the fs to be
> shared). But you''re still exposed to OS crashes; those would
> still corrupt your nfs clients.
>
For the love of God do NOT do stuff like that.

Just create ZFS on a pile of disks the way that we should, with the
write cache disabled on all the disks and with redundancy in the ZPool
config .. nothing special :

# zpool create -m legacy -f zpool0 mirror c2t8d0 c3t8d0 mirror c2t9d0 c3t9d0
mirror c2t10d0 c3t10d0 mirror c2t11d0 c3t11d0 mirror c2t12d0 c3t12d0 mirror
c2t13d0 c3t13d0
#

# zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
zpool0                 1.63T     66K   1.63T     0%  ONLINE     -
#

# zpool status zpool0
  pool: zpool0
 state: ONLINE
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        zpool0       ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c2t8d0   ONLINE       0     0     0
            c3t8d0   ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c2t9d0   ONLINE       0     0     0
            c3t9d0   ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c2t10d0  ONLINE       0     0     0
            c3t10d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c2t11d0  ONLINE       0     0     0
            c3t11d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c2t12d0  ONLINE       0     0     0
            c3t12d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c2t13d0  ONLINE       0     0     0
            c3t13d0  ONLINE       0     0     0


The create some filesystem in there :

# zfs create zpool0/test

Give it a mount point :

# zfs set mountpoint=/test zpool0/test

# zfs list
NAME                   USED  AVAIL  REFER  MOUNTPOINT
zpool0                  98K  1.60T  1.50K  legacy
zpool0/test           24.5K  1.60T  24.5K  /test

Maybe set a quota :

# zfs set quota=400G zpool0/test
# zfs list
NAME                   USED  AVAIL  REFER  MOUNTPOINT
zpool0                  98K  1.60T  1.50K  legacy
zpool0/test           24.5K   400G  24.5K  /test
#

# df -F ufs -k
Filesystem           1024-blocks        Used   Available Capacity  Mounted on
/dev/md/dsk/d0           5819164     3862268     1665938    70%    /
/dev/md/dsk/d3           4068216      869404     2995404    23%    /var
/dev/md/dsk/d5          20654976       20504    19601728     1%    /data
# df -F zfs -k
Filesystem           1024-blocks        Used   Available Capacity  Mounted on
zpool0/test            419430400          24   419430375     1%    /test
#

# zfs list
NAME                   USED  AVAIL  REFER  MOUNTPOINT
zpool0                  98K  1.60T  1.50K  legacy
zpool0/test           24.5K   400G  24.5K  /test
#

For NFS just share it. Nothing special at all :

# zfs set sharenfs=nosub\,nosuid\,rw\=nfsclient\,root\=nfsclient zpool0/test

# zfs get -o property,value,source all zpool0/test
PROPERTY       VALUE                      SOURCE
type           filesystem                 -
creation       Sun Apr 15 12:12 2007      -
used           24.5K                      -
available      400G                       -
referenced     24.5K                      -
compressratio  1.00x                      -
mounted        yes                        -
quota          400G                       local
reservation    none                       default
recordsize     128K                       default
mountpoint     /test                      local
sharenfs       nosub,nosuid,rw=nfsclient,root=nfsclient  local
checksum       on                         default
compression    off                        default
atime          on                         default
devices        on                         default
exec           on                         default
setuid         on                         default
readonly       off                        default
zoned          off                        default
snapdir        hidden                     default
aclmode        groupmask                  default
aclinherit     secure                     default
#


thats it .. no big deal.  It just works.

Dennis

cedric briner

2007-Apr-26 14:57 UTC

head link

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

>> okay let''say that it is not. :)
>> Imagine that I setup a box:
>>   - with Solaris
>>   - with many HDs (directly attached).
>>   - use ZFS as the FS
>>   - export the Data with NFS
>>   - on an UPS.
>>
>> Then after reading the :
>>
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_and_Complex_Storage_Considerations
>>
>> I wonder if there is a way to tell the OS to ignore the fsync flush
>> commands since they are likely to survive a power outage.
> 
> Cedric,
> 
> You do not want to ignore syncs from ZFS if your harddisk is directly
> attached to the server.  As the document mentioned, that is really for
> Complex Storage with NVRAM where flush is not necessary.
This post follows : `XServe Raid & Complex Storage Considerations''
http://www.opensolaris.org/jive/thread.jspa?threadID=29276&tstart=0

Where we have made the assumption (*1) if the XServe Raid is connected 
to an UPS that we can consider the RAM in the XServe Raid as it was NVRAM.

(*1)
   This assumption is even pointed by Roch  :
   http://blogs.sun.com/roch/#zfs_to_ufs_performance_comparison
   >> Intelligent Storage
   through: `the Shenanigans with ZFS flushing and intelligent
arrays...''
   http://blogs.digitar.com/jjww/?itemid=44
   >> Tell your array to ignore ZFS'' flush commands

So in this way, when we export it with NFS we get a boost in the BW.

Okay, then is there any difference that I do not catch between :
  - the Shenanigans with ZFS flushing and intelligent arrays...
  - and my situation

I mean, I want to have a cheap and reliable nfs service. Why should I 
buy expensive `Complex Storage with NVRAM'' and not just buying a
machine
with 8 IDE HD''s ?


Ced.
-- 

Cedric BRINER
Geneva - Switzerland

cedric briner

2007-Apr-26 15:17 UTC

head link

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

> You might set zil_disable to 1 (_then_ mount the fs to be
> shared). But you''re still exposed to OS crashes; those would 
> still corrupt your nfs clients.
> 
> -r
hello Roch,

I''ve few questions

1)
from:
   Shenanigans with ZFS flushing and intelligent arrays...
   http://blogs.digitar.com/jjww/?itemid=44
I read :
  Disable the ZIL. The ZIL is the way ZFS maintains _consistency_ until 
it can get the blocks written to their final place on the disk. That''s 
why the ZIL flushes the cache. If you don''t have the ZIL and a power 
outage occurs, your blocks may go poof in your server''s
RAM...''cause
they never made it to the disk Kemosabe.

from :
   Eric Kustarz''s Weblog
   http://blogs.sun.com/erickustarz/entry/zil_disable
I read :
   Note: disabling the ZIL does _NOT_ compromise filesystem integrity. 
Disabling the ZIL does NOT cause corruption in ZFS.

then :
   I don''t understand: In one they tell that:
    - we can lose _consistency_
   and in the other one they say that :
    - does not compromise filesystem integrity
   so .. which one is right ?


2)
from :
   Eric Kustarz''s Weblog
   http://blogs.sun.com/erickustarz/entry/zil_disable
I read:
  Disabling the ZIL is definitely frowned upon and can cause your 
applications much confusion. Disabling the ZIL can cause corruption for 
NFS clients in the case where a reply to the client is done before the 
server crashes, and the server crashes before the data is commited to 
stable storage. If you can''t live with this, then don''t turn
off the ZIL.

then:
   The service that we export with zfs & NFS is not such things as 
databases or some really stress full system, but just exporting home. So 
it feels to me that we can juste disable this ZIL.

3)
from:
   NFS and ZFS, a fine combination
   http://blogs.sun.com/roch/#zfs_to_ufs_performance_comparison
I read:
   NFS service with risk of corruption of client''s side view :

	nfs/ufs :  7     sec (write cache enable)
	nfs/zfs :  4.2   sec (write cache enable,zil_disable=1)
	nfs/zfs :  4.7   sec (write cache disable,zil_disable=1)

Semantically correct NFS service :

	nfs/ufs : 17     sec (write cache disable)
	nfs/zfs : 12     sec (write cache disable,zil_disable=0)
	nfs/zfs :  7     sec (write cache enable,zil_disable=0)

then :
   Does this mean that when you just create an UFS FS, and that you just 
export it with NFS, you are doing an not semantically correct NFS 
service. And that you have to disable the write cache to have an correct 
NFS server ???

4)
so can we say that people used to have an NFS with risk of corruption of 
client''s side view can just take ZFS and disable the ZIL ?





thanks in advance for your clarifications

Ced.
P.-S. Does some of you know the best way to send an email containing 
many questions inside it ? Should I create a thread for each of them, 
the next time





-- 

Cedric BRINER
Geneva - Switzerland

Neil.Perrin at Sun.COM

2007-Apr-26 15:34 UTC

head link

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

cedric briner wrote:>> You might set zil_disable to 1 (_then_ mount the fs to be
>> shared). But you''re still exposed to OS crashes; those would
still
>> corrupt your nfs clients.
>>
>> -r
> 
> 
> hello Roch,
> 
> I''ve few questions
> 
> 1)
> from:
>    Shenanigans with ZFS flushing and intelligent arrays...
>    http://blogs.digitar.com/jjww/?itemid=44
> I read :
>   Disable the ZIL. The ZIL is the way ZFS maintains _consistency_ until 
> it can get the blocks written to their final place on the disk.
This is wrong. The on-disk format is always consistent.
The author of this blog is misinformed and is probably getting
confused with traditional journalling.
> That''s why the ZIL flushes the cache.
The ZIL flushes it''s blocks to ensure that if a power failure/panic
occurs
then the data the system guarantees to be on stable storage (due say to a fsync
or O_DSYNC) is actually on stable storage.
> If you don''t have the ZIL and a power 
> outage occurs, your blocks may go poof in your server''s
RAM...''cause
> they never made it to the disk Kemosabe.
True, but not blocks, rather system call transactions - as this is what the
ZIL handles.
> 
> from :
>    Eric Kustarz''s Weblog
>    http://blogs.sun.com/erickustarz/entry/zil_disable
> I read :
>    Note: disabling the ZIL does _NOT_ compromise filesystem integrity. 
> Disabling the ZIL does NOT cause corruption in ZFS.
> 
> then :
>    I don''t understand: In one they tell that:
>     - we can lose _consistency_
>    and in the other one they say that :
>     - does not compromise filesystem integrity
>    so .. which one is right ?
Eric''s, who works on ZFS!
> 
> 
> 2)
> from :
>    Eric Kustarz''s Weblog
>    http://blogs.sun.com/erickustarz/entry/zil_disable
> I read:
>   Disabling the ZIL is definitely frowned upon and can cause your 
> applications much confusion. Disabling the ZIL can cause corruption for 
> NFS clients in the case where a reply to the client is done before the 
> server crashes, and the server crashes before the data is commited to 
> stable storage. If you can''t live with this, then don''t
turn off the ZIL.
> 
> then:
>    The service that we export with zfs & NFS is not such things as 
> databases or some really stress full system, but just exporting home. So 
> it feels to me that we can juste disable this ZIL.
> 
> 3)
> from:
>    NFS and ZFS, a fine combination
>    http://blogs.sun.com/roch/#zfs_to_ufs_performance_comparison
> I read:
>    NFS service with risk of corruption of client''s side view :
> 
>     nfs/ufs :  7     sec (write cache enable)
>     nfs/zfs :  4.2   sec (write cache enable,zil_disable=1)
>     nfs/zfs :  4.7   sec (write cache disable,zil_disable=1)
> 
> Semantically correct NFS service :
> 
>     nfs/ufs : 17     sec (write cache disable)
>     nfs/zfs : 12     sec (write cache disable,zil_disable=0)
>     nfs/zfs :  7     sec (write cache enable,zil_disable=0)
> 
> then :
>    Does this mean that when you just create an UFS FS, and that you just 
> export it with NFS, you are doing an not semantically correct NFS 
> service. And that you have to disable the write cache to have an correct 
> NFS server ???
Yes. UFS requires the write cache to be disabled to maintain consistency.
> 
> 4)
> so can we say that people used to have an NFS with risk of corruption of 
> client''s side view can just take ZFS and disable the ZIL ?
I suppose but we aim to strive for better than expected corruption.
We (ZFS) recommend not disabling the ZIL.
We also recommend not disabling the disk write cache flushing unless they are
backed by nvram or UPS.
> 
> thanks in advance for your clarifications
> 
> Ced.
> P.-S. Does some of you know the best way to send an email containing 
> many questions inside it ? Should I create a thread for each of them, 
> the next time
This works.

- Good questions.

Neil.

Roch - PAE

2007-Apr-26 16:03 UTC

head link

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

cedric briner writes:
 > > You might set zil_disable to 1 (_then_ mount the fs to be
 > > shared). But you''re still exposed to OS crashes; those would
 > > still corrupt your nfs clients.
 > > 
 > > -r
 > 
 > hello Roch,
 > 
 > I''ve few questions
 > 
 > 1)
 > from:
 >    Shenanigans with ZFS flushing and intelligent arrays...
 >    http://blogs.digitar.com/jjww/?itemid=44
 > I read :
 >   Disable the ZIL. The ZIL is the way ZFS maintains _consistency_ until 
 > it can get the blocks written to their final place on the disk.
That''s
 > why the ZIL flushes the cache. If you don''t have the ZIL and a
power
 > outage occurs, your blocks may go poof in your server''s
RAM...''cause
 > they never made it to the disk Kemosabe.
 > 

_consistency_ above may not be the right technical term.
It''s more about data loss. If you fsync, then  data needs to be
somewhere; that''s in the zil.

 > from :
 >    Eric Kustarz''s Weblog
 >    http://blogs.sun.com/erickustarz/entry/zil_disable
 > I read :
 >    Note: disabling the ZIL does _NOT_ compromise filesystem integrity. 
 > Disabling the ZIL does NOT cause corruption in ZFS.
 > 

That is correct. ZIL or not, the pool stays
self-consistent from ZFS standpoint. This means it will be mounted and 
there shall be no error then. But the application may not
agree with this statement. Without a zil, I fsunc a file but the data is
not there (data loss) after reboot.

 > then :
 >    I don''t understand: In one they tell that:
 >     - we can lose _consistency_
 >    and in the other one they say that :
 >     - does not compromise filesystem integrity
 >    so .. which one is right ?
 > 

ZFS/ZPOOL is consistent with itself (zil not required for that) or
the ZFS/ZPOOL is consistent with the posix requirement
regarding fsync (ZIL required).

 > 
 > 2)
 > from :
 >    Eric Kustarz''s Weblog
 >    http://blogs.sun.com/erickustarz/entry/zil_disable
 > I read:
 >   Disabling the ZIL is definitely frowned upon and can cause your 
 > applications much confusion. Disabling the ZIL can cause corruption for 
 > NFS clients in the case where a reply to the client is done before the 
 > server crashes, and the server crashes before the data is commited to 
 > stable storage. If you can''t live with this, then don''t
turn off the ZIL.
 > 
 > then:
 >    The service that we export with zfs & NFS is not such things as 
 > databases or some really stress full system, but just exporting home. So 
 > it feels to me that we can juste disable this ZIL.
 > 

the problem is dealing with failure (as I explain in the
"fine combination" entry). With zil disable, you would need
to unmount on all clients that mounted a FS that is backed by 
a ZFS with zil disabled. Might as well have to reboot them.
So reboot of the server implies tracking down all client
mounts... NFS was never intended for that...


 > 3)
 > from:
 >    NFS and ZFS, a fine combination
 >    http://blogs.sun.com/roch/#zfs_to_ufs_performance_comparison
 > I read:
 >    NFS service with risk of corruption of client''s side view :
 > 
 > 	nfs/ufs :  7     sec (write cache enable)
 > 	nfs/zfs :  4.2   sec (write cache enable,zil_disable=1)
 > 	nfs/zfs :  4.7   sec (write cache disable,zil_disable=1)
 > 
 > Semantically correct NFS service :
 > 
 > 	nfs/ufs : 17     sec (write cache disable)
 > 	nfs/zfs : 12     sec (write cache disable,zil_disable=0)
 > 	nfs/zfs :  7     sec (write cache enable,zil_disable=0)
 > 
 > then :
 >    Does this mean that when you just create an UFS FS, and that you just 
 > export it with NFS, you are doing an not semantically correct NFS 
 > service. And that you have to disable the write cache to have an correct 
 > NFS server ???
 > 

Yes.  That  is why all shipping   Sun products had  the disk
write cache disabled.  But with  ZFS  we are safe no  matter
what the setting.


 > 4)
 > so can we say that people used to have an NFS with risk of corruption of 
 > client''s side view can just take ZFS and disable the ZIL ?
 > 

It''s your call.

	You were running UFS with third party storage that
	had the write cache enabled means your air bag was
	disabled. Now you got a new car and want to disable 
	the air bag on that one to gain some performance.
	It''s your data, your life.

But if people start doing this on a large scale we''re in for
trouble.

-r

 > 
 > 
 > 
 > 
 > thanks in advance for your clarifications
 > 
 > Ced.
 > P.-S. Does some of you know the best way to send an email containing 
 > many questions inside it ? Should I create a thread for each of them, 
 > the next time
 > 
 > 

Since there are related , this format looked fine to me.

 > 
 > 
 > 
 > -- 
 > 
 > Cedric BRINER
 > Geneva - Switzerland
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Toby Thain

2007-Apr-26 16:09 UTC

head link

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

On 26-Apr-07, at 11:57 AM, cedric briner wrote:
>>> okay let''say that it is not. :)
>>> Imagine that I setup a box:
>>>   - with Solaris
>>>   - with many HDs (directly attached).
>>>   - use ZFS as the FS
>>>   - export the Data with NFS
>>>   - on an UPS.
>>>
>>> Then after reading the :
>>> http://www.solarisinternals.com/wiki/index.php/ 
>>> ZFS_Best_Practices_Guide#ZFS_and_Complex_Storage_Considerations
>>> I wonder if there is a way to tell the OS to ignore the fsync flush
>>> commands since they are likely to survive a power outage.
>> Cedric,
>> You do not want to ignore syncs from ZFS if your harddisk is directly
>> attached to the server.  As the document mentioned, that is really  
>> for
>> Complex Storage with NVRAM where flush is not necessary.
>
> This post follows : `XServe Raid & Complex Storage
Considerations''
> http://www.opensolaris.org/jive/thread.jspa?threadID=29276&tstart=0
>
> Where we have made the assumption (*1) if the XServe Raid is  
> connected to an UPS that we can consider the RAM in the XServe Raid  
> as it was NVRAM.
May not be relevant, but ISTR that the Xserve has an option for  
battery-backed write cache RAM also.

--Toby
>
> ...
>
>
> Ced.
> -- 
>
> Cedric BRINER
> Geneva - Switzerland
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Robert Milkowski

2007-Apr-26 17:25 UTC

head link

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

Hello Wee,

Thursday, April 26, 2007, 4:21:00 PM, you wrote:

WYT> On 4/26/07, cedric briner <work at infomaniak.ch>
wrote:>> okay let''say that it is not. :)
>> Imagine that I setup a box:
>>   - with Solaris
>>   - with many HDs (directly attached).
>>   - use ZFS as the FS
>>   - export the Data with NFS
>>   - on an UPS.
>>
>> Then after reading the :
>>
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_and_Complex_Storage_Considerations
>> I wonder if there is a way to tell the OS to ignore the fsync flush
>> commands since they are likely to survive a power outage.
WYT> Cedric,

WYT> You do not want to ignore syncs from ZFS if your harddisk is directly
WYT> attached to the server.  As the document mentioned, that is really for
WYT> Complex Storage with NVRAM where flush is not necessary.

What??

Setting zil_disable=1 has nothing to do with NVRAM in storage arrays.
It disables ZIL in ZFS wich means that if application calls fsync() or
opens a file with O_DSYNC, etc. then ZFS won''t honor it (return
immediatelly without commiting to stable storage).

Once txg group closes data will be written to disks and SCSI write
cache flush commands will be send.

Setting zil_disable to 1 is not that bad actually, and if someone
doesn''t care to lose some last N seconds of data in case of server
crash (however zfs itself will be consistent) it can actually speed up
nfs operations a lot.

btw: people accustomed to Linux in a way have always zil_disable set
to 1... :)

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Wee Yeh Tan

2007-Apr-27 01:13 UTC

head link

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

Robert,

On 4/27/07, Robert Milkowski <rmilkowski at task.gda.pl>
wrote:> Hello Wee,
>
> Thursday, April 26, 2007, 4:21:00 PM, you wrote:
>
> WYT> On 4/26/07, cedric briner <work at infomaniak.ch> wrote:
> >> okay let''say that it is not. :)
> >> Imagine that I setup a box:
> >>   - with Solaris
> >>   - with many HDs (directly attached).
> >>   - use ZFS as the FS
> >>   - export the Data with NFS
> >>   - on an UPS.
> >>
> >> Then after reading the :
> >>
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_and_Complex_Storage_Considerations
> >> I wonder if there is a way to tell the OS to ignore the fsync
flush
> >> commands since they are likely to survive a power outage.
>
> WYT> Cedric,
>
> WYT> You do not want to ignore syncs from ZFS if your harddisk is
directly
> WYT> attached to the server.  As the document mentioned, that is really
for
> WYT> Complex Storage with NVRAM where flush is not necessary.
>
>
> What??
>
> Setting zil_disable=1 has nothing to do with NVRAM in storage arrays.
> It disables ZIL in ZFS wich means that if application calls fsync() or
> opens a file with O_DSYNC, etc. then ZFS won''t honor it (return
> immediatelly without commiting to stable storage).
Wait a minute.  Are we talking about zil_disable or zfs_noflush (or
zfs_nocacheflush)?
The article quoted was about configuring the array to ignore flush
commands or device specific zfs_noflush, not zil_disable.

I agree that zil_disable is okay from FS view (correctness still
depends on the application), but zfs_noflush is dangerous.


-- 
Just me,
Wire ...
Blog: <prstat.blogspot.com>

Wee Yeh Tan

2007-Apr-27 01:39 UTC

head link

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

Cedric,

On 4/26/07, cedric briner <work at infomaniak.ch>
wrote:> >> okay let''say that it is not. :)
> >> Imagine that I setup a box:
> >>   - with Solaris
> >>   - with many HDs (directly attached).
> >>   - use ZFS as the FS
> >>   - export the Data with NFS
> >>   - on an UPS.
> >>
> >> Then after reading the :
> >>
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_and_Complex_Storage_Considerations
> >>
> >> I wonder if there is a way to tell the OS to ignore the fsync
flush
> >> commands since they are likely to survive a power outage.
> >
> > Cedric,
> >
> > You do not want to ignore syncs from ZFS if your harddisk is directly
> > attached to the server.  As the document mentioned, that is really for
> > Complex Storage with NVRAM where flush is not necessary.
>
> This post follows : `XServe Raid & Complex Storage
Considerations''
> http://www.opensolaris.org/jive/thread.jspa?threadID=29276&tstart=0
Ah... I wasn''t aware the other thread was started by you :).  If your
storage device features NVRAM, you should in fact configure it as
discussed in the stated thread.  However, if your storage device(s)
are directly attached disks (or anything without an NVRAM controller),
zfs_noflush=1 is potentially fatal (see link below).
> Where we have made the assumption (*1) if the XServe Raid is connected
> to an UPS that we can consider the RAM in the XServe Raid as it was NVRAM.
I''m not sure about the interaction between XServe and the UPS but
I''d
imagine that the UPS can probably power the XServe  for a few minutes
after a power outage.  That should be enough time for the XServe to
drain stuff in its RAM to disk.
> (*1)
>    This assumption is even pointed by Roch  :
>    http://blogs.sun.com/roch/#zfs_to_ufs_performance_comparison
>    >> Intelligent Storage
>    through: `the Shenanigans with ZFS flushing and intelligent
arrays...''
>    http://blogs.digitar.com/jjww/?itemid=44
>    >> Tell your array to ignore ZFS'' flush commands
>
> So in this way, when we export it with NFS we get a boost in the BW.
Indeed.  This is especially true if you consider that expensive
storage are likely to be shared by more than 1 host.  A flush command
likely flushes the entire cache rather than just parts relevant to the
requesting host.
> Okay, then is there any difference that I do not catch between :
>   - the Shenanigans with ZFS flushing and intelligent arrays...
>   - and my situation
>
> I mean, I want to have a cheap and reliable nfs service. Why should I
> buy expensive `Complex Storage with NVRAM'' and not just buying a
machine
> with 8 IDE HD''s ?
Your 8 IDE HD may not benefit from zfs_noflush=1 since their caches
are small anyway but the potential impact on reliability will be
fairly severe.
  http://www.opensolaris.org/jive/thread.jspa?messageID=91730

Nothing is stopping you though from getting decent performance from 8
IDE HDD.  You just should not treat them like they are NVRAM backed
array.


-- 
Just me,
Wire ...
Blog: <prstat.blogspot.com>

Andy Lubel

2007-Apr-27 02:51 UTC

head link

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

Anyone who has an Xraid should have one (or 2) of these BBC modules.
good mojo.

http://store.apple.com/1-800-MY-APPLE/WebObjects/AppleStore.woa/wa/RSLID
?mco=6C04E0D7&nplm=M8941G/B

Can you tell I <3 apple?


-----Original Message-----
From: zfs-discuss-bounces at opensolaris.org
[mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Wee Yeh Tan
Sent: Thursday, April 26, 2007 9:40 PM
To: cedric briner
Cc: zfs at opensolaris
Subject: Re: [zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

Cedric,

On 4/26/07, cedric briner <work at infomaniak.ch>
wrote:> >> okay let''say that it is not. :)
> >> Imagine that I setup a box:
> >>   - with Solaris
> >>   - with many HDs (directly attached).
> >>   - use ZFS as the FS
> >>   - export the Data with NFS
> >>   - on an UPS.
> >>
> >> Then after reading the :
> >>
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_G
> >> uide#ZFS_and_Complex_Storage_Considerations
> >>
> >> I wonder if there is a way to tell the OS to ignore the fsync
flush
> >> commands since they are likely to survive a power outage.
> >
> > Cedric,
> >
> > You do not want to ignore syncs from ZFS if your harddisk is 
> > directly attached to the server.  As the document mentioned, that is
> > really for Complex Storage with NVRAM where flush is not necessary.
>
> This post follows : `XServe Raid & Complex Storage
Considerations''
> http://www.opensolaris.org/jive/thread.jspa?threadID=29276&tstart=0
Ah... I wasn''t aware the other thread was started by you :).  If your
storage device features NVRAM, you should in fact configure it as
discussed in the stated thread.  However, if your storage device(s) are
directly attached disks (or anything without an NVRAM controller),
zfs_noflush=1 is potentially fatal (see link below).
> Where we have made the assumption (*1) if the XServe Raid is connected
> to an UPS that we can consider the RAM in the XServe Raid as it wasNVRAM.

I''m not sure about the interaction between XServe and the UPS but
I''d
imagine that the UPS can probably power the XServe  for a few minutes
after a power outage.  That should be enough time for the XServe to
drain stuff in its RAM to disk.
> (*1)
>    This assumption is even pointed by Roch  :
>    http://blogs.sun.com/roch/#zfs_to_ufs_performance_comparison
>    >> Intelligent Storage
>    through: `the Shenanigans with ZFS flushing and intelligent
arrays...''>    http://blogs.digitar.com/jjww/?itemid=44
>    >> Tell your array to ignore ZFS'' flush commands
>
> So in this way, when we export it with NFS we get a boost in the BW.
Indeed.  This is especially true if you consider that expensive storage
are likely to be shared by more than 1 host.  A flush command likely
flushes the entire cache rather than just parts relevant to the
requesting host.
> Okay, then is there any difference that I do not catch between :
>   - the Shenanigans with ZFS flushing and intelligent arrays...
>   - and my situation
>
> I mean, I want to have a cheap and reliable nfs service. Why should I 
> buy expensive `Complex Storage with NVRAM'' and not just buying a 
> machine with 8 IDE HD''s ?
Your 8 IDE HD may not benefit from zfs_noflush=1 since their caches are
small anyway but the potential impact on reliability will be fairly
severe.
  http://www.opensolaris.org/jive/thread.jspa?messageID=91730

Nothing is stopping you though from getting decent performance from 8
IDE HDD.  You just should not treat them like they are NVRAM backed
array.


--
Just me,
Wire ...
Blog: <prstat.blogspot.com>
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Roch - PAE

2007-Apr-27 07:57 UTC

head link

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

Robert Milkowski writes:
 > Hello Wee,
 > 
 > Thursday, April 26, 2007, 4:21:00 PM, you wrote:
 > 
 > WYT> On 4/26/07, cedric briner <work at infomaniak.ch> wrote:
 > >> okay let''say that it is not. :)
 > >> Imagine that I setup a box:
 > >>   - with Solaris
 > >>   - with many HDs (directly attached).
 > >>   - use ZFS as the FS
 > >>   - export the Data with NFS
 > >>   - on an UPS.
 > >>
 > >> Then after reading the :
 > >>
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_and_Complex_Storage_Considerations
 > >> I wonder if there is a way to tell the OS to ignore the fsync
flush
 > >> commands since they are likely to survive a power outage.
 > 
 > WYT> Cedric,
 > 
 > WYT> You do not want to ignore syncs from ZFS if your harddisk is
directly
 > WYT> attached to the server.  As the document mentioned, that is really
for
 > WYT> Complex Storage with NVRAM where flush is not necessary.
 > 
 > 
 > What??
 > 
 > Setting zil_disable=1 has nothing to do with NVRAM in storage arrays.
 > It disables ZIL in ZFS wich means that if application calls fsync() or
 > opens a file with O_DSYNC, etc. then ZFS won''t honor it (return
 > immediatelly without commiting to stable storage).
 > 
 > Once txg group closes data will be written to disks and SCSI write
 > cache flush commands will be send.
 > 
 > Setting zil_disable to 1 is not that bad actually, and if someone
 > doesn''t care to lose some last N seconds of data in case of
server
 > crash (however zfs itself will be consistent) it can actually speed up
 > nfs operations a lot.
 > 

...set zil_disable...speed up nfs...at the expense of a risk
of corruption of the NFS client''s view. We must never forget
this.

zil_disable is really not an option IMO.


-r

 > 
 > -- 
 > Best regards,
 >  Robert                            mailto:rmilkowski at task.gda.pl
 >                                        http://milek.blogspot.com
 > 
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Roch - PAE

2007-Apr-27 08:20 UTC

head link

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

Wee Yeh Tan writes:
 > Robert,
 > 
 > On 4/27/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote:
 > > Hello Wee,
 > >
 > > Thursday, April 26, 2007, 4:21:00 PM, you wrote:
 > >
 > > WYT> On 4/26/07, cedric briner <work at infomaniak.ch>
wrote:
 > > >> okay let''say that it is not. :)
 > > >> Imagine that I setup a box:
 > > >>   - with Solaris
 > > >>   - with many HDs (directly attached).
 > > >>   - use ZFS as the FS
 > > >>   - export the Data with NFS
 > > >>   - on an UPS.
 > > >>
 > > >> Then after reading the :
 > > >>
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_and_Complex_Storage_Considerations
 > > >> I wonder if there is a way to tell the OS to ignore the
fsync flush
 > > >> commands since they are likely to survive a power outage.
 > >
 > > WYT> Cedric,
 > >
 > > WYT> You do not want to ignore syncs from ZFS if your harddisk is
directly
 > > WYT> attached to the server.  As the document mentioned, that is
really for
 > > WYT> Complex Storage with NVRAM where flush is not necessary.
 > >
 > >
 > > What??
 > >
 > > Setting zil_disable=1 has nothing to do with NVRAM in storage arrays.
 > > It disables ZIL in ZFS wich means that if application calls fsync()
or
 > > opens a file with O_DSYNC, etc. then ZFS won''t honor it
(return
 > > immediatelly without commiting to stable storage).
 > 
 > Wait a minute.  Are we talking about zil_disable or zfs_noflush (or
 > zfs_nocacheflush)?
 > The article quoted was about configuring the array to ignore flush
 > commands or device specific zfs_noflush, not zil_disable.
 > 
 > I agree that zil_disable is okay from FS view (correctness still
 > depends on the application), but zfs_noflush is dangerous.
 > 

For me, both are dangerous.

zil_disable  can cause immense pain  

  to  applications and NFS  clients.  I don''t see how anyone
  can recommend   it    without  mentioning   the   risk  of
  application/NFS corruption.

zfs_nocacheflush is  also unsafe.  

  It opens  a risk  of pool  corruption !  But, if  you have
  *all* of your pooled data on safe NVRAM protected storage,
  and  that you  don''t  find a  way to tell   the storage to
  ignore cache   flush requests, you might   want to set the
  variable temporarily until  the  SYNC_NV thing  is  sorted
  out. Then make sure,  nobody imports the tunable elsewhere
  without full understanding and  make sure noone creates  a
  new  pool with non-NVRAM storage.  Since  those things are
  not under anyones control, it''s not  a good idea to spread
  these kind of recommendations.

 > 
 > -- 
 > Just me,
 > Wire ...
 > Blog: <prstat.blogspot.com>
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

cedric briner

2007-Apr-27 08:48 UTC

head link

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

>> You might set zil_disable to 1 (_then_ mount the fs to be
>> shared). But you''re still exposed to OS crashes; those would
>> still corrupt your nfs clients.Just to better understand ? (I know that I''m quite slow :( )
when you say _nfs clients_ are you specifically talking of:
- the nfs client program itself :
    (lockd, statd) meaning that you can have a stale nfs handle or other 
things ?
- the host acting as an nfs client
    meaning that the nfs client service works, but you would have 
corrupt the data that the software use with nfs''s mounted disk.


If I''m digging and digging against this ZIL and NFS UFS with write 
cache, that''s because I do not understand which kind of problems that 
can occurs. What I read in general is statement like _corruption_ of the 
client''s point of view.. but what does that means ?

is the shema of what can happen is :
- the application on the nfs client side write data on the nfs server
- meanwhile the nfs server crashes so:
  - the data are not stored
  - the application on the nfs client think that the data are stored ! :(
- when the server is up again
- the nfs client re-see the data
- the application on the nfs client side find itself with data in the 
previous state of its lasts writes.

Am I right ?

So with ZIL:
  - The application has the ability to do things in the right way. So 
even of a nfs-server crash, the application on the nfs-client side can 
rely on is own data.

So without ZIL:
  - The application has not the ability to do things in the right way. 
And we can have a corruption of data. But that doesn''t mean corruption 
of the FS. It means that the data were partially written and some are 
missing.
> For the love of God do NOT do stuff like that.
> 
> Just create ZFS on a pile of disks the way that we should, with the
> write cache disabled on all the disks and with redundancy in the ZPool
> config .. nothing special :Wooooh !!noo..  this is really special to me !!
I''ve read and re-read many times the:
  - NFS and ZFS, a fine combination
  - ZFS Best Practices Guide
and other blog without remarking such idea !

I even notice the opposite recommendation
from:
-ZFS Best Practices Guide >> ZFS Storage Pools Recommendations
-http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_Storage_Pools_Recommendations
where I read :
  - For production systems, consider using whole disks for storage pools 
rather than slices for the following reasons:
   + Allow ZFS to enable the disk''s write cache, for those disks that 
have write caches

and from:
-NFS and ZFS, a fine combination >> Comparison with UFS
-http://blogs.sun.com/roch/#zfs_to_ufs_performance_comparison
where I read :
  Semantically correct NFS service :

	nfs/ufs : 17     sec (write cache disable)
	nfs/zfs : 12     sec (write cache disable,zil_disable=0)
	nfs/zfs :  7     sec (write cache enable,zil_disable=0)
then I can say:
  that nfs/zfs with write cache enable end zil_enable is --in that 
case-- faster

So why are you recommending me to disable the write cache ?

-- 

Cedric BRINER
Geneva - Switzerland

Roch - PAE

2007-Apr-27 10:10 UTC

head link

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

cedric briner writes:
 > >> You might set zil_disable to 1 (_then_ mount the fs to be
 > >> shared). But you''re still exposed to OS crashes; those
would
 > >> still corrupt your nfs clients.
 > Just to better understand ? (I know that I''m quite slow :( )
 > when you say _nfs clients_ are you specifically talking of:
 > - the nfs client program itself :
 >     (lockd, statd) meaning that you can have a stale nfs handle or other 
 > things ?
 > - the host acting as an nfs client
 >     meaning that the nfs client service works, but you would have 
 > corrupt the data that the software use with nfs''s mounted disk.
 > 

It''s rather applications running on the client.
Basically, we would have data loss from application''s
perspective running on client without any sign of errors. It''s a bit
like
having a disk that would drop a write request and  not
signal an error.

 > 
 > If I''m digging and digging against this ZIL and NFS UFS with
write
 > cache, that''s because I do not understand which kind of problems
that
 > can occurs. What I read in general is statement like _corruption_ of the 
 > client''s point of view.. but what does that means ?
 > 
 > is the shema of what can happen is :
 > - the application on the nfs client side write data on the nfs server
 > - meanwhile the nfs server crashes so:
 >   - the data are not stored
 >   - the application on the nfs client think that the data are stored ! :(
 > - when the server is up again
 > - the nfs client re-see the data
 > - the application on the nfs client side find itself with data in the 
 > previous state of its lasts writes.
 > 
 > Am I right ?

The scenario I see would be on the client, 
download some software (a tar file).

	tar x
	make

The tar succeeded with  no  errors at  all. Behind our  back
during the  tar  x,   the   server rebooted. No   big   deal
normally.   But with zil_disable   on the  server, the  make
fails, either because  some files from  the original tar are
missing or parts of files.

 > 
 > So with ZIL:
 >   - The application has the ability to do things in the right way. So 
 > even of a nfs-server crash, the application on the nfs-client side can 
 > rely on is own data.
 > 
 > So without ZIL:
 >   - The application has not the ability to do things in the right way. 
 > And we can have a corruption of data. But that doesn''t mean
corruption
 > of the FS. It means that the data were partially written and some are 
 > missing.

Sounds right.

 > 
 > > For the love of God do NOT do stuff like that.
 > > 
 > > Just create ZFS on a pile of disks the way that we should, with the
 > > write cache disabled on all the disks and with redundancy in the
ZPool
 > > config .. nothing special :

 > Wooooh !!noo..  this is really special to me !!
 > I''ve read and re-read many times the:
 >   - NFS and ZFS, a fine combination
 >   - ZFS Best Practices Guide
 > and other blog without remarking such idea !
 > 
 > I even notice the opposite recommendation
 > from:
 > -ZFS Best Practices Guide >> ZFS Storage Pools Recommendations
 >
-http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_Storage_Pools_Recommendations
 > where I read :
 >   - For production systems, consider using whole disks for storage pools 
 > rather than slices for the following reasons:
 >    + Allow ZFS to enable the disk''s write cache, for those disks
that
 > have write caches
 >
 > and from:
 > -NFS and ZFS, a fine combination >> Comparison with UFS
 > -http://blogs.sun.com/roch/#zfs_to_ufs_performance_comparison
 > where I read :
 >   Semantically correct NFS service :
 > 
 > 	nfs/ufs : 17     sec (write cache disable)
 > 	nfs/zfs : 12     sec (write cache disable,zil_disable=0)
 > 	nfs/zfs :  7     sec (write cache enable,zil_disable=0)
 > then I can say:
 >   that nfs/zfs with write cache enable end zil_enable is --in that 
 > case-- faster
 > 
 > So why are you recommending me to disable the write cache ?
 >

For ZFS, it can work either way. Maybe the above was a typo.

 > -- 
 > 
 > Cedric BRINER
 > Geneva - Switzerland
 > 
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

zfs discuss - Apr 2007 - HowTo: UPS + ZFS & NFS + no fsync

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync

[zfs-discuss] HowTo: UPS + ZFS & NFS + no fsync