Just posted: http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine ____________________________________________________________________________________ Performance, Availability & Architecture Engineering Roch Bourbonnais Sun Microsystems, Icnc-Grenoble Senior Performance Analyst 180, Avenue De L''Europe, 38330, Montbonnot Saint Martin, France http://icncweb.france/~rbourbon http://blogs.sun.com/roch Roch.Bourbonnais at Sun.Com (+33).4.76.18.83.20
Roch - PAE wrote:> > Just posted: > > http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fineNice article. Now what about when we do this with more than one disk and compare UFS/SVM or VxFS/VxVM with ZFS as the back end - all with JBOD storage ? How then does ZFS compare as an NFS server ? -- Darren J Moffat
> http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fineSo just to confirm; disabling the zil *ONLY* breaks the semantics of fsync() and synchronous writes from the application perspective; it will do *NOTHING* to lessen the correctness guarantee of ZFS itself, including in the case of a power outtage? This makes it more reasonable to actually disable the zil. But still, personally I would like to be able to tell the NFS server to simply not be standards compliant, so that I can keep the correct semantics on the lower layer (ZFS), and disable the behavior at the level where I actually want it disabled (the NFS server). -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org
> Roch - PAE wrote: >> >> Just posted: >> >> http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine > > Nice article.I still need to read all of it .. closely.> Now what about when we do this with more than one disk > and compare UFS/SVM or VxFS/VxVM with ZFS as the back end - all with > JBOD storage ?That is precisely what I have here. Except for VxFS/VxVM which I no longer see a use for. I thought it was considered to be in poor taste to post performance numbers on blogs and such ?> > How then does ZFS compare as an NFS server ? >I want to compare ZFS over NFS and UFS over NFS with SVM. Thus I have done the following between yesterday and today : Create a small 90GB stripe set with 256 block stripe depth : # metainit d19 1 3 /dev/dsk/c0t9d0s0 /dev/dsk/c0t10d0s0 /dev/dsk/c0t11d0s0 -i 256b d19: Concat/Stripe is setup # metainit d29 1 3 /dev/dsk/c0t12d0s0 /dev/dsk/c0t13d0s0 /dev/dsk/c0t14d0s0 -i 256b d29: Concat/Stripe is setup # metainit d9 -m d19 d9: Mirror is setup # metattach d9 d29 d9: submirror d29 is attached metattach: mars: /dev/md/dsk/d0: not a metadevice On snv_52 I have no idea why that last message pops up because d0 is a metadevice and it has nothing to do with d9 or d19 or d29. After the sync is done we have a stripe of mirrors or RAID 1+0 . # newfs -b 8192 -f 8192 -m 5 -i 8192 -a 0 /dev/md/rdsk/d9 newfs: /dev/md/rdsk/d9 last mounted as /export/nfs newfs: construct a new file system /dev/md/rdsk/d9: (y/n)? y Warning: 4992 sector(s) in last cylinder unallocated /dev/md/rdsk/d9: 213369984 sectors in 34729 cylinders of 48 tracks, 128 sectors 104184.6MB in 2171 cyl groups (16 c/g, 48.00MB/g, 5824 i/g) super-block backups (for fsck -F ufs -o b=#) at: 32, 98464, 196896, 295328, 393760, 492192, 590624, 689056, 787488, 885920, Initializing cylinder groups: ........................................... super-block backups for last 10 cylinder groups at: 212441248, 212539680, 212638112, 212736544, 212834976, 212933408, 213031840, 213130272, 213228704, 213327136 # mount -F ufs -o noatime,nologging /dev/md/dsk/d9 /export/nfs I''m not sure of the value of nologging. # share -F nfs -o rw=pluto,root=pluto /export/nfs At the client ( pluto = Solaris 8 ) I mount thus : # mount -F nfs -o bg,intr,nosuid mars:/export/nfs /mnt # mkdir /mnt/snv_54;chown dclarke:csw /mnt/snv_54 Back at the NFS server I see this : # ls -lap /export/nfs total 50 drwxr-xr-x 4 root root 512 Jan 8 02:58 ./ drwxr-xr-x 5 root sys 512 Nov 24 15:49 ../ drwx------ 2 root root 8192 Jan 8 00:41 lost+found/ drwxr-xr-x 2 dclarke csw 512 Jan 8 02:58 snv_54/ I then start wget download all of snv_54 CDROM files and go to sleep. This morning I see this : bash-3.1# ls -lap total 8710144 drwxr-xr-x 2 dclarke csw 1536 Jan 8 08:28 ./ drwxr-xr-x 4 root root 512 Jan 8 02:58 ../ -rwx------ 1 root other 1187 Jan 8 08:28 check.sh -rwx------ 1 root other 6479 Jan 8 03:12 get.sh -rw-r--r-- 1 root other 2728 Jan 8 08:26 md5sum -rw-r--r-- 1 root other 1382 Dec 15 15:48 md5sum_sparc -rw-r--r-- 1 root other 3297 Jan 8 03:05 md5sum_sparc.log -rw-r--r-- 1 root other 1346 Dec 15 15:52 md5sum_x86 -rw-r--r-- 1 root other 3266 Jan 8 04:59 md5sum_x86.log -rw-r--r-- 1 root other 13218 Jan 8 03:29 sol-nv-b54-sparc-v1-iso.log -rw-r--r-- 1 root other 391595145 Dec 15 17:04 sol-nv-b54-sparc-v1-iso.zip -rw-r--r-- 1 root other 16147 Jan 8 03:49 sol-nv-b54-sparc-v2-iso.log -rw-r--r-- 1 root other 507360545 Dec 15 17:13 sol-nv-b54-sparc-v2-iso.zip -rw-r--r-- 1 root other 16384 Jan 8 04:10 sol-nv-b54-sparc-v3-iso.log -rw-r--r-- 1 root other 518203340 Dec 15 17:21 sol-nv-b54-sparc-v3-iso.zip -rw-r--r-- 1 root other 12434 Jan 8 04:24 sol-nv-b54-sparc-v4-iso.log -rw-r--r-- 1 root other 359044620 Dec 15 17:27 sol-nv-b54-sparc-v4-iso.zip -rw-r--r-- 1 root other 11085 Jan 8 04:36 sol-nv-b54-sparc-v5-iso.log -rw-r--r-- 1 root other 305615204 Dec 15 17:33 sol-nv-b54-sparc-v5-iso.zip -rw-r--r-- 1 root other 12039 Jan 8 04:59 sol-nv-b54-sparc-v6-iso.log -rw-r--r-- 1 root other 344745281 Dec 15 17:38 sol-nv-b54-sparc-v6-iso.zip -rw-r--r-- 1 root other 11929 Jan 8 05:13 sol-nv-b54-x86-v1-iso.log -rw-r--r-- 1 root other 342585923 Dec 15 17:09 sol-nv-b54-x86-v1-iso.zip -rw-r--r-- 1 root other 13667 Jan 8 05:34 sol-nv-b54-x86-v2-iso.log -rw-r--r-- 1 root other 411458198 Dec 15 17:16 sol-nv-b54-x86-v2-iso.zip -rw-r--r-- 1 root other 11376 Jan 8 05:47 sol-nv-b54-x86-v3-iso.log -rw-r--r-- 1 root other 320073040 Dec 15 17:22 sol-nv-b54-x86-v3-iso.zip -rw-r--r-- 1 root other 11771 Jan 8 06:00 sol-nv-b54-x86-v4-iso.log -rw-r--r-- 1 root other 334381815 Dec 15 17:27 sol-nv-b54-x86-v4-iso.zip -rw-r--r-- 1 root other 10671 Jan 8 06:12 sol-nv-b54-x86-v5-iso.log -rw-r--r-- 1 root other 290463795 Dec 15 17:32 sol-nv-b54-x86-v5-iso.zip -rw-r--r-- 1 root other 11698 Jan 8 06:29 sol-nv-b54-x86-v6-iso.log -rw-r--r-- 1 root other 331409974 Dec 15 17:38 sol-nv-b54-x86-v6-iso.zip The script check will run a gmd5sum against all of those files and grep for the expected value in the md5sum file. This way I can verify the validity of the download thus : bash-3.1# ./check.sh 776dde5ab5cefaff451b041b79336555 sol-nv-b54-sparc-v1-iso.zip 776dde5ab5cefaff451b041b79336555 sol-nv-b54-sparc-v1-iso.zip 4e13f28514e0fc5ce7c34b2cd10e1c5a sol-nv-b54-sparc-v2-iso.zip 4e13f28514e0fc5ce7c34b2cd10e1c5a sol-nv-b54-sparc-v2-iso.zip 3bc5c292e04858b034eac126b476797c sol-nv-b54-sparc-v3-iso.zip 3bc5c292e04858b034eac126b476797c sol-nv-b54-sparc-v3-iso.zip 9b62b4758aaaf6fc0cb56918aee2295c sol-nv-b54-sparc-v4-iso.zip 9b62b4758aaaf6fc0cb56918aee2295c sol-nv-b54-sparc-v4-iso.zip 2130fd4c3a9db24e4f2b3dbc111db28f sol-nv-b54-sparc-v5-iso.zip 2130fd4c3a9db24e4f2b3dbc111db28f sol-nv-b54-sparc-v5-iso.zip d907cbcda71059469212c08b70ee5e96 sol-nv-b54-sparc-v6-iso.zip d907cbcda71059469212c08b70ee5e96 sol-nv-b54-sparc-v6-iso.zip 355dc13b2484d58760a775b3aa9e70e4 sol-nv-b54-x86-v1-iso.zip 355dc13b2484d58760a775b3aa9e70e4 sol-nv-b54-x86-v1-iso.zip b50978a8091230a188c20c90ff28f475 sol-nv-b54-x86-v2-iso.zip b50978a8091230a188c20c90ff28f475 sol-nv-b54-x86-v2-iso.zip fc9f1354dcb0a20c549f07988ae3b335 sol-nv-b54-x86-v3-iso.zip fc9f1354dcb0a20c549f07988ae3b335 sol-nv-b54-x86-v3-iso.zip d13b9dd8a9483db5878b50be837b08a9 sol-nv-b54-x86-v4-iso.zip d13b9dd8a9483db5878b50be837b08a9 sol-nv-b54-x86-v4-iso.zip 3c5f2e92066f191e9ec8427cf3f84fdf sol-nv-b54-x86-v5-iso.zip 3c5f2e92066f191e9ec8427cf3f84fdf sol-nv-b54-x86-v5-iso.zip e82b64ab24b9d192957e4ae66517d97d sol-nv-b54-x86-v6-iso.zip e82b64ab24b9d192957e4ae66517d97d sol-nv-b54-x86-v6-iso.zip I now have an IO intensive operation on a set of data. In order to compute the MD5 sig I clearly need to have the client read the entire file chunk by chunk and process it. Thus I see this as a valid performance metric to use. The other of course if my crucible code that generates a pile of small files on pass 1, then appends text on pass 2 and then appends a fragment of a block at pass 3 and reports times based on the high resolution timer. I expect that to be of some value here also. bash-3.1$ which cc /opt/studio/SOS11/SUNWspro/bin/cc bash-3.1$ cc -xstrconst -xildoff -xarch=v9a -Kpic -xlibmil -Xa -xO4 -c -o crucible.o crucible.c bash-3.1$ cc -xstrconst -xildoff -xarch=v9a -Kpic -xlibmil -Xa -xO4 -o crucible crucible.o bash-3.1$ bash-3.1$ strip crucible bash-3.1$ file crucible crucible: ELF 64-bit MSB executable SPARCV9 Version 1, UltraSPARC1 Extensions Required, dynamically linked, stripped bash-3.1$ So I will now run a local test ar the server mars on that UFS/SVM based filesystem. Then I guess I better do the same thing locally on the ZFS based fs and then share that out to the same client with NFS with no change to any config anywhere. So this will be out of the box config for snv_52 ZFS and NFS. I''ll post data shortly. I hope. Dennis ps: crucible running locally now ------------------------------------------------------------------------- # ptime ./bin/crucible /export/nfs/local_test ***************************************************** crucible : cru-ci-ble (kroo''se-bel) noun. 1. A vessel for melting materials at high temperatures. 2. A severe test, as of patience or belief; a trial. [ Dennis Clarke dclarke at blastwave.org ] ***************************************************** TEST 1 ) file write. Building file structure at /export/nfs/local_test/ This test will create 62^3 = 238328 files of exactly 65536 bytes each. This amounts to 15,619,063,808 bytes = 14.55 GB of file data. - - - WARNING TO USERS OF ZFS FILESYSTEMS - - - - - - - - - - - - - - - - - - - - - - - - - - If you have compression enabled on your ZFS based filesystem then you will see very high numbers for the final effective compression. This is due to the fact that the ZFS compression algorithm is a block based algorithm and the data written by this code is largely repetitive in nature. Thus it will compress better than any non-regular or random data. - - - - - - - - - - - - - - - - - - - - - - - - - - WARNING TO USERS OF ZFS FILESYSTEMS - - - . . .
Hans-Juergen Schnitzer
2007-Jan-08 16:32 UTC
[zfs-discuss] NFS and ZFS, a fine combination
Roch - PAE wrote:> > Just posted: > > http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine > >Which role plays network latency? If I understand you right, even a low-latency network, e.g. Infiniband, would not increase performance substantially since the main bottleneck is that the NFS server always has to write data to stable storage. Is that correct? Hans Schnitzer
On Mon, Jan 08, 2007 at 03:47:31PM +0100, Peter Schuller wrote:> > http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine > > So just to confirm; disabling the zil *ONLY* breaks the semantics of fsync() > and synchronous writes from the application perspective; it will do *NOTHING* > to lessen the correctness guarantee of ZFS itself, including in the case of a > power outtage?That is correct. ZFS, with or without the ZIL, will *always* maintain consistent on-disk state and will *always* preserve the ordering of events on-disk. That is, if an application makes two changes to the filesystem, first A, then B, ZFS will *never* show B on-disk without also showing A.> This makes it more reasonable to actually disable the zil. But still, > personally I would like to be able to tell the NFS server to simply not be > standards compliant, so that I can keep the correct semantics on the lower > layer (ZFS), and disable the behavior at the level where I actually want it > disabled (the NFS server).This would be nice, simply to make it easier to do apples-to-apples comparisons with other NFS server implementations that don''t honor the correct semantics (Linux, I''m looking at you). --Bill
> On Mon, Jan 08, 2007 at 03:47:31PM +0100, Peter Schuller wrote: >> > http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine >> >> So just to confirm; disabling the zil *ONLY* breaks the semantics of >> fsync() >> and synchronous writes from the application perspective; it will do >> *NOTHING* >> to lessen the correctness guarantee of ZFS itself, including in the case >> of a >> power outtage? > > That is correct. ZFS, with or without the ZIL, will *always* maintain > consistent on-disk state and will *always* preserve the ordering of > events on-disk. That is, if an application makes two changes to the > filesystem, first A, then B, ZFS will *never* show B on-disk without > also showing A. >So then, this begs the question Why do I want this ZIL animal at all?>> This makes it more reasonable to actually disable the zil. But still, >> personally I would like to be able to tell the NFS server to simply not be >> standards compliant, so that I can keep the correct semantics on the lower >> layer (ZFS), and disable the behavior at the level where I actually want >> it >> disabled (the NFS server). > > This would be nice, simply to make it easier to do apples-to-apples > comparisons with other NFS server implementations that don''t honor the > correct semantics (Linux, I''m looking at you).is that a glare or a leer or a sneer ? :-) dc
> Roch - PAE wrote: >> >> Just posted: >> >> http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine > > Nice article. Now what about when we do this with more than one disk > and compare UFS/SVM or VxFS/VxVM with ZFS as the back end - all with > JBOD storage ? > > How then does ZFS compare as an NFS server ?the following is just pitiful , SVM stripe and mirror thus : # mount -v | grep d9 /dev/md/dsk/d9 on /export/nfs type ufs read/write/setuid/devices/intr/largefiles/xattr/noatime/onerror=panic/dev=154000e on Mon Jan 8 00:44:56 2007 # metastat d9 d9: Mirror Submirror 0: d19 State: Okay Submirror 1: d29 State: Okay Pass: 1 Read option: roundrobin (default) Write option: parallel (default) Size: 213369984 blocks (101 GB) d19: Submirror of d9 State: Okay Size: 213369984 blocks (101 GB) Stripe 0: (interlace: 256 blocks) Device Start Block Dbase State Reloc Hot Spare c0t9d0s0 0 No Okay Yes c0t10d0s0 2889 No Okay Yes c0t11d0s0 2889 No Okay Yes d29: Submirror of d9 State: Okay Size: 213369984 blocks (101 GB) Stripe 0: (interlace: 256 blocks) Device Start Block Dbase State Reloc Hot Spare c0t12d0s0 0 No Okay Yes c0t13d0s0 2889 No Okay Yes c0t14d0s0 2889 No Okay Yes TEST 1 ) file write. Building file structure at /export/nfs/local_test/ This test will create 62^3 = 238328 files of exactly 65536 bytes each. This amounts to 15,619,063,808 bytes = 14.55 GB of file data. - - - WARNING TO USERS OF ZFS FILESYSTEMS - - - - - - - - - - - - - - - - - - - - - - - - - - If you have compression enabled on your ZFS based filesystem then you will see very high numbers for the final effective compression. This is due to the fact that the ZFS compression algorithm is a block based algorithm and the data written by this code is largely repetitive in nature. Thus it will compress better than any non-regular or random data. - - - - - - - - - - - - - - - - - - - - - - - - - - WARNING TO USERS OF ZFS FILESYSTEMS - - - RT= 3489.821276 sec 238328 files avg=0.014639 sec total=3488.777365 sec io_avg=4.269547 MB/sec TEST 2 ) file append 2048 bytes. Appending to file structure at /export/nfs/local_test/ This test will append 2048 bytes to the files that were created in TEST 1. RT= 550.342414 sec 238328 files avg=0.002305 sec total=549.310531 sec io_avg=0.847398 MB/sec TEST 3 ) file append 749 bytes Appending to file structure at /export/nfs/local_test/ This test will append 749 bytes to the files that were created in TEST 1. RT = 859.041237 sec 238328 files avg=0.003599 sec total=857.776092 sec io_avg=0.198465 MB/sec RT = 4899.253981 sec real 1:21:39.309 user 12:53.959 sys 10:02.700 # That really really is bad.
Peter Schuller wrote:>> http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine > > > So just to confirm; disabling the zil *ONLY* breaks the semantics of fsync() > and synchronous writes from the application perspective; it will do *NOTHING* > to lessen the correctness guarantee of ZFS itself, including in the case of a > power outtage?See this blog that Roch pointed to: http://blogs.sun.com/erickustarz/entry/zil_disable See the sentence: "Note: disabling the ZIL does NOT compromise filesystem integrity. Disabling the ZIL does NOT cause corruption in ZFS."> > This makes it more reasonable to actually disable the zil. But still, > personally I would like to be able to tell the NFS server to simply not be > standards compliant, so that I can keep the correct semantics on the lower > layer (ZFS), and disable the behavior at the level where I actually want it > disabled (the NFS server). >This discussion belongs on the nfs-discuss alias. eric
Hans-Juergen Schnitzer writes: > Roch - PAE wrote: > > > > Just posted: > > > > http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine > > > > > > Which role plays network latency? If I understand you right, > even a low-latency network, e.g. Infiniband, would not increase > performance substantially since the main bottleneck is that > the NFS server always has to write data to stable storage. > Is that correct? > > Hans Schnitzer > > For this load, network latency plays a role as long as it is of the same order of magnitude to the I/O latency. Once network latency gets much smaller than I/O latency then network latency becomes pretty much irrelevant. At times both are of the same order of magnitude and both must be taken into account. So if your storage is NVRAM based or is far away, then network latency may still be very much at play. -r > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hans-Juergen Schnitzer wrote:> Roch - PAE wrote: > >> >> Just posted: >> >> http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine >> > > Which role plays network latency? If I understand you right, > even a low-latency network, e.g. Infiniband, would not increase > performance substantially since the main bottleneck is that > the NFS server always has to write data to stable storage. > Is that correct?Correct. You can essentially simulate the NFS semantics by doing a fsync after every file creation and before every close on a local tar extraction. eric> > Hans Schnitzer > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Dennis Clarke writes: > > > On Mon, Jan 08, 2007 at 03:47:31PM +0100, Peter Schuller wrote: > >> > http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine > >> > >> So just to confirm; disabling the zil *ONLY* breaks the semantics of > >> fsync() > >> and synchronous writes from the application perspective; it will do > >> *NOTHING* > >> to lessen the correctness guarantee of ZFS itself, including in the case > >> of a > >> power outtage? > > > > That is correct. ZFS, with or without the ZIL, will *always* maintain > > consistent on-disk state and will *always* preserve the ordering of > > events on-disk. That is, if an application makes two changes to the > > filesystem, first A, then B, ZFS will *never* show B on-disk without > > also showing A. > > > > So then, this begs the question Why do I want this ZIL animal at all? > You said "correctness guarantee" Bill said "...consistent on-disk state" The ZIL is not necessary for ZFS to keep it''s on-disk format consistent. However the ZIL is necessary/essential to provide synchonous semantics to application. Without a ZIL fsync() and the like become a NO-OP; it''s a very uncommon requirement altough one that does exists. But for ZFS to be a correct Filesystem, the ZIL is necessary and provides an excellent service. My article shows that ZFS with the ZIL can be better than UFS (which uses it''s own logging scheme). -r