I just did the followimg test: - Create two 10 GB files on a UFS partition - make them available with lofiadm - create zfs on one of them and ufs on the other - mount and extract the FreeDB files: freedb-complete-20051104.tar.bz2 This is ~ 1.9 million files in 11 dirs - Extraction on UFS: 2:49 real 5280.010seconds System - Extraction on ZFS: 9:35 real 2862.890seconds System - find is _extremely_ slow on ZFS - it seems that ZFS causes a lot more I/O than UFS - sfind . | count did not finish after 4.5 hours on ZFS - sfind . | count did finish after 2:45 minutes on UFS - the calls to getdents() are _extremely_ slow on ZFS J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
On Wed, Nov 23, 2005 at 07:33:05PM +0100, Joerg Schilling wrote:> I just did the followimg test: > > - Create two 10 GB files on a UFS partition > > - make them available with lofiadm > > - create zfs on one of them and ufs on the other > > - mount and extract the FreeDB files: > freedb-complete-20051104.tar.bz2 > This is ~ 1.9 million files in 11 dirs > > - Extraction on UFS: 2:49 real 5280.010seconds System > - Extraction on ZFS: 9:35 real 2862.890seconds System > > - find is _extremely_ slow on ZFS > > - it seems that ZFS causes a lot more I/O than UFS > > - sfind . | count did not finish after 4.5 hours > on ZFS > > - sfind . | count did finish after 2:45 minutes > on UFS > > - the calls to getdents() are _extremely_ slow on ZFSYou _can''t_ be serious -- you''re mounting ON TOP OF ANOTHER FILESYSTEM and using that to draw performance conclusions? Repeat your experiment on actual, physical spindles; no one is (or should) take this kind of analysis seriously until then. This is not to say that there aren''t issues here (the getdents() issue in particular has been/is being addressed), just to say that one cannot _assume_ that performance issues found in this configuration are issues with ZFS. Please repeat this test on physical spindles... - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Solaris Kernel Development. http://blogs.sun.com/bmc
Bryan Cantrill <bmc at eng.sun.com> wrote:> You _can''t_ be serious -- you''re mounting ON TOP OF ANOTHER FILESYSTEM > and using that to draw performance conclusions? Repeat your experiment > on actual, physical spindles; no one is (or should) take this kind of > analysis seriously until then. This is not to say that there aren''t > issues here (the getdents() issue in particular has been/is being > addressed), just to say that one cannot _assume_ that performance issues > found in this configuration are issues with ZFS. Please repeat this > test on physical spindles...As _both_ tests (ufs vs zfs) did have the same constraints, I see no reason to distrust my results.... If you like to repeat the test and have a real disk to do the tests, get the archives from freedb.org J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
On Wed, Nov 23, 2005 at 07:56:38PM +0100, Joerg Schilling wrote:> Bryan Cantrill <bmc at eng.sun.com> wrote: > > > You _can''t_ be serious -- you''re mounting ON TOP OF ANOTHER FILESYSTEM > > and using that to draw performance conclusions? Repeat your experiment > > on actual, physical spindles; no one is (or should) take this kind of > > analysis seriously until then. This is not to say that there aren''t > > issues here (the getdents() issue in particular has been/is being > > addressed), just to say that one cannot _assume_ that performance issues > > found in this configuration are issues with ZFS. Please repeat this > > test on physical spindles... > > As _both_ tests (ufs vs zfs) did have the same constraints, I see no > reason to distrust my results....The problem is that while it''s not _necessarily_ invalid, it''s not necessarily _valid_ either. UFS is _not_ an accurate simulator of a physical device: it doesn''t have head latency or rotational latency for example. As a result, systems that were designed to perform well with respect to the physical device (e.g. ZFS) can look disproportionately bad, while systems that did not have such a design center (e.g. UFS) can look disproportionately good. To take an extreme example of this: if you were to take a simple virtual device that simply took a real device and scrambled its logical block numbers, ZFS would perform terribly -- almost certainly worse than other systems. And again, there may well be valid results in your data -- but the potential presence of invalid results discounts them, such as they are. - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Solaris Kernel Development. http://blogs.sun.com/bmc
>I just did the followimg test: > >- Create two 10 GB files on a UFS partition > >- make them available with lofiadmlofi doesn''t have stellar performance (certainly not disk like) and could be hampering the test results because of all the interactions.>- find is _extremely_ slow on ZFSIn the 27a build, both getdents and stat() are very slow; a number of improvements went into b28/b29.>Casper
>You _can''t_ be serious -- you''re mounting ON TOP OF ANOTHER FILESYSTEM >and using that to draw performance conclusions? Repeat your experiment >on actual, physical spindles; no one is (or should) take this kind of >analysis seriously until then. This is not to say that there aren''t >issues here (the getdents() issue in particular has been/is being >addressed), just to say that one cannot _assume_ that performance issues >found in this configuration are issues with ZFS. Please repeat this >test on physical spindles...There are, indeed, many issues here. E.g., if you don''t change ufs_WRITES to 0 you may be hit by ufs write throttling; and if the ZFS working set is larger than the ufs working set, it may be severaly penalized because of this. Don''t forget that the UFS file the filesystem resides on is cached; in place writing as UFS does will appear to cause fewer writes as they don''t find their way to disk as they would on a physical filesystem. Casper
Casper.Dik at Sun.COM wrote:> > >I just did the followimg test: > > > >- Create two 10 GB files on a UFS partition > > > >- make them available with lofiadm > > lofi doesn''t have stellar performance (certainly not > disk like) and could be hampering the test results because > of all the interactions.Please read my mail again: Both tests did use lofi> >- find is _extremely_ slow on ZFS > > In the 27a build, both getdents and stat() are very slow; > a number of improvements went into b28/b29.stat() is not a problem, but a single getdents() call takes 3 seconds. It also seems to be important that getdents causes a lot of I/O in ZFS. extended device statistics tty cpu device r/s w/s kr/s kw/s wait actv svc_t %w %b tin tout us sy wt id cmdk0 149.6 0.2 5049.2 0.4 0.0 0.9 6.1 4 70 0 116 0 49 0 50 lofi1 73.4 0.0 4696.7 0.0 0.0 0.7 10.0 1 73 J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
>Casper.Dik at Sun.COM wrote: > >> >> >I just did the followimg test: >> > >> >- Create two 10 GB files on a UFS partition >> > >> >- make them available with lofiadm >> >> lofi doesn''t have stellar performance (certainly not >> disk like) and could be hampering the test results because >> of all the interactions. > >Please read my mail again: Both tests did use lofiI understand that; but ufs and zfs behave differently and where ufs may get away zfs may not.> >> >- find is _extremely_ slow on ZFS >> >> In the 27a build, both getdents and stat() are very slow; >> a number of improvements went into b28/b29. > >stat() is not a problem, but a single getdents() call takes 3 seconds. > >It also seems to be important that getdents causes a lot of I/O in ZFS. > > extended device statistics tty >cpu >device r/s w/s kr/s kw/s wait actv svc_t %w %b tin tout us sy wt id >cmdk0 149.6 0.2 5049.2 0.4 0.0 0.9 6.1 4 70 0 116 0 49 0 50 >lofi1 73.4 0.0 4696.7 0.0 0.0 0.7 10.0 1 73I believe it does inode prefetching which turned out not to be a good idea in all cases..... Was ufs_WRITES set to 0? If not, then any difference in write working set may have serious repercusions. Casper
Casper.Dik at Sun.COM wrote:> There are, indeed, many issues here. > > E.g., if you don''t change ufs_WRITES to 0 you may be hit by ufs > write throttling; and if the ZFS working set is larger than the > ufs working set, it may be severaly penalized because of this. > > Don''t forget that the UFS file the filesystem resides on is > cached; in place writing as UFS does will appear to cause > fewer writes as they don''t find their way to disk as they > would on a physical filesystem.Writer chaching should have been disables by the sticky bit. As iostat shows that there are mainly reads, I asume that the problems are zfs caused. Note that I am using sfind and not sun find, so _all_ getdents() calls are done at once _followed_ by a stat() loop on the resulting list. note that a sfind on the ufs test FS takes less than 3 minutes and that I currently estimate the sfind on the ZFS test FS to take aprox. 9 hours. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Casper.Dik at Sun.COM wrote:> >stat() is not a problem, but a single getdents() call takes 3 seconds. > > > >It also seems to be important that getdents causes a lot of I/O in ZFS. > > > > extended device statistics tty > >cpu > >device r/s w/s kr/s kw/s wait actv svc_t %w %b tin tout us sy wt id > >cmdk0 149.6 0.2 5049.2 0.4 0.0 0.9 6.1 4 70 0 116 0 49 0 50 > >lofi1 73.4 0.0 4696.7 0.0 0.0 0.7 10.0 1 73 > > I believe it does inode prefetching which turned out not to be a good > idea in all cases..... > > Was ufs_WRITES set to 0? If not, then any difference in write working set > may have serious repercusions.As expexted, setting ufs_WRITES to 0 did not change anything truss -d -p 101937 Base time stamp: 1132781010.2370 [ Wed Nov 23 22:23:30 CET 2005 ] 3.7073 getdents64(4, 0xC6EA4000, 8192) = 8192 7.3568 getdents64(4, 0xC6EA4000, 8192) = 8192 11.3254 getdents64(4, 0xC6EA4000, 8192) = 8192 15.7433 getdents64(4, 0xC6EA4000, 8192) = 8192 19.3317 getdents64(4, 0xC6EA4000, 8192) = 8192 23.1085 getdents64(4, 0xC6EA4000, 8192) = 8192 26.5186 getdents64(4, 0xC6EA4000, 8192) = 8192 30.2388 getdents64(4, 0xC6EA4000, 8192) = 8192 ufs_WRITES was set to 0 at the 11.3254 timestamp. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Bryan Cantrill <bmc at eng.sun.com> wrote:> > As _both_ tests (ufs vs zfs) did have the same constraints, I see no > > reason to distrust my results.... > > The problem is that while it''s not _necessarily_ invalid, it''s not > necessarily _valid_ either. UFS is _not_ an accurate simulator of a > physical device: it doesn''t have head latency or rotational latency for > example. As a result, systems that were designed to perform well with > respect to the physical device (e.g. ZFS) can look disproportionately bad, > while systems that did not have such a design center (e.g. UFS) can lookI cannot speak for ZFS as I did not yet look into the sourcecode, but UFS has been designed to perform well with physical devices. But note: I am doing my tests because I like to find out whether it would make sense to use ZFS on a SchilliX CD. CD drives behave substancial different from hard disk drives. If you do the wrong things, you blow up the read cache of the drive and seeks are extremely expensive on a CD.> disproportionately good. To take an extreme example of this: if you > were to take a simple virtual device that simply took a real device and > scrambled its logical block numbers, ZFS would perform terribly -- almost > certainly worse than other systems. And again, there may well be valid > results in your data -- but the potential presence of invalid results > discounts them, such as they are.As I observe extremely high reading media access rates with ZFS, I am sure that the current estimation that ZFS is 200 times slower than UFS with getdents(). I am sure this ratio will never go to a 1 : 1 ratio on a different physical background storage. My observation is: - at the beginning of a directory, getdents() takes 0.1 .. 1 seconds - when the directory offset reaches some limit, a single getdents() reaches a constant value of 3 seconds. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Joerg Schilling wrote:> I cannot speak for ZFS as I did not yet look into the sourcecode, but > UFS has been designed to perform well with physical devices.Physical devices circa 1980 ;-) -- richard
On Wed, Nov 23, 2005 at 07:33:05PM +0100, Joerg Schilling wrote:> I just did the followimg test: > > - Create two 10 GB files on a UFS partitionRight, so how is this testing ZFS? How about with a slightly more realistic situation - filesystems directly on disk! (Novel concept, I know....) In all cases the hardware is a T3 disk array, with a 4-disk stripe in hardware (no mirroring/RAID-5). For ZFS, this is then imported as a single filesystem in the pool. For UFS, as a single filesystem, with logging enabled. In all cases the source file was stored in /tmp (tmpfs), and was dd''ed to /dev/zero before starting so that any caching was the same between each run. Both the ZFS and UFS filesystems were re-created between runs. With the T3''s cache disabled : ZFS - 17 minutes, 47 seconds real (5:35 user, 0:11 system) UFS - 48 minutes, 28 seconds real (5:38 user, 0:13 system) With the T3''s cache enabled : ZFS - 15 minutes, 30 seconds real (5:49 user, 0:13 system) UFS - 24 minutes, 29 seconds real (5:39 user, 0:13 system) So realistically ZFS is _significantly_ faster than UFS (for the untar at least). It also seems far less reliant on the speed of the underlying disk (as is seen by the minimal difference the hardware cache makes). ZFS layered on top of lofi layered on top of UFS may be slower, but that''s not exactly the use case it was designed for!> - find is _extremely_ slow on ZFS"find", or "sfind" ? Running "find . | wc -l" (Solaris find). In both cases the filesystem cache was cleared before running (zfs export/import, ufs unmount/mount). ZFS - 3 minutes, 39 seconds (0:07 user, 2:58 system) UFS - 1 minute, 16 seconds (0:06 user, 0:50 system) As others have said there''s reasons this is slow, and it''s being worked on, but it''s certainly not a matter of it taking "hours". This is either something wrong with sfind, or an artifact of running ZFS on lofi on UFS. Full results below. Scott ## T3 cache disabled. New ZFS and UFS filesystems created ## # cd /zfs # dd if=/tmp/freedb-complete-20051104.tar.bz2 of=/dev/null bs=1048576 >/dev/null # ptime bzcat /tmp/freedb-complete-20051104.tar.bz2 | tar xf - real 17:46.932 user 5:34.820 sys 10.825 # cd /ufs # dd if=/tmp/freedb-complete-20051104.tar.bz2 of=/dev/null bs=1048576 >/dev/null # ptime bzcat /tmp/freedb-complete-20051104.tar.bz2 | tar xf - real 46:28.445 user 5:37.799 sys 12.962 ## T3 cache enabled. New ZFS and UFS filesystems created ## # cd /zfs # dd if=/tmp/freedb-complete-20051104.tar.bz2 of=/dev/null bs=1048576 >/dev/null # ptime bzcat /tmp/freedb-complete-20051104.tar.bz2 | tar xf - real 15:29.661 user 5:49.150 sys 12.808 # cd /ufs # dd if=/tmp/freedb-complete-20051104.tar.bz2 of=/dev/null bs=1048576 >/dev/null # ptime bzcat /tmp/freedb-complete-20051104.tar.bz2 | tar xf - real 24:29.378 user 5:38.632 sys 12.845 ## T3 cache enabled. ZFS and UFS caches cleared ## # cd /zfs # ptime find . |wc -l real 3:39.084 user 7.437 sys 2:57.592 1872972 # cd /ufs # ptime find . |wc -l real 1:16.667 user 5.532 sys 49.787 1872971
Richard Elling - PAE wrote:> Joerg Schilling wrote: >> I cannot speak for ZFS as I did not yet look into the sourcecode, but >> UFS has been designed to perform well with physical devices. > > Physical devices circa 1980 ;-) > -- richardIf you assumptions are older then your car ... :-)
Scott Howard wrote:> On Wed, Nov 23, 2005 at 07:33:05PM +0100, Joerg Schilling wrote: > >>I just did the followimg test: >> >>- Create two 10 GB files on a UFS partition > > > Right, so how is this testing ZFS? > > How about with a slightly more realistic situation - filesystems directly > on disk! (Novel concept, I know....)I was part way through this when I saw Scott''s email with some real world test results. I thought I''d still post as I did things slightly differently... First try: Because Joerg''s tests were presumably hosted on a filesystem that was on a single physical device, I thought I''d try a one-slice ZFS pool (/test), and a one-slice ufs filesystem (/a). They are on the first 13GB of their respective, identical, disks, and the disks hanging off the same PCI SATA controller. The disks may have write caching enabled, I suspect. (Is there an easy way to tell from within Solaris?) I tried star-1.4.3 and Solaris tar, sfind-1.0 and Solaris find. Differences between sfind and find performance were not interesting (<10%), but star and tar behave very differently. I got bored of extracting the whole database to ZFS, so I created a ~111,000 file fragment of the whole freedb .tar.bz2 (basically, I hit ^C and tar/bzipped what I had extracted by that time. :-) Even with the smaller dataset there are some marked differences between ZFS and UFS performance. Source .tar.bz2 file is only 22MB, so I didn''t worry about repeatable caching of the source - it''s down in the noise. Filesystems were recreated in between extraction tests, and remounted between traversal tests. Firstly, star and tar perform very, very differently across ZFS and UFS: [all times in seconds, hope the formatting is preserved] UFS ZFS star 70 1086 tar 402 28 For this extreme "create zillions of tiny files in two directories" workload, Joerg''s star extracts 38 times slower to ZFS than Solaris tar. But it extracts almost 6 times faster to UFS than Solaris tar. I got bored/puzzled during the tar extract to UFS, so I ran iostat in another window: # iostat -xn 2 [first report omitted as it''s always silly] extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c2d0 0.0 247.7 0.0 331.4 16022.4 2.0 64686.5 8.1 100 100 c3d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4d0 Note the wait service time for c3d0. :-) There was more of the same, with wsvc_t dropping slowly. Clearly star and tar do things very very differently! As for sfind versus find, the comparison is much less interesting: UFS ZFS sfind 1.43 3.16 find 1.41 3.10 Just as Scott found, ZFS as delivered in NV27a is definitely slower for this test, and the difference between sfind and find is negligible. But it''s not pathologically slow like Joerg''s ZFS-over-lofi-over-UFS test setup would suggest. Cheers, Jason =:^) ps to Joerg: imho, the only reason to ever extract that freedb tarfile on something other than a ReiserFS filesystem would be to ingest it into a database. You wouldn''t design such a data organisation unless you were assuming something like ReiserFS in any case. Surely something less pathological (like, say, the Schillix live CD filesystem?) would be more appropriate as a performance test of ZFS? Especially when that''s apparently what you were thinking of using ZFS for anyway... Raw command output included below for completeness (filesystem fiddling not shown) star extract to UFS: $ (cd /a/jao ; time -p star xf /mnt/jao/freedb-piece.bz2 ) star: WARNING: Archive is bzip2 compressed, trying to use the -bz option. star: 18635 blocks + 0 bytes (total of 190822400 bytes = 186350.00k). real 70.04 user 15.80 sys 16.58 star extract to ZFS: $ (cd /test/jao ; time -p star xf /mnt/jao/freedb-piece.bz2 ) star: WARNING: Archive is bzip2 compressed, trying to use the -bz option. star: current ''./'' newer. star: 18635 blocks + 0 bytes (total of 190822400 bytes = 186350.00k). real 1086.52 user 15.19 sys 16.25 tar extract to UFS: $ (cd /a/jao ; time -p bzcat /mnt/jao/freedb-piece.bz2 | tar xf - ) real 402.70 user 16.23 sys 14.31 tar extract to ZFS: $ (cd /test/jao ; time -p bzcat /mnt/jao/freedb-piece.bz2 | tar xf - ) real 28.06 user 16.13 sys 9.01 sfind on UFS: $ (cd /a/jao ; time -p sfind . | wc ) 110966 110966 1775424 real 1.43 user 0.06 sys 0.64 sfind on ZFS: $ (cd /test/jao ; time -p sfind . | wc ) 110966 110966 1775424 real 3.16 user 0.07 sys 2.07 find on UFS: $ (cd /a/jao ; time -p find . | wc ) 110966 110966 1775424 real 1.41 user 0.08 sys 0.65 find on ZFS: $ (cd /test/jao ; time -p find . | wc ) 110966 110966 1775424 real 3.10 user 0.12 sys 2.00
I did some tests on my laptop (2.2GHz P4, 5400RPM ATA) running Nexenta with ZFS on a partiton. Extracting the tarball took 71m31.297s, "find . >/dev/null" took 85m8.983s, and possibly most alarming, "rm -rf *" took 270m13.031s. It isn''t exactly a fair comparison but I tried the same with reiser4 on my Gentoo desktop (2.4GHz amd64, 7200RPM ATA with 8Mb cache). The tarball extracted in 30m15.974s, find took 1m49.597s, and rm which is normally a weakness for reiser4 took 14m29.083s. Hardware differences might be responsible for extracting twice as fast, but find is over 46 times as fast, and rm is over 18 times as fast on reiser4. I''ll test build 29 on my amd64 machine for a fair comparison when there''s a SchilliX or Nexenta build of it. This message posted from opensolaris.org
Richard Elling - PAE <Richard.Elling at Sun.COM> wrote:> Joerg Schilling wrote: > > I cannot speak for ZFS as I did not yet look into the sourcecode, but > > UFS has been designed to perform well with physical devices. > > Physical devices circa 1980 ;-)For supporting devices from 1980, you need extra effort that could be omitted today. If there is a documentation on how ZFS does device optimization, please send me a pointer. I will read and comment.... J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Jason Ozolins <Jason.Ozolins at anu.edu.au> wrote:> First try: Because Joerg''s tests were presumably hosted on a filesystem that > was on a single physical device, I thought I''d try a one-slice ZFS pool > (/test), and a one-slice ufs filesystem (/a). They are on the first 13GB of > their respective, identical, disks, and the disks hanging off the same PCI > SATA controller. The disks may have write caching enabled, I suspect. (Is > there an easy way to tell from within Solaris?) I tried star-1.4.3 andstar-1.4.3 is _extremely_ old. I would recommend to use star-1.5a70.> Solaris tar, sfind-1.0 and Solaris find. Differences between sfind and find > performance were not interesting (<10%), but star and tar behave very > differently.sun find is done the ancient way, while sfind used recent technologies to operate. Sfind is typicalls 10% faster than Sun find. Sun find spends aprox. less user CPU time than sfind. This is caused by the fact that treewalk() in sfind calls the walk-callback-funktion that does some checks and then calls the pure find expression interpreter. The callback function in Sun find directly calls one big and unstructured function...making it impossible to use create a find library out of Sun find as done with sfind for star. However, as sfind reads all directories in one big chunk _before_ starting to work on the list (as all modern software does), the tree-walker from sfind is faster than nftw() used by Sun find. This causes sfind to need less system CPU time than Sun find and this is why sfind is a bit faster than Sun find. The differences between Sun tar and star are caused by the fact that star operates in a secure and comprehensible way while Sun tar cannot even tell whether there have been specific problems during the extract operation. In order to tell whether star was able to extract a file correctly, it calls fsync(f) and close(f) for every file and checks for the return codes. This causes a 10-20% performance penalty on UFS (depending on file sizes). If you like to find the performance penalty caused by the fact that star is able to tell you whether it did work correctly, compare the time that you need to run: star -xp f=xxx and star -cp f=xxx -no-fsync BTW: On Linux and ext2, I did see a 400% performance penalty when running in secure (default) fsync mode. This is caused by the fact that ext2+Linux does start disk transfers very late and needs a long time to finally sync the FS cache to the disk.> I got bored of extracting the whole database to ZFS, so I created a ~111,000 > file fragment of the whole freedb .tar.bz2 (basically, I hit ^C and > tar/bzipped what I had extracted by that time. :-) Even with the smaller > dataset there are some marked differences between ZFS and UFS performance. > Source .tar.bz2 file is only 22MB, so I didn''t worry about repeatable caching > of the source - it''s down in the noise. Filesystems were recreated in > between extraction tests, and remounted between traversal tests. > > Firstly, star and tar perform very, very differently across ZFS and UFS: [all > times in seconds, hope the formatting is preserved] > > UFS ZFS > star 70 1086 > tar 402 28This is _really_ interesting! But note first: you cannot compare these times without checking the state of the FS at the time when the untar operation did finish. You should at least run two sync(1) calls after the tar extract and include the sync time to the time of the untar operation. Otherwise you did just compare times for two completely unknown and differeent tasks. If you like to compare ZFS vs. UFS, I recommend to run 4 tests: star -x UFS, star -x -no-fsync UFS, star -xZUFS, star -x -no-fsync ZFS If you like to compare Sun tar vs, star, you need to understand what both programs do. A Sun tar test without syncing at the end of the operation is worthless. [snipped part]> As for sfind versus find, the comparison is much less interesting: > > UFS ZFS > sfind 1.43 3.16 > find 1.41 3.10My tests look different (currently I may only give UFS results for a real disk device). This is the whole FreeDB tree: sfind . > /dev/null 2:11.517r 3.590u 69.330s 55% 0M 0+0k 0st 0+0io 0pf+0w find . > /dev/null 2:17.228r 4.680u 73.300s 56% 0M 0+0k 0st 0+0io 0pf+0w This shows: Sfind in this case needs 30% less USER CPU time than Sun find. Sfind in this case needs 6% less SYSTEM CPU time than Sun find. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Scott Howard <Scott.Howard at Sun.COM> wrote:> > - Create two 10 GB files on a UFS partition > > Right, so how is this testing ZFS?The way I did describe.... It shows e.g. that ZFS does a lot more I/O than UFS.> How about with a slightly more realistic situation - filesystems directly > on disk! (Novel concept, I know....)????> In all cases the hardware is a T3 disk array, with a 4-disk stripe in > hardware (no mirroring/RAID-5). For ZFS, this is then imported as a > single filesystem in the pool. For UFS, as a single filesystem, with > logging enabled. > In all cases the source file was stored in /tmp (tmpfs), and was dd''ed > to /dev/zero before starting so that any caching was the same between > each run. Both the ZFS and UFS filesystems were re-created between runs. > > With the T3''s cache disabled : > ZFS - 17 minutes, 47 seconds real (5:35 user, 0:11 system) > UFS - 48 minutes, 28 seconds real (5:38 user, 0:13 system) > > With the T3''s cache enabled : > ZFS - 15 minutes, 30 seconds real (5:49 user, 0:13 system) > UFS - 24 minutes, 29 seconds real (5:39 user, 0:13 system)I am not sure whether this could be called realistic. It uses extremely expensive hardware.> > - find is _extremely_ slow on ZFS > > "find", or "sfind" ?Of course, I use sfind because Sun find is using an ancient method of looping over getdents() ... stat() ... getdents() on the same directory while sfind uses the modern and POSIX 2001 compliant aproach of first reading in the whole list of names from a directory and then looping over the names.> Running "find . | wc -l" (Solaris find). In both cases the filesystem > cache was cleared before running (zfs export/import, ufs unmount/mount). > ZFS - 3 minutes, 39 seconds (0:07 user, 2:58 system) > UFS - 1 minute, 16 seconds (0:06 user, 0:50 system)So the only thing you could prove with this text is that a find on ZFS is sill slower on extremely expensive HW than on emulated chesk HW using UFS. I am currently running a test on real hardware but my impression is that this will not give significant different results than my first test. I''ll report later J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Joerg Schilling wrote:> Jason Ozolins <Jason.Ozolins at anu.edu.au> wrote:> star-1.4.3 is _extremely_ old. I would recommend to use star-1.5a70.I built 1.5.a70, and I don''t think it made much difference to the timings. It is what I have used for the tests below.> In order to tell whether star was able to extract a file correctly, it calls > fsync(f) and close(f) for every file and checks for the return codes. > This causes a 10-20% performance penalty on UFS (depending on file sizes).And obviously much more on ZFS...>>Firstly, star and tar perform very, very differently across ZFS and UFS: [all >>times in seconds, hope the formatting is preserved] >> >> UFS ZFS >>star 70 1086 >>tar 402 28 > > > This is _really_ interesting! > > But note first: you cannot compare these times without checking the state > of the FS at the time when the untar operation did finish. You should at > least run two sync(1) calls after the tar extract and include the sync time > to the time of the untar operation. Otherwise you did just compare times for > two completely unknown and differeent tasks. > > If you like to compare ZFS vs. UFS, I recommend to run 4 tests: > > star -x UFS, star -x -no-fsync UFS, star -xZUFS, star -x -no-fsync ZFSThese new timings include both the extraction and a umount/zfs export to totally ensure all data made it to the disk. When the filesystems are idle, the umount or export is basically instant, so I''ve assumed that any time taken by the umount/export in this case is just to push data out to disk. For my ~ 111,000 file fragment of the freedb archive: UFS ZFS star 72 1080 tar 420 33 star-no-fsync 57 32 Times rounded to the nearest second because I make no representation of their exact repeatability anyway. The umount/export made some difference, but not a whole heap. Strangeness did happen once though: the tar extract to UFS ran > 3 times faster... From two runs: timing tar extract to /ufs... real 133.29 user 16.33 sys 12.69 timing tar extract to /ufs... real 419.92 user 16.29 sys 12.76 Note that the user time and system time are essentially equal across the two runs. The filesystem is recreated before each extraction, so data placement can''t be part of the effect. iostat on one of the slow runs showed the tar extraction provoking extremely long ( > 30,000 ) pending I/O queues. Unfortunately I didn''t have iostat going during the fast run. :-( Any ideas, Sun folks? Anyway, looks like (for this extreme case of zillions of tiny files) fsync is the cause of the huge blowout in extraction time. Given that fsync is the issue, I just extracted the whole database to my poky 6GB one-slice ZFS volume with star -no-fsync: real 564.54 user 244.09 sys 181.85 And I just ran find over the whole thing after export/import of the pool: # time -p find . | wc 1872970 1872970 31616653 real 338.16 user 2.21 sys 53.77 Not as fast as Scott Howard''s test, but then again it''s a single consumer disk. Sfind is indeed faster than find on the whole freedb archive: # (cd /zfs/freedb ; time -p sfind . | wc) 1872970 1872970 31616653 real 246.00 user 1.34 sys 55.52 Certainly ZFS doesn''t seem to be pathologically slow on this real world disk. I''m now extracting with star -no-fsync to my UFS volume, and by contrast the thing is taking lots longer, and the filesystem''s gone completely unresponsive (huge I/O queue again, ls -al of the filesystem root is stalled in an uninterruptible system call for many seconds on end when I hit ^C). And star is actually stuck in something quite noninterruptible right now too, based on what happens when I try to truss it. So I''m kind of happy with ZFS right now. :-) -Jason (going home, not waiting for the UFS extract to complete). ps: I still think that this is a totally crazy workload/data organisation to give to any filesystem other than ReiserFS. Do many users really put > 100,000 files in a directory?
[ ... ]> ps: I still think that this is a totally crazy workload/data organisation to > give to any filesystem other than ReiserFS. Do many users really put > > 100,000 files in a directory?You''ve never worked in Services for any of the major system vendors, have you ? ;-) Complaints about "bad performance" when doing this are _really_ frequent. We tend to get around ten requests a year about ufs'' limit of 32765 subdirectories. And more about (bad) performance with "manymany" files in one directory. I agree with you in the sense that I don''t understand _why_ people want to do this (and I wouldn''t). But I recognize that they are. Whatever reasons they have, if we fail to do this well then we loose standing. Hence the test is a valid one. Best regards, FrankH.
Jason Ozolins <Jason.Ozolins at anu.edu.au> wrote:> > star-1.4.3 is _extremely_ old. I would recommend to use star-1.5a70. > > I built 1.5.a70, and I don''t think it made much difference to the timings. > It is what I have used for the tests below.Parts did become faster and other parts did become slower. As recent star versions did include validity checking for correctness of file meta data, it in theory should be slower than implementations like Sun tar that does not care.> > In order to tell whether star was able to extract a file correctly, it calls > > fsync(f) and close(f) for every file and checks for the return codes. > > This causes a 10-20% performance penalty on UFS (depending on file sizes). > > And obviously much more on ZFS...This is something that shouild been worked on! Using fsync() should not cause a performance penalty of 34x, this leads to deficits in the buffer/cache implementation.> > If you like to compare ZFS vs. UFS, I recommend to run 4 tests: > > > > star -x UFS, star -x -no-fsync UFS, star -xZUFS, star -x -no-fsync ZFS > > These new timings include both the extraction and a umount/zfs export to > totally ensure all data made it to the disk. When the filesystems are idle, > the umount or export is basically instant, so I''ve assumed that any time > taken by the umount/export in this case is just to push data out to disk. > > For my ~ 111,000 file fragment of the freedb archive: > > UFS ZFS > star 72 1080 > tar 420 33 > star-no-fsync 57 32 > > Times rounded to the nearest second because I make no representation of their > exact repeatability anyway. The umount/export made some difference, but not > a whole heap.Interesting to see that star is still a bit faster than Sun tar bacause the speed optimization in star have all been made for the create mode and not for the extract mode.> Strangeness did happen once though: the tar extract to UFS ran > 3 times > faster... From two runs: > > timing tar extract to /ufs... > real 133.29 > user 16.33 > sys 12.69 > > timing tar extract to /ufs... > real 419.92 > user 16.29 > sys 12.76This is really strange. The user/system timing are identical.> Anyway, looks like (for this extreme case of zillions of tiny files) fsync is > the cause of the huge blowout in extraction time.And this is why I am really reverent to the UFS implentation....> Given that fsync is the issue, I just extracted the whole database to my poky > 6GB one-slice ZFS volume with star -no-fsync: > > real 564.54 > user 244.09 > sys 181.85The whole 1.9 million files? Could you name numbers for yout machine? RAM size, CPU type/speed, Disk type/speed?> And I just ran find over the whole thing after export/import of the pool: > # time -p find . | wc > 1872970 1872970 31616653 > real 338.16 > user 2.21 > sys 53.77 > > Not as fast as Scott Howard''s test, but then again it''s a single consumer disk. > > Sfind is indeed faster than find on the whole freedb archive: > # (cd /zfs/freedb ; time -p sfind . | wc) > 1872970 1872970 31616653 > real 246.00 > user 1.34 > sys 55.52If this is both ZFS, I would like to know how much RAM you were using. It seems that ZFS does not behave nicely on internal ZFS meta data cache misses.> Certainly ZFS doesn''t seem to be pathologically slow on this real world disk.It is for me... Tested on a machine with 1280 Megabytes of RAM.> I''m now extracting with star -no-fsync to my UFS volume, and by contrast > the thing is taking lots longer, and the filesystem''s gone completely > unresponsive (huge I/O queue again, ls -al of the filesystem root is stalled > in an uninterruptible system call for many seconds on end when I hit ^C).This is something I also noted yesterday. It seems that otherwise the buffer cache gets full and makes the system sitcky> And star is actually stuck in something quite noninterruptible right now too, > based on what happens when I try to truss it. So I''m kind of happy with ZFS > right now. :-)Did it stick forever? J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Joerg Schilling wrote:> Jason Ozolins <Jason.Ozolins at anu.edu.au> wrote:>>>In order to tell whether star was able to extract a file correctly, it calls >>>fsync(f) and close(f) for every file and checks for the return codes. >>>This causes a 10-20% performance penalty on UFS (depending on file sizes). >> >>And obviously much more on ZFS... > > This is something that shouild been worked on! > Using fsync() should not cause a performance penalty of 34x, > this leads to deficits in the buffer/cache implementation.The 34x penalty is in an extreme case of extracting lots of very small files. For larger files, the latency enforced by the fsync would not be so apparent. I''m trying to think of the use cases where fsync is valuable for tar extraction: 1. the system dies during the extraction, and you were relying on verbose output to tell you exactly what''s been extracted, and that output is preserved for you to refer to 2. the system dies shortly after the extraction completes and you have taken the completion of the command as a signal that all I/O associated with the command is complete 3. you want to detect any i/o errors resulting from the extraction 4. ? Case 1 seems pretty marginal to me. I''d really like a mechanism to enforce case 2 at the user level though, like "please synchronously checkpoint all my outstanding I/O". Sync does too much (an ordinary user shouldn''t be able to mess with I/O policy for the whole machine) and too little (you can''t tell whether the data''s really made it to stable storage when the sync system call returns). This would be good for scripting where you don''t have control over the behaviour of individual programs, like Solaris tar for instance. Case 3 is problematic anyway, because in a logging filesystem the data might be written to a temporary log and not to its eventual destination at the time the command completes. The fact that it has reached stable storage is not enough to guarantee that nothing else can go wrong... :-)>>Strangeness did happen once though: the tar extract to UFS ran > 3 times >>faster... From two runs: >> >>timing tar extract to /ufs... >>real 133.29 >>user 16.33 >>sys 12.69 >> >>timing tar extract to /ufs... >>real 419.92 >>user 16.29 >>sys 12.76 > > > This is really strange. > The user/system timing are identical.Indeed. My vague guess is that there''s a kernel thread that periodically flushes dirty pages (buffers? showing my ignorance here) to disk, and that the start time of the extraction has to be in a certain phase range of that kernel thread''s cycle for the extraction to happen quickly. The /ufs file system was mounted with logging enabled, BTW.>>Given that fsync is the issue, I just extracted the whole database to my poky >>6GB one-slice ZFS volume with star -no-fsync: >> >>real 564.54 >>user 244.09 >>sys 181.85 > > > The whole 1.9 million files? Could you name numbers for yout machine? > RAM size, CPU type/speed, Disk type/speed?Yes, the whole lot, as reported by the find/sfind runs below. It''s a pretty boring machine by modern desktop standards, except for the number of disks: Athlon 64 3200 (socket 754, 2GHz, 1MB L2 cache) Asus K8N-E motherboard, nForce3-250 chipset, 1GB of RAM 4 * Seagate 160GB 7200RPM SATA disks - 2 on nForce SATA controller - 2 on Silicon Image 3114 SATA controller - disks are NCQ capable, but the controllers aren''t Both the /ufs and /zfs test filesystems were single slices from the start of the disks attached to the Silicon Image controller. bzcat may like having 1MB of L2 cache: -bash-3.00# time -p bzcat ~jao900/freedb-complete-20051104.tar.bz2 > /dev/null real 226.55 user 219.49 sys 0.94 So the tar extraction part actually takes 338 seconds real time. (my star extraction timing listed above is for the whole pipeline).>>And I just ran find over the whole thing after export/import of the pool: >># time -p find . | wc >> 1872970 1872970 31616653 >>real 338.16 >>user 2.21 >>sys 53.77 >> >>Not as fast as Scott Howard''s test, but then again it''s a single consumer disk. >> >>Sfind is indeed faster than find on the whole freedb archive: >># (cd /zfs/freedb ; time -p sfind . | wc) >> 1872970 1872970 31616653 >>real 246.00 >>user 1.34 >>sys 55.52 > > If this is both ZFS, I would like to know how much RAM you were using. > It seems that ZFS does not behave nicely on internal ZFS meta data cache misses.1GB of RAM, no tweaks to any memory management policy tunables that might exist. The filesystem were exported/imported before each run of find/sfind, so the cache was clear of any ZFS metadata.>>Certainly ZFS doesn''t seem to be pathologically slow on this real world disk. > > It is for me... Tested on a machine with 1280 Megabytes of RAM.Now that''s odd. You''ve got more RAM than me... I wasn''t running anything except a couple of SSH sessions and the dtlogin window on the test machine at the time, so it should have had maybe 850-900MB of available memory.>> I''m now extracting with star -no-fsync to my UFS volume, and by contrast >>the thing is taking lots longer, and the filesystem''s gone completely >>unresponsive (huge I/O queue again, ls -al of the filesystem root is stalled >>in an uninterruptible system call for many seconds on end when I hit ^C). > > This is something I also noted yesterday. It seems that otherwise the buffer cache > gets full and makes the system sitcky > >>And star is actually stuck in something quite noninterruptible right now too, >>based on what happens when I try to truss it. So I''m kind of happy with ZFS >>right now. :-) > > Did it stick forever?No. I came back this morning to find that my 6GB partition didn''t have enough inodes (d''oh!). I''ll try again tonight when I don''t need the machine to be responsive. -Jason
If I remember correctly, sync(1) and fsync(3c) actually work as advertised on ZFS, and guarantee that data has been written to disk before the syscall returns. Thus, there will be more overhead than on UFS (where the pages are flushed, but may or may not have reached stable storage). I guess the question (as Jason points out) is what are you trying to accomplish? If you really want the data for each file on disk between each extraction, then you are getting what you asked for on ZFS, unlike UFS. I don''t see any value to the (broken) UFS semantics. That being said, there is always performance to be gained. But I wouldn''t expect a 34x improvement any time soon. - Eric On Mon, Nov 28, 2005 at 12:09:16PM +1100, Jason Ozolins wrote:> Joerg Schilling wrote: > > >This is something that shouild been worked on! > >Using fsync() should not cause a performance penalty of 34x, > >this leads to deficits in the buffer/cache implementation. > > The 34x penalty is in an extreme case of extracting lots of very small > files. For larger files, the latency enforced by the fsync would not be so > apparent. > > I''m trying to think of the use cases where fsync is valuable for tar > extraction: > 1. the system dies during the extraction, and you were relying on verbose > output to tell you exactly what''s been extracted, and that output is > preserved for you to refer to > 2. the system dies shortly after the extraction completes and you have > taken the completion of the command as a signal that all I/O associated > with the command is complete > 3. you want to detect any i/o errors resulting from the extraction > 4. ? > > Case 1 seems pretty marginal to me. > > I''d really like a mechanism to enforce case 2 at the user level though, > like "please synchronously checkpoint all my outstanding I/O". Sync does > too much (an ordinary user shouldn''t be able to mess with I/O policy for > the whole machine) and too little (you can''t tell whether the data''s really > made it to stable storage when the sync system call returns). This would > be good for scripting where you don''t have control over the behaviour of > individual programs, like Solaris tar for instance. > > Case 3 is problematic anyway, because in a logging filesystem the data > might be written to a temporary log and not to its eventual destination at > the time the command completes. The fact that it has reached stable > storage is not enough to guarantee that nothing else can go wrong... :-)-- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Eric is right that sync() for ZFS guarantees all outstanding data and meta data has been written to stable storage. UFS has indeed only ever scheduled the IO but not waited for it''s completion. However, fsync() has always been synchronous for UFS (and ZFS). So I''m concerned about the severe performance degradation noticed. I haven''t seen that in my own benchmarking of fsync(). I''ll try to reproduce this. Neil. Eric Schrock wrote On 11/27/05 18:48,:> If I remember correctly, sync(1) and fsync(3c) actually work as > advertised on ZFS, and guarantee that data has been written to disk > before the syscall returns. Thus, there will be more overhead than on > UFS (where the pages are flushed, but may or may not have reached stable > storage). > > I guess the question (as Jason points out) is what are you trying to > accomplish? If you really want the data for each file on disk between > each extraction, then you are getting what you asked for on ZFS, unlike > UFS. I don''t see any value to the (broken) UFS semantics. > > That being said, there is always performance to be gained. But I > wouldn''t expect a 34x improvement any time soon. > > - Eric > > On Mon, Nov 28, 2005 at 12:09:16PM +1100, Jason Ozolins wrote: > >>Joerg Schilling wrote: >> >> >>>This is something that shouild been worked on! >>>Using fsync() should not cause a performance penalty of 34x, >>>this leads to deficits in the buffer/cache implementation. >> >>The 34x penalty is in an extreme case of extracting lots of very small >>files. For larger files, the latency enforced by the fsync would not be so >> apparent. >> >>I''m trying to think of the use cases where fsync is valuable for tar >>extraction: >>1. the system dies during the extraction, and you were relying on verbose >>output to tell you exactly what''s been extracted, and that output is >>preserved for you to refer to >>2. the system dies shortly after the extraction completes and you have >>taken the completion of the command as a signal that all I/O associated >>with the command is complete >>3. you want to detect any i/o errors resulting from the extraction >>4. ? >> >>Case 1 seems pretty marginal to me. >> >>I''d really like a mechanism to enforce case 2 at the user level though, >>like "please synchronously checkpoint all my outstanding I/O". Sync does >>too much (an ordinary user shouldn''t be able to mess with I/O policy for >>the whole machine) and too little (you can''t tell whether the data''s really >>made it to stable storage when the sync system call returns). This would >>be good for scripting where you don''t have control over the behaviour of >>individual programs, like Solaris tar for instance. >> >>Case 3 is problematic anyway, because in a logging filesystem the data >>might be written to a temporary log and not to its eventual destination at >>the time the command completes. The fact that it has reached stable >>storage is not enough to guarantee that nothing else can go wrong... :-) > > > -- > Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://opensolaris.org/mailman/listinfo/zfs-discuss-- Neil
On Sun, Nov 27, 2005 at 09:28:30PM -0700, Neil Perrin wrote:> Eric is right that sync() for ZFS guarantees all outstanding > data and meta data has been written to stable storage. UFS > has indeed only ever scheduled the IO but not waited for it''s > completion. However, fsync() has always been synchronous for > UFS (and ZFS). So I''m concerned about the severe performance > degradation noticed. I haven''t seen that in my own benchmarking > of fsync(). I''ll try to reproduce this. > > Neil.Obviously I didn''t remember quite right ;-) What''s an extra ''f'', anyway? Thanks for the clarification. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
>If I remember correctly, sync(1) and fsync(3c) actually work as >advertised on ZFS, and guarantee that data has been written to disk >before the syscall returns. Thus, there will be more overhead than on >UFS (where the pages are flushed, but may or may not have reached stable >storage). > >I guess the question (as Jason points out) is what are you trying to >accomplish? If you really want the data for each file on disk between >each extraction, then you are getting what you asked for on ZFS, unlike >UFS. I don''t see any value to the (broken) UFS semantics. > >That being said, there is always performance to be gained. But I >wouldn''t expect a 34x improvement any time soon.fsync() does guarantee that it won''t return until the data is on stable storage, unlink sync(). However, UFS fsync is less than useful because it doesn''t guarantee that the inode and the directory entry and the directory inode have reached stable storage. Not sure what zfs guarantees? Does it guarantee that the ueberblock which points to the newly created file and its data is on disk? Casper
Casper.Dik at sun.com wrote On 11/28/05 02:19,:> > fsync() does guarantee that it won''t return until the > data is on stable storage, unlink sync(). > > However, UFS fsync is less than useful because it doesn''t guarantee > that the inode and the directory entry and the directory inode have > reached stable storage.With UFS logging (now the default) the directory entry and inode are on stable storage (in the log) by return from the fsync. This may not have been so with non logging.> > Not sure what zfs guarantees? Does it guarantee that the ueberblock > which points to the newly created file and its data is on disk?ZFS will push all data and meta data to the intent log by return from fsync(). The uber block will not be updated until the DMU transaction group commits - potentially seconds later. However, on power loss or panic, the intent log is replayed which will recreate the transactions and flush them by forcing the DMU transaction group to commit. So the uber block will contain the tree with the newly created file. Neil
Jason Ozolins <Jason.Ozolins at anu.edu.au> wrote:> > This is something that shouild been worked on! > > Using fsync() should not cause a performance penalty of 34x, > > this leads to deficits in the buffer/cache implementation. > > The 34x penalty is in an extreme case of extracting lots of very small files. > For larger files, the latency enforced by the fsync would not be so apparent.I get the following numbers: UFS ZFS star -x 1:55:20 14:55:20 star -x -no-fsync 1:57:16 3:57:08 So comparing the fsync() case for UFS vs. ZFS, I get a factor of ~ 7.8x This is something that could be worked on. rm -rf is taking 27 minutes on UFS and 5 hours on ZFS. This is something that needs to be worked on.> I''m trying to think of the use cases where fsync is valuable for tar extraction: > 1. the system dies during the extraction, and you were relying on verbose > output to tell you exactly what''s been extracted, and that output is > preserved for you to refer to > 2. the system dies shortly after the extraction completes and you have taken > the completion of the command as a signal that all I/O associated with the > command is complete > 3. you want to detect any i/o errors resulting from the extraction > 4. ?I like usrs of star to be able to evaluate the exit code and take exit code 0 as a signal that extraction was definitely OK.> I''d really like a mechanism to enforce case 2 at the user level though, like > "please synchronously checkpoint all my outstanding I/O". Sync does too much > (an ordinary user shouldn''t be able to mess with I/O policy for the whole > machine) and too little (you can''t tell whether the data''s really made it to > stable storage when the sync system call returns). This would be good for > scripting where you don''t have control over the behaviour of individual > programs, like Solaris tar for instance.The problem is that you cannot ask the system to report whether the whole star -x run was successful at the end by calling something magical. The only way is the ask this for every file and this could only be done via fsync() calls at the end of the extraction of every file.> Case 3 is problematic anyway, because in a logging filesystem the data might > be written to a temporary log and not to its eventual destination at the time > the command completes. The fact that it has reached stable storage is not > enough to guarantee that nothing else can go wrong... :-)In this case, the logging system would need to signal an exception and keep the log data.> > The whole 1.9 million files? Could you name numbers for yout machine? > > RAM size, CPU type/speed, Disk type/speed? > > Yes, the whole lot, as reported by the find/sfind runs below. It''s a pretty > boring machine by modern desktop standards, except for the number of disks: > Athlon 64 3200 (socket 754, 2GHz, 1MB L2 cache) > Asus K8N-E motherboard, nForce3-250 chipset, 1GB of RAM > 4 * Seagate 160GB 7200RPM SATA disks > - 2 on nForce SATA controller > - 2 on Silicon Image 3114 SATA controller > - disks are NCQ capable, but the controllers aren''tSo it is not a really fast machine with plenty of RAM.> -bash-3.00# time -p bzcat ~jao900/freedb-complete-20051104.tar.bz2 > /dev/null > real 226.55 > user 219.49 > sys 0.94 > > So the tar extraction part actually takes 338 seconds real time. (my star > extraction timing listed above is for the whole pipeline).Did you susbtract numbers? Star always runs in two processes (a background FIFO process and a foreground extract process). In case of decompression there is another zip process that may run simultanerously during e.g. the I/O wait times.> >>And I just ran find over the whole thing after export/import of the pool: > >># time -p find . | wc > >> 1872970 1872970 31616653 > >>real 338.16 > >>user 2.21 > >>sys 53.77 > >> > >>Not as fast as Scott Howard''s test, but then again it''s a single consumer disk. > >> > >>Sfind is indeed faster than find on the whole freedb archive: > >># (cd /zfs/freedb ; time -p sfind . | wc) > >> 1872970 1872970 31616653 > >>real 246.00 > >>user 1.34 > >>sys 55.52 > > > > If this is both ZFS, I would like to know how much RAM you were using. > > It seems that ZFS does not behave nicely on internal ZFS meta data cache misses. > > 1GB of RAM, no tweaks to any memory management policy tunables that might > exist. The filesystem were exported/imported before each run of find/sfind, > so the cache was clear of any ZFS metadata.It is unclear to me why I see the oposite behavior. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Eric Schrock <eric.schrock at sun.com> wrote:> If I remember correctly, sync(1) and fsync(3c) actually work as > advertised on ZFS, and guarantee that data has been written to disk > before the syscall returns. Thus, there will be more overhead than on > UFS (where the pages are flushed, but may or may not have reached stable > storage).Could you explian what you understand by stable storage?> I guess the question (as Jason points out) is what are you trying to > accomplish? If you really want the data for each file on disk between > each extraction, then you are getting what you asked for on ZFS, unlike > UFS. I don''t see any value to the (broken) UFS semantics.What does UFS? J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
On Mon, Nov 28, 2005 at 05:38:41PM +0100, Joerg Schilling wrote:> Eric Schrock <eric.schrock at sun.com> wrote: > > > If I remember correctly, sync(1) and fsync(3c) actually work as > > advertised on ZFS, and guarantee that data has been written to disk > > before the syscall returns. Thus, there will be more overhead than on > > UFS (where the pages are flushed, but may or may not have reached stable > > storage). > > Could you explian what you understand by stable storage?The backing store for the filesystem, typically a disk. Note that Neil and Casper clarified some of my (invalid) assumptions. For fsync(), UFS does guarantee that pending writes will reach stable storage, but not sync(). ZFS makes this guarantee for both.> > I guess the question (as Jason points out) is what are you trying to > > accomplish? If you really want the data for each file on disk between > > each extraction, then you are getting what you asked for on ZFS, unlike > > UFS. I don''t see any value to the (broken) UFS semantics. > > What does UFS?I don''t know what this means, but it''s probably based on my invalid assumption as described above. Neil is looking into the fsync() performance issues. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Casper.Dik at Sun.COM wrote:> fsync() does guarantee that it won''t return until the > data is on stable storage, unlink sync(). > > However, UFS fsync is less than useful because it doesn''t guarantee > that the inode and the directory entry and the directory inode have > reached stable storage.With logging, this seems to be suficcient or did I miss something? J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
> > It is unclear to me why I see the oposite behavior. >Just to be clear, you _are_ running this on non-DEBUG SX:CR 27a bits, correct? You''ll get rather different results when running on OpenSolaris/BFU DEBUG bits thanks to our liberal use of kmem caches and some ZFS debugging features which are turned on for DEBUG builds[1]. - Eric [1] You can minimize these effects by setting ''kmem_flags'' and ''zfs_flags'' to zero, but you will still have various overhead. Performance testing with DEBUG bits is never really valid. -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
>ZFS will push all data and meta data to the intent log by return >from fsync(). The uber block will not be updated until the DMU >transaction group commits - potentially seconds later. >However, on power loss or panic, the intent log is replayed which >will recreate the transactions and flush them by forcing the >DMU transaction group to commit. So the uber block will contain >the tree with the newly created file.So what''s the preferred spelling for "uberblock" is it "?ber", "ueber" (both correct german forms) or "uber" (a hitherto unknown word). The pronunciation is rather different. Casper
On Mon, Nov 28, 2005 at 07:17:21PM +0100, Casper.Dik at sun.com wrote:> > So what''s the preferred spelling for "uberblock" is it > "?ber", "ueber" (both correct german forms) or "uber" (a hitherto > unknown word). The pronunciation is rather different. >The spelling without the umlaut is acceptable; the ''ue'' isn''t. While a bastardized English form, it''s hardly "hitherto unknown", see: http://en.wikipedia.org/wiki/%C3%9Cber In particular, this phrase: ?ber is commonly misspelled as uber in English, although the correct substitute for the ''?''-Umlaut would be ue, not just ''u'' In particular, the non-umlaut spelling is common in many informal/slang settings, such as "ubercool". See: http://www.urbandictionary.com/define.php?term=uber In ZFS, the non-umlaut version is the common form, since source code doesn''t easily allow us the use of the real thing ;-) - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
>The spelling without the umlaut is acceptable; the ''ue'' isn''t. While a >bastardized English form, it''s hardly "hitherto unknown", see:Why isn''t "ue" acceptable; ue is the correct form as the wifipedia entry say .... "Uber" is pronounced "oober" and writing it makes you look silly and uneducated. Casper
On Mon, 2005-11-28 at 14:05, Casper.Dik at Sun.COM wrote:> "Uber" is pronounced "oober" and writing it makes you look silly > and uneducated.You expect otherwise from an English slang form, with a life of its own distinct from the original German word? To quote someone or other: "We don''t just borrow words; on occasion, English has pursued other languages down alleyways to beat them unconscious and rifle their pockets for new vocabulary." - Bill
An ultimate irony is that English conjugations & declensions resemble those used in German, in the few cases where German imports words from foreign languages, except in the case of some 1000 core words (which I believe resemble German''s native rules). Thus English is like German if it rifled other languages for its words. (This according to a linguist who gave a guest lecture to a biochem faculty at a prominent research university.) Bill Ross Bill Sommerfeld wrote:> On Mon, 2005-11-28 at 14:05, Casper.Dik at Sun.COM wrote: > >>"Uber" is pronounced "oober" and writing it makes you look silly >>and uneducated. > > > You expect otherwise from an English slang form, with a life of its own > distinct from the original German word? > > To quote someone or other: "We don''t just borrow words; on occasion, > English has pursued other languages down alleyways to beat them > unconscious and rifle their pockets for new vocabulary." > > - Bill > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://opensolaris.org/mailman/listinfo/zfs-discuss
Eric Schrock wrote:> > The spelling without the umlaut is acceptable; the ''ue'' isn''t. While a > bastardized English form, it''s hardly "hitherto unknown", see: > > http://en.wikipedia.org/wiki/%C3%9Cber > > In particular, this phrase: > > ?ber is commonly misspelled as uber in English, although the > correct substitute for the ''?''-Umlaut would be ue, not just ''u'' > > In particular, the non-umlaut spelling is common in many informal/slang > settings, such as "ubercool". See: > > http://www.urbandictionary.com/define.php?term=uber > > In ZFS, the non-umlaut version is the common form, since source code > doesn''t easily allow us the use of the real thing ;-) > >I still think "Dr. Feelgood" is their best album. Wait....am I on the wrong list again? -- Torrey McMahon Sun Microsystems Inc.
Eric Schrock <eric.schrock at sun.com> wrote:> Just to be clear, you _are_ running this on non-DEBUG SX:CR 27a bits, > correct? You''ll get rather different results when running onAs long as there is no official way to run non-debug kernels....> OpenSolaris/BFU DEBUG bits thanks to our liberal use of kmem caches and > some ZFS debugging features which are turned on for DEBUG builds[1]. > > - Eric > > [1] You can minimize these effects by setting ''kmem_flags'' and > ''zfs_flags'' to zero, but you will still have various overhead. > Performance testing with DEBUG bits is never really valid.Whank you for the hint. I did set zfs_flags to 0 and this caused a slight speedup. The insteresting result is that the extraction was _really_ fast until the first big directory (''misc'') reached 440,000 entries. Then it became as slow as before. As a result, the star -xp -no-fsync extract did speed up from 3:57 h to 3:00 h. If there is a way to prevent the extreme slowdown after 440,000 entries, ZFS would be notocable faster than UFS! J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Eric Schrock <eric.schrock at sun.com> wrote:> In particular, the non-umlaut spelling is common in many informal/slang > settings, such as "ubercool". See: > > http://www.urbandictionary.com/define.php?term=uberSomething I would write and pronounce "obercool" J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
On Tue, Nov 29, 2005 at 03:02:50PM +0100, Joerg Schilling wrote:> > As long as there is no official way to run non-debug kernels....There is. It''s called Solaris Express. If you''re going to do any future performance testing for ZFS, please use theses non-DEBUG bits or we''ll have no way to objectively analyze your results (unless, of course, you''re trying to measure the DEBUG overhead). Or just wait for non-DEBUG opensolaris bits (should be coming shortly).> If there is a way to prevent the extreme slowdown after 440,000 > entries, ZFS would be notocable faster than UFS!Noel is looking at some ZAP oddities with large directories (there are some large performance jumps at various sizes). But this could also be an artifact of running DEBUG bits - nothing''s for certain when you''re doing performance testing on DEBUG bits. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Roch Bourbonnais - Performance Engineering
2005-Nov-29 16:35 UTC
[zfs-discuss] ZFS extremely slow
One hypothesis for the performance delta would be that ZFS actually reclaims buffers too agressively (and thus after a while would need to go do disk). I am not sure how to reduce this aggressiveness; Anyone can provide guidance here ? Is adding swap space sufficient ? -r
Joerg Schilling wrote:> Eric Schrock <eric.schrock at sun.com> wrote: > > >>In particular, the non-umlaut spelling is common in many informal/slang >>settings, such as "ubercool". See: >> >>http://www.urbandictionary.com/define.php?term=uber > > > Something I would write and pronounce "obercool"Just one step from almost the original meaning: overcool Bill> > J?rg >