I had followed with interest the "turn off NV cache flushing" thread, in regard to doing ZFS-backed NFS on our low-end Hitachi array: http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg05000.html In short, if you have non-volatile cache, you can configure the array to ignore the ZFS cache-flush requests. This is reported to improve the really terrible performance of ZFS-backed NFS systems. Feel free to correct me if I''m misremembering.... Anyway, I''ve also read that if ZFS notices it''s using "slices" instead of whole disks, it will not enable/use the write cache. So I thought I''d be clever and configure a ZFS pool on our array with a slice of a LUN instead of the whole LUN, and "fool" ZFS into not issuing cache-flushes, rather than having to change config of the array itself. Unfortunately, it didn''t make a bit of difference in my little NFS benchmark, namely extracting a small 7.6MB tar file (C++ source code, 500 files/dirs). I used three test zpools and a UFS filesystem (not all were in play at the same time): pool: bulk_sp1 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM bulk_sp1 ONLINE 0 0 0 c6t4849544143484920443630303133323230303230d0 ONLINE 0 0 0 errors: No known data errors pool: bulk_sp1s state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM bulk_sp1s ONLINE 0 0 0 c6t4849544143484920443630303133323230303230d0s0 ONLINE 0 0 0 errors: No known data errors pool: int01 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM int01 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t0d0s5 ONLINE 0 0 0 c0t1d0s5 ONLINE 0 0 0 errors: No known data errors # prtvtoc -s /dev/rdsk/c6t4849544143484920443630303133323230303230d0 * First Sector Last * Partition Tag Flags Sector Count Sector Mount Directory 0 4 00 34 4294879232 4294879265 1 4 00 4294879266 67517 4294946782 8 11 00 4294946783 16384 4294963166 # Both NFS client and server are Sun T2000''s, 16GB RAM, switched gigabit ethernet, Solaris-10U3 patched as of 12-Jan-2007, doing nothing else at the time of the tests. The "bulk_sp1*" pools were both on the same Hitachi 9520V RAID-5 SATA group that I ran my bonnie++ tests on yesterday. The "int01" pool is mirrored on two slice-5''s of the server T2000''s internal 2.5" SAS 73GB drives. ZFS on whole-disk FC-SATA LUN via NFS: real 968.13 user 0.33 sys 0.04 7.9 KB/sec overall ZFS on partial slice-0 of FC-SATA LUN via NFS: real 950.77 user 0.33 sys 0.04 8.0 KB/sec overall ZFS on slice-5 mirror of internal SAS drives via NFS: real 17.48 user 0.32 sys 0.03 438.8 KB/sec overall UFS on partial slice-0 of FC-SATA LUN via NFS: real 6.13 user 0.32 sys 0.03 1251.4 KB/sec overall I''m not willing to disable the ZIL. I think I''d settle for the 400KB/sec range in this test from NFS on ZFS, if I could get that on our FC-SATA Hitachi array. As things are now, ZFS just won''t work for us, and I''m not sure how to make it go faster. Thoughts & suggestions are welcome.... Marion
Marion Hakanson
2007-Feb-02 01:50 UTC
[zfs-discuss] Re: ZFS vs NFS vs array caches, revisited
Adding to my own post, I said earlier:> Anyway, I''ve also read that if ZFS notices it''s using "slices" instead of > whole disks, it will not enable/use the write cache. So I thought I''d be > clever and configure a ZFS pool on our array with a slice of a LUN instead of > the whole LUN, and "fool" ZFS into not issuing cache-flushes, rather than > having to change config of the array itself. > > Unfortunately, it didn''t make a bit of difference in my little NFS benchmark, > namely extracting a small 7.6MB tar file (C++ source code, 500 files/dirs).I was checking the write-cache settings via the "cache" submenu of "format -e"; All LUN''s on this array appear (to "format") to have write cache disabled. Trying to enable it yields: Write cache setting is not changeable Re-creating a zpool with whole-disk devices does not change the setting reported by format, either. Given that format can''t control the cache settings, can one assume that ZFS isn''t trying to flush the cache either? My question here is, how can one tell if ZFS is trying to flush the write caches? Dtrace to the rescue? Regards, Marion
Marion, this is a common misintrepetation : "Anyway, I''ve also read that if ZFS notices it''s using "slices" instead of whole disks, it will not enable/use the write cache. " The reality is that ZFS turns on the write cache when it owns the whole disk. _Independantly_ of that, ZFS flushes the write cache when ZFS needs to insure that data reaches stable storage. The point is that the flushes occur whether or not ZFS turned the caches on or not (caches might be turned on by some other means outside the visibility of ZFS). The problem is that the flush cache command means 2 different things to the 2 components. To ZFS : "put on stable storage" To Storage: "flush the cache" Until we get this house in order, storage needs to ignore the requests. -r
Hi All, In my test set up, I have one zpool of size 1000M bytes. On this zpool, my application writes 100 files each of size 10 MB. First 96 files were written successfully with out any problem. But the 97 file is not written successfully , it written only 5 MB (the return value of write() call ). Since it is short write my application tried to truncate it to 5MB. But ftruncate is failing with an erroe message saying that No space on the devices. Have you people ever seen these kind of error message ? After ftruncate failure I checked the size of 97 th file, it is strange. The size is 7 MB but the expected size is only 5 MB. You help is appreciated. Thanks & Regards Mastan --------------------------------- TV dinner still cooling? Check out "Tonight''s Picks" on Yahoo! TV. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070202/ee1794c6/attachment.html>
Roch.Bourbonnais at Sun.Com said:> The reality is that > ZFS turns on the write cache when it owns the > whole disk. > _Independantly_ of that, > ZFS flushes the write cache when ZFS needs to insure > that data reaches stable storage. > > The point is that the flushes occur whether or not ZFS turned the caches on > or not (caches might be turned on by some other means outside the visibility > of ZFS).Thanks for taking the time to clear this up for us (assuming others than just me had this misunderstanding :-). Yet today I measured something that leaves me puzzled again. How can we explain the following results? # zpool status -v pool: bulk_zp1 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM bulk_zp1 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c6t4849544143484920443630303133323230303230d0s0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303230d0s1 ONLINE 0 0 0 c6t4849544143484920443630303133323230303230d0s2 ONLINE 0 0 0 c6t4849544143484920443630303133323230303230d0s3 ONLINE 0 0 0 c6t4849544143484920443630303133323230303230d0s4 ONLINE 0 0 0 c6t4849544143484920443630303133323230303230d0s5 ONLINE 0 0 0 c6t4849544143484920443630303133323230303230d0s6 ONLINE 0 0 0 errors: No known data errors # prtvtoc -s /dev/rdsk/c6t4849544143484920443630303133323230303230d0 * First Sector Last * Partition Tag Flags Sector Count Sector Mount Directory 0 4 00 34 613563821 613563854 1 4 00 613563855 613563821 1227127675 2 4 00 1227127676 613563821 1840691496 3 4 00 1840691497 613563821 2454255317 4 4 00 2454255318 613563821 3067819138 5 4 00 3067819139 613563821 3681382959 6 4 00 3681382960 613563821 4294946780 8 11 00 4294946783 16384 4294963166 # And, at a later time: # zpool status -v bulk_sp1s pool: bulk_sp1s state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM bulk_sp1s ONLINE 0 0 0 c6t4849544143484920443630303133323230303230d0s0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303230d0s1 ONLINE 0 0 0 c6t4849544143484920443630303133323230303230d0s2 ONLINE 0 0 0 c6t4849544143484920443630303133323230303230d0s3 ONLINE 0 0 0 c6t4849544143484920443630303133323230303230d0s4 ONLINE 0 0 0 c6t4849544143484920443630303133323230303230d0s5 ONLINE 0 0 0 c6t4849544143484920443630303133323230303230d0s6 ONLINE 0 0 0 errors: No known data errors # The storage is that same "single 2TB LUN" I used yesterday, except I''ve used "format" to slice it up into 7 equal chunks, and made a raidz (and later a simple striped) pool across all of them. My "tar over NFS" benchmark on these goes pretty fast. If ZFS is making the flush-cache call, it sure works faster than in the whole-LUN case: ZFS on whole-disk FC-SATA LUN via NFS, yesterday: real 968.13 user 0.33 sys 0.04 7.9 KB/sec overall ZFS on whole-disk FC-SATA LUN via NFS, ssd_max_throttle=32 today: real 664.78 user 0.33 sys 0.04 11.4 KB/sec overall ZFS raidz on 7 slices of FC-SATA LUN via NFS today: real 12.32 user 0.32 sys 0.03 620.2 KB/sec overall ZFS striped on 7 slices of FC-SATA LUN via NFS today: real 6.51 user 0.32 sys 0.03 1178.3 KB/sec overall Not that I''m not complaining, mind you. I appear to have stumbled across a way to get NFS over ZFS to work at a reasonable speed, without making changes to the array (nor resorting to giving ZFS SVN soft partitions instead of "real" devices). Suboptimal, mind you, but it''s workable if our Hitachi folks don''t turn up a way to tweak the array. Guess I should go read the ZFS source code (though my 10U3 surely lags the Opensolaris stuff). Thanks and regards, Marion
Hi All, No one has any idea on this ? -Masthan dudekula mastan <d_mastan at yahoo.com> wrote: Hi All, In my test set up, I have one zpool of size 1000M bytes. On this zpool, my application writes 100 files each of size 10 MB. First 96 files were written successfully with out any problem. But the 97 file is not written successfully , it written only 5 MB (the return value of write() call ). Since it is short write my application tried to truncate it to 5MB. But ftruncate is failing with an erroe message saying that No space on the devices. Have you people ever seen these kind of error message ? After ftruncate failure I checked the size of 97 th file, it is strange. The size is 7 MB but the expected size is only 5 MB. You help is appreciated. Thanks & Regards Mastan --------------------------------- TV dinner still cooling? Check out "Tonight''s Picks" on Yahoo! TV._______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss --------------------------------- Have a burning question? Go to Yahoo! Answers and get answers from real people who know. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070206/c7921cb0/attachment.html>
Masthan,> > */dudekula mastan <d_mastan at yahoo.com>/* wrote: > > > Hi All, > > In my test set up, I have one zpool of size 1000M bytes. > >Is this the size given by zfs list ? Or is the amount of disk space that you had ? The reason I ask this is because ZFS/Zpool takes up some amount of space for its house keeping. So, if you add 1G worth of disk space to the pool the effective space available is a little less (few MBs) than 1G.> On this zpool, my application writes 100 files each of size 10 MB. > > First 96 files were written successfully with out any problem. >Here you are filling the FS to the brim. This is a border case and the copy-on-write nature of ZFS could lead to the behaviour that you are seeing.> > But the 97 file is not written successfully , it written only 5 MB > (the return value of write() call ). > > Since it is short write my application tried to truncate it to > 5MB. But ftruncate is failing with an erroe message saying that No > space on the devices. >This is expected because of the copy-onwrite nature of ZFS. During truncate it is trying to allocate new disk blocks probably to write the new metadata and fails to find them.> > Have you people ever seen these kind of error message ? >Yes, there are others who have seen these errors.> > After ftruncate failure I checked the size of 97 th file, it is > strange. The size is 7 MB but the expected size is only 5 MB. >Is there any particular reason that you are pushing the filesystem to the brim ? Is this part of some test ? Please, help us understand what you are trying to test. Thanks and regards, Sanjeev. -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel: x27521 +91 80 669 27521
Matthew Ahrens
2007-Feb-09 21:21 UTC
[zfs-discuss] ENOSPC on full FS (was: Meta data corruptions on ZFS.)
dudekula mastan wrote:> > Hi All, > > In my test set up, I have one zpool of size 1000M bytes. > > On this zpool, my application writes 100 files each of size 10 MB. > > First 96 files were written successfully with out any problem. > > But the 97 file is not written successfully , it written only 5 MB (the > return value of write() call ). > > Since it is short write my application tried to truncate it to 5MB. But > ftruncate is failing with an erroe message saying that No space on the > devices.Try removing one of the larger files. Alternatively, upgrade to a more recent version of solaris express / nevada / opensolaris where this problem is much less severe. --matt ps. subject changed, not sure what this had to do with corruption.
Marion asked the community [i]> How can we explain the following results?[/i] and nobody replied so I ask this question again because it''s very important to me: [b]How the ZFS striped on 7 slices of FC-SATA LUN via NFS worked [u]146 times faster[/u] than the ZFS on 1 slice of the same LUN via NFS???[/b] I''ll appreciate your inputs. [i]-- leon[/i] This message posted from opensolaris.org
Jeff Bonwick
2007-Feb-11 08:03 UTC
[zfs-discuss] Re: ZFS vs NFS vs array caches, revisited
> [b]How the ZFS striped on 7 slices of FC-SATA LUN via NFS worked [u]146 times faster[/u] than the ZFS on 1 slice of the same LUN via NFS???[/b]Without knowing more I can only guess, but most likely it''s a simple matter of working set. Suppose the benchmark in question has a 4G working set, and suppose that each LUN is fronted by a 1G cache. With a single LUN, only 1/4 of your working set fits in cache, so you''re doing a fair amount of actual disk I/O. With 7 LUNs, you''ve got 7G of cache, so the entire benchmark fits in cache -- no disk I/O. The factor of >100x is what tells me this is almost certainly a working-set effect. Jeff
Leon Koll
2007-Feb-11 16:53 UTC
[zfs-discuss] Re: Re: ZFS vs NFS vs array caches, revisited
Jeff, thank you for the explanation but it''s hard to me to accept it because: 1.You described a different configuration: 7 LUNs. Marion post was about 7 slices of the same LUN. 2.I never saw the storage controller with cache-per-LUN setting. Cache size doesn''t depend on number of LUNs IMHO, it''s a fixed size per controller or per FC port, SAN-experts-please-fix-me-if-I''m-wrong. [i]-- leon[/i] This message posted from opensolaris.org
Robert Milkowski
2007-Feb-11 23:20 UTC
[zfs-discuss] Re: Re: ZFS vs NFS vs array caches, revisited
Hello Leon, Sunday, February 11, 2007, 5:53:48 PM, you wrote: LK> Jeff, LK> thank you for the explanation but it''s hard to me to accept it because: LK> 1.You described a different configuration: 7 LUNs. Marion post LK> was about 7 slices of the same LUN. LK> 2.I never saw the storage controller with cache-per-LUN setting. LK> Cache size doesn''t depend on number of LUNs IMHO, it''s a fixed LK> size per controller or per FC port, LK> SAN-experts-please-fix-me-if-I''m-wrong. IIRC Symmetrix boxes used to at least reserve minimum amount of cache per LUN basis. However it''s not relevant to your case as you are talking about: entire LUN vs. 7 slices on the LUN in a striped pool. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Marion Hakanson
2007-Feb-13 00:36 UTC
[zfs-discuss] Re: Re: ZFS vs NFS vs array caches, revisited
leon.is.here at gmail.com said:> [b]How the ZFS striped on 7 slices of FC-SATA LUN via NFS worked [u]146 times > faster[/u] than the ZFS on 1 slice of the same LUN via NFS???[/b]Well, I do have more info to share on this issue, though how it worked faster in that test still remains a mystery. Folks may recall that I said:> Not that I''m not complaining, mind you. I appear to have stumbled across a > way to get NFS over ZFS to work at a reasonable speed, without making changes > to the array (nor resorting to giving ZFS SVN soft partitions instead of > "real" devices). Suboptimal, mind you, but it''s workable if our Hitachi > folks don''t turn up a way to tweak the array.Unfortunately, I was wrong. I _don''t_ know how to make it go fast. While I _have_ been able to reproduce the result on a couple different LUN/slice configurations, I don''t know what triggers the "fast" behavior. All I can say for sure is that a little dtrace one-liner that counts sync-cache calls turns up no such calls (for both local ZFS and remote NFS extracts) when things are going fast on a particular filesystem. By comparison, a local ZFS tar-extraction triggers 12 sync-cache calls, and one hits 288 such calls during an NFS extraction before interrupting the run after 30 seconds (est. 1/100th of the way through) when things are working in the "slow" mode. Oh yeah, here''s the one-liner (type in the command, run your test in another session, then hit ^C on this one): dtrace -n fbt::ssd_send_scsi_SYNCHRONIZE_CACHE:entry''{@a[probefunc] = count()}'' This is my first ever use of dtrace, so please be gentle with me (:-). hakansom at ohsu.edu said:> Guess I should go read the ZFS source code (though my 10U3 surely lags the > Opensolaris stuff).I did go read the source code, for my own edification. To reiterate what was said earlier: Roch.Bourbonnais at Sun.Com said:> The point is that the flushes occur whether or not ZFS turned the caches on > or not (caches might be turned on by some other means outside the visibility > of ZFS).My limited reading of ZFS (on opensolaris.org site) code so far has turned up no obvious way to make ZFS skip the sync-cache call. However my dtrace test, unless it''s flawed, shows that on some filesystems, the call is made, and on other filesystems the call is not made. leon.is.here at gmail.com said:> 2.I never saw the storage controller with cache-per-LUN setting. Cache size > doesn''t depend on number of LUNs IMHO, it''s a fixed size per controller or > per FC port, SAN-experts-please-fix-me-if-I''m-wrong.Robert has already mentioned array cache being reserved on a per-LUN basis in Symmetrix boxes. Our low-end HDS unit also has cache pre-fetch settings on a per-LUN basis (defaults according to number of disks in RAID-group). Regards, Marion
Roch - PAE
2007-Feb-13 10:58 UTC
[zfs-discuss] Re: Re: ZFS vs NFS vs array caches, revisited
The only obvious thing would be if the exported ZFS filesystems where initially mounted at a point in time when zil_disable was non-null. The stack trace that is relevant is: sd_send_scsi_SYNCHRONIZE_CACHE sd`sdioctl+0x1770 zfs`vdev_disk_io_start+0xa0 zfs`zil_flush_vdevs+0x108 zfs`zil_commit_writer+0x2b8 ... You might want to try in turn: dtrace -n ''sd_send_scsi_SYNCHRONIZE_CACHE:entry{@a[stack(20)]=count()}'' dtrace -n ''sdioctl:entry{@a[stack(20)]=count()}'' dtrace -n zil_flush_vdevs:entry{@a[stack(20)]=count()}'' dtrace -n zil_commit_writer:entry{@a[stack(20)]=count()}'' And see if you loose your footing along the way. -r Marion Hakanson writes: > leon.is.here at gmail.com said: > > [b]How the ZFS striped on 7 slices of FC-SATA LUN via NFS worked [u]146 times > > faster[/u] than the ZFS on 1 slice of the same LUN via NFS???[/b] > > Well, I do have more info to share on this issue, though how it worked > faster in that test still remains a mystery. Folks may recall that I said: > > > Not that I''m not complaining, mind you. I appear to have stumbled across a > > way to get NFS over ZFS to work at a reasonable speed, without making changes > > to the array (nor resorting to giving ZFS SVN soft partitions instead of > > "real" devices). Suboptimal, mind you, but it''s workable if our Hitachi > > folks don''t turn up a way to tweak the array. > > Unfortunately, I was wrong. I _don''t_ know how to make it go fast. While > I _have_ been able to reproduce the result on a couple different LUN/slice > configurations, I don''t know what triggers the "fast" behavior. All I can > say for sure is that a little dtrace one-liner that counts sync-cache calls > turns up no such calls (for both local ZFS and remote NFS extracts) when > things are going fast on a particular filesystem. > > By comparison, a local ZFS tar-extraction triggers 12 sync-cache calls, > and one hits 288 such calls during an NFS extraction before interrupting > the run after 30 seconds (est. 1/100th of the way through) when things > are working in the "slow" mode. Oh yeah, here''s the one-liner (type in > the command, run your test in another session, then hit ^C on this one): > > dtrace -n fbt::ssd_send_scsi_SYNCHRONIZE_CACHE:entry''{@a[probefunc] = count()}'' > > This is my first ever use of dtrace, so please be gentle with me (:-). > > > hakansom at ohsu.edu said: > > Guess I should go read the ZFS source code (though my 10U3 surely lags the > > Opensolaris stuff). > > I did go read the source code, for my own edification. To reiterate what > was said earlier: > > Roch.Bourbonnais at Sun.Com said: > > The point is that the flushes occur whether or not ZFS turned the caches on > > or not (caches might be turned on by some other means outside the visibility > > of ZFS). > > My limited reading of ZFS (on opensolaris.org site) code so far has turned > up no obvious way to make ZFS skip the sync-cache call. However my dtrace > test, unless it''s flawed, shows that on some filesystems, the call is made, > and on other filesystems the call is not made. > > > leon.is.here at gmail.com said: > > 2.I never saw the storage controller with cache-per-LUN setting. Cache size > > doesn''t depend on number of LUNs IMHO, it''s a fixed size per controller or > > per FC port, SAN-experts-please-fix-me-if-I''m-wrong. > > Robert has already mentioned array cache being reserved on a per-LUN basis > in Symmetrix boxes. Our low-end HDS unit also has cache pre-fetch settings > on a per-LUN basis (defaults according to number of disks in RAID-group). > > Regards, > > Marion > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Leon Koll
2007-Feb-13 14:35 UTC
[zfs-discuss] Re: Re: Re: ZFS vs NFS vs array caches, revisited
Hi Marion, your one-liner works only on SPARC and doesn''t work on x86: # dtrace -n fbt::ssd_send_scsi_SYNCHRONIZE_CACHE:entry''{@a[probefunc] = count()}'' dtrace: invalid probe specifier fbt::ssd_send_scsi_SYNCHRONIZE_CACHE:entry{@a[probefunc] = count()}: probe description fbt::ssd_send_scsi_SYNCHRONIZE_CACHE:entry does not match any probes What''s wrong with it? Thanks, [i]-- leon[/i] This message posted from opensolaris.org
Roch - PAE
2007-Feb-13 14:51 UTC
[zfs-discuss] Re: Re: Re: ZFS vs NFS vs array caches, revisited
On x86 try with sd_send_scsi_SYNCHRONIZE_CACHE Leon Koll writes: > Hi Marion, > your one-liner works only on SPARC and doesn''t work on x86: > # dtrace -n fbt::ssd_send_scsi_SYNCHRONIZE_CACHE:entry''{@a[probefunc] = count()}'' > dtrace: invalid probe specifier fbt::ssd_send_scsi_SYNCHRONIZE_CACHE:entry{@a[probefunc] = count()}: probe description fbt::ssd_send_scsi_SYNCHRONIZE_CACHE:entry does not match any probes > > What''s wrong with it? > Thanks, > [i]-- leon[/i] > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> This is expected because of the copy-onwrite nature of ZFS. During > truncate it is trying to allocate > new disk blocks probably to write the new metadata and fails to find them.I realize there is a fundamental issue with copy on write, but does this mean ZFS does not maintain some kind of reservation to guarantee you can always remove data? If so I would consider this a major issue for general purpose use, and if nothing else it should most definitely be clearly documented. Accidentally filling up space is not at *all* uncommon in many situations, be it home use or medium sized business type use. Yes you should avoid it, but shit (always) happens. -- / Peter Schuller PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org
Roch.Bourbonnais at Sun.Com said:> The only obvious thing would be if the exported ZFS filesystems where > initially mounted at a point in time when zil_disable was non-null.No changes have been made to zil_disable. It''s 0 now, and we''ve never changed the setting. Export/import doesn''t appear to change the behavior. Roch.Bourbonnais at Sun.Com said:> You might want to try in turn: > dtrace -n ''sd_send_scsi_SYNCHRONIZE_CACHE:entry{@a[stack(20)]=count()}'' > dtrace -n ''sdioctl:entry{@a[stack(20)]=count()}'' > dtrace -n zil_flush_vdevs:entry{@a[stack(20)]=count()}'' > dtrace -n zil_commit_writer:entry{@a[stack(20)]=count()}'' > And see if you loose your footing along the way.I''ve included below the complete list of dtrace output. This system has two zpools, one that goes "fast" for NFS and one that goes "slow". You can see the details of the pools'' configs below. Let me re-state that at times in the past, the "fast" pool has gone "slow", and I don''t know what made it start going "fast" again. To summarize, the first dtrace above gives no output on the fast pool, and lists 6, 7, 12, or 14 calls for the slow pool. The second dtrace above counts 6 or 7 calls on both pools. The last third dtrace above gives no output for either pool, but zil_flush_vdevs isn''t in the stack trace for the earlier trace on my machine (SPARC, Sol-10U3). The last dtrace doesn''t find a matching probe here. ================================================================ # echo "zil_disable/D" | mdb -k zil_disable: zil_disable: 0 # zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT bulk_zp1 2.14T 160K 2.14T 0% ONLINE - bulk_zp2 2.14T 346K 2.14T 0% ONLINE - int01 48.2G 1.94G 46.3G 4% ONLINE - # cd # zpool export bulk_zp1 # zpool export bulk_zp2 # zpool import pool: bulk_zp2 id: 803252704584693135 state: ONLINE action: The pool can be imported using its name or numeric identifier. config: bulk_zp2 ONLINE raidz1 ONLINE c6t4849544143484920443630303133323230303330d0s0 ONLINE c6t4849544143484920443630303133323230303330d0s1 ONLINE c6t4849544143484920443630303133323230303331d0s0 ONLINE c6t4849544143484920443630303133323230303331d0s1 ONLINE c6t4849544143484920443630303133323230303332d0s0 ONLINE c6t4849544143484920443630303133323230303332d0s1 ONLINE pool: bulk_zp1 id: 14914295292657419291 state: ONLINE action: The pool can be imported using its name or numeric identifier. config: bulk_zp1 ONLINE raidz1 ONLINE c6t4849544143484920443630303133323230303230d0s0 ONLINE c6t4849544143484920443630303133323230303230d0s1 ONLINE c6t4849544143484920443630303133323230303231d0s0 ONLINE c6t4849544143484920443630303133323230303231d0s1 ONLINE c6t4849544143484920443630303133323230303232d0s0 ONLINE c6t4849544143484920443630303133323230303232d0s1 ONLINE c6t4849544143484920443630303133323230303232d0s2 ONLINE # zpool import bulk_zp1 # zpool import bulk_zp2 # zfs list bulk_zp1 NAME USED AVAIL REFER MOUNTPOINT bulk_zp1 123K 1.79T 53.6K /zp1 # zfs list bulk_zp2 NAME USED AVAIL REFER MOUNTPOINT bulk_zp2 193K 1.75T 63.9K /zp2 # dtrace -n ''ssd_send_scsi_SYNCHRONIZE_CACHE:entry{@a[stack(20)]=count()}'' \> -n ''sd_send_scsi_SYNCHRONIZE_CACHE:entry{@a[stack(20)]=count()}''dtrace: description ''ssd_send_scsi_SYNCHRONIZE_CACHE:entry'' matched 1 probe dtrace: description ''sd_send_scsi_SYNCHRONIZE_CACHE:entry'' matched 1 probe ^C # : no output from zp1 test. # dtrace -n ''ssd_send_scsi_SYNCHRONIZE_CACHE:entry{@a[stack(20)]=count()}'' \> -n ''sd_send_scsi_SYNCHRONIZE_CACHE:entry{@a[stack(20)]=count()}''dtrace: description ''ssd_send_scsi_SYNCHRONIZE_CACHE:entry'' matched 1 probe dtrace: description ''sd_send_scsi_SYNCHRONIZE_CACHE:entry'' matched 1 probe ^C ssd`ssdioctl+0x17a8 zfs`vdev_disk_io_start+0xa0 zfs`zio_ioctl+0xec zfs`vdev_config_sync+0xe0 zfs`spa_sync+0x2ec zfs`txg_sync_thread+0x134 unix`thread_start+0x4 12 ssd`ssdioctl+0x17a8 zfs`vdev_disk_io_start+0xa0 zfs`zio_ioctl+0xec zfs`vdev_config_sync+0x258 zfs`spa_sync+0x2ec zfs`txg_sync_thread+0x134 unix`thread_start+0x4 12 # : above output from zp2 test. # dtrace -n ''ssdioctl:entry{@a[stack(20)]=count()}'' -n ''sdioctl:entry{@a[stack(20)]=count()}'' dtrace: description ''ssdioctl:entry'' matched 1 probe dtrace: description ''sdioctl:entry'' matched 1 probe ^C zfs`vdev_disk_io_start+0xa0 zfs`zio_ioctl+0xec zfs`vdev_config_sync+0xe0 zfs`spa_sync+0x2ec zfs`txg_sync_thread+0x134 unix`thread_start+0x4 6 # : above is from zp2 test. # dtrace -n ''vdev_config_sync:entry{@a[stack(20)]=count()}'' dtrace: description ''vdev_config_sync:entry'' matched 1 probe ^C zfs`spa_sync+0x2ec zfs`txg_sync_thread+0x134 unix`thread_start+0x4 12 # : above is from zp2 test. # dtrace -n ''vdev_config_sync:entry{@a[stack(20)]=count()}'' dtrace: description ''vdev_config_sync:entry'' matched 1 probe ^C zfs`spa_sync+0x2ec zfs`txg_sync_thread+0x134 unix`thread_start+0x4 6 # : above is from zp1 test # dtrace -n ''ssdioctl:entry{@a[stack(20)]=count()}'' -n ''sdioctl:entry{@a[stack(20)]=count()}'' dtrace: description ''ssdioctl:entry'' matched 1 probe dtrace: description ''sdioctl:entry'' matched 1 probe ^C zfs`vdev_disk_io_start+0xa0 zfs`zio_ioctl+0xec zfs`vdev_config_sync+0xe0 zfs`spa_sync+0x2ec zfs`txg_sync_thread+0x134 unix`thread_start+0x4 14 zfs`vdev_disk_io_start+0xa0 zfs`zio_ioctl+0xec zfs`vdev_config_sync+0x258 zfs`spa_sync+0x2ec zfs`txg_sync_thread+0x134 unix`thread_start+0x4 14 # : above is from zp1 test. # dtrace -n ''ssd_send_scsi_SYNCHRONIZE_CACHE:entry{@a[stack(20)]=count()}'' \> -n ''sd_send_scsi_SYNCHRONIZE_CACHE:entry{@a[stack(20)]=count()}''dtrace: description ''ssd_send_scsi_SYNCHRONIZE_CACHE:entry'' matched 1 probe dtrace: description ''sd_send_scsi_SYNCHRONIZE_CACHE:entry'' matched 1 probe ^C # : above is from zp1 test, i.e. no sync-cache calls happened. ================================================================ Regards, Marion
dudekula mastan
2007-Feb-15 10:08 UTC
[zfs-discuss] Is ZFS file system supports short writes ?
Hi all, Please let me know the ZFS support for short writes ? Thanks & Regards Masthan --------------------------------- Cheap Talk? Check out Yahoo! Messenger''s low PC-to-Phone call rates. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070215/d6e0c76a/attachment-0001.html>
Hello dudekula, Thursday, February 15, 2007, 11:08:26 AM, you wrote: > Hi all, Please let me know the ZFS support for short writes ? And what are short writes? -- Best regards, Robert mailto:rmilkowski@task.gda.pl http://milek.blogspot.com _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Torrey McMahon
2007-Feb-15 18:36 UTC
[zfs-discuss] Is ZFS file system supports short writes ?
Robert Milkowski wrote:> > Hello dudekula, > > > Thursday, February 15, 2007, 11:08:26 AM, you wrote: > > > > > > > > Hi all, > > > > Please let me know the ZFS support for short writes ? > > > > > > And what are short writes? >http://www.pittstate.edu/wac/newwlassignments.html#ShortWrites :-P
dudekula mastan
2007-Feb-17 09:42 UTC
[zfs-discuss] Is ZFS file system supports short writes ?
If a write call attempted to write X bytes of data, and if writecall writes only x ( hwere x <X) bytes, then we call that write as short write. -Masthan Torrey McMahon <tmcmahon2 at yahoo.com> wrote: Robert Milkowski wrote:> > Hello dudekula, > > > Thursday, February 15, 2007, 11:08:26 AM, you wrote: > > > > > > > > Hi all, > > > > Please let me know the ZFS support for short writes ? > > > > > > And what are short writes? >http://www.pittstate.edu/wac/newwlassignments.html#ShortWrites :-P --------------------------------- Food fight? Enjoy some healthy debate in the Yahoo! Answers Food & Drink Q&A. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070217/b1aa07b9/attachment.html>
dudekula mastan writes: > If a write call attempted to write X bytes of data, and if writecall writes only x ( hwere x <X) bytes, then we call that write as short write. > > -Masthan What kind of support do you want/need ? -r
So, that would be an "error", and, other than reporting it accurately, what would you want ZFS to do to "support" it? dudekula mastan wrote:> If a write call attempted to write X bytes of data, and if writecall > writes only x ( hwere x <X) bytes, then we call that write as short write. > > -Masthan> > Please let me know the ZFS support for short writes ?
Frank Hofmann
2007-Feb-23 14:13 UTC
[zfs-discuss] Is ZFS file system supports short writes ?
On Fri, 23 Feb 2007, Dan Mick wrote:> So, that would be an "error", and, other than reporting it accurately, what > would you want ZFS to do to "support" it?It''s not an error for write(2) to return with less bytes written than requested. In some situations, that''s pretty much expected. Like, for example, an writing to network sockets. But filesystems may also decide to do short writes, e.g. in the case when the write would extend the file, but the filesystem runs out of space before all of the write completed; it''s up to the implementation whether it returns ENOSPC for all of the write or whether it returns the number of bytes successfully written. Same if you exceed the rlimits or quota allocations; if the write is interrupted before completion.> > dudekula mastan wrote: >> If a write call attempted to write X bytes of data, and if writecall writes >> only x ( hwere x <X) bytes, then we call that write as short write. >> -Masthan > >> > Please let me know the ZFS support for short writes ?In the sense that it does them ? Well, it''s UNIX/POSIX standard to do them, the write(2) manpage puts it like this: If a write() requests that more bytes be written than there is room for-for example, if the write would exceed the pro- cess file size limit (see getrlimit(2) and ulimit(2)), the system file size limit, or the free space on the device-only as many bytes as there is room for will be written. For example, suppose there is space for 20 bytes more in a file before reaching a limit. A write() of 512-bytes returns 20. The next write() of a non-zero number of bytes gives a failure return (except as noted for pipes and FIFO below). I.e. you get a partial write before a failing write. ZFS behaves like this (on quota, definitely - "filesystem full" on ZFS is a bit different due to the space needs for COW), just as other filesystems do. Where have you encountered a filesystem _NOT_ supporting this behaviour ? FrankH.> _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >