Yi Zhang
2011-Feb-05 16:10 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
Hi all, I''m trying to achieve the same effect of UFS directio on ZFS and here is what I did: 1. Set the primarycache of zfs to metadata and secondarycache to none, recordsize to 8K (to match the unit size of writes) 2. Run my test program (code below) with different options and measure the running time. a) open the file without O_DSYNC flag: 0.11s. This doesn''t seem like directio is in effect, because I tried on UFS and time was 2s. So I went on with more experiments with the O_DSYNC flag set. I know that directio and O_DSYNC are two different things, but I thought the flag would force synchronous writes and achieve what directio does (and more). b) open the file with O_DSYNC flag: 147.26s c) same as b) but also enabled zfs_nocacheflush: 5.87s My questions are: 1. With my primarycache and secondarycache settings, the FS shouldn''t buffer reads and writes anymore. Wouldn''t that be equivalent to O_DSYNC? Why a) and b) are so different? 2. My understanding is that zfs_nocacheflush essentially removes the sync command sent to the device, which cancels the O_DSYNC flag. Why b) and c) are so different? 3. Does ZIL have anything to do with these results? Thanks in advance for any suggestion/insight! Yi #include <fcntl.h> #include <sys/time.h> int main(int argc, char **argv) { struct timeval tim; gettimeofday(&tim, NULL); double t1 = tim.tv_sec + tim.tv_usec/1000000.0; char a[8192]; int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0660); //int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC|O_DSYNC, 0660); if (argv[2][0] == ''1'') directio(fd, DIRECTIO_ON); int i; for (i=0; i<10000; ++i) pwrite(fd, a, sizeof(a), i*8192); close(fd); gettimeofday(&tim, NULL); double t2 = tim.tv_sec + tim.tv_usec/1000000.0; printf("%f\n", t2-t1); }
Richard Elling
2011-Feb-07 05:25 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Feb 5, 2011, at 8:10 AM, Yi Zhang wrote:> Hi all, > > I''m trying to achieve the same effect of UFS directio on ZFS and here > is what I did:Solaris UFS directio has three functions: 1. improved async code path 2. multiple concurrent writers 3. no buffering Of the three, #1 and #2 were designed into ZFS from day 1, so there is nothing to set or change to take advantage of the feature.> > 1. Set the primarycache of zfs to metadata and secondarycache to none, > recordsize to 8K (to match the unit size of writes) > 2. Run my test program (code below) with different options and measure > the running time. > a) open the file without O_DSYNC flag: 0.11s. > This doesn''t seem like directio is in effect, because I tried on UFS > and time was 2s. So I went on with more experiments with the O_DSYNC > flag set. I know that directio and O_DSYNC are two different things, > but I thought the flag would force synchronous writes and achieve what > directio does (and more).Directio and O_DSYNC are two different features.> b) open the file with O_DSYNC flag: 147.26souch> c) same as b) but also enabled zfs_nocacheflush: 5.87sIs your pool created from a single HDD?> My questions are: > 1. With my primarycache and secondarycache settings, the FS shouldn''t > buffer reads and writes anymore. Wouldn''t that be equivalent to > O_DSYNC? Why a) and b) are so different?No. O_DSYNC deals with when the I/O is committed to media.> 2. My understanding is that zfs_nocacheflush essentially removes the > sync command sent to the device, which cancels the O_DSYNC flag. Why > b) and c) are so different?No. Disabling the cache flush means that the volatile write buffer in the disk is not flushed. In other words, disabling the cache flush is in direct conflict with the semantics of O_DSYNC.> 3. Does ZIL have anything to do with these results?Yes. The ZIL is used for meeting the O_DSYNC requirements. This has nothing to do with buffering. More details are on the ZFS Best Practices Guide. -- richard> > Thanks in advance for any suggestion/insight! > Yi > > > #include <fcntl.h> > #include <sys/time.h> > > int main(int argc, char **argv) > { > struct timeval tim; > gettimeofday(&tim, NULL); > double t1 = tim.tv_sec + tim.tv_usec/1000000.0; > char a[8192]; > int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0660); > //int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC|O_DSYNC, 0660); > if (argv[2][0] == ''1'') > directio(fd, DIRECTIO_ON); > int i; > for (i=0; i<10000; ++i) > pwrite(fd, a, sizeof(a), i*8192); > close(fd); > gettimeofday(&tim, NULL); > double t2 = tim.tv_sec + tim.tv_usec/1000000.0; > printf("%f\n", t2-t1); > } > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Yi Zhang
2011-Feb-07 14:15 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 12:25 AM, Richard Elling <richard.elling at gmail.com> wrote:> On Feb 5, 2011, at 8:10 AM, Yi Zhang wrote: > >> Hi all, >> >> I''m trying to achieve the same effect of UFS directio on ZFS and here >> is what I did: > > Solaris UFS directio has three functions: > ? ? ? ?1. improved async code path > ? ? ? ?2. multiple concurrent writers > ? ? ? ?3. no buffering >Thanks for the comments, Richard. All I wanted is to achieve 3 on ZFS. But as I said, apprently 2.a) below didn''t give me that. Do you have any suggestion?> Of the three, #1 and #2 were designed into ZFS from day 1, so there is nothing > to set or change to take advantage of the feature. > >> >> 1. Set the primarycache of zfs to metadata and secondarycache to none, >> recordsize to 8K (to match the unit size of writes) >> 2. Run my test program (code below) with different options and measure >> the running time. >> a) open the file without O_DSYNC flag: 0.11s. >> This doesn''t seem like directio is in effect, because I tried on UFS >> and time was 2s. So I went on with more experiments with the O_DSYNC >> flag set. I know that directio and O_DSYNC are two different things, >> but I thought the flag would force synchronous writes and achieve what >> directio does (and more). > > Directio and O_DSYNC are two different features. > >> b) open the file with O_DSYNC flag: 147.26s > > ouch > >> c) same as b) but also enabled zfs_nocacheflush: 5.87s > > Is your pool created from a single HDD?Yes, it is. Do you have an explanation for the b) case? I also tried O_DSYNC AND directio on UFS, the time is on the same order as directio but no O_DSYNC on UFS (see below). This dramatic difference between UFS and ZFS is puzzling me... UFS: ?directio=on,no O_DSYNC -> 2s ? ? ? ? ?directio=on,O_DSYNC -> 5s ZFS: ?no caching, no O_DSYNC -> 0.11s ? ? no caching, O_DSYNC -> 147s> >> My questions are: >> 1. With my primarycache and secondarycache settings, the FS shouldn''t >> buffer reads and writes anymore. Wouldn''t that be equivalent to >> O_DSYNC? Why a) and b) are so different? > > No. O_DSYNC deals with when the I/O is committed to media. > >> 2. My understanding is that zfs_nocacheflush essentially removes the >> sync command sent to the device, which cancels the O_DSYNC flag. Why >> b) and c) are so different? > > No. Disabling the cache flush means that the volatile write buffer in the > disk is not flushed. In other words, disabling the cache flush is in direct > conflict with the semantics of O_DSYNC. > >> 3. Does ZIL have anything to do with these results? > > Yes. The ZIL is used for meeting the O_DSYNC requirements. ?This has > nothing to do with buffering. More details are on the ZFS Best Practices Guide. > ?-- richard > >> >> Thanks in advance for any suggestion/insight! >> Yi >> >> >> #include <fcntl.h> >> #include <sys/time.h> >> >> int main(int argc, char **argv) >> { >> ? struct timeval tim; >> ? gettimeofday(&tim, NULL); >> ? double t1 = tim.tv_sec + tim.tv_usec/1000000.0; >> ? char a[8192]; >> ? int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0660); >> ? //int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC|O_DSYNC, 0660); >> ? if (argv[2][0] == ''1'') >> ? ? ? directio(fd, DIRECTIO_ON); >> ? int i; >> ? for (i=0; i<10000; ++i) >> ? ? ? pwrite(fd, a, sizeof(a), i*8192); >> ? close(fd); >> ? gettimeofday(&tim, NULL); >> ? double t2 = tim.tv_sec + tim.tv_usec/1000000.0; >> ? printf("%f\n", t2-t1); >> } >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >
Roch
2011-Feb-07 15:26 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
Le 7 f?vr. 2011 ? 06:25, Richard Elling a ?crit :> On Feb 5, 2011, at 8:10 AM, Yi Zhang wrote: > >> Hi all, >> >> I''m trying to achieve the same effect of UFS directio on ZFS and here >> is what I did: > > Solaris UFS directio has three functions: > 1. improved async code path > 2. multiple concurrent writers > 3. no buffering > > Of the three, #1 and #2 were designed into ZFS from day 1, so there is nothing > to set or change to take advantage of the feature. > >> >> 1. Set the primarycache of zfs to metadata and secondarycache to none, >> recordsize to 8K (to match the unit size of writes) >> 2. Run my test program (code below) with different options and measure >> the running time. >> a) open the file without O_DSYNC flag: 0.11s. >> This doesn''t seem like directio is in effect, because I tried on UFS >> and time was 2s. So I went on with more experiments with the O_DSYNC >> flag set. I know that directio and O_DSYNC are two different things, >> but I thought the flag would force synchronous writes and achieve what >> directio does (and more). > > Directio and O_DSYNC are two different features. > >> b) open the file with O_DSYNC flag: 147.26s > > ouchhow big a file ? Does the resuld holds if you don''t truncate ? -r> >> c) same as b) but also enabled zfs_nocacheflush: 5.87s > > Is your pool created from a single HDD? > >> My questions are: >> 1. With my primarycache and secondarycache settings, the FS shouldn''t >> buffer reads and writes anymore. Wouldn''t that be equivalent to >> O_DSYNC? Why a) and b) are so different? > > No. O_DSYNC deals with when the I/O is committed to media. > >> 2. My understanding is that zfs_nocacheflush essentially removes the >> sync command sent to the device, which cancels the O_DSYNC flag. Why >> b) and c) are so different? > > No. Disabling the cache flush means that the volatile write buffer in the > disk is not flushed. In other words, disabling the cache flush is in direct > conflict with the semantics of O_DSYNC. > >> 3. Does ZIL have anything to do with these results? > > Yes. The ZIL is used for meeting the O_DSYNC requirements. This has > nothing to do with buffering. More details are on the ZFS Best Practices Guide. > -- richard > >> >> Thanks in advance for any suggestion/insight! >> Yi >> >> >> #include <fcntl.h> >> #include <sys/time.h> >> >> int main(int argc, char **argv) >> { >> struct timeval tim; >> gettimeofday(&tim, NULL); >> double t1 = tim.tv_sec + tim.tv_usec/1000000.0; >> char a[8192]; >> int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0660); >> //int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC|O_DSYNC, 0660); >> if (argv[2][0] == ''1'') >> directio(fd, DIRECTIO_ON); >> int i; >> for (i=0; i<10000; ++i) >> pwrite(fd, a, sizeof(a), i*8192); >> close(fd); >> gettimeofday(&tim, NULL); >> double t2 = tim.tv_sec + tim.tv_usec/1000000.0; >> printf("%f\n", t2-t1); >> } >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Yi Zhang
2011-Feb-07 16:08 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 10:26 AM, Roch <roch.bourbonnais at oracle.com> wrote:> > Le 7 f?vr. 2011 ? 06:25, Richard Elling a ?crit : > >> On Feb 5, 2011, at 8:10 AM, Yi Zhang wrote: >> >>> Hi all, >>> >>> I''m trying to achieve the same effect of UFS directio on ZFS and here >>> is what I did: >> >> Solaris UFS directio has three functions: >> ? ? ? 1. improved async code path >> ? ? ? 2. multiple concurrent writers >> ? ? ? 3. no buffering >> >> Of the three, #1 and #2 were designed into ZFS from day 1, so there is nothing >> to set or change to take advantage of the feature. >> >>> >>> 1. Set the primarycache of zfs to metadata and secondarycache to none, >>> recordsize to 8K (to match the unit size of writes) >>> 2. Run my test program (code below) with different options and measure >>> the running time. >>> a) open the file without O_DSYNC flag: 0.11s. >>> This doesn''t seem like directio is in effect, because I tried on UFS >>> and time was 2s. So I went on with more experiments with the O_DSYNC >>> flag set. I know that directio and O_DSYNC are two different things, >>> but I thought the flag would force synchronous writes and achieve what >>> directio does (and more). >> >> Directio and O_DSYNC are two different features. >> >>> b) open the file with O_DSYNC flag: 147.26s >> >> ouch > > how big a file ? > Does the resuld holds if you don''t truncate ? > > -r >The file is 8K*10000 about 80M. I removed the O_TRUNC flag and the results stayed the same...>> >>> c) same as b) but also enabled zfs_nocacheflush: 5.87s >> >> Is your pool created from a single HDD? >> >>> My questions are: >>> 1. With my primarycache and secondarycache settings, the FS shouldn''t >>> buffer reads and writes anymore. Wouldn''t that be equivalent to >>> O_DSYNC? Why a) and b) are so different? >> >> No. O_DSYNC deals with when the I/O is committed to media. >> >>> 2. My understanding is that zfs_nocacheflush essentially removes the >>> sync command sent to the device, which cancels the O_DSYNC flag. Why >>> b) and c) are so different? >> >> No. Disabling the cache flush means that the volatile write buffer in the >> disk is not flushed. In other words, disabling the cache flush is in direct >> conflict with the semantics of O_DSYNC. >> >>> 3. Does ZIL have anything to do with these results? >> >> Yes. The ZIL is used for meeting the O_DSYNC requirements. ?This has >> nothing to do with buffering. More details are on the ZFS Best Practices Guide. >> -- richard >> >>> >>> Thanks in advance for any suggestion/insight! >>> Yi >>> >>> >>> #include <fcntl.h> >>> #include <sys/time.h> >>> >>> int main(int argc, char **argv) >>> { >>> ?struct timeval tim; >>> ?gettimeofday(&tim, NULL); >>> ?double t1 = tim.tv_sec + tim.tv_usec/1000000.0; >>> ?char a[8192]; >>> ?int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0660); >>> ?//int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC|O_DSYNC, 0660); >>> ?if (argv[2][0] == ''1'') >>> ? ? ?directio(fd, DIRECTIO_ON); >>> ?int i; >>> ?for (i=0; i<10000; ++i) >>> ? ? ?pwrite(fd, a, sizeof(a), i*8192); >>> ?close(fd); >>> ?gettimeofday(&tim, NULL); >>> ?double t2 = tim.tv_sec + tim.tv_usec/1000000.0; >>> ?printf("%f\n", t2-t1); >>> } >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >
Brandon High
2011-Feb-07 18:06 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 6:15 AM, Yi Zhang <yizhang84 at gmail.com> wrote:> On Mon, Feb 7, 2011 at 12:25 AM, Richard Elling > <richard.elling at gmail.com> wrote: >> Solaris UFS directio has three functions: >> ? ? ? ?1. improved async code path >> ? ? ? ?2. multiple concurrent writers >> ? ? ? ?3. no buffering >> > Thanks for the comments, Richard. All I wanted is to achieve 3 on ZFS. > But as I said, apprently 2.a) below didn''t give me that. Do you have > any suggestion?Don''t. Use a ZIL, which will meet the requirements for synchronous IO. Set primarycache to metadata to prevent caching reads. ZFS is a very different beast than UFS and doesn''t require the same tuning. -B -- Brandon High : bhigh at freaks.com
Yi Zhang
2011-Feb-07 18:29 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 1:06 PM, Brandon High <bhigh at freaks.com> wrote:> On Mon, Feb 7, 2011 at 6:15 AM, Yi Zhang <yizhang84 at gmail.com> wrote: >> On Mon, Feb 7, 2011 at 12:25 AM, Richard Elling >> <richard.elling at gmail.com> wrote: >>> Solaris UFS directio has three functions: >>> ? ? ? ?1. improved async code path >>> ? ? ? ?2. multiple concurrent writers >>> ? ? ? ?3. no buffering >>> >> Thanks for the comments, Richard. All I wanted is to achieve 3 on ZFS. >> But as I said, apprently 2.a) below didn''t give me that. Do you have >> any suggestion? > > Don''t. Use a ZIL, which will meet the requirements for synchronous IO. > Set primarycache to metadata to prevent caching reads. > > ZFS is a very different beast than UFS and doesn''t require the same tuning. >I already set primarycache to metadata, and I''m not concerned about caching reads, but caching writes. It appears writes are indeed cached judging from the time of 2.a) compared to UFS+directio. More specifically, 80MB/2s=40MB/s (UFS+directio) looks realistic while 80MB/0.11s=800MB/s (ZFS+primarycache=metadata) doesn''t.
Brandon High
2011-Feb-07 18:51 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 10:29 AM, Yi Zhang <yizhang84 at gmail.com> wrote:> I already set primarycache to metadata, and I''m not concerned about > caching reads, but caching writes. It appears writes are indeed cached > judging from the time of 2.a) compared to UFS+directio. More > specifically, 80MB/2s=40MB/s (UFS+directio) looks realistic while > 80MB/0.11s=800MB/s (ZFS+primarycache=metadata) doesn''t.You''re trying to force a solution that isn''t relevant for the situation. ZFS is not UFS, and solutions that are required for UFS to work correctly are not needed with ZFS. Yes, writes are cached, but all the POSIX requirements for synchronous IO are met by the ZIL. As long as your storage devices, be they SAN, DAS or somewhere in between respect cache flushes, you''re fine. If you need more performance, use a slog device that respects cache flushes. You don''t need to worry about whether writes are being cached, because any data that is written synchronously will be committed to stable storage before the write returns. -B -- Brandon High : bhigh at freaks.com
Yi Zhang
2011-Feb-07 19:17 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 1:51 PM, Brandon High <bhigh at freaks.com> wrote:> On Mon, Feb 7, 2011 at 10:29 AM, Yi Zhang <yizhang84 at gmail.com> wrote: >> I already set primarycache to metadata, and I''m not concerned about >> caching reads, but caching writes. It appears writes are indeed cached >> judging from the time of 2.a) compared to UFS+directio. More >> specifically, 80MB/2s=40MB/s (UFS+directio) looks realistic while >> 80MB/0.11s=800MB/s (ZFS+primarycache=metadata) doesn''t. > > You''re trying to force a solution that isn''t relevant for the > situation. ZFS is not UFS, and solutions that are required for UFS to > work correctly are not needed with ZFS. > > Yes, writes are cached, but all the POSIX requirements for synchronous > IO are met by the ZIL. As long as your storage devices, be they SAN, > DAS or somewhere in between respect cache flushes, you''re fine. If you > need more performance, use a slog device that respects cache flushes. > You don''t need to worry about whether writes are being cached, because > any data that is written synchronously will be committed to stable > storage before the write returns. > > -B > > -- > Brandon High : bhigh at freaks.com >Maybe I didn''t make my intention clear. UFS with directio is reasonably close to a raw disk from my application''s perspective: when the app writes to a file location, no buffering happens. My goal is to find a way to duplicate this on ZFS. Setting primarycache didn''t eliminate the buffering, using O_DSYNC (whose side effects include elimination of buffering) made it ridiculously slow: none of the things I tried eliminated buffering, and just buffering, on ZFS.>From the discussion so far my feeling is that ZFS is too differentfrom UFS that there''s simply no way to achieve this goal...
Roy Sigurd Karlsbakk
2011-Feb-07 19:29 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
> Maybe I didn''t make my intention clear. UFS with directio is > reasonably close to a raw disk from my application''s perspective: when > the app writes to a file location, no buffering happens. My goal is to > find a way to duplicate this on ZFS.There really is no need to do this on ZFS. Using an SLOG device (ZIL on an SSD) will allow ZFS to do its caching transpararently to the application. Successive read operations will read from the cache if that''s available and writes will go to the SLOG _and_ the ARC for successive commits. So long the SLOG device supports cache flush, or have a supercap/BBU, your data will be safe.> Setting primarycache didn''t eliminate the buffering, using O_DSYNC > (whose side effects include elimination of buffering) made it > ridiculously slow: none of the things I tried eliminated buffering, > and just buffering, on ZFS. > > From the discussion so far my feeling is that ZFS is too different > from UFS that there''s simply no way to achieve this goal...See above - ZFS is quite safe to use for this, given a good hardware configuration. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
Yi Zhang
2011-Feb-07 19:39 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 2:21 PM, Brandon High <bhigh at freaks.com> wrote:> On Mon, Feb 7, 2011 at 11:17 AM, Yi Zhang <yizhang84 at gmail.com> wrote: >> Maybe I didn''t make my intention clear. UFS with directio is >> reasonably close to a raw disk from my application''s perspective: when >> the app writes to a file location, no buffering happens. My goal is to >> find a way to duplicate this on ZFS. > > Step back an consider *why* you need no buffering.I''m writing an database-like application which manages its own page buffer, so I want to disable the buffering in the OS/FS level. UFS with directio suits my need perfectly, but I also want to try it on ZFS because ZFS does''t directly overwrite a page which is being modified (it allocates a new page instead), and thus it represents a different category of FS. I want to measure the performance difference of my app on UFS and ZFS and tell how my app is FS-dependent.> >> From the discussion so far my feeling is that ZFS is too different >> from UFS that there''s simply no way to achieve this goal... > > ZFS is not UFS, and solutions that are required for UFS to work > correctly are not needed with ZFS. > > -B > > -- > Brandon High : bhigh at freaks.com >
Nico Williams
2011-Feb-07 19:42 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 1:17 PM, Yi Zhang <yizhang84 at gmail.com> wrote:> On Mon, Feb 7, 2011 at 1:51 PM, Brandon High <bhigh at freaks.com> wrote: > Maybe I didn''t make my intention clear. UFS with directio is > reasonably close to a raw disk from my application''s perspective: when > the app writes to a file location, no buffering happens. My goal is to > find a way to duplicate this on ZFS.You''re still mixing directio and O_DSYNC. O_DSYNC is like calling fsync(2) after every write(2). fsync(2) is useful to obtain some limited transactional semantics, as well as for durability semantics. In ZFS you don''t need to call fsync(2) to get those transactional semantics, but you do need to call fsync(2) get those durability semantics. Now, in ZFS fsync(2) implies a synchronous I/O operation involving significantly more than just the data blocks you wrote to. Which means that O_DSYNC on ZFS is significantly slower than on UFS. You can address this in one of two ways: a) you might realize that you don''t need every write(2) to be durable, then stop using O_DSYNC, b) you might get a fast ZIL device. I''m betting that if you look carefully at your application''s requirements you''ll probably conclude that you don''t need O_DSYNC at all. Perhaps you can tell us more about your application.> Setting primarycache didn''t eliminate the buffering, using O_DSYNC > (whose side effects include elimination of buffering) made it > ridiculously slow: none of the things I tried eliminated buffering, > and just buffering, on ZFS. > > From the discussion so far my feeling is that ZFS is too different > from UFS that there''s simply no way to achieve this goal...You''ve not really stated your application''s requirements. You may be convinced that you need O_DSYNC, but chances are that you don''t. And yes, it''s possible that you''d need O_DSYNC on UFS but not on ZFS. Nico --
Yi Zhang
2011-Feb-07 19:49 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 2:42 PM, Nico Williams <nico at cryptonector.com> wrote:> On Mon, Feb 7, 2011 at 1:17 PM, Yi Zhang <yizhang84 at gmail.com> wrote: >> On Mon, Feb 7, 2011 at 1:51 PM, Brandon High <bhigh at freaks.com> wrote: >> Maybe I didn''t make my intention clear. UFS with directio is >> reasonably close to a raw disk from my application''s perspective: when >> the app writes to a file location, no buffering happens. My goal is to >> find a way to duplicate this on ZFS. > > You''re still mixing directio and O_DSYNC. > > O_DSYNC is like calling fsync(2) after every write(2). ?fsync(2) is > useful to obtain > some limited transactional semantics, as well as for durability > semantics. ?In ZFS > you don''t need to call fsync(2) to get those transactional semantics, but you do > need to call fsync(2) get those durability semantics. > > Now, in ZFS fsync(2) implies a synchronous I/O operation involving significantly > more than just the data blocks you wrote to. ?Which means that O_DSYNC on ZFS > is significantly slower than on UFS. ?You can address this in one of two ways: > a) you might realize that you don''t need every write(2) to be durable, then stop > using O_DSYNC, b) you might get a fast ZIL device. > > I''m betting that if you look carefully at your application''s requirements you''ll > probably conclude that you don''t need O_DSYNC at all. ?Perhaps you can tell us > more about your application. > >> Setting primarycache didn''t eliminate the buffering, using O_DSYNC >> (whose side effects include elimination of buffering) made it >> ridiculously slow: none of the things I tried eliminated buffering, >> and just buffering, on ZFS. >> >> From the discussion so far my feeling is that ZFS is too different >> from UFS that there''s simply no way to achieve this goal... > > You''ve not really stated your application''s requirements. ?You may be convinced > that you need O_DSYNC, but chances are that you don''t. ?And yes, it''s possible > that you''d need O_DSYNC on UFS but not on ZFS. > > Nico > -- >Please see my previous email for a high-level discussion of my application. I know that I don''t really need O_DSYNC. The reason why I tried that is to get the side effect of no buffering, which is my ultimate goal.
Roch
2011-Feb-07 19:59 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
Le 7 f?vr. 2011 ? 17:08, Yi Zhang a ?crit :> On Mon, Feb 7, 2011 at 10:26 AM, Roch <roch.bourbonnais at oracle.com> wrote: >> >> Le 7 f?vr. 2011 ? 06:25, Richard Elling a ?crit : >> >>> On Feb 5, 2011, at 8:10 AM, Yi Zhang wrote: >>> >>>> Hi all, >>>> >>>> I''m trying to achieve the same effect of UFS directio on ZFS and here >>>> is what I did: >>> >>> Solaris UFS directio has three functions: >>> 1. improved async code path >>> 2. multiple concurrent writers >>> 3. no buffering >>> >>> Of the three, #1 and #2 were designed into ZFS from day 1, so there is nothing >>> to set or change to take advantage of the feature. >>> >>>> >>>> 1. Set the primarycache of zfs to metadata and secondarycache to none, >>>> recordsize to 8K (to match the unit size of writes) >>>> 2. Run my test program (code below) with different options and measure >>>> the running time. >>>> a) open the file without O_DSYNC flag: 0.11s. >>>> This doesn''t seem like directio is in effect, because I tried on UFS >>>> and time was 2s. So I went on with more experiments with the O_DSYNC >>>> flag set. I know that directio and O_DSYNC are two different things, >>>> but I thought the flag would force synchronous writes and achieve what >>>> directio does (and more). >>> >>> Directio and O_DSYNC are two different features. >>> >>>> b) open the file with O_DSYNC flag: 147.26s >>> >>> ouch >> >> how big a file ? >> Does the resuld holds if you don''t truncate ?OK, if it had been a 2TB file, I could have seen an opening. Not for 80M though. So it''s baffling.... Unless ! It''s not just the open which takes 147s it''s the whole run, 10000 writes. 10000 sync writes without an SDD would take at 150 second at 68 IO/S. with the O_DSYNC flag then all writes are to memory so it''s expected to take 0.11s to tracne 80000K at 750MB/sec (memcopy speed). O_DSYNC + zfs_nocacheflush is in between. Every write transfers data to an unstable cache but then does not flush it. At some point the cache might overflow and so some write have high latency while the data is transfering from disk cache to disk platter. So those results are inline with what everybody has been seeing before Note that to compare with UFS, since UFS does not cache flush after every sync write like ZFS correctly does you have to compare UFS + write cache disabled to ZFS (with or without write cache). After deleting a zfs pool, the disk write is left on and so a UFS filesystem wil appear inordinately fast then unless you turn off the write cache with "format -e; cache , write_cache; disable." -r>> >> -r >> > The file is 8K*10000 about 80M. I removed the O_TRUNC flag and the > results stayed the same... > >>> >>>> c) same as b) but also enabled zfs_nocacheflush: 5.87s >>> >>> Is your pool created from a single HDD? >>> >>>> My questions are: >>>> 1. With my primarycache and secondarycache settings, the FS shouldn''t >>>> buffer reads and writes anymore. Wouldn''t that be equivalent to >>>> O_DSYNC? Why a) and b) are so different? >>> >>> No. O_DSYNC deals with when the I/O is committed to media. >>> >>>> 2. My understanding is that zfs_nocacheflush essentially removes the >>>> sync command sent to the device, which cancels the O_DSYNC flag. Why >>>> b) and c) are so different? >>> >>> No. Disabling the cache flush means that the volatile write buffer in the >>> disk is not flushed. In other words, disabling the cache flush is in direct >>> conflict with the semantics of O_DSYNC. >>> >>>> 3. Does ZIL have anything to do with these results? >>> >>> Yes. The ZIL is used for meeting the O_DSYNC requirements. This has >>> nothing to do with buffering. More details are on the ZFS Best Practices Guide. >>> -- richard >>> >>>> >>>> Thanks in advance for any suggestion/insight! >>>> Yi >>>> >>>> >>>> #include <fcntl.h> >>>> #include <sys/time.h> >>>> >>>> int main(int argc, char **argv) >>>> { >>>> struct timeval tim; >>>> gettimeofday(&tim, NULL); >>>> double t1 = tim.tv_sec + tim.tv_usec/1000000.0; >>>> char a[8192]; >>>> int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0660); >>>> //int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC|O_DSYNC, 0660); >>>> if (argv[2][0] == ''1'') >>>> directio(fd, DIRECTIO_ON); >>>> int i; >>>> for (i=0; i<10000; ++i) >>>> pwrite(fd, a, sizeof(a), i*8192); >>>> close(fd); >>>> gettimeofday(&tim, NULL); >>>> double t2 = tim.tv_sec + tim.tv_usec/1000000.0; >>>> printf("%f\n", t2-t1); >>>> } >>>> _______________________________________________ >>>> zfs-discuss mailing list >>>> zfs-discuss at opensolaris.org >>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>> >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >>
Bill Sommerfeld
2011-Feb-07 20:14 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On 02/07/11 11:49, Yi Zhang wrote:> The reason why I > tried that is to get the side effect of no buffering, which is my > ultimate goal.ultimate = "final". you must have a goal beyond the elimination of buffering in the filesystem. if the writes are made durable by zfs when you need them to be durable, why does it matter that it may buffer data while it is doing so? - Bill
Yi Zhang
2011-Feb-07 20:39 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 2:54 PM, Nico Williams <nico at cryptonector.com> wrote:> On Mon, Feb 7, 2011 at 1:49 PM, Yi Zhang <yizhang84 at gmail.com> wrote: >> Please see my previous email for a high-level discussion of my >> application. I know that I don''t really need O_DSYNC. The reason why I >> tried that is to get the side effect of no buffering, which is my >> ultimate goal. > > ZFS cannot not buffer. ?The reason is that ZFS likes to batch transactions into > as large a contiguous write to disk as possible. ?The ZIL exists to > support fsyn(2) > operations that must commit before the rest of a ZFS transaction. ?In > other words: > there''s always some amount of buffering of writes in ZFS.In that case, ZFS doesn''t suit my needs.> > As to read buffering, why would you want to disable those?My application manages its own buffer and reads/writes go through that buffer first. I don''t want double buffering.> > You still haven''t told us what your application does. ?Or why you want > to get close > to the metal. ?Simply telling us that you need "no buffering" doesn''t > really help us > help you -- with that approach you''ll simply end up believing that ZFS is not > appropriate for your needs, even though it well might be.It''s like the Berkeley DB on a high level, though it doesn''t require transaction support, durability, etc. I''m measuring its performance and don''t want FS buffer to pollute my results (hence directio).> > Nico > -- >
Yi Zhang
2011-Feb-07 20:49 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 3:14 PM, Bill Sommerfeld <sommerfeld at alum.mit.edu> wrote:> On 02/07/11 11:49, Yi Zhang wrote: >> >> The reason why I >> tried that is to get the side effect of no buffering, which is my >> ultimate goal. > > ultimate = "final". ?you must have a goal beyond the elimination of > buffering in the filesystem. > > if the writes are made durable by zfs when you need them to be durable, why > does it matter that it may buffer data while it is doing so? > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- BillIf buffering is on, the running time of my app doesn''t reflect the actual I/O cost. My goal is to accurately measure the time of I/O. With buffering on, ZFS would batch up a bunch of writes and change both the original I/O activity and the time.
David Dyer-Bennet
2011-Feb-07 21:01 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, February 7, 2011 14:49, Yi Zhang wrote:> On Mon, Feb 7, 2011 at 3:14 PM, Bill Sommerfeld <sommerfeld at alum.mit.edu> > wrote: >> On 02/07/11 11:49, Yi Zhang wrote: >>> >>> The reason why I >>> tried that is to get the side effect of no buffering, which is my >>> ultimate goal. >> >> ultimate = "final". ??you must have a goal beyond the elimination of >> buffering in the filesystem. >> >> if the writes are made durable by zfs when you need them to be durable, >> why >> does it matter that it may buffer data while it is doing so? >> >> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??- >> Bill > > If buffering is on, the running time of my app doesn''t reflect the > actual I/O cost. My goal is to accurately measure the time of I/O. > With buffering on, ZFS would batch up a bunch of writes and change > both the original I/O activity and the time.I''m not sure I understand what you''re trying to measure (which seems to be your top priority). Achievable performance with ZFS would be better using suitable caching; normally that''s the benchmark statistic people would care about. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
Yi Zhang
2011-Feb-07 21:10 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 3:47 PM, Nico Williams <nico at cryptonector.com> wrote:> On Mon, Feb 7, 2011 at 2:39 PM, Yi Zhang <yizhang84 at gmail.com> wrote: >> On Mon, Feb 7, 2011 at 2:54 PM, Nico Williams <nico at cryptonector.com> wrote: >>> ZFS cannot not buffer. ?The reason is that ZFS likes to batch transactions into >>> as large a contiguous write to disk as possible. ?The ZIL exists to >>> support fsyn(2) >>> operations that must commit before the rest of a ZFS transaction. ?In >>> other words: >>> there''s always some amount of buffering of writes in ZFS. >> In that case, ZFS doesn''t suit my needs. > > Maybe. ?See below. > >>> As to read buffering, why would you want to disable those? >> My application manages its own buffer and reads/writes go through that >> buffer first. I don''t want double buffering. > > So your concern is that you don''t want to pay twice the memory cost > for buffering? > > If so, set primarycache as described earlier and drop the O_DSYNC flag. > > ZFS will then buffer your writes, but only for a little while, and you > should want it to > because ZFS will almost certainly do a better job of batching transactions than > your application would. ?With ZFS you''ll benefit from: advanced volume > management, > snapshots/clones, dedup, Merkle hash trees (i.e., corruption > detection), encryption, > and so on. ?You''ll almost certainly not be implementing any of those > in your application... > >>> You still haven''t told us what your application does. ?Or why you want >>> to get close >>> to the metal. ?Simply telling us that you need "no buffering" doesn''t >>> really help us >>> help you -- with that approach you''ll simply end up believing that ZFS is not >>> appropriate for your needs, even though it well might be. >> It''s like the Berkeley DB on a high level, though it doesn''t require >> transaction support, durability, etc. I''m measuring its performance >> and don''t want FS buffer to pollute my results (hence directio). > > You''re still mixing directio and O_DSYNC. > > You should do three things: a) set primarycache=metadata, b) set recordsize to > whatever your application''s page size is (e.g., 8KB), c) stop using O_DSYNC. > > Tell us how that goes. ?I suspect the performance will be much better. > > Nico > -- >This is actually what I did for 2.a) in my original post. My concern there is that ZFS'' internal write buffering makes it hard to get a grip on my application''s behavior. I want to present my application''s "raw" I/O performance without too much outside factors... UFS plus directio gives me exactly (or close to) that but ZFS doesn''t... Of course, in the final deployment, it would be great to be able to take advantage of ZFS'' advanced features such as I/O optimization.
Bill Sommerfeld
2011-Feb-07 21:17 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On 02/07/11 12:49, Yi Zhang wrote:> If buffering is on, the running time of my app doesn''t reflect the > actual I/O cost. My goal is to accurately measure the time of I/O. > With buffering on, ZFS would batch up a bunch of writes and change > both the original I/O activity and the time.if batching main pool writes improves the overall throughput of the system over a more naive i/o scheduling model, don''t you want your users to see the improvement in performance from that batching? why not set up a steady-state sustained workload that will run for hours, and measure how long it takes the system to commit each 1000 or 10000 transactions in the middle of the steady state workload?
Erik Trimble
2011-Feb-07 22:38 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On 2/7/2011 1:10 PM, Yi Zhang wrote: [snip]> This is actually what I did for 2.a) in my original post. My concern > there is that ZFS'' internal write buffering makes it hard to get a > grip on my application''s behavior. I want to present my application''s > "raw" I/O performance without too much outside factors... UFS plus > directio gives me exactly (or close to) that but ZFS doesn''t... > > Of course, in the final deployment, it would be great to be able to > take advantage of ZFS'' advanced features such as I/O optimization. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussAnd, there''s your answer. You seem to care about doing "bare-metal" I/O for tuning of your application, so you can do consistent measurements. Not for actual usage in production. Therefore, do what''s inferred in the above: develop your app, using it on UFS w/directio to work out the application issues and tune. When you deploy it, use ZFS and its caching techniques to get maximum (though not absolutely consistently measurable) performance for the already-tuned app. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
Richard Elling
2011-Feb-08 02:05 UTC
[zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Feb 7, 2011, at 1:10 PM, Yi Zhang wrote:> > This is actually what I did for 2.a) in my original post. My concern > there is that ZFS'' internal write buffering makes it hard to get a > grip on my application''s behavior. I want to present my application''s > "raw" I/O performance without too much outside factors... UFS plus > directio gives me exactly (or close to) that but ZFS doesn''t...In the bad old days when processors only had one memory controller, one could make an argument that not copying was an important optimization. Today, with the fast memory controllers (plural) we have, memory copies don''t hurt very much. Other factors will dominate. Of course, with dtrace it should be relatively easy to measure the copy.> Of course, in the final deployment, it would be great to be able to > take advantage of ZFS'' advanced features such as I/O optimization.Nice save :-) otherwise we wonder why you don''t just use raw disk if you are so concerned about memory copies :-) -- richard -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110207/d3d344e9/attachment.html>