It wouldn''t be proper to start my first post here without congratulations and thanks to the ZFS team for such an impressive piece of work. Anyway, on to my query. I''ve been trying out ZFS, with a particular focus in reducing latency in a specific application. This application has a fair amount of random writing going on in the background (which, of course, ZFS will make sequential), but the latency-critical step is syncing a transaction into a transaction log. This is a sequentially written file, and for persistence we must guarantee the transaction to the physical media before acknowledging it. The theory is that the "all writes are sequential" rule of ZFS will make a big difference to the worst-case and average latency here, although obviously the best case won''t change. The "guarantee to physical media" bit is done using msync(3C), which of course results is a call to memcntl(2) passing the MC_SYNC flag. What I''m finding is that, on ZFS, this doesn''t actually synchronously write the passed pages to disk before returning - it returns far too quickly (in a few microseconds) for the vanilla SCSI disk I''m using. A quick prod with DTrace shows the IO happening after the memcntl call has returned. In case it''s helpful, here''s the call flow generated by my D script, showing what''s happening in the kernel in response to the memcntl syscall: CPU FUNCTION 17 -> memcntl 17 -> valid_usr_range 17 <- valid_usr_range 17 -> as_ctl 17 -> as_segat 17 <- as_segat 17 -> segvn_sync 17 -> fop_putpage 17 <- fop_putpage 17 -> zfs_putpage 17 -> page_lookup 17 <- page_lookup 17 -> page_lookup_create 17 <- page_lookup_create 17 <- zfs_putpage 17 <- segvn_sync 17 <- as_ctl 17 <- memcntl A search turned up bug ID 6281075, category kernel:zfs, with synopsis "Support memcntl() requests (like MC_INVALIDATE)". It sounds like this means what I''m trying to do isn''t implemented yet - is this correct, or have I found a bug? Presumably it will be implemented in the future? [As a side-note, you''re probably wondering why we don''t just use O_DSYNC when opening the file, and then just write(2) to it. The reason for this is because it''s very slow on UFS for large buffers - effectively linear with the number of pages crossed, or about 6ms per 8KBytes on an otherwise idle SCSI disk on SPARC. The good news is that this is very fast on ZFS - on the same disk, about 7ms constant for up to 64KBytes, and about 20ms constant for 256KBytes. However this is still a bit slower than the msync(3C) approach on UFS] Thanks in advance. -- Philip Beevers mailto:philip.beevers at ntlworld.com
Roch Bourbonnais - Performance Engineering
2005-Nov-28 08:50 UTC
[zfs-discuss] ZFS and memcntl(..., MC_SYNC, ...)
A few questions. Between the write(2) and the msync() do you re-mmap the file ? Or is the file pre-allocated (say by writting zeros at initialization) ? Otherwise, I wonder if you have a valid address space that maps to the newly written portion and if not what is actually being msync-ed. Just making sure: Do you check the msync() return code ? -r Philip Beevers writes: > It wouldn''t be proper to start my first post here without congratulations > and thanks to the ZFS team for such an impressive piece of work. > > Anyway, on to my query. I''ve been trying out ZFS, with a particular focus in > reducing latency in a specific application. This application has a fair > amount of random writing going on in the background (which, of course, ZFS > will make sequential), but the latency-critical step is syncing a > transaction into a transaction log. This is a sequentially written file, and > for persistence we must guarantee the transaction to the physical media > before acknowledging it. The theory is that the "all writes are sequential" > rule of ZFS will make a big difference to the worst-case and average latency > here, although obviously the best case won''t change. > > The "guarantee to physical media" bit is done using msync(3C), which of > course results is a call to memcntl(2) passing the MC_SYNC flag. What I''m > finding is that, on ZFS, this doesn''t actually synchronously write the > passed pages to disk before returning - it returns far too quickly (in a few > microseconds) for the vanilla SCSI disk I''m using. A quick prod with DTrace > shows the IO happening after the memcntl call has returned. In case it''s > helpful, here''s the call flow generated by my D script, showing what''s > happening in the kernel in response to the memcntl syscall: > > CPU FUNCTION > 17 -> memcntl > 17 -> valid_usr_range > 17 <- valid_usr_range > 17 -> as_ctl > 17 -> as_segat > 17 <- as_segat > 17 -> segvn_sync > 17 -> fop_putpage > 17 <- fop_putpage > 17 -> zfs_putpage > 17 -> page_lookup > 17 <- page_lookup > 17 -> page_lookup_create > 17 <- page_lookup_create > 17 <- zfs_putpage > 17 <- segvn_sync > 17 <- as_ctl > 17 <- memcntl > > A search turned up bug ID 6281075, category kernel:zfs, with synopsis > "Support memcntl() requests (like MC_INVALIDATE)". It sounds like this means > what I''m trying to do isn''t implemented yet - is this correct, or have I > found a bug? Presumably it will be implemented in the future? > > [As a side-note, you''re probably wondering why we don''t just use O_DSYNC > when opening the file, and then just write(2) to it. The reason for this is > because it''s very slow on UFS for large buffers - effectively linear with > the number of pages crossed, or about 6ms per 8KBytes on an otherwise idle > SCSI disk on SPARC. The good news is that this is very fast on ZFS - on the > same disk, about 7ms constant for up to 64KBytes, and about 20ms constant > for 256KBytes. However this is still a bit slower than the msync(3C) > approach on UFS] > > Thanks in advance. > > -- > > Philip Beevers > mailto:philip.beevers at ntlworld.com > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://opensolaris.org/mailman/listinfo/zfs-discuss
philip.beevers at ntlworld.com
2005-Nov-28 12:49 UTC
[zfs-discuss] ZFS and memcntl(..., MC_SYNC, ...)
Hi Roch, Thanks for your response.> A few questions. Between the write(2) and the msync() do you > re-mmap the file ? Or is the file pre-allocated (say by > writting zeros at initialization) ?My initial prog does this (in a loop): write mmap msync munmap I guess this relies on a unified VM behaviour. I changed the prog so it worked like this: write the whole file out In a loop mmap memcpy to dirty the mapped pages msync munmap Unfortunately this still results in the IO happening _after_ the program has exited, although the calls to msync do take significantly longer (a few hundred microseconds, rather than a few tens of microseconds). Another quick DTrace shows in this case, zfs_putapage is getting called (at line 2839 of zfs_vnops.c, if my reading of the source is correct), whereas in the previous case it isn''t (as the previous call flow showed).> Otherwise, I wonder if you have a valid address space that > maps to the newly written portion and if not what is > actually being msync-ed.I''m fairly confident that I''ve mapped the right thing - I''ve run through my test program in dbx and it seems OK. Here''s the source: #include <sys/mman.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> int main(int argc, char** argv) { int i = 0; int fd = open("foo", O_TRUNC | O_CREAT | O_RDWR, 0666); int chunk = 0; int trials = 0; char* buf = NULL; if (argc < 3) { printf("Usage is: %s <chunk size> <trials>\n", argv[0]); exit(1); } chunk = atoi(argv[1]); buf = (char*)malloc(chunk); memset(buf, 0xbc, chunk); trials = atoi(argv[2]); for (i = 0; i < trials; i++) { void* ptr = NULL; if (write(fd, buf, chunk) < 0) { perror("write"); exit(1); } ptr = mmap(NULL, chunk, PROT_READ | PROT_WRITE, MAP_SHARED, fd, chunk * i); if (ptr == MAP_FAILED) { perror("write"); exit(1); } if (msync(ptr, chunk, MS_SYNC) < 0) { perror("msync"); exit(1); } munmap(ptr, chunk); } }> Just making sure: Do you check the msync() return code ?Of course :-) Thanks again for your help, Phil. ----------------------------------------- Email sent from www.ntlworld.com Virus-checked using McAfee(R) Software Visit www.ntlworld.com/security for more information
philip.beevers at ntlworld.com wrote:> Hi Roch, > > Thanks for your response. > >> A few questions. Between the write(2) and the msync() do you >> re-mmap the file ? Or is the file pre-allocated (say by >> writting zeros at initialization) ? > > My initial prog does this (in a loop): > > write > mmap > msync > munmap > > I guess this relies on a unified VM behaviour. > > I changed the prog so it worked like this: > > write the whole file out > In a loop > mmap > memcpy to dirty the mapped pages > msync > munmap > > Unfortunately this still results in the IO happening _after_ the > program has exited, although the calls to msync do take significantly > longer (a few hundred microseconds, rather than a few tens of > microseconds). Another quick DTrace shows in this case, zfs_putapage > is getting called (at line 2839 of zfs_vnops.c, if my reading of the > source is correct), whereas in the previous case it isn''t (as the > previous call flow showed).The MC_MSYNC API doesn''t work yet on ZFS; right now the only way to flush ZFS pages is to unmount the filesystem being tested. Yes, it''s a pain; the ZFS folks know this needs to be fixed. - Bart -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts
Hi Bart,> The MC_MSYNC API doesn''t work yet on ZFS; right now the only > way to flush ZFS pages is to unmount the filesystem being > tested. Yes, it''s a pain; the ZFS folks know this needs to be fixed.Many thanks for confirming that. For now, I''ll stick with using O_DSYNC. Regards, Phil.
On Mon, Nov 28, 2005 at 09:34:03AM -0800, Bart Smaalders wrote:> > The MC_MSYNC API doesn''t work yet on ZFS; right now the only way > to flush ZFS pages is to unmount the filesystem being tested. > Yes, it''s a pain; the ZFS folks know this needs to be fixed. >Actually, it takes an export/import. See: 6347986 need CLI and programmatic interface to flush ZFS cache The first step is to introduce the ARC interfaces to flush all data associated with a filesystem/file/page - currently all it understands are DVAs. Once that''s in place, we can make the following changes: 1. Update MC_SYNC to correctly flush cached ZFS data 2. Have unmount flush cached ZFS data for the filesystem 3. Optionally introduce a new CLI option like ''zfs flush'' to explicitly flush the cache for a given dataset. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Eric Schrock wrote:> On Mon, Nov 28, 2005 at 09:34:03AM -0800, Bart Smaalders wrote: > >>The MC_MSYNC API doesn''t work yet on ZFS; right now the only way >>to flush ZFS pages is to unmount the filesystem being tested. >>Yes, it''s a pain; the ZFS folks know this needs to be fixed. > > Actually, it takes an export/import. See: > > 6347986 need CLI and programmatic interface to flush ZFS cache > > The first step is to introduce the ARC interfaces to flush all data > associated with a filesystem/file/page - currently all it understands > are DVAs. Once that''s in place, we can make the following changes:[...]> 3. Optionally introduce a new CLI option like ''zfs flush'' to explicitly > flush the cache for a given dataset.That would be a useful primitive to have for non-root folks who need to do repeatable performance measurements. While you''re at it, ''zfs sync'' to sync an individual dataset would be great too - it would let owners of a dataset ensure that outstanding writes had reached stable storage without having to push out everybody else''s data to disk with it, like sync(2) will apparently do with ZFS. :-) -Jason
On Wed, 2005-11-30 at 14:19 +1100, Jason Ozolins wrote:> While you''re at it, ''zfs sync'' > to sync an individual dataset would be great too - it would let owners > of a dataset ensure that outstanding writes had reached stable storage > without having to push out everybody else''s data to disk with it, like > sync(2) will apparently do with ZFS. :-)"lockfs -f <path>" has approximately those semantics right now for UFS, but as a command it''s otherwise annoyingly UFS-specific. in an ideal world we''d get a generic command rather than another filesystem-specific command; for the most part, applications written in the shell of your choice shouldn''t have to know what sort of filesystem is backing their storage...
> The MC_MSYNC API doesn''t work yet on ZFS; right now > the only way to flush ZFS pages is to unmount the filesystem > being tested. Yes, it''s a pain; the ZFS folks know this needs > to be fixed.Does fsync(2) work? This message posted from opensolaris.org
David Phillips wrote:>> The MC_MSYNC API doesn''t work yet on ZFS; right now >> the only way to flush ZFS pages is to unmount the filesystem >> being tested. Yes, it''s a pain; the ZFS folks know this needs >> to be fixed. > > Does fsync(2) work?yes... but if you''re doing read tests rather than write, this won''t help of course :-). - Bart -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts
David Phillips wrote On 11/29/05 21:10,:>>The MC_MSYNC API doesn''t work yet on ZFS; right now >>the only way to flush ZFS pages is to unmount the filesystem >>being tested. Yes, it''s a pain; the ZFS folks know this needs >>to be fixed. > > > Does fsync(2) work? > This message posted from opensolaris.orgYes. Neil
Neil Perrin wrote:> > David Phillips wrote On 11/29/05 21:10,: > >>>The MC_MSYNC API doesn''t work yet on ZFS; right now >>>the only way to flush ZFS pages is to unmount the filesystem >>>being tested. Yes, it''s a pain; the ZFS folks know this needs >>>to be fixed. >> >> >>Does fsync(2) work? >>This message posted from opensolaris.org > > > Yes. > > NeilBut if you''re doing lots of small-file creation then the latency of issuing fsync()s per file written may slow you down a _lot_. (See the "zfs extremely slow" thread - Joerg Schilling''s star program by default does an fsync per file created, which leads to massive slowdown relative to UFS when extracting an archive containing lots of very small files.) I was after a primitive with larger granularity, i.e. do lots of file ops on lots of files, then push all outstanding I/O to disk; now we now know all that stuff got to disk and can record that fact somewhere. It also means that you can have some decent control when you are using programs that you don''t write yourself and don''t have the source for. This would be really great for batch-style processing on a multiuser machine, so that you can maintain a reliable transactional view of what stage processing has reached without the sledgehammer approach of messing with everyone else''s I/O, i.e. sync(). For a moment I thought lockfs -f was just what I wanted, until I saw it was UFS only. :-( -Jason -- Jason.Ozolins at anu.edu.au ANU Supercomputer Facility APAC Data Grid Program Leonard Huxley Bldg 56, Mills Road Ph: +61 2 6125 5449 Australian National University Fax: +61 2 6125 8199 Canberra, ACT, 0200, Australia
Jason Ozolins wrote On 11/29/05 22:39,:> Neil Perrin wrote: > >>David Phillips wrote On 11/29/05 21:10,: >> >> >>>>The MC_MSYNC API doesn''t work yet on ZFS; right now >>>>the only way to flush ZFS pages is to unmount the filesystem >>>>being tested. Yes, it''s a pain; the ZFS folks know this needs >>>>to be fixed. >>> >>> >>>Does fsync(2) work? >>>This message posted from opensolaris.org >> >> >>Yes. >> >>Neil > > > But if you''re doing lots of small-file creation then the latency of > issuing fsync()s per file written may slow you down a _lot_. (See the > "zfs extremely slow" thread - Joerg Schilling''s star program by default > does an fsync per file created, which leads to massive slowdown relative > to UFS when extracting an archive containing lots of very small files.)In my testing of fsync() I haven''t seen this slow down. In fact it''s generally been faster then UFS, but I''d like to investigate it a more.Certainly we must fix the speed of fsync() on ZFS if it is a problem. We don''t want user programs having to work around any slowness.> I was after a primitive with larger granularity, i.e. do lots of file > ops on lots of files, then push all outstanding I/O to disk; now we now > know all that stuff got to disk and can record that fact somewhere. > > It also means that you can have some decent control when you are using > programs that you don''t write yourself and don''t have the source for. > This would be really great for batch-style processing on a multiuser > machine, so that you can maintain a reliable transactional view of what > stage processing has reached without the sledgehammer approach of > messing with everyone else''s I/O, i.e. sync(). > > For a moment I thought lockfs -f was just what I wanted, until I saw it > was UFS only. :-(I don''t think "lockfs -f" is meant to be specific to UFS, although there certainly are UFS specific comments in it (and the rest of lockfs). I also don''t think we should invent yet another interface peculiar to ZFS for flushing a file system. Personally, I think it best for applications to use the standard interfaces (fsync(), sync(), etc) and speed them up if they are too slow (easy to say), but I can certainly see the need for a flush just this filesystem interface.> > -Jason-- Neil
Jason Ozolins wrote:> But if you''re doing lots of small-file creation then the latency of > issuing fsync()s per file written may slow you down a _lot_. (See the > "zfs extremely slow" thread - Joerg Schilling''s star program by default > does an fsync per file created, which leads to massive slowdown relative > to UFS when extracting an archive containing lots of very small files.)As far as I can tell, this happens only on Joerg''s unusual zfs on ufs configuration w/ debug kernels... not exactly the design center. - Bart -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts
Joerg Schilling
2005-Dec-01 11:47 UTC
[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)
Bart Smaalders <bart.smaalders at Sun.COM> wrote:> Jason Ozolins wrote: > > > But if you''re doing lots of small-file creation then the latency of > > issuing fsync()s per file written may slow you down a _lot_. (See the > > "zfs extremely slow" thread - Joerg Schilling''s star program by default > > does an fsync per file created, which leads to massive slowdown relative > > to UFS when extracting an archive containing lots of very small files.) > > > As far as I can tell, this happens only on Joerg''s unusual > zfs on ufs configuration w/ debug kernels... not exactly > the design center.As already noted, this happens the same way if the FS is on a real disk. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Matthew Simmons
2005-Dec-01 16:29 UTC
[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)
>>>>> "JS" == Joerg Schilling <schilling at fokus.fraunhofer.de> writes:JS> As already noted, this happens the same way if the FS is on a real JS> disk. Which still doesn''t subtract the DEBUG kernel from the equation. Matt -- Matt Simmons - simmonmt at eng.sun.com | Solaris Kernel - New York Despite the cost of living, have you noticed how it remains so popular?
Joerg Schilling
2005-Dec-01 19:54 UTC
[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)
Matthew Simmons <simmonmt at eng.sun.com> wrote:> >>>>> "JS" == Joerg Schilling <schilling at fokus.fraunhofer.de> writes: > > JS> As already noted, this happens the same way if the FS is on a real > JS> disk. > > Which still doesn''t subtract the DEBUG kernel from the equation.Well, I am working for SchilliX and I don''t have the time to set up two testsystems (one runnig SX). So I use for my tests what I use for my work. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Matthew Simmons
2005-Dec-01 20:02 UTC
[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)
>>>>> "JS" == Joerg Schilling <schilling at fokus.fraunhofer.de> writes:JS> Well, I am working for SchilliX and I don''t have the time to set up two JS> testsystems (one runnig SX). So I use for my tests what I use for my JS> work. OK, well, then you will need to understand that the performance numbers you come up with are likely to be invalid. We don''t optimize the DEBUG kernel. All the complaints in the world about how program/subsystem x interacts differently with the debugging features than does program/subsystem y won''t change that. I don''t know how we can make this any clearer. It may well be that you''ll have to wait until OpenSolaris supports the construction of non-DEBUG kernels before you can draw useful conclusions from performance tests. Matt -- Matt Simmons - simmonmt at eng.sun.com | Solaris Kernel - New York History records no more gallant struggle than that of humanity against the truth.
Roch Bourbonnais - Performance Engineering
2005-Dec-02 10:21 UTC
[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)
Joerg Schilling writes: > Bart Smaalders <bart.smaalders at Sun.COM> wrote: > > > Jason Ozolins wrote: > > > > > But if you''re doing lots of small-file creation then the latency of > > > issuing fsync()s per file written may slow you down a _lot_. (See the > > > "zfs extremely slow" thread - Joerg Schilling''s star program by default > > > does an fsync per file created, which leads to massive slowdown relative > > > to UFS when extracting an archive containing lots of very small files.) > > > > > > As far as I can tell, this happens only on Joerg''s unusual > > zfs on ufs configuration w/ debug kernels... not exactly > > the design center. > > As already noted, this happens the same way if the FS is on a real disk. > > J?rg > Absolutely; I''m working with a disk and see interesting issues when working with these > 10K-entry directories (specially on rm). I''m looking at how the spa_sync is impacting progress made by rm; It looks like the in-memory cache of the directory needs to be partly refreshed regularly (on my system, my test, my memory size) and this slows down progress. On going study... About the fsync() performance; fsync() after write() must wait for at least one I/O whatever the FS is. So yes fsync() slows you down and this the desired effect; this is not a problem. However calling fsync() when it is not helpful or required could be called a problem. -r ____________________________________________________________________________________ Roch Bourbonnais Sun Microsystems, Icnc-Grenoble Senior Performance Analyst 180, Avenue De L''Europe, 38330, Montbonnot Saint Martin, France Performance & Availability Engineering http://icncweb.france/~rbourbon http://blogs.sun.com/roller/page/roch Roch.Bourbonnais at Sun.Com (+33).4.76.18.83.20
Joerg Schilling
2005-Dec-02 10:23 UTC
[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)
Matthew Simmons <simmonmt at eng.sun.com> wrote:> >>>>> "JS" == Joerg Schilling <schilling at fokus.fraunhofer.de> writes: > > JS> As already noted, this happens the same way if the FS is on a real > JS> disk. > > Which still doesn''t subtract the DEBUG kernel from the equation.>From the tests I did so far with OpenSolaris, I can tell that for all othertest cases I did not see any real difference to nonDEBUG kernels that would become obvious without closely looking at the times. Why should ZFS on a debug kernel give results that are worse by moch more than the factor of two? J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
On Fri 02 Dec 2005 at 11:23AM, Joerg Schilling wrote:> Matthew Simmons <simmonmt at eng.sun.com> wrote: > > > >>>>> "JS" == Joerg Schilling <schilling at fokus.fraunhofer.de> writes: > > > > JS> As already noted, this happens the same way if the FS is on a real > > JS> disk. > > > > Which still doesn''t subtract the DEBUG kernel from the equation. > > >From the tests I did so far with OpenSolaris, I can tell that for all other > test cases I did not see any real difference to nonDEBUG kernels that would > become obvious without closely looking at the times. Why should ZFS on a > debug kernel give results that are worse by moch more than the factor of two?I disagree: http://www.opensolaris.org/jive/servlet/JiveServlet/download/26-1785-7608-130/debug_nondebug_comparo.html The differences can be pretty severe. In particular, ZFS is kmem-happy. If kmem flags are on, you''re gonna pay and pay for that. -dp -- Daniel Price - Solaris Kernel Engineering - dp at eng.sun.com - blogs.sun.com/dp
On Fri 02 Dec 2005 at 02:39AM, Dan Price wrote:> On Fri 02 Dec 2005 at 11:23AM, Joerg Schilling wrote: > > Matthew Simmons <simmonmt at eng.sun.com> wrote: > > > > > >>>>> "JS" == Joerg Schilling <schilling at fokus.fraunhofer.de> writes: > > > > > > JS> As already noted, this happens the same way if the FS is on a real > > > JS> disk. > > > > > > Which still doesn''t subtract the DEBUG kernel from the equation. > > > > >From the tests I did so far with OpenSolaris, I can tell that for all other > > test cases I did not see any real difference to nonDEBUG kernels that would > > become obvious without closely looking at the times. Why should ZFS on a > > debug kernel give results that are worse by moch more than the factor of two? > > I disagree: > > http://www.opensolaris.org/jive/servlet/JiveServlet/download/26-1785-7608-130/debug_nondebug_comparo.html > > The differences can be pretty severe. In particular, ZFS is kmem-happy. > If kmem flags are on, you''re gonna pay and pay for that.I should add: This isn''t to say that you haven''t found a legitimate issue-- just that we lack meaningful data to support that conclusion. I spoke to Mike Kupfer at lunch today about the eventual availabilty of non-DEBUG bits, and he expressed that he thought it would be the next thing he worked on following the split-tree work he is doing. -dp -- Daniel Price - Solaris Kernel Engineering - dp at eng.sun.com - blogs.sun.com/dp
Joerg Schilling
2005-Dec-02 11:52 UTC
[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)
Roch Bourbonnais - Performance Engineering <Roch.Bourbonnais at Sun.COM> wrote:> About the fsync() performance; fsync() after write() must > wait for at least one I/O whatever the FS is. So yes > fsync() slows you down and this the desired effect; this is > not a problem. However calling fsync() when it is not > helpful or required could be called a problem.Would you call the fsync() call from star before the close(f) call to be "not helpful"? It is the only official way to tell whether the extraction did work... J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Joerg Schilling
2005-Dec-02 11:54 UTC
[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)
Dan Price <dp at eng.sun.com> wrote:> > >From the tests I did so far with OpenSolaris, I can tell that for all other > > test cases I did not see any real difference to nonDEBUG kernels that would > > become obvious without closely looking at the times. Why should ZFS on a > > debug kernel give results that are worse by moch more than the factor of two? > > I disagree: > > http://www.opensolaris.org/jive/servlet/JiveServlet/download/26-1785-7608-130/debug_nondebug_comparo.html > > The differences can be pretty severe. In particular, ZFS is kmem-happy. > If kmem flags are on, you''re gonna pay and pay for that.Well, kmem_flags are set to 0 on SchilliX...... Are the numbers on the URL retrieved with kmem_flags != 0? J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Joerg Schilling
2005-Dec-02 11:55 UTC
[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)
Dan Price <dp at eng.sun.com> wrote:> I spoke to Mike Kupfer at lunch today about the eventual availabilty > of non-DEBUG bits, and he expressed that he thought it would be > the next thing he worked on following the split-tree work he is doing.Thank you! J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Roch Bourbonnais - Performance Engineering wrote:> Joerg Schilling writes: > > Bart Smaalders <bart.smaalders at Sun.COM> wrote: > > > > > Jason Ozolins wrote: > > > > > > > But if you''re doing lots of small-file creation then the latency of > > > > issuing fsync()s per file written may slow you down a _lot_. (See the > > > > "zfs extremely slow" thread - Joerg Schilling''s star program by default > > > > does an fsync per file created, which leads to massive slowdown relative > > > > to UFS when extracting an archive containing lots of very small files.) > > > > > > > > > As far as I can tell, this happens only on Joerg''s unusual > > > zfs on ufs configuration w/ debug kernels... not exactly > > > the design center. > > > > As already noted, this happens the same way if the FS is on a real disk. > > > > J?rgIf the Community Express NV27a kernel is non-DEBUG, then I can say that the fsync performance penalty I was talking about does apply to non-DEBUG, real disk.> About the fsync() performance; fsync() after write() must > wait for at least one I/O whatever the FS is. So yes > fsync() slows you down and this the desired effect; this is > not a problem. However calling fsync() when it is not > helpful or required could be called a problem.Well, if you are creating lots of small files, and you really want to know when each file''s been committed to stable storage, then you''ll be calling fsync once for each created file. This is the default behaviour of Joerg''s star. Is it helpful to do this? I''m not sure. But there is an extreme difference between star''s behaviour on UFS and ZFS. From my earlier message in the "ZFS extremely slow" thread, some times to extract tar archives to filesystems on real disks: For my ~ 111,000 file fragment of the freedb archive: UFS ZFS star 72 1080 tar 420 33 star-no-fsync 57 32 To me this implies that if UFS respects fsync semantics (as stated by some Sun folks in recent messages to zfs-discuss), it''s managing to do the requisite logging with much less latency than ZFS, which corresponds to a much lower wall time for this fsync-heavy operation on UFS than on ZFS. This is what I was thinking of when I wrote the text quoted by Bart [see top of this message]. I don''t want to sound like a broken record on this... just saying that there are some ZFS slownesses concerning fsync() which show up on a vanilla (native disk, non-DEBUG kernel) setup. Not to mention that sequential reads go faster on a 4-disk RAID-Z pool than a 4-disk striped pool... see the last message in the "old-style sequential read performance" thread if you''re interested. -Jason
Neil Perrin wrote:> Jason Ozolins wrote On 11/29/05 22:39,:>>I was after a primitive with larger granularity, i.e. do lots of file >>ops on lots of files, then push all outstanding I/O to disk; now we now >>know all that stuff got to disk and can record that fact somewhere. >> >>It also means that you can have some decent control when you are using >>programs that you don''t write yourself and don''t have the source for. >>This would be really great for batch-style processing on a multiuser >>machine, so that you can maintain a reliable transactional view of what >>stage processing has reached without the sledgehammer approach of >>messing with everyone else''s I/O, i.e. sync(). >> >>For a moment I thought lockfs -f was just what I wanted, until I saw it >>was UFS only. :-( > > > I don''t think "lockfs -f" is meant to be specific to UFS, although there > certainly are UFS specific comments in it (and the rest of lockfs). > I also don''t think we should invent yet another interface peculiar > to ZFS for flushing a file system.Yeah, I''m wrong again - the magic source browser shows that lockfs -f causes an _FIOFFS ioctl, and _FIOFFS is indeed handled in zfs_ioctl(). So it''s all good, except that the man page for lockfs could be more explicit about which options aren''t UFS-specific.> Personally, I think it best for applications to use the standard > interfaces (fsync(), sync(), etc) and speed them up if they are > too slow (easy to say), but I can certainly see the need for a > flush just this filesystem interface.The tricky part is if you don''t write the apps, but still want some way for a script to checkpoint any state that they changed, so you know which stage to roll back to if there''s a crash. -Jason
Bill Sommerfeld
2005-Dec-02 16:26 UTC
[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)
On Fri, 2005-12-02 at 06:52, Joerg Schilling wrote:> Would you call the fsync() call from star before the close(f) > call to be "not helpful"?It''s helpful in the sense that it will guarantee to whatever started tar that the files will be there after a crash if tar exits with a nonzero status. But if you call fsync() on file N before proceeding on to file N+1, you hide the amount of inherent parallelism actually present in the task at hand.> It is the only official way to tell whether the extraction did work...As an underlying primitive, yes, but see also aio_fsync(). Or you could do the fsync & close in a different thread. - Bill
Joerg Schilling
2005-Dec-05 11:44 UTC
[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)
Bill Sommerfeld <sommerfeld at sun.com> wrote:> On Fri, 2005-12-02 at 06:52, Joerg Schilling wrote: > > Would you call the fsync() call from star before the close(f) > > call to be "not helpful"? > > It''s helpful in the sense that it will guarantee to whatever started tar > that the files will be there after a crash if tar exits with a nonzero > status.I don''t understand. Check out ''vi'' it is doing the same as star does. Before vi did start calling fsync() (I believe this was with SunOS-4.0), many users of vi did lose their files with certain situations. If you like to have anything meaningful in the exit code of star, I see no way than calling fsync(). Or do you have a better proposal?> But if you call fsync() on file N before proceeding on to file N+1, you > hide the amount of inherent parallelism actually present in the task at > hand. > > > It is the only official way to tell whether the extraction did work... > > As an underlying primitive, yes, but see also aio_fsync(). Or you could > do the fsync & close in a different thread.I''ll look at this, but note that this would cause the file desriptors to become a rare resource that needs control. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily