thr3ads.net - zfs discuss - [zfs-discuss] ZFS and memcntl(..., MC

If this information is useful, please help other people find it:
Share via:

Philip Beevers

2005-Nov-25 21:56 UTC

[zfs-discuss] ZFS and memcntl(..., MC_SYNC, ...)

It wouldn''t be proper to start my first post here without
congratulations
and thanks to the ZFS team for such an impressive piece of work.

Anyway, on to my query. I''ve been trying out ZFS, with a particular
focus in
reducing latency in a specific application. This application has a fair
amount of random writing going on in the background (which, of course, ZFS
will make sequential), but the latency-critical step is syncing a
transaction into a transaction log. This is a sequentially written file, and
for persistence we must guarantee the transaction to the physical media
before acknowledging it. The theory is that the "all writes are
sequential"
rule of ZFS will make a big difference to the worst-case and average latency
here, although obviously the best case won''t change.

The "guarantee to physical media" bit is done using msync(3C), which
of
course results is a call to memcntl(2) passing the MC_SYNC flag. What
I''m
finding is that, on ZFS, this doesn''t actually synchronously write the
passed pages to disk before returning - it returns far too quickly (in a few
microseconds) for the vanilla SCSI disk I''m using. A quick prod with
DTrace
shows the IO happening after the memcntl call has returned. In case
it''s
helpful, here''s the call flow generated by my D script, showing
what''s
happening in the kernel in response to the memcntl syscall:

CPU FUNCTION                                 
 17  -> memcntl                               
 17    -> valid_usr_range                     
 17    <- valid_usr_range                     
 17    -> as_ctl                              
 17      -> as_segat                          
 17      <- as_segat                          
 17      -> segvn_sync                        
 17        -> fop_putpage                     
 17        <- fop_putpage                     
 17        -> zfs_putpage                     
 17          -> page_lookup                   
 17          <- page_lookup                   
 17          -> page_lookup_create            
 17          <- page_lookup_create            
 17        <- zfs_putpage                     
 17      <- segvn_sync                        
 17    <- as_ctl                              
 17  <- memcntl

A search turned up bug ID 6281075, category kernel:zfs, with synopsis
"Support memcntl() requests (like MC_INVALIDATE)". It sounds like this
means
what I''m trying to do isn''t implemented yet - is this correct,
or have I
found a bug? Presumably it will be implemented in the future?

[As a side-note, you''re probably wondering why we don''t just
use O_DSYNC
when opening the file, and then just write(2) to it. The reason for this is
because it''s very slow on UFS for large buffers - effectively linear
with
the number of pages crossed, or about 6ms per 8KBytes on an otherwise idle
SCSI disk on SPARC. The good news is that this is very fast on ZFS - on the
same disk, about 7ms constant for up to 64KBytes, and about 20ms constant
for 256KBytes. However this is still a bit slower than the msync(3C)
approach on UFS]

Thanks in advance.

-- 

Philip Beevers
mailto:philip.beevers at ntlworld.com

Roch Bourbonnais - Performance Engineering

2005-Nov-28 08:50 UTC

head link

[zfs-discuss] ZFS and memcntl(..., MC_SYNC, ...)

A few questions. Between the write(2) and the msync() do you 
re-mmap the file ? Or is the file pre-allocated (say by
writting zeros at initialization) ?

Otherwise, I wonder if you have a valid address space that
maps to the newly written portion and if not what is
actually being msync-ed.

Just making sure: Do you check the msync() return code ?


-r


Philip Beevers writes:
 > It wouldn''t be proper to start my first post here without
congratulations
 > and thanks to the ZFS team for such an impressive piece of work.
 > 
 > Anyway, on to my query. I''ve been trying out ZFS, with a
particular focus in
 > reducing latency in a specific application. This application has a fair
 > amount of random writing going on in the background (which, of course, ZFS
 > will make sequential), but the latency-critical step is syncing a
 > transaction into a transaction log. This is a sequentially written file,
and
 > for persistence we must guarantee the transaction to the physical media
 > before acknowledging it. The theory is that the "all writes are
sequential"
 > rule of ZFS will make a big difference to the worst-case and average
latency
 > here, although obviously the best case won''t change.
 > 
 > The "guarantee to physical media" bit is done using msync(3C),
which of
 > course results is a call to memcntl(2) passing the MC_SYNC flag. What
I''m
 > finding is that, on ZFS, this doesn''t actually synchronously
write the
 > passed pages to disk before returning - it returns far too quickly (in a
few
 > microseconds) for the vanilla SCSI disk I''m using. A quick prod
with DTrace
 > shows the IO happening after the memcntl call has returned. In case
it''s
 > helpful, here''s the call flow generated by my D script, showing
what''s
 > happening in the kernel in response to the memcntl syscall:
 > 
 > CPU FUNCTION                                 
 >  17  -> memcntl                               
 >  17    -> valid_usr_range                     
 >  17    <- valid_usr_range                     
 >  17    -> as_ctl                              
 >  17      -> as_segat                          
 >  17      <- as_segat                          
 >  17      -> segvn_sync                        
 >  17        -> fop_putpage                     
 >  17        <- fop_putpage                     
 >  17        -> zfs_putpage                     
 >  17          -> page_lookup                   
 >  17          <- page_lookup                   
 >  17          -> page_lookup_create            
 >  17          <- page_lookup_create            
 >  17        <- zfs_putpage                     
 >  17      <- segvn_sync                        
 >  17    <- as_ctl                              
 >  17  <- memcntl
 > 
 > A search turned up bug ID 6281075, category kernel:zfs, with synopsis
 > "Support memcntl() requests (like MC_INVALIDATE)". It sounds
like this means
 > what I''m trying to do isn''t implemented yet - is this
correct, or have I
 > found a bug? Presumably it will be implemented in the future?
 > 
 > [As a side-note, you''re probably wondering why we don''t
just use O_DSYNC
 > when opening the file, and then just write(2) to it. The reason for this
is
 > because it''s very slow on UFS for large buffers - effectively
linear with
 > the number of pages crossed, or about 6ms per 8KBytes on an otherwise idle
 > SCSI disk on SPARC. The good news is that this is very fast on ZFS - on
the
 > same disk, about 7ms constant for up to 64KBytes, and about 20ms constant
 > for 256KBytes. However this is still a bit slower than the msync(3C)
 > approach on UFS]
 > 
 > Thanks in advance.
 > 
 > -- 
 > 
 > Philip Beevers
 > mailto:philip.beevers at ntlworld.com
 > 
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://opensolaris.org/mailman/listinfo/zfs-discuss

philip.beevers at ntlworld.com

2005-Nov-28 12:49 UTC

head link

[zfs-discuss] ZFS and memcntl(..., MC_SYNC, ...)

Hi Roch,

Thanks for your response.
> A few questions. Between the write(2) and the msync() do you 
> re-mmap the file ? Or is the file pre-allocated (say by
> writting zeros at initialization) ?
My initial prog does this (in a loop):

write
mmap
msync
munmap

I guess this relies on a unified VM behaviour.

I changed the prog so it worked like this:

write the whole file out
In a loop
  mmap
  memcpy to dirty the mapped pages
  msync
  munmap

Unfortunately this still results in the IO happening _after_ the program has
exited, although the calls to msync do take significantly longer (a few hundred
microseconds, rather than a few tens of microseconds). Another quick DTrace
shows in this case, zfs_putapage is getting called (at line 2839 of zfs_vnops.c,
if my reading of the source is correct), whereas in the previous case it
isn''t (as the previous call flow showed).
> Otherwise, I wonder if you have a valid address space that
> maps to the newly written portion and if not what is
> actually being msync-ed.
I''m fairly confident that I''ve mapped the right thing -
I''ve run through my test program in dbx and it seems OK.
Here''s the source:

#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(int argc, char** argv)
{
    int i = 0;
    int fd = open("foo", O_TRUNC | O_CREAT | O_RDWR, 0666);
    int chunk = 0;
    int trials = 0;
    char* buf = NULL;

    if (argc < 3)
    {
        printf("Usage is: %s <chunk size> <trials>\n",
argv[0]);
        exit(1);
    }

    chunk = atoi(argv[1]);
    buf = (char*)malloc(chunk);

    memset(buf, 0xbc, chunk);

    trials = atoi(argv[2]);

    for (i = 0; i < trials; i++)
    {
        void* ptr = NULL;

        if (write(fd, buf, chunk) < 0)
        {
            perror("write");
            exit(1);
        }

        ptr = mmap(NULL,
                   chunk,
                   PROT_READ | PROT_WRITE,
                   MAP_SHARED,
                   fd,
                   chunk * i);

        if (ptr == MAP_FAILED)
        {
            perror("write");
            exit(1);
        }

        if (msync(ptr, chunk, MS_SYNC) < 0)
        {
            perror("msync");
            exit(1);
        }

        munmap(ptr, chunk);
    }
}
> Just making sure: Do you check the msync() return code ?
Of course :-)

Thanks again for your help,


Phil.

-----------------------------------------
Email sent from www.ntlworld.com
Virus-checked using McAfee(R) Software 
Visit www.ntlworld.com/security for more information

Bart Smaalders

2005-Nov-28 17:34 UTC

head link

[zfs-discuss] ZFS and memcntl(..., MC_SYNC, ...)

philip.beevers at ntlworld.com wrote:> Hi Roch,
> 
> Thanks for your response.
> 
>> A few questions. Between the write(2) and the msync() do you 
>> re-mmap the file ? Or is the file pre-allocated (say by
>> writting zeros at initialization) ?
> 
> My initial prog does this (in a loop):
> 
> write
> mmap
> msync
> munmap
> 
> I guess this relies on a unified VM behaviour.
> 
> I changed the prog so it worked like this:
> 
> write the whole file out
> In a loop
>   mmap
>   memcpy to dirty the mapped pages
>   msync
>   munmap
> 
> Unfortunately this still results in the IO happening _after_ the 
> program has exited, although the calls to msync do take significantly
> longer (a few hundred microseconds, rather than a few tens of 
> microseconds). Another quick DTrace shows in this case, zfs_putapage 
> is getting called (at line 2839 of zfs_vnops.c, if my reading of the 
> source is correct), whereas in the previous case it isn''t (as the 
> previous call flow showed).
The MC_MSYNC API doesn''t work yet on ZFS; right now the only way
to flush ZFS pages is to unmount the filesystem being tested.
Yes, it''s a pain; the ZFS folks know this needs to be fixed.

- Bart

-- 
Bart Smaalders			Solaris Kernel Performance
barts at cyber.eng.sun.com		http://blogs.sun.com/barts

Philip Beevers

2005-Nov-28 20:06 UTC

head link

[zfs-discuss] ZFS and memcntl(..., MC_SYNC, ...)

Hi Bart,
> The MC_MSYNC API doesn''t work yet on ZFS; right now the only 
> way to flush ZFS pages is to unmount the filesystem being 
> tested. Yes, it''s a pain; the ZFS folks know this needs to be
fixed.
Many thanks for confirming that. For now, I''ll stick with using
O_DSYNC.

Regards,


Phil.

Eric Schrock

2005-Nov-28 20:45 UTC

head link

[zfs-discuss] ZFS and memcntl(..., MC_SYNC, ...)

On Mon, Nov 28, 2005 at 09:34:03AM -0800, Bart Smaalders
wrote:> 
> The MC_MSYNC API doesn''t work yet on ZFS; right now the only way
> to flush ZFS pages is to unmount the filesystem being tested.
> Yes, it''s a pain; the ZFS folks know this needs to be fixed.
> 
Actually, it takes an export/import.  See:

6347986 need CLI and programmatic interface to flush ZFS cache

The first step is to introduce the ARC interfaces to flush all data
associated with a filesystem/file/page - currently all it understands
are DVAs.  Once that''s in place, we can make the following changes:

1. Update MC_SYNC to correctly flush cached ZFS data
2. Have unmount flush cached ZFS data for the filesystem
3. Optionally introduce a new CLI option like ''zfs flush'' to
explicitly
   flush the cache for a given dataset.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Jason Ozolins

2005-Nov-30 03:19 UTC

head link

[zfs-discuss] ZFS and memcntl(..., MC_SYNC, ...)

Eric Schrock wrote:> On Mon, Nov 28, 2005 at 09:34:03AM -0800, Bart Smaalders wrote:
> 
>>The MC_MSYNC API doesn''t work yet on ZFS; right now the only
way
>>to flush ZFS pages is to unmount the filesystem being tested.
>>Yes, it''s a pain; the ZFS folks know this needs to be fixed.
> 
> Actually, it takes an export/import.  See:
> 
> 6347986 need CLI and programmatic interface to flush ZFS cache
> 
> The first step is to introduce the ARC interfaces to flush all data
> associated with a filesystem/file/page - currently all it understands
> are DVAs.  Once that''s in place, we can make the following
changes:
[...]
> 3. Optionally introduce a new CLI option like ''zfs flush''
to explicitly
>    flush the cache for a given dataset.
That would be a useful primitive to have for non-root folks who need to 
do repeatable performance measurements.  While you''re at it,
''zfs sync''
to sync an individual dataset would be great too - it would let owners 
of a dataset ensure that outstanding writes had reached stable storage 
without having to push out everybody else''s data to disk with it, like 
sync(2) will apparently do with ZFS. :-)

-Jason

Bill Sommerfeld

2005-Nov-30 03:44 UTC

head link

[zfs-discuss] ZFS and memcntl(..., MC_SYNC, ...)

On Wed, 2005-11-30 at 14:19 +1100, Jason Ozolins wrote:> While you''re at it, ''zfs sync'' 
> to sync an individual dataset would be great too - it would let owners 
> of a dataset ensure that outstanding writes had reached stable storage 
> without having to push out everybody else''s data to disk with it,
like
> sync(2) will apparently do with ZFS. :-)
"lockfs -f <path>" has approximately those semantics right now
for UFS,
but as a command it''s otherwise annoyingly UFS-specific.

in an ideal world we''d get a generic command rather than another
filesystem-specific command; for the most part, applications written in
the shell of your choice shouldn''t have to know what sort of filesystem
is backing their storage...

David Phillips

2005-Nov-30 04:10 UTC

head link

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

> The MC_MSYNC API doesn''t work yet on ZFS; right now
> the only way to flush ZFS pages is to unmount the filesystem
> being tested. Yes, it''s a pain; the ZFS folks know this needs
> to be fixed.
Does fsync(2) work?
This message posted from opensolaris.org

Bart Smaalders

2005-Nov-30 04:29 UTC

head link

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

David Phillips wrote:>> The MC_MSYNC API doesn''t work yet on ZFS; right now
>> the only way to flush ZFS pages is to unmount the filesystem
>> being tested. Yes, it''s a pain; the ZFS folks know this needs
>> to be fixed.
> 
> Does fsync(2) work?

yes... but if you''re doing read tests rather than write,
this won''t help of course :-).

- Bart

-- 
Bart Smaalders			Solaris Kernel Performance
barts at cyber.eng.sun.com		http://blogs.sun.com/barts

Neil Perrin

2005-Nov-30 04:36 UTC

head link

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

David Phillips wrote On 11/29/05 21:10,:>>The MC_MSYNC API doesn''t work yet on ZFS; right now
>>the only way to flush ZFS pages is to unmount the filesystem
>>being tested. Yes, it''s a pain; the ZFS folks know this needs
>>to be fixed.
> 
> 
> Does fsync(2) work?
> This message posted from opensolaris.org
Yes.

Neil

Jason Ozolins

2005-Nov-30 05:39 UTC

head link

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

Neil Perrin wrote:> 
> David Phillips wrote On 11/29/05 21:10,:
> 
>>>The MC_MSYNC API doesn''t work yet on ZFS; right now
>>>the only way to flush ZFS pages is to unmount the filesystem
>>>being tested. Yes, it''s a pain; the ZFS folks know this
needs
>>>to be fixed.
>>
>>
>>Does fsync(2) work?
>>This message posted from opensolaris.org
> 
> 
> Yes.
> 
> Neil
But if you''re doing lots of small-file creation then the latency of 
issuing fsync()s per file written may slow you down a _lot_.  (See the 
"zfs extremely slow" thread - Joerg Schilling''s star program
by default
does an fsync per file created, which leads to massive slowdown relative 
to UFS when extracting an archive containing lots of very small files.)

I was after a primitive with larger granularity, i.e. do lots of file 
ops on lots of files, then push all outstanding I/O to disk; now we now 
know all that stuff got to disk and can record that fact somewhere.

It also means that you can have some decent control when you are using 
programs that you don''t write yourself and don''t have the
source for.
This would be really great for batch-style processing on a multiuser 
machine, so that you can maintain a reliable transactional view of what 
stage processing has reached without the sledgehammer approach of 
messing with everyone else''s I/O, i.e. sync().

For a moment I thought lockfs -f was just what I wanted, until I saw it 
was UFS only. :-(

-Jason
-- 
Jason.Ozolins at anu.edu.au         ANU Supercomputer Facility
APAC Data Grid Program           Leonard Huxley Bldg 56, Mills Road
Ph:  +61 2 6125 5449             Australian National University
Fax: +61 2 6125 8199             Canberra, ACT, 0200, Australia

Neil Perrin

2005-Nov-30 07:31 UTC

head link

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

Jason Ozolins wrote On 11/29/05 22:39,:> Neil Perrin wrote:
> 
>>David Phillips wrote On 11/29/05 21:10,:
>>
>>
>>>>The MC_MSYNC API doesn''t work yet on ZFS; right now
>>>>the only way to flush ZFS pages is to unmount the filesystem
>>>>being tested. Yes, it''s a pain; the ZFS folks know this
needs
>>>>to be fixed.
>>>
>>>
>>>Does fsync(2) work?
>>>This message posted from opensolaris.org
>>
>>
>>Yes.
>>
>>Neil
> 
> 
> But if you''re doing lots of small-file creation then the latency
of
> issuing fsync()s per file written may slow you down a _lot_.  (See the 
> "zfs extremely slow" thread - Joerg Schilling''s star
program by default
> does an fsync per file created, which leads to massive slowdown relative 
> to UFS when extracting an archive containing lots of very small files.)
In my testing of fsync() I haven''t seen this slow down. In fact
it''s generally been faster then UFS, but I''d like to
investigate
it a more.Certainly we must fix the speed of fsync() on ZFS if
it is a problem. We don''t want user programs having to work around
any slowness.
> I was after a primitive with larger granularity, i.e. do lots of file 
> ops on lots of files, then push all outstanding I/O to disk; now we now 
> know all that stuff got to disk and can record that fact somewhere.
> 
> It also means that you can have some decent control when you are using 
> programs that you don''t write yourself and don''t have the
source for.
> This would be really great for batch-style processing on a multiuser 
> machine, so that you can maintain a reliable transactional view of what 
> stage processing has reached without the sledgehammer approach of 
> messing with everyone else''s I/O, i.e. sync().
> 
> For a moment I thought lockfs -f was just what I wanted, until I saw it 
> was UFS only. :-(
I don''t think "lockfs -f" is meant to be specific to UFS,
although there
certainly are UFS specific comments in it (and the rest of lockfs).
I also don''t think we should invent yet another interface peculiar
to ZFS for flushing a file system.

Personally, I think it best for applications to use the standard
interfaces (fsync(), sync(), etc) and speed them up if they are
too slow (easy to say), but I can certainly see the need for a
flush just this filesystem interface.
> 
> -Jason
-- 

Neil

Bart Smaalders

2005-Dec-01 00:18 UTC

head link

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

Jason Ozolins wrote:
> But if you''re doing lots of small-file creation then the latency
of
> issuing fsync()s per file written may slow you down a _lot_.  (See the 
> "zfs extremely slow" thread - Joerg Schilling''s star
program by default
> does an fsync per file created, which leads to massive slowdown relative 
> to UFS when extracting an archive containing lots of very small files.)

As far as I can tell, this happens only on Joerg''s unusual
zfs on ufs configuration w/ debug kernels... not exactly
the design center.

- Bart


-- 
Bart Smaalders			Solaris Kernel Performance
barts at cyber.eng.sun.com		http://blogs.sun.com/barts

Joerg Schilling

2005-Dec-01 11:47 UTC

head link

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

Bart Smaalders <bart.smaalders at Sun.COM> wrote:
> Jason Ozolins wrote:
>
> > But if you''re doing lots of small-file creation then the
latency of
> > issuing fsync()s per file written may slow you down a _lot_.  (See the
> > "zfs extremely slow" thread - Joerg Schilling''s
star program by default
> > does an fsync per file created, which leads to massive slowdown
relative
> > to UFS when extracting an archive containing lots of very small
files.)
>
>
> As far as I can tell, this happens only on Joerg''s unusual
> zfs on ufs configuration w/ debug kernels... not exactly
> the design center.
As already noted, this happens the same way if the FS is on a real disk.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de		(uni)  
       schilling at fokus.fraunhofer.de	(work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Matthew Simmons

2005-Dec-01 16:29 UTC

head link

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

>>>>> "JS" == Joerg Schilling <schilling at
fokus.fraunhofer.de> writes:
    JS> As already noted, this happens the same way if the FS is on a real
    JS> disk.

Which still doesn''t subtract the DEBUG kernel from the equation.

Matt

-- 
	Matt Simmons - simmonmt at eng.sun.com | Solaris Kernel - New York
    Despite the cost of living, have you noticed how it remains so popular?

Joerg Schilling

2005-Dec-01 19:54 UTC

head link

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

Matthew Simmons <simmonmt at eng.sun.com> wrote:
> >>>>> "JS" == Joerg Schilling <schilling at
fokus.fraunhofer.de> writes:
>
>     JS> As already noted, this happens the same way if the FS is on a
real
>     JS> disk.
>
> Which still doesn''t subtract the DEBUG kernel from the equation.
Well, I am working for SchilliX and I don''t have the time to set up
two testsystems (one runnig SX). So I use for my tests what I use for
my work.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de		(uni)  
       schilling at fokus.fraunhofer.de	(work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Matthew Simmons

2005-Dec-01 20:02 UTC

head link

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

>>>>> "JS" == Joerg Schilling <schilling at
fokus.fraunhofer.de> writes:
    JS> Well, I am working for SchilliX and I don''t have the time to
set up two
    JS> testsystems (one runnig SX). So I use for my tests what I use for my
    JS> work.

OK, well, then you will need to understand that the performance numbers you
come up with are likely to be invalid.  We don''t optimize the DEBUG
kernel.
All the complaints in the world about how program/subsystem x interacts
differently with the debugging features than does program/subsystem y
won''t
change that.  I don''t know how we can make this any clearer.

It may well be that you''ll have to wait until OpenSolaris supports the
construction of non-DEBUG kernels before you can draw useful conclusions from
performance tests.

Matt

-- 
	Matt Simmons - simmonmt at eng.sun.com | Solaris Kernel - New York
		    History records no more gallant struggle
		    than that of humanity against the truth.

Roch Bourbonnais - Performance Engineering

2005-Dec-02 10:21 UTC

head link

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

Joerg Schilling writes:
 > Bart Smaalders <bart.smaalders at Sun.COM> wrote:
 > 
 > > Jason Ozolins wrote:
 > >
 > > > But if you''re doing lots of small-file creation then
the latency of
 > > > issuing fsync()s per file written may slow you down a _lot_. 
(See the
 > > > "zfs extremely slow" thread - Joerg
Schilling''s star program by default
 > > > does an fsync per file created, which leads to massive slowdown
relative
 > > > to UFS when extracting an archive containing lots of very small
files.)
 > >
 > >
 > > As far as I can tell, this happens only on Joerg''s unusual
 > > zfs on ufs configuration w/ debug kernels... not exactly
 > > the design center.
 > 
 > As already noted, this happens the same way if the FS is on a real disk.
 > 
 > J?rg
 > 

Absolutely;  I''m working with  a   disk and see  interesting
issues  when  working with   these  > 10K-entry  directories
(specially on rm).   I''m looking  at   how the  spa_sync  is
impacting progress made by  rm; It looks like the  in-memory
cache of  the directory    needs  to be   partly   refreshed
regularly (on my system, my test, my memory size) and this
slows down progress. On going study...

About  the  fsync() performance; fsync()  after write() must
wait for   at least  one I/O  whatever  the  FS  is.  So yes
fsync() slows you down and this the  desired effect; this is
not a  problem. However  calling  fsync()  when it   is  not
helpful or required could be called a problem.

-r

____________________________________________________________________________________
Roch Bourbonnais                        Sun Microsystems, Icnc-Grenoble 
Senior Performance Analyst              180, Avenue De L''Europe, 38330,
					Montbonnot Saint Martin, France
Performance & Availability Engineering  
http://icncweb.france/~rbourbon		http://blogs.sun.com/roller/page/roch
Roch.Bourbonnais at Sun.Com		(+33).4.76.18.83.20

Joerg Schilling

2005-Dec-02 10:23 UTC

head link

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

Matthew Simmons <simmonmt at eng.sun.com> wrote:
> >>>>> "JS" == Joerg Schilling <schilling at
fokus.fraunhofer.de> writes:
>
>     JS> As already noted, this happens the same way if the FS is on a
real
>     JS> disk.
>
> Which still doesn''t subtract the DEBUG kernel from the equation.
>From the tests I did so far with OpenSolaris, I can tell that for all othertest cases I did not see any real difference to nonDEBUG kernels that would
become obvious without closely looking at the times. Why should ZFS on a 
debug kernel give results that are worse by moch more than the factor of two?

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de		(uni)  
       schilling at fokus.fraunhofer.de	(work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Dan Price

2005-Dec-02 10:39 UTC

head link

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

On Fri 02 Dec 2005 at 11:23AM, Joerg Schilling wrote:> Matthew Simmons <simmonmt at eng.sun.com> wrote:
> 
> > >>>>> "JS" == Joerg Schilling <schilling
at fokus.fraunhofer.de> writes:
> >
> >     JS> As already noted, this happens the same way if the FS is on
a real
> >     JS> disk.
> >
> > Which still doesn''t subtract the DEBUG kernel from the
equation.
> 
> >From the tests I did so far with OpenSolaris, I can tell that for all
other
> test cases I did not see any real difference to nonDEBUG kernels that would
> become obvious without closely looking at the times. Why should ZFS on a 
> debug kernel give results that are worse by moch more than the factor of
two?
I disagree:

http://www.opensolaris.org/jive/servlet/JiveServlet/download/26-1785-7608-130/debug_nondebug_comparo.html

The differences can be pretty severe.  In particular, ZFS is kmem-happy.
If kmem flags are on, you''re gonna pay and pay for that.

        -dp

-- 
Daniel Price - Solaris Kernel Engineering - dp at eng.sun.com - blogs.sun.com/dp

Dan Price

2005-Dec-02 10:45 UTC

head link

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

On Fri 02 Dec 2005 at 02:39AM, Dan Price wrote:> On Fri 02 Dec 2005 at 11:23AM, Joerg Schilling wrote:
> > Matthew Simmons <simmonmt at eng.sun.com> wrote:
> > 
> > > >>>>> "JS" == Joerg Schilling
<schilling at fokus.fraunhofer.de> writes:
> > >
> > >     JS> As already noted, this happens the same way if the FS
is on a real
> > >     JS> disk.
> > >
> > > Which still doesn''t subtract the DEBUG kernel from the
equation.
> > 
> > >From the tests I did so far with OpenSolaris, I can tell that for
all other
> > test cases I did not see any real difference to nonDEBUG kernels that
would
> > become obvious without closely looking at the times. Why should ZFS on
a
> > debug kernel give results that are worse by moch more than the factor
of two?
> 
> I disagree:
> 
>
http://www.opensolaris.org/jive/servlet/JiveServlet/download/26-1785-7608-130/debug_nondebug_comparo.html
> 
> The differences can be pretty severe.  In particular, ZFS is kmem-happy.
> If kmem flags are on, you''re gonna pay and pay for that.
I should add: This isn''t to say that you haven''t found a
legitimate
issue-- just that we lack meaningful data to support that conclusion.

I spoke to Mike Kupfer at lunch today about the eventual availabilty
of non-DEBUG bits, and he expressed that he thought it would be
the next thing he worked on following the split-tree work he is doing.

        -dp

-- 
Daniel Price - Solaris Kernel Engineering - dp at eng.sun.com - blogs.sun.com/dp

Joerg Schilling

2005-Dec-02 11:52 UTC

head link

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

Roch Bourbonnais - Performance Engineering <Roch.Bourbonnais at Sun.COM>
wrote:
> About  the  fsync() performance; fsync()  after write() must
> wait for   at least  one I/O  whatever  the  FS  is.  So yes
> fsync() slows you down and this the  desired effect; this is
> not a  problem. However  calling  fsync()  when it   is  not
> helpful or required could be called a problem.
Would you call the fsync() call from  star before the close(f)
call to be "not helpful"?

It is the only official way to tell whether the extraction did work...

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de		(uni)  
       schilling at fokus.fraunhofer.de	(work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Joerg Schilling

2005-Dec-02 11:54 UTC

head link

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

Dan Price <dp at eng.sun.com> wrote:
> > >From the tests I did so far with OpenSolaris, I can tell that for
all other
> > test cases I did not see any real difference to nonDEBUG kernels that
would
> > become obvious without closely looking at the times. Why should ZFS on
a
> > debug kernel give results that are worse by moch more than the factor
of two?
>
> I disagree:
>
>
http://www.opensolaris.org/jive/servlet/JiveServlet/download/26-1785-7608-130/debug_nondebug_comparo.html
>
> The differences can be pretty severe.  In particular, ZFS is kmem-happy.
> If kmem flags are on, you''re gonna pay and pay for that.
Well, kmem_flags are set to 0 on SchilliX......

Are the numbers on the URL retrieved with kmem_flags != 0?

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de		(uni)  
       schilling at fokus.fraunhofer.de	(work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Joerg Schilling

2005-Dec-02 11:55 UTC

head link

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

Dan Price <dp at eng.sun.com> wrote:
> I spoke to Mike Kupfer at lunch today about the eventual availabilty
> of non-DEBUG bits, and he expressed that he thought it would be
> the next thing he worked on following the split-tree work he is doing.
Thank you!

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de		(uni)  
       schilling at fokus.fraunhofer.de	(work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Jason Ozolins

2005-Dec-02 12:30 UTC

head link

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

Roch Bourbonnais - Performance Engineering wrote:> Joerg Schilling writes:
>  > Bart Smaalders <bart.smaalders at Sun.COM> wrote:
>  > 
>  > > Jason Ozolins wrote:
>  > >
>  > > > But if you''re doing lots of small-file creation
then the latency of
>  > > > issuing fsync()s per file written may slow you down a
_lot_.  (See the
>  > > > "zfs extremely slow" thread - Joerg
Schilling''s star program by default
>  > > > does an fsync per file created, which leads to massive
slowdown relative
>  > > > to UFS when extracting an archive containing lots of very
small files.)
>  > >
>  > >
>  > > As far as I can tell, this happens only on Joerg''s
unusual
>  > > zfs on ufs configuration w/ debug kernels... not exactly
>  > > the design center.
>  > 
>  > As already noted, this happens the same way if the FS is on a real
disk.
>  > 
>  > J?rg
If the Community Express NV27a kernel is non-DEBUG, then I can say that 
the fsync performance penalty I was talking about does apply to 
non-DEBUG, real disk.
> About  the  fsync() performance; fsync()  after write() must
> wait for   at least  one I/O  whatever  the  FS  is.  So yes
> fsync() slows you down and this the  desired effect; this is
> not a  problem. However  calling  fsync()  when it   is  not
> helpful or required could be called a problem.
Well, if you are creating lots of small files, and you really want to 
know when each file''s been committed to stable storage, then
you''ll be
calling fsync once for each created file.  This is the default behaviour 
of Joerg''s star.  Is it helpful to do this?  I''m not sure. 
But there is
an extreme difference between star''s behaviour on UFS and ZFS.

 From my earlier message in the "ZFS extremely slow" thread, some
times
to extract tar archives to filesystems on real disks:

For my ~ 111,000 file fragment of the freedb archive:
                   UFS      ZFS
  star              72     1080
  tar              420       33
  star-no-fsync     57       32

To me this implies that if UFS respects fsync semantics (as stated by 
some Sun folks in recent messages to zfs-discuss), it''s managing to do 
the requisite logging with much less latency than ZFS, which corresponds 
to a much lower wall time for this fsync-heavy operation on UFS than on 
ZFS.  This is what I was thinking of when I wrote the text quoted by 
Bart [see top of this message].

I don''t want to sound like a broken record on this... just saying that 
there are some ZFS slownesses concerning fsync() which show up on a 
vanilla (native disk, non-DEBUG kernel) setup.

Not to mention that sequential reads go faster on a 4-disk RAID-Z pool 
than a 4-disk striped pool... see the last message in the "old-style 
sequential read performance" thread if you''re interested.

-Jason

Jason Ozolins

2005-Dec-02 12:56 UTC

head link

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

Neil Perrin wrote:> Jason Ozolins wrote On 11/29/05 22:39,:
>>I was after a primitive with larger granularity, i.e. do lots of file 
>>ops on lots of files, then push all outstanding I/O to disk; now we now 
>>know all that stuff got to disk and can record that fact somewhere.
>>
>>It also means that you can have some decent control when you are using 
>>programs that you don''t write yourself and don''t have
the source for.
>>This would be really great for batch-style processing on a multiuser 
>>machine, so that you can maintain a reliable transactional view of what 
>>stage processing has reached without the sledgehammer approach of 
>>messing with everyone else''s I/O, i.e. sync().
>>
>>For a moment I thought lockfs -f was just what I wanted, until I saw it 
>>was UFS only. :-(
> 
> 
> I don''t think "lockfs -f" is meant to be specific to
UFS, although there
> certainly are UFS specific comments in it (and the rest of lockfs).
> I also don''t think we should invent yet another interface peculiar
> to ZFS for flushing a file system.
Yeah, I''m wrong again - the magic source browser shows that lockfs -f 
causes an _FIOFFS ioctl, and _FIOFFS is indeed handled in zfs_ioctl(). 
So it''s all good, except that the man page for lockfs could be more 
explicit about which options aren''t UFS-specific.
> Personally, I think it best for applications to use the standard
> interfaces (fsync(), sync(), etc) and speed them up if they are
> too slow (easy to say), but I can certainly see the need for a
> flush just this filesystem interface.
The tricky part is if you don''t write the apps, but still want some way
for a script to checkpoint any state that they changed, so you know 
which stage to roll back to if there''s a crash.

-Jason

Bill Sommerfeld

2005-Dec-02 16:26 UTC

head link

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

On Fri, 2005-12-02 at 06:52, Joerg Schilling wrote:> Would you call the fsync() call from  star before the close(f)
> call to be "not helpful"?
It''s helpful in the sense that it will guarantee to whatever started
tar
that the files will be there after a crash if tar exits with a nonzero
status.

But if you call fsync() on file N before proceeding on to file N+1, you
hide the amount of inherent parallelism actually present in the task at
hand.
> It is the only official way to tell whether the extraction did work...
As an underlying primitive, yes, but see also aio_fsync().  Or you could
do the fsync & close in a different thread.

					- Bill

Joerg Schilling

2005-Dec-05 11:44 UTC

head link

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

Bill Sommerfeld <sommerfeld at sun.com> wrote:
> On Fri, 2005-12-02 at 06:52, Joerg Schilling wrote:
> > Would you call the fsync() call from  star before the close(f)
> > call to be "not helpful"?
>
> It''s helpful in the sense that it will guarantee to whatever
started tar
> that the files will be there after a crash if tar exits with a nonzero
> status.
I don''t understand.

Check out ''vi'' it is doing the same as star does.

Before vi did start calling fsync() (I believe this was with SunOS-4.0),
many users of vi did lose their files with certain situations.

If you like to have anything meaningful in the exit code of star, I see
no way than calling fsync(). Or do you have a better proposal?

> But if you call fsync() on file N before proceeding on to file N+1, you
> hide the amount of inherent parallelism actually present in the task at
> hand.
>
> > It is the only official way to tell whether the extraction did work...
>
> As an underlying primitive, yes, but see also aio_fsync().  Or you could
> do the fsync & close in a different thread.
I''ll look at this, but note that this would cause the file desriptors
to
become a rare resource that needs control.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de		(uni)  
       schilling at fokus.fraunhofer.de	(work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Maybe Matching Threads

Search for more maybe matching threads

zfs discuss - Nov 2005 - ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

[zfs-discuss] Re: ZFS and memcntl(..., MC_SYNC, ...)

Maybe Matching Threads