thr3ads.net - zfs discuss - [zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance [Jan 2006]

If this information is useful, please help other people find it:
Share via:

Jürgen Keil

2006-Jan-25 12:10 UTC

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

Yesterday I tried if I could reproduce a pcfs mount problem on current
opensolaris release,
when I noticed some bad performance problems with lofi devices attached to a
file on a
zfs filesystem.

The performacen issue can be reproduced like this:
(I''m using snv_28, bfu''ed to b31 release bits, compiled from
current opensolaris sources)

- /files/tmp is a zfs (compressed or uncompressed) filesystem

- mkfile -n 80g /files/tmp/pcfs.img

- lofiadm -a /files/tmp/pcfs.img

- fdisk /dev/rlofi/1
   --> create 100% FAT32LBA partition, make it the active partition, write
the MBR

- truss -D -f mkfs -F pcfs -o fat=32 /dev/rlofi/1:c

...
755:     0.1165 write(3, "EB <90 M S D O S 4 . 0\0".., 512)     =
512
755:     0.0292 write(3, " R R a A\0\0\0\0\0\0\0\0".., 512)     = 512
755:     0.0003 llseek(3, 4918272, SEEK_SET)                    = 4918272
755:     0.0288 write(3, "EB <90 M S D O S 4 . 0\0".., 512)     =
512
755:     0.0291 write(3, " R R a A\0\0\0\0\0\0\0\0".., 512)     = 512
755:     0.0002 llseek(3, 4931584, SEEK_SET)                    = 4931584
755:    41.5346 write(3, "F8FFFF0FFFFFFF0FF8FFFF0F".., 10484224) =
10484224
755:    47.3175 write(3, "F8FFFF0FFFFFFF0FF8FFFF0F".., 10484224) =
10484224
755:     0.1473 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0".., 32768)   =
32768
755:     0.0002 close(3)                                        = 0
755:     0.0001 _exit(0)


Note the write system calls, it need 41.5 seconds to write 10MByte of data?
That''s 10484224/41.5 = 252631 bytes/sec.  Why is this so slow?


When I run exactly the same test using a lofi device attached to a file on a UFS
filesystem,
the same 10MByte write completes in ~ 0.5 seconds. 

Why is the test 80x slower on ZFS, when compared to UFS?

---------------------------------------------------------------------------------------------------------------------------
#!/usr/sbin/dtrace -s

#pragma D option flowindent

fbt::lofi_strategy:entry
{
        self->t = timestamp;
}

fbt::lofi_strategy:return
{
        self->t = 0;
}

fbt::lofi_strategy_task:entry
{
        self->ls_task = timestamp;
}

fbt::segmap_fault:entry,
fbt::segmap_fault:return,
fbt::segmap_release:entry,
fbt::segmap_release:return,
fbt::segmap_getmapflt:entry,
fbt::segmap_getmapflt:return
/self->ls_task/
{
        printf("%llu", timestamp - self->ls_task);
}

fbt::lofi_strategy_task:return
/self->ls_task/
{
        printf("%llu", timestamp - self->ls_task);
        self->ls_task = 0;
}

fbt::taskq_dispatch:entry
/self->t/
{
        self->t_disp = timestamp;
}

fbt::taskq_dispatch:return
/self->t_disp/
{
        printf("%llu", timestamp - self->t_disp);
        self->t_disp = 0;
}

fbt:::
/self->t/
{
}
---------------------------------------------------------------------------------------------------------------------------

Using the above dtrace script; I''ve narrowed down the slowness to the
segmap_release() call
in lofi_strategy_task():

http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/io/lofi.c#lofi_strategy_task
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/io/lofi.c#370

lofi seems to split big block i/o requests into 8KByte chunks, maps the 8K chunk
(from a zfs file)
into kernel memory, changes it, and calls segmap_release() for the 8K chunk. 
The
segmap_release() call needs ~28 milliseconds to complete, for every 8K chunk of
data.

The above dtrace reports the following nanosecond times for function
entry/returns for functions
called from lofi_strategy_task(), relative to the entry time into
lofi_strategy_task():

  0    -> segmap_getmapflt                    390212
  0    <- segmap_getmapflt                    397780
  0    -> segmap_fault                        400686
  0    <- segmap_fault                        471288
  0    -> segmap_fault                        483743
  0    <- segmap_fault                        494756
  0    -> segmap_release                      497002        
<<<<<<<<<<<<<<<<
  0    <- segmap_release                      28477764        
<<<<<<<<<<<<<<<<
  0    -> segmap_getmapflt                    28482182
  0    <- segmap_getmapflt                    28489342
  0    -> segmap_fault                        28491860
  0    <- segmap_fault                        28556055
  0    -> segmap_fault                        28568419
  0    <- segmap_fault                        28579688
  0    -> segmap_release                      28581889        
<<<<<<<<<<<<<<<<
  0    <- segmap_release                      57585125        
<<<<<<<<<<<<<<<<
  0    -> segmap_getmapflt                    57588452
  0    <- segmap_getmapflt                    57593437
  0    -> segmap_fault                        57595443
  0    <- segmap_fault                        57653024
  0    -> segmap_fault                        57666526
  0    <- segmap_fault                        57675899
  0    -> segmap_release                      57677756        
<<<<<<<<<<<<<<<<
  0    <- segmap_release                      86721131        
<<<<<<<<<<<<<<<<
  0    -> segmap_getmapflt                    86726222
  0    <- segmap_getmapflt                    86732854
  0    -> segmap_fault                        86735832
  0    <- segmap_fault                        86812762
  0    -> segmap_fault                        86825782
  0    <- segmap_fault                        86838545
  0    -> segmap_release                      86840904        
<<<<<<<<<<<<<<<<
  0    <- segmap_release                      115792547        
<<<<<<<<<<<<<<<<
  0    -> segmap_getmapflt                    115797538
  0    <- segmap_getmapflt                    115804102
  0    -> segmap_fault                        115806849


It seems as if zfs creates transactions for each segmap_release() call and waits
until the
transaction is written to disk.  Call stack for these transactions would be
something like this:

  0 | cv_block:sleep
              genunix`cv_wait+0x27
              zfs`zio_wait+0x46
              zfs`zil_flush_vdevs+0x197
              zfs`zil_commit+0x2ff
              zfs`zfs_putapage+0x266
              zfs`zfs_putpage+0x1a2
              genunix`fop_putpage+0x21
              genunix`segmap_release+0x2b3
              lofi`lofi_strategy_task+0x1eb
              genunix`taskq_thread+0x16c
              unix`thread_start+0x8

  0 | cv_block:sleep
              genunix`cv_wait+0x27
              zfs`zil_commit+0x2da
              zfs`zfs_putapage+0x266
              zfs`zfs_putpage+0x1a2
              genunix`fop_putpage+0x21
              genunix`segmap_release+0x2b3
              lofi`lofi_strategy_task+0x1eb
              genunix`taskq_thread+0x16c
              unix`thread_start+0x8

  0 | cv_block:sleep
              genunix`cv_wait+0x27
              zfs`zio_wait+0x46
              zfs`zil_flush_vdevs+0x197
              zfs`zil_commit+0x2ff
              zfs`zfs_putapage+0x266
              zfs`zfs_putpage+0x1a2
              genunix`fop_putpage+0x21
              genunix`segmap_release+0x2b3
              lofi`lofi_strategy_task+0x1eb
              genunix`taskq_thread+0x16c


-------------------------------------------------------------------------------------------------------------------------------
Another, possibly related problem, using mmap to write a 10MByte file on a zfs
filesystem.
Again, this is *much* faster on a logging UFS filesystem, than on ZFS. 

/*
 * similar issue as "zfs-lofi", writing file using mmap is slow on zfs
 * (ufs is ~ 10x-40x faster)
 */

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
#include <sys/mman.h>


int
main(int argc, char **argv)
{
        char *filename = "data";
        int size = 10*1024*1024;
        int fd;
        void *addr;

        if (argv[1])
                filename = argv[1];

        fd = open(filename, O_RDWR|O_CREAT, 0777);
        if (fd < 0) {
                perror(filename);
                exit(1);
        }
        if (lseek(fd, size-1, SEEK_SET) != size-1) {
                perror("lseek");
                exit(1);
        }
        if (write(fd, "", 1) < 0) {
                perror("write");
                exit(1);
        }

        addr = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
        if (addr == (void*)-1) {
                perror("mmap");
                exit(1);
        }
        close(fd);
        memset(addr, 0, size);
        msync(addr, size, MS_SYNC);
        munmap(addr, size);
        exit(0);
}


-------------------------------------------------------------------------------------------------------------------------------

% time zfs-lofi-mapw /var/tmp/data    (<-- logging ufs)
0.02u 0.15s 0:00.57 29.8%

% time zfs-lofi-mapw /files/tmp/data    (<-- zfs)
0.02u 0.73s 0:23.36 3.2%
This message posted from opensolaris.org

Joerg Schilling

2006-Jan-25 16:50 UTC

head link

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

J??rgen Keil <jk at tools.de> wrote:
> Yesterday I tried if I could reproduce a pcfs mount problem on current
opensolaris release,
> when I noticed some bad performance problems with lofi devices attached to
a file on a
> zfs filesystem.
>
> The performacen issue can be reproduced like this:
> (I''m using snv_28, bfu''ed to b31 release bits, compiled
from current opensolaris sources)
>
> - /files/tmp is a zfs (compressed or uncompressed) filesystem
>
> - mkfile -n 80g /files/tmp/pcfs.img
>
> - lofiadm -a /files/tmp/pcfs.img
>
> - fdisk /dev/rlofi/1
>    --> create 100% FAT32LBA partition, make it the active partition,
write the MBR
Did you call "mkfile" as root and have the "sticky" but set?

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       schilling at fokus.fraunhofer.de     (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Juergen Keil

2006-Jan-25 18:31 UTC

head link

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

> > The performacen issue can be reproduced like this:
> > (I''m using snv_28, bfu''ed to b31 release bits,
compiled from current opensolaris sources)
> >
> > - /files/tmp is a zfs (compressed or uncompressed) filesystem
> >
> > - mkfile -n 80g /files/tmp/pcfs.img
> Did you call "mkfile" as root and have the "sticky" but
set?
Yes, all commands where run as user root; the sticky bit is set.

Joerg Schilling

2006-Jan-26 10:14 UTC

head link

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

Juergen Keil <jk at tools.de> wrote:
> > > - mkfile -n 80g /files/tmp/pcfs.img
>
> > Did you call "mkfile" as root and have the
"sticky" but set?
>
> Yes, all commands where run as user root; the sticky bit is set.
Then it seems that either ZFS ignores the sticky bit (could someone from Sun
comment on this please?) or the caching stratgy of ZFS in counter productive in 
your case.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       schilling at fokus.fraunhofer.de     (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Casper.Dik at Sun.COM

2006-Jan-26 10:19 UTC

head link

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

>Juergen Keil <jk at tools.de> wrote:
>
>> > > - mkfile -n 80g /files/tmp/pcfs.img
>>
>> > Did you call "mkfile" as root and have the
"sticky" but set?
>>
>> Yes, all commands where run as user root; the sticky bit is set.
>
>Then it seems that either ZFS ignores the sticky bit (could someone from Sun
>comment on this please?) or the caching stratgy of ZFS in counter productive
in
>your case.

I''ve seen similar issues: building a Solaris miniroot on a tmpfs
filesystem is many orders of magnitude faster than on ZFS.

However:

     If the sticky bit is set on a regular file  and  no  execute
     bits  are  set,  the system''s page cache will not be used to
     hold the file''s data. This bit is normally set on swap files
     of  diskless  clients so that accesses to these files do not
     flush more valuable data from the system''s cache.  Moreover,
     by default such files are treated as swap files, whose inode
     modification times may not necessarily be correctly recorded
     on permanent storage.

so it may be that you do not want the sticky bit set on such files
when creating a filesystem in them as it will cause the data
not to be cached.


Casper

Juergen Keil

2006-Jan-26 13:00 UTC

head link

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

> I''ve seen similar issues: building a Solaris miniroot on a tmpfs
> filesystem is many orders of magnitude faster than on ZFS.
> 
> However:
> 
>      If the sticky bit is set on a regular file  and  no  execute
>      bits  are  set,  the system''s page cache will not be used to
>      hold the file''s data. This bit is normally set on swap files
>      of  diskless  clients so that accesses to these files do not
>      flush more valuable data from the system''s cache.  Moreover,
>      by default such files are treated as swap files, whose inode
>      modification times may not necessarily be correctly recorded
>      on permanent storage.
> 
> so it may be that you do not want the sticky bit set on such files
> when creating a filesystem in them as it will cause the data
> not to be cached.
Repeating the test procedure with a "chmod 600 ..." thrown in after
the mkfile (to clear the stick bit) shows no difference.  It''s still
slow
on zfs.


Part of the reason why this test is slow on zfs appears to be this:

lofi releases modified pages with:

http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/io/lofi.c#362

        segmap_release(segkmap, mapaddr, SM_WRITE);

(= SYNC WRITE), so that zfs_putpage is called with flags == 0. zfs_putpage
creates a transaction, and waits for the transaction to be written to disk.

http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/zfs_vnops.c#
2796


If lofi would use:

        segmap_release(segkmap, mapaddr, SM_WRITE|SM_ASYNC);

the "mkfs -F pcfs" would run *much* faster on zfs. With
SM_WRITE|SM_ASYNC,
zfs_putpage is called with flags = 0x400 (B_ASYNC), and zfs wouldn''t
*wait*
for the new transaction for every "8K putpage" to be flushed to disk.


In my b31 compiled release kernel, when I patch the "pushl $0x1"
(SM_WRITE)
at lofi_strategy_task+0x1dd ...

lofi_strategy_task+0x1dd:pushl  $0x1
lofi_strategy_task+0x1df:       pushl  %esi
lofi_strategy_task+0x1e0:       pushl  0xfec00b8c
lofi_strategy_task+0x1e6:       call   +0x695c3af       <segmap_release>

... to a "pushl $0x3" (SM_WRITE|SM_ASYNC), the 10MByte writes in
"mkfs -F pcfs ... /dev/rlofi/1:c" complete in less than 0.5 seconds.

Casper.Dik at Sun.COM

2006-Jan-26 13:49 UTC

head link

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

>Part of the reason why this test is slow on zfs appears to be this:
>
>lofi releases modified pages with:
>
>http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/io/lofi.c#362
>
>        segmap_release(segkmap, mapaddr, SM_WRITE);
>
>(= SYNC WRITE), so that zfs_putpage is called with flags == 0. zfs_putpage
>creates a transaction, and waits for the transaction to be written to disk.
>
>http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/zfs_vnops.c#
>2796
>
>
>If lofi would use:
>
>        segmap_release(segkmap, mapaddr, SM_WRITE|SM_ASYNC);

It is unclear to me that this is always the proper behaviour;
the strategy routine is assumed to be synchronous only the
"B_ASYNC" flag is bassed.

So perhaps the fix should take that into account?
(And then the gains may evaporate)

Also, "ASYNC" will limit upward error propagation; that may lead
to panics more quickly when write errors in, e.g., UFS filesystems are
ignored.

Casper

Nicolas Williams

2006-Jan-26 16:26 UTC

head link

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

On Thu, Jan 26, 2006 at 02:49:49PM +0100, Casper.Dik at Sun.COM
wrote:> >If lofi would use:
> >
> >        segmap_release(segkmap, mapaddr, SM_WRITE|SM_ASYNC);
> 
> 
> It is unclear to me that this is always the proper behaviour;
> the strategy routine is assumed to be synchronous only the
> "B_ASYNC" flag is bassed.
> 
> So perhaps the fix should take that into account?
> (And then the gains may evaporate)
It might be an option, so mkfs can go fast, but you really could not
mount an async lofi:
> Also, "ASYNC" will limit upward error propagation; that may lead
> to panics more quickly when write errors in, e.g., UFS filesystems are
> ignored.
Nico
--

Casper.Dik at Sun.COM

2006-Jan-26 16:38 UTC

head link

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

>On Thu, Jan 26, 2006 at 02:49:49PM +0100, Casper.Dik at Sun.COM wrote:
>> >If lofi would use:
>> >
>> >        segmap_release(segkmap, mapaddr, SM_WRITE|SM_ASYNC);
>> 
>> 
>> It is unclear to me that this is always the proper behaviour;
>> the strategy routine is assumed to be synchronous only the
>> "B_ASYNC" flag is bassed.
>> 
>> So perhaps the fix should take that into account?
>> (And then the gains may evaporate)
>
>It might be an option, so mkfs can go fast, but you really could not
>mount an async lofi:
>
>> Also, "ASYNC" will limit upward error propagation; that may
lead
>> to panics more quickly when write errors in, e.g., UFS filesystems are
>> ignored.
>
If the strategy routine requests async writes then there''s no reason
not to release the segmap with async.

Don''t know how often that would be.

Casper

Jürgen Keil

2006-Jan-27 11:27 UTC

head link

[zfs-discuss] Re: bad lofi performance with zfs file backend / bad mmap write performan

> >Part of the reason why this test is slow on zfs appears to be this:
> >
> >lofi releases modified pages with:
> >
> >        segmap_release(segkmap, mapaddr, SM_WRITE);
...> >If lofi would use:
> >
> >        segmap_release(segkmap, mapaddr, SM_WRITE|SM_ASYNC);
> Also, "ASYNC" will limit upward error propagation;
Yes, changing lofi to use SM_ASYNC appears to be a bad idea, it''ll skip
a biowait() when used with a file backend in an ufs filesystem, so that
we''ll miss error return values.

==
Another thing that I''ve noticed is that a lot of the Solaris filesystem
modules
use "pvn_write_kluster()" somewhere inside the
"VOP_PUTPAGE()" implementation;
apparently to replace single page writes with bigger writes that include
dirty pages immediatelly before and after the page.

e.g. UFS: in ufs_putapage()
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/ufs/ufs_vnops.c#5236

or TMPFS: in tmp_putapage()
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/tmpfs/tmp_vnops.c#1912



zfs doesn''t appear to implement "klustered page writes"
inside zfs_putpage().

Shouldn''t zfs try to write klusters of pages, instead of single pages ?


That might also help with the mmap() write test program that I included in the
initial message (which doesn''t use lofi at all).


Btw. the performace problem that is shown by the mmap() write test program
can also be observed with the linker /usr/ccs/bin/ld, which appears to use
mmap() to write it''s output file.  Building a big C++ shared library
(19MBytes,
libQtGui_debug.so.4.0.1) with the output file on UFS is fast, and is *much*
slower when the output file is written to a ZFS filesystem.  tcsh''s
time function
reports:

shlib output file on UFS, object input files from ZFS (I repeated the link 3x):
1.50u 0.75s 0:04.11 54.7%
1.51u 0.77s 0:03.91 58.3%
1.51u 0.76s 0:04.39 51.7%

output file on ZFS, object input files from ZFS (repeated 3x):
1.51u 0.93s 0:31.22 7.8%
1.54u 0.97s 0:38.39 6.5%
1.52u 0.95s 0:36.30 6.8%
This message posted from opensolaris.org

Jürgen Keil

2006-Jan-30 12:38 UTC

head link

[zfs-discuss] Re: bad lofi performance with zfs file backend / bad mmap write performan

I wrote:
> Part of the reason why this test is slow on zfs appears to be this:
> 
> lofi releases modified pages with:
> 
> http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/io/lofi.c#362
> 
>         segmap_release(segkmap, mapaddr, SM_WRITE);
> 
> (= SYNC WRITE), so that zfs_putpage is called with flags == 0. zfs_putpage
> creates a transaction, and waits for the transaction to be written to disk.
> 
>
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/zfs_vnops.c#2796

After some more dtracing, I found out that the performance killer are the
frequent
DKIOCFLUSHWRITECACHE ioctls sent to the notebook''s 4200 rpm P-ATA HDD.

This happens in zil_flush_vdevs():

http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/zil.c#zil_flush_vdevs


Although the I/O write requests are for sequential blocks, flushing the
disk''s write cache
after every chunk of data written results in a data rate of 8K (or 4K?) written
per disk
revolution.

The attached dtrace script produces output like this (note, this time
I''m using a 7200 rpm
P-ATA drive, it needs ~ 8 milliseconds per DKIOCFLUSHWRITECACHE ioctl during
this test):

  0   9311              bdev_strategy:entry edev 1980047, flags 1080101, bn
3c34d00, count 2000
  0  34086       vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7890593
nsec, error 0
  0   9311              bdev_strategy:entry edev 1980047, flags 1080101, bn
3c34d10, count 2000
  0  34086       vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7872219
nsec, error 0
  0   9311              bdev_strategy:entry edev 1980047, flags 1080101, bn
3c34d20, count 2000
  0  34086       vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7698369
nsec, error 0
  0   9311              bdev_strategy:entry edev 1980047, flags 1080101, bn
3c34d30, count 2000
  0  34086       vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7575891
nsec, error 0
  0   9311              bdev_strategy:entry edev 1980047, flags 1080101, bn
3c34d40, count 2000
  0  34086       vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7907461
nsec, error 0
  0   9311              bdev_strategy:entry edev 1980047, flags 1080101, bn
3c34d50, count 2000
  0  34086       vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7792415
nsec, error 0
  0   9311              bdev_strategy:entry edev 1980047, flags 1080101, bn
3c34d60, count 2000
  0  34086       vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 8039159
nsec, error 0
  0   9311              bdev_strategy:entry edev 1980047, flags 1080101, bn
3c34d70, count 2000
  0  34086       vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7537145
nsec, error 0


On disk devices that do not support DKIOCFLUSHWRITECACHE (USB, Firewire), lofi
and
mmap write performance on zfs is much better.

On USB, I saw DKIOCFLUSHWRITECACHE failing with errno 48 (ENOTSUP); the sd
driver
has the flag "un_f_sync_cache_unsupported" set.

And on Firewire, DKIOCFLUSHWRITECACHE is accepted without error, but 
finishes in ~ 6 microseconds.  On firewire, sd thinks that the device
doesn''t have it''s
write cache enabled (un_f_sync_cache_unsupported = 0, un_f_write_cache_enabled =
0),
so that DKIOCFLUSHWRITECACHE turns into a no-op.
This message posted from opensolaris.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lofi8.d
Type: application/octet-stream
Size: 580 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060130/f82279e2/attachment.obj>

Bill Moore

2006-Jan-30 16:52 UTC

head link

[zfs-discuss] Re: bad lofi performance with zfs file backend / bad mmap write performan

On Mon, Jan 30, 2006 at 04:38:31AM -0800, J?rgen Keil
wrote:> After some more dtracing, I found out that the performance killer are
> the frequent DKIOCFLUSHWRITECACHE ioctls sent to the notebook''s
4200
> rpm P-ATA HDD.
> 
> This happens in zil_flush_vdevs():
> 
>
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/zil.c#zil_flush_vdevs
> 
> 
> Although the I/O write requests are for sequential blocks, flushing
> the disk''s write cache after every chunk of data written results
in a
> data rate of 8K (or 4K?) written per disk revolution.
I don''t think the ioctl() is doing what you think it''s doing. 
The
naming is a bit unfortunate, being copied from the ATA command we send
to the disk.  It doesn''t really flush the write cache, it is more like
a
write barrier.  The spec says that the command will not complete until
all previous writes have been actually written to the drive''s media.
The data in the drive''s cache is preserved, not flushed.
> The attached dtrace script produces output like this (note, this time
> I''m using a 7200 rpm P-ATA drive, it needs ~ 8 milliseconds per
> DKIOCFLUSHWRITECACHE ioctl during this test):
That''s not unreasonable.  That''s really saying that an I/O to
your disk
takes about 8ms, which is about right for a laptop drive, though I''d
think that a 7200rpm disk should be a bit quicker.  Perhaps there are
multiple cached writes that need to be written to media.
> On disk devices that do not support DKIOCFLUSHWRITECACHE (USB,
> Firewire), lofi and mmap write performance on zfs is much better.
The problem is that there are some very thin windows where the drive''s
write caching can mess with on-disk consistency during the final phases
of a transaction group commit if a power outage were to happen at the
same time.  Not terribly likely, but from a correctness point of view,
the window is either open or closed.  We''re working on some changes to
further mitigate this risk, but any drive containing important data
should really support this command.
> And on Firewire, DKIOCFLUSHWRITECACHE is accepted without error, but
> finishes in ~ 6 microseconds.  On firewire, sd thinks that the device
> doesn''t have it''s write cache enabled
(un_f_sync_cache_unsupported > 0, un_f_write_cache_enabled = 0), so that
DKIOCFLUSHWRITECACHE turns
> into a no-op.
An interesting question would be, is the drive''s write cache *really*
disabled?  If you run "format -e" on the drive, and turn on the write
cache, can you see any difference in performance?  I''ve found that on
ATA disks without NCQ, write caching is necessary for non-sucky
performance.  On SCSI devices with tagged queuing, turning on the write
cache helps, but not to the same degree.

BTW, thanks for the investigation and analysis - this is great stuff.


--Bill

Jürgen Keil

2006-Jan-31 10:39 UTC

head link

[zfs-discuss] Re: Re: bad lofi performance with zfs file backend / bad mmap write perfo

> On Mon, Jan 30, 2006 at 04:38:31AM -0800, J?rgen Keil wrote:
> > After some more dtracing, I found out that the performance killer are
> > the frequent DKIOCFLUSHWRITECACHE ioctls sent to the
notebook''s 4200
> > rpm P-ATA HDD.
> > 
> > Although the I/O write requests are for sequential blocks, flushing
> > the disk''s write cache after every chunk of data written
results in a
> > data rate of 8K (or 4K?) written per disk revolution.
> 
> I don''t think the ioctl() is doing what you think it''s
doing.  The
> naming is a bit unfortunate, being copied from the ATA command we send
> to the disk.
ATA "FLUSH CACHE" command, command code 0xE7 ?

(I can''t check the DKIOCFLUSHWRITECACHE implementation that for the ata
driver, because "ata" is closed source).
>  It doesn''t really flush the write cache, it is more like a
> write barrier.  The spec says that the command will not complete until
> all previous writes have been actually written to the drive''s
media.
> The data in the drive''s cache is preserved, not flushed.
Yep, and that has exactly been my understanding of what happens when
DKIOCFLUSHWRITECACHE is send to an ATA device.

I think what happens is something like this:

1. 8K (16 sectors) of data is written to sectors x ... x+15
    the HDD puts the data into it''s write cache, the data is not yet
written to disk

2. a "FLUSH CACHE" command is submitted to the HDD.
    HDD writes the sectors x ... x+15 from it''s cache to the disk
    When the "FLUSH CACHE" command completes, we''re after the
    end of sector x+15; probably sector x+16 already passes under 
    the HDD''s read/write head.

3. zfs sends the next chunk of data, 8K for sectors x+16 .. x+31
    HDD stores this data in it''s write cache.  The disk keeps spinning.
    (maybe sector x+17 or one of the following sectors is currently
    passing under the read/write head).

4. Another "FLUSH CACHE" command is submitted to the HDD.
    We''ve missed the address field for sector x+16;  to write the
modified
    contents for sector x+16 from the disk''s write cache, the HDD has
    to wait a complete revolution of the disk until it can write sector x+16.

    This is where the huge DKIOCFLUSHWRITECACHE delay is comming from.

    > > The attached dtrace script produces output like this (note, this time
> > I''m using a 7200 rpm P-ATA drive, it needs ~ 8 milliseconds
per
> > DKIOCFLUSHWRITECACHE ioctl during this test):
> 
> That''s not unreasonable.  That''s really saying that an
I/O to your disk
> takes about 8ms, which is about right for a laptop drive, though
I''d
> think that a 7200rpm disk should be a bit quicker.
Note that I''ve switched computer and disks here,

- laptop, 4200 RPM drive: 60 seconds / 4200 RPM = 14.3 milliseconds
- desktop, 7200 RPM drive: 60 seconds / 7200 RPM = 8.3 milliseconds
> Perhaps there are multiple cached writes that need to be written to media.
No, I don''t think so.  I''ve traced all bdev_strategy calls and
all I can see is
a single 8K (sequential) write request between the DKIOCFLUSHWRITECACHE
ioctls.

> > On disk devices that do not support DKIOCFLUSHWRITECACHE (USB,
> > Firewire), lofi and mmap write performance on zfs is much better.
> 
> The problem is that there are some very thin windows where the
drive''s
> write caching can mess with on-disk consistency during the final phases
> of a transaction group commit if a power outage were to happen at the
> same time.  Not terribly likely, but from a correctness point of view,
> the window is either open or closed.  We''re working on some
changes to
> further mitigate this risk, but any drive containing important data
> should really support this command.
Ok. So one has to be careful when using USB or Firewire devices with
zfs, because I doubt that these devices have a working "flush disk write
cache" command or have their write cache completely disabled.
> > And on Firewire, DKIOCFLUSHWRITECACHE is accepted without error, but
> > finishes in ~ 6 microseconds.  On firewire, sd thinks that the device
> > doesn''t have it''s write cache enabled
(un_f_sync_cache_unsupported > > 0, un_f_write_cache_enabled = 0), so that
DKIOCFLUSHWRITECACHE turns
> > into a no-op.
> 
> An interesting question would be, is the drive''s write cache
*really*
> disabled?
The HDD that I''ve used with the USB / Firewire test is a standard IDE
P-ATA drive
(250 GB Hitachi), behind a USB<->IDE bridge (or Firewire<->IDE
bridge).  I don''t
know for sure, but I''m assuming the ATA drive is sold with
it''s write cache enabled.

>  If you run "format -e" on the drive, and turn on the write
> cache, can you see any difference in performance?
This assumes the scsi "MODE SELECT" / "MODE SENSE" commands
used
for manupulating the disk''s write cache setting are supported by the 
(USB / Firewire)<->IDE bridge chip.  Somehow I doubt that.  I''l
double check
that.

> I''ve found that on ATA disks without NCQ, write caching is
necessary for
> non-sucky performance.  On SCSI devices with tagged queuing,
> turning on the write cache helps, but not to the same degree.
Hmm, disabling the ATA disk''s write cache might be an interesting
experiment,
because on a disk with disabled write cache the DKIOCFLUSHWRITECACHE
ioctl turns into a nop. Unfortunatelly there is no Solaris support for changing
an ATA disk''s write cache setting in format -e.
This message posted from opensolaris.org

Jürgen Keil

2006-Feb-01 18:11 UTC

head link

[zfs-discuss] Re: Re: bad lofi performance with zfs file backend / bad mmap write perfo

> > And on Firewire, DKIOCFLUSHWRITECACHE is accepted without error, but
> > finishes in ~ 6 microseconds.  On firewire, sd thinks that the device
> > doesn''t have it''s write cache enabled
(un_f_sync_cache_unsupported > > 0, un_f_write_cache_enabled = 0), so that
DKIOCFLUSHWRITECACHE turns
> > into a no-op.
> 
> An interesting question would be, is the drive''s write cache
*really*
> disabled?
It isn''t disabled;  I removed the IDE HDD from the external Firewire
enclosure
and installed it directly on the P-ATA bus.  Printing the "struct
dadk" for the
disk shows that ata has found the write cache to be enabled:

    dad_noflush = 0
    dad_wce = 0x1

>  If you run "format -e" on the drive, and turn on the write
> cache, can you see any difference in performance?

The SCSI MODE SENSE command to query the write caching parameter
page fails on the firewire device; it''s impossible to query or change
the
write caching paramters on this firewire device through format -e

# format -eM
Searching for disks...


AVAILABLE DISK SELECTIONS:
       0. c0d0 <DEFAULT cyl 24379 alt 2 hd 64 sec 63>
          /pci at 0,0/pci-ide at 1f,1/ide at 0/cmdk at 0,0
       1. c2t0d0 <drive type unknown>
          /pci at 0,0/pci1179,ff10 at 1d,7/storage at 1/disk at 0,0
       2. c4t0d0 <DEFAULT cyl 59536 alt 2 hd 16 sec 63>
          /pci at 0,0/pci8086,2448 at 1e/pci1179,ff10 at 9,1/pci574,86c at
0/unit at 0001a30000063ee8/disk at 0,0
Specify disk (enter its number): 2
selecting c4t0d0
[disk formatted]
...

WRITE_CACHE MENU:
        display     - display current setting of write cache
        enable      - enable write cache
        disable     - disable write cache
        !<cmd>      - execute <cmd>, then return
        quit
write_cache> display

Mode sense page 0x8: incorrect page code 0x0
Mode sense: 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Mode sense failed.
write_cache> enable

Mode sense page 0x8: incorrect page code 0x0
Mode sense: 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00



This firewire device has an "interesting" implementation of the MODE
SENSE
command (for ATA DISK installed behind the Firewire<->IDE bridge):  It
seems
the firewire device returns a block of 0 Bytes (according to the MODE SENSE
command''s requested data buffer length), with the exception of the
first byte,
which is set to 3.

So we get back a minimum "struct mode_header" of four bytes, and no
block_descriptor and no mode page data.


format is clever enough to check for this behaviour, and detects that the device
hasn''t returned valid page 8 mode sense data, when the device was asked
to
return page 8 data.

The sd driver''s DKIOFLUSHWRITECACHE implementation in 
sd_get_write_cache_enabled() isn''t that clever, it doesn''t
verify that the
device has returned the expected caching page #8 mode page data.
It simply checks the mode_caching_page->wce bit, which isn''t set
because
the firewire device is returning 0 bytes, so that sd believes that this firewire
device has it''s write cache disabled.   sd''s
sd_get_write_cache_enabled()
 could do a better job validating the mode sense data returned by the device.
This message posted from opensolaris.org

Richard Elling

2006-Feb-01 20:25 UTC

head link

[zfs-discuss] Re: Re: bad lofi performance with zfs file backend / bad mmap write perfo

AFAIK, write cache disable is a nop on some or many IDE disks.
You might need an empirical test to verify.
 -- richard
This message posted from opensolaris.org

Joerg Schilling

2006-Feb-02 10:07 UTC

head link

[zfs-discuss] Re: Re: bad lofi performance with zfs file backend / bad mmap write perfo

J??rgen Keil <jk at tools.de> wrote:
> WRITE_CACHE MENU:
>         display     - display current setting of write cache
>         enable      - enable write cache
>         disable     - disable write cache
>         !<cmd>      - execute <cmd>, then return
>         quit
> write_cache> display
>
> Mode sense page 0x8: incorrect page code 0x0
> Mode sense: 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>             00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>             00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>             00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>             00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>             00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>             00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>             00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>             00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>             00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>             00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>             00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>             00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>             00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>             00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>             00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> Mode sense failed.
> write_cache> enable
...
> This firewire device has an "interesting" implementation of the
MODE SENSE
> command (for ATA DISK installed behind the Firewire<->IDE bridge): 
It seems
> the firewire device returns a block of 0 Bytes (according to the MODE SENSE
> command''s requested data buffer length), with the exception of the
first byte,
> which is set to 3.
Mode sense is the most missunderstood part in SCSI :-(

See the mode page related code in cdrecord.

What you observe is one of many well known implementation bugs.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       schilling at fokus.fraunhofer.de     (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Jeff Bonwick

2006-Feb-07 08:33 UTC

head link

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

> If lofi would use:
> 
> 	segmap_release(segkmap, mapaddr, SM_WRITE|SM_ASYNC);
> 
> [ stuff would go faster ]
This seems reasonable -- certainly if the incoming I/O is B_ASYNC.
Looking at the lofi code a little deeper, it appears that this is
a sin of omission.  Here''s the surrounding code and commentary,
streamlined to illustrate the point:

	/*
	 * We used to always use vn_rdwr here, but we cannot do that because
	 * we might decide to read or write from the the underlying
	 * file during this call, which would be a deadlock because
	 * we have the rw_lock. So instead we page, unless it''s not
	 * mapable or it''s a character device.
	 */
	if (we should use segmap) {
		smflags = 0;
		if (isread) {
			bcopy(mapaddr + mapoffset, bufaddr, xfersize);
		} else {
			smflags |= SM_WRITE;
			bcopy(bufaddr, mapaddr + mapoffset, xfersize);
		}
		error = segmap_release(segkmap, mapaddr, smflags);
	} else {
		error = vn_rdwr(rw, lsp->ls_vp, bufaddr, bp->b_bcount,
		    offset, UIO_SYSSPACE, 0, RLIM64_INFINITY, kcred, &resid);
	}

Notice, in the ''else'' clause, that the argument after
UIO_SYSSPACE is 0;
that''s the ioflag argument, which should be ''FSYNC''
or ''FDSYNC'' if the
original I/O request was not B_ASYNC.  In other words, lofi was always
broken for synchronous I/O.  The new segmap-based code makes the opposite
mistake: it treats *everything* as synchronous I/O.  In both cases, it is
the absence of a non-zero flag that makes the bug so easy to write.

It seems to me that the logic we really want here something like this:

	if (using segmap) {
		if (bp->b_flags & B_ASYNC)
			smflags |= SM_ASYNC;
	} else {
		error = vn_rdwr(..., bp->b_flags & B_ASYNC ? 0 : FDSYNC, ...);
	}

Alternatively, because segmap operates on 8k chunks and there''s no good
reason to issue back-to-back synchronous I/O, we could make all of the
I/O async and follow it with a VOP_FSYNC() when necessary.

Mike (Sullivan, that is), what do you think?

Jeff

Mike Sullivan

2006-Feb-07 19:14 UTC

head link

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

Jeff Bonwick wrote:>> If lofi would use:
>>
>> 	segmap_release(segkmap, mapaddr, SM_WRITE|SM_ASYNC);
>>
>> [ stuff would go faster ]
> 
> This seems reasonable -- certainly if the incoming I/O is B_ASYNC.
> Looking at the lofi code a little deeper, it appears that this is
> a sin of omission.  Here''s the surrounding code and commentary,
> streamlined to illustrate the point:
> Alternatively, because segmap operates on 8k chunks and there''s no
good
> reason to issue back-to-back synchronous I/O, we could make all of the
> I/O async and follow it with a VOP_FSYNC() when necessary.
> 
> Mike (Sullivan, that is), what do you think?
I think I always hated that section of code, and at the time certainly
wasn''t too worried about performance since that wasn''t the
point of
lofi :)

If it can be made faster/better/simpler go for it :)

	Mike

Jeff Bonwick

2006-Feb-08 05:20 UTC

head link

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

For anyone inclined to follow this saga at home, I''ve filed this bug:

6382683 lofi is confused about sync/async I/O

Jeff

zfs discuss - Jan 2006 - bad lofi performance with zfs file backend / bad mmap write performance

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

[zfs-discuss] Re: bad lofi performance with zfs file backend / bad mmap write performan

[zfs-discuss] Re: bad lofi performance with zfs file backend / bad mmap write performan

[zfs-discuss] Re: bad lofi performance with zfs file backend / bad mmap write performan

[zfs-discuss] Re: Re: bad lofi performance with zfs file backend / bad mmap write perfo

[zfs-discuss] Re: Re: bad lofi performance with zfs file backend / bad mmap write perfo

[zfs-discuss] Re: Re: bad lofi performance with zfs file backend / bad mmap write perfo

[zfs-discuss] Re: Re: bad lofi performance with zfs file backend / bad mmap write perfo

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance

[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance