Jürgen Keil
2006-Jan-25 12:10 UTC
[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance
Yesterday I tried if I could reproduce a pcfs mount problem on current opensolaris release, when I noticed some bad performance problems with lofi devices attached to a file on a zfs filesystem. The performacen issue can be reproduced like this: (I''m using snv_28, bfu''ed to b31 release bits, compiled from current opensolaris sources) - /files/tmp is a zfs (compressed or uncompressed) filesystem - mkfile -n 80g /files/tmp/pcfs.img - lofiadm -a /files/tmp/pcfs.img - fdisk /dev/rlofi/1 --> create 100% FAT32LBA partition, make it the active partition, write the MBR - truss -D -f mkfs -F pcfs -o fat=32 /dev/rlofi/1:c ... 755: 0.1165 write(3, "EB <90 M S D O S 4 . 0\0".., 512) = 512 755: 0.0292 write(3, " R R a A\0\0\0\0\0\0\0\0".., 512) = 512 755: 0.0003 llseek(3, 4918272, SEEK_SET) = 4918272 755: 0.0288 write(3, "EB <90 M S D O S 4 . 0\0".., 512) = 512 755: 0.0291 write(3, " R R a A\0\0\0\0\0\0\0\0".., 512) = 512 755: 0.0002 llseek(3, 4931584, SEEK_SET) = 4931584 755: 41.5346 write(3, "F8FFFF0FFFFFFF0FF8FFFF0F".., 10484224) = 10484224 755: 47.3175 write(3, "F8FFFF0FFFFFFF0FF8FFFF0F".., 10484224) = 10484224 755: 0.1473 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0".., 32768) = 32768 755: 0.0002 close(3) = 0 755: 0.0001 _exit(0) Note the write system calls, it need 41.5 seconds to write 10MByte of data? That''s 10484224/41.5 = 252631 bytes/sec. Why is this so slow? When I run exactly the same test using a lofi device attached to a file on a UFS filesystem, the same 10MByte write completes in ~ 0.5 seconds. Why is the test 80x slower on ZFS, when compared to UFS? --------------------------------------------------------------------------------------------------------------------------- #!/usr/sbin/dtrace -s #pragma D option flowindent fbt::lofi_strategy:entry { self->t = timestamp; } fbt::lofi_strategy:return { self->t = 0; } fbt::lofi_strategy_task:entry { self->ls_task = timestamp; } fbt::segmap_fault:entry, fbt::segmap_fault:return, fbt::segmap_release:entry, fbt::segmap_release:return, fbt::segmap_getmapflt:entry, fbt::segmap_getmapflt:return /self->ls_task/ { printf("%llu", timestamp - self->ls_task); } fbt::lofi_strategy_task:return /self->ls_task/ { printf("%llu", timestamp - self->ls_task); self->ls_task = 0; } fbt::taskq_dispatch:entry /self->t/ { self->t_disp = timestamp; } fbt::taskq_dispatch:return /self->t_disp/ { printf("%llu", timestamp - self->t_disp); self->t_disp = 0; } fbt::: /self->t/ { } --------------------------------------------------------------------------------------------------------------------------- Using the above dtrace script; I''ve narrowed down the slowness to the segmap_release() call in lofi_strategy_task(): http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/io/lofi.c#lofi_strategy_task http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/io/lofi.c#370 lofi seems to split big block i/o requests into 8KByte chunks, maps the 8K chunk (from a zfs file) into kernel memory, changes it, and calls segmap_release() for the 8K chunk. The segmap_release() call needs ~28 milliseconds to complete, for every 8K chunk of data. The above dtrace reports the following nanosecond times for function entry/returns for functions called from lofi_strategy_task(), relative to the entry time into lofi_strategy_task(): 0 -> segmap_getmapflt 390212 0 <- segmap_getmapflt 397780 0 -> segmap_fault 400686 0 <- segmap_fault 471288 0 -> segmap_fault 483743 0 <- segmap_fault 494756 0 -> segmap_release 497002 <<<<<<<<<<<<<<<< 0 <- segmap_release 28477764 <<<<<<<<<<<<<<<< 0 -> segmap_getmapflt 28482182 0 <- segmap_getmapflt 28489342 0 -> segmap_fault 28491860 0 <- segmap_fault 28556055 0 -> segmap_fault 28568419 0 <- segmap_fault 28579688 0 -> segmap_release 28581889 <<<<<<<<<<<<<<<< 0 <- segmap_release 57585125 <<<<<<<<<<<<<<<< 0 -> segmap_getmapflt 57588452 0 <- segmap_getmapflt 57593437 0 -> segmap_fault 57595443 0 <- segmap_fault 57653024 0 -> segmap_fault 57666526 0 <- segmap_fault 57675899 0 -> segmap_release 57677756 <<<<<<<<<<<<<<<< 0 <- segmap_release 86721131 <<<<<<<<<<<<<<<< 0 -> segmap_getmapflt 86726222 0 <- segmap_getmapflt 86732854 0 -> segmap_fault 86735832 0 <- segmap_fault 86812762 0 -> segmap_fault 86825782 0 <- segmap_fault 86838545 0 -> segmap_release 86840904 <<<<<<<<<<<<<<<< 0 <- segmap_release 115792547 <<<<<<<<<<<<<<<< 0 -> segmap_getmapflt 115797538 0 <- segmap_getmapflt 115804102 0 -> segmap_fault 115806849 It seems as if zfs creates transactions for each segmap_release() call and waits until the transaction is written to disk. Call stack for these transactions would be something like this: 0 | cv_block:sleep genunix`cv_wait+0x27 zfs`zio_wait+0x46 zfs`zil_flush_vdevs+0x197 zfs`zil_commit+0x2ff zfs`zfs_putapage+0x266 zfs`zfs_putpage+0x1a2 genunix`fop_putpage+0x21 genunix`segmap_release+0x2b3 lofi`lofi_strategy_task+0x1eb genunix`taskq_thread+0x16c unix`thread_start+0x8 0 | cv_block:sleep genunix`cv_wait+0x27 zfs`zil_commit+0x2da zfs`zfs_putapage+0x266 zfs`zfs_putpage+0x1a2 genunix`fop_putpage+0x21 genunix`segmap_release+0x2b3 lofi`lofi_strategy_task+0x1eb genunix`taskq_thread+0x16c unix`thread_start+0x8 0 | cv_block:sleep genunix`cv_wait+0x27 zfs`zio_wait+0x46 zfs`zil_flush_vdevs+0x197 zfs`zil_commit+0x2ff zfs`zfs_putapage+0x266 zfs`zfs_putpage+0x1a2 genunix`fop_putpage+0x21 genunix`segmap_release+0x2b3 lofi`lofi_strategy_task+0x1eb genunix`taskq_thread+0x16c ------------------------------------------------------------------------------------------------------------------------------- Another, possibly related problem, using mmap to write a 10MByte file on a zfs filesystem. Again, this is *much* faster on a logging UFS filesystem, than on ZFS. /* * similar issue as "zfs-lofi", writing file using mmap is slow on zfs * (ufs is ~ 10x-40x faster) */ #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <string.h> #include <fcntl.h> #include <sys/mman.h> int main(int argc, char **argv) { char *filename = "data"; int size = 10*1024*1024; int fd; void *addr; if (argv[1]) filename = argv[1]; fd = open(filename, O_RDWR|O_CREAT, 0777); if (fd < 0) { perror(filename); exit(1); } if (lseek(fd, size-1, SEEK_SET) != size-1) { perror("lseek"); exit(1); } if (write(fd, "", 1) < 0) { perror("write"); exit(1); } addr = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (addr == (void*)-1) { perror("mmap"); exit(1); } close(fd); memset(addr, 0, size); msync(addr, size, MS_SYNC); munmap(addr, size); exit(0); } ------------------------------------------------------------------------------------------------------------------------------- % time zfs-lofi-mapw /var/tmp/data (<-- logging ufs) 0.02u 0.15s 0:00.57 29.8% % time zfs-lofi-mapw /files/tmp/data (<-- zfs) 0.02u 0.73s 0:23.36 3.2% This message posted from opensolaris.org
Joerg Schilling
2006-Jan-25 16:50 UTC
[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance
J??rgen Keil <jk at tools.de> wrote:> Yesterday I tried if I could reproduce a pcfs mount problem on current opensolaris release, > when I noticed some bad performance problems with lofi devices attached to a file on a > zfs filesystem. > > The performacen issue can be reproduced like this: > (I''m using snv_28, bfu''ed to b31 release bits, compiled from current opensolaris sources) > > - /files/tmp is a zfs (compressed or uncompressed) filesystem > > - mkfile -n 80g /files/tmp/pcfs.img > > - lofiadm -a /files/tmp/pcfs.img > > - fdisk /dev/rlofi/1 > --> create 100% FAT32LBA partition, make it the active partition, write the MBRDid you call "mkfile" as root and have the "sticky" but set? J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Juergen Keil
2006-Jan-25 18:31 UTC
[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance
> > The performacen issue can be reproduced like this: > > (I''m using snv_28, bfu''ed to b31 release bits, compiled from current opensolaris sources) > > > > - /files/tmp is a zfs (compressed or uncompressed) filesystem > > > > - mkfile -n 80g /files/tmp/pcfs.img> Did you call "mkfile" as root and have the "sticky" but set?Yes, all commands where run as user root; the sticky bit is set.
Joerg Schilling
2006-Jan-26 10:14 UTC
[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance
Juergen Keil <jk at tools.de> wrote:> > > - mkfile -n 80g /files/tmp/pcfs.img > > > Did you call "mkfile" as root and have the "sticky" but set? > > Yes, all commands where run as user root; the sticky bit is set.Then it seems that either ZFS ignores the sticky bit (could someone from Sun comment on this please?) or the caching stratgy of ZFS in counter productive in your case. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Casper.Dik at Sun.COM
2006-Jan-26 10:19 UTC
[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance
>Juergen Keil <jk at tools.de> wrote: > >> > > - mkfile -n 80g /files/tmp/pcfs.img >> >> > Did you call "mkfile" as root and have the "sticky" but set? >> >> Yes, all commands where run as user root; the sticky bit is set. > >Then it seems that either ZFS ignores the sticky bit (could someone from Sun >comment on this please?) or the caching stratgy of ZFS in counter productive in >your case.I''ve seen similar issues: building a Solaris miniroot on a tmpfs filesystem is many orders of magnitude faster than on ZFS. However: If the sticky bit is set on a regular file and no execute bits are set, the system''s page cache will not be used to hold the file''s data. This bit is normally set on swap files of diskless clients so that accesses to these files do not flush more valuable data from the system''s cache. Moreover, by default such files are treated as swap files, whose inode modification times may not necessarily be correctly recorded on permanent storage. so it may be that you do not want the sticky bit set on such files when creating a filesystem in them as it will cause the data not to be cached. Casper
Juergen Keil
2006-Jan-26 13:00 UTC
[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance
> I''ve seen similar issues: building a Solaris miniroot on a tmpfs > filesystem is many orders of magnitude faster than on ZFS. > > However: > > If the sticky bit is set on a regular file and no execute > bits are set, the system''s page cache will not be used to > hold the file''s data. This bit is normally set on swap files > of diskless clients so that accesses to these files do not > flush more valuable data from the system''s cache. Moreover, > by default such files are treated as swap files, whose inode > modification times may not necessarily be correctly recorded > on permanent storage. > > so it may be that you do not want the sticky bit set on such files > when creating a filesystem in them as it will cause the data > not to be cached.Repeating the test procedure with a "chmod 600 ..." thrown in after the mkfile (to clear the stick bit) shows no difference. It''s still slow on zfs. Part of the reason why this test is slow on zfs appears to be this: lofi releases modified pages with: http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/io/lofi.c#362 segmap_release(segkmap, mapaddr, SM_WRITE); (= SYNC WRITE), so that zfs_putpage is called with flags == 0. zfs_putpage creates a transaction, and waits for the transaction to be written to disk. http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/zfs_vnops.c# 2796 If lofi would use: segmap_release(segkmap, mapaddr, SM_WRITE|SM_ASYNC); the "mkfs -F pcfs" would run *much* faster on zfs. With SM_WRITE|SM_ASYNC, zfs_putpage is called with flags = 0x400 (B_ASYNC), and zfs wouldn''t *wait* for the new transaction for every "8K putpage" to be flushed to disk. In my b31 compiled release kernel, when I patch the "pushl $0x1" (SM_WRITE) at lofi_strategy_task+0x1dd ... lofi_strategy_task+0x1dd:pushl $0x1 lofi_strategy_task+0x1df: pushl %esi lofi_strategy_task+0x1e0: pushl 0xfec00b8c lofi_strategy_task+0x1e6: call +0x695c3af <segmap_release> ... to a "pushl $0x3" (SM_WRITE|SM_ASYNC), the 10MByte writes in "mkfs -F pcfs ... /dev/rlofi/1:c" complete in less than 0.5 seconds.
Casper.Dik at Sun.COM
2006-Jan-26 13:49 UTC
[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance
>Part of the reason why this test is slow on zfs appears to be this: > >lofi releases modified pages with: > >http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/io/lofi.c#362 > > segmap_release(segkmap, mapaddr, SM_WRITE); > >(= SYNC WRITE), so that zfs_putpage is called with flags == 0. zfs_putpage >creates a transaction, and waits for the transaction to be written to disk. > >http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/zfs_vnops.c# >2796 > > >If lofi would use: > > segmap_release(segkmap, mapaddr, SM_WRITE|SM_ASYNC);It is unclear to me that this is always the proper behaviour; the strategy routine is assumed to be synchronous only the "B_ASYNC" flag is bassed. So perhaps the fix should take that into account? (And then the gains may evaporate) Also, "ASYNC" will limit upward error propagation; that may lead to panics more quickly when write errors in, e.g., UFS filesystems are ignored. Casper
Nicolas Williams
2006-Jan-26 16:26 UTC
[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance
On Thu, Jan 26, 2006 at 02:49:49PM +0100, Casper.Dik at Sun.COM wrote:> >If lofi would use: > > > > segmap_release(segkmap, mapaddr, SM_WRITE|SM_ASYNC); > > > It is unclear to me that this is always the proper behaviour; > the strategy routine is assumed to be synchronous only the > "B_ASYNC" flag is bassed. > > So perhaps the fix should take that into account? > (And then the gains may evaporate)It might be an option, so mkfs can go fast, but you really could not mount an async lofi:> Also, "ASYNC" will limit upward error propagation; that may lead > to panics more quickly when write errors in, e.g., UFS filesystems are > ignored.Nico --
Casper.Dik at Sun.COM
2006-Jan-26 16:38 UTC
[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance
>On Thu, Jan 26, 2006 at 02:49:49PM +0100, Casper.Dik at Sun.COM wrote: >> >If lofi would use: >> > >> > segmap_release(segkmap, mapaddr, SM_WRITE|SM_ASYNC); >> >> >> It is unclear to me that this is always the proper behaviour; >> the strategy routine is assumed to be synchronous only the >> "B_ASYNC" flag is bassed. >> >> So perhaps the fix should take that into account? >> (And then the gains may evaporate) > >It might be an option, so mkfs can go fast, but you really could not >mount an async lofi: > >> Also, "ASYNC" will limit upward error propagation; that may lead >> to panics more quickly when write errors in, e.g., UFS filesystems are >> ignored. >If the strategy routine requests async writes then there''s no reason not to release the segmap with async. Don''t know how often that would be. Casper
Jürgen Keil
2006-Jan-27 11:27 UTC
[zfs-discuss] Re: bad lofi performance with zfs file backend / bad mmap write performan
> >Part of the reason why this test is slow on zfs appears to be this: > > > >lofi releases modified pages with: > > > > segmap_release(segkmap, mapaddr, SM_WRITE);...> >If lofi would use: > > > > segmap_release(segkmap, mapaddr, SM_WRITE|SM_ASYNC); > Also, "ASYNC" will limit upward error propagation;Yes, changing lofi to use SM_ASYNC appears to be a bad idea, it''ll skip a biowait() when used with a file backend in an ufs filesystem, so that we''ll miss error return values. == Another thing that I''ve noticed is that a lot of the Solaris filesystem modules use "pvn_write_kluster()" somewhere inside the "VOP_PUTPAGE()" implementation; apparently to replace single page writes with bigger writes that include dirty pages immediatelly before and after the page. e.g. UFS: in ufs_putapage() http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/ufs/ufs_vnops.c#5236 or TMPFS: in tmp_putapage() http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/tmpfs/tmp_vnops.c#1912 zfs doesn''t appear to implement "klustered page writes" inside zfs_putpage(). Shouldn''t zfs try to write klusters of pages, instead of single pages ? That might also help with the mmap() write test program that I included in the initial message (which doesn''t use lofi at all). Btw. the performace problem that is shown by the mmap() write test program can also be observed with the linker /usr/ccs/bin/ld, which appears to use mmap() to write it''s output file. Building a big C++ shared library (19MBytes, libQtGui_debug.so.4.0.1) with the output file on UFS is fast, and is *much* slower when the output file is written to a ZFS filesystem. tcsh''s time function reports: shlib output file on UFS, object input files from ZFS (I repeated the link 3x): 1.50u 0.75s 0:04.11 54.7% 1.51u 0.77s 0:03.91 58.3% 1.51u 0.76s 0:04.39 51.7% output file on ZFS, object input files from ZFS (repeated 3x): 1.51u 0.93s 0:31.22 7.8% 1.54u 0.97s 0:38.39 6.5% 1.52u 0.95s 0:36.30 6.8% This message posted from opensolaris.org
Jürgen Keil
2006-Jan-30 12:38 UTC
[zfs-discuss] Re: bad lofi performance with zfs file backend / bad mmap write performan
I wrote:> Part of the reason why this test is slow on zfs appears to be this: > > lofi releases modified pages with: > > http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/io/lofi.c#362 > > segmap_release(segkmap, mapaddr, SM_WRITE); > > (= SYNC WRITE), so that zfs_putpage is called with flags == 0. zfs_putpage > creates a transaction, and waits for the transaction to be written to disk. > > http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/zfs_vnops.c#2796After some more dtracing, I found out that the performance killer are the frequent DKIOCFLUSHWRITECACHE ioctls sent to the notebook''s 4200 rpm P-ATA HDD. This happens in zil_flush_vdevs(): http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/zil.c#zil_flush_vdevs Although the I/O write requests are for sequential blocks, flushing the disk''s write cache after every chunk of data written results in a data rate of 8K (or 4K?) written per disk revolution. The attached dtrace script produces output like this (note, this time I''m using a 7200 rpm P-ATA drive, it needs ~ 8 milliseconds per DKIOCFLUSHWRITECACHE ioctl during this test): 0 9311 bdev_strategy:entry edev 1980047, flags 1080101, bn 3c34d00, count 2000 0 34086 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7890593 nsec, error 0 0 9311 bdev_strategy:entry edev 1980047, flags 1080101, bn 3c34d10, count 2000 0 34086 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7872219 nsec, error 0 0 9311 bdev_strategy:entry edev 1980047, flags 1080101, bn 3c34d20, count 2000 0 34086 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7698369 nsec, error 0 0 9311 bdev_strategy:entry edev 1980047, flags 1080101, bn 3c34d30, count 2000 0 34086 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7575891 nsec, error 0 0 9311 bdev_strategy:entry edev 1980047, flags 1080101, bn 3c34d40, count 2000 0 34086 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7907461 nsec, error 0 0 9311 bdev_strategy:entry edev 1980047, flags 1080101, bn 3c34d50, count 2000 0 34086 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7792415 nsec, error 0 0 9311 bdev_strategy:entry edev 1980047, flags 1080101, bn 3c34d60, count 2000 0 34086 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 8039159 nsec, error 0 0 9311 bdev_strategy:entry edev 1980047, flags 1080101, bn 3c34d70, count 2000 0 34086 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7537145 nsec, error 0 On disk devices that do not support DKIOCFLUSHWRITECACHE (USB, Firewire), lofi and mmap write performance on zfs is much better. On USB, I saw DKIOCFLUSHWRITECACHE failing with errno 48 (ENOTSUP); the sd driver has the flag "un_f_sync_cache_unsupported" set. And on Firewire, DKIOCFLUSHWRITECACHE is accepted without error, but finishes in ~ 6 microseconds. On firewire, sd thinks that the device doesn''t have it''s write cache enabled (un_f_sync_cache_unsupported = 0, un_f_write_cache_enabled = 0), so that DKIOCFLUSHWRITECACHE turns into a no-op. This message posted from opensolaris.org -------------- next part -------------- A non-text attachment was scrubbed... Name: lofi8.d Type: application/octet-stream Size: 580 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060130/f82279e2/attachment.obj>
Bill Moore
2006-Jan-30 16:52 UTC
[zfs-discuss] Re: bad lofi performance with zfs file backend / bad mmap write performan
On Mon, Jan 30, 2006 at 04:38:31AM -0800, J?rgen Keil wrote:> After some more dtracing, I found out that the performance killer are > the frequent DKIOCFLUSHWRITECACHE ioctls sent to the notebook''s 4200 > rpm P-ATA HDD. > > This happens in zil_flush_vdevs(): > > http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/zil.c#zil_flush_vdevs > > > Although the I/O write requests are for sequential blocks, flushing > the disk''s write cache after every chunk of data written results in a > data rate of 8K (or 4K?) written per disk revolution.I don''t think the ioctl() is doing what you think it''s doing. The naming is a bit unfortunate, being copied from the ATA command we send to the disk. It doesn''t really flush the write cache, it is more like a write barrier. The spec says that the command will not complete until all previous writes have been actually written to the drive''s media. The data in the drive''s cache is preserved, not flushed.> The attached dtrace script produces output like this (note, this time > I''m using a 7200 rpm P-ATA drive, it needs ~ 8 milliseconds per > DKIOCFLUSHWRITECACHE ioctl during this test):That''s not unreasonable. That''s really saying that an I/O to your disk takes about 8ms, which is about right for a laptop drive, though I''d think that a 7200rpm disk should be a bit quicker. Perhaps there are multiple cached writes that need to be written to media.> On disk devices that do not support DKIOCFLUSHWRITECACHE (USB, > Firewire), lofi and mmap write performance on zfs is much better.The problem is that there are some very thin windows where the drive''s write caching can mess with on-disk consistency during the final phases of a transaction group commit if a power outage were to happen at the same time. Not terribly likely, but from a correctness point of view, the window is either open or closed. We''re working on some changes to further mitigate this risk, but any drive containing important data should really support this command.> And on Firewire, DKIOCFLUSHWRITECACHE is accepted without error, but > finishes in ~ 6 microseconds. On firewire, sd thinks that the device > doesn''t have it''s write cache enabled (un_f_sync_cache_unsupported > 0, un_f_write_cache_enabled = 0), so that DKIOCFLUSHWRITECACHE turns > into a no-op.An interesting question would be, is the drive''s write cache *really* disabled? If you run "format -e" on the drive, and turn on the write cache, can you see any difference in performance? I''ve found that on ATA disks without NCQ, write caching is necessary for non-sucky performance. On SCSI devices with tagged queuing, turning on the write cache helps, but not to the same degree. BTW, thanks for the investigation and analysis - this is great stuff. --Bill
Jürgen Keil
2006-Jan-31 10:39 UTC
[zfs-discuss] Re: Re: bad lofi performance with zfs file backend / bad mmap write perfo
> On Mon, Jan 30, 2006 at 04:38:31AM -0800, J?rgen Keil wrote: > > After some more dtracing, I found out that the performance killer are > > the frequent DKIOCFLUSHWRITECACHE ioctls sent to the notebook''s 4200 > > rpm P-ATA HDD. > > > > Although the I/O write requests are for sequential blocks, flushing > > the disk''s write cache after every chunk of data written results in a > > data rate of 8K (or 4K?) written per disk revolution. > > I don''t think the ioctl() is doing what you think it''s doing. The > naming is a bit unfortunate, being copied from the ATA command we send > to the disk.ATA "FLUSH CACHE" command, command code 0xE7 ? (I can''t check the DKIOCFLUSHWRITECACHE implementation that for the ata driver, because "ata" is closed source).> It doesn''t really flush the write cache, it is more like a > write barrier. The spec says that the command will not complete until > all previous writes have been actually written to the drive''s media. > The data in the drive''s cache is preserved, not flushed.Yep, and that has exactly been my understanding of what happens when DKIOCFLUSHWRITECACHE is send to an ATA device. I think what happens is something like this: 1. 8K (16 sectors) of data is written to sectors x ... x+15 the HDD puts the data into it''s write cache, the data is not yet written to disk 2. a "FLUSH CACHE" command is submitted to the HDD. HDD writes the sectors x ... x+15 from it''s cache to the disk When the "FLUSH CACHE" command completes, we''re after the end of sector x+15; probably sector x+16 already passes under the HDD''s read/write head. 3. zfs sends the next chunk of data, 8K for sectors x+16 .. x+31 HDD stores this data in it''s write cache. The disk keeps spinning. (maybe sector x+17 or one of the following sectors is currently passing under the read/write head). 4. Another "FLUSH CACHE" command is submitted to the HDD. We''ve missed the address field for sector x+16; to write the modified contents for sector x+16 from the disk''s write cache, the HDD has to wait a complete revolution of the disk until it can write sector x+16. This is where the huge DKIOCFLUSHWRITECACHE delay is comming from.> > The attached dtrace script produces output like this (note, this time > > I''m using a 7200 rpm P-ATA drive, it needs ~ 8 milliseconds per > > DKIOCFLUSHWRITECACHE ioctl during this test): > > That''s not unreasonable. That''s really saying that an I/O to your disk > takes about 8ms, which is about right for a laptop drive, though I''d > think that a 7200rpm disk should be a bit quicker.Note that I''ve switched computer and disks here, - laptop, 4200 RPM drive: 60 seconds / 4200 RPM = 14.3 milliseconds - desktop, 7200 RPM drive: 60 seconds / 7200 RPM = 8.3 milliseconds> Perhaps there are multiple cached writes that need to be written to media.No, I don''t think so. I''ve traced all bdev_strategy calls and all I can see is a single 8K (sequential) write request between the DKIOCFLUSHWRITECACHE ioctls.> > On disk devices that do not support DKIOCFLUSHWRITECACHE (USB, > > Firewire), lofi and mmap write performance on zfs is much better. > > The problem is that there are some very thin windows where the drive''s > write caching can mess with on-disk consistency during the final phases > of a transaction group commit if a power outage were to happen at the > same time. Not terribly likely, but from a correctness point of view, > the window is either open or closed. We''re working on some changes to > further mitigate this risk, but any drive containing important data > should really support this command.Ok. So one has to be careful when using USB or Firewire devices with zfs, because I doubt that these devices have a working "flush disk write cache" command or have their write cache completely disabled.> > And on Firewire, DKIOCFLUSHWRITECACHE is accepted without error, but > > finishes in ~ 6 microseconds. On firewire, sd thinks that the device > > doesn''t have it''s write cache enabled (un_f_sync_cache_unsupported > > 0, un_f_write_cache_enabled = 0), so that DKIOCFLUSHWRITECACHE turns > > into a no-op. > > An interesting question would be, is the drive''s write cache *really* > disabled?The HDD that I''ve used with the USB / Firewire test is a standard IDE P-ATA drive (250 GB Hitachi), behind a USB<->IDE bridge (or Firewire<->IDE bridge). I don''t know for sure, but I''m assuming the ATA drive is sold with it''s write cache enabled.> If you run "format -e" on the drive, and turn on the write > cache, can you see any difference in performance?This assumes the scsi "MODE SELECT" / "MODE SENSE" commands used for manupulating the disk''s write cache setting are supported by the (USB / Firewire)<->IDE bridge chip. Somehow I doubt that. I''l double check that.> I''ve found that on ATA disks without NCQ, write caching is necessary for > non-sucky performance. On SCSI devices with tagged queuing, > turning on the write cache helps, but not to the same degree.Hmm, disabling the ATA disk''s write cache might be an interesting experiment, because on a disk with disabled write cache the DKIOCFLUSHWRITECACHE ioctl turns into a nop. Unfortunatelly there is no Solaris support for changing an ATA disk''s write cache setting in format -e. This message posted from opensolaris.org
Jürgen Keil
2006-Feb-01 18:11 UTC
[zfs-discuss] Re: Re: bad lofi performance with zfs file backend / bad mmap write perfo
> > And on Firewire, DKIOCFLUSHWRITECACHE is accepted without error, but > > finishes in ~ 6 microseconds. On firewire, sd thinks that the device > > doesn''t have it''s write cache enabled (un_f_sync_cache_unsupported > > 0, un_f_write_cache_enabled = 0), so that DKIOCFLUSHWRITECACHE turns > > into a no-op. > > An interesting question would be, is the drive''s write cache *really* > disabled?It isn''t disabled; I removed the IDE HDD from the external Firewire enclosure and installed it directly on the P-ATA bus. Printing the "struct dadk" for the disk shows that ata has found the write cache to be enabled: dad_noflush = 0 dad_wce = 0x1> If you run "format -e" on the drive, and turn on the write > cache, can you see any difference in performance?The SCSI MODE SENSE command to query the write caching parameter page fails on the firewire device; it''s impossible to query or change the write caching paramters on this firewire device through format -e # format -eM Searching for disks... AVAILABLE DISK SELECTIONS: 0. c0d0 <DEFAULT cyl 24379 alt 2 hd 64 sec 63> /pci at 0,0/pci-ide at 1f,1/ide at 0/cmdk at 0,0 1. c2t0d0 <drive type unknown> /pci at 0,0/pci1179,ff10 at 1d,7/storage at 1/disk at 0,0 2. c4t0d0 <DEFAULT cyl 59536 alt 2 hd 16 sec 63> /pci at 0,0/pci8086,2448 at 1e/pci1179,ff10 at 9,1/pci574,86c at 0/unit at 0001a30000063ee8/disk at 0,0 Specify disk (enter its number): 2 selecting c4t0d0 [disk formatted] ... WRITE_CACHE MENU: display - display current setting of write cache enable - enable write cache disable - disable write cache !<cmd> - execute <cmd>, then return quit write_cache> display Mode sense page 0x8: incorrect page code 0x0 Mode sense: 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Mode sense failed. write_cache> enable Mode sense page 0x8: incorrect page code 0x0 Mode sense: 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 This firewire device has an "interesting" implementation of the MODE SENSE command (for ATA DISK installed behind the Firewire<->IDE bridge): It seems the firewire device returns a block of 0 Bytes (according to the MODE SENSE command''s requested data buffer length), with the exception of the first byte, which is set to 3. So we get back a minimum "struct mode_header" of four bytes, and no block_descriptor and no mode page data. format is clever enough to check for this behaviour, and detects that the device hasn''t returned valid page 8 mode sense data, when the device was asked to return page 8 data. The sd driver''s DKIOFLUSHWRITECACHE implementation in sd_get_write_cache_enabled() isn''t that clever, it doesn''t verify that the device has returned the expected caching page #8 mode page data. It simply checks the mode_caching_page->wce bit, which isn''t set because the firewire device is returning 0 bytes, so that sd believes that this firewire device has it''s write cache disabled. sd''s sd_get_write_cache_enabled() could do a better job validating the mode sense data returned by the device. This message posted from opensolaris.org
Richard Elling
2006-Feb-01 20:25 UTC
[zfs-discuss] Re: Re: bad lofi performance with zfs file backend / bad mmap write perfo
AFAIK, write cache disable is a nop on some or many IDE disks. You might need an empirical test to verify. -- richard This message posted from opensolaris.org
Joerg Schilling
2006-Feb-02 10:07 UTC
[zfs-discuss] Re: Re: bad lofi performance with zfs file backend / bad mmap write perfo
J??rgen Keil <jk at tools.de> wrote:> WRITE_CACHE MENU: > display - display current setting of write cache > enable - enable write cache > disable - disable write cache > !<cmd> - execute <cmd>, then return > quit > write_cache> display > > Mode sense page 0x8: incorrect page code 0x0 > Mode sense: 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > Mode sense failed. > write_cache> enable...> This firewire device has an "interesting" implementation of the MODE SENSE > command (for ATA DISK installed behind the Firewire<->IDE bridge): It seems > the firewire device returns a block of 0 Bytes (according to the MODE SENSE > command''s requested data buffer length), with the exception of the first byte, > which is set to 3.Mode sense is the most missunderstood part in SCSI :-( See the mode page related code in cdrecord. What you observe is one of many well known implementation bugs. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Jeff Bonwick
2006-Feb-07 08:33 UTC
[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance
> If lofi would use: > > segmap_release(segkmap, mapaddr, SM_WRITE|SM_ASYNC); > > [ stuff would go faster ]This seems reasonable -- certainly if the incoming I/O is B_ASYNC. Looking at the lofi code a little deeper, it appears that this is a sin of omission. Here''s the surrounding code and commentary, streamlined to illustrate the point: /* * We used to always use vn_rdwr here, but we cannot do that because * we might decide to read or write from the the underlying * file during this call, which would be a deadlock because * we have the rw_lock. So instead we page, unless it''s not * mapable or it''s a character device. */ if (we should use segmap) { smflags = 0; if (isread) { bcopy(mapaddr + mapoffset, bufaddr, xfersize); } else { smflags |= SM_WRITE; bcopy(bufaddr, mapaddr + mapoffset, xfersize); } error = segmap_release(segkmap, mapaddr, smflags); } else { error = vn_rdwr(rw, lsp->ls_vp, bufaddr, bp->b_bcount, offset, UIO_SYSSPACE, 0, RLIM64_INFINITY, kcred, &resid); } Notice, in the ''else'' clause, that the argument after UIO_SYSSPACE is 0; that''s the ioflag argument, which should be ''FSYNC'' or ''FDSYNC'' if the original I/O request was not B_ASYNC. In other words, lofi was always broken for synchronous I/O. The new segmap-based code makes the opposite mistake: it treats *everything* as synchronous I/O. In both cases, it is the absence of a non-zero flag that makes the bug so easy to write. It seems to me that the logic we really want here something like this: if (using segmap) { if (bp->b_flags & B_ASYNC) smflags |= SM_ASYNC; } else { error = vn_rdwr(..., bp->b_flags & B_ASYNC ? 0 : FDSYNC, ...); } Alternatively, because segmap operates on 8k chunks and there''s no good reason to issue back-to-back synchronous I/O, we could make all of the I/O async and follow it with a VOP_FSYNC() when necessary. Mike (Sullivan, that is), what do you think? Jeff
Mike Sullivan
2006-Feb-07 19:14 UTC
[zfs-discuss] bad lofi performance with zfs file backend / bad mmap write performance
Jeff Bonwick wrote:>> If lofi would use: >> >> segmap_release(segkmap, mapaddr, SM_WRITE|SM_ASYNC); >> >> [ stuff would go faster ] > > This seems reasonable -- certainly if the incoming I/O is B_ASYNC. > Looking at the lofi code a little deeper, it appears that this is > a sin of omission. Here''s the surrounding code and commentary, > streamlined to illustrate the point:> Alternatively, because segmap operates on 8k chunks and there''s no good > reason to issue back-to-back synchronous I/O, we could make all of the > I/O async and follow it with a VOP_FSYNC() when necessary. > > Mike (Sullivan, that is), what do you think?I think I always hated that section of code, and at the time certainly wasn''t too worried about performance since that wasn''t the point of lofi :) If it can be made faster/better/simpler go for it :) Mike