Jeff Bonwick
2006-Feb-03 08:59 UTC
[zfs-discuss] Re: Re: bad lofi performance with zfs file backend / bad mmap write perfo
> > An interesting question would be, is the drive''s write cache *really* > > disabled? > > It isn''t disabled;A few of us discussed this today. The write cache is indeed enabled, so for correctness, it needs to be flushed for correctness before returning from any synchronous I/O. The reason you''re seeing such a large difference in performance between ufs and zfs is that zfs does the flush, while ufs (incorrectly) does not. Of course, the real problem is that lofi is generating synchronous I/O. That could be fixed, but it''s not as simple as it sounds because segmap_release() doesn''t have an async callback mechanism. But lofi atop a zfs file is not an efficient way to get what you want. To get device/volume semantics using space in a zfs storage pool, you''re much better off using zvol (see zfs create -V). This will create a volume with whatever size and blocksize you want, consuming space from the pool just like a zfs filesystem. Quick example: if your pool is named tank, and you want to create a 10G volume using 8k blocks named "neo", say this: zfs create -V 10g -b 8k tank/neo The volume will appear in /dev/zvol/{dsk,rdsk}/tank/neo. Like any other zfs dataset, zfs volumes are checksummed, you can take snapshots of them, and so on. To ensure that there''s enough space in the pool to write to any block in the volume, it will be created with (in this example) a 10g reservation against tank. If you want it to be a sparse volume (aka thin provisioning) with no reservation, use -s. Jeff
Darren J Moffat
2006-Feb-03 09:26 UTC
[zfs-discuss] Re: Re: bad lofi performance with zfs file backend / bad mmap write perfo
On Fri, 2006-02-03 at 08:59, Jeff Bonwick wrote:> But lofi atop a zfs file is not an efficient way to get what you want. > To get device/volume semantics using space in a zfs storage pool, > you''re much better off using zvol (see zfs create -V).So how would I use a zvol to mount an iso file as an hsfs filesystem that I just downloaded ? Or in the write case creating an iso file that I''m going to burn to disk. 99% of my use cases for lofi is mounting iso images, the other 1% is my swap space because I am testing the encrypted lofi bits we are hoping to put up on opensolaris.org really soon. -- Darren J Moffat
Jürgen Keil
2006-Feb-03 10:56 UTC
[zfs-discuss] Re: Re: Re: bad lofi performance with zfs file backend / bad mmap write p
> > > An interesting question would be, is the drive''s write cache *really* > > > disabled? > > > > It isn''t disabled; > > A few of us discussed this today. The write cache is > indeed enabled, so for correctness, it needs to be flushed for > correctness before returning from any synchronous I/O.Getting a working "write cache flush" for USB or Firewire HDD devices could be quite difficult, because the firmware in the {USB,Firewire}<->ATA bridges seems to implement a limited set of SCSI commands only when they translate between SCSI protocol and ATA commands: - READ & WRITE - INQUIRY - READ CAPACITY - with a bit of luck: MODE SENSE / MODE SELECT, but often you get a "dummy" MODE SENSE implementation only When ATAPI devices (optical device, dvd writer, ..) are installed behind such a bridge SCSI command support is much better because the USB/Firewire firmware can pass the SCSI command unmodified (more or less) to the ATAPI device. Back to my Firewire enclosure with an ATA HDD: The "SYNCHRONIZE CACHE" command for flushing the write cache is rejected by my Firewire external enclosure (when a HDD is stalled). I tested this by patching the sd driver''s "un_f_write_cache_enabled" bit to 1, so that "sd" tried a SYNCHRONIZE CACHE command for the next DKIOCFLUSHWRITECACHE ioctl. The answer from the Firewire enclosure was a "sense key/asc/ascq 5 24 0" reply (ILLEGAL REQUEST / INVALID FIELD IN CDB). While "sd" can and probably should be fixed to correctly detect that certain {Firewire,USB} devices are unable to return if they have their write cache enabled or disabled, flushing the write cache (when it is enabled) might be impossible.> The reason you''re seeing such a large difference in performance between ufs and zfs > is that zfs does the flush, while ufs (incorrectly) does not.Exactly.> Of course, the real problem is that lofi is generating synchronous I/O. > That could be fixed, but it''s not as simple as it sounds because > segmap_release() doesn''t have an async callback > mechanism.Would it be possible to work around the problem by increasing the amount of I/O that is done between the write cache flush ioctls, by writing bigger chunks? And note that the problem isn''t limited to lofi, an "msync(addr, size, MS_SYNC);" for a memory mapped file on zfs (/usr/ccs/bin/ld, output file on zfs) seems to trigger the same performance issue: Kernel stack for the DKIOCFLUSHWRITECACHE ioctls for the msync is this: 0 34868 zio_ioctl:entry zfs`zil_flush_vdevs+0x144 zfs`zil_commit+0x2ff zfs`zfs_putapage+0x266 zfs`zfs_putpage+0x1a2 genunix`fop_putpage+0x21 genunix`segvn_sync+0xb6 genunix`as_ctl+0x187 genunix`memcntl+0x57a unix`sys_sysenter+0x104 0 34086 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7968324 nsec, error 0 This message posted from opensolaris.org
Jürgen Keil
2006-Feb-03 11:18 UTC
[zfs-discuss] Re: Re: Re: bad lofi performance with zfs file backend / bad mmap write p
> > But lofi atop a zfs file is not an efficient way to get what you want. > > To get device/volume semantics using space in a zfs storage pool, > > you''re much better off using zvol (see zfs create -V). > > So how would I use a zvol to mount an iso file as an hsfs filesystem > that I just downloaded ?The performance issue only exists when writing to a lofi device. There should be no performance problem using a read only iso filesystem mount using a lofi device with an iso image file on zfs.> Or in the write case creating an iso file that I''m going to burn to disk.Fortunatelly "mkisofs" is able to write the iso output image directly to a file, so lofi or zvol isn''t needed. This message posted from opensolaris.org
Casper.Dik at Sun.COM
2006-Feb-03 11:30 UTC
[zfs-discuss] Re: Re: Re: bad lofi performance with zfs file backend / bad mmap write p
>> > But lofi atop a zfs file is not an efficient way to get what you want. >> > To get device/volume semantics using space in a zfs storage pool, >> > you''re much better off using zvol (see zfs create -V). >> >> So how would I use a zvol to mount an iso file as an hsfs filesystem >> that I just downloaded ? > >The performance issue only exists when writing to a lofi device. There >should be no performance problem using a read only iso filesystem >mount using a lofi device with an iso image file on zfs.And lofi performance is mediocre at best anyway, so any performance tests involving lofi are suspect. (Interposing lofi over the swap device has a serious performance impact, even though it''s only pass through; encrypting lofi is much worse, of course, but that''s because there need to be data copies.) Casper
Darren J Moffat
2006-Feb-03 11:41 UTC
[zfs-discuss] Re: Re: Re: bad lofi performance with zfs file backend / bad mmap write p
On Fri, 2006-02-03 at 11:30, Casper.Dik at sun.com wrote:> (Interposing lofi over the swap device has a serious performance > impact, even though it''s only pass through; encrypting lofi is much > worse, of course, but that''s because there need to be data copies.)Indeed I think the better direction for encrypted swap will be swapping on a zvol because encryption will be available to zvols as well as file systems (or any other future datasets). -- Darren J Moffat
Joerg Schilling
2006-Feb-03 15:52 UTC
[zfs-discuss] Re: Re: Re: bad lofi performance with zfs file backend / bad mmap write p
J??rgen Keil <jk at tools.de> wrote:> > > > An interesting question would be, is the drive''s write cache *really* > > > > disabled? > > > > > > It isn''t disabled; > > > > A few of us discussed this today. The write cache is > > indeed enabled, so for correctness, it needs to be flushed for > > correctness before returning from any synchronous I/O. > > Getting a working "write cache flush" for USB or Firewire HDD devices could be > quite difficult, because the firmware in the {USB,Firewire}<->ATA bridges > seems to implement a limited set of SCSI commands only when they translate > between SCSI protocol and ATA commands: > > - READ & WRITE > - INQUIRY > - READ CAPACITY > > - with a bit of luck: MODE SENSE / MODE SELECT, but often you get a "dummy" > MODE SENSE implementation only > > When ATAPI devices (optical device, dvd writer, ..) are installed behind such a > bridge SCSI command support is much better because the USB/Firewire firmware > can pass the SCSI command unmodified (more or less) to the ATAPI device.It may help to try to get a ATAPI enabled HDD. It seems that for DCD writing purposes unknown SCSI commands are just passed to the real drive. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily