Guys, What is the best way to ask for a feature enhancement to ZFS. To allow ZFS to be usefull for DR disk replication, we need to be able set an option against the pool or file system or both, called close sync. ie When a programme closes a file any outstanding writes are flush to disk, before the close returns to the programme. So when a programme ends you are guarantee any state information is save to the disk. (exit() also results in close being called) open(xxx, O_DSYNC) is only good if you can alter the source code. Shell scripts use of awk, head, tail, echo etc to create output files do not use O_DSYNC, when the shell script returns 0, you want to know that all the data is on the disk, so if the system crashes the data is still there. PS it would be nice if UFS had closessync as well, instead of using forcedirectio. Cheers
On Thu, 26 Jul 2007, Damon Atkins wrote:> Guys, > What is the best way to ask for a feature enhancement to ZFS. > > To allow ZFS to be usefull for DR disk replication, we need to be able > set an option against the pool or file system or both, called close > sync. ie When a programme closes a file any outstanding writes are flush > to disk, before the close returns to the programme. So when a programme > ends you are guarantee any state information is save to the disk. > (exit() also results in close being called) > > open(xxx, O_DSYNC) is only good if you can alter the source code. Shell > scripts use of awk, head, tail, echo etc to create output files do not > use O_DSYNC, when the shell script returns 0, you want to know that all > the data is on the disk, so if the system crashes the data is still there. > > PS it would be nice if UFS had closessync as well, instead of using > forcedirectio.I''d implement this via LD_PRELOAD library, implementing your own ''close'', so that this not only dispatches to libc`close but also does an fsync() call on that filedescriptor before. Or, if really wanting to make sourcecode changes, again change it in libc`close(), and make it depend on an environment variable; if DO_CLOSE_SYNC is set, perform fsync(); close() instead of just the latter. There''s a problem with sync-on-close anyway - mmap for file I/O. Who guarantees you no file contents are being modified after the close() ? FrankH.
> I''d implement this via LD_PRELOAD library [ ... ] > > There''s a problem with sync-on-close anyway - mmap for file I/O. Who > guarantees you no file contents are being modified after the close() ?The latter is actually a good argument for doing this (if it is necessary) in the file system, rather than via a preload library. The file system knows when the file is no longer accessible by a user process (neither opened nor mapped). That said, I?m not sure exactly what this buys you for disk replication. What?s special about files which have been closed? Is the point that applications might close a file and then notify some other process of the file?s availability for use? This message posted from opensolaris.org
>Date: Thu, 26 Jul 2007 20:39:09 PDT >From: "Anton B. Rang">That said, I?m not sure exactly what this buys you for disk replication. What?s special about files >which have been closed? Is the point that applications might close a file and then notify some other >process of the file?s availability for use?Yes E.g. 1 Program starts output job,and completes job in OS Cache on Server A. Server A tells batch scheduling software on Server B, that job is complete. Server A Crashes, file no longer exists or is truncated due to what is left in the OS Cache. Server B Schedules the next job, on the assumption that the file creates on Server A is ok. E.g. 2 Program starts output job,and completes job in OS Cache on Server A. A DB on Server A running in a different ZFS Pool, updates a DB record to record the fact the output is complete (DB uses O_DSYNC) Server A Crashes, file no longer exists or is truncated due to what is left in the OS Cache. Server A DB contains information saying that the file is completed. I believe that sync-on-close should be the default. File systems integrity should be more than just being able to read a file which has been truncated due to a system crash/power failure etc. E.g. 3 (a bit cheeky -:) $ vi xxxx a file, save the file, system crashes, you look back at the screen and you say thank god, I save the file in time, because on your screen in the prompt $ again. This is all happening in the OS Cache file. When the system returns the file does not exist. (I am ignoring vi -r) $ vi xxxxx $ connection lost Therefore users should do $ vi xxxxx $ sleep 5 ; echo file xxxxx now on disk :-) $ echo add a line > xxxxx $ sleep 5; echo update to xxxxx complete UFS forcedirectio and VxFS closesync ensure that what ever happens your files will always exist if the program completes. Therefore with Disk Replication (sync) the file exists at the other site at its finished size. When you introduce DR with Disk Replication, general means you can not afford to lose any save data. UFS forcedirectio has a larger performance hit than VxFS closesync. Cheers This message posted from opensolaris.org
On Fri, 3 Aug 2007, Damon Atkins wrote: [ ... ]> UFS forcedirectio and VxFS closesync ensure that what ever happens your files will always exist if the program completes. Therefore with Disk Replication (sync) the file exists at the other site at its finished size. When you introduce DR with Disk Replication, general means you can not afford to lose any save data. UFS forcedirectio has a larger performance hit than VxFS closesync.Hmm, not quite. forcedirectio, at least on UFS, is bound on the I/O operations meeting certain criteria. These are explained in directio(3C): DIRECTIO_ON The system behaves as though the application is not going to reuse the file data in the near future. In other words, the file data is not cached in the system''s memory pages. When possible, data is read or written directly between the application''s memory and the device when the data is accessed with read(2) and write(2) operations. When such transfers are not possible, the system switches back to the default behavior, but just for that operation. In general, the transfer is possible when the application''s buffer is aligned on a two-byte (short) boundary, the offset into the file is on a device sector boundary, and the size of the operation is a multiple of device sectors. This advisory is ignored while the file associated with fildes is mapped (see mmap(2)). So, it all depends on how exactly your workload looks like. If you''re doing non-blocked writes or writes to nonalinged offsets, and/or mmap access, directio is not being done, the advisory AND (!) the mount option notwithstanding. As far as the hot backup consistency goes: Do a "lockfs -w", then start the BCV copy, then (once that started) do a "lockfs -u". A writelocked filesystem is "clean", needs not to be fsck''ed before being able to mount it. The disadvantage is that write ops to that fs in question will block while the lockfs -w is active. But then, you don''t need to wait until the BCV finished - you only need the consistent state to start with, and can unlock immediately as the copy started. Note that fssnap also writelocks temporarily. So if you have used UFS snapshots in the past, "lockfs -w";<BCV start>;"lockfs -u" is not going to cause you more impact. "lockfs -f" is only a best-try-if-I-cannot-writelock. It''s no guarantee for consistency, because by the time the command returns something else can already be writing again. FrankH.> > Cheers > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
> Do a "lockfs -w", then start the BCV copy, then (once that started) do a > "lockfs -u". > A writelocked filesystem is "clean", needs not to be fsck''ed before being > able to mount it. > > The disadvantage is that write ops to that fs in question will block while > the lockfs -w is active. But then, you don''t need to wait until the BCV > finished - you only need the consistent state to start with, and can > unlock immediately as the copy started.Unless I''m misunderstanding the terminology, consistency isn''t required for the BCV copy (start or otherwise), it''s required during the split. -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >