Here's a patch against 4.8-RELEASE kernel that allows disk writes on softupdates-enabled filesystems to be delayed for (theoretically) arbitrarily long periods of time. The motivation for such updating policy is surprisingly not purely suicidal - it can allow disks on laptops to spin down immediately after I/O operations and stay idle for longer periods of time, thus saving considerable amount of battery power. The patch introduces a new sysctl tunable vfs.sync_extdelay which controls the delay duration in seconds. If the variable is set to 0, the standard UFS synching policy is restored. The tunable can be either modified by hand or controlled by APM daemon using the attached rc.syncdelay script. When enabled, the extended delaying policy introduces some additional changes: - fsync() no longer flushes the buffers to disk, but returns immediately instead; - invoking sync() causes flushing of softupdates buffers to follow immediately, which was not the case before; - if one of the mounted filesystems becomes low on free space, which can happen if lot of data is written to the FS but FS metadata buffers are not updated to disk, flushing of all softupdates buffers is scheduled automatically; - if an I/O operation (typically read request) on ATA disk is performed, which is likely to cause the disk to be spinned up, the pending buffers are immediately flushed to the disk, but only if they were pending longer than what would be the case with normal updating policy. As I'm virtually clueless in FS concepts and theory I'm not sure if the above model doesn't shake the foundations of UFS operation, therefore I'd appreciate for more knowledgeable people to comment on the patch. Nevertheless, my laptop runs without glitches for the last two weeks with the extra delaying enabled, while happily achieving 5-10% longer battery operated periods, depending on disk utilization patterns. Cheers, Marko -------------- next part -------------- --- /usr/src/sys.org/dev/ata/ata-disk.c Thu Jan 30 08:19:59 2003 +++ dev/ata/ata-disk.c Sat Apr 12 00:31:26 2003 @@ -294,6 +294,7 @@ adstrategy(struct buf *bp) struct ad_softc *adp = bp->b_dev->si_drv1; int s; + stratcalls++; if (adp->device->flags & ATA_D_DETACHING) { bp->b_error = ENXIO; bp->b_flags |= B_ERROR; --- /usr/src/sys.org/kern/vfs_subr.c Sun Oct 13 18:19:12 2002 +++ kern/vfs_subr.c Sat Apr 12 01:56:16 2003 @@ -116,6 +116,10 @@ SYSCTL_INT(_vfs, OID_AUTO, reassignbufme static int nameileafonly = 0; SYSCTL_INT(_vfs, OID_AUTO, nameileafonly, CTLFLAG_RW, &nameileafonly, 0, ""); +int stratcalls = 0; +int sync_extdelay = 0; +SYSCTL_INT(_vfs, OID_AUTO, sync_extdelay, CTLFLAG_RW, &sync_extdelay, 0, ""); + #ifdef ENABLE_VFS_IOOPT int vfs_ioopt = 0; SYSCTL_INT(_vfs, OID_AUTO, ioopt, CTLFLAG_RW, &vfs_ioopt, 0, ""); @@ -137,7 +141,7 @@ static vm_zone_t vnode_zone; * The workitem queue. */ #define SYNCER_MAXDELAY 32 -static int syncer_maxdelay = SYNCER_MAXDELAY; /* maximum delay time */ +int syncer_maxdelay = SYNCER_MAXDELAY; /* maximum delay time */ time_t syncdelay = 30; /* max time to delay syncing data */ time_t filedelay = 30; /* time to delay syncing files */ SYSCTL_INT(_kern, OID_AUTO, filedelay, CTLFLAG_RW, &filedelay, 0, ""); @@ -145,7 +149,7 @@ time_t dirdelay = 29; /* time to delay SYSCTL_INT(_kern, OID_AUTO, dirdelay, CTLFLAG_RW, &dirdelay, 0, ""); time_t metadelay = 28; /* time to delay syncing metadata */ SYSCTL_INT(_kern, OID_AUTO, metadelay, CTLFLAG_RW, &metadelay, 0, ""); -static int rushjob; /* number of slots to run ASAP */ +int rushjob; /* number of slots to run ASAP */ static int stat_rush_requests; /* number of times I/O speeded up */ SYSCTL_INT(_debug, OID_AUTO, rush_requests, CTLFLAG_RW, &stat_rush_requests, 0, ""); @@ -177,6 +181,7 @@ vntblinit() { desiredvnodes = maxproc + cnt.v_page_count / 4; + TUNABLE_INT_FETCH("kern.maxvnodes", &desiredvnodes); minvnodes = desiredvnodes / 4; simple_lock_init(&mntvnode_slock); simple_lock_init(&mntid_slock); @@ -1119,7 +1124,7 @@ sched_sync(void) { struct synclist *slp; struct vnode *vp; - long starttime; + time_t starttime; int s; struct proc *p = updateproc; @@ -1127,8 +1132,6 @@ sched_sync(void) SHUTDOWN_PRI_LAST); for (;;) { - kproc_suspend_loop(p); - starttime = time_second; /* @@ -1198,8 +1201,25 @@ sched_sync(void) * matter as we are just trying to generally pace the * filesystem activity. */ - if (time_second == starttime) + if (time_second != starttime) + continue; + + if (sync_extdelay >= syncer_maxdelay) + while (syncer_delayno == 0 && rushjob == 0 && + abs(time_second - starttime) < sync_extdelay) { + stratcalls = 0; + tsleep(&lbolt, PPAUSE, "syncer", 0); + kproc_suspend_loop(p); + if (stratcalls != 0 && syncer_maxdelay < + abs(time_second - starttime)) { + rushjob = syncer_maxdelay; + break; + } + } + else { tsleep(&lbolt, PPAUSE, "syncer", 0); + kproc_suspend_loop(p); + } } } --- /usr/src/sys.org/kern/vfs_syscalls.c Thu Jan 2 18:26:18 2003 +++ kern/vfs_syscalls.c Sat Apr 12 01:55:48 2003 @@ -563,6 +563,9 @@ sync(p, uap) register struct mount *mp, *nmp; int asyncflag; + /* Notify sched_sync() to try flushing syncer_workitem_pending[*] */ + rushjob += syncer_maxdelay; + simple_lock(&mountlist_slock); for (mp = TAILQ_FIRST(&mountlist); mp != NULL; mp = nmp) { if (vfs_busy(mp, LK_NOWAIT, &mountlist_slock, p)) { @@ -2627,6 +2630,10 @@ fsync(p, uap) struct file *fp; vm_object_t obj; int error; + + /* Just return if we are artificially delaying disk syncs */ + if (sync_extdelay) + return (0); if ((error = getvnode(p->p_fd, SCARG(uap, fd), &fp)) != 0) return (error); --- /usr/src/sys.org/ufs/ffs/ffs_alloc.c Fri Sep 21 21:15:21 2001 +++ ufs/ffs/ffs_alloc.c Sat Apr 12 00:06:20 2003 @@ -125,6 +125,10 @@ ffs_alloc(ip, lbn, bpref, size, cred, bn #endif /* DIAGNOSTIC */ if (size == fs->fs_bsize && fs->fs_cstotal.cs_nbfree == 0) goto nospace; + /* Speedup flushing of syncer_wokitem_pending[*] if low on freespace */ + if (rushjob == 0 && + freespace(fs, fs->fs_minfree + 2) - numfrags(fs, size) < 0) + rushjob = syncer_maxdelay; if (cred->cr_uid != 0 && freespace(fs, fs->fs_minfree) - numfrags(fs, size) < 0) goto nospace; @@ -195,6 +199,10 @@ ffs_realloccg(ip, lbprev, bpref, osize, if (cred == NOCRED) panic("ffs_realloccg: missing credential"); #endif /* DIAGNOSTIC */ + /* Speedup flushing of syncer_wokitem_pending[*] if low on freespace */ + if (rushjob == 0 && + freespace(fs, fs->fs_minfree + 2) - numfrags(fs, nsize - osize) < 0) + rushjob = syncer_maxdelay; if (cred->cr_uid != 0 && freespace(fs, fs->fs_minfree) - numfrags(fs, nsize - osize) < 0) goto nospace; --- /usr/src/sys.org/sys/buf.h Sat Jan 25 20:02:23 2003 +++ sys/buf.h Sat Apr 12 00:30:48 2003 @@ -478,6 +478,7 @@ extern char *buffers; /* The buffer con extern int bufpages; /* Number of memory pages in the buffer pool. */ extern struct buf *swbuf; /* Swap I/O buffer headers. */ extern int nswbuf; /* Number of swap I/O buffer headers. */ +extern int stratcalls; /* I/O ops since last buffer sync */ extern TAILQ_HEAD(swqueue, buf) bswlist; extern TAILQ_HEAD(bqueues, buf) bufqueues[BUFFER_QUEUES]; --- /usr/src/sys.org/sys/vnode.h Sun Dec 29 19:19:53 2002 +++ sys/vnode.h Sat Apr 12 00:06:20 2003 @@ -294,6 +294,9 @@ extern struct vm_zone *namei_zone; extern int prtactive; /* nonzero to call vprint() */ extern struct vattr va_null; /* predefined null vattr structure */ extern int vfs_ioopt; +extern int rushjob; +extern int syncer_maxdelay; +extern int sync_extdelay; /* * Macro/function to check for client cache inconsistency w.r.t. leasing. -------------- next part -------------- # apmd Configuration File # # $FreeBSD: src/etc/apmd.conf,v 1.2.2.1 2000/12/12 22:48:18 dannyboy Exp $ # apm_event POWERSTATECHANGE { exec "/etc/rc.syncdelay"; } apm_event SUSPENDREQ { exec "/etc/rc.suspend"; } apm_event USERSUSPENDREQ { exec "sync && sync && sync"; #exec "sleep 1"; exec "apm -z"; } apm_event NORMRESUME, STANDBYRESUME { exec "/etc/rc.resume"; exec "/etc/rc.syncdelay"; } # resume event configuration for serial mouse users by # reinitializing a moused(8) connected to a serial port. # #apm_event NORMRESUME { # exec "kill -HUP `cat /var/run/moused.pid`"; #} # suspend request event configuration for ATA HDD users: # execute standby instead of suspend. # #apm_event SUSPENDREQ { # reject; # exec "sync && sync && sync"; # exec "sleep 1"; # exec "apm -Z"; #} # Sample entries for battery state monitoring #apm_battery 5% discharging { # exec "logger -p user.emerg battery status critical!"; # exec "echo T250L8CE-GE-C >/dev/speaker"; #} #apm_battery 1% discharging { # exec "logger -p user.emerg battery low - emergency suspend"; # exec "echo T250L16B+BA+AG+GF+FED+DC+CC >/dev/speaker"; # exec "apm -z"; #} #apm_battery 99% charging { # exec "logger -p user.notice battery fully charged"; #} # apmd Configuration ends here -------------- next part -------------- #!/bin/sh # # Copyright (c) 2003 Marko Zec # #include /usr/share/examples/bsd-style-copyright # # # /etc/rc.syncdelay # # Adjust disk syncing policy and delay on battery powered systems. # Invoked automatically by apmd(8) when power state change or resume # events occur. # AC_DELAY=0 # no delayed syncing BAT_DELAY=600 # sync every 10 minutes if [ `apm -a` -eq 1 ]; then # AC powered mode sysctl vfs.sync_extdelay=$AC_DELAY else # Battery powered mode # Allow delayed syncing only if enough battery capacity is available if [ `apm -l` -gt 3 ]; then sysctl vfs.sync_extdelay=$BAT_DELAY else sysctl vfs.sync_extdelay=0 fi fi exit 0
* Marko Zec <zec@tel.fer.hr> [030411 19:01] wrote:> > When enabled, the extended delaying policy introduces some additional > changes: > > - fsync() no longer flushes the buffers to disk, but returns immediately > instead;This is really the only bad thing I can see here, what about introducing a slight delay and seeing if one can coalesce the writes? Is this part really needed? Making fsync() not work is a good way to make any sort of userland based transactional system break badly. otherwise, way cool! -Alfred
On Sat, 12 Apr 2003, Marko Zec wrote:> When enabled, the extended delaying policy introduces some additional > changes: > > - fsync() no longer flushes the buffers to disk, but returns immediately > instead;This is bad; the rest looks very interesting. -- jan grant, ILRT, University of Bristol. http://www.ilrt.bris.ac.uk/ Tel +44(0)117 9287088 Fax +44 (0)117 9287112 http://ioctl.org/jan/ Rereleasing dolphins into the wild since 1998.
Marko Zec <zec@tel.fer.hr> wrote: > Here's a patch against 4.8-RELEASE kernel that allows disk writes on > softupdates-enabled filesystems to be delayed for (theoretically) > arbitrarily long periods of time. The motivation for such updating > policy is surprisingly not purely suicidal - it can allow disks on > laptops to spin down immediately after I/O operations and stay idle for > longer periods of time, thus saving considerable amount of battery > power. It would be very cool if you could have different delay settings per filesystem. That would enable you to have a large delay on /tmp, a medium delay on /var, and the standard delay (i.e. more safety) on everything else. > - fsync() no longer flushes the buffers to disk, but returns immediately > instead; I see some issues with that. Better make that tunable separately (and probably default to off). > - invoking sync() causes flushing of softupdates buffers to follow > immediately, which was not the case before; That's cool. I always disliked the fact I had to type sync several times and still couldn't be sure that everything was really synced. (Yeah, I know, it's the way it works, it always worked like that, and it's documented to work like that ... but I still dislike it.) > - if one of the mounted filesystems becomes low on free space, which can > happen if lot of data is written to the FS but FS metadata buffers are > not updated to disk, flushing of all softupdates buffers is scheduled > automatically; That's cool, too. I've been bitten several times by the bogus "no space left on device", due to soft-updates delaying the freeing of file data. I assume that buffered data is also flushed to disk when the system runs low on RAM, right? (I'm not a VFS/VM expert, so that might be a stupid question.) > Nevertheless, my laptop runs without glitches for the last two weeks > with the extra delaying enabled, while happily achieving 5-10% longer > battery operated periods, depending on disk utilization patterns. Awesome. That would mean about 45 minutes more mobility with my laptop. :) Regards Oliver -- Oliver Fromme, secnetix GmbH & Co KG, Oettingenstr. 2, 80538 M?nchen Any opinions expressed in this message may be personal to the author and may not necessarily reflect the opinions of secnetix in any way. "Clear perl code is better than unclear awk code; but NOTHING comes close to unclear perl code" (taken from comp.lang.awk FAQ)
Marko Zec said:> Alfred Perlstein wrote: > > > * Marko Zec <zec@tel.fer.hr> [030411 19:01] wrote: > > > > > > When enabled, the extended delaying policy introduces > > > some additional changes: > > > > > > - fsync() no longer flushes the buffers to disk, but > > > returns immediately instead;[...]> > Making fsync() not work is a good way to make any sort > > of userland based transactional system break badly.[...]> If the disk would start spinning every now and than, > the whole patch would than become pointless...As I feared.> [...] the fact that the modified fsync() just returns > without doing anything useful doesn't mean the data will be > lost - it will simply be delayed until the next coalesced > updating occurs.Unless, of course, your system or power happens to fail. Imagine you have a database program keeping track of banking transactions. This program uses fsync() to ensure its transaction logs are committed to reliable storage before indicating the transaction is completed. Suppose the moment after I withdraw $500 from an ATM, the operating system or hardware fails at the bank. With your change to fsync() to not commit to stable storage, I may have just won $500 courtesy of you. That is, the database software did all it could to ensure the $500 transaction was actually written to disk before authorizing the ATM to dispense cash, yet fsync() has decided it's not that important to do right away, so the transaction might well have not hit the disk before the catastrophe. For a perspective from the Windows world on the same sort of capability, check out the Win32 FlushFileBuffers spec: http://makeashorterlink.com/?E26B12F24 which is an alias for: http://msdn.microsoft.com/library/default.asp?url=/library/ en-us/fileio/base/flushfilebuffers.asp>From that page: "The FlushFileBuffers function writes allof the buffered information for the specified file to disk." Such is the world of writing OS code -- optimizing for one situation may well break other important uses of the same code. Regards, Dave Hart davehart@davehart.com (who spent more time formatting text than writing, sigh)
Marko Zec wrote:> - fsync() no longer flushes the buffers to disk, but returns immediately > instead;Any system that does this should be flushed down the toilet. Softupdates already breaks transaction semantics of FFS by breaking link()/unlink()/rename() etc. Don't make it worse. Many programs rely on fsync().
In message <3E976EBD.C3E66EF8@tel.fer.hr>, Marko Zec writes:>Here's a patch against 4.8-RELEASE kernel that allows disk writes on >softupdates-enabled filesystems to be delayed for (theoretically) >arbitrarily long periods of time. The motivation for such updating >policy is surprisingly not purely suicidal - it can allow disks on >laptops to spin down immediately after I/O operations and stay idle for >longer periods of time, thus saving considerable amount of battery >power.Looks interesting. A while ago I was reading the spec of some IBM ATA hard disk, and discovered that there is a "delayed write" feature built into most ATA disks that is extremely useful for keeping a laptop disk spun down. When the feature is enabled, the disk behaves normally until it spins down due to the standard ATA spindown timeout. Then it enters the delayed write mode, and all further writes to the disk go just to the disk cache and the disk is not spun up. Finally, when for any reason the disk needs to be spun up (cache is full, or a read of an uncached sector occurs), the cache is flushed as soon as the disk spins up. Assuming this is was happens (it's mostly based on observation rather than documentation), you get a much smaller window where the disk is potentially inconsistent, and automatic triggering of the writes only when they are necessary. Below is simple script I use to turn on the feature when running on battery power (using ACPI), and the -CURRENT patches that allow the spindown delay and delayed write features to be controlled with atacontrol (I mailed the patches to Soren a while ago). Ian #!/bin/sh oacline="" while :; do sleep 5 acline=`sysctl -n hw.acpi.acline` if [ "X$acline" = "X$oacline" ]; then continue fi oacline="$acline"; case "$acline" in 1) atacontrol standby 0 0 300 atacontrol delayed_write 0 0 0 ;; 0) atacontrol standby 0 0 20 atacontrol delayed_write 0 0 1 ;; esac done Index: sys/sys/ata.h ==================================================================RCS file: /dump/FreeBSD-CVS/src/sys/sys/ata.h,v retrieving revision 1.17 diff -u -r1.17 ata.h --- sys/sys/ata.h 22 Mar 2003 12:18:20 -0000 1.17 +++ sys/sys/ata.h 28 Mar 2003 02:42:27 -0000 @@ -370,6 +370,7 @@ #define ATARAIDSTATUS 11 #define ATAENCSTAT 12 #define ATAGMAXCHANNEL 13 +#define ATACMD 14 union { struct { @@ -409,6 +410,20 @@ int v05; int v12; } enclosure; + struct { + int flags; /* info about the request */ +#define ATA_CMD_CTRL 0x00 +#define ATA_CMD_READ 0x01 +#define ATA_CMD_WRITE 0x02 + + u_int8_t command; /* command code */ + u_int64_t lba; /* lba address */ + u_int16_t count; /* sector count */ + u_int8_t feature; /* feature modifier bits */ + + caddr_t databuf; /* I/O data buffer */ + int datalen; /* length of data buffer */ + } ata; struct { char ccb[16]; caddr_t data; Index: sys/dev/ata/ata-all.c ==================================================================RCS file: /dump/FreeBSD-CVS/src/sys/dev/ata/ata-all.c,v retrieving revision 1.175 diff -u -r1.175 ata-all.c --- sys/dev/ata/ata-all.c 30 Mar 2003 09:27:59 -0000 1.175 +++ sys/dev/ata/ata-all.c 1 Apr 2003 12:27:07 -0000 @@ -355,6 +355,28 @@ sizeof(struct ata_params)); return 0; + case ATACMD: { + struct ata_device *atadev; + + if (!device || !(ch = device_get_softc(device))) + return ENXIO; + if (!(atadev = &ch->device[iocmd->device]) || + !(ch->devices & (iocmd->device == MASTER ? + ATA_ATA_MASTER : ATA_ATA_SLAVE))) + return ENXIO; + if (iocmd->u.ata.flags != ATA_CMD_CTRL) + return EOPNOTSUPP; + + error = 0; + ATA_SLEEPLOCK_CH(ch, ATA_CONTROL); + if (ata_command(atadev, iocmd->u.ata.command, iocmd->u.ata.lba, + iocmd->u.ata.count, iocmd->u.ata.feature, + ATA_WAIT_INTR) != 0) + error = EIO; + ATA_UNLOCK_CH(ch); + return error; + } + case ATAENCSTAT: { struct ata_device *atadev; Index: sbin/atacontrol/atacontrol.8 ==================================================================RCS file: /dump/FreeBSD-CVS/src/sbin/atacontrol/atacontrol.8,v retrieving revision 1.22 diff -u -r1.22 atacontrol.8 --- sbin/atacontrol/atacontrol.8 23 Dec 2002 15:30:40 -0000 1.22 +++ sbin/atacontrol/atacontrol.8 31 Jan 2003 00:57:52 -0000 @@ -25,7 +25,7 @@ .\" .\" $FreeBSD: src/sbin/atacontrol/atacontrol.8,v 1.22 2002/12/23 15:30:40 ru Exp $ .\" -.Dd May 17, 2001 +.Dd August 18, 2002 .Dt ATACONTROL 8 .Os .Sh NAME @@ -72,6 +72,21 @@ .Ar channel device .Nm .Ic list +.Nm +.Ic idle +.Ar channel device +.Op seconds +.Nm +.Ic standby +.Ar channel device +.Op seconds +.Nm +.Ic sleep +.Ar channel device +.Nm +.Ic delayed_write +.Ar channel device +.Op 0 | 1 .Sh DESCRIPTION The .Nm @@ -208,6 +223,27 @@ Fan RPM speed, enclosure temperature, 5V and 12V levels are shown. .It Ic list Show info about all attached devices on all active controllers. +.It Ic idle +Set the idle timeout on the specified device. +If no timeout is given, put the device into the idle state immediately. +.It Ic standby +Set the standby timeout on the specified device. +If no timeout is given, put the device into the standby state immediately. +.It Ic sleep +Put the device into sleep mode. +Since this effectively powers down the device, settings configured by +the driver are lost, so this should not be used on an active drive. +Use +.Nm +.Ic reinit +to reinitialize the device for later use. +.It Ic delayed_write +Enable or disable the delayed write feature on the specified device. +When delayed writes are enabled on devices that support this feature, +writes that occur while the disk is spun down are stored in the +drive cache only. +Once the cache becomes full or the disk is spun up (e.g. for a read +operation), the cached writes are immediately flushed to the disk. .El .Sh EXAMPLES To see the devices' current access modes, use the command line: Index: sbin/atacontrol/atacontrol.c ==================================================================RCS file: /dump/FreeBSD-CVS/src/sbin/atacontrol/atacontrol.c,v retrieving revision 1.20 diff -u -r1.20 atacontrol.c --- sbin/atacontrol/atacontrol.c 22 Mar 2003 12:18:20 -0000 1.20 +++ sbin/atacontrol/atacontrol.c 1 Apr 2003 13:26:51 -0000 @@ -249,7 +249,7 @@ main(int argc, char **argv) { struct ata_cmd iocmd; - int fd, maxunit, unit; + int enable, fd, idle, maxunit, unit; if ((fd = open("/dev/ata", O_RDWR)) < 0) err(1, "control device not found"); @@ -427,6 +427,43 @@ mode2str(iocmd.u.mode.mode[0]), mode2str(iocmd.u.mode.mode[1])); } + } + else if ((!strcmp(argv[1], "idle") || !strcmp(argv[1], "standby") || + !strcmp(argv[1], "sleep")) && argc == 4) { + iocmd.cmd = ATACMD; + iocmd.device = atoi(argv[3]); + iocmd.u.ata.flags = ATA_CMD_CTRL; + iocmd.u.ata.command = !strcmp(argv[1], "idle") ? 0xe1 : + !strcmp(argv[1], "standby") ? 0xe0 : 0xe6; + if (ioctl(fd, IOCATA, &iocmd) < 0) + err(1, "ioctl(ATACMD)"); + } + else if ((!strcmp(argv[1], "idle") || !strcmp(argv[1], "standby")) && + argc == 5) { + idle = atoi(argv[4]); + if (idle > 19800) + errx(1, "Maximum idle time is 19800 seconds"); + if (idle <= 240*5) + iocmd.u.ata.count = (idle + 4) / 5; + else + iocmd.u.ata.count = idle / (30*60) + 240; + + iocmd.cmd = ATACMD; + iocmd.device = atoi(argv[3]); + iocmd.u.ata.flags = ATA_CMD_CTRL; + iocmd.u.ata.command = !strcmp(argv[1], "idle") ? 0xe3 : 0xe2; + if (ioctl(fd, IOCATA, &iocmd) < 0) + err(1, "ioctl(ATACMD)"); + } + else if (!strcmp(argv[1], "delayed_write") && argc == 5) { + enable = atoi(argv[4]); + iocmd.cmd = ATACMD; + iocmd.device = atoi(argv[3]); + iocmd.u.ata.feature = enable ? 0x07 : 0x87; + iocmd.u.ata.flags = ATA_CMD_CTRL; + iocmd.u.ata.command = 0xfa; + if (ioctl(fd, IOCATA, &iocmd) < 0) + err(1, "ioctl(ATACMD)"); } else usage();
* Michael Sierchio <kudzu@tenebras.com> [030412 11:38] wrote:> Marko Zec wrote: > > >- fsync() no longer flushes the buffers to disk, but returns immediately > >instead; > > Any system that does this should be flushed down the toilet. Softupdates > already breaks transaction semantics of FFS by breaking > link()/unlink()/rename() > etc.How does it do that?
Marko Zec <zec@tel.fer.hr> wrote:> >Here's a patch against 4.8-RELEASE kernel that allows disk writes on >softupdates-enabled filesystems to be delayed for (theoretically) >arbitrarily long periods of time. The motivation for such updating >policy is surprisingly not purely suicidal - it can allow disks on >laptops to spin down immediately after I/O operations and stay idle for >longer periods of time, thus saving considerable amount of battery >power.I've used a much simpler patch for a number of years, which allows you to increase the 30s syncer interval. See http://dotat.at/prog/buildworld/patches.src/04.syncdelay.patch.disabled Tony. -- f.a.n.finch <dot@dotat.at> http://dotat.at/ LANDS END TO ST DAVIDS HEAD INCLUDING THE BRISTOL CHANNEL: EAST OR SOUTHEAST 4, INCREASING 7, LOCALLY GALE 8, EASING SOUTHEAST 4 OR 5 LATER. FAIR, THEN RAIN, WITH RISK OF MIST. GOOD, BECOMING MODERATE OR POOR. SLIGHT OR MODERATE, LOCALLY ROUGH LATER.
Jon Hamilton wrote:> Dave Hart <davehart@davehart.com>, said on Sat Apr 12, 2003 [04:58:13 PM]: > } Marko Zec said: > [...] > } > If the disk would start spinning every now and than, > } > the whole patch would than become pointless... > } > } As I feared. > } > } > [...] the fact that the modified fsync() just returns > } > without doing anything useful doesn't mean the data will be > } > lost - it will simply be delayed until the next coalesced > } > updating occurs. > } > } Unless, of course, your system or power happens to fail. > } Imagine you have a database program keeping track of banking > } transactions. This program uses fsync() to ensure its > } transaction logs are committed to reliable storage before > } indicating the transaction is completed. Suppose the moment > } after I withdraw $500 from an ATM, the operating system or > } hardware fails at the bank. > > Right. So in such a situation, the admin for that system would not > enable this optional behavior. There probably aren't too many cases > where mission critical financial transaction systems run on a laptop > on which the desire is maximal battery life, which is the case from > which this whole patch/discussion derives. It's a conscious tradeoff.Despite criticism of Dave's comments, I'd also be a little concerned about what had been written to the drive prior to unexpected power loss. I'm saying this as a person who uses a laptop as my primary desktop machine. Real world laptop scenario. I just finish downloading my E-Mail. I then take and put this machine into a suspend mode. Upon awakening the system glitches for some reason forcing an unexpected system shutdown. (Note: I am having this problem now with a Thinkpad T23) Did my mail get written to the drive prior to suspending? I'll grant you that this isn't in the same league as moving cash around, but to me that mail is absolutely mission critical. I'd love to get 10% more battery life from my laptop, but not at the expense of having a file system that loses data on any unclean shutdown. Be it moving $500, storing E-Mail, or just saving a document I had been working on. With this patch in play when I tell an app to save a document will it? Later on, -- "Outside of a dog, a book is man's best friend. Inside of a dog, it's too dark to read." - Groucho Marx
I am of the opinion that fsync should work. Applications like `vi' use fsync to ensure that the write of the new file is on stable store before removing the old copy. If that semantic is broken, it would be possible to have neither the old nor the new copy of your file after a crash. I do not consider that acceptable behavior. Further, the fsync call is used to ensure that link/unlink/rename have been completed. So more than just fsync is being affected by your change. Lastly, I often write out a file when I am about to suspend my laptop (for low battery or other reasons) and I really want that file on the disk now. I do not want to have to wait for it to decide at some future time to spin up the disk. I suggest that you make the disabling of fsync a separate option from the rest of your change so that people can decide for themselves whether they want partial savings with working semantics, or greater savings with broken semantics. I am also intrigued by the changes proposed by Ian Dowse that may better accomplish the same goals with less breakage. Kirk McKusick
On Sat, Apr 12, 2003, Marko Zec wrote:> Here's a patch against 4.8-RELEASE kernel that allows disk writes on > softupdates-enabled filesystems to be delayed for (theoretically) > arbitrarily long periods of time. The motivation for such updating > policy is surprisingly not purely suicidal - it can allow disks on > laptops to spin down immediately after I/O operations and stay idle for > longer periods of time, thus saving considerable amount of battery > power.Very nice! I have been thinking about doing something like this for a long time, but I never managed to find the time. Some comments: - As others have mentioned, the fsync-disabling feature is questionable and ought to be separate. You can make it somewhat more useful by at least guaranteeing transactional consistency, i.e. by treating every fsync() call as a write barrier. You would need to ensure this for both data and metadata, which I expect would be devilishly hard to do within the softupdates framework. However, you might be able to accomplish it at the disk buffer level. For instance, you could have fsync() push the appropriate dirty buffers out to a separate cache, then commit the contents of the cache in the order of the fsyncs when the disk is next active. - The fiddling with rushjob seems rather arbitrary. You can probably just let the existing code increment it as necessary and force a sync if the value gets too high. - Patches against -CURRENT would be nice. (Sorry, that will be a doosie.) - It looks like you have a few separate changes in there, such as + TUNABLE_INT_FETCH("kern.maxvnodes", &desiredvnodes); and - long starttime; + time_t starttime;
Michael Collette wrote:> Real world laptop scenario. I just finish downloading my E-Mail. I then take > and put this machine into a suspend mode. Upon awakening the system glitches > for some reason forcing an unexpected system shutdown. > > (Note: I am having this problem now with a Thinkpad T23) > > Did my mail get written to the drive prior to suspending?It should get synched, since the APM daemon by default calls a global sync (not fsync!) a couple of times before suspending, to ensure FS consistency in case the system never resumes successfully. Marko
David Schultz wrote:> For instance, you could > have fsync() push the appropriate dirty buffers out to a separate > cache, then commit the contents of the cache in the order of the > fsyncs when the disk is next active.Huh... such a concept would still break fsync() semantics. Note that the original patch also ensures dirty buffers get flushed if / when the disk spins up, even before the delay timer gets expired.> - The fiddling with rushjob seems rather arbitrary. You can probably > just let the existing code increment it as necessary and force a sync > if the value gets too high.If rushjob is would not be used for forcing prompt synching, the original code could not guarantee the sync to occur immediately. Instead, the synching could be further delayed for up to 30 seconds, which is not desirable if our major design goal is to do as much disk I/O as possible in a small time interval and leave the disk idle otherwise. Marko
Chris Dillon <cdillon@wolves.k12.mo.us> wrote: > On Wed, 16 Apr 2003, Terry Lambert wrote: > > [Flash memory] > > The life expectancy of these devices is really, really > > underestimated. In practice, I've seen two million write cycles > > from some of these in lab machines which get rewritten pretty often. > > I realize they have what looks like a really big number of writes on a > human scale, but to a computer which does things methodically day in > and day out without stopping, those writes can add up relatively > quickly. Even with a life of two million write cycles, the > "occasional" 30-second round of updates that happen to write the same > bits over and over The controller in things such as CompactFlash cards will _not_ write the same physical bits over and over. Those beasts are clever enough to remap logical blocks to different physical blocks upon each write access, so that the written-to flash cells are evenly distributed over the whole physical range. You can probably update the atime of files 100 million times and more without any problems, because all of those 100 million writes will end up on all different flash blocks. Of course, that's provided that there are also areas in your filesystem which are less frequently written to, but that's usually the case (how often do you rewrite binaries and libs?). So I agree with Terry that the life expectancy of flash devices really underestimated. Regards Oliver -- Oliver Fromme, secnetix GmbH & Co KG, Oettingenstr. 2, 80538 M?nchen Any opinions expressed in this message may be personal to the author and may not necessarily reflect the opinions of secnetix in any way. "If you do things right, people won't be sure you've done anything at all." -- God in Futurama season 4 episode 8