I am trying to do some video capture and have been losing occasional fields. After adding some debugging code to the kernel, I've found that the problem is excessive latency between the hardware interrupt and the driver interrupt - the hardware can handle about 1.5msec of latency. Most of the time, the latency is less than 20?sec but but I'm seeing up to 8 msec occasionally. In virtually all cases where there is a problem, curproc at the time of the hardware interrupt is syncer. (I had one case where there was another process, but it had died by the time I went looking for it). The interrupt is marked INTR_TYPE_AV so it shouldn't be being delayed by other threads. (I can't easily make it INTR_FAST because it needs to call psignal(9)). The system is an Athlon XP-1800 with 512MB RAM and 2 ATA-100 disks running 5.3-RELEASE-p5. It has a couple of NFS exports but doesn't import anything. There's nothing much running apart from ffmpeg capturing the video and a process capturing my kernel debugging output. Apart from 4 files being sequentially written as part of my capture and cron regularly waking up to go back to sleep, there shouldn't be any filesystem activity. I tried copying a couple of large files and touching lots of files but that didn't cause any problems. Can anyone suggest why syncer would be occasionally running for up to 8 msec at a time? Overall, it's not clocking up a great deal of CPU time, it just seems to grab it in large chunks. Peter
On Sat, 26 Feb 2005, Peter Jeremy wrote:> I am trying to do some video capture and have been losing occasional > fields. After adding some debugging code to the kernel, I've found that > the problem is excessive latency between the hardware interrupt and the > driver interrupt - the hardware can handle about 1.5msec of latency. > Most of the time, the latency is less than 20?sec but but I'm seeing up > to 8 msec occasionally. In virtually all cases where there is a > problem, curproc at the time of the hardware interrupt is syncer. (I > had one case where there was another process, but it had died by the > time I went looking for it). The interrupt is marked INTR_TYPE_AV so it > shouldn't be being delayed by other threads. (I can't easily make it > INTR_FAST because it needs to call psignal(9)). > > The system is an Athlon XP-1800 with 512MB RAM and 2 ATA-100 disks > running 5.3-RELEASE-p5. It has a couple of NFS exports but doesn't > import anything. There's nothing much running apart from ffmpeg > capturing the video and a process capturing my kernel debugging output. > Apart from 4 files being sequentially written as part of my capture and > cron regularly waking up to go back to sleep, there shouldn't be any > filesystem activity. I tried copying a couple of large files and > touching lots of files but that didn't cause any problems. > > Can anyone suggest why syncer would be occasionally running for up to 8 > msec at a time? Overall, it's not clocking up a great deal of CPU time, > it just seems to grab it in large chunks.I don't have too much insight into the syncer (I've CC'd phk to victimize him with more e-mail as this is an area he takes great interested in). A couple of questions: (1) Have you tried turning on options PREEMPTION? (2) Does the driver code run with Giant at all? (3) Are you relying on callouts or taskqueues at all for processing? With PREEMPTION enabled and all driver code running without Giant (and not depending on threads that also acquire Giant), and all related workers running with adequate priority, your driver threads should preempt the syncer. Your user process will have to wait for the syncer to finish running though. So using preemption and Giant-free code, we should be able to get your driver code in kernel to run on short deadline, but getting the syncer to behave better will be necessary to get the user code running on short deadline. Robert N M Watson
On 26 Feb, Peter Jeremy wrote:> I am trying to do some video capture and have been losing occasional > fields. After adding some debugging code to the kernel, I've found > that the problem is excessive latency between the hardware interrupt > and the driver interrupt - the hardware can handle about 1.5msec of > latency. Most of the time, the latency is less than 20?sec but but > I'm seeing up to 8 msec occasionally. In virtually all cases where > there is a problem, curproc at the time of the hardware interrupt is > syncer. (I had one case where there was another process, but it had > died by the time I went looking for it). The interrupt is marked > INTR_TYPE_AV so it shouldn't be being delayed by other threads. (I > can't easily make it INTR_FAST because it needs to call psignal(9)). > > The system is an Athlon XP-1800 with 512MB RAM and 2 ATA-100 disks > running 5.3-RELEASE-p5. It has a couple of NFS exports but doesn't > import anything. There's nothing much running apart from ffmpeg > capturing the video and a process capturing my kernel debugging > output. Apart from 4 files being sequentially written as part of my > capture and cron regularly waking up to go back to sleep, there > shouldn't be any filesystem activity. I tried copying a couple of > large files and touching lots of files but that didn't cause any > problems. > > Can anyone suggest why syncer would be occasionally running for > up to 8 msec at a time? Overall, it's not clocking up a great > deal of CPU time, it just seems to grab it in large chunks.You're probably running into the inode timestamp update loop. Each mounted file system has a special "syncer vnode" that remains permanently on the syncer worklist. The syncer will call VOP_FSYNC() on each of these vnodes as it encounters them in the work list, which it traverses every 32 seconds. This is done so that things like the superblock and other file system metadata is periodically written to disk. In the case of ufs, the code that does this is in ffs_sync(). I suspect that the problem that you are running into is that ffs_sync() (and ext2_sync()) also handle inode timestamp updates. Each time they are called, they walk the list of vnodes for the file system and call VOP_FSYNC() for any that have unwritten timestamp updates. As the comment in the loop in ffs_sync() says: /* * Depend on the mntvnode_slock to keep things stable enough * for a quick test. Since there might be hundreds of * thousands of vnodes, we cannot afford even a subroutine * call unless there's a good chance that we have work to do. */ I noticed a related performance problem a while back. If you are doing something that writes to a lot of files, like untarring the ports tree, there will be large bursts of disk activity every 30 seconds and the system gets very sluggish. Soft updates and the new syncer were supposed to eliminate this behaviour by spreading out the write activity over time, but this loop in ffs_sync() will cause a burst of writes every time it is called. This can also be observed by watching the length of the syncer worklist. When untarring the ports tree, the length of the worklist should increase to a certain, high level, and stabilize. Instead it ramps up over about thirty seconds and then takes a dramatic drop. In the initial softupdates implementation, some of the work inside the loop was skipped in the MNT_LAZY case, but it was found that timestamp updates were being deferred for too long a time. I talked to Kirk about entirely bypassing this loop in the MNT_LAZY case and moving the timestamp updates to the syncer worklist. Kirk sounded positive on the idea, but I never found the time to work on the implementation. and phk's conversion of the syncer to use bufobjs instead of vnodes complicated things (what do you do about fifos and sockets?).