On 2009-Nov-30 19:13:30 +1100, Peter Jeremy <peter@server.vk2pj.dyndns.org> wrote:>On 2009-Nov-29 08:56:55 +0100, Thomas Backman <serenity@exscape.org> wrote: >> >>On Nov 28, 2009, at 10:22 PM, Peter Jeremy wrote: >> >>> My main server is running 8.0/amd64 from between RC1 and RC2 and I've >>> recently had a couple of long-duration hangs on it during which time >>> processes doing I/O will stop responding....>It actually "hung" again just after I sent the original mail. This >time I managed to get console access and could check the kernel state. >This showed that a number of processes were blocked on ZFS locks. >The most commonly reported state was 'tx->tx_quiesce_done_cv)'.I've upgraded to 8-STABLE from 30-Nov and the problem is still present, even after disabling the boinc processes. This seems to leave race conditions inside ZFS as the only option. Has anyone else seen anything like this? -- Peter Jeremy -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20091205/e0fa4b47/attachment.pgp
On Sat, Dec 5, 2009 at 3:48 PM, Peter Jeremy <peterjeremy@acm.org> wrote:> On 2009-Nov-30 19:13:30 +1100, Peter Jeremy <peter@server.vk2pj.dyndns.org> > wrote: > >On 2009-Nov-29 08:56:55 +0100, Thomas Backman <serenity@exscape.org> > wrote: > >> > >>On Nov 28, 2009, at 10:22 PM, Peter Jeremy wrote: > >> > >>> My main server is running 8.0/amd64 from between RC1 and RC2 and I've > >>> recently had a couple of long-duration hangs on it during which time > >>> processes doing I/O will stop responding. > ... > >It actually "hung" again just after I sent the original mail. This > >time I managed to get console access and could check the kernel state. > >This showed that a number of processes were blocked on ZFS locks. > >The most commonly reported state was 'tx->tx_quiesce_done_cv)'. > > I've upgraded to 8-STABLE from 30-Nov and the problem is still present, > even after disabling the boinc processes. > > This seems to leave race conditions inside ZFS as the only option. > > Has anyone else seen anything like this? > >I have a machine running 7.2 that does the same thing if I don't disable ZIL and prefetch (probably just one of them triggers the hang, just haven't had time to see which one). I'll be upgrading it to 8-Stable in the next week or so and I'll see if the problem persists. One data point that may or may not be relevant is that the process that always triggers the hangs is istgt (iSCSI target from ports). Elliot
Peter Jeremy wrote:> On 2009-Nov-30 19:13:30 +1100, Peter Jeremy <peter@server.vk2pj.dyndns.org> wrote: > >> On 2009-Nov-29 08:56:55 +0100, Thomas Backman <serenity@exscape.org> wrote: >> >>> On Nov 28, 2009, at 10:22 PM, Peter Jeremy wrote: >>> >>> >>>> My main server is running 8.0/amd64 from between RC1 and RC2 and I've >>>> recently had a couple of long-duration hangs on it during which time >>>> processes doing I/O will stop responding. >>>> > ... > >> It actually "hung" again just after I sent the original mail. This >> time I managed to get console access and could check the kernel state. >> This showed that a number of processes were blocked on ZFS locks. >> The most commonly reported state was 'tx->tx_quiesce_done_cv)'. >> > > I've upgraded to 8-STABLE from 30-Nov and the problem is still present, > even after disabling the boinc processes. > > This seems to leave race conditions inside ZFS as the only option. > > Has anyone else seen anything like this? > >I got the same issue since I upgraded to 8.0-RELEASE. I happens during high I/O operation such a buildworld. Since I run top in an ssh session, I can say that before the hung [zfskern] process shows high CPU usage, global system usage is 99%. Sometimes I can get back to normal breaking the build with Ctrl-C. Sometimes I don't. If enabled, the watchdog kicks in and the machine reboots (else, I just ssh control over it). The machine is low (512MB) memory, with same tuning as I used in 7.2 (arc reduced to 60M, device cache to 5M, which gave me a stable machine). I enabled crashdumps. I can investigate if somebody give me pointers of where to look. Arnaud