Felix Palmen
2021-Apr-12 09:44 UTC
Frequent disk I/O stalls while building (poudriere), processes in "zfs tear" state
Hello all, since following the releng/13.0 branch, I experience stalled disk I/O quite often (ca. once per minute) while building packages with poudriere. What I can see in this case is the CPU going almost idle, and several processes shown in `top` in state "zfs te" (and procstat shows "zfs tear" for that). For up to several seconds, no disk I/O completes (even starting a new process is impossible), then it recovers. Only two times, I have seen the system going into a deadlock instead, with printing messages similar to this to the serial console: swap_pager: indefinite wait buffer ... I have this behavior since -RC3 (followed releng/13.0 now up to -RELEASE). Before that, I had the vnlru-related problem that was fixed with faa41af1fed350327cc542cb240ca2c6e1e8ba0c. Some details: * CPU: Intel(R) Xeon(R) CPU E3-1240L v5 @ 2.10GHz * RAM: 64GB (ECC) * Four HDDs (Seagate NAS models), 4TB each * Swap 16GB, striped over the 4 disks * Pool: 12TB raid-z on GELI-encrypted partitions. NOT upgraded yet, so I have a way back to 12.2. * Two bhyve VMs running with 1GB and 8GB RAM, both wired * Several jails running services like samba, an MTA, nginx... * Several NFS shares mounted by other machines * Poudriere running on idprio 22 with 8 parallel build jobs Reducing the parallel jobs in poudriere also reduces the frequency of the problem, but it doesn't seem to completely go away. Also, I have the impression running into these stalls is more likely when a lot of compilation jobs can be satisfied from ccache. Thanks for any ideas and insight (e.g. what this "zfs tear" status means). Best regards, Felix Palmen -- Dipl.-Inform. Felix Palmen <felix at palmen-it.de> ,.//.......... {web} http://palmen-it.de {jabber} [see email] ,//palmen-it.de {pgp public key} http://palmen-it.de/pub.txt // """"""""""" {pgp fingerprint} A891 3D55 5F2E 3A74 3965 B997 3EF2 8B0A BC02 DA2A -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 488 bytes Desc: not available URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20210412/98090f39/attachment.sig>
Felix Palmen
2021-Apr-15 16:29 UTC
Frequent disk I/O stalls while building (poudriere), processes in "zfs tear" state
After more experimentation, I finally found what's causing these problems for me on 13: * Felix Palmen <felix at palmen-it.de> [20210412 11:44]:> * Poudriere running on idprio 22 with 8 parallel build jobsRunning poudriere with normal priority works perfectly fine. Now, I've had poudriere running on idprio because there are several other services on that machine that shouldn't be slowed down by running a heavy build and I still want to use all the CPU resources available for building. Right now, I'm running a test with idprio 0 instead, which still seems to have the desired effect, and so far, I didn't have any of these stalls. If this persists, the problem is solved for me! I'd still be curious about what might be the cause, and, what this state "zfs tear" actually means. But that's kind of an "academic interest" now. -- Dipl.-Inform. Felix Palmen <felix at palmen-it.de> ,.//.......... {web} http://palmen-it.de {jabber} [see email] ,//palmen-it.de {pgp public key} http://palmen-it.de/pub.txt // """"""""""" {pgp fingerprint} A891 3D55 5F2E 3A74 3965 B997 3EF2 8B0A BC02 DA2A -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 488 bytes Desc: not available URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20210415/a8b0a1f1/attachment.sig>