I have a system running the Debian package of 3.11.5 with an Amd Opteron 1212 processor (2*64bit cores), 8G of RAM, and an Intel 120G SSD for the root and home subvols. It has a RAID-1 array of 2*3TB disks for bulk storage (movies etc) but that probably isn''t relevant to this problem. On the root filesystem I have cron jobs making daily snapshots of / and /home and additional snapshots of /home every 15 minutes. At midnight a cron job removes older snapshots. For the last 8 days the system has been reliably hanging at about 5 minutes after midnight and the subvol removal cron job is the only thing that has happened then. So it seems clear to me that on my system 3.11.5 has a crash a few minutes after removing ~98 subvols at the same time. Last night I watched it happen and deleted a few dozen extra subvols to test whether it would repeat. That wasn''t such a good idea and I rebooted the system many times before giving up and booting 3.10.11 which is now working correctly. When running 3.11.5 I was seeing kernel log messages such as the following shortly after boot. Then after that it got into a state where a ssh session didn''t work and the X login prompt didn''t even flash it''s cursor. In that state it could still forward packets (the system in question is an ethernet bridge which I use to connect my workstation to the Internet) but couldn''t do much else. The NFS server processes locked and sshd wouldn''t complete the login process for new connection attempts. [ 68.056003] BUG: soft lockup - CPU#0 stuck for 22s! [btrfs-cleaner:270] [ 68.144004] BUG: soft lockup - CPU#1 stuck for 22s! [btrfs-transacti:271] Prior to the lockup those two kernel processes had used most CPU time. I''m not sure whether prior to the lockup they were in some sort of CPU loop or whether they were just reading a lot of data from a fast SSD and acting correctly. As an aside I ordered a replacement server last week when I wasn''t sure if this was a hardware or a software problem. This will allow me to test some things in more detail on the old server after the new one is running, however I don''t own a spare SSD so if it''s a SSD specific issue then I have limited ability to test. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Russell Coker posted on Wed, 06 Nov 2013 12:52:38 +1100 as excerpted:> I have a system running the Debian package of 3.11.5 with an Amd Opteron > 1212 processor (2*64bit cores), 8G of RAM, and an Intel 120G SSD for the > root and home subvols. It has a RAID-1 array of 2*3TB disks for bulk > storage (movies etc) but that probably isn''t relevant to this problem. > > On the root filesystem I have cron jobs making daily snapshots of / and > /home and additional snapshots of /home every 15 minutes. At midnight a > cron job removes older snapshots. For the last 8 days the system has > been reliably hanging at about 5 minutes after midnight and the subvol > removal cron job is the only thing that has happened then.I believe there''s a btrfs-critical stable-series patch in 3.11.6, that you''re probably missing with 3.11.5. (There were unfortunately some crossed signals and the patch was skipped for a couple weeks after it should have gone in, but it''s in now.) Yes... Just checked the 3.11.6 changelog: Josef Bacik (1): Btrfs: use right root when checking for hash collision Note that there''s another critical patch in-flight, patching a bug triggered by btrfs balance on filesystems with pre-allocated files (like systemd does with its journal and various torrent clients do with their downloads). But this one is currently being held up because stable rules require it to be in current mainline first, and 3.12 is out, but the two- week 3.13 commit window that would normally be open now is suspended for a week, as Linux is traveling without a reliable net connection. So the patch can''t hit mainline, and thus won''t hit stable unless an exception is made, until after Linus'' vacation, when the commit window opens and the patch is accepted. See previous discussion here on this list for it, or simply don''t do any balances if you''re running systemd or with any other pre-allocated-file apps such as torrent clients running, until after you get that patch. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 6 Nov 2013 12:37:32 PM Duncan wrote:> Note that there''s another critical patch in-flight, patching a bug > triggered by btrfs balance on filesystems with pre-allocated files (like > systemd does with its journal and various torrent clients do with their > downloads). But this one is currently being held up because stable rules > require it to be in current mainline first, and 3.12 is out, but the two- > week 3.13 commit window that would normally be open now is suspended for > a week, as Linux is traveling without a reliable net connection. So the > patch can''t hit mainline, and thus won''t hit stable unless an exception > is made, until after Linus'' vacation, when the commit window opens and > the patch is accepted.Greg K-H has said he''ll accept stable patches that haven''t hit the mainline during this period. cheers! Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP