Russell Coker
2014-Jun-11 02:05 UTC
3.13.6-1 user-space freeze on subvol removal - fixed in 3.14.4
Last night I discovered a bug in my subvol removal script on one of my servers. I fixed the bug and ran the script to delete ~1500 subvols and the system locked up. It would respond to pings and accept TCP connections but nothing else. Existing ssh sessions didn't respond and connections to port 22 got a TCP connection but not even the start of the ssh handshake (IE sshd was hung). Today I visited the server and connected a keyboard, I couldn't get a keyboard response to even unblank the screen (it's at a virtual console and X isn't installed. I rebooted the system and found that the bug fix to the subvol removal script was lost, the old version of the file was in place. So all changes to the filesystem from 10+ seconds before the subvol removal were discarded. INFO: rcu_sched self detected stall on CPU { 1} (t=5250 jiffies g=7989 c=7988 q=3) BUG: soft lockup - CPU#0 stuck for 23s! [sync:2424] BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:287] BUG: soft lockup - CPU#0 stuck for 23s! [sync:2424] BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:287] INFO: rcu_sched self detected stall on CPU { 1} (t=21003 jiffies g=7989 c=7988 q=3) BUG: soft lockup - CPU#0 stuck for 23s! [sync:2424] BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:287] BUG: soft lockup - CPU#0 stuck for 23s! [sync:2424] BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:287] After booting up I ran the new script and told it to delete only ~440 subvoles and got the above messages on the console. Those messages keep repeating with the only differences being the value of the t= parameter and the number of seconds (which varies between 22 and 23 seconds). At that time the keyboard didn't get any response from the system and sshd didn't even start it's protocol (user-space seems dead). I had to do a hardware reset as CTRL-ALT-DEL and briefly pressing the power button had no affect. After that I installed kernel 3.14.4 and deleted 442 subvols and then soon after another 1104 without any problems. I've noticed before that generally newer kernels have been fixing the various crashes related to snapshot creation and removal, but this is the first time I've done a clean repeatable test to show 3.14 working where 3.13 failed. I had been hesitant to upgrade to 3.14 because I've seen it fail horribly with Xen, but in this case it worked well with Xen. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html