Nov 14 00:00:01 workstation /USR/SBIN/CRON[30993]: (root) CMD (/usr/local/sbin/btrfs-make-snapshot minutes /home >> /var/log/snapshot.log) Nov 14 00:00:01 workstation /USR/SBIN/CRON[30994]: (root) CMD (/usr/local/sbin/btrfs-make-snapshot days /home >> /var/log/snapshot.log) Nov 14 00:00:01 workstation /USR/SBIN/CRON[30995]: (root) CMD (/usr/local/sbin/btrfs-remove-snapshots 100 100 /home/ / >> /var/log/snapshot.log) Below is a grep for NMI in my kernel message log. I''m running Debian kernel 3.11.6-2 on a quad-core AMD64 system. Above is the relevant section of the cron log. It appears that running two scripts that create snapshots at the same time as a script that removes maybe 97 snapshots causes an NMI. I''ve attached the kernel message log from today, I can give you logs from other days if they are useful. Nov 11 00:00:58 workstation kernel: [896466.616001] sending NMI to all CPUs: Nov 11 00:00:58 workstation kernel: [896466.616016] NMI backtrace for cpu 0 Nov 11 00:00:58 workstation kernel: [896466.616001] NMI backtrace for cpu 2 Nov 11 00:00:58 workstation kernel: [896466.616479] NMI backtrace for cpu 3 Nov 11 00:00:58 workstation kernel: [896466.616637] NMI backtrace for cpu 1 Nov 12 00:00:58 workstation kernel: [982863.428000] sending NMI to all CPUs: Nov 12 00:00:58 workstation kernel: [982863.428000] NMI backtrace for cpu 1 Nov 12 00:00:58 workstation kernel: [982863.428340] NMI backtrace for cpu 0 Nov 12 00:00:58 workstation kernel: [982863.428647] NMI backtrace for cpu 2 Nov 12 00:00:58 workstation kernel: [982863.428725] NMI backtrace for cpu 3 Nov 13 00:00:57 workstation kernel: [1069265.036001] sending NMI to all CPUs: Nov 13 00:00:57 workstation kernel: [1069265.036018] NMI backtrace for cpu 1 Nov 13 00:00:57 workstation kernel: [1069265.036341] NMI backtrace for cpu 0 Nov 13 00:00:57 workstation kernel: [1069265.036001] NMI backtrace for cpu 2 Nov 13 00:00:57 workstation kernel: [1069265.036723] NMI backtrace for cpu 3 Nov 13 22:21:04 workstation kernel: [1149672.214144] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 6.060 msecs Nov 14 00:00:56 workstation kernel: [1155663.796001] sending NMI to all CPUs: Nov 14 00:00:56 workstation kernel: [1155663.796001] NMI backtrace for cpu 2 Nov 14 00:00:56 workstation kernel: [1155663.796269] NMI backtrace for cpu 3 Nov 14 00:00:56 workstation kernel: [1155663.796007] NMI backtrace for cpu 1 Nov 14 00:00:56 workstation kernel: [1155663.796581] NMI backtrace for cpu 0 -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/
Russell Coker posted on Thu, 14 Nov 2013 01:45:29 +1100 as excerpted:> It appears that running two scripts that create snapshots at the same > time as a script that removes maybe 97 snapshots causes an NMI.This is a known bug in 3.11 (and presumably earlier). Snapshot manipulation concurrency evidently wasn''t originally well considered, and either a recent regression triggered the problem or possibly it always existed and only recently was reported (I don''t know either way, tho the btrfs devs might). Either way, there''s a patch both to address the problem and adding a regression test to the xfstests filesystem testing suite to catch future regressions, but it''s too recent to be in 3.11 yet, and indeed, as of my update yesterday (3.12.0-03510-g9b66bfb), there hadn''t yet been a btrfs pull in the 3.12 commit window yet either, so while 3.12 will likely get it, it''s not there yet as of the above git snapshot. If interested in more, check recent list threads for the bug report and proposed patch. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 14 Nov 2013, Duncan <1i5t5.duncan@cox.net> wrote:> > It appears that running two scripts that create snapshots at the same > > time as a script that removes maybe 97 snapshots causes an NMI. > > This is a known bug in 3.11 (and presumably earlier). Snapshot > manipulation concurrency evidently wasn''t originally well considered, and > either a recent regression triggered the problem or possibly it always > existed and only recently was reported (I don''t know either way, tho the > btrfs devs might). > > Either way, there''s a patch both to address the problem and adding a > regression test to the xfstests filesystem testing suite to catch futureThanks for the information. As an aside, I had it happen on a workstation that I was using to watch a movie over NFS. Strangely the BTRFS hang caused video playback to pause. I didn''t expect a BTRFS issue like that to stop access to an NFS mount from another system. Would it be locking up inside the VFS? -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 18, 2013 at 01:43:51AM +1100, Russell Coker wrote:> On Thu, 14 Nov 2013, Duncan <1i5t5.duncan@cox.net> wrote: > > > It appears that running two scripts that create snapshots at the same > > > time as a script that removes maybe 97 snapshots causes an NMI. > > > > This is a known bug in 3.11 (and presumably earlier). Snapshot > > manipulation concurrency evidently wasn''t originally well considered, andNot super important, but the bug does go farther back in time, I reported it here too some time back (although didn''t hear anything back at the time). But for me I had /var/local/scr/btrfs_snaps: line 23: 26017 Segmentation fault (core dumped) /sbin/btrfs subvolume delete "$sub" with details on the BT pasted in the mail: http://permalink.gmane.org/gmane.comp.file-systems.btrfs/28770 Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems .... .... what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Nov 17, 2013 at 07:16:18AM -0800, Marc MERLIN wrote:> On Mon, Nov 18, 2013 at 01:43:51AM +1100, Russell Coker wrote: > > On Thu, 14 Nov 2013, Duncan <1i5t5.duncan@cox.net> wrote: > > > > It appears that running two scripts that create snapshots at the same > > > > time as a script that removes maybe 97 snapshots causes an NMI. > > > > > > This is a known bug in 3.11 (and presumably earlier). Snapshot > > > manipulation concurrency evidently wasn''t originally well considered, and > > Not super important, but the bug does go farther back in time, I > reported it here too some time back (although didn''t hear anything back > at the time). > > But for me I had > /var/local/scr/btrfs_snaps: line 23: 26017 Segmentation fault (core dumped) /sbin/btrfs subvolume > delete "$sub" > with details on the BT pasted in the mail: > http://permalink.gmane.org/gmane.comp.file-systems.btrfs/28770Kernel 3.2.5 IIRC there was a fix or two regarding concurrency during snapshot creation and deletion, roughly matching the kernel 3.2 times. david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Nov 14, 2013 at 01:45:29AM +1100, Russell Coker wrote:> Nov 14 00:00:01 workstation /USR/SBIN/CRON[30993]: (root) CMD > (/usr/local/sbin/btrfs-make-snapshot minutes /home >> /var/log/snapshot.log) > Nov 14 00:00:01 workstation /USR/SBIN/CRON[30994]: (root) CMD > (/usr/local/sbin/btrfs-make-snapshot days /home >> /var/log/snapshot.log) > Nov 14 00:00:01 workstation /USR/SBIN/CRON[30995]: (root) CMD > (/usr/local/sbin/btrfs-remove-snapshots 100 100 /home/ / >> > /var/log/snapshot.log) > > Below is a grep for NMI in my kernel message log. I''m running Debian kernel > 3.11.6-2 on a quad-core AMD64 system. Above is the relevant section of the > cron log. It appears that running two scripts that create snapshots at the > same time as a script that removes maybe 97 snapshots causes an NMI. I''ve > attached the kernel message log from today, I can give you logs from other > days if they are useful.Snapshot deletion may touch a lot of data and some operations are slowed down by that. This may trigger the warning INFO: task T blocked for more than 120 seconds If there''s no other crash (that would block the system indefinitely), the warnings disappear. According to the grep output, the warnings happen at the same daytime, so this is likely the system overload reason. You may want to follow the warning''s sugeestion: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message or increase the timeout. david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html