Kai Krakow
2011-Dec-07 20:40 UTC
WARNING: at fs/btrfs/extent-tree.c:4754 followed by BUG: unable to handle kernel NULL pointer dereference at (null)
Hello btrfs! Recently I upgraded to 3.2.0-rc4 due to instabilities with my btrfs filesystem in 3.1.1. While with 3.1.1 my system completely froze, with 3.2.0-rc4 it stays at least somehow usable (for some strange reason my xorg screen turns black as soon as this happens, only ssh is working then). Scrubbing reports 1 uncorrectable error. I have this error since my system froze due to some xorg graphic driver instability (was trying out SNA acceleration for sandybridge). The problematic file seems to be in /usr/portage but scrubbing doesn''t tell me the filename (I was under the impression 3.2.x adds a patch which should report filenames). Everytime I run "emerge" (it is a gentoo system) my screen goes black after a few seconds and I can only revert to using ssh. Problem is: As soon as this happens, some filesystem accesses block the process in disk state, it cannot be killed. This initiates some feedback loop: From now on any other process trying to access the FS freezes. I can only reisub now. It seems to be fine if data comes from cache instead from disk. Any chance to fix the filesystem or make the kernel not getting stuck? I''d hate to recreate the fs from scratch again. Using Linus'' tree from git, tagged v3.2-rc4. Here''s my dmesg output: [172816.292951] parent transid verify failed on 622147694592 wanted 130733 found 134506 [172816.292957] parent transid verify failed on 622147694592 wanted 130733 found 134506 [172816.292960] parent transid verify failed on 622147694592 wanted 130733 found 134506 [172816.292963] parent transid verify failed on 622147694592 wanted 130733 found 134506 [172816.292965] parent transid verify failed on 622147694592 wanted 130733 found 134506 [172816.292967] ------------[ cut here ]------------ [172816.292972] WARNING: at fs/btrfs/extent-tree.c:4754 __btrfs_free_extent+0x290/0x5c7() [172816.292974] Hardware name: To Be Filled By O.E.M. [172816.292975] Modules linked in: zram(C) af_packet fuse snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss nls_iso8859_15 nls_cp437 vfat fat reiserfs loop nfs tcp_cubic lockd auth_rpcgss nfs_acl sunrpc sg snd_usb_audio snd_hwdep snd_usbmidi_lib snd_rawmidi snd_seq_device gspca_sonixj gspca_main videodev usb_storage v4l2_compat_ioctl32 uas usbhid hid pcspkr evdev i2c_i801 unix [last unloaded: microcode] [172816.293004] Pid: 6193, comm: btrfs-delayed-m Tainted: G C 3.2.0-rc4 #2 [172816.293005] Call Trace: [172816.293010] [<ffffffff8103327e>] ? warn_slowpath_common+0x78/0x8c [172816.293012] [<ffffffff8111ea5b>] ? __btrfs_free_extent+0x290/0x5c7 [172816.293014] [<ffffffff810b2490>] ? __slab_free+0xd1/0x236 [172816.293016] [<ffffffff81121d68>] ? run_clustered_refs+0x66c/0x6b8 [172816.293018] [<ffffffff81121e7d>] ? btrfs_run_delayed_refs+0xc9/0x173 [172816.293021] [<ffffffff8112faf0>] ? __btrfs_end_transaction+0x90/0x1dd [172816.293024] [<ffffffff810273b0>] ? should_resched+0x5/0x24 [172816.293027] [<ffffffff81166981>] ? btrfs_async_run_delayed_node_done+0x16c/0x1ca [172816.293029] [<ffffffff8114f20f>] ? worker_loop+0x170/0x46d [172816.293031] [<ffffffff8114f09f>] ? btrfs_queue_worker+0x25b/0x25b [172816.293033] [<ffffffff8114f09f>] ? btrfs_queue_worker+0x25b/0x25b [172816.293036] [<ffffffff8104883b>] ? kthread+0x7a/0x82 [172816.293040] [<ffffffff81415af4>] ? kernel_thread_helper+0x4/0x10 [172816.293042] [<ffffffff810487c1>] ? kthread_worker_fn+0x135/0x135 [172816.293043] [<ffffffff81415af0>] ? gs_change+0xb/0xb [172816.293045] ---[ end trace 095cf6945c90cf63 ]--- [172816.293046] btrfs unable to find ref byte nr 1871181426688 parent 0 root 2 owner 0 offset 0 [172816.293050] BUG: unable to handle kernel NULL pointer dereference at (null) [172816.293054] IP: [<ffffffff81148998>] map_private_extent_buffer+0x9/0xde [172816.293057] PGD 0 [172816.293058] Oops: 0000 [#1] SMP [172816.293060] CPU 1 [172816.293061] Modules linked in: zram(C) af_packet fuse snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss nls_iso8859_15 nls_cp437 vfat fat reiserfs loop nfs tcp_cubic lockd auth_rpcgss nfs_acl sunrpc sg snd_usb_audio snd_hwdep snd_usbmidi_lib snd_rawmidi snd_seq_device gspca_sonixj gspca_main videodev usb_storage v4l2_compat_ioctl32 uas usbhid hid pcspkr evdev i2c_i801 unix [last unloaded: microcode] [172816.293078] [172816.293079] Pid: 6193, comm: btrfs-delayed-m Tainted: G WC 3.2.0-rc4 #2 To Be Filled By O.E.M. To Be Filled By O.E.M./Z68 Pro3 [172816.293083] RIP: 0010:[<ffffffff81148998>] [<ffffffff81148998>] map_private_extent_buffer+0x9/0xde [172816.293086] RSP: 0018:ffff8801bb847b00 EFLAGS: 00010286 [172816.293088] RAX: 0000000000000067 RBX: ffff8801bb847b40 RCX: ffff8801bb847b40 [172816.293090] RDX: 0000000000000004 RSI: 000000000000007a RDI: 0000000000000000 [172816.293092] RBP: 0000000000000065 R08: ffff8801bb847b38 R09: ffff8801bb847b30 [172816.293103] R10: 0000000000000000 R11: 0000000000000009 R12: 000000000000007a [172816.293105] R13: 0000000000000000 R14: ffff8802350d0000 R15: 0000000000000000 [172816.293107] FS: 0000000000000000(0000) GS:ffff88023fa80000(0000) knlGS:0000000000000000 [172816.293109] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [172816.293111] CR2: 0000000000000000 CR3: 0000000001805000 CR4: 00000000000406e0 [172816.293113] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [172816.293115] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [172816.293117] Process btrfs-delayed-m (pid: 6193, threadinfo ffff8801bb846000, task ffff8801c73e12f0) [172816.293119] Stack: [172816.293120] 0000000000000000 ffffffff814125e0 0000000000000030 0000000000000000 [172816.293123] 0000000000000065 ffffffff81140b49 0000000000000009 000001b3ab1ab000 [172816.293126] 0000000000000000 0000000000000002 ffff880233cbb360 00000000fffffffb [172816.293129] Call Trace: [172816.293140] [<ffffffff814125e0>] ? printk+0x40/0x48 [172816.293153] [<ffffffff81140b49>] ? btrfs_item_size+0x2c/0x62 [172816.293155] [<ffffffff8111ea9b>] ? __btrfs_free_extent+0x2d0/0x5c7 [172816.293158] [<ffffffff810b2490>] ? __slab_free+0xd1/0x236 [172816.293160] [<ffffffff81121d68>] ? run_clustered_refs+0x66c/0x6b8 [172816.293162] [<ffffffff81121e7d>] ? btrfs_run_delayed_refs+0xc9/0x173 [172816.293165] [<ffffffff8112faf0>] ? __btrfs_end_transaction+0x90/0x1dd [172816.293167] [<ffffffff810273b0>] ? should_resched+0x5/0x24 [172816.293170] [<ffffffff81166981>] ? btrfs_async_run_delayed_node_done+0x16c/0x1ca [172816.293172] [<ffffffff8114f20f>] ? worker_loop+0x170/0x46d [172816.293175] [<ffffffff8114f09f>] ? btrfs_queue_worker+0x25b/0x25b [172816.293177] [<ffffffff8114f09f>] ? btrfs_queue_worker+0x25b/0x25b [172816.293179] [<ffffffff8104883b>] ? kthread+0x7a/0x82 [172816.293182] [<ffffffff81415af4>] ? kernel_thread_helper+0x4/0x10 [172816.293184] [<ffffffff810487c1>] ? kthread_worker_fn+0x135/0x135 [172816.293186] [<ffffffff81415af0>] ? gs_change+0xb/0xb [172816.293188] Code: 8b 74 24 18 48 8b 7c 24 40 e8 99 cb ff ff 48 81 c4 88 00 00 00 89 e8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 55 53 48 89 cb 48 83 ec 18 [172816.293200] 8b 2f 81 e5 ff 0f 00 00 48 8d 04 2e 48 89 c1 4c 8d 54 10 ff [172816.293206] RIP [<ffffffff81148998>] map_private_extent_buffer+0x9/0xde [172816.293209] RSP <ffff8801bb847b00> [172816.293210] CR2: 0000000000000000 [172816.355504] ---[ end trace 095cf6945c90cf64 ]--- Regards, Kai -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jan Schmidt
2011-Dec-08 16:03 UTC
Re: WARNING: at fs/btrfs/extent-tree.c:4754 followed by BUG: unable to handle kernel NULL pointer dereference at (null)
On 07.12.2011 21:40, Kai Krakow wrote:> Scrubbing reports 1 uncorrectable error. I have this error since my system > froze due to some xorg graphic driver instability (was trying out SNA > acceleration for sandybridge). > > The problematic file seems to be in /usr/portage but scrubbing doesn''t tell > me the filename (I was under the impression 3.2.x adds a patch which should > report filenames).It should. Did you take a look at dmesg output after scrubbing? If it doesn''t contain a hint on the file or block, please paste what you get.> Everytime I run "emerge" (it is a gentoo system) my > screen goes black after a few seconds and I can only revert to using ssh. > > Problem is: As soon as this happens, some filesystem accesses block the > process in disk state, it cannot be killed. This initiates some feedback > loop: From now on any other process trying to access the FS freezes. I can > only reisub now. It seems to be fine if data comes from cache instead from > disk.Please try to grab sysrq+w output in this state. -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Kai Krakow
2011-Dec-09 13:34 UTC
Re: WARNING: at fs/btrfs/extent-tree.c:4754 followed by BUG: unable to handle kernel NULL pointer dereference at (null)
Hello! 2011/12/8 Jan Schmidt <list.btrfs@jan-o-sch.net>:> On 07.12.2011 21:40, Kai Krakow wrote:[...]>> The problematic file seems to be in /usr/portage but scrubbing doesn''t tell >> me the filename (I was under the impression 3.2.x adds a patch which should >> report filenames). > > It should. Did you take a look at dmesg output after scrubbing? If it > doesn''t contain a hint on the file or block, please paste what you get.I watched dmesg while scrubbing. Nothing there. To paste what I got I need to find a way to make my 3.2-rc4 system boot again (without freezing to due services and background jobs touching certain parts of the broken filesystem) or create a 3.2 rescue system...>> Everytime I run "emerge" (it is a gentoo system) my >> screen goes black after a few seconds and I can only revert to using ssh. >> >> Problem is: As soon as this happens, some filesystem accesses block the >> process in disk state, it cannot be killed. This initiates some feedback >> loop: From now on any other process trying to access the FS freezes. I can >> only reisub now. It seems to be fine if data comes from cache instead from >> disk. > > Please try to grab sysrq+w output in this state.I tried, nothing there. I wondered, why... This changed between 3.1 and 3.2. There is probably no blocking process because it got killed by the kernel. Next process accessing the filesystem blocks (gets not killed). I try to get a sysrq+w from this situation via ssh to copy&paste dmesg somewhere but it will be difficult because usually ssh communication freezes, too. Maybe related: When the system was still running I was sometimes seeing it use 100% CPU on one or two cores, looking at "top" I could not see a process or kernel thread using the CPU but I saw the CPU usage distributing on SYS%, WA% and USER%... This effect could only be resolved by rebooting. It can be seen in both kernel 3.1 and 3.2, but 3.2 with much lower likelihood. However, even nice''d processes were still able to acquire 100% cpu usage per core, so it didn''t have any effect on system performance. I think I even made my situation worse... In an attempt to get the error fixed, I deleted and recreated the subvolume with /usr/portage (content is easily restorable from the internet). On next reboot the btrfs cleaner kernel thread spit out a lot of errors and traces into dmesg, system froze some minutes later so I couldn''t save the output. Now I cannot reliably boot and btrfs has problems accessing files all over the filesystem, even in subvolumes that worked fine before. I thought subvolumes are clearly separated from each other? Now I have at least 3 different classes of error messages instead of only 1 single error. Josef''s repair program fails an assertion and cannot continue on the volume. I think in order to stabilize btrfs it is important to make it handle structure errors gracefully, and then invest into some repair utility. I''d like to contribute but at some point in time I will need to get my system back into a stable state and will recreate my filesystem from scratch. Mounting the fs read-only allows me to access all parts of the filesystem without problems. I still see errors in dmesg but no kernel bugs or warnings with traces. Regards, Kai -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Kai Krakow
2011-Dec-15 04:11 UTC
Re: WARNING: at fs/btrfs/extent-tree.c:4754 followed by BUG: unable to handle kernel NULL pointer dereference at (null)
Hello, I managed to mount my broken btrfs partition in read-only mode and clone my rootfs subvolume to an ext4 partition and boot from that - so I now have the original system bootable. Jan Schmidt wrote:> On 07.12.2011 21:40, Kai Krakow wrote:[...]>> The problematic file seems to be in /usr/portage but scrubbing doesn''t >> tell me the filename (I was under the impression 3.2.x adds a patch which >> should report filenames). > > It should. Did you take a look at dmesg output after scrubbing? If it > doesn''t contain a hint on the file or block, please paste what you get.[ 187.136485] device fsid 311dda08-f33f-4cb9-9d59-6eac6026b1b1 devid 2 transid 146954 /dev/sda3 [ 187.136776] btrfs: use lzo compression [ 187.136777] btrfs: disk space caching is enabled [ 190.874110] zcache: created ephemeral tmem pool, id=2, client=65535 [ 243.659298] checksum error at logical 622147694592 on dev /dev/sda3, sector 301624: metadata leaf (level 0) in tree 2 [ 243.659302] checksum error at logical 622147694592 on dev /dev/sda3, sector 301624: metadata leaf (level 0) in tree 2 [ 243.725126] btrfs: unable to fixup (regular) error at logical 622147694592 [ 306.023952] parent transid verify failed on 622147694592 wanted 130733 found 134506 [ 306.023960] parent transid verify failed on 622147694592 wanted 130733 found 134506 [ 306.023963] parent transid verify failed on 622147694592 wanted 130733 found 134506 [ 306.023966] parent transid verify failed on 622147694592 wanted 130733 found 134506 [ 306.023968] parent transid verify failed on 622147694592 wanted 130733 found 134506 Here''s the last scrub status: scrub status for 311dda08-f33f-4cb9-9d59-6eac6026b1b1 scrub started at Sat Dec 10 10:34:57 2011 and was aborted after 2711 seconds total bytes scrubbed: 318.77GB with 3 errors error details: read=1 verify=2 corrected errors: 0, uncorrectable errors: 1, unverified errors: 0 I''m not sure what "read" and "verify" mean in this context. This happens with 3.2.0-rc4... I''m switching to rc5 soon. But as you (@Jan) can see: No file pathes are printed. Regards, Kai -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html