Vincent Vanackere
2012-Jan-19 14:42 UTC
[BUG - btrfs] kernel oops in extent_range_uptodate
Hi, With the most current git kernel (90a4c0f51e8e44111a926be6f4c87af3938a79c3) I''m still getting the same reproducible kernel panic when trying to read a particular file stored on a btrfs filesystem (as seen in the log there are indeed disk media errors on this disk). I''d like the "software" part of this to be fixed - btrfs should definitely not oops even in case of media error - before sending the disk to RMA. Is there anything I can do to make progress on this ? Regards, Vincent -------------------------------- ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata6.00: BMDMA stat 0x24 ata6.00: failed command: READ DMA EXT ata6.00: cmd 25/00:08:5f:dc:2f/00:00:70:00:00/e0 tag 0 dma 4096 in res 51/40:00:61:dc:2f/40:00:70:00:00/e0 Emask 0x9 (media error) ata6.00: status: { DRDY ERR } ata6.00: error: { UNC } ata6.00: configured for UDMA/133 sd 5:0:0:0: [sdd] Unhandled sense code sd 5:0:0:0: [sdd] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sd 5:0:0:0: [sdd] Sense Key : Medium Error [current] [descriptor] Descriptor sense data with sense descriptors (in hex): 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 70 2f dc 61 sd 5:0:0:0: [sdd] Add. Sense: Unrecovered read error - auto reallocate failed sd 5:0:0:0: [sdd] CDB: Read(10): 28 00 70 2f dc 5f 00 00 08 00 end_request: I/O error, dev sdd, sector 1882184801 ata6: EH complete BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffffa0191b09>] extent_range_uptodate+0x59/0xe0 [btrfs] PGD 221bf8067 PUD 222864067 PMD 0 Oops: 0000 [#1] SMP CPU 1 Modules linked in: ip6table_filter ip6_tables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge stp kvm_intel kvm parport_pc ppdev dm_crypt nfsd nfs lockd fscache auth_rpcgss nfs_acl binfmt_misc sunrpc snd_usb_audio joydev snd_usbmidi_lib snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer psmouse snd_seq_device serio_raw snd soundcore snd_page_alloc lp parport btrfs zlib_deflate libcrc32c hid_logitech ff_memless usbhid hid i915 r8169 drm_kms_helper drm pata_jmicron i2c_algo_bit video Pid: 1003, comm: btrfs-endio-met Not tainted 3.2.0-custom-9429-g90a4c0f #3 Gigabyte Technology Co., Ltd. G33-DS3R/G33-DS3R RIP: 0010:[<ffffffffa0191b09>] [<ffffffffa0191b09>] extent_range_uptodate+0x59/0xe0 [btrfs] RSP: 0018:ffff88022191dde0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 000000df57385000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 000000000df57385 RDI: 0000000000000000 RBP: ffff88022191de00 R08: 0000000000000000 R09: ffff8801da949ae0 R10: ffff8801fda37010 R11: 0000000000001000 R12: ffff88021b4487f0 R13: 000000df573853ff R14: ffff88022191de98 R15: ffff880221ac2ae8 FS: 0000000000000000(0000) GS:ffff88022fc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000221bf9000 CR4: 00000000000406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process btrfs-endio-met (pid: 1003, threadinfo ffff88022191c000, task ffff880221b9db80) Stack: 0000000000000000 ffff8801fdb64eb8 ffff8802221be840 ffff880220b3e000 ffff88022191de30 ffffffffa016ad89 ffff880221ac2ae0 ffff8801fdb64ee0 ffff880221ac2ae0 ffff880221ac2af8 ffff88022191dee0 ffffffffa019c18f Call Trace: [<ffffffffa016ad89>] end_workqueue_fn+0x119/0x140 [btrfs] [<ffffffffa019c18f>] worker_loop+0x16f/0x5d0 [btrfs] [<ffffffffa019c020>] ? btrfs_queue_worker+0x310/0x310 [btrfs] [<ffffffff81070193>] kthread+0x93/0xa0 [<ffffffff81636f24>] kernel_thread_helper+0x4/0x10 [<ffffffff81070100>] ? kthread_freezable_should_stop+0x70/0x70 [<ffffffff81636f20>] ? gs_change+0x13/0x13 Code: 01 f0 48 09 f0 a9 ff 0f 00 00 75 4e 49 39 dd b8 01 00 00 00 72 36 0f 1f 40 00 49 8b 7c 24 18 48 89 de 48 c1 ee 0c e8 e7 36 f8 e0 <48> 8b 10 83 e2 08 74 5f 48 89 c7 48 81 c3 00 10 00 00 e8 40 00 RIP [<ffffffffa0191b09>] extent_range_uptodate+0x59/0xe0 [btrfs] RSP <ffff88022191dde0> CR2: 0000000000000000 ---[ end trace 4c48da444d2270f0 ]---
On Thu, Jan 19, 2012 at 8:42 AM, Vincent Vanackere <vincent.vanackere@gmail.com> wrote:> Hi, > > With the most current git kernel (90a4c0f51e8e44111a926be6f4c87af3938a79c3) > I''m still getting the same reproducible kernel panic when trying to read a > particular file stored on a btrfs filesystem (as seen in the log there are > indeed disk media errors on this disk). > I''d like the "software" part of this to be fixed - btrfs should definitely > not oops even in case of media error - before sending the disk to RMA. Is > there anything I can do to make progress on this ? >Is this kernel compiled with "Compile the kernel with debug info" (in the "Kernel hacking --->" configuration section)? It would be nice to have the specific line of code passing the NULL pointer. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Vincent Vanackere
2012-Jan-20 16:48 UTC
Re: [BUG - btrfs] kernel oops in extent_range_uptodate
On 01/19/2012 05:24 PM, Mitch Harder wrote:> On Thu, Jan 19, 2012 at 8:42 AM, Vincent Vanackere > <vincent.vanackere@gmail.com> wrote: >> Hi, >> >> With the most current git kernel (90a4c0f51e8e44111a926be6f4c87af3938a79c3) >> I''m still getting the same reproducible kernel panic when trying to read a >> particular file stored on a btrfs filesystem (as seen in the log there are >> indeed disk media errors on this disk). >> I''d like the "software" part of this to be fixed - btrfs should definitely >> not oops even in case of media error - before sending the disk to RMA. Is >> there anything I can do to make progress on this ? >> > Is this kernel compiled with "Compile the kernel with debug info" (in > the "Kernel hacking --->" configuration section)? > > It would be nice to have the specific line of code passing the NULL pointer.The kernel was compiled with debug information but modern linux distribution make it really hard to keep your debug information it seems :-( I even had to compile btrfs builtin to keep the line numbers... Anyway, thanks to kexec / kdump I finally managed to get this, hope it helps : crash> bt -l PID: 939 TASK: ffff880218a4adc0 CPU: 0 COMMAND: "btrfs-endio-met" #0 [ffff88022316b9e0] machine_kexec at ffffffff810366aa /usr/src/linux/arch/x86/kernel/machine_kexec_64.c: 339 #1 [ffff88022316ba50] crash_kexec at ffffffff810b2df8 /usr/src/linux/kernel/kexec.c: 1101 #2 [ffff88022316bb20] oops_end at ffffffff816afdd8 /usr/src/linux/arch/x86/kernel/dumpstack.c: 228 #3 [ffff88022316bb50] no_context at ffffffff816a3141 /usr/src/linux/arch/x86/mm/fault.c: 690 #4 [ffff88022316bbb0] __bad_area_nosemaphore at ffffffff816a3321 /usr/src/linux/arch/x86/mm/fault.c: 767 #5 [ffff88022316bc10] bad_area_nosemaphore at ffffffff816a3353 /usr/src/linux/arch/x86/mm/fault.c: 775 #6 [ffff88022316bc20] do_page_fault at ffffffff816b29b6 /usr/src/linux/arch/x86/mm/fault.c: 1122 #7 [ffff88022316bd30] page_fault at ffffffff816af235 /usr/src/linux/arch/x86_64/kernel/entry.S [exception RIP: extent_range_uptodate+89] RIP: ffffffff812c7239 RSP: ffff88022316bde0 RFLAGS: 00010246 RAX: 0000000000000000 RBX: 000000df57385000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 000000000df57385 RDI: 0000000000000000 RBP: ffff88022316be00 R8: 0000000000000000 R9: ffff88021f2823c0 R10: ffff88022034d010 R11: 0000000000001000 R12: ffff880222908410 R13: 000000df573853ff R14: ffff88022316be98 R15: ffff88021a3e72a8 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffff88022316be08] end_workqueue_fn at ffffffff812a04b9 /usr/src/linux/fs/btrfs/disk-io.c: 1564 #9 [ffff88022316be38] worker_loop at ffffffff812d18bf /usr/src/linux/arch/x86/include/asm/atomic.h: 107 #10 [ffff88022316bee8] kthread at ffffffff81070193 /usr/src/linux/kernel/kthread.c: 121 #11 [ffff88022316bf48] kernel_thread_helper at ffffffff816b8124 /usr/src/linux/arch/x86/kernel/entry_64.S: 1163
On Fri, Jan 20, 2012 at 10:48 AM, Vincent Vanackere <vincent.vanackere@gmail.com> wrote:> On 01/19/2012 05:24 PM, Mitch Harder wrote: >> >> On Thu, Jan 19, 2012 at 8:42 AM, Vincent Vanackere >> <vincent.vanackere@gmail.com> wrote: >>> >>> Hi, >>> >>> With the most current git kernel >>> (90a4c0f51e8e44111a926be6f4c87af3938a79c3) >>> I''m still getting the same reproducible kernel panic when trying to read >>> a >>> particular file stored on a btrfs filesystem (as seen in the log there >>> are >>> indeed disk media errors on this disk). >>> I''d like the "software" part of this to be fixed - btrfs should >>> definitely >>> not oops even in case of media error - before sending the disk to RMA. Is >>> there anything I can do to make progress on this ? >>> >> Is this kernel compiled with "Compile the kernel with debug info" (in >> the "Kernel hacking --->" configuration section)? >> >> It would be nice to have the specific line of code passing the NULL >> pointer. > > > The kernel was compiled with debug information but modern linux distribution > make it really hard to keep your debug information it seems :-(I see where the find_get_page(...) function called in extent_range_uptodate has the potential to return a NULL value. Could you try the following patch, and if it solves your oops and shows the included warning in your dmesg log, I''ll simplify the patch to drop the printk and submit it to the list. I only included the printk since your current error log is ambiguous regarding the specific point where we''re getting the NULL pointer dereference, but I''ll pull it out if it works. diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 9d09a4f..35c3a2a 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3909,6 +3909,13 @@ int extent_range_uptodate(struct extent_io_tree *tree, while (start <= end) { index = start >> PAGE_CACHE_SHIFT; page = find_get_page(tree->mapping, index); + if (unlikely(!page)) { + if (printk_ratelimit()) + printk(KERN_WARNING + "btrfs: NULL page in " + "extent_range_uptodate()\n"); + return 1; + } uptodate = PageUptodate(page); page_cache_release(page); if (!uptodate) {
Vincent Vanackere
2012-Jan-24 16:24 UTC
Re: [BUG - btrfs] kernel oops in extent_range_uptodate
On 01/20/2012 09:54 PM, Mitch Harder wrote:> On Fri, Jan 20, 2012 at 10:48 AM, Vincent Vanackere > <vincent.vanackere@gmail.com> wrote: >> On 01/19/2012 05:24 PM, Mitch Harder wrote: >>> On Thu, Jan 19, 2012 at 8:42 AM, Vincent Vanackere >>> <vincent.vanackere@gmail.com> wrote: >>>> Hi, >>>> >>>> With the most current git kernel >>>> (90a4c0f51e8e44111a926be6f4c87af3938a79c3) >>>> I''m still getting the same reproducible kernel panic when trying to read >>>> a >>>> particular file stored on a btrfs filesystem (as seen in the log there >>>> are >>>> indeed disk media errors on this disk). >>>> I''d like the "software" part of this to be fixed - btrfs should >>>> definitely >>>> not oops even in case of media error - before sending the disk to RMA. Is >>>> there anything I can do to make progress on this ? >>>> >>> Is this kernel compiled with "Compile the kernel with debug info" (in >>> the "Kernel hacking --->" configuration section)? >>> >>> It would be nice to have the specific line of code passing the NULL >>> pointer. >> >> The kernel was compiled with debug information but modern linux distribution >> make it really hard to keep your debug information it seems :-( > I see where the find_get_page(...) function called in > extent_range_uptodate has the potential to return a NULL value. > > Could you try the following patch, and if it solves your oops and > shows the included warning in your dmesg log, I''ll simplify the patch > to drop the printk and submit it to the list. > > I only included the printk since your current error log is ambiguous > regarding the specific point where we''re getting the NULL pointer > dereference, but I''ll pull it out if it works. > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 9d09a4f..35c3a2a 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -3909,6 +3909,13 @@ int extent_range_uptodate(struct extent_io_tree *tree, > while (start<= end) { > index = start>> PAGE_CACHE_SHIFT; > page = find_get_page(tree->mapping, index); > + if (unlikely(!page)) { > + if (printk_ratelimit()) > + printk(KERN_WARNING > + "btrfs: NULL page in " > + "extent_range_uptodate()\n"); > + return 1; > + } > uptodate = PageUptodate(page); > page_cache_release(page); > if (!uptodate) {Indeed your patch helps. No kernel panic any more... but it looks like the task doesn''t finish and there''s another problem to solve now : sd 5:0:0:0: [sdd] Unhandled sense code sd 5:0:0:0: [sdd] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sd 5:0:0:0: [sdd] Sense Key : Medium Error [current] [descriptor] Descriptor sense data with sense descriptors (in hex): 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 70 2f dc 61 sd 5:0:0:0: [sdd] Add. Sense: Unrecovered read error - auto reallocate failed sd 5:0:0:0: [sdd] CDB: Read(10): 28 00 70 2f dc 5f 00 00 08 00 end_request: I/O error, dev sdd, sector 1882184801 ata6: EH complete btrfs: NULL page in extent_range_uptodate() btrfs: NULL page in extent_range_uptodate() btrfs bad tree block start 959241011200 959241011200 INFO: task cat:3099 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. cat D ffffffff8180c600 0 3099 3002 0x00000000 ffff8801f2b0f618 0000000000000086 ffff8801f2b0f5d8 ffff880221018770 ffff880222c65b80 ffff8801f2b0ffd8 ffff8801f2b0ffd8 ffff8801f2b0ffd8 ffff8802241816e0 ffff880222c65b80 ffff8801f2b0f5e8 ffff88022fd13e88 Call Trace: [<ffffffff81114260>] ? __lock_page+0x70/0x70 [<ffffffff8162c93f>] schedule+0x3f/0x60 [<ffffffff8162c9ef>] io_schedule+0x8f/0xd0 [<ffffffff8111426e>] sleep_on_page+0xe/0x20 [<ffffffff8162b1ff>] __wait_on_bit+0x5f/0x90 [<ffffffff811143d8>] wait_on_page_bit+0x78/0x80 [<ffffffff81070c40>] ? autoremove_wake_function+0x40/0x40 [<ffffffffa0192161>] read_extent_buffer_pages+0x471/0x4d0 [btrfs] [<ffffffffa01697b0>] ? verify_parent_transid+0x160/0x160 [btrfs] [<ffffffffa016a13a>] btree_read_extent_buffer_pages.isra.99+0x8a/0xc0 [btrfs] [<ffffffffa016c1e1>] read_tree_block+0x41/0x60 [btrfs] [<ffffffffa01526a3>] read_block_for_search.isra.34+0xf3/0x3d0 [btrfs] [<ffffffffa0154930>] btrfs_search_slot+0x300/0x8a0 [btrfs] [<ffffffffa0166ab4>] btrfs_lookup_csum+0x74/0x170 [btrfs] [<ffffffffa0166d5f>] __btrfs_lookup_bio_sums+0x1af/0x3b0 [btrfs] [<ffffffffa0166fb6>] btrfs_lookup_bio_sums+0x16/0x20 [btrfs] [<ffffffffa0173650>] btrfs_submit_bio_hook+0x140/0x170 [btrfs] [<ffffffffa01755d0>] ? btrfs_real_readdir+0x720/0x720 [btrfs] [<ffffffffa018c17a>] submit_one_bio+0x6a/0xa0 [btrfs] [<ffffffffa0190e34>] extent_readpages+0xe4/0x100 [btrfs] [<ffffffffa01755d0>] ? btrfs_real_readdir+0x720/0x720 [btrfs] [<ffffffffa0173ebf>] btrfs_readpages+0x1f/0x30 [btrfs] [<ffffffff81120a0f>] __do_page_cache_readahead+0x1af/0x250 [<ffffffff81120e11>] ra_submit+0x21/0x30 [<ffffffff81120f35>] ondemand_readahead+0x115/0x230 [<ffffffff81137cd9>] ? __do_fault+0x419/0x530 [<ffffffff81121131>] page_cache_sync_readahead+0x31/0x50 [<ffffffff811165f8>] generic_file_aio_read+0x438/0x780 [<ffffffff81173bb2>] do_sync_read+0xd2/0x110 [<ffffffff81293e73>] ? security_file_permission+0x93/0xb0 [<ffffffff81174031>] ? rw_verify_area+0x61/0xf0 [<ffffffff81174510>] vfs_read+0xb0/0x180 [<ffffffff8117462a>] sys_read+0x4a/0x90 [<ffffffff81635ae9>] system_call_fastpath+0x16/0x1b
On Tue, Jan 24, 2012 at 10:24 AM, Vincent Vanackere <vincent.vanackere@gmail.com> wrote:> On 01/20/2012 09:54 PM, Mitch Harder wrote: >> >> On Fri, Jan 20, 2012 at 10:48 AM, Vincent Vanackere >> <vincent.vanackere@gmail.com> wrote: >>> >>> On 01/19/2012 05:24 PM, Mitch Harder wrote: >>>> >>>> On Thu, Jan 19, 2012 at 8:42 AM, Vincent Vanackere >>>> <vincent.vanackere@gmail.com> wrote: >>>>> >>>>> Hi, >>>>> >>>>> With the most current git kernel >>>>> (90a4c0f51e8e44111a926be6f4c87af3938a79c3) >>>>> I''m still getting the same reproducible kernel panic when trying to >>>>> read >>>>> a >>>>> particular file stored on a btrfs filesystem (as seen in the log there >>>>> are >>>>> indeed disk media errors on this disk). >>>>> I''d like the "software" part of this to be fixed - btrfs should >>>>> definitely >>>>> not oops even in case of media error - before sending the disk to RMA. >>>>> Is >>>>> there anything I can do to make progress on this ? >>>>> >>>> Is this kernel compiled with "Compile the kernel with debug info" (in >>>> the "Kernel hacking --->" configuration section)? >>>> >>>> It would be nice to have the specific line of code passing the NULL >>>> pointer. >>> >>> >>> The kernel was compiled with debug information but modern linux >>> distribution >>> make it really hard to keep your debug information it seems :-( >> >> I see where the find_get_page(...) function called in >> extent_range_uptodate has the potential to return a NULL value. >> >> Could you try the following patch, and if it solves your oops and >> shows the included warning in your dmesg log, I''ll simplify the patch >> to drop the printk and submit it to the list. >> >> I only included the printk since your current error log is ambiguous >> regarding the specific point where we''re getting the NULL pointer >> dereference, but I''ll pull it out if it works. >> >> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c >> index 9d09a4f..35c3a2a 100644 >> --- a/fs/btrfs/extent_io.c >> +++ b/fs/btrfs/extent_io.c >> @@ -3909,6 +3909,13 @@ int extent_range_uptodate(struct extent_io_tree >> *tree, >> while (start<= end) { >> index = start>> PAGE_CACHE_SHIFT; >> page = find_get_page(tree->mapping, index); >> + if (unlikely(!page)) { >> + if (printk_ratelimit()) >> + printk(KERN_WARNING >> + "btrfs: NULL page in " >> + "extent_range_uptodate()\n"); >> + return 1; >> + } >> uptodate = PageUptodate(page); >> page_cache_release(page); >> if (!uptodate) { > > > Indeed your patch helps. No kernel panic any more... but it looks like the > task doesn''t finish and there''s another problem to solve now : > > > sd 5:0:0:0: [sdd] Unhandled sense code > sd 5:0:0:0: [sdd] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE > sd 5:0:0:0: [sdd] Sense Key : Medium Error [current] [descriptor] > Descriptor sense data with sense descriptors (in hex): > 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 > 70 2f dc 61 > sd 5:0:0:0: [sdd] Add. Sense: Unrecovered read error - auto reallocate > failed > sd 5:0:0:0: [sdd] CDB: Read(10): 28 00 70 2f dc 5f 00 00 08 00 > end_request: I/O error, dev sdd, sector 1882184801 > ata6: EH complete > btrfs: NULL page in extent_range_uptodate() > btrfs: NULL page in extent_range_uptodate() > btrfs bad tree block start 959241011200 959241011200 > INFO: task cat:3099 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > cat D ffffffff8180c600 0 3099 3002 0x00000000 > ffff8801f2b0f618 0000000000000086 ffff8801f2b0f5d8 ffff880221018770 > ffff880222c65b80 ffff8801f2b0ffd8 ffff8801f2b0ffd8 ffff8801f2b0ffd8 > ffff8802241816e0 ffff880222c65b80 ffff8801f2b0f5e8 ffff88022fd13e88 > Call Trace: > [<ffffffff81114260>] ? __lock_page+0x70/0x70 > [<ffffffff8162c93f>] schedule+0x3f/0x60 > [<ffffffff8162c9ef>] io_schedule+0x8f/0xd0 > [<ffffffff8111426e>] sleep_on_page+0xe/0x20 > [<ffffffff8162b1ff>] __wait_on_bit+0x5f/0x90 > [<ffffffff811143d8>] wait_on_page_bit+0x78/0x80 > [<ffffffff81070c40>] ? autoremove_wake_function+0x40/0x40 > [<ffffffffa0192161>] read_extent_buffer_pages+0x471/0x4d0 [btrfs] > [<ffffffffa01697b0>] ? verify_parent_transid+0x160/0x160 [btrfs] > [<ffffffffa016a13a>] btree_read_extent_buffer_pages.isra.99+0x8a/0xc0 > [btrfs] > [<ffffffffa016c1e1>] read_tree_block+0x41/0x60 [btrfs] > [<ffffffffa01526a3>] read_block_for_search.isra.34+0xf3/0x3d0 [btrfs] > [<ffffffffa0154930>] btrfs_search_slot+0x300/0x8a0 [btrfs] > [<ffffffffa0166ab4>] btrfs_lookup_csum+0x74/0x170 [btrfs] > [<ffffffffa0166d5f>] __btrfs_lookup_bio_sums+0x1af/0x3b0 [btrfs] > [<ffffffffa0166fb6>] btrfs_lookup_bio_sums+0x16/0x20 [btrfs] > [<ffffffffa0173650>] btrfs_submit_bio_hook+0x140/0x170 [btrfs] > [<ffffffffa01755d0>] ? btrfs_real_readdir+0x720/0x720 [btrfs] > [<ffffffffa018c17a>] submit_one_bio+0x6a/0xa0 [btrfs] > [<ffffffffa0190e34>] extent_readpages+0xe4/0x100 [btrfs] > [<ffffffffa01755d0>] ? btrfs_real_readdir+0x720/0x720 [btrfs] > [<ffffffffa0173ebf>] btrfs_readpages+0x1f/0x30 [btrfs] > [<ffffffff81120a0f>] __do_page_cache_readahead+0x1af/0x250 > [<ffffffff81120e11>] ra_submit+0x21/0x30 > [<ffffffff81120f35>] ondemand_readahead+0x115/0x230 > [<ffffffff81137cd9>] ? __do_fault+0x419/0x530 > [<ffffffff81121131>] page_cache_sync_readahead+0x31/0x50 > [<ffffffff811165f8>] generic_file_aio_read+0x438/0x780 > [<ffffffff81173bb2>] do_sync_read+0xd2/0x110 > [<ffffffff81293e73>] ? security_file_permission+0x93/0xb0 > [<ffffffff81174031>] ? rw_verify_area+0x61/0xf0 > [<ffffffff81174510>] vfs_read+0xb0/0x180 > [<ffffffff8117462a>] sys_read+0x4a/0x90 > [<ffffffff81635ae9>] system_call_fastpath+0x16/0x1b >Good, looks like we''re making progress. We appear to be stuck now at wait_on_page_locked(page) in the read_extent_buffer_pages(...) function in extent_io.c for (i = start_i; i < num_pages; i++) { page = extent_buffer_page(eb, i); wait_on_page_locked(page); if (!PageUptodate(page)) ret = -EIO; } I tried looking around the kernel for how others have handled error checking when using wait_on_page_locked(...), but I could not find many examples. http://lxr.free-electrons.com/ident?i=wait_on_page_locked I believe I''ll have to ask for help from the others on the list at this point for how to handle this issue. Do you still have data you are trying to recover from this disk?
Vincent Vanackere
2012-Jan-25 08:29 UTC
Re: [BUG - btrfs] kernel oops in extent_range_uptodate
On Wed, Jan 25, 2012 at 04:30, Mitch Harder <mitch.harder@sabayonlinux.org> wrote:> > On Tue, Jan 24, 2012 at 10:24 AM, Vincent Vanackere > <vincent.vanackere@gmail.com> wrote: > > On 01/20/2012 09:54 PM, Mitch Harder wrote: > >> > >> On Fri, Jan 20, 2012 at 10:48 AM, Vincent Vanackere > >> <vincent.vanackere@gmail.com> wrote: > >>> > >>> On 01/19/2012 05:24 PM, Mitch Harder wrote: > >>>> > >>>> On Thu, Jan 19, 2012 at 8:42 AM, Vincent Vanackere > >>>> <vincent.vanackere@gmail.com> wrote: > >>>>> > >>>>> Hi, > >>>>> > >>>>> With the most current git kernel > >>>>> (90a4c0f51e8e44111a926be6f4c87af3938a79c3) > >>>>> I''m still getting the same reproducible kernel panic when trying to > >>>>> read > >>>>> a > >>>>> particular file stored on a btrfs filesystem (as seen in the log there > >>>>> are > >>>>> indeed disk media errors on this disk). > >>>>> I''d like the "software" part of this to be fixed - btrfs should > >>>>> definitely > >>>>> not oops even in case of media error - before sending the disk to RMA. > >>>>> Is > >>>>> there anything I can do to make progress on this ? > >>>>> > >>>> Is this kernel compiled with "Compile the kernel with debug info" (in > >>>> the "Kernel hacking --->" configuration section)? > >>>> > >>>> It would be nice to have the specific line of code passing the NULL > >>>> pointer. > >>> > >>> > >>> The kernel was compiled with debug information but modern linux > >>> distribution > >>> make it really hard to keep your debug information it seems :-( > >> > >> I see where the find_get_page(...) function called in > >> extent_range_uptodate has the potential to return a NULL value. > >> > >> Could you try the following patch, and if it solves your oops and > >> shows the included warning in your dmesg log, I''ll simplify the patch > >> to drop the printk and submit it to the list. > >> > >> I only included the printk since your current error log is ambiguous > >> regarding the specific point where we''re getting the NULL pointer > >> dereference, but I''ll pull it out if it works. > >> > >> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > >> index 9d09a4f..35c3a2a 100644 > >> --- a/fs/btrfs/extent_io.c > >> +++ b/fs/btrfs/extent_io.c > >> @@ -3909,6 +3909,13 @@ int extent_range_uptodate(struct extent_io_tree > >> *tree, > >> while (start<= end) { > >> index = start>> PAGE_CACHE_SHIFT; > >> page = find_get_page(tree->mapping, index); > >> + if (unlikely(!page)) { > >> + if (printk_ratelimit()) > >> + printk(KERN_WARNING > >> + "btrfs: NULL page in " > >> + "extent_range_uptodate()\n"); > >> + return 1; > >> + } > >> uptodate = PageUptodate(page); > >> page_cache_release(page); > >> if (!uptodate) { > > > > > > Indeed your patch helps. No kernel panic any more... but it looks like the > > task doesn''t finish and there''s another problem to solve now : > > > > > > sd 5:0:0:0: [sdd] Unhandled sense code > > sd 5:0:0:0: [sdd] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE > > sd 5:0:0:0: [sdd] Sense Key : Medium Error [current] [descriptor] > > Descriptor sense data with sense descriptors (in hex): > > 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 > > 70 2f dc 61 > > sd 5:0:0:0: [sdd] Add. Sense: Unrecovered read error - auto reallocate > > failed > > sd 5:0:0:0: [sdd] CDB: Read(10): 28 00 70 2f dc 5f 00 00 08 00 > > end_request: I/O error, dev sdd, sector 1882184801 > > ata6: EH complete > > btrfs: NULL page in extent_range_uptodate() > > btrfs: NULL page in extent_range_uptodate() > > btrfs bad tree block start 959241011200 959241011200 > > INFO: task cat:3099 blocked for more than 120 seconds. > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > cat D ffffffff8180c600 0 3099 3002 0x00000000 > > ffff8801f2b0f618 0000000000000086 ffff8801f2b0f5d8 ffff880221018770 > > ffff880222c65b80 ffff8801f2b0ffd8 ffff8801f2b0ffd8 ffff8801f2b0ffd8 > > ffff8802241816e0 ffff880222c65b80 ffff8801f2b0f5e8 ffff88022fd13e88 > > Call Trace: > > [<ffffffff81114260>] ? __lock_page+0x70/0x70 > > [<ffffffff8162c93f>] schedule+0x3f/0x60 > > [<ffffffff8162c9ef>] io_schedule+0x8f/0xd0 > > [<ffffffff8111426e>] sleep_on_page+0xe/0x20 > > [<ffffffff8162b1ff>] __wait_on_bit+0x5f/0x90 > > [<ffffffff811143d8>] wait_on_page_bit+0x78/0x80 > > [<ffffffff81070c40>] ? autoremove_wake_function+0x40/0x40 > > [<ffffffffa0192161>] read_extent_buffer_pages+0x471/0x4d0 [btrfs] > > [<ffffffffa01697b0>] ? verify_parent_transid+0x160/0x160 [btrfs] > > [<ffffffffa016a13a>] btree_read_extent_buffer_pages.isra.99+0x8a/0xc0 > > [btrfs] > > [<ffffffffa016c1e1>] read_tree_block+0x41/0x60 [btrfs] > > [<ffffffffa01526a3>] read_block_for_search.isra.34+0xf3/0x3d0 [btrfs] > > [<ffffffffa0154930>] btrfs_search_slot+0x300/0x8a0 [btrfs] > > [<ffffffffa0166ab4>] btrfs_lookup_csum+0x74/0x170 [btrfs] > > [<ffffffffa0166d5f>] __btrfs_lookup_bio_sums+0x1af/0x3b0 [btrfs] > > [<ffffffffa0166fb6>] btrfs_lookup_bio_sums+0x16/0x20 [btrfs] > > [<ffffffffa0173650>] btrfs_submit_bio_hook+0x140/0x170 [btrfs] > > [<ffffffffa01755d0>] ? btrfs_real_readdir+0x720/0x720 [btrfs] > > [<ffffffffa018c17a>] submit_one_bio+0x6a/0xa0 [btrfs] > > [<ffffffffa0190e34>] extent_readpages+0xe4/0x100 [btrfs] > > [<ffffffffa01755d0>] ? btrfs_real_readdir+0x720/0x720 [btrfs] > > [<ffffffffa0173ebf>] btrfs_readpages+0x1f/0x30 [btrfs] > > [<ffffffff81120a0f>] __do_page_cache_readahead+0x1af/0x250 > > [<ffffffff81120e11>] ra_submit+0x21/0x30 > > [<ffffffff81120f35>] ondemand_readahead+0x115/0x230 > > [<ffffffff81137cd9>] ? __do_fault+0x419/0x530 > > [<ffffffff81121131>] page_cache_sync_readahead+0x31/0x50 > > [<ffffffff811165f8>] generic_file_aio_read+0x438/0x780 > > [<ffffffff81173bb2>] do_sync_read+0xd2/0x110 > > [<ffffffff81293e73>] ? security_file_permission+0x93/0xb0 > > [<ffffffff81174031>] ? rw_verify_area+0x61/0xf0 > > [<ffffffff81174510>] vfs_read+0xb0/0x180 > > [<ffffffff8117462a>] sys_read+0x4a/0x90 > > [<ffffffff81635ae9>] system_call_fastpath+0x16/0x1b > > > > Good, looks like we''re making progress. > > We appear to be stuck now at wait_on_page_locked(page) in the > read_extent_buffer_pages(...) function in extent_io.c > > for (i = start_i; i < num_pages; i++) { > page = extent_buffer_page(eb, i); > wait_on_page_locked(page); > if (!PageUptodate(page)) > ret = -EIO; > } > > I tried looking around the kernel for how others have handled error > checking when using wait_on_page_locked(...), but I could not find > many examples. > > http://lxr.free-electrons.com/ident?i=wait_on_page_locked > > I believe I''ll have to ask for help from the others on the list at > this point for how to handle this issue. > > Do you still have data you are trying to recover from this disk?I already recovered all interesting data, I''m only keeping this disk until I''m confident btrfs will be able to deal with this particular IO error... Thanks for your help so far ! Vincent