When doing a ''rm -r /var/tmp/portage/sys-devel'' I get the following Oops: ------------[ cut here ]------------ kernel BUG at fs/btrfs/extent-tree.c:1353! invalid opcode: 0000 [#1] PREEMPT SMP last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0A:00/power_supply/BAT0/charge_full Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss nfs lockd nfs_acl auth_rpcgss sunrpc sco rfcomm bnep l2cap crc16 xts gf128mul usb_storage dm_crypt dm_mod coretemp hwmon acpi_cpufreq mperf snd_hda_codec_realtek uvcvideo iwl3945 snd_hda_intel snd_hda_codec iwlcore videodev r8169 snd_hwdep btusb snd_pcm v4l1_compat mac80211 snd_timer bluetooth snd mii cfg80211 soundcore sg rfkill ac i2c_i801 snd_page_alloc uhci_hcd battery [last unloaded: microcode] Pid: 2358, comm: rm Not tainted 2.6.35-rc4 #32 M912/M912 EIP: 0060:[<c10c383b>] EFLAGS: 00010202 CPU: 1 EIP is at lookup_inline_extent_backref+0xf2/0x406 EAX: 00000001 EBX: 00000007 ECX: 00000000 EDX: 00000000 ESI: 00000004 EDI: f7268150 EBP: 00000004 ESP: f5aa5d08 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process rm (pid: 2358, ti=f5aa4000 task=f6f0fa70 task.ti=f5aa4000) Stack: f702f8c0 f744e080 f665f380 000000b0 00000000 00000000 ffffffff f6c80f00 <0> f744e080 c10ec226 e98acfff f6c98000 00001001 0e987000 00000004 00000000 <0> 00000850 040e9870 a8000000 00001000 00000000 00000007 00000000 0e987000 Call Trace: [<c10ec226>] ? set_extent_dirty+0x19/0x1d [<c10c5081>] ? __btrfs_free_extent+0xda/0x675 [<c10c88bf>] ? run_clustered_refs+0x699/0x6d7 [<c10d239f>] ? btrfs_mark_buffer_dirty+0xa3/0xef [<c1101454>] ? btrfs_find_ref_cluster+0xf9/0x13a [<c10c89bc>] ? btrfs_run_delayed_refs+0xbf/0x155 [<c10d3a73>] ? __btrfs_end_transaction+0x53/0x16c [<c10db480>] ? btrfs_delete_inode+0x166/0x17e [<c102280d>] ? get_parent_ip+0x8/0x19 [<c108fe5c>] ? generic_delete_inode+0x6f/0xbd [<c108f5b3>] ? iput+0x46/0x48 [<c10893a8>] ? do_unlinkat+0xc7/0x109 [<c102280d>] ? get_parent_ip+0x8/0x19 [<c10822e3>] ? fput+0x12/0x15c [<c10a2f30>] ? dnotify_flush+0x41/0xc2 [<c107fe85>] ? filp_close+0x4c/0x52 [<c107feed>] ? sys_close+0x62/0x9b [<c1002550>] ? sysenter_do_call+0x12/0x26 Code: 80 4e 68 02 8d 4c 24 43 89 f8 6a 01 ff 74 24 1c ff 74 24 08 8b 54 24 38 e8 01 c2 ff ff 83 c4 0c 83 f8 00 0f 8c e1 02 00 00 74 02 <0f> 0b 8b 04 24 8b 34 24 8b 00 8b 56 20 89 44 24 08 e8 2e fa ff EIP: [<c10c383b>] lookup_inline_extent_backref+0xf2/0x406 SS:ESP 0068:f5aa5d08 ---[ end trace d97601f0b455ca72 ]--- note: rm[2358] exited with preempt_count 2 BUG: scheduling while atomic: rm/2358/0x10000003 Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss nfs lockd nfs_acl auth_rpcgss sunrpc sco rfcomm bnep l2cap crc16 xts gf128mul usb_storage dm_crypt dm_mod coretemp hwmon acpi_cpufreq mperf snd_hda_codec_realtek uvcvideo iwl3945 snd_hda_intel snd_hda_codec iwlcore videodev r8169 snd_hwdep btusb snd_pcm v4l1_compat mac80211 snd_timer bluetooth snd mii cfg80211 soundcore sg rfkill ac i2c_i801 snd_page_alloc uhci_hcd battery [last unloaded: microcode] Pid: 2358, comm: rm Tainted: G D 2.6.35-rc4 #32 Call Trace: [<c12de6b3>] ? schedule+0x88/0x332 [<c10237c1>] ? __cond_resched+0xf/0x19 [<c12de9e2>] ? _cond_resched+0x12/0x18 [<c106ceec>] ? unmap_vmas+0x4e7/0x534 [<c1070c8f>] ? exit_mmap+0x64/0xa4 [<c1026089>] ? mmput+0x21/0x96 [<c102938e>] ? exit_mm+0xe7/0xf0 [<c12dfa28>] ? _raw_spin_unlock_irqrestore+0x1a/0x24 [<c103aaa1>] ? hrtimer_try_to_cancel+0x31/0x3a [<c102a42e>] ? do_exit+0x17b/0x57d [<c1028e78>] ? kmsg_dump+0x81/0xf9 [<c1002d06>] ? do_invalid_op+0x0/0x76 [<c1004fa0>] ? oops_end+0x72/0x75 [<c1002d6f>] ? do_invalid_op+0x69/0x76 [<c10c383b>] ? lookup_inline_extent_backref+0xf2/0x406 [<c10bdc9a>] ? generic_bin_search.clone.0+0x145/0x150 [<c10bcf30>] ? btrfs_cow_block+0x106/0x112 [<c10bdcdc>] ? bin_search+0x37/0x3d [<c10bfe33>] ? btrfs_search_slot+0x405/0x477 [<c12e031a>] ? error_code+0x66/0x6c [<c1002d06>] ? do_invalid_op+0x0/0x76 [<c10c383b>] ? lookup_inline_extent_backref+0xf2/0x406 [<c10ec226>] ? set_extent_dirty+0x19/0x1d [<c10c5081>] ? __btrfs_free_extent+0xda/0x675 [<c10c88bf>] ? run_clustered_refs+0x699/0x6d7 [<c10d239f>] ? btrfs_mark_buffer_dirty+0xa3/0xef [<c1101454>] ? btrfs_find_ref_cluster+0xf9/0x13a [<c10c89bc>] ? btrfs_run_delayed_refs+0xbf/0x155 [<c10d3a73>] ? __btrfs_end_transaction+0x53/0x16c [<c10db480>] ? btrfs_delete_inode+0x166/0x17e [<c102280d>] ? get_parent_ip+0x8/0x19 [<c108fe5c>] ? generic_delete_inode+0x6f/0xbd [<c108f5b3>] ? iput+0x46/0x48 [<c10893a8>] ? do_unlinkat+0xc7/0x109 [<c102280d>] ? get_parent_ip+0x8/0x19 [<c10822e3>] ? fput+0x12/0x15c [<c10a2f30>] ? dnotify_flush+0x41/0xc2 [<c107fe85>] ? filp_close+0x4c/0x52 [<c107feed>] ? sys_close+0x62/0x9b [<c1002550>] ? sysenter_do_call+0x12/0x26 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Neither Yan nor I have been able to reproduce this locally, but a few people have now hit it. Johannes, are you available to try out a debugging kernel to try and track this down? -chris On Thu, Jul 08, 2010 at 04:27:23PM +0200, Johannes Hirte wrote:> When doing a ''rm -r /var/tmp/portage/sys-devel'' I get the following Oops: > > ------------[ cut here ]------------ > kernel BUG at fs/btrfs/extent-tree.c:1353! > invalid opcode: 0000 [#1] PREEMPT SMP > last sysfs file: > /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0A:00/power_supply/BAT0/charge_full > Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq > snd_seq_device snd_pcm_oss snd_mixer_oss nfs lockd nfs_acl auth_rpcgss sunrpc > sco rfcomm bnep l2cap crc16 xts gf128mul usb_storage dm_crypt dm_mod coretemp > hwmon acpi_cpufreq mperf snd_hda_codec_realtek uvcvideo iwl3945 snd_hda_intel > snd_hda_codec iwlcore videodev r8169 snd_hwdep btusb snd_pcm v4l1_compat > mac80211 snd_timer bluetooth snd mii cfg80211 soundcore sg rfkill ac i2c_i801 > snd_page_alloc uhci_hcd battery [last unloaded: microcode] > > Pid: 2358, comm: rm Not tainted 2.6.35-rc4 #32 M912/M912 > EIP: 0060:[<c10c383b>] EFLAGS: 00010202 CPU: 1 > EIP is at lookup_inline_extent_backref+0xf2/0x406 > EAX: 00000001 EBX: 00000007 ECX: 00000000 EDX: 00000000 > ESI: 00000004 EDI: f7268150 EBP: 00000004 ESP: f5aa5d08 > DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 > Process rm (pid: 2358, ti=f5aa4000 task=f6f0fa70 task.ti=f5aa4000) > Stack: > f702f8c0 f744e080 f665f380 000000b0 00000000 00000000 ffffffff f6c80f00 > <0> f744e080 c10ec226 e98acfff f6c98000 00001001 0e987000 00000004 00000000 > <0> 00000850 040e9870 a8000000 00001000 00000000 00000007 00000000 0e987000 > Call Trace: > [<c10ec226>] ? set_extent_dirty+0x19/0x1d > [<c10c5081>] ? __btrfs_free_extent+0xda/0x675 > [<c10c88bf>] ? run_clustered_refs+0x699/0x6d7 > [<c10d239f>] ? btrfs_mark_buffer_dirty+0xa3/0xef > [<c1101454>] ? btrfs_find_ref_cluster+0xf9/0x13a > [<c10c89bc>] ? btrfs_run_delayed_refs+0xbf/0x155 > [<c10d3a73>] ? __btrfs_end_transaction+0x53/0x16c > [<c10db480>] ? btrfs_delete_inode+0x166/0x17e > [<c102280d>] ? get_parent_ip+0x8/0x19 > [<c108fe5c>] ? generic_delete_inode+0x6f/0xbd > [<c108f5b3>] ? iput+0x46/0x48 > [<c10893a8>] ? do_unlinkat+0xc7/0x109 > [<c102280d>] ? get_parent_ip+0x8/0x19 > [<c10822e3>] ? fput+0x12/0x15c > [<c10a2f30>] ? dnotify_flush+0x41/0xc2 > [<c107fe85>] ? filp_close+0x4c/0x52 > [<c107feed>] ? sys_close+0x62/0x9b > [<c1002550>] ? sysenter_do_call+0x12/0x26 > Code: 80 4e 68 02 8d 4c 24 43 89 f8 6a 01 ff 74 24 1c ff 74 24 08 8b 54 24 38 e8 > 01 c2 ff ff 83 c4 0c 83 f8 00 0f 8c e1 02 00 00 74 02 <0f> 0b 8b 04 24 8b 34 24 > 8b 00 8b 56 20 89 44 24 08 e8 2e fa ff > EIP: [<c10c383b>] lookup_inline_extent_backref+0xf2/0x406 SS:ESP 0068:f5aa5d08 > ---[ end trace d97601f0b455ca72 ]--- > note: rm[2358] exited with preempt_count 2 > BUG: scheduling while atomic: rm/2358/0x10000003 > Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq > snd_seq_device snd_pcm_oss snd_mixer_oss nfs lockd nfs_acl auth_rpcgss sunrpc > sco rfcomm bnep l2cap crc16 xts gf128mul usb_storage dm_crypt dm_mod coretemp > hwmon acpi_cpufreq mperf snd_hda_codec_realtek uvcvideo iwl3945 snd_hda_intel > snd_hda_codec iwlcore videodev r8169 snd_hwdep btusb snd_pcm v4l1_compat > mac80211 snd_timer bluetooth snd mii cfg80211 soundcore sg rfkill ac i2c_i801 > snd_page_alloc uhci_hcd battery [last unloaded: microcode] > Pid: 2358, comm: rm Tainted: G D 2.6.35-rc4 #32 > Call Trace: > [<c12de6b3>] ? schedule+0x88/0x332 > [<c10237c1>] ? __cond_resched+0xf/0x19 > [<c12de9e2>] ? _cond_resched+0x12/0x18 > [<c106ceec>] ? unmap_vmas+0x4e7/0x534 > [<c1070c8f>] ? exit_mmap+0x64/0xa4 > [<c1026089>] ? mmput+0x21/0x96 > [<c102938e>] ? exit_mm+0xe7/0xf0 > [<c12dfa28>] ? _raw_spin_unlock_irqrestore+0x1a/0x24 > [<c103aaa1>] ? hrtimer_try_to_cancel+0x31/0x3a > [<c102a42e>] ? do_exit+0x17b/0x57d > [<c1028e78>] ? kmsg_dump+0x81/0xf9 > [<c1002d06>] ? do_invalid_op+0x0/0x76 > [<c1004fa0>] ? oops_end+0x72/0x75 > [<c1002d6f>] ? do_invalid_op+0x69/0x76 > [<c10c383b>] ? lookup_inline_extent_backref+0xf2/0x406 > [<c10bdc9a>] ? generic_bin_search.clone.0+0x145/0x150 > [<c10bcf30>] ? btrfs_cow_block+0x106/0x112 > [<c10bdcdc>] ? bin_search+0x37/0x3d > [<c10bfe33>] ? btrfs_search_slot+0x405/0x477 > [<c12e031a>] ? error_code+0x66/0x6c > [<c1002d06>] ? do_invalid_op+0x0/0x76 > [<c10c383b>] ? lookup_inline_extent_backref+0xf2/0x406 > [<c10ec226>] ? set_extent_dirty+0x19/0x1d > [<c10c5081>] ? __btrfs_free_extent+0xda/0x675 > [<c10c88bf>] ? run_clustered_refs+0x699/0x6d7 > [<c10d239f>] ? btrfs_mark_buffer_dirty+0xa3/0xef > [<c1101454>] ? btrfs_find_ref_cluster+0xf9/0x13a > [<c10c89bc>] ? btrfs_run_delayed_refs+0xbf/0x155 > [<c10d3a73>] ? __btrfs_end_transaction+0x53/0x16c > [<c10db480>] ? btrfs_delete_inode+0x166/0x17e > [<c102280d>] ? get_parent_ip+0x8/0x19 > [<c108fe5c>] ? generic_delete_inode+0x6f/0xbd > [<c108f5b3>] ? iput+0x46/0x48 > [<c10893a8>] ? do_unlinkat+0xc7/0x109 > [<c102280d>] ? get_parent_ip+0x8/0x19 > [<c10822e3>] ? fput+0x12/0x15c > [<c10a2f30>] ? dnotify_flush+0x41/0xc2 > [<c107fe85>] ? filp_close+0x4c/0x52 > [<c107feed>] ? sys_close+0x62/0x9b > [<c1002550>] ? sysenter_do_call+0x12/0x26
Am Donnerstag 08 Juli 2010, 16:31:09 schrieb Chris Mason:> Neither Yan nor I have been able to reproduce this locally, but a few > people have now hit it. Johannes, are you available to try out a > debugging kernel to try and track this down?Sure, just tell me what to do. Is it enough to recompile the kernel with debug options enabled or are special debug patches necessary? regards, Johannes
It''s getting worse. The /home partition is now affected too. I get the Oops on simple unmounting the fs. btrfsck gives me this output on this fs: btrfsck /dev/mapper/sdb3 leaf 123780497408 items 49 free space 271 generation 62207 owner 2 fs uuid 7f013285-88d8-452f-a139-7d44bffd14b6 chunk uuid 365526c9-e209-46a1-8963-3157306d9e05 item 0 key (123780108288 EXTENT_ITEM 4096) itemoff 3944 itemsize 51 extent refs 1 gen 45309 flags 2 tree block key (836797 6c 0) level 1 tree block backref root 5 item 1 key (123780112384 EXTENT_ITEM 4096) itemoff 3893 itemsize 51 extent refs 1 gen 45309 flags 2 tree block key (836955 c 416051) level 0 tree block backref root 5 item 2 key (123780116480 EXTENT_ITEM 4096) itemoff 3842 itemsize 51 extent refs 1 gen 24599 flags 2 tree block key (18446744073709551606 80 122869407744) level 0 tree block backref root 7 item 3 key (123780120576 EXTENT_ITEM 4096) itemoff 3791 itemsize 51 extent refs 1 gen 49958 flags 2 tree block key (979249 1 0) level 0 tree block backref root 5 item 4 key (123780124672 EXTENT_ITEM 4096) itemoff 3740 itemsize 51 extent refs 1 gen 62191 flags 2 tree block key (1220 1 0) level 0 tree block backref root 5 item 5 key (123780128768 EXTENT_ITEM 4096) itemoff 3689 itemsize 51 extent refs 1 gen 54817 flags 2 tree block key (1001168 c 455590) level 0 tree block backref root 5 item 6 key (123780132864 EXTENT_ITEM 4096) itemoff 3638 itemsize 51 extent refs 1 gen 62201 flags 2 tree block key (28712 1 0) level 1 tree block backref root 5 item 7 key (123780136960 EXTENT_ITEM 4096) itemoff 3587 itemsize 51 extent refs 1 gen 62191 flags 2 tree block key (34645 c 33037) level 0 tree block backref root 5 item 8 key (123780141056 EXTENT_ITEM 4096) itemoff 3536 itemsize 51 extent refs 1 gen 50007 flags 2 tree block key (31007 60 4044) level 0 tree block backref root 5 item 9 key (123780145152 EXTENT_ITEM 4096) itemoff 3485 itemsize 51 extent refs 1 gen 62202 flags 2 tree block key (123644329984 a8 4096) level 0 tree block backref root 2 item 10 key (123780149248 EXTENT_ITEM 4096) itemoff 3434 itemsize 51 extent refs 1 gen 62202 flags 2 tree block key (123644854272 a8 4096) level 0 tree block backref root 2 item 11 key (123780153344 EXTENT_ITEM 4096) itemoff 3383 itemsize 51 extent refs 1 gen 62202 flags 2 tree block key (123645849600 a8 4096) level 0 tree block backref root 2 item 12 key (123780157440 EXTENT_ITEM 4096) itemoff 3332 itemsize 51 extent refs 1 gen 62207 flags 2 tree block key (123411308544 a8 4096) level 2 tree block backref root 2 item 13 key (123780161536 EXTENT_ITEM 4096) itemoff 3281 itemsize 51 extent refs 1 gen 62200 flags 2 tree block key (1325101 c 1264) level 0 tree block backref root 5 item 14 key (123780165632 EXTENT_ITEM 4096) itemoff 3230 itemsize 51 extent refs 1 gen 56401 flags 2 tree block key (59621 1 0) level 1 tree block backref root 5 item 15 key (123780169728 EXTENT_ITEM 4096) itemoff 3179 itemsize 51 extent refs 1 gen 24613 flags 2 tree block key (18446744073709551606 80 125996056576) level 0 tree block backref root 7 item 16 key (123780173824 EXTENT_ITEM 4096) itemoff 3128 itemsize 51 extent refs 1 gen 62189 flags 2 tree block key (1324334 1 0) level 0 tree block backref root 5 item 17 key (123780177920 EXTENT_ITEM 4095) itemoff 3077 itemsize 51 extent refs 1 gen 62207 flags 2 tree block key (123682791424 a8 4096) level 1 tree block backref root 2 item 18 key (123780182016 EXTENT_ITEM 4096) itemoff 3026 itemsize 51 extent refs 1 gen 62202 flags 2 tree block key (123648741376 a8 4096) level 0 tree block backref root 2 item 19 key (123780186112 EXTENT_ITEM 4096) itemoff 2975 itemsize 51 extent refs 1 gen 62201 flags 2 tree block key (28854 54 1781866506) level 0 tree block backref root 5 item 20 key (123780190208 EXTENT_ITEM 4096) itemoff 2924 itemsize 51 extent refs 1 gen 62192 flags 2 tree block key (1456 6c 3231744) level 0 tree block backref root 5 item 21 key (123780194304 EXTENT_ITEM 4096) itemoff 2873 itemsize 51 extent refs 1 gen 62189 flags 2 tree block key (123646390272 a8 4096) level 0 tree block backref root 2 item 22 key (123780198400 EXTENT_ITEM 4096) itemoff 2822 itemsize 51 extent refs 1 gen 58732 flags 2 tree block key (145006825472 a8 8192) level 0 tree block backref root 2 item 23 key (123780202496 EXTENT_ITEM 4096) itemoff 2771 itemsize 51 extent refs 1 gen 62191 flags 2 tree block key (123656515584 a8 4096) level 0 tree block backref root 2 item 24 key (123780206592 EXTENT_ITEM 4096) itemoff 2720 itemsize 51 extent refs 1 gen 59061 flags 2 tree block key (34873 1 0) level 0 tree block backref root 5 item 25 key (123780210688 EXTENT_ITEM 4096) itemoff 2669 itemsize 51 extent refs 1 gen 62207 flags 2 tree block key (123778146304 a8 4096) level 0 tree block backref root 2 item 26 key (123780214784 EXTENT_ITEM 4096) itemoff 2618 itemsize 51 extent refs 1 gen 62201 flags 2 tree block key (28848 c 2786) level 0 tree block backref root 5 item 27 key (123780218880 EXTENT_ITEM 4096) itemoff 2567 itemsize 51 extent refs 1 gen 62189 flags 2 tree block key (123657785344 a8 4096) level 0 tree block backref root 2 item 28 key (123780222976 EXTENT_ITEM 4096) itemoff 2516 itemsize 51 extent refs 1 gen 62201 flags 2 tree block key (133719777280 a8 4096) level 1 tree block backref root 2 item 29 key (123780227072 EXTENT_ITEM 4096) itemoff 2465 itemsize 51 extent refs 1 gen 62189 flags 2 tree block key (123659431936 a8 4096) level 0 tree block backref root 2 item 30 key (123780231168 EXTENT_ITEM 4096) itemoff 2414 itemsize 51 extent refs 1 gen 62202 flags 2 tree block key (123658563584 a8 4096) level 1 tree block backref root 2 item 31 key (123780235264 EXTENT_ITEM 4096) itemoff 2363 itemsize 51 extent refs 1 gen 46988 flags 2 tree block key (34911 6c 282624) level 0 tree block backref root 5 item 32 key (123780239360 EXTENT_ITEM 4096) itemoff 2312 itemsize 51 extent refs 1 gen 58714 flags 2 tree block key (63797821440 a8 4096) level 0 tree block backref root 2 item 33 key (123780243456 EXTENT_ITEM 4096) itemoff 2261 itemsize 51 extent refs 1 gen 62201 flags 2 tree block key (133728186368 a8 4096) level 0 tree block backref root 2 item 34 key (123780247552 EXTENT_ITEM 4096) itemoff 2210 itemsize 51 extent refs 1 gen 56401 flags 2 tree block key (59654 60 7) level 0 tree block backref root 5 item 35 key (123780251648 EXTENT_ITEM 4096) itemoff 2159 itemsize 51 extent refs 1 gen 62207 flags 2 tree block key (123778404352 a8 4096) level 0 tree block backref root 2 item 36 key (123780255744 EXTENT_ITEM 4096) itemoff 2108 itemsize 51 extent refs 1 gen 62202 flags 2 tree block key (123659759616 a8 4096) level 0 tree block backref root 2 item 37 key (123780259840 EXTENT_ITEM 4096) itemoff 2057 itemsize 51 extent refs 1 gen 62204 flags 2 tree block key (145014075392 a8 12288) level 0 tree block backref root 2 item 38 key (123780263936 EXTENT_ITEM 4096) itemoff 2006 itemsize 51 extent refs 1 gen 62193 flags 2 tree block key (145016651776 a8 8192) level 0 tree block backref root 2 item 39 key (123780268032 EXTENT_ITEM 4096) itemoff 1955 itemsize 51 extent refs 1 gen 62207 flags 2 tree block key (123778617344 a8 4096) level 0 tree block backref root 2 item 40 key (123780272128 EXTENT_ITEM 4096) itemoff 1904 itemsize 51 extent refs 1 gen 59985 flags 2 tree block key (18446744073709551606 80 62746009600) level 0 tree block backref root 7 item 41 key (123780276224 EXTENT_ITEM 4096) itemoff 1853 itemsize 51 extent refs 1 gen 62202 flags 2 tree block key (123660300288 a8 4096) level 0 tree block backref root 2 item 42 key (123780280320 EXTENT_ITEM 4096) itemoff 1802 itemsize 51 extent refs 1 gen 22901 flags 2 tree block key (66305400832 a8 4096) level 0 tree block backref root 2 item 43 key (123780284416 EXTENT_ITEM 4096) itemoff 1751 itemsize 51 extent refs 1 gen 62207 flags 2 tree block key (123778818048 a8 4096) level 0 tree block backref root 2 item 44 key (123780288512 EXTENT_ITEM 4096) itemoff 1700 itemsize 51 extent refs 1 gen 51582 flags 2 tree block key (455590 54 3842052003) level 0 tree block backref root 5 item 45 key (123780292608 EXTENT_ITEM 4096) itemoff 1649 itemsize 51 extent refs 1 gen 62200 flags 2 tree block key (1283 54 2984231802) level 0 tree block backref root 5 item 46 key (123780296704 EXTENT_ITEM 4096) itemoff 1598 itemsize 51 extent refs 1 gen 56398 flags 2 tree block key (54879 c 54735) level 0 tree block backref root 5 item 47 key (123780300800 EXTENT_ITEM 4096) itemoff 1547 itemsize 51 extent refs 1 gen 62201 flags 2 tree block key (2786 54 1989163538) level 1 tree block backref root 5 item 48 key (123780304896 EXTENT_ITEM 4096) itemoff 1496 itemsize 51 extent refs 1 gen 59040 flags 2 tree block key (123812081664 a8 4096) level 0 tree block backref root 2 failed to find block number 123780177920 Aborted Mounting the fs read-only an reading the content works fine so far. I''ve three systems running with btrfs, a dual Opteron (252), a Pentium 4 system and a netbook with N270 Atom. The netbook is the only one that shows the errors. It''s also the only system where I''m using gcc-4.5. Perhaps it''s related, but I doubt it''s the only reason as I''m using gcc-4.5 since May. regards, Johannes
Am Sonntag 11 Juli 2010, 14:28:09 schrieb Johannes Hirte: ...> I''ve three systems running with btrfs, a dual Opteron (252), a Pentium 4 > system and a netbook with N270 Atom. The netbook is the only one that shows > the errors. It''s also the only system where I''m using gcc-4.5. Perhaps it''s > related, but I doubt it''s the only reason as I''m using gcc-4.5 since May.On the Opteron system I got now csum errors. I''ve synced some data from the netbook to the Opteron yesteray. After hitting ENOSPC with 4GB free, I''ve run ''btrfs-vol -b'' on this fs in hope to get some more free space. It worked but the command failed and I found in dmesg: btrfs csum failed ino 339 off 935280640 csum 337776576 private 337776575 btrfs csum failed ino 339 off 935280640 csum 337776576 private 337776575 btrfs csum failed ino 339 off 935280640 csum 337776576 private 337776575 btrfs csum failed ino 339 off 935280640 csum 337776576 private 337776575 So I''ve tested the new synced data by syncing them to another disk on the Optoern system (XFS). As I''ve expected (or better feared), some data wasn''t readable and I found more csum errors in dmesg: btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688 btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688 btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688 btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688 btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688 btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688 btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688 btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688 btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688 btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688 btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 btrfs csum failed ino 1959333 off 252362752 csum 686735346 private 686735345 btrfs csum failed ino 1959333 off 252362752 csum 686735346 private 686735345 btrfs csum failed ino 1959333 off 252362752 csum 686735346 private 686735345 btrfs csum failed ino 1959333 off 252362752 csum 686735346 private 686735345 btrfs csum failed ino 1959333 off 252362752 csum 686735346 private 686735345 btrfs csum failed ino 1959333 off 252362752 csum 686735346 private 686735345 btrfs csum failed ino 1959333 off 651108352 csum 2851505977 private 2851505976 btrfs csum failed ino 1959333 off 651108352 csum 2851505977 private 2851505976 btrfs csum failed ino 1959333 off 651108352 csum 2851505977 private 2851505976 btrfs csum failed ino 1959333 off 651108352 csum 2851505977 private 2851505976 btrfs csum failed ino 1959333 off 651108352 csum 2851505977 private 2851505976 btrfs csum failed ino 1959333 off 651108352 csum 2851505977 private 2851505976 btrfs csum failed ino 1959333 off 898342912 csum 4271223884 private 4271223883 btrfs csum failed ino 1959333 off 898342912 csum 4271223884 private 4271223883 btrfs csum failed ino 1959333 off 898342912 csum 4271223884 private 4271223883 btrfs csum failed ino 1959333 off 898342912 csum 4271223884 private 4271223883 btrfs csum failed ino 1959333 off 898342912 csum 4271223884 private 4271223883 btrfs csum failed ino 1959333 off 898342912 csum 4271223884 private 4271223883 I suspect something goes horribly wrong on writing to disc within btrfs. On the netbook I got missing blocks, on the Opteron System bad csums. Both systems are running linux-2.6.35-rc4, the netbook with gcc-4.5.0 the Opteron system with gcc-4.4.4. I''ll test the P4 system later, if there are similar errors too. regards, Johannes -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am Donnerstag 08 Juli 2010, 16:31:09 schrieb Chris Mason:> Neither Yan nor I have been able to reproduce this locally, but a few > people have now hit it. Johannes, are you available to try out a > debugging kernel to try and track this down? > > -chris > > On Thu, Jul 08, 2010 at 04:27:23PM +0200, Johannes Hirte wrote: > > When doing a ''rm -r /var/tmp/portage/sys-devel'' I get the following Oops: > > > > ------------[ cut here ]------------ > > kernel BUG at fs/btrfs/extent-tree.c:1353! > > invalid opcode: 0000 [#1] PREEMPT SMP > > last sysfs file: > > /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0A:00/power_supply/BAT0/charge_ > > full Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event > > snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss nfs lockd nfs_acl > > auth_rpcgss sunrpc sco rfcomm bnep l2cap crc16 xts gf128mul usb_storage > > dm_crypt dm_mod coretemp hwmon acpi_cpufreq mperf snd_hda_codec_realtek > > uvcvideo iwl3945 snd_hda_intel snd_hda_codec iwlcore videodev r8169 > > snd_hwdep btusb snd_pcm v4l1_compat mac80211 snd_timer bluetooth snd mii > > cfg80211 soundcore sg rfkill ac i2c_i801 snd_page_alloc uhci_hcd battery > > [last unloaded: microcode] > > > > Pid: 2358, comm: rm Not tainted 2.6.35-rc4 #32 M912/M912 > > EIP: 0060:[<c10c383b>] EFLAGS: 00010202 CPU: 1 > > EIP is at lookup_inline_extent_backref+0xf2/0x406 > > EAX: 00000001 EBX: 00000007 ECX: 00000000 EDX: 00000000 > > ESI: 00000004 EDI: f7268150 EBP: 00000004 ESP: f5aa5d08 > > > > DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 > > > > Process rm (pid: 2358, ti=f5aa4000 task=f6f0fa70 task.ti=f5aa4000) > > > > Stack: > > f702f8c0 f744e080 f665f380 000000b0 00000000 00000000 ffffffff f6c80f00 > > > > <0> f744e080 c10ec226 e98acfff f6c98000 00001001 0e987000 00000004 > > 00000000 <0> 00000850 040e9870 a8000000 00001000 00000000 00000007 > > 00000000 0e987000 > > > > Call Trace: > > [<c10ec226>] ? set_extent_dirty+0x19/0x1d > > [<c10c5081>] ? __btrfs_free_extent+0xda/0x675 > > [<c10c88bf>] ? run_clustered_refs+0x699/0x6d7 > > [<c10d239f>] ? btrfs_mark_buffer_dirty+0xa3/0xef > > [<c1101454>] ? btrfs_find_ref_cluster+0xf9/0x13a > > [<c10c89bc>] ? btrfs_run_delayed_refs+0xbf/0x155 > > [<c10d3a73>] ? __btrfs_end_transaction+0x53/0x16c > > [<c10db480>] ? btrfs_delete_inode+0x166/0x17e > > [<c102280d>] ? get_parent_ip+0x8/0x19 > > [<c108fe5c>] ? generic_delete_inode+0x6f/0xbd > > [<c108f5b3>] ? iput+0x46/0x48 > > [<c10893a8>] ? do_unlinkat+0xc7/0x109 > > [<c102280d>] ? get_parent_ip+0x8/0x19 > > [<c10822e3>] ? fput+0x12/0x15c > > [<c10a2f30>] ? dnotify_flush+0x41/0xc2 > > [<c107fe85>] ? filp_close+0x4c/0x52 > > [<c107feed>] ? sys_close+0x62/0x9b > > [<c1002550>] ? sysenter_do_call+0x12/0x26 > > > > Code: 80 4e 68 02 8d 4c 24 43 89 f8 6a 01 ff 74 24 1c ff 74 24 08 8b 54 > > 24 38 e8 01 c2 ff ff 83 c4 0c 83 f8 00 0f 8c e1 02 00 00 74 02 <0f> 0b > > 8b 04 24 8b 34 24 8b 00 8b 56 20 89 44 24 08 e8 2e fa ff > > EIP: [<c10c383b>] lookup_inline_extent_backref+0xf2/0x406 SS:ESP > > 0068:f5aa5d08 ---[ end trace d97601f0b455ca72 ]--- > > note: rm[2358] exited with preempt_count 2 > > BUG: scheduling while atomic: rm/2358/0x10000003 > > Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq > > snd_seq_device snd_pcm_oss snd_mixer_oss nfs lockd nfs_acl auth_rpcgss > > sunrpc sco rfcomm bnep l2cap crc16 xts gf128mul usb_storage dm_crypt > > dm_mod coretemp hwmon acpi_cpufreq mperf snd_hda_codec_realtek uvcvideo > > iwl3945 snd_hda_intel snd_hda_codec iwlcore videodev r8169 snd_hwdep > > btusb snd_pcm v4l1_compat mac80211 snd_timer bluetooth snd mii cfg80211 > > soundcore sg rfkill ac i2c_i801 snd_page_alloc uhci_hcd battery [last > > unloaded: microcode] > > Pid: 2358, comm: rm Tainted: G D 2.6.35-rc4 #32 > > > > Call Trace: > > [<c12de6b3>] ? schedule+0x88/0x332 > > [<c10237c1>] ? __cond_resched+0xf/0x19 > > [<c12de9e2>] ? _cond_resched+0x12/0x18 > > [<c106ceec>] ? unmap_vmas+0x4e7/0x534 > > [<c1070c8f>] ? exit_mmap+0x64/0xa4 > > [<c1026089>] ? mmput+0x21/0x96 > > [<c102938e>] ? exit_mm+0xe7/0xf0 > > [<c12dfa28>] ? _raw_spin_unlock_irqrestore+0x1a/0x24 > > [<c103aaa1>] ? hrtimer_try_to_cancel+0x31/0x3a > > [<c102a42e>] ? do_exit+0x17b/0x57d > > [<c1028e78>] ? kmsg_dump+0x81/0xf9 > > [<c1002d06>] ? do_invalid_op+0x0/0x76 > > [<c1004fa0>] ? oops_end+0x72/0x75 > > [<c1002d6f>] ? do_invalid_op+0x69/0x76 > > [<c10c383b>] ? lookup_inline_extent_backref+0xf2/0x406 > > [<c10bdc9a>] ? generic_bin_search.clone.0+0x145/0x150 > > [<c10bcf30>] ? btrfs_cow_block+0x106/0x112 > > [<c10bdcdc>] ? bin_search+0x37/0x3d > > [<c10bfe33>] ? btrfs_search_slot+0x405/0x477 > > [<c12e031a>] ? error_code+0x66/0x6c > > [<c1002d06>] ? do_invalid_op+0x0/0x76 > > [<c10c383b>] ? lookup_inline_extent_backref+0xf2/0x406 > > [<c10ec226>] ? set_extent_dirty+0x19/0x1d > > [<c10c5081>] ? __btrfs_free_extent+0xda/0x675 > > [<c10c88bf>] ? run_clustered_refs+0x699/0x6d7 > > [<c10d239f>] ? btrfs_mark_buffer_dirty+0xa3/0xef > > [<c1101454>] ? btrfs_find_ref_cluster+0xf9/0x13a > > [<c10c89bc>] ? btrfs_run_delayed_refs+0xbf/0x155 > > [<c10d3a73>] ? __btrfs_end_transaction+0x53/0x16c > > [<c10db480>] ? btrfs_delete_inode+0x166/0x17e > > [<c102280d>] ? get_parent_ip+0x8/0x19 > > [<c108fe5c>] ? generic_delete_inode+0x6f/0xbd > > [<c108f5b3>] ? iput+0x46/0x48 > > [<c10893a8>] ? do_unlinkat+0xc7/0x109 > > [<c102280d>] ? get_parent_ip+0x8/0x19 > > [<c10822e3>] ? fput+0x12/0x15c > > [<c10a2f30>] ? dnotify_flush+0x41/0xc2 > > [<c107fe85>] ? filp_close+0x4c/0x52 > > [<c107feed>] ? sys_close+0x62/0x9b > > [<c1002550>] ? sysenter_do_call+0x12/0x26I''m not sure if btrfs is to blame for this error. After the errors I switched to XFS on this system and got now this error: ls -l .kde4/share/apps/akregator/data/ ls: cannot access .kde4/share/apps/akregator/data/feeds.opml: Structure needs cleaning total 4 ?????????? ? ? ? ? ? feeds.opml xfs_check is showing this: xfs_check /dev/sda3 link count mismatch for inode 219998792 (name ?), nlink 0, counted 1 disconnected inode 220064328, nlink 1 So this is the second FS I''ve got suddenly errors, so I think the problem lies deeper. Adding some CCs for this. regards, Johannes
On Wed, Jul 14, 2010 at 05:25:23PM +0200, Johannes Hirte wrote:> Am Donnerstag 08 Juli 2010, 16:31:09 schrieb Chris Mason: > I''m not sure if btrfs is to blame for this error. After the errors I switched > to XFS on this system and got now this error: > > ls -l .kde4/share/apps/akregator/data/ > ls: cannot access .kde4/share/apps/akregator/data/feeds.opml: Structure needs > cleaning > total 4 > ?????????? ? ? ? ? ? feeds.opmlWhat is the error reported in dmesg when the XFS filesytem shuts down? Cheers, Dave. -- Dave Chinner david@fromorbit.com
Am Donnerstag 15 Juli 2010, 02:11:04 schrieb Dave Chinner:> On Wed, Jul 14, 2010 at 05:25:23PM +0200, Johannes Hirte wrote: > > Am Donnerstag 08 Juli 2010, 16:31:09 schrieb Chris Mason: > > I''m not sure if btrfs is to blame for this error. After the errors I > > switched to XFS on this system and got now this error: > > > > ls -l .kde4/share/apps/akregator/data/ > > ls: cannot access .kde4/share/apps/akregator/data/feeds.opml: Structure > > needs cleaning > > total 4 > > ?????????? ? ? ? ? ? feeds.opml > > What is the error reported in dmesg when the XFS filesytem shuts down?Nothing. I double checked the logs. There are only the messages when mounting the filesystem. No other errors are reported than the inaccessible file and the output from xfs_check. regards, Johannes -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte:> On the Opteron system I got now csum errors. I''ve synced some data from the > netbook to the Opteron yesteray. After hitting ENOSPC with 4GB free, I''ve > run ''btrfs-vol -b'' on this fs in hope to get some more free space. It > worked but the command failed and I found in dmesg: > > btrfs csum failed ino 339 off 935280640 csum 337776576 private 337776575 > btrfs csum failed ino 339 off 935280640 csum 337776576 private 337776575 > btrfs csum failed ino 339 off 935280640 csum 337776576 private 337776575 > btrfs csum failed ino 339 off 935280640 csum 337776576 private 337776575 > > So I''ve tested the new synced data by syncing them to another disk on the > Optoern system (XFS). As I''ve expected (or better feared), some data wasn''t > readable and I found more csum errors in dmesg: > > btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688 > btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688 > btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688 > btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688 > btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688 > btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688 > btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688 > btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688 > btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688 > btrfs csum failed ino 1849137 off 368640 csum 3354885689 private 3354885688 > btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 > btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 > btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 > btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 > btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 > btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 > btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 > btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 > btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 > btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 > btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 > btrfs csum failed ino 1912210 off 5095424 csum 847944548 private 847944547 > btrfs csum failed ino 1959333 off 252362752 csum 686735346 private > 686735345 btrfs csum failed ino 1959333 off 252362752 csum 686735346 > private 686735345 btrfs csum failed ino 1959333 off 252362752 csum > 686735346 private 686735345 btrfs csum failed ino 1959333 off 252362752 > csum 686735346 private 686735345 btrfs csum failed ino 1959333 off > 252362752 csum 686735346 private 686735345 btrfs csum failed ino 1959333 > off 252362752 csum 686735346 private 686735345 btrfs csum failed ino > 1959333 off 651108352 csum 2851505977 private 2851505976 btrfs csum failed > ino 1959333 off 651108352 csum 2851505977 private 2851505976 btrfs csum > failed ino 1959333 off 651108352 csum 2851505977 private 2851505976 btrfs > csum failed ino 1959333 off 651108352 csum 2851505977 private 2851505976 > btrfs csum failed ino 1959333 off 651108352 csum 2851505977 private > 2851505976 btrfs csum failed ino 1959333 off 651108352 csum 2851505977 > private 2851505976 btrfs csum failed ino 1959333 off 898342912 csum > 4271223884 private 4271223883 btrfs csum failed ino 1959333 off 898342912 > csum 4271223884 private 4271223883 btrfs csum failed ino 1959333 off > 898342912 csum 4271223884 private 4271223883 btrfs csum failed ino 1959333 > off 898342912 csum 4271223884 private 4271223883 btrfs csum failed ino > 1959333 off 898342912 csum 4271223884 private 4271223883 btrfs csum failed > ino 1959333 off 898342912 csum 4271223884 private 4271223883I think, this is a different error. I''ve only seen them on filesystems from my Opteron system. It seems that the recorded csums are wrong and it looks to me like rounding errors. The data itself should be correct, as I''ve tested one affected file via md5sum against the original on another filesystem. Any ideas what is going wrong here? regards, Johannes -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 15, 2010 at 08:30:17PM +0200, Johannes Hirte wrote:> Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte: > > ino 1959333 off 898342912 csum 4271223884 private 4271223883 > > I think, this is a different error. I''ve only seen them on filesystems from my > Opteron system. It seems that the recorded csums are wrong and it looks to me > like rounding errors. The data itself should be correct, as I''ve tested one > affected file via md5sum against the original on another filesystem. > Any ideas what is going wrong here?Are you doing data mirroring? We can map that block and do a raw read off the device to see what the data blocks actually contain. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am Donnerstag 15 Juli 2010, 21:03:09 schrieb Chris Mason:> On Thu, Jul 15, 2010 at 08:30:17PM +0200, Johannes Hirte wrote: > > Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte: > > > ino 1959333 off 898342912 csum 4271223884 private 4271223883 > > > > I think, this is a different error. I''ve only seen them on filesystems > > from my Opteron system. It seems that the recorded csums are wrong and > > it looks to me like rounding errors. The data itself should be correct, > > as I''ve tested one affected file via md5sum against the original on > > another filesystem. Any ideas what is going wrong here? > > Are you doing data mirroring?No, I don''t.> We can map that block and do a raw read off the device to see what the > data blocks actually contain.I''ve modified the btrfs-source a little to get the data. In inode.c I''ve changed the code to: csum = btrfs_csum_data(root, kaddr + offset, csum, end - start + 1); btrfs_csum_final(csum, (char *)&csum); if (csum != private) if (printk_ratelimit()) { printk(KERN_INFO "csum != private; ino %lu off %llu " "csum %u private %llu\n", page->mapping->host->i_ino, (unsigned long long)start, csum, (unsigned long long)private); } // goto zeroit; kunmap_atomic(kaddr, KM_USER0); This way I could read the files with wrong csum too. As I wrote, I''ve compared the md5sum from one file with a copy on an other filesystem. As they are the same, at least for this file the data should be correct. The big question is, why do the csums differ? regards, Johannes
On Thu, Jul 15, 2010 at 09:32:12PM +0200, Johannes Hirte wrote:> Am Donnerstag 15 Juli 2010, 21:03:09 schrieb Chris Mason: > > On Thu, Jul 15, 2010 at 08:30:17PM +0200, Johannes Hirte wrote: > > > Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte: > > > > ino 1959333 off 898342912 csum 4271223884 private 4271223883 > > > > > > I think, this is a different error. I''ve only seen them on filesystems > > > from my Opteron system. It seems that the recorded csums are wrong and > > > it looks to me like rounding errors. The data itself should be correct, > > > as I''ve tested one affected file via md5sum against the original on > > > another filesystem. Any ideas what is going wrong here? > > > > Are you doing data mirroring? > > No, I don''t. > > > We can map that block and do a raw read off the device to see what the > > data blocks actually contain. > > I''ve modified the btrfs-source a little to get the data. In inode.c I''ve > changed the code to:Great. The bad csums are all just one bit off, that can''t be an accident. When were they written (which kernel?). Did you boot a 32 bit kernel on there at any time? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am Donnerstag 15 Juli 2010, 21:35:51 schrieb Chris Mason:> On Thu, Jul 15, 2010 at 09:32:12PM +0200, Johannes Hirte wrote: > > Am Donnerstag 15 Juli 2010, 21:03:09 schrieb Chris Mason: > > > On Thu, Jul 15, 2010 at 08:30:17PM +0200, Johannes Hirte wrote: > > > > Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte: > > > > > ino 1959333 off 898342912 csum 4271223884 private 4271223883 > > > > > > > > I think, this is a different error. I''ve only seen them on > > > > filesystems from my Opteron system. It seems that the recorded csums > > > > are wrong and it looks to me like rounding errors. The data itself > > > > should be correct, as I''ve tested one affected file via md5sum > > > > against the original on another filesystem. Any ideas what is going > > > > wrong here? > > > > > > Are you doing data mirroring? > > > > No, I don''t. > > > > > We can map that block and do a raw read off the device to see what the > > > data blocks actually contain. > > > > I''ve modified the btrfs-source a little to get the data. In inode.c I''ve > > > changed the code to: > Great. The bad csums are all just one bit off, that can''t be an > accident. When were they written (which kernel?). Did you boot a 32 > bit kernel on there at any time?No, I don''t have a bootable 32bit installation on this system. I''ve tested it now with a 32bit system by dumping the whole filesystem to an external drive and mounting this to a 32bit system. The result was the same. The affected files were written by different kernels. I think at least 2.6.34, 2.6.35-rc3 and 2.6.35-rc4 should be involved, perhaps 2.6.33 too. I''ll try to figure it out more exactly. regards, Johannes -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am Donnerstag 15 Juli 2010, 20:14:51 schrieb Johannes Hirte:> Am Donnerstag 15 Juli 2010, 02:11:04 schrieb Dave Chinner: > > On Wed, Jul 14, 2010 at 05:25:23PM +0200, Johannes Hirte wrote: > > > Am Donnerstag 08 Juli 2010, 16:31:09 schrieb Chris Mason: > > > I''m not sure if btrfs is to blame for this error. After the errors I > > > switched to XFS on this system and got now this error: > > > > > > ls -l .kde4/share/apps/akregator/data/ > > > ls: cannot access .kde4/share/apps/akregator/data/feeds.opml: Structure > > > needs cleaning > > > total 4 > > > ?????????? ? ? ? ? ? feeds.opml > > > > What is the error reported in dmesg when the XFS filesytem shuts down? > > Nothing. I double checked the logs. There are only the messages when > mounting the filesystem. No other errors are reported than the > inaccessible file and the output from xfs_check.I''m running now a kernel with more debug options enabled and got this: [ 6794.810935] [ 6794.810941] ================================[ 6794.810955] [ INFO: inconsistent lock state ] [ 6794.810966] 2.6.35-rc4-btrfs-debug #7 [ 6794.810975] --------------------------------- [ 6794.810984] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. [ 6794.810996] kswapd0/361 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 6794.811006] (&(&ip->i_iolock)->mr_lock#2){++++?+}, at: [<c10fa82d>] xfs_ilock+0x22/0x67 [ 6794.811039] {RECLAIM_FS-ON-W} state was registered at: [ 6794.811046] [<c104ebc1>] mark_held_locks+0x42/0x5e [ 6794.811046] [<c104f1f7>] lockdep_trace_alloc+0x99/0xb0 [ 6794.811046] [<c10740b8>] __alloc_pages_nodemask+0x6a/0x4a1 [ 6794.811046] [<c106edc2>] __page_cache_alloc+0x11/0x13 [ 6794.811046] [<c106fb43>] grab_cache_page_write_begin+0x47/0x81 [ 6794.811046] [<c10b2050>] block_write_begin_newtrunc+0x2e/0x9c [ 6794.811046] [<c10b233a>] block_write_begin+0x23/0x5d [ 6794.811046] [<c1114a9d>] xfs_vm_write_begin+0x26/0x28 [ 6794.811046] [<c106f15d>] generic_file_buffered_write+0xb5/0x1bd [ 6794.811046] [<c1117e31>] xfs_file_aio_write+0x40e/0x66d [ 6794.811046] [<c10950b4>] do_sync_write+0x8b/0xc6 [ 6794.811046] [<c109568b>] vfs_write+0x77/0xa4 [ 6794.811046] [<c10957f3>] sys_write+0x3c/0x5e [ 6794.811046] [<c1002690>] sysenter_do_call+0x12/0x36 [ 6794.811046] irq event stamp: 141369 [ 6794.811046] hardirqs last enabled at (141369): [<c13639d2>] _raw_spin_unlock_irqrestore+0x36/0x5b [ 6794.811046] hardirqs last disabled at (141368): [<c13634c5>] _raw_spin_lock_irqsave+0x14/0x68 [ 6794.811046] softirqs last enabled at (141300): [<c1032d69>] __do_softirq+0xfe/0x10d [ 6794.811046] softirqs last disabled at (141295): [<c1032da7>] do_softirq+0x2f/0x47 [ 6794.811046] [ 6794.811046] other info that might help us debug this: [ 6794.811046] 2 locks held by kswapd0/361: [ 6794.811046] #0: (shrinker_rwsem){++++..}, at: [<c10774db>] shrink_slab+0x25/0x13f [ 6794.811046] #1: (&xfs_mount_list_lock){++++.-}, at: [<c111cc78>] xfs_reclaim_inode_shrink+0x2a/0xe8 [ 6794.811046] [ 6794.811046] stack backtrace: [ 6794.811046] Pid: 361, comm: kswapd0 Not tainted 2.6.35-rc4-btrfs-debug #7 [ 6794.811046] Call Trace: [ 6794.811046] [<c13616c0>] ? printk+0xf/0x17 [ 6794.811046] [<c104e988>] valid_state+0x134/0x142 [ 6794.811046] [<c104ea66>] mark_lock+0xd0/0x1e9 [ 6794.811046] [<c104e2a7>] ? check_usage_forwards+0x0/0x5f [ 6794.811046] [<c105003d>] __lock_acquire+0x374/0xc80 [ 6794.811046] [<c1044942>] ? sched_clock_local+0x12/0x121 [ 6794.811046] [<c1044c0b>] ? sched_clock_cpu+0x122/0x133 [ 6794.811046] [<c1050d4d>] lock_acquire+0x5f/0x76 [ 6794.811046] [<c10fa82d>] ? xfs_ilock+0x22/0x67 [ 6794.811046] [<c1043974>] down_write_nested+0x32/0x63 [ 6794.811046] [<c10fa82d>] ? xfs_ilock+0x22/0x67 [ 6794.811046] [<c10fa82d>] xfs_ilock+0x22/0x67 [ 6794.811046] [<c10faa48>] xfs_ireclaim+0x98/0xbb [ 6794.811046] [<c1043a1e>] ? up_write+0x16/0x2b [ 6794.811046] [<c111c78c>] xfs_reclaim_inode+0x1a7/0x1b1 [ 6794.811046] [<c111cafe>] xfs_inode_ag_walk+0x77/0xbc [ 6794.811046] [<c111c5e5>] ? xfs_reclaim_inode+0x0/0x1b1 [ 6794.811046] [<c111cc07>] xfs_inode_ag_iterator+0x52/0x99 [ 6794.811046] [<c111cc78>] ? xfs_reclaim_inode_shrink+0x2a/0xe8 [ 6794.811046] [<c111c5e5>] ? xfs_reclaim_inode+0x0/0x1b1 [ 6794.811046] [<c111cc99>] xfs_reclaim_inode_shrink+0x4b/0xe8 [ 6794.811046] [<c1077588>] shrink_slab+0xd2/0x13f [ 6794.811046] [<c1078cef>] kswapd+0x37d/0x4e9 [ 6794.811046] [<c104028f>] ? autoremove_wake_function+0x0/0x2f [ 6794.811046] [<c1078972>] ? kswapd+0x0/0x4e9 [ 6794.811046] [<c103ffbc>] kthread+0x60/0x65 [ 6794.811046] [<c103ff5c>] ? kthread+0x0/0x65 [ 6794.811046] [<c1002bba>] kernel_thread_helper+0x6/0x10 Don''t know if this is related to the problem. regards, Johannes -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/15/2010 12:35 PM, Chris Mason wrote:> On Thu, Jul 15, 2010 at 09:32:12PM +0200, Johannes Hirte wrote: > >> Am Donnerstag 15 Juli 2010, 21:03:09 schrieb Chris Mason: >> >>> On Thu, Jul 15, 2010 at 08:30:17PM +0200, Johannes Hirte wrote: >>> >>>> Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte: >>>> >>>>> ino 1959333 off 898342912 csum 4271223884 private 4271223883 > Great. The bad csums are all just one bit off, that can''t be an > accident. When were they written (which kernel?). Did you boot a 32 > bit kernel on there at any time? >I''ve seen this as well, with three files. In all instances, csum == *private + 1. Here are the unique lines from dmesg: [32700.980806] btrfs csum failed ino 320113 off 55889920 csum 2415136266 private 2415136265 [32735.751112] btrfs csum failed ino 1731630 off 24776704 csum 1385284137 private 1385284136 [32738.777624] btrfs csum failed ino 2495707 off 171790336 csum 1385781806 private 1385781805 All three files are from when I first transitioned to btrfs (or more accurately, they are clones of those files I made to hold onto a copy of the corrupted version). Since the vast majority of my disk usage comes from the transition anyway, I can''t be sure this is due to a problem only present at that time. I believe I was running 2.6.34 when I copied my files over to my new btrfs partition, but I''m going from memory here. My btrfs partition has never been touched by a 32-bit kernel. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 15 Jul 2010 20:14:51 +0200, Johannes Hirte wrote:> Am Donnerstag 15 Juli 2010, 02:11:04 schrieb Dave Chinner: >> On Wed, Jul 14, 2010 at 05:25:23PM +0200, Johannes Hirte wrote: >>> Am Donnerstag 08 Juli 2010, 16:31:09 schrieb Chris Mason: >>> I''m not sure if btrfs is to blame for this error. After the errors I >>> switched to XFS on this system and got now this error: >>> >>> ls -l .kde4/share/apps/akregator/data/ >>> ls: cannot access .kde4/share/apps/akregator/data/feeds.opml: Structure >>> needs cleaning >>> total 4 >>> ?????????? ? ? ? ? ? feeds.opml >> >> What is the error reported in dmesg when the XFS filesytem shuts down? > > Nothing. I double checked the logs. There are only the messages when mounting > the filesystem. No other errors are reported than the inaccessible file and the > output from xfs_check.Is there anything wrong with your disks or memory? Sometimes the bad memory can break the filesystem. I have met this kind of problem some time ago. If there is no problem with your disk and memory, Could you tell us the parameter of mkfs.btrfs and mount? Thanks Miao Xie -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am Montag 19 Juli 2010, 10:01:46 schrieb Miao Xie:> On Thu, 15 Jul 2010 20:14:51 +0200, Johannes Hirte wrote: > > Am Donnerstag 15 Juli 2010, 02:11:04 schrieb Dave Chinner: > >> On Wed, Jul 14, 2010 at 05:25:23PM +0200, Johannes Hirte wrote: > >>> Am Donnerstag 08 Juli 2010, 16:31:09 schrieb Chris Mason: > >>> I''m not sure if btrfs is to blame for this error. After the errors I > >>> switched to XFS on this system and got now this error: > >>> > >>> ls -l .kde4/share/apps/akregator/data/ > >>> ls: cannot access .kde4/share/apps/akregator/data/feeds.opml: Structure > >>> needs cleaning > >>> total 4 > >>> ?????????? ? ? ? ? ? feeds.opml > >> > >> What is the error reported in dmesg when the XFS filesytem shuts down? > > > > Nothing. I double checked the logs. There are only the messages when > > mounting the filesystem. No other errors are reported than the > > inaccessible file and the output from xfs_check. > > Is there anything wrong with your disks or memory? > Sometimes the bad memory can break the filesystem. I have met this kind of > problem some time ago.I don''t think that''s the case. I''ve checked the RAM with memtest86+ and got no errors. I got the errors with two different disks, the first one with btrfs the second one now with XFS. Before changing to the second disk, I''ve run badblocks on it to be sure it has no errors.> > If there is no problem with your disk and memory, Could you tell us the > parameter of mkfs.btrfs and mount?I''m not sure what parameters I''ve used for mkbtrfs. It was either none ore ''-m single''. mount parameters are only noatime. Some time ago I''ve played a little with max_inline. On the actual disk with XFS I got now some more errors on my root-fs. Similar error on one file: ls: cannot access /var/tmp/portage/app- office/krita-2.2.1/work/krita-2.2.1/krita/image/tiles3/tests/dm_consistancy_test/dm_consistancy_test.pr: Invalid argument xfs_check shows on this fs: localhost ~ # xfs_check /dev/sda1 agi unlinked bucket 10 is 7279754 in ag 0 (inode=7279754) agi unlinked bucket 11 is 7279755 in ag 0 (inode=7279755) dir 91466358 entry dm_consistancy_test.pr bad inode number 1862628266 dir 91466358 size is 36, should be 35 agi unlinked bucket 48 is 11677104 in ag 2 (inode=78785968) agi unlinked bucket 49 is 11677105 in ag 2 (inode=78785969) agi unlinked bucket 50 is 11677106 in ag 2 (inode=78785970) agi unlinked bucket 51 is 11677107 in ag 2 (inode=78785971) agi unlinked bucket 52 is 11677108 in ag 2 (inode=78785972) agi unlinked bucket 53 is 11677109 in ag 2 (inode=78785973) agi unlinked bucket 54 is 11677110 in ag 2 (inode=78785974) agi unlinked bucket 55 is 11677111 in ag 2 (inode=78785975) agi unlinked bucket 58 is 11677114 in ag 2 (inode=78785978) agi unlinked bucket 59 is 11677115 in ag 2 (inode=78785979) agi unlinked bucket 60 is 11677116 in ag 2 (inode=78785980) agi unlinked bucket 61 is 11677117 in ag 2 (inode=78785981) allocated inode 7279754 has 0 link count allocated inode 7279755 has 0 link count disconnected inode 91466360, nlink 1 allocated inode 78785968 has 0 link count allocated inode 78785969 has 0 link count allocated inode 78785970 has 0 link count allocated inode 78785971 has 0 link count allocated inode 78785972 has 0 link count allocated inode 78785973 has 0 link count allocated inode 78785974 has 0 link count allocated inode 78785975 has 0 link count allocated inode 78785978 has 0 link count allocated inode 78785979 has 0 link count allocated inode 78785980 has 0 link count allocated inode 78785981 has 0 link count And again I don''t find any related message in dmesg. regards, Johannes
On 22 July 2010 19:07, Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de> wrote:> Am Montag 19 Juli 2010, 10:01:46 schrieb Miao Xie: >> On Thu, 15 Jul 2010 20:14:51 +0200, Johannes Hirte wrote: >> > Am Donnerstag 15 Juli 2010, 02:11:04 schrieb Dave Chinner: >> >> On Wed, Jul 14, 2010 at 05:25:23PM +0200, Johannes Hirte wrote: >> >>> Am Donnerstag 08 Juli 2010, 16:31:09 schrieb Chris Mason: >> >>> I''m not sure if btrfs is to blame for this error. After the errors I >> >>> switched to XFS on this system and got now this error: >> >>> >> >>> ls -l .kde4/share/apps/akregator/data/ >> >>> ls: cannot access .kde4/share/apps/akregator/data/feeds.opml: Structure >> >>> needs cleaning >> >>> total 4 >> >>> ?????????? ? ? ? ? ? feeds.opml >> >> >> >> What is the error reported in dmesg when the XFS filesytem shuts down? >> > >> > Nothing. I double checked the logs. There are only the messages when >> > mounting the filesystem. No other errors are reported than the >> > inaccessible file and the output from xfs_check. >> >> Is there anything wrong with your disks or memory? >> Sometimes the bad memory can break the filesystem. I have met this kind of >> problem some time ago. > > I don''t think that''s the case. I''ve checked the RAM with memtest86+ and got no > errors. I got the errors with two different disks, the first one with btrfs the > second one now with XFS. Before changing to the second disk, I''ve run > badblocks on it to be sure it has no errors.There are some known-buggy chipsets also. One still around is the Nvidia CK804/MCP55, under certain patterns of spatially-local pending reads and writes to the memory controller, a 64-byte request would occasionally be returned with the wrong offset. I was hitting it with some 27-Gbit adapters and managed to capture it on a PCI-e protocol analyser. Rsync between network and local disk would hit sometimes too. -- Daniel J Blueman -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 22, 2010 at 2:07 PM, Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de> wrote:>> Is there anything wrong with your disks or memory? >> Sometimes the bad memory can break the filesystem. I have met this kind of >> problem some time ago. > > I don''t think that''s the case. I''ve checked the RAM with memtest86+ and got no > errors. I got the errors with two different disks, the first one with btrfs the > second one now with XFS. Before changing to the second disk, I''ve run > badblocks on it to be sure it has no errors.You might also try kmemcheck. There''s a good chance that a bug that scribbles random memory shows up as FS corruption. I have had ext4 corruption due to an inotify bug, which kmemcheck found on the first try. -- Bob Copeland %% www.bobcopeland.com
Am Donnerstag 22 Juli 2010, 20:07:23 schrieb Johannes Hirte:> Am Montag 19 Juli 2010, 10:01:46 schrieb Miao Xie: > > On Thu, 15 Jul 2010 20:14:51 +0200, Johannes Hirte wrote: > > > Am Donnerstag 15 Juli 2010, 02:11:04 schrieb Dave Chinner: > > >> On Wed, Jul 14, 2010 at 05:25:23PM +0200, Johannes Hirte wrote: > > >>> Am Donnerstag 08 Juli 2010, 16:31:09 schrieb Chris Mason: > > >>> I''m not sure if btrfs is to blame for this error. After the errors I > > >>> switched to XFS on this system and got now this error: > > >>> > > >>> ls -l .kde4/share/apps/akregator/data/ > > >>> ls: cannot access .kde4/share/apps/akregator/data/feeds.opml: > > >>> Structure needs cleaning > > >>> total 4 > > >>> ?????????? ? ? ? ? ? feeds.opml > > >> > > >> What is the error reported in dmesg when the XFS filesytem shuts down? > > > > > > Nothing. I double checked the logs. There are only the messages when > > > mounting the filesystem. No other errors are reported than the > > > inaccessible file and the output from xfs_check. > > > > Is there anything wrong with your disks or memory? > > Sometimes the bad memory can break the filesystem. I have met this kind > > of problem some time ago. > > I don''t think that''s the case. I''ve checked the RAM with memtest86+ and got > no errors. I got the errors with two different disks, the first one with > btrfs the second one now with XFS. Before changing to the second disk, > I''ve run badblocks on it to be sure it has no errors.I think I''ve found it. The bug was introduced by commit 7f0e7bed936a0c422641a046551829a01341dd80 Author: Christoph Hellwig <hch@lst.de> Date: Tue Jun 8 18:14:34 2010 +0200 writeback: fix writeback completion notifications The code dealing with bdi_work->state and completion of a bdi_work is a major mess currently. This patch makes sure we directly use one set of flags to deal with it, and use it consistently, which means: - always notify about completion from the rcu callback. We only ever wait for it from on-stack callers, so this simplification does not even cause a theoretical slowdown currently. It also makes sure we don''t miss out on the notification if we ever add other callers to wait for it. - make earlier completion notification depending on the on-stack allocation, not the sync mode. If we introduce new callers that want to do WB_SYNC_NONE writeback from on-stack callers this will be nessecary. Also rename bdi_wait_on_work_clear to bdi_wait_on_work_done and inline a few small functions into their only caller to make the code understandable. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com> and seems to be fixed by commit 83ba7b071f30f7c01f72518ad72d5cd203c27502 Author: Christoph Hellwig <hch@lst.de> Date: Tue Jul 6 08:59:53 2010 +0200 writeback: simplify the write back thread queue First remove items from work_list as soon as we start working on them.This means we don''t have to track any pending or visited state and can get rid of all the RCU magic freeing the work items - we can simply free them once the operation has finished. Second use a real completion for tracking synchronous requests - if the caller sets the completion pointer we complete it, otherwise use it as a boolean indicator that we can free the work item directly. Third unify struct wb_writeback_args and struct bdi_work into a single data structure, wb_writeback_work. Previous we set all parameters into a struct wb_writeback_args, copied it into struct bdi_work, copied it again on the stack to use it there. Instead of just allocate one structure dynamically or on the stack and use it all the way through the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com> I was able to reproduce the bug by unpacking a big tar-file and deleting this files multiple times. Normally with btrfs the kernel crashed within 20 runs. After commit 83ba7b071f30f7c01f72518ad72d5cd203c27502 it survived more than 500 runs. regards, Johannes -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/29/2010 07:09 PM, Johannes Hirte wrote:> Am Donnerstag 22 Juli 2010, 20:07:23 schrieb Johannes Hirte: >> Am Montag 19 Juli 2010, 10:01:46 schrieb Miao Xie: >>> On Thu, 15 Jul 2010 20:14:51 +0200, Johannes Hirte wrote: >>>> Am Donnerstag 15 Juli 2010, 02:11:04 schrieb Dave Chinner: >>>>> On Wed, Jul 14, 2010 at 05:25:23PM +0200, Johannes Hirte wrote: >>>>>> Am Donnerstag 08 Juli 2010, 16:31:09 schrieb Chris Mason: >>>>>> I''m not sure if btrfs is to blame for this error. After the errors I >>>>>> switched to XFS on this system and got now this error: >>>>>> >>>>>> ls -l .kde4/share/apps/akregator/data/ >>>>>> ls: cannot access .kde4/share/apps/akregator/data/feeds.opml: >>>>>> Structure needs cleaning >>>>>> total 4 >>>>>> ?????????? ? ? ? ? ? feeds.opml >>>>> >>>>> What is the error reported in dmesg when the XFS filesytem shuts down? >>>> >>>> Nothing. I double checked the logs. There are only the messages when >>>> mounting the filesystem. No other errors are reported than the >>>> inaccessible file and the output from xfs_check. >>> >>> Is there anything wrong with your disks or memory? >>> Sometimes the bad memory can break the filesystem. I have met this kind >>> of problem some time ago. >> >> I don''t think that''s the case. I''ve checked the RAM with memtest86+ and got >> no errors. I got the errors with two different disks, the first one with >> btrfs the second one now with XFS. Before changing to the second disk, >> I''ve run badblocks on it to be sure it has no errors. > > I think I''ve found it. The bug was introduced by > > commit 7f0e7bed936a0c422641a046551829a01341dd80 > Author: Christoph Hellwig <hch@lst.de> > Date: Tue Jun 8 18:14:34 2010 +0200 > > writeback: fix writeback completion notifications > > The code dealing with bdi_work->state and completion of a bdi_work is a > major mess currently. This patch makes sure we directly use one set of > flags to deal with it, and use it consistently, which means: > > - always notify about completion from the rcu callback. We only ever > wait for it from on-stack callers, so this simplification does not > even cause a theoretical slowdown currently. It also makes sure we > don''t miss out on the notification if we ever add other callers to > wait for it. > - make earlier completion notification depending on the on-stack > allocation, not the sync mode. If we introduce new callers that > want to do WB_SYNC_NONE writeback from on-stack callers this will > be nessecary. > > Also rename bdi_wait_on_work_clear to bdi_wait_on_work_done and inline > a few small functions into their only caller to make the code > understandable. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > Signed-off-by: Jens Axboe <jaxboe@fusionio.com> > > and seems to be fixed by > > commit 83ba7b071f30f7c01f72518ad72d5cd203c27502 > Author: Christoph Hellwig <hch@lst.de> > Date: Tue Jul 6 08:59:53 2010 +0200 > > writeback: simplify the write back thread queue > > First remove items from work_list as soon as we start working on them.This > means we don''t have to track any pending or visited state and can get > rid of all the RCU magic freeing the work items - we can simply free > them once the operation has finished. Second use a real completion for > tracking synchronous requests - if the caller sets the completion pointer > we complete it, otherwise use it as a boolean indicator that we can free > the work item directly. Third unify struct wb_writeback_args and struct > bdi_work into a single data structure, wb_writeback_work. Previous we > set all parameters into a struct wb_writeback_args, copied it into > struct bdi_work, copied it again on the stack to use it there. Instead > of just allocate one structure dynamically or on the stack and use it > all the way through the stack. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > Signed-off-by: Jens Axboe <jaxboe@fusionio.com> > > I was able to reproduce the bug by unpacking a big tar-file and > deleting this files multiple times. Normally with btrfs the kernel > crashed within 20 runs. After commit > 83ba7b071f30f7c01f72518ad72d5cd203c27502 it survived more than 500 > runs.Makes sense, that first commit would potentially pass in stack cruft as the wbc arg. So I think we can safely consider it fixed now. -- Jens Axboe
On 17 July 2010 06:55, Brian Rogers <brian@xyzw.org> wrote:> On 07/15/2010 12:35 PM, Chris Mason wrote: >> >> On Thu, Jul 15, 2010 at 09:32:12PM +0200, Johannes Hirte wrote: >> >>> >>> Am Donnerstag 15 Juli 2010, 21:03:09 schrieb Chris Mason: >>> >>>> >>>> On Thu, Jul 15, 2010 at 08:30:17PM +0200, Johannes Hirte wrote: >>>> >>>>> >>>>> Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte: >>>>> >>>>>> >>>>>> ino 1959333 off 898342912 csum 4271223884 private 4271223883 >> >> Great. The bad csums are all just one bit off, that can''t be an >> accident. When were they written (which kernel?). Did you boot a 32 >> bit kernel on there at any time? >> > > I''ve seen this as well, with three files. In all instances, csum == *private > + 1. Here are the unique lines from dmesg: > > [32700.980806] btrfs csum failed ino 320113 off 55889920 csum 2415136266 > private 2415136265 > [32735.751112] btrfs csum failed ino 1731630 off 24776704 csum 1385284137 > private 1385284136 > [32738.777624] btrfs csum failed ino 2495707 off 171790336 csum 1385781806 > private 1385781805 > > All three files are from when I first transitioned to btrfs (or more > accurately, they are clones of those files I made to hold onto a copy of the > corrupted version). Since the vast majority of my disk usage comes from the > transition anyway, I can''t be sure this is due to a problem only present at > that time. I believe I was running 2.6.34 when I copied my files over to my > new btrfs partition, but I''m going from memory here. > > My btrfs partition has never been touched by a 32-bit kernel. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >I am also getting this now: btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498 btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498 btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498 btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498 A bit unrelated, but I was doing this while doing a rebalance across my drives. RAID-0. I think it''s a different issue though, but I''ll go ahead and post my dmesg output anyway: ------------[ cut here ]------------ kernel BUG at fs/btrfs/volumes.c:1980! invalid opcode: 0000 [#1] PREEMPT SMP last sysfs file: /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq CPU 1 Modules linked in: crc32c_intel ipv6 microcode cifs cpufreq_ondemand hwmon_vid coretemp fuse ext2 mbcache raid1 md_mod usbhid hid rndis_wlan cfg80211 rfkill rndis_host cdc_ether usbnet snd_hda_codec_realtek snd_usb_audio snd_usbmidi_lib snd_hda_intel snd_rawmidi snd_seq_dummy snd_seq_oss snd_hda_codec snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss nvidia(P) snd_hwdep snd_pcm snd_timer snd uhci_hcd i2c_i801 acpi_cpufreq soundcore freq_table ehci_hcd tpm_tis i2c_core tpm i7core_edac r8169 snd_page_alloc mperf iTCO_wdt mii iTCO_vendor_support wmi tpm_bios usbcore edac_core processor fan thermal pcspkr button evdev rtc_cmos rtc_core rtc_lib btrfs zlib_deflate crc32c libcrc32c sr_mod sg cdrom sd_mod ata_piix pata_acpi ata_generic libata scsi_mod Pid: 3480, comm: btrfs Tainted: P 2.6.35-ARCH #1 121-BL-E756/OEM RIP: 0010:[<ffffffffa018158a>] [<ffffffffa018158a>] btrfs_balance+0x24a/0x260 [btrfs] RSP: 0018:ffff8801d5675d48 EFLAGS: 00010282 RAX: 00000000fffffffb RBX: ffff8801d8b60ab0 RCX: ffffea00054814a8 RDX: 0000000000000004 RSI: ffffffff81530df0 RDI: 0000000000000282 RBP: ffff8801d5675dc8 R08: 0000000000000002 R09: 0000000000000002 R10: ffff88000001d240 R11: 0000000000000000 R12: ffff8801d9eec800 R13: 0000000000000000 R14: ffff8801d5675d78 R15: 00000659f9210000 FS: 00007f7c3aef8740(0000) GS:ffff880001820000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007ff55a388000 CR3: 00000001d8d26000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process btrfs (pid: 3480, threadinfo ffff8801d5674000, task ffff8801d77ce120) Stack: ffff8801d9eed000 0000000000000100 0000000000000100 000659f9210000e4 <0> 00007f7c3a84a200 0000000000000000 0000000000000100 00065a7920ffffe4 <0> ffff8801d5675e00 ffffffff810f8ea3 0000000000000000 ffff8801bebd6e40 Call Trace: [<ffffffff810f8ea3>] ? handle_mm_fault+0x1c3/0xab0 [<ffffffffa0188520>] btrfs_ioctl+0x2c0/0x940 [btrfs] [<ffffffff811334ac>] vfs_ioctl+0x3c/0xd0 [<ffffffff81133aac>] do_vfs_ioctl+0x7c/0x520 [<ffffffff81133fd1>] sys_ioctl+0x81/0xa0 [<ffffffff81009e82>] system_call_fastpath+0x16/0x1b Code: 6d 62 fb ff 48 8b 45 80 48 8b b8 28 01 00 00 48 81 c7 80 1c 00 00 e8 a6 fd 1e e1 e9 fe fd ff ff 45 31 ed eb d7 0f 0b 85 c0 74 a7 <0f> 0b 0f 0b 0f 0b 0f 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 RIP [<ffffffffa018158a>] btrfs_balance+0x24a/0x260 [btrfs] RSP <ffff8801d5675d48> ---[ end trace 7501e28b9295d333 ]--- # btrfs-show Label: none uuid: 405f0d0b-ee4d-4426-9826-d2580d0c8d6c Total devices 2 FS bytes used 178.06GB devid 5 size 232.55GB used 91.63GB path /dev/sde3 devid 7 size 189.82GB used 91.63GB path /dev/sda2 Btrfs Btrfs v0.19 Regards, Sebastian J. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2010-07-08 at 10:31 -0400, Chris Mason wrote:> Neither Yan nor I have been able to reproduce this locally, but a few > people have now hit it. Johannes, are you available to try out a > debugging kernel to try and track this down?I have a machine you can log into, with a spare file system that causes this. It''s 200GiB so non-trivial to upload, unless you have a tool like the xfs one which extracts just the important data structures and leaves all the data extents and free space as zero. It was created with the MeeGo 2.6.33.6 kernel -- which I don''t think contains commit 7f0e7bed which is later alleged to be the cause. I''ve also tried booting this machine with a current kernel from git, which includes commit 83ba7b07 which is alleged to be a fix -- and btrfs still oopses just the same. btrfsck and btrfs-dump-tree both abort with: extent-tree.c:1755: update_space_info: Assertion `!(found->total_bytes < found->bytes_used)'' failed. I can mount it read-only though and read certain things out of it. But when I boot from it, I hit the BUG(). -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation
On 08/10/2010 02:06 PM, Sebastian ''gonX'' Jensen wrote:> On 17 July 2010 06:55, Brian Rogers<brian@xyzw.org> wrote: >> On 07/15/2010 12:35 PM, Chris Mason wrote: >>> On Thu, Jul 15, 2010 at 09:32:12PM +0200, Johannes Hirte wrote: >>> >>>> Am Donnerstag 15 Juli 2010, 21:03:09 schrieb Chris Mason: >>>> >>>>> On Thu, Jul 15, 2010 at 08:30:17PM +0200, Johannes Hirte wrote: >>>>> >>>>>> Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte: >>>>>> >>>>>>> ino 1959333 off 898342912 csum 4271223884 private 4271223883 >>> Great. The bad csums are all just one bit off, that can''t be an >>> accident. When were they written (which kernel?). Did you boot a 32 >>> bit kernel on there at any time? >>> >> I''ve seen this as well, with three files. In all instances, csum == *private >> + 1. Here are the unique lines from dmesg: >> >> [32700.980806] btrfs csum failed ino 320113 off 55889920 csum 2415136266 >> private 2415136265 >> [32735.751112] btrfs csum failed ino 1731630 off 24776704 csum 1385284137 >> private 1385284136 >> [32738.777624] btrfs csum failed ino 2495707 off 171790336 csum 1385781806 >> private 1385781805 >> >> All three files are from when I first transitioned to btrfs (or more >> accurately, they are clones of those files I made to hold onto a copy of the >> corrupted version). Since the vast majority of my disk usage comes from the >> transition anyway, I can''t be sure this is due to a problem only present at >> that time. I believe I was running 2.6.34 when I copied my files over to my >> new btrfs partition, but I''m going from memory here. >> >> My btrfs partition has never been touched by a 32-bit kernel. > I am also getting this now: > > btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498 > btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498 > btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498 > btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498 > > A bit unrelated, but I was doing this while doing a rebalance across > my drives. RAID-0.I get this as well on single-drive btrfs. I cleaned out all the files that produce a csum error when read normally, but I still get the error during a rebalance. I can read all the files on any subvolume with the matching inode number just fine. If I delete the mentioned files or replace them with new copies and do a rebalance again, I''ll get the same error again on a different inode number. I did two rebalance runs in a row (with a reboot between each) without deleting the problem inode to see if it would fail in the same place each time. The inode number varied, but the block group, offset, and checksums were the same: Run 1: [63978.519791] btrfs: relocating block group 511130468352 flags 1 [63980.401249] btrfs csum failed ino 418 off 9949184 csum 1385781806 private 1385781805 [63980.499024] btrfs csum failed ino 418 off 9949184 csum 1385781806 private 1385781805 [63980.535384] btrfs csum failed ino 418 off 9949184 csum 1385781806 private 1385781805 [63980.570196] btrfs csum failed ino 418 off 9949184 csum 1385781806 private 1385781805 Run 2: [51317.967011] btrfs: relocating block group 511130468352 flags 1 [51321.298448] btrfs csum failed ino 415 off 9949184 csum 1385781806 private 1385781805 [51321.807357] btrfs csum failed ino 415 off 9949184 csum 1385781806 private 1385781805 [51322.707362] btrfs csum failed ino 415 off 9949184 csum 1385781806 private 1385781805 [51323.318478] btrfs csum failed ino 415 off 9949184 csum 1385781806 private 1385781805 These files should have different contents (unfortunately I already deleted them by now), so I don''t know what they''re doing at the same offset, sharing the same checksum... Could these files both be inlined in the same chunk of metadata, or does this mean something else? Also, I wonder if the miscalculated checksum is something that happens non-deterministically, or if it''s just that the inodes were processed in a different order the second time... It certainly seems significant that the inode number is always low. The balance always runs for quite a while before hitting a problem, and since it appears to start from the end of the disk, it seems that only the earliest and lowest-numbered inodes at the beginning of the disk can cause this problem. Complete crash from dmesg: [51317.967011] btrfs: relocating block group 511130468352 flags 1 [51321.298448] btrfs csum failed ino 415 off 9949184 csum 1385781806 private 1385781805 [51321.807357] btrfs csum failed ino 415 off 9949184 csum 1385781806 private 1385781805 [51322.707362] btrfs csum failed ino 415 off 9949184 csum 1385781806 private 1385781805 [51323.318478] btrfs csum failed ino 415 off 9949184 csum 1385781806 private 1385781805 [51327.954315] ------------[ cut here ]------------ [51327.954322] kernel BUG at /build/buildd/linux-2.6.35/fs/btrfs/volumes.c:1980! [51327.954326] invalid opcode: 0000 [#1] SMP [51327.954330] last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:1f/PNP0C0A:00/power_supply/BAT1/charge_full [51327.954334] CPU 0 [51327.954336] Modules linked in: ip6table_filter ip6_tables hidp hid binfmt_misc rfcomm parport_pc ppdev sco bnep l2cap ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp microcode joydev i915 snd_hda_codec_si3054 snd_hda_codec_realtek drm_kms_helper drm i2c_algo_bit snd_hda_intel snd_hda_codec arc4 snd_hwdep uinput snd_pcm iwl3945 video snd_seq_midi snd_rawmidi snd_seq_midi_event iwlcore snd_seq snd_timer snd_seq_device lp snd mac80211 soundcore output psmouse btusb intel_agp serio_raw cfg80211 bluetooth snd_page_alloc parport btrfs zlib_deflate firewire_ohci firewire_core ahci crc_itu_t sdhci_pci sdhci led_class tg3 crc32c libahci libcrc32c [51327.954396] [51327.954400] Pid: 15426, comm: btrfs Not tainted 2.6.35-15-generic #21-Ubuntu IFT01 /N/A [51327.954404] RIP: 0010:[<ffffffffa00cc25f>] [<ffffffffa00cc25f>] btrfs_balance+0x24f/0x260 [btrfs] [51327.954425] RSP: 0018:ffff88012eb95dc8 EFLAGS: 00010282 [51327.954428] RAX: 00000000fffffffb RBX: ffff880037c78480 RCX: 0200000000004081 [51327.954431] RDX: 0000000000000003 RSI: ffffea0003ea1640 RDI: 0000000000000282 [51327.954434] RBP: ffff88012eb95e48 R08: 0000000000000000 R09: 0000000000000000 [51327.954437] R10: 0000000000000069 R11: 0000000000000001 R12: ffff880138da6800 [51327.954439] R13: 0000000000000000 R14: 0000007701c00000 R15: ffff88012eb95df8 [51327.954443] FS: 00007fbea8710740(0000) GS:ffff880001e00000(0000) knlGS:0000000000000000 [51327.954446] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [51327.954449] CR2: 00007f99c0088cc1 CR3: 0000000114bee000 CR4: 00000000000006f0 [51327.954452] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [51327.954455] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [51327.954458] Process btrfs (pid: 15426, threadinfo ffff88012eb94000, task ffff88003fb6adc0) [51327.954460] Stack: [51327.954462] ffff880138da7000 0000000000000100 0000000000000100 00007701c00000e4 [51327.954467] <0> ffff880100001c00 ffff88013fc31400 0000000000000100 0000e15b3fffffe4 [51327.954473] <0> ffff88012eb95e00 ffffffff811280f5 ffff8801315f5038 ffff880115d35600 [51327.954478] Call Trace: [51327.954486] [<ffffffff811280f5>] ? page_add_new_anon_rmap+0x95/0xa0 [51327.954500] [<ffffffffa00d44b0>] btrfs_ioctl+0x2c0/0x4c0 [btrfs] [51327.954505] [<ffffffff811615ad>] vfs_ioctl+0x3d/0xd0 [51327.954509] [<ffffffff81161e81>] do_vfs_ioctl+0x81/0x340 [51327.954514] [<ffffffff8158c8ae>] ? do_page_fault+0x15e/0x350 [51327.954517] [<ffffffff811621c1>] sys_ioctl+0x81/0xa0 [51327.954523] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b [51327.954525] Code: fb ff 48 8b 45 80 48 8b b8 28 01 00 00 48 81 c7 20 1c 00 00 e8 e3 b0 4b e1 e9 00 fe ff ff 45 31 ed eb d7 0f 0b eb fe 85 c0 74 a5 <0f> 0b eb fe 0f 0b eb fe 0f 0b eb fe 0f 0b eb fe 90 55 48 89 e5 [51327.954567] RIP [<ffffffffa00cc25f>] btrfs_balance+0x24f/0x260 [btrfs] [51327.954580] RSP <ffff88012eb95dc8> [51327.954583] ---[ end trace 0bf81e832fde7349 ]--- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 14 August 2010 09:05, Brian Rogers <brian@xyzw.org> wrote:> On 08/10/2010 02:06 PM, Sebastian ''gonX'' Jensen wrote: >> >> On 17 July 2010 06:55, Brian Rogers<brian@xyzw.org> wrote: >>> >>> On 07/15/2010 12:35 PM, Chris Mason wrote: >>>> >>>> On Thu, Jul 15, 2010 at 09:32:12PM +0200, Johannes Hirte wrote: >>>> >>>>> Am Donnerstag 15 Juli 2010, 21:03:09 schrieb Chris Mason: >>>>> >>>>>> On Thu, Jul 15, 2010 at 08:30:17PM +0200, Johannes Hirte wrote: >>>>>> >>>>>>> Am Dienstag 13 Juli 2010, 14:23:58 schrieb Johannes Hirte: >>>>>>> >>>>>>>> ino 1959333 off 898342912 csum 4271223884 private 4271223883 >>>> >>>> Great. The bad csums are all just one bit off, that can''t be an >>>> accident. When were they written (which kernel?). Did you boot a 32 >>>> bit kernel on there at any time? >>>> >>> I''ve seen this as well, with three files. In all instances, csum =>>> *private >>> + 1. Here are the unique lines from dmesg: >>> >>> [32700.980806] btrfs csum failed ino 320113 off 55889920 csum 2415136266 >>> private 2415136265 >>> [32735.751112] btrfs csum failed ino 1731630 off 24776704 csum 1385284137 >>> private 1385284136 >>> [32738.777624] btrfs csum failed ino 2495707 off 171790336 csum >>> 1385781806 >>> private 1385781805 >>> >>> All three files are from when I first transitioned to btrfs (or more >>> accurately, they are clones of those files I made to hold onto a copy of >>> the >>> corrupted version). Since the vast majority of my disk usage comes from >>> the >>> transition anyway, I can''t be sure this is due to a problem only present >>> at >>> that time. I believe I was running 2.6.34 when I copied my files over to >>> my >>> new btrfs partition, but I''m going from memory here. >>> >>> My btrfs partition has never been touched by a 32-bit kernel. >> >> I am also getting this now: >> >> btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498 >> btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498 >> btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498 >> btrfs csum failed ino 288 off 799268864 csum 4054934499 private 4054934498 >> >> A bit unrelated, but I was doing this while doing a rebalance across >> my drives. RAID-0. > > I get this as well on single-drive btrfs. I cleaned out all the files that > produce a csum error when read normally, but I still get the error during a > rebalance. I can read all the files on any subvolume with the matching inode > number just fine. If I delete the mentioned files or replace them with new > copies and do a rebalance again, I''ll get the same error again on a > different inode number. > > I did two rebalance runs in a row (with a reboot between each) without > deleting the problem inode to see if it would fail in the same place each > time. The inode number varied, but the block group, offset, and checksums > were the same: > > Run 1: > [63978.519791] btrfs: relocating block group 511130468352 flags 1 > [63980.401249] btrfs csum failed ino 418 off 9949184 csum 1385781806 private > 1385781805 > [63980.499024] btrfs csum failed ino 418 off 9949184 csum 1385781806 private > 1385781805 > [63980.535384] btrfs csum failed ino 418 off 9949184 csum 1385781806 private > 1385781805 > [63980.570196] btrfs csum failed ino 418 off 9949184 csum 1385781806 private > 1385781805 > > Run 2: > [51317.967011] btrfs: relocating block group 511130468352 flags 1 > [51321.298448] btrfs csum failed ino 415 off 9949184 csum 1385781806 private > 1385781805 > [51321.807357] btrfs csum failed ino 415 off 9949184 csum 1385781806 private > 1385781805 > [51322.707362] btrfs csum failed ino 415 off 9949184 csum 1385781806 private > 1385781805 > [51323.318478] btrfs csum failed ino 415 off 9949184 csum 1385781806 private > 1385781805 > > These files should have different contents (unfortunately I already deleted > them by now), so I don''t know what they''re doing at the same offset, sharing > the same checksum... Could these files both be inlined in the same chunk of > metadata, or does this mean something else? > > Also, I wonder if the miscalculated checksum is something that happens > non-deterministically, or if it''s just that the inodes were processed in a > different order the second time... > > It certainly seems significant that the inode number is always low. The > balance always runs for quite a while before hitting a problem, and since it > appears to start from the end of the disk, it seems that only the earliest > and lowest-numbered inodes at the beginning of the disk can cause this > problem. > > Complete crash from dmesg: > > [51317.967011] btrfs: relocating block group 511130468352 flags 1 > [51321.298448] btrfs csum failed ino 415 off 9949184 csum 1385781806 private > 1385781805 > [51321.807357] btrfs csum failed ino 415 off 9949184 csum 1385781806 private > 1385781805 > [51322.707362] btrfs csum failed ino 415 off 9949184 csum 1385781806 private > 1385781805 > [51323.318478] btrfs csum failed ino 415 off 9949184 csum 1385781806 private > 1385781805 > [51327.954315] ------------[ cut here ]------------ > [51327.954322] kernel BUG at > /build/buildd/linux-2.6.35/fs/btrfs/volumes.c:1980! > [51327.954326] invalid opcode: 0000 [#1] SMP > [51327.954330] last sysfs file: > /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:1f/PNP0C0A:00/power_supply/BAT1/charge_full > [51327.954334] CPU 0 > [51327.954336] Modules linked in: ip6table_filter ip6_tables hidp hid > binfmt_misc rfcomm parport_pc ppdev sco bnep l2cap ipt_MASQUERADE > iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack > ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp microcode > joydev i915 snd_hda_codec_si3054 snd_hda_codec_realtek drm_kms_helper drm > i2c_algo_bit snd_hda_intel snd_hda_codec arc4 snd_hwdep uinput snd_pcm > iwl3945 video snd_seq_midi snd_rawmidi snd_seq_midi_event iwlcore snd_seq > snd_timer snd_seq_device lp snd mac80211 soundcore output psmouse btusb > intel_agp serio_raw cfg80211 bluetooth snd_page_alloc parport btrfs > zlib_deflate firewire_ohci firewire_core ahci crc_itu_t sdhci_pci sdhci > led_class tg3 crc32c libahci libcrc32c > [51327.954396] > [51327.954400] Pid: 15426, comm: btrfs Not tainted 2.6.35-15-generic > #21-Ubuntu IFT01 /N/A > [51327.954404] RIP: 0010:[<ffffffffa00cc25f>] [<ffffffffa00cc25f>] > btrfs_balance+0x24f/0x260 [btrfs] > [51327.954425] RSP: 0018:ffff88012eb95dc8 EFLAGS: 00010282 > [51327.954428] RAX: 00000000fffffffb RBX: ffff880037c78480 RCX: > 0200000000004081 > [51327.954431] RDX: 0000000000000003 RSI: ffffea0003ea1640 RDI: > 0000000000000282 > [51327.954434] RBP: ffff88012eb95e48 R08: 0000000000000000 R09: > 0000000000000000 > [51327.954437] R10: 0000000000000069 R11: 0000000000000001 R12: > ffff880138da6800 > [51327.954439] R13: 0000000000000000 R14: 0000007701c00000 R15: > ffff88012eb95df8 > [51327.954443] FS: 00007fbea8710740(0000) GS:ffff880001e00000(0000) > knlGS:0000000000000000 > [51327.954446] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [51327.954449] CR2: 00007f99c0088cc1 CR3: 0000000114bee000 CR4: > 00000000000006f0 > [51327.954452] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [51327.954455] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: > 0000000000000400 > [51327.954458] Process btrfs (pid: 15426, threadinfo ffff88012eb94000, task > ffff88003fb6adc0) > [51327.954460] Stack: > [51327.954462] ffff880138da7000 0000000000000100 0000000000000100 > 00007701c00000e4 > [51327.954467] <0> ffff880100001c00 ffff88013fc31400 0000000000000100 > 0000e15b3fffffe4 > [51327.954473] <0> ffff88012eb95e00 ffffffff811280f5 ffff8801315f5038 > ffff880115d35600 > [51327.954478] Call Trace: > [51327.954486] [<ffffffff811280f5>] ? page_add_new_anon_rmap+0x95/0xa0 > [51327.954500] [<ffffffffa00d44b0>] btrfs_ioctl+0x2c0/0x4c0 [btrfs] > [51327.954505] [<ffffffff811615ad>] vfs_ioctl+0x3d/0xd0 > [51327.954509] [<ffffffff81161e81>] do_vfs_ioctl+0x81/0x340 > [51327.954514] [<ffffffff8158c8ae>] ? do_page_fault+0x15e/0x350 > [51327.954517] [<ffffffff811621c1>] sys_ioctl+0x81/0xa0 > [51327.954523] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b > [51327.954525] Code: fb ff 48 8b 45 80 48 8b b8 28 01 00 00 48 81 c7 20 1c > 00 00 e8 e3 b0 4b e1 e9 00 fe ff ff 45 31 ed eb d7 0f 0b eb fe 85 c0 74 a5 > <0f> 0b eb fe 0f 0b eb fe 0f 0b eb fe 0f 0b eb fe 90 55 48 89 e5 > [51327.954567] RIP [<ffffffffa00cc25f>] btrfs_balance+0x24f/0x260 [btrfs] > [51327.954580] RSP <ffff88012eb95dc8> > [51327.954583] ---[ end trace 0bf81e832fde7349 ]--- > >Sorry to burst your bubble, but it''s definitely not only on the low-numbered inodes. It just segfaults when it comes across the first checksum error on the inodes while balancing, and this is something copying files won''t do. I''m in the process of moving most of my storage stored on my 2-drive RAID-0 btrfs array, and I get Input/Output errors from time to time. It''s a bit annoying, but I can live with it since I''ve kept backup of most of those files. The backup drive isn''t on btrfs, so the errors aren''t from that drive. This is what my dmesg looks like after having it all copied over: btrfs csum failed ino 388772 off 5146492928 csum 169329396 private 169329395 btrfs csum failed ino 388772 off 5146492928 csum 169329396 private 169329395 btrfs csum failed ino 388772 off 5146492928 csum 169329396 private 169329395 btrfs csum failed ino 388772 off 5146492928 csum 169329396 private 169329395 btrfs csum failed ino 388772 off 5146492928 csum 169329396 private 169329395 btrfs csum failed ino 388772 off 5146492928 csum 169329396 private 169329395 btrfs csum failed ino 388822 off 910540800 csum 2930498218 private 2930498217 btrfs csum failed ino 388822 off 910540800 csum 2930498218 private 2930498217 btrfs csum failed ino 388822 off 910540800 csum 2930498218 private 2930498217 btrfs csum failed ino 388822 off 910540800 csum 2930498218 private 2930498217 btrfs csum failed ino 388822 off 910540800 csum 2930498218 private 2930498217 btrfs csum failed ino 388822 off 910540800 csum 2930498218 private 2930498217 btrfs csum failed ino 388823 off 563412992 csum 3815793104 private 771608454 btrfs csum failed ino 388823 off 2966323200 csum 2865511448 private 2865511447 btrfs csum failed ino 388823 off 2966323200 csum 2865511448 private 2865511447 btrfs csum failed ino 388823 off 2966323200 csum 2865511448 private 2865511447 btrfs csum failed ino 388823 off 2966323200 csum 2865511448 private 2865511447 btrfs csum failed ino 389762 off 114298880 csum 350178036 private 350178035 btrfs csum failed ino 389762 off 114298880 csum 350178036 private 350178035 btrfs csum failed ino 389762 off 114298880 csum 350178036 private 350178035 btrfs csum failed ino 389762 off 114298880 csum 350178036 private 350178035 btrfs csum failed ino 389762 off 114298880 csum 350178036 private 350178035 btrfs csum failed ino 389762 off 114298880 csum 350178036 private 350178035 btrfs csum failed ino 389807 off 97808384 csum 558603966 private 313438526 btrfs csum failed ino 389807 off 97812480 csum 934061497 private 2103939083 btrfs csum failed ino 4290176 off 119853056 csum 2643001855 private 2643001854 btrfs csum failed ino 4290176 off 119853056 csum 2643001855 private 2643001854 btrfs csum failed ino 4290176 off 119853056 csum 2643001855 private 2643001854 btrfs csum failed ino 4290176 off 119853056 csum 2643001855 private 2643001854 btrfs csum failed ino 4290176 off 119853056 csum 2643001855 private 2643001854 btrfs csum failed ino 4290176 off 119853056 csum 2643001855 private 2643001854 btrfs csum failed ino 4290280 off 389795840 csum 257184388 private 2989030598 btrfs csum failed ino 4290334 off 22917120 csum 1416109476 private 4240914906 btrfs csum failed ino 4290415 off 171986944 csum 2046579276 private 2822934100 btrfs csum failed ino 409660 extent 40120546816 csum 4161269104 wanted 4161269103 mirror 0 btrfs csum failed ino 409660 extent 92755793920 csum 4161269104 wanted 4161269103 mirror 1 btrfs csum failed ino 409660 extent 92755793920 csum 4161269104 wanted 4161269103 mirror 1 btrfs csum failed ino 409660 extent 92755793920 csum 4161269104 wanted 4161269103 mirror 1 btrfs csum failed ino 409660 extent 92755793920 csum 4161269104 wanted 4161269103 mirror 1 btrfs csum failed ino 409660 extent 92755793920 csum 4161269104 wanted 4161269103 mirror 1 btrfs csum failed ino 409660 extent 92755793920 csum 4161269104 wanted 4161269103 mirror 1 btrfs csum failed ino 409660 extent 92755793920 csum 4161269104 wanted 4161269103 mirror 1 btrfs csum failed ino 409660 extent 92755793920 csum 4161269104 wanted 4161269103 mirror 0 btrfs csum failed ino 409660 extent 92755793920 csum 4161269104 wanted 4161269103 mirror 1 btrfs csum failed ino 409660 extent 92755793920 csum 4161269104 wanted 4161269103 mirror 1