I tried creating a multi-device btrfs filesystem for the first time (on Fedora 18 with 3.7.2-204.fc18.x86_64), and I ran into some problems. I had heard that btrfs is now reasonably stable, and though I expected to possibly see a problem here or there, I was a little surprised at just how many problems I encountered in such a short period of time. I now have about a thousand error messages in my kernel logs related to several different problems. Is this roughly the expected level of stability for btrfs with multiple devices, or am I just particularly lucky? :) Am I correct in assuming that I''ll need to switch to md for a few months and try btrfs again later, or are there known problems in the specific kernel I''m running that I could avoid by trying a different version? For the sake of being specific, I''ll detail a few of the problems I''ve hit: These two may have been caused by a possibly faulty disk (I''m still trying to determine whether it was faulty or whether the bug was purely in btrfs): https://bugzilla.redhat.com/show_bug.cgi?id=903794 https://bugzilla.redhat.com/show_bug.cgi?id=904143 This one was triggered when I tried to remove a possibly faulty disk: https://bugzilla.redhat.com/show_bug.cgi?id=904197 With a freshly created filesystem, I got a kernel bug, associated with a hang in most filesystem operations. This occurred in the middle of ordinary operation and without any sort of hardware-related errors in the kernel logs. https://bugzilla.redhat.com/show_bug.cgi?id=904223 I''ve noticed that a lot of the reports in the Fedora bugzilla and kernel bugzilla don''t seem to include much discussion; is there any specific type of information that bug submitters should try to include to make the reports more helpful? Thanks. -- Andrew McNabb http://www.mcnabbs.org/andrew/ PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 25, 2013 at 01:05:14PM -0700, Andrew McNabb wrote:> I tried creating a multi-device btrfs filesystem for the first time (on > Fedora 18 with 3.7.2-204.fc18.x86_64), and I ran into some problems. I > had heard that btrfs is now reasonably stable, and though I expected to > possibly see a problem here or there, I was a little surprised at just > how many problems I encountered in such a short period of time. I now > have about a thousand error messages in my kernel logs related to > several different problems. Is this roughly the expected level of > stability for btrfs with multiple devices, or am I just particularly > lucky? :) > > Am I correct in assuming that I''ll need to switch to md for a few months > and try btrfs again later, or are there known problems in the specific > kernel I''m running that I could avoid by trying a different version? > > For the sake of being specific, I''ll detail a few of the problems I''ve > hit: > > These two may have been caused by a possibly faulty disk (I''m still > trying to determine whether it was faulty or whether the bug was purely > in btrfs): > > https://bugzilla.redhat.com/show_bug.cgi?id=903794This one is just a allocator warning because the relocator doesn''t do the right accounting for relocation. It''s just complainig, we need to fix it but it won''t keep it from working.> https://bugzilla.redhat.com/show_bug.cgi?id=904143This I''m almost certain (I have to check) was just a result of me making fsync faster and forgetting to remove this warn on. It''s fixed upstream. Again, nothing to worry about, but annoying.> > This one was triggered when I tried to remove a possibly faulty disk: > > https://bugzilla.redhat.com/show_bug.cgi?id=904197 >Ok this is a bug, I can fix this. Basically we tried to read from the faulty disk, it failed, we read from the other copy, and then tried to write the good copy back to the failed disk and when we saw that the IO wasn''t actually going to go to the bad disk we panic''ed. Silly but easy enough to understand/fix.> With a freshly created filesystem, I got a kernel bug, associated with a > hang in most filesystem operations. This occurred in the middle of > ordinary operation and without any sort of hardware-related errors in > the kernel logs. > > https://bugzilla.redhat.com/show_bug.cgi?id=904223 >So this is from the fsync stuff, and I''m sure I fixed this somewhere but I can''t account for where I did it. Can you give btrfs-next a try and see if you can still reproduce. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 25, 2013 at 01:05:14PM -0700, Andrew McNabb wrote:> I tried creating a multi-device btrfs filesystem for the first time (on > Fedora 18 with 3.7.2-204.fc18.x86_64), and I ran into some problems. I > had heard that btrfs is now reasonably stable, and though I expected to > possibly see a problem here or there, I was a little surprised at just > how many problems I encountered in such a short period of time. I now > have about a thousand error messages in my kernel logs related to > several different problems. Is this roughly the expected level of > stability for btrfs with multiple devices, or am I just particularly > lucky? :) > > Am I correct in assuming that I''ll need to switch to md for a few months > and try btrfs again later, or are there known problems in the specific > kernel I''m running that I could avoid by trying a different version? > > For the sake of being specific, I''ll detail a few of the problems I''ve > hit: > > These two may have been caused by a possibly faulty disk (I''m still > trying to determine whether it was faulty or whether the bug was purely > in btrfs): > > https://bugzilla.redhat.com/show_bug.cgi?id=903794 > https://bugzilla.redhat.com/show_bug.cgi?id=904143 > > This one was triggered when I tried to remove a possibly faulty disk: > > https://bugzilla.redhat.com/show_bug.cgi?id=904197Actually for this one, how did you remove the disk? Did you just yank it out while the box was running? Did you mount -o degraded and then delete the device and then remove it? How exactly did you get to this situation. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 25, 2013 at 03:37:17PM -0500, Josef Bacik wrote:> > https://bugzilla.redhat.com/show_bug.cgi?id=903794 > > This one is just a allocator warning because the relocator doesn''t do the right > accounting for relocation. It''s just complainig, we need to fix it but it won''t > keep it from working.I won''t worry about this one, then.> > https://bugzilla.redhat.com/show_bug.cgi?id=904143 > > This I''m almost certain (I have to check) was just a result of me making fsync > faster and forgetting to remove this warn on. It''s fixed upstream. Again, > nothing to worry about, but annoying.Sounds good.> > This one was triggered when I tried to remove a possibly faulty disk: > > > > https://bugzilla.redhat.com/show_bug.cgi?id=904197 > > > > Ok this is a bug, I can fix this. Basically we tried to read from the faulty > disk, it failed, we read from the other copy, and then tried to write the good > copy back to the failed disk and when we saw that the IO wasn''t actually going > to go to the bad disk we panic''ed. Silly but easy enough to understand/fix.I was a little surprised that this happened after I had already done a "btrfs dev delete"--is there a way to tell btrfs that a disk really is gone?> > With a freshly created filesystem, I got a kernel bug, associated with a > > hang in most filesystem operations. This occurred in the middle of > > ordinary operation and without any sort of hardware-related errors in > > the kernel logs. > > > > https://bugzilla.redhat.com/show_bug.cgi?id=904223 > > > > So this is from the fsync stuff, and I''m sure I fixed this somewhere but I can''t > account for where I did it.Would this also be the cause of the hangs that I''m seeing? In the end, a hang with the load rising to 260.10 is the most serious problem. It''s happened a few times, and it gets temporarily fixed by a reboot, but then tends to recur fairly soon.> Can you give btrfs-next a try and see if you can > still reproduce. Thanks,Is there a pre-built RPM for btrfs-next, or what''s the best way to try it out in Fedora without breaking other things? Thanks for your quick response, and sorry for not responding sooner (I''ve been interrupted by a few phone calls). -- Andrew McNabb http://www.mcnabbs.org/andrew/ PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 25, 2013 at 03:53:22PM -0500, Josef Bacik wrote:> > Actually for this one, how did you remove the disk? Did you just yank it out > while the box was running? Did you mount -o degraded and then delete the device > and then remove it? How exactly did you get to this situation. Thanks,I''ve moved my answer over to IRC to reduce the latency in the conversation. Thanks again for all the help. -- Andrew McNabb http://www.mcnabbs.org/andrew/ PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Here''s an update. I tried the new kernel, and I seem to be having some new (possibly worse problems. In my ssh session, I''m seeing many errors of this sort: Message from syslogd@guru at Jan 26 13:13:14 ... kernel:[ 308.223834] BUG: soft lockup - CPU#0 stuck for 23s! [btrfs-endio-wri:2073] Message from syslogd@guru at Jan 26 13:13:14 ... kernel:[ 308.248754] BUG: soft lockup - CPU#2 stuck for 23s! [btrfs-delalloc-:594] In the logs, I''m seeing several warnings and bugs, including: WARNING: at fs/btrfs/extent_map.c:78 free_extent_map+0x79/0x90 [btrfs]() WARNING: at lib/list_debug.c:62 __list_del_entry+0x82/0xd0() BUG: unable to handle kernel NULL pointer dereference at (null) BUG: soft lockup - CPU#0 stuck for 22s! [btrfs-endio-wri:1489] BUG: soft lockup - CPU#1 stuck for 22s! [btrfs-delalloc-:607] Kernel logs (across a few reboots) are at: http://students.cs.byu.edu/~amcnabb/messages2 -- Andrew McNabb http://www.mcnabbs.org/andrew/ PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Jan 26, 2013 at 01:27:11PM -0700, Andrew McNabb wrote:> Here''s an update. I tried the new kernel, and I seem to be having some > new (possibly worse problems. In my ssh session, I''m seeing many errors > of this sort: > > Message from syslogd@guru at Jan 26 13:13:14 ... > kernel:[ 308.223834] BUG: soft lockup - CPU#0 stuck for 23s! > [btrfs-endio-wri:2073] > > Message from syslogd@guru at Jan 26 13:13:14 ... > kernel:[ 308.248754] BUG: soft lockup - CPU#2 stuck for 23s! > [btrfs-delalloc-:594] > > In the logs, I''m seeing several warnings and bugs, including: > > WARNING: at fs/btrfs/extent_map.c:78 free_extent_map+0x79/0x90 [btrfs]() > WARNING: at lib/list_debug.c:62 __list_del_entry+0x82/0xd0() > BUG: unable to handle kernel NULL pointer dereference at (null) > BUG: soft lockup - CPU#0 stuck for 22s! [btrfs-endio-wri:1489] > BUG: soft lockup - CPU#1 stuck for 22s! [btrfs-delalloc-:607] > > Kernel logs (across a few reboots) are at: > > http://students.cs.byu.edu/~amcnabb/messages2 >Hrm well I didn''t expect that. I will look into this and see what I can come up with. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Jan 26, 2013 at 01:27:11PM -0700, Andrew McNabb wrote:> Here''s an update. I tried the new kernel, and I seem to be having some > new (possibly worse problems. In my ssh session, I''m seeing many errors > of this sort: > > Message from syslogd@guru at Jan 26 13:13:14 ... > kernel:[ 308.223834] BUG: soft lockup - CPU#0 stuck for 23s! > [btrfs-endio-wri:2073] > > Message from syslogd@guru at Jan 26 13:13:14 ... > kernel:[ 308.248754] BUG: soft lockup - CPU#2 stuck for 23s! > [btrfs-delalloc-:594] > > In the logs, I''m seeing several warnings and bugs, including: > > WARNING: at fs/btrfs/extent_map.c:78 free_extent_map+0x79/0x90 [btrfs]() > WARNING: at lib/list_debug.c:62 __list_del_entry+0x82/0xd0() > BUG: unable to handle kernel NULL pointer dereference at (null) > BUG: soft lockup - CPU#0 stuck for 22s! [btrfs-endio-wri:1489] > BUG: soft lockup - CPU#1 stuck for 22s! [btrfs-delalloc-:607] > > Kernel logs (across a few reboots) are at: > > http://students.cs.byu.edu/~amcnabb/messages2 >Ok I think I figured it out, can you give this a whirl? Let me know when you get testers fatigue ;) http://koji.fedoraproject.org/koji/taskinfo?taskID=4908932 Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html