I''ve got a 3TB SATA disk that is known to have problems (it failed in a zpool for one of my clients). For test purposes I''m running a BTRFS RAID-1 on two partitions on that disk, bad for performance and not something you''d normally do but good for testing. BTRFS recovers from read errors quite well and gives informative log messages. But it doesn''t seem possible to get a count of the number of errors. I think that at the minimum I should be able to get a count of the number of errors from a device since it was attached to the system. I think that the ideal would be to have an error count stored on the device and available to the sysadmin. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Russell, a sufficiently up-to-date kernel and btrfs tool will provide the ''btrfs device stats'' command, which should give you the info you want. Regards, Bart On Sun, Aug 4, 2013 at 1:42 PM, Russell Coker <russell@coker.com.au> wrote:> I''ve got a 3TB SATA disk that is known to have problems (it failed in a zpool > for one of my clients). For test purposes I''m running a BTRFS RAID-1 on two > partitions on that disk, bad for performance and not something you''d normally > do but good for testing. > > BTRFS recovers from read errors quite well and gives informative log messages. > > But it doesn''t seem possible to get a count of the number of errors. I think > that at the minimum I should be able to get a count of the number of errors > from a device since it was attached to the system. I think that the ideal > would be to have an error count stored on the device and available to the > sysadmin. > > -- > My Main Blog http://etbe.coker.com.au/ > My Documents Blog http://doc.coker.com.au/ > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 4 Aug 2013 03:37:22 PM Bart Noordervliet wrote:> a sufficiently up-to-date kernel and btrfs tool will provide the > ''btrfs device stats'' command, which should give you the info you want.This is what it looks like: chris@quad:~/Downloads/Linux/FileSystems/BtrFS/btrfs-progs$ sudo ./btrfs device stats /srv/DR [/dev/sdb3].write_io_errs 0 [/dev/sdb3].read_io_errs 0 [/dev/sdb3].flush_io_errs 0 [/dev/sdb3].corruption_errs 0 [/dev/sdb3].generation_errs 0 All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP
On Sat, 10 Aug 2013, Chris Samuel <chris@csamuel.org> wrote:> On Sun, 4 Aug 2013 03:37:22 PM Bart Noordervliet wrote: > > a sufficiently up-to-date kernel and btrfs tool will provide the > > ''btrfs device stats'' command, which should give you the info you want. > > This is what it looks like: > > chris@quad:~/Downloads/Linux/FileSystems/BtrFS/btrfs-progs$ sudo ./btrfs > device stats /srv/DR > [/dev/sdb3].write_io_errs 0 > [/dev/sdb3].read_io_errs 0 > [/dev/sdb3].flush_io_errs 0 > [/dev/sdb3].corruption_errs 0 > [/dev/sdb3].generation_errs 0Thanks Chris and Bart. Would it be possible to get the man page updated to include a brief description of those errors? The first three are somewhat obvious in meaning (although not obvious in how they would happen) and the fourth is very obvious. But what does generation_errs mean? I''m seeing some on one system. Should I be concerned? If I write a Nagios check which ones should be warnings and which ones errors? Also why does it give the following errors about trying to open /dev/sr0 when using a BTRFS RAID-1 filesystem? Below is for a RAID-1 over /dev/sdb and /dev/sdc. # btrfs device stats /dev/sdb failed to open /dev/sr0: No medium found failed to open /dev/sr0: No medium found [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs 0 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0 # btrfs device stats /dev/sdc failed to open /dev/sr0: No medium found failed to open /dev/sr0: No medium found [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 Why is it even searching for the other part when only a single device is specified and why can''t it give stats without checking /dev/sr0 when checking a single device when it can do so while checking all devices? # btrfs device stats /dev/sdd1 failed to open /dev/sr0: No medium found failed to open /dev/sr0: No medium found [/dev/sdd1].write_io_errs 0 [/dev/sdd1].read_io_errs 136 [/dev/sdd1].flush_io_errs 0 [/dev/sdd1].corruption_errs 0 [/dev/sdd1].generation_errs 0 # btrfs device stats /dev/sdd2 failed to open /dev/sr0: No medium found failed to open /dev/sr0: No medium found [/dev/sdd2].write_io_errs 0 [/dev/sdd2].read_io_errs 0 [/dev/sdd2].flush_io_errs 0 [/dev/sdd2].corruption_errs 0 [/dev/sdd2].generation_errs 0 # btrfs device stats /mnt/backup/ [/dev/sdd1].write_io_errs 0 [/dev/sdd1].read_io_errs 136 [/dev/sdd1].flush_io_errs 0 [/dev/sdd1].corruption_errs 0 [/dev/sdd1].generation_errs 0 [/dev/sdd2].write_io_errs 0 [/dev/sdd2].read_io_errs 0 [/dev/sdd2].flush_io_errs 0 [/dev/sdd2].corruption_errs 0 [/dev/sdd2].generation_errs 0 Thanks. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 10 Aug 2013 07:19:27 PM Russell Coker wrote:> But what does generation_errs mean? I''m seeing some on one system. > Should I be concerned? If I write a Nagios check which ones should be > warnings and which ones errors?All I know is that ioctl.h says: BTRFS_DEV_STAT_GENERATION_ERRS, /* an indication that blocks have not * been written */ Looking at the kernel code that only seems to get incremented during a scrub. The code that does that says: } else if (generation != le64_to_cpu(h->generation)) { sblock->header_error = 1; sblock->generation_error = 1; } The generation there is from the btrfs inode structure, the header says: /* full 64 bit generation number, struct vfs_inode doesn''t have a big * enough field for this. */ u64 generation; The wiki says: https://btrfs.wiki.kernel.org/index.php/Glossary # generation # An internal counter which updates for each transaction. When a # metadata block is written (using copy on write), current generation # is stored in the block, so that blocks which are too new (and hence # possibly inconsistent) can be identified. and: https://btrfs.wiki.kernel.org/index.php/Btrfs_design # Everything that points to a btree block also stores the generation # field it expects that block to have. This allows Btrfs to detect # phantom or misplaced writes on the media. HTH!> Also why does it give the following errors about trying to open /dev/sr0 > when using a BTRFS RAID-1 filesystem? Below is for a RAID-1 over /dev/sdb > and /dev/sdc.I don''t get that here, I''m building btrfs-progs from git at commit 194aa4a1bd6447bb545286d0bcb0b0be8204d79f (July 5th), aka: btrfs-progs$ git describe --tags v0.20-rc1-358-g194aa4a cheers! Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP