I just did a search and couldn''t find any probe for btrfs RAID status The "check_raid" plugin seems to recognise mdadm and various other types of RAID but not btrfs Has anybody seen a plugin for Nagios or could anybody comment on how it should work if somebody wants to make one? For example, would the command btrfs filesystem show --all-devices give a non-zero error status or some other clue if any of the devices are at risk? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Daniel Pocock posted on Fri, 22 Nov 2013 14:47:49 +0100 as excerpted:> I just did a search and couldn''t find any probe for btrfs RAID status > > The "check_raid" plugin seems to recognise mdadm and various other types > of RAID but not btrfs > > Has anybody seen a plugin for Nagios or could anybody comment on how it > should work if somebody wants to make one? > > For example, would the command > > btrfs filesystem show --all-devices > > give a non-zero error status or some other clue if any of the devices > are at risk?[btrfs personal user/sysadmin, not a dev, not anything large enough to have personal nagios experience...] AFAIK, btrfs raid modes currently switch the filesystem to read-only on any device-drop error. That has been deemed the simplest/safest policy during development, tho at some point as stable approaches the behavior could theoretically be made optional. So detection could watch for read-only and act accordingly, either switching back to read-write or rebooting or simply logging the event, as deemed appropriate. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> For example, would the command > > btrfs filesystem show --all-devices > > give a non-zero error status or some other clue if any of the devices > are at risk?No there isn''t any good way as of now. that''s something to fix. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 23/11/13 04:59, Anand Jain wrote:> > >> For example, would the command >> >> btrfs filesystem show --all-devices >> >> give a non-zero error status or some other clue if any of the devices >> are at risk? > > No there isn''t any good way as of now. that''s something to fix.Does it require kernel/driver code changes or it should be possible to implement in the user space utility? It would be useful for people testing the filesystem to know when they get into trouble so they can investigate more quickly (and before the point of no return)> [btrfs personal user/sysadmin, not a dev, not anything large enough to > have personal nagios experience...] > > AFAIK, btrfs raid modes currently switch the filesystem to read-only on > any device-drop error. That has been deemed the simplest/safest policy > during development, tho at some point as stable approaches the behavior > could theoretically be made optional.None of the warnings about btrfs''s experimental status hint at that, some people may be surprised by it.> So detection could watch for read-only and act accordingly, either > switching back to read-write or rebooting or simply logging the event, > as deemed appropriate.It would be relatively trivial to implement a Nagios check for read-only, Nagios probes are just shell scripts What about when btrfs detects a bad block checksum and recovers data from the equivalent block on another disk? The wiki says there will be a syslog event. Does btrfs keep any stats on the number of blocks that it considers unreliable and can this be queried from user space? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 23/11/13 09:37, Daniel Pocock wrote:> > > On 23/11/13 04:59, Anand Jain wrote: >> >> >>> For example, would the command >>> >>> btrfs filesystem show --all-devices >>> >>> give a non-zero error status or some other clue if any of the devices >>> are at risk? >> >> No there isn''t any good way as of now. that''s something to fix. > > Does it require kernel/driver code changes or it should be possible to > implement in the user space utility? > > It would be useful for people testing the filesystem to know when they > get into trouble so they can investigate more quickly (and before the > point of no return) > >> [btrfs personal user/sysadmin, not a dev, not anything large enough to >> have personal nagios experience...] >> >> AFAIK, btrfs raid modes currently switch the filesystem to read-only on >> any device-drop error. That has been deemed the simplest/safest policy >> during development, tho at some point as stable approaches the behavior >> could theoretically be made optional. > > None of the warnings about btrfs''s experimental status hint at that, > some people may be surprised by it. > >> So detection could watch for read-only and act accordingly, either >> switching back to read-write or rebooting or simply logging the event, >> as deemed appropriate. > > It would be relatively trivial to implement a Nagios check for > read-only, Nagios probes are just shell scriptsJust checked, it already exists, so we are half way there: http://exchange.nagios.org/directory/Plugins/Operating-Systems/Linux/check_ro_mounts/details> > What about when btrfs detects a bad block checksum and recovers data > from the equivalent block on another disk? The wiki says there will be > a syslog event. Does btrfs keep any stats on the number of blocks that > it considers unreliable and can this be queried from user space? > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Daniel Pocock posted on Sat, 23 Nov 2013 09:37:50 +0100 as excerpted:> What about when btrfs detects a bad block checksum and recovers data > from the equivalent block on another disk? The wiki says there will be > a syslog event. Does btrfs keep any stats on the number of blocks that > it considers unreliable and can this be queried from user space?The way you phrased that question is strange to me (considers unreliable? does that mean ones that it had to fix, or ones that it had to fix more than once, or...), so I''m not sure this answers it, but from the btrfs manpage...>>>>btrfs device stats [-z] {<path>|<device>} Read and print the device IO stats for all devices of the filesystem identified by <path> or for a single <device>. Options -z Reset stats to zero after reading them. <<<< Here''s the output for my (dual device btrfs raid1) rootfs, here: btrfs dev stat / [/dev/sdc5].write_io_errs 0 [/dev/sdc5].read_io_errs 0 [/dev/sdc5].flush_io_errs 0 [/dev/sdc5].corruption_errs 0 [/dev/sdc5].generation_errs 0 [/dev/sda5].write_io_errs 0 [/dev/sda5].read_io_errs 0 [/dev/sda5].flush_io_errs 0 [/dev/sda5].corruption_errs 0 [/dev/sda5].generation_errs 0 As you can see, for multi-device filesystems it gives the stats per component device. Any errors accumulate until a reset using -z, so you can easily see if the numbers are increasing over time and by how much. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 23/11/13 11:35, Duncan wrote:> Daniel Pocock posted on Sat, 23 Nov 2013 09:37:50 +0100 as excerpted: > >> What about when btrfs detects a bad block checksum and recovers data >> from the equivalent block on another disk? The wiki says there will be >> a syslog event. Does btrfs keep any stats on the number of blocks that >> it considers unreliable and can this be queried from user space? > > The way you phrased that question is strange to me (considers unreliable? > does that mean ones that it had to fix, or ones that it had to fix more > than once, or...), so I''m not sure this answers it, but from the btrfs > manpage...Let me clarify: when I said unreliable, I was referring to those blocks where the block device driver reads the block without reporting any error but where btrfs has decided the checksum is bad and not used the data from the block. Such blocks definitely exist. Sometimes the data was corrupted at the moment of writing and no matter how many times you read the block, you always get a bad checksum.>>>>> > > btrfs device stats [-z] {<path>|<device>} > > Read and print the device IO stats for all devices of the filesystem > identified by <path> or for a single <device>. > > Options > > -z Reset stats to zero after reading them. > > <<<< > > Here''s the output for my (dual device btrfs raid1) rootfs, here: > > btrfs dev stat / > [/dev/sdc5].write_io_errs 0 > [/dev/sdc5].read_io_errs 0 > [/dev/sdc5].flush_io_errs 0 > [/dev/sdc5].corruption_errs 0 > [/dev/sdc5].generation_errs 0 > [/dev/sda5].write_io_errs 0 > [/dev/sda5].read_io_errs 0 > [/dev/sda5].flush_io_errs 0 > [/dev/sda5].corruption_errs 0 > [/dev/sda5].generation_errs 0 > > As you can see, for multi-device filesystems it gives the stats per > component device. Any errors accumulate until a reset using -z, so you > can easily see if the numbers are increasing over time and by how much. >That looks interesting - are these explained anywhere? Should a Nagios plugin just look for any non-zero value or just focus on some of those? Are they runtime stats (since system boot) or are they maintained in the filesystem on disk? My own version of the btrfs utility doesn''t have that command though, I am using a Debian stable system. I tried a newer version and it gives ERROR: ioctl(BTRFS_IOC_GET_DEV_STATS) so I probably need to update my kernel too. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Daniel Pocock posted on Sat, 23 Nov 2013 12:44:25 +0100 as excerpted:>> [btrfs manpage quote] >> btrfs device stats [-z] {<path>|<device>} >> >> Read and print the device IO stats for all devices of the filesystem >> identified by <path> or for a single <device>.>> -z Reset stats to zero after reading them.>> Here''s the output for my (dual device btrfs raid1) rootfs, here: >> >> btrfs dev stat / >> [/dev/sdc5].write_io_errs 0 >> [/dev/sdc5].read_io_errs 0 >> [/dev/sdc5].flush_io_errs 0 >> [/dev/sdc5].corruption_errs 0 >> [/dev/sdc5].generation_errs 0 >> [/dev/sda5].write_io_errs 0 >> [/dev/sda5].read_io_errs 0 >> [/dev/sda5].flush_io_errs 0 >> [/dev/sda5].corruption_errs 0 >> [/dev/sda5].generation_errs 0 >> >> As you can see, for multi-device filesystems it gives the stats per >> component device. Any errors accumulate until a reset using -z, so you >> can easily see if the numbers are increasing over time and by how much.> That looks interesting - are these explained anywhere?I''d guess in the sources... There''s nothing more in the manpage about them, and nothing on the wiki. Some weeks ago I scanned some of the whitepapers listed on the wiki, and found most of them frustratingly "big picture" vague on such details as well. =:^( There was one that had a bit of detail, but only about half of what I was looking for at the time (the difference between leafsize, sectorsize and nodesize, three option knobs available on the mkfs.btrfs commandline, and what they actually tuned, and while I was at it, how they related to btrfs chunks) was there either, and even then not really explained very clearly). So it seems a lot of the documentation is sources-only at this point. =:^(> Should a Nagios plugin just look for any non-zero value or just focus on > some of those?I could guess at what some of them are and their significance based on what I''ve seen here, but I''m afraid my guesses wouldn''t rate well in SNR terms, so I''ll abstain...> Are they runtime stats (since system boot) or are they maintained in the > filesystem on disk?The records are maintained across mounts/boots so must be stored on- disk. Only the -z switch zeroes.> My own version of the btrfs utility doesn''t have that command though, I > am using a Debian stable system. I tried a newer version and it gives > > ERROR: ioctl(BTRFS_IOC_GET_DEV_STATS) > > so I probably need to update my kernel too.You''ve likely read it before, but btrfs remains a filesystem under heavy development, with every kernel bringing fixes for known bugs and userspace tools developed in tandem, and every btrfs user at this point is by definition a development filesystem tester. While there are reasons one may wish to be conservative and stick with a known stable system, they really tend to be antithetical with the reasons one would have for testing something as development edge as btrfs at this point. Thus, upgrading to a current kernel (3.12.x at this point, if not 3.13 development kernel as rc1 just came out) and btrfs-progs (at least, you can keep the rest of the system stable Debian if you like) is very strongly recommended if you''re testing btrfs, in any case. (For btrfs-progs, development happens in git branches, with merges to the master branch only when changes are considered release-ready. So current git-master btrfs-progs is always the reference. FWIW, here''s what btrfs --version outputs here, btrfs-progs from git updated as of yesterday as it happens, tho I usually keep within a week or two: Btrfs v0.20-rc1-598- g8116550.) See the btrfs wiki for more: https://btrfs.wiki.kernel.org. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html