thr3ads.net - Btrfs devel - Nagios probe for btrfs RAID status? [Nov 2013]

If this information is useful, please help other people find it:
Share via:

Daniel Pocock

2013-Nov-22 13:47 UTC

Nagios probe for btrfs RAID status?

I just did a search and couldn''t find any probe for btrfs RAID status

The "check_raid" plugin seems to recognise mdadm and various other
types
of RAID but not btrfs

Has anybody seen a plugin for Nagios or could anybody comment on how it
should work if somebody wants to make one?

For example, would the command

    btrfs filesystem show --all-devices

give a non-zero error status or some other clue if any of the devices
are at risk?


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Duncan

2013-Nov-22 17:52 UTC

head link

Re: Nagios probe for btrfs RAID status?

Daniel Pocock posted on Fri, 22 Nov 2013 14:47:49 +0100 as excerpted:
> I just did a search and couldn''t find any probe for btrfs RAID
status
> 
> The "check_raid" plugin seems to recognise mdadm and various
other types
> of RAID but not btrfs
> 
> Has anybody seen a plugin for Nagios or could anybody comment on how it
> should work if somebody wants to make one?
> 
> For example, would the command
> 
>     btrfs filesystem show --all-devices
> 
> give a non-zero error status or some other clue if any of the devices
> are at risk?
[btrfs personal user/sysadmin, not a dev, not anything large enough to 
have personal nagios experience...]

AFAIK, btrfs raid modes currently switch the filesystem to read-only on 
any device-drop error.  That has been deemed the simplest/safest policy 
during development, tho at some point as stable approaches the behavior 
could theoretically be made optional.

So detection could watch for read-only and act accordingly, either 
switching back to read-write or rebooting or simply logging the event, as 
deemed appropriate.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Anand Jain

2013-Nov-23 03:59 UTC

head link

Re: Nagios probe for btrfs RAID status?

> For example, would the command
>
>      btrfs filesystem show --all-devices
>
> give a non-zero error status or some other clue if any of the devices
> are at risk?
  No there isn''t any good way as of now. that''s something to
fix.

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Daniel Pocock

2013-Nov-23 08:37 UTC

head link

Re: Nagios probe for btrfs RAID status?

On 23/11/13 04:59, Anand Jain wrote:> 
> 
>> For example, would the command
>>
>>      btrfs filesystem show --all-devices
>>
>> give a non-zero error status or some other clue if any of the devices
>> are at risk?
> 
>  No there isn''t any good way as of now. that''s something
to fix.
Does it require kernel/driver code changes or it should be possible to
implement in the user space utility?

It would be useful for people testing the filesystem to know when they
get into trouble so they can investigate more quickly (and before the
point of no return)
> [btrfs personal user/sysadmin, not a dev, not anything large enough to
> have personal nagios experience...]
> 
> AFAIK, btrfs raid modes currently switch the filesystem to read-only on
> any device-drop error. That has been deemed the simplest/safest policy
> during development, tho at some point as stable approaches the behavior
> could theoretically be made optional.
None of the warnings about btrfs''s experimental status hint at that,
some people may be surprised by it.
> So detection could watch for read-only and act accordingly, either
> switching back to read-write or rebooting or simply logging the event,
> as deemed appropriate.
It would be relatively trivial to implement a Nagios check for
read-only, Nagios probes are just shell scripts

What about when btrfs detects a bad block checksum and recovers data
from the equivalent block on another disk?  The wiki says there will be
a syslog event.  Does btrfs keep any stats on the number of blocks that
it considers unreliable and can this be queried from user space?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Daniel Pocock

2013-Nov-23 09:20 UTC

head link

Re: Nagios probe for btrfs RAID status?

On 23/11/13 09:37, Daniel Pocock wrote:> 
> 
> On 23/11/13 04:59, Anand Jain wrote:
>>
>>
>>> For example, would the command
>>>
>>>      btrfs filesystem show --all-devices
>>>
>>> give a non-zero error status or some other clue if any of the
devices
>>> are at risk?
>>
>>  No there isn''t any good way as of now. that''s
something to fix.
> 
> Does it require kernel/driver code changes or it should be possible to
> implement in the user space utility?
> 
> It would be useful for people testing the filesystem to know when they
> get into trouble so they can investigate more quickly (and before the
> point of no return)
> 
>> [btrfs personal user/sysadmin, not a dev, not anything large enough to
>> have personal nagios experience...]
>>
>> AFAIK, btrfs raid modes currently switch the filesystem to read-only on
>> any device-drop error. That has been deemed the simplest/safest policy
>> during development, tho at some point as stable approaches the behavior
>> could theoretically be made optional.
> 
> None of the warnings about btrfs''s experimental status hint at
that,
> some people may be surprised by it.
> 
>> So detection could watch for read-only and act accordingly, either
>> switching back to read-write or rebooting or simply logging the event,
>> as deemed appropriate.
> 
> It would be relatively trivial to implement a Nagios check for
> read-only, Nagios probes are just shell scripts
Just checked, it already exists, so we are half way there:

http://exchange.nagios.org/directory/Plugins/Operating-Systems/Linux/check_ro_mounts/details

> 
> What about when btrfs detects a bad block checksum and recovers data
> from the equivalent block on another disk?  The wiki says there will be
> a syslog event.  Does btrfs keep any stats on the number of blocks that
> it considers unreliable and can this be queried from user space?
> 
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Duncan

2013-Nov-23 10:35 UTC

head link

Re: Nagios probe for btrfs RAID status?

Daniel Pocock posted on Sat, 23 Nov 2013 09:37:50 +0100 as excerpted:
> What about when btrfs detects a bad block checksum and recovers data
> from the equivalent block on another disk?  The wiki says there will be
> a syslog event.  Does btrfs keep any stats on the number of blocks that
> it considers unreliable and can this be queried from user space?
The way you phrased that question is strange to me (considers unreliable?
does that mean ones that it had to fix, or ones that it had to fix more 
than once, or...), so I''m not sure this answers it, but from the btrfs 
manpage...
>>>>
btrfs device stats [-z] {<path>|<device>}

Read and print the device IO stats for all devices of the filesystem 
identified by <path> or for a single <device>.

Options

-z   Reset stats to zero after reading them.

<<<<

Here''s the output for my (dual device btrfs raid1) rootfs, here:

btrfs dev stat /
[/dev/sdc5].write_io_errs   0
[/dev/sdc5].read_io_errs    0
[/dev/sdc5].flush_io_errs   0
[/dev/sdc5].corruption_errs 0
[/dev/sdc5].generation_errs 0
[/dev/sda5].write_io_errs   0
[/dev/sda5].read_io_errs    0
[/dev/sda5].flush_io_errs   0
[/dev/sda5].corruption_errs 0
[/dev/sda5].generation_errs 0

As you can see, for multi-device filesystems it gives the stats per 
component device.  Any errors accumulate until a reset using -z, so you 
can easily see if the numbers are increasing over time and by how much.



-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Daniel Pocock

2013-Nov-23 11:44 UTC

head link

Re: Nagios probe for btrfs RAID status?

On 23/11/13 11:35, Duncan wrote:> Daniel Pocock posted on Sat, 23 Nov 2013 09:37:50 +0100 as excerpted:
> 
>> What about when btrfs detects a bad block checksum and recovers data
>> from the equivalent block on another disk?  The wiki says there will be
>> a syslog event.  Does btrfs keep any stats on the number of blocks that
>> it considers unreliable and can this be queried from user space?
> 
> The way you phrased that question is strange to me (considers unreliable?
> does that mean ones that it had to fix, or ones that it had to fix more 
> than once, or...), so I''m not sure this answers it, but from the
btrfs
> manpage...

Let me clarify: when I said unreliable, I was referring to those blocks
where the block device driver reads the block without reporting any
error but where btrfs has decided the checksum is bad and not used the
data from the block.

Such blocks definitely exist. Sometimes the data was corrupted at the
moment of writing and no matter how many times you read the block, you
always get a bad checksum.

>>>>>
> 
> btrfs device stats [-z] {<path>|<device>}
> 
> Read and print the device IO stats for all devices of the filesystem 
> identified by <path> or for a single <device>.
> 
> Options
> 
> -z   Reset stats to zero after reading them.
> 
> <<<<
> 
> Here''s the output for my (dual device btrfs raid1) rootfs, here:
> 
> btrfs dev stat /
> [/dev/sdc5].write_io_errs   0
> [/dev/sdc5].read_io_errs    0
> [/dev/sdc5].flush_io_errs   0
> [/dev/sdc5].corruption_errs 0
> [/dev/sdc5].generation_errs 0
> [/dev/sda5].write_io_errs   0
> [/dev/sda5].read_io_errs    0
> [/dev/sda5].flush_io_errs   0
> [/dev/sda5].corruption_errs 0
> [/dev/sda5].generation_errs 0
> 
> As you can see, for multi-device filesystems it gives the stats per 
> component device.  Any errors accumulate until a reset using -z, so you 
> can easily see if the numbers are increasing over time and by how much.
> 

That looks interesting - are these explained anywhere?

Should a Nagios plugin just look for any non-zero value or just focus on
some of those?

Are they runtime stats (since system boot) or are they maintained in the
filesystem on disk?

My own version of the btrfs utility doesn''t have that command though, I
am using a Debian stable system.  I tried a newer version and it gives

ERROR: ioctl(BTRFS_IOC_GET_DEV_STATS)

so I probably need to update my kernel too.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Duncan

2013-Nov-23 16:32 UTC

head link

Re: Nagios probe for btrfs RAID status?

Daniel Pocock posted on Sat, 23 Nov 2013 12:44:25 +0100 as excerpted:
>> [btrfs manpage quote]
>> btrfs device stats [-z] {<path>|<device>}
>> 
>> Read and print the device IO stats for all devices of the filesystem
>> identified by <path> or for a single <device>.
>> -z   Reset stats to zero after reading them.
>> Here''s the output for my (dual device btrfs raid1) rootfs,
here:
>> 
>> btrfs dev stat /
>> [/dev/sdc5].write_io_errs   0
>> [/dev/sdc5].read_io_errs    0
>> [/dev/sdc5].flush_io_errs   0
>> [/dev/sdc5].corruption_errs 0
>> [/dev/sdc5].generation_errs 0
>> [/dev/sda5].write_io_errs   0
>> [/dev/sda5].read_io_errs    0
>> [/dev/sda5].flush_io_errs   0
>> [/dev/sda5].corruption_errs 0
>> [/dev/sda5].generation_errs 0
>> 
>> As you can see, for multi-device filesystems it gives the stats per
>> component device.  Any errors accumulate until a reset using -z, so you
>> can easily see if the numbers are increasing over time and by how much.
> That looks interesting - are these explained anywhere?
I''d guess in the sources...  There''s nothing more in the
manpage about
them, and nothing on the wiki.  Some weeks ago I scanned some of the 
whitepapers listed on the wiki, and found most of them frustratingly "big 
picture" vague on such details as well. =:^(  There was one that had a 
bit of detail, but only about half of what I was looking for at the time 
(the difference between leafsize, sectorsize and nodesize, three option 
knobs available on the mkfs.btrfs commandline, and what they actually 
tuned, and while I was at it, how they related to btrfs chunks) was there 
either, and even then not really explained very clearly).  So it seems a 
lot of the documentation is sources-only at this point. =:^(
> Should a Nagios plugin just look for any non-zero value or just focus on
> some of those?
I could guess at what some of them are and their significance based on 
what I''ve seen here, but I''m afraid my guesses
wouldn''t rate well in SNR
terms, so I''ll abstain...
> Are they runtime stats (since system boot) or are they maintained in the
> filesystem on disk?
The records are maintained across mounts/boots so must be stored on-
disk.  Only the -z switch zeroes.
> My own version of the btrfs utility doesn''t have that command
though, I
> am using a Debian stable system.  I tried a newer version and it gives
> 
> ERROR: ioctl(BTRFS_IOC_GET_DEV_STATS)
> 
> so I probably need to update my kernel too.
You''ve likely read it before, but btrfs remains a filesystem under
heavy
development, with every kernel bringing fixes for known bugs and 
userspace tools developed in tandem, and every btrfs user at this point 
is by definition a development filesystem tester.  While there are 
reasons one may wish to be conservative and stick with a known stable 
system, they really tend to be antithetical with the reasons one would 
have for testing something as development edge as btrfs at this point.  
Thus, upgrading to a current kernel (3.12.x at this point, if not 3.13 
development kernel as rc1 just came out) and btrfs-progs (at least, you 
can keep the rest of the system stable Debian if you like) is very 
strongly recommended if you''re testing btrfs, in any case.

(For btrfs-progs, development happens in git branches, with merges to the 
master branch only when changes are considered release-ready.  So current 
git-master btrfs-progs is always the reference.  FWIW, here''s what
btrfs
--version outputs here, btrfs-progs from git updated as of yesterday as 
it happens, tho I usually keep within a week or two: Btrfs v0.20-rc1-598-
g8116550.)

See the btrfs wiki for more:  https://btrfs.wiki.kernel.org.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Nov 2013 - Nagios probe for btrfs RAID status?

Nagios probe for btrfs RAID status?

Re: Nagios probe for btrfs RAID status?

Re: Nagios probe for btrfs RAID status?

Re: Nagios probe for btrfs RAID status?

Re: Nagios probe for btrfs RAID status?

Re: Nagios probe for btrfs RAID status?

Re: Nagios probe for btrfs RAID status?

Re: Nagios probe for btrfs RAID status?