thr3ads.net - Btrfs devel - Status of self-healing features in btrfs [Jan 2014]

If this information is useful, please help other people find it:
Share via:

Kai Krakow

2014-Jan-11 11:32 UTC

Status of self-healing features in btrfs

Hello list!

What is the status of btrfs' self-healing capabilities?

On my backup btrfs device I am currently facing back-reference errors that 
neither btrfs can deal with online, nor btrfsck is able to repair it (it 
bails out with an assertion). I'm going to post this as a separate post.

So in this post it comes down to the question about self-healing.

I imagine btrfs should support the following features (just a few thoughts, 
I did not take a scientific exercise, and I am no filesystem dev):

  * error resilience: fix simple structure errors silently but inform the
    admin about the problem (by logging to klog so it can be watched by
    a process to send a notification to the admin)

  * data integrity: I think this is almost done by checking for a good
    copy of the data block in the other devices using checksums,
    automatically rebuild the bad blocks during this process, notify the
    admin

  * zero-downtime guarantees: Try to avoid to take the volume offline nor
    force read-only remount, instead log the error within the fs structures
    and deny access to the data being accessed, inform the admin that an
    offline/mount-time check/repair is needed

  * transparent self-healing: If the data being accessed is currently
    repaired or scanned (because this has been triggered by some process),
    block the accessing process until that fixing has finished. Of course
    this means we either need to guarantee really short time periods needed
    for self-healing or have a timeout on blocking the process so processes
    do not pile up and can bail-out with an error if blocked for too long.
    Actually, this is similar to auto-remount ro with the difference it only
    affects parts of the filesystem.

  * fast automatic scan and repair: At mount-time, check for logged errors
    and run a scanning/fixing thread (either offline or online) to check
    only the device areas marked as faulty. Ideally, it should only take a
    few seconds so downtime is minimized

  * Forced full online filesystem scan: Allow running a full online scan for
    errors triggering the above mentioned methods in case errors are found

  * Forced full offline filesystem scan: As a last resort or when errors
    cannot be fixed with any of the above methods, allow running a full
    offline scan and fix for errors. The above methods can trigger to notify
    the admin that this scan is actually needed because the found errors are
    too extensive to be fixed online or by self-healing

I think all these capabilities are really needed with todays device sizes (a 
full filesystem scan can take hours, maybe even days) and requirements in 
productive systems (desktop and servers).

Regards,
Kai

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Jan 2014 - Status of self-healing features in btrfs

Status of self-healing features in btrfs