Hello list!
What is the status of btrfs' self-healing capabilities?
On my backup btrfs device I am currently facing back-reference errors that
neither btrfs can deal with online, nor btrfsck is able to repair it (it
bails out with an assertion). I'm going to post this as a separate post.
So in this post it comes down to the question about self-healing.
I imagine btrfs should support the following features (just a few thoughts,
I did not take a scientific exercise, and I am no filesystem dev):
* error resilience: fix simple structure errors silently but inform the
admin about the problem (by logging to klog so it can be watched by
a process to send a notification to the admin)
* data integrity: I think this is almost done by checking for a good
copy of the data block in the other devices using checksums,
automatically rebuild the bad blocks during this process, notify the
admin
* zero-downtime guarantees: Try to avoid to take the volume offline nor
force read-only remount, instead log the error within the fs structures
and deny access to the data being accessed, inform the admin that an
offline/mount-time check/repair is needed
* transparent self-healing: If the data being accessed is currently
repaired or scanned (because this has been triggered by some process),
block the accessing process until that fixing has finished. Of course
this means we either need to guarantee really short time periods needed
for self-healing or have a timeout on blocking the process so processes
do not pile up and can bail-out with an error if blocked for too long.
Actually, this is similar to auto-remount ro with the difference it only
affects parts of the filesystem.
* fast automatic scan and repair: At mount-time, check for logged errors
and run a scanning/fixing thread (either offline or online) to check
only the device areas marked as faulty. Ideally, it should only take a
few seconds so downtime is minimized
* Forced full online filesystem scan: Allow running a full online scan for
errors triggering the above mentioned methods in case errors are found
* Forced full offline filesystem scan: As a last resort or when errors
cannot be fixed with any of the above methods, allow running a full
offline scan and fix for errors. The above methods can trigger to notify
the admin that this scan is actually needed because the found errors are
too extensive to be fixed online or by self-healing
I think all these capabilities are really needed with todays device sizes (a
full filesystem scan can take hours, maybe even days) and requirements in
productive systems (desktop and servers).
Regards,
Kai
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html