I''ve been experimenting lately with btrfs RAID1 implementation and have
to say
that it is performing quite well, but there are few problems:
* when I purposefully damage partitions on which btrfs stores data (for
example, by changing the case of letters) it will read the other copy and
return correct data. It doesn''t report in dmesg this fact every time,
but it
does correct the one with wrong checksum
* when both copies are damaged it returns the damaged block as it is
written(!) and only adds a warning in the dmesg with exact same wording as
with the single block corruption(!!)
* from what I could find, btrfs doesn''t remember anywhere the number of
detected and fixed corruptions
I don''t know if it''s the final design and while the first and
last points are
minor inconveniences the second one is quite major. At this time it
doesn''t
prevent silent corruption from going unnoticed. I think that reading from such
blocks should return EIO (unless mounted nodatasum) or at least a broadcast
message noting that a corrupted block is being returned to userspace.
I''ve also been thinking about tiered storage (meaning 2+, not only
two-tiered)
and have some ideas about it.
I think that there need to be 3 different mechanisms working together to
achieve high performance:
* ability to store all metadata on selected volumes (probably read optimised
SSDs)
* ability to store all newly written data on selected volumes (write optimised
SSDs)
* ability to differentiate between often written, often read and infrequently
accessed data (and based on this information, ability to move this data to
fast SSDs, slow SSDs, fast RAID, slow RAID or MAID)
While the first two are rather straight-forward, the third one needs some
explanation. I think that for this to work, we should save not only the time
of last access to file and last change time but also few past values (I think
that at least 8 to 16 ctimes and atimes are necessary but this will need
testing). I''m not sure about how and exactly when to move this data
around to
keep the arrays balanced but a userspace daemon would be most flexible.
This solution won''t work well for file systems with few very large
files of
which very few parts change often, in other words it won''t be doing
block-
level tiered storage. From what I know, databases would benefit most from such
configuration, but then most databases can already partition tables to
different files based on access rate. As such, making its granularity on file
level would make this mechanism easy to implement while still useful.
On second thought: it won''t make it exactly file-level granular, if we
introduce snapshots in the mix, the new version can have the data regularly
accessed while the old snapshot won''t, this way the obsolete blocks can
be
moved to slow storage.
--
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
qbs.com.pl
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at vger.kernel.org/majordomo-info.html