I''ve been experimenting lately with btrfs RAID1 implementation and have
to say
that it is performing quite well, but there are few problems:
* when I purposefully damage partitions on which btrfs stores data (for 
  example, by changing the case of letters) it will read the other copy and 
  return correct data. It doesn''t report in dmesg this fact every time,
but it
  does correct the one with wrong checksum
* when both copies are damaged it returns the damaged block as it is
  written(!) and only adds a warning in the dmesg with exact same wording as 
  with the single block corruption(!!)
* from what I could find, btrfs doesn''t remember anywhere the number of
  detected and fixed corruptions
I don''t know if it''s the final design and while the first and
last points are
minor inconveniences the second one is quite major. At this time it
doesn''t
prevent silent corruption from going unnoticed. I think that reading from such 
blocks should return EIO (unless mounted nodatasum) or at least a broadcast 
message noting that a corrupted block is being returned to userspace.
I''ve also been thinking about tiered storage (meaning 2+, not only
two-tiered)
and have some ideas about it.
I think that there need to be 3 different mechanisms working together to 
achieve high performance:
* ability to store all metadata on selected volumes (probably read optimised 
  SSDs)
* ability to store all newly written data on selected volumes (write optimised 
  SSDs)
* ability to differentiate between often written, often read and infrequently 
  accessed data (and based on this information, ability to move this data to 
  fast SSDs, slow SSDs, fast RAID, slow RAID or MAID)
While the first two are rather straight-forward, the third one needs some 
explanation. I think that for this to work, we should save not only the time 
of last access to file and last change time but also few past values (I think 
that at least 8 to 16 ctimes and atimes are necessary but this will need 
testing). I''m not sure about how and exactly when to move this data
around to
keep the arrays balanced but a userspace daemon would be most flexible.
This solution won''t work well for file systems with few very large
files of
which very few parts change often, in other words it won''t be doing
block-
level tiered storage. From what I know, databases would benefit most from such 
configuration, but then most databases can already partition tables to 
different files based on access rate. As such, making its granularity on file 
level would make this mechanism easy to implement while still useful.
On second thought: it won''t make it exactly file-level granular, if we 
introduce snapshots in the mix, the new version can have the data regularly 
accessed while the old snapshot won''t, this way the obsolete blocks can
be
moved to slow storage.
-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html