thr3ads.net - Btrfs devel - Hybrid Storage proposal [Feb 2013]

If this information is useful, please help other people find it:
Share via:

Matias Bjorling

2013-Feb-20 16:46 UTC

Hybrid Storage proposal

Here is a short proposal for the hybrid storage cache idea with 
introduction/motivation and a bird''s eye view of an approach to
implement a
hybrid storage cache for btrfs. Please note that there is currently no available
patches. We would like to get as much input before as possible before we start 
designing and implementing a solution.

1. Introduction

The emerge of Solid State Drives (SSD) change how data is stored for fast
accesses. Their high throughput and low latency characteristics make them a good
choice for applications that traditionally require many hard-drives.

SSDs are still more expensive per GB, making traditional hard-drives a good and 
affordable solution to store bulk amount of data. Often, the working set of a 
filesystem is smaller than the full capacity of a drive. We can exploit this by
extending the often used bulk data on SSDs. We prioritize data that is often
accessed randomly, while larger bulk operations are kept on bulk storage.

Recent development in Linux SSD caches, uses a block IO approach to solve
caching. The approach assumes that data is stable on disk and evicts data based 
on LRU, temparature, etc. This is great for read only IO patterns and in-place
writes. However, btrfs uses a copy on write approach, that reduces the benefits 
of block IO caching. The block caches are unable to track updates (require 
extensive hints forth and back between the cache layer). Additionally, data and 
metadata is the same to the block layer.

The internal file-system information available within btrfs allows separation of
these types of updates and enables fine-grained control of a to-be implemented
cache.

2 Overview

The design space for a cache is divided into read and writes. For both read
and write caches, we divide them into caching metadata (trees) accesses or
user data. Writes are further divided into either being write-back or
write-through.

2.1 Overview

Any device attached to the storage pool should allow to be used as a cache. It
is natural to store the cache in the already implemented chunk architecture (as
cache chunks). Each allocated cache chunk may? be available to one or more 
subvolumes.

2.2 Caching hierarchy

By adding an extra layer, we have the following access pattern: host memory
->
SSD or Disk -> Disk.

  - Host memory caches lookup paths, transactions, free space infomation, etc.
  - SSD/disk cache frequently used data or writes for data that cannot be in 
    host memory.
  - Traditional hard-drives store the largest amount of data and store a 
    complete copy of all data.

2.3 Hotness tracking

The data to cache is defined by some hotness algorithm, which can be applied to 
different layers of btrfs:

  - Inode level
    The recently implemented VFS hot track patches enable us to detect hotness
    for files within any given VFS file-system implementation. For reads that
    are within a tunable cache size, we either cache the tree leaf or its 
    extent.
    For writes, we track the tree updates and write the tree updates to the SSD.
    If the file is larger and it receives a considerable amount of reads or
    writes, their hot subparts should be cached.

  - Tree level
    Within the fs, we can track the hotness of each tree. If a tree is
    read or updated frequently, it should be served by the SSD cache.

  - Extent level
    Hole or parts of extents should be tracked and cached if needed.

2.4 Cache opportunities:

- Hotness tracking for random reads

  Define threshold for when to cache reads. Back of envelope calculation
  tells us to cache when IO size is below 1.5MB. This assumes 100 IO/s and
  a read speed of 150MB/s from the traditional drives. This should be
  tunable.

  If data is updated, we should "follow" the newly written data and
evict the
  "old" data from the cache. As such, if the old data was cached, we
make sure
  to also cache the new data.

  Implementation details:
    - Use the hot track patches for VFS to track when an inode is hot and then
      cache the reads.
    - Track CoW actions and pre-warm cache.

- Write-back cache 

  * Tree updates

    Updates to trees are batched and flushed every 30 seconds. Flush the updates
    to cache layer first, and then flush them later to bulk storage.

    When updates are flushed to bulk storage, we reorder IOs to be as sequential
    as possible. This optimization allows us to have higher throughput at
    the cost of sorting writes at flush time.

    The technique requires that we track tree updates between disk cache and
    disk. As our trees are append only, we can track the current generation and
    apply the difference at timed intervals or at mount/unmount times.

    Implementation details: 
      - This can be implemented on a per-tree basis. E.g. specific extent
        trees, checksum tree or other frequently updated tree.

  * Data updates

    Data are placed two places. If small, directly inside the tree leafs or if 
    larger, within extents. If an inode is known to be hot, we cache the
updates.

 - Write-through cache for user data

   If the cache isn''t "safe", i.e. no duplicate copies. The
cache can still be
   used using write-through and cache subsequent reads.

   This is similar to a pure disk block-based cache approach.

2.5 Other

 - Warmup cache at mount time

   Reread the SSD cache on mount to enjoy a preheated cache of the bulk storage.

   This can be archived by storing information about the cache and reconstruct
   the cache tree.

 - (By David Sterba) Implement appropriate debugfs/sysfs hooks for monitoring
   the cache and get information about the size of trees. This is useful for
   deciding if a tree should be cached on an SSD or not. E.g. the checksum tree
   might always be in memory, but if it isn''t, it should be cached on
the SSD
   storage to lower checksum tree seeks on the bulk storage.

2.6 Summary

The following list of items have to be addressed for the first full patchset:

 - Cache lookup
 - Cache type (write through, write back, hot tracking, etc.)
 - Data structure for lookup cache
 - Allow for prioritized storage (e.g. PCM>SSD>HDD)
 - Eviction strategy. LRU, LFU, FIFO, Temparature-based (VFS hot track)
 - Disk layout for cache storage

Here we presented our design space for a hybrid drive solution, as well as 
what it would take to carry it out. We are very much open to any kind of input, 
feedback or new ideas you might have.

Sincerely,
Matias & Jesper

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alex Elsayed

2013-Feb-20 19:19 UTC

head link

Re: Hybrid Storage proposal

Matias Bjorling wrote:
> Here is a short proposal for the hybrid storage cache idea with
> introduction/motivation and a bird''s eye view of an approach to
implement
> a hybrid storage cache for btrfs. Please note that there is currently no
> available patches. We would like to get as much input before as possible
> before we start designing and implementing a solution.
Just to toss this out there, aside from the option of ''cache
chunks'' there
is the option of using normal chunks, and instead place extents on chunks 
based on performance characteristics of the underlying device.

As an analogy, if the above proposal is closer to the ''bcache''
approach of
treating the SSD purely as a performance enhancer that does not contribute 
to capacity, performance-based allocation is closer to the
''btier'' [1]
approach in which the SSD does contribute to capacity, but has
''dibs'' on
performance-sensitive data.

[1] Btier sourceforge page: http://sourceforge.net/projects/tier/

Lessfs homepage, with btier info: http://www.lessfs.com/wordpress/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gareth Pye

2013-Feb-21 00:49 UTC

head link

Re: Hybrid Storage proposal

On Thu, Feb 21, 2013 at 3:46 AM, Matias Bjorling <mabj@itu.dk>
wrote:> Recent development in Linux SSD caches, uses a block IO approach to solve
> caching. The approach assumes that data is stable on disk and evicts data
based
> on LRU, temparature, etc. This is great for read only IO patterns and
in-place
> writes. However, btrfs uses a copy on write approach, that reduces the
benefits
> of block IO caching. The block caches are unable to track updates (require
> extensive hints forth and back between the cache layer). Additionally, data
and
> metadata is the same to the block layer.

Another great reason for this to be implemented in btrfs is that
knowledge of redundancy can be applied. Chunks that are mirrored
should be unlikely to need both copies on SSD devices unless they are
very highly used (probably true for some of the meta data but not for
data).

Conversely there is little benefit to putting one stripe of a
raid0/5/6 into the SSD device without the rest of that data reaching
the same level.

Not that additional reasons to do this work in btrfs were needed it
does need to be thought about how this implementation interacts with
those features.

-- 
Gareth Pye
Level 2 Judge, Melbourne, Australia
Australian MTG Forum: mtgau.com
gareth@cerberos.id.au - www.rockpaperdynamite.wordpress.com
"Dear God, I would like to file a bug report"
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alex Elsayed

2013-Feb-21 06:53 UTC

head link

Re: Hybrid Storage proposal

Gareth Pye wrote:
> On Thu, Feb 21, 2013 at 3:46 AM, Matias Bjorling <mabj@itu.dk> wrote:
>> Recent development in Linux SSD caches, uses a block IO approach to
solve
>> caching. The approach assumes that data is stable on disk and evicts
data
>> based on LRU, temparature, etc. This is great for read only IO patterns
>> and in-place writes. However, btrfs uses a copy on write approach, that
>> reduces the benefits of block IO caching. The block caches are unable
to
>> track updates (require extensive hints forth and back between the cache
>> layer). Additionally, data and metadata is the same to the block layer.
> 
> 
> Another great reason for this to be implemented in btrfs is that
> knowledge of redundancy can be applied. Chunks that are mirrored
> should be unlikely to need both copies on SSD devices unless they are
> very highly used (probably true for some of the meta data but not for
> data).
On the other hand, there''s the option of doing something that bcache
has had
on its roadmap for a while (but I''m not sure was ever implemented) by 
striping clean data and mirroring dirty data across the SSDs. The intent for 
bcache was to maximize both redundancy (for dirty data, which can''t be 
regenerated) and space (for clean data, where if you lose some you can just 
shrug and say "oh well") in the case of one SSD failing.
> Conversely there is little benefit to putting one stripe of a
> raid0/5/6 into the SSD device without the rest of that data reaching
> the same level.
> 
> Not that additional reasons to do this work in btrfs were needed it
> does need to be thought about how this implementation interacts with
> those features.
> 
Agreed.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Matias Bjørling

2013-Feb-21 09:25 UTC

head link

Re: Hybrid Storage proposal

On 02/20/2013 08:19 PM, Alex Elsayed wrote:> Matias Bjorling wrote:
>
>> Here is a short proposal for the hybrid storage cache idea with
>> introduction/motivation and a bird''s eye view of an approach
to implement
>> a hybrid storage cache for btrfs. Please note that there is currently
no
>> available patches. We would like to get as much input before as
possible
>> before we start designing and implementing a solution.
>
> Just to toss this out there, aside from the option of ''cache
chunks'' there
> is the option of using normal chunks, and instead place extents on chunks
> based on performance characteristics of the underlying device.
>
> As an analogy, if the above proposal is closer to the
''bcache'' approach of
> treating the SSD purely as a performance enhancer that does not contribute
> to capacity, performance-based allocation is closer to the
''btier'' [1]
> approach in which the SSD does contribute to capacity, but has
''dibs'' on
> performance-sensitive data.
>
> [1] Btier sourceforge page: http://sourceforge.net/projects/tier/
>
> Lessfs homepage, with btier info: http://www.lessfs.com/wordpress/
Good point. I think both approaches are relatively easy to combine.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zhi Yong Wu

2013-Feb-26 11:04 UTC

head link

Re: Hybrid Storage proposal

HI,

It''s a bit long so that i haven''t read its whole, but i want
to know
if it has any collision with my ongoing feature "btrfs hot
relocation/migration"?

On Thu, Feb 21, 2013 at 12:46 AM, Matias Bjorling <mabj@itu.dk>
wrote:> Here is a short proposal for the hybrid storage cache idea with
> introduction/motivation and a bird''s eye view of an approach to
implement a
> hybrid storage cache for btrfs. Please note that there is currently no
available
> patches. We would like to get as much input before as possible before we
start
> designing and implementing a solution.
>
> 1. Introduction
>
> The emerge of Solid State Drives (SSD) change how data is stored for fast
> accesses. Their high throughput and low latency characteristics make them a
good
> choice for applications that traditionally require many hard-drives.
>
> SSDs are still more expensive per GB, making traditional hard-drives a good
and
> affordable solution to store bulk amount of data. Often, the working set of
a
> filesystem is smaller than the full capacity of a drive. We can exploit
this by
> extending the often used bulk data on SSDs. We prioritize data that is
often
> accessed randomly, while larger bulk operations are kept on bulk storage.
>
> Recent development in Linux SSD caches, uses a block IO approach to solve
> caching. The approach assumes that data is stable on disk and evicts data
based
> on LRU, temparature, etc. This is great for read only IO patterns and
in-place
> writes. However, btrfs uses a copy on write approach, that reduces the
benefits
> of block IO caching. The block caches are unable to track updates (require
> extensive hints forth and back between the cache layer). Additionally, data
and
> metadata is the same to the block layer.
>
> The internal file-system information available within btrfs allows
separation of
> these types of updates and enables fine-grained control of a to-be
implemented
> cache.
>
> 2 Overview
>
> The design space for a cache is divided into read and writes. For both read
> and write caches, we divide them into caching metadata (trees) accesses or
> user data. Writes are further divided into either being write-back or
> write-through.
>
> 2.1 Overview
>
> Any device attached to the storage pool should allow to be used as a cache.
It
> is natural to store the cache in the already implemented chunk architecture
(as
> cache chunks). Each allocated cache chunk may? be available to one or more
> subvolumes.
>
> 2.2 Caching hierarchy
>
> By adding an extra layer, we have the following access pattern: host memory
->
> SSD or Disk -> Disk.
>
>   - Host memory caches lookup paths, transactions, free space infomation,
etc.
>   - SSD/disk cache frequently used data or writes for data that cannot be
in
>     host memory.
>   - Traditional hard-drives store the largest amount of data and store a
>     complete copy of all data.
>
> 2.3 Hotness tracking
>
> The data to cache is defined by some hotness algorithm, which can be
applied to
> different layers of btrfs:
>
>   - Inode level
>     The recently implemented VFS hot track patches enable us to detect
hotness
>     for files within any given VFS file-system implementation. For reads
that
>     are within a tunable cache size, we either cache the tree leaf or its
>     extent.
>     For writes, we track the tree updates and write the tree updates to the
SSD.
>     If the file is larger and it receives a considerable amount of reads or
>     writes, their hot subparts should be cached.
>
>   - Tree level
>     Within the fs, we can track the hotness of each tree. If a tree is
>     read or updated frequently, it should be served by the SSD cache.
>
>   - Extent level
>     Hole or parts of extents should be tracked and cached if needed.
>
> 2.4 Cache opportunities:
>
> - Hotness tracking for random reads
>
>   Define threshold for when to cache reads. Back of envelope calculation
>   tells us to cache when IO size is below 1.5MB. This assumes 100 IO/s and
>   a read speed of 150MB/s from the traditional drives. This should be
>   tunable.
>
>   If data is updated, we should "follow" the newly written data
and evict the
>   "old" data from the cache. As such, if the old data was cached,
we make sure
>   to also cache the new data.
>
>   Implementation details:
>     - Use the hot track patches for VFS to track when an inode is hot and
then
>       cache the reads.
>     - Track CoW actions and pre-warm cache.
>
> - Write-back cache
>
>   * Tree updates
>
>     Updates to trees are batched and flushed every 30 seconds. Flush the
updates
>     to cache layer first, and then flush them later to bulk storage.
>
>     When updates are flushed to bulk storage, we reorder IOs to be as
sequential
>     as possible. This optimization allows us to have higher throughput at
>     the cost of sorting writes at flush time.
>
>     The technique requires that we track tree updates between disk cache
and
>     disk. As our trees are append only, we can track the current generation
and
>     apply the difference at timed intervals or at mount/unmount times.
>
>     Implementation details:
>       - This can be implemented on a per-tree basis. E.g. specific extent
>         trees, checksum tree or other frequently updated tree.
>
>   * Data updates
>
>     Data are placed two places. If small, directly inside the tree leafs or
if
>     larger, within extents. If an inode is known to be hot, we cache the
updates.
>
>  - Write-through cache for user data
>
>    If the cache isn''t "safe", i.e. no duplicate copies.
The cache can still be
>    used using write-through and cache subsequent reads.
>
>    This is similar to a pure disk block-based cache approach.
>
> 2.5 Other
>
>  - Warmup cache at mount time
>
>    Reread the SSD cache on mount to enjoy a preheated cache of the bulk
storage.
>
>    This can be archived by storing information about the cache and
reconstruct
>    the cache tree.
>
>  - (By David Sterba) Implement appropriate debugfs/sysfs hooks for
monitoring
>    the cache and get information about the size of trees. This is useful
for
>    deciding if a tree should be cached on an SSD or not. E.g. the checksum
tree
>    might always be in memory, but if it isn''t, it should be cached
on the SSD
>    storage to lower checksum tree seeks on the bulk storage.
>
> 2.6 Summary
>
> The following list of items have to be addressed for the first full
patchset:
>
>  - Cache lookup
>  - Cache type (write through, write back, hot tracking, etc.)
>  - Data structure for lookup cache
>  - Allow for prioritized storage (e.g. PCM>SSD>HDD)
>  - Eviction strategy. LRU, LFU, FIFO, Temparature-based (VFS hot track)
>  - Disk layout for cache storage
>
> Here we presented our design space for a hybrid drive solution, as well as
> what it would take to carry it out. We are very much open to any kind of
input,
> feedback or new ideas you might have.
>
> Sincerely,
> Matias & Jesper
>
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Matias Bjørling

2013-Feb-26 13:08 UTC

head link

Re: [POSSIBLE SPAM] Re: Hybrid Storage proposal

On 02/26/2013 12:04 PM, Zhi Yong Wu wrote:> HI,
>
> It''s a bit long so that i haven''t read its whole, but i
want to know
> if it has any collision with my ongoing feature "btrfs hot
> relocation/migration"?
It will utilize the hot track patch set you''re been creating for VFS. I
think the methods are complementary. Do you have a description of the 
relocation/migration approach?


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Zhi Yong Wu

2013-Feb-27 11:50 UTC

head link

Re: [POSSIBLE SPAM] Re: Hybrid Storage proposal

On Tue, Feb 26, 2013 at 9:08 PM, Matias Bjørling <mabj@itu.dk>
wrote:> On 02/26/2013 12:04 PM, Zhi Yong Wu wrote:
>>
>> HI,
>>
>> It''s a bit long so that i haven''t read its whole, but
i want to know
>> if it has any collision with my ongoing feature "btrfs hot
>> relocation/migration"?
>
>
> It will utilize the hot track patch set you''re been creating for
VFS. I
> think the methods are complementary. Do you have a description of the
> relocation/migration approach?sorry, no currently. i will post out them after they are completed.
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Apparently Analagous Threads

Search for more reasonably related threads

Btrfs devel - Feb 2013 - Hybrid Storage proposal

Hybrid Storage proposal

Re: Hybrid Storage proposal

Re: Hybrid Storage proposal

Re: Hybrid Storage proposal

Re: Hybrid Storage proposal

Re: Hybrid Storage proposal

Re: [POSSIBLE SPAM] Re: Hybrid Storage proposal

Re: [POSSIBLE SPAM] Re: Hybrid Storage proposal

Apparently Analagous Threads