bchociej@gmail.com
2010-Jul-27 22:00 UTC
[RFC PATCH 0/5] Btrfs: Add hot data tracking functionality
INTRODUCTION: This patch series adds experimental support for tracking data temperature in Btrfs. Essentially, this means maintaining some key stats (like number of reads/writes, last read/write time, frequency of reads/writes), then distilling those numbers down to a single "temperature" value that reflects what data is "hot." The long-term goal of these patches, as discussed in the Motivation section at the end of this message, is to enable Btrfs to perform automagic relocation of hot data to fast media like SSD. This goal has been motivated by the Project Ideas page on the Btrfs wiki. Of course, users are warned not to run this code outside of development environments. These patches are EXPERIMENTAL, and as such they might eat your data and/or memory. MOTIVATION: The overall goal of enabling hot data relocation to SSD has been motivated by the Project Ideas page on the Btrfs wiki at https://btrfs.wiki.kernel.org/index.php/Project_ideas. It is hoped that this initial patchset will eventually mature into a usable hybrid storage feature set for Btrfs. This is essentially the traditional cache argument: SSD is fast and expensive; HDD is cheap but slow. ZFS, for example, can already take advantage of SSD caching. Btrfs should also be able to take advantage of hybrid storage without any broad, sweeping changes to existing code. With Btrfs''s COW approach, an external cache (where data is *moved* to SSD, rather than just cached there) makes a lot of sense. Though these patches don''t enable any relocation yet, they do lay an essential foundation for enabling that functionality in the near future. We plan to roll out an additional patchset introducing some of the automatic migration functionality in the next few weeks. SUMMARY: - Hooks in existing Btrfs functions to track data access frequency (btrfs_direct_IO, btrfs_readpages, and extent_write_cache_pages) - New rbtrees for tracking access frequency of inodes and sub-file ranges (hotdata_map.c) - A hash list for indexing data by its temperature (hotdata_hash.c) - A debugfs interface for dumping data from the rbtrees (debugfs.c) - A foundation for relocating data to faster media based on temperature (future patchset) - Mount options for enabling temperature tracking (-o hotdatatrack, -o hotdatamove; move implies track; both default to disabled) - An ioctl to retrieve the frequency information collected for a certain file - Ioctls to enable/disable frequency tracking per inode. DIFFSTAT: fs/btrfs/Makefile | 5 +- fs/btrfs/ctree.h | 42 +++ fs/btrfs/debugfs.c | 500 +++++++++++++++++++++++++++++++++++ fs/btrfs/debugfs.h | 57 ++++ fs/btrfs/disk-io.c | 29 ++ fs/btrfs/extent_io.c | 18 ++ fs/btrfs/hotdata_hash.c | 111 ++++++++ fs/btrfs/hotdata_hash.h | 89 +++++++ fs/btrfs/hotdata_map.c | 660 +++++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/hotdata_map.h | 118 +++++++++ fs/btrfs/inode.c | 29 ++- fs/btrfs/ioctl.c | 146 +++++++++++- fs/btrfs/ioctl.h | 21 ++ fs/btrfs/super.c | 48 ++++- 14 files changed, 1867 insertions(+), 6 deletions(-) IMPLEMENTATION (in a nutshell): Hooks have been added to various functions (btrfs_writepage(s), btrfs_readpages, btrfs_direct_IO, and extent_write_cache_pages) in order to track data access patterns. Each of these hooks calls a new function, btrfs_update_freqs, that records each access to an inode, possibly including some sub-file-level information as well. A data structure containing some various frequency metrics gets updated with the latest access information. From there, a hash list takes over the job of figuring out a total "temperature" value for the data and indexing that temperature for fast lookup in the future. The function that does the temperature distilliation is rather sensitive and can be tuned/tweaked by altering various #defined values in hotdata_hash.h. Aside from the core functionality, there is a debugfs interface to spit out some of the data that is collected, and ioctls are also introduced to manipulate the new functionality on a per-inode basis. Signed-off-by: Ben Chociej <bcchocie@us.ibm.com> Signed-off-by: Matt Lupfer <mrlupfer@us.ibm.com> Signed-off-by: Conor Scott <crscott@us.ibm.com> Reviewed-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Steve French <sfrench@us.ibm.com>
bchociej@gmail.com
2010-Jul-27 22:00 UTC
[RFC PATCH 1/5] Btrfs: Add experimental hot data hash list index
From: Ben Chociej <bcchocie@us.ibm.com> Adds a hash table structure to efficiently lookup the data temperature of a file. Also adds a function to calculate that temperature based on some metrics kept in custom frequency data structs. Signed-off-by: Ben Chociej <bcchocie@us.ibm.com> Signed-off-by: Matt Lupfer <mrlupfer@us.ibm.com> Signed-off-by: Conor Scott <crscott@us.ibm.com> Reviewed-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Steve French <sfrench@us.ibm.com> --- fs/btrfs/hotdata_hash.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/hotdata_hash.h | 89 +++++++++++++++++++++++++++++++++++++ 2 files changed, 200 insertions(+), 0 deletions(-) create mode 100644 fs/btrfs/hotdata_hash.c create mode 100644 fs/btrfs/hotdata_hash.h diff --git a/fs/btrfs/hotdata_hash.c b/fs/btrfs/hotdata_hash.c new file mode 100644 index 0000000..a0de853 --- /dev/null +++ b/fs/btrfs/hotdata_hash.c @@ -0,0 +1,111 @@ +#include <linux/list.h> +#include <linux/module.h> +#include <linux/spinlock.h> +#include <linux/hash.h> +#include "hotdata_map.h" +#include "hotdata_hash.h" +#include "async-thread.h" +#include "ctree.h" + +/* set thread to update temperatures every 5 minutes */ +#define HEAT_UPDATE_DELAY (HZ * 60 * 5) + +struct heat_hashlist_node *alloc_heat_hashlist_node(gfp_t mask) +{ + struct heat_hashlist_node *node; + + node = kmalloc(sizeof(struct heat_hashlist_node), mask); + if (!node || IS_ERR(node)) + return node; + INIT_HLIST_NODE(&node->hashnode); + node->freq_data = NULL; + node->hlist = NULL; + + return node; +} + +void free_heat_hashlists(struct btrfs_root *root) +{ + int i; + + /* Free node/range heat hash lists */ + for (i = 0; i < HEAT_HASH_SIZE; i++) { + struct hlist_node *pos = NULL, *pos2 = NULL; + struct heat_hashlist_node *heatnode = NULL; + + hlist_for_each_safe(pos, pos2, + &root->heat_inode_hl[i].hashhead) { + heatnode = hlist_entry(pos, struct heat_hashlist_node, + hashnode); + hlist_del(pos); + kfree(heatnode); + } + + hlist_for_each_safe(pos, pos2, + &root->heat_range_hl[i].hashhead) { + heatnode = hlist_entry(pos, struct heat_hashlist_node, + hashnode); + hlist_del(pos); + kfree(heatnode); + } + } +} + +/* + * Function that converts btrfs_freq_data structs to integer temperature + * values, determined by some constants in .h. + * + * This is not very calibrated, though we''ve gotten it in the ballpark. + */ +int btrfs_get_temp(struct btrfs_freq_data *fdata) +{ + u32 result = 0; + + struct timespec ckt = current_kernel_time(); + u64 cur_time = timespec_to_ns(&ckt); + + u32 nrr_heat = fdata->nr_reads << NRR_MULTIPLIER_POWER; + u32 nrw_heat = fdata->nr_writes << NRW_MULTIPLIER_POWER; + + u64 ltr_heat = (cur_time - timespec_to_ns(&fdata->last_read_time)) + >> LTR_DIVIDER_POWER; + u64 ltw_heat = (cur_time - timespec_to_ns(&fdata->last_write_time)) + >> LTW_DIVIDER_POWER; + + u64 avr_heat = (((u64) -1) - fdata->avg_delta_reads) + >> AVR_DIVIDER_POWER; + u64 avw_heat = (((u64) -1) - fdata->avg_delta_writes) + >> AVR_DIVIDER_POWER; + + if (ltr_heat >= ((u64) 1 << 32)) + ltr_heat = 0; + else + ltr_heat = ((u64) 1 << 32) - ltr_heat; + /* ltr_heat is now guaranteed to be u32 safe */ + + if (ltw_heat >= ((u64) 1 << 32)) + ltw_heat = 0; + else + ltw_heat = ((u64) 1 << 32) - ltw_heat; + /* ltw_heat is now guaranteed to be u32 safe */ + + if (avr_heat >= ((u64) 1 << 32)) + avr_heat = (u32) -1; + /* avr_heat is now guaranteed to be u32 safe */ + + if (avw_heat >= ((u64) 1 << 32)) + avr_heat = (u32) -1; + /* avw_heat is now guaranteed to be u32 safe */ + + nrr_heat = nrr_heat >> (3 - NRR_COEFF_POWER); + nrw_heat = nrw_heat >> (3 - NRW_COEFF_POWER); + ltr_heat = ltr_heat >> (3 - LTR_COEFF_POWER); + ltw_heat = ltw_heat >> (3 - LTW_COEFF_POWER); + avr_heat = avr_heat >> (3 - AVR_COEFF_POWER); + avw_heat = avw_heat >> (3 - AVW_COEFF_POWER); + + result = nrr_heat + nrw_heat + (u32) ltr_heat + + (u32) ltw_heat + (u32) avr_heat + (u32) avw_heat; + + return result >> (32 - HEAT_HASH_BITS); +} diff --git a/fs/btrfs/hotdata_hash.h b/fs/btrfs/hotdata_hash.h new file mode 100644 index 0000000..46bf61e --- /dev/null +++ b/fs/btrfs/hotdata_hash.h @@ -0,0 +1,89 @@ +#ifndef __HOTDATAHASH__ +#define __HOTDATAHASH__ + +#include <linux/list.h> +#include <linux/hash.h> + +#define HEAT_HASH_BITS 8 +#define HEAT_HASH_SIZE (1 << HEAT_HASH_BITS) +#define HEAT_HASH_MASK (HEAT_HASH_SIZE - 1) +#define HEAT_MIN_VALUE 0 +#define HEAT_MAX_VALUE (HEAT_HASH_SIZE - 1) +#define HEAT_HOT_MIN (HEAT_HASH_SIZE - 50) + +/* + * The following comments explain what exactly comprises a unit of heat. + * + * Each of six values of heat are calculated and combined in order to form an + * overall temperature for the data: + * + * NRR - number of reads since mount + * NRW - number of writes since mount + * LTR - time elapsed since last read (ns) + * LTW - time elapsed since last write (ns) + * AVR - average delta between recent reads (ns) + * AVW - average delta between recent writes (ns) + * + * These values are divided (right-shifted) according to the *_DIVIDER_POWER + * values defined below to bring the numbers into a reasonable range. You can + * modify these values to fit your needs. However, each heat unit is a u32 and + * thus maxes out at 2^32 - 1. Therefore, you must choose your dividers quite + * carefully or else they could max out or be stuck at zero quite easily. + * + * (E.g., if you chose AVR_DIVIDER_POWER = 0, nothing less than 4s of atime + * delta would bring the temperature above zero, ever.) + * + * Finally, each value is added to the overall temperature between 0 and 8 + * times, depending on its *_COEFF_POWER value. Note that the coefficients are + * also actually implemented with shifts, so take care to treat these values + * as powers of 2. (I.e., 0 means we''ll add it to the temp once; 1 = 2x, etc.) + */ + +#define NRR_MULTIPLIER_POWER 23 +#define NRR_COEFF_POWER 0 +#define NRW_MULTIPLIER_POWER 23 +#define NRW_COEFF_POWER 0 +#define LTR_DIVIDER_POWER 30 +#define LTR_COEFF_POWER 1 +#define LTW_DIVIDER_POWER 30 +#define LTW_COEFF_POWER 1 +#define AVR_DIVIDER_POWER 40 +#define AVR_COEFF_POWER 0 +#define AVW_DIVIDER_POWER 40 +#define AVW_COEFF_POWER 0 + +/* TODO a kmem cache for entry structs */ + +struct btrfs_root; + +/* Hash list heads for heat hash table */ +struct heat_hashlist_entry { + struct hlist_head hashhead; + rwlock_t rwlock; + u32 temperature; +}; + +/* Nodes stored in each hash list of hash table */ +struct heat_hashlist_node { + struct hlist_node hashnode; + struct btrfs_freq_data *freq_data; + struct heat_hashlist_entry *hlist; +}; + +struct heat_hashlist_node *alloc_heat_hashlist_node(gfp_t mask); +void free_heat_hashlists(struct btrfs_root *root); + +/* + * Returns a value from 0 to HEAT_MAX_VALUE indicating the temperature of the + * file (and consequently its bucket number in hashlist) + */ +int btrfs_get_temp(struct btrfs_freq_data *fdata); + +/* + * recalculates temperatures for inode or range + * and moves around in heat hash table based on temp + */ +void btrfs_update_heat_index(struct btrfs_freq_data *fdata, + struct btrfs_root *root); + +#endif /* __HOTDATAHASH__ */ -- 1.7.1
bchociej@gmail.com
2010-Jul-27 22:00 UTC
[RFC PATCH 2/5] Btrfs: Add data structures for hot data tracking
From: Ben Chociej <bcchocie@us.ibm.com> Adds hot_inode_tree and hot_range_tree structs to keep track of frequently accessed files and ranges within files. Trees contain hot_{inode,range}_items representing those files and ranges, each of which contains a btrfs_freq_data struct with its frequency of access metrics (number of {reads, writes}, last {read,write} time, frequency of {reads,writes}. Having these trees means that Btrfs can quickly determine the temperature of some data by doing some calculations on the btrfs_freq_data struct that hangs off of the tree item. Also, since it isn''t entirely obvious, the "frequency" or reads or writes is determined by taking a kind of generalized average of the last few (2^N for some tunable N) reads or writes. Signed-off-by: Ben Chociej <bcchocie@us.ibm.com> Signed-off-by: Matt Lupfer <mrlupfer@us.ibm.com> Signed-off-by: Conor Scott <crscott@us.ibm.com> Reviewed-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Steve French <sfrench@us.ibm.com> --- fs/btrfs/hotdata_map.c | 660 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/hotdata_map.h | 118 +++++++++ 2 files changed, 778 insertions(+), 0 deletions(-) create mode 100644 fs/btrfs/hotdata_map.c create mode 100644 fs/btrfs/hotdata_map.h diff --git a/fs/btrfs/hotdata_map.c b/fs/btrfs/hotdata_map.c new file mode 100644 index 0000000..77a560e --- /dev/null +++ b/fs/btrfs/hotdata_map.c @@ -0,0 +1,660 @@ +#include <linux/err.h> +#include <linux/slab.h> +#include <linux/module.h> +#include <linux/spinlock.h> +#include <linux/hardirq.h> +#include "ctree.h" +#include "hotdata_map.h" +#include "hotdata_hash.h" +#include "btrfs_inode.h" + +/* kmem_cache pointers for slab caches */ +static struct kmem_cache *hot_inode_item_cache; +static struct kmem_cache *hot_range_item_cache; + +struct hot_inode_item *btrfs_update_inode_freq(struct btrfs_inode *inode, + int create); +struct hot_range_item *btrfs_update_range_freq(struct hot_inode_item *he, + u64 off, u64 len, int create, + struct btrfs_root *root); +/* init hot_inode_item kmem cache */ +int __init hot_inode_item_init(void) +{ + hot_inode_item_cache = kmem_cache_create("hot_inode_item", + sizeof(struct hot_inode_item), 0, + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL); + if (!hot_inode_item_cache) + return -ENOMEM; + return 0; +} + +/* init hot_range_item kmem cache */ +int __init hot_range_item_init(void) +{ + hot_range_item_cache = kmem_cache_create("hot_range_item", + sizeof(struct hot_range_item), 0, + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL); + if (!hot_range_item_cache) + return -ENOMEM; + return 0; +} + +void hot_inode_item_exit(void) +{ + if (hot_inode_item_cache) + kmem_cache_destroy(hot_inode_item_cache); +} + +void hot_range_item_exit(void) +{ + if (hot_range_item_cache) + kmem_cache_destroy(hot_range_item_cache); +} + + +/* Initialize the inode tree */ +void hot_inode_tree_init(struct hot_inode_tree *tree) +{ + tree->map = RB_ROOT; + rwlock_init(&tree->lock); +} + +/* Initialize the hot range tree tree */ +void hot_range_tree_init(struct hot_range_tree *tree) +{ + tree->map = RB_ROOT; + rwlock_init(&tree->lock); +} + +/* Allocate a new hot_inode_item structure. The new structure is + * returned with a reference count of one and needs to be + * freed using free_inode_item() */ +struct hot_inode_item *alloc_hot_inode_item(unsigned long ino) +{ + struct hot_inode_item *he; + he = kmem_cache_alloc(hot_inode_item_cache, GFP_KERNEL | GFP_NOFS); + if (!he || IS_ERR(he)) + return he; + + atomic_set(&he->refs, 1); + he->in_tree = 0; + he->i_ino = ino; + he->heat_node = alloc_heat_hashlist_node(GFP_KERNEL | GFP_NOFS); + he->freq_data.avg_delta_reads = (u64) -1; + he->freq_data.avg_delta_writes = (u64) -1; + he->freq_data.nr_reads = 0; + he->freq_data.nr_writes = 0; + he->freq_data.flags = FREQ_DATA_TYPE_INODE; + hot_range_tree_init(&he->hot_range_tree); + + spin_lock_init(&he->lock); + + return he; +} + +/* Allocate a new hot_range_item structure. The new structure is + * returned with a reference count of one and needs to be + * freed using free_range_item() */ +struct hot_range_item *alloc_hot_range_item(u64 start, u64 len) +{ + struct hot_range_item *hr; + hr = kmem_cache_alloc(hot_range_item_cache, GFP_KERNEL | GFP_NOFS); + if (!hr || IS_ERR(hr)) + return hr; + atomic_set(&hr->refs, 1); + hr->in_tree = 0; + hr->start = start & RANGE_SIZE_MASK; + hr->len = len; + hr->heat_node = alloc_heat_hashlist_node(GFP_KERNEL | GFP_NOFS); + hr->heat_node->freq_data = &hr->freq_data; + hr->freq_data.avg_delta_reads = (u64) -1; + hr->freq_data.avg_delta_writes = (u64) -1; + hr->freq_data.nr_reads = 0; + hr->freq_data.nr_writes = 0; + hr->freq_data.flags = FREQ_DATA_TYPE_RANGE; + + spin_lock_init(&hr->lock); + + return hr; +} + +/* Drops the reference out on hot_inode_item by one and free the structure + * if the reference count hits zero */ +void free_hot_inode_item(struct hot_inode_item *he) +{ + if (!he) + return; + if (atomic_dec_and_test(&he->refs)) { + WARN_ON(he->in_tree); + kmem_cache_free(hot_inode_item_cache, he); + } +} + +/* Drops the reference out on hot_range_item by one and free the structure + * if the reference count hits zero */ +void free_hot_range_item(struct hot_range_item *hr) +{ + if (!hr) + return; + if (atomic_dec_and_test(&hr->refs)) { + WARN_ON(hr->in_tree); + kmem_cache_free(hot_range_item_cache, hr); + } +} + +/* Frees the entire hot_inode_tree. Called by free_fs_root */ +void free_hot_inode_tree(struct btrfs_root *root) +{ + struct rb_node *node, *node2; + struct hot_inode_item *he; + struct hot_range_item *hr; + + /* Free hot inode and range trees on fs root */ + node = rb_first(&root->hot_inode_tree.map); + + while (node) { + he = rb_entry(node, struct hot_inode_item, + rb_node); + + node2 = rb_first(&he->hot_range_tree.map); + + while (node2) { + hr = rb_entry(node2, struct hot_range_item, + rb_node); + remove_hot_range_item(&he->hot_range_tree, hr); + free_hot_range_item(hr); + node2 = rb_first(&he->hot_range_tree.map); + } + + remove_hot_inode_item(&root->hot_inode_tree, he); + free_hot_inode_item(he); + node = rb_first(&root->hot_inode_tree.map); + } +} + +static struct rb_node *tree_insert_inode_item(struct rb_root *root, + unsigned long inode_num, + struct rb_node *node) +{ + struct rb_node **p = &root->rb_node; + struct rb_node *parent = NULL; + struct hot_inode_item *entry; + + /* walk tree to find insertion point */ + while (*p) { + parent = *p; + entry = rb_entry(parent, struct hot_inode_item, rb_node); + + if (inode_num < entry->i_ino) + p = &(*p)->rb_left; + else if (inode_num > entry->i_ino) + p = &(*p)->rb_right; + else + return parent; + } + + entry = rb_entry(node, struct hot_inode_item, rb_node); + entry->in_tree = 1; + rb_link_node(node, parent, p); + rb_insert_color(node, root); + return NULL; +} + +static u64 range_map_end(struct hot_range_item *hr) +{ + if (hr->start + hr->len < hr->start) + return (u64)-1; + return hr->start + hr->len; +} + +static struct rb_node *tree_insert_range_item(struct rb_root *root, + u64 start, + struct rb_node *node) +{ + struct rb_node **p = &root->rb_node; + struct rb_node *parent = NULL; + struct hot_range_item *entry; + + + /* walk tree to find insertion point */ + while (*p) { + parent = *p; + entry = rb_entry(parent, struct hot_range_item, rb_node); + + if (start < entry->start) + p = &(*p)->rb_left; + else if (start >= range_map_end(entry)) + p = &(*p)->rb_right; + else + return parent; + } + + entry = rb_entry(node, struct hot_range_item, rb_node); + entry->in_tree = 1; + rb_link_node(node, parent, p); + rb_insert_color(node, root); + return NULL; +} + +/* Add a hot_inode_item to a hot_inode_tree. If the tree already contains + * an item with the index given, return -EEXIST */ +int add_hot_inode_item(struct hot_inode_tree *tree, + struct hot_inode_item *he) +{ + int ret = 0; + struct rb_node *rb; + struct hot_inode_item *exist; + + exist = lookup_hot_inode_item(tree, he->i_ino); + if (exist) { + free_hot_inode_item(exist); + ret = -EEXIST; + goto out; + } + rb = tree_insert_inode_item(&tree->map, he->i_ino, &he->rb_node); + if (rb) { + ret = -EEXIST; + goto out; + } + atomic_inc(&he->refs); +out: + return ret; +} + +/* Add a hot_range_item to a hot_range_tree. If the tree already contains + * an item with the index given, return -EEXIST + * Also optionally aggressively merge ranges (currently disabled) */ +int add_hot_range_item(struct hot_range_tree *tree, + struct hot_range_item *hr) +{ + int ret = 0; + struct rb_node *rb; + struct hot_range_item *exist; + /* struct hot_range_item *merge = NULL; */ + + exist = lookup_hot_range_item(tree, hr->start); + if (exist) { + free_hot_range_item(exist); + ret = -EEXIST; + goto out; + } + rb = tree_insert_range_item(&tree->map, hr->start, &hr->rb_node); + if (rb) { + ret = -EEXIST; + goto out; + } + + atomic_inc(&hr->refs); + +out: + return ret; +} + +/* Lookup a hot_inode_item in the hot_inode_tree with the given index + * (inode_num) */ +struct hot_inode_item *lookup_hot_inode_item(struct hot_inode_tree *tree, + unsigned long inode_num) +{ + struct rb_node **p = &(tree->map.rb_node); + struct rb_node *parent = NULL; + struct hot_inode_item *entry; + + while (*p) { + parent = *p; + entry = rb_entry(parent, struct hot_inode_item, rb_node); + + if (inode_num < entry->i_ino) + p = &(*p)->rb_left; + else if (inode_num > entry->i_ino) + p = &(*p)->rb_right; + else { + atomic_inc(&entry->refs); + return entry; + } + } + + return NULL; +} + +/* Lookup a hot_range_item in a hot_range_tree with the given index + * (start, offset) */ +struct hot_range_item *lookup_hot_range_item(struct hot_range_tree *tree, + u64 start) +{ + struct rb_node **p = &(tree->map.rb_node); + struct rb_node *parent = NULL; + struct hot_range_item *entry; + + /* ensure start is on a range boundary */ + start = start & RANGE_SIZE_MASK; + + while (*p) { + parent = *p; + entry = rb_entry(parent, struct hot_range_item, rb_node); + + if (start < entry->start) + p = &(*p)->rb_left; + else if (start >= range_map_end(entry)) + p = &(*p)->rb_right; + else { + atomic_inc(&entry->refs); + return entry; + } + } + return NULL; +} + +int remove_hot_inode_item(struct hot_inode_tree *tree, + struct hot_inode_item *he) +{ + int ret = 0; + rb_erase(&he->rb_node, &tree->map); + he->in_tree = 0; + return ret; +} + +int remove_hot_range_item(struct hot_range_tree *tree, + struct hot_range_item *hr) +{ + int ret = 0; + rb_erase(&hr->rb_node, &tree->map); + hr->in_tree = 0; + return ret; +} + +/* main function to update access frequency from read/writepage(s) hooks */ +inline void btrfs_update_freqs(struct inode *inode, u64 start, + u64 len, int create) +{ + struct hot_inode_item *he; + struct hot_range_item *hr; + struct btrfs_inode *btrfs_inode = BTRFS_I(inode); + + he = btrfs_update_inode_freq(btrfs_inode, create); + + WARN_ON(!he || IS_ERR(he)); + + if (he && !IS_ERR(he)) { + hr = btrfs_update_range_freq(he, start, len, + create, btrfs_inode->root); + WARN_ON(!hr || IS_ERR(hr)); + + + /* + * drop refcounts on inode/range items: + */ + + free_hot_inode_item(he); + + if (hr && !IS_ERR(hr)) + free_hot_range_item(hr); + } + +} + +/* Update inode frequency struct */ +struct hot_inode_item *btrfs_update_inode_freq(struct btrfs_inode *inode, + int create) +{ + struct hot_inode_tree *hitree = &inode->root->hot_inode_tree; + struct hot_inode_item *he; + struct btrfs_root *root = inode->root; + + read_lock(&hitree->lock); + he = lookup_hot_inode_item(hitree, inode->vfs_inode.i_ino); + read_unlock(&hitree->lock); + + if (!he) { + he = alloc_hot_inode_item(inode->vfs_inode.i_ino); + + if (!he || IS_ERR(he)) + goto out; + + write_lock(&hitree->lock); + add_hot_inode_item(hitree, he); + write_unlock(&hitree->lock); + } + + spin_lock(&he->lock); + btrfs_update_freq(&he->freq_data, create); + /* + * printk(KERN_DEBUG "btrfs_update_inode_freq avd_r: %llu," + * " avd_w: %llu\n", + * he->freq_data.avg_delta_reads, + * he->freq_data.avg_delta_writes); + */ + spin_unlock(&he->lock); + + /* will get its own lock(s) */ + btrfs_update_heat_index(&he->freq_data, root); + +out: + return he; +} + +/* Update range frequency struct */ +struct hot_range_item *btrfs_update_range_freq(struct hot_inode_item *he, + u64 off, u64 len, int create, + struct btrfs_root *root) +{ + struct hot_range_tree *hrtree = &he->hot_range_tree; + struct hot_range_item *hr = NULL; + u64 start_off = off & RANGE_SIZE_MASK; + u64 end_off = (off + len - 1) & RANGE_SIZE_MASK; + u64 cur; + + /* + * Align ranges on RANGE_SIZE boundary to prevent proliferation + * of range structs + */ + for (cur = start_off; cur <= end_off; cur += RANGE_SIZE) { + read_lock(&hrtree->lock); + hr = lookup_hot_range_item(hrtree, cur); + read_unlock(&hrtree->lock); + + if (!hr) { + hr = alloc_hot_range_item(cur, RANGE_SIZE); + + if (!hr || IS_ERR(hr)) + goto out; + + write_lock(&hrtree->lock); + add_hot_range_item(hrtree, hr); + write_unlock(&hrtree->lock); + } + + spin_lock(&hr->lock); + btrfs_update_freq(&hr->freq_data, create); + /* + * printk(KERN_DEBUG "btrfs_update_range_freq avd_r: %llu," + * " avd_w: %llu\n", + * he->freq_data.avg_delta_reads, + * he->freq_data.avg_delta_writes); + */ + spin_unlock(&hr->lock); + + + /* will get its own locks */ + btrfs_update_heat_index(&hr->freq_data, root); + } +out: + return hr; +} + +/* + * This function does the actual work of updating the frequency numbers, + * whatever they turn out to be. BTRFS_FREQ_POWER determines how many atime + * deltas we keep track of (as a power of 2). So, setting it to anything above + * 16ish is probably overkill. Also, the higher the power, the more bits get + * right shifted out of the timestamp, reducing precision, so take note of that + * as well. + * + * The caller (which is probably btrfs_update_freq) should have already locked + * fdata''s parent''s spinlock. + */ +#define BTRFS_FREQ_POWER 4 +void btrfs_update_freq(struct btrfs_freq_data *fdata, int create) +{ + struct timespec old_atime; + struct timespec current_time; + struct timespec delta_ts; + u64 new_avg; + u64 new_delta; + + if (unlikely(create)) { + old_atime = fdata->last_write_time; + fdata->nr_writes += 1; + new_avg = fdata->avg_delta_writes; + } else { + old_atime = fdata->last_read_time; + fdata->nr_reads += 1; + new_avg = fdata->avg_delta_reads; + } + + current_time = current_kernel_time(); + delta_ts = timespec_sub(current_time, old_atime); + new_delta = timespec_to_ns(&delta_ts) >> BTRFS_FREQ_POWER; + + new_avg = (new_avg << BTRFS_FREQ_POWER) - new_avg + new_delta; + new_avg = new_avg >> BTRFS_FREQ_POWER; + + if (unlikely(create)) { + fdata->last_write_time = current_time; + fdata->avg_delta_writes = new_avg; + } else { + fdata->last_read_time = current_time; + fdata->avg_delta_reads = new_avg; + } + +} + +/* + * Get a new temperature and, if necessary, move the heat_node corresponding + * to this inode or range to the proper hashlist with the new temperature + */ +void btrfs_update_heat_index(struct btrfs_freq_data *fdata, + struct btrfs_root *root) +{ + int temp = 0; + int moved = 0; + struct heat_hashlist_entry *buckets, *current_bucket = NULL; + struct hot_inode_item *he; + struct hot_range_item *hr; + + if (fdata->flags & FREQ_DATA_TYPE_INODE) { + he = freq_data_get_he(fdata); + buckets = root->heat_inode_hl; + + spin_lock(&he->lock); + temp = btrfs_get_temp(fdata); + spin_unlock(&he->lock); + + if (he == NULL) + return; + + if (he->heat_node->hlist == NULL) { + current_bucket = buckets + + temp; + moved = 1; + } else { + current_bucket = he->heat_node->hlist; + if (current_bucket->temperature != temp) { + write_lock(¤t_bucket->rwlock); + hlist_del(&he->heat_node->hashnode); + write_unlock(¤t_bucket->rwlock); + current_bucket = buckets + temp; + moved = 1; + } + } + + if (moved) { + write_lock(¤t_bucket->rwlock); + hlist_add_head(&he->heat_node->hashnode, + ¤t_bucket->hashhead); + he->heat_node->hlist = current_bucket; + write_unlock(¤t_bucket->rwlock); + } + + } else if (fdata->flags & FREQ_DATA_TYPE_RANGE) { + hr = freq_data_get_hr(fdata); + buckets = root->heat_range_hl; + + spin_lock(&hr->lock); + temp = btrfs_get_temp(fdata); + spin_unlock(&hr->lock); + + if (hr == NULL) + return; + + if (hr->heat_node->hlist == NULL) { + current_bucket = buckets + + temp; + moved = 1; + } else { + current_bucket = hr->heat_node->hlist; + if (current_bucket->temperature != temp) { + write_lock(¤t_bucket->rwlock); + hlist_del(&hr->heat_node->hashnode); + write_unlock(¤t_bucket->rwlock); + current_bucket = buckets + temp; + moved = 1; + } + } + + if (moved) { + write_lock(¤t_bucket->rwlock); + hlist_add_head(&hr->heat_node->hashnode, + ¤t_bucket->hashhead); + hr->heat_node->hlist = current_bucket; + write_unlock(¤t_bucket->rwlock); + } + } +} + +/* Walk the hot_inode_tree, locking as necessary */ +struct hot_inode_item *find_next_hot_inode(struct btrfs_root *root, + u64 objectid) +{ + struct rb_node *node; + struct rb_node *prev; + struct hot_inode_item *entry; + + read_lock(&root->hot_inode_tree.lock); + + node = root->hot_inode_tree.map.rb_node; + prev = NULL; + while (node) { + prev = node; + entry = rb_entry(node, struct hot_inode_item, rb_node); + + if (objectid < entry->i_ino) + node = node->rb_left; + else if (objectid > entry->i_ino) + node = node->rb_right; + else + break; + } + if (!node) { + while (prev) { + entry = rb_entry(prev, struct hot_inode_item, rb_node); + if (objectid <= entry->i_ino) { + node = prev; + break; + } + prev = rb_next(prev); + } + } + if (node) { + entry = rb_entry(node, struct hot_inode_item, rb_node); + /* increase reference count to prevent pruning while + * caller is using the hot_inode_item */ + atomic_inc(&entry->refs); + + read_unlock(&root->hot_inode_tree.lock); + return entry; + } + + read_unlock(&root->hot_inode_tree.lock); + return NULL; +} + diff --git a/fs/btrfs/hotdata_map.h b/fs/btrfs/hotdata_map.h new file mode 100644 index 0000000..46ae1d6 --- /dev/null +++ b/fs/btrfs/hotdata_map.h @@ -0,0 +1,118 @@ +#ifndef __HOTDATAMAP__ +#define __HOTDATAMAP__ + +#include <linux/rbtree.h> + +/* values for btrfs_freq_data flags */ +#define FREQ_DATA_TYPE_INODE 1 /* freq data struct is for an inode */ +#define FREQ_DATA_TYPE_RANGE (1 << 1) /* freq data struct is for a range */ +#define FREQ_DATA_HEAT_HOT (1 << 2) /* freq data struct is for hot data */ + /* (not implemented) */ +#define RANGE_SIZE (1<<12) +#define RANGE_SIZE_MASK (~((u64)(RANGE_SIZE - 1))) + +/* macros to wrap container_of()''s for hot data structs */ +#define freq_data_get_he(x) (struct hot_inode_item *) container_of(x, \ + struct hot_inode_item, freq_data) +#define freq_data_get_hr(x) (struct hot_range_item *) container_of(x, \ + struct hot_range_item, freq_data) +#define rb_node_get_he(x) (struct hot_inode_item *) container_of(x, \ + struct hot_inode_item, rb_node) +#define rb_node_get_hr(x) (struct hot_range_item *) container_of(x, \ + struct hot_range_item, rb_node) + +/* A frequency data struct holds values that are used to + * determine temperature of files and file ranges. These structs + * are members of hot_inode_item and hot_range_item */ +struct btrfs_freq_data { + struct timespec last_read_time; + struct timespec last_write_time; + u32 nr_reads; + u32 nr_writes; + u64 avg_delta_reads; + u64 avg_delta_writes; + u8 flags; +}; + +/* A tree that sits on the fs_root */ +struct hot_inode_tree { + struct rb_root map; + rwlock_t lock; +}; + +/* A tree of ranges for each inode in the hot_inode_tree */ +struct hot_range_tree { + struct rb_root map; + rwlock_t lock; +}; + +/* An item representing an inode and its access frequency */ +struct hot_inode_item { + struct rb_node rb_node; /* node for hot_inode_tree rb_tree */ + unsigned long i_ino; /* inode number, copied from vfs_inode */ + struct hot_range_tree hot_range_tree; /* tree of ranges in this + inode */ + struct btrfs_freq_data freq_data; /* frequency data for this inode */ + spinlock_t lock; /* protects freq_data, i_no, in_tree */ + atomic_t refs; /* prevents kfree */ + u8 in_tree; /* used to check for errors in ref counting */ + struct heat_hashlist_node *heat_node; /* hashlist node for this + inode */ +}; + +/* An item representing a range inside of an inode whose frequency + * is being tracked */ +struct hot_range_item { + struct rb_node rb_node; /* node for hot_range_tree rb_tree */ + u64 start; /* starting offset of this range */ + u64 len; /* length of this range */ + struct btrfs_freq_data freq_data; /* frequency data for this range */ + spinlock_t lock; /* protects freq_data, start, len, in_tree */ + atomic_t refs; /* prevents kfree */ + u8 in_tree; /* used to check for errors in ref counting */ + struct heat_hashlist_node *heat_node; /* hashlist node for this + range */ +}; + +struct btrfs_root; +struct inode; + +void hot_inode_tree_init(struct hot_inode_tree *tree); +void hot_range_tree_init(struct hot_range_tree *tree); + +struct hot_range_item *lookup_hot_range_item(struct hot_range_tree *tree, + u64 start); + +struct hot_inode_item *lookup_hot_inode_item(struct hot_inode_tree *tree, + unsigned long inode_num); + +int add_hot_inode_item(struct hot_inode_tree *tree, + struct hot_inode_item *he); +int add_hot_range_item(struct hot_range_tree *tree, + struct hot_range_item *hr); + +int remove_hot_inode_item(struct hot_inode_tree *tree, + struct hot_inode_item *he); +int remove_hot_range_item(struct hot_range_tree *tree, + struct hot_range_item *hr); + +struct hot_inode_item *alloc_hot_inode_item(unsigned long ino); +struct hot_range_item *alloc_hot_range_item(u64 start, u64 len); + +void free_hot_inode_item(struct hot_inode_item *he); +void free_hot_range_item(struct hot_range_item *hr); +void free_hot_inode_tree(struct btrfs_root *root); + +int __init hot_inode_item_init(void); +int __init hot_range_item_init(void); + +void hot_inode_item_exit(void); +void hot_range_item_exit(void); + +struct hot_inode_item *find_next_hot_inode(struct btrfs_root *root, + u64 objectid); +void btrfs_update_freq(struct btrfs_freq_data *fdata, int create); +void btrfs_update_freqs(struct inode *inode, u64 start, u64 len, + int create); + +#endif -- 1.7.1
bchociej@gmail.com
2010-Jul-27 22:00 UTC
[RFC PATCH 3/5] Btrfs: 3 new ioctls related to hot data features
From: Ben Chociej <bcchocie@us.ibm.com> BTRFS_IOC_GET_HEAT_INFO: return a struct containing the various metrics collected in btrfs_freq_data structs, and also return a calculated data temperature based on those metrics. Optionally, retrieve the temperature from the hot data hash list instead of recalculating it. BTRFS_IOC_GET_HEAT_OPTS: return an integer representing the current state of hot data tracking and migration: 0 = do nothing 1 = track frequency of access 2 = migrate data to fast media based on temperature (not implemented) BTRFS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and migration, as described above. Signed-off-by: Ben Chociej <bcchocie@us.ibm.com> Signed-off-by: Matt Lupfer <mrlupfer@us.ibm.com> Signed-off-by: Conor Scott <crscott@us.ibm.com> Reviewed-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Steve French <sfrench@us.ibm.com> --- fs/btrfs/ioctl.c | 146 +++++++++++++++++++++++++++++++++++++++++++++++++++++- fs/btrfs/ioctl.h | 21 ++++++++ 2 files changed, 166 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 4dbaf89..be7aba2 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -49,6 +49,8 @@ #include "print-tree.h" #include "volumes.h" #include "locking.h" +#include "hotdata_map.h" +#include "hotdata_hash.h" /* Mask out flags that are inappropriate for the given type of inode. */ static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags) @@ -1869,7 +1871,7 @@ static long btrfs_ioctl_default_subvol(struct file *file, void __user *argp) return 0; } -long btrfs_ioctl_space_info(struct btrfs_root *root, void __user *arg) +static long btrfs_ioctl_space_info(struct btrfs_root *root, void __user *arg) { struct btrfs_ioctl_space_args space_args; struct btrfs_ioctl_space_info space; @@ -1974,6 +1976,142 @@ long btrfs_ioctl_trans_end(struct file *file) return 0; } +/* + * Retrieve information about access frequency for the given file. Return it in + * a userspace-friendly struct for btrfsctl (or another tool) to parse. + * + * The temperature that is returned can be "live" -- that is, recalculated when + * the ioctl is called -- or it can be returned from the hashtable, reflecting + * the (possibly old) value that the system will use when considering files + * for migration. This behavior is determined by heat_info->live. + */ +static long btrfs_ioctl_heat_info(struct file *file, void __user *argp) +{ + struct inode *mnt_inode = fdentry(file)->d_inode; + struct inode *file_inode; + struct file *file_filp; + struct btrfs_root *root = BTRFS_I(mnt_inode)->root; + struct btrfs_ioctl_heat_info *heat_info; + struct hot_inode_tree *hitree; + struct hot_inode_item *he; + int ret; + + heat_info = kmalloc(sizeof(struct btrfs_ioctl_heat_info), + GFP_KERNEL | GFP_NOFS); + + if (copy_from_user((void *) heat_info, + argp, + sizeof(struct btrfs_ioctl_heat_info)) != 0) { + ret = -EFAULT; + goto err; + } + + file_filp = filp_open(heat_info->filename, O_RDONLY, 0); + + if (IS_ERR(file_filp)) { + ret = (long) file_filp; + goto err; + } + + file_inode = file_filp->f_dentry->d_inode; + + hitree = &root->hot_inode_tree; + read_lock(&hitree->lock); + he = lookup_hot_inode_item(hitree, file_inode->i_ino); + read_unlock(&hitree->lock); + + if (!he || IS_ERR(he)) { + /* we don''t have any info on this file yet */ + ret = -ENODATA; + goto err; + } + + spin_lock(&he->lock); + + heat_info->avg_delta_reads + (__u64) he->freq_data.avg_delta_reads; + heat_info->avg_delta_writes + (__u64) he->freq_data.avg_delta_writes; + heat_info->last_read_time + (__u64) timespec_to_ns(&he->freq_data.last_read_time); + heat_info->last_write_time + (__u64) timespec_to_ns(&he->freq_data.last_write_time); + heat_info->num_reads + (__u32) he->freq_data.nr_reads; + heat_info->num_writes + (__u32) he->freq_data.nr_writes; + + if (heat_info->live > 0) { + /* got a request for live temperature, + * call btrfs_get_temp to recalculate */ + heat_info->temperature = btrfs_get_temp(&he->freq_data); + } else { + /* not live temperature, get it from the hashlist */ + read_lock(&he->heat_node->hlist->rwlock); + heat_info->temperature = he->heat_node->hlist->temperature; + read_unlock(&he->heat_node->hlist->rwlock); + } + + spin_unlock(&he->lock); + free_hot_inode_item(he); + + if (copy_to_user(argp, (void *) heat_info, + sizeof(struct btrfs_ioctl_heat_info))) { + ret = -EFAULT; + goto err; + } + + kfree(heat_info); + return 0; + +err: + kfree(heat_info); + return ret; +} + +static long btrfs_ioctl_heat_opts(struct file *file, void __user *argp, int set) +{ + struct inode *inode = fdentry(file)->d_inode; + int arg, ret = 0; + + if (!set) { + arg = ((BTRFS_I(inode)->flags & BTRFS_INODE_NO_HOTDATA_TRACK) + ? 0 : 1) + + ((BTRFS_I(inode)->flags & BTRFS_INODE_NO_HOTDATA_MOVE) + ? 0 : 1); + + if (copy_to_user(argp, (void *) &arg, sizeof(int)) != 0) + ret = -EFAULT; + } else if (copy_from_user((void *) &arg, argp, sizeof(int)) != 0) + ret = -EFAULT; + else + switch (arg) { + case 0: /* track nothing, move nothing */ + /* set both flags */ + BTRFS_I(inode)->flags |+ BTRFS_INODE_NO_HOTDATA_TRACK | + BTRFS_INODE_NO_HOTDATA_MOVE; + break; + case 1: /* do tracking, don''t move anything */ + /* clear NO_HOTDATA_TRACK, set NO_HOTDATA_MOVE */ + BTRFS_I(inode)->flags &+ ~BTRFS_INODE_NO_HOTDATA_TRACK; + BTRFS_I(inode)->flags |+ BTRFS_INODE_NO_HOTDATA_MOVE; + break; + case 2: /* track and move */ + /* clear both flags */ + BTRFS_I(inode)->flags &+ ~(BTRFS_INODE_NO_HOTDATA_TRACK | + BTRFS_INODE_NO_HOTDATA_MOVE); + break; + default: + ret = -EINVAL; + } + + return ret; +} + long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { @@ -2021,6 +2159,12 @@ long btrfs_ioctl(struct file *file, unsigned int return btrfs_ioctl_ino_lookup(file, argp); case BTRFS_IOC_SPACE_INFO: return btrfs_ioctl_space_info(root, argp); + case BTRFS_IOC_GET_HEAT_INFO: + return btrfs_ioctl_heat_info(file, argp); + case BTRFS_IOC_GET_HEAT_OPTS: + return btrfs_ioctl_heat_opts(file, argp, 0); + case BTRFS_IOC_SET_HEAT_OPTS: + return btrfs_ioctl_heat_opts(file, argp, 1); case BTRFS_IOC_SYNC: btrfs_sync_fs(file->f_dentry->d_sb, 1); return 0; diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h index 424694a..8ba775e 100644 --- a/fs/btrfs/ioctl.h +++ b/fs/btrfs/ioctl.h @@ -138,6 +138,18 @@ struct btrfs_ioctl_space_args { struct btrfs_ioctl_space_info spaces[0]; }; +struct btrfs_ioctl_heat_info { + __u64 avg_delta_reads; + __u64 avg_delta_writes; + __u64 last_read_time; + __u64 last_write_time; + __u32 num_reads; + __u32 num_writes; + char filename[BTRFS_PATH_NAME_MAX + 1]; + int temperature; + __u8 live; +}; + #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \ struct btrfs_ioctl_vol_args) #define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \ @@ -178,4 +190,13 @@ struct btrfs_ioctl_space_args { #define BTRFS_IOC_DEFAULT_SUBVOL _IOW(BTRFS_IOCTL_MAGIC, 19, u64) #define BTRFS_IOC_SPACE_INFO _IOWR(BTRFS_IOCTL_MAGIC, 20, \ struct btrfs_ioctl_space_args) + +/* + * Hot data tracking ioctls: + */ +#define BTRFS_IOC_GET_HEAT_INFO _IOWR(BTRFS_IOCTL_MAGIC, 21, \ + struct btrfs_ioctl_heat_info) +#define BTRFS_IOC_SET_HEAT_OPTS _IOW(BTRFS_IOCTL_MAGIC, 22, int) +#define BTRFS_IOC_GET_HEAT_OPTS _IOR(BTRFS_IOCTL_MAGIC, 23, int) + #endif -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
bchociej@gmail.com
2010-Jul-27 22:00 UTC
[RFC PATCH 4/5] Btrfs: Add debugfs interface for hot data stats
From: Ben Chociej <bcchocie@us.ibm.com> Adds a ./btrfs_data/<device_name>/ directory in the debugfs directory for each volume. The directory contains two files. The first, `inode_data'', contains the heat information for inodes that have been brought into the hot data map structures. The second, `range_data'', contains similar information about subfile ranges. Signed-off-by: Ben Chociej <bcchocie@us.ibm.com> Signed-off-by: Matt Lupfer <mrlupfer@us.ibm.com> Signed-off-by: Conor Scott <crscott@us.ibm.com> Reviewed-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Steve French <sfrench@us.ibm.com> --- fs/btrfs/debugfs.c | 500 ++++++++++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/debugfs.h | 57 ++++++ 2 files changed, 557 insertions(+), 0 deletions(-) create mode 100644 fs/btrfs/debugfs.c create mode 100644 fs/btrfs/debugfs.h diff --git a/fs/btrfs/debugfs.c b/fs/btrfs/debugfs.c new file mode 100644 index 0000000..a0e7bb7 --- /dev/null +++ b/fs/btrfs/debugfs.c @@ -0,0 +1,500 @@ +#include <linux/debugfs.h> +#include <linux/fs.h> +#include <linux/module.h> +#include <linux/types.h> +#include <linux/vmalloc.h> +#include <linux/limits.h> +#include "ctree.h" +#include "hotdata_map.h" +#include "hotdata_hash.h" +#include "debugfs.h" + +/* + * debugfs.c contains the code to interface with the btrfs debugfs. + * The debugfs outputs range- and file-level access frequency + * statistics for each mounted volume. + */ + +static int copy_msg_to_log(struct debugfs_vol_data *data, char *msg, int len) +{ + struct lstring *debugfs_log = data->debugfs_log; + uint new_log_alloc_size; + char *new_log; + + if (len >= data->log_alloc_size - debugfs_log->len) { + /* Not enough room in the log buffer for the new message. */ + /* Allocate a bigger buffer. */ + new_log_alloc_size = data->log_alloc_size + LOG_PAGE_SIZE; + new_log = vmalloc(new_log_alloc_size); + + if (new_log) { + memcpy(new_log, debugfs_log->str, + debugfs_log->len); + memset(new_log + debugfs_log->len, 0, + new_log_alloc_size - debugfs_log->len); + vfree(debugfs_log->str); + debugfs_log->str = new_log; + data->log_alloc_size = new_log_alloc_size; + } else { + WARN_ON(1); + if (data->log_alloc_size - debugfs_log->len) { + #define err_msg "No more memory!\n" + strlcpy(debugfs_log->str + + debugfs_log->len, + err_msg, data->log_alloc_size - + debugfs_log->len); + debugfs_log->len ++ min((typeof(debugfs_log->len)) + sizeof(err_msg), + ((typeof(debugfs_log->len)) + data->log_alloc_size - + debugfs_log->len)); + } + return 0; + } + } + + memcpy(debugfs_log->str + debugfs_log->len, + data->log_work_buff, len); + debugfs_log->len += (unsigned long) len; + + return len; +} + +/* Returns the number of bytes written to the log. */ +static int debugfs_log(struct debugfs_vol_data *data, const char *fmt, ...) +{ + struct lstring *debugfs_log = data->debugfs_log; + va_list args; + int len; + + if (debugfs_log->str == NULL) + return -1; + + spin_lock(&data->log_lock); + + va_start(args, fmt); + len = vsnprintf(data->log_work_buff, sizeof(data->log_work_buff), fmt, + args); + va_end(args); + + if (len >= sizeof(data->log_work_buff)) { + #define truncate_msg "The next message has been truncated.\n" + copy_msg_to_log(data, truncate_msg, sizeof(truncate_msg)); + } + + len = copy_msg_to_log(data, data->log_work_buff, len); + spin_unlock(&data->log_lock); + + return len; +} + +/* initialize a log corresponding to a btrfs volume */ +static int debugfs_log_init(struct debugfs_vol_data *data) +{ + int err = 0; + struct lstring *debugfs_log = data->debugfs_log; + + spin_lock(&data->log_lock); + debugfs_log->str = vmalloc(INIT_LOG_ALLOC_SIZE); + + if (debugfs_log->str) { + memset(debugfs_log->str, 0, INIT_LOG_ALLOC_SIZE); + data->log_alloc_size = INIT_LOG_ALLOC_SIZE; + } else { + err = -ENOMEM; + } + + spin_unlock(&data->log_lock); + return err; +} + +/* free a log corresponding to a btrfs volume */ +static void debugfs_log_exit(struct debugfs_vol_data *data) +{ + struct lstring *debugfs_log = data->debugfs_log; + spin_lock(&data->log_lock); + vfree(debugfs_log->str); + debugfs_log->str = NULL; + debugfs_log->len = 0; + spin_unlock(&data->log_lock); +} + +/* fops to override for printing range data */ +static const struct file_operations btrfs_debugfs_range_fops = { + .read = __btrfs_debugfs_range_read, + .open = __btrfs_debugfs_open, +}; + +/* fops to override for printing inode data */ +static const struct file_operations btrfs_debugfs_inode_fops = { + .read = __btrfs_debugfs_inode_read, + .open = __btrfs_debugfs_open, +}; + +/* initialize debugfs for btrfs at module init */ +int btrfs_init_debugfs(void) +{ + debugfs_root_dentry = debugfs_create_dir(DEBUGFS_ROOT_NAME, NULL); + /*init list of debugfs data list */ + INIT_LIST_HEAD(&debugfs_vol_data_list); + /*init lock to list of debugfs data list */ + spin_lock_init(&data_list_lock); + if (!debugfs_root_dentry) + goto debugfs_error; + return 0; + +debugfs_error: + return -EIO; +} + +/* + * on each volume mount, initialize the debugfs dentries and associated + * structures (debugfs_vol_data and debugfs_log) + */ +int btrfs_init_debugfs_volume(const char *uuid, struct super_block *sb) +{ + struct dentry *debugfs_volume_entry = NULL; + struct dentry *debugfs_range_entry = NULL; + struct dentry *debugfs_inode_entry = NULL; + struct debugfs_vol_data *range_data = NULL; + struct debugfs_vol_data *inode_data = NULL; + size_t dev_name_length = strlen(uuid); + char dev[NAME_MAX]; + + if (!debugfs_root_dentry) + goto debugfs_error; + + /* create debugfs folder for this volume by mounted dev name */ + memcpy(dev, uuid + DEV_NAME_CHOP, dev_name_length - + DEV_NAME_CHOP + 1); + debugfs_volume_entry = debugfs_create_dir(dev, debugfs_root_dentry); + + if (!debugfs_volume_entry) + goto debugfs_error; + + /* malloc and initialize debugfs_vol_data for range_data */ + range_data = kmalloc(sizeof(struct debugfs_vol_data), + GFP_KERNEL | GFP_NOFS); + memset(range_data, 0, sizeof(struct debugfs_vol_data)); + range_data->debugfs_log = NULL; + range_data->sb = sb; + spin_lock_init(&range_data->log_lock); + range_data->log_alloc_size = 0; + + /* malloc and initialize debugfs_vol_data for range_data */ + inode_data = kmalloc(sizeof(struct debugfs_vol_data), + GFP_KERNEL | GFP_NOFS); + memset(inode_data, 0, sizeof(struct debugfs_vol_data)); + inode_data->debugfs_log = NULL; + inode_data->sb = sb; + spin_lock_init(&inode_data->log_lock); + inode_data->log_alloc_size = 0; + + /* add debugfs_vol_data for inode data and range data for + * volume to list */ + range_data->de = debugfs_volume_entry; + inode_data->de = debugfs_volume_entry; + spin_lock(&data_list_lock); + list_add(&range_data->node, &debugfs_vol_data_list); + list_add(&inode_data->node, &debugfs_vol_data_list); + spin_unlock(&data_list_lock); + + /* create debugfs range_data file */ + debugfs_range_entry = debugfs_create_file("range_data", + S_IFREG | S_IRUSR | S_IWUSR | + S_IRUGO, + debugfs_volume_entry, + (void *) range_data, + &btrfs_debugfs_range_fops); + if (!debugfs_range_entry) + goto debugfs_error; + + /* create debugfs inode_data file */ + debugfs_inode_entry = debugfs_create_file("inode_data", + S_IFREG | S_IRUSR | S_IWUSR | + S_IRUGO, + debugfs_volume_entry, + (void *) inode_data, + &btrfs_debugfs_inode_fops); + + if (!debugfs_inode_entry) + goto debugfs_error; + + return 0; + +debugfs_error: + + kfree(range_data); + kfree(inode_data); + + return -EIO; +} + +/* find volume mounted (match by superblock) and remove + * debugfs dentry + */ +void btrfs_exit_debugfs_volume(struct super_block *sb) +{ + struct list_head *head; + struct list_head *pos; + struct debugfs_vol_data *data; + spin_lock(&data_list_lock); + head = &debugfs_vol_data_list; + /* must clean up memory assicatied with superblock */ + list_for_each(pos, head) + { + data = list_entry(pos, struct debugfs_vol_data, node); + if (data->sb == sb) { + list_del(pos); + debugfs_remove_recursive(data->de); + kfree(data); + data = NULL; + break; + } + } + spin_unlock(&data_list_lock); +} + +/* clean up memory and remove dentries for debugsfs */ +void btrfs_exit_debugfs(void) +{ + /* first iterate through debugfs_vol_data_list and free memory */ + struct list_head *head; + struct list_head *pos; + struct list_head *cur; + struct debugfs_vol_data *data; + + spin_lock(&data_list_lock); + head = &debugfs_vol_data_list; + list_for_each_safe(pos, cur, head) { + data = list_entry(pos, struct debugfs_vol_data, node); + if (data && pos != head) + kfree(data); + } + spin_unlock(&data_list_lock); + + /* remove all debugfs entries recursively from the root */ + debugfs_remove_recursive(debugfs_root_dentry); +} + +/* debugfs open file override from fops table */ +int __btrfs_debugfs_open(struct inode *inode, struct file *file) +{ + if (inode->i_private) + file->private_data = inode->i_private; + + return 0; +} + +/* debugfs read file override from fops table */ +ssize_t __btrfs_debugfs_range_read(struct file *file, char __user *user, + size_t count, loff_t *ppos) +{ + int err = 0; + struct super_block *sb; + struct btrfs_root *root; + struct btrfs_root *fs_root; + struct hot_inode_item *current_hot_inode; + struct debugfs_vol_data *data; + struct lstring *debugfs_log; + + data = (struct debugfs_vol_data *) file->private_data; + sb = data->sb; + root = btrfs_sb(sb); + fs_root = (struct btrfs_root *) root->fs_info->fs_root; + + if (!data->debugfs_log) { + /* initialize debugfs log corresponding to this volume*/ + debugfs_log = kmalloc(sizeof(struct lstring), + GFP_KERNEL | GFP_NOFS); + debugfs_log->str = NULL, + debugfs_log->len = 0; + data->debugfs_log = debugfs_log; + debugfs_log_init(data); + } + + if ((unsigned long) *ppos > 0) { + /* caller is continuing a previous read, don''t walk tree */ + if ((unsigned long) *ppos >= data->debugfs_log->len) + goto clean_up; + + goto print_to_user; + } + + /* walk the inode tree */ + + current_hot_inode = find_next_hot_inode(fs_root, 0); + + while (current_hot_inode) { + /* walk ranges, print data to debugfs log */ + __walk_range_tree(current_hot_inode, data); + + free_hot_inode_item(current_hot_inode); + current_hot_inode = find_next_hot_inode(fs_root, + (u64) current_hot_inode->i_ino + 1); + } + +print_to_user: + + if (data->debugfs_log->len) { + err = simple_read_from_buffer(user, count, ppos, + data->debugfs_log->str, + data->debugfs_log->len); + } + + return err; + +clean_up: + + /* reader has finished the file */ + /* clean up */ + + debugfs_log_exit(data); + kfree(data->debugfs_log); + data->debugfs_log = NULL; + + return 0; +} + +/* debugfs read file override from fops table */ +ssize_t __btrfs_debugfs_inode_read(struct file *file, char __user *user, + size_t count, loff_t *ppos) +{ + int err = 0; + struct super_block *sb; + struct btrfs_root *root; + struct btrfs_root *fs_root; + struct hot_inode_item *current_hot_inode; + struct debugfs_vol_data *data; + struct lstring *debugfs_log; + + data = (struct debugfs_vol_data *) file->private_data; + sb = data->sb; + root = btrfs_sb(sb); + fs_root = (struct btrfs_root *) root->fs_info->fs_root; + + if (!data->debugfs_log) { + /* initialize debugfs log corresponding to this volume */ + debugfs_log = kmalloc(sizeof(struct lstring), + GFP_KERNEL | GFP_NOFS); + debugfs_log->str = NULL, + debugfs_log->len = 0; + data->debugfs_log = debugfs_log; + debugfs_log_init(data); + } + + if ((unsigned long) *ppos > 0) { + /* caller is continuing a previous read, don''t walk tree */ + if ((unsigned long) *ppos >= data->debugfs_log->len) + goto clean_up; + + goto print_to_user; + } + + /* walk the inode tree */ + + current_hot_inode = find_next_hot_inode(fs_root, 0); + + while (current_hot_inode) { + /* walk ranges, print data to debugfs log */ + __print_inode_freq_data(current_hot_inode, data); + + free_hot_inode_item(current_hot_inode); + current_hot_inode = find_next_hot_inode(fs_root, + (u64) current_hot_inode->i_ino + 1); + } + +print_to_user: + + if (data->debugfs_log->len) { + err = simple_read_from_buffer(user, count, ppos, + data->debugfs_log->str, + data->debugfs_log->len); + } + + return err; + +clean_up: + + /* reader has finished the file */ + /* clean up */ + debugfs_log_exit(data); + kfree(data->debugfs_log); + data->debugfs_log = NULL; + + return 0; +} + +/* + * Take the inode, find ranges associated with inode + * and print each range data struct + */ +void __walk_range_tree(struct hot_inode_item *hot_inode, + struct debugfs_vol_data *data) +{ + struct hot_range_tree *inode_range_tree; + struct rb_node *node; + struct hot_range_item *current_range; + + inode_range_tree = &hot_inode->hot_range_tree; + read_lock(&inode_range_tree->lock); + node = rb_first(&inode_range_tree->map); + + /* Walk the hot_range_tree for inode */ + while (node) { + current_range = rb_entry(node, struct hot_range_item, rb_node); + __print_range_freq_data(hot_inode, current_range, data); + node = rb_next(node); + } + read_unlock(&inode_range_tree->lock); +} + +/* Print frequency data for each range to log */ +void __print_range_freq_data(struct hot_inode_item *hot_inode, + struct hot_range_item *hot_range, + struct debugfs_vol_data *data) +{ + struct btrfs_freq_data *freq_data; + int temp; + freq_data = &hot_range->freq_data; + read_lock(&hot_range->heat_node->hlist->rwlock); + temp = hot_range->heat_node->hlist->temperature; + read_unlock(&hot_range->heat_node->hlist->rwlock); + + /* Always lock hot_inode_item first */ + spin_lock(&hot_inode->lock); + spin_lock(&hot_range->lock); + debugfs_log(data, "inode #%lu, range start " + "%llu (range len %llu) reads %u, writes %u, temp %u\n", + hot_inode->i_ino, + hot_range->start, + hot_range->len, + freq_data->nr_reads, + freq_data->nr_writes, + temp); + spin_unlock(&hot_range->lock); + spin_unlock(&hot_inode->lock); +} + +/* Print frequency data for each freq data to log */ +void __print_inode_freq_data(struct hot_inode_item *hot_inode, + struct debugfs_vol_data *data) +{ + struct btrfs_freq_data *freq_data; + int temp; + freq_data = &hot_inode->freq_data; + + read_lock(&hot_inode->heat_node->hlist->rwlock); + temp = hot_inode->heat_node->hlist->temperature; + read_unlock(&hot_inode->heat_node->hlist->rwlock); + + spin_lock(&hot_inode->lock); + debugfs_log(data, "inode #%lu, reads %u, writes %u, temp %u\n", + hot_inode->i_ino, + freq_data->nr_reads, + freq_data->nr_writes, + temp); + spin_unlock(&hot_inode->lock); +} + diff --git a/fs/btrfs/debugfs.h b/fs/btrfs/debugfs.h new file mode 100644 index 0000000..bdd4938 --- /dev/null +++ b/fs/btrfs/debugfs.h @@ -0,0 +1,57 @@ +#ifndef __BTRFS_DEBUGFS__ +#define __BTRFS_DEBUGFS__ + +/* size of log to vmalloc */ +#define INIT_LOG_ALLOC_SIZE (PAGE_SIZE * 10) +#define LOG_PAGE_SIZE (PAGE_SIZE * 10) + +/* number of chars of device name of chop off for making debugfs folder + * e.g. /dev/sda -> sda */ +#define DEV_NAME_CHOP 5 + +/* list to keep track of each mounted volumes debugfs_vol_data */ +static struct list_head debugfs_vol_data_list; +/* lock for debugfs_vol_data_list */ +static spinlock_t data_list_lock; + +/* + * Name for BTRFS data in debugfs directory + * e.g. /sys/kernel/debug/btrfs_data + */ +#define DEBUGFS_ROOT_NAME "btrfs_data" +/* pointer to top level debugfs dentry */ +static struct dentry *debugfs_root_dentry; + +/* log to output to userspace in debugfs files */ +struct lstring { + char *str; + unsigned long len; +}; + +/* + * debugfs_vol_data is a struct of items that is passed to the debugfs + */ +struct debugfs_vol_data { + struct list_head node; /* protected by data_list_lock */ + struct lstring *debugfs_log; + struct super_block *sb; + struct dentry *de; + spinlock_t log_lock; /* protects debugfs_log */ + char log_work_buff[1024]; + uint log_alloc_size; +}; + +ssize_t __btrfs_debugfs_range_read(struct file *file, char __user *user, + size_t size, loff_t *len); +ssize_t __btrfs_debugfs_inode_read(struct file *file, char __user *user, + size_t size, loff_t *len); +int __btrfs_debugfs_open(struct inode *inode, struct file *file); +void __walk_range_tree(struct hot_inode_item *hot_inode, + struct debugfs_vol_data *data); +void __print_range_freq_data(struct hot_inode_item *hot_inode, + struct hot_range_item *hot_range, + struct debugfs_vol_data *data); +void __print_inode_freq_data(struct hot_inode_item *hot_inode, + struct debugfs_vol_data *data); + +#endif -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
bchociej@gmail.com
2010-Jul-27 22:00 UTC
[RFC PATCH 5/5] Btrfs: Add hooks to enable hot data tracking
From: Ben Chociej <bcchocie@us.ibm.com> Miscellaneous features that enable hot data tracking features, open the door for future hot data migration to faster media, and generally make the hot data functions a bit more friendly. ctree.h: Add the root hot_inode_tree and heat hashlists. Defines some mount options and inode flags for turning all of the hot data functionality on and off globally and per file. Defines some guard macros that enforce the mount options and inode flags. disk-io.c: Initialization and freeing of various structures. extent_io.c: Add hook into extent_write_cache_pages to enable hot data tracking functionality. Actual IO tracking is done here (and in inode.c). inode.c: Add hooks into btrfs_direct_IO and btrfs_readpages to enable hot data tracking functionality. Actual IO tracking is done here (and in extent_io.c). super.c: Implements aforementioned mount options, does various initializing and freeing. Signed-off-by: Ben Chociej <bcchocie@us.ibm.com> Signed-off-by: Matt Lupfer <mrlupfer@us.ibm.com> Signed-off-by: Conor Scott <crscott@us.ibm.com> Reviewed-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Steve French <sfrench@us.ibm.com> --- fs/btrfs/Makefile | 5 ++++- fs/btrfs/ctree.h | 42 ++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/disk-io.c | 29 +++++++++++++++++++++++++++++ fs/btrfs/extent_io.c | 18 ++++++++++++++++++ fs/btrfs/inode.c | 27 +++++++++++++++++++++++++++ fs/btrfs/super.c | 48 +++++++++++++++++++++++++++++++++++++++++++++--- 6 files changed, 165 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index a35eb36..8bc70ba 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -7,4 +7,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \ extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \ export.o tree-log.o acl.o free-space-cache.o zlib.o \ - compression.o delayed-ref.o relocation.o + compression.o delayed-ref.o relocation.o hotdata_map.o \ + hotdata_hash.o + +btrfs-$(CONFIG_DEBUG_FS) += debugfs.o diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index e9bf864..7284cb5 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -31,6 +31,8 @@ #include "extent_io.h" #include "extent_map.h" #include "async-thread.h" +#include "hotdata_map.h" +#include "hotdata_hash.h" struct btrfs_trans_handle; struct btrfs_transaction; @@ -877,6 +879,7 @@ struct btrfs_fs_info { struct mutex cleaner_mutex; struct mutex chunk_mutex; struct mutex volume_mutex; + /* * this protects the ordered operations list only while we are * processing all of the entries on it. This way we make @@ -950,6 +953,7 @@ struct btrfs_fs_info { struct btrfs_workers endio_meta_write_workers; struct btrfs_workers endio_write_workers; struct btrfs_workers submit_workers; + /* * fixup workers take dirty pages that didn''t properly go through * the cow mechanism and make them safe to write. It happens @@ -958,6 +962,7 @@ struct btrfs_fs_info { struct btrfs_workers fixup_workers; struct task_struct *transaction_kthread; struct task_struct *cleaner_kthread; + int thread_pool_size; struct kobject super_kobj; @@ -1092,6 +1097,15 @@ struct btrfs_root { /* red-black tree that keeps track of in-memory inodes */ struct rb_root inode_tree; + /* red-black tree that keeps track of fs-wide hot data */ + struct hot_inode_tree hot_inode_tree; + + /* hash map of inode temperature */ + struct heat_hashlist_entry heat_inode_hl[HEAT_HASH_SIZE]; + + /* hash map of range temperature */ + struct heat_hashlist_entry heat_range_hl[HEAT_HASH_SIZE]; + /* * right now this just gets used so that a root has its own devid * for stat. It may be used for more later @@ -1192,6 +1206,8 @@ struct btrfs_root { #define BTRFS_MOUNT_NOSSD (1 << 9) #define BTRFS_MOUNT_DISCARD (1 << 10) #define BTRFS_MOUNT_FORCE_COMPRESS (1 << 11) +#define BTRFS_MOUNT_HOTDATA_TRACK (1 << 12) +#define BTRFS_MOUNT_HOTDATA_MOVE (1 << 13) #define btrfs_clear_opt(o, opt) ((o) &= ~BTRFS_MOUNT_##opt) #define btrfs_set_opt(o, opt) ((o) |= BTRFS_MOUNT_##opt) @@ -1211,6 +1227,24 @@ struct btrfs_root { #define BTRFS_INODE_NODUMP (1 << 8) #define BTRFS_INODE_NOATIME (1 << 9) #define BTRFS_INODE_DIRSYNC (1 << 10) +#define BTRFS_INODE_NO_HOTDATA_TRACK (1 << 11) +#define BTRFS_INODE_NO_HOTDATA_MOVE (1 << 12) + +/* Hot data tracking -- guard macros */ +#define BTRFS_TRACKING_HOT_DATA(btrfs_root) \ +(btrfs_test_opt(btrfs_root, HOTDATA_TRACK)) + +#define BTRFS_MOVING_HOT_DATA(btrfs_root) \ +((btrfs_test_opt(btrfs_root, HOTDATA_TRACK)) && \ +!(btrfs_root->fs_info->sb->s_flags & MS_RDONLY)) + +#define BTRFS_TRACK_THIS_INODE(btrfs_inode) \ +((BTRFS_TRACKING_HOT_DATA(btrfs_inode->root)) && \ +!(btrfs_inode->flags & BTRFS_INODE_NO_HOTDATA_TRACK)) + +#define BTRFS_MOVE_THIS_INODE(btrfs_inode) \ +((BTRFS_MOVING_HOT_DATA(btrfs_inode->root)) && \ +!(btrfs_inode->flags & BTRFS_INODE_NO_HOTDATA_MOVE)) /* some macros to generate set/get funcs for the struct fields. This * assumes there is a lefoo_to_cpu for every type, so lets make a simple @@ -2457,6 +2491,14 @@ int btrfs_sysfs_add_root(struct btrfs_root *root); void btrfs_sysfs_del_root(struct btrfs_root *root); void btrfs_sysfs_del_super(struct btrfs_fs_info *root); +#ifdef CONFIG_DEBUG_FS +/* debugfs.c */ +int btrfs_init_debugfs(void); +void btrfs_exit_debugfs(void); +int btrfs_init_debugfs_volume(const char *, struct super_block *); +void btrfs_exit_debugfs_volume(struct super_block *); +#endif + /* xattr.c */ ssize_t btrfs_listxattr(struct dentry *dentry, char *buffer, size_t size); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 34f7c37..8f9c866 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -39,6 +39,7 @@ #include "locking.h" #include "tree-log.h" #include "free-space-cache.h" +#include "hotdata_hash.h" static struct extent_io_ops btree_extent_io_ops; static void end_workqueue_fn(struct btrfs_work *work); @@ -893,11 +894,32 @@ int clean_tree_block(struct btrfs_trans_handle *trans, struct btrfs_root *root, return 0; } +static inline void __setup_hotdata(struct btrfs_root *root) +{ + int i; + + hot_inode_tree_init(&root->hot_inode_tree); + + memset(&root->heat_inode_hl, 0, sizeof(root->heat_inode_hl)); + memset(&root->heat_range_hl, 0, sizeof(root->heat_range_hl)); + for (i = 0; i < HEAT_HASH_SIZE; i++) { + INIT_HLIST_HEAD(&root->heat_inode_hl[i].hashhead); + INIT_HLIST_HEAD(&root->heat_range_hl[i].hashhead); + + rwlock_init(&root->heat_inode_hl[i].rwlock); + rwlock_init(&root->heat_range_hl[i].rwlock); + + root->heat_inode_hl[i].temperature = i; + root->heat_range_hl[i].temperature = i; + } +} + static int __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize, u32 stripesize, struct btrfs_root *root, struct btrfs_fs_info *fs_info, u64 objectid) { + root->node = NULL; root->commit_root = NULL; root->sectorsize = sectorsize; @@ -945,6 +967,10 @@ static int __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize, memset(&root->root_item, 0, sizeof(root->root_item)); memset(&root->defrag_progress, 0, sizeof(root->defrag_progress)); memset(&root->root_kobj, 0, sizeof(root->root_kobj)); + + if (BTRFS_TRACKING_HOT_DATA(root)) + __setup_hotdata(root); + root->defrag_trans_start = fs_info->generation; init_completion(&root->kobj_unregister); root->defrag_running = 0; @@ -2324,6 +2350,9 @@ static void free_fs_root(struct btrfs_root *root) down_write(&root->anon_super.s_umount); kill_anon_super(&root->anon_super); } + + free_heat_hashlists(root); + free_hot_inode_tree(root); free_extent_buffer(root->node); free_extent_buffer(root->commit_root); kfree(root->name); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index a4080c2..8fa2820 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2468,8 +2468,10 @@ static int extent_write_cache_pages(struct extent_io_tree *tree, int ret = 0; int done = 0; int nr_to_write_done = 0; + int nr_written = 0; struct pagevec pvec; int nr_pages; + pgoff_t start; pgoff_t index; pgoff_t end; /* Inclusive */ int scanned = 0; @@ -2486,6 +2488,7 @@ static int extent_write_cache_pages(struct extent_io_tree *tree, range_whole = 1; scanned = 1; } + start = index << PAGE_CACHE_SHIFT; retry: while (!done && !nr_to_write_done && (index <= end) && (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, @@ -2547,6 +2550,7 @@ retry: * at any time */ nr_to_write_done = wbc->nr_to_write <= 0; + nr_written += 1; } pagevec_release(&pvec); cond_resched(); @@ -2560,6 +2564,20 @@ retry: index = 0; goto retry; } + + /* + * i_ino = 1 appears to come from metadata operations, ignore + * those writes + */ + if (BTRFS_TRACK_THIS_INODE(BTRFS_I(mapping->host)) && + mapping->host->i_ino > 1) { + printk(KERN_INFO "btrfs recorded a write %lu, %lu, %lu\n", + mapping->host->i_ino, start, nr_written * + PAGE_CACHE_SIZE); + btrfs_update_freqs(mapping->host, start, + nr_written * PAGE_CACHE_SIZE, 1); + } + return ret; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index f08427c..010eb29 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -37,6 +37,7 @@ #include <linux/posix_acl.h> #include <linux/falloc.h> #include <linux/slab.h> +#include <linux/pagevec.h> #include "compat.h" #include "ctree.h" #include "disk-io.h" @@ -50,6 +51,7 @@ #include "tree-log.h" #include "compression.h" #include "locking.h" +#include "hotdata_map.h" struct btrfs_iget_args { u64 ino; @@ -4515,6 +4517,10 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans, BTRFS_I(inode)->flags |= BTRFS_INODE_NODATASUM; if (btrfs_test_opt(root, NODATACOW)) BTRFS_I(inode)->flags |= BTRFS_INODE_NODATACOW; + if (!btrfs_test_opt(root, HOTDATA_TRACK)) + BTRFS_I(inode)->flags |= BTRFS_INODE_NO_HOTDATA_TRACK; + if (!btrfs_test_opt(root, HOTDATA_MOVE)) + BTRFS_I(inode)->flags |= BTRFS_INODE_NO_HOTDATA_MOVE; } insert_inode_hash(inode); @@ -5781,6 +5787,10 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb, lockstart = offset; lockend = offset + count - 1; + if (BTRFS_TRACK_THIS_INODE(BTRFS_I(inode)) && count > 0) + btrfs_update_freqs(inode, lockstart, (u64) count, + writing); + if (writing) { ret = btrfs_delalloc_reserve_space(inode, count); if (ret) @@ -5860,7 +5870,15 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, int btrfs_readpage(struct file *file, struct page *page) { struct extent_io_tree *tree; + u64 start; + tree = &BTRFS_I(page->mapping->host)->io_tree; + start = (u64) page->index << PAGE_CACHE_SHIFT; + + if (BTRFS_TRACK_THIS_INODE(BTRFS_I(page->mapping->host))) + btrfs_update_freqs(page->mapping->host, start, + PAGE_CACHE_SIZE, 0); + return extent_read_full_page(tree, page, btrfs_get_extent); } @@ -5892,7 +5910,16 @@ btrfs_readpages(struct file *file, struct address_space *mapping, struct list_head *pages, unsigned nr_pages) { struct extent_io_tree *tree; + u64 start, len; + tree = &BTRFS_I(mapping->host)->io_tree; + start = (u64) (list_entry(pages->prev, struct page, lru)->index) + << PAGE_CACHE_SHIFT; + len = nr_pages * PAGE_CACHE_SIZE; + + if (len > 0 && BTRFS_TRACK_THIS_INODE(BTRFS_I(mapping->host))) + btrfs_update_freqs(mapping->host, start, len, 0); + return extent_readpages(tree, mapping, pages, nr_pages, btrfs_get_extent); } diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 859ddaa..db91b38 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -51,6 +51,8 @@ #include "version.h" #include "export.h" #include "compression.h" +#include "hotdata_map.h" +#include "hotdata_hash.h" static const struct super_operations btrfs_super_ops; @@ -59,6 +61,9 @@ static void btrfs_put_super(struct super_block *sb) struct btrfs_root *root = btrfs_sb(sb); int ret; + if (BTRFS_TRACKING_HOT_DATA(root)) + btrfs_exit_debugfs_volume(sb); + ret = close_ctree(root); sb->s_fs_info = NULL; } @@ -68,7 +73,7 @@ enum { Opt_nodatacow, Opt_max_inline, Opt_alloc_start, Opt_nobarrier, Opt_ssd, Opt_nossd, Opt_ssd_spread, Opt_thread_pool, Opt_noacl, Opt_compress, Opt_compress_force, Opt_notreelog, Opt_ratio, Opt_flushoncommit, - Opt_discard, Opt_err, + Opt_discard, Opt_hotdatatrack, Opt_hotdatamove, Opt_err, }; static match_table_t tokens = { @@ -92,6 +97,8 @@ static match_table_t tokens = { {Opt_flushoncommit, "flushoncommit"}, {Opt_ratio, "metadata_ratio=%d"}, {Opt_discard, "discard"}, + {Opt_hotdatatrack, "hotdatatrack"}, + {Opt_hotdatamove, "hotdatamove"}, {Opt_err, NULL}, }; @@ -235,6 +242,18 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) case Opt_discard: btrfs_set_opt(info->mount_opt, DISCARD); break; + case Opt_hotdatamove: + printk(KERN_INFO "btrfs: turning on hot data " + "migration\n"); + printk(KERN_INFO " (implies hotdatatrack, " + "no ssd_spread)\n"); + btrfs_set_opt(info->mount_opt, HOTDATA_MOVE); + btrfs_clear_opt(info->mount_opt, SSD_SPREAD); + case Opt_hotdatatrack: + printk(KERN_INFO "btrfs: turning on hot data" + " tracking\n"); + btrfs_set_opt(info->mount_opt, HOTDATA_TRACK); + break; case Opt_err: printk(KERN_INFO "btrfs: unrecognized mount option " "''%s''\n", p); @@ -457,6 +476,7 @@ static int btrfs_fill_super(struct super_block *sb, printk("btrfs: open_ctree failed\n"); return PTR_ERR(tree_root); } + sb->s_fs_info = tree_root; disk_super = &tree_root->fs_info->super_copy; @@ -659,6 +679,9 @@ static int btrfs_get_sb(struct file_system_type *fs_type, int flags, mnt->mnt_sb = s; mnt->mnt_root = root; + if (btrfs_test_opt(btrfs_sb(s), HOTDATA_TRACK)) + btrfs_init_debugfs_volume(dev_name, s); + kfree(subvol_name); return 0; @@ -846,18 +869,30 @@ static int __init init_btrfs_fs(void) if (err) goto free_sysfs; - err = extent_io_init(); + err = btrfs_init_debugfs(); if (err) goto free_cachep; + err = extent_io_init(); + if (err) + goto free_debugfs; + err = extent_map_init(); if (err) goto free_extent_io; - err = btrfs_interface_init(); + err = hot_inode_item_init(); if (err) goto free_extent_map; + err = hot_range_item_init(); + if (err) + goto free_hot_inode_item; + + err = btrfs_interface_init(); + if (err) + goto free_hot_range_item; + err = register_filesystem(&btrfs_fs_type); if (err) goto unregister_ioctl; @@ -867,10 +902,16 @@ static int __init init_btrfs_fs(void) unregister_ioctl: btrfs_interface_exit(); +free_hot_range_item: + hot_range_item_exit(); +free_hot_inode_item: + hot_inode_item_exit(); free_extent_map: extent_map_exit(); free_extent_io: extent_io_exit(); +free_debugfs: + btrfs_exit_debugfs(); free_cachep: btrfs_destroy_cachep(); free_sysfs: @@ -886,6 +927,7 @@ static void __exit exit_btrfs_fs(void) btrfs_interface_exit(); unregister_filesystem(&btrfs_fs_type); btrfs_exit_sysfs(); + btrfs_exit_debugfs(); btrfs_cleanup_fs_uuids(); btrfs_zlib_exit(); } -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Tracy Reed
2010-Jul-27 22:29 UTC
Re: [RFC PATCH 0/5] Btrfs: Add hot data tracking functionality
On Tue, Jul 27, 2010 at 05:00:18PM -0500, bchociej@gmail.com spake thusly:> The long-term goal of these patches, as discussed in the Motivation > section at the end of this message, is to enable Btrfs to perform > automagic relocation of hot data to fast media like SSD. This goal has > been motivated by the Project Ideas page on the Btrfs wiki.With disks being so highly virtualized away these days is there any way for btrfs to know which are the fast outer-tracks vs the slower inner-tracks of a physical disk? If so not only could this benefit SSD owners but it could also benefit the many more spinning platters out there. If not (which wouldn''t be surprising) then disregard. Even just having that sort of functionality for SSD would be excellent. If I understand correctly not only would this work for SSD but if I have a SAN full of many large 7200rpm disks and a few 15k SAS disks I could effectively utilize that disk by allowing btrfs to place hot data on the 15k SAS. I understand Compellent does this as well. -- Tracy Reed http://tracyreed.org
Diego Calleja
2010-Jul-27 23:10 UTC
Re: [RFC PATCH 0/5] Btrfs: Add hot data tracking functionality
On Miércoles, 28 de Julio de 2010 00:00:18 bchociej@gmail.com escribió:> With Btrfs''s COW approach, an external cache (where data is moved to > SSD, rather than just cached there) makes a lot of sense. Though theseAs I understand it, what your proyect intends to do is to move "hot" data to a SSD which would be part of a Btrfs pool, and not do any kind of SSD caching, as bcache (http://lwn.net/Articles/394672/) does? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ben Chociej
2010-Jul-27 23:18 UTC
Re: [RFC PATCH 0/5] Btrfs: Add hot data tracking functionality
On Tue, Jul 27, 2010 at 6:10 PM, Diego Calleja <diegocg@gmail.com> wrote:> On Miércoles, 28 de Julio de 2010 00:00:18 bchociej@gmail.com escribió: >> With Btrfs''s COW approach, an external cache (where data is moved to >> SSD, rather than just cached there) makes a lot of sense. Though these > > As I understand it, what your proyect intends to do is to move "hot" > data to a SSD which would be part of a Btrfs pool, and not do any > kind of SSD caching, as bcache (http://lwn.net/Articles/394672/) does? >Yes, that''s correct. It''s likely not going to be a cache in the traditional sense, since the entire capacity of both HDD and SSD would be available. BC
Chris Samuel
2010-Jul-28 12:28 UTC
Re: [RFC PATCH 0/5] Btrfs: Add hot data tracking functionality
On Wed, 28 Jul 2010 09:18:23 am Ben Chociej wrote:> Yes, that''s correct. It''s likely not going to be a cache in the > traditional sense, since the entire capacity of both HDD and SSD would > be available.To me that sounds like an HSM type arrangement, with most frequently used data on the highest performing media and less frequently touched data getting shunted down the chain to SAS, SATA and then tape and/or MAID type devices. Certainly interesting from my HPC point of view in that I can see it being useful to parallel filesystems like Ceph if this "just happens". I guess real HSM devotees would want policies for migration downstream.. ;-) cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP
Mingming Cao
2010-Jul-28 21:22 UTC
Re: [RFC PATCH 0/5] Btrfs: Add hot data tracking functionality
On Tue, 2010-07-27 at 15:29 -0700, Tracy Reed wrote:> On Tue, Jul 27, 2010 at 05:00:18PM -0500, bchociej@gmail.com spake thusly: > > The long-term goal of these patches, as discussed in the Motivation > > section at the end of this message, is to enable Btrfs to perform > > automagic relocation of hot data to fast media like SSD. This goal has > > been motivated by the Project Ideas page on the Btrfs wiki. > > With disks being so highly virtualized away these days is there any > way for btrfs to know which are the fast outer-tracks vs the slower > inner-tracks of a physical disk? If so not only could this benefit SSD > owners but it could also benefit the many more spinning platters out > there. If not (which wouldn''t be surprising) then disregard. Even just > having that sort of functionality for SSD would be excellent. If I > understand correctly not only would this work for SSD but if I have a > SAN full of many large 7200rpm disks and a few 15k SAS disks I could > effectively utilize that disk by allowing btrfs to place hot data on > the 15k SAS. I understand Compellent does this as well. >This certainly possible. The disk to store hot data does not has to limit to SSDs, thought current implementation detecting "fast" device by checking the SSD rotation flag. This could be easily extended if btrfs is able to detect the relatively fast devices. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html