From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> NOTE: The patchset is currently post out mainly to make sure it is going in the correct direction and hope to get some helpful comments from other guys. For more infomation, please check hot_tracking.txt in Documentation TODO List: 1.) Need to do scalability or performance tests. 2.) Turn some Micro into be tunable TIME_TO_KICK, and HEAT_UPDATE_DELAY 3.) Rafactor hot_hash_is_aging() If you just made the timeout value a timespec and compared the _timespecs_, you would be doing a lot fewer conversions. 4.) Cleanup some unnecessary lock protect 5.) Add more comments to explain how to calc temperature How to "read" the avg read/write time (nanoseconds, microseconds, jiffies....??) 6.) Make updating tempreture more parallel 7.) How to save the file tempreture among the umount to be able to preserve the file tempreture after reboot 8.) Add one new ioctl inteface to set temperature value. Ben Chociej, Matt Lupfer and Conor Scott originally wrote this code to be very btrfs-specific. I''ve taken their code and attempted to make it more generic and integrate it at the VFS level. Changelog from v1: 1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner] 2.) The first three patches can probably just be flattened into one. [Marco Stornelli , Dave Chinner] Zhi Yong Wu (10): vfs: introduce private rb structures vfs: add support for updating access frequency vfs: add one new mount option ''-o hottrack'' vfs: add init and exit support vfs: introduce one hash table vfs: enable hot data tracking vfs: fork one kthread to update data temperature vfs: add 3 new ioctl interfaces vfs: add debugfs support vfs: add documentation Documentation/filesystems/hot_tracking.txt | 106 ++ fs/Makefile | 2 +- fs/compat_ioctl.c | 8 + fs/dcache.c | 2 + fs/direct-io.c | 10 + fs/hot_tracking.c | 1563 ++++++++++++++++++++++++++++ fs/hot_tracking.h | 163 +++ fs/ioctl.c | 130 +++ fs/namespace.c | 10 + fs/super.c | 11 + include/linux/fs.h | 15 + include/linux/hot_tracking.h | 164 +++ mm/filemap.c | 8 + mm/page-writeback.c | 21 + mm/readahead.c | 9 + 15 files changed, 2221 insertions(+), 1 deletions(-) create mode 100644 Documentation/filesystems/hot_tracking.txt create mode 100644 fs/hot_tracking.c create mode 100644 fs/hot_tracking.h create mode 100644 include/linux/hot_tracking.h -- 1.7.6.5 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
zwu.kernel@gmail.com
2012-Sep-23 12:56 UTC
[RFC v2 01/10] vfs: introduce private rb structures
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> One root structure hot_info is defined, is hooked up in super_block, and will be used to hold rb trees root, hash list root and some other information, etc. Adds hot_inode_tree struct to keep track of frequently accessed files, and be keyed by {inode, offset}. Trees contain hot_inode_items representing those files and ranges. Having these trees means that vfs can quickly determine the temperature of some data by doing some calculations on the hot_freq_data struct that hangs off of the tree item. Define two items hot_inode_item and hot_range_item, one of them represents one tracked file to keep track of its access frequency and the tree of ranges in this file, while the latter represents a file range of one inode. Each of the two structures contains a hot_freq_data struct with its frequency of access metrics (number of {reads, writes}, last {read,write} time, frequency of {reads,writes}). Also, each hot_inode_item contains one hot_range_tree struct which is keyed by {inode, offset, length} and used to keep track of all the ranges in this file. Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/Makefile | 2 +- fs/dcache.c | 2 + fs/hot_tracking.c | 116 ++++++++++++++++++++++++++++++++++++++++++ fs/hot_tracking.h | 27 ++++++++++ include/linux/fs.h | 4 ++ include/linux/hot_tracking.h | 96 ++++++++++++++++++++++++++++++++++ 6 files changed, 246 insertions(+), 1 deletions(-) create mode 100644 fs/hot_tracking.c create mode 100644 fs/hot_tracking.h create mode 100644 include/linux/hot_tracking.h diff --git a/fs/Makefile b/fs/Makefile index 2fb9779..9d29618 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -11,7 +11,7 @@ obj-y := open.o read_write.o file_table.o super.o \ attr.o bad_inode.o file.o filesystems.o namespace.o \ seq_file.o xattr.o libfs.o fs-writeback.o \ pnode.o drop_caches.o splice.o sync.o utimes.o \ - stack.o fs_struct.o statfs.o + stack.o fs_struct.o statfs.o hot_tracking.o ifeq ($(CONFIG_BLOCK),y) obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o diff --git a/fs/dcache.c b/fs/dcache.c index 8086636..92470a1 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -37,6 +37,7 @@ #include <linux/rculist_bl.h> #include <linux/prefetch.h> #include <linux/ratelimit.h> +#include "hot_tracking.h" #include "internal.h" #include "mount.h" @@ -3164,6 +3165,7 @@ void __init vfs_caches_init(unsigned long mempages) inode_init(); files_init(mempages); mnt_init(); + hot_track_cache_init(); bdev_cache_init(); chrdev_init(); } diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c new file mode 100644 index 0000000..173054b --- /dev/null +++ b/fs/hot_tracking.c @@ -0,0 +1,116 @@ +/* + * fs/hot_tracking.c + * + * Copyright (C) 2012 IBM Corp. All rights reserved. + * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> + * Ben Chociej <bchociej@gmail.com> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + */ + +#include <linux/list.h> +#include <linux/err.h> +#include <linux/slab.h> +#include <linux/module.h> +#include <linux/spinlock.h> +#include <linux/hardirq.h> +#include <linux/fs.h> +#include <linux/blkdev.h> +#include <linux/types.h> +#include "hot_tracking.h" + +/* kmem_cache pointers for slab caches */ +static struct kmem_cache *hot_inode_item_cache; +static struct kmem_cache *hot_range_item_cache; + +/* + * Initialize the inode tree. Should be called for each new inode + * access or other user of the hot_inode interface. + */ +static void hot_rb_inode_tree_init(struct hot_inode_tree *tree) +{ + tree->map = RB_ROOT; + rwlock_init(&tree->lock); +} + +/* + * Initialize the hot range tree. Should be called for each new inode + * access or other user of the hot_range interface. + */ +void hot_rb_range_tree_init(struct hot_range_tree *tree) +{ + tree->map = RB_ROOT; + rwlock_init(&tree->lock); +} + +/* + * Initialize a new hot_inode_item structure. The new structure is + * returned with a reference count of one and needs to be + * freed using free_inode_item() + */ +void hot_rb_inode_item_init(void *_item) +{ + struct hot_inode_item *he = _item; + + memset(he, 0, sizeof(*he)); + kref_init(&he->refs); + spin_lock_init(&he->lock); + he->hot_freq_data.avg_delta_reads = (u64) -1; + he->hot_freq_data.avg_delta_writes = (u64) -1; + he->hot_freq_data.flags = FREQ_DATA_TYPE_INODE; + hot_rb_range_tree_init(&he->hot_range_tree); +} + +/* + * Initialize a new hot_range_item structure. The new structure is + * returned with a reference count of one and needs to be + * freed using free_range_item() + */ +static void hot_rb_range_item_init(void *_item) +{ + struct hot_range_item *hr = _item; + + memset(hr, 0, sizeof(*hr)); + kref_init(&hr->refs); + spin_lock_init(&hr->lock); + hr->hot_freq_data.avg_delta_reads = (u64) -1; + hr->hot_freq_data.avg_delta_writes = (u64) -1; + hr->hot_freq_data.flags = FREQ_DATA_TYPE_RANGE; +} + +/* init hot_inode_item and hot_range_item kmem cache */ +static int __init hot_rb_item_cache_init(void) +{ + hot_inode_item_cache = kmem_cache_create("hot_inode_item", + sizeof(struct hot_inode_item), 0, + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, + hot_rb_inode_item_init); + if (!hot_inode_item_cache) + goto inode_err; + + hot_range_item_cache = kmem_cache_create("hot_range_item", + sizeof(struct hot_range_item), 0, + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, + hot_rb_range_item_init); + if (!hot_range_item_cache) + goto range_err; + + return 0; + +range_err: + kmem_cache_destroy(hot_inode_item_cache); +inode_err: + return -ENOMEM; +} + +/* + * Initialize kmem cache for hot_inode_item + * and hot_range_item + */ +void __init hot_track_cache_init(void) +{ + if (hot_rb_item_cache_init()) + return; +} diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h new file mode 100644 index 0000000..269b67a --- /dev/null +++ b/fs/hot_tracking.h @@ -0,0 +1,27 @@ +/* + * fs/hot_tracking.h + * + * Copyright (C) 2012 IBM Corp. All rights reserved. + * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> + * Ben Chociej <bchociej@gmail.com> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + */ + +#ifndef __HOT_TRACKING__ +#define __HOT_TRACKING__ + +#include <linux/rbtree.h> +#include <linux/hot_tracking.h> + +/* values for hot_freq_data flags */ +/* freq data struct is for an inode */ +#define FREQ_DATA_TYPE_INODE (1 << 0) +/* freq data struct is for a range */ +#define FREQ_DATA_TYPE_RANGE (1 << 1) + +void __init hot_track_cache_init(void); + +#endif /* __HOT_TRACKING__ */ diff --git a/include/linux/fs.h b/include/linux/fs.h index aa11047..db1a144 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -415,6 +415,7 @@ struct inodes_stat_t { #include <linux/migrate_mode.h> #include <linux/uidgid.h> #include <linux/lockdep.h> +#include <linux/hot_tracking.h> #include <asm/byteorder.h> @@ -1578,6 +1579,9 @@ struct super_block { /* Being remounted read-only */ int s_readonly_remount; + + /* Hot data tracking info*/ + struct hot_info s_hotinfo; }; /* superblock cache pruning functions */ diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h new file mode 100644 index 0000000..a566f91 --- /dev/null +++ b/include/linux/hot_tracking.h @@ -0,0 +1,96 @@ +/* + * include/linux/hot_tracking.h + * + * This file has definitions for VFS hot data tracking + * structures etc. + * + * Copyright (C) 2012 IBM Corp. All rights reserved. + * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> + * Ben Chociej <bchociej@gmail.com> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + */ + +#ifndef _LINUX_HOTTRACK_H +#define _LINUX_HOTTRACK_H + +#include <linux/types.h> +#include <linux/rbtree.h> +#include <linux/kref.h> + +/* A tree that sits on the hot_info */ +struct hot_inode_tree { + struct rb_root map; + rwlock_t lock; +}; + +/* A tree of ranges for each inode in the hot_inode_tree */ +struct hot_range_tree { + struct rb_root map; + rwlock_t lock; +}; + +/* A frequency data struct holds values that are used to + * determine temperature of files and file ranges. These structs + * are members of hot_inode_item and hot_range_item + */ +struct hot_freq_data { + struct timespec last_read_time; + struct timespec last_write_time; + u32 nr_reads; + u32 nr_writes; + u64 avg_delta_reads; + u64 avg_delta_writes; + u8 flags; + u32 last_temperature; +}; + +/* An item representing an inode and its access frequency */ +struct hot_inode_item { + /* node for hot_inode_tree rb_tree */ + struct rb_node rb_node; + /* tree of ranges in this inode */ + struct hot_range_tree hot_range_tree; + /* frequency data for this inode */ + struct hot_freq_data hot_freq_data; + /* inode number, copied from inode */ + unsigned long i_ino; + /* used to check for errors in ref counting */ + u8 in_tree; + /* protects hot_freq_data, i_no, in_tree */ + spinlock_t lock; + /* prevents kfree */ + struct kref refs; +}; + +/* + * An item representing a range inside of an inode whose frequency + * is being tracked + */ +struct hot_range_item { + /* node for hot_range_tree rb_tree */ + struct rb_node rb_node; + /* frequency data for this range */ + struct hot_freq_data hot_freq_data; + /* the hot_inode_item associated with this hot_range_item */ + struct hot_inode_item *hot_inode; + /* starting offset of this range */ + u64 start; + /* length of this range */ + u64 len; + /* used to check for errors in ref counting */ + u8 in_tree; + /* protects hot_freq_data, start, len, and in_tree */ + spinlock_t lock; + /* prevents kfree */ + struct kref refs; +}; + +struct hot_info { + /* red-black tree that keeps track of fs-wide hot data */ + struct hot_inode_tree hot_inode_tree; +}; + +#endif /* _LINUX_HOTTRACK_H */ -- 1.7.6.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
zwu.kernel@gmail.com
2012-Sep-23 12:56 UTC
[RFC v2 02/10] vfs: add support for updating access frequency
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Add some utils helpers to update access frequencies for one file or its range. Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/hot_tracking.c | 359 +++++++++++++++++++++++++++++++++++++++++++++++++++++ fs/hot_tracking.h | 15 +++ 2 files changed, 374 insertions(+), 0 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index 173054b..52ed926 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -106,6 +106,365 @@ inode_err: } /* + * Drops the reference out on hot_inode_item by one and free the structure + * if the reference count hits zero + */ +void hot_rb_free_hot_inode_item(struct hot_inode_item *he) +{ + if (!he) + return; + + if (atomic_dec_and_test(&he->refs.refcount)) { + WARN_ON(he->in_tree); + kmem_cache_free(hot_inode_item_cache, he); + } +} + +/* + * Drops the reference out on hot_range_item by one and free the structure + * if the reference count hits zero + */ +static void hot_rb_free_hot_range_item(struct hot_range_item *hr) +{ + if (!hr) + return; + + if (atomic_dec_and_test(&hr->refs.refcount)) { + WARN_ON(hr->in_tree); + kmem_cache_free(hot_range_item_cache, hr); + } +} + +static struct rb_node *hot_rb_insert_hot_inode_item(struct rb_root *root, + unsigned long inode_num, + struct rb_node *node) +{ + struct rb_node **p = &root->rb_node; + struct rb_node *parent = NULL; + struct hot_inode_item *entry; + + /* walk tree to find insertion point */ + while (*p) { + parent = *p; + entry = rb_entry(parent, struct hot_inode_item, rb_node); + + if (inode_num < entry->i_ino) + p = &(*p)->rb_left; + else if (inode_num > entry->i_ino) + p = &(*p)->rb_right; + else + return parent; + } + + entry = rb_entry(node, struct hot_inode_item, rb_node); + entry->in_tree = 1; + rb_link_node(node, parent, p); + rb_insert_color(node, root); + + return NULL; +} + +static u64 hot_rb_range_end(struct hot_range_item *hr) +{ + if (hr->start + hr->len < hr->start) + return (u64)-1; + + return hr->start + hr->len - 1; +} + +static struct rb_node *hot_rb_insert_hot_range_item(struct rb_root *root, + u64 start, + struct rb_node *node) +{ + struct rb_node **p = &root->rb_node; + struct rb_node *parent = NULL; + struct hot_range_item *entry; + + /* ensure start is on a range boundary */ + start = start & RANGE_SIZE_MASK; + /* walk tree to find insertion point */ + while (*p) { + parent = *p; + entry = rb_entry(parent, struct hot_range_item, rb_node); + + if (start < entry->start) + p = &(*p)->rb_left; + else if (start >= hot_rb_range_end(entry)) + p = &(*p)->rb_right; + else + return parent; + } + + entry = rb_entry(node, struct hot_range_item, rb_node); + entry->in_tree = 1; + rb_link_node(node, parent, p); + rb_insert_color(node, root); + + return NULL; +} + +/* + * Add a hot_inode_item to a hot_inode_tree. If the tree already contains + * an item with the index given, return -EEXIST + */ +static int hot_rb_add_hot_inode_item(struct hot_inode_tree *tree, + struct hot_inode_item *he) +{ + int ret = 0; + struct rb_node *rb; + + rb = hot_rb_insert_hot_inode_item( + &tree->map, he->i_ino, &he->rb_node); + if (rb) { + ret = -EEXIST; + goto out; + } + + kref_get(&he->refs); + +out: + return ret; +} + +/* + * Add a hot_range_item to a hot_range_tree. If the tree already contains + * an item with the index given, return -EEXIST + * + * Also optionally aggresively merge ranges (currently disabled) + */ +static int hot_rb_add_hot_range_item(struct hot_range_tree *tree, + struct hot_range_item *hr) +{ + int ret = 0; + struct rb_node *rb; + + rb = hot_rb_insert_hot_range_item( + &tree->map, hr->start, &hr->rb_node); + if (rb) { + ret = -EEXIST; + goto out; + } + + kref_get(&hr->refs); + +out: + return ret; +} + +/* + * Lookup a hot_inode_item in the hot_inode_tree with the given index + * (inode_num) + */ +struct hot_inode_item +*hot_rb_lookup_hot_inode_item(struct hot_inode_tree *tree, + unsigned long inode_num) +{ + struct rb_node **p = &(tree->map.rb_node); + struct rb_node *parent = NULL; + struct hot_inode_item *entry; + + while (*p) { + parent = *p; + entry = rb_entry(parent, struct hot_inode_item, rb_node); + + if (inode_num < entry->i_ino) + p = &(*p)->rb_left; + else if (inode_num > entry->i_ino) + p = &(*p)->rb_right; + else { + kref_get(&entry->refs); + return entry; + } + } + + return NULL; +} + +/* + * Lookup a hot_range_item in a hot_range_tree with the given index + * (start, offset) + */ +static struct hot_range_item +*hot_rb_lookup_hot_range_item(struct hot_range_tree *tree, + u64 start) +{ + struct rb_node **p = &(tree->map.rb_node); + struct rb_node *parent = NULL; + struct hot_range_item *entry; + + /* ensure start is on a range boundary */ + start = start & RANGE_SIZE_MASK; + while (*p) { + parent = *p; + entry = rb_entry(parent, struct hot_range_item, rb_node); + + if (start < entry->start) + p = &(*p)->rb_left; + else if (start > hot_rb_range_end(entry)) + p = &(*p)->rb_right; + else { + kref_get(&entry->refs); + return entry; + } + } + + return NULL; +} + +/* + * This function does the actual work of updating the frequency numbers, + * whatever they turn out to be. FREQ_POWER determines how many atime + * deltas we keep track of (as a power of 2). So, setting it to anything above + * 16ish is probably overkill. Also, the higher the power, the more bits get + * right shifted out of the timestamp, reducing precision, so take note of that + * as well. + * + * The caller should have already locked freq_data''s parent''s spinlock. + * + * FREQ_POWER, defined immediately below, determines how heavily to weight + * the current frequency numbers against the newest access. For example, a value + * of 4 means that the new access information will be weighted 1/16th (ie 2^-4) + * as heavily as the existing frequency info. In essence, this is a kludged- + * together version of a weighted average, since we can''t afford to keep all of + * the information that it would take to get a _real_ weighted average. + */ +static void hot_rb_update_freq(struct hot_freq_data *freq_data, int rw) +{ + struct timespec old_atime; + struct timespec current_time; + struct timespec delta_ts; + u64 new_avg; + u64 new_delta; + + if (unlikely(rw)) { + old_atime = freq_data->last_write_time; + freq_data->nr_writes += 1; + new_avg = freq_data->avg_delta_writes; + } else { + old_atime = freq_data->last_read_time; + freq_data->nr_reads += 1; + new_avg = freq_data->avg_delta_reads; + } + + current_time = current_kernel_time(); + delta_ts = timespec_sub(current_time, old_atime); + new_delta = timespec_to_ns(&delta_ts) >> FREQ_POWER; + + new_avg = (new_avg << FREQ_POWER) - new_avg + new_delta; + new_avg = new_avg >> FREQ_POWER; + + if (unlikely(rw)) { + freq_data->last_write_time = current_time; + freq_data->avg_delta_writes = new_avg; + } else { + freq_data->last_read_time = current_time; + freq_data->avg_delta_reads = new_avg; + } +} + +/* Update inode frequency struct */ +static struct hot_inode_item *hot_rb_update_inode_freq(struct inode *inode, + int rw) +{ + struct hot_info *root = &(inode->i_sb->s_hotinfo); + struct hot_inode_tree *hitree = &(root->hot_inode_tree); + struct hot_inode_item *he; + + read_lock(&hitree->lock); + he = hot_rb_lookup_hot_inode_item(hitree, inode->i_ino); + read_unlock(&hitree->lock); + + if (!he) { + he = kmem_cache_alloc(hot_inode_item_cache, + GFP_KERNEL | GFP_NOFS); + if (!he) + goto out; + + write_lock(&hitree->lock); + hot_rb_inode_item_init(he); + he->i_ino = inode->i_ino; + hot_rb_add_hot_inode_item(hitree, he); + write_unlock(&hitree->lock); + } + + spin_lock(&he->lock); + hot_rb_update_freq(&he->hot_freq_data, rw); + spin_unlock(&he->lock); + +out: + return he; +} + +/* Update range frequency struct */ +static bool hot_rb_update_range_freq(struct hot_inode_item *he, + u64 off, u64 len, int rw, + struct hot_info *root) +{ + struct hot_range_tree *hrtree = &(he->hot_range_tree); + struct hot_range_item *hr = NULL; + u64 start_off = off & RANGE_SIZE_MASK; + u64 end_off = (off + len - 1) & RANGE_SIZE_MASK; + u64 cur; + int ret = true; + + if (len == 0) + return false; + + /* + * Align ranges on RANGE_SIZE boundary to prevent proliferation + * of range structs + */ + for (cur = start_off; cur <= end_off; cur += RANGE_SIZE) { + read_lock(&hrtree->lock); + hr = hot_rb_lookup_hot_range_item(hrtree, cur); + read_unlock(&hrtree->lock); + + if (!hr) { + hr = kmem_cache_alloc(hot_range_item_cache, + GFP_KERNEL | GFP_NOFS); + if (!hr) { + ret = false; + goto out; + } + + write_lock(&hrtree->lock); + hot_rb_range_item_init(hr); + hr->start = cur & RANGE_SIZE_MASK; + hr->len = RANGE_SIZE; + hr->hot_inode = he; + hot_rb_add_hot_range_item(hrtree, hr); + write_unlock(&hrtree->lock); + } + + spin_lock(&hr->lock); + hot_rb_update_freq(&hr->hot_freq_data, rw); + spin_unlock(&hr->lock); + hot_rb_free_hot_range_item(hr); + } + +out: + return ret; +} + +/* main function to update access frequency from read/writepage(s) hooks */ +void hot_rb_update_freqs(struct inode *inode, u64 start, + u64 len, int rw) +{ + struct hot_inode_item *he; + + he = hot_rb_update_inode_freq(inode, rw); + + WARN_ON(!he); + + if (he) { + hot_rb_update_range_freq(he, start, len, + rw, &(inode->i_sb->s_hotinfo)); + + hot_rb_free_hot_inode_item(he); + } +} + +/* * Initialize kmem cache for hot_inode_item * and hot_range_item */ diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h index 269b67a..2ba29e4 100644 --- a/fs/hot_tracking.h +++ b/fs/hot_tracking.h @@ -21,6 +21,21 @@ #define FREQ_DATA_TYPE_INODE (1 << 0) /* freq data struct is for a range */ #define FREQ_DATA_TYPE_RANGE (1 << 1) +/* size of sub-file ranges */ +#define RANGE_SIZE (1 << 20) +#define RANGE_SIZE_MASK (~((u64)(RANGE_SIZE - 1))) + +#define FREQ_POWER 4 + +struct hot_info; +struct inode; + +struct hot_inode_item +*hot_rb_lookup_hot_inode_item(struct hot_inode_tree *tree, + unsigned long inode_num); +void hot_rb_free_hot_inode_item(struct hot_inode_item *he); +void hot_rb_update_freqs(struct inode *inode, u64 start, u64 len, + int rw); void __init hot_track_cache_init(void); -- 1.7.6.5 -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
zwu.kernel@gmail.com
2012-Sep-23 12:56 UTC
[RFC v2 03/10] vfs: add one new mount option ''-o hottrack''
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Introduce one new mount option ''-o hottrack'', and add its parsing support. Its usage looks like: mount -o hottrack mount -o nouser,hottrack mount -o nouser,hottrack,loop mount -o hottrack,nouser Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/hot_tracking.c | 34 ++++++++++++++++++++++++++++++++++ fs/hot_tracking.h | 1 + fs/super.c | 5 +++++ include/linux/hot_tracking.h | 7 +++++++ 4 files changed, 47 insertions(+), 0 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index 52ed926..f97e8a6 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -465,6 +465,40 @@ void hot_rb_update_freqs(struct inode *inode, u64 start, } /* + * Regular mount options parser for -hottrack option. + * return false if no -hottrack is specified; + * otherwise return true. And the -hottrack will be + * removed from options. + */ +bool hot_track_parse_options(char *options) +{ + long len; + char *p; + static char opts_hot[] = "hottrack"; + + if (!options) + return false; + + p = strstr(options, opts_hot); + if (!p) + return false; + + while (p) { + len = options + strlen(options) - (p + strlen(opts_hot)); + if (len == 0) { + options[0] = ''\0''; + break; + } + + memmove(p, p + strlen(opts_hot) + 1, len); + p = strstr(options, opts_hot); + } + + printk(KERN_INFO "vfs: turning on hot data tracking\n"); + return true; +} + +/* * Initialize kmem cache for hot_inode_item * and hot_range_item */ diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h index 2ba29e4..6bd09eb 100644 --- a/fs/hot_tracking.h +++ b/fs/hot_tracking.h @@ -37,6 +37,7 @@ void hot_rb_free_hot_inode_item(struct hot_inode_item *he); void hot_rb_update_freqs(struct inode *inode, u64 start, u64 len, int rw); +bool hot_track_parse_options(char *options); void __init hot_track_cache_init(void); #endif /* __HOT_TRACKING__ */ diff --git a/fs/super.c b/fs/super.c index 0902cfa..7eb3b0c 100644 --- a/fs/super.c +++ b/fs/super.c @@ -34,6 +34,7 @@ #include <linux/cleancache.h> #include <linux/fsnotify.h> #include <linux/lockdep.h> +#include "hot_tracking.h" #include "internal.h" @@ -1125,6 +1126,7 @@ mount_fs(struct file_system_type *type, int flags, const char *name, void *data) struct dentry *root; struct super_block *sb; char *secdata = NULL; + bool hottrack = false; int error = -ENOMEM; if (data && !(type->fs_flags & FS_BINARY_MOUNTDATA)) { @@ -1137,6 +1139,9 @@ mount_fs(struct file_system_type *type, int flags, const char *name, void *data) goto out_free_secdata; } + if (data && hot_track_parse_options(data)) + hottrack = true; + root = type->mount(type, flags, name, data); if (IS_ERR(root)) { error = PTR_ERR(root); diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h index a566f91..bb2a41c 100644 --- a/include/linux/hot_tracking.h +++ b/include/linux/hot_tracking.h @@ -20,6 +20,11 @@ #include <linux/rbtree.h> #include <linux/kref.h> +/* + * Flags for hot data tracking mount options. + */ +#define HOT_MOUNT_HOT_TRACK (1 << 0) + /* A tree that sits on the hot_info */ struct hot_inode_tree { struct rb_root map; @@ -89,6 +94,8 @@ struct hot_range_item { }; struct hot_info { + unsigned long mount_opt; + /* red-black tree that keeps track of fs-wide hot data */ struct hot_inode_tree hot_inode_tree; }; -- 1.7.6.5
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Add initialization function to create some key data structures when hot tracking is enabled; Clean up them when hot tracking is disabled Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/hot_tracking.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++++ fs/hot_tracking.h | 2 + fs/namespace.c | 4 +++ fs/super.c | 6 +++++ 4 files changed, 72 insertions(+), 0 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index f97e8a6..fa89f70 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -135,6 +135,51 @@ static void hot_rb_free_hot_range_item(struct hot_range_item *hr) } } +static int hot_rb_remove_hot_inode_item(struct hot_inode_tree *tree, + struct hot_inode_item *he) +{ + int ret = 0; + rb_erase(&he->rb_node, &tree->map); + he->in_tree = 0; + return ret; +} + +static int hot_rb_remove_hot_range_item(struct hot_range_tree *tree, + struct hot_range_item *hr) +{ + int ret = 0; + rb_erase(&hr->rb_node, &tree->map); + hr->in_tree = 0; + return ret; +} + +/* Frees the entire hot_inode_tree. */ +static void hot_rb_inode_tree_free(struct hot_info *root) +{ + struct rb_node *node, *node2; + struct hot_inode_item *he; + struct hot_range_item *hr; + + /* Free hot inode and range trees on fs root */ + node = rb_first(&root->hot_inode_tree.map); + + while (node) { + he = rb_entry(node, struct hot_inode_item, rb_node); + + node2 = rb_first(&he->hot_range_tree.map); + while (node2) { + hr = rb_entry(node2, struct hot_range_item, rb_node); + hot_rb_remove_hot_range_item(&he->hot_range_tree, hr); + hot_rb_free_hot_range_item(hr); + node2 = rb_first(&he->hot_range_tree.map); + } + + hot_rb_remove_hot_inode_item(&root->hot_inode_tree, he); + hot_rb_free_hot_inode_item(he); + node = rb_first(&root->hot_inode_tree.map); + } +} + static struct rb_node *hot_rb_insert_hot_inode_item(struct rb_root *root, unsigned long inode_num, struct rb_node *node) @@ -507,3 +552,18 @@ void __init hot_track_cache_init(void) if (hot_rb_item_cache_init()) return; } + +/* + * Initialize the data structures for hot data tracking. + */ +void hot_track_init(struct super_block *sb, const char *name) +{ + sb->s_hotinfo.mount_opt |= HOT_MOUNT_HOT_TRACK; + hot_rb_inode_tree_init(&sb->s_hotinfo.hot_inode_tree); +} + +void hot_track_exit(struct super_block *sb) +{ + sb->s_hotinfo.mount_opt &= ~HOT_MOUNT_HOT_TRACK; + hot_rb_inode_tree_free(&sb->s_hotinfo); +} diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h index 6bd09eb..3a8d398 100644 --- a/fs/hot_tracking.h +++ b/fs/hot_tracking.h @@ -39,5 +39,7 @@ void hot_rb_update_freqs(struct inode *inode, u64 start, u64 len, bool hot_track_parse_options(char *options); void __init hot_track_cache_init(void); +void hot_track_init(struct super_block *sb, const char *name); +void hot_track_exit(struct super_block *sb); #endif /* __HOT_TRACKING__ */ diff --git a/fs/namespace.c b/fs/namespace.c index 4d31f73..55006c8 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -20,6 +20,7 @@ #include <linux/fs_struct.h> /* get_fs_root et.al. */ #include <linux/fsnotify.h> /* fsnotify_vfsmount_delete */ #include <linux/uaccess.h> +#include "hot_tracking.h" #include "pnode.h" #include "internal.h" @@ -1215,6 +1216,9 @@ static int do_umount(struct mount *mnt, int flags) return retval; } + if (sb->s_hotinfo.mount_opt & HOT_MOUNT_HOT_TRACK) + hot_track_exit(sb); + down_write(&namespace_sem); br_write_lock(&vfsmount_lock); event++; diff --git a/fs/super.c b/fs/super.c index 7eb3b0c..0999d5c 100644 --- a/fs/super.c +++ b/fs/super.c @@ -1153,6 +1153,9 @@ mount_fs(struct file_system_type *type, int flags, const char *name, void *data) WARN_ON(sb->s_bdi == &default_backing_dev_info); sb->s_flags |= MS_BORN; + if (hottrack) + hot_track_init(sb, name); + error = security_sb_kern_mount(sb, flags, secdata); if (error) goto out_sb; @@ -1170,6 +1173,9 @@ mount_fs(struct file_system_type *type, int flags, const char *name, void *data) free_secdata(secdata); return root; out_sb: + if (hottrack) + hot_track_exit(sb); + dput(root); deactivate_locked_super(sb); out_free_secdata: -- 1.7.6.5 -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Adds a hash table structure which contains a lot of hash list and is used to efficiently look up the data temperature of a file or its ranges. In each hash list of hash table, the hash node will keep track of temperature info. Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/hot_tracking.c | 77 ++++++++++++++++++++++++++++++++++++++++- include/linux/hot_tracking.h | 35 +++++++++++++++++++ 2 files changed, 110 insertions(+), 2 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index fa89f70..5f96442 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -16,6 +16,7 @@ #include <linux/module.h> #include <linux/spinlock.h> #include <linux/hardirq.h> +#include <linux/hash.h> #include <linux/fs.h> #include <linux/blkdev.h> #include <linux/types.h> @@ -24,6 +25,9 @@ /* kmem_cache pointers for slab caches */ static struct kmem_cache *hot_inode_item_cache; static struct kmem_cache *hot_range_item_cache; +static struct kmem_cache *hot_hash_node_cache; + +static void hot_hash_node_init(void *_node); /* * Initialize the inode tree. Should be called for each new inode @@ -57,6 +61,10 @@ void hot_rb_inode_item_init(void *_item) memset(he, 0, sizeof(*he)); kref_init(&he->refs); spin_lock_init(&he->lock); + he->heat_node = kmem_cache_alloc(hot_hash_node_cache, + GFP_KERNEL | GFP_NOFS); + hot_hash_node_init(he->heat_node); + he->heat_node->hot_freq_data = &he->hot_freq_data; he->hot_freq_data.avg_delta_reads = (u64) -1; he->hot_freq_data.avg_delta_writes = (u64) -1; he->hot_freq_data.flags = FREQ_DATA_TYPE_INODE; @@ -75,6 +83,10 @@ static void hot_rb_range_item_init(void *_item) memset(hr, 0, sizeof(*hr)); kref_init(&hr->refs); spin_lock_init(&hr->lock); + hr->heat_node = kmem_cache_alloc(hot_hash_node_cache, + GFP_KERNEL | GFP_NOFS); + hot_hash_node_init(hr->heat_node); + hr->heat_node->hot_freq_data = &hr->hot_freq_data; hr->hot_freq_data.avg_delta_reads = (u64) -1; hr->hot_freq_data.avg_delta_writes = (u64) -1; hr->hot_freq_data.flags = FREQ_DATA_TYPE_RANGE; @@ -105,6 +117,18 @@ inode_err: return -ENOMEM; } +static void hot_rb_inode_item_exit(void) +{ + if (hot_inode_item_cache) + kmem_cache_destroy(hot_inode_item_cache); +} + +static void hot_rb_range_item_exit(void) +{ + if (hot_range_item_cache) + kmem_cache_destroy(hot_range_item_cache); +} + /* * Drops the reference out on hot_inode_item by one and free the structure * if the reference count hits zero @@ -510,6 +534,48 @@ void hot_rb_update_freqs(struct inode *inode, u64 start, } /* + * Initialize hash node. + */ +static void hot_hash_node_init(void *_node) +{ + struct hot_hash_node *node = _node; + + memset(node, 0, sizeof(*node)); + INIT_HLIST_NODE(&node->hashnode); + node->hot_freq_data = NULL; + node->hlist = NULL; + spin_lock_init(&node->lock); + kref_init(&node->refs); +} + +static int __init hot_hash_node_cache_init(void) +{ + hot_hash_node_cache = kmem_cache_create("hot_hash_node", + sizeof(struct hot_hash_node), + 0, + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, + hot_hash_node_init); + if (!hot_hash_node_cache) + return -ENOMEM; + + return 0; +} + +/* + * Initialize inode/range hash lists. + */ +static void hot_hash_table_init(struct hot_info *root) +{ + int i; + for (i = 0; i < HEAT_HASH_SIZE; i++) { + root->heat_inode_hl[i].temperature = i; + root->heat_range_hl[i].temperature = i; + rwlock_init(&root->heat_inode_hl[i].rwlock); + rwlock_init(&root->heat_range_hl[i].rwlock); + } +} + +/* * Regular mount options parser for -hottrack option. * return false if no -hottrack is specified; * otherwise return true. And the -hottrack will be @@ -544,13 +610,18 @@ bool hot_track_parse_options(char *options) } /* - * Initialize kmem cache for hot_inode_item - * and hot_range_item + * Initialize kmem cache for hot_inode_item, + * hot_range_item and hot_hash_node */ void __init hot_track_cache_init(void) { if (hot_rb_item_cache_init()) return; + + if (hot_hash_node_cache_init()) { + hot_rb_inode_item_exit(); + hot_rb_range_item_exit(); + } } /* @@ -560,10 +631,12 @@ void hot_track_init(struct super_block *sb, const char *name) { sb->s_hotinfo.mount_opt |= HOT_MOUNT_HOT_TRACK; hot_rb_inode_tree_init(&sb->s_hotinfo.hot_inode_tree); + hot_hash_table_init(&sb->s_hotinfo); } void hot_track_exit(struct super_block *sb) { sb->s_hotinfo.mount_opt &= ~HOT_MOUNT_HOT_TRACK; + hot_hash_table_free(&sb->s_hotinfo); hot_rb_inode_tree_free(&sb->s_hotinfo); } diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h index bb2a41c..635ffb6 100644 --- a/include/linux/hot_tracking.h +++ b/include/linux/hot_tracking.h @@ -20,6 +20,9 @@ #include <linux/rbtree.h> #include <linux/kref.h> +#define HEAT_HASH_BITS 8 +#define HEAT_HASH_SIZE (1 << HEAT_HASH_BITS) + /* * Flags for hot data tracking mount options. */ @@ -52,6 +55,28 @@ struct hot_freq_data { u32 last_temperature; }; +/* Hash list heads for hot hash table */ +struct hot_hash_head { + struct hlist_head hashhead; + rwlock_t rwlock; + u32 temperature; +}; + +/* Nodes stored in each hash list of hash table */ +struct hot_hash_node { + struct hlist_node hashnode; + struct list_head node; + struct hot_freq_data *hot_freq_data; + struct hot_hash_head *hlist; + spinlock_t lock; /* protects hlist */ + + /* + * number of references to this node + * equals 1 (hashlist entry) + */ + struct kref refs; +}; + /* An item representing an inode and its access frequency */ struct hot_inode_item { /* node for hot_inode_tree rb_tree */ @@ -68,6 +93,8 @@ struct hot_inode_item { spinlock_t lock; /* prevents kfree */ struct kref refs; + /* hashlist node for this inode */ + struct hot_hash_node *heat_node; }; /* @@ -91,6 +118,8 @@ struct hot_range_item { spinlock_t lock; /* prevents kfree */ struct kref refs; + /* hashlist node for this range */ + struct hot_hash_node *heat_node; }; struct hot_info { @@ -98,6 +127,12 @@ struct hot_info { /* red-black tree that keeps track of fs-wide hot data */ struct hot_inode_tree hot_inode_tree; + + /* hash map of inode temperature */ + struct hot_hash_head heat_inode_hl[HEAT_HASH_SIZE]; + + /* hash map of range temperature */ + struct hot_hash_head heat_range_hl[HEAT_HASH_SIZE]; }; #endif /* _LINUX_HOTTRACK_H */ -- 1.7.6.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Miscellaneous features that implement hot data tracking and generally make the hot data functions a bit more friendly. Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/direct-io.c | 10 ++++++++++ include/linux/hot_tracking.h | 11 +++++++++++ mm/filemap.c | 8 ++++++++ mm/page-writeback.c | 21 +++++++++++++++++++++ mm/readahead.c | 9 +++++++++ 5 files changed, 59 insertions(+), 0 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index f86c720..3773f44 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -37,6 +37,7 @@ #include <linux/uio.h> #include <linux/atomic.h> #include <linux/prefetch.h> +#include "hot_tracking.h" /* * How many user pages to map in one call to get_user_pages(). This determines @@ -1297,6 +1298,15 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, prefetch(bdev->bd_queue); prefetch((char *)bdev->bd_queue + SMP_CACHE_BYTES); + /* Hot data tracking */ + if (TRACK_THIS_INODE(iocb->ki_filp->f_mapping->host) + && iov_length(iov, nr_segs) > 0) { + hot_rb_update_freqs(iocb->ki_filp->f_mapping->host, + (u64)offset, + (u64)iov_length(iov, nr_segs), + rw & WRITE); + } + return do_blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset, nr_segs, get_block, end_io, submit_io, flags); diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h index 635ffb6..bc41f94 100644 --- a/include/linux/hot_tracking.h +++ b/include/linux/hot_tracking.h @@ -28,6 +28,14 @@ */ #define HOT_MOUNT_HOT_TRACK (1 << 0) +/* Hot data tracking -- guard macros */ +#define TRACKING_HOT_TRACK(root) \ + (root->s_hotinfo.mount_opt & HOT_MOUNT_HOT_TRACK) + +#define TRACK_THIS_INODE(inode) \ + ((TRACKING_HOT_TRACK(inode->i_sb)) && \ + !(inode->i_flags & S_NOHOTDATATRACK)) + /* A tree that sits on the hot_info */ struct hot_inode_tree { struct rb_root map; @@ -135,4 +143,7 @@ struct hot_info { struct hot_hash_head heat_range_hl[HEAT_HASH_SIZE]; }; +extern void hot_rb_update_freqs(struct inode *inode, + u64 start, u64 len, int rw); + #endif /* _LINUX_HOTTRACK_H */ diff --git a/mm/filemap.c b/mm/filemap.c index 3843445..8b1ecff 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -33,6 +33,7 @@ #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */ #include <linux/memcontrol.h> #include <linux/cleancache.h> +#include <linux/hot_tracking.h> #include "internal.h" /* @@ -1224,6 +1225,13 @@ readpage: * PG_error will be set again if readpage fails. */ ClearPageError(page); + + /* Hot data tracking */ + if (TRACK_THIS_INODE(filp->f_mapping->host)) + hot_rb_update_freqs(filp->f_mapping->host, + (u64)page->index << PAGE_CACHE_SHIFT, + PAGE_CACHE_SIZE, 0); + /* Start the actual read. The read will unlock the page. */ error = mapping->a_ops->readpage(filp, page); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 5ad5ce2..552c861 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -35,6 +35,7 @@ #include <linux/buffer_head.h> /* __set_page_dirty_buffers */ #include <linux/pagevec.h> #include <linux/timer.h> +#include <linux/hot_tracking.h> #include <trace/events/writeback.h> /* @@ -1895,13 +1896,33 @@ EXPORT_SYMBOL(generic_writepages); int do_writepages(struct address_space *mapping, struct writeback_control *wbc) { int ret; + pgoff_t start = 0; + u64 prev_count = 0, count = 0; if (wbc->nr_to_write <= 0) return 0; + + /* Hot data tracking */ + if (TRACK_THIS_INODE(mapping->host) + && wbc->range_cyclic) { + start = mapping->writeback_index << PAGE_CACHE_SHIFT; + prev_count = (u64)wbc->nr_to_write; + } + if (mapping->a_ops->writepages) ret = mapping->a_ops->writepages(mapping, wbc); else ret = generic_writepages(mapping, wbc); + + /* Hot data tracking */ + if (TRACK_THIS_INODE(mapping->host) + && wbc->range_cyclic) { + count = prev_count - (u64)wbc->nr_to_write; + if (count) + hot_rb_update_freqs(mapping->host, (u64)start, + count * PAGE_CACHE_SIZE, 1); + } + return ret; } diff --git a/mm/readahead.c b/mm/readahead.c index ea8f8fa..7010fc4 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -19,6 +19,7 @@ #include <linux/pagemap.h> #include <linux/syscalls.h> #include <linux/file.h> +#include <linux/hot_tracking.h> /* * Initialise a struct file''s readahead state. Assumes that the caller has @@ -138,6 +139,14 @@ static int read_pages(struct address_space *mapping, struct file *filp, out: blk_finish_plug(&plug); + /* Hot data tracking */ + if (TRACK_THIS_INODE(mapping->host) && nr_pages > 0) { + u64 start = (u64)(list_entry(pages->prev, + struct page, lru)->index) << PAGE_CACHE_SHIFT; + hot_rb_update_freqs(mapping->host, start, + (u64)nr_pages * PAGE_CACHE_SIZE, 0); + } + return ret; } -- 1.7.6.5
zwu.kernel@gmail.com
2012-Sep-23 12:56 UTC
[RFC v2 07/10] vfs: fork one kthread to update data temperature
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Fork and run one kernel kthread to calculate that temperature based on some metrics kept in custom frequency data structs, and store the info in the hash table. Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/hot_tracking.c | 467 +++++++++++++++++++++++++++++++++++++++++- fs/hot_tracking.h | 78 +++++++ include/linux/hot_tracking.h | 3 + 3 files changed, 542 insertions(+), 6 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index 5f96442..fd11695 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -17,6 +17,8 @@ #include <linux/spinlock.h> #include <linux/hardirq.h> #include <linux/hash.h> +#include <linux/kthread.h> +#include <linux/freezer.h> #include <linux/fs.h> #include <linux/blkdev.h> #include <linux/types.h> @@ -27,7 +29,12 @@ static struct kmem_cache *hot_inode_item_cache; static struct kmem_cache *hot_range_item_cache; static struct kmem_cache *hot_hash_node_cache; +static struct task_struct *hot_track_temperature_update_kthread; + static void hot_hash_node_init(void *_node); +static int hot_hash_is_aging(struct hot_freq_data *freq_data); +static void hot_hash_update_hash_table(struct hot_freq_data *freq_data, + struct hot_info *root); /* * Initialize the inode tree. Should be called for each new inode @@ -456,9 +463,13 @@ static struct hot_inode_item *hot_rb_update_inode_freq(struct inode *inode, write_unlock(&hitree->lock); } - spin_lock(&he->lock); - hot_rb_update_freq(&he->hot_freq_data, rw); - spin_unlock(&he->lock); + if (!hot_track_temperature_update_kthread + || hot_track_temperature_update_kthread->pid != current->pid) { + spin_lock(&he->lock); + hot_rb_update_freq(&he->hot_freq_data, rw); + spin_unlock(&he->lock); + hot_hash_update_hash_table(&he->hot_freq_data, root); + } out: return he; @@ -505,9 +516,14 @@ static bool hot_rb_update_range_freq(struct hot_inode_item *he, write_unlock(&hrtree->lock); } - spin_lock(&hr->lock); - hot_rb_update_freq(&hr->hot_freq_data, rw); - spin_unlock(&hr->lock); + if (!hot_track_temperature_update_kthread + || hot_track_temperature_update_kthread->pid != current->pid) { + spin_lock(&hr->lock); + hot_rb_update_freq(&hr->hot_freq_data, rw); + spin_unlock(&hr->lock); + hot_hash_update_hash_table(&hr->hot_freq_data, root); + } + hot_rb_free_hot_range_item(hr); } @@ -515,6 +531,58 @@ out: return ret; } +/* Walk the hot_inode_tree, locking as necessary */ +static struct hot_inode_item +*hot_rb_find_next_hot_inode(struct hot_info *root, + u64 objectid) +{ + struct rb_node *node; + struct rb_node *prev; + struct hot_inode_item *entry; + + read_lock(&root->hot_inode_tree.lock); + + node = root->hot_inode_tree.map.rb_node; + prev = NULL; + while (node) { + prev = node; + entry = rb_entry(node, struct hot_inode_item, rb_node); + + if (objectid < entry->i_ino) + node = node->rb_left; + else if (objectid > entry->i_ino) + node = node->rb_right; + else + break; + } + + if (!node) { + while (prev) { + entry = rb_entry(prev, struct hot_inode_item, rb_node); + if (objectid <= entry->i_ino) { + node = prev; + break; + } + prev = rb_next(prev); + } + } + + if (node) { + entry = rb_entry(node, struct hot_inode_item, rb_node); + /* + * increase reference count to prevent pruning while + * caller is using the hot_inode_item + */ + kref_get(&entry->refs); + + read_unlock(&root->hot_inode_tree.lock); + return entry; + } + + read_unlock(&root->hot_inode_tree.lock); + return NULL; +} + /* main function to update access frequency from read/writepage(s) hooks */ void hot_rb_update_freqs(struct inode *inode, u64 start, u64 len, int rw) @@ -534,6 +602,65 @@ void hot_rb_update_freqs(struct inode *inode, u64 start, } /* + * take hot range that is now cold and remove from indexes and clean up + * any memory associted, involves removing hot range from rb tree, and + * heat hash lists, and freeing up all memory. + */ +static void hot_rb_remove_range_data(struct hot_inode_item *hot_inode, + struct hot_range_item *hr, + struct hot_info *root) +{ + /* remove range from rb tree */ + hot_rb_remove_hot_range_item(&hot_inode->hot_range_tree, hr); + + /* remove range from hash list */ + spin_lock(&hr->heat_node->lock); + write_lock(&hr->heat_node->hlist->rwlock); + hlist_del(&hr->heat_node->hashnode); + write_unlock(&hr->heat_node->hlist->rwlock); + spin_unlock(&hr->heat_node->lock); + + /*free up memory */ + kfree(hr->heat_node); + hot_rb_free_hot_range_item(hr); +} + +/* Update temperatures for each range item for aging purposes */ +static void hot_rb_update_range_data(struct hot_inode_item *hot_inode, + struct hot_info *root) +{ + struct hot_range_tree *inode_range_tree; + struct rb_node *node; + struct rb_node *old_node; + struct hot_range_item *current_range; + int range_is_aging; + + inode_range_tree = &hot_inode->hot_range_tree; + write_lock(&inode_range_tree->lock); + node = rb_first(&inode_range_tree->map); + /* Walk the hot_range_tree for inode */ + while (node) { + current_range = rb_entry(node, struct hot_range_item, rb_node); + hot_hash_update_hash_table(¤t_range->hot_freq_data, root); + old_node = node; + node = rb_next(node); + + spin_lock(¤t_range->lock); + range_is_aging = hot_hash_is_aging(¤t_range->hot_freq_data); + spin_unlock(¤t_range->lock); + + if (range_is_aging) { + if (atomic_read( + ¤t_range->heat_node->refs.refcount) <= 1) + hot_rb_remove_range_data(hot_inode, + current_range, root); + } + } + + write_unlock(&inode_range_tree->lock); +} + +/* * Initialize hash node. */ static void hot_hash_node_init(void *_node) @@ -575,6 +702,308 @@ static void hot_hash_table_init(struct hot_info *root) } } +static void hot_hash_node_free(struct hlist_head *head) +{ + struct hlist_node *pos = NULL, *pos2 = NULL; + struct hot_hash_node *heatnode = NULL; + + hlist_for_each_safe(pos, pos2, head) { + heatnode = hlist_entry(pos, + struct hot_hash_node, + hashnode); + hlist_del(pos); + kfree(heatnode); + } + +} + +static void hot_hash_table_free(struct hot_info *root) +{ + int i; + + /* Free node/range heat hash lists */ + for (i = 0; i < HEAT_HASH_SIZE; i++) { + hot_hash_node_free(&root->heat_inode_hl[i].hashhead); + hot_hash_node_free(&root->heat_range_hl[i].hashhead); + } +} + +static u64 hot_hash_shift(u64 counter, u32 bits, bool shift_dir) +{ + if (shift_dir) + return counter << bits; + else + return counter >> bits; +} + +/* + * hot_hash_calc_temperature() is responsible for distilling the six heat + * criteria, which are described in detail in hot_hash.h) down into a single + * temperature value for the data, which is an integer between 0 + * and HEAT_MAX_VALUE. + * + * To accomplish this, the raw values from the hot_freq_data structure + * are shifted various ways in order to make the temperature calculation more + * or less sensitive to each value. + * + * Once this calibration has happened, we do some additional normalization and + * make sure that everything fits nicely in a u32. From there, we take a very + * rudimentary kind of "average" of each of the values, where the *_COEFF_POWER + * values act as weights for the average. + * + * Finally, we use the HEAT_HASH_BITS value, which determines the size of the + * heat hash list, to normalize the temperature to the proper granularity. + */ +int hot_hash_calc_temperature(struct hot_freq_data *freq_data) +{ + u64 result = 0; + + struct timespec ckt = current_kernel_time(); + u64 cur_time = timespec_to_ns(&ckt); + + u32 nrr_heat = (u32)hot_hash_shift((u64)freq_data->nr_reads, + NRR_MULTIPLIER_POWER, true); + u32 nrw_heat = (u32)hot_hash_shift((u64)freq_data->nr_writes, + NRW_MULTIPLIER_POWER, true); + + u64 ltr_heat + hot_hash_shift((cur_time - timespec_to_ns(&freq_data->last_read_time)), + LTR_DIVIDER_POWER, false); + u64 ltw_heat + hot_hash_shift((cur_time - timespec_to_ns(&freq_data->last_write_time)), + LTW_DIVIDER_POWER, false); + + u64 avr_heat + hot_hash_shift((((u64) -1) - freq_data->avg_delta_reads), + AVR_DIVIDER_POWER, false); + u64 avw_heat + hot_hash_shift((((u64) -1) - freq_data->avg_delta_writes), + AVW_DIVIDER_POWER, false); + + /* ltr_heat is now guaranteed to be u32 safe */ + if (ltr_heat >= hot_hash_shift((u64) 1, 32, true)) + ltr_heat = 0; + else + ltr_heat = hot_hash_shift((u64) 1, 32, true) - ltr_heat; + + /* ltw_heat is now guaranteed to be u32 safe */ + if (ltw_heat >= hot_hash_shift((u64) 1, 32, true)) + ltw_heat = 0; + else + ltw_heat = hot_hash_shift((u64) 1, 32, true) - ltw_heat; + + /* avr_heat is now guaranteed to be u32 safe */ + if (avr_heat >= hot_hash_shift((u64) 1, 32, true)) + avr_heat = (u32) -1; + + /* avw_heat is now guaranteed to be u32 safe */ + if (avw_heat >= hot_hash_shift((u64) 1, 32, true)) + avw_heat = (u32) -1; + + nrr_heat = (u32)hot_hash_shift((u64)nrr_heat, + (3 - NRR_COEFF_POWER), false); + nrw_heat = (u32)hot_hash_shift((u64)nrw_heat, + (3 - NRW_COEFF_POWER), false); + ltr_heat = hot_hash_shift(ltr_heat, (3 - LTR_COEFF_POWER), false); + ltw_heat = hot_hash_shift(ltw_heat, (3 - LTW_COEFF_POWER), false); + avr_heat = hot_hash_shift(avr_heat, (3 - AVR_COEFF_POWER), false); + avw_heat = hot_hash_shift(avw_heat, (3 - AVW_COEFF_POWER), false); + + result = nrr_heat + nrw_heat + (u32) ltr_heat + + (u32) ltw_heat + (u32) avr_heat + (u32) avw_heat; + + return result >> (32 - HEAT_HASH_BITS); +} + +static int hot_hash_is_aging(struct hot_freq_data *freq_data) +{ + int ret = 0; + struct timespec ckt = current_kernel_time(); + + u64 cur_time = timespec_to_ns(&ckt); + u64 last_read_ns + (cur_time - timespec_to_ns(&freq_data->last_read_time)); + u64 last_write_ns + (cur_time - timespec_to_ns(&freq_data->last_write_time)); + u64 kick_ns = TIME_TO_KICK * (u64)1000000000; + + if ((last_read_ns > kick_ns) && (last_write_ns > kick_ns)) + ret = 1; + + return ret; +} + +/* + * Calc a new temperature and, if necessary, move the heat_node corresponding + * to this inode or range to the proper hashlist with the new temperature + */ +static void hot_hash_update_hash_table(struct hot_freq_data *freq_data, + struct hot_info *root) +{ + int temperature = 0; + int moved = 0; + struct hot_hash_head *buckets, *current_bucket = NULL; + struct hot_inode_item *he; + struct hot_range_item *hr; + + if (freq_data->flags & FREQ_DATA_TYPE_INODE) { + he = hot_freq_data_get_he(freq_data); + buckets = root->heat_inode_hl; + + spin_lock(&he->lock); + temperature = hot_hash_calc_temperature(freq_data); + freq_data->last_temperature = temperature; + spin_unlock(&he->lock); + + if (he == NULL) + return; + + spin_lock(&he->heat_node->lock); + if (he->heat_node->hlist == NULL) { + current_bucket = buckets + temperature; + moved = 1; + } else { + write_lock(&he->heat_node->hlist->rwlock); + current_bucket = he->heat_node->hlist; + if (current_bucket->temperature != temperature) { + hlist_del(&he->heat_node->hashnode); + current_bucket = buckets + temperature; + moved = 1; + } + write_unlock(&he->heat_node->hlist->rwlock); + } + + if (moved) { + write_lock(¤t_bucket->rwlock); + hlist_add_head(&he->heat_node->hashnode, + ¤t_bucket->hashhead); + he->heat_node->hlist = current_bucket; + write_unlock(¤t_bucket->rwlock); + } + spin_unlock(&he->heat_node->lock); + } else if (freq_data->flags & FREQ_DATA_TYPE_RANGE) { + hr = hot_freq_data_get_hr(freq_data); + buckets = root->heat_range_hl; + + spin_lock(&hr->lock); + temperature = hot_hash_calc_temperature(freq_data); + freq_data->last_temperature = temperature; + spin_unlock(&hr->lock); + + if (hr == NULL) + return; + + spin_lock(&hr->heat_node->lock); + if (hr->heat_node->hlist == NULL) { + current_bucket = buckets + temperature; + moved = 1; + } else { + write_lock(&hr->heat_node->hlist->rwlock); + current_bucket = hr->heat_node->hlist; + if (current_bucket->temperature != temperature) { + hlist_del(&hr->heat_node->hashnode); + current_bucket = buckets + temperature; + moved = 1; + } + write_unlock(&hr->heat_node->hlist->rwlock); + } + + if (moved) { + write_lock(¤t_bucket->rwlock); + hlist_add_head(&hr->heat_node->hashnode, + ¤t_bucket->hashhead); + hr->heat_node->hlist = current_bucket; + write_unlock(¤t_bucket->rwlock); + } + spin_unlock(&hr->heat_node->lock); + } +} + +/* + * Update temperatures for each hot inode item and + * hot range item for aging purposes + */ +static void hot_hash_temperature_update_iterator(struct hot_info *root) +{ + struct hot_inode_item *current_hot_inode; + struct hot_inode_tree *hot_inode_tree; + unsigned long inode_num; + + hot_inode_tree = &root->hot_inode_tree; + + /* walk the inode tree */ + current_hot_inode = hot_rb_find_next_hot_inode(root, 0); + while (current_hot_inode) { + hot_hash_update_hash_table( + ¤t_hot_inode->hot_freq_data, root); + hot_rb_update_range_data(current_hot_inode, root); + inode_num = current_hot_inode->i_ino; + hot_rb_free_hot_inode_item(current_hot_inode); + current_hot_inode = hot_rb_find_next_hot_inode(root, + inode_num + 1); + } +} + +/* Determine if there is hot data tracking to be enabled */ +static bool hot_hash_global_hot_track(void) +{ + struct super_block *sb; + bool ret = false; + + spin_lock(&sb_lock); + list_for_each_entry(sb, &super_blocks, s_list) { + if (hlist_unhashed(&sb->s_instances)) + continue; + if (sb->s_hotinfo.mount_opt & HOT_MOUNT_HOT_TRACK) + ret = true; + } + spin_unlock(&sb_lock); + + return ret; +} + +/* + * kthread iterates each hot_inode_item and hot_range_item + * and update temperatures to be shifted in heat hash table + * for purposes of relocation and such hot file detection + */ +static int hot_hash_temperature_update_kthread(void *arg) +{ + struct super_block *sb; + struct hot_info *root; + unsigned long delay; + + do { + spin_lock(&sb_lock); + list_for_each_entry(sb, &super_blocks, s_list) { + if (hlist_unhashed(&sb->s_instances)) + continue; + delay = HZ * HEAT_UPDATE_DELAY; + root = &sb->s_hotinfo; + if (mutex_trylock( + &root->hot_data_update_kthread_mutex)) { + hot_hash_temperature_update_iterator(root); + mutex_unlock( + &root->hot_data_update_kthread_mutex); + } + if (unlikely(freezing(current))) { + __refrigerator(true); + } else { + set_current_state(TASK_INTERRUPTIBLE); + if (!kthread_should_stop()) { + spin_unlock(&sb_lock); + schedule_timeout(delay); + spin_lock(&sb_lock); + } + __set_current_state(TASK_RUNNING); + } + } + spin_unlock(&sb_lock); + } while (!kthread_should_stop() || !hot_hash_global_hot_track()); + + return 0; +} + /* * Regular mount options parser for -hottrack option. * return false if no -hottrack is specified; @@ -625,6 +1054,30 @@ void __init hot_track_cache_init(void) } /* + * Fork and initialize kthread for each new mount point that + * periodically goes through hot inodes and hot ranges and ages them + * based on frequency of access + */ +static void hot_track_fork_temperature_update_kthread(void) +{ + if (hot_track_temperature_update_kthread) + return; + + hot_track_temperature_update_kthread + kthread_run(hot_hash_temperature_update_kthread, NULL, + "hot_temperature_update_kthread"); + if (IS_ERR(hot_track_temperature_update_kthread)) + kthread_stop(hot_track_temperature_update_kthread); +} + +/* Stop the kthread to do temperature updates for all filesystems */ +static void hot_track_stop_temperature_update_kthread(void) +{ + if (hot_track_temperature_update_kthread) + kthread_stop(hot_track_temperature_update_kthread); +} + +/* * Initialize the data structures for hot data tracking. */ void hot_track_init(struct super_block *sb, const char *name) @@ -632,11 +1085,13 @@ void hot_track_init(struct super_block *sb, const char *name) sb->s_hotinfo.mount_opt |= HOT_MOUNT_HOT_TRACK; hot_rb_inode_tree_init(&sb->s_hotinfo.hot_inode_tree); hot_hash_table_init(&sb->s_hotinfo); + hot_track_fork_temperature_update_kthread(); } void hot_track_exit(struct super_block *sb) { sb->s_hotinfo.mount_opt &= ~HOT_MOUNT_HOT_TRACK; + hot_track_stop_temperature_update_kthread(); hot_hash_table_free(&sb->s_hotinfo); hot_rb_inode_tree_free(&sb->s_hotinfo); } diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h index 3a8d398..1b6c694 100644 --- a/fs/hot_tracking.h +++ b/fs/hot_tracking.h @@ -27,6 +27,76 @@ #define FREQ_POWER 4 +/* + * time to quit keeping track of + * tracking data (seconds) + */ +#define TIME_TO_KICK 400 + +/* set how often to update temperatures (seconds) */ +#define HEAT_UPDATE_DELAY 400 + +/* + * The following comments explain what exactly comprises a unit of heat. + * + * Each of six values of heat are calculated and combined in order to form an + * overall temperature for the data: + * + * NRR - number of reads since mount + * NRW - number of writes since mount + * LTR - time elapsed since last read (ns) + * LTW - time elapsed since last write (ns) + * AVR - average delta between recent reads (ns) + * AVW - average delta between recent writes (ns) + * + * These values are divided (right-shifted) according to the *_DIVIDER_POWER + * values defined below to bring the numbers into a reasonable range. You can + * modify these values to fit your needs. However, each heat unit is a u32 and + * thus maxes out at 2^32 - 1. Therefore, you must choose your dividers quite + * carefully or else they could max out or be stuck at zero quite easily. + * + * (E.g., if you chose AVR_DIVIDER_POWER = 0, nothing less than 4s of atime + * delta would bring the temperature above zero, ever.) + * + * Finally, each value is added to the overall temperature between 0 and 8 + * times, depending on its *_COEFF_POWER value. Note that the coefficients are + * also actually implemented with shifts, so take care to treat these values + * as powers of 2. (I.e., 0 means we''ll add it to the temp once; 1 = 2x, etc.) + */ + +/* NRR/NRW heat unit = 2^X accesses */ +#define NRR_MULTIPLIER_POWER 20 +#define NRR_COEFF_POWER 0 +#define NRW_MULTIPLIER_POWER 20 +#define NRW_COEFF_POWER 0 + +/* LTR/LTW heat unit = 2^X ns of age */ +#define LTR_DIVIDER_POWER 30 +#define LTR_COEFF_POWER 1 +#define LTW_DIVIDER_POWER 30 +#define LTW_COEFF_POWER 1 + +/* + * AVR/AVW cold unit = 2^X ns of average delta + * AVR/AVW heat unit = HEAT_MAX_VALUE - cold unit + * + * E.g., data with an average delta between 0 and 2^X ns + * will have a cold value of 0, which means a heat value + * equal to HEAT_MAX_VALUE. + */ +#define AVR_DIVIDER_POWER 40 +#define AVR_COEFF_POWER 0 +#define AVW_DIVIDER_POWER 40 +#define AVW_COEFF_POWER 0 + +/* macros to wrap container_of()''s for hot data structs */ +#define hot_freq_data_get_he(x) \ + ((struct hot_inode_item *) container_of(x, \ + struct hot_inode_item, hot_freq_data)) +#define hot_freq_data_get_hr(x) \ + ((struct hot_range_item *) container_of(x, \ + struct hot_range_item, hot_freq_data)) + struct hot_info; struct inode; @@ -37,6 +107,14 @@ void hot_rb_free_hot_inode_item(struct hot_inode_item *he); void hot_rb_update_freqs(struct inode *inode, u64 start, u64 len, int rw); +/* + * Returns a value from 0 to HEAT_MAX_VALUE + * indicating the temperature of the + * file (and consequently its bucket number + * in hash list) (see hot_hash.c) + */ +int hot_hash_calc_temperature(struct hot_freq_data *freq_data); + bool hot_track_parse_options(char *options); void __init hot_track_cache_init(void); void hot_track_init(struct super_block *sb, const char *name); diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h index bc41f94..1ec90a6 100644 --- a/include/linux/hot_tracking.h +++ b/include/linux/hot_tracking.h @@ -141,6 +141,9 @@ struct hot_info { /* hash map of range temperature */ struct hot_hash_head heat_range_hl[HEAT_HASH_SIZE]; + + /* protects hot data items while being iterated and updated */ + struct mutex hot_data_update_kthread_mutex; }; extern void hot_rb_update_freqs(struct inode *inode, -- 1.7.6.5
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> FS_IOC_GET_HEAT_INFO: return a struct containing the various metrics collected in btrfs_freq_data structs, and also return a calculated data temperature based on those metrics. Optionally, retrieve the temperature from the hot data hash list instead of recalculating it. FS_IOC_GET_HEAT_OPTS: return an integer representing the current state of hot data tracking and migration: 0 = do nothing 1 = track frequency of access FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and migration, as described above. Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/compat_ioctl.c | 8 +++ fs/ioctl.c | 130 ++++++++++++++++++++++++++++++++++++++++++ include/linux/fs.h | 11 ++++ include/linux/hot_tracking.h | 12 ++++ 4 files changed, 161 insertions(+), 0 deletions(-) diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c index debdfe0..a88c7de 100644 --- a/fs/compat_ioctl.c +++ b/fs/compat_ioctl.c @@ -1390,6 +1390,11 @@ COMPATIBLE_IOCTL(TIOCSTART) COMPATIBLE_IOCTL(TIOCSTOP) #endif +/*Hot data tracking*/ +COMPATIBLE_IOCTL(FS_IOC_GET_HEAT_INFO) +COMPATIBLE_IOCTL(FS_IOC_SET_HEAT_OPTS) +COMPATIBLE_IOCTL(FS_IOC_GET_HEAT_OPTS) + /* fat ''r'' ioctls. These are handled by fat with ->compat_ioctl, but we don''t want warnings on other file systems. So declare them as compatible here. */ @@ -1572,6 +1577,9 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, unsigned int cmd, case FIBMAP: case FIGETBSZ: case FIONREAD: + case FS_IOC_GET_HEAT_INFO: + case FS_IOC_SET_HEAT_OPTS: + case FS_IOC_GET_HEAT_OPTS: if (S_ISREG(filp->f_path.dentry->d_inode->i_mode)) break; /*FALL THROUGH*/ diff --git a/fs/ioctl.c b/fs/ioctl.c index 29167be..394975e 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -15,6 +15,7 @@ #include <linux/writeback.h> #include <linux/buffer_head.h> #include <linux/falloc.h> +#include "hot_tracking.h" #include <asm/ioctls.h> @@ -537,6 +538,126 @@ static int ioctl_fsthaw(struct file *filp) } /* + * Retrieve information about access frequency for the given file. Return it in + * a userspace-friendly struct for btrfsctl (or another tool) to parse. + * + * The temperature that is returned can be "live" -- that is, recalculated when + * the ioctl is called -- or it can be returned from the hashtable, reflecting + * the (possibly old) value that the system will use when considering files + * for migration. This behavior is determined by hot_heat_info->live. + */ +static int ioctl_heat_info(struct file *file, void __user *argp) +{ + struct inode *mnt_inode = file->f_path.dentry->d_inode; + struct inode *file_inode; + struct file *file_filp; + struct hot_info *root = &(mnt_inode->i_sb->s_hotinfo); + struct hot_heat_info *heat_info; + struct hot_inode_tree *hitree; + struct hot_inode_item *he; + int ret; + + heat_info = kmalloc(sizeof(struct hot_heat_info), + GFP_KERNEL | GFP_NOFS); + + if (copy_from_user((void *) heat_info, + argp, + sizeof(struct hot_heat_info)) != 0) { + ret = -EFAULT; + goto err; + } + + file_filp = filp_open(heat_info->filename, O_RDONLY, 0); + file_inode = file_filp->f_dentry->d_inode; + filp_close(file_filp, NULL); + + hitree = &root->hot_inode_tree; + read_lock(&hitree->lock); + he = hot_rb_lookup_hot_inode_item(hitree, file_inode->i_ino); + read_unlock(&hitree->lock); + if (!he) { + /* we don''t have any info on this file yet */ + ret = -ENODATA; + goto err; + } + + spin_lock(&he->lock); + heat_info->avg_delta_reads + (__u64) he->hot_freq_data.avg_delta_reads; + heat_info->avg_delta_writes + (__u64) he->hot_freq_data.avg_delta_writes; + heat_info->last_read_time + (__u64) timespec_to_ns(&he->hot_freq_data.last_read_time); + heat_info->last_write_time + (__u64) timespec_to_ns(&he->hot_freq_data.last_write_time); + heat_info->num_reads + (__u32) he->hot_freq_data.nr_reads; + heat_info->num_writes + (__u32) he->hot_freq_data.nr_writes; + + if (heat_info->live > 0) { + /* got a request for live temperature, + * call hot_hash_calc_temperature to recalculate + */ + heat_info->temperature + hot_hash_calc_temperature(&he->hot_freq_data); + } else { + /* not live temperature, get it from the hashlist */ + read_lock(&he->heat_node->hlist->rwlock); + heat_info->temperature = he->heat_node->hlist->temperature; + read_unlock(&he->heat_node->hlist->rwlock); + } + spin_unlock(&he->lock); + + hot_rb_free_hot_inode_item(he); + + if (copy_to_user(argp, (void *) heat_info, + sizeof(struct hot_heat_info))) { + ret = -EFAULT; + goto err; + } + + kfree(heat_info); + return 0; + +err: + kfree(heat_info); + return ret; +} + +static int ioctl_heat_opts(struct file *file, void __user *argp, int set) +{ + struct inode *inode = file->f_path.dentry->d_inode; + int arg, ret = 0; + + if (!set) { + arg = TRACK_THIS_INODE(inode) ? 1 : 0; + + if (copy_to_user(argp, (void *) &arg, sizeof(int)) != 0) + ret = -EFAULT; + } else { + if (copy_from_user((void *) &arg, argp, sizeof(int)) != 0) { + ret = -EFAULT; + } else { + switch (arg) { + case 0: /* track nothing */ + /* set S_NOHOTDATATRACK */ + inode->i_flags |= S_NOHOTDATATRACK; + break; + case 1: /* do tracking */ + /* clear S_NOHOTDATATRACK */ + inode->i_flags &= ~S_NOHOTDATATRACK; + break; + default: + ret = -EINVAL; + } + } + } + + return ret; +} + +/* * When you add any new common ioctls to the switches above and below * please update compat_sys_ioctl() too. * @@ -591,6 +712,15 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, case FIGETBSZ: return put_user(inode->i_sb->s_blocksize, argp); + case FS_IOC_GET_HEAT_INFO: + return ioctl_heat_info(filp, argp); + + case FS_IOC_SET_HEAT_OPTS: + return ioctl_heat_opts(filp, argp, 1); + + case FS_IOC_GET_HEAT_OPTS: + return ioctl_heat_opts(filp, argp, 0); + default: if (S_ISREG(inode->i_mode)) error = file_ioctl(filp, cmd, arg); diff --git a/include/linux/fs.h b/include/linux/fs.h index db1a144..277791f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -256,6 +256,7 @@ struct inodes_stat_t { #define S_IMA 1024 /* Inode has an associated IMA struct */ #define S_AUTOMOUNT 2048 /* Automount/referral quasi-directory */ #define S_NOSEC 4096 /* no suid or xattr security attributes */ +#define S_NOHOTDATATRACK (1 << 13) /* hot data tracking */ /* * Note that nosuid etc flags are inode-specific: setting some file-system @@ -354,6 +355,16 @@ struct inodes_stat_t { #define FS_IOC32_SETVERSION _IOW(''v'', 2, int) /* + * Hot data tracking ioctls: + * + * HOT_INFO - retrieve info on frequency of access + */ +#define FS_IOC_GET_HEAT_INFO _IOR(''f'', 17, \ + struct hot_heat_info) +#define FS_IOC_SET_HEAT_OPTS _IOW(''f'', 18, int) +#define FS_IOC_GET_HEAT_OPTS _IOR(''f'', 19, int) + +/* * Inode flags (FS_IOC_GETFLAGS / FS_IOC_SETFLAGS) */ #define FS_SECRM_FL 0x00000001 /* Secure deletion */ diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h index 1ec90a6..b0a2705 100644 --- a/include/linux/hot_tracking.h +++ b/include/linux/hot_tracking.h @@ -63,6 +63,18 @@ struct hot_freq_data { u32 last_temperature; }; +struct hot_heat_info { + __u64 avg_delta_reads; + __u64 avg_delta_writes; + __u64 last_read_time; + __u64 last_write_time; + __u32 num_reads; + __u32 num_writes; + __u32 temperature; + __u8 live; + char filename[PATH_MAX]; +}; + /* Hash list heads for hot hash table */ struct hot_hash_head { struct hlist_head hashhead; -- 1.7.6.5
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Add a /sys/kernel/debug/hot_track/<device_name>/ directory for each volume that contains two files. The first, `inode_data'', contains the heat information for inodes that have been brought into the hot data map structures. The second, `range_data'', contains similar information for subfile ranges. Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/hot_tracking.c | 466 +++++++++++++++++++++++++++++++++++++++++++++++++++++ fs/hot_tracking.h | 40 +++++ fs/namespace.c | 6 + 3 files changed, 512 insertions(+), 0 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index fd11695..6aeabad 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -22,6 +22,9 @@ #include <linux/fs.h> #include <linux/blkdev.h> #include <linux/types.h> +#include <linux/debugfs.h> +#include <linux/vmalloc.h> +#include <linux/limits.h> #include "hot_tracking.h" /* kmem_cache pointers for slab caches */ @@ -29,6 +32,13 @@ static struct kmem_cache *hot_inode_item_cache; static struct kmem_cache *hot_range_item_cache; static struct kmem_cache *hot_hash_node_cache; +/* list to keep track of each mounted volumes debugfs_vol_data */ +static struct list_head hot_debugfs_vol_data_list; +/* lock for debugfs_vol_data_list */ +static spinlock_t hot_debugfs_data_list_lock; +/* pointer to top level debugfs dentry */ +static struct dentry *hot_debugfs_root_dentry; + static struct task_struct *hot_track_temperature_update_kthread; static void hot_hash_node_init(void *_node); @@ -1004,6 +1014,460 @@ static int hot_hash_temperature_update_kthread(void *arg) return 0; } +static int hot_debugfs_copy(struct debugfs_vol_data *data, char *msg, int len) +{ + struct lstring *debugfs_log = data->debugfs_log; + uint new_log_alloc_size; + char *new_log; + static char err_msg[] = "No more memory!\n"; + + if (len >= data->log_alloc_size - debugfs_log->len) { + /* Not enough room in the log buffer for the new message. */ + /* Allocate a bigger buffer. */ + new_log_alloc_size = data->log_alloc_size + LOG_PAGE_SIZE; + new_log = vmalloc(new_log_alloc_size); + + if (new_log) { + memcpy(new_log, debugfs_log->str, debugfs_log->len); + memset(new_log + debugfs_log->len, 0, + new_log_alloc_size - debugfs_log->len); + vfree(debugfs_log->str); + debugfs_log->str = new_log; + data->log_alloc_size = new_log_alloc_size; + } else { + WARN_ON(1); + if (data->log_alloc_size - debugfs_log->len) { + strlcpy(debugfs_log->str + + debugfs_log->len, + err_msg, + data->log_alloc_size - debugfs_log->len); + debugfs_log->len ++ min((typeof(debugfs_log->len)) + sizeof(err_msg), + ((typeof(debugfs_log->len)) + data->log_alloc_size - debugfs_log->len)); + } + return 0; + } + } + + memcpy(debugfs_log->str + debugfs_log->len, data->log_work_buff, len); + debugfs_log->len += (unsigned long) len; + + return len; +} + +/* Returns the number of bytes written to the log. */ +static int hot_debugfs_log(struct debugfs_vol_data *data, const char *fmt, ...) +{ + struct lstring *debugfs_log = data->debugfs_log; + va_list args; + int len; + static char trunc_msg[] + "The next message has been truncated.\n"; + + if (debugfs_log->str == NULL) + return -1; + + spin_lock(&data->log_lock); + + va_start(args, fmt); + len = vsnprintf(data->log_work_buff, + sizeof(data->log_work_buff), fmt, args); + va_end(args); + + if (len >= sizeof(data->log_work_buff)) { + hot_debugfs_copy(data, trunc_msg, sizeof(trunc_msg)); + } + + len = hot_debugfs_copy(data, data->log_work_buff, len); + spin_unlock(&data->log_lock); + + return len; +} + +/* initialize a log corresponding to a fs volume */ +static int hot_debugfs_log_init(struct debugfs_vol_data *data) +{ + int err = 0; + struct lstring *debugfs_log = data->debugfs_log; + + spin_lock(&data->log_lock); + debugfs_log->str = vmalloc(INIT_LOG_ALLOC_SIZE); + if (debugfs_log->str) { + memset(debugfs_log->str, 0, INIT_LOG_ALLOC_SIZE); + data->log_alloc_size = INIT_LOG_ALLOC_SIZE; + } else { + err = -ENOMEM; + } + spin_unlock(&data->log_lock); + + return err; +} + +/* free a log corresponding to a fs volume */ +static void hot_debugfs_log_exit(struct debugfs_vol_data *data) +{ + struct lstring *debugfs_log = data->debugfs_log; + + spin_lock(&data->log_lock); + vfree(debugfs_log->str); + debugfs_log->str = NULL; + debugfs_log->len = 0; + spin_unlock(&data->log_lock); +} + +/* debugfs open file override from fops table */ +static int __hot_debugfs_open(struct inode *inode, struct file *file) +{ + if (inode->i_private) + file->private_data = inode->i_private; + + return 0; +} + +static void __hot_debugfs_print_range_freq_data( + struct hot_inode_item *hot_inode, + struct hot_range_item *hot_range, + struct debugfs_vol_data *data, + struct hot_info *root) +{ + struct hot_freq_data *freq_data; + u64 start; + u64 len; + + freq_data = &hot_range->hot_freq_data; + + spin_lock(&hot_range->lock); + start = hot_range->start; + len = hot_range->len; + spin_unlock(&hot_range->lock); + + /* Always lock hot_inode_item first */ + spin_lock(&hot_inode->lock); + spin_lock(&hot_range->lock); + hot_debugfs_log(data, "inode #%lu, range start " \ + "%llu (range len %llu) reads %u, writes %u, " + "avg read time %llu, avg write time %llu, temp %u\n", + hot_inode->i_ino, + hot_range->start, + hot_range->len, + freq_data->nr_reads, + freq_data->nr_writes, + freq_data->avg_delta_reads, + freq_data->avg_delta_writes, + freq_data->last_temperature); + spin_unlock(&hot_range->lock); + spin_unlock(&hot_inode->lock); +} + +/* + * take the inode, find ranges associated with inode + * and print each range data struct + */ +static void __hot_debugfs_walk_range_tree(struct hot_inode_item *hot_inode, + struct debugfs_vol_data *data, + struct hot_info *root) +{ + struct hot_range_tree *inode_range_tree; + struct rb_node *node; + struct hot_range_item *current_range; + + inode_range_tree = &hot_inode->hot_range_tree; + read_lock(&inode_range_tree->lock); + node = rb_first(&inode_range_tree->map); + + /* Walk the hot_range_tree for inode */ + while (node) { + current_range = rb_entry(node, struct hot_range_item, rb_node); + __hot_debugfs_print_range_freq_data(hot_inode, + current_range, data, root); + node = rb_next(node); + } + read_unlock(&inode_range_tree->lock); +} + +/* Print frequency data for each freq data to log */ +static void __hot_debugfs_print_inode_freq_data( + struct hot_inode_item *hot_inode, + struct debugfs_vol_data *data, + struct hot_info *root) +{ + struct hot_freq_data *freq_data = &hot_inode->hot_freq_data; + + spin_lock(&hot_inode->lock); + hot_debugfs_log(data, "inode #%lu, reads %u, writes %u, " \ + "avg read time %llu, avg write time %llu, temp %u\n", + hot_inode->i_ino, + freq_data->nr_reads, + freq_data->nr_writes, + freq_data->avg_delta_reads, + freq_data->avg_delta_writes, + freq_data->last_temperature); + spin_unlock(&hot_inode->lock); +} + +/* debugfs read file override from fops table */ +static ssize_t __hot_debugfs_range_read(struct file *file, char __user *user, + size_t count, loff_t *ppos) +{ + int err = 0; + struct hot_info *root; + struct hot_inode_item *current_hot_inode; + struct debugfs_vol_data *data; + struct lstring *debugfs_log; + unsigned long inode_num; + + data = (struct debugfs_vol_data *) file->private_data; + root = &(data->sb->s_hotinfo); + + if (!data->debugfs_log) { + /* initialize debugfs log corresponding to this volume*/ + debugfs_log = kmalloc(sizeof(struct lstring), + GFP_KERNEL | GFP_NOFS); + debugfs_log->str = NULL, + debugfs_log->len = 0; + data->debugfs_log = debugfs_log; + hot_debugfs_log_init(data); + } + + if ((unsigned long) *ppos > 0) { + /* caller is continuing a previous read, don''t walk tree */ + if ((unsigned long) *ppos >= data->debugfs_log->len) + goto clean_up; + + goto print_to_user; + } + + /* walk the inode tree */ + current_hot_inode = hot_rb_find_next_hot_inode(root, 0); + + while (current_hot_inode) { + /* walk ranges, print data to debugfs log */ + __hot_debugfs_walk_range_tree(current_hot_inode, data, root); + inode_num = current_hot_inode->i_ino; + hot_rb_free_hot_inode_item(current_hot_inode); + current_hot_inode = hot_rb_find_next_hot_inode(root, + inode_num + 1); + } + +print_to_user: + if (data->debugfs_log->len) { + err = simple_read_from_buffer(user, count, ppos, + data->debugfs_log->str, + data->debugfs_log->len); + } + + return err; + +clean_up: + /* Reader has finished the file, clean up */ + hot_debugfs_log_exit(data); + kfree(data->debugfs_log); + data->debugfs_log = NULL; + + return 0; +} + +/* debugfs read file override from fops table */ +static ssize_t __hot_debugfs_inode_read(struct file *file, char __user *user, + size_t count, loff_t *ppos) +{ + int err = 0; + struct hot_info *root; + struct hot_inode_item *current_hot_inode; + struct debugfs_vol_data *data; + struct lstring *debugfs_log; + unsigned long inode_num; + + data = (struct debugfs_vol_data *) file->private_data; + root = &(data->sb->s_hotinfo); + + if (!data->debugfs_log) { + /* initialize debugfs log corresponding to this volume */ + debugfs_log = kmalloc(sizeof(struct lstring), + GFP_KERNEL | GFP_NOFS); + debugfs_log->str = NULL, + debugfs_log->len = 0; + data->debugfs_log = debugfs_log; + hot_debugfs_log_init(data); + } + + if ((unsigned long) *ppos > 0) { + /* caller is continuing a previous read, don''t walk tree */ + if ((unsigned long) *ppos >= data->debugfs_log->len) + goto clean_up; + + goto print_to_user; + } + + /* walk the inode tree */ + current_hot_inode = hot_rb_find_next_hot_inode(root, 0); + + while (current_hot_inode) { + /* walk ranges, print data to debugfs log */ + __hot_debugfs_print_inode_freq_data(current_hot_inode, + data, root); + inode_num = current_hot_inode->i_ino; + hot_rb_free_hot_inode_item(current_hot_inode); + current_hot_inode = hot_rb_find_next_hot_inode(root, + inode_num + 1); + } + +print_to_user: + if (data->debugfs_log->len) { + err = simple_read_from_buffer(user, count, ppos, + data->debugfs_log->str, + data->debugfs_log->len); + } + + return err; + +clean_up: + /* reader has finished the file, clean up */ + hot_debugfs_log_exit(data); + kfree(data->debugfs_log); + data->debugfs_log = NULL; + + return 0; +} + +/* fops to override for printing range data */ +static const struct file_operations hot_debugfs_range_fops = { + .read = __hot_debugfs_range_read, + .open = __hot_debugfs_open, +}; + +/* fops to override for printing inode data */ +static const struct file_operations hot_debugfs_inode_fops = { + .read = __hot_debugfs_inode_read, + .open = __hot_debugfs_open, +}; + +/* initialize debugfs at module init */ +int hot_debugfs_init(void) +{ + hot_debugfs_root_dentry = debugfs_create_dir(DEBUGFS_ROOT_NAME, NULL); + /*init list of debugfs data list */ + INIT_LIST_HEAD(&hot_debugfs_vol_data_list); + /*init lock to list of debugfs data list */ + spin_lock_init(&hot_debugfs_data_list_lock); + if (!hot_debugfs_root_dentry) + goto debugfs_error; + + return 0; + +debugfs_error: + return -EIO; +} + +/* + * on each volume mount, initialize the debugfs dentries and associated + * structures (debugfs_vol_data and debugfs_log) + */ +static int hot_debugfs_volume_init(const char *uuid, struct super_block *sb) +{ + struct dentry *debugfs_volume_entry = NULL; + struct dentry *debugfs_range_entry = NULL; + struct dentry *debugfs_inode_entry = NULL; + struct debugfs_vol_data *range_data = NULL; + struct debugfs_vol_data *inode_data = NULL; + size_t dev_name_length = strlen(uuid); + char dev[NAME_MAX]; + + if (!hot_debugfs_root_dentry) + goto debugfs_error; + + /* create debugfs folder for this volume by mounted dev name */ + memcpy(dev, uuid + DEV_NAME_CHOP, dev_name_length - DEV_NAME_CHOP + 1); + debugfs_volume_entry = debugfs_create_dir(dev, hot_debugfs_root_dentry); + + if (!debugfs_volume_entry) + goto debugfs_error; + + /* malloc and initialize debugfs_vol_data for range_data */ + range_data = kmalloc(sizeof(struct debugfs_vol_data), + GFP_KERNEL | GFP_NOFS); + memset(range_data, 0, sizeof(struct debugfs_vol_data)); + range_data->debugfs_log = NULL; + range_data->sb = sb; + spin_lock_init(&range_data->log_lock); + range_data->log_alloc_size = 0; + + /* malloc and initialize debugfs_vol_data for inode_data */ + inode_data = kmalloc(sizeof(struct debugfs_vol_data), + GFP_KERNEL | GFP_NOFS); + memset(inode_data, 0, sizeof(struct debugfs_vol_data)); + inode_data->debugfs_log = NULL; + inode_data->sb = sb; + spin_lock_init(&inode_data->log_lock); + inode_data->log_alloc_size = 0; + + /* + * add debugfs_vol_data for inode data and range data for + * volume to list + */ + range_data->de = debugfs_volume_entry; + inode_data->de = debugfs_volume_entry; + spin_lock(&hot_debugfs_data_list_lock); + list_add(&range_data->node, &hot_debugfs_vol_data_list); + list_add(&inode_data->node, &hot_debugfs_vol_data_list); + spin_unlock(&hot_debugfs_data_list_lock); + + /* create debugfs range_data file */ + debugfs_range_entry = debugfs_create_file("range_data", + S_IFREG | S_IRUSR | S_IWUSR | S_IRUGO, + debugfs_volume_entry, + (void *) range_data, + &hot_debugfs_range_fops); + if (!debugfs_range_entry) + goto debugfs_error; + + /* create debugfs inode_data file */ + debugfs_inode_entry = debugfs_create_file("inode_data", + S_IFREG | S_IRUSR | S_IWUSR | S_IRUGO, + debugfs_volume_entry, + (void *) inode_data, + &hot_debugfs_inode_fops); + + if (!debugfs_inode_entry) + goto debugfs_error; + + return 0; + +debugfs_error: + kfree(range_data); + kfree(inode_data); + + return -EIO; +} + +/* + * find volume mounted (match by superblock) and remove + * debugfs dentry + */ +static void hot_debugfs_volume_exit(struct super_block *sb) +{ + struct list_head *head; + struct list_head *pos; + struct debugfs_vol_data *data; + + spin_lock(&hot_debugfs_data_list_lock); + head = &hot_debugfs_vol_data_list; + /* must clean up memory assicatied with superblock */ + list_for_each(pos, head) + { + data = list_entry(pos, struct debugfs_vol_data, node); + if (data->sb == sb) { + list_del(pos); + debugfs_remove_recursive(data->de); + kfree(data); + data = NULL; + } + } + spin_unlock(&hot_debugfs_data_list_lock); +} + /* * Regular mount options parser for -hottrack option. * return false if no -hottrack is specified; @@ -1086,6 +1550,7 @@ void hot_track_init(struct super_block *sb, const char *name) hot_rb_inode_tree_init(&sb->s_hotinfo.hot_inode_tree); hot_hash_table_init(&sb->s_hotinfo); hot_track_fork_temperature_update_kthread(); + hot_debugfs_volume_init(name, sb); } void hot_track_exit(struct super_block *sb) @@ -1094,4 +1559,5 @@ void hot_track_exit(struct super_block *sb) hot_track_stop_temperature_update_kthread(); hot_hash_table_free(&sb->s_hotinfo); hot_rb_inode_tree_free(&sb->s_hotinfo); + hot_debugfs_volume_exit(sb); } diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h index 1b6c694..fa1eb9b 100644 --- a/fs/hot_tracking.h +++ b/fs/hot_tracking.h @@ -97,9 +97,47 @@ ((struct hot_range_item *) container_of(x, \ struct hot_range_item, hot_freq_data)) +/* size of log to vmalloc */ +#define INIT_LOG_ALLOC_SIZE (PAGE_SIZE * 10) +#define LOG_PAGE_SIZE (PAGE_SIZE * 10) + +/* + * number of chars of device name of chop off + * for making debugfs folder e.g. /dev/sda -> sda + */ +#define DEV_NAME_CHOP 5 + +/* + * Name for VFS data in debugfs directory + * e.g. /sys/kernel/debug/hot_track + */ +#define DEBUGFS_ROOT_NAME "hot_track" + struct hot_info; struct inode; +/* log to output to userspace in debugfs files */ +struct lstring { + char *str; + unsigned long len; +}; + +/* + * debugfs_vol_data is a struct of items + * that is passed to the debugfs + */ +struct debugfs_vol_data { + /* protected by hot_debugfs_data_list_lock */ + struct list_head node; + struct lstring *debugfs_log; + struct super_block *sb; + struct dentry *de; + /* protects debugfs_log */ + spinlock_t log_lock; + char log_work_buff[1024]; + uint log_alloc_size; +}; + struct hot_inode_item *hot_rb_lookup_hot_inode_item(struct hot_inode_tree *tree, unsigned long inode_num); @@ -115,6 +153,8 @@ void hot_rb_update_freqs(struct inode *inode, u64 start, u64 len, */ int hot_hash_calc_temperature(struct hot_freq_data *freq_data); +int hot_debugfs_init(void); + bool hot_track_parse_options(char *options); void __init hot_track_cache_init(void); void hot_track_init(struct super_block *sb, const char *name); diff --git a/fs/namespace.c b/fs/namespace.c index 55006c8..6cea6c0 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2629,6 +2629,12 @@ void __init mnt_init(void) fs_kobj = kobject_create_and_add("fs", NULL); if (!fs_kobj) printk(KERN_WARNING "%s: kobj create error\n", __func__); + + err = hot_debugfs_init(); + if (err) + printk(KERN_WARNING "%s: sysfs_init error: %d\n", + __func__, err); + init_rootfs(); init_mount_tree(); } -- 1.7.6.5
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- Documentation/filesystems/hot_tracking.txt | 106 ++++++++++++++++++++++++++++ 1 files changed, 106 insertions(+), 0 deletions(-) create mode 100644 Documentation/filesystems/hot_tracking.txt diff --git a/Documentation/filesystems/hot_tracking.txt b/Documentation/filesystems/hot_tracking.txt new file mode 100644 index 0000000..340df45 --- /dev/null +++ b/Documentation/filesystems/hot_tracking.txt @@ -0,0 +1,106 @@ +Hot Data Tracking + +Introduction +------------------------------------------------------------------------------- + + The feature adds experimental support for tracking data temperature +information in VFS layer. Essentially, this means maintaining some key +stats(like number of reads/writes, last read/write time, frequency of +reads/writes), then distilling those numbers down to a single +"temperature" value that reflects what data is "hot," and using that +temperature to move data to SSDs. + + The long-term goal of the feature is to allow some FSs, +e.g. Btrfs to intelligently utilize SSDs in a heterogenous volume. +Incidentally, this project has been motivated by +the Project Ideas page on the Btrfs wiki. + + Of course, users are warned not to run this code outside of development +environments. These patches are EXPERIMENTAL, and as such they might eat +your data and/or memory. That said, the code should be relatively safe +when the hottrack mount option are disabled. + +Motivation +------------------------------------------------------------------------------- + + The overall goal of enabling hot data relocation to SSD has been +motivated by the Project Ideas page on the Btrfs wiki at +<https://btrfs.wiki.kernel.org/index.php/Project_ideas>. +It will divide into two steps. VFS provide hot data tracking function +while specific FS will provide hot data relocation function. +So as the first step of this goal, it is hoped that the patchset +for hot data tracking will eventually mature into VFS. + + This is essentially the traditional cache argument: SSD is fast and +expensive; HDD is cheap but slow. ZFS, for example, can already take +advantage of SSD caching. Btrfs should also be able to take advantage of +hybrid storage without many broad, sweeping changes to existing code. + +Main Parts Description +------------------------------------------------------------------------------- + +These include the following parts: + * Hooks in existing vfs functions to track data access frequency + * New rbtrees for tracking access frequency of inodes and sub-file +ranges (hot_rb.c) + The relationship between super_block and rbtree is as below: +super_block->s_hotinfo.hot_inode_tree + In include/linux/fs.h, one struct hot_info s_hotinfo is added to +super_block struct. Each FS instance can find hot tracking info +s_hotinfo via its super_block. In this hot_info, it store a lot of hot +tracking info such as hot_inode_tree, inode and range hash list, etc. + * A hash list for indexing data by its temperature (hot_hash.c) + * A debugfs interface for dumping data from the rbtrees (hot_debugfs.c) + * A background kthread for updating inode heat info + * Mount options for enabling temperature tracking(-o hottrack, +default mean disabled) (hot_track.c) + * An ioctl to retrieve the frequency information collected for a certain +file + * Ioctls to enable/disable frequency tracking per inode. + +Git Development Tree +------------------------------------------------------------------------------- + + The feature is still on development and review, so if you''re interested, +you can pull from the git repository at the following location: + https://github.com/wuzhy/kernel.git hot_tracking + git://github.com/wuzhy/kernel.git hot_tracking + +Usage Example +------------------------------------------------------------------------------- +To use hot tracking, you should mount like this: + +$ mount -o hottrack /dev/sdb /mnt +[ 1505.894078] device label test devid 1 transid 29 /dev/sdb +[ 1505.952977] btrfs: disk space caching is enabled +[ 1506.069678] vfs: turning on hot data tracking + +Mount debugfs at first: + +$ mount -t debugfs none /sys/kernel/debug +$ ls -l /sys/kernel/debug/vfs_hotdata/ +total 0 +drwxr-xr-x 2 root root 0 Aug 8 04:40 sdb +$ ls -l /sys/kernel/debug/vfs_hotdata/sdb +total 0 +-rw-r--r-- 1 root root 0 Aug 8 04:40 inode_data +-rw-r--r-- 1 root root 0 Aug 8 04:40 range_data + +View information about hot tracking from debugfs: + +$ echo "hot tracking test" > /mnt/file +$ cat /sys/kernel/debug/hot_track/sdb/inode_data +inode #279, reads 0, writes 1, avg read time 18446744073709551615, +avg write time 5251566408153596, temp 109 +$ cat /sys/kernel/debug/hot_track/sdb/range_data +inode #279, range start 0 (range len 1048576) reads 0, writes 1, +avg read time 18446744073709551615, avg write time 1128690176623144209, temp 64 + +$ echo "hot data tracking test" >> /mnt/file +$ cat /sys/kernel/debug/hot_track/sdb/inode_data +inode #279, reads 0, writes 2, avg read time 18446744073709551615, +avg write time 4923343766042451, temp 109 +$ cat /sys/kernel/debug/hot_track/sdb/range_data +inode #279, range start 0 (range len 1048576) reads 0, writes 2, +avg read time 18446744073709551615, avg write time 1058147040842596150, temp 64 + -- 1.7.6.5
Dave Chinner
2012-Sep-25 07:37 UTC
Re: [RFC v2 01/10] vfs: introduce private rb structures
On Sun, Sep 23, 2012 at 08:56:26PM +0800, zwu.kernel@gmail.com wrote:> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > > One root structure hot_info is defined, is hooked > up in super_block, and will be used to hold rb trees > root, hash list root and some other information, etc. > Adds hot_inode_tree struct to keep track of > frequently accessed files, and be keyed by {inode, offset}. > Trees contain hot_inode_items representing those files > and ranges. > Having these trees means that vfs can quickly determine the > temperature of some data by doing some calculations on the > hot_freq_data struct that hangs off of the tree item. > Define two items hot_inode_item and hot_range_item, > one of them represents one tracked file > to keep track of its access frequency and the tree of > ranges in this file, while the latter represents > a file range of one inode. > Each of the two structures contains a hot_freq_data > struct with its frequency of access metrics (number of > {reads, writes}, last {read,write} time, frequency of > {reads,writes}). > Also, each hot_inode_item contains one hot_range_tree > struct which is keyed by {inode, offset, length} > and used to keep track of all the ranges in this file. > > Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>Just a coupl eof minor formatting things first up - I''ll have more comments as I get deeper into the series. ....> +/* > + * Initialize the inode tree. Should be called for each new inode > + * access or other user of the hot_inode interface. > + */ > +static void hot_rb_inode_tree_init(struct hot_inode_tree *tree)The names of these are a bit clunky. You probably don''t need the "_rb_" in the function name. i.e. hot_inode_tree_init() is sufficient, and if we every want to change in the tree type we don''t have to rename every single function... .....> +/* > + * Initialize a new hot_inode_item structure. The new structure is > + * returned with a reference count of one and needs to be > + * freed using free_inode_item() > + */ > +void hot_rb_inode_item_init(void *_item) > +{The usual naming convention for slab initialiser functions is to use a suffix of "_once" to indicate it is only ever called once per slab object instantiation, not every time the object is allocated fom the slab. See, for example, inode_init_once() and inode_init_always(). so, that would make this function hot_inode_item_init_once(). ....> +/* init hot_inode_item and hot_range_item kmem cache */ > +static int __init hot_rb_item_cache_init(void) > +{ > + hot_inode_item_cache = kmem_cache_create("hot_inode_item", > + sizeof(struct hot_inode_item), 0, > + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, > + hot_rb_inode_item_init); > + if (!hot_inode_item_cache) > + goto inode_err; > + > + hot_range_item_cache = kmem_cache_create("hot_range_item", > + sizeof(struct hot_range_item), 0, > + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, > + hot_rb_range_item_init); > + if (!hot_range_item_cache) > + goto range_err; > + > + return 0; > + > +range_err: > + kmem_cache_destroy(hot_inode_item_cache); > +inode_err: > + return -ENOMEM; > +} > + > +/* > + * Initialize kmem cache for hot_inode_item > + * and hot_range_item > + */ > +void __init hot_track_cache_init(void) > +{ > + if (hot_rb_item_cache_init()) > + return;No real need to have a hot_rb_item_cache_init() function here - just open code it all in the hot_track_cache_init() function.> +} > diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h > new file mode 100644 > index 0000000..269b67a > --- /dev/null > +++ b/fs/hot_tracking.h > @@ -0,0 +1,27 @@ > +/* > + * fs/hot_tracking.h > + * > + * Copyright (C) 2012 IBM Corp. All rights reserved. > + * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > + * Ben Chociej <bchociej@gmail.com> > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public > + * License v2 as published by the Free Software Foundation. > + */ > + > +#ifndef __HOT_TRACKING__ > +#define __HOT_TRACKING__ > + > +#include <linux/rbtree.h> > +#include <linux/hot_tracking.h> > + > +/* values for hot_freq_data flags */ > +/* freq data struct is for an inode */ > +#define FREQ_DATA_TYPE_INODE (1 << 0) > +/* freq data struct is for a range */ > +#define FREQ_DATA_TYPE_RANGE (1 << 1)The comments are redundant - the name of the object documents it''s use sufficiently. ie. /* values for hot_freq_data flags */ #define FREQ_DATA_TYPE_INODE (1 << 0) #define FREQ_DATA_TYPE_RANGE (1 << 1) is just fine by itself. ....> +/* A frequency data struct holds values that are used to > + * determine temperature of files and file ranges. These structs > + * are members of hot_inode_item and hot_range_item > + *//* * This is a * multiline comment. ;) */> +struct hot_freq_data { > + struct timespec last_read_time; > + struct timespec last_write_time; > + u32 nr_reads; > + u32 nr_writes; > + u64 avg_delta_reads; > + u64 avg_delta_writes; > + u8 flags; > + u32 last_temperature;may as well make the flags a u32 - the compiler will ues that much space anyway as it aligned the u32 last_temperature variable after it.> +}; > + > +/* An item representing an inode and its access frequency */ > +struct hot_inode_item { > + /* node for hot_inode_tree rb_tree */ > + struct rb_node rb_node; > + /* tree of ranges in this inode */ > + struct hot_range_tree hot_range_tree; > + /* frequency data for this inode */ > + struct hot_freq_data hot_freq_data; > + /* inode number, copied from inode */ > + unsigned long i_ino; > + /* used to check for errors in ref counting */ > + u8 in_tree; > + /* protects hot_freq_data, i_no, in_tree */ > + spinlock_t lock; > + /* prevents kfree */ > + struct kref refs;It''s hard to see the code in the commentsi, and some of comments are redundant.. It''s easier to read if you do this: struct hot_inode_item { struct rb_node node; /* hot_inode_tree index */ struct hot_range_tree hot_range_tree; /* tree of ranges */ struct hot_freq_data hot_freq_data; /* frequency data */ unsigned long i_ino; /* inode number from inode */ u8 in_tree; /* ref counting check */ spinlock_t lock; /* protects object data */ struct kref refs; /* prevents kfree */ } Also: - i_ino really needs to be a 64 bit quantity as some filesystems can use 64 bit inode numbers even on 32 bit systems (e.g. XFS). - in_tree can be u32 or a flags field if it is boolean. if it is just debug, then maybe it can be removed whenteh code is ready for commit.> +}; > + > +/* > + * An item representing a range inside of an inode whose frequency > + * is being tracked > + */ > +struct hot_range_item { > + /* node for hot_range_tree rb_tree */ > + struct rb_node rb_node; > + /* frequency data for this range */ > + struct hot_freq_data hot_freq_data; > + /* the hot_inode_item associated with this hot_range_item */ > + struct hot_inode_item *hot_inode; > + /* starting offset of this range */ > + u64 start; > + /* length of this range */ > + u64 len;What units? u64 start; /* start offset in bytes */ u64 len /* length in bytes */> + /* used to check for errors in ref counting */ > + u8 in_tree; > + /* protects hot_freq_data, start, len, and in_tree */ > + spinlock_t lock; > + /* prevents kfree */ > + struct kref refs; > +}; > + > +struct hot_info { > + /* red-black tree that keeps track of fs-wide hot data */ > + struct hot_inode_tree hot_inode_tree; > +};The comment is redundant... Cheers, Dave. -- Dave Chinner david@fromorbit.com
On Tue, Sep 25, 2012 at 3:37 PM, Dave Chinner <david@fromorbit.com> wrote:> On Sun, Sep 23, 2012 at 08:56:26PM +0800, zwu.kernel@gmail.com wrote: >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> >> One root structure hot_info is defined, is hooked >> up in super_block, and will be used to hold rb trees >> root, hash list root and some other information, etc. >> Adds hot_inode_tree struct to keep track of >> frequently accessed files, and be keyed by {inode, offset}. >> Trees contain hot_inode_items representing those files >> and ranges. >> Having these trees means that vfs can quickly determine the >> temperature of some data by doing some calculations on the >> hot_freq_data struct that hangs off of the tree item. >> Define two items hot_inode_item and hot_range_item, >> one of them represents one tracked file >> to keep track of its access frequency and the tree of >> ranges in this file, while the latter represents >> a file range of one inode. >> Each of the two structures contains a hot_freq_data >> struct with its frequency of access metrics (number of >> {reads, writes}, last {read,write} time, frequency of >> {reads,writes}). >> Also, each hot_inode_item contains one hot_range_tree >> struct which is keyed by {inode, offset, length} >> and used to keep track of all the ranges in this file. >> >> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > > Just a coupl eof minor formatting things first up - I''ll have more > comments as I get deeper into the series.All comments are very reasonable, and will be applied. thanks for your review.> > .... >> +/* >> + * Initialize the inode tree. Should be called for each new inode >> + * access or other user of the hot_inode interface. >> + */ >> +static void hot_rb_inode_tree_init(struct hot_inode_tree *tree) > > The names of these are a bit clunky. You probably don''t need the > "_rb_" in the function name. i.e. hot_inode_tree_init() is > sufficient, and if we every want to change in the tree type we don''t > have to rename every single function... > > ..... >> +/* >> + * Initialize a new hot_inode_item structure. The new structure is >> + * returned with a reference count of one and needs to be >> + * freed using free_inode_item() >> + */ >> +void hot_rb_inode_item_init(void *_item) >> +{ > > The usual naming convention for slab initialiser functions is to use > a suffix of "_once" to indicate it is only ever called once per > slab object instantiation, not every time the object is allocated > fom the slab. See, for example, inode_init_once() and > inode_init_always(). > > so, that would make this function hot_inode_item_init_once(). > > .... >> +/* init hot_inode_item and hot_range_item kmem cache */ >> +static int __init hot_rb_item_cache_init(void) >> +{ >> + hot_inode_item_cache = kmem_cache_create("hot_inode_item", >> + sizeof(struct hot_inode_item), 0, >> + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, >> + hot_rb_inode_item_init); >> + if (!hot_inode_item_cache) >> + goto inode_err; >> + >> + hot_range_item_cache = kmem_cache_create("hot_range_item", >> + sizeof(struct hot_range_item), 0, >> + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, >> + hot_rb_range_item_init); >> + if (!hot_range_item_cache) >> + goto range_err; >> + >> + return 0; >> + >> +range_err: >> + kmem_cache_destroy(hot_inode_item_cache); >> +inode_err: >> + return -ENOMEM; >> +} >> + >> +/* >> + * Initialize kmem cache for hot_inode_item >> + * and hot_range_item >> + */ >> +void __init hot_track_cache_init(void) >> +{ >> + if (hot_rb_item_cache_init()) >> + return; > > No real need to have a hot_rb_item_cache_init() function here - just > open code it all in the hot_track_cache_init() function. > >> +} >> diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h >> new file mode 100644 >> index 0000000..269b67a >> --- /dev/null >> +++ b/fs/hot_tracking.h >> @@ -0,0 +1,27 @@ >> +/* >> + * fs/hot_tracking.h >> + * >> + * Copyright (C) 2012 IBM Corp. All rights reserved. >> + * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> + * Ben Chociej <bchociej@gmail.com> >> + * >> + * This program is free software; you can redistribute it and/or >> + * modify it under the terms of the GNU General Public >> + * License v2 as published by the Free Software Foundation. >> + */ >> + >> +#ifndef __HOT_TRACKING__ >> +#define __HOT_TRACKING__ >> + >> +#include <linux/rbtree.h> >> +#include <linux/hot_tracking.h> >> + >> +/* values for hot_freq_data flags */ >> +/* freq data struct is for an inode */ >> +#define FREQ_DATA_TYPE_INODE (1 << 0) >> +/* freq data struct is for a range */ >> +#define FREQ_DATA_TYPE_RANGE (1 << 1) > > The comments are redundant - the name of the object documents it''s > use sufficiently. ie. > > /* values for hot_freq_data flags */ > #define FREQ_DATA_TYPE_INODE (1 << 0) > #define FREQ_DATA_TYPE_RANGE (1 << 1) > > is just fine by itself. > > .... >> +/* A frequency data struct holds values that are used to >> + * determine temperature of files and file ranges. These structs >> + * are members of hot_inode_item and hot_range_item >> + */ > > /* > * This is a > * multiline comment. ;) > */ > >> +struct hot_freq_data { >> + struct timespec last_read_time; >> + struct timespec last_write_time; >> + u32 nr_reads; >> + u32 nr_writes; >> + u64 avg_delta_reads; >> + u64 avg_delta_writes; >> + u8 flags; >> + u32 last_temperature; > > may as well make the flags a u32 - the compiler will ues that much > space anyway as it aligned the u32 last_temperature variable after > it. > >> +}; >> + >> +/* An item representing an inode and its access frequency */ >> +struct hot_inode_item { >> + /* node for hot_inode_tree rb_tree */ >> + struct rb_node rb_node; >> + /* tree of ranges in this inode */ >> + struct hot_range_tree hot_range_tree; >> + /* frequency data for this inode */ >> + struct hot_freq_data hot_freq_data; >> + /* inode number, copied from inode */ >> + unsigned long i_ino; >> + /* used to check for errors in ref counting */ >> + u8 in_tree; >> + /* protects hot_freq_data, i_no, in_tree */ >> + spinlock_t lock; >> + /* prevents kfree */ >> + struct kref refs; > > It''s hard to see the code in the commentsi, and some of comments are > redundant.. It''s easier to read if you do this: > > struct hot_inode_item { > struct rb_node node; /* hot_inode_tree index */ > struct hot_range_tree hot_range_tree; /* tree of ranges */ > struct hot_freq_data hot_freq_data; /* frequency data */ > unsigned long i_ino; /* inode number from inode */ > u8 in_tree; /* ref counting check */ > spinlock_t lock; /* protects object data */ > struct kref refs; /* prevents kfree */ > } > > Also: > - i_ino really needs to be a 64 bit quantity as some > filesystems can use 64 bit inode numbers even on 32 > bit systems (e.g. XFS). > - in_tree can be u32 or a flags field if it is boolean. if > it is just debug, then maybe it can be removed whenteh > code is ready for commit. > >> +}; >> + >> +/* >> + * An item representing a range inside of an inode whose frequency >> + * is being tracked >> + */ >> +struct hot_range_item { >> + /* node for hot_range_tree rb_tree */ >> + struct rb_node rb_node; >> + /* frequency data for this range */ >> + struct hot_freq_data hot_freq_data; >> + /* the hot_inode_item associated with this hot_range_item */ >> + struct hot_inode_item *hot_inode; >> + /* starting offset of this range */ >> + u64 start; >> + /* length of this range */ >> + u64 len; > > What units? > u64 start; /* start offset in bytes */ > u64 len /* length in bytes */ > >> + /* used to check for errors in ref counting */ >> + u8 in_tree; >> + /* protects hot_freq_data, start, len, and in_tree */ >> + spinlock_t lock; >> + /* prevents kfree */ >> + struct kref refs; >> +}; >> + >> +struct hot_info { >> + /* red-black tree that keeps track of fs-wide hot data */ >> + struct hot_inode_tree hot_inode_tree; >> +}; > > The comment is redundant... > > Cheers, > > Dave. > > -- > Dave Chinner > david@fromorbit.com-- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 25, 2012 at 3:37 PM, Dave Chinner <david@fromorbit.com> wrote:> On Sun, Sep 23, 2012 at 08:56:26PM +0800, zwu.kernel@gmail.com wrote: >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> >> One root structure hot_info is defined, is hooked >> up in super_block, and will be used to hold rb trees >> root, hash list root and some other information, etc. >> Adds hot_inode_tree struct to keep track of >> frequently accessed files, and be keyed by {inode, offset}. >> Trees contain hot_inode_items representing those files >> and ranges. >> Having these trees means that vfs can quickly determine the >> temperature of some data by doing some calculations on the >> hot_freq_data struct that hangs off of the tree item. >> Define two items hot_inode_item and hot_range_item, >> one of them represents one tracked file >> to keep track of its access frequency and the tree of >> ranges in this file, while the latter represents >> a file range of one inode. >> Each of the two structures contains a hot_freq_data >> struct with its frequency of access metrics (number of >> {reads, writes}, last {read,write} time, frequency of >> {reads,writes}). >> Also, each hot_inode_item contains one hot_range_tree >> struct which is keyed by {inode, offset, length} >> and used to keep track of all the ranges in this file. >> >> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > > Just a coupl eof minor formatting things first up - I''ll have more > comments as I get deeper into the series.OK, very look forward to seeing more on other patches, indeed thanks again.> > .... >> +/* >> + * Initialize the inode tree. Should be called for each new inode >> + * access or other user of the hot_inode interface. >> + */ >> +static void hot_rb_inode_tree_init(struct hot_inode_tree *tree) > > The names of these are a bit clunky. You probably don''t need the > "_rb_" in the function name. i.e. hot_inode_tree_init() is > sufficient, and if we every want to change in the tree type we don''t > have to rename every single function... > > ..... >> +/* >> + * Initialize a new hot_inode_item structure. The new structure is >> + * returned with a reference count of one and needs to be >> + * freed using free_inode_item() >> + */ >> +void hot_rb_inode_item_init(void *_item) >> +{ > > The usual naming convention for slab initialiser functions is to use > a suffix of "_once" to indicate it is only ever called once per > slab object instantiation, not every time the object is allocated > fom the slab. See, for example, inode_init_once() and > inode_init_always(). > > so, that would make this function hot_inode_item_init_once(). > > .... >> +/* init hot_inode_item and hot_range_item kmem cache */ >> +static int __init hot_rb_item_cache_init(void) >> +{ >> + hot_inode_item_cache = kmem_cache_create("hot_inode_item", >> + sizeof(struct hot_inode_item), 0, >> + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, >> + hot_rb_inode_item_init); >> + if (!hot_inode_item_cache) >> + goto inode_err; >> + >> + hot_range_item_cache = kmem_cache_create("hot_range_item", >> + sizeof(struct hot_range_item), 0, >> + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, >> + hot_rb_range_item_init); >> + if (!hot_range_item_cache) >> + goto range_err; >> + >> + return 0; >> + >> +range_err: >> + kmem_cache_destroy(hot_inode_item_cache); >> +inode_err: >> + return -ENOMEM; >> +} >> + >> +/* >> + * Initialize kmem cache for hot_inode_item >> + * and hot_range_item >> + */ >> +void __init hot_track_cache_init(void) >> +{ >> + if (hot_rb_item_cache_init()) >> + return; > > No real need to have a hot_rb_item_cache_init() function here - just > open code it all in the hot_track_cache_init() function. > >> +} >> diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h >> new file mode 100644 >> index 0000000..269b67a >> --- /dev/null >> +++ b/fs/hot_tracking.h >> @@ -0,0 +1,27 @@ >> +/* >> + * fs/hot_tracking.h >> + * >> + * Copyright (C) 2012 IBM Corp. All rights reserved. >> + * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> + * Ben Chociej <bchociej@gmail.com> >> + * >> + * This program is free software; you can redistribute it and/or >> + * modify it under the terms of the GNU General Public >> + * License v2 as published by the Free Software Foundation. >> + */ >> + >> +#ifndef __HOT_TRACKING__ >> +#define __HOT_TRACKING__ >> + >> +#include <linux/rbtree.h> >> +#include <linux/hot_tracking.h> >> + >> +/* values for hot_freq_data flags */ >> +/* freq data struct is for an inode */ >> +#define FREQ_DATA_TYPE_INODE (1 << 0) >> +/* freq data struct is for a range */ >> +#define FREQ_DATA_TYPE_RANGE (1 << 1) > > The comments are redundant - the name of the object documents it''s > use sufficiently. ie. > > /* values for hot_freq_data flags */ > #define FREQ_DATA_TYPE_INODE (1 << 0) > #define FREQ_DATA_TYPE_RANGE (1 << 1) > > is just fine by itself. > > .... >> +/* A frequency data struct holds values that are used to >> + * determine temperature of files and file ranges. These structs >> + * are members of hot_inode_item and hot_range_item >> + */ > > /* > * This is a > * multiline comment. ;) > */ > >> +struct hot_freq_data { >> + struct timespec last_read_time; >> + struct timespec last_write_time; >> + u32 nr_reads; >> + u32 nr_writes; >> + u64 avg_delta_reads; >> + u64 avg_delta_writes; >> + u8 flags; >> + u32 last_temperature; > > may as well make the flags a u32 - the compiler will ues that much > space anyway as it aligned the u32 last_temperature variable after > it. > >> +}; >> + >> +/* An item representing an inode and its access frequency */ >> +struct hot_inode_item { >> + /* node for hot_inode_tree rb_tree */ >> + struct rb_node rb_node; >> + /* tree of ranges in this inode */ >> + struct hot_range_tree hot_range_tree; >> + /* frequency data for this inode */ >> + struct hot_freq_data hot_freq_data; >> + /* inode number, copied from inode */ >> + unsigned long i_ino; >> + /* used to check for errors in ref counting */ >> + u8 in_tree; >> + /* protects hot_freq_data, i_no, in_tree */ >> + spinlock_t lock; >> + /* prevents kfree */ >> + struct kref refs; > > It''s hard to see the code in the commentsi, and some of comments are > redundant.. It''s easier to read if you do this: > > struct hot_inode_item { > struct rb_node node; /* hot_inode_tree index */ > struct hot_range_tree hot_range_tree; /* tree of ranges */ > struct hot_freq_data hot_freq_data; /* frequency data */ > unsigned long i_ino; /* inode number from inode */ > u8 in_tree; /* ref counting check */ > spinlock_t lock; /* protects object data */ > struct kref refs; /* prevents kfree */ > } > > Also: > - i_ino really needs to be a 64 bit quantity as some > filesystems can use 64 bit inode numbers even on 32 > bit systems (e.g. XFS). > - in_tree can be u32 or a flags field if it is boolean. if > it is just debug, then maybe it can be removed whenteh > code is ready for commit. > >> +}; >> + >> +/* >> + * An item representing a range inside of an inode whose frequency >> + * is being tracked >> + */ >> +struct hot_range_item { >> + /* node for hot_range_tree rb_tree */ >> + struct rb_node rb_node; >> + /* frequency data for this range */ >> + struct hot_freq_data hot_freq_data; >> + /* the hot_inode_item associated with this hot_range_item */ >> + struct hot_inode_item *hot_inode; >> + /* starting offset of this range */ >> + u64 start; >> + /* length of this range */ >> + u64 len; > > What units? > u64 start; /* start offset in bytes */ > u64 len /* length in bytes */ > >> + /* used to check for errors in ref counting */ >> + u8 in_tree; >> + /* protects hot_freq_data, start, len, and in_tree */ >> + spinlock_t lock; >> + /* prevents kfree */ >> + struct kref refs; >> +}; >> + >> +struct hot_info { >> + /* red-black tree that keeps track of fs-wide hot data */ >> + struct hot_inode_tree hot_inode_tree; >> +}; > > The comment is redundant... > > Cheers, > > Dave. > > -- > Dave Chinner > david@fromorbit.com-- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Dave Chinner
2012-Sep-25 09:17 UTC
Re: [RFC v2 02/10] vfs: add support for updating access frequency
On Sun, Sep 23, 2012 at 08:56:27PM +0800, zwu.kernel@gmail.com wrote:> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > > Add some utils helpers to update access frequencies > for one file or its range. > > Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > --- > fs/hot_tracking.c | 359 +++++++++++++++++++++++++++++++++++++++++++++++++++++ > fs/hot_tracking.h | 15 +++ > 2 files changed, 374 insertions(+), 0 deletions(-) > > diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c > index 173054b..52ed926 100644 > --- a/fs/hot_tracking.c > +++ b/fs/hot_tracking.c > @@ -106,6 +106,365 @@ inode_err: > } > > /* > + * Drops the reference out on hot_inode_item by one and free the structure > + * if the reference count hits zero > + */ > +void hot_rb_free_hot_inode_item(struct hot_inode_item *he)hot_inode_item_put()> +{ > + if (!he) > + return;It''s a bug to call a put function on a kref counted item with a null pointer. Let the kernel crash so it is noticed and fixed.> + > + if (atomic_dec_and_test(&he->refs.refcount)) { > + WARN_ON(he->in_tree); > + kmem_cache_free(hot_inode_item_cache, he); > + }Isn''t this abusing krefs? i.e. this should be: hot_inode_item_free() { WARN_ON(he->in_tree); kmem_cache_free(hot_inode_item_cache, he); } hot_inode_item_put() { kref_put(&he->refs, hot_inode_item_free) }> +/* > + * Drops the reference out on hot_range_item by one and free the structure > + * if the reference count hits zero > + */ > +static void hot_rb_free_hot_range_item(struct hot_range_item *hr)same comments as above. ....> +static struct rb_node *hot_rb_insert_hot_inode_item(struct rb_root *root, > + unsigned long inode_num, > + struct rb_node *node)static struct rb_node * hot_inode_item_find(..... )> +{ > + struct rb_node **p = &root->rb_node; > + struct rb_node *parent = NULL; > + struct hot_inode_item *entry; > + > + /* walk tree to find insertion point */ > + while (*p) { > + parent = *p; > + entry = rb_entry(parent, struct hot_inode_item, rb_node); > + > + if (inode_num < entry->i_ino) > + p = &(*p)->rb_left; > + else if (inode_num > entry->i_ino) > + p = &(*p)->rb_right; > + else > + return parent; > + } > + > + entry = rb_entry(node, struct hot_inode_item, rb_node); > + entry->in_tree = 1; > + rb_link_node(node, parent, p); > + rb_insert_color(node, root);So the hot inode item search key is the inode number? Why use an rb-tree then? Wouldn''t a btree be a more memory efficient way to hold a sparse index that has fixed key and pointer sizes? Also, the API seems quite strange. you pass in the the rb_node and the inode number which instead of passing in the hot inode item that already holds this information. You then convert the rb_node back to a hot inode item to set the in_tree variable. So why not just pass in the struct hot_inode_item in the first place?> +static u64 hot_rb_range_end(struct hot_range_item *hr)hot_range_item_end()> +{ > + if (hr->start + hr->len < hr->start) > + return (u64)-1; > + > + return hr->start + hr->len - 1; > +} > + > +static struct rb_node *hot_rb_insert_hot_range_item(struct rb_root *root,hot_range_item_find()> + u64 start, > + struct rb_node *node) > +{ > + struct rb_node **p = &root->rb_node; > + struct rb_node *parent = NULL; > + struct hot_range_item *entry; > + > + /* ensure start is on a range boundary */ > + start = start & RANGE_SIZE_MASK; > + /* walk tree to find insertion point */ > + while (*p) { > + parent = *p; > + entry = rb_entry(parent, struct hot_range_item, rb_node); > + > + if (start < entry->start) > + p = &(*p)->rb_left; > + else if (start >= hot_rb_range_end(entry))Shouldn''t an aligned end always be one byte short of the start offset of the next aligned region? i.e. start =hot_rb_range_end(entry) is an indication of an off-by one bug somewhere?> + p = &(*p)->rb_right; > + else > + return parent; > + } > + > + entry = rb_entry(node, struct hot_range_item, rb_node); > + entry->in_tree = 1; > + rb_link_node(node, parent, p); > + rb_insert_color(node, root);Same comments as the hot_inode_range. Also, the start offset in the hot_range_item is already aligned, so why do you need to align it again?> + > + return NULL; > +} > + > +/* > + * Add a hot_inode_item to a hot_inode_tree. If the tree already contains > + * an item with the index given, return -EEXIST > + */ > +static int hot_rb_add_hot_inode_item(struct hot_inode_tree *tree, > + struct hot_inode_item *he) > +{ > + int ret = 0; > + struct rb_node *rb; > + > + rb = hot_rb_insert_hot_inode_item( > + &tree->map, he->i_ino, &he->rb_node); > + if (rb) { > + ret = -EEXIST; > + goto out; > + } > + > + kref_get(&he->refs); > + > +out: > + return ret;And this really just seems like a useless wrapper. just move the -EEXIST and kref_get(&he->refs); into hot_inode_item_find() and call that directly....> +} > + > +/* > + * Add a hot_range_item to a hot_range_tree. If the tree already contains > + * an item with the index given, return -EEXIST > + * > + * Also optionally aggresively merge ranges (currently disabled) > + */ > +static int hot_rb_add_hot_range_item(struct hot_range_tree *tree, > + struct hot_range_item *hr) > +{ > + int ret = 0; > + struct rb_node *rb; > + > + rb = hot_rb_insert_hot_range_item( > + &tree->map, hr->start, &hr->rb_node); > + if (rb) { > + ret = -EEXIST; > + goto out; > + } > + > + kref_get(&hr->refs); > + > +out: > + return ret; > +}Same again.> + > +/* > + * Lookup a hot_inode_item in the hot_inode_tree with the given index > + * (inode_num) > + */ > +struct hot_inode_item > +*hot_rb_lookup_hot_inode_item(struct hot_inode_tree *tree, > + unsigned long inode_num) > +{ > + struct rb_node **p = &(tree->map.rb_node); > + struct rb_node *parent = NULL; > + struct hot_inode_item *entry; > + > + while (*p) { > + parent = *p; > + entry = rb_entry(parent, struct hot_inode_item, rb_node); > + > + if (inode_num < entry->i_ino) > + p = &(*p)->rb_left; > + else if (inode_num > entry->i_ino) > + p = &(*p)->rb_right; > + else { > + kref_get(&entry->refs); > + return entry; > + } > + } > + > + return NULL; > +}That''s basically a duplicate of hot_inode_item_find(), jus without the rb_node parameter. You could combine the two into a single function - the lookup simply passes a NULL rb_node parameter. That would then justify passing the inode number and rb_node separately rather than the hot_inode_item into the insert function....> + > +/* > + * Lookup a hot_range_item in a hot_range_tree with the given index > + * (start, offset) > + */ > +static struct hot_range_item > +*hot_rb_lookup_hot_range_item(struct hot_range_tree *tree, > + u64 start) > +{ > + struct rb_node **p = &(tree->map.rb_node); > + struct rb_node *parent = NULL; > + struct hot_range_item *entry; > + > + /* ensure start is on a range boundary */ > + start = start & RANGE_SIZE_MASK; > + while (*p) { > + parent = *p; > + entry = rb_entry(parent, struct hot_range_item, rb_node); > + > + if (start < entry->start) > + p = &(*p)->rb_left; > + else if (start > hot_rb_range_end(entry)) > + p = &(*p)->rb_right; > + else { > + kref_get(&entry->refs); > + return entry; > + } > + } > + > + return NULL; > +}Same again.> + > +/* > + * This function does the actual work of updating the frequency numbers, > + * whatever they turn out to be. FREQ_POWER determines how many atime > + * deltas we keep track of (as a power of 2). So, setting it to anything above > + * 16ish is probably overkill. Also, the higher the power, the more bits get > + * right shifted out of the timestamp, reducing precision, so take note of that > + * as well. > + * > + * The caller should have already locked freq_data''s parent''s spinlock. > + * > + * FREQ_POWER, defined immediately below, determines how heavily to weight > + * the current frequency numbers against the newest access. For example, a value > + * of 4 means that the new access information will be weighted 1/16th (ie 2^-4) > + * as heavily as the existing frequency info. In essence, this is a kludged- > + * together version of a weighted average, since we can''t afford to keep all of > + * the information that it would take to get a _real_ weighted average. > + */ > +static void hot_rb_update_freq(struct hot_freq_data *freq_data, int rw)hot_inode_freq_update(struct hot_freq_data *freq_data, bool write)> +{ > + struct timespec old_atime; > + struct timespec current_time; > + struct timespec delta_ts; > + u64 new_avg; > + u64 new_delta; > + > + if (unlikely(rw)) {Don''t use unlikely() - the branch predictor in a modern CPU is better than static hints almost all the time. so: if (write) {> + old_atime = freq_data->last_write_time; > + freq_data->nr_writes += 1; > + new_avg = freq_data->avg_delta_writes; > + } else { > + old_atime = freq_data->last_read_time; > + freq_data->nr_reads += 1; > + new_avg = freq_data->avg_delta_reads; > + } > + > + current_time = current_kernel_time(); > + delta_ts = timespec_sub(current_time, old_atime); > + new_delta = timespec_to_ns(&delta_ts) >> FREQ_POWER; > + > + new_avg = (new_avg << FREQ_POWER) - new_avg + new_delta; > + new_avg = new_avg >> FREQ_POWER; > + > + if (unlikely(rw)) { > + freq_data->last_write_time = current_time; > + freq_data->avg_delta_writes = new_avg; > + } else { > + freq_data->last_read_time = current_time; > + freq_data->avg_delta_reads = new_avg; > + } > +}static u64 update_average(struct timespec old_atime, struct timespec current_time, u64 avg) { struct timespec delta_ts; u64 new_delta; delta_ts = timespec_sub(current_time, old_atime); new_delta = timespec_to_ns(&delta_ts) >> FREQ_POWER; avg = (avg << FREQ_POWER) - avg + new_delta; avg = avg >> FREQ_POWER; return avg } static void hot_inode_freq_update(struct hot_freq_data *freq_data, bool write) { struct timespec current_time = current_kernel_time(); if (write) { freq_data->avg_delta_writes = update_average( freq_data->last_write_time, current_time, freq_data->avg_delta_writes); freq_data->last_write_time = current_time; return; } freq_data->avg_delta_reads = update_average( freq_data->last_read_time, current_time, freq_data->avg_delta_reads); freq_data->last_read_time = current_time; }> + > +/* Update inode frequency struct */ > +static struct hot_inode_item *hot_rb_update_inode_freq(struct inode *inode, > + int rw) > +{ > + struct hot_info *root = &(inode->i_sb->s_hotinfo); > + struct hot_inode_tree *hitree = &(root->hot_inode_tree); > + struct hot_inode_item *he; > + > + read_lock(&hitree->lock); > + he = hot_rb_lookup_hot_inode_item(hitree, inode->i_ino); > + read_unlock(&hitree->lock); > + > + if (!he) { > + he = kmem_cache_alloc(hot_inode_item_cache, > + GFP_KERNEL | GFP_NOFS); > + if (!he) > + goto out; > + > + write_lock(&hitree->lock); > + hot_rb_inode_item_init(he); > + he->i_ino = inode->i_ino; > + hot_rb_add_hot_inode_item(hitree, he); > + write_unlock(&hitree->lock); > + }The insert is racy - you''ve dropped the lock after the failed lookup, so you can race with another failed lookup to insert the newly allocated inode item. You can''t avoid this race if you are using an rwlock, so you have to handle it. As it is, I suspect the use of a rwlock here is premature optimisation - if there are any amount of inserts being done then the rwlock will be more expensive than a spin lock. I''ve done the numbers before, which is why the XFS buffer cache rbtrees use a plain spin lock. They sustain >10^6 lookups per second under heavy load with a 99.999% lookup hit rate, and yet the spinlock is not a contention point. hence I suspect for simplicity sake the rwlock shoul dbe made a spin lock and the lookup done in a similar manner to xfs_buf_get_map/_xfs_buf_find() (And yes, you''ll see a lot of similarities between that code and the suggestions I''ve been making, like a single function that does both lookup and insert...)> + > + spin_lock(&he->lock); > + hot_rb_update_freq(&he->hot_freq_data, rw); > + spin_unlock(&he->lock);Probably should move the he->lock into hot_inode_freq_update().> + > +out: > + return he; > +} > + > +/* Update range frequency struct */ > +static bool hot_rb_update_range_freq(struct hot_inode_item *he, > + u64 off, u64 len, int rw, > + struct hot_info *root) > +{ > + struct hot_range_tree *hrtree = &(he->hot_range_tree); > + struct hot_range_item *hr = NULL; > + u64 start_off = off & RANGE_SIZE_MASK; > + u64 end_off = (off + len - 1) & RANGE_SIZE_MASK; > + u64 cur; > + int ret = true; > + > + if (len == 0) > + return false; > + > + /* > + * Align ranges on RANGE_SIZE boundary to prevent proliferation > + * of range structs > + */ > + for (cur = start_off; cur <= end_off; cur += RANGE_SIZE) { > + read_lock(&hrtree->lock); > + hr = hot_rb_lookup_hot_range_item(hrtree, cur); > + read_unlock(&hrtree->lock); > + > + if (!hr) { > + hr = kmem_cache_alloc(hot_range_item_cache, > + GFP_KERNEL | GFP_NOFS); > + if (!hr) { > + ret = false; > + goto out; > + } > + > + write_lock(&hrtree->lock); > + hot_rb_range_item_init(hr); > + hr->start = cur & RANGE_SIZE_MASK; > + hr->len = RANGE_SIZE; > + hr->hot_inode = he; > + hot_rb_add_hot_range_item(hrtree, hr); > + write_unlock(&hrtree->lock); > + } > + > + spin_lock(&hr->lock); > + hot_rb_update_freq(&hr->hot_freq_data, rw); > + spin_unlock(&hr->lock); > + hot_rb_free_hot_range_item(hr); > + }Same comments again about locking. I note that the code will always insert range items of a length RANGE_SIZE. This means you have a fixed object granularity and hence you have no need for a range based search. That is, you could use a radix tree where each entry in the radix tree points directly to the range object similar to how the page cache uses a radix tree for indexing pages. That brings the possibility of lockless range item lookups....> + > +out: > + return ret; > +} > + > +/* main function to update access frequency from read/writepage(s) hooks */ > +void hot_rb_update_freqs(struct inode *inode, u64 start, > + u64 len, int rw) > +{ > + struct hot_inode_item *he; > + > + he = hot_rb_update_inode_freq(inode, rw); > + > + WARN_ON(!he); > + > + if (he) { > + hot_rb_update_range_freq(he, start, len, > + rw, &(inode->i_sb->s_hotinfo)); > + > + hot_rb_free_hot_inode_item(he); > + }This is very assymetric. it would be much better to collapse all the abstraction down to something much simpler, say: int hot_inode_update_freqs() { he = hot_inode_item_find(tree, inum, null) if (!he) { new_he = allocate() if (!new_he) return -ENOMEM; <init new_he> he = hot_inode_item_find(tree, inum, new_he) if (he != new_he) free new_he } hot_inode_freq_update(&he->hot_freq_data, write) for (cur = start_off; cur <= end_off; cur += RANGE_SIZE) { hr = hot_range_item_find(tree, cur, NULL) if (!hr) { new_hr = allocate() if (!new_hr) return -ENOMEM; <init new_hr> hr = hot_inode_item_find(tree, cur, new_hr) if (hr != new_hr) free new_hr } hot_inode_freq_update(&hr->hot_freq_data, write) hot_range_item_put(hr); { hot_inode_item_put(he); }> +} > + > +/* > * Initialize kmem cache for hot_inode_item > * and hot_range_item > */ > diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h > index 269b67a..2ba29e4 100644 > --- a/fs/hot_tracking.h > +++ b/fs/hot_tracking.h > @@ -21,6 +21,21 @@ > #define FREQ_DATA_TYPE_INODE (1 << 0) > /* freq data struct is for a range */ > #define FREQ_DATA_TYPE_RANGE (1 << 1) > +/* size of sub-file ranges */ > +#define RANGE_SIZE (1 << 20) > +#define RANGE_SIZE_MASK (~((u64)(RANGE_SIZE - 1)))You might want to state what units these ranges are in, and that there is one tracking object per range per inode....> +#define FREQ_POWER 4 > + > +struct hot_info; > +struct inode; > + > +struct hot_inode_itemYou''ve already included include/linux/hot_tracking.h, so you shouldn''t need these forward declarations... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Dave Chinner
2012-Sep-25 09:28 UTC
Re: [RFC v2 03/10] vfs: add one new mount option ''-o hottrack''
On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.kernel@gmail.com wrote:> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > > Introduce one new mount option ''-o hottrack'', > and add its parsing support. > Its usage looks like: > mount -o hottrack > mount -o nouser,hottrack > mount -o nouser,hottrack,loop > mount -o hottrack,nouserI think that this option parsing should be done by the filesystem, even though the tracking functionality is in the VFS. That way ony the filesystems that can use the tracking information will turn it on, rather than being able to turn it on for everything regardless of whether it is useful or not. Along those lines, just using a normal superblock flag to indicate it is active (e.g. MS_HOT_INODE_TRACKING in sb->s_flags) means you don''t need to allocate the sb->s_hot_info structure just to be able to check whether we are tracking hot inodes or not. This then means the hot inode tracking for the superblock can be initialised by the filesystem as part of it''s fill_super method, along with the filesystem specific code that will use the hot tracking information the VFS gathers.... Cheers, Dave. -- Dave Chinner david@fromorbit.com
On Sun, Sep 23, 2012 at 08:56:30PM +0800, zwu.kernel@gmail.com wrote:> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > > Adds a hash table structure which contains > a lot of hash list and is used to efficiently > look up the data temperature of a file or its > ranges. > In each hash list of hash table, the hash node > will keep track of temperature info. > > Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > --- > fs/hot_tracking.c | 77 ++++++++++++++++++++++++++++++++++++++++- > include/linux/hot_tracking.h | 35 +++++++++++++++++++ > 2 files changed, 110 insertions(+), 2 deletions(-) > > diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c > index fa89f70..5f96442 100644 > --- a/fs/hot_tracking.c > +++ b/fs/hot_tracking.c > @@ -16,6 +16,7 @@ > #include <linux/module.h> > #include <linux/spinlock.h> > #include <linux/hardirq.h> > +#include <linux/hash.h> > #include <linux/fs.h> > #include <linux/blkdev.h> > #include <linux/types.h> > @@ -24,6 +25,9 @@...snip...> +/* Hash list heads for hot hash table */ > +struct hot_hash_head { > + struct hlist_head hashhead; > + rwlock_t rwlock; > + u32 temperature; > +}; > + > +/* Nodes stored in each hash list of hash table */ > +struct hot_hash_node { > + struct hlist_node hashnode; > + struct list_head node; > + struct hot_freq_data *hot_freq_data; > + struct hot_hash_head *hlist; > + spinlock_t lock; /* protects hlist */ > + > + /* > + * number of references to this node > + * equals 1 (hashlist entry) > + */ > + struct kref refs; > +};Dont see why you need yet another datastructure to hash the inode_item and the range_item into a hash list. You can just add another hlist_node in the inode_item and range_item. This field can be then used to link into the corresponding hash list. You can use the container_of() get to the inode_item or the range_item using the hlist_node field. You can thus eliminate a lot of code.> + > /* An item representing an inode and its access frequency */ > struct hot_inode_item { > /* node for hot_inode_tree rb_tree */ > @@ -68,6 +93,8 @@ struct hot_inode_item { > spinlock_t lock; > /* prevents kfree */ > struct kref refs; > + /* hashlist node for this inode */ > + struct hot_hash_node *heat_node;this can be just struct hlist_node head_node; /* lookup hot_inode hash list */ Use this field to link it into the corresponding hashlist.> }; >this can be just> /* > @@ -91,6 +118,8 @@ struct hot_range_item { > spinlock_t lock; > /* prevents kfree */ > struct kref refs; > + /* hashlist node for this range */ > + struct hot_hash_node *heat_node;this can be just struct hlist_node head_node; /* lookup hot_range hash list */> }; > > struct hot_info { > @@ -98,6 +127,12 @@ struct hot_info { > > /* red-black tree that keeps track of fs-wide hot data */ > struct hot_inode_tree hot_inode_tree; > + > + /* hash map of inode temperature */ > + struct hot_hash_head heat_inode_hl[HEAT_HASH_SIZE]; > + > + /* hash map of range temperature */ > + struct hot_hash_head heat_range_hl[HEAT_HASH_SIZE]; > }; > > #endif /* _LINUX_HOTTRACK_H */-- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Sep 23, 2012 at 08:56:26PM +0800, zwu.kernel@gmail.com wrote:> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > > One root structure hot_info is defined, is hooked > up in super_block, and will be used to hold rb trees > root, hash list root and some other information, etc. > Adds hot_inode_tree struct to keep track of > frequently accessed files, and be keyed by {inode, offset}. > Trees contain hot_inode_items representing those files > and ranges. > Having these trees means that vfs can quickly determine the > temperature of some data by doing some calculations on the > hot_freq_data struct that hangs off of the tree item. > Define two items hot_inode_item and hot_range_item, > one of them represents one tracked file > to keep track of its access frequency and the tree of > ranges in this file, while the latter represents > a file range of one inode. > Each of the two structures contains a hot_freq_data > struct with its frequency of access metrics (number of > {reads, writes}, last {read,write} time, frequency of > {reads,writes}). > Also, each hot_inode_item contains one hot_range_tree > struct which is keyed by {inode, offset, length} > and used to keep track of all the ranges in this file. > > Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > --- > +..snip..> +/* A tree that sits on the hot_info */ > +struct hot_inode_tree { > + struct rb_root map; > + rwlock_t lock; > +}; > + > +/* A tree of ranges for each inode in the hot_inode_tree */ > +struct hot_range_tree { > + struct rb_root map; > + rwlock_t lock; > +};Can as well have a generic datastructure called hot_tree instead of having two different datastructure which basically are the same.> + > +/* A frequency data struct holds values that are used to > + * determine temperature of files and file ranges. These structs > + * are members of hot_inode_item and hot_range_item > + */ > +struct hot_freq_data { > + struct timespec last_read_time; > + struct timespec last_write_time; > + u32 nr_reads; > + u32 nr_writes; > + u64 avg_delta_reads; > + u64 avg_delta_writes; > + u8 flags; > + u32 last_temperature; > +}; > + > +/* An item representing an inode and its access frequency */ > +struct hot_inode_item { > + /* node for hot_inode_tree rb_tree */ > + struct rb_node rb_node; > + /* tree of ranges in this inode */ > + struct hot_range_tree hot_range_tree; > + /* frequency data for this inode */ > + struct hot_freq_data hot_freq_data; > + /* inode number, copied from inode */ > + unsigned long i_ino; > + /* used to check for errors in ref counting */ > + u8 in_tree; > + /* protects hot_freq_data, i_no, in_tree */ > + spinlock_t lock; > + /* prevents kfree */ > + struct kref refs; > +}; > + > +/* > + * An item representing a range inside of an inode whose frequency > + * is being tracked > + */ > +struct hot_range_item { > + /* node for hot_range_tree rb_tree */ > + struct rb_node rb_node; > + /* frequency data for this range */ > + struct hot_freq_data hot_freq_data; > + /* the hot_inode_item associated with this hot_range_item */ > + struct hot_inode_item *hot_inode; > + /* starting offset of this range */ > + u64 start; > + /* length of this range */ > + u64 len; > + /* used to check for errors in ref counting */ > + u8 in_tree; > + /* protects hot_freq_data, start, len, and in_tree */ > + spinlock_t lock; > + /* prevents kfree */ > + struct kref refs; > +};might as well have just one generic datastructure called hot_item with all the common fields and then have struct hot_inode_item { struct hot_item hot_inode; struct hot_tree hot_range_tree; unsigned long i_ino; } and struct hot_range_item { struct hot_item hot_range; u64 start; u64 len; /* length of this range */ } This should help you eliminate some duplicate code as well. RP -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Zhi Yong Wu
2012-Sep-26 02:53 UTC
Re: [RFC v2 02/10] vfs: add support for updating access frequency
thanks a lot for your review in my heart, Dave. It is very helpful to me. On Tue, Sep 25, 2012 at 5:17 PM, Dave Chinner <david@fromorbit.com> wrote:> On Sun, Sep 23, 2012 at 08:56:27PM +0800, zwu.kernel@gmail.com wrote: >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> >> Add some utils helpers to update access frequencies >> for one file or its range. >> >> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> --- >> fs/hot_tracking.c | 359 +++++++++++++++++++++++++++++++++++++++++++++++++++++ >> fs/hot_tracking.h | 15 +++ >> 2 files changed, 374 insertions(+), 0 deletions(-) >> >> diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c >> index 173054b..52ed926 100644 >> --- a/fs/hot_tracking.c >> +++ b/fs/hot_tracking.c >> @@ -106,6 +106,365 @@ inode_err: >> } >> >> /* >> + * Drops the reference out on hot_inode_item by one and free the structure >> + * if the reference count hits zero >> + */ >> +void hot_rb_free_hot_inode_item(struct hot_inode_item *he) > > hot_inode_item_put() > >> +{ >> + if (!he) >> + return; > > It''s a bug to call a put function on a kref counted item with a null > pointer. Let the kernel crash so it is noticed and fixed.OK, will remove it.> >> + >> + if (atomic_dec_and_test(&he->refs.refcount)) { >> + WARN_ON(he->in_tree); >> + kmem_cache_free(hot_inode_item_cache, he); >> + } > > Isn''t this abusing krefs? i.e. this should be:Sorry, thanks for your explaination as below:> > hot_inode_item_free() > { > WARN_ON(he->in_tree); > kmem_cache_free(hot_inode_item_cache, he); > } > > hot_inode_item_put() > { > kref_put(&he->refs, hot_inode_item_free) > } > >> +/* >> + * Drops the reference out on hot_range_item by one and free the structure >> + * if the reference count hits zero >> + */ >> +static void hot_rb_free_hot_range_item(struct hot_range_item *hr) > > same comments as above.OK, thanks.> .... >> +static struct rb_node *hot_rb_insert_hot_inode_item(struct rb_root *root, >> + unsigned long inode_num, >> + struct rb_node *node) > > static struct rb_node * > hot_inode_item_find(..... )OK, thanks.> >> +{ >> + struct rb_node **p = &root->rb_node; >> + struct rb_node *parent = NULL; >> + struct hot_inode_item *entry; >> + >> + /* walk tree to find insertion point */ >> + while (*p) { >> + parent = *p; >> + entry = rb_entry(parent, struct hot_inode_item, rb_node); >> + >> + if (inode_num < entry->i_ino) >> + p = &(*p)->rb_left; >> + else if (inode_num > entry->i_ino) >> + p = &(*p)->rb_right; >> + else >> + return parent; >> + } >> + >> + entry = rb_entry(node, struct hot_inode_item, rb_node); >> + entry->in_tree = 1; >> + rb_link_node(node, parent, p); >> + rb_insert_color(node, root); > > So the hot inode item search key is the inode number? Why use anYes> rb-tree then? Wouldn''t a btree be a more memory efficient way to > hold a sparse index that has fixed key and pointer sizes?Yes, i know, but if we don''t use btree, what will be better? Radix tree?> > Also, the API seems quite strange. you pass in the the rb_node and > the inode number which instead of passing in the hot inode item that > already holds this information. You then convert the rb_node back to > a hot inode item to set the in_tree variable. So why not just pass > in the struct hot_inode_item in the first place?Good catch, thanks for your remider.> >> +static u64 hot_rb_range_end(struct hot_range_item *hr) > > hot_range_item_end()OK> >> +{ >> + if (hr->start + hr->len < hr->start) >> + return (u64)-1; >> + >> + return hr->start + hr->len - 1; >> +} >> + >> +static struct rb_node *hot_rb_insert_hot_range_item(struct rb_root *root, > > hot_range_item_find()OK> >> + u64 start, >> + struct rb_node *node) >> +{ >> + struct rb_node **p = &root->rb_node; >> + struct rb_node *parent = NULL; >> + struct hot_range_item *entry; >> + >> + /* ensure start is on a range boundary */ >> + start = start & RANGE_SIZE_MASK; >> + /* walk tree to find insertion point */ >> + while (*p) { >> + parent = *p; >> + entry = rb_entry(parent, struct hot_range_item, rb_node); >> + >> + if (start < entry->start) >> + p = &(*p)->rb_left; >> + else if (start >= hot_rb_range_end(entry)) > > Shouldn''t an aligned end always be one byte short of the start > offset of the next aligned region? i.e. start => hot_rb_range_end(entry) is an indication of an off-by one bug > somewhere?This is really one good catch, thanks.> >> + p = &(*p)->rb_right; >> + else >> + return parent; >> + } >> + >> + entry = rb_entry(node, struct hot_range_item, rb_node); >> + entry->in_tree = 1; >> + rb_link_node(node, parent, p); >> + rb_insert_color(node, root); > > Same comments as the hot_inode_range.OK.> > Also, the start offset in the hot_range_item is already aligned, so > why do you need to align it again?ah, thanks, will remove it.> >> + >> + return NULL; >> +} >> + >> +/* >> + * Add a hot_inode_item to a hot_inode_tree. If the tree already contains >> + * an item with the index given, return -EEXIST >> + */ >> +static int hot_rb_add_hot_inode_item(struct hot_inode_tree *tree, >> + struct hot_inode_item *he) >> +{ >> + int ret = 0; >> + struct rb_node *rb; >> + >> + rb = hot_rb_insert_hot_inode_item( >> + &tree->map, he->i_ino, &he->rb_node); >> + if (rb) { >> + ret = -EEXIST; >> + goto out; >> + } >> + >> + kref_get(&he->refs); >> + >> +out: >> + return ret; > > And this really just seems like a useless wrapper. just move the > -EEXIST and kref_get(&he->refs); into hot_inode_item_find() and > call that directly....OK, thanks.> >> +} >> + >> +/* >> + * Add a hot_range_item to a hot_range_tree. If the tree already contains >> + * an item with the index given, return -EEXIST >> + * >> + * Also optionally aggresively merge ranges (currently disabled) >> + */ >> +static int hot_rb_add_hot_range_item(struct hot_range_tree *tree, >> + struct hot_range_item *hr) >> +{ >> + int ret = 0; >> + struct rb_node *rb; >> + >> + rb = hot_rb_insert_hot_range_item( >> + &tree->map, hr->start, &hr->rb_node); >> + if (rb) { >> + ret = -EEXIST; >> + goto out; >> + } >> + >> + kref_get(&hr->refs); >> + >> +out: >> + return ret; >> +} > > Same again.OK.> >> + >> +/* >> + * Lookup a hot_inode_item in the hot_inode_tree with the given index >> + * (inode_num) >> + */ >> +struct hot_inode_item >> +*hot_rb_lookup_hot_inode_item(struct hot_inode_tree *tree, >> + unsigned long inode_num) >> +{ >> + struct rb_node **p = &(tree->map.rb_node); >> + struct rb_node *parent = NULL; >> + struct hot_inode_item *entry; >> + >> + while (*p) { >> + parent = *p; >> + entry = rb_entry(parent, struct hot_inode_item, rb_node); >> + >> + if (inode_num < entry->i_ino) >> + p = &(*p)->rb_left; >> + else if (inode_num > entry->i_ino) >> + p = &(*p)->rb_right; >> + else { >> + kref_get(&entry->refs); >> + return entry; >> + } >> + } >> + >> + return NULL; >> +} > > That''s basically a duplicate of hot_inode_item_find(), jus > without the rb_node parameter. You could combine the two into a > single function - the lookup simply passes a NULL rb_node parameter. > That would then justify passing the inode number and rb_node > separately rather than the hot_inode_item into the insert > function....Got it. thanks.> >> + >> +/* >> + * Lookup a hot_range_item in a hot_range_tree with the given index >> + * (start, offset) >> + */ >> +static struct hot_range_item >> +*hot_rb_lookup_hot_range_item(struct hot_range_tree *tree, >> + u64 start) >> +{ >> + struct rb_node **p = &(tree->map.rb_node); >> + struct rb_node *parent = NULL; >> + struct hot_range_item *entry; >> + >> + /* ensure start is on a range boundary */ >> + start = start & RANGE_SIZE_MASK; >> + while (*p) { >> + parent = *p; >> + entry = rb_entry(parent, struct hot_range_item, rb_node); >> + >> + if (start < entry->start) >> + p = &(*p)->rb_left; >> + else if (start > hot_rb_range_end(entry)) >> + p = &(*p)->rb_right; >> + else { >> + kref_get(&entry->refs); >> + return entry; >> + } >> + } >> + >> + return NULL; >> +} > > Same again.OK> >> + >> +/* >> + * This function does the actual work of updating the frequency numbers, >> + * whatever they turn out to be. FREQ_POWER determines how many atime >> + * deltas we keep track of (as a power of 2). So, setting it to anything above >> + * 16ish is probably overkill. Also, the higher the power, the more bits get >> + * right shifted out of the timestamp, reducing precision, so take note of that >> + * as well. >> + * >> + * The caller should have already locked freq_data''s parent''s spinlock. >> + * >> + * FREQ_POWER, defined immediately below, determines how heavily to weight >> + * the current frequency numbers against the newest access. For example, a value >> + * of 4 means that the new access information will be weighted 1/16th (ie 2^-4) >> + * as heavily as the existing frequency info. In essence, this is a kludged- >> + * together version of a weighted average, since we can''t afford to keep all of >> + * the information that it would take to get a _real_ weighted average. >> + */ >> +static void hot_rb_update_freq(struct hot_freq_data *freq_data, int rw) > > hot_inode_freq_update(struct hot_freq_data *freq_data, bool write)OK> >> +{ >> + struct timespec old_atime; >> + struct timespec current_time; >> + struct timespec delta_ts; >> + u64 new_avg; >> + u64 new_delta; >> + >> + if (unlikely(rw)) { > > Don''t use unlikely() - the branch predictor in a modern CPU is > better than static hints almost all the time. so:OK> > if (write) { > >> + old_atime = freq_data->last_write_time; >> + freq_data->nr_writes += 1; >> + new_avg = freq_data->avg_delta_writes; >> + } else { >> + old_atime = freq_data->last_read_time; >> + freq_data->nr_reads += 1; >> + new_avg = freq_data->avg_delta_reads; >> + } >> + >> + current_time = current_kernel_time(); >> + delta_ts = timespec_sub(current_time, old_atime); >> + new_delta = timespec_to_ns(&delta_ts) >> FREQ_POWER; >> + >> + new_avg = (new_avg << FREQ_POWER) - new_avg + new_delta; >> + new_avg = new_avg >> FREQ_POWER; >> + >> + if (unlikely(rw)) { >> + freq_data->last_write_time = current_time; >> + freq_data->avg_delta_writes = new_avg; >> + } else { >> + freq_data->last_read_time = current_time; >> + freq_data->avg_delta_reads = new_avg; >> + } >> +} > > static u64 update_average(struct timespec old_atime, > struct timespec current_time, u64 avg) > { > struct timespec delta_ts; > u64 new_delta; > > delta_ts = timespec_sub(current_time, old_atime); > new_delta = timespec_to_ns(&delta_ts) >> FREQ_POWER; > > avg = (avg << FREQ_POWER) - avg + new_delta; > avg = avg >> FREQ_POWER; > > return avg > } > > static void hot_inode_freq_update(struct hot_freq_data *freq_data, bool write) > { > struct timespec current_time = current_kernel_time(); > > if (write) { > freq_data->avg_delta_writes = update_average( > freq_data->last_write_time, > current_time, > freq_data->avg_delta_writes); > freq_data->last_write_time = current_time; > return; > } > > freq_data->avg_delta_reads = update_average( > freq_data->last_read_time, > current_time, > freq_data->avg_delta_reads); > freq_data->last_read_time = current_time; > }OK, thanks a lot.> >> + >> +/* Update inode frequency struct */ >> +static struct hot_inode_item *hot_rb_update_inode_freq(struct inode *inode, >> + int rw) >> +{ >> + struct hot_info *root = &(inode->i_sb->s_hotinfo); >> + struct hot_inode_tree *hitree = &(root->hot_inode_tree); >> + struct hot_inode_item *he; >> + >> + read_lock(&hitree->lock); >> + he = hot_rb_lookup_hot_inode_item(hitree, inode->i_ino); >> + read_unlock(&hitree->lock); >> + >> + if (!he) { >> + he = kmem_cache_alloc(hot_inode_item_cache, >> + GFP_KERNEL | GFP_NOFS); >> + if (!he) >> + goto out; >> + >> + write_lock(&hitree->lock); >> + hot_rb_inode_item_init(he); >> + he->i_ino = inode->i_ino; >> + hot_rb_add_hot_inode_item(hitree, he); >> + write_unlock(&hitree->lock); >> + } > > The insert is racy - you''ve dropped the lock after the failed > lookup, so you can race with another failed lookup to insert the > newly allocated inode item. You can''t avoid this race if you are > using an rwlock, so you have to handle it. > > As it is, I suspect the use of a rwlock here is premature > optimisation - if there are any amount of inserts being done then > the rwlock will be more expensive than a spin lock. I''ve done the > numbers before, which is why the XFS buffer cache rbtrees use a > plain spin lock. They sustain >10^6 lookups per second under heavy > load with a 99.999% lookup hit rate, and yet the spinlock is not a > contention point. hence I suspect for simplicity sake the rwlock > shoul dbe made a spin lock and the lookup done in a similar manner > to xfs_buf_get_map/_xfs_buf_find()Got it, thanks.> > (And yes, you''ll see a lot of similarities between that code and the > suggestions I''ve been making, like a single function that does both > lookup and insert...) > >> + >> + spin_lock(&he->lock); >> + hot_rb_update_freq(&he->hot_freq_data, rw); >> + spin_unlock(&he->lock); > > Probably should move the he->lock into hot_inode_freq_update().OK> >> + >> +out: >> + return he; >> +} >> + >> +/* Update range frequency struct */ >> +static bool hot_rb_update_range_freq(struct hot_inode_item *he, >> + u64 off, u64 len, int rw, >> + struct hot_info *root) >> +{ >> + struct hot_range_tree *hrtree = &(he->hot_range_tree); >> + struct hot_range_item *hr = NULL; >> + u64 start_off = off & RANGE_SIZE_MASK; >> + u64 end_off = (off + len - 1) & RANGE_SIZE_MASK; >> + u64 cur; >> + int ret = true; >> + >> + if (len == 0) >> + return false; >> + >> + /* >> + * Align ranges on RANGE_SIZE boundary to prevent proliferation >> + * of range structs >> + */ >> + for (cur = start_off; cur <= end_off; cur += RANGE_SIZE) { >> + read_lock(&hrtree->lock); >> + hr = hot_rb_lookup_hot_range_item(hrtree, cur); >> + read_unlock(&hrtree->lock); >> + >> + if (!hr) { >> + hr = kmem_cache_alloc(hot_range_item_cache, >> + GFP_KERNEL | GFP_NOFS); >> + if (!hr) { >> + ret = false; >> + goto out; >> + } >> + >> + write_lock(&hrtree->lock); >> + hot_rb_range_item_init(hr); >> + hr->start = cur & RANGE_SIZE_MASK; >> + hr->len = RANGE_SIZE; >> + hr->hot_inode = he; >> + hot_rb_add_hot_range_item(hrtree, hr); >> + write_unlock(&hrtree->lock); >> + } >> + >> + spin_lock(&hr->lock); >> + hot_rb_update_freq(&hr->hot_freq_data, rw); >> + spin_unlock(&hr->lock); >> + hot_rb_free_hot_range_item(hr); >> + } > > Same comments again about locking.OK> > I note that the code will always insert range items of a length > RANGE_SIZE. This means you have a fixed object granularity and hence > you have no need for a range based search. That is, you could use a > radix tree where each entry in the radix tree points directly to the > range object similar to how the page cache uses a radix tree for > indexing pages. That brings the possibility of lockless range item > lookups....Great suggestion, but can we temporarily put it in TODO list? because it will bring one big code change.> > >> + >> +out: >> + return ret; >> +} >> + >> +/* main function to update access frequency from read/writepage(s) hooks */ >> +void hot_rb_update_freqs(struct inode *inode, u64 start, >> + u64 len, int rw) >> +{ >> + struct hot_inode_item *he; >> + >> + he = hot_rb_update_inode_freq(inode, rw); >> + >> + WARN_ON(!he); >> + >> + if (he) { >> + hot_rb_update_range_freq(he, start, len, >> + rw, &(inode->i_sb->s_hotinfo)); >> + >> + hot_rb_free_hot_inode_item(he); >> + } > > This is very assymetric. it would be much better to collapse all > the abstraction down to something much simpler, say:OK, thanks.> > > int hot_inode_update_freqs() > { > he = hot_inode_item_find(tree, inum, null) > if (!he) { > new_he = allocate() > if (!new_he) > return -ENOMEM; > > <init new_he> > he = hot_inode_item_find(tree, inum, new_he) > if (he != new_he) > free new_he > } > hot_inode_freq_update(&he->hot_freq_data, write) > > for (cur = start_off; cur <= end_off; cur += RANGE_SIZE) { > hr = hot_range_item_find(tree, cur, NULL) > if (!hr) { > new_hr = allocate() > if (!new_hr) > return -ENOMEM; > > <init new_hr> > hr = hot_inode_item_find(tree, cur, new_hr) > if (hr != new_hr) > free new_hr > } > hot_inode_freq_update(&hr->hot_freq_data, write) > hot_range_item_put(hr); > { > hot_inode_item_put(he); > } > > > > > > >> +} >> + >> +/* >> * Initialize kmem cache for hot_inode_item >> * and hot_range_item >> */ >> diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h >> index 269b67a..2ba29e4 100644 >> --- a/fs/hot_tracking.h >> +++ b/fs/hot_tracking.h >> @@ -21,6 +21,21 @@ >> #define FREQ_DATA_TYPE_INODE (1 << 0) >> /* freq data struct is for a range */ >> #define FREQ_DATA_TYPE_RANGE (1 << 1) >> +/* size of sub-file ranges */ >> +#define RANGE_SIZE (1 << 20) >> +#define RANGE_SIZE_MASK (~((u64)(RANGE_SIZE - 1))) > > You might want to state what units these ranges are in, and that > there is one tracking object per range per inode....yes.> >> +#define FREQ_POWER 4 >> + >> +struct hot_info; >> +struct inode; >> + >> +struct hot_inode_item > > You''ve already included include/linux/hot_tracking.h, so you > shouldn''t need these forward declarations...OK.> > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com-- Regards, Zhi Yong Wu
Zhi Yong Wu
2012-Sep-26 02:56 UTC
Re: [RFC v2 03/10] vfs: add one new mount option ''-o hottrack''
On Tue, Sep 25, 2012 at 5:28 PM, Dave Chinner <david@fromorbit.com> wrote:> On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.kernel@gmail.com wrote: >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> >> Introduce one new mount option ''-o hottrack'', >> and add its parsing support. >> Its usage looks like: >> mount -o hottrack >> mount -o nouser,hottrack >> mount -o nouser,hottrack,loop >> mount -o hottrack,nouser > > I think that this option parsing should be done by the filesystem, > even though the tracking functionality is in the VFS. That way ony > the filesystems that can use the tracking information will turn it > on, rather than being able to turn it on for everything regardless > of whether it is useful or not. > > Along those lines, just using a normal superblock flag to indicate > it is active (e.g. MS_HOT_INODE_TRACKING in sb->s_flags) means you > don''t need to allocate the sb->s_hot_info structure just to be able > to check whether we are tracking hot inodes or not. > > This then means the hot inode tracking for the superblock can be > initialised by the filesystem as part of it''s fill_super method, > along with the filesystem specific code that will use the hot > tracking information the VFS gathers....I can see what you mean, but don''t know if other guys also agree with this. If so, all FS specific code which use hot tracking feature wll have to add the same chunk of code in it fill_super method. Is it good?> > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com-- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 25, 2012 at 6:20 PM, Ram Pai <linuxram@us.ibm.com> wrote:> On Sun, Sep 23, 2012 at 08:56:26PM +0800, zwu.kernel@gmail.com wrote: >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> >> One root structure hot_info is defined, is hooked >> up in super_block, and will be used to hold rb trees >> root, hash list root and some other information, etc. >> Adds hot_inode_tree struct to keep track of >> frequently accessed files, and be keyed by {inode, offset}. >> Trees contain hot_inode_items representing those files >> and ranges. >> Having these trees means that vfs can quickly determine the >> temperature of some data by doing some calculations on the >> hot_freq_data struct that hangs off of the tree item. >> Define two items hot_inode_item and hot_range_item, >> one of them represents one tracked file >> to keep track of its access frequency and the tree of >> ranges in this file, while the latter represents >> a file range of one inode. >> Each of the two structures contains a hot_freq_data >> struct with its frequency of access metrics (number of >> {reads, writes}, last {read,write} time, frequency of >> {reads,writes}). >> Also, each hot_inode_item contains one hot_range_tree >> struct which is keyed by {inode, offset, length} >> and used to keep track of all the ranges in this file. >> >> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> --- >> + > ..snip.. > >> +/* A tree that sits on the hot_info */ >> +struct hot_inode_tree { >> + struct rb_root map; >> + rwlock_t lock; >> +}; >> + >> +/* A tree of ranges for each inode in the hot_inode_tree */ >> +struct hot_range_tree { >> + struct rb_root map; >> + rwlock_t lock; >> +}; > > Can as well have a generic datastructure called hot_tree instead > of having two different datastructure which basically are the same.OK.> >> + >> +/* A frequency data struct holds values that are used to >> + * determine temperature of files and file ranges. These structs >> + * are members of hot_inode_item and hot_range_item >> + */ >> +struct hot_freq_data { >> + struct timespec last_read_time; >> + struct timespec last_write_time; >> + u32 nr_reads; >> + u32 nr_writes; >> + u64 avg_delta_reads; >> + u64 avg_delta_writes; >> + u8 flags; >> + u32 last_temperature; >> +}; >> + >> +/* An item representing an inode and its access frequency */ >> +struct hot_inode_item { >> + /* node for hot_inode_tree rb_tree */ >> + struct rb_node rb_node; >> + /* tree of ranges in this inode */ >> + struct hot_range_tree hot_range_tree; >> + /* frequency data for this inode */ >> + struct hot_freq_data hot_freq_data; >> + /* inode number, copied from inode */ >> + unsigned long i_ino; >> + /* used to check for errors in ref counting */ >> + u8 in_tree; >> + /* protects hot_freq_data, i_no, in_tree */ >> + spinlock_t lock; >> + /* prevents kfree */ >> + struct kref refs; >> +}; >> + >> +/* >> + * An item representing a range inside of an inode whose frequency >> + * is being tracked >> + */ >> +struct hot_range_item { >> + /* node for hot_range_tree rb_tree */ >> + struct rb_node rb_node; >> + /* frequency data for this range */ >> + struct hot_freq_data hot_freq_data; >> + /* the hot_inode_item associated with this hot_range_item */ >> + struct hot_inode_item *hot_inode; >> + /* starting offset of this range */ >> + u64 start; >> + /* length of this range */ >> + u64 len; >> + /* used to check for errors in ref counting */ >> + u8 in_tree; >> + /* protects hot_freq_data, start, len, and in_tree */ >> + spinlock_t lock; >> + /* prevents kfree */ >> + struct kref refs; >> +}; > > might as well have just one generic datastructure called hot_item with > all the common fields and then have > > struct hot_inode_item { > struct hot_item hot_inode; > struct hot_tree hot_range_tree; > unsigned long i_ino; > } > > and > > struct hot_range_item { > struct hot_item hot_range; > u64 start; > u64 len; /* length of this range */ > } > > This should help you eliminate some duplicate code as well.OK, i will try to apply them. thanks.> > > RP >-- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 25, 2012 at 5:54 PM, Ram Pai <linuxram@us.ibm.com> wrote:> On Sun, Sep 23, 2012 at 08:56:30PM +0800, zwu.kernel@gmail.com wrote: >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> >> Adds a hash table structure which contains >> a lot of hash list and is used to efficiently >> look up the data temperature of a file or its >> ranges. >> In each hash list of hash table, the hash node >> will keep track of temperature info. >> >> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> --- >> fs/hot_tracking.c | 77 ++++++++++++++++++++++++++++++++++++++++- >> include/linux/hot_tracking.h | 35 +++++++++++++++++++ >> 2 files changed, 110 insertions(+), 2 deletions(-) >> >> diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c >> index fa89f70..5f96442 100644 >> --- a/fs/hot_tracking.c >> +++ b/fs/hot_tracking.c >> @@ -16,6 +16,7 @@ >> #include <linux/module.h> >> #include <linux/spinlock.h> >> #include <linux/hardirq.h> >> +#include <linux/hash.h> >> #include <linux/fs.h> >> #include <linux/blkdev.h> >> #include <linux/types.h> >> @@ -24,6 +25,9 @@ > > ...snip... > >> +/* Hash list heads for hot hash table */ >> +struct hot_hash_head { >> + struct hlist_head hashhead; >> + rwlock_t rwlock; >> + u32 temperature; >> +}; >> + >> +/* Nodes stored in each hash list of hash table */ >> +struct hot_hash_node { >> + struct hlist_node hashnode; >> + struct list_head node; >> + struct hot_freq_data *hot_freq_data; >> + struct hot_hash_head *hlist; >> + spinlock_t lock; /* protects hlist */ >> + >> + /* >> + * number of references to this node >> + * equals 1 (hashlist entry) >> + */ >> + struct kref refs; >> +}; > > Dont see why you need yet another datastructure to hash the inode_item > and the range_item into a hash list. You can just add anotherIf there''s such one structure, we can rapidly know which hash bucket one inode_item is currently linking to via its hlist field. This is useful if one inode_item corresponding to one file need to move from one bucket to another bucket when this file get cold or hotter. Does it make sense to you?> hlist_node in the inode_item and range_item. This field can be then used > to link into the corresponding hash list. > > You can use the container_of() get to the inode_item or the range_item > using the hlist_node field. > > You can thus eliminate a lot of code.As what i said above, i don''t think it can eliminate a lo of code.> >> + >> /* An item representing an inode and its access frequency */ >> struct hot_inode_item { >> /* node for hot_inode_tree rb_tree */ >> @@ -68,6 +93,8 @@ struct hot_inode_item { >> spinlock_t lock; >> /* prevents kfree */ >> struct kref refs; >> + /* hashlist node for this inode */ >> + struct hot_hash_node *heat_node; > > this can be just > struct hlist_node head_node; /* lookup hot_inode hash list */ > > Use this field to link it into the corresponding hashlist. > >> }; >> > this can be just >> /* >> @@ -91,6 +118,8 @@ struct hot_range_item { >> spinlock_t lock; >> /* prevents kfree */ >> struct kref refs; >> + /* hashlist node for this range */ >> + struct hot_hash_node *heat_node; > > this can be just > struct hlist_node head_node; /* lookup hot_range hash list */ > > >> }; >> >> struct hot_info { >> @@ -98,6 +127,12 @@ struct hot_info { >> >> /* red-black tree that keeps track of fs-wide hot data */ >> struct hot_inode_tree hot_inode_tree; >> + >> + /* hash map of inode temperature */ >> + struct hot_hash_head heat_inode_hl[HEAT_HASH_SIZE]; >> + >> + /* hash map of range temperature */ >> + struct hot_hash_head heat_range_hl[HEAT_HASH_SIZE]; >> }; >> >> #endif /* _LINUX_HOTTRACK_H */ >-- Regards, Zhi Yong Wu
Dave Chinner
2012-Sep-27 02:19 UTC
Re: [RFC v2 02/10] vfs: add support for updating access frequency
On Wed, Sep 26, 2012 at 10:53:07AM +0800, Zhi Yong Wu wrote:> On Tue, Sep 25, 2012 at 5:17 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Sun, Sep 23, 2012 at 08:56:27PM +0800, zwu.kernel@gmail.com wrote: > > I note that the code will always insert range items of a length > > RANGE_SIZE. This means you have a fixed object granularity and hence > > you have no need for a range based search. That is, you could use a > > radix tree where each entry in the radix tree points directly to the > > range object similar to how the page cache uses a radix tree for > > indexing pages. That brings the possibility of lockless range item > > lookups.... > Great suggestion, but can we temporarily put it in TODO list? because > it will bring one big code change.Sure. I just wanted to point out that there are better choices for indexing fixed size elements than rb-trees and why it might make sense to use a different type of tree. Cheers, Dave. -- Dave Chinner david@fromorbit.com
Dave Chinner
2012-Sep-27 02:20 UTC
Re: [RFC v2 03/10] vfs: add one new mount option ''-o hottrack''
On Wed, Sep 26, 2012 at 10:56:08AM +0800, Zhi Yong Wu wrote:> On Tue, Sep 25, 2012 at 5:28 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.kernel@gmail.com wrote: > >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > >> > >> Introduce one new mount option ''-o hottrack'', > >> and add its parsing support. > >> Its usage looks like: > >> mount -o hottrack > >> mount -o nouser,hottrack > >> mount -o nouser,hottrack,loop > >> mount -o hottrack,nouser > > > > I think that this option parsing should be done by the filesystem, > > even though the tracking functionality is in the VFS. That way ony > > the filesystems that can use the tracking information will turn it > > on, rather than being able to turn it on for everything regardless > > of whether it is useful or not. > > > > Along those lines, just using a normal superblock flag to indicate > > it is active (e.g. MS_HOT_INODE_TRACKING in sb->s_flags) means you > > don''t need to allocate the sb->s_hot_info structure just to be able > > to check whether we are tracking hot inodes or not. > > > > This then means the hot inode tracking for the superblock can be > > initialised by the filesystem as part of it''s fill_super method, > > along with the filesystem specific code that will use the hot > > tracking information the VFS gathers.... > I can see what you mean, but don''t know if other guys also agree with this. > If so, all FS specific code which use hot tracking feature wll have to add > the same chunk of code in it fill_super method. Is it good?Most filesystems will only need to add 3-4 lines of code to their existing parser, so it''s not a big deal I think... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Sep 23, 2012 at 08:56:29PM +0800, zwu.kernel@gmail.com wrote:> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > > Add initialization function to create some > key data structures when hot tracking is enabled; > Clean up them when hot tracking is disabled > > Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > --- > fs/hot_tracking.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++++ > fs/hot_tracking.h | 2 + > fs/namespace.c | 4 +++ > fs/super.c | 6 +++++ > 4 files changed, 72 insertions(+), 0 deletions(-) > > diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c > index f97e8a6..fa89f70 100644 > --- a/fs/hot_tracking.c > +++ b/fs/hot_tracking.c > @@ -135,6 +135,51 @@ static void hot_rb_free_hot_range_item(struct hot_range_item *hr) > } > } > > +static int hot_rb_remove_hot_inode_item(struct hot_inode_tree *tree, > + struct hot_inode_item *he)hot_inode_item_remove()> +{ > + int ret = 0; > + rb_erase(&he->rb_node, &tree->map); > + he->in_tree = 0; > + return ret; > +} > + > +static int hot_rb_remove_hot_range_item(struct hot_range_tree *tree, > + struct hot_range_item *hr)hot_range_item_remove() (repeat for other function names ;)> +{ > + int ret = 0; > + rb_erase(&hr->rb_node, &tree->map); > + hr->in_tree = 0; > + return ret; > +}these can probably be void functions are they don''t use the return value at all.> + > +/* Frees the entire hot_inode_tree. */ > +static void hot_rb_inode_tree_free(struct hot_info *root) > +{ > + struct rb_node *node, *node2; > + struct hot_inode_item *he; > + struct hot_range_item *hr; > + > + /* Free hot inode and range trees on fs root */ > + node = rb_first(&root->hot_inode_tree.map); > + > + while (node) {while ((node = rb_first(&root->hot_inode_tree.map))) {> + he = rb_entry(node, struct hot_inode_item, rb_node); > + > + node2 = rb_first(&he->hot_range_tree.map); > + while (node2) {while ((node2 = rb_first(&he->hot_range_tree.map))) { .....> > + if (sb->s_hotinfo.mount_opt & HOT_MOUNT_HOT_TRACK) > + hot_track_exit(sb); > +Let the filesystems call in .put_super()> down_write(&namespace_sem); > br_write_lock(&vfsmount_lock); > event++; > diff --git a/fs/super.c b/fs/super.c > index 7eb3b0c..0999d5c 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -1153,6 +1153,9 @@ mount_fs(struct file_system_type *type, int flags, const char *name, void *data) > WARN_ON(sb->s_bdi == &default_backing_dev_info); > sb->s_flags |= MS_BORN; > > + if (hottrack) > + hot_track_init(sb, name);And call this in .fill_super() after parsing the hottrack argument. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Zhi Yong Wu
2012-Sep-27 02:30 UTC
Re: [RFC v2 02/10] vfs: add support for updating access frequency
On Thu, Sep 27, 2012 at 10:19 AM, Dave Chinner <david@fromorbit.com> wrote:> On Wed, Sep 26, 2012 at 10:53:07AM +0800, Zhi Yong Wu wrote: >> On Tue, Sep 25, 2012 at 5:17 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Sun, Sep 23, 2012 at 08:56:27PM +0800, zwu.kernel@gmail.com wrote: >> > I note that the code will always insert range items of a length >> > RANGE_SIZE. This means you have a fixed object granularity and hence >> > you have no need for a range based search. That is, you could use a >> > radix tree where each entry in the radix tree points directly to the >> > range object similar to how the page cache uses a radix tree for >> > indexing pages. That brings the possibility of lockless range item >> > lookups.... >> Great suggestion, but can we temporarily put it in TODO list? because >> it will bring one big code change. > > Sure. I just wanted to point out that there are better choices for > indexing fixed size elements than rb-trees and why it might make > sense to use a different type of tree.Got it, thanks. Moreover, it should also be better to use radix tree to hold hot_inode, not only hot_range.> > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com-- Regards, Zhi Yong Wu
Zhi Yong Wu
2012-Sep-27 02:30 UTC
Re: [RFC v2 03/10] vfs: add one new mount option ''-o hottrack''
On Thu, Sep 27, 2012 at 10:20 AM, Dave Chinner <david@fromorbit.com> wrote:> On Wed, Sep 26, 2012 at 10:56:08AM +0800, Zhi Yong Wu wrote: >> On Tue, Sep 25, 2012 at 5:28 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.kernel@gmail.com wrote: >> >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> >> >> >> Introduce one new mount option ''-o hottrack'', >> >> and add its parsing support. >> >> Its usage looks like: >> >> mount -o hottrack >> >> mount -o nouser,hottrack >> >> mount -o nouser,hottrack,loop >> >> mount -o hottrack,nouser >> > >> > I think that this option parsing should be done by the filesystem, >> > even though the tracking functionality is in the VFS. That way ony >> > the filesystems that can use the tracking information will turn it >> > on, rather than being able to turn it on for everything regardless >> > of whether it is useful or not. >> > >> > Along those lines, just using a normal superblock flag to indicate >> > it is active (e.g. MS_HOT_INODE_TRACKING in sb->s_flags) means you >> > don''t need to allocate the sb->s_hot_info structure just to be able >> > to check whether we are tracking hot inodes or not. >> > >> > This then means the hot inode tracking for the superblock can be >> > initialised by the filesystem as part of it''s fill_super method, >> > along with the filesystem specific code that will use the hot >> > tracking information the VFS gathers.... >> I can see what you mean, but don''t know if other guys also agree with this. >> If so, all FS specific code which use hot tracking feature wll have to add >> the same chunk of code in it fill_super method. Is it good? > > Most filesystems will only need to add 3-4 lines of code to their > existing parser, so it''s not a big deal I think...OK, i try to do it. thanks a lot.> > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com-- Regards, Zhi Yong Wu
On Sun, Sep 23, 2012 at 08:56:30PM +0800, zwu.kernel@gmail.com wrote:> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > > Adds a hash table structure which contains > a lot of hash list and is used to efficiently > look up the data temperature of a file or its > ranges. > In each hash list of hash table, the hash node > will keep track of temperature info.So, let me see if I''ve got the relationship straight: - sb->s_hot_info.hot_inode_tree indexes hot_inode_items, one per inode - hot_inode_item contains access frequency data for that inode - hot_inode_item holds a heat hash node to index the access frequency data for that inode - hot_inode_item.hot_range_tree indexes hot_range_items for that inode - hot_range_item contains access frequency data for that range - hot_range_item holds a heat hash node to index the access frequency data for that range - sb->s_hot_info.heat_inode_hl indexes per-inode heat hash nodes - sb->s_hot_info.heat_range_hl indexes per-range heat hash nodes How about some ascii art? :) Just looking at the hot inode item case (the range item case is the same pattern, though), we have: heat_inode_hl hot_inode_tree | | | V | +-------hot_inode_item-------+ +---+ | frequency data | | V ^ V | ...<--hot_inode_item-->... | ...<--hot_inode_item-->.... | frequency data | frequency data | ^ | ^ | | | | | | | | +------>hot_hash_node-->hot_hash_node-->hot_hash_node-->.... There''s no actual data stored in the hot_hash_node, just pointer back to the frequency data, a hlist_node and a pointer to the hashlist head. IOWs, I agree with Ram that this does not need to exist and just embedding a hlist_node inside the hot_inode_item is all that is needed. i.e: heat_inode_hl hot_inode_tree | | | V | +-------hot_inode_item-------+ | | frequency data | +---+ | hlist_node | | V ^ | V | ...<--hot_inode_item-->... | | ...<--hot_inode_item-->.... | frequency data | | frequency data +------>hlist_node-----------+ +------->hlist_node--->..... There''s no need for separate allocations, initialisations, locks and reference counting - all that is already in the hot_inode_item. The items have the same lifecycle limitations - a hot_hash_node must be torn down before the frequency data it points to is freed. Finally, there''s no difference in how you move it between lists. Indeed, calling it a hash is wrong - there''s not hashing at all - it keeping an array of list where each entry corresponds to a specific temperature. It is a *heat map*, not a hash list. i.e. inode_heat_map, not heat_inode_hl. HEAT_MAP_SIZE, not HASH_SIZE. As it is, there aren''t any users of the heat maps that are generated in this patch set - it''s not even exported to userspace or to debugfs, so I''m not sure how it will be used yet. How are these heat maps going to be used by filesystems, Zhi? Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Sep 23, 2012 at 08:56:31PM +0800, zwu.kernel@gmail.com wrote:> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > > Miscellaneous features that implement hot data tracking > and generally make the hot data functions a bit more friendly. > > Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > --- > fs/direct-io.c | 10 ++++++++++ > include/linux/hot_tracking.h | 11 +++++++++++ > mm/filemap.c | 8 ++++++++ > mm/page-writeback.c | 21 +++++++++++++++++++++ > mm/readahead.c | 9 +++++++++ > 5 files changed, 59 insertions(+), 0 deletions(-) > > diff --git a/fs/direct-io.c b/fs/direct-io.c > index f86c720..3773f44 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -37,6 +37,7 @@ > #include <linux/uio.h> > #include <linux/atomic.h> > #include <linux/prefetch.h> > +#include "hot_tracking.h" > > /* > * How many user pages to map in one call to get_user_pages(). This determines > @@ -1297,6 +1298,15 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, > prefetch(bdev->bd_queue); > prefetch((char *)bdev->bd_queue + SMP_CACHE_BYTES); > > + /* Hot data tracking */ > + if (TRACK_THIS_INODE(iocb->ki_filp->f_mapping->host) > + && iov_length(iov, nr_segs) > 0) { > + hot_rb_update_freqs(iocb->ki_filp->f_mapping->host, > + (u64)offset, > + (u64)iov_length(iov, nr_segs), > + rw & WRITE); > + }That''s a bit messy. I''d prefer a static inline function that hides all this. e.g. track_hot_inode_ranges(inode, offset, length, rw) { if (inode->i_sb->s_flags & MS_HOT_TRACKING) hot_inode_freq_update(inode, offset, length, rw); }> diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 5ad5ce2..552c861 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -35,6 +35,7 @@ > #include <linux/buffer_head.h> /* __set_page_dirty_buffers */ > #include <linux/pagevec.h> > #include <linux/timer.h> > +#include <linux/hot_tracking.h> > #include <trace/events/writeback.h> > > /* > @@ -1895,13 +1896,33 @@ EXPORT_SYMBOL(generic_writepages); > int do_writepages(struct address_space *mapping, struct writeback_control *wbc) > { > int ret; > + pgoff_t start = 0; > + u64 prev_count = 0, count = 0; > > if (wbc->nr_to_write <= 0) > return 0; > + > + /* Hot data tracking */ > + if (TRACK_THIS_INODE(mapping->host) > + && wbc->range_cyclic) { > + start = mapping->writeback_index << PAGE_CACHE_SHIFT; > + prev_count = (u64)wbc->nr_to_write; > + }Why only wbc->range_cyclic? This won''t record things like synchronous writes or fsync-triggered writes, are are far more likely to be to hot ranges in a file... Cheers, Dave. -- Dave Chinner david@fromorbit.com
Dave Chinner
2012-Sep-27 04:03 UTC
Re: [RFC v2 07/10] vfs: fork one kthread to update data temperature
On Sun, Sep 23, 2012 at 08:56:32PM +0800, zwu.kernel@gmail.com wrote:> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > > Fork and run one kernel kthread to calculate > that temperature based on some metrics kept > in custom frequency data structs, and store > the info in the hash table.No new kthreads, please. Use a per-superblock workqueue and a struct delayed_work to run periodic work on each superblock. That will also remove all the nasty, nasty !hot_track_temperature_update_kthread checks from the code, too. Also, I''d separate the work that the workqueue does from the patch that introduces the work queue. That way there is only one new thing to comment on in the patch. Further, I''d separate the aging code from the code that updates the temperature map into it''s own patch as well.. Finally, you''re going to need a shrinker to control the amount of memory that is used in tracking hot regions - if we are throwing inodes out of memory due to memory pressure, we most definitely are going to need to reduce the amount of memory the tracking code is using, even if it means losing useful information (i.e. the shrinker accelerates the aging process). Given the above, and the other comments earlier in the series, there''s not a lot of point in me spending time commenting on ethe code in detail here as it will change significantly as a result of all the earlier comments.... Cheers, Dave. -- Dave Chinner david@fromorbit.com
Zhi Yong Wu
2012-Sep-27 05:25 UTC
Re: [RFC v2 03/10] vfs: add one new mount option ''-o hottrack''
On Tue, Sep 25, 2012 at 5:28 PM, Dave Chinner <david@fromorbit.com> wrote:> On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.kernel@gmail.com wrote: >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> >> Introduce one new mount option ''-o hottrack'', >> and add its parsing support. >> Its usage looks like: >> mount -o hottrack >> mount -o nouser,hottrack >> mount -o nouser,hottrack,loop >> mount -o hottrack,nouser > > I think that this option parsing should be done by the filesystem, > even though the tracking functionality is in the VFS. That way ony > the filesystems that can use the tracking information will turn it > on, rather than being able to turn it on for everything regardless > of whether it is useful or not. > > Along those lines, just using a normal superblock flag to indicate > it is active (e.g. MS_HOT_INODE_TRACKING in sb->s_flags) means you > don''t need to allocate the sb->s_hot_info structure just to be ableIf we don''t allocate one sb->s_hot_info, where will those hash list head and btree roots locate?> to check whether we are tracking hot inodes or not. > > This then means the hot inode tracking for the superblock can be > initialised by the filesystem as part of it''s fill_super method, > along with the filesystem specific code that will use the hot > tracking information the VFS gathers.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com-- Regards, Zhi Yong Wu
On Thu, Sep 27, 2012 at 11:43 AM, Dave Chinner <david@fromorbit.com> wrote:> On Sun, Sep 23, 2012 at 08:56:30PM +0800, zwu.kernel@gmail.com wrote: >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> >> Adds a hash table structure which contains >> a lot of hash list and is used to efficiently >> look up the data temperature of a file or its >> ranges. >> In each hash list of hash table, the hash node >> will keep track of temperature info. > > So, let me see if I''ve got the relationship straight: > > - sb->s_hot_info.hot_inode_tree indexes hot_inode_items, one per inode > > - hot_inode_item contains access frequency data for that inode > > - hot_inode_item holds a heat hash node to index the access > frequency data for that inode > > - hot_inode_item.hot_range_tree indexes hot_range_items for that inode > > - hot_range_item contains access frequency data for that range > > - hot_range_item holds a heat hash node to index the access > frequency data for that range > > - sb->s_hot_info.heat_inode_hl indexes per-inode heat hash nodes > > - sb->s_hot_info.heat_range_hl indexes per-range heat hash nodesCorrect.> > How about some ascii art? :) Just looking at the hot inode item case > (the range item case is the same pattern, though), we have: > > > heat_inode_hl hot_inode_tree > | | > | V > | +-------hot_inode_item-------+ > +---+ | frequency data | > | V ^ V > | ...<--hot_inode_item-->... | ...<--hot_inode_item-->.... > | frequency data | frequency data > | ^ | ^ > | | | | > | | | | > +------>hot_hash_node-->hot_hash_node-->hot_hash_node-->....Great, can we put them in hot_tracking.txt in Documentation?> > > There''s no actual data stored in the hot_hash_node, just pointer > back to the frequency data, a hlist_node and a pointer to the > hashlist head. IOWs, I agree with Ram that this does not need to > exist and just embedding a hlist_node inside the hot_inode_item is > all that is needed. i.e: > > heat_inode_hl hot_inode_tree > | | > | V > | +-------hot_inode_item-------+ > | | frequency data | > +---+ | hlist_node | > | V ^ | V > | ...<--hot_inode_item-->... | | ...<--hot_inode_item-->.... > | frequency data | | frequency data > +------>hlist_node-----------+ +------->hlist_node--->..... > > There''s no need for separate allocations, initialisations, locks and > reference counting - all that is already in the hot_inode_item. The > items have the same lifecycle limitations - a hot_hash_node must be > torn down before the frequency data it points to is freed. Finally, > there''s no difference in how you move it between lists.How will you know if one hot_inode_item should be moved between lists when its freq data is changed?> > Indeed, calling it a hash is wrong - there''s not hashing at all > - it keeping an array of list where each entry corresponds to a > specific temperature. It is a *heat map*, not a hash list. i.e. > inode_heat_map, not heat_inode_hl. HEAT_MAP_SIZE, not HASH_SIZE.OK.> > As it is, there aren''t any users of the heat maps that are generated > in this patch set - it''s not even exported to userspace or to > debugfs, so I''m not sure how it will be used yet. How are these heat > maps going to be used by filesystems, Zhi?In hot_hash_calc_temperature(), you can see that one hot_inode or hot_range''s freq data will be distilled into one temperature value, then it will be inserted to the heat map based on its temperature. When the file corresponding to the inode or range got hotter or cold, its location will be changed in the heat map based on its new temperature in hot_hash_update_hash_table(). And the user will retrieve those freq data and temperature info via debugfs or ioctl interfaces.> > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com-- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 27, 2012 at 11:54 AM, Dave Chinner <david@fromorbit.com> wrote:> On Sun, Sep 23, 2012 at 08:56:31PM +0800, zwu.kernel@gmail.com wrote: >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> >> Miscellaneous features that implement hot data tracking >> and generally make the hot data functions a bit more friendly. >> >> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> --- >> fs/direct-io.c | 10 ++++++++++ >> include/linux/hot_tracking.h | 11 +++++++++++ >> mm/filemap.c | 8 ++++++++ >> mm/page-writeback.c | 21 +++++++++++++++++++++ >> mm/readahead.c | 9 +++++++++ >> 5 files changed, 59 insertions(+), 0 deletions(-) >> >> diff --git a/fs/direct-io.c b/fs/direct-io.c >> index f86c720..3773f44 100644 >> --- a/fs/direct-io.c >> +++ b/fs/direct-io.c >> @@ -37,6 +37,7 @@ >> #include <linux/uio.h> >> #include <linux/atomic.h> >> #include <linux/prefetch.h> >> +#include "hot_tracking.h" >> >> /* >> * How many user pages to map in one call to get_user_pages(). This determines >> @@ -1297,6 +1298,15 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, >> prefetch(bdev->bd_queue); >> prefetch((char *)bdev->bd_queue + SMP_CACHE_BYTES); >> >> + /* Hot data tracking */ >> + if (TRACK_THIS_INODE(iocb->ki_filp->f_mapping->host) >> + && iov_length(iov, nr_segs) > 0) { >> + hot_rb_update_freqs(iocb->ki_filp->f_mapping->host, >> + (u64)offset, >> + (u64)iov_length(iov, nr_segs), >> + rw & WRITE); >> + } > > That''s a bit messy. I''d prefer a static inline function that hides > all this. e.g.Do you think of moving the condition into hot_inode_udate_freqs(), not adding another new function?> > track_hot_inode_ranges(inode, offset, length, rw) > { > if (inode->i_sb->s_flags & MS_HOT_TRACKING) > hot_inode_freq_update(inode, offset, length, rw); > } > >> diff --git a/mm/page-writeback.c b/mm/page-writeback.c >> index 5ad5ce2..552c861 100644 >> --- a/mm/page-writeback.c >> +++ b/mm/page-writeback.c >> @@ -35,6 +35,7 @@ >> #include <linux/buffer_head.h> /* __set_page_dirty_buffers */ >> #include <linux/pagevec.h> >> #include <linux/timer.h> >> +#include <linux/hot_tracking.h> >> #include <trace/events/writeback.h> >> >> /* >> @@ -1895,13 +1896,33 @@ EXPORT_SYMBOL(generic_writepages); >> int do_writepages(struct address_space *mapping, struct writeback_control *wbc) >> { >> int ret; >> + pgoff_t start = 0; >> + u64 prev_count = 0, count = 0; >> >> if (wbc->nr_to_write <= 0) >> return 0; >> + >> + /* Hot data tracking */ >> + if (TRACK_THIS_INODE(mapping->host) >> + && wbc->range_cyclic) { >> + start = mapping->writeback_index << PAGE_CACHE_SHIFT; >> + prev_count = (u64)wbc->nr_to_write; >> + } > > Why only wbc->range_cyclic? This won''t record things like > synchronous writes or fsync-triggered writes, are are far more > likely to be to hot ranges in a file...sorry, i don''t undersand what wbc->range_cyclic means. OK, i will fix it in next version.> > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com-- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Zhi Yong Wu
2012-Sep-27 06:54 UTC
Re: [RFC v2 07/10] vfs: fork one kthread to update data temperature
On Thu, Sep 27, 2012 at 12:03 PM, Dave Chinner <david@fromorbit.com> wrote:> On Sun, Sep 23, 2012 at 08:56:32PM +0800, zwu.kernel@gmail.com wrote: >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> >> Fork and run one kernel kthread to calculate >> that temperature based on some metrics kept >> in custom frequency data structs, and store >> the info in the hash table. > > No new kthreads, please. Use a per-superblock workqueue and a struct > delayed_work to run periodic work on each superblock.If no new kthread is created, which kthread will work on these delayed_work tasks?> > That will also remove all the nasty, nasty > !hot_track_temperature_update_kthread checks from the code, too. > > Also, I''d separate the work that the workqueue does from the patch > that introduces the work queue. That way there is only one new thing > to comment on in the patch. Further, I''d separate the aging code > from the code that updates the temperature map into it''s own patch > as well.. > > Finally, you''re going to need a shrinker to control the amount of > memory that is used in tracking hot regions - if we are throwing > inodes out of memory due to memory pressure, we most definitely are > going to need to reduce the amount of memory the tracking code is > using, even if it means losing useful information (i.e. the shrinker > accelerates the aging process).Great, I agree with you.> > Given the above, and the other comments earlier in the series, > there''s not a lot of point in me spending time commenting on ethe > code in detail here as it will change significantly as a result of > all the earlier comments....OK, i will complete the code change based on all your earlier comments ASAP.> > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- Regards, Zhi Yong Wu
On Thu, Sep 27, 2012 at 02:23:16PM +0800, Zhi Yong Wu wrote:> On Thu, Sep 27, 2012 at 11:43 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Sun, Sep 23, 2012 at 08:56:30PM +0800, zwu.kernel@gmail.com wrote: > >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > >> > >> Adds a hash table structure which contains > >> a lot of hash list and is used to efficiently > >> look up the data temperature of a file or its > >> ranges. > >> In each hash list of hash table, the hash node > >> will keep track of temperature info. > > > > So, let me see if I''ve got the relationship straight: > > > > - sb->s_hot_info.hot_inode_tree indexes hot_inode_items, one per inode > > > > - hot_inode_item contains access frequency data for that inode > > > > - hot_inode_item holds a heat hash node to index the access > > frequency data for that inode > > > > - hot_inode_item.hot_range_tree indexes hot_range_items for that inode > > > > - hot_range_item contains access frequency data for that range > > > > - hot_range_item holds a heat hash node to index the access > > frequency data for that range > > > > - sb->s_hot_info.heat_inode_hl indexes per-inode heat hash nodes > > > > - sb->s_hot_info.heat_range_hl indexes per-range heat hash nodes > Correct. > > > > How about some ascii art? :) Just looking at the hot inode item case > > (the range item case is the same pattern, though), we have: > > > > > > heat_inode_hl hot_inode_tree > > | | > > | V > > | +-------hot_inode_item-------+ > > +---+ | frequency data | > > | V ^ V > > | ...<--hot_inode_item-->... | ...<--hot_inode_item-->.... > > | frequency data | frequency data > > | ^ | ^ > > | | | | > > | | | | > > +------>hot_hash_node-->hot_hash_node-->hot_hash_node-->.... > Great, can we put them in hot_tracking.txt in Documentation? > > > > > > There''s no actual data stored in the hot_hash_node, just pointer > > back to the frequency data, a hlist_node and a pointer to the > > hashlist head. IOWs, I agree with Ram that this does not need to > > exist and just embedding a hlist_node inside the hot_inode_item is > > all that is needed. i.e: > > > > heat_inode_hl hot_inode_tree > > | | > > | V > > | +-------hot_inode_item-------+ > > | | frequency data | > > +---+ | hlist_node | > > | V ^ | V > > | ...<--hot_inode_item-->... | | ...<--hot_inode_item-->.... > > | frequency data | | frequency data > > +------>hlist_node-----------+ +------->hlist_node--->..... > > > > There''s no need for separate allocations, initialisations, locks and > > reference counting - all that is already in the hot_inode_item. The > > items have the same lifecycle limitations - a hot_hash_node must be > > torn down before the frequency data it points to is freed. Finally, > > there''s no difference in how you move it between lists. > How will you know if one hot_inode_item should be moved between lists > when its freq data is changed?Record the current temperature in the frequency data, and if it changes, change the list it is on.> > Indeed, calling it a hash is wrong - there''s not hashing at all > > - it keeping an array of list where each entry corresponds to a > > specific temperature. It is a *heat map*, not a hash list. i.e. > > inode_heat_map, not heat_inode_hl. HEAT_MAP_SIZE, not HASH_SIZE. > OK. > > > > As it is, there aren''t any users of the heat maps that are generated > > in this patch set - it''s not even exported to userspace or to > > debugfs, so I''m not sure how it will be used yet. How are these heat > > maps going to be used by filesystems, Zhi? > In hot_hash_calc_temperature(), you can see that one hot_inode or > hot_range''s freq data will be distilled into one temperature value, > then it will be inserted to the heat map based on its temperature. > When the file corresponding to the inode or range got hotter or cold, > its location will be changed in the heat map based on its new > temperature in hot_hash_update_hash_table().Yes, but a hot_inode_item or hot_range_item can only have one location in the heat map, right? So it doesn''t need external structure to point to the frequency data to track this....> And the user will retrieve those freq data and temperature info via > debugfs or ioctl interfaces.Right - but that data is only extracted after an initial hot_inode_tree lookup - The heat map itself is never directly used for lookups. If it''s not used for lookups based on temperature, why is it needed? Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 27, 2012 at 02:28:12PM +0800, Zhi Yong Wu wrote:> On Thu, Sep 27, 2012 at 11:54 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Sun, Sep 23, 2012 at 08:56:31PM +0800, zwu.kernel@gmail.com wrote: > >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > >> > >> Miscellaneous features that implement hot data tracking > >> and generally make the hot data functions a bit more friendly. > >> > >> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > >> --- > >> fs/direct-io.c | 10 ++++++++++ > >> include/linux/hot_tracking.h | 11 +++++++++++ > >> mm/filemap.c | 8 ++++++++ > >> mm/page-writeback.c | 21 +++++++++++++++++++++ > >> mm/readahead.c | 9 +++++++++ > >> 5 files changed, 59 insertions(+), 0 deletions(-) > >> > >> diff --git a/fs/direct-io.c b/fs/direct-io.c > >> index f86c720..3773f44 100644 > >> --- a/fs/direct-io.c > >> +++ b/fs/direct-io.c > >> @@ -37,6 +37,7 @@ > >> #include <linux/uio.h> > >> #include <linux/atomic.h> > >> #include <linux/prefetch.h> > >> +#include "hot_tracking.h" > >> > >> /* > >> * How many user pages to map in one call to get_user_pages(). This determines > >> @@ -1297,6 +1298,15 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, > >> prefetch(bdev->bd_queue); > >> prefetch((char *)bdev->bd_queue + SMP_CACHE_BYTES); > >> > >> + /* Hot data tracking */ > >> + if (TRACK_THIS_INODE(iocb->ki_filp->f_mapping->host) > >> + && iov_length(iov, nr_segs) > 0) { > >> + hot_rb_update_freqs(iocb->ki_filp->f_mapping->host, > >> + (u64)offset, > >> + (u64)iov_length(iov, nr_segs), > >> + rw & WRITE); > >> + } > > > > That''s a bit messy. I''d prefer a static inline function that hides > > all this. e.g. > Do you think of moving the condition into hot_inode_udate_freqs(), not > adding another new function?Moving it into hot_inode_udate_freqs() will add a function call overhead even when tracking is not enabled. a static inline function will just result in no extra overhead other than the if statement.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Dave Chinner
2012-Sep-27 07:01 UTC
Re: [RFC v2 07/10] vfs: fork one kthread to update data temperature
On Thu, Sep 27, 2012 at 02:54:22PM +0800, Zhi Yong Wu wrote:> On Thu, Sep 27, 2012 at 12:03 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Sun, Sep 23, 2012 at 08:56:32PM +0800, zwu.kernel@gmail.com wrote: > >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > >> > >> Fork and run one kernel kthread to calculate > >> that temperature based on some metrics kept > >> in custom frequency data structs, and store > >> the info in the hash table. > > > > No new kthreads, please. Use a per-superblock workqueue and a struct > > delayed_work to run periodic work on each superblock. > If no new kthread is created, which kthread will work on these > delayed_work tasks?One of the kworker threads that service the workqueue infrastructure. Cheers, Dave. -- Dave Chinner david@fromorbit.com
Dave Chinner
2012-Sep-27 07:05 UTC
Re: [RFC v2 03/10] vfs: add one new mount option ''-o hottrack''
On Thu, Sep 27, 2012 at 01:25:34PM +0800, Zhi Yong Wu wrote:> On Tue, Sep 25, 2012 at 5:28 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.kernel@gmail.com wrote: > >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > >> > >> Introduce one new mount option ''-o hottrack'', > >> and add its parsing support. > >> Its usage looks like: > >> mount -o hottrack > >> mount -o nouser,hottrack > >> mount -o nouser,hottrack,loop > >> mount -o hottrack,nouser > > > > I think that this option parsing should be done by the filesystem, > > even though the tracking functionality is in the VFS. That way ony > > the filesystems that can use the tracking information will turn it > > on, rather than being able to turn it on for everything regardless > > of whether it is useful or not. > > > > Along those lines, just using a normal superblock flag to indicate > > it is active (e.g. MS_HOT_INODE_TRACKING in sb->s_flags) means you > > don''t need to allocate the sb->s_hot_info structure just to be able > If we don''t allocate one sb->s_hot_info, where will those hash list > head and btree roots locate?I wrote that thinking (mistakenly) that s-hot)info was dynamically allocated rather than being embedded in the struct super_block. Indeed, if the mount option is held in s_flags, then it could be dynamically allocated, but I don''t think that''s really necessary... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 27, 2012 at 2:57 PM, Dave Chinner <david@fromorbit.com> wrote:> On Thu, Sep 27, 2012 at 02:23:16PM +0800, Zhi Yong Wu wrote: >> On Thu, Sep 27, 2012 at 11:43 AM, Dave Chinner <david@fromorbit.com> wrote: >> > On Sun, Sep 23, 2012 at 08:56:30PM +0800, zwu.kernel@gmail.com wrote: >> >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> >> >> >> Adds a hash table structure which contains >> >> a lot of hash list and is used to efficiently >> >> look up the data temperature of a file or its >> >> ranges. >> >> In each hash list of hash table, the hash node >> >> will keep track of temperature info. >> > >> > So, let me see if I''ve got the relationship straight: >> > >> > - sb->s_hot_info.hot_inode_tree indexes hot_inode_items, one per inode >> > >> > - hot_inode_item contains access frequency data for that inode >> > >> > - hot_inode_item holds a heat hash node to index the access >> > frequency data for that inode >> > >> > - hot_inode_item.hot_range_tree indexes hot_range_items for that inode >> > >> > - hot_range_item contains access frequency data for that range >> > >> > - hot_range_item holds a heat hash node to index the access >> > frequency data for that range >> > >> > - sb->s_hot_info.heat_inode_hl indexes per-inode heat hash nodes >> > >> > - sb->s_hot_info.heat_range_hl indexes per-range heat hash nodes >> Correct. >> > >> > How about some ascii art? :) Just looking at the hot inode item case >> > (the range item case is the same pattern, though), we have: >> > >> > >> > heat_inode_hl hot_inode_tree >> > | | >> > | V >> > | +-------hot_inode_item-------+ >> > +---+ | frequency data | >> > | V ^ V >> > | ...<--hot_inode_item-->... | ...<--hot_inode_item-->.... >> > | frequency data | frequency data >> > | ^ | ^ >> > | | | | >> > | | | | >> > +------>hot_hash_node-->hot_hash_node-->hot_hash_node-->.... >> Great, can we put them in hot_tracking.txt in Documentation? >> > >> > >> > There''s no actual data stored in the hot_hash_node, just pointer >> > back to the frequency data, a hlist_node and a pointer to the >> > hashlist head. IOWs, I agree with Ram that this does not need to >> > exist and just embedding a hlist_node inside the hot_inode_item is >> > all that is needed. i.e: >> > >> > heat_inode_hl hot_inode_tree >> > | | >> > | V >> > | +-------hot_inode_item-------+ >> > | | frequency data | >> > +---+ | hlist_node | >> > | V ^ | V >> > | ...<--hot_inode_item-->... | | ...<--hot_inode_item-->.... >> > | frequency data | | frequency data >> > +------>hlist_node-----------+ +------->hlist_node--->..... >> > >> > There''s no need for separate allocations, initialisations, locks and >> > reference counting - all that is already in the hot_inode_item. The >> > items have the same lifecycle limitations - a hot_hash_node must be >> > torn down before the frequency data it points to is freed. Finally, >> > there''s no difference in how you move it between lists. >> How will you know if one hot_inode_item should be moved between lists >> when its freq data is changed? > > Record the current temperature in the frequency data, and if itI know how to do it, thanks.> changes, change the list it is on. > >> > Indeed, calling it a hash is wrong - there''s not hashing at all >> > - it keeping an array of list where each entry corresponds to a >> > specific temperature. It is a *heat map*, not a hash list. i.e. >> > inode_heat_map, not heat_inode_hl. HEAT_MAP_SIZE, not HASH_SIZE. >> OK. >> > >> > As it is, there aren''t any users of the heat maps that are generated >> > in this patch set - it''s not even exported to userspace or to >> > debugfs, so I''m not sure how it will be used yet. How are these heat >> > maps going to be used by filesystems, Zhi? >> In hot_hash_calc_temperature(), you can see that one hot_inode or >> hot_range''s freq data will be distilled into one temperature value, >> then it will be inserted to the heat map based on its temperature. >> When the file corresponding to the inode or range got hotter or cold, >> its location will be changed in the heat map based on its new >> temperature in hot_hash_update_hash_table(). > > Yes, but a hot_inode_item or hot_range_item can only have one > location in the heat map, right? So it doesn''t need externalYes.> structure to point to the frequency data to track this....OK.> >> And the user will retrieve those freq data and temperature info via >> debugfs or ioctl interfaces. > > Right - but that data is only extracted after an initial > hot_inode_tree lookup - The heat map itself is never directly used > for lookups. If it''s not used for lookups based on temperature, why > is it needed?You mean we don''t need hot_inode_tree? You know, after those hook functions collect the freq data for inode, they will store those raw info in hot_inode_tree. One private kthread will iterate this tree to distill those raw freq data into one temperatue value in [0 ~ 255], then link the corresponding hot_inode_item in heat map.> > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com-- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 27, 2012 at 2:59 PM, Dave Chinner <david@fromorbit.com> wrote:> On Thu, Sep 27, 2012 at 02:28:12PM +0800, Zhi Yong Wu wrote: >> On Thu, Sep 27, 2012 at 11:54 AM, Dave Chinner <david@fromorbit.com> wrote: >> > On Sun, Sep 23, 2012 at 08:56:31PM +0800, zwu.kernel@gmail.com wrote: >> >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> >> >> >> Miscellaneous features that implement hot data tracking >> >> and generally make the hot data functions a bit more friendly. >> >> >> >> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> >> --- >> >> fs/direct-io.c | 10 ++++++++++ >> >> include/linux/hot_tracking.h | 11 +++++++++++ >> >> mm/filemap.c | 8 ++++++++ >> >> mm/page-writeback.c | 21 +++++++++++++++++++++ >> >> mm/readahead.c | 9 +++++++++ >> >> 5 files changed, 59 insertions(+), 0 deletions(-) >> >> >> >> diff --git a/fs/direct-io.c b/fs/direct-io.c >> >> index f86c720..3773f44 100644 >> >> --- a/fs/direct-io.c >> >> +++ b/fs/direct-io.c >> >> @@ -37,6 +37,7 @@ >> >> #include <linux/uio.h> >> >> #include <linux/atomic.h> >> >> #include <linux/prefetch.h> >> >> +#include "hot_tracking.h" >> >> >> >> /* >> >> * How many user pages to map in one call to get_user_pages(). This determines >> >> @@ -1297,6 +1298,15 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, >> >> prefetch(bdev->bd_queue); >> >> prefetch((char *)bdev->bd_queue + SMP_CACHE_BYTES); >> >> >> >> + /* Hot data tracking */ >> >> + if (TRACK_THIS_INODE(iocb->ki_filp->f_mapping->host) >> >> + && iov_length(iov, nr_segs) > 0) { >> >> + hot_rb_update_freqs(iocb->ki_filp->f_mapping->host, >> >> + (u64)offset, >> >> + (u64)iov_length(iov, nr_segs), >> >> + rw & WRITE); >> >> + } >> > >> > That''s a bit messy. I''d prefer a static inline function that hides >> > all this. e.g. >> Do you think of moving the condition into hot_inode_udate_freqs(), not >> adding another new function? > > Moving it into hot_inode_udate_freqs() will add a function call > overhead even when tracking is not enabled. a static inline functionCan we not directly define hot_inode_udate_freqs to be a static inline?:)> will just result in no extra overhead other than the if > statement.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com-- Regards, Zhi Yong Wu
Zhi Yong Wu
2012-Sep-27 07:19 UTC
Re: [RFC v2 07/10] vfs: fork one kthread to update data temperature
On Thu, Sep 27, 2012 at 3:01 PM, Dave Chinner <david@fromorbit.com> wrote:> On Thu, Sep 27, 2012 at 02:54:22PM +0800, Zhi Yong Wu wrote: >> On Thu, Sep 27, 2012 at 12:03 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Sun, Sep 23, 2012 at 08:56:32PM +0800, zwu.kernel@gmail.com wrote: >> >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> >> >> >> Fork and run one kernel kthread to calculate >> >> that temperature based on some metrics kept >> >> in custom frequency data structs, and store >> >> the info in the hash table. >> > >> > No new kthreads, please. Use a per-superblock workqueue and a struct >> > delayed_work to run periodic work on each superblock. >> If no new kthread is created, which kthread will work on these >> delayed_work tasks? > > One of the kworker threads that service the workqueue > infrastructure.Got it, thanks> > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com-- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Zhi Yong Wu
2012-Sep-27 07:21 UTC
Re: [RFC v2 03/10] vfs: add one new mount option ''-o hottrack''
On Thu, Sep 27, 2012 at 3:05 PM, Dave Chinner <david@fromorbit.com> wrote:> On Thu, Sep 27, 2012 at 01:25:34PM +0800, Zhi Yong Wu wrote: >> On Tue, Sep 25, 2012 at 5:28 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.kernel@gmail.com wrote: >> >> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> >> >> >> >> Introduce one new mount option ''-o hottrack'', >> >> and add its parsing support. >> >> Its usage looks like: >> >> mount -o hottrack >> >> mount -o nouser,hottrack >> >> mount -o nouser,hottrack,loop >> >> mount -o hottrack,nouser >> > >> > I think that this option parsing should be done by the filesystem, >> > even though the tracking functionality is in the VFS. That way ony >> > the filesystems that can use the tracking information will turn it >> > on, rather than being able to turn it on for everything regardless >> > of whether it is useful or not. >> > >> > Along those lines, just using a normal superblock flag to indicate >> > it is active (e.g. MS_HOT_INODE_TRACKING in sb->s_flags) means you >> > don''t need to allocate the sb->s_hot_info structure just to be able >> If we don''t allocate one sb->s_hot_info, where will those hash list >> head and btree roots locate? > > I wrote that thinking (mistakenly) that s-hot)info was dynamically > allocated rather than being embedded in the struct super_block. > > Indeed, if the mount option is held in s_flags, then it could be > dynamically allocated, but I don''t think that''s really necessary...ah, you prefer allocating it, OK, let me try. thanks for your explaination.> > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com-- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html