From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> The patchset is trying to introduce hot tracking function in VFS layer, which will keep track of real disk I/O in memory. By it, you will easily know more details about disk I/O, and then detect where disk I/O hot spots are. Also, specific FS can take use of it to do accurate defragment, and hot relocation support, etc. After V1 was sent out, Chandra Seetharaman has reviewed and made a lot of comments, thanks a lot to him. Not it''s time to send out its V2 for external review, any comments or ideas are appreciated, thanks. NOTE: The patchset can be obtained via my kernel dev git on github: git://github.com/wuzhy/kernel.git hot_tracking If you''re interested, you can also review them via https://github.com/wuzhy/kernel/commits/hot_tracking For how to use and more other info and performance report, please check hot_tracking.txt in Documentation and following links: 1.) http://lwn.net/Articles/525651/ 2.) https://lkml.org/lkml/2012/12/20/199 Changelog from v1: - Refactored to be under RCU [Chandra Seetharaman] - Merged some code changes [Chandra Seetharaman] - Fixed some issues [Chandra Seetharaman] v1: - Solved 64 bits inode number issue. [David Sterba] - Embed struct hot_type in struct file_system_type [Darrick J. Wong] - Cleanup Some issues [David Sterba] - Use a static hot debugfs root [Greg KH] rfcv4: - Introduce hot func registering framework [Zhiyong] - Remove global variable for hot tracking [Zhiyong] - Add btrfs hot tracking support [Zhiyong] rfcv3: 1.) Rewritten debugfs support based seq_file operation. [Dave Chinner] 2.) Refactored workqueue support. [Dave Chinner] 3.) Turn some Micro into be tunable [Zhiyong, Liu Zheng] TIME_TO_KICK, and HEAT_UPDATE_DELAY 4.) Cleanedup a lot of other issues [Dave Chinner] rfcv2: 1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner] 2.) Added memory shrinker [Dave Chinner] 3.) Converted to one workqueue to update map info periodically [Dave Chinner] 4.) Cleanedup a lot of other issues [Dave Chinner] rfcv1: 1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner] 2.) The first three patches can probably just be flattened into one. [Marco Stornelli , Dave Chinner] Zhi Yong Wu (12): VFS hot tracking: introduce some data structures VFS hot tracking: add i/o freq tracking hooks VFS hot tracking: add one workqueue to update hot map VFS hot tracking: register one shrinker VFS hot tracking, rcu: introduce one rcu macro for list VFS hot tracking, seq_file: introduce one set of rcu seq_list interfaces VFS hot tracking: add debugfs support VFS hot tracking: add one ioctl interface VFS hot tracking, procfs: add two proc interfaces VFS hot tracking, btrfs: add hot tracking support VFS hot tracking: add documentation VFS hot tracking: add fs hot type support Documentation/filesystems/00-INDEX | 2 + Documentation/filesystems/hot_tracking.txt | 256 ++++++ fs/Makefile | 2 +- fs/btrfs/ctree.h | 1 + fs/btrfs/super.c | 22 +- fs/compat_ioctl.c | 5 + fs/dcache.c | 2 + fs/direct-io.c | 5 + fs/hot_tracking.c | 1320 ++++++++++++++++++++++++++++ fs/hot_tracking.h | 67 ++ fs/ioctl.c | 70 ++ fs/namei.c | 2 + fs/seq_file.c | 37 + include/linux/fs.h | 5 + include/linux/hot_tracking.h | 175 ++++ include/linux/rculist.h | 5 + include/linux/seq_file.h | 7 + kernel/sysctl.c | 14 + mm/filemap.c | 6 + mm/page-writeback.c | 12 + mm/readahead.c | 6 + 21 files changed, 2019 insertions(+), 2 deletions(-) create mode 100644 Documentation/filesystems/hot_tracking.txt create mode 100644 fs/hot_tracking.c create mode 100644 fs/hot_tracking.h create mode 100644 include/linux/hot_tracking.h -- 1.7.11.7
zwu.kernel@gmail.com
2013-May-14 00:59 UTC
[PATCH v2 01/12] VFS hot tracking: introduce some data structures
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> One root structure hot_info is defined, is hooked up in super_block, and will be used to hold radix tree root, hash list root and some other information, etc. Adds hot_inode_tree struct to keep track of frequently accessed files, and be keyed by {inode, offset}. Trees contain hot_inode_items representing those files and ranges. Having these trees means that vfs can quickly determine the temperature of some data by doing some calculations on the hot_freq_data struct that hangs off of the tree item. Define two items hot_inode_item and hot_range_item, one of them represents one tracked file to keep track of its access frequency and the tree of ranges in this file, while the latter represents a file range of one inode. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/Makefile | 2 +- fs/dcache.c | 2 + fs/hot_tracking.c | 209 +++++++++++++++++++++++++++++++++++++++++++ fs/hot_tracking.h | 17 ++++ include/linux/fs.h | 4 + include/linux/hot_tracking.h | 103 +++++++++++++++++++++ 6 files changed, 336 insertions(+), 1 deletion(-) create mode 100644 fs/hot_tracking.c create mode 100644 fs/hot_tracking.h create mode 100644 include/linux/hot_tracking.h diff --git a/fs/Makefile b/fs/Makefile index 199c880..d0fc704 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -13,7 +13,7 @@ obj-y := open.o read_write.o file_table.o super.o \ attr.o bad_inode.o file.o filesystems.o namespace.o \ seq_file.o xattr.o libfs.o fs-writeback.o \ pnode.o splice.o sync.o utimes.o \ - stack.o fs_struct.o statfs.o + stack.o fs_struct.o statfs.o hot_tracking.o ifeq ($(CONFIG_BLOCK),y) obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o diff --git a/fs/dcache.c b/fs/dcache.c index f09b908..9d7c2af 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -37,6 +37,7 @@ #include <linux/rculist_bl.h> #include <linux/prefetch.h> #include <linux/ratelimit.h> +#include <linux/hot_tracking.h> #include "internal.h" #include "mount.h" @@ -3094,4 +3095,5 @@ void __init vfs_caches_init(unsigned long mempages) mnt_init(); bdev_cache_init(); chrdev_init(); + hot_cache_init(); } diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c new file mode 100644 index 0000000..6bf4229 --- /dev/null +++ b/fs/hot_tracking.c @@ -0,0 +1,209 @@ +/* + * fs/hot_tracking.c + * + * Copyright (C) 2013 IBM Corp. All rights reserved. + * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + */ + +#include <linux/list.h> +#include <linux/err.h> +#include <linux/slab.h> +#include <linux/module.h> +#include <linux/spinlock.h> +#include <linux/fs.h> +#include <linux/types.h> +#include <linux/list_sort.h> +#include <linux/limits.h> +#include "hot_tracking.h" + +/* kmem_cache pointers for slab caches */ +static struct kmem_cache *hot_inode_item_cachep __read_mostly; +static struct kmem_cache *hot_range_item_cachep __read_mostly; + +static void hot_inode_item_free(struct kref *kref); + +static void hot_comm_item_free_cb(struct rcu_head *head) +{ + struct hot_comm_item *ci = container_of(head, + struct hot_comm_item, c_rcu); + + if (ci->hot_freq_data.flags == TYPE_RANGE) { + struct hot_range_item *hr = container_of(ci, + struct hot_range_item, hot_range); + kmem_cache_free(hot_range_item_cachep, hr); + } else { + struct hot_inode_item *he = container_of(ci, + struct hot_inode_item, hot_inode); + kmem_cache_free(hot_inode_item_cachep, he); + } +} + +static void hot_range_item_free(struct kref *kref) +{ + struct hot_comm_item *ci = container_of(kref, + struct hot_comm_item, refs); + struct hot_range_item *hr = container_of(ci, + struct hot_range_item, hot_range); + + hr->hot_inode = NULL; + + call_rcu(&hr->hot_range.c_rcu, hot_comm_item_free_cb); +} + +/* + * Drops the reference out on hot_comm_item by one + * and free the structure if the reference count hits zero + */ +void hot_comm_item_put(struct hot_comm_item *ci) +{ + kref_put(&ci->refs, (ci->hot_freq_data.flags == TYPE_RANGE) ? + hot_range_item_free : hot_inode_item_free); +} +EXPORT_SYMBOL_GPL(hot_comm_item_put); + +static void hot_comm_item_unlink(struct hot_info *root, + struct hot_comm_item *ci) +{ + if (!test_and_set_bit(HOT_DELETING, &ci->delete_flag)) { + hot_comm_item_put(ci); + } +} + +/* + * Frees the entire hot_range_tree. + */ +static void hot_range_tree_free(struct hot_inode_item *he) +{ + struct hot_info *root = he->hot_root; + struct rb_node *node; + struct hot_comm_item *ci; + + /* Free hot inode and range trees on fs root */ + rcu_read_lock(); + node = rb_first(&he->hot_range_tree); + while (node) { + ci = rb_entry(node, struct hot_comm_item, rb_node); + node = rb_next(node); + hot_comm_item_unlink(root, ci); + } + rcu_read_unlock(); + +} + +static void hot_inode_item_free(struct kref *kref) +{ + struct hot_comm_item *ci = container_of(kref, + struct hot_comm_item, refs); + struct hot_inode_item *he = container_of(ci, + struct hot_inode_item, hot_inode); + + hot_range_tree_free(he); + he->hot_root = NULL; + + call_rcu(&he->hot_inode.c_rcu, hot_comm_item_free_cb); +} + +/* + * Initialize kmem cache for hot_inode_item and hot_range_item. + */ +void __init hot_cache_init(void) +{ + hot_inode_item_cachep = kmem_cache_create("hot_inode_item", + sizeof(struct hot_inode_item), 0, + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, + NULL); + if (!hot_inode_item_cachep) + return; + + hot_range_item_cachep = kmem_cache_create("hot_range_item", + sizeof(struct hot_range_item), 0, + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, + NULL); + if (!hot_range_item_cachep) + kmem_cache_destroy(hot_inode_item_cachep); +} +EXPORT_SYMBOL_GPL(hot_cache_init); + +static struct hot_info *hot_tree_init(struct super_block *sb) +{ + struct hot_info *root; + int i, j; + + root = kzalloc(sizeof(struct hot_info), GFP_NOFS); + if (!root) { + printk(KERN_ERR "%s: Failed to malloc memory for " + "hot_info\n", __func__); + return ERR_PTR(-ENOMEM); + } + + root->hot_inode_tree = RB_ROOT; + spin_lock_init(&root->t_lock); + spin_lock_init(&root->m_lock); + + for (i = 0; i < MAP_SIZE; i++) { + for (j = 0; j < MAX_TYPES; j++) + INIT_LIST_HEAD(&root->hot_map[j][i]); + } + + return root; +} + +/* + * Frees the entire hot tree. + */ +static void hot_tree_exit(struct hot_info *root) +{ + struct rb_node *node; + struct hot_comm_item *ci; + + rcu_read_lock(); + node = rb_first(&root->hot_inode_tree); + while (node) { + struct hot_inode_item *he; + ci = rb_entry(node, struct hot_comm_item, rb_node); + he = container_of(ci, struct hot_inode_item, hot_inode); + node = rb_next(node); + hot_comm_item_unlink(root, &he->hot_inode); + } + rcu_read_unlock(); +} + +/* + * Initialize the data structures for hot tracking. + * This function will be called by *_fill_super() + * when filesystem is mounted. + */ +int hot_track_init(struct super_block *sb) +{ + struct hot_info *root; + + root = hot_tree_init(sb); + if (IS_ERR(root)) + return PTR_ERR(root); + + sb->s_hot_root = root; + + printk(KERN_INFO "VFS: Turning on hot data tracking\n"); + + return 0; +} +EXPORT_SYMBOL_GPL(hot_track_init); + +/* + * This function will be called by *_put_super() + * when filesystem is umounted, or also by *_fill_super() + * in some exceptional cases. + */ +void hot_track_exit(struct super_block *sb) +{ + struct hot_info *root = sb->s_hot_root; + + hot_tree_exit(root); + sb->s_hot_root = NULL; + kfree(root); +} +EXPORT_SYMBOL_GPL(hot_track_exit); diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h new file mode 100644 index 0000000..a2ee95f --- /dev/null +++ b/fs/hot_tracking.h @@ -0,0 +1,17 @@ +/* + * fs/hot_tracking.h + * + * Copyright (C) 2013 IBM Corp. All rights reserved. + * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + */ + +#ifndef __HOT_TRACKING__ +#define __HOT_TRACKING__ + +#include <linux/hot_tracking.h> + +#endif /* __HOT_TRACKING__ */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 43db02e..ee2c54f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -27,6 +27,7 @@ #include <linux/lockdep.h> #include <linux/percpu-rwsem.h> #include <linux/blk_types.h> +#include <linux/hot_tracking.h> #include <asm/byteorder.h> #include <uapi/linux/fs.h> @@ -1322,6 +1323,9 @@ struct super_block { /* Being remounted read-only */ int s_readonly_remount; + + /* Hot data tracking*/ + struct hot_info *s_hot_root; }; /* superblock cache pruning functions */ diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h new file mode 100644 index 0000000..fa99439 --- /dev/null +++ b/include/linux/hot_tracking.h @@ -0,0 +1,103 @@ +/* + * include/linux/hot_tracking.h + * + * This file has definitions for VFS hot data tracking + * structures etc. + * + * Copyright (C) 2013 IBM Corp. All rights reserved. + * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + */ + +#ifndef _LINUX_HOTTRACK_H +#define _LINUX_HOTTRACK_H + +#include <linux/types.h> + +#ifdef __KERNEL__ + +#include <linux/rbtree.h> +#include <linux/kref.h> +#include <linux/fs.h> + +#define MAP_BITS 8 +#define MAP_SIZE (1 << MAP_BITS) + +/* values for hot_freq_data flags */ +enum { + TYPE_INODE = 0, + TYPE_RANGE, + MAX_TYPES +}; + +enum { + HOT_DELETING, +}; + +/* + * A frequency data struct holds values that are used to + * determine temperature of files and file ranges. These structs + * are members of hot_inode_item and hot_range_item + */ +struct hot_freq_data { + struct timespec last_read_time; + struct timespec last_write_time; + u32 nr_reads; + u32 nr_writes; + u64 avg_delta_reads; + u64 avg_delta_writes; + u32 flags; + u32 last_temp; +}; + +/* The common info for both following structures */ +struct hot_comm_item { + struct hot_freq_data hot_freq_data; /* frequency data */ + struct kref refs; + struct rb_node rb_node; /* rbtree index */ + unsigned long delete_flag; + struct rcu_head c_rcu; +}; + +/* An item representing an inode and its access frequency */ +struct hot_inode_item { + struct hot_comm_item hot_inode; /* node in hot_inode_tree */ + struct rb_root hot_range_tree; /* tree of ranges */ + spinlock_t i_lock; /* protect above tree */ +}; + +/* + * An item representing a range inside of + * an inode whose frequency is being tracked + */ +struct hot_range_item { + struct hot_comm_item hot_range; + struct hot_inode_item *hot_inode; /* associated hot_inode_item */ +}; + +struct hot_info { + struct rb_root hot_inode_tree; + spinlock_t t_lock; /* protect above tree */ + struct list_head hot_map[MAX_TYPES][MAP_SIZE]; /* map of inode temp */ + spinlock_t m_lock; +}; + +extern void __init hot_cache_init(void); +extern int hot_track_init(struct super_block *sb); +extern void hot_track_exit(struct super_block *sb); +extern void hot_comm_item_put(struct hot_comm_item *ci); + +static inline u64 hot_shift(u64 counter, u32 bits, bool dir) +{ + if (dir) + return counter << bits; + else + return counter >> bits; +} + +#endif /* __KERNEL__ */ + +#endif /* _LINUX_HOTTRACK_H */ -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
zwu.kernel@gmail.com
2013-May-14 00:59 UTC
[PATCH v2 02/12] VFS hot tracking: add i/o freq tracking hooks
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Add i/o freq tracking hooks in real read/write code paths which include read_pages(), do_writepages(), do_generic_file_read(), and __blockdev_direct_IO(). Currently whole FS has one RB tree to track i/o freqs for all inodes which had real disk i/o, while every inode has its own one RB tree to track i/o freqs for all of its extents. When real disk i/o for the inode are done, its own i/o freq will be created or updated in the RB tree per FS, and the i/o freq for all of its extents will also be done in the RB-tree per inode. Also, Each of the two structures hot_inode_item and hot_range_item contains a hot_freq_data struct with its frequency of access metrics (number of {reads, writes}, last {read,write} time, frequency of {reads,writes}). Also, each hot_inode_item contains one hot_range_tree struct which is keyed by {inode, offset, length} and used to keep track of all the ranges in this file. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/direct-io.c | 5 + fs/hot_tracking.c | 284 +++++++++++++++++++++++++++++++++++++++++++ fs/hot_tracking.h | 4 + fs/namei.c | 2 + include/linux/hot_tracking.h | 17 +++ mm/filemap.c | 6 + mm/page-writeback.c | 12 ++ mm/readahead.c | 6 + 8 files changed, 336 insertions(+) diff --git a/fs/direct-io.c b/fs/direct-io.c index 7ab90f5..6cb0598 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -38,6 +38,7 @@ #include <linux/atomic.h> #include <linux/prefetch.h> #include <linux/aio.h> +#include "hot_tracking.h" /* * How many user pages to map in one call to get_user_pages(). This determines @@ -1295,6 +1296,10 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, prefetch(bdev->bd_queue); prefetch((char *)bdev->bd_queue + SMP_CACHE_BYTES); + /* Hot data tracking */ + hot_update_freqs(inode, offset, iov_length(iov, nr_segs), + rw & WRITE); + return do_blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset, nr_segs, get_block, end_io, submit_io, flags); diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index 6bf4229..cc899f4 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -26,6 +26,26 @@ static struct kmem_cache *hot_range_item_cachep __read_mostly; static void hot_inode_item_free(struct kref *kref); +static void hot_comm_item_init(struct hot_comm_item *ci, int type) +{ + kref_init(&ci->refs); + clear_bit(HOT_DELETING, &ci->delete_flag); + memset(&ci->hot_freq_data, 0, sizeof(struct hot_freq_data)); + ci->hot_freq_data.avg_delta_reads = (u64) -1; + ci->hot_freq_data.avg_delta_writes = (u64) -1; + ci->hot_freq_data.flags = type; +} + +static void hot_range_item_init(struct hot_range_item *hr, + struct hot_inode_item *he, loff_t start) +{ + hr->start = start; + hr->len = hot_shift(1, RANGE_BITS, true); + hr->hot_inode = he; + hr->storage_type = -1; + hot_comm_item_init(&hr->hot_range, TYPE_RANGE); +} + static void hot_comm_item_free_cb(struct rcu_head *head) { struct hot_comm_item *ci = container_of(head, @@ -65,10 +85,27 @@ void hot_comm_item_put(struct hot_comm_item *ci) } EXPORT_SYMBOL_GPL(hot_comm_item_put); +/* + * root->t_lock or he->i_lock is acquired in this function + */ static void hot_comm_item_unlink(struct hot_info *root, struct hot_comm_item *ci) { if (!test_and_set_bit(HOT_DELETING, &ci->delete_flag)) { + if (ci->hot_freq_data.flags == TYPE_RANGE) { + struct hot_range_item *hr = container_of(ci, + struct hot_range_item, hot_range); + struct hot_inode_item *he = hr->hot_inode; + + spin_lock(&he->i_lock); + rb_erase(&ci->rb_node, &he->hot_range_tree); + spin_unlock(&he->i_lock); + } else { + spin_lock(&root->t_lock); + rb_erase(&ci->rb_node, &root->hot_inode_tree); + spin_unlock(&root->t_lock); + } + hot_comm_item_put(ci); } } @@ -94,6 +131,15 @@ static void hot_range_tree_free(struct hot_inode_item *he) } +static void hot_inode_item_init(struct hot_inode_item *he, + struct hot_info *hot_root, u64 ino) +{ + he->i_ino = ino; + he->hot_root = hot_root; + spin_lock_init(&he->i_lock); + hot_comm_item_init(&he->hot_inode, TYPE_INODE); +} + static void hot_inode_item_free(struct kref *kref) { struct hot_comm_item *ci = container_of(kref, @@ -107,6 +153,195 @@ static void hot_inode_item_free(struct kref *kref) call_rcu(&he->hot_inode.c_rcu, hot_comm_item_free_cb); } +/* root->t_lock is acquired in this function. */ +struct hot_inode_item +*hot_inode_item_lookup(struct hot_info *root, u64 ino, int alloc) +{ + struct rb_node **p; + struct rb_node *parent = NULL; + struct hot_comm_item *ci; + struct hot_inode_item *he, *he_new = NULL; + + /* walk tree to find insertion point */ +redo: + spin_lock(&root->t_lock); + p = &root->hot_inode_tree.rb_node; + while (*p) { + parent = *p; + ci = rb_entry(parent, struct hot_comm_item, rb_node); + he = container_of(ci, struct hot_inode_item, hot_inode); + if (ino < he->i_ino) + p = &(*p)->rb_left; + else if (ino > he->i_ino) + p = &(*p)->rb_right; + else { + hot_comm_item_get(&he->hot_inode); + spin_unlock(&root->t_lock); + if (he_new) + /* + * Lost the race. Somebody else inserted + * the item for the inode. Free the + * newly allocated item. + */ + kmem_cache_free(hot_inode_item_cachep, he_new); + + if (test_bit(HOT_DELETING, &he->hot_inode.delete_flag)) + return ERR_PTR(-ENOENT); + + return he; + } + } + + if (he_new) { + rb_link_node(&he_new->hot_inode.rb_node, parent, p); + rb_insert_color(&he_new->hot_inode.rb_node, + &root->hot_inode_tree); + hot_comm_item_get(&he_new->hot_inode); + spin_unlock(&root->t_lock); + return he_new; + } + spin_unlock(&root->t_lock); + + if (!alloc) + return ERR_PTR(-ENOENT); + + he_new = kmem_cache_zalloc(hot_inode_item_cachep, GFP_NOFS); + if (!he_new) + return ERR_PTR(-ENOMEM); + + hot_inode_item_init(he_new, root, ino); + + goto redo; +} +EXPORT_SYMBOL_GPL(hot_inode_item_lookup); + +void hot_inode_item_delete(struct inode *inode) +{ + struct hot_info *root = inode->i_sb->s_hot_root; + struct hot_inode_item *he; + + if (!root || !S_ISREG(inode->i_mode)) + return; + + he = hot_inode_item_lookup(root, inode->i_ino, 0); + if (IS_ERR(he)) + return; + + hot_comm_item_put(&he->hot_inode); /* for lookup */ + hot_comm_item_unlink(root, &he->hot_inode); +} +EXPORT_SYMBOL_GPL(hot_inode_item_delete); + +/* he->i_lock is acquired in this function. */ +struct hot_range_item +*hot_range_item_lookup(struct hot_inode_item *he, loff_t start, int alloc) +{ + struct rb_node **p; + struct rb_node *parent = NULL; + struct hot_comm_item *ci; + struct hot_range_item *hr, *hr_new = NULL; + + start = hot_shift(start, RANGE_BITS, true); + + /* walk tree to find insertion point */ +redo: + spin_lock(&he->i_lock); + p = &he->hot_range_tree.rb_node; + while (*p) { + parent = *p; + ci = rb_entry(parent, struct hot_comm_item, rb_node); + hr = container_of(ci, struct hot_range_item, hot_range); + if (start < hr->start) + p = &(*p)->rb_left; + else if (start > (hr->start + hr->len - 1)) + p = &(*p)->rb_right; + else { + hot_comm_item_get(&hr->hot_range); + spin_unlock(&he->i_lock); + if(hr_new) + /* + * Lost the race. Somebody else inserted + * the item for the range. Free the + * newly allocated item. + */ + kmem_cache_free(hot_range_item_cachep, hr_new); + + if (test_bit(HOT_DELETING, &hr->hot_range.delete_flag)) + return ERR_PTR(-ENOENT); + + return hr; + } + } + + if (hr_new) { + rb_link_node(&hr_new->hot_range.rb_node, parent, p); + rb_insert_color(&hr_new->hot_range.rb_node, + &he->hot_range_tree); + hot_comm_item_get(&hr_new->hot_range); + spin_unlock(&he->i_lock); + return hr_new; + } + spin_unlock(&he->i_lock); + + if (!alloc) + return ERR_PTR(-ENOENT); + + hr_new = kmem_cache_zalloc(hot_range_item_cachep, GFP_NOFS); + if (!hr_new) + return ERR_PTR(-ENOMEM); + + hot_range_item_init(hr_new, he, start); + + goto redo; +} +EXPORT_SYMBOL_GPL(hot_range_item_lookup); + +/* + * This function does the actual work of updating + * the frequency numbers. + * + * avg_delta_{reads,writes} are indeed a kind of simple moving + * average of the time difference between each of the last + * 2^(FREQ_POWER) reads/writes. If there have not yet been that + * many reads or writes, it''s likely that the values will be very + * large; They are initialized to the largest possible value for the + * data type. Simply, we don''t want a few fast access to a file to + * automatically make it appear very hot. + */ +static void hot_freq_calc(struct timespec old_atime, + struct timespec cur_time, u64 *avg) +{ + struct timespec delta_ts; + u64 new_delta; + + delta_ts = timespec_sub(cur_time, old_atime); + new_delta = timespec_to_ns(&delta_ts) >> FREQ_POWER; + + *avg = (*avg << FREQ_POWER) - *avg + new_delta; + *avg = *avg >> FREQ_POWER; +} + +static void hot_freq_update(struct hot_info *root, + struct hot_comm_item *ci, bool write) +{ + struct timespec cur_time = current_kernel_time(); + struct hot_freq_data *freq_data = &ci->hot_freq_data; + + if (write) { + freq_data->nr_writes += 1; + hot_freq_calc(freq_data->last_write_time, + cur_time, + &freq_data->avg_delta_writes); + freq_data->last_write_time = cur_time; + } else { + freq_data->nr_reads += 1; + hot_freq_calc(freq_data->last_read_time, + cur_time, + &freq_data->avg_delta_reads); + freq_data->last_read_time = cur_time; + } +} + /* * Initialize kmem cache for hot_inode_item and hot_range_item. */ @@ -128,6 +363,55 @@ void __init hot_cache_init(void) } EXPORT_SYMBOL_GPL(hot_cache_init); +/* + * Main function to update i/o access frequencies, and it will be called + * from read/writepages() hooks, which are read_pages(), do_writepages(), + * do_generic_file_read(), and __blockdev_direct_IO(). + */ +void hot_update_freqs(struct inode *inode, loff_t start, + size_t len, int rw) +{ + struct hot_info *root = inode->i_sb->s_hot_root; + struct hot_inode_item *he; + struct hot_range_item *hr; + u64 range_size; + loff_t cur, end; + + if (!root || (len == 0) || !S_ISREG(inode->i_mode)) + return; + + he = hot_inode_item_lookup(root, inode->i_ino, 1); + if (IS_ERR(he)) + return; + + hot_freq_update(root, &he->hot_inode, rw); + + /* + * Align ranges on range size boundary + * to prevent proliferation of range structs + */ + range_size = hot_shift(1, RANGE_BITS, true); + end = hot_shift((start + len + range_size - 1), + RANGE_BITS, false); + cur = hot_shift(start, RANGE_BITS, false); + for (; cur < end; cur++) { + hr = hot_range_item_lookup(he, cur, 1); + if (IS_ERR(hr)) { + WARN(1, "hot_range_item_lookup returns %ld\n", + PTR_ERR(hr)); + hot_comm_item_put(&he->hot_inode); + return; + } + + hot_freq_update(root, &hr->hot_range, rw); + + hot_comm_item_put(&hr->hot_range); + } + + hot_comm_item_put(&he->hot_inode); +} +EXPORT_SYMBOL_GPL(hot_update_freqs); + static struct hot_info *hot_tree_init(struct super_block *sb) { struct hot_info *root; diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h index a2ee95f..bb4cb16 100644 --- a/fs/hot_tracking.h +++ b/fs/hot_tracking.h @@ -14,4 +14,8 @@ #include <linux/hot_tracking.h> +/* size of sub-file ranges */ +#define RANGE_BITS 20 +#define FREQ_POWER 4 + #endif /* __HOT_TRACKING__ */ diff --git a/fs/namei.c b/fs/namei.c index 85e40d1..2eec542 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3394,6 +3394,8 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry) if (!dir->i_op->unlink) return -EPERM; + hot_inode_item_delete(dentry->d_inode); + mutex_lock(&dentry->d_inode->i_mutex); if (d_mountpoint(dentry)) error = -EBUSY; diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h index fa99439..3c0cf57 100644 --- a/include/linux/hot_tracking.h +++ b/include/linux/hot_tracking.h @@ -67,6 +67,8 @@ struct hot_inode_item { struct hot_comm_item hot_inode; /* node in hot_inode_tree */ struct rb_root hot_range_tree; /* tree of ranges */ spinlock_t i_lock; /* protect above tree */ + struct hot_info *hot_root; /* associated hot_info */ + u64 i_ino; /* inode number from inode */ }; /* @@ -76,6 +78,9 @@ struct hot_inode_item { struct hot_range_item { struct hot_comm_item hot_range; struct hot_inode_item *hot_inode; /* associated hot_inode_item */ + loff_t start; /* offset in bytes */ + size_t len; /* length in bytes */ + int storage_type; /* type of storage */ }; struct hot_info { @@ -89,6 +94,13 @@ extern void __init hot_cache_init(void); extern int hot_track_init(struct super_block *sb); extern void hot_track_exit(struct super_block *sb); extern void hot_comm_item_put(struct hot_comm_item *ci); +extern void hot_update_freqs(struct inode *inode, loff_t start, + size_t len, int rw); +extern struct hot_inode_item *hot_inode_item_lookup(struct hot_info *root, + u64 ino, int alloc); +extern struct hot_range_item *hot_range_item_lookup(struct hot_inode_item *he, + loff_t start, int alloc); +extern void hot_inode_item_delete(struct inode *inode); static inline u64 hot_shift(u64 counter, u32 bits, bool dir) { @@ -98,6 +110,11 @@ static inline u64 hot_shift(u64 counter, u32 bits, bool dir) return counter >> bits; } +static inline void hot_comm_item_get(struct hot_comm_item *ci) +{ + kref_get(&ci->refs); +} + #endif /* __KERNEL__ */ #endif /* _LINUX_HOTTRACK_H */ diff --git a/mm/filemap.c b/mm/filemap.c index 7905fe7..eb64c49 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -33,6 +33,7 @@ #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */ #include <linux/memcontrol.h> #include <linux/cleancache.h> +#include <linux/hot_tracking.h> #include "internal.h" #define CREATE_TRACE_POINTS @@ -1242,6 +1243,11 @@ readpage: * PG_error will be set again if readpage fails. */ ClearPageError(page); + + /* Hot data tracking */ + hot_update_freqs(inode, (loff_t)page->index << PAGE_CACHE_SHIFT, + PAGE_CACHE_SIZE, 0); + /* Start the actual read. The read will unlock the page. */ error = mapping->a_ops->readpage(filp, page); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 4514ad7..4bbca3a 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -36,6 +36,7 @@ #include <linux/pagevec.h> #include <linux/timer.h> #include <linux/sched/rt.h> +#include <linux/hot_tracking.h> #include <trace/events/writeback.h> /* @@ -1921,13 +1922,24 @@ EXPORT_SYMBOL(generic_writepages); int do_writepages(struct address_space *mapping, struct writeback_control *wbc) { int ret; + loff_t start = 0; + size_t count = 0; if (wbc->nr_to_write <= 0) return 0; + + start = mapping->writeback_index << PAGE_CACHE_SHIFT; + count = wbc->nr_to_write; + if (mapping->a_ops->writepages) ret = mapping->a_ops->writepages(mapping, wbc); else ret = generic_writepages(mapping, wbc); + + /* Hot data tracking */ + hot_update_freqs(mapping->host, start, + (count - wbc->nr_to_write) * PAGE_CACHE_SIZE, 1); + return ret; } diff --git a/mm/readahead.c b/mm/readahead.c index daed28d..901396b 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -19,6 +19,7 @@ #include <linux/pagemap.h> #include <linux/syscalls.h> #include <linux/file.h> +#include <linux/hot_tracking.h> /* * Initialise a struct file''s readahead state. Assumes that the caller has @@ -115,6 +116,11 @@ static int read_pages(struct address_space *mapping, struct file *filp, unsigned page_idx; int ret; + /* Hot data tracking */ + hot_update_freqs(mapping->host, + list_to_page(pages)->index << PAGE_CACHE_SHIFT, + (size_t)nr_pages * PAGE_CACHE_SIZE, 0); + blk_start_plug(&plug); if (mapping->a_ops->readpages) { -- 1.7.11.7
zwu.kernel@gmail.com
2013-May-14 00:59 UTC
[PATCH v2 03/12] VFS hot tracking: add one workqueue to update hot map
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Add a workqueue per superblock and a delayed_work to run periodic work to update map info on each superblock. Two arrays of map list are defined, one is for hot inode items, and the other is for hot extent items. The hot items in the RB-tree will be at first distilled into one temperature in the range [0, 255]. If it is old, it will be not linked or aged out, otherwise then it will be linked to its corresponding array of map list which use the temperature as its index. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/hot_tracking.c | 298 ++++++++++++++++++++++++++++++++++++++++++- fs/hot_tracking.h | 25 ++++ include/linux/hot_tracking.h | 4 + 3 files changed, 326 insertions(+), 1 deletion(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index cc899f4..2742d9e 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -29,7 +29,9 @@ static void hot_inode_item_free(struct kref *kref); static void hot_comm_item_init(struct hot_comm_item *ci, int type) { kref_init(&ci->refs); + clear_bit(HOT_IN_LIST, &ci->delete_flag); clear_bit(HOT_DELETING, &ci->delete_flag); + INIT_LIST_HEAD(&ci->track_list); memset(&ci->hot_freq_data, 0, sizeof(struct hot_freq_data)); ci->hot_freq_data.avg_delta_reads = (u64) -1; ci->hot_freq_data.avg_delta_writes = (u64) -1; @@ -86,12 +88,21 @@ void hot_comm_item_put(struct hot_comm_item *ci) EXPORT_SYMBOL_GPL(hot_comm_item_put); /* - * root->t_lock or he->i_lock is acquired in this function + * root->t_lock or he->i_lock, and root->m_lock + * are acquired in this function */ static void hot_comm_item_unlink(struct hot_info *root, struct hot_comm_item *ci) { if (!test_and_set_bit(HOT_DELETING, &ci->delete_flag)) { + if (test_and_clear_bit(HOT_IN_LIST, &ci->delete_flag)) { + spin_lock(&root->m_lock); + list_del_rcu(&ci->track_list); + spin_unlock(&root->m_lock); + + hot_comm_item_put(ci); + } + if (ci->hot_freq_data.flags == TYPE_RANGE) { struct hot_range_item *hr = container_of(ci, struct hot_range_item, hot_range); @@ -343,6 +354,274 @@ static void hot_freq_update(struct hot_info *root, } /* + * hot_temp_calc() is responsible for distilling the six heat + * criteria down into a single temperature value for the data, + * which is an integer between 0 and HEAT_MAX_VALUE. + * + * With the six values, we first do some very rudimentary + * "normalizations" to each metric such that they affect the + * final temperature calculation exactly the right way. It''s + * important to note that we still weren''t really sure that + * these six adjustments were exactly right. + * They could definitely use more tweaking and adjustment, + * especially in terms of the memory footprint they consume. + * + * Next, we take the adjusted values and shift them down to + * a manageable size, whereafter they are weighted using the + * the *_COEFF_POWER values and combined to a single temperature + * value. + */ +static u32 hot_temp_calc(struct hot_comm_item *ci) +{ + u32 result = 0; + struct hot_freq_data *freq_data = &ci->hot_freq_data; + + struct timespec ckt = current_kernel_time(); + u64 cur_time = timespec_to_ns(&ckt); + u32 nrr_heat, nrw_heat; + u64 ltr_heat, ltw_heat, avr_heat, avw_heat; + + nrr_heat = (u32)hot_shift((u64)freq_data->nr_reads, + NRR_MULTIPLIER_POWER, true); + nrw_heat = (u32)hot_shift((u64)freq_data->nr_writes, + NRW_MULTIPLIER_POWER, true); + + ltr_heat + hot_shift((cur_time - timespec_to_ns(&freq_data->last_read_time)), + LTR_DIVIDER_POWER, false); + ltw_heat + hot_shift((cur_time - timespec_to_ns(&freq_data->last_write_time)), + LTW_DIVIDER_POWER, false); + + avr_heat + hot_shift((((u64) -1) - freq_data->avg_delta_reads), + AVR_DIVIDER_POWER, false); + avw_heat + hot_shift((((u64) -1) - freq_data->avg_delta_writes), + AVW_DIVIDER_POWER, false); + + /* ltr_heat is now guaranteed to be u32 safe */ + if (ltr_heat >= hot_shift((u64) 1, 32, true)) + ltr_heat = 0; + else + ltr_heat = hot_shift((u64) 1, 32, true) - ltr_heat; + + /* ltw_heat is now guaranteed to be u32 safe */ + if (ltw_heat >= hot_shift((u64) 1, 32, true)) + ltw_heat = 0; + else + ltw_heat = hot_shift((u64) 1, 32, true) - ltw_heat; + + /* avr_heat is now guaranteed to be u32 safe */ + if (avr_heat >= hot_shift((u64) 1, 32, true)) + avr_heat = (u32) -1; + + /* avw_heat is now guaranteed to be u32 safe */ + if (avw_heat >= hot_shift((u64) 1, 32, true)) + avw_heat = (u32) -1; + + nrr_heat = (u32)hot_shift((u64)nrr_heat, + (3 - NRR_COEFF_POWER), false); + nrw_heat = (u32)hot_shift((u64)nrw_heat, + (3 - NRW_COEFF_POWER), false); + ltr_heat = hot_shift(ltr_heat, (3 - LTR_COEFF_POWER), false); + ltw_heat = hot_shift(ltw_heat, (3 - LTW_COEFF_POWER), false); + avr_heat = hot_shift(avr_heat, (3 - AVR_COEFF_POWER), false); + avw_heat = hot_shift(avw_heat, (3 - AVW_COEFF_POWER), false); + + result = nrr_heat + nrw_heat + (u32) ltr_heat + + (u32) ltw_heat + (u32) avr_heat + (u32) avw_heat; + + return result; +} + +static bool hot_is_obsolete(struct hot_comm_item *ci) +{ + int ret = 0; + struct timespec ckt = current_kernel_time(); + struct hot_freq_data *freq_data = &ci->hot_freq_data; + u64 last_read_ns, last_write_ns; + u64 cur_time = timespec_to_ns(&ckt); + u64 kick_ns = HOT_AGE_INTERVAL * NSEC_PER_SEC; + + last_read_ns + (cur_time - timespec_to_ns(&freq_data->last_read_time)); + last_write_ns + (cur_time - timespec_to_ns(&freq_data->last_write_time)); + + if ((last_read_ns > kick_ns) && (last_write_ns > kick_ns)) + ret = 1; + + return ret; +} + +static void hot_comm_item_link_cb(struct rcu_head *head) +{ + struct hot_comm_item *ci = container_of(head, + struct hot_comm_item, c_rcu); + struct hot_info *root; + u32 cur_temp = ci->hot_freq_data.last_temp; + + if (ci->hot_freq_data.flags == TYPE_RANGE) { + struct hot_range_item *hr = container_of(ci, + struct hot_range_item, hot_range); + root = hr->hot_inode->hot_root; + } else { + struct hot_inode_item *he = container_of(ci, + struct hot_inode_item, hot_inode); + root = he->hot_root; + + } + + spin_lock(&root->m_lock); + if (test_bit(HOT_DELETING, &ci->delete_flag)) { + spin_unlock(&root->m_lock); + return; + } + list_add_tail_rcu(&ci->track_list, + &root->hot_map[ci->hot_freq_data.flags][cur_temp]); + spin_unlock(&root->m_lock); +} + +/* + * Calculate a new temperature and, if necessary, + * move the list_head corresponding to this inode or range + * to the proper list with the new temperature. + */ +static int hot_map_update(struct hot_info *root, + struct hot_comm_item *ci) +{ + u32 temp = hot_temp_calc(ci); + u8 cur_temp, prev_temp; + int flag = false; + + cur_temp = (u8)hot_shift((u64)temp, + (32 - MAP_BITS), false); + prev_temp = (u8)hot_shift((u64)ci->hot_freq_data.last_temp, + (32 - MAP_BITS), false); + + if (cur_temp != prev_temp) { + if (test_and_set_bit(HOT_IN_LIST, &ci->delete_flag)) { + spin_lock(&root->m_lock); + list_del_rcu(&ci->track_list); + spin_unlock(&root->m_lock); + flag = true; + } + + ci->hot_freq_data.last_temp = temp; + + if (flag) + call_rcu(&ci->c_rcu, hot_comm_item_link_cb); + if (test_bit(HOT_DELETING, &ci->delete_flag)) + return 1; + else { + u32 flags = ci->hot_freq_data.flags; + + hot_comm_item_get(ci); + + spin_lock(&root->m_lock); + if (test_bit(HOT_DELETING, &ci->delete_flag)) { + spin_unlock(&root->m_lock); + return 1; + } + list_add_tail_rcu(&ci->track_list, + &root->hot_map[flags][cur_temp]); + spin_unlock(&root->m_lock); + } + } + + return 0; +} + +/* + * Update temperatures for each range item for aging purposes. + * If one hot range item is old, it will be aged out. + */ +static void hot_range_update(struct hot_inode_item *he, + struct hot_info *root) +{ + struct rb_node *node; + struct hot_comm_item *ci; + bool obsolete; + + rcu_read_lock(); + node = rb_first(&he->hot_range_tree); + while (node) { + ci = rb_entry(node, struct hot_comm_item, rb_node); + node = rb_next(node); + if (test_bit(HOT_DELETING, &ci->delete_flag) || + hot_map_update(root, ci)) { + continue; + } + obsolete = hot_is_obsolete(ci); + if (obsolete) + hot_comm_item_unlink(root, ci); + } + rcu_read_unlock(); +} + +/* Temperature compare function*/ +static int hot_temp_cmp(void *priv, struct list_head *a, + struct list_head *b) +{ + struct hot_comm_item *ap = container_of(a, + struct hot_comm_item, track_list); + struct hot_comm_item *bp = container_of(b, + struct hot_comm_item, track_list); + + int diff = ap->hot_freq_data.last_temp + - bp->hot_freq_data.last_temp; + if (diff > 0) + return -1; + if (diff < 0) + return 1; + return 0; +} + +/* + * Every sync period we update temperatures for + * each hot inode item and hot range item for aging + * purposes. + */ +static void hot_update_worker(struct work_struct *work) +{ + struct hot_info *root = container_of(to_delayed_work(work), + struct hot_info, update_work); + struct rb_node *node; + struct hot_comm_item *ci; + struct hot_inode_item *he; + int i, j; + + rcu_read_lock(); + node = rb_first(&root->hot_inode_tree); + while (node) { + ci = rb_entry(node, struct hot_comm_item, rb_node); + node = rb_next(node); + if (test_bit(HOT_DELETING, &ci->delete_flag) || + hot_map_update(root, ci)) { + continue; + } + he = container_of(ci, + struct hot_inode_item, hot_inode); + hot_range_update(he, root); + } + rcu_read_unlock(); + + /* Sort temperature map info based on last temperature*/ + for (i = 0; i < MAP_SIZE; i++) { + for (j = 0; j < MAX_TYPES; j++) { + spin_lock(&root->m_lock); + list_sort(NULL, &root->hot_map[j][i], hot_temp_cmp); + spin_unlock(&root->m_lock); + } + } + + /* Instert next delayed work */ + queue_delayed_work(root->update_wq, &root->update_work, + msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC)); +} + +/* * Initialize kmem cache for hot_inode_item and hot_range_item. */ void __init hot_cache_init(void) @@ -433,6 +712,20 @@ static struct hot_info *hot_tree_init(struct super_block *sb) INIT_LIST_HEAD(&root->hot_map[j][i]); } + root->update_wq = alloc_workqueue( + "hot_update_wq", WQ_NON_REENTRANT, 0); + if (!root->update_wq) { + printk(KERN_ERR "%s: Failed to create " + "hot update workqueue\n", __func__); + kfree(root); + return ERR_PTR(-ENOMEM); + } + + /* Initialize hot tracking wq and arm one delayed work */ + INIT_DELAYED_WORK(&root->update_work, hot_update_worker); + queue_delayed_work(root->update_wq, &root->update_work, + msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC)); + return root; } @@ -444,6 +737,9 @@ static void hot_tree_exit(struct hot_info *root) struct rb_node *node; struct hot_comm_item *ci; + cancel_delayed_work_sync(&root->update_work); + destroy_workqueue(root->update_wq); + rcu_read_lock(); node = rb_first(&root->hot_inode_tree); while (node) { diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h index bb4cb16..8a53c2d 100644 --- a/fs/hot_tracking.h +++ b/fs/hot_tracking.h @@ -12,10 +12,35 @@ #ifndef __HOT_TRACKING__ #define __HOT_TRACKING__ +#include <linux/workqueue.h> #include <linux/hot_tracking.h> +#define HOT_UPDATE_INTERVAL 150 +#define HOT_AGE_INTERVAL 300 + /* size of sub-file ranges */ #define RANGE_BITS 20 #define FREQ_POWER 4 +/* NRR/NRW heat unit = 2^X accesses */ +#define NRR_MULTIPLIER_POWER 20 /* NRR - number of reads since mount */ +#define NRR_COEFF_POWER 0 +#define NRW_MULTIPLIER_POWER 20 /* NRW - number of writes since mount */ +#define NRW_COEFF_POWER 0 + +/* LTR/LTW heat unit = 2^X ns of age */ +#define LTR_DIVIDER_POWER 30 /* LTR - time elapsed since last read(ns) */ +#define LTR_COEFF_POWER 1 +#define LTW_DIVIDER_POWER 30 /* LTW - time elapsed since last write(ns) */ +#define LTW_COEFF_POWER 1 + +/* + * AVR/AVW cold unit = 2^X ns of average delta + * AVR/AVW heat unit = HEAT_MAX_VALUE - cold unit + */ +#define AVR_DIVIDER_POWER 40 /* AVR - average delta between recent reads(ns) */ +#define AVR_COEFF_POWER 0 +#define AVW_DIVIDER_POWER 40 /* AVW - average delta between recent writes(ns) */ +#define AVW_COEFF_POWER 0 + #endif /* __HOT_TRACKING__ */ diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h index 3c0cf57..c32197b 100644 --- a/include/linux/hot_tracking.h +++ b/include/linux/hot_tracking.h @@ -35,6 +35,7 @@ enum { enum { HOT_DELETING, + HOT_IN_LIST, }; /* @@ -60,6 +61,7 @@ struct hot_comm_item { struct rb_node rb_node; /* rbtree index */ unsigned long delete_flag; struct rcu_head c_rcu; + struct list_head track_list; /* link to *_map[] */ }; /* An item representing an inode and its access frequency */ @@ -88,6 +90,8 @@ struct hot_info { spinlock_t t_lock; /* protect above tree */ struct list_head hot_map[MAX_TYPES][MAP_SIZE]; /* map of inode temp */ spinlock_t m_lock; + struct workqueue_struct *update_wq; + struct delayed_work update_work; }; extern void __init hot_cache_init(void); -- 1.7.11.7
zwu.kernel@gmail.com
2013-May-14 00:59 UTC
[PATCH v2 04/12] VFS hot tracking: register one shrinker
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Register a shrinker to control the amount of memory that is used in tracking hot regions. If we are throwing inodes out of memory due to memory pressure, we most definitely are going to need to reduce the amount of memory the tracking code is using, even if it means losing useful information That is, the shrinker accelerates the aging process. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/hot_tracking.c | 58 ++++++++++++++++++++++++++++++++++++++++++++ include/linux/hot_tracking.h | 2 ++ 2 files changed, 60 insertions(+) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index 2742d9e..af4498c 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -100,6 +100,7 @@ static void hot_comm_item_unlink(struct hot_info *root, list_del_rcu(&ci->track_list); spin_unlock(&root->m_lock); + atomic_dec(&root->hot_map_nr); hot_comm_item_put(ci); } @@ -517,6 +518,7 @@ static int hot_map_update(struct hot_info *root, else { u32 flags = ci->hot_freq_data.flags; + atomic_inc(&root->hot_map_nr); hot_comm_item_get(ci); spin_lock(&root->m_lock); @@ -642,6 +644,55 @@ void __init hot_cache_init(void) } EXPORT_SYMBOL_GPL(hot_cache_init); +static void hot_prune_map(struct hot_info *root, long nr) +{ + int i; + + for (i = 0; i < MAP_SIZE; i++) { + struct hot_comm_item *ci; + unsigned long prev_nr; + + rcu_read_lock(); + if (list_empty(&root->hot_map[TYPE_INODE][i])) { + rcu_read_unlock(); + continue; + } + + list_for_each_entry_rcu(ci, &root->hot_map[TYPE_INODE][i], + track_list) { + prev_nr = atomic_read(&root->hot_map_nr); + hot_comm_item_unlink(root, ci); + nr -= (prev_nr - atomic_read(&root->hot_map_nr)); + if (nr <= 0) + break; + } + rcu_read_unlock(); + + if (nr <= 0) + break; + } + + return; +} + +/* The shrinker callback function */ +static int hot_track_prune(struct shrinker *shrink, + struct shrink_control *sc) +{ + struct hot_info *root + container_of(shrink, struct hot_info, hot_shrink); + + if (sc->nr_to_scan == 0) + return atomic_read(&root->hot_map_nr) / 2; + + if (!(sc->gfp_mask & __GFP_FS)) + return -1; + + hot_prune_map(root, sc->nr_to_scan); + + return atomic_read(&root->hot_map_nr); +} + /* * Main function to update i/o access frequencies, and it will be called * from read/writepages() hooks, which are read_pages(), do_writepages(), @@ -706,6 +757,7 @@ static struct hot_info *hot_tree_init(struct super_block *sb) root->hot_inode_tree = RB_ROOT; spin_lock_init(&root->t_lock); spin_lock_init(&root->m_lock); + atomic_set(&root->hot_map_nr, 0); for (i = 0; i < MAP_SIZE; i++) { for (j = 0; j < MAX_TYPES; j++) @@ -726,6 +778,11 @@ static struct hot_info *hot_tree_init(struct super_block *sb) queue_delayed_work(root->update_wq, &root->update_work, msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC)); + /* Register a shrinker callback */ + root->hot_shrink.shrink = hot_track_prune; + root->hot_shrink.seeks = DEFAULT_SEEKS; + register_shrinker(&root->hot_shrink); + return root; } @@ -737,6 +794,7 @@ static void hot_tree_exit(struct hot_info *root) struct rb_node *node; struct hot_comm_item *ci; + unregister_shrinker(&root->hot_shrink); cancel_delayed_work_sync(&root->update_work); destroy_workqueue(root->update_wq); diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h index c32197b..a78b4fc 100644 --- a/include/linux/hot_tracking.h +++ b/include/linux/hot_tracking.h @@ -90,8 +90,10 @@ struct hot_info { spinlock_t t_lock; /* protect above tree */ struct list_head hot_map[MAX_TYPES][MAP_SIZE]; /* map of inode temp */ spinlock_t m_lock; + atomic_t hot_map_nr; struct workqueue_struct *update_wq; struct delayed_work update_work; + struct shrinker hot_shrink; }; extern void __init hot_cache_init(void); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
zwu.kernel@gmail.com
2013-May-14 00:59 UTC
[PATCH v2 05/12] VFS hot tracking, rcu: introduce one rcu macro for list
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> This rcu macro for list will be used in seq_list rcu interfaces. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- include/linux/rculist.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/include/linux/rculist.h b/include/linux/rculist.h index 8089e35..a3fa055 100644 --- a/include/linux/rculist.h +++ b/include/linux/rculist.h @@ -218,6 +218,11 @@ static inline void list_splice_init_rcu(struct list_head *list, at->prev = last; } +#define __list_for_each_rcu(pos, head) \ + for (pos = rcu_dereference(list_next_rcu(head)); \ + pos != head; \ + pos = rcu_dereference(list_next_rcu(pos))) + /** * list_entry_rcu - get the struct for this entry * @ptr: the &struct list_head pointer. -- 1.7.11.7
zwu.kernel@gmail.com
2013-May-14 00:59 UTC
[PATCH v2 06/12] VFS hot tracking, seq_file: introduce one set of rcu seq_list interfaces
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> The patch will introduce one set of rcu interface for seq_list. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/seq_file.c | 37 +++++++++++++++++++++++++++++++++++++ include/linux/seq_file.h | 7 +++++++ 2 files changed, 44 insertions(+) diff --git a/fs/seq_file.c b/fs/seq_file.c index 774c1eb..301caa7 100644 --- a/fs/seq_file.c +++ b/fs/seq_file.c @@ -795,6 +795,43 @@ struct list_head *seq_list_next(void *v, struct list_head *head, loff_t *ppos) } EXPORT_SYMBOL(seq_list_next); +struct list_head *seq_list_start_rcu(struct list_head *head, loff_t pos) +{ + struct list_head *lh; + + __list_for_each_rcu(lh, head) + if (pos-- == 0) + return lh; + + return NULL; +} +EXPORT_SYMBOL(seq_list_start_rcu); + +struct list_head *seq_list_start_head_rcu(struct list_head *head, loff_t pos) +{ + if (!pos) + return head; + + return seq_list_start_rcu(head, pos - 1); +} +EXPORT_SYMBOL(seq_list_start_head_rcu); + +struct list_head *seq_list_next_rcu(void *v, struct list_head *head, + loff_t *ppos) +{ + struct list_head *lh; + + ++*ppos; + rcu_read_lock(); + lh = rcu_dereference(((struct list_head *)v)->next); + if (lh == head) + lh = NULL; + rcu_read_unlock(); + + return lh; +} +EXPORT_SYMBOL(seq_list_next_rcu); + /** * seq_hlist_start - start an iteration of a hlist * @head: the head of the hlist diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h index 2da29ac..7e391c9 100644 --- a/include/linux/seq_file.h +++ b/include/linux/seq_file.h @@ -155,6 +155,13 @@ extern struct list_head *seq_list_start_head(struct list_head *head, extern struct list_head *seq_list_next(void *v, struct list_head *head, loff_t *ppos); +extern struct list_head *seq_list_start_rcu(struct list_head *head, + loff_t pos); +extern struct list_head *seq_list_start_head_rcu(struct list_head *head, + loff_t pos); +extern struct list_head *seq_list_next_rcu(void *v, struct list_head *head, + loff_t *ppos); + /* * Helpers for iteration over hlist_head-s in seq_files */ -- 1.7.11.7
zwu.kernel@gmail.com
2013-May-14 00:59 UTC
[PATCH v2 07/12] VFS hot tracking: add debugfs support
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Add a directory ''<dev_name>'' in /sys/kernel/debug/hot_track/ for each volume that contains four files which are ''inode_stat'', ''extent_stat'', ''inode_spot'', and ''extent_spot''. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/hot_tracking.c | 455 +++++++++++++++++++++++++++++++++++++++++++ fs/hot_tracking.h | 5 + include/linux/hot_tracking.h | 1 + 3 files changed, 461 insertions(+) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index af4498c..cea3675 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -17,9 +17,12 @@ #include <linux/fs.h> #include <linux/types.h> #include <linux/list_sort.h> +#include <linux/debugfs.h> #include <linux/limits.h> #include "hot_tracking.h" +static struct dentry *hot_debugfs_root; + /* kmem_cache pointers for slab caches */ static struct kmem_cache *hot_inode_item_cachep __read_mostly; static struct kmem_cache *hot_range_item_cachep __read_mostly; @@ -623,6 +626,444 @@ static void hot_update_worker(struct work_struct *work) msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC)); } +static void *hot_range_seq_start(struct seq_file *seq, loff_t *pos) + __acquires(rcu) +{ + struct hot_info *root = seq->private; + struct rb_node *node_he, *node_hr; + struct hot_comm_item *ci_he, *ci_hr; + struct hot_inode_item *he; + struct hot_range_item *hr; + loff_t l = *pos; + + rcu_read_lock(); + node_he = rb_first(&root->hot_inode_tree); + while (node_he) { + ci_he = rb_entry(node_he, struct hot_comm_item, rb_node); + he = container_of(ci_he, struct hot_inode_item, hot_inode); + node_hr = rb_first(&he->hot_range_tree); + while (node_hr) { + if (!l--) { + ci_hr = rb_entry(node_hr, + struct hot_comm_item, rb_node); + hr = container_of(ci_hr, + struct hot_range_item, hot_range); + return hr; + } + node_hr = rb_next(node_hr); + } + node_he = rb_next(node_he); + } + + return NULL; +} + +static void *hot_range_seq_next(struct seq_file *seq, + void *v, loff_t *pos) +{ + struct rb_node *node_he, *node_hr; + struct hot_comm_item *ci_he, *ci_hr; + struct hot_range_item *hr_next = NULL, *hr = v; + struct hot_inode_item *he_next; + + (*pos)++; + node_hr = rb_next(&hr->hot_range.rb_node); + if (node_hr) { + ci_hr = rb_entry(node_hr, struct hot_comm_item, rb_node); + hr_next = container_of(ci_hr, struct hot_range_item, hot_range); + + return hr_next; + } + + node_he = rb_next(&hr->hot_inode->hot_inode.rb_node); +loop_he: + if (node_he) { + ci_he = rb_entry(node_he, struct hot_comm_item, rb_node); + he_next = container_of(ci_he, struct hot_inode_item, hot_inode); + node_hr = rb_first(&he_next->hot_range_tree); + if (node_hr) { + ci_hr = rb_entry(node_hr, + struct hot_comm_item, rb_node); + hr_next = container_of(ci_hr, + struct hot_range_item, hot_range); + } else { + node_he = rb_next(node_he); + goto loop_he; + } + } + + return hr_next; +} + +static void hot_range_seq_stop(struct seq_file *seq, void *v) + __releases(rcu) +{ + rcu_read_unlock(); +} + +static int hot_range_seq_show(struct seq_file *seq, void *v) +{ + struct hot_range_item *hr = v; + struct hot_inode_item *he = hr->hot_inode; + struct hot_freq_data *freq_data; + + freq_data = &hr->hot_range.hot_freq_data; + seq_printf(seq, "inode %llu, extent %llu+%llu, " \ + "reads %u, writes %u, temp %u, " \ + "storage type %s\n", + he->i_ino, (unsigned long long)hr->start, + (unsigned long long)hr->len, + freq_data->nr_reads, + freq_data->nr_writes, + (u8)hot_shift((u64)freq_data->last_temp, + (32 - MAP_BITS), false), + (hr->storage_type == 1) ? "nonrot" : "rot"); + + return 0; +} + +static void *hot_inode_seq_start(struct seq_file *seq, loff_t *pos) + __acquires(rcu) +{ + struct hot_info *root = seq->private; + struct rb_node *node; + struct hot_comm_item *ci; + struct hot_inode_item *he = NULL; + loff_t l = *pos; + + rcu_read_lock(); + node = rb_first(&root->hot_inode_tree); + while (node) { + if (!l--) { + ci = rb_entry(node, struct hot_comm_item, rb_node); + he = container_of(ci, struct hot_inode_item, hot_inode); + break; + } + node = rb_next(node); + } + + return he; +} + +static void *hot_inode_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + struct hot_inode_item *he_next = NULL, *he = v; + struct rb_node *node; + struct hot_comm_item *ci; + + (*pos)++; + node = rb_next(&he->hot_inode.rb_node); + if (node) { + ci = rb_entry(node, struct hot_comm_item, rb_node); + he_next = container_of(ci, struct hot_inode_item, hot_inode); + } + + return he_next; +} + +static void hot_inode_seq_stop(struct seq_file *seq, void *v) + __releases(rcu) +{ + rcu_read_unlock(); +} + +static int hot_inode_seq_show(struct seq_file *seq, void *v) +{ + struct hot_inode_item *he = v; + struct hot_freq_data *freq_data = &he->hot_inode.hot_freq_data; + + seq_printf(seq, "inode %llu, reads %u, writes %u, temp %d\n", + he->i_ino, + freq_data->nr_reads, + freq_data->nr_writes, + (u8)hot_shift((u64)freq_data->last_temp, + (32 - MAP_BITS), false)); + + return 0; +} + +static struct hot_comm_item *hot_spot_seq_start(struct hot_info *root, + loff_t *pos, int type) + __acquires(rcu) +{ + struct hot_comm_item *ci; + struct list_head *track_list; + int i; + + rcu_read_lock(); + for (i = MAP_SIZE - 1; i >= 0; i--) { + track_list = seq_list_start_rcu(&root->hot_map[type][i], *pos); + if (track_list) { + ci = container_of(track_list, + struct hot_comm_item, track_list); + return ci; + } + } + + return NULL; +} + +static struct hot_comm_item *hot_spot_seq_next(struct hot_info *root, + struct hot_comm_item *ci, + loff_t *pos, int type) +{ + struct hot_comm_item *ci_next = NULL; + struct list_head *track_list; + int i; + + i = (int)hot_shift(ci->hot_freq_data.last_temp, + (32 - MAP_BITS), false); + + track_list = seq_list_next_rcu(&ci->track_list, + &root->hot_map[type][i], pos); +next: + if (track_list) + ci_next = container_of(track_list, + struct hot_comm_item, track_list); + else if (--i >= 0) { + track_list = seq_list_next_rcu(&root->hot_map[type][i], + &root->hot_map[type][i], pos); + goto next; + } + + return ci_next; +} + +static void *hot_spot_range_seq_start(struct seq_file *seq, loff_t *pos) +{ + struct hot_info *root = seq->private; + struct hot_range_item *hr = NULL; + struct hot_comm_item *ci; + + ci = hot_spot_seq_start(root, pos, TYPE_RANGE); + if (ci) + hr = container_of(ci, struct hot_range_item, hot_range); + + return hr; +} + +static void *hot_spot_range_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + struct hot_info *root = seq->private; + struct hot_range_item *hr_next = NULL, *hr = v; + struct hot_comm_item *ci_next; + + ci_next = hot_spot_seq_next(root, &hr->hot_range, pos, TYPE_RANGE); + if (ci_next) + hr_next = container_of(ci_next, + struct hot_range_item, hot_range); + + return hr_next; +} + +static void *hot_spot_inode_seq_start(struct seq_file *seq, loff_t *pos) +{ + struct hot_info *root = seq->private; + struct hot_inode_item *he = NULL; + struct hot_comm_item *ci; + + ci = hot_spot_seq_start(root, pos, TYPE_INODE); + if (ci) + he = container_of(ci, struct hot_inode_item, hot_inode); + + return he; +} + +static void *hot_spot_inode_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + struct hot_info *root = seq->private; + struct hot_inode_item *he_next = NULL, *he = v; + struct hot_comm_item *ci_next; + + ci_next = hot_spot_seq_next(root, &he->hot_inode, pos, TYPE_INODE); + if (ci_next) + he_next = container_of(ci_next, + struct hot_inode_item, hot_inode); + + return he_next; +} + +static const struct seq_operations hot_range_seq_ops = { + .start = hot_range_seq_start, + .next = hot_range_seq_next, + .stop = hot_range_seq_stop, + .show = hot_range_seq_show +}; + +static const struct seq_operations hot_inode_seq_ops = { + .start = hot_inode_seq_start, + .next = hot_inode_seq_next, + .stop = hot_inode_seq_stop, + .show = hot_inode_seq_show +}; + +static const struct seq_operations hot_spot_range_seq_ops = { + .start = hot_spot_range_seq_start, + .next = hot_spot_range_seq_next, + .stop = hot_range_seq_stop, + .show = hot_range_seq_show +}; + +static const struct seq_operations hot_spot_inode_seq_ops = { + .start = hot_spot_inode_seq_start, + .next = hot_spot_inode_seq_next, + .stop = hot_inode_seq_stop, + .show = hot_inode_seq_show +}; + +static int hot_range_seq_open(struct inode *inode, struct file *file) +{ + int ret = seq_open_private(file, &hot_range_seq_ops, 0); + if (ret == 0) { + struct seq_file *seq = file->private_data; + seq->private = inode->i_private; + } + return ret; +} + +static int hot_inode_seq_open(struct inode *inode, struct file *file) +{ + int ret = seq_open_private(file, &hot_inode_seq_ops, 0); + if (ret == 0) { + struct seq_file *seq = file->private_data; + seq->private = inode->i_private; + } + return ret; +} + +static int hot_spot_range_seq_open(struct inode *inode, struct file *file) +{ + int ret = seq_open_private(file, &hot_spot_range_seq_ops, 0); + if (ret == 0) { + struct seq_file *seq = file->private_data; + seq->private = inode->i_private; + } + return ret; +} + +static int hot_spot_inode_seq_open(struct inode *inode, struct file *file) +{ + int ret = seq_open_private(file, &hot_spot_inode_seq_ops, 0); + if (ret == 0) { + struct seq_file *seq = file->private_data; + seq->private = inode->i_private; + } + return ret; +} + +/* fops to override for printing range data */ +static const struct file_operations hot_debugfs_range_fops = { + .open = hot_range_seq_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; + +/* fops to override for printing inode data */ +static const struct file_operations hot_debugfs_inode_fops = { + .open = hot_inode_seq_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; + +/* fops to override for printing temperature data */ +static const struct file_operations hot_debugfs_spot_range_fops = { + .open = hot_spot_range_seq_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; + +static const struct file_operations hot_debugfs_spot_inode_fops = { + .open = hot_spot_inode_seq_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; + +static const struct hot_debugfs hot_debugfs[] = { + { + .name = "extent_stat", + .fops = &hot_debugfs_range_fops, + }, + { + .name = "inode_stat", + .fops = &hot_debugfs_inode_fops, + }, + { + .name = "extent_spot", + .fops = &hot_debugfs_spot_range_fops, + }, + { + .name = "inode_spot", + .fops = &hot_debugfs_spot_inode_fops, + }, +}; + +/* initialize debugfs */ +static int hot_debugfs_init(struct super_block *sb) +{ + static const char hot_name[] = "hot_track"; + struct dentry *dentry; + int i, ret = 0; + + /* Determine if hot debufs root has existed */ + if (!hot_debugfs_root) { + hot_debugfs_root = debugfs_create_dir(hot_name, NULL); + if (IS_ERR(hot_debugfs_root)) { + ret = PTR_ERR(hot_debugfs_root); + return ret; + } + } + + /* create debugfs folder for this volume by mounted dev name */ + sb->s_hot_root->debugfs_dentry + debugfs_create_dir(sb->s_id, hot_debugfs_root); + if (IS_ERR(sb->s_hot_root->debugfs_dentry)) { + ret = PTR_ERR(sb->s_hot_root->debugfs_dentry); + goto root_err; + } + + /* create debugfs hot data files */ + for (i = 0; i < ARRAY_SIZE(hot_debugfs); i++) { + dentry = debugfs_create_file(hot_debugfs[i].name, + S_IFREG | S_IRUSR | S_IWUSR, + sb->s_hot_root->debugfs_dentry, + sb->s_hot_root, + hot_debugfs[i].fops); + if (IS_ERR(dentry)) { + ret = PTR_ERR(dentry); + goto err; + } + } + + return 0; + +err: + debugfs_remove_recursive(sb->s_hot_root->debugfs_dentry); + +root_err: + if (list_empty(&hot_debugfs_root->d_subdirs)) { + debugfs_remove(hot_debugfs_root); + hot_debugfs_root = NULL; + } + + return ret; +} + +/* remove dentries for debugsfs */ +static void hot_debugfs_exit(struct super_block *sb) +{ + /* remove all debugfs entries recursively from the volume root */ + debugfs_remove_recursive(sb->s_hot_root->debugfs_dentry); + + if (list_empty(&hot_debugfs_root->d_subdirs)) { + debugfs_remove(hot_debugfs_root); + hot_debugfs_root = NULL; + } +} + /* * Initialize kmem cache for hot_inode_item and hot_range_item. */ @@ -818,6 +1259,7 @@ static void hot_tree_exit(struct hot_info *root) int hot_track_init(struct super_block *sb) { struct hot_info *root; + int ret; root = hot_tree_init(sb); if (IS_ERR(root)) @@ -825,9 +1267,21 @@ int hot_track_init(struct super_block *sb) sb->s_hot_root = root; + ret = hot_debugfs_init(sb); + if (ret) { + printk(KERN_ERR "%s: hot_debugfs_init error: %d\n", + __func__, ret); + goto out; + } + printk(KERN_INFO "VFS: Turning on hot data tracking\n"); return 0; + +out: + hot_tree_exit(root); + sb->s_hot_root = NULL; + return ret; } EXPORT_SYMBOL_GPL(hot_track_init); @@ -840,6 +1294,7 @@ void hot_track_exit(struct super_block *sb) { struct hot_info *root = sb->s_hot_root; + hot_debugfs_exit(sb); hot_tree_exit(root); sb->s_hot_root = NULL; kfree(root); diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h index 8a53c2d..fcc60ac 100644 --- a/fs/hot_tracking.h +++ b/fs/hot_tracking.h @@ -43,4 +43,9 @@ #define AVW_DIVIDER_POWER 40 /* AVW - average delta between recent writes(ns) */ #define AVW_COEFF_POWER 0 +struct hot_debugfs { + const char *name; + const struct file_operations *fops; +}; + #endif /* __HOT_TRACKING__ */ diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h index a78b4fc..63baae3 100644 --- a/include/linux/hot_tracking.h +++ b/include/linux/hot_tracking.h @@ -94,6 +94,7 @@ struct hot_info { struct workqueue_struct *update_wq; struct delayed_work update_work; struct shrinker hot_shrink; + struct dentry *debugfs_dentry; }; extern void __init hot_cache_init(void); -- 1.7.11.7
zwu.kernel@gmail.com
2013-May-14 00:59 UTC
[PATCH v2 08/12] VFS hot tracking: add one ioctl interface
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> FS_IOC_GET_HEAT_INFO: return a struct containing the various metrics collected in hot_freq_data structs, and also return a calculated data temperature based on those metrics. Optionally, retrieve the temperature from the hot data hash list instead of recalculating it. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/compat_ioctl.c | 5 ++++ fs/hot_tracking.c | 2 +- fs/ioctl.c | 70 ++++++++++++++++++++++++++++++++++++++++++++ include/linux/hot_tracking.h | 21 +++++++++++++ 4 files changed, 97 insertions(+), 1 deletion(-) diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c index 996cdc5..97bf972 100644 --- a/fs/compat_ioctl.c +++ b/fs/compat_ioctl.c @@ -57,6 +57,7 @@ #include <linux/i2c-dev.h> #include <linux/atalk.h> #include <linux/gfp.h> +#include <linux/hot_tracking.h> #include <net/bluetooth/bluetooth.h> #include <net/bluetooth/hci.h> @@ -1402,6 +1403,9 @@ COMPATIBLE_IOCTL(TIOCSTART) COMPATIBLE_IOCTL(TIOCSTOP) #endif +/*Hot data tracking*/ +COMPATIBLE_IOCTL(FS_IOC_GET_HEAT_INFO) + /* fat ''r'' ioctls. These are handled by fat with ->compat_ioctl, but we don''t want warnings on other file systems. So declare them as compatible here. */ @@ -1581,6 +1585,7 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, unsigned int cmd, case FIBMAP: case FIGETBSZ: case FIONREAD: + case FS_IOC_GET_HEAT_INFO: if (S_ISREG(file_inode(f.file)->i_mode)) break; /*FALL THROUGH*/ diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index cea3675..1618f21 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -375,7 +375,7 @@ static void hot_freq_update(struct hot_info *root, * the *_COEFF_POWER values and combined to a single temperature * value. */ -static u32 hot_temp_calc(struct hot_comm_item *ci) +u32 hot_temp_calc(struct hot_comm_item *ci) { u32 result = 0; struct hot_freq_data *freq_data = &ci->hot_freq_data; diff --git a/fs/ioctl.c b/fs/ioctl.c index fd507fb..f9f3497 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -15,6 +15,7 @@ #include <linux/writeback.h> #include <linux/buffer_head.h> #include <linux/falloc.h> +#include <linux/hot_tracking.h> #include <asm/ioctls.h> @@ -537,6 +538,72 @@ static int ioctl_fsthaw(struct file *filp) } /* + * Retrieve information about access frequency for the given file. Return it in + * a userspace-friendly struct for btrfsctl (or another tool) to parse. + * + * The temperature that is returned can be "live" -- that is, recalculated when + * the ioctl is called -- or it can be returned from the map list, reflecting + * the (possibly old) value that the system will use when considering files + * for migration. This behavior is determined by hot_heat_info->live. + */ +static int ioctl_heat_info(struct file *file, void __user *argp) +{ + struct inode *inode = file->f_dentry->d_inode; + struct hot_heat_info heat_info; + struct hot_inode_item *he; + int ret = 0; + + if (copy_from_user((void *)&heat_info, + argp, + sizeof(struct hot_heat_info)) != 0) { + ret = -EFAULT; + goto err; + } + + he = hot_inode_item_lookup(inode->i_sb->s_hot_root, inode->i_ino, 0); + if (IS_ERR(he)) { + /* we don''t have any info on this file yet */ + ret = -ENODATA; + goto err; + } + + heat_info.avg_delta_reads + (__u64) he->hot_inode.hot_freq_data.avg_delta_reads; + heat_info.avg_delta_writes + (__u64) he->hot_inode.hot_freq_data.avg_delta_writes; + heat_info.last_read_time + (__u64) timespec_to_ns(&he->hot_inode.hot_freq_data.last_read_time); + heat_info.last_write_time + (__u64) timespec_to_ns(&he->hot_inode.hot_freq_data.last_write_time); + heat_info.num_reads + (__u32) he->hot_inode.hot_freq_data.nr_reads; + heat_info.num_writes + (__u32) he->hot_inode.hot_freq_data.nr_writes; + + if (heat_info.live > 0) { + /* + * got a request for live temperature, + * call hot_calc_temp() to recalculate + */ + heat_info.temp = hot_temp_calc(&he->hot_inode); + } else { + /* not live temperature, get it from the map list */ + heat_info.temp = he->hot_inode.hot_freq_data.last_temp; + } + + hot_comm_item_put(&he->hot_inode); + + if (copy_to_user(argp, (void *)&heat_info, + sizeof(struct hot_heat_info))) { + ret = -EFAULT; + goto err; + } + +err: + return ret; +} + +/* * When you add any new common ioctls to the switches above and below * please update compat_sys_ioctl() too. * @@ -591,6 +658,9 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, case FIGETBSZ: return put_user(inode->i_sb->s_blocksize, argp); + case FS_IOC_GET_HEAT_INFO: + return ioctl_heat_info(filp, argp); + default: if (S_ISREG(inode->i_mode)) error = file_ioctl(filp, cmd, arg); diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h index 63baae3..263a15e 100644 --- a/include/linux/hot_tracking.h +++ b/include/linux/hot_tracking.h @@ -17,6 +17,18 @@ #include <linux/types.h> +struct hot_heat_info { + __u64 avg_delta_reads; + __u64 avg_delta_writes; + __u64 last_read_time; + __u64 last_write_time; + __u32 num_reads; + __u32 num_writes; + __u32 temp; + __u8 live; + __u8 resv[3]; +}; + #ifdef __KERNEL__ #include <linux/rbtree.h> @@ -97,6 +109,14 @@ struct hot_info { struct dentry *debugfs_dentry; }; +/* + * Hot data tracking ioctls: + * + * HOT_INFO - retrieve info on frequency of access + */ +#define FS_IOC_GET_HEAT_INFO _IOR(''f'', 17, \ + struct hot_heat_info) + extern void __init hot_cache_init(void); extern int hot_track_init(struct super_block *sb); extern void hot_track_exit(struct super_block *sb); @@ -108,6 +128,7 @@ extern struct hot_inode_item *hot_inode_item_lookup(struct hot_info *root, extern struct hot_range_item *hot_range_item_lookup(struct hot_inode_item *he, loff_t start, int alloc); extern void hot_inode_item_delete(struct inode *inode); +extern u32 hot_temp_calc(struct hot_comm_item *ci); static inline u64 hot_shift(u64 counter, u32 bits, bool dir) { -- 1.7.11.7
zwu.kernel@gmail.com
2013-May-14 00:59 UTC
[PATCH v2 09/12] VFS hot tracking, procfs: add two proc interfaces
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Add two proc interfaces hot-age-interval and hot-update-interval under the dir /proc/sys/fs/ in order to turn HOT_AGE_INTERVAL and HOT_UPDATE_INTERVAL into be tunable. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/hot_tracking.c | 12 +++++++++--- fs/hot_tracking.h | 3 --- include/linux/hot_tracking.h | 7 +++++++ kernel/sysctl.c | 14 ++++++++++++++ 4 files changed, 30 insertions(+), 6 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index 1618f21..088e9aa 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -23,6 +23,12 @@ static struct dentry *hot_debugfs_root; +int sysctl_hot_age_interval __read_mostly = 300; +EXPORT_SYMBOL_GPL(sysctl_hot_age_interval); + +int sysctl_hot_update_interval __read_mostly = 300; +EXPORT_SYMBOL_GPL(sysctl_hot_update_interval); + /* kmem_cache pointers for slab caches */ static struct kmem_cache *hot_inode_item_cachep __read_mostly; static struct kmem_cache *hot_range_item_cachep __read_mostly; @@ -446,7 +452,7 @@ static bool hot_is_obsolete(struct hot_comm_item *ci) struct hot_freq_data *freq_data = &ci->hot_freq_data; u64 last_read_ns, last_write_ns; u64 cur_time = timespec_to_ns(&ckt); - u64 kick_ns = HOT_AGE_INTERVAL * NSEC_PER_SEC; + u64 kick_ns = sysctl_hot_age_interval * NSEC_PER_SEC; last_read_ns (cur_time - timespec_to_ns(&freq_data->last_read_time)); @@ -623,7 +629,7 @@ static void hot_update_worker(struct work_struct *work) /* Instert next delayed work */ queue_delayed_work(root->update_wq, &root->update_work, - msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC)); + msecs_to_jiffies(sysctl_hot_update_interval * MSEC_PER_SEC)); } static void *hot_range_seq_start(struct seq_file *seq, loff_t *pos) @@ -1217,7 +1223,7 @@ static struct hot_info *hot_tree_init(struct super_block *sb) /* Initialize hot tracking wq and arm one delayed work */ INIT_DELAYED_WORK(&root->update_work, hot_update_worker); queue_delayed_work(root->update_wq, &root->update_work, - msecs_to_jiffies(HOT_UPDATE_INTERVAL * MSEC_PER_SEC)); + msecs_to_jiffies(sysctl_hot_update_interval * MSEC_PER_SEC)); /* Register a shrinker callback */ root->hot_shrink.shrink = hot_track_prune; diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h index fcc60ac..d1ab48b 100644 --- a/fs/hot_tracking.h +++ b/fs/hot_tracking.h @@ -15,9 +15,6 @@ #include <linux/workqueue.h> #include <linux/hot_tracking.h> -#define HOT_UPDATE_INTERVAL 150 -#define HOT_AGE_INTERVAL 300 - /* size of sub-file ranges */ #define RANGE_BITS 20 #define FREQ_POWER 4 diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h index 263a15e..6de7153 100644 --- a/include/linux/hot_tracking.h +++ b/include/linux/hot_tracking.h @@ -110,6 +110,13 @@ struct hot_info { }; /* + * Two variables have meanings as below: + * 1. time to quit keeping track of tracking data (seconds) + * 2. set how often to update temperatures (seconds) + */ +extern int sysctl_hot_age_interval, sysctl_hot_update_interval; + +/* * Hot data tracking ioctls: * * HOT_INFO - retrieve info on frequency of access diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 9edcf45..6ee4338 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1616,6 +1616,20 @@ static struct ctl_table fs_table[] = { .proc_handler = &pipe_proc_fn, .extra1 = &pipe_min_size, }, + { + .procname = "hot-age-interval", + .data = &sysctl_hot_age_interval, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { + .procname = "hot-update-interval", + .data = &sysctl_hot_update_interval, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, { } }; -- 1.7.11.7
zwu.kernel@gmail.com
2013-May-14 00:59 UTC
[PATCH v2 10/12] VFS hot tracking, btrfs: add hot tracking support
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Introduce one new mount option ''-o hot_track'', and add its parsing support. Its usage looks like: mount -o hot_track mount -o nouser,hot_track mount -o nouser,hot_track,loop mount -o hot_track,nouser Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/btrfs/ctree.h | 1 + fs/btrfs/super.c | 22 +++++++++++++++++++++- 2 files changed, 22 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 63c328a..133a6ed 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1927,6 +1927,7 @@ struct btrfs_ioctl_defrag_range_args { #define BTRFS_MOUNT_CHECK_INTEGRITY (1 << 20) #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21) #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR (1 << 22) +#define BTRFS_MOUNT_HOT_TRACK (1 << 23) #define btrfs_clear_opt(o, opt) ((o) &= ~BTRFS_MOUNT_##opt) #define btrfs_set_opt(o, opt) ((o) |= BTRFS_MOUNT_##opt) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index a4807ce..09fb9d2 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -42,6 +42,7 @@ #include <linux/cleancache.h> #include <linux/ratelimit.h> #include <linux/btrfs.h> +#include <linux/hot_tracking.h> #include "compat.h" #include "delayed-inode.h" #include "ctree.h" @@ -306,6 +307,10 @@ static void btrfs_put_super(struct super_block *sb) * last process that kept it busy. Or segfault in the aforementioned * process... Whom would you report that to? */ + + /* Hot data tracking */ + if (btrfs_test_opt(btrfs_sb(sb)->tree_root, HOT_TRACK)) + hot_track_exit(sb); } enum { @@ -318,7 +323,7 @@ enum { Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache, Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_check_integrity, Opt_check_integrity_including_extent_data, - Opt_check_integrity_print_mask, Opt_fatal_errors, + Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_hot_track, Opt_err, }; @@ -359,6 +364,7 @@ static match_table_t tokens = { {Opt_check_integrity_including_extent_data, "check_int_data"}, {Opt_check_integrity_print_mask, "check_int_print_mask=%d"}, {Opt_fatal_errors, "fatal_errors=%s"}, + {Opt_hot_track, "hot_track"}, {Opt_err, NULL}, }; @@ -624,6 +630,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) goto out; } break; + case Opt_hot_track: + btrfs_set_opt(info->mount_opt, HOT_TRACK); + break; case Opt_err: printk(KERN_INFO "btrfs: unrecognized mount option " "''%s''\n", p); @@ -843,11 +852,20 @@ static int btrfs_fill_super(struct super_block *sb, goto fail_close; } + if (btrfs_test_opt(fs_info->tree_root, HOT_TRACK)) { + err = hot_track_init(sb); + if (err) + goto fail_hot; + } + save_mount_options(sb, data); cleancache_init_fs(sb); sb->s_flags |= MS_ACTIVE; return 0; +fail_hot: + dput(sb->s_root); + sb->s_root = NULL; fail_close: close_ctree(fs_info->tree_root); return err; @@ -943,6 +961,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry) seq_puts(seq, ",skip_balance"); if (btrfs_test_opt(root, PANIC_ON_FATAL_ERROR)) seq_puts(seq, ",fatal_errors=panic"); + if (btrfs_test_opt(root, HOT_TRACK)) + seq_puts(seq, ",hot_track"); return 0; } -- 1.7.11.7
zwu.kernel@gmail.com
2013-May-14 00:59 UTC
[PATCH v2 11/12] VFS hot tracking: add documentation
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Add Documentation for VFS hot tracking feature Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- Documentation/filesystems/00-INDEX | 2 + Documentation/filesystems/hot_tracking.txt | 256 +++++++++++++++++++++++++++++ 2 files changed, 258 insertions(+) create mode 100644 Documentation/filesystems/hot_tracking.txt diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index 8042050..2454472 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX @@ -122,3 +122,5 @@ xfs.txt - info and mount options for the XFS filesystem. xip.txt - info on execute-in-place for file mappings. +hot_tracking.txt + - info on hot data tracking in VFS layer diff --git a/Documentation/filesystems/hot_tracking.txt b/Documentation/filesystems/hot_tracking.txt new file mode 100644 index 0000000..9ea3fa8 --- /dev/null +++ b/Documentation/filesystems/hot_tracking.txt @@ -0,0 +1,256 @@ +Hot Data Tracking + +April, 2013 Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> + +CONTENTS + +1. Introduction +2. Motivation +3. The Design +4. How to Calc Frequency of Reads/Writes & Temperature +5. Git Development Tree +6. Usage Example + + +1. Introduction + + The feature adds the support for tracking data temperature +information in VFS layer. Essentially, this means maintaining some key +stats(like number of reads/writes, last read/write time, frequency of +reads/writes), then distilling those numbers down to a single +"temperature" value that reflects what data is "hot", and filesystem +can use this information to move hot data from slow devices to fast +devices. + + The long-term goal of the feature is to allow some FSs, +e.g. Btrfs to intelligently utilize SSDs in a heterogenous volume. +Incidentally, this project has been motivated by +the Project Ideas page on the Btrfs wiki. + + +2. Motivation + + This is essentially the traditional cache argument: SSD is fast and +expensive; HDD is cheap but slow. ZFS, for example, can already take +advantage of SSD caching. Btrfs should also be able to take advantage of +hybrid storage without many broad, sweeping changes to existing code. + + The overall goal of enabling hot data relocation to SSD has been +motivated by the Project Ideas page on the Btrfs wiki at +<https://btrfs.wiki.kernel.org/index.php/Project_ideas>. +It will divide into two parts. VFS provide hot data tracking function +while specific FS will provide hot data relocation function. +So as the first step of this goal, this feature provides the first part +of the functionality. + + +3. The Design + +These include the following parts: + + * Hooks in existing vfs functions to track data access frequency + + * New rb-trees for tracking access frequency of inodes and sub-file +ranges + The relationship between super_block and rb-trees is as below: +hot_info.hot_inode_tree + Each FS instance can find hot tracking info s_hot_root. + hot_info has hot_inode_tree and it has inode''s hot information, +and it has hot_range_tree, which has range''s hot information. + + * A list of hot inodes and hot ranges by its temperature + + * A debugfs interface for dumping data from the rb-trees + + * A work queue for updating inode heat info + + * Mount options for enabling temperature tracking(-o hot_track, +default mean disabled) + * An ioctl to retrieve the frequency information collected for a certain +file + * Ioctls to enable/disable frequency tracking per inode. + +Let us see their relationship as below: + + * hot_info.hot_inode_tree indexes hot_inode_items, one per inode + + * hot_inode_item contains access frequency data for that inode + + * hot_inode_item holds a heat list node to link the access frequency +data for that inode + + * hot_inode_item.hot_range_tree indexes hot_range_items for that inode + + * hot_range_item contains access frequency data for that range + + * hot_range_item holds a heat list node to index the access +frequency data for that range + + * hot_info.heat_inode_map indexes per-inode heat list nodes + + * hot_info.heat_range_map indexes per-range heat list nodes + + How about some ascii art? :) Just looking at the hot inode item case +(the range item case is the same pattern, though), we have: + + super_block + | + V + hot_info + | + +-------------------------+----------------------------------------+ + | | | + | | | + V V V +heat_inode_map hot_inode_tree heat_range_map + | | | + | V | + | +-------hot_comm_item--------+ | + | | frequency data | | ++---+ | list_head | | +| V V | +| ...<--hot_comm_item-->... ...<--hot_comm_item-->... | + frequency data frequency data | + list_head list_head | + hot_range_tree hot_range_tree | + | | + V | + +-------hot_comm_item--------+ | + | frequency data | | + | list_head | +---+ + V ^ | V | + <--hot_comm_item-->... | | ...<--hot_comm_item-->... | + frequency data frequency data + list_head list_head + + +4. How to Calc Frequency of Reads/Writes & Temperature + +1.) hot_rw_freq_calc() + + This function does the actual work of updating the frequency numbers. +FREQ_POWER determines how many atime deltas we keep track of (as a power of 2). +So, setting it to anything above 16ish is probably overkill. Also, +the higher the power, the more bits get right shifted out of the timestamp, +reducing precision, so take note of that as well. + + FREQ_POWER, defined immediately below, determines how heavily to weight +the current frequency numbers against the newest access. For example, a value +of 4 means that the new access information will be weighted 1/16th (ie 2^-4) +as heavily as the existing frequency info. In essence, this is a kludged- +together version of a weighted average, since we can''t afford to keep all of +the information that it would take to get a _real_ weighted average. + +2.) hot_temp_calc() + + The following comments explain what exactly comprises a unit of heat. +Each of six values of heat are calculated and combined in order to form an +overall temperature for the data: + + * NRR - number of reads since mount + * NRW - number of writes since mount + * LTR - time elapsed since last read (ns) + * LTW - time elapsed since last write (ns) + * AVR - average delta between recent reads (ns) + * AVW - average delta between recent writes (ns) + + These values are divided (right-shifted) according to the *_DIVIDER_POWER +values defined below to bring the numbers into a reasonable range. You can +modify these values to fit your needs. However, each heat unit is a u32 and +thus maxes out at 2^32 - 1. Therefore, you must choose your dividers quite +carefully or else they could max out or be stuck at zero quite easily. +(E.g., if you chose AVR_DIVIDER_POWER = 0, nothing less than 4s of atime +delta would bring the temperature above zero, ever.) + + Finally, each value is added to the overall temperature between 0 and 8 +times, depending on its *_COEFF_POWER value. Note that the coefficients are +also actually implemented with shifts, so take care to treat these values +as powers of 2. (I.e., 0 means we''ll add it to the temp once; 1 = 2x, etc.) + + * AVR/AVW cold unit = 2^X ns of average delta + * AVR/AVW heat unit = HEAT_MAX_VALUE - cold unit + + E.g., data with an average delta between 0 and 2^X ns will have a cold +value of 0, which means a heat value equal to HEAT_MAX_VALUE. + + This function is responsible for distilling the six heat +criteria, which are described in detail in hot_tracking.h) down into a single +temperature value for the data, which is an integer between 0 +and HEAT_MAX_VALUE. + + To accomplish this, the raw values from the hot_freq_data structure +are shifted in order to make the temperature calculation more +or less sensitive to each value. + + Once this calibration has happened, we do some additional normalization and +make sure that everything fits nicely in a u32. From there, we take a very +rudimentary kind of "average" of each of the values, where the *_COEFF_POWER +values act as weights for the average. + + Finally, we use the MAP_BITS value, which determines the size of the +heat list array, to normalize the temperature to the proper granularity. + + +5. Git Development Tree + + This feature is still on development and review, so if you''re interested, +you can pull from the git repository at the following location: + + https://github.com/wuzhy/kernel.git hot_tracking + git://github.com/wuzhy/kernel.git hot_tracking + + +6. Usage Example + +1.) To use hot tracking, you should mount like this: + +$ mount -o hot_track /dev/sdb /mnt +[ 1505.894078] device label test devid 1 transid 29 /dev/sdb +[ 1505.952977] btrfs: disk space caching is enabled +[ 1506.069678] vfs: turning on hot data tracking + +2.) Mount debugfs at first: + +$ mount -t debugfs none /sys/kernel/debug +$ ls -l /sys/kernel/debug/hot_track/ +total 0 +drwxr-xr-x 2 root root 0 Aug 8 04:40 sdb +$ ls -l /sys/kernel/debug/hot_track/sdb +total 0 +-rw-r--r-- 1 root root 0 Aug 8 04:40 inode_stat +-rw-r--r-- 1 root root 0 Aug 8 04:40 extent_stat + +3.) View information about hot tracking from debugfs: + +$ echo "hot tracking test" > /mnt/file +$ cat /sys/kernel/debug/hot_track/sdb/inode_stat +inode 279, reads 0, writes 1, temp 109 +$ cat /sys/kernel/debug/hot_track/sdb/extent_stat +inode 279, extent 0+1048576, reads 0, writes 1, temp 64 + +$ echo "hot data tracking test" >> /mnt/file +$ cat /sys/kernel/debug/hot_track/sdb/inode_stat +inode 279, reads 0, writes 2, temp 109 +$ cat /sys/kernel/debug/hot_track/sdb/extent_stat +inode 279, extent 0+1048576 reads 0, writes 2, temp 64 + +4.) Check temp sorting result of some nodes: + +$ cat /sys/kernel/debug/hot_track/loop0/inode_spot +inode 5248773, reads 0, writes 244, temp 111 +inode 878523, reads 0, writes 1, temp 109 +inode 878524, reads 0, writes 1, temp 109 + +5.) Tune some hot tracking parameters as below: + +$ cat /proc/sys/fs/hot-age-interval +300 +$ echo 360 > /proc/sys/fs/hot-age-interval +$ cat /proc/sys/fs/hot-age-interval +360 +$ cat /proc/sys/fs/hot-update-interval +300 +$ echo 360 > /proc/sys/fs/hot-update-interval +$ cat /proc/sys/fs/hot-update-interval +360 + -- 1.7.11.7
zwu.kernel@gmail.com
2013-May-14 00:59 UTC
[PATCH v2 12/12] VFS hot tracking: add fs hot type support
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Introduce one ability to enable that specific FS can register its own hot tracking functions. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/hot_tracking.c | 32 ++++++++++++++++++++++---------- fs/hot_tracking.h | 19 +++++++++++++++++++ fs/ioctl.c | 2 +- include/linux/fs.h | 1 + include/linux/hot_tracking.h | 22 +++++++++++++++++++++- 5 files changed, 64 insertions(+), 12 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index 088e9aa..4eee33c 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -51,7 +51,7 @@ static void hot_range_item_init(struct hot_range_item *hr, struct hot_inode_item *he, loff_t start) { hr->start = start; - hr->len = hot_shift(1, RANGE_BITS, true); + hr->len = hot_shift(1, he->hot_root->hot_type->range_bits, true); hr->hot_inode = he; hr->storage_type = -1; hot_comm_item_init(&hr->hot_range, TYPE_RANGE); @@ -259,10 +259,11 @@ struct hot_range_item { struct rb_node **p; struct rb_node *parent = NULL; + struct hot_info *root = he->hot_root; struct hot_comm_item *ci; struct hot_range_item *hr, *hr_new = NULL; - start = hot_shift(start, RANGE_BITS, true); + start = hot_shift(start, root->hot_type->range_bits, true); /* walk tree to find insertion point */ redo: @@ -350,13 +351,13 @@ static void hot_freq_update(struct hot_info *root, if (write) { freq_data->nr_writes += 1; - hot_freq_calc(freq_data->last_write_time, + HOT_FREQ_CALC(root, freq_data->last_write_time, cur_time, &freq_data->avg_delta_writes); freq_data->last_write_time = cur_time; } else { freq_data->nr_reads += 1; - hot_freq_calc(freq_data->last_read_time, + HOT_FREQ_CALC(root, freq_data->last_read_time, cur_time, &freq_data->avg_delta_reads); freq_data->last_read_time = cur_time; @@ -381,7 +382,7 @@ static void hot_freq_update(struct hot_info *root, * the *_COEFF_POWER values and combined to a single temperature * value. */ -u32 hot_temp_calc(struct hot_comm_item *ci) +static u32 hot_temp_calc(struct hot_comm_item *ci) { u32 result = 0; struct hot_freq_data *freq_data = &ci->hot_freq_data; @@ -501,7 +502,7 @@ static void hot_comm_item_link_cb(struct rcu_head *head) static int hot_map_update(struct hot_info *root, struct hot_comm_item *ci) { - u32 temp = hot_temp_calc(ci); + u32 temp = HOT_TEMP_CALC(root, ci); u8 cur_temp, prev_temp; int flag = false; @@ -564,7 +565,7 @@ static void hot_range_update(struct hot_inode_item *he, hot_map_update(root, ci)) { continue; } - obsolete = hot_is_obsolete(ci); + obsolete = HOT_IS_OBSOLETE(root, ci); if (obsolete) hot_comm_item_unlink(root, ci); } @@ -1167,10 +1168,10 @@ void hot_update_freqs(struct inode *inode, loff_t start, * Align ranges on range size boundary * to prevent proliferation of range structs */ - range_size = hot_shift(1, RANGE_BITS, true); + range_size = hot_shift(1, root->hot_type->range_bits, true); end = hot_shift((start + len + range_size - 1), - RANGE_BITS, false); - cur = hot_shift(start, RANGE_BITS, false); + root->hot_type->range_bits, false); + cur = hot_shift(start, root->hot_type->range_bits, false); for (; cur < end; cur++) { hr = hot_range_item_lookup(he, cur, 1); if (IS_ERR(hr)) { @@ -1211,6 +1212,17 @@ static struct hot_info *hot_tree_init(struct super_block *sb) INIT_LIST_HEAD(&root->hot_map[j][i]); } + /* Get hot type for specific FS */ + root->hot_type = &sb->s_type->hot_type; + if (!HOT_FREQ_FN_EXIST(root)) + SET_HOT_FREQ_FN(root, hot_freq_calc); + if (!HOT_TEMP_FN_EXIST(root)) + SET_HOT_TEMP_FN(root, hot_temp_calc); + if (!HOT_OBSOLETE_FN_EXIST(root)) + SET_HOT_OBSOLETE_FN(root, hot_is_obsolete); + if (root->hot_type->range_bits == 0) + root->hot_type->range_bits = RANGE_BITS; + root->update_wq = alloc_workqueue( "hot_update_wq", WQ_NON_REENTRANT, 0); if (!root->update_wq) { diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h index d1ab48b..4756fc3 100644 --- a/fs/hot_tracking.h +++ b/fs/hot_tracking.h @@ -40,6 +40,25 @@ #define AVW_DIVIDER_POWER 40 /* AVW - average delta between recent writes(ns) */ #define AVW_COEFF_POWER 0 +#define HOT_FREQ_FN_EXIST(root) \ + ((root)->hot_type->ops.hot_freq_calc) +#define HOT_TEMP_FN_EXIST(root) \ + ((root)->hot_type->ops.hot_temp_calc) +#define HOT_OBSOLETE_FN_EXIST(root) \ + ((root)->hot_type->ops.hot_is_obsolete) + +#define HOT_FREQ_CALC(root, lt, ct, avg) \ + ((root)->hot_type->ops.hot_freq_calc(lt, ct, avg)) +#define HOT_IS_OBSOLETE(root, ci) \ + ((root)->hot_type->ops.hot_is_obsolete(ci)) + +#define SET_HOT_FREQ_FN(root, fn) \ + (root)->hot_type->ops.hot_freq_calc = fn +#define SET_HOT_TEMP_FN(root, fn) \ + (root)->hot_type->ops.hot_temp_calc = fn +#define SET_HOT_OBSOLETE_FN(root, fn) \ + (root)->hot_type->ops.hot_is_obsolete = fn + struct hot_debugfs { const char *name; const struct file_operations *fops; diff --git a/fs/ioctl.c b/fs/ioctl.c index f9f3497..95ec029 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -585,7 +585,7 @@ static int ioctl_heat_info(struct file *file, void __user *argp) * got a request for live temperature, * call hot_calc_temp() to recalculate */ - heat_info.temp = hot_temp_calc(&he->hot_inode); + heat_info.temp = HOT_TEMP_CALC(he->hot_root, &he->hot_inode); } else { /* not live temperature, get it from the map list */ heat_info.temp = he->hot_inode.hot_freq_data.last_temp; diff --git a/include/linux/fs.h b/include/linux/fs.h index ee2c54f..dda3b9c 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1817,6 +1817,7 @@ struct file_system_type { struct dentry *(*mount) (struct file_system_type *, int, const char *, void *); void (*kill_sb) (struct super_block *); + struct hot_type hot_type; struct module *owner; struct file_system_type * next; struct hlist_head fs_supers; diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h index 6de7153..6184296 100644 --- a/include/linux/hot_tracking.h +++ b/include/linux/hot_tracking.h @@ -97,6 +97,26 @@ struct hot_range_item { int storage_type; /* type of storage */ }; +typedef void (hot_freq_calc_fn) (struct timespec old_atime, + struct timespec cur_time, u64 *avg); +typedef u32 (hot_temp_calc_fn) (struct hot_comm_item *ci); +typedef bool (hot_is_obsolete_fn) (struct hot_comm_item *ci); + +struct hot_func_ops { + hot_freq_calc_fn *hot_freq_calc; + hot_temp_calc_fn *hot_temp_calc; + hot_is_obsolete_fn *hot_is_obsolete; +}; + +/* identifies an hot type */ +struct hot_type { + u64 range_bits; + struct hot_func_ops ops; /* fields provided by specific FS */ +}; + +#define HOT_TEMP_CALC(root, ci) \ + ((root)->hot_type->ops.hot_temp_calc(ci)) + struct hot_info { struct rb_root hot_inode_tree; spinlock_t t_lock; /* protect above tree */ @@ -105,6 +125,7 @@ struct hot_info { atomic_t hot_map_nr; struct workqueue_struct *update_wq; struct delayed_work update_work; + struct hot_type *hot_type; struct shrinker hot_shrink; struct dentry *debugfs_dentry; }; @@ -135,7 +156,6 @@ extern struct hot_inode_item *hot_inode_item_lookup(struct hot_info *root, extern struct hot_range_item *hot_range_item_lookup(struct hot_inode_item *he, loff_t start, int alloc); extern void hot_inode_item_delete(struct inode *inode); -extern u32 hot_temp_calc(struct hot_comm_item *ci); static inline u64 hot_shift(u64 counter, u32 bits, bool dir) { -- 1.7.11.7
Ping... On Tue, May 14, 2013 at 8:59 AM, <zwu.kernel@gmail.com> wrote:> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> > > The patchset is trying to introduce hot tracking function in > VFS layer, which will keep track of real disk I/O in memory. > By it, you will easily know more details about disk I/O, and > then detect where disk I/O hot spots are. Also, specific FS > can take use of it to do accurate defragment, and hot relocation > support, etc. > > After V1 was sent out, Chandra Seetharaman has reviewed and > made a lot of comments, thanks a lot to him. Not it''s time to > send out its V2 for external review, any comments or ideas are > appreciated, thanks. > > NOTE: > > The patchset can be obtained via my kernel dev git on github: > git://github.com/wuzhy/kernel.git hot_tracking > If you''re interested, you can also review them via > https://github.com/wuzhy/kernel/commits/hot_tracking > > For how to use and more other info and performance report, > please check hot_tracking.txt in Documentation and following > links: > 1.) http://lwn.net/Articles/525651/ > 2.) https://lkml.org/lkml/2012/12/20/199 > > Changelog from v1: > - Refactored to be under RCU [Chandra Seetharaman] > - Merged some code changes [Chandra Seetharaman] > - Fixed some issues [Chandra Seetharaman] > > v1: > - Solved 64 bits inode number issue. [David Sterba] > - Embed struct hot_type in struct file_system_type [Darrick J. Wong] > - Cleanup Some issues [David Sterba] > - Use a static hot debugfs root [Greg KH] > > rfcv4: > - Introduce hot func registering framework [Zhiyong] > - Remove global variable for hot tracking [Zhiyong] > - Add btrfs hot tracking support [Zhiyong] > > rfcv3: > 1.) Rewritten debugfs support based seq_file operation. [Dave Chinner] > 2.) Refactored workqueue support. [Dave Chinner] > 3.) Turn some Micro into be tunable [Zhiyong, Liu Zheng] > TIME_TO_KICK, and HEAT_UPDATE_DELAY > 4.) Cleanedup a lot of other issues [Dave Chinner] > > rfcv2: > 1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner] > 2.) Added memory shrinker [Dave Chinner] > 3.) Converted to one workqueue to update map info periodically [Dave Chinner] > 4.) Cleanedup a lot of other issues [Dave Chinner] > > rfcv1: > 1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner] > 2.) The first three patches can probably just be flattened into one. > [Marco Stornelli , Dave Chinner] > > Zhi Yong Wu (12): > VFS hot tracking: introduce some data structures > VFS hot tracking: add i/o freq tracking hooks > VFS hot tracking: add one workqueue to update hot map > VFS hot tracking: register one shrinker > VFS hot tracking, rcu: introduce one rcu macro for list > VFS hot tracking, seq_file: introduce one set of rcu seq_list > interfaces > VFS hot tracking: add debugfs support > VFS hot tracking: add one ioctl interface > VFS hot tracking, procfs: add two proc interfaces > VFS hot tracking, btrfs: add hot tracking support > VFS hot tracking: add documentation > VFS hot tracking: add fs hot type support > > Documentation/filesystems/00-INDEX | 2 + > Documentation/filesystems/hot_tracking.txt | 256 ++++++ > fs/Makefile | 2 +- > fs/btrfs/ctree.h | 1 + > fs/btrfs/super.c | 22 +- > fs/compat_ioctl.c | 5 + > fs/dcache.c | 2 + > fs/direct-io.c | 5 + > fs/hot_tracking.c | 1320 ++++++++++++++++++++++++++++ > fs/hot_tracking.h | 67 ++ > fs/ioctl.c | 70 ++ > fs/namei.c | 2 + > fs/seq_file.c | 37 + > include/linux/fs.h | 5 + > include/linux/hot_tracking.h | 175 ++++ > include/linux/rculist.h | 5 + > include/linux/seq_file.h | 7 + > kernel/sysctl.c | 14 + > mm/filemap.c | 6 + > mm/page-writeback.c | 12 + > mm/readahead.c | 6 + > 21 files changed, 2019 insertions(+), 2 deletions(-) > create mode 100644 Documentation/filesystems/hot_tracking.txt > create mode 100644 fs/hot_tracking.c > create mode 100644 fs/hot_tracking.h > create mode 100644 include/linux/hot_tracking.h > > -- > 1.7.11.7 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- Regards, Zhi Yong Wu