zwu.kernel@gmail.com
2013-Jun-21 12:20 UTC
[RFC PATCH v2 0/5] BTRFS hot relocation support
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> This patchset can work well with the patchset v3 for VFS hot tracking in RAID single mode now. The patchset as RFC is sent out mainly to see if its design goes in the correct development direction. When working on this feature, i am trying to change as less the existing btrfs code as possible. After V0 was sent out, i carefully checked the patchset for speed profile, and don''t think that it is meanful to BTRFS hot relocation, but think that it is one simple and effective way to introduce one new block group for nonrotating disk to differentiate if the block space is reserved from rotating disk or nonrotating disk; So It''s very appreciated that the developers can double check if the design is appropriate to BTRFS hot relocation. The patchset is trying to introduce hot relocation support for BTRFS. In hybrid storage environment, when the data in rotating disk get hot, it can be relocated to nonrotating disk by BTRFS hot relocation support automatically; also, if nonrotating disk ratio exceed its upper threshold, the data which get cold can be looked up and relocated to rotating disk to make more space in nonrotating disk at first, and then the data which get hot will be relocated to nonrotating disk automatically. BTRFS hot relocation mainly reserve block space from nonrotating disk at first, load the hot data to page cache from rotating disk, allocate block space from nonrotating disk, and finally write the data to it. Below is its TODO list: - BTRFS RAID full support. [Martin Steigerwald, Zhiyong] - Mark files as hot via ioctl. [Martin Steigerwald] - Easier setup. With BTRFS flexibility I would expect that a SSD as hot data cache can be added and removed on the fly during filesystem is mounted. Only seems supported at mkfs-time as I read the patch docs, but from my basic technical understanding of BTRFS it can be extented to be done on the fly with a mounted FS as well. [Martin Steigerwald] If you''d like to play with it, pls pull the patchset from my git on github: https://github.com/wuzhy/kernel.git hot_reloc For how to use, please refer too the example below: root@debian-i386:~# echo 0 > /sys/block/vdc/queue/rotational ^^^ Above command will hack /dev/vdc to be one SSD disk root@debian-i386:~# echo 999999 > /proc/sys/fs/hot-age-interval root@debian-i386:~# echo 10 > /proc/sys/fs/hot-update-interval root@debian-i386:~# echo 10 > /proc/sys/fs/hot-reloc-interval root@debian-i386:~# mkfs.btrfs -d single -m single -h /dev/vdb /dev/vdc -f WARNING! - Btrfs v0.20-rc1-254-gb0136aa-dirty IS EXPERIMENTAL WARNING! - see http://btrfs.wiki.kernel.org before using [ 140.279011] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 1 transid 16 /dev/vdb [ 140.283650] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc [ 140.550759] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 3 /dev/vdb [ 140.552473] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc adding device /dev/vdc id 2 [ 140.636215] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 2 transid 3 /dev/vdc fs created label (null) on /dev/vdb nodesize 4096 leafsize 4096 sectorsize 4096 size 14.65GB Btrfs v0.20-rc1-254-gb0136aa-dirty root@debian-i386:~# mount -o hot_move /dev/vdb /data2 [ 144.855471] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 6 /dev/vdb [ 144.870444] btrfs: disk space caching is enabled [ 144.904214] VFS: Turning on hot data tracking root@debian-i386:~# dd if=/dev/zero of=/data2/test1 bs=1M count=2048 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 23.4948 s, 91.4 MB/s root@debian-i386:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/vda1 16G 13G 2.2G 86% / tmpfs 4.8G 0 4.8G 0% /lib/init/rw udev 10M 176K 9.9M 2% /dev tmpfs 4.8G 0 4.8G 0% /dev/shm /dev/vdb 15G 2.0G 13G 14% /data2 root@debian-i386:~# btrfs fi df /data2 Data: total=3.01GB, used=2.00GB System: total=4.00MB, used=4.00KB Metadata: total=8.00MB, used=2.19MB Data_SSD: total=8.00MB, used=0.00 root@debian-i386:~# echo 108 > /proc/sys/fs/hot-reloc-threshold ^^^ Above command will start HOT RLEOCATE, because The data temperature is currently 109 root@debian-i386:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/vda1 16G 13G 2.2G 86% / tmpfs 4.8G 0 4.8G 0% /lib/init/rw udev 10M 176K 9.9M 2% /dev tmpfs 4.8G 0 4.8G 0% /dev/shm /dev/vdb 15G 2.1G 13G 14% /data2 root@debian-i386:~# btrfs fi df /data2 Data: total=3.01GB, used=6.25MB System: total=4.00MB, used=4.00KB Metadata: total=8.00MB, used=2.26MB Data_SSD: total=2.01GB, used=2.00GB root@debian-i386:~# Changelog from v1: - Fixed up one nospc bug which is introduced by this feature. v1: - Refactor introducing one new block group. Zhi Yong Wu (5): BTRFS hot reloc, vfs: add one list_head field BTRFS hot reloc: add one new block group BTRFS hot reloc: add one hot reloc thread BTRFS hot reloc, procfs: add three proc interfaces BTRFS hot reloc: add hot relocation support fs/btrfs/Makefile | 3 +- fs/btrfs/ctree.h | 35 ++- fs/btrfs/extent-tree.c | 99 ++++-- fs/btrfs/extent_io.c | 51 ++- fs/btrfs/extent_io.h | 7 + fs/btrfs/file.c | 27 +- fs/btrfs/free-space-cache.c | 2 +- fs/btrfs/hot_relocate.c | 721 +++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/hot_relocate.h | 38 +++ fs/btrfs/inode-map.c | 7 +- fs/btrfs/inode.c | 106 +++++-- fs/btrfs/ioctl.c | 17 +- fs/btrfs/relocation.c | 6 +- fs/btrfs/super.c | 30 +- fs/btrfs/volumes.c | 29 +- fs/hot_tracking.c | 1 + include/linux/btrfs.h | 4 + include/linux/hot_tracking.h | 1 + kernel/sysctl.c | 22 ++ 19 files changed, 1132 insertions(+), 74 deletions(-) create mode 100644 fs/btrfs/hot_relocate.c create mode 100644 fs/btrfs/hot_relocate.h -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
zwu.kernel@gmail.com
2013-Jun-21 12:20 UTC
[RFC PATCH v2 1/5] BTRFS hot reloc, vfs: add one list_head field
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Add one list_head field ''reloc_list'' to accommodate hot relocation support. Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/hot_tracking.c | 1 + include/linux/hot_tracking.h | 1 + 2 files changed, 2 insertions(+) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index dbc90d4..f013182 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -44,6 +44,7 @@ static void hot_comm_item_init(struct hot_comm_item *ci, int type) clear_bit(HOT_IN_LIST, &ci->delete_flag); clear_bit(HOT_DELETING, &ci->delete_flag); INIT_LIST_HEAD(&ci->track_list); + INIT_LIST_HEAD(&ci->reloc_list); memset(&ci->hot_freq_data, 0, sizeof(struct hot_freq_data)); ci->hot_freq_data.avg_delta_reads = (u64) -1; ci->hot_freq_data.avg_delta_writes = (u64) -1; diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h index 1009377..98bb092 100644 --- a/include/linux/hot_tracking.h +++ b/include/linux/hot_tracking.h @@ -75,6 +75,7 @@ struct hot_comm_item { unsigned long delete_flag; struct rcu_head c_rcu; struct list_head track_list; /* link to *_map[] */ + struct list_head reloc_list; /* used in hot relocation*/ }; /* An item representing an inode and its access frequency */ -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
zwu.kernel@gmail.com
2013-Jun-21 12:20 UTC
[RFC PATCH v2 2/5] BTRFS hot reloc: add one new block group
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Introduce one new block group BTRFS_BLOCK_GROUP_DATA_NONROT, which is used to differentiate if the block space is reserved and allocated from one rotating disk or nonrotating disk. Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/btrfs/ctree.h | 33 +++++++++++--- fs/btrfs/extent-tree.c | 99 +++++++++++++++++++++++++++++++++-------- fs/btrfs/extent_io.c | 51 ++++++++++++++++++++- fs/btrfs/extent_io.h | 7 +++ fs/btrfs/file.c | 27 +++++++---- fs/btrfs/free-space-cache.c | 2 +- fs/btrfs/inode-map.c | 7 +-- fs/btrfs/inode.c | 106 +++++++++++++++++++++++++++++++++++--------- fs/btrfs/ioctl.c | 17 ++++--- fs/btrfs/relocation.c | 6 ++- fs/btrfs/super.c | 4 +- fs/btrfs/volumes.c | 29 +++++++++++- 12 files changed, 318 insertions(+), 70 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 745cac4..1c11be1 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -963,6 +963,12 @@ struct btrfs_dev_replace_item { #define BTRFS_BLOCK_GROUP_RAID10 (1ULL << 6) #define BTRFS_BLOCK_GROUP_RAID5 (1 << 7) #define BTRFS_BLOCK_GROUP_RAID6 (1 << 8) +/* + * New block groups for use with BTRFS hot relocation feature. + * When BTRFS hot relocation is enabled, *_NONROT block group is + * forced to nonrotating drives. + */ +#define BTRFS_BLOCK_GROUP_DATA_NONROT (1ULL << 9) #define BTRFS_BLOCK_GROUP_RESERVED BTRFS_AVAIL_ALLOC_BIT_SINGLE enum btrfs_raid_types { @@ -978,7 +984,8 @@ enum btrfs_raid_types { #define BTRFS_BLOCK_GROUP_TYPE_MASK (BTRFS_BLOCK_GROUP_DATA | \ BTRFS_BLOCK_GROUP_SYSTEM | \ - BTRFS_BLOCK_GROUP_METADATA) + BTRFS_BLOCK_GROUP_METADATA | \ + BTRFS_BLOCK_GROUP_DATA_NONROT) #define BTRFS_BLOCK_GROUP_PROFILE_MASK (BTRFS_BLOCK_GROUP_RAID0 | \ BTRFS_BLOCK_GROUP_RAID1 | \ @@ -1521,6 +1528,7 @@ struct btrfs_fs_info { struct list_head space_info; struct btrfs_space_info *data_sinfo; + struct btrfs_space_info *nonrot_data_sinfo; struct reloc_control *reloc_ctl; @@ -1545,6 +1553,7 @@ struct btrfs_fs_info { u64 avail_data_alloc_bits; u64 avail_metadata_alloc_bits; u64 avail_system_alloc_bits; + u64 avail_data_nonrot_alloc_bits; /* restriper state */ spinlock_t balance_lock; @@ -1557,6 +1566,7 @@ struct btrfs_fs_info { unsigned data_chunk_allocations; unsigned metadata_ratio; + unsigned data_nonrot_chunk_allocations; void *bdev_holder; @@ -1928,6 +1938,7 @@ struct btrfs_ioctl_defrag_range_args { #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21) #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR (1 << 22) #define BTRFS_MOUNT_HOT_TRACK (1 << 23) +#define BTRFS_MOUNT_HOT_MOVE (1 << 24) #define btrfs_clear_opt(o, opt) ((o) &= ~BTRFS_MOUNT_##opt) #define btrfs_set_opt(o, opt) ((o) |= BTRFS_MOUNT_##opt) @@ -3043,6 +3054,8 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_root *root, int btrfs_cross_ref_exist(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 objectid, u64 offset, u64 bytenr); +struct btrfs_block_group_cache *btrfs_lookup_first_block_group( + struct btrfs_fs_info *info, u64 bytenr); struct btrfs_block_group_cache *btrfs_lookup_block_group( struct btrfs_fs_info *info, u64 bytenr); @@ -3093,6 +3106,8 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid, u64 owner, u64 offset, int for_cow); +struct btrfs_block_group_cache *next_block_group(struct btrfs_root *root, + struct btrfs_block_group_cache *cache); int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans, struct btrfs_root *root); @@ -3122,8 +3137,14 @@ enum btrfs_reserve_flush_enum { BTRFS_RESERVE_FLUSH_ALL, }; -int btrfs_check_data_free_space(struct inode *inode, u64 bytes); -void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes); +enum { + TYPE_ROT = 0, /* rot -> rotating */ + TYPE_NONROT, /* nonrot -> nonrotating */ + MAX_RELOC_TYPES, +}; + +int btrfs_check_data_free_space(struct inode *inode, u64 bytes, int *flag); +void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes, int flag); void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans, struct btrfs_root *root); int btrfs_orphan_reserve_metadata(struct btrfs_trans_handle *trans, @@ -3138,8 +3159,8 @@ void btrfs_subvolume_release_metadata(struct btrfs_root *root, u64 qgroup_reserved); int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes); void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes); -int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes); -void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes); +int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes, int *flag); +void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes, int flag); void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type); struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root, unsigned short type); @@ -3612,7 +3633,7 @@ int btrfs_release_file(struct inode *inode, struct file *file); int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode, struct page **pages, size_t num_pages, loff_t pos, size_t write_bytes, - struct extent_state **cached); + struct extent_state **cached, int flag); /* tree-defrag.c */ int btrfs_defrag_leaves(struct btrfs_trans_handle *trans, diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index df472ab..a7b3044 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -628,7 +628,7 @@ static int cache_block_group(struct btrfs_block_group_cache *cache, /* * return the block group that starts at or after bytenr */ -static struct btrfs_block_group_cache * +struct btrfs_block_group_cache * btrfs_lookup_first_block_group(struct btrfs_fs_info *info, u64 bytenr) { struct btrfs_block_group_cache *cache; @@ -3027,7 +3027,7 @@ fail: } -static struct btrfs_block_group_cache * +struct btrfs_block_group_cache * next_block_group(struct btrfs_root *root, struct btrfs_block_group_cache *cache) { @@ -3056,6 +3056,7 @@ static int cache_save_setup(struct btrfs_block_group_cache *block_group, int num_pages = 0; int retries = 0; int ret = 0; + int flag = TYPE_ROT; /* * If this block group is smaller than 100 megs don''t bother caching the @@ -3144,7 +3145,7 @@ again: num_pages *= 16; num_pages *= PAGE_CACHE_SIZE; - ret = btrfs_check_data_free_space(inode, num_pages); + ret = btrfs_check_data_free_space(inode, num_pages, &flag); if (ret) goto out_put; @@ -3153,7 +3154,8 @@ again: &alloc_hint); if (!ret) dcs = BTRFS_DC_SETUP; - btrfs_free_reserved_data_space(inode, num_pages); + + btrfs_free_reserved_data_space(inode, num_pages, flag); out_put: iput(inode); @@ -3355,6 +3357,8 @@ static int update_space_info(struct btrfs_fs_info *info, u64 flags, list_add_rcu(&found->list, &info->space_info); if (flags & BTRFS_BLOCK_GROUP_DATA) info->data_sinfo = found; + else if (flags & BTRFS_BLOCK_GROUP_DATA_NONROT) + info->nonrot_data_sinfo = found; return 0; } @@ -3370,6 +3374,8 @@ static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags) fs_info->avail_metadata_alloc_bits |= extra_flags; if (flags & BTRFS_BLOCK_GROUP_SYSTEM) fs_info->avail_system_alloc_bits |= extra_flags; + if (flags & BTRFS_BLOCK_GROUP_DATA_NONROT) + fs_info->avail_data_nonrot_alloc_bits |= extra_flags; write_sequnlock(&fs_info->profiles_lock); } @@ -3476,18 +3482,27 @@ static u64 get_alloc_profile(struct btrfs_root *root, u64 flags) flags |= root->fs_info->avail_system_alloc_bits; else if (flags & BTRFS_BLOCK_GROUP_METADATA) flags |= root->fs_info->avail_metadata_alloc_bits; + else if (flags & BTRFS_BLOCK_GROUP_DATA_NONROT) + flags |= root->fs_info->avail_data_nonrot_alloc_bits; } while (read_seqretry(&root->fs_info->profiles_lock, seq)); return btrfs_reduce_alloc_profile(root, flags); } +/* + * Turns a chunk_type integer into set of block group flags (a profile). + * Hot relocation code adds chunk_type 2 for hot data specific block + * group type. + */ u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data) { u64 flags; u64 ret; - if (data) + if (data == 1) flags = BTRFS_BLOCK_GROUP_DATA; + else if (data == 2) + flags = BTRFS_BLOCK_GROUP_DATA_NONROT; else if (root == root->fs_info->chunk_root) flags = BTRFS_BLOCK_GROUP_SYSTEM; else @@ -3501,13 +3516,14 @@ u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data) * This will check the space that the inode allocates from to make sure we have * enough space for bytes. */ -int btrfs_check_data_free_space(struct inode *inode, u64 bytes) +int btrfs_check_data_free_space(struct inode *inode, u64 bytes, int *flag) { struct btrfs_space_info *data_sinfo; struct btrfs_root *root = BTRFS_I(inode)->root; struct btrfs_fs_info *fs_info = root->fs_info; u64 used; int ret = 0, committed = 0, alloc_chunk = 1; + int data, tried = 0; /* make sure bytes are sectorsize aligned */ bytes = ALIGN(bytes, root->sectorsize); @@ -3518,7 +3534,15 @@ int btrfs_check_data_free_space(struct inode *inode, u64 bytes) committed = 1; } - data_sinfo = fs_info->data_sinfo; + if (*flag == TYPE_NONROT) { +try_nonrot: + data = 2; + data_sinfo = fs_info->nonrot_data_sinfo; + } else { + data = 1; + data_sinfo = fs_info->data_sinfo; + } + if (!data_sinfo) goto alloc; @@ -3536,13 +3560,22 @@ again: * if we don''t have enough free bytes in this space then we need * to alloc a new chunk. */ - if (!data_sinfo->full && alloc_chunk) { + if (alloc_chunk) { u64 alloc_target; + if (data_sinfo->full) { + if (!tried) { + tried = 1; + spin_unlock(&data_sinfo->lock); + goto try_nonrot; + } else + goto non_alloc; + } + data_sinfo->force_alloc = CHUNK_ALLOC_FORCE; spin_unlock(&data_sinfo->lock); alloc: - alloc_target = btrfs_get_alloc_profile(root, 1); + alloc_target = btrfs_get_alloc_profile(root, data); trans = btrfs_join_transaction(root); if (IS_ERR(trans)) return PTR_ERR(trans); @@ -3559,11 +3592,13 @@ alloc: } if (!data_sinfo) - data_sinfo = fs_info->data_sinfo; + data_sinfo = (data == 1) ? fs_info->data_sinfo : + fs_info->nonrot_data_sinfo; goto again; } +non_alloc: /* * If we have less pinned bytes than we want to allocate then * don''t bother committing the transaction, it won''t help us. @@ -3574,7 +3609,7 @@ alloc: /* commit the current transaction and try again */ commit_trans: - if (!committed && + if (!committed && data_sinfo && !atomic_read(&root->fs_info->open_ioctl_trans)) { committed = 1; trans = btrfs_join_transaction(root); @@ -3588,6 +3623,10 @@ commit_trans: return -ENOSPC; } + + if (tried) + *flag = TYPE_NONROT; + data_sinfo->bytes_may_use += bytes; trace_btrfs_space_reservation(root->fs_info, "space_info", data_sinfo->flags, bytes, 1); @@ -3599,7 +3638,7 @@ commit_trans: /* * Called if we need to clear a data reservation for this inode. */ -void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes) +void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes, int flag) { struct btrfs_root *root = BTRFS_I(inode)->root; struct btrfs_space_info *data_sinfo; @@ -3607,7 +3646,10 @@ void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes) /* make sure bytes are sectorsize aligned */ bytes = ALIGN(bytes, root->sectorsize); - data_sinfo = root->fs_info->data_sinfo; + if (flag == TYPE_NONROT) + data_sinfo = root->fs_info->nonrot_data_sinfo; + else + data_sinfo = root->fs_info->data_sinfo; spin_lock(&data_sinfo->lock); data_sinfo->bytes_may_use -= bytes; trace_btrfs_space_reservation(root->fs_info, "space_info", @@ -3791,6 +3833,13 @@ again: force_metadata_allocation(fs_info); } + if (flags & BTRFS_BLOCK_GROUP_DATA_NONROT && fs_info->metadata_ratio) { + fs_info->data_nonrot_chunk_allocations++; + if (!(fs_info->data_nonrot_chunk_allocations % + fs_info->metadata_ratio)) + force_metadata_allocation(fs_info); + } + /* * Check if we have enough space in SYSTEM chunk because we may need * to update devices. @@ -4497,6 +4546,13 @@ static u64 calc_global_metadata_size(struct btrfs_fs_info *fs_info) meta_used = sinfo->bytes_used; spin_unlock(&sinfo->lock); + sinfo = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_DATA_NONROT); + if (sinfo) { + spin_lock(&sinfo->lock); + data_used += sinfo->bytes_used; + spin_unlock(&sinfo->lock); + } + num_bytes = (data_used >> fs_info->sb->s_blocksize_bits) * csum_size * 2; num_bytes += div64_u64(data_used + meta_used, 50); @@ -4972,6 +5028,7 @@ void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes) * btrfs_delalloc_reserve_space - reserve data and metadata space for delalloc * @inode: inode we''re writing to * @num_bytes: the number of bytes we want to allocate + * @flag: indicate if block space is reserved from rotating disk or not * * This will do the following things * @@ -4983,17 +5040,17 @@ void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes) * * This will return 0 for success and -ENOSPC if there is no space left. */ -int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes) +int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes, int *flag) { int ret; - ret = btrfs_check_data_free_space(inode, num_bytes); + ret = btrfs_check_data_free_space(inode, num_bytes, flag); if (ret) return ret; ret = btrfs_delalloc_reserve_metadata(inode, num_bytes); if (ret) { - btrfs_free_reserved_data_space(inode, num_bytes); + btrfs_free_reserved_data_space(inode, num_bytes, *flag); return ret; } @@ -5004,6 +5061,7 @@ int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes) * btrfs_delalloc_release_space - release data and metadata space for delalloc * @inode: inode we''re releasing space for * @num_bytes: the number of bytes we want to free up + * @flag: indicate if block space is freed from rotating disk or not * * This must be matched with a call to btrfs_delalloc_reserve_space. This is * called in the case that we don''t need the metadata AND data reservations @@ -5013,10 +5071,10 @@ int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes) * decrement ->delalloc_bytes and remove it from the fs_info delalloc_inodes * list if there are no delalloc bytes left. */ -void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes) +void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes, int flag) { btrfs_delalloc_release_metadata(inode, num_bytes); - btrfs_free_reserved_data_space(inode, num_bytes); + btrfs_free_reserved_data_space(inode, num_bytes, flag); } static int update_block_group(struct btrfs_root *root, @@ -5892,7 +5950,8 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans, struct btrfs_space_info *space_info; int loop = 0; int index = __get_raid_index(flags); - int alloc_type = (flags & BTRFS_BLOCK_GROUP_DATA) ? + int alloc_type = ((flags & BTRFS_BLOCK_GROUP_DATA) + || (flags & BTRFS_BLOCK_GROUP_DATA_NONROT)) ? RESERVE_ALLOC_NO_ACCOUNT : RESERVE_ALLOC; bool found_uncached_bg = false; bool failed_cluster_refill = false; @@ -8366,6 +8425,8 @@ static void clear_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags) fs_info->avail_metadata_alloc_bits &= ~extra_flags; if (flags & BTRFS_BLOCK_GROUP_SYSTEM) fs_info->avail_system_alloc_bits &= ~extra_flags; + if (flags & BTRFS_BLOCK_GROUP_DATA_NONROT) + fs_info->avail_data_nonrot_alloc_bits &= ~extra_flags; write_sequnlock(&fs_info->profiles_lock); } diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index e7e7afb..6fbfc90 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1229,6 +1229,26 @@ int clear_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end, cached_state, mask); } +void set_extent_hot(struct inode *inode, u64 start, u64 end, + struct extent_state **cached_state, + int type, int flag) +{ + int bits = (type == TYPE_NONROT) ? EXTENT_HOT : EXTENT_COLD; + + if (flag) { + bits |= EXTENT_DELALLOC | EXTENT_UPTODATE; + + clear_extent_bit(&BTRFS_I(inode)->io_tree, start, end, + EXTENT_DIRTY | EXTENT_DELALLOC | + EXTENT_DO_ACCOUNTING | + EXTENT_HOT | EXTENT_COLD, + 0, 0, cached_state, GFP_NOFS); + } + + set_extent_bit(&BTRFS_I(inode)->io_tree, start, + end, bits, NULL, cached_state, GFP_NOFS); +} + /* * either insert or lock state struct between start and end use mask to tell * us if waiting is desired. @@ -1430,9 +1450,11 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree, { struct rb_node *node; struct extent_state *state; + struct btrfs_root *root; u64 cur_start = *start; u64 found = 0; u64 total_bytes = 0; + int flag = EXTENT_DELALLOC; spin_lock(&tree->lock); @@ -1447,13 +1469,27 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree, goto out; } + root = BTRFS_I(tree->mapping->host)->root; while (1) { state = rb_entry(node, struct extent_state, rb_node); if (found && (state->start != cur_start || (state->state & EXTENT_BOUNDARY))) { goto out; } - if (!(state->state & EXTENT_DELALLOC)) { + if (btrfs_test_opt(root, HOT_MOVE)) { + if (!(state->state & EXTENT_DELALLOC) || + (!(state->state & EXTENT_HOT) && + !(state->state & EXTENT_COLD))) { + if (!found) + *end = state->end; + goto out; + } else { + if (!found) + flag = (state->state & EXTENT_HOT) ? + EXTENT_HOT : EXTENT_COLD; + } + } + if (!(state->state & flag)) { if (!found) *end = state->end; goto out; @@ -1640,7 +1676,13 @@ again: lock_extent_bits(tree, delalloc_start, delalloc_end, 0, &cached_state); /* then test to make sure it is all still delalloc */ - ret = test_range_bit(tree, delalloc_start, delalloc_end, + if (btrfs_test_opt(BTRFS_I(inode)->root, HOT_MOVE)) { + ret = test_range_bit(tree, delalloc_start, delalloc_end, + EXTENT_DELALLOC | EXTENT_HOT, 1, cached_state) || + test_range_bit(tree, delalloc_start, delalloc_end, + EXTENT_DELALLOC | EXTENT_COLD, 1, cached_state); + } else + ret = test_range_bit(tree, delalloc_start, delalloc_end, EXTENT_DELALLOC, 1, cached_state); if (!ret) { unlock_extent_cached(tree, delalloc_start, delalloc_end, @@ -1678,6 +1720,11 @@ int extent_clear_unlock_delalloc(struct inode *inode, if (op & EXTENT_CLEAR_DELALLOC) clear_bits |= EXTENT_DELALLOC; + if (op & EXTENT_CLEAR_HOT) + clear_bits |= EXTENT_HOT; + if (op & EXTENT_CLEAR_COLD) + clear_bits |= EXTENT_COLD; + clear_extent_bit(tree, start, end, clear_bits, 1, 0, NULL, GFP_NOFS); if (!(op & (EXTENT_CLEAR_UNLOCK_PAGE | EXTENT_CLEAR_DIRTY | EXTENT_SET_WRITEBACK | EXTENT_END_WRITEBACK | diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 41fb81e..ef63452 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -19,6 +19,8 @@ #define EXTENT_FIRST_DELALLOC (1 << 12) #define EXTENT_NEED_WAIT (1 << 13) #define EXTENT_DAMAGED (1 << 14) +#define EXTENT_HOT (1 << 15) +#define EXTENT_COLD (1 << 16) #define EXTENT_IOBITS (EXTENT_LOCKED | EXTENT_WRITEBACK) #define EXTENT_CTLBITS (EXTENT_DO_ACCOUNTING | EXTENT_FIRST_DELALLOC) @@ -51,6 +53,8 @@ #define EXTENT_END_WRITEBACK 0x20 #define EXTENT_SET_PRIVATE2 0x40 #define EXTENT_CLEAR_ACCOUNTING 0x80 +#define EXTENT_CLEAR_HOT 0x100 +#define EXTENT_CLEAR_COLD 0x200 /* * page->private values. Every page that is controlled by the extent @@ -237,6 +241,9 @@ int set_extent_delalloc(struct extent_io_tree *tree, u64 start, u64 end, struct extent_state **cached_state, gfp_t mask); int set_extent_defrag(struct extent_io_tree *tree, u64 start, u64 end, struct extent_state **cached_state, gfp_t mask); +void set_extent_hot(struct inode *inode, u64 start, u64 end, + struct extent_state **cached_state, + int type, int flag); int find_first_extent_bit(struct extent_io_tree *tree, u64 start, u64 *start_ret, u64 *end_ret, unsigned long bits, struct extent_state **cached_state); diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 4205ba7..e3c58c4 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -41,6 +41,7 @@ #include "locking.h" #include "compat.h" #include "volumes.h" +#include "hot_relocate.h" static struct kmem_cache *btrfs_inode_defrag_cachep; /* @@ -500,7 +501,7 @@ static void btrfs_drop_pages(struct page **pages, size_t num_pages) int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode, struct page **pages, size_t num_pages, loff_t pos, size_t write_bytes, - struct extent_state **cached) + struct extent_state **cached, int flag) { int err = 0; int i; @@ -514,6 +515,11 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode, num_bytes = ALIGN(write_bytes + pos - start_pos, root->sectorsize); end_of_last_block = start_pos + num_bytes - 1; + + if (btrfs_test_opt(root, HOT_MOVE)) + set_extent_hot(inode, start_pos, end_of_last_block, + cached, flag, 0); + err = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block, cached); if (err) @@ -1294,7 +1300,8 @@ again: clear_extent_bit(&BTRFS_I(inode)->io_tree, start_pos, last_pos - 1, EXTENT_DIRTY | EXTENT_DELALLOC | - EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, + EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG | + EXTENT_HOT | EXTENT_COLD, 0, 0, &cached_state, GFP_NOFS); unlock_extent_cached(&BTRFS_I(inode)->io_tree, start_pos, last_pos - 1, &cached_state, @@ -1350,6 +1357,7 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file, PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; size_t dirty_pages; size_t copied; + int flag = TYPE_ROT; WARN_ON(num_pages > nrptrs); @@ -1363,7 +1371,7 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file, } ret = btrfs_delalloc_reserve_space(inode, - num_pages << PAGE_CACHE_SHIFT); + num_pages << PAGE_CACHE_SHIFT, &flag); if (ret) break; @@ -1377,7 +1385,7 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file, force_page_uptodate); if (ret) { btrfs_delalloc_release_space(inode, - num_pages << PAGE_CACHE_SHIFT); + num_pages << PAGE_CACHE_SHIFT, flag); break; } @@ -1416,16 +1424,16 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file, } btrfs_delalloc_release_space(inode, (num_pages - dirty_pages) << - PAGE_CACHE_SHIFT); + PAGE_CACHE_SHIFT, flag); } if (copied > 0) { ret = btrfs_dirty_pages(root, inode, pages, dirty_pages, pos, copied, - NULL); + NULL, flag); if (ret) { btrfs_delalloc_release_space(inode, - dirty_pages << PAGE_CACHE_SHIFT); + dirty_pages << PAGE_CACHE_SHIFT, flag); btrfs_drop_pages(pages, num_pages); break; } @@ -2150,6 +2158,7 @@ static long btrfs_fallocate(struct file *file, int mode, u64 locked_end; struct extent_map *em; int blocksize = BTRFS_I(inode)->root->sectorsize; + int flag = TYPE_ROT; int ret; alloc_start = round_down(offset, blocksize); @@ -2166,7 +2175,7 @@ static long btrfs_fallocate(struct file *file, int mode, * Make sure we have enough space before we do the * allocation. */ - ret = btrfs_check_data_free_space(inode, alloc_end - alloc_start); + ret = btrfs_check_data_free_space(inode, alloc_end - alloc_start, &flag); if (ret) return ret; if (root->fs_info->quota_enabled) { @@ -2281,7 +2290,7 @@ out: btrfs_qgroup_free(root, alloc_end - alloc_start); out_reserve_fail: /* Let go of our reservation. */ - btrfs_free_reserved_data_space(inode, alloc_end - alloc_start); + btrfs_free_reserved_data_space(inode, alloc_end - alloc_start, flag); return ret; } diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index e530096..18f0467 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -1004,7 +1004,7 @@ static int __btrfs_write_out_cache(struct btrfs_root *root, struct inode *inode, io_ctl_zero_remaining_pages(&io_ctl); ret = btrfs_dirty_pages(root, inode, io_ctl.pages, io_ctl.num_pages, - 0, i_size_read(inode), &cached_state); + 0, i_size_read(inode), &cached_state, TYPE_ROT); io_ctl_drop_pages(&io_ctl); unlock_extent_cached(&BTRFS_I(inode)->io_tree, 0, i_size_read(inode) - 1, &cached_state, GFP_NOFS); diff --git a/fs/btrfs/inode-map.c b/fs/btrfs/inode-map.c index 2c66ddb..0f8268c 100644 --- a/fs/btrfs/inode-map.c +++ b/fs/btrfs/inode-map.c @@ -403,6 +403,7 @@ int btrfs_save_ino_cache(struct btrfs_root *root, u64 alloc_hint = 0; int ret; int prealloc; + int flag = TYPE_ROT; bool retry = false; /* only fs tree and subvol/snap needs ino cache */ @@ -492,17 +493,17 @@ again: /* Just to make sure we have enough space */ prealloc += 8 * PAGE_CACHE_SIZE; - ret = btrfs_delalloc_reserve_space(inode, prealloc); + ret = btrfs_delalloc_reserve_space(inode, prealloc, &flag); if (ret) goto out_put; ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, prealloc, prealloc, prealloc, &alloc_hint); if (ret) { - btrfs_delalloc_release_space(inode, prealloc); + btrfs_delalloc_release_space(inode, prealloc, flag); goto out_put; } - btrfs_free_reserved_data_space(inode, prealloc); + btrfs_free_reserved_data_space(inode, prealloc, flag); ret = btrfs_write_out_ino_cache(root, trans, path); out_put: diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 17f3064..437d20f 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -57,6 +57,7 @@ #include "free-space-cache.h" #include "inode-map.h" #include "backref.h" +#include "hot_relocate.h" struct btrfs_iget_args { u64 ino; @@ -106,6 +107,27 @@ static struct extent_map *create_pinned_em(struct inode *inode, u64 start, static int btrfs_dirty_inode(struct inode *inode); +static int get_chunk_type(struct inode *inode, u64 start, u64 end) +{ + int hot, cold, ret = 1; + + hot = test_range_bit(&BTRFS_I(inode)->io_tree, + start, end, EXTENT_HOT, 1, NULL); + cold = test_range_bit(&BTRFS_I(inode)->io_tree, + start, end, EXTENT_COLD, 1, NULL); + + WARN_ON(hot && cold); + + if (hot) + ret = 2; + else if (cold) + ret = 1; + else + WARN_ON(1); + + return ret; +} + static int btrfs_init_inode_security(struct btrfs_trans_handle *trans, struct inode *inode, struct inode *dir, const struct qstr *qstr) @@ -861,13 +883,14 @@ static noinline int __cow_file_range(struct btrfs_trans_handle *trans, { u64 alloc_hint = 0; u64 num_bytes; - unsigned long ram_size; + unsigned long ram_size, hot_flag = 0; u64 disk_num_bytes; u64 cur_alloc_size; u64 blocksize = root->sectorsize; struct btrfs_key ins; struct extent_map *em; struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree; + int chunk_type = 1; int ret = 0; BUG_ON(btrfs_is_free_space_inode(inode)); @@ -875,6 +898,7 @@ static noinline int __cow_file_range(struct btrfs_trans_handle *trans, num_bytes = ALIGN(end - start + 1, blocksize); num_bytes = max(blocksize, num_bytes); disk_num_bytes = num_bytes; + ret = 0; /* if this is a small write inside eof, kick off defrag */ if (num_bytes < 64 * 1024 && @@ -894,7 +918,8 @@ static noinline int __cow_file_range(struct btrfs_trans_handle *trans, EXTENT_CLEAR_DELALLOC | EXTENT_CLEAR_DIRTY | EXTENT_SET_WRITEBACK | - EXTENT_END_WRITEBACK); + EXTENT_END_WRITEBACK | + hot_flag); *nr_written = *nr_written + (end - start + PAGE_CACHE_SIZE) / PAGE_CACHE_SIZE; @@ -916,9 +941,25 @@ static noinline int __cow_file_range(struct btrfs_trans_handle *trans, unsigned long op; cur_alloc_size = disk_num_bytes; + + /* + * Use COW operations to move hot data to SSD and cold data + * back to rotating disk. Sets chunk_type to 1 to indicate + * to write to BTRFS_BLOCK_GROUP_DATA or 2 to indicate + * BTRFS_BLOCK_GROUP_DATA_NONROT. + */ + if (btrfs_test_opt(root, HOT_MOVE)) { + chunk_type = get_chunk_type(inode, start, + start + cur_alloc_size - 1); + if (chunk_type == 2) + hot_flag = EXTENT_CLEAR_HOT; + else + hot_flag = EXTENT_CLEAR_COLD; + } + ret = btrfs_reserve_extent(trans, root, cur_alloc_size, root->sectorsize, 0, alloc_hint, - &ins, 1); + &ins, chunk_type); if (ret < 0) { btrfs_abort_transaction(trans, root, ret); goto out_unlock; @@ -986,7 +1027,7 @@ static noinline int __cow_file_range(struct btrfs_trans_handle *trans, */ op = unlock ? EXTENT_CLEAR_UNLOCK_PAGE : 0; op |= EXTENT_CLEAR_UNLOCK | EXTENT_CLEAR_DELALLOC | - EXTENT_SET_PRIVATE2; + EXTENT_SET_PRIVATE2 | hot_flag; extent_clear_unlock_delalloc(inode, &BTRFS_I(inode)->io_tree, start, start + ram_size - 1, @@ -1010,7 +1051,8 @@ out_unlock: EXTENT_CLEAR_DELALLOC | EXTENT_CLEAR_DIRTY | EXTENT_SET_WRITEBACK | - EXTENT_END_WRITEBACK); + EXTENT_END_WRITEBACK | + hot_flag); goto out; } @@ -1604,8 +1646,12 @@ static void btrfs_clear_bit_hook(struct inode *inode, btrfs_delalloc_release_metadata(inode, len); if (root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID - && do_list) - btrfs_free_reserved_data_space(inode, len); + && do_list) { + int flag = TYPE_ROT; + if ((state->state & EXTENT_HOT) && (*bits & EXTENT_HOT)) + flag = TYPE_NONROT; + btrfs_free_reserved_data_space(inode, len, flag); + } __percpu_counter_add(&root->fs_info->delalloc_bytes, -len, root->fs_info->delalloc_batch); @@ -1800,6 +1846,7 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work) u64 page_start; u64 page_end; int ret; + int flag = TYPE_ROT; fixup = container_of(work, struct btrfs_writepage_fixup, work); page = fixup->page; @@ -1831,7 +1878,7 @@ again: goto again; } - ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE); + ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE, &flag); if (ret) { mapping_set_error(page->mapping, ret); end_extent_writepage(page, ret, page_start, page_end); @@ -1839,6 +1886,10 @@ again: goto out; } + if (btrfs_test_opt(BTRFS_I(inode)->root, HOT_MOVE)) + set_extent_hot(inode, page_start, page_end, + &cached_state, flag, 0); + btrfs_set_extent_delalloc(inode, page_start, page_end, &cached_state); ClearPageChecked(page); set_page_dirty(page); @@ -4286,20 +4337,21 @@ int btrfs_truncate_page(struct inode *inode, loff_t from, loff_t len, struct page *page; gfp_t mask = btrfs_alloc_write_mask(mapping); int ret = 0; + int flag = TYPE_ROT; u64 page_start; u64 page_end; if ((offset & (blocksize - 1)) == 0 && (!len || ((len & (blocksize - 1)) == 0))) goto out; - ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE); + ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE, &flag); if (ret) goto out; again: page = find_or_create_page(mapping, index, mask); if (!page) { - btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE); + btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE, flag); ret = -ENOMEM; goto out; } @@ -4338,9 +4390,14 @@ again: clear_extent_bit(&BTRFS_I(inode)->io_tree, page_start, page_end, EXTENT_DIRTY | EXTENT_DELALLOC | - EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, + EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG | + EXTENT_HOT | EXTENT_COLD, 0, 0, &cached_state, GFP_NOFS); + if (btrfs_test_opt(root, HOT_MOVE)) + set_extent_hot(inode, page_start, page_end, + &cached_state, flag, 0); + ret = btrfs_set_extent_delalloc(inode, page_start, page_end, &cached_state); if (ret) { @@ -4367,7 +4424,7 @@ again: out_unlock: if (ret) - btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE); + btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE, flag); unlock_page(page); page_cache_release(page); out: @@ -7379,6 +7436,7 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb, struct inode *inode = file->f_mapping->host; size_t count = 0; int flags = 0; + int flag = TYPE_ROT; bool wakeup = true; bool relock = false; ssize_t ret; @@ -7401,7 +7459,7 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb, mutex_unlock(&inode->i_mutex); relock = true; } - ret = btrfs_delalloc_reserve_space(inode, count); + ret = btrfs_delalloc_reserve_space(inode, count, &flag); if (ret) goto out; } else if (unlikely(test_bit(BTRFS_INODE_READDIO_NEED_LOCK, @@ -7417,10 +7475,10 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb, btrfs_submit_direct, flags); if (rw & WRITE) { if (ret < 0 && ret != -EIOCBQUEUED) - btrfs_delalloc_release_space(inode, count); + btrfs_delalloc_release_space(inode, count, flag); else if (ret >= 0 && (size_t)ret < count) btrfs_delalloc_release_space(inode, - count - (size_t)ret); + count - (size_t)ret, flag); else btrfs_delalloc_release_metadata(inode, 0); } @@ -7543,7 +7601,8 @@ static void btrfs_invalidatepage(struct page *page, unsigned long offset) clear_extent_bit(tree, page_start, page_end, EXTENT_DIRTY | EXTENT_DELALLOC | EXTENT_LOCKED | EXTENT_DO_ACCOUNTING | - EXTENT_DEFRAG, 1, 0, &cached_state, GFP_NOFS); + EXTENT_DEFRAG | EXTENT_HOT | EXTENT_COLD, + 1, 0, &cached_state, GFP_NOFS); /* * whoever cleared the private bit is responsible * for the finish_ordered_io @@ -7559,7 +7618,8 @@ static void btrfs_invalidatepage(struct page *page, unsigned long offset) } clear_extent_bit(tree, page_start, page_end, EXTENT_LOCKED | EXTENT_DIRTY | EXTENT_DELALLOC | - EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 1, 1, + EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG | + EXTENT_HOT | EXTENT_COLD, 1, 1, &cached_state, GFP_NOFS); __btrfs_releasepage(page, GFP_NOFS); @@ -7599,11 +7659,12 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) loff_t size; int ret; int reserved = 0; + int flag = TYPE_ROT; u64 page_start; u64 page_end; sb_start_pagefault(inode->i_sb); - ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE); + ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE, &flag); if (!ret) { ret = file_update_time(vma->vm_file); reserved = 1; @@ -7658,9 +7719,14 @@ again: */ clear_extent_bit(&BTRFS_I(inode)->io_tree, page_start, page_end, EXTENT_DIRTY | EXTENT_DELALLOC | - EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, + EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG | + EXTENT_HOT | EXTENT_COLD, 0, 0, &cached_state, GFP_NOFS); + if (btrfs_test_opt(root, HOT_MOVE)) + set_extent_hot(inode, page_start, page_end, + &cached_state, flag, 0); + ret = btrfs_set_extent_delalloc(inode, page_start, page_end, &cached_state); if (ret) { @@ -7700,7 +7766,7 @@ out_unlock: } unlock_page(page); out: - btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE); + btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE, flag); out_noreserve: sb_end_pagefault(inode->i_sb); return ret; diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 0f81d67..9401229 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -56,6 +56,7 @@ #include "rcu-string.h" #include "send.h" #include "dev-replace.h" +#include "hot_relocate.h" /* Mask out flags that are inappropriate for the given type of inode. */ static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags) @@ -1001,6 +1002,7 @@ static int cluster_pages_for_defrag(struct inode *inode, int ret; int i; int i_done; + int flag = TYPE_ROT; struct btrfs_ordered_extent *ordered; struct extent_state *cached_state = NULL; struct extent_io_tree *tree; @@ -1013,7 +1015,7 @@ static int cluster_pages_for_defrag(struct inode *inode, page_cnt = min_t(u64, (u64)num_pages, (u64)file_end - start_index + 1); ret = btrfs_delalloc_reserve_space(inode, - page_cnt << PAGE_CACHE_SHIFT); + page_cnt << PAGE_CACHE_SHIFT, &flag); if (ret) return ret; i_done = 0; @@ -1101,9 +1103,12 @@ again: BTRFS_I(inode)->outstanding_extents++; spin_unlock(&BTRFS_I(inode)->lock); btrfs_delalloc_release_space(inode, - (page_cnt - i_done) << PAGE_CACHE_SHIFT); + (page_cnt - i_done) << PAGE_CACHE_SHIFT, flag); } + if (btrfs_test_opt(BTRFS_I(inode)->root, HOT_MOVE)) + set_extent_hot(inode, page_start, page_end - 1, + &cached_state, flag, 0); set_extent_defrag(&BTRFS_I(inode)->io_tree, page_start, page_end - 1, &cached_state, GFP_NOFS); @@ -1126,7 +1131,8 @@ out: unlock_page(pages[i]); page_cache_release(pages[i]); } - btrfs_delalloc_release_space(inode, page_cnt << PAGE_CACHE_SHIFT); + btrfs_delalloc_release_space(inode, + page_cnt << PAGE_CACHE_SHIFT, flag); return ret; } @@ -3021,8 +3027,9 @@ static long btrfs_ioctl_space_info(struct btrfs_root *root, void __user *arg) u64 types[] = {BTRFS_BLOCK_GROUP_DATA, BTRFS_BLOCK_GROUP_SYSTEM, BTRFS_BLOCK_GROUP_METADATA, - BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA}; - int num_types = 4; + BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA, + BTRFS_BLOCK_GROUP_DATA_NONROT}; + int num_types = 5; int alloc_size; int ret = 0; u64 slot_count = 0; diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 4febca4..9ea9d6c 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -31,6 +31,7 @@ #include "async-thread.h" #include "free-space-cache.h" #include "inode-map.h" +#include "hot_relocate.h" /* * backref_node, mapping_node and tree_block start with this @@ -2938,12 +2939,13 @@ int prealloc_file_extent_cluster(struct inode *inode, u64 num_bytes; int nr = 0; int ret = 0; + int flag = TYPE_ROT; BUG_ON(cluster->start != cluster->boundary[0]); mutex_lock(&inode->i_mutex); ret = btrfs_check_data_free_space(inode, cluster->end + - 1 - cluster->start); + 1 - cluster->start, &flag); if (ret) goto out; @@ -2965,7 +2967,7 @@ int prealloc_file_extent_cluster(struct inode *inode, nr++; } btrfs_free_reserved_data_space(inode, cluster->end + - 1 - cluster->start); + 1 - cluster->start, flag); out: mutex_unlock(&inode->i_mutex); return ret; diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index f13517b..9ee751f 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -58,6 +58,7 @@ #include "rcu-string.h" #include "dev-replace.h" #include "free-space-cache.h" +#include "hot_relocate.h" #define CREATE_TRACE_POINTS #include <trace/events/btrfs.h> @@ -1521,7 +1522,8 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) mutex_lock(&fs_info->chunk_mutex); rcu_read_lock(); list_for_each_entry_rcu(found, head, list) { - if (found->flags & BTRFS_BLOCK_GROUP_DATA) { + if ((found->flags & BTRFS_BLOCK_GROUP_DATA) || + (found->flags & BTRFS_BLOCK_GROUP_DATA_NONROT)) { total_free_data += found->disk_total - found->disk_used; total_free_data - btrfs_account_ro_block_groups_free_space(found); diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 8bffb91..8b6beec 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1451,6 +1451,9 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path) all_avail = root->fs_info->avail_data_alloc_bits | root->fs_info->avail_system_alloc_bits | root->fs_info->avail_metadata_alloc_bits; + if (btrfs_test_opt(root, HOT_MOVE)) + all_avail |+ root->fs_info->avail_data_nonrot_alloc_bits; } while (read_seqretry(&root->fs_info->profiles_lock, seq)); num_devices = root->fs_info->fs_devices->num_devices; @@ -3728,7 +3731,8 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, devs_increment = btrfs_raid_array[index].devs_increment; ncopies = btrfs_raid_array[index].ncopies; - if (type & BTRFS_BLOCK_GROUP_DATA) { + if (type & BTRFS_BLOCK_GROUP_DATA || + type & BTRFS_BLOCK_GROUP_DATA_NONROT) { max_stripe_size = 1024 * 1024 * 1024; max_chunk_size = 10 * max_stripe_size; } else if (type & BTRFS_BLOCK_GROUP_METADATA) { @@ -3767,9 +3771,30 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, struct btrfs_device *device; u64 max_avail; u64 dev_offset; + int dev_rot; + int skip = 0; device = list_entry(cur, struct btrfs_device, dev_alloc_list); + /* + * If HOT_MOVE is set, the chunk type being allocated + * determines which disks the data may be allocated on. + * This can cause problems if, for example, the data alloc + * profile is RAID0 and there are only two devices, 1 SSD + + * 1 HDD. All allocations to BTRFS_BLOCK_GROUP_DATA_NONROT + * in this config will return -ENOSPC as the allocation code + * can''t find allowable space for the second stripe. + */ + dev_rot = !blk_queue_nonrot(bdev_get_queue(device->bdev)); + if (btrfs_test_opt(extent_root, HOT_MOVE)) { + int ret1 = type & (BTRFS_BLOCK_GROUP_DATA | + BTRFS_BLOCK_GROUP_METADATA | + BTRFS_BLOCK_GROUP_SYSTEM) && !dev_rot; + int ret2 = type & BTRFS_BLOCK_GROUP_DATA_NONROT && dev_rot; + if (ret1 || ret2) + skip = 1; + } + cur = cur->next; if (!device->writeable) { @@ -3778,7 +3803,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, continue; } - if (!device->in_fs_metadata || + if (skip || !device->in_fs_metadata || device->is_tgtdev_for_dev_replace) continue; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
zwu.kernel@gmail.com
2013-Jun-21 12:20 UTC
[RFC PATCH v2 3/5] BTRFS hot reloc: add one hot reloc thread
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Add one private thread for hot relocation. It will check if there''re some extents which is hotter than the threshold and queue them at first, if no, it will return and wait for its next turn; otherwise, it will check if nonrotating disk ratio is beyond its usage threshold, if no, it will directly relocate hot extents from rotating disk to nonrotating disk; otherwise it will find the extents with low temperature and queue them, then relocate those extents with low temperature and queue them, and finally relocate the hot extents from from rotating disk to nonrotating disk. Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/btrfs/Makefile | 3 +- fs/btrfs/ctree.h | 2 + fs/btrfs/hot_relocate.c | 713 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/hot_relocate.h | 43 +++ 4 files changed, 760 insertions(+), 1 deletion(-) create mode 100644 fs/btrfs/hot_relocate.c create mode 100644 fs/btrfs/hot_relocate.h diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index 3932224..94f1ea5 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -8,7 +8,8 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \ export.o tree-log.o free-space-cache.o zlib.o lzo.o \ compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ - reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o + reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ + hot_relocate.o btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 1c11be1..956115d 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1636,6 +1636,8 @@ struct btrfs_fs_info { struct btrfs_dev_replace dev_replace; atomic_t mutually_exclusive_operation_running; + + void *hot_reloc; }; /* diff --git a/fs/btrfs/hot_relocate.c b/fs/btrfs/hot_relocate.c new file mode 100644 index 0000000..ae28b86 --- /dev/null +++ b/fs/btrfs/hot_relocate.c @@ -0,0 +1,713 @@ +/* + * fs/btrfs/hot_relocate.c + * + * Copyright (C) 2013 IBM Corp. All rights reserved. + * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> + * Ben Chociej <bchociej@gmail.com> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + */ + +#include <linux/list.h> +#include <linux/spinlock.h> +#include <linux/blkdev.h> +#include <linux/writeback.h> +#include <linux/kthread.h> +#include <linux/freezer.h> +#include <linux/module.h> +#include "hot_relocate.h" + +/* + * Hot relocation strategy: + * + * The relocation code below operates on the heat map lists to identify + * hot or cold data logical file ranges that are candidates for relocation. + * The triggering mechanism for relocation is controlled by a global heat + * threshold integer value (HOT_RELOC_THRESHOLD). Ranges are + * queued for relocation by the periodically executing relocate kthread, + * which updates the global heat threshold and responds to space pressure + * on the nonrotating disks. + * + * The heat map lists index logical ranges by heat and provide a constant-time + * access path to hot or cold range items. The relocation kthread uses this + * path to find hot or cold items to move to/from nonrotating disks. To ensure + * that the relocation kthread has a chance to sleep, and to prevent thrashing + * between nonrotating disks and HDD, there is a configurable limit to how many + * ranges are moved per iteration of the kthread. This limit may be overrun in + * the case where space pressure requires that items be aggressively moved from + * nonrotating disks back to HDD. + * + * This needs still more resistance to thrashing and stronger (read: actual) + * guarantees that relocation operations won''t -ENOSPC. + * + * The relocation code has introduced one new btrfs block group type: + * BTRFS_BLOCK_GROUP_DATA_NONROT. + * + * When mkfs''ing a volume with the hot data relocation option, initial block + * groups are allocated to the proper disks. Runtime block group allocation + * only allocates BTRFS_BLOCK_GROUP_DATA BTRFS_BLOCK_GROUP_METADATA and + * BTRFS_BLOCK_GROUP_SYSTEM to HDD, and likewise only allocates + * BTRFS_BLOCK_GROUP_DATA_NONROT to nonrotating disks. + * (assuming, critically, the HOT_MOVE option is set at mount time). + */ + +/* + * Returns the ratio of nonrotating disks that are full. + * If no nonrotating disk is found, returns THRESH_MAX_VALUE + 1. + */ +static int hot_calc_nonrot_ratio(struct hot_reloc *hot_reloc) +{ + struct btrfs_space_info *info; + struct btrfs_device *device, *next; + struct btrfs_fs_info *fs_info = hot_reloc->fs_info; + u64 total_bytes = 0, bytes_used = 0; + + /* + * Iterate through devices, if they''re nonrot, + * add their bytes to the total_bytes. + */ + mutex_lock(&fs_info->fs_devices->device_list_mutex); + list_for_each_entry_safe(device, next, + &fs_info->fs_devices->devices, dev_list) { + if (blk_queue_nonrot(bdev_get_queue(device->bdev))) + total_bytes += device->total_bytes; + } + mutex_unlock(&fs_info->fs_devices->device_list_mutex); + + if (total_bytes == 0) + return THRESH_MAX_VALUE + 1; + + /* + * Iterate through space_info. if nonrot data block group + * is found, add the bytes used by that group bytes_used + */ + rcu_read_lock(); + list_for_each_entry_rcu(info, &fs_info->space_info, list) { + if (info->flags & BTRFS_BLOCK_GROUP_DATA_NONROT) + bytes_used += info->bytes_used; + } + rcu_read_unlock(); + + /* Finish up, return ratio of nonrotating disks filled. */ + BUG_ON(bytes_used >= total_bytes); + + return (int) div64_u64(bytes_used * 100, total_bytes); +} + +/* + * Update heat threshold for hot relocation + * based on how full nonrotating disks are. + */ +static int hot_update_threshold(struct hot_reloc *hot_reloc, + int update) +{ + int thresh = hot_reloc->thresh; + int ratio = hot_calc_nonrot_ratio(hot_reloc); + + /* Sometimes update global threshold, others not */ + if (!update && ratio < HIGH_WATER_LEVEL) + return ratio; + + if (unlikely(ratio > THRESH_MAX_VALUE)) + thresh = HEAT_MAX_VALUE + 1; + else { + WARN_ON(HIGH_WATER_LEVEL > THRESH_MAX_VALUE + || LOW_WATER_LEVEL < 0); + + if (ratio >= HIGH_WATER_LEVEL) + thresh += THRESH_UP_SPEED; + else if (ratio <= LOW_WATER_LEVEL) + thresh -= THRESH_DOWN_SPEED; + + if (thresh > HEAT_MAX_VALUE) + thresh = HEAT_MAX_VALUE + 1; + else if (thresh < 0) + thresh = 0; + } + + hot_reloc->thresh = thresh; + return ratio; +} + +static bool hot_can_relocate(struct inode *inode, u64 start, + u64 len, u64 *skip, u64 *end) +{ + struct extent_map *em = NULL; + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; + struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree; + bool ret = true; + + /* + * Make sure that once we start relocating an extent, + * we keep on relocating it + */ + if (start < *end) + return true; + + *skip = 0; + + /* + * Hopefully we have this extent in the tree already, + * try without the full extent lock + */ + read_lock(&em_tree->lock); + em = lookup_extent_mapping(em_tree, start, len); + read_unlock(&em_tree->lock); + if (!em) { + /* Get the big lock and read metadata off disk */ + lock_extent(io_tree, start, start + len - 1); + em = btrfs_get_extent(inode, NULL, 0, start, len, 0); + unlock_extent(io_tree, start, start + len - 1); + if (IS_ERR(em)) + return false; + } + + /* This will cover holes, and inline extents */ + if (em->block_start >= EXTENT_MAP_LAST_BYTE) + ret = false; + + if (ret) { + *end = extent_map_end(em); + } else { + *skip = extent_map_end(em); + *end = 0; + } + + free_extent_map(em); + return ret; +} + +static void hot_cleanup_relocq(struct list_head *bucket) +{ + struct hot_range_item *hr; + struct hot_comm_item *ci, *ci_next; + + list_for_each_entry_safe(ci, ci_next, bucket, reloc_list) { + hr = container_of(ci, struct hot_range_item, hot_range); + list_del_init(&hr->hot_range.reloc_list); + hot_comm_item_put(ci); + } +} + +static int hot_queue_extent(struct hot_reloc *hot_reloc, + struct list_head *bucket, + u64 *counter, int storage_type) +{ + struct hot_comm_item *ci; + struct hot_range_item *hr; + int st, ret = 0; + + /* Queue hot_ranges */ + list_for_each_entry_rcu(ci, bucket, track_list) { + if (test_bit(HOT_DELETING, &ci->delete_flag)) + continue; + + /* Queue up on relocate list */ + hr = container_of(ci, struct hot_range_item, hot_range); + st = hr->storage_type; + if (st != storage_type) { + list_del_init(&ci->reloc_list); + list_add_tail(&ci->reloc_list, + &hot_reloc->hot_relocq[storage_type]); + hot_comm_item_get(ci); + *counter = *counter + 1; + } + + if (*counter >= HOT_RELOC_MAX_ITEMS) + break; + + if (kthread_should_stop()) { + ret = 1; + break; + } + } + + return ret; +} + +static u64 hot_search_extent(struct hot_reloc *hot_reloc, + int thresh, int storage_type) +{ + struct hot_info *root; + u64 counter = 0; + int i, ret = 0; + + root = hot_reloc->fs_info->sb->s_hot_root; + for (i = HEAT_MAX_VALUE; i >= thresh; i--) { + rcu_read_lock(); + if (!list_empty(&root->hot_map[TYPE_RANGE][i])) + ret = hot_queue_extent(hot_reloc, + &root->hot_map[TYPE_RANGE][i], + &counter, storage_type); + rcu_read_unlock(); + if (ret) { + counter = 0; + break; + } + } + + if (ret) + hot_cleanup_relocq(&hot_reloc->hot_relocq[storage_type]); + + return counter; +} + +static int hot_load_file_extent(struct inode *inode, + struct page **pages, + unsigned long start_index, + int num_pages, int storage_type) +{ + unsigned long file_end; + int ret, i, i_done; + u64 isize = i_size_read(inode), page_start, page_end, page_cnt; + struct btrfs_ordered_extent *ordered; + struct extent_state *cached_state = NULL; + struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; + gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping); + + file_end = (isize - 1) >> PAGE_CACHE_SHIFT; + if (!isize || start_index > file_end) + return 0; + + page_cnt = min_t(u64, (u64)num_pages, (u64)file_end - start_index + 1); + + ret = btrfs_delalloc_reserve_space(inode, + page_cnt << PAGE_CACHE_SHIFT, &storage_type); + if (ret) + return ret; + + i_done = 0; + /* step one, lock all the pages */ + for (i = 0; i < page_cnt; i++) { + struct page *page; +again: + page = find_or_create_page(inode->i_mapping, + start_index + i, mask); + if (!page) + break; + + page_start = page_offset(page); + page_end = page_start + PAGE_CACHE_SIZE - 1; + while (1) { + lock_extent(tree, page_start, page_end); + ordered = btrfs_lookup_ordered_extent(inode, + page_start); + unlock_extent(tree, page_start, page_end); + if (!ordered) + break; + + unlock_page(page); + btrfs_start_ordered_extent(inode, ordered, 1); + btrfs_put_ordered_extent(ordered); + lock_page(page); + /* + * we unlocked the page above, so we need check if + * it was released or not. + */ + if (page->mapping != inode->i_mapping) { + unlock_page(page); + page_cache_release(page); + goto again; + } + } + + if (!PageUptodate(page)) { + btrfs_readpage(NULL, page); + lock_page(page); + if (!PageUptodate(page)) { + unlock_page(page); + page_cache_release(page); + ret = -EIO; + break; + } + } + + if (page->mapping != inode->i_mapping) { + unlock_page(page); + page_cache_release(page); + goto again; + } + + pages[i] = page; + i_done++; + } + if (!i_done || ret) + goto out; + + if (!(inode->i_sb->s_flags & MS_ACTIVE)) + goto out; + + page_start = page_offset(pages[0]); + page_end = page_offset(pages[i_done - 1]) + PAGE_CACHE_SIZE - 1; + + lock_extent_bits(tree, page_start, page_end, 0, &cached_state); + + if (i_done != page_cnt) { + spin_lock(&BTRFS_I(inode)->lock); + BTRFS_I(inode)->outstanding_extents++; + spin_unlock(&BTRFS_I(inode)->lock); + + btrfs_delalloc_release_space(inode, + (page_cnt - i_done) << PAGE_CACHE_SHIFT, + storage_type); + } + + set_extent_hot(inode, page_start, page_end, + &cached_state, storage_type, 1); + unlock_extent_cached(tree, page_start, page_end, + &cached_state, GFP_NOFS); + + for (i = 0; i < i_done; i++) { + clear_page_dirty_for_io(pages[i]); + ClearPageChecked(pages[i]); + set_page_extent_mapped(pages[i]); + set_page_dirty(pages[i]); + unlock_page(pages[i]); + page_cache_release(pages[i]); + } + + /* + * so now we have a nice long stream of locked + * and up to date pages, lets wait on them + */ + for (i = 0; i < i_done; i++) + wait_on_page_writeback(pages[i]); + + return i_done; +out: + for (i = 0; i < i_done; i++) { + unlock_page(pages[i]); + page_cache_release(pages[i]); + } + + btrfs_delalloc_release_space(inode, + page_cnt << PAGE_CACHE_SHIFT, + storage_type); + + return ret; +} + +/* + * Relocate data to SSD or spinning drive based on past location + * and load the file into page cache and marks pages as dirty. + * + * based on defrag ioctl + */ +static int hot_relocate_extent(struct hot_range_item *hr, + struct hot_reloc *hot_reloc, + int storage_type) +{ + struct btrfs_root *root = hot_reloc->fs_info->fs_root; + struct inode *inode; + struct file_ra_state *ra = NULL; + struct btrfs_key key; + u64 isize, last_len = 0, skip = 0, end = 0; + unsigned long i, last, ra_index = 0; + int ret = -ENOENT, count = 0, new = 0; + int max_cluster = (256 * 1024) >> PAGE_CACHE_SHIFT; + int cluster = max_cluster; + struct page **pages = NULL; + + key.objectid = hr->hot_inode->i_ino; + key.type = BTRFS_INODE_ITEM_KEY; + key.offset = 0; + inode = btrfs_iget(root->fs_info->sb, &key, root, &new); + if (IS_ERR(inode)) + goto out; + else if (is_bad_inode(inode)) + goto out_inode; + + isize = i_size_read(inode); + if (isize == 0) { + ret = 0; + goto out_inode; + } + + ra = kzalloc(sizeof(*ra), GFP_NOFS); + if (!ra) { + ret = -ENOMEM; + goto out_inode; + } else { + file_ra_state_init(ra, inode->i_mapping); + } + + pages = kmalloc(sizeof(struct page *) * max_cluster, + GFP_NOFS); + if (!pages) { + ret = -ENOMEM; + goto out_ra; + } + + /* find the last page */ + if (hr->start + hr->len > hr->start) { + last = min_t(u64, isize - 1, + hr->start + hr->len - 1) >> PAGE_CACHE_SHIFT; + } else { + last = (isize - 1) >> PAGE_CACHE_SHIFT; + } + + i = hr->start >> PAGE_CACHE_SHIFT; + + /* + * make writeback starts from i, so the range can be + * written sequentially. + */ + if (i < inode->i_mapping->writeback_index) + inode->i_mapping->writeback_index = i; + + while (i <= last && count < last + 1 && + (i < (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> + PAGE_CACHE_SHIFT)) { + /* + * make sure we stop running if someone unmounts + * the FS + */ + if (!(inode->i_sb->s_flags & MS_ACTIVE)) + break; + + if (signal_pending(current)) { + printk(KERN_DEBUG "btrfs: hot relocation cancelled\n"); + break; + } + + if (!hot_can_relocate(inode, (u64)i << PAGE_CACHE_SHIFT, + PAGE_CACHE_SIZE, &skip, &end)) { + unsigned long next; + /* + * the function tells us how much to skip + * bump our counter by the suggested amount + */ + next = (skip + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + i = max(i + 1, next); + continue; + } + + cluster = (PAGE_CACHE_ALIGN(end) >> PAGE_CACHE_SHIFT) - i; + cluster = min(cluster, max_cluster); + + if (i + cluster > ra_index) { + ra_index = max(i, ra_index); + btrfs_force_ra(inode->i_mapping, ra, NULL, ra_index, + cluster); + ra_index += max_cluster; + } + + mutex_lock(&inode->i_mutex); + ret = hot_load_file_extent(inode, pages, + i, cluster, storage_type); + if (ret < 0) { + mutex_unlock(&inode->i_mutex); + goto out_ra; + } + + count += ret; + balance_dirty_pages_ratelimited(inode->i_mapping); + mutex_unlock(&inode->i_mutex); + + if (ret > 0) { + i += ret; + last_len += ret << PAGE_CACHE_SHIFT; + } else { + i++; + last_len = 0; + } + } + + ret = count; + if (ret > 0) + hr->storage_type = storage_type; + +out_ra: + kfree(ra); + kfree(pages); +out_inode: + iput(inode); +out: + list_del_init(&hr->hot_range.reloc_list); + + hot_comm_item_put(&hr->hot_range); + + return ret; +} + +/* + * Main function iterates through heat map table and + * finds hot and cold data to move based on SSD pressure. + * + * First iterates through cold items below the heat + * threshold, if the item is on SSD and is now cold, + * we queue it up for relocation back to spinning disk. + * After scanning these items, we call relocation code + * on all ranges that have been queued up for moving + * to HDD. + * + * We then iterate through items above the heat threshold + * and if they are on HDD we queue them up to be moved to + * SSD. We then iterate through queue and move hot ranges + * to SSD if they are not already. + */ +void hot_do_relocate(struct hot_reloc *hot_reloc) +{ + struct hot_info *root; + struct hot_range_item *hr; + struct hot_comm_item *ci, *ci_next; + int i, ret = 0, thresh, ratio = 0; + u64 count, count_to_cold, count_to_hot; + static u32 run = 1; + + run++; + ratio = hot_update_threshold(hot_reloc, !(run % 15)); + thresh = hot_reloc->thresh; + + INIT_LIST_HEAD(&hot_reloc->hot_relocq[TYPE_NONROT]); + + /* Check and queue hot extents */ + count_to_hot = hot_search_extent(hot_reloc, + thresh, TYPE_NONROT); + if (count_to_hot == 0) + return; + + count_to_cold = HOT_RELOC_MAX_ITEMS; + + /* Don''t move cold data to HDD unless there''s space pressure */ + if (ratio < HIGH_WATER_LEVEL) + goto do_hot_reloc; + + INIT_LIST_HEAD(&hot_reloc->hot_relocq[TYPE_ROT]); + + /* + * Move up to RELOCATE_MAX_ITEMS cold ranges back to spinning + * disk. First, queue up items to move on the hot_relocq[TYPE_ROT]. + */ + root = hot_reloc->fs_info->sb->s_hot_root; + for (count = 0, count_to_cold = 0; (count < thresh) && + (count_to_cold < count_to_hot); count++) { + rcu_read_lock(); + if (!list_empty(&root->hot_map[TYPE_RANGE][count])) + ret = hot_queue_extent(hot_reloc, + &root->hot_map[TYPE_RANGE][count], + &count_to_cold, TYPE_ROT); + rcu_read_unlock(); + if (ret) + goto relocq_clean; + } + + /* Do the hot -> cold relocation */ + count_to_cold = 0; + list_for_each_entry_safe(ci, ci_next, + &hot_reloc->hot_relocq[TYPE_ROT], reloc_list) { + hr = container_of(ci, struct hot_range_item, hot_range); + ret = hot_relocate_extent(hr, hot_reloc, TYPE_ROT); + if ((ret == -ENOSPC) || (ret == -ENOMEM) || + kthread_should_stop()) + goto relocq_clean; + else if (ret > 0) + count_to_cold++; + } + + /* + * Move up to RELOCATE_MAX_ITEMS ranges to SSD. Periodically check + * for space pressure on SSD and directly return if we''ve exceeded + * the SSD capacity high water mark. + * First, queue up items to move on hot_relocq[TYPE_NONROT]. + */ +do_hot_reloc: + /* Do the cold -> hot relocation */ + count_to_hot = 0; + list_for_each_entry_safe(ci, ci_next, + &hot_reloc->hot_relocq[TYPE_NONROT], reloc_list) { + if (count_to_hot >= count_to_cold) + goto relocq_clean; + hr = container_of(ci, struct hot_range_item, hot_range); + ret = hot_relocate_extent(hr, hot_reloc, TYPE_NONROT); + if ((ret == -ENOSPC) || (ret == -ENOMEM) || + kthread_should_stop()) + goto relocq_clean; + else if (ret > 0) + count_to_hot++; + + /* + * If we''ve exceeded the SSD capacity high water mark, + * directly return. + */ + if ((count_to_hot != 0) && count_to_hot % 30 == 0) { + ratio = hot_update_threshold(hot_reloc, 1); + if (ratio >= HIGH_WATER_LEVEL) + goto relocq_clean; + } + } + + return; + +relocq_clean: + for (i = 0; i < MAX_RELOC_TYPES; i++) + hot_cleanup_relocq(&hot_reloc->hot_relocq[i]); +} + +/* Main loop for running relcation thread */ +static int hot_relocate_kthread(void *arg) +{ + struct hot_reloc *hot_reloc = arg; + unsigned long delay; + + do { + delay = HZ * HOT_RELOC_INTERVAL; + if (mutex_trylock(&hot_reloc->hot_reloc_mutex)) { + hot_do_relocate(hot_reloc); + mutex_unlock(&hot_reloc->hot_reloc_mutex); + } + + if (!try_to_freeze()) { + set_current_state(TASK_INTERRUPTIBLE); + if (!kthread_should_stop()) + schedule_timeout(delay); + __set_current_state(TASK_RUNNING); + } + } while (!kthread_should_stop()); + + return 0; +} + +/* Kick off the relocation kthread */ +int hot_relocate_init(struct btrfs_fs_info *fs_info) +{ + int i, ret = 0; + struct hot_reloc *hot_reloc; + + hot_reloc = kzalloc(sizeof(*hot_reloc), GFP_NOFS); + if (!hot_reloc) { + printk(KERN_ERR "%s: Failed to allocate memory for " + "hot_reloc\n", __func__); + return -ENOMEM; + } + + fs_info->hot_reloc = hot_reloc; + hot_reloc->fs_info = fs_info; + hot_reloc->thresh = HOT_RELOC_THRESHOLD; + for (i = 0; i < MAX_RELOC_TYPES; i++) + INIT_LIST_HEAD(&hot_reloc->hot_relocq[i]); + mutex_init(&hot_reloc->hot_reloc_mutex); + + hot_reloc->hot_reloc_kthread = kthread_run(hot_relocate_kthread, + hot_reloc, "hot_relocate_kthread"); + ret = IS_ERR(hot_reloc->hot_reloc_kthread); + if (ret) { + kthread_stop(hot_reloc->hot_reloc_kthread); + kfree(hot_reloc); + } + + return ret; +} + +void hot_relocate_exit(struct btrfs_fs_info *fs_info) +{ + struct hot_reloc *hot_reloc = fs_info->hot_reloc; + + if (hot_reloc->hot_reloc_kthread) + kthread_stop(hot_reloc->hot_reloc_kthread); + + kfree(hot_reloc); + fs_info->hot_reloc = NULL; +} diff --git a/fs/btrfs/hot_relocate.h b/fs/btrfs/hot_relocate.h new file mode 100644 index 0000000..1b1cfb5 --- /dev/null +++ b/fs/btrfs/hot_relocate.h @@ -0,0 +1,43 @@ +/* + * fs/btrfs/hot_relocate.h + * + * Copyright (C) 2013 IBM Corp. All rights reserved. + * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> + * Ben Chociej <bchociej@gmail.com> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + */ + +#ifndef __HOT_RELOCATE__ +#define __HOT_RELOCATE__ + +#include <linux/hot_tracking.h> +#include "ctree.h" +#include "btrfs_inode.h" +#include "volumes.h" + +#define HOT_RELOC_INTERVAL 120 +#define HOT_RELOC_THRESHOLD 150 +#define HOT_RELOC_MAX_ITEMS 250 + +#define HEAT_MAX_VALUE (MAP_SIZE - 1) +#define HIGH_WATER_LEVEL 75 /* when to raise the threshold */ +#define LOW_WATER_LEVEL 50 /* when to lower the threshold */ +#define THRESH_UP_SPEED 10 /* how much to raise it by */ +#define THRESH_DOWN_SPEED 1 /* how much to lower it by */ +#define THRESH_MAX_VALUE 100 + +struct hot_reloc { + struct btrfs_fs_info *fs_info; + struct list_head hot_relocq[MAX_RELOC_TYPES]; + int thresh; + struct task_struct *hot_reloc_kthread; + struct mutex hot_reloc_mutex; +}; + +int hot_relocate_init(struct btrfs_fs_info *fs_info); +void hot_relocate_exit(struct btrfs_fs_info *fs_info); + +#endif /* __HOT_RELOCATE__ */ -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
zwu.kernel@gmail.com
2013-Jun-21 12:20 UTC
[RFC PATCH v2 4/5] BTRFS hot reloc, procfs: add three proc interfaces
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Add three proc interfaces hot-reloc-interval, hot-reloc-threshold, and hot-reloc-max-items under the dir /proc/sys/fs/ in order to turn HOT_RELOC_INTERVAL, HOT_RELOC_THRESHOLD, and HOT_RELOC_MAX_ITEMS into be tunable. Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/btrfs/hot_relocate.c | 26 +++++++++++++++++--------- fs/btrfs/hot_relocate.h | 5 ----- include/linux/btrfs.h | 4 ++++ kernel/sysctl.c | 22 ++++++++++++++++++++++ 4 files changed, 43 insertions(+), 14 deletions(-) diff --git a/fs/btrfs/hot_relocate.c b/fs/btrfs/hot_relocate.c index ae28b86..debf580 100644 --- a/fs/btrfs/hot_relocate.c +++ b/fs/btrfs/hot_relocate.c @@ -25,7 +25,7 @@ * The relocation code below operates on the heat map lists to identify * hot or cold data logical file ranges that are candidates for relocation. * The triggering mechanism for relocation is controlled by a global heat - * threshold integer value (HOT_RELOC_THRESHOLD). Ranges are + * threshold integer value (sysctl_hot_reloc_thresh). Ranges are * queued for relocation by the periodically executing relocate kthread, * which updates the global heat threshold and responds to space pressure * on the nonrotating disks. @@ -53,6 +53,15 @@ * (assuming, critically, the HOT_MOVE option is set at mount time). */ +int sysctl_hot_reloc_thresh = 150; +EXPORT_SYMBOL_GPL(sysctl_hot_reloc_thresh); + +int sysctl_hot_reloc_interval __read_mostly = 120; +EXPORT_SYMBOL_GPL(sysctl_hot_reloc_interval); + +int sysctl_hot_reloc_max_items __read_mostly = 250; +EXPORT_SYMBOL_GPL(sysctl_hot_reloc_max_items); + /* * Returns the ratio of nonrotating disks that are full. * If no nonrotating disk is found, returns THRESH_MAX_VALUE + 1. @@ -103,7 +112,7 @@ static int hot_calc_nonrot_ratio(struct hot_reloc *hot_reloc) static int hot_update_threshold(struct hot_reloc *hot_reloc, int update) { - int thresh = hot_reloc->thresh; + int thresh = sysctl_hot_reloc_thresh; int ratio = hot_calc_nonrot_ratio(hot_reloc); /* Sometimes update global threshold, others not */ @@ -127,7 +136,7 @@ static int hot_update_threshold(struct hot_reloc *hot_reloc, thresh = 0; } - hot_reloc->thresh = thresh; + sysctl_hot_reloc_thresh = thresh; return ratio; } @@ -215,7 +224,7 @@ static int hot_queue_extent(struct hot_reloc *hot_reloc, *counter = *counter + 1; } - if (*counter >= HOT_RELOC_MAX_ITEMS) + if (*counter >= sysctl_hot_reloc_max_items) break; if (kthread_should_stop()) { @@ -293,7 +302,7 @@ again: while (1) { lock_extent(tree, page_start, page_end); ordered = btrfs_lookup_ordered_extent(inode, - page_start); + page_start); unlock_extent(tree, page_start, page_end); if (!ordered) break; @@ -559,7 +568,7 @@ void hot_do_relocate(struct hot_reloc *hot_reloc) run++; ratio = hot_update_threshold(hot_reloc, !(run % 15)); - thresh = hot_reloc->thresh; + thresh = sysctl_hot_reloc_thresh; INIT_LIST_HEAD(&hot_reloc->hot_relocq[TYPE_NONROT]); @@ -569,7 +578,7 @@ void hot_do_relocate(struct hot_reloc *hot_reloc) if (count_to_hot == 0) return; - count_to_cold = HOT_RELOC_MAX_ITEMS; + count_to_cold = sysctl_hot_reloc_max_items; /* Don''t move cold data to HDD unless there''s space pressure */ if (ratio < HIGH_WATER_LEVEL) @@ -653,7 +662,7 @@ static int hot_relocate_kthread(void *arg) unsigned long delay; do { - delay = HZ * HOT_RELOC_INTERVAL; + delay = HZ * sysctl_hot_reloc_interval; if (mutex_trylock(&hot_reloc->hot_reloc_mutex)) { hot_do_relocate(hot_reloc); mutex_unlock(&hot_reloc->hot_reloc_mutex); @@ -685,7 +694,6 @@ int hot_relocate_init(struct btrfs_fs_info *fs_info) fs_info->hot_reloc = hot_reloc; hot_reloc->fs_info = fs_info; - hot_reloc->thresh = HOT_RELOC_THRESHOLD; for (i = 0; i < MAX_RELOC_TYPES; i++) INIT_LIST_HEAD(&hot_reloc->hot_relocq[i]); mutex_init(&hot_reloc->hot_reloc_mutex); diff --git a/fs/btrfs/hot_relocate.h b/fs/btrfs/hot_relocate.h index 1b1cfb5..94defe6 100644 --- a/fs/btrfs/hot_relocate.h +++ b/fs/btrfs/hot_relocate.h @@ -18,10 +18,6 @@ #include "btrfs_inode.h" #include "volumes.h" -#define HOT_RELOC_INTERVAL 120 -#define HOT_RELOC_THRESHOLD 150 -#define HOT_RELOC_MAX_ITEMS 250 - #define HEAT_MAX_VALUE (MAP_SIZE - 1) #define HIGH_WATER_LEVEL 75 /* when to raise the threshold */ #define LOW_WATER_LEVEL 50 /* when to lower the threshold */ @@ -32,7 +28,6 @@ struct hot_reloc { struct btrfs_fs_info *fs_info; struct list_head hot_relocq[MAX_RELOC_TYPES]; - int thresh; struct task_struct *hot_reloc_kthread; struct mutex hot_reloc_mutex; }; diff --git a/include/linux/btrfs.h b/include/linux/btrfs.h index 22d7991..a713514 100644 --- a/include/linux/btrfs.h +++ b/include/linux/btrfs.h @@ -3,4 +3,8 @@ #include <uapi/linux/btrfs.h> +extern int sysctl_hot_reloc_thresh; +extern int sysctl_hot_reloc_interval; +extern int sysctl_hot_reloc_max_items; + #endif /* _LINUX_BTRFS_H */ diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 753585d..7050086 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -62,6 +62,7 @@ #include <linux/capability.h> #include <linux/binfmts.h> #include <linux/sched/sysctl.h> +#include <linux/btrfs.h> #include <asm/uaccess.h> #include <asm/processor.h> @@ -1637,6 +1638,27 @@ static struct ctl_table fs_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, + { + .procname = "hot-reloc-thresh", + .data = &sysctl_hot_reloc_thresh, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { + .procname = "hot-reloc-interval", + .data = &sysctl_hot_reloc_interval, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { + .procname = "hot-reloc-max-items", + .data = &sysctl_hot_reloc_max_items, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, { } }; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
zwu.kernel@gmail.com
2013-Jun-21 12:21 UTC
[RFC PATCH v2 5/5] BTRFS hot reloc: add hot relocation support
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Add one new mount option ''-o hot_move'' for hot relocation support. When hot relocation is enabled, hot tracking will be enabled automatically. Its usage looks like: mount -o hot_move mount -o nouser,hot_move mount -o nouser,hot_move,loop mount -o hot_move,nouser Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> --- fs/btrfs/super.c | 26 +++++++++++++++++++++++--- 1 file changed, 23 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 9ee751f..31e1d17 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -309,8 +309,13 @@ static void btrfs_put_super(struct super_block *sb) * process... Whom would you report that to? */ + /* Hot data relocation */ + if (btrfs_test_opt(btrfs_sb(sb)->tree_root, HOT_MOVE)) + hot_relocate_exit(btrfs_sb(sb)); + /* Hot data tracking */ - if (btrfs_test_opt(btrfs_sb(sb)->tree_root, HOT_TRACK)) + if (btrfs_test_opt(btrfs_sb(sb)->tree_root, HOT_MOVE) + || btrfs_test_opt(btrfs_sb(sb)->tree_root, HOT_TRACK)) hot_track_exit(sb); } @@ -325,7 +330,7 @@ enum { Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_check_integrity, Opt_check_integrity_including_extent_data, Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_hot_track, - Opt_err, + Opt_hot_move, Opt_err, }; static match_table_t tokens = { @@ -366,6 +371,7 @@ static match_table_t tokens = { {Opt_check_integrity_print_mask, "check_int_print_mask=%d"}, {Opt_fatal_errors, "fatal_errors=%s"}, {Opt_hot_track, "hot_track"}, + {Opt_hot_move, "hot_move"}, {Opt_err, NULL}, }; @@ -634,6 +640,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) case Opt_hot_track: btrfs_set_opt(info->mount_opt, HOT_TRACK); break; + case Opt_hot_move: + btrfs_set_opt(info->mount_opt, HOT_MOVE); + break; case Opt_err: printk(KERN_INFO "btrfs: unrecognized mount option " "''%s''\n", p); @@ -853,17 +862,26 @@ static int btrfs_fill_super(struct super_block *sb, goto fail_close; } - if (btrfs_test_opt(fs_info->tree_root, HOT_TRACK)) { + if (btrfs_test_opt(fs_info->tree_root, HOT_MOVE) + || btrfs_test_opt(fs_info->tree_root, HOT_TRACK)) { err = hot_track_init(sb); if (err) goto fail_hot; } + if (btrfs_test_opt(fs_info->tree_root, HOT_MOVE)) { + err = hot_relocate_init(fs_info); + if (err) + goto fail_reloc; + } + save_mount_options(sb, data); cleancache_init_fs(sb); sb->s_flags |= MS_ACTIVE; return 0; +fail_reloc: + hot_track_exit(sb); fail_hot: dput(sb->s_root); sb->s_root = NULL; @@ -964,6 +982,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry) seq_puts(seq, ",fatal_errors=panic"); if (btrfs_test_opt(root, HOT_TRACK)) seq_puts(seq, ",hot_track"); + if (btrfs_test_opt(root, HOT_MOVE)) + seq_puts(seq, ",hot_move"); return 0; } -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html