thr3ads.net - Btrfs devel - [PATCH 00/26] Btrfs: Add device replace code [Nov 2012]

If this information is useful, please help other people find it:
Share via:

Stefan Behrens

2012-Nov-06 16:38 UTC

[PATCH 00/26] Btrfs: Add device replace code

This patch series adds support for replacing disks at runtime.

It replaces the following steps in case a disk was lost:
    mount ... -o degraded
    btrfs device add new_disk
    btrfs device delete missing

Or in case a disk just needs to be replaced because the error rate
is increasing:
    btrfs device add new_disk
    btrfs device delete old_disk

Instead just run:
    btrfs replace mountpoint old_disk new_disk

The device replace operation takes place at runtime on a live
filesystem, you don''t need to unmount it or stop active tasks.
It is safe to crash or lose power during the operation, the
process resumes with the next mount.

The copy usually takes place at 90% of the available platter
speed if no additional disk I/O is ongoing during the copy
operation, thus the degraded state without redundancy can be
left quickly.

The copy process is started manually. It is a different project
to react on an increased device I/O error rate with an automatic
start of this procedure.

The patch series is based on btrfs-next and also available here:
git://btrfs.giantdisaster.de/git/btrfs device-replace

The user mode part is the top commit of
git://btrfs.giantdisaster.de/git/btrfs-progs master


replace start [-Bfr] <path> <srcdev>|<devid> <targetdev>
       Replace device of a btrfs filesystem.   On  a  live  filesystem,
       duplicate  the  data  to  the  target  device which is currently
       stored on the source device. If the source device is not  avail-
       able anymore, or if the -r option is set, the data is built only
       using the RAID redundancy mechanisms. After  completion  of  the
       operation, the source device is removed from the filesystem.  If
       the srcdev is a numerical value, it is assumed to be the  device
       id  of the filesystem which is mounted at mount_point, otherwise
       is is the path to the source device. If  the  source  device  is
       disconnected, from the system, you have to use the devid parame-
       ter format.  The targetdev needs to be same size or larger  than
       the srcdev.

       Options

       -r     only  read  from  srcdev  if  no other zero-defect mirror
              exists (enable this  if  your  drive  has  lots  of  read
              errors, the access would be very slow)

       -f     force  using  and  overwriting targetdev even if it looks
              like  containing  a  valid  btrfs  filesystem.  A   valid
              filesystem  is  assumed  if  a  btrfs superblock is found
              which contains a correct checksum. Devices which are cur-
              rently  mounted  are never allowed to be used as the tar-
              getdev

       -B     do not background


replace status [-1] <path>
       Print status  and  progress  information  of  a  running  device
       replace operation.

       Options

       -1     print once instead of print continously until the replace
              operation finishes (or is canceled)


replace cancel <path>
       Cancel a running device replace operation.


Stefan Behrens (26):
  Btrfs: rename the scrub context structure
  Btrfs: remove the block device pointer from the scrub context struct
  Btrfs: make the scrub page array dynamically allocated
  Btrfs: in scrub repair code, optimize the reading of mirrors
  Btrfs: in scrub repair code, simplify alloc error handling
  Btrfs: cleanup scrub bio and worker wait code
  Btrfs: add two more find_device() methods
  Btrfs: Pass fs_info to btrfs_num_copies() instead of mapping_tree
  Btrfs: pass fs_info to btrfs_map_block() instead of mapping_tree
  Btrfs: add btrfs_scratch_superblock() function
  Btrfs: pass fs_info instead of root
  Btrfs: avoid risk of a deadlock in btrfs_handle_error
  Btrfs: enhance btrfs structures for device replace support
  Btrfs: introduce a btrfs_dev_replace_item type
  Btrfs: add a new source file with device replace code
  Btrfs: disallow mutually exclusiv admin operations from user mode
  Btrfs: disallow some operations on the device replace target device
  Btrfs: handle errors from btrfs_map_bio() everywhere
  Btrfs: add code to scrub to copy read data to another disk
  Btrfs: change core code of btrfs to support the device replace
    operations
  Btrfs: introduce GET_READ_MIRRORS functionality for btrfs_map_block()
  Btrfs: changes to live filesystem are also written to replacement
    disk
  Btrfs: optionally avoid reads from device replace source drive
  Btrfs: increase BTRFS_MAX_MIRRORS by one for dev replace
  Btrfs: allow repair code to include target disk when searching
    mirrors
  Btrfs: add support for device replace ioctls

 fs/btrfs/Makefile          |    2 +-
 fs/btrfs/check-integrity.c |   29 +-
 fs/btrfs/compression.c     |    6 +-
 fs/btrfs/ctree.h           |  127 ++-
 fs/btrfs/dev-replace.c     |  843 ++++++++++++++++++++
 fs/btrfs/dev-replace.h     |   44 ++
 fs/btrfs/disk-io.c         |   79 +-
 fs/btrfs/extent-tree.c     |    5 +-
 fs/btrfs/extent_io.c       |   28 +-
 fs/btrfs/extent_io.h       |    4 +-
 fs/btrfs/inode.c           |   39 +-
 fs/btrfs/ioctl.c           |  117 ++-
 fs/btrfs/ioctl.h           |   45 ++
 fs/btrfs/print-tree.c      |    3 +
 fs/btrfs/reada.c           |   31 +-
 fs/btrfs/scrub.c           | 1822 ++++++++++++++++++++++++++++++++------------
 fs/btrfs/super.c           |   30 +-
 fs/btrfs/transaction.c     |    7 +-
 fs/btrfs/volumes.c         |  624 +++++++++++++--
 fs/btrfs/volumes.h         |   26 +-
 20 files changed, 3244 insertions(+), 667 deletions(-)
 create mode 100644 fs/btrfs/dev-replace.c
 create mode 100644 fs/btrfs/dev-replace.h

-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 01/26] Btrfs: rename the scrub context structure

The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.

Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is
used.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/scrub.c   | 504 ++++++++++++++++++++++++++---------------------------
 fs/btrfs/volumes.h |   2 +-
 2 files changed, 253 insertions(+), 253 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 27892f6..29c8aac 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -42,10 +42,10 @@
  */
 
 struct scrub_block;
-struct scrub_dev;
+struct scrub_ctx;
 
 #define SCRUB_PAGES_PER_BIO	16	/* 64k per bio */
-#define SCRUB_BIOS_PER_DEV	16	/* 1 MB per device in flight */
+#define SCRUB_BIOS_PER_CTX	16	/* 1 MB per device in flight */
 #define SCRUB_MAX_PAGES_PER_BLOCK	16	/* 64k per node/leaf/sector */
 
 struct scrub_page {
@@ -66,7 +66,7 @@ struct scrub_page {
 
 struct scrub_bio {
 	int			index;
-	struct scrub_dev	*sdev;
+	struct scrub_ctx	*sctx;
 	struct bio		*bio;
 	int			err;
 	u64			logical;
@@ -82,7 +82,7 @@ struct scrub_block {
 	int			page_count;
 	atomic_t		outstanding_pages;
 	atomic_t		ref_count; /* free mem on transition to zero */
-	struct scrub_dev	*sdev;
+	struct scrub_ctx	*sctx;
 	struct {
 		unsigned int	header_error:1;
 		unsigned int	checksum_error:1;
@@ -91,8 +91,8 @@ struct scrub_block {
 	};
 };
 
-struct scrub_dev {
-	struct scrub_bio	*bios[SCRUB_BIOS_PER_DEV];
+struct scrub_ctx {
+	struct scrub_bio	*bios[SCRUB_BIOS_PER_CTX];
 	struct btrfs_device	*dev;
 	int			first_free;
 	int			curr;
@@ -116,7 +116,7 @@ struct scrub_dev {
 };
 
 struct scrub_fixup_nodatasum {
-	struct scrub_dev	*sdev;
+	struct scrub_ctx	*sctx;
 	u64			logical;
 	struct btrfs_root	*root;
 	struct btrfs_work	work;
@@ -138,7 +138,7 @@ struct scrub_warning {
 
 
 static int scrub_handle_errored_block(struct scrub_block *sblock_to_check);
-static int scrub_setup_recheck_block(struct scrub_dev *sdev,
+static int scrub_setup_recheck_block(struct scrub_ctx *sctx,
 				     struct btrfs_mapping_tree *map_tree,
 				     u64 length, u64 logical,
 				     struct scrub_block *sblock);
@@ -163,9 +163,9 @@ static int scrub_checksum_tree_block(struct scrub_block
*sblock);
 static int scrub_checksum_super(struct scrub_block *sblock);
 static void scrub_block_get(struct scrub_block *sblock);
 static void scrub_block_put(struct scrub_block *sblock);
-static int scrub_add_page_to_bio(struct scrub_dev *sdev,
+static int scrub_add_page_to_bio(struct scrub_ctx *sctx,
 				 struct scrub_page *spage);
-static int scrub_pages(struct scrub_dev *sdev, u64 logical, u64 len,
+static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
 		       u64 physical, u64 flags, u64 gen, int mirror_num,
 		       u8 *csum, int force);
 static void scrub_bio_end_io(struct bio *bio, int err);
@@ -173,27 +173,27 @@ static void scrub_bio_end_io_worker(struct btrfs_work
*work);
 static void scrub_block_complete(struct scrub_block *sblock);
 
 
-static void scrub_free_csums(struct scrub_dev *sdev)
+static void scrub_free_csums(struct scrub_ctx *sctx)
 {
-	while (!list_empty(&sdev->csum_list)) {
+	while (!list_empty(&sctx->csum_list)) {
 		struct btrfs_ordered_sum *sum;
-		sum = list_first_entry(&sdev->csum_list,
+		sum = list_first_entry(&sctx->csum_list,
 				       struct btrfs_ordered_sum, list);
 		list_del(&sum->list);
 		kfree(sum);
 	}
 }
 
-static noinline_for_stack void scrub_free_dev(struct scrub_dev *sdev)
+static noinline_for_stack void scrub_free_ctx(struct scrub_ctx *sctx)
 {
 	int i;
 
-	if (!sdev)
+	if (!sctx)
 		return;
 
 	/* this can happen when scrub is cancelled */
-	if (sdev->curr != -1) {
-		struct scrub_bio *sbio = sdev->bios[sdev->curr];
+	if (sctx->curr != -1) {
+		struct scrub_bio *sbio = sctx->bios[sctx->curr];
 
 		for (i = 0; i < sbio->page_count; i++) {
 			BUG_ON(!sbio->pagev[i]);
@@ -203,69 +203,69 @@ static noinline_for_stack void scrub_free_dev(struct
scrub_dev *sdev)
 		bio_put(sbio->bio);
 	}
 
-	for (i = 0; i < SCRUB_BIOS_PER_DEV; ++i) {
-		struct scrub_bio *sbio = sdev->bios[i];
+	for (i = 0; i < SCRUB_BIOS_PER_CTX; ++i) {
+		struct scrub_bio *sbio = sctx->bios[i];
 
 		if (!sbio)
 			break;
 		kfree(sbio);
 	}
 
-	scrub_free_csums(sdev);
-	kfree(sdev);
+	scrub_free_csums(sctx);
+	kfree(sctx);
 }
 
 static noinline_for_stack
-struct scrub_dev *scrub_setup_dev(struct btrfs_device *dev)
+struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev)
 {
-	struct scrub_dev *sdev;
+	struct scrub_ctx *sctx;
 	int		i;
 	struct btrfs_fs_info *fs_info = dev->dev_root->fs_info;
 	int pages_per_bio;
 
 	pages_per_bio = min_t(int, SCRUB_PAGES_PER_BIO,
 			      bio_get_nr_vecs(dev->bdev));
-	sdev = kzalloc(sizeof(*sdev), GFP_NOFS);
-	if (!sdev)
+	sctx = kzalloc(sizeof(*sctx), GFP_NOFS);
+	if (!sctx)
 		goto nomem;
-	sdev->dev = dev;
-	sdev->pages_per_bio = pages_per_bio;
-	sdev->curr = -1;
-	for (i = 0; i < SCRUB_BIOS_PER_DEV; ++i) {
+	sctx->dev = dev;
+	sctx->pages_per_bio = pages_per_bio;
+	sctx->curr = -1;
+	for (i = 0; i < SCRUB_BIOS_PER_CTX; ++i) {
 		struct scrub_bio *sbio;
 
 		sbio = kzalloc(sizeof(*sbio), GFP_NOFS);
 		if (!sbio)
 			goto nomem;
-		sdev->bios[i] = sbio;
+		sctx->bios[i] = sbio;
 
 		sbio->index = i;
-		sbio->sdev = sdev;
+		sbio->sctx = sctx;
 		sbio->page_count = 0;
 		sbio->work.func = scrub_bio_end_io_worker;
 
-		if (i != SCRUB_BIOS_PER_DEV-1)
-			sdev->bios[i]->next_free = i + 1;
+		if (i != SCRUB_BIOS_PER_CTX - 1)
+			sctx->bios[i]->next_free = i + 1;
 		else
-			sdev->bios[i]->next_free = -1;
-	}
-	sdev->first_free = 0;
-	sdev->nodesize = dev->dev_root->nodesize;
-	sdev->leafsize = dev->dev_root->leafsize;
-	sdev->sectorsize = dev->dev_root->sectorsize;
-	atomic_set(&sdev->in_flight, 0);
-	atomic_set(&sdev->fixup_cnt, 0);
-	atomic_set(&sdev->cancel_req, 0);
-	sdev->csum_size = btrfs_super_csum_size(fs_info->super_copy);
-	INIT_LIST_HEAD(&sdev->csum_list);
-
-	spin_lock_init(&sdev->list_lock);
-	spin_lock_init(&sdev->stat_lock);
-	init_waitqueue_head(&sdev->list_wait);
-	return sdev;
+			sctx->bios[i]->next_free = -1;
+	}
+	sctx->first_free = 0;
+	sctx->nodesize = dev->dev_root->nodesize;
+	sctx->leafsize = dev->dev_root->leafsize;
+	sctx->sectorsize = dev->dev_root->sectorsize;
+	atomic_set(&sctx->in_flight, 0);
+	atomic_set(&sctx->fixup_cnt, 0);
+	atomic_set(&sctx->cancel_req, 0);
+	sctx->csum_size = btrfs_super_csum_size(fs_info->super_copy);
+	INIT_LIST_HEAD(&sctx->csum_list);
+
+	spin_lock_init(&sctx->list_lock);
+	spin_lock_init(&sctx->stat_lock);
+	init_waitqueue_head(&sctx->list_wait);
+	return sctx;
 
 nomem:
-	scrub_free_dev(sdev);
+	scrub_free_ctx(sctx);
 	return ERR_PTR(-ENOMEM);
 }
 
@@ -345,7 +345,7 @@ err:
 
 static void scrub_print_warning(const char *errstr, struct scrub_block *sblock)
 {
-	struct btrfs_device *dev = sblock->sdev->dev;
+	struct btrfs_device *dev = sblock->sctx->dev;
 	struct btrfs_fs_info *fs_info = dev->dev_root->fs_info;
 	struct btrfs_path *path;
 	struct btrfs_key found_key;
@@ -530,21 +530,21 @@ static void scrub_fixup_nodatasum(struct btrfs_work *work)
 {
 	int ret;
 	struct scrub_fixup_nodatasum *fixup;
-	struct scrub_dev *sdev;
+	struct scrub_ctx *sctx;
 	struct btrfs_trans_handle *trans = NULL;
 	struct btrfs_fs_info *fs_info;
 	struct btrfs_path *path;
 	int uncorrectable = 0;
 
 	fixup = container_of(work, struct scrub_fixup_nodatasum, work);
-	sdev = fixup->sdev;
+	sctx = fixup->sctx;
 	fs_info = fixup->root->fs_info;
 
 	path = btrfs_alloc_path();
 	if (!path) {
-		spin_lock(&sdev->stat_lock);
-		++sdev->stat.malloc_errors;
-		spin_unlock(&sdev->stat_lock);
+		spin_lock(&sctx->stat_lock);
+		++sctx->stat.malloc_errors;
+		spin_unlock(&sctx->stat_lock);
 		uncorrectable = 1;
 		goto out;
 	}
@@ -573,22 +573,22 @@ static void scrub_fixup_nodatasum(struct btrfs_work *work)
 	}
 	WARN_ON(ret != 1);
 
-	spin_lock(&sdev->stat_lock);
-	++sdev->stat.corrected_errors;
-	spin_unlock(&sdev->stat_lock);
+	spin_lock(&sctx->stat_lock);
+	++sctx->stat.corrected_errors;
+	spin_unlock(&sctx->stat_lock);
 
 out:
 	if (trans && !IS_ERR(trans))
 		btrfs_end_transaction(trans, fixup->root);
 	if (uncorrectable) {
-		spin_lock(&sdev->stat_lock);
-		++sdev->stat.uncorrectable_errors;
-		spin_unlock(&sdev->stat_lock);
+		spin_lock(&sctx->stat_lock);
+		++sctx->stat.uncorrectable_errors;
+		spin_unlock(&sctx->stat_lock);
 
 		printk_ratelimited_in_rcu(KERN_ERR
 			"btrfs: unable to fixup (nodatasum) error at logical %llu on dev
%s\n",
 			(unsigned long long)fixup->logical,
-			rcu_str_deref(sdev->dev->name));
+			rcu_str_deref(sctx->dev->name));
 	}
 
 	btrfs_free_path(path);
@@ -599,9 +599,9 @@ out:
 	atomic_dec(&fs_info->scrubs_running);
 	atomic_dec(&fs_info->scrubs_paused);
 	mutex_unlock(&fs_info->scrub_lock);
-	atomic_dec(&sdev->fixup_cnt);
+	atomic_dec(&sctx->fixup_cnt);
 	wake_up(&fs_info->scrub_pause_wait);
-	wake_up(&sdev->list_wait);
+	wake_up(&sctx->list_wait);
 }
 
 /*
@@ -614,7 +614,7 @@ out:
  */
 static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 {
-	struct scrub_dev *sdev = sblock_to_check->sdev;
+	struct scrub_ctx *sctx = sblock_to_check->sctx;
 	struct btrfs_fs_info *fs_info;
 	u64 length;
 	u64 logical;
@@ -633,7 +633,7 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 				      DEFAULT_RATELIMIT_BURST);
 
 	BUG_ON(sblock_to_check->page_count < 1);
-	fs_info = sdev->dev->dev_root->fs_info;
+	fs_info = sctx->dev->dev_root->fs_info;
 	length = sblock_to_check->page_count * PAGE_SIZE;
 	logical = sblock_to_check->pagev[0].logical;
 	generation = sblock_to_check->pagev[0].generation;
@@ -677,25 +677,25 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 				     sizeof(*sblocks_for_recheck),
 				     GFP_NOFS);
 	if (!sblocks_for_recheck) {
-		spin_lock(&sdev->stat_lock);
-		sdev->stat.malloc_errors++;
-		sdev->stat.read_errors++;
-		sdev->stat.uncorrectable_errors++;
-		spin_unlock(&sdev->stat_lock);
-		btrfs_dev_stat_inc_and_print(sdev->dev,
+		spin_lock(&sctx->stat_lock);
+		sctx->stat.malloc_errors++;
+		sctx->stat.read_errors++;
+		sctx->stat.uncorrectable_errors++;
+		spin_unlock(&sctx->stat_lock);
+		btrfs_dev_stat_inc_and_print(sctx->dev,
 					     BTRFS_DEV_STAT_READ_ERRS);
 		goto out;
 	}
 
 	/* setup the context, map the logical blocks and alloc the pages */
-	ret = scrub_setup_recheck_block(sdev, &fs_info->mapping_tree, length,
+	ret = scrub_setup_recheck_block(sctx, &fs_info->mapping_tree, length,
 					logical, sblocks_for_recheck);
 	if (ret) {
-		spin_lock(&sdev->stat_lock);
-		sdev->stat.read_errors++;
-		sdev->stat.uncorrectable_errors++;
-		spin_unlock(&sdev->stat_lock);
-		btrfs_dev_stat_inc_and_print(sdev->dev,
+		spin_lock(&sctx->stat_lock);
+		sctx->stat.read_errors++;
+		sctx->stat.uncorrectable_errors++;
+		spin_unlock(&sctx->stat_lock);
+		btrfs_dev_stat_inc_and_print(sctx->dev,
 					     BTRFS_DEV_STAT_READ_ERRS);
 		goto out;
 	}
@@ -704,13 +704,13 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 
 	/* build and submit the bios for the failed mirror, check checksums */
 	ret = scrub_recheck_block(fs_info, sblock_bad, is_metadata, have_csum,
-				  csum, generation, sdev->csum_size);
+				  csum, generation, sctx->csum_size);
 	if (ret) {
-		spin_lock(&sdev->stat_lock);
-		sdev->stat.read_errors++;
-		sdev->stat.uncorrectable_errors++;
-		spin_unlock(&sdev->stat_lock);
-		btrfs_dev_stat_inc_and_print(sdev->dev,
+		spin_lock(&sctx->stat_lock);
+		sctx->stat.read_errors++;
+		sctx->stat.uncorrectable_errors++;
+		spin_unlock(&sctx->stat_lock);
+		btrfs_dev_stat_inc_and_print(sctx->dev,
 					     BTRFS_DEV_STAT_READ_ERRS);
 		goto out;
 	}
@@ -725,45 +725,45 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 		 * different bio (usually one of the two latter cases is
 		 * the cause)
 		 */
-		spin_lock(&sdev->stat_lock);
-		sdev->stat.unverified_errors++;
-		spin_unlock(&sdev->stat_lock);
+		spin_lock(&sctx->stat_lock);
+		sctx->stat.unverified_errors++;
+		spin_unlock(&sctx->stat_lock);
 
 		goto out;
 	}
 
 	if (!sblock_bad->no_io_error_seen) {
-		spin_lock(&sdev->stat_lock);
-		sdev->stat.read_errors++;
-		spin_unlock(&sdev->stat_lock);
+		spin_lock(&sctx->stat_lock);
+		sctx->stat.read_errors++;
+		spin_unlock(&sctx->stat_lock);
 		if (__ratelimit(&_rs))
 			scrub_print_warning("i/o error", sblock_to_check);
-		btrfs_dev_stat_inc_and_print(sdev->dev,
+		btrfs_dev_stat_inc_and_print(sctx->dev,
 					     BTRFS_DEV_STAT_READ_ERRS);
 	} else if (sblock_bad->checksum_error) {
-		spin_lock(&sdev->stat_lock);
-		sdev->stat.csum_errors++;
-		spin_unlock(&sdev->stat_lock);
+		spin_lock(&sctx->stat_lock);
+		sctx->stat.csum_errors++;
+		spin_unlock(&sctx->stat_lock);
 		if (__ratelimit(&_rs))
 			scrub_print_warning("checksum error", sblock_to_check);
-		btrfs_dev_stat_inc_and_print(sdev->dev,
+		btrfs_dev_stat_inc_and_print(sctx->dev,
 					     BTRFS_DEV_STAT_CORRUPTION_ERRS);
 	} else if (sblock_bad->header_error) {
-		spin_lock(&sdev->stat_lock);
-		sdev->stat.verify_errors++;
-		spin_unlock(&sdev->stat_lock);
+		spin_lock(&sctx->stat_lock);
+		sctx->stat.verify_errors++;
+		spin_unlock(&sctx->stat_lock);
 		if (__ratelimit(&_rs))
 			scrub_print_warning("checksum/header error",
 					    sblock_to_check);
 		if (sblock_bad->generation_error)
-			btrfs_dev_stat_inc_and_print(sdev->dev,
+			btrfs_dev_stat_inc_and_print(sctx->dev,
 				BTRFS_DEV_STAT_GENERATION_ERRS);
 		else
-			btrfs_dev_stat_inc_and_print(sdev->dev,
+			btrfs_dev_stat_inc_and_print(sctx->dev,
 				BTRFS_DEV_STAT_CORRUPTION_ERRS);
 	}
 
-	if (sdev->readonly)
+	if (sctx->readonly)
 		goto did_not_correct_error;
 
 	if (!is_metadata && !have_csum) {
@@ -779,7 +779,7 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 		fixup_nodatasum = kzalloc(sizeof(*fixup_nodatasum), GFP_NOFS);
 		if (!fixup_nodatasum)
 			goto did_not_correct_error;
-		fixup_nodatasum->sdev = sdev;
+		fixup_nodatasum->sctx = sctx;
 		fixup_nodatasum->logical = logical;
 		fixup_nodatasum->root = fs_info->extent_root;
 		fixup_nodatasum->mirror_num = failed_mirror_index + 1;
@@ -796,7 +796,7 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 		atomic_inc(&fs_info->scrubs_running);
 		atomic_inc(&fs_info->scrubs_paused);
 		mutex_unlock(&fs_info->scrub_lock);
-		atomic_inc(&sdev->fixup_cnt);
+		atomic_inc(&sctx->fixup_cnt);
 		fixup_nodatasum->work.func = scrub_fixup_nodatasum;
 		btrfs_queue_worker(&fs_info->scrub_workers,
 				   &fixup_nodatasum->work);
@@ -818,7 +818,7 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 		ret = scrub_recheck_block(fs_info,
 					  sblocks_for_recheck + mirror_index,
 					  is_metadata, have_csum, csum,
-					  generation, sdev->csum_size);
+					  generation, sctx->csum_size);
 		if (ret)
 			goto did_not_correct_error;
 	}
@@ -930,7 +930,7 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 			 */
 			ret = scrub_recheck_block(fs_info, sblock_bad,
 						  is_metadata, have_csum, csum,
-						  generation, sdev->csum_size);
+						  generation, sctx->csum_size);
 			if (!ret && !sblock_bad->header_error &&
 			    !sblock_bad->checksum_error &&
 			    sblock_bad->no_io_error_seen)
@@ -939,23 +939,23 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 				goto did_not_correct_error;
 		} else {
 corrected_error:
-			spin_lock(&sdev->stat_lock);
-			sdev->stat.corrected_errors++;
-			spin_unlock(&sdev->stat_lock);
+			spin_lock(&sctx->stat_lock);
+			sctx->stat.corrected_errors++;
+			spin_unlock(&sctx->stat_lock);
 			printk_ratelimited_in_rcu(KERN_ERR
 				"btrfs: fixed up error at logical %llu on dev %s\n",
 				(unsigned long long)logical,
-				rcu_str_deref(sdev->dev->name));
+				rcu_str_deref(sctx->dev->name));
 		}
 	} else {
 did_not_correct_error:
-		spin_lock(&sdev->stat_lock);
-		sdev->stat.uncorrectable_errors++;
-		spin_unlock(&sdev->stat_lock);
+		spin_lock(&sctx->stat_lock);
+		sctx->stat.uncorrectable_errors++;
+		spin_unlock(&sctx->stat_lock);
 		printk_ratelimited_in_rcu(KERN_ERR
 			"btrfs: unable to fixup (regular) error at logical %llu on dev
%s\n",
 			(unsigned long long)logical,
-			rcu_str_deref(sdev->dev->name));
+			rcu_str_deref(sctx->dev->name));
 	}
 
 out:
@@ -978,7 +978,7 @@ out:
 	return 0;
 }
 
-static int scrub_setup_recheck_block(struct scrub_dev *sdev,
+static int scrub_setup_recheck_block(struct scrub_ctx *sctx,
 				     struct btrfs_mapping_tree *map_tree,
 				     u64 length, u64 logical,
 				     struct scrub_block *sblocks_for_recheck)
@@ -988,7 +988,7 @@ static int scrub_setup_recheck_block(struct scrub_dev *sdev,
 	int ret;
 
 	/*
-	 * note: the three members sdev, ref_count and outstanding_pages
+	 * note: the three members sctx, ref_count and outstanding_pages
 	 * are not used (and not set) in the blocks that are used for
 	 * the recheck procedure
 	 */
@@ -1028,9 +1028,9 @@ static int scrub_setup_recheck_block(struct scrub_dev
*sdev,
 			page->mirror_num = mirror_index + 1;
 			page->page = alloc_page(GFP_NOFS);
 			if (!page->page) {
-				spin_lock(&sdev->stat_lock);
-				sdev->stat.malloc_errors++;
-				spin_unlock(&sdev->stat_lock);
+				spin_lock(&sctx->stat_lock);
+				sctx->stat.malloc_errors++;
+				spin_unlock(&sctx->stat_lock);
 				kfree(bbio);
 				return -ENOMEM;
 			}
@@ -1259,14 +1259,14 @@ static void scrub_checksum(struct scrub_block *sblock)
 
 static int scrub_checksum_data(struct scrub_block *sblock)
 {
-	struct scrub_dev *sdev = sblock->sdev;
+	struct scrub_ctx *sctx = sblock->sctx;
 	u8 csum[BTRFS_CSUM_SIZE];
 	u8 *on_disk_csum;
 	struct page *page;
 	void *buffer;
 	u32 crc = ~(u32)0;
 	int fail = 0;
-	struct btrfs_root *root = sdev->dev->dev_root;
+	struct btrfs_root *root = sctx->dev->dev_root;
 	u64 len;
 	int index;
 
@@ -1278,7 +1278,7 @@ static int scrub_checksum_data(struct scrub_block *sblock)
 	page = sblock->pagev[0].page;
 	buffer = kmap_atomic(page);
 
-	len = sdev->sectorsize;
+	len = sctx->sectorsize;
 	index = 0;
 	for (;;) {
 		u64 l = min_t(u64, len, PAGE_SIZE);
@@ -1296,7 +1296,7 @@ static int scrub_checksum_data(struct scrub_block *sblock)
 	}
 
 	btrfs_csum_final(crc, csum);
-	if (memcmp(csum, on_disk_csum, sdev->csum_size))
+	if (memcmp(csum, on_disk_csum, sctx->csum_size))
 		fail = 1;
 
 	return fail;
@@ -1304,9 +1304,9 @@ static int scrub_checksum_data(struct scrub_block *sblock)
 
 static int scrub_checksum_tree_block(struct scrub_block *sblock)
 {
-	struct scrub_dev *sdev = sblock->sdev;
+	struct scrub_ctx *sctx = sblock->sctx;
 	struct btrfs_header *h;
-	struct btrfs_root *root = sdev->dev->dev_root;
+	struct btrfs_root *root = sctx->dev->dev_root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	u8 calculated_csum[BTRFS_CSUM_SIZE];
 	u8 on_disk_csum[BTRFS_CSUM_SIZE];
@@ -1324,7 +1324,7 @@ static int scrub_checksum_tree_block(struct scrub_block
*sblock)
 	page = sblock->pagev[0].page;
 	mapped_buffer = kmap_atomic(page);
 	h = (struct btrfs_header *)mapped_buffer;
-	memcpy(on_disk_csum, h->csum, sdev->csum_size);
+	memcpy(on_disk_csum, h->csum, sctx->csum_size);
 
 	/*
 	 * we don''t use the getter functions here, as we
@@ -1345,8 +1345,8 @@ static int scrub_checksum_tree_block(struct scrub_block
*sblock)
 		   BTRFS_UUID_SIZE))
 		++fail;
 
-	BUG_ON(sdev->nodesize != sdev->leafsize);
-	len = sdev->nodesize - BTRFS_CSUM_SIZE;
+	BUG_ON(sctx->nodesize != sctx->leafsize);
+	len = sctx->nodesize - BTRFS_CSUM_SIZE;
 	mapped_size = PAGE_SIZE - BTRFS_CSUM_SIZE;
 	p = ((u8 *)mapped_buffer) + BTRFS_CSUM_SIZE;
 	index = 0;
@@ -1368,7 +1368,7 @@ static int scrub_checksum_tree_block(struct scrub_block
*sblock)
 	}
 
 	btrfs_csum_final(crc, calculated_csum);
-	if (memcmp(calculated_csum, on_disk_csum, sdev->csum_size))
+	if (memcmp(calculated_csum, on_disk_csum, sctx->csum_size))
 		++crc_fail;
 
 	return fail || crc_fail;
@@ -1377,8 +1377,8 @@ static int scrub_checksum_tree_block(struct scrub_block
*sblock)
 static int scrub_checksum_super(struct scrub_block *sblock)
 {
 	struct btrfs_super_block *s;
-	struct scrub_dev *sdev = sblock->sdev;
-	struct btrfs_root *root = sdev->dev->dev_root;
+	struct scrub_ctx *sctx = sblock->sctx;
+	struct btrfs_root *root = sctx->dev->dev_root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	u8 calculated_csum[BTRFS_CSUM_SIZE];
 	u8 on_disk_csum[BTRFS_CSUM_SIZE];
@@ -1396,7 +1396,7 @@ static int scrub_checksum_super(struct scrub_block
*sblock)
 	page = sblock->pagev[0].page;
 	mapped_buffer = kmap_atomic(page);
 	s = (struct btrfs_super_block *)mapped_buffer;
-	memcpy(on_disk_csum, s->csum, sdev->csum_size);
+	memcpy(on_disk_csum, s->csum, sctx->csum_size);
 
 	if (sblock->pagev[0].logical != le64_to_cpu(s->bytenr))
 		++fail_cor;
@@ -1429,7 +1429,7 @@ static int scrub_checksum_super(struct scrub_block
*sblock)
 	}
 
 	btrfs_csum_final(crc, calculated_csum);
-	if (memcmp(calculated_csum, on_disk_csum, sdev->csum_size))
+	if (memcmp(calculated_csum, on_disk_csum, sctx->csum_size))
 		++fail_cor;
 
 	if (fail_cor + fail_gen) {
@@ -1438,14 +1438,14 @@ static int scrub_checksum_super(struct scrub_block
*sblock)
 		 * They will get written with the next transaction commit
 		 * anyway
 		 */
-		spin_lock(&sdev->stat_lock);
-		++sdev->stat.super_errors;
-		spin_unlock(&sdev->stat_lock);
+		spin_lock(&sctx->stat_lock);
+		++sctx->stat.super_errors;
+		spin_unlock(&sctx->stat_lock);
 		if (fail_cor)
-			btrfs_dev_stat_inc_and_print(sdev->dev,
+			btrfs_dev_stat_inc_and_print(sctx->dev,
 				BTRFS_DEV_STAT_CORRUPTION_ERRS);
 		else
-			btrfs_dev_stat_inc_and_print(sdev->dev,
+			btrfs_dev_stat_inc_and_print(sctx->dev,
 				BTRFS_DEV_STAT_GENERATION_ERRS);
 	}
 
@@ -1469,21 +1469,21 @@ static void scrub_block_put(struct scrub_block *sblock)
 	}
 }
 
-static void scrub_submit(struct scrub_dev *sdev)
+static void scrub_submit(struct scrub_ctx *sctx)
 {
 	struct scrub_bio *sbio;
 
-	if (sdev->curr == -1)
+	if (sctx->curr == -1)
 		return;
 
-	sbio = sdev->bios[sdev->curr];
-	sdev->curr = -1;
-	atomic_inc(&sdev->in_flight);
+	sbio = sctx->bios[sctx->curr];
+	sctx->curr = -1;
+	atomic_inc(&sctx->in_flight);
 
 	btrfsic_submit_bio(READ, sbio->bio);
 }
 
-static int scrub_add_page_to_bio(struct scrub_dev *sdev,
+static int scrub_add_page_to_bio(struct scrub_ctx *sctx,
 				 struct scrub_page *spage)
 {
 	struct scrub_block *sblock = spage->sblock;
@@ -1494,20 +1494,20 @@ again:
 	/*
 	 * grab a fresh bio or wait for one to become available
 	 */
-	while (sdev->curr == -1) {
-		spin_lock(&sdev->list_lock);
-		sdev->curr = sdev->first_free;
-		if (sdev->curr != -1) {
-			sdev->first_free = sdev->bios[sdev->curr]->next_free;
-			sdev->bios[sdev->curr]->next_free = -1;
-			sdev->bios[sdev->curr]->page_count = 0;
-			spin_unlock(&sdev->list_lock);
+	while (sctx->curr == -1) {
+		spin_lock(&sctx->list_lock);
+		sctx->curr = sctx->first_free;
+		if (sctx->curr != -1) {
+			sctx->first_free = sctx->bios[sctx->curr]->next_free;
+			sctx->bios[sctx->curr]->next_free = -1;
+			sctx->bios[sctx->curr]->page_count = 0;
+			spin_unlock(&sctx->list_lock);
 		} else {
-			spin_unlock(&sdev->list_lock);
-			wait_event(sdev->list_wait, sdev->first_free != -1);
+			spin_unlock(&sctx->list_lock);
+			wait_event(sctx->list_wait, sctx->first_free != -1);
 		}
 	}
-	sbio = sdev->bios[sdev->curr];
+	sbio = sctx->bios[sctx->curr];
 	if (sbio->page_count == 0) {
 		struct bio *bio;
 
@@ -1515,7 +1515,7 @@ again:
 		sbio->logical = spage->logical;
 		bio = sbio->bio;
 		if (!bio) {
-			bio = bio_alloc(GFP_NOFS, sdev->pages_per_bio);
+			bio = bio_alloc(GFP_NOFS, sctx->pages_per_bio);
 			if (!bio)
 				return -ENOMEM;
 			sbio->bio = bio;
@@ -1523,14 +1523,14 @@ again:
 
 		bio->bi_private = sbio;
 		bio->bi_end_io = scrub_bio_end_io;
-		bio->bi_bdev = sdev->dev->bdev;
+		bio->bi_bdev = sctx->dev->bdev;
 		bio->bi_sector = spage->physical >> 9;
 		sbio->err = 0;
 	} else if (sbio->physical + sbio->page_count * PAGE_SIZE ! 		  
spage->physical ||
 		   sbio->logical + sbio->page_count * PAGE_SIZE ! 		  
spage->logical) {
-		scrub_submit(sdev);
+		scrub_submit(sctx);
 		goto again;
 	}
 
@@ -1542,20 +1542,20 @@ again:
 			sbio->bio = NULL;
 			return -EIO;
 		}
-		scrub_submit(sdev);
+		scrub_submit(sctx);
 		goto again;
 	}
 
 	scrub_block_get(sblock); /* one for the added page */
 	atomic_inc(&sblock->outstanding_pages);
 	sbio->page_count++;
-	if (sbio->page_count == sdev->pages_per_bio)
-		scrub_submit(sdev);
+	if (sbio->page_count == sctx->pages_per_bio)
+		scrub_submit(sctx);
 
 	return 0;
 }
 
-static int scrub_pages(struct scrub_dev *sdev, u64 logical, u64 len,
+static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
 		       u64 physical, u64 flags, u64 gen, int mirror_num,
 		       u8 *csum, int force)
 {
@@ -1564,15 +1564,15 @@ static int scrub_pages(struct scrub_dev *sdev, u64
logical, u64 len,
 
 	sblock = kzalloc(sizeof(*sblock), GFP_NOFS);
 	if (!sblock) {
-		spin_lock(&sdev->stat_lock);
-		sdev->stat.malloc_errors++;
-		spin_unlock(&sdev->stat_lock);
+		spin_lock(&sctx->stat_lock);
+		sctx->stat.malloc_errors++;
+		spin_unlock(&sctx->stat_lock);
 		return -ENOMEM;
 	}
 
 	/* one ref inside this function, plus one for each page later on */
 	atomic_set(&sblock->ref_count, 1);
-	sblock->sdev = sdev;
+	sblock->sctx = sctx;
 	sblock->no_io_error_seen = 1;
 
 	for (index = 0; len > 0; index++) {
@@ -1582,9 +1582,9 @@ static int scrub_pages(struct scrub_dev *sdev, u64
logical, u64 len,
 		BUG_ON(index >= SCRUB_MAX_PAGES_PER_BLOCK);
 		spage->page = alloc_page(GFP_NOFS);
 		if (!spage->page) {
-			spin_lock(&sdev->stat_lock);
-			sdev->stat.malloc_errors++;
-			spin_unlock(&sdev->stat_lock);
+			spin_lock(&sctx->stat_lock);
+			sctx->stat.malloc_errors++;
+			spin_unlock(&sctx->stat_lock);
 			while (index > 0) {
 				index--;
 				__free_page(sblock->pagev[index].page);
@@ -1593,7 +1593,7 @@ static int scrub_pages(struct scrub_dev *sdev, u64
logical, u64 len,
 			return -ENOMEM;
 		}
 		spage->sblock = sblock;
-		spage->dev = sdev->dev;
+		spage->dev = sctx->dev;
 		spage->flags = flags;
 		spage->generation = gen;
 		spage->logical = logical;
@@ -1601,7 +1601,7 @@ static int scrub_pages(struct scrub_dev *sdev, u64
logical, u64 len,
 		spage->mirror_num = mirror_num;
 		if (csum) {
 			spage->have_csum = 1;
-			memcpy(spage->csum, csum, sdev->csum_size);
+			memcpy(spage->csum, csum, sctx->csum_size);
 		} else {
 			spage->have_csum = 0;
 		}
@@ -1616,7 +1616,7 @@ static int scrub_pages(struct scrub_dev *sdev, u64
logical, u64 len,
 		struct scrub_page *spage = sblock->pagev + index;
 		int ret;
 
-		ret = scrub_add_page_to_bio(sdev, spage);
+		ret = scrub_add_page_to_bio(sctx, spage);
 		if (ret) {
 			scrub_block_put(sblock);
 			return ret;
@@ -1624,7 +1624,7 @@ static int scrub_pages(struct scrub_dev *sdev, u64
logical, u64 len,
 	}
 
 	if (force)
-		scrub_submit(sdev);
+		scrub_submit(sctx);
 
 	/* last one frees, either here or in bio completion for last page */
 	scrub_block_put(sblock);
@@ -1634,8 +1634,8 @@ static int scrub_pages(struct scrub_dev *sdev, u64
logical, u64 len,
 static void scrub_bio_end_io(struct bio *bio, int err)
 {
 	struct scrub_bio *sbio = bio->bi_private;
-	struct scrub_dev *sdev = sbio->sdev;
-	struct btrfs_fs_info *fs_info = sdev->dev->dev_root->fs_info;
+	struct scrub_ctx *sctx = sbio->sctx;
+	struct btrfs_fs_info *fs_info = sctx->dev->dev_root->fs_info;
 
 	sbio->err = err;
 	sbio->bio = bio;
@@ -1646,7 +1646,7 @@ static void scrub_bio_end_io(struct bio *bio, int err)
 static void scrub_bio_end_io_worker(struct btrfs_work *work)
 {
 	struct scrub_bio *sbio = container_of(work, struct scrub_bio, work);
-	struct scrub_dev *sdev = sbio->sdev;
+	struct scrub_ctx *sctx = sbio->sctx;
 	int i;
 
 	BUG_ON(sbio->page_count > SCRUB_PAGES_PER_BIO);
@@ -1671,12 +1671,12 @@ static void scrub_bio_end_io_worker(struct btrfs_work
*work)
 
 	bio_put(sbio->bio);
 	sbio->bio = NULL;
-	spin_lock(&sdev->list_lock);
-	sbio->next_free = sdev->first_free;
-	sdev->first_free = sbio->index;
-	spin_unlock(&sdev->list_lock);
-	atomic_dec(&sdev->in_flight);
-	wake_up(&sdev->list_wait);
+	spin_lock(&sctx->list_lock);
+	sbio->next_free = sctx->first_free;
+	sctx->first_free = sbio->index;
+	spin_unlock(&sctx->list_lock);
+	atomic_dec(&sctx->in_flight);
+	wake_up(&sctx->list_wait);
 }
 
 static void scrub_block_complete(struct scrub_block *sblock)
@@ -1687,7 +1687,7 @@ static void scrub_block_complete(struct scrub_block
*sblock)
 		scrub_checksum(sblock);
 }
 
-static int scrub_find_csum(struct scrub_dev *sdev, u64 logical, u64 len,
+static int scrub_find_csum(struct scrub_ctx *sctx, u64 logical, u64 len,
 			   u8 *csum)
 {
 	struct btrfs_ordered_sum *sum = NULL;
@@ -1695,15 +1695,15 @@ static int scrub_find_csum(struct scrub_dev *sdev, u64
logical, u64 len,
 	unsigned long i;
 	unsigned long num_sectors;
 
-	while (!list_empty(&sdev->csum_list)) {
-		sum = list_first_entry(&sdev->csum_list,
+	while (!list_empty(&sctx->csum_list)) {
+		sum = list_first_entry(&sctx->csum_list,
 				       struct btrfs_ordered_sum, list);
 		if (sum->bytenr > logical)
 			return 0;
 		if (sum->bytenr + sum->len > logical)
 			break;
 
-		++sdev->stat.csum_discards;
+		++sctx->stat.csum_discards;
 		list_del(&sum->list);
 		kfree(sum);
 		sum = NULL;
@@ -1711,10 +1711,10 @@ static int scrub_find_csum(struct scrub_dev *sdev, u64
logical, u64 len,
 	if (!sum)
 		return 0;
 
-	num_sectors = sum->len / sdev->sectorsize;
+	num_sectors = sum->len / sctx->sectorsize;
 	for (i = 0; i < num_sectors; ++i) {
 		if (sum->sums[i].bytenr == logical) {
-			memcpy(csum, &sum->sums[i].sum, sdev->csum_size);
+			memcpy(csum, &sum->sums[i].sum, sctx->csum_size);
 			ret = 1;
 			break;
 		}
@@ -1727,7 +1727,7 @@ static int scrub_find_csum(struct scrub_dev *sdev, u64
logical, u64 len,
 }
 
 /* scrub extent tries to collect up to 64 kB for each bio */
-static int scrub_extent(struct scrub_dev *sdev, u64 logical, u64 len,
+static int scrub_extent(struct scrub_ctx *sctx, u64 logical, u64 len,
 			u64 physical, u64 flags, u64 gen, int mirror_num)
 {
 	int ret;
@@ -1735,20 +1735,20 @@ static int scrub_extent(struct scrub_dev *sdev, u64
logical, u64 len,
 	u32 blocksize;
 
 	if (flags & BTRFS_EXTENT_FLAG_DATA) {
-		blocksize = sdev->sectorsize;
-		spin_lock(&sdev->stat_lock);
-		sdev->stat.data_extents_scrubbed++;
-		sdev->stat.data_bytes_scrubbed += len;
-		spin_unlock(&sdev->stat_lock);
+		blocksize = sctx->sectorsize;
+		spin_lock(&sctx->stat_lock);
+		sctx->stat.data_extents_scrubbed++;
+		sctx->stat.data_bytes_scrubbed += len;
+		spin_unlock(&sctx->stat_lock);
 	} else if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
-		BUG_ON(sdev->nodesize != sdev->leafsize);
-		blocksize = sdev->nodesize;
-		spin_lock(&sdev->stat_lock);
-		sdev->stat.tree_extents_scrubbed++;
-		sdev->stat.tree_bytes_scrubbed += len;
-		spin_unlock(&sdev->stat_lock);
+		BUG_ON(sctx->nodesize != sctx->leafsize);
+		blocksize = sctx->nodesize;
+		spin_lock(&sctx->stat_lock);
+		sctx->stat.tree_extents_scrubbed++;
+		sctx->stat.tree_bytes_scrubbed += len;
+		spin_unlock(&sctx->stat_lock);
 	} else {
-		blocksize = sdev->sectorsize;
+		blocksize = sctx->sectorsize;
 		BUG_ON(1);
 	}
 
@@ -1758,11 +1758,11 @@ static int scrub_extent(struct scrub_dev *sdev, u64
logical, u64 len,
 
 		if (flags & BTRFS_EXTENT_FLAG_DATA) {
 			/* push csums to sbio */
-			have_csum = scrub_find_csum(sdev, logical, l, csum);
+			have_csum = scrub_find_csum(sctx, logical, l, csum);
 			if (have_csum == 0)
-				++sdev->stat.no_csum;
+				++sctx->stat.no_csum;
 		}
-		ret = scrub_pages(sdev, logical, l, physical, flags, gen,
+		ret = scrub_pages(sctx, logical, l, physical, flags, gen,
 				  mirror_num, have_csum ? csum : NULL, 0);
 		if (ret)
 			return ret;
@@ -1773,11 +1773,11 @@ static int scrub_extent(struct scrub_dev *sdev, u64
logical, u64 len,
 	return 0;
 }
 
-static noinline_for_stack int scrub_stripe(struct scrub_dev *sdev,
+static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	struct map_lookup *map, int num, u64 base, u64 length)
 {
 	struct btrfs_path *path;
-	struct btrfs_fs_info *fs_info = sdev->dev->dev_root->fs_info;
+	struct btrfs_fs_info *fs_info = sctx->dev->dev_root->fs_info;
 	struct btrfs_root *root = fs_info->extent_root;
 	struct btrfs_root *csum_root = fs_info->csum_root;
 	struct btrfs_extent_item *extent;
@@ -1843,8 +1843,8 @@ static noinline_for_stack int scrub_stripe(struct
scrub_dev *sdev,
 	 */
 	logical = base + offset;
 
-	wait_event(sdev->list_wait,
-		   atomic_read(&sdev->in_flight) == 0);
+	wait_event(sctx->list_wait,
+		   atomic_read(&sctx->in_flight) == 0);
 	atomic_inc(&fs_info->scrubs_paused);
 	wake_up(&fs_info->scrub_pause_wait);
 
@@ -1898,7 +1898,7 @@ static noinline_for_stack int scrub_stripe(struct
scrub_dev *sdev,
 		 * canceled?
 		 */
 		if (atomic_read(&fs_info->scrub_cancel_req) ||
-		    atomic_read(&sdev->cancel_req)) {
+		    atomic_read(&sctx->cancel_req)) {
 			ret = -ECANCELED;
 			goto out;
 		}
@@ -1907,9 +1907,9 @@ static noinline_for_stack int scrub_stripe(struct
scrub_dev *sdev,
 		 */
 		if (atomic_read(&fs_info->scrub_pause_req)) {
 			/* push queued extents */
-			scrub_submit(sdev);
-			wait_event(sdev->list_wait,
-				   atomic_read(&sdev->in_flight) == 0);
+			scrub_submit(sctx);
+			wait_event(sctx->list_wait,
+				   atomic_read(&sctx->in_flight) == 0);
 			atomic_inc(&fs_info->scrubs_paused);
 			wake_up(&fs_info->scrub_pause_wait);
 			mutex_lock(&fs_info->scrub_lock);
@@ -1926,7 +1926,7 @@ static noinline_for_stack int scrub_stripe(struct
scrub_dev *sdev,
 
 		ret = btrfs_lookup_csums_range(csum_root, logical,
 					       logical + map->stripe_len - 1,
-					       &sdev->csum_list, 1);
+					       &sctx->csum_list, 1);
 		if (ret)
 			goto out;
 
@@ -2004,7 +2004,7 @@ static noinline_for_stack int scrub_stripe(struct
scrub_dev *sdev,
 					     key.objectid;
 			}
 
-			ret = scrub_extent(sdev, key.objectid, key.offset,
+			ret = scrub_extent(sctx, key.objectid, key.offset,
 					   key.objectid - logical + physical,
 					   flags, generation, mirror_num);
 			if (ret)
@@ -2016,12 +2016,12 @@ next:
 		btrfs_release_path(path);
 		logical += increment;
 		physical += map->stripe_len;
-		spin_lock(&sdev->stat_lock);
-		sdev->stat.last_physical = physical;
-		spin_unlock(&sdev->stat_lock);
+		spin_lock(&sctx->stat_lock);
+		sctx->stat.last_physical = physical;
+		spin_unlock(&sctx->stat_lock);
 	}
 	/* push queued extents */
-	scrub_submit(sdev);
+	scrub_submit(sctx);
 
 out:
 	blk_finish_plug(&plug);
@@ -2029,12 +2029,12 @@ out:
 	return ret < 0 ? ret : 0;
 }
 
-static noinline_for_stack int scrub_chunk(struct scrub_dev *sdev,
+static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
 	u64 chunk_tree, u64 chunk_objectid, u64 chunk_offset, u64 length,
 	u64 dev_offset)
 {
 	struct btrfs_mapping_tree *map_tree -	
&sdev->dev->dev_root->fs_info->mapping_tree;
+		&sctx->dev->dev_root->fs_info->mapping_tree;
 	struct map_lookup *map;
 	struct extent_map *em;
 	int i;
@@ -2055,9 +2055,9 @@ static noinline_for_stack int scrub_chunk(struct scrub_dev
*sdev,
 		goto out;
 
 	for (i = 0; i < map->num_stripes; ++i) {
-		if (map->stripes[i].dev == sdev->dev &&
+		if (map->stripes[i].dev == sctx->dev &&
 		    map->stripes[i].physical == dev_offset) {
-			ret = scrub_stripe(sdev, map, i, chunk_offset, length);
+			ret = scrub_stripe(sctx, map, i, chunk_offset, length);
 			if (ret)
 				goto out;
 		}
@@ -2069,11 +2069,11 @@ out:
 }
 
 static noinline_for_stack
-int scrub_enumerate_chunks(struct scrub_dev *sdev, u64 start, u64 end)
+int scrub_enumerate_chunks(struct scrub_ctx *sctx, u64 start, u64 end)
 {
 	struct btrfs_dev_extent *dev_extent = NULL;
 	struct btrfs_path *path;
-	struct btrfs_root *root = sdev->dev->dev_root;
+	struct btrfs_root *root = sctx->dev->dev_root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	u64 length;
 	u64 chunk_tree;
@@ -2094,7 +2094,7 @@ int scrub_enumerate_chunks(struct scrub_dev *sdev, u64
start, u64 end)
 	path->search_commit_root = 1;
 	path->skip_locking = 1;
 
-	key.objectid = sdev->dev->devid;
+	key.objectid = sctx->dev->devid;
 	key.offset = 0ull;
 	key.type = BTRFS_DEV_EXTENT_KEY;
 
@@ -2117,7 +2117,7 @@ int scrub_enumerate_chunks(struct scrub_dev *sdev, u64
start, u64 end)
 
 		btrfs_item_key_to_cpu(l, &found_key, slot);
 
-		if (found_key.objectid != sdev->dev->devid)
+		if (found_key.objectid != sctx->dev->devid)
 			break;
 
 		if (btrfs_key_type(&found_key) != BTRFS_DEV_EXTENT_KEY)
@@ -2151,7 +2151,7 @@ int scrub_enumerate_chunks(struct scrub_dev *sdev, u64
start, u64 end)
 			ret = -ENOENT;
 			break;
 		}
-		ret = scrub_chunk(sdev, chunk_tree, chunk_objectid,
+		ret = scrub_chunk(sctx, chunk_tree, chunk_objectid,
 				  chunk_offset, length, found_key.offset);
 		btrfs_put_block_group(cache);
 		if (ret)
@@ -2170,13 +2170,13 @@ int scrub_enumerate_chunks(struct scrub_dev *sdev, u64
start, u64 end)
 	return ret < 0 ? ret : 0;
 }
 
-static noinline_for_stack int scrub_supers(struct scrub_dev *sdev)
+static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx)
 {
 	int	i;
 	u64	bytenr;
 	u64	gen;
 	int	ret;
-	struct btrfs_device *device = sdev->dev;
+	struct btrfs_device *device = sctx->dev;
 	struct btrfs_root *root = device->dev_root;
 
 	if (root->fs_info->fs_state & BTRFS_SUPER_FLAG_ERROR)
@@ -2189,12 +2189,12 @@ static noinline_for_stack int scrub_supers(struct
scrub_dev *sdev)
 		if (bytenr + BTRFS_SUPER_INFO_SIZE > device->total_bytes)
 			break;
 
-		ret = scrub_pages(sdev, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr,
+		ret = scrub_pages(sctx, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr,
 				     BTRFS_EXTENT_FLAG_SUPER, gen, i, NULL, 1);
 		if (ret)
 			return ret;
 	}
-	wait_event(sdev->list_wait, atomic_read(&sdev->in_flight) == 0);
+	wait_event(sctx->list_wait, atomic_read(&sctx->in_flight) == 0);
 
 	return 0;
 }
@@ -2238,7 +2238,7 @@ static noinline_for_stack void scrub_workers_put(struct
btrfs_root *root)
 int btrfs_scrub_dev(struct btrfs_root *root, u64 devid, u64 start, u64 end,
 		    struct btrfs_scrub_progress *progress, int readonly)
 {
-	struct scrub_dev *sdev;
+	struct scrub_ctx *sctx;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	int ret;
 	struct btrfs_device *dev;
@@ -2302,41 +2302,41 @@ int btrfs_scrub_dev(struct btrfs_root *root, u64 devid,
u64 start, u64 end,
 		scrub_workers_put(root);
 		return -EINPROGRESS;
 	}
-	sdev = scrub_setup_dev(dev);
-	if (IS_ERR(sdev)) {
+	sctx = scrub_setup_ctx(dev);
+	if (IS_ERR(sctx)) {
 		mutex_unlock(&fs_info->scrub_lock);
 		mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
 		scrub_workers_put(root);
-		return PTR_ERR(sdev);
+		return PTR_ERR(sctx);
 	}
-	sdev->readonly = readonly;
-	dev->scrub_device = sdev;
+	sctx->readonly = readonly;
+	dev->scrub_device = sctx;
 
 	atomic_inc(&fs_info->scrubs_running);
 	mutex_unlock(&fs_info->scrub_lock);
 	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
 
 	down_read(&fs_info->scrub_super_lock);
-	ret = scrub_supers(sdev);
+	ret = scrub_supers(sctx);
 	up_read(&fs_info->scrub_super_lock);
 
 	if (!ret)
-		ret = scrub_enumerate_chunks(sdev, start, end);
+		ret = scrub_enumerate_chunks(sctx, start, end);
 
-	wait_event(sdev->list_wait, atomic_read(&sdev->in_flight) == 0);
+	wait_event(sctx->list_wait, atomic_read(&sctx->in_flight) == 0);
 	atomic_dec(&fs_info->scrubs_running);
 	wake_up(&fs_info->scrub_pause_wait);
 
-	wait_event(sdev->list_wait, atomic_read(&sdev->fixup_cnt) == 0);
+	wait_event(sctx->list_wait, atomic_read(&sctx->fixup_cnt) == 0);
 
 	if (progress)
-		memcpy(progress, &sdev->stat, sizeof(*progress));
+		memcpy(progress, &sctx->stat, sizeof(*progress));
 
 	mutex_lock(&fs_info->scrub_lock);
 	dev->scrub_device = NULL;
 	mutex_unlock(&fs_info->scrub_lock);
 
-	scrub_free_dev(sdev);
+	scrub_free_ctx(sctx);
 	scrub_workers_put(root);
 
 	return ret;
@@ -2407,15 +2407,15 @@ int btrfs_scrub_cancel(struct btrfs_root *root)
 int btrfs_scrub_cancel_dev(struct btrfs_root *root, struct btrfs_device *dev)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
-	struct scrub_dev *sdev;
+	struct scrub_ctx *sctx;
 
 	mutex_lock(&fs_info->scrub_lock);
-	sdev = dev->scrub_device;
-	if (!sdev) {
+	sctx = dev->scrub_device;
+	if (!sctx) {
 		mutex_unlock(&fs_info->scrub_lock);
 		return -ENOTCONN;
 	}
-	atomic_inc(&sdev->cancel_req);
+	atomic_inc(&sctx->cancel_req);
 	while (dev->scrub_device) {
 		mutex_unlock(&fs_info->scrub_lock);
 		wait_event(fs_info->scrub_pause_wait,
@@ -2453,15 +2453,15 @@ int btrfs_scrub_progress(struct btrfs_root *root, u64
devid,
 			 struct btrfs_scrub_progress *progress)
 {
 	struct btrfs_device *dev;
-	struct scrub_dev *sdev = NULL;
+	struct scrub_ctx *sctx = NULL;
 
 	mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
 	dev = btrfs_find_device(root, devid, NULL, NULL);
 	if (dev)
-		sdev = dev->scrub_device;
-	if (sdev)
-		memcpy(progress, &sdev->stat, sizeof(*progress));
+		sctx = dev->scrub_device;
+	if (sctx)
+		memcpy(progress, &sctx->stat, sizeof(*progress));
 	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
 
-	return dev ? (sdev ? 0 : -ENOTCONN) : -ENODEV;
+	return dev ? (sctx ? 0 : -ENOTCONN) : -ENODEV;
 }
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 53c06af..1789cda 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -88,7 +88,7 @@ struct btrfs_device {
 	u8 uuid[BTRFS_UUID_SIZE];
 
 	/* per-device scrub information */
-	struct scrub_dev *scrub_device;
+	struct scrub_ctx *scrub_device;
 
 	struct btrfs_work work;
 	struct rcu_head rcu;
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 02/26] Btrfs: remove the block device pointer from the scrub context struct

The block device is removed from the scrub context state structure.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in the scrub
context struct and moved into the lower level scope of scrub_bio,
fixup and page structures where the block device context is known.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/scrub.c | 133 ++++++++++++++++++++++++++++++-------------------------
 1 file changed, 73 insertions(+), 60 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 29c8aac..822c08a 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -67,6 +67,7 @@ struct scrub_page {
 struct scrub_bio {
 	int			index;
 	struct scrub_ctx	*sctx;
+	struct btrfs_device	*dev;
 	struct bio		*bio;
 	int			err;
 	u64			logical;
@@ -93,7 +94,7 @@ struct scrub_block {
 
 struct scrub_ctx {
 	struct scrub_bio	*bios[SCRUB_BIOS_PER_CTX];
-	struct btrfs_device	*dev;
+	struct btrfs_root	*dev_root;
 	int			first_free;
 	int			curr;
 	atomic_t		in_flight;
@@ -117,6 +118,7 @@ struct scrub_ctx {
 
 struct scrub_fixup_nodatasum {
 	struct scrub_ctx	*sctx;
+	struct btrfs_device	*dev;
 	u64			logical;
 	struct btrfs_root	*root;
 	struct btrfs_work	work;
@@ -166,8 +168,8 @@ static void scrub_block_put(struct scrub_block *sblock);
 static int scrub_add_page_to_bio(struct scrub_ctx *sctx,
 				 struct scrub_page *spage);
 static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
-		       u64 physical, u64 flags, u64 gen, int mirror_num,
-		       u8 *csum, int force);
+		       u64 physical, struct btrfs_device *dev, u64 flags,
+		       u64 gen, int mirror_num, u8 *csum, int force);
 static void scrub_bio_end_io(struct bio *bio, int err);
 static void scrub_bio_end_io_worker(struct btrfs_work *work);
 static void scrub_block_complete(struct scrub_block *sblock);
@@ -228,9 +230,9 @@ struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev)
 	sctx = kzalloc(sizeof(*sctx), GFP_NOFS);
 	if (!sctx)
 		goto nomem;
-	sctx->dev = dev;
 	sctx->pages_per_bio = pages_per_bio;
 	sctx->curr = -1;
+	sctx->dev_root = dev->dev_root;
 	for (i = 0; i < SCRUB_BIOS_PER_CTX; ++i) {
 		struct scrub_bio *sbio;
 
@@ -345,8 +347,8 @@ err:
 
 static void scrub_print_warning(const char *errstr, struct scrub_block *sblock)
 {
-	struct btrfs_device *dev = sblock->sctx->dev;
-	struct btrfs_fs_info *fs_info = dev->dev_root->fs_info;
+	struct btrfs_device *dev;
+	struct btrfs_fs_info *fs_info;
 	struct btrfs_path *path;
 	struct btrfs_key found_key;
 	struct extent_buffer *eb;
@@ -361,15 +363,18 @@ static void scrub_print_warning(const char *errstr, struct
scrub_block *sblock)
 	const int bufsize = 4096;
 	int ret;
 
+	WARN_ON(sblock->page_count < 1);
+	dev = sblock->pagev[0].dev;
+	fs_info = sblock->sctx->dev_root->fs_info;
+
 	path = btrfs_alloc_path();
 
 	swarn.scratch_buf = kmalloc(bufsize, GFP_NOFS);
 	swarn.msg_buf = kmalloc(bufsize, GFP_NOFS);
-	BUG_ON(sblock->page_count < 1);
 	swarn.sector = (sblock->pagev[0].physical) >> 9;
 	swarn.logical = sblock->pagev[0].logical;
 	swarn.errstr = errstr;
-	swarn.dev = dev;
+	swarn.dev = NULL;
 	swarn.msg_bufsize = bufsize;
 	swarn.scratch_bufsize = bufsize;
 
@@ -405,6 +410,7 @@ static void scrub_print_warning(const char *errstr, struct
scrub_block *sblock)
 		} while (ret != 1);
 	} else {
 		swarn.path = path;
+		swarn.dev = dev;
 		iterate_extent_inodes(fs_info, found_key.objectid,
 					extent_item_pos, 1,
 					scrub_print_warning_inode, &swarn);
@@ -588,7 +594,7 @@ out:
 		printk_ratelimited_in_rcu(KERN_ERR
 			"btrfs: unable to fixup (nodatasum) error at logical %llu on dev
%s\n",
 			(unsigned long long)fixup->logical,
-			rcu_str_deref(sctx->dev->name));
+			rcu_str_deref(fixup->dev->name));
 	}
 
 	btrfs_free_path(path);
@@ -615,6 +621,7 @@ out:
 static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 {
 	struct scrub_ctx *sctx = sblock_to_check->sctx;
+	struct btrfs_device *dev;
 	struct btrfs_fs_info *fs_info;
 	u64 length;
 	u64 logical;
@@ -633,7 +640,7 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 				      DEFAULT_RATELIMIT_BURST);
 
 	BUG_ON(sblock_to_check->page_count < 1);
-	fs_info = sctx->dev->dev_root->fs_info;
+	fs_info = sctx->dev_root->fs_info;
 	length = sblock_to_check->page_count * PAGE_SIZE;
 	logical = sblock_to_check->pagev[0].logical;
 	generation = sblock_to_check->pagev[0].generation;
@@ -643,6 +650,7 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 			BTRFS_EXTENT_FLAG_DATA);
 	have_csum = sblock_to_check->pagev[0].have_csum;
 	csum = sblock_to_check->pagev[0].csum;
+	dev = sblock_to_check->pagev[0].dev;
 
 	/*
 	 * read all mirrors one after the other. This includes to
@@ -682,8 +690,7 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 		sctx->stat.read_errors++;
 		sctx->stat.uncorrectable_errors++;
 		spin_unlock(&sctx->stat_lock);
-		btrfs_dev_stat_inc_and_print(sctx->dev,
-					     BTRFS_DEV_STAT_READ_ERRS);
+		btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_READ_ERRS);
 		goto out;
 	}
 
@@ -695,8 +702,7 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 		sctx->stat.read_errors++;
 		sctx->stat.uncorrectable_errors++;
 		spin_unlock(&sctx->stat_lock);
-		btrfs_dev_stat_inc_and_print(sctx->dev,
-					     BTRFS_DEV_STAT_READ_ERRS);
+		btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_READ_ERRS);
 		goto out;
 	}
 	BUG_ON(failed_mirror_index >= BTRFS_MAX_MIRRORS);
@@ -710,8 +716,7 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 		sctx->stat.read_errors++;
 		sctx->stat.uncorrectable_errors++;
 		spin_unlock(&sctx->stat_lock);
-		btrfs_dev_stat_inc_and_print(sctx->dev,
-					     BTRFS_DEV_STAT_READ_ERRS);
+		btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_READ_ERRS);
 		goto out;
 	}
 
@@ -738,15 +743,14 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 		spin_unlock(&sctx->stat_lock);
 		if (__ratelimit(&_rs))
 			scrub_print_warning("i/o error", sblock_to_check);
-		btrfs_dev_stat_inc_and_print(sctx->dev,
-					     BTRFS_DEV_STAT_READ_ERRS);
+		btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_READ_ERRS);
 	} else if (sblock_bad->checksum_error) {
 		spin_lock(&sctx->stat_lock);
 		sctx->stat.csum_errors++;
 		spin_unlock(&sctx->stat_lock);
 		if (__ratelimit(&_rs))
 			scrub_print_warning("checksum error", sblock_to_check);
-		btrfs_dev_stat_inc_and_print(sctx->dev,
+		btrfs_dev_stat_inc_and_print(dev,
 					     BTRFS_DEV_STAT_CORRUPTION_ERRS);
 	} else if (sblock_bad->header_error) {
 		spin_lock(&sctx->stat_lock);
@@ -756,10 +760,10 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 			scrub_print_warning("checksum/header error",
 					    sblock_to_check);
 		if (sblock_bad->generation_error)
-			btrfs_dev_stat_inc_and_print(sctx->dev,
+			btrfs_dev_stat_inc_and_print(dev,
 				BTRFS_DEV_STAT_GENERATION_ERRS);
 		else
-			btrfs_dev_stat_inc_and_print(sctx->dev,
+			btrfs_dev_stat_inc_and_print(dev,
 				BTRFS_DEV_STAT_CORRUPTION_ERRS);
 	}
 
@@ -780,6 +784,7 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 		if (!fixup_nodatasum)
 			goto did_not_correct_error;
 		fixup_nodatasum->sctx = sctx;
+		fixup_nodatasum->dev = dev;
 		fixup_nodatasum->logical = logical;
 		fixup_nodatasum->root = fs_info->extent_root;
 		fixup_nodatasum->mirror_num = failed_mirror_index + 1;
@@ -945,7 +950,7 @@ corrected_error:
 			printk_ratelimited_in_rcu(KERN_ERR
 				"btrfs: fixed up error at logical %llu on dev %s\n",
 				(unsigned long long)logical,
-				rcu_str_deref(sctx->dev->name));
+				rcu_str_deref(dev->name));
 		}
 	} else {
 did_not_correct_error:
@@ -955,7 +960,7 @@ did_not_correct_error:
 		printk_ratelimited_in_rcu(KERN_ERR
 			"btrfs: unable to fixup (regular) error at logical %llu on dev
%s\n",
 			(unsigned long long)logical,
-			rcu_str_deref(sctx->dev->name));
+			rcu_str_deref(dev->name));
 	}
 
 out:
@@ -1266,7 +1271,7 @@ static int scrub_checksum_data(struct scrub_block *sblock)
 	void *buffer;
 	u32 crc = ~(u32)0;
 	int fail = 0;
-	struct btrfs_root *root = sctx->dev->dev_root;
+	struct btrfs_root *root = sctx->dev_root;
 	u64 len;
 	int index;
 
@@ -1306,7 +1311,7 @@ static int scrub_checksum_tree_block(struct scrub_block
*sblock)
 {
 	struct scrub_ctx *sctx = sblock->sctx;
 	struct btrfs_header *h;
-	struct btrfs_root *root = sctx->dev->dev_root;
+	struct btrfs_root *root = sctx->dev_root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	u8 calculated_csum[BTRFS_CSUM_SIZE];
 	u8 on_disk_csum[BTRFS_CSUM_SIZE];
@@ -1378,7 +1383,7 @@ static int scrub_checksum_super(struct scrub_block
*sblock)
 {
 	struct btrfs_super_block *s;
 	struct scrub_ctx *sctx = sblock->sctx;
-	struct btrfs_root *root = sctx->dev->dev_root;
+	struct btrfs_root *root = sctx->dev_root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	u8 calculated_csum[BTRFS_CSUM_SIZE];
 	u8 on_disk_csum[BTRFS_CSUM_SIZE];
@@ -1442,10 +1447,10 @@ static int scrub_checksum_super(struct scrub_block
*sblock)
 		++sctx->stat.super_errors;
 		spin_unlock(&sctx->stat_lock);
 		if (fail_cor)
-			btrfs_dev_stat_inc_and_print(sctx->dev,
+			btrfs_dev_stat_inc_and_print(sblock->pagev[0].dev,
 				BTRFS_DEV_STAT_CORRUPTION_ERRS);
 		else
-			btrfs_dev_stat_inc_and_print(sctx->dev,
+			btrfs_dev_stat_inc_and_print(sblock->pagev[0].dev,
 				BTRFS_DEV_STAT_GENERATION_ERRS);
 	}
 
@@ -1513,6 +1518,7 @@ again:
 
 		sbio->physical = spage->physical;
 		sbio->logical = spage->logical;
+		sbio->dev = spage->dev;
 		bio = sbio->bio;
 		if (!bio) {
 			bio = bio_alloc(GFP_NOFS, sctx->pages_per_bio);
@@ -1523,13 +1529,14 @@ again:
 
 		bio->bi_private = sbio;
 		bio->bi_end_io = scrub_bio_end_io;
-		bio->bi_bdev = sctx->dev->bdev;
-		bio->bi_sector = spage->physical >> 9;
+		bio->bi_bdev = sbio->dev->bdev;
+		bio->bi_sector = sbio->physical >> 9;
 		sbio->err = 0;
 	} else if (sbio->physical + sbio->page_count * PAGE_SIZE ! 		  
spage->physical ||
 		   sbio->logical + sbio->page_count * PAGE_SIZE !-		  
spage->logical) {
+		   spage->logical ||
+		   sbio->dev != spage->dev) {
 		scrub_submit(sctx);
 		goto again;
 	}
@@ -1556,8 +1563,8 @@ again:
 }
 
 static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
-		       u64 physical, u64 flags, u64 gen, int mirror_num,
-		       u8 *csum, int force)
+		       u64 physical, struct btrfs_device *dev, u64 flags,
+		       u64 gen, int mirror_num, u8 *csum, int force)
 {
 	struct scrub_block *sblock;
 	int index;
@@ -1593,7 +1600,7 @@ static int scrub_pages(struct scrub_ctx *sctx, u64
logical, u64 len,
 			return -ENOMEM;
 		}
 		spage->sblock = sblock;
-		spage->dev = sctx->dev;
+		spage->dev = dev;
 		spage->flags = flags;
 		spage->generation = gen;
 		spage->logical = logical;
@@ -1634,8 +1641,7 @@ static int scrub_pages(struct scrub_ctx *sctx, u64
logical, u64 len,
 static void scrub_bio_end_io(struct bio *bio, int err)
 {
 	struct scrub_bio *sbio = bio->bi_private;
-	struct scrub_ctx *sctx = sbio->sctx;
-	struct btrfs_fs_info *fs_info = sctx->dev->dev_root->fs_info;
+	struct btrfs_fs_info *fs_info = sbio->dev->dev_root->fs_info;
 
 	sbio->err = err;
 	sbio->bio = bio;
@@ -1728,7 +1734,8 @@ static int scrub_find_csum(struct scrub_ctx *sctx, u64
logical, u64 len,
 
 /* scrub extent tries to collect up to 64 kB for each bio */
 static int scrub_extent(struct scrub_ctx *sctx, u64 logical, u64 len,
-			u64 physical, u64 flags, u64 gen, int mirror_num)
+			u64 physical, struct btrfs_device *dev, u64 flags,
+			u64 gen, int mirror_num)
 {
 	int ret;
 	u8 csum[BTRFS_CSUM_SIZE];
@@ -1762,7 +1769,7 @@ static int scrub_extent(struct scrub_ctx *sctx, u64
logical, u64 len,
 			if (have_csum == 0)
 				++sctx->stat.no_csum;
 		}
-		ret = scrub_pages(sctx, logical, l, physical, flags, gen,
+		ret = scrub_pages(sctx, logical, l, physical, dev, flags, gen,
 				  mirror_num, have_csum ? csum : NULL, 0);
 		if (ret)
 			return ret;
@@ -1774,10 +1781,12 @@ static int scrub_extent(struct scrub_ctx *sctx, u64
logical, u64 len,
 }
 
 static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
-	struct map_lookup *map, int num, u64 base, u64 length)
+					   struct map_lookup *map,
+					   struct btrfs_device *scrub_dev,
+					   int num, u64 base, u64 length)
 {
 	struct btrfs_path *path;
-	struct btrfs_fs_info *fs_info = sctx->dev->dev_root->fs_info;
+	struct btrfs_fs_info *fs_info = sctx->dev_root->fs_info;
 	struct btrfs_root *root = fs_info->extent_root;
 	struct btrfs_root *csum_root = fs_info->csum_root;
 	struct btrfs_extent_item *extent;
@@ -1797,7 +1806,6 @@ static noinline_for_stack int scrub_stripe(struct
scrub_ctx *sctx,
 	struct reada_control *reada2;
 	struct btrfs_key key_start;
 	struct btrfs_key key_end;
-
 	u64 increment = map->stripe_len;
 	u64 offset;
 
@@ -2006,7 +2014,8 @@ static noinline_for_stack int scrub_stripe(struct
scrub_ctx *sctx,
 
 			ret = scrub_extent(sctx, key.objectid, key.offset,
 					   key.objectid - logical + physical,
-					   flags, generation, mirror_num);
+					   scrub_dev, flags, generation,
+					   mirror_num);
 			if (ret)
 				goto out;
 
@@ -2030,11 +2039,13 @@ out:
 }
 
 static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
-	u64 chunk_tree, u64 chunk_objectid, u64 chunk_offset, u64 length,
-	u64 dev_offset)
+					  struct btrfs_device *scrub_dev,
+					  u64 chunk_tree, u64 chunk_objectid,
+					  u64 chunk_offset, u64 length,
+					  u64 dev_offset)
 {
 	struct btrfs_mapping_tree *map_tree -	
&sctx->dev->dev_root->fs_info->mapping_tree;
+		&sctx->dev_root->fs_info->mapping_tree;
 	struct map_lookup *map;
 	struct extent_map *em;
 	int i;
@@ -2055,9 +2066,10 @@ static noinline_for_stack int scrub_chunk(struct
scrub_ctx *sctx,
 		goto out;
 
 	for (i = 0; i < map->num_stripes; ++i) {
-		if (map->stripes[i].dev == sctx->dev &&
+		if (map->stripes[i].dev->bdev == scrub_dev->bdev &&
 		    map->stripes[i].physical == dev_offset) {
-			ret = scrub_stripe(sctx, map, i, chunk_offset, length);
+			ret = scrub_stripe(sctx, map, scrub_dev, i,
+					   chunk_offset, length);
 			if (ret)
 				goto out;
 		}
@@ -2069,11 +2081,12 @@ out:
 }
 
 static noinline_for_stack
-int scrub_enumerate_chunks(struct scrub_ctx *sctx, u64 start, u64 end)
+int scrub_enumerate_chunks(struct scrub_ctx *sctx,
+			   struct btrfs_device *scrub_dev, u64 start, u64 end)
 {
 	struct btrfs_dev_extent *dev_extent = NULL;
 	struct btrfs_path *path;
-	struct btrfs_root *root = sctx->dev->dev_root;
+	struct btrfs_root *root = sctx->dev_root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	u64 length;
 	u64 chunk_tree;
@@ -2094,11 +2107,10 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, u64
start, u64 end)
 	path->search_commit_root = 1;
 	path->skip_locking = 1;
 
-	key.objectid = sctx->dev->devid;
+	key.objectid = scrub_dev->devid;
 	key.offset = 0ull;
 	key.type = BTRFS_DEV_EXTENT_KEY;
 
-
 	while (1) {
 		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
 		if (ret < 0)
@@ -2117,7 +2129,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, u64
start, u64 end)
 
 		btrfs_item_key_to_cpu(l, &found_key, slot);
 
-		if (found_key.objectid != sctx->dev->devid)
+		if (found_key.objectid != scrub_dev->devid)
 			break;
 
 		if (btrfs_key_type(&found_key) != BTRFS_DEV_EXTENT_KEY)
@@ -2151,7 +2163,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, u64
start, u64 end)
 			ret = -ENOENT;
 			break;
 		}
-		ret = scrub_chunk(sctx, chunk_tree, chunk_objectid,
+		ret = scrub_chunk(sctx, scrub_dev, chunk_tree, chunk_objectid,
 				  chunk_offset, length, found_key.offset);
 		btrfs_put_block_group(cache);
 		if (ret)
@@ -2170,14 +2182,14 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, u64
start, u64 end)
 	return ret < 0 ? ret : 0;
 }
 
-static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx)
+static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx,
+					   struct btrfs_device *scrub_dev)
 {
 	int	i;
 	u64	bytenr;
 	u64	gen;
 	int	ret;
-	struct btrfs_device *device = sctx->dev;
-	struct btrfs_root *root = device->dev_root;
+	struct btrfs_root *root = sctx->dev_root;
 
 	if (root->fs_info->fs_state & BTRFS_SUPER_FLAG_ERROR)
 		return -EIO;
@@ -2186,11 +2198,12 @@ static noinline_for_stack int scrub_supers(struct
scrub_ctx *sctx)
 
 	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
 		bytenr = btrfs_sb_offset(i);
-		if (bytenr + BTRFS_SUPER_INFO_SIZE > device->total_bytes)
+		if (bytenr + BTRFS_SUPER_INFO_SIZE > scrub_dev->total_bytes)
 			break;
 
 		ret = scrub_pages(sctx, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr,
-				     BTRFS_EXTENT_FLAG_SUPER, gen, i, NULL, 1);
+				  scrub_dev, BTRFS_EXTENT_FLAG_SUPER, gen, i,
+				  NULL, 1);
 		if (ret)
 			return ret;
 	}
@@ -2317,11 +2330,11 @@ int btrfs_scrub_dev(struct btrfs_root *root, u64 devid,
u64 start, u64 end,
 	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
 
 	down_read(&fs_info->scrub_super_lock);
-	ret = scrub_supers(sctx);
+	ret = scrub_supers(sctx, dev);
 	up_read(&fs_info->scrub_super_lock);
 
 	if (!ret)
-		ret = scrub_enumerate_chunks(sctx, start, end);
+		ret = scrub_enumerate_chunks(sctx, dev, start, end);
 
 	wait_event(sctx->list_wait, atomic_read(&sctx->in_flight) == 0);
 	atomic_dec(&fs_info->scrubs_running);
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 03/26] Btrfs: make the scrub page array dynamically allocated

With the modified design (in order to support the devive replace
procedure) it is necessary to alloc the page array dynamically.
The reason is that pages are reused. At first a page is used for
the bio to read the data from the filesystem, then the same page
is reused for the bio that writes the data to the target disk.
Since the read process and the write process are completely
decoupled, this requires a new concept of refcounts and get/put
functions for pages, and it requires to use newly created pages
for each read bio which are freed after the write operation
is finished.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/scrub.c | 195 ++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 121 insertions(+), 74 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 822c08a..15ac82a 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -46,6 +46,12 @@ struct scrub_ctx;
 
 #define SCRUB_PAGES_PER_BIO	16	/* 64k per bio */
 #define SCRUB_BIOS_PER_CTX	16	/* 1 MB per device in flight */
+
+/*
+ * the following value times PAGE_SIZE needs to be large enough to match the
+ * largest node/leaf/sector size that shall be supported.
+ * Values larger than BTRFS_STRIPE_LEN are not supported.
+ */
 #define SCRUB_MAX_PAGES_PER_BLOCK	16	/* 64k per node/leaf/sector */
 
 struct scrub_page {
@@ -56,6 +62,7 @@ struct scrub_page {
 	u64			generation;
 	u64			logical;
 	u64			physical;
+	atomic_t		ref_count;
 	struct {
 		unsigned int	mirror_num:8;
 		unsigned int	have_csum:1;
@@ -79,7 +86,7 @@ struct scrub_bio {
 };
 
 struct scrub_block {
-	struct scrub_page	pagev[SCRUB_MAX_PAGES_PER_BLOCK];
+	struct scrub_page	*pagev[SCRUB_MAX_PAGES_PER_BLOCK];
 	int			page_count;
 	atomic_t		outstanding_pages;
 	atomic_t		ref_count; /* free mem on transition to zero */
@@ -165,6 +172,8 @@ static int scrub_checksum_tree_block(struct scrub_block
*sblock);
 static int scrub_checksum_super(struct scrub_block *sblock);
 static void scrub_block_get(struct scrub_block *sblock);
 static void scrub_block_put(struct scrub_block *sblock);
+static void scrub_page_get(struct scrub_page *spage);
+static void scrub_page_put(struct scrub_page *spage);
 static int scrub_add_page_to_bio(struct scrub_ctx *sctx,
 				 struct scrub_page *spage);
 static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
@@ -364,15 +373,15 @@ static void scrub_print_warning(const char *errstr, struct
scrub_block *sblock)
 	int ret;
 
 	WARN_ON(sblock->page_count < 1);
-	dev = sblock->pagev[0].dev;
+	dev = sblock->pagev[0]->dev;
 	fs_info = sblock->sctx->dev_root->fs_info;
 
 	path = btrfs_alloc_path();
 
 	swarn.scratch_buf = kmalloc(bufsize, GFP_NOFS);
 	swarn.msg_buf = kmalloc(bufsize, GFP_NOFS);
-	swarn.sector = (sblock->pagev[0].physical) >> 9;
-	swarn.logical = sblock->pagev[0].logical;
+	swarn.sector = (sblock->pagev[0]->physical) >> 9;
+	swarn.logical = sblock->pagev[0]->logical;
 	swarn.errstr = errstr;
 	swarn.dev = NULL;
 	swarn.msg_bufsize = bufsize;
@@ -642,15 +651,15 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 	BUG_ON(sblock_to_check->page_count < 1);
 	fs_info = sctx->dev_root->fs_info;
 	length = sblock_to_check->page_count * PAGE_SIZE;
-	logical = sblock_to_check->pagev[0].logical;
-	generation = sblock_to_check->pagev[0].generation;
-	BUG_ON(sblock_to_check->pagev[0].mirror_num < 1);
-	failed_mirror_index = sblock_to_check->pagev[0].mirror_num - 1;
-	is_metadata = !(sblock_to_check->pagev[0].flags &
+	logical = sblock_to_check->pagev[0]->logical;
+	generation = sblock_to_check->pagev[0]->generation;
+	BUG_ON(sblock_to_check->pagev[0]->mirror_num < 1);
+	failed_mirror_index = sblock_to_check->pagev[0]->mirror_num - 1;
+	is_metadata = !(sblock_to_check->pagev[0]->flags &
 			BTRFS_EXTENT_FLAG_DATA);
-	have_csum = sblock_to_check->pagev[0].have_csum;
-	csum = sblock_to_check->pagev[0].csum;
-	dev = sblock_to_check->pagev[0].dev;
+	have_csum = sblock_to_check->pagev[0]->have_csum;
+	csum = sblock_to_check->pagev[0]->csum;
+	dev = sblock_to_check->pagev[0]->dev;
 
 	/*
 	 * read all mirrors one after the other. This includes to
@@ -892,7 +901,7 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 
 	success = 1;
 	for (page_num = 0; page_num < sblock_bad->page_count; page_num++) {
-		struct scrub_page *page_bad = sblock_bad->pagev + page_num;
+		struct scrub_page *page_bad = sblock_bad->pagev[page_num];
 
 		if (!page_bad->io_error)
 			continue;
@@ -903,8 +912,8 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 		     mirror_index++) {
 			struct scrub_block *sblock_other = sblocks_for_recheck +
 							   mirror_index;
-			struct scrub_page *page_other = sblock_other->pagev +
-							page_num;
+			struct scrub_page *page_other = sblock_other->pagev[
+							page_num];
 
 			if (!page_other->io_error) {
 				ret = scrub_repair_page_from_good_copy(
@@ -971,11 +980,11 @@ out:
 						     mirror_index;
 			int page_index;
 
-			for (page_index = 0; page_index < SCRUB_PAGES_PER_BIO;
-			     page_index++)
-				if (sblock->pagev[page_index].page)
-					__free_page(
-						sblock->pagev[page_index].page);
+			for (page_index = 0; page_index < sblock->page_count;
+			     page_index++) {
+				sblock->pagev[page_index]->sblock = NULL;
+				scrub_page_put(sblock->pagev[page_index]);
+			}
 		}
 		kfree(sblocks_for_recheck);
 	}
@@ -993,7 +1002,7 @@ static int scrub_setup_recheck_block(struct scrub_ctx
*sctx,
 	int ret;
 
 	/*
-	 * note: the three members sctx, ref_count and outstanding_pages
+	 * note: the two members ref_count and outstanding_pages
 	 * are not used (and not set) in the blocks that are used for
 	 * the recheck procedure
 	 */
@@ -1025,21 +1034,27 @@ static int scrub_setup_recheck_block(struct scrub_ctx
*sctx,
 				continue;
 
 			sblock = sblocks_for_recheck + mirror_index;
-			page = sblock->pagev + page_index;
-			page->logical = logical;
-			page->physical = bbio->stripes[mirror_index].physical;
-			/* for missing devices, dev->bdev is NULL */
-			page->dev = bbio->stripes[mirror_index].dev;
-			page->mirror_num = mirror_index + 1;
-			page->page = alloc_page(GFP_NOFS);
-			if (!page->page) {
+			sblock->sctx = sctx;
+			page = kzalloc(sizeof(*page), GFP_NOFS);
+			if (!page) {
+leave_nomem:
 				spin_lock(&sctx->stat_lock);
 				sctx->stat.malloc_errors++;
 				spin_unlock(&sctx->stat_lock);
 				kfree(bbio);
 				return -ENOMEM;
 			}
+			scrub_page_get(page);
+			sblock->pagev[page_index] = page;
+			page->logical = logical;
+			page->physical = bbio->stripes[mirror_index].physical;
+			/* for missing devices, dev->bdev is NULL */
+			page->dev = bbio->stripes[mirror_index].dev;
+			page->mirror_num = mirror_index + 1;
 			sblock->page_count++;
+			page->page = alloc_page(GFP_NOFS);
+			if (!page->page)
+				goto leave_nomem;
 		}
 		kfree(bbio);
 		length -= sublen;
@@ -1071,7 +1086,7 @@ static int scrub_recheck_block(struct btrfs_fs_info
*fs_info,
 	for (page_num = 0; page_num < sblock->page_count; page_num++) {
 		struct bio *bio;
 		int ret;
-		struct scrub_page *page = sblock->pagev + page_num;
+		struct scrub_page *page = sblock->pagev[page_num];
 		DECLARE_COMPLETION_ONSTACK(complete);
 
 		if (page->dev->bdev == NULL) {
@@ -1080,7 +1095,7 @@ static int scrub_recheck_block(struct btrfs_fs_info
*fs_info,
 			continue;
 		}
 
-		BUG_ON(!page->page);
+		WARN_ON(!page->page);
 		bio = bio_alloc(GFP_NOFS, 1);
 		if (!bio)
 			return -EIO;
@@ -1125,14 +1140,14 @@ static void scrub_recheck_block_checksum(struct
btrfs_fs_info *fs_info,
 	struct btrfs_root *root = fs_info->extent_root;
 	void *mapped_buffer;
 
-	BUG_ON(!sblock->pagev[0].page);
+	WARN_ON(!sblock->pagev[0]->page);
 	if (is_metadata) {
 		struct btrfs_header *h;
 
-		mapped_buffer = kmap_atomic(sblock->pagev[0].page);
+		mapped_buffer = kmap_atomic(sblock->pagev[0]->page);
 		h = (struct btrfs_header *)mapped_buffer;
 
-		if (sblock->pagev[0].logical != le64_to_cpu(h->bytenr) ||
+		if (sblock->pagev[0]->logical != le64_to_cpu(h->bytenr) ||
 		    memcmp(h->fsid, fs_info->fsid, BTRFS_UUID_SIZE) ||
 		    memcmp(h->chunk_tree_uuid, fs_info->chunk_tree_uuid,
 			   BTRFS_UUID_SIZE)) {
@@ -1146,7 +1161,7 @@ static void scrub_recheck_block_checksum(struct
btrfs_fs_info *fs_info,
 		if (!have_csum)
 			return;
 
-		mapped_buffer = kmap_atomic(sblock->pagev[0].page);
+		mapped_buffer = kmap_atomic(sblock->pagev[0]->page);
 	}
 
 	for (page_num = 0;;) {
@@ -1162,9 +1177,9 @@ static void scrub_recheck_block_checksum(struct
btrfs_fs_info *fs_info,
 		page_num++;
 		if (page_num >= sblock->page_count)
 			break;
-		BUG_ON(!sblock->pagev[page_num].page);
+		WARN_ON(!sblock->pagev[page_num]->page);
 
-		mapped_buffer = kmap_atomic(sblock->pagev[page_num].page);
+		mapped_buffer = kmap_atomic(sblock->pagev[page_num]->page);
 	}
 
 	btrfs_csum_final(crc, calculated_csum);
@@ -1202,11 +1217,11 @@ static int scrub_repair_page_from_good_copy(struct
scrub_block *sblock_bad,
 					    struct scrub_block *sblock_good,
 					    int page_num, int force_write)
 {
-	struct scrub_page *page_bad = sblock_bad->pagev + page_num;
-	struct scrub_page *page_good = sblock_good->pagev + page_num;
+	struct scrub_page *page_bad = sblock_bad->pagev[page_num];
+	struct scrub_page *page_good = sblock_good->pagev[page_num];
 
-	BUG_ON(sblock_bad->pagev[page_num].page == NULL);
-	BUG_ON(sblock_good->pagev[page_num].page == NULL);
+	BUG_ON(page_bad->page == NULL);
+	BUG_ON(page_good->page == NULL);
 	if (force_write || sblock_bad->header_error ||
 	    sblock_bad->checksum_error || page_bad->io_error) {
 		struct bio *bio;
@@ -1247,8 +1262,8 @@ static void scrub_checksum(struct scrub_block *sblock)
 	u64 flags;
 	int ret;
 
-	BUG_ON(sblock->page_count < 1);
-	flags = sblock->pagev[0].flags;
+	WARN_ON(sblock->page_count < 1);
+	flags = sblock->pagev[0]->flags;
 	ret = 0;
 	if (flags & BTRFS_EXTENT_FLAG_DATA)
 		ret = scrub_checksum_data(sblock);
@@ -1276,11 +1291,11 @@ static int scrub_checksum_data(struct scrub_block
*sblock)
 	int index;
 
 	BUG_ON(sblock->page_count < 1);
-	if (!sblock->pagev[0].have_csum)
+	if (!sblock->pagev[0]->have_csum)
 		return 0;
 
-	on_disk_csum = sblock->pagev[0].csum;
-	page = sblock->pagev[0].page;
+	on_disk_csum = sblock->pagev[0]->csum;
+	page = sblock->pagev[0]->page;
 	buffer = kmap_atomic(page);
 
 	len = sctx->sectorsize;
@@ -1295,8 +1310,8 @@ static int scrub_checksum_data(struct scrub_block *sblock)
 			break;
 		index++;
 		BUG_ON(index >= sblock->page_count);
-		BUG_ON(!sblock->pagev[index].page);
-		page = sblock->pagev[index].page;
+		BUG_ON(!sblock->pagev[index]->page);
+		page = sblock->pagev[index]->page;
 		buffer = kmap_atomic(page);
 	}
 
@@ -1326,7 +1341,7 @@ static int scrub_checksum_tree_block(struct scrub_block
*sblock)
 	int index;
 
 	BUG_ON(sblock->page_count < 1);
-	page = sblock->pagev[0].page;
+	page = sblock->pagev[0]->page;
 	mapped_buffer = kmap_atomic(page);
 	h = (struct btrfs_header *)mapped_buffer;
 	memcpy(on_disk_csum, h->csum, sctx->csum_size);
@@ -1337,10 +1352,10 @@ static int scrub_checksum_tree_block(struct scrub_block
*sblock)
 	 * b) the page is already kmapped
 	 */
 
-	if (sblock->pagev[0].logical != le64_to_cpu(h->bytenr))
+	if (sblock->pagev[0]->logical != le64_to_cpu(h->bytenr))
 		++fail;
 
-	if (sblock->pagev[0].generation != le64_to_cpu(h->generation))
+	if (sblock->pagev[0]->generation != le64_to_cpu(h->generation))
 		++fail;
 
 	if (memcmp(h->fsid, fs_info->fsid, BTRFS_UUID_SIZE))
@@ -1365,8 +1380,8 @@ static int scrub_checksum_tree_block(struct scrub_block
*sblock)
 			break;
 		index++;
 		BUG_ON(index >= sblock->page_count);
-		BUG_ON(!sblock->pagev[index].page);
-		page = sblock->pagev[index].page;
+		BUG_ON(!sblock->pagev[index]->page);
+		page = sblock->pagev[index]->page;
 		mapped_buffer = kmap_atomic(page);
 		mapped_size = PAGE_SIZE;
 		p = mapped_buffer;
@@ -1398,15 +1413,15 @@ static int scrub_checksum_super(struct scrub_block
*sblock)
 	int index;
 
 	BUG_ON(sblock->page_count < 1);
-	page = sblock->pagev[0].page;
+	page = sblock->pagev[0]->page;
 	mapped_buffer = kmap_atomic(page);
 	s = (struct btrfs_super_block *)mapped_buffer;
 	memcpy(on_disk_csum, s->csum, sctx->csum_size);
 
-	if (sblock->pagev[0].logical != le64_to_cpu(s->bytenr))
+	if (sblock->pagev[0]->logical != le64_to_cpu(s->bytenr))
 		++fail_cor;
 
-	if (sblock->pagev[0].generation != le64_to_cpu(s->generation))
+	if (sblock->pagev[0]->generation != le64_to_cpu(s->generation))
 		++fail_gen;
 
 	if (memcmp(s->fsid, fs_info->fsid, BTRFS_UUID_SIZE))
@@ -1426,8 +1441,8 @@ static int scrub_checksum_super(struct scrub_block
*sblock)
 			break;
 		index++;
 		BUG_ON(index >= sblock->page_count);
-		BUG_ON(!sblock->pagev[index].page);
-		page = sblock->pagev[index].page;
+		BUG_ON(!sblock->pagev[index]->page);
+		page = sblock->pagev[index]->page;
 		mapped_buffer = kmap_atomic(page);
 		mapped_size = PAGE_SIZE;
 		p = mapped_buffer;
@@ -1447,10 +1462,10 @@ static int scrub_checksum_super(struct scrub_block
*sblock)
 		++sctx->stat.super_errors;
 		spin_unlock(&sctx->stat_lock);
 		if (fail_cor)
-			btrfs_dev_stat_inc_and_print(sblock->pagev[0].dev,
+			btrfs_dev_stat_inc_and_print(sblock->pagev[0]->dev,
 				BTRFS_DEV_STAT_CORRUPTION_ERRS);
 		else
-			btrfs_dev_stat_inc_and_print(sblock->pagev[0].dev,
+			btrfs_dev_stat_inc_and_print(sblock->pagev[0]->dev,
 				BTRFS_DEV_STAT_GENERATION_ERRS);
 	}
 
@@ -1468,12 +1483,25 @@ static void scrub_block_put(struct scrub_block *sblock)
 		int i;
 
 		for (i = 0; i < sblock->page_count; i++)
-			if (sblock->pagev[i].page)
-				__free_page(sblock->pagev[i].page);
+			scrub_page_put(sblock->pagev[i]);
 		kfree(sblock);
 	}
 }
 
+static void scrub_page_get(struct scrub_page *spage)
+{
+	atomic_inc(&spage->ref_count);
+}
+
+static void scrub_page_put(struct scrub_page *spage)
+{
+	if (atomic_dec_and_test(&spage->ref_count)) {
+		if (spage->page)
+			__free_page(spage->page);
+		kfree(spage);
+	}
+}
+
 static void scrub_submit(struct scrub_ctx *sctx)
 {
 	struct scrub_bio *sbio;
@@ -1577,28 +1605,28 @@ static int scrub_pages(struct scrub_ctx *sctx, u64
logical, u64 len,
 		return -ENOMEM;
 	}
 
-	/* one ref inside this function, plus one for each page later on */
+	/* one ref inside this function, plus one for each page added to
+	 * a bio later on */
 	atomic_set(&sblock->ref_count, 1);
 	sblock->sctx = sctx;
 	sblock->no_io_error_seen = 1;
 
 	for (index = 0; len > 0; index++) {
-		struct scrub_page *spage = sblock->pagev + index;
+		struct scrub_page *spage;
 		u64 l = min_t(u64, len, PAGE_SIZE);
 
-		BUG_ON(index >= SCRUB_MAX_PAGES_PER_BLOCK);
-		spage->page = alloc_page(GFP_NOFS);
-		if (!spage->page) {
+		spage = kzalloc(sizeof(*spage), GFP_NOFS);
+		if (!spage) {
+leave_nomem:
 			spin_lock(&sctx->stat_lock);
 			sctx->stat.malloc_errors++;
 			spin_unlock(&sctx->stat_lock);
-			while (index > 0) {
-				index--;
-				__free_page(sblock->pagev[index].page);
-			}
-			kfree(sblock);
+			scrub_block_put(sblock);
 			return -ENOMEM;
 		}
+		BUG_ON(index >= SCRUB_MAX_PAGES_PER_BLOCK);
+		scrub_page_get(spage);
+		sblock->pagev[index] = spage;
 		spage->sblock = sblock;
 		spage->dev = dev;
 		spage->flags = flags;
@@ -1613,14 +1641,17 @@ static int scrub_pages(struct scrub_ctx *sctx, u64
logical, u64 len,
 			spage->have_csum = 0;
 		}
 		sblock->page_count++;
+		spage->page = alloc_page(GFP_NOFS);
+		if (!spage->page)
+			goto leave_nomem;
 		len -= l;
 		logical += l;
 		physical += l;
 	}
 
-	BUG_ON(sblock->page_count == 0);
+	WARN_ON(sblock->page_count == 0);
 	for (index = 0; index < sblock->page_count; index++) {
-		struct scrub_page *spage = sblock->pagev + index;
+		struct scrub_page *spage = sblock->pagev[index];
 		int ret;
 
 		ret = scrub_add_page_to_bio(sctx, spage);
@@ -2289,6 +2320,22 @@ int btrfs_scrub_dev(struct btrfs_root *root, u64 devid,
u64 start, u64 end,
 		return -EINVAL;
 	}
 
+	if (fs_info->chunk_root->nodesize >
+	    PAGE_SIZE * SCRUB_MAX_PAGES_PER_BLOCK ||
+	    fs_info->chunk_root->sectorsize >
+	    PAGE_SIZE * SCRUB_MAX_PAGES_PER_BLOCK) {
+		/*
+		 * would exhaust the array bounds of pagev member in
+		 * struct scrub_block
+		 */
+		pr_err("btrfs_scrub: size assumption nodesize and sectorsize <=
SCRUB_MAX_PAGES_PER_BLOCK (%d <= %d && %d <= %d) fails\n",
+		       fs_info->chunk_root->nodesize,
+		       SCRUB_MAX_PAGES_PER_BLOCK,
+		       fs_info->chunk_root->sectorsize,
+		       SCRUB_MAX_PAGES_PER_BLOCK);
+		return -EINVAL;
+	}
+
 	ret = scrub_workers_get(root);
 	if (ret)
 		return ret;
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 04/26] Btrfs: in scrub repair code, optimize the reading of mirrors

In case that disk blocks need to be repaired (rewritten), the
current code at first (for simplicity reasons) reads all alternate
mirrors in the first step, afterwards selects the best one in a
second step. This is now changed to read one alternate mirror
after the other and to leave the loop early when a perfect mirror
is found.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/scrub.c | 35 ++++++++++++-----------------------
 1 file changed, 12 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 15ac82a..7d38f40 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -819,26 +819,8 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 
 	/*
 	 * now build and submit the bios for the other mirrors, check
-	 * checksums
-	 */
-	for (mirror_index = 0;
-	     mirror_index < BTRFS_MAX_MIRRORS &&
-	     sblocks_for_recheck[mirror_index].page_count > 0;
-	     mirror_index++) {
-		if (mirror_index == failed_mirror_index)
-			continue;
-
-		/* build and submit the bios, check checksums */
-		ret = scrub_recheck_block(fs_info,
-					  sblocks_for_recheck + mirror_index,
-					  is_metadata, have_csum, csum,
-					  generation, sctx->csum_size);
-		if (ret)
-			goto did_not_correct_error;
-	}
-
-	/*
-	 * first try to pick the mirror which is completely without I/O
+	 * checksums.
+	 * First try to pick the mirror which is completely without I/O
 	 * errors and also does not have a checksum error.
 	 * If one is found, and if a checksum is present, the full block
 	 * that is known to contain an error is rewritten. Afterwards
@@ -854,10 +836,17 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 	     mirror_index < BTRFS_MAX_MIRRORS &&
 	     sblocks_for_recheck[mirror_index].page_count > 0;
 	     mirror_index++) {
-		struct scrub_block *sblock_other = sblocks_for_recheck +
-						   mirror_index;
+		struct scrub_block *sblock_other;
 
-		if (!sblock_other->header_error &&
+		if (mirror_index == failed_mirror_index)
+			continue;
+		sblock_other = sblocks_for_recheck + mirror_index;
+
+		/* build and submit the bios, check checksums */
+		ret = scrub_recheck_block(fs_info, sblock_other, is_metadata,
+					  have_csum, csum, generation,
+					  sctx->csum_size);
+		if (!ret && !sblock_other->header_error &&
 		    !sblock_other->checksum_error &&
 		    sblock_other->no_io_error_seen) {
 			int force_write = is_metadata || have_csum;
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 05/26] Btrfs: in scrub repair code, simplify alloc error handling

In the scrub repair code, the code is changed to handle memory
allocation errors a little bit smarter. The change is to handle
it just like a read error. This simplifies the code and removes
a couple of lines of code, since the code to handle read errors
is there anyway.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/scrub.c | 61 ++++++++++++++++++++++++--------------------------------
 1 file changed, 26 insertions(+), 35 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 7d38f40..fcd5bcc 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -151,10 +151,10 @@ static int scrub_setup_recheck_block(struct scrub_ctx
*sctx,
 				     struct btrfs_mapping_tree *map_tree,
 				     u64 length, u64 logical,
 				     struct scrub_block *sblock);
-static int scrub_recheck_block(struct btrfs_fs_info *fs_info,
-			       struct scrub_block *sblock, int is_metadata,
-			       int have_csum, u8 *csum, u64 generation,
-			       u16 csum_size);
+static void scrub_recheck_block(struct btrfs_fs_info *fs_info,
+				struct scrub_block *sblock, int is_metadata,
+				int have_csum, u8 *csum, u64 generation,
+				u16 csum_size);
 static void scrub_recheck_block_checksum(struct btrfs_fs_info *fs_info,
 					 struct scrub_block *sblock,
 					 int is_metadata, int have_csum,
@@ -718,16 +718,8 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 	sblock_bad = sblocks_for_recheck + failed_mirror_index;
 
 	/* build and submit the bios for the failed mirror, check checksums */
-	ret = scrub_recheck_block(fs_info, sblock_bad, is_metadata, have_csum,
-				  csum, generation, sctx->csum_size);
-	if (ret) {
-		spin_lock(&sctx->stat_lock);
-		sctx->stat.read_errors++;
-		sctx->stat.uncorrectable_errors++;
-		spin_unlock(&sctx->stat_lock);
-		btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_READ_ERRS);
-		goto out;
-	}
+	scrub_recheck_block(fs_info, sblock_bad, is_metadata, have_csum,
+			    csum, generation, sctx->csum_size);
 
 	if (!sblock_bad->header_error && !sblock_bad->checksum_error
&&
 	    sblock_bad->no_io_error_seen) {
@@ -843,10 +835,11 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 		sblock_other = sblocks_for_recheck + mirror_index;
 
 		/* build and submit the bios, check checksums */
-		ret = scrub_recheck_block(fs_info, sblock_other, is_metadata,
-					  have_csum, csum, generation,
-					  sctx->csum_size);
-		if (!ret && !sblock_other->header_error &&
+		scrub_recheck_block(fs_info, sblock_other, is_metadata,
+				    have_csum, csum, generation,
+				    sctx->csum_size);
+
+		if (!sblock_other->header_error &&
 		    !sblock_other->checksum_error &&
 		    sblock_other->no_io_error_seen) {
 			int force_write = is_metadata || have_csum;
@@ -931,10 +924,10 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 			 * is verified, but most likely the data comes out
 			 * of the page cache.
 			 */
-			ret = scrub_recheck_block(fs_info, sblock_bad,
-						  is_metadata, have_csum, csum,
-						  generation, sctx->csum_size);
-			if (!ret && !sblock_bad->header_error &&
+			scrub_recheck_block(fs_info, sblock_bad,
+					    is_metadata, have_csum, csum,
+					    generation, sctx->csum_size);
+			if (!sblock_bad->header_error &&
 			    !sblock_bad->checksum_error &&
 			    sblock_bad->no_io_error_seen)
 				goto corrected_error;
@@ -1061,10 +1054,10 @@ leave_nomem:
  * to take those pages that are not errored from all the mirrors so that
  * the pages that are errored in the just handled mirror can be repaired.
  */
-static int scrub_recheck_block(struct btrfs_fs_info *fs_info,
-			       struct scrub_block *sblock, int is_metadata,
-			       int have_csum, u8 *csum, u64 generation,
-			       u16 csum_size)
+static void scrub_recheck_block(struct btrfs_fs_info *fs_info,
+				struct scrub_block *sblock, int is_metadata,
+				int have_csum, u8 *csum, u64 generation,
+				u16 csum_size)
 {
 	int page_num;
 
@@ -1074,7 +1067,6 @@ static int scrub_recheck_block(struct btrfs_fs_info
*fs_info,
 
 	for (page_num = 0; page_num < sblock->page_count; page_num++) {
 		struct bio *bio;
-		int ret;
 		struct scrub_page *page = sblock->pagev[page_num];
 		DECLARE_COMPLETION_ONSTACK(complete);
 
@@ -1086,18 +1078,17 @@ static int scrub_recheck_block(struct btrfs_fs_info
*fs_info,
 
 		WARN_ON(!page->page);
 		bio = bio_alloc(GFP_NOFS, 1);
-		if (!bio)
-			return -EIO;
+		if (!bio) {
+			page->io_error = 1;
+			sblock->no_io_error_seen = 0;
+			continue;
+		}
 		bio->bi_bdev = page->dev->bdev;
 		bio->bi_sector = page->physical >> 9;
 		bio->bi_end_io = scrub_complete_bio_end_io;
 		bio->bi_private = &complete;
 
-		ret = bio_add_page(bio, page->page, PAGE_SIZE, 0);
-		if (PAGE_SIZE != ret) {
-			bio_put(bio);
-			return -EIO;
-		}
+		bio_add_page(bio, page->page, PAGE_SIZE, 0);
 		btrfsic_submit_bio(READ, bio);
 
 		/* this will also unplug the queue */
@@ -1114,7 +1105,7 @@ static int scrub_recheck_block(struct btrfs_fs_info
*fs_info,
 					     have_csum, csum, generation,
 					     csum_size);
 
-	return 0;
+	return;
 }
 
 static void scrub_recheck_block_checksum(struct btrfs_fs_info *fs_info,
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 06/26] Btrfs: cleanup scrub bio and worker wait code

Just move some code into functions to make everything more readable.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/scrub.c | 106 +++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 71 insertions(+), 35 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index fcd5bcc..a67b1a1 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (C) 2011 STRATO.  All rights reserved.
+ * Copyright (C) 2011, 2012 STRATO.  All rights reserved.
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public
@@ -104,8 +104,8 @@ struct scrub_ctx {
 	struct btrfs_root	*dev_root;
 	int			first_free;
 	int			curr;
-	atomic_t		in_flight;
-	atomic_t		fixup_cnt;
+	atomic_t		bios_in_flight;
+	atomic_t		workers_pending;
 	spinlock_t		list_lock;
 	wait_queue_head_t	list_wait;
 	u16			csum_size;
@@ -146,6 +146,10 @@ struct scrub_warning {
 };
 
 
+static void scrub_pending_bio_inc(struct scrub_ctx *sctx);
+static void scrub_pending_bio_dec(struct scrub_ctx *sctx);
+static void scrub_pending_trans_workers_inc(struct scrub_ctx *sctx);
+static void scrub_pending_trans_workers_dec(struct scrub_ctx *sctx);
 static int scrub_handle_errored_block(struct scrub_block *sblock_to_check);
 static int scrub_setup_recheck_block(struct scrub_ctx *sctx,
 				     struct btrfs_mapping_tree *map_tree,
@@ -184,6 +188,59 @@ static void scrub_bio_end_io_worker(struct btrfs_work
*work);
 static void scrub_block_complete(struct scrub_block *sblock);
 
 
+static void scrub_pending_bio_inc(struct scrub_ctx *sctx)
+{
+	atomic_inc(&sctx->bios_in_flight);
+}
+
+static void scrub_pending_bio_dec(struct scrub_ctx *sctx)
+{
+	atomic_dec(&sctx->bios_in_flight);
+	wake_up(&sctx->list_wait);
+}
+
+/*
+ * used for workers that require transaction commits (i.e., for the
+ * NOCOW case)
+ */
+static void scrub_pending_trans_workers_inc(struct scrub_ctx *sctx)
+{
+	struct btrfs_fs_info *fs_info = sctx->dev_root->fs_info;
+
+	/*
+	 * increment scrubs_running to prevent cancel requests from
+	 * completing as long as a worker is running. we must also
+	 * increment scrubs_paused to prevent deadlocking on pause
+	 * requests used for transactions commits (as the worker uses a
+	 * transaction context). it is safe to regard the worker
+	 * as paused for all matters practical. effectively, we only
+	 * avoid cancellation requests from completing.
+	 */
+	mutex_lock(&fs_info->scrub_lock);
+	atomic_inc(&fs_info->scrubs_running);
+	atomic_inc(&fs_info->scrubs_paused);
+	mutex_unlock(&fs_info->scrub_lock);
+	atomic_inc(&sctx->workers_pending);
+}
+
+/* used for workers that require transaction commits */
+static void scrub_pending_trans_workers_dec(struct scrub_ctx *sctx)
+{
+	struct btrfs_fs_info *fs_info = sctx->dev_root->fs_info;
+
+	/*
+	 * see scrub_pending_trans_workers_inc() why we''re pretending
+	 * to be paused in the scrub counters
+	 */
+	mutex_lock(&fs_info->scrub_lock);
+	atomic_dec(&fs_info->scrubs_running);
+	atomic_dec(&fs_info->scrubs_paused);
+	mutex_unlock(&fs_info->scrub_lock);
+	atomic_dec(&sctx->workers_pending);
+	wake_up(&fs_info->scrub_pause_wait);
+	wake_up(&sctx->list_wait);
+}
+
 static void scrub_free_csums(struct scrub_ctx *sctx)
 {
 	while (!list_empty(&sctx->csum_list)) {
@@ -264,8 +321,8 @@ struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev)
 	sctx->nodesize = dev->dev_root->nodesize;
 	sctx->leafsize = dev->dev_root->leafsize;
 	sctx->sectorsize = dev->dev_root->sectorsize;
-	atomic_set(&sctx->in_flight, 0);
-	atomic_set(&sctx->fixup_cnt, 0);
+	atomic_set(&sctx->bios_in_flight, 0);
+	atomic_set(&sctx->workers_pending, 0);
 	atomic_set(&sctx->cancel_req, 0);
 	sctx->csum_size = btrfs_super_csum_size(fs_info->super_copy);
 	INIT_LIST_HEAD(&sctx->csum_list);
@@ -609,14 +666,7 @@ out:
 	btrfs_free_path(path);
 	kfree(fixup);
 
-	/* see caller why we''re pretending to be paused in the scrub counters
*/
-	mutex_lock(&fs_info->scrub_lock);
-	atomic_dec(&fs_info->scrubs_running);
-	atomic_dec(&fs_info->scrubs_paused);
-	mutex_unlock(&fs_info->scrub_lock);
-	atomic_dec(&sctx->fixup_cnt);
-	wake_up(&fs_info->scrub_pause_wait);
-	wake_up(&sctx->list_wait);
+	scrub_pending_trans_workers_dec(sctx);
 }
 
 /*
@@ -789,20 +839,7 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 		fixup_nodatasum->logical = logical;
 		fixup_nodatasum->root = fs_info->extent_root;
 		fixup_nodatasum->mirror_num = failed_mirror_index + 1;
-		/*
-		 * increment scrubs_running to prevent cancel requests from
-		 * completing as long as a fixup worker is running. we must also
-		 * increment scrubs_paused to prevent deadlocking on pause
-		 * requests used for transactions commits (as the worker uses a
-		 * transaction context). it is safe to regard the fixup worker
-		 * as paused for all matters practical. effectively, we only
-		 * avoid cancellation requests from completing.
-		 */
-		mutex_lock(&fs_info->scrub_lock);
-		atomic_inc(&fs_info->scrubs_running);
-		atomic_inc(&fs_info->scrubs_paused);
-		mutex_unlock(&fs_info->scrub_lock);
-		atomic_inc(&sctx->fixup_cnt);
+		scrub_pending_trans_workers_inc(sctx);
 		fixup_nodatasum->work.func = scrub_fixup_nodatasum;
 		btrfs_queue_worker(&fs_info->scrub_workers,
 				   &fixup_nodatasum->work);
@@ -1491,7 +1528,7 @@ static void scrub_submit(struct scrub_ctx *sctx)
 
 	sbio = sctx->bios[sctx->curr];
 	sctx->curr = -1;
-	atomic_inc(&sctx->in_flight);
+	scrub_pending_bio_inc(sctx);
 
 	btrfsic_submit_bio(READ, sbio->bio);
 }
@@ -1692,8 +1729,7 @@ static void scrub_bio_end_io_worker(struct btrfs_work
*work)
 	sbio->next_free = sctx->first_free;
 	sctx->first_free = sbio->index;
 	spin_unlock(&sctx->list_lock);
-	atomic_dec(&sctx->in_flight);
-	wake_up(&sctx->list_wait);
+	scrub_pending_bio_dec(sctx);
 }
 
 static void scrub_block_complete(struct scrub_block *sblock)
@@ -1863,7 +1899,7 @@ static noinline_for_stack int scrub_stripe(struct
scrub_ctx *sctx,
 	logical = base + offset;
 
 	wait_event(sctx->list_wait,
-		   atomic_read(&sctx->in_flight) == 0);
+		   atomic_read(&sctx->bios_in_flight) == 0);
 	atomic_inc(&fs_info->scrubs_paused);
 	wake_up(&fs_info->scrub_pause_wait);
 
@@ -1928,7 +1964,7 @@ static noinline_for_stack int scrub_stripe(struct
scrub_ctx *sctx,
 			/* push queued extents */
 			scrub_submit(sctx);
 			wait_event(sctx->list_wait,
-				   atomic_read(&sctx->in_flight) == 0);
+				   atomic_read(&sctx->bios_in_flight) == 0);
 			atomic_inc(&fs_info->scrubs_paused);
 			wake_up(&fs_info->scrub_pause_wait);
 			mutex_lock(&fs_info->scrub_lock);
@@ -2218,7 +2254,7 @@ static noinline_for_stack int scrub_supers(struct
scrub_ctx *sctx,
 		if (ret)
 			return ret;
 	}
-	wait_event(sctx->list_wait, atomic_read(&sctx->in_flight) == 0);
+	wait_event(sctx->list_wait, atomic_read(&sctx->bios_in_flight) ==
0);
 
 	return 0;
 }
@@ -2363,11 +2399,11 @@ int btrfs_scrub_dev(struct btrfs_root *root, u64 devid,
u64 start, u64 end,
 	if (!ret)
 		ret = scrub_enumerate_chunks(sctx, dev, start, end);
 
-	wait_event(sctx->list_wait, atomic_read(&sctx->in_flight) == 0);
+	wait_event(sctx->list_wait, atomic_read(&sctx->bios_in_flight) ==
0);
 	atomic_dec(&fs_info->scrubs_running);
 	wake_up(&fs_info->scrub_pause_wait);
 
-	wait_event(sctx->list_wait, atomic_read(&sctx->fixup_cnt) == 0);
+	wait_event(sctx->list_wait, atomic_read(&sctx->workers_pending) ==
0);
 
 	if (progress)
 		memcpy(progress, &sctx->stat, sizeof(*progress));
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 07/26] Btrfs: add two more find_device() methods

The new function btrfs_find_device_missing_or_by_path() will be
used for the device replace procedure. This function itself calls
the second new function btrfs_find_device_by_path().
Unfortunately, it is not possible to currently make the rest of the
code use these functions as well, since all functions that look
similar at first view are all a little bit different in what they
are doing. But in the future, new code could benefit from these
two new functions, and currently, device replace uses them.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/volumes.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  5 ++++
 2 files changed, 79 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index eeed97d..bcd3097 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1512,6 +1512,80 @@ error_undo:
 	goto error_brelse;
 }
 
+int btrfs_find_device_by_path(struct btrfs_root *root, char *device_path,
+			      struct btrfs_device **device)
+{
+	int ret = 0;
+	struct btrfs_super_block *disk_super;
+	u64 devid;
+	u8 *dev_uuid;
+	struct block_device *bdev;
+	struct buffer_head *bh = NULL;
+
+	*device = NULL;
+	mutex_lock(&uuid_mutex);
+	bdev = blkdev_get_by_path(device_path, FMODE_READ,
+				  root->fs_info->bdev_holder);
+	if (IS_ERR(bdev)) {
+		ret = PTR_ERR(bdev);
+		bdev = NULL;
+		goto out;
+	}
+
+	set_blocksize(bdev, 4096);
+	invalidate_bdev(bdev);
+	bh = btrfs_read_dev_super(bdev);
+	if (!bh) {
+		ret = -EINVAL;
+		goto out;
+	}
+	disk_super = (struct btrfs_super_block *)bh->b_data;
+	devid = btrfs_stack_device_id(&disk_super->dev_item);
+	dev_uuid = disk_super->dev_item.uuid;
+	*device = btrfs_find_device(root, devid, dev_uuid,
+				    disk_super->fsid);
+	brelse(bh);
+	if (!*device)
+		ret = -ENOENT;
+out:
+	mutex_unlock(&uuid_mutex);
+	if (bdev)
+		blkdev_put(bdev, FMODE_READ);
+	return ret;
+}
+
+int btrfs_find_device_missing_or_by_path(struct btrfs_root *root,
+					 char *device_path,
+					 struct btrfs_device **device)
+{
+	*device = NULL;
+	if (strcmp(device_path, "missing") == 0) {
+		struct list_head *devices;
+		struct btrfs_device *tmp;
+
+		devices = &root->fs_info->fs_devices->devices;
+		/*
+		 * It is safe to read the devices since the volume_mutex
+		 * is held by the caller.
+		 */
+		list_for_each_entry(tmp, devices, dev_list) {
+			if (tmp->in_fs_metadata && !tmp->bdev) {
+				*device = tmp;
+				break;
+			}
+		}
+
+		if (!*device) {
+			pr_err("btrfs: no missing device found\n");
+			return -ENOENT;
+		}
+
+		return 0;
+	} else {
+		return btrfs_find_device_by_path(root, device_path, device);
+	}
+}
+
 /*
  * does all the dirty work required for changing file system''s UUID.
  */
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 1789cda..657bb12 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -268,6 +268,11 @@ int btrfs_scan_one_device(const char *path, fmode_t flags,
void *holder,
 			  struct btrfs_fs_devices **fs_devices_ret);
 int btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
 void btrfs_close_extra_devices(struct btrfs_fs_devices *fs_devices);
+int btrfs_find_device_missing_or_by_path(struct btrfs_root *root,
+					 char *device_path,
+					 struct btrfs_device **device);
+int btrfs_find_device_by_path(struct btrfs_root *root, char *device_path,
+			      struct btrfs_device **device);
 int btrfs_add_device(struct btrfs_trans_handle *trans,
 		     struct btrfs_root *root,
 		     struct btrfs_device *device);
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 08/26] Btrfs: Pass fs_info to btrfs_num_copies() instead of mapping_tree

This is required for the device replace procedure in a later step.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/check-integrity.c | 12 ++++++------
 fs/btrfs/disk-io.c         |  2 +-
 fs/btrfs/extent_io.c       | 11 +++++------
 fs/btrfs/volumes.c         |  3 ++-
 fs/btrfs/volumes.h         |  2 +-
 5 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 5a3e45d..58dfac1 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -723,7 +723,7 @@ static int btrfsic_process_superblock(struct btrfsic_state
*state,
 		}
 
 		num_copies -		   
btrfs_num_copies(&state->root->fs_info->mapping_tree,
+		    btrfs_num_copies(state->root->fs_info,
 				     next_bytenr, state->metablock_size);
 		if (state->print_mask & BTRFSIC_PRINT_MASK_NUM_COPIES)
 			printk(KERN_INFO "num_copies(log_bytenr=%llu) = %d\n",
@@ -903,7 +903,7 @@ static int btrfsic_process_superblock_dev_mirror(
 		}
 
 		num_copies -		   
btrfs_num_copies(&state->root->fs_info->mapping_tree,
+		    btrfs_num_copies(state->root->fs_info,
 				     next_bytenr, state->metablock_size);
 		if (state->print_mask & BTRFSIC_PRINT_MASK_NUM_COPIES)
 			printk(KERN_INFO "num_copies(log_bytenr=%llu) = %d\n",
@@ -1287,7 +1287,7 @@ static int btrfsic_create_link_to_next_block(
 	*next_blockp = NULL;
 	if (0 == *num_copiesp) {
 		*num_copiesp -		   
btrfs_num_copies(&state->root->fs_info->mapping_tree,
+		    btrfs_num_copies(state->root->fs_info,
 				     next_bytenr, state->metablock_size);
 		if (state->print_mask & BTRFSIC_PRINT_MASK_NUM_COPIES)
 			printk(KERN_INFO "num_copies(log_bytenr=%llu) = %d\n",
@@ -1489,7 +1489,7 @@ static int btrfsic_handle_extent_data(
 			chunk_len = num_bytes;
 
 		num_copies -		   
btrfs_num_copies(&state->root->fs_info->mapping_tree,
+		    btrfs_num_copies(state->root->fs_info,
 				     next_bytenr, state->datablock_size);
 		if (state->print_mask & BTRFSIC_PRINT_MASK_NUM_COPIES)
 			printk(KERN_INFO "num_copies(log_bytenr=%llu) = %d\n",
@@ -2463,7 +2463,7 @@ static int btrfsic_process_written_superblock(
 		}
 
 		num_copies -		   
btrfs_num_copies(&state->root->fs_info->mapping_tree,
+		    btrfs_num_copies(state->root->fs_info,
 				     next_bytenr, BTRFS_SUPER_INFO_SIZE);
 		if (state->print_mask & BTRFSIC_PRINT_MASK_NUM_COPIES)
 			printk(KERN_INFO "num_copies(log_bytenr=%llu) = %d\n",
@@ -2960,7 +2960,7 @@ static void btrfsic_cmp_log_and_dev_bytenr(struct
btrfsic_state *state,
 	struct btrfsic_block_data_ctx block_ctx;
 	int match = 0;
 
-	num_copies =
btrfs_num_copies(&state->root->fs_info->mapping_tree,
+	num_copies = btrfs_num_copies(state->root->fs_info,
 				      bytenr, state->metablock_size);
 
 	for (mirror_num = 1; mirror_num <= num_copies; mirror_num++) {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0643159..f4b2733 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -387,7 +387,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root
*root,
 		if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags))
 			break;
 
-		num_copies = btrfs_num_copies(&root->fs_info->mapping_tree,
+		num_copies = btrfs_num_copies(root->fs_info,
 					      eb->start, eb->len);
 		if (num_copies == 1)
 			break;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 472873a..a18f5c9 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2046,10 +2046,10 @@ static int clean_io_failure(u64 start, struct page
*page)
 	spin_unlock(&BTRFS_I(inode)->io_tree.lock);
 
 	if (state && state->start == failrec->start) {
-		map_tree = &BTRFS_I(inode)->root->fs_info->mapping_tree;
-		num_copies = btrfs_num_copies(map_tree, failrec->logical,
-						failrec->len);
+		num_copies = btrfs_num_copies(BTRFS_I(inode)->root->fs_info,
+					      failrec->logical, failrec->len);
 		if (num_copies > 1)  {
+			map_tree = &BTRFS_I(inode)->root->fs_info->mapping_tree;
 			ret = repair_io_failure(map_tree, start, failrec->len,
 						failrec->logical, page,
 						failrec->failed_mirror);
@@ -2159,9 +2159,8 @@ static int bio_readpage_error(struct bio *failed_bio,
struct page *page,
 		 * clean_io_failure() clean all those errors at once.
 		 */
 	}
-	num_copies = btrfs_num_copies(
-			      &BTRFS_I(inode)->root->fs_info->mapping_tree,
-			      failrec->logical, failrec->len);
+	num_copies = btrfs_num_copies(BTRFS_I(inode)->root->fs_info,
+				      failrec->logical, failrec->len);
 	if (num_copies == 1) {
 		/*
 		 * we only have a single copy of the data, so don''t bother with
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index bcd3097..133582b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3790,8 +3790,9 @@ void btrfs_mapping_tree_free(struct btrfs_mapping_tree
*tree)
 	}
 }
 
-int btrfs_num_copies(struct btrfs_mapping_tree *map_tree, u64 logical, u64 len)
+int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 {
+	struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree;
 	struct extent_map *em;
 	struct map_lookup *map;
 	struct extent_map_tree *em_tree = &map_tree->map_tree;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 657bb12..35ea442 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -278,7 +278,7 @@ int btrfs_add_device(struct btrfs_trans_handle *trans,
 		     struct btrfs_device *device);
 int btrfs_rm_device(struct btrfs_root *root, char *device_path);
 void btrfs_cleanup_fs_uuids(void);
-int btrfs_num_copies(struct btrfs_mapping_tree *map_tree, u64 logical, u64
len);
+int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len);
 int btrfs_grow_device(struct btrfs_trans_handle *trans,
 		      struct btrfs_device *device, u64 new_size);
 struct btrfs_device *btrfs_find_device(struct btrfs_root *root, u64 devid,
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 09/26] Btrfs: pass fs_info to btrfs_map_block() instead of mapping_tree

This is required for the device replace procedure in a later step.
Two calling functions also had to be changed to have the fs_info
pointer: repair_io_failure() and scrub_setup_recheck_block().

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/check-integrity.c |  2 +-
 fs/btrfs/extent-tree.c     |  2 +-
 fs/btrfs/extent_io.c       | 19 +++++++++----------
 fs/btrfs/extent_io.h       |  4 ++--
 fs/btrfs/inode.c           | 12 +++++-------
 fs/btrfs/reada.c           |  3 +--
 fs/btrfs/scrub.c           | 14 +++++++-------
 fs/btrfs/volumes.c         | 11 +++++------
 fs/btrfs/volumes.h         |  2 +-
 9 files changed, 32 insertions(+), 37 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 58dfac1..8f9abed 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -1582,7 +1582,7 @@ static int btrfsic_map_block(struct btrfsic_state *state,
u64 bytenr, u32 len,
 	struct btrfs_device *device;
 
 	length = len;
-	ret = btrfs_map_block(&state->root->fs_info->mapping_tree, READ,
+	ret = btrfs_map_block(state->root->fs_info, READ,
 			      bytenr, &length, &multi, mirror_num);
 
 	device = multi->stripes[0].dev;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b495cb4..4c94183 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1818,7 +1818,7 @@ static int btrfs_discard_extent(struct btrfs_root *root,
u64 bytenr,
 
 
 	/* Tell the block device(s) that the sectors can be discarded */
-	ret = btrfs_map_block(&root->fs_info->mapping_tree, REQ_DISCARD,
+	ret = btrfs_map_block(root->fs_info, REQ_DISCARD,
 			      bytenr, &num_bytes, &bbio, 0);
 	/* Error condition is -ENOMEM */
 	if (!ret) {
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a18f5c9..82fab08 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1919,12 +1919,12 @@ static void repair_io_failure_callback(struct bio *bio,
int err)
  * the standard behavior is to write all copies in a raid setup. here we only
  * want to write the one bad copy. so we do the mapping for ourselves and issue
  * submit_bio directly.
- * to avoid any synchonization issues, wait for the data after writing, which
+ * to avoid any synchronization issues, wait for the data after writing, which
  * actually prevents the read that triggered the error from finishing.
  * currently, there can be no more than two copies of every data bit. thus,
  * exactly one rewrite is required.
  */
-int repair_io_failure(struct btrfs_mapping_tree *map_tree, u64 start,
+int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
 			u64 length, u64 logical, struct page *page,
 			int mirror_num)
 {
@@ -1946,7 +1946,7 @@ int repair_io_failure(struct btrfs_mapping_tree *map_tree,
u64 start,
 	bio->bi_size = 0;
 	map_length = length;
 
-	ret = btrfs_map_block(map_tree, WRITE, logical,
+	ret = btrfs_map_block(fs_info, WRITE, logical,
 			      &map_length, &bbio, mirror_num);
 	if (ret) {
 		bio_put(bio);
@@ -1984,14 +1984,13 @@ int repair_io_failure(struct btrfs_mapping_tree
*map_tree, u64 start,
 int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 			 int mirror_num)
 {
-	struct btrfs_mapping_tree *map_tree = &root->fs_info->mapping_tree;
 	u64 start = eb->start;
 	unsigned long i, num_pages = num_extent_pages(eb->start, eb->len);
 	int ret = 0;
 
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = extent_buffer_page(eb, i);
-		ret = repair_io_failure(map_tree, start, PAGE_CACHE_SIZE,
+		ret = repair_io_failure(root->fs_info, start, PAGE_CACHE_SIZE,
 					start, p, mirror_num);
 		if (ret)
 			break;
@@ -2010,7 +2009,7 @@ static int clean_io_failure(u64 start, struct page *page)
 	u64 private;
 	u64 private_failure;
 	struct io_failure_record *failrec;
-	struct btrfs_mapping_tree *map_tree;
+	struct btrfs_fs_info *fs_info;
 	struct extent_state *state;
 	int num_copies;
 	int did_repair = 0;
@@ -2046,11 +2045,11 @@ static int clean_io_failure(u64 start, struct page
*page)
 	spin_unlock(&BTRFS_I(inode)->io_tree.lock);
 
 	if (state && state->start == failrec->start) {
-		num_copies = btrfs_num_copies(BTRFS_I(inode)->root->fs_info,
-					      failrec->logical, failrec->len);
+		fs_info = BTRFS_I(inode)->root->fs_info;
+		num_copies = btrfs_num_copies(fs_info, failrec->logical,
+					      failrec->len);
 		if (num_copies > 1)  {
-			map_tree = &BTRFS_I(inode)->root->fs_info->mapping_tree;
-			ret = repair_io_failure(map_tree, start, failrec->len,
+			ret = repair_io_failure(fs_info, start, failrec->len,
 						failrec->logical, page,
 						failrec->failed_mirror);
 			did_repair = !ret;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 711d12b..2eacfab 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -337,9 +337,9 @@ struct bio *
 btrfs_bio_alloc(struct block_device *bdev, u64 first_sector, int nr_vecs,
 		gfp_t gfp_flags);
 
-struct btrfs_mapping_tree;
+struct btrfs_fs_info;
 
-int repair_io_failure(struct btrfs_mapping_tree *map_tree, u64 start,
+int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
 			u64 length, u64 logical, struct page *page,
 			int mirror_num);
 int end_extent_writepage(struct page *page, int err, u64 start, u64 end);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3978bcb..5172a62 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1533,7 +1533,6 @@ int btrfs_merge_bio_hook(struct page *page, unsigned long
offset,
 			 unsigned long bio_flags)
 {
 	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
-	struct btrfs_mapping_tree *map_tree;
 	u64 logical = (u64)bio->bi_sector << 9;
 	u64 length = 0;
 	u64 map_length;
@@ -1543,11 +1542,10 @@ int btrfs_merge_bio_hook(struct page *page, unsigned
long offset,
 		return 0;
 
 	length = bio->bi_size;
-	map_tree = &root->fs_info->mapping_tree;
 	map_length = length;
-	ret = btrfs_map_block(map_tree, READ, logical,
+	ret = btrfs_map_block(root->fs_info, READ, logical,
 			      &map_length, NULL, 0);
-	/* Will always return 0 or 1 with map_multi == NULL */
+	/* Will always return 0 with map_multi == NULL */
 	BUG_ON(ret < 0);
 	if (map_length < length + size)
 		return 1;
@@ -6977,7 +6975,6 @@ static int btrfs_submit_direct_hook(int rw, struct
btrfs_dio_private *dip,
 {
 	struct inode *inode = dip->inode;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	struct btrfs_mapping_tree *map_tree = &root->fs_info->mapping_tree;
 	struct bio *bio;
 	struct bio *orig_bio = dip->orig_bio;
 	struct bio_vec *bvec = orig_bio->bi_io_vec;
@@ -6990,7 +6987,7 @@ static int btrfs_submit_direct_hook(int rw, struct
btrfs_dio_private *dip,
 	int async_submit = 0;
 
 	map_length = orig_bio->bi_size;
-	ret = btrfs_map_block(map_tree, READ, start_sector << 9,
+	ret = btrfs_map_block(root->fs_info, READ, start_sector << 9,
 			      &map_length, NULL, 0);
 	if (ret) {
 		bio_put(orig_bio);
@@ -7044,7 +7041,8 @@ static int btrfs_submit_direct_hook(int rw, struct
btrfs_dio_private *dip,
 			bio->bi_end_io = btrfs_end_dio_bio;
 
 			map_length = orig_bio->bi_size;
-			ret = btrfs_map_block(map_tree, READ, start_sector << 9,
+			ret = btrfs_map_block(root->fs_info, READ,
+					      start_sector << 9,
 					      &map_length, NULL, 0);
 			if (ret) {
 				bio_put(bio);
diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index a955669..0ddc565 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -323,7 +323,6 @@ static struct reada_extent *reada_find_extent(struct
btrfs_root *root,
 	struct reada_extent *re = NULL;
 	struct reada_extent *re_exist = NULL;
 	struct btrfs_fs_info *fs_info = root->fs_info;
-	struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree;
 	struct btrfs_bio *bbio = NULL;
 	struct btrfs_device *dev;
 	struct btrfs_device *prev_dev;
@@ -358,7 +357,7 @@ static struct reada_extent *reada_find_extent(struct
btrfs_root *root,
 	 * map block
 	 */
 	length = blocksize;
-	ret = btrfs_map_block(map_tree, REQ_WRITE, logical, &length, &bbio,
0);
+	ret = btrfs_map_block(fs_info, REQ_WRITE, logical, &length, &bbio, 0);
 	if (ret || !bbio || length < blocksize)
 		goto error;
 
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index a67b1a1..894bb27 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -152,7 +152,7 @@ static void scrub_pending_trans_workers_inc(struct scrub_ctx
*sctx);
 static void scrub_pending_trans_workers_dec(struct scrub_ctx *sctx);
 static int scrub_handle_errored_block(struct scrub_block *sblock_to_check);
 static int scrub_setup_recheck_block(struct scrub_ctx *sctx,
-				     struct btrfs_mapping_tree *map_tree,
+				     struct btrfs_fs_info *fs_info,
 				     u64 length, u64 logical,
 				     struct scrub_block *sblock);
 static void scrub_recheck_block(struct btrfs_fs_info *fs_info,
@@ -523,7 +523,7 @@ static int scrub_fixup_readpage(u64 inum, u64 offset, u64
root, void *ctx)
 	}
 
 	if (PageUptodate(page)) {
-		struct btrfs_mapping_tree *map_tree;
+		struct btrfs_fs_info *fs_info;
 		if (PageDirty(page)) {
 			/*
 			 * we need to write the data to the defect sector. the
@@ -544,8 +544,8 @@ static int scrub_fixup_readpage(u64 inum, u64 offset, u64
root, void *ctx)
 			ret = -EIO;
 			goto out;
 		}
-		map_tree = &BTRFS_I(inode)->root->fs_info->mapping_tree;
-		ret = repair_io_failure(map_tree, offset, PAGE_SIZE,
+		fs_info = BTRFS_I(inode)->root->fs_info;
+		ret = repair_io_failure(fs_info, offset, PAGE_SIZE,
 					fixup->logical, page,
 					fixup->mirror_num);
 		unlock_page(page);
@@ -754,7 +754,7 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 	}
 
 	/* setup the context, map the logical blocks and alloc the pages */
-	ret = scrub_setup_recheck_block(sctx, &fs_info->mapping_tree, length,
+	ret = scrub_setup_recheck_block(sctx, fs_info, length,
 					logical, sblocks_for_recheck);
 	if (ret) {
 		spin_lock(&sctx->stat_lock);
@@ -1012,7 +1012,7 @@ out:
 }
 
 static int scrub_setup_recheck_block(struct scrub_ctx *sctx,
-				     struct btrfs_mapping_tree *map_tree,
+				     struct btrfs_fs_info *fs_info,
 				     u64 length, u64 logical,
 				     struct scrub_block *sblocks_for_recheck)
 {
@@ -1036,7 +1036,7 @@ static int scrub_setup_recheck_block(struct scrub_ctx
*sctx,
 		 * with a length of PAGE_SIZE, each returned stripe
 		 * represents one mirror
 		 */
-		ret = btrfs_map_block(map_tree, WRITE, logical, &mapped_length,
+		ret = btrfs_map_block(fs_info, WRITE, logical, &mapped_length,
 				      &bbio, 0);
 		if (ret || !bbio || mapped_length < sublen) {
 			kfree(bbio);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 133582b..c79f8db 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3831,13 +3831,14 @@ static int find_live_mirror(struct map_lookup *map, int
first, int num,
 	return optimal;
 }
 
-static int __btrfs_map_block(struct btrfs_mapping_tree *map_tree, int rw,
+static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 			     u64 logical, u64 *length,
 			     struct btrfs_bio **bbio_ret,
 			     int mirror_num)
 {
 	struct extent_map *em;
 	struct map_lookup *map;
+	struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree;
 	struct extent_map_tree *em_tree = &map_tree->map_tree;
 	u64 offset;
 	u64 stripe_offset;
@@ -4066,11 +4067,11 @@ out:
 	return ret;
 }
 
-int btrfs_map_block(struct btrfs_mapping_tree *map_tree, int rw,
+int btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 		      u64 logical, u64 *length,
 		      struct btrfs_bio **bbio_ret, int mirror_num)
 {
-	return __btrfs_map_block(map_tree, rw, logical, length, bbio_ret,
+	return __btrfs_map_block(fs_info, rw, logical, length, bbio_ret,
 				 mirror_num);
 }
 
@@ -4399,7 +4400,6 @@ static void bbio_error(struct btrfs_bio *bbio, struct bio
*bio, u64 logical)
 int btrfs_map_bio(struct btrfs_root *root, int rw, struct bio *bio,
 		  int mirror_num, int async_submit)
 {
-	struct btrfs_mapping_tree *map_tree;
 	struct btrfs_device *dev;
 	struct bio *first_bio = bio;
 	u64 logical = (u64)bio->bi_sector << 9;
@@ -4411,10 +4411,9 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, struct
bio *bio,
 	struct btrfs_bio *bbio = NULL;
 
 	length = bio->bi_size;
-	map_tree = &root->fs_info->mapping_tree;
 	map_length = length;
 
-	ret = btrfs_map_block(map_tree, rw, logical, &map_length, &bbio,
+	ret = btrfs_map_block(root->fs_info, rw, logical, &map_length,
&bbio,
 			      mirror_num);
 	if (ret) /* -ENOMEM */
 		return ret;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 35ea442..ad5566d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -248,7 +248,7 @@ int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
 			   struct btrfs_device *device,
 			   u64 chunk_tree, u64 chunk_objectid,
 			   u64 chunk_offset, u64 start, u64 num_bytes);
-int btrfs_map_block(struct btrfs_mapping_tree *map_tree, int rw,
+int btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 		    u64 logical, u64 *length,
 		    struct btrfs_bio **bbio_ret, int mirror_num);
 int btrfs_rmap_block(struct btrfs_mapping_tree *map_tree,
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 10/26] Btrfs: add btrfs_scratch_superblock() function

This new function is used by the device replace procedure in
a later patch.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/volumes.c | 18 ++++++++++++++++++
 fs/btrfs/volumes.h |  1 +
 2 files changed, 19 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c79f8db..bf5e68b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5111,3 +5111,21 @@ int btrfs_get_dev_stats(struct btrfs_root *root,
 		stats->nr_items = BTRFS_DEV_STAT_VALUES_MAX;
 	return 0;
 }
+
+int btrfs_scratch_superblock(struct btrfs_device *device)
+{
+	struct buffer_head *bh;
+	struct btrfs_super_block *disk_super;
+
+	bh = btrfs_read_dev_super(device->bdev);
+	if (!bh)
+		return -EINVAL;
+	disk_super = (struct btrfs_super_block *)bh->b_data;
+
+	memset(&disk_super->magic, 0, sizeof(disk_super->magic));
+	set_buffer_dirty(bh);
+	sync_dirty_buffer(bh);
+	brelse(bh);
+
+	return 0;
+}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index ad5566d..7eaaf4e 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -301,6 +301,7 @@ int btrfs_get_dev_stats(struct btrfs_root *root,
 int btrfs_init_dev_stats(struct btrfs_fs_info *fs_info);
 int btrfs_run_dev_stats(struct btrfs_trans_handle *trans,
 			struct btrfs_fs_info *fs_info);
+int btrfs_scratch_superblock(struct btrfs_device *device);
 
 static inline void btrfs_dev_stat_inc(struct btrfs_device *dev,
 				      int index)
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 11/26] Btrfs: pass fs_info instead of root

A small number of functions that are used in a device replace
procedure when the operation is resumed at mount time are unable
to pass the same root pointer that would be used in the regular
(ioctl) context. And since the root pointer is not required, only
the fs_info is, the root pointer argument is replaced with the
fs_info pointer argument.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/ctree.h   | 11 ++++----
 fs/btrfs/disk-io.c |  4 +--
 fs/btrfs/ioctl.c   |  8 +++---
 fs/btrfs/scrub.c   | 76 ++++++++++++++++++++++++------------------------------
 fs/btrfs/super.c   |  2 +-
 fs/btrfs/volumes.c | 23 +++++++++--------
 fs/btrfs/volumes.h |  2 +-
 7 files changed, 60 insertions(+), 66 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 95c1a8e..ef83538 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3540,15 +3540,16 @@ int btrfs_reloc_post_snapshot(struct btrfs_trans_handle
*trans,
 			      struct btrfs_pending_snapshot *pending);
 
 /* scrub.c */
-int btrfs_scrub_dev(struct btrfs_root *root, u64 devid, u64 start, u64 end,
-		    struct btrfs_scrub_progress *progress, int readonly);
+int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
+		    u64 end, struct btrfs_scrub_progress *progress,
+		    int readonly);
 void btrfs_scrub_pause(struct btrfs_root *root);
 void btrfs_scrub_pause_super(struct btrfs_root *root);
 void btrfs_scrub_continue(struct btrfs_root *root);
 void btrfs_scrub_continue_super(struct btrfs_root *root);
-int __btrfs_scrub_cancel(struct btrfs_fs_info *info);
-int btrfs_scrub_cancel(struct btrfs_root *root);
-int btrfs_scrub_cancel_dev(struct btrfs_root *root, struct btrfs_device *dev);
+int btrfs_scrub_cancel(struct btrfs_fs_info *info);
+int btrfs_scrub_cancel_dev(struct btrfs_fs_info *info,
+			   struct btrfs_device *dev);
 int btrfs_scrub_cancel_devid(struct btrfs_root *root, u64 devid);
 int btrfs_scrub_progress(struct btrfs_root *root, u64 devid,
 			 struct btrfs_scrub_progress *progress);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f4b2733..53ab3cd 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3283,9 +3283,9 @@ int close_ctree(struct btrfs_root *root)
 	smp_mb();
 
 	/* pause restriper - we want to resume on mount */
-	btrfs_pause_balance(root->fs_info);
+	btrfs_pause_balance(fs_info);
 
-	btrfs_scrub_cancel(root);
+	btrfs_scrub_cancel(fs_info);
 
 	/* wait for any defraggers to finish */
 	wait_event(fs_info->transaction_wait,
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 1a9c5ee..7ece1fe 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1340,7 +1340,7 @@ static noinline int btrfs_ioctl_resize(struct btrfs_root
*root,
 		printk(KERN_INFO "btrfs: resizing devid %llu\n",
 		       (unsigned long long)devid);
 	}
-	device = btrfs_find_device(root, devid, NULL, NULL);
+	device = btrfs_find_device(root->fs_info, devid, NULL, NULL);
 	if (!device) {
 		printk(KERN_INFO "btrfs: resizer unable to find device %llu\n",
 		       (unsigned long long)devid);
@@ -2329,7 +2329,7 @@ static long btrfs_ioctl_dev_info(struct btrfs_root *root,
void __user *arg)
 		s_uuid = di_args->uuid;
 
 	mutex_lock(&fs_devices->device_list_mutex);
-	dev = btrfs_find_device(root, di_args->devid, s_uuid, NULL);
+	dev = btrfs_find_device(root->fs_info, di_args->devid, s_uuid, NULL);
 	mutex_unlock(&fs_devices->device_list_mutex);
 
 	if (!dev) {
@@ -3086,7 +3086,7 @@ static long btrfs_ioctl_scrub(struct btrfs_root *root,
void __user *arg)
 	if (IS_ERR(sa))
 		return PTR_ERR(sa);
 
-	ret = btrfs_scrub_dev(root, sa->devid, sa->start, sa->end,
+	ret = btrfs_scrub_dev(root->fs_info, sa->devid, sa->start,
sa->end,
 			      &sa->progress, sa->flags & BTRFS_SCRUB_READONLY);
 
 	if (copy_to_user(arg, sa, sizeof(*sa)))
@@ -3101,7 +3101,7 @@ static long btrfs_ioctl_scrub_cancel(struct btrfs_root
*root, void __user *arg)
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-	return btrfs_scrub_cancel(root);
+	return btrfs_scrub_cancel(root->fs_info);
 }
 
 static long btrfs_ioctl_scrub_progress(struct btrfs_root *root,
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 894bb27..6cf23f4 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2262,9 +2262,8 @@ static noinline_for_stack int scrub_supers(struct
scrub_ctx *sctx,
 /*
  * get a reference count on fs_info->scrub_workers. start worker if
necessary
  */
-static noinline_for_stack int scrub_workers_get(struct btrfs_root *root)
+static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info)
 {
-	struct btrfs_fs_info *fs_info = root->fs_info;
 	int ret = 0;
 
 	mutex_lock(&fs_info->scrub_lock);
@@ -2283,10 +2282,8 @@ out:
 	return ret;
 }
 
-static noinline_for_stack void scrub_workers_put(struct btrfs_root *root)
+static noinline_for_stack void scrub_workers_put(struct btrfs_fs_info *fs_info)
 {
-	struct btrfs_fs_info *fs_info = root->fs_info;
-
 	mutex_lock(&fs_info->scrub_lock);
 	if (--fs_info->scrub_workers_refcnt == 0)
 		btrfs_stop_workers(&fs_info->scrub_workers);
@@ -2294,29 +2291,29 @@ static noinline_for_stack void scrub_workers_put(struct
btrfs_root *root)
 	mutex_unlock(&fs_info->scrub_lock);
 }
 
-
-int btrfs_scrub_dev(struct btrfs_root *root, u64 devid, u64 start, u64 end,
-		    struct btrfs_scrub_progress *progress, int readonly)
+int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
+		    u64 end, struct btrfs_scrub_progress *progress,
+		    int readonly)
 {
 	struct scrub_ctx *sctx;
-	struct btrfs_fs_info *fs_info = root->fs_info;
 	int ret;
 	struct btrfs_device *dev;
 
-	if (btrfs_fs_closing(root->fs_info))
+	if (btrfs_fs_closing(fs_info))
 		return -EINVAL;
 
 	/*
 	 * check some assumptions
 	 */
-	if (root->nodesize != root->leafsize) {
+	if (fs_info->chunk_root->nodesize !=
fs_info->chunk_root->leafsize) {
 		printk(KERN_ERR
 		       "btrfs_scrub: size assumption nodesize == leafsize (%d == %d)
fails\n",
-		       root->nodesize, root->leafsize);
+		       fs_info->chunk_root->nodesize,
+		       fs_info->chunk_root->leafsize);
 		return -EINVAL;
 	}
 
-	if (root->nodesize > BTRFS_STRIPE_LEN) {
+	if (fs_info->chunk_root->nodesize > BTRFS_STRIPE_LEN) {
 		/*
 		 * in this case scrub is unable to calculate the checksum
 		 * the way scrub is implemented. Do not handle this
@@ -2324,15 +2321,16 @@ int btrfs_scrub_dev(struct btrfs_root *root, u64 devid,
u64 start, u64 end,
 		 */
 		printk(KERN_ERR
 		       "btrfs_scrub: size assumption nodesize <= BTRFS_STRIPE_LEN (%d
<= %d) fails\n",
-		       root->nodesize, BTRFS_STRIPE_LEN);
+		       fs_info->chunk_root->nodesize, BTRFS_STRIPE_LEN);
 		return -EINVAL;
 	}
 
-	if (root->sectorsize != PAGE_SIZE) {
+	if (fs_info->chunk_root->sectorsize != PAGE_SIZE) {
 		/* not supported for data w/o checksums */
 		printk(KERN_ERR
 		       "btrfs_scrub: size assumption sectorsize != PAGE_SIZE (%d !=
%lld) fails\n",
-		       root->sectorsize, (unsigned long long)PAGE_SIZE);
+		       fs_info->chunk_root->sectorsize,
+		       (unsigned long long)PAGE_SIZE);
 		return -EINVAL;
 	}
 
@@ -2352,37 +2350,37 @@ int btrfs_scrub_dev(struct btrfs_root *root, u64 devid,
u64 start, u64 end,
 		return -EINVAL;
 	}
 
-	ret = scrub_workers_get(root);
+	ret = scrub_workers_get(fs_info);
 	if (ret)
 		return ret;
 
-	mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
-	dev = btrfs_find_device(root, devid, NULL, NULL);
+	mutex_lock(&fs_info->fs_devices->device_list_mutex);
+	dev = btrfs_find_device(fs_info, devid, NULL, NULL);
 	if (!dev || dev->missing) {
-		mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
-		scrub_workers_put(root);
+		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+		scrub_workers_put(fs_info);
 		return -ENODEV;
 	}
 	mutex_lock(&fs_info->scrub_lock);
 
 	if (!dev->in_fs_metadata) {
 		mutex_unlock(&fs_info->scrub_lock);
-		mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
-		scrub_workers_put(root);
-		return -ENODEV;
+		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+		scrub_workers_put(fs_info);
+		return -EIO;
 	}
 
 	if (dev->scrub_device) {
 		mutex_unlock(&fs_info->scrub_lock);
-		mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
-		scrub_workers_put(root);
+		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+		scrub_workers_put(fs_info);
 		return -EINPROGRESS;
 	}
 	sctx = scrub_setup_ctx(dev);
 	if (IS_ERR(sctx)) {
 		mutex_unlock(&fs_info->scrub_lock);
-		mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
-		scrub_workers_put(root);
+		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+		scrub_workers_put(fs_info);
 		return PTR_ERR(sctx);
 	}
 	sctx->readonly = readonly;
@@ -2390,7 +2388,7 @@ int btrfs_scrub_dev(struct btrfs_root *root, u64 devid,
u64 start, u64 end,
 
 	atomic_inc(&fs_info->scrubs_running);
 	mutex_unlock(&fs_info->scrub_lock);
-	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
+	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
 
 	down_read(&fs_info->scrub_super_lock);
 	ret = scrub_supers(sctx, dev);
@@ -2413,7 +2411,7 @@ int btrfs_scrub_dev(struct btrfs_root *root, u64 devid,
u64 start, u64 end,
 	mutex_unlock(&fs_info->scrub_lock);
 
 	scrub_free_ctx(sctx);
-	scrub_workers_put(root);
+	scrub_workers_put(fs_info);
 
 	return ret;
 }
@@ -2453,9 +2451,8 @@ void btrfs_scrub_continue_super(struct btrfs_root *root)
 	up_write(&root->fs_info->scrub_super_lock);
 }
 
-int __btrfs_scrub_cancel(struct btrfs_fs_info *fs_info)
+int btrfs_scrub_cancel(struct btrfs_fs_info *fs_info)
 {
-
 	mutex_lock(&fs_info->scrub_lock);
 	if (!atomic_read(&fs_info->scrubs_running)) {
 		mutex_unlock(&fs_info->scrub_lock);
@@ -2475,14 +2472,9 @@ int __btrfs_scrub_cancel(struct btrfs_fs_info *fs_info)
 	return 0;
 }
 
-int btrfs_scrub_cancel(struct btrfs_root *root)
+int btrfs_scrub_cancel_dev(struct btrfs_fs_info *fs_info,
+			   struct btrfs_device *dev)
 {
-	return __btrfs_scrub_cancel(root->fs_info);
-}
-
-int btrfs_scrub_cancel_dev(struct btrfs_root *root, struct btrfs_device *dev)
-{
-	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct scrub_ctx *sctx;
 
 	mutex_lock(&fs_info->scrub_lock);
@@ -2514,12 +2506,12 @@ int btrfs_scrub_cancel_devid(struct btrfs_root *root,
u64 devid)
 	 * does not go away in cancel_dev. FIXME: find a better solution
 	 */
 	mutex_lock(&fs_info->fs_devices->device_list_mutex);
-	dev = btrfs_find_device(root, devid, NULL, NULL);
+	dev = btrfs_find_device(fs_info, devid, NULL, NULL);
 	if (!dev) {
 		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
 		return -ENODEV;
 	}
-	ret = btrfs_scrub_cancel_dev(root, dev);
+	ret = btrfs_scrub_cancel_dev(fs_info, dev);
 	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
 
 	return ret;
@@ -2532,7 +2524,7 @@ int btrfs_scrub_progress(struct btrfs_root *root, u64
devid,
 	struct scrub_ctx *sctx = NULL;
 
 	mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
-	dev = btrfs_find_device(root, devid, NULL, NULL);
+	dev = btrfs_find_device(root->fs_info, devid, NULL, NULL);
 	if (dev)
 		sctx = dev->scrub_device;
 	if (sctx)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 8bba9d1..8b7cf54 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -116,7 +116,7 @@ static void btrfs_handle_error(struct btrfs_fs_info
*fs_info)
 	if (fs_info->fs_state & BTRFS_SUPER_FLAG_ERROR) {
 		sb->s_flags |= MS_RDONLY;
 		printk(KERN_INFO "btrfs is forced readonly\n");
-		__btrfs_scrub_cancel(fs_info);
+		btrfs_scrub_cancel(fs_info);
 //		WARN_ON(1);
 	}
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index bf5e68b..473ae89 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1388,7 +1388,7 @@ int btrfs_rm_device(struct btrfs_root *root, char
*device_path)
 		disk_super = (struct btrfs_super_block *)bh->b_data;
 		devid = btrfs_stack_device_id(&disk_super->dev_item);
 		dev_uuid = disk_super->dev_item.uuid;
-		device = btrfs_find_device(root, devid, dev_uuid,
+		device = btrfs_find_device(root->fs_info, devid, dev_uuid,
 					   disk_super->fsid);
 		if (!device) {
 			ret = -ENOENT;
@@ -1425,7 +1425,7 @@ int btrfs_rm_device(struct btrfs_root *root, char
*device_path)
 	spin_unlock(&root->fs_info->free_chunk_lock);
 
 	device->in_fs_metadata = 0;
-	btrfs_scrub_cancel_dev(root, device);
+	btrfs_scrub_cancel_dev(root->fs_info, device);
 
 	/*
 	 * the device list mutex makes sure that we don''t change
@@ -1482,7 +1482,7 @@ int btrfs_rm_device(struct btrfs_root *root, char
*device_path)
 	 * at this point, the device is zero sized.  We want to
 	 * remove it from the devices list and zero out the old super
 	 */
-	if (clear_super) {
+	if (clear_super && disk_super) {
 		/* make sure this device isn''t detected as part of
 		 * the FS anymore
 		 */
@@ -1542,7 +1542,7 @@ int btrfs_find_device_by_path(struct btrfs_root *root,
char *device_path,
 	disk_super = (struct btrfs_super_block *)bh->b_data;
 	devid = btrfs_stack_device_id(&disk_super->dev_item);
 	dev_uuid = disk_super->dev_item.uuid;
-	*device = btrfs_find_device(root, devid, dev_uuid,
+	*device = btrfs_find_device(root->fs_info, devid, dev_uuid,
 				    disk_super->fsid);
 	brelse(bh);
 	if (!*device)
@@ -1704,7 +1704,8 @@ next_slot:
 		read_extent_buffer(leaf, fs_uuid,
 				   (unsigned long)btrfs_device_fsid(dev_item),
 				   BTRFS_UUID_SIZE);
-		device = btrfs_find_device(root, devid, dev_uuid, fs_uuid);
+		device = btrfs_find_device(root->fs_info, devid, dev_uuid,
+					   fs_uuid);
 		BUG_ON(!device); /* Logic error */
 
 		if (device->fs_devices->seeding) {
@@ -4468,13 +4469,13 @@ int btrfs_map_bio(struct btrfs_root *root, int rw,
struct bio *bio,
 	return 0;
 }
 
-struct btrfs_device *btrfs_find_device(struct btrfs_root *root, u64 devid,
+struct btrfs_device *btrfs_find_device(struct btrfs_fs_info *fs_info, u64
devid,
 				       u8 *uuid, u8 *fsid)
 {
 	struct btrfs_device *device;
 	struct btrfs_fs_devices *cur_devices;
 
-	cur_devices = root->fs_info->fs_devices;
+	cur_devices = fs_info->fs_devices;
 	while (cur_devices) {
 		if (!fsid ||
 		    !memcmp(cur_devices->fsid, fsid, BTRFS_UUID_SIZE)) {
@@ -4572,8 +4573,8 @@ static int read_one_chunk(struct btrfs_root *root, struct
btrfs_key *key,
 		read_extent_buffer(leaf, uuid, (unsigned long)
 				   btrfs_stripe_dev_uuid_nr(chunk, i),
 				   BTRFS_UUID_SIZE);
-		map->stripes[i].dev = btrfs_find_device(root, devid, uuid,
-							NULL);
+		map->stripes[i].dev = btrfs_find_device(root->fs_info, devid,
+							uuid, NULL);
 		if (!map->stripes[i].dev && !btrfs_test_opt(root, DEGRADED)) {
 			kfree(map);
 			free_extent_map(em);
@@ -4691,7 +4692,7 @@ static int read_one_dev(struct btrfs_root *root,
 			return ret;
 	}
 
-	device = btrfs_find_device(root, devid, dev_uuid, fs_uuid);
+	device = btrfs_find_device(root->fs_info, devid, dev_uuid, fs_uuid);
 	if (!device || !device->bdev) {
 		if (!btrfs_test_opt(root, DEGRADED))
 			return -EIO;
@@ -5083,7 +5084,7 @@ int btrfs_get_dev_stats(struct btrfs_root *root,
 	int i;
 
 	mutex_lock(&fs_devices->device_list_mutex);
-	dev = btrfs_find_device(root, stats->devid, NULL, NULL);
+	dev = btrfs_find_device(root->fs_info, stats->devid, NULL, NULL);
 	mutex_unlock(&fs_devices->device_list_mutex);
 
 	if (!dev) {
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 7eaaf4e..802e2ba 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -281,7 +281,7 @@ void btrfs_cleanup_fs_uuids(void);
 int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len);
 int btrfs_grow_device(struct btrfs_trans_handle *trans,
 		      struct btrfs_device *device, u64 new_size);
-struct btrfs_device *btrfs_find_device(struct btrfs_root *root, u64 devid,
+struct btrfs_device *btrfs_find_device(struct btrfs_fs_info *fs_info, u64
devid,
 				       u8 *uuid, u8 *fsid);
 int btrfs_shrink_device(struct btrfs_device *device, u64 new_size);
 int btrfs_init_new_device(struct btrfs_root *root, char *path);
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 12/26] Btrfs: avoid risk of a deadlock in btrfs_handle_error

Remove the attempt to cancel a running scrub or device replace
operation in btrfs_handle_error() because it adds the risk of
a deadlock. The only penalty of not canceling the operation is
that some I/O remains active until the procedure completes.
This is basically the same thing that happens to other tasks
that are running in user mode context, they are not affected
or stopped in btrfs_handle_error(), these tasks just need to
handle write errors correctly.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/super.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 8b7cf54..bdf1f5e 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -116,7 +116,16 @@ static void btrfs_handle_error(struct btrfs_fs_info
*fs_info)
 	if (fs_info->fs_state & BTRFS_SUPER_FLAG_ERROR) {
 		sb->s_flags |= MS_RDONLY;
 		printk(KERN_INFO "btrfs is forced readonly\n");
-		btrfs_scrub_cancel(fs_info);
+		/*
+		 * Note that a running device replace operation is not
+		 * canceled here although there is no way to update
+		 * the progress. It would add the risk of a deadlock,
+		 * therefore the canceling is ommited. The only penalty
+		 * is that some I/O remains active until the procedure
+		 * completes. The next time when the filesystem is
+		 * mounted writeable again, the device replace
+		 * operation continues.
+		 */
 //		WARN_ON(1);
 	}
 }
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 13/26] Btrfs: enhance btrfs structures for device replace support

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/ctree.h   | 39 +++++++++++++++++++++++++++++++++++++++
 fs/btrfs/disk-io.c |  5 +++++
 2 files changed, 44 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ef83538..57961f1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -885,6 +885,42 @@ struct btrfs_dev_stats_item {
 	__le64 values[BTRFS_DEV_STAT_VALUES_MAX];
 } __attribute__ ((__packed__));
 
+#define BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_ALWAYS	0
+#define BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID	1
+#define BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED	0
+#define BTRFS_DEV_REPLACE_ITEM_STATE_STARTED		1
+#define BTRFS_DEV_REPLACE_ITEM_STATE_SUSPENDED		2
+#define BTRFS_DEV_REPLACE_ITEM_STATE_FINISHED		3
+#define BTRFS_DEV_REPLACE_ITEM_STATE_CANCELED		4
+
+struct btrfs_dev_replace {
+	u64 replace_state;	/* see #define above */
+	u64 time_started;	/* seconds since 1-Jan-1970 */
+	u64 time_stopped;	/* seconds since 1-Jan-1970 */
+	atomic64_t num_write_errors;
+	atomic64_t num_uncorrectable_read_errors;
+
+	u64 cursor_left;
+	u64 committed_cursor_left;
+	u64 cursor_left_last_write_of_item;
+	u64 cursor_right;
+
+	u64 cont_reading_from_srcdev_mode;	/* see #define above */
+
+	int is_valid;
+	int item_needs_writeback;
+	struct btrfs_device *srcdev;
+	struct btrfs_device *tgtdev;
+
+	pid_t lock_owner;
+	atomic_t nesting_level;
+	struct mutex lock_finishing_cancel_unmount;
+	struct mutex lock_management_lock;
+	struct mutex lock;
+
+	struct btrfs_scrub_progress scrub_progress;
+};
+
 /* different types of block groups (and chunks) */
 #define BTRFS_BLOCK_GROUP_DATA		(1ULL << 0)
 #define BTRFS_BLOCK_GROUP_SYSTEM	(1ULL << 1)
@@ -1471,6 +1507,9 @@ struct btrfs_fs_info {
 	int backup_root_index;
 
 	int num_tolerated_disk_barrier_failures;
+
+	/* device replace state */
+	struct btrfs_dev_replace dev_replace;
 };
 
 /*
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 53ab3cd..bdf6345 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2131,6 +2131,11 @@ int open_ctree(struct super_block *sb,
 	init_rwsem(&fs_info->extent_commit_sem);
 	init_rwsem(&fs_info->cleanup_work_sem);
 	init_rwsem(&fs_info->subvol_sem);
+	fs_info->dev_replace.lock_owner = 0;
+	atomic_set(&fs_info->dev_replace.nesting_level, 0);
+	mutex_init(&fs_info->dev_replace.lock_finishing_cancel_unmount);
+	mutex_init(&fs_info->dev_replace.lock_management_lock);
+	mutex_init(&fs_info->dev_replace.lock);
 
 	spin_lock_init(&fs_info->qgroup_lock);
 	fs_info->qgroup_tree = RB_ROOT;
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 14/26] Btrfs: introduce a btrfs_dev_replace_item type

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/ctree.h      | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/print-tree.c |  3 +++
 2 files changed, 69 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 57961f1..4bffc63 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -921,6 +921,23 @@ struct btrfs_dev_replace {
 	struct btrfs_scrub_progress scrub_progress;
 };
 
+struct btrfs_dev_replace_item {
+	/*
+	 * grow this item struct at the end for future enhancements and keep
+	 * the existing values unchanged
+	 */
+	__le64 src_devid;
+	__le64 cursor_left;
+	__le64 cursor_right;
+	__le64 cont_reading_from_srcdev_mode;
+
+	__le64 replace_state;
+	__le64 time_started;
+	__le64 time_stopped;
+	__le64 num_write_errors;
+	__le64 num_uncorrectable_read_errors;
+} __attribute__ ((__packed__));
+
 /* different types of block groups (and chunks) */
 #define BTRFS_BLOCK_GROUP_DATA		(1ULL << 0)
 #define BTRFS_BLOCK_GROUP_SYSTEM	(1ULL << 1)
@@ -1763,6 +1780,12 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_DEV_STATS_KEY	249
 
 /*
+ * Persistantly stores the device replace state in the device tree.
+ * The key is built like this: (0, BTRFS_DEV_REPLACE_KEY, 0).
+ */
+#define BTRFS_DEV_REPLACE_KEY	250
+
+/*
  * string items are for debugging.  They just store a short string of
  * data in the FS
  */
@@ -2796,6 +2819,49 @@ BTRFS_SETGET_FUNCS(qgroup_limit_rsv_rfer, struct
btrfs_qgroup_limit_item,
 BTRFS_SETGET_FUNCS(qgroup_limit_rsv_excl, struct btrfs_qgroup_limit_item,
 		   rsv_excl, 64);
 
+/* btrfs_dev_replace_item */
+BTRFS_SETGET_FUNCS(dev_replace_src_devid,
+		   struct btrfs_dev_replace_item, src_devid, 64);
+BTRFS_SETGET_FUNCS(dev_replace_cont_reading_from_srcdev_mode,
+		   struct btrfs_dev_replace_item, cont_reading_from_srcdev_mode,
+		   64);
+BTRFS_SETGET_FUNCS(dev_replace_replace_state, struct btrfs_dev_replace_item,
+		   replace_state, 64);
+BTRFS_SETGET_FUNCS(dev_replace_time_started, struct btrfs_dev_replace_item,
+		   time_started, 64);
+BTRFS_SETGET_FUNCS(dev_replace_time_stopped, struct btrfs_dev_replace_item,
+		   time_stopped, 64);
+BTRFS_SETGET_FUNCS(dev_replace_num_write_errors, struct btrfs_dev_replace_item,
+		   num_write_errors, 64);
+BTRFS_SETGET_FUNCS(dev_replace_num_uncorrectable_read_errors,
+		   struct btrfs_dev_replace_item, num_uncorrectable_read_errors,
+		   64);
+BTRFS_SETGET_FUNCS(dev_replace_cursor_left, struct btrfs_dev_replace_item,
+		   cursor_left, 64);
+BTRFS_SETGET_FUNCS(dev_replace_cursor_right, struct btrfs_dev_replace_item,
+		   cursor_right, 64);
+
+BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_src_devid,
+			 struct btrfs_dev_replace_item, src_devid, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_cont_reading_from_srcdev_mode,
+			 struct btrfs_dev_replace_item,
+			 cont_reading_from_srcdev_mode, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_replace_state,
+			 struct btrfs_dev_replace_item, replace_state, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_time_started,
+			 struct btrfs_dev_replace_item, time_started, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_time_stopped,
+			 struct btrfs_dev_replace_item, time_stopped, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_num_write_errors,
+			 struct btrfs_dev_replace_item, num_write_errors, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_num_uncorrectable_read_errors,
+			 struct btrfs_dev_replace_item,
+			 num_uncorrectable_read_errors, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_cursor_left,
+			 struct btrfs_dev_replace_item, cursor_left, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_dev_replace_cursor_right,
+			 struct btrfs_dev_replace_item, cursor_right, 64);
+
 static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb)
 {
 	return sb->s_fs_info;
diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c
index 5e23684..50d95fd 100644
--- a/fs/btrfs/print-tree.c
+++ b/fs/btrfs/print-tree.c
@@ -297,6 +297,9 @@ void btrfs_print_leaf(struct btrfs_root *root, struct
extent_buffer *l)
 		case BTRFS_DEV_STATS_KEY:
 			printk(KERN_INFO "\t\tdevice stats\n");
 			break;
+		case BTRFS_DEV_REPLACE_KEY:
+			printk(KERN_INFO "\t\tdev replace\n");
+			break;
 		};
 	}
 }
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 15/26] Btrfs: add a new source file with device replace code

This adds a new file to the sources together with the header file
and the changes to ioctl.h that are required by the new C source
file.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/dev-replace.c | 843 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/dev-replace.h |  44 +++
 fs/btrfs/ioctl.h       |  45 +++
 3 files changed, 932 insertions(+)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
new file mode 100644
index 0000000..1d56163
--- /dev/null
+++ b/fs/btrfs/dev-replace.c
@@ -0,0 +1,843 @@
+/*
+ * Copyright (C) STRATO AG 2012.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#include <linux/sched.h>
+#include <linux/bio.h>
+#include <linux/slab.h>
+#include <linux/buffer_head.h>
+#include <linux/blkdev.h>
+#include <linux/random.h>
+#include <linux/iocontext.h>
+#include <linux/capability.h>
+#include <linux/kthread.h>
+#include <linux/math64.h>
+#include <asm/div64.h>
+#include "compat.h"
+#include "ctree.h"
+#include "extent_map.h"
+#include "disk-io.h"
+#include "transaction.h"
+#include "print-tree.h"
+#include "volumes.h"
+#include "async-thread.h"
+#include "check-integrity.h"
+#include "rcu-string.h"
+#include "dev-replace.h"
+
+static u64 btrfs_get_seconds_since_1970(void);
+static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
+				       int scrub_ret);
+static void btrfs_dev_replace_update_device_in_mapping_tree(
+						struct btrfs_fs_info *fs_info,
+						struct btrfs_device *srcdev,
+						struct btrfs_device *tgtdev);
+static int btrfs_dev_replace_find_srcdev(struct btrfs_root *root, u64 srcdevid,
+					 char *srcdev_name,
+					 struct btrfs_device **device);
+static u64 __btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info);
+static int btrfs_dev_replace_kthread(void *data);
+static int btrfs_dev_replace_continue_on_mount(struct btrfs_fs_info *fs_info);
+
+
+int btrfs_init_dev_replace(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_key key;
+	struct btrfs_root *dev_root = fs_info->dev_root;
+	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+	struct extent_buffer *eb;
+	int slot;
+	int ret = 0;
+	struct btrfs_path *path = NULL;
+	int item_size;
+	struct btrfs_dev_replace_item *ptr;
+	u64 src_devid;
+
+	path = btrfs_alloc_path();
+	if (!path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	key.objectid = 0;
+	key.type = BTRFS_DEV_REPLACE_KEY;
+	key.offset = 0;
+	ret = btrfs_search_slot(NULL, dev_root, &key, path, 0, 0);
+	if (ret) {
+no_valid_dev_replace_entry_found:
+		ret = 0;
+		dev_replace->replace_state +			BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED;
+		dev_replace->cont_reading_from_srcdev_mode +		   
BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_ALWAYS;
+		dev_replace->replace_state = 0;
+		dev_replace->time_started = 0;
+		dev_replace->time_stopped = 0;
+		atomic64_set(&dev_replace->num_write_errors, 0);
+		atomic64_set(&dev_replace->num_uncorrectable_read_errors, 0);
+		dev_replace->cursor_left = 0;
+		dev_replace->committed_cursor_left = 0;
+		dev_replace->cursor_left_last_write_of_item = 0;
+		dev_replace->cursor_right = 0;
+		dev_replace->srcdev = NULL;
+		dev_replace->tgtdev = NULL;
+		dev_replace->is_valid = 0;
+		dev_replace->item_needs_writeback = 0;
+		goto out;
+	}
+	slot = path->slots[0];
+	eb = path->nodes[0];
+	item_size = btrfs_item_size_nr(eb, slot);
+	ptr = btrfs_item_ptr(eb, slot, struct btrfs_dev_replace_item);
+
+	if (item_size != sizeof(struct btrfs_dev_replace_item)) {
+		pr_warn("btrfs: dev_replace entry found has unexpected size, ignore
entry\n");
+		goto no_valid_dev_replace_entry_found;
+	}
+
+	src_devid = btrfs_dev_replace_src_devid(eb, ptr);
+	dev_replace->cont_reading_from_srcdev_mode +	
btrfs_dev_replace_cont_reading_from_srcdev_mode(eb, ptr);
+	dev_replace->replace_state = btrfs_dev_replace_replace_state(eb, ptr);
+	dev_replace->time_started = btrfs_dev_replace_time_started(eb, ptr);
+	dev_replace->time_stopped +		btrfs_dev_replace_time_stopped(eb, ptr);
+	atomic64_set(&dev_replace->num_write_errors,
+		     btrfs_dev_replace_num_write_errors(eb, ptr));
+	atomic64_set(&dev_replace->num_uncorrectable_read_errors,
+		     btrfs_dev_replace_num_uncorrectable_read_errors(eb, ptr));
+	dev_replace->cursor_left = btrfs_dev_replace_cursor_left(eb, ptr);
+	dev_replace->committed_cursor_left = dev_replace->cursor_left;
+	dev_replace->cursor_left_last_write_of_item = dev_replace->cursor_left;
+	dev_replace->cursor_right = btrfs_dev_replace_cursor_right(eb, ptr);
+	dev_replace->is_valid = 1;
+
+	dev_replace->item_needs_writeback = 0;
+	switch (dev_replace->replace_state) {
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED:
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED:
+		dev_replace->srcdev = NULL;
+		dev_replace->tgtdev = NULL;
+		break;
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED:
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:
+		dev_replace->srcdev = btrfs_find_device(fs_info, src_devid,
+							NULL, NULL);
+		dev_replace->tgtdev = btrfs_find_device(fs_info,
+							BTRFS_DEV_REPLACE_DEVID,
+							NULL, NULL);
+		/*
+		 * allow ''btrfs dev replace_cancel'' if src/tgt device is
+		 * missing
+		 */
+		if (!dev_replace->srcdev &&
+		    !btrfs_test_opt(dev_root, DEGRADED)) {
+			ret = -EIO;
+			pr_warn("btrfs: cannot mount because device replace operation is
ongoing and\n" "srcdev (devid %llu) is missing, need to run
''btrfs dev scan''?\n",
+				(unsigned long long)src_devid);
+		}
+		if (!dev_replace->tgtdev &&
+		    !btrfs_test_opt(dev_root, DEGRADED)) {
+			ret = -EIO;
+			pr_warn("btrfs: cannot mount because device replace operation is
ongoing and\n" "tgtdev (devid %llu) is missing, need to run btrfs dev
scan?\n",
+				(unsigned long long)BTRFS_DEV_REPLACE_DEVID);
+		}
+		if (dev_replace->tgtdev) {
+			if (dev_replace->srcdev) {
+				dev_replace->tgtdev->total_bytes +				
dev_replace->srcdev->total_bytes;
+				dev_replace->tgtdev->disk_total_bytes +				
dev_replace->srcdev->disk_total_bytes;
+				dev_replace->tgtdev->bytes_used +				
dev_replace->srcdev->bytes_used;
+			}
+			dev_replace->tgtdev->is_tgtdev_for_dev_replace = 1;
+			btrfs_init_dev_replace_tgtdev_for_resume(fs_info,
+				dev_replace->tgtdev);
+		}
+		break;
+	}
+
+out:
+	if (path) {
+		btrfs_release_path(path);
+		btrfs_free_path(path);
+	}
+	return ret;
+}
+
+/*
+ * called from commit_transaction. Writes changed device replace state to
+ * disk.
+ */
+int btrfs_run_dev_replace(struct btrfs_trans_handle *trans,
+			  struct btrfs_fs_info *fs_info)
+{
+	int ret;
+	struct btrfs_root *dev_root = fs_info->dev_root;
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	struct extent_buffer *eb;
+	struct btrfs_dev_replace_item *ptr;
+	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+
+	btrfs_dev_replace_lock(dev_replace);
+	if (!dev_replace->is_valid ||
+	    !dev_replace->item_needs_writeback) {
+		btrfs_dev_replace_unlock(dev_replace);
+		return 0;
+	}
+	btrfs_dev_replace_unlock(dev_replace);
+
+	key.objectid = 0;
+	key.type = BTRFS_DEV_REPLACE_KEY;
+	key.offset = 0;
+
+	path = btrfs_alloc_path();
+	if (!path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = btrfs_search_slot(trans, dev_root, &key, path, -1, 1);
+	if (ret < 0) {
+		pr_warn("btrfs: error %d while searching for dev_replace item!\n",
+			ret);
+		goto out;
+	}
+
+	if (ret == 0 &&
+	    btrfs_item_size_nr(path->nodes[0], path->slots[0]) <
sizeof(*ptr)) {
+		/*
+		 * need to delete old one and insert a new one.
+		 * Since no attempt is made to recover any old state, if the
+		 * dev_replace state is ''running'', the data on the target
+		 * drive is lost.
+		 * It would be possible to recover the state: just make sure
+		 * that the beginning of the item is never changed and always
+		 * contains all the essential information. Then read this
+		 * minimal set of information and use it as a base for the
+		 * new state.
+		 */
+		ret = btrfs_del_item(trans, dev_root, path);
+		if (ret != 0) {
+			pr_warn("btrfs: delete too small dev_replace item failed %d!\n",
+				ret);
+			goto out;
+		}
+		ret = 1;
+	}
+
+	if (ret == 1) {
+		/* need to insert a new item */
+		btrfs_release_path(path);
+		ret = btrfs_insert_empty_item(trans, dev_root, path,
+					      &key, sizeof(*ptr));
+		if (ret < 0) {
+			pr_warn("btrfs: insert dev_replace item failed %d!\n",
+				ret);
+			goto out;
+		}
+	}
+
+	eb = path->nodes[0];
+	ptr = btrfs_item_ptr(eb, path->slots[0],
+			     struct btrfs_dev_replace_item);
+
+	btrfs_dev_replace_lock(dev_replace);
+	if (dev_replace->srcdev)
+		btrfs_set_dev_replace_src_devid(eb, ptr,
+			dev_replace->srcdev->devid);
+	else
+		btrfs_set_dev_replace_src_devid(eb, ptr, (u64)-1);
+	btrfs_set_dev_replace_cont_reading_from_srcdev_mode(eb, ptr,
+		dev_replace->cont_reading_from_srcdev_mode);
+	btrfs_set_dev_replace_replace_state(eb, ptr,
+		dev_replace->replace_state);
+	btrfs_set_dev_replace_time_started(eb, ptr, dev_replace->time_started);
+	btrfs_set_dev_replace_time_stopped(eb, ptr, dev_replace->time_stopped);
+	btrfs_set_dev_replace_num_write_errors(eb, ptr,
+		atomic64_read(&dev_replace->num_write_errors));
+	btrfs_set_dev_replace_num_uncorrectable_read_errors(eb, ptr,
+		atomic64_read(&dev_replace->num_uncorrectable_read_errors));
+	dev_replace->cursor_left_last_write_of_item +		dev_replace->cursor_left;
+	btrfs_set_dev_replace_cursor_left(eb, ptr,
+		dev_replace->cursor_left_last_write_of_item);
+	btrfs_set_dev_replace_cursor_right(eb, ptr,
+		dev_replace->cursor_right);
+	dev_replace->item_needs_writeback = 0;
+	btrfs_dev_replace_unlock(dev_replace);
+
+	btrfs_mark_buffer_dirty(eb);
+
+out:
+	btrfs_free_path(path);
+
+	return ret;
+}
+
+void btrfs_after_dev_replace_commit(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+
+	dev_replace->committed_cursor_left +	
dev_replace->cursor_left_last_write_of_item;
+}
+
+static u64 btrfs_get_seconds_since_1970(void)
+{
+	struct timespec t = CURRENT_TIME_SEC;
+
+	return t.tv_sec;
+}
+
+int btrfs_dev_replace_start(struct btrfs_root *root,
+			    struct btrfs_ioctl_dev_replace_args *args)
+{
+	struct btrfs_trans_handle *trans;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+	int ret;
+	struct btrfs_device *tgt_device = NULL;
+	struct btrfs_device *src_device = NULL;
+
+	switch (args->start.cont_reading_from_srcdev_mode) {
+	case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS:
+	case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID:
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	if ((args->start.srcdevid == 0 && args->start.srcdev_name[0] ==
''\0'') ||
+	    args->start.tgtdev_name[0] == ''\0'')
+		return -EINVAL;
+
+	mutex_lock(&fs_info->volume_mutex);
+	ret = btrfs_init_dev_replace_tgtdev(root, args->start.tgtdev_name,
+					    &tgt_device);
+	if (ret) {
+		pr_err("btrfs: target device %s is invalid!\n",
+		       args->start.tgtdev_name);
+		mutex_unlock(&fs_info->volume_mutex);
+		return -EINVAL;
+	}
+
+	ret = btrfs_dev_replace_find_srcdev(root, args->start.srcdevid,
+					    args->start.srcdev_name,
+					    &src_device);
+	mutex_unlock(&fs_info->volume_mutex);
+	if (ret) {
+		ret = -EINVAL;
+		goto leave_no_lock;
+	}
+
+	if (tgt_device->total_bytes < src_device->total_bytes) {
+		pr_err("btrfs: target device is smaller than source device!\n");
+		ret = -EINVAL;
+		goto leave_no_lock;
+	}
+
+	btrfs_dev_replace_lock(dev_replace);
+	switch (dev_replace->replace_state) {
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED:
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED:
+		break;
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED:
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:
+		args->result = BTRFS_IOCTL_DEV_REPLACE_RESULT_ALREADY_STARTED;
+		goto leave;
+	}
+
+	dev_replace->cont_reading_from_srcdev_mode +	
args->start.cont_reading_from_srcdev_mode;
+	WARN_ON(!src_device);
+	dev_replace->srcdev = src_device;
+	WARN_ON(!tgt_device);
+	dev_replace->tgtdev = tgt_device;
+
+	tgt_device->total_bytes = src_device->total_bytes;
+	tgt_device->disk_total_bytes = src_device->disk_total_bytes;
+	tgt_device->bytes_used = src_device->bytes_used;
+
+	/*
+	 * from now on, the writes to the srcdev are all duplicated to
+	 * go to the tgtdev as well (refer to btrfs_map_block()).
+	 */
+	dev_replace->replace_state = BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED;
+	dev_replace->time_started = btrfs_get_seconds_since_1970();
+	dev_replace->cursor_left = 0;
+	dev_replace->committed_cursor_left = 0;
+	dev_replace->cursor_left_last_write_of_item = 0;
+	dev_replace->cursor_right = 0;
+	dev_replace->is_valid = 1;
+	dev_replace->item_needs_writeback = 1;
+	args->result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR;
+	btrfs_dev_replace_unlock(dev_replace);
+
+	btrfs_wait_ordered_extents(root, 0);
+
+	/* force writing the updated state information to disk */
+	trans = btrfs_start_transaction(root, 0);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		btrfs_dev_replace_lock(dev_replace);
+		goto leave;
+	}
+
+	ret = btrfs_commit_transaction(trans, root);
+	WARN_ON(ret);
+
+	/* the disk copy procedure reuses the scrub code */
+	ret = btrfs_scrub_dev(fs_info, src_device->devid, 0,
+			      src_device->total_bytes,
+			      &dev_replace->scrub_progress, 0, 1);
+
+	ret = btrfs_dev_replace_finishing(root->fs_info, ret);
+	WARN_ON(ret);
+
+	return 0;
+
+leave:
+	dev_replace->srcdev = NULL;
+	dev_replace->tgtdev = NULL;
+	btrfs_dev_replace_unlock(dev_replace);
+leave_no_lock:
+	if (tgt_device)
+		btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
+	return ret;
+}
+
+static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
+				       int scrub_ret)
+{
+	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+	struct btrfs_device *tgt_device;
+	struct btrfs_device *src_device;
+	struct btrfs_root *root = fs_info->tree_root;
+	u8 uuid_tmp[BTRFS_UUID_SIZE];
+	struct btrfs_trans_handle *trans;
+	int ret = 0;
+
+	/* don''t allow cancel or unmount to disturb the finishing procedure
*/
+	mutex_lock(&dev_replace->lock_finishing_cancel_unmount);
+
+	btrfs_dev_replace_lock(dev_replace);
+	/* was the operation canceled, or is it finished? */
+	if (dev_replace->replace_state !+	   
BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED) {
+		btrfs_dev_replace_unlock(dev_replace);
+		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
+		return 0;
+	}
+
+	tgt_device = dev_replace->tgtdev;
+	src_device = dev_replace->srcdev;
+	btrfs_dev_replace_unlock(dev_replace);
+
+	/* replace old device with new one in mapping tree */
+	if (!scrub_ret)
+		btrfs_dev_replace_update_device_in_mapping_tree(fs_info,
+								src_device,
+								tgt_device);
+
+	/*
+	 * flush all outstanding I/O and inode extent mappings before the
+	 * copy operation is declared as being finished
+	 */
+	btrfs_start_delalloc_inodes(root, 0);
+	btrfs_wait_ordered_extents(root, 0);
+
+	trans = btrfs_start_transaction(root, 0);
+	if (IS_ERR(trans)) {
+		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
+		return PTR_ERR(trans);
+	}
+	ret = btrfs_commit_transaction(trans, root);
+	WARN_ON(ret);
+
+	/* keep away write_all_supers() during the finishing procedure */
+	mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
+	btrfs_dev_replace_lock(dev_replace);
+	dev_replace->replace_state +		scrub_ret ?
BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED
+			  : BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED;
+	dev_replace->tgtdev = NULL;
+	dev_replace->srcdev = NULL;
+	dev_replace->time_stopped = btrfs_get_seconds_since_1970();
+	dev_replace->item_needs_writeback = 1;
+
+	if (scrub_ret) {
+		printk_in_rcu(KERN_ERR
+			      "btrfs: btrfs_scrub_dev(%s, %llu, %s) failed %d\n",
+			      rcu_str_deref(src_device->name),
+			      src_device->devid,
+			      rcu_str_deref(tgt_device->name), scrub_ret);
+		btrfs_dev_replace_unlock(dev_replace);
+		mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
+		if (tgt_device)
+			btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
+		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
+
+		return 0;
+	}
+
+	tgt_device->is_tgtdev_for_dev_replace = 0;
+	tgt_device->devid = src_device->devid;
+	src_device->devid = BTRFS_DEV_REPLACE_DEVID;
+	tgt_device->bytes_used = src_device->bytes_used;
+	memcpy(uuid_tmp, tgt_device->uuid, sizeof(uuid_tmp));
+	memcpy(tgt_device->uuid, src_device->uuid, sizeof(tgt_device->uuid));
+	memcpy(src_device->uuid, uuid_tmp, sizeof(src_device->uuid));
+	tgt_device->total_bytes = src_device->total_bytes;
+	tgt_device->disk_total_bytes = src_device->disk_total_bytes;
+	tgt_device->bytes_used = src_device->bytes_used;
+	if (fs_info->sb->s_bdev == src_device->bdev)
+		fs_info->sb->s_bdev = tgt_device->bdev;
+	if (fs_info->fs_devices->latest_bdev == src_device->bdev)
+		fs_info->fs_devices->latest_bdev = tgt_device->bdev;
+	list_add(&tgt_device->dev_alloc_list,
&fs_info->fs_devices->alloc_list);
+
+	btrfs_rm_dev_replace_srcdev(fs_info, src_device);
+	if (src_device->bdev) {
+		/* zero out the old super */
+		btrfs_scratch_superblock(src_device);
+	}
+	/*
+	 * this is again a consistent state where no dev_replace procedure
+	 * is running, the target device is part of the filesystem, the
+	 * source device is not part of the filesystem anymore and its 1st
+	 * superblock is scratched out so that it is no longer marked to
+	 * belong to this filesystem.
+	 */
+	btrfs_dev_replace_unlock(dev_replace);
+	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
+
+	/* write back the superblocks */
+	trans = btrfs_start_transaction(root, 0);
+	if (!IS_ERR(trans))
+		btrfs_commit_transaction(trans, root);
+
+	mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
+
+	return 0;
+}
+
+static void btrfs_dev_replace_update_device_in_mapping_tree(
+						struct btrfs_fs_info *fs_info,
+						struct btrfs_device *srcdev,
+						struct btrfs_device *tgtdev)
+{
+	struct extent_map_tree *em_tree = &fs_info->mapping_tree.map_tree;
+	struct extent_map *em;
+	struct map_lookup *map;
+	u64 start = 0;
+	int i;
+
+	write_lock(&em_tree->lock);
+	do {
+		em = lookup_extent_mapping(em_tree, start, (u64)-1);
+		if (!em)
+			break;
+		map = (struct map_lookup *)em->bdev;
+		for (i = 0; i < map->num_stripes; i++)
+			if (srcdev == map->stripes[i].dev)
+				map->stripes[i].dev = tgtdev;
+		start = em->start + em->len;
+		free_extent_map(em);
+	} while (start);
+	write_unlock(&em_tree->lock);
+}
+
+static int btrfs_dev_replace_find_srcdev(struct btrfs_root *root, u64 srcdevid,
+					 char *srcdev_name,
+					 struct btrfs_device **device)
+{
+	int ret;
+
+	if (srcdevid) {
+		*device = btrfs_find_device(root->fs_info, srcdevid, NULL,
+					    NULL);
+		if (!*device)
+			ret = -ENOENT;
+	} else {
+		ret = btrfs_find_device_missing_or_by_path(root, srcdev_name,
+							   device);
+	}
+	return ret;
+}
+
+void btrfs_dev_replace_status(struct btrfs_fs_info *fs_info,
+			      struct btrfs_ioctl_dev_replace_args *args)
+{
+	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+
+	btrfs_dev_replace_lock(dev_replace);
+	/* even if !dev_replace_is_valid, the values are good enough for
+	 * the replace_status ioctl */
+	args->result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR;
+	args->status.replace_state = dev_replace->replace_state;
+	args->status.time_started = dev_replace->time_started;
+	args->status.time_stopped = dev_replace->time_stopped;
+	args->status.num_write_errors +	
atomic64_read(&dev_replace->num_write_errors);
+	args->status.num_uncorrectable_read_errors +	
atomic64_read(&dev_replace->num_uncorrectable_read_errors);
+	switch (dev_replace->replace_state) {
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED:
+		args->status.progress_1000 = 0;
+		break;
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED:
+		args->status.progress_1000 = 1000;
+		break;
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED:
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:
+		args->status.progress_1000 = div64_u64(dev_replace->cursor_left,
+			div64_u64(dev_replace->srcdev->total_bytes, 1000));
+		break;
+	}
+	btrfs_dev_replace_unlock(dev_replace);
+}
+
+int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info,
+			     struct btrfs_ioctl_dev_replace_args *args)
+{
+	args->result = __btrfs_dev_replace_cancel(fs_info);
+	return 0;
+}
+
+static u64 __btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+	struct btrfs_device *tgt_device = NULL;
+	struct btrfs_trans_handle *trans;
+	struct btrfs_root *root = fs_info->tree_root;
+	u64 result;
+	int ret;
+
+	mutex_lock(&dev_replace->lock_finishing_cancel_unmount);
+	btrfs_dev_replace_lock(dev_replace);
+	switch (dev_replace->replace_state) {
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED:
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED:
+		result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NOT_STARTED;
+		btrfs_dev_replace_unlock(dev_replace);
+		goto leave;
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED:
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:
+		result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR;
+		tgt_device = dev_replace->tgtdev;
+		dev_replace->tgtdev = NULL;
+		dev_replace->srcdev = NULL;
+		break;
+	}
+	dev_replace->replace_state = BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED;
+	dev_replace->time_stopped = btrfs_get_seconds_since_1970();
+	dev_replace->item_needs_writeback = 1;
+	btrfs_dev_replace_unlock(dev_replace);
+	btrfs_scrub_cancel(fs_info);
+
+	trans = btrfs_start_transaction(root, 0);
+	if (IS_ERR(trans)) {
+		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
+		return PTR_ERR(trans);
+	}
+	ret = btrfs_commit_transaction(trans, root);
+	WARN_ON(ret);
+	if (tgt_device)
+		btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
+
+leave:
+	mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
+	return result;
+}
+
+void btrfs_dev_replace_suspend_for_unmount(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+
+	mutex_lock(&dev_replace->lock_finishing_cancel_unmount);
+	btrfs_dev_replace_lock(dev_replace);
+	switch (dev_replace->replace_state) {
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED:
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED:
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:
+		break;
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED:
+		dev_replace->replace_state +			BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED;
+		dev_replace->time_stopped = btrfs_get_seconds_since_1970();
+		dev_replace->item_needs_writeback = 1;
+		pr_info("btrfs: suspending dev_replace for unmount\n");
+		break;
+	}
+
+	btrfs_dev_replace_unlock(dev_replace);
+	mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
+}
+
+/* resume dev_replace procedure that was interrupted by unmount */
+int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info)
+{
+	struct task_struct *task;
+	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+
+	btrfs_dev_replace_lock(dev_replace);
+	switch (dev_replace->replace_state) {
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED:
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED:
+		btrfs_dev_replace_unlock(dev_replace);
+		return 0;
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED:
+		break;
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:
+		dev_replace->replace_state +			BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED;
+		break;
+	}
+	if (!dev_replace->tgtdev || !dev_replace->tgtdev->bdev) {
+		pr_info("btrfs: cannot continue dev_replace, tgtdev is missing\n"
+			"btrfs: you may cancel the operation after ''mount -o
degraded''\n");
+		btrfs_dev_replace_unlock(dev_replace);
+		return 0;
+	}
+	btrfs_dev_replace_unlock(dev_replace);
+
+	WARN_ON(atomic_xchg(
+		&fs_info->mutually_exclusive_operation_running, 1));
+	task = kthread_run(btrfs_dev_replace_kthread, fs_info,
"btrfs-devrepl");
+	return PTR_RET(task);
+}
+
+static int btrfs_dev_replace_kthread(void *data)
+{
+	struct btrfs_fs_info *fs_info = data;
+	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+	struct btrfs_ioctl_dev_replace_args *status_args;
+	u64 progress;
+
+	status_args = kzalloc(sizeof(*status_args), GFP_NOFS);
+	if (status_args) {
+		btrfs_dev_replace_status(fs_info, status_args);
+		progress = status_args->status.progress_1000;
+		kfree(status_args);
+		do_div(progress, 10);
+		printk_in_rcu(KERN_INFO
+			      "btrfs: continuing dev_replace from %s (devid %llu) to %s
@%u%%\n",
+			      dev_replace->srcdev->missing ? "<missing disk>"
:
+				rcu_str_deref(dev_replace->srcdev->name),
+			      dev_replace->srcdev->devid,
+			      dev_replace->tgtdev ?
+				rcu_str_deref(dev_replace->tgtdev->name) :
+				"<missing target disk>",
+			      (unsigned int)progress);
+	}
+	btrfs_dev_replace_continue_on_mount(fs_info);
+	atomic_set(&fs_info->mutually_exclusive_operation_running, 0);
+
+	return 0;
+}
+
+static int btrfs_dev_replace_continue_on_mount(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+	int ret;
+
+	ret = btrfs_scrub_dev(fs_info, dev_replace->srcdev->devid,
+			      dev_replace->committed_cursor_left,
+			      dev_replace->srcdev->total_bytes,
+			      &dev_replace->scrub_progress, 0, 1);
+	ret = btrfs_dev_replace_finishing(fs_info, ret);
+	WARN_ON(ret);
+	return 0;
+}
+
+int btrfs_dev_replace_is_ongoing(struct btrfs_dev_replace *dev_replace)
+{
+	if (!dev_replace->is_valid)
+		return 0;
+
+	switch (dev_replace->replace_state) {
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED:
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED:
+		return 0;
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED:
+	case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:
+		/*
+		 * return true even if tgtdev is missing (this is
+		 * something that can happen if the dev_replace
+		 * procedure is suspended by an umount and then
+		 * the tgtdev is missing (or "btrfs dev scan") was
+		 * not called and the the filesystem is remounted
+		 * in degraded state. This does not stop the
+		 * dev_replace procedure. It needs to be canceled
+		 * manually if the cancelation is wanted.
+		 */
+		break;
+	}
+	return 1;
+}
+
+void btrfs_dev_replace_lock(struct btrfs_dev_replace *dev_replace)
+{
+	/* the beginning is just an optimization for the typical case */
+	if (atomic_read(&dev_replace->nesting_level) == 0) {
+acquire_lock:
+		/* this is not a nested case where the same thread
+		 * is trying to acqurire the same lock twice */
+		mutex_lock(&dev_replace->lock);
+		mutex_lock(&dev_replace->lock_management_lock);
+		dev_replace->lock_owner = current->pid;
+		atomic_inc(&dev_replace->nesting_level);
+		mutex_unlock(&dev_replace->lock_management_lock);
+		return;
+	}
+
+	mutex_lock(&dev_replace->lock_management_lock);
+	if (atomic_read(&dev_replace->nesting_level) > 0 &&
+	    dev_replace->lock_owner == current->pid) {
+		WARN_ON(!mutex_is_locked(&dev_replace->lock));
+		atomic_inc(&dev_replace->nesting_level);
+		mutex_unlock(&dev_replace->lock_management_lock);
+		return;
+	}
+
+	mutex_unlock(&dev_replace->lock_management_lock);
+	goto acquire_lock;
+}
+
+void btrfs_dev_replace_unlock(struct btrfs_dev_replace *dev_replace)
+{
+	WARN_ON(!mutex_is_locked(&dev_replace->lock));
+	mutex_lock(&dev_replace->lock_management_lock);
+	WARN_ON(atomic_read(&dev_replace->nesting_level) < 1);
+	WARN_ON(dev_replace->lock_owner != current->pid);
+	atomic_dec(&dev_replace->nesting_level);
+	if (atomic_read(&dev_replace->nesting_level) == 0) {
+		dev_replace->lock_owner = 0;
+		mutex_unlock(&dev_replace->lock_management_lock);
+		mutex_unlock(&dev_replace->lock);
+	} else {
+		mutex_unlock(&dev_replace->lock_management_lock);
+	}
+}
diff --git a/fs/btrfs/dev-replace.h b/fs/btrfs/dev-replace.h
new file mode 100644
index 0000000..20035cb
--- /dev/null
+++ b/fs/btrfs/dev-replace.h
@@ -0,0 +1,44 @@
+/*
+ * Copyright (C) STRATO AG 2012.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#if !defined(__BTRFS_DEV_REPLACE__)
+#define __BTRFS_DEV_REPLACE__
+
+struct btrfs_ioctl_dev_replace_args;
+
+int btrfs_init_dev_replace(struct btrfs_fs_info *fs_info);
+int btrfs_run_dev_replace(struct btrfs_trans_handle *trans,
+			  struct btrfs_fs_info *fs_info);
+void btrfs_after_dev_replace_commit(struct btrfs_fs_info *fs_info);
+int btrfs_dev_replace_start(struct btrfs_root *root,
+			    struct btrfs_ioctl_dev_replace_args *args);
+void btrfs_dev_replace_status(struct btrfs_fs_info *fs_info,
+			      struct btrfs_ioctl_dev_replace_args *args);
+int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info,
+			     struct btrfs_ioctl_dev_replace_args *args);
+void btrfs_dev_replace_suspend_for_unmount(struct btrfs_fs_info *fs_info);
+int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info);
+int btrfs_dev_replace_is_ongoing(struct btrfs_dev_replace *dev_replace);
+void btrfs_dev_replace_lock(struct btrfs_dev_replace *dev_replace);
+void btrfs_dev_replace_unlock(struct btrfs_dev_replace *dev_replace);
+
+static inline void btrfs_dev_replace_stats_inc(atomic64_t *stat_value)
+{
+	atomic64_inc(stat_value);
+}
+#endif
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 731e287..62006ba 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -123,6 +123,48 @@ struct btrfs_ioctl_scrub_args {
 	__u64 unused[(1024-32-sizeof(struct btrfs_scrub_progress))/8];
 };
 
+#define BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS	0
+#define BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID	1
+struct btrfs_ioctl_dev_replace_start_params {
+	__u64 srcdevid;	/* in, if 0, use srcdev_name instead */
+	__u8 srcdev_name[BTRFS_PATH_NAME_MAX + 1];	/* in */
+	__u8 tgtdev_name[BTRFS_PATH_NAME_MAX + 1];	/* in */
+	__u64 cont_reading_from_srcdev_mode;	/* in, see #define
+						 * above */
+};
+
+#define BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED	0
+#define BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED		1
+#define BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED		2
+#define BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED		3
+#define BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED		4
+struct btrfs_ioctl_dev_replace_status_params {
+	__u64 replace_state;	/* out, see #define above */
+	__u64 progress_1000;	/* out, 0 <= x <= 1000 */
+	__u64 time_started;	/* out, seconds since 1-Jan-1970 */
+	__u64 time_stopped;	/* out, seconds since 1-Jan-1970 */
+	__u64 num_write_errors;	/* out */
+	__u64 num_uncorrectable_read_errors;	/* out */
+};
+
+#define BTRFS_IOCTL_DEV_REPLACE_CMD_START			0
+#define BTRFS_IOCTL_DEV_REPLACE_CMD_STATUS			1
+#define BTRFS_IOCTL_DEV_REPLACE_CMD_CANCEL			2
+#define BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR			0
+#define BTRFS_IOCTL_DEV_REPLACE_RESULT_NOT_STARTED		1
+#define BTRFS_IOCTL_DEV_REPLACE_RESULT_ALREADY_STARTED		2
+struct btrfs_ioctl_dev_replace_args {
+	__u64 cmd;	/* in */
+	__u64 result;	/* out */
+
+	union {
+		struct btrfs_ioctl_dev_replace_start_params start;
+		struct btrfs_ioctl_dev_replace_status_params status;
+	};	/* in/out */
+
+	__u64 spare[64];
+};
+
 #define BTRFS_DEVICE_PATH_NAME_MAX 1024
 struct btrfs_ioctl_dev_info_args {
 	__u64 devid;				/* in/out */
@@ -453,4 +495,7 @@ struct btrfs_ioctl_send_args {
 			       struct btrfs_ioctl_qgroup_limit_args)
 #define BTRFS_IOC_GET_DEV_STATS _IOWR(BTRFS_IOCTL_MAGIC, 52, \
 				      struct btrfs_ioctl_get_dev_stats)
+#define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \
+				    struct btrfs_ioctl_dev_replace_args)
+
 #endif
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 16/26] Btrfs: disallow mutually exclusiv admin operations from user mode

Btrfs admin operations that are manually started from user mode
and that cannot be executed at the same time return -EINPROGRESS.
A common way to enter and leave this locked section is introduced
since it used to be specific to the balance operation.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/ctree.h   |  2 ++
 fs/btrfs/ioctl.c   | 53 ++++++++++++++++++++++++++++++++++++-----------------
 fs/btrfs/volumes.c |  2 ++
 3 files changed, 40 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4bffc63..f9ceea9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1527,6 +1527,8 @@ struct btrfs_fs_info {
 
 	/* device replace state */
 	struct btrfs_dev_replace dev_replace;
+
+	atomic_t mutually_exclusive_operation_running;
 };
 
 /*
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 7ece1fe..d912c64 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1314,13 +1314,13 @@ static noinline int btrfs_ioctl_resize(struct btrfs_root
*root,
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-	mutex_lock(&root->fs_info->volume_mutex);
-	if (root->fs_info->balance_ctl) {
-		printk(KERN_INFO "btrfs: balance in progress\n");
-		ret = -EINVAL;
-		goto out;
+	if
(atomic_xchg(&root->fs_info->mutually_exclusive_operation_running,
+			1)) {
+		pr_info("btrfs: dev add/delete/balance/replace/resize operation in
progress\n");
+		return -EINPROGRESS;
 	}
 
+	mutex_lock(&root->fs_info->volume_mutex);
 	vol_args = memdup_user(arg, sizeof(*vol_args));
 	if (IS_ERR(vol_args)) {
 		ret = PTR_ERR(vol_args);
@@ -1416,6 +1416,7 @@ out_free:
 	kfree(vol_args);
 out:
 	mutex_unlock(&root->fs_info->volume_mutex);
+	atomic_set(&root->fs_info->mutually_exclusive_operation_running, 0);
 	return ret;
 }
 
@@ -2157,9 +2158,17 @@ static int btrfs_ioctl_defrag(struct file *file, void
__user *argp)
 	if (btrfs_root_readonly(root))
 		return -EROFS;
 
+	if
(atomic_xchg(&root->fs_info->mutually_exclusive_operation_running,
+			1)) {
+		pr_info("btrfs: dev add/delete/balance/replace/resize operation in
progress\n");
+		return -EINPROGRESS;
+	}
 	ret = mnt_want_write_file(file);
-	if (ret)
+	if (ret) {
+		atomic_set(&root->fs_info->mutually_exclusive_operation_running,
+			   0);
 		return ret;
+	}
 
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFDIR:
@@ -2211,6 +2220,7 @@ static int btrfs_ioctl_defrag(struct file *file, void
__user *argp)
 	}
 out:
 	mnt_drop_write_file(file);
+	atomic_set(&root->fs_info->mutually_exclusive_operation_running, 0);
 	return ret;
 }
 
@@ -2222,13 +2232,13 @@ static long btrfs_ioctl_add_dev(struct btrfs_root *root,
void __user *arg)
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-	mutex_lock(&root->fs_info->volume_mutex);
-	if (root->fs_info->balance_ctl) {
-		printk(KERN_INFO "btrfs: balance in progress\n");
-		ret = -EINVAL;
-		goto out;
+	if
(atomic_xchg(&root->fs_info->mutually_exclusive_operation_running,
+			1)) {
+		pr_info("btrfs: dev add/delete/balance/replace/resize operation in
progress\n");
+		return -EINPROGRESS;
 	}
 
+	mutex_lock(&root->fs_info->volume_mutex);
 	vol_args = memdup_user(arg, sizeof(*vol_args));
 	if (IS_ERR(vol_args)) {
 		ret = PTR_ERR(vol_args);
@@ -2241,6 +2251,7 @@ static long btrfs_ioctl_add_dev(struct btrfs_root *root,
void __user *arg)
 	kfree(vol_args);
 out:
 	mutex_unlock(&root->fs_info->volume_mutex);
+	atomic_set(&root->fs_info->mutually_exclusive_operation_running, 0);
 	return ret;
 }
 
@@ -2255,13 +2266,13 @@ static long btrfs_ioctl_rm_dev(struct btrfs_root *root,
void __user *arg)
 	if (root->fs_info->sb->s_flags & MS_RDONLY)
 		return -EROFS;
 
-	mutex_lock(&root->fs_info->volume_mutex);
-	if (root->fs_info->balance_ctl) {
-		printk(KERN_INFO "btrfs: balance in progress\n");
-		ret = -EINVAL;
-		goto out;
+	if
(atomic_xchg(&root->fs_info->mutually_exclusive_operation_running,
+			1)) {
+		pr_info("btrfs: dev add/delete/balance/replace/resize operation in
progress\n");
+		return -EINPROGRESS;
 	}
 
+	mutex_lock(&root->fs_info->volume_mutex);
 	vol_args = memdup_user(arg, sizeof(*vol_args));
 	if (IS_ERR(vol_args)) {
 		ret = PTR_ERR(vol_args);
@@ -2274,6 +2285,7 @@ static long btrfs_ioctl_rm_dev(struct btrfs_root *root,
void __user *arg)
 	kfree(vol_args);
 out:
 	mutex_unlock(&root->fs_info->volume_mutex);
+	atomic_set(&root->fs_info->mutually_exclusive_operation_running, 0);
 	return ret;
 }
 
@@ -3316,6 +3328,7 @@ static long btrfs_ioctl_balance(struct file *file, void
__user *arg)
 	struct btrfs_ioctl_balance_args *bargs;
 	struct btrfs_balance_control *bctl;
 	int ret;
+	int need_to_clear_lock = 0;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
@@ -3351,10 +3364,13 @@ static long btrfs_ioctl_balance(struct file *file, void
__user *arg)
 		bargs = NULL;
 	}
 
-	if (fs_info->balance_ctl) {
+	if
(atomic_xchg(&root->fs_info->mutually_exclusive_operation_running,
+			1)) {
+		pr_info("btrfs: dev add/delete/balance/replace/resize operation in
progress\n");
 		ret = -EINPROGRESS;
 		goto out_bargs;
 	}
+	need_to_clear_lock = 1;
 
 	bctl = kzalloc(sizeof(*bctl), GFP_NOFS);
 	if (!bctl) {
@@ -3388,6 +3404,9 @@ do_balance:
 out_bargs:
 	kfree(bargs);
 out:
+	if (need_to_clear_lock)
+		atomic_set(&root->fs_info->mutually_exclusive_operation_running,
+			   0);
 	mutex_unlock(&fs_info->balance_mutex);
 	mutex_unlock(&fs_info->volume_mutex);
 	mnt_drop_write_file(file);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 473ae89..1827bcd 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2956,6 +2956,7 @@ static int balance_kthread(void *data)
 		ret = btrfs_balance(fs_info->balance_ctl, NULL);
 	}
 
+	atomic_set(&fs_info->mutually_exclusive_operation_running, 0);
 	mutex_unlock(&fs_info->balance_mutex);
 	mutex_unlock(&fs_info->volume_mutex);
 
@@ -2978,6 +2979,7 @@ int btrfs_resume_balance_async(struct btrfs_fs_info
*fs_info)
 		return 0;
 	}
 
+	WARN_ON(atomic_xchg(&fs_info->mutually_exclusive_operation_running,
1));
 	tsk = kthread_run(balance_kthread, fs_info, "btrfs-balance");
 	if (IS_ERR(tsk))
 		return PTR_ERR(tsk);
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 17/26] Btrfs: disallow some operations on the device replace target device

This patch adds some code to disallow operations on the device that
is used as the target for the device replace operation.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/ctree.h       |  2 +-
 fs/btrfs/extent-tree.c |  3 ++-
 fs/btrfs/ioctl.c       |  8 +++++++-
 fs/btrfs/scrub.c       | 14 +++++++++-----
 fs/btrfs/super.c       |  3 ++-
 fs/btrfs/volumes.c     | 41 ++++++++++++++++++++++++++++++++---------
 fs/btrfs/volumes.h     |  1 +
 7 files changed, 54 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f9ceea9..83904b5 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3649,7 +3649,7 @@ int btrfs_reloc_post_snapshot(struct btrfs_trans_handle
*trans,
 /* scrub.c */
 int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 		    u64 end, struct btrfs_scrub_progress *progress,
-		    int readonly);
+		    int readonly, int is_dev_replace);
 void btrfs_scrub_pause(struct btrfs_root *root);
 void btrfs_scrub_pause_super(struct btrfs_root *root);
 void btrfs_scrub_continue(struct btrfs_root *root);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4c94183..f37c0cc 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7462,7 +7462,8 @@ int btrfs_can_relocate(struct btrfs_root *root, u64
bytenr)
 		 * check to make sure we can actually find a chunk with enough
 		 * space to fit our block group in.
 		 */
-		if (device->total_bytes > device->bytes_used + min_free) {
+		if (device->total_bytes > device->bytes_used + min_free &&
+		    !device->is_tgtdev_for_dev_replace) {
 			ret = find_free_dev_extent(device, min_free,
 						   &dev_offset, NULL);
 			if (!ret)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d912c64..1a93c14 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1372,6 +1372,11 @@ static noinline int btrfs_ioctl_resize(struct btrfs_root
*root,
 		}
 	}
 
+	if (device->is_tgtdev_for_dev_replace) {
+		ret = -EINVAL;
+		goto out_free;
+	}
+
 	old_size = device->total_bytes;
 
 	if (mod < 0) {
@@ -3099,7 +3104,8 @@ static long btrfs_ioctl_scrub(struct btrfs_root *root,
void __user *arg)
 		return PTR_ERR(sa);
 
 	ret = btrfs_scrub_dev(root->fs_info, sa->devid, sa->start,
sa->end,
-			      &sa->progress, sa->flags & BTRFS_SCRUB_READONLY);
+			      &sa->progress, sa->flags & BTRFS_SCRUB_READONLY,
+			      0);
 
 	if (copy_to_user(arg, sa, sizeof(*sa)))
 		ret = -EFAULT;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 6cf23f4..460e30b 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -116,6 +116,9 @@ struct scrub_ctx {
 	u32			sectorsize;
 	u32			nodesize;
 	u32			leafsize;
+
+	int			is_dev_replace;
+
 	/*
 	 * statistics
 	 */
@@ -284,7 +287,7 @@ static noinline_for_stack void scrub_free_ctx(struct
scrub_ctx *sctx)
 }
 
 static noinline_for_stack
-struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev)
+struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, int is_dev_replace)
 {
 	struct scrub_ctx *sctx;
 	int		i;
@@ -296,6 +299,7 @@ struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev)
 	sctx = kzalloc(sizeof(*sctx), GFP_NOFS);
 	if (!sctx)
 		goto nomem;
+	sctx->is_dev_replace = is_dev_replace;
 	sctx->pages_per_bio = pages_per_bio;
 	sctx->curr = -1;
 	sctx->dev_root = dev->dev_root;
@@ -2293,7 +2297,7 @@ static noinline_for_stack void scrub_workers_put(struct
btrfs_fs_info *fs_info)
 
 int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 		    u64 end, struct btrfs_scrub_progress *progress,
-		    int readonly)
+		    int readonly, int is_dev_replace)
 {
 	struct scrub_ctx *sctx;
 	int ret;
@@ -2356,14 +2360,14 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64
devid, u64 start,
 
 	mutex_lock(&fs_info->fs_devices->device_list_mutex);
 	dev = btrfs_find_device(fs_info, devid, NULL, NULL);
-	if (!dev || dev->missing) {
+	if (!dev || (dev->missing && !is_dev_replace)) {
 		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
 		scrub_workers_put(fs_info);
 		return -ENODEV;
 	}
 	mutex_lock(&fs_info->scrub_lock);
 
-	if (!dev->in_fs_metadata) {
+	if (!dev->in_fs_metadata || dev->is_tgtdev_for_dev_replace) {
 		mutex_unlock(&fs_info->scrub_lock);
 		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
 		scrub_workers_put(fs_info);
@@ -2376,7 +2380,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64
devid, u64 start,
 		scrub_workers_put(fs_info);
 		return -EINPROGRESS;
 	}
-	sctx = scrub_setup_ctx(dev);
+	sctx = scrub_setup_ctx(dev, is_dev_replace);
 	if (IS_ERR(sctx)) {
 		mutex_unlock(&fs_info->scrub_lock);
 		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index bdf1f5e..789c9b2 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1362,7 +1362,8 @@ static int btrfs_calc_avail_data_space(struct btrfs_root
*root, u64 *free_bytes)
 		min_stripe_size = BTRFS_STRIPE_LEN;
 
 	list_for_each_entry(device, &fs_devices->devices, dev_list) {
-		if (!device->in_fs_metadata || !device->bdev)
+		if (!device->in_fs_metadata || !device->bdev ||
+		    device->is_tgtdev_for_dev_replace)
 			continue;
 
 		avail_space = device->total_bytes - device->bytes_used;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1827bcd..87051e8 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -480,8 +480,9 @@ again:
 	/* This is the initialized path, it is safe to release the devices. */
 	list_for_each_entry_safe(device, next, &fs_devices->devices, dev_list)
{
 		if (device->in_fs_metadata) {
-			if (!latest_transid ||
-			    device->generation > latest_transid) {
+			if (!device->is_tgtdev_for_dev_replace &&
+			    (!latest_transid ||
+			     device->generation > latest_transid)) {
 				latest_devid = device->devid;
 				latest_transid = device->generation;
 				latest_bdev = device->bdev;
@@ -796,7 +797,7 @@ int btrfs_account_dev_extents_size(struct btrfs_device
*device, u64 start,
 
 	*length = 0;
 
-	if (start >= device->total_bytes)
+	if (start >= device->total_bytes ||
device->is_tgtdev_for_dev_replace)
 		return 0;
 
 	path = btrfs_alloc_path();
@@ -913,7 +914,7 @@ int find_free_dev_extent(struct btrfs_device *device, u64
num_bytes,
 	max_hole_size = 0;
 	hole_size = 0;
 
-	if (search_start >= search_end) {
+	if (search_start >= search_end || device->is_tgtdev_for_dev_replace) {
 		ret = -ENOSPC;
 		goto error;
 	}
@@ -1096,6 +1097,7 @@ int btrfs_alloc_dev_extent(struct btrfs_trans_handle
*trans,
 	struct btrfs_key key;
 
 	WARN_ON(!device->in_fs_metadata);
+	WARN_ON(device->is_tgtdev_for_dev_replace);
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
@@ -1357,7 +1359,9 @@ int btrfs_rm_device(struct btrfs_root *root, char
*device_path)
 		 * is held.
 		 */
 		list_for_each_entry(tmp, devices, dev_list) {
-			if (tmp->in_fs_metadata && !tmp->bdev) {
+			if (tmp->in_fs_metadata &&
+			    !tmp->is_tgtdev_for_dev_replace &&
+			    !tmp->bdev) {
 				device = tmp;
 				break;
 			}
@@ -1396,6 +1400,12 @@ int btrfs_rm_device(struct btrfs_root *root, char
*device_path)
 		}
 	}
 
+	if (device->is_tgtdev_for_dev_replace) {
+		pr_err("btrfs: unable to remove the dev_replace target dev\n");
+		ret = -EINVAL;
+		goto error_brelse;
+	}
+
 	if (device->writeable &&
root->fs_info->fs_devices->rw_devices == 1) {
 		printk(KERN_ERR "btrfs: unable to remove the only writeable "
 		       "device\n");
@@ -1415,6 +1425,11 @@ int btrfs_rm_device(struct btrfs_root *root, char
*device_path)
 	if (ret)
 		goto error_undo;
 
+	/*
+	 * TODO: the superblock still includes this device in its num_devices
+	 * counter although write_all_supers() is not locked out. This
+	 * could give a filesystem state which requires a degraded mount.
+	 */
 	ret = btrfs_rm_dev_item(root->fs_info->chunk_root, device);
 	if (ret)
 		goto error_undo;
@@ -1812,6 +1827,7 @@ int btrfs_init_new_device(struct btrfs_root *root, char
*device_path)
 	device->dev_root = root->fs_info->dev_root;
 	device->bdev = bdev;
 	device->in_fs_metadata = 1;
+	device->is_tgtdev_for_dev_replace = 0;
 	device->mode = FMODE_EXCL;
 	set_blocksize(device->bdev, 4096);
 
@@ -1975,7 +1991,8 @@ static int __btrfs_grow_device(struct btrfs_trans_handle
*trans,
 
 	if (!device->writeable)
 		return -EACCES;
-	if (new_size <= device->total_bytes)
+	if (new_size <= device->total_bytes ||
+	    device->is_tgtdev_for_dev_replace)
 		return -EINVAL;
 
 	btrfs_set_super_total_bytes(super_copy, old_total + diff);
@@ -2604,7 +2621,8 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
 		size_to_free = div_factor(old_size, 1);
 		size_to_free = min(size_to_free, (u64)1 * 1024 * 1024);
 		if (!device->writeable ||
-		    device->total_bytes - device->bytes_used > size_to_free)
+		    device->total_bytes - device->bytes_used > size_to_free ||
+		    device->is_tgtdev_for_dev_replace)
 			continue;
 
 		ret = btrfs_shrink_device(device, old_size - size_to_free);
@@ -3136,6 +3154,9 @@ int btrfs_shrink_device(struct btrfs_device *device, u64
new_size)
 	u64 old_size = device->total_bytes;
 	u64 diff = device->total_bytes - new_size;
 
+	if (device->is_tgtdev_for_dev_replace)
+		return -EINVAL;
+
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
@@ -3406,7 +3427,8 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle
*trans,
 			continue;
 		}
 
-		if (!device->in_fs_metadata)
+		if (!device->in_fs_metadata ||
+		    device->is_tgtdev_for_dev_replace)
 			continue;
 
 		if (device->total_bytes > device->bytes_used)
@@ -4617,6 +4639,7 @@ static void fill_device_from_item(struct extent_buffer
*leaf,
 	device->io_align = btrfs_device_io_align(leaf, dev_item);
 	device->io_width = btrfs_device_io_width(leaf, dev_item);
 	device->sector_size = btrfs_device_sector_size(leaf, dev_item);
+	device->is_tgtdev_for_dev_replace = 0;
 
 	ptr = (unsigned long)btrfs_device_uuid(dev_item);
 	read_extent_buffer(leaf, device->uuid, ptr, BTRFS_UUID_SIZE);
@@ -4727,7 +4750,7 @@ static int read_one_dev(struct btrfs_root *root,
 	fill_device_from_item(leaf, dev_item, device);
 	device->dev_root = root->fs_info->dev_root;
 	device->in_fs_metadata = 1;
-	if (device->writeable) {
+	if (device->writeable && !device->is_tgtdev_for_dev_replace) {
 		device->fs_devices->total_rw_bytes += device->total_bytes;
 		spin_lock(&root->fs_info->free_chunk_lock);
 		root->fs_info->free_chunk_space += device->total_bytes -
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 802e2ba..8fd5a4d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -50,6 +50,7 @@ struct btrfs_device {
 	int in_fs_metadata;
 	int missing;
 	int can_discard;
+	int is_tgtdev_for_dev_replace;
 
 	spinlock_t io_lock;
 
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 18/26] Btrfs: handle errors from btrfs_map_bio() everywhere

With the addition of the device replace procedure, it is possible
for btrfs_map_bio(READ) to report an error. This happens when the
specific mirror is requested which is located on the target disk,
and the copy operation has not yet copied this block. Hence the
block cannot be read and this error state is indicated by
returning EIO.
Some background information follows now. A new mirror is added
while the device replace procedure is running.
btrfs_get_num_copies() returns one more, and
btrfs_map_bio(GET_READ_MIRROR) adds one more mirror if a disk
location is involved that was already handled by the device
replace copy operation. The assigned mirror num is the highest
mirror number, e.g. the value 3 in case of RAID1.
If btrfs_map_bio() is invoked with mirror_num == 0 (i.e., select
any mirror), the copy on the target drive is never selected
because that disk shall be able to perform the write requests as
quickly as possible. The parallel execution of read requests would
only slow down the disk copy procedure. Second case is that
btrfs_map_bio() is called with mirror_num > 0. This is done from
the repair code only. In this case, the highest mirror num is
assigned to the target disk, since it is used last. And when this
mirror is not available because the copy procedure has not yet
handled this area, an error is returned. Everywhere in the code
the handling of such errors is added now.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/check-integrity.c | 15 +++++++++++++--
 fs/btrfs/compression.c     |  6 ++++--
 fs/btrfs/disk-io.c         | 44 +++++++++++++++++++++++++++-----------------
 fs/btrfs/extent_io.c       |  4 ----
 fs/btrfs/inode.c           | 27 ++++++++++++++++++++-------
 fs/btrfs/volumes.c         |  2 +-
 6 files changed, 65 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 8f9abed..badc6f1 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -1585,6 +1585,18 @@ static int btrfsic_map_block(struct btrfsic_state *state,
u64 bytenr, u32 len,
 	ret = btrfs_map_block(state->root->fs_info, READ,
 			      bytenr, &length, &multi, mirror_num);
 
+	if (ret) {
+		block_ctx_out->start = 0;
+		block_ctx_out->dev_bytenr = 0;
+		block_ctx_out->len = 0;
+		block_ctx_out->dev = NULL;
+		block_ctx_out->datav = NULL;
+		block_ctx_out->pagev = NULL;
+		block_ctx_out->mem_to_free = NULL;
+
+		return ret;
+	}
+
 	device = multi->stripes[0].dev;
 	block_ctx_out->dev = btrfsic_dev_state_lookup(device->bdev);
 	block_ctx_out->dev_bytenr = multi->stripes[0].physical;
@@ -1594,8 +1606,7 @@ static int btrfsic_map_block(struct btrfsic_state *state,
u64 bytenr, u32 len,
 	block_ctx_out->pagev = NULL;
 	block_ctx_out->mem_to_free = NULL;
 
-	if (0 == ret)
-		kfree(multi);
+	kfree(multi);
 	if (NULL == block_ctx_out->dev) {
 		ret = -ENXIO;
 		printk(KERN_INFO "btrfsic: error, cannot lookup dev (#1)!\n");
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index c6467aa..94ab2f8 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -687,7 +687,8 @@ int btrfs_submit_compressed_read(struct inode *inode, struct
bio *bio,
 
 			ret = btrfs_map_bio(root, READ, comp_bio,
 					    mirror_num, 0);
-			BUG_ON(ret); /* -ENOMEM */
+			if (ret)
+				bio_endio(comp_bio, ret);
 
 			bio_put(comp_bio);
 
@@ -712,7 +713,8 @@ int btrfs_submit_compressed_read(struct inode *inode, struct
bio *bio,
 	}
 
 	ret = btrfs_map_bio(root, READ, comp_bio, mirror_num, 0);
-	BUG_ON(ret); /* -ENOMEM */
+	if (ret)
+		bio_endio(comp_bio, ret);
 
 	bio_put(comp_bio);
 	return 0;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index bdf6345..dbe0434 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -852,11 +852,16 @@ static int __btree_submit_bio_done(struct inode *inode,
int rw, struct bio *bio,
 				 int mirror_num, unsigned long bio_flags,
 				 u64 bio_offset)
 {
+	int ret;
+
 	/*
 	 * when we''re called for a write, we''re already in the async
 	 * submission context.  Just jump into btrfs_map_bio
 	 */
-	return btrfs_map_bio(BTRFS_I(inode)->root, rw, bio, mirror_num, 1);
+	ret = btrfs_map_bio(BTRFS_I(inode)->root, rw, bio, mirror_num, 1);
+	if (ret)
+		bio_endio(bio, ret);
+	return ret;
 }
 
 static int check_async_write(struct inode *inode, unsigned long bio_flags)
@@ -878,7 +883,6 @@ static int btree_submit_bio_hook(struct inode *inode, int
rw, struct bio *bio,
 	int ret;
 
 	if (!(rw & REQ_WRITE)) {
-
 		/*
 		 * called for a read, do the setup so that checksum validation
 		 * can happen in the async kernel threads
@@ -886,26 +890,32 @@ static int btree_submit_bio_hook(struct inode *inode, int
rw, struct bio *bio,
 		ret = btrfs_bio_wq_end_io(BTRFS_I(inode)->root->fs_info,
 					  bio, 1);
 		if (ret)
-			return ret;
-		return btrfs_map_bio(BTRFS_I(inode)->root, rw, bio,
-				     mirror_num, 0);
+			goto out_w_error;
+		ret = btrfs_map_bio(BTRFS_I(inode)->root, rw, bio,
+				    mirror_num, 0);
 	} else if (!async) {
 		ret = btree_csum_one_bio(bio);
 		if (ret)
-			return ret;
-		return btrfs_map_bio(BTRFS_I(inode)->root, rw, bio,
-				     mirror_num, 0);
+			goto out_w_error;
+		ret = btrfs_map_bio(BTRFS_I(inode)->root, rw, bio,
+				    mirror_num, 0);
+	} else {
+		/*
+		 * kthread helpers are used to submit writes so that
+		 * checksumming can happen in parallel across all CPUs
+		 */
+		ret = btrfs_wq_submit_bio(BTRFS_I(inode)->root->fs_info,
+					  inode, rw, bio, mirror_num, 0,
+					  bio_offset,
+					  __btree_submit_bio_start,
+					  __btree_submit_bio_done);
 	}
 
-	/*
-	 * kthread helpers are used to submit writes so that checksumming
-	 * can happen in parallel across all CPUs
-	 */
-	return btrfs_wq_submit_bio(BTRFS_I(inode)->root->fs_info,
-				   inode, rw, bio, mirror_num, 0,
-				   bio_offset,
-				   __btree_submit_bio_start,
-				   __btree_submit_bio_done);
+	if (ret) {
+out_w_error:
+		bio_endio(bio, ret);
+	}
+	return ret;
 }
 
 #ifdef CONFIG_MIGRATION
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 82fab08..f4e8d3a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2464,10 +2464,6 @@ btrfs_bio_alloc(struct block_device *bdev, u64
first_sector, int nr_vecs,
 	return bio;
 }
 
-/*
- * Since writes are async, they will only return -ENOMEM.
- * Reads can return the full range of I/O error conditions.
- */
 static int __must_check submit_one_bio(int rw, struct bio *bio,
 				       int mirror_num, unsigned long bio_flags)
 {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5172a62..90857c1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1586,7 +1586,12 @@ static int __btrfs_submit_bio_done(struct inode *inode,
int rw, struct bio *bio,
 			  u64 bio_offset)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	return btrfs_map_bio(root, rw, bio, mirror_num, 1);
+	int ret;
+
+	ret = btrfs_map_bio(root, rw, bio, mirror_num, 1);
+	if (ret)
+		bio_endio(bio, ret);
+	return ret;
 }
 
 /*
@@ -1610,15 +1615,17 @@ static int btrfs_submit_bio_hook(struct inode *inode,
int rw, struct bio *bio,
 	if (!(rw & REQ_WRITE)) {
 		ret = btrfs_bio_wq_end_io(root->fs_info, bio, metadata);
 		if (ret)
-			return ret;
+			goto out;
 
 		if (bio_flags & EXTENT_BIO_COMPRESSED) {
-			return btrfs_submit_compressed_read(inode, bio,
-						    mirror_num, bio_flags);
+			ret = btrfs_submit_compressed_read(inode, bio,
+							   mirror_num,
+							   bio_flags);
+			goto out;
 		} else if (!skip_sum) {
 			ret = btrfs_lookup_bio_sums(root, inode, bio, NULL);
 			if (ret)
-				return ret;
+				goto out;
 		}
 		goto mapit;
 	} else if (!skip_sum) {
@@ -1626,15 +1633,21 @@ static int btrfs_submit_bio_hook(struct inode *inode,
int rw, struct bio *bio,
 		if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID)
 			goto mapit;
 		/* we''re doing a write, do the async checksumming */
-		return btrfs_wq_submit_bio(BTRFS_I(inode)->root->fs_info,
+		ret = btrfs_wq_submit_bio(BTRFS_I(inode)->root->fs_info,
 				   inode, rw, bio, mirror_num,
 				   bio_flags, bio_offset,
 				   __btrfs_submit_bio_start,
 				   __btrfs_submit_bio_done);
+		goto out;
 	}
 
 mapit:
-	return btrfs_map_bio(root, rw, bio, mirror_num, 0);
+	ret = btrfs_map_bio(root, rw, bio, mirror_num, 0);
+
+out:
+	if (ret < 0)
+		bio_endio(bio, ret);
+	return ret;
 }
 
 /*
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 87051e8..9233219 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4440,7 +4440,7 @@ int btrfs_map_bio(struct btrfs_root *root, int rw, struct
bio *bio,
 
 	ret = btrfs_map_block(root->fs_info, rw, logical, &map_length,
&bbio,
 			      mirror_num);
-	if (ret) /* -ENOMEM */
+	if (ret)
 		return ret;
 
 	total_devs = bbio->num_stripes;
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 19/26] Btrfs: add code to scrub to copy read data to another disk

The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit adds code to scrub to allow the scrub code to copy read
data to another disk.
One goal is to be able to perform as fast as possible. Therefore the
write requests are collected until huge bios are build, and the
write process is decoupled from the read process with some kind of
flow control, of course, in order to limit the allocated memory.
The best performance on spinning disks could by reached when the
head movements are avoided as much as possible. Therefore a single
worker is used to interface the read process with the write process.
The regular scrub operation works as fast as before, it is not
negatively influenced and actually it is more or less unchanged.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/ctree.h |   2 +
 fs/btrfs/reada.c |  10 +-
 fs/btrfs/scrub.c | 881 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/super.c |   3 +-
 4 files changed, 823 insertions(+), 73 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 83904b5..e17f211 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1483,6 +1483,8 @@ struct btrfs_fs_info {
 	struct rw_semaphore scrub_super_lock;
 	int scrub_workers_refcnt;
 	struct btrfs_workers scrub_workers;
+	struct btrfs_workers scrub_wr_completion_workers;
+	struct btrfs_workers scrub_nocow_workers;
 
 #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
 	u32 check_integrity_print_mask;
diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index 0ddc565..9f363e1 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -418,12 +418,17 @@ static struct reada_extent *reada_find_extent(struct
btrfs_root *root,
 			 */
 			continue;
 		}
+		if (!dev->bdev) {
+			/* cannot read ahead on missing device */
+			continue;
+		}
 		prev_dev = dev;
 		ret = radix_tree_insert(&dev->reada_extents, index, re);
 		if (ret) {
 			while (--i >= 0) {
 				dev = bbio->stripes[i].dev;
 				BUG_ON(dev == NULL);
+				/* ignore whether the entry was inserted */
 				radix_tree_delete(&dev->reada_extents, index);
 			}
 			BUG_ON(fs_info == NULL);
@@ -914,7 +919,10 @@ struct reada_control *btrfs_reada_add(struct btrfs_root
*root,
 	generation = btrfs_header_generation(node);
 	free_extent_buffer(node);
 
-	reada_add_block(rc, start, &max_key, level, generation);
+	if (reada_add_block(rc, start, &max_key, level, generation)) {
+		kfree(rc);
+		return ERR_PTR(-ENOMEM);
+	}
 
 	reada_start_machine(root->fs_info);
 
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 460e30b..59c69e0 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -25,6 +25,7 @@
 #include "transaction.h"
 #include "backref.h"
 #include "extent_io.h"
+#include "dev-replace.h"
 #include "check-integrity.h"
 #include "rcu-string.h"
 
@@ -44,8 +45,15 @@
 struct scrub_block;
 struct scrub_ctx;
 
-#define SCRUB_PAGES_PER_BIO	16	/* 64k per bio */
-#define SCRUB_BIOS_PER_CTX	16	/* 1 MB per device in flight */
+/*
+ * the following three values only influence the performance.
+ * The last one configures the number of parallel and outstanding I/O
+ * operations. The first two values configure an upper limit for the number
+ * of (dynamically allocated) pages that are added to a bio.
+ */
+#define SCRUB_PAGES_PER_RD_BIO	32	/* 128k per bio */
+#define SCRUB_PAGES_PER_WR_BIO	32	/* 128k per bio */
+#define SCRUB_BIOS_PER_SCTX	64	/* 8MB per device in flight */
 
 /*
  * the following value times PAGE_SIZE needs to be large enough to match the
@@ -62,6 +70,7 @@ struct scrub_page {
 	u64			generation;
 	u64			logical;
 	u64			physical;
+	u64			physical_for_dev_replace;
 	atomic_t		ref_count;
 	struct {
 		unsigned int	mirror_num:8;
@@ -79,7 +88,11 @@ struct scrub_bio {
 	int			err;
 	u64			logical;
 	u64			physical;
-	struct scrub_page	*pagev[SCRUB_PAGES_PER_BIO];
+#if SCRUB_PAGES_PER_WR_BIO >= SCRUB_PAGES_PER_RD_BIO
+	struct scrub_page	*pagev[SCRUB_PAGES_PER_WR_BIO];
+#else
+	struct scrub_page	*pagev[SCRUB_PAGES_PER_RD_BIO];
+#endif
 	int			page_count;
 	int			next_free;
 	struct btrfs_work	work;
@@ -99,8 +112,16 @@ struct scrub_block {
 	};
 };
 
+struct scrub_wr_ctx {
+	struct scrub_bio *wr_curr_bio;
+	struct btrfs_device *tgtdev;
+	int pages_per_wr_bio; /* <= SCRUB_PAGES_PER_WR_BIO */
+	atomic_t flush_all_writes;
+	struct mutex wr_lock;
+};
+
 struct scrub_ctx {
-	struct scrub_bio	*bios[SCRUB_BIOS_PER_CTX];
+	struct scrub_bio	*bios[SCRUB_BIOS_PER_SCTX];
 	struct btrfs_root	*dev_root;
 	int			first_free;
 	int			curr;
@@ -112,12 +133,13 @@ struct scrub_ctx {
 	struct list_head	csum_list;
 	atomic_t		cancel_req;
 	int			readonly;
-	int			pages_per_bio; /* <= SCRUB_PAGES_PER_BIO */
+	int			pages_per_rd_bio;
 	u32			sectorsize;
 	u32			nodesize;
 	u32			leafsize;
 
 	int			is_dev_replace;
+	struct scrub_wr_ctx	wr_ctx;
 
 	/*
 	 * statistics
@@ -135,6 +157,15 @@ struct scrub_fixup_nodatasum {
 	int			mirror_num;
 };
 
+struct scrub_copy_nocow_ctx {
+	struct scrub_ctx	*sctx;
+	u64			logical;
+	u64			len;
+	int			mirror_num;
+	u64			physical_for_dev_replace;
+	struct btrfs_work	work;
+};
+
 struct scrub_warning {
 	struct btrfs_path	*path;
 	u64			extent_item_size;
@@ -156,8 +187,9 @@ static void scrub_pending_trans_workers_dec(struct scrub_ctx
*sctx);
 static int scrub_handle_errored_block(struct scrub_block *sblock_to_check);
 static int scrub_setup_recheck_block(struct scrub_ctx *sctx,
 				     struct btrfs_fs_info *fs_info,
+				     struct scrub_block *original_sblock,
 				     u64 length, u64 logical,
-				     struct scrub_block *sblock);
+				     struct scrub_block *sblocks_for_recheck);
 static void scrub_recheck_block(struct btrfs_fs_info *fs_info,
 				struct scrub_block *sblock, int is_metadata,
 				int have_csum, u8 *csum, u64 generation,
@@ -174,6 +206,9 @@ static int scrub_repair_block_from_good_copy(struct
scrub_block *sblock_bad,
 static int scrub_repair_page_from_good_copy(struct scrub_block *sblock_bad,
 					    struct scrub_block *sblock_good,
 					    int page_num, int force_write);
+static void scrub_write_block_to_dev_replace(struct scrub_block *sblock);
+static int scrub_write_page_to_dev_replace(struct scrub_block *sblock,
+					   int page_num);
 static int scrub_checksum_data(struct scrub_block *sblock);
 static int scrub_checksum_tree_block(struct scrub_block *sblock);
 static int scrub_checksum_super(struct scrub_block *sblock);
@@ -181,14 +216,38 @@ static void scrub_block_get(struct scrub_block *sblock);
 static void scrub_block_put(struct scrub_block *sblock);
 static void scrub_page_get(struct scrub_page *spage);
 static void scrub_page_put(struct scrub_page *spage);
-static int scrub_add_page_to_bio(struct scrub_ctx *sctx,
-				 struct scrub_page *spage);
+static int scrub_add_page_to_rd_bio(struct scrub_ctx *sctx,
+				    struct scrub_page *spage);
 static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
 		       u64 physical, struct btrfs_device *dev, u64 flags,
-		       u64 gen, int mirror_num, u8 *csum, int force);
+		       u64 gen, int mirror_num, u8 *csum, int force,
+		       u64 physical_for_dev_replace);
 static void scrub_bio_end_io(struct bio *bio, int err);
 static void scrub_bio_end_io_worker(struct btrfs_work *work);
 static void scrub_block_complete(struct scrub_block *sblock);
+static void scrub_remap_extent(struct btrfs_fs_info *fs_info,
+			       u64 extent_logical, u64 extent_len,
+			       u64 *extent_physical,
+			       struct btrfs_device **extent_dev,
+			       int *extent_mirror_num);
+static int scrub_setup_wr_ctx(struct scrub_ctx *sctx,
+			      struct scrub_wr_ctx *wr_ctx,
+			      struct btrfs_fs_info *fs_info,
+			      struct btrfs_device *dev,
+			      int is_dev_replace);
+static void scrub_free_wr_ctx(struct scrub_wr_ctx *wr_ctx);
+static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
+				    struct scrub_page *spage);
+static void scrub_wr_submit(struct scrub_ctx *sctx);
+static void scrub_wr_bio_end_io(struct bio *bio, int err);
+static void scrub_wr_bio_end_io_worker(struct btrfs_work *work);
+static int write_page_nocow(struct scrub_ctx *sctx,
+			    u64 physical_for_dev_replace, struct page *page);
+static int copy_nocow_pages_for_inode(u64 inum, u64 offset, u64 root,
+				      void *ctx);
+static int copy_nocow_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
+			    int mirror_num, u64 physical_for_dev_replace);
+static void copy_nocow_pages_worker(struct btrfs_work *work);
 
 
 static void scrub_pending_bio_inc(struct scrub_ctx *sctx)
@@ -262,19 +321,20 @@ static noinline_for_stack void scrub_free_ctx(struct
scrub_ctx *sctx)
 	if (!sctx)
 		return;
 
+	scrub_free_wr_ctx(&sctx->wr_ctx);
+
 	/* this can happen when scrub is cancelled */
 	if (sctx->curr != -1) {
 		struct scrub_bio *sbio = sctx->bios[sctx->curr];
 
 		for (i = 0; i < sbio->page_count; i++) {
-			BUG_ON(!sbio->pagev[i]);
-			BUG_ON(!sbio->pagev[i]->page);
+			WARN_ON(!sbio->pagev[i]->page);
 			scrub_block_put(sbio->pagev[i]->sblock);
 		}
 		bio_put(sbio->bio);
 	}
 
-	for (i = 0; i < SCRUB_BIOS_PER_CTX; ++i) {
+	for (i = 0; i < SCRUB_BIOS_PER_SCTX; ++i) {
 		struct scrub_bio *sbio = sctx->bios[i];
 
 		if (!sbio)
@@ -292,18 +352,29 @@ struct scrub_ctx *scrub_setup_ctx(struct btrfs_device
*dev, int is_dev_replace)
 	struct scrub_ctx *sctx;
 	int		i;
 	struct btrfs_fs_info *fs_info = dev->dev_root->fs_info;
-	int pages_per_bio;
+	int pages_per_rd_bio;
+	int ret;
 
-	pages_per_bio = min_t(int, SCRUB_PAGES_PER_BIO,
-			      bio_get_nr_vecs(dev->bdev));
+	/*
+	 * the setting of pages_per_rd_bio is correct for scrub but might
+	 * be wrong for the dev_replace code where we might read from
+	 * different devices in the initial huge bios. However, that
+	 * code is able to correctly handle the case when adding a page
+	 * to a bio fails.
+	 */
+	if (dev->bdev)
+		pages_per_rd_bio = min_t(int, SCRUB_PAGES_PER_RD_BIO,
+					 bio_get_nr_vecs(dev->bdev));
+	else
+		pages_per_rd_bio = SCRUB_PAGES_PER_RD_BIO;
 	sctx = kzalloc(sizeof(*sctx), GFP_NOFS);
 	if (!sctx)
 		goto nomem;
 	sctx->is_dev_replace = is_dev_replace;
-	sctx->pages_per_bio = pages_per_bio;
+	sctx->pages_per_rd_bio = pages_per_rd_bio;
 	sctx->curr = -1;
 	sctx->dev_root = dev->dev_root;
-	for (i = 0; i < SCRUB_BIOS_PER_CTX; ++i) {
+	for (i = 0; i < SCRUB_BIOS_PER_SCTX; ++i) {
 		struct scrub_bio *sbio;
 
 		sbio = kzalloc(sizeof(*sbio), GFP_NOFS);
@@ -316,7 +387,7 @@ struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev,
int is_dev_replace)
 		sbio->page_count = 0;
 		sbio->work.func = scrub_bio_end_io_worker;
 
-		if (i != SCRUB_BIOS_PER_CTX - 1)
+		if (i != SCRUB_BIOS_PER_SCTX - 1)
 			sctx->bios[i]->next_free = i + 1;
 		else
 			sctx->bios[i]->next_free = -1;
@@ -334,6 +405,13 @@ struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev,
int is_dev_replace)
 	spin_lock_init(&sctx->list_lock);
 	spin_lock_init(&sctx->stat_lock);
 	init_waitqueue_head(&sctx->list_wait);
+
+	ret = scrub_setup_wr_ctx(sctx, &sctx->wr_ctx, fs_info,
+				 fs_info->dev_replace.tgtdev, is_dev_replace);
+	if (ret) {
+		scrub_free_ctx(sctx);
+		return ERR_PTR(ret);
+	}
 	return sctx;
 
 nomem:
@@ -341,7 +419,8 @@ nomem:
 	return ERR_PTR(-ENOMEM);
 }
 
-static int scrub_print_warning_inode(u64 inum, u64 offset, u64 root, void *ctx)
+static int scrub_print_warning_inode(u64 inum, u64 offset, u64 root,
+				     void *warn_ctx)
 {
 	u64 isize;
 	u32 nlink;
@@ -349,7 +428,7 @@ static int scrub_print_warning_inode(u64 inum, u64 offset,
u64 root, void *ctx)
 	int i;
 	struct extent_buffer *eb;
 	struct btrfs_inode_item *inode_item;
-	struct scrub_warning *swarn = ctx;
+	struct scrub_warning *swarn = warn_ctx;
 	struct btrfs_fs_info *fs_info = swarn->dev->dev_root->fs_info;
 	struct inode_fs_paths *ipath = NULL;
 	struct btrfs_root *local_root;
@@ -492,11 +571,11 @@ out:
 	kfree(swarn.msg_buf);
 }
 
-static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void *ctx)
+static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void
*fixup_ctx)
 {
 	struct page *page = NULL;
 	unsigned long index;
-	struct scrub_fixup_nodatasum *fixup = ctx;
+	struct scrub_fixup_nodatasum *fixup = fixup_ctx;
 	int ret;
 	int corrected = 0;
 	struct btrfs_key key;
@@ -660,7 +739,9 @@ out:
 		spin_lock(&sctx->stat_lock);
 		++sctx->stat.uncorrectable_errors;
 		spin_unlock(&sctx->stat_lock);
-
+		btrfs_dev_replace_stats_inc(
+			&sctx->dev_root->fs_info->dev_replace.
+			num_uncorrectable_read_errors);
 		printk_ratelimited_in_rcu(KERN_ERR
 			"btrfs: unable to fixup (nodatasum) error at logical %llu on dev
%s\n",
 			(unsigned long long)fixup->logical,
@@ -715,6 +796,11 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 	csum = sblock_to_check->pagev[0]->csum;
 	dev = sblock_to_check->pagev[0]->dev;
 
+	if (sctx->is_dev_replace && !is_metadata && !have_csum) {
+		sblocks_for_recheck = NULL;
+		goto nodatasum_case;
+	}
+
 	/*
 	 * read all mirrors one after the other. This includes to
 	 * re-read the extent or metadata block that failed (that was
@@ -758,7 +844,7 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 	}
 
 	/* setup the context, map the logical blocks and alloc the pages */
-	ret = scrub_setup_recheck_block(sctx, fs_info, length,
+	ret = scrub_setup_recheck_block(sctx, fs_info, sblock_to_check, length,
 					logical, sblocks_for_recheck);
 	if (ret) {
 		spin_lock(&sctx->stat_lock);
@@ -789,6 +875,8 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 		sctx->stat.unverified_errors++;
 		spin_unlock(&sctx->stat_lock);
 
+		if (sctx->is_dev_replace)
+			scrub_write_block_to_dev_replace(sblock_bad);
 		goto out;
 	}
 
@@ -822,12 +910,15 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 				BTRFS_DEV_STAT_CORRUPTION_ERRS);
 	}
 
-	if (sctx->readonly)
+	if (sctx->readonly && !sctx->is_dev_replace)
 		goto did_not_correct_error;
 
 	if (!is_metadata && !have_csum) {
 		struct scrub_fixup_nodatasum *fixup_nodatasum;
 
+nodatasum_case:
+		WARN_ON(sctx->is_dev_replace);
+
 		/*
 		 * !is_metadata and !have_csum, this means that the data
 		 * might not be COW''ed, that it might be modified
@@ -883,18 +974,79 @@ static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check)
 		if (!sblock_other->header_error &&
 		    !sblock_other->checksum_error &&
 		    sblock_other->no_io_error_seen) {
-			int force_write = is_metadata || have_csum;
-
-			ret = scrub_repair_block_from_good_copy(sblock_bad,
-								sblock_other,
-								force_write);
+			if (sctx->is_dev_replace) {
+				scrub_write_block_to_dev_replace(sblock_other);
+			} else {
+				int force_write = is_metadata || have_csum;
+
+				ret = scrub_repair_block_from_good_copy(
+						sblock_bad, sblock_other,
+						force_write);
+			}
 			if (0 == ret)
 				goto corrected_error;
 		}
 	}
 
 	/*
-	 * in case of I/O errors in the area that is supposed to be
+	 * for dev_replace, pick good pages and write to the target device.
+	 */
+	if (sctx->is_dev_replace) {
+		success = 1;
+		for (page_num = 0; page_num < sblock_bad->page_count;
+		     page_num++) {
+			int sub_success;
+
+			sub_success = 0;
+			for (mirror_index = 0;
+			     mirror_index < BTRFS_MAX_MIRRORS &&
+			     sblocks_for_recheck[mirror_index].page_count > 0;
+			     mirror_index++) {
+				struct scrub_block *sblock_other +					sblocks_for_recheck + mirror_index;
+				struct scrub_page *page_other +					sblock_other->pagev[page_num];
+
+				if (!page_other->io_error) {
+					ret = scrub_write_page_to_dev_replace(
+							sblock_other, page_num);
+					if (ret == 0) {
+						/* succeeded for this page */
+						sub_success = 1;
+						break;
+					} else {
+						btrfs_dev_replace_stats_inc(
+							&sctx->dev_root->
+							fs_info->dev_replace.
+							num_write_errors);
+					}
+				}
+			}
+
+			if (!sub_success) {
+				/*
+				 * did not find a mirror to fetch the page
+				 * from. scrub_write_page_to_dev_replace()
+				 * handles this case (page->io_error), by
+				 * filling the block with zeros before
+				 * submitting the write request
+				 */
+				success = 0;
+				ret = scrub_write_page_to_dev_replace(
+						sblock_bad, page_num);
+				if (ret)
+					btrfs_dev_replace_stats_inc(
+						&sctx->dev_root->fs_info->
+						dev_replace.num_write_errors);
+			}
+		}
+
+		goto out;
+	}
+
+	/*
+	 * for regular scrub, repair those pages that are errored.
+	 * In case of I/O errors in the area that is supposed to be
 	 * repaired, continue by picking good copies of those pages.
 	 * Select the good pages from mirrors to rewrite bad pages from
 	 * the area to fix. Afterwards verify the checksum of the block
@@ -1017,6 +1169,7 @@ out:
 
 static int scrub_setup_recheck_block(struct scrub_ctx *sctx,
 				     struct btrfs_fs_info *fs_info,
+				     struct scrub_block *original_sblock,
 				     u64 length, u64 logical,
 				     struct scrub_block *sblocks_for_recheck)
 {
@@ -1047,7 +1200,7 @@ static int scrub_setup_recheck_block(struct scrub_ctx
*sctx,
 			return -EIO;
 		}
 
-		BUG_ON(page_index >= SCRUB_PAGES_PER_BIO);
+		BUG_ON(page_index >= SCRUB_PAGES_PER_RD_BIO);
 		for (mirror_index = 0; mirror_index < (int)bbio->num_stripes;
 		     mirror_index++) {
 			struct scrub_block *sblock;
@@ -1071,6 +1224,10 @@ leave_nomem:
 			sblock->pagev[page_index] = page;
 			page->logical = logical;
 			page->physical = bbio->stripes[mirror_index].physical;
+			BUG_ON(page_index >= original_sblock->page_count);
+			page->physical_for_dev_replace +			
original_sblock->pagev[page_index]->
+				physical_for_dev_replace;
 			/* for missing devices, dev->bdev is NULL */
 			page->dev = bbio->stripes[mirror_index].dev;
 			page->mirror_num = mirror_index + 1;
@@ -1249,6 +1406,12 @@ static int scrub_repair_page_from_good_copy(struct
scrub_block *sblock_bad,
 		int ret;
 		DECLARE_COMPLETION_ONSTACK(complete);
 
+		if (!page_bad->dev->bdev) {
+			printk_ratelimited(KERN_WARNING
+				"btrfs: scrub_repair_page_from_good_copy(bdev == NULL) is
unexpected!\n");
+			return -EIO;
+		}
+
 		bio = bio_alloc(GFP_NOFS, 1);
 		if (!bio)
 			return -EIO;
@@ -1269,6 +1432,9 @@ static int scrub_repair_page_from_good_copy(struct
scrub_block *sblock_bad,
 		if (!bio_flagged(bio, BIO_UPTODATE)) {
 			btrfs_dev_stat_inc_and_print(page_bad->dev,
 				BTRFS_DEV_STAT_WRITE_ERRS);
+			btrfs_dev_replace_stats_inc(
+				&sblock_bad->sctx->dev_root->fs_info->
+				dev_replace.num_write_errors);
 			bio_put(bio);
 			return -EIO;
 		}
@@ -1278,7 +1444,166 @@ static int scrub_repair_page_from_good_copy(struct
scrub_block *sblock_bad,
 	return 0;
 }
 
-static void scrub_checksum(struct scrub_block *sblock)
+static void scrub_write_block_to_dev_replace(struct scrub_block *sblock)
+{
+	int page_num;
+
+	for (page_num = 0; page_num < sblock->page_count; page_num++) {
+		int ret;
+
+		ret = scrub_write_page_to_dev_replace(sblock, page_num);
+		if (ret)
+			btrfs_dev_replace_stats_inc(
+				&sblock->sctx->dev_root->fs_info->dev_replace.
+				num_write_errors);
+	}
+}
+
+static int scrub_write_page_to_dev_replace(struct scrub_block *sblock,
+					   int page_num)
+{
+	struct scrub_page *spage = sblock->pagev[page_num];
+
+	BUG_ON(spage->page == NULL);
+	if (spage->io_error) {
+		void *mapped_buffer = kmap_atomic(spage->page);
+
+		memset(mapped_buffer, 0, PAGE_CACHE_SIZE);
+		flush_dcache_page(spage->page);
+		kunmap_atomic(mapped_buffer);
+	}
+	return scrub_add_page_to_wr_bio(sblock->sctx, spage);
+}
+
+static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
+				    struct scrub_page *spage)
+{
+	struct scrub_wr_ctx *wr_ctx = &sctx->wr_ctx;
+	struct scrub_bio *sbio;
+	int ret;
+
+	mutex_lock(&wr_ctx->wr_lock);
+again:
+	if (!wr_ctx->wr_curr_bio) {
+		wr_ctx->wr_curr_bio = kzalloc(sizeof(*wr_ctx->wr_curr_bio),
+					      GFP_NOFS);
+		if (!wr_ctx->wr_curr_bio)
+			return -ENOMEM;
+		wr_ctx->wr_curr_bio->sctx = sctx;
+		wr_ctx->wr_curr_bio->page_count = 0;
+	}
+	sbio = wr_ctx->wr_curr_bio;
+	if (sbio->page_count == 0) {
+		struct bio *bio;
+
+		sbio->physical = spage->physical_for_dev_replace;
+		sbio->logical = spage->logical;
+		sbio->dev = wr_ctx->tgtdev;
+		bio = sbio->bio;
+		if (!bio) {
+			bio = bio_alloc(GFP_NOFS, wr_ctx->pages_per_wr_bio);
+			if (!bio) {
+				mutex_unlock(&wr_ctx->wr_lock);
+				return -ENOMEM;
+			}
+			sbio->bio = bio;
+		}
+
+		bio->bi_private = sbio;
+		bio->bi_end_io = scrub_wr_bio_end_io;
+		bio->bi_bdev = sbio->dev->bdev;
+		bio->bi_sector = sbio->physical >> 9;
+		sbio->err = 0;
+	} else if (sbio->physical + sbio->page_count * PAGE_SIZE !+		  
spage->physical_for_dev_replace ||
+		   sbio->logical + sbio->page_count * PAGE_SIZE !+		  
spage->logical) {
+		scrub_wr_submit(sctx);
+		goto again;
+	}
+
+	ret = bio_add_page(sbio->bio, spage->page, PAGE_SIZE, 0);
+	if (ret != PAGE_SIZE) {
+		if (sbio->page_count < 1) {
+			bio_put(sbio->bio);
+			sbio->bio = NULL;
+			mutex_unlock(&wr_ctx->wr_lock);
+			return -EIO;
+		}
+		scrub_wr_submit(sctx);
+		goto again;
+	}
+
+	sbio->pagev[sbio->page_count] = spage;
+	scrub_page_get(spage);
+	sbio->page_count++;
+	if (sbio->page_count == wr_ctx->pages_per_wr_bio)
+		scrub_wr_submit(sctx);
+	mutex_unlock(&wr_ctx->wr_lock);
+
+	return 0;
+}
+
+static void scrub_wr_submit(struct scrub_ctx *sctx)
+{
+	struct scrub_wr_ctx *wr_ctx = &sctx->wr_ctx;
+	struct scrub_bio *sbio;
+
+	if (!wr_ctx->wr_curr_bio)
+		return;
+
+	sbio = wr_ctx->wr_curr_bio;
+	wr_ctx->wr_curr_bio = NULL;
+	WARN_ON(!sbio->bio->bi_bdev);
+	scrub_pending_bio_inc(sctx);
+	/* process all writes in a single worker thread. Then the block layer
+	 * orders the requests before sending them to the driver which
+	 * doubled the write performance on spinning disks when measured
+	 * with Linux 3.5 */
+	btrfsic_submit_bio(WRITE, sbio->bio);
+}
+
+static void scrub_wr_bio_end_io(struct bio *bio, int err)
+{
+	struct scrub_bio *sbio = bio->bi_private;
+	struct btrfs_fs_info *fs_info = sbio->dev->dev_root->fs_info;
+
+	sbio->err = err;
+	sbio->bio = bio;
+
+	sbio->work.func = scrub_wr_bio_end_io_worker;
+	btrfs_queue_worker(&fs_info->scrub_wr_completion_workers,
&sbio->work);
+}
+
+static void scrub_wr_bio_end_io_worker(struct btrfs_work *work)
+{
+	struct scrub_bio *sbio = container_of(work, struct scrub_bio, work);
+	struct scrub_ctx *sctx = sbio->sctx;
+	int i;
+
+	WARN_ON(sbio->page_count > SCRUB_PAGES_PER_WR_BIO);
+	if (sbio->err) {
+		struct btrfs_dev_replace *dev_replace +		
&sbio->sctx->dev_root->fs_info->dev_replace;
+
+		for (i = 0; i < sbio->page_count; i++) {
+			struct scrub_page *spage = sbio->pagev[i];
+
+			spage->io_error = 1;
+			btrfs_dev_replace_stats_inc(&dev_replace->
+						    num_write_errors);
+		}
+	}
+
+	for (i = 0; i < sbio->page_count; i++)
+		scrub_page_put(sbio->pagev[i]);
+
+	bio_put(sbio->bio);
+	kfree(sbio);
+	scrub_pending_bio_dec(sctx);
+}
+
+static int scrub_checksum(struct scrub_block *sblock)
 {
 	u64 flags;
 	int ret;
@@ -1296,6 +1621,8 @@ static void scrub_checksum(struct scrub_block *sblock)
 		WARN_ON(1);
 	if (ret)
 		scrub_handle_errored_block(sblock);
+
+	return ret;
 }
 
 static int scrub_checksum_data(struct scrub_block *sblock)
@@ -1386,7 +1713,7 @@ static int scrub_checksum_tree_block(struct scrub_block
*sblock)
 		   BTRFS_UUID_SIZE))
 		++fail;
 
-	BUG_ON(sctx->nodesize != sctx->leafsize);
+	WARN_ON(sctx->nodesize != sctx->leafsize);
 	len = sctx->nodesize - BTRFS_CSUM_SIZE;
 	mapped_size = PAGE_SIZE - BTRFS_CSUM_SIZE;
 	p = ((u8 *)mapped_buffer) + BTRFS_CSUM_SIZE;
@@ -1534,11 +1861,24 @@ static void scrub_submit(struct scrub_ctx *sctx)
 	sctx->curr = -1;
 	scrub_pending_bio_inc(sctx);
 
-	btrfsic_submit_bio(READ, sbio->bio);
+	if (!sbio->bio->bi_bdev) {
+		/*
+		 * this case should not happen. If btrfs_map_block() is
+		 * wrong, it could happen for dev-replace operations on
+		 * missing devices when no mirrors are available, but in
+		 * this case it should already fail the mount.
+		 * This case is handled correctly (but _very_ slowly).
+		 */
+		printk_ratelimited(KERN_WARNING
+			"btrfs: scrub_submit(bio bdev == NULL) is unexpected!\n");
+		bio_endio(sbio->bio, -EIO);
+	} else {
+		btrfsic_submit_bio(READ, sbio->bio);
+	}
 }
 
-static int scrub_add_page_to_bio(struct scrub_ctx *sctx,
-				 struct scrub_page *spage)
+static int scrub_add_page_to_rd_bio(struct scrub_ctx *sctx,
+				    struct scrub_page *spage)
 {
 	struct scrub_block *sblock = spage->sblock;
 	struct scrub_bio *sbio;
@@ -1570,7 +1910,7 @@ again:
 		sbio->dev = spage->dev;
 		bio = sbio->bio;
 		if (!bio) {
-			bio = bio_alloc(GFP_NOFS, sctx->pages_per_bio);
+			bio = bio_alloc(GFP_NOFS, sctx->pages_per_rd_bio);
 			if (!bio)
 				return -ENOMEM;
 			sbio->bio = bio;
@@ -1602,10 +1942,10 @@ again:
 		goto again;
 	}
 
-	scrub_block_get(sblock); /* one for the added page */
+	scrub_block_get(sblock); /* one for the page added to the bio */
 	atomic_inc(&sblock->outstanding_pages);
 	sbio->page_count++;
-	if (sbio->page_count == sctx->pages_per_bio)
+	if (sbio->page_count == sctx->pages_per_rd_bio)
 		scrub_submit(sctx);
 
 	return 0;
@@ -1613,7 +1953,8 @@ again:
 
 static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
 		       u64 physical, struct btrfs_device *dev, u64 flags,
-		       u64 gen, int mirror_num, u8 *csum, int force)
+		       u64 gen, int mirror_num, u8 *csum, int force,
+		       u64 physical_for_dev_replace)
 {
 	struct scrub_block *sblock;
 	int index;
@@ -1654,6 +1995,7 @@ leave_nomem:
 		spage->generation = gen;
 		spage->logical = logical;
 		spage->physical = physical;
+		spage->physical_for_dev_replace = physical_for_dev_replace;
 		spage->mirror_num = mirror_num;
 		if (csum) {
 			spage->have_csum = 1;
@@ -1668,6 +2010,7 @@ leave_nomem:
 		len -= l;
 		logical += l;
 		physical += l;
+		physical_for_dev_replace += l;
 	}
 
 	WARN_ON(sblock->page_count == 0);
@@ -1675,7 +2018,7 @@ leave_nomem:
 		struct scrub_page *spage = sblock->pagev[index];
 		int ret;
 
-		ret = scrub_add_page_to_bio(sctx, spage);
+		ret = scrub_add_page_to_rd_bio(sctx, spage);
 		if (ret) {
 			scrub_block_put(sblock);
 			return ret;
@@ -1707,7 +2050,7 @@ static void scrub_bio_end_io_worker(struct btrfs_work
*work)
 	struct scrub_ctx *sctx = sbio->sctx;
 	int i;
 
-	BUG_ON(sbio->page_count > SCRUB_PAGES_PER_BIO);
+	BUG_ON(sbio->page_count > SCRUB_PAGES_PER_RD_BIO);
 	if (sbio->err) {
 		for (i = 0; i < sbio->page_count; i++) {
 			struct scrub_page *spage = sbio->pagev[i];
@@ -1733,15 +2076,30 @@ static void scrub_bio_end_io_worker(struct btrfs_work
*work)
 	sbio->next_free = sctx->first_free;
 	sctx->first_free = sbio->index;
 	spin_unlock(&sctx->list_lock);
+
+	if (sctx->is_dev_replace &&
+	    atomic_read(&sctx->wr_ctx.flush_all_writes)) {
+		mutex_lock(&sctx->wr_ctx.wr_lock);
+		scrub_wr_submit(sctx);
+		mutex_unlock(&sctx->wr_ctx.wr_lock);
+	}
+
 	scrub_pending_bio_dec(sctx);
 }
 
 static void scrub_block_complete(struct scrub_block *sblock)
 {
-	if (!sblock->no_io_error_seen)
+	if (!sblock->no_io_error_seen) {
 		scrub_handle_errored_block(sblock);
-	else
-		scrub_checksum(sblock);
+	} else {
+		/*
+		 * if has checksum error, write via repair mechanism in
+		 * dev replace case, otherwise write here in dev replace
+		 * case.
+		 */
+		if (!scrub_checksum(sblock) && sblock->sctx->is_dev_replace)
+			scrub_write_block_to_dev_replace(sblock);
+	}
 }
 
 static int scrub_find_csum(struct scrub_ctx *sctx, u64 logical, u64 len,
@@ -1786,7 +2144,7 @@ static int scrub_find_csum(struct scrub_ctx *sctx, u64
logical, u64 len,
 /* scrub extent tries to collect up to 64 kB for each bio */
 static int scrub_extent(struct scrub_ctx *sctx, u64 logical, u64 len,
 			u64 physical, struct btrfs_device *dev, u64 flags,
-			u64 gen, int mirror_num)
+			u64 gen, int mirror_num, u64 physical_for_dev_replace)
 {
 	int ret;
 	u8 csum[BTRFS_CSUM_SIZE];
@@ -1799,7 +2157,7 @@ static int scrub_extent(struct scrub_ctx *sctx, u64
logical, u64 len,
 		sctx->stat.data_bytes_scrubbed += len;
 		spin_unlock(&sctx->stat_lock);
 	} else if (flags & BTRFS_EXTENT_FLAG_TREE_BLOCK) {
-		BUG_ON(sctx->nodesize != sctx->leafsize);
+		WARN_ON(sctx->nodesize != sctx->leafsize);
 		blocksize = sctx->nodesize;
 		spin_lock(&sctx->stat_lock);
 		sctx->stat.tree_extents_scrubbed++;
@@ -1807,7 +2165,7 @@ static int scrub_extent(struct scrub_ctx *sctx, u64
logical, u64 len,
 		spin_unlock(&sctx->stat_lock);
 	} else {
 		blocksize = sctx->sectorsize;
-		BUG_ON(1);
+		WARN_ON(1);
 	}
 
 	while (len) {
@@ -1819,14 +2177,23 @@ static int scrub_extent(struct scrub_ctx *sctx, u64
logical, u64 len,
 			have_csum = scrub_find_csum(sctx, logical, l, csum);
 			if (have_csum == 0)
 				++sctx->stat.no_csum;
+			if (sctx->is_dev_replace && !have_csum) {
+				ret = copy_nocow_pages(sctx, logical, l,
+						       mirror_num,
+						      physical_for_dev_replace);
+				goto behind_scrub_pages;
+			}
 		}
 		ret = scrub_pages(sctx, logical, l, physical, dev, flags, gen,
-				  mirror_num, have_csum ? csum : NULL, 0);
+				  mirror_num, have_csum ? csum : NULL, 0,
+				  physical_for_dev_replace);
+behind_scrub_pages:
 		if (ret)
 			return ret;
 		len -= l;
 		logical += l;
 		physical += l;
+		physical_for_dev_replace += l;
 	}
 	return 0;
 }
@@ -1834,7 +2201,8 @@ static int scrub_extent(struct scrub_ctx *sctx, u64
logical, u64 len,
 static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 					   struct map_lookup *map,
 					   struct btrfs_device *scrub_dev,
-					   int num, u64 base, u64 length)
+					   int num, u64 base, u64 length,
+					   int is_dev_replace)
 {
 	struct btrfs_path *path;
 	struct btrfs_fs_info *fs_info = sctx->dev_root->fs_info;
@@ -1859,6 +2227,11 @@ static noinline_for_stack int scrub_stripe(struct
scrub_ctx *sctx,
 	struct btrfs_key key_end;
 	u64 increment = map->stripe_len;
 	u64 offset;
+	u64 extent_logical;
+	u64 extent_physical;
+	u64 extent_len;
+	struct btrfs_device *extent_dev;
+	int extent_mirror_num;
 
 	nstripes = length;
 	offset = 0;
@@ -1966,9 +2339,14 @@ static noinline_for_stack int scrub_stripe(struct
scrub_ctx *sctx,
 		 */
 		if (atomic_read(&fs_info->scrub_pause_req)) {
 			/* push queued extents */
+			atomic_set(&sctx->wr_ctx.flush_all_writes, 1);
 			scrub_submit(sctx);
+			mutex_lock(&sctx->wr_ctx.wr_lock);
+			scrub_wr_submit(sctx);
+			mutex_unlock(&sctx->wr_ctx.wr_lock);
 			wait_event(sctx->list_wait,
 				   atomic_read(&sctx->bios_in_flight) == 0);
+			atomic_set(&sctx->wr_ctx.flush_all_writes, 0);
 			atomic_inc(&fs_info->scrubs_paused);
 			wake_up(&fs_info->scrub_pause_wait);
 			mutex_lock(&fs_info->scrub_lock);
@@ -2063,10 +2441,20 @@ static noinline_for_stack int scrub_stripe(struct
scrub_ctx *sctx,
 					     key.objectid;
 			}
 
-			ret = scrub_extent(sctx, key.objectid, key.offset,
-					   key.objectid - logical + physical,
-					   scrub_dev, flags, generation,
-					   mirror_num);
+			extent_logical = key.objectid;
+			extent_physical = key.objectid - logical + physical;
+			extent_len = key.offset;
+			extent_dev = scrub_dev;
+			extent_mirror_num = mirror_num;
+			if (is_dev_replace)
+				scrub_remap_extent(fs_info, extent_logical,
+						   extent_len, &extent_physical,
+						   &extent_dev,
+						   &extent_mirror_num);
+			ret = scrub_extent(sctx, extent_logical, extent_len,
+					   extent_physical, extent_dev, flags,
+					   generation, extent_mirror_num,
+					   key.objectid - logical + physical);
 			if (ret)
 				goto out;
 
@@ -2080,10 +2468,13 @@ next:
 		sctx->stat.last_physical = physical;
 		spin_unlock(&sctx->stat_lock);
 	}
+out:
 	/* push queued extents */
 	scrub_submit(sctx);
+	mutex_lock(&sctx->wr_ctx.wr_lock);
+	scrub_wr_submit(sctx);
+	mutex_unlock(&sctx->wr_ctx.wr_lock);
 
-out:
 	blk_finish_plug(&plug);
 	btrfs_free_path(path);
 	return ret < 0 ? ret : 0;
@@ -2093,14 +2484,14 @@ static noinline_for_stack int scrub_chunk(struct
scrub_ctx *sctx,
 					  struct btrfs_device *scrub_dev,
 					  u64 chunk_tree, u64 chunk_objectid,
 					  u64 chunk_offset, u64 length,
-					  u64 dev_offset)
+					  u64 dev_offset, int is_dev_replace)
 {
 	struct btrfs_mapping_tree *map_tree  	
&sctx->dev_root->fs_info->mapping_tree;
 	struct map_lookup *map;
 	struct extent_map *em;
 	int i;
-	int ret = -EINVAL;
+	int ret = 0;
 
 	read_lock(&map_tree->map_tree.lock);
 	em = lookup_extent_mapping(&map_tree->map_tree, chunk_offset, 1);
@@ -2120,7 +2511,8 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx
*sctx,
 		if (map->stripes[i].dev->bdev == scrub_dev->bdev &&
 		    map->stripes[i].physical == dev_offset) {
 			ret = scrub_stripe(sctx, map, scrub_dev, i,
-					   chunk_offset, length);
+					   chunk_offset, length,
+					   is_dev_replace);
 			if (ret)
 				goto out;
 		}
@@ -2133,7 +2525,8 @@ out:
 
 static noinline_for_stack
 int scrub_enumerate_chunks(struct scrub_ctx *sctx,
-			   struct btrfs_device *scrub_dev, u64 start, u64 end)
+			   struct btrfs_device *scrub_dev, u64 start, u64 end,
+			   int is_dev_replace)
 {
 	struct btrfs_dev_extent *dev_extent = NULL;
 	struct btrfs_path *path;
@@ -2149,6 +2542,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 	struct btrfs_key key;
 	struct btrfs_key found_key;
 	struct btrfs_block_group_cache *cache;
+	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -2214,11 +2608,61 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 			ret = -ENOENT;
 			break;
 		}
+		dev_replace->cursor_right = found_key.offset + length;
+		dev_replace->cursor_left = found_key.offset;
+		dev_replace->item_needs_writeback = 1;
 		ret = scrub_chunk(sctx, scrub_dev, chunk_tree, chunk_objectid,
-				  chunk_offset, length, found_key.offset);
+				  chunk_offset, length, found_key.offset,
+				  is_dev_replace);
+
+		/*
+		 * flush, submit all pending read and write bios, afterwards
+		 * wait for them.
+		 * Note that in the dev replace case, a read request causes
+		 * write requests that are submitted in the read completion
+		 * worker. Therefore in the current situation, it is required
+		 * that all write requests are flushed, so that all read and
+		 * write requests are really completed when bios_in_flight
+		 * changes to 0.
+		 */
+		atomic_set(&sctx->wr_ctx.flush_all_writes, 1);
+		scrub_submit(sctx);
+		mutex_lock(&sctx->wr_ctx.wr_lock);
+		scrub_wr_submit(sctx);
+		mutex_unlock(&sctx->wr_ctx.wr_lock);
+
+		wait_event(sctx->list_wait,
+			   atomic_read(&sctx->bios_in_flight) == 0);
+		atomic_set(&sctx->wr_ctx.flush_all_writes, 0);
+		atomic_inc(&fs_info->scrubs_paused);
+		wake_up(&fs_info->scrub_pause_wait);
+		wait_event(sctx->list_wait,
+			   atomic_read(&sctx->workers_pending) == 0);
+
+		mutex_lock(&fs_info->scrub_lock);
+		while (atomic_read(&fs_info->scrub_pause_req)) {
+			mutex_unlock(&fs_info->scrub_lock);
+			wait_event(fs_info->scrub_pause_wait,
+			   atomic_read(&fs_info->scrub_pause_req) == 0);
+			mutex_lock(&fs_info->scrub_lock);
+		}
+		atomic_dec(&fs_info->scrubs_paused);
+		mutex_unlock(&fs_info->scrub_lock);
+		wake_up(&fs_info->scrub_pause_wait);
+
+		dev_replace->cursor_left = dev_replace->cursor_right;
+		dev_replace->item_needs_writeback = 1;
 		btrfs_put_block_group(cache);
 		if (ret)
 			break;
+		if (atomic64_read(&dev_replace->num_write_errors) > 0) {
+			ret = -EIO;
+			break;
+		}
+		if (sctx->stat.malloc_errors > 0) {
+			ret = -ENOMEM;
+			break;
+		}
 
 		key.offset = found_key.offset + length;
 		btrfs_release_path(path);
@@ -2254,7 +2698,7 @@ static noinline_for_stack int scrub_supers(struct
scrub_ctx *sctx,
 
 		ret = scrub_pages(sctx, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr,
 				  scrub_dev, BTRFS_EXTENT_FLAG_SUPER, gen, i,
-				  NULL, 1);
+				  NULL, 1, bytenr);
 		if (ret)
 			return ret;
 	}
@@ -2266,18 +2710,38 @@ static noinline_for_stack int scrub_supers(struct
scrub_ctx *sctx,
 /*
  * get a reference count on fs_info->scrub_workers. start worker if
necessary
  */
-static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info)
+static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info,
+						int is_dev_replace)
 {
 	int ret = 0;
 
 	mutex_lock(&fs_info->scrub_lock);
 	if (fs_info->scrub_workers_refcnt == 0) {
-		btrfs_init_workers(&fs_info->scrub_workers, "scrub",
-			   fs_info->thread_pool_size, &fs_info->generic_worker);
+		if (is_dev_replace)
+			btrfs_init_workers(&fs_info->scrub_workers, "scrub", 1,
+					&fs_info->generic_worker);
+		else
+			btrfs_init_workers(&fs_info->scrub_workers, "scrub",
+					fs_info->thread_pool_size,
+					&fs_info->generic_worker);
 		fs_info->scrub_workers.idle_thresh = 4;
 		ret = btrfs_start_workers(&fs_info->scrub_workers);
 		if (ret)
 			goto out;
+		btrfs_init_workers(&fs_info->scrub_wr_completion_workers,
+				   "scrubwrc",
+				   fs_info->thread_pool_size,
+				   &fs_info->generic_worker);
+		fs_info->scrub_wr_completion_workers.idle_thresh = 2;
+		ret = btrfs_start_workers(
+				&fs_info->scrub_wr_completion_workers);
+		if (ret)
+			goto out;
+		btrfs_init_workers(&fs_info->scrub_nocow_workers, "scrubnc",
1,
+				   &fs_info->generic_worker);
+		ret = btrfs_start_workers(&fs_info->scrub_nocow_workers);
+		if (ret)
+			goto out;
 	}
 	++fs_info->scrub_workers_refcnt;
 out:
@@ -2289,8 +2753,11 @@ out:
 static noinline_for_stack void scrub_workers_put(struct btrfs_fs_info *fs_info)
 {
 	mutex_lock(&fs_info->scrub_lock);
-	if (--fs_info->scrub_workers_refcnt == 0)
+	if (--fs_info->scrub_workers_refcnt == 0) {
 		btrfs_stop_workers(&fs_info->scrub_workers);
+		btrfs_stop_workers(&fs_info->scrub_wr_completion_workers);
+		btrfs_stop_workers(&fs_info->scrub_nocow_workers);
+	}
 	WARN_ON(fs_info->scrub_workers_refcnt < 0);
 	mutex_unlock(&fs_info->scrub_lock);
 }
@@ -2354,7 +2821,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64
devid, u64 start,
 		return -EINVAL;
 	}
 
-	ret = scrub_workers_get(fs_info);
+	ret = scrub_workers_get(fs_info, is_dev_replace);
 	if (ret)
 		return ret;
 
@@ -2394,12 +2861,15 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64
devid, u64 start,
 	mutex_unlock(&fs_info->scrub_lock);
 	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
 
-	down_read(&fs_info->scrub_super_lock);
-	ret = scrub_supers(sctx, dev);
-	up_read(&fs_info->scrub_super_lock);
+	if (!is_dev_replace) {
+		down_read(&fs_info->scrub_super_lock);
+		ret = scrub_supers(sctx, dev);
+		up_read(&fs_info->scrub_super_lock);
+	}
 
 	if (!ret)
-		ret = scrub_enumerate_chunks(sctx, dev, start, end);
+		ret = scrub_enumerate_chunks(sctx, dev, start, end,
+					     is_dev_replace);
 
 	wait_event(sctx->list_wait, atomic_read(&sctx->bios_in_flight) ==
0);
 	atomic_dec(&fs_info->scrubs_running);
@@ -2537,3 +3007,272 @@ int btrfs_scrub_progress(struct btrfs_root *root, u64
devid,
 
 	return dev ? (sctx ? 0 : -ENOTCONN) : -ENODEV;
 }
+
+static void scrub_remap_extent(struct btrfs_fs_info *fs_info,
+			       u64 extent_logical, u64 extent_len,
+			       u64 *extent_physical,
+			       struct btrfs_device **extent_dev,
+			       int *extent_mirror_num)
+{
+	u64 mapped_length;
+	struct btrfs_bio *bbio = NULL;
+	int ret;
+
+	mapped_length = extent_len;
+	ret = btrfs_map_block(fs_info, READ, extent_logical,
+			      &mapped_length, &bbio, 0);
+	if (ret || !bbio || mapped_length < extent_len ||
+	    !bbio->stripes[0].dev->bdev) {
+		kfree(bbio);
+		return;
+	}
+
+	*extent_physical = bbio->stripes[0].physical;
+	*extent_mirror_num = bbio->mirror_num;
+	*extent_dev = bbio->stripes[0].dev;
+	kfree(bbio);
+}
+
+static int scrub_setup_wr_ctx(struct scrub_ctx *sctx,
+			      struct scrub_wr_ctx *wr_ctx,
+			      struct btrfs_fs_info *fs_info,
+			      struct btrfs_device *dev,
+			      int is_dev_replace)
+{
+	WARN_ON(wr_ctx->wr_curr_bio != NULL);
+
+	mutex_init(&wr_ctx->wr_lock);
+	wr_ctx->wr_curr_bio = NULL;
+	if (!is_dev_replace)
+		return 0;
+
+	WARN_ON(!dev->bdev);
+	wr_ctx->pages_per_wr_bio = min_t(int, SCRUB_PAGES_PER_WR_BIO,
+					 bio_get_nr_vecs(dev->bdev));
+	wr_ctx->tgtdev = dev;
+	atomic_set(&wr_ctx->flush_all_writes, 0);
+	return 0;
+}
+
+static void scrub_free_wr_ctx(struct scrub_wr_ctx *wr_ctx)
+{
+	mutex_lock(&wr_ctx->wr_lock);
+	kfree(wr_ctx->wr_curr_bio);
+	wr_ctx->wr_curr_bio = NULL;
+	mutex_unlock(&wr_ctx->wr_lock);
+}
+
+static int copy_nocow_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
+			    int mirror_num, u64 physical_for_dev_replace)
+{
+	struct scrub_copy_nocow_ctx *nocow_ctx;
+	struct btrfs_fs_info *fs_info = sctx->dev_root->fs_info;
+
+	nocow_ctx = kzalloc(sizeof(*nocow_ctx), GFP_NOFS);
+	if (!nocow_ctx) {
+		spin_lock(&sctx->stat_lock);
+		sctx->stat.malloc_errors++;
+		spin_unlock(&sctx->stat_lock);
+		return -ENOMEM;
+	}
+
+	scrub_pending_trans_workers_inc(sctx);
+
+	nocow_ctx->sctx = sctx;
+	nocow_ctx->logical = logical;
+	nocow_ctx->len = len;
+	nocow_ctx->mirror_num = mirror_num;
+	nocow_ctx->physical_for_dev_replace = physical_for_dev_replace;
+	nocow_ctx->work.func = copy_nocow_pages_worker;
+	btrfs_queue_worker(&fs_info->scrub_nocow_workers,
+			   &nocow_ctx->work);
+
+	return 0;
+}
+
+static void copy_nocow_pages_worker(struct btrfs_work *work)
+{
+	struct scrub_copy_nocow_ctx *nocow_ctx +		container_of(work, struct
scrub_copy_nocow_ctx, work);
+	struct scrub_ctx *sctx = nocow_ctx->sctx;
+	u64 logical = nocow_ctx->logical;
+	u64 len = nocow_ctx->len;
+	int mirror_num = nocow_ctx->mirror_num;
+	u64 physical_for_dev_replace = nocow_ctx->physical_for_dev_replace;
+	int ret;
+	struct btrfs_trans_handle *trans = NULL;
+	struct btrfs_fs_info *fs_info;
+	struct btrfs_path *path;
+	struct btrfs_root *root;
+	int not_written = 0;
+
+	fs_info = sctx->dev_root->fs_info;
+	root = fs_info->extent_root;
+
+	path = btrfs_alloc_path();
+	if (!path) {
+		spin_lock(&sctx->stat_lock);
+		sctx->stat.malloc_errors++;
+		spin_unlock(&sctx->stat_lock);
+		not_written = 1;
+		goto out;
+	}
+
+	trans = btrfs_join_transaction(root);
+	if (IS_ERR(trans)) {
+		not_written = 1;
+		goto out;
+	}
+
+	ret = iterate_inodes_from_logical(logical, fs_info, path,
+					  copy_nocow_pages_for_inode,
+					  nocow_ctx);
+	if (ret != 0 && ret != -ENOENT) {
+		pr_warn("iterate_inodes_from_logical() failed: log %llu, phys %llu, len
%llu, mir %llu, ret %d\n",
+			(unsigned long long)logical,
+			(unsigned long long)physical_for_dev_replace,
+			(unsigned long long)len,
+			(unsigned long long)mirror_num, ret);
+		not_written = 1;
+		goto out;
+	}
+
+out:
+	if (trans && !IS_ERR(trans))
+		btrfs_end_transaction(trans, root);
+	if (not_written)
+		btrfs_dev_replace_stats_inc(&fs_info->dev_replace.
+					    num_uncorrectable_read_errors);
+
+	btrfs_free_path(path);
+	kfree(nocow_ctx);
+
+	scrub_pending_trans_workers_dec(sctx);
+}
+
+static int copy_nocow_pages_for_inode(u64 inum, u64 offset, u64 root, void
*ctx)
+{
+	unsigned long index;
+	struct scrub_copy_nocow_ctx *nocow_ctx = ctx;
+	int ret = 0;
+	struct btrfs_key key;
+	struct inode *inode = NULL;
+	struct btrfs_root *local_root;
+	u64 physical_for_dev_replace;
+	u64 len;
+	struct btrfs_fs_info *fs_info = nocow_ctx->sctx->dev_root->fs_info;
+
+	key.objectid = root;
+	key.type = BTRFS_ROOT_ITEM_KEY;
+	key.offset = (u64)-1;
+	local_root = btrfs_read_fs_root_no_name(fs_info, &key);
+	if (IS_ERR(local_root))
+		return PTR_ERR(local_root);
+
+	key.type = BTRFS_INODE_ITEM_KEY;
+	key.objectid = inum;
+	key.offset = 0;
+	inode = btrfs_iget(fs_info->sb, &key, local_root, NULL);
+	if (IS_ERR(inode))
+		return PTR_ERR(inode);
+
+	physical_for_dev_replace = nocow_ctx->physical_for_dev_replace;
+	len = nocow_ctx->len;
+	while (len >= PAGE_CACHE_SIZE) {
+		struct page *page = NULL;
+		int ret_sub;
+
+		index = offset >> PAGE_CACHE_SHIFT;
+
+		page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
+		if (!page) {
+			pr_err("find_or_create_page() failed\n");
+			ret = -ENOMEM;
+			goto next_page;
+		}
+
+		if (PageUptodate(page)) {
+			if (PageDirty(page))
+				goto next_page;
+		} else {
+			ClearPageError(page);
+			ret_sub = extent_read_full_page(&BTRFS_I(inode)->
+							 io_tree,
+							page, btrfs_get_extent,
+							nocow_ctx->mirror_num);
+			if (ret_sub) {
+				ret = ret_sub;
+				goto next_page;
+			}
+			wait_on_page_locked(page);
+			if (!PageUptodate(page)) {
+				ret = -EIO;
+				goto next_page;
+			}
+		}
+		ret_sub = write_page_nocow(nocow_ctx->sctx,
+					   physical_for_dev_replace, page);
+		if (ret_sub) {
+			ret = ret_sub;
+			goto next_page;
+		}
+
+next_page:
+		if (page) {
+			unlock_page(page);
+			put_page(page);
+		}
+		offset += PAGE_CACHE_SIZE;
+		physical_for_dev_replace += PAGE_CACHE_SIZE;
+		len -= PAGE_CACHE_SIZE;
+	}
+
+	if (inode)
+		iput(inode);
+	return ret;
+}
+
+static int write_page_nocow(struct scrub_ctx *sctx,
+			    u64 physical_for_dev_replace, struct page *page)
+{
+	struct bio *bio;
+	struct btrfs_device *dev;
+	int ret;
+	DECLARE_COMPLETION_ONSTACK(compl);
+
+	dev = sctx->wr_ctx.tgtdev;
+	if (!dev)
+		return -EIO;
+	if (!dev->bdev) {
+		printk_ratelimited(KERN_WARNING
+			"btrfs: scrub write_page_nocow(bdev == NULL) is unexpected!\n");
+		return -EIO;
+	}
+	bio = bio_alloc(GFP_NOFS, 1);
+	if (!bio) {
+		spin_lock(&sctx->stat_lock);
+		sctx->stat.malloc_errors++;
+		spin_unlock(&sctx->stat_lock);
+		return -ENOMEM;
+	}
+	bio->bi_private = &compl;
+	bio->bi_end_io = scrub_complete_bio_end_io;
+	bio->bi_size = 0;
+	bio->bi_sector = physical_for_dev_replace >> 9;
+	bio->bi_bdev = dev->bdev;
+	ret = bio_add_page(bio, page, PAGE_CACHE_SIZE, 0);
+	if (ret != PAGE_CACHE_SIZE) {
+leave_with_eio:
+		bio_put(bio);
+		btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_WRITE_ERRS);
+		return -EIO;
+	}
+	btrfsic_submit_bio(WRITE_SYNC, bio);
+	wait_for_completion(&compl);
+
+	if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
+		goto leave_with_eio;
+
+	bio_put(bio);
+	return 0;
+}
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 789c9b2..8ad5f15 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1203,7 +1203,8 @@ static void btrfs_resize_thread_pool(struct btrfs_fs_info
*fs_info,
 	btrfs_set_max_workers(&fs_info->endio_freespace_worker, new_pool_size);
 	btrfs_set_max_workers(&fs_info->delayed_workers, new_pool_size);
 	btrfs_set_max_workers(&fs_info->readahead_workers, new_pool_size);
-	btrfs_set_max_workers(&fs_info->scrub_workers, new_pool_size);
+	btrfs_set_max_workers(&fs_info->scrub_wr_completion_workers,
+			      new_pool_size);
 }
 
 static int btrfs_remount(struct super_block *sb, int *flags, char *data)
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 20/26] Btrfs: change core code of btrfs to support the device replace operations

This commit contains all the essential changes to the core code
of Btrfs for support of the device replace procedure.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/Makefile      |   2 +-
 fs/btrfs/ctree.h       |   2 +
 fs/btrfs/disk-io.c     |  24 +++++-
 fs/btrfs/reada.c       |  17 +++++
 fs/btrfs/scrub.c       |   7 +-
 fs/btrfs/super.c       |  13 ++++
 fs/btrfs/transaction.c |   7 +-
 fs/btrfs/volumes.c     | 193 ++++++++++++++++++++++++++++++++++++++++++++++---
 fs/btrfs/volumes.h     |  11 ++-
 9 files changed, 261 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index d7fcdba..7df3e0f 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -8,7 +8,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o
root-tree.o dir-item.o \
 	   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
 	   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
 	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
-	   reada.o backref.o ulist.o qgroup.o send.o
+	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e17f211..7c9e4f7 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -142,6 +142,8 @@ struct btrfs_ordered_sum;
 
 #define BTRFS_EMPTY_SUBVOL_DIR_OBJECTID 2
 
+#define BTRFS_DEV_REPLACE_DEVID 0
+
 /*
  * the max metadata block size.  This limit is somewhat artificial,
  * but the memmove costs go through the roof for larger blocks.
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index dbe0434..509fce2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -45,6 +45,7 @@
 #include "inode-map.h"
 #include "check-integrity.h"
 #include "rcu-string.h"
+#include "dev-replace.h"
 
 #ifdef CONFIG_X86
 #include <asm/cpufeature.h>
@@ -2438,7 +2439,11 @@ int open_ctree(struct super_block *sb,
 		goto fail_tree_roots;
 	}
 
-	btrfs_close_extra_devices(fs_devices);
+	/*
+	 * keep the device that is marked to be the target device for the
+	 * dev_replace procedure
+	 */
+	btrfs_close_extra_devices(fs_info, fs_devices, 0);
 
 	if (!fs_devices->latest_bdev) {
 		printk(KERN_CRIT "btrfs: failed to read devices on %s\n",
@@ -2510,6 +2515,14 @@ retry_root_backup:
 		goto fail_block_groups;
 	}
 
+	ret = btrfs_init_dev_replace(fs_info);
+	if (ret) {
+		pr_err("btrfs: failed to init dev_replace: %d\n", ret);
+		goto fail_block_groups;
+	}
+
+	btrfs_close_extra_devices(fs_info, fs_devices, 1);
+
 	ret = btrfs_init_space_info(fs_info);
 	if (ret) {
 		printk(KERN_ERR "Failed to initial space info: %d\n", ret);
@@ -2658,6 +2671,13 @@ retry_root_backup:
 		return ret;
 	}
 
+	ret = btrfs_resume_dev_replace_async(fs_info);
+	if (ret) {
+		pr_warn("btrfs: failed to resume dev_replace\n");
+		close_ctree(tree_root);
+		return ret;
+	}
+
 	return 0;
 
 fail_qgroup:
@@ -3300,6 +3320,8 @@ int close_ctree(struct btrfs_root *root)
 	/* pause restriper - we want to resume on mount */
 	btrfs_pause_balance(fs_info);
 
+	btrfs_dev_replace_suspend_for_unmount(fs_info);
+
 	btrfs_scrub_cancel(fs_info);
 
 	/* wait for any defraggers to finish */
diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index 9f363e1..c705a48 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -27,6 +27,7 @@
 #include "volumes.h"
 #include "disk-io.h"
 #include "transaction.h"
+#include "dev-replace.h"
 
 #undef DEBUG
 
@@ -331,6 +332,7 @@ static struct reada_extent *reada_find_extent(struct
btrfs_root *root,
 	int nzones = 0;
 	int i;
 	unsigned long index = logical >> PAGE_CACHE_SHIFT;
+	int dev_replace_is_ongoing;
 
 	spin_lock(&fs_info->reada_lock);
 	re = radix_tree_lookup(&fs_info->reada_tree, index);
@@ -392,6 +394,7 @@ static struct reada_extent *reada_find_extent(struct
btrfs_root *root,
 	}
 
 	/* insert extent in reada_tree + all per-device trees, all or nothing */
+	btrfs_dev_replace_lock(&fs_info->dev_replace);
 	spin_lock(&fs_info->reada_lock);
 	ret = radix_tree_insert(&fs_info->reada_tree, index, re);
 	if (ret == -EEXIST) {
@@ -399,13 +402,17 @@ static struct reada_extent *reada_find_extent(struct
btrfs_root *root,
 		BUG_ON(!re_exist);
 		re_exist->refcnt++;
 		spin_unlock(&fs_info->reada_lock);
+		btrfs_dev_replace_unlock(&fs_info->dev_replace);
 		goto error;
 	}
 	if (ret) {
 		spin_unlock(&fs_info->reada_lock);
+		btrfs_dev_replace_unlock(&fs_info->dev_replace);
 		goto error;
 	}
 	prev_dev = NULL;
+	dev_replace_is_ongoing = btrfs_dev_replace_is_ongoing(
+			&fs_info->dev_replace);
 	for (i = 0; i < nzones; ++i) {
 		dev = bbio->stripes[i].dev;
 		if (dev == prev_dev) {
@@ -422,6 +429,14 @@ static struct reada_extent *reada_find_extent(struct
btrfs_root *root,
 			/* cannot read ahead on missing device */
 			continue;
 		}
+		if (dev_replace_is_ongoing &&
+		    dev == fs_info->dev_replace.tgtdev) {
+			/*
+			 * as this device is selected for reading only as
+			 * a last resort, skip it for read ahead.
+			 */
+			continue;
+		}
 		prev_dev = dev;
 		ret = radix_tree_insert(&dev->reada_extents, index, re);
 		if (ret) {
@@ -434,10 +449,12 @@ static struct reada_extent *reada_find_extent(struct
btrfs_root *root,
 			BUG_ON(fs_info == NULL);
 			radix_tree_delete(&fs_info->reada_tree, index);
 			spin_unlock(&fs_info->reada_lock);
+			btrfs_dev_replace_unlock(&fs_info->dev_replace);
 			goto error;
 		}
 	}
 	spin_unlock(&fs_info->reada_lock);
+	btrfs_dev_replace_unlock(&fs_info->dev_replace);
 
 	kfree(bbio);
 	return re;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 59c69e0..8bfe782 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2841,12 +2841,17 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64
devid, u64 start,
 		return -EIO;
 	}
 
-	if (dev->scrub_device) {
+	btrfs_dev_replace_lock(&fs_info->dev_replace);
+	if (dev->scrub_device ||
+	    (!is_dev_replace &&
+	     btrfs_dev_replace_is_ongoing(&fs_info->dev_replace))) {
+		btrfs_dev_replace_unlock(&fs_info->dev_replace);
 		mutex_unlock(&fs_info->scrub_lock);
 		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
 		scrub_workers_put(fs_info);
 		return -EINPROGRESS;
 	}
+	btrfs_dev_replace_unlock(&fs_info->dev_replace);
 	sctx = scrub_setup_ctx(dev, is_dev_replace);
 	if (IS_ERR(sctx)) {
 		mutex_unlock(&fs_info->scrub_lock);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 8ad5f15..9a89423 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -55,6 +55,7 @@
 #include "export.h"
 #include "compression.h"
 #include "rcu-string.h"
+#include "dev-replace.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/btrfs.h>
@@ -1233,8 +1234,15 @@ static int btrfs_remount(struct super_block *sb, int
*flags, char *data)
 		return 0;
 
 	if (*flags & MS_RDONLY) {
+		/*
+		 * this also happens on ''umount -rf'' or on shutdown, when
+		 * the filesystem is busy.
+		 */
 		sb->s_flags |= MS_RDONLY;
 
+		btrfs_dev_replace_suspend_for_unmount(fs_info);
+		btrfs_scrub_cancel(fs_info);
+
 		ret = btrfs_commit_super(root);
 		if (ret)
 			goto restore;
@@ -1271,6 +1279,11 @@ static int btrfs_remount(struct super_block *sb, int
*flags, char *data)
 		if (ret)
 			goto restore;
 
+		ret = btrfs_resume_dev_replace_async(fs_info);
+		if (ret) {
+			pr_warn("btrfs: failed to resume dev_replace\n");
+			goto restore;
+		}
 		sb->s_flags &= ~MS_RDONLY;
 	}
 
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 259f74e..063e49e 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -30,6 +30,7 @@
 #include "tree-log.h"
 #include "inode-map.h"
 #include "volumes.h"
+#include "dev-replace.h"
 
 #define BTRFS_ROOT_TRANS_TAG 0
 
@@ -848,7 +849,9 @@ static noinline int commit_cowonly_roots(struct
btrfs_trans_handle *trans,
 		return ret;
 
 	ret = btrfs_run_dev_stats(trans, root->fs_info);
-	BUG_ON(ret);
+	WARN_ON(ret);
+	ret = btrfs_run_dev_replace(trans, root->fs_info);
+	WARN_ON(ret);
 
 	ret = btrfs_run_qgroups(trans, root->fs_info);
 	BUG_ON(ret);
@@ -871,6 +874,8 @@ static noinline int commit_cowonly_roots(struct
btrfs_trans_handle *trans,
 	switch_commit_root(fs_info->extent_root);
 	up_write(&fs_info->extent_commit_sem);
 
+	btrfs_after_dev_replace_commit(fs_info);
+
 	return 0;
 }
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9233219..3ef0df2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -36,6 +36,7 @@
 #include "check-integrity.h"
 #include "rcu-string.h"
 #include "math.h"
+#include "dev-replace.h"
 
 static int init_first_rw_device(struct btrfs_trans_handle *trans,
 				struct btrfs_root *root,
@@ -467,7 +468,8 @@ error:
 	return ERR_PTR(-ENOMEM);
 }
 
-void btrfs_close_extra_devices(struct btrfs_fs_devices *fs_devices)
+void btrfs_close_extra_devices(struct btrfs_fs_info *fs_info,
+			       struct btrfs_fs_devices *fs_devices, int step)
 {
 	struct btrfs_device *device, *next;
 
@@ -490,6 +492,21 @@ again:
 			continue;
 		}
 
+		if (device->devid == BTRFS_DEV_REPLACE_DEVID) {
+			/*
+			 * In the first step, keep the device which has
+			 * the correct fsid and the devid that is used
+			 * for the dev_replace procedure.
+			 * In the second step, the dev_replace state is
+			 * read from the device tree and it is known
+			 * whether the procedure is really active or
+			 * not, which means whether this device is
+			 * used or whether it should be removed.
+			 */
+			if (step == 0 || device->is_tgtdev_for_dev_replace) {
+				continue;
+			}
+		}
 		if (device->bdev) {
 			blkdev_put(device->bdev, device->mode);
 			device->bdev = NULL;
@@ -498,7 +515,8 @@ again:
 		if (device->writeable) {
 			list_del_init(&device->dev_alloc_list);
 			device->writeable = 0;
-			fs_devices->rw_devices--;
+			if (!device->is_tgtdev_for_dev_replace)
+				fs_devices->rw_devices--;
 		}
 		list_del_init(&device->dev_list);
 		fs_devices->num_devices--;
@@ -556,7 +574,7 @@ static int __btrfs_close_devices(struct btrfs_fs_devices
*fs_devices)
 		if (device->bdev)
 			fs_devices->open_devices--;
 
-		if (device->writeable) {
+		if (device->writeable && !device->is_tgtdev_for_dev_replace) {
 			list_del_init(&device->dev_alloc_list);
 			fs_devices->rw_devices--;
 		}
@@ -688,7 +706,7 @@ static int __btrfs_open_devices(struct btrfs_fs_devices
*fs_devices,
 			fs_devices->rotating = 1;
 
 		fs_devices->open_devices++;
-		if (device->writeable) {
+		if (device->writeable && !device->is_tgtdev_for_dev_replace) {
 			fs_devices->rw_devices++;
 			list_add(&device->dev_alloc_list,
 				 &fs_devices->alloc_list);
@@ -1332,16 +1350,22 @@ int btrfs_rm_device(struct btrfs_root *root, char
*device_path)
 		root->fs_info->avail_system_alloc_bits |
 		root->fs_info->avail_metadata_alloc_bits;
 
-	if ((all_avail & BTRFS_BLOCK_GROUP_RAID10) &&
-	    root->fs_info->fs_devices->num_devices <= 4) {
+	num_devices = root->fs_info->fs_devices->num_devices;
+	btrfs_dev_replace_lock(&root->fs_info->dev_replace);
+	if (btrfs_dev_replace_is_ongoing(&root->fs_info->dev_replace)) {
+		WARN_ON(num_devices < 1);
+		num_devices--;
+	}
+	btrfs_dev_replace_unlock(&root->fs_info->dev_replace);
+
+	if ((all_avail & BTRFS_BLOCK_GROUP_RAID10) && num_devices <= 4)
{
 		printk(KERN_ERR "btrfs: unable to go below four devices "
 		       "on raid10\n");
 		ret = -EINVAL;
 		goto out;
 	}
 
-	if ((all_avail & BTRFS_BLOCK_GROUP_RAID1) &&
-	    root->fs_info->fs_devices->num_devices <= 2) {
+	if ((all_avail & BTRFS_BLOCK_GROUP_RAID1) && num_devices <= 2)
{
 		printk(KERN_ERR "btrfs: unable to go below two "
 		       "devices on raid1\n");
 		ret = -EINVAL;
@@ -1527,6 +1551,53 @@ error_undo:
 	goto error_brelse;
 }
 
+void btrfs_rm_dev_replace_srcdev(struct btrfs_fs_info *fs_info,
+				 struct btrfs_device *srcdev)
+{
+	WARN_ON(!mutex_is_locked(&fs_info->fs_devices->device_list_mutex));
+	list_del_rcu(&srcdev->dev_list);
+	list_del_rcu(&srcdev->dev_alloc_list);
+	fs_info->fs_devices->num_devices--;
+	if (srcdev->missing) {
+		fs_info->fs_devices->missing_devices--;
+		fs_info->fs_devices->rw_devices++;
+	}
+	if (srcdev->can_discard)
+		fs_info->fs_devices->num_can_discard--;
+	if (srcdev->bdev)
+		fs_info->fs_devices->open_devices--;
+
+	call_rcu(&srcdev->rcu, free_device);
+}
+
+void btrfs_destroy_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
+				      struct btrfs_device *tgtdev)
+{
+	struct btrfs_device *next_device;
+
+	WARN_ON(!tgtdev);
+	mutex_lock(&fs_info->fs_devices->device_list_mutex);
+	if (tgtdev->bdev) {
+		btrfs_scratch_superblock(tgtdev);
+		fs_info->fs_devices->open_devices--;
+	}
+	fs_info->fs_devices->num_devices--;
+	if (tgtdev->can_discard)
+		fs_info->fs_devices->num_can_discard++;
+
+	next_device = list_entry(fs_info->fs_devices->devices.next,
+				 struct btrfs_device, dev_list);
+	if (tgtdev->bdev == fs_info->sb->s_bdev)
+		fs_info->sb->s_bdev = next_device->bdev;
+	if (tgtdev->bdev == fs_info->fs_devices->latest_bdev)
+		fs_info->fs_devices->latest_bdev = next_device->bdev;
+	list_del_rcu(&tgtdev->dev_list);
+
+	call_rcu(&tgtdev->rcu, free_device);
+
+	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
+}
+
 int btrfs_find_device_by_path(struct btrfs_root *root, char *device_path,
 			      struct btrfs_device **device)
 {
@@ -1935,6 +2006,98 @@ error:
 	return ret;
 }
 
+int btrfs_init_dev_replace_tgtdev(struct btrfs_root *root, char *device_path,
+				  struct btrfs_device **device_out)
+{
+	struct request_queue *q;
+	struct btrfs_device *device;
+	struct block_device *bdev;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct list_head *devices;
+	struct rcu_string *name;
+	int ret = 0;
+
+	*device_out = NULL;
+	if (fs_info->fs_devices->seeding)
+		return -EINVAL;
+
+	bdev = blkdev_get_by_path(device_path, FMODE_WRITE | FMODE_EXCL,
+				  fs_info->bdev_holder);
+	if (IS_ERR(bdev))
+		return PTR_ERR(bdev);
+
+	filemap_write_and_wait(bdev->bd_inode->i_mapping);
+
+	devices = &fs_info->fs_devices->devices;
+	list_for_each_entry(device, devices, dev_list) {
+		if (device->bdev == bdev) {
+			ret = -EEXIST;
+			goto error;
+		}
+	}
+
+	device = kzalloc(sizeof(*device), GFP_NOFS);
+	if (!device) {
+		ret = -ENOMEM;
+		goto error;
+	}
+
+	name = rcu_string_strdup(device_path, GFP_NOFS);
+	if (!name) {
+		kfree(device);
+		ret = -ENOMEM;
+		goto error;
+	}
+	rcu_assign_pointer(device->name, name);
+
+	q = bdev_get_queue(bdev);
+	if (blk_queue_discard(q))
+		device->can_discard = 1;
+	mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
+	device->writeable = 1;
+	device->work.func = pending_bios_fn;
+	generate_random_uuid(device->uuid);
+	device->devid = BTRFS_DEV_REPLACE_DEVID;
+	spin_lock_init(&device->io_lock);
+	device->generation = 0;
+	device->io_width = root->sectorsize;
+	device->io_align = root->sectorsize;
+	device->sector_size = root->sectorsize;
+	device->total_bytes = i_size_read(bdev->bd_inode);
+	device->disk_total_bytes = device->total_bytes;
+	device->dev_root = fs_info->dev_root;
+	device->bdev = bdev;
+	device->in_fs_metadata = 1;
+	device->is_tgtdev_for_dev_replace = 1;
+	device->mode = FMODE_EXCL;
+	set_blocksize(device->bdev, 4096);
+	device->fs_devices = fs_info->fs_devices;
+	list_add(&device->dev_list, &fs_info->fs_devices->devices);
+	fs_info->fs_devices->num_devices++;
+	fs_info->fs_devices->open_devices++;
+	if (device->can_discard)
+		fs_info->fs_devices->num_can_discard++;
+	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
+
+	*device_out = device;
+	return ret;
+
+error:
+	blkdev_put(bdev, FMODE_EXCL);
+	return ret;
+}
+
+void btrfs_init_dev_replace_tgtdev_for_resume(struct btrfs_fs_info *fs_info,
+					      struct btrfs_device *tgtdev)
+{
+	WARN_ON(fs_info->fs_devices->rw_devices == 0);
+	tgtdev->io_width = fs_info->dev_root->sectorsize;
+	tgtdev->io_align = fs_info->dev_root->sectorsize;
+	tgtdev->sector_size = fs_info->dev_root->sectorsize;
+	tgtdev->dev_root = fs_info->dev_root;
+	tgtdev->in_fs_metadata = 1;
+}
+
 static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
 					struct btrfs_device *device)
 {
@@ -2800,6 +2963,7 @@ int btrfs_balance(struct btrfs_balance_control *bctl,
 	u64 allowed;
 	int mixed = 0;
 	int ret;
+	u64 num_devices;
 
 	if (btrfs_fs_closing(fs_info) ||
 	    atomic_read(&fs_info->balance_pause_req) ||
@@ -2828,10 +2992,17 @@ int btrfs_balance(struct btrfs_balance_control *bctl,
 		}
 	}
 
+	num_devices = fs_info->fs_devices->num_devices;
+	btrfs_dev_replace_lock(&fs_info->dev_replace);
+	if (btrfs_dev_replace_is_ongoing(&fs_info->dev_replace)) {
+		BUG_ON(num_devices < 1);
+		num_devices--;
+	}
+	btrfs_dev_replace_unlock(&fs_info->dev_replace);
 	allowed = BTRFS_AVAIL_ALLOC_BIT_SINGLE;
-	if (fs_info->fs_devices->num_devices == 1)
+	if (num_devices == 1)
 		allowed |= BTRFS_BLOCK_GROUP_DUP;
-	else if (fs_info->fs_devices->num_devices < 4)
+	else if (num_devices < 4)
 		allowed |= (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1);
 	else
 		allowed |= (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1 |
@@ -3457,6 +3628,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle
*trans,
 		devices_info[ndevs].total_avail = total_avail;
 		devices_info[ndevs].dev = device;
 		++ndevs;
+		WARN_ON(ndevs > fs_devices->rw_devices);
 	}
 
 	/*
@@ -4639,6 +4811,7 @@ static void fill_device_from_item(struct extent_buffer
*leaf,
 	device->io_align = btrfs_device_io_align(leaf, dev_item);
 	device->io_width = btrfs_device_io_width(leaf, dev_item);
 	device->sector_size = btrfs_device_sector_size(leaf, dev_item);
+	WARN_ON(device->devid == BTRFS_DEV_REPLACE_DEVID);
 	device->is_tgtdev_for_dev_replace = 0;
 
 	ptr = (unsigned long)btrfs_device_uuid(dev_item);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 8fd5a4d..37d0157 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -268,7 +268,8 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
 int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder,
 			  struct btrfs_fs_devices **fs_devices_ret);
 int btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
-void btrfs_close_extra_devices(struct btrfs_fs_devices *fs_devices);
+void btrfs_close_extra_devices(struct btrfs_fs_info *fs_info,
+			       struct btrfs_fs_devices *fs_devices, int step);
 int btrfs_find_device_missing_or_by_path(struct btrfs_root *root,
 					 char *device_path,
 					 struct btrfs_device **device);
@@ -286,6 +287,8 @@ struct btrfs_device *btrfs_find_device(struct btrfs_fs_info
*fs_info, u64 devid,
 				       u8 *uuid, u8 *fsid);
 int btrfs_shrink_device(struct btrfs_device *device, u64 new_size);
 int btrfs_init_new_device(struct btrfs_root *root, char *path);
+int btrfs_init_dev_replace_tgtdev(struct btrfs_root *root, char *device_path,
+				  struct btrfs_device **device_out);
 int btrfs_balance(struct btrfs_balance_control *bctl,
 		  struct btrfs_ioctl_balance_args *bargs);
 int btrfs_resume_balance_async(struct btrfs_fs_info *fs_info);
@@ -302,6 +305,12 @@ int btrfs_get_dev_stats(struct btrfs_root *root,
 int btrfs_init_dev_stats(struct btrfs_fs_info *fs_info);
 int btrfs_run_dev_stats(struct btrfs_trans_handle *trans,
 			struct btrfs_fs_info *fs_info);
+void btrfs_rm_dev_replace_srcdev(struct btrfs_fs_info *fs_info,
+				 struct btrfs_device *srcdev);
+void btrfs_destroy_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
+				      struct btrfs_device *tgtdev);
+void btrfs_init_dev_replace_tgtdev_for_resume(struct btrfs_fs_info *fs_info,
+					      struct btrfs_device *tgtdev);
 int btrfs_scratch_superblock(struct btrfs_device *device);
 
 static inline void btrfs_dev_stat_inc(struct btrfs_device *dev,
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 21/26] Btrfs: introduce GET_READ_MIRRORS functionality for btrfs_map_block()

Before this commit, btrfs_map_block() was called with REQ_WRITE
in order to retrieve the list of mirrors for a disk block.
This needs to be changed for the device replace procedure since
it makes a difference whether you are asking for read mirrors
or for locations to write to.
GET_READ_MIRRORS is introduced as a new interface to call
btrfs_map_block().
In the current commit, the functionality is not yet changed,
only the interface for GET_READ_MIRRORS is introduced and all
the places that should use this new interface are adapted.

The reason that REQ_WRITE cannot be abused anymore to retrieve
a list of read mirrors is that during a running dev replace
operation all write requests to the live filesystem are
duplicated to also write to the target drive.
Keep in mind that the target disk is only partially a valid
copy of the source disk while the operation is ongoing. All
writes go to the target disk, but not all reads would return
valid data on the target disk. Therefore it is not possible
anymore to abuse a REQ_WRITE interface to find valid mirrors
for a REQ_READ.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/ctree.h   | 3 +++
 fs/btrfs/reada.c   | 3 ++-
 fs/btrfs/scrub.c   | 4 ++--
 fs/btrfs/volumes.c | 8 ++++----
 4 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7c9e4f7..01fcfcb 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -174,6 +174,9 @@ static int btrfs_csum_sizes[] = { 4, 0 };
 /* four bytes for CRC32 */
 #define BTRFS_EMPTY_DIR_SIZE 0
 
+/* spefic to btrfs_map_block(), therefore not in include/linux/blk_types.h */
+#define REQ_GET_READ_MIRRORS	(1 << 30)
+
 #define BTRFS_FT_UNKNOWN	0
 #define BTRFS_FT_REG_FILE	1
 #define BTRFS_FT_DIR		2
diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index c705a48..96b93da 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -359,7 +359,8 @@ static struct reada_extent *reada_find_extent(struct
btrfs_root *root,
 	 * map block
 	 */
 	length = blocksize;
-	ret = btrfs_map_block(fs_info, REQ_WRITE, logical, &length, &bbio, 0);
+	ret = btrfs_map_block(fs_info, REQ_GET_READ_MIRRORS, logical, &length,
+			      &bbio, 0);
 	if (ret || !bbio || length < blocksize)
 		goto error;
 
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 8bfe782..ba46eb0 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1193,8 +1193,8 @@ static int scrub_setup_recheck_block(struct scrub_ctx
*sctx,
 		 * with a length of PAGE_SIZE, each returned stripe
 		 * represents one mirror
 		 */
-		ret = btrfs_map_block(fs_info, WRITE, logical, &mapped_length,
-				      &bbio, 0);
+		ret = btrfs_map_block(fs_info, REQ_GET_READ_MIRRORS, logical,
+				      &mapped_length, &bbio, 0);
 		if (ret || !bbio || mapped_length < sublen) {
 			kfree(bbio);
 			return -EIO;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 3ef0df2..208cc13 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4108,7 +4108,7 @@ static int __btrfs_map_block(struct btrfs_fs_info
*fs_info, int rw,
 					    stripe_nr_end - stripe_nr_orig);
 		stripe_index = do_div(stripe_nr, map->num_stripes);
 	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
-		if (rw & (REQ_WRITE | REQ_DISCARD))
+		if (rw & (REQ_WRITE | REQ_DISCARD | REQ_GET_READ_MIRRORS))
 			num_stripes = map->num_stripes;
 		else if (mirror_num)
 			stripe_index = mirror_num - 1;
@@ -4120,7 +4120,7 @@ static int __btrfs_map_block(struct btrfs_fs_info
*fs_info, int rw,
 		}
 
 	} else if (map->type & BTRFS_BLOCK_GROUP_DUP) {
-		if (rw & (REQ_WRITE | REQ_DISCARD)) {
+		if (rw & (REQ_WRITE | REQ_DISCARD | REQ_GET_READ_MIRRORS)) {
 			num_stripes = map->num_stripes;
 		} else if (mirror_num) {
 			stripe_index = mirror_num - 1;
@@ -4134,7 +4134,7 @@ static int __btrfs_map_block(struct btrfs_fs_info
*fs_info, int rw,
 		stripe_index = do_div(stripe_nr, factor);
 		stripe_index *= map->sub_stripes;
 
-		if (rw & REQ_WRITE)
+		if (rw & (REQ_WRITE | REQ_GET_READ_MIRRORS))
 			num_stripes = map->sub_stripes;
 		else if (rw & REQ_DISCARD)
 			num_stripes = min_t(u64, map->sub_stripes *
@@ -4247,7 +4247,7 @@ static int __btrfs_map_block(struct btrfs_fs_info
*fs_info, int rw,
 		}
 	}
 
-	if (rw & REQ_WRITE) {
+	if (rw & (REQ_WRITE | REQ_GET_READ_MIRRORS)) {
 		if (map->type & (BTRFS_BLOCK_GROUP_RAID1 |
 				 BTRFS_BLOCK_GROUP_RAID10 |
 				 BTRFS_BLOCK_GROUP_DUP)) {
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 22/26] Btrfs: changes to live filesystem are also written to replacement disk

During a running dev replace operation, all write requests to
the live filesystem are duplicated to also write to the target
drive. Therefore btrfs_map_block() is changed to duplicate
stripes that are written to the source disk of a device replace
procedure to be written to the target disk as well.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/volumes.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 49 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 208cc13..164f9b1 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4049,6 +4049,9 @@ static int __btrfs_map_block(struct btrfs_fs_info
*fs_info, int rw,
 	int num_stripes;
 	int max_errors = 0;
 	struct btrfs_bio *bbio = NULL;
+	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+	int dev_replace_is_ongoing = 0;
+	int num_alloc_stripes;
 
 	read_lock(&em_tree->lock);
 	em = lookup_extent_mapping(em_tree, logical, *length);
@@ -4094,6 +4097,11 @@ static int __btrfs_map_block(struct btrfs_fs_info
*fs_info, int rw,
 	if (!bbio_ret)
 		goto out;
 
+	btrfs_dev_replace_lock(dev_replace);
+	dev_replace_is_ongoing = btrfs_dev_replace_is_ongoing(dev_replace);
+	if (!dev_replace_is_ongoing)
+		btrfs_dev_replace_unlock(dev_replace);
+
 	num_stripes = 1;
 	stripe_index = 0;
 	stripe_nr_orig = stripe_nr;
@@ -4160,7 +4168,10 @@ static int __btrfs_map_block(struct btrfs_fs_info
*fs_info, int rw,
 	}
 	BUG_ON(stripe_index >= map->num_stripes);
 
-	bbio = kzalloc(btrfs_bio_size(num_stripes), GFP_NOFS);
+	num_alloc_stripes = num_stripes;
+	if (dev_replace_is_ongoing && (rw & (REQ_WRITE | REQ_DISCARD)))
+		num_alloc_stripes <<= 1;
+	bbio = kzalloc(btrfs_bio_size(num_alloc_stripes), GFP_NOFS);
 	if (!bbio) {
 		ret = -ENOMEM;
 		goto out;
@@ -4255,11 +4266,48 @@ static int __btrfs_map_block(struct btrfs_fs_info
*fs_info, int rw,
 		}
 	}
 
+	if (dev_replace_is_ongoing && (rw & (REQ_WRITE | REQ_DISCARD))
&&
+	    dev_replace->tgtdev != NULL) {
+		int index_where_to_add;
+		u64 srcdev_devid = dev_replace->srcdev->devid;
+
+		/*
+		 * duplicate the write operations while the dev replace
+		 * procedure is running. Since the copying of the old disk
+		 * to the new disk takes place at run time while the
+		 * filesystem is mounted writable, the regular write
+		 * operations to the old disk have to be duplicated to go
+		 * to the new disk as well.
+		 * Note that device->missing is handled by the caller, and
+		 * that the write to the old disk is already set up in the
+		 * stripes array.
+		 */
+		index_where_to_add = num_stripes;
+		for (i = 0; i < num_stripes; i++) {
+			if (bbio->stripes[i].dev->devid == srcdev_devid) {
+				/* write to new disk, too */
+				struct btrfs_bio_stripe *new +					bbio->stripes + index_where_to_add;
+				struct btrfs_bio_stripe *old +					bbio->stripes + i;
+
+				new->physical = old->physical;
+				new->length = old->length;
+				new->dev = dev_replace->tgtdev;
+				index_where_to_add++;
+				max_errors++;
+			}
+		}
+		num_stripes = index_where_to_add;
+	}
+
 	*bbio_ret = bbio;
 	bbio->num_stripes = num_stripes;
 	bbio->max_errors = max_errors;
 	bbio->mirror_num = mirror_num;
 out:
+	if (dev_replace_is_ongoing)
+		btrfs_dev_replace_unlock(dev_replace);
 	free_extent_map(em);
 	return ret;
 }
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 23/26] Btrfs: optionally avoid reads from device replace source drive

It is desirable to be able to configure the device replace
procedure to avoid reading the source drive (the one to be
copied) whenever possible. This is useful when the number of
read errors on this disk is high, because it would delay the
copy procedure alot. Therefore there is an option to avoid
reading from the source disk unless the repair procedure
really needs to access it. The regular read req asks for
mapping the block with mirror_num == 0, in this case the
source disk is avoided whenever possible. The repair code
selects the mirror_num explicitly (mirror_num != 0), this
case is not changed by this commit.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/volumes.c | 46 +++++++++++++++++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 164f9b1..edd5573 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4012,16 +4012,37 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64
logical, u64 len)
 	return ret;
 }
 
-static int find_live_mirror(struct map_lookup *map, int first, int num,
-			    int optimal)
+static int find_live_mirror(struct btrfs_fs_info *fs_info,
+			    struct map_lookup *map, int first, int num,
+			    int optimal, int dev_replace_is_ongoing)
 {
 	int i;
-	if (map->stripes[optimal].dev->bdev)
-		return optimal;
-	for (i = first; i < first + num; i++) {
-		if (map->stripes[i].dev->bdev)
-			return i;
+	int tolerance;
+	struct btrfs_device *srcdev;
+
+	if (dev_replace_is_ongoing &&
+	    fs_info->dev_replace.cont_reading_from_srcdev_mode =+	    
BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID)
+		srcdev = fs_info->dev_replace.srcdev;
+	else
+		srcdev = NULL;
+
+	/*
+	 * try to avoid the drive that is the source drive for a
+	 * dev-replace procedure, only choose it if no other non-missing
+	 * mirror is available
+	 */
+	for (tolerance = 0; tolerance < 2; tolerance++) {
+		if (map->stripes[optimal].dev->bdev &&
+		    (tolerance || map->stripes[optimal].dev != srcdev))
+			return optimal;
+		for (i = first; i < first + num; i++) {
+			if (map->stripes[i].dev->bdev &&
+			    (tolerance || map->stripes[i].dev != srcdev))
+				return i;
+		}
 	}
+
 	/* we couldn''t find one that doesn''t fail.  Just return
something
 	 * and the io error handling code will clean up eventually
 	 */
@@ -4121,9 +4142,10 @@ static int __btrfs_map_block(struct btrfs_fs_info
*fs_info, int rw,
 		else if (mirror_num)
 			stripe_index = mirror_num - 1;
 		else {
-			stripe_index = find_live_mirror(map, 0,
+			stripe_index = find_live_mirror(fs_info, map, 0,
 					    map->num_stripes,
-					    current->pid % map->num_stripes);
+					    current->pid % map->num_stripes,
+					    dev_replace_is_ongoing);
 			mirror_num = stripe_index + 1;
 		}
 
@@ -4152,9 +4174,11 @@ static int __btrfs_map_block(struct btrfs_fs_info
*fs_info, int rw,
 			stripe_index += mirror_num - 1;
 		else {
 			int old_stripe_index = stripe_index;
-			stripe_index = find_live_mirror(map, stripe_index,
+			stripe_index = find_live_mirror(fs_info, map,
+					      stripe_index,
 					      map->sub_stripes, stripe_index +
-					      current->pid % map->sub_stripes);
+					      current->pid % map->sub_stripes,
+					      dev_replace_is_ongoing);
 			mirror_num = stripe_index - old_stripe_index + 1;
 		}
 	} else {
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 24/26] Btrfs: increase BTRFS_MAX_MIRRORS by one for dev replace

This change of the define is effective in all modes, it
is required and used only in the case when a device replace
procedure is running. The reason is that during an active
device replace procedure, the target device of the copy
operation is a mirror for the filesystem data as well that
can be used to read data in order to repair read errors on
other disks.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/ctree.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 01fcfcb..133e5c4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -48,7 +48,7 @@ struct btrfs_ordered_sum;
 
 #define BTRFS_MAGIC "_BHRfS_M"
 
-#define BTRFS_MAX_MIRRORS 2
+#define BTRFS_MAX_MIRRORS 3
 
 #define BTRFS_MAX_LEVEL 8
 
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 25/26] Btrfs: allow repair code to include target disk when searching mirrors

Make the target disk of a running device replace operation
available for reading. This is only used as a last ressort for
the defect repair procedure. And it is dependent on the location
of the data block to read, because during an ongoing device
replace operation, the target drive is only partially filled
with the filesystem data.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/volumes.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 154 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index edd5573..959a4c8 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4009,6 +4009,12 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64
logical, u64 len)
 	else
 		ret = 1;
 	free_extent_map(em);
+
+	btrfs_dev_replace_lock(&fs_info->dev_replace);
+	if (btrfs_dev_replace_is_ongoing(&fs_info->dev_replace))
+		ret++;
+	btrfs_dev_replace_unlock(&fs_info->dev_replace);
+
 	return ret;
 }
 
@@ -4073,6 +4079,8 @@ static int __btrfs_map_block(struct btrfs_fs_info
*fs_info, int rw,
 	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
 	int dev_replace_is_ongoing = 0;
 	int num_alloc_stripes;
+	int patch_the_first_stripe_for_dev_replace = 0;
+	u64 physical_to_patch_in_first_stripe = 0;
 
 	read_lock(&em_tree->lock);
 	em = lookup_extent_mapping(em_tree, logical, *length);
@@ -4089,9 +4097,6 @@ static int __btrfs_map_block(struct btrfs_fs_info
*fs_info, int rw,
 	map = (struct map_lookup *)em->bdev;
 	offset = logical - em->start;
 
-	if (mirror_num > map->num_stripes)
-		mirror_num = 0;
-
 	stripe_nr = offset;
 	/*
 	 * stripe_nr counts the total number of stripes we have to stride
@@ -4123,6 +4128,88 @@ static int __btrfs_map_block(struct btrfs_fs_info
*fs_info, int rw,
 	if (!dev_replace_is_ongoing)
 		btrfs_dev_replace_unlock(dev_replace);
 
+	if (dev_replace_is_ongoing && mirror_num == map->num_stripes + 1
&&
+	    !(rw & (REQ_WRITE | REQ_DISCARD | REQ_GET_READ_MIRRORS)) &&
+	    dev_replace->tgtdev != NULL) {
+		/*
+		 * in dev-replace case, for repair case (that''s the only
+		 * case where the mirror is selected explicitly when
+		 * calling btrfs_map_block), blocks left of the left cursor
+		 * can also be read from the target drive.
+		 * For REQ_GET_READ_MIRRORS, the target drive is added as
+		 * the last one to the array of stripes. For READ, it also
+		 * needs to be supported using the same mirror number.
+		 * If the requested block is not left of the left cursor,
+		 * EIO is returned. This can happen because btrfs_num_copies()
+		 * returns one more in the dev-replace case.
+		 */
+		u64 tmp_length = *length;
+		struct btrfs_bio *tmp_bbio = NULL;
+		int tmp_num_stripes;
+		u64 srcdev_devid = dev_replace->srcdev->devid;
+		int index_srcdev = 0;
+		int found = 0;
+		u64 physical_of_found = 0;
+
+		ret = __btrfs_map_block(fs_info, REQ_GET_READ_MIRRORS,
+			     logical, &tmp_length, &tmp_bbio, 0);
+		if (ret) {
+			WARN_ON(tmp_bbio != NULL);
+			goto out;
+		}
+
+		tmp_num_stripes = tmp_bbio->num_stripes;
+		if (mirror_num > tmp_num_stripes) {
+			/*
+			 * REQ_GET_READ_MIRRORS does not contain this
+			 * mirror, that means that the requested area
+			 * is not left of the left cursor
+			 */
+			ret = -EIO;
+			kfree(tmp_bbio);
+			goto out;
+		}
+
+		/*
+		 * process the rest of the function using the mirror_num
+		 * of the source drive. Therefore look it up first.
+		 * At the end, patch the device pointer to the one of the
+		 * target drive.
+		 */
+		for (i = 0; i < tmp_num_stripes; i++) {
+			if (tmp_bbio->stripes[i].dev->devid == srcdev_devid) {
+				/*
+				 * In case of DUP, in order to keep it
+				 * simple, only add the mirror with the
+				 * lowest physical address
+				 */
+				if (found &&
+				    physical_of_found <+				     tmp_bbio->stripes[i].physical)
+					continue;
+				index_srcdev = i;
+				found = 1;
+				physical_of_found +					tmp_bbio->stripes[i].physical;
+			}
+		}
+
+		if (found) {
+			mirror_num = index_srcdev + 1;
+			patch_the_first_stripe_for_dev_replace = 1;
+			physical_to_patch_in_first_stripe = physical_of_found;
+		} else {
+			WARN_ON(1);
+			ret = -EIO;
+			kfree(tmp_bbio);
+			goto out;
+		}
+
+		kfree(tmp_bbio);
+	} else if (mirror_num > map->num_stripes) {
+		mirror_num = 0;
+	}
+
 	num_stripes = 1;
 	stripe_index = 0;
 	stripe_nr_orig = stripe_nr;
@@ -4193,8 +4280,12 @@ static int __btrfs_map_block(struct btrfs_fs_info
*fs_info, int rw,
 	BUG_ON(stripe_index >= map->num_stripes);
 
 	num_alloc_stripes = num_stripes;
-	if (dev_replace_is_ongoing && (rw & (REQ_WRITE | REQ_DISCARD)))
-		num_alloc_stripes <<= 1;
+	if (dev_replace_is_ongoing) {
+		if (rw & (REQ_WRITE | REQ_DISCARD))
+			num_alloc_stripes <<= 1;
+		if (rw & REQ_GET_READ_MIRRORS)
+			num_alloc_stripes++;
+	}
 	bbio = kzalloc(btrfs_bio_size(num_alloc_stripes), GFP_NOFS);
 	if (!bbio) {
 		ret = -ENOMEM;
@@ -4323,12 +4414,70 @@ static int __btrfs_map_block(struct btrfs_fs_info
*fs_info, int rw,
 			}
 		}
 		num_stripes = index_where_to_add;
+	} else if (dev_replace_is_ongoing && (rw & REQ_GET_READ_MIRRORS)
&&
+		   dev_replace->tgtdev != NULL) {
+		u64 srcdev_devid = dev_replace->srcdev->devid;
+		int index_srcdev = 0;
+		int found = 0;
+		u64 physical_of_found = 0;
+
+		/*
+		 * During the dev-replace procedure, the target drive can
+		 * also be used to read data in case it is needed to repair
+		 * a corrupt block elsewhere. This is possible if the
+		 * requested area is left of the left cursor. In this area,
+		 * the target drive is a full copy of the source drive.
+		 */
+		for (i = 0; i < num_stripes; i++) {
+			if (bbio->stripes[i].dev->devid == srcdev_devid) {
+				/*
+				 * In case of DUP, in order to keep it
+				 * simple, only add the mirror with the
+				 * lowest physical address
+				 */
+				if (found &&
+				    physical_of_found <+				     bbio->stripes[i].physical)
+					continue;
+				index_srcdev = i;
+				found = 1;
+				physical_of_found = bbio->stripes[i].physical;
+			}
+		}
+		if (found) {
+			u64 length = map->stripe_len;
+
+			if (physical_of_found + length <+			    dev_replace->cursor_left) {
+				struct btrfs_bio_stripe *tgtdev_stripe +					bbio->stripes +
num_stripes;
+
+				tgtdev_stripe->physical = physical_of_found;
+				tgtdev_stripe->length +					bbio->stripes[index_srcdev].length;
+				tgtdev_stripe->dev = dev_replace->tgtdev;
+
+				num_stripes++;
+			}
+		}
 	}
 
 	*bbio_ret = bbio;
 	bbio->num_stripes = num_stripes;
 	bbio->max_errors = max_errors;
 	bbio->mirror_num = mirror_num;
+
+	/*
+	 * this is the case that REQ_READ && dev_replace_is_ongoing &&
+	 * mirror_num == num_stripes + 1 && dev_replace target drive is
+	 * available as a mirror
+	 */
+	if (patch_the_first_stripe_for_dev_replace && num_stripes > 0) {
+		WARN_ON(num_stripes > 1);
+		bbio->stripes[0].dev = dev_replace->tgtdev;
+		bbio->stripes[0].physical = physical_to_patch_in_first_stripe;
+		bbio->mirror_num = map->num_stripes + 1;
+	}
 out:
 	if (dev_replace_is_ongoing)
 		btrfs_dev_replace_unlock(dev_replace);
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 16:38 UTC

head link

[PATCH 26/26] Btrfs: add support for device replace ioctls

This is the commit that allows to start the device replace
procedure.

An ioctl() interface is added that supports starting and
canceling the device replace procedure, and to retrieve
the status and progress.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
---
 fs/btrfs/ioctl.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 1a93c14..914a88c 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -55,6 +55,7 @@
 #include "backref.h"
 #include "rcu-string.h"
 #include "send.h"
+#include "dev-replace.h"
 
 /* Mask out flags that are inappropriate for the given type of inode. */
 static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags)
@@ -3168,6 +3169,51 @@ static long btrfs_ioctl_get_dev_stats(struct btrfs_root
*root,
 	return ret;
 }
 
+static long btrfs_ioctl_dev_replace(struct btrfs_root *root, void __user *arg)
+{
+	struct btrfs_ioctl_dev_replace_args *p;
+	int ret;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	p = memdup_user(arg, sizeof(*p));
+	if (IS_ERR(p))
+		return PTR_ERR(p);
+
+	switch (p->cmd) {
+	case BTRFS_IOCTL_DEV_REPLACE_CMD_START:
+		if (atomic_xchg(
+			&root->fs_info->mutually_exclusive_operation_running,
+			1)) {
+			pr_info("btrfs: dev add/delete/balance/replace/resize operation in
progress\n");
+			ret = -EINPROGRESS;
+		} else {
+			ret = btrfs_dev_replace_start(root, p);
+			atomic_set(
+			 &root->fs_info->mutually_exclusive_operation_running,
+			 0);
+		}
+		break;
+	case BTRFS_IOCTL_DEV_REPLACE_CMD_STATUS:
+		btrfs_dev_replace_status(root->fs_info, p);
+		ret = 0;
+		break;
+	case BTRFS_IOCTL_DEV_REPLACE_CMD_CANCEL:
+		ret = btrfs_dev_replace_cancel(root->fs_info, p);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	if (copy_to_user(arg, p, sizeof(*p)))
+		ret = -EFAULT;
+
+	kfree(p);
+	return ret;
+}
+
 static long btrfs_ioctl_ino_to_path(struct btrfs_root *root, void __user *arg)
 {
 	int ret = 0;
@@ -3823,6 +3869,8 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_qgroup_create(root, argp);
 	case BTRFS_IOC_QGROUP_LIMIT:
 		return btrfs_ioctl_qgroup_limit(root, argp);
+	case BTRFS_IOC_DEV_REPLACE:
+		return btrfs_ioctl_dev_replace(root, argp);
 	}
 
 	return -ENOTTY;
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-06 18:57 UTC

head link

Re: [PATCH 00/26] Btrfs: Add device replace code

On 11/06/2012 19:17, Bart Noordervliet wrote:> Great new feature Stefan, thanks a lot! Going to give it a spin right
> away. While compiling I got this (probably superfluous) error that you
> might want to prevent:
>
> fs/btrfs/dev-replace.c: In function ‘btrfs_dev_replace_start’:
> fs/btrfs/dev-replace.c:344:5: warning: ‘ret’ may be used uninitialized
> in this function [-Wmaybe-uninitialized]
>
> Regards,
>
> Bart
>
Yes, this happens on 32 bit builds, not on 64 bit builds. If you look at 
the source code, the compiler is obviously wrong (or I am blind).

         ret = btrfs_dev_replace_find_srcdev(root, args->start.srcdevid,
                                             args->start.srcdev_name,
                                             &src_device);
         mutex_unlock(&fs_info->volume_mutex);
         if (ret) {  <-- this is line 344

But thanks for the feedback anyway!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills

2012-Nov-06 19:20 UTC

head link

Re: [PATCH 00/26] Btrfs: Add device replace code

On Tue, Nov 06, 2012 at 07:57:47PM +0100, Stefan Behrens
wrote:> On 11/06/2012 19:17, Bart Noordervliet wrote:
> >Great new feature Stefan, thanks a lot! Going to give it a spin right
> >away. While compiling I got this (probably superfluous) error that you
> >might want to prevent:
> >
> >fs/btrfs/dev-replace.c: In function ‘btrfs_dev_replace_start’:
> >fs/btrfs/dev-replace.c:344:5: warning: ‘ret’ may be used uninitialized
> >in this function [-Wmaybe-uninitialized]
> >
> >Regards,
> >
> >Bart
> >
> 
> Yes, this happens on 32 bit builds, not on 64 bit builds. If you
> look at the source code, the compiler is obviously wrong (or I am
> blind).
> 
>         ret = btrfs_dev_replace_find_srcdev(root, args->start.srcdevid,
>                                             args->start.srcdev_name,
>                                             &src_device);
>         mutex_unlock(&fs_info->volume_mutex);
>         if (ret) {  <-- this is line 344
> 
> But thanks for the feedback anyway!
   This usually means that somewhere in the call tree under
btrfs_dev_replace_find_srcdev(), there''s a way that the function can
return without returning a value, or returning an unintialised
variable. Not quite sure why it''s only on 32 bits, though.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ==  PGP
key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
     --- Computer Science is not about computers,  any more than ---     
                     astronomy is about telescopes.

Zach Brown

2012-Nov-06 22:48 UTC

head link

Re: [PATCH 00/26] Btrfs: Add device replace code

> > Yes, this happens on 32 bit builds, not on 64 bit builds. If you
> > look at the source code, the compiler is obviously wrong (or I am
> > blind).
> > 
> >         ret = btrfs_dev_replace_find_srcdev(root,
args->start.srcdevid,
> >                                            
args->start.srcdev_name,
> >                                             &src_device);
> >         mutex_unlock(&fs_info->volume_mutex);
> >         if (ret) {  <-- this is line 344
> > 
> > But thanks for the feedback anyway!
> 
>    This usually means that somewhere in the call tree under
> btrfs_dev_replace_find_srcdev(), there''s a way that the function
can
> return without returning a value
Indeed, and that''s obviously the case in 15/26 with
btrfs_dev_replace_find_srcdev() when btrfs_find_device() returns a
device.

*mumbles something about the reason for ERR_PTR semantics*

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tsutomu Itoh

2012-Nov-07 00:30 UTC

head link

Re: [PATCH 19/26] Btrfs: add code to scrub to copy read data to another disk

(2012/11/07 1:38), Stefan Behrens wrote:> The device replace procedure makes use of the scrub code. The scrub
> code is the most efficient code to read the allocated data of a disk,
> i.e. it reads sequentially in order to avoid disk head movements, it
> skips unallocated blocks, it uses read ahead mechanisms, and it
> contains all the code to detect and repair defects.
> This commit adds code to scrub to allow the scrub code to copy read
> data to another disk.
> One goal is to be able to perform as fast as possible. Therefore the
> write requests are collected until huge bios are build, and the
> write process is decoupled from the read process with some kind of
> flow control, of course, in order to limit the allocated memory.
> The best performance on spinning disks could by reached when the
> head movements are avoided as much as possible. Therefore a single
> worker is used to interface the read process with the write process.
> The regular scrub operation works as fast as before, it is not
> negatively influenced and actually it is more or less unchanged.
> 
> Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
> ---
>   fs/btrfs/ctree.h |   2 +
>   fs/btrfs/reada.c |  10 +-
>   fs/btrfs/scrub.c | 881
++++++++++++++++++++++++++++++++++++++++++++++++++-----
>   fs/btrfs/super.c |   3 +-
>   4 files changed, 823 insertions(+), 73 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 83904b5..e17f211 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1483,6 +1483,8 @@ struct btrfs_fs_info {
>   	struct rw_semaphore scrub_super_lock;
>   	int scrub_workers_refcnt;
>   	struct btrfs_workers scrub_workers;
> +	struct btrfs_workers scrub_wr_completion_workers;
> +	struct btrfs_workers scrub_nocow_workers;
>   
>   #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
>   	u32 check_integrity_print_mask;
> diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
> index 0ddc565..9f363e1 100644
> --- a/fs/btrfs/reada.c
> +++ b/fs/btrfs/reada.c
> @@ -418,12 +418,17 @@ static struct reada_extent *reada_find_extent(struct
btrfs_root *root,
>   			 */
>   			continue;
>   		}
> +		if (!dev->bdev) {
> +			/* cannot read ahead on missing device */
> +			continue;
> +		}
>   		prev_dev = dev;
>   		ret = radix_tree_insert(&dev->reada_extents, index, re);
>   		if (ret) {
>   			while (--i >= 0) {
>   				dev = bbio->stripes[i].dev;
>   				BUG_ON(dev == NULL);
> +				/* ignore whether the entry was inserted */
>   				radix_tree_delete(&dev->reada_extents, index);
>   			}
>   			BUG_ON(fs_info == NULL);
> @@ -914,7 +919,10 @@ struct reada_control *btrfs_reada_add(struct
btrfs_root *root,
>   	generation = btrfs_header_generation(node);
>   	free_extent_buffer(node);
>   
> -	reada_add_block(rc, start, &max_key, level, generation);
> +	if (reada_add_block(rc, start, &max_key, level, generation)) {
> +		kfree(rc);
> +		return ERR_PTR(-ENOMEM);
> +	}
>   
>   	reada_start_machine(root->fs_info);
>   
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 460e30b..59c69e0 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -25,6 +25,7 @@
>   #include "transaction.h"
>   #include "backref.h"
>   #include "extent_io.h"
> +#include "dev-replace.h"
>   #include "check-integrity.h"
>   #include "rcu-string.h"
>   
> @@ -44,8 +45,15 @@
>   struct scrub_block;
>   struct scrub_ctx;
>   
> -#define SCRUB_PAGES_PER_BIO	16	/* 64k per bio */
> -#define SCRUB_BIOS_PER_CTX	16	/* 1 MB per device in flight */
> +/*
> + * the following three values only influence the performance.
> + * The last one configures the number of parallel and outstanding I/O
> + * operations. The first two values configure an upper limit for the
number
> + * of (dynamically allocated) pages that are added to a bio.
> + */
> +#define SCRUB_PAGES_PER_RD_BIO	32	/* 128k per bio */
> +#define SCRUB_PAGES_PER_WR_BIO	32	/* 128k per bio */
> +#define SCRUB_BIOS_PER_SCTX	64	/* 8MB per device in flight */
>   
>   /*
>    * the following value times PAGE_SIZE needs to be large enough to match
the
> @@ -62,6 +70,7 @@ struct scrub_page {
>   	u64			generation;
>   	u64			logical;
>   	u64			physical;
> +	u64			physical_for_dev_replace;
>   	atomic_t		ref_count;
>   	struct {
>   		unsigned int	mirror_num:8;
> @@ -79,7 +88,11 @@ struct scrub_bio {
>   	int			err;
>   	u64			logical;
>   	u64			physical;
> -	struct scrub_page	*pagev[SCRUB_PAGES_PER_BIO];
> +#if SCRUB_PAGES_PER_WR_BIO >= SCRUB_PAGES_PER_RD_BIO
> +	struct scrub_page	*pagev[SCRUB_PAGES_PER_WR_BIO];
> +#else
> +	struct scrub_page	*pagev[SCRUB_PAGES_PER_RD_BIO];
> +#endif
>   	int			page_count;
>   	int			next_free;
>   	struct btrfs_work	work;
> @@ -99,8 +112,16 @@ struct scrub_block {
>   	};
>   };
>   
> +struct scrub_wr_ctx {
> +	struct scrub_bio *wr_curr_bio;
> +	struct btrfs_device *tgtdev;
> +	int pages_per_wr_bio; /* <= SCRUB_PAGES_PER_WR_BIO */
> +	atomic_t flush_all_writes;
> +	struct mutex wr_lock;
> +};
> +
>   struct scrub_ctx {
> -	struct scrub_bio	*bios[SCRUB_BIOS_PER_CTX];
> +	struct scrub_bio	*bios[SCRUB_BIOS_PER_SCTX];
>   	struct btrfs_root	*dev_root;
>   	int			first_free;
>   	int			curr;
> @@ -112,12 +133,13 @@ struct scrub_ctx {
>   	struct list_head	csum_list;
>   	atomic_t		cancel_req;
>   	int			readonly;
> -	int			pages_per_bio; /* <= SCRUB_PAGES_PER_BIO */
> +	int			pages_per_rd_bio;
>   	u32			sectorsize;
>   	u32			nodesize;
>   	u32			leafsize;
>   
>   	int			is_dev_replace;
> +	struct scrub_wr_ctx	wr_ctx;
>   
>   	/*
>   	 * statistics
> @@ -135,6 +157,15 @@ struct scrub_fixup_nodatasum {
>   	int			mirror_num;
>   };
>   
> +struct scrub_copy_nocow_ctx {
> +	struct scrub_ctx	*sctx;
> +	u64			logical;
> +	u64			len;
> +	int			mirror_num;
> +	u64			physical_for_dev_replace;
> +	struct btrfs_work	work;
> +};
> +
>   struct scrub_warning {
>   	struct btrfs_path	*path;
>   	u64			extent_item_size;
> @@ -156,8 +187,9 @@ static void scrub_pending_trans_workers_dec(struct
scrub_ctx *sctx);
>   static int scrub_handle_errored_block(struct scrub_block
*sblock_to_check);
>   static int scrub_setup_recheck_block(struct scrub_ctx *sctx,
>   				     struct btrfs_fs_info *fs_info,
> +				     struct scrub_block *original_sblock,
>   				     u64 length, u64 logical,
> -				     struct scrub_block *sblock);
> +				     struct scrub_block *sblocks_for_recheck);
>   static void scrub_recheck_block(struct btrfs_fs_info *fs_info,
>   				struct scrub_block *sblock, int is_metadata,
>   				int have_csum, u8 *csum, u64 generation,
> @@ -174,6 +206,9 @@ static int scrub_repair_block_from_good_copy(struct
scrub_block *sblock_bad,
>   static int scrub_repair_page_from_good_copy(struct scrub_block
*sblock_bad,
>   					    struct scrub_block *sblock_good,
>   					    int page_num, int force_write);
> +static void scrub_write_block_to_dev_replace(struct scrub_block *sblock);
> +static int scrub_write_page_to_dev_replace(struct scrub_block *sblock,
> +					   int page_num);
>   static int scrub_checksum_data(struct scrub_block *sblock);
>   static int scrub_checksum_tree_block(struct scrub_block *sblock);
>   static int scrub_checksum_super(struct scrub_block *sblock);
> @@ -181,14 +216,38 @@ static void scrub_block_get(struct scrub_block
*sblock);
>   static void scrub_block_put(struct scrub_block *sblock);
>   static void scrub_page_get(struct scrub_page *spage);
>   static void scrub_page_put(struct scrub_page *spage);
> -static int scrub_add_page_to_bio(struct scrub_ctx *sctx,
> -				 struct scrub_page *spage);
> +static int scrub_add_page_to_rd_bio(struct scrub_ctx *sctx,
> +				    struct scrub_page *spage);
>   static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
>   		       u64 physical, struct btrfs_device *dev, u64 flags,
> -		       u64 gen, int mirror_num, u8 *csum, int force);
> +		       u64 gen, int mirror_num, u8 *csum, int force,
> +		       u64 physical_for_dev_replace);
>   static void scrub_bio_end_io(struct bio *bio, int err);
>   static void scrub_bio_end_io_worker(struct btrfs_work *work);
>   static void scrub_block_complete(struct scrub_block *sblock);
> +static void scrub_remap_extent(struct btrfs_fs_info *fs_info,
> +			       u64 extent_logical, u64 extent_len,
> +			       u64 *extent_physical,
> +			       struct btrfs_device **extent_dev,
> +			       int *extent_mirror_num);
> +static int scrub_setup_wr_ctx(struct scrub_ctx *sctx,
> +			      struct scrub_wr_ctx *wr_ctx,
> +			      struct btrfs_fs_info *fs_info,
> +			      struct btrfs_device *dev,
> +			      int is_dev_replace);
> +static void scrub_free_wr_ctx(struct scrub_wr_ctx *wr_ctx);
> +static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
> +				    struct scrub_page *spage);
> +static void scrub_wr_submit(struct scrub_ctx *sctx);
> +static void scrub_wr_bio_end_io(struct bio *bio, int err);
> +static void scrub_wr_bio_end_io_worker(struct btrfs_work *work);
> +static int write_page_nocow(struct scrub_ctx *sctx,
> +			    u64 physical_for_dev_replace, struct page *page);
> +static int copy_nocow_pages_for_inode(u64 inum, u64 offset, u64 root,
> +				      void *ctx);
> +static int copy_nocow_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
> +			    int mirror_num, u64 physical_for_dev_replace);
> +static void copy_nocow_pages_worker(struct btrfs_work *work);
>   
>   
>   static void scrub_pending_bio_inc(struct scrub_ctx *sctx)
> @@ -262,19 +321,20 @@ static noinline_for_stack void scrub_free_ctx(struct
scrub_ctx *sctx)
>   	if (!sctx)
>   		return;
>   
> +	scrub_free_wr_ctx(&sctx->wr_ctx);
> +
>   	/* this can happen when scrub is cancelled */
>   	if (sctx->curr != -1) {
>   		struct scrub_bio *sbio = sctx->bios[sctx->curr];
>   
>   		for (i = 0; i < sbio->page_count; i++) {
> -			BUG_ON(!sbio->pagev[i]);
> -			BUG_ON(!sbio->pagev[i]->page);
> +			WARN_ON(!sbio->pagev[i]->page);
>   			scrub_block_put(sbio->pagev[i]->sblock);
>   		}
>   		bio_put(sbio->bio);
>   	}
>   
> -	for (i = 0; i < SCRUB_BIOS_PER_CTX; ++i) {
> +	for (i = 0; i < SCRUB_BIOS_PER_SCTX; ++i) {
>   		struct scrub_bio *sbio = sctx->bios[i];
>   
>   		if (!sbio)
> @@ -292,18 +352,29 @@ struct scrub_ctx *scrub_setup_ctx(struct btrfs_device
*dev, int is_dev_replace)
>   	struct scrub_ctx *sctx;
>   	int		i;
>   	struct btrfs_fs_info *fs_info = dev->dev_root->fs_info;
> -	int pages_per_bio;
> +	int pages_per_rd_bio;
> +	int ret;
>   
> -	pages_per_bio = min_t(int, SCRUB_PAGES_PER_BIO,
> -			      bio_get_nr_vecs(dev->bdev));
> +	/*
> +	 * the setting of pages_per_rd_bio is correct for scrub but might
> +	 * be wrong for the dev_replace code where we might read from
> +	 * different devices in the initial huge bios. However, that
> +	 * code is able to correctly handle the case when adding a page
> +	 * to a bio fails.
> +	 */
> +	if (dev->bdev)
> +		pages_per_rd_bio = min_t(int, SCRUB_PAGES_PER_RD_BIO,
> +					 bio_get_nr_vecs(dev->bdev));
> +	else
> +		pages_per_rd_bio = SCRUB_PAGES_PER_RD_BIO;
>   	sctx = kzalloc(sizeof(*sctx), GFP_NOFS);
>   	if (!sctx)
>   		goto nomem;
>   	sctx->is_dev_replace = is_dev_replace;
> -	sctx->pages_per_bio = pages_per_bio;
> +	sctx->pages_per_rd_bio = pages_per_rd_bio;
>   	sctx->curr = -1;
>   	sctx->dev_root = dev->dev_root;
> -	for (i = 0; i < SCRUB_BIOS_PER_CTX; ++i) {
> +	for (i = 0; i < SCRUB_BIOS_PER_SCTX; ++i) {
>   		struct scrub_bio *sbio;
>   
>   		sbio = kzalloc(sizeof(*sbio), GFP_NOFS);
> @@ -316,7 +387,7 @@ struct scrub_ctx *scrub_setup_ctx(struct btrfs_device
*dev, int is_dev_replace)
>   		sbio->page_count = 0;
>   		sbio->work.func = scrub_bio_end_io_worker;
>   
> -		if (i != SCRUB_BIOS_PER_CTX - 1)
> +		if (i != SCRUB_BIOS_PER_SCTX - 1)
>   			sctx->bios[i]->next_free = i + 1;
>   		else
>   			sctx->bios[i]->next_free = -1;
> @@ -334,6 +405,13 @@ struct scrub_ctx *scrub_setup_ctx(struct btrfs_device
*dev, int is_dev_replace)
>   	spin_lock_init(&sctx->list_lock);
>   	spin_lock_init(&sctx->stat_lock);
>   	init_waitqueue_head(&sctx->list_wait);
> +
> +	ret = scrub_setup_wr_ctx(sctx, &sctx->wr_ctx, fs_info,
> +				 fs_info->dev_replace.tgtdev, is_dev_replace);
> +	if (ret) {
> +		scrub_free_ctx(sctx);
> +		return ERR_PTR(ret);
> +	}
>   	return sctx;
>   
>   nomem:
> @@ -341,7 +419,8 @@ nomem:
>   	return ERR_PTR(-ENOMEM);
>   }
>   
> -static int scrub_print_warning_inode(u64 inum, u64 offset, u64 root, void
*ctx)
> +static int scrub_print_warning_inode(u64 inum, u64 offset, u64 root,
> +				     void *warn_ctx)
>   {
>   	u64 isize;
>   	u32 nlink;
> @@ -349,7 +428,7 @@ static int scrub_print_warning_inode(u64 inum, u64
offset, u64 root, void *ctx)
>   	int i;
>   	struct extent_buffer *eb;
>   	struct btrfs_inode_item *inode_item;
> -	struct scrub_warning *swarn = ctx;
> +	struct scrub_warning *swarn = warn_ctx;
>   	struct btrfs_fs_info *fs_info = swarn->dev->dev_root->fs_info;
>   	struct inode_fs_paths *ipath = NULL;
>   	struct btrfs_root *local_root;
> @@ -492,11 +571,11 @@ out:
>   	kfree(swarn.msg_buf);
>   }
>   
> -static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void *ctx)
> +static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void
*fixup_ctx)
>   {
>   	struct page *page = NULL;
>   	unsigned long index;
> -	struct scrub_fixup_nodatasum *fixup = ctx;
> +	struct scrub_fixup_nodatasum *fixup = fixup_ctx;
>   	int ret;
>   	int corrected = 0;
>   	struct btrfs_key key;
> @@ -660,7 +739,9 @@ out:
>   		spin_lock(&sctx->stat_lock);
>   		++sctx->stat.uncorrectable_errors;
>   		spin_unlock(&sctx->stat_lock);
> -
> +		btrfs_dev_replace_stats_inc(
> +			&sctx->dev_root->fs_info->dev_replace.
> +			num_uncorrectable_read_errors);
>   		printk_ratelimited_in_rcu(KERN_ERR
>   			"btrfs: unable to fixup (nodatasum) error at logical %llu on dev
%s\n",
>   			(unsigned long long)fixup->logical,
> @@ -715,6 +796,11 @@ static int scrub_handle_errored_block(struct
scrub_block *sblock_to_check)
>   	csum = sblock_to_check->pagev[0]->csum;
>   	dev = sblock_to_check->pagev[0]->dev;
>   
> +	if (sctx->is_dev_replace && !is_metadata &&
!have_csum) {
> +		sblocks_for_recheck = NULL;
> +		goto nodatasum_case;
> +	}
> +
>   	/*
>   	 * read all mirrors one after the other. This includes to
>   	 * re-read the extent or metadata block that failed (that was
> @@ -758,7 +844,7 @@ static int scrub_handle_errored_block(struct
scrub_block *sblock_to_check)
>   	}
>   
>   	/* setup the context, map the logical blocks and alloc the pages */
> -	ret = scrub_setup_recheck_block(sctx, fs_info, length,
> +	ret = scrub_setup_recheck_block(sctx, fs_info, sblock_to_check, length,
>   					logical, sblocks_for_recheck);
>   	if (ret) {
>   		spin_lock(&sctx->stat_lock);
> @@ -789,6 +875,8 @@ static int scrub_handle_errored_block(struct
scrub_block *sblock_to_check)
>   		sctx->stat.unverified_errors++;
>   		spin_unlock(&sctx->stat_lock);
>   
> +		if (sctx->is_dev_replace)
> +			scrub_write_block_to_dev_replace(sblock_bad);
>   		goto out;
>   	}
>   
> @@ -822,12 +910,15 @@ static int scrub_handle_errored_block(struct
scrub_block *sblock_to_check)
>   				BTRFS_DEV_STAT_CORRUPTION_ERRS);
>   	}
>   
> -	if (sctx->readonly)
> +	if (sctx->readonly && !sctx->is_dev_replace)
>   		goto did_not_correct_error;
>   
>   	if (!is_metadata && !have_csum) {
>   		struct scrub_fixup_nodatasum *fixup_nodatasum;
>   
> +nodatasum_case:
> +		WARN_ON(sctx->is_dev_replace);
> +
>   		/*
>   		 * !is_metadata and !have_csum, this means that the data
>   		 * might not be COW''ed, that it might be modified
> @@ -883,18 +974,79 @@ static int scrub_handle_errored_block(struct
scrub_block *sblock_to_check)
>   		if (!sblock_other->header_error &&
>   		    !sblock_other->checksum_error &&
>   		    sblock_other->no_io_error_seen) {
> -			int force_write = is_metadata || have_csum;
> -
> -			ret = scrub_repair_block_from_good_copy(sblock_bad,
> -								sblock_other,
> -								force_write);
> +			if (sctx->is_dev_replace) {
> +				scrub_write_block_to_dev_replace(sblock_other);
> +			} else {
> +				int force_write = is_metadata || have_csum;
> +
> +				ret = scrub_repair_block_from_good_copy(
> +						sblock_bad, sblock_other,
> +						force_write);
> +			}
>   			if (0 == ret)
>   				goto corrected_error;
>   		}
>   	}
>   
>   	/*
> -	 * in case of I/O errors in the area that is supposed to be
> +	 * for dev_replace, pick good pages and write to the target device.
> +	 */
> +	if (sctx->is_dev_replace) {
> +		success = 1;
> +		for (page_num = 0; page_num < sblock_bad->page_count;
> +		     page_num++) {
> +			int sub_success;
> +
> +			sub_success = 0;
> +			for (mirror_index = 0;
> +			     mirror_index < BTRFS_MAX_MIRRORS &&
> +			     sblocks_for_recheck[mirror_index].page_count > 0;
> +			     mirror_index++) {
> +				struct scrub_block *sblock_other > +					sblocks_for_recheck +
mirror_index;
> +				struct scrub_page *page_other > +				
sblock_other->pagev[page_num];
> +
> +				if (!page_other->io_error) {
> +					ret = scrub_write_page_to_dev_replace(
> +							sblock_other, page_num);
> +					if (ret == 0) {
> +						/* succeeded for this page */
> +						sub_success = 1;
> +						break;
> +					} else {
> +						btrfs_dev_replace_stats_inc(
> +							&sctx->dev_root->
> +							fs_info->dev_replace.
> +							num_write_errors);
> +					}
> +				}
> +			}
> +
> +			if (!sub_success) {
> +				/*
> +				 * did not find a mirror to fetch the page
> +				 * from. scrub_write_page_to_dev_replace()
> +				 * handles this case (page->io_error), by
> +				 * filling the block with zeros before
> +				 * submitting the write request
> +				 */
> +				success = 0;
> +				ret = scrub_write_page_to_dev_replace(
> +						sblock_bad, page_num);
> +				if (ret)
> +					btrfs_dev_replace_stats_inc(
> +						&sctx->dev_root->fs_info->
> +						dev_replace.num_write_errors);
> +			}
> +		}
> +
> +		goto out;
> +	}
> +
> +	/*
> +	 * for regular scrub, repair those pages that are errored.
> +	 * In case of I/O errors in the area that is supposed to be
>   	 * repaired, continue by picking good copies of those pages.
>   	 * Select the good pages from mirrors to rewrite bad pages from
>   	 * the area to fix. Afterwards verify the checksum of the block
> @@ -1017,6 +1169,7 @@ out:
>   
>   static int scrub_setup_recheck_block(struct scrub_ctx *sctx,
>   				     struct btrfs_fs_info *fs_info,
> +				     struct scrub_block *original_sblock,
>   				     u64 length, u64 logical,
>   				     struct scrub_block *sblocks_for_recheck)
>   {
> @@ -1047,7 +1200,7 @@ static int scrub_setup_recheck_block(struct scrub_ctx
*sctx,
>   			return -EIO;
>   		}
>   
> -		BUG_ON(page_index >= SCRUB_PAGES_PER_BIO);
> +		BUG_ON(page_index >= SCRUB_PAGES_PER_RD_BIO);
>   		for (mirror_index = 0; mirror_index < (int)bbio->num_stripes;
>   		     mirror_index++) {
>   			struct scrub_block *sblock;
> @@ -1071,6 +1224,10 @@ leave_nomem:
>   			sblock->pagev[page_index] = page;
>   			page->logical = logical;
>   			page->physical = bbio->stripes[mirror_index].physical;
> +			BUG_ON(page_index >= original_sblock->page_count);
> +			page->physical_for_dev_replace > +			
original_sblock->pagev[page_index]->
> +				physical_for_dev_replace;
>   			/* for missing devices, dev->bdev is NULL */
>   			page->dev = bbio->stripes[mirror_index].dev;
>   			page->mirror_num = mirror_index + 1;
> @@ -1249,6 +1406,12 @@ static int scrub_repair_page_from_good_copy(struct
scrub_block *sblock_bad,
>   		int ret;
>   		DECLARE_COMPLETION_ONSTACK(complete);
>   
> +		if (!page_bad->dev->bdev) {
> +			printk_ratelimited(KERN_WARNING
> +				"btrfs: scrub_repair_page_from_good_copy(bdev == NULL) is
unexpected!\n");
> +			return -EIO;
> +		}
> +
>   		bio = bio_alloc(GFP_NOFS, 1);
>   		if (!bio)
>   			return -EIO;
> @@ -1269,6 +1432,9 @@ static int scrub_repair_page_from_good_copy(struct
scrub_block *sblock_bad,
>   		if (!bio_flagged(bio, BIO_UPTODATE)) {
>   			btrfs_dev_stat_inc_and_print(page_bad->dev,
>   				BTRFS_DEV_STAT_WRITE_ERRS);
> +			btrfs_dev_replace_stats_inc(
> +				&sblock_bad->sctx->dev_root->fs_info->
> +				dev_replace.num_write_errors);
>   			bio_put(bio);
>   			return -EIO;
>   		}
> @@ -1278,7 +1444,166 @@ static int scrub_repair_page_from_good_copy(struct
scrub_block *sblock_bad,
>   	return 0;
>   }
>   
> -static void scrub_checksum(struct scrub_block *sblock)
> +static void scrub_write_block_to_dev_replace(struct scrub_block *sblock)
> +{
> +	int page_num;
> +
> +	for (page_num = 0; page_num < sblock->page_count; page_num++) {
> +		int ret;
> +
> +		ret = scrub_write_page_to_dev_replace(sblock, page_num);
> +		if (ret)
> +			btrfs_dev_replace_stats_inc(
> +				&sblock->sctx->dev_root->fs_info->dev_replace.
> +				num_write_errors);
> +	}
> +}
> +
> +static int scrub_write_page_to_dev_replace(struct scrub_block *sblock,
> +					   int page_num)
> +{
> +	struct scrub_page *spage = sblock->pagev[page_num];
> +
> +	BUG_ON(spage->page == NULL);
> +	if (spage->io_error) {
> +		void *mapped_buffer = kmap_atomic(spage->page);
> +
> +		memset(mapped_buffer, 0, PAGE_CACHE_SIZE);
> +		flush_dcache_page(spage->page);
> +		kunmap_atomic(mapped_buffer);
> +	}
> +	return scrub_add_page_to_wr_bio(sblock->sctx, spage);
> +}
> +
> +static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
> +				    struct scrub_page *spage)
> +{
> +	struct scrub_wr_ctx *wr_ctx = &sctx->wr_ctx;
> +	struct scrub_bio *sbio;
> +	int ret;
> +
> +	mutex_lock(&wr_ctx->wr_lock);
> +again:
> +	if (!wr_ctx->wr_curr_bio) {
> +		wr_ctx->wr_curr_bio = kzalloc(sizeof(*wr_ctx->wr_curr_bio),
> +					      GFP_NOFS);
> +		if (!wr_ctx->wr_curr_bio)
I think mutex_unlock(&wr_ctx->wr_lock) is necessary before it returns.
> +			return -ENOMEM;
> +		wr_ctx->wr_curr_bio->sctx = sctx;
> +		wr_ctx->wr_curr_bio->page_count = 0;
> +	}...
...

- Tsutomu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tsutomu Itoh

2012-Nov-07 02:14 UTC

head link

Re: [PATCH 00/26] Btrfs: Add device replace code

(2012/11/07 1:38), Stefan Behrens wrote:> This patch series adds support for replacing disks at runtime.
> 
> It replaces the following steps in case a disk was lost:
>      mount ... -o degraded
>      btrfs device add new_disk
>      btrfs device delete missing
> 
> Or in case a disk just needs to be replaced because the error rate
> is increasing:
>      btrfs device add new_disk
>      btrfs device delete old_disk
> 
> Instead just run:
>      btrfs replace mountpoint old_disk new_disk
> 
> The device replace operation takes place at runtime on a live
> filesystem, you don''t need to unmount it or stop active tasks.
> It is safe to crash or lose power during the operation, the
> process resumes with the next mount.
> 
> The copy usually takes place at 90% of the available platter
> speed if no additional disk I/O is ongoing during the copy
> operation, thus the degraded state without redundancy can be
> left quickly.
> 
> The copy process is started manually. It is a different project
> to react on an increased device I/O error rate with an automatic
> start of this procedure.
> 
> The patch series is based on btrfs-next and also available here:
> git://btrfs.giantdisaster.de/git/btrfs device-replace
> 
> The user mode part is the top commit of
> git://btrfs.giantdisaster.de/git/btrfs-progs master
> 
> 
> replace start [-Bfr] <path> <srcdev>|<devid>
<targetdev>
I think that
 "btrfs replace start [-Bfr] <srcdev>|<devid> <targetdev>
<path>"
of the same synopsis as other subcommands is better.

- Tsutomu
>         Replace device of a btrfs filesystem.   On  a  live  filesystem,
>         duplicate  the  data  to  the  target  device which is currently
>         stored on the source device. If the source device is not  avail-
>         able anymore, or if the -r option is set, the data is built only
>         using the RAID redundancy mechanisms. After  completion  of  the
>         operation, the source device is removed from the filesystem.  If
>         the srcdev is a numerical value, it is assumed to be the  device
>         id  of the filesystem which is mounted at mount_point, otherwise
>         is is the path to the source device. If  the  source  device  is
>         disconnected, from the system, you have to use the devid parame-
>         ter format.  The targetdev needs to be same size or larger  than
>         the srcdev.
> 
>         Options
> 
>         -r     only  read  from  srcdev  if  no other zero-defect mirror
>                exists (enable this  if  your  drive  has  lots  of  read
>                errors, the access would be very slow)
> 
>         -f     force  using  and  overwriting targetdev even if it looks
>                like  containing  a  valid  btrfs  filesystem.  A   valid
>                filesystem  is  assumed  if  a  btrfs superblock is found
>                which contains a correct checksum. Devices which are cur-
>                rently  mounted  are never allowed to be used as the tar-
>                getdev
> 
>         -B     do not background
> 
> 
> replace status [-1] <path>
>         Print status  and  progress  information  of  a  running  device
>         replace operation.
> 
>         Options
> 
>         -1     print once instead of print continously until the replace
>                operation finishes (or is canceled)
> 
> 
> replace cancel <path>
>         Cancel a running device replace operation.
> 
> 
> Stefan Behrens (26):
>    Btrfs: rename the scrub context structure
>    Btrfs: remove the block device pointer from the scrub context struct
>    Btrfs: make the scrub page array dynamically allocated
>    Btrfs: in scrub repair code, optimize the reading of mirrors
>    Btrfs: in scrub repair code, simplify alloc error handling
>    Btrfs: cleanup scrub bio and worker wait code
>    Btrfs: add two more find_device() methods
>    Btrfs: Pass fs_info to btrfs_num_copies() instead of mapping_tree
>    Btrfs: pass fs_info to btrfs_map_block() instead of mapping_tree
>    Btrfs: add btrfs_scratch_superblock() function
>    Btrfs: pass fs_info instead of root
>    Btrfs: avoid risk of a deadlock in btrfs_handle_error
>    Btrfs: enhance btrfs structures for device replace support
>    Btrfs: introduce a btrfs_dev_replace_item type
>    Btrfs: add a new source file with device replace code
>    Btrfs: disallow mutually exclusiv admin operations from user mode
>    Btrfs: disallow some operations on the device replace target device
>    Btrfs: handle errors from btrfs_map_bio() everywhere
>    Btrfs: add code to scrub to copy read data to another disk
>    Btrfs: change core code of btrfs to support the device replace
>      operations
>    Btrfs: introduce GET_READ_MIRRORS functionality for btrfs_map_block()
>    Btrfs: changes to live filesystem are also written to replacement
>      disk
>    Btrfs: optionally avoid reads from device replace source drive
>    Btrfs: increase BTRFS_MAX_MIRRORS by one for dev replace
>    Btrfs: allow repair code to include target disk when searching
>      mirrors
>    Btrfs: add support for device replace ioctls
> 
>   fs/btrfs/Makefile          |    2 +-
>   fs/btrfs/check-integrity.c |   29 +-
>   fs/btrfs/compression.c     |    6 +-
>   fs/btrfs/ctree.h           |  127 ++-
>   fs/btrfs/dev-replace.c     |  843 ++++++++++++++++++++
>   fs/btrfs/dev-replace.h     |   44 ++
>   fs/btrfs/disk-io.c         |   79 +-
>   fs/btrfs/extent-tree.c     |    5 +-
>   fs/btrfs/extent_io.c       |   28 +-
>   fs/btrfs/extent_io.h       |    4 +-
>   fs/btrfs/inode.c           |   39 +-
>   fs/btrfs/ioctl.c           |  117 ++-
>   fs/btrfs/ioctl.h           |   45 ++
>   fs/btrfs/print-tree.c      |    3 +
>   fs/btrfs/reada.c           |   31 +-
>   fs/btrfs/scrub.c           | 1822
++++++++++++++++++++++++++++++++------------
>   fs/btrfs/super.c           |   30 +-
>   fs/btrfs/transaction.c     |    7 +-
>   fs/btrfs/volumes.c         |  624 +++++++++++++--
>   fs/btrfs/volumes.h         |   26 +-
>   20 files changed, 3244 insertions(+), 667 deletions(-)
>   create mode 100644 fs/btrfs/dev-replace.c
>   create mode 100644 fs/btrfs/dev-replace.h
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-07 10:29 UTC

head link

Re: [PATCH 00/26] Btrfs: Add device replace code

On Tue, 6 Nov 2012 14:48:36 -0800, Zach Brown wrote:>>> Yes, this happens on 32 bit builds, not on 64 bit builds. If you
>>> look at the source code, the compiler is obviously wrong (or I am
>>> blind).
>>>
>>>         ret = btrfs_dev_replace_find_srcdev(root,
args->start.srcdevid,
>>>                                            
args->start.srcdev_name,
>>>                                             &src_device);
>>>         mutex_unlock(&fs_info->volume_mutex);
>>>         if (ret) {  <-- this is line 344
>>>
>>> But thanks for the feedback anyway!
>>
>>    This usually means that somewhere in the call tree under
>> btrfs_dev_replace_find_srcdev(), there''s a way that the
function can
>> return without returning a value
> 
> Indeed, and that''s obviously the case in 15/26 with
> btrfs_dev_replace_find_srcdev() when btrfs_find_device() returns a
> device.
> 
> *mumbles something about the reason for ERR_PTR semantics*
Thanks Bart, Hugo and Zach!
I''ll fix it in
git://btrfs.giantdisaster.de/git/btrfs device-replace

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-07 10:30 UTC

head link

Re: [PATCH 19/26] Btrfs: add code to scrub to copy read data to another disk

On Wed, 07 Nov 2012 09:30:59 +0900, Tsutomu Itoh wrote:> (2012/11/07 1:38), Stefan Behrens wrote:
>> +static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
>> +				    struct scrub_page *spage)
>> +{
>> +	struct scrub_wr_ctx *wr_ctx = &sctx->wr_ctx;
>> +	struct scrub_bio *sbio;
>> +	int ret;
>> +
>> +	mutex_lock(&wr_ctx->wr_lock);
>> +again:
>> +	if (!wr_ctx->wr_curr_bio) {
>> +		wr_ctx->wr_curr_bio = kzalloc(sizeof(*wr_ctx->wr_curr_bio),
>> +					      GFP_NOFS);
>> +		if (!wr_ctx->wr_curr_bio)
> 
> I think mutex_unlock(&wr_ctx->wr_lock) is necessary before it
returns.
> 
>> +			return -ENOMEM;
Thanks Tsutomu!
I''ll fix it in
git://btrfs.giantdisaster.de/git/btrfs device-replace

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-07 13:12 UTC

head link

Re: [PATCH 00/26] Btrfs: Add device replace code

On Wed, 07 Nov 2012 11:14:36 +0900, Tsutomu Itoh wrote:> (2012/11/07 1:38), Stefan Behrens wrote:
>> replace start [-Bfr] <path> <srcdev>|<devid>
<targetdev>
> 
> I think that
>  "btrfs replace start [-Bfr] <srcdev>|<devid>
<targetdev> <path>"
> of the same synopsis as other subcommands is better.
You are right. ''btrfs device add'' and ''btrfs device
delete'' have the
path at the end and the device in front, and there is no good reason to
not make it look equally. I will change it accordingly and post a V2 of
the Btrfs-progs patch within the next few days.

Thanks!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2012-Nov-08 00:59 UTC

head link

Re: [PATCH 00/26] Btrfs: Add device replace code

On Tue, Nov 06, 2012 at 09:38:18AM -0700, Stefan Behrens
wrote:> This patch series adds support for replacing disks at runtime.
> 
> It replaces the following steps in case a disk was lost:
>     mount ... -o degraded
>     btrfs device add new_disk
>     btrfs device delete missing
> 
> Or in case a disk just needs to be replaced because the error rate
> is increasing:
>     btrfs device add new_disk
>     btrfs device delete old_disk
> 
> Instead just run:
>     btrfs replace mountpoint old_disk new_disk
> 
This is just fantastic.  I''m pulling it down and doing a test
integration with RAID56.  More comments as I hash through things.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Goffredo Baroncelli

2012-Nov-08 12:50 UTC

head link

Re: [PATCH 00/26] Btrfs: Add device replace code

Hi Stefan,

great work. However I have a suggestion: what about putting all the
command under ''device'' sub commands: something like:

- btrfs device replace <old> <new> </path>

- btrfs device status </path>

Where "btrfs device status" would show only the status of the
"replacing" operation; but in the future it could show also the status
of the "delete" command (which it is the only other "device
sub-command" which needs time to complete).

Of course I am not asking to complete the "btrfs device status" part
for the "btrfs device delete" command. This could be implemented in a
second time.

I think that so "replace" would be the natral extension to the
"add"
and "delete" subcommands.

BR
G.Baroncelli

On Wed, Nov 7, 2012 at 2:12 PM, Stefan Behrens
<sbehrens@giantdisaster.de> wrote:> On Wed, 07 Nov 2012 11:14:36 +0900, Tsutomu Itoh wrote:
>> (2012/11/07 1:38), Stefan Behrens wrote:
>>> replace start [-Bfr] <path> <srcdev>|<devid>
<targetdev>
>>
>> I think that
>>  "btrfs replace start [-Bfr] <srcdev>|<devid>
<targetdev> <path>"
>> of the same synopsis as other subcommands is better.
>
> You are right. ''btrfs device add'' and ''btrfs
device delete'' have the
> path at the end and the device in front, and there is no good reason to
> not make it look equally. I will change it accordingly and post a V2 of
> the Btrfs-progs patch within the next few days.
>
> Thanks!
>
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Liu Bo

2012-Nov-08 14:24 UTC

head link

Re: [PATCH 07/26] Btrfs: add two more find_device() methods

On Tue, Nov 06, 2012 at 05:38:25PM +0100, Stefan Behrens
wrote:> The new function btrfs_find_device_missing_or_by_path() will be
> used for the device replace procedure. This function itself calls
> the second new function btrfs_find_device_by_path().
> Unfortunately, it is not possible to currently make the rest of the
> code use these functions as well, since all functions that look
> similar at first view are all a little bit different in what they
> are doing. But in the future, new code could benefit from these
> two new functions, and currently, device replace uses them.
> 
> Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
> ---
>  fs/btrfs/volumes.c | 74
++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/volumes.h |  5 ++++
>  2 files changed, 79 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index eeed97d..bcd3097 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1512,6 +1512,80 @@ error_undo:
>  	goto error_brelse;
>  }
>  
> +int btrfs_find_device_by_path(struct btrfs_root *root, char *device_path,
> +			      struct btrfs_device **device)
> +{
> +	int ret = 0;
> +	struct btrfs_super_block *disk_super;
> +	u64 devid;
> +	u8 *dev_uuid;
> +	struct block_device *bdev;
> +	struct buffer_head *bh = NULL;
> +
> +	*device = NULL;
> +	mutex_lock(&uuid_mutex);
Since the caller have held volume_mutex, we can get rid of the
mutex_lock here, can''t we?
> +	bdev = blkdev_get_by_path(device_path, FMODE_READ,
> +				  root->fs_info->bdev_holder);
> +	if (IS_ERR(bdev)) {
> +		ret = PTR_ERR(bdev);
> +		bdev = NULL;
> +		goto out;
> +	}
> +
> +	set_blocksize(bdev, 4096);
> +	invalidate_bdev(bdev);
> +	bh = btrfs_read_dev_super(bdev);
> +	if (!bh) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
I made a scan for this ''get bdev & bh'' in the code, I
think
maybe we can make a function,
func_get(device_path, flags, mode, &bdev, &bh, flush)

where we need to take care of setting bdev = NULL, bh = NULL, and
''flush'' is for filemap_bdev().  Besides, we also need to make
some proper error handling.


thanks,
liubo
> +	disk_super = (struct btrfs_super_block *)bh->b_data;
> +	devid = btrfs_stack_device_id(&disk_super->dev_item);
> +	dev_uuid = disk_super->dev_item.uuid;
> +	*device = btrfs_find_device(root, devid, dev_uuid,
> +				    disk_super->fsid);
> +	brelse(bh);
> +	if (!*device)
> +		ret = -ENOENT;
> +out:
> +	mutex_unlock(&uuid_mutex);
> +	if (bdev)
> +		blkdev_put(bdev, FMODE_READ);
> +	return ret;
> +}
> +
> +int btrfs_find_device_missing_or_by_path(struct btrfs_root *root,
> +					 char *device_path,
> +					 struct btrfs_device **device)
> +{
> +	*device = NULL;
> +	if (strcmp(device_path, "missing") == 0) {
> +		struct list_head *devices;
> +		struct btrfs_device *tmp;
> +
> +		devices = &root->fs_info->fs_devices->devices;
> +		/*
> +		 * It is safe to read the devices since the volume_mutex
> +		 * is held by the caller.
> +		 */
> +		list_for_each_entry(tmp, devices, dev_list) {
> +			if (tmp->in_fs_metadata && !tmp->bdev) {
> +				*device = tmp;
> +				break;
> +			}
> +		}
> +
> +		if (!*device) {
> +			pr_err("btrfs: no missing device found\n");
> +			return -ENOENT;
> +		}
> +
> +		return 0;
> +	} else {
> +		return btrfs_find_device_by_path(root, device_path, device);
> +	}
> +}
> +
>  /*
>   * does all the dirty work required for changing file system''s
UUID.
>   */
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 1789cda..657bb12 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -268,6 +268,11 @@ int btrfs_scan_one_device(const char *path, fmode_t
flags, void *holder,
>  			  struct btrfs_fs_devices **fs_devices_ret);
>  int btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
>  void btrfs_close_extra_devices(struct btrfs_fs_devices *fs_devices);
> +int btrfs_find_device_missing_or_by_path(struct btrfs_root *root,
> +					 char *device_path,
> +					 struct btrfs_device **device);
> +int btrfs_find_device_by_path(struct btrfs_root *root, char *device_path,
> +			      struct btrfs_device **device);
>  int btrfs_add_device(struct btrfs_trans_handle *trans,
>  		     struct btrfs_root *root,
>  		     struct btrfs_device *device);
> -- 
> 1.8.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Liu Bo

2012-Nov-08 14:50 UTC

head link

Re: [PATCH 15/26] Btrfs: add a new source file with device replace code

On Tue, Nov 06, 2012 at 05:38:33PM +0100, Stefan Behrens
wrote:> This adds a new file to the sources together with the header file
> and the changes to ioctl.h that are required by the new C source
> file.
> 
> Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
> ---
>  fs/btrfs/dev-replace.c | 843
+++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/dev-replace.h |  44 +++
>  fs/btrfs/ioctl.h       |  45 +++
>  3 files changed, 932 insertions(+)
> 
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> new file mode 100644
> index 0000000..1d56163
> --- /dev/null
> +++ b/fs/btrfs/dev-replace.c
> @@ -0,0 +1,843 @@
> +/*
> + * Copyright (C) STRATO AG 2012.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License v2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; if not, write to the
> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> + * Boston, MA 021110-1307, USA.
> + */
> +#include <linux/sched.h>
> +#include <linux/bio.h>
> +#include <linux/slab.h>
> +#include <linux/buffer_head.h>
> +#include <linux/blkdev.h>
> +#include <linux/random.h>
> +#include <linux/iocontext.h>
> +#include <linux/capability.h>
> +#include <linux/kthread.h>
> +#include <linux/math64.h>
> +#include <asm/div64.h>
> +#include "compat.h"
> +#include "ctree.h"
> +#include "extent_map.h"
> +#include "disk-io.h"
> +#include "transaction.h"
> +#include "print-tree.h"
> +#include "volumes.h"
> +#include "async-thread.h"
> +#include "check-integrity.h"
> +#include "rcu-string.h"
> +#include "dev-replace.h"
> +
> +static u64 btrfs_get_seconds_since_1970(void);
> +static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
> +				       int scrub_ret);
> +static void btrfs_dev_replace_update_device_in_mapping_tree(
> +						struct btrfs_fs_info *fs_info,
> +						struct btrfs_device *srcdev,
> +						struct btrfs_device *tgtdev);
> +static int btrfs_dev_replace_find_srcdev(struct btrfs_root *root, u64
srcdevid,
> +					 char *srcdev_name,
> +					 struct btrfs_device **device);
> +static u64 __btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info);
> +static int btrfs_dev_replace_kthread(void *data);
> +static int btrfs_dev_replace_continue_on_mount(struct btrfs_fs_info
*fs_info);
> +
> +
> +int btrfs_init_dev_replace(struct btrfs_fs_info *fs_info)
> +{
> +	struct btrfs_key key;
> +	struct btrfs_root *dev_root = fs_info->dev_root;
> +	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
> +	struct extent_buffer *eb;
> +	int slot;
> +	int ret = 0;
> +	struct btrfs_path *path = NULL;
> +	int item_size;
> +	struct btrfs_dev_replace_item *ptr;
> +	u64 src_devid;
> +
> +	path = btrfs_alloc_path();
> +	if (!path) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	key.objectid = 0;
> +	key.type = BTRFS_DEV_REPLACE_KEY;
> +	key.offset = 0;
> +	ret = btrfs_search_slot(NULL, dev_root, &key, path, 0, 0);
> +	if (ret) {
> +no_valid_dev_replace_entry_found:
> +		ret = 0;
> +		dev_replace->replace_state > +		
BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED;
> +		dev_replace->cont_reading_from_srcdev_mode > +		   
BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_ALWAYS;
> +		dev_replace->replace_state = 0;
> +		dev_replace->time_started = 0;
> +		dev_replace->time_stopped = 0;
> +		atomic64_set(&dev_replace->num_write_errors, 0);
> +		atomic64_set(&dev_replace->num_uncorrectable_read_errors, 0);
> +		dev_replace->cursor_left = 0;
> +		dev_replace->committed_cursor_left = 0;
> +		dev_replace->cursor_left_last_write_of_item = 0;
> +		dev_replace->cursor_right = 0;
> +		dev_replace->srcdev = NULL;
> +		dev_replace->tgtdev = NULL;
> +		dev_replace->is_valid = 0;
> +		dev_replace->item_needs_writeback = 0;
> +		goto out;
> +	}
> +	slot = path->slots[0];
> +	eb = path->nodes[0];
> +	item_size = btrfs_item_size_nr(eb, slot);
> +	ptr = btrfs_item_ptr(eb, slot, struct btrfs_dev_replace_item);
> +
> +	if (item_size != sizeof(struct btrfs_dev_replace_item)) {
> +		pr_warn("btrfs: dev_replace entry found has unexpected size, ignore
entry\n");
> +		goto no_valid_dev_replace_entry_found;
> +	}
> +
> +	src_devid = btrfs_dev_replace_src_devid(eb, ptr);
> +	dev_replace->cont_reading_from_srcdev_mode > +	
btrfs_dev_replace_cont_reading_from_srcdev_mode(eb, ptr);
> +	dev_replace->replace_state = btrfs_dev_replace_replace_state(eb, ptr);
> +	dev_replace->time_started = btrfs_dev_replace_time_started(eb, ptr);
> +	dev_replace->time_stopped > +		btrfs_dev_replace_time_stopped(eb,
ptr);
> +	atomic64_set(&dev_replace->num_write_errors,
> +		     btrfs_dev_replace_num_write_errors(eb, ptr));
> +	atomic64_set(&dev_replace->num_uncorrectable_read_errors,
> +		     btrfs_dev_replace_num_uncorrectable_read_errors(eb, ptr));
> +	dev_replace->cursor_left = btrfs_dev_replace_cursor_left(eb, ptr);
> +	dev_replace->committed_cursor_left = dev_replace->cursor_left;
> +	dev_replace->cursor_left_last_write_of_item =
dev_replace->cursor_left;
> +	dev_replace->cursor_right = btrfs_dev_replace_cursor_right(eb, ptr);
> +	dev_replace->is_valid = 1;
> +
> +	dev_replace->item_needs_writeback = 0;
> +	switch (dev_replace->replace_state) {
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED:
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED:
> +		dev_replace->srcdev = NULL;
> +		dev_replace->tgtdev = NULL;
> +		break;
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED:
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:
> +		dev_replace->srcdev = btrfs_find_device(fs_info, src_devid,
> +							NULL, NULL);
> +		dev_replace->tgtdev = btrfs_find_device(fs_info,
> +							BTRFS_DEV_REPLACE_DEVID,
> +							NULL, NULL);
> +		/*
> +		 * allow ''btrfs dev replace_cancel'' if src/tgt device
is
> +		 * missing
> +		 */
> +		if (!dev_replace->srcdev &&
> +		    !btrfs_test_opt(dev_root, DEGRADED)) {
> +			ret = -EIO;
> +			pr_warn("btrfs: cannot mount because device replace operation is
ongoing and\n" "srcdev (devid %llu) is missing, need to run
''btrfs dev scan''?\n",
> +				(unsigned long long)src_devid);
> +		}
> +		if (!dev_replace->tgtdev &&
> +		    !btrfs_test_opt(dev_root, DEGRADED)) {
> +			ret = -EIO;
> +			pr_warn("btrfs: cannot mount because device replace operation is
ongoing and\n" "tgtdev (devid %llu) is missing, need to run btrfs dev
scan?\n",
> +				(unsigned long long)BTRFS_DEV_REPLACE_DEVID);
> +		}
> +		if (dev_replace->tgtdev) {
> +			if (dev_replace->srcdev) {
> +				dev_replace->tgtdev->total_bytes > +				
dev_replace->srcdev->total_bytes;
> +				dev_replace->tgtdev->disk_total_bytes > +				
dev_replace->srcdev->disk_total_bytes;
> +				dev_replace->tgtdev->bytes_used > +				
dev_replace->srcdev->bytes_used;
> +			}
> +			dev_replace->tgtdev->is_tgtdev_for_dev_replace = 1;
> +			btrfs_init_dev_replace_tgtdev_for_resume(fs_info,
> +				dev_replace->tgtdev);
> +		}
> +		break;
> +	}
> +
> +out:
> +	if (path) {
> +		btrfs_release_path(path);
> +		btrfs_free_path(path);
btrfs_free_path(path) will do release for you :)
> +	}
> +	return ret;
> +}
> +
> +/*
> + * called from commit_transaction. Writes changed device replace state to
> + * disk.
> + */
> +int btrfs_run_dev_replace(struct btrfs_trans_handle *trans,
> +			  struct btrfs_fs_info *fs_info)
> +{
> +	int ret;
> +	struct btrfs_root *dev_root = fs_info->dev_root;
> +	struct btrfs_path *path;
> +	struct btrfs_key key;
> +	struct extent_buffer *eb;
> +	struct btrfs_dev_replace_item *ptr;
> +	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
> +
> +	btrfs_dev_replace_lock(dev_replace);
> +	if (!dev_replace->is_valid ||
> +	    !dev_replace->item_needs_writeback) {
> +		btrfs_dev_replace_unlock(dev_replace);
> +		return 0;
> +	}
> +	btrfs_dev_replace_unlock(dev_replace);
> +
> +	key.objectid = 0;
> +	key.type = BTRFS_DEV_REPLACE_KEY;
> +	key.offset = 0;
> +
> +	path = btrfs_alloc_path();
> +	if (!path) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +	ret = btrfs_search_slot(trans, dev_root, &key, path, -1, 1);
> +	if (ret < 0) {
> +		pr_warn("btrfs: error %d while searching for dev_replace
item!\n",
> +			ret);
> +		goto out;
> +	}
> +
> +	if (ret == 0 &&
> +	    btrfs_item_size_nr(path->nodes[0], path->slots[0]) <
sizeof(*ptr)) {
> +		/*
> +		 * need to delete old one and insert a new one.
> +		 * Since no attempt is made to recover any old state, if the
> +		 * dev_replace state is ''running'', the data on the
target
> +		 * drive is lost.
> +		 * It would be possible to recover the state: just make sure
> +		 * that the beginning of the item is never changed and always
> +		 * contains all the essential information. Then read this
> +		 * minimal set of information and use it as a base for the
> +		 * new state.
> +		 */
> +		ret = btrfs_del_item(trans, dev_root, path);
> +		if (ret != 0) {
> +			pr_warn("btrfs: delete too small dev_replace item failed
%d!\n",
> +				ret);
> +			goto out;
> +		}
> +		ret = 1;
> +	}
> +
> +	if (ret == 1) {
> +		/* need to insert a new item */
> +		btrfs_release_path(path);
> +		ret = btrfs_insert_empty_item(trans, dev_root, path,
> +					      &key, sizeof(*ptr));
> +		if (ret < 0) {
> +			pr_warn("btrfs: insert dev_replace item failed %d!\n",
> +				ret);
> +			goto out;
> +		}
> +	}
> +
> +	eb = path->nodes[0];
> +	ptr = btrfs_item_ptr(eb, path->slots[0],
> +			     struct btrfs_dev_replace_item);
> +
> +	btrfs_dev_replace_lock(dev_replace);
> +	if (dev_replace->srcdev)
> +		btrfs_set_dev_replace_src_devid(eb, ptr,
> +			dev_replace->srcdev->devid);
> +	else
> +		btrfs_set_dev_replace_src_devid(eb, ptr, (u64)-1);
> +	btrfs_set_dev_replace_cont_reading_from_srcdev_mode(eb, ptr,
> +		dev_replace->cont_reading_from_srcdev_mode);
> +	btrfs_set_dev_replace_replace_state(eb, ptr,
> +		dev_replace->replace_state);
> +	btrfs_set_dev_replace_time_started(eb, ptr,
dev_replace->time_started);
> +	btrfs_set_dev_replace_time_stopped(eb, ptr,
dev_replace->time_stopped);
> +	btrfs_set_dev_replace_num_write_errors(eb, ptr,
> +		atomic64_read(&dev_replace->num_write_errors));
> +	btrfs_set_dev_replace_num_uncorrectable_read_errors(eb, ptr,
> +		atomic64_read(&dev_replace->num_uncorrectable_read_errors));
> +	dev_replace->cursor_left_last_write_of_item > +	
dev_replace->cursor_left;
> +	btrfs_set_dev_replace_cursor_left(eb, ptr,
> +		dev_replace->cursor_left_last_write_of_item);
> +	btrfs_set_dev_replace_cursor_right(eb, ptr,
> +		dev_replace->cursor_right);
> +	dev_replace->item_needs_writeback = 0;
> +	btrfs_dev_replace_unlock(dev_replace);
> +
> +	btrfs_mark_buffer_dirty(eb);
> +
> +out:
> +	btrfs_free_path(path);
> +
> +	return ret;
> +}
> +
> +void btrfs_after_dev_replace_commit(struct btrfs_fs_info *fs_info)
> +{
> +	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
> +
> +	dev_replace->committed_cursor_left > +	
dev_replace->cursor_left_last_write_of_item;
> +}
> +
> +static u64 btrfs_get_seconds_since_1970(void)
> +{
> +	struct timespec t = CURRENT_TIME_SEC;
> +
> +	return t.tv_sec;
> +}
> +
> +int btrfs_dev_replace_start(struct btrfs_root *root,
> +			    struct btrfs_ioctl_dev_replace_args *args)
> +{
> +	struct btrfs_trans_handle *trans;
> +	struct btrfs_fs_info *fs_info = root->fs_info;
> +	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
> +	int ret;
> +	struct btrfs_device *tgt_device = NULL;
> +	struct btrfs_device *src_device = NULL;
> +
> +	switch (args->start.cont_reading_from_srcdev_mode) {
> +	case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS:
> +	case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID:
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	if ((args->start.srcdevid == 0 &&
args->start.srcdev_name[0] == ''\0'') ||
> +	    args->start.tgtdev_name[0] == ''\0'')
> +		return -EINVAL;
> +
> +	mutex_lock(&fs_info->volume_mutex);
> +	ret = btrfs_init_dev_replace_tgtdev(root, args->start.tgtdev_name,
> +					    &tgt_device);
> +	if (ret) {
> +		pr_err("btrfs: target device %s is invalid!\n",
> +		       args->start.tgtdev_name);
> +		mutex_unlock(&fs_info->volume_mutex);
> +		return -EINVAL;
> +	}
> +
> +	ret = btrfs_dev_replace_find_srcdev(root, args->start.srcdevid,
> +					    args->start.srcdev_name,
> +					    &src_device);
> +	mutex_unlock(&fs_info->volume_mutex);
> +	if (ret) {
> +		ret = -EINVAL;
> +		goto leave_no_lock;
> +	}
> +
> +	if (tgt_device->total_bytes < src_device->total_bytes) {
> +		pr_err("btrfs: target device is smaller than source
device!\n");
> +		ret = -EINVAL;
> +		goto leave_no_lock;
> +	}
> +
> +	btrfs_dev_replace_lock(dev_replace);
> +	switch (dev_replace->replace_state) {
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED:
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED:
> +		break;
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED:
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:
> +		args->result = BTRFS_IOCTL_DEV_REPLACE_RESULT_ALREADY_STARTED;
> +		goto leave;
> +	}
> +
> +	dev_replace->cont_reading_from_srcdev_mode > +	
args->start.cont_reading_from_srcdev_mode;
> +	WARN_ON(!src_device);
> +	dev_replace->srcdev = src_device;
> +	WARN_ON(!tgt_device);
> +	dev_replace->tgtdev = tgt_device;
> +
> +	tgt_device->total_bytes = src_device->total_bytes;
> +	tgt_device->disk_total_bytes = src_device->disk_total_bytes;
> +	tgt_device->bytes_used = src_device->bytes_used;
> +
> +	/*
> +	 * from now on, the writes to the srcdev are all duplicated to
> +	 * go to the tgtdev as well (refer to btrfs_map_block()).
> +	 */
> +	dev_replace->replace_state = BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED;
> +	dev_replace->time_started = btrfs_get_seconds_since_1970();
> +	dev_replace->cursor_left = 0;
> +	dev_replace->committed_cursor_left = 0;
> +	dev_replace->cursor_left_last_write_of_item = 0;
> +	dev_replace->cursor_right = 0;
> +	dev_replace->is_valid = 1;
> +	dev_replace->item_needs_writeback = 1;
> +	args->result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR;
> +	btrfs_dev_replace_unlock(dev_replace);
> +
> +	btrfs_wait_ordered_extents(root, 0);
> +
> +	/* force writing the updated state information to disk */
> +	trans = btrfs_start_transaction(root, 0);
why a start_transaction here?  Any reasons?
(same question also for some other places)

thanks,
liubo
> +	if (IS_ERR(trans)) {
> +		ret = PTR_ERR(trans);
> +		btrfs_dev_replace_lock(dev_replace);
> +		goto leave;
> +	}
> +
> +	ret = btrfs_commit_transaction(trans, root);
> +	WARN_ON(ret);
> +
> +	/* the disk copy procedure reuses the scrub code */
> +	ret = btrfs_scrub_dev(fs_info, src_device->devid, 0,
> +			      src_device->total_bytes,
> +			      &dev_replace->scrub_progress, 0, 1);
> +
> +	ret = btrfs_dev_replace_finishing(root->fs_info, ret);
> +	WARN_ON(ret);
> +
> +	return 0;
> +
> +leave:
> +	dev_replace->srcdev = NULL;
> +	dev_replace->tgtdev = NULL;
> +	btrfs_dev_replace_unlock(dev_replace);
> +leave_no_lock:
> +	if (tgt_device)
> +		btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
> +	return ret;
> +}
> +
> +static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
> +				       int scrub_ret)
> +{
> +	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
> +	struct btrfs_device *tgt_device;
> +	struct btrfs_device *src_device;
> +	struct btrfs_root *root = fs_info->tree_root;
> +	u8 uuid_tmp[BTRFS_UUID_SIZE];
> +	struct btrfs_trans_handle *trans;
> +	int ret = 0;
> +
> +	/* don''t allow cancel or unmount to disturb the finishing
procedure */
> +	mutex_lock(&dev_replace->lock_finishing_cancel_unmount);
> +
> +	btrfs_dev_replace_lock(dev_replace);
> +	/* was the operation canceled, or is it finished? */
> +	if (dev_replace->replace_state !> +	   
BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED) {
> +		btrfs_dev_replace_unlock(dev_replace);
> +		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
> +		return 0;
> +	}
> +
> +	tgt_device = dev_replace->tgtdev;
> +	src_device = dev_replace->srcdev;
> +	btrfs_dev_replace_unlock(dev_replace);
> +
> +	/* replace old device with new one in mapping tree */
> +	if (!scrub_ret)
> +		btrfs_dev_replace_update_device_in_mapping_tree(fs_info,
> +								src_device,
> +								tgt_device);
> +
> +	/*
> +	 * flush all outstanding I/O and inode extent mappings before the
> +	 * copy operation is declared as being finished
> +	 */
> +	btrfs_start_delalloc_inodes(root, 0);
> +	btrfs_wait_ordered_extents(root, 0);
> +
> +	trans = btrfs_start_transaction(root, 0);
> +	if (IS_ERR(trans)) {
> +		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
> +		return PTR_ERR(trans);
> +	}
> +	ret = btrfs_commit_transaction(trans, root);
> +	WARN_ON(ret);
> +
> +	/* keep away write_all_supers() during the finishing procedure */
> +	mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
> +	btrfs_dev_replace_lock(dev_replace);
> +	dev_replace->replace_state > +		scrub_ret ?
BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED
> +			  : BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED;
> +	dev_replace->tgtdev = NULL;
> +	dev_replace->srcdev = NULL;
> +	dev_replace->time_stopped = btrfs_get_seconds_since_1970();
> +	dev_replace->item_needs_writeback = 1;
> +
> +	if (scrub_ret) {
> +		printk_in_rcu(KERN_ERR
> +			      "btrfs: btrfs_scrub_dev(%s, %llu, %s) failed %d\n",
> +			      rcu_str_deref(src_device->name),
> +			      src_device->devid,
> +			      rcu_str_deref(tgt_device->name), scrub_ret);
> +		btrfs_dev_replace_unlock(dev_replace);
> +	
mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
> +		if (tgt_device)
> +			btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
> +		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
> +
> +		return 0;
> +	}
> +
> +	tgt_device->is_tgtdev_for_dev_replace = 0;
> +	tgt_device->devid = src_device->devid;
> +	src_device->devid = BTRFS_DEV_REPLACE_DEVID;
> +	tgt_device->bytes_used = src_device->bytes_used;
> +	memcpy(uuid_tmp, tgt_device->uuid, sizeof(uuid_tmp));
> +	memcpy(tgt_device->uuid, src_device->uuid,
sizeof(tgt_device->uuid));
> +	memcpy(src_device->uuid, uuid_tmp, sizeof(src_device->uuid));
> +	tgt_device->total_bytes = src_device->total_bytes;
> +	tgt_device->disk_total_bytes = src_device->disk_total_bytes;
> +	tgt_device->bytes_used = src_device->bytes_used;
> +	if (fs_info->sb->s_bdev == src_device->bdev)
> +		fs_info->sb->s_bdev = tgt_device->bdev;
> +	if (fs_info->fs_devices->latest_bdev == src_device->bdev)
> +		fs_info->fs_devices->latest_bdev = tgt_device->bdev;
> +	list_add(&tgt_device->dev_alloc_list,
&fs_info->fs_devices->alloc_list);
> +
> +	btrfs_rm_dev_replace_srcdev(fs_info, src_device);
> +	if (src_device->bdev) {
> +		/* zero out the old super */
> +		btrfs_scratch_superblock(src_device);
> +	}
> +	/*
> +	 * this is again a consistent state where no dev_replace procedure
> +	 * is running, the target device is part of the filesystem, the
> +	 * source device is not part of the filesystem anymore and its 1st
> +	 * superblock is scratched out so that it is no longer marked to
> +	 * belong to this filesystem.
> +	 */
> +	btrfs_dev_replace_unlock(dev_replace);
> +	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
> +
> +	/* write back the superblocks */
> +	trans = btrfs_start_transaction(root, 0);
> +	if (!IS_ERR(trans))
> +		btrfs_commit_transaction(trans, root);
> +
> +	mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
> +
> +	return 0;
> +}
> +
> +static void btrfs_dev_replace_update_device_in_mapping_tree(
> +						struct btrfs_fs_info *fs_info,
> +						struct btrfs_device *srcdev,
> +						struct btrfs_device *tgtdev)
> +{
> +	struct extent_map_tree *em_tree = &fs_info->mapping_tree.map_tree;
> +	struct extent_map *em;
> +	struct map_lookup *map;
> +	u64 start = 0;
> +	int i;
> +
> +	write_lock(&em_tree->lock);
> +	do {
> +		em = lookup_extent_mapping(em_tree, start, (u64)-1);
> +		if (!em)
> +			break;
> +		map = (struct map_lookup *)em->bdev;
> +		for (i = 0; i < map->num_stripes; i++)
> +			if (srcdev == map->stripes[i].dev)
> +				map->stripes[i].dev = tgtdev;
> +		start = em->start + em->len;
> +		free_extent_map(em);
> +	} while (start);
> +	write_unlock(&em_tree->lock);
> +}
> +
> +static int btrfs_dev_replace_find_srcdev(struct btrfs_root *root, u64
srcdevid,
> +					 char *srcdev_name,
> +					 struct btrfs_device **device)
> +{
> +	int ret;
> +
> +	if (srcdevid) {
> +		*device = btrfs_find_device(root->fs_info, srcdevid, NULL,
> +					    NULL);
> +		if (!*device)
> +			ret = -ENOENT;
> +	} else {
> +		ret = btrfs_find_device_missing_or_by_path(root, srcdev_name,
> +							   device);
> +	}
> +	return ret;
> +}
> +
> +void btrfs_dev_replace_status(struct btrfs_fs_info *fs_info,
> +			      struct btrfs_ioctl_dev_replace_args *args)
> +{
> +	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
> +
> +	btrfs_dev_replace_lock(dev_replace);
> +	/* even if !dev_replace_is_valid, the values are good enough for
> +	 * the replace_status ioctl */
> +	args->result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR;
> +	args->status.replace_state = dev_replace->replace_state;
> +	args->status.time_started = dev_replace->time_started;
> +	args->status.time_stopped = dev_replace->time_stopped;
> +	args->status.num_write_errors > +	
atomic64_read(&dev_replace->num_write_errors);
> +	args->status.num_uncorrectable_read_errors > +	
atomic64_read(&dev_replace->num_uncorrectable_read_errors);
> +	switch (dev_replace->replace_state) {
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED:
> +		args->status.progress_1000 = 0;
> +		break;
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED:
> +		args->status.progress_1000 = 1000;
> +		break;
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED:
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:
> +		args->status.progress_1000 = div64_u64(dev_replace->cursor_left,
> +			div64_u64(dev_replace->srcdev->total_bytes, 1000));
> +		break;
> +	}
> +	btrfs_dev_replace_unlock(dev_replace);
> +}
> +
> +int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info,
> +			     struct btrfs_ioctl_dev_replace_args *args)
> +{
> +	args->result = __btrfs_dev_replace_cancel(fs_info);
> +	return 0;
> +}
> +
> +static u64 __btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info)
> +{
> +	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
> +	struct btrfs_device *tgt_device = NULL;
> +	struct btrfs_trans_handle *trans;
> +	struct btrfs_root *root = fs_info->tree_root;
> +	u64 result;
> +	int ret;
> +
> +	mutex_lock(&dev_replace->lock_finishing_cancel_unmount);
> +	btrfs_dev_replace_lock(dev_replace);
> +	switch (dev_replace->replace_state) {
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED:
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED:
> +		result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NOT_STARTED;
> +		btrfs_dev_replace_unlock(dev_replace);
> +		goto leave;
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED:
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:
> +		result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR;
> +		tgt_device = dev_replace->tgtdev;
> +		dev_replace->tgtdev = NULL;
> +		dev_replace->srcdev = NULL;
> +		break;
> +	}
> +	dev_replace->replace_state = BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED;
> +	dev_replace->time_stopped = btrfs_get_seconds_since_1970();
> +	dev_replace->item_needs_writeback = 1;
> +	btrfs_dev_replace_unlock(dev_replace);
> +	btrfs_scrub_cancel(fs_info);
> +
> +	trans = btrfs_start_transaction(root, 0);
> +	if (IS_ERR(trans)) {
> +		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
> +		return PTR_ERR(trans);
> +	}
> +	ret = btrfs_commit_transaction(trans, root);
> +	WARN_ON(ret);
> +	if (tgt_device)
> +		btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
> +
> +leave:
> +	mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
> +	return result;
> +}
> +
> +void btrfs_dev_replace_suspend_for_unmount(struct btrfs_fs_info *fs_info)
> +{
> +	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
> +
> +	mutex_lock(&dev_replace->lock_finishing_cancel_unmount);
> +	btrfs_dev_replace_lock(dev_replace);
> +	switch (dev_replace->replace_state) {
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED:
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED:
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:
> +		break;
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED:
> +		dev_replace->replace_state > +		
BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED;
> +		dev_replace->time_stopped = btrfs_get_seconds_since_1970();
> +		dev_replace->item_needs_writeback = 1;
> +		pr_info("btrfs: suspending dev_replace for unmount\n");
> +		break;
> +	}
> +
> +	btrfs_dev_replace_unlock(dev_replace);
> +	mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
> +}
> +
> +/* resume dev_replace procedure that was interrupted by unmount */
> +int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info)
> +{
> +	struct task_struct *task;
> +	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
> +
> +	btrfs_dev_replace_lock(dev_replace);
> +	switch (dev_replace->replace_state) {
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED:
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED:
> +		btrfs_dev_replace_unlock(dev_replace);
> +		return 0;
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED:
> +		break;
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:
> +		dev_replace->replace_state > +		
BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED;
> +		break;
> +	}
> +	if (!dev_replace->tgtdev || !dev_replace->tgtdev->bdev) {
> +		pr_info("btrfs: cannot continue dev_replace, tgtdev is
missing\n"
> +			"btrfs: you may cancel the operation after ''mount -o
degraded''\n");
> +		btrfs_dev_replace_unlock(dev_replace);
> +		return 0;
> +	}
> +	btrfs_dev_replace_unlock(dev_replace);
> +
> +	WARN_ON(atomic_xchg(
> +		&fs_info->mutually_exclusive_operation_running, 1));
> +	task = kthread_run(btrfs_dev_replace_kthread, fs_info,
"btrfs-devrepl");
> +	return PTR_RET(task);
> +}
> +
> +static int btrfs_dev_replace_kthread(void *data)
> +{
> +	struct btrfs_fs_info *fs_info = data;
> +	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
> +	struct btrfs_ioctl_dev_replace_args *status_args;
> +	u64 progress;
> +
> +	status_args = kzalloc(sizeof(*status_args), GFP_NOFS);
> +	if (status_args) {
> +		btrfs_dev_replace_status(fs_info, status_args);
> +		progress = status_args->status.progress_1000;
> +		kfree(status_args);
> +		do_div(progress, 10);
> +		printk_in_rcu(KERN_INFO
> +			      "btrfs: continuing dev_replace from %s (devid %llu) to %s
@%u%%\n",
> +			      dev_replace->srcdev->missing ? "<missing
disk>" :
> +				rcu_str_deref(dev_replace->srcdev->name),
> +			      dev_replace->srcdev->devid,
> +			      dev_replace->tgtdev ?
> +				rcu_str_deref(dev_replace->tgtdev->name) :
> +				"<missing target disk>",
> +			      (unsigned int)progress);
> +	}
> +	btrfs_dev_replace_continue_on_mount(fs_info);
> +	atomic_set(&fs_info->mutually_exclusive_operation_running, 0);
> +
> +	return 0;
> +}
> +
> +static int btrfs_dev_replace_continue_on_mount(struct btrfs_fs_info
*fs_info)
> +{
> +	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
> +	int ret;
> +
> +	ret = btrfs_scrub_dev(fs_info, dev_replace->srcdev->devid,
> +			      dev_replace->committed_cursor_left,
> +			      dev_replace->srcdev->total_bytes,
> +			      &dev_replace->scrub_progress, 0, 1);
> +	ret = btrfs_dev_replace_finishing(fs_info, ret);
> +	WARN_ON(ret);
> +	return 0;
> +}
> +
> +int btrfs_dev_replace_is_ongoing(struct btrfs_dev_replace *dev_replace)
> +{
> +	if (!dev_replace->is_valid)
> +		return 0;
> +
> +	switch (dev_replace->replace_state) {
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED:
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED:
> +		return 0;
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED:
> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:
> +		/*
> +		 * return true even if tgtdev is missing (this is
> +		 * something that can happen if the dev_replace
> +		 * procedure is suspended by an umount and then
> +		 * the tgtdev is missing (or "btrfs dev scan") was
> +		 * not called and the the filesystem is remounted
> +		 * in degraded state. This does not stop the
> +		 * dev_replace procedure. It needs to be canceled
> +		 * manually if the cancelation is wanted.
> +		 */
> +		break;
> +	}
> +	return 1;
> +}
> +
> +void btrfs_dev_replace_lock(struct btrfs_dev_replace *dev_replace)
> +{
> +	/* the beginning is just an optimization for the typical case */
> +	if (atomic_read(&dev_replace->nesting_level) == 0) {
> +acquire_lock:
> +		/* this is not a nested case where the same thread
> +		 * is trying to acqurire the same lock twice */
> +		mutex_lock(&dev_replace->lock);
> +		mutex_lock(&dev_replace->lock_management_lock);
> +		dev_replace->lock_owner = current->pid;
> +		atomic_inc(&dev_replace->nesting_level);
> +		mutex_unlock(&dev_replace->lock_management_lock);
> +		return;
> +	}
> +
> +	mutex_lock(&dev_replace->lock_management_lock);
> +	if (atomic_read(&dev_replace->nesting_level) > 0 &&
> +	    dev_replace->lock_owner == current->pid) {
> +		WARN_ON(!mutex_is_locked(&dev_replace->lock));
> +		atomic_inc(&dev_replace->nesting_level);
> +		mutex_unlock(&dev_replace->lock_management_lock);
> +		return;
> +	}
> +
> +	mutex_unlock(&dev_replace->lock_management_lock);
> +	goto acquire_lock;
> +}
> +
> +void btrfs_dev_replace_unlock(struct btrfs_dev_replace *dev_replace)
> +{
> +	WARN_ON(!mutex_is_locked(&dev_replace->lock));
> +	mutex_lock(&dev_replace->lock_management_lock);
> +	WARN_ON(atomic_read(&dev_replace->nesting_level) < 1);
> +	WARN_ON(dev_replace->lock_owner != current->pid);
> +	atomic_dec(&dev_replace->nesting_level);
> +	if (atomic_read(&dev_replace->nesting_level) == 0) {
> +		dev_replace->lock_owner = 0;
> +		mutex_unlock(&dev_replace->lock_management_lock);
> +		mutex_unlock(&dev_replace->lock);
> +	} else {
> +		mutex_unlock(&dev_replace->lock_management_lock);
> +	}
> +}
> diff --git a/fs/btrfs/dev-replace.h b/fs/btrfs/dev-replace.h
> new file mode 100644
> index 0000000..20035cb
> --- /dev/null
> +++ b/fs/btrfs/dev-replace.h
> @@ -0,0 +1,44 @@
> +/*
> + * Copyright (C) STRATO AG 2012.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License v2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public
> + * License along with this program; if not, write to the
> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
> + * Boston, MA 021110-1307, USA.
> + */
> +
> +#if !defined(__BTRFS_DEV_REPLACE__)
> +#define __BTRFS_DEV_REPLACE__
> +
> +struct btrfs_ioctl_dev_replace_args;
> +
> +int btrfs_init_dev_replace(struct btrfs_fs_info *fs_info);
> +int btrfs_run_dev_replace(struct btrfs_trans_handle *trans,
> +			  struct btrfs_fs_info *fs_info);
> +void btrfs_after_dev_replace_commit(struct btrfs_fs_info *fs_info);
> +int btrfs_dev_replace_start(struct btrfs_root *root,
> +			    struct btrfs_ioctl_dev_replace_args *args);
> +void btrfs_dev_replace_status(struct btrfs_fs_info *fs_info,
> +			      struct btrfs_ioctl_dev_replace_args *args);
> +int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info,
> +			     struct btrfs_ioctl_dev_replace_args *args);
> +void btrfs_dev_replace_suspend_for_unmount(struct btrfs_fs_info *fs_info);
> +int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info);
> +int btrfs_dev_replace_is_ongoing(struct btrfs_dev_replace *dev_replace);
> +void btrfs_dev_replace_lock(struct btrfs_dev_replace *dev_replace);
> +void btrfs_dev_replace_unlock(struct btrfs_dev_replace *dev_replace);
> +
> +static inline void btrfs_dev_replace_stats_inc(atomic64_t *stat_value)
> +{
> +	atomic64_inc(stat_value);
> +}
> +#endif
> diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
> index 731e287..62006ba 100644
> --- a/fs/btrfs/ioctl.h
> +++ b/fs/btrfs/ioctl.h
> @@ -123,6 +123,48 @@ struct btrfs_ioctl_scrub_args {
>  	__u64 unused[(1024-32-sizeof(struct btrfs_scrub_progress))/8];
>  };
>  
> +#define BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS	0
> +#define BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID	1
> +struct btrfs_ioctl_dev_replace_start_params {
> +	__u64 srcdevid;	/* in, if 0, use srcdev_name instead */
> +	__u8 srcdev_name[BTRFS_PATH_NAME_MAX + 1];	/* in */
> +	__u8 tgtdev_name[BTRFS_PATH_NAME_MAX + 1];	/* in */
> +	__u64 cont_reading_from_srcdev_mode;	/* in, see #define
> +						 * above */
> +};
> +
> +#define BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED	0
> +#define BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED		1
> +#define BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED		2
> +#define BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED		3
> +#define BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED		4
> +struct btrfs_ioctl_dev_replace_status_params {
> +	__u64 replace_state;	/* out, see #define above */
> +	__u64 progress_1000;	/* out, 0 <= x <= 1000 */
> +	__u64 time_started;	/* out, seconds since 1-Jan-1970 */
> +	__u64 time_stopped;	/* out, seconds since 1-Jan-1970 */
> +	__u64 num_write_errors;	/* out */
> +	__u64 num_uncorrectable_read_errors;	/* out */
> +};
> +
> +#define BTRFS_IOCTL_DEV_REPLACE_CMD_START			0
> +#define BTRFS_IOCTL_DEV_REPLACE_CMD_STATUS			1
> +#define BTRFS_IOCTL_DEV_REPLACE_CMD_CANCEL			2
> +#define BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR			0
> +#define BTRFS_IOCTL_DEV_REPLACE_RESULT_NOT_STARTED		1
> +#define BTRFS_IOCTL_DEV_REPLACE_RESULT_ALREADY_STARTED		2
> +struct btrfs_ioctl_dev_replace_args {
> +	__u64 cmd;	/* in */
> +	__u64 result;	/* out */
> +
> +	union {
> +		struct btrfs_ioctl_dev_replace_start_params start;
> +		struct btrfs_ioctl_dev_replace_status_params status;
> +	};	/* in/out */
> +
> +	__u64 spare[64];
> +};
> +
>  #define BTRFS_DEVICE_PATH_NAME_MAX 1024
>  struct btrfs_ioctl_dev_info_args {
>  	__u64 devid;				/* in/out */
> @@ -453,4 +495,7 @@ struct btrfs_ioctl_send_args {
>  			       struct btrfs_ioctl_qgroup_limit_args)
>  #define BTRFS_IOC_GET_DEV_STATS _IOWR(BTRFS_IOCTL_MAGIC, 52, \
>  				      struct btrfs_ioctl_get_dev_stats)
> +#define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \
> +				    struct btrfs_ioctl_dev_replace_args)
> +
>  #endif
> -- 
> 1.8.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-08 17:24 UTC

head link

Re: [PATCH 15/26] Btrfs: add a new source file with device replace code

On Thu, 8 Nov 2012 22:50:47 +0800, Liu Bo wrote:> On Tue, Nov 06, 2012 at 05:38:33PM +0100, Stefan Behrens wrote:
>> +out:
>> +	if (path) {
>> +		btrfs_release_path(path);
>> +		btrfs_free_path(path);
> 
> btrfs_free_path(path) will do release for you :)
> 
Right :) Thanks.

>> +int btrfs_dev_replace_start(struct btrfs_root *root,
>> +			    struct btrfs_ioctl_dev_replace_args *args)
>> +{
>> +	struct btrfs_trans_handle *trans;
>> +	struct btrfs_fs_info *fs_info = root->fs_info;
>> +	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
>> +	int ret;
>> +	struct btrfs_device *tgt_device = NULL;
>> +	struct btrfs_device *src_device = NULL;
>> +
>> +	switch (args->start.cont_reading_from_srcdev_mode) {
>> +	case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS:
>> +	case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID:
>> +		break;
>> +	default:
>> +		return -EINVAL;
>> +	}
>> +
>> +	if ((args->start.srcdevid == 0 &&
args->start.srcdev_name[0] == ''\0'') ||
>> +	    args->start.tgtdev_name[0] == ''\0'')
>> +		return -EINVAL;
>> +
>> +	mutex_lock(&fs_info->volume_mutex);
>> +	ret = btrfs_init_dev_replace_tgtdev(root, args->start.tgtdev_name,
>> +					    &tgt_device);
>> +	if (ret) {
>> +		pr_err("btrfs: target device %s is invalid!\n",
>> +		       args->start.tgtdev_name);
>> +		mutex_unlock(&fs_info->volume_mutex);
>> +		return -EINVAL;
>> +	}
>> +
>> +	ret = btrfs_dev_replace_find_srcdev(root, args->start.srcdevid,
>> +					    args->start.srcdev_name,
>> +					    &src_device);
>> +	mutex_unlock(&fs_info->volume_mutex);
>> +	if (ret) {
>> +		ret = -EINVAL;
>> +		goto leave_no_lock;
>> +	}
>> +
>> +	if (tgt_device->total_bytes < src_device->total_bytes) {
>> +		pr_err("btrfs: target device is smaller than source
device!\n");
>> +		ret = -EINVAL;
>> +		goto leave_no_lock;
>> +	}
>> +
>> +	btrfs_dev_replace_lock(dev_replace);
>> +	switch (dev_replace->replace_state) {
>> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
>> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED:
>> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED:
>> +		break;
>> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED:
>> +	case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:
>> +		args->result = BTRFS_IOCTL_DEV_REPLACE_RESULT_ALREADY_STARTED;
>> +		goto leave;
>> +	}
>> +
>> +	dev_replace->cont_reading_from_srcdev_mode >> +	
args->start.cont_reading_from_srcdev_mode;
>> +	WARN_ON(!src_device);
>> +	dev_replace->srcdev = src_device;
>> +	WARN_ON(!tgt_device);
>> +	dev_replace->tgtdev = tgt_device;
>> +
>> +	tgt_device->total_bytes = src_device->total_bytes;
>> +	tgt_device->disk_total_bytes = src_device->disk_total_bytes;
>> +	tgt_device->bytes_used = src_device->bytes_used;
>> +
>> +	/*
>> +	 * from now on, the writes to the srcdev are all duplicated to
>> +	 * go to the tgtdev as well (refer to btrfs_map_block()).
>> +	 */
>> +	dev_replace->replace_state =
BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED;
>> +	dev_replace->time_started = btrfs_get_seconds_since_1970();
>> +	dev_replace->cursor_left = 0;
>> +	dev_replace->committed_cursor_left = 0;
>> +	dev_replace->cursor_left_last_write_of_item = 0;
>> +	dev_replace->cursor_right = 0;
>> +	dev_replace->is_valid = 1;
>> +	dev_replace->item_needs_writeback = 1;
>> +	args->result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR;
>> +	btrfs_dev_replace_unlock(dev_replace);
>> +
>> +	btrfs_wait_ordered_extents(root, 0);
>> +
>> +	/* force writing the updated state information to disk */
>> +	trans = btrfs_start_transaction(root, 0);
> 
> why a start_transaction here?  Any reasons?
> (same question also for some other places)
> 
Without this transaction, there is outstanding I/O which is not flushed.
Pending writes that go only to the old disk need to be flushed before
the mode is switched to write all live data to the source disk and to
the target disk as well. The copy operation that is part of the scrub
code works on the commit root for performance reasons. Every write
request that is performed after the commit root is established needs to
go to both disks. Those requests that already have the bdev assigned
(i.e., btrfs_map_bio() was already called) cannot be duplicated anymore
to write to the new disk as well.

btrfs_dev_replace_finishing() looks similar and goes through a
transaction commit between the steps where the bdev in the mapping tree
is swapped and the step when the old bdev is freed. Otherwise the bdev
would be accessed after being freed.

>> +	if (IS_ERR(trans)) {
>> +		ret = PTR_ERR(trans);
>> +		btrfs_dev_replace_lock(dev_replace);
>> +		goto leave;
>> +	}
>> +
>> +	ret = btrfs_commit_transaction(trans, root);
>> +	WARN_ON(ret);
>> +
>> +	/* the disk copy procedure reuses the scrub code */
>> +	ret = btrfs_scrub_dev(fs_info, src_device->devid, 0,
>> +			      src_device->total_bytes,
>> +			      &dev_replace->scrub_progress, 0, 1);
>> +
>> +	ret = btrfs_dev_replace_finishing(root->fs_info, ret);
>> +	WARN_ON(ret);
>> +
>> +	return 0;
>> +
>> +leave:
>> +	dev_replace->srcdev = NULL;
>> +	dev_replace->tgtdev = NULL;
>> +	btrfs_dev_replace_unlock(dev_replace);
>> +leave_no_lock:
>> +	if (tgt_device)
>> +		btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
>> +	return ret;
>> +}
>> +
>> +static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>> +				       int scrub_ret)
>> +{
>> +	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
>> +	struct btrfs_device *tgt_device;
>> +	struct btrfs_device *src_device;
>> +	struct btrfs_root *root = fs_info->tree_root;
>> +	u8 uuid_tmp[BTRFS_UUID_SIZE];
>> +	struct btrfs_trans_handle *trans;
>> +	int ret = 0;
>> +
>> +	/* don''t allow cancel or unmount to disturb the finishing
procedure */
>> +	mutex_lock(&dev_replace->lock_finishing_cancel_unmount);
>> +
>> +	btrfs_dev_replace_lock(dev_replace);
>> +	/* was the operation canceled, or is it finished? */
>> +	if (dev_replace->replace_state !>> +	   
BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED) {
>> +		btrfs_dev_replace_unlock(dev_replace);
>> +		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
>> +		return 0;
>> +	}
>> +
>> +	tgt_device = dev_replace->tgtdev;
>> +	src_device = dev_replace->srcdev;
>> +	btrfs_dev_replace_unlock(dev_replace);
>> +
>> +	/* replace old device with new one in mapping tree */
>> +	if (!scrub_ret)
>> +		btrfs_dev_replace_update_device_in_mapping_tree(fs_info,
>> +								src_device,
>> +								tgt_device);
>> +
>> +	/*
>> +	 * flush all outstanding I/O and inode extent mappings before the
>> +	 * copy operation is declared as being finished
>> +	 */
>> +	btrfs_start_delalloc_inodes(root, 0);
>> +	btrfs_wait_ordered_extents(root, 0);
>> +
>> +	trans = btrfs_start_transaction(root, 0);
>> +	if (IS_ERR(trans)) {
>> +		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
>> +		return PTR_ERR(trans);
>> +	}
>> +	ret = btrfs_commit_transaction(trans, root);
>> +	WARN_ON(ret);
>> +
>> +	/* keep away write_all_supers() during the finishing procedure */
>> +
mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
>> +	btrfs_dev_replace_lock(dev_replace);
>> +	dev_replace->replace_state >> +		scrub_ret ?
BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED
>> +			  : BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED;
>> +	dev_replace->tgtdev = NULL;
>> +	dev_replace->srcdev = NULL;
>> +	dev_replace->time_stopped = btrfs_get_seconds_since_1970();
>> +	dev_replace->item_needs_writeback = 1;
>> +
>> +	if (scrub_ret) {
>> +		printk_in_rcu(KERN_ERR
>> +			      "btrfs: btrfs_scrub_dev(%s, %llu, %s) failed %d\n",
>> +			      rcu_str_deref(src_device->name),
>> +			      src_device->devid,
>> +			      rcu_str_deref(tgt_device->name), scrub_ret);
>> +		btrfs_dev_replace_unlock(dev_replace);
>> +	
mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
>> +		if (tgt_device)
>> +			btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
>> +		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
>> +
>> +		return 0;
>> +	}
>> +
>> +	tgt_device->is_tgtdev_for_dev_replace = 0;
>> +	tgt_device->devid = src_device->devid;
>> +	src_device->devid = BTRFS_DEV_REPLACE_DEVID;
>> +	tgt_device->bytes_used = src_device->bytes_used;
>> +	memcpy(uuid_tmp, tgt_device->uuid, sizeof(uuid_tmp));
>> +	memcpy(tgt_device->uuid, src_device->uuid,
sizeof(tgt_device->uuid));
>> +	memcpy(src_device->uuid, uuid_tmp, sizeof(src_device->uuid));
>> +	tgt_device->total_bytes = src_device->total_bytes;
>> +	tgt_device->disk_total_bytes = src_device->disk_total_bytes;
>> +	tgt_device->bytes_used = src_device->bytes_used;
>> +	if (fs_info->sb->s_bdev == src_device->bdev)
>> +		fs_info->sb->s_bdev = tgt_device->bdev;
>> +	if (fs_info->fs_devices->latest_bdev == src_device->bdev)
>> +		fs_info->fs_devices->latest_bdev = tgt_device->bdev;
>> +	list_add(&tgt_device->dev_alloc_list,
&fs_info->fs_devices->alloc_list);
>> +
>> +	btrfs_rm_dev_replace_srcdev(fs_info, src_device);
>> +	if (src_device->bdev) {
>> +		/* zero out the old super */
>> +		btrfs_scratch_superblock(src_device);
>> +	}
>> +	/*
>> +	 * this is again a consistent state where no dev_replace procedure
>> +	 * is running, the target device is part of the filesystem, the
>> +	 * source device is not part of the filesystem anymore and its 1st
>> +	 * superblock is scratched out so that it is no longer marked to
>> +	 * belong to this filesystem.
>> +	 */
>> +	btrfs_dev_replace_unlock(dev_replace);
>> +
mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
>> +
>> +	/* write back the superblocks */
>> +	trans = btrfs_start_transaction(root, 0);
>> +	if (!IS_ERR(trans))
>> +		btrfs_commit_transaction(trans, root);
>> +
>> +	mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
>> +
>> +	return 0;
>> +}
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-08 17:31 UTC

head link

Re: [PATCH 00/26] Btrfs: Add device replace code

On Thu, 8 Nov 2012 13:50:19 +0100, Goffredo Baroncelli
wrote:> great work. However I have a suggestion: what about putting all the
> command under ''device'' sub commands: something like:
> 
> - btrfs device replace <old> <new> </path>
> 
> - btrfs device status </path>
> 
> Where "btrfs device status" would show only the status of the
> "replacing" operation; but in the future it could show also the
status
> of the "delete" command (which it is the only other "device
> sub-command" which needs time to complete).
> 
> Of course I am not asking to complete the "btrfs device status"
part
> for the "btrfs device delete" command. This could be implemented
in a
> second time.
> 
> I think that so "replace" would be the natral extension to the
"add"
> and "delete" subcommands.
"btrfs device replace <old> <new> <path>"
was also my first idea. It used to be like this initially.

"btrfs device replace cancel <path>"
was the point when I gave up putting it below the "device" commands.
IMO
that''s just too long, too much to type.

Now it has the same look and feel as the "scrub" commands ("scrub
start", "scrub status" and "scrub cancel").

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Goffredo Baroncelli

2012-Nov-08 18:41 UTC

head link

Re: [PATCH 00/26] Btrfs: Add device replace code

On 11/08/2012 06:31 PM, Stefan Behrens wrote:> On Thu, 8 Nov 2012 13:50:19 +0100, Goffredo Baroncelli wrote:
[...]>> I think that so "replace" would be the natural extension to
the "add"
>> and "delete" subcommands.
> 
> "btrfs device replace <old> <new> <path>"
> was also my first idea. It used to be like this initially.
> 
> "btrfs device replace cancel <path>"
> was the point when I gave up putting it below the "device"
commands. IMO
> that''s just too long, too much to type.
> 
> Now it has the same look and feel as the "scrub" commands
("scrub
> start", "scrub status" and "scrub cancel").
Yes, but scrub was a a new command. Instead I see "replace" as an
extension of "btrfs device add/del" (from an user interface POV).

If someone would extend the "btrfs device delete" command to support
status/pause/resume, how could do it ?
May be we need a new series of command which handle the "background
process" (like btrfs replace, btrfs device delete, btrfs subvolume
delete....) to status/stop/suspend/resume these processes ?

I am doing a bit of brain-storming...






> 
> 

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Liu Bo

2012-Nov-09 00:44 UTC

head link

Re: [PATCH 15/26] Btrfs: add a new source file with device replace code

On Thu, Nov 08, 2012 at 06:24:36PM +0100, Stefan Behrens
wrote:> On Thu, 8 Nov 2012 22:50:47 +0800, Liu Bo wrote:
> > On Tue, Nov 06, 2012 at 05:38:33PM +0100, Stefan Behrens wrote:
[...]> >> +	btrfs_dev_replace_unlock(dev_replace);
> >> +
> >> +	btrfs_wait_ordered_extents(root, 0);
> >> +
> >> +	/* force writing the updated state information to disk */
> >> +	trans = btrfs_start_transaction(root, 0);
> > 
> > why a start_transaction here?  Any reasons?
> > (same question also for some other places)
> > 
> 
> Without this transaction, there is outstanding I/O which is not flushed.
> Pending writes that go only to the old disk need to be flushed before
> the mode is switched to write all live data to the source disk and to
> the target disk as well. The copy operation that is part of the scrub
> code works on the commit root for performance reasons. Every write
> request that is performed after the commit root is established needs to
> go to both disks. Those requests that already have the bdev assigned
> (i.e., btrfs_map_bio() was already called) cannot be duplicated anymore
> to write to the new disk as well.
> 
> btrfs_dev_replace_finishing() looks similar and goes through a
> transaction commit between the steps where the bdev in the mapping tree
> is swapped and the step when the old bdev is freed. Otherwise the bdev
> would be accessed after being freed.
> 
I see, if you''re only about to flush metadata, why not join a
transaction?

thanks,
liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michael Kjörling

2012-Nov-09 10:02 UTC

head link

Re: [PATCH 00/26] Btrfs: Add device replace code

On 8 Nov 2012 18:31 +0100, from sbehrens@giantdisaster.de (Stefan
Behrens):> "btrfs device replace cancel <path>"
> was the point when I gave up putting it below the "device"
commands. IMO
> that''s just too long, too much to type.
How often is one going to type that? I like the idea of consistency
with add/delete, here, since it really amounts to largely the same
thing.

-- 
Michael Kjörling • http://michael.kjorling.se • michael@kjorling.se
                “People who think they know everything really annoy
                those of us who know we don’t.” (Bjarne Stroustrup)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-09 10:19 UTC

head link

Re: [PATCH 15/26] Btrfs: add a new source file with device replace code

On Fri, 9 Nov 2012 08:44:01 +0800, Liu Bo wrote:> On Thu, Nov 08, 2012 at 06:24:36PM +0100, Stefan Behrens wrote:
>> On Thu, 8 Nov 2012 22:50:47 +0800, Liu Bo wrote:
>>> On Tue, Nov 06, 2012 at 05:38:33PM +0100, Stefan Behrens wrote:
>>>> +	trans = btrfs_start_transaction(root, 0);
>>>
>>> why a start_transaction here?  Any reasons?
>>> (same question also for some other places)
>>>
>>
>> Without this transaction, there is outstanding I/O which is not
flushed.
>> Pending writes that go only to the old disk need to be flushed before
>> the mode is switched to write all live data to the source disk and to
>> the target disk as well. The copy operation that is part of the scrub
>> code works on the commit root for performance reasons. Every write
>> request that is performed after the commit root is established needs to
>> go to both disks. Those requests that already have the bdev assigned
>> (i.e., btrfs_map_bio() was already called) cannot be duplicated anymore
>> to write to the new disk as well.
>>
>> btrfs_dev_replace_finishing() looks similar and goes through a
>> transaction commit between the steps where the bdev in the mapping tree
>> is swapped and the step when the old bdev is freed. Otherwise the bdev
>> would be accessed after being freed.
>>
> 
> I see, if you''re only about to flush metadata, why not join a
transaction?
btrfs_join_transaction() would delay the current transaction and enforce
that the current transaction is used and not a new one.
btrfs_start_transaction() would use either the current transaction, or a
new one. It is less interfering.

Since in dev-replace.c it is not required to enforce that a current
transaction is joined, btrfs_start_transaction() is the one to choose
here, as I understood it.

But that''s an interesting topic and I would appreciate to get a
definite
rule which one to choose when.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Pottage

2012-Nov-09 10:47 UTC

head link

RE: [PATCH 24/26] Btrfs: increase BTRFS_MAX_MIRRORS by one for dev replace

From: linux-btrfs-owner@vger.kernel.org
[mailto:linux-btrfs-owner@vger.kernel.org] On Behalf Of Stefan Behrens
Sent: 06 November 2012 16:39
To: linux-btrfs@vger.kernel.org
Subject: [PATCH 24/26] Btrfs: increase BTRFS_MAX_MIRRORS by one for dev replace
> This change of the define is effective in all modes, it is required and
used only in the case when a
> device replace procedure is running. The reason is that during an active
device replace procedure,
> the target device of the copy operation is a mirror for the filesystem data
as well that can be used
> to read data in order to repair read errors on other disks.
Are you assuming the user is only replacing one device at once?

If the user is upgrading their disc array to increase the capacity (or speed),
then it would make sense for them to replace all the drives in the array at
once. Is that supported?

For example, could a user with a 4 drive array of 1Tb drives, fit 4 new 2Tb
drives, then call btrfs replace mountpoint <old> <new> 4 times,
leave the system overnight to copy the data, and remove the old drives in the
morning.



________________________________

Sophos Limited, The Pentagon, Abingdon Science Park, Abingdon, OX14 3YP, United
Kingdom.
Company Reg No 2096520. VAT Reg No GB 991 2418 08.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-09 11:23 UTC

head link

Re: [PATCH 24/26] Btrfs: increase BTRFS_MAX_MIRRORS by one for dev replace

On Fri, 9 Nov 2012 10:47:39 +0000, David Pottage wrote:> Are you assuming the user is only replacing one device at once?
> 
> If the user is upgrading their disc array to increase the capacity (or
speed), then it would make sense for them to replace all the drives in the array
at once. Is that supported?
> 
> For example, could a user with a 4 drive array of 1Tb drives, fit 4 new 2Tb
drives, then call btrfs replace mountpoint <old> <new> 4 times,
leave the system overnight to copy the data, and remove the old drives in the
morning.
Yes, one device at once.

"btrfs replace start ... & btrfs replace start ..." is not
supported.

"btrfs replace start -B ... ; btrfs replace start -B ..." is the
designated way to replace multiple disks.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Liu Bo

2012-Nov-09 14:45 UTC

head link

Re: [PATCH 15/26] Btrfs: add a new source file with device replace code

On Fri, Nov 09, 2012 at 11:19:17AM +0100, Stefan Behrens
wrote:> On Fri, 9 Nov 2012 08:44:01 +0800, Liu Bo wrote:
> > On Thu, Nov 08, 2012 at 06:24:36PM +0100, Stefan Behrens wrote:
> >> On Thu, 8 Nov 2012 22:50:47 +0800, Liu Bo wrote:
> >>> On Tue, Nov 06, 2012 at 05:38:33PM +0100, Stefan Behrens
wrote:
> >>>> +	trans = btrfs_start_transaction(root, 0);
> >>>
> >>> why a start_transaction here?  Any reasons?
> >>> (same question also for some other places)
> >>>
> >>
> >> Without this transaction, there is outstanding I/O which is not
flushed.
> >> Pending writes that go only to the old disk need to be flushed
before
> >> the mode is switched to write all live data to the source disk and
to
> >> the target disk as well. The copy operation that is part of the
scrub
> >> code works on the commit root for performance reasons. Every write
> >> request that is performed after the commit root is established
needs to
> >> go to both disks. Those requests that already have the bdev
assigned
> >> (i.e., btrfs_map_bio() was already called) cannot be duplicated
anymore
> >> to write to the new disk as well.
> >>
> >> btrfs_dev_replace_finishing() looks similar and goes through a
> >> transaction commit between the steps where the bdev in the mapping
tree
> >> is swapped and the step when the old bdev is freed. Otherwise the
bdev
> >> would be accessed after being freed.
> >>
> > 
> > I see, if you''re only about to flush metadata, why not join a
transaction?
> 
> btrfs_join_transaction() would delay the current transaction and enforce
> that the current transaction is used and not a new one.
> btrfs_start_transaction() would use either the current transaction, or a
> new one. It is less interfering.
hmm...btrfs_start_transaction() would not use the current transaction unless
you''re still in the same task, ie. current->journal_info remains
unchanged,
otherwise it will be blocked by the current transaction(wait_current_trans()).

If there are several btrfs_start_transaction() being blocked, after the current
one''s commit, one of them will allocate a new transaction, and the rest
can join it.

But btrfs_join_transaction will join the current as much as possible.

And since here we don''t do any reservation and seems to just update
chunk/device
tree(which will use global block rsv directly), I perfer
btrfs_join_transaction().

thanks,
liubo
> 
> Since in dev-replace.c it is not required to enforce that a current
> transaction is joined, btrfs_start_transaction() is the one to choose
> here, as I understood it.
> 
> But that''s an interesting topic and I would appreciate to get a
definite
> rule which one to choose when.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-12 16:50 UTC

head link

Re: [PATCH 07/26] Btrfs: add two more find_device() methods

On Thu, 8 Nov 2012 22:24:51 +0800, Liu Bo wrote:> On Tue, Nov 06, 2012 at 05:38:25PM +0100, Stefan Behrens wrote:
>> The new function btrfs_find_device_missing_or_by_path() will be
>> used for the device replace procedure. This function itself calls
>> the second new function btrfs_find_device_by_path().
>> Unfortunately, it is not possible to currently make the rest of the
>> code use these functions as well, since all functions that look
>> similar at first view are all a little bit different in what they
>> are doing. But in the future, new code could benefit from these
>> two new functions, and currently, device replace uses them.
>>
>> Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
>> ---
>>  fs/btrfs/volumes.c | 74
++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/btrfs/volumes.h |  5 ++++
>>  2 files changed, 79 insertions(+)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index eeed97d..bcd3097 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -1512,6 +1512,80 @@ error_undo:
>>  	goto error_brelse;
>>  }
>>  
>> +int btrfs_find_device_by_path(struct btrfs_root *root, char
*device_path,
>> +			      struct btrfs_device **device)
>> +{
>> +	int ret = 0;
>> +	struct btrfs_super_block *disk_super;
>> +	u64 devid;
>> +	u8 *dev_uuid;
>> +	struct block_device *bdev;
>> +	struct buffer_head *bh = NULL;
>> +
>> +	*device = NULL;
>> +	mutex_lock(&uuid_mutex);
> 
> Since the caller have held volume_mutex, we can get rid of the
> mutex_lock here, can''t we?
> 
Yes, you are right.

>> +	bdev = blkdev_get_by_path(device_path, FMODE_READ,
>> +				  root->fs_info->bdev_holder);
>> +	if (IS_ERR(bdev)) {
>> +		ret = PTR_ERR(bdev);
>> +		bdev = NULL;
>> +		goto out;
>> +	}
>> +
>> +	set_blocksize(bdev, 4096);
>> +	invalidate_bdev(bdev);
>> +	bh = btrfs_read_dev_super(bdev);
>> +	if (!bh) {
>> +		ret = -EINVAL;
>> +		goto out;
>> +	}
> 
> I made a scan for this ''get bdev & bh'' in the code, I
think
> maybe we can make a function,
> func_get(device_path, flags, mode, &bdev, &bh, flush)
> 
> where we need to take care of setting bdev = NULL, bh = NULL, and
> ''flush'' is for filemap_bdev().  Besides, we also need to
make
> some proper error handling.
Good idea for a cleanup! I have now added such a function. It improves
the readability. And it is only one place to make changes in the future
(e.g. when it is time to replace the "short term measure" commit
3c4bb26b213 which added the flushing code as a temporary workaround).

Thanks for your comments!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-12 17:21 UTC

head link

Re: [PATCH 15/26] Btrfs: add a new source file with device replace code

On Fri, 9 Nov 2012 22:45:16 +0800, Liu Bo wrote:> On Fri, Nov 09, 2012 at 11:19:17AM +0100, Stefan Behrens wrote:
>> On Fri, 9 Nov 2012 08:44:01 +0800, Liu Bo wrote:
>>> On Thu, Nov 08, 2012 at 06:24:36PM +0100, Stefan Behrens wrote:
>>>> On Thu, 8 Nov 2012 22:50:47 +0800, Liu Bo wrote:
>>>>> On Tue, Nov 06, 2012 at 05:38:33PM +0100, Stefan Behrens
wrote:
>>>>>> +	trans = btrfs_start_transaction(root, 0);
>>>>>
>>>>> why a start_transaction here?  Any reasons?
>>>>> (same question also for some other places)
>>>>>
>>>>
>>>> Without this transaction, there is outstanding I/O which is not
flushed.
>>>> Pending writes that go only to the old disk need to be flushed
before
>>>> the mode is switched to write all live data to the source disk
and to
>>>> the target disk as well. The copy operation that is part of the
scrub
>>>> code works on the commit root for performance reasons. Every
write
>>>> request that is performed after the commit root is established
needs to
>>>> go to both disks. Those requests that already have the bdev
assigned
>>>> (i.e., btrfs_map_bio() was already called) cannot be duplicated
anymore
>>>> to write to the new disk as well.
>>>>
>>>> btrfs_dev_replace_finishing() looks similar and goes through a
>>>> transaction commit between the steps where the bdev in the
mapping tree
>>>> is swapped and the step when the old bdev is freed. Otherwise
the bdev
>>>> would be accessed after being freed.
>>>>
>>>
>>> I see, if you''re only about to flush metadata, why not
join a transaction?
>>
>> btrfs_join_transaction() would delay the current transaction and
enforce
>> that the current transaction is used and not a new one.
>> btrfs_start_transaction() would use either the current transaction, or
a
>> new one. It is less interfering.
> 
> hmm...btrfs_start_transaction() would not use the current transaction
unless
> you''re still in the same task, ie. current->journal_info
remains unchanged,
> otherwise it will be blocked by the current
transaction(wait_current_trans()).
> 
> If there are several btrfs_start_transaction() being blocked, after the
current
> one''s commit, one of them will allocate a new transaction, and the
rest can join it.
> 
> But btrfs_join_transaction will join the current as much as possible.
> 
> And since here we don''t do any reservation and seems to just
update chunk/device
> tree(which will use global block rsv directly), I perfer
btrfs_join_transaction().
> 
I am still not sure, which one is worse or better:
a) to delay a commit by calling btrfs_join_transaction() which joins and thereby
delays a transaction, or
b) to go through one additional transaction.

Here is the log message of the commit that added btrfs_join_transaction(). For
me, it sounds like one should use btrfs_join_transaction() only when it is
_required_ to join a transaction, e.g. when a low level function is required to
join the transaction that some higher level function has started:

commit f9295749388f82c8d2f485e99c72cd7c7876a99b
Author: Chris Mason <chris.mason@oracle.com>
Date:   Thu Jul 17 12:54:14 2008 -0400

    btrfs_start_transaction: wait for commits in progress to finish

    btrfs_commit_transaction has to loop waiting for any writers in the
    transaction to finish before it can proceed.  btrfs_start_transaction
    should be polite and not join a transaction that is in the process
    of being finished off.

    There are a few places that can''t wait, basically the ones doing IO
that
    might be needed to finish the transaction.  For them, btrfs_join_transaction
    is added.


>>
>> Since in dev-replace.c it is not required to enforce that a current
>> transaction is joined, btrfs_start_transaction() is the one to choose
>> here, as I understood it.
>>
>> But that''s an interesting topic and I would appreciate to get
a definite
>> rule which one to choose when.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bart Noordervliet

2012-Nov-13 16:25 UTC

head link

Re: [PATCH 00/26] Btrfs: Add device replace code

Hi Stefan,

I gave your patchset a whirl and it worked like a charm. Thanks a lot
for your work. I was confused for a moment by the fact that the
operation doesn''t immediately resize btrfs to use all of the new
device''s available space. But after a bit of thought I suppose that is
intentional to make it easy to return to a device of similar size as
the original.

One suggestion remains: I would prefer the operation to leave some
record of what has happened in the system log. Other operations like
device add and device delete do and I think it makes sense to have a
log trail of such invasive operations on the filesystem.

Tested-by: Bart Noordervliet <bart@noordervliet.net>

Best regards,

Bart
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stefan Behrens

2012-Nov-14 11:42 UTC

head link

Re: [PATCH 00/26] Btrfs: Add device replace code

On Tue, 13 Nov 2012 17:25:46 +0100, Bart Noordervliet
wrote:> Hi Stefan,
> 
> I gave your patchset a whirl and it worked like a charm. Thanks a lot
> for your work. I was confused for a moment by the fact that the
> operation doesn''t immediately resize btrfs to use all of the new
> device''s available space. But after a bit of thought I suppose
that is
> intentional to make it easy to return to a device of similar size as
> the original.
Yes, replace and resize is an operation that intentionally requires two
manual steps, since the step backwards, to shrink a device, can take a
long time.

> One suggestion remains: I would prefer the operation to leave some
> record of what has happened in the system log. Other operations like
> device add and device delete do and I think it makes sense to have a
> log trail of such invasive operations on the filesystem.
That''s a good idea. I have now added such KERN_INFO log messages.


> Tested-by: Bart Noordervliet <bart@noordervliet.net>
Thanks for testing and for your feedback :)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Nov 2012 - [PATCH 00/26] Btrfs: Add device replace code

[PATCH 00/26] Btrfs: Add device replace code

[PATCH 01/26] Btrfs: rename the scrub context structure

[PATCH 02/26] Btrfs: remove the block device pointer from the scrub context struct

[PATCH 03/26] Btrfs: make the scrub page array dynamically allocated

[PATCH 04/26] Btrfs: in scrub repair code, optimize the reading of mirrors

[PATCH 05/26] Btrfs: in scrub repair code, simplify alloc error handling

[PATCH 06/26] Btrfs: cleanup scrub bio and worker wait code

[PATCH 07/26] Btrfs: add two more find_device() methods

[PATCH 08/26] Btrfs: Pass fs_info to btrfs_num_copies() instead of mapping_tree

[PATCH 09/26] Btrfs: pass fs_info to btrfs_map_block() instead of mapping_tree

[PATCH 10/26] Btrfs: add btrfs_scratch_superblock() function

[PATCH 11/26] Btrfs: pass fs_info instead of root

[PATCH 12/26] Btrfs: avoid risk of a deadlock in btrfs_handle_error

[PATCH 13/26] Btrfs: enhance btrfs structures for device replace support

[PATCH 14/26] Btrfs: introduce a btrfs_dev_replace_item type

[PATCH 15/26] Btrfs: add a new source file with device replace code

[PATCH 16/26] Btrfs: disallow mutually exclusiv admin operations from user mode

[PATCH 17/26] Btrfs: disallow some operations on the device replace target device

[PATCH 18/26] Btrfs: handle errors from btrfs_map_bio() everywhere

[PATCH 19/26] Btrfs: add code to scrub to copy read data to another disk

[PATCH 20/26] Btrfs: change core code of btrfs to support the device replace operations

[PATCH 21/26] Btrfs: introduce GET_READ_MIRRORS functionality for btrfs_map_block()

[PATCH 22/26] Btrfs: changes to live filesystem are also written to replacement disk

[PATCH 23/26] Btrfs: optionally avoid reads from device replace source drive

[PATCH 24/26] Btrfs: increase BTRFS_MAX_MIRRORS by one for dev replace

[PATCH 25/26] Btrfs: allow repair code to include target disk when searching mirrors

[PATCH 26/26] Btrfs: add support for device replace ioctls

Re: [PATCH 00/26] Btrfs: Add device replace code

Re: [PATCH 00/26] Btrfs: Add device replace code

Re: [PATCH 00/26] Btrfs: Add device replace code

Re: [PATCH 19/26] Btrfs: add code to scrub to copy read data to another disk

Re: [PATCH 00/26] Btrfs: Add device replace code

Re: [PATCH 00/26] Btrfs: Add device replace code

Re: [PATCH 19/26] Btrfs: add code to scrub to copy read data to another disk

Re: [PATCH 00/26] Btrfs: Add device replace code

Re: [PATCH 00/26] Btrfs: Add device replace code

Re: [PATCH 00/26] Btrfs: Add device replace code

Re: [PATCH 07/26] Btrfs: add two more find_device() methods

Re: [PATCH 15/26] Btrfs: add a new source file with device replace code

Re: [PATCH 15/26] Btrfs: add a new source file with device replace code

Re: [PATCH 00/26] Btrfs: Add device replace code

Re: [PATCH 00/26] Btrfs: Add device replace code

Re: [PATCH 15/26] Btrfs: add a new source file with device replace code

Re: [PATCH 00/26] Btrfs: Add device replace code

Re: [PATCH 15/26] Btrfs: add a new source file with device replace code

RE: [PATCH 24/26] Btrfs: increase BTRFS_MAX_MIRRORS by one for dev replace

Re: [PATCH 24/26] Btrfs: increase BTRFS_MAX_MIRRORS by one for dev replace

Re: [PATCH 15/26] Btrfs: add a new source file with device replace code

Re: [PATCH 07/26] Btrfs: add two more find_device() methods

Re: [PATCH 15/26] Btrfs: add a new source file with device replace code

Re: [PATCH 00/26] Btrfs: Add device replace code

Re: [PATCH 00/26] Btrfs: Add device replace code