I''ve been trying to figure out the btrfs I/O stack to try to understand why, sometimes (but not always), after a failure to read a (data non-replicated) block from the disk, the file being accessed becomes permanently locked, and the filesystem, unmountable. Sometimes (but not always) it''s possible to kill the process that accessed the file, and sometimes (but not always) the failure causes the machine load to skyrocket by 60+ processes. In one of the failures that caused machine load spikes, I tried to collect info on active processes with perf top and SysRq-T, but nothing there seemed to explain the spike. Thoughts on how to figure out what''s causing this? Another weirdness I noticed is that, after a single read failure, btree_io_failed_hook gets called multiple times, until io_pages gets down to zero. This seems wrong: I think it should only be called once when a single block fails, rather than having that single failure get all pending pages marked as failed, no? Here are some instrumented dumps I collected from one occurrence of the scenario described in the previous paragraph (it didn''t cause a load spike). Only one disk block had a read failure. At the end, I enclose the patch that got those dumps printed, the result of several iterations in which one failure led me to find another function to instrument. end_request: I/O error, dev sdd, sector 183052083 btrfs: bdev /dev/sdd4 errs: wr 0, rd 3, flush 0, corrupt 0, gen 0 btrfs_end_bio orig -EIO 1 > 0 pending 0 end ffffffffa0240820,ffffffffa020c2d0 end_workqueue_bio err -5 bi_rw 0 ata5: EH complete end_workqueue_fn err -5 end_io ffffffffa020c2d0,ffffffffa0231080 btree_io_failed_hook failed_mirror 1 io_pages 15 readahead 0 end_bio_extent_readpage err -5 faied_hook ffffffffa020bed0 ret -5 btree_io_failed_hook failed_mirror 1 io_pages 14 readahead 0 end_bio_extent_readpage err -5 failed_hook ffffffffa020bed0 ret -5 [...repeat both msgs with io_pages decremented one at a time...] btree_io_failed_hook failed_mirror 1 io_pages 0 readahead 0 end_bio_extent_readpage err -5 failed_hook ffffffffa020bed0 ret -5 (no further related messages) -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer
On Apr 4, 2013, Alexandre Oliva <oliva@gnu.org> wrote:> I''ve been trying to figure out the btrfs I/O stack to try to understand > why, sometimes (but not always), after a failure to read a (data > non-replicated) block from the disk, the file being accessed becomes > permanently locked, and the filesystem, unmountable.So, after some further investigation, we could determine that the problem was that end_bio_extent_readpage would unlock_extent_cached only part of the page, because it had previously computed whole_page as zero because of the nonzero bv_offset. So I started hunting for some place that would set up the bio with partial pages, and I failed. I was already suspecting some race condition or other form of corruption of the bvec before it got to end_bio_extent_readpage when I realized that the bv_offset was always a multiple of 512 bytes, and it represented the offset into the 4KiB page that the sector that failed to read was going to occupy. So I started hunting for places that modified bv_offset, and I found blk_update_request in fs/blk-core.c, where the very error message reporting the failed sector was output. The conclusion is that we cannot assume bvec is unmodified between our submitting the bio and our getting an error back. OTOH, I don''t see that we ever set up bvecs that do not correspond to whole pages. Indeed, my attempts to catch such situations with a wrapper around bio_add_page got no hits whatsoever, which suggests we could just do away with the whole_page computation, and take bv_offset+bv_len == PAGE_CACHE_SIZE as the requested read size. With this patch, after a read error, I get an EIO rather than a process hang that causes further attempts to access the file to hang, generally in a non-interruptible way. Yay! -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer
On Thu, Apr 04, 2013 at 01:10:27PM -0300, Alexandre Oliva wrote:> I''ve been trying to figure out the btrfs I/O stack to try to understand > why, sometimes (but not always), after a failure to read a (data > non-replicated) block from the disk, the file being accessed becomes > permanently locked, and the filesystem, unmountable. > > Sometimes (but not always) it''s possible to kill the process that > accessed the file, and sometimes (but not always) the failure causes > the machine load to skyrocket by 60+ processes. > > In one of the failures that caused machine load spikes, I tried to > collect info on active processes with perf top and SysRq-T, but nothing > there seemed to explain the spike. Thoughts on how to figure out what''s > causing this?Hi Alexandre, Although I''ve seen your solution patch in this thread, I''m still curious about this senario, could you please share the reproducer script or something?> > Another weirdness I noticed is that, after a single read failure, > btree_io_failed_hook gets called multiple times, until io_pages gets > down to zero. This seems wrong: I think it should only be called once > when a single block fails, rather than having that single failure get > all pending pages marked as failed, no?I guess that you''re using ''-l 64k -n 64k'' for mkfs.btrfs, so an ''eb'' has 16 pages in total, and reading an eb will not complete until all of 16 pages finish reading. thanks, liubo> > Here are some instrumented dumps I collected from one occurrence of the > scenario described in the previous paragraph (it didn''t cause a load > spike). Only one disk block had a read failure. At the end, I enclose > the patch that got those dumps printed, the result of several iterations > in which one failure led me to find another function to instrument. > > end_request: I/O error, dev sdd, sector 183052083 > btrfs: bdev /dev/sdd4 errs: wr 0, rd 3, flush 0, corrupt 0, gen 0 > btrfs_end_bio orig -EIO 1 > 0 pending 0 end ffffffffa0240820,ffffffffa020c2d0 > end_workqueue_bio err -5 bi_rw 0 > ata5: EH complete > end_workqueue_fn err -5 end_io ffffffffa020c2d0,ffffffffa0231080 > btree_io_failed_hook failed_mirror 1 io_pages 15 readahead 0 > end_bio_extent_readpage err -5 faied_hook ffffffffa020bed0 ret -5 > btree_io_failed_hook failed_mirror 1 io_pages 14 readahead 0 > end_bio_extent_readpage err -5 failed_hook ffffffffa020bed0 ret -5 > [...repeat both msgs with io_pages decremented one at a time...] > btree_io_failed_hook failed_mirror 1 io_pages 0 readahead 0 > end_bio_extent_readpage err -5 failed_hook ffffffffa020bed0 ret -5 > (no further related messages) >> Be verbose about the path followed after an I/O error > > From: Alexandre Oliva <lxoliva@fsfla.org> > > > --- > fs/btrfs/disk-io.c | 22 ++++++++++++++++++++-- > fs/btrfs/extent_io.c | 6 ++++++ > fs/btrfs/volumes.c | 31 +++++++++++++++++++++++++++++-- > 3 files changed, 55 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 6d19a0a..20f9828 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -659,13 +659,18 @@ static int btree_io_failed_hook(struct page *page, int failed_mirror) > { > struct extent_buffer *eb; > struct btrfs_root *root = BTRFS_I(page->mapping->host)->root; > + long io_pages; > + bool readahead; > > eb = (struct extent_buffer *)page->private; > set_bit(EXTENT_BUFFER_IOERR, &eb->bflags); > eb->read_mirror = failed_mirror; > - atomic_dec(&eb->io_pages); > - if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags)) > + io_pages = atomic_dec_return(&eb->io_pages); > + if ((readahead = test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))) > btree_readahead_hook(root, eb, eb->start, -EIO); > + printk(KERN_ERR > + "btree_io_failed_hook failed_mirror %i io_pages %li readahead %i\n", > + failed_mirror, io_pages, readahead); > return -EIO; /* we fixed nothing */ > } > > @@ -674,6 +679,12 @@ static void end_workqueue_bio(struct bio *bio, int err) > struct end_io_wq *end_io_wq = bio->bi_private; > struct btrfs_fs_info *fs_info; > > + if (err) { > + printk(KERN_ERR > + "end_workqueue_bio err %i bi_rw %lx\n", > + err, (unsigned long)bio->bi_rw); > + } > + > fs_info = end_io_wq->info; > end_io_wq->error = err; > end_io_wq->work.func = end_workqueue_fn; > @@ -1647,6 +1658,13 @@ static void end_workqueue_fn(struct btrfs_work *work) > fs_info = end_io_wq->info; > > error = end_io_wq->error; > + > + if (error) { > + printk(KERN_ERR > + "end_workqueue_fn err %i end_io %p,%p\n", > + error, bio->bi_end_io, end_io_wq->end_io); > + } > + > bio->bi_private = end_io_wq->private; > bio->bi_end_io = end_io_wq->end_io; > kfree(end_io_wq); > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index cdee391..355b24e 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -2422,6 +2422,9 @@ static void end_bio_extent_readpage(struct bio *bio, int err) > > if (!uptodate && tree->ops && tree->ops->readpage_io_failed_hook) { > ret = tree->ops->readpage_io_failed_hook(page, mirror); > + printk(KERN_ERR > + "end_bio_extent_readpage err %i failed_hook %p ret %i\n", > + err, tree->ops->readpage_io_failed_hook, ret); > if (!ret && !err && > test_bit(BIO_UPTODATE, &bio->bi_flags)) > uptodate = 1; > @@ -2437,6 +2440,9 @@ static void end_bio_extent_readpage(struct bio *bio, int err) > * remain responsible for that page. > */ > ret = bio_readpage_error(bio, page, start, end, mirror, NULL); > + printk(KERN_ERR > + "end_bio_extent_readpage err %i readpage_error ret %i\n", > + err, ret); > if (ret == 0) { > uptodate > test_bit(BIO_UPTODATE, &bio->bi_flags); > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 2854c82..59ea128 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -5033,6 +5033,8 @@ static void btrfs_end_bio(struct bio *bio, int err) > { > struct btrfs_bio *bbio = extract_bbio_from_bio_private(bio->bi_private); > int is_orig_bio = 0; > + int pending; > + int errors = 0; > > if (err) { > atomic_inc(&bbio->error); > @@ -5062,11 +5064,13 @@ static void btrfs_end_bio(struct bio *bio, int err) > if (bio == bbio->orig_bio) > is_orig_bio = 1; > > - if (atomic_dec_and_test(&bbio->stripes_pending)) { > + if ((pending = atomic_dec_return(&bbio->stripes_pending)) == 0) { > + void (*endio)(struct bio *, int); > if (!is_orig_bio) { > bio_put(bio); > bio = bbio->orig_bio; > } > + endio = bio->bi_end_io; > bio->bi_private = bbio->private; > bio->bi_end_io = bbio->end_io; > bio->bi_bdev = (struct block_device *) > @@ -5074,21 +5078,44 @@ static void btrfs_end_bio(struct bio *bio, int err) > /* only send an error to the higher layers if it is > * beyond the tolerance of the btrfs bio > */ > - if (atomic_read(&bbio->error) > bbio->max_errors) { > + ; > + if ((errors = atomic_read(&bbio->error)) > bbio->max_errors) { > err = -EIO; > + printk(KERN_ERR > + "btrfs_end_bio %s-EIO %i > %i pending %i end %p,%p\n", > + is_orig_bio ? "orig " : "", > + errors, bbio->max_errors, pending, > + endio, bio->bi_end_io); > } else { > /* > * this bio is actually up to date, we didn''t > * go over the max number of errors > */ > set_bit(BIO_UPTODATE, &bio->bi_flags); > + if (errors) { > + printk(KERN_ERR > + "btrfs_end_bio %s retry %i <= %i pending %i end %p,%p\n", > + is_orig_bio ? "orig " : "", > + errors, bbio->max_errors, pending, > + endio, bio->bi_end_io); > + } > err = 0; > } > kfree(bbio); > > bio_endio(bio, err); > } else if (!is_orig_bio) { > + errors = atomic_read(&bbio->error); > bio_put(bio); > + if (errors) { > + printk(KERN_ERR > + "btrfs_end_bio no-endio %i <?> %i pending %i\n", > + errors, bbio->max_errors, pending); > + } > + } else if ((errors = atomic_read(&bbio->error))) { > + printk(KERN_ERR > + "btrfs_end_bio orig no-endio %i <?> %i pending %i\n", > + errors, bbio->max_errors, pending); > } > } >> > > -- > Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ > You must be the change you wish to see in the world. -- Gandhi > Be Free! -- http://FSFLA.org/ FSF Latin America board member > Free Software Evangelist Red Hat Brazil Compiler Engineer-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, May 11, 2013 at 01:16:38AM -0600, Alexandre Oliva wrote:> On Apr 4, 2013, Alexandre Oliva <oliva@gnu.org> wrote: > > > I''ve been trying to figure out the btrfs I/O stack to try to understand > > why, sometimes (but not always), after a failure to read a (data > > non-replicated) block from the disk, the file being accessed becomes > > permanently locked, and the filesystem, unmountable. > > So, after some further investigation, we could determine that the > problem was that end_bio_extent_readpage would unlock_extent_cached only > part of the page, because it had previously computed whole_page as zero > because of the nonzero bv_offset. > > So I started hunting for some place that would set up the bio with > partial pages, and I failed. > > I was already suspecting some race condition or other form of corruption > of the bvec before it got to end_bio_extent_readpage when I realized > that the bv_offset was always a multiple of 512 bytes, and it > represented the offset into the 4KiB page that the sector that failed to > read was going to occupy. > > So I started hunting for places that modified bv_offset, and I found > blk_update_request in fs/blk-core.c, where the very error message > reporting the failed sector was output. > > The conclusion is that we cannot assume bvec is unmodified between our > submitting the bio and our getting an error back. > > OTOH, I don''t see that we ever set up bvecs that do not correspond to > whole pages. Indeed, my attempts to catch such situations with a > wrapper around bio_add_page got no hits whatsoever, which suggests we > could just do away with the whole_page computation, and take > bv_offset+bv_len == PAGE_CACHE_SIZE as the requested read size. > > With this patch, after a read error, I get an EIO rather than a process > hang that causes further attempts to access the file to hang, generally > in a non-interruptible way. Yay! > >> btrfs: do away with non-whole_page extent I/O > > From: Alexandre Oliva <oliva@gnu.org> > > end_bio_extent_readpage computes whole_page based on bv_offset and > bv_len, without taking into account that blk_update_request may modify > them when some of the blocks to be read into a page produce a read > error. This would cause the read to unlock only part of the file > range associated with the page, which would in turn leave the entire > page locked, which would not only keep the process blocked instead of > returning -EIO to it, but also prevent any further access to the file. > > It turns out that btrfs always issues whole-page reads and writes. > The special handling of non-whole_page appears to be a mistake or a > left-over from a time when this wasn''t the case. Indeed, > end_bio_extent_writepage distinguished between whole_page and > non-whole_page writes but behaved identically in both cases! > > I''ve replaced the whole_page computations with warnings, just to be > sure that we''re not issuing partial page reads or writes. The > warnings should probably just go away some time. > > Signed-off-by: Alexandre Oliva <oliva@gnu.org> > --- > fs/btrfs/extent_io.c | 85 ++++++++++++++++++-------------------------------- > 1 file changed, 30 insertions(+), 55 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index cdee391..f44b033 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -1873,28 +1873,6 @@ static void check_page_uptodate(struct extent_io_tree *tree, struct page *page) > } > > /* > - * helper function to unlock a page if all the extents in the tree > - * for that page are unlocked > - */ > -static void check_page_locked(struct extent_io_tree *tree, struct page *page) > -{ > - u64 start = page_offset(page); > - u64 end = start + PAGE_CACHE_SIZE - 1; > - if (!test_range_bit(tree, start, end, EXTENT_LOCKED, 0, NULL)) > - unlock_page(page); > -} > - > -/* > - * helper function to end page writeback if all the extents > - * in the tree for that page are done with writeback > - */ > -static void check_page_writeback(struct extent_io_tree *tree, > - struct page *page) > -{ > - end_page_writeback(page); > -} > - > -/* > * When IO fails, either with EIO or csum verification fails, we > * try other mirrors that might have a good copy of the data. This > * io_failure_record is used to record state as we go through all the > @@ -2323,19 +2301,24 @@ static void end_bio_extent_writepage(struct bio *bio, int err) > struct extent_io_tree *tree; > u64 start; > u64 end; > - int whole_page; > > do { > struct page *page = bvec->bv_page; > tree = &BTRFS_I(page->mapping->host)->io_tree; > > - start = page_offset(page) + bvec->bv_offset; > - end = start + bvec->bv_len - 1; > + /* We always issue full-page reads, but if some block > + * in a page fails to read, blk_update_request() will > + * advance bv_offset and adjust bv_len to compensate. > + * Print a warning for nonzero offsets, and an error > + * if they don''t add up to a full page. */ > + if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) > + printk("%s page write in btrfs with offset %u and length %u\n", > + bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE > + ? KERN_ERR "partial" : KERN_INFO "incomplete", > + bvec->bv_offset, bvec->bv_len); > > - if (bvec->bv_offset == 0 && bvec->bv_len == PAGE_CACHE_SIZE) > - whole_page = 1; > - else > - whole_page = 0; > + start = page_offset(page); > + end = start + bvec->bv_offset + bvec->bv_len - 1; > > if (--bvec >= bio->bi_io_vec) > prefetchw(&bvec->bv_page->flags); > @@ -2343,10 +2326,7 @@ static void end_bio_extent_writepage(struct bio *bio, int err) > if (end_extent_writepage(page, err, start, end)) > continue; > > - if (whole_page) > - end_page_writeback(page); > - else > - check_page_writeback(tree, page); > + end_page_writeback(page); > } while (bvec >= bio->bi_io_vec); > > bio_put(bio); > @@ -2371,7 +2351,6 @@ static void end_bio_extent_readpage(struct bio *bio, int err) > struct extent_io_tree *tree; > u64 start; > u64 end; > - int whole_page; > int mirror; > int ret; > > @@ -2388,13 +2367,19 @@ static void end_bio_extent_readpage(struct bio *bio, int err) > (long int)bio->bi_bdev); > tree = &BTRFS_I(page->mapping->host)->io_tree; > > - start = page_offset(page) + bvec->bv_offset; > - end = start + bvec->bv_len - 1; > + /* We always issue full-page reads, but if some block > + * in a page fails to read, blk_update_request() will > + * advance bv_offset and adjust bv_len to compensate. > + * Print a warning for nonzero offsets, and an error > + * if they don''t add up to a full page. */ > + if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) > + printk("%s page read in btrfs with offset %u and length %u\n", > + bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE > + ? KERN_ERR "partial" : KERN_INFO "incomplete", > + bvec->bv_offset, bvec->bv_len); > > - if (bvec->bv_offset == 0 && bvec->bv_len == PAGE_CACHE_SIZE) > - whole_page = 1; > - else > - whole_page = 0; > + start = page_offset(page); > + end = start + bvec->bv_offset + bvec->bv_len - 1; > > if (++bvec <= bvec_end) > prefetchw(&bvec->bv_page->flags); > @@ -2453,23 +2438,13 @@ static void end_bio_extent_readpage(struct bio *bio, int err) > } > unlock_extent_cached(tree, start, end, &cached, GFP_ATOMIC); > > - if (whole_page) { > - if (uptodate) { > - SetPageUptodate(page); > - } else { > - ClearPageUptodate(page); > - SetPageError(page); > - } > - unlock_page(page); > + if (uptodate) { > + SetPageUptodate(page); > } else { > - if (uptodate) { > - check_page_uptodate(tree, page); > - } else { > - ClearPageUptodate(page); > - SetPageError(page); > - } > - check_page_locked(tree, page); > + ClearPageUptodate(page); > + SetPageError(page); > } > + unlock_page(page); > } while (bvec <= bvec_end); > > bio_put(bio);So this should only happen in the case that you are on a dm device it looks like, is that how you are running? Kind of mean that the block layer does that to us though, I wonder what fs''es that do sub page size writes are supposed to do. Jens do you have any insight to offer here? Do we just need to cram the info we need into a private and just not trust the bio we get back? Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On May 14, 2013, Liu Bo <bo.li.liu@oracle.com> wrote:>> In one of the failures that caused machine load spikes, I tried to >> collect info on active processes with perf top and SysRq-T, but nothing >> there seemed to explain the spike. Thoughts on how to figure out what''s >> causing this?> Although I''ve seen your solution patch in this thread, I''m still curious > about this senario, could you please share the reproducer script or > something?I''m afraid I don''t have one. I just use the filesystem on various disks, with ceph osds and other non-ceph subvolumes and files, and occasionally I run into one of these bad blocks and the filesystem gets into these odd states.> I guess that you''re using ''-l 64k -n 64k'' for mkfs.btrfsThat is correct, but IIUC this should only affect metadata, and metadata recovery from the DUP block works. It''s data (single copy) that fails as described. -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On May 15, 2013, Josef Bacik <jbacik@fusionio.com> wrote:> So this should only happen in the case that you are on a dm device it looks > like, is that how you are running?That was my first thought, but no, I''m using partitions out of the SATA disks directly. I even checked for stray dm out of fake raid or somesuch, but the dm modules were not even loaded, and perusing /sys/block confirms the “scsi” devices are actual ATA disks. Further investigation suggested that when individual 512-byte blocks are read from a disk (that''s the block size reported by the kernel), the underlying disk driver is supposed to inform the upper layer about what it could read by updating the bio_vec bits in precisely the observed way. -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html