Alex Lyakas
2013-Jan-21  18:37 UTC
btrfs_start_delalloc_inodes livelocks when creating snapshot under IO
Greetings all, I see the following issue during snap creation under IO: Transaction commit calls btrfs_start_delalloc_inodes() that locks the delalloc_inodes list, fetches the first inode, unlocks the list, triggers btrfs_alloc_delalloc_work/btrfs_queue_worker for this inode and then locks the list again. Then it checks the head of the list again. In my case, this is always exactly the same inode. As a result, this function allocates a huge amount of btrfs_delalloc_work structures, and I start seeing OOM messages in the kernel log, killing processes etc. During that time this transaction commit is stuck, so, for example, other requests to create snapshot (that must wait for this transaction to commit first) get stuck. The transaction_kthread also gets stuck in attempt to commit the transaction. Is this an intended behavior? Shouldn''t we ensure that every inode in the delalloc list gets handled at most once? If the delalloc work is processed asynchronously, maybe the delalloc list can be locked once and traversed once? Josef, I see in commit 996d282c7ff470f150a467eb4815b90159d04c47 that you mention that "btrfs_start_delalloc_inodes will just walk the list of delalloc inodes and start writing them out, but it doesn''t splice the list or anything so as long as somebody is doing work on the box you could end up in this section _forever_." I guess I am hitting this here also. Miao, I tested the behavior before your commit 8ccf6f19b67f7e0921063cc309f4672a6afcb528 "Btrfs: make delalloc inodes be flushed by multi-task", on kernel 3.6. I see same issue there as well, but OOM doesn''t happen, because before your change btrfs_start_delalloc_inodes() was calling filemap_flush() directly. But I see still that btrfs_start_delalloc_inodes() handles same inode more than once, and in some cases never returns in 15 minutes or more, delaying all other transactions. And snapshot creation gets stuck for all this time. (The stack I see on kernel 3.6 is like this: [<ffffffff812f26c6>] get_request_wait+0xf6/0x1d0 [<ffffffff812f35df>] blk_queue_bio+0x7f/0x380 [<ffffffff812f0374>] generic_make_request.part.50+0x74/0xb0 [<ffffffff812f0788>] generic_make_request+0x68/0x70 [<ffffffff812f0815>] submit_bio+0x85/0x110 [<ffffffffa034ace5>] btrfs_map_bio+0x165/0x2f0 [btrfs] [<ffffffffa032880f>] btrfs_submit_bio_hook+0x7f/0x170 [btrfs] [<ffffffffa033b7da>] submit_one_bio+0x6a/0xa0 [btrfs] [<ffffffffa033f8a4>] submit_extent_page.isra.34+0xe4/0x230 [btrfs] [<ffffffffa034084c>] __extent_writepage+0x5ec/0x810 [btrfs] [<ffffffffa0340d22>] extent_write_cache_pages.isra.26.constprop.40+0x2b2/0x410 [btrfs] [<ffffffffa03410c5>] extent_writepages+0x45/0x60 [btrfs] [<ffffffffa0327178>] btrfs_writepages+0x28/0x30 [btrfs] [<ffffffff81122b21>] do_writepages+0x21/0x40 [<ffffffff81118e5b>] __filemap_fdatawrite_range+0x5b/0x60 [<ffffffff8111982c>] filemap_flush+0x1c/0x20 [<ffffffffa0334289>] btrfs_start_delalloc_inodes+0xc9/0x1f0 [btrfs] [<ffffffffa0324f5d>] btrfs_commit_transaction+0x44d/0xaf0 [btrfs] [<ffffffffa035200d>] btrfs_mksubvol.isra.53+0x37d/0x440 [btrfs] [<ffffffffa03521ca>] btrfs_ioctl_snap_create_transid+0xfa/0x190 [btrfs] [<ffffffffa03523e3>] btrfs_ioctl_snap_create_v2+0x103/0x140 [btrfs] [<ffffffffa03546cf>] btrfs_ioctl+0x80f/0x1bf0 [btrfs] [<ffffffff8118a01a>] do_vfs_ioctl+0x8a/0x340 [<ffffffff8118a361>] sys_ioctl+0x91/0xa0 [<ffffffff81665c42>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff Somehow the request queue of the block device gets empty and the transaction waits for a long time to allocate a request.) Some details about my setup: I am testing for-linus Chris''s branch I have one subvolume with 8 large files (10GB each). I am running two fio processes (one per file, so only 2 out of 8 files are involved) with 8 threads each like this: fio --thread --directory=/btrfs/subvol1 --rw=randwrite --randrepeat=1 --fadvise_hint=0 --fallocate=posix --size=1000m --filesize=10737418240 --bsrange=512b-64k --scramble_buffers=1 --nrfiles=1 --overwrite=1 --ioengine=sync --filename=file-1 --name=job0 --name=job1 --name=job2 --name=job3 --name=job4 --name=job5 --name=job6 --name=job7 The files are preallocated with fallocate before the fio run. Mount options: noatime,nodatasum,nodatacow,nospace_cache Can somebody please advise on how to address this issue, and, if possible, how to solve it on kernel 3.6. Thanks, Alex. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Miao Xie
2013-Jan-22  08:59 UTC
Re: btrfs_start_delalloc_inodes livelocks when creating snapshot under IO
On mon, 21 Jan 2013 20:37:57 +0200, Alex Lyakas wrote:> Greetings all, > > I see the following issue during snap creation under IO: > Transaction commit calls btrfs_start_delalloc_inodes() that locks the > delalloc_inodes list, fetches the first inode, unlocks the list, > triggers btrfs_alloc_delalloc_work/btrfs_queue_worker for this inode > and then locks the list again. Then it checks the head of the list > again. In my case, this is always exactly the same inode. As a result, > this function allocates a huge amount of btrfs_delalloc_work > structures, and I start seeing OOM messages in the kernel log, killing > processes etc.OK, I will deal with this problem.> > During that time this transaction commit is stuck, so, for example, > other requests to create snapshot (that must wait for this transaction > to commit first) get stuck. The transaction_kthread also gets stuck in > attempt to commit the transaction. > > Is this an intended behavior? Shouldn''t we ensure that every inode in > the delalloc list gets handled at most once? If the delalloc work is > processed asynchronously, maybe the delalloc list can be locked once > and traversed once?1st question: It is an intended behavior. We flush the delalloc inodes when committing the transaction is to guarantee that the data in the snapshot is the same as the source subvolume. For example: # dd if=/dev/zero of=<mnt>/file0 bs=1M count=1 # btrfs subvolume snapshot <mnt> <mnt>/snap0 If we don''t flush the delalloc inode(file0), we will find the data in <mnt/>snap0/file0 is different with <mnt>/file0. It is why we have to flush the delalloc inodes. 2nd question: I think we needn''t flush all the delalloc inodes in the fs when creating a snapshot, we just need flush all the delalloc inodes of the source subvolumes when creating a snapshot. If so, we must split the delalloc list and introduce pre-subvolume delalloc list.(If no one rejects this idea, I can implement it.) 3th question: Traversing the delalloc list once is reasonable and safe, because we can guarantee the following process is safe. Task write data into a file make a snapshot Maybe somebody will worry this case: Task0 Task1 make a snapshot commit transaction flush delalloc inodes write data into a file commit all trees update super block transaction commit end because we can not get the data that Task1 wrote from the snapshot. I think this case can be ignored because this data was written into the file after we start the process of the snapshot creation.> Josef, I see in commit 996d282c7ff470f150a467eb4815b90159d04c47 that > you mention that "btrfs_start_delalloc_inodes will just walk the list > of delalloc inodes and start writing them out, but it doesn''t splice > the list or anything so as long as somebody is doing work on the box > you could end up in this section _forever_." I guess I am hitting this > here also. > > Miao, I tested the behavior before your commit > 8ccf6f19b67f7e0921063cc309f4672a6afcb528 "Btrfs: make delalloc inodes > be flushed by multi-task", on kernel 3.6. I see same issue there as > well, but OOM doesn''t happen, because before your change > btrfs_start_delalloc_inodes() was calling filemap_flush() directly. > But I see still that btrfs_start_delalloc_inodes() handles same inode > more than once, and in some cases never returns in 15 minutes or more, > delaying all other transactions. And snapshot creation gets stuck for > all this time.You are right. I will fix this problem.> > (The stack I see on kernel 3.6 is like this: > [<ffffffff812f26c6>] get_request_wait+0xf6/0x1d0 > [<ffffffff812f35df>] blk_queue_bio+0x7f/0x380 > [<ffffffff812f0374>] generic_make_request.part.50+0x74/0xb0 > [<ffffffff812f0788>] generic_make_request+0x68/0x70 > [<ffffffff812f0815>] submit_bio+0x85/0x110 > [<ffffffffa034ace5>] btrfs_map_bio+0x165/0x2f0 [btrfs] > [<ffffffffa032880f>] btrfs_submit_bio_hook+0x7f/0x170 [btrfs] > [<ffffffffa033b7da>] submit_one_bio+0x6a/0xa0 [btrfs] > [<ffffffffa033f8a4>] submit_extent_page.isra.34+0xe4/0x230 [btrfs] > [<ffffffffa034084c>] __extent_writepage+0x5ec/0x810 [btrfs] > [<ffffffffa0340d22>] > extent_write_cache_pages.isra.26.constprop.40+0x2b2/0x410 [btrfs] > [<ffffffffa03410c5>] extent_writepages+0x45/0x60 [btrfs] > [<ffffffffa0327178>] btrfs_writepages+0x28/0x30 [btrfs] > [<ffffffff81122b21>] do_writepages+0x21/0x40 > [<ffffffff81118e5b>] __filemap_fdatawrite_range+0x5b/0x60 > [<ffffffff8111982c>] filemap_flush+0x1c/0x20 > [<ffffffffa0334289>] btrfs_start_delalloc_inodes+0xc9/0x1f0 [btrfs] > [<ffffffffa0324f5d>] btrfs_commit_transaction+0x44d/0xaf0 [btrfs] > [<ffffffffa035200d>] btrfs_mksubvol.isra.53+0x37d/0x440 [btrfs] > [<ffffffffa03521ca>] btrfs_ioctl_snap_create_transid+0xfa/0x190 [btrfs] > [<ffffffffa03523e3>] btrfs_ioctl_snap_create_v2+0x103/0x140 [btrfs] > [<ffffffffa03546cf>] btrfs_ioctl+0x80f/0x1bf0 [btrfs] > [<ffffffff8118a01a>] do_vfs_ioctl+0x8a/0x340 > [<ffffffff8118a361>] sys_ioctl+0x91/0xa0 > [<ffffffff81665c42>] system_call_fastpath+0x16/0x1b > [<ffffffffffffffff>] 0xffffffffffffffff > > Somehow the request queue of the block device gets empty and the > transaction waits for a long time to allocate a request.)I think it is because there is no enough memory and it must wait for the memory reclaim. 1038 * 1039 * Get a free request from @q. If %__GFP_WAIT is set in @gfp_mask, this 1040 * function keeps retrying under memory pressure and fails iff @q is dead. 1041 * Thanks Miao> > Some details about my setup: > I am testing for-linus Chris''s branch > I have one subvolume with 8 large files (10GB each). > I am running two fio processes (one per file, so only 2 out of 8 files > are involved) with 8 threads each like this: > fio --thread --directory=/btrfs/subvol1 --rw=randwrite --randrepeat=1 > --fadvise_hint=0 --fallocate=posix --size=1000m --filesize=10737418240 > --bsrange=512b-64k --scramble_buffers=1 --nrfiles=1 --overwrite=1 > --ioengine=sync --filename=file-1 --name=job0 --name=job1 --name=job2 > --name=job3 --name=job4 --name=job5 --name=job6 --name=job7 > The files are preallocated with fallocate before the fio run. > Mount options: noatime,nodatasum,nodatacow,nospace_cache > > Can somebody please advise on how to address this issue, and, if > possible, how to solve it on kernel 3.6. > > Thanks, > Alex. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Maybe Matching Threads
- btrfs-delalloc - threaded?
- [PATCH] Btrfs: limit delalloc pages outside of find_delalloc_range
- [PATCH] Btrfs: make sure the delalloc workers actually flush compressed writes
- [PATCH] Btrfs: add tests for find_lock_delalloc_range
- [PATCH 1/5] Btrfs: fix joining the same transaction handler more than 2 times