John Lightsey
2017-Nov-21 02:45 UTC
[Ocfs2-devel] [PATCH] Bug#841144: kernel BUG at /build/linux-Wgpe2M/linux-4.8.11/fs/ocfs2/alloc.c:1514!
On Tue, 2017-11-21 at 00:58 +0000, Changwei Ge wrote:> > @@ -873,6 +875,7 @@ static int ocfs2_alloc_write_ctxt(struct > > ocfs2_write_ctxt **wcp, > > ? > > ? ocfs2_init_dealloc_ctxt(&wc->w_dealloc); > > ? INIT_LIST_HEAD(&wc->w_unwritten_list); > > + wc->w_unwritten_count = 0; > > I think you don't have to evaluate ::w_unwritten_count to zero since? > kzalloc already did that.Very true. I was following the example of how dwc was handling the dw_zero_count. You'll have to forgive me a bit. I'm very unfamiliar with the linux kernel codebase.> > > ? > > ? *wcp = wc; > > ? > > @@ -1373,6 +1376,7 @@ static int ocfs2_unwritten_check(struct inode > > *inode, > > ? desc->c_clear_unwritten = 0; > > ? list_add_tail(&new->ue_ip_node, &oi->ip_unwritten_list); > > ? list_add_tail(&new->ue_node, &wc->w_unwritten_list); > > + wc->w_unwritten_count++; > > You increase ::w_unwritten_coun once a new _ue_ is attached to? > ::w_unwritten_list. So if no _ue_ ever is attached, > ::w_unwritten_list? > is still empty. I think your change has the same effect with origin. > > Moreover I don't see the relation between the reported crash issue > and?your patch change. Can you elaborate further?The important part is in the next segment in the patch. This block is just using w_unwritten_count to track the size of w_unwritten_list.> > @@ -2246,7 +2250,7 @@ static int ocfs2_dio_get_block(struct inode > > *inode, sector_t iblock, > > ? ue->ue_phys = desc->c_phys; > > ? > > ? list_splice_tail_init(&wc->w_unwritten_list, &dwc->dw_zero_list); > > - dwc->dw_zero_count++; > > + dwc->dw_zero_count += wc->w_unwritten_count; > > ? } > > ? > >dw_zero_count is tracking the number of elements in dw_zero_list. The old version assumed that after dw_zero_list and w_unwritten_list were spliced together, that the new length was dw_zero_count + 1. This assumption is not correct if w_unwritten_list contained more than one element. The length of dw_zero_list is used by ocfs2_dio_end_io_write() to determine whether or not meta_ac will be needed to complete the write: ????ret = ocfs2_lock_allocators(inode, &et, 0, dwc->dw_zero_count*2, ????????????????????&data_ac, &meta_ac); This will return with success and a null meta_ac if there are at least dw_zero_count * 2 extents available for the write. Since dw_zero_count was not being calculated correctly, this will occasionally result in the write getting into ocfs2_grow_tree() with a null meta_ac following this chain: ocfs2_dio_end_io_write() ocfs2_mark_extent_written() ocfs2_change_extent_flag() ocfs2_split_extent() ocfs2_split_and_insert() ocfs2_grow_tree() That's my understanding of what's causing the bug. Our OCFS2 cluster was crashing every two to three hours after we upgraded to a 4.x kernel. We've gone about 18 hours with this patch applied and no crashes. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part Url : http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20171120/b68bb5fa/attachment.bin
Changwei Ge
2017-Nov-21 05:58 UTC
[Ocfs2-devel] [PATCH] Bug#841144: kernel BUG at /build/linux-Wgpe2M/linux-4.8.11/fs/ocfs2/alloc.c:1514!
On 2017/11/21 10:45, John Lightsey wrote:> On Tue, 2017-11-21 at 00:58 +0000, Changwei Ge wrote: >>> @@ -873,6 +875,7 @@ static int ocfs2_alloc_write_ctxt(struct >>> ocfs2_write_ctxt **wcp, >>> >>> ? ocfs2_init_dealloc_ctxt(&wc->w_dealloc); >>> ? INIT_LIST_HEAD(&wc->w_unwritten_list); >>> + wc->w_unwritten_count = 0; >> >> I think you don't have to evaluate ::w_unwritten_count to zero since >> kzalloc already did that. > > Very true. I was following the example of how dwc was handling the > dw_zero_count. You'll have to forgive me a bit. I'm very unfamiliar > with the linux kernel codebase. > >> >>> >>> ? *wcp = wc; >>> >>> @@ -1373,6 +1376,7 @@ static int ocfs2_unwritten_check(struct inode >>> *inode, >>> ? desc->c_clear_unwritten = 0; >>> ? list_add_tail(&new->ue_ip_node, &oi->ip_unwritten_list); >>> ? list_add_tail(&new->ue_node, &wc->w_unwritten_list); >>> + wc->w_unwritten_count++; >> >> You increase ::w_unwritten_coun once a new _ue_ is attached to >> ::w_unwritten_list. So if no _ue_ ever is attached, >> ::w_unwritten_list >> is still empty. I think your change has the same effect with origin. >> >> Moreover I don't see the relation between the reported crash issue >> and?your patch change. Can you elaborate further? > > The important part is in the next segment in the patch. This block is > just using w_unwritten_count to track the size of w_unwritten_list. > >>> @@ -2246,7 +2250,7 @@ static int ocfs2_dio_get_block(struct inode >>> *inode, sector_t iblock, >>> ? ue->ue_phys = desc->c_phys; >>> >>> ? list_splice_tail_init(&wc->w_unwritten_list, &dwc->dw_zero_list); >>> - dwc->dw_zero_count++; >>> + dwc->dw_zero_count += wc->w_unwritten_count; >>> ? } >>> >>> > > dw_zero_count is tracking the number of elements in dw_zero_list. > > The old version assumed that after dw_zero_list and w_unwritten_list > were spliced together, that the new length was dw_zero_count + 1. This > assumption is not correct if w_unwritten_list contained more than one > element. > > The length of dw_zero_list is used by ocfs2_dio_end_io_write() to > determine whether or not meta_ac will be needed to complete the write: > > ????ret = ocfs2_lock_allocators(inode, &et, 0, dwc->dw_zero_count*2, > ????????????????????&data_ac, &meta_ac);Hi John, Thanks for reporting. I probably get your point. Can your tell me how did you format your volume? What's your _cluster size_ and _block size_? Your can obtain such information via debugfs.ocfs2 <your volume> -R 'stats' | grep 'Cluster Size' It's better for you provide a way to reproduce this issue so that we can perform some test. Thanks, Changwei> > This will return with success and a null meta_ac if there are at least > dw_zero_count * 2 extents available for the write. > > Since dw_zero_count was not being calculated correctly, this will > occasionally result in the write getting into ocfs2_grow_tree() with a > null meta_ac following this chain: > > ocfs2_dio_end_io_write() > ocfs2_mark_extent_written() > ocfs2_change_extent_flag() > ocfs2_split_extent() > ocfs2_split_and_insert() > ocfs2_grow_tree() > > That's my understanding of what's causing the bug. > > Our OCFS2 cluster was crashing every two to three hours after we > upgraded to a 4.x kernel. We've gone about 18 hours with this patch > applied and no crashes. >