Mark Fasheh
2015-Dec-18 23:28 UTC
[Ocfs2-devel] The root cause analysis about buffer read getting starvation
On Fri, Dec 18, 2015 at 05:09:39PM +0800, Eric Ren wrote:> Hi all, > > On Thu, Dec 17, 2015 at 08:08:42AM -0700, He Gang wrote: > > Hello Mark and all, > > In the past days, I and Eric were looking at a customer issue, the customer is complaining that buffer reading sometimes lasts too much time ( 1 - 10 seconds) in case reading/writing the same file from different nodes concurrently, some day ago I sent a mail to the list for some discussions, you can read some details via the link https://oss.oracle.com/pipermail/ocfs2-devel/2015-December/011389.html. > > But, this problem does not happen under SLES10 (sp1 - sp4), the customer upgraded his Linux OS to SLES11(sp3 or sp4), the problem happened, this is why the customer complains, he hope we can give a investigation, to see how to make OCFS2 buffer reading/writing behavior be consistent with SLES10. > > According to our code reviewing and some testings, we found that the root cause to let buffer read get starvation. > > The suspicious code in aops.c > > 274 static int ocfs2_readpage(struct file *file, struct page *page) > > 275 { > > 276 struct inode *inode = page->mapping->host; > > 277 struct ocfs2_inode_info *oi = OCFS2_I(inode); > > 278 loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT; > > 279 int ret, unlock = 1; > > 280 long delta; > > 281 struct timespec t_enter, t_mid1, t_mid2, t_exit; > > 282 > > 283 trace_ocfs2_readpage((unsigned long long)oi->ip_blkno, > > 284 (page ? page->index : 0)); > > 285 > > 286 ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page); <<= here, using nonblock way to get lock will bring many times retry, spend too much time > > 287 if (ret != 0) { > > 288 if (ret == AOP_TRUNCATED_PAGE) > > 289 unlock = 0; > > 290 mlog_errno(ret); > > 291 goto out; > > 292 } > > 293 > > 294 if (down_read_trylock(&oi->ip_alloc_sem) == 0) { <<= here, the same problem with above > > 295 /* > > 296 * Unlock the page and cycle ip_alloc_sem so that we don't > > 297 * busyloop waiting for ip_alloc_sem to unlock > > 298 */ > > 299 ret = AOP_TRUNCATED_PAGE; > > 300 unlock_page(page); > > 301 unlock = 0; > > 302 down_read(&oi->ip_alloc_sem); > > 303 up_read(&oi->ip_alloc_sem); > > 304 goto out_inode_unlock; > > 305 } > > > > > > As you can see, using nonblock way to get lock will bring many time retry, spend too much time. > > We can't modify the code to using block way to get the lock, as this will bring a dead lock. > > Actually, we did some testing when trying to use block way to get the lock here, the deadlock problems were encountered. > > But, in SLES10 source code, there is not any using nonblock way to get lock in buffer reading/writing, this is why buffer reading/writing are very fair to get IO when reading/writing the same file from multiple nodes. > SLES10 with kernel version about 2.6.16.x, used blocking way, i.e. down_read(), wich has the > potential deaklock between page lock / ip_alloc_sem when one node get the cluster lock and > does writing and reading on same file on it. This deadlock was fixed by this commit:You are correct here - the change was introduced to solve a deadlock between page lock and ip_alloc_sem(). Basically, ->readpage is going to be called with the page lock held and we need to be aware of that.> --- > commit e9dfc0b2bc42761410e8db6c252c6c5889e178b8 > Author: Mark Fasheh <mark.fasheh at oracle.com> > Date: Mon May 14 11:38:51 2007 -0700 > > ocfs2: trylock in ocfs2_readpage() > > Similarly to the page lock / cluster lock inversion in ocfs2_readpage, we > can deadlock on ip_alloc_sem. We can down_read_trylock() instead and just > return AOP_TRUNCATED_PAGE if the operation fails. > > Signed-off-by: Mark Fasheh <mark.fasheh at oracle.com> > > diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c > index 8e7cafb..3030670 100644 > --- a/fs/ocfs2/aops.c > +++ b/fs/ocfs2/aops.c > @@ -222,7 +222,10 @@ static int ocfs2_readpage(struct file *file, struct page *page) > goto out; > } > > - down_read(&OCFS2_I(inode)->ip_alloc_sem); > + if (down_read_trylock(&OCFS2_I(inode)->ip_alloc_sem) == 0) { > + ret = AOP_TRUNCATED_PAGE; > + goto out_meta_unlock; > + } > > /* > * i_size might have just been updated as we grabed the meta lock. We > @@ -258,6 +261,7 @@ static int ocfs2_readpage(struct file *file, struct page *page) > ocfs2_data_unlock(inode, 0); > out_alloc: > up_read(&OCFS2_I(inode)->ip_alloc_sem); > +out_meta_unlock: > ocfs2_meta_unlock(inode, 0); > out: > if (unlock) > --- > > But somehow with this patch, performance in the scenario become very bad. I don't how this could happen? because the reading node just has only one > thread reading the shared file, then down_read_trylock() should always get ip_alloc_sem successfully, right? if not, who else may race ip_alloc_sem?Hmm, there's only one thread and it can't get the lock? Any chance you might put some debug prints around where we acquire ip_alloc_sem? It would be interesting to see where it get taken to prevent this from happening. --Mark -- Mark Fasheh
Gang He
2015-Dec-21 02:53 UTC
[Ocfs2-devel] The root cause analysis about buffer read getting starvation
Hello Mark and all,>>> > On Fri, Dec 18, 2015 at 05:09:39PM +0800, Eric Ren wrote: >> Hi all, >> >> On Thu, Dec 17, 2015 at 08:08:42AM -0700, He Gang wrote: >> > Hello Mark and all, >> > In the past days, I and Eric were looking at a customer issue, the customer > is complaining that buffer reading sometimes lasts too much time ( 1 - 10 > seconds) in case reading/writing the same file from different nodes > concurrently, some day ago I sent a mail to the list for some discussions, > you can read some details via the link > https://oss.oracle.com/pipermail/ocfs2-devel/2015-December/011389.html. >> > But, this problem does not happen under SLES10 (sp1 - sp4), the customer > upgraded his Linux OS to SLES11(sp3 or sp4), the problem happened, this is > why the customer complains, he hope we can give a investigation, to see how > to make OCFS2 buffer reading/writing behavior be consistent with SLES10. >> > According to our code reviewing and some testings, we found that the root > cause to let buffer read get starvation. >> > The suspicious code in aops.c >> > 274 static int ocfs2_readpage(struct file *file, struct page *page) >> > 275 { >> > 276 struct inode *inode = page->mapping->host; >> > 277 struct ocfs2_inode_info *oi = OCFS2_I(inode); >> > 278 loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT; >> > 279 int ret, unlock = 1; >> > 280 long delta; >> > 281 struct timespec t_enter, t_mid1, t_mid2, t_exit; >> > 282 >> > 283 trace_ocfs2_readpage((unsigned long long)oi->ip_blkno, >> > 284 (page ? page->index : 0)); >> > 285 >> > 286 ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page); <<= > here, using nonblock way to get lock will bring many times retry, spend too > much time >> > 287 if (ret != 0) { >> > 288 if (ret == AOP_TRUNCATED_PAGE) >> > 289 unlock = 0; >> > 290 mlog_errno(ret); >> > 291 goto out; >> > 292 } >> > 293 >> > 294 if (down_read_trylock(&oi->ip_alloc_sem) == 0) { <<= here, the > same problem with above >> > 295 /* >> > 296 * Unlock the page and cycle ip_alloc_sem so that we > don't >> > 297 * busyloop waiting for ip_alloc_sem to unlock >> > 298 */ >> > 299 ret = AOP_TRUNCATED_PAGE; >> > 300 unlock_page(page); >> > 301 unlock = 0; >> > 302 down_read(&oi->ip_alloc_sem); >> > 303 up_read(&oi->ip_alloc_sem); >> > 304 goto out_inode_unlock; >> > 305 } >> > >> > >> > As you can see, using nonblock way to get lock will bring many time retry, > spend too much time. >> > We can't modify the code to using block way to get the lock, as this will > bring a dead lock. >> > Actually, we did some testing when trying to use block way to get the lock > here, the deadlock problems were encountered. >> > But, in SLES10 source code, there is not any using nonblock way to get lock > in buffer reading/writing, this is why buffer reading/writing are very fair > to get IO when reading/writing the same file from multiple nodes. >> SLES10 with kernel version about 2.6.16.x, used blocking way, i.e. > down_read(), wich has the >> potential deaklock between page lock / ip_alloc_sem when one node get the > cluster lock and >> does writing and reading on same file on it. This deadlock was fixed by this > commit: > > You are correct here - the change was introduced to solve a deadlock between > page lock and ip_alloc_sem(). Basically, ->readpage is going to be called > with the page lock held and we need to be aware of that.Hello guys, my main question is, why we changed ip_alloc_sem lock/unlock position from SLES10 to SLES11? In SLES10, we get ip_alloc_sem lock before calling generic_file_read() or generic_file_write_nolock in file.c, but in SLES11, we get ip_alloc_sem lock in ocfs2_readpage in aops.c, and more, getting/putting the page lock and ip_alloc_sem lock orders are NOT consistent in read/write path. I just want to know the background behind this code evolution. If we keep getting the ip_alloc_sem lock before calling generic_file_aio_read in SLES11, the deadlock can be avoided? then, we need not to use nonblocking way to get the lock in read_page(), buffer read will not getting starvation in such case, the read/write IO behavior will be the same with SLES10. Thanks Gang> > >> --- >> commit e9dfc0b2bc42761410e8db6c252c6c5889e178b8 >> Author: Mark Fasheh <mark.fasheh at oracle.com> >> Date: Mon May 14 11:38:51 2007 -0700 >> >> ocfs2: trylock in ocfs2_readpage() >> >> Similarly to the page lock / cluster lock inversion in ocfs2_readpage, > we >> can deadlock on ip_alloc_sem. We can down_read_trylock() instead and > just >> return AOP_TRUNCATED_PAGE if the operation fails. >> >> Signed-off-by: Mark Fasheh <mark.fasheh at oracle.com> >> >> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c >> index 8e7cafb..3030670 100644 >> --- a/fs/ocfs2/aops.c >> +++ b/fs/ocfs2/aops.c >> @@ -222,7 +222,10 @@ static int ocfs2_readpage(struct file *file, struct page > *page) >> goto out; >> } >> >> - down_read(&OCFS2_I(inode)->ip_alloc_sem); >> + if (down_read_trylock(&OCFS2_I(inode)->ip_alloc_sem) == 0) { >> + ret = AOP_TRUNCATED_PAGE; >> + goto out_meta_unlock; >> + } >> >> /* >> * i_size might have just been updated as we grabed the meta lock. > We >> @@ -258,6 +261,7 @@ static int ocfs2_readpage(struct file *file, struct page > *page) >> ocfs2_data_unlock(inode, 0); >> out_alloc: >> up_read(&OCFS2_I(inode)->ip_alloc_sem); >> +out_meta_unlock: >> ocfs2_meta_unlock(inode, 0); >> out: >> if (unlock) >> --- >> >> But somehow with this patch, performance in the scenario become very bad. I > don't how this could happen? because the reading node just has only one >> thread reading the shared file, then down_read_trylock() should always get > ip_alloc_sem successfully, right? if not, who else may race ip_alloc_sem? > > Hmm, there's only one thread and it can't get the lock? Any chance you might > put some debug prints around where we acquire ip_alloc_sem? It would be > interesting to see where it get taken to prevent this from happening. > --Mark > > -- > Mark Fasheh
Eric Ren
2015-Dec-21 11:23 UTC
[Ocfs2-devel] The root cause analysis about buffer read getting starvation
Hello Mark, ...snip..> > SLES10 with kernel version about 2.6.16.x, used blocking way, i.e. down_read(), wich has the > > potential deaklock between page lock / ip_alloc_sem when one node get the cluster lock and > > does writing and reading on same file on it. This deadlock was fixed by this commit: > > You are correct here - the change was introduced to solve a deadlock between > page lock and ip_alloc_sem(). Basically, ->readpage is going to be called > with the page lock held and we need to be aware of that....snip..> > But somehow with this patch, performance in the scenario become very bad. I don't how this could happen? because the reading node just has only one > > thread reading the shared file, then down_read_trylock() should always get ip_alloc_sem successfully, right? if not, who else may race ip_alloc_sem? > > Hmm, there's only one thread and it can't get the lock? Any chance you mightNo, it can always get the lock in this case. Sorry, I made a false testing result. There're probably mainly two factors: 1. none-isolated testing environment - include nodes, network and shared disk; 2. testing program from customer - sleep for 1s after finishing ~1M read/write each time, thus the overlap time of read/write on two nodes is random; so the shoter overlap time is, the better performance looks. Sorry again for bothering your time. --Eric> put some debug prints around where we acquire ip_alloc_sem? It would be > interesting to see where it get taken to prevent this from happening. > --Mark > > -- > Mark Fasheh > > _______________________________________________ > Ocfs2-devel mailing list > Ocfs2-devel at oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-devel >