thr3ads.net - Ocfs2 devel - [Ocfs2-devel] The root cause analysis about buffer read getting starvation [Dec 2015]

If this information is useful, please help other people find it:
Share via:

Mark Fasheh

2015-Dec-18 23:28 UTC

[Ocfs2-devel] The root cause analysis about buffer read getting starvation

On Fri, Dec 18, 2015 at 05:09:39PM +0800, Eric Ren
wrote:> Hi all,
> 
> On Thu, Dec 17, 2015 at 08:08:42AM -0700, He Gang wrote: 
> > Hello Mark and all,
> > In the past days, I and Eric were looking at a customer issue, the
customer is complaining that buffer reading sometimes lasts too much time ( 1 -
10 seconds) in case reading/writing the same file from different nodes
concurrently, some day ago I sent a mail to the list for some discussions, you
can read some details via the link
https://oss.oracle.com/pipermail/ocfs2-devel/2015-December/011389.html.
> > But, this problem does not happen under SLES10 (sp1 - sp4), the
customer upgraded his Linux OS to SLES11(sp3 or sp4), the problem happened, this
is why the customer complains, he hope we can give a investigation, to see how
to make OCFS2 buffer reading/writing behavior be consistent with SLES10.
> > According to our code reviewing and some testings, we found that the
root cause to let buffer read get starvation.
> > The suspicious code in aops.c 
> >  274 static int ocfs2_readpage(struct file *file, struct page *page)
> >  275 {
> >  276         struct inode *inode = page->mapping->host;
> >  277         struct ocfs2_inode_info *oi = OCFS2_I(inode);
> >  278         loff_t start = (loff_t)page->index <<
PAGE_CACHE_SHIFT;
> >  279         int ret, unlock = 1;
> >  280         long delta;
> >  281         struct timespec t_enter, t_mid1, t_mid2, t_exit;
> >  282
> >  283         trace_ocfs2_readpage((unsigned long long)oi->ip_blkno,
> >  284                              (page ? page->index : 0));
> >  285
> >  286         ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page); 
<<= here, using nonblock way to get lock will bring many times retry,
spend too much time
> >  287         if (ret != 0) {
> >  288                 if (ret == AOP_TRUNCATED_PAGE)
> >  289                         unlock = 0;
> >  290                 mlog_errno(ret);
> >  291                 goto out;
> >  292         }
> >  293
> >  294         if (down_read_trylock(&oi->ip_alloc_sem) == 0) { 
<<= here, the same problem with above
> >  295                 /*
> >  296                  * Unlock the page and cycle ip_alloc_sem so that
we don't
> >  297                  * busyloop waiting for ip_alloc_sem to unlock
> >  298                  */
> >  299                 ret = AOP_TRUNCATED_PAGE;
> >  300                 unlock_page(page);
> >  301                 unlock = 0;
> >  302                 down_read(&oi->ip_alloc_sem);
> >  303                 up_read(&oi->ip_alloc_sem);
> >  304                 goto out_inode_unlock;
> >  305         }
> > 
> > 
> > As you can see, using nonblock way to get lock will bring many time
retry, spend too much time.
> > We can't modify the code to using block way to get the lock, as
this will bring a dead lock.
> > Actually, we did some testing when trying to use block way to get the
lock here, the deadlock problems were encountered.
> > But, in SLES10 source code, there is not any using nonblock way to get
lock in buffer reading/writing, this is why buffer reading/writing are very fair
to get IO when reading/writing the same file from multiple nodes.
> SLES10 with kernel version about 2.6.16.x, used blocking way, i.e.
down_read(), wich has the
> potential deaklock between page lock / ip_alloc_sem when one node get the
cluster lock and
> does writing and reading on same file on it. This deadlock was fixed by
this commit:
You are correct here - the change was introduced to solve a deadlock between
page lock and ip_alloc_sem(). Basically, ->readpage is going to be called
with the page lock held and we need to be aware of that.

> ---
> commit e9dfc0b2bc42761410e8db6c252c6c5889e178b8
> Author: Mark Fasheh <mark.fasheh at oracle.com>
> Date:   Mon May 14 11:38:51 2007 -0700
> 
>     ocfs2: trylock in ocfs2_readpage()
> 
>     Similarly to the page lock / cluster lock inversion in ocfs2_readpage,
we
>     can deadlock on ip_alloc_sem. We can down_read_trylock() instead and
just
>     return AOP_TRUNCATED_PAGE if the operation fails.
> 
>     Signed-off-by: Mark Fasheh <mark.fasheh at oracle.com>
> 
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index 8e7cafb..3030670 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -222,7 +222,10 @@ static int ocfs2_readpage(struct file *file, struct
page *page)
>                 goto out;
>         }
> 
> -       down_read(&OCFS2_I(inode)->ip_alloc_sem);
> +       if (down_read_trylock(&OCFS2_I(inode)->ip_alloc_sem) == 0) {
> +               ret = AOP_TRUNCATED_PAGE;
> +               goto out_meta_unlock;
> +       }
> 
>         /*
>          * i_size might have just been updated as we grabed the meta lock. 
We
> @@ -258,6 +261,7 @@ static int ocfs2_readpage(struct file *file, struct
page *page)
>         ocfs2_data_unlock(inode, 0);
>  out_alloc:
>         up_read(&OCFS2_I(inode)->ip_alloc_sem);
> +out_meta_unlock:
>         ocfs2_meta_unlock(inode, 0);
>  out:
>         if (unlock)
> ---
> 
> But somehow with this patch, performance in the scenario become very bad. I
don't how this could happen? because the reading node just has only one
> thread reading the shared file, then down_read_trylock() should always get
ip_alloc_sem successfully, right? if not, who else may race ip_alloc_sem?
Hmm, there's only one thread and it can't get the lock? Any chance you
might
put some debug prints around where we acquire ip_alloc_sem? It would be
interesting to see where it get taken to prevent this from happening.
	--Mark

--
Mark Fasheh

Gang He

2015-Dec-21 02:53 UTC

head link

[Ocfs2-devel] The root cause analysis about buffer read getting starvation

Hello Mark and all,

>>> 
> On Fri, Dec 18, 2015 at 05:09:39PM +0800, Eric Ren wrote:
>> Hi all,
>> 
>> On Thu, Dec 17, 2015 at 08:08:42AM -0700, He Gang wrote: 
>> > Hello Mark and all,
>> > In the past days, I and Eric were looking at a customer issue, the
customer
> is complaining that buffer reading sometimes lasts too much time ( 1 - 10 
> seconds) in case reading/writing the same file from different nodes 
> concurrently, some day ago I sent a mail to the list for some discussions, 
> you can read some details via the link 
> https://oss.oracle.com/pipermail/ocfs2-devel/2015-December/011389.html.
>> > But, this problem does not happen under SLES10 (sp1 - sp4), the
customer
> upgraded his Linux OS to SLES11(sp3 or sp4), the problem happened, this is 
> why the customer complains, he hope we can give a investigation, to see how
> to make OCFS2 buffer reading/writing behavior be consistent with SLES10.
>> > According to our code reviewing and some testings, we found that
the root
> cause to let buffer read get starvation.
>> > The suspicious code in aops.c 
>> >  274 static int ocfs2_readpage(struct file *file, struct page
*page)
>> >  275 {
>> >  276         struct inode *inode = page->mapping->host;
>> >  277         struct ocfs2_inode_info *oi = OCFS2_I(inode);
>> >  278         loff_t start = (loff_t)page->index <<
PAGE_CACHE_SHIFT;
>> >  279         int ret, unlock = 1;
>> >  280         long delta;
>> >  281         struct timespec t_enter, t_mid1, t_mid2, t_exit;
>> >  282
>> >  283         trace_ocfs2_readpage((unsigned long
long)oi->ip_blkno,
>> >  284                              (page ? page->index : 0));
>> >  285
>> >  286         ret = ocfs2_inode_lock_with_page(inode, NULL, 0,
page);  <<=
> here, using nonblock way to get lock will bring many times retry, spend too
> much time
>> >  287         if (ret != 0) {
>> >  288                 if (ret == AOP_TRUNCATED_PAGE)
>> >  289                         unlock = 0;
>> >  290                 mlog_errno(ret);
>> >  291                 goto out;
>> >  292         }
>> >  293
>> >  294         if (down_read_trylock(&oi->ip_alloc_sem) == 0)
{  <<= here, the
> same problem with above
>> >  295                 /*
>> >  296                  * Unlock the page and cycle ip_alloc_sem so
that we
> don't
>> >  297                  * busyloop waiting for ip_alloc_sem to
unlock
>> >  298                  */
>> >  299                 ret = AOP_TRUNCATED_PAGE;
>> >  300                 unlock_page(page);
>> >  301                 unlock = 0;
>> >  302                 down_read(&oi->ip_alloc_sem);
>> >  303                 up_read(&oi->ip_alloc_sem);
>> >  304                 goto out_inode_unlock;
>> >  305         }
>> > 
>> > 
>> > As you can see, using nonblock way to get lock will bring many
time retry,
> spend too much time.
>> > We can't modify the code to using block way to get the lock,
as this will
> bring a dead lock.
>> > Actually, we did some testing when trying to use block way to get
the lock
> here, the deadlock problems were encountered.
>> > But, in SLES10 source code, there is not any using nonblock way to
get lock
> in buffer reading/writing, this is why buffer reading/writing are very fair
> to get IO when reading/writing the same file from multiple nodes. 
>> SLES10 with kernel version about 2.6.16.x, used blocking way, i.e. 
> down_read(), wich has the
>> potential deaklock between page lock / ip_alloc_sem when one node get
the
> cluster lock and
>> does writing and reading on same file on it. This deadlock was fixed by
this
> commit:
> 
> You are correct here - the change was introduced to solve a deadlock
between
> page lock and ip_alloc_sem(). Basically, ->readpage is going to be
called
> with the page lock held and we need to be aware of that.Hello guys, my main question is, why we changed ip_alloc_sem lock/unlock
position from SLES10 to SLES11?
In SLES10, we get ip_alloc_sem lock before calling generic_file_read() or
generic_file_write_nolock in file.c,
but in SLES11, we get ip_alloc_sem lock in ocfs2_readpage in aops.c, and more,
getting/putting the page lock and ip_alloc_sem lock orders are NOT consistent in
read/write path.
I just want to know the background behind this code evolution. If we keep
getting the ip_alloc_sem lock before calling generic_file_aio_read in SLES11,
the deadlock can be avoided?
then, we need not to use nonblocking way to get the lock in read_page(), buffer
read will not getting starvation in such case, the read/write IO behavior will
be the same with SLES10.

Thanks
Gang
> 
> 
>> ---
>> commit e9dfc0b2bc42761410e8db6c252c6c5889e178b8
>> Author: Mark Fasheh <mark.fasheh at oracle.com>
>> Date:   Mon May 14 11:38:51 2007 -0700
>> 
>>     ocfs2: trylock in ocfs2_readpage()
>> 
>>     Similarly to the page lock / cluster lock inversion in
ocfs2_readpage,
> we
>>     can deadlock on ip_alloc_sem. We can down_read_trylock() instead
and
> just
>>     return AOP_TRUNCATED_PAGE if the operation fails.
>> 
>>     Signed-off-by: Mark Fasheh <mark.fasheh at oracle.com>
>> 
>> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
>> index 8e7cafb..3030670 100644
>> --- a/fs/ocfs2/aops.c
>> +++ b/fs/ocfs2/aops.c
>> @@ -222,7 +222,10 @@ static int ocfs2_readpage(struct file *file,
struct page
> *page)
>>                 goto out;
>>         }
>> 
>> -       down_read(&OCFS2_I(inode)->ip_alloc_sem);
>> +       if (down_read_trylock(&OCFS2_I(inode)->ip_alloc_sem) ==
0) {
>> +               ret = AOP_TRUNCATED_PAGE;
>> +               goto out_meta_unlock;
>> +       }
>> 
>>         /*
>>          * i_size might have just been updated as we grabed the meta
lock.
> We
>> @@ -258,6 +261,7 @@ static int ocfs2_readpage(struct file *file, struct
page
> *page)
>>         ocfs2_data_unlock(inode, 0);
>>  out_alloc:
>>         up_read(&OCFS2_I(inode)->ip_alloc_sem);
>> +out_meta_unlock:
>>         ocfs2_meta_unlock(inode, 0);
>>  out:
>>         if (unlock)
>> ---
>> 
>> But somehow with this patch, performance in the scenario become very
bad. I
> don't how this could happen? because the reading node just has only one
>> thread reading the shared file, then down_read_trylock() should always
get
> ip_alloc_sem successfully, right? if not, who else may race ip_alloc_sem?
> 
> Hmm, there's only one thread and it can't get the lock? Any chance
you might
> put some debug prints around where we acquire ip_alloc_sem? It would be
> interesting to see where it get taken to prevent this from happening.
> 	--Mark
> 
> --
> Mark Fasheh

Eric Ren

2015-Dec-21 11:23 UTC

head link

[Ocfs2-devel] The root cause analysis about buffer read getting starvation

Hello Mark,

...snip..> > SLES10 with kernel version about 2.6.16.x, used blocking way, i.e.
down_read(), wich has the
> > potential deaklock between page lock / ip_alloc_sem when one node get
the cluster lock and
> > does writing and reading on same file on it. This deadlock was fixed
by this commit:
> 
> You are correct here - the change was introduced to solve a deadlock
between
> page lock and ip_alloc_sem(). Basically, ->readpage is going to be
called
> with the page lock held and we need to be aware of that.
...snip..> > But somehow with this patch, performance in the scenario become very
bad. I don't how this could happen? because the reading node just has only
one
> > thread reading the shared file, then down_read_trylock() should always
get ip_alloc_sem successfully, right? if not, who else may race ip_alloc_sem?
> 
> Hmm, there's only one thread and it can't get the lock? Any chance
you mightNo, it can always get the lock in this case. Sorry, I made a false testing
result. There're probably mainly two factors:

1. none-isolated testing environment - include nodes, network and shared disk;
2. testing program from customer - sleep for 1s after finishing ~1M read/write
each time,
   thus the overlap time of read/write on two nodes is random; so the shoter
overlap time is,
   the better performance looks.
   
Sorry again for bothering your time.
--Eric> put some debug prints around where we acquire ip_alloc_sem? It would be
> interesting to see where it get taken to prevent this from happening.
> 	--Mark
> 
> --
> Mark Fasheh
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>

Ocfs2 devel - Dec 2015 - The root cause analysis about buffer read getting starvation

[Ocfs2-devel] The root cause analysis about buffer read getting starvation

[Ocfs2-devel] The root cause analysis about buffer read getting starvation

[Ocfs2-devel] The root cause analysis about buffer read getting starvation