Hi Eric,
On 2015/11/12 17:48, Eric Ren wrote:> Hi Joseph,
>
> On 11/12/15 16:00, Joseph Qi wrote:
>> On 2015/11/12 15:23, Eric Ren wrote:
>>> Hi Joseph,
>>>
>>> Thanks for your reply! There're more details I'd like to
ask about ;-)
>>>
>>> On 11/12/15 11:05, Joseph Qi wrote:
>>>> Hi Eric,
>>>> You reported an issue about sometime io response time may be
long.
>>>>
>>>> From your test case information, I think it was caused by
downconvert.
>>> From what I learned from fs/dlm, lock manager grants all
down-conversions requests
>>> in place,i.e. on grant queue. Here're some silly questions:
>>> 1. who may requests down-convertion?
>>> 2. when down-convertion happends?
>>> 3. how could a down-convertion takes so long?
>> IMO, it happens almost in two cases.
>> 1. Owner knows another node is waiting on the lock, in other words, one
>> have blocked another's request. It may be triggered in ast, bast,
or
>> unlock.
>> 2. ocfs2cmt does periodically commit.
>>
>> One case can lead to long time downconvert is, it is indeed that it has
>> too much work to do. I am not sure if there are any other cases or code
>> bug.
> OK, not familiar with ocfs2cmt. Could I bother you to explain what ocfs2cmt
is used to do,
> it's relation with R/W, and why down-conversion can be triggered by
when it commits?
Sorry, the above explanation is not right and may mislead you.
jbd2/xxx (previously called kjournald2?) does periodically commit,
the default interval is 5s and can be set with mount option "commit=".
ocfs2cmt does the checkpoint, it can be waked up:
a) unblock lock during downconvert, and if jbd2/xxx has already done the
commit, ocfs2cmt won't be actually waken up because it has already been
checkpointed. So ocfs2cmt works with jbd2/xxx.
b) evict inode and then do downconvert.
>>> Could you describes more in this case?
>>>> And it seemed reasonable because it had to.
>>>>
>>>> Node 1 wrote file, and node 2 read it. Since you used buffer
io, that
>>>> was after node 1 had finished written, it might be still in
page cache.
>>> Sorry, I cannot understand the relationship between "still in
page case" and "so...downconvert".
>>>> So node 1 should downconvert first then node 2 read could
continue.
>>>> That was why you said it seemed ocfs2_inode_lock_with_page
spent most
>>> Actually, it suprises me more with such long time spent than the
*most* time compared to "readpage" stuff ;-)
>>>> time. More specifically, it was ocfs2_inode_lock after trying
nonblock
>>>> lock and returning -EAGAIN.
>>> You mean read process would repeatedly try nonblock lock until
write process down-convertion completes?
>> No, after nonblock lock returning -EAGAIN, it will unlock page and then
>> call ocfs2_inode_lock and ocfs2_inode_unlock. And ocfs2_inode_lock will
> Yes.
>> wait until downconvert completion in another node.
> Another node which read or write process on?
Yes, the node blocks my request.
For example, node 1 has EX, then node 2 wants to get PR, it should wait
for node 1 downconvert first.
Thanks,
Joesph
>> This is for an lock inversion case. You can refer the comments of
>> ocfs2_inode_lock_with_page.
> Yeah, actually I read this comments again and again, but still fail to get
this idea.
> Could you please explain how this works? I'm really really interested
;-) Forgive me
> paste code below, make it convenient to refer.
>
> /*
> * This is working around a lock inversion between tasks acquiring DLM
> * locks while holding a page lock and the downconvert thread which
> * blocks dlm lock acquiry while acquiring page locks.
> *
> * ** These _with_page variantes are only intended to be called from aop
> * methods that hold page locks and return a very specific *positive* error
> * code that aop methods pass up to the VFS -- test for errors with != 0.
**
> *
> * The DLM is called such that it returns -EAGAIN if it would have
> * blocked waiting for the downconvert thread. In that case we unlock
> * our page so the downconvert thread can make progress. Once we've
> * done this we have to return AOP_TRUNCATED_PAGE so the aop method
> * that called us can bubble that back up into the VFS who will then
> * immediately retry the aop call.
> *
> * We do a blocking lock and immediate unlock before returning, though, so
that
> * the lock has a great chance of being cached on this node by the time the
VFS
> * calls back to retry the aop. This has a potential to livelock as
nodes
> * ping locks back and forth, but that's a risk we're willing to
take to avoid
> * the lock inversion simply.
> */
> int ocfs2_inode_lock_with_page(struct inode *inode,
> struct buffer_head **ret_bh,
> int ex,
> struct page *page)
> {
> int ret;
>
> ret = ocfs2_inode_lock_full(inode, ret_bh, ex,
OCFS2_LOCK_NONBLOCK);
> if (ret == -EAGAIN) {
> unlock_page(page);
> if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
> ocfs2_inode_unlock(inode, ex);
> ret = AOP_TRUNCATED_PAGE;
> }
>
> return ret;
> }
>
> Thanks,
> Eric
>>>> And this also explained why direct io didn't have the
issue, but took
>>>> more time.
>>>>
>>>> I am not sure if your test case is the same as what the
customer has
>>>> reported. I think you should recheck the operations in each
node.
>>> Yes, we've verified several times both on sles10 and sles11.
On sles10, each IO time is smooth, no long time IO peak.
>>>> And we have reported an case before about DLM handling issue. I
am not
>>>> sure if it has relations.
>>>>
https://oss.oracle.com/pipermail/ocfs2-devel/2015-August/011045.html
>>> Thanks, I've read this post. I cannot see any relations yet.
Actually, fs/dlm also implements that way, it's the so-called
"conversion deadlock"
>>> which mentioned in 2.3.7.3 section of "programming locking
applications" book.
>>>
>>> There're only two processes from two nodes. Process A is
blocked on wait queue caused by process B in convert queue, that leave grant
queue empty,
>>> is this possible?
>> So we have to investigate why convert request cannot be satisfied.
>> If dlm still works fine, it is impossible. Otherwise it is a bug.
>>
>>> You'know I'm new here, maybe some questions're
improper,please point out if so;-)
>>>
>>> Thank,
>>> Eric
>
>
> .
>