thr3ads.net - Lustre discuss - [Lustre-discuss] osc_brw_redo

If this information is useful, please help other people find it:
Share via:

James Robnett

2011-Feb-09 21:35 UTC

[Lustre-discuss] osc_brw_redo_request error on clients

I have a fairly simple lustre environment that consists of a single MDS and
2 OSS''s each with 4 OST''s.  The servers and clients are all
running Lustre
1.8.5 under RHEL 5.5,  RPM''s downloaded from lustre.

Normally I''ve had no problems but recently I have multiple clients
reporting the following error:

LustreError: 3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@ redo
for recoverable error  req at ffff8101ae084000 x1358858531428366/t60136289752
o4->lustre-OST0004_UUID at 192.168.1.12@o2ib:6/4 lens 448/608 e 0 to 1 dl
1297285890 ref 2 fl Interpret:R/0/0 rc 0/0

which in turn appears to generate a premature EOF on our user software.

There are no corresponding errors on the servers.

I seem to only see this error on clients connected via QDR infiniband
though that may be a false lead.  In addition the problem seems more
prevalent under load.  Lastly it seems to be getting worse, almost as
if there''s some garbage collection issue on the clients.

I''ve done some searching and don''t see reports involving that
routine.  It
seems like a timeout of some sort.  Any hints as to what this error
indicates as a problem ?

james

Oleg Drokin

2011-Feb-09 22:11 UTC

head link

[Lustre-discuss] osc_brw_redo_request error on clients

Hello!

On Feb 9, 2011, at 4:35 PM, James Robnett wrote:> Normally I''ve had no problems but recently I have multiple clients
> reporting the following error:
> 
> LustreError: 3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@ redo
> for recoverable error  req at ffff8101ae084000
x1358858531428366/t60136289752
> o4->lustre-OST0004_UUID at 192.168.1.12@o2ib:6/4 lens 448/608 e 0 to 1
dl
> 1297285890 ref 2 fl Interpret:R/0/0 rc 0/0
One line before that there should be the actual RPC error specified that we need
to know what happened.
> which in turn appears to generate a premature EOF on our user software.
Actually this message does what it does - resends the request, so the userspace
should not notice
any problems. On the other hand if any other requests aside from brw requests
fail, they might not
get the resending benefit and cause userspace-visible errors.

Bye,
    Oleg

James Robnett

2011-Feb-09 22:20 UTC

head link

[Lustre-discuss] osc_brw_redo_request error on clients

>> LustreError: 3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@
redo for recoverable error  req at ffff8101ae084000>> x1358858531428366/t60136289752
>> o4->lustre-OST0004_UUID at 192.168.1.12@o2ib:6/4 lens 448/608 e 0 to
1 dl
1297285890 ref 2 fl Interpret:R/0/0 rc 0/0>
> One line before that there should be the actual RPC error specified thatwe need to know what happened.

   Nope, just that error repeated:
Feb  9 03:19:26 nm-post-2 kernel: LustreError:
3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@ redo for
recoverable error  req at ffff8101aaa3dc00 x1358858525376456/t60135184183
o4->lustre-OST0007_UUID at 192.168.1.12@o2ib:6/4 lens 464/608 e 0 to 1 dl
1297246810 ref 2 fl Interpret:R/0/0 rc 0/0
Feb  9 03:29:56 nm-post-2 kernel: LustreError:
3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@ redo for
recoverable error  req at ffff8101aaa3d000 x1358858525468762/t60135184397
o4->lustre-OST0007_UUID at 192.168.1.12@o2ib:6/4 lens 464/608 e 0 to 1 dl
1297247403 ref 2 fl Interpret:R/0/0 rc 0/0
Feb  9 03:40:22 nm-post-2 kernel: LustreError:
3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@ redo for
recoverable error  req at ffff8101aaa3c400 x1358858525557912/t60135184598
o4->lustre-OST0007_UUID at 192.168.1.12@o2ib:6/4 lens 464/608 e 0 to 1 dl
1297248029 ref 2 fl Interpret:R/0/0 rc 0/0
Feb  9 03:51:18 nm-post-2 kernel: LustreError:
3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@ redo for
recoverable error  req at ffff8101afef2400 x1358858525655268/t60135392181
o4->lustre-OST0002_UUID at 192.168.1.11@o2ib:6/4 lens 464/608 e 0 to 1 dl
1297248685 ref 2 fl Interpret:R/0/0 rc 0/0
Feb  9 04:01:40 nm-post-2 kernel: LustreError:
3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@ redo for
recoverable error  req at ffff8101aaa3dc00 x1358858525738536/t60135185019
o4->lustre-OST0007_UUID at 192.168.1.12@o2ib:6/4 lens 464/608 e 0 to 1 dl
1297249307 ref 2 fl Interpret:R/0/0 rc 0/0
Feb  9 04:12:04 nm-post-2 kernel: LustreError:
3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@ redo for
recoverable error  req at ffff8101aaa3c400 x1358858525822214/t60135185246
o4->lustre-OST0007_UUID at 192.168.1.12@o2ib:6/4 lens 464/608 e 0 to 1 dl
1297249931 ref 2 fl Interpret:R/0/0 rc 0/0
Feb  9 10:48:28 nm-post-2 kernel: LustreError:
3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@ redo for
recoverable error  req at ffff8101addda800 x1358858527540672/t60134973305
o4->lustre-OST0000_UUID at 192.168.1.11@o2ib:6/4 lens 448/608 e 0 to 1 dl
1297273752 ref 2 fl Interpret:R/0/0 rc 0/0
Feb  9 10:49:51 nm-post-2 kernel: LustreError:
3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@ redo for
recoverable error  req at ffff8101adf3c800 x1358858527567804/t60134976801
o4->lustre-OST0000_UUID at 192.168.1.11@o2ib:6/4 lens 448/608 e 0 to 1 dl
1297273835 ref 2 fl Interpret:R/0/0 rc 0/0
Feb  9 10:52:22 nm-post-2 kernel: LustreError:
3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@ redo for
recoverable error  req at ffff8101adddb000 x1358858527619100/t60134983332
o4->lustre-OST0000_UUID at 192.168.1.11@o2ib:6/4 lens 448/608 e 0 to 1 dl
1297273986 ref 2 fl Interpret:R/0/0 rc 0/0
Feb  9 10:57:23 nm-post-2 kernel: LustreError:
3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@ redo for
recoverable error  req at ffff8101addda000 x1358858527728677/t60134998617
o4->lustre-OST0000_UUID at 192.168.1.11@o2ib:6/4 lens 448/608 e 0 to 1 dl
1297274250 ref 2 fl Interpret:R/0/0 rc 0/0
Feb  9 11:07:23 nm-post-2 kernel: LustreError:
3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@ redo for
recoverable error  req at ffff8101adddbc00 x1358858527926588/t60135043030
o4->lustre-OST0000_UUID at 192.168.1.11@o2ib:6/4 lens 448/608 e 0 to 1 dl
1297274887 ref 2 fl Interpret:R/0/0 rc 0/0

   the above is from ''less /var/log/messages'', not some false
negative
by greping for osc_brw or lustre etc from the logs.

  In addition to the above I also see this sequence:

Feb  9 11:57:41 nm-post-2 kernel: LustreError:
3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@ redo for
recoverable error  req at ffff8101add34000 x1358858528880660/t64430183909
o4->lustre-OST0001_UUID at 192.168.1.11@o2ib:6/4 lens 448/608 e 0 to 1 dl
1297277905 ref 2 fl Interpret:R/0/0 rc 0/0
 to 1 dl 1297278471 ref 2 fl Interpret:R/0/0 rc 0/0
Feb  9 12:07:42 nm-post-2 kernel: LustreError:
3935:0:(osc_request.c:1629:osc_brw_redo_request()) Skipped 1000 previous si
milar messages
Feb  9 12:15:12 nm-post-2 kernel: LustreError:
400:0:(osc_request.c:1143:can_merge_pages()) is it ok to have flags 0xc20 a
nd 0x420 in the same brw?
Feb  9 12:15:12 nm-post-2 kernel: LustreError:
400:0:(osc_request.c:1143:can_merge_pages()) Skipped 43 previous similar me
ssages
Feb  9 12:15:50 nm-post-2 kernel: LustreError:
3935:0:(osc_request.c:1143:can_merge_pages()) is it ok to have flags 0xc20
and 0x420 in the same brw?
Feb  9 12:15:50 nm-post-2 kernel: LustreError:
3935:0:(osc_request.c:1143:can_merge_pages()) Skipped 1 previous similar me
ssage
>> which in turn appears to generate a premature EOF on our user software.
>
> Actually this message does what it does - resends the request, so the
userspace should not notice> any problems. On the other hand if any other requests aside from brw
requests fail, they might not> get the resending benefit and cause userspace-visible errors.
   I glanced at the source and my initial impression was what you just
said,  that this is an internal retry, on the other hand there seems to be
a tight correlation between these messages and the user space EOF
occurrences.

   Thanks for the quick response.

james
> Bye,
>     Oleg
>

James Robnett

2011-Feb-10 00:24 UTC

head link

[Lustre-discuss] osc_brw_redo_request error on clients

> Normally I''ve had no problems but recently I have multiple clients
> reporting the following error:
>
> LustreError: 3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@ redo
> for recoverable error  req at ffff8101ae084000
x1358858531428366/t60136289752
> o4->lustre-OST0004_UUID at 192.168.1.12@o2ib:6/4 lens 448/608 e 0 to 1
dl
> 1297285890 ref 2 fl Interpret:R/0/0 rc 0/0
>
> which in turn appears to generate a premature EOF on our user software.
>
> There are no corresponding errors on the servers.
   The above is not true.  There are apparently corresponding errors of
the form:

Feb  9 17:05:21 lustre-oss-1 kernel: LustreError:
2964:0:(ost_handler.c:1038:ost_brw_write()) client csum f00001, server
csum 964d53e2
Feb  9 17:05:21 lustre-oss-1 kernel: LustreError:
2964:0:(ost_handler.c:1038:ost_brw_write()) Skipped 43 previous similar
messages
Feb  9 17:05:21 lustre-oss-1 kernel: LustreError: 168-f: lustre-OST0000:
BAD WRITE CHECKSUM: changed in transit before arrival at OST from
12345-10.64.1.212 at tcp inum 2981338/1802650709 object 8183950/0 extent
[2384461824-2385510399]
Feb  9 17:05:21 lustre-oss-1 kernel: LustreError: Skipped 43 previous
similar messages
Feb  9 17:05:21 lustre-oss-1 kernel: LustreError:
2964:0:(ost_handler.c:1100:ost_brw_write()) client csum f00001, original
server csum 964d53e2, server csum now 964d53e2
Feb  9 17:05:21 lustre-oss-1 kernel: LustreError:
2964:0:(ost_handler.c:1100:ost_brw_write()) Skipped 43 previous similar
messages
Feb  9 17:10:22 lustre-oss-1 kernel: LustreError:
3035:0:(ost_handler.c:1038:ost_brw_write()) client csum f00001, server
csum 180cd9bd
Feb  9 17:10:22 lustre-oss-1 kernel: LustreError:
3035:0:(ost_handler.c:1038:ost_brw_write()) Skipped 63 previous similar
messages
Feb  9 17:10:22 lustre-oss-1 kernel: LustreError: 168-f: lustre-OST0000:
BAD WRITE CHECKSUM: changed in transit before arrival at OST from
12345-10.64.1.212 at tcp inum 2981338/1802650709 object 8183950/0 extent
[4355784704-4356833279]
Feb  9 17:10:22 lustre-oss-1 kernel: LustreError: Skipped 63 previous
similar messages
Feb  9 17:10:22 lustre-oss-1 kernel: LustreError:
3035:0:(ost_handler.c:1100:ost_brw_write()) client csum f00001, original
server csum 180cd9bd, server csum now 180cd9bd
Feb  9 17:10:22 lustre-oss-1 kernel: LustreError:
3035:0:(ost_handler.c:1100:ost_brw_write()) Skipped 63 previous similar
messages

   The other OSS shows similar errors.  We are doing mmap I/O and a
search implies those errors are related to mmap I/O.

   I''m open to suggestions, in the meantime the userspace code can be
switched from mmap to regular file I/O via an rc file so we''ll try that
and see if it at least makes the errors go away.

James

Bob Ball

2011-Feb-10 00:33 UTC

head link

[Lustre-discuss] osc_brw_redo_request error on clients

Maybe, clients should mount the file system with "localflock" 
parameter?  Please check the manual for information about this, but I 
think it was the same problem we had a while back where a dynamic link 
was failing.

bob

On 2/9/2011 7:24 PM, James Robnett wrote:>> Normally I''ve had no problems but recently I have multiple
clients
>> reporting the following error:
>>
>> LustreError: 3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@
redo
>> for recoverable error  req at ffff8101ae084000
x1358858531428366/t60136289752
>> o4->lustre-OST0004_UUID at 192.168.1.12@o2ib:6/4 lens 448/608 e 0 to
1 dl
>> 1297285890 ref 2 fl Interpret:R/0/0 rc 0/0
>>
>> which in turn appears to generate a premature EOF on our user software.
>>
>> There are no corresponding errors on the servers.
>     The above is not true.  There are apparently corresponding errors of
> the form:
>
> Feb  9 17:05:21 lustre-oss-1 kernel: LustreError:
> 2964:0:(ost_handler.c:1038:ost_brw_write()) client csum f00001, server
> csum 964d53e2
> Feb  9 17:05:21 lustre-oss-1 kernel: LustreError:
> 2964:0:(ost_handler.c:1038:ost_brw_write()) Skipped 43 previous similar
> messages
> Feb  9 17:05:21 lustre-oss-1 kernel: LustreError: 168-f: lustre-OST0000:
> BAD WRITE CHECKSUM: changed in transit before arrival at OST from
> 12345-10.64.1.212 at tcp inum 2981338/1802650709 object 8183950/0 extent
> [2384461824-2385510399]
> Feb  9 17:05:21 lustre-oss-1 kernel: LustreError: Skipped 43 previous
> similar messages
> Feb  9 17:05:21 lustre-oss-1 kernel: LustreError:
> 2964:0:(ost_handler.c:1100:ost_brw_write()) client csum f00001, original
> server csum 964d53e2, server csum now 964d53e2
> Feb  9 17:05:21 lustre-oss-1 kernel: LustreError:
> 2964:0:(ost_handler.c:1100:ost_brw_write()) Skipped 43 previous similar
> messages
> Feb  9 17:10:22 lustre-oss-1 kernel: LustreError:
> 3035:0:(ost_handler.c:1038:ost_brw_write()) client csum f00001, server
> csum 180cd9bd
> Feb  9 17:10:22 lustre-oss-1 kernel: LustreError:
> 3035:0:(ost_handler.c:1038:ost_brw_write()) Skipped 63 previous similar
> messages
> Feb  9 17:10:22 lustre-oss-1 kernel: LustreError: 168-f: lustre-OST0000:
> BAD WRITE CHECKSUM: changed in transit before arrival at OST from
> 12345-10.64.1.212 at tcp inum 2981338/1802650709 object 8183950/0 extent
> [4355784704-4356833279]
> Feb  9 17:10:22 lustre-oss-1 kernel: LustreError: Skipped 63 previous
> similar messages
> Feb  9 17:10:22 lustre-oss-1 kernel: LustreError:
> 3035:0:(ost_handler.c:1100:ost_brw_write()) client csum f00001, original
> server csum 180cd9bd, server csum now 180cd9bd
> Feb  9 17:10:22 lustre-oss-1 kernel: LustreError:
> 3035:0:(ost_handler.c:1100:ost_brw_write()) Skipped 63 previous similar
> messages
>
>     The other OSS shows similar errors.  We are doing mmap I/O and a
> search implies those errors are related to mmap I/O.
>
>     I''m open to suggestions, in the meantime the userspace code
can be
> switched from mmap to regular file I/O via an rc file so we''ll try
that
> and see if it at least makes the errors go away.
>
> James
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Oleg Drokin

2011-Feb-10 00:34 UTC

head link

[Lustre-discuss] osc_brw_redo_request error on clients

Hello!

On Feb 9, 2011, at 7:24 PM, James Robnett wrote:
>> Normally I''ve had no problems but recently I have multiple
clients
>> reporting the following error:
>> 
>> LustreError: 3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@
redo
>> for recoverable error  req at ffff8101ae084000
x1358858531428366/t60136289752
>> o4->lustre-OST0004_UUID at 192.168.1.12@o2ib:6/4 lens 448/608 e 0 to
1 dl
>> 1297285890 ref 2 fl Interpret:R/0/0 rc 0/0
>> 
>> which in turn appears to generate a premature EOF on our user software.
>> 
>> There are no corresponding errors on the servers.
>   The above is not true.  There are apparently corresponding errors of
> the form:
> Feb  9 17:05:21 lustre-oss-1 kernel: LustreError:
> 2964:0:(ost_handler.c:1038:ost_brw_write()) client csum f00001, server
> csum 964d53e2
> Feb  9 17:05:21 lustre-oss-1 kernel: LustreError:
> 2964:0:(ost_handler.c:1038:ost_brw_write()) Skipped 43 previous similar
> messages
>   The other OSS shows similar errors.  We are doing mmap I/O and a
> search implies those errors are related to mmap I/O.
Ok, so this is it. The mmap code changes the page before it is sent therefore
failing the crc check and causes the resent
(I am a bit surprised there is no crc error in the client logs, though).
>   I''m open to suggestions, in the meantime the userspace code can
be
> switched from mmap to regular file I/O via an rc file so we''ll try
that
> and see if it at least makes the errors go away.
Well, your options are to disable mmap in the code (Lustre mmap code is not
super fast,
so if that''s a real option, give it a try and you might find that it
speeds up everything too)
or you can disable checksum checking.

I also did some more digging and in fact there was a patch included in 1.8.4
that essentially makes
retries to be done only once and in fact ignore subsequent errors, so there
still should be no user-visible failures.
(the patch is from bug 11742, there the last comments in fact references the
messages just like you see on the client,
but no ill effect from them).

Bye,
    Oleg

Lustre discuss - Feb 2011 - osc_brw_redo_request error on clients

[Lustre-discuss] osc_brw_redo_request error on clients

[Lustre-discuss] osc_brw_redo_request error on clients

[Lustre-discuss] osc_brw_redo_request error on clients

[Lustre-discuss] osc_brw_redo_request error on clients

[Lustre-discuss] osc_brw_redo_request error on clients

[Lustre-discuss] osc_brw_redo_request error on clients