thr3ads.net - Lustre discuss - [Lustre-discuss] potential issue with data corruption [Jul 2011]

If this information is useful, please help other people find it:
Share via:

Lisa Giacchetti

2011-Jul-14 17:59 UTC

[Lustre-discuss] potential issue with data corruption

Hi,
  We are seeing a problem where some running jobs attempted to copy a 
file from local disk
on a worker node to a lustre file system. 14 of those files ended up 
empty or truncated.

We have 7 OSSs with either 6 or 12 ost''s on each. All 14 files ended up
being on an ost on
one of the two systems that have 12 osts. There are 12 different OST''s 
involved.

So if I look at the messages file on one of those OSS''s and I 
specifically look for messages
related to one of the OST''s that have a truncated or empty file I see 
things like this:

Jul  7 07:10:08 cmsls6 kernel: Lustre: 
15431:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
c03badd9-c242-1507-6824-3a9648c8b21f reconnecting
Jul  7 07:59:42 cmsls6 kernel: Lustre: 
3272:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905647245 sent from cmsprod1-OST002d to NID 131.225.191.35 at tcp 
7s ago has timed out (7s prior to deadline).
Jul  7 07:59:42 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
client on nid 131.225.191.35 at tcp was evicted due to a lock completion 
callback to 131.225.191.35 at tcp timed out: rc -107
Jul  7 09:26:58 cmsls6 kernel: Lustre: 
15433:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
9235f65e-ff71-2b1f-60fb-c049cbad5728 reconnecting
Jul  7 09:53:50 cmsls6 kernel: Lustre: 
2663:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905668862 sent from cmsprod1-OST002d to NID 131.225.204.88 at tcp 
7s ago has timed out (7s prior to deadline).
Jul  7 09:53:50 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
client on nid 131.225.204.88 at tcp was evicted due to a lock blocking 
callback to 131.225.204.88 at tcp timed out: rc -107
Jul  7 10:18:57 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
client on nid 131.225.207.176 at tcp was evicted due to a lock blocking 
callback to 131.225.207.176 at tcp timed out: rc -107
Jul  7 10:23:01 cmsls6 kernel: Lustre: 
15405:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905675944 sent from cmsprod1-OST002d to NID 131.225.204.118 at tcp 
7s ago has timed out (7s prior to deadline).
Jul  7 11:06:31 cmsls6 kernel: Lustre: 
15341:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
e25b2761-680a-4d94-ed2c-10913403c0a3 reconnecting
Jul  7 12:26:17 cmsls6 kernel: Lustre: 
15352:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905703492 sent from cmsprod1-OST002d to NID 131.225.190.151 at tcp 
7s ago has timed out (7s prior to deadline).
Jul  7 12:26:17 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
client on nid 131.225.190.151 at tcp was evicted due to a lock blocking 
callback to 131.225.190.151 at tcp timed out: rc -107
Jul  7 12:26:17 cmsls6 kernel: LustreError: 
15352:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on destroyed 
export ffff810c3926f400 ns: filter-cmsprod1-OST002d_UUID lock: 
ffff8109c7f21a00/0xf22d54118e04e04d lrc: 3/0,0 mode: --/PW res: 337742/0 
rrc: 2 type: EXT [0->1048575] (req 0->1048575) flags: 0x0 remote: 
0x6c03f21f59f6b4e6 expref: 19 pid: 15352 timeout 0
Jul  7 12:26:17 cmsls6 kernel: Lustre: 
2740:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring 
bulk IO comm error with 
f81d3629-7e6a-1b5d-810e-ad73d7f5c90d at NET_0x2000083e1be97_UUID id 
12345-131.225.190.151 at tcp - client will retry
Jul  7 12:26:19 cmsls6 kernel: Lustre: 
2742:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring 
bulk IO comm error with 
f81d3629-7e6a-1b5d-810e-ad73d7f5c90d at NET_0x2000083e1be97_UUID id 
12345-131.225.190.151 at tcp - client will retry


Some of these errors seem really bad - like the bulk IO comm error or 
the eviction due to a locking call back.
What should I be looking for here?  I have determined some of the 
messages that say a client has been evicted cause the
OSS thinks its dead are not due the system being down. So what makes the 
OSS think the client is dead?

Also is there any way to determine what files are involved in these errors?

lisa


-------------- next part --------------
A non-text attachment was scrubbed...
Name: lisa.vcf
Type: text/x-vcard
Size: 275 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110714/15073314/attachment.vcf

Lisa Giacchetti

2011-Jul-14 18:15 UTC

head link

[Lustre-discuss] potential issue with data corruption

I am running 1.8.3 on servers and clients.
lisa

On 7/14/11 12:59 PM, Lisa Giacchetti wrote:> Hi,
>  We are seeing a problem where some running jobs attempted to copy a 
> file from local disk
> on a worker node to a lustre file system. 14 of those files ended up 
> empty or truncated.
>
> We have 7 OSSs with either 6 or 12 ost''s on each. All 14 files
ended
> up being on an ost on
> one of the two systems that have 12 osts. There are 12 different
OST''s
> involved.
>
> So if I look at the messages file on one of those OSS''s and I 
> specifically look for messages
> related to one of the OST''s that have a truncated or empty file I
see
> things like this:
>
> Jul  7 07:10:08 cmsls6 kernel: Lustre: 
> 15431:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
> c03badd9-c242-1507-6824-3a9648c8b21f reconnecting
> Jul  7 07:59:42 cmsls6 kernel: Lustre: 
> 3272:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
> x1359120905647245 sent from cmsprod1-OST002d to NID 131.225.191.35 at tcp 
> 7s ago has timed out (7s prior to deadline).
> Jul  7 07:59:42 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
> client on nid 131.225.191.35 at tcp was evicted due to a lock completion 
> callback to 131.225.191.35 at tcp timed out: rc -107
> Jul  7 09:26:58 cmsls6 kernel: Lustre: 
> 15433:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
> 9235f65e-ff71-2b1f-60fb-c049cbad5728 reconnecting
> Jul  7 09:53:50 cmsls6 kernel: Lustre: 
> 2663:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
> x1359120905668862 sent from cmsprod1-OST002d to NID 131.225.204.88 at tcp 
> 7s ago has timed out (7s prior to deadline).
> Jul  7 09:53:50 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
> client on nid 131.225.204.88 at tcp was evicted due to a lock blocking 
> callback to 131.225.204.88 at tcp timed out: rc -107
> Jul  7 10:18:57 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
> client on nid 131.225.207.176 at tcp was evicted due to a lock blocking 
> callback to 131.225.207.176 at tcp timed out: rc -107
> Jul  7 10:23:01 cmsls6 kernel: Lustre: 
> 15405:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
> x1359120905675944 sent from cmsprod1-OST002d to NID 
> 131.225.204.118 at tcp 7s ago has timed out (7s prior to deadline).
> Jul  7 11:06:31 cmsls6 kernel: Lustre: 
> 15341:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
> e25b2761-680a-4d94-ed2c-10913403c0a3 reconnecting
> Jul  7 12:26:17 cmsls6 kernel: Lustre: 
> 15352:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
> x1359120905703492 sent from cmsprod1-OST002d to NID 
> 131.225.190.151 at tcp 7s ago has timed out (7s prior to deadline).
> Jul  7 12:26:17 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
> client on nid 131.225.190.151 at tcp was evicted due to a lock blocking 
> callback to 131.225.190.151 at tcp timed out: rc -107
> Jul  7 12:26:17 cmsls6 kernel: LustreError: 
> 15352:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on 
> destroyed export ffff810c3926f400 ns: filter-cmsprod1-OST002d_UUID 
> lock: ffff8109c7f21a00/0xf22d54118e04e04d lrc: 3/0,0 mode: --/PW res: 
> 337742/0 rrc: 2 type: EXT [0->1048575] (req 0->1048575) flags: 0x0 
> remote: 0x6c03f21f59f6b4e6 expref: 19 pid: 15352 timeout 0
> Jul  7 12:26:17 cmsls6 kernel: Lustre: 
> 2740:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring 
> bulk IO comm error with 
> f81d3629-7e6a-1b5d-810e-ad73d7f5c90d at NET_0x2000083e1be97_UUID id 
> 12345-131.225.190.151 at tcp - client will retry
> Jul  7 12:26:19 cmsls6 kernel: Lustre: 
> 2742:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring 
> bulk IO comm error with 
> f81d3629-7e6a-1b5d-810e-ad73d7f5c90d at NET_0x2000083e1be97_UUID id 
> 12345-131.225.190.151 at tcp - client will retry
>
>
> Some of these errors seem really bad - like the bulk IO comm error or 
> the eviction due to a locking call back.
> What should I be looking for here?  I have determined some of the 
> messages that say a client has been evicted cause the
> OSS thinks its dead are not due the system being down. So what makes 
> the OSS think the client is dead?
>
> Also is there any way to determine what files are involved in these 
> errors?
>
> lisa
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110714/f0a7dc66/attachment.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lisa.vcf
Type: text/x-vcard
Size: 275 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110714/f0a7dc66/attachment.vcf

Oleg Drokin

2011-Jul-14 19:47 UTC

head link

[Lustre-discuss] potential issue with data corruption

Hello!

On Jul 14, 2011, at 1:59 PM, Lisa Giacchetti wrote:
> Jul  7 07:10:08 cmsls6 kernel: Lustre:
15431:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d:
c03badd9-c242-1507-6824-3a9648c8b21f reconnecting
> Jul  7 07:59:42 cmsls6 kernel: Lustre:
3272:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1359120905647245
sent from cmsprod1-OST002d to NID 131.225.191.35 at tcp 7s ago has timed out (7s
prior to deadline).
> Jul  7 07:59:42 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A
client on nid 131.225.191.35 at tcp was evicted due to a lock completion
callback to 131.225.191.35 at tcp timed out: rc -107
> Jul  7 09:26:58 cmsls6 kernel: Lustre:
15433:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d:
9235f65e-ff71-2b1f-60fb-c049cbad5728 reconnecting
> Jul  7 09:53:50 cmsls6 kernel: Lustre:
2663:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1359120905668862
sent from cmsprod1-OST002d to NID 131.225.204.88 at tcp 7s ago has timed out (7s
prior to deadline).
> Jul  7 09:53:50 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A
client on nid 131.225.204.88 at tcp was evicted due to a lock blocking callback
to 131.225.204.88 at tcp timed out: rc -107
> Jul  7 10:18:57 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A
client on nid 131.225.207.176 at tcp was evicted due to a lock blocking callback
to 131.225.207.176 at tcp timed out: rc -107
> Jul  7 10:23:01 cmsls6 kernel: Lustre:
15405:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request
x1359120905675944 sent from cmsprod1-OST002d to NID 131.225.204.118 at tcp 7s
ago has timed out (7s prior to deadline).
> Jul  7 11:06:31 cmsls6 kernel: Lustre:
15341:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d:
e25b2761-680a-4d94-ed2c-10913403c0a3 reconnecting
> Jul  7 12:26:17 cmsls6 kernel: Lustre:
15352:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request
x1359120905703492 sent from cmsprod1-OST002d to NID 131.225.190.151 at tcp 7s
ago has timed out (7s prior to deadline).
> Jul  7 12:26:17 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A
client on nid 131.225.190.151 at tcp was evicted due to a lock blocking callback
to 131.225.190.151 at tcp timed out: rc -107
> Jul  7 12:26:17 cmsls6 kernel: LustreError:
15352:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on destroyed export
ffff810c3926f400 ns: filter-cmsprod1-OST002d_UUID lock:
ffff8109c7f21a00/0xf22d54118e04e04d lrc: 3/0,0 mode: --/PW res: 337742/0 rrc: 2
type: EXT [0->1048575] (req 0->1048575) flags: 0x0 remote:
0x6c03f21f59f6b4e6 expref: 19 pid: 15352 timeout 0
> Jul  7 12:26:17 cmsls6 kernel: Lustre:
2740:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring bulk IO
comm error with f81d3629-7e6a-1b5d-810e-ad73d7f5c90d at NET_0x2000083e1be97_UUID
id 12345-131.225.190.151 at tcp - client will retry
> Jul  7 12:26:19 cmsls6 kernel: Lustre:
2742:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring bulk IO
comm error with f81d3629-7e6a-1b5d-810e-ad73d7f5c90d at NET_0x2000083e1be97_UUID
id 12345-131.225.190.151 at tcp - client will retry
> 
> 
> Some of these errors seem really bad - like the bulk IO comm error or the
eviction due to a locking call back.
> What should I be looking for here?  I have determined some of the messages
that say a client has been evicted cause the
> OSS thinks its dead are not due the system being down. So what makes the
OSS think the client is dead?
Well, the clients become unresponsive for some reason, you really need to look
at the client side logs for some clues on that.
> Also is there any way to determine what files are involved in these errors?
Well, the lock blocking callbacks message will provide you with ost number and
object index that you might be able to backreference to a file.

All that said, 1.8.3 is quite old and I think it would be a much better idea to
try 1.8.6 and see if it improves things.

Bye,
    Oleg
--
Oleg Drokin
Senior Software Engineer
Whamcloud, Inc.

Lisa Giacchetti

2011-Jul-14 19:55 UTC

head link

[Lustre-discuss] potential issue with data corruption

Oleg,
thanks for your response.
See my responses inline.

lisa

On 7/14/11 2:47 PM, Oleg Drokin wrote:> Hello!
>
> On Jul 14, 2011, at 1:59 PM, Lisa Giacchetti wrote:
>
>> Jul  7 07:10:08 cmsls6 kernel: Lustre:
15431:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d:
c03badd9-c242-1507-6824-3a9648c8b21f reconnecting
>> Jul  7 07:59:42 cmsls6 kernel: Lustre:
3272:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1359120905647245
sent from cmsprod1-OST002d to NID 131.225.191.35 at tcp 7s ago has timed out (7s
prior to deadline).
>> Jul  7 07:59:42 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A
client on nid 131.225.191.35 at tcp was evicted due to a lock completion
callback to 131.225.191.35 at tcp timed out: rc -107
>> Jul  7 09:26:58 cmsls6 kernel: Lustre:
15433:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d:
9235f65e-ff71-2b1f-60fb-c049cbad5728 reconnecting
>> Jul  7 09:53:50 cmsls6 kernel: Lustre:
2663:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1359120905668862
sent from cmsprod1-OST002d to NID 131.225.204.88 at tcp 7s ago has timed out (7s
prior to deadline).
>> Jul  7 09:53:50 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A
client on nid 131.225.204.88 at tcp was evicted due to a lock blocking callback
to 131.225.204.88 at tcp timed out: rc -107
>> Jul  7 10:18:57 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A
client on nid 131.225.207.176 at tcp was evicted due to a lock blocking callback
to 131.225.207.176 at tcp timed out: rc -107
>> Jul  7 10:23:01 cmsls6 kernel: Lustre:
15405:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request
x1359120905675944 sent from cmsprod1-OST002d to NID 131.225.204.118 at tcp 7s
ago has timed out (7s prior to deadline).
>> Jul  7 11:06:31 cmsls6 kernel: Lustre:
15341:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d:
e25b2761-680a-4d94-ed2c-10913403c0a3 reconnecting
>> Jul  7 12:26:17 cmsls6 kernel: Lustre:
15352:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request
x1359120905703492 sent from cmsprod1-OST002d to NID 131.225.190.151 at tcp 7s
ago has timed out (7s prior to deadline).
>> Jul  7 12:26:17 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A
client on nid 131.225.190.151 at tcp was evicted due to a lock blocking callback
to 131.225.190.151 at tcp timed out: rc -107
>> Jul  7 12:26:17 cmsls6 kernel: LustreError:
15352:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on destroyed export
ffff810c3926f400 ns: filter-cmsprod1-OST002d_UUID lock:
ffff8109c7f21a00/0xf22d54118e04e04d lrc: 3/0,0 mode: --/PW res: 337742/0 rrc: 2
type: EXT [0->1048575] (req 0->1048575) flags: 0x0 remote:
0x6c03f21f59f6b4e6 expref: 19 pid: 15352 timeout 0
>> Jul  7 12:26:17 cmsls6 kernel: Lustre:
2740:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring bulk IO
comm error with f81d3629-7e6a-1b5d-810e-ad73d7f5c90d at NET_0x2000083e1be97_UUID
id 12345-131.225.190.151 at tcp - client will retry
>> Jul  7 12:26:19 cmsls6 kernel: Lustre:
2742:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring bulk IO
comm error with f81d3629-7e6a-1b5d-810e-ad73d7f5c90d at NET_0x2000083e1be97_UUID
id 12345-131.225.190.151 at tcp - client will retry
>>
>>
>> Some of these errors seem really bad - like the bulk IO comm error or
the eviction due to a locking call back.
>> What should I be looking for here?  I have determined some of the
messages that say a client has been evicted cause the
>> OSS thinks its dead are not due the system being down. So what makes
the OSS think the client is dead?
> Well, the clients become unresponsive for some reason, you really need to
look at the client side logs for some clues on that.I have been doing this as I was waiting for a reply and going through 
the manual and lustre-discuss archives.
Here is an example of one of the client''s logs during the appropriate 
time frame:
Jul  7 11:55:33 cmswn1526 kernel: LustreError: 11-0: an error occurred 
while communicating with 131.225.191.164 at tcp. The obd_ping operation 
failed with -107
Jul  7 11:55:33 cmswn1526 kernel: Lustre: 
cmsprod1-OST0033-osc-ffff810617966400: Connection to service 
cmsprod1-OST0033 via nid 131.225.191.164 at tcp was lost; in progress 
operations using this service will wait for recovery to complete.
Jul  7 11:55:33 cmswn1526 kernel: LustreError: 11-0: an error occurred 
while communicating with 131.225.191.164 at tcp. The ost_write operation 
failed with -107
Jul  7 11:55:35 cmswn1526 kernel: LustreError: 167-0: This client was 
evicted by cmsprod1-OST0033; in progress operations using this service 
will fail.
Jul  7 11:55:35 cmswn1526 kernel: LustreError: 
3750:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  
req at ffff81031d414400 x1373265269802511/t0 
o4->cmsprod1-OST0033_UUID at 131.225.191.164@tcp:6/4 lens 448/608 e 0 to 1 
dl 0 ref 2 fl Rpc:/0/0 rc 0/0
Jul  7 11:55:35 cmswn1526 kernel: Lustre: 
cmsprod1-OST0033-osc-ffff810617966400: Connection restored to service 
cmsprod1-OST0033 using nid 131.225.191.164 at tcp.
>> Also is there any way to determine what files are involved in these
errors?
> Well, the lock blocking callbacks message will provide you with ost number
and object index that you might be able to backreference to a file.I know there is a way to do this from the /proc file system (at least I 
think its /proc) but I can''t find any reference to this
in the book I got from class on this or in the manual.
Can someone refresh my memory?
> All that said, 1.8.3 is quite old and I think it would be a much better
idea to try 1.8.6 and see if it improves things.
>downtimes are few and far between for us so this may take a while to get 
scheduled.
If there is anything that can be done in the meantime I''d like to try
it.

lisa
> Bye,
>      Oleg
> --
> Oleg Drokin
> Senior Software Engineer
> Whamcloud, Inc.
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lisa.vcf
Type: text/x-vcard
Size: 275 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110714/0e9d293e/attachment.vcf

Oleg Drokin

2011-Jul-14 20:05 UTC

head link

[Lustre-discuss] potential issue with data corruption

Hello!

On Jul 14, 2011, at 3:55 PM, Lisa Giacchetti wrote:>>> Jul  7 07:10:08 cmsls6 kernel: Lustre:
15431:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d:
c03badd9-c242-1507-6824-3a9648c8b21f reconnecting
>>> Some of these errors seem really bad - like the bulk IO comm error
or the eviction due to a locking call back.
>>> What should I be looking for here?  I have determined some of the
messages that say a client has been evicted cause the
>>> OSS thinks its dead are not due the system being down. So what
makes the OSS think the client is dead?
>> Well, the clients become unresponsive for some reason, you really need
to look at the client side logs for some clues on that.
> I have been doing this as I was waiting for a reply and going through the
manual and lustre-discuss archives.
> Here is an example of one of the client''s logs during the
appropriate time frame:
> Jul  7 11:55:33 cmswn1526 kernel: LustreError: 11-0: an error occurred
while communicating with 131.225.191.164 at tcp. The obd_ping operation failed
with -107
> Jul  7 11:55:33 cmswn1526 kernel: Lustre:
cmsprod1-OST0033-osc-ffff810617966400: Connection to service cmsprod1-OST0033
via nid 131.225.191.164 at tcp was lost; in progress operations using this
service will wait for recovery to complete.
This is way too late in the game, here the server already evicted the client.
Was there anything before then?
>>> Also is there any way to determine what files are involved in these
errors?
>> Well, the lock blocking callbacks message will provide you with ost
number and object index that you might be able to backreference to a file.
> I know there is a way to do this from the /proc file system (at least I
think its /proc) but I can''t find any reference to this
> in the book I got from class on this or in the manual.
> Can someone refresh my memory?
Actually I think you can do it with combination of lfs find and lfs getattr.
>> All that said, 1.8.3 is quite old and I think it would be a much better
idea to try 1.8.6 and see if it improves things.
> downtimes are few and far between for us so this may take a while to get
scheduled.
> If there is anything that can be done in the meantime I''d like to
try it.
I suspect the might have been several bugs since 1.8.3 that might have
manifested in slowness to reply to lock callback requests
and you''ll end up having downtime to upgrade the clients one way or the
other.

Bye,
    Oleg
--
Oleg Drokin
Senior Software Engineer
Whamcloud, Inc.

Lisa Giacchetti

2011-Jul-14 20:07 UTC

head link

[Lustre-discuss] potential issue with data corruption

On 7/14/11 3:05 PM, Oleg Drokin wrote:> Hello!
>
> On Jul 14, 2011, at 3:55 PM, Lisa Giacchetti wrote:
>>>> Jul  7 07:10:08 cmsls6 kernel: Lustre:
15431:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d:
c03badd9-c242-1507-6824-3a9648c8b21f reconnecting
>>>> Some of these errors seem really bad - like the bulk IO comm
error or the eviction due to a locking call back.
>>>> What should I be looking for here?  I have determined some of
the messages that say a client has been evicted cause the
>>>> OSS thinks its dead are not due the system being down. So what
makes the OSS think the client is dead?
>>> Well, the clients become unresponsive for some reason, you really
need to look at the client side logs for some clues on that.
>> I have been doing this as I was waiting for a reply and going through
the manual and lustre-discuss archives.
>> Here is an example of one of the client''s logs during the
appropriate time frame:
>> Jul  7 11:55:33 cmswn1526 kernel: LustreError: 11-0: an error occurred
while communicating with 131.225.191.164 at tcp. The obd_ping operation failed
with -107
>> Jul  7 11:55:33 cmswn1526 kernel: Lustre:
cmsprod1-OST0033-osc-ffff810617966400: Connection to service cmsprod1-OST0033
via nid 131.225.191.164 at tcp was lost; in progress operations using this
service will wait for recovery to complete.
> This is way too late in the game, here the server already evicted the
client.
> Was there anything before then?
>No there is nothing before then.
>>>> Also is there any way to determine what files are involved in
these errors?
>>> Well, the lock blocking callbacks message will provide you with ost
number and object index that you might be able to backreference to a file.
>> I know there is a way to do this from the /proc file system (at least I
think its /proc) but I can''t find any reference to this
>> in the book I got from class on this or in the manual.
>> Can someone refresh my memory?
> Actually I think you can do it with combination of lfs find and lfs
getattr.Hmm. Ok let me try that
>>> All that said, 1.8.3 is quite old and I think it would be a much
better idea to try 1.8.6 and see if it improves things.
>> downtimes are few and far between for us so this may take a while to
get scheduled.
>> If there is anything that can be done in the meantime I''d like
to try it.
> I suspect the might have been several bugs since 1.8.3 that might have
manifested in slowness to reply to lock callback requests
> and you''ll end up having downtime to upgrade the clients one way
or the other.
>
> Bye,
>      Oleg
> --
> Oleg Drokin
> Senior Software Engineer
> Whamcloud, Inc.
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lisa.vcf
Type: text/x-vcard
Size: 275 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110714/cf66ea9a/attachment.vcf

Lustre discuss - Jul 2011 - potential issue with data corruption

[Lustre-discuss] potential issue with data corruption

[Lustre-discuss] potential issue with data corruption

[Lustre-discuss] potential issue with data corruption

[Lustre-discuss] potential issue with data corruption

[Lustre-discuss] potential issue with data corruption

[Lustre-discuss] potential issue with data corruption