Good morning, To make a story short: the network malfunctioned and the client was reset before the network problems were solved. Now the system seems to be presenting some coherency issues: On the client I see: LustreError: 3311:0:(client.c:951:ptlrpc_expire_one_request()) @@@ timeout (sent at 1170669653, 5s ago) req@eadab400 x348045/t0 o8->b-ost0_UUID@no6_UUID:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 3311:0:(client.c:951:ptlrpc_expire_one_request()) previously skipped 199 similar messages On the node no6 I see: LustreError: 5462:0:(lib-move.c:152:lnet_match_md()) Dropping PUT from 12345-10.10.1.4@tcp portal 6 match 0x52844b offset 0 length 240: no match LustreError: 5462:0:(lib-move.c:152:lnet_match_md()) previously skipped 287 similar messages Any suggestions on how to reestablish a coherent filesystem are great. Lustre is 1.4.6.4 with a patch for bug 10730, with kernel 2.6.12.6. Thanks in advance, Jo?o Miguel Neves -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 191 bytes Desc: Esta =?ISO-8859-1?Q?=E9?= uma parte de mensagem assinada digitalmente Url : http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070205/1ec31917/attachment.bin
Latest messages (when trying to do a ls -l on the client): LustreError: 3702:0:(file.c:704:ll_glimpse_size()) obd_enqueue returned rc -4, returning -EIO Any clues are welcome. Best regards, Jo?o Miguel Neves On Seg, 2007-02-05 at 10:06 +0000, Jo?o Miguel Neves wrote:> Good morning, > > To make a story short: the network malfunctioned and the client was > reset before the network problems were solved. Now the system seems to > be presenting some coherency issues: > > On the client I see: > LustreError: 3311:0:(client.c:951:ptlrpc_expire_one_request()) @@@ timeout (sent at 1170669653, 5s ago) req@eadab400 x348045/t0 o8->b-ost0_UUID@no6_UUID:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0 > LustreError: 3311:0:(client.c:951:ptlrpc_expire_one_request()) previously skipped 199 similar messages > > On the node no6 I see: > LustreError: 5462:0:(lib-move.c:152:lnet_match_md()) Dropping PUT from 12345-10.10.1.4@tcp portal 6 match 0x52844b offset 0 length 240: no match > LustreError: 5462:0:(lib-move.c:152:lnet_match_md()) previously skipped 287 similar messages > > Any suggestions on how to reestablish a coherent filesystem are great. > Lustre is 1.4.6.4 with a patch for bug 10730, with kernel 2.6.12.6. > > Thanks in advance, > Jo?o Miguel Neves > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 191 bytes Desc: Esta =?ISO-8859-1?Q?=E9?= uma parte de mensagem assinada digitalmente Url : http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070205/13ebee5d/attachment.bin
On Feb 05, 2007 15:56 +0000, Jo?o Miguel Neves wrote:> Latest messages (when trying to do a ls -l on the client): > LustreError: 3702:0:(file.c:704:ll_glimpse_size()) obd_enqueue returned rc -4, returning -EIO-4 = -EINTR, means someone hit CTRL-C for ls.> On Seg, 2007-02-05 at 10:06 +0000, Jo?o Miguel Neves wrote: > > Good morning, > > > > To make a story short: the network malfunctioned and the client was > > reset before the network problems were solved. Now the system seems to > > be presenting some coherency issues: > > > > On the client I see: > > LustreError: 3311:0:(client.c:951:ptlrpc_expire_one_request()) @@@ timeout (sent at 1170669653, 5s ago) req@eadab400 x348045/t0 o8->b-ost0_UUID@no6_UUID:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0 > > LustreError: 3311:0:(client.c:951:ptlrpc_expire_one_request()) previously skipped 199 similar messages > > > > On the node no6 I see: > > LustreError: 5462:0:(lib-move.c:152:lnet_match_md()) Dropping PUT from 12345-10.10.1.4@tcp portal 6 match 0x52844b offset 0 length 240: no match > > LustreError: 5462:0:(lib-move.c:152:lnet_match_md()) previously skipped 287 similar messagesIt means nothing is listening for this request on the OST (i.e. it is not started up yet). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
> > On Seg, 2007-02-05 at 10:06 +0000, Jo?o Miguel Neves wrote: > > > Good morning, > > > > > > To make a story short: the network malfunctioned and the client was > > > reset before the network problems were solved. Now the system seems to > > > be presenting some coherency issues: > > > > > > On the client I see: > > > LustreError: 3311:0:(client.c:951:ptlrpc_expire_one_request()) @@@ timeout (sent at 1170669653, 5s ago) req@eadab400 x348045/t0 o8->b-ost0_UUID@no6_UUID:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0 > > > LustreError: 3311:0:(client.c:951:ptlrpc_expire_one_request()) previously skipped 199 similar messages > > > > > > On the node no6 I see: > > > LustreError: 5462:0:(lib-move.c:152:lnet_match_md()) Dropping PUT from 12345-10.10.1.4@tcp portal 6 match 0x52844b offset 0 length 240: no match > > > LustreError: 5462:0:(lib-move.c:152:lnet_match_md()) previously skipped 287 similar messages > > It means nothing is listening for this request on the OST (i.e. it is not > started up yet).So this would normally happen if an OST is down, right? But the node no6 has the 8 OSTs that it always had... # cat /proc/fs/lustre/devices 0 UP obdfilter b-ost0 b-ost0_UUID 4 1 UP ost OSS OSS_UUID 3 2 UP obdfilter b-ost1 b-ost1_UUID 4 3 UP obdfilter b-ost2 b-ost2_UUID 4 4 UP obdfilter b-ost3 b-ost3_UUID 5 5 UP obdfilter b-ost4 b-ost4_UUID 5 6 UP obdfilter b-ost5 b-ost5_UUID 5 7 UP obdfilter b-ost6 b-ost6_UUID 5 8 UP obdfilter b-ost7 b-ost7_UUID 5 Giving some more info to see if I can understand what''s going on: client - 10.10.1.2 mds - 10.10.1.4 no6 (ost) - 10.10.1.6 If I just test the names (metadata) with a ''find'' on the client, everything shows correctly. If I do a ''find -ls'' in some directories (the ones I suspect where there has been data loss), it simply seems to block and, I get this in the log of no6: LustreError: 5462:0:(lib-move.c:152:lnet_match_md()) Dropping PUT from 12345-10.10.1.2@tcp portal 6 match 0x155830 offset 0 length 240: no match LustreError: 5462:0:(lib-move.c:152:lnet_match_md()) previously skipped 295 similar messages On the client I get: LustreError: 3314:0:(client.c:951:ptlrpc_expire_one_request()) @@@ timeout (sent at 1170756287, 5s ago) req@ec066200 x1398929/t0 o8->b-ost0_UUID@no6_UUID:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 3314:0:(client.c:951:ptlrpc_expire_one_request()) previously skipped 199 similar messages Does this make any sense? Thanks, Jo?o Miguel Neves -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 191 bytes Desc: Esta =?ISO-8859-1?Q?=E9?= uma parte de mensagem assinada digitalmente Url : http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070206/8d2a9159/attachment.bin
On Ter, 2007-02-06 at 10:10 +0000, Jo?o Miguel Neves wrote:> Giving some more info to see if I can understand what''s going on: > > client - 10.10.1.2 > mds - 10.10.1.4 > no6 (ost) - 10.10.1.6 > > If I just test the names (metadata) with a ''find'' on the client, > everything shows correctly. If I do a ''find -ls'' in some directories > (the ones I suspect where there has been data loss), it simply seems to > block and, I get this in the log of no6: > > LustreError: 5462:0:(lib-move.c:152:lnet_match_md()) Dropping PUT from 12345-10.10.1.2@tcp portal 6 match 0x155830 offset 0 length 240: no match > LustreError: 5462:0:(lib-move.c:152:lnet_match_md()) previously skipped 295 similar messages > > On the client I get: > > LustreError: 3314:0:(client.c:951:ptlrpc_expire_one_request()) @@@ timeout (sent at 1170756287, 5s ago) req@ec066200 x1398929/t0 o8->b-ost0_UUID@no6_UUID:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/0 > LustreError: 3314:0:(client.c:951:ptlrpc_expire_one_request()) previously skipped 199 similar messages >This set of messages repeat themselves until I stop the find. Best regards, Jo?o Miguel Neves -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 191 bytes Desc: Esta =?ISO-8859-1?Q?=E9?= uma parte de mensagem assinada digitalmente Url : http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070206/f73ab309/attachment.bin