Hi all. I''ve been trying to update from 1.6b5 to 1.6b7, and running into assorted unexpected problems. I recall hearing about some compatibility issues on this list, but I couldn''t remember the details, and searching through the list archives didn''t reveal what I was looking for. First, it seems the ksocklnd made an incompatible message format change, signified by different magic? My older 1.6b5 clients caused all manner of assertion failures when they tried to connect. Second, even when I disable the old clients, I get a bunch of stuff like this on the MGS. I haven''t seen this one before, and it''s not clear to me what it''s trying to tell me: Feb 22 09:37:39 gsrv022 LustreError: 10882:0:(client.c:942:ptlrpc_expire_one_request()) @@@ timeout (sent at 1172154959, 100s ago) req@ffff8100ec701e00 x180/t0 o250->MGS@MGC10.2.2.22@tcp_0:26 lens 304/328 ref 1 fl Rpc:/0/0 rc 0/0 Feb 22 09:38:31 gsrv022 LustreError: 11941:0:(client.c:514:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8100ee704200 x190/t0 o101->MGS@MGC10.2.2.22@tcp_0:26 lens 232/240 ref 1 fl Rpc:/0/0 rc 0/0 Feb 22 09:38:31 gsrv022 LustreError: 11941:0:(client.c:514:ptlrpc_import_delay_req()) Skipped 3 previous similar messages Am I remembering correctly that there was an on-disk format change? If that''s the case I''ll just go ahead and reconstitute the fs, but I haven''t done that yet. Anybody else have hints on what it might be griping about? TIA...
John, I remember reading the wiki for MountConf and there was a filesystem change from beta5 to beta7. We went ahead and reformatted our MGS and OST nodes with beta7. John R. Dunning wrote:> Hi all. I''ve been trying to update from 1.6b5 to 1.6b7, and running into > assorted unexpected problems. I recall hearing about some compatibility > issues on this list, but I couldn''t remember the details, and searching > through the list archives didn''t reveal what I was looking for. > > First, it seems the ksocklnd made an incompatible message format change, > signified by different magic? My older 1.6b5 clients caused all manner of > assertion failures when they tried to connect. > > Second, even when I disable the old clients, I get a bunch of stuff like > this > on the MGS. I haven''t seen this one before, and it''s not clear to me what > it''s trying to tell me: > > Feb 22 09:37:39 gsrv022 LustreError: > 10882:0:(client.c:942:ptlrpc_expire_one_request()) @@@ timeout (sent at > 1172154959, 100s ago) req@ffff8100ec701e00 x180/t0 > o250->MGS@MGC10.2.2.22@tcp_0:26 lens 304/328 ref 1 fl Rpc:/0/0 rc 0/0 > Feb 22 09:38:31 gsrv022 LustreError: > 11941:0:(client.c:514:ptlrpc_import_delay_req()) @@@ IMP_INVALID > req@ffff8100ee704200 x190/t0 o101->MGS@MGC10.2.2.22@tcp_0:26 lens 232/240 > ref 1 fl Rpc:/0/0 rc 0/0 > Feb 22 09:38:31 gsrv022 LustreError: > 11941:0:(client.c:514:ptlrpc_import_delay_req()) Skipped 3 previous > similar messages > > > Am I remembering correctly that there was an on-disk format change? If > that''s > the case I''ll just go ahead and reconstitute the fs, but I haven''t done > that > yet. Anybody else have hints on what it might be griping about? > > TIA... > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >-- Jeremy Mann jeremy@biochem.uthscsa.edu University of Texas Health Science Center Bioinformatics Core Facility http://www.bioinformatics.uthscsa.edu Phone: (210) 567-2672
From: "Jeremy Mann" <jeremy@biochem.uthscsa.edu> Date: Thu, 22 Feb 2007 08:47:53 -0600 (CST) John, I remember reading the wiki for MountConf and there was a filesystem change from beta5 to beta7. We went ahead and reformatted our MGS and OST nodes with beta7. Ok, that''s probably it. Thanks. John R. Dunning wrote: > Hi all. I''ve been trying to update from 1.6b5 to 1.6b7, and running into > assorted unexpected problems. I recall hearing about some compatibility > issues on this list, but I couldn''t remember the details, and searching > through the list archives didn''t reveal what I was looking for. > > First, it seems the ksocklnd made an incompatible message format change, > signified by different magic? My older 1.6b5 clients caused all manner of > assertion failures when they tried to connect. > > Second, even when I disable the old clients, I get a bunch of stuff like > this > on the MGS. I haven''t seen this one before, and it''s not clear to me what > it''s trying to tell me: > > Feb 22 09:37:39 gsrv022 LustreError: > 10882:0:(client.c:942:ptlrpc_expire_one_request()) @@@ timeout (sent at > 1172154959, 100s ago) req@ffff8100ec701e00 x180/t0 > o250->MGS@MGC10.2.2.22@tcp_0:26 lens 304/328 ref 1 fl Rpc:/0/0 rc 0/0 > Feb 22 09:38:31 gsrv022 LustreError: > 11941:0:(client.c:514:ptlrpc_import_delay_req()) @@@ IMP_INVALID > req@ffff8100ee704200 x190/t0 o101->MGS@MGC10.2.2.22@tcp_0:26 lens 232/240 > ref 1 fl Rpc:/0/0 rc 0/0 > Feb 22 09:38:31 gsrv022 LustreError: > 11941:0:(client.c:514:ptlrpc_import_delay_req()) Skipped 3 previous > similar messages > > > Am I remembering correctly that there was an on-disk format change? If > that''s > the case I''ll just go ahead and reconstitute the fs, but I haven''t done > that > yet. Anybody else have hints on what it might be griping about? > > TIA... > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > -- Jeremy Mann jeremy@biochem.uthscsa.edu University of Texas Health Science Center Bioinformatics Core Facility http://www.bioinformatics.uthscsa.edu Phone: (210) 567-2672
from http://www.clusterfs.com/changelog.1.6.0.html : WIRE PROTOCOL CHANGE from previous 1.6 beta versions. This version will not interoperate with 1.6 betas before beta7 (1.5.97). There were no disk changes, but there were configuration log changes, so if you want to keep your old 1.6b5 data you should probably do a --writeconf (see https://mail.clusterfs.com/wikis/lustre/MountConf) to regenerate the config logs afresh. Also, I''m not making an official announcement because the internal testing has not completed, but there''s a b8 up on the ftp site, if you''re interested. John R. Dunning wrote:> Hi all. I''ve been trying to update from 1.6b5 to 1.6b7, and running into > assorted unexpected problems. I recall hearing about some compatibility > issues on this list, but I couldn''t remember the details, and searching > through the list archives didn''t reveal what I was looking for. > > First, it seems the ksocklnd made an incompatible message format change, > signified by different magic? My older 1.6b5 clients caused all manner of > assertion failures when they tried to connect. > > Second, even when I disable the old clients, I get a bunch of stuff like this > on the MGS. I haven''t seen this one before, and it''s not clear to me what > it''s trying to tell me: > > Feb 22 09:37:39 gsrv022 LustreError: 10882:0:(client.c:942:ptlrpc_expire_one_request()) @@@ timeout (sent at 1172154959, 100s ago) req@ffff8100ec701e00 x180/t0 o250->MGS@MGC10.2.2.22@tcp_0:26 lens 304/328 ref 1 fl Rpc:/0/0 rc 0/0 > Feb 22 09:38:31 gsrv022 LustreError: 11941:0:(client.c:514:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff8100ee704200 x190/t0 o101->MGS@MGC10.2.2.22@tcp_0:26 lens 232/240 ref 1 fl Rpc:/0/0 rc 0/0 > Feb 22 09:38:31 gsrv022 LustreError: 11941:0:(client.c:514:ptlrpc_import_delay_req()) Skipped 3 previous similar messages > > > Am I remembering correctly that there was an on-disk format change? If that''s > the case I''ll just go ahead and reconstitute the fs, but I haven''t done that > yet. Anybody else have hints on what it might be griping about? > > TIA... > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > >
From: Nathaniel Rutman <nathan@clusterfs.com> Date: Thu, 22 Feb 2007 08:14:26 -0800 from http://www.clusterfs.com/changelog.1.6.0.html : WIRE PROTOCOL CHANGE from previous 1.6 beta versions. This version will not interoperate with 1.6 betas before beta7 (1.5.97). There were no disk changes, but there were configuration log changes, so if you want to keep your old 1.6b5 data you should probably do a --writeconf (see https://mail.clusterfs.com/wikis/lustre/MountConf) to regenerate the config logs afresh. Hmmm. Would that account for the odd log messages I''m seeing. Also, I''m not making an official announcement because the internal testing has not completed, but there''s a b8 up on the ftp site, if you''re interested. Yikes. I think I''ll try to finish b7 before I launch off into that one...
John R. Dunning wrote:> From: Nathaniel Rutman <nathan@clusterfs.com> > Date: Thu, 22 Feb 2007 08:14:26 -0800 > > from http://www.clusterfs.com/changelog.1.6.0.html : > WIRE PROTOCOL CHANGE from previous 1.6 beta versions. This version will > not interoperate with 1.6 betas before beta7 (1.5.97). > > There were no disk changes, but there were configuration log changes, so > if you want to keep your old 1.6b5 data you should probably do a > --writeconf (see https://mail.clusterfs.com/wikis/lustre/MountConf) to > regenerate the config logs afresh. > > Hmmm. Would that account for the odd log messages I''m seeing. > >It could, depending on what''s in the old logs.
> First, it seems the ksocklnd made an incompatible message format change, > signified by different magic? My older 1.6b5 clients caused all manner of > assertion failures when they tried to connect.ERK! We changed the socklnd wire protocol so that we could do zero-copy sends without requiring a kernel patch. This change should still allow connection establishment with "old" peers, so I''m surprised that you''ve seen "all manner of assertion failures". Could you let me see some? Are they in the socklnd? Cheers, Eric
From: "Eric Barton" <eeb@bartonsoftware.com> Date: Mon, 26 Feb 2007 15:59:15 -0000 > First, it seems the ksocklnd made an incompatible message format change, > signified by different magic? My older 1.6b5 clients caused all manner of > assertion failures when they tried to connect. ERK! We changed the socklnd wire protocol so that we could do zero-copy sends without requiring a kernel patch. This change should still allow connection establishment with "old" peers, so I''m surprised that you''ve seen "all manner of assertion failures". Could you let me see some? Are they in the socklnd? Here''s a representative one. I got this kind of thing every time an older client tried to connect, and the problem went away as soon as I shut down or updated the older clients. I didn''t do much further debugging than that. Feb 22 08:28:36 localhost Lustre: 11093:0:(lib-move.c:1644:lnet_parse_put()) Dropping PUT from 12345-10.0.1.105@tcp portal 6 match 15660 offset 0 length 240: 2 Feb 22 08:28:36 localhost LustreError: 11215:0:(pack_generic.c:809:lustre_unpack_msg()) bad lustre msg magic: 0XBD00BD2 Feb 22 08:28:36 localhost LustreError: 11215:0:(service.c:557:ptlrpc_server_handle_request()) error unpacking request: ptl 12 from 12345-10.0.1.105@tcp xid 15659 Feb 22 08:28:36 localhost LustreError: 11215:0:(pack_generic.c:1298:lustre_msg_get_opc()) ASSERTION(0) failed:incorrect message magic: 0bd00bd2 Feb 22 08:28:36 localhost LustreError: 11215:0:(pack_generic.c:1298:lustre_msg_get_opc()) LBUG Feb 22 08:28:36 localhost Lustre: 11215:0:(linux-debug.c:166:libcfs_debug_dumpstack()) showing stack for process 11215 Feb 22 08:28:36 localhost ll_mdt_01 R running task 0 11215 1 11216 11178 (L-TLB) Feb 22 08:28:36 localhost ffff8100f39f1d18 0000000000000003 ffff810005d3f980 0000000000000286 Feb 22 08:28:36 localhost 0000000000000003 ffff8100f9b24080 ffffffff88120210 0000000000000512 Feb 22 08:28:36 localhost 0000000000000000 0000000000000000 Feb 22 08:28:36 localhost Call Trace:<ffffffff8010f53f>{show_trace+527} <ffffffff8010f6b5>{show_stack+229} Feb 22 08:28:36 localhost <ffffffff88000d0a>{:libcfs:lbug_with_loc+122} <ffffffff880feb2d>{:ptlrpc:lustre_msg_get_opc+285} Feb 22 08:28:36 localhost <ffffffff8810a208>{:ptlrpc:ptlrpc_main+5784} <ffffffff80131440>{default_wake_function+0} Feb 22 08:28:36 localhost <ffffffff8010ebc2>{child_rip+8} <ffffffff88108b70>{:ptlrpc:ptlrpc_main+0} Feb 22 08:28:36 localhost <ffffffff8010ebba>{child_rip+0}
John,> Here''s a representative one. I got this kind of thing every time an older > client tried to connect, and the problem went away as soon as I shut down or > updated the older clients. I didn''t do much further debugging than that. > > Feb 22 08:28:36 localhost Lustre: 11093:0:(lib-move.c:1644:lnet_parse_put()) Dropping PUT from 12345-10.0.1.105@tcp portal 6 match15660 offset 0 length 240: 2> Feb 22 08:28:36 localhost LustreError: 11215:0:(pack_generic.c:809:lustre_unpack_msg()) bad lustre msg magic: 0XBD00BD2 > Feb 22 08:28:36 localhost LustreError: 11215:0:(service.c:557:ptlrpc_server_handle_request()) error unpacking request: ptl 12 from12345-10.0.1.105@tcp xid 15659> Feb 22 08:28:36 localhost LustreError: 11215:0:(pack_generic.c:1298:lustre_msg_get_opc()) ASSERTION(0) failed:incorrect messagemagic: 0bd00bd2> Feb 22 08:28:36 localhost LustreError: 11215:0:(pack_generic.c:1298:lustre_msg_get_opc()) LBUG > Feb 22 08:28:36 localhost Lustre: 11215:0:(linux-debug.c:166:libcfs_debug_dumpstack()) showing stack for process 11215 > Feb 22 08:28:36 localhost ll_mdt_01 R running task 0 11215 1 11216 11178 (L-TLB) > Feb 22 08:28:36 localhost ffff8100f39f1d18 0000000000000003 ffff810005d3f980 0000000000000286 > Feb 22 08:28:36 localhost 0000000000000003 ffff8100f9b24080 ffffffff88120210 0000000000000512 > Feb 22 08:28:36 localhost 0000000000000000 0000000000000000 > Feb 22 08:28:36 localhost Call Trace:<ffffffff8010f53f>{show_trace+527} <ffffffff8010f6b5>{show_stack+229} > Feb 22 08:28:36 localhost <ffffffff88000d0a>{:libcfs:lbug_with_loc+122} <ffffffff880feb2d>{:ptlrpc:lustre_msg_get_opc+285} > Feb 22 08:28:36 localhost <ffffffff8810a208>{:ptlrpc:ptlrpc_main+5784} <ffffffff80131440>{default_wake_function+0} > Feb 22 08:28:36 localhost <ffffffff8010ebc2>{child_rip+8} <ffffffff88108b70>{:ptlrpc:ptlrpc_main+0} > Feb 22 08:28:36 localhost <ffffffff8010ebba>{child_rip+0}Phew! The socklnd is working OK, but these 2 betas aren''t interoperable at the lustre protocol level. Our bad for causing an assertion failure though - a node shouldn''t fall over just because someone spoke garbage to it! Cheers, Eric
On Feb 26, 2007 17:57 -0000, Eric Barton wrote:> Phew! The socklnd is working OK, but these 2 betas aren''t interoperable at the lustre protocol level. Our bad for causing an > assertion failure though - a node shouldn''t fall over just because someone spoke garbage to it!I''ve fixed the ptlrpc code to not try and calculate RPC stats on a message that we don''t understand. This will appear in the final 1.6.0 release. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.