Reto Gantenbein
2008-Apr-29 19:06 UTC
[Lustre-discuss] lustre_mgs: operation ... on unconnected MGS
Dear lustre users I did setup a lustre file system with 7 osts (fibre-channel raids) and an mgs/mdt which are exported via two nodes. One node has the mgs/mdt and 3 osts, the other has 4 osts mounted. The nodes are running the lustre patched 2.6.18 vanilla kernel. The clients are patchless and are running the 2.6.22 gentoo kernel. The lustre-1.6.4.3 is compiled from sources under gentoo linux. The two nodes are called lustre01 and lustre02. I did format the mgs/mdt on lustre01 with: mkfs.lustre --mgs --mdt --fsname=homefs --failnode=lustre02 at tcp --reformat /dev/sdb Then I mounted it and formatted the osts also on lustre01 with: mkfs.lustre --ost --mgsnode=lustre01 at tcp --mgsnode=lustre02 at tcp --fsname=homefs --failnode=lustre02 at tcp --index=1 /dev/sdc and so on... Is there already a general mistake in this installation setup? The osts are distributed over both servers to enlarge bandwidth and also for failover reasons. All osts and mgs are connected to both servers but only mounted on a single one. Now to my problem: I mounted the file system from a client with ip 10.1.1.65 and these are the messages that appear in the system log: lustre01 LustreError: 13533:0:(handler.c:148:mds_sendpage()) @@@ bulk failed: timeout 0(4096), evicting 87fb775c-8f64-5d85-2a95-8fb595e62892 at NET_0x200000a010141_UUID lustre01 req at ffff81011dc72e00 x2483/t0 o37->87fb775c-8f64-5d85-2a95-8fb595e62892 at NET_0x200000a010141_UUID:-1 lens 296/296 ref 0 fl Interpret:/0/0 rc 0/0 lustre01 LustreError: 13469:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at ffff81011d704a00 x2479/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl Interpret:/0/0 rc -107/0 lustre01 LustreError: 13469:0:(handler.c:1499:mds_handle()) operation 400 on unconnected MDS from 12345-10.1.1.65 at tcp lustre01 LustreError: 13535:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS lustre01 LustreError: 13535:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs: operation 501 on unconnected MGS I already tried to find some answers in the net but without much success. I cannot find what they mean or where they come from. Maybe it also helps to show you my device list: lustre01: lctl > device_list 0 UP mgs MGS MGS 11 1 UP mgc MGC10.1.140.2 at tcp 89b4c0f0-c602-0857-c22e-ed232d8ad7aa 5 2 UP mdt MDS MDS_uuid 3 3 UP lov homefs-mdtlov homefs-mdtlov_UUID 4 4 UP mds homefs-MDT0000 homefs-MDT0000_UUID 5 5 UP osc homefs-OST0001-osc homefs-mdtlov_UUID 5 6 UP osc homefs-OST0004-osc homefs-mdtlov_UUID 5 7 UP osc homefs-OST0005-osc homefs-mdtlov_UUID 5 8 UP osc homefs-OST0002-osc homefs-mdtlov_UUID 5 9 UP osc homefs-OST0003-osc homefs-mdtlov_UUID 5 10 UP osc homefs-OST0006-osc homefs-mdtlov_UUID 5 11 UP osc homefs-OST0007-osc homefs-mdtlov_UUID 5 12 UP mgc MGC10.1.140.1 at tcp c8ad2ab0-9eef-b334-37af-85734b53ac94 5 13 UP ost OSS OSS_uuid 3 14 UP obdfilter homefs-OST0001 homefs-OST0001_UUID 7 15 UP obdfilter homefs-OST0004 homefs-OST0004_UUID 7 16 UP obdfilter homefs-OST0005 homefs-OST0005_UUID 7 lustre02: lctl > device_list 0 UP mgc MGC10.1.140.1 at tcp 6154baf3-e830-81d9-ff6c-451d107650c1 5 1 UP ost OSS OSS_uuid 3 2 UP obdfilter homefs-OST0002 homefs-OST0002_UUID 7 3 UP obdfilter homefs-OST0003 homefs-OST0003_UUID 7 4 UP obdfilter homefs-OST0006 homefs-OST0006_UUID 7 5 UP obdfilter homefs-OST0007 homefs-OST0007_UUID 7 Can someone give me some hints? What is going wrong here? Kind regards, Reto Gantenbein
Reto Gantenbein
2008-Apr-30 14:33 UTC
[Lustre-discuss] lustre_mgs: operation ... on unconnected MGS
Hello everybody I did a clean install of the osts/mgs and now it seems to work without these errors. But still it''s unclear to me why they appeared or what they explicitly mean. Simpler error messages would be a nice thing in lustre, especially for newbies. Cheers, Reto On Apr 29, 2008, at 9:06 PM, Reto Gantenbein wrote:> Dear lustre users > > I did setup a lustre file system with 7 osts (fibre-channel raids) and > an mgs/mdt which are exported via two nodes. One node has the mgs/mdt > and 3 osts, the other has 4 osts mounted. The nodes are running the > lustre patched 2.6.18 vanilla kernel. The clients are patchless and > are > running the 2.6.22 gentoo kernel. The lustre-1.6.4.3 is compiled from > sources under gentoo linux. > > The two nodes are called lustre01 and lustre02. > > I did format the mgs/mdt on lustre01 with: > mkfs.lustre --mgs --mdt --fsname=homefs --failnode=lustre02 at tcp > --reformat /dev/sdb > > Then I mounted it and formatted the osts also on lustre01 with: > mkfs.lustre --ost --mgsnode=lustre01 at tcp --mgsnode=lustre02 at tcp > --fsname=homefs --failnode=lustre02 at tcp --index=1 /dev/sdc > > and so on... > > Is there already a general mistake in this installation setup? > > The osts are distributed over both servers to enlarge bandwidth and > also > for failover reasons. All osts and mgs are connected to both servers > but > only mounted on a single one. > > Now to my problem: > I mounted the file system from a client with ip 10.1.1.65 and these > are > the messages that appear in the system log: > > lustre01 LustreError: 13533:0:(handler.c:148:mds_sendpage()) @@@ bulk > failed: timeout 0(4096), evicting > 87fb775c-8f64-5d85-2a95-8fb595e62892 at NET_0x200000a010141_UUID > lustre01 req at ffff81011dc72e00 x2483/t0 > o37->87fb775c-8f64-5d85-2a95-8fb595e62892 at NET_0x200000a010141_UUID:-1 > lens 296/296 ref 0 fl Interpret:/0/0 rc 0/0 > > lustre01 LustreError: 13469:0:(ldlm_lib.c: > 1442:target_send_reply_msg()) > @@@ processing error (-107) req at ffff81011d704a00 x2479/t0 > o400-><?>@<?>:-1 lens 128/0 ref 0 fl Interpret:/0/0 rc -107/0 > > lustre01 LustreError: 13469:0:(handler.c:1499:mds_handle()) operation > 400 on unconnected MDS from 12345-10.1.1.65 at tcp > > lustre01 LustreError: 13535:0:(mgs_handler.c:515:mgs_handle()) > lustre_mgs: operation 101 on unconnected MGS > > lustre01 LustreError: 13535:0:(mgs_handler.c:515:mgs_handle()) > lustre_mgs: operation 501 on unconnected MGS > > I already tried to find some answers in the net but without much > success. I cannot find what they mean or where they come from. > > Maybe it also helps to show you my device list: > > lustre01: > lctl > device_list > 0 UP mgs MGS MGS 11 > 1 UP mgc MGC10.1.140.2 at tcp 89b4c0f0-c602-0857-c22e-ed232d8ad7aa 5 > 2 UP mdt MDS MDS_uuid 3 > 3 UP lov homefs-mdtlov homefs-mdtlov_UUID 4 > 4 UP mds homefs-MDT0000 homefs-MDT0000_UUID 5 > 5 UP osc homefs-OST0001-osc homefs-mdtlov_UUID 5 > 6 UP osc homefs-OST0004-osc homefs-mdtlov_UUID 5 > 7 UP osc homefs-OST0005-osc homefs-mdtlov_UUID 5 > 8 UP osc homefs-OST0002-osc homefs-mdtlov_UUID 5 > 9 UP osc homefs-OST0003-osc homefs-mdtlov_UUID 5 > 10 UP osc homefs-OST0006-osc homefs-mdtlov_UUID 5 > 11 UP osc homefs-OST0007-osc homefs-mdtlov_UUID 5 > 12 UP mgc MGC10.1.140.1 at tcp c8ad2ab0-9eef-b334-37af-85734b53ac94 5 > 13 UP ost OSS OSS_uuid 3 > 14 UP obdfilter homefs-OST0001 homefs-OST0001_UUID 7 > 15 UP obdfilter homefs-OST0004 homefs-OST0004_UUID 7 > 16 UP obdfilter homefs-OST0005 homefs-OST0005_UUID 7 > > lustre02: > lctl > device_list > 0 UP mgc MGC10.1.140.1 at tcp 6154baf3-e830-81d9-ff6c-451d107650c1 5 > 1 UP ost OSS OSS_uuid 3 > 2 UP obdfilter homefs-OST0002 homefs-OST0002_UUID 7 > 3 UP obdfilter homefs-OST0003 homefs-OST0003_UUID 7 > 4 UP obdfilter homefs-OST0006 homefs-OST0006_UUID 7 > 5 UP obdfilter homefs-OST0007 homefs-OST0007_UUID 7 > > > Can someone give me some hints? What is going wrong here? > > Kind regards, > Reto Gantenbein > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss