I got around the lustre-modules problem by removing the RPM and reinstalling it. That worked, but now, I''m at a loss what is going on here. So far I have 1 dedicated mgs/mds node, 1 ost and 1 client. Making the mgs/mds node went fine, same with the ost. The problem is with the client and I can''t figure out why its doing this. On the client, df hangs and logs show me: Lustre: 4569:0:(import.c:395:import_select_connection()) bcffs-OST0000-osc-ffff81007f203400: tried all connections, increasing latency to 20s Lustre: Request x22 sent from bcffs-OST0000-osc-ffff81007f203400 to NID 192.168.1.254 at tcp 5s ago has timed out (limit 5s). Lustre: 4569:0:(import.c:395:import_select_connection()) bcffs-OST0000-osc-ffff81007f203400: tried all connections, increasing latency to 25s Lustre: Request x25 sent from bcffs-OST0000-osc-ffff81007f203400 to NID 192.168.1.254 at tcp 5s ago has timed out (limit 5s). Lustre: 4569:0:(import.c:395:import_select_connection()) bcffs-OST0000-osc-ffff81007f203400: tried all connections, increasing latency to 30s LustreError: 4568:0:(events.c:55:request_out_callback()) @@@ type 4, status -5 req at ffff81004eaefa00 x28/t0 o8->bcffs-OST0000_UUID at 192.168.1.254@tcp:6/4 lens 240/400 e 0 to 5 dl 1217610639 ref 2 fl Rpc:/0/0 rc 0/0 Lustre: Request x28 sent from bcffs-OST0000-osc-ffff81007f203400 to NID 192.168.1.254 at tcp 0s ago has timed out (limit 5s). Each device, mgs/mdt, ost and client have gigE. The client is the front end that also serves NFS and Grid service, which they work fine. What is the latency issue with Lustre 1.6.5.1? -- Jeremy Mann jeremy at biochem.uthscsa.edu University of Texas Health Science Center Bioinformatics Core Facility http://www.bioinformatics.uthscsa.edu Phone: (210) 567-2672
Any ideas to the latency issue? I''ve tried everything I can and it only happens with my frontend node. If I try other client nodes, it works. Ideas? Lustre: Client bcffs-client has started Lustre: Request x42 sent from bcffs-OST0000-osc-ffff8101ff6b4000 to NID 192.168.1.254 at tcp 5s ago has timed out (limit 5s). Lustre: 5097:0:(import.c:395:import_select_connection()) bcffs-OST0000-osc-ffff8101ff6b4000: tried all connections, increasing latency to 5s LustreError: 5096:0:(events.c:55:request_out_callback()) @@@ type 4, status -5 req at ffff81004e550200 x47/t0 o8->bcffs-OST0000_UUID at 192.168.1.254@tcp:6/4 lens 240/400 e 0 to 5 dl 1217614425 ref 2 fl Rpc:/0/0 rc 0/0 Jeremy Mann wrote:> I got around the lustre-modules problem by removing the RPM and > reinstalling it. That worked, but now, I''m at a loss what is going onhere. So far I have 1 dedicated mgs/mds node, 1 ost and 1 client.> > Making the mgs/mds node went fine, same with the ost. The problem iswith the client and I can''t figure out why its doing this.> > On the client, df hangs and logs show me: > > Lustre: 4569:0:(import.c:395:import_select_connection()) > bcffs-OST0000-osc-ffff81007f203400: tried all connections, increasinglatency to 20s> Lustre: Request x22 sent from bcffs-OST0000-osc-ffff81007f203400 to NID192.168.1.254 at tcp 5s ago has timed out (limit 5s).> Lustre: 4569:0:(import.c:395:import_select_connection()) > bcffs-OST0000-osc-ffff81007f203400: tried all connections, increasinglatency to 25s> Lustre: Request x25 sent from bcffs-OST0000-osc-ffff81007f203400 to NID192.168.1.254 at tcp 5s ago has timed out (limit 5s).> Lustre: 4569:0:(import.c:395:import_select_connection()) > bcffs-OST0000-osc-ffff81007f203400: tried all connections, increasinglatency to 30s> LustreError: 4568:0:(events.c:55:request_out_callback()) @@@ type 4,status -5 req at ffff81004eaefa00 x28/t0> o8->bcffs-OST0000_UUID at 192.168.1.254@tcp:6/4 lens 240/400 e 0 to 5 dl1217610639 ref 2 fl Rpc:/0/0 rc 0/0> Lustre: Request x28 sent from bcffs-OST0000-osc-ffff81007f203400 to NID192.168.1.254 at tcp 0s ago has timed out (limit 5s).> > Each device, mgs/mdt, ost and client have gigE. The client is the frontend that also serves NFS and Grid service, which they work fine.> > What is the latency issue with Lustre 1.6.5.1? > > > -- > Jeremy Mann > jeremy at biochem.uthscsa.edu > > University of Texas Health Science Center > Bioinformatics Core Facility > http://www.bioinformatics.uthscsa.edu > Phone: (210) 567-2672 > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Jeremy Mann jeremy at biochem.uthscsa.edu University of Texas Health Science Center Bioinformatics Core Facility http://www.bioinformatics.uthscsa.edu Phone: (210) 567-2672 -- Jeremy Mann jeremy at biochem.uthscsa.edu University of Texas Health Science Center Bioinformatics Core Facility http://www.bioinformatics.uthscsa.edu Phone: (210) 567-2672
Andreas Dilger
2008-Aug-01 20:28 UTC
[Lustre-discuss] Desperate problems with Lustre 1.6.5.1
On Aug 01, 2008 13:16 -0500, Jeremy Mann wrote:> Any ideas to the latency issue? I''ve tried everything I can and it only > happens with my frontend node. If I try other client nodes, it works.Do you have xinetd or selinux or other port filtering enabled? Try running tcpdump on the server to see if you get any traffic at all. You can also try "telnet {server} 988" to see if you get any connection at all. It won''t print anything to the screen, but after typing something and pressing enter it will disconnect.> Lustre: Client bcffs-client has started > Lustre: Request x42 sent from bcffs-OST0000-osc-ffff8101ff6b4000 to NID > 192.168.1.254 at tcp 5s ago has timed out (limit 5s). > Lustre: 5097:0:(import.c:395:import_select_connection()) > bcffs-OST0000-osc-ffff8101ff6b4000: tried all connections, increasing > latency to 5s > LustreError: 5096:0:(events.c:55:request_out_callback()) @@@ type 4, > status -5 req at ffff81004e550200 x47/t0 > o8->bcffs-OST0000_UUID at 192.168.1.254@tcp:6/4 lens 240/400 e 0 to 5 dl > 1217614425 ref 2 fl Rpc:/0/0 rc 0/0 > > > > > Jeremy Mann wrote: > > I got around the lustre-modules problem by removing the RPM and > > reinstalling it. That worked, but now, I''m at a loss what is going on > here. So far I have 1 dedicated mgs/mds node, 1 ost and 1 client. > > > > Making the mgs/mds node went fine, same with the ost. The problem is > with the client and I can''t figure out why its doing this. > > > > On the client, df hangs and logs show me: > > > > Lustre: 4569:0:(import.c:395:import_select_connection()) > > bcffs-OST0000-osc-ffff81007f203400: tried all connections, increasing > latency to 20s > > Lustre: Request x22 sent from bcffs-OST0000-osc-ffff81007f203400 to NID > 192.168.1.254 at tcp 5s ago has timed out (limit 5s). > > Lustre: 4569:0:(import.c:395:import_select_connection()) > > bcffs-OST0000-osc-ffff81007f203400: tried all connections, increasing > latency to 25s > > Lustre: Request x25 sent from bcffs-OST0000-osc-ffff81007f203400 to NID > 192.168.1.254 at tcp 5s ago has timed out (limit 5s). > > Lustre: 4569:0:(import.c:395:import_select_connection()) > > bcffs-OST0000-osc-ffff81007f203400: tried all connections, increasing > latency to 30s > > LustreError: 4568:0:(events.c:55:request_out_callback()) @@@ type 4, > status -5 req at ffff81004eaefa00 x28/t0 > > o8->bcffs-OST0000_UUID at 192.168.1.254@tcp:6/4 lens 240/400 e 0 to 5 dl > 1217610639 ref 2 fl Rpc:/0/0 rc 0/0 > > Lustre: Request x28 sent from bcffs-OST0000-osc-ffff81007f203400 to NID > 192.168.1.254 at tcp 0s ago has timed out (limit 5s). > > > > Each device, mgs/mdt, ost and client have gigE. The client is the front > end that also serves NFS and Grid service, which they work fine. > > > > What is the latency issue with Lustre 1.6.5.1? > > > > > > -- > > Jeremy Mann > > jeremy at biochem.uthscsa.edu > > > > University of Texas Health Science Center > > Bioinformatics Core Facility > > http://www.bioinformatics.uthscsa.edu > > Phone: (210) 567-2672 > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > -- > Jeremy Mann > jeremy at biochem.uthscsa.edu > > University of Texas Health Science Center > Bioinformatics Core Facility > http://www.bioinformatics.uthscsa.edu > Phone: (210) 567-2672 > > > > > -- > Jeremy Mann > jeremy at biochem.uthscsa.edu > > University of Texas Health Science Center > Bioinformatics Core Facility > http://www.bioinformatics.uthscsa.edu > Phone: (210) 567-2672 > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas Dilger wrote:> On Aug 01, 2008 13:16 -0500, Jeremy Mann wrote: >> Any ideas to the latency issue? I''ve tried everything I can and it only >> happens with my frontend node. If I try other client nodes, it works. > > Do you have xinetd or selinux or other port filtering enabled? > > Try running tcpdump on the server to see if you get any traffic at all. > > You can also try "telnet {server} 988" to see if you get any connection > at all. It won''t print anything to the screen, but after typing something > and pressing enter it will disconnect.It didn''t occur to me at the time to check iptables on the compute nodes. There were in fact iptables firewalls on all of our compute nodes. Once I disabled them, it started working. -- Jeremy Mann jeremy at biochem.uthscsa.edu University of Texas Health Science Center Bioinformatics Core Facility http://www.bioinformatics.uthscsa.edu Phone: (210) 567-2672