Hi all, we''re using Lustre 1.8.3 on a gigabit network. We have 4 osts on 2 oss and a single mgs. We are serving the users'' homedirs (mostly small files) for 64 clients on this network. It now happened for the third time that the cluster went down: either the oss or the mgs block, and nobody can access the lustre share anymore. Looking at the logs, I see lots of connectivity errors: LustreError: 17792:0:(mgs_handler.c:641:mgs_handle()) MGS handle cmd=250 rc=-16 LustreError: 17792:0:(mgs_handler.c:641:mgs_handle()) Skipped 3 previous similar messages LustreError: 17792:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error (-16) req at f5ae642c x1344336331331741/t0 o250->6e1c6cb5-564f-49b0-a01e-e7e4605423a8 at NET_0x200000affffc1_UUID:0/0 lens 368/264 e 0 to 0 dl 1288258643 ref 1 fl Interpret:/0/0 rc -16/0 LustreError: 17792:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 3 previous similar messages Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1348047450260821 sent from lustre-MDT0000-mdc-f3aa2e00 to NID 0 at lo 30s ago has timed out (30s prior to deadline). req at e66d1e00 x1348047450260821/t0 o38->lustre-MDT0000_UUID at 10.1.1.1@tcp:12/10 lens 368/584 e 0 to 1 dl 1288258632 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 19 previous similar messages Lustre: MGS: haven''t heard from client 03b9cdae-66f1-552b-8c7c-94a9499c8dcf (at 10.255.255.220 at tcp) in 228 seconds. I think it''s dead, and I am evicting it. LustreError: 2893:0:(acceptor.c:455:lnet_acceptor()) Error -11 reading connection request from 10.255.255.199 Lustre: 2936:0:(ldlm_lib.c:575:target_handle_reconnect()) MGS: fa602b20-b24c-bbcd-7003-b3b9bf702db4 reconnecting Lustre: 2936:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 49 previous similar messages LustreError: 2888:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Error -104 reading HELLO from 10.255.255.199 LustreError: 2888:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Skipped 2 previous similar messages LustreError: 11b-b: Connection to 10.255.255.199 at tcp at host 10.255.255.199 on port 988 was reset: is it running a compatible version of Lustre and is 10.255.255.199 at tcp one of its NIDs? Lustre: 20268:0:(ldlm_lib.c:804:target_handle_connect()) MGS: exp eb3ff200 already connecting Lustre: 17792:0:(ldlm_lib.c:875:target_handle_connect()) MGS: refuse reconnection from fa602b20-b24c-bbcd-7003-b3b9bf702db4 at 10.255.255.199@tcp to 0xeb3ff200; still busy with 2 active RPCs Lustre: 17792:0:(ldlm_lib.c:875:target_handle_connect()) Skipped 1 previous similar message Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1348047450261066 sent from lustre-MDT0000-mdc-f3aa2e00 to NID 0 at lo 30s ago has timed out (30s prior to deadline). req at ce186400 x1348047450261066/t0 o38->lustre-MDT0000_UUID at 10.1.1.1@tcp:12/10 lens 368/584 e 0 to 1 dl 1288259252 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 19 previous similar messages Lustre: There was an unexpected network error while writing to 10.255.255.221: -110. Lustre: 20268:0:(ldlm_lib.c:804:target_handle_connect()) MGS: exp d8637600 already connecting Lustre: 19656:0:(ldlm_lib.c:875:target_handle_connect()) MGS: refuse reconnection from 7ee2fe58-3fab-c39a-8adb-c356d1bdc434 at 10.255.255.209@tcp to 0xd8637600; still busy with 1 active RPCs LustreError: 19656:0:(mgs_handler.c:641:mgs_handle()) MGS handle cmd=250 rc=-16 LustreError: 19656:0:(mgs_handler.c:641:mgs_handle()) Skipped 3 previous similar messages LustreError: 19656:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error (-16) req at f5ae6c2c x1344329600678410/t0 o250->7ee2fe58-3fab-c39a-8adb-c356d1bdc434 at NET_0x200000affffd1_UUID:0/0 lens 368/264 e 0 to 0 dl 1288259427 ref 1 fl Interpret:/0/0 rc -16/0 LustreError: 19656:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 3 previous similar messages Lustre: There was an unexpected network error while writing to 10.255.255.209: -110. Lustre: 2935:0:(ldlm_lib.c:575:target_handle_reconnect()) MGS: fa602b20-b24c-bbcd-7003-b3b9bf702db4 reconnecting Lustre: 2935:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 55 previous similar messages LustreError: 2890:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Error -11 reading HELLO from 10.255.255.191 LustreError: 2890:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Skipped 1 previous similar message LustreError: 2893:0:(acceptor.c:455:lnet_acceptor()) Error -11 reading connection request from 10.255.255.220 eth0: too many iterations (16) in nv_nic_irq. Lustre: MGS: haven''t heard from client ed155959-9a71-653f-7d3b-0f4982ed020c (at 10.255.255.212 at tcp) in 222 seconds. I think it''s dead, and I am evicting it. Lustre: Skipped 3 previous similar messages Lustre: There was an unexpected network error while writing to 10.255.255.215: -110. Lustre: 2935:0:(ldlm_lib.c:875:target_handle_connect()) MGS: refuse reconnection from 4e3abc28-f42b-ad98-3021-5557328f80a7 at 10.255.255.213@tcp to 0xee70bc00; still busy with 1 active RPCs Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1348047450261319 sent from lustre-MDT0000-mdc-f3aa2e00 to NID 0 at lo 30s ago has timed out (30s prior to deadline). req at ec7ffc00 x1348047450261319/t0 o38->lustre-MDT0000_UUID at 10.1.1.1@tcp:12/10 lens 368/584 e 0 to 1 dl 1288259872 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 19 previous similar messages Lustre: 20268:0:(ldlm_lib.c:575:target_handle_reconnect()) MGS: ed155959-9a71-653f-7d3b-0f4982ed020c reconnecting Lustre: 20268:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 53 previous similar messages LustreError: 2889:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Error -11 reading HELLO from 10.255.255.217 LustreError: 2889:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Skipped 1 previous similar message Lustre: MGS: haven''t heard from client 5ccc9289-412d-1e96-a777-f5c3dbcc3899 (at 10.255.255.218 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Lustre: Skipped 3 previous similar messages Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1348047450261564 sent from lustre-MDT0000-mdc-f3aa2e00 to NID 0 at lo 30s ago has timed out (30s prior to deadline). req at eb72ba00 x1348047450261564/t0 o38->lustre-MDT0000_UUID at 10.1.1.1@tcp:12/10 lens 368/584 e 0 to 1 dl 1288260492 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 19 previous similar messages LustreError: 14286:0:(mdc_locks.c:625:mdc_enqueue()) ldlm_cli_enqueue: -4 LustreError: 14286:0:(file.c:3256:ll_inode_revalidate_fini()) failure -4 inode 38076417 LustreError: 14286:0:(mdc_locks.c:625:mdc_enqueue()) ldlm_cli_enqueue: -4 LustreError: 14286:0:(file.c:3256:ll_inode_revalidate_fini()) failure -4 inode 38076417 LustreError: 21507:0:(mdc_locks.c:625:mdc_enqueue()) ldlm_cli_enqueue: -4 LustreError: 21507:0:(file.c:3256:ll_inode_revalidate_fini()) failure -4 inode 38076417 Lustre: 2935:0:(ldlm_lib.c:875:target_handle_connect()) MGS: refuse reconnection from ed155959-9a71-653f-7d3b-0f4982ed020c at 10.255.255.212@tcp to 0xd1681e00; still busy with 1 active RPCs Lustre: 2935:0:(ldlm_lib.c:875:target_handle_connect()) Skipped 1 previous similar message LustreError: 2935:0:(mgs_handler.c:641:mgs_handle()) MGS handle cmd=250 rc=-16 LustreError: 2935:0:(mgs_handler.c:641:mgs_handle()) Skipped 2 previous similar messages LustreError: 2935:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error (-16) req at f55c402c x1344329607901556/t0 o250->ed155959-9a71-653f-7d3b-0f4982ed020c at NET_0x200000affffd4_UUID:0/0 lens 368/264 e 0 to 0 dl 1288260660 ref 1 fl Interpret:/0/0 rc -16/0 LustreError: 2935:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 2 previous similar messages LustreError: 21584:0:(mdc_locks.c:625:mdc_enqueue()) ldlm_cli_enqueue: -4 LustreError: 21584:0:(file.c:3256:ll_inode_revalidate_fini()) failure -4 inode 38076417 So (I guess) this seems to be an issue with our network, which appears under certain conditions (eg. high load). Another hint pointing in this direction is the high number of RX errors I can see on the ports of the connecting switch, and that some of the node are resetting their link. In normal operation, I don''t get any of these behaviors. I suspect that something goes wrong while communicating over the network, the mgs starts to evict clients and afterwards the whole system spirals down in a storm of request. At least it didn''t get better when I resetted the switch, so it''s not the switch alone. I only managed to fix it by switching off half of the nodes... Any thoughts on this? How can I know if your switch (a Netgear GS748T) is up to the task? What type of backend do you recommend for a cluster of this size? regards, Arne -- Arne Brutschy Ph.D. Student Email arne.brutschy(AT)ulb.ac.be IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy Universite'' Libre de Bruxelles Tel +32 2 650 2273 Avenue Franklin Roosevelt 50 Fax +32 2 650 2715 1050 Bruxelles, Belgium (Fax at IRIDIA secretary)
Arne Brutschy wrote:> Hi all, > > we''re using Lustre 1.8.3 on a gigabit network. We have 4 osts on 2 oss > and a single mgs. We are serving the users'' homedirs (mostly small > files) for 64 clients on this network. It now happened for the third > time that the cluster went down: either the oss or the mgs block, and > nobody can access the lustre share anymore. > > Looking at the logs, I see lots of connectivity errors: > > LustreError: 17792:0:(mgs_handler.c:641:mgs_handle()) MGS handle cmd=250 rc=-16 > LustreError: 17792:0:(mgs_handler.c:641:mgs_handle()) Skipped 3 previous similar messages > LustreError: 17792:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error (-16) req at f5ae642c x1344336331331741/t0 o250->6e1c6cb5-564f-49b0-a01e-e7e4605423a8 at NET_0x200000affffc1_UUID:0/0 lens 368/264 e 0 to 0 dl 1288258643 ref 1 fl Interpret:/0/0 rc -16/0 > LustreError: 17792:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 3 previous similar messages > Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1348047450260821 sent from lustre-MDT0000-mdc-f3aa2e00 to NID 0 at lo 30s ago has timed out (30s prior to deadline). > req at e66d1e00 x1348047450260821/t0 o38->lustre-MDT0000_UUID at 10.1.1.1@tcp:12/10 lens 368/584 e 0 to 1 dl 1288258632 ref 1 fl Rpc:N/0/0 rc 0/0 > Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 19 previous similar messages > Lustre: MGS: haven''t heard from client 03b9cdae-66f1-552b-8c7c-94a9499c8dcf (at 10.255.255.220 at tcp) in 228 seconds. I think it''s dead, and I am evicting it. > LustreError: 2893:0:(acceptor.c:455:lnet_acceptor()) Error -11 reading connection request from 10.255.255.199 > Lustre: 2936:0:(ldlm_lib.c:575:target_handle_reconnect()) MGS: fa602b20-b24c-bbcd-7003-b3b9bf702db4 reconnecting > Lustre: 2936:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 49 previous similar messages > LustreError: 2888:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Error -104 reading HELLO from 10.255.255.199 > LustreError: 2888:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Skipped 2 previous similar messages > LustreError: 11b-b: Connection to 10.255.255.199 at tcp at host 10.255.255.199 on port 988 was reset: is it running a compatible version of Lustre and is 10.255.255.199 at tcp one of its NIDs? > Lustre: 20268:0:(ldlm_lib.c:804:target_handle_connect()) MGS: exp eb3ff200 already connecting > Lustre: 17792:0:(ldlm_lib.c:875:target_handle_connect()) MGS: refuse reconnection from fa602b20-b24c-bbcd-7003-b3b9bf702db4 at 10.255.255.199@tcp to 0xeb3ff200; still busy with 2 active RPCs >It looks like the server threads are spending a long time processing the request. If you look at the client logs for 10.255.255.199 you will likely see that it thinks the server died and tried to "failover". The server, when it finally got around to processing the request, noticed that the client was no longer there, as it had given up on the server. The server, for sanity, won''t allow the client to reconnect until the outstanding transactions have completed (so the question is why are they taking so long). Are you seeing any "slow" messages on the servers? There are lots of reasons server threads could be slow. If /proc/sys/vm/zone_reclaim_mode is 1, try setting it to 0. You might want to try the patch in Bug 23826 which I found useful in tracking how long the server thread was processing the request, rather than just the IO phase.> Lustre: 17792:0:(ldlm_lib.c:875:target_handle_connect()) Skipped 1 previous similar message > Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1348047450261066 sent from lustre-MDT0000-mdc-f3aa2e00 to NID 0 at lo 30s ago has timed out (30s prior to deadline). > req at ce186400 x1348047450261066/t0 o38->lustre-MDT0000_UUID at 10.1.1.1@tcp:12/10 lens 368/584 e 0 to 1 dl 1288259252 ref 1 fl Rpc:N/0/0 rc 0/0 > Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 19 previous similar messages > Lustre: There was an unexpected network error while writing to 10.255.255.221: -110. > Lustre: 20268:0:(ldlm_lib.c:804:target_handle_connect()) MGS: exp d8637600 already connecting > Lustre: 19656:0:(ldlm_lib.c:875:target_handle_connect()) MGS: refuse reconnection from 7ee2fe58-3fab-c39a-8adb-c356d1bdc434 at 10.255.255.209@tcp to 0xd8637600; still busy with 1 active RPCs > LustreError: 19656:0:(mgs_handler.c:641:mgs_handle()) MGS handle cmd=250 rc=-16 > LustreError: 19656:0:(mgs_handler.c:641:mgs_handle()) Skipped 3 previous similar messages > LustreError: 19656:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error (-16) req at f5ae6c2c x1344329600678410/t0 o250->7ee2fe58-3fab-c39a-8adb-c356d1bdc434 at NET_0x200000affffd1_UUID:0/0 lens 368/264 e 0 to 0 dl 1288259427 ref 1 fl Interpret:/0/0 rc -16/0 > LustreError: 19656:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 3 previous similar messages > Lustre: There was an unexpected network error while writing to 10.255.255.209: -110. > Lustre: 2935:0:(ldlm_lib.c:575:target_handle_reconnect()) MGS: fa602b20-b24c-bbcd-7003-b3b9bf702db4 reconnecting > Lustre: 2935:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 55 previous similar messages > LustreError: 2890:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Error -11 reading HELLO from 10.255.255.191 > LustreError: 2890:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Skipped 1 previous similar message > LustreError: 2893:0:(acceptor.c:455:lnet_acceptor()) Error -11 reading connection request from 10.255.255.220 > eth0: too many iterations (16) in nv_nic_irq. >Ok, I''ve seen this before. This is an interesting driver. Basically, under heavy load the packets are coming in as fast as the interrupt handler can pull them off the RX queue. Rather than having the driver bail, you probably need something like this in modprobe.conf: options forcedeth max_interrupt_work=100 But this message means the driver is dropping packets, not the switch. (Note that NAPI was developed for situations like this). You probably also want to increase the tx/rx ring sizes on the driver, unless you''ve already done that.> Lustre: MGS: haven''t heard from client ed155959-9a71-653f-7d3b-0f4982ed020c (at 10.255.255.212 at tcp) in 222 seconds. I think it''s dead, and I am evicting it. > Lustre: Skipped 3 previous similar messages > Lustre: There was an unexpected network error while writing to 10.255.255.215: -110. > Lustre: 2935:0:(ldlm_lib.c:875:target_handle_connect()) MGS: refuse reconnection from 4e3abc28-f42b-ad98-3021-5557328f80a7 at 10.255.255.213@tcp to 0xee70bc00; still busy with 1 active RPCs > Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1348047450261319 sent from lustre-MDT0000-mdc-f3aa2e00 to NID 0 at lo 30s ago has timed out (30s prior to deadline). > req at ec7ffc00 x1348047450261319/t0 o38->lustre-MDT0000_UUID at 10.1.1.1@tcp:12/10 lens 368/584 e 0 to 1 dl 1288259872 ref 1 fl Rpc:N/0/0 rc 0/0 > Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 19 previous similar messages > Lustre: 20268:0:(ldlm_lib.c:575:target_handle_reconnect()) MGS: ed155959-9a71-653f-7d3b-0f4982ed020c reconnecting > Lustre: 20268:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 53 previous similar messages > LustreError: 2889:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Error -11 reading HELLO from 10.255.255.217 > LustreError: 2889:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Skipped 1 previous similar message > Lustre: MGS: haven''t heard from client 5ccc9289-412d-1e96-a777-f5c3dbcc3899 (at 10.255.255.218 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. > Lustre: Skipped 3 previous similar messages > Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1348047450261564 sent from lustre-MDT0000-mdc-f3aa2e00 to NID 0 at lo 30s ago has timed out (30s prior to deadline). > req at eb72ba00 x1348047450261564/t0 o38->lustre-MDT0000_UUID at 10.1.1.1@tcp:12/10 lens 368/584 e 0 to 1 dl 1288260492 ref 1 fl Rpc:N/0/0 rc 0/0 > Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 19 previous similar messages > LustreError: 14286:0:(mdc_locks.c:625:mdc_enqueue()) ldlm_cli_enqueue: -4 > LustreError: 14286:0:(file.c:3256:ll_inode_revalidate_fini()) failure -4 inode 38076417 > LustreError: 14286:0:(mdc_locks.c:625:mdc_enqueue()) ldlm_cli_enqueue: -4 > LustreError: 14286:0:(file.c:3256:ll_inode_revalidate_fini()) failure -4 inode 38076417 > LustreError: 21507:0:(mdc_locks.c:625:mdc_enqueue()) ldlm_cli_enqueue: -4 > LustreError: 21507:0:(file.c:3256:ll_inode_revalidate_fini()) failure -4 inode 38076417 > Lustre: 2935:0:(ldlm_lib.c:875:target_handle_connect()) MGS: refuse reconnection from ed155959-9a71-653f-7d3b-0f4982ed020c at 10.255.255.212@tcp to 0xd1681e00; still busy with 1 active RPCs > Lustre: 2935:0:(ldlm_lib.c:875:target_handle_connect()) Skipped 1 previous similar message > LustreError: 2935:0:(mgs_handler.c:641:mgs_handle()) MGS handle cmd=250 rc=-16 > LustreError: 2935:0:(mgs_handler.c:641:mgs_handle()) Skipped 2 previous similar messages > LustreError: 2935:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error (-16) req at f55c402c x1344329607901556/t0 o250->ed155959-9a71-653f-7d3b-0f4982ed020c at NET_0x200000affffd4_UUID:0/0 lens 368/264 e 0 to 0 dl 1288260660 ref 1 fl Interpret:/0/0 rc -16/0 > LustreError: 2935:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 2 previous similar messages > LustreError: 21584:0:(mdc_locks.c:625:mdc_enqueue()) ldlm_cli_enqueue: -4 > LustreError: 21584:0:(file.c:3256:ll_inode_revalidate_fini()) failure -4 inode 38076417 > > So (I guess) this seems to be an issue with our network, which appears > under certain conditions (eg. high load). Another hint pointing in this > direction is the high number of RX errors I can see on the ports of the > connecting switch, and that some of the node are resetting their link. > In normal operation, I don''t get any of these behaviors. > > I suspect that something goes wrong while communicating over the > network, the mgs starts to evict clients and afterwards the whole system > spirals down in a storm of request. At least it didn''t get better when I > resetted the switch, so it''s not the switch alone. I only managed to fix > it by switching off half of the nodes... > > Any thoughts on this? How can I know if your switch (a Netgear GS748T) > is up to the task? What type of backend do you recommend for a cluster > of this size? >Well, if you are having packet overrun issues, try enabling tx and rx flow control (PAUSE frames), on the nodes and the switch. Properly implemented (which is not always the case), it allows the receiver to pause the sender in order to catch up. But you should be able to tune things at 1GigE to run without pause frames.> regards, > Arne > >
Hey, thanks for your reply!> Are you seeing any "slow" messages on the servers? There are lots of > reasons server threads could be slow. If /proc/sys/vm/zone_reclaim_mode > is 1, try setting it to 0.I don''t understand what you mean by "slow" messages? The servers are basically idle... I don''t have a /proc/sys/vm/zone_reclaim_mode (running CentOS5 kernel version 2.6.18-164.11.1.el5_lustre.1.8.3.> You might want to try the patch in Bug 23826 which I found useful in > tracking how long the server thread was processing the request, rather > than just the IO phase.Ok thanks, I will have a look into this as soon as I have time to download, patch and compile lustre (using binary packets for now).> Ok, I''ve seen this before. This is an interesting driver. Basically, > under heavy load the packets are coming in as fast as the interrupt > handler can pull them off the RX queue. Rather than having the driver > bail, you probably need something like this in modprobe.conf: > options forcedeth max_interrupt_work=100 > But this message means the driver is dropping packets, not the switch. > (Note that NAPI was developed for situations like this). > > You probably also want to increase the tx/rx ring sizes on the driver, > unless you''ve already done that.I upgraded the driver on my 3 lustre servers, so that I can set the ring sizes (wasn''t possible before, see http://www.centos.org/modules/newbb/print.php?form=1&topic_id=20835&forum=40&order=ASC&start=0). I am now using the following options: options forcedeth max_interrupt_work=100 rx_ring_size=1024 tx_ring_size=1024 rx_flow_control=1 tx_flow_control=1 Let''s wait and hope. But I''ve also seen nodes disconnecting from the switch, and these nodes do not have the forcedepth driver. Under high load, the switch starts to show RX errors on *all* interfaces. If this is the case, even the nodes with broadcom NICs disconnect (Tigon3 BCM95780). Additionally, we recently upgraded our cluster (from Rocks 4.2.1 with Lustre 1.6.3 to Rocks 5.3 with Lustre 1.8.3). I saw similar behavior on our old cluster. I guess it''s the switch then, but I would like to be sure. Swapping it on a hunch is with 2000 Euros an expensive test... Thanks! Arne