We''ve got the latest lustre running(1.8.4) and kernel 2.6.18-194.3.1.el5. I call it our primary client as it is what exposes the file system for others to use via nfs/samba. Today the machine seeminly rebooted on its own and checking the logs I see these messages Nov 22 12:25:52 cajal kernel: LustreError: 3909:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Skipped 8 previous similar messages Nov 22 12:25:52 cajal kernel: LustreError: 11b-b: Connection to 192.168.5.101 at tcp at host 192.168.5.101 on port 988 was reset: is it running a compatible version of Lustre and is 192.168.5.101 at tcp one of its NIDs? Nov 22 12:25:52 cajal kernel: LustreError: Skipped 8 previous similar messages Nov 22 12:31:22 cajal kernel: LustreError: 5870:0:(llite_nfs.c:96:search_inode_for_lustre()) failure -2 inode 565846402 Nov 22 12:31:22 cajal kernel: LustreError: 5870:0:(llite_nfs.c:96:search_inode_for_lustre()) Skipped 490 previous similar messages Nov 22 12:33:40 cajal mountd[5959]: /lustre/home and /home have same filehandle for 10.0.0.0/255.0.0.0, using first Nov 22 12:36:31 cajal kernel: LustreError: 3908:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Error -104 reading HELLO from 192.168.5.101 Nov 22 12:36:31 cajal kernel: LustreError: 3908:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Skipped 9 previous similar messages Nov 22 12:36:31 cajal kernel: LustreError: 11b-b: Connection to 192.168.5.101 at tcp at host 192.168.5.101 on port 988 was reset: is it running a compatible version of Lustre and is 192.168.5.101 at tcp one of its NIDs? Nov 22 12:36:31 cajal kernel: LustreError: Skipped 9 previous similar messages Nov 22 12:37:34 cajal mountd[5959]: authenticated mount request from 129.115.117.22:723 for /lustre/home/qyu926 (/lustre/home) Nov 22 12:38:38 cajal mountd[5959]: /lustre/home and /home have same filehandle for 129.115.0.0/255.255.0.0, using first Nov 22 12:40:20 cajal rpc.idmapd[3669]: nss_getpwnam: name ''500'' does not map into domain ''cbi.utsa.edu'' Nov 22 12:41:23 cajal kernel: LustreError: 5466:0:(llite_nfs.c:96:search_inode_for_lustre()) failure -2 inode 565846402 Nov 22 12:41:23 cajal kernel: LustreError: 5466:0:(llite_nfs.c:96:search_inode_for_lustre()) Skipped 503 previous similar messages This is the last entry before system reboots and you get the normal kernel boot messages This is what I see on 192.168.5.101 Nov 22 12:25:22 data2 kernel: LustreError: 4726:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Error -104 reading HELLO from 129.115.117.8 Nov 22 12:25:22 data2 kernel: LustreError: 4726:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Skipped 8 previous similar messages Nov 22 12:36:02 data2 kernel: LustreError: 4725:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Error -104 reading HELLO from 129.115.117.8 Nov 22 12:36:02 data2 kernel: LustreError: 4725:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Skipped 9 previous similar messages Nov 22 12:43:39 data2 kernel: Lustre: 23762:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1351344462337868 sent from lustre-OST0002 to NID 129.115.117.8 at tcp 7s ago has timed out (7s prior to deadline). Nov 22 12:43:39 data2 kernel: req at ffff81004c42d800 x1351344462337868/t0 o104->@NET_0x2000081737508_UUID:15/16 lens 296/384 e 0 to 1 dl 1290451419 ref 1 fl Rpc:N/0/0 rc 0/0 Nov 22 12:43:39 data2 kernel: LustreError: 138-a: lustre-OST0002: A client on nid 129.115.117.8 at tcp was evicted due to a lock blocking callback to 129.115.117.8 at tcp timed out: rc -107 Nov 22 12:44:38 data2 kernel: Lustre: 23569:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1351344462337882 sent from lustre-OST0003 to NID 129.115.117.8 at tcp 0s ago has failed due to network error (7s prior to deadline). Nov 22 12:44:38 data2 kernel: req at ffff810111b38400 x1351344462337882/t0 o104->@NET_0x2000081737508_UUID:15/16 lens 296/384 e 0 to 1 dl 1290451485 ref 1 fl Rpc:N/0/0 rc 0/0 Nov 22 12:44:38 data2 kernel: LustreError: 138-a: lustre-OST0003: A client on nid 129.115.117.8 at tcp was evicted due to a lock blocking callback to 129.115.117.8 at tcp timed out: rc -107 Whats going on? Thanks David -- Personally, I liked the university. They gave us money and facilities, we didn''t have to produce anything! You''ve never been out of college! You don''t know what it''s like out there! I''ve worked in the private sector. They expect results. -Ray Ghostbusters
Hello! On Nov 22, 2010, at 2:04 PM, David Noriega wrote:> We''ve got the latest lustre running(1.8.4) and kernel > 2.6.18-194.3.1.el5. I call it our primary client as it is what exposes > the file system for others to use via nfs/samba. Today the machine > seeminly rebooted on its own and checking the logs I see these > messagesDo you have automatic reboot on panic set? If so, that means you just run into BUG() or LBUG situation. If you have some sort of serial console setup, you should see what was it there. If you do not, then there is now no way to find out, but please consider setting it up for the future.> This is what I see on 192.168.5.101 > Nov 22 12:25:22 data2 kernel: LustreError: > 4726:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Error -104 reading > HELLO from 129.115.117.8 > Nov 22 12:25:22 data2 kernel: LustreError: > 4726:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Skipped 8 previous > similar messages > Nov 22 12:36:02 data2 kernel: LustreError: > 4725:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Error -104 reading > HELLO from 129.115.117.8 > Nov 22 12:36:02 data2 kernel: LustreError: > 4725:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Skipped 9 previous > similar messagesSo you are getting connection resets from this node for some reason was it the one that rebooted?> Nov 22 12:43:39 data2 kernel: Lustre: > 23762:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request > x1351344462337868 sent from lustre-OST0002 to NID 129.115.117.8 at tcp 7s > ago has timed out (7s prior to deadline). > Nov 22 12:43:39 data2 kernel: req at ffff81004c42d800 > x1351344462337868/t0 o104->@NET_0x2000081737508_UUID:15/16 lens > 296/384 e 0 to 1 dl 1290451419 ref 1 fl Rpc:N/0/0 rc 0/0Attempt to send blocking callback to 129.115.117.8 failed> Nov 22 12:43:39 data2 kernel: LustreError: 138-a: lustre-OST0002: A > client on nid 129.115.117.8 at tcp was evicted due to a lock blocking > callback to 129.115.117.8 at tcp timed out: rc -107 > Nov 22 12:44:38 data2 kernel: Lustre: > 23569:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request > x1351344462337882 sent from lustre-OST0003 to NID 129.115.117.8 at tcp 0s > ago has failed due to network error (7s prior to deadline). > Nov 22 12:44:38 data2 kernel: req at ffff810111b38400 > x1351344462337882/t0 o104->@NET_0x2000081737508_UUID:15/16 lens > 296/384 e 0 to 1 dl 1290451485 ref 1 fl Rpc:N/0/0 rc 0/0 > Nov 22 12:44:38 data2 kernel: LustreError: 138-a: lustre-OST0003: A > client on nid 129.115.117.8 at tcp was evicted due to a lock blocking > callback to 129.115.117.8 at tcp timed out: rc -107We are evicting this client (129.115.117.8) because we cannot deliver ldlm ASTs to it and assume it is dead or is in some wedged state. Bye, Oleg