Oral, H. Sarp
2007-Mar-22 12:39 UTC
[Lustre-discuss] 1.4.9. client errors of unknown source
Hello, We got the following error messages on our 1.4.9 Lustre clients. Our Lustre servers are also running 1.4.9. The first two incidents happened at the same instance where both clients were busy running two IOR jobs per each. The last incident happened when the client was not running any jobs and was idle. These errors did not create any other bug/logs on clients. We are not at this point sure if this is a Lustre problem or something else, but the RIP line ({:ksocklnd:ksocknal_process_transmit+969}) makes us think it might be a Lustre problem. Has anyone seen something like this? Mar 19 18:38:52 pinto0002-admin kernel: Unable to handle kernel paging request at 0000000000100108 RIP: Mar 19 18:38:52 pinto0002-admin kernel: <7>Losing some ticks... checking if CPU frequency changed. Mar 19 18:38:52 pinto0002-admin kernel: Oops: 0002 [1] SMP Mar 19 18:38:52 pinto0002-admin kernel: Oops: 0002 [1] SMP Mar 19 18:38:52 pinto0002-admin kernel: RIP <ffffffffa016be29>{:ksocklnd:ksocknal_process_transmit+969} RSP <00000100de20be58> Mar 19 18:38:52 pinto0002-admin kernel: CR2: 0000000000100108 Mar 19 18:38:52 pinto0002-admin kernel: CR2: 0000000000100108 Mar 19 18:38:26 pinto0009-admin kernel: Unable to handle kernel paging request at 0000000000100108 RIP: Mar 19 18:38:26 pinto0009-admin kernel: <7>Losing some ticks... checking if CPU frequency changed. Mar 19 18:38:26 pinto0009-admin kernel: <ffffffffa0238e29>{:ksocklnd:ksocknal_process_transmit+969} Mar 19 18:38:26 pinto0009-admin kernel: Oops: 0002 [1] SMP Mar 19 18:38:26 pinto0009-admin kernel: Oops: 0002 [1] SMP Mar 19 18:38:26 pinto0009-admin kernel: RIP <ffffffffa0238e29>{:ksocklnd:ksocknal_process_transmit+969} RSP <00000100c2647e58> Mar 19 18:38:26 pinto0009-admin kernel: CR2: 0000000000100108 Mar 19 18:38:26 pinto0009-admin kernel: CR2: 0000000000100108 Mar 20 16:22:40 pinto0060-admin kernel: Unable to handle kernel paging request at 0000000000100108 RIP: Mar 20 16:22:40 pinto0060-admin kernel: <7>Losing some ticks... checking if CPU frequency changed. Mar 20 16:22:40 pinto0060-admin kernel: Oops: 0002 [1] SMP Mar 20 16:22:40 pinto0060-admin kernel: Oops: 0002 [1] SMP Mar 20 16:22:40 pinto0060-admin kernel: RIP <ffffffffa0238e29>{:ksocklnd:ksocknal_process_transmit+969} RSP <00000100c1c83e58> Mar 20 16:22:40 pinto0060-admin kernel: CR2: 0000000000100108 Mar 20 16:22:40 pinto0060-admin kernel: CR2: 0000000000100108 PS: The same hardware was running without any problems before these errors and after a reboot, they are still running fine and no hardware configuration changes have been made on these clients. Thanks, Sarp -------------------- Sarp Oral, Ph.D. National Center for Computational Sciences (NCCS) Oak Ridge National Lab, Oak Ridge, Tennessee 37831 865-574-2173, oralhs@ornl.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070322/d514706e/attachment.html
Sarp, Shane, Is the primary problem an access violation? The "Losing some ticks" messages seem secondary (and we''ve actually seen that printk with interrupts disabled can be a cause of this). We really need to determine the source code line to work out what has screwed up here. A stacktrace helps, but kernel core dumps are even better. Is it possible to arrange that? Can you file a lustre bug with all this info? Cheers, Eric Eric Barton Barton Software 9 York Gardens Clifton Bristol, BS8 4LL United Kingdom Tel: Mobile: Fax: Email: +44 (117) 330 1575 +44 (7909) 680 356 Call to arrange eeb@bartonsoftware.com <mailto:eeb@bartonsoftware.com> _____ From: lustre-discuss-bounces@clusterfs.com [mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of Oral, H. Sarp Sent: 22 March 2007 7:39 PM To: lustre-discuss@clusterfs.com Subject: [Lustre-discuss] 1.4.9. client errors of unknown source Hello, We got the following error messages on our 1.4.9 Lustre clients. Our Lustre servers are also running 1.4.9. The first two incidents happened at the same instance where both clients were busy running two IOR jobs per each. The last incident happened when the client was not running any jobs and was idle. These errors did not create any other bug/logs on clients. We are not at this point sure if this is a Lustre problem or something else, but the RIP line ({:ksocklnd:ksocknal_process_transmit+969}) makes us think it might be a Lustre problem. Has anyone seen something like this? Mar 19 18:38:52 pinto0002-admin kernel: Unable to handle kernel paging request at 0000000000100108 RIP: Mar 19 18:38:52 pinto0002-admin kernel: <7>Losing some ticks... checking if CPU frequency changed. Mar 19 18:38:52 pinto0002-admin kernel: Oops: 0002 [1] SMP Mar 19 18:38:52 pinto0002-admin kernel: Oops: 0002 [1] SMP Mar 19 18:38:52 pinto0002-admin kernel: RIP <ffffffffa016be29>{:ksocklnd:ksocknal_process_transmit+969} RSP <00000100de20be58> Mar 19 18:38:52 pinto0002-admin kernel: CR2: 0000000000100108 Mar 19 18:38:52 pinto0002-admin kernel: CR2: 0000000000100108 Mar 19 18:38:26 pinto0009-admin kernel: Unable to handle kernel paging request at 0000000000100108 RIP: Mar 19 18:38:26 pinto0009-admin kernel: <7>Losing some ticks... checking if CPU frequency changed. Mar 19 18:38:26 pinto0009-admin kernel: <ffffffffa0238e29>{:ksocklnd:ksocknal_process_transmit+969} Mar 19 18:38:26 pinto0009-admin kernel: Oops: 0002 [1] SMP Mar 19 18:38:26 pinto0009-admin kernel: Oops: 0002 [1] SMP Mar 19 18:38:26 pinto0009-admin kernel: RIP <ffffffffa0238e29>{:ksocklnd:ksocknal_process_transmit+969} RSP <00000100c2647e58> Mar 19 18:38:26 pinto0009-admin kernel: CR2: 0000000000100108 Mar 19 18:38:26 pinto0009-admin kernel: CR2: 0000000000100108 Mar 20 16:22:40 pinto0060-admin kernel: Unable to handle kernel paging request at 0000000000100108 RIP: Mar 20 16:22:40 pinto0060-admin kernel: <7>Losing some ticks... checking if CPU frequency changed. Mar 20 16:22:40 pinto0060-admin kernel: Oops: 0002 [1] SMP Mar 20 16:22:40 pinto0060-admin kernel: Oops: 0002 [1] SMP Mar 20 16:22:40 pinto0060-admin kernel: RIP <ffffffffa0238e29>{:ksocklnd:ksocknal_process_transmit+969} RSP <00000100c1c83e58> Mar 20 16:22:40 pinto0060-admin kernel: CR2: 0000000000100108 Mar 20 16:22:40 pinto0060-admin kernel: CR2: 0000000000100108 PS: The same hardware was running without any problems before these errors and after a reboot, they are still running fine and no hardware configuration changes have been made on these clients. Thanks, Sarp -------------------- Sarp Oral, Ph.D. National Center for Computational Sciences (NCCS) Oak Ridge National Lab, Oak Ridge, Tennessee 37831 865-574-2173, oralhs@ornl.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070322/10958fce/attachment.html
Oral, H. Sarp
2007-Mar-26 08:39 UTC
[Lustre-discuss] 1.4.9. client errors of unknown source
Eric, We are in the process of getting netdump working on these client nodes. The RT entry for this issue is [rt.clusterfs.com #28583]. I will file a Lustre bug for this issue if CFS hasn''t already done so. Sarp -------------------- Sarp Oral, Ph.D. National Center for Computational Sciences (NCCS) Oak Ridge National Lab, Oak Ridge, Tennessee 37831 865-574-2173, oralhs@ornl.gov ________________________________ From: Eric Barton [mailto:eeb@bartonsoftware.com] Sent: Thursday, March 22, 2007 7:32 PM To: Oral, H. Sarp Cc: lustre-discuss@clusterfs.com Subject: RE: [Lustre-discuss] 1.4.9. client errors of unknown source Sarp, Shane, Is the primary problem an access violation? The "Losing some ticks" messages seem secondary (and we''ve actually seen that printk with interrupts disabled can be a cause of this). We really need to determine the source code line to work out what has screwed up here. A stacktrace helps, but kernel core dumps are even better. Is it possible to arrange that? Can you file a lustre bug with all this info? Cheers, Eric Eric Barton Barton Software 9 York Gardens Clifton Bristol, BS8 4LL United Kingdom Tel: Mobile: Fax: Email: +44 (117) 330 1575 +44 (7909) 680 356 Call to arrange eeb@bartonsoftware.com ________________________________ From: lustre-discuss-bounces@clusterfs.com [mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of Oral, H. Sarp Sent: 22 March 2007 7:39 PM To: lustre-discuss@clusterfs.com Subject: [Lustre-discuss] 1.4.9. client errors of unknown source Hello, We got the following error messages on our 1.4.9 Lustre clients. Our Lustre servers are also running 1.4.9. The first two incidents happened at the same instance where both clients were busy running two IOR jobs per each. The last incident happened when the client was not running any jobs and was idle. These errors did not create any other bug/logs on clients. We are not at this point sure if this is a Lustre problem or something else, but the RIP line ({:ksocklnd:ksocknal_process_transmit+969}) makes us think it might be a Lustre problem. Has anyone seen something like this? Mar 19 18:38:52 pinto0002-admin kernel: Unable to handle kernel paging request at 0000000000100108 RIP: Mar 19 18:38:52 pinto0002-admin kernel: <7>Losing some ticks... checking if CPU frequency changed. Mar 19 18:38:52 pinto0002-admin kernel: Oops: 0002 [1] SMP Mar 19 18:38:52 pinto0002-admin kernel: Oops: 0002 [1] SMP Mar 19 18:38:52 pinto0002-admin kernel: RIP <ffffffffa016be29>{:ksocklnd:ksocknal_process_transmit+969} RSP <00000100de20be58> Mar 19 18:38:52 pinto0002-admin kernel: CR2: 0000000000100108 Mar 19 18:38:52 pinto0002-admin kernel: CR2: 0000000000100108 Mar 19 18:38:26 pinto0009-admin kernel: Unable to handle kernel paging request at 0000000000100108 RIP: Mar 19 18:38:26 pinto0009-admin kernel: <7>Losing some ticks... checking if CPU frequency changed. Mar 19 18:38:26 pinto0009-admin kernel: <ffffffffa0238e29>{:ksocklnd:ksocknal_process_transmit+969} Mar 19 18:38:26 pinto0009-admin kernel: Oops: 0002 [1] SMP Mar 19 18:38:26 pinto0009-admin kernel: Oops: 0002 [1] SMP Mar 19 18:38:26 pinto0009-admin kernel: RIP <ffffffffa0238e29>{:ksocklnd:ksocknal_process_transmit+969} RSP <00000100c2647e58> Mar 19 18:38:26 pinto0009-admin kernel: CR2: 0000000000100108 Mar 19 18:38:26 pinto0009-admin kernel: CR2: 0000000000100108 Mar 20 16:22:40 pinto0060-admin kernel: Unable to handle kernel paging request at 0000000000100108 RIP: Mar 20 16:22:40 pinto0060-admin kernel: <7>Losing some ticks... checking if CPU frequency changed. Mar 20 16:22:40 pinto0060-admin kernel: Oops: 0002 [1] SMP Mar 20 16:22:40 pinto0060-admin kernel: Oops: 0002 [1] SMP Mar 20 16:22:40 pinto0060-admin kernel: RIP <ffffffffa0238e29>{:ksocklnd:ksocknal_process_transmit+969} RSP <00000100c1c83e58> Mar 20 16:22:40 pinto0060-admin kernel: CR2: 0000000000100108 Mar 20 16:22:40 pinto0060-admin kernel: CR2: 0000000000100108 PS: The same hardware was running without any problems before these errors and after a reboot, they are still running fine and no hardware configuration changes have been made on these clients. Thanks, Sarp -------------------- Sarp Oral, Ph.D. National Center for Computational Sciences (NCCS) Oak Ridge National Lab, Oak Ridge, Tennessee 37831 865-574-2173, oralhs@ornl.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070326/f4d2b648/attachment-0001.html
Oral, H. Sarp
2007-Mar-26 09:10 UTC
[Lustre-discuss] 1.4.9. client errors of unknown source
Eric, Bugzilla #: 12016 Sarp -------------------- Sarp Oral, Ph.D. National Center for Computational Sciences (NCCS) Oak Ridge National Lab, Oak Ridge, Tennessee 37831 865-574-2173, oralhs@ornl.gov ________________________________ From: Eric Barton [mailto:eeb@bartonsoftware.com] Sent: Thursday, March 22, 2007 7:32 PM To: Oral, H. Sarp Cc: lustre-discuss@clusterfs.com Subject: RE: [Lustre-discuss] 1.4.9. client errors of unknown source Sarp, Shane, Is the primary problem an access violation? The "Losing some ticks" messages seem secondary (and we''ve actually seen that printk with interrupts disabled can be a cause of this). We really need to determine the source code line to work out what has screwed up here. A stacktrace helps, but kernel core dumps are even better. Is it possible to arrange that? Can you file a lustre bug with all this info? Cheers, Eric Eric Barton Barton Software 9 York Gardens Clifton Bristol, BS8 4LL United Kingdom Tel: Mobile: Fax: Email: +44 (117) 330 1575 +44 (7909) 680 356 Call to arrange eeb@bartonsoftware.com ________________________________ From: lustre-discuss-bounces@clusterfs.com [mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of Oral, H. Sarp Sent: 22 March 2007 7:39 PM To: lustre-discuss@clusterfs.com Subject: [Lustre-discuss] 1.4.9. client errors of unknown source Hello, We got the following error messages on our 1.4.9 Lustre clients. Our Lustre servers are also running 1.4.9. The first two incidents happened at the same instance where both clients were busy running two IOR jobs per each. The last incident happened when the client was not running any jobs and was idle. These errors did not create any other bug/logs on clients. We are not at this point sure if this is a Lustre problem or something else, but the RIP line ({:ksocklnd:ksocknal_process_transmit+969}) makes us think it might be a Lustre problem. Has anyone seen something like this? Mar 19 18:38:52 pinto0002-admin kernel: Unable to handle kernel paging request at 0000000000100108 RIP: Mar 19 18:38:52 pinto0002-admin kernel: <7>Losing some ticks... checking if CPU frequency changed. Mar 19 18:38:52 pinto0002-admin kernel: Oops: 0002 [1] SMP Mar 19 18:38:52 pinto0002-admin kernel: Oops: 0002 [1] SMP Mar 19 18:38:52 pinto0002-admin kernel: RIP <ffffffffa016be29>{:ksocklnd:ksocknal_process_transmit+969} RSP <00000100de20be58> Mar 19 18:38:52 pinto0002-admin kernel: CR2: 0000000000100108 Mar 19 18:38:52 pinto0002-admin kernel: CR2: 0000000000100108 Mar 19 18:38:26 pinto0009-admin kernel: Unable to handle kernel paging request at 0000000000100108 RIP: Mar 19 18:38:26 pinto0009-admin kernel: <7>Losing some ticks... checking if CPU frequency changed. Mar 19 18:38:26 pinto0009-admin kernel: <ffffffffa0238e29>{:ksocklnd:ksocknal_process_transmit+969} Mar 19 18:38:26 pinto0009-admin kernel: Oops: 0002 [1] SMP Mar 19 18:38:26 pinto0009-admin kernel: Oops: 0002 [1] SMP Mar 19 18:38:26 pinto0009-admin kernel: RIP <ffffffffa0238e29>{:ksocklnd:ksocknal_process_transmit+969} RSP <00000100c2647e58> Mar 19 18:38:26 pinto0009-admin kernel: CR2: 0000000000100108 Mar 19 18:38:26 pinto0009-admin kernel: CR2: 0000000000100108 Mar 20 16:22:40 pinto0060-admin kernel: Unable to handle kernel paging request at 0000000000100108 RIP: Mar 20 16:22:40 pinto0060-admin kernel: <7>Losing some ticks... checking if CPU frequency changed. Mar 20 16:22:40 pinto0060-admin kernel: Oops: 0002 [1] SMP Mar 20 16:22:40 pinto0060-admin kernel: Oops: 0002 [1] SMP Mar 20 16:22:40 pinto0060-admin kernel: RIP <ffffffffa0238e29>{:ksocklnd:ksocknal_process_transmit+969} RSP <00000100c1c83e58> Mar 20 16:22:40 pinto0060-admin kernel: CR2: 0000000000100108 Mar 20 16:22:40 pinto0060-admin kernel: CR2: 0000000000100108 PS: The same hardware was running without any problems before these errors and after a reboot, they are still running fine and no hardware configuration changes have been made on these clients. Thanks, Sarp -------------------- Sarp Oral, Ph.D. National Center for Computational Sciences (NCCS) Oak Ridge National Lab, Oak Ridge, Tennessee 37831 865-574-2173, oralhs@ornl.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070326/947d2cf4/attachment-0001.html