Correction below: The stack dumps once a minute or so started at 12:40. I rebooted client1 at 13:13. Sorry for the confusion. Rob -----Original Message----- From: Hendelman, Rob Sent: Tuesday, October 06, 2009 8:15 AM To: ''lustre-discuss at lists.lustre.org'' Subject: Soft CPU Lockup Hello Mr. Drokin, Thank you for your prior response. There was a client eviction just prior to the threads hanging and eating 100%, but NOT prior to the OSS finally dropping cpu usage again. Here is a basic timeline(in hours:min "military" time) 09:07am->12:39: Client "6" which was cloned from "client1" is being worked on, rebooted, and connected/disconnected from the lustre servers. No issues 12:39: final OSS message that says "haven''t heard from <ip of client66> in 240 seconds, I think it''s dead and I''m evicting it. 12:40: what appear to be stack dumps on the OSS server for 2 i/o threads (previously mentioned) 12:44: client1 has lost it''s lustre mounts and is complaining in nagios. All other clients are fine. 13:13: "stack dumps" once a minute or so, but no LBUG. I leave the server up and finally reboot client1. The other clients2-5 are not affected. All other clients seem to be working normally so I don''t touch the OSS. 14:10: Final messages on OSS before OSS calms down (no messages after this) Oct 5 14:10:56 maglustre04 kernel: Oct 5 14:10:59 maglustre04 kernel: Lustre: 13366:0:(service.c:1317:ptlrpc_server_handle_request()) @@@ Request x6413848 took longer than estimated (100+5495s); client may timeout. req at ffff81009308c400 x6413848/t0 o101->1b9e4991-1d5e-814d-2607-8c52f432e68d@:0/0 lens 232/288 e 0 to 0 dl 1254764364 ref 1 fl Complete:/0/0 rc 301/301 Oct 5 14:10:59 maglustre04 kernel: Lustre: 13421:0:(watchdog.c:330:lcw_update_time()) Expired watchdog for pid 13421 disabled after 5595.8041s Oct 5 14:10:59 maglustre04 kernel: Lustre: 13366:0:(service.c:1317:ptlrpc_server_handle_request()) Skipped 1 previous similar message Oct 5 14:10:59 maglustre04 kernel: Lustre: 13366:0:(watchdog.c:330:lcw_update_time()) Expired watchdog for pid 13366 disabled after 5595.8059s Should I file a new bug? Is there enough info in /var/log/messages to file a bug or do I need to turn on some sort of more verbose debugging incase this happens again? Thanks, Robert The information contained in this message and its attachments is intended only for the private and confidential use of the intended recipient(s). If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e- mail is strictly prohibited.