nic@cray.com
2006-Dec-15 09:43 UTC
[Lustre-devel] [Bug 11394] OSS loses its mind, spits out error messages with garbage data.
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11394 nic@cray.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |eeb@clusterfs.com Priority|P3 |P2 We have had 2 new hits on the production XT3 at ORNL (jaguar) of this bug. This is now a hot one! This always seems to occur after some network issues -- it almost seems like we are using a buffer with bogus data in it, like we are ignoring (or not seeing) a bad return value from a network receive. I''ll upload more logs -- but we need to get some movement on this soon.
pbojanic@clusterfs.com
2007-Jan-08 07:58 UTC
[Lustre-devel] [Bug 11394] OSS loses its mind, spits out error messages with garbage data.
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11394 What |Removed |Added ---------------------------------------------------------------------------- CC| |green@clusterfs.com Status Whiteboard|2006-12-21: CFS consider the|2007-01-08: Likely not an |corruption has occurred |LNET issue, but may be |prior to reaching LNET |Lustre or Cray Portals; CFS | |to determine next steps for | |debugging Eric Barton advises that he really doesn''t think this is an LNET issue. Not sure if it''s Lustre or Cray Portals related. He suggests try normal use-after-free debugging stuff (e.g. POISON) first off. I''ve asked Oleg to discuss with mjmac to see if we can move this forward ourselves, or if we need further help from Cray.