Peter Bojanic
2006-Nov-15 22:19 UTC
[Lustre-discuss] Lustre 1.6.0/Unicos integration issues
Nathan, The following bugs were reported by Cray when they attempted to integrate Lustre 1.6.0 beta6 with their environment. Could these problems have been found by integrating with a stock SLES 9 distribution, first? Is anyone running Lustre 1.6.0 with SLES 9 yet? 11153 unexpected errors seen during evict by nid 11147 more sanity test failures in 1.6 beta 11143 sanity test fails in fcntl test 11138 sanity fails early in 1.6 beta 11134 liblustre clients won''t connect to servers 11133 old mount syntax does not work 11120 bad ldiskfs build suse in 1.6 beta 11114 bad patch in 1.6 beta 11102 no zero-copy TCP in new 1.6 beta 11093 build failure in 1.6 beta 11091 build question about latest 1.6 beta 10809 liblustre sanity test fails Please give a run-down of how these problems could have been prevented in the first place. Also, how could the liblustre issues have been identified prior to running on Catamount? Thanks, Bojanic
Nathaniel Rutman
2006-Nov-16 11:02 UTC
[Lustre-discuss] Re: Lustre 1.6.0/Unicos integration issues
Peter Bojanic wrote:> Nathan, > > The following bugs were reported by Cray when they attempted to > integrate Lustre 1.6.0 beta6 with their environment. Could these > problems have been found by integrating with a stock SLES 9 > distribution, first? Is anyone running Lustre 1.6.0 with SLES 9 yet?There were 2 Suse-specific issues and at least 1 Catamount specific issue. 1 issue was well-known ahead of time. 2 are intended behavior. I think 4 issues (11138, 11120, 11114, 11091) would have been caught by Suse Liblustre testing. 2 more issues (11093, 11147 maybe) by doing our own Catamount testing.> 11153 unexpected errors seen during evict by nidI don''t think this is an error yet - still looking. We don''t seem to actually have a test for evict by nid though; a regression test for this should be added to our test suite.> 11147 more sanity test failures in 1.6 betaThere are a few issues in this bug. 10809 is dealt with below; the LNET issue I don''t see on my x86, and am not sure if SLES9 testing would or would not show it.> 11143 sanity test fails in fcntl testPreviously known 10842, still open. I have been very vocal about this bug for awhile now (before Cray testing.)> 11138 sanity fails early in 1.6 betaPreviously known and fixed 10999 didn''t make it into the beta.> 11134 liblustre clients won''t connect to serversCross-version issue that I never thought to check. We need to add a "cross-version liblustre check" to our major release process.> 11133 old mount syntax does not workExpected behavior - clarified documentation> 11120 bad ldiskfs build suse in 1.6 betaThis would have been found with SUSE build - malformed patch from an update from b1_4> 11114 bad patch in 1.6 betaWould have been found with SUSE testing> 11102 no zero-copy TCP in new 1.6 betaIntentional> 11093 build failure in 1.6 betaThis would only have been found somewhere where HAVE_LIBPTHREAD isn''t defined, which afaik is only Catamount. Might have been found in a very careful code review.> 11091 build question about latest 1.6 betaThis broke many builds and should have been caught by our current testing. (Unreviewed change to build that should never have been signed in.) I think this was just unlucky timing for Cray.> 10809 liblustre sanity test failsI dropped the ball on this one - test 55 failure was masked by earlier test 21 failure (10842), and I never tried to run the remaining tests. I have now started skipping test21 to get the remaining coverage. I should have done this as soon as it became apparent that 10842 would take awhile to fix.> > Please give a run-down of how these problems could have been prevented > in the first place. Also, how could the liblustre issues have been > identified prior to running on Catamount? > > Thanks, > Bojanic