Peter Bojanic
2006-Nov-15 22:18 UTC
[Lustre-discuss] Lustre.org Community Forum - 2006-11-13
Thanks to all the Lustre users who joined us for this meeting in Tampa, Florida. Please find attached slides from the discussion. I''ve also attached Jody McIntyre''s notes. I do a lot of talking in these meetings, but I expect that to change in the future with more participation by Lustre community members and other CFS engineers. If you have any suggested topics for future Lustre.org meetings, please let me know. Thanks also to Brent Gorda, LLNL, for hosting an interesting and informative Lustre BOF on Tuesday morning. It was well work getting up early to attend;) Cheers, Peter P.S. I''m fairly certain that "blue shirt guy, front" is Brent Gorda, LLNL. If anyone else cares to self-identify based on their shirt colour and dialog, please feel free. Lustre.org Community Forum, 2006-11-13 ~~~~~~~~~~~~~~~~~~~~~~~~~~ "Open Source Strategy" pbojanic: Any thoughts on what is needed to support this type of development? blue golf shirt guy: open bug tracking, so users can tell if anyone else has seen a bug before. kevinc: customers would like the ability to comment on design as we go along. pbojanic: hears about private comments (and the fact that we use them) all the time. white golf shirt guy: it would be nice to know what other people are thinking about regarding needed features Adam Boggs: This will prevent duplicated effort. Evan Felix: We would like this to occur on a public list, rather than on an internal list with a feature that shows up at the end sometime. Adam Boggs: Having seen a lot of the code, one of the difficulties is the complexity and intricacy of the codebase, which implies a high level of committment to making contributions and means a lot of testing is required - there should be some sort of testing to assure quality of outside contributions. -- "Release Model" plaid shirt guy in front right: How are you going to do testing without hardware? pbojanic: customers will help with testing plaid shirt guy in front right: is this what customers want? blue golf shirt guy: yes.. running a test cluster is part of having a reliable production system. Lee Ward: how do you get customers to sign up for testing and make sure they do this? pbojanic: our current testing infrastructure is woefully inadequate - deploying ltest externally is practically impossible. We are re-engineering a test environment. We will share our requirements/etc, get input, for a unified test framework that can run on developers'' workstations, the CFS test clusters, and on customers'' systems for system-wide and system-scale testing. blue shirt guy, front: Can you walk us through a release? How much testing does it get? pbojanic: what we do today is not what we want to be doing in the future. 1.6.0 is where we''ve drawn the line. It hasn''t been released yet because we want it to be tested differently: has to have been tested at LLNL, on an XT3, etc. We haven''t been able to coordinate this yet. Today: a series of regression tests on our test cluster. blue shirt guy, front: what model are we going to? What if Sandia or LLNL do not have time to test a release? pbojanic: Yes, the release will be held up. We need to test it before putting up a major release. blue shirt guy, front: that''s an interesting dependency. Lee Ward: I think this is a mistake - because customers can not do this on your schedule. Adam Boggs: the challenge will be if things blow up, tracking down the problem/etc could be a nightmare for the customer. pink shirt guy, front row: What are the key issues for testing? pbojanic: the key issue is the hardware. We are willing to allocate the resources to do the testing - it''s a lot easier to hire a US citizen who can login to an XT3 to obtain access to an XT3. Lee Ward: allow your customers to "sign" a release to give confidence to others. pbojanic: Good suggestion - perhaps if LLNL doesn''t have time but 3 major partners have signed it, this might be OK for most. Adam Boggs: Having it be web-observable would be useful. white golf shirt guy: would be nice to know what tests have passed/failed - visible on the web. -- "Test System" Lee Ward: this policy is counterproductive with your other slides - this is antagonistic since you will not open your test system. pbojanic: the tests themselves will be completely open. Lee Ward: if the test framework is not open, the bar is much higher for anyone who wants to do testing who does not have access. pbojanic: not sure how to address this point... Lee Ward: you''re shooting yourself in the foot - your value isn''t in QA. If I can''t easily run everyone''s tests, I will only run mine. pbojanic: If the tool for collecting results were not open, would that be OK? Lee Ward: ...maybe.. very pink shirt guy: (I missed his point) pbojanic: thanks for the input Adam Boggs: Will the tests be packaged with the Lustre source tree? Will it have the same development model? pbojanic: yes.. Adam Boggs: good, will eliminate the need for version matching of tests vs. lustre version. LLNL guy next to Evan: Will performance be part of the test suite? pbojanic: Yes. -- "Benchmarking Web Site" Adam Boggs: It could be useful to have some statistical confidence with benchmarks - is there dramatic benchmarking over repeated runs of the same benchmark? blue golf shirt guy: This would be helpful for us to know if we''re getting reasonable results from our hardware. Adam Boggs: potential bragging rights. Lee Ward: will you winnow the results? kevinc: unlikely. Lee Ward: your competitors also have bugs - if you''re going to put up graphs, they''re not reasonable if you throw out the ones you don''t like. Adam Boggs: statistical confidence may help with this issue. kevinc: in general, people want to see how well things run. Lee Ward: there are 2 ways to use this system - for marketing, or for a potential customer to see how Lustre will perform on their configuration. pbojanic: we aim to generate a library, perhaps with commentary on negative results. Lee Ward: this will just generate work for you. I have been generating 20 graphs a month, just from Sandia. Would you be comfortable with this? pbojanic: yes, we are. If there are bad points, we want to know why. -- "CFS Strategic Priorities" pbojanic: (described zerocopy fixes) pink tshirt guy: which upstream kernel version introduced this bug? pbojanic: sometime in 2.4; well documented in bug 10089 Lee Ward: Does CFS have multiple teams? pbojanic: ... Lee Ward: How many people do these teams represent? pbojanic: ~60 engineers, including QA. -- (conclusion) Adam Boggs: What timeframe are you looking at for these changes? pbojanic: ASAP, based on sysadmins, however this will be a bit of a culture shift internally for us. No timeframe on the SVN repository. pink shirt guy behind me: Will you still maintain patches if metadata is still faster with patched systems. pbojanic: unknown. We are quite confident we will be able to close the gap. SFS guy in the back: to the people in the room: please contribute success stories as well as problems openly. Lee Ward: Well, it _runs_ on a 10000 node cluster... The fact that the national labs run Lustre and not something else says a lot. -------------- next part -------------- A non-text attachment was scrubbed... Name: Lustre.org-SC06DiscussionSlides-061113.pdf Type: application/pdf Size: 49731 bytes Desc: not available Url : http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20061115/d2ce9ef0/Lustre.org-SC06DiscussionSlides-061113-0001.pdf