Peter Bojanic
2006-Jul-18 15:48 UTC
[Lustre-discuss] Meeting Summary for lustre.org Community Forum 2006-06-29
lustre.org Community Forum 2006-06-29 - SESSION 3 - Lustre Release framework Prepared by: Peter Jones Date: 2006-06-29 MEETING NOTES 1. TESTING pbojanic: CFS in in the process of rethinking its automated test suite; we don''t have access to the same scale variety of hardware as our community of users; we''re interested in scale of testing, especially on large scale ORNL: test system 80 nodes of TCP and 80 nodes of IB; other testing on XT3; can test on Jaguar with notice (5000 nodes); needs planning though Q: does the gold standard release correlates to Peter Braam''s comments on fewer releases pbojanic: yes, but let''s save that discussion ???: three stress tests: testing network communications between nodes, file system tests using IOR, Hendrix metadata test to check ability to handle small files; most run continuously on a cluster for 8 hours or so LLNL: two fairly large test clusters; 16 portal routers to file system with 128 I/O nodes and 1024 processor nodes; smaller testbeds to test all architectures. pbojanic: LLNL is closest testing partner and gets early releases so they should get credit for quality of CFS releases HP: smaller systems for basic sanity testing, share time on larger systems; month-long testing on 256 node IB cluster using known stressors such as iozone and known customers hotspots; reboot all nodes to check everything remounts correctly, as that has been a pain- point in the past; generally, once it is up and running usually everything runs ok at scale, but getting there can be problematic. pbojanic: one month soak test? HP: yes, at least, depends upon availability, maybe more if possible pbojanic: platforms tested? anyone testing on more esoteric testing like PowerPC testing? Indiana: not currently, but have a whole pile could do ORNL: limited testing on ??? right now have problems scaling Bull: ia64 systems; mainly 10-20 nodes HP: same, but smaller LLNL: do testing on all those platforms all ia64 pbojanic: large scale testing of Infiniband, Elan? HP: lot of IB testing Sandia: OpenFabrics soon hopefully 300-400 nodes on IB NCSA: running Topspin: 450 now, more in fall pbojanic: development Cisco LND? NCSA: that''s us with Topspin. Cisco bought Topspin pbojanic: OpenFabrics? ORNL: interested in doing this DDN: working with Voltaire to get request sizes higher LLNL: small clusters OpenIB gen 2 (aka OpenFabrics) ???: CFS to provide framework to allow people to upload testing results? pbojanic: interest in separate forum on this particular subject HP: clarification on specifically what kind of testing talking about pbojanic: who would want to be involved? A: HP, Cray, ORNL, Sandia, Indiana Terascala, Bull 2. GOLD STANDARD LUSTRE RELEASE pbojanic: As per ORNL''s early comment, CFS wants to make fewer releases but on a more regular release schedule; LLNL gradually increases testing to a very large scale, but only on some architectures; gold standard is a release that will be tested on certain configs to large scale. pbojanic: perceived benefits? HP: scaling one thing, but all users hit different issues so would need broader testing to feel assured; regular releases are better because can align on that schedule Cray: if problems found on gold, then fixes go into release? pbojanic: benefit from gold standard? Cray: testing would mean issues flushed out and higher quality HP: gold standard behind betas by fixed amount 6 months? year? pbojanic: not sure of details yet, but it might be several months before a release could get the gold standard assignation ORNL: we see lustre.org being a Fedora version of CFS because then new features could come out more agressively than in mainstream release pbojanic: we have idea of what we want to do in our releases (train model) what would be your ideal scenario? ORNL: A Fedora model is interesting because we make certain features available without needing to do a lot of tweaking; however, we want to avoid lustre.org being junk pbojanic: does the community want features more rapidly or prefer something more conservative? Terascala: prefer conservative HP: stability ???: just bug fixes no new features in gold standard pbojanic: Debian model? ????: excellent idea ???: like concept of stable and development repositories because gives options HP: would still prefer stable gold standard and applying exact patch for new feature than taking a more unknown quantity Terascala: shortcomings of Debian include outdated packages; the price is worth paying for stability (others nodding heads at training); critical new feature come as patch pbojanic: interest in more detailed discussion in Gold Standard? A: HP, NCSA, ORNL 3. INFORMATION MANAGEMENT - Source code repository/ Information management; public communication forum pbojanic: lustre.org website hosted and maintained by CFS; thoughts on what Lustre community needs to collaborate more effectively on these areas? Indiana: Better docs help give everyone participating same starting point pbojanic: Should CFS still maintain the web site, or should someone else take this over? ???: similar to SourceForge site? pbojanic: no, there is an external read-only CVS which is password protected; CFS needs to be sensitive about early code in development that may have proprietary\customer content, or is not ready so completely free access is not plausible ???: central CVS repository for projects being run within Lustre.org community pbojanic: Thoughts on CVS versus other alternatives. What using at Cray? Cray: just switched to Subversion pbojanic: better than CVS? Cray: not used personally, but understands that it is better ???: no doubt ORNL: same pbojanic: how would DDN publish tuning info? Need CFS to filter info and organize site? wiki model? ???: different tools for different needs. wiki makes sense for docs. forum for discuss. Does not want community to waste time doing something already out there pbojanic: Did LLNL talk about Collabnet at the LUG? LLNL: Yes pbojanic: Collabnet ? ???: good be valuable but not sure of cost etc pbojanic: definitely some costs. Anything else? HP: updating comments in Lustre comments. Some comments not too helpful PAJ: should we raise this at the forthcoming engineering launch? ACTION: CFS to table subject of comments at forthcoming engineer launch