Peter Bojanic
2006-Jun-28 20:15 UTC
[Lustre-discuss] Meeting Summary for lustre.org Community Forum 2006-06-27
lustre.org Community Forum 2006-06-27 - SESSION 1 - Organization and Communication Prepared by: Peter Jones Date: 2006-06-27 MEETING PARTICIPANTS Army HPC Research Center - Duane Cloud Boeing - Josef Sikora Bull - Philippe Couv?e Chevron - Jim Owens CEA - Jacques-Charles Lafoucriere CFS - Peter Bojanic CFS - Peter Jones CGG - John Belshaw DDN - Dave Fellinger HP - Frank O''Neill Indiana University - Steve Simms Myricom - Patrick Geoffray NCSA - Anthony Tong Ohio State University - Ranjit Noronha Ohio State University - DK Panda ORNL - Shane Cannon ORNL - Sarp Oral ORNL - David Vasil PSC - Doug Balog Sandia - Milt Clauser Sandia - Lee Ward Sicortex - John Goodhue MEETING NOTES 1. Update on lustre.org beginnings - CFS/Peter Bojanic - lustre-discuss traffic has increased significantly and steadily - 2006-01: 81; 2006-02: 68; 2006-03: 146; 2006-04: 90; 2006-05: 109; 2006-06: 111 - 2005-01: 13; 2005-02: 42; 2005-03: 56; 2005-04: 33; 2005-05: 25; 2005-06: 57 - lustre-devel list also initiated with moderate traffic - HSM project initiated with CEA with joint copyright agreement - parallel IO work with DK Panda at Ohio State University - working with Debian community for an unsupported Lustre release on this platform - networking detailed designs shared through the lustre-devel mailing list; first time CFS has started to share this level of detail publicly - Lustre internals training in Boulder, CO 2. Update on CFS activities since LUG - CFS/Peter Bojanic 1. [Management] Understandable and documented error messages; troubleshooting - near-term initiative to identify, improve, and document "top 10" Lustre and LNET error messages; longer term RAS strategy already under development 2. [Management] OST stripe management: 1) Pools; 2) Join files; 3) background migration - Quality of Storage feature landed for Lustre 1.6.0; Pools is under development and is anticipated for Lustre 1.8.0 in the new year 3. [Management] Improved Logging, Debugging and Diagnostic tools; NID logic; per-client stats - Lustre monitoring/debugging initiative kicked off this week; longer term RAS strategy already under development 4. [Backup] Lustre HPSS copy, using DMAPI - HSM initiative started with CEA 5. [Management] I/O, metadata performance profiling, analysis, and reporting - Lustre monitoring/debugging initiative kicked off this week; longer term RAS strategy already under development 6. [Management] Cluster monitoring tool - working with Livermore to release Lustre Monitoring Tool; discussions with other Lustre users to get on board to enhance the tool 7. [Performance] NUMA awareness - we''ve done nothing in this particular area because it isn''t a high priority to current customers 8. [Management] Lustre internals documentation - training course on Lustre Internals developed and being delivered this week in Boulder, CO; enormous effort with hundreds of slides, exercises and workshops 9. [Stability] Version based recovery - high level design in progress 10. [Management] Global health check - high level design in progress for "adaptive timeouts", which is a health-based system for recovery 11. [Management] Multiple mount protection - near complete; will be delivered in a future maintenance release of Lustre 1.4.7.x 3. Round-table discussion on lustre.org, based on the high level questions: a. What Lustre development is your organization currently engaged in? Sandia: testing existing Lustre product. Linkage to client library more complete. Testing routers. Libsysio. Red Storm, 10000 nodes. Also smaller 400 nodes on IB to support Red Storm. Another small cluster with two interconnects: IB Myrinet; Thunderbird 9000 processors Bull: Integration servers and storage. quadrics. GigE. Porting ia64. HPC networks. Specific admin tools for large clusters with 600+ HA testing performance Cray: putting Lustre onto XD1, XT3 using Lustre as supportable scratch file system DDN: Storage system for lab; Infiniband, Fibre Channel, RDMA SCSI remote protocol HP: product offering based on Lustre technology. range of storage options\computer types. focus on testing robustness, ease of use Myricom: native MX support; small config strong system interest in Lustre Sicortex: performance, robustness and useability Boeing: long term state. Assessing file systems. Still not installed CGG: Two projects. Prototype and pilot in London 200TB file system for geophysical data. stable HA. Long term robustness over performance Chevron: several instances. small development into monitoring. improving HA for config Indiana University: Lustre over WAN. good success. Demo at last conf. Monitoring tools. Talking to Dresden re plugin. Data capacitor Ohio State: perf testing presented at LUG; Other interconnects re ADIO driver PSC: Router dev projects to get Lustre production ready Army HPC: Lustre on all systems in-house CEA: production in large scale; Openfabric LND. Multiple clusters. NCSA: Lustre on production Security in lustre. Kerberos support b. What value does your organization want to derive from lustre.org (in next 24 months; in 2-3 years)? c. What are some of the other potential key benefits to participating in lustre.org? NCSA: multiple clusters. more machines but concerned about present security (i.e. Kerberos) Sandia - similar to existing Hendrix project? CFS: Hendrix CMD2 project developed Kerberos support; will be productized soon; details in roadmap CEA: interested in sharing tools and experience of different users Army HPC: information trying to setup global file system shared across all machines. interested in everything (security, HSM) CFS: large lab experience useful LLNL, Sandia PSC: sharing experiences with some of other sites e.g Sandia. Also, how they can contribute Ohio: issues in ADIO interface. What hooks available? What feature community wants for ADIO? Bottlenecks etc? Indiana: Share their expeiences. Lustre with HPSS. Performance and tuning. What utility for WAN? Chevron: Share experiences and find out what other doing to see if hitting same sorts of issues CGG: HA, HSM issues difficult to solve so better in a large forum. Infiniband early so helped with testing. Commercial so cannot be as open as some research labs Boeing: learn about capabilities as background prior to using, also to help influence direction Sicortex: forum to share information. lower priority like management tool and co-ordinated testing Myricom: to know what people want in the future. to keep up to date with development. especially good to hear of success stories, because vendors often only hear about problems HP: focus on robustness, availability, security. sharing experiences DDN: collected data for perf tuning. Sometimes figures do not seem to make sense. Would like to publish data for performance tuning guide. both successes and failure so that people can know how to tune a system for different data access patterns. Being prepared for sharing Cray: ADIO layers, routing, monitoring. security HSM into what Cray offers. What Cray expects to bring is data and experience of scaling and supportability\maintainability on large machines Bull: Lustre quality and stablility. HSM monitoring. CFS: Bull''s early efforts in testing of DDN 9500 contributed to performance tuning efforts to everyone''s benefit Sandia: Sharing experiences on global file system, routing, diagnostics, perf tuning guide; long-term viabliity of Lustre. Monitoring for the Tri Labs; ensure Lustre remains a viable option for the Tri Labs ORNL: Broader development base. Venue to share development efforts. Both to contribute and get back. HSM. someone joining soon better intergration with NPO? Client cache. Better diagnostic tools. I/O patterns and cookbook for best performance CFS: better coordinated and leveraged testing efforts for Lustre; a gold standard Lustre releases with predictable release schedule (because testing on larger scales); accelerate Lustre roadmap development (e.g. monitoring, HPSS) d. What reservations or cautions do you have regarding an open source development forum for Lustre? Sicortex: does not want to undermine CFS architecture lead. Cray: if Lustre.org community implements a feature in a different way to CFS then possible conflicting priorities or each believes the other is dealing with it so no-one does ORNL: don''t want to weaken CFS commercial viability because still rely on CFS for core features; Lustre.org goes off at a tangent from what users need ACTIONS - CFS to raise question of information sharing to in Session 3 discussions on Thursday, June 29 - CFS to inquire from lustre.org members topics for future meetings; engage lustre.org members to present/lead discussions; possible topics include: - Lustre/DDN performance tuning - experience deploying a Lustre global file system - high availability Lustre - Lustre over WAN - Lustre routers - HSM -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060628/e582a49d/Lustre.org-MtgSummary-060627.html