Kevin and list,
We''re most interested in gaining this functionality for
Gainestown-based
Vayu blades. We''re receiving 800+ of them very soon now. It could be
true that memory error information is logged into the SEL via the SP,
but we haven''t received any concrete information on the capabilities of
the Vayu in this respect. Could you provide this information to me?
Regardless, our experience doing this kind of thing on 1,000s of nodes
causes us to greatly prefer the EDAC approach for the following reasons:
A. We have direct real-time access to the actual hardware events.
With EDAC, we can get immediate visibility via syslog, when an
ECC event happens, without having to deal with polling the SP''s
SEL. I.e., we want to know about ECC events the moment they
happen. Can the SP push all of these events to us via SNMP
trap, syslog or another mechanism as they happen, or do we have
to poll the SP?
B. We much prefer the direct access to memory fault information
that EDAC provides over the filtered access that is usually
provided by vendors in their SP/SEL implementations. I.e., we
want to control the error reporting thresholds and have access
to all error data, instead of the data being filtered though
opaque firmware containing thresholds that we can not change.
Does the SP/SEL approach give me access to every event? Can I
modify all thresholds?
C. The use of EDAC is operationally consistent with all of our
other platforms.
Does that make sense?
Thanks,
-Matthew
On Fri, 2008-09-12 at 11:09 -0600, Matthew Bohnsack wrote:
> Thanks for the pointer Kevin.
>
> I will examine and evaluate the SP tools more closely over the next
> weeks and get back to you on this.
>
> -Matthew
>
>
> On Fri, 2008-09-12 at 11:00 -0600, Kevin Van Maren wrote:
> > Memory errors on the x6250 are logged by the SP into the SEL, and
> > tied
> > into the blade''s fault reporting.
> > I think your request would need to clarify in which ways the existing
> > tools are not adequate to identify
> > bad memory.
> >
> > Kevin
> >
>
> _______________________________________________
> Linux_hpc_swstack mailing list
> Linux_hpc_swstack at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/linux_hpc_swstack