During the diagnostics of my SAN failure last week we thought we had seen a backplane failure due to high error counts with ''lsiutil''. However, even with a new backplane and ruling out failed cards (MPXIO or singular) or bad cables I''m still seeing my error count with LSIUTIL increment. I''ve got no disks attached to the array right now so I''ve also ruled those out. Even with nothing connected but the HBA to the backplane expander, a simple restart of the SAN into a OpenIndiana LiveCD or other distribution (NexentaStor) increments the counter. I''ve been as careful as I can be to clear the counter between changes to parts to try and eliminate a potentially bad cable/card/etc. You can see phy 8-15 throw errors irregardless of MPXIO or single card config, OR which expander port I use on the backplane. According to my VAR something in the mptsas code changed "recently" (not sure what that means in time terms) and they do not see the problems with 6GB backplanes and adapters. -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: SAS Diags.txt URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111201/0c9932e4/attachment-0001.txt> -------------- next part -------------- Attached is a log I took through NexentaStor 3.1.1 with my disks still attached. The disks themselves don''t seem to be throwing errors, so that''s good. Has anyone seen anything like this? I have not tried to boot into an older version of Solaris or NexentaStor yet, but booting into Scientific Linux 6.1 yields about the same results with lsiutil. Nothing from fmadm, /var/adm/messages or otherwise indicate these data errors outside of lsiutil.
Richard Elling
2011-Dec-04 03:03 UTC
[zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)
On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote:> During the diagnostics of my SAN failure last week we thought we had seen a backplane failure due to high error counts with ''lsiutil''. However, even with a new backplane and ruling out failed cards (MPXIO or singular) or bad cables I''m still seeing my error count with LSIUTIL increment. I''ve got no disks attached to the array right now so I''ve also ruled those out.The link error counters are on the receiving side. To see the complete picture, you need to look at link errors on both ends of each link (more below?)> > Even with nothing connected but the HBA to the backplane expander, a simple restart of the SAN into a OpenIndiana LiveCD or other distribution (NexentaStor) increments the counter.A few counters can tick up when the system is reset at boot. These can be ignored. What you are looking for is a consistent increase of the counters under load. In some cases I have seen millions of errors per minute on a very unhappy system.> I''ve been as careful as I can be to clear the counter between changes to parts to try and eliminate a potentially bad cable/card/etc. You can see phy 8-15 throw errors irregardless of MPXIO or single card config, OR which expander port I use on the backplane.The info you attaced doesn''t show the topology (lsiutil command 16), so it is difficult to say why this occurs.> > According to my VAR something in the mptsas code changed "recently" (not sure what that means in time terms) and they do not see the problems with 6GB backplanes and adapters.These counters are in the physical interfaces, far away from any OS.> > <SAS Diags.txt> > > > Attached is a log I took through NexentaStor 3.1.1 with my disks still attached. The disks themselves don''t seem to be throwing errors, so that''s good.To see errors from the disk''s perspective, you need to look at the disk''s logs. I use sg3 utils for this (sg_logs -a /dev/rdsk/...)> > > Has anyone seen anything like this? I have not tried to boot into an older version of Solaris or NexentaStor yet, but booting into Scientific Linux 6.1 yields about the same results with lsiutil.Yes. Root cause is always hardware.> > Nothing from fmadm, /var/adm/messages or otherwise indicate these data errors outside of lsiutil.Those errors are counters as part of the SAS link state machine. The symptoms will show as poor performance or occasional command resets at the OS level. -- richard -- ZFS and performance consulting http://www.RichardElling.com LISA ''11, Boston, MA, December 4-9
Hi Richard, Thanks for getting back to me. On Dec 3, 2011, at 9:03 PM, Richard Elling wrote:> On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote: > >> During the diagnostics of my SAN failure last week we thought we had seen a backplane failure due to high error counts with ''lsiutil''. However, even with a new backplane and ruling out failed cards (MPXIO or singular) or bad cables I''m still seeing my error count with LSIUTIL increment. I''ve got no disks attached to the array right now so I''ve also ruled those out. > > The link error counters are on the receiving side. To see the complete picture, you need to look at > link errors on both ends of each link (more below?) > >> >> Even with nothing connected but the HBA to the backplane expander, a simple restart of the SAN into a OpenIndiana LiveCD or other distribution (NexentaStor) increments the counter. > > A few counters can tick up when the system is reset at boot. These can be ignored. > > What you are looking for is a consistent increase of the counters under load. In some cases > I have seen millions of errors per minute on a very unhappy system.But we''re talking about 600,000 -> 2,000,000 errors on a simple reset at boot. Per my VAR their 6GB hardware show significantly less (in the 10s to 100s of errors, not 100s to millions).> >> I''ve been as careful as I can be to clear the counter between changes to parts to try and eliminate a potentially bad cable/card/etc. You can see phy 8-15 throw errors irregardless of MPXIO or single card config, OR which expander port I use on the backplane. > > The info you attaced doesn''t show the topology (lsiutil command 16), so it is difficult to say > why this occurs.Attached is the output of option 16 on each card. -------------- next part -------------- A non-text attachment was scrubbed... Name: LSI1068.rtf Type: text/rtf Size: 17221 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111203/1206ed14/attachment.bin> -------------- next part --------------> >> >> According to my VAR something in the mptsas code changed "recently" (not sure what that means in time terms) and they do not see the problems with 6GB backplanes and adapters. > > These counters are in the physical interfaces, far away from any OS. > >> >> <SAS Diags.txt> >> >> >> Attached is a log I took through NexentaStor 3.1.1 with my disks still attached. The disks themselves don''t seem to be throwing errors, so that''s good. > > To see errors from the disk''s perspective, you need to look at the disk''s logs. > I use sg3 utils for this (sg_logs -a /dev/rdsk/...) >I''d paste some of this, but the output would be pretty big. :) I''ll look more into this. Though my "errors corrected without substantial delay" stands out as pretty high, even on a new disk I just received. Is there anything specific I should be looking at?>> >> >> Has anyone seen anything like this? I have not tried to boot into an older version of Solaris or NexentaStor yet, but booting into Scientific Linux 6.1 yields about the same results with lsiutil. > > Yes. Root cause is always hardware. > >> >> Nothing from fmadm, /var/adm/messages or otherwise indicate these data errors outside of lsiutil. > > Those errors are counters as part of the SAS link state machine. The symptoms will show as > poor performance or occasional command resets at the OS level. > -- richard > > -- > > ZFS and performance consulting > http://www.RichardElling.com > LISA ''11, Boston, MA, December 4-9 > > > > > > > > > > > > > >
Richard Elling
2011-Dec-04 04:31 UTC
[zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)
On Dec 3, 2011, at 7:36 PM, Ryan Wehler wrote:> Hi Richard, > Thanks for getting back to me. > > > On Dec 3, 2011, at 9:03 PM, Richard Elling wrote: > >> On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote: >> >>> During the diagnostics of my SAN failure last week we thought we had seen a backplane failure due to high error counts with ''lsiutil''. However, even with a new backplane and ruling out failed cards (MPXIO or singular) or bad cables I''m still seeing my error count with LSIUTIL increment. I''ve got no disks attached to the array right now so I''ve also ruled those out. >> >> The link error counters are on the receiving side. To see the complete picture, you need to look at >> link errors on both ends of each link (more below?) >> >>> >>> Even with nothing connected but the HBA to the backplane expander, a simple restart of the SAN into a OpenIndiana LiveCD or other distribution (NexentaStor) increments the counter. >> >> A few counters can tick up when the system is reset at boot. These can be ignored. >> >> What you are looking for is a consistent increase of the counters under load. In some cases >> I have seen millions of errors per minute on a very unhappy system. > > But we''re talking about 600,000 -> 2,000,000 errors on a simple reset at boot. Per my VAR their 6GB hardware show significantly less (in the 10s to 100s of errors, not 100s to millions).For high-quality hardware, I see 4 to 8. If I see > 1,000, then I start replacing hardware.>>> I''ve been as careful as I can be to clear the counter between changes to parts to try and eliminate a potentially bad cable/card/etc. You can see phy 8-15 throw errors irregardless of MPXIO or single card config, OR which expander port I use on the backplane. >> >> The info you attaced doesn''t show the topology (lsiutil command 16), so it is difficult to say >> why this occurs. > > Attached is the output of option 16 on each card. > > <LSI1068.rtf>This shows that the handle 0009 phys 12 to 15 are the other HBA (initiator). It is unusual to see millions of errors there. Also, the number of errors is not symmetrical. From the HBA (Adapter phy 1) you see on the order of thousand errors. From the expander (handle 0009) you see millions of errors on phys 12 to 15, that are connected to the HBA. Also interesting is that one of the phys, adapter phy 0, shows no errors, but we see errors on the others. This is unusual because there are 4 links in the cable. Still smells like hardware to me. -- richard>>> >>> According to my VAR something in the mptsas code changed "recently" (not sure what that means in time terms) and they do not see the problems with 6GB backplanes and adapters. >> >> These counters are in the physical interfaces, far away from any OS. >> >>> >>> <SAS Diags.txt> >>> >>> >>> Attached is a log I took through NexentaStor 3.1.1 with my disks still attached. The disks themselves don''t seem to be throwing errors, so that''s good. >> >> To see errors from the disk''s perspective, you need to look at the disk''s logs. >> I use sg3 utils for this (sg_logs -a /dev/rdsk/...) >> > > I''d paste some of this, but the output would be pretty big. :) I''ll look more into this. Though my "errors corrected without substantial delay" stands out as pretty high, even on a new disk I just received. Is there anything specific I should be looking at? > > >>> >>> >>> Has anyone seen anything like this? I have not tried to boot into an older version of Solaris or NexentaStor yet, but booting into Scientific Linux 6.1 yields about the same results with lsiutil. >> >> Yes. Root cause is always hardware. >> >>> >>> Nothing from fmadm, /var/adm/messages or otherwise indicate these data errors outside of lsiutil. >> >> Those errors are counters as part of the SAS link state machine. The symptoms will show as >> poor performance or occasional command resets at the OS level. >> -- richard >> >> -- >> >> ZFS and performance consulting >> http://www.RichardElling.com >> LISA ''11, Boston, MA, December 4-9 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >-- ZFS and performance consulting http://www.RichardElling.com LISA ''11, Boston, MA, December 4-9
On Dec 3, 2011, at 10:31 PM, Richard Elling wrote:> On Dec 3, 2011, at 7:36 PM, Ryan Wehler wrote: > >> Hi Richard, >> Thanks for getting back to me. >> >> >> On Dec 3, 2011, at 9:03 PM, Richard Elling wrote: >> >>> On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote: >>> >>>> During the diagnostics of my SAN failure last week we thought we had seen a backplane failure due to high error counts with ''lsiutil''. However, even with a new backplane and ruling out failed cards (MPXIO or singular) or bad cables I''m still seeing my error count with LSIUTIL increment. I''ve got no disks attached to the array right now so I''ve also ruled those out. >>> >>> The link error counters are on the receiving side. To see the complete picture, you need to look at >>> link errors on both ends of each link (more below?) >>> >>>> >>>> Even with nothing connected but the HBA to the backplane expander, a simple restart of the SAN into a OpenIndiana LiveCD or other distribution (NexentaStor) increments the counter. >>> >>> A few counters can tick up when the system is reset at boot. These can be ignored. >>> >>> What you are looking for is a consistent increase of the counters under load. In some cases >>> I have seen millions of errors per minute on a very unhappy system. >> >> But we''re talking about 600,000 -> 2,000,000 errors on a simple reset at boot. Per my VAR their 6GB hardware show significantly less (in the 10s to 100s of errors, not 100s to millions). > > For high-quality hardware, I see 4 to 8. If I see > 1,000, then I start replacing hardware.And how do you define "high quality hardware"? Obviously these aren''t crummy SATA adapters and low cost drives. The Chassis and backplane are on Nexenta''s HSL. While the cards are not, explicitly listed. The underlying chip (LSI 1068) is on another card (3081E-R) that is on the HSL.> >>>> I''ve been as careful as I can be to clear the counter between changes to parts to try and eliminate a potentially bad cable/card/etc. You can see phy 8-15 throw errors irregardless of MPXIO or single card config, OR which expander port I use on the backplane. >>> >>> The info you attaced doesn''t show the topology (lsiutil command 16), so it is difficult to say >>> why this occurs. >> >> Attached is the output of option 16 on each card. >> >> <LSI1068.rtf> > > This shows that the handle 0009 phys 12 to 15 are the other HBA (initiator). > > It is unusual to see millions of errors there. > > Also, the number of errors is not symmetrical. From the HBA (Adapter phy 1) > you see on the order of thousand errors. From the expander (handle 0009) > you see millions of errors on phys 12 to 15, that are connected to the HBA. > > Also interesting is that one of the phys, adapter phy 0, shows no errors, but we see > errors on the others. This is unusual because there are 4 links in the cable. > > Still smells like hardware to me. > -- richard >I''m not quite extrapolating this data like you are. I see handle 0009 which looks to be the expander. Card #1 is hooked to phy 8-11 and Card #2 is hooked to phy 12-15. (port 0 and 1 on the expander) As far as symmetrical errors, yeah the whole thing is screwy. The one thing I am seeing as stand out that I did not notice before for some reason is that "right card" (the one that normally handles phy 12-15) in my previous output from my initial inquiry carries 1+M errors on the expander phys regardless of the "right or left" cable. Perhaps that is an indicator of hardware malfunction. The "left" card (usually responsible for phy 8-11) throws something in the order of 600+K (under 1M) using "right or left" cable (phy 8-11 or 12-15). Those numbers are uncomfortably high too, though. Basically the output of my SAS Diag.txt was flipping between single use of each card with each of the two cables I had available to me. If I were to show the output now with both cards enabled phy 8-15 on the expander all show "link up" situation. The other mystery as you mentioned is why Adapter phy 0 is error free while the other 3 phys are not. It''s also persistent across cables used AND cards used.>>>> >>>> According to my VAR something in the mptsas code changed "recently" (not sure what that means in time terms) and they do not see the problems with 6GB backplanes and adapters. >>> >>> These counters are in the physical interfaces, far away from any OS. >>> >>>> >>>> <SAS Diags.txt> >>>> >>>> >>>> Attached is a log I took through NexentaStor 3.1.1 with my disks still attached. The disks themselves don''t seem to be throwing errors, so that''s good. >>> >>> To see errors from the disk''s perspective, you need to look at the disk''s logs. >>> I use sg3 utils for this (sg_logs -a /dev/rdsk/...) >>> >> >> I''d paste some of this, but the output would be pretty big. :) I''ll look more into this. Though my "errors corrected without substantial delay" stands out as pretty high, even on a new disk I just received. Is there anything specific I should be looking at? >> >> >>>> >>>> >>>> Has anyone seen anything like this? I have not tried to boot into an older version of Solaris or NexentaStor yet, but booting into Scientific Linux 6.1 yields about the same results with lsiutil. >>> >>> Yes. Root cause is always hardware. >>> >>>> >>>> Nothing from fmadm, /var/adm/messages or otherwise indicate these data errors outside of lsiutil. >>> >>> Those errors are counters as part of the SAS link state machine. The symptoms will show as >>> poor performance or occasional command resets at the OS level. >>> -- richard >>> >>> -- >>> >>> ZFS and performance consulting >>> http://www.RichardElling.com >>> LISA ''11, Boston, MA, December 4-9 >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> > > -- > > ZFS and performance consulting > http://www.RichardElling.com > LISA ''11, Boston, MA, December 4-9 > > > > > > > > > > > > > >
Richard Elling
2011-Dec-04 05:18 UTC
[zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)
On Dec 3, 2011, at 9:02 PM, Ryan Wehler wrote:> > On Dec 3, 2011, at 10:31 PM, Richard Elling wrote: > >> On Dec 3, 2011, at 7:36 PM, Ryan Wehler wrote: >> >>> Hi Richard, >>> Thanks for getting back to me. >>> >>> >>> On Dec 3, 2011, at 9:03 PM, Richard Elling wrote: >>> >>>> On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote: >>>> >>>>> During the diagnostics of my SAN failure last week we thought we had seen a backplane failure due to high error counts with ''lsiutil''. However, even with a new backplane and ruling out failed cards (MPXIO or singular) or bad cables I''m still seeing my error count with LSIUTIL increment. I''ve got no disks attached to the array right now so I''ve also ruled those out. >>>> >>>> The link error counters are on the receiving side. To see the complete picture, you need to look at >>>> link errors on both ends of each link (more below?) >>>> >>>>> >>>>> Even with nothing connected but the HBA to the backplane expander, a simple restart of the SAN into a OpenIndiana LiveCD or other distribution (NexentaStor) increments the counter. >>>> >>>> A few counters can tick up when the system is reset at boot. These can be ignored. >>>> >>>> What you are looking for is a consistent increase of the counters under load. In some cases >>>> I have seen millions of errors per minute on a very unhappy system. >>> >>> But we''re talking about 600,000 -> 2,000,000 errors on a simple reset at boot. Per my VAR their 6GB hardware show significantly less (in the 10s to 100s of errors, not 100s to millions). >> >> For high-quality hardware, I see 4 to 8. If I see > 1,000, then I start replacing hardware. > > > And how do you define "high quality hardware"? Obviously these aren''t crummy SATA adapters and low cost drives. The Chassis and backplane are on Nexenta''s HSL. While the cards are not, explicitly listed. The underlying chip (LSI 1068) is on another card (3081E-R) that is on the HSL.I recently tested a HP DL380 G7 with D2600 and D2700 JBOD chassis. Zero errors. Currently, the test process for HSL records any errors, but as long as the root cause can be explained, the devices can pass certification.>>>>> I''ve been as careful as I can be to clear the counter between changes to parts to try and eliminate a potentially bad cable/card/etc. You can see phy 8-15 throw errors irregardless of MPXIO or single card config, OR which expander port I use on the backplane. >>>> >>>> The info you attaced doesn''t show the topology (lsiutil command 16), so it is difficult to say >>>> why this occurs. >>> >>> Attached is the output of option 16 on each card. >>> >>> <LSI1068.rtf> >> >> This shows that the handle 0009 phys 12 to 15 are the other HBA (initiator). >> >> It is unusual to see millions of errors there. >> >> Also, the number of errors is not symmetrical. From the HBA (Adapter phy 1) >> you see on the order of thousand errors. From the expander (handle 0009) >> you see millions of errors on phys 12 to 15, that are connected to the HBA. >> >> Also interesting is that one of the phys, adapter phy 0, shows no errors, but we see >> errors on the others. This is unusual because there are 4 links in the cable. >> >> Still smells like hardware to me. >> -- richard >> > > I''m not quite extrapolating this data like you are. I see handle 0009 which looks to be the expander. Card #1 is hooked to phy 8-11 and Card #2 is hooked to phy 12-15. (port 0 and 1 on the expander) > > As far as symmetrical errors, yeah the whole thing is screwy. The one thing I am seeing as stand out that I did not notice before for some reason is that "right card" (the one that normally handles phy 12-15) in my previous output from my initial inquiry carries 1+M errors on the expander phys regardless of the "right or left" cable. Perhaps that is an indicator of hardware malfunction. The "left" card (usually responsible for phy 8-11) throws something in the order of 600+K (under 1M) using "right or left" cable (phy 8-11 or 12-15). Those numbers are uncomfortably high too, though.Agree.> Basically the output of my SAS Diag.txt was flipping between single use of each card with each of the two cables I had available to me. If I were to show the output now with both cards enabled phy 8-15 on the expander all show "link up" situation.Are the cables of the same make/model? Unfortunately, it is not uncommon to see bad cables :-( I had one just last week :-(> The other mystery as you mentioned is why Adapter phy 0 is error free while the other 3 phys are not. It''s also persistent across cables used AND cards used.A mystery? -- richard -- ZFS and performance consulting http://www.RichardElling.com LISA ''11, Boston, MA, December 4-9
On Dec 3, 2011, at 11:18 PM, Richard Elling wrote:> On Dec 3, 2011, at 9:02 PM, Ryan Wehler wrote: >> >> On Dec 3, 2011, at 10:31 PM, Richard Elling wrote: >> >>> On Dec 3, 2011, at 7:36 PM, Ryan Wehler wrote: >>> >>>> Hi Richard, >>>> Thanks for getting back to me. >>>> >>>> >>>> On Dec 3, 2011, at 9:03 PM, Richard Elling wrote: >>>> >>>>> On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote: >>>>> >>>>>> During the diagnostics of my SAN failure last week we thought we had seen a backplane failure due to high error counts with ''lsiutil''. However, even with a new backplane and ruling out failed cards (MPXIO or singular) or bad cables I''m still seeing my error count with LSIUTIL increment. I''ve got no disks attached to the array right now so I''ve also ruled those out. >>>>> >>>>> The link error counters are on the receiving side. To see the complete picture, you need to look at >>>>> link errors on both ends of each link (more below?) >>>>> >>>>>> >>>>>> Even with nothing connected but the HBA to the backplane expander, a simple restart of the SAN into a OpenIndiana LiveCD or other distribution (NexentaStor) increments the counter. >>>>> >>>>> A few counters can tick up when the system is reset at boot. These can be ignored. >>>>> >>>>> What you are looking for is a consistent increase of the counters under load. In some cases >>>>> I have seen millions of errors per minute on a very unhappy system. >>>> >>>> But we''re talking about 600,000 -> 2,000,000 errors on a simple reset at boot. Per my VAR their 6GB hardware show significantly less (in the 10s to 100s of errors, not 100s to millions). >>> >>> For high-quality hardware, I see 4 to 8. If I see > 1,000, then I start replacing hardware. >> >> >> And how do you define "high quality hardware"? Obviously these aren''t crummy SATA adapters and low cost drives. The Chassis and backplane are on Nexenta''s HSL. While the cards are not, explicitly listed. The underlying chip (LSI 1068) is on another card (3081E-R) that is on the HSL. > > I recently tested a HP DL380 G7 with D2600 and D2700 JBOD chassis. Zero errors.I''m assuming these had some sort of LSI cards in them since that''s the primary focus here. Do you happen to know models and what expander chip was used on the backplane(s)?> Currently, the test process for HSL records any errors, but as long as the root cause can be > explained, the devices can pass certification.Well.... since we can''t even come to a reasonable justification on why these errors exist with no "true" indicator of bad hardware, something like this could pass the HSL if the VAR can justify it? I''m not saying thats what happened.. I''m just trying to understand the process.> >>>>>> I''ve been as careful as I can be to clear the counter between changes to parts to try and eliminate a potentially bad cable/card/etc. You can see phy 8-15 throw errors irregardless of MPXIO or single card config, OR which expander port I use on the backplane. >>>>> >>>>> The info you attaced doesn''t show the topology (lsiutil command 16), so it is difficult to say >>>>> why this occurs. >>>> >>>> Attached is the output of option 16 on each card. >>>> >>>> <LSI1068.rtf> >>> >>> This shows that the handle 0009 phys 12 to 15 are the other HBA (initiator). >>> >>> It is unusual to see millions of errors there. >>> >>> Also, the number of errors is not symmetrical. From the HBA (Adapter phy 1) >>> you see on the order of thousand errors. From the expander (handle 0009) >>> you see millions of errors on phys 12 to 15, that are connected to the HBA. >>> >>> Also interesting is that one of the phys, adapter phy 0, shows no errors, but we see >>> errors on the others. This is unusual because there are 4 links in the cable. >>> >>> Still smells like hardware to me. >>> -- richard >>> >> >> I''m not quite extrapolating this data like you are. I see handle 0009 which looks to be the expander. Card #1 is hooked to phy 8-11 and Card #2 is hooked to phy 12-15. (port 0 and 1 on the expander) >> >> As far as symmetrical errors, yeah the whole thing is screwy. The one thing I am seeing as stand out that I did not notice before for some reason is that "right card" (the one that normally handles phy 12-15) in my previous output from my initial inquiry carries 1+M errors on the expander phys regardless of the "right or left" cable. Perhaps that is an indicator of hardware malfunction. The "left" card (usually responsible for phy 8-11) throws something in the order of 600+K (under 1M) using "right or left" cable (phy 8-11 or 12-15). Those numbers are uncomfortably high too, though. > > Agree. > >> Basically the output of my SAS Diag.txt was flipping between single use of each card with each of the two cables I had available to me. If I were to show the output now with both cards enabled phy 8-15 on the expander all show "link up" situation. > > Are the cables of the same make/model? Unfortunately, it is not uncommon to see bad cables :-( > I had one just last week :-(The cables are identical. My VAR put this all together about 2 years ago. I don''t have any other cables to test but the present fix is "upgrade to SAS3 (6GB) backplane/cards/cables".>> The other mystery as you mentioned is why Adapter phy 0 is error free while the other 3 phys are not. It''s also persistent across cables used AND cards used. > > A mystery? > -- richard > -- > > ZFS and performance consulting > http://www.RichardElling.com > LISA ''11, Boston, MA, December 4-9 > > > > > > > > > > > > > >
Richard Elling
2011-Dec-04 05:45 UTC
[zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)
On Dec 3, 2011, at 9:32 PM, Ryan Wehler wrote:> On Dec 3, 2011, at 11:18 PM, Richard Elling wrote: > >> On Dec 3, 2011, at 9:02 PM, Ryan Wehler wrote: >>> >>> On Dec 3, 2011, at 10:31 PM, Richard Elling wrote: >>> >>>> On Dec 3, 2011, at 7:36 PM, Ryan Wehler wrote: >>>> >>>>> Hi Richard, >>>>> Thanks for getting back to me. >>>>> >>>>> >>>>> On Dec 3, 2011, at 9:03 PM, Richard Elling wrote: >>>>> >>>>>> On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote: >>>>>> >>>>>>> During the diagnostics of my SAN failure last week we thought we had seen a backplane failure due to high error counts with ''lsiutil''. However, even with a new backplane and ruling out failed cards (MPXIO or singular) or bad cables I''m still seeing my error count with LSIUTIL increment. I''ve got no disks attached to the array right now so I''ve also ruled those out. >>>>>> >>>>>> The link error counters are on the receiving side. To see the complete picture, you need to look at >>>>>> link errors on both ends of each link (more below?) >>>>>> >>>>>>> >>>>>>> Even with nothing connected but the HBA to the backplane expander, a simple restart of the SAN into a OpenIndiana LiveCD or other distribution (NexentaStor) increments the counter. >>>>>> >>>>>> A few counters can tick up when the system is reset at boot. These can be ignored. >>>>>> >>>>>> What you are looking for is a consistent increase of the counters under load. In some cases >>>>>> I have seen millions of errors per minute on a very unhappy system. >>>>> >>>>> But we''re talking about 600,000 -> 2,000,000 errors on a simple reset at boot. Per my VAR their 6GB hardware show significantly less (in the 10s to 100s of errors, not 100s to millions). >>>> >>>> For high-quality hardware, I see 4 to 8. If I see > 1,000, then I start replacing hardware. >>> >>> >>> And how do you define "high quality hardware"? Obviously these aren''t crummy SATA adapters and low cost drives. The Chassis and backplane are on Nexenta''s HSL. While the cards are not, explicitly listed. The underlying chip (LSI 1068) is on another card (3081E-R) that is on the HSL. >> >> I recently tested a HP DL380 G7 with D2600 and D2700 JBOD chassis. Zero errors. > > I''m assuming these had some sort of LSI cards in them since that''s the primary focus here. Do you happen to know models and what expander chip was used on the backplane(s)?LSI 2008 chipset (HP SC08Ge HBA). Expanders are HP-branded, I''ll speculate they are LSI SAS2x28. Note: there is also firmware on the HBAs and expanders. But I do not expect firmware to change the link error counts. I suspect that is more of a physical issue.>> Currently, the test process for HSL records any errors, but as long as the root cause can be >> explained, the devices can pass certification. > > Well.... since we can''t even come to a reasonable justification on why these errors exist with no "true" indicator of bad hardware, something like this could pass the HSL if the VAR can justify it? I''m not saying thats what happened.. I''m just trying to understand the process.A certification does not mean that any specific implementation operates without errors. A failed part, noisy environment, or other influences will affect any specific implementation. -- richard -- ZFS and performance consulting http://www.RichardElling.com LISA ''11, Boston, MA, December 4-9
On Dec 3, 2011, at 11:45 PM, Richard Elling wrote:> On Dec 3, 2011, at 9:32 PM, Ryan Wehler wrote: >> On Dec 3, 2011, at 11:18 PM, Richard Elling wrote: >> >>> On Dec 3, 2011, at 9:02 PM, Ryan Wehler wrote: >>>> >>>> On Dec 3, 2011, at 10:31 PM, Richard Elling wrote: >>>> >>>>> On Dec 3, 2011, at 7:36 PM, Ryan Wehler wrote: >>>>> >>>>>> Hi Richard, >>>>>> Thanks for getting back to me. >>>>>> >>>>>> >>>>>> On Dec 3, 2011, at 9:03 PM, Richard Elling wrote: >>>>>> >>>>>>> On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote: >>>>>>> >>>>>>>> During the diagnostics of my SAN failure last week we thought we had seen a backplane failure due to high error counts with ''lsiutil''. However, even with a new backplane and ruling out failed cards (MPXIO or singular) or bad cables I''m still seeing my error count with LSIUTIL increment. I''ve got no disks attached to the array right now so I''ve also ruled those out. >>>>>>> >>>>>>> The link error counters are on the receiving side. To see the complete picture, you need to look at >>>>>>> link errors on both ends of each link (more below?) >>>>>>> >>>>>>>> >>>>>>>> Even with nothing connected but the HBA to the backplane expander, a simple restart of the SAN into a OpenIndiana LiveCD or other distribution (NexentaStor) increments the counter. >>>>>>> >>>>>>> A few counters can tick up when the system is reset at boot. These can be ignored. >>>>>>> >>>>>>> What you are looking for is a consistent increase of the counters under load. In some cases >>>>>>> I have seen millions of errors per minute on a very unhappy system. >>>>>> >>>>>> But we''re talking about 600,000 -> 2,000,000 errors on a simple reset at boot. Per my VAR their 6GB hardware show significantly less (in the 10s to 100s of errors, not 100s to millions). >>>>> >>>>> For high-quality hardware, I see 4 to 8. If I see > 1,000, then I start replacing hardware. >>>> >>>> >>>> And how do you define "high quality hardware"? Obviously these aren''t crummy SATA adapters and low cost drives. The Chassis and backplane are on Nexenta''s HSL. While the cards are not, explicitly listed. The underlying chip (LSI 1068) is on another card (3081E-R) that is on the HSL. >>> >>> I recently tested a HP DL380 G7 with D2600 and D2700 JBOD chassis. Zero errors. >> >> I''m assuming these had some sort of LSI cards in them since that''s the primary focus here. Do you happen to know models and what expander chip was used on the backplane(s)? > > LSI 2008 chipset (HP SC08Ge HBA). Expanders are HP-branded, I''ll speculate they are LSI SAS2x28. > > Note: there is also firmware on the HBAs and expanders. But I do not expect firmware to change the > link error counts. I suspect that is more of a physical issue.In an effort to solve this problem I did update my 3442E-R HBAs from a 2009 firmware to "Phase 21" which came out earlier this year from LSI. The replacement backplane I got from my VAR when they thought that was the issue moved the backplane firmware from 7015 to 7017 per lsiutil''s output. You''re right it must be a physical issue but it just seems highly unlikely that BOTH HBAs failed and BOTH SAS cables failed (we''ll take the expander out of the equation since it was replaced)> >>> Currently, the test process for HSL records any errors, but as long as the root cause can be >>> explained, the devices can pass certification. >> >> Well.... since we can''t even come to a reasonable justification on why these errors exist with no "true" indicator of bad hardware, something like this could pass the HSL if the VAR can justify it? I''m not saying thats what happened.. I''m just trying to understand the process. > > A certification does not mean that any specific implementation operates without errors. A failed part, > noisy environment, or other influences will affect any specific implementation.Would it not be more prudent to re-run the tests after a failure was fixed and try to eliminate environmental variables? If you were to look up the reason it made it onto the HSL it should be "It just works!", not "it works, but this is why we''re seeing errors". That leads to doubt when there are caveats and trying to diagnose like/same hardware in the future.> -- richard > > -- > > ZFS and performance consulting > http://www.RichardElling.com > LISA ''11, Boston, MA, December 4-9 > > > > > > > > > > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111204/26826425/attachment.html>
Richard Elling
2011-Dec-04 20:23 UTC
[zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)
On Dec 4, 2011, at 8:50 AM, Ryan Wehler wrote:>> >> A certification does not mean that any specific implementation operates without errors. A failed part, >> noisy environment, or other influences will affect any specific implementation. > > Would it not be more prudent to re-run the tests after a failure was fixed and try to eliminate environmental variables? If you were to look up the reason it made it onto the HSL it should be "It just works!", not "it works, but this is why we''re seeing errors". That leads to doubt when there are caveats and trying to diagnose like/same hardware in the future.Perhaps I wasn''t clear. When we root cause an error reported during certification it is to absolve the device under test. For example, if we run a test against a disk and see errors on the wire caused by a backplane or cable, then we must absolve the disk of the errors. If the disk is the root cause of the error reports, then it fails certification. Do not confuse certification with "it runs forever with no problems in all cases" -- richard -- ZFS and performance consulting http://www.RichardElling.com LISA ''11, Boston, MA, December 4-9
James C. McPherson
2011-Dec-04 22:11 UTC
[zfs-discuss] LSI 3GB HBA SAS Errors (and other misc)
On 5/12/11 02:50 AM, Ryan Wehler wrote: ...> In an effort to solve this problem I did update my 3442E-R HBAs from a > 2009 firmware to "Phase 21" which came out earlier this year from LSI. The > replacement backplane I got from my VAR when they thought that was the > issue moved the backplane firmware from 7015 to 7017 per lsiutil''s output. > You''re right it must be a physical issue but it just seems highly unlikely > that BOTH HBAs failed and BOTH SAS cables failed (we''ll take the expander > out of the equation since it was replaced)You need to look at the data available, rather than making assumptions. When I was part of CPRE (now PTS?) in Sun we referred to swapping hardware without investigation as practicing "swaptronics". Every escalation we got where this had happened took longer to resolve as a result. So yes, it certainly could be a hardware problem twice in a row. You''d want to examine the serial numbers and other identifying data such as manufacturing date codes to see how likely that is. In the past I''ve seen cases where replacement disks turned out to be duds across several different batches and different factories involved. The true root cause was traced to a chip that was supplied to the manufacturer by a third party. Personally, I''d start looking at the cables first - in my experience they seem to incur more physical stress through the connect/disconnect operations than HBAs. James C. McPherson -- Oracle http://www.jmcp.homeunix.com/blog
Well if we want to get into theories on faulty hardware batches and such we can. Though I think the likelihood is slim but not impossible I suppose. I did the best I can diagnostic wise given I have no spare parts that have never been a part of this SAN. As I said, I still think the likelihood of two failed HBAs or failed cables just doesn''t add up. The errors thrown between cards is pretty consistent between cable swaps too, so nothing really indicative of A bad cable, let alone two. My vendor has more hardware on it''s way to me early this coming week.. so I''ll be able to report back once I have new HBAs and cables too. On Dec 4, 2011, at 4:11 PM, James C. McPherson wrote:> On 5/12/11 02:50 AM, Ryan Wehler wrote: > ... >> In an effort to solve this problem I did update my 3442E-R HBAs from a >> 2009 firmware to "Phase 21" which came out earlier this year from LSI. The >> replacement backplane I got from my VAR when they thought that was the >> issue moved the backplane firmware from 7015 to 7017 per lsiutil''s output. >> You''re right it must be a physical issue but it just seems highly unlikely >> that BOTH HBAs failed and BOTH SAS cables failed (we''ll take the expander >> out of the equation since it was replaced) > > > You need to look at the data available, rather than making > assumptions. When I was part of CPRE (now PTS?) in Sun we > referred to swapping hardware without investigation as > practicing "swaptronics". Every escalation we got where this > had happened took longer to resolve as a result. > > So yes, it certainly could be a hardware problem twice in a > row. You''d want to examine the serial numbers and other identifying > data such as manufacturing date codes to see how likely that is. > In the past I''ve seen cases where replacement disks turned out to > be duds across several different batches and different factories > involved. The true root cause was traced to a chip that was supplied > to the manufacturer by a third party. > > Personally, I''d start looking at the cables first - in my > experience they seem to incur more physical stress through the > connect/disconnect operations than HBAs. > > > > James C. McPherson > -- > Oracle > http://www.jmcp.homeunix.com/blog
Here''s LSIUTIL after swapping to a 6GB backplane and dual 9211-8i cards on a fresh boot. Much better. :) Adapter Phy 0: Link Up, No Errors Adapter Phy 1: Link Up, No Errors Adapter Phy 2: Link Up, No Errors Adapter Phy 3: Link Up, No Errors Adapter Phy 4: Link Down, No Errors Adapter Phy 5: Link Down, No Errors Adapter Phy 6: Link Down, No Errors Adapter Phy 7: Link Down, No Errors Expander (Handle 0009) Phy 0: Link Down, No Errors Expander (Handle 0009) Phy 1: Link Down, No Errors Expander (Handle 0009) Phy 2: Link Down, No Errors Expander (Handle 0009) Phy 3: Link Down, No Errors Expander (Handle 0009) Phy 4: Link Up Invalid DWord Count 8 Running Disparity Error Count 7 Loss of DWord Synch Count 2 Phy Reset Problem Count 0 Expander (Handle 0009) Phy 5: Link Up Invalid DWord Count 8 Running Disparity Error Count 6 Loss of DWord Synch Count 2 Phy Reset Problem Count 0 Expander (Handle 0009) Phy 6: Link Up Invalid DWord Count 8 Running Disparity Error Count 5 Loss of DWord Synch Count 2 Phy Reset Problem Count 0 Expander (Handle 0009) Phy 7: Link Up Invalid DWord Count 8 Running Disparity Error Count 7 Loss of DWord Synch Count 2 Phy Reset Problem Count 0 Expander (Handle 0009) Phy 8: Link Up Invalid DWord Count 8 Running Disparity Error Count 7 Loss of DWord Synch Count 2 Phy Reset Problem Count 0 Expander (Handle 0009) Phy 9: Link Up Invalid DWord Count 8 Running Disparity Error Count 4 Loss of DWord Synch Count 2 Phy Reset Problem Count 0 Expander (Handle 0009) Phy 10: Link Up Invalid DWord Count 8 Running Disparity Error Count 6 Loss of DWord Synch Count 2 Phy Reset Problem Count 0 Expander (Handle 0009) Phy 11: Link Up Invalid DWord Count 8 Running Disparity Error Count 6 Loss of DWord Synch Count 2 Phy Reset Problem Count 0 Expander (Handle 0009) Phy 12: Link Down, No Errors Expander (Handle 0009) Phy 13: Link Down, No Errors Expander (Handle 0009) Phy 14: Link Down, No Errors Expander (Handle 0009) Phy 15: Link Down, No Errors Expander (Handle 0009) Phy 16: Link Down, No Errors Expander (Handle 0009) Phy 17: Link Down, No Errors Expander (Handle 0009) Phy 18: Link Up, No Errors Expander (Handle 0009) Phy 19: Link Up, No Errors Expander (Handle 0009) Phy 20: Link Down, No Errors Expander (Handle 0009) Phy 21: Link Down, No Errors Expander (Handle 0009) Phy 22: Link Down, No Errors Expander (Handle 0009) Phy 23: Link Down, No Errors Expander (Handle 0009) Phy 24: Link Down, No Errors Expander (Handle 0009) Phy 25: Link Down, No Errors Expander (Handle 0009) Phy 26: Link Down, No Errors Expander (Handle 0009) Phy 27: Link Down, No Errors Expander (Handle 0009) Phy 28: Link Up, No Errors Expander (Handle 0009) Phy 29: Link Up, No Errors Expander (Handle 0009) Phy 30: Link Up, No Errors Expander (Handle 0009) Phy 31: Link Up, No Errors Expander (Handle 0009) Phy 32: Link Up, No Errors Expander (Handle 0009) Phy 33: Link Up, No Errors Expander (Handle 0009) Phy 34: Link Down, No Errors Expander (Handle 0009) Phy 35: Link Down, No Errors Expander (Handle 0009) Phy 36: Link Up, No Errors Expander (Handle 0009) Phy 37: Link Down, No Errors -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111205/a4d67b02/attachment.html>
Whoops. Make that 9211-4i cards. :) Still promising.
2011-12-05 5:15, Ryan Wehler wrote:> Well if we want to get into theories on faulty hardware batches and such we can. Though I think the likelihood is slim but not impossible I suppose. > > I did the best I can diagnostic wise given I have no spare parts that have never been a part of this SAN. As I said, I still think the likelihood of two failed HBAs or failed cables just doesn''t add up. The errors thrown between cards is pretty consistent between cable swaps too, so nothing really indicative of A bad cable, let alone two.Well, speculation-wise, if these were nearly-identical items serving for the same time in identical conditions (same enclosure), they could fail together just because they were subjected to the same shocks, power surges, or perhaps more likely aging of components (i.e. drying up of capacitors, oxydization of soldered connections, diffusion of atoms in the microchips - whatever). Regarding soldered connections - there was a true story some 10 years ago about Fujitsu desktop drives dying at nearly the same age after exiting the factory (few months old), which was tracked to some more-than-usual acidity of soldering lead or its addons. Overall, the electrical links just stopped working after a while due to oxydization into the bulk of the metal blobs :) Still, congratulations on that replacement hardware did solve the problem! ;) //Jim