thr3ads.net - Lustre discuss - [Lustre-discuss] aacraid kernel panic caused failover [Mar 2011]

If this information is useful, please help other people find it:
Share via:

David Noriega

2011-Mar-25 14:37 UTC

[Lustre-discuss] aacraid kernel panic caused failover

Had some crazyness happen to our lustre system. We have two OSSs, both
identical sun x4140 servers and on only one of them have I''ve seen
this pop up in the kernel messages and then a kernel panic. The panic
seemed to then spread and caused the network to go down and the second
OSS to try to failover(or failback?). Anyways ''splitbrain''
occurred
and I was able to get in and set them straight. I researched this
aacraid module messages and so far all I can find says to increase the
timeout, but these are old messages and currently they are set to 60.
Anyone else have any ideas?

aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter reset request. SCSI hang ?
AAC: Host adapter BLINK LED 0xef
AAC0: adapter kernel panic''d ef.

-- 
Personally, I liked the university. They gave us money and facilities,
we didn''t have to produce anything! You''ve never been out of
college!
You don''t know what it''s like out there! I''ve worked
in the private
sector. They expect results. -Ray Ghostbusters

David Noriega

2011-Apr-06 05:03 UTC

head link

[Lustre-discuss] aacraid kernel panic caused failover

Ok I updated the aacraid driver and the raid firmware, yet I still had
the problem happen, so I did more research and applied the following
tweaks:

1) Rebuilt mkinitrd with the following options:
a) edit /etc/sysconfig/mkinitrid/multipath to contain MULTIPATH=yes
b) mkinitrid initrd-2.6.18-194.3.1.el5_lustre.1.8.4.img
2.6.18-194.3.1.el5_lustre.1.8.4 --preload=scsi_dh_rdac
2) Added the local hard disk to the multipath black list
3) Edited modprobe.conf to have the following aacraid options:
options aacraid firmware_debug=2 startup_timeout=60 #the debug doesn''t
seem to print anything to dmesg
4) Added pcie_aspm=off to the kernel boot options

So things looked good for a while. I did have a problem mounting the
lustre partitions but this was my fault in misconfiguring some lnet
options I was experimenting with. I fixed that and just as a test, I
ran ''modprobe lustre'' since I wasn''t ready to fail
back the partitions
just yet(wanted to wait till when activity was the lowest). That was
earlier today. I was about to fail back tonight, yet when I checked
the server again I saw in dmesg the same aacraid problems from before.
Is it possible lustre is interfering with aacraid? Its weird since I
do have a duplicate machine and its not having any of thise problems.

On Fri, Mar 25, 2011 at 9:55 AM, Temple  Jason <jtemple at cscs.ch>
wrote:> Adaptec should have the firmware and drivers on their site for your card.
?If not adaptec, then SOracle will have it available somewhere.
>
> The firmware and system drivers usually have a utility that will check the
current version and upgrade it for you.
>
> Hope this helps (I use different cards, so I can''t tell you
exactly).
>
> -Jason
>
> -----Original Message-----
> From: David Noriega [mailto:tsk133 at my.utsa.edu]
> Sent: venerd?, 25. marzo 2011 15:47
> To: Temple Jason
> Subject: Re: [Lustre-discuss] aacraid kernel panic caused failover
>
> Hmm not sure, whats the best way to find out?
>
> On Fri, Mar 25, 2011 at 9:46 AM, Temple ?Jason <jtemple at cscs.ch>
wrote:
>> Hi,
>>
>> Are you using the latest firmware? ?This sort of thing used to happen
to me, but with different raid cards.
>>
>> -Jason
>>
>> -----Original Message-----
>> From: lustre-discuss-bounces at lists.lustre.org
[mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of David Noriega
>> Sent: venerd?, 25. marzo 2011 15:38
>> To: lustre-discuss at lists.lustre.org
>> Subject: [Lustre-discuss] aacraid kernel panic caused failover
>>
>> Had some crazyness happen to our lustre system. We have two OSSs, both
>> identical sun x4140 servers and on only one of them have I''ve
seen
>> this pop up in the kernel messages and then a kernel panic. The panic
>> seemed to then spread and caused the network to go down and the second
>> OSS to try to failover(or failback?). Anyways
''splitbrain'' occurred
>> and I was able to get in and set them straight. I researched this
>> aacraid module messages and so far all I can find says to increase the
>> timeout, but these are old messages and currently they are set to 60.
>> Anyone else have any ideas?
>>
>> aacraid: Host adapter abort request (0,0,0,0)
>> aacraid: Host adapter reset request. SCSI hang ?
>> AAC: Host adapter BLINK LED 0xef
>> AAC0: adapter kernel panic''d ef.
>>
>> --
>> Personally, I liked the university. They gave us money and facilities,
>> we didn''t have to produce anything! You''ve never been
out of college!
>> You don''t know what it''s like out there!
I''ve worked in the private
>> sector. They expect results. -Ray Ghostbusters
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>
>
>
> --
> Personally, I liked the university. They gave us money and facilities,
> we didn''t have to produce anything! You''ve never been out
of college!
> You don''t know what it''s like out there! I''ve
worked in the private
> sector. They expect results. -Ray Ghostbusters
>


-- 
Personally, I liked the university. They gave us money and facilities,
we didn''t have to produce anything! You''ve never been out of
college!
You don''t know what it''s like out there! I''ve worked
in the private
sector. They expect results. -Ray Ghostbusters

Thomas Roth

2011-Apr-06 08:05 UTC

head link

[Lustre-discuss] aacraid kernel panic caused failover

We have ~ 60 servers with these Adaptec controllers, and found this problem just
to happen from time to time.
Upgrade of the aacraid module wouldn''t help. We had contacts to
Adaptec, but they had no clue either.
Only good thing is it seems that this adapter panic happens in an instant,
halting the machine, but has no prior phase of degradation: the controller
doesn''t start leaving out every second bit or just writing the
''1''s and not the ''0''s or ... - so whatever
data has made it to the disks before the
crash seems to be quite sensible. Reboot and never buy Adaptec again.

Cheers,
Thomas

On 04/06/2011 07:03 AM, David Noriega wrote:> Ok I updated the aacraid driver and the raid firmware, yet I still had
> the problem happen, so I did more research and applied the following
> tweaks:
> 
> 1) Rebuilt mkinitrd with the following options:
> a) edit /etc/sysconfig/mkinitrid/multipath to contain MULTIPATH=yes
> b) mkinitrid initrd-2.6.18-194.3.1.el5_lustre.1.8.4.img
> 2.6.18-194.3.1.el5_lustre.1.8.4 --preload=scsi_dh_rdac
> 2) Added the local hard disk to the multipath black list
> 3) Edited modprobe.conf to have the following aacraid options:
> options aacraid firmware_debug=2 startup_timeout=60 #the debug
doesn''t
> seem to print anything to dmesg
> 4) Added pcie_aspm=off to the kernel boot options
> 
> So things looked good for a while. I did have a problem mounting the
> lustre partitions but this was my fault in misconfiguring some lnet
> options I was experimenting with. I fixed that and just as a test, I
> ran ''modprobe lustre'' since I wasn''t ready to
fail back the partitions
> just yet(wanted to wait till when activity was the lowest). That was
> earlier today. I was about to fail back tonight, yet when I checked
> the server again I saw in dmesg the same aacraid problems from before.
> Is it possible lustre is interfering with aacraid? Its weird since I
> do have a duplicate machine and its not having any of thise problems.
> 
> On Fri, Mar 25, 2011 at 9:55 AM, Temple  Jason <jtemple at cscs.ch>
wrote:
>> Adaptec should have the firmware and drivers on their site for your
card.  If not adaptec, then SOracle will have it available somewhere.
>>
>> The firmware and system drivers usually have a utility that will check
the current version and upgrade it for you.
>>
>> Hope this helps (I use different cards, so I can''t tell you
exactly).
>>
>> -Jason
>>
>> -----Original Message-----
>> From: David Noriega [mailto:tsk133 at my.utsa.edu]
>> Sent: venerd?, 25. marzo 2011 15:47
>> To: Temple Jason
>> Subject: Re: [Lustre-discuss] aacraid kernel panic caused failover
>>
>> Hmm not sure, whats the best way to find out?
>>
>> On Fri, Mar 25, 2011 at 9:46 AM, Temple  Jason <jtemple at
cscs.ch> wrote:
>>> Hi,
>>>
>>> Are you using the latest firmware?  This sort of thing used to
happen to me, but with different raid cards.
>>>
>>> -Jason
>>>
>>> -----Original Message-----
>>> From: lustre-discuss-bounces at lists.lustre.org
[mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of David Noriega
>>> Sent: venerd?, 25. marzo 2011 15:38
>>> To: lustre-discuss at lists.lustre.org
>>> Subject: [Lustre-discuss] aacraid kernel panic caused failover
>>>
>>> Had some crazyness happen to our lustre system. We have two OSSs,
both
>>> identical sun x4140 servers and on only one of them have
I''ve seen
>>> this pop up in the kernel messages and then a kernel panic. The
panic
>>> seemed to then spread and caused the network to go down and the
second
>>> OSS to try to failover(or failback?). Anyways
''splitbrain'' occurred
>>> and I was able to get in and set them straight. I researched this
>>> aacraid module messages and so far all I can find says to increase
the
>>> timeout, but these are old messages and currently they are set to
60.
>>> Anyone else have any ideas?
>>>
>>> aacraid: Host adapter abort request (0,0,0,0)
>>> aacraid: Host adapter reset request. SCSI hang ?
>>> AAC: Host adapter BLINK LED 0xef
>>> AAC0: adapter kernel panic''d ef.
>>>
>>> --
>>> Personally, I liked the university. They gave us money and
facilities,
>>> we didn''t have to produce anything! You''ve never
been out of college!
>>> You don''t know what it''s like out there!
I''ve worked in the private
>>> sector. They expect results. -Ray Ghostbusters
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>
>>
>>
>> --
>> Personally, I liked the university. They gave us money and facilities,
>> we didn''t have to produce anything! You''ve never been
out of college!
>> You don''t know what it''s like out there!
I''ve worked in the private
>> sector. They expect results. -Ray Ghostbusters
>>
> 
> 
> 
-- 
--------------------------------------------------------------------
Thomas Roth
Department: Informationstechnologie
Location: SB3 1.262
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

GSI Helmholtzzentrum f?r Schwerionenforschung GmbH
Planckstra?e 1
64291 Darmstadt
www.gsi.de

Gesellschaft mit beschr?nkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528

Gesch?ftsf?hrung: Professor Dr. Dr. h.c. Horst St?cker,
Dr. Hartmut Eickhoff

Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt

Jeff Johnson

2011-Apr-06 12:52 UTC

head link

[Lustre-discuss] aacraid kernel panic caused failover

I have seen similar behavior on these controllers. On dissimilar configs and
different aged systems. These happened to be non-Lustre standalone nfs and iscsi
target boxes.

Went through controller and drive firmware upgrades, low-level fw dumps  and
analysis from dev engineers.

In the end it was never really explained or resolved. It appears that these
controllers, like small children, have tantrums and fall apart. A power cycle
clears the condition.

Not the best controller for an OSS.

--Jeff

---mobile signature---
Jeff Johnson - Aeon Computing
jeff.johnson at aeoncomputing.com


On Apr 6, 2011, at 1:05, Thomas Roth <t.roth at gsi.de> wrote:
> We have ~ 60 servers with these Adaptec controllers, and found this problem
just to happen from time to time.
> Upgrade of the aacraid module wouldn''t help. We had contacts to
Adaptec, but they had no clue either.
> Only good thing is it seems that this adapter panic happens in an instant,
halting the machine, but has no prior phase of degradation: the controller
> doesn''t start leaving out every second bit or just writing the
''1''s and not the ''0''s or ... - so whatever
data has made it to the disks before the
> crash seems to be quite sensible. Reboot and never buy Adaptec again.
> 
> Cheers,
> Thomas
> 
> On 04/06/2011 07:03 AM, David Noriega wrote:
>> Ok I updated the aacraid driver and the raid firmware, yet I still had
>> the problem happen, so I did more research and applied the following
>> tweaks:
>> 
>> 1) Rebuilt mkinitrd with the following options:
>> a) edit /etc/sysconfig/mkinitrid/multipath to contain MULTIPATH=yes
>> b) mkinitrid initrd-2.6.18-194.3.1.el5_lustre.1.8.4.img
>> 2.6.18-194.3.1.el5_lustre.1.8.4 --preload=scsi_dh_rdac
>> 2) Added the local hard disk to the multipath black list
>> 3) Edited modprobe.conf to have the following aacraid options:
>> options aacraid firmware_debug=2 startup_timeout=60 #the debug
doesn''t
>> seem to print anything to dmesg
>> 4) Added pcie_aspm=off to the kernel boot options
>> 
>> So things looked good for a while. I did have a problem mounting the
>> lustre partitions but this was my fault in misconfiguring some lnet
>> options I was experimenting with. I fixed that and just as a test, I
>> ran ''modprobe lustre'' since I wasn''t ready
to fail back the partitions
>> just yet(wanted to wait till when activity was the lowest). That was
>> earlier today. I was about to fail back tonight, yet when I checked
>> the server again I saw in dmesg the same aacraid problems from before.
>> Is it possible lustre is interfering with aacraid? Its weird since I
>> do have a duplicate machine and its not having any of thise problems.
>> 
>> On Fri, Mar 25, 2011 at 9:55 AM, Temple  Jason <jtemple at
cscs.ch> wrote:
>>> Adaptec should have the firmware and drivers on their site for your
card.  If not adaptec, then SOracle will have it available somewhere.
>>> 
>>> The firmware and system drivers usually have a utility that will
check the current version and upgrade it for you.
>>> 
>>> Hope this helps (I use different cards, so I can''t tell
you exactly).
>>> 
>>> -Jason
>>> 
>>> -----Original Message-----
>>> From: David Noriega [mailto:tsk133 at my.utsa.edu]
>>> Sent: venerd?, 25. marzo 2011 15:47
>>> To: Temple Jason
>>> Subject: Re: [Lustre-discuss] aacraid kernel panic caused failover
>>> 
>>> Hmm not sure, whats the best way to find out?
>>> 
>>> On Fri, Mar 25, 2011 at 9:46 AM, Temple  Jason <jtemple at
cscs.ch> wrote:
>>>> Hi,
>>>> 
>>>> Are you using the latest firmware?  This sort of thing used to
happen to me, but with different raid cards.
>>>> 
>>>> -Jason
>>>> 
>>>> -----Original Message-----
>>>> From: lustre-discuss-bounces at lists.lustre.org
[mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of David Noriega
>>>> Sent: venerd?, 25. marzo 2011 15:38
>>>> To: lustre-discuss at lists.lustre.org
>>>> Subject: [Lustre-discuss] aacraid kernel panic caused failover
>>>> 
>>>> Had some crazyness happen to our lustre system. We have two
OSSs, both
>>>> identical sun x4140 servers and on only one of them have
I''ve seen
>>>> this pop up in the kernel messages and then a kernel panic. The
panic
>>>> seemed to then spread and caused the network to go down and the
second
>>>> OSS to try to failover(or failback?). Anyways
''splitbrain'' occurred
>>>> and I was able to get in and set them straight. I researched
this
>>>> aacraid module messages and so far all I can find says to
increase the
>>>> timeout, but these are old messages and currently they are set
to 60.
>>>> Anyone else have any ideas?
>>>> 
>>>> aacraid: Host adapter abort request (0,0,0,0)
>>>> aacraid: Host adapter reset request. SCSI hang ?
>>>> AAC: Host adapter BLINK LED 0xef
>>>> AAC0: adapter kernel panic''d ef.
>>>> 
>>>> --
>>>> Personally, I liked the university. They gave us money and
facilities,
>>>> we didn''t have to produce anything! You''ve
never been out of college!
>>>> You don''t know what it''s like out there!
I''ve worked in the private
>>>> sector. They expect results. -Ray Ghostbusters
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Personally, I liked the university. They gave us money and
facilities,
>>> we didn''t have to produce anything! You''ve never
been out of college!
>>> You don''t know what it''s like out there!
I''ve worked in the private
>>> sector. They expect results. -Ray Ghostbusters
>>> 
>> 
>> 
>> 
> 
> -- 
> --------------------------------------------------------------------
> Thomas Roth
> Department: Informationstechnologie
> Location: SB3 1.262
> Phone: +49-6159-71 1453  Fax: +49-6159-71 2986
> 
> GSI Helmholtzzentrum f?r Schwerionenforschung GmbH
> Planckstra?e 1
> 64291 Darmstadt
> www.gsi.de
> 
> Gesellschaft mit beschr?nkter Haftung
> Sitz der Gesellschaft: Darmstadt
> Handelsregister: Amtsgericht Darmstadt, HRB 1528
> 
> Gesch?ftsf?hrung: Professor Dr. Dr. h.c. Horst St?cker,
> Dr. Hartmut Eickhoff
> 
> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph
> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

David Noriega

2011-Apr-06 14:58 UTC

head link

[Lustre-discuss] aacraid kernel panic caused failover

Our adaptec raid card is a Sun StorageTek RAID INT card, made by intel
of all people. So I installed the raid manager software, which of
course doesn''t say anything is wrong, but it does come with a
monitoring daemon and it printed this message after the last aacraid
kernel panic:

Sun StorageTek RAID Manager Agent: [203] The battery-backup cache
device needs a new battery: controller 1.

So could that be the problem?

On Wed, Apr 6, 2011 at 7:52 AM, Jeff Johnson
<jeff.johnson at aeoncomputing.com> wrote:> I have seen similar behavior on these controllers. On dissimilar configs
and different aged systems. These happened to be non-Lustre standalone nfs and
iscsi target boxes.
>
> Went through controller and drive firmware upgrades, low-level fw dumps
?and analysis from dev engineers.
>
> In the end it was never really explained or resolved. It appears that these
controllers, like small children, have tantrums and fall apart. A power cycle
clears the condition.
>
> Not the best controller for an OSS.
>
> --Jeff
>
> ---mobile signature---
> Jeff Johnson - Aeon Computing
> jeff.johnson at aeoncomputing.com
>
>
> On Apr 6, 2011, at 1:05, Thomas Roth <t.roth at gsi.de> wrote:
>
>> We have ~ 60 servers with these Adaptec controllers, and found this
problem just to happen from time to time.
>> Upgrade of the aacraid module wouldn''t help. We had contacts
to Adaptec, but they had no clue either.
>> Only good thing is it seems that this adapter panic happens in an
instant, halting the machine, but has no prior phase of degradation: the
controller
>> doesn''t start leaving out every second bit or just writing the
''1''s and not the ''0''s or ... - so whatever
data has made it to the disks before the
>> crash seems to be quite sensible. Reboot and never buy Adaptec again.
>>
>> Cheers,
>> Thomas
>>
>> On 04/06/2011 07:03 AM, David Noriega wrote:
>>> Ok I updated the aacraid driver and the raid firmware, yet I still
had
>>> the problem happen, so I did more research and applied the
following
>>> tweaks:
>>>
>>> 1) Rebuilt mkinitrd with the following options:
>>> a) edit /etc/sysconfig/mkinitrid/multipath to contain MULTIPATH=yes
>>> b) mkinitrid initrd-2.6.18-194.3.1.el5_lustre.1.8.4.img
>>> 2.6.18-194.3.1.el5_lustre.1.8.4 --preload=scsi_dh_rdac
>>> 2) Added the local hard disk to the multipath black list
>>> 3) Edited modprobe.conf to have the following aacraid options:
>>> options aacraid firmware_debug=2 startup_timeout=60 #the debug
doesn''t
>>> seem to print anything to dmesg
>>> 4) Added pcie_aspm=off to the kernel boot options
>>>
>>> So things looked good for a while. I did have a problem mounting
the
>>> lustre partitions but this was my fault in misconfiguring some lnet
>>> options I was experimenting with. I fixed that and just as a test,
I
>>> ran ''modprobe lustre'' since I wasn''t
ready to fail back the partitions
>>> just yet(wanted to wait till when activity was the lowest). That
was
>>> earlier today. I was about to fail back tonight, yet when I checked
>>> the server again I saw in dmesg the same aacraid problems from
before.
>>> Is it possible lustre is interfering with aacraid? Its weird since
I
>>> do have a duplicate machine and its not having any of thise
problems.
>>>
>>> On Fri, Mar 25, 2011 at 9:55 AM, Temple ?Jason <jtemple at
cscs.ch> wrote:
>>>> Adaptec should have the firmware and drivers on their site for
your card. ?If not adaptec, then SOracle will have it available somewhere.
>>>>
>>>> The firmware and system drivers usually have a utility that
will check the current version and upgrade it for you.
>>>>
>>>> Hope this helps (I use different cards, so I can''t
tell you exactly).
>>>>
>>>> -Jason
>>>>
>>>> -----Original Message-----
>>>> From: David Noriega [mailto:tsk133 at my.utsa.edu]
>>>> Sent: venerd?, 25. marzo 2011 15:47
>>>> To: Temple Jason
>>>> Subject: Re: [Lustre-discuss] aacraid kernel panic caused
failover
>>>>
>>>> Hmm not sure, whats the best way to find out?
>>>>
>>>> On Fri, Mar 25, 2011 at 9:46 AM, Temple ?Jason <jtemple at
cscs.ch> wrote:
>>>>> Hi,
>>>>>
>>>>> Are you using the latest firmware? ?This sort of thing used
to happen to me, but with different raid cards.
>>>>>
>>>>> -Jason
>>>>>
>>>>> -----Original Message-----
>>>>> From: lustre-discuss-bounces at lists.lustre.org
[mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of David Noriega
>>>>> Sent: venerd?, 25. marzo 2011 15:38
>>>>> To: lustre-discuss at lists.lustre.org
>>>>> Subject: [Lustre-discuss] aacraid kernel panic caused
failover
>>>>>
>>>>> Had some crazyness happen to our lustre system. We have two
OSSs, both
>>>>> identical sun x4140 servers and on only one of them have
I''ve seen
>>>>> this pop up in the kernel messages and then a kernel panic.
The panic
>>>>> seemed to then spread and caused the network to go down and
the second
>>>>> OSS to try to failover(or failback?). Anyways
''splitbrain'' occurred
>>>>> and I was able to get in and set them straight. I
researched this
>>>>> aacraid module messages and so far all I can find says to
increase the
>>>>> timeout, but these are old messages and currently they are
set to 60.
>>>>> Anyone else have any ideas?
>>>>>
>>>>> aacraid: Host adapter abort request (0,0,0,0)
>>>>> aacraid: Host adapter reset request. SCSI hang ?
>>>>> AAC: Host adapter BLINK LED 0xef
>>>>> AAC0: adapter kernel panic''d ef.
>>>>>
>>>>> --
>>>>> Personally, I liked the university. They gave us money and
facilities,
>>>>> we didn''t have to produce anything!
You''ve never been out of college!
>>>>> You don''t know what it''s like out there!
I''ve worked in the private
>>>>> sector. They expect results. -Ray Ghostbusters
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Personally, I liked the university. They gave us money and
facilities,
>>>> we didn''t have to produce anything! You''ve
never been out of college!
>>>> You don''t know what it''s like out there!
I''ve worked in the private
>>>> sector. They expect results. -Ray Ghostbusters
>>>>
>>>
>>>
>>>
>>
>> --
>> --------------------------------------------------------------------
>> Thomas Roth
>> Department: Informationstechnologie
>> Location: SB3 1.262
>> Phone: +49-6159-71 1453 ?Fax: +49-6159-71 2986
>>
>> GSI Helmholtzzentrum f?r Schwerionenforschung GmbH
>> Planckstra?e 1
>> 64291 Darmstadt
>> www.gsi.de
>>
>> Gesellschaft mit beschr?nkter Haftung
>> Sitz der Gesellschaft: Darmstadt
>> Handelsregister: Amtsgericht Darmstadt, HRB 1528
>>
>> Gesch?ftsf?hrung: Professor Dr. Dr. h.c. Horst St?cker,
>> Dr. Hartmut Eickhoff
>>
>> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph
>> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>


-- 
Personally, I liked the university. They gave us money and facilities,
we didn''t have to produce anything! You''ve never been out of
college!
You don''t know what it''s like out there! I''ve worked
in the private
sector. They expect results. -Ray Ghostbusters

Thomas Roth

2011-Apr-06 15:38 UTC

head link

[Lustre-discuss] aacraid kernel panic caused failover

Provided your card is actually a Adaptec Raid controller (it says 
"Adaptec ASR 5405" on our cards, not Intel or Sun), this is definitely
not the problem. We have had a number of broken or aged batteries amongs 
our 60 or so controller cards, but never any relation with the kernel 
panic and the controller complaining about its BBU.

Cheers,
Thomas

On 04/06/2011 04:58 PM, David Noriega wrote:> Our adaptec raid card is a Sun StorageTek RAID INT card, made by intel
> of all people. So I installed the raid manager software, which of
> course doesn''t say anything is wrong, but it does come with a
> monitoring daemon and it printed this message after the last aacraid
> kernel panic:
>
> Sun StorageTek RAID Manager Agent: [203] The battery-backup cache
> device needs a new battery: controller 1.
>
> So could that be the problem?
>
> On Wed, Apr 6, 2011 at 7:52 AM, Jeff Johnson
> <jeff.johnson at aeoncomputing.com>  wrote:
>> I have seen similar behavior on these controllers. On dissimilar
configs and different aged systems. These happened to be non-Lustre standalone
nfs and iscsi target boxes.
>>
>> Went through controller and drive firmware upgrades, low-level fw dumps
and analysis from dev engineers.
>>
>> In the end it was never really explained or resolved. It appears that
these controllers, like small children, have tantrums and fall apart. A power
cycle clears the condition.
>>
>> Not the best controller for an OSS.
>>
>> --Jeff
>>
>> ---mobile signature---
>> Jeff Johnson - Aeon Computing
>> jeff.johnson at aeoncomputing.com
>>
>>
>> On Apr 6, 2011, at 1:05, Thomas Roth<t.roth at gsi.de>  wrote:
>>
>>> We have ~ 60 servers with these Adaptec controllers, and found this
problem just to happen from time to time.
>>> Upgrade of the aacraid module wouldn''t help. We had
contacts to Adaptec, but they had no clue either.
>>> Only good thing is it seems that this adapter panic happens in an
instant, halting the machine, but has no prior phase of degradation: the
controller
>>> doesn''t start leaving out every second bit or just writing
the ''1''s and not the ''0''s or ... - so
whatever data has made it to the disks before the
>>> crash seems to be quite sensible. Reboot and never buy Adaptec
again.
>>>
>>> Cheers,
>>> Thomas
>>>
>>> On 04/06/2011 07:03 AM, David Noriega wrote:
>>>> Ok I updated the aacraid driver and the raid firmware, yet I
still had
>>>> the problem happen, so I did more research and applied the
following
>>>> tweaks:
>>>>
>>>> 1) Rebuilt mkinitrd with the following options:
>>>> a) edit /etc/sysconfig/mkinitrid/multipath to contain
MULTIPATH=yes
>>>> b) mkinitrid initrd-2.6.18-194.3.1.el5_lustre.1.8.4.img
>>>> 2.6.18-194.3.1.el5_lustre.1.8.4 --preload=scsi_dh_rdac
>>>> 2) Added the local hard disk to the multipath black list
>>>> 3) Edited modprobe.conf to have the following aacraid options:
>>>> options aacraid firmware_debug=2 startup_timeout=60 #the debug
doesn''t
>>>> seem to print anything to dmesg
>>>> 4) Added pcie_aspm=off to the kernel boot options
>>>>
>>>> So things looked good for a while. I did have a problem
mounting the
>>>> lustre partitions but this was my fault in misconfiguring some
lnet
>>>> options I was experimenting with. I fixed that and just as a
test, I
>>>> ran ''modprobe lustre'' since I wasn''t
ready to fail back the partitions
>>>> just yet(wanted to wait till when activity was the lowest).
That was
>>>> earlier today. I was about to fail back tonight, yet when I
checked
>>>> the server again I saw in dmesg the same aacraid problems from
before.
>>>> Is it possible lustre is interfering with aacraid? Its weird
since I
>>>> do have a duplicate machine and its not having any of thise
problems.
>>>>
>>>> On Fri, Mar 25, 2011 at 9:55 AM, Temple  Jason<jtemple at
cscs.ch>  wrote:
>>>>> Adaptec should have the firmware and drivers on their site
for your card.  If not adaptec, then SOracle will have it available somewhere.
>>>>>
>>>>> The firmware and system drivers usually have a utility that
will check the current version and upgrade it for you.
>>>>>
>>>>> Hope this helps (I use different cards, so I can''t
tell you exactly).
>>>>>
>>>>> -Jason
>>>>>
>>>>> -----Original Message-----
>>>>> From: David Noriega [mailto:tsk133 at my.utsa.edu]
>>>>> Sent: venerd?, 25. marzo 2011 15:47
>>>>> To: Temple Jason
>>>>> Subject: Re: [Lustre-discuss] aacraid kernel panic caused
failover
>>>>>
>>>>> Hmm not sure, whats the best way to find out?
>>>>>
>>>>> On Fri, Mar 25, 2011 at 9:46 AM, Temple  Jason<jtemple
at cscs.ch>  wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Are you using the latest firmware?  This sort of thing
used to happen to me, but with different raid cards.
>>>>>>
>>>>>> -Jason
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: lustre-discuss-bounces at lists.lustre.org
[mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of David Noriega
>>>>>> Sent: venerd?, 25. marzo 2011 15:38
>>>>>> To: lustre-discuss at lists.lustre.org
>>>>>> Subject: [Lustre-discuss] aacraid kernel panic caused
failover
>>>>>>
>>>>>> Had some crazyness happen to our lustre system. We have
two OSSs, both
>>>>>> identical sun x4140 servers and on only one of them
have I''ve seen
>>>>>> this pop up in the kernel messages and then a kernel
panic. The panic
>>>>>> seemed to then spread and caused the network to go down
and the second
>>>>>> OSS to try to failover(or failback?). Anyways
''splitbrain'' occurred
>>>>>> and I was able to get in and set them straight. I
researched this
>>>>>> aacraid module messages and so far all I can find says
to increase the
>>>>>> timeout, but these are old messages and currently they
are set to 60.
>>>>>> Anyone else have any ideas?
>>>>>>
>>>>>> aacraid: Host adapter abort request (0,0,0,0)
>>>>>> aacraid: Host adapter reset request. SCSI hang ?
>>>>>> AAC: Host adapter BLINK LED 0xef
>>>>>> AAC0: adapter kernel panic''d ef.
>>>>>>
>>>>>> --
>>>>>> Personally, I liked the university. They gave us money
and facilities,
>>>>>> we didn''t have to produce anything!
You''ve never been out of college!
>>>>>> You don''t know what it''s like out
there! I''ve worked in the private
>>>>>> sector. They expect results. -Ray Ghostbusters
>>>>>> _______________________________________________
>>>>>> Lustre-discuss mailing list
>>>>>> Lustre-discuss at lists.lustre.org
>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Personally, I liked the university. They gave us money and
facilities,
>>>>> we didn''t have to produce anything!
You''ve never been out of college!
>>>>> You don''t know what it''s like out there!
I''ve worked in the private
>>>>> sector. They expect results. -Ray Ghostbusters
>>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>>
--------------------------------------------------------------------
>>> Thomas Roth
>>> Department: Informationstechnologie
>>> Location: SB3 1.262
>>> Phone: +49-6159-71 1453  Fax: +49-6159-71 2986
>>>
>>> GSI Helmholtzzentrum f?r Schwerionenforschung GmbH
>>> Planckstra?e 1
>>> 64291 Darmstadt
>>> www.gsi.de
>>>
>>> Gesellschaft mit beschr?nkter Haftung
>>> Sitz der Gesellschaft: Darmstadt
>>> Handelsregister: Amtsgericht Darmstadt, HRB 1528
>>>
>>> Gesch?ftsf?hrung: Professor Dr. Dr. h.c. Horst St?cker,
>>> Dr. Hartmut Eickhoff
>>>
>>> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph
>>> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>
>
>

-- 
--------------------------------------------------------------------
Thomas Roth           IT-HPC-Linux
Location: SB3 1.262   Phone: +49-6159-71 1453


http://twitter.com/gsi_it

David Noriega

2011-Apr-06 17:26 UTC

head link

[Lustre-discuss] aacraid kernel panic caused failover

It is adaptec based, just branded by sun and built by intel. Anyways I
reseated the card and will wait and see. If it still goes wonky, is
there a card anyone recommends? It has to be a low profile pcie 8x
with two x4 sas internal connectors.

On Wed, Apr 6, 2011 at 10:38 AM, Thomas Roth <t.roth at gsi.de>
wrote:> Provided your card is actually a Adaptec Raid controller (it says
> "Adaptec ASR 5405" on our cards, not Intel or Sun), this is
definitely
> not the problem. We have had a number of broken or aged batteries amongs
> our 60 or so controller cards, but never any relation with the kernel
> panic and the controller complaining about its BBU.
>
> Cheers,
> Thomas
>
> On 04/06/2011 04:58 PM, David Noriega wrote:
>> Our adaptec raid card is a Sun StorageTek RAID INT card, made by intel
>> of all people. So I installed the raid manager software, which of
>> course doesn''t say anything is wrong, but it does come with a
>> monitoring daemon and it printed this message after the last aacraid
>> kernel panic:
>>
>> Sun StorageTek RAID Manager Agent: [203] The battery-backup cache
>> device needs a new battery: controller 1.
>>
>> So could that be the problem?
>>
>> On Wed, Apr 6, 2011 at 7:52 AM, Jeff Johnson
>> <jeff.johnson at aeoncomputing.com> ?wrote:
>>> I have seen similar behavior on these controllers. On dissimilar
configs and different aged systems. These happened to be non-Lustre standalone
nfs and iscsi target boxes.
>>>
>>> Went through controller and drive firmware upgrades, low-level fw
dumps ?and analysis from dev engineers.
>>>
>>> In the end it was never really explained or resolved. It appears
that these controllers, like small children, have tantrums and fall apart. A
power cycle clears the condition.
>>>
>>> Not the best controller for an OSS.
>>>
>>> --Jeff
>>>
>>> ---mobile signature---
>>> Jeff Johnson - Aeon Computing
>>> jeff.johnson at aeoncomputing.com
>>>
>>>
>>> On Apr 6, 2011, at 1:05, Thomas Roth<t.roth at gsi.de>
?wrote:
>>>
>>>> We have ~ 60 servers with these Adaptec controllers, and found
this problem just to happen from time to time.
>>>> Upgrade of the aacraid module wouldn''t help. We had
contacts to Adaptec, but they had no clue either.
>>>> Only good thing is it seems that this adapter panic happens in
an instant, halting the machine, but has no prior phase of degradation: the
controller
>>>> doesn''t start leaving out every second bit or just
writing the ''1''s and not the ''0''s or ... -
so whatever data has made it to the disks before the
>>>> crash seems to be quite sensible. Reboot and never buy Adaptec
again.
>>>>
>>>> Cheers,
>>>> Thomas
>>>>
>>>> On 04/06/2011 07:03 AM, David Noriega wrote:
>>>>> Ok I updated the aacraid driver and the raid firmware, yet
I still had
>>>>> the problem happen, so I did more research and applied the
following
>>>>> tweaks:
>>>>>
>>>>> 1) Rebuilt mkinitrd with the following options:
>>>>> a) edit /etc/sysconfig/mkinitrid/multipath to contain
MULTIPATH=yes
>>>>> b) mkinitrid initrd-2.6.18-194.3.1.el5_lustre.1.8.4.img
>>>>> 2.6.18-194.3.1.el5_lustre.1.8.4 --preload=scsi_dh_rdac
>>>>> 2) Added the local hard disk to the multipath black list
>>>>> 3) Edited modprobe.conf to have the following aacraid
options:
>>>>> options aacraid firmware_debug=2 startup_timeout=60 #the
debug doesn''t
>>>>> seem to print anything to dmesg
>>>>> 4) Added pcie_aspm=off to the kernel boot options
>>>>>
>>>>> So things looked good for a while. I did have a problem
mounting the
>>>>> lustre partitions but this was my fault in misconfiguring
some lnet
>>>>> options I was experimenting with. I fixed that and just as
a test, I
>>>>> ran ''modprobe lustre'' since I
wasn''t ready to fail back the partitions
>>>>> just yet(wanted to wait till when activity was the lowest).
That was
>>>>> earlier today. I was about to fail back tonight, yet when I
checked
>>>>> the server again I saw in dmesg the same aacraid problems
from before.
>>>>> Is it possible lustre is interfering with aacraid? Its
weird since I
>>>>> do have a duplicate machine and its not having any of thise
problems.
>>>>>
>>>>> On Fri, Mar 25, 2011 at 9:55 AM, Temple ?Jason<jtemple
at cscs.ch> ?wrote:
>>>>>> Adaptec should have the firmware and drivers on their
site for your card. ?If not adaptec, then SOracle will have it available
somewhere.
>>>>>>
>>>>>> The firmware and system drivers usually have a utility
that will check the current version and upgrade it for you.
>>>>>>
>>>>>> Hope this helps (I use different cards, so I
can''t tell you exactly).
>>>>>>
>>>>>> -Jason
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: David Noriega [mailto:tsk133 at my.utsa.edu]
>>>>>> Sent: venerd?, 25. marzo 2011 15:47
>>>>>> To: Temple Jason
>>>>>> Subject: Re: [Lustre-discuss] aacraid kernel panic
caused failover
>>>>>>
>>>>>> Hmm not sure, whats the best way to find out?
>>>>>>
>>>>>> On Fri, Mar 25, 2011 at 9:46 AM, Temple
?Jason<jtemple at cscs.ch> ?wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Are you using the latest firmware? ?This sort of
thing used to happen to me, but with different raid cards.
>>>>>>>
>>>>>>> -Jason
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: lustre-discuss-bounces at lists.lustre.org
[mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of David Noriega
>>>>>>> Sent: venerd?, 25. marzo 2011 15:38
>>>>>>> To: lustre-discuss at lists.lustre.org
>>>>>>> Subject: [Lustre-discuss] aacraid kernel panic
caused failover
>>>>>>>
>>>>>>> Had some crazyness happen to our lustre system. We
have two OSSs, both
>>>>>>> identical sun x4140 servers and on only one of them
have I''ve seen
>>>>>>> this pop up in the kernel messages and then a
kernel panic. The panic
>>>>>>> seemed to then spread and caused the network to go
down and the second
>>>>>>> OSS to try to failover(or failback?). Anyways
''splitbrain'' occurred
>>>>>>> and I was able to get in and set them straight. I
researched this
>>>>>>> aacraid module messages and so far all I can find
says to increase the
>>>>>>> timeout, but these are old messages and currently
they are set to 60.
>>>>>>> Anyone else have any ideas?
>>>>>>>
>>>>>>> aacraid: Host adapter abort request (0,0,0,0)
>>>>>>> aacraid: Host adapter reset request. SCSI hang ?
>>>>>>> AAC: Host adapter BLINK LED 0xef
>>>>>>> AAC0: adapter kernel panic''d ef.
>>>>>>>
>>>>>>> --
>>>>>>> Personally, I liked the university. They gave us
money and facilities,
>>>>>>> we didn''t have to produce anything!
You''ve never been out of college!
>>>>>>> You don''t know what it''s like out
there! I''ve worked in the private
>>>>>>> sector. They expect results. -Ray Ghostbusters
>>>>>>> _______________________________________________
>>>>>>> Lustre-discuss mailing list
>>>>>>> Lustre-discuss at lists.lustre.org
>>>>>>>
http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Personally, I liked the university. They gave us money
and facilities,
>>>>>> we didn''t have to produce anything!
You''ve never been out of college!
>>>>>> You don''t know what it''s like out
there! I''ve worked in the private
>>>>>> sector. They expect results. -Ray Ghostbusters
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>>
--------------------------------------------------------------------
>>>> Thomas Roth
>>>> Department: Informationstechnologie
>>>> Location: SB3 1.262
>>>> Phone: +49-6159-71 1453 ?Fax: +49-6159-71 2986
>>>>
>>>> GSI Helmholtzzentrum f?r Schwerionenforschung GmbH
>>>> Planckstra?e 1
>>>> 64291 Darmstadt
>>>> www.gsi.de
>>>>
>>>> Gesellschaft mit beschr?nkter Haftung
>>>> Sitz der Gesellschaft: Darmstadt
>>>> Handelsregister: Amtsgericht Darmstadt, HRB 1528
>>>>
>>>> Gesch?ftsf?hrung: Professor Dr. Dr. h.c. Horst St?cker,
>>>> Dr. Hartmut Eickhoff
>>>>
>>>> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph
>>>> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>
>>
>>
>
>
> --
> --------------------------------------------------------------------
> Thomas Roth ? ? ? ? ? IT-HPC-Linux
> Location: SB3 1.262 ? Phone: +49-6159-71 1453
>
>
> http://twitter.com/gsi_it
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>


-- 
Personally, I liked the university. They gave us money and facilities,
we didn''t have to produce anything! You''ve never been out of
college!
You don''t know what it''s like out there! I''ve worked
in the private
sector. They expect results. -Ray Ghostbusters

Lustre discuss - Mar 2011 - aacraid kernel panic caused failover

[Lustre-discuss] aacraid kernel panic caused failover

[Lustre-discuss] aacraid kernel panic caused failover

[Lustre-discuss] aacraid kernel panic caused failover

[Lustre-discuss] aacraid kernel panic caused failover

[Lustre-discuss] aacraid kernel panic caused failover

[Lustre-discuss] aacraid kernel panic caused failover

[Lustre-discuss] aacraid kernel panic caused failover