thr3ads.net - freebsd stable - FreeBSD-6 amr and ahd trouble [Nov 2005]

If this information is useful, please help other people find it:
Share via:

Joerg Pulz

2005-Nov-15 08:04 UTC

FreeBSD-6 amr and ahd trouble

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Hi guys,

I'm running an Fujitsu-Siemens Primergy RX300 dual-XEON hyperthreading 
enabled server with an onboard LSI MegaRAID controller and an Adaptec 
39320A Ultra320 dual channel SCSI adapter. The LSI MegaRAID controller is 
configured to RAID1 with two disk and one hotspare. On this array FreeBSD 
is installed.
Up to now, the system was running fine with FreeBSD-5.3 first and 
FreeBSD-5.4 now.
I tried to upgrade this beast to FreeBSD-6.0-RELEASE without success. The 
kernel is booting and detects all devices correctly but when it comes to 
read from the amr(4) the last thing i see is "GEOM: new disk amrd0"
after
that the system "hangs" and its nearly impossible to scroll the kernel
messages up or down (Scroll lock pressed). then after a while there are a 
lot of SCSI error messages about SCB timeouts coming from the ahd(4).
I decided to boot the old RELENG_5_4 kernel and cvsup'ed the sources to 
RELENG_6 but i got the same results. booting from a FreeBSD-6.0-RELEASE 
bootonly CDRom got again the same results.
I searched google about this, and found something about a tuneable 
sysctl/loader setting called hw.pci.do_powerstate and tried it, but the 
same result. later i saw, that in RELENG_6 this tuneable is renamed and 
set to 0 anyway.
the next step was removing the Adaptec card to make sure this one is not 
interrupting the amr(4) but the only thing that happened was the SCSI 
error messages going away so this was not the problem.
I decided to give CURRENT from today a try, and it was working without 
any problems. I have tested CURRENT some steps back until i hit 700003 
dated to "Sun Sep 18 05:12:39 2005 UTC" which is exactly the same time
the
RELENG_6 branch was marked for 6.0-BETA5 and CURRENT was working with 
every point i checked out from cvs. Unfortunately 6.0-BETA5 is NOT 
working.
I checked out the sources for 6.0-BETA4 and it is working again. So 
somewhere between 6.0-BETA4 and 6.0-BETA5 the whole thing is broken, at 
least for me and my hardware.
I've seen some differences in sys/cam/cam_xpt.c, maybe these cause the 
trouble i have, but I'm not so deep in the FreeBSD kernel code to make 
this sure.

It would be nice if someone can take a look at this to get this fixed in 
RELENG_6.
Any patches to test are welcome.

regards
Joerg

- -- 
The beginning is the most important part of the work.
 				-Plato
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (FreeBSD)

iD8DBQFDegcDSPOsGF+KA+MRAtErAJ4w6Y8jpTvd7Q0SWMDYepTCsjFq9wCgtyuW
XYxOUeRNY+DDtp7BfQOVMS8=QYI6
-----END PGP SIGNATURE-----

Scott Long

2005-Nov-16 08:23 UTC

head link

FreeBSD-6 amr and ahd trouble

Joerg Pulz wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> 
> Hi guys,
> 
> I'm running an Fujitsu-Siemens Primergy RX300 dual-XEON hyperthreading 
> enabled server with an onboard LSI MegaRAID controller and an Adaptec 
> 39320A Ultra320 dual channel SCSI adapter. The LSI MegaRAID controller 
> is configured to RAID1 with two disk and one hotspare. On this array 
> FreeBSD is installed.
> Up to now, the system was running fine with FreeBSD-5.3 first and 
> FreeBSD-5.4 now.
> I tried to upgrade this beast to FreeBSD-6.0-RELEASE without success. 
> The kernel is booting and detects all devices correctly but when it 
> comes to read from the amr(4) the last thing i see is "GEOM: new disk 
> amrd0" after that the system "hangs" and its nearly
impossible to scroll
> the kernel messages up or down (Scroll lock pressed). then after a while 
> there are a lot of SCSI error messages about SCB timeouts coming from 
> the ahd(4).
> I decided to boot the old RELENG_5_4 kernel and cvsup'ed the sources to
> RELENG_6 but i got the same results. booting from a FreeBSD-6.0-RELEASE 
> bootonly CDRom got again the same results.
> I searched google about this, and found something about a tuneable 
> sysctl/loader setting called hw.pci.do_powerstate and tried it, but the 
> same result. later i saw, that in RELENG_6 this tuneable is renamed and 
> set to 0 anyway.
> the next step was removing the Adaptec card to make sure this one is not 
> interrupting the amr(4) but the only thing that happened was the SCSI 
> error messages going away so this was not the problem.
> I decided to give CURRENT from today a try, and it was working without 
> any problems. I have tested CURRENT some steps back until i hit 700003 
> dated to "Sun Sep 18 05:12:39 2005 UTC" which is exactly the same
time
> the RELENG_6 branch was marked for 6.0-BETA5 and CURRENT was working 
> with every point i checked out from cvs. Unfortunately 6.0-BETA5 is NOT 
> working.
> I checked out the sources for 6.0-BETA4 and it is working again. So 
> somewhere between 6.0-BETA4 and 6.0-BETA5 the whole thing is broken, at 
> least for me and my hardware.
> I've seen some differences in sys/cam/cam_xpt.c, maybe these cause the 
> trouble i have, but I'm not so deep in the FreeBSD kernel code to make 
> this sure.
> 
> It would be nice if someone can take a look at this to get this fixed in 
> RELENG_6.
> Any patches to test are welcome.
> 
> regards
> Joerg
> 
This is almost certainly an interrupt routing bug.  Can you try booting 
with ACPI disabled?  Can you try building a 6.0 kernel without SMP and
the 'apic' devices?  From 5.4, can you send your system information?

Scott

Joerg Pulz

2005-Nov-17 06:01 UTC

head link

FreeBSD-6 amr and ahd trouble

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Content-ID: <20051117145924.K7025@hades.admin.frm2>

On Wed, 16 Nov 2005, Scott Long wrote:
> Joerg Pulz wrote:
> 
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>> 
>> 
>> Hi guys,
>> 
>> I'm running an Fujitsu-Siemens Primergy RX300 dual-XEON
hyperthreading
>> enabled server with an onboard LSI MegaRAID controller and an Adaptec
>> 39320A Ultra320 dual channel SCSI adapter. The LSI MegaRAID controller
is
>> configured to RAID1 with two disk and one hotspare. On this array
FreeBSD
>> is installed.
>> Up to now, the system was running fine with FreeBSD-5.3 first and
>> FreeBSD-5.4 now.
>> I tried to upgrade this beast to FreeBSD-6.0-RELEASE without success.
The
>> kernel is booting and detects all devices correctly but when it comes
to
>> read from the amr(4) the last thing i see is "GEOM: new disk
amrd0" after
>> that the system "hangs" and its nearly impossible to scroll
the kernel
>> messages up or down (Scroll lock pressed). then after a while there are
a
>> lot of SCSI error messages about SCB timeouts coming from the ahd(4).
>> I decided to boot the old RELENG_5_4 kernel and cvsup'ed the
sources to
>> RELENG_6 but i got the same results. booting from a FreeBSD-6.0-RELEASE
>> bootonly CDRom got again the same results.
>> I searched google about this, and found something about a tuneable
>> sysctl/loader setting called hw.pci.do_powerstate and tried it, but the
>> same result. later i saw, that in RELENG_6 this tuneable is renamed and
set
>> to 0 anyway.
>> the next step was removing the Adaptec card to make sure this one is
not
>> interrupting the amr(4) but the only thing that happened was the SCSI
error
>> messages going away so this was not the problem.
>> I decided to give CURRENT from today a try, and it was working without
any
>> problems. I have tested CURRENT some steps back until i hit 700003
dated to
>> "Sun Sep 18 05:12:39 2005 UTC" which is exactly the same time
the RELENG_6
>> branch was marked for 6.0-BETA5 and CURRENT was working with every
point i
>> checked out from cvs. Unfortunately 6.0-BETA5 is NOT working.
>> I checked out the sources for 6.0-BETA4 and it is working again. So
>> somewhere between 6.0-BETA4 and 6.0-BETA5 the whole thing is broken, at
>> least for me and my hardware.
>> I've seen some differences in sys/cam/cam_xpt.c, maybe these cause
the
>> trouble i have, but I'm not so deep in the FreeBSD kernel code to
make this
>> sure.
>> 
>> It would be nice if someone can take a look at this to get this fixed
in
>> RELENG_6.
>> Any patches to test are welcome.
>> 
>> regards
>> Joerg
>> 
> 
> This is almost certainly an interrupt routing bug.  Can you try booting
with
> ACPI disabled?  Can you try building a 6.0 kernel without SMP and
> the 'apic' devices?  From 5.4, can you send your system
information?
Hi Scott,

i've attached the kernel messages of the different tries.
Here is a short description.

RELENG-5.4_SMP-APIC-ACPI_verbose:
 	- selfmade kernel with SMP, apic and acpi enabled
 	- this one is working since 5.4 is out
 	- the only thing to mention is the wrong order of the serial ports
 	  sio0 == COM2, sio1 == COM1

RELENG-6_UP-APIC-ACPI_verbose:
 	- GENERIC kernel, NO SMP, apic and acpi enabled
 	- this one isn't working
 	- the order of the serial ports is wrong too

RELENG-6_UP-APIC-NOACPI_verbose:
 	- GENERIC kernel, NO SMP, apic enabled and acpi disabled
 	- this one isn't working
 	- the order of the serial ports is correct
 	  sio0 == COM1, sio1 == COM2

RELENG-6_UP-NOAPIC-ACPI_verbose:
 	- GENERIC kernel, NO SMP, apic disabled and acpi enabled
 	- this one isn't working,
 	  it hangs at "start_init: trying /sbin/init"
 	  but walking through amr(4) and ahd(4) seems to work
 	- the order of the serial ports is wrong too

RELENG-6_UP-NOAPIC-NOACPI_verbose:
 	- GENERIC kernel, NO SMP, apic nad acpi disabled
 	- this on IS WORKING, i can login and use all devices (NIC, DISK)
 	- the order of the serial ports is correct

Only to mention, that the order of the serial ports is wrong in an recent
CURRENT too, but the recent CURRENT is working with SMP, apic and acpi
enabled.
Any hints/patches to make this working with SMP again are welcome.

regards
Joerg

- --
The beginning is the most important part of the work.
 				-Plato
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (FreeBSD)

iD8DBQFDfI0jSPOsGF+KA+MRAoBtAJ9eqzKu00JBCqZpp+0dYyEhJfJoIQCZAWOn
uafR1CcJyYUCXCXvUqGzrLY=XS6w
-----END PGP SIGNATURE-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: logs.tgz
Type: application/octet-stream
Size: 44500 bytes
Desc: 
Url :
http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20051117/923a252d/logs.obj

Michael Rogato

2005-Dec-02 22:20 UTC

head link

FreeBSD-6 amr and ahd trouble

I know I'm a couple weeks late, but I've been having the same problem 
with my 300-8x. It seems that after a seemingly random period of time on 
my dual opteron box, the system just hangs. It did kernel panic once 
when I was taking down the geom array. Originally I thought it might 
have something to do with GEOM, but since it's also happened outside of 
a GEOM array, I'm kind of at a loss.

Have you managed to find anything out about what exactly is causing the 
problem? I don't get any kind of error messages, so I haven't had much 
luck in tracking it down.

freebsd stable - Nov 2005 - FreeBSD-6 amr and ahd trouble

FreeBSD-6 amr and ahd trouble

FreeBSD-6 amr and ahd trouble

FreeBSD-6 amr and ahd trouble

FreeBSD-6 amr and ahd trouble