thr3ads.net - freebsd stable - Several issues on Dell 1950/2950 servers (6-STABLE and 7-CURRENT) [Sep 2006]

If this information is useful, please help other people find it:
Share via:

Alex Salazar

2006-Sep-02 09:23 UTC

Several issues on Dell 1950/2950 servers (6-STABLE and 7-CURRENT)

Apologies for the long message, and thanks in advance for any response.

I've just bought one of those new generation Dell servers, specifically,
the PowerEdge 1950.

This is a dual Intel Dual Core Xeon 5050, 3.0 GHz, 667MHz FSB,
1GB 533MHz RAM, system.

This server has a LSI Logic SAS 5/i integrated adapter and dual embedded
Broadcom NetXtreme II 5708 Gigabit Ethernet NIC.

When I tried to install from a FreeBSD 6.0-RELEASE i386 CD I had at hand,
no hard disc was detected.

After finding out that SAS controller was not supported on that release,
I grabbed the most recent 6.1-STABLE i386 snapshot (200608) and tried again.
This time, the hard disc was detected properly.

The installation succeeded and, after the post-install configuration,
the system was restarted.

The OS booted up and the SAS controller was now detected and supported by
the mpt(4) driver:
---
mpt0: <LSILogic SAS Adapter> port 0xec00-0xecff mem 0xfc4fc000-0xfc4fffff,
0xfc4e0000-0xfc4effff irq 64 at device 8.0 on pci2
mpt0: Reserved 0x100 bytes for rid 0x10 type 4 at 0xec00
mpt0: Reserved 0x4000 bytes for rid 0x14 type 3 at 0xfc4fc000
mpt0: [GIANT-LOCKED]
mpt0: MPI Version=1.5.12.0
---

And the related errors showed up immediately, for the first time:
---
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x12
mpt0: Unhandled Event Notify Frame. Event 0x12 (ACK not required).
mpt0: mpt_cam_event: MPI_EVENT_SAS_DEVICE_STATUS_CHANGE
mpt0: mpt_cam_event: MPI_EVENT_SAS_DEVICE_STATUS_CHANGE
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
-- 

When the bootstrap process reached the SCSI probe, there were
no activity on the screen for about five minutes, so I was forced to use
the power off button, and after rebooting, the same symptoms were evident,
so I rebooted the machine once again, this time in verbose mode.

This debug information was being printed on the screen, one character at time,
at about 1 char/sec:

(probe8:mpt0:0:8:0): error 22
(probe8:mpt0:0:8:0): Unretryable Error
(probe8:mpt0:0:8:0): error 22
(probe8:mpt0:0:8:0): Unretryable Error
(probe0:mpt0:0:0:1): error 22
(probe0:mpt0:0:0:1): Unretryable Error
(probe1:mpt0:0:8:1): Unexpected Bus Free
(probe1:mpt0:0:8:1): Retrying Command
...
(probe0:mpt0:0:8:7): Unexpected Bus Free
(probe0:mpt0:0:8:7): Retrying Command
(probe0:mpt0:0:8:7): Unexpected Bus Free
(probe0:mpt0:0:8:7): Retrying Command
(probe0:mpt0:0:8:7): Unexpected Bus Free
(probe0:mpt0:0:8:7): Retrying Command
(probe0:mpt0:0:8:7): Unexpected Bus Free
(probe0:mpt0:0:8:7): Retrying Command
(probe0:mpt0:0:8:7): Unexpected Bus Free
(probe0:mpt0:0:8:7): error 5
(probe0:mpt0:0:8:7): Retries Exausted

After 18 (eighteen) minutes, the error messages ceased, and the boot process
continued as usually:
---
pass0 at mpt0 bus 0 target 0 lun 0
pass0: <MAXTOR ATLAS15K2_073SAS BP00> Fixed Direct Access SCSI-5 device
pass0: Serial Number K40C1Q5K
pass0: 300.000MB/s transfers, Tagged Queueing Enabled
pass1 at mpt0 bus 0 target 8 lun 0
pass1: <DP BACKPLANE 1.00> Fixed Enclosure Services SCSI-5 device
pass1: 300.000MB/s transfers, Tagged Queueing Enabled
ses0 at mpt0 bus 0 target 8 lun 0
ses0: <DP BACKPLANE 1.00> Fixed Enclosure Services SCSI-5 device
ses0: 300.000MB/s transfers, Tagged Queueing Enabled
ses0: SCSI-3 SES Device
GEOM: new dida0 at mpt0 bus 0 target 0 lun 0
da0: <MAXTOR ATLAS15K2_073SAS BP00> Fixed Direct Access SCSI-5 device
da0: Serial Number K40C1Q5K
da0: 300.000MB/s transfers, Tagged Queueing Enabled
da0: 70007MB (143374650 512 byte sectors: 255H 63S/T 8924C)
---

As a workaround, I disabled the APICs (hint.apic.0.disabled),
and that ~15 minutes delay at boot up, now was gone. Fine.

(BTW, 7-CURRENT has the same problem, but without that huge delay)

Once I was logged in the server, I proceeded to populate my ports tree,
by using portsnap(8), so, when I extracted the tarball (portsnap extract),
there was a lot of the following error message, at about 1 message per second:

mpt0: Unhandled Event Notify Frame. Event 0xe (ACK not required).

Once in a while, an error message like below, showed up:
-- 
(da0:mpt0:0:0:0): WRITE(10). CDB: 2a 0 1 55 6f 5f 0 0 20 0
(da0:mpt0:0:0:0): CAM Status: SCSI Status Error
(da0:mpt0:0:0:0): SCSI Status: Check Condition
(da0:mpt0:0:0:0): UNIT ATTENTION asc:29,2
(da0:mpt0:0:0:0): Scsi bus reset occurred
-- 

After running some diagnostics included on some utilities CDs shipped with this
server, I concluded this was a software issue:
-- 
Device Name     : SAS Disk 0:0
Description     : SAS MAXTOR ATLAS15K2_073SAS
Test Name       : Disk Self Test
Passes          : 1
Result          : passed
Start Time      : Mon Aug 21 02:04:06 2006
Completion Time : Mon Aug 21 02:26:12 2006
Result Event    : The test operation completed successfully
-- 

In order to perform those diagnostics, I had to install a SuSe Linux
Enterprise Server 9, which was also shipped with this machine)

After reinstalling FreeBSD, I logged remotely into the server, via ssh,
and fetched the ports snapshot again and extracted once more.

Suddenly, the screen activity ceased and the network connection timed out.

Locally, on the server, there was a lot of mpt(4) errors and warnings.
---
(da0:mpt0:0:0:0): CAM Status 0x18
(da0:mpt0:0:0:0): Retrying Command
(... and about 500 more lines like those...)
---

Then, some bce(4) errors, which caused the network interface to be shutdown:
---
bce0: ../../../dev/bce/if_bce.c(5032): Watchdog timeout occurred, resetting
bce0: link state changed to DOWN
bce0: Gigabit link up
bce0: link state changed to UP
bce0: ../../../dev/bce/if_bce.c(5032): Watchdog timeout occurred, resetting
---

And finally, those errors from mpt(4):

---
request 0xc4c4a080:44717 timed out for ccb 0xc4e41400 (req->ccb 0xc4e41400)
request 0xc4c4b430:44718 timed out for ccb 0xc4ca5800 (req->ccb 0xc4ca5800)
request 0xc4c4cd80:44719 timed out for ccb 0xc4c52800 (req->ccb 0xc4c52800)
(... and about 300 more lines like those ...)
---

which were followed by the same number of lines like these:
---
mpt0: completing timedout/aborted req 0xc4c4a080:44717
mpt0: completing timedout/aborted req 0xc4c4b430:44718
mpt0: completing timedout/aborted req 0xc4c4cd80:44719
---

and finishing with this line:
---
mpt0: Timedout requests already complete. Interrupts may not be functioning.
---


After one hour and a half, the system was still unstable and I was forced to
reboot it.




Those are the main issues (mpt(4) and bce(4)) regarding this hardware
configuration and FreeBSD (6-STABLE and 7-CURRENT), however,
two problems were showing up, as well.

1. The first network interface (labeled "Gb 1", on server's case)
was detected as bce1, and the second one (Gb 2), as bce0.
---
bce0: <Broadcom NetXtreme II BCM5708 1000Base-T (B1), v0.9.6>
     mem 0xf4000000-0xf5ffffff irq 16 at device 0.0 on pci8
bce0: Reserved 0x2000000 bytes for rid 0x10 type 3 at 0xf4000000
bce0: ASIC ID 0x57081010; Revision (B1); PCI-X 64-bit 133MHz
miibus0: <MII bus> on bce0
brgphy0: <BCM5708C 10/100/1000baseTX PHY> on miibus0
brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX,
         1000baseTX, 1000baseTX-FDX, auto
bce0: bpf attached
bce0: Ethernet address: 00:13:72:f9:xx:xx
bce0: [MPSAFE]
...
bce1: <Broadcom NetXtreme II BCM5708 1000Base-T (B1), v0.9.6>
     mem 0xf8000000-0xf9ffffff irq 16 at device 0.0 on pci4
bce1: Reserved 0x2000000 bytes for rid 0x10 type 3 at 0xf8000000
bce1: ASIC ID 0x57081010; Revision (B1); PCI-X 64-bit 133MHz
miibus1: <MII bus> on bce1
brgphy1: <BCM5708C 10/100/1000baseTX PHY> on miibus1
brgphy1:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX,
         1000baseTX, 1000baseTX-FDX, auto
bce1: bpf attached
bce1: Ethernet address: 00:13:72:f9:xx:xx
bce1: [MPSAFE]
---

According to this log, bce0 is on pci8, while bce1 is on pci4.


2. Sometimes, the server refuses to be halted or rebooted via
shutdown(8) or reboot(8).


Any hint will be appreciated :)


/var/run/dmesg.boot (verbose log)
http://bsdero.tripod.com/dmesg.boot.txt

/var/log/messages (with remarks)
http://bsdero.tripod.com/messages.txt

pciconf -lv output
http://bsdero.tripod.com/pciconf.txt

-- 
Alex Salazar

Matthew Jacob

2006-Sep-02 19:21 UTC

head link

Several issues on Dell 1950/2950 servers (6-STABLE and 7-CURRENT)

>
> The OS booted up and the SAS controller was now detected and supported by
> the mpt(4) driver:
> ---
> mpt0: <LSILogic SAS Adapter> port 0xec00-0xecff mem
0xfc4fc000-0xfc4fffff,
> 0xfc4e0000-0xfc4effff irq 64 at device 8.0 on pci2
> mpt0: Reserved 0x100 bytes for rid 0x10 type 4 at 0xec00
> mpt0: Reserved 0x4000 bytes for rid 0x14 type 3 at 0xfc4fc000
> mpt0: [GIANT-LOCKED]
> mpt0: MPI Version=1.5.12.0
> ---
>
> And the related errors showed up immediately, for the first time:
> ---
> mpt0: mpt_cam_event: 0x16
> mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
> mpt0: mpt_cam_event: 0x12
> mpt0: Unhandled Event Notify Frame. Event 0x12 (ACK not required).
> mpt0: mpt_cam_event: MPI_EVENT_SAS_DEVICE_STATUS_CHANGE
> mpt0: mpt_cam_event: MPI_EVENT_SAS_DEVICE_STATUS_CHANGE
> mpt0: mpt_cam_event: 0x16
> mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
> --
These are device arrival events.
>
> When the bootstrap process reached the SCSI probe, there were
> no activity on the screen for about five minutes, so I was forced to use
> the power off button, and after rebooting, the same symptoms were evident,
> so I rebooted the machine once again, this time in verbose mode.
>
> This debug information was being printed on the screen, one character at
time,
> at about 1 char/sec:
>
> (probe8:mpt0:0:8:0): error 22
What's at target 8? It isn't happy for a variety of reasons. Oh- I see
from below- it's an SES instance that drops dead if given something
at> lun 0.
> (probe8:mpt0:0:8:0): Unretryable Error
> ---
> pass0 at mpt0 bus 0 target 0 lun 0
> pass0: <MAXTOR ATLAS15K2_073SAS BP00> Fixed Direct Access SCSI-5
device
> > As a workaround, I disabled the APICs (hint.apic.0.disabled),
> and that ~15 minutes delay at boot up, now was gone. Fine.
>
> (BTW, 7-CURRENT has the same problem, but without that huge delay)
Do you have APIC disabled for 7-CURRENT also?
>
> Once I was logged in the server, I proceeded to populate my ports tree,
> by using portsnap(8), so, when I extracted the tarball (portsnap extract),
> there was a lot of the following error message, at about 1 message per
second:
>
> mpt0: Unhandled Event Notify Frame. Event 0xe (ACK not required).
Queue Full events from the SAS firmware.
>
> Once in a while, an error message like below, showed up:
> --
> (da0:mpt0:0:0:0): WRITE(10). CDB: 2a 0 1 55 6f 5f 0 0 20 0
> (da0:mpt0:0:0:0): CAM Status: SCSI Status Error
> (da0:mpt0:0:0:0): SCSI Status: Check Condition
> (da0:mpt0:0:0:0): UNIT ATTENTION asc:29,2
> (da0:mpt0:0:0:0): Scsi bus reset occurred
Somebody is reseeting the bus periodically. We (freebsd) aren't
volitionally doing this that I'm aware of here.
> In order to perform those diagnostics, I had to install a SuSe Linux
> Enterprise Server 9, which was also shipped with this machine)
Which is a good way of saying that LSI-Logic support isn't very
evident on FreeBSD.
>
> After reinstalling FreeBSD, I logged remotely into the server, via ssh,
> and fetched the ports snapshot again and extracted once more.
>
> Suddenly, the screen activity ceased and the network connection timed out.
>
> Locally, on the server, there was a lot of mpt(4) errors and warnings.
> ---
> (da0:mpt0:0:0:0): CAM Status 0x18
> (da0:mpt0:0:0:0): Retrying Command
> (... and about 500 more lines like those...)
Hmm.
> ---
>>
> And finally, those errors from mpt(4):
>
> ---
> request 0xc4c4a080:44717 timed out for ccb 0xc4e41400 (req->ccb
0xc4e41400)
> request 0xc4c4b430:44718 timed out for ccb 0xc4ca5800 (req->ccb
0xc4ca5800)
> request 0xc4c4cd80:44719 timed out for ccb 0xc4c52800 (req->ccb
0xc4c52800)
> (... and about 300 more lines like those ...)
> ---
>
> which were followed by the same number of lines like these:
> ---
> mpt0: completing timedout/aborted req 0xc4c4a080:44717
> mpt0: completing timedout/aborted req 0xc4c4b430:44718
> mpt0: completing timedout/aborted req 0xc4c4cd80:44719
> ---
>
> and finishing with this line:
> ---
> mpt0: Timedout requests already complete. Interrupts may not be
functioning.
> ---
>
I've seen this on Supermicro EM64T in the past on 7-current, but that
went away about 3-4 weeks ago. It really seemed to me that this was
indeed an interrupt related problem.

Yup, sounds like a mess here.

Morten A. Middelthon

2006-Sep-04 05:49 UTC

head link

Several issues on Dell 1950/2950 servers (6-STABLE and 7-CURRENT)

On Sat, Sep 02, 2006 at 04:23:21AM -0500, Alex Salazar
wrote:> Apologies for the long message, and thanks in advance for any response.
> 
> I've just bought one of those new generation Dell servers,
specifically,
> the PowerEdge 1950.
> 
> This is a dual Intel Dual Core Xeon 5050, 3.0 GHz, 667MHz FSB,
> 1GB 533MHz RAM, system.
> 
> This server has a LSI Logic SAS 5/i integrated adapter and dual embedded
> Broadcom NetXtreme II 5708 Gigabit Ethernet NIC.<snip>

Just wanted to say that I'm running FreeBSD/i386 6.1-STABLE on two Dell PE
1950's
without any problems. The only thing I had to do was update the bce driver for
the NIC, but other than that the mfi RAID controller is detected properly, as
well
as the SAS disks and RAID array. The only difference I can think of is perhaps
the firmware and BIOS versions

Attached is the dmesg output from one of the two 1950's I've got.

with regards,

-- 
Morten A. Middelthon

"I have been Foolish and Deluded,
and I am a Bear of No Brain at All." 
		-- Pooh
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url :
http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20060904/9a74a3bd/attachment.pgp

Morten A. Middelthon

2006-Sep-04 06:13 UTC

head link

Several issues on Dell 1950/2950 servers (6-STABLE and 7-CURRENT)

Skipped content of type multipart/mixed-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url :
http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20060904/abb4f558/attachment.pgp

Alex Salazar

2006-Sep-15 22:46 UTC

head link

Several issues on Dell 1950/2950 servers (6-STABLE and 7-CURRENT)

On 9/15/06, Conrad Burger <conrad.burger@mxit.com>
wrote:> Hi
>
> Has anyone been able to solve the problem with the bce driver on
6.1-stable?
>
> I am running 6.1-stable(200609013)AMD64 on a Dell 1950 with a SMP kernel.
>
> The system boots up fine. When I copy data to an nfs mount the bce network
> interface times out and then resets. It never recovers from this state.
>
> From /var/log/messages-----------
> Sep 13 05:01:13 gold kernel: bce0: /usr/src/sys/dev/bce/if_bce.c(5032):
Watchdog
> timeout occurred, resetting!
> Sep 13 05:01:13 gold kernel: bce0: link state changed to DOWN
> Sep 13 05:01:16 gold kernel: bce0: link state changed to UP
> Sep 13 05:02:41 gold kernel: bce0: /usr/src/sys/dev/bce/if_bce.c(5032):
Watchdog
> timeout occurred, resetting!
> Sep 13 05:02:41 gold kernel: bce0: link state changed to DOWN
> Sep 13 05:02:44 gold kernel: bce0: link state changed to UP
>
> Any help would be appreciated.
>
> Regards
> Conrad
>
I was almost sure this problem had been solved by the 6-STABLE version
of sys/dev/bce/, a month ago, or so. I can't back it up, however, since
I'm currently using 7-CURRENT (i386) on my PE 1950, and this behaviour is
not
present in this FreeBSD version.

By the way, is your server equipped with a LSI SAS 5/i controller, or
Dell PERC 5/i?

-- 
Alex Salazar
BSD M?xico
www.bsd.org.mx

freebsd stable - Sep 2006 - Several issues on Dell 1950/2950 servers (6-STABLE and 7-CURRENT)

Several issues on Dell 1950/2950 servers (6-STABLE and 7-CURRENT)

Several issues on Dell 1950/2950 servers (6-STABLE and 7-CURRENT)

Several issues on Dell 1950/2950 servers (6-STABLE and 7-CURRENT)

Several issues on Dell 1950/2950 servers (6-STABLE and 7-CURRENT)

Several issues on Dell 1950/2950 servers (6-STABLE and 7-CURRENT)