Alex Salazar
2006-Sep-02 09:23 UTC
Several issues on Dell 1950/2950 servers (6-STABLE and 7-CURRENT)
Apologies for the long message, and thanks in advance for any response. I've just bought one of those new generation Dell servers, specifically, the PowerEdge 1950. This is a dual Intel Dual Core Xeon 5050, 3.0 GHz, 667MHz FSB, 1GB 533MHz RAM, system. This server has a LSI Logic SAS 5/i integrated adapter and dual embedded Broadcom NetXtreme II 5708 Gigabit Ethernet NIC. When I tried to install from a FreeBSD 6.0-RELEASE i386 CD I had at hand, no hard disc was detected. After finding out that SAS controller was not supported on that release, I grabbed the most recent 6.1-STABLE i386 snapshot (200608) and tried again. This time, the hard disc was detected properly. The installation succeeded and, after the post-install configuration, the system was restarted. The OS booted up and the SAS controller was now detected and supported by the mpt(4) driver: --- mpt0: <LSILogic SAS Adapter> port 0xec00-0xecff mem 0xfc4fc000-0xfc4fffff, 0xfc4e0000-0xfc4effff irq 64 at device 8.0 on pci2 mpt0: Reserved 0x100 bytes for rid 0x10 type 4 at 0xec00 mpt0: Reserved 0x4000 bytes for rid 0x14 type 3 at 0xfc4fc000 mpt0: [GIANT-LOCKED] mpt0: MPI Version=1.5.12.0 --- And the related errors showed up immediately, for the first time: --- mpt0: mpt_cam_event: 0x16 mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required). mpt0: mpt_cam_event: 0x12 mpt0: Unhandled Event Notify Frame. Event 0x12 (ACK not required). mpt0: mpt_cam_event: MPI_EVENT_SAS_DEVICE_STATUS_CHANGE mpt0: mpt_cam_event: MPI_EVENT_SAS_DEVICE_STATUS_CHANGE mpt0: mpt_cam_event: 0x16 mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required). -- When the bootstrap process reached the SCSI probe, there were no activity on the screen for about five minutes, so I was forced to use the power off button, and after rebooting, the same symptoms were evident, so I rebooted the machine once again, this time in verbose mode. This debug information was being printed on the screen, one character at time, at about 1 char/sec: (probe8:mpt0:0:8:0): error 22 (probe8:mpt0:0:8:0): Unretryable Error (probe8:mpt0:0:8:0): error 22 (probe8:mpt0:0:8:0): Unretryable Error (probe0:mpt0:0:0:1): error 22 (probe0:mpt0:0:0:1): Unretryable Error (probe1:mpt0:0:8:1): Unexpected Bus Free (probe1:mpt0:0:8:1): Retrying Command ... (probe0:mpt0:0:8:7): Unexpected Bus Free (probe0:mpt0:0:8:7): Retrying Command (probe0:mpt0:0:8:7): Unexpected Bus Free (probe0:mpt0:0:8:7): Retrying Command (probe0:mpt0:0:8:7): Unexpected Bus Free (probe0:mpt0:0:8:7): Retrying Command (probe0:mpt0:0:8:7): Unexpected Bus Free (probe0:mpt0:0:8:7): Retrying Command (probe0:mpt0:0:8:7): Unexpected Bus Free (probe0:mpt0:0:8:7): error 5 (probe0:mpt0:0:8:7): Retries Exausted After 18 (eighteen) minutes, the error messages ceased, and the boot process continued as usually: --- pass0 at mpt0 bus 0 target 0 lun 0 pass0: <MAXTOR ATLAS15K2_073SAS BP00> Fixed Direct Access SCSI-5 device pass0: Serial Number K40C1Q5K pass0: 300.000MB/s transfers, Tagged Queueing Enabled pass1 at mpt0 bus 0 target 8 lun 0 pass1: <DP BACKPLANE 1.00> Fixed Enclosure Services SCSI-5 device pass1: 300.000MB/s transfers, Tagged Queueing Enabled ses0 at mpt0 bus 0 target 8 lun 0 ses0: <DP BACKPLANE 1.00> Fixed Enclosure Services SCSI-5 device ses0: 300.000MB/s transfers, Tagged Queueing Enabled ses0: SCSI-3 SES Device GEOM: new dida0 at mpt0 bus 0 target 0 lun 0 da0: <MAXTOR ATLAS15K2_073SAS BP00> Fixed Direct Access SCSI-5 device da0: Serial Number K40C1Q5K da0: 300.000MB/s transfers, Tagged Queueing Enabled da0: 70007MB (143374650 512 byte sectors: 255H 63S/T 8924C) --- As a workaround, I disabled the APICs (hint.apic.0.disabled), and that ~15 minutes delay at boot up, now was gone. Fine. (BTW, 7-CURRENT has the same problem, but without that huge delay) Once I was logged in the server, I proceeded to populate my ports tree, by using portsnap(8), so, when I extracted the tarball (portsnap extract), there was a lot of the following error message, at about 1 message per second: mpt0: Unhandled Event Notify Frame. Event 0xe (ACK not required). Once in a while, an error message like below, showed up: -- (da0:mpt0:0:0:0): WRITE(10). CDB: 2a 0 1 55 6f 5f 0 0 20 0 (da0:mpt0:0:0:0): CAM Status: SCSI Status Error (da0:mpt0:0:0:0): SCSI Status: Check Condition (da0:mpt0:0:0:0): UNIT ATTENTION asc:29,2 (da0:mpt0:0:0:0): Scsi bus reset occurred -- After running some diagnostics included on some utilities CDs shipped with this server, I concluded this was a software issue: -- Device Name : SAS Disk 0:0 Description : SAS MAXTOR ATLAS15K2_073SAS Test Name : Disk Self Test Passes : 1 Result : passed Start Time : Mon Aug 21 02:04:06 2006 Completion Time : Mon Aug 21 02:26:12 2006 Result Event : The test operation completed successfully -- In order to perform those diagnostics, I had to install a SuSe Linux Enterprise Server 9, which was also shipped with this machine) After reinstalling FreeBSD, I logged remotely into the server, via ssh, and fetched the ports snapshot again and extracted once more. Suddenly, the screen activity ceased and the network connection timed out. Locally, on the server, there was a lot of mpt(4) errors and warnings. --- (da0:mpt0:0:0:0): CAM Status 0x18 (da0:mpt0:0:0:0): Retrying Command (... and about 500 more lines like those...) --- Then, some bce(4) errors, which caused the network interface to be shutdown: --- bce0: ../../../dev/bce/if_bce.c(5032): Watchdog timeout occurred, resetting bce0: link state changed to DOWN bce0: Gigabit link up bce0: link state changed to UP bce0: ../../../dev/bce/if_bce.c(5032): Watchdog timeout occurred, resetting --- And finally, those errors from mpt(4): --- request 0xc4c4a080:44717 timed out for ccb 0xc4e41400 (req->ccb 0xc4e41400) request 0xc4c4b430:44718 timed out for ccb 0xc4ca5800 (req->ccb 0xc4ca5800) request 0xc4c4cd80:44719 timed out for ccb 0xc4c52800 (req->ccb 0xc4c52800) (... and about 300 more lines like those ...) --- which were followed by the same number of lines like these: --- mpt0: completing timedout/aborted req 0xc4c4a080:44717 mpt0: completing timedout/aborted req 0xc4c4b430:44718 mpt0: completing timedout/aborted req 0xc4c4cd80:44719 --- and finishing with this line: --- mpt0: Timedout requests already complete. Interrupts may not be functioning. --- After one hour and a half, the system was still unstable and I was forced to reboot it. Those are the main issues (mpt(4) and bce(4)) regarding this hardware configuration and FreeBSD (6-STABLE and 7-CURRENT), however, two problems were showing up, as well. 1. The first network interface (labeled "Gb 1", on server's case) was detected as bce1, and the second one (Gb 2), as bce0. --- bce0: <Broadcom NetXtreme II BCM5708 1000Base-T (B1), v0.9.6> mem 0xf4000000-0xf5ffffff irq 16 at device 0.0 on pci8 bce0: Reserved 0x2000000 bytes for rid 0x10 type 3 at 0xf4000000 bce0: ASIC ID 0x57081010; Revision (B1); PCI-X 64-bit 133MHz miibus0: <MII bus> on bce0 brgphy0: <BCM5708C 10/100/1000baseTX PHY> on miibus0 brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX, 1000baseTX-FDX, auto bce0: bpf attached bce0: Ethernet address: 00:13:72:f9:xx:xx bce0: [MPSAFE] ... bce1: <Broadcom NetXtreme II BCM5708 1000Base-T (B1), v0.9.6> mem 0xf8000000-0xf9ffffff irq 16 at device 0.0 on pci4 bce1: Reserved 0x2000000 bytes for rid 0x10 type 3 at 0xf8000000 bce1: ASIC ID 0x57081010; Revision (B1); PCI-X 64-bit 133MHz miibus1: <MII bus> on bce1 brgphy1: <BCM5708C 10/100/1000baseTX PHY> on miibus1 brgphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX, 1000baseTX-FDX, auto bce1: bpf attached bce1: Ethernet address: 00:13:72:f9:xx:xx bce1: [MPSAFE] --- According to this log, bce0 is on pci8, while bce1 is on pci4. 2. Sometimes, the server refuses to be halted or rebooted via shutdown(8) or reboot(8). Any hint will be appreciated :) /var/run/dmesg.boot (verbose log) http://bsdero.tripod.com/dmesg.boot.txt /var/log/messages (with remarks) http://bsdero.tripod.com/messages.txt pciconf -lv output http://bsdero.tripod.com/pciconf.txt -- Alex Salazar
Matthew Jacob
2006-Sep-02 19:21 UTC
Several issues on Dell 1950/2950 servers (6-STABLE and 7-CURRENT)
> > The OS booted up and the SAS controller was now detected and supported by > the mpt(4) driver: > --- > mpt0: <LSILogic SAS Adapter> port 0xec00-0xecff mem 0xfc4fc000-0xfc4fffff, > 0xfc4e0000-0xfc4effff irq 64 at device 8.0 on pci2 > mpt0: Reserved 0x100 bytes for rid 0x10 type 4 at 0xec00 > mpt0: Reserved 0x4000 bytes for rid 0x14 type 3 at 0xfc4fc000 > mpt0: [GIANT-LOCKED] > mpt0: MPI Version=1.5.12.0 > --- > > And the related errors showed up immediately, for the first time: > --- > mpt0: mpt_cam_event: 0x16 > mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required). > mpt0: mpt_cam_event: 0x12 > mpt0: Unhandled Event Notify Frame. Event 0x12 (ACK not required). > mpt0: mpt_cam_event: MPI_EVENT_SAS_DEVICE_STATUS_CHANGE > mpt0: mpt_cam_event: MPI_EVENT_SAS_DEVICE_STATUS_CHANGE > mpt0: mpt_cam_event: 0x16 > mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required). > --These are device arrival events.> > When the bootstrap process reached the SCSI probe, there were > no activity on the screen for about five minutes, so I was forced to use > the power off button, and after rebooting, the same symptoms were evident, > so I rebooted the machine once again, this time in verbose mode. > > This debug information was being printed on the screen, one character at time, > at about 1 char/sec: > > (probe8:mpt0:0:8:0): error 22What's at target 8? It isn't happy for a variety of reasons. Oh- I see from below- it's an SES instance that drops dead if given something at> lun 0.> (probe8:mpt0:0:8:0): Unretryable Error > --- > pass0 at mpt0 bus 0 target 0 lun 0 > pass0: <MAXTOR ATLAS15K2_073SAS BP00> Fixed Direct Access SCSI-5 device > > As a workaround, I disabled the APICs (hint.apic.0.disabled), > and that ~15 minutes delay at boot up, now was gone. Fine. > > (BTW, 7-CURRENT has the same problem, but without that huge delay)Do you have APIC disabled for 7-CURRENT also?> > Once I was logged in the server, I proceeded to populate my ports tree, > by using portsnap(8), so, when I extracted the tarball (portsnap extract), > there was a lot of the following error message, at about 1 message per second: > > mpt0: Unhandled Event Notify Frame. Event 0xe (ACK not required).Queue Full events from the SAS firmware.> > Once in a while, an error message like below, showed up: > -- > (da0:mpt0:0:0:0): WRITE(10). CDB: 2a 0 1 55 6f 5f 0 0 20 0 > (da0:mpt0:0:0:0): CAM Status: SCSI Status Error > (da0:mpt0:0:0:0): SCSI Status: Check Condition > (da0:mpt0:0:0:0): UNIT ATTENTION asc:29,2 > (da0:mpt0:0:0:0): Scsi bus reset occurredSomebody is reseeting the bus periodically. We (freebsd) aren't volitionally doing this that I'm aware of here.> In order to perform those diagnostics, I had to install a SuSe Linux > Enterprise Server 9, which was also shipped with this machine)Which is a good way of saying that LSI-Logic support isn't very evident on FreeBSD.> > After reinstalling FreeBSD, I logged remotely into the server, via ssh, > and fetched the ports snapshot again and extracted once more. > > Suddenly, the screen activity ceased and the network connection timed out. > > Locally, on the server, there was a lot of mpt(4) errors and warnings. > --- > (da0:mpt0:0:0:0): CAM Status 0x18 > (da0:mpt0:0:0:0): Retrying Command > (... and about 500 more lines like those...)Hmm.> --- >> > And finally, those errors from mpt(4): > > --- > request 0xc4c4a080:44717 timed out for ccb 0xc4e41400 (req->ccb 0xc4e41400) > request 0xc4c4b430:44718 timed out for ccb 0xc4ca5800 (req->ccb 0xc4ca5800) > request 0xc4c4cd80:44719 timed out for ccb 0xc4c52800 (req->ccb 0xc4c52800) > (... and about 300 more lines like those ...) > --- > > which were followed by the same number of lines like these: > --- > mpt0: completing timedout/aborted req 0xc4c4a080:44717 > mpt0: completing timedout/aborted req 0xc4c4b430:44718 > mpt0: completing timedout/aborted req 0xc4c4cd80:44719 > --- > > and finishing with this line: > --- > mpt0: Timedout requests already complete. Interrupts may not be functioning. > --- >I've seen this on Supermicro EM64T in the past on 7-current, but that went away about 3-4 weeks ago. It really seemed to me that this was indeed an interrupt related problem. Yup, sounds like a mess here.
Morten A. Middelthon
2006-Sep-04 05:49 UTC
Several issues on Dell 1950/2950 servers (6-STABLE and 7-CURRENT)
On Sat, Sep 02, 2006 at 04:23:21AM -0500, Alex Salazar wrote:> Apologies for the long message, and thanks in advance for any response. > > I've just bought one of those new generation Dell servers, specifically, > the PowerEdge 1950. > > This is a dual Intel Dual Core Xeon 5050, 3.0 GHz, 667MHz FSB, > 1GB 533MHz RAM, system. > > This server has a LSI Logic SAS 5/i integrated adapter and dual embedded > Broadcom NetXtreme II 5708 Gigabit Ethernet NIC.<snip> Just wanted to say that I'm running FreeBSD/i386 6.1-STABLE on two Dell PE 1950's without any problems. The only thing I had to do was update the bce driver for the NIC, but other than that the mfi RAID controller is detected properly, as well as the SAS disks and RAID array. The only difference I can think of is perhaps the firmware and BIOS versions Attached is the dmesg output from one of the two 1950's I've got. with regards, -- Morten A. Middelthon "I have been Foolish and Deluded, and I am a Bear of No Brain at All." -- Pooh -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20060904/9a74a3bd/attachment.pgp
Morten A. Middelthon
2006-Sep-04 06:13 UTC
Several issues on Dell 1950/2950 servers (6-STABLE and 7-CURRENT)
Skipped content of type multipart/mixed-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20060904/abb4f558/attachment.pgp
Alex Salazar
2006-Sep-15 22:46 UTC
Several issues on Dell 1950/2950 servers (6-STABLE and 7-CURRENT)
On 9/15/06, Conrad Burger <conrad.burger@mxit.com> wrote:> Hi > > Has anyone been able to solve the problem with the bce driver on 6.1-stable? > > I am running 6.1-stable(200609013)AMD64 on a Dell 1950 with a SMP kernel. > > The system boots up fine. When I copy data to an nfs mount the bce network > interface times out and then resets. It never recovers from this state. > > From /var/log/messages----------- > Sep 13 05:01:13 gold kernel: bce0: /usr/src/sys/dev/bce/if_bce.c(5032): Watchdog > timeout occurred, resetting! > Sep 13 05:01:13 gold kernel: bce0: link state changed to DOWN > Sep 13 05:01:16 gold kernel: bce0: link state changed to UP > Sep 13 05:02:41 gold kernel: bce0: /usr/src/sys/dev/bce/if_bce.c(5032): Watchdog > timeout occurred, resetting! > Sep 13 05:02:41 gold kernel: bce0: link state changed to DOWN > Sep 13 05:02:44 gold kernel: bce0: link state changed to UP > > Any help would be appreciated. > > Regards > Conrad >I was almost sure this problem had been solved by the 6-STABLE version of sys/dev/bce/, a month ago, or so. I can't back it up, however, since I'm currently using 7-CURRENT (i386) on my PE 1950, and this behaviour is not present in this FreeBSD version. By the way, is your server equipped with a LSI SAS 5/i controller, or Dell PERC 5/i? -- Alex Salazar BSD M?xico www.bsd.org.mx