Got a bunch of servers from Penguin. Supermicro m/b's H8QG6. We put a 3tb drive in for additional workspace for the users, and some of them won't read, others will go for weeks, then spit out DRDY errors. lshw shows the controller as an ATI SB7x0/SB8x0/SB9x0 SATA. I did notice that it shows *-storage description: SATA controller product: SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] vendor: ATI Technologies Inc <snip> width: 32 bits ^^^^^^^^^^^^^^ clock: 66MHz ^^^^^^^^^^^^ capabilities: storage pm ahci_1.0 bus_master cap_list>From /var/log/dmesg:pci 0000:00:0d.0: PME# supported from D0 D3hot D3cold pci 0000:00:0d.0: PME# disabled pci 0000:00:11.0: reg 10 io port: [0xd000-0xd007] pci 0000:00:11.0: reg 14 io port: [0xc000-0xc003] pci 0000:00:11.0: reg 18 io port: [0xb000-0xb007] pci 0000:00:11.0: reg 1c io port: [0xa000-0xa003] pci 0000:00:11.0: reg 20 io port: [0x9000-0x900f] pci 0000:00:11.0: reg 24 32bit mmio: [0xdfefa400-0xdfefa7ff] <...> ahci 0000:00:11.0: version 3.0 alloc irq_desc for 22 on node 0 alloc kstat_irqs on node 0 ahci 0000:00:11.0: PCI INT A -> GSI 22 (level, low) -> IRQ 22 ahci 0000:00:11.0: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl SATA mode ahci 0000:00:11.0: flags: 64bit ncq sntf ilck pm led clo pmp pio slum part ccc <...> ata1: SATA max UDMA/133 abar m1024 at 0xdfefa400 port 0xdfefa500 irq 22 ata2: SATA max UDMA/133 abar m1024 at 0xdfefa400 port 0xdfefa580 irq 22 I've included the above, because I note the 32bit mmio, but the 64bit flag; also the clock speed for the controller. Now, I've been working on one with Penguin. I noticed one thing, that it was set to native IDE. After googling, I saw that the most recent spec, which included EIDE, should be good to petabytes... but I tried resetting it to AHCI anyway. The user ran one job, ok... then another last night, and it's spitting the same errors. In /var/log/messages, I see JBD: detected IO errors while flushing file data: Mar 7 00:53:28 <server> kernel: ata2.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen Mar 7 00:53:28 <server> kernel: ata2.00: failed command: WRITE FPDMA QUEUED Mar 7 00:53:28 <server> kernel: ata2.00: cmd 61/08:00:72:4a:a4/00:00:ae:00:00/40 tag 0 ncq 4096 out Mar 7 00:53:28 <server> kernel: res 40/00:04:20:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) Mar 7 00:53:28 <server> kernel: ata2.00: status: { DRDY } <...> Mar 7 00:53:28 <server> kernel: ata2: hard resetting link Mar 7 00:53:33 <server> kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Mar 7 00:53:33 <server> kernel: ata2.00: configured for UDMA/133 Mar 7 00:53:33 <server> kernel: ata2.00: device reported invalid CHS sector 0 Mar 7 00:53:33 <server> kernel: ata2: EH complete Notice the "device reported invalid CHS sector 0". The drive does have a GPT rather than an MBR. So, has anyone else seen similar problems, or have some suggestions as to something I can try? Penguin's still waiting for a response from Supermicro, and has escalated.... mark
On Wednesday 07 March 2012 11.17.15 m.roth at 5-cent.us wrote:> Got a bunch of servers from Penguin. Supermicro m/b's H8QG6. We put a 3tb > drive in for additional workspace for the users, and some of them won't > read, others will go for weeks, then spit out DRDY errors. lshw shows the > controller as an ATI SB7x0/SB8x0/SB9x0 SATA....> Now, I've been working on one with Penguin. I noticed one thing, that it > was set to native IDE. After googling, I saw that the most recent spec, > which included EIDE, should be good to petabytes... but I tried resetting > it to AHCI anyway. > > The user ran one job, ok... then another last night, and it's spitting the > same errors....> Mar 7 00:53:28 <server> kernel: ata2.00: failed command: WRITE FPDMA QUEUED...> 40/00:04:20:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)...> Mar 7 00:53:28 <server> kernel: ata2: hard resetting linkWhile writing the drive timed out and the link to it was then subjected to a hard reset. This is not normal and usually points to bad drive or buggy firmware. Have you had a look at smartdata for the drive(s)? (you may want to run the smart selftests) Also, I'd suggest you test it in a controlled environment. For example, can any of your drives survive a full surface write? (dd if=/dev/zero bs=1M of=..) Full surface read? Do the tests against /dev/sdX to be sure (excludes partitioning, filesystems, volume management, etc.) Do note that writing your drive full of zeros _will_ destroy your data (I really hope that's stating the obvious...). /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: <http://lists.centos.org/pipermail/centos/attachments/20120307/b1f27fe1/attachment-0004.sig>
Peter Kjellstr?m wrote:> On Wednesday 07 March 2012 11.17.15 m.roth at 5-cent.us wrote: >> Got a bunch of servers from Penguin. Supermicro m/b's H8QG6. We put a >> 3tb drive in for additional workspace for the users, and some of them >> won't read, others will go for weeks, then spit out DRDY errors. lshw >> shows the controller as an ATI SB7x0/SB8x0/SB9x0 SATA. > ... >> Now, I've been working on one with Penguin. I noticed one thing, that it >> was set to native IDE. After googling, I saw that the most recent spec, >> which included EIDE, should be good to petabytes... but I tried >> resetting it to AHCI anyway. >> >> The user ran one job, ok... then another last night, and it's spitting >> the same errors. > ... >> Mar 7 00:53:28 <server> kernel: ata2.00: failed command: WRITE FPDMA >> QUEUED > ... >> 40/00:04:20:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) > ... >> Mar 7 00:53:28 <server> kernel: ata2: hard resetting link > > While writing the drive timed out and the link to it was then subjected to > a hard reset. This is not normal and usually points to bad drive or buggy > firmware. > > Have you had a look at smartdata for the drive(s)? (you may want to run > the smart selftests) > > Also, I'd suggest you test it in a controlled environment. For example, > can any of your drives survive a full surface write? (dd if=/dev/zero > bs=1M of=..) > Full surface read? Do the tests against /dev/sdX to be sure (excludes > partitioning, filesystems, volume management, etc.) > > Do note that writing your drive full of zeros _will_ destroy your data (I > really hope that's stating the obvious...).<g> Of course. Nahhh... I've run bonnie++ against it, but couldn't provoke it. It's this one user, who runs *large* jobs, with big o/p, when it hits. smartctl - I ran the short test just before lunch, and smartctl -H reports it passed, completed without errors. I saw that it timed out. One of the reasons for some of the stuff I included, above, was that kernel: ata2.00: device reported invalid CHS sector 0 Also, I noticed that lshw showed the ATI controller having a width of 32 bits, and a clock of 66MHz, and wondered if there could be some sort of slip-through-the-cracks where the driver didn't handle this correctly. mark
Possibly Parallel Threads
- VGA passthrough ? AMD-FX8, GA-990FXA-UD3, G210
- My ethernet is not listed in centOS 8 boot.iso
- Centos 6.4 - doesnt power off with shutdown/poweroff cmd
- My ethernet is not listed in centOS 8 boot.iso
- [virtio-dev] [PATCH v3 00/15] virtio-mem: paravirtualized memory