thr3ads.net - CentOS - [CentOS] hardware issues? driver issues? [Mar 2012]

If this information is useful, please help other people find it:
Share via:

m.roth at 5-cent.us

2012-Mar-07 16:17 UTC

[CentOS] hardware issues? driver issues?

Got a bunch of servers from Penguin. Supermicro m/b's H8QG6. We put a 3tb
drive in for additional workspace for the users, and some of them won't
read, others will go for weeks, then spit out DRDY errors. lshw shows the
controller as an ATI SB7x0/SB8x0/SB9x0 SATA.

I did notice that it shows
 *-storage
             description: SATA controller
             product: SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
             vendor: ATI Technologies Inc
<snip>
             width: 32 bits
             ^^^^^^^^^^^^^^
             clock: 66MHz
             ^^^^^^^^^^^^
             capabilities: storage pm ahci_1.0 bus_master cap_list
>From /var/log/dmesg:pci 0000:00:0d.0: PME# supported from D0 D3hot D3cold
pci 0000:00:0d.0: PME# disabled
pci 0000:00:11.0: reg 10 io port: [0xd000-0xd007]
pci 0000:00:11.0: reg 14 io port: [0xc000-0xc003]
pci 0000:00:11.0: reg 18 io port: [0xb000-0xb007]
pci 0000:00:11.0: reg 1c io port: [0xa000-0xa003]
pci 0000:00:11.0: reg 20 io port: [0x9000-0x900f]
pci 0000:00:11.0: reg 24 32bit mmio: [0xdfefa400-0xdfefa7ff]
<...>
ahci 0000:00:11.0: version 3.0
  alloc irq_desc for 22 on node 0
  alloc kstat_irqs on node 0
ahci 0000:00:11.0: PCI INT A -> GSI 22 (level, low) -> IRQ 22
ahci 0000:00:11.0: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl SATA mode
ahci 0000:00:11.0: flags: 64bit ncq sntf ilck pm led clo pmp pio slum part
ccc
<...>
ata1: SATA max UDMA/133 abar m1024 at 0xdfefa400 port 0xdfefa500 irq 22
ata2: SATA max UDMA/133 abar m1024 at 0xdfefa400 port 0xdfefa580 irq 22

I've included the above, because I note the 32bit mmio, but the 64bit
flag; also the clock speed for the controller.

Now, I've been working on one with Penguin. I noticed one thing, that it
was set to native IDE. After googling, I saw that the most recent spec,
which included EIDE, should be good to petabytes... but I tried resetting
it to AHCI anyway.

The user ran one job, ok... then another last night, and it's spitting the
same errors.

In /var/log/messages, I see JBD: detected IO errors while flushing file data:
Mar  7 00:53:28 <server> kernel: ata2.00: exception Emask 0x0 SAct 0x3
SErr 0x0 action 0x6 frozen
Mar  7 00:53:28 <server> kernel: ata2.00: failed command: WRITE FPDMA
QUEUED
Mar  7 00:53:28 <server> kernel: ata2.00: cmd
61/08:00:72:4a:a4/00:00:ae:00:00/40 tag 0 ncq 4096 out
Mar  7 00:53:28 <server> kernel:         res
40/00:04:20:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Mar  7 00:53:28 <server> kernel: ata2.00: status: { DRDY }
<...>
Mar  7 00:53:28 <server> kernel: ata2: hard resetting link
Mar  7 00:53:33 <server> kernel: ata2: SATA link up 3.0 Gbps (SStatus 123
SControl 300)
Mar  7 00:53:33 <server> kernel: ata2.00: configured for UDMA/133
Mar  7 00:53:33 <server> kernel: ata2.00: device reported invalid CHS
sector 0
Mar  7 00:53:33 <server> kernel: ata2: EH complete

Notice the "device reported invalid CHS sector 0". The drive does have
a
GPT rather than an MBR.

So, has anyone else seen similar problems, or have some suggestions as to
something I can try? Penguin's still waiting for a response from
Supermicro, and has escalated....

          mark

Peter Kjellström

2012-Mar-07 16:43 UTC

head link

[CentOS] hardware issues? driver issues?

On Wednesday 07 March 2012 11.17.15 m.roth at 5-cent.us
wrote:> Got a bunch of servers from Penguin. Supermicro m/b's H8QG6. We put a
3tb
> drive in for additional workspace for the users, and some of them won't
> read, others will go for weeks, then spit out DRDY errors. lshw shows the
> controller as an ATI SB7x0/SB8x0/SB9x0 SATA.
...> Now, I've been working on one with Penguin. I noticed one thing, that
it
> was set to native IDE. After googling, I saw that the most recent spec,
> which included EIDE, should be good to petabytes... but I tried resetting
> it to AHCI anyway.
> 
> The user ran one job, ok... then another last night, and it's spitting
the
> same errors.
... > Mar  7 00:53:28 <server> kernel: ata2.00: failed command: WRITE FPDMA
QUEUED
...> 40/00:04:20:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
...> Mar  7 00:53:28 <server> kernel: ata2: hard resetting link
While writing the drive timed out and the link to it was then subjected to a 
hard reset. This is not normal and usually points to bad drive or buggy 
firmware.

Have you had a look at smartdata for the drive(s)? (you may want to run the 
smart selftests)

Also, I'd suggest you test it in a controlled environment. For example, can 
any of your drives survive a full surface write? (dd if=/dev/zero bs=1M of=..) 
Full surface read? Do the tests against /dev/sdX to be sure (excludes 
partitioning, filesystems, volume management, etc.)

Do note that writing your drive full of zeros _will_ destroy your data (I 
really hope that's stating the obvious...).

/Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL:
<http://lists.centos.org/pipermail/centos/attachments/20120307/b1f27fe1/attachment-0004.sig>

m.roth at 5-cent.us

2012-Mar-07 18:16 UTC

head link

[CentOS] hardware issues? driver issues?

Peter Kjellstr?m wrote:> On Wednesday 07 March 2012 11.17.15 m.roth at 5-cent.us wrote:
>> Got a bunch of servers from Penguin. Supermicro m/b's H8QG6. We put
a
>> 3tb drive in for additional workspace for the users, and some of them
>> won't read, others will go for weeks, then spit out DRDY errors.
lshw
>> shows the controller as an ATI SB7x0/SB8x0/SB9x0 SATA.
> ...
>> Now, I've been working on one with Penguin. I noticed one thing,
that it
>> was set to native IDE. After googling, I saw that the most recent spec,
>> which included EIDE, should be good to petabytes... but I tried
>> resetting it to AHCI anyway.
>>
>> The user ran one job, ok... then another last night, and it's
spitting
>> the same errors.
> ...
>> Mar  7 00:53:28 <server> kernel: ata2.00: failed command: WRITE
FPDMA
>> QUEUED
> ...
>> 40/00:04:20:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
> ...
>> Mar  7 00:53:28 <server> kernel: ata2: hard resetting link
>
> While writing the drive timed out and the link to it was then subjected to
> a hard reset. This is not normal and usually points to bad drive or buggy
> firmware.
>
> Have you had a look at smartdata for the drive(s)? (you may want to run
> the smart selftests)
>
> Also, I'd suggest you test it in a controlled environment. For example,
> can any of your drives survive a full surface write? (dd if=/dev/zero
> bs=1M of=..)
> Full surface read? Do the tests against /dev/sdX to be sure (excludes
> partitioning, filesystems, volume management, etc.)
>
> Do note that writing your drive full of zeros _will_ destroy your data (I
> really hope that's stating the obvious...).
<g>
Of course. Nahhh... I've run bonnie++ against it, but couldn't provoke
it.
It's this one user, who runs *large* jobs, with big o/p, when it hits.

smartctl - I ran the short test just before lunch, and smartctl -H reports
it passed, completed without errors.

I saw that it timed out. One of the reasons for some of the stuff I
included, above, was that
kernel: ata2.00: device reported invalid CHS sector 0

Also, I noticed that lshw showed the ATI controller having a width of 32
bits, and a clock of 66MHz, and wondered if there could be some sort of
slip-through-the-cracks where the driver didn't handle this correctly.

        mark

Seemingly Similar Threads

Search for more apparently analagous threads

CentOS - Mar 2012 - hardware issues? driver issues?

[CentOS] hardware issues? driver issues?

[CentOS] hardware issues? driver issues?

[CentOS] hardware issues? driver issues?

Seemingly Similar Threads