thr3ads.net - freebsd stable - HELP DEBUG: FreeBSD 6.3-RELEASE-p3 TIMEOUT - WRITE_DMA + other strange behaviour! [Sep 2008]

If this information is useful, please help other people find it:
Share via:

Anton - Valqk

2008-Sep-26 10:30 UTC

HELP DEBUG: FreeBSD 6.3-RELEASE-p3 TIMEOUT - WRITE_DMA + other strange behaviour!

Hello,
I have a VERY strange behaving 6-3p3 with DMA tmieouts and network cards
'dropping traffic'.
Following is the explanation of hardware and the thinga that are happening.
The machine is DELL optiplex PII 300mHZ with 512RAM.
It has 3 NICs:
fxp0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        options=8<VLAN_MTU>
        inet 7.8.9.10 netmask 0xfffff000 broadcast 7.8.9.255
        ether 00:91:21:16:14:bf
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active
rl0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        options=8<VLAN_MTU>
        inet 8.9.10.11 netmask 0xffffffe0 broadcast 8.9.10.255
        ether 00:02:44:73:2a:fa
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active
xl0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        options=9<RXCSUM,VLAN_MTU>
        inet 192.168.123.2 netmask 0xffffff00 broadcast 192.168.123.255
        inet 192.168.123.5 netmask 0xffffff00 broadcast 192.168.123.255
        inet 192.168.123.6 netmask 0xffffff00 broadcast 192.168.123.255
        ether 00:c0:4f:20:66:a3
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active
fxp0 and rl0 are external links to the world and are plugged into pci slots
xl0 is the internal interface and is integrated on motherboard.
It also has 1 PROMISE ULTRA133 ATA pci IDE controller plugged into the
pci slot.
It has 5 disks in it - 4 connected to the PROMISE card and 1 to the
motherboard ide.

they are as follows:
ad0 and ad6 are two identical hitachi disks in gmirror for the system
and a partition that I keep backups on.

ad4, ad5 and ad7 are storage disks - seagates 500GB 8mb cache that I
keep isos etc files on and are the problematic (maybe because of high
traffic operations compared to the other two?).

What is the problem:
Actually there are two problems:
1. I get a lot of dma times outs. mostly on ad5 and ad7 where I keep
files over 4-5MBs and write/read very often with 3-6-8MB/s from the
disk. I don't use ad4 so I can not tell if there's gona be timeous but I
suppose there will (currently has linux partitions on it and is not
mounted). I get these errors:
dmesg.today:ad7: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=5554848
dmesg.today:ad7: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=5914112
dmesg.today:ad7: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=14924096
dmesg.today:ad7: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=374303456
dmesg.today:ad7: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR>
error=10<NID_NOT_FOUND> LBA=374303456
dmesg.today:g_vfs_done():ad7[WRITE(offset=191643369472,
length=131072)]error = 5
dmesg.today:ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=50757760
dmesg.today:ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=50760192
dmesg.today:ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=12032
dmesg.today:ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=50769792

strange thing is that I'm seeing the g_vfs_done just recently and this
problem is from the very start of this hardware setup of the machine.
The machine used to work with two hitachi disks connected to the ad0 and
ad1 (integrated ide) and only one - xl0 - nic perfectly.
The problems started when I plugged in the PROMISE and other nic cards
and started using it as router, fileserver and backup server (each in
separate jail, except the pf firewall).
2. The other strange issue is that when (I guess) it starts timeouting
*sometimes* not everytime I'm loosing connection to xl0 or fxp0
(sometimes the rl0 works and accepts connections from the outside,
sometimes - not). When I go to the machine and plug a monitor - there
are no messages from kernel, no logs in /var/log/messages or debug -
noting. Stange thing is that I ping host from the local net and it time
outs, ifconfig shows that interface is connected at fd 100mbit and
everyting seems ok. I've tried ifconfig xl0 down up but doesn't help,
tried plugging out the cable and it got connected but not packets passed
- timeout again!
I've rebooted and nic came up. These 'drops' became more and more
common
recently and last night I wasn't able to login for about an hour and
after that the machine came back up again by itself!!!that's in the lan
- but it wasn't accessible at all from the outside - strange thins is
that it replied to ping but I wasn't able to even open the ssh port
connection and the nat wasn't working?! After that I've remembered that
at this time I have a cronjob started for about an hour that fetches
into a file a online radio cast for an hour.... wired!!! it also have
rtorrent, apache22, samba (in a jail) runing.

some output from it can be found here:
http://valqk.ath.cx/tmp/dmesg
http://valqk.ath.cx/tmp/vmstat
http://valqk.ath.cx/tmp/smartctl


please give any ideas/hints/solutions!

thanks a lot to everyone!
cheers,
valqk.

Jeremy Chadwick

2008-Sep-26 11:11 UTC

head link

HELP DEBUG: FreeBSD 6.3-RELEASE-p3 TIMEOUT - WRITE_DMA + other strange behaviour!

On Fri, Sep 26, 2008 at 01:12:14PM +0300, Anton - Valqk
wrote:> Hello,
> I have a VERY strange behaving 6-3p3 with DMA tmieouts and network cards
> 'dropping traffic'.
The disk errors you see are well-known, but the reasons for them
happening differ per person.  Some people replace cables and the problem
goes away.  Others change controller cards.  Others found no solution
and went to Linux.

http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting

Here's some facts:

1) The LBAs reported to have problems are scattered, which indicates to
me there are probably not bad blocks on your disks,

2) You have two separate disks showing the above behaviour, decreasing
the probability of it being bad blocks/sectors,

3) Your dmesg.today doesn't include timestamps, so I have to assume the
problems all happen at once or within short moments of one another,
rather than at random moments throughout a 24 hour period,
> strange thing is that I'm seeing the g_vfs_done just recently and this
> problem is from the very start of this hardware setup of the machine.
I believe the g_vfs_done issues can either be attributed to the disk
errors you're seeing, or oddities with gmirror/GEOM.  I've seen people
report this before, and GEOM often spits back an error on an
index/offset which seems way too large for it to be realistic.
> The machine used to work with two hitachi disks connected to the ad0 and
> ad1 (integrated ide) and only one - xl0 - nic perfectly.
> The problems started when I plugged in the PROMISE and other nic cards
> and started using it as router, fileserver and backup server (each in
> separate jail, except the pf firewall).
> ...
>
> 2. The other strange issue is that when (I guess) it starts timeouting
> *sometimes* not everytime I'm loosing connection to xl0 or fxp0
> (sometimes the rl0 works and accepts connections from the outside,
> sometimes - not). When I go to the machine and plug a monitor - there
> are no messages from kernel, no logs in /var/log/messages or debug -
> noting. Stange thing is that I ping host from the local net and it time
> outs, ifconfig shows that interface is connected at fd 100mbit and
> everyting seems ok. I've tried ifconfig xl0 down up but doesn't
help,
> tried plugging out the cable and it got connected but not packets passed
> - timeout again!
I've looked at your dmesg and vmstat output, and I have a feeling the
problem is an obvious one.

Your system has no APIC (this is not a typo), so your system *must*
share IRQs.  You have ***four*** devices on IRQ 11: a USB controller,
your fxp0 card, your rl0 card, and your xl0 card.
> http://valqk.ath.cx/tmp/dmesg
> http://valqk.ath.cx/tmp/vmstat
> http://valqk.ath.cx/tmp/smartctl
> 
> please give any ideas/hints/solutions!
I would recommend you start yanking PCI cards out of the system and
see which solve the problem.  You did state once you added the Promise
card (which makes your system have FIVE PCI cards in it?!?  Sheesh) the
problems began.

I can't imagine you'll have a stable system with that many cards in the
box all sharing a single IRQ -- especially on a board that old.

I'd recommend decreasing the amount of cards you have in that system, or
get a motherboard that has an APIC and preferably some reliable on-board
networking (read: Intel chips).  Toss the rl0 card if possible, and
consider replacing the Promise controller with a different one.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |

Anton - Valqk

2008-Sep-26 12:13 UTC

head link

HELP DEBUG: FreeBSD 6.3-RELEASE-p3 TIMEOUT - WRITE_DMA + other strange behaviour!

Thanks Jeremy and Peter,
you are right that the machine has *lots* ot hardware in it,
I was thinking of the power supply as a reason and measured the 5 and 12
volts - seemd to be ok 11.8 and 5.2 with all hardware in it.
The shared irq is the one I've thought of and that's why I've posted
vmstat -i to hear your opinion.
[forgot to mention that I've read the wiki and next step is to patch the
kernel with
http://freenas.svn.sourceforge.net/viewvc/freenas/branches/0.69/build/kernel-patches/ata/files/patch-ata.diff?view=markup
this patch (any bad words for this patch or could just run - nothing bad
can happen?)]

Yes, I have 3 nics(2 on pci) + pci ide promise, I'll get a smart switch
with vlans and I'll leave just the integrated xl0 and fxp0 with both
external ips on it these days,
but first I'll patch the kernel if Jeremy says it won't hurt (as far as
I saw just a timeout is moved from hardcoded value to a sysctl?)...
I have another promise card that is a raid controller, but when I've
started looking for one I've asked here and there were  answers for
PROMISE ULTRA ATA133 for being a good card for my freebsd (
http://docs.freebsd.org/cgi/getmsg.cgi?fetch=290848+0+archive/2008/freebsd-stable/20080316.freebsd-stable
)
(hmm, just saw that Jeremy pointed out promise card:  'Their Ultra133
TX2 card works fine on 33MHz PCI bus machines; don't worry about the
card being 66MHz, it will downthrottle correctly.') so maybe the problem
will be solved if I leave just two nics and no rl0...
Actually I'm using 6.3 here because I didn't wanted this to happen and I
was ware of such problems happening on 7-current....

So test must be done... pls just answer about the patch will it be
helpful or I should try:

1. remove rl0 and run only one isp for the test.
2. replace the ultra 133 card with another one.
3. try to replace the ATA100 cables (the one with 80 wires) with an
older ones with only 40 cabels?
4. ? anything else?


Anton - Valqk wrote:> Hello,
> I have a VERY strange behaving 6-3p3 with DMA tmieouts and network cards
> 'dropping traffic'.
> Following is the explanation of hardware and the thinga that are happening.
> The machine is DELL optiplex PII 300mHZ with 512RAM.
> It has 3 NICs:
> fxp0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
>         options=8<VLAN_MTU>
>         inet 7.8.9.10 netmask 0xfffff000 broadcast 7.8.9.255
>         ether 00:91:21:16:14:bf
>         media: Ethernet autoselect (100baseTX <full-duplex>)
>         status: active
> rl0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
>         options=8<VLAN_MTU>
>         inet 8.9.10.11 netmask 0xffffffe0 broadcast 8.9.10.255
>         ether 00:02:44:73:2a:fa
>         media: Ethernet autoselect (100baseTX <full-duplex>)
>         status: active
> xl0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
>         options=9<RXCSUM,VLAN_MTU>
>         inet 192.168.123.2 netmask 0xffffff00 broadcast 192.168.123.255
>         inet 192.168.123.5 netmask 0xffffff00 broadcast 192.168.123.255
>         inet 192.168.123.6 netmask 0xffffff00 broadcast 192.168.123.255
>         ether 00:c0:4f:20:66:a3
>         media: Ethernet autoselect (100baseTX <full-duplex>)
>         status: active
> fxp0 and rl0 are external links to the world and are plugged into pci slots
> xl0 is the internal interface and is integrated on motherboard.
> It also has 1 PROMISE ULTRA133 ATA pci IDE controller plugged into the
> pci slot.
> It has 5 disks in it - 4 connected to the PROMISE card and 1 to the
> motherboard ide.
>
> they are as follows:
> ad0 and ad6 are two identical hitachi disks in gmirror for the system
> and a partition that I keep backups on.
>
> ad4, ad5 and ad7 are storage disks - seagates 500GB 8mb cache that I
> keep isos etc files on and are the problematic (maybe because of high
> traffic operations compared to the other two?).
>
> What is the problem:
> Actually there are two problems:
> 1. I get a lot of dma times outs. mostly on ad5 and ad7 where I keep
> files over 4-5MBs and write/read very often with 3-6-8MB/s from the
> disk. I don't use ad4 so I can not tell if there's gona be timeous
but I
> suppose there will (currently has linux partitions on it and is not
> mounted). I get these errors:
> dmesg.today:ad7: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=5554848
> dmesg.today:ad7: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=5914112
> dmesg.today:ad7: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=14924096
> dmesg.today:ad7: TIMEOUT - WRITE_DMA48 retrying (1 retry left)
LBA=374303456
> dmesg.today:ad7: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR>
> error=10<NID_NOT_FOUND> LBA=374303456
> dmesg.today:g_vfs_done():ad7[WRITE(offset=191643369472,
> length=131072)]error = 5
> dmesg.today:ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=50757760
> dmesg.today:ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=50760192
> dmesg.today:ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=12032
> dmesg.today:ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=50769792
>
> strange thing is that I'm seeing the g_vfs_done just recently and this
> problem is from the very start of this hardware setup of the machine.
> The machine used to work with two hitachi disks connected to the ad0 and
> ad1 (integrated ide) and only one - xl0 - nic perfectly.
> The problems started when I plugged in the PROMISE and other nic cards
> and started using it as router, fileserver and backup server (each in
> separate jail, except the pf firewall).
> 2. The other strange issue is that when (I guess) it starts timeouting
> *sometimes* not everytime I'm loosing connection to xl0 or fxp0
> (sometimes the rl0 works and accepts connections from the outside,
> sometimes - not). When I go to the machine and plug a monitor - there
> are no messages from kernel, no logs in /var/log/messages or debug -
> noting. Stange thing is that I ping host from the local net and it time
> outs, ifconfig shows that interface is connected at fd 100mbit and
> everyting seems ok. I've tried ifconfig xl0 down up but doesn't
help,
> tried plugging out the cable and it got connected but not packets passed
> - timeout again!
> I've rebooted and nic came up. These 'drops' became more and
more common
> recently and last night I wasn't able to login for about an hour and
> after that the machine came back up again by itself!!!that's in the lan
> - but it wasn't accessible at all from the outside - strange thins is
> that it replied to ping but I wasn't able to even open the ssh port
> connection and the nat wasn't working?! After that I've remembered
that
> at this time I have a cronjob started for about an hour that fetches
> into a file a online radio cast for an hour.... wired!!! it also have
> rtorrent, apache22, samba (in a jail) runing.
>
> some output from it can be found here:
> http://valqk.ath.cx/tmp/dmesg
> http://valqk.ath.cx/tmp/vmstat
> http://valqk.ath.cx/tmp/smartctl
>
>
> please give any ideas/hints/solutions!
>
> thanks a lot to everyone!
> cheers,
> valqk.
> _______________________________________________
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to
"freebsd-stable-unsubscribe@freebsd.org"
>
>

Peter Jeremy

2008-Sep-26 22:21 UTC

head link

HELP DEBUG: FreeBSD 6.3-RELEASE-p3 TIMEOUT - WRITE_DMA + other strange behaviour!

On 2008-Sep-26 13:12:14 +0300, Anton - Valqk <lists@lozenetz.org>
wrote:>1. I get a lot of dma times outs. mostly on ad5 and ad7 where I keep
...>dmesg.today:ad7: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR>
>error=10<NID_NOT_FOUND> LBA=374303456
This is a bad sign and suggests dying disk but...
>2. The other strange issue is that when (I guess) it starts timeouting
>*sometimes* not everytime I'm loosing connection to xl0 or fxp0
You have an awful lot of hardware in this box.  Are you sure the
power supply and cooling is up to scratch?  Sagging power could
cause the problems you report, as could overheating.

-- 
Peter Jeremy
Please excuse any delays as the result of my ISP's inability to implement
an MTA that is either RFC2821-compliant or matches their claimed behaviour.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
Url :
http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20080926/d3151878/attachment.pgp

freebsd stable - Sep 2008 - HELP DEBUG: FreeBSD 6.3-RELEASE-p3 TIMEOUT - WRITE_DMA + other strange behaviour!

HELP DEBUG: FreeBSD 6.3-RELEASE-p3 TIMEOUT - WRITE_DMA + other strange behaviour!

HELP DEBUG: FreeBSD 6.3-RELEASE-p3 TIMEOUT - WRITE_DMA + other strange behaviour!

HELP DEBUG: FreeBSD 6.3-RELEASE-p3 TIMEOUT - WRITE_DMA + other strange behaviour!

HELP DEBUG: FreeBSD 6.3-RELEASE-p3 TIMEOUT - WRITE_DMA + other strange behaviour!