thr3ads.net - freebsd stable - broken re(4) [May 2008]

If this information is useful, please help other people find it:
Share via:

Gerrit Kühn

2008-May-27 15:23 UTC

broken re(4)

Hi folks,

I have four identical ITX boards from Jetway here, each having two re(4)
onboard nics:

re0@pci0:0:9:0: class=0x020000 card=0x10ec16f3 chip=0x816710ec rev=0x10
hdr=0x00 vendor     = 'Realtek Semiconductor'
    device     = 'RTL8169/8110 Family Gigabit Ethernet NIC'
    class      = network
    subclass   = ethernet
re1@pci0:0:11:0:        class=0x020000 card=0x10ec16f3 chip=0x816710ec
rev=0x10 hdr=0x00 vendor     = 'Realtek Semiconductor'
    device     = 'RTL8169/8110 Family Gigabit Ethernet NIC'
    class      = network
    subclass   = ethernet
atapci0@pci0:0:15:0:    class=0x01018f card=0x31491106 chip=0x31491106
rev=0x80 


I run FreeBSD 7-stable from early March 08 on three of these
machines and noticed no problems with networking with that so far.
Some days ago I installed a fourth machine with 7-stable from early May
(and some days later -because of the problems described below- to May
17th). With this new machine I see several networking problems. The most
prominent are these two:

- heavy networking traffic (in this case backup via tar & NFS) causes hangs
for about 10s-30s and sometimes also leads to watchdog timeouts: 
May 27 09:04:07 protoserve kernel: re0: watchdog timeout 
May 27 09:04:07 protoserve kernel: re0: link state changed to DOWN
May 27 09:04:10 protoserve kernel: re0: link state changed to UP

- copying large files (more than some 100MB) via ssh/scp drops the
connection due to "corrupted MAC on input":
Disconnecting: Corrupted MAC on input.
lost connection

In the latter case the networking traffic should actually not be that
high, because these are nanobsd systems which are transferring a new image
file (system update, 2GB) via ssh (so the bottleneck should be the write
speed of the CF card used to hold the system).


I do not see these problems with the old codebase from March 08 on my old
machines. The cvs shows a large MFC for the re-driver in April, so I
guessed something came in there which broke things here. Therefore I
downgraded the new system to a cvs codebase from March 1st, but the
problems persist. They also exist on both interfaces. memtest86 is running
for hours now without finding something wrong.

Any hints what I should do next to find the culprit?


cu
  Gerrit

Michael Proto

2008-May-27 15:45 UTC

head link

broken re(4)

Gerrit K?hn wrote:> Hi folks,
> 
> I have four identical ITX boards from Jetway here, each having two re(4)
> onboard nics:
> 
> re0@pci0:0:9:0: class=0x020000 card=0x10ec16f3 chip=0x816710ec rev=0x10
> hdr=0x00 vendor     = 'Realtek Semiconductor'
>     device     = 'RTL8169/8110 Family Gigabit Ethernet NIC'
>     class      = network
>     subclass   = ethernet
> re1@pci0:0:11:0:        class=0x020000 card=0x10ec16f3 chip=0x816710ec
> rev=0x10 hdr=0x00 vendor     = 'Realtek Semiconductor'
>     device     = 'RTL8169/8110 Family Gigabit Ethernet NIC'
>     class      = network
>     subclass   = ethernet
> atapci0@pci0:0:15:0:    class=0x01018f card=0x31491106 chip=0x31491106
> rev=0x80
> 
> 
> I run FreeBSD 7-stable from early March 08 on three of these
> machines and noticed no problems with networking with that so far.
> Some days ago I installed a fourth machine with 7-stable from early May
> (and some days later -because of the problems described below- to May
> 17th). With this new machine I see several networking problems. The most
> prominent are these two:
> 
> - heavy networking traffic (in this case backup via tar & NFS) causes
hangs
> for about 10s-30s and sometimes also leads to watchdog timeouts:
> May 27 09:04:07 protoserve kernel: re0: watchdog timeout
> May 27 09:04:07 protoserve kernel: re0: link state changed to DOWN
> May 27 09:04:10 protoserve kernel: re0: link state changed to UP
> 
> - copying large files (more than some 100MB) via ssh/scp drops the
> connection due to "corrupted MAC on input":
> Disconnecting: Corrupted MAC on input.
> lost connection
> 
> In the latter case the networking traffic should actually not be that
> high, because these are nanobsd systems which are transferring a new image
> file (system update, 2GB) via ssh (so the bottleneck should be the write
> speed of the CF card used to hold the system).
> 
> 
> I do not see these problems with the old codebase from March 08 on my old
> machines. The cvs shows a large MFC for the re-driver in April, so I
> guessed something came in there which broke things here. Therefore I
> downgraded the new system to a cvs codebase from March 1st, but the
> problems persist. They also exist on both interfaces. memtest86 is running
> for hours now without finding something wrong.
> 
> Any hints what I should do next to find the culprit?
> 
I'm running 6.3 on the exact same Jetway board at home, and while I
haven't been bitten by the DOWN/UP issue I have seen the occasional
"corrupted MAC on input" error when doing an ssh/scp. Seems to have
simmered-down since moving from 6.3-RELEASE to 6.3-STABLE (last
supped/rebuilt on 5/6/08).

Note this is using only one of the 2 on-board NICs. I disabled the 2nd
one in the BIOS as I don't need it at the moment.



-Proto

Pyun YongHyeon

2008-May-28 00:28 UTC

head link

broken re(4)

On Tue, May 27, 2008 at 04:52:32PM +0200, Gerrit K?hn wrote:
 > Hi folks,
 > 
 > I have four identical ITX boards from Jetway here, each having two re(4)
 > onboard nics:
 > 
 > re0@pci0:0:9:0: class=0x020000 card=0x10ec16f3 chip=0x816710ec rev=0x10
 > hdr=0x00 vendor     = 'Realtek Semiconductor'
 >     device     = 'RTL8169/8110 Family Gigabit Ethernet NIC'
 >     class      = network
 >     subclass   = ethernet
 > re1@pci0:0:11:0:        class=0x020000 card=0x10ec16f3 chip=0x816710ec
 > rev=0x10 hdr=0x00 vendor     = 'Realtek Semiconductor'
 >     device     = 'RTL8169/8110 Family Gigabit Ethernet NIC'
 >     class      = network
 >     subclass   = ethernet
 > atapci0@pci0:0:15:0:    class=0x01018f card=0x31491106 chip=0x31491106
 > rev=0x80 
 > 
 > 
 > I run FreeBSD 7-stable from early March 08 on three of these
 > machines and noticed no problems with networking with that so far.
 > Some days ago I installed a fourth machine with 7-stable from early May
 > (and some days later -because of the problems described below- to May
 > 17th). With this new machine I see several networking problems. The most
 > prominent are these two:
 > 
 > - heavy networking traffic (in this case backup via tar & NFS) causes
hangs
 > for about 10s-30s and sometimes also leads to watchdog timeouts: 
 > May 27 09:04:07 protoserve kernel: re0: watchdog timeout 
 > May 27 09:04:07 protoserve kernel: re0: link state changed to DOWN
 > May 27 09:04:10 protoserve kernel: re0: link state changed to UP
 > 
 > - copying large files (more than some 100MB) via ssh/scp drops the
 > connection due to "corrupted MAC on input":
 > Disconnecting: Corrupted MAC on input.
 > lost connection
 > 
 > In the latter case the networking traffic should actually not be that
 > high, because these are nanobsd systems which are transferring a new image
 > file (system update, 2GB) via ssh (so the bottleneck should be the write
 > speed of the CF card used to hold the system).
 > 
 > 
 > I do not see these problems with the old codebase from March 08 on my old
 > machines. The cvs shows a large MFC for the re-driver in April, so I
 > guessed something came in there which broke things here. Therefore I
 > downgraded the new system to a cvs codebase from March 1st, but the
 > problems persist. They also exist on both interfaces. memtest86 is running
 > for hours now without finding something wrong.
 > 
 > Any hints what I should do next to find the culprit?
 > 

There were similiar reports on this issue. It seems that it's very
hard to make re(4) work so many RTL8168/8169/8111 revisions without
documentation as different revisions require different workaround.
Anyway, would you try this one? The patch was generated against HEAD
but it would apply to STABLE too.
http://people.freebsd.org/~yongari/re/re.HEAD.20080519

-- 
Regards,
Pyun YongHyeon

Wilko Bulte

2008-May-30 13:58 UTC

head link

broken re(4)

Quoting Gerrit Khn, who wrote on Fri, May 30, 2008 at 02:47:59PM +0200
..> On Fri, 30 May 2008 13:49:24 +0200 Wilko Bulte <wb@freebie.xs4all.nl>
> wrote about Re: broken re(4):
> 
> WB> > Typing "pci riser card jumper" in Google will give
you
> WB> > many more pages with interesting (or frightening) stuff
> WB> > to read.
> 
> WB> Well, if you know how the PCI bus electrically works this kind of
> WB> problem is hardly a surprise ;-)
> 
> Well, the riser that came with this 1HU-chassis is probably even more
> frightening: it plugs into the pci port and uses a short ribbon cable to
> connect to an extra board which holds the cards.
Hmmm... brr....

-- 
Wilko Bulte				wilko@FreeBSD.org

Daniele Bastianini

2008-Jun-10 19:52 UTC

head link

broken re(4)

Il giorno 27/mag/08, alle ore 16:52, Gerrit K?hn ha scritto:
> Hi folks,
>
> I have four identical ITX boards from Jetway here, each having two  
> re(4)
> onboard nics:
>
> re0@pci0:0:9:0: class=0x020000 card=0x10ec16f3 chip=0x816710ec  
> rev=0x10
> hdr=0x00 vendor     = 'Realtek Semiconductor'
>     device     = 'RTL8169/8110 Family Gigabit Ethernet NIC'
>     class      = network
>     subclass   = ethernet
> re1@pci0:0:11:0:        class=0x020000 card=0x10ec16f3 chip=0x816710ec
> rev=0x10 hdr=0x00 vendor     = 'Realtek Semiconductor'
>     device     = 'RTL8169/8110 Family Gigabit Ethernet NIC'
>     class      = network
>     subclass   = ethernet
> atapci0@pci0:0:15:0:    class=0x01018f card=0x31491106 chip=0x31491106
> rev=0x80
>
>
> I run FreeBSD 7-stable from early March 08 on three of these
> machines and noticed no problems with networking with that so far.
> Some days ago I installed a fourth machine with 7-stable from early  
> May
> (and some days later -because of the problems described below- to May
> 17th). With this new machine I see several networking problems. The  
> most
> prominent are these two:
>
> - heavy networking traffic (in this case backup via tar & NFS)  
> causes hangs
> for about 10s-30s and sometimes also leads to watchdog timeouts:
> May 27 09:04:07 protoserve kernel: re0: watchdog timeout
> May 27 09:04:07 protoserve kernel: re0: link state changed to DOWN
> May 27 09:04:10 protoserve kernel: re0: link state changed to UP
>
> - copying large files (more than some 100MB) via ssh/scp drops the
> connection due to "corrupted MAC on input":
> Disconnecting: Corrupted MAC on input.
> lost connection
I had the same problem.
I fixed it (for now) making a buildworld with
*default date=2008.03.01.00.00.00 in my src csup configuration.

I'm not so skilled to investigate in the sources but the problem is  
after this date.

Regards
   Daniele Bastianini

Gerrit Kühn

2008-Jun-11 07:25 UTC

head link

broken re(4)

On Tue, 10 Jun 2008 20:43:04 +0200 Daniele Bastianini
<liste.bsd@gmail.com> wrote about Re: broken re(4):

DB> > - copying large files (more than some 100MB) via ssh/scp drops the
DB> > connection due to "corrupted MAC on input":
DB> > Disconnecting: Corrupted MAC on input.
DB> > lost connection

DB> I had the same problem.
DB> I fixed it (for now) making a buildworld with
DB> *default date=2008.03.01.00.00.00 in my src csup configuration.

DB> I'm not so skilled to investigate in the sources but the problem is  
DB> after this date.

For me all versions from cvs and all patches from Pyun are working now,
after I have solved the issue with the bad riser card. I still think it's
funny that the riser causes this kind of trouble for the networking chips.

On the other hand, I have not been able to get more than about 10MByte/s
through the interfaces of this particular system. I have 1GBit-networking
equipment, and the other systems (which are used as router) have no
problem doing a throughput of >20MB/s. Even bonding the two interfaces
using lagg(4) does not improve the performance - where else could be the
bottleneck?
The only difference here is that I have the extra SATA-controller with
disks in there. However, the disks appear to be as fast as I can expect
from a SATA150-interface.


cu
  Gerrit

Oliver Fromme

2008-Jun-11 15:26 UTC

head link

broken re(4)

Gerrit K?hn wrote:
 > On the other hand, I have not been able to get more than about 10MByte/s
 > through the interfaces of this particular system. I have 1GBit-networking
 > equipment, and the other systems (which are used as router) have no
 > problem doing a throughput of >20MB/s. Even bonding the two interfaces
 > using lagg(4) does not improve the performance - where else could be the
 > bottleneck?

A few questions or hints ...

 - What is the CPU usage during your network test (user,
   sys, intr, idle)?

 - Do you see errors in "netstat -i"?

 - Do you use jumbo frames?

 - Is polling enabled?

 - Are there any network-related sysctls (/etc/sysctl.conf)
   or kernel settings?  Have you enabled kernel debugging
   features (INVARIANTS, WITNESS etc.)?

 - Do you have any packet filter rules (PF, IPF, IPFW)?

Best regards
   Oliver

-- 
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606,  Gesch?ftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht M?n-
chen, HRB 125758,  Gesch?ftsf?hrer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen, -Produkte und mehr:  http://www.secnetix.de/bsd

"C++ is the only current language making COBOL look good."
        -- Bertrand Meyer

Gerrit Kühn

2008-Jun-11 15:38 UTC

head link

broken re(4)

On Wed, 11 Jun 2008 17:26:29 +0200 (CEST) Oliver Fromme
<olli@lurza.secnetix.de> wrote about Re: broken re(4):

OF>  > On the other hand, I have not been able to get more than about
OF>  > 10MByte/s through the interfaces of this particular system. I have
OF>  > 1GBit-networking equipment, and the other systems (which are used
OF>  > as router) have no problem doing a throughput of >20MB/s. Even
OF>  > bonding the two interfaces using lagg(4) does not improve the
OF>  > performance - where else could be the bottleneck?

OF> A few questions or hints ...
OF>  - What is the CPU usage during your network test (user,
OF>    sys, intr, idle)?

I will test and report that tomorrow.

OF>  - Do you see errors in "netstat -i"?

None.

OF>  - Do you use jumbo frames?

No.

OF>  - Is polling enabled?

No. I tested polling on a lot of different machines earlier and never
found it to improve performance so far (same for jumbo frames, btw).

OF>  - Are there any network-related sysctls (/etc/sysctl.conf)
OF>    or kernel settings?  Have you enabled kernel debugging
OF>    features (INVARIANTS, WITNESS etc.)?

No, stock GENERIC, only with a lot of things disabled.

OF>  - Do you have any packet filter rules (PF, IPF, IPFW)?

No, not on this machine. The faster machines are router/firewalls, they do
filtering; so it should be something different...


cu
  Gerrit

Pyun YongHyeon

2008-Jun-12 07:03 UTC

head link

broken re(4)

On Thu, Jun 12, 2008 at 08:58:10AM +0200, Gerrit K?hn wrote:
 > On Thu, 12 Jun 2008 12:22:28 +0900 Pyun YongHyeon <pyunyh@gmail.com>
wrote
 > about Re: broken re(4):
 > 
 > PY> Before checking performance of network controller you had to rule
 > PY> out other factors like disk I/O. Use one of benchmark programs in
 > PY> ports/benchmark.
 > 
 > I already did simple benchmarking by using "dd if=/dev/zero
of=file" which
 > gave me several 10s of MByte/s under all circumstances.
 > Can you recommend one of the benchmarking programs for more detailed
 > testing?
 > 

Try netperf or iperf in ports/benchmark.

 > 
 > cu
 >   Gerrit

-- 
Regards,
Pyun YongHyeon

freebsd stable - May 2008 - broken re(4)

broken re(4)

broken re(4)

broken re(4)

broken re(4)

broken re(4)

broken re(4)

broken re(4)

broken re(4)

broken re(4)