Hi all, First of all I repost this email in shorewall list as there are a lot of firewall experts here that might know what the hell i going on. We have also posted this on the linux bridge list (we needed acceptance first) and leaf list. Very thankful for your understanding. We are experiencing a very strange problem and would need some help. We have a Leaf / Shorewall based box (actually a Lince box kernel 2.4.26) running as a bridge with 8 gigabit ethernets, PIV 3Ghz, 2GB RAM. 4 of them share the same PCI Express bus and the other 4 a different PCI bus. We have NAPI enabled on all ethernets and IRQ moderation enabled (dynamic). The firewall was not activated at this point. Some ASCII art before proceeding. Router 1 Router 2 | | --------- Switch -------- | | Firewall WAN LAN Empty Empty Empty Empty Empty Empty | | | | | | | | eth0 eth1 eth2 eth3 eth4 eth5 eth6 eth7 ----------------- ------------------- PCI-X PCI Both routers use HSRP from Cisco to share information about who is alive. This app uses multicast UDP packets to 224.0.0.1 address, port 1985. The problem is, after a while (1 or 2 minutes) the CPU reaches 100% (0.99 load 99% System) with the process ksoftirqd_CPU0 reaching 99%. Using iptraf we discover ethernets 4 to 7 (the ones that share the PCI bus) are at full speed. The traffic is on port 1985 and comes from the 2 virtual IP from the redundant routers. It seems they enter an infinite loop and completely kill the system. BTW, the only used ethernets are 0 and 1, both on the PCI-X bus, and eth2 and eth3 seem unaffected (no traffic). Bear in mind, real traffic on eth0 and eth1 doesnt surpass 1Mbps. Also, no service is provided at this point, not even firewalling. The problem appears with and without STP activated and we have verified there is not a loop in the network. If we disable ethernets from 4 to 7 (ip link set ethx down) the problem seems to disappear, but we are not sure as we didnt want to disturb the client more time (actually, for 15 minutes the problem didnt appear, while the other way it appeared in much less than 5 minutes). In this case, even activating things like a Netflow probe in eth0 didnt disturb at all the system. The same problem seems to appear with a Via 1Ghz box with 4 realtek ethernets and around 4Mbps of traffic (this system was placed under heavier load, and as the problem appeared, we tested with the big box the same afternoon). When the problem appeared this box was so slow we could not even make a ssh session so we dont know if this is the same problem (but bet it is). So, some questions: 1) Is this related to running as a bridge? Would this problem disappear if we used a pseudo bridge (proxy ARP)? 2) Can such a beast sustain 8 ethernets as a single bridge? Bear in mind they dont have gigabit traffic, they just use gigabit ethernets :) Whats the limit for a linux bridge? Would be better to break it into two bridges? 3) As this traffic is only needed on both routers but doesnt need to pass trough the firewall, will dropping it on eth0 solve the problem? (That way there is no way the packets enter into other ethernet ports) What would happen with other multicast based apps? Would they need to be dropped too? Very thankful in advance. Regards. -- Jaime Nebrera - jnebrera@eneotecnologia.com Consultor TI - ENEO Tecnologia SL Telf.- 95 455 40 62 - 619 04 55 18
> Hi all, > > First of all I repost this email in shorewall list as there are alot> of firewall experts here that might know what the hell i going on. > We have also posted this on the linux bridge list (we neededacceptance first)> and leaf list. Very thankful for your understanding. > > We are experiencing a very strange problem and would need somehelp.> We have a Leaf / Shorewall based box (actually a Lince box kernel2.4.26)> running as a bridge with 8 gigabit ethernets, PIV 3Ghz, 2GB RAM. 4of them> share the same PCI Express bus and the other 4 a different PCI bus.We have> NAPI enabled on all ethernets and IRQ moderation enabled (dynamic).The> firewall was not activated at this point. > > Some ASCII art before proceeding. > > Router 1 Router 2 > | | > --------- Switch -------- > | > | > Firewall > > > WAN LAN Empty Empty Empty Empty Empty Empty > | | | | | | | | > eth0 eth1 eth2 eth3 eth4 eth5 eth6 eth7 > ----------------- ------------------- > PCI-X PCI > > Both routers use HSRP from Cisco to share information about who is > alive. This app uses multicast UDP packets to 224.0.0.1 address,port> 1985. > > The problem is, after a while (1 or 2 minutes) the CPU reaches100%> (0.99 load 99% System) with the process ksoftirqd_CPU0 reaching 99%. > Using iptraf we discover ethernets 4 to 7 (the ones that share thePCI> bus) are at full speed. The traffic is on port 1985 and comes fromthe 2> virtual IP from the redundant routers. It seems they enter aninfinite> loop and completely kill the system. BTW, the only used ethernetsare 0> and 1, both on the PCI-X bus, and eth2 and eth3 seem unaffected (no > traffic). Bear in mind, real traffic on eth0 and eth1 doesnt surpass > 1Mbps. Also, no service is provided at this point, not evenfirewalling.> > The problem appears with and without STP activated and we have > verified there is not a loop in the network. > > If we disable ethernets from 4 to 7 (ip link set ethx down) the > problem seems to disappear, but we are not sure as we didnt want to > disturb the client more time (actually, for 15 minutes the problemdidnt> appear, while the other way it appeared in much less than 5minutes). In> this case, even activating things like a Netflow probe in eth0 didnt > disturb at all the system. > > The same problem seems to appear with a Via 1Ghz box with 4realtek> ethernets and around 4Mbps of traffic (this system was placed under > heavier load, and as the problem appeared, we tested with the bigbox> the same afternoon). When the problem appeared this box was so slow > we could not even make a ssh session so we dont know if this is the > same problem (but bet it is). > > So, some questions: > > 1) Is this related to running as a bridge? Would this problem > disappear if we used a pseudo bridge (proxy ARP)? > > 2) Can such a beast sustain 8 ethernets as a single bridge? Bearin> mind they dont have gigabit traffic, they just use gigabit ethernets:)> Whats the limit for a linux bridge? Would be better to break it intotwo> bridges? > > 3) As this traffic is only needed on both routers but doesnt needto> pass trough the firewall, will dropping it on eth0 solve theproblem?> (That way there is no way the packets enter into other ethernetports)> What would happen with other multicast based apps? Would they needto be> dropped too? > > Very thankful in advance. Regards. > > -- > Jaime Nebrera - jnebrera@eneotecnologia.com > Consultor TI - ENEO Tecnologia SL > Telf.- 95 455 40 62 - 619 04 55 18Sound like you have a nic driver/IRQ issue to me. What does "cat /proc/interrupts" give you? Are the 2 active cards are sharing the same interrupt? Maybe you could switch eth1 and eth2 around and see what happends, don''t move them in the box just reconfigure them. Just my 2 cents worth, hope it helps. Jerry Vonau
Hi Jerry,> Sound like you have a nic driver/IRQ issue to me. > What does "cat /proc/interrupts" give you? > Are the 2 active cards are sharing the same interrupt? > Maybe you could switch eth1 and eth2 around and > see what happends, don''t move them in the box just > reconfigure them. > Just my 2 cents worth, hope it helps.Dont know, as we dont have access to that scenario right now. We are trying to mimic it in our office but dont have equipment with HSRP :( Also, before proceeding into the clients site we placed the box in our lab and made some testing. We had all our PC scanning (nmap) the firewall or a PC behind (100Mbps network) and the box didnt even notice the load, and of course we didnt have strange traffic on unused ports. Regarding changing the ethernets, might be hard. The problem here is those two ports have a LAN Bypass feature that can not be changed. If the box is turned off, those two make a bypass. This is good for QoS appliances and our clients ask for it. Still thanks. we will make more tests and comment them in the list. I just sent the email to see if somebody could give us a clue. As you see, in our prior test no problem arised and we were harder than the real environment, but we didnt have those poky HSRP packages. Very thankful. Regards. -- Jaime Nebrera - jnebrera@eneotecnologia.com Consultor TI - ENEO Tecnologia SL Telf.- 95 455 40 62 - 619 04 55 18