OK folks - my turn to ask desperately for help. I hope it doesn''t turn out that the answer is in the FAQ! ;-) We have a router cluster running SLES 9 and Shorewall 2.4.7. It has been working very well for us up until now. (I originally installed it about a year ago on Shorewall 2.4.0prewhatever.) Last night we had some power fluctuations, and today lots of machines can''t contact the server LAN. We seem to be getting intermittent packet loss. e.g. One Windows machine pinging the server 12 times, 4 packets lost, 2 OK, 2 more lost, 4 OK. This is reproducible for all machines on that VLAN. Things on the same LAN as the server work fine, as have *some* machines on the other VLANs at some times today, but all machines we put on that LAN now fail to map most server drives. There are a couple of isolated, minor server drives which work fine, but not our primary file/print cluster. :-( I''m getting no relevant messages in the Shorewall logs, and i eat my own dog food on this point and *always* leave my policies in log mode, with an explicit zone2zone policy for every zone combination. My first guess was that we were filling up some TCP tables in the kernel, but I''m getting no errors to that effect in dmesg (although I''m not sure whether this requires any tuning). Tom pointed me to the ''ip -s link ls'' command, which shows a lot of errors on several of my links, but we''re not sure how to reset the stats. The error count doesn''t seem to be going up at the moment, either. Any ideas? Paul ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
> OK folks - my turn to ask desperately for help. I hope it doesn''t turn > out that the answer is in the FAQ! ;-) > > We have a router cluster running SLES 9 and Shorewall 2.4.7. It has > been working very well for us up until now. (I originally installed it > about a year ago on Shorewall 2.4.0prewhatever.) > > Last night we had some power fluctuations, and today lots of machines > can''t contact the server LAN. We seem to be getting intermittent packet > loss. e.g. One Windows machine pinging the server 12 times, 4 packets > lost, 2 OK, 2 more lost, 4 OK. This is reproducible for all machines on > that VLAN. > > Things on the same LAN as the server work fine, as have *some* machines > on the other VLANs at some times today, but all machines we put on that > LAN now fail to map most server drives. There are a couple of isolated, > minor server drives which work fine, but not our primary file/print > cluster. :-( > > I''m getting no relevant messages in the Shorewall logs, and i eat my own > dog food on this point and *always* leave my policies in log mode, with > an explicit zone2zone policy for every zone combination. > > My first guess was that we were filling up some TCP tables in the > kernel, but I''m getting no errors to that effect in dmesg (although I''m > not sure whether this requires any tuning). > > Tom pointed me to the ''ip -s link ls'' command, which shows a lot of > errors on several of my links, but we''re not sure how to reset the > stats. The error count doesn''t seem to be going up at the moment, either. > > Any ideas?To me this sounds very much like a problem on your LAN side. At least I don''t remember a similar problem which turned out to be a server/firewall/whatever problem. From what I understand you had issues with your power supply and I guess some devices have lost power in the event. I try to make a list of things to check which have been important to me in similar situations: - Is it possible that some network device ( most likely switches) have lost it''s configuration because it has not been saved in flash after it was configured? - Does your network make use of some kind of layer 2 redundancy (spanning tree, whatever). Maybe things got out of control while devices lost power. - Do your switches have features enabled like broadcast storm protection. Some modern switches do this (by default!) and those features can cause exactly what you describe. It''s really unbelievable that switches are delivered with such defaults. - You said firewall cluster, maybe you also have other clusters. You can run into all kind of ARP related problems with clusters, redundant ethernet connections, ethernet load balancing and such. Good luck, Simon ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
Simon Matter wrote:>>... > To me this sounds very much like a problem on your LAN side. At least I > don''t remember a similar problem which turned out to be a > server/firewall/whatever problem.Well, Simon, it turns out you are exactly right. The server on the LAN side had its nappy in a knot somehow. We failed over the clustered file services to the other node in the cluster, and everything started working again.> From what I understand you had issues with your power supply and I guess > some devices have lost power in the event. I try to make a list of things > to check which have been important to me in similar situations: > > - Is it possible that some network device ( most likely switches) have > lost it''s configuration because it has not been saved in flash after it > was configured?We had already rebooted some of our switches (including ones that were not affected by the power fluctuations).> - Does your network make use of some kind of layer 2 redundancy (spanning > tree, whatever). Maybe things got out of control while devices lost power..We definitely use RSTP, and the servers in question use the Broadcom B57 load-balancing driver for their front end LANs. Maybe it was one of those that was screwed up.> - Do your switches have features enabled like broadcast storm protection. > Some modern switches do this (by default!) and those features can cause > exactly what you describe. It''s really unbelievable that switches are > delivered with such defaults.We have that turned on, but i''ve never seen it affect anything like this.> - You said firewall cluster, maybe you also have other clusters. You can > run into all kind of ARP related problems with clusters, redundant > ethernet connections, ethernet load balancing and such.We have 3 clusters and counting. :-) All of them so far have seemed very well behaved with ARP. (They send out gratuitous ARPs whenever they get a failover.) Anyway, problem solved. Thanks for your input. Paul ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642