Daniel Braniss wrote:>
> > On 24 Aug 2015, at 10:22, Hans Petter Selasky <hps at
selasky.org> wrote:
> >
> > On 08/24/15 01:02, Rick Macklem wrote:
> >> The other thing is the degradation seems to cut the rate by about
half
> >> each time.
> >> 300-->150-->70 I have no idea if this helps to explain it.
> >
> > Might be a NUMA binding issue for the processes involved.
> >
> > man cpuset
> >
> > --HPS
>
> I can?t see how this is relevant, given that the same host, using the
> mellanox/mlxen
> behave much better.
Well, the "ix" driver has a bunch of tunables for things like
"number of queues"
and although I'll admit I don't understand how these queues are used, I
think
they are related to CPUs and their caches. There is also something called
IXGBE_FDIR,
which others have recommended be disabled. (The code is #ifdef IXGBE_FDIR, but I
don't
know if it defined for your kernel?) There are also tunables for interrupt rate
and
something called hw.ixgbe_tx_process_limit, which appears to limit the number of
packets
to send or something like that?
(I suspect Hans would understand this stuff much better than I do, since I
don't understand
it at all.;-)
At a glance, the mellanox driver looks very different.
> I?m getting different results with the intel/ix depending who is the nfs
> server
>
Who knows until you figure out what is actually going on. It could just be the
timing of
handling the write RPCs or when the different servers send acks for the TCP
segments or ...
that causes this for one server and not another.
One of the principals used when investigating airplane accidents is to
"never assume anything"
and just try to collect the facts until the pieces of the puzzle fall in place.
I think the
same principal works for this kind of stuff.
I once had a case where a specific read of one NFS file would fail on certain
machines.
I won't bore you with the details, but after weeks we got to the point where
we had a lab
of identical machines (exactly the same hardware and exactly the same software
loaded on them)
and we could reproduce this problem on about half the machines and not the other
half. We
(myself and the guy I worked with) finally noticed the failing machines were on
network ports
for a given switch. We moved the net cables to another switch and the problem
went away.
--> This particular network switch was broken in such a way that it would
garble one specific
packet consistently, but worked fine for everything else.
My point here is that, if someone had suggested the "network switch might
be broken" at the
beginning of investigating this, I would have probably dismissed it, based on
"the network is
working just fine", but in the end, that was the problem.
--> I am not suggesting you have a broken network switch, just
"don't take anything off the
table until you know what is actually going on".
And to be honest, you may never know, but it is fun to try and solve these
puzzles.
Beyond what I already suggested, I'd look at the "ix" driver's
stats and tunables and
see if any of the tunables has an effect. (And, yes, it will take time to work
through these.)
Good luck with it, rick
>
> danny
>
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at
freebsd.org"