thr3ads.net - freebsd stable - ix(intel) vs mlxen(mellanox) 10Gb performance [Oct 2015]

If this information is useful, please help other people find it:
Share via:

Rick Macklem

2015-Aug-24 12:25 UTC

ix(intel) vs mlxen(mellanox) 10Gb performance

Daniel Braniss wrote:> 
> > On 24 Aug 2015, at 10:22, Hans Petter Selasky <hps at
selasky.org> wrote:
> > 
> > On 08/24/15 01:02, Rick Macklem wrote:
> >> The other thing is the degradation seems to cut the rate by about
half
> >> each time.
> >> 300-->150-->70 I have no idea if this helps to explain it.
> > 
> > Might be a NUMA binding issue for the processes involved.
> > 
> > man cpuset
> > 
> > --HPS
> 
> I can?t see how this is relevant, given that the same host, using the
> mellanox/mlxen
> behave much better.Well, the "ix" driver has a bunch of tunables for things like
"number of queues"
and although I'll admit I don't understand how these queues are used, I
think
they are related to CPUs and their caches. There is also something called
IXGBE_FDIR,
which others have recommended be disabled. (The code is #ifdef IXGBE_FDIR, but I
don't
know if it defined for your kernel?) There are also tunables for interrupt rate
and
something called hw.ixgbe_tx_process_limit, which appears to limit the number of
packets
to send or something like that?
(I suspect Hans would understand this stuff much better than I do, since I
don't understand
 it at all.;-)

At a glance, the mellanox  driver looks very different.
> I?m getting different results with the intel/ix depending who is the nfs
> server
> Who knows until you figure out what is actually going on. It could just be the
timing of
handling the write RPCs or when the different servers send acks for the TCP
segments or ...
that causes this for one server and not another.

One of the principals used when investigating airplane accidents is to
"never assume anything"
and just try to collect the facts until the pieces of the puzzle fall in place.
I think the
same principal works for this kind of stuff.
I once had a case where a specific read of one NFS file would fail on certain
machines.
I won't bore you with the details, but after weeks we got to the point where
we had a lab
of identical machines (exactly the same hardware and exactly the same software
loaded on them)
and we could reproduce this problem on about half the machines and not the other
half. We
(myself and the guy I worked with) finally noticed the failing machines were on
network ports
for a given switch. We moved the net cables to another switch and the problem
went away.
--> This particular network switch was broken in such a way that it would
garble one specific
    packet consistently, but worked fine for everything else.
My point here is that, if someone had suggested the "network switch might
be broken" at the
beginning of investigating this, I would have probably dismissed it, based on
"the network is
working just fine", but in the end, that was the problem.
--> I am not suggesting you have a broken network switch, just
"don't take anything off the
    table until you know what is actually going on".

And to be honest, you may never know, but it is fun to try and solve these
puzzles.
Beyond what I already suggested, I'd look at the "ix" driver's
stats and tunables and
see if any of the tunables has an effect. (And, yes, it will take time to work
through these.)

Good luck with it, rick
> 
> danny
> 
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at
freebsd.org"

Adrian Chadd

2015-Aug-24 23:13 UTC

head link

ix(intel) vs mlxen(mellanox) 10Gb performance

Hi,

Some hand-waving suggestions:

* if you're running something before 10.2, please disable IXGBE_FDIR
in sys/conf/options and sys/modules/ixgbe/Makefile . It's buggy and it
caused a lot of issues.
* It sounds like some extra latency is happening, so I'd fiddle around
with interrupt settings. By default it does something called adaptive
interrupt moderation and it may be getting in the way of what you're
trying to do. There's a way to disable AIM in /boot/loader.conf and
manually set the interrupt rate.
* As others have said, TSO has been a bit of a problem - hps has been
working on solidifying the TSO configuration side of things so NICs
advertise to the stack what their maximum offload capability is so
things like NFS and TCP don't exceed the segment count. I don't know
if it's tunable without hacking the driver, but maybe hack the driver
to reduce the count a little to make sure you're not overflowing
things and causing it to fall back to a slower path (where it copies
all the mbufs into a single larger one to send to the NIC.)
* Disable software LRO and see if it helps. Since you're doing lots of
little non-streaming operations, it may actually be hindering.

HTH,


-adrian

Daniel Braniss

2015-Aug-26 05:56 UTC

head link

ix(intel) vs mlxen(mellanox) 10Gb performance

> On Aug 24, 2015, at 3:25 PM, Rick Macklem <rmacklem at uoguelph.ca>
wrote:
> 
> Daniel Braniss wrote:
>> 
>>> On 24 Aug 2015, at 10:22, Hans Petter Selasky <hps at
selasky.org> wrote:
>>> 
>>> On 08/24/15 01:02, Rick Macklem wrote:
>>>> The other thing is the degradation seems to cut the rate by
about half
>>>> each time.
>>>> 300-->150-->70 I have no idea if this helps to explain
it.
>>> 
>>> Might be a NUMA binding issue for the processes involved.
>>> 
>>> man cpuset
>>> 
>>> --HPS
>> 
>> I can?t see how this is relevant, given that the same host, using the
>> mellanox/mlxen
>> behave much better.
> Well, the "ix" driver has a bunch of tunables for things like
"number of queues"
> and although I'll admit I don't understand how these queues are
used, I think
> they are related to CPUs and their caches. There is also something called
IXGBE_FDIR,
> which others have recommended be disabled. (The code is #ifdef IXGBE_FDIR,
but I don't
> know if it defined for your kernel?) There are also tunables for interrupt
rate and
> something called hw.ixgbe_tx_process_limit, which appears to limit the
number of packets
> to send or something like that?
> (I suspect Hans would understand this stuff much better than I do, since I
don't understand
> it at all.;-)
> but how does this explain the fact that, at the same time,
the throughput to the NetApp is about 70MG/s while to
a FreeBSD it?s above 150MB/s? (window size negotiation?)
switching off TSO evens out this diff.
> At a glance, the mellanox  driver looks very different.
> 
>> I?m getting different results with the intel/ix depending who is the
nfs
>> server
>> 
> Who knows until you figure out what is actually going on. It could just be
the timing of
> handling the write RPCs or when the different servers send acks for the TCP
segments or ...
> that causes this for one server and not another.
> 
> One of the principals used when investigating airplane accidents is to
"never assume anything"
> and just try to collect the facts until the pieces of the puzzle fall in
place. I think the
> same principal works for this kind of stuff.
> I once had a case where a specific read of one NFS file would fail on
certain machines.
> I won't bore you with the details, but after weeks we got to the point
where we had a lab
> of identical machines (exactly the same hardware and exactly the same
software loaded on them)
> and we could reproduce this problem on about half the machines and not the
other half. We
> (myself and the guy I worked with) finally noticed the failing machines
were on network ports
> for a given switch. We moved the net cables to another switch and the
problem went away.
> --> This particular network switch was broken in such a way that it
would garble one specific
>    packet consistently, but worked fine for everything else.
> My point here is that, if someone had suggested the "network switch
might be broken" at the
> beginning of investigating this, I would have probably dismissed it, based
on "the network is
> working just fine", but in the end, that was the problem.
> --> I am not suggesting you have a broken network switch, just
"don't take anything off the
>    table until you know what is actually going on".
> 
> And to be honest, you may never know, but it is fun to try and solve these
puzzles.
one needs to find the clues ?
at the moment:
	when things go bad, they stay bad
		ix/nfs/tcp/tso and NetApp
	when things are ok, the numbers fluctuate, which is probably due to loads
	on the system, but they are far above the 70MB/s (100 to 200)
> Beyond what I already suggested, I'd look at the "ix"
driver's stats and tunables and
> see if any of the tunables has an effect. (And, yes, it will take time to
work through these.)
> 

> Good luck with it, rick
> 
>> 
>> danny
>> 
>> _______________________________________________
>> freebsd-stable at freebsd.org <mailto:freebsd-stable at
freebsd.org> mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
<https://lists.freebsd.org/mailman/listinfo/freebsd-stable>
>> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at
freebsd.org <mailto:freebsd-stable-unsubscribe at freebsd.org>"

Hans Petter Selasky

2015-Oct-08 09:34 UTC

head link

ix(intel) vs mlxen(mellanox) 10Gb performance

Hi,

I've now MFC'ed r287775 to 10-stable and 9-stable. I hope this will 
resolve the issues with m_defrag() being called on too long mbuf chains 
due to an off-by-one in the driver TSO parameters and that it will be 
easier to maintain these parameters in the future.

Some comments were made that we might want to have an option to select 
if the IP-header should be counted or not. Certain network drivers 
require copying of the whole ETH/TCP/IP-header into separate memory 
areas, and can then handle one more data payload mbuf for TSO. Others 
required DMA-ing of the whole mbuf TSO chain. I think it is acceptable 
to have one TX-DMA segment slot free, in case of 2K mbuf clusters being 
used for TSO. From my experience the limitation typically kicks in when 
2K mbuf clusters are used for TSO instead of 4K mbuf clusters. 65536 / 
4096 = 16, whereas 65536 / 2048 = 32. If an ethernet hardware driver has 
a limitation of 24 data segments (mlxen), and assuming that each mbuf 
represent a single segment, then iff the majority of mbufs being 
transmitted are 2K clusters we may have a small, 1/24 = 4.2%, loss of TX 
capability per TSO packet. From what I've seen using iperf, which in 
turn calls m_uiotombuf() which in turn calls m_getm2(), MJUMPPAGESIZE'ed 
mbuf clusters are preferred for large data transfers, so this issue 
might only happen in case of NODELAY being used on the socket and if the 
writes are small from the application point of view.  If an application 
is writing small amounts of data per send() system call, it is expected 
to degrade the system performance.

Please file a PR if it becomes an issue.

Someone asked me to MFC r287775 to 10.X release aswell. Is this still 
required?

--HPS

freebsd stable - Oct 2015 - ix(intel) vs mlxen(mellanox) 10Gb performance

ix(intel) vs mlxen(mellanox) 10Gb performance

ix(intel) vs mlxen(mellanox) 10Gb performance

ix(intel) vs mlxen(mellanox) 10Gb performance

ix(intel) vs mlxen(mellanox) 10Gb performance