Daniel Braniss wrote:> > > On 24 Aug 2015, at 10:22, Hans Petter Selasky <hps at selasky.org> wrote: > > > > On 08/24/15 01:02, Rick Macklem wrote: > >> The other thing is the degradation seems to cut the rate by about half > >> each time. > >> 300-->150-->70 I have no idea if this helps to explain it. > > > > Might be a NUMA binding issue for the processes involved. > > > > man cpuset > > > > --HPS > > I can?t see how this is relevant, given that the same host, using the > mellanox/mlxen > behave much better.Well, the "ix" driver has a bunch of tunables for things like "number of queues" and although I'll admit I don't understand how these queues are used, I think they are related to CPUs and their caches. There is also something called IXGBE_FDIR, which others have recommended be disabled. (The code is #ifdef IXGBE_FDIR, but I don't know if it defined for your kernel?) There are also tunables for interrupt rate and something called hw.ixgbe_tx_process_limit, which appears to limit the number of packets to send or something like that? (I suspect Hans would understand this stuff much better than I do, since I don't understand it at all.;-) At a glance, the mellanox driver looks very different.> I?m getting different results with the intel/ix depending who is the nfs > server >Who knows until you figure out what is actually going on. It could just be the timing of handling the write RPCs or when the different servers send acks for the TCP segments or ... that causes this for one server and not another. One of the principals used when investigating airplane accidents is to "never assume anything" and just try to collect the facts until the pieces of the puzzle fall in place. I think the same principal works for this kind of stuff. I once had a case where a specific read of one NFS file would fail on certain machines. I won't bore you with the details, but after weeks we got to the point where we had a lab of identical machines (exactly the same hardware and exactly the same software loaded on them) and we could reproduce this problem on about half the machines and not the other half. We (myself and the guy I worked with) finally noticed the failing machines were on network ports for a given switch. We moved the net cables to another switch and the problem went away. --> This particular network switch was broken in such a way that it would garble one specific packet consistently, but worked fine for everything else. My point here is that, if someone had suggested the "network switch might be broken" at the beginning of investigating this, I would have probably dismissed it, based on "the network is working just fine", but in the end, that was the problem. --> I am not suggesting you have a broken network switch, just "don't take anything off the table until you know what is actually going on". And to be honest, you may never know, but it is fun to try and solve these puzzles. Beyond what I already suggested, I'd look at the "ix" driver's stats and tunables and see if any of the tunables has an effect. (And, yes, it will take time to work through these.) Good luck with it, rick> > danny > > _______________________________________________ > freebsd-stable at freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"
Hi, Some hand-waving suggestions: * if you're running something before 10.2, please disable IXGBE_FDIR in sys/conf/options and sys/modules/ixgbe/Makefile . It's buggy and it caused a lot of issues. * It sounds like some extra latency is happening, so I'd fiddle around with interrupt settings. By default it does something called adaptive interrupt moderation and it may be getting in the way of what you're trying to do. There's a way to disable AIM in /boot/loader.conf and manually set the interrupt rate. * As others have said, TSO has been a bit of a problem - hps has been working on solidifying the TSO configuration side of things so NICs advertise to the stack what their maximum offload capability is so things like NFS and TCP don't exceed the segment count. I don't know if it's tunable without hacking the driver, but maybe hack the driver to reduce the count a little to make sure you're not overflowing things and causing it to fall back to a slower path (where it copies all the mbufs into a single larger one to send to the NIC.) * Disable software LRO and see if it helps. Since you're doing lots of little non-streaming operations, it may actually be hindering. HTH, -adrian
> On Aug 24, 2015, at 3:25 PM, Rick Macklem <rmacklem at uoguelph.ca> wrote: > > Daniel Braniss wrote: >> >>> On 24 Aug 2015, at 10:22, Hans Petter Selasky <hps at selasky.org> wrote: >>> >>> On 08/24/15 01:02, Rick Macklem wrote: >>>> The other thing is the degradation seems to cut the rate by about half >>>> each time. >>>> 300-->150-->70 I have no idea if this helps to explain it. >>> >>> Might be a NUMA binding issue for the processes involved. >>> >>> man cpuset >>> >>> --HPS >> >> I can?t see how this is relevant, given that the same host, using the >> mellanox/mlxen >> behave much better. > Well, the "ix" driver has a bunch of tunables for things like "number of queues" > and although I'll admit I don't understand how these queues are used, I think > they are related to CPUs and their caches. There is also something called IXGBE_FDIR, > which others have recommended be disabled. (The code is #ifdef IXGBE_FDIR, but I don't > know if it defined for your kernel?) There are also tunables for interrupt rate and > something called hw.ixgbe_tx_process_limit, which appears to limit the number of packets > to send or something like that? > (I suspect Hans would understand this stuff much better than I do, since I don't understand > it at all.;-) >but how does this explain the fact that, at the same time, the throughput to the NetApp is about 70MG/s while to a FreeBSD it?s above 150MB/s? (window size negotiation?) switching off TSO evens out this diff.> At a glance, the mellanox driver looks very different. > >> I?m getting different results with the intel/ix depending who is the nfs >> server >> > Who knows until you figure out what is actually going on. It could just be the timing of > handling the write RPCs or when the different servers send acks for the TCP segments or ... > that causes this for one server and not another. > > One of the principals used when investigating airplane accidents is to "never assume anything" > and just try to collect the facts until the pieces of the puzzle fall in place. I think the > same principal works for this kind of stuff. > I once had a case where a specific read of one NFS file would fail on certain machines. > I won't bore you with the details, but after weeks we got to the point where we had a lab > of identical machines (exactly the same hardware and exactly the same software loaded on them) > and we could reproduce this problem on about half the machines and not the other half. We > (myself and the guy I worked with) finally noticed the failing machines were on network ports > for a given switch. We moved the net cables to another switch and the problem went away. > --> This particular network switch was broken in such a way that it would garble one specific > packet consistently, but worked fine for everything else. > My point here is that, if someone had suggested the "network switch might be broken" at the > beginning of investigating this, I would have probably dismissed it, based on "the network is > working just fine", but in the end, that was the problem. > --> I am not suggesting you have a broken network switch, just "don't take anything off the > table until you know what is actually going on". > > And to be honest, you may never know, but it is fun to try and solve these puzzles.one needs to find the clues ? at the moment: when things go bad, they stay bad ix/nfs/tcp/tso and NetApp when things are ok, the numbers fluctuate, which is probably due to loads on the system, but they are far above the 70MB/s (100 to 200)> Beyond what I already suggested, I'd look at the "ix" driver's stats and tunables and > see if any of the tunables has an effect. (And, yes, it will take time to work through these.) >> Good luck with it, rick > >> >> danny >> >> _______________________________________________ >> freebsd-stable at freebsd.org <mailto:freebsd-stable at freebsd.org> mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-stable <https://lists.freebsd.org/mailman/listinfo/freebsd-stable> >> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org <mailto:freebsd-stable-unsubscribe at freebsd.org>"
Hi, I've now MFC'ed r287775 to 10-stable and 9-stable. I hope this will resolve the issues with m_defrag() being called on too long mbuf chains due to an off-by-one in the driver TSO parameters and that it will be easier to maintain these parameters in the future. Some comments were made that we might want to have an option to select if the IP-header should be counted or not. Certain network drivers require copying of the whole ETH/TCP/IP-header into separate memory areas, and can then handle one more data payload mbuf for TSO. Others required DMA-ing of the whole mbuf TSO chain. I think it is acceptable to have one TX-DMA segment slot free, in case of 2K mbuf clusters being used for TSO. From my experience the limitation typically kicks in when 2K mbuf clusters are used for TSO instead of 4K mbuf clusters. 65536 / 4096 = 16, whereas 65536 / 2048 = 32. If an ethernet hardware driver has a limitation of 24 data segments (mlxen), and assuming that each mbuf represent a single segment, then iff the majority of mbufs being transmitted are 2K clusters we may have a small, 1/24 = 4.2%, loss of TX capability per TSO packet. From what I've seen using iperf, which in turn calls m_uiotombuf() which in turn calls m_getm2(), MJUMPPAGESIZE'ed mbuf clusters are preferred for large data transfers, so this issue might only happen in case of NODELAY being used on the socket and if the writes are small from the application point of view. If an application is writing small amounts of data per send() system call, it is expected to degrade the system performance. Please file a PR if it becomes an issue. Someone asked me to MFC r287775 to 10.X release aswell. Is this still required? --HPS