thr3ads.net - Lustre discuss - [Lustre-discuss] Lustre with 10GbE or Infiniband? [Feb 2009]

If this information is useful, please help other people find it:
Share via:

Jeffrey Bennett

2009-Feb-11 19:08 UTC

[Lustre-discuss] Lustre with 10GbE or Infiniband?

Hi,

Has anybody done any performance comparison between Lustre with 10GbE and Lustre
with Infiniband 4X SDR? I wonder if they perform similarly.

Thanks,

Jeffrey A. Bennett
HPC Data Engineer
San Diego Supercomputer Center
http://users.sdsc.edu/~jab

Brian J. Murrell

2009-Feb-11 19:25 UTC

head link

[Lustre-discuss] Lustre with 10GbE or Infiniband?

On Wed, 2009-02-11 at 11:08 -0800, Jeffrey Bennett
wrote:> Hi,
> 
> Has anybody done any performance comparison between Lustre with 10GbE and
Lustre with Infiniband 4X SDR? I wonder if they perform similarly.
While I don''t have any performance numbers or experience for you, I
will
mention the differences in the way Lustre uses those two technologies.

On 10GbE, Lustre (via it''s sock LND) will use the TCP/IP stack on top
of
the ethernet stack.  With Infiniband, we communicate directly with the
I/B stack (via the o2ib LND) and take direct advantage of it''s RDMA
capabilities to achieve a very high percentage of wire speed.

My gut feeling is that the overhead of TCP/IP carves some percentage out
of your ability to achieve full wire speed.

Maybe some others here, including our benchmarking folks here at Sun can
provide some real world experiences and comparisons.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090211/d1810e5a/attachment.bin

Scott Atchley

2009-Feb-11 21:35 UTC

head link

[Lustre-discuss] Lustre with 10GbE or Infiniband?

On Feb 11, 2009, at 2:25 PM, Brian J. Murrell wrote:
> On Wed, 2009-02-11 at 11:08 -0800, Jeffrey Bennett wrote:
>> Hi,
>>
>> Has anybody done any performance comparison between Lustre with  
>> 10GbE and Lustre with Infiniband 4X SDR? I wonder if they perform  
>> similarly.
>
> While I don''t have any performance numbers or experience for you,
I
> will
> mention the differences in the way Lustre uses those two technologies.
>
> On 10GbE, Lustre (via it''s sock LND) will use the TCP/IP stack on
> top of
> the ethernet stack.  With Infiniband, we communicate directly with the
> I/B stack (via the o2ib LND) and take direct advantage of it''s
RDMA
> capabilities to achieve a very high percentage of wire speed.
>
> My gut feeling is that the overhead of TCP/IP carves some percentage  
> out
> of your ability to achieve full wire speed.
>
> Maybe some others here, including our benchmarking folks here at Sun  
> can
> provide some real world experiences and comparisons.
>
> b.
Jeffrey,

To add to Brian''s comments, IB 4X SDR is limited to about 700-750 MB/s
by the fabric. O2IBLND cannot go faster than minimum of either the  
fabric or PCI-E connection allow.

SOCKLND is limited by a copy on the receive side. When a client  
writes, the server has to copy the data out. When a client reads, it  
has to copy the data out. Because of this from a server''s point-of- 
view, multiple client read performance can scale with the number of  
clients (the server is sending with zero-copy to multiple clients) and  
can reach linerate. I did some tests a couple of years ago with  
SOCKLND and our NICs:

http://wiki.lustre.org/index.php?title=Myri-10G_Ethernet

It shows a single server with 1 and 3 clients reading and writing.  
When 3 clients read, it got very close to linerate.

Indiana University won the SC07 Bandwidth Challenge using Lustre over  
the wide-area. They used SOCKLND with Myricom NICS and top-of-the-line  
DDN storage. They saturated a 10 Gb/s link (sending and receiving  
simultaneously), but I think it took a couple of DDN systems and  
corresponding OSSes.

If your storage cannot exceed 700-750 MB/s, then either should work  
for you.

Scott

Charles Taylor

2009-Feb-11 23:11 UTC

head link

[Lustre-discuss] Lustre with 10GbE or Infiniband?

On Feb 11, 2009, at 4:35 PM, Scott Atchley wrote:
> To add to Brian''s comments, IB 4X SDR is limited to about 700-750
MB/s
> by the fabric. O2IBLND cannot go faster than minimum of either the
> fabric or PCI-E connection allow.
Hmmm.   I can agree with the second part of that statement but I  
question the first.   We''ve measured much closer to the 1GByte/sec  
wire rate of IB using several different tools.  750 GBytes/sec   
corresponds to roughly 6 GBits/sec.   You lose 2 of the 10 Gbits to  
encoding (8B10) so line rate is really 8GBits/sec or 1 GByte/sec.      
Yes, you''ll lose some more to protocol and swtiching overhead but it  
is not anywhere near an additional 2 GBits/sec - in our experience.

Just ran a quick IMB (formerly Pallas) between a couple of our SDR  
nodes and got 860 MBytes/sec (ping-pong, 4MB).   So I don''t think  
there is anything inherent in SDR IB that limits you to 750 MBytes/ 
sec.   However, running IPoIB will  probably limit you to something  
even less than that which is why you should use the O2IBLND if you  
want the real benefit of IB.

Just our experience,

Charlie Taylor
UF HPC Center

Kevin Van Maren

2009-Feb-12 01:46 UTC

head link

[Lustre-discuss] Lustre with 10GbE or Infiniband?

Charles Taylor wrote:> On Feb 11, 2009, at 4:35 PM, Scott Atchley wrote:
>
>   
>> To add to Brian''s comments, IB 4X SDR is limited to about
700-750 MB/s
>> by the fabric. O2IBLND cannot go faster than minimum of either the
>> fabric or PCI-E connection allow.
>>     
>
> Hmmm.   I can agree with the second part of that statement but I  
> question the first.   We''ve measured much closer to the 1GByte/sec
> wire rate of IB using several different tools.  750 GBytes/sec   
> corresponds to roughly 6 GBits/sec.   You lose 2 of the 10 Gbits to  
> encoding (8B10) so line rate is really 8GBits/sec or 1 GByte/sec.      
> Yes, you''ll lose some more to protocol and swtiching overhead but
it
> is not anywhere near an additional 2 GBits/sec - in our experience.
>   
Correct.  Infinipath SDR was getting ~980 MB/s, and DDR HCAs in SDR mode
can also do quite well in an x8 PCIe slot.

The PCI-X HCAs were limited to around 850MB/s by the bus, and PCIe HCAs
_are_ likewise limited to around 700-750MB/s -- but only in a PCIe x4 slot.

DDR IB (unless using a PCIe gen2 connectX card, or a x16 Infinipath 
card) are also
limited to around 1450-1600 MB/s by the PCIe x8 bus, with a wire speed 
of 2000 MB/s.

QDR IB, in a Gen2 x8 PCIe slot, are also going to be limited to << 
4000MB/s line rate
(should expect around twice the BW of the gen1 PCIe slots).

The IB headers are very small, compared to a 2KB or 4KB packet size, but 
the PCIe
headers (and eg flow-control overhead) are quite large compared to a 
typical 256B packet size.

To clarify one point: IB advertises the "signaling" rate, so the 10Gb 
includes the overhead
bits, as 8 bits are encoded in a 10 bit representation for 
transmission.  So 10Gb/s = 1GB/s,
with 10-bit bytes.  Ethernet, on the other hand, always advertises the 
"data" rate, so 10Gb
Ethernet is 1.25GB/s (12.5Gb/s signaling rate), as there are 8 bits in a 
byte.  Ethernet packet
headers are also effectively a bit larger than for IB (with IFG, 
preamble, etc).

Kevin
> Just ran a quick IMB (formerly Pallas) between a couple of our SDR  
> nodes and got 860 MBytes/sec (ping-pong, 4MB).   So I don''t think
> there is anything inherent in SDR IB that limits you to 750 MBytes/ 
> sec.   However, running IPoIB will  probably limit you to something  
> even less than that which is why you should use the O2IBLND if you  
> want the real benefit of IB.
>
> Just our experience,
>
> Charlie Taylor
> UF HPC Center
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Kevin Van Maren

2009-Feb-12 03:00 UTC

head link

[Lustre-discuss] Lustre with 10GbE or Infiniband?

One more clarification: those IB numbers were for MPI, not Lustre.

Kevin


On Feb 11, 2009, at 6:46 PM, Kevin Van Maren <Kevin.Vanmaren at Sun.COM>  
wrote:
> Charles Taylor wrote:
>> On Feb 11, 2009, at 4:35 PM, Scott Atchley wrote:
>>
>>
>>> To add to Brian''s comments, IB 4X SDR is limited to about
700-750
>>> MB/s
>>> by the fabric. O2IBLND cannot go faster than minimum of either the
>>> fabric or PCI-E connection allow.
>>>
>>
>> Hmmm.   I can agree with the second part of that statement but I
>> question the first.   We''ve measured much closer to the
1GByte/sec
>> wire rate of IB using several different tools.  750 GBytes/sec
>> corresponds to roughly 6 GBits/sec.   You lose 2 of the 10 Gbits to
>> encoding (8B10) so line rate is really 8GBits/sec or 1 GByte/sec.
>> Yes, you''ll lose some more to protocol and swtiching overhead
but it
>> is not anywhere near an additional 2 GBits/sec - in our experience.
>>
>
> Correct.  Infinipath SDR was getting ~980 MB/s, and DDR HCAs in SDR  
> mode
> can also do quite well in an x8 PCIe slot.
>
> The PCI-X HCAs were limited to around 850MB/s by the bus, and PCIe  
> HCAs
> _are_ likewise limited to around 700-750MB/s -- but only in a PCIe  
> x4 slot.
>
> DDR IB (unless using a PCIe gen2 connectX card, or a x16 Infinipath
> card) are also
> limited to around 1450-1600 MB/s by the PCIe x8 bus, with a wire speed
> of 2000 MB/s.
>
> QDR IB, in a Gen2 x8 PCIe slot, are also going to be limited to <<
> 4000MB/s line rate
> (should expect around twice the BW of the gen1 PCIe slots).
>
> The IB headers are very small, compared to a 2KB or 4KB packet size,  
> but
> the PCIe
> headers (and eg flow-control overhead) are quite large compared to a
> typical 256B packet size.
>
> To clarify one point: IB advertises the "signaling" rate, so the
10Gb
> includes the overhead
> bits, as 8 bits are encoded in a 10 bit representation for
> transmission.  So 10Gb/s = 1GB/s,
> with 10-bit bytes.  Ethernet, on the other hand, always advertises the
> "data" rate, so 10Gb
> Ethernet is 1.25GB/s (12.5Gb/s signaling rate), as there are 8 bits  
> in a
> byte.  Ethernet packet
> headers are also effectively a bit larger than for IB (with IFG,
> preamble, etc).
>
> Kevin
>
>> Just ran a quick IMB (formerly Pallas) between a couple of our SDR
>> nodes and got 860 MBytes/sec (ping-pong, 4MB).   So I don''t
think
>> there is anything inherent in SDR IB that limits you to 750 MBytes/
>> sec.   However, running IPoIB will  probably limit you to something
>> even less than that which is why you should use the O2IBLND if you
>> want the real benefit of IB.
>>
>> Just our experience,
>>
>> Charlie Taylor
>> UF HPC Center
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Isaac Huang

2009-Feb-12 05:23 UTC

head link

[Lustre-discuss] Lustre with 10GbE or Infiniband?

On Wed, Feb 11, 2009 at 06:11:30PM -0500, Charles Taylor
wrote:> ......
> Just ran a quick IMB (formerly Pallas) between a couple of our SDR  
> nodes and got 860 MBytes/sec (ping-pong, 4MB).   So I don''t think
> there is anything inherent in SDR IB that limits you to 750 MBytes/ 
> sec.   However, running IPoIB will  probably limit you to something  
> even less than that which is why you should use the O2IBLND if you  
> want the real benefit of IB.
Yes, the last time I checked, IPoIB didn''t make use of RDMA at all.

Isaac

Isaac Huang

2009-Feb-12 05:29 UTC

head link

[Lustre-discuss] Lustre with 10GbE or Infiniband?

On Wed, Feb 11, 2009 at 04:35:47PM -0500, Scott Atchley
wrote:> ......
> SOCKLND is limited by a copy on the receive side. When a client  
> writes, the server has to copy the data out. When a client reads, it  
> ......
One exception is SOCKLND on Chelsio''s T3, quote:

"The T3 ASIC uses the mechanism of Direct Data Placement (DDP)  that
provides a flexible zero copy on receive capability for regular TCP
connections, requiring no changes to the sender, the wire protocol, or
the socket API on sending or the receiving side."

I remembered that a small SOCKLND fix was landed recently to make use
of this zero copy receive capability.

Isaac

Scott Atchley

2009-Feb-12 13:26 UTC

head link

[Lustre-discuss] Lustre with 10GbE or Infiniband?

On Feb 12, 2009, at 12:29 AM, Isaac Huang wrote:
> On Wed, Feb 11, 2009 at 04:35:47PM -0500, Scott Atchley wrote:
>> ......
>> SOCKLND is limited by a copy on the receive side. When a client
>> writes, the server has to copy the data out. When a client reads, it
>> ......
>
> One exception is SOCKLND on Chelsio''s T3, quote:
>
> "The T3 ASIC uses the mechanism of Direct Data Placement (DDP)  that
> provides a flexible zero copy on receive capability for regular TCP
> connections, requiring no changes to the sender, the wire protocol, or
> the socket API on sending or the receiving side."
>
> I remembered that a small SOCKLND fix was landed recently to make use
> of this zero copy receive capability.
>
> Isaac
Interesting. Is this code available yet?

Scott

Scott Atchley

2009-Feb-12 13:28 UTC

head link

[Lustre-discuss] Lustre with 10GbE or Infiniband?

On Feb 11, 2009, at 8:46 PM, Kevin Van Maren wrote:
> Charles Taylor wrote:
>> On Feb 11, 2009, at 4:35 PM, Scott Atchley wrote:
>>
>>> To add to Brian''s comments, IB 4X SDR is limited to about
700-750
>>> MB/s
>>> by the fabric. O2IBLND cannot go faster than minimum of either the
>>> fabric or PCI-E connection allow.
>>
>> Hmmm.   I can agree with the second part of that statement but I
>> question the first.   We''ve measured much closer to the
1GByte/sec
>> wire rate of IB using several different tools.  750 GBytes/sec
>> corresponds to roughly 6 GBits/sec.   You lose 2 of the 10 Gbits to
>> encoding (8B10) so line rate is really 8GBits/sec or 1 GByte/sec.
>> Yes, you''ll lose some more to protocol and swtiching overhead
but it
>> is not anywhere near an additional 2 GBits/sec - in our experience.
>
> Correct.  Infinipath SDR was getting ~980 MB/s, and DDR HCAs in SDR  
> mode
> can also do quite well in an x8 PCIe slot.
>
> The PCI-X HCAs were limited to around 850MB/s by the bus, and PCIe  
> HCAs
> _are_ likewise limited to around 700-750MB/s -- but only in a PCIe  
> x4 slot.
>
> DDR IB (unless using a PCIe gen2 connectX card, or a x16 Infinipath
> card) are also
> limited to around 1450-1600 MB/s by the PCIe x8 bus, with a wire speed
> of 2000 MB/s.
>
> QDR IB, in a Gen2 x8 PCIe slot, are also going to be limited to <<
> 4000MB/s line rate
> (should expect around twice the BW of the gen1 PCIe slots).
>
> The IB headers are very small, compared to a 2KB or 4KB packet size,  
> but
> the PCIe
> headers (and eg flow-control overhead) are quite large compared to a
> typical 256B packet size.
>
> To clarify one point: IB advertises the "signaling" rate, so the
10Gb
> includes the overhead
> bits, as 8 bits are encoded in a 10 bit representation for
> transmission.  So 10Gb/s = 1GB/s,
> with 10-bit bytes.  Ethernet, on the other hand, always advertises the
> "data" rate, so 10Gb
> Ethernet is 1.25GB/s (12.5Gb/s signaling rate), as there are 8 bits  
> in a
> byte.  Ethernet packet
> headers are also effectively a bit larger than for IB (with IFG,
> preamble, etc).
>
> Kevin
Thanks everyone! I get so confused by IB performance claims. :-)

Scott

Isaac Huang

2009-Feb-12 21:12 UTC

head link

[Lustre-discuss] Lustre with 10GbE or Infiniband?

On Thu, Feb 12, 2009 at 08:26:09AM -0500, Scott Atchley
wrote:>> ......
>> One exception is SOCKLND on Chelsio''s T3, quote:
>>
>> "The T3 ASIC uses the mechanism of Direct Data Placement (DDP) 
that
>> provides a flexible zero copy on receive capability for regular TCP
>> connections, requiring no changes to the sender, the wire protocol, or
>> the socket API on sending or the receiving side."
>>
>> I remembered that a small SOCKLND fix was landed recently to make use
>> of this zero copy receive capability.
>>
>> Isaac
>
> Interesting. Is this code available yet?
You could find everything at:
https://bugzilla.lustre.org/show_bug.cgi?id=15093

Isaac

Isaac Huang

2009-Feb-18 23:45 UTC

head link

[Lustre-discuss] Lustre with 10GbE or Infiniband?

You might find this interesting:
http://www.cse.ohio-state.edu/~panda/temp/ib_10ge_advanced.pdf

Isaac

On Wed, Feb 11, 2009 at 2:08 PM, Jeffrey Bennett <jab at sdsc.edu>
wrote:> Hi,
>
> Has anybody done any performance comparison between Lustre with 10GbE and
Lustre with Infiniband 4X SDR? I wonder if they perform similarly.
>
> Thanks,
>
> Jeffrey A. Bennett
> HPC Data Engineer
> San Diego Supercomputer Center
> http://users.sdsc.edu/~jab
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Lustre discuss - Feb 2009 - Lustre with 10GbE or Infiniband?

[Lustre-discuss] Lustre with 10GbE or Infiniband?

[Lustre-discuss] Lustre with 10GbE or Infiniband?

[Lustre-discuss] Lustre with 10GbE or Infiniband?

[Lustre-discuss] Lustre with 10GbE or Infiniband?

[Lustre-discuss] Lustre with 10GbE or Infiniband?

[Lustre-discuss] Lustre with 10GbE or Infiniband?

[Lustre-discuss] Lustre with 10GbE or Infiniband?

[Lustre-discuss] Lustre with 10GbE or Infiniband?

[Lustre-discuss] Lustre with 10GbE or Infiniband?

[Lustre-discuss] Lustre with 10GbE or Infiniband?

[Lustre-discuss] Lustre with 10GbE or Infiniband?

[Lustre-discuss] Lustre with 10GbE or Infiniband?