Hi, Has anybody done any performance comparison between Lustre with 10GbE and Lustre with Infiniband 4X SDR? I wonder if they perform similarly. Thanks, Jeffrey A. Bennett HPC Data Engineer San Diego Supercomputer Center http://users.sdsc.edu/~jab
On Wed, 2009-02-11 at 11:08 -0800, Jeffrey Bennett wrote:> Hi, > > Has anybody done any performance comparison between Lustre with 10GbE and Lustre with Infiniband 4X SDR? I wonder if they perform similarly.While I don''t have any performance numbers or experience for you, I will mention the differences in the way Lustre uses those two technologies. On 10GbE, Lustre (via it''s sock LND) will use the TCP/IP stack on top of the ethernet stack. With Infiniband, we communicate directly with the I/B stack (via the o2ib LND) and take direct advantage of it''s RDMA capabilities to achieve a very high percentage of wire speed. My gut feeling is that the overhead of TCP/IP carves some percentage out of your ability to achieve full wire speed. Maybe some others here, including our benchmarking folks here at Sun can provide some real world experiences and comparisons. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090211/d1810e5a/attachment.bin
On Feb 11, 2009, at 2:25 PM, Brian J. Murrell wrote:> On Wed, 2009-02-11 at 11:08 -0800, Jeffrey Bennett wrote: >> Hi, >> >> Has anybody done any performance comparison between Lustre with >> 10GbE and Lustre with Infiniband 4X SDR? I wonder if they perform >> similarly. > > While I don''t have any performance numbers or experience for you, I > will > mention the differences in the way Lustre uses those two technologies. > > On 10GbE, Lustre (via it''s sock LND) will use the TCP/IP stack on > top of > the ethernet stack. With Infiniband, we communicate directly with the > I/B stack (via the o2ib LND) and take direct advantage of it''s RDMA > capabilities to achieve a very high percentage of wire speed. > > My gut feeling is that the overhead of TCP/IP carves some percentage > out > of your ability to achieve full wire speed. > > Maybe some others here, including our benchmarking folks here at Sun > can > provide some real world experiences and comparisons. > > b.Jeffrey, To add to Brian''s comments, IB 4X SDR is limited to about 700-750 MB/s by the fabric. O2IBLND cannot go faster than minimum of either the fabric or PCI-E connection allow. SOCKLND is limited by a copy on the receive side. When a client writes, the server has to copy the data out. When a client reads, it has to copy the data out. Because of this from a server''s point-of- view, multiple client read performance can scale with the number of clients (the server is sending with zero-copy to multiple clients) and can reach linerate. I did some tests a couple of years ago with SOCKLND and our NICs: http://wiki.lustre.org/index.php?title=Myri-10G_Ethernet It shows a single server with 1 and 3 clients reading and writing. When 3 clients read, it got very close to linerate. Indiana University won the SC07 Bandwidth Challenge using Lustre over the wide-area. They used SOCKLND with Myricom NICS and top-of-the-line DDN storage. They saturated a 10 Gb/s link (sending and receiving simultaneously), but I think it took a couple of DDN systems and corresponding OSSes. If your storage cannot exceed 700-750 MB/s, then either should work for you. Scott
On Feb 11, 2009, at 4:35 PM, Scott Atchley wrote:> To add to Brian''s comments, IB 4X SDR is limited to about 700-750 MB/s > by the fabric. O2IBLND cannot go faster than minimum of either the > fabric or PCI-E connection allow.Hmmm. I can agree with the second part of that statement but I question the first. We''ve measured much closer to the 1GByte/sec wire rate of IB using several different tools. 750 GBytes/sec corresponds to roughly 6 GBits/sec. You lose 2 of the 10 Gbits to encoding (8B10) so line rate is really 8GBits/sec or 1 GByte/sec. Yes, you''ll lose some more to protocol and swtiching overhead but it is not anywhere near an additional 2 GBits/sec - in our experience. Just ran a quick IMB (formerly Pallas) between a couple of our SDR nodes and got 860 MBytes/sec (ping-pong, 4MB). So I don''t think there is anything inherent in SDR IB that limits you to 750 MBytes/ sec. However, running IPoIB will probably limit you to something even less than that which is why you should use the O2IBLND if you want the real benefit of IB. Just our experience, Charlie Taylor UF HPC Center
Charles Taylor wrote:> On Feb 11, 2009, at 4:35 PM, Scott Atchley wrote: > > >> To add to Brian''s comments, IB 4X SDR is limited to about 700-750 MB/s >> by the fabric. O2IBLND cannot go faster than minimum of either the >> fabric or PCI-E connection allow. >> > > Hmmm. I can agree with the second part of that statement but I > question the first. We''ve measured much closer to the 1GByte/sec > wire rate of IB using several different tools. 750 GBytes/sec > corresponds to roughly 6 GBits/sec. You lose 2 of the 10 Gbits to > encoding (8B10) so line rate is really 8GBits/sec or 1 GByte/sec. > Yes, you''ll lose some more to protocol and swtiching overhead but it > is not anywhere near an additional 2 GBits/sec - in our experience. >Correct. Infinipath SDR was getting ~980 MB/s, and DDR HCAs in SDR mode can also do quite well in an x8 PCIe slot. The PCI-X HCAs were limited to around 850MB/s by the bus, and PCIe HCAs _are_ likewise limited to around 700-750MB/s -- but only in a PCIe x4 slot. DDR IB (unless using a PCIe gen2 connectX card, or a x16 Infinipath card) are also limited to around 1450-1600 MB/s by the PCIe x8 bus, with a wire speed of 2000 MB/s. QDR IB, in a Gen2 x8 PCIe slot, are also going to be limited to << 4000MB/s line rate (should expect around twice the BW of the gen1 PCIe slots). The IB headers are very small, compared to a 2KB or 4KB packet size, but the PCIe headers (and eg flow-control overhead) are quite large compared to a typical 256B packet size. To clarify one point: IB advertises the "signaling" rate, so the 10Gb includes the overhead bits, as 8 bits are encoded in a 10 bit representation for transmission. So 10Gb/s = 1GB/s, with 10-bit bytes. Ethernet, on the other hand, always advertises the "data" rate, so 10Gb Ethernet is 1.25GB/s (12.5Gb/s signaling rate), as there are 8 bits in a byte. Ethernet packet headers are also effectively a bit larger than for IB (with IFG, preamble, etc). Kevin> Just ran a quick IMB (formerly Pallas) between a couple of our SDR > nodes and got 860 MBytes/sec (ping-pong, 4MB). So I don''t think > there is anything inherent in SDR IB that limits you to 750 MBytes/ > sec. However, running IPoIB will probably limit you to something > even less than that which is why you should use the O2IBLND if you > want the real benefit of IB. > > Just our experience, > > Charlie Taylor > UF HPC Center > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
One more clarification: those IB numbers were for MPI, not Lustre. Kevin On Feb 11, 2009, at 6:46 PM, Kevin Van Maren <Kevin.Vanmaren at Sun.COM> wrote:> Charles Taylor wrote: >> On Feb 11, 2009, at 4:35 PM, Scott Atchley wrote: >> >> >>> To add to Brian''s comments, IB 4X SDR is limited to about 700-750 >>> MB/s >>> by the fabric. O2IBLND cannot go faster than minimum of either the >>> fabric or PCI-E connection allow. >>> >> >> Hmmm. I can agree with the second part of that statement but I >> question the first. We''ve measured much closer to the 1GByte/sec >> wire rate of IB using several different tools. 750 GBytes/sec >> corresponds to roughly 6 GBits/sec. You lose 2 of the 10 Gbits to >> encoding (8B10) so line rate is really 8GBits/sec or 1 GByte/sec. >> Yes, you''ll lose some more to protocol and swtiching overhead but it >> is not anywhere near an additional 2 GBits/sec - in our experience. >> > > Correct. Infinipath SDR was getting ~980 MB/s, and DDR HCAs in SDR > mode > can also do quite well in an x8 PCIe slot. > > The PCI-X HCAs were limited to around 850MB/s by the bus, and PCIe > HCAs > _are_ likewise limited to around 700-750MB/s -- but only in a PCIe > x4 slot. > > DDR IB (unless using a PCIe gen2 connectX card, or a x16 Infinipath > card) are also > limited to around 1450-1600 MB/s by the PCIe x8 bus, with a wire speed > of 2000 MB/s. > > QDR IB, in a Gen2 x8 PCIe slot, are also going to be limited to << > 4000MB/s line rate > (should expect around twice the BW of the gen1 PCIe slots). > > The IB headers are very small, compared to a 2KB or 4KB packet size, > but > the PCIe > headers (and eg flow-control overhead) are quite large compared to a > typical 256B packet size. > > To clarify one point: IB advertises the "signaling" rate, so the 10Gb > includes the overhead > bits, as 8 bits are encoded in a 10 bit representation for > transmission. So 10Gb/s = 1GB/s, > with 10-bit bytes. Ethernet, on the other hand, always advertises the > "data" rate, so 10Gb > Ethernet is 1.25GB/s (12.5Gb/s signaling rate), as there are 8 bits > in a > byte. Ethernet packet > headers are also effectively a bit larger than for IB (with IFG, > preamble, etc). > > Kevin > >> Just ran a quick IMB (formerly Pallas) between a couple of our SDR >> nodes and got 860 MBytes/sec (ping-pong, 4MB). So I don''t think >> there is anything inherent in SDR IB that limits you to 750 MBytes/ >> sec. However, running IPoIB will probably limit you to something >> even less than that which is why you should use the O2IBLND if you >> want the real benefit of IB. >> >> Just our experience, >> >> Charlie Taylor >> UF HPC Center >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
On Wed, Feb 11, 2009 at 06:11:30PM -0500, Charles Taylor wrote:> ...... > Just ran a quick IMB (formerly Pallas) between a couple of our SDR > nodes and got 860 MBytes/sec (ping-pong, 4MB). So I don''t think > there is anything inherent in SDR IB that limits you to 750 MBytes/ > sec. However, running IPoIB will probably limit you to something > even less than that which is why you should use the O2IBLND if you > want the real benefit of IB.Yes, the last time I checked, IPoIB didn''t make use of RDMA at all. Isaac
On Wed, Feb 11, 2009 at 04:35:47PM -0500, Scott Atchley wrote:> ...... > SOCKLND is limited by a copy on the receive side. When a client > writes, the server has to copy the data out. When a client reads, it > ......One exception is SOCKLND on Chelsio''s T3, quote: "The T3 ASIC uses the mechanism of Direct Data Placement (DDP) that provides a flexible zero copy on receive capability for regular TCP connections, requiring no changes to the sender, the wire protocol, or the socket API on sending or the receiving side." I remembered that a small SOCKLND fix was landed recently to make use of this zero copy receive capability. Isaac
On Feb 12, 2009, at 12:29 AM, Isaac Huang wrote:> On Wed, Feb 11, 2009 at 04:35:47PM -0500, Scott Atchley wrote: >> ...... >> SOCKLND is limited by a copy on the receive side. When a client >> writes, the server has to copy the data out. When a client reads, it >> ...... > > One exception is SOCKLND on Chelsio''s T3, quote: > > "The T3 ASIC uses the mechanism of Direct Data Placement (DDP) that > provides a flexible zero copy on receive capability for regular TCP > connections, requiring no changes to the sender, the wire protocol, or > the socket API on sending or the receiving side." > > I remembered that a small SOCKLND fix was landed recently to make use > of this zero copy receive capability. > > IsaacInteresting. Is this code available yet? Scott
On Feb 11, 2009, at 8:46 PM, Kevin Van Maren wrote:> Charles Taylor wrote: >> On Feb 11, 2009, at 4:35 PM, Scott Atchley wrote: >> >>> To add to Brian''s comments, IB 4X SDR is limited to about 700-750 >>> MB/s >>> by the fabric. O2IBLND cannot go faster than minimum of either the >>> fabric or PCI-E connection allow. >> >> Hmmm. I can agree with the second part of that statement but I >> question the first. We''ve measured much closer to the 1GByte/sec >> wire rate of IB using several different tools. 750 GBytes/sec >> corresponds to roughly 6 GBits/sec. You lose 2 of the 10 Gbits to >> encoding (8B10) so line rate is really 8GBits/sec or 1 GByte/sec. >> Yes, you''ll lose some more to protocol and swtiching overhead but it >> is not anywhere near an additional 2 GBits/sec - in our experience. > > Correct. Infinipath SDR was getting ~980 MB/s, and DDR HCAs in SDR > mode > can also do quite well in an x8 PCIe slot. > > The PCI-X HCAs were limited to around 850MB/s by the bus, and PCIe > HCAs > _are_ likewise limited to around 700-750MB/s -- but only in a PCIe > x4 slot. > > DDR IB (unless using a PCIe gen2 connectX card, or a x16 Infinipath > card) are also > limited to around 1450-1600 MB/s by the PCIe x8 bus, with a wire speed > of 2000 MB/s. > > QDR IB, in a Gen2 x8 PCIe slot, are also going to be limited to << > 4000MB/s line rate > (should expect around twice the BW of the gen1 PCIe slots). > > The IB headers are very small, compared to a 2KB or 4KB packet size, > but > the PCIe > headers (and eg flow-control overhead) are quite large compared to a > typical 256B packet size. > > To clarify one point: IB advertises the "signaling" rate, so the 10Gb > includes the overhead > bits, as 8 bits are encoded in a 10 bit representation for > transmission. So 10Gb/s = 1GB/s, > with 10-bit bytes. Ethernet, on the other hand, always advertises the > "data" rate, so 10Gb > Ethernet is 1.25GB/s (12.5Gb/s signaling rate), as there are 8 bits > in a > byte. Ethernet packet > headers are also effectively a bit larger than for IB (with IFG, > preamble, etc). > > KevinThanks everyone! I get so confused by IB performance claims. :-) Scott
On Thu, Feb 12, 2009 at 08:26:09AM -0500, Scott Atchley wrote:>> ...... >> One exception is SOCKLND on Chelsio''s T3, quote: >> >> "The T3 ASIC uses the mechanism of Direct Data Placement (DDP) that >> provides a flexible zero copy on receive capability for regular TCP >> connections, requiring no changes to the sender, the wire protocol, or >> the socket API on sending or the receiving side." >> >> I remembered that a small SOCKLND fix was landed recently to make use >> of this zero copy receive capability. >> >> Isaac > > Interesting. Is this code available yet?You could find everything at: https://bugzilla.lustre.org/show_bug.cgi?id=15093 Isaac
You might find this interesting: http://www.cse.ohio-state.edu/~panda/temp/ib_10ge_advanced.pdf Isaac On Wed, Feb 11, 2009 at 2:08 PM, Jeffrey Bennett <jab at sdsc.edu> wrote:> Hi, > > Has anybody done any performance comparison between Lustre with 10GbE and Lustre with Infiniband 4X SDR? I wonder if they perform similarly. > > Thanks, > > Jeffrey A. Bennett > HPC Data Engineer > San Diego Supercomputer Center > http://users.sdsc.edu/~jab > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >