Bartek Krawczyk
2012-Oct-18 06:38 UTC
[Gluster-users] Gluster 3.3.0 on CentOS 6 - GigabitEthernet vs InfiniBand
Hi, we've been assembling a small cluster using a Dell M1000e and few M620s. We've decided to use GlusterFS as our storage solution. Since our setup has an InfiniBand switch we want to use GlusterFS using rdma transport. I've been doing some iozone benchmarking and the results I got are really strange. There's almost no difference between using Gigabit Ethernet, InfiniBand IPoIB or InfiniBand RDMA. To test InfiniBand IPoIB I added peers using IPs on ibX interfaces and a tcp transport in volumes. To test InfiniBand RDMA I added peers also using IPs on ibX interfaces and a rdma transport (it was the only transport on that volume). I've mounted it using "mount -t glusterfs masterib:/vol4.rdma /home/test". Please find some plots which compare raw disk iozone benchmarks (I took the average of all results of iozone -a for each test). The second plot is the max value of iozone -a results for each type of connection. As you can see raw disk performance isn't the bottleneck. In the documentation of GlusterFS 3.3.0 it's said that rdma transport isn't well supported in 3.3.0 release. But then why there's almost no difference between InfiniBand IPoIB (10Gbps) and Gigabit Ethernet (1Gbps) ? I see GlusterFS 3.3.1 was released on 16th of October. I'll try upgrading but I don't see any significant changes to RDMA. Reagards and feel free to chime in with your suggestions PS. here are some infiniband diagnostic commands to show that it should be working correctly: [root at node01 ~]# ibv_rc_pingpong masterib local address: LID 0x0001, QPN 0x4c004a, PSN 0xd90c3e, GID :: remote address: LID 0x0002, QPN 0x64004a, PSN 0x64d15d, GID :: 8192000 bytes in 0.01 seconds = 8733.48 Mbit/sec 1000 iters in 0.01 seconds = 7.50 usec/iter [root at node01 ~]# ibhosts Ca : 0x0002c90300384bc0 ports 2 "node02 mlx4_0" Ca : 0x0002c90300385450 ports 2 "master mlx4_0" Ca : 0x0002c90300385150 ports 2 "node01 mlx4_0" [root at node01 ~]# ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.10.2132 node_guid: 0002:c903:0038:5150 sys_image_guid: 0002:c903:0038:5153 vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x0 board_id: DEL0A10210018 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 4 port_lid: 1 port_lmc: 0x00 link_layer: InfiniBand port: 2 state: PORT_DOWN (1) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: InfiniBand -- Bartek Krawczyk network and system administrator -------------- next part -------------- A non-text attachment was scrubbed... Name: average.jpg Type: image/jpeg Size: 71591 bytes Desc: not available URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20121018/8b796715/attachment.jpg> -------------- next part -------------- A non-text attachment was scrubbed... Name: max.jpg Type: image/jpeg Size: 79196 bytes Desc: not available URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20121018/8b796715/attachment-0001.jpg>
Bartek Krawczyk
2012-Oct-18 07:48 UTC
[Gluster-users] Gluster 3.3.0 on CentOS 6 - GigabitEthernet vs InfiniBand
On 18 October 2012 08:44, Ling Ho <ling at slac.stanford.edu> wrote:> When you mount using rdma, try running some network tools like iftop to see > if traffics is going through your Ge interface. > > If your volume is created with both tcp and rdma, my experience is rdma does > not work under 3.3.0 and it will always fall back to tcp. > > However ipoib works fine for us. Again you should check where the traffics > go.I used tcpdump and iftop and confirmed the traffic using IPoIB goes through ib0 interface, not eth. When I use rdma transport the traffic doesn't show on neither eth nor ib0 interface - so I guess it's correctly using RDMA. I re-ran the iozone -a tests and they're the same. In addition I did a "dd" test fo read and write on IPoIB and RDMA mounted volume. IPoIB: [root at master gluster3]# dd if=/dev/zero of=test bs=100M count=50 50+0 przeczytanych record?w 50+0 zapisanych record?w skopiowane 5242880000 bajt?w (5,2 GB), 16,997 s, 308 MB/s [root at master gluster3]# dd if=test of=/dev/null 10240000+0 przeczytanych record?w 10240000+0 zapisanych record?w skopiowane 5242880000 bajt?w (5,2 GB), 28,4185 s, 184 MB/s RDMA: [root at master gluster]# dd if=/dev/zero of=test bs=100M count=50 50+0 przeczytanych record?w 50+0 zapisanych record?w skopiowane 5242880000 bajt?w (5,2 GB), 70,3636 s, 74,5 MB/s [root at master gluster]# dd if=test of=/dev/null 10240000+0 przeczytanych record?w 10240000+0 zapisanych record?w skopiowane 5242880000 bajt?w (5,2 GB), 10,8389 s, 484 MB/s I did a "sync" between those tests. Funny isn't it? RDMA is much slower than IPoIB on writing and faster on reading. I think we'll stick to the IPoIB until something's fixed in glusterfs. And still - why the results are so similar to Gigabit Ethernet? Regards -- Bartek Krawczyk network and system administrator
Ivan Dimitrov
2012-Oct-18 09:16 UTC
[Gluster-users] Gluster 3.3.0 on CentOS 6 - GigabitEthernet vs InfiniBand
On 10/18/12 10:48 AM, Bartek Krawczyk wrote:> On 18 October 2012 08:44, Ling Ho <ling at slac.stanford.edu> wrote: > > If your volume is created with both tcp and rdma, my experience is rdma does > not work under 3.3.0 and it will always fall back to tcp.I just converted from gbe to infiniband and I was cursing the entire last week about this. Please devs: Make sure we can set the transport type between peers!
Corey Kovacs
2012-Oct-18 13:39 UTC
[Gluster-users] Gluster 3.3.0 on CentOS 6 - GigabitEthernet vs InfiniBand
One problem right away is that 3.3 doesn't support rdma. I know it still builds the rdma packages but rdma isn't a supported connection method for 3.3. It was set aside so the 33. release could make it on time. As I understand it, 3.1 was/is suppposed to be the first 3.x version to fully support it. I'd upgrade and test again. That or downgrade to 3.2.7 which does in fact support it right now. I'm not even sure how you got things mounted using the "rdma" semantics with 3.3. My experiences so far were sort of disappointing until I found out a few key items about GlusterFS which I'd taken for granted. 1. Stripes are not what you might think. The I/O for a stripe does _not_ fan out as in a raid card. It's an unfortunate use of the term only describing and allowing you to store files larger than the max size of a single brick. 2. I/O is done in sync mode so cache coherency isn't an issue and to ensure the integrity of the data written. 3. The performance of a distributed volume far exceeds that of a stripe for my use. Again, depends on the size of the bricks. These are things which effect my experience and now that I at least _think_ I understand them my results make much more sense to me. Generally my throughput maxes out around 800-900MB/s which is the limit of my disk storage right now. As a test I created as large of a volume as I could using ramdisks. I did this to see just how much of a limiter my disks actually are, and I was very surprised to see the speed NOT increase across the file system using an rdma target on version 3.2.6. The local I/O reached 1.9GB/s. So, in addition to my spindle based limit, I do not believe I am using any more than a single lane of my IB cards (which are FDR ( 56Gb) Mellanox). Using the ib test utilities I can indeed max the connection at 6GB/s, which is really amazing to see but so far I just can't seem to get GlusterFS to make use of the total B/W available. Also, you might try putting your cads into "Connected Mode" and increasing your MTU to 64k. Corey -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20121018/08e94510/attachment.html>