Our OSS''s with 2x1GB NICs (bonded) appear limited to 1GB worth of write throughput each. Our setup: 2 OSS serving 1 OST each Lustre 1.8.5 RHEL 5.4 New Dell M610''s blade servers with plenty of CPU and RAM All SAN fibre connections are at least 4GB Some notes: - A direct write (dd) from a single OSS to the OST gets 4GB, the OSS''s fibre wire speed. - A single client will get 2GB of lustre write speed, the client''s ethernet wire speed. - We''ve tried bond mode 6 and 0 on all systems. With mode 6 we will see both NICs on both OSSs receiving data. - We''ve tried multiple OSTs per OSS. But 2 clients writing a file will get 2GB of total bandwidth to the filesystems. We have been unable to isolate any particular resource bottleneck. None of the systems (MDS, OSS, or client) seem to be working very hard. The 1GB per OSS threshold is so consistent, that it almost appears by design - and hopefully we''re missing something obvious. Any advice? Thanks. djm
Balagopal Pillai
2011-Jan-27 13:48 UTC
[Lustre-discuss] 1GB throughput limit on OST (1.8.5)?
I guess you have two gigabit nics bonded in mode 6 and not two 1GB nics? (B-Bytes, b-bits) The max aggregate throughput could be about 200MBps out of the 2 bonded nics. I think the mode 0 bonding works only with cisco etherchannel or something similar on the switch side. Same with the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max throughout. Maybe you could also see the max read and write capabilities of the raid controller other than just the network. When testing with dd, some of the data remains as dirty data till its flushed into the disk. I think the default background ratio is 10% for rhel5 which would be sizable if your oss have lots of ram. There is chance of lockup of the oss once it hits the dirty_ratio limit,which is 40% by default. So a bit more aggressive flush to disk by lowering the background_ratio and a bit more headroom before it hits the dirty_ratio is generally desirable if your raid controller could keep up with it. So with your current setup, i guess you could get a max of 400MBps out of both OSS''s if they both have two 1Gb nics in them. Maybe if you have one of the switches from Dell that has 4 10Gb ports in them (their powerconnect 6248), 10Gb nics for your OSS''s might be a cheaper way to increase the aggregate performance. I think over 1GBps from a client is possible in cases where you use infiniband and rdma to deliver data. David Merhar wrote:> Our OSS''s with 2x1GB NICs (bonded) appear limited to 1GB worth of > write throughput each. > > Our setup: > 2 OSS serving 1 OST each > Lustre 1.8.5 > RHEL 5.4 > New Dell M610''s blade servers with plenty of CPU and RAM > All SAN fibre connections are at least 4GB > > Some notes: > - A direct write (dd) from a single OSS to the OST gets 4GB, the OSS''s > fibre wire speed. > - A single client will get 2GB of lustre write speed, the client''s > ethernet wire speed. > - We''ve tried bond mode 6 and 0 on all systems. With mode 6 we will > see both NICs on both OSSs receiving data. > - We''ve tried multiple OSTs per OSS. > > But 2 clients writing a file will get 2GB of total bandwidth to the > filesystems. We have been unable to isolate any particular resource > bottleneck. None of the systems (MDS, OSS, or client) seem to be > working very hard. > > The 1GB per OSS threshold is so consistent, that it almost appears by > design - and hopefully we''re missing something obvious. > > Any advice? > > Thanks. > > djm > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Sorry - little b all the way around. We''re limited to 1Gb per OST. djm On Jan 27, 2011, at 7:48 AM, Balagopal Pillai wrote:> I guess you have two gigabit nics bonded in mode 6 and not two 1GB > nics? > (B-Bytes, b-bits) The max aggregate throughput could be about 200MBps > out of the 2 bonded nics. I think the mode 0 bonding works only with > cisco etherchannel or something similar on the switch side. Same with > the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max > throughout. Maybe you could also see the max read and write > capabilities > of the raid controller other than just the network. When testing with > dd, some of the data remains as dirty data till its flushed into the > disk. I think the default background ratio is 10% for rhel5 which > would > be sizable if your oss have lots of ram. There is chance of lockup of > the oss once it hits the dirty_ratio limit,which is 40% by default. > So a > bit more aggressive flush to disk by lowering the background_ratio > and a > bit more headroom before it hits the dirty_ratio is generally > desirable > if your raid controller could keep up with it. So with your current > setup, i guess you could get a max of 400MBps out of both OSS''s if > they > both have two 1Gb nics in them. Maybe if you have one of the switches > from Dell that has 4 10Gb ports in them (their powerconnect 6248), > 10Gb > nics for your OSS''s might be a cheaper way to increase the aggregate > performance. I think over 1GBps from a client is possible in cases > where > you use infiniband and rdma to deliver data. > > > David Merhar wrote: >> Our OSS''s with 2x1GB NICs (bonded) appear limited to 1GB worth of >> write throughput each. >> >> Our setup: >> 2 OSS serving 1 OST each >> Lustre 1.8.5 >> RHEL 5.4 >> New Dell M610''s blade servers with plenty of CPU and RAM >> All SAN fibre connections are at least 4GB >> >> Some notes: >> - A direct write (dd) from a single OSS to the OST gets 4GB, the >> OSS''s >> fibre wire speed. >> - A single client will get 2GB of lustre write speed, the client''s >> ethernet wire speed. >> - We''ve tried bond mode 6 and 0 on all systems. With mode 6 we will >> see both NICs on both OSSs receiving data. >> - We''ve tried multiple OSTs per OSS. >> >> But 2 clients writing a file will get 2GB of total bandwidth to the >> filesystems. We have been unable to isolate any particular resource >> bottleneck. None of the systems (MDS, OSS, or client) seem to be >> working very hard. >> >> The 1GB per OSS threshold is so consistent, that it almost appears by >> design - and hopefully we''re missing something obvious. >> >> Any advice? >> >> Thanks. >> >> djm >> >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Kevin Van Maren
2011-Jan-27 17:16 UTC
[Lustre-discuss] 1GB throughput limit on OST (1.8.5)?
Normally if you are having a problem with write BW, you need to futz with the switch. If you were having problems with read BW, you need to futz with the server''s config (xmit hash policy is the usual culprit). Are you testing multiple clients to the same server? Are you using mode 6 because you don''t have bonding support in your switch? I normally use 802.3ad mode, assuming your switch supports link aggregation. I was bonding 2x1Gb links for Lustre back in 2004. That was before BOND_XMIT_POLICY_LAYER34 was in the kernel, so I had to hack the bond xmit hash (with multiple NICs standard, layer2 hashing does not produce a uniform distribution, and can''t work if going through a router). Any one connection (socket or node/node connection) will use only one gigabit link. While it is possible to use two links using round-robin, that normally only helps for client reads (server can''t choose which link to receive data, the switch picks that), and has the serious downside of out-of-order packets on the TCP stream. [If you want clients to have better client bandwidth for a single file, change your default stripe count to 2, so it will hit two different servers.] Kevin David Merhar wrote:> Sorry - little b all the way around. > > We''re limited to 1Gb per OST. > > djm > > > > On Jan 27, 2011, at 7:48 AM, Balagopal Pillai wrote: > > >> I guess you have two gigabit nics bonded in mode 6 and not two 1GB >> nics? >> (B-Bytes, b-bits) The max aggregate throughput could be about 200MBps >> out of the 2 bonded nics. I think the mode 0 bonding works only with >> cisco etherchannel or something similar on the switch side. Same with >> the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max >> throughout. Maybe you could also see the max read and write >> capabilities >> of the raid controller other than just the network. When testing with >> dd, some of the data remains as dirty data till its flushed into the >> disk. I think the default background ratio is 10% for rhel5 which >> would >> be sizable if your oss have lots of ram. There is chance of lockup of >> the oss once it hits the dirty_ratio limit,which is 40% by default. >> So a >> bit more aggressive flush to disk by lowering the background_ratio >> and a >> bit more headroom before it hits the dirty_ratio is generally >> desirable >> if your raid controller could keep up with it. So with your current >> setup, i guess you could get a max of 400MBps out of both OSS''s if >> they >> both have two 1Gb nics in them. Maybe if you have one of the switches >> from Dell that has 4 10Gb ports in them (their powerconnect 6248), >> 10Gb >> nics for your OSS''s might be a cheaper way to increase the aggregate >> performance. I think over 1GBps from a client is possible in cases >> where >> you use infiniband and rdma to deliver data. >> >> >> David Merhar wrote: >> >>> Our OSS''s with 2x1GB NICs (bonded) appear limited to 1GB worth of >>> write throughput each. >>> >>> Our setup: >>> 2 OSS serving 1 OST each >>> Lustre 1.8.5 >>> RHEL 5.4 >>> New Dell M610''s blade servers with plenty of CPU and RAM >>> All SAN fibre connections are at least 4GB >>> >>> Some notes: >>> - A direct write (dd) from a single OSS to the OST gets 4GB, the >>> OSS''s >>> fibre wire speed. >>> - A single client will get 2GB of lustre write speed, the client''s >>> ethernet wire speed. >>> - We''ve tried bond mode 6 and 0 on all systems. With mode 6 we will >>> see both NICs on both OSSs receiving data. >>> - We''ve tried multiple OSTs per OSS. >>> >>> But 2 clients writing a file will get 2GB of total bandwidth to the >>> filesystems. We have been unable to isolate any particular resource >>> bottleneck. None of the systems (MDS, OSS, or client) seem to be >>> working very hard. >>> >>> The 1GB per OSS threshold is so consistent, that it almost appears by >>> design - and hopefully we''re missing something obvious. >>> >>> Any advice? >>> >>> Thanks. >>> >>> djm >>> >>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Appreciate the input. We''ve been using mode 6 as I expect it provides the fewest configuration pratfalls. IF the single stream becomes our bottleneck we''ll mess with aggregation. What I can''t find is the bottleneck in our current setup. With 4 servers - 2 clients, two OSSs - I''d expect 4Gb of aggregate throughput where each client has a single connection to each OST. Instead we''re limited to 2GB, where each OSS appears limited to 1Gb of I/O. The strangeness is that iptraf on the OSSs shows traffic through the expected connections (2 X 2) but at only 35% - 65% of bandwidth. And a third client writing to the filesystem will briefly increase aggregate throughput, but it quickly settles back to ~2Gb. djm On Jan 27, 2011, at 11:16 AM, Kevin Van Maren wrote:> Normally if you are having a problem with write BW, you need to futz > with the switch. If you were having > problems with read BW, you need to futz with the server''s config > (xmit hash policy is the usual culprit). > > Are you testing multiple clients to the same server? > > Are you using mode 6 because you don''t have bonding support in your > switch? I normally use 802.3ad mode, > assuming your switch supports link aggregation. > > > I was bonding 2x1Gb links for Lustre back in 2004. That was before > BOND_XMIT_POLICY_LAYER34 > was in the kernel, so I had to hack the bond xmit hash (with > multiple NICs standard, layer2 hashing does not > produce a uniform distribution, and can''t work if going through a > router). > > Any one connection (socket or node/node connection) will use only > one gigabit link. While it is possible > to use two links using round-robin, that normally only helps for > client reads (server can''t choose which link to > receive data, the switch picks that), and has the serious downside > of out-of-order packets on the TCP stream. > > [If you want clients to have better client bandwidth for a single > file, change your default stripe count to 2, so it > will hit two different servers.] > > Kevin > > > David Merhar wrote: >> Sorry - little b all the way around. >> >> We''re limited to 1Gb per OST. >> >> djm >> >> >> >> On Jan 27, 2011, at 7:48 AM, Balagopal Pillai wrote: >> >> >>> I guess you have two gigabit nics bonded in mode 6 and not two >>> 1GB nics? >>> (B-Bytes, b-bits) The max aggregate throughput could be about >>> 200MBps >>> out of the 2 bonded nics. I think the mode 0 bonding works only with >>> cisco etherchannel or something similar on the switch side. Same >>> with >>> the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max >>> throughout. Maybe you could also see the max read and write >>> capabilities >>> of the raid controller other than just the network. When testing >>> with >>> dd, some of the data remains as dirty data till its flushed into the >>> disk. I think the default background ratio is 10% for rhel5 which >>> would >>> be sizable if your oss have lots of ram. There is chance of lockup >>> of >>> the oss once it hits the dirty_ratio limit,which is 40% by >>> default. So a >>> bit more aggressive flush to disk by lowering the >>> background_ratio and a >>> bit more headroom before it hits the dirty_ratio is generally >>> desirable >>> if your raid controller could keep up with it. So with your current >>> setup, i guess you could get a max of 400MBps out of both OSS''s >>> if they >>> both have two 1Gb nics in them. Maybe if you have one of the >>> switches >>> from Dell that has 4 10Gb ports in them (their powerconnect >>> 6248), 10Gb >>> nics for your OSS''s might be a cheaper way to increase the aggregate >>> performance. I think over 1GBps from a client is possible in >>> cases where >>> you use infiniband and rdma to deliver data. >>> >>> >>> David Merhar wrote: >>> >>>> Our OSS''s with 2x1GB NICs (bonded) appear limited to 1GB worth of >>>> write throughput each. >>>> >>>> Our setup: >>>> 2 OSS serving 1 OST each >>>> Lustre 1.8.5 >>>> RHEL 5.4 >>>> New Dell M610''s blade servers with plenty of CPU and RAM >>>> All SAN fibre connections are at least 4GB >>>> >>>> Some notes: >>>> - A direct write (dd) from a single OSS to the OST gets 4GB, the >>>> OSS''s >>>> fibre wire speed. >>>> - A single client will get 2GB of lustre write speed, the client''s >>>> ethernet wire speed. >>>> - We''ve tried bond mode 6 and 0 on all systems. With mode 6 we >>>> will >>>> see both NICs on both OSSs receiving data. >>>> - We''ve tried multiple OSTs per OSS. >>>> >>>> But 2 clients writing a file will get 2GB of total bandwidth to the >>>> filesystems. We have been unable to isolate any particular >>>> resource >>>> bottleneck. None of the systems (MDS, OSS, or client) seem to be >>>> working very hard. >>>> >>>> The 1GB per OSS threshold is so consistent, that it almost >>>> appears by >>>> design - and hopefully we''re missing something obvious. >>>> >>>> Any advice? >>>> >>>> Thanks. >>>> >>>> djm >>>> >>>> >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110127/2821abe7/attachment-0001.html