I''ve put together a small test Lustre system which is giving confusing (at least to me) results. All nodes are running fully patched 64bit RHEL 5.3 with the premade Lustre 1.8.1 x86_64 RPMs. The nodes are a bit cobbled together from what I had handy. One MDS: 8 core 2.5 GHz nihalem 8GB RAM single E1000 gigabit NIC MDT is just a partition on a 1TB SAS Seagate Two OSS: Dual core 2.8GHz Xeon, 4GB RAM single E1000 gigabit NIC Dual 3ware 9550SX cards with 7+1 RAID 5 across 400GB WD SATA drives. Two OST/OSS: 2TB. Configured as LVM. 1 and 4MB stripe size tried. Client: 8 core 2.5 GHz Xeon, 8GB ram single Broadcaom gigabit NIC Network: Dedicated Cisco 2960g Gigabit switch This gives 2 OSSes, 4 OSTs of 2TB each for a total of 8TB. I''ve tried 1MB and 4MB stripes. Using Bonnie++ 1.03b (-f -s24g) from the client I see decent numbers when reading/writing to any single OST (94 and 112 MB/s write/read). I see slightly better numbers using 2 OST''s on the same OSS (98 and 115MB/s write/read). When I use any 2 OSTs across two OSSes or all 4 OSTs I see a distinct fall off in read rates. In that case I get full 115MB/s writes but only 40MB/s reads. This holds true for striping that uses any combination of OSTs which utilize both OSSes. All the data rates are about what I''d expect given the subsystems and gigabit ethernet but those very slow reads confuse me. I expect slightly slower (say 80-90 MB/s) reads due to buffer issues but not 40. With iostat I see relatively sustained read rates on each OSTs volume as opposed to full reads, wait, full reads, wait which seems to imply the client is the one setting the pace but I''m confused why the client is so slow reassembling replies from two streams from 2 OSS''s and not 2 streams from one OSS. I''ve tried 1MB and 4MB stripe sizes, I''ve tried increasing the RX ring on OSSes to 4096, I''ve tried disabling checksums. Not surprisingly nothing seemed to have any effect since each OSS can easily handle the client requests on its own. I have *not* applied the patches that address the potential corruption issue in 1.8.x. I saw no evidence they really applied in this case. I''ve searched through this list but haven''t seen anything that seems equivalent. I feel I must have missed something simple on the client side but am at my wits end what that is. Thanks in advance for any insight as to what I''m missing. James Robnett NRAO/NM
After reading through my first post I felt some clarification was probably warranted. In this test setup there are two OSS, call them OSS-1 and OSS-2, each has an OST, call them OSS-1-A, OSS-1-B and OSS-2-A, OSS-2-B. The MDS, OSSes and client all have 1Gbit ethernet connections. The following table illustrates the data rates I see in MB/s. OST(s) Read Write OSS-1-A 113 95 OSS-1-B 112 93 OSS-1-A OSS-1-B 112 98 OSS-2-A 105 93 OSS-2-B 115 94 OSS-2-A OSS-2-B 115 98 OSS-1-B OSS-2-A ---> 42 113 OSS-1-A OSS-2-B ---> 42 114 OSS-1-A OSS-1-B OSS-2-A OSS-2-B ---> 46 114 The write numbers are almost exactly what I''d expect across 1Gbit. 96MB/s or so between the client and a single OSS and nearly full rate (112MB/s) with two OSSes. The 113MB/s read numbers for a single OSS (one or more OST''s) are also pretty much exactly what I''d expect. It''s the 40MB/s reads when utilizing 2 OSSes that are throwing me. I can envision that there would be more re-assembly overhead on the client in the case of 2 OSSes(1) but I''m surprised it''s that high. Is this an expected result ? If it''s unexpected is there a common misconfiguration or client short coming that causes it to be slower when reading from multiple OSSes ? Is there some command I could run or data I could provide that would help identify the issue ? I''m fairly new to Lustre so I''m just as likely to flood noise as signal if I just randomly appended data beyond raw rates. I just upgraded to 1.8.1.1 which had no effect. James Robnett NRAO/NM 1) I''m assuming in the case of a single OSS with 2 OSTs the OSS presents the client with a single stream. If assembly of two data streams is required on the client in both the single and dual OSS (both with 2 OSTs) cases then I''m even more confused about those results. James Robnett wrote:> The nodes are a bit cobbled together from what I had handy. > > One MDS: Dual quad-core 2.5GHz nehalem 8GB RAM E1000 gigabit NIC > MDT is just a partition on a 1TB SAS Seagate > Two OSS: Single dual core 2.8GHz Xeon, 4GB RAM single gigabit NIC > Dual 3ware 9550SX cards with 7+1 RAID 5 across 400GB WD SATA > drives. > Two OST/OSS: 2TB. Configured as LVM. 1 and 4MB stripe size tried. > Client: Dual quad-core 2.5 GHz Xeon, 8GB RAM single gigabit NIC > Network: Dedicated Cisco 2960g Gigabit switch
On 14-Oct-09, at 14:15, James Robnett wrote:> After reading through my first post I felt some clarification was > probably warranted. > > In this test setup there are two OSS, call them OSS-1 and OSS-2, > each has an OST, call them OSS-1-A, OSS-1-B and OSS-2-A, OSS-2-B. > > The MDS, OSSes and client all have 1Gbit ethernet connections. > > The following table illustrates the data rates I see in MB/s. > > OST(s) Read Write > OSS-1-A 113 95 > OSS-1-B 112 93 > OSS-1-A OSS-1-B 112 98 > OSS-2-A 105 93 > OSS-2-B 115 94 > OSS-2-A OSS-2-B 115 98 > OSS-1-B OSS-2-A ---> 42 113 > OSS-1-A OSS-2-B ---> 42 114 > OSS-1-A OSS-1-B OSS-2-A OSS-2-B ---> 46 114You''re sure that there isn''t some other strange effect here, like you are only measuring the speed of a single iozone thread or similar?> I can envision that there would be more re-assembly overhead on > the client in the case of 2 OSSes(1) but I''m surprised it''s that high. > > Is this an expected result ? > > If it''s unexpected is there a common misconfiguration or client > short coming that causes it to be slower when reading from multiple > OSSes?This is definitely NOT expected, and I''m puzzled as to why this might be.> Is there some command I could run or data I could provide that would > help identify the issue ? I''m fairly new to Lustre so I''m just as > likely to flood noise as signal if I just randomly appended data > beyond raw rates.You could check /proc/fs/lustre/obdfilter/*/brw_stats on the respective OSTs to see if the client is not assembling the RPCs very well for some reason. Alternately, it might be that you have configured the disk storage of OSS-1 and OSS-2 to compete (e.g. different partitions sharing the same disks).> 1) I''m assuming in the case of a single OSS with 2 OSTs the OSS > presents the client with a single stream. If assembly of two data > streams is required on the client in both the single and dual OSS > (both with 2 OSTs) cases then I''m even more confused about those > results.No, the client needs to assemble the OST objects itself, regardless of whether the OSTs are on the same OSS or not. The file should be striped over all of the OSTs involved in the test.> James Robnett wrote: >> The nodes are a bit cobbled together from what I had handy. >> >> One MDS: Dual quad-core 2.5GHz nehalem 8GB RAM E1000 gigabit NIC >> MDT is just a partition on a 1TB SAS Seagate >> Two OSS: Single dual core 2.8GHz Xeon, 4GB RAM single gigabit NIC >> Dual 3ware 9550SX cards with 7+1 RAID 5 across 400GB WD SATA >> drives. >> Two OST/OSS: 2TB. Configured as LVM. 1 and 4MB stripe size tried. >> Client: Dual quad-core 2.5 GHz Xeon, 8GB RAM single gigabit NIC >> Network: Dedicated Cisco 2960g Gigabit switchCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Many thanks for the reply Andreas.> You''re sure that there isn''t some other strange effect here, like you > are only measuring the speed of a single iozone thread or similar?I''m just looking at the output from Bonnie++ running on the client. I see corresponding numbers when examining iostat on each OST. The sum of all iostats from each OST in use matches the bonnie++ numbers. Can Bonnie be at fault ? I''ve only been setting test size. I''ll try iozone to see if it returns similar results.> This is definitely NOT expected, and I''m puzzled as to why this might > be.Considering how ''stock'' this should be, ie RHEL 5.3 with Sun provided RPMS I must be doing something wrong or more folks would see it but I''m dipped if I know what it is. Everything works, no errors, just slow for multiple OSSes.> You could check /proc/fs/lustre/obdfilter/*/brw_stats on the > respective OSTs > to see if the client is not assembling the RPCs very well for some > reason.I ran two instances of bonnie++, the first used OST0000 and OST0001 on OSS1, the second used OST0001 on OSS1 and OST0002 on OSS2. I rebooted between each run to reset the stats. The contents of /proc/fs/lustre/obdfilter/lustre-OST0001/brw_stats look essentially identical in both runs even thought the read rate in the first was 114MB and in the second 38MB/s. I''ve appended the read portion of both files below. Not sure exactly what I should be looking for in those stats. I''m also curious how it could be the OST at fault since 2 OSTs on one OSS give the expected ~115MB/s read rate but 2 OSTs on two OSSes give ~40MB/s.> Alternately, it might be that you have configured the disk storage of > OSS-1 > and OSS-2 to compete (e.g. different partitions sharing the same disks).Each OSS has two internal PCI 8 port 3ware 9550sx cards and 16 internal disks carved into two separate 7+1 RAID 5 groups (one per card). They''re physically distinct where disk storage is concerned.> No, the client needs to assemble the OST objects itself, regardless of > whether the OSTs are on the same OSS or not. The file should be striped > over all of the OSTs involved in the test.Iostat on each OST confirms the striping. I see and don''t see reads on OSTs where I''d expect as I change the striping. OST''s not in use are quiescent. OSTs in use show uniform read rates between them and they have relative constant rates per second. No starvation apparent. It sure seems like some issue on the client not being able to deal with multiple streams from multiple OSS but can deal just fine with multiple streams from a single OSS. I''ve tried to think of some way the switch could be at fault but haven''t come up with anything. It''s a Cisco 2960 gigabit switch and while it can block it shouldn''t be in this case. I have no problem obtaining 115MB/s read writes as long as I avoid reading across two OSSes. Again, many thanks for the reply. If nothing else knowing it really is wrong will make me keep digging. If you can think of any output I could show or test I could do to help isolate the problem I''m all ears. James Robnett NRAO/NM Below is the read portion of brw_stats for OST0001 from the 40MB/s run (left) and 115MB/s run (right), I removed the write portion for clarity. read (40MB/s) | read (115MB/s) pages per bulk r/w rpcs % cum % | rpcs % cum % 1: 5003 17 17 | 5256 18 18 2: 13 0 17 | 23 0 18 4: 11 0 17 | 1 0 18 8: 19 0 17 | 1 0 18 16: 14 0 17 | 11 0 18 32: 53 0 17 | 18 0 18 64: 47 0 17 | 11 0 18 128: 74 0 17 | 35 0 18 256: 24145 82 100 | 23415 81 100 read | read discontiguous pages rpcs % cum % | rpcs % cum % 0: 29261 99 99 | 28735 99 99 1: 61 0 99 | 34 0 99 2: 18 0 99 | 2 0 100 3: 15 0 99 | 0 0 100 4: 9 0 99 | 0 0 100 5: 7 0 99 | 0 0 100 6: 4 0 99 | 0 0 100 7: 3 0 99 | 0 0 100 8: 0 0 99 | 0 0 100 9: 1 0 100 | 0 0 100 10: 0 0 100 | 0 0 100 11: 0 0 100 | 0 0 100 12: 0 0 100 | 0 0 100 13: 0 0 100 | read | read discontiguous blocks rpcs % cum % | rpcs % cum % 0: 29261 99 99 | 28735 99 99 1: 61 0 99 | 34 0 99 2: 18 0 99 | 2 0 100 3: 15 0 99 | 0 0 100 4: 9 0 99 | 0 0 100 5: 7 0 99 | 0 0 100 6: 4 0 99 | 0 0 100 7: 3 0 99 | 0 0 100 8: 0 0 99 | 0 0 100 9: 1 0 100 | 0 0 100 10: 0 0 100 | 0 0 100 11: 0 0 100 | 0 0 100 12: 0 0 100 | 0 0 100 13: 0 0 100 | read | read disk fragmented I/Os ios % cum % | ios % cum % 0: 1 0 0 | 5308 18 18 1: 5084 17 17 | 12 0 18 2: 44 0 17 | 18 0 18 3: 46 0 17 | 17 0 18 4: 38 0 17 | 10 0 18 5: 31 0 17 | 20 0 18 6: 30 0 17 | 12 0 18 7: 29 0 18 | 23353 81 99 8: 24034 81 99 | 21 0 100 9: 27 0 99 | 0 0 100 10: 8 0 99 | 0 0 100 11: 3 0 99 | 0 0 100 12: 3 0 99 | 0 0 100 13: 0 0 99 | 14: 1 0 100 | read | read disk I/Os in flight ios % cum % | ios % cum % 1: 15990 8 8 | 14821 7 7 2: 16817 8 16 | 16105 8 16 3: 15968 8 24 | 14930 7 23 4: 15761 7 32 | 14260 7 31 5: 16390 8 40 | 14644 7 38 6: 17131 8 49 | 15039 7 46 7: 17786 8 58 | 15383 7 54 8: 18551 9 67 | 15887 8 62 9: 7313 3 71 | 7218 3 66 10: 7100 3 74 | 7006 3 70 11: 6755 3 78 | 6824 3 73 12: 6416 3 81 | 6738 3 77 13: 5931 2 84 | 6438 3 80 14: 5386 2 87 | 6209 3 83 15: 4831 2 89 | 5983 3 86 16: 4287 2 91 | 5540 2 89 17: 2146 1 92 | 2314 1 90 18: 1928 0 93 | 2213 1 92 19: 1703 0 94 | 2046 1 93 20: 1531 0 95 | 1911 0 94 21: 1376 0 96 | 1772 0 95 22: 1202 0 96 | 1602 0 95 23: 1011 0 97 | 1398 0 96 24: 749 0 97 | 1190 0 97 25: 435 0 97 | 640 0 97 26: 383 0 98 | 584 0 97 27: 358 0 98 | 526 0 98 28: 328 0 98 | 477 0 98 29: 298 0 98 | 434 0 98 30: 258 0 98 | 365 0 98 31: 2559 1 100 | 2224 1 100 read | read I/O time (1/1000s) ios % cum % | ios % cum % 1: 1079 3 3 | 339 1 1 2: 5565 18 22 | 3228 11 12 4: 5672 19 41 | 6847 23 36 8: 2649 9 50 | 4393 15 51 16: 5967 20 71 | 8461 29 80 32: 7243 24 95 | 4243 14 95 64: 1073 3 99 | 1176 4 99 128: 126 0 99 | 84 0 100 256: 5 0 100 | 0 0 100 512: 0 0 100 | 0 0 100 read | read disk I/O size ios % cum % | ios % cum % 4K: 5147 2 2 | 5263 2 2 8K: 94 0 2 | 28 0 2 16K: 18 0 2 | 11 0 2 32K: 45 0 2 | 20 0 2 64K: 98 0 2 | 48 0 2 128K: 193276 97 100 | 187351 97 100
hi when are we going to see the new updatd lustre 1.8.1.1 support matrix? http://wiki.lustre.org/index.php/Lustre_Support_Matrix only list 1.8.1 Another question Does lustre support Servers running the latest rhel 5.3 with 1.8.1.1 but client only running rhel5.1? Can Client also run 1.8.1.1? can this configuration run OFED 1.4.1 or 1.4.2? TIA -------------- next part -------------- A non-text attachment was scrubbed... Name: hung-sheng_tsao.vcf Type: text/x-vcard Size: 273 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091018/31cc03ff/attachment.vcf
The support matrix has been updated for 1.8.1.1: http://wiki.lustre.org/index.php/Lustre_Release_Information Dr. Hung-Sheng Tsao wrote:> hi > when are we going to see the new updatd lustre 1.8.1.1 support matrix? > http://wiki.lustre.org/index.php/Lustre_Support_Matrix > only list 1.8.1 > > Another question > Does lustre support Servers running the latest rhel 5.3 with 1.8.1.1 > but client only running rhel5.1? > Can Client also run 1.8.1.1? > can this configuration run OFED 1.4.1 or 1.4.2? > TIA > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
A bit more (likely not useful) info on this odd read performance problem across multiple OSSes. We have a different prototype Lustre installation in another group. The hardware is a bit different but is also a RHEL 5.3 based system installed from RPM''s (v1.8). That system also sends traffic through a Cisco Catalyst 2960 48port gigabit switch. We see the same odd performance issue on that Lustre system. Reads limited to a single OSS are client network limited, reads that involve more than one OSS are capped around 40MB/s. Considering nobody else has reported this it seems like it must be either some oddity in our base RHEL 5.3 install or some oddity in that Cisco switch. The switch has a 32gbit/s backplane and otherwise seems perfectly capable of handling traffic from multiple clients at much faster rates than we''re seeing. Nor do I see any evidence on the OSSes or switch that the switch becomes congested trying to direct multiple read streams to the client''s 1gbit interface. It''s quite possible I''m looking for the wrong thing there, suggestions welcome. I''m more inclined to believe it''s our base OS install. It continues to appear that the experienced read rate is driven by the request rate from the client and not some reply bottleneck. The client simply isn''t requesting reads at full speed if multiple OSSes are involved. There''s some evidence that this other Lustre system used to get more typical read/write rates. That suggests some subsequent RHEL 5.3 patch that''s affecting the performance. BTW, I''ve now tried bonnie++, iozone and IOR, all give similar results so that rules out some bonnie++ pathology. James Robnett NRAO/NM 8 OST''s across all 4 OSSes [root at casa-dev-13 C]# lfs getstripe /lustre/casa-store/IOR/ OBDS: 0: lustre-OST0000_UUID ACTIVE 1: lustre-OST0001_UUID ACTIVE 2: lustre-OST0002_UUID ACTIVE 3: lustre-OST0003_UUID ACTIVE 4: lustre-OST0004_UUID ACTIVE 5: lustre-OST0005_UUID ACTIVE 6: lustre-OST0006_UUID ACTIVE 7: lustre-OST0007_UUID ACTIVE /lustre/casa-store/IOR/ stripe_count: -1 stripe_size: 4194304 stripe_offset: 0 Run began: Tue Oct 20 10:39:22 2009 Command line used: ./IOR -r -w -b 64K -t 64K -s 270000 -N 1 -o /lustre/casa-store/IOR/junkMachine: Linux casa-dev-13 Summary: api = POSIX test filename = /lustre/casa-store/IOR/junk access = single-shared-file ordering in a file = sequential offsets ordering inter file= no tasks offsets clients = 1 (1 per node) repetitions = 1 xfersize = 65536 bytes blocksize = 65536 bytes aggregate filesize = 16.48 GiB Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)Op grep #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize --------- --------- --------- ---------- ------- --------- --------- ---------- ------- ------- write 116.01 116.01 116.01 0.00 0.01 0.01 0.01 0.00 145.46395 1 1 1 0 0 1 0 0 270000 65536 65536 17694720000 -1 POSIX EXCEL read 44.29 44.29 44.29 0.00 0.00 0.00 0.00 0.00 381.01059 1 1 1 0 0 1 0 0 270000 65536 65536 17694720000 -1 POSIX EXCEL Max Write: 116.01 MiB/sec (121.64 MB/sec) Max Read: 44.29 MiB/sec (46.44 MB/sec) 2 OSTs (0000 and 0001 on OSS-1, confirmed via iostat on OSS-1). lfs getstripe /lustre/casa-store/IOR-OST-0-1/ OBDS: 0: lustre-OST0000_UUID ACTIVE 1: lustre-OST0001_UUID ACTIVE 2: lustre-OST0002_UUID ACTIVE 3: lustre-OST0003_UUID ACTIVE 4: lustre-OST0004_UUID ACTIVE 5: lustre-OST0005_UUID ACTIVE 6: lustre-OST0006_UUID ACTIVE 7: lustre-OST0007_UUID ACTIVE /lustre/casa-store/IOR-OST-0-1/ stripe_count: 2 stripe_size: 4194304 stripe_offset: 0 Run began: Tue Oct 20 10:31:32 2009 Command line used: ./IOR -r -w -b 64K -t 64K -s 270000 -N 1 -o /lustre/casa-store/IOR-OST-0-1/junkMachine: Linux casa-dev-13 Summary: api = POSIX test filename = /lustre/casa-store/IOR-OST-0-1/junk access = single-shared-file ordering in a file = sequential offsets ordering inter file= no tasks offsets clients = 1 (1 per node) repetitions = 1 xfersize = 65536 bytes blocksize = 65536 bytes aggregate filesize = 16.48 GiB Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)Op grep #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize --------- --------- --------- ---------- ------- --------- --------- ---------- ------- ------- write 118.45 118.45 118.45 0.00 0.01 0.01 0.01 0.00 142.46609 1 1 1 0 0 1 0 0 270000 65536 65536 17694720000 -1 POSIX EXCEL read 117.64 117.64 117.64 0.00 0.01 0.01 0.01 0.00 143.44616 1 1 1 0 0 1 0 0 270000 65536 65536 17694720000 -1 POSIX EXCEL Max Write: 118.45 MiB/sec (124.20 MB/sec) Max Read: 117.64 MiB/sec (123.35 MB/sec)
James Robnett wrote:> BTW, I''ve now tried bonnie++, iozone and IOR, all give similar > results so that rules out some bonnie++ pathology.FWIW, I tried plain old dd on one of the Lustre filesystems (striping across OSTs on multiple OSSs), and results were similar. -- Martin
The problem appears to be network congestion control on the OSSes triggered by these Cisco 2960 switches inability to deal with over-subscription very well. The problem even occurs if the client has a channel bonded 2 x 1gbit interface pair and only two OSSes are involved. Sadly it was that result that led me to believe the problem was on the client and not the switch or the OSSes. I connected the OSSes and client to an el cheapo Allied Telesyn 8 port 1gbit switch. A client with a single 1gbit interface and a test with two OSSes resulted in 116MB writes and reads. A second test involving 4 OSSes (each with two OST''s) reverted to the 116MB/s writes and 40''ish MB/s reads which implies the AT switch is better but there''s still a problem. Looking at the OSSes I discovered some sub-optimal IP stack settings. In particular net.ipv4.tcp_sack = 0 net.ipv4.tcp_timestamps = 0 Setting those both to *1* improved the AT switch case to about 78MB/s reads across 4 OSSes but that switch doesn''t support 9000 MTU. Fixing up the OSSes with those IP settings and returning to the original switch (which does support 9000MTU) seems to be the best case. Across 4 OSSes w/8 OSTs w/4MB stripe size 9000 MTU 115MB/s writes, 106MB/s reads Across 4 OSSes w/8 OSTs w/1MB stripe size 9000 MTU 115MB/s writes, 111MB/s reads So for now I''d say it''s all better though I''ll be suspicious of our settings till I see a scaled up version running on a newer switch with full throughput. I did some "site:lists.lustre.org <string>" type searches for congestion, tcp sysctl and came up with very little. Are there best practice type tcp settings for Lustre in 1gbit with channel bonding (as opposed to IB or 10G) type environments. We have our own set here we''ve empirically settled on. James Robnett Here are all the changes we make beyond stock settings. net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_fin_timeout = 10 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.ipv4.tcp_no_metrics_save = 1 net.core.netdev_max_backlog = 3000 # Added for Lustre net.ipv4.tcp_sack = 1 net.ipv4.tcp_timestamps = 1