Sandia has made available a very interesting set of graphs with some questions. They study single file per process and shared file IO on Red Storm. Most striking is that on the liblustre platform, shared file IO seems to be 4x slower. It feels like this should be a simple contention issue on the server, but we havent'' been able to find it. A question not mentioned in this email is why this won''t scale beyond 12,500 clients. Probably that has a very simple answer too, and something blows up there. LLNL has shown that in some cases shared file IO can be faster than file per process, and it is high time we put the puzzle to bed. I attach the graphs and copy Sandia''s questions below. Peter What follows are some IOR tests run at Sandia on a 160-OSS/320-OST Lustre file system. This file system had just been reformatted, prior to the runs. The following issues seem key ones: - the single shared file is a factor 4-5 too slow, what is the overhead? - why are reads so slow? - why is there a significant read dropoff? - why is two cores so much slower than single core? -------------- next part -------------- A non-text attachment was scrubbed... Name: IOtestingResults _2_.pdf Type: application/pdf Size: 23636 bytes Desc: not available Url : http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20061121/916e1d5e/IOtestingResults_2_-0001.pdf
Peter Braam wrote:> Sandia has made available a very interesting set of graphs with some > questions. They study single file per process and shared file IO on > Red Storm. > > Most striking is that on the liblustre platform, shared file IO seems > to be 4x slower. It feels like this should be a simple contention > issue on the server, but we havent'' been able to find it. > > A question not mentioned in this email is why this won''t scale beyond > 12,500 clients. Probably that has a very simple answer too, and > something blows up there. > > LLNL has shown that in some cases shared file IO can be faster than > file per process, and it is high time we put the puzzle to bed. > > I attach the graphs and copy Sandia''s questions below. > > Peter > > What follows are some IOR tests run at Sandia on a 160-OSS/320-OST > Lustre file system. This file system had just been reformatted, prior > to the runs. > The following issues seem key ones: > - the single shared file is a factor 4-5 too slow, what is the > overhead? > - why are reads so slow? >Was the server lnet.debug set to 0? We have seen a large improvement in reads on the XT3 with that set to 0. I doubt it would fix such a large difference, but it one thing to get out of the way. Nic
Hello! On Tue, Nov 21, 2006 at 10:24:42AM -0700, Peter Braam wrote:> Sandia has made available a very interesting set of graphs with some > questions. They study single file per process and shared file IO on Red > Storm.The test was unfair, see: With one stripe per file file per process, 320 OSTs are used for 10k client that''s rougly 31 client per OST. Now with shared file, stripe size(? count, I presume) is 157. Since the file is only one, it means only 157 OSTs were in use, or roughly 64 clients per OST. So for meaningful comparison we should compare 10k clients file per process with 5k cleints shared file. This "only" gives us 2x difference which is still better than 4x. Also stipe size is not specified, what was it set to?> The following issues seem key ones: > - the single shared file is a factor 4-5 too slow, what is the > overhead?Only 2x it seems.> - why are reads so slow?No proper readahead by backend disk?> - why is there a significant read dropoff?Writes can be cached, nicely aggregated and written to disk in nice linear chunks and disk backend cannot do proper readahead for such a seemingly random 1M here, 1M there? Was write cache enabled?> - why is two cores so much slower than single core?Was two cores on a single chip counted as single client on the graph, or as two clients? Probably latter? Could it be some local bus bottleneck due to increased load on same of the bus/network? Bye, Oleg
Oleg pointed out that the striping width of 160 can have negative impacts - if you have a bandwidth per OST that is too low, you will limit your number. But there is no reason that has to be the case. OST targets could be RAID0 striped across multiple FC interconnects, giving them a very high bandwidth. In this way with 160 stripe width 2x the current bandwidth can be achieved. To validate Oleg''s claim for reads, namely that they are suboptimal for random 1MB reads - could the OST survey be run against the OBD filter device? This only requires one DDN. - Peter -
I think we''d very appreciate if users provide us more data from their runs. the most interesting things are: 1) vmstat 1 2) utils/llobdstat.pl <ost name> 1 3) utils/llstat.pl <path to stats file> 1 note that, every service has own stats file. for example, [root@tom tests]# find /proc/fs/lustre -name stats /proc/fs/lustre/ldlm/services/ldlm_canceld/stats /proc/fs/lustre/ldlm/services/ldlm_cbd/stats /proc/fs/lustre/llite/fs0/stats /proc/fs/lustre/mdt/MDT/mds_readpage/stats /proc/fs/lustre/mdt/MDT/mds_setattr/stats /proc/fs/lustre/mdt/MDT/mds/stats /proc/fs/lustre/osc/OSC_tom_OST_tom_MNT_tom/stats /proc/fs/lustre/osc/OSC_tom_OST_tom_mds1/stats /proc/fs/lustre/obdfilter/OST_tom/stats /proc/fs/lustre/ost/OSS/ost_io/stats /proc/fs/lustre/ost/OSS/ost_create/stats /proc/fs/lustre/ost/OSS/ost/stats it seems like we need a silly script to simplify gathering. Peter, what do you think? thanks, Alex>>>>> Peter Braam (PB) writes:PB> Sandia has made available a very interesting set of graphs with some PB> questions. They study single file per process and shared file IO on PB> Red Storm. PB> Most striking is that on the liblustre platform, shared file IO seems PB> to be 4x slower. It feels like this should be a simple contention PB> issue on the server, but we havent'' been able to find it. PB> A question not mentioned in this email is why this won''t scale beyond PB> 12,500 clients. Probably that has a very simple answer too, and PB> something blows up there. PB> LLNL has shown that in some cases shared file IO can be faster than PB> file per process, and it is high time we put the puzzle to bed. PB> I attach the graphs and copy Sandia''s questions below. PB> Peter PB> What follows are some IOR tests run at Sandia on a 160-OSS/320-OST PB> Lustre file system. This file system had just been reformatted, prior PB> to the runs. PB> The following issues seem key ones: PB> - the single shared file is a factor 4-5 too slow, what is the PB> overhead? PB> - why are reads so slow? PB> - why is there a significant read dropoff? PB> - why is two cores so much slower than single core? PB> _______________________________________________ PB> Lustre-devel mailing list PB> Lustre-devel@clusterfs.com PB> https://mail.clusterfs.com/mailman/listinfo/lustre-devel
On Tue, 2006-11-21 at 20:48 +0200, Oleg Drokin wrote:> Hello! > > On Tue, Nov 21, 2006 at 10:24:42AM -0700, Peter Braam wrote: > > Sandia has made available a very interesting set of graphs with some > > questions. They study single file per process and shared file IO on Red > > Storm. > > The test was unfair, see: > With one stripe per file file per process, 320 OSTs are used for 10k client > that''s rougly 31 client per OST. > > Now with shared file, stripe size(? count, I presume) is 157. Since the file > is only one, it means only 157 OSTs were in use, or roughly 64 clients per > OST. > > So for meaningful comparison we should compare 10k clients file per process > with 5k cleints shared file. This "only" gives us 2x difference which is still > better than 4x. > > Also stipe size is not specified, what was it set to?Please define "stripe size"?> > > The following issues seem key ones: > > - the single shared file is a factor 4-5 too slow, what is the > > overhead? > > Only 2x it seems. > > > - why are reads so slow? > > No proper readahead by backend disk?MF is set to "on", max-prefetch is set to ("x" and "1")> > > - why is there a significant read dropoff? > > Writes can be cached, nicely aggregated and written to disk in nice linear > chunks and disk backend cannot do proper readahead for such a seemingly random > 1M here, 1M there? > Was write cache enabled?Yes. Not partitioned.> > > - why is two cores so much slower than single core? > > Was two cores on a single chip counted as single client on the graph, > or as two clients? Probably latter? Could it be some local bus bottleneck > due to increased load on same of the bus/network?It''s counted as 2 clients. The node architecture *is* 2 clients in this scenario. Memory is partitioned, etc. The only thing shared is the NIC. I suppose the HT is shared as well but it is so much faster than the NIC that it would seem to need an architectural deficiency to figure in here.> > Bye, > Oleg > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-devel > >
The runs can''t be repeated at present, but we plan to collect this data when there is an opportunity. Attached are some tables and plots related to the previous set, which also show the per OST performance from the same runs as the original four plots. (fpp == file per process, sf == shared file, vn == 2 clients per compute node) Figure 1 attached files fpp_avg_per_ost.png fpp_table.txt Figure 2 attached files fpp_vn_avg_per_ost.png fpp_vn_table.txt Figure 3 attached files sf_avg_per_ost.png sf_table.txt Figure 4 attached files sf_vn_avg_per_ost.png sf_vn_table.txt Thanks Jim, Ruth On Wed, 2006-11-22 at 02:04 +0300, Alex Tomas wrote:> I think we''d very appreciate if users provide us more data > from their runs. the most interesting things are: > 1) vmstat 1 > 2) utils/llobdstat.pl <ost name> 1 > 3) utils/llstat.pl <path to stats file> 1 > > note that, every service has own stats file. for example, > > [root@tom tests]# find /proc/fs/lustre -name stats > /proc/fs/lustre/ldlm/services/ldlm_canceld/stats > /proc/fs/lustre/ldlm/services/ldlm_cbd/stats > /proc/fs/lustre/llite/fs0/stats > /proc/fs/lustre/mdt/MDT/mds_readpage/stats > /proc/fs/lustre/mdt/MDT/mds_setattr/stats > /proc/fs/lustre/mdt/MDT/mds/stats > /proc/fs/lustre/osc/OSC_tom_OST_tom_MNT_tom/stats > /proc/fs/lustre/osc/OSC_tom_OST_tom_mds1/stats > /proc/fs/lustre/obdfilter/OST_tom/stats > /proc/fs/lustre/ost/OSS/ost_io/stats > /proc/fs/lustre/ost/OSS/ost_create/stats > /proc/fs/lustre/ost/OSS/ost/stats > > it seems like we need a silly script to simplify gathering. > Peter, what do you think? > > thanks, Alex > > >>>>> Peter Braam (PB) writes: > > PB> Sandia has made available a very interesting set of graphs with some > PB> questions. They study single file per process and shared file IO on > PB> Red Storm. > > PB> Most striking is that on the liblustre platform, shared file IO seems > PB> to be 4x slower. It feels like this should be a simple contention > PB> issue on the server, but we havent'' been able to find it. > > PB> A question not mentioned in this email is why this won''t scale beyond > PB> 12,500 clients. Probably that has a very simple answer too, and > PB> something blows up there. > > PB> LLNL has shown that in some cases shared file IO can be faster than > PB> file per process, and it is high time we put the puzzle to bed. > > PB> I attach the graphs and copy Sandia''s questions below. > > PB> Peter > > PB> What follows are some IOR tests run at Sandia on a 160-OSS/320-OST > PB> Lustre file system. This file system had just been reformatted, prior > PB> to the runs. > > PB> The following issues seem key ones: > PB> - the single shared file is a factor 4-5 too slow, what is the > PB> overhead? > PB> - why are reads so slow? > PB> - why is there a significant read dropoff? > PB> - why is two cores so much slower than single core? > > > > PB> _______________________________________________ > PB> Lustre-devel mailing list > PB> Lustre-devel@clusterfs.com > PB> https://mail.clusterfs.com/mailman/listinfo/lustre-devel > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-devel > >-------------- next part -------------- A non-text attachment was scrubbed... Name: fpp_avg_per_ost.png Type: image/png Size: 4512 bytes Desc: not available Url : http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20061122/5bdbf181/fpp_avg_per_ost-0001.png -------------- next part -------------- # CLIENTS STRIPE MAX OSTS SAMPLE 1 SAMPLE 2 AVG READ AVG READ/OST ERROR 513 1 320 31442 31930 31686.00 99.02 488.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 SAMPLE 2 AVG WRITE AVG WRITE/OST ERROR 513 1 320 30190 28824 29507.00 92.21 1366.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 SAMPLE 2 AVG READ AVG READ/OST ERROR 1023 1 320 32332 32571 32451.50 101.41 239.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 SAMPLE 2 AVG WRITE AVG WRITE/OST ERROR 1023 1 320 34263 15097 24680.00 77.12 19166.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 SAMPLE 2 AVG READ AVG READ/OST ERROR 2049 1 320 35720 31865 33792.50 105.60 3855.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 SAMPLE 2 AVG WRITE AVG WRITE/OST ERROR 2049 1 320 41169 46168 43668.50 136.46 4999.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 SAMPLE 2 AVG READ AVG READ/OST ERROR 2559 1 320 32900 31898 32399.00 101.25 1002.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 SAMPLE 2 AVG WRITE AVG WRITE/OST ERROR 2559 1 320 39888 39814 39851.00 124.53 74.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 SAMPLE 2 AVG READ AVG READ/OST ERROR 3073 1 320 31012 31453 31232.50 97.60 441.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 SAMPLE 2 AVG WRITE AVG WRITE/OST ERROR 3073 1 320 38459 39904 39181.50 122.44 1445.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 SAMPLE 2 AVG READ AVG READ/OST ERROR 3583 1 320 31123 31035 31079.00 97.12 88.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 SAMPLE 2 AVG WRITE AVG WRITE/OST ERROR 3583 1 320 40635 41507 41071.00 128.35 872.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG READ AVG READ/OST ERROR 4300 1 320 26512 26512.00 82.85 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG WRITE AVG WRITE/OST ERROR 4300 1 320 41298 41298.00 129.06 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG READ AVG READ/OST ERROR 6300 1 320 25177 25177.00 78.68 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG WRITE AVG WRITE/OST ERROR 6300 1 320 38915 38915.00 121.61 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG READ AVG READ/OST ERROR 8300 1 320 15549 15549.00 48.59 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG WRITE AVG WRITE/OST ERROR 8300 1 320 38386 38386.00 119.96 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG READ AVG READ/OST ERROR 10300 1 320 14877 14877.00 46.49 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG WRITE AVG WRITE/OST ERROR 10300 1 320 36556 36556.00 114.24 0.00 -------------- next part -------------- A non-text attachment was scrubbed... Name: fpp_vn_avg_per_ost.png Type: image/png Size: 4499 bytes Desc: not available Url : http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20061122/5bdbf181/fpp_vn_avg_per_ost-0001.png -------------- next part -------------- # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG READ AVG READ/OST ERROR 513 1 320 22650 22650.00 70.78 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG WRITE AVG WRITE/OST ERROR 513 1 320 27913 27913.00 87.23 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG READ AVG READ/OST ERROR 1023 1 320 25037 25037.00 78.24 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG WRITE AVG WRITE/OST ERROR 1023 1 320 30648 30648.00 95.78 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG READ AVG READ/OST ERROR 2049 1 320 30358 30358.00 94.87 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG WRITE AVG WRITE/OST ERROR 2049 1 320 37164 37164.00 116.14 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG READ AVG READ/OST ERROR 4095 1 320 28612 28612.00 89.41 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG WRITE AVG WRITE/OST ERROR 4095 1 320 39290 39290.00 122.78 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG READ AVG READ/OST ERROR 8191 1 320 19229 19229.00 60.09 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG WRITE AVG WRITE/OST ERROR 8191 1 320 30000 30000.00 93.75 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG READ AVG READ/OST ERROR 10300 1 320 12732 12732.00 39.79 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG WRITE AVG WRITE/OST ERROR 10300 1 320 27912 27912.00 87.22 0.00 -------------- next part -------------- A non-text attachment was scrubbed... Name: sf_avg_per_ost.png Type: image/png Size: 4029 bytes Desc: not available Url : http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20061122/5bdbf181/sf_avg_per_ost-0001.png -------------- next part -------------- # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG READ AVG READ/OST ERROR 2049 157 157 5119 5119.00 32.61 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG WRITE AVG WRITE/OST ERROR 2049 157 157 7474 7474.00 47.61 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG READ AVG READ/OST ERROR 4095 157 157 6826 6826.00 43.48 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG WRITE AVG WRITE/OST ERROR 4095 157 157 11122 11122.00 70.84 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG READ AVG READ/OST ERROR 8191 157 157 6281 6281.00 40.01 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG WRITE AVG WRITE/OST ERROR 8191 157 157 9289 9289.00 59.17 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG READ AVG READ/OST ERROR 10300 157 157 5883 5883.00 37.47 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG WRITE AVG WRITE/OST ERROR 10300 157 157 8407 8407.00 53.55 0.00 -------------- next part -------------- A non-text attachment was scrubbed... Name: sf_vn_avg_per_ost.png Type: image/png Size: 3650 bytes Desc: not available Url : http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20061122/5bdbf181/sf_vn_avg_per_ost-0001.png -------------- next part -------------- # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG READ AVG READ/OST ERROR 4095 157 157 6653 6653.00 42.38 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG WRITE AVG WRITE/OST ERROR 4095 157 157 11138 11138.00 70.94 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG READ AVG READ/OST ERROR 8191 157 157 6157 6157.00 39.22 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG WRITE AVG WRITE/OST ERROR 8191 157 157 8530 8530.00 54.33 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG READ AVG READ/OST ERROR 10300 157 157 5822 5822.00 37.08 0.00 # CLIENTS STRIPE MAX OSTS SAMPLE 1 AVG WRITE AVG WRITE/OST ERROR 10300 157 157 7850 7850.00 50.00 0.00
On Tue, 2006-11-21 at 15:34 -0700, Peter Braam wrote:> Oleg pointed out that the striping width of 160 can have negative > impacts - if you have a bandwidth per OST that is too low, you will > limit your number. > > But there is no reason that has to be the case. OST targets could be > RAID0 striped across multiple FC interconnects, giving them a very high > bandwidth. In this way with 160 stripe width 2x the current bandwidth > can be achieved. > > To validate Oleg''s claim for reads, namely that they are suboptimal for > random 1MB reads - could the OST survey be run against the OBD filter > device? This only requires one DDN. >Does OST survey run for catamount clients? I''d need some pointers to get going on trying that. We ran obdecho once, but I haven''t personally got obdfilter device going on xt3 as of yet. Thanks, Ruth> - Peter - > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-devel > >
On our ddn''s I''ve noticed that during large reads the backend bandwidth of the ddn is pegged (700-800 MB/s) while the total bandwidth being delivered through the fc interfaces is in the low hundreds of MBs. This leads me to believe that the ddn cache is being thrashed heavily due to overly aggressive read-ahead by many OSTs and the ddn itself. I haven''t re-run these tests with the 2.6 kernel (cray 1.4XX) so I''m not sure if this phenomena still exists but this theory still may have some validity since the ddn cache is shared between all the osts connected to it. paul Lee Ward wrote:>On Tue, 2006-11-21 at 20:48 +0200, Oleg Drokin wrote: > > >>Hello! >> >>On Tue, Nov 21, 2006 at 10:24:42AM -0700, Peter Braam wrote: >> >> >>>Sandia has made available a very interesting set of graphs with some >>>questions. They study single file per process and shared file IO on Red >>>Storm. >>> >>> >>The test was unfair, see: >>With one stripe per file file per process, 320 OSTs are used for 10k client >>that''s rougly 31 client per OST. >> >>Now with shared file, stripe size(? count, I presume) is 157. Since the file >>is only one, it means only 157 OSTs were in use, or roughly 64 clients per >>OST. >> >>There are 2 OSTs per OSS so I don''t think the test is that unfair unless somehow it failed to utilize every fiber channel link available. So if one OST was used on each OSS then in terms of network and disk bandwidth the tests are equivalent.>>So for meaningful comparison we should compare 10k clients file per process >>with 5k cleints shared file. This "only" gives us 2x difference which is still >>better than 4x. >> >> >>Also stipe size is not specified, what was it set to? >> >> > >Please define "stripe size"? > > > >>>The following issues seem key ones: >>> - the single shared file is a factor 4-5 too slow, what is the >>>overhead? >>> >>> >>Only 2x it seems. >> >> >> >>> - why are reads so slow? >>> >>> >>No proper readahead by backend disk? >> >> > >MF is set to "on", max-prefetch is set to ("x" and "1") > > > >>> - why is there a significant read dropoff? >>> >>> >>Writes can be cached, nicely aggregated and written to disk in nice linear >>chunks and disk backend cannot do proper readahead for such a seemingly random >>1M here, 1M there? >>Was write cache enabled? >> >> > >Yes. Not partitioned. > > > >>> - why is two cores so much slower than single core? >>> >>> >>Was two cores on a single chip counted as single client on the graph, >>or as two clients? Probably latter? Could it be some local bus bottleneck >>due to increased load on same of the bus/network? >> >> > >It''s counted as 2 clients. The node architecture *is* 2 clients in this >scenario. Memory is partitioned, etc. The only thing shared is the NIC. >I suppose the HT is shared as well but it is so much faster than the NIC >that it would seem to need an architectural deficiency to figure in >here. > > >
pauln wrote:> On our ddn''s I''ve noticed that during large reads the backend > bandwidth of the ddn is pegged (700-800 MB/s) while the total > bandwidth being delivered through the fc interfaces is in the low > hundreds of MBs. This leads me to believe that the ddn cache is being > thrashed heavily due to overly aggressive read-ahead by many OSTs and > the ddn itself. > I haven''t re-run these tests with the 2.6 kernel (cray 1.4XX) so I''m > not sure if this phenomena still exists but this theory still may have > some validity since the ddn cache is shared between all the osts > connected to it. > paul >Paul -- what is the "prefetch" value for the LUNs? The "cache" command is where this can be found. Historical testing on s2a8500s shows that for Lustre, prefetch=1 or prefetch=0 is desired. When run with prefetch=8, we saw the same behavior you are describing, namely the DDN "stomping" on itself trying to prefetch too much data. This testing has been done twice on the XT3, both to the same result -- prefetch=1. Nic
Hello! On Wed, Nov 22, 2006 at 11:39:02AM -0700, Lee Ward wrote:> > So for meaningful comparison we should compare 10k clients file per process > > with 5k cleints shared file. This "only" gives us 2x difference which is still > > better than 4x. > > Also stipe size is not specified, what was it set to? > Please define "stripe size"?tripe size is amount of bytes that go into a single stripe on one OST before we switch to next OST for another stripe. with lfs setstripe it is first of three arguments.> > > - why are reads so slow? > > No proper readahead by backend disk? > MF is set to "on", max-prefetch is set to ("x" and "1")I am not familiar with those options you are speaking about, unfortunatelly. Is there a simple description or something like that? How large is such readahead with these settings? See, if all clients read data sequentially 1M each, and we have 31 clients competing for a read from one particular OST. Every such client has to read 1Mb out of every 31M in an object that lives on this OST. It is also complicated by the fact there is no particular order those requests arrive in. E.g. if we get a request for last 1M chunk of that 31M area, readahead logic in the backend is probably going to read data forward (how much forward?), but when other requests come in they are uncached and we wait for them to hit the disk. This is in single file case that is quite favourable. In fact you see read speed does not descent as badly for this case. For file per process there is separate file ''object'' for every process. Those objects are living in different (ext3) ''subdirs'' and allocator well might decide to allocate them in different parts of the disk (that''s what ext3 default allocator does I think). So one client reading data is most likely does not prefetch any data for other clients on first read. And if readahead does not read gigabytes of data in advance for every object, there is going to be some contention for backend storage to read all those multiple places in various areas of the storage (how much parallel i/o streams can the disk backend sustain?) For 2Mb i/o figures above needs to be doubled, of course. Was there a reason single file, single core case was measured with 1M xfer size when all other cases were with 2M xfer size?> > > - why is there a significant read dropoff? > > Writes can be cached, nicely aggregated and written to disk in nice linear > > chunks and disk backend cannot do proper readahead for such a seemingly random > > 1M here, 1M there? > > Was write cache enabled? > Yes. Not partitioned.That''s likely an answer to why writes are so fast and do not drop off with number of clients going up as long as data fits into cache (What we can see on first graph), it is not very clear why on second graph the write performance starts to go down after some numbe rof clients,and I guess this suggest a bottleneck in some other place than disk backend.> > > - why is two cores so much slower than single core? > > Was two cores on a single chip counted as single client on the graph, > > or as two clients? Probably latter? Could it be some local bus bottleneck > > due to increased load on same of the bus/network? > It''s counted as 2 clients. The node architecture *is* 2 clients in this > scenario. Memory is partitioned, etc. The only thing shared is the NIC. > I suppose the HT is shared as well but it is so much faster than the NIC > that it would seem to need an architectural deficiency to figure in > here.So it could be you saturate NIC or parts of the network? (you have 2x as much traffic in dualcore nodes case for same network). Is there a way to measure this? Bye, Oleg
It was set to 1. This happened on a 2048 pe job reading in a few hundred gigs of input data from a single file. paul Nicholas Henke wrote:> pauln wrote: > >> On our ddn''s I''ve noticed that during large reads the backend >> bandwidth of the ddn is pegged (700-800 MB/s) while the total >> bandwidth being delivered through the fc interfaces is in the low >> hundreds of MBs. This leads me to believe that the ddn cache is >> being thrashed heavily due to overly aggressive read-ahead by many >> OSTs and the ddn itself. >> I haven''t re-run these tests with the 2.6 kernel (cray 1.4XX) so I''m >> not sure if this phenomena still exists but this theory still may >> have some validity since the ddn cache is shared between all the osts >> connected to it. >> paul >> > > Paul -- what is the "prefetch" value for the LUNs? The "cache" command > is where this can be found. > > Historical testing on s2a8500s shows that for Lustre, prefetch=1 or > prefetch=0 is desired. When run with prefetch=8, we saw the same > behavior you are describing, namely the DDN "stomping" on itself > trying to prefetch too much data. > > This testing has been done twice on the XT3, both to the same result > -- prefetch=1. > > Nic
Lee Ward wrote:>On Tue, 2006-11-21 at 20:48 +0200, Oleg Drokin wrote: > > >>Hello! >> >>On Tue, Nov 21, 2006 at 10:24:42AM -0700, Peter Braam wrote: >> >> >>>Sandia has made available a very interesting set of graphs with some >>>questions. They study single file per process and shared file IO on Red >>>Storm. >>> >>> >>The test was unfair, see: >>With one stripe per file file per process, 320 OSTs are used for 10k client >>that''s rougly 31 client per OST. >> >>Now with shared file, stripe size(? count, I presume) is 157. Since the file >>is only one, it means only 157 OSTs were in use, or roughly 64 clients per >>OST. >> >>So for meaningful comparison we should compare 10k clients file per process >>with 5k cleints shared file. This "only" gives us 2x difference which is still >>better than 4x. >> >>Also stipe size is not specified, what was it set to? >> >>Oleg, I''m not sure I entirely agree here. If every OSS was used in the single file test then from the io hardware perspective they are functionally equivalent. They run 2 OSTs on each OSS (which is probably done for capacity reasons due to the small lun limitation of ext3) so a test which only uses half of the OSTs could still use every OSS processor, interconnect link, and fiber channel link. paul>Please define "stripe size"? > > > >>>The following issues seem key ones: >>> - the single shared file is a factor 4-5 too slow, what is the >>>overhead? >>> >>> >>Only 2x it seems. >> >> >> >>> - why are reads so slow? >>> >>> >>No proper readahead by backend disk? >> >> > >MF is set to "on", max-prefetch is set to ("x" and "1") > > > >>> - why is there a significant read dropoff? >>> >>> >>Writes can be cached, nicely aggregated and written to disk in nice linear >>chunks and disk backend cannot do proper readahead for such a seemingly random >>1M here, 1M there? >>Was write cache enabled? >> >> > >Yes. Not partitioned. > > > >>> - why is two cores so much slower than single core? >>> >>> >>Was two cores on a single chip counted as single client on the graph, >>or as two clients? Probably latter? Could it be some local bus bottleneck >>due to increased load on same of the bus/network? >> >> > >It''s counted as 2 clients. The node architecture *is* 2 clients in this >scenario. Memory is partitioned, etc. The only thing shared is the NIC. >I suppose the HT is shared as well but it is so much faster than the NIC >that it would seem to need an architectural deficiency to figure in >here. > > > >>Bye, >> Oleg >> >>_______________________________________________ >>Lustre-devel mailing list >>Lustre-devel@clusterfs.com >>https://mail.clusterfs.com/mailman/listinfo/lustre-devel >> >> >> >> > >_______________________________________________ >Lustre-devel mailing list >Lustre-devel@clusterfs.com >https://mail.clusterfs.com/mailman/listinfo/lustre-devel > >
Hello! On Wed, Nov 22, 2006 at 02:33:23PM -0500, pauln wrote:> >>So for meaningful comparison we should compare 10k clients file per > >>process > >>with 5k cleints shared file. This "only" gives us 2x difference which is > >>still > >>better than 4x. > >>Also stipe size is not specified, what was it set to? > Oleg, I''m not sure I entirely agree here. If every OSS was used in the > single file test then from the io hardware perspective they are > functionally equivalent. They run 2 OSTs on each OSS (which is probably > done for capacity reasons due to the small lun limitation of ext3) so a > test which only uses half of the OSTs could still use every OSS > processor, interconnect link, and fiber channel link.I have no such information. My assumption was every OST had its own disk backend, but even if not, we have no idea in what order OSTs were created. If they were created like on oss0 we have ost0 & ost1 and so on, what I describe is still correct too. If somehow number of physical disk backends & osses used in both cases are same other things might have come into play too, like less spindles used by underlying disk backend due to partitioning? Bye, Oleg
On Wed, 2006-11-22 at 21:11 +0200, Oleg Drokin wrote:> Hello! > > On Wed, Nov 22, 2006 at 11:39:02AM -0700, Lee Ward wrote: > > > So for meaningful comparison we should compare 10k clients file per process > > > with 5k cleints shared file. This "only" gives us 2x difference which is still > > > better than 4x. > > > Also stipe size is not specified, what was it set to? > > Please define "stripe size"? > > tripe size is amount of bytes that go into a single stripe on one OST before we > switch to next OST for another stripe. > with lfs setstripe it is first of three arguments.Ah. Ok. It''s 2 MiB.> > > > > - why are reads so slow? > > > No proper readahead by backend disk? > > MF is set to "on", max-prefetch is set to ("x" and "1") > > I am not familiar with those options you are speaking about, unfortunatelly. > Is there a simple description or something like that? > How large is such readahead with these settings?Ok. Our interpretation is that these settings mean readahead is enabled, up to a 64K prefetch. The DDN site has the technical docs for the controller online. Feel free to try to interpret the settings yourself :-)> See, if all clients read data sequentially 1M each, and we have 31 clients > competing for a read from one particular OST. Every such client has to read > 1Mb out of every 31M in an object that lives on this OST. > It is also complicated by the fact there is no particular order those requests > arrive in.But they should arrive very close to one another (all began at the same time and proceed in order through the file, tending to keep them in step) and with the elevator seek-sort applied, they are supposed to be reordered so that the controller typically sees long, contiguous access requests -- Provided the FS did an extent-based allocation, of course. You seem to indicate that it does extent-based allocation for the file-per-process case in the paragraph after the next. Are you thinking it does not do extent allocation for objects involved in the shared-file environment?> E.g. if we get a request for last 1M chunk of that 31M area, readahead logic in > the backend is probably going to read data forward (how much forward?), but > when other requests come in they are uncached and we wait for them to hit the > disk. This is in single file case that is quite favourable. In fact you > see read speed does not descent as badly for this case.This would make sense without extent-based allocation and re-ordering. With it, though, things should work out just fine. Paul Nowicki, in a previous message in this thread, seems to have evidence that pre-fetching at the controller might not be a good idea. Could that be contributing? Are we wasting back-end bandwidth in the RAID controller? I can see that explaining the fall-off for the file-per-process case. It doesn''t explain the simply bad performance of the single-shared-file case, though. There, prefetch should be in the noise -- Again, given a proper extent allocator and seek-sort algorithm.> > For file per process there is separate file ''object'' for every process. > Those objects are living in different (ext3) ''subdirs'' and allocator well might > decide to allocate them in different parts of the disk (that''s what ext3 > default allocator does I think). So one client reading data is most likely > does not prefetch any data for other clients on first read. And if readahead > does not read gigabytes of data in advance for every object, there is going > to be some contention for backend storage to read all those multiple places in > various areas of the storage (how much parallel i/o streams can the > disk backend sustain?) > For 2Mb i/o figures above needs to be doubled, of course. > > Was there a reason single file, single core case was measured with 1M xfer size > when all other cases were with 2M xfer size?Dunno but the ranges and shapes are similar. I''m thinking, then, there''s not much difference due to this.> > > > > - why is there a significant read dropoff? > > > Writes can be cached, nicely aggregated and written to disk in nice linear > > > chunks and disk backend cannot do proper readahead for such a seemingly random > > > 1M here, 1M there? > > > Was write cache enabled? > > Yes. Not partitioned. > > That''s likely an answer to why writes are so fast and do not drop off with > number of clients going up as long as data fits into cache (What we can > see on first graph), it is not very clear why on second graph the write > performance starts to go down after some numbe rof clients,and I guess this > suggest a bottleneck in some other place than disk backend.I agree. NIC arbitration, maybe? I''m more concerned about the single file per process case. So long as we can get *something* usable we can counsel our users. Such a thing exists for the file-per-process case. It just doesn''t for the single shared file.> > > > > - why is two cores so much slower than single core? > > > Was two cores on a single chip counted as single client on the graph, > > > or as two clients? Probably latter? Could it be some local bus bottleneck > > > due to increased load on same of the bus/network? > > It''s counted as 2 clients. The node architecture *is* 2 clients in this > > scenario. Memory is partitioned, etc. The only thing shared is the NIC. > > I suppose the HT is shared as well but it is so much faster than the NIC > > that it would seem to need an architectural deficiency to figure in > > here. > > So it could be you saturate NIC or parts of the network? (you have 2x as much > traffic in dualcore nodes case for same network). > Is there a way to measure this?I can''t see that from these graphs. Peak is the same, and at the same node count -- I.e., the client can drive things sufficiently whether single or dual. So it isn''t the client side. As things go out further the *number* of messages injected into the network is the same, so I can''t reach a place where we are calling out the network. For the service side, it''s the same NIC, receiving the same number of messages. The server is *always* single-core. My only question would be whether the same nid is used by a dual-core client and is Lustre/lnet doing something different. Ruth or Jim, do you know if the client node is using the same NID for both virtual machines?> > Bye, > Oleg >
Lee Ward wrote:>On Wed, 2006-11-22 at 21:11 +0200, Oleg Drokin wrote: > > >>Hello! >> >>On Wed, Nov 22, 2006 at 11:39:02AM -0700, Lee Ward wrote: >> >> >>>>So for meaningful comparison we should compare 10k clients file per process >>>>with 5k cleints shared file. This "only" gives us 2x difference which is still >>>>better than 4x. >>>>Also stipe size is not specified, what was it set to? >>>> >>>> >>>Please define "stripe size"? >>> >>> >>tripe size is amount of bytes that go into a single stripe on one OST before we >>switch to next OST for another stripe. >>with lfs setstripe it is first of three arguments. >> >> > >Ah. Ok. It''s 2 MiB. > > > >>>>> - why are reads so slow? >>>>> >>>>> >>>>No proper readahead by backend disk? >>>> >>>> >>>MF is set to "on", max-prefetch is set to ("x" and "1") >>> >>> >>I am not familiar with those options you are speaking about, unfortunatelly. >>Is there a simple description or something like that? >>How large is such readahead with these settings? >> >> > >Ok. Our interpretation is that these settings mean readahead is enabled, >up to a 64K prefetch. > >The DDN site has the technical docs for the controller online. Feel free >to try to interpret the settings yourself :-) > > > >>See, if all clients read data sequentially 1M each, and we have 31 clients >>competing for a read from one particular OST. Every such client has to read >>1Mb out of every 31M in an object that lives on this OST. >>It is also complicated by the fact there is no particular order those requests >>arrive in. >> >> > >But they should arrive very close to one another (all began at the same >time and proceed in order through the file, tending to keep them in >step) and with the elevator seek-sort applied, they are supposed to be >reordered so that the controller typically sees long, contiguous access >requests -- Provided the FS did an extent-based allocation, of course. >You seem to indicate that it does extent-based allocation for the >file-per-process case in the paragraph after the next. Are you thinking >it does not do extent allocation for objects involved in the shared-file >environment? > > > >>E.g. if we get a request for last 1M chunk of that 31M area, readahead logic in >>the backend is probably going to read data forward (how much forward?), but >>when other requests come in they are uncached and we wait for them to hit the >>disk. This is in single file case that is quite favourable. In fact you >>see read speed does not descent as badly for this case. >> >> > >This would make sense without extent-based allocation and re-ordering. >With it, though, things should work out just fine. > >Paul Nowicki, in a previous message in this thread, seems to have >evidence that pre-fetching at the controller might not be a good idea. >Could that be contributing? Are we wasting back-end bandwidth in the >RAID controller? I can see that explaining the fall-off for the >file-per-process case. It doesn''t explain the simply bad performance of >the single-shared-file case, though. There, prefetch should be in the >noise -- Again, given a proper extent allocator and seek-sort algorithm. > >"Nowicki" - that''s one I haven''t heard before! :) Lee, can you guys run "stats repeat=mbs" on your ddns during the single-file and multi-file tests? It would be interesting to compare the back-end to front-end bw ratios for both cases. We had one case (single-file) where the backend prefetching killed the performance so I''m certain (at least under 2.4) that single-file usage cases can cause thrashing.> > >>For file per process there is separate file ''object'' for every process. >>Those objects are living in different (ext3) ''subdirs'' and allocator well might >>decide to allocate them in different parts of the disk (that''s what ext3 >>default allocator does I think). So one client reading data is most likely >>does not prefetch any data for other clients on first read. And if readahead >>does not read gigabytes of data in advance for every object, there is going >>to be some contention for backend storage to read all those multiple places in >>various areas of the storage (how much parallel i/o streams can the >>disk backend sustain?) >>For 2Mb i/o figures above needs to be doubled, of course. >> >> >>The story gets worse if prefetched data, which would have been useful, is replaced prematurely! paul>>Was there a reason single file, single core case was measured with 1M xfer size >>when all other cases were with 2M xfer size? >> >> > >Dunno but the ranges and shapes are similar. I''m thinking, then, there''s >not much difference due to this. > > > >>>>> - why is there a significant read dropoff? >>>>> >>>>> >>>>Writes can be cached, nicely aggregated and written to disk in nice linear >>>>chunks and disk backend cannot do proper readahead for such a seemingly random >>>>1M here, 1M there? >>>>Was write cache enabled? >>>> >>>> >>>Yes. Not partitioned. >>> >>> >>That''s likely an answer to why writes are so fast and do not drop off with >>number of clients going up as long as data fits into cache (What we can >>see on first graph), it is not very clear why on second graph the write >>performance starts to go down after some numbe rof clients,and I guess this >>suggest a bottleneck in some other place than disk backend. >> >> > >I agree. NIC arbitration, maybe? I''m more concerned about the single >file per process case. So long as we can get *something* usable we can >counsel our users. Such a thing exists for the file-per-process case. It >just doesn''t for the single shared file. > > > >>>>> - why is two cores so much slower than single core? >>>>> >>>>> >>>>Was two cores on a single chip counted as single client on the graph, >>>>or as two clients? Probably latter? Could it be some local bus bottleneck >>>>due to increased load on same of the bus/network? >>>> >>>> >>>It''s counted as 2 clients. The node architecture *is* 2 clients in this >>>scenario. Memory is partitioned, etc. The only thing shared is the NIC. >>>I suppose the HT is shared as well but it is so much faster than the NIC >>>that it would seem to need an architectural deficiency to figure in >>>here. >>> >>> >>So it could be you saturate NIC or parts of the network? (you have 2x as much >>traffic in dualcore nodes case for same network). >>Is there a way to measure this? >> >> > >I can''t see that from these graphs. Peak is the same, and at the same >node count -- I.e., the client can drive things sufficiently whether >single or dual. So it isn''t the client side. As things go out further >the *number* of messages injected into the network is the same, so I >can''t reach a place where we are calling out the network. > >For the service side, it''s the same NIC, receiving the same number of >messages. The server is *always* single-core. My only question would be >whether the same nid is used by a dual-core client and is Lustre/lnet >doing something different. Ruth or Jim, do you know if the client node >is using the same NID for both virtual machines? > > > > >>Bye, >> Oleg >> >> >> > >_______________________________________________ >Lustre-devel mailing list >Lustre-devel@clusterfs.com >https://mail.clusterfs.com/mailman/listinfo/lustre-devel > >
On Wed, 22 Nov 2006, Ruth Klundt wrote:> On Tue, 2006-11-21 at 15:34 -0700, Peter Braam wrote: >> >> To validate Oleg''s claim for reads, namely that they are suboptimal for >> random 1MB reads - could the OST survey be run against the OBD filter >> device? This only requires one DDN. >> > > Does OST survey run for catamount clients? I''d need some pointers to get > going on trying that. We ran obdecho once, but I haven''t personally got > obdfilter device going on xt3 as of yet.I think Peter suggested running the obdsurvey on the servers. IIRC your servers are regular Linux systems, right? There are some instructions for the lustre_survey python script in the KB: https://bugzilla.lustre.org/show_bug.cgi?id=9383 I''m not sure it''s still working though, I know there were related bugs opened, but I can''t find them anymore. Here we are still using Eric Barton''s obdfilter-survey shell script, which works pretty well, but we only use it on test systems. -- Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net
On Wed, 2006-11-22 at 22:07 +0200, Oleg Drokin wrote:> Hello! > > On Wed, Nov 22, 2006 at 02:33:23PM -0500, pauln wrote: > > >>So for meaningful comparison we should compare 10k clients file per > > >>process > > >>with 5k cleints shared file. This "only" gives us 2x difference which is > > >>still > > >>better than 4x. > > >>Also stipe size is not specified, what was it set to? > > Oleg, I''m not sure I entirely agree here. If every OSS was used in the > > single file test then from the io hardware perspective they are > > functionally equivalent. They run 2 OSTs on each OSS (which is probably > > done for capacity reasons due to the small lun limitation of ext3) so a > > test which only uses half of the OSTs could still use every OSS > > processor, interconnect link, and fiber channel link.We use 2 OST per OSS in order to actively use both channels of the HBA -- Staying away channel from bonding.> > I have no such information. My assumption was every OST had its own disk > backend, but even if not, we have no idea in what order OSTs were created. > If they were created like on oss0 we have ost0 & ost1 and so on, what I describe > is still correct too. > > If somehow number of physical disk backends & osses used in both cases are same > other things might have come into play too, like less spindles used by > underlying disk backend due to partitioning?They are partitioned. However, during this test the other file systems were idle. Each channel on an HBA does, effectively, have it''s own string of disk -- DDN calls them "tiers". There are 4 OSS per DDN "couplet", controller pair. There are 8 "tiers" per DDN. Lustre is configured so that two consecutive OSTs would be on the same OSS.> Bye, > Oleg > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-devel > >
Lee Ward wrote:> On Wed, 2006-11-22 at 22:07 +0200, Oleg Drokin wrote: > >> Hello! >> >> On Wed, Nov 22, 2006 at 02:33:23PM -0500, pauln wrote: >> >>>>> So for meaningful comparison we should compare 10k clients file per >>>>> process >>>>> with 5k cleints shared file. This "only" gives us 2x difference which is >>>>> still >>>>> better than 4x. >>>>> Also stipe size is not specified, what was it set to? >>>>> >>> Oleg, I''m not sure I entirely agree here. If every OSS was used in the >>> single file test then from the io hardware perspective they are >>> functionally equivalent. They run 2 OSTs on each OSS (which is probably >>> done for capacity reasons due to the small lun limitation of ext3) so a >>> test which only uses half of the OSTs could still use every OSS >>> processor, interconnect link, and fiber channel link. >>> > > We use 2 OST per OSS in order to actively use both channels of the HBA > -- Staying away channel from bonding. > >So that limits (currently) your bandwidth to 50% of what is available. Making an OST volume through LVM can give you full bandwidth, without channel bonding. I think we have just established that using LVM for RAID0 will not have an impact on performance.>> I have no such information. My assumption was every OST had its own disk >> backend, but even if not, we have no idea in what order OSTs were created. >> If they were created like on oss0 we have ost0 & ost1 and so on, what I describe >> is still correct too. >> >> If somehow number of physical disk backends & osses used in both cases are same >> other things might have come into play too, like less spindles used by >> underlying disk backend due to partitioning? >> > > They are partitioned. However, during this test the other file systems > were idle. > > Each channel on an HBA does, effectively, have it''s own string of disk > -- DDN calls them "tiers". There are 4 OSS per DDN "couplet", controller > pair. There are 8 "tiers" per DDN. > > Lustre is configured so that two consecutive OSTs would be on the same > OSS. > > >> Bye, >> Oleg >> >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel@clusterfs.com >> https://mail.clusterfs.com/mailman/listinfo/lustre-devel >> >> >> > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-devel >
On Thu, Nov 23, 2006 at 01:40:55PM +0100, Jean-Marc Saffroy wrote:> There are some instructions for the lustre_survey python script in the KB: > https://bugzilla.lustre.org/show_bug.cgi?id=9383 > > I''m not sure it''s still working though, I know there were related bugs > opened, but I can''t find them anymore.obdsurvey (the python script) does not work and has been removed from current versions of the iokit.> Here we are still using Eric Barton''s obdfilter-survey shell script, which > works pretty well, but we only use it on test systems.This is what we currently recommend. The script is included in the iokit, and there is fairly complete documentation on using it in the Lustre manual. Like obdsurvey, it is nondestructive. IO Kit: https://downloads.clusterfs.com/customer/lustre-iokit/ Manual: http://lustre.org/manual.html Cheers, Jody> > > -- > Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-devel--
On Thu, 2006-11-23 at 12:46 -0700, Peter Braam wrote:> > Lee Ward wrote: > > On Wed, 2006-11-22 at 22:07 +0200, Oleg Drokin wrote: > > > >> Hello! > >> > >> On Wed, Nov 22, 2006 at 02:33:23PM -0500, pauln wrote: > >> > >>>>> So for meaningful comparison we should compare 10k clients file per > >>>>> process > >>>>> with 5k cleints shared file. This "only" gives us 2x difference which is > >>>>> still > >>>>> better than 4x. > >>>>> Also stipe size is not specified, what was it set to? > >>>>> > >>> Oleg, I''m not sure I entirely agree here. If every OSS was used in the > >>> single file test then from the io hardware perspective they are > >>> functionally equivalent. They run 2 OSTs on each OSS (which is probably > >>> done for capacity reasons due to the small lun limitation of ext3) so a > >>> test which only uses half of the OSTs could still use every OSS > >>> processor, interconnect link, and fiber channel link. > >>> > > > > We use 2 OST per OSS in order to actively use both channels of the HBA > > -- Staying away channel from bonding. > > > > > So that limits (currently) your bandwidth to 50% of what is available. > Making an OST volume through LVM can give you full bandwidth, without > channel bonding. I think we have just established that using LVM for > RAID0 will not have an impact on performance.Huh? in the aggregate, all the performance is there. The graphs reflect that. No way could we get 40 GB/s using only half what the attached disk is capable of. In the singular, for file-per-process, w/o any striping, I would agree. But we are happy with what we have there, I think. Also, as you are probably aware, we have been having some reliability issues with the DDN controllers. To try to stripe on a single host also brings a greater exposure then. This arrangement we have tends to mitigate the exposure for smaller jobs. For the larger, well, there''s little we can think of to do, really. It''s just always a factor. Things are getting better, though. Maybe someday we can consider aggregating using LVM.> >> I have no such information. My assumption was every OST had its own disk > >> backend, but even if not, we have no idea in what order OSTs were created. > >> If they were created like on oss0 we have ost0 & ost1 and so on, what I describe > >> is still correct too. > >> > >> If somehow number of physical disk backends & osses used in both cases are same > >> other things might have come into play too, like less spindles used by > >> underlying disk backend due to partitioning? > >> > > > > They are partitioned. However, during this test the other file systems > > were idle. > > > > Each channel on an HBA does, effectively, have it''s own string of disk > > -- DDN calls them "tiers". There are 4 OSS per DDN "couplet", controller > > pair. There are 8 "tiers" per DDN. > > > > Lustre is configured so that two consecutive OSTs would be on the same > > OSS. > > > > > >> Bye, > >> Oleg > >> > >> _______________________________________________ > >> Lustre-devel mailing list > >> Lustre-devel@clusterfs.com > >> https://mail.clusterfs.com/mailman/listinfo/lustre-devel > >> > >> > >> > > > > _______________________________________________ > > Lustre-devel mailing list > > Lustre-devel@clusterfs.com > > https://mail.clusterfs.com/mailman/listinfo/lustre-devel > >
On Nov 23, 2006 14:37 -0700, Lee Ward wrote:> On Thu, 2006-11-23 at 12:46 -0700, Peter Braam wrote: > > > On Wed, 2006-11-22 at 22:07 +0200, Oleg Drokin wrote: > > > We use 2 OST per OSS in order to actively use both channels of the HBA > > > -- Staying away channel from bonding. > > > > So that limits (currently) your bandwidth to 50% of what is available. > > Making an OST volume through LVM can give you full bandwidth, without > > channel bonding. I think we have just established that using LVM for > > RAID0 will not have an impact on performance. > > Huh? in the aggregate, all the performance is there. The graphs reflect > that. No way could we get 40 GB/s using only half what the attached disk > is capable of.I think to clarify the previous comments: - for FPP jobs, files are striped over 1 OST, but all 320 OSTs (hence 320 DDN tiers, 320 FC controllers) are in use because there are multiple files - for SSF jobs, the one file is striped over at most 160 OSTs (this is a current Lustre limit for a single file) so at most 160 DDN tiers, 160 FC controllers are in use This difference is why Oleg previously mentioned the "explained 2x difference" in the tests. For an apples-apples comparison, it would be possible to deactivate 160 OSTs (one per OSS) on MDS via "for N in ... ;do lctl --device N deactivate;done" and then run the SSF and FPP jobs again. This will limit the FPP jobs to 160 OSTs (like the SSF). It might also be useful to disable 161 OSTs (leave 159 active) to avoid the aliasing in the SSF case, or alternately have clients each write 7MB chunk sizes or something. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
I''d just like to take one step back here for the reading problem - let''s keep it simple. When we worked with DDN 8500 controllers we established that with something like 10 reading threads reading 1MB chunks randomly from OSTs we achieved the full bandwidth of the array. Read-ahead probably only damages things at this point (but is very important for reading smaller files). There are many users of the 9500 SATA arrays now, and it surprises me that we don''t have the OST survey graphs for these arrays. I believe that someone from our staff is at work with Sandia to get one done. This should show us the parameters to get full bandwidth at least when the application reads at least X MB, where X is hopefully still 1 (if not we have a nasty networking problem on our hands). I would say that "playing" with the parameters (DDN settings, Lustre read ahead etc) is almost certain to make things worse in these use cases, and be highly confusing. Perhaps we should disable at least the Lustre read-ahead stuff when we have the minimum sizes to reach full bandwidth. - peter -
Peter Braam wrote:> > I''d just like to take one step back here for the reading problem - > let''s keep it simple. > When we worked with DDN 8500 controllers we established that with > something like 10 reading threads reading 1MB chunks randomly from > OSTs we achieved the full bandwidth of the array. > Read-ahead probably only damages things at this point (but is very > important for reading smaller files).Is testing with only 10 threads truly representative of the situation? I recall from a previous message that they have 30 threads per ost so on a ddn couplet that''s 30*8 threads. I would assume that such a workload would greatly randomize the io from the perspective of the ddn.> > There are many users of the 9500 SATA arrays now, and it surprises me > that we don''t have the OST survey graphs for these arrays. I believe > that someone from our staff is at work with Sandia to get one done. > This should show us the parameters to get full bandwidth at least when > the application reads at least X MB, where X is hopefully still 1 (if > not we have a nasty networking problem on our hands). > > I would say that "playing" with the parameters (DDN settings, Lustre > read ahead etc) is almost certain to make things worse in these use > cases, and be highly confusing. Perhaps we should disable at least > the Lustre read-ahead stuff when we have the minimum sizes to reach > full bandwidth. > > - peter - > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-devel
pauln wrote:> Peter Braam wrote: >> >> I''d just like to take one step back here for the reading problem - >> let''s keep it simple. >> When we worked with DDN 8500 controllers we established that with >> something like 10 reading threads reading 1MB chunks randomly from >> OSTs we achieved the full bandwidth of the array. >> Read-ahead probably only damages things at this point (but is very >> important for reading smaller files). > Is testing with only 10 threads truly representative of the > situation? I recall from a previous message that they have 30 threads > per ost so on a ddn couplet that''s 30*8 threads. I would assume that > such a workload would greatly randomize the io from the perspective of > the ddn.Around 10 threads are needed to max out the array - throughput is completely stable up to 1000 threads. - Peter ->> >> There are many users of the 9500 SATA arrays now, and it surprises me >> that we don''t have the OST survey graphs for these arrays. I believe >> that someone from our staff is at work with Sandia to get one done. >> This should show us the parameters to get full bandwidth at least >> when the application reads at least X MB, where X is hopefully still >> 1 (if not we have a nasty networking problem on our hands). >> >> I would say that "playing" with the parameters (DDN settings, Lustre >> read ahead etc) is almost certain to make things worse in these use >> cases, and be highly confusing. Perhaps we should disable at least >> the Lustre read-ahead stuff when we have the minimum sizes to reach >> full bandwidth. >> >> - peter - >> >> >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel@clusterfs.com >> https://mail.clusterfs.com/mailman/listinfo/lustre-devel
On Thu, 2006-11-23 at 15:03 -0700, Andreas Dilger wrote:> On Nov 23, 2006 14:37 -0700, Lee Ward wrote: > > On Thu, 2006-11-23 at 12:46 -0700, Peter Braam wrote: > > > > On Wed, 2006-11-22 at 22:07 +0200, Oleg Drokin wrote: > > > > We use 2 OST per OSS in order to actively use both channels of the HBA > > > > -- Staying away channel from bonding. > > > > > > So that limits (currently) your bandwidth to 50% of what is available. > > > Making an OST volume through LVM can give you full bandwidth, without > > > channel bonding. I think we have just established that using LVM for > > > RAID0 will not have an impact on performance. > > > > Huh? in the aggregate, all the performance is there. The graphs reflect > > that. No way could we get 40 GB/s using only half what the attached disk > > is capable of. > > I think to clarify the previous comments: > - for FPP jobs, files are striped over 1 OST, but all 320 OSTs (hence 320 > DDN tiers, 320 FC controllers) are in use because there are multiple files > - for SSF jobs, the one file is striped over at most 160 OSTs (this is a > current Lustre limit for a single file) so at most 160 DDN tiers, 160 FC > controllers are in use > > This difference is why Oleg previously mentioned the "explained 2x difference" > in the tests. > > For an apples-apples comparison, it would be possible to deactivate 160 OSTs > (one per OSS) on MDS via "for N in ... ;do lctl --device N deactivate;done" > and then run the SSF and FPP jobs again. This will limit the FPP jobs to > 160 OSTs (like the SSF). It might also be useful to disable 161 OSTs (leave > 159 active) to avoid the aliasing in the SSF case, or alternately have clients > each write 7MB chunk sizes or something.For our formal testing, we have in our scripts a section that precreates files, locating them on specific OSTs. Would that be a suitable alternative to this reconfiguration of lustre you suggest? What you suggest is possible, of course. I would just rather not deviate our process without a good reason. --Lee> > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. >
On Thu, 2006-11-23 at 18:21 -0700, Peter Braam wrote:> I''d just like to take one step back here for the reading problem - let''s > keep it simple. > > When we worked with DDN 8500 controllers we established that with > something like 10 reading threads reading 1MB chunks randomly from OSTs > we achieved the full bandwidth of the array. > > Read-ahead probably only damages things at this point (but is very > important for reading smaller files). > > There are many users of the 9500 SATA arrays now, and it surprises me > that we don''t have the OST survey graphs for these arrays. I believe > that someone from our staff is at work with Sandia to get one done. > This should show us the parameters to get full bandwidth at least when > the application reads at least X MB, where X is hopefully still 1 (if > not we have a nasty networking problem on our hands). > > I would say that "playing" with the parameters (DDN settings, Lustre > read ahead etc) is almost certain to make things worse in these use > cases, and be highly confusing. Perhaps we should disable at least the > Lustre read-ahead stuff when we have the minimum sizes to reach full > bandwidth.I believe we already know what to do to accomplish what you suggest. However, could you take the time to be explicit? Better that we know than believe... --Lee> > - peter - > >
On Nov 23, 2006 22:39 -0700, Lee Ward wrote:> On Thu, 2006-11-23 at 15:03 -0700, Andreas Dilger wrote: > > For an apples-apples comparison, it would be possible to deactivate 160 OSTs > > (one per OSS) on MDS via "for N in ... ;do lctl --device N deactivate;done" > > and then run the SSF and FPP jobs again. This will limit the FPP jobs to > > 160 OSTs (like the SSF). It might also be useful to disable 161 OSTs (leave > > 159 active) to avoid the aliasing in the SSF case, or alternately have > > clients each write 7MB chunk sizes or something. > > For our formal testing, we have in our scripts a section that precreates > files, locating them on specific OSTs. Would that be a suitable > alternative to this reconfiguration of lustre you suggest? What you > suggest is possible, of course. I would just rather not deviate our > process without a good reason.Sure, precreating files on, say, the first 160 OSTs in a round-robin manner for all of the client processes for FPP, and similarly forcing the one SSF file to be created starting on ost_idx 0 will use the same 160 OSTs. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: ddn.xls Type: application/vnd.ms-excel Size: 179200 bytes Desc: not available Url : http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20061124/6902c6fb/ddn-0001.xls -------------- next part -------------- A non-text attachment was scrubbed... Name: iographs.xls Type: application/vnd.ms-excel Size: 168448 bytes Desc: not available Url : http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20061124/6902c6fb/iographs-0001.xls
I don''t really fancy this suggestion that much - you are proposing an apples - to - apples for 1/2 the performance. With proper LVM configuration we could get full performance. Yes, that exposes one to more HBA/DDN vulnerabilities, but not more than a higher stripe_count would do. I think this is the way to get full BW on wide striping shared files. - Peter - Andreas Dilger wrote:> On Nov 23, 2006 22:39 -0700, Lee Ward wrote: > >> On Thu, 2006-11-23 at 15:03 -0700, Andreas Dilger wrote: >> >>> For an apples-apples comparison, it would be possible to deactivate 160 OSTs >>> (one per OSS) on MDS via "for N in ... ;do lctl --device N deactivate;done" >>> and then run the SSF and FPP jobs again. This will limit the FPP jobs to >>> 160 OSTs (like the SSF). It might also be useful to disable 161 OSTs (leave >>> 159 active) to avoid the aliasing in the SSF case, or alternately have >>> clients each write 7MB chunk sizes or something. >>> >> For our formal testing, we have in our scripts a section that precreates >> files, locating them on specific OSTs. Would that be a suitable >> alternative to this reconfiguration of lustre you suggest? What you >> suggest is possible, of course. I would just rather not deviate our >> process without a good reason. >> > > Sure, precreating files on, say, the first 160 OSTs in a round-robin manner > for all of the client processes for FPP, and similarly forcing the one SSF > file to be created starting on ost_idx 0 will use the same 160 OSTs. > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20061124/6d035326/attachment.html
Hi Lee, Peter Braam advises that you find our IOkit documentation inadequate. Would you please let me know what specific problems, uncertainties, or questions you have? The most up to date obdfilter_survey instructions can be found in section 1.2.2 of part III of the Lustre manual. Thanks, Jody
Hi Jody, I was talking to Lee earlier, and he doesn''t remember being the source of complaints about the IOkit. I haven''t run it myself so I have no complaints either ;) What we did try last spring was to start obdecho and also obdfilter on an xt3 test cluster, using some info on the wiki. The first worked, but I couldn''t get the second one going properly for more than one ost for reasons I now forget and may have been user error. Given the considerable difference in code base I''d say that''s water under the bridge now. When the opportunity comes up again, I''ll certainly try out the kit and give any feedback on the instructions. Thanks much, Ruth -----Original Message----- From: lustre-devel-bounces@clusterfs.com [mailto:lustre-devel-bounces@clusterfs.com] On Behalf Of Jody McIntyre Sent: Monday, November 27, 2006 4:28 PM To: Ward, Lee Cc: lustre-devel@clusterfs.com Subject: Re: [Lustre-devel] wide striping Hi Lee, Peter Braam advises that you find our IOkit documentation inadequate. Would you please let me know what specific problems, uncertainties, or questions you have? The most up to date obdfilter_survey instructions can be found in section 1.2.2 of part III of the Lustre manual. Thanks, Jody _______________________________________________ Lustre-devel mailing list Lustre-devel@clusterfs.com https://mail.clusterfs.com/mailman/listinfo/lustre-devel
Hi Ruth, On Tue, Nov 28, 2006 at 11:00:46AM -0700, Klundt, Ruth wrote:> I was talking to Lee earlier, and he doesn''t remember being the source > of complaints about the IOkit. I haven''t run it myself so I have no > complaints either ;)OK, thanks for clarifying - I received my information third hand and it seems it was wrong.> [...] > When the opportunity comes up again, I''ll certainly try out the kit and > give any feedback on the instructions.OK, thanks. We''re here to help :) Cheers, Jody> Thanks much, > Ruth