thr3ads.net - Lustre devel - [Lustre-devel] wide striping [Nov 2006]

If this information is useful, please help other people find it:
Share via:

Peter Braam

2006-Nov-21 10:24 UTC

[Lustre-devel] wide striping

Sandia has made available a very interesting set of graphs with some 
questions.  They study single file per process and shared file IO on Red 
Storm.

Most striking is that on the liblustre platform, shared file IO seems to 
be 4x slower.   It feels like this should be a simple contention issue 
on the server, but we havent'' been able to find it.

A question not mentioned in this email is why this won''t scale beyond 
12,500 clients. Probably that has a very simple answer too, and 
something blows up there.

LLNL has shown that in some cases shared file IO can be faster than file 
per process, and it is high time we put the puzzle to bed.

I attach the graphs and copy Sandia''s questions below.

Peter

What follows are some IOR tests run at Sandia on a 160-OSS/320-OST
Lustre file system. This file system had just been reformatted, prior
to the runs.  
 
The following issues seem key ones:
  - the single shared file is a factor 4-5 too slow, what is the
overhead?
  - why are reads so slow?
  - why is there a significant read dropoff?
  - why is two cores so much slower than single core?
 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: IOtestingResults _2_.pdf
Type: application/pdf
Size: 23636 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20061121/916e1d5e/IOtestingResults_2_-0001.pdf

Nicholas Henke

2006-Nov-21 10:59 UTC

head link

[Lustre-devel] wide striping

Peter Braam wrote:> Sandia has made available a very interesting set of graphs with some 
> questions.  They study single file per process and shared file IO on 
> Red Storm.
>
> Most striking is that on the liblustre platform, shared file IO seems 
> to be 4x slower.   It feels like this should be a simple contention 
> issue on the server, but we havent'' been able to find it.
>
> A question not mentioned in this email is why this won''t scale
beyond
> 12,500 clients. Probably that has a very simple answer too, and 
> something blows up there.
>
> LLNL has shown that in some cases shared file IO can be faster than 
> file per process, and it is high time we put the puzzle to bed.
>
> I attach the graphs and copy Sandia''s questions below.
>
> Peter
>
> What follows are some IOR tests run at Sandia on a 160-OSS/320-OST
> Lustre file system. This file system had just been reformatted, prior
> to the runs. 
> The following issues seem key ones:
>  - the single shared file is a factor 4-5 too slow, what is the
> overhead?
>  - why are reads so slow?
>Was the server lnet.debug set to 0? We have seen a large improvement in 
reads on the XT3 with that set to 0. I doubt it would fix such a large 
difference, but it one thing to get out of the way.

Nic

Oleg Drokin

2006-Nov-21 11:51 UTC

head link

[Lustre-devel] wide striping

Hello!

On Tue, Nov 21, 2006 at 10:24:42AM -0700, Peter Braam
wrote:> Sandia has made available a very interesting set of graphs with some 
> questions.  They study single file per process and shared file IO on Red 
> Storm.
The test was unfair, see:
With one stripe per file file per process, 320 OSTs are used for 10k client
that''s rougly 31 client per OST.

Now with shared file, stripe size(? count, I presume) is 157. Since the file
is only one, it means only 157 OSTs were in use, or roughly 64 clients per
OST.

So for meaningful comparison we should compare 10k clients file per process
with 5k cleints shared file. This "only" gives us 2x difference which
is still
better than 4x.

Also stipe size is not specified, what was it set to?
> The following issues seem key ones:
>  - the single shared file is a factor 4-5 too slow, what is the
> overhead?
Only 2x it seems.
>  - why are reads so slow?
No proper readahead by backend disk?
>  - why is there a significant read dropoff?
Writes can be cached, nicely aggregated and written to disk in nice linear
chunks and disk backend cannot do proper readahead for such a seemingly random
1M here, 1M there?
Was write cache enabled?
>  - why is two cores so much slower than single core?
Was two cores on a single chip counted as single client on the graph,
or as two clients? Probably latter? Could it be some local bus bottleneck
due to increased load on same of the bus/network?

Bye,
    Oleg

Peter Braam

2006-Nov-21 15:35 UTC

head link

[Lustre-devel] wide striping

Oleg pointed out that the striping width of 160 can have negative 
impacts - if you have a bandwidth per OST that is too low, you will 
limit your number.

But there is no reason that has to be the case. OST targets could be 
RAID0 striped across multiple FC interconnects, giving them a very high 
bandwidth.  In this way with 160 stripe width 2x the current bandwidth 
can be achieved.

To validate Oleg''s claim for reads, namely that they are suboptimal for
random 1MB reads - could the OST survey be run against the OBD filter 
device? This only requires one DDN.

- Peter -

Alex Tomas

2006-Nov-21 16:04 UTC

head link

[Lustre-devel] wide striping

I think we''d very appreciate if users provide us more data
from their runs. the most interesting things are:
1) vmstat 1
2) utils/llobdstat.pl <ost name> 1 
3) utils/llstat.pl <path to stats file> 1

note that, every service has own stats file. for example,

[root@tom tests]# find /proc/fs/lustre -name stats
/proc/fs/lustre/ldlm/services/ldlm_canceld/stats
/proc/fs/lustre/ldlm/services/ldlm_cbd/stats
/proc/fs/lustre/llite/fs0/stats
/proc/fs/lustre/mdt/MDT/mds_readpage/stats
/proc/fs/lustre/mdt/MDT/mds_setattr/stats
/proc/fs/lustre/mdt/MDT/mds/stats
/proc/fs/lustre/osc/OSC_tom_OST_tom_MNT_tom/stats
/proc/fs/lustre/osc/OSC_tom_OST_tom_mds1/stats
/proc/fs/lustre/obdfilter/OST_tom/stats
/proc/fs/lustre/ost/OSS/ost_io/stats
/proc/fs/lustre/ost/OSS/ost_create/stats
/proc/fs/lustre/ost/OSS/ost/stats

it seems like we need a silly script to simplify gathering.
Peter, what do you think?

thanks, Alex
>>>>> Peter Braam (PB) writes:
 PB> Sandia has made available a very interesting set of graphs with some
 PB> questions.  They study single file per process and shared file IO on
 PB> Red Storm.

 PB> Most striking is that on the liblustre platform, shared file IO seems
 PB> to be 4x slower.   It feels like this should be a simple contention
 PB> issue on the server, but we havent'' been able to find it.

 PB> A question not mentioned in this email is why this won''t scale
beyond
 PB> 12,500 clients. Probably that has a very simple answer too, and
 PB> something blows up there.

 PB> LLNL has shown that in some cases shared file IO can be faster than
 PB> file per process, and it is high time we put the puzzle to bed.

 PB> I attach the graphs and copy Sandia''s questions below.

 PB> Peter

 PB> What follows are some IOR tests run at Sandia on a 160-OSS/320-OST
 PB> Lustre file system. This file system had just been reformatted, prior
 PB> to the runs.

 PB> The following issues seem key ones:
 PB>  - the single shared file is a factor 4-5 too slow, what is the
 PB> overhead?
 PB>  - why are reads so slow?
 PB>  - why is there a significant read dropoff?
 PB>  - why is two cores so much slower than single core?



 PB> _______________________________________________
 PB> Lustre-devel mailing list
 PB> Lustre-devel@clusterfs.com
 PB> https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Lee Ward

2006-Nov-22 11:39 UTC

head link

[Lustre-devel] wide striping

On Tue, 2006-11-21 at 20:48 +0200, Oleg Drokin wrote:> Hello!
> 
> On Tue, Nov 21, 2006 at 10:24:42AM -0700, Peter Braam wrote:
> > Sandia has made available a very interesting set of graphs with some 
> > questions.  They study single file per process and shared file IO on
Red
> > Storm.
> 
> The test was unfair, see:
> With one stripe per file file per process, 320 OSTs are used for 10k client
> that''s rougly 31 client per OST.
> 
> Now with shared file, stripe size(? count, I presume) is 157. Since the
file
> is only one, it means only 157 OSTs were in use, or roughly 64 clients per
> OST.
> 
> So for meaningful comparison we should compare 10k clients file per process
> with 5k cleints shared file. This "only" gives us 2x difference
which is still
> better than 4x.
> 
> Also stipe size is not specified, what was it set to?
Please define "stripe size"?
> 
> > The following issues seem key ones:
> >  - the single shared file is a factor 4-5 too slow, what is the
> > overhead?
> 
> Only 2x it seems.
> 
> >  - why are reads so slow?
> 
> No proper readahead by backend disk?
MF is set to "on", max-prefetch is set to ("x" and
"1")
> 
> >  - why is there a significant read dropoff?
> 
> Writes can be cached, nicely aggregated and written to disk in nice linear
> chunks and disk backend cannot do proper readahead for such a seemingly
random
> 1M here, 1M there?
> Was write cache enabled?
Yes. Not partitioned.
> 
> >  - why is two cores so much slower than single core?
> 
> Was two cores on a single chip counted as single client on the graph,
> or as two clients? Probably latter? Could it be some local bus bottleneck
> due to increased load on same of the bus/network?
It''s counted as 2 clients. The node architecture *is* 2 clients in this
scenario. Memory is partitioned, etc. The only thing shared is the NIC.
I suppose the HT is shared as well but it is so much faster than the NIC
that it would seem to need an architectural deficiency to figure in
here.
> 
> Bye,
>     Oleg
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-devel
> 
>

Ruth Klundt

2006-Nov-22 11:47 UTC

head link

[Lustre-devel] wide striping

The runs can''t be repeated at present, but we plan to collect this data
when there is an opportunity. 

Attached are some tables and plots related to the previous set, which
also show the per OST performance from the same runs as the original
four plots. 

(fpp == file per process, sf == shared file, vn == 2 clients per compute
node)

Figure 1 attached files
fpp_avg_per_ost.png
fpp_table.txt

Figure 2 attached files
fpp_vn_avg_per_ost.png
fpp_vn_table.txt

Figure 3 attached files
sf_avg_per_ost.png
sf_table.txt

Figure 4 attached files
sf_vn_avg_per_ost.png
sf_vn_table.txt

Thanks Jim,
Ruth


On Wed, 2006-11-22 at 02:04 +0300, Alex Tomas wrote:> I think we''d very appreciate if users provide us more data
> from their runs. the most interesting things are:
> 1) vmstat 1
> 2) utils/llobdstat.pl <ost name> 1 
> 3) utils/llstat.pl <path to stats file> 1
> 
> note that, every service has own stats file. for example,
> 
> [root@tom tests]# find /proc/fs/lustre -name stats
> /proc/fs/lustre/ldlm/services/ldlm_canceld/stats
> /proc/fs/lustre/ldlm/services/ldlm_cbd/stats
> /proc/fs/lustre/llite/fs0/stats
> /proc/fs/lustre/mdt/MDT/mds_readpage/stats
> /proc/fs/lustre/mdt/MDT/mds_setattr/stats
> /proc/fs/lustre/mdt/MDT/mds/stats
> /proc/fs/lustre/osc/OSC_tom_OST_tom_MNT_tom/stats
> /proc/fs/lustre/osc/OSC_tom_OST_tom_mds1/stats
> /proc/fs/lustre/obdfilter/OST_tom/stats
> /proc/fs/lustre/ost/OSS/ost_io/stats
> /proc/fs/lustre/ost/OSS/ost_create/stats
> /proc/fs/lustre/ost/OSS/ost/stats
> 
> it seems like we need a silly script to simplify gathering.
> Peter, what do you think?
> 
> thanks, Alex
> 
> >>>>> Peter Braam (PB) writes:
> 
>  PB> Sandia has made available a very interesting set of graphs with
some
>  PB> questions.  They study single file per process and shared file IO
on
>  PB> Red Storm.
> 
>  PB> Most striking is that on the liblustre platform, shared file IO
seems
>  PB> to be 4x slower.   It feels like this should be a simple contention
>  PB> issue on the server, but we havent'' been able to find it.
> 
>  PB> A question not mentioned in this email is why this won''t
scale beyond
>  PB> 12,500 clients. Probably that has a very simple answer too, and
>  PB> something blows up there.
> 
>  PB> LLNL has shown that in some cases shared file IO can be faster than
>  PB> file per process, and it is high time we put the puzzle to bed.
> 
>  PB> I attach the graphs and copy Sandia''s questions below.
> 
>  PB> Peter
> 
>  PB> What follows are some IOR tests run at Sandia on a 160-OSS/320-OST
>  PB> Lustre file system. This file system had just been reformatted,
prior
>  PB> to the runs.
> 
>  PB> The following issues seem key ones:
>  PB>  - the single shared file is a factor 4-5 too slow, what is the
>  PB> overhead?
>  PB>  - why are reads so slow?
>  PB>  - why is there a significant read dropoff?
>  PB>  - why is two cores so much slower than single core?
> 
> 
> 
>  PB> _______________________________________________
>  PB> Lustre-devel mailing list
>  PB> Lustre-devel@clusterfs.com
>  PB> https://mail.clusterfs.com/mailman/listinfo/lustre-devel
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-devel
> 
> -------------- next part --------------
A non-text attachment was scrubbed...
Name: fpp_avg_per_ost.png
Type: image/png
Size: 4512 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20061122/5bdbf181/fpp_avg_per_ost-0001.png
-------------- next part --------------
# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	SAMPLE 2	AVG READ	AVG READ/OST	ERROR
513		1		320		31442		31930		31686.00		99.02		488.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	SAMPLE 2	AVG WRITE	AVG WRITE/OST	ERROR
513		1		320		30190		28824		29507.00		92.21		1366.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	SAMPLE 2	AVG READ	AVG READ/OST	ERROR
1023		1		320		32332		32571		32451.50		101.41		239.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	SAMPLE 2	AVG WRITE	AVG WRITE/OST	ERROR
1023		1		320		34263		15097		24680.00		77.12		19166.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	SAMPLE 2	AVG READ	AVG READ/OST	ERROR
2049		1		320		35720		31865		33792.50		105.60		3855.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	SAMPLE 2	AVG WRITE	AVG WRITE/OST	ERROR
2049		1		320		41169		46168		43668.50		136.46		4999.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	SAMPLE 2	AVG READ	AVG READ/OST	ERROR
2559		1		320		32900		31898		32399.00		101.25		1002.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	SAMPLE 2	AVG WRITE	AVG WRITE/OST	ERROR
2559		1		320		39888		39814		39851.00		124.53		74.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	SAMPLE 2	AVG READ	AVG READ/OST	ERROR
3073		1		320		31012		31453		31232.50		97.60		441.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	SAMPLE 2	AVG WRITE	AVG WRITE/OST	ERROR
3073		1		320		38459		39904		39181.50		122.44		1445.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	SAMPLE 2	AVG READ	AVG READ/OST	ERROR
3583		1		320		31123		31035		31079.00		97.12		88.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	SAMPLE 2	AVG WRITE	AVG WRITE/OST	ERROR
3583		1		320		40635		41507		41071.00		128.35		872.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG READ	AVG READ/OST	ERROR
4300		1		320		26512		26512.00		82.85		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG WRITE	AVG WRITE/OST	ERROR
4300		1		320		41298		41298.00		129.06		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG READ	AVG READ/OST	ERROR
6300		1		320		25177		25177.00		78.68		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG WRITE	AVG WRITE/OST	ERROR
6300		1		320		38915		38915.00		121.61		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG READ	AVG READ/OST	ERROR
8300		1		320		15549		15549.00		48.59		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG WRITE	AVG WRITE/OST	ERROR
8300		1		320		38386		38386.00		119.96		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG READ	AVG READ/OST	ERROR
10300		1		320		14877		14877.00		46.49		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG WRITE	AVG WRITE/OST	ERROR
10300		1		320		36556		36556.00		114.24		0.00

-------------- next part --------------
A non-text attachment was scrubbed...
Name: fpp_vn_avg_per_ost.png
Type: image/png
Size: 4499 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20061122/5bdbf181/fpp_vn_avg_per_ost-0001.png
-------------- next part --------------
# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG READ	AVG READ/OST	ERROR
513		1		320		22650		22650.00		70.78		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG WRITE	AVG WRITE/OST	ERROR
513		1		320		27913		27913.00		87.23		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG READ	AVG READ/OST	ERROR
1023		1		320		25037		25037.00		78.24		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG WRITE	AVG WRITE/OST	ERROR
1023		1		320		30648		30648.00		95.78		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG READ	AVG READ/OST	ERROR
2049		1		320		30358		30358.00		94.87		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG WRITE	AVG WRITE/OST	ERROR
2049		1		320		37164		37164.00		116.14		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG READ	AVG READ/OST	ERROR
4095		1		320		28612		28612.00		89.41		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG WRITE	AVG WRITE/OST	ERROR
4095		1		320		39290		39290.00		122.78		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG READ	AVG READ/OST	ERROR
8191		1		320		19229		19229.00		60.09		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG WRITE	AVG WRITE/OST	ERROR
8191		1		320		30000		30000.00		93.75		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG READ	AVG READ/OST	ERROR
10300		1		320		12732		12732.00		39.79		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG WRITE	AVG WRITE/OST	ERROR
10300		1		320		27912		27912.00		87.22		0.00

-------------- next part --------------
A non-text attachment was scrubbed...
Name: sf_avg_per_ost.png
Type: image/png
Size: 4029 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20061122/5bdbf181/sf_avg_per_ost-0001.png
-------------- next part --------------
# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG READ	AVG READ/OST	ERROR
2049		157		157		5119		5119.00		32.61		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG WRITE	AVG WRITE/OST	ERROR
2049		157		157		7474		7474.00		47.61		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG READ	AVG READ/OST	ERROR
4095		157		157		6826		6826.00		43.48		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG WRITE	AVG WRITE/OST	ERROR
4095		157		157		11122		11122.00		70.84		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG READ	AVG READ/OST	ERROR
8191		157		157		6281		6281.00		40.01		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG WRITE	AVG WRITE/OST	ERROR
8191		157		157		9289		9289.00		59.17		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG READ	AVG READ/OST	ERROR
10300		157		157		5883		5883.00		37.47		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG WRITE	AVG WRITE/OST	ERROR
10300		157		157		8407		8407.00		53.55		0.00

-------------- next part --------------
A non-text attachment was scrubbed...
Name: sf_vn_avg_per_ost.png
Type: image/png
Size: 3650 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20061122/5bdbf181/sf_vn_avg_per_ost-0001.png
-------------- next part --------------
# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG READ	AVG READ/OST	ERROR
4095		157		157		6653		6653.00		42.38		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG WRITE	AVG WRITE/OST	ERROR
4095		157		157		11138		11138.00		70.94		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG READ	AVG READ/OST	ERROR
8191		157		157		6157		6157.00		39.22		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG WRITE	AVG WRITE/OST	ERROR
8191		157		157		8530		8530.00		54.33		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG READ	AVG READ/OST	ERROR
10300		157		157		5822		5822.00		37.08		0.00

# CLIENTS	STRIPE		MAX OSTS	SAMPLE 1	AVG WRITE	AVG WRITE/OST	ERROR
10300		157		157		7850		7850.00		50.00		0.00

Ruth Klundt

2006-Nov-22 12:10 UTC

head link

[Lustre-devel] wide striping

On Tue, 2006-11-21 at 15:34 -0700, Peter Braam wrote:> Oleg pointed out that the striping width of 160 can have negative 
> impacts - if you have a bandwidth per OST that is too low, you will 
> limit your number.
> 
> But there is no reason that has to be the case. OST targets could be 
> RAID0 striped across multiple FC interconnects, giving them a very high 
> bandwidth.  In this way with 160 stripe width 2x the current bandwidth 
> can be achieved.
> 
> To validate Oleg''s claim for reads, namely that they are
suboptimal for
> random 1MB reads - could the OST survey be run against the OBD filter 
> device? This only requires one DDN.
> 
Does OST survey run for catamount clients? I''d need some pointers to
get
going on trying that. We ran obdecho once, but I haven''t personally got
obdfilter device going on xt3 as of yet.

Thanks,
Ruth

> - Peter -
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-devel
> 
>

pauln

2006-Nov-22 12:11 UTC

head link

[Lustre-devel] wide striping

On our ddn''s I''ve noticed that during large reads the backend
bandwidth
of the ddn is pegged (700-800 MB/s) while the  total bandwidth being 
delivered through the fc interfaces is in the low hundreds of MBs.  This 
leads me to believe that the ddn cache is being thrashed heavily due to 
overly aggressive read-ahead by many OSTs and the ddn itself. 

I haven''t re-run these tests with the 2.6 kernel (cray 1.4XX) so
I''m not
sure if this phenomena still exists but this theory still may have some 
validity since the ddn cache is shared between all the osts connected to it.
paul

Lee Ward wrote:
>On Tue, 2006-11-21 at 20:48 +0200, Oleg Drokin wrote:
>  
>
>>Hello!
>>
>>On Tue, Nov 21, 2006 at 10:24:42AM -0700, Peter Braam wrote:
>>    
>>
>>>Sandia has made available a very interesting set of graphs with some
>>>questions.  They study single file per process and shared file IO on
Red
>>>Storm.
>>>      
>>>
>>The test was unfair, see:
>>With one stripe per file file per process, 320 OSTs are used for 10k
client
>>that''s rougly 31 client per OST.
>>
>>Now with shared file, stripe size(? count, I presume) is 157. Since the
file
>>is only one, it means only 157 OSTs were in use, or roughly 64 clients
per
>>OST.
>>    
>>There are 2 OSTs per OSS so  I don''t think the test is that unfair 
unless somehow it failed to utilize every fiber channel link available.  
So if one OST was used on each OSS then in terms of network and disk 
bandwidth the tests are equivalent.
>>So for meaningful comparison we should compare 10k clients file per
process
>>with 5k cleints shared file. This "only" gives us 2x
difference which is still
>>better than 4x.
>>    
>>
>>Also stipe size is not specified, what was it set to?
>>    
>>
>
>Please define "stripe size"?
>
>  
>
>>>The following issues seem key ones:
>>> - the single shared file is a factor 4-5 too slow, what is the
>>>overhead?
>>>      
>>>
>>Only 2x it seems.
>>
>>    
>>
>>> - why are reads so slow?
>>>      
>>>
>>No proper readahead by backend disk?
>>    
>>
>
>MF is set to "on", max-prefetch is set to ("x" and
"1")
>
>  
>
>>> - why is there a significant read dropoff?
>>>      
>>>
>>Writes can be cached, nicely aggregated and written to disk in nice
linear
>>chunks and disk backend cannot do proper readahead for such a seemingly
random
>>1M here, 1M there?
>>Was write cache enabled?
>>    
>>
>
>Yes. Not partitioned.
>
>  
>
>>> - why is two cores so much slower than single core?
>>>      
>>>
>>Was two cores on a single chip counted as single client on the graph,
>>or as two clients? Probably latter? Could it be some local bus
bottleneck
>>due to increased load on same of the bus/network?
>>    
>>
>
>It''s counted as 2 clients. The node architecture *is* 2 clients in
this
>scenario. Memory is partitioned, etc. The only thing shared is the NIC.
>I suppose the HT is shared as well but it is so much faster than the NIC
>that it would seem to need an architectural deficiency to figure in
>here.
>
>  
>

Nicholas Henke

2006-Nov-22 12:14 UTC

head link

[Lustre-devel] wide striping

pauln wrote:> On our ddn''s I''ve noticed that during large reads the
backend
> bandwidth of the ddn is pegged (700-800 MB/s) while the  total 
> bandwidth being delivered through the fc interfaces is in the low 
> hundreds of MBs.  This leads me to believe that the ddn cache is being 
> thrashed heavily due to overly aggressive read-ahead by many OSTs and 
> the ddn itself.
> I haven''t re-run these tests with the 2.6 kernel (cray 1.4XX) so
I''m
> not sure if this phenomena still exists but this theory still may have 
> some validity since the ddn cache is shared between all the osts 
> connected to it.
> paul
>
Paul -- what is the "prefetch" value for the LUNs? The
"cache" command
is where this can be found.

Historical testing on s2a8500s shows that for Lustre, prefetch=1 or 
prefetch=0 is desired. When run with prefetch=8, we saw the same 
behavior you are describing, namely the DDN "stomping" on itself
trying
to prefetch too much data.

This testing has been done twice on the XT3, both to the same result -- 
prefetch=1.

Nic

Oleg Drokin

2006-Nov-22 12:14 UTC

head link

[Lustre-devel] wide striping

Hello!

On Wed, Nov 22, 2006 at 11:39:02AM -0700, Lee Ward
wrote:> > So for meaningful comparison we should compare 10k clients file per
process
> > with 5k cleints shared file. This "only" gives us 2x
difference which is still
> > better than 4x.
> > Also stipe size is not specified, what was it set to?
> Please define "stripe size"?
tripe size is amount of bytes that go into a single stripe on one OST before we
switch to next OST for another stripe.
with lfs setstripe it is first of three arguments.
> > >  - why are reads so slow?
> > No proper readahead by backend disk?
> MF is set to "on", max-prefetch is set to ("x" and
"1")
I am not familiar with those options you are speaking about, unfortunatelly.
Is there a simple description or something like that?
How large is such readahead with these settings?
See, if all clients read data sequentially 1M each, and we have 31 clients
competing for a read from one particular OST. Every such client has to read
1Mb out of every 31M in an object that lives on this OST.
It is also complicated by the fact there is no particular order those requests
arrive in.
E.g. if we get a request for last 1M chunk of that 31M area, readahead logic in
the backend is probably going to read data forward (how much forward?), but
when other requests come in they are uncached and we wait for them to hit the
disk. This is in single file case that is quite favourable. In fact you
see read speed does not descent as badly for this case.

For file per process there is separate file ''object'' for every
process.
Those objects are living in different (ext3) ''subdirs'' and
allocator well might
decide to allocate them in different parts of the disk (that''s what
ext3
default allocator does I think). So one client reading data is most likely
does not prefetch any data for other clients on first read. And if readahead
does not read gigabytes of data in advance for every object, there is going
to be some contention for backend storage to read all those multiple places in
various areas of the storage (how much parallel i/o streams can the
disk backend sustain?)
For 2Mb i/o figures above needs to be doubled, of course.

Was there a reason single file, single core case was measured with 1M xfer size
when all other cases were with 2M xfer size?
> > >  - why is there a significant read dropoff?
> > Writes can be cached, nicely aggregated and written to disk in nice
linear
> > chunks and disk backend cannot do proper readahead for such a
seemingly random
> > 1M here, 1M there?
> > Was write cache enabled?
> Yes. Not partitioned.
That''s likely an answer to why writes are so fast and do not drop off
with
number of clients going up as long as data fits into cache (What we can
see on first graph), it is not very clear why on second graph the write
performance starts to go down after some numbe rof clients,and I guess this
suggest a bottleneck in some other place than disk backend.
> > >  - why is two cores so much slower than single core?
> > Was two cores on a single chip counted as single client on the graph,
> > or as two clients? Probably latter? Could it be some local bus
bottleneck
> > due to increased load on same of the bus/network?
> It''s counted as 2 clients. The node architecture *is* 2 clients in
this
> scenario. Memory is partitioned, etc. The only thing shared is the NIC.
> I suppose the HT is shared as well but it is so much faster than the NIC
> that it would seem to need an architectural deficiency to figure in
> here.
So it could be you saturate NIC or parts of the network? (you have 2x as much
traffic in dualcore nodes case for same network).
Is there a way to measure this?

Bye,
    Oleg

pauln

2006-Nov-22 12:21 UTC

head link

[Lustre-devel] wide striping

It was set to 1.  This happened on a 2048 pe job reading in a few 
hundred gigs of input data from a single file.
paul

Nicholas Henke wrote:
> pauln wrote:
>
>> On our ddn''s I''ve noticed that during large reads the
backend
>> bandwidth of the ddn is pegged (700-800 MB/s) while the  total 
>> bandwidth being delivered through the fc interfaces is in the low 
>> hundreds of MBs.  This leads me to believe that the ddn cache is 
>> being thrashed heavily due to overly aggressive read-ahead by many 
>> OSTs and the ddn itself.
>> I haven''t re-run these tests with the 2.6 kernel (cray 1.4XX)
so I''m
>> not sure if this phenomena still exists but this theory still may 
>> have some validity since the ddn cache is shared between all the osts 
>> connected to it.
>> paul
>>
>
> Paul -- what is the "prefetch" value for the LUNs? The
"cache" command
> is where this can be found.
>
> Historical testing on s2a8500s shows that for Lustre, prefetch=1 or 
> prefetch=0 is desired. When run with prefetch=8, we saw the same 
> behavior you are describing, namely the DDN "stomping" on itself 
> trying to prefetch too much data.
>
> This testing has been done twice on the XT3, both to the same result 
> -- prefetch=1.
>
> Nic

pauln

2006-Nov-22 12:36 UTC

head link

[Lustre-devel] wide striping

Lee Ward wrote:
>On Tue, 2006-11-21 at 20:48 +0200, Oleg Drokin wrote:
>  
>
>>Hello!
>>
>>On Tue, Nov 21, 2006 at 10:24:42AM -0700, Peter Braam wrote:
>>    
>>
>>>Sandia has made available a very interesting set of graphs with some
>>>questions.  They study single file per process and shared file IO on
Red
>>>Storm.
>>>      
>>>
>>The test was unfair, see:
>>With one stripe per file file per process, 320 OSTs are used for 10k
client
>>that''s rougly 31 client per OST.
>>
>>Now with shared file, stripe size(? count, I presume) is 157. Since the
file
>>is only one, it means only 157 OSTs were in use, or roughly 64 clients
per
>>OST.
>>
>>So for meaningful comparison we should compare 10k clients file per
process
>>with 5k cleints shared file. This "only" gives us 2x
difference which is still
>>better than 4x.
>>
>>Also stipe size is not specified, what was it set to?
>>    
>>Oleg, I''m not sure I entirely agree here.  If every OSS was used in the
single file test then from the io hardware perspective they are 
functionally equivalent.  They run 2 OSTs on each OSS (which is probably 
done for capacity reasons due to the small lun limitation of ext3) so a 
test which only uses half of the OSTs could still use every OSS 
processor, interconnect link, and fiber channel link. 
paul
>Please define "stripe size"?
>
>  
>
>>>The following issues seem key ones:
>>> - the single shared file is a factor 4-5 too slow, what is the
>>>overhead?
>>>      
>>>
>>Only 2x it seems.
>>
>>    
>>
>>> - why are reads so slow?
>>>      
>>>
>>No proper readahead by backend disk?
>>    
>>
>
>MF is set to "on", max-prefetch is set to ("x" and
"1")
>
>  
>
>>> - why is there a significant read dropoff?
>>>      
>>>
>>Writes can be cached, nicely aggregated and written to disk in nice
linear
>>chunks and disk backend cannot do proper readahead for such a seemingly
random
>>1M here, 1M there?
>>Was write cache enabled?
>>    
>>
>
>Yes. Not partitioned.
>
>  
>
>>> - why is two cores so much slower than single core?
>>>      
>>>
>>Was two cores on a single chip counted as single client on the graph,
>>or as two clients? Probably latter? Could it be some local bus
bottleneck
>>due to increased load on same of the bus/network?
>>    
>>
>
>It''s counted as 2 clients. The node architecture *is* 2 clients in
this
>scenario. Memory is partitioned, etc. The only thing shared is the NIC.
>I suppose the HT is shared as well but it is so much faster than the NIC
>that it would seem to need an architectural deficiency to figure in
>here.
>
>  
>
>>Bye,
>>    Oleg
>>
>>_______________________________________________
>>Lustre-devel mailing list
>>Lustre-devel@clusterfs.com
>>https://mail.clusterfs.com/mailman/listinfo/lustre-devel
>>
>>
>>    
>>
>
>_______________________________________________
>Lustre-devel mailing list
>Lustre-devel@clusterfs.com
>https://mail.clusterfs.com/mailman/listinfo/lustre-devel
>  
>

Oleg Drokin

2006-Nov-22 13:10 UTC

head link

[Lustre-devel] wide striping

Hello!

On Wed, Nov 22, 2006 at 02:33:23PM -0500, pauln wrote:> >>So for meaningful comparison we should compare 10k clients file per
> >>process
> >>with 5k cleints shared file. This "only" gives us 2x
difference which is
> >>still
> >>better than 4x.
> >>Also stipe size is not specified, what was it set to?
> Oleg, I''m not sure I entirely agree here.  If every OSS was used
in the
> single file test then from the io hardware perspective they are 
> functionally equivalent.  They run 2 OSTs on each OSS (which is probably 
> done for capacity reasons due to the small lun limitation of ext3) so a 
> test which only uses half of the OSTs could still use every OSS 
> processor, interconnect link, and fiber channel link. 
I have no such information. My assumption was every OST had its own disk
backend, but even if not, we have no idea in what order OSTs were created.
If they were created like on oss0 we have ost0 & ost1 and so on, what I
describe
is still correct too.

If somehow number of physical disk backends & osses used in both cases are
same
other things might have come into play too, like less spindles used by
underlying disk backend due to partitioning?

Bye,
    Oleg

Lee Ward

2006-Nov-22 13:49 UTC

head link

[Lustre-devel] wide striping

On Wed, 2006-11-22 at 21:11 +0200, Oleg Drokin wrote:> Hello!
> 
> On Wed, Nov 22, 2006 at 11:39:02AM -0700, Lee Ward wrote:
> > > So for meaningful comparison we should compare 10k clients file
per process
> > > with 5k cleints shared file. This "only" gives us 2x
difference which is still
> > > better than 4x.
> > > Also stipe size is not specified, what was it set to?
> > Please define "stripe size"?
> 
> tripe size is amount of bytes that go into a single stripe on one OST
before we
> switch to next OST for another stripe.
> with lfs setstripe it is first of three arguments.
Ah. Ok. It''s 2 MiB.
> 
> > > >  - why are reads so slow?
> > > No proper readahead by backend disk?
> > MF is set to "on", max-prefetch is set to ("x" and
"1")
> 
> I am not familiar with those options you are speaking about,
unfortunatelly.
> Is there a simple description or something like that?
> How large is such readahead with these settings?
Ok. Our interpretation is that these settings mean readahead is enabled,
up to a 64K prefetch.

The DDN site has the technical docs for the controller online. Feel free
to try to interpret the settings yourself :-)
> See, if all clients read data sequentially 1M each, and we have 31 clients
> competing for a read from one particular OST. Every such client has to read
> 1Mb out of every 31M in an object that lives on this OST.
> It is also complicated by the fact there is no particular order those
requests
> arrive in.
But they should arrive very close to one another (all began at the same
time and proceed in order through the file, tending to keep them in
step) and with the elevator seek-sort applied, they are supposed to be
reordered so that the controller typically sees long, contiguous access
requests -- Provided the FS did an extent-based allocation, of course.
You seem to indicate that it does extent-based allocation for the
file-per-process case in the paragraph after the next. Are you thinking
it does not do extent allocation for objects involved in the shared-file
environment?
> E.g. if we get a request for last 1M chunk of that 31M area, readahead
logic in
> the backend is probably going to read data forward (how much forward?), but
> when other requests come in they are uncached and we wait for them to hit
the
> disk. This is in single file case that is quite favourable. In fact you
> see read speed does not descent as badly for this case.
This would make sense without extent-based allocation and re-ordering.
With it, though, things should work out just fine.

Paul Nowicki, in a previous message in this thread, seems to have
evidence that pre-fetching at the controller might not be a good idea.
Could that be contributing? Are we wasting back-end bandwidth in the
RAID controller? I can see that explaining the fall-off for the
file-per-process case. It doesn''t explain the simply bad performance of
the single-shared-file case, though. There, prefetch should be in the
noise -- Again, given a proper extent allocator and seek-sort algorithm.
> 
> For file per process there is separate file ''object'' for
every process.
> Those objects are living in different (ext3) ''subdirs''
and allocator well might
> decide to allocate them in different parts of the disk (that''s
what ext3
> default allocator does I think). So one client reading data is most likely
> does not prefetch any data for other clients on first read. And if
readahead
> does not read gigabytes of data in advance for every object, there is going
> to be some contention for backend storage to read all those multiple places
in
> various areas of the storage (how much parallel i/o streams can the
> disk backend sustain?)
> For 2Mb i/o figures above needs to be doubled, of course.
> 
> Was there a reason single file, single core case was measured with 1M xfer
size
> when all other cases were with 2M xfer size?
Dunno but the ranges and shapes are similar. I''m thinking, then,
there''s
not much difference due to this.
> 
> > > >  - why is there a significant read dropoff?
> > > Writes can be cached, nicely aggregated and written to disk in
nice linear
> > > chunks and disk backend cannot do proper readahead for such a
seemingly random
> > > 1M here, 1M there?
> > > Was write cache enabled?
> > Yes. Not partitioned.
> 
> That''s likely an answer to why writes are so fast and do not drop
off with
> number of clients going up as long as data fits into cache (What we can
> see on first graph), it is not very clear why on second graph the write
> performance starts to go down after some numbe rof clients,and I guess this
> suggest a bottleneck in some other place than disk backend.
I agree. NIC arbitration, maybe? I''m more concerned about the single
file per process case. So long as we can get *something* usable we can
counsel our users. Such a thing exists for the file-per-process case. It
just doesn''t for the single shared file.
> 
> > > >  - why is two cores so much slower than single core?
> > > Was two cores on a single chip counted as single client on the
graph,
> > > or as two clients? Probably latter? Could it be some local bus
bottleneck
> > > due to increased load on same of the bus/network?
> > It''s counted as 2 clients. The node architecture *is* 2
clients in this
> > scenario. Memory is partitioned, etc. The only thing shared is the
NIC.
> > I suppose the HT is shared as well but it is so much faster than the
NIC
> > that it would seem to need an architectural deficiency to figure in
> > here.
> 
> So it could be you saturate NIC or parts of the network? (you have 2x as
much
> traffic in dualcore nodes case for same network).
> Is there a way to measure this?
I can''t see that from these graphs. Peak is the same, and at the same
node count -- I.e., the client can drive things sufficiently whether
single or dual. So it isn''t the client side. As things go out further
the *number* of messages injected into the network is the same, so I
can''t reach a place where we are calling out the network.

For the service side, it''s the same NIC, receiving the same number of
messages. The server is *always* single-core. My only question would be
whether the same nid is used by a dual-core client and is Lustre/lnet
doing something different. Ruth or Jim, do you know if the client node
is using the same NID for both virtual machines?

> 
> Bye,
>     Oleg
>

pauln

2006-Nov-22 14:18 UTC

head link

[Lustre-devel] wide striping

Lee Ward wrote:
>On Wed, 2006-11-22 at 21:11 +0200, Oleg Drokin wrote:
>  
>
>>Hello!
>>
>>On Wed, Nov 22, 2006 at 11:39:02AM -0700, Lee Ward wrote:
>>    
>>
>>>>So for meaningful comparison we should compare 10k clients file
per process
>>>>with 5k cleints shared file. This "only" gives us 2x
difference which is still
>>>>better than 4x.
>>>>Also stipe size is not specified, what was it set to?
>>>>        
>>>>
>>>Please define "stripe size"?
>>>      
>>>
>>tripe size is amount of bytes that go into a single stripe on one OST
before we
>>switch to next OST for another stripe.
>>with lfs setstripe it is first of three arguments.
>>    
>>
>
>Ah. Ok. It''s 2 MiB.
>
>  
>
>>>>> - why are reads so slow?
>>>>>          
>>>>>
>>>>No proper readahead by backend disk?
>>>>        
>>>>
>>>MF is set to "on", max-prefetch is set to ("x"
and "1")
>>>      
>>>
>>I am not familiar with those options you are speaking about,
unfortunatelly.
>>Is there a simple description or something like that?
>>How large is such readahead with these settings?
>>    
>>
>
>Ok. Our interpretation is that these settings mean readahead is enabled,
>up to a 64K prefetch.
>
>The DDN site has the technical docs for the controller online. Feel free
>to try to interpret the settings yourself :-)
>
>  
>
>>See, if all clients read data sequentially 1M each, and we have 31
clients
>>competing for a read from one particular OST. Every such client has to
read
>>1Mb out of every 31M in an object that lives on this OST.
>>It is also complicated by the fact there is no particular order those
requests
>>arrive in.
>>    
>>
>
>But they should arrive very close to one another (all began at the same
>time and proceed in order through the file, tending to keep them in
>step) and with the elevator seek-sort applied, they are supposed to be
>reordered so that the controller typically sees long, contiguous access
>requests -- Provided the FS did an extent-based allocation, of course.
>You seem to indicate that it does extent-based allocation for the
>file-per-process case in the paragraph after the next. Are you thinking
>it does not do extent allocation for objects involved in the shared-file
>environment?
>
>  
>
>>E.g. if we get a request for last 1M chunk of that 31M area, readahead
logic in
>>the backend is probably going to read data forward (how much forward?),
but
>>when other requests come in they are uncached and we wait for them to
hit the
>>disk. This is in single file case that is quite favourable. In fact you
>>see read speed does not descent as badly for this case.
>>    
>>
>
>This would make sense without extent-based allocation and re-ordering.
>With it, though, things should work out just fine.
>
>Paul Nowicki, in a previous message in this thread, seems to have
>evidence that pre-fetching at the controller might not be a good idea.
>Could that be contributing? Are we wasting back-end bandwidth in the
>RAID controller? I can see that explaining the fall-off for the
>file-per-process case. It doesn''t explain the simply bad
performance of
>the single-shared-file case, though. There, prefetch should be in the
>noise -- Again, given a proper extent allocator and seek-sort algorithm.
>  
>"Nowicki" - that''s one I haven''t heard before! :)

Lee, can you guys run "stats repeat=mbs" on your ddns during the 
single-file and multi-file tests?   It would be interesting to compare 
the back-end to front-end bw ratios for both cases.  We had one case 
(single-file) where the backend prefetching killed the performance so 
I''m certain (at least under 2.4) that single-file usage cases can cause
thrashing.
>  
>
>>For file per process there is separate file ''object''
for every process.
>>Those objects are living in different (ext3) ''subdirs''
and allocator well might
>>decide to allocate them in different parts of the disk (that''s
what ext3
>>default allocator does I think). So one client reading data is most
likely
>>does not prefetch any data for other clients on first read. And if
readahead
>>does not read gigabytes of data in advance for every object, there is
going
>>to be some contention for backend storage to read all those multiple
places in
>>various areas of the storage (how much parallel i/o streams can the
>>disk backend sustain?)
>>For 2Mb i/o figures above needs to be doubled, of course.
>>
>>    
>>The story gets worse if prefetched data, which would have been useful, 
is replaced prematurely!
paul
>>Was there a reason single file, single core case was measured with 1M
xfer size
>>when all other cases were with 2M xfer size?
>>    
>>
>
>Dunno but the ranges and shapes are similar. I''m thinking, then,
there''s
>not much difference due to this.
>
>  
>
>>>>> - why is there a significant read dropoff?
>>>>>          
>>>>>
>>>>Writes can be cached, nicely aggregated and written to disk in
nice linear
>>>>chunks and disk backend cannot do proper readahead for such a
seemingly random
>>>>1M here, 1M there?
>>>>Was write cache enabled?
>>>>        
>>>>
>>>Yes. Not partitioned.
>>>      
>>>
>>That''s likely an answer to why writes are so fast and do not
drop off with
>>number of clients going up as long as data fits into cache (What we can
>>see on first graph), it is not very clear why on second graph the write
>>performance starts to go down after some numbe rof clients,and I guess
this
>>suggest a bottleneck in some other place than disk backend.
>>    
>>
>
>I agree. NIC arbitration, maybe? I''m more concerned about the
single
>file per process case. So long as we can get *something* usable we can
>counsel our users. Such a thing exists for the file-per-process case. It
>just doesn''t for the single shared file.
>
>  
>
>>>>> - why is two cores so much slower than single core?
>>>>>          
>>>>>
>>>>Was two cores on a single chip counted as single client on the
graph,
>>>>or as two clients? Probably latter? Could it be some local bus
bottleneck
>>>>due to increased load on same of the bus/network?
>>>>        
>>>>
>>>It''s counted as 2 clients. The node architecture *is* 2
clients in this
>>>scenario. Memory is partitioned, etc. The only thing shared is the
NIC.
>>>I suppose the HT is shared as well but it is so much faster than the
NIC
>>>that it would seem to need an architectural deficiency to figure in
>>>here.
>>>      
>>>
>>So it could be you saturate NIC or parts of the network? (you have 2x as
much
>>traffic in dualcore nodes case for same network).
>>Is there a way to measure this?
>>    
>>
>
>I can''t see that from these graphs. Peak is the same, and at the
same
>node count -- I.e., the client can drive things sufficiently whether
>single or dual. So it isn''t the client side. As things go out
further
>the *number* of messages injected into the network is the same, so I
>can''t reach a place where we are calling out the network.
>
>For the service side, it''s the same NIC, receiving the same number
of
>messages. The server is *always* single-core. My only question would be
>whether the same nid is used by a dual-core client and is Lustre/lnet
>doing something different. Ruth or Jim, do you know if the client node
>is using the same NID for both virtual machines?
>
>
>  
>
>>Bye,
>>    Oleg
>>
>>    
>>
>
>_______________________________________________
>Lustre-devel mailing list
>Lustre-devel@clusterfs.com
>https://mail.clusterfs.com/mailman/listinfo/lustre-devel
>  
>

Jean-Marc Saffroy

2006-Nov-23 05:41 UTC

head link

[Lustre-devel] wide striping

On Wed, 22 Nov 2006, Ruth Klundt wrote:
> On Tue, 2006-11-21 at 15:34 -0700, Peter Braam wrote:
>>
>> To validate Oleg''s claim for reads, namely that they are
suboptimal for
>> random 1MB reads - could the OST survey be run against the OBD filter
>> device? This only requires one DDN.
>>
>
> Does OST survey run for catamount clients? I''d need some pointers
to get
> going on trying that. We ran obdecho once, but I haven''t
personally got
> obdfilter device going on xt3 as of yet.
I think Peter suggested running the obdsurvey on the servers.
IIRC your servers are regular Linux systems, right?

There are some instructions for the lustre_survey python script in the KB:
   https://bugzilla.lustre.org/show_bug.cgi?id=9383

I''m not sure it''s still working though, I know there were
related bugs
opened, but I can''t find them anymore.

Here we are still using Eric Barton''s obdfilter-survey shell script,
which
works pretty well, but we only use it on test systems.


-- 
Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net

Lee Ward

2006-Nov-23 11:44 UTC

head link

[Lustre-devel] wide striping

On Wed, 2006-11-22 at 22:07 +0200, Oleg Drokin wrote:> Hello!
> 
> On Wed, Nov 22, 2006 at 02:33:23PM -0500, pauln wrote:
> > >>So for meaningful comparison we should compare 10k clients
file per
> > >>process
> > >>with 5k cleints shared file. This "only" gives us 2x
difference which is
> > >>still
> > >>better than 4x.
> > >>Also stipe size is not specified, what was it set to?
> > Oleg, I''m not sure I entirely agree here.  If every OSS was
used in the
> > single file test then from the io hardware perspective they are 
> > functionally equivalent.  They run 2 OSTs on each OSS (which is
probably
> > done for capacity reasons due to the small lun limitation of ext3) so
a
> > test which only uses half of the OSTs could still use every OSS 
> > processor, interconnect link, and fiber channel link. 
We use 2 OST per OSS in order to actively use both channels of the HBA
-- Staying away channel from bonding.
> 
> I have no such information. My assumption was every OST had its own disk
> backend, but even if not, we have no idea in what order OSTs were created.
> If they were created like on oss0 we have ost0 & ost1 and so on, what I
describe
> is still correct too.
> 
> If somehow number of physical disk backends & osses used in both cases
are same
> other things might have come into play too, like less spindles used by
> underlying disk backend due to partitioning?
They are partitioned. However, during this test the other file systems
were idle.

Each channel on an HBA does, effectively, have it''s own string of disk
-- DDN calls them "tiers". There are 4 OSS per DDN
"couplet", controller
pair. There are 8 "tiers" per DDN.

Lustre is configured so that two consecutive OSTs would be on the same
OSS.
  > Bye,
>     Oleg
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-devel
> 
>

Peter Braam

2006-Nov-23 12:46 UTC

head link

[Lustre-devel] wide striping

Lee Ward wrote:> On Wed, 2006-11-22 at 22:07 +0200, Oleg Drokin wrote:
>   
>> Hello!
>>
>> On Wed, Nov 22, 2006 at 02:33:23PM -0500, pauln wrote:
>>     
>>>>> So for meaningful comparison we should compare 10k clients
file per
>>>>> process
>>>>> with 5k cleints shared file. This "only" gives us
2x difference which is
>>>>> still
>>>>> better than 4x.
>>>>> Also stipe size is not specified, what was it set to?
>>>>>           
>>> Oleg, I''m not sure I entirely agree here.  If every OSS
was used in the
>>> single file test then from the io hardware perspective they are 
>>> functionally equivalent.  They run 2 OSTs on each OSS (which is
probably
>>> done for capacity reasons due to the small lun limitation of ext3)
so a
>>> test which only uses half of the OSTs could still use every OSS 
>>> processor, interconnect link, and fiber channel link. 
>>>       
>
> We use 2 OST per OSS in order to actively use both channels of the HBA
> -- Staying away channel from bonding.
>
>   So that limits (currently) your bandwidth to 50% of what is available.   
Making an OST volume through LVM can give you full bandwidth, without 
channel bonding.  I think we have just established that using LVM for 
RAID0 will not have an impact on performance.>> I have no such information. My assumption was every OST had its own
disk
>> backend, but even if not, we have no idea in what order OSTs were
created.
>> If they were created like on oss0 we have ost0 & ost1 and so on,
what I describe
>> is still correct too.
>>
>> If somehow number of physical disk backends & osses used in both
cases are same
>> other things might have come into play too, like less spindles used by
>> underlying disk backend due to partitioning?
>>     
>
> They are partitioned. However, during this test the other file systems
> were idle.
>
> Each channel on an HBA does, effectively, have it''s own string of
disk
> -- DDN calls them "tiers". There are 4 OSS per DDN
"couplet", controller
> pair. There are 8 "tiers" per DDN.
>
> Lustre is configured so that two consecutive OSTs would be on the same
> OSS.
>   
>   
>> Bye,
>>     Oleg
>>
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel@clusterfs.com
>> https://mail.clusterfs.com/mailman/listinfo/lustre-devel
>>
>>
>>     
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-devel
>

Jody McIntyre

2006-Nov-23 12:58 UTC

head link

[Lustre-devel] wide striping

On Thu, Nov 23, 2006 at 01:40:55PM +0100, Jean-Marc Saffroy wrote:
> There are some instructions for the lustre_survey python script in the KB:
>   https://bugzilla.lustre.org/show_bug.cgi?id=9383
> 
> I''m not sure it''s still working though, I know there were
related bugs
> opened, but I can''t find them anymore.
obdsurvey (the python script) does not work and has been removed from
current versions of the iokit.
> Here we are still using Eric Barton''s obdfilter-survey shell
script, which
> works pretty well, but we only use it on test systems.
This is what we currently recommend.  The script is included in the
iokit, and there is fairly complete documentation on using it in the
Lustre manual.  Like obdsurvey, it is nondestructive.

IO Kit: https://downloads.clusterfs.com/customer/lustre-iokit/
Manual: http://lustre.org/manual.html

Cheers,
Jody
> 
> 
> -- 
> Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-devel
--

Lee Ward

2006-Nov-23 14:37 UTC

head link

[Lustre-devel] wide striping

On Thu, 2006-11-23 at 12:46 -0700, Peter Braam wrote:> 
> Lee Ward wrote:
> > On Wed, 2006-11-22 at 22:07 +0200, Oleg Drokin wrote:
> >   
> >> Hello!
> >>
> >> On Wed, Nov 22, 2006 at 02:33:23PM -0500, pauln wrote:
> >>     
> >>>>> So for meaningful comparison we should compare 10k
clients file per
> >>>>> process
> >>>>> with 5k cleints shared file. This "only"
gives us 2x difference which is
> >>>>> still
> >>>>> better than 4x.
> >>>>> Also stipe size is not specified, what was it set to?
> >>>>>           
> >>> Oleg, I''m not sure I entirely agree here.  If every
OSS was used in the
> >>> single file test then from the io hardware perspective they
are
> >>> functionally equivalent.  They run 2 OSTs on each OSS (which
is probably
> >>> done for capacity reasons due to the small lun limitation of
ext3) so a
> >>> test which only uses half of the OSTs could still use every
OSS
> >>> processor, interconnect link, and fiber channel link. 
> >>>       
> >
> > We use 2 OST per OSS in order to actively use both channels of the HBA
> > -- Staying away channel from bonding.
> >
> >   
> So that limits (currently) your bandwidth to 50% of what is available.   
> Making an OST volume through LVM can give you full bandwidth, without 
> channel bonding.  I think we have just established that using LVM for 
> RAID0 will not have an impact on performance.
Huh? in the aggregate, all the performance is there. The graphs reflect
that. No way could we get 40 GB/s using only half what the attached disk
is capable of.

In the singular, for file-per-process, w/o any striping, I would agree.
But we are happy with what we have there, I think.

Also, as you are probably aware, we have been having some reliability
issues with the DDN controllers. To try to stripe on a single host also
brings a greater exposure then. This arrangement we have tends to
mitigate the exposure for smaller jobs. For the larger, well, there''s
little we can think of to do, really. It''s just always a factor. Things
are getting better, though. Maybe someday we can consider aggregating
using LVM.
> >> I have no such information. My assumption was every OST had its
own disk
> >> backend, but even if not, we have no idea in what order OSTs were
created.
> >> If they were created like on oss0 we have ost0 & ost1 and so
on, what I describe
> >> is still correct too.
> >>
> >> If somehow number of physical disk backends & osses used in
both cases are same
> >> other things might have come into play too, like less spindles
used by
> >> underlying disk backend due to partitioning?
> >>     
> >
> > They are partitioned. However, during this test the other file systems
> > were idle.
> >
> > Each channel on an HBA does, effectively, have it''s own
string of disk
> > -- DDN calls them "tiers". There are 4 OSS per DDN
"couplet", controller
> > pair. There are 8 "tiers" per DDN.
> >
> > Lustre is configured so that two consecutive OSTs would be on the same
> > OSS.
> >   
> >   
> >> Bye,
> >>     Oleg
> >>
> >> _______________________________________________
> >> Lustre-devel mailing list
> >> Lustre-devel@clusterfs.com
> >> https://mail.clusterfs.com/mailman/listinfo/lustre-devel
> >>
> >>
> >>     
> >
> > _______________________________________________
> > Lustre-devel mailing list
> > Lustre-devel@clusterfs.com
> > https://mail.clusterfs.com/mailman/listinfo/lustre-devel
> >

Andreas Dilger

2006-Nov-23 15:03 UTC

head link

[Lustre-devel] wide striping

On Nov 23, 2006  14:37 -0700, Lee Ward wrote:> On Thu, 2006-11-23 at 12:46 -0700, Peter Braam wrote:
> > > On Wed, 2006-11-22 at 22:07 +0200, Oleg Drokin wrote:
> > > We use 2 OST per OSS in order to actively use both channels of
the HBA
> > > -- Staying away channel from bonding.
> >
> > So that limits (currently) your bandwidth to 50% of what is available.
> > Making an OST volume through LVM can give you full bandwidth, without 
> > channel bonding.  I think we have just established that using LVM for 
> > RAID0 will not have an impact on performance.
> 
> Huh? in the aggregate, all the performance is there. The graphs reflect
> that. No way could we get 40 GB/s using only half what the attached disk
> is capable of.
I think to clarify the previous comments:
- for FPP jobs, files are striped over 1 OST, but all 320 OSTs (hence 320
  DDN tiers, 320 FC controllers) are in use because there are multiple files
- for SSF jobs, the one file is striped over at most 160 OSTs (this is a
  current Lustre limit for a single file) so at most 160 DDN tiers, 160 FC
  controllers are in use

This difference is why Oleg previously mentioned the "explained 2x
difference"
in the tests.

For an apples-apples comparison, it would be possible to deactivate 160 OSTs
(one per OSS) on MDS via "for N in ... ;do lctl --device N
deactivate;done"
and then run the SSF and FPP jobs again.  This will limit the FPP jobs to
160 OSTs (like the SSF).  It might also be useful to disable 161 OSTs (leave
159 active) to avoid the aliasing in the SSF case, or alternately have clients
each write 7MB chunk sizes or something.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Peter Braam

2006-Nov-23 18:21 UTC

head link

[Lustre-devel] wide striping

I''d just like to take one step back here for the reading problem -
let''s
keep it simple. 

When we worked with DDN 8500 controllers we established that with 
something like 10 reading threads reading 1MB chunks randomly from OSTs 
we achieved the full bandwidth of the array. 

Read-ahead probably only damages things at this point (but is very 
important for reading smaller files).

There are many users of the 9500 SATA arrays now, and it surprises me 
that we don''t have the OST survey graphs for these arrays.  I believe 
that someone from our staff is at work with Sandia to get one done.  
This should show us the parameters to get full bandwidth at least when 
the application reads at least X MB, where X is hopefully still 1 (if 
not we have a nasty networking problem on our hands).

I would say that "playing" with the parameters (DDN settings, Lustre 
read ahead etc) is almost certain to make things worse in these use 
cases, and be highly confusing.  Perhaps we should disable at least the 
Lustre read-ahead stuff when we have the minimum sizes to reach full 
bandwidth.

- peter -

pauln

2006-Nov-23 22:16 UTC

head link

[Lustre-devel] wide striping

Peter Braam wrote:>
> I''d just like to take one step back here for the reading problem -
> let''s keep it simple.
> When we worked with DDN 8500 controllers we established that with 
> something like 10 reading threads reading 1MB chunks randomly from 
> OSTs we achieved the full bandwidth of the array.
> Read-ahead probably only damages things at this point (but is very 
> important for reading smaller files).Is testing with only 10 threads truly representative of the situation?  
I recall from a previous message that they have 30 threads per ost so on 
a ddn couplet that''s 30*8 threads.  I would assume that such a workload
would greatly randomize the io from the perspective of the
ddn.>
> There are many users of the 9500 SATA arrays now, and it surprises me 
> that we don''t have the OST survey graphs for these arrays.  I
believe
> that someone from our staff is at work with Sandia to get one done.  
> This should show us the parameters to get full bandwidth at least when 
> the application reads at least X MB, where X is hopefully still 1 (if 
> not we have a nasty networking problem on our hands).
>
> I would say that "playing" with the parameters (DDN settings,
Lustre
> read ahead etc) is almost certain to make things worse in these use 
> cases, and be highly confusing.  Perhaps we should disable at least 
> the Lustre read-ahead stuff when we have the minimum sizes to reach 
> full bandwidth.
>
> - peter -
>
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Peter Braam

2006-Nov-23 22:17 UTC

head link

[Lustre-devel] wide striping

pauln wrote:> Peter Braam wrote:
>>
>> I''d just like to take one step back here for the reading
problem -
>> let''s keep it simple.
>> When we worked with DDN 8500 controllers we established that with 
>> something like 10 reading threads reading 1MB chunks randomly from 
>> OSTs we achieved the full bandwidth of the array.
>> Read-ahead probably only damages things at this point (but is very 
>> important for reading smaller files).
> Is testing with only 10 threads truly representative of the 
> situation?  I recall from a previous message that they have 30 threads 
> per ost so on a ddn couplet that''s 30*8 threads.  I would assume
that
> such a workload would greatly randomize the io from the perspective of 
> the ddn.Around 10 threads are needed to max out the array - throughput is 
completely stable up to 1000 threads.

- Peter -

>>
>> There are many users of the 9500 SATA arrays now, and it surprises me 
>> that we don''t have the OST survey graphs for these arrays.  I
believe
>> that someone from our staff is at work with Sandia to get one done.  
>> This should show us the parameters to get full bandwidth at least 
>> when the application reads at least X MB, where X is hopefully still 
>> 1 (if not we have a nasty networking problem on our hands).
>>
>> I would say that "playing" with the parameters (DDN settings,
Lustre
>> read ahead etc) is almost certain to make things worse in these use 
>> cases, and be highly confusing.  Perhaps we should disable at least 
>> the Lustre read-ahead stuff when we have the minimum sizes to reach 
>> full bandwidth.
>>
>> - peter -
>>
>>
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel@clusterfs.com
>> https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Lee Ward

2006-Nov-23 22:39 UTC

head link

[Lustre-devel] wide striping

On Thu, 2006-11-23 at 15:03 -0700, Andreas Dilger wrote:> On Nov 23, 2006  14:37 -0700, Lee Ward wrote:
> > On Thu, 2006-11-23 at 12:46 -0700, Peter Braam wrote:
> > > > On Wed, 2006-11-22 at 22:07 +0200, Oleg Drokin wrote:
> > > > We use 2 OST per OSS in order to actively use both channels
of the HBA
> > > > -- Staying away channel from bonding.
> > >
> > > So that limits (currently) your bandwidth to 50% of what is
available.
> > > Making an OST volume through LVM can give you full bandwidth,
without
> > > channel bonding.  I think we have just established that using LVM
for
> > > RAID0 will not have an impact on performance.
> > 
> > Huh? in the aggregate, all the performance is there. The graphs
reflect
> > that. No way could we get 40 GB/s using only half what the attached
disk
> > is capable of.
> 
> I think to clarify the previous comments:
> - for FPP jobs, files are striped over 1 OST, but all 320 OSTs (hence 320
>   DDN tiers, 320 FC controllers) are in use because there are multiple
files
> - for SSF jobs, the one file is striped over at most 160 OSTs (this is a
>   current Lustre limit for a single file) so at most 160 DDN tiers, 160 FC
>   controllers are in use
> 
> This difference is why Oleg previously mentioned the "explained 2x
difference"
> in the tests.
> 
> For an apples-apples comparison, it would be possible to deactivate 160
OSTs
> (one per OSS) on MDS via "for N in ... ;do lctl --device N
deactivate;done"
> and then run the SSF and FPP jobs again.  This will limit the FPP jobs to
> 160 OSTs (like the SSF).  It might also be useful to disable 161 OSTs
(leave
> 159 active) to avoid the aliasing in the SSF case, or alternately have
clients
> each write 7MB chunk sizes or something.
For our formal testing, we have in our scripts a section that precreates
files, locating them on specific OSTs. Would that be a suitable
alternative to this reconfiguration of lustre you suggest? What you
suggest is possible, of course. I would just rather not deviate our
process without a good reason.

		--Lee
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>

Lee Ward

2006-Nov-23 22:42 UTC

head link

[Lustre-devel] wide striping

On Thu, 2006-11-23 at 18:21 -0700, Peter Braam wrote:> I''d just like to take one step back here for the reading problem -
let''s
> keep it simple. 
> 
> When we worked with DDN 8500 controllers we established that with 
> something like 10 reading threads reading 1MB chunks randomly from OSTs 
> we achieved the full bandwidth of the array. 
> 
> Read-ahead probably only damages things at this point (but is very 
> important for reading smaller files).
> 
> There are many users of the 9500 SATA arrays now, and it surprises me 
> that we don''t have the OST survey graphs for these arrays.  I
believe
> that someone from our staff is at work with Sandia to get one done.  
> This should show us the parameters to get full bandwidth at least when 
> the application reads at least X MB, where X is hopefully still 1 (if 
> not we have a nasty networking problem on our hands).
> 
> I would say that "playing" with the parameters (DDN settings,
Lustre
> read ahead etc) is almost certain to make things worse in these use 
> cases, and be highly confusing.  Perhaps we should disable at least the 
> Lustre read-ahead stuff when we have the minimum sizes to reach full 
> bandwidth.
I believe we already know what to do to accomplish what you suggest.
However, could you take the time to be explicit? Better that we know
than believe...

		--Lee

> 
> - peter -
> 
>

Andreas Dilger

2006-Nov-24 00:23 UTC

head link

[Lustre-devel] wide striping

On Nov 23, 2006  22:39 -0700, Lee Ward wrote:> On Thu, 2006-11-23 at 15:03 -0700, Andreas Dilger wrote:
> > For an apples-apples comparison, it would be possible to deactivate
160 OSTs
> > (one per OSS) on MDS via "for N in ... ;do lctl --device N
deactivate;done"
> > and then run the SSF and FPP jobs again.  This will limit the FPP jobs
to
> > 160 OSTs (like the SSF).  It might also be useful to disable 161 OSTs
(leave
> > 159 active) to avoid the aliasing in the SSF case, or alternately have
> > clients each write 7MB chunk sizes or something.
> 
> For our formal testing, we have in our scripts a section that precreates
> files, locating them on specific OSTs. Would that be a suitable
> alternative to this reconfiguration of lustre you suggest? What you
> suggest is possible, of course. I would just rather not deviate our
> process without a good reason.
Sure, precreating files on, say, the first 160 OSTs in a round-robin manner
for all of the client processes for FPP, and similarly forcing the one SSF
file to be created starting on ost_idx 0 will use the same 160 OSTs.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Peter Braam

2006-Nov-24 12:32 UTC

head link

[Lustre-devel] wide striping

Skipped content of type multipart/alternative-------------- next part
--------------
A non-text attachment was scrubbed...
Name: ddn.xls
Type: application/vnd.ms-excel
Size: 179200 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20061124/6902c6fb/ddn-0001.xls
-------------- next part --------------
A non-text attachment was scrubbed...
Name: iographs.xls
Type: application/vnd.ms-excel
Size: 168448 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20061124/6902c6fb/iographs-0001.xls

Peter Braam

2006-Nov-24 16:35 UTC

head link

[Lustre-devel] wide striping

I don''t really fancy this suggestion that much - you are proposing an 
apples - to - apples for 1/2 the performance.  With proper LVM 
configuration we could get full performance. 

Yes, that exposes one to more HBA/DDN vulnerabilities, but not more than 
a higher stripe_count would do.   I think this is the way to get full BW 
on wide striping shared files.

- Peter -

Andreas Dilger wrote:> On Nov 23, 2006  22:39 -0700, Lee Ward wrote:
>   
>> On Thu, 2006-11-23 at 15:03 -0700, Andreas Dilger wrote:
>>     
>>> For an apples-apples comparison, it would be possible to deactivate
160 OSTs
>>> (one per OSS) on MDS via "for N in ... ;do lctl --device N
deactivate;done"
>>> and then run the SSF and FPP jobs again.  This will limit the FPP
jobs to
>>> 160 OSTs (like the SSF).  It might also be useful to disable 161
OSTs (leave
>>> 159 active) to avoid the aliasing in the SSF case, or alternately
have
>>> clients each write 7MB chunk sizes or something.
>>>       
>> For our formal testing, we have in our scripts a section that
precreates
>> files, locating them on specific OSTs. Would that be a suitable
>> alternative to this reconfiguration of lustre you suggest? What you
>> suggest is possible, of course. I would just rather not deviate our
>> process without a good reason.
>>     
>
> Sure, precreating files on, say, the first 160 OSTs in a round-robin manner
> for all of the client processes for FPP, and similarly forcing the one SSF
> file to be created starting on ost_idx 0 will use the same 160 OSTs.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>   -------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20061124/6d035326/attachment.html

Jody McIntyre

2006-Nov-27 16:27 UTC

head link

[Lustre-devel] wide striping

Hi Lee,

Peter Braam advises that you find our IOkit documentation inadequate.
Would you please let me know what specific problems, uncertainties, or
questions you have?  The most up to date obdfilter_survey instructions
can be found in section 1.2.2 of part III of the Lustre manual.

Thanks,
Jody

Klundt, Ruth

2006-Nov-28 11:01 UTC

head link

[Lustre-devel] wide striping

Hi Jody, 

I was talking to Lee earlier, and he doesn''t remember being the source
of complaints about the IOkit.  I haven''t run it myself so I have no
complaints either ;) 

What we did try last spring was to start obdecho and also obdfilter on
an xt3 test cluster, using some info on the wiki. The first worked, but
I couldn''t get the second one going properly for more than one ost for
reasons I now forget and may have been user error. Given the
considerable difference in code base I''d say that''s water
under the
bridge now. 

When the opportunity comes up again, I''ll certainly try out the kit and
give any feedback on the instructions. 

Thanks much,
Ruth

-----Original Message-----
From: lustre-devel-bounces@clusterfs.com
[mailto:lustre-devel-bounces@clusterfs.com] On Behalf Of Jody McIntyre
Sent: Monday, November 27, 2006 4:28 PM
To: Ward, Lee
Cc: lustre-devel@clusterfs.com
Subject: Re: [Lustre-devel] wide striping

Hi Lee,

Peter Braam advises that you find our IOkit documentation inadequate.
Would you please let me know what specific problems, uncertainties, or
questions you have?  The most up to date obdfilter_survey instructions
can be found in section 1.2.2 of part III of the Lustre manual.

Thanks,
Jody

_______________________________________________
Lustre-devel mailing list
Lustre-devel@clusterfs.com
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Jody McIntyre

2006-Nov-28 15:45 UTC

head link

[Lustre-devel] wide striping

Hi Ruth,

On Tue, Nov 28, 2006 at 11:00:46AM -0700, Klundt, Ruth wrote:
> I was talking to Lee earlier, and he doesn''t remember being the
source
> of complaints about the IOkit.  I haven''t run it myself so I have
no
> complaints either ;) 
OK, thanks for clarifying - I received my information third hand and it
seems it was wrong.
> [...]
> When the opportunity comes up again, I''ll certainly try out the
kit and
> give any feedback on the instructions. 
OK, thanks.  We''re here to help :)

Cheers,
Jody
> Thanks much,
> Ruth

Lustre devel - Nov 2006 - wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping

[Lustre-devel] wide striping