thr3ads.net - Lustre devel - [Lustre-devel] sgpdd-survey of Sun Thumper [Dec 2006]

If this information is useful, please help other people find it:
Share via:

Jody McIntyre

2006-Dec-14 14:34 UTC

[Lustre-devel] sgpdd-survey of Sun Thumper

Hi all,

Attached are more sgpdd-survey results, this time from a full (48 disk)
Sun Thumper using software RAID 5 with a RHEL 4 kernel. Version 3.6.3
of the out-of-tree mv_sata driver was used with a patch from Sun to
disable writeback cache.

The read results are mostly fairly consistent up to 128 threads, which
is good to see. There are a few strange points in all the 8-device
graphs which should probably be ignored - this is likely one device
falling behind the others since the sgpdd-survey tool essentially
reports the "worst-case" aggregate bandwidth. No such points exist in
the single-device survey.

One interesting result is that smaller chunk sizes than we currently
recommend can increase read performance by allowing larger request
sizes. I''m not sure why it is possible to do a 2 or 4MB request
through the MD layer with 128K chunks but not with 256K chunks. No
error is shown on the console in the failing case.

Considering the write results, we actually see higher performance with
small chunk sizes (128K) no matter what request size is used, and again
the smaller chunk size allows larger requests with correspondingly
higher performance.

So far the best performance for both reads and writes I''ve observed is
with a 128K chunk size and 4MB requests. This request size corresponds
to a stripe size of 4MB, assuming all layers from obdfilter to the MD
device preserve the 4MB IO intact.

My next plan is to test obdfilter-survey with various chunk and request
sizes. Assuming 4MB IOs are preserved intact all the way to the MD
layer, I expect similar results. I will then test even smaller chunk
sizes with both sgpdd-survey and obdfilter-survey.

Cheers,
Jody

-------------- next part --------------
A non-text attachment was scrubbed...
Name: thumper-sgp_dd.xls
Type: application/vnd.ms-excel
Size: 183296 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20061214/605d6c62/thumper-sgp_dd-0001.xls

Eric Barton

2006-Dec-15 04:46 UTC

head link

[Lustre-devel] sgpdd-survey of Sun Thumper

Jody,
> One interesting result is that smaller chunk sizes than we currently
> recommend can increase read performance by allowing larger request
> sizes.  I''m not sure why it is possible to do a 2 or 4MB request
> through the MD layer with 128K chunks but not with 256K chunks.  No
> error is shown on the console in the failing case.
What is "chunk size"?
> Considering the write results, we actually see higher performance with
> small chunk sizes (128K) no matter what request size is used, and again
> the smaller chunk size allows larger requests with correspondingly
> higher performance.
> 
> So far the best performance for both reads and writes I''ve
observed is
> with a 128K chunk size and 4MB requests.  This request size corresponds
> to a stripe size of 4MB, assuming all layers from obdfilter to the MD
> device preserve the 4MB IO intact.
What is the page size of this machine?  There are issues with scatter/gather
descriptors larger than 1 page which you might be running into.
> My next plan is to test obdfilter-survey with various chunk and request
> sizes.  Assuming 4MB IOs are preserved intact all the way to the MD
> layer, I expect similar results.  I will then test even smaller chunk
> sizes with both sgpdd-survey and obdfilter-survey.
By default, the maximum request size issued by clients is 1MByte which
is the LNET MTU.  It''s possible to build lustre with a larger maximum
I/O size that can be exploited in special cases (e.g. no routing, page
size on client and server > 4K and the LND supports it).  Supporting these
in the general case requires changes in lustre (e.g. multiple bulk
transfers in a single RPC).

                Cheers,
                        Eric

Jody McIntyre

2006-Dec-15 08:50 UTC

head link

[Lustre-devel] sgpdd-survey of Sun Thumper

Hi Eric,

On Fri, Dec 15, 2006 at 11:47:40AM -0000, Eric Barton wrote:
> > One interesting result is that smaller chunk sizes than we currently
> > recommend can increase read performance by allowing larger request
> > sizes.  I''m not sure why it is possible to do a 2 or 4MB
request
> > through the MD layer with 128K chunks but not with 256K chunks.  No
> > error is shown on the console in the failing case.
> 
> What is "chunk size"?
Chunk size is a RAID parameter, applicable to at least RAID 5.  It is
the size of the individual pieces written to each disk.  For example, if
Lustre does a 1MB write on a 4+1 array with a 256K chunk size, it will
be broken up into 4 256K chunks and a furthur 256K parity chunk will be
computed and written (assuming writes are aligned on chunk boundaries.)
We currently recommend selecting chunk size and stripe size such that
(stripe size) = (chunk size) * (# active disks).

One thought I had on the smaller chunk size is that perhaps 128K IOs end
up being merged by Alex''s raid5-merge-ios.patch so the disks actually
see larger IOs and therefore perform well.  Alex, is this possible?  Do
you have any other theories on why smaller chunk sizes than we recommend
actually produce better performance?
> > So far the best performance for both reads and writes I''ve
observed is
> > with a 128K chunk size and 4MB requests.  This request size
corresponds
> > to a stripe size of 4MB, assuming all layers from obdfilter to the MD
> > device preserve the 4MB IO intact.
> 
> What is the page size of this machine?  There are issues with
scatter/gather
> descriptors larger than 1 page which you might be running into.
It''s x86_64, so 4KB.  I don''t think scatter/gather descriptors
come into
play at this point since the 4MB write will be broken up into smaller
writes by the MD layer.
> By default, the maximum request size issued by clients is 1MByte which
> is the LNET MTU.  It''s possible to build lustre with a larger
maximum
> I/O size that can be exploited in special cases (e.g. no routing, page
> size on client and server > 4K and the LND supports it).  Supporting
these
> in the general case requires changes in lustre (e.g. multiple bulk
> transfers in a single RPC).
What about echo clients talking directly to a local obdfilter?  That is
what I am attempting to survey.  I realize Lustre in general does not
support IOs larger than 1MB; part of the point of my current work is to
see if 4 MB IOs are something we need to support in more general cases.

Cheers,
Jody

Andreas Dilger

2006-Dec-15 15:11 UTC

head link

[Lustre-devel] sgpdd-survey of Sun Thumper

On Dec 15, 2006  10:49 -0500, Jody McIntyre wrote:> > > One interesting result is that smaller chunk sizes than we
currently
> > > recommend can increase read performance by allowing larger
request
> > > sizes.  I''m not sure why it is possible to do a 2 or 4MB
request
> > > through the MD layer with 128K chunks but not with 256K chunks. 
No
> > > error is shown on the console in the failing case.
> 
> One thought I had on the smaller chunk size is that perhaps 128K IOs end
> up being merged by Alex''s raid5-merge-ios.patch so the disks
actually
> see larger IOs and therefore perform well.  Alex, is this possible?  Do
> you have any other theories on why smaller chunk sizes than we recommend
> actually produce better performance?
To be honest, if smaller chunk sizes perform better than that is the best
of both worlds.  It means that smaller IOs will have to do less or smaller
read-modify-write operations on the RAID5 parity.
> It''s x86_64, so 4KB.  I don''t think scatter/gather
descriptors come into
> play at this point since the 4MB write will be broken up into smaller
> writes by the MD layer.
Hmm, though wouldn''t the incoming request itself have to be put into an
bio in order to be submitted?  There are patches in bug 9945 that Bull
has been testing in this area.
> > By default, the maximum request size issued by clients is 1MByte which
> > is the LNET MTU.  It''s possible to build lustre with a larger
maximum
> > I/O size that can be exploited in special cases (e.g. no routing, page
> > size on client and server > 4K and the LND supports it). 
Supporting these
> > in the general case requires changes in lustre (e.g. multiple bulk
> > transfers in a single RPC).
> 
> What about echo clients talking directly to a local obdfilter?  That is
> what I am attempting to survey.  I realize Lustre in general does not
> support IOs larger than 1MB; part of the point of my current work is to
> see if 4 MB IOs are something we need to support in more general cases.
You would need to compile Lustre with an LNET_MTU of 4MB (to get the same
sized PTLRPC_MAX_BRW_SIZE) so that the filter allocates a large enough
pool for the iobuf.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Jody McIntyre

2006-Dec-15 18:05 UTC

head link

[Lustre-devel] sgpdd-survey of Sun Thumper

Hi Andreas,

On Fri, Dec 15, 2006 at 03:10:57PM -0700, Andreas Dilger wrote:
> > It''s x86_64, so 4KB.  I don''t think scatter/gather
descriptors come into
> > play at this point since the 4MB write will be broken up into smaller
> > writes by the MD layer.
> 
> Hmm, though wouldn''t the incoming request itself have to be put
into an
> bio in order to be submitted?  There are patches in bug 9945 that Bull
> has been testing in this area.
I really don''t know what goes on when using MD.  It does seem logical
that bios would be used at the input side of the MD device.. Alex, can
you you comment on this?

All I know is that 2 and 4 MB IOs work with a chunk size of 128 KB but
not with a chunk size of 256 KB, and no error message is printed when it
doesn''t work.

Cheers,
Jody

Jody McIntyre

2006-Dec-29 15:10 UTC

head link

[Lustre-devel] Re: sgpdd-survey of Sun Thumper

On Thu, Dec 14, 2006 at 04:34:06PM -0500, Jody McIntyre wrote:
> My next plan is to test obdfilter-survey with various chunk and request
> sizes.  Assuming 4MB IOs are preserved intact all the way to the MD
> layer, I expect similar results.  I will then test even smaller chunk
> sizes with both sgpdd-survey and obdfilter-survey.
The obdfilter-surveys are still in progress, but attached is a revised
sgpdd-survey spreadsheet.  As you can see, the 64K chunk size is worse,
sometimes significantly, than the 128K chunk size except for writes of
size 1M (which are slightly better with the 64K chunk size.)

Therefore 128K chunk size is basically the sweet spot for a RAID 5 array
with 5 active devices on Thumper hardware.

Cheers,
Jody
-------------- next part --------------
A non-text attachment was scrubbed...
Name: thumper-sgp_dd.xls
Type: application/vnd.ms-excel
Size: 269312 bytes
Desc: not available
Url :
http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20061229/852f5e76/thumper-sgp_dd-0001.xls

Eric Barton

2007-Jan-02 07:57 UTC

head link

[Lustre-devel] Re: sgpdd-survey of Sun Thumper

Jody,

A great set of results.

Have you found any parameters that make all the series
(threads/regions) collapse into a single line?  If that
can''t happen, then different workloads will have quite
different performance - and the more "interesting" the
graphs, the more performance will vary.

BTW, did you want the graphs to have different scales?
I''ve found that if you set them all the same, it makes
for easier comparison by eye.

                Cheers,
                        Eric

---------------------------------------------------
|Eric Barton        Barton Software               |
|9 York Gardens     Tel:    +44 (117) 330 1575    |
|Clifton            Mobile: +44 (7909) 680 356    |
|Bristol BS8 4LL    Fax:    call first            |
|United Kingdom     E-Mail: eeb@bartonsoftware.com|
---------------------------------------------------

Jody McIntyre

2007-Jan-02 19:16 UTC

head link

[Lustre-devel] Re: sgpdd-survey of Sun Thumper

Hi Eric,

On Tue, Jan 02, 2007 at 02:57:01PM -0000, Eric Barton wrote:
> Have you found any parameters that make all the series
> (threads/regions) collapse into a single line?  If that
> can''t happen, then different workloads will have quite
> different performance - and the more "interesting" the
> graphs, the more performance will vary.
No, I haven''t, unfortunately.
> BTW, did you want the graphs to have different scales?
> I''ve found that if you set them all the same, it makes
> for easier comparison by eye.
Most of the scales are the same for similar region sizes so you can
compare graphs when reading across the page.  As you know, it''s an easy
change to make if you have different preferences.

Cheers,
Jody

Snider, Tim

2007-Jul-05 07:09 UTC

head link

[Lustre-devel] Re: sgpdd-survey of Sun Thumper

I also found these ClustreFS performance results in the archives. 
Can you provide details on the host configuration (processor, OS,
storage connection) that was used in these tests?
Thx,
Tim  

-----Original Message-----
From: lustre-devel-bounces@clusterfs.com
[mailto:lustre-devel-bounces@clusterfs.com] On Behalf Of Jody McIntyre
Sent: Friday, December 29, 2006 4:10 PM
To: lustre-devel@clusterfs.com
Cc: bzzz@clusterfs.com
Subject: [Lustre-devel] Re: sgpdd-survey of Sun Thumper

On Thu, Dec 14, 2006 at 04:34:06PM -0500, Jody McIntyre wrote:
> My next plan is to test obdfilter-survey with various chunk and 
> request sizes.  Assuming 4MB IOs are preserved intact all the way to 
> the MD layer, I expect similar results.  I will then test even smaller
> chunk sizes with both sgpdd-survey and obdfilter-survey.
The obdfilter-surveys are still in progress, but attached is a revised
sgpdd-survey spreadsheet.  As you can see, the 64K chunk size is worse,
sometimes significantly, than the 128K chunk size except for writes of
size 1M (which are slightly better with the 64K chunk size.)

Therefore 128K chunk size is basically the sweet spot for a RAID 5 array
with 5 active devices on Thumper hardware.

Cheers,
Jody

Jody McIntyre

2007-Jul-10 10:08 UTC

head link

[Lustre-devel] Re: sgpdd-survey of Sun Thumper

Hi Tim,

On Thu, Jul 05, 2007 at 07:09:38AM -0600, Snider, Tim
wrote:> I also found these ClustreFS performance results in the archives. 
> Can you provide details on the host configuration (processor, OS,
> storage connection) that was used in these tests?
This node is a Sun Thumper, AKA Sun Fire X4500.  Specifications are
here: http://www.sun.com/servers/x64/x4500/specs.xml

The OS used was RHEL 4, and the storage was the Thumper''s builtin
storage which is connected via SATA.

Cheers,
Jody
> Thx,
> Tim  
> 
> -----Original Message-----
> From: lustre-devel-bounces@clusterfs.com
> [mailto:lustre-devel-bounces@clusterfs.com] On Behalf Of Jody McIntyre
> Sent: Friday, December 29, 2006 4:10 PM
> To: lustre-devel@clusterfs.com
> Cc: bzzz@clusterfs.com
> Subject: [Lustre-devel] Re: sgpdd-survey of Sun Thumper
> 
> On Thu, Dec 14, 2006 at 04:34:06PM -0500, Jody McIntyre wrote:
> 
> > My next plan is to test obdfilter-survey with various chunk and 
> > request sizes.  Assuming 4MB IOs are preserved intact all the way to 
> > the MD layer, I expect similar results.  I will then test even smaller
> 
> > chunk sizes with both sgpdd-survey and obdfilter-survey.
> 
> The obdfilter-surveys are still in progress, but attached is a revised
> sgpdd-survey spreadsheet.  As you can see, the 64K chunk size is worse,
> sometimes significantly, than the 128K chunk size except for writes of
> size 1M (which are slightly better with the 64K chunk size.)
> 
> Therefore 128K chunk size is basically the sweet spot for a RAID 5 array
> with 5 active devices on Thumper hardware.
> 
> Cheers,
> Jody
--

Lustre devel - Dec 2006 - sgpdd-survey of Sun Thumper

[Lustre-devel] sgpdd-survey of Sun Thumper

[Lustre-devel] sgpdd-survey of Sun Thumper

[Lustre-devel] sgpdd-survey of Sun Thumper

[Lustre-devel] sgpdd-survey of Sun Thumper

[Lustre-devel] sgpdd-survey of Sun Thumper

[Lustre-devel] Re: sgpdd-survey of Sun Thumper

[Lustre-devel] Re: sgpdd-survey of Sun Thumper

[Lustre-devel] Re: sgpdd-survey of Sun Thumper

[Lustre-devel] Re: sgpdd-survey of Sun Thumper

[Lustre-devel] Re: sgpdd-survey of Sun Thumper