Hi all, Attached are more sgpdd-survey results, this time from a full (48 disk) Sun Thumper using software RAID 5 with a RHEL 4 kernel. Version 3.6.3 of the out-of-tree mv_sata driver was used with a patch from Sun to disable writeback cache. The read results are mostly fairly consistent up to 128 threads, which is good to see. There are a few strange points in all the 8-device graphs which should probably be ignored - this is likely one device falling behind the others since the sgpdd-survey tool essentially reports the "worst-case" aggregate bandwidth. No such points exist in the single-device survey. One interesting result is that smaller chunk sizes than we currently recommend can increase read performance by allowing larger request sizes. I''m not sure why it is possible to do a 2 or 4MB request through the MD layer with 128K chunks but not with 256K chunks. No error is shown on the console in the failing case. Considering the write results, we actually see higher performance with small chunk sizes (128K) no matter what request size is used, and again the smaller chunk size allows larger requests with correspondingly higher performance. So far the best performance for both reads and writes I''ve observed is with a 128K chunk size and 4MB requests. This request size corresponds to a stripe size of 4MB, assuming all layers from obdfilter to the MD device preserve the 4MB IO intact. My next plan is to test obdfilter-survey with various chunk and request sizes. Assuming 4MB IOs are preserved intact all the way to the MD layer, I expect similar results. I will then test even smaller chunk sizes with both sgpdd-survey and obdfilter-survey. Cheers, Jody -------------- next part -------------- A non-text attachment was scrubbed... Name: thumper-sgp_dd.xls Type: application/vnd.ms-excel Size: 183296 bytes Desc: not available Url : http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20061214/605d6c62/thumper-sgp_dd-0001.xls
Jody,> One interesting result is that smaller chunk sizes than we currently > recommend can increase read performance by allowing larger request > sizes. I''m not sure why it is possible to do a 2 or 4MB request > through the MD layer with 128K chunks but not with 256K chunks. No > error is shown on the console in the failing case.What is "chunk size"?> Considering the write results, we actually see higher performance with > small chunk sizes (128K) no matter what request size is used, and again > the smaller chunk size allows larger requests with correspondingly > higher performance. > > So far the best performance for both reads and writes I''ve observed is > with a 128K chunk size and 4MB requests. This request size corresponds > to a stripe size of 4MB, assuming all layers from obdfilter to the MD > device preserve the 4MB IO intact.What is the page size of this machine? There are issues with scatter/gather descriptors larger than 1 page which you might be running into.> My next plan is to test obdfilter-survey with various chunk and request > sizes. Assuming 4MB IOs are preserved intact all the way to the MD > layer, I expect similar results. I will then test even smaller chunk > sizes with both sgpdd-survey and obdfilter-survey.By default, the maximum request size issued by clients is 1MByte which is the LNET MTU. It''s possible to build lustre with a larger maximum I/O size that can be exploited in special cases (e.g. no routing, page size on client and server > 4K and the LND supports it). Supporting these in the general case requires changes in lustre (e.g. multiple bulk transfers in a single RPC). Cheers, Eric
Hi Eric, On Fri, Dec 15, 2006 at 11:47:40AM -0000, Eric Barton wrote:> > One interesting result is that smaller chunk sizes than we currently > > recommend can increase read performance by allowing larger request > > sizes. I''m not sure why it is possible to do a 2 or 4MB request > > through the MD layer with 128K chunks but not with 256K chunks. No > > error is shown on the console in the failing case. > > What is "chunk size"?Chunk size is a RAID parameter, applicable to at least RAID 5. It is the size of the individual pieces written to each disk. For example, if Lustre does a 1MB write on a 4+1 array with a 256K chunk size, it will be broken up into 4 256K chunks and a furthur 256K parity chunk will be computed and written (assuming writes are aligned on chunk boundaries.) We currently recommend selecting chunk size and stripe size such that (stripe size) = (chunk size) * (# active disks). One thought I had on the smaller chunk size is that perhaps 128K IOs end up being merged by Alex''s raid5-merge-ios.patch so the disks actually see larger IOs and therefore perform well. Alex, is this possible? Do you have any other theories on why smaller chunk sizes than we recommend actually produce better performance?> > So far the best performance for both reads and writes I''ve observed is > > with a 128K chunk size and 4MB requests. This request size corresponds > > to a stripe size of 4MB, assuming all layers from obdfilter to the MD > > device preserve the 4MB IO intact. > > What is the page size of this machine? There are issues with scatter/gather > descriptors larger than 1 page which you might be running into.It''s x86_64, so 4KB. I don''t think scatter/gather descriptors come into play at this point since the 4MB write will be broken up into smaller writes by the MD layer.> By default, the maximum request size issued by clients is 1MByte which > is the LNET MTU. It''s possible to build lustre with a larger maximum > I/O size that can be exploited in special cases (e.g. no routing, page > size on client and server > 4K and the LND supports it). Supporting these > in the general case requires changes in lustre (e.g. multiple bulk > transfers in a single RPC).What about echo clients talking directly to a local obdfilter? That is what I am attempting to survey. I realize Lustre in general does not support IOs larger than 1MB; part of the point of my current work is to see if 4 MB IOs are something we need to support in more general cases. Cheers, Jody
On Dec 15, 2006 10:49 -0500, Jody McIntyre wrote:> > > One interesting result is that smaller chunk sizes than we currently > > > recommend can increase read performance by allowing larger request > > > sizes. I''m not sure why it is possible to do a 2 or 4MB request > > > through the MD layer with 128K chunks but not with 256K chunks. No > > > error is shown on the console in the failing case. > > One thought I had on the smaller chunk size is that perhaps 128K IOs end > up being merged by Alex''s raid5-merge-ios.patch so the disks actually > see larger IOs and therefore perform well. Alex, is this possible? Do > you have any other theories on why smaller chunk sizes than we recommend > actually produce better performance?To be honest, if smaller chunk sizes perform better than that is the best of both worlds. It means that smaller IOs will have to do less or smaller read-modify-write operations on the RAID5 parity.> It''s x86_64, so 4KB. I don''t think scatter/gather descriptors come into > play at this point since the 4MB write will be broken up into smaller > writes by the MD layer.Hmm, though wouldn''t the incoming request itself have to be put into an bio in order to be submitted? There are patches in bug 9945 that Bull has been testing in this area.> > By default, the maximum request size issued by clients is 1MByte which > > is the LNET MTU. It''s possible to build lustre with a larger maximum > > I/O size that can be exploited in special cases (e.g. no routing, page > > size on client and server > 4K and the LND supports it). Supporting these > > in the general case requires changes in lustre (e.g. multiple bulk > > transfers in a single RPC). > > What about echo clients talking directly to a local obdfilter? That is > what I am attempting to survey. I realize Lustre in general does not > support IOs larger than 1MB; part of the point of my current work is to > see if 4 MB IOs are something we need to support in more general cases.You would need to compile Lustre with an LNET_MTU of 4MB (to get the same sized PTLRPC_MAX_BRW_SIZE) so that the filter allocates a large enough pool for the iobuf. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Hi Andreas, On Fri, Dec 15, 2006 at 03:10:57PM -0700, Andreas Dilger wrote:> > It''s x86_64, so 4KB. I don''t think scatter/gather descriptors come into > > play at this point since the 4MB write will be broken up into smaller > > writes by the MD layer. > > Hmm, though wouldn''t the incoming request itself have to be put into an > bio in order to be submitted? There are patches in bug 9945 that Bull > has been testing in this area.I really don''t know what goes on when using MD. It does seem logical that bios would be used at the input side of the MD device.. Alex, can you you comment on this? All I know is that 2 and 4 MB IOs work with a chunk size of 128 KB but not with a chunk size of 256 KB, and no error message is printed when it doesn''t work. Cheers, Jody
On Thu, Dec 14, 2006 at 04:34:06PM -0500, Jody McIntyre wrote:> My next plan is to test obdfilter-survey with various chunk and request > sizes. Assuming 4MB IOs are preserved intact all the way to the MD > layer, I expect similar results. I will then test even smaller chunk > sizes with both sgpdd-survey and obdfilter-survey.The obdfilter-surveys are still in progress, but attached is a revised sgpdd-survey spreadsheet. As you can see, the 64K chunk size is worse, sometimes significantly, than the 128K chunk size except for writes of size 1M (which are slightly better with the 64K chunk size.) Therefore 128K chunk size is basically the sweet spot for a RAID 5 array with 5 active devices on Thumper hardware. Cheers, Jody -------------- next part -------------- A non-text attachment was scrubbed... Name: thumper-sgp_dd.xls Type: application/vnd.ms-excel Size: 269312 bytes Desc: not available Url : http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20061229/852f5e76/thumper-sgp_dd-0001.xls
Jody, A great set of results. Have you found any parameters that make all the series (threads/regions) collapse into a single line? If that can''t happen, then different workloads will have quite different performance - and the more "interesting" the graphs, the more performance will vary. BTW, did you want the graphs to have different scales? I''ve found that if you set them all the same, it makes for easier comparison by eye. Cheers, Eric --------------------------------------------------- |Eric Barton Barton Software | |9 York Gardens Tel: +44 (117) 330 1575 | |Clifton Mobile: +44 (7909) 680 356 | |Bristol BS8 4LL Fax: call first | |United Kingdom E-Mail: eeb@bartonsoftware.com| ---------------------------------------------------
Hi Eric, On Tue, Jan 02, 2007 at 02:57:01PM -0000, Eric Barton wrote:> Have you found any parameters that make all the series > (threads/regions) collapse into a single line? If that > can''t happen, then different workloads will have quite > different performance - and the more "interesting" the > graphs, the more performance will vary.No, I haven''t, unfortunately.> BTW, did you want the graphs to have different scales? > I''ve found that if you set them all the same, it makes > for easier comparison by eye.Most of the scales are the same for similar region sizes so you can compare graphs when reading across the page. As you know, it''s an easy change to make if you have different preferences. Cheers, Jody
I also found these ClustreFS performance results in the archives. Can you provide details on the host configuration (processor, OS, storage connection) that was used in these tests? Thx, Tim -----Original Message----- From: lustre-devel-bounces@clusterfs.com [mailto:lustre-devel-bounces@clusterfs.com] On Behalf Of Jody McIntyre Sent: Friday, December 29, 2006 4:10 PM To: lustre-devel@clusterfs.com Cc: bzzz@clusterfs.com Subject: [Lustre-devel] Re: sgpdd-survey of Sun Thumper On Thu, Dec 14, 2006 at 04:34:06PM -0500, Jody McIntyre wrote:> My next plan is to test obdfilter-survey with various chunk and > request sizes. Assuming 4MB IOs are preserved intact all the way to > the MD layer, I expect similar results. I will then test even smaller> chunk sizes with both sgpdd-survey and obdfilter-survey.The obdfilter-surveys are still in progress, but attached is a revised sgpdd-survey spreadsheet. As you can see, the 64K chunk size is worse, sometimes significantly, than the 128K chunk size except for writes of size 1M (which are slightly better with the 64K chunk size.) Therefore 128K chunk size is basically the sweet spot for a RAID 5 array with 5 active devices on Thumper hardware. Cheers, Jody
Hi Tim, On Thu, Jul 05, 2007 at 07:09:38AM -0600, Snider, Tim wrote:> I also found these ClustreFS performance results in the archives. > Can you provide details on the host configuration (processor, OS, > storage connection) that was used in these tests?This node is a Sun Thumper, AKA Sun Fire X4500. Specifications are here: http://www.sun.com/servers/x64/x4500/specs.xml The OS used was RHEL 4, and the storage was the Thumper''s builtin storage which is connected via SATA. Cheers, Jody> Thx, > Tim > > -----Original Message----- > From: lustre-devel-bounces@clusterfs.com > [mailto:lustre-devel-bounces@clusterfs.com] On Behalf Of Jody McIntyre > Sent: Friday, December 29, 2006 4:10 PM > To: lustre-devel@clusterfs.com > Cc: bzzz@clusterfs.com > Subject: [Lustre-devel] Re: sgpdd-survey of Sun Thumper > > On Thu, Dec 14, 2006 at 04:34:06PM -0500, Jody McIntyre wrote: > > > My next plan is to test obdfilter-survey with various chunk and > > request sizes. Assuming 4MB IOs are preserved intact all the way to > > the MD layer, I expect similar results. I will then test even smaller > > > chunk sizes with both sgpdd-survey and obdfilter-survey. > > The obdfilter-surveys are still in progress, but attached is a revised > sgpdd-survey spreadsheet. As you can see, the 64K chunk size is worse, > sometimes significantly, than the 128K chunk size except for writes of > size 1M (which are slightly better with the 64K chunk size.) > > Therefore 128K chunk size is basically the sweet spot for a RAID 5 array > with 5 active devices on Thumper hardware. > > Cheers, > Jody--