Sky, Phil, I''ve started observing the behavior of Lustre 1.0.4. These observations may be relevant to this discussion.=20 I have a single OST and a single client. I observe that the block transfer requests coming from the sockNAL are always 4K bytes in size even though the application is doing 1MB writes "dd if=3D/dev/zero of=3D/mnt/lustre/file500M bs=3D1M count=3D512". Similar behavior is observed with reads. My guess is that if O_DIRECT is set that the kernel/Lustre doesn''t break the transfer into 4KB chunks, rather it does the chunk size requested by the application (this improves efficiency significantly).=20 The effective bandwidth of very high speed interconnects like InfiniBand (IB) goes way up as the transfer size increases (e.g. it only takes 4 microseconds on the wire to transfer a 4KB message), whereas the control overhead in Lustre/sockNAL/TCP/IPoIB is much longer than 4 microseconds. Therefore, the effective bandwidth is limited by the control overhead. I know that if we can get Lustre to issue larger IO requests (to an OST) that we can significantly increase the observed throughput of the Lustre filesystem over high speed interconnects like InfiniBand. I''ve also had some experience measuring IO performance on linux. Tools like bonnie++ (and especially iometer for Linux) can report misleading IO performance numbers due to their inability to bypass the local disk cache (if there is an easy way to configure Lustre to always bypass the local disk cache on the client machine I''d love to use it). Right now, I''m running my Lustre Lite client on a machine restricted to 256MB of memory (using mem=3D256M in boot) in order to reduce the likelihood of pages being cached in the local disk cache (I''m also transferring 500MB files to reduce likelihood of cache hits). Duane. -----Original Message----- From: Phil Schwan [mailto:phil@clusterfs.com] Sent: Friday, June 11, 2004 7:14 PM To: sky Cc: lustre-discuss@lists.clusterfs.com Subject: Re: [Lustre-discuss] O_DIRECT On Fri, 2004-05-28 at 00:06, sky wrote:> I had performance trouble agaist IPoIB o IA64.It''s very interesting that bonnie with=20 > O_DIRECT option achive write rate at about 90~100MB/s, but 20~30MB/s without o_direct. > We have 2 OST with SCSI 320 disk on IA64. > It seems that data flowing into page cache have been pushed out of cache slowly. > I have not read lustre source code seriously.I suspet that IPoIB or infiniband stack=20 > affect lustre cache operation. It''s just my opnion. > I don''t konw if IB nal under developing have expericed same problem.Is there any way to > ajust lustre cache ?What I find most surprising is that you get such good performance with O_DIRECT at all. bonnie++, at least the way we run it, is a very metadata-intensive test, and it''s these metadata portions which are usually the bottleneck. The difference between O_DIRECT and the page cache is the size and number of RPCs. With O_DIRECT, the API requires that we send the contents of your write() to the server immediately, and not return until it''s there. We don''t batch it together with other writes, and we can only have 1 RPC in flight at a time per thread. When writing through the page cache, we take efforts to keep many RPCs in flight at all times (4 or 8 by default, depending on your Lustre version), and also to make those RPCs 512k in size. Both of these variables are tunable, for each OSC, in /proc/fs/lustre/osc/ You could try tuning max_pages_per_rpc down to something smaller, or max_rpcs_in_flight. Given what was written to the list earlier today about the window sizes in the IPoIB stack, this might just work. Please let us know what you learn! Thanks-- -Phil _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.clusterfs.com https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
On Fri, 2004-05-28 at 00:06, sky wrote:> I had performance trouble agaist IPoIB o IA64.It''s very interesting that bonnie with > O_DIRECT option achive write rate at about 90~100MB/s, but 20~30MB/s without o_direct. > We have 2 OST with SCSI 320 disk on IA64. > It seems that data flowing into page cache have been pushed out of cache slowly. > I have not read lustre source code seriously.I suspet that IPoIB or infiniband stack > affect lustre cache operation. It''s just my opnion. > I don''t konw if IB nal under developing have expericed same problem.Is there any way to > ajust lustre cache ?What I find most surprising is that you get such good performance with O_DIRECT at all. bonnie++, at least the way we run it, is a very metadata-intensive test, and it''s these metadata portions which are usually the bottleneck. The difference between O_DIRECT and the page cache is the size and number of RPCs. With O_DIRECT, the API requires that we send the contents of your write() to the server immediately, and not return until it''s there. We don''t batch it together with other writes, and we can only have 1 RPC in flight at a time per thread. When writing through the page cache, we take efforts to keep many RPCs in flight at all times (4 or 8 by default, depending on your Lustre version), and also to make those RPCs 512k in size. Both of these variables are tunable, for each OSC, in /proc/fs/lustre/osc/ You could try tuning max_pages_per_rpc down to something smaller, or max_rpcs_in_flight. Given what was written to the list earlier today about the window sizes in the IPoIB stack, this might just work. Please let us know what you learn! Thanks-- -Phil
Duane,> -----Original Message----- > From: lustre-discuss-admin@lists.clusterfs.com > [mailto:lustre-discuss-admin@lists.clusterfs.com]On Behalf Of McCrory, > Duane > Sent: Saturday, June 12, 2004 4:32 AM > To: Phil Schwan; sky > Cc: lustre-discuss@lists.clusterfs.com > Subject: RE: [Lustre-discuss] O_DIRECT > > > Sky, Phil, > > I''ve started observing the behavior of Lustre 1.0.4. These > observations may be relevant to this discussion. > > I have a single OST and a single client. > > I observe that the block transfer requests coming from the > sockNAL are always 4K bytes in size even though the > application is doing 1MB writes "dd if=/dev/zero > of=/mnt/lustre/file500M bs=1M count=512". Similar behavior is > observed with reads. My guess is that if O_DIRECT is set that > the kernel/Lustre doesn''t break the transfer into 4KB chunks, > rather it does the chunk size requested by the application > (this improves efficiency significantly).This is a "feature" of the socknal, that it only gives 1 fragment at a time to the network. However it _does_ set the ''more'' flag on all sends apart from the very last fragment, which allows the linux TCP/IP stack to aggregate these fragments into full MTU packets. I noticed that this was also a problem for the Voltaire SDP implementation, and so I produced a patch that passes a full-sized iovec on writes at the expense of always saving and reconstructing the iovec (partial writes leave the iovec in an unknown state since there is no guarantee that the caller''s iovec will or won''t be ''consumed''; we''ve seen both behaviours!). Cheers, Eric --------------------------------------------------- |Eric Barton Barton Software | |9 York Gardens Tel: +44 (117) 330 1575 | |Clifton Mobile: +44 (7909) 680 356 | |Bristol BS8 4LL Fax: call first | |United Kingdom E-Mail: eeb@bartonsoftware.com| ---------------------------------------------------
<html><pre><PRE>hi, everyone<BR><!-- -->I had performance trouble agaist IPoIB o IA64.It's very interesting that bonnie with <BR><!-- -->O_DIRECT option achive write rate at about 90~100MB/s, but 20~30MB/s without o_direct.<BR><!-- -->We have 2 OST with SCSI 320 disk on IA64.<BR><!-- -->It seems that data flowing into page cache have been pushed out of cache slowly.<BR><!-- -->I have not read lustre source code seriously.I suspet that IPoIB or infiniband stack <BR><!-- -->affect lustre cache operation. It's just my opnion.<BR><!-- -->I don't konw if IB nal under developing have expericed same problem.Is there any way to<BR><!-- -->ajust lustre cache ?<BR><!-- --><BR><!-- --><BR><!-- --><BR><!-- --><BR><!-- -->kane.</PRE></pre></html> =========================263µç×ÓÓʼþ£ÐÅÀµÓÊ×Ôרҵ