Hello We''ve been facing strange performance behaviors regarding to random I/O on large files on a Lustre FS. Trying to decipher it, I take a deeper look into the ''i_blksize'' value used by Lustre, particularly in ll_getattr_it() call. The behavior is: some of our tools do many fseek-fread(few bytes) on large files in huge quantity. Those tools face very poor performances regarding to what could have been expected with Lustre traditional performance on the same configuration. f = fopen() pos = somewhere do fseek(f, pos) fread(10 bytes, f) pos = somewhere_else loop Analyzing the behavior, it appears that, for each small fread(), a read of 2MB was done instead, by the glibc, this leading to poor performance, seen from the program (even if the global Lustre throughput was good, for each 2MB read, only few bytes were useful). So, why the glibc read so much data while only few were requested by the binary? It seems that this library uses the value ''f_bsize'' returned by statfs() as optimal readahead blocksize and use it as a buffer. Yet, this value is computed by Lustre using: [lustre/llite/llite_lib.c:1567] inode->i_blkbits = min(PTLRPC_MAX_BRW_BITS+1, LL_MAX_BLKSIZE_BITS); which is quite a huge value (2MB) and, as our tool does many fseek(), this buffer is useless and is discard between each read. We made a small patch thanks to we could change the value returned by ll_getattr_it() (which filled the statfs struct) as a module parameter. The performances are really good using it when reducing the value to 4KB. Indeed, Lustre is really better managing and buffering itself the I/O than the glibc in those cases. So, we are wondering whether it could not be a good idea to reduce the default value, or using something like a module param to tune it depending on the kind of I/O on the client. What''s your idea about this ? -- Aurelien Degremont
Solofo.Ramangalahy@bull.net
2007-Feb-13 05:46 UTC
[Lustre-devel] i_blksize value in lustre 1.5.97
Aurelien Degremont writes: > The behavior is: some of our tools do many fseek-fread(few bytes) on > large files in huge quantity. Those tools face very poor performances > regarding to what could have been expected with Lustre traditional > performance on the same configuration. > > > f = fopen() > pos = somewhere > do > fseek(f, pos) > fread(10 bytes, f) > pos = somewhere_else > loop > > > Analyzing the behavior, it appears that, for each small fread(), a > read of 2MB was done instead, by the glibc, this leading to poor > performance, seen from the program (even if the global Lustre throughput > was good, for each 2MB read, only few bytes were useful). > So, why the glibc read so much data while only few were requested by the > binary? Not sure it is relevant, but there was also some discussion about an fseek glibc optimisation last year : http://lkml.org/lkml/2006/2/28/187 "Drastic Slowdown of ''fseek()'' Calls From 2.4 to 2.6 -- VMM Change?" -- solofo
Solofo.Ramangalahy@bull.net wrote:> Not sure it is relevant, but there was also some discussion about an > fseek glibc optimisation last year : > http://lkml.org/lkml/2006/2/28/187 > "Drastic Slowdown of ''fseek()'' Calls From 2.4 to 2.6 -- VMM Change?" >This is perfectly relevant because it seems Reiser hit exactly the same issue with ReiserFS (the value jumped from 4K in 2.4 to 128K in 2.6). The solution I propose for Lustre is : As Lustre could buffer the I/O greatly better than the glibc, why not return a small value for stat.blksize and let Lustre manage the different I/O pattern. Does another code uses the value stat.blksize return by statfs() except GlibC ? -- Aurelien Degremont
On Feb 13, 2007 13:22 +0100, Aurelien Degremont wrote:> The behavior is: some of our tools do many fseek-fread(few bytes) on > large files in huge quantity. Those tools face very poor performances > regarding to what could have been expected with Lustre traditional > performance on the same configuration. > > > f = fopen() > pos = somewhere > do > fseek(f, pos) > fread(10 bytes, f) > pos = somewhere_else > loop > > > Analyzing the behavior, it appears that, for each small fread(), a > read of 2MB was done instead, by the glibc, this leading to poor > performanceAh, interesting. I''ve seen similar reports, but didn''t realize the problem was caused by glibc instead of the application itself using f_bsize.> [lustre/llite/llite_lib.c:1567] > inode->i_blkbits = min(PTLRPC_MAX_BRW_BITS+1, LL_MAX_BLKSIZE_BITS); > > The performances are really good using it when reducing the > value to 4KB. > Indeed, Lustre is really better managing and buffering itself the I/O > than the glibc in those cases. So, we are wondering whether it could not > be a good idea to reduce the default value, or using something like a > module param to tune it depending on the kind of I/O on the client.We haven''t tested the performance of such a change under different loads, so I can''t really comment. The original reason that i_blkbits was increased was because programs like "cp" benefit a great deal with a larger blocksize. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Hello! On Tue, Feb 13, 2007 at 04:35:26PM +0100, Aurelien Degremont wrote:> Solofo.Ramangalahy@bull.net wrote: > >Not sure it is relevant, but there was also some discussion about an > >fseek glibc optimisation last year : > >http://lkml.org/lkml/2006/2/28/187 > >"Drastic Slowdown of ''fseek()'' Calls From 2.4 to 2.6 -- VMM Change?" > This is perfectly relevant because it seems Reiser hit exactly the same > issue with ReiserFS (the value jumped from 4K in 2.4 to 128K in 2.6).In reiserfs the problem was only reported about slow kmail speed, I think. Also as I remember, at the time decision was that i_blksiz has nothing to do about reads, actually. It is just a hint for WRITEs to show optimal write size to avoid partial block writes (and associated speed penalties). Too bad glibc now adopted this bad practice. Perhaps somebody can file a bug with glibc people? Bye, Oleg
Oleg Drokin wrote:> Too bad glibc now adopted this bad practice. > Perhaps somebody can file a bug with glibc people?Sure it will be a good idea. Can somebody at CFS do this ? As far as I know, Reiser try dealing this with the glibc people and they refused. But, Hans Reiser is not known to be temperate... And what about workaround for the moment ? What is the solution at short term ? -- Aurelien Degremont
Hello! On Wed, Feb 21, 2007 at 11:00:26AM +0100, Aurelien Degremont wrote:> Oleg Drokin wrote: > >Too bad glibc now adopted this bad practice. > >Perhaps somebody can file a bug with glibc people? > Sure it will be a good idea. Can somebody at CFS do this ?http://sources.redhat.com/bugzilla/show_bug.cgi?id=4099> As far as I know, Reiser try dealing this with the glibc people and they > refused. But, Hans Reiser is not known to be temperate...I do not remember there actually was any reply.> And what about workaround for the moment ? What is the solution at short > term ?Do not use streaming i/o functions for now? It''s also possible to decrease st_blksize reported by lustre by patching lustre. Bye, Oleg
On Feb 25, 2007 21:11 -0500, Oleg Drokin wrote:> > And what about workaround for the moment ? What is the solution at short > > term ? > > Do not use streaming i/o functions for now? > It''s also possible to decrease st_blksize reported by lustre by patching > lustre.With older versions of Lustre if you specified a striping like 65536 bytes then st_blksize would be returned with stripes * stripe_size. This caused problems with NFS export and connectathon because NFS creates files via mknod() and they don''t have file striping associated until the file is opened, so the st_blksize could change on a file. We might want to consider revisiting this change. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Oleg Drokin wrote:>>And what about workaround for the moment ? What is the solution at short >>term ? > > It''s also possible to decrease st_blksize reported by lustre by patching > lustre.Here is a small patch that add a module param that tunes the value returned by stat/fstat/lstat syscalls (all calling ll_getattr_it()) By default, Lustre behaves like it actually does, if you change the module param ''stat_blksize'', Lustre will return this value as the preferred I/O size. The value could be dynamically change using /proc. We are using it for days with success. Easy tuning could be done, depending on your prefered behaviour. I could add it to BugZilla if needed. The patch is against Lustre 1.5.97. -- Aurelien Degremont -------------- next part -------------- A non-text attachment was scrubbed... Name: stat_blksize.patch Type: text/x-patch Size: 2108 bytes Desc: not available Url : http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20070302/2a41d071/stat_blksize.bin