I am having an issue with our lustre file system. In our current environment on a san file system opening a large file and doing fseeks completes in under 2 seconds. Running that same routine on our lustre file system the routine actually never finishes. Are there any tunable parameter in lustre that can alleviate this problem? Thanks in advance. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100407/a7bdbbc6/attachment.html
On 2010-04-07, at 14:09, Ronald K Long wrote:> I am having an issue with our lustre file system. In our current > environment on a san file system opening a large file and doing > fseeks completes in under 2 seconds. Running that same routine on > our lustre file system the routine actually never finishes.Doing fseek() itself is only a client-side operation, so it should have no performance impact, UNLESS you are doing SEEK_END, which requires that the actual file size be computed on the client. That causes lock revocation from all of the clients and is an expensive operation. Using SEEK_CUR or SEEK_SET has no cost at all.> Are there any tunable parameter in lustre that can alleviate this > problem?It depends on what the problem really is. Cheers, Andreas -- Andreas Dilger Principal Engineer, Lustre Group Oracle Corporation Canada Inc.
We are doing SEEK_SET fseek(fp,offset[i],SEEK_SET We were running into this same issue on our san file system until we set the dma_cache_read_ahead to match our buffer size of 256k. Just wondering if there is away to set that within lustre. We are running 1.8 on the MDS and OSS and the clients running the fseek are are running 1.6 Thanks again. Rocky From: Andreas Dilger <andreas.dilger at oracle.com> To: Ronald K Long <rklong at usgs.gov> Cc: lustre-discuss at lists.lustre.org Date: 04/09/2010 10:52 AM Subject: Re: [Lustre-discuss] fseeks on lustre On 2010-04-07, at 14:09, Ronald K Long wrote:> I am having an issue with our lustre file system. In our current > environment on a san file system opening a large file and doing > fseeks completes in under 2 seconds. Running that same routine on > our lustre file system the routine actually never finishes.Doing fseek() itself is only a client-side operation, so it should have no performance impact, UNLESS you are doing SEEK_END, which requires that the actual file size be computed on the client. That causes lock revocation from all of the clients and is an expensive operation. Using SEEK_CUR or SEEK_SET has no cost at all.> Are there any tunable parameter in lustre that can alleviate this > problem?It depends on what the problem really is. Cheers, Andreas -- Andreas Dilger Principal Engineer, Lustre Group Oracle Corporation Canada Inc. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100413/a674cb73/attachment.html
On 2010-04-13, at 07:59, Ronald K Long wrote:> We are doing SEEK_SET > > fseek(fp,offset[i],SEEK_SET > > We were running into this same issue on our san file system until we > set the dma_cache_read_ahead to match our buffer size of 256k. Just > wondering if there is away to set that within lustre. We are > running 1.8 on the MDS and OSS and the clients running the fseek are > are running 1.6.Sorry, I didn''t read enough into your question. When you said "opening a large file and doing fseek()" I thought that was the only thing you are doing, but really you are doing IOs after the fseek() that is presumably what is taking a long time. It''s true that if you are doing random reads it may cause sub-optimal performance because the client cannot do readahead to mitigate the network/disk latency. One problem that we saw some time ago at another customer was that their application doing random IO was getting the "IO blocksize" from the file via {f,}stat() and reading from the file in chunks of st_blksize. Since Lustre returns st_blksize = 2MB, then the application wanting 4kB chunks of random data from the file was seeking and reading 2MB of extra data for each seek. It would be worthwhile to strace your application to see if it is doing the same thing.> Andreas Dilger <andreas.dilger at oracle.com> wrote: >> On 2010-04-07, at 14:09, Ronald K Long wrote: >> > I am having an issue with our lustre file system. In our current >> > environment on a san file system opening a large file and doing >> > fseeks completes in under 2 seconds. Running that same routine on >> > our lustre file system the routine actually never finishes. >> >> Doing fseek() itself is only a client-side operation, so it should >> have no performance impact, UNLESS you are doing SEEK_END, which >> requires that the actual file size be computed on the client. That >> causes lock revocation from all of the clients and is an expensive >> operation. Using SEEK_CUR or SEEK_SET has no cost at all. >> >> > Are there any tunable parameter in lustre that can alleviate this >> > problem? >> >> It depends on what the problem really is.Cheers, Andreas -- Andreas Dilger Principal Engineer, Lustre Group Oracle Corporation Canada Inc.
Andreas - Here is a snipet of the strace output. _llseek(3, 2097152, [2097152], SEEK_SET) = 0 _llseek(3, 2097152, [2097152], SEEK_SET) = 0 _llseek(3, 2097152, [2097152], SEEK_SET) = 0 _llseek(3, 2097152, [2097152], SEEK_SET) = 0 _llseek(3, 2097152, [2097152], SEEK_SET) = 0 _llseek(3, 2097152, [2097152], SEEK_SET) = 0 _llseek(3, 2097152, [2097152], SEEK_SET) = 0 _llseek(3, 2097152, [2097152], SEEK_SET) = 0 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2097152) = 2097152 _llseek(3, 0, [0], SEEK_SET) = 0 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2097152) = 2097152 _llseek(3, 2097152, [2097152], SEEK_SET) = 0 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2097152) = 2097152 _llseek(3, 0, [0], SEEK_SET) = 0 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2097152) = 2097152 _llseek(3, 2097152, [2097152], SEEK_SET) = 0 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2097152) = 2097152 _llseek(3, 0, [0], SEEK_SET) = 0 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2097152) = 2097152 _llseek(3, 2097152, [2097152], SEEK_SET) = 0 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2097152) = 2097152 _llseek(3, 0, [0], SEEK_SET) = 0 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2097152) = 2097152 _llseek(3, 2097152, [2097152], SEEK_SET) = 0 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2097152) = 2097152 _llseek(3, 0, [0], SEEK_SET) = 0 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2097152) = 2097152 _llseek(3, 2097152, [2097152], SEEK_SET) = 0 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2097152) = 2097152 _llseek(3, 0, [0], SEEK_SET) = 0 Thanks again Rocky From: Andreas Dilger <andreas.dilger at oracle.com> To: Ronald K Long <rklong at usgs.gov> Cc: lustre-discuss at lists.lustre.org Date: 04/13/2010 11:50 PM Subject: Re: [Lustre-discuss] fseeks on lustre On 2010-04-13, at 07:59, Ronald K Long wrote:> We are doing SEEK_SET > > fseek(fp,offset[i],SEEK_SET > > We were running into this same issue on our san file system until we > set the dma_cache_read_ahead to match our buffer size of 256k. Just > wondering if there is away to set that within lustre. We are > running 1.8 on the MDS and OSS and the clients running the fseek are > are running 1.6.Sorry, I didn''t read enough into your question. When you said "opening a large file and doing fseek()" I thought that was the only thing you are doing, but really you are doing IOs after the fseek() that is presumably what is taking a long time. It''s true that if you are doing random reads it may cause sub-optimal performance because the client cannot do readahead to mitigate the network/disk latency. One problem that we saw some time ago at another customer was that their application doing random IO was getting the "IO blocksize" from the file via {f,}stat() and reading from the file in chunks of st_blksize. Since Lustre returns st_blksize = 2MB, then the application wanting 4kB chunks of random data from the file was seeking and reading 2MB of extra data for each seek. It would be worthwhile to strace your application to see if it is doing the same thing.> Andreas Dilger <andreas.dilger at oracle.com> wrote: >> On 2010-04-07, at 14:09, Ronald K Long wrote: >> > I am having an issue with our lustre file system. In our current >> > environment on a san file system opening a large file and doing >> > fseeks completes in under 2 seconds. Running that same routine on >> > our lustre file system the routine actually never finishes. >> >> Doing fseek() itself is only a client-side operation, so it should >> have no performance impact, UNLESS you are doing SEEK_END, which >> requires that the actual file size be computed on the client. That >> causes lock revocation from all of the clients and is an expensive >> operation. Using SEEK_CUR or SEEK_SET has no cost at all. >> >> > Are there any tunable parameter in lustre that can alleviate this >> > problem? >> >> It depends on what the problem really is.Cheers, Andreas -- Andreas Dilger Principal Engineer, Lustre Group Oracle Corporation Canada Inc. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100414/a47a07d1/attachment-0001.html
On Wed, 2010-04-14 at 07:08 -0500, Ronald K Long wrote:> > Andreas - Here is a snipet of the strace output. > > read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > \0\0"..., 2097152) = 2097152As Andreas suspected, your application is doing 2MB reads every time. Does it really need 2MB of data on each read? If not, can you fix your application to only read as much data as it actually wants? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100414/b3b1a4be/attachment.bin
We''ve narrowed down the problem quite a bit. The problematic code snippet is not actually doing any reads or writes; it''s just doing a massive number of fseek() operations within a couple of nested loops. (Note: The production code is doing some I/O, but this snippet was narrowed down to the bare minimum example that exhibited the problem, which was how we discovered that fseek was the culprit.) The issue appears to be the behavior of the glibc implementation of fseek(). Apparently, a call to fseek() on a buffered file stream causes glibc to flush the stream (regardless of whether a flush is actually needed). If we modify the snippet to call setvbuf() and disable buffering on the file stream before any of the fseek() calls, then it finishes more or less instantly, as you would expect. The problem is that this offending code is actually buried deep within a COTS library that we''re using to do image processing (the Hierarchical Data Format (HDF) library). While we do have access to the source code for this library and could conceivably modify it, this is a large and complex library, and a change of this nature would require us to do a large amount of regression testing to ensure that nothing was broken. So at the end of the day this is really not a "Lustre problem" per se, though we would still be interested in any suggestions as to how we can minimize the effects of this glibc "flush penalty". This penalty is not particularly onerous when reading and writing to local disk, but is obviously more of an issue with a distributed filesystem. Thank you again for the support. Rocky On Wed, 2010-04-14 at 07:08 -0500, Ronald K Long wrote:> > Andreas - Here is a snipet of the strace output. > > read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > \0\0"..., 2097152) = 2097152As Andreas suspected, your application is doing 2MB reads every time. Does it really need 2MB of data on each read? If not, can you fix your application to only read as much data as it actually wants? b. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100414/3a773e8a/attachment.html
On 2010-04-14, at 11:08, Ronald K Long wrote:> We''ve narrowed down the problem quite a bit. > > The problematic code snippet is not actually doing any reads or > writes; > it''s just doing a massive number of fseek() operations within a couple > of nested loops. (Note: The production code is doing some I/O, but > this > snippet was narrowed down to the bare minimum example that exhibited > the > problem, which was how we discovered that fseek was the culprit.) > > The issue appears to be the behavior of the glibc implementation of > fseek(). Apparently, a call to fseek() on a buffered file stream > causes > glibc to flush the stream (regardless of whether a flush is actually > needed). If we modify the snippet to call setvbuf() and disable > buffering on the file stream before any of the fseek() calls, then it > finishes more or less instantly, as you would expect.I''d encourage you to file a bug (preferably with a patch) against glibc to fix this. I''ve had reasonable success in getting problems like this fixed upstream.> The problem is that this offending code is actually buried deep > within a > COTS library that we''re using to do image processing (the Hierarchical > Data Format (HDF) library). While we do have access to the source > code > for this library and could conceivably modify it, this is a large and > complex library, and a change of this nature would require us to do a > large amount of regression testing to ensure that nothing was broken. > > So at the end of the day this is really not a "Lustre problem" per se, > though we would still be interested in any suggestions as to how we > can > minimize the effects of this glibc "flush penalty". This penalty is > not > particularly onerous when reading and writing to local disk, but is > obviously more of an issue with a distributed filesystem.Similarly, HDF + Lustre usage is very common, and I expect that the HDF developers would be interested to fix this if possible.> On Wed, 2010-04-14 at 07:08 -0500, Ronald K Long wrote: > > > > Andreas - Here is a snipet of the strace output. > > > > read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > > \0\0"..., 2097152) = 2097152 > > As Andreas suspected, your application is doing 2MB reads every time. > Does it really need 2MB of data on each read? If not, can you fix > your > application to only read as much data as it actually wants?Cheers, Andreas -- Andreas Dilger Principal Engineer, Lustre Group Oracle Corporation Canada Inc.
After doing some more digging it looks as though a bug was reported on this in 2007. https://bugzilla.lustre.org/show_bug.cgi?id=12739 We have loaded the patch for lustre attached to this bug, however when running the set_param command I am getting the following error. lctl set_param llite*.*.stat_blksize=4096 error: set_param: /proc/{fs,sys}/{lnet,lustre}/llite/lustre*/stat_blksize: No such process Is this patch still valid for 2.6.9-78.0.22.EL_lustre.1.6.7.2smp Thanks again Rocky From: Andreas Dilger <andreas.dilger at oracle.com> To: Ronald K Long <rklong at usgs.gov> Cc: "Brian J. Murrell" <Brian.Murrell at Sun.COM>, lustre-discuss at lists.lustre.org, lustre-discuss-bounces at lists.lustre.org Date: 04/14/2010 02:13 PM Subject: Re: [Lustre-discuss] fseeks on lustre On 2010-04-14, at 11:08, Ronald K Long wrote:> We''ve narrowed down the problem quite a bit. > > The problematic code snippet is not actually doing any reads or > writes; > it''s just doing a massive number of fseek() operations within a couple > of nested loops. (Note: The production code is doing some I/O, but > this > snippet was narrowed down to the bare minimum example that exhibited > the > problem, which was how we discovered that fseek was the culprit.) > > The issue appears to be the behavior of the glibc implementation of > fseek(). Apparently, a call to fseek() on a buffered file stream > causes > glibc to flush the stream (regardless of whether a flush is actually > needed). If we modify the snippet to call setvbuf() and disable > buffering on the file stream before any of the fseek() calls, then it > finishes more or less instantly, as you would expect.I''d encourage you to file a bug (preferably with a patch) against glibc to fix this. I''ve had reasonable success in getting problems like this fixed upstream.> The problem is that this offending code is actually buried deep > within a > COTS library that we''re using to do image processing (the Hierarchical > Data Format (HDF) library). While we do have access to the source > code > for this library and could conceivably modify it, this is a large and > complex library, and a change of this nature would require us to do a > large amount of regression testing to ensure that nothing was broken. > > So at the end of the day this is really not a "Lustre problem" per se, > though we would still be interested in any suggestions as to how we > can > minimize the effects of this glibc "flush penalty". This penalty is > not > particularly onerous when reading and writing to local disk, but is > obviously more of an issue with a distributed filesystem.Similarly, HDF + Lustre usage is very common, and I expect that the HDF developers would be interested to fix this if possible.> On Wed, 2010-04-14 at 07:08 -0500, Ronald K Long wrote: > > > > Andreas - Here is a snipet of the strace output. > > > > read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > > \0\0"..., 2097152) = 2097152 > > As Andreas suspected, your application is doing 2MB reads every time. > Does it really need 2MB of data on each read? If not, can you fix > your > application to only read as much data as it actually wants?Cheers, Andreas -- Andreas Dilger Principal Engineer, Lustre Group Oracle Corporation Canada Inc. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100416/ed1225a6/attachment-0001.html
We were able to find where to tune the stat_blksize after loading the patch mentioned below and the fseek function is working correctly with this patch installed. Thanks Rocky After doing some more digging it looks as though a bug was reported on this in 2007. https://bugzilla.lustre.org/show_bug.cgi?id=12739 We have loaded the patch for lustre attached to this bug, however when running the set_param command I am getting the following error. lctl set_param llite*.*.stat_blksize=4096 error: set_param: /proc/{fs,sys}/{lnet,lustre}/llite/lustre*/stat_blksize: No such process Is this patch still valid for 2.6.9-78.0.22.EL_lustre.1.6.7.2smp Thanks again Rocky -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100422/e2c95b36/attachment.html