Goswin von Brederlow
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
Wei-keng Liao <wkliao@ece.northwestern.edu> writes:> How do I disable client-side caching on Lustre when accessing a > file in a C program? > > Wei-keng LiaoDoes the O_DIRECT falg for open have the desired effect? MfG Goswin
Andreas Dilger
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
On Mar 03, 2006 23:26 -0600, Wei-keng Liao wrote:> How do I disable client-side caching on Lustre when accessing a > file in a C program?When you say "disable client-side cache" do you mean read cache or write cache? The standard Linux mechanism for bypassing the page cache entirely (read and write) is by using O_DIRECT as was previously mentioned. If you want to flush written data from cache, then you should call fsync() on the file after writing. It is possible to limit the amount of cached write data on the client by setting each of the /proc/fs/lustre/osc/*/max_dirty_mb tunables. They are set to 32MB of dirty data by default, per OSC. The client will accumulate dirty pages (for non-O_DIRECT writes) until they make up at least a full RPC (1MB), but will not throttle writes until they hit the max_dirty_mb limit. It is also possible to limit the amount of cached read data on the client by the /proc/fs/lustre/llite/fs*/max_cached_mb tunable. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Andreas Dilger
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
On Mar 06, 2006 01:14 -0600, Wei-keng Liao wrote:> I meant both read and write caching. I am running two I/O benchmarks: > FLASH I/O and BTIO, where FLASH I/O has only write operations and BTIO has > reads after writes (file closed in between). I got very bad performance > results for both on Lustre and are trying to see if it is due to caching. > O_DIRECT is not applicable here because of the non-aligned read-write > amount.I was able to find some information about the BTIO and FLASH benchmarks, but these are not tests that we have run ourselves. My brief reading on these benchmarks indicates that they are doing a lot of small-size IO requests striped throughout the file, instead of large contiguous IO requests. Is that correct? Do you have any information on e.g. the size of the IO requests? You already mentioned that the IO is not aligned, which also hurts performance. Do these tests also do their IO using MPIIO? We have had problems in the past when the ADIO layer was making bad assumptions for Lustre IO. We have not yet worked on any Lustre-specific ADIO layer, but this may allow a large performance improvement for some kinds of applications.> For applications performs only write operations (and no repeated or > overlapped writes), caching causes unnecessary memory copies and thus the > bad performance. Similar for non-repeated reads only (although prefetch > can also help).One issue is that even if the write segments are not overlapping, if they are not aligned on PAGE_SIZE (4k normally) boundaries, then they can not be handled by the linux VM efficiently, and result in a read-modify-write cycle for each of the "boundary" pages. Lustre also aligns the DLM locks on client PAGE_SIZE boundaries, so this can cause artificial contention among the writes. In any case, I suspect that the root of the problem is the small, interleaved, and unaligned IO pattern, and not the cache itself.> As a Lustre user, I would like to know if I can disable read and write > caching for the benchmarks I tested, without root permission to change > file system configuration.No, this isn''t possible (excluding O_DIRECT). The reason O_DIRECT allows uncached IO is because the kernel is using the user-supplied data buffers to do the IO directly to the filesystem. Since the filesystem layer has to work with aligned buffers, this imposes the alignmen restrictions. Working with unaligned data buffers imposes the data copying overhead. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Wei-keng Liao
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
I meant both read and write caching. I am running two I/O benchmarks: FLASH I/O and BTIO, where FLASH I/O has only write operations and BTIO has reads after writes (file closed in between). I got very bad performance results for both on Lustre and are trying to see if it is due to caching. O_DIRECT is not applicable here because of the non-aligned read-write amount. For applications performs only write operations (and no repeated or overlapped writes), caching causes unnecessary memory copies and thus the bad performance. Similar for non-repeated reads only (although prefetch can also help). As a Lustre user, I would like to know if I can disable read and write caching for the benchmarks I tested, without root permission to change file system configuration. I was told ioctl() can do that, but could not find any info from Lustre''s manual. Wei-keng On Sun, 5 Mar 2006, Andreas Dilger wrote:> On Mar 03, 2006 23:26 -0600, Wei-keng Liao wrote: >> How do I disable client-side caching on Lustre when accessing a >> file in a C program? > > When you say "disable client-side cache" do you mean read cache or > write cache? The standard Linux mechanism for bypassing the page > cache entirely (read and write) is by using O_DIRECT as was previously > mentioned. If you want to flush written data from cache, then you > should call fsync() on the file after writing. > > It is possible to limit the amount of cached write data on the client > by setting each of the /proc/fs/lustre/osc/*/max_dirty_mb tunables. > They are set to 32MB of dirty data by default, per OSC. The client will > accumulate dirty pages (for non-O_DIRECT writes) until they make up at > least a full RPC (1MB), but will not throttle writes until they hit the > max_dirty_mb limit. > > It is also possible to limit the amount of cached read data on the client > by the /proc/fs/lustre/llite/fs*/max_cached_mb tunable. > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. >
Wei-keng Liao
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
According to the Lustre manual, part IV, section 1.4 says O_DIRECT requires the buffer size for read() and write() to be aligned on a 512-byte boundary. If not, error EINVAL will occur. I did get this error when I used different sized buffers. My question is actually that is there a way to simply disable caching (using a system call like ioctl or so) without the 512-byte limitation? Many existing applications do their IO without 512-byte alignment. It can be hard to change all I/O calls to see the effect of O_DIRECT. Wei-keng On Sat, 4 Mar 2006, Goswin von Brederlow wrote:> Wei-keng Liao <wkliao@ece.northwestern.edu> writes: > >> How do I disable client-side caching on Lustre when accessing a >> file in a C program? >> >> Wei-keng Liao > > Does the O_DIRECT falg for open have the desired effect? > > MfG > Goswin >
Wei-keng Liao
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
How do I disable client-side caching on Lustre when accessing a file in a C program? Wei-keng Liao
Wei-keng Liao
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
Jay What benchmarks were you using to get the low I/O performance results on Lustre? Does it do parallel I/O through MPI-IO? On shared files or separate files? On a file system that does no caching at all, I think maybe it is due to the conflict file locks that causes I/O serialization and hence the low I/O performance results. Wei-keng On Mon, 6 Mar 2006, Andreas Dilger wrote:> On Mar 06, 2006 20:03 -0600, J. K. Cliburn wrote: >> On 3/6/06, J. K. Cliburn <jcliburn@gmail.com> wrote: >> I meant to say a 4,176 *processor* Cray XT3. 4,128 processors are >> embodied in dual Opteron compute nodes running the Catamount >> lightweight kernel. The remainder are 4-way service nodes running a >> variant of SLES9. > > Note that the Catamount Lustre clients are _completely_ different than > the Linux VFS Lustre clients. The Catamount clients cannot do ANY write > caching or asynchronous writes because the operating environment does > not support interrupts for notification of IO completion. The writes > must be completed during each syscall, and are as a result completely > synchronous. > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >
Andreas Dilger
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
On Mar 14, 2006 15:46 -0600, Wei-keng Liao wrote:> On a file system that does no caching at all, I think maybe it is due to > the conflict file locks that causes I/O serialization and hence the low > I/O performance results.The clients do not get any locks either. On the other hand, we have recently identified one problem which would negatively impact performace in this environment, and Cray is working to make this available in their next release.> On Mon, 6 Mar 2006, Andreas Dilger wrote: > >On Mar 06, 2006 20:03 -0600, J. K. Cliburn wrote: > >>On 3/6/06, J. K. Cliburn <jcliburn@gmail.com> wrote: > >>I meant to say a 4,176 *processor* Cray XT3. 4,128 processors are > >>embodied in dual Opteron compute nodes running the Catamount > >>lightweight kernel. The remainder are 4-way service nodes running a > >>variant of SLES9. > > > >Note that the Catamount Lustre clients are _completely_ different than > >the Linux VFS Lustre clients. The Catamount clients cannot do ANY write > >caching or asynchronous writes because the operating environment does > >not support interrupts for notification of IO completion. The writes > >must be completed during each syscall, and are as a result completely > >synchronous.Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
J. K. Cliburn
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
On 3/14/06, Wei-keng Liao <wkliao@ece.northwestern.edu> wrote:> Jay > > What benchmarks were you using to get the low I/O performance results on > Lustre? Does it do parallel I/O through MPI-IO? On shared files or > separate files?Real codes, not benchmarks, were our first indication that we had work to do. We then wrote a small parallel writer/reader (no MPI_IO, IIRC) to further characterize what we''d seen. I''m most interested in many writers to a single file (many-to-one), since, in my humble opinion anyway, that''s what "parallel I/O" is, and that''s where I thought Lustre''s strength lay. We get much, much better performance in one-to-one I/O, although it seems (understandably) to be extraordinarily sensitive to stripe number and size. We host a wide variety of users who employ a corresponding wide variety of parallel codes applied to all sorts of scientific problems, most of whom will need to carefully tune their respective code''s I/O in order to see acceptable turnaround times on their batch jobs. I''m confident that Cray will help us get things sorted out, though. (And I''m sure the Cray employees who read this will agree, since, oddly, as I learned from my previous post, there appears to be a direct conduit from this list to Mendota Heights.) Jay
Andreas Dilger
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
On Mar 14, 2006 16:08 -0600, Wei-keng Liao wrote:> On Tue, 7 Mar 2006, Andreas Dilger wrote: > >As this is the first time that I have seen complaints about IO performance > >from this type of workload, there has not been any effort to tune lustre > >for this load yet. I see that the lock policy doesn''t consider a single > >file to be "contended" until there are at least 5 concurrent writers, and > >not "highly contended" until there are at least 33 concurrent writers. > >Most of the shared-file write tunings have been done with 400+ clients. > > I do not quite understand this lock policy. What does Lustre do when it > doesn''t consider a single file to be "contended"? What does it do in the > "highly contended" case? Can you point me any further reading about this?In the "uncontended" case (what is the common single-writer-per-file, or only-readers-per-file) the Lustre DLM will grow the requested lock to the maximum non-conflicting extent possible. So, for example, if the first client requests a DLM write extent lock on [131072, 199607] of an object it will actually be granted a lock on [0, EOF] (EOF=2^64-1 in this context). This means that it will not have to issue another DLM request for the entire file access, if there are no other conflicting users. If a second client then requests a lock on [65536, 131071] the first lock is cancelled and the client is granted a lock on [0, 131071] because the DLM still considers the to-be-cancelled first extent when growing the lock. A third client reqesting [0, 65535] would cancel the first lock and be granted exactly that extent. The DLM also takes requested-but blocked locks into account when growing the locks, so that it doesn''t grant conflicting locks for known users. If the clients are doing IO to only a single region of the file each (e.g. [chunksize * mpi_rank, +chunksize]) then the above algorithm will converge fairly quickly to give each client a non-overlapping write lock on its particular part of the file. If the chunksize is not a multiple of the client PAGE_SIZE then this will work poorly because it is not possible to grant those locks concurrently because of the overlap due to rounding, and no effort has yet been made to optimize it. At some point (if the chunk size is large enough to span multiple write() calls) the clients will be working on disjoint extents. Clearly, this is sub-optimal if there are multiple clients that are doing small strided writes to a single file, but at least in past experience that was a rare case. In the "contended" case, extent locks are not grown "downwards", avoiding potential unnecessary lock conflicts. In the "highly contended" case, instead of growing the locks to the maximum extent possible they are capped at the next 32MB boundary. This is in lustre/ldlm/ldlm_extent.c::ldlm_extent_internal_policy(). Ideally, there would be some way to know in advance what the "correct" lock size is (maybe tracking per object what the granted lock''s request sizes are), but for any such algorithm there is a pathalogical case. We would of course welcome any contributions to improve the algorithms.> Can you describe how to disable locking in conjunction with O_DIRECT, even > though it is not supported? Can it be done by application users without > root permission?Currently it is not accessible to non-root users, as it was only implemented for very early benchmarking and is not a supported feature. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Andreas Dilger
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
On Mar 06, 2006 14:51 -0600, Wei-keng Liao wrote:> Attached is a collection of offset-length pairs for FLASH-IO and BTIO. > > I have been playing different stripe size and no. OSTs on Tungsten@NCSA > which has default 4MB stripe size and stripe over 8 OSTs. Changing the > stripe size and no. OSTs does not help the performance a lot for FLASH and > BTIO. I tried 64KB, 512KB, 1MB stripe size and 8 and 16 OSTs. Numbers of > clients I ran range from 4 to 64. However, in both FLASH and BTIO, all > clients read/write to a shared file. Maybe this is what causes the bad > performance on Lustre. (The I/O bandwidth I got for FLASH-IO is less than > 10 MB/s and less than 80 MB/s for BTIO, on Tungsten which potentially can > provide 11 GB/s I/O bandwidth.) So, my guess is that Lustre has not been > optimized for concurrent or parallel I/O operations yet. As a heavy MPI-IO > user, I wish there will be more demands on this aspect in the future :(Lustre has in fact been tested with single-file parallel IO for a long time, and while performance is not as good as file-per-process, it has generally been acceptable if the clients are doing large non-overlapping concurrent writes. However, the IO pattern that single-file writes have been tuned to is different than is shown here, with the heuristics being tuned for tens-hundreds of clients doing aligned writes of 1MB or more concurrently. As this is the first time that I have seen complaints about IO performance from this type of workload, there has not been any effort to tune lustre for this load yet. I see that the lock policy doesn''t consider a single file to be "contended" until there are at least 5 concurrent writers, and not "highly contended" until there are at least 33 concurrent writers. Most of the shared-file write tunings have been done with 400+ clients. Looking at the start of one of the logs in question (OFFSET_LEN/BTIO.A4): 1 o btio.full.out 2 o btio.full.out 3 o btio.full.out 0 o btio.full.out 1 w 2621440 2621440 1 b 1 1 3 w 7864320 2621440 3 b 1 1 0 w 0 2621440 0 b 1 1 2 w 5242880 2621440 2 b 1 1 0 w 10485760 2621440 3 w 18350080 2621440 2 w 15728640 2621440 3 b 1 1 2 b 1 1 0 b 1 1 1 w 13107200 2621440 1 b 1 1 2 w 26214400 2621440 2 b 1 1 0 w 20971520 2621440 3 w 28835840 2621440 1 w 23592960 2621440 3 b 1 1 0 b 1 1 1 b 1 1 It is clear from looking at this IO log that this application would be optimal with a stripe size of 2621440 bytes and 4 stripes for the file, or a multiple thereof. This would allow each process to essentially write to its own file, and there would be no lock contention at all. lfs setstripe /path/to/output/btio.full.out 2621440 -1 4 or lfs setstripe /path/to/output/btio.full.out 1310720 -1 8 or lfs setstripe /path/to/output/btio.full.out 655360 -1 16 etc. It is the lock contention that is essentially serializing the writes from each client. It would be interesting to know if the ROMIO/MPIIO layer would send the proper hints to a lustre-ADIO to allow it to create the file in this manner? I admit I don''t know enough about the hints that an ADIO implemention is given. In your README you mention that the "b 1 1" operation is an MPI barrier. Is there also a "sync" operation associated with this barrier? Even without a sync, the fact that there is a barrier between each IO means that each IO is waiting for the slowest client to complete the writes. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Wei-keng Liao
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
Andreas, I wonder how Lustre implements the POSIX read/write atomicity without locking, when a file is stored across more than one OST. Wei-keng On Tue, 14 Mar 2006, Andreas Dilger wrote:> On Mar 14, 2006 15:46 -0600, Wei-keng Liao wrote: >> On a file system that does no caching at all, I think maybe it is due to >> the conflict file locks that causes I/O serialization and hence the low >> I/O performance results. > > The clients do not get any locks either. On the other hand, we have > recently identified one problem which would negatively impact performace > in this environment, and Cray is working to make this available in their > next release. > >> On Mon, 6 Mar 2006, Andreas Dilger wrote: >>> On Mar 06, 2006 20:03 -0600, J. K. Cliburn wrote: >>>> On 3/6/06, J. K. Cliburn <jcliburn@gmail.com> wrote: >>>> I meant to say a 4,176 *processor* Cray XT3. 4,128 processors are >>>> embodied in dual Opteron compute nodes running the Catamount >>>> lightweight kernel. The remainder are 4-way service nodes running a >>>> variant of SLES9. >>> >>> Note that the Catamount Lustre clients are _completely_ different than >>> the Linux VFS Lustre clients. The Catamount clients cannot do ANY write >>> caching or asynchronous writes because the operating environment does >>> not support interrupts for notification of IO completion. The writes >>> must be completed during each syscall, and are as a result completely >>> synchronous. > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. >
Wei-keng Liao
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
Andreas, The BTIO I am running does I/O through MPI-IO. The FLASH I/O calls HDF5 I/O APIs which internally is implemented on top of MPI-IO. FLASH I/O writes data into files in HDF5 format which contains both metadata and array data. For writing metadata, there are lots of small writes, but for array data, writes are in large amount. As for BTIO, it is user''s option to run through MPI collective or non-collective I/O. When collective I/O is used (which is I only use), the reads/writes are all in large amount. Both benchmarks read/write global arrays from/tom files. The actual read/write amount in each processes depends on the total number of processes running the job. Therefore, they are seldom aligned with 512-byte boundary. In addition, since MPI collective I/O is used, reads/writes are not interleaved. I have collected the I/O trace, the read/write offset-length pairs, for the two benchmarks. Let me know if you are interested. I personally believe the access patterns used in these two benchmarks appear frequently in many scientific applications. It is important to see Lustre provide a reasonable performance for them. I ran these benchmarks on PVFS which does no client-side caching and got a very good result (both Lustre and pvfs ran on the same machine, Tungsten @NCSA). This is why I am seeking to disable caching on Lustre. You mentioned that ADIO layer for Lustre has not been implemented yet. What is the plan for this in the near future? Can you also shed light on why ADIO makes bad assumption for Lustre I/O? All I can see is ADIO regards Lustre as a UFS and only POSIX I/O calls (open,read,write) are used. I also wonder if caching is disabled if a file is open in O_WRONLY, since it makes little sense to do caching in this case, unless for aggregating small writes into large ones. Does Lustre use any I/O benchmark that can demonstrate the benefit of caching? Please let me know. I am very interested in seeing the I/O patterns of it. Thanks. Wei-keng On Mon, 6 Mar 2006, Andreas Dilger wrote:> On Mar 06, 2006 01:14 -0600, Wei-keng Liao wrote: >> I meant both read and write caching. I am running two I/O benchmarks: >> FLASH I/O and BTIO, where FLASH I/O has only write operations and BTIO has >> reads after writes (file closed in between). I got very bad performance >> results for both on Lustre and are trying to see if it is due to caching. >> O_DIRECT is not applicable here because of the non-aligned read-write >> amount. > > I was able to find some information about the BTIO and FLASH benchmarks, > but these are not tests that we have run ourselves. My brief reading on > these benchmarks indicates that they are doing a lot of small-size IO > requests striped throughout the file, instead of large contiguous IO > requests. Is that correct? Do you have any information on e.g. the size > of the IO requests? You already mentioned that the IO is not aligned, > which also hurts performance. > > Do these tests also do their IO using MPIIO? We have had problems in the > past when the ADIO layer was making bad assumptions for Lustre IO. We > have not yet worked on any Lustre-specific ADIO layer, but this may allow > a large performance improvement for some kinds of applications. > >> For applications performs only write operations (and no repeated or >> overlapped writes), caching causes unnecessary memory copies and thus the >> bad performance. Similar for non-repeated reads only (although prefetch >> can also help). > > One issue is that even if the write segments are not overlapping, if they > are not aligned on PAGE_SIZE (4k normally) boundaries, then they can not > be handled by the linux VM efficiently, and result in a read-modify-write > cycle for each of the "boundary" pages. Lustre also aligns the DLM locks > on client PAGE_SIZE boundaries, so this can cause artificial contention > among the writes. > > In any case, I suspect that the root of the problem is the small, > interleaved, and unaligned IO pattern, and not the cache itself. > >> As a Lustre user, I would like to know if I can disable read and write >> caching for the benchmarks I tested, without root permission to change >> file system configuration. > > No, this isn''t possible (excluding O_DIRECT). The reason O_DIRECT allows > uncached IO is because the kernel is using the user-supplied data buffers > to do the IO directly to the filesystem. Since the filesystem layer has > to work with aligned buffers, this imposes the alignmen restrictions. > Working with unaligned data buffers imposes the data copying overhead. > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. >
Andreas Dilger
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
On Mar 06, 2006 11:44 -0600, Wei-keng Liao wrote:> The BTIO I am running does I/O through MPI-IO. The FLASH I/O calls HDF5 > I/O APIs which internally is implemented on top of MPI-IO. FLASH I/O > writes data into files in HDF5 format which contains both metadata and > array data. For writing metadata, there are lots of small writes, but for > array data, writes are in large amount... I have collected the I/O trace, > the read/write offset-length pairs, for the two benchmarks. Let me know > if you are interested.Yes, it would be useful to see what kind of IO pattern is being generated.> You mentioned that ADIO layer for Lustre has not been implemented yet. > What is the plan for this in the near future? Can you also shed light on > why ADIO makes bad assumption for Lustre I/O? All I can see is ADIO > regards Lustre as a UFS and only POSIX I/O calls (open,read,write) are > used.As yet there has not been a strong customer demand for an ADIO layer for Lustre. One of the things that could benefit a lot from a Lustre-specific ADIO layer would be file striping parameters. It is possible that your test is only running against a small number of storage servers by default, which would not be optimal if there are lots of clients concurrently reading/writing to the same file at one time. In the vast majority of cases there is only a single client doing IO to each file so the default tuning parameters are set to optimize this case. You can check this by running "lfs getstripe -v /path/to/file(s)" to check the striping on the file(s). You can set the default striping parameters for the output directory, and all NEW files created in that directory will inherit this striping configuration: $ ./lustre/utils/lfs setstripe -h Create a new file with a specific striping pattern or set the default striping pattern on an existing directory or delete the default striping pattern from an existing directory usage: setstripe <filename|dirname> <stripe_size> <stripe_start> <stripe_count> or setstripe -d <dirname> (to delete default striping) stripe_size: Number of bytes on each OST (0 filesystem default) stripe_start: OST index of first stripe (-1 filesystem default) stripe_count: Number of OSTs to stripe over (0 default, -1 all) The stripe_start parameter should be left at -1, but for shared-file writes the stripe_count should be increased to get the maximum bandwidth.> I also wonder if caching is disabled if a file is open in O_WRONLY, since > it makes little sense to do caching in this case, unless for aggregating > small writes into large ones.The cache is a function of the linux VM subsystem and is not implemented specifically in lustre. Only the delayed-write mechanism to aggregate writes for sending larger RPCs is specific to Lustre.> Does Lustre use any I/O benchmark that can demonstrate the benefit of > caching? Please let me know. I am very interested in seeing the I/O > patterns of it. Thanks.IOR is a good example of a benchmark that can can see benefit from the cache. In some earlier versions, if the write size does not exceed RAM size, then the read rates approach memory bandwidth. In later versions of IOR the read algorithm was changed so that clients would not read back the data they just wrote, so that they do not read from cache. Similarly, for IO that is sequential but not page aligned or multiples of full pages the IO rate is improved because the RPCs are large and efficient, instead of only the size of the read/write request. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Wei-keng Liao
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
Andreas, Attached is a collection of offset-length pairs for FLASH-IO and BTIO. I have been playing different stripe size and no. OSTs on Tungsten@NCSA which has default 4MB stripe size and stripe over 8 OSTs. Changing the stripe size and no. OSTs does not help the performance a lot for FLASH and BTIO. I tried 64KB, 512KB, 1MB stripe size and 8 and 16 OSTs. Numbers of clients I ran range from 4 to 64. However, in both FLASH and BTIO, all clients read/write to a shared file. Maybe this is what causes the bad performance on Lustre. (The I/O bandwidth I got for FLASH-IO is less than 10 MB/s and less than 80 MB/s for BTIO, on Tungsten which potentially can provide 11 GB/s I/O bandwidth.) So, my guess is that Lustre has not been optimized for concurrent or parallel I/O operations yet. As a heavy MPI-IO user, I wish there will be more demands on this aspect in the future :( I ran IOR on Lustre recently and it does not have any pattern that can be used to show the effect of client-side caching. I also wrote a simple program to test I/O bandwidth on Tungsten. I found the O_DIRECT gives a huge performance boost if the buffer is aligned. The only change in my test code is just with or without O_DIRECT. This is one of the reasons that I suspect the Lustre''s caching effect. I also ran FLASH and BTIO on IBM GPFS which also does caching and distributed file locking like Lustre. The performance on GPFS is much better (even better than PVFS). I think maybe Lustre development group would be interested in finding the causes of the bad performance. Wei-keng On Mon, 6 Mar 2006, Andreas Dilger wrote:> On Mar 06, 2006 11:44 -0600, Wei-keng Liao wrote: >> The BTIO I am running does I/O through MPI-IO. The FLASH I/O calls HDF5 >> I/O APIs which internally is implemented on top of MPI-IO. FLASH I/O >> writes data into files in HDF5 format which contains both metadata and >> array data. For writing metadata, there are lots of small writes, but for >> array data, writes are in large amount... I have collected the I/O trace, >> the read/write offset-length pairs, for the two benchmarks. Let me know >> if you are interested. > > Yes, it would be useful to see what kind of IO pattern is being generated. > >> You mentioned that ADIO layer for Lustre has not been implemented yet. >> What is the plan for this in the near future? Can you also shed light on >> why ADIO makes bad assumption for Lustre I/O? All I can see is ADIO >> regards Lustre as a UFS and only POSIX I/O calls (open,read,write) are >> used. > > As yet there has not been a strong customer demand for an ADIO layer for > Lustre. One of the things that could benefit a lot from a Lustre-specific > ADIO layer would be file striping parameters. > > It is possible that your test is only running against a small number of > storage servers by default, which would not be optimal if there are lots > of clients concurrently reading/writing to the same file at one time. In > the vast majority of cases there is only a single client doing IO to each > file so the default tuning parameters are set to optimize this case. > > You can check this by running "lfs getstripe -v /path/to/file(s)" to check > the striping on the file(s). You can set the default striping parameters > for the output directory, and all NEW files created in that directory will > inherit this striping configuration: > > $ ./lustre/utils/lfs setstripe -h > Create a new file with a specific striping pattern or > set the default striping pattern on an existing directory or > delete the default striping pattern from an existing directory > usage: setstripe <filename|dirname> <stripe_size> <stripe_start> <stripe_count> > or > setstripe -d <dirname> (to delete default striping) > stripe_size: Number of bytes on each OST (0 filesystem default) > stripe_start: OST index of first stripe (-1 filesystem default) > stripe_count: Number of OSTs to stripe over (0 default, -1 all) > > The stripe_start parameter should be left at -1, but for shared-file writes > the stripe_count should be increased to get the maximum bandwidth. > >> I also wonder if caching is disabled if a file is open in O_WRONLY, since >> it makes little sense to do caching in this case, unless for aggregating >> small writes into large ones. > > The cache is a function of the linux VM subsystem and is not implemented > specifically in lustre. Only the delayed-write mechanism to aggregate > writes for sending larger RPCs is specific to Lustre. > >> Does Lustre use any I/O benchmark that can demonstrate the benefit of >> caching? Please let me know. I am very interested in seeing the I/O >> patterns of it. Thanks. > > IOR is a good example of a benchmark that can can see benefit from the > cache. In some earlier versions, if the write size does not exceed RAM > size, then the read rates approach memory bandwidth. In later versions > of IOR the read algorithm was changed so that clients would not read back > the data they just wrote, so that they do not read from cache. > > Similarly, for IO that is sequential but not page aligned or multiples > of full pages the IO rate is improved because the RPCs are large and > efficient, instead of only the size of the read/write request. > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. >-------------- next part -------------- A non-text attachment was scrubbed... Name: offset_len.tar.gz Type: application/octet-stream Size: 178469 bytes Desc: Url : http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060306/2d299b5d/offset_len.tar.obj
Wei-keng Liao
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
Don, Please discard those BTIO files. I generated these offset-length pairs on a 4-node machine. ROMIO internally figures this out and lets only 4 processes perform the read()/write() calls. I will try to generate the true offset-length pairs for different numbers of clients. For now, please just use BTIO.A.4 and BTIO.B.4. FLASH-IO offset-length files seem fine. Here is a brief description for BTIO: BTIO performs 40 MPI collective writes followed by 40 collective reads. A 3-D array partitioned among clients in a block-block-block fashion is written to and then read back from a single file. In each collective call, the offset-length pairs are re-arranged such that every process makes a single contiguous read()/write() calls to the file system. This is why you see the sequential read patterns and this is also why caching is not helping here. If you read the line segments in groups of 4, each group corresponding to an MPI collective read/write, it will be very clear to understand the BTIO''s pattern. If different numbers of clients is used, the patterns should look similar. Thanks for making this chart, a very good one. Wei-keng On Mon, 6 Mar 2006, Iozone wrote:> Wei-kenq Liao, > > Looking at your BTIO.A.16 data set, here is what > I see :-) > > > > The reader phase looks pure sequential for BTIO.A.16 > > Enjoy, > Don Capps >
J. K. Cliburn
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
On 3/6/06, Wei-keng Liao <wkliao@ece.northwestern.edu> wrote:> if you can briefly present your similar findings in the discuss > group, it may increase the chance for Lustre to pay more attentions on > parallel I/O :)I''ll see what I can pull together from our group. The machine (a 4,176 node Cray XT3) is in semi-production at the moment, but we''re trying to squeeze in some I/O tests where we can. The initial results, while not exhaustive by any means, are disappointing. With past parallel systems we''ve owned, one could -- to a varying extent -- always improve I/O performance from a baseline by organizing reads and writes to better fit the underlying parallel filesystem (PFS, [C]XFS, GPFS, in our experience). This is our first experience with Lustre, and we don''t expect to avoid the I/O improvement curve this time, but my chief complaint in the early going is that what I''m calling the "baseline" seems so low; a few (or at best 10s) MB/sec on a filesystem billed to deliver 100s or 1000s MB/sec. I''m concerned that our users are going to have to significantly modify their codes to achieve reasonably acceptable I/O rates. Unfortunately, most of these users operate on budgets that don''t include code-hacking funds; their raison d''etre is to conduct computational science. I''m interested in hearing from any list members who have experience with Lustre operating with a variety of parallel codes employing a wide range of node counts, large and small. Any tips you can provide as our group goes forward will help. Best regards, Jay
J. K. Cliburn
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
On 3/6/06, J. K. Cliburn <jcliburn@gmail.com> wrote:> The machine (a 4,176 node Cray XT3)I meant to say a 4,176 *processor* Cray XT3. 4,128 processors are embodied in dual Opteron compute nodes running the Catamount lightweight kernel. The remainder are 4-way service nodes running a variant of SLES9.
Wei-keng Liao
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
Don, Maybe I did not clearly describe the file format in the README. In each file, the first column represents the client id, the 2nd column is the I/O operation (o:open, r: read, w: write, c: close, b: global synchronization), the 3rd column is the filename for open and offset for reads and writes, the 4th column is the length of reads and writes. I ran BTIO with the option of using MPI collective calls. If you look at a particular client, say 2, the I/O trace is always a read/write followed by a synchronization. For example, output from command: grep " 2 " BTIO.A.4 2 o btio.full.out 2 w 5242880 2621440 2 b 1 1 2 w 15728640 2621440 2 b 1 1 2 w 26214400 2621440 2 b 1 1 ... These files must be interpreted by a group of n processes. In the case of BTIO.A.4, n is 4. That is, the read()/write() in every group of 4 occurs concurrently and these groups happen in a sequence of timely manner. BTIO performs 40 MPI collective writes followed by 40 reads. Each collective write appends the data after the previous write. So, the I/O pattern is actually all sequential. In this pattern, caching is only good for write-behind in which small data are accumulated and later written in a big chunk to utilize the network bandwidth. I do agree with you that the fine-tuned stripe size and stripe factor matching exact the application''s I/O pattern should generate optimal performance. I also believe even the stripe size and factor are not optimal, as long as they won''t result in extremely many I/O calls to OSTs, the performance should be reasonable close to the optimal. In fact, no matter how you choose stripe sizes, it is always the same amount of I/O from a group of clients to a group of OSTs. Unfortunately, I got very low I/O bandwidth. I tried 64K, 512K, 1M, and 4M but got < 80 MB/s on a machine that can sustain 11 GB/s peak bandwidth. Hence, I suspect caching is not helping here, instead, worsens the performance. Wei-keng ps. I re-post your charts to the discussion group. They are very nice and should be shared with others. On Mon, 6 Mar 2006, Iozone wrote:> > ----- Original Message ----- > From: "Wei-keng Liao" <wkliao@ece.northwestern.edu> > To: "Iozone" <capps@iozone.org> > Cc: <lustre-discuss@clusterfs.com> > Sent: Monday, March 06, 2006 5:47 PM > Subject: Re: [Lustre-discuss] disable client-side caching on Lustre? > > >> Don, >> >> This is exactly the BTIO access pattern. No repeats exist among writes and >> no repeats among reads. For the write phase, all writes are disjoint. >> Same for the read phase. >> >> Wei-keng >> >> > > Wei-keng, > > The BTIO.A.4 writer looks like an oscillating strided writer, > but the BTIO.A.4 reader looks sequential. > > > > > A client side cache may, or may not, benefit the application''s performance. > It would depend on the size of the client side cache. If it were big enough > to encapsulate the entire data set, then the reader, and writer, could run at > memory speeds. If not, then it would depend on the page/buffer cache > replacement algorithm, the alignment of the buffers, the number of controllers, > the number of disks, the memory bandwidth, and the VM subsystem''s > interaction, should it detect VM pressure. > > It''s likely that a striped filesystem could help the strided writer > phase, as the I/Os could be overlapped across controllers, and > disks. It looks like BTIO.A.4 would like the stripe size to be > 256kbytes. If your stripe size is smaller, or larger, than this > typical transfer size, then performance would not be optimal. > > If the system''s read-ahead were deep enough, then the > stripe might help it too. > > > Enjoy, > Don Capps >-------------- next part -------------- A non-text attachment was scrubbed... Name: BTIO.A.4_reader.jpg Type: image/jpeg Size: 134474 bytes Desc: Url : http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060306/7ffd512d/BTIO.A.4_reader.jpg -------------- next part -------------- A non-text attachment was scrubbed... Name: BTIO.A.4.jpg Type: image/jpeg Size: 143527 bytes Desc: Url : http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060306/7ffd512d/BTIO.A.4.jpg
Wei-keng Liao
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
Andreas, The "b 1 1" in the I/O trace files is indeed to indicate a synchronization among all processes. In an MPI collective I/O, every process must wait for all others to complete before returning the call. Sync here means the process'' execution thread, not I/O sync. I ran BTIO.A.4 on Tungsten@NCSA with stripe size=2621440 and number of OSTs=4 against the system default setting (stripe size=4MB, OSTs=8). Here is the result: stripe size = 2560KB stripe size = 4MB no. OSTs = 4 no. OSTs = 8 ------------------------------------------------------------------------ MPI file setup time = 0.11 sec 0.10 sec write 400.00 MB in 1.85 sec 3.20 sec read 400.00 MB in 1.42 sec 92.07 sec total I/O amount : 800.00 MB 800.00 MB number of I/Os : 80 80 ------------------------------------------------------------------------ nproc array size bandwidth (MB/s) bandwidth (MB/s) 4 64 x 64 x 64 236.60 8.39 The overall performance is enhanced significantly. I believe if I use more processors, larger I/O amount, the bandwidth can go even higher. But let''s check the read write performance separately for the non-optimal case, the default stripe size. I can see read is much worse than write. Could you comment on why the huge difference? If I/O is completely serialized due to file locking, how come read for default stripe size is 65 times slower on a run with 4 compute nodes. Applying locking on write seems reasonable, though. This also comes back to my question first posted: "disable client-side caching on Lustre?". If file locking is used to ensure the cache coherence on Lustre, can we imply that disable caching -> no locking -> better performance for non-overlap, write-only IO like BTIO? While tuning stripe size for a single access pattern shows good performance like above, it also create a problem. What if two different pattern I/Os need to be written to a single file. Even if the "future ADIO hints" for Lustre allow users to specify the stripe size, can a file''s stripe size + OSTs be changed during the open and close period? If yes, what will be the overhead of doing so? For the current release of Lustre, how can I specify stripe size and number of OSTs when creating a file? Of it has to be done off line with command "lfs"? Thanks. Wei-keng On Tue, 7 Mar 2006, Andreas Dilger wrote:> On Mar 06, 2006 14:51 -0600, Wei-keng Liao wrote: >> Attached is a collection of offset-length pairs for FLASH-IO and BTIO. >> >> I have been playing different stripe size and no. OSTs on Tungsten@NCSA >> which has default 4MB stripe size and stripe over 8 OSTs. Changing the >> stripe size and no. OSTs does not help the performance a lot for FLASH and >> BTIO. I tried 64KB, 512KB, 1MB stripe size and 8 and 16 OSTs. Numbers of >> clients I ran range from 4 to 64. However, in both FLASH and BTIO, all >> clients read/write to a shared file. Maybe this is what causes the bad >> performance on Lustre. (The I/O bandwidth I got for FLASH-IO is less than >> 10 MB/s and less than 80 MB/s for BTIO, on Tungsten which potentially can >> provide 11 GB/s I/O bandwidth.) So, my guess is that Lustre has not been >> optimized for concurrent or parallel I/O operations yet. As a heavy MPI-IO >> user, I wish there will be more demands on this aspect in the future :( > > Lustre has in fact been tested with single-file parallel IO for a long time, > and while performance is not as good as file-per-process, it has generally > been acceptable if the clients are doing large non-overlapping concurrent > writes. However, the IO pattern that single-file writes have been tuned > to is different than is shown here, with the heuristics being tuned for > tens-hundreds of clients doing aligned writes of 1MB or more concurrently. > > As this is the first time that I have seen complaints about IO performance > from this type of workload, there has not been any effort to tune lustre > for this load yet. I see that the lock policy doesn''t consider a single > file to be "contended" until there are at least 5 concurrent writers, and > not "highly contended" until there are at least 33 concurrent writers. > Most of the shared-file write tunings have been done with 400+ clients. > > > Looking at the start of one of the logs in question (OFFSET_LEN/BTIO.A4): > 1 o btio.full.out > 2 o btio.full.out > 3 o btio.full.out > 0 o btio.full.out > 1 w 2621440 2621440 > 1 b 1 1 > 3 w 7864320 2621440 > 3 b 1 1 > 0 w 0 2621440 > 0 b 1 1 > 2 w 5242880 2621440 > 2 b 1 1 > 0 w 10485760 2621440 > 3 w 18350080 2621440 > 2 w 15728640 2621440 > 3 b 1 1 > 2 b 1 1 > 0 b 1 1 > 1 w 13107200 2621440 > 1 b 1 1 > 2 w 26214400 2621440 > 2 b 1 1 > 0 w 20971520 2621440 > 3 w 28835840 2621440 > 1 w 23592960 2621440 > 3 b 1 1 > 0 b 1 1 > 1 b 1 1 > > It is clear from looking at this IO log that this application would be > optimal with a stripe size of 2621440 bytes and 4 stripes for the file, > or a multiple thereof. This would allow each process to essentially > write to its own file, and there would be no lock contention at all. > > lfs setstripe /path/to/output/btio.full.out 2621440 -1 4 > or > lfs setstripe /path/to/output/btio.full.out 1310720 -1 8 > or > lfs setstripe /path/to/output/btio.full.out 655360 -1 16 > etc. > > It is the lock contention that is essentially serializing the writes > from each client. It would be interesting to know if the ROMIO/MPIIO > layer would send the proper hints to a lustre-ADIO to allow it to create > the file in this manner? I admit I don''t know enough about the hints > that an ADIO implemention is given. > > In your README you mention that the "b 1 1" operation is an MPI barrier. > Is there also a "sync" operation associated with this barrier? Even > without a sync, the fact that there is a barrier between each IO means > that each IO is waiting for the slowest client to complete the writes. > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. >
Kumaran Rajaram
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
Wei-keng, Please see below:>>>Wei-keng Liao <wkliao@ece.northwestern.edu> 03/07/06 1:17 pm >>>Andreas, I ran BTIO.A.4 on Tungsten@NCSA with stripe size=2621440 and number of OSTs=4 against the system default setting (stripe size=4MB, OSTs=8). Here is the result: stripe size = 2560KB stripe size = 4MB no. OSTs = 4 no. OSTs = 8 ------------------------------------------------------------------------ MPI file setup time = 0.11 sec 0.10 sec write 400.00 MB in 1.85 sec 3.20 sec read 400.00 MB in 1.42 sec 92.07 sec total I/O amount : 800.00 MB 800.00 MB number of I/Os : 80 80 ------------------------------------------------------------------------ nproc array size bandwidth (MB/s) bandwidth (MB/s) 4 64 x 64 x 64 236.60 8.39 The overall performance is enhanced significantly. I believe if I use more processors, larger I/O amount, the bandwidth can go even higher. But let''s check the read write performance separately for the non-optimal case, the default stripe size. I can see read is much worse than write. Could you comment on why the huge difference? If I/O is completely serialized due to file locking, how come read for default stripe size is 65 times slower on a run with 4 compute nodes. Applying locking on write seems reasonable, though. --> ROMIO collective I/O implementation is based on two-phase algortihm. My hunch is that since the I/O request size (400MB) is small, write returns as soon as data is written to the file-system cache (with no read-modify-write being performed). For read operation (two-phase I/O), if data access pattern causes cache miss and if concurrent process have overlapping segments, dirty cache might need to be flushed to disk and data read from the disks (costly). Also, ROMIO collective implementation, might incur communication costs as the number of nodes are scaled This also comes back to my question first posted: disable client-side caching on Lustre?. If file locking is used to ensure the cache coherence on Lustre, can we imply that disable caching -> no locking -> better performance for non-overlap, write-only IO like BTIO? While tuning stripe size for a single access pattern shows good performance like above, it also create a problem. What if two different pattern I/Os need to be written to a single file. Even if the future ADIO hints for Lustre allow users to specify the stripe size, can a file''s stripe size + OSTs be changed during the open and close period? If yes, what will be the overhead of doing so? ---> You can only set/change the file-stripping parameters when the file is initially created. MPI-IO allows this though file-info hints during MPI_File_open, MPI_File_set_view, MPI_File_set_info. For the current release of Lustre, how can I specify stripe size and number of OSTs when creating a file? Of it has to be done off line with command lfs? --> In older Lustre releases, you have to use ioctl() calls along with LL_IOC_LOV parameters (defined in lustre_user.h) prior to file create. Its been long since I played with ioctls(), so Iam not sure if this holds true in the current 1.4.6 release. HTH, -Kums Thanks. Wei-keng On Tue, 7 Mar 2006, Andreas Dilger wrote:>On Mar 06, 2006 14:51 -0600, Wei-keng Liao wrote: >>Attached is a collection of offset-length pairs for FLASH-IO and BTIO. >> >>I have been playing different stripe size and no. OSTs on Tungsten@NCSA >>which has default 4MB stripe size and stripe over 8 OSTs. Changing the >>stripe size and no. OSTs does not help the performance a lot for FLASH and >>BTIO. I tried 64KB, 512KB, 1MB stripe size and 8 and 16 OSTs. Numbers of >>clients I ran range from 4 to 64. However, in both FLASH and BTIO, all >>clients read/write to a shared file. Maybe this is what causes the bad >>performance on Lustre. (The I/O bandwidth I got for FLASH-IO is less than >>10 MB/s and less than 80 MB/s for BTIO, on Tungsten which potentially can >>provide 11 GB/s I/O bandwidth.) So, my guess is that Lustre has not been >>optimized for concurrent or parallel I/O operations yet. As a heavy MPI-IO >>user, I wish there will be more demands on this aspect in the future :( > >Lustre has in fact been tested with single-file parallel IO for a long time, >and while performance is not as good as file-per-process, it has generally >been acceptable if the clients are doing large non-overlapping concurrent >writes. However, the IO pattern that single-file writes have been tuned >to is different than is shown here, with the heuristics being tuned for >tens-hundreds of clients doing aligned writes of 1MB or more concurrently. > >As this is the first time that I have seen complaints about IO performance >from this type of workload, there has not been any effort to tune lustre >for this load yet. I see that the lock policy doesn''t consider a single >file to be contended until there are at least 5 concurrent writers, and >not highly contended until there are at least 33 concurrent writers. >Most of the shared-file write tunings have been done with 400+ clients. > > >Looking at the start of one of the logs in question (OFFSET_LEN/BTIO.A4): >1 o btio.full.out >2 o btio.full.out >3 o btio.full.out >0 o btio.full.out >1 w 2621440 2621440 >1 b 1 1 >3 w 7864320 2621440 >3 b 1 1 >0 w 0 2621440 >0 b 1 1 >2 w 5242880 2621440 >2 b 1 1 >0 w 10485760 2621440 >3 w 18350080 2621440 >2 w 15728640 2621440 >3 b 1 1 >2 b 1 1 >0 b 1 1 >1 w 13107200 2621440 >1 b 1 1 >2 w 26214400 2621440 >2 b 1 1 >0 w 20971520 2621440 >3 w 28835840 2621440 >1 w 23592960 2621440 >3 b 1 1 >0 b 1 1 >1 b 1 1 > >It is clear from looking at this IO log that this application would be >optimal with a stripe size of 2621440 bytes and 4 stripes for the file, >or a multiple thereof. This would allow each process to essentially >write to its own file, and there would be no lock contention at all. > >lfs setstripe /path/to/output/btio.full.out 2621440 -1 4 >or >lfs setstripe /path/to/output/btio.full.out 1310720 -1 8 >or >lfs setstripe /path/to/output/btio.full.out 655360 -1 16 >etc. > >It is the lock contention that is essentially serializing the writes >from each client. It would be interesting to know if the ROMIO/MPIIO >layer would send the proper hints to a lustre-ADIO to allow it to create >the file in this manner? I admit I don''t know enough about the hints >that an ADIO implemention is given. > >In your README you mention that the b 1 1 operation is an MPI barrier. >Is there also a sync operation associated with this barrier? Even >without a sync, the fact that there is a barrier between each IO means >that each IO is waiting for the slowest client to complete the writes. > >Cheers, Andreas >-- >Andreas Dilger >Principal Software Engineer >Cluster File Systems, Inc. >Lustre-discuss mailing list Lustre-discuss@clusterfs.com https://mail.clusterfs.com/mailman/listinfo/lustre-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060307/945ce5a5/attachment.html
Andreas Dilger
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
On Mar 07, 2006 14:17 -0600, Wei-keng Liao wrote:> I ran BTIO.A.4 on Tungsten@NCSA with stripe size=2621440 and number of > OSTs=4 against the system default setting (stripe size=4MB, OSTs=8). Here > is the result: > stripe size = 2560KB stripe size = 4MB > no. OSTs = 4 no. OSTs = 8 > ------------------------------------------------------------------------ > MPI file setup time = 0.11 sec 0.10 sec > write 400.00 MB in 1.85 sec 3.20 sec > read 400.00 MB in 1.42 sec 92.07 sec > total I/O amount : 800.00 MB 800.00 MB > number of I/Os : 80 80 > ------------------------------------------------------------------------ > nproc array size bandwidth (MB/s) bandwidth (MB/s) > 4 64 x 64 x 64 236.60 8.39 > > The overall performance is enhanced significantly. I believe if I use more > processors, larger I/O amount, the bandwidth can go even higher.Good, at least this part of the problem is understood. It is a lot easier for me to understand what sort of problems are being seen when the data is presented in this format (separate times for read and write) compared to a single "bandwidth" number. I also agree that with more OSTs and more clients it is likely that the aggregate IO performance can be increased here.> But let''s check the read write performance separately for the non-optimal > case, the default stripe size. I can see read is much worse than write. > Could you comment on why the huge difference? If I/O is completely serialized > due to file locking, how come read for default stripe size is 65 times > slower on a run with 4 compute nodes. Applying locking on write seems > reasonable, though.There needs to be locking on both read and write, to ensure that the clients always see the correct data. Lustre implements full POSIX read/write semantics and fine-grained (though page-aligned) file locks. During read the clients can share locks on the same file, so there should not be any conflicts there. I suspect that one reason the read calls are much slower is that this time also includes a significant fraction of time spent writing the data from the client to disk. If there was an fsync (or MPIO equivalent) call after the write I suspect the read "performance" would increase, at the expense of increasing the write time.> This also comes back to my question first posted: "disable client-side > caching on Lustre?". If file locking is used to ensure the cache coherence > on Lustre, can we imply that disable caching -> no locking -> better > performance for non-overlap, write-only IO like BTIO?While this might be true in theory, there is currently no mechanism other than O_DIRECT to disable client-side cache. It is possible (though not supported) to disable client-side locking of the file, but this can only be used in conjunction with O_DIRECT to ensure that the client does not cache any data without a lock. In this case the application is wholly responsible for ensuring the integrity of the file.> While tuning stripe size for a single access pattern shows good > performance like above, it also create a problem. What if two different > pattern I/Os need to be written to a single file.Yes, this would indeed pose a problem. In some cases it may be possible to have a "lowest common denominator" for the stripe configuration, but if the IO isn''t aligned then even that won''t help. I agree that there is likely a lot more that can be done to improve this situation, but so far there has been very little requirements in this direction so it may take some time to understand and resolve the issues.> Even if the "future ADIO hints" for Lustre allow users to specify > the stripe size, can a file''s stripe size + OSTs be changed during the > open and close period? If yes, what will be the overhead of doing so?It is not possible to change the striping after the file has been created.> For the current release of Lustre, how can I specify stripe size and > number of OSTs when creating a file? Of it has to be done off line with > command "lfs"?As Kums says, it is possible to specify the striping via ioctl() directly on a new file, but there is also a small C library (liblustreapi.a) to abstract such operations and give applications a nicer interface: #include <lustre/liblustreapi.h> int llapi_file_create(char *name, long stripe_size, int stripe_offset, int stripe_count, int stripe_pattern) name = pathname of file stripe_size = number of bytes to put in each object (OST) stripe_offset = starting OST index (should be -1 in majority of cases) stripe_count = number of stripes (OSTs) on which to create the file stripe_pattern = Lustre stripe pattern (only ''0'' is currently supported) return is 0 on success, or -ve unix error number Link with "-llustreapi". Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Andreas Dilger
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
On Mar 08, 2006 15:43 -0600, Wei-keng Liao wrote:> To verify your suspesion on data flushing in read phase, I ran the BTIO > with fsync() for the following 2 cases: > 1) add a single fsync() after all writes complete and before reads start > 2) add a fsync() after each individual write. > > I only tested the default stripe size and OSTs, which is not optimized for > BTIO on Lustre. Unfortunately, the results of both cases are similar to > the one without calling fsync(). The read phase is still much larger than > the write phase. I ran a couple times and the results are all the same. > So, caching/flushing is not the cause for this strange behavior. > > I would suspect the locking mechanism. (Maybe it''s the implementation > issues of deadlock, timeout, etc.) Consider this: even if I/O are > completely serialized (from 4 processors) due to file locking, the > timing/bandwidth should not be much worse than 4 times of the optimal > case, assuming there is no contention at all in the optimal case.I agree, and for the write part the unoptimized performance is about 70% slower than the optimized case (or 245% slower if you consider that the unoptimized test has twice as many stripes). It isn''t at all clear why the read portion is so much slower for the unoptimized case, especially since read locks do not conflict with each other.> What is the locking design used by Lustre? Is Lustre using the distributed > file locking?Lustre uses server-based extent locking. It is not fully distributed locking in the sense that the "lock master" for a single resource could be on different nodes. Rather, each storage server does locking for the data that it controls and the locking scalability of each file increases as the number of stripes in a file increase (along with the potential IO bandwidth). The locks themselves are read and write extent locks, aligned on 4kB boundaries (in most cases at least). Since the most common usage case is a single client doing writes to the file, the OST optimistically grows the extent (up to a full-file lock for the first writer) to reduce the number of future lock requests. Consider the case where a process (e.g. cp) is doing writes in 4kB chunks, we don''t want to issue a lock request for each write. If there are multiple competing writes then the lock must be revoked and split into smaller non-overlapping extents, with a heuristic to limit extent size if there are "many" writers for the same object. With 5 lockers stops downward lock growth, and 33 lockers to cap lock extent size to max(requested size, 32MB)), subject to other locking constraints. This heuristic doesn''t seem to fit your usage pattern very well at all. Readers always get full-file extent locks unless there are write locks, because many clients can have overlapping read locks without conflict. It is of course possible to also read data under a write lock. One possibility is that there is bad interaction with the client readahead. The client readahead may read a lot of data from parts of the file it doesn''t need (yet), so a potential (num_clients - 1) overhead there. If the reads are done under a write lock, and that lock needs to be revoked, then all the readahead would be discarded and possibly read in again. Having the barriers between each read can also hamper this because the clients that have read in data ahead of where they need it may not have any chance to use that data before it might be evicted due to a write lock cancel from another client issuing a conflicting read lock request. In the optimized case, each client is only doing read/write operations on a single object, so it likely gets a lock only one time on "it''s" object and then never issues any more locks and doesn''t cancel any other client''s locks, so all the read-ahead data is eventually used. As a test, what happens if each client reads in all the file data it needs at the beginning of the read phase, and then does all the computation in a second step?> On Tue, 7 Mar 2006, Andreas Dilger wrote: > > >On Mar 07, 2006 14:17 -0600, Wei-keng Liao wrote: > >>I ran BTIO.A.4 on Tungsten@NCSA with stripe size=2621440 and number of > >>OSTs=4 against the system default setting (stripe size=4MB, OSTs=8). Here > >>is the result: > >> stripe size = 2560KB stripe size = 4MB > >> no. OSTs = 4 no. OSTs = 8 > >>------------------------------------------------------------------------ > >> MPI file setup time = 0.11 sec 0.10 sec > >> write 400.00 MB in 1.85 sec 3.20 sec > >> read 400.00 MB in 1.42 sec 92.07 sec > >> total I/O amount : 800.00 MB 800.00 MB > >> number of I/Os : 80 80 > >>------------------------------------------------------------------------ > >> nproc array size bandwidth (MB/s) bandwidth (MB/s) > >> 4 64 x 64 x 64 236.60 8.39 > >> > >>The overall performance is enhanced significantly. I believe if I use more > >>processors, larger I/O amount, the bandwidth can go even higher. > > > >Good, at least this part of the problem is understood. It is a lot easier > >for me to understand what sort of problems are being seen when the data > >is presented in this format (separate times for read and write) compared > >to a single "bandwidth" number. > > > >I also agree that with more OSTs and more clients it is likely that the > >aggregate IO performance can be increased here. > > > >>But let''s check the read write performance separately for the non-optimal > >>case, the default stripe size. I can see read is much worse than write. > >>Could you comment on why the huge difference? If I/O is completely > >>serialized > >>due to file locking, how come read for default stripe size is 65 times > >>slower on a run with 4 compute nodes. Applying locking on write seems > >>reasonable, though. > > > >There needs to be locking on both read and write, to ensure that the > >clients > >always see the correct data. Lustre implements full POSIX read/write > >semantics and fine-grained (though page-aligned) file locks. During read > >the clients can share locks on the same file, so there should not be any > >conflicts there. > > > >I suspect that one reason the read calls are much slower is that this time > >also includes a significant fraction of time spent writing the data from > >the client to disk. If there was an fsync (or MPIO equivalent) call after > >the write I suspect the read "performance" would increase, at the expense > >of increasing the write time. > > > >>This also comes back to my question first posted: "disable client-side > >>caching on Lustre?". If file locking is used to ensure the cache coherence > >>on Lustre, can we imply that disable caching -> no locking -> better > >>performance for non-overlap, write-only IO like BTIO? > > > >While this might be true in theory, there is currently no mechanism other > >than O_DIRECT to disable client-side cache. It is possible (though not > >supported) to disable client-side locking of the file, but this can only > >be used in conjunction with O_DIRECT to ensure that the client does not > >cache any data without a lock. In this case the application is wholly > >responsible for ensuring the integrity of the file. > > > >>While tuning stripe size for a single access pattern shows good > >>performance like above, it also create a problem. What if two different > >>pattern I/Os need to be written to a single file. > > > >Yes, this would indeed pose a problem. In some cases it may be possible > >to have a "lowest common denominator" for the stripe configuration, but > >if the IO isn''t aligned then even that won''t help. > > > >I agree that there is likely a lot more that can be done to improve this > >situation, but so far there has been very little requirements in this > >direction so it may take some time to understand and resolve the issues. > > > >>Even if the "future ADIO hints" for Lustre allow users to specify > >>the stripe size, can a file''s stripe size + OSTs be changed during the > >>open and close period? If yes, what will be the overhead of doing so? > > > >It is not possible to change the striping after the file has been created. > > > >>For the current release of Lustre, how can I specify stripe size and > >>number of OSTs when creating a file? Of it has to be done off line with > >>command "lfs"? > > > >As Kums says, it is possible to specify the striping via ioctl() directly > >on a new file, but there is also a small C library (liblustreapi.a) to > >abstract such operations and give applications a nicer interface: > > > >#include <lustre/liblustreapi.h> > > > >int llapi_file_create(char *name, long stripe_size, int stripe_offset, > > int stripe_count, int stripe_pattern) > > > > name = pathname of file > > stripe_size = number of bytes to put in each object (OST) > > stripe_offset = starting OST index (should be -1 in majority of > > cases) > > stripe_count = number of stripes (OSTs) on which to create the file > > stripe_pattern = Lustre stripe pattern (only ''0'' is currently > > supported) > > > > return is 0 on success, or -ve unix error number > > > >Link with "-llustreapi". > > > >Cheers, Andreas > >-- > >Andreas Dilger > >Principal Software Engineer > >Cluster File Systems, Inc. > >Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Wei-keng Liao
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
Andreas, To verify your suspesion on data flushing in read phase, I ran the BTIO with fsync() for the following 2 cases: 1) add a single fsync() after all writes complete and before reads start 2) add a fsync() after each individual write. I only tested the default stripe size and OSTs, which is not optimized for BTIO on Lustre. Unfortunately, the results of both cases are similar to the one without calling fsync(). The read phase is still much larger than the write phase. I ran a couple times and the results are all the same. So, caching/flushing is not the cause for this strange behavior. I would suspect the locking mechanism. (Maybe it''s the implementation issues of deadlock, timeout, etc.) Consider this: even if I/O are completely serialized (from 4 processors) due to file locking, the timing/bandwidth should not be much worse than 4 times of the optimal case, assuming there is no contention at all in the optimal case. What is the locking design used by Lustre? Is Lustre using the distributed file locking? Wei-keng On Tue, 7 Mar 2006, Andreas Dilger wrote:> On Mar 07, 2006 14:17 -0600, Wei-keng Liao wrote: >> I ran BTIO.A.4 on Tungsten@NCSA with stripe size=2621440 and number of >> OSTs=4 against the system default setting (stripe size=4MB, OSTs=8). Here >> is the result: >> stripe size = 2560KB stripe size = 4MB >> no. OSTs = 4 no. OSTs = 8 >> ------------------------------------------------------------------------ >> MPI file setup time = 0.11 sec 0.10 sec >> write 400.00 MB in 1.85 sec 3.20 sec >> read 400.00 MB in 1.42 sec 92.07 sec >> total I/O amount : 800.00 MB 800.00 MB >> number of I/Os : 80 80 >> ------------------------------------------------------------------------ >> nproc array size bandwidth (MB/s) bandwidth (MB/s) >> 4 64 x 64 x 64 236.60 8.39 >> >> The overall performance is enhanced significantly. I believe if I use more >> processors, larger I/O amount, the bandwidth can go even higher. > > Good, at least this part of the problem is understood. It is a lot easier > for me to understand what sort of problems are being seen when the data > is presented in this format (separate times for read and write) compared > to a single "bandwidth" number. > > I also agree that with more OSTs and more clients it is likely that the > aggregate IO performance can be increased here. > >> But let''s check the read write performance separately for the non-optimal >> case, the default stripe size. I can see read is much worse than write. >> Could you comment on why the huge difference? If I/O is completely serialized >> due to file locking, how come read for default stripe size is 65 times >> slower on a run with 4 compute nodes. Applying locking on write seems >> reasonable, though. > > There needs to be locking on both read and write, to ensure that the clients > always see the correct data. Lustre implements full POSIX read/write > semantics and fine-grained (though page-aligned) file locks. During read > the clients can share locks on the same file, so there should not be any > conflicts there. > > I suspect that one reason the read calls are much slower is that this time > also includes a significant fraction of time spent writing the data from > the client to disk. If there was an fsync (or MPIO equivalent) call after > the write I suspect the read "performance" would increase, at the expense > of increasing the write time. > >> This also comes back to my question first posted: "disable client-side >> caching on Lustre?". If file locking is used to ensure the cache coherence >> on Lustre, can we imply that disable caching -> no locking -> better >> performance for non-overlap, write-only IO like BTIO? > > While this might be true in theory, there is currently no mechanism other > than O_DIRECT to disable client-side cache. It is possible (though not > supported) to disable client-side locking of the file, but this can only > be used in conjunction with O_DIRECT to ensure that the client does not > cache any data without a lock. In this case the application is wholly > responsible for ensuring the integrity of the file. > >> While tuning stripe size for a single access pattern shows good >> performance like above, it also create a problem. What if two different >> pattern I/Os need to be written to a single file. > > Yes, this would indeed pose a problem. In some cases it may be possible > to have a "lowest common denominator" for the stripe configuration, but > if the IO isn''t aligned then even that won''t help. > > I agree that there is likely a lot more that can be done to improve this > situation, but so far there has been very little requirements in this > direction so it may take some time to understand and resolve the issues. > >> Even if the "future ADIO hints" for Lustre allow users to specify >> the stripe size, can a file''s stripe size + OSTs be changed during the >> open and close period? If yes, what will be the overhead of doing so? > > It is not possible to change the striping after the file has been created. > >> For the current release of Lustre, how can I specify stripe size and >> number of OSTs when creating a file? Of it has to be done off line with >> command "lfs"? > > As Kums says, it is possible to specify the striping via ioctl() directly > on a new file, but there is also a small C library (liblustreapi.a) to > abstract such operations and give applications a nicer interface: > > #include <lustre/liblustreapi.h> > > int llapi_file_create(char *name, long stripe_size, int stripe_offset, > int stripe_count, int stripe_pattern) > > name = pathname of file > stripe_size = number of bytes to put in each object (OST) > stripe_offset = starting OST index (should be -1 in majority of cases) > stripe_count = number of stripes (OSTs) on which to create the file > stripe_pattern = Lustre stripe pattern (only ''0'' is currently supported) > > return is 0 on success, or -ve unix error number > > Link with "-llustreapi". > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. >
I am looking for more information about using O_DIRECT. According to Lustre manual 1.4.6, page 50, it says "For more information about the pros and cons of using Direct I/O with Lustre, see Performance Concepts." Where can I find this "Performance Concepts" document? Wei-keng
Wei-keng Liao
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
Jay, The zipped tar ball I sent out has no timing data. They are just offset-length pairs. As I indicated in my earlier email, my experience was on Tungsten@NCSA http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/XeonCluster/TechSummary/. The bad performance numbers along cannot go to any conference. That is why I would like to figure out why. Lustre supposes to have a great potential for scalable I/O on parallel machines. I plan to summarize the results I collected recently and provide to Lustre development group if they are willing to dig into this problem. In addition, if you can briefly present your similar findings in the discuss group, it may increase the chance for Lustre to pay more attentions on parallel I/O :) Wei-keng On Mon, 6 Mar 2006, J. K. Cliburn wrote:> On 3/6/06, Wei-keng Liao <wkliao@ece.northwestern.edu> wrote: > Attached is a collection of offset-length pairs for FLASH-IO and BTIO.I didn''t see any timings in your data. I''m very interested in your results because we''re seeing marginal performance similar to your description on a Lustre filesystem attached to a very large parallel machine. Do you intend to present your results at a conference, or were the data collected for your own project''s benefit? How could I get a copy of your findings? Thanks, Jay
Wei-keng Liao
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
Don, This is exactly the BTIO access pattern. No repeats exist among writes and no repeats among reads. For the write phase, all writes are disjoint. Same for the read phase. Wei-keng On Mon, 6 Mar 2006, Iozone wrote:>> Don, >> >> Please discard those BTIO files. I generated these offset-length pairs on >> a 4-node machine. ROMIO internally figures this out and lets only 4 >> processes perform the read()/write() calls. I will try to generate the >> true offset-length pairs for different numbers of clients. For now, please >> just use BTIO.A.4 and BTIO.B.4. FLASH-IO offset-length files seem fine. >> > Wei-Keng, > > Ok... Using BTIO.A.4 and looking at the writer, it''s still an > oscillating strided writer :-) > > > > Enjoy, > Don Capps >-------------- next part -------------- A non-text attachment was scrubbed... Name: BTIO.A.4.jpg Type: image/jpeg Size: 143527 bytes Desc: Url : http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060306/a48e30be/BTIO.A.4.jpg
Andreas Dilger
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
On Mar 06, 2006 20:03 -0600, J. K. Cliburn wrote:> On 3/6/06, J. K. Cliburn <jcliburn@gmail.com> wrote: > I meant to say a 4,176 *processor* Cray XT3. 4,128 processors are > embodied in dual Opteron compute nodes running the Catamount > lightweight kernel. The remainder are 4-way service nodes running a > variant of SLES9.Note that the Catamount Lustre clients are _completely_ different than the Linux VFS Lustre clients. The Catamount clients cannot do ANY write caching or asynchronous writes because the operating environment does not support interrupts for notification of IO completion. The writes must be completed during each syscall, and are as a result completely synchronous. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Wei-keng Liao
2006-May-19 07:36 UTC
[Lustre-discuss] disable client-side caching on Lustre?
Andreas, Please see below. On Tue, 7 Mar 2006, Andreas Dilger wrote:> As this is the first time that I have seen complaints about IO performance > from this type of workload, there has not been any effort to tune lustre > for this load yet. I see that the lock policy doesn''t consider a single > file to be "contended" until there are at least 5 concurrent writers, and > not "highly contended" until there are at least 33 concurrent writers. > Most of the shared-file write tunings have been done with 400+ clients.I do not quite understand this lock policy. What does Lustre do when it doesn''t consider a single file to be "contended"? What does it do in the "highly contended" case? Can you point me any further reading about this?> While this might be true in theory, there is currently no mechanism other > than O_DIRECT to disable client-side cache. It is possible (though not > supported) to disable client-side locking of the file, but this can only > be used in conjunction with O_DIRECT to ensure that the client does not > cache any data without a lock. In this case the application is wholly > responsible for ensuring the integrity of the file.Can you describe how to disable locking in conjunction with O_DIRECT, even though it is not supported? Can it be done by application users without root permission? Thanks. Wei-keng