thr3ads.net - Lustre discuss - [Lustre-discuss] disable client-side caching on Lustre? [May 2006]

If this information is useful, please help other people find it:
Share via:

Goswin von Brederlow

2006-May-19 07:36 UTC

[Lustre-discuss] disable client-side caching on Lustre?

Wei-keng Liao <wkliao@ece.northwestern.edu> writes:
> How do I disable client-side caching on Lustre when accessing a
> file in a C program?
>
> Wei-keng Liao
Does the O_DIRECT falg for open have the desired effect?

MfG
        Goswin

Andreas Dilger

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

On Mar 03, 2006  23:26 -0600, Wei-keng Liao wrote:> How do I disable client-side caching on Lustre when accessing a
> file in a C program?
When you say "disable client-side cache" do you mean read cache or
write cache?  The standard Linux mechanism for bypassing the page
cache entirely (read and write) is by using O_DIRECT as was previously
mentioned.  If you want to flush written data from cache, then you
should call fsync() on the file after writing.

It is possible to limit the amount of cached write data on the client
by setting each of the /proc/fs/lustre/osc/*/max_dirty_mb tunables.
They are set to 32MB of dirty data by default, per OSC.  The client will
accumulate dirty pages (for non-O_DIRECT writes) until they make up at
least a full RPC (1MB), but will not throttle writes until they hit the
max_dirty_mb limit.

It is also possible to limit the amount of cached read data on the client
by the /proc/fs/lustre/llite/fs*/max_cached_mb tunable.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Andreas Dilger

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

On Mar 06, 2006  01:14 -0600, Wei-keng Liao wrote:> I meant both read and write caching. I am running two I/O benchmarks: 
> FLASH I/O and BTIO, where FLASH I/O has only write operations and BTIO has 
> reads after writes (file closed in between). I got very bad performance 
> results for both on Lustre and are trying to see if it is due to caching. 
> O_DIRECT is not applicable here because of the non-aligned read-write 
> amount.
I was able to find some information about the BTIO and FLASH benchmarks,
but these are not tests that we have run ourselves.  My brief reading on
these benchmarks indicates that they are doing a lot of small-size IO
requests striped throughout the file, instead of large contiguous IO
requests.  Is that correct?  Do you have any information on e.g. the size
of the IO requests?  You already mentioned that the IO is not aligned,
which also hurts performance.

Do these tests also do their IO using MPIIO?  We have had problems in the
past when the ADIO layer was making bad assumptions for Lustre IO.  We
have not yet worked on any Lustre-specific ADIO layer, but this may allow
a large performance improvement for some kinds of applications.
> For applications performs only write operations (and no repeated or 
> overlapped writes), caching causes unnecessary memory copies and thus the 
> bad performance. Similar for non-repeated reads only (although prefetch 
> can also help).
One issue is that even if the write segments are not overlapping, if they
are not aligned on PAGE_SIZE (4k normally) boundaries, then they can not
be handled by the linux VM efficiently, and result in a read-modify-write
cycle for each of the "boundary" pages.  Lustre also aligns the DLM
locks
on client PAGE_SIZE boundaries, so this can cause artificial contention
among the writes.

In any case, I suspect that the root of the problem is the small,
interleaved, and unaligned IO pattern, and not the cache itself.
> As a Lustre user, I would like to know if I can disable read and write 
> caching for the benchmarks I tested, without root permission to change 
> file system configuration.
No, this isn''t possible (excluding O_DIRECT).  The reason O_DIRECT
allows
uncached IO is because the kernel is using the user-supplied data buffers
to do the IO directly to the filesystem.  Since the filesystem layer has
to work with aligned buffers, this imposes the alignmen restrictions.
Working with unaligned data buffers imposes the data copying overhead.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Wei-keng Liao

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

I meant both read and write caching. I am running two I/O benchmarks: 
FLASH I/O and BTIO, where FLASH I/O has only write operations and BTIO has 
reads after writes (file closed in between). I got very bad performance 
results for both on Lustre and are trying to see if it is due to caching. 
O_DIRECT is not applicable here because of the non-aligned read-write 
amount.

For applications performs only write operations (and no repeated or 
overlapped writes), caching causes unnecessary memory copies and thus the 
bad performance. Similar for non-repeated reads only (although prefetch 
can also help).

As a Lustre user, I would like to know if I can disable read and write 
caching for the benchmarks I tested, without root permission to change 
file system configuration. I was told ioctl() can do that, but could not 
find any info from Lustre''s manual.

Wei-keng

On Sun, 5 Mar 2006, Andreas Dilger wrote:
> On Mar 03, 2006  23:26 -0600, Wei-keng Liao wrote:
>> How do I disable client-side caching on Lustre when accessing a
>> file in a C program?
>
> When you say "disable client-side cache" do you mean read cache
or
> write cache?  The standard Linux mechanism for bypassing the page
> cache entirely (read and write) is by using O_DIRECT as was previously
> mentioned.  If you want to flush written data from cache, then you
> should call fsync() on the file after writing.
>
> It is possible to limit the amount of cached write data on the client
> by setting each of the /proc/fs/lustre/osc/*/max_dirty_mb tunables.
> They are set to 32MB of dirty data by default, per OSC.  The client will
> accumulate dirty pages (for non-O_DIRECT writes) until they make up at
> least a full RPC (1MB), but will not throttle writes until they hit the
> max_dirty_mb limit.
>
> It is also possible to limit the amount of cached read data on the client
> by the /proc/fs/lustre/llite/fs*/max_cached_mb tunable.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>

Wei-keng Liao

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

According to the Lustre manual, part IV, section 1.4 says O_DIRECT 
requires the buffer size for read() and write() to be aligned on a 
512-byte boundary. If not, error EINVAL will occur.

I did get this error when I used different sized buffers. My question is 
actually that is there a way to simply disable caching (using a system 
call like ioctl or so) without the 512-byte limitation? Many existing 
applications do their IO without 512-byte alignment. It can be hard to 
change all I/O calls to see the effect of O_DIRECT.

Wei-keng

On Sat, 4 Mar 2006, Goswin von Brederlow wrote:
> Wei-keng Liao <wkliao@ece.northwestern.edu> writes:
>
>> How do I disable client-side caching on Lustre when accessing a
>> file in a C program?
>>
>> Wei-keng Liao
>
> Does the O_DIRECT falg for open have the desired effect?
>
> MfG
>        Goswin
>

Wei-keng Liao

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

How do I disable client-side caching on Lustre when accessing a
file in a C program?

Wei-keng Liao

Wei-keng Liao

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

Jay

What benchmarks were you using to get the low I/O performance results on 
Lustre? Does it do parallel I/O through MPI-IO? On shared files or 
separate files?

On a file system that does no caching at all, I think maybe it is due to 
the conflict file locks that causes I/O serialization and hence the low 
I/O performance results.


Wei-keng



On Mon, 6 Mar 2006, Andreas Dilger wrote:
> On Mar 06, 2006  20:03 -0600, J. K. Cliburn wrote:
>> On 3/6/06, J. K. Cliburn <jcliburn@gmail.com> wrote:
>> I meant to say a 4,176 *processor* Cray XT3.  4,128 processors are
>> embodied in dual Opteron compute nodes running the Catamount
>> lightweight kernel.  The remainder are 4-way service nodes running a
>> variant of SLES9.
>
> Note that the Catamount Lustre clients are _completely_ different than
> the Linux VFS Lustre clients.  The Catamount clients cannot do ANY write
> caching or asynchronous writes because the operating environment does
> not support interrupts for notification of IO completion.  The writes
> must be completed during each syscall, and are as a result completely
> synchronous.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>

Andreas Dilger

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

On Mar 14, 2006  15:46 -0600, Wei-keng Liao wrote:> On a file system that does no caching at all, I think maybe it is due to 
> the conflict file locks that causes I/O serialization and hence the low 
> I/O performance results.
The clients do not get any locks either.  On the other hand, we have
recently identified one problem which would negatively impact performace
in this environment, and Cray is working to make this available in their
next release.
> On Mon, 6 Mar 2006, Andreas Dilger wrote:
> >On Mar 06, 2006  20:03 -0600, J. K. Cliburn wrote:
> >>On 3/6/06, J. K. Cliburn <jcliburn@gmail.com> wrote:
> >>I meant to say a 4,176 *processor* Cray XT3.  4,128 processors are
> >>embodied in dual Opteron compute nodes running the Catamount
> >>lightweight kernel.  The remainder are 4-way service nodes running
a
> >>variant of SLES9.
> >
> >Note that the Catamount Lustre clients are _completely_ different than
> >the Linux VFS Lustre clients.  The Catamount clients cannot do ANY
write
> >caching or asynchronous writes because the operating environment does
> >not support interrupts for notification of IO completion.  The writes
> >must be completed during each syscall, and are as a result completely
> >synchronous.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

J. K. Cliburn

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

On 3/14/06, Wei-keng Liao <wkliao@ece.northwestern.edu>
wrote:> Jay
>
> What benchmarks were you using to get the low I/O performance results on
> Lustre?  Does it do parallel I/O through MPI-IO? On shared files or
> separate files?
Real codes, not benchmarks, were our first indication that we had work
to do.  We then wrote a small parallel writer/reader (no MPI_IO, IIRC)
to further characterize what we''d seen.  I''m most interested
in many
writers to a single file (many-to-one), since, in my humble opinion
anyway, that''s what "parallel I/O" is, and that''s
where I thought
Lustre''s strength lay.  We get much, much better performance in
one-to-one I/O, although it seems (understandably) to be
extraordinarily sensitive to stripe number and size.

We host a wide variety of users who employ a corresponding wide
variety of parallel codes applied to all sorts of scientific problems,
most of whom will need to carefully tune their respective code''s I/O
in order to see acceptable turnaround times on their batch jobs.  I''m
confident that Cray will help us get things sorted out, though.  (And
I''m sure the Cray employees who read this will agree, since, oddly, as
I learned from my previous post, there appears to be a direct conduit
from this list to Mendota Heights.)

Jay

Andreas Dilger

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

On Mar 14, 2006  16:08 -0600, Wei-keng Liao wrote:> On Tue, 7 Mar 2006, Andreas Dilger wrote:
> >As this is the first time that I have seen complaints about IO
performance
> >from this type of workload, there has not been any effort to tune
lustre
> >for this load yet.  I see that the lock policy doesn''t
consider a single
> >file to be "contended" until there are at least 5 concurrent
writers, and
> >not "highly contended" until there are at least 33 concurrent
writers.
> >Most of the shared-file write tunings have been done with 400+ clients.
> 
> I do not quite understand this lock policy. What does Lustre do when it 
> doesn''t consider a single file to be "contended"? What
does it do in the
> "highly contended" case? Can you point me any further reading
about this?
In the "uncontended" case (what is the common single-writer-per-file,
or
only-readers-per-file) the Lustre DLM will grow the requested lock to the
maximum non-conflicting extent possible.

So, for example, if the first client requests a DLM write extent lock on
[131072, 199607] of an object it will actually be granted a lock on [0, EOF]
(EOF=2^64-1 in this context).  This means that it will not have to issue
another DLM request for the entire file access, if there are no other
conflicting users.

If a second client then requests a lock on [65536, 131071] the first lock
is cancelled and the client is granted a lock on [0, 131071] because the
DLM still considers the to-be-cancelled first extent when growing the lock.
A third client reqesting [0, 65535] would cancel the first lock and be
granted exactly that extent.  The DLM also takes requested-but blocked
locks into account when growing the locks, so that it doesn''t grant
conflicting locks for known users.

If the clients are doing IO to only a single region of the file each
(e.g. [chunksize * mpi_rank, +chunksize]) then the above algorithm
will converge fairly quickly to give each client a non-overlapping write
lock on its particular part of the file.  If the chunksize is not a
multiple of the client PAGE_SIZE then this will work poorly because it
is not possible to grant those locks concurrently because of the overlap
due to rounding, and no effort has yet been made to optimize it.  At
some point (if the chunk size is large enough to span multiple write()
calls) the clients will be working on disjoint extents.

Clearly, this is sub-optimal if there are multiple clients that are doing
small strided writes to a single file, but at least in past experience
that was a rare case.

In the "contended" case, extent locks are not grown
"downwards", avoiding
potential unnecessary lock conflicts.

In the "highly contended" case, instead of growing the locks to the
maximum
extent possible they are capped at the next 32MB boundary.

This is in lustre/ldlm/ldlm_extent.c::ldlm_extent_internal_policy().
Ideally, there would be some way to know in advance what the "correct"
lock
size is (maybe tracking per object what the granted lock''s request
sizes
are), but for any such algorithm there is a pathalogical case.  We would
of course welcome any contributions to improve the algorithms.
> Can you describe how to disable locking in conjunction with O_DIRECT, even 
> though it is not supported? Can it be done by application users without 
> root permission?
Currently it is not accessible to non-root users, as it was only implemented
for very early benchmarking and is not a supported feature.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Andreas Dilger

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

On Mar 06, 2006  14:51 -0600, Wei-keng Liao wrote:> Attached is a collection of offset-length pairs for FLASH-IO and BTIO.
> 
> I have been playing different stripe size and no. OSTs on Tungsten@NCSA 
> which has default 4MB stripe size and stripe over 8 OSTs. Changing the 
> stripe size and no. OSTs does not help the performance a lot for FLASH and 
> BTIO. I tried 64KB, 512KB, 1MB stripe size and 8 and 16 OSTs. Numbers of 
> clients I ran range from 4 to 64. However, in both FLASH and BTIO, all 
> clients read/write to a shared file. Maybe this is what causes the bad 
> performance on Lustre. (The I/O bandwidth I got for FLASH-IO is less than 
> 10 MB/s and less than 80 MB/s for BTIO, on Tungsten which potentially can 
> provide 11 GB/s I/O bandwidth.) So, my guess is that Lustre has not been 
> optimized for concurrent or parallel I/O operations yet. As a heavy MPI-IO 
> user, I wish there will be more demands on this aspect in the future :(
Lustre has in fact been tested with single-file parallel IO for a long time,
and while performance is not as good as file-per-process, it has generally
been acceptable if the clients are doing large non-overlapping concurrent
writes.  However, the IO pattern that single-file writes have been tuned
to is different than is shown here, with the heuristics being tuned for
tens-hundreds of clients doing aligned writes of 1MB or more concurrently.

As this is the first time that I have seen complaints about IO performance
from this type of workload, there has not been any effort to tune lustre
for this load yet.  I see that the lock policy doesn''t consider a
single
file to be "contended" until there are at least 5 concurrent writers,
and
not "highly contended" until there are at least 33 concurrent writers.
Most of the shared-file write tunings have been done with 400+ clients.

Looking at the start of one of the logs in question (OFFSET_LEN/BTIO.A4):
 1 o btio.full.out
 2 o btio.full.out
 3 o btio.full.out
 0 o btio.full.out
 1 w 2621440 2621440
 1 b 1 1
 3 w 7864320 2621440
 3 b 1 1
 0 w 0 2621440
 0 b 1 1
 2 w 5242880 2621440
 2 b 1 1
 0 w 10485760 2621440
 3 w 18350080 2621440
 2 w 15728640 2621440
 3 b 1 1
 2 b 1 1
 0 b 1 1
 1 w 13107200 2621440
 1 b 1 1
 2 w 26214400 2621440
 2 b 1 1
 0 w 20971520 2621440
 3 w 28835840 2621440
 1 w 23592960 2621440
 3 b 1 1
 0 b 1 1
 1 b 1 1

It is clear from looking at this IO log that this application would be
optimal with a stripe size of 2621440 bytes and 4 stripes for the file,
or a multiple thereof.  This would allow each process to essentially
write to its own file, and there would be no lock contention at all.

	lfs setstripe /path/to/output/btio.full.out 2621440 -1 4
or
	lfs setstripe /path/to/output/btio.full.out 1310720 -1 8
or
	lfs setstripe /path/to/output/btio.full.out 655360 -1 16
etc.

It is the lock contention that is essentially serializing the writes
from each client.  It would be interesting to know if the ROMIO/MPIIO
layer would send the proper hints to a lustre-ADIO to allow it to create
the file in this manner?  I admit I don''t know enough about the hints
that an ADIO implemention is given.

In your README you mention that the "b 1 1" operation is an MPI
barrier.
Is there also a "sync" operation associated with this barrier?  Even
without a sync, the fact that there is a barrier between each IO means
that each IO is waiting for the slowest client to complete the writes.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Wei-keng Liao

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

Andreas,

I wonder how Lustre implements the POSIX read/write atomicity without 
locking, when a file is stored across more than one OST.

Wei-keng


On Tue, 14 Mar 2006, Andreas Dilger wrote:
> On Mar 14, 2006  15:46 -0600, Wei-keng Liao wrote:
>> On a file system that does no caching at all, I think maybe it is due
to
>> the conflict file locks that causes I/O serialization and hence the low
>> I/O performance results.
>
> The clients do not get any locks either.  On the other hand, we have
> recently identified one problem which would negatively impact performace
> in this environment, and Cray is working to make this available in their
> next release.
>
>> On Mon, 6 Mar 2006, Andreas Dilger wrote:
>>> On Mar 06, 2006  20:03 -0600, J. K. Cliburn wrote:
>>>> On 3/6/06, J. K. Cliburn <jcliburn@gmail.com> wrote:
>>>> I meant to say a 4,176 *processor* Cray XT3.  4,128 processors
are
>>>> embodied in dual Opteron compute nodes running the Catamount
>>>> lightweight kernel.  The remainder are 4-way service nodes
running a
>>>> variant of SLES9.
>>>
>>> Note that the Catamount Lustre clients are _completely_ different
than
>>> the Linux VFS Lustre clients.  The Catamount clients cannot do ANY
write
>>> caching or asynchronous writes because the operating environment
does
>>> not support interrupts for notification of IO completion.  The
writes
>>> must be completed during each syscall, and are as a result
completely
>>> synchronous.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>

Wei-keng Liao

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

Andreas,

The BTIO I am running does I/O through MPI-IO. The FLASH I/O calls HDF5 
I/O APIs which internally is implemented on top of MPI-IO. FLASH I/O 
writes data into files in HDF5 format which contains both metadata and 
array data. For writing metadata, there are lots of small writes, but for 
array data, writes are in large amount. As for BTIO, it is user''s
option
to run through MPI collective or non-collective I/O. When collective I/O 
is used (which is I only use), the reads/writes are all in large amount. 
Both benchmarks read/write global arrays from/tom files. The actual 
read/write amount in each processes depends on the total number of 
processes running the job. Therefore, they are seldom aligned with 
512-byte boundary. In addition, since MPI collective I/O is used, 
reads/writes are not interleaved. I have collected the I/O trace, the 
read/write offset-length pairs, for the two benchmarks. Let me know if you 
are interested.

I personally believe the access patterns used in these two benchmarks 
appear frequently in many scientific applications. It is important to see 
Lustre provide a reasonable performance for them. I ran these benchmarks 
on PVFS which does no client-side caching and got a very good result (both 
Lustre and pvfs ran on the same machine, Tungsten @NCSA). This is why I am 
seeking to disable caching on Lustre.

You mentioned that ADIO layer for Lustre has not been implemented yet. 
What is the plan for this in the near future? Can you also shed light on 
why ADIO makes bad assumption for Lustre I/O? All I can see is ADIO 
regards Lustre as a UFS and only POSIX I/O calls (open,read,write) are 
used.

I also wonder if caching is disabled if a file is open in O_WRONLY, since 
it makes little sense to do caching in this case, unless for aggregating 
small writes into large ones.

Does Lustre use any I/O benchmark that can demonstrate the benefit of 
caching? Please let me know. I am very interested in seeing the I/O 
patterns of it. Thanks.

Wei-keng

On Mon, 6 Mar 2006, Andreas Dilger wrote:
> On Mar 06, 2006  01:14 -0600, Wei-keng Liao wrote:
>> I meant both read and write caching. I am running two I/O benchmarks:
>> FLASH I/O and BTIO, where FLASH I/O has only write operations and BTIO
has
>> reads after writes (file closed in between). I got very bad performance
>> results for both on Lustre and are trying to see if it is due to
caching.
>> O_DIRECT is not applicable here because of the non-aligned read-write
>> amount.
>
> I was able to find some information about the BTIO and FLASH benchmarks,
> but these are not tests that we have run ourselves.  My brief reading on
> these benchmarks indicates that they are doing a lot of small-size IO
> requests striped throughout the file, instead of large contiguous IO
> requests.  Is that correct?  Do you have any information on e.g. the size
> of the IO requests?  You already mentioned that the IO is not aligned,
> which also hurts performance.
>
> Do these tests also do their IO using MPIIO?  We have had problems in the
> past when the ADIO layer was making bad assumptions for Lustre IO.  We
> have not yet worked on any Lustre-specific ADIO layer, but this may allow
> a large performance improvement for some kinds of applications.
>
>> For applications performs only write operations (and no repeated or
>> overlapped writes), caching causes unnecessary memory copies and thus
the
>> bad performance. Similar for non-repeated reads only (although prefetch
>> can also help).
>
> One issue is that even if the write segments are not overlapping, if they
> are not aligned on PAGE_SIZE (4k normally) boundaries, then they can not
> be handled by the linux VM efficiently, and result in a read-modify-write
> cycle for each of the "boundary" pages.  Lustre also aligns the
DLM locks
> on client PAGE_SIZE boundaries, so this can cause artificial contention
> among the writes.
>
> In any case, I suspect that the root of the problem is the small,
> interleaved, and unaligned IO pattern, and not the cache itself.
>
>> As a Lustre user, I would like to know if I can disable read and write
>> caching for the benchmarks I tested, without root permission to change
>> file system configuration.
>
> No, this isn''t possible (excluding O_DIRECT).  The reason O_DIRECT
allows
> uncached IO is because the kernel is using the user-supplied data buffers
> to do the IO directly to the filesystem.  Since the filesystem layer has
> to work with aligned buffers, this imposes the alignmen restrictions.
> Working with unaligned data buffers imposes the data copying overhead.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>

Andreas Dilger

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

On Mar 06, 2006  11:44 -0600, Wei-keng Liao wrote:> The BTIO I am running does I/O through MPI-IO. The FLASH I/O calls HDF5 
> I/O APIs which internally is implemented on top of MPI-IO. FLASH I/O 
> writes data into files in HDF5 format which contains both metadata and 
> array data. For writing metadata, there are lots of small writes, but for 
> array data, writes are in large amount... I have collected the I/O trace,
> the read/write offset-length pairs, for the two benchmarks. Let me know
> if you  are interested.
Yes, it would be useful to see what kind of IO pattern is being generated.
> You mentioned that ADIO layer for Lustre has not been implemented yet. 
> What is the plan for this in the near future? Can you also shed light on 
> why ADIO makes bad assumption for Lustre I/O? All I can see is ADIO 
> regards Lustre as a UFS and only POSIX I/O calls (open,read,write) are 
> used.
As yet there has not been a strong customer demand for an ADIO layer for
Lustre.  One of the things that could benefit a lot from a Lustre-specific
ADIO layer would be file striping parameters.

It is possible that your test is only running against a small number of
storage servers by default, which would not be optimal if there are lots
of clients concurrently reading/writing to the same file at one time.  In
the vast majority of cases there is only a single client doing IO to each
file so the default tuning parameters are set to optimize this case.

You can check this by running "lfs getstripe -v /path/to/file(s)" to
check
the striping on the file(s).  You can set the default striping parameters
for the output directory, and all NEW files created in that directory will
inherit this striping configuration:

$ ./lustre/utils/lfs setstripe -h
Create a new file with a specific striping pattern or
set the default striping pattern on an existing directory or
delete the default striping pattern from an existing directory
usage: setstripe <filename|dirname> <stripe_size>
<stripe_start> <stripe_count>
       or
       setstripe -d <dirname>   (to delete default striping)
        stripe_size:  Number of bytes on each OST (0 filesystem default)
        stripe_start: OST index of first stripe (-1 filesystem default)
        stripe_count: Number of OSTs to stripe over (0 default, -1 all)

The stripe_start parameter should be left at -1, but for shared-file writes
the stripe_count should be increased to get the maximum bandwidth.
> I also wonder if caching is disabled if a file is open in O_WRONLY, since 
> it makes little sense to do caching in this case, unless for aggregating 
> small writes into large ones.
The cache is a function of the linux VM subsystem and is not implemented
specifically in lustre.  Only the delayed-write mechanism to aggregate
writes for sending larger RPCs is specific to Lustre.
> Does Lustre use any I/O benchmark that can demonstrate the benefit of 
> caching? Please let me know. I am very interested in seeing the I/O 
> patterns of it. Thanks.
IOR is a good example of a benchmark that can can see benefit from the
cache.  In some earlier versions, if the write size does not exceed RAM
size, then the read rates approach memory bandwidth.  In later versions
of IOR the read algorithm was changed so that clients would not read back
the data they just wrote, so that they do not read from cache.

Similarly, for IO that is sequential but not page aligned or multiples
of full pages the IO rate is improved because the RPCs are large and
efficient, instead of only the size of the read/write request.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Wei-keng Liao

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

Andreas,

Attached is a collection of offset-length pairs for FLASH-IO and BTIO.

I have been playing different stripe size and no. OSTs on Tungsten@NCSA 
which has default 4MB stripe size and stripe over 8 OSTs. Changing the 
stripe size and no. OSTs does not help the performance a lot for FLASH and 
BTIO. I tried 64KB, 512KB, 1MB stripe size and 8 and 16 OSTs. Numbers of 
clients I ran range from 4 to 64. However, in both FLASH and BTIO, all 
clients read/write to a shared file. Maybe this is what causes the bad 
performance on Lustre. (The I/O bandwidth I got for FLASH-IO is less than 
10 MB/s and less than 80 MB/s for BTIO, on Tungsten which potentially can 
provide 11 GB/s I/O bandwidth.) So, my guess is that Lustre has not been 
optimized for concurrent or parallel I/O operations yet. As a heavy MPI-IO 
user, I wish there will be more demands on this aspect in the future :(

I ran IOR on Lustre recently and it does not have any pattern that can be 
used to show the effect of client-side caching.

I also wrote a simple program to test I/O bandwidth on Tungsten. I found 
the O_DIRECT gives a huge performance boost if the buffer is aligned. The 
only change in my test code is just with or without O_DIRECT. This is one 
of the reasons that I suspect the Lustre''s caching effect. I also ran 
FLASH and BTIO on IBM GPFS which also does caching and distributed file 
locking like Lustre. The performance on GPFS is much better (even better 
than PVFS). I think maybe Lustre development group would be interested in 
finding the causes of the bad performance.

Wei-keng

On Mon, 6 Mar 2006, Andreas Dilger wrote:
> On Mar 06, 2006  11:44 -0600, Wei-keng Liao wrote:
>> The BTIO I am running does I/O through MPI-IO. The FLASH I/O calls HDF5
>> I/O APIs which internally is implemented on top of MPI-IO. FLASH I/O
>> writes data into files in HDF5 format which contains both metadata and
>> array data. For writing metadata, there are lots of small writes, but
for
>> array data, writes are in large amount... I have collected the I/O
trace,
>> the read/write offset-length pairs, for the two benchmarks. Let me know
>> if you  are interested.
>
> Yes, it would be useful to see what kind of IO pattern is being generated.
>
>> You mentioned that ADIO layer for Lustre has not been implemented yet.
>> What is the plan for this in the near future? Can you also shed light
on
>> why ADIO makes bad assumption for Lustre I/O? All I can see is ADIO
>> regards Lustre as a UFS and only POSIX I/O calls (open,read,write) are
>> used.
>
> As yet there has not been a strong customer demand for an ADIO layer for
> Lustre.  One of the things that could benefit a lot from a Lustre-specific
> ADIO layer would be file striping parameters.
>
> It is possible that your test is only running against a small number of
> storage servers by default, which would not be optimal if there are lots
> of clients concurrently reading/writing to the same file at one time.  In
> the vast majority of cases there is only a single client doing IO to each
> file so the default tuning parameters are set to optimize this case.
>
> You can check this by running "lfs getstripe -v /path/to/file(s)"
to check
> the striping on the file(s).  You can set the default striping parameters
> for the output directory, and all NEW files created in that directory will
> inherit this striping configuration:
>
> $ ./lustre/utils/lfs setstripe -h
> Create a new file with a specific striping pattern or
> set the default striping pattern on an existing directory or
> delete the default striping pattern from an existing directory
> usage: setstripe <filename|dirname> <stripe_size>
<stripe_start> <stripe_count>
>       or
>       setstripe -d <dirname>   (to delete default striping)
>        stripe_size:  Number of bytes on each OST (0 filesystem default)
>        stripe_start: OST index of first stripe (-1 filesystem default)
>        stripe_count: Number of OSTs to stripe over (0 default, -1 all)
>
> The stripe_start parameter should be left at -1, but for shared-file writes
> the stripe_count should be increased to get the maximum bandwidth.
>
>> I also wonder if caching is disabled if a file is open in O_WRONLY,
since
>> it makes little sense to do caching in this case, unless for
aggregating
>> small writes into large ones.
>
> The cache is a function of the linux VM subsystem and is not implemented
> specifically in lustre.  Only the delayed-write mechanism to aggregate
> writes for sending larger RPCs is specific to Lustre.
>
>> Does Lustre use any I/O benchmark that can demonstrate the benefit of
>> caching? Please let me know. I am very interested in seeing the I/O
>> patterns of it. Thanks.
>
> IOR is a good example of a benchmark that can can see benefit from the
> cache.  In some earlier versions, if the write size does not exceed RAM
> size, then the read rates approach memory bandwidth.  In later versions
> of IOR the read algorithm was changed so that clients would not read back
> the data they just wrote, so that they do not read from cache.
>
> Similarly, for IO that is sequential but not page aligned or multiples
> of full pages the IO rate is improved because the RPCs are large and
> efficient, instead of only the size of the read/write request.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>-------------- next part --------------
A non-text attachment was scrubbed...
Name: offset_len.tar.gz
Type: application/octet-stream
Size: 178469 bytes
Desc: 
Url :
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060306/2d299b5d/offset_len.tar.obj

Wei-keng Liao

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

Don,

Please discard those BTIO files. I generated these offset-length pairs on 
a 4-node machine. ROMIO internally figures this out and lets only 4 
processes perform the read()/write() calls. I will try to generate the 
true offset-length pairs for different numbers of clients. For now, please 
just use BTIO.A.4 and BTIO.B.4. FLASH-IO offset-length files seem fine.

Here is a brief description for BTIO:
  BTIO performs 40 MPI collective writes followed by 40 collective reads. A 
3-D array partitioned among clients in a block-block-block fashion is 
written to and then read back from a single file. In each collective call, 
the offset-length pairs are re-arranged such that every process makes a 
single contiguous read()/write() calls to the file system. This is why you 
see the sequential read patterns and this is also why caching is not 
helping here. If you read the line segments in groups of 4, each group 
corresponding to an MPI collective read/write, it will be very clear to 
understand the BTIO''s pattern. If different numbers of clients is used,
the patterns should look similar.

Thanks for making this chart, a very good one.

Wei-keng

On Mon, 6 Mar 2006, Iozone wrote:
> Wei-kenq Liao,
>
>    Looking at your BTIO.A.16 data set, here is what
>    I see :-)
>
>
>
>    The reader phase looks pure sequential for BTIO.A.16
>
> Enjoy,
> Don Capps
>

J. K. Cliburn

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

On 3/6/06, Wei-keng Liao <wkliao@ece.northwestern.edu>
wrote:> if you can briefly present your similar findings in the discuss
> group, it may increase the chance for Lustre to pay more attentions on
> parallel I/O :)
I''ll see what I can pull together from our group.  The machine (a
4,176 node Cray XT3) is in semi-production at the moment, but we''re
trying to squeeze in some I/O tests where we can.  The initial
results, while not exhaustive by any means, are disappointing.

With past parallel systems we''ve owned, one could -- to a varying
extent -- always improve I/O performance from a baseline by organizing
reads and writes to better fit the underlying parallel filesystem
(PFS, [C]XFS, GPFS, in our experience).  This is our first experience
with Lustre, and we don''t expect to avoid the I/O improvement curve
this time, but my chief complaint in the early going is that what I''m
calling the "baseline" seems so low; a few (or at best 10s) MB/sec on
a filesystem billed to deliver 100s or 1000s MB/sec.  I''m concerned
that our users are going to have to significantly modify their codes
to achieve reasonably acceptable I/O rates.  Unfortunately, most of
these users operate on budgets that don''t include code-hacking funds;
their raison d''etre is to conduct computational science.

I''m interested in hearing from any list members who have experience
with Lustre operating with a variety of parallel codes employing a
wide range of node counts, large and small.  Any tips you can provide
as our group goes forward will help.

Best regards,
Jay

J. K. Cliburn

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

On 3/6/06, J. K. Cliburn <jcliburn@gmail.com>
wrote:> The machine (a 4,176 node Cray XT3)
I meant to say a 4,176 *processor* Cray XT3.  4,128 processors are
embodied in dual Opteron compute nodes running the Catamount
lightweight kernel.  The remainder are 4-way service nodes running a
variant of SLES9.

Wei-keng Liao

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

Don,

Maybe I did not clearly describe the file format in the README. In each 
file, the first column represents the client id, the 2nd column is the I/O 
operation (o:open, r: read, w: write, c: close, b: global 
synchronization), the 3rd column is the filename for open and offset for 
reads and writes, the 4th column is the length of reads and writes.

I ran BTIO with the option of using MPI collective calls. If you look at a 
particular client, say 2, the I/O trace is always a read/write followed by
a synchronization. For example, output from command: grep " 2 " 
BTIO.A.4
  2 o btio.full.out
  2 w 5242880 2621440
  2 b 1 1
  2 w 15728640 2621440
  2 b 1 1
  2 w 26214400 2621440
  2 b 1 1
...

These files must be interpreted by a group of n processes. In the case of 
BTIO.A.4, n is 4. That is, the read()/write() in every group of 4 occurs 
concurrently and these groups happen in a sequence of timely manner. BTIO 
performs 40 MPI collective writes followed by 40 reads. Each collective 
write appends the data after the previous write. So, the I/O pattern is 
actually all sequential.

In this pattern, caching is only good for write-behind in which small data 
are accumulated and later written in a big chunk to utilize the network 
bandwidth.

I do agree with you that the fine-tuned stripe size and stripe factor 
matching exact the application''s I/O pattern should generate optimal 
performance. I also believe even the stripe size and factor are not 
optimal, as long as they won''t result in extremely many I/O calls to
OSTs,
the performance should be reasonable close to the optimal. In fact, no 
matter how you choose stripe sizes, it is always the same amount of I/O 
from a group of clients to a group of OSTs. Unfortunately, I got very low 
I/O bandwidth. I tried 64K, 512K, 1M, and 4M but got < 80 MB/s on a 
machine that can sustain 11 GB/s peak bandwidth. Hence, I suspect caching 
is not helping here, instead, worsens the performance.

Wei-keng

ps. I re-post your charts to the discussion group. They are very nice and
     should be shared with others.

On Mon, 6 Mar 2006, Iozone wrote:
>
> ----- Original Message -----
> From: "Wei-keng Liao" <wkliao@ece.northwestern.edu>
> To: "Iozone" <capps@iozone.org>
> Cc: <lustre-discuss@clusterfs.com>
> Sent: Monday, March 06, 2006 5:47 PM
> Subject: Re: [Lustre-discuss] disable client-side caching on Lustre?
>
>
>> Don,
>>
>> This is exactly the BTIO access pattern. No repeats exist among writes
and
>> no repeats among reads. For the write phase, all writes are disjoint.
>> Same for the read phase.
>>
>> Wei-keng
>>
>>
>
> Wei-keng,
>
>    The BTIO.A.4 writer looks like an oscillating strided writer,
>    but the BTIO.A.4 reader looks sequential.
>
>
>
>
>    A client side cache may, or may not, benefit the application''s
performance.
>    It would depend on the size of the client side cache. If it were big
enough
>    to encapsulate the entire data set, then the reader, and writer, could
run at
>    memory speeds.  If not, then it would depend on the page/buffer cache
>    replacement algorithm, the alignment of the buffers, the number of
controllers,
>    the number of disks, the memory bandwidth, and the VM
subsystem''s
>    interaction, should it detect VM pressure.
>
>    It''s likely that a striped filesystem could help the strided
writer
>    phase, as the I/Os could be overlapped across controllers, and
>    disks.  It looks like BTIO.A.4 would like the stripe size to be
>    256kbytes. If your stripe size is smaller, or larger, than this
>    typical transfer size, then performance would not be optimal.
>
>    If the system''s read-ahead were deep enough, then the
>    stripe might help it too.
>
>
> Enjoy,
> Don Capps
>-------------- next part --------------
A non-text attachment was scrubbed...
Name: BTIO.A.4_reader.jpg
Type: image/jpeg
Size: 134474 bytes
Desc: 
Url :
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060306/7ffd512d/BTIO.A.4_reader.jpg
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BTIO.A.4.jpg
Type: image/jpeg
Size: 143527 bytes
Desc: 
Url :
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060306/7ffd512d/BTIO.A.4.jpg

Wei-keng Liao

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

Andreas,

The "b 1 1" in the I/O trace files is indeed to indicate a
synchronization
among all processes. In an MPI collective I/O, every process must wait for 
all others to complete before returning the call. Sync here means the 
process'' execution thread, not I/O sync.

I ran BTIO.A.4 on Tungsten@NCSA with stripe size=2621440 and number of 
OSTs=4 against the system default setting (stripe size=4MB, OSTs=8). Here 
is the result:
                          stripe size  = 2560KB        stripe size  = 4MB
                             no. OSTs  = 4                no. OSTs  = 8
------------------------------------------------------------------------
  MPI file setup time =        0.11 sec                     0.10 sec
  write     400.00 MB in       1.85 sec                     3.20 sec
  read      400.00 MB in       1.42 sec                    92.07 sec
  total I/O amount    :      800.00 MB                    800.00 MB
  number of I/Os      :       80                           80
------------------------------------------------------------------------
  nproc    array size        bandwidth (MB/s)            bandwidth (MB/s)
     4    64 x  64 x  64       236.60                       8.39

The overall performance is enhanced significantly. I believe if I use more 
processors, larger I/O amount, the bandwidth can go even higher. But
let''s
check the read write performance separately for the non-optimal case, the 
default stripe size. I can see read is much worse than write. Could you 
comment on why the huge difference? If I/O is completely serialized due to 
file locking, how come read for default stripe size is 65 times slower on 
a run with 4 compute nodes. Applying locking on write seems reasonable, 
though.

This also comes back to my question first posted: "disable client-side 
caching on Lustre?". If file locking is used to ensure the cache coherence 
on Lustre, can we imply that disable caching -> no locking -> better 
performance for non-overlap, write-only IO like BTIO?

While tuning stripe size for a single access pattern shows good 
performance like above, it also create a problem. What if two different 
pattern I/Os need to be written to a single file. Even if the "future ADIO 
hints" for Lustre allow users to specify the stripe size, can a
file''s
stripe size + OSTs be changed during the open and close period? If yes, 
what will be the overhead of doing so?

For the current release of Lustre, how can I specify stripe size and 
number of OSTs when creating a file? Of it has to be done off line with 
command "lfs"?

Thanks.

Wei-keng



On Tue, 7 Mar 2006, Andreas Dilger wrote:
> On Mar 06, 2006  14:51 -0600, Wei-keng Liao wrote:
>> Attached is a collection of offset-length pairs for FLASH-IO and BTIO.
>>
>> I have been playing different stripe size and no. OSTs on Tungsten@NCSA
>> which has default 4MB stripe size and stripe over 8 OSTs. Changing the
>> stripe size and no. OSTs does not help the performance a lot for FLASH
and
>> BTIO. I tried 64KB, 512KB, 1MB stripe size and 8 and 16 OSTs. Numbers
of
>> clients I ran range from 4 to 64. However, in both FLASH and BTIO, all
>> clients read/write to a shared file. Maybe this is what causes the bad
>> performance on Lustre. (The I/O bandwidth I got for FLASH-IO is less
than
>> 10 MB/s and less than 80 MB/s for BTIO, on Tungsten which potentially
can
>> provide 11 GB/s I/O bandwidth.) So, my guess is that Lustre has not
been
>> optimized for concurrent or parallel I/O operations yet. As a heavy
MPI-IO
>> user, I wish there will be more demands on this aspect in the future :(
>
> Lustre has in fact been tested with single-file parallel IO for a long
time,
> and while performance is not as good as file-per-process, it has generally
> been acceptable if the clients are doing large non-overlapping concurrent
> writes.  However, the IO pattern that single-file writes have been tuned
> to is different than is shown here, with the heuristics being tuned for
> tens-hundreds of clients doing aligned writes of 1MB or more concurrently.
>
> As this is the first time that I have seen complaints about IO performance
> from this type of workload, there has not been any effort to tune lustre
> for this load yet.  I see that the lock policy doesn''t consider a
single
> file to be "contended" until there are at least 5 concurrent
writers, and
> not "highly contended" until there are at least 33 concurrent
writers.
> Most of the shared-file write tunings have been done with 400+ clients.
>
>
> Looking at the start of one of the logs in question (OFFSET_LEN/BTIO.A4):
> 1 o btio.full.out
> 2 o btio.full.out
> 3 o btio.full.out
> 0 o btio.full.out
> 1 w 2621440 2621440
> 1 b 1 1
> 3 w 7864320 2621440
> 3 b 1 1
> 0 w 0 2621440
> 0 b 1 1
> 2 w 5242880 2621440
> 2 b 1 1
> 0 w 10485760 2621440
> 3 w 18350080 2621440
> 2 w 15728640 2621440
> 3 b 1 1
> 2 b 1 1
> 0 b 1 1
> 1 w 13107200 2621440
> 1 b 1 1
> 2 w 26214400 2621440
> 2 b 1 1
> 0 w 20971520 2621440
> 3 w 28835840 2621440
> 1 w 23592960 2621440
> 3 b 1 1
> 0 b 1 1
> 1 b 1 1
>
> It is clear from looking at this IO log that this application would be
> optimal with a stripe size of 2621440 bytes and 4 stripes for the file,
> or a multiple thereof.  This would allow each process to essentially
> write to its own file, and there would be no lock contention at all.
>
> 	lfs setstripe /path/to/output/btio.full.out 2621440 -1 4
> or
> 	lfs setstripe /path/to/output/btio.full.out 1310720 -1 8
> or
> 	lfs setstripe /path/to/output/btio.full.out 655360 -1 16
> etc.
>
> It is the lock contention that is essentially serializing the writes
> from each client.  It would be interesting to know if the ROMIO/MPIIO
> layer would send the proper hints to a lustre-ADIO to allow it to create
> the file in this manner?  I admit I don''t know enough about the
hints
> that an ADIO implemention is given.
>
> In your README you mention that the "b 1 1" operation is an MPI
barrier.
> Is there also a "sync" operation associated with this barrier? 
Even
> without a sync, the fact that there is a barrier between each IO means
> that each IO is waiting for the slowest client to complete the writes.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>

Kumaran Rajaram

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

Wei-keng, 
 
Please see below: 
 >>>Wei-keng Liao <wkliao@ece.northwestern.edu> 03/07/06 1:17 pm
>>>Andreas,

I ran BTIO.A.4 on Tungsten@NCSA with stripe size=2621440 and number of
OSTs=4 against the system default setting (stripe size=4MB, OSTs=8). Here
is the result:
                         stripe size  = 2560KB        stripe size  = 4MB
                            no. OSTs  = 4                no. OSTs  = 8
------------------------------------------------------------------------
 MPI file setup time =        0.11 sec                     0.10 sec
 write     400.00 MB in       1.85 sec                     3.20 sec
 read      400.00 MB in       1.42 sec                    92.07 sec
 total I/O amount    :      800.00 MB                    800.00 MB
 number of I/Os      :       80                           80
------------------------------------------------------------------------
 nproc    array size        bandwidth (MB/s)            bandwidth (MB/s)
    4    64 x  64 x  64       236.60                       8.39

The overall performance is enhanced significantly. I believe if I use more
processors, larger I/O amount, the bandwidth can go even higher. But
let''s
check the read write performance separately for the non-optimal case, the
default stripe size. I can see read is much worse than write. Could you
comment on why the huge difference? If I/O is completely serialized due to
file locking, how come read for default stripe size is 65 times slower on
a run with 4 compute nodes. Applying locking on write seems reasonable,
though. 
 
--> ROMIO collective I/O implementation is based on two-phase algortihm. My
hunch is that since the I/O request size (400MB) is small, write returns as soon
as data is written to the file-system cache (with no read-modify-write being
performed). For read operation (two-phase I/O), if data access pattern causes
cache miss and if concurrent process have overlapping segments, dirty cache
might need to be flushed to disk and data read from the disks (costly). Also,
ROMIO collective implementation, might incur communication costs as the number
of nodes are scaled

This also comes back to my question first posted: disable client-side
caching on Lustre?. If file locking is used to ensure the cache coherence
on Lustre, can we imply that disable caching -> no locking -> better
performance for non-overlap, write-only IO like BTIO?

While tuning stripe size for a single access pattern shows good
performance like above, it also create a problem. What if two different
pattern I/Os need to be written to a single file. Even if the future ADIO
hints for Lustre allow users to specify the stripe size, can a file''s
stripe size + OSTs be changed during the open and close period? If yes,
what will be the overhead of doing so? 
 
---> You can only set/change the file-stripping parameters when the file is
initially created. MPI-IO allows this though file-info hints during
MPI_File_open, MPI_File_set_view, MPI_File_set_info.

For the current release of Lustre, how can I specify stripe size and
number of OSTs when creating a file? Of it has to be done off line with
command lfs? 
 
--> In older Lustre releases, you have to use ioctl() calls along with
LL_IOC_LOV parameters (defined in lustre_user.h) prior to file create. Its been
long since I played with ioctls(), so Iam not sure if this holds true in the
current 1.4.6 release.
 
HTH, 
-Kums 


Thanks.

Wei-keng



On Tue, 7 Mar 2006, Andreas Dilger wrote:
>On Mar 06, 2006  14:51 -0600, Wei-keng Liao wrote:
>>Attached is a collection of offset-length pairs for FLASH-IO and BTIO.
>>
>>I have been playing different stripe size and no. OSTs on Tungsten@NCSA
>>which has default 4MB stripe size and stripe over 8 OSTs. Changing the
>>stripe size and no. OSTs does not help the performance a lot for FLASH
and
>>BTIO. I tried 64KB, 512KB, 1MB stripe size and 8 and 16 OSTs. Numbers of
>>clients I ran range from 4 to 64. However, in both FLASH and BTIO, all
>>clients read/write to a shared file. Maybe this is what causes the bad
>>performance on Lustre. (The I/O bandwidth I got for FLASH-IO is less
than
>>10 MB/s and less than 80 MB/s for BTIO, on Tungsten which potentially
can
>>provide 11 GB/s I/O bandwidth.) So, my guess is that Lustre has not been
>>optimized for concurrent or parallel I/O operations yet. As a heavy
MPI-IO
>>user, I wish there will be more demands on this aspect in the future :(
>
>Lustre has in fact been tested with single-file parallel IO for a long time,
>and while performance is not as good as file-per-process, it has generally
>been acceptable if the clients are doing large non-overlapping concurrent
>writes.  However, the IO pattern that single-file writes have been tuned
>to is different than is shown here, with the heuristics being tuned for
>tens-hundreds of clients doing aligned writes of 1MB or more concurrently.
>
>As this is the first time that I have seen complaints about IO performance
>from this type of workload, there has not been any effort to tune lustre
>for this load yet.  I see that the lock policy doesn''t consider a
single
>file to be contended until there are at least 5 concurrent writers, and
>not highly contended until there are at least 33 concurrent writers.
>Most of the shared-file write tunings have been done with 400+ clients.
>
>
>Looking at the start of one of the logs in question (OFFSET_LEN/BTIO.A4):
>1 o btio.full.out
>2 o btio.full.out
>3 o btio.full.out
>0 o btio.full.out
>1 w 2621440 2621440
>1 b 1 1
>3 w 7864320 2621440
>3 b 1 1
>0 w 0 2621440
>0 b 1 1
>2 w 5242880 2621440
>2 b 1 1
>0 w 10485760 2621440
>3 w 18350080 2621440
>2 w 15728640 2621440
>3 b 1 1
>2 b 1 1
>0 b 1 1
>1 w 13107200 2621440
>1 b 1 1
>2 w 26214400 2621440
>2 b 1 1
>0 w 20971520 2621440
>3 w 28835840 2621440
>1 w 23592960 2621440
>3 b 1 1
>0 b 1 1
>1 b 1 1
>
>It is clear from looking at this IO log that this application would be
>optimal with a stripe size of 2621440 bytes and 4 stripes for the file,
>or a multiple thereof.  This would allow each process to essentially
>write to its own file, and there would be no lock contention at all.
>
>lfs setstripe /path/to/output/btio.full.out 2621440 -1 4
>or
>lfs setstripe /path/to/output/btio.full.out 1310720 -1 8
>or
>lfs setstripe /path/to/output/btio.full.out 655360 -1 16
>etc.
>
>It is the lock contention that is essentially serializing the writes
>from each client.  It would be interesting to know if the ROMIO/MPIIO
>layer would send the proper hints to a lustre-ADIO to allow it to create
>the file in this manner?  I admit I don''t know enough about the
hints
>that an ADIO implemention is given.
>
>In your README you mention that the b 1 1 operation is an MPI barrier.
>Is there also a sync operation associated with this barrier?  Even
>without a sync, the fact that there is a barrier between each IO means
>that each IO is waiting for the slowest client to complete the writes.
>
>Cheers, Andreas
>--
>Andreas Dilger
>Principal Software Engineer
>Cluster File Systems, Inc.
>
Lustre-discuss mailing list
Lustre-discuss@clusterfs.com
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060307/945ce5a5/attachment.html

Andreas Dilger

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

On Mar 07, 2006  14:17 -0600, Wei-keng Liao wrote:> I ran BTIO.A.4 on Tungsten@NCSA with stripe size=2621440 and number of 
> OSTs=4 against the system default setting (stripe size=4MB, OSTs=8). Here 
> is the result:
>                          stripe size  = 2560KB        stripe size  = 4MB
>                             no. OSTs  = 4                no. OSTs  = 8
> ------------------------------------------------------------------------
>  MPI file setup time =        0.11 sec                     0.10 sec
>  write     400.00 MB in       1.85 sec                     3.20 sec
>  read      400.00 MB in       1.42 sec                    92.07 sec
>  total I/O amount    :      800.00 MB                    800.00 MB
>  number of I/Os      :       80                           80
> ------------------------------------------------------------------------
>  nproc    array size        bandwidth (MB/s)            bandwidth (MB/s)
>     4    64 x  64 x  64       236.60                       8.39
> 
> The overall performance is enhanced significantly. I believe if I use more 
> processors, larger I/O amount, the bandwidth can go even higher.
Good, at least this part of the problem is understood.  It is a lot easier
for me to understand what sort of problems are being seen when the data
is presented in this format (separate times for read and write) compared
to a single "bandwidth" number.

I also agree that with more OSTs and more clients it is likely that the
aggregate IO performance can be increased here.
> But let''s check the read write performance separately for the
non-optimal
> case, the default stripe size. I can see read is much worse than write.
> Could you comment on why the huge difference? If I/O is completely
serialized
> due to file locking, how come read for default stripe size is 65 times
> slower on a run with 4 compute nodes. Applying locking on write seems
> reasonable, though.
There needs to be locking on both read and write, to ensure that the clients
always see the correct data.  Lustre implements full POSIX read/write
semantics and fine-grained (though page-aligned) file locks.  During read
the clients can share locks on the same file, so there should not be any
conflicts there.

I suspect that one reason the read calls are much slower is that this time
also includes a significant fraction of time spent writing the data from
the client to disk.  If there was an fsync (or MPIO equivalent) call after
the write I suspect the read "performance" would increase, at the
expense
of increasing the write time.
> This also comes back to my question first posted: "disable client-side
> caching on Lustre?". If file locking is used to ensure the cache
coherence
> on Lustre, can we imply that disable caching -> no locking -> better 
> performance for non-overlap, write-only IO like BTIO?
While this might be true in theory, there is currently no mechanism other
than O_DIRECT to disable client-side cache.  It is possible (though not
supported) to disable client-side locking of the file, but this can only
be used in conjunction with O_DIRECT to ensure that the client does not
cache any data without a lock.  In this case the application is wholly
responsible for ensuring the integrity of the file.
> While tuning stripe size for a single access pattern shows good 
> performance like above, it also create a problem. What if two different 
> pattern I/Os need to be written to a single file.
Yes, this would indeed pose a problem.  In some cases it may be possible
to have a "lowest common denominator" for the stripe configuration,
but
if the IO isn''t aligned then even that won''t help.

I agree that there is likely a lot more that can be done to improve this
situation, but so far there has been very little requirements in this
direction so it may take some time to understand and resolve the issues.
> Even if the "future ADIO hints" for Lustre allow users to specify
> the stripe size, can a file''s stripe size + OSTs be changed during
the
> open and close period? If yes, what will be the overhead of doing so?
It is not possible to change the striping after the file has been created.
> For the current release of Lustre, how can I specify stripe size and 
> number of OSTs when creating a file? Of it has to be done off line with 
> command "lfs"?
As Kums says, it is possible to specify the striping via ioctl() directly
on a new file, but there is also a small C library (liblustreapi.a) to
abstract such operations and give applications a nicer interface:

#include <lustre/liblustreapi.h>

int llapi_file_create(char *name, long stripe_size, int stripe_offset,
                      int stripe_count, int stripe_pattern)

	name = pathname of file
	stripe_size = number of bytes to put in each object (OST)
	stripe_offset = starting OST index (should be -1 in majority of cases)
	stripe_count = number of stripes (OSTs) on which to create the file
	stripe_pattern = Lustre stripe pattern (only ''0'' is currently
supported)

	return is 0 on success, or -ve unix error number

Link with "-llustreapi".

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Andreas Dilger

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

On Mar 08, 2006  15:43 -0600, Wei-keng Liao wrote:> To verify your suspesion on data flushing in read phase, I ran the BTIO 
> with fsync() for the following 2 cases:
> 1) add a single fsync() after all writes complete and before reads start
> 2) add a fsync() after each individual write.
> 
> I only tested the default stripe size and OSTs, which is not optimized for 
> BTIO on Lustre. Unfortunately, the results of both cases are similar to 
> the one without calling fsync(). The read phase is still much larger than 
> the write phase. I ran a couple times and the results are all the same. 
> So, caching/flushing is not the cause for this strange behavior.
> 
> I would suspect the locking mechanism. (Maybe it''s the
implementation
> issues of deadlock, timeout, etc.) Consider this: even if I/O are 
> completely serialized (from 4 processors) due to file locking, the 
> timing/bandwidth should not be much worse than 4 times of the optimal 
> case, assuming there is no contention at all in the optimal case.
I agree, and for the write part the unoptimized performance is about 70%
slower than the optimized case (or 245% slower if you consider that the
unoptimized test has twice as many stripes).  It isn''t at all clear why
the read portion is so much slower for the unoptimized case, especially
since read locks do not conflict with each other.
> What is the locking design used by Lustre? Is Lustre using the distributed
> file locking?
Lustre uses server-based extent locking.  It is not fully distributed
locking in the sense that the "lock master" for a single resource
could
be on different nodes.  Rather, each storage server does locking for the
data that it controls and the locking scalability of each file increases
as the number of stripes in a file increase (along with the potential IO
bandwidth).  The locks themselves are read and write extent locks,
aligned on 4kB boundaries (in most cases at least).

Since the most common usage case is a single client doing writes to the
file, the OST optimistically grows the extent (up to a full-file lock
for the first writer) to reduce the number of future lock requests.
Consider the case where a process (e.g. cp) is doing writes in 4kB
chunks, we don''t want to issue a lock request for each write.

If there are multiple competing writes then the lock must be revoked and
split into smaller non-overlapping extents, with a heuristic to limit
extent size if there are "many" writers for the same object.  With 5
lockers stops downward lock growth, and 33 lockers to cap lock extent size
to max(requested size, 32MB)), subject to other locking constraints.
This heuristic doesn''t seem to fit your usage pattern very well at all.

Readers always get full-file extent locks unless there are write locks,
because many clients can have overlapping read locks without conflict.
It is of course possible to also read data under a write lock.

One possibility is that there is bad interaction with the client readahead.
The client readahead may read a lot of data from parts of the file it
doesn''t need (yet), so a potential (num_clients - 1) overhead there. 
If
the reads are done under a write lock, and that lock needs to be revoked,
then all the readahead would be discarded and possibly read in again.

Having the barriers between each read can also hamper this because the
clients that have read in data ahead of where they need it may not have
any chance to use that data before it might be evicted due to a write
lock cancel from another client issuing a conflicting read lock request.

In the optimized case, each client is only doing read/write operations on
a single object, so it likely gets a lock only one time on
"it''s" object
and then never issues any more locks and doesn''t cancel any other
client''s
locks, so all the read-ahead data is eventually used.


As a test, what happens if each client reads in all the file data it needs
at the beginning of the read phase, and then does all the computation in a
second step?
> On Tue, 7 Mar 2006, Andreas Dilger wrote:
> 
> >On Mar 07, 2006  14:17 -0600, Wei-keng Liao wrote:
> >>I ran BTIO.A.4 on Tungsten@NCSA with stripe size=2621440 and number
of
> >>OSTs=4 against the system default setting (stripe size=4MB,
OSTs=8). Here
> >>is the result:
> >>                         stripe size  = 2560KB        stripe size 
= 4MB
> >>                            no. OSTs  = 4                no. OSTs 
= 8
>
>>------------------------------------------------------------------------
> >> MPI file setup time =        0.11 sec                     0.10 sec
> >> write     400.00 MB in       1.85 sec                     3.20 sec
> >> read      400.00 MB in       1.42 sec                    92.07 sec
> >> total I/O amount    :      800.00 MB                    800.00 MB
> >> number of I/Os      :       80                           80
>
>>------------------------------------------------------------------------
> >> nproc    array size        bandwidth (MB/s)            bandwidth
(MB/s)
> >>    4    64 x  64 x  64       236.60                       8.39
> >>
> >>The overall performance is enhanced significantly. I believe if I
use more
> >>processors, larger I/O amount, the bandwidth can go even higher.
> >
> >Good, at least this part of the problem is understood.  It is a lot
easier
> >for me to understand what sort of problems are being seen when the data
> >is presented in this format (separate times for read and write)
compared
> >to a single "bandwidth" number.
> >
> >I also agree that with more OSTs and more clients it is likely that the
> >aggregate IO performance can be increased here.
> >
> >>But let''s check the read write performance separately for
the non-optimal
> >>case, the default stripe size. I can see read is much worse than
write.
> >>Could you comment on why the huge difference? If I/O is completely 
> >>serialized
> >>due to file locking, how come read for default stripe size is 65
times
> >>slower on a run with 4 compute nodes. Applying locking on write
seems
> >>reasonable, though.
> >
> >There needs to be locking on both read and write, to ensure that the 
> >clients
> >always see the correct data.  Lustre implements full POSIX read/write
> >semantics and fine-grained (though page-aligned) file locks.  During
read
> >the clients can share locks on the same file, so there should not be
any
> >conflicts there.
> >
> >I suspect that one reason the read calls are much slower is that this
time
> >also includes a significant fraction of time spent writing the data
from
> >the client to disk.  If there was an fsync (or MPIO equivalent) call
after
> >the write I suspect the read "performance" would increase, at
the expense
> >of increasing the write time.
> >
> >>This also comes back to my question first posted: "disable
client-side
> >>caching on Lustre?". If file locking is used to ensure the
cache coherence
> >>on Lustre, can we imply that disable caching -> no locking ->
better
> >>performance for non-overlap, write-only IO like BTIO?
> >
> >While this might be true in theory, there is currently no mechanism
other
> >than O_DIRECT to disable client-side cache.  It is possible (though not
> >supported) to disable client-side locking of the file, but this can
only
> >be used in conjunction with O_DIRECT to ensure that the client does not
> >cache any data without a lock.  In this case the application is wholly
> >responsible for ensuring the integrity of the file.
> >
> >>While tuning stripe size for a single access pattern shows good
> >>performance like above, it also create a problem. What if two
different
> >>pattern I/Os need to be written to a single file.
> >
> >Yes, this would indeed pose a problem.  In some cases it may be
possible
> >to have a "lowest common denominator" for the stripe
configuration, but
> >if the IO isn''t aligned then even that won''t help.
> >
> >I agree that there is likely a lot more that can be done to improve
this
> >situation, but so far there has been very little requirements in this
> >direction so it may take some time to understand and resolve the
issues.
> >
> >>Even if the "future ADIO hints" for Lustre allow users to
specify
> >>the stripe size, can a file''s stripe size + OSTs be
changed during the
> >>open and close period? If yes, what will be the overhead of doing
so?
> >
> >It is not possible to change the striping after the file has been
created.
> >
> >>For the current release of Lustre, how can I specify stripe size
and
> >>number of OSTs when creating a file? Of it has to be done off line
with
> >>command "lfs"?
> >
> >As Kums says, it is possible to specify the striping via ioctl()
directly
> >on a new file, but there is also a small C library (liblustreapi.a) to
> >abstract such operations and give applications a nicer interface:
> >
> >#include <lustre/liblustreapi.h>
> >
> >int llapi_file_create(char *name, long stripe_size, int stripe_offset,
> >                     int stripe_count, int stripe_pattern)
> >
> >	name = pathname of file
> >	stripe_size = number of bytes to put in each object (OST)
> >	stripe_offset = starting OST index (should be -1 in majority of 
> >	cases)
> >	stripe_count = number of stripes (OSTs) on which to create the file
> >	stripe_pattern = Lustre stripe pattern (only ''0'' is
currently
> >	supported)
> >
> >	return is 0 on success, or -ve unix error number
> >
> >Link with "-llustreapi".
> >
> >Cheers, Andreas
> >--
> >Andreas Dilger
> >Principal Software Engineer
> >Cluster File Systems, Inc.
> >
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Wei-keng Liao

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

Andreas,

To verify your suspesion on data flushing in read phase, I ran the BTIO 
with fsync() for the following 2 cases:
1) add a single fsync() after all writes complete and before reads start
2) add a fsync() after each individual write.

I only tested the default stripe size and OSTs, which is not optimized for 
BTIO on Lustre. Unfortunately, the results of both cases are similar to 
the one without calling fsync(). The read phase is still much larger than 
the write phase. I ran a couple times and the results are all the same. 
So, caching/flushing is not the cause for this strange behavior.

I would suspect the locking mechanism. (Maybe it''s the implementation 
issues of deadlock, timeout, etc.) Consider this: even if I/O are 
completely serialized (from 4 processors) due to file locking, the 
timing/bandwidth should not be much worse than 4 times of the optimal 
case, assuming there is no contention at all in the optimal case. What is 
the locking design used by Lustre? Is Lustre using the distributed file 
locking?


Wei-keng



On Tue, 7 Mar 2006, Andreas Dilger wrote:
> On Mar 07, 2006  14:17 -0600, Wei-keng Liao wrote:
>> I ran BTIO.A.4 on Tungsten@NCSA with stripe size=2621440 and number of
>> OSTs=4 against the system default setting (stripe size=4MB, OSTs=8).
Here
>> is the result:
>>                          stripe size  = 2560KB        stripe size  =
4MB
>>                             no. OSTs  = 4                no. OSTs  = 8
>>
------------------------------------------------------------------------
>>  MPI file setup time =        0.11 sec                     0.10 sec
>>  write     400.00 MB in       1.85 sec                     3.20 sec
>>  read      400.00 MB in       1.42 sec                    92.07 sec
>>  total I/O amount    :      800.00 MB                    800.00 MB
>>  number of I/Os      :       80                           80
>>
------------------------------------------------------------------------
>>  nproc    array size        bandwidth (MB/s)            bandwidth
(MB/s)
>>     4    64 x  64 x  64       236.60                       8.39
>>
>> The overall performance is enhanced significantly. I believe if I use
more
>> processors, larger I/O amount, the bandwidth can go even higher.
>
> Good, at least this part of the problem is understood.  It is a lot easier
> for me to understand what sort of problems are being seen when the data
> is presented in this format (separate times for read and write) compared
> to a single "bandwidth" number.
>
> I also agree that with more OSTs and more clients it is likely that the
> aggregate IO performance can be increased here.
>
>> But let''s check the read write performance separately for the
non-optimal
>> case, the default stripe size. I can see read is much worse than write.
>> Could you comment on why the huge difference? If I/O is completely
serialized
>> due to file locking, how come read for default stripe size is 65 times
>> slower on a run with 4 compute nodes. Applying locking on write seems
>> reasonable, though.
>
> There needs to be locking on both read and write, to ensure that the
clients
> always see the correct data.  Lustre implements full POSIX read/write
> semantics and fine-grained (though page-aligned) file locks.  During read
> the clients can share locks on the same file, so there should not be any
> conflicts there.
>
> I suspect that one reason the read calls are much slower is that this time
> also includes a significant fraction of time spent writing the data from
> the client to disk.  If there was an fsync (or MPIO equivalent) call after
> the write I suspect the read "performance" would increase, at the
expense
> of increasing the write time.
>
>> This also comes back to my question first posted: "disable
client-side
>> caching on Lustre?". If file locking is used to ensure the cache
coherence
>> on Lustre, can we imply that disable caching -> no locking ->
better
>> performance for non-overlap, write-only IO like BTIO?
>
> While this might be true in theory, there is currently no mechanism other
> than O_DIRECT to disable client-side cache.  It is possible (though not
> supported) to disable client-side locking of the file, but this can only
> be used in conjunction with O_DIRECT to ensure that the client does not
> cache any data without a lock.  In this case the application is wholly
> responsible for ensuring the integrity of the file.
>
>> While tuning stripe size for a single access pattern shows good
>> performance like above, it also create a problem. What if two different
>> pattern I/Os need to be written to a single file.
>
> Yes, this would indeed pose a problem.  In some cases it may be possible
> to have a "lowest common denominator" for the stripe
configuration, but
> if the IO isn''t aligned then even that won''t help.
>
> I agree that there is likely a lot more that can be done to improve this
> situation, but so far there has been very little requirements in this
> direction so it may take some time to understand and resolve the issues.
>
>> Even if the "future ADIO hints" for Lustre allow users to
specify
>> the stripe size, can a file''s stripe size + OSTs be changed
during the
>> open and close period? If yes, what will be the overhead of doing so?
>
> It is not possible to change the striping after the file has been created.
>
>> For the current release of Lustre, how can I specify stripe size and
>> number of OSTs when creating a file? Of it has to be done off line with
>> command "lfs"?
>
> As Kums says, it is possible to specify the striping via ioctl() directly
> on a new file, but there is also a small C library (liblustreapi.a) to
> abstract such operations and give applications a nicer interface:
>
> #include <lustre/liblustreapi.h>
>
> int llapi_file_create(char *name, long stripe_size, int stripe_offset,
>                      int stripe_count, int stripe_pattern)
>
> 	name = pathname of file
> 	stripe_size = number of bytes to put in each object (OST)
> 	stripe_offset = starting OST index (should be -1 in majority of cases)
> 	stripe_count = number of stripes (OSTs) on which to create the file
> 	stripe_pattern = Lustre stripe pattern (only ''0'' is
currently supported)
>
> 	return is 0 on success, or -ve unix error number
>
> Link with "-llustreapi".
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>

Wei-keng Liao

2006-May-19 07:36 UTC

head link

[Lustre-discuss] more info on O_DIRECT

I am looking for more information about using O_DIRECT. According to 
Lustre manual 1.4.6, page 50, it says

   "For more information about the pros and cons of using Direct I/O with
    Lustre, see Performance Concepts."

Where can I find this "Performance Concepts" document?

Wei-keng

Wei-keng Liao

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

Jay,

The zipped tar ball I sent out has no timing data. They are just 
offset-length pairs.

As I indicated in my earlier email, my experience was on Tungsten@NCSA 
http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/XeonCluster/TechSummary/. 
The bad performance numbers along cannot go to any conference. That is why 
I would like to figure out why. Lustre supposes to have a great potential 
for scalable I/O on parallel machines.

I plan to summarize the results I collected recently and provide to Lustre 
development group if they are willing to dig into this problem. In 
addition, if you can briefly present your similar findings in the discuss 
group, it may increase the chance for Lustre to pay more attentions on 
parallel I/O :)

Wei-keng

On Mon, 6 Mar 2006, J. K. Cliburn wrote:
> On 3/6/06, Wei-keng Liao <wkliao@ece.northwestern.edu> wrote:
> Attached is a collection of offset-length pairs for FLASH-IO and BTIO.
I didn''t see any timings in your data.  I''m very interested in
your
results because we''re seeing marginal performance similar to your
description on a Lustre filesystem attached to a very large parallel
machine.  Do you intend to present your results at a conference, or
were the data collected for your own project''s benefit?  How could I
get a copy of your findings?

Thanks,
Jay

Wei-keng Liao

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

Don,

This is exactly the BTIO access pattern. No repeats exist among writes and
no repeats among reads. For the write phase, all writes are disjoint.
Same for the read phase.

Wei-keng



On Mon, 6 Mar 2006, Iozone wrote:>> Don,
>>
>> Please discard those BTIO files. I generated these offset-length pairs
on
>> a 4-node machine. ROMIO internally figures this out and lets only 4
>> processes perform the read()/write() calls. I will try to generate the
>> true offset-length pairs for different numbers of clients. For now,
please
>> just use BTIO.A.4 and BTIO.B.4. FLASH-IO offset-length files seem fine.
>>
> Wei-Keng,
>
>    Ok... Using BTIO.A.4 and looking at the writer, it''s still an
>    oscillating strided writer :-)
>
>
>
> Enjoy,
> Don Capps
>-------------- next part --------------
A non-text attachment was scrubbed...
Name: BTIO.A.4.jpg
Type: image/jpeg
Size: 143527 bytes
Desc: 
Url :
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060306/a48e30be/BTIO.A.4.jpg

Andreas Dilger

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

On Mar 06, 2006  20:03 -0600, J. K. Cliburn wrote:> On 3/6/06, J. K. Cliburn <jcliburn@gmail.com> wrote:
> I meant to say a 4,176 *processor* Cray XT3.  4,128 processors are
> embodied in dual Opteron compute nodes running the Catamount
> lightweight kernel.  The remainder are 4-way service nodes running a
> variant of SLES9.
Note that the Catamount Lustre clients are _completely_ different than
the Linux VFS Lustre clients.  The Catamount clients cannot do ANY write
caching or asynchronous writes because the operating environment does
not support interrupts for notification of IO completion.  The writes
must be completed during each syscall, and are as a result completely
synchronous.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Wei-keng Liao

2006-May-19 07:36 UTC

head link

[Lustre-discuss] disable client-side caching on Lustre?

Andreas,

Please see below.

On Tue, 7 Mar 2006, Andreas Dilger wrote:> As this is the first time that I have seen complaints about IO performance
> from this type of workload, there has not been any effort to tune lustre
> for this load yet.  I see that the lock policy doesn''t consider a
single
> file to be "contended" until there are at least 5 concurrent
writers, and
> not "highly contended" until there are at least 33 concurrent
writers.
> Most of the shared-file write tunings have been done with 400+ clients.
I do not quite understand this lock policy. What does Lustre do when it 
doesn''t consider a single file to be "contended"? What does
it do in the
"highly contended" case? Can you point me any further reading about
this?

> While this might be true in theory, there is currently no mechanism other
> than O_DIRECT to disable client-side cache.  It is possible (though not
> supported) to disable client-side locking of the file, but this can only
> be used in conjunction with O_DIRECT to ensure that the client does not
> cache any data without a lock.  In this case the application is wholly
> responsible for ensuring the integrity of the file.
Can you describe how to disable locking in conjunction with O_DIRECT, even 
though it is not supported? Can it be done by application users without 
root permission?

Thanks.

Wei-keng

Lustre discuss - May 2006 - disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] more info on O_DIRECT

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?

[Lustre-discuss] disable client-side caching on Lustre?