thr3ads.net - Lustre discuss - [Lustre-discuss] short writes [Jul 2010]

If this information is useful, please help other people find it:
Share via:

David Singleton

2010-Jul-08 13:25 UTC

[Lustre-discuss] short writes

The POSIX standard pretty clearly allows short writes to occur (number of
bytes written less than requested in a successful call to write) but its
not something you see very often and I dont think many users/applications
expect it to occur when writing to disk based files.  We are seeing it
fairly regularly and just wanted to confirm that we (rather our users)
should expect this behaviour from Lustre.

We are seeing the issue with the infamous Gaussian quantum chem code
which spends literally days constantly writing and reading to scratch files
in roughly 1GB chunks as part of out-of-core solvers.  We manage jobs using
simple SIGSTOP/SIGCONT based suspend/resume and occasionally jobs will flag
a short write immediately after a SIGCONT. The application incorrectly
treats this as an error and aborts.  Adding code to complete the write
appears to fix the problem (as you''d hope).  Now we are at the stage of
"debating" with the application developers whether it''s their
problem or
Lustre''s.

Is this considered normal Lustre behaviour?

This is with 1.8.3 clients on 2.6.27.46.

Thanks,
David

Kevin Van Maren

2010-Jul-08 13:53 UTC

head link

[Lustre-discuss] short writes

Hi David,

I''ve also seen short writes on local file systems -- can''t
even count
the number of times I''ve modified codes to use wrappers that handle 
short reads/writes.  Not at all surprised you see them when suspending 
the app.

http://www.opengroup.org/onlinepubs/000095399/functions/write.html
"If write() is interrupted by a signal after it successfully writes some 
data, it shall return the number of bytes written."
Similar language exists for read as well.  I always thought libc should 
handle the retry for you by default, but I didn''t write the spec.

Signals are relatively rare, and the window is a bit smaller for a local 
file system, which may be why they haven''t seen it/properly dealt with 
it yet.

Kevin

David Singleton wrote:> The POSIX standard pretty clearly allows short writes to occur (number of
> bytes written less than requested in a successful call to write) but its
> not something you see very often and I dont think many users/applications
> expect it to occur when writing to disk based files.  We are seeing it
> fairly regularly and just wanted to confirm that we (rather our users)
> should expect this behaviour from Lustre.
>
> We are seeing the issue with the infamous Gaussian quantum chem code
> which spends literally days constantly writing and reading to scratch files
> in roughly 1GB chunks as part of out-of-core solvers.  We manage jobs using
> simple SIGSTOP/SIGCONT based suspend/resume and occasionally jobs will flag
> a short write immediately after a SIGCONT. The application incorrectly
> treats this as an error and aborts.  Adding code to complete the write
> appears to fix the problem (as you''d hope).  Now we are at the
stage of
> "debating" with the application developers whether it''s
their problem or
> Lustre''s.
>
> Is this considered normal Lustre behaviour?
>
> This is with 1.8.3 clients on 2.6.27.46.
>
> Thanks,
> David
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Brian J. Murrell

2010-Jul-08 14:05 UTC

head link

[Lustre-discuss] short writes

On Thu, 2010-07-08 at 23:25 +1000, David Singleton wrote:
> The POSIX standard pretty clearly allows short writes to occur (number of
> bytes written less than requested in a successful call to write)
Yes, I believe that is true.  The write(2) manpage also indicates that a
write of less than the requested bytes is possible.
> but its
> not something you see very often
Not with local disk, no.
> and I dont think many users/applications
> expect it to occur when writing to disk based files.
Indeed.  Sadly many applications don''t expect a lot of conditions that
are possible and allowed but historically not seen.
> We manage jobs using
> simple SIGSTOP/SIGCONT based suspend/resume and occasionally jobs will flag
> a short write immediately after a SIGCONT.
Interesting.
> The application incorrectly
> treats this as an error and aborts.
Yes, if an application is treating a short write as an error, I do
believe that that is an application defect.
> Adding code to complete the write
> appears to fix the problem (as you''d hope).
No surprise.  :-)
> Now we are at the stage of
> "debating" with the application developers whether it''s
their problem or
> Lustre''s.
Well, I''m not sure how much of a debate there is there.  If POSIX is
clearly allowing this and given that Lustre is (mostly, if not
completely) POSIX compliant, it should be clear that this is not a
defect in Lustre and that the application(s) should be fixed to handle
what is legal behaviour.
> Is this considered normal Lustre behaviour?
Hopefully one of the developers with a more intimate knowledge can
answer definitively about particular conditions in Lustre that can cause
short writes, but it would not at all be surprising that if POSIX allows
it and we have a need to take advantage of it, we would.

Afterall, from an application point of view, if there were some
situation in the filesystem in that it simply cannot perform the
complete write, but could write a smaller portion with a possibility
that another write will be successful also, wouldn''t that be better
than
an outright EIO?

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100708/a4e8e1ed/attachment.bin

Brian J. Murrell

2010-Jul-08 14:08 UTC

head link

[Lustre-discuss] short writes

On Thu, 2010-07-08 at 07:53 -0600, Kevin Van Maren wrote:
> Hi David,
Hey Kevin,
> http://www.opengroup.org/onlinepubs/000095399/functions/write.html
Heh.  Funny enough, I was reading the exact same URL.
> I always thought libc should 
> handle the retry for you by default, but I didn''t write the spec.
write(2) is a system call, not a libc function.  fwrite(3) is a
comparable libc function, so libc might be able to handle short
write(2)s in fwrite(3), but really it should not (IMHO) be mucking with
write(2) (or any other) system calls.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100708/d0241c1c/attachment.bin

Bernd Schubert

2010-Jul-08 16:33 UTC

head link

[Lustre-discuss] short writes

On Thursday, July 08, 2010, Brian J. Murrell wrote:> On Thu, 2010-07-08 at 07:53 -0600, Kevin Van Maren wrote:
> > Hi David,
> 
> Hey Kevin,
> 
> > http://www.opengroup.org/onlinepubs/000095399/functions/write.html
> 
> Heh.  Funny enough, I was reading the exact same URL.
> 
> > I always thought libc should
> > handle the retry for you by default, but I didn''t write the
spec.
> 
> write(2) is a system call, not a libc function.  fwrite(3) is a
> comparable libc function, so libc might be able to handle short
> write(2)s in fwrite(3), but really it should not (IMHO) be mucking with
> write(2) (or any other) system calls.
You have to keep in mind that Gaussian is a Fortran application. Fortran has 
its own IO library and it would be quite possible that the library of some 
compilers can handle a short write, but the library of other compilers can 
not... In my university group we had to deal with quite some weird effects 
between fortran-IO implementations... 

David, did you use PGI or another compiler? Last time I had to deal with 
Gaussian only PGI was supported, but I have not checked for recent Gaussian 
versions.


Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks

John Hammond

2010-Jul-08 21:50 UTC

head link

[Lustre-discuss] short writes

On 07/08/2010 08:53 AM, Kevin Van Maren wrote:> Hi David,
>
> I''ve also seen short writes on local file systems --
can''t even count
> the number of times I''ve modified codes to use wrappers that
handle
> short reads/writes.  Not at all surprised you see them when suspending
> the app.
>
> http://www.opengroup.org/onlinepubs/000095399/functions/write.html
> "If write() is interrupted by a signal after it successfully writes
some
> data, it shall return the number of bytes written."
> Similar language exists for read as well.  I always thought libc should
> handle the retry for you by default, but I didn''t write the spec.
>
> Signals are relatively rare, and the window is a bit smaller for a local
> file system, which may be why they haven''t seen it/properly dealt
with
> it yet.
It also says "The issue of which files or file types are interruptible 
is considered an implementation design issue. This is often affected 
primarily by hardware and reliability issues."

For Linux, the signal(7) manpage indicates that read(2), readv(2), 
write(2), writev(2), and ioctl(2) calls on "slow" devices should
return
-EINTR when interrupted by a signal, and goes on to say that "slow" 
devices are ones "where the I/O call may block for an indefinite time, 
for example, a terminal, pipe, or socket.  (A disk is not a slow device 
according to this definition.)"

Nowhere does it say something really helpfully clear like "Writing to a 
regular file shall suspend the calling process until such time as..." 
But, I interpret this to mean that operations on regular files are not 
interruptible, and should not return -EINTR.  Moreover, I understand 
that this is the consensus among those unlucky enough to care.

On the other hand, there are some explicitly specified situations which 
will result in short writes to a regular file, like file size limits.

-- 
John L. Hammond, Ph.D.
ICES, The University of Texas at Austin
jhammond at ices.utexas.edu
(512) 471-9304

David Singleton

2010-Jul-08 21:54 UTC

head link

[Lustre-discuss] short writes

On 07/09/2010 02:33 AM, Bernd Schubert wrote:> You have to keep in mind that Gaussian is a Fortran application. Fortran
has
> its own IO library and it would be quite possible that the library of some
> compilers can handle a short write, but the library of other compilers can
> not... In my university group we had to deal with quite some weird effects
> between fortran-IO implementations...
>
> David, did you use PGI or another compiler? Last time I had to deal with
> Gaussian only PGI was supported, but I have not checked for recent Gaussian
> versions.
>
Hi Bernd,

Gaussian uses C directly for all system interaction.  And we have confirmed
that the behaviour is not confined to the standard PGI build.

With regards to dodgy Fortran runtimes, yes, we believe the Intel lib can
(effectively) return a "zero written" EINTR to the application. I say
effectively because the only indication is an "interrupted system
call" message
to stderr while doing IO.  There is no portable Fortran way to determine that
this is a recoverable EINTR return.

David

Kevin Van Maren

2010-Jul-08 22:48 UTC

head link

[Lustre-discuss] short writes

John Hammond wrote:> On 07/08/2010 08:53 AM, Kevin Van Maren wrote:
>> Hi David,
>>
>> I''ve also seen short writes on local file systems --
can''t even count
>> the number of times I''ve modified codes to use wrappers that
handle
>> short reads/writes.  Not at all surprised you see them when suspending
>> the app.
>>
>> http://www.opengroup.org/onlinepubs/000095399/functions/write.html
>> "If write() is interrupted by a signal after it successfully
writes some
>> data, it shall return the number of bytes written."
>> Similar language exists for read as well.  I always thought libc should
>> handle the retry for you by default, but I didn''t write the
spec.
>>
>> Signals are relatively rare, and the window is a bit smaller for a
local
>> file system, which may be why they haven''t seen it/properly
dealt with
>> it yet.
>
> It also says "The issue of which files or file types are interruptible
> is considered an implementation design issue. This is often affected 
> primarily by hardware and reliability issues."
>
> For Linux, the signal(7) manpage indicates that read(2), readv(2), 
> write(2), writev(2), and ioctl(2) calls on "slow" devices should 
> return -EINTR when interrupted by a signal, and goes on to say that 
> "slow" devices are ones "where the I/O call may block for an
> indefinite time, for example, a terminal, pipe, or socket.  (A disk is 
> not a slow device according to this definition.)"
How about a network file system waiting for server failover (especially 
if it is not automatic)?
> Nowhere does it say something really helpfully clear like "Writing to 
> a regular file shall suspend the calling process until such time 
> as..." But, I interpret this to mean that operations on regular files 
> are not interruptible, and should not return -EINTR.  Moreover, I 
> understand that this is the consensus among those unlucky enough to care.
>
> On the other hand, there are some explicitly specified situations 
> which will result in short writes to a regular file, like file size 
> limits.
With NFS, "hard,intr" is the most sane configuration.  For Lustre, 
operations (should) become interruptible after the initial timeout 
period has passed.

Kevin

John Hammond

2010-Jul-08 23:51 UTC

head link

[Lustre-discuss] short writes

On 07/08/2010 05:48 PM, Kevin Van Maren wrote:> John Hammond wrote:
>> On 07/08/2010 08:53 AM, Kevin Van Maren wrote:
>>> Hi David,
>>>
>>> I''ve also seen short writes on local file systems --
can''t even
>>> count the number of times I''ve modified codes to use
wrappers
>>> that handle short reads/writes. Not at all surprised you see
>>> them when suspending the app.
>>>
>>> http://www.opengroup.org/onlinepubs/000095399/functions/write.html
>>>
>>>
>>>"If write() is interrupted by a signal after it successfully writes
some>>> data, it shall return the number of bytes written." Similar
>>> language exists for read as well. I always thought libc should
>>> handle the retry for you by default, but I didn''t write
the
>>> spec.
>>>
>>> Signals are relatively rare, and the window is a bit smaller for
>>> a local file system, which may be why they haven''t seen
>>> it/properly dealt with it yet.
>>
>> It also says "The issue of which files or file types are
>> interruptible is considered an implementation design issue. This
>> is often affected primarily by hardware and reliability issues."
>>
>> For Linux, the signal(7) manpage indicates that read(2), readv(2),
>> write(2), writev(2), and ioctl(2) calls on "slow" devices
should
>> return -EINTR when interrupted by a signal, and goes on to say
>> that "slow" devices are ones "where the I/O call may
block for an
>> indefinite time, for example, a terminal, pipe, or socket. (A disk
>> is not a slow device according to this definition.)"
>
> How about a network file system waiting for server failover
> (especially if it is not automatic)?
That''s not indefinite.  The FS is waiting for something which will
eventually occur.  (Assuming it''s is correctly administered).
>> Nowhere does it say something really helpfully clear like "Writing
>> to a regular file shall suspend the calling process until such
>> time as..." But, I interpret this to mean that operations on
>> regular files are not interruptible, and should not return -EINTR.
>> Moreover, I understand that this is the consensus among those
>> unlucky enough to care.
>>
>> On the other hand, there are some explicitly specified situations
>> which will result in short writes to a regular file, like file
>> size limits.
>
> With NFS, "hard,intr" is the most sane configuration.
Yes, because NFS is FTP but with less typing.
> For Lustre, operations (should) become interruptible after the
> initial timeout period has passed.
I disagree.  Lustre is not NFS.  The intended uses are big
noninteractive jobs.  Who would interrupt them?  Most likely
administrators who know that some server is hosed.  Better to have a
long timeout, and once it passes, the operations return -EIO, the
clients try to reconnect, and maybe the FS can heal itself.  Even if it
can''t, I would argue that the FS is still easier to administer than
before, since you don''t have to ssh out to every stuck node.

Also, why make the logic any harder?  In the FS, isn''t it much easier
to
emulate a block device, leaving the process in D sleep when you have to? 
  And in the application it''s one less thing to worry about.  Plus the 
behavior matches the block device/page cache mental-model better.

-- 
John L. Hammond, Ph.D.
ICES, The University of Texas at Austin
jhammond at ices.utexas.edu
(512) 471-9304

Andrew Perepechko

2010-Jul-14 23:07 UTC

head link

[Lustre-discuss] short writes

Hello John,

please note that the Lustre client does not write data to a disk device,
rather it sends and receives data through network, particularly, through a
socket in case of ksocklnd. Even if the call may not be considered
slow with respect to indefiniteness, it still is slow as compared to the
"fast I/O" (local disk I/O). I can recall several requests of adding
even more interruption points made by people who were looking for the support
of premature read/write termination (and, I believe, l_wait_event usage in
1.8 is such that since a network request has been sent, the read/write
system call cannot be interrupted until the corresponding lustre timeout
has happened or a reply has been received)

Best wishes,
Andrew.
>On 07/08/2010 08:53 AM, Kevin Van Maren wrote:
>It also says "The issue of which files or file types are interruptible 
>is considered an implementation design issue. This is often affected 
>primarily by hardware and reliability issues."
>
>For Linux, the signal(7) manpage indicates that read(2), readv(2), 
>write(2), writev(2), and ioctl(2) calls on "slow" devices should
return
>-EINTR when interrupted by a signal, and goes on to say that
"slow"
>devices are ones "where the I/O call may block for an indefinite time, 
>for example, a terminal, pipe, or socket.  (A disk is not a slow device 
>according to this definition.)"
>
>Nowhere does it say something really helpfully clear like "Writing to a
>regular file shall suspend the calling process until such time as..." 
>But, I interpret this to mean that operations on regular files are not 
>interruptible, and should not return -EINTR.  Moreover, I understand 
>that this is the consensus among those unlucky enough to care.
>
>On the other hand, there are some explicitly specified situations which 
>will result in short writes to a regular file, like file size limits.
-- 
John L. Hammond, Ph.D.
ICES, The University of Texas at Austin
jhammond at ices.utexas.edu
<http://lists.lustre.org/mailman/listinfo/lustre-discuss>
(512) 471-9304

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100715/fc989953/attachment-0001.html

John Hammond

2010-Jul-15 21:21 UTC

head link

[Lustre-discuss] short writes

Hi Andrew,
> please note that the Lustre client does not write data to a disk
> device, rather it sends and receives data through network,
Yes, I''ve heard about that.
> particularly, through a socket in case of ksocklnd.
A sockets isn''t considered "slow" because of the speed of the
network,
it''s considered "slow" so that the application is allowed to
handle IPC
with unresponsive peers.
> Even if the call may not be considered slow with respect to
> indefiniteness, it still is slow as compared to the "fast I/O"
(local
> disk I/O). I can recall several requests of adding even more
> interruption points made by people who were looking for the support
> of premature read/write termination (and, I believe, l_wait_event
> usage in 1.8 is such that since a network request has been sent, the
> read/write system call cannot be interrupted until the corresponding
> lustre timeout has happened or a reply has been received)
Looking at 1.8.3, I see that l_wait_event() allow calls to be 
interrupted under certain circumstances (if the timeout has expired, or 
no timeout was specified, ...).  But then only if the pending signal 
belongs to LUSTRE_FATAL_SIGS: SIGKILL, SIGINT, SIGTERM, SIGQUIT, or 
SIGARLM.  I guess the assumption is that all of these signals are fatal 
anyway, and delivering them is useful to users who change their minds 
about untarring the Encyclopedia Britannica, and then go complain that 
Lustre breaks Ctrl-C.  Fine.  Aside from the fact that the latter four 
may not be fatal, and that this may cause some unexpected breakage among 
unsuspecting applications that handle these signals for purposes other 
than process termination...whatever.  I''m giving up on this point.

I also noticed that the signal mask handling in l_wait_event is slightly 
defective.  In the cases where l_wait_event would allow the call to be 
interrupted, it sets the caller''s mask to allow LUSTRE_FATAL_SIGS. 
Consider the following sequence of events for a process P:

   P blocks SIGALRM.
   SIGALRM is sent to P.
   P calls open().
   RPC to mds times-out.
   l_wait_info unblocks LUSTRE_FATAL_SIGS.
   l_wait_info determines that SIGALRM is deliverable.
   l_wait_info restores the signal mask.
   l_wait_info returns -EINTR.
   open() returns -EINTR.

Thus open() is interrupted by the non-delivery of a blocked signal. 
It''s easy to reproduce, if somewhat obscure.

Best,

John

-- 
John L. Hammond, Ph.D.
ICES, The University of Texas at Austin
jhammond at ices.utexas.edu
(512) 471-9304

Christopher J. Morrone

2010-Jul-15 23:56 UTC

head link

[Lustre-discuss] short writes

On 07/08/2010 04:51 PM, John Hammond wrote:
>> How about a network file system waiting for server failover
>> (especially if it is not automatic)?
>
> That''s not indefinite.  The FS is waiting for something which will
> eventually occur.  (Assuming it''s is correctly administered).
That IS indefinite.  Indefinite just means that the limit is vague 
and/or unknown, as apposed to having a clear and well defined bound.

In the IO context, any operation that is unbounded (indefinite) may take 
a very long time in human terms, and therefore should be interruptible. 
  It is just not very reasonable to have a process stuck unkillable for 
days.  But on the other hand, we don''t want it timing out either if the
data is valuable and we are willing to wait a day or two for hardware 
repairs.

Even ignoring the fact that Lustre''s behavior is allowed by the POSIX 
spec, I believe that Lustre is doing the Right Thing.

If a job hangs on a write because servers are unavailable, it should 
hang indefinitely until the server is restored, or until a human 
interrupts the process with a signal.

Only a human can determine how valuable that job''s output is.  If the 
human determines that the data is very important, and must finish, then 
they leave the job hanging and go fix the servers.  If they decide that 
the data is easily reproducible and the compute cluster is better spent 
running another job that doesn''t require the down filesystem, then they
have the ability to abort the operation with a signal.

Now all that said, there may be an argument to be made that SIGSTOP and 
SIGCONT should not be signals that interrupt Lustre client operations.

Peter Grandi

2010-Jul-17 09:10 UTC

head link

[Lustre-discuss] short writes

>>>> [ ... whether apps can rely on the kernel always returning a
>>>> full read or write count on file IO except at EOF on read ... ]
As it has been remarked the answer is NO.

BTW this question is not the same as the "interruptible" one.
There is a difference between the kernel being allowed to
read(2) or write(2) less than requested by the process and them
being apparently atomic.

The kernel may always return a count of bytes read or written
less than requested for any reason whatever, even if a signal
has not interrupted the operations.

The applications have to deal with it. Most applications are
written wrong (and in many other ways, e.g. how many do check or
even just ''(void)'' the return code from
''close'' for example, and
never mind not calling ''flock'' or ''fsync''),
and as many kernel
writers say "userspace sucks", but most applications mistakes
matter only in infrequent cases, and when these happen users
just shrug.

The reasons why the semantics are like that have been explained
very clearly by Gabriel in his paper "Worse is better".
>>> How about a network file system waiting for server failover
>>> (especially if it is not automatic)?
>> That''s not indefinite. The FS is waiting for something which
>> will eventually occur.
Here "indefinite" as to a wait duration is used in two rather
different ways. One is to say that it is "unknowable", the other
is that it is "unknown at the moment and expected to be in some
relevant sense long".

There is a fundamental difference. If the outcome (success or
failure) of an operation may or may not become known, we have a
completely different class of models of computation from the
usual Turing or Church or Von Neumann one, with rather
completely different properties from the usual one.

The halting problem does not exist, as all computations must
complete, but the outcome on completion can be undeterminate,
which the opposite of the usual class of models of computation).
Once upon time I even wrote a paper (in a very obscure journal)
on the difference between the two classes of models of
computation and why it matters a lot.

In the distributed filesystem case one is trying to simulate one
class of model of computation on another, which is simply not
possible in the edge cases (those which matter). Attempting
POSIX semantics in that case requires a lot of effort and a
considerable suspension of disbelief.
> (Assuming it''s is correctly administered).
That''s the key statement -- here the hidden assumption is that
"correctly administered" means that there is a central agency
that ensures that all operations have a known outcome if they
complete. If there is no central agency, all operations complete
because they eventually timeout, but whether they succeeded or
not is not always knowable.
> That IS indefinite. Indefinite just means that the limit is
> vague and/or unknown, as apposed to having a clear and well
> defined bound.
Actually that applies only to the non distributed case. In the
distributed case it means that the outcome may be absolutely
unknowable.

Supposed for example that you write a log entry to a file on a
Lustre file server, and the kernel code receives confirmation
that the write request has been sent, but then all communication
with the file server ceases. Has the log entry been written to
the file server disk?  Well, how can you figure that out? No way
(unless an admin looks at the file server and thus restarts
communications). That''s "indefinite": when whether the
operation
succeeded or failed cannot be known.

[ ... ]
> If a job hangs on a write because servers are unavailable, it
> should hang indefinitely until the server is restored, or
> until a human interrupts the process with a signal.
That is if one wants to preserve the illusion that a centralized
class model of computation is available when a distributed class
one is the reality. Then the human interruption is the point at
which the illusion goes away.

The better way to handle the distributed case is to design
programs knowingly for the class of distributed models of
computation, which requires completely different programming
strategies, and of course nearly everybody does not realize that
(even if some programmers of distributed system with high
reality requirements rediscover them).
> Only a human can determine how valuable that job''s output is.
> If the human determines that the data is very important, and
> must finish, then they leave the job hanging and go fix the
> servers. If they decide that the data is easily reproducible
> and the compute cluster is better spent running another job
> that doesn''t require the down filesystem, then they have the
> ability to abort the operation with a signal. [ ... ]
Here you are assuming though the underlying model of computation
is in the centralized class, that is the outcome (success or
failure) of an operation is knowable, even if only by the human
component of the systems.

For Lustre systems this is usually a good assumption as most
Lustre installations are centrally managed and in a single
location and hardware and software state can and will be
inspected to determine the outcome or operations.

But people are now using Lustre across wide geographical
networks and over mobs of thousands (or dozens of thousands)
of clients and servers, and in such cases it is usually not
practical to assume that the outcome of an operation is
knowable, and eventually people will learn that means a
completely different world. Or not, as the ''O_PONIES'' story
about ''fsync'' and barriers demonstrates.

Andrew Perepechko

2010-Jul-19 12:09 UTC

head link

[Lustre-discuss] short writes

Hello John.

On 07/16/2010 01:21 AM, John Hammond wrote:> A sockets isn''t considered "slow" because of the speed
of the network,
> it''s considered "slow" so that the application is
allowed to handle
> IPC with unresponsive peers.
>The Linux man page does not state that socket I/O is considered "slow"
NOT because of the speed of the network. It is usually considered slow
because of possible slowness of the network and because of possible
unresponsive peers and some other reasons. In any case, the difference
between the local disk I/O and the socket I/O is not that the latter may
last forever since the socket interface uses the notion of timeout.
Also, local disk I/O may take very long time to complete if the I/O
subsystem is under pressure. The difference is subtle.

If possible unresponsive peers indicate "slowness" of the I/O, then
Lustre client _should_ be able to interrupt the I/O and is allowed
perform short reads.

Best wishes,
Andrew.

Lustre discuss - Jul 2010 - short writes

[Lustre-discuss] short writes

[Lustre-discuss] short writes

[Lustre-discuss] short writes

[Lustre-discuss] short writes

[Lustre-discuss] short writes

[Lustre-discuss] short writes

[Lustre-discuss] short writes

[Lustre-discuss] short writes

[Lustre-discuss] short writes

[Lustre-discuss] short writes

[Lustre-discuss] short writes

[Lustre-discuss] short writes

[Lustre-discuss] short writes

[Lustre-discuss] short writes