The POSIX standard pretty clearly allows short writes to occur (number of bytes written less than requested in a successful call to write) but its not something you see very often and I dont think many users/applications expect it to occur when writing to disk based files. We are seeing it fairly regularly and just wanted to confirm that we (rather our users) should expect this behaviour from Lustre. We are seeing the issue with the infamous Gaussian quantum chem code which spends literally days constantly writing and reading to scratch files in roughly 1GB chunks as part of out-of-core solvers. We manage jobs using simple SIGSTOP/SIGCONT based suspend/resume and occasionally jobs will flag a short write immediately after a SIGCONT. The application incorrectly treats this as an error and aborts. Adding code to complete the write appears to fix the problem (as you''d hope). Now we are at the stage of "debating" with the application developers whether it''s their problem or Lustre''s. Is this considered normal Lustre behaviour? This is with 1.8.3 clients on 2.6.27.46. Thanks, David
Hi David, I''ve also seen short writes on local file systems -- can''t even count the number of times I''ve modified codes to use wrappers that handle short reads/writes. Not at all surprised you see them when suspending the app. http://www.opengroup.org/onlinepubs/000095399/functions/write.html "If write() is interrupted by a signal after it successfully writes some data, it shall return the number of bytes written." Similar language exists for read as well. I always thought libc should handle the retry for you by default, but I didn''t write the spec. Signals are relatively rare, and the window is a bit smaller for a local file system, which may be why they haven''t seen it/properly dealt with it yet. Kevin David Singleton wrote:> The POSIX standard pretty clearly allows short writes to occur (number of > bytes written less than requested in a successful call to write) but its > not something you see very often and I dont think many users/applications > expect it to occur when writing to disk based files. We are seeing it > fairly regularly and just wanted to confirm that we (rather our users) > should expect this behaviour from Lustre. > > We are seeing the issue with the infamous Gaussian quantum chem code > which spends literally days constantly writing and reading to scratch files > in roughly 1GB chunks as part of out-of-core solvers. We manage jobs using > simple SIGSTOP/SIGCONT based suspend/resume and occasionally jobs will flag > a short write immediately after a SIGCONT. The application incorrectly > treats this as an error and aborts. Adding code to complete the write > appears to fix the problem (as you''d hope). Now we are at the stage of > "debating" with the application developers whether it''s their problem or > Lustre''s. > > Is this considered normal Lustre behaviour? > > This is with 1.8.3 clients on 2.6.27.46. > > Thanks, > David > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
On Thu, 2010-07-08 at 23:25 +1000, David Singleton wrote:> The POSIX standard pretty clearly allows short writes to occur (number of > bytes written less than requested in a successful call to write)Yes, I believe that is true. The write(2) manpage also indicates that a write of less than the requested bytes is possible.> but its > not something you see very oftenNot with local disk, no.> and I dont think many users/applications > expect it to occur when writing to disk based files.Indeed. Sadly many applications don''t expect a lot of conditions that are possible and allowed but historically not seen.> We manage jobs using > simple SIGSTOP/SIGCONT based suspend/resume and occasionally jobs will flag > a short write immediately after a SIGCONT.Interesting.> The application incorrectly > treats this as an error and aborts.Yes, if an application is treating a short write as an error, I do believe that that is an application defect.> Adding code to complete the write > appears to fix the problem (as you''d hope).No surprise. :-)> Now we are at the stage of > "debating" with the application developers whether it''s their problem or > Lustre''s.Well, I''m not sure how much of a debate there is there. If POSIX is clearly allowing this and given that Lustre is (mostly, if not completely) POSIX compliant, it should be clear that this is not a defect in Lustre and that the application(s) should be fixed to handle what is legal behaviour.> Is this considered normal Lustre behaviour?Hopefully one of the developers with a more intimate knowledge can answer definitively about particular conditions in Lustre that can cause short writes, but it would not at all be surprising that if POSIX allows it and we have a need to take advantage of it, we would. Afterall, from an application point of view, if there were some situation in the filesystem in that it simply cannot perform the complete write, but could write a smaller portion with a possibility that another write will be successful also, wouldn''t that be better than an outright EIO? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100708/a4e8e1ed/attachment.bin
On Thu, 2010-07-08 at 07:53 -0600, Kevin Van Maren wrote:> Hi David,Hey Kevin,> http://www.opengroup.org/onlinepubs/000095399/functions/write.htmlHeh. Funny enough, I was reading the exact same URL.> I always thought libc should > handle the retry for you by default, but I didn''t write the spec.write(2) is a system call, not a libc function. fwrite(3) is a comparable libc function, so libc might be able to handle short write(2)s in fwrite(3), but really it should not (IMHO) be mucking with write(2) (or any other) system calls. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100708/d0241c1c/attachment.bin
On Thursday, July 08, 2010, Brian J. Murrell wrote:> On Thu, 2010-07-08 at 07:53 -0600, Kevin Van Maren wrote: > > Hi David, > > Hey Kevin, > > > http://www.opengroup.org/onlinepubs/000095399/functions/write.html > > Heh. Funny enough, I was reading the exact same URL. > > > I always thought libc should > > handle the retry for you by default, but I didn''t write the spec. > > write(2) is a system call, not a libc function. fwrite(3) is a > comparable libc function, so libc might be able to handle short > write(2)s in fwrite(3), but really it should not (IMHO) be mucking with > write(2) (or any other) system calls.You have to keep in mind that Gaussian is a Fortran application. Fortran has its own IO library and it would be quite possible that the library of some compilers can handle a short write, but the library of other compilers can not... In my university group we had to deal with quite some weird effects between fortran-IO implementations... David, did you use PGI or another compiler? Last time I had to deal with Gaussian only PGI was supported, but I have not checked for recent Gaussian versions. Cheers, Bernd -- Bernd Schubert DataDirect Networks
On 07/08/2010 08:53 AM, Kevin Van Maren wrote:> Hi David, > > I''ve also seen short writes on local file systems -- can''t even count > the number of times I''ve modified codes to use wrappers that handle > short reads/writes. Not at all surprised you see them when suspending > the app. > > http://www.opengroup.org/onlinepubs/000095399/functions/write.html > "If write() is interrupted by a signal after it successfully writes some > data, it shall return the number of bytes written." > Similar language exists for read as well. I always thought libc should > handle the retry for you by default, but I didn''t write the spec. > > Signals are relatively rare, and the window is a bit smaller for a local > file system, which may be why they haven''t seen it/properly dealt with > it yet.It also says "The issue of which files or file types are interruptible is considered an implementation design issue. This is often affected primarily by hardware and reliability issues." For Linux, the signal(7) manpage indicates that read(2), readv(2), write(2), writev(2), and ioctl(2) calls on "slow" devices should return -EINTR when interrupted by a signal, and goes on to say that "slow" devices are ones "where the I/O call may block for an indefinite time, for example, a terminal, pipe, or socket. (A disk is not a slow device according to this definition.)" Nowhere does it say something really helpfully clear like "Writing to a regular file shall suspend the calling process until such time as..." But, I interpret this to mean that operations on regular files are not interruptible, and should not return -EINTR. Moreover, I understand that this is the consensus among those unlucky enough to care. On the other hand, there are some explicitly specified situations which will result in short writes to a regular file, like file size limits. -- John L. Hammond, Ph.D. ICES, The University of Texas at Austin jhammond at ices.utexas.edu (512) 471-9304
On 07/09/2010 02:33 AM, Bernd Schubert wrote:> You have to keep in mind that Gaussian is a Fortran application. Fortran has > its own IO library and it would be quite possible that the library of some > compilers can handle a short write, but the library of other compilers can > not... In my university group we had to deal with quite some weird effects > between fortran-IO implementations... > > David, did you use PGI or another compiler? Last time I had to deal with > Gaussian only PGI was supported, but I have not checked for recent Gaussian > versions. >Hi Bernd, Gaussian uses C directly for all system interaction. And we have confirmed that the behaviour is not confined to the standard PGI build. With regards to dodgy Fortran runtimes, yes, we believe the Intel lib can (effectively) return a "zero written" EINTR to the application. I say effectively because the only indication is an "interrupted system call" message to stderr while doing IO. There is no portable Fortran way to determine that this is a recoverable EINTR return. David
John Hammond wrote:> On 07/08/2010 08:53 AM, Kevin Van Maren wrote: >> Hi David, >> >> I''ve also seen short writes on local file systems -- can''t even count >> the number of times I''ve modified codes to use wrappers that handle >> short reads/writes. Not at all surprised you see them when suspending >> the app. >> >> http://www.opengroup.org/onlinepubs/000095399/functions/write.html >> "If write() is interrupted by a signal after it successfully writes some >> data, it shall return the number of bytes written." >> Similar language exists for read as well. I always thought libc should >> handle the retry for you by default, but I didn''t write the spec. >> >> Signals are relatively rare, and the window is a bit smaller for a local >> file system, which may be why they haven''t seen it/properly dealt with >> it yet. > > It also says "The issue of which files or file types are interruptible > is considered an implementation design issue. This is often affected > primarily by hardware and reliability issues." > > For Linux, the signal(7) manpage indicates that read(2), readv(2), > write(2), writev(2), and ioctl(2) calls on "slow" devices should > return -EINTR when interrupted by a signal, and goes on to say that > "slow" devices are ones "where the I/O call may block for an > indefinite time, for example, a terminal, pipe, or socket. (A disk is > not a slow device according to this definition.)"How about a network file system waiting for server failover (especially if it is not automatic)?> Nowhere does it say something really helpfully clear like "Writing to > a regular file shall suspend the calling process until such time > as..." But, I interpret this to mean that operations on regular files > are not interruptible, and should not return -EINTR. Moreover, I > understand that this is the consensus among those unlucky enough to care. > > On the other hand, there are some explicitly specified situations > which will result in short writes to a regular file, like file size > limits.With NFS, "hard,intr" is the most sane configuration. For Lustre, operations (should) become interruptible after the initial timeout period has passed. Kevin
On 07/08/2010 05:48 PM, Kevin Van Maren wrote:> John Hammond wrote: >> On 07/08/2010 08:53 AM, Kevin Van Maren wrote: >>> Hi David, >>> >>> I''ve also seen short writes on local file systems -- can''t even >>> count the number of times I''ve modified codes to use wrappers >>> that handle short reads/writes. Not at all surprised you see >>> them when suspending the app. >>> >>> http://www.opengroup.org/onlinepubs/000095399/functions/write.html >>> >>> >>>"If write() is interrupted by a signal after it successfully writes some>>> data, it shall return the number of bytes written." Similar >>> language exists for read as well. I always thought libc should >>> handle the retry for you by default, but I didn''t write the >>> spec. >>> >>> Signals are relatively rare, and the window is a bit smaller for >>> a local file system, which may be why they haven''t seen >>> it/properly dealt with it yet. >> >> It also says "The issue of which files or file types are >> interruptible is considered an implementation design issue. This >> is often affected primarily by hardware and reliability issues." >> >> For Linux, the signal(7) manpage indicates that read(2), readv(2), >> write(2), writev(2), and ioctl(2) calls on "slow" devices should >> return -EINTR when interrupted by a signal, and goes on to say >> that "slow" devices are ones "where the I/O call may block for an >> indefinite time, for example, a terminal, pipe, or socket. (A disk >> is not a slow device according to this definition.)" > > How about a network file system waiting for server failover > (especially if it is not automatic)?That''s not indefinite. The FS is waiting for something which will eventually occur. (Assuming it''s is correctly administered).>> Nowhere does it say something really helpfully clear like "Writing >> to a regular file shall suspend the calling process until such >> time as..." But, I interpret this to mean that operations on >> regular files are not interruptible, and should not return -EINTR. >> Moreover, I understand that this is the consensus among those >> unlucky enough to care. >> >> On the other hand, there are some explicitly specified situations >> which will result in short writes to a regular file, like file >> size limits. > > With NFS, "hard,intr" is the most sane configuration.Yes, because NFS is FTP but with less typing.> For Lustre, operations (should) become interruptible after the > initial timeout period has passed.I disagree. Lustre is not NFS. The intended uses are big noninteractive jobs. Who would interrupt them? Most likely administrators who know that some server is hosed. Better to have a long timeout, and once it passes, the operations return -EIO, the clients try to reconnect, and maybe the FS can heal itself. Even if it can''t, I would argue that the FS is still easier to administer than before, since you don''t have to ssh out to every stuck node. Also, why make the logic any harder? In the FS, isn''t it much easier to emulate a block device, leaving the process in D sleep when you have to? And in the application it''s one less thing to worry about. Plus the behavior matches the block device/page cache mental-model better. -- John L. Hammond, Ph.D. ICES, The University of Texas at Austin jhammond at ices.utexas.edu (512) 471-9304
Hello John, please note that the Lustre client does not write data to a disk device, rather it sends and receives data through network, particularly, through a socket in case of ksocklnd. Even if the call may not be considered slow with respect to indefiniteness, it still is slow as compared to the "fast I/O" (local disk I/O). I can recall several requests of adding even more interruption points made by people who were looking for the support of premature read/write termination (and, I believe, l_wait_event usage in 1.8 is such that since a network request has been sent, the read/write system call cannot be interrupted until the corresponding lustre timeout has happened or a reply has been received) Best wishes, Andrew.>On 07/08/2010 08:53 AM, Kevin Van Maren wrote: >It also says "The issue of which files or file types are interruptible >is considered an implementation design issue. This is often affected >primarily by hardware and reliability issues." > >For Linux, the signal(7) manpage indicates that read(2), readv(2), >write(2), writev(2), and ioctl(2) calls on "slow" devices should return >-EINTR when interrupted by a signal, and goes on to say that "slow" >devices are ones "where the I/O call may block for an indefinite time, >for example, a terminal, pipe, or socket. (A disk is not a slow device >according to this definition.)" > >Nowhere does it say something really helpfully clear like "Writing to a >regular file shall suspend the calling process until such time as..." >But, I interpret this to mean that operations on regular files are not >interruptible, and should not return -EINTR. Moreover, I understand >that this is the consensus among those unlucky enough to care. > >On the other hand, there are some explicitly specified situations which >will result in short writes to a regular file, like file size limits.-- John L. Hammond, Ph.D. ICES, The University of Texas at Austin jhammond at ices.utexas.edu <http://lists.lustre.org/mailman/listinfo/lustre-discuss> (512) 471-9304 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100715/fc989953/attachment-0001.html
Hi Andrew,> please note that the Lustre client does not write data to a disk > device, rather it sends and receives data through network,Yes, I''ve heard about that.> particularly, through a socket in case of ksocklnd.A sockets isn''t considered "slow" because of the speed of the network, it''s considered "slow" so that the application is allowed to handle IPC with unresponsive peers.> Even if the call may not be considered slow with respect to > indefiniteness, it still is slow as compared to the "fast I/O" (local > disk I/O). I can recall several requests of adding even more > interruption points made by people who were looking for the support > of premature read/write termination (and, I believe, l_wait_event > usage in 1.8 is such that since a network request has been sent, the > read/write system call cannot be interrupted until the corresponding > lustre timeout has happened or a reply has been received)Looking at 1.8.3, I see that l_wait_event() allow calls to be interrupted under certain circumstances (if the timeout has expired, or no timeout was specified, ...). But then only if the pending signal belongs to LUSTRE_FATAL_SIGS: SIGKILL, SIGINT, SIGTERM, SIGQUIT, or SIGARLM. I guess the assumption is that all of these signals are fatal anyway, and delivering them is useful to users who change their minds about untarring the Encyclopedia Britannica, and then go complain that Lustre breaks Ctrl-C. Fine. Aside from the fact that the latter four may not be fatal, and that this may cause some unexpected breakage among unsuspecting applications that handle these signals for purposes other than process termination...whatever. I''m giving up on this point. I also noticed that the signal mask handling in l_wait_event is slightly defective. In the cases where l_wait_event would allow the call to be interrupted, it sets the caller''s mask to allow LUSTRE_FATAL_SIGS. Consider the following sequence of events for a process P: P blocks SIGALRM. SIGALRM is sent to P. P calls open(). RPC to mds times-out. l_wait_info unblocks LUSTRE_FATAL_SIGS. l_wait_info determines that SIGALRM is deliverable. l_wait_info restores the signal mask. l_wait_info returns -EINTR. open() returns -EINTR. Thus open() is interrupted by the non-delivery of a blocked signal. It''s easy to reproduce, if somewhat obscure. Best, John -- John L. Hammond, Ph.D. ICES, The University of Texas at Austin jhammond at ices.utexas.edu (512) 471-9304
On 07/08/2010 04:51 PM, John Hammond wrote:>> How about a network file system waiting for server failover >> (especially if it is not automatic)? > > That''s not indefinite. The FS is waiting for something which will > eventually occur. (Assuming it''s is correctly administered).That IS indefinite. Indefinite just means that the limit is vague and/or unknown, as apposed to having a clear and well defined bound. In the IO context, any operation that is unbounded (indefinite) may take a very long time in human terms, and therefore should be interruptible. It is just not very reasonable to have a process stuck unkillable for days. But on the other hand, we don''t want it timing out either if the data is valuable and we are willing to wait a day or two for hardware repairs. Even ignoring the fact that Lustre''s behavior is allowed by the POSIX spec, I believe that Lustre is doing the Right Thing. If a job hangs on a write because servers are unavailable, it should hang indefinitely until the server is restored, or until a human interrupts the process with a signal. Only a human can determine how valuable that job''s output is. If the human determines that the data is very important, and must finish, then they leave the job hanging and go fix the servers. If they decide that the data is easily reproducible and the compute cluster is better spent running another job that doesn''t require the down filesystem, then they have the ability to abort the operation with a signal. Now all that said, there may be an argument to be made that SIGSTOP and SIGCONT should not be signals that interrupt Lustre client operations.
>>>> [ ... whether apps can rely on the kernel always returning a >>>> full read or write count on file IO except at EOF on read ... ]As it has been remarked the answer is NO. BTW this question is not the same as the "interruptible" one. There is a difference between the kernel being allowed to read(2) or write(2) less than requested by the process and them being apparently atomic. The kernel may always return a count of bytes read or written less than requested for any reason whatever, even if a signal has not interrupted the operations. The applications have to deal with it. Most applications are written wrong (and in many other ways, e.g. how many do check or even just ''(void)'' the return code from ''close'' for example, and never mind not calling ''flock'' or ''fsync''), and as many kernel writers say "userspace sucks", but most applications mistakes matter only in infrequent cases, and when these happen users just shrug. The reasons why the semantics are like that have been explained very clearly by Gabriel in his paper "Worse is better".>>> How about a network file system waiting for server failover >>> (especially if it is not automatic)?>> That''s not indefinite. The FS is waiting for something which >> will eventually occur.Here "indefinite" as to a wait duration is used in two rather different ways. One is to say that it is "unknowable", the other is that it is "unknown at the moment and expected to be in some relevant sense long". There is a fundamental difference. If the outcome (success or failure) of an operation may or may not become known, we have a completely different class of models of computation from the usual Turing or Church or Von Neumann one, with rather completely different properties from the usual one. The halting problem does not exist, as all computations must complete, but the outcome on completion can be undeterminate, which the opposite of the usual class of models of computation). Once upon time I even wrote a paper (in a very obscure journal) on the difference between the two classes of models of computation and why it matters a lot. In the distributed filesystem case one is trying to simulate one class of model of computation on another, which is simply not possible in the edge cases (those which matter). Attempting POSIX semantics in that case requires a lot of effort and a considerable suspension of disbelief.> (Assuming it''s is correctly administered).That''s the key statement -- here the hidden assumption is that "correctly administered" means that there is a central agency that ensures that all operations have a known outcome if they complete. If there is no central agency, all operations complete because they eventually timeout, but whether they succeeded or not is not always knowable.> That IS indefinite. Indefinite just means that the limit is > vague and/or unknown, as apposed to having a clear and well > defined bound.Actually that applies only to the non distributed case. In the distributed case it means that the outcome may be absolutely unknowable. Supposed for example that you write a log entry to a file on a Lustre file server, and the kernel code receives confirmation that the write request has been sent, but then all communication with the file server ceases. Has the log entry been written to the file server disk? Well, how can you figure that out? No way (unless an admin looks at the file server and thus restarts communications). That''s "indefinite": when whether the operation succeeded or failed cannot be known. [ ... ]> If a job hangs on a write because servers are unavailable, it > should hang indefinitely until the server is restored, or > until a human interrupts the process with a signal.That is if one wants to preserve the illusion that a centralized class model of computation is available when a distributed class one is the reality. Then the human interruption is the point at which the illusion goes away. The better way to handle the distributed case is to design programs knowingly for the class of distributed models of computation, which requires completely different programming strategies, and of course nearly everybody does not realize that (even if some programmers of distributed system with high reality requirements rediscover them).> Only a human can determine how valuable that job''s output is. > If the human determines that the data is very important, and > must finish, then they leave the job hanging and go fix the > servers. If they decide that the data is easily reproducible > and the compute cluster is better spent running another job > that doesn''t require the down filesystem, then they have the > ability to abort the operation with a signal. [ ... ]Here you are assuming though the underlying model of computation is in the centralized class, that is the outcome (success or failure) of an operation is knowable, even if only by the human component of the systems. For Lustre systems this is usually a good assumption as most Lustre installations are centrally managed and in a single location and hardware and software state can and will be inspected to determine the outcome or operations. But people are now using Lustre across wide geographical networks and over mobs of thousands (or dozens of thousands) of clients and servers, and in such cases it is usually not practical to assume that the outcome of an operation is knowable, and eventually people will learn that means a completely different world. Or not, as the ''O_PONIES'' story about ''fsync'' and barriers demonstrates.
Hello John. On 07/16/2010 01:21 AM, John Hammond wrote:> A sockets isn''t considered "slow" because of the speed of the network, > it''s considered "slow" so that the application is allowed to handle > IPC with unresponsive peers. >The Linux man page does not state that socket I/O is considered "slow" NOT because of the speed of the network. It is usually considered slow because of possible slowness of the network and because of possible unresponsive peers and some other reasons. In any case, the difference between the local disk I/O and the socket I/O is not that the latter may last forever since the socket interface uses the notion of timeout. Also, local disk I/O may take very long time to complete if the I/O subsystem is under pressure. The difference is subtle. If possible unresponsive peers indicate "slowness" of the I/O, then Lustre client _should_ be able to interrupt the I/O and is allowed perform short reads. Best wishes, Andrew.