Hi, I have a problem with the Scalable molecular dynamics software NAMD. It write restart files once in a while. But sometime the binary write crashes. The when it crashes is not constant. The only constant thing is it happens when it writes on our Lustre file system. When it write on something else, it is fine. I can''t seem find any errors in any of the /var/log/messages. Anyone had any problems with NAMD? Someone seem to have had the same problem, but there was no solution provided: http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/12205.html -- Richard Lefebvre, Sys-admin, RQCHP "Don''t Panic" -- THGTTG RQCHP (rqchp.ca) --------------------- Calcul Canada (computecanada.org)
On 2010-07-22, at 14:59, Richard Lefebvre wrote:> I have a problem with the Scalable molecular dynamics software NAMD. It > write restart files once in a while. But sometime the binary write > crashes. The when it crashes is not constant. The only constant thing is > it happens when it writes on our Lustre file system. When it write on > something else, it is fine. I can''t seem find any errors in any of the > /var/log/messages. Anyone had any problems with NAMD?Rarely has anyone complained about Lustre not providing error messages when there is a problem, so if there is nothing in /var/log/messages on either the client or the server then it is hard to know whether it is a Lustre problem or not... If possible, you could try running the application under strace (limited to the IO calls, or it would be much too much data) to see which system call the error is coming from. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Hi Richard, If the cause of the I/O errors is Lustre there will be some message in the logs. I am seeing similar problem with some applications that run on our cluster. The symptoms are always the same, just before application crashes with I/O error node gets evicted with a message like that: LustreError: 167-0: This client was evicted by ddn_data-OST000f; in progress operations using this service will fail. The OSS that mounts the OST from the above message has following line in the log: LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 10.143.5.9 at tcp ns: filter-ddn_data-OST000f_UUID lock: ffff81021a84ba00/0x744b1dd44 81e38b2 lrc: 3/0,0 mode: PR/PR res: 34959884/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0x1d34b900a905375d expref: 9 pid: 1506 timeout 8374258376 Can you please check your logs for similar messages? Best regards Wojciech On 22 July 2010 23:43, Andreas Dilger <andreas.dilger at oracle.com> wrote:> On 2010-07-22, at 14:59, Richard Lefebvre wrote: > > I have a problem with the Scalable molecular dynamics software NAMD. It > > write restart files once in a while. But sometime the binary write > > crashes. The when it crashes is not constant. The only constant thing is > > it happens when it writes on our Lustre file system. When it write on > > something else, it is fine. I can''t seem find any errors in any of the > > /var/log/messages. Anyone had any problems with NAMD? > > Rarely has anyone complained about Lustre not providing error messages when > there is a problem, so if there is nothing in /var/log/messages on either > the client or the server then it is hard to know whether it is a Lustre > problem or not... > > If possible, you could try running the application under strace (limited to > the IO calls, or it would be much too much data) to see which system call > the error is coming from. > > Cheers, Andreas > -- > Andreas Dilger > Lustre Technical Lead > Oracle Corporation Canada Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100723/e64c7afa/attachment.html
we have the same problem when running namd in lustre sometimes, the console log suggest file lock expired, but I don''t know why. On Fri, Jul 23, 2010 at 8:12 AM, Wojciech Turek <wjt27 at cam.ac.uk> wrote:> Hi Richard, > > If the cause of the I/O errors is Lustre there will be some message in the > logs. I am seeing similar problem with some applications that run on our > cluster. The symptoms are always the same, just before application crashes > with I/O error node gets evicted with a message like that: > ?LustreError: 167-0: This client was evicted by ddn_data-OST000f; in > progress operations using this service will fail. > > The OSS that mounts the OST from the above message has following line in the > log: > LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock > callback timer expired after 101s: evicting client at 10.143.5.9 at tcp? ns: > filter-ddn_data-OST000f_UUID lock: ffff81021a84ba00/0x744b1dd44 > 81e38b2 lrc: 3/0,0 mode: PR/PR res: 34959884/0 rrc: 2 type: EXT > [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: > 0x1d34b900a905375d expref: 9 pid: 1506 timeout 8374258376 > > Can you please check your logs for similar messages? > > Best regards > > Wojciech > > On 22 July 2010 23:43, Andreas Dilger <andreas.dilger at oracle.com> wrote: >> >> On 2010-07-22, at 14:59, Richard Lefebvre wrote: >> > I have a problem with the Scalable molecular dynamics software NAMD. It >> > write restart files once in a while. But sometime the binary write >> > crashes. The when it crashes is not constant. The only constant thing is >> > it happens when it writes on our Lustre file system. When it write on >> > something else, it is fine. I can''t seem find any errors in any of the >> > /var/log/messages. Anyone had any problems with NAMD? >> >> Rarely has anyone complained about Lustre not providing error messages >> when there is a problem, so if there is nothing in /var/log/messages on >> either the client or the server then it is hard to know whether it is a >> Lustre problem or not... >> >> If possible, you could try running the application under strace (limited >> to the IO calls, or it would be much too much data) to see which system call >> the error is coming from. >> >> Cheers, Andreas >> -- >> Andreas Dilger >> Lustre Technical Lead >> Oracle Corporation Canada Inc. >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
There is a similar thread on this mailing list: http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/afe24159554cd3ff/8b37bababf848123?lnk=gst&q=I%2FO+error+on+clients# Also there is a bug open which reports similar problem: https://bugzilla.lustre.org/show_bug.cgi?id=23190 On 23 July 2010 10:02, Larry <tsrjzq at gmail.com> wrote:> we have the same problem when running namd in lustre sometimes, the > console log suggest file lock expired, but I don''t know why. > > On Fri, Jul 23, 2010 at 8:12 AM, Wojciech Turek <wjt27 at cam.ac.uk> wrote: > > Hi Richard, > > > > If the cause of the I/O errors is Lustre there will be some message in > the > > logs. I am seeing similar problem with some applications that run on our > > cluster. The symptoms are always the same, just before application > crashes > > with I/O error node gets evicted with a message like that: > > LustreError: 167-0: This client was evicted by ddn_data-OST000f; in > > progress operations using this service will fail. > > > > The OSS that mounts the OST from the above message has following line in > the > > log: > > LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock > > callback timer expired after 101s: evicting client at 10.143.5.9 at tcp > ns: > > filter-ddn_data-OST000f_UUID lock: ffff81021a84ba00/0x744b1dd44 > > 81e38b2 lrc: 3/0,0 mode: PR/PR res: 34959884/0 rrc: 2 type: EXT > > [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 > remote: > > 0x1d34b900a905375d expref: 9 pid: 1506 timeout 8374258376 > > > > Can you please check your logs for similar messages? > > > > Best regards > > > > Wojciech > > > > On 22 July 2010 23:43, Andreas Dilger <andreas.dilger at oracle.com> wrote: > >> > >> On 2010-07-22, at 14:59, Richard Lefebvre wrote: > >> > I have a problem with the Scalable molecular dynamics software NAMD. > It > >> > write restart files once in a while. But sometime the binary write > >> > crashes. The when it crashes is not constant. The only constant thing > is > >> > it happens when it writes on our Lustre file system. When it write on > >> > something else, it is fine. I can''t seem find any errors in any of the > >> > /var/log/messages. Anyone had any problems with NAMD? > >> > >> Rarely has anyone complained about Lustre not providing error messages > >> when there is a problem, so if there is nothing in /var/log/messages on > >> either the client or the server then it is hard to know whether it is a > >> Lustre problem or not... > >> > >> If possible, you could try running the application under strace (limited > >> to the IO calls, or it would be much too much data) to see which system > call > >> the error is coming from. > >> > >> Cheers, Andreas > >> -- > >> Andreas Dilger > >> Lustre Technical Lead > >> Oracle Corporation Canada Inc. > >> > >> _______________________________________________ > >> Lustre-discuss mailing list > >> Lustre-discuss at lists.lustre.org > >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100723/79a0153f/attachment.html
There are many kinds of reasons that a server evicts a client, maybe network error, maybe ptlrpcd bug, but according to my experience, the only chance to see the I/O error is running namd in lustre filesystem, I can see some other "evict" events sometimes, but none of them results in I/O error. So besides the "evict client", there may be something else causing the "I/O error". On Fri, Jul 23, 2010 at 6:54 PM, Wojciech Turek <wjt27 at cam.ac.uk> wrote:> There is a similar thread on this mailing list: > http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/afe24159554cd3ff/8b37bababf848123?lnk=gst&q=I%2FO+error+on+clients# > Also there is a bug open which reports similar problem: > https://bugzilla.lustre.org/show_bug.cgi?id=23190 > > > > On 23 July 2010 10:02, Larry <tsrjzq at gmail.com> wrote: >> >> we have ?the same problem when running namd in lustre sometimes, the >> console log suggest file lock expired, but I don''t know why. >> >> On Fri, Jul 23, 2010 at 8:12 AM, Wojciech Turek <wjt27 at cam.ac.uk> wrote: >> > Hi Richard, >> > >> > If the cause of the I/O errors is Lustre there will be some message in >> > the >> > logs. I am seeing similar problem with some applications that run on our >> > cluster. The symptoms are always the same, just before application >> > crashes >> > with I/O error node gets evicted with a message like that: >> > ?LustreError: 167-0: This client was evicted by ddn_data-OST000f; in >> > progress operations using this service will fail. >> > >> > The OSS that mounts the OST from the above message has following line in >> > the >> > log: >> > LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock >> > callback timer expired after 101s: evicting client at 10.143.5.9 at tcp >> > ns: >> > filter-ddn_data-OST000f_UUID lock: ffff81021a84ba00/0x744b1dd44 >> > 81e38b2 lrc: 3/0,0 mode: PR/PR res: 34959884/0 rrc: 2 type: EXT >> > [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 >> > remote: >> > 0x1d34b900a905375d expref: 9 pid: 1506 timeout 8374258376 >> > >> > Can you please check your logs for similar messages? >> > >> > Best regards >> > >> > Wojciech >> > >> > On 22 July 2010 23:43, Andreas Dilger <andreas.dilger at oracle.com> wrote: >> >> >> >> On 2010-07-22, at 14:59, Richard Lefebvre wrote: >> >> > I have a problem with the Scalable molecular dynamics software NAMD. >> >> > It >> >> > write restart files once in a while. But sometime the binary write >> >> > crashes. The when it crashes is not constant. The only constant thing >> >> > is >> >> > it happens when it writes on our Lustre file system. When it write on >> >> > something else, it is fine. I can''t seem find any errors in any of >> >> > the >> >> > /var/log/messages. Anyone had any problems with NAMD? >> >> >> >> Rarely has anyone complained about Lustre not providing error messages >> >> when there is a problem, so if there is nothing in /var/log/messages on >> >> either the client or the server then it is hard to know whether it is a >> >> Lustre problem or not... >> >> >> >> If possible, you could try running the application under strace >> >> (limited >> >> to the IO calls, or it would be much too much data) to see which system >> >> call >> >> the error is coming from. >> >> >> >> Cheers, Andreas >> >> -- >> >> Andreas Dilger >> >> Lustre Technical Lead >> >> Oracle Corporation Canada Inc. >> >> >> >> _______________________________________________ >> >> Lustre-discuss mailing list >> >> Lustre-discuss at lists.lustre.org >> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > >> > >> > >> > _______________________________________________ >> > Lustre-discuss mailing list >> > Lustre-discuss at lists.lustre.org >> > http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > >> > > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
If I had some Lustre error, it would give me a clue, but the only errors the users get is the following traceback on the application: ------------------------------------------------------------------- Reason: FATAL ERROR: Error on write to binary file restart/ABCD_les4.950000.vel: Interrupted system call Fatal error on PE 0> FATAL ERROR: Error on write to binary file restart/ABCD_les4.950000.vel: Interrupted system call " " FINISHED WRITING RESTART COORDINATES WRITING VELOCITIES TO RESTART FILE AT STEP 950000 FATAL ERROR: Error on write to binary file restart/ABCD_les4.950000.vel: Interrupted system call [0] Stack Traceback: [0] CmiAbort+0x5f [0xabb81b] [1] _Z8NAMD_errPKc+0x9d [0x5099e9] [2] _ZN6Output17write_binary_fileEPciP6Vector+0xb0 [0x916d7e] [3] _ZN6Output25output_restart_velocitiesEiiP6Vector+0x249 [0x918d93] [4] _ZN6Output8velocityEiiP6Vector+0xdb [0x918a53] [5] _ZN24CkIndex_CollectionMaster40_call_receiveVelocities_CollectVectorMsgEPvP16CollectionMaster+0x16c [0x51c184] [6] CkDeliverMessageFree+0x21 [0xa2f899] [7] _Z15_processHandlerPvP11CkCoreState+0x50f [0xa2ef63] [8] CsdScheduleForever+0xa5 [0xabc631] [9] CsdScheduler+0x1c [0xabc232] [10] _ZN7BackEnd7suspendEv+0xb [0x511ddd] [11] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x140 [0x971ddc] [12] TclInvokeStringCommand+0x91 [0xae0518] [13] /share/apps/namd/NAMD_2.7b2_Linux-x86_64-ibverbs/namd2 [0xb16368] [14] Tcl_EvalEx+0x176 [0xb169ab] [15] Tcl_EvalFile+0x134 [0xb0e3b4] [16] _ZN9ScriptTcl3runEPc+0x14 [0x9714da] [17] _Z18after_backend_initiPPc+0x22b [0x50da7b] [18] main+0x3a [0x50d81a] [19] __libc_start_main+0xf4 [0x398661d974] [20] _ZNSt8ios_base4InitD1Ev+0x4a [0x508bda] " When I told the user to use a slower NFS file system instead. The problem doesn''t occur. As another one commented about file lock, the lustre mount was added the "flock" parameter recently (for another application) but the NAMD still has problems. RIchard On 07/22/2010 08:12 PM, Wojciech Turek wrote:> Hi Richard, > > If the cause of the I/O errors is Lustre there will be some message in > the logs. I am seeing similar problem with some applications that run on > our cluster. The symptoms are always the same, just before application > crashes with I/O error node gets evicted with a message like that: > LustreError: 167-0: This client was evicted by ddn_data-OST000f; in > progress operations using this service will fail. > > The OSS that mounts the OST from the above message has following line in > the log: > LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock > callback timer expired after 101s: evicting client at 10.143.5.9 at tcp > ns: filter-ddn_data-OST000f_UUID lock: ffff81021a84ba00/0x744b1dd44 > 81e38b2 lrc: 3/0,0 mode: PR/PR res: 34959884/0 rrc: 2 type: EXT > [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 > remote: 0x1d34b900a905375d expref: 9 pid: 1506 timeout 8374258376 > > Can you please check your logs for similar messages? > > Best regards > > Wojciech > > On 22 July 2010 23:43, Andreas Dilger <andreas.dilger at oracle.com > <mailto:andreas.dilger at oracle.com>> wrote: > > On 2010-07-22, at 14:59, Richard Lefebvre wrote: > > I have a problem with the Scalable molecular dynamics software > NAMD. It > > write restart files once in a while. But sometime the binary write > > crashes. The when it crashes is not constant. The only constant > thing is > > it happens when it writes on our Lustre file system. When it write on > > something else, it is fine. I can''t seem find any errors in any > of the > > /var/log/messages. Anyone had any problems with NAMD? > > Rarely has anyone complained about Lustre not providing error > messages when there is a problem, so if there is nothing in > /var/log/messages on either the client or the server then it is hard > to know whether it is a Lustre problem or not... > > If possible, you could try running the application under strace > (limited to the IO calls, or it would be much too much data) to see > which system call the error is coming from. > > Cheers, Andreas > -- > Andreas Dilger > Lustre Technical Lead > Oracle Corporation Canada Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org <mailto:Lustre-discuss at lists.lustre.org> > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > >-- Richard Lefebvre, Sys-admin, RQCHP, (514)343-6111 x5313 "Don''t Panic" Richard.Lefebvre at rqchp.qc.ca -- THGTTG RQCHP (rqchp.ca) --------------------- Calcul Canada (computecanada.org)
Hi Larry,>From my experience, if the application is doing some I/O and server evictsthe node that application is running on this will definitely result in EIO error being send to the application, thus the input/output error message in the standard output of the application. In the case of my cluster the eviction always happens with a particular applications and this behaviour is very reproducible. I have checked cluster network but it doesn''t seem to have any congestion at the time of eviction. This problem started after upgrading from 1.6.6 to 1.8.3 Currently for the applications affected by that problem we workaround by using compute nodes local disks but this is not ideal and hopefully we will see some progress on that case soon. Best regards, Wojciech On 23 July 2010 15:41, Larry <tsrjzq at gmail.com> wrote:> There are many kinds of reasons that a server evicts a client, maybe > network error, maybe ptlrpcd bug, but according to my experience, the > only chance to see the I/O error is running namd in lustre filesystem, > I can see some other "evict" events sometimes, but none of them > results in I/O error. So besides the "evict client", there may be > something else causing the "I/O error". > > On Fri, Jul 23, 2010 at 6:54 PM, Wojciech Turek <wjt27 at cam.ac.uk> wrote: > > There is a similar thread on this mailing list: > > > http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/afe24159554cd3ff/8b37bababf848123?lnk=gst&q=I%2FO+error+on+clients# > > Also there is a bug open which reports similar problem: > > https://bugzilla.lustre.org/show_bug.cgi?id=23190 > > > > > > > > On 23 July 2010 10:02, Larry <tsrjzq at gmail.com> wrote: > >> > >> we have the same problem when running namd in lustre sometimes, the > >> console log suggest file lock expired, but I don''t know why. > >> > >> On Fri, Jul 23, 2010 at 8:12 AM, Wojciech Turek <wjt27 at cam.ac.uk> > wrote: > >> > Hi Richard, > >> > > >> > If the cause of the I/O errors is Lustre there will be some message in > >> > the > >> > logs. I am seeing similar problem with some applications that run on > our > >> > cluster. The symptoms are always the same, just before application > >> > crashes > >> > with I/O error node gets evicted with a message like that: > >> > LustreError: 167-0: This client was evicted by ddn_data-OST000f; in > >> > progress operations using this service will fail. > >> > > >> > The OSS that mounts the OST from the above message has following line > in > >> > the > >> > log: > >> > LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock > >> > callback timer expired after 101s: evicting client at 10.143.5.9 at tcp > >> > ns: > >> > filter-ddn_data-OST000f_UUID lock: ffff81021a84ba00/0x744b1dd44 > >> > 81e38b2 lrc: 3/0,0 mode: PR/PR res: 34959884/0 rrc: 2 type: EXT > >> > [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 > >> > remote: > >> > 0x1d34b900a905375d expref: 9 pid: 1506 timeout 8374258376 > >> > > >> > Can you please check your logs for similar messages? > >> > > >> > Best regards > >> > > >> > Wojciech > >> > > >> > On 22 July 2010 23:43, Andreas Dilger <andreas.dilger at oracle.com> > wrote: > >> >> > >> >> On 2010-07-22, at 14:59, Richard Lefebvre wrote: > >> >> > I have a problem with the Scalable molecular dynamics software > NAMD. > >> >> > It > >> >> > write restart files once in a while. But sometime the binary write > >> >> > crashes. The when it crashes is not constant. The only constant > thing > >> >> > is > >> >> > it happens when it writes on our Lustre file system. When it write > on > >> >> > something else, it is fine. I can''t seem find any errors in any of > >> >> > the > >> >> > /var/log/messages. Anyone had any problems with NAMD? > >> >> > >> >> Rarely has anyone complained about Lustre not providing error > messages > >> >> when there is a problem, so if there is nothing in /var/log/messages > on > >> >> either the client or the server then it is hard to know whether it is > a > >> >> Lustre problem or not... > >> >> > >> >> If possible, you could try running the application under strace > >> >> (limited > >> >> to the IO calls, or it would be much too much data) to see which > system > >> >> call > >> >> the error is coming from. > >> >> > >> >> Cheers, Andreas > >> >> -- > >> >> Andreas Dilger > >> >> Lustre Technical Lead > >> >> Oracle Corporation Canada Inc. > >> >> > >> >> _______________________________________________ > >> >> Lustre-discuss mailing list > >> >> Lustre-discuss at lists.lustre.org > >> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > >> > > >> > > >> > > >> > _______________________________________________ > >> > Lustre-discuss mailing list > >> > Lustre-discuss at lists.lustre.org > >> > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >> > > >> > > > > > > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > >-- -- Wojciech Turek Assistant System Manager -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100723/e5c9fec8/attachment.html
On Fri, Jul 23, 2010 at 10:53:45AM -0700, Richard Lefebvre wrote:> If I had some Lustre error, it would give me a clue, but the only errors > the users get is the following traceback on the application: > > ------------------------------------------------------------------- > Reason: FATAL ERROR: Error on write to binary file > restart/ABCD_les4.950000.vel: Interrupted system call > > Fatal error on PE 0> FATAL ERROR: Error on write to binary file > restart/ABCD_les4.950000.vel: Interrupted system callInterrupted system call is not a fatal error. The application should retry the operation if it fails with EINTR. Ned
On 2010-07-23, at 11:53, Richard Lefebvre wrote:> If I had some Lustre error, it would give me a clue, but the only errors the users get is the following traceback on the application: > > ------------------------------------------------------------------- > Reason: FATAL ERROR: Error on write to binary file > restart/ABCD_les4.950000.vel: Interrupted system call > > Fatal error on PE 0> FATAL ERROR: Error on write to binary file > restart/ABCD_les4.950000.vel: Interrupted system callThere was a bug just filed on EINTR and flock. I don''t have the number, but a quick search should find it. No patch as yet, but it would be worthwhile to subscribe to for updates. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
> On 2010-07-23, at 11:53, Richard Lefebvre wrote: > >> If I had some Lustre error, it would give me a clue, but the only errors the users get is the following traceback on the application: >> >> ------------------------------------------------------------------- >> Reason: FATAL ERROR: Error on write to binary file >> restart/ABCD_les4.950000.vel: Interrupted system call >> >> Fatal error on PE 0> FATAL ERROR: Error on write to binary file >> restart/ABCD_les4.950000.vel: Interrupted system call >> > There was a bug just filed on EINTR and flock. I don''t have the number, but a quick search should find it. No patch as yet, but it would be worthwhile to subscribe to for updates. >Bug 23372 https://bugzilla.lustre.org/show_bug.cgi?id=23372
On 07/23/2010 06:39 PM, Rick Grubin wrote:> >> On 2010-07-23, at 11:53, Richard Lefebvre wrote: >> >>> If I had some Lustre error, it would give me a clue, but the only >>> errors the users get is the following traceback on the >>> application: >>> >>> ------------------------------------------------------------------- >>> >>>Reason: FATAL ERROR: Error on write to binary file>>> restart/ABCD_les4.950000.vel: Interrupted system call >>> >>> Fatal error on PE 0> FATAL ERROR: Error on write to binary >>> file restart/ABCD_les4.950000.vel: Interrupted system call >>> >> There was a bug just filed on EINTR and flock. I don''t have the >> number, but a quick search should find it. No patch as yet, but it >> would be worthwhile to subscribe to for updates. >> > > Bug 23372 > > https://bugzilla.lustre.org/show_bug.cgi?id=23372 > _______________________________________________ Lustre-discuss > mailing list Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussIn the scenario of 23372, -EINTR is not returned, although it should be. However, there are several places where -EINTR is used (internally) by the ptlrpc layer. And, IIRC, there are also some places where ptlrpc return codes are (inappropriately) used as the return codes for file operations. So, it may be the case that -EINTR is generated in ptlrpc becuase of an eviction or some other mishap and returned by write(), but no signals were actually delivered during the call. -- John L. Hammond, Ph.D. ICES, The University of Texas at Austin jhammond at ices.utexas.edu (512) 471-9304
I recently had a user reporting this same issue (NAMD 2.6 and Lustre 1.8.1). Thanks for the link to bug 23372. Mike -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of John Hammond Sent: Friday, July 23, 2010 8:47 PM To: Rick Grubin Cc: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] I/O errors with NAMD On 07/23/2010 06:39 PM, Rick Grubin wrote:> >> On 2010-07-23, at 11:53, Richard Lefebvre wrote: >> >>> If I had some Lustre error, it would give me a clue, but the only >>> errors the users get is the following traceback on the >>> application: >>> >>> ------------------------------------------------------------------- >>> >>>Reason: FATAL ERROR: Error on write to binary file>>> restart/ABCD_les4.950000.vel: Interrupted system call >>> >>> Fatal error on PE 0> FATAL ERROR: Error on write to binary >>> file restart/ABCD_les4.950000.vel: Interrupted system call >>> >> There was a bug just filed on EINTR and flock. I don''t have the >> number, but a quick search should find it. No patch as yet, but it >> would be worthwhile to subscribe to for updates. >> > > Bug 23372 > > https://bugzilla.lustre.org/show_bug.cgi?id=23372 > _______________________________________________ Lustre-discuss > mailing list Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussIn the scenario of 23372, -EINTR is not returned, although it should be. However, there are several places where -EINTR is used (internally) by the ptlrpc layer. And, IIRC, there are also some places where ptlrpc return codes are (inappropriately) used as the return codes for file operations. So, it may be the case that -EINTR is generated in ptlrpc becuase of an eviction or some other mishap and returned by write(), but no signals were actually delivered during the call. -- John L. Hammond, Ph.D. ICES, The University of Texas at Austin jhammond at ices.utexas.edu (512) 471-9304 _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss