thr3ads.net - Lustre discuss - [Lustre-discuss] I/O errors with NAMD [Jul 2010]

If this information is useful, please help other people find it:
Share via:

Richard Lefebvre

2010-Jul-22 20:59 UTC

[Lustre-discuss] I/O errors with NAMD

Hi,

I have a problem with the Scalable molecular dynamics software NAMD. It 
write restart files once in a while. But sometime the binary write 
crashes. The when it crashes is not constant. The only constant thing is 
it happens when it writes on our Lustre file system. When it write on 
something else, it is fine. I can''t seem find any errors in any of the 
/var/log/messages. Anyone had any problems with NAMD?

Someone seem to have had the same problem, but there was no solution 
provided:
http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/12205.html


-- 
Richard Lefebvre, Sys-admin, RQCHP                         "Don''t
Panic"
                                                             -- THGTTG
RQCHP (rqchp.ca) --------------------- Calcul Canada (computecanada.org)

Andreas Dilger

2010-Jul-22 22:43 UTC

head link

[Lustre-discuss] I/O errors with NAMD

On 2010-07-22, at 14:59, Richard Lefebvre wrote:> I have a problem with the Scalable molecular dynamics software NAMD. It 
> write restart files once in a while. But sometime the binary write 
> crashes. The when it crashes is not constant. The only constant thing is 
> it happens when it writes on our Lustre file system. When it write on 
> something else, it is fine. I can''t seem find any errors in any of
the
> /var/log/messages. Anyone had any problems with NAMD?
Rarely has anyone complained about Lustre not providing error messages when
there is a problem, so if there is nothing in /var/log/messages on either the
client or the server then it is hard to know whether it is a Lustre problem or
not...

If possible, you could try running the application under strace (limited to the
IO calls, or it would be much too much data) to see which system call the error
is coming from.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Wojciech Turek

2010-Jul-23 00:12 UTC

head link

[Lustre-discuss] I/O errors with NAMD

Hi Richard,

If the cause of the I/O errors is Lustre there will be some message in the
logs. I am seeing similar problem with some applications that run on our
cluster. The symptoms are always the same, just before application crashes
with I/O error node gets evicted with a message like that:
 LustreError: 167-0: This client was evicted by ddn_data-OST000f; in
progress operations using this service will fail.

The OSS that mounts the OST from the above message has following line in the
log:
LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock
callback timer expired after 101s: evicting client at 10.143.5.9 at tcp  ns:
filter-ddn_data-OST000f_UUID lock: ffff81021a84ba00/0x744b1dd44
81e38b2 lrc: 3/0,0 mode: PR/PR res: 34959884/0 rrc: 2 type: EXT
[0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20
remote:
0x1d34b900a905375d expref: 9 pid: 1506 timeout 8374258376

Can you please check your logs for similar messages?

Best regards

Wojciech

On 22 July 2010 23:43, Andreas Dilger <andreas.dilger at oracle.com>
wrote:
> On 2010-07-22, at 14:59, Richard Lefebvre wrote:
> > I have a problem with the Scalable molecular dynamics software NAMD.
It
> > write restart files once in a while. But sometime the binary write
> > crashes. The when it crashes is not constant. The only constant thing
is
> > it happens when it writes on our Lustre file system. When it write on
> > something else, it is fine. I can''t seem find any errors in
any of the
> > /var/log/messages. Anyone had any problems with NAMD?
>
> Rarely has anyone complained about Lustre not providing error messages when
> there is a problem, so if there is nothing in /var/log/messages on either
> the client or the server then it is hard to know whether it is a Lustre
> problem or not...
>
> If possible, you could try running the application under strace (limited to
> the IO calls, or it would be much too much data) to see which system call
> the error is coming from.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100723/e64c7afa/attachment.html

Larry

2010-Jul-23 09:02 UTC

head link

[Lustre-discuss] I/O errors with NAMD

we have  the same problem when running namd in lustre sometimes, the
console log suggest file lock expired, but I don''t know why.

On Fri, Jul 23, 2010 at 8:12 AM, Wojciech Turek <wjt27 at cam.ac.uk>
wrote:> Hi Richard,
>
> If the cause of the I/O errors is Lustre there will be some message in the
> logs. I am seeing similar problem with some applications that run on our
> cluster. The symptoms are always the same, just before application crashes
> with I/O error node gets evicted with a message like that:
> ?LustreError: 167-0: This client was evicted by ddn_data-OST000f; in
> progress operations using this service will fail.
>
> The OSS that mounts the OST from the above message has following line in
the
> log:
> LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock
> callback timer expired after 101s: evicting client at 10.143.5.9 at tcp?
ns:
> filter-ddn_data-OST000f_UUID lock: ffff81021a84ba00/0x744b1dd44
> 81e38b2 lrc: 3/0,0 mode: PR/PR res: 34959884/0 rrc: 2 type: EXT
> [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20
remote:
> 0x1d34b900a905375d expref: 9 pid: 1506 timeout 8374258376
>
> Can you please check your logs for similar messages?
>
> Best regards
>
> Wojciech
>
> On 22 July 2010 23:43, Andreas Dilger <andreas.dilger at oracle.com>
wrote:
>>
>> On 2010-07-22, at 14:59, Richard Lefebvre wrote:
>> > I have a problem with the Scalable molecular dynamics software
NAMD. It
>> > write restart files once in a while. But sometime the binary write
>> > crashes. The when it crashes is not constant. The only constant
thing is
>> > it happens when it writes on our Lustre file system. When it write
on
>> > something else, it is fine. I can''t seem find any errors
in any of the
>> > /var/log/messages. Anyone had any problems with NAMD?
>>
>> Rarely has anyone complained about Lustre not providing error messages
>> when there is a problem, so if there is nothing in /var/log/messages on
>> either the client or the server then it is hard to know whether it is a
>> Lustre problem or not...
>>
>> If possible, you could try running the application under strace
(limited
>> to the IO calls, or it would be much too much data) to see which system
call
>> the error is coming from.
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Lustre Technical Lead
>> Oracle Corporation Canada Inc.
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>

Wojciech Turek

2010-Jul-23 10:54 UTC

head link

[Lustre-discuss] I/O errors with NAMD

There is a similar thread on this mailing list:
http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/afe24159554cd3ff/8b37bababf848123?lnk=gst&q=I%2FO+error+on+clients#
Also there is a bug open which reports similar problem:
https://bugzilla.lustre.org/show_bug.cgi?id=23190



On 23 July 2010 10:02, Larry <tsrjzq at gmail.com> wrote:
> we have  the same problem when running namd in lustre sometimes, the
> console log suggest file lock expired, but I don''t know why.
>
> On Fri, Jul 23, 2010 at 8:12 AM, Wojciech Turek <wjt27 at cam.ac.uk>
wrote:
> > Hi Richard,
> >
> > If the cause of the I/O errors is Lustre there will be some message in
> the
> > logs. I am seeing similar problem with some applications that run on
our
> > cluster. The symptoms are always the same, just before application
> crashes
> > with I/O error node gets evicted with a message like that:
> >  LustreError: 167-0: This client was evicted by ddn_data-OST000f; in
> > progress operations using this service will fail.
> >
> > The OSS that mounts the OST from the above message has following line
in
> the
> > log:
> > LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock
> > callback timer expired after 101s: evicting client at 10.143.5.9 at
tcp
> ns:
> > filter-ddn_data-OST000f_UUID lock: ffff81021a84ba00/0x744b1dd44
> > 81e38b2 lrc: 3/0,0 mode: PR/PR res: 34959884/0 rrc: 2 type: EXT
> > [0->18446744073709551615] (req 0->18446744073709551615) flags:
0x20
> remote:
> > 0x1d34b900a905375d expref: 9 pid: 1506 timeout 8374258376
> >
> > Can you please check your logs for similar messages?
> >
> > Best regards
> >
> > Wojciech
> >
> > On 22 July 2010 23:43, Andreas Dilger <andreas.dilger at
oracle.com> wrote:
> >>
> >> On 2010-07-22, at 14:59, Richard Lefebvre wrote:
> >> > I have a problem with the Scalable molecular dynamics
software NAMD.
> It
> >> > write restart files once in a while. But sometime the binary
write
> >> > crashes. The when it crashes is not constant. The only
constant thing
> is
> >> > it happens when it writes on our Lustre file system. When it
write on
> >> > something else, it is fine. I can''t seem find any
errors in any of the
> >> > /var/log/messages. Anyone had any problems with NAMD?
> >>
> >> Rarely has anyone complained about Lustre not providing error
messages
> >> when there is a problem, so if there is nothing in
/var/log/messages on
> >> either the client or the server then it is hard to know whether it
is a
> >> Lustre problem or not...
> >>
> >> If possible, you could try running the application under strace
(limited
> >> to the IO calls, or it would be much too much data) to see which
system
> call
> >> the error is coming from.
> >>
> >> Cheers, Andreas
> >> --
> >> Andreas Dilger
> >> Lustre Technical Lead
> >> Oracle Corporation Canada Inc.
> >>
> >> _______________________________________________
> >> Lustre-discuss mailing list
> >> Lustre-discuss at lists.lustre.org
> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> >
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> >
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100723/79a0153f/attachment.html

Larry

2010-Jul-23 14:41 UTC

head link

[Lustre-discuss] I/O errors with NAMD

There are many kinds of reasons that a server evicts a client, maybe
network error, maybe ptlrpcd bug, but according to my experience, the
only chance to see the I/O error is running namd in lustre filesystem,
I can see some other "evict"  events sometimes, but none of them
results in I/O error. So besides the "evict client", there may be
something else causing the "I/O error".

On Fri, Jul 23, 2010 at 6:54 PM, Wojciech Turek <wjt27 at cam.ac.uk>
wrote:> There is a similar thread on this mailing list:
>
http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/afe24159554cd3ff/8b37bababf848123?lnk=gst&q=I%2FO+error+on+clients#
> Also there is a bug open which reports similar problem:
> https://bugzilla.lustre.org/show_bug.cgi?id=23190
>
>
>
> On 23 July 2010 10:02, Larry <tsrjzq at gmail.com> wrote:
>>
>> we have ?the same problem when running namd in lustre sometimes, the
>> console log suggest file lock expired, but I don''t know why.
>>
>> On Fri, Jul 23, 2010 at 8:12 AM, Wojciech Turek <wjt27 at
cam.ac.uk> wrote:
>> > Hi Richard,
>> >
>> > If the cause of the I/O errors is Lustre there will be some
message in
>> > the
>> > logs. I am seeing similar problem with some applications that run
on our
>> > cluster. The symptoms are always the same, just before application
>> > crashes
>> > with I/O error node gets evicted with a message like that:
>> > ?LustreError: 167-0: This client was evicted by ddn_data-OST000f;
in
>> > progress operations using this service will fail.
>> >
>> > The OSS that mounts the OST from the above message has following
line in
>> > the
>> > log:
>> > LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ###
lock
>> > callback timer expired after 101s: evicting client at 10.143.5.9
at tcp
>> > ns:
>> > filter-ddn_data-OST000f_UUID lock: ffff81021a84ba00/0x744b1dd44
>> > 81e38b2 lrc: 3/0,0 mode: PR/PR res: 34959884/0 rrc: 2 type: EXT
>> > [0->18446744073709551615] (req 0->18446744073709551615)
flags: 0x20
>> > remote:
>> > 0x1d34b900a905375d expref: 9 pid: 1506 timeout 8374258376
>> >
>> > Can you please check your logs for similar messages?
>> >
>> > Best regards
>> >
>> > Wojciech
>> >
>> > On 22 July 2010 23:43, Andreas Dilger <andreas.dilger at
oracle.com> wrote:
>> >>
>> >> On 2010-07-22, at 14:59, Richard Lefebvre wrote:
>> >> > I have a problem with the Scalable molecular dynamics
software NAMD.
>> >> > It
>> >> > write restart files once in a while. But sometime the
binary write
>> >> > crashes. The when it crashes is not constant. The only
constant thing
>> >> > is
>> >> > it happens when it writes on our Lustre file system. When
it write on
>> >> > something else, it is fine. I can''t seem find
any errors in any of
>> >> > the
>> >> > /var/log/messages. Anyone had any problems with NAMD?
>> >>
>> >> Rarely has anyone complained about Lustre not providing error
messages
>> >> when there is a problem, so if there is nothing in
/var/log/messages on
>> >> either the client or the server then it is hard to know
whether it is a
>> >> Lustre problem or not...
>> >>
>> >> If possible, you could try running the application under
strace
>> >> (limited
>> >> to the IO calls, or it would be much too much data) to see
which system
>> >> call
>> >> the error is coming from.
>> >>
>> >> Cheers, Andreas
>> >> --
>> >> Andreas Dilger
>> >> Lustre Technical Lead
>> >> Oracle Corporation Canada Inc.
>> >>
>> >> _______________________________________________
>> >> Lustre-discuss mailing list
>> >> Lustre-discuss at lists.lustre.org
>> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> >
>> >
>> >
>> > _______________________________________________
>> > Lustre-discuss mailing list
>> > Lustre-discuss at lists.lustre.org
>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> >
>> >
>
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>

Richard Lefebvre

2010-Jul-23 17:53 UTC

head link

[Lustre-discuss] I/O errors with NAMD

If I had some Lustre error, it would give me a clue, but the only errors 
the users get is the following traceback on the application:

-------------------------------------------------------------------
Reason: FATAL ERROR: Error on write to binary file
restart/ABCD_les4.950000.vel: Interrupted system call

Fatal error on PE 0> FATAL ERROR: Error on write to binary file
restart/ABCD_les4.950000.vel: Interrupted system call
"


"
FINISHED WRITING RESTART COORDINATES
WRITING VELOCITIES TO RESTART FILE AT STEP 950000
FATAL ERROR: Error on write to binary file restart/ABCD_les4.950000.vel:
Interrupted system call
[0] Stack Traceback:
   [0] CmiAbort+0x5f  [0xabb81b]
   [1] _Z8NAMD_errPKc+0x9d  [0x5099e9]
   [2] _ZN6Output17write_binary_fileEPciP6Vector+0xb0  [0x916d7e]
   [3] _ZN6Output25output_restart_velocitiesEiiP6Vector+0x249  [0x918d93]
   [4] _ZN6Output8velocityEiiP6Vector+0xdb  [0x918a53]
   [5]
_ZN24CkIndex_CollectionMaster40_call_receiveVelocities_CollectVectorMsgEPvP16CollectionMaster+0x16c
  [0x51c184]
   [6] CkDeliverMessageFree+0x21  [0xa2f899]
   [7] _Z15_processHandlerPvP11CkCoreState+0x50f  [0xa2ef63]
   [8] CsdScheduleForever+0xa5  [0xabc631]
   [9] CsdScheduler+0x1c  [0xabc232]
   [10] _ZN7BackEnd7suspendEv+0xb  [0x511ddd]
   [11] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x140  [0x971ddc]
   [12] TclInvokeStringCommand+0x91  [0xae0518]
   [13] /share/apps/namd/NAMD_2.7b2_Linux-x86_64-ibverbs/namd2 [0xb16368]
   [14] Tcl_EvalEx+0x176  [0xb169ab]
   [15] Tcl_EvalFile+0x134  [0xb0e3b4]
   [16] _ZN9ScriptTcl3runEPc+0x14  [0x9714da]
   [17] _Z18after_backend_initiPPc+0x22b  [0x50da7b]
   [18] main+0x3a  [0x50d81a]
   [19] __libc_start_main+0xf4  [0x398661d974]
   [20] _ZNSt8ios_base4InitD1Ev+0x4a  [0x508bda]
"

When I told the user to use a slower NFS file system instead. The 
problem doesn''t occur.

As another one commented about file lock, the lustre mount was added the 
"flock" parameter recently (for another application) but the NAMD
still
has problems.

RIchard


On 07/22/2010 08:12 PM, Wojciech Turek wrote:> Hi Richard,
>
> If the cause of the I/O errors is Lustre there will be some message in
> the logs. I am seeing similar problem with some applications that run on
> our cluster. The symptoms are always the same, just before application
> crashes with I/O error node gets evicted with a message like that:
>   LustreError: 167-0: This client was evicted by ddn_data-OST000f; in
> progress operations using this service will fail.
>
> The OSS that mounts the OST from the above message has following line in
> the log:
> LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock
> callback timer expired after 101s: evicting client at 10.143.5.9 at tcp
> ns: filter-ddn_data-OST000f_UUID lock: ffff81021a84ba00/0x744b1dd44
> 81e38b2 lrc: 3/0,0 mode: PR/PR res: 34959884/0 rrc: 2 type: EXT
> [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20
> remote: 0x1d34b900a905375d expref: 9 pid: 1506 timeout 8374258376
>
> Can you please check your logs for similar messages?
>
> Best regards
>
> Wojciech
>
> On 22 July 2010 23:43, Andreas Dilger <andreas.dilger at oracle.com
> <mailto:andreas.dilger at oracle.com>> wrote:
>
>     On 2010-07-22, at 14:59, Richard Lefebvre wrote:
>      > I have a problem with the Scalable molecular dynamics software
>     NAMD. It
>      > write restart files once in a while. But sometime the binary
write
>      > crashes. The when it crashes is not constant. The only constant
>     thing is
>      > it happens when it writes on our Lustre file system. When it
write on
>      > something else, it is fine. I can''t seem find any errors
in any
>     of the
>      > /var/log/messages. Anyone had any problems with NAMD?
>
>     Rarely has anyone complained about Lustre not providing error
>     messages when there is a problem, so if there is nothing in
>     /var/log/messages on either the client or the server then it is hard
>     to know whether it is a Lustre problem or not...
>
>     If possible, you could try running the application under strace
>     (limited to the IO calls, or it would be much too much data) to see
>     which system call the error is coming from.
>
>     Cheers, Andreas
>     --
>     Andreas Dilger
>     Lustre Technical Lead
>     Oracle Corporation Canada Inc.
>
>     _______________________________________________
>     Lustre-discuss mailing list
>     Lustre-discuss at lists.lustre.org <mailto:Lustre-discuss at
lists.lustre.org>
>     http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>

-- 
Richard Lefebvre, Sys-admin, RQCHP, (514)343-6111 x5313    "Don''t
Panic"
Richard.Lefebvre at rqchp.qc.ca                                -- THGTTG
RQCHP (rqchp.ca) --------------------- Calcul Canada (computecanada.org)

Wojciech Turek

2010-Jul-23 19:27 UTC

head link

[Lustre-discuss] I/O errors with NAMD

Hi Larry,
>From my experience, if the application is doing some I/O and server evictsthe node that application is running on this will definitely result in EIO
error being send to the application, thus the input/output error message in
the standard output of the application.
In the case of my cluster the eviction always happens with a particular
applications and this behaviour is very reproducible. I have checked cluster
network but it doesn''t seem to have any congestion at the time of
eviction.
This problem started after  upgrading  from 1.6.6 to 1.8.3
Currently for the applications affected by that problem we workaround by
using compute nodes local disks but this is not ideal and hopefully we will
see some progress on that case soon.

Best regards,

Wojciech

On 23 July 2010 15:41, Larry <tsrjzq at gmail.com> wrote:
> There are many kinds of reasons that a server evicts a client, maybe
> network error, maybe ptlrpcd bug, but according to my experience, the
> only chance to see the I/O error is running namd in lustre filesystem,
> I can see some other "evict"  events sometimes, but none of them
> results in I/O error. So besides the "evict client", there may be
> something else causing the "I/O error".
>
> On Fri, Jul 23, 2010 at 6:54 PM, Wojciech Turek <wjt27 at cam.ac.uk>
wrote:
> > There is a similar thread on this mailing list:
> >
>
http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/afe24159554cd3ff/8b37bababf848123?lnk=gst&q=I%2FO+error+on+clients#
> > Also there is a bug open which reports similar problem:
> > https://bugzilla.lustre.org/show_bug.cgi?id=23190
> >
> >
> >
> > On 23 July 2010 10:02, Larry <tsrjzq at gmail.com> wrote:
> >>
> >> we have  the same problem when running namd in lustre sometimes,
the
> >> console log suggest file lock expired, but I don''t know
why.
> >>
> >> On Fri, Jul 23, 2010 at 8:12 AM, Wojciech Turek <wjt27 at
cam.ac.uk>
> wrote:
> >> > Hi Richard,
> >> >
> >> > If the cause of the I/O errors is Lustre there will be some
message in
> >> > the
> >> > logs. I am seeing similar problem with some applications that
run on
> our
> >> > cluster. The symptoms are always the same, just before
application
> >> > crashes
> >> > with I/O error node gets evicted with a message like that:
> >> >  LustreError: 167-0: This client was evicted by
ddn_data-OST000f; in
> >> > progress operations using this service will fail.
> >> >
> >> > The OSS that mounts the OST from the above message has
following line
> in
> >> > the
> >> > log:
> >> > LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback())
### lock
> >> > callback timer expired after 101s: evicting client at
10.143.5.9 at tcp
> >> > ns:
> >> > filter-ddn_data-OST000f_UUID lock:
ffff81021a84ba00/0x744b1dd44
> >> > 81e38b2 lrc: 3/0,0 mode: PR/PR res: 34959884/0 rrc: 2 type:
EXT
> >> > [0->18446744073709551615] (req 0->18446744073709551615)
flags: 0x20
> >> > remote:
> >> > 0x1d34b900a905375d expref: 9 pid: 1506 timeout 8374258376
> >> >
> >> > Can you please check your logs for similar messages?
> >> >
> >> > Best regards
> >> >
> >> > Wojciech
> >> >
> >> > On 22 July 2010 23:43, Andreas Dilger <andreas.dilger at
oracle.com>
> wrote:
> >> >>
> >> >> On 2010-07-22, at 14:59, Richard Lefebvre wrote:
> >> >> > I have a problem with the Scalable molecular
dynamics software
> NAMD.
> >> >> > It
> >> >> > write restart files once in a while. But sometime
the binary write
> >> >> > crashes. The when it crashes is not constant. The
only constant
> thing
> >> >> > is
> >> >> > it happens when it writes on our Lustre file system.
When it write
> on
> >> >> > something else, it is fine. I can''t seem
find any errors in any of
> >> >> > the
> >> >> > /var/log/messages. Anyone had any problems with
NAMD?
> >> >>
> >> >> Rarely has anyone complained about Lustre not providing
error
> messages
> >> >> when there is a problem, so if there is nothing in
/var/log/messages
> on
> >> >> either the client or the server then it is hard to know
whether it is
> a
> >> >> Lustre problem or not...
> >> >>
> >> >> If possible, you could try running the application under
strace
> >> >> (limited
> >> >> to the IO calls, or it would be much too much data) to
see which
> system
> >> >> call
> >> >> the error is coming from.
> >> >>
> >> >> Cheers, Andreas
> >> >> --
> >> >> Andreas Dilger
> >> >> Lustre Technical Lead
> >> >> Oracle Corporation Canada Inc.
> >> >>
> >> >> _______________________________________________
> >> >> Lustre-discuss mailing list
> >> >> Lustre-discuss at lists.lustre.org
> >> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > Lustre-discuss mailing list
> >> > Lustre-discuss at lists.lustre.org
> >> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >> >
> >> >
> >
> >
> >
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> >
>


-- 
--
Wojciech Turek

Assistant System Manager
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100723/e5c9fec8/attachment.html

Ned Bass6

2010-Jul-23 19:39 UTC

head link

[Lustre-discuss] I/O errors with NAMD

On Fri, Jul 23, 2010 at 10:53:45AM -0700, Richard Lefebvre
wrote:> If I had some Lustre error, it would give me a clue, but the only errors 
> the users get is the following traceback on the application:
> 
> -------------------------------------------------------------------
> Reason: FATAL ERROR: Error on write to binary file
> restart/ABCD_les4.950000.vel: Interrupted system call
> 
> Fatal error on PE 0> FATAL ERROR: Error on write to binary file
> restart/ABCD_les4.950000.vel: Interrupted system call
Interrupted system call is not a fatal error.  The application should
retry the operation if it fails with EINTR.

Ned

Andreas Dilger

2010-Jul-23 23:28 UTC

head link

[Lustre-discuss] I/O errors with NAMD

On 2010-07-23, at 11:53, Richard Lefebvre wrote:> If I had some Lustre error, it would give me a clue, but the only errors
the users get is the following traceback on the application:
> 
> -------------------------------------------------------------------
> Reason: FATAL ERROR: Error on write to binary file
> restart/ABCD_les4.950000.vel: Interrupted system call
> 
> Fatal error on PE 0> FATAL ERROR: Error on write to binary file
> restart/ABCD_les4.950000.vel: Interrupted system call
There was a bug just filed on EINTR and flock.  I don''t have the
number, but a quick search should find it.  No patch as yet, but it would be
worthwhile to subscribe to for updates.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Rick Grubin

2010-Jul-23 23:39 UTC

head link

[Lustre-discuss] I/O errors with NAMD

> On 2010-07-23, at 11:53, Richard Lefebvre wrote:
>    
>> If I had some Lustre error, it would give me a clue, but the only
errors the users get is the following traceback on the application:
>>
>> -------------------------------------------------------------------
>> Reason: FATAL ERROR: Error on write to binary file
>> restart/ABCD_les4.950000.vel: Interrupted system call
>>
>> Fatal error on PE 0>  FATAL ERROR: Error on write to binary file
>> restart/ABCD_les4.950000.vel: Interrupted system call
>>      
> There was a bug just filed on EINTR and flock.  I don''t have the
number, but a quick search should find it.  No patch as yet, but it would be
worthwhile to subscribe to for updates.
>    
Bug 23372

https://bugzilla.lustre.org/show_bug.cgi?id=23372

John Hammond

2010-Jul-24 01:47 UTC

head link

[Lustre-discuss] I/O errors with NAMD

On 07/23/2010 06:39 PM, Rick Grubin wrote:>
>> On 2010-07-23, at 11:53, Richard Lefebvre wrote:
>>
>>> If I had some Lustre error, it would give me a clue, but the only
>>> errors the users get is the following traceback on the
>>> application:
>>>
>>> -------------------------------------------------------------------
>>>
>>>
Reason: FATAL ERROR: Error on write to binary file>>> restart/ABCD_les4.950000.vel: Interrupted system call
>>>
>>> Fatal error on PE 0>   FATAL ERROR: Error on write to binary
>>> file restart/ABCD_les4.950000.vel: Interrupted system call
>>>
>> There was a bug just filed on EINTR and flock.  I don''t have
the
>> number, but a quick search should find it.  No patch as yet, but it
>> would be worthwhile to subscribe to for updates.
>>
>
> Bug 23372
>
> https://bugzilla.lustre.org/show_bug.cgi?id=23372
> _______________________________________________ Lustre-discuss
> mailing list Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
In the scenario of 23372, -EINTR is not returned, although it should be.

However, there are several places where -EINTR is used (internally) by 
the ptlrpc layer.  And, IIRC, there are also some places where ptlrpc 
return codes are (inappropriately) used as the return codes for file 
operations.  So, it may be the case that -EINTR is generated in ptlrpc 
becuase of an eviction or some other mishap and returned by write(), but 
no signals were actually delivered during the call.

-- 
John L. Hammond, Ph.D.
ICES, The University of Texas at Austin
jhammond at ices.utexas.edu
(512) 471-9304

Mike Hanby

2010-Sep-07 14:45 UTC

head link

[Lustre-discuss] I/O errors with NAMD

I recently had a user reporting this same issue (NAMD 2.6 and Lustre 1.8.1).

Thanks for the link to bug 23372.

Mike

-----Original Message-----
From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces
at lists.lustre.org] On Behalf Of John Hammond
Sent: Friday, July 23, 2010 8:47 PM
To: Rick Grubin
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] I/O errors with NAMD

On 07/23/2010 06:39 PM, Rick Grubin wrote:>
>> On 2010-07-23, at 11:53, Richard Lefebvre wrote:
>>
>>> If I had some Lustre error, it would give me a clue, but the only
>>> errors the users get is the following traceback on the
>>> application:
>>>
>>> -------------------------------------------------------------------
>>>
>>>
Reason: FATAL ERROR: Error on write to binary file>>> restart/ABCD_les4.950000.vel: Interrupted system call
>>>
>>> Fatal error on PE 0>   FATAL ERROR: Error on write to binary
>>> file restart/ABCD_les4.950000.vel: Interrupted system call
>>>
>> There was a bug just filed on EINTR and flock.  I don''t have
the
>> number, but a quick search should find it.  No patch as yet, but it
>> would be worthwhile to subscribe to for updates.
>>
>
> Bug 23372
>
> https://bugzilla.lustre.org/show_bug.cgi?id=23372
> _______________________________________________ Lustre-discuss
> mailing list Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
In the scenario of 23372, -EINTR is not returned, although it should be.

However, there are several places where -EINTR is used (internally) by 
the ptlrpc layer.  And, IIRC, there are also some places where ptlrpc 
return codes are (inappropriately) used as the return codes for file 
operations.  So, it may be the case that -EINTR is generated in ptlrpc 
becuase of an eviction or some other mishap and returned by write(), but 
no signals were actually delivered during the call.

-- 
John L. Hammond, Ph.D.
ICES, The University of Texas at Austin
jhammond at ices.utexas.edu
(512) 471-9304
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Lustre discuss - Jul 2010 - I/O errors with NAMD

[Lustre-discuss] I/O errors with NAMD

[Lustre-discuss] I/O errors with NAMD

[Lustre-discuss] I/O errors with NAMD

[Lustre-discuss] I/O errors with NAMD

[Lustre-discuss] I/O errors with NAMD

[Lustre-discuss] I/O errors with NAMD

[Lustre-discuss] I/O errors with NAMD

[Lustre-discuss] I/O errors with NAMD

[Lustre-discuss] I/O errors with NAMD

[Lustre-discuss] I/O errors with NAMD

[Lustre-discuss] I/O errors with NAMD

[Lustre-discuss] I/O errors with NAMD

[Lustre-discuss] I/O errors with NAMD