thr3ads.net - Lustre discuss - [Lustre-discuss] How do I troubleshoot this problem? [May 2006]

If this information is useful, please help other people find it:
Share via:

Jeremy Hansen

2006-May-19 07:36 UTC

[Lustre-discuss] How do I troubleshoot this problem?

I''m working with a new Lustre installation and I''m running
into problems.
I''m running iozone tests across multiple clients and basically when I
exceed
more then 2, sometimes 3 clients, I start getting a lot of evictions from
the server.  An example of the error can be found at:

    http://www.thedudeminds.com/lustre.txt

Right now I suspect it''s possible network related, although I
can''t really
say this for sure based on the errors.

I''m launching iozone using c3 from a head node.  Here is my options:

cexec rlx-2-6:3-5 ''/opt/iozone/bin/iozone -i 0 -i 1 -r 8k -s 4g -f
/mnt/lustre/stripe_test/testfile.''\''''$$''\''''''

Any help is appreciated.

-jeremy

Hancock, Tom

2006-May-19 07:36 UTC

head link

[Lustre-discuss] How do I troubleshoot this problem?

The errors are indicative of a client eviction by a server for
lack of responsiveness, potentially due to a network error
(though that''s not indicated) or potentially due to a timeout
within the lustre dlm.

I suspect the intention was to incorporate the pid in the name.
In this case I think that the command may not be what was intended.
Replacing ''cexec'' with ''ssh'' and
''iozone'' with ''touch'' I can see on
a system here that files are created with names like
''testfile.$$''.
Thus on my system each client would write accessing the same file
at the same location relatively synchronously. Was this what was
really intended, or perhaps cexec has different semantics?

- Tom

---
Tom Hancock, Hewlett Packard, Galway. +353-91-754765 

-----Original Message-----
From: lustre-discuss-admin@lists.clusterfs.com
[mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Jeremy
Hansen
Sent: 20 November 2005 08:49
To: lustre-discuss@lists.clusterfs.com
Subject: Re: [Lustre-discuss] How do I troubleshoot this problem?

To add to this through a bit more testing, it seems I''m only get errors
when running the test on a directory in which I''ve changed the stripe
pattern to a stripe size of 4M using all available osts, which in this
case is 30 osts.
Running the same test using the default striping config seems to pass,
but performance is very low.  The default stripe config I believe is 1M
and 1 OST per file.

-jeremy

On 11/20/05 12:32 AM, "Jeremy Hansen" <jeremy@smokehabanos.com>
wrote:
> I''m working with a new Lustre installation and I''m
running into
problems.> I''m running iozone tests across multiple clients and basically
when I
> exceed more then 2, sometimes 3 clients, I start getting a lot of 
> evictions from the server.  An example of the error can be found at:
> 
>     http://www.thedudeminds.com/lustre.txt
> 
> Right now I suspect it''s possible network related, although I
can''t
> really say this for sure based on the errors.
> 
> I''m launching iozone using c3 from a head node.  Here is my
options:
> 
> cexec rlx-2-6:3-5 ''/opt/iozone/bin/iozone -i 0 -i 1 -r 8k -s 4g -f
>
/mnt/lustre/stripe_test/testfile.''\''''$$''\''''''
> 
> Any help is appreciated.
> 
> -jeremy
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@lists.clusterfs.com
> https://lists.clusterfs.com/mailman/listinfo/lustre-discuss

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.clusterfs.com
https://lists.clusterfs.com/mailman/listinfo/lustre-discuss

Iozone

2006-May-19 07:36 UTC

head link

[Lustre-discuss] How do I troubleshoot this problem?

Jeremy,

    See bottom post....

----Original Message-----
From: lustre-discuss-admin@lists.clusterfs.com
[mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Jeremy
Hansen
Sent: 20 November 2005 08:49
To: lustre-discuss@lists.clusterfs.com
Subject: Re: [Lustre-discuss] How do I troubleshoot this problem?

To add to this through a bit more testing, it seems I''m only get errors
when running the test on a directory in which I''ve changed the stripe
pattern to a stripe size of 4M using all available osts, which in this
case is 30 osts.
Running the same test using the default striping config seems to pass,
but performance is very low.  The default stripe config I believe is 1M
and 1 OST per file.

-jeremy


On 11/20/05 12:32 AM, "Jeremy Hansen" <jeremy@smokehabanos.com>
wrote:
> I''m working with a new Lustre installation and I''m
running into
problems.> I''m running iozone tests across multiple clients and basically
when I
> exceed more then 2, sometimes 3 clients, I start getting a lot of 
> evictions from the server.  An example of the error can be found at:
> 
>     http://www.thedudeminds.com/lustre.txt
> 
> Right now I suspect it''s possible network related, although I
can''t
> really say this for sure based on the errors.
> 
> I''m launching iozone using c3 from a head node.  Here is my
options:
> 
> cexec rlx-2-6:3-5 ''/opt/iozone/bin/iozone -i 0 -i 1 -r 8k -s 4g -f
>
/mnt/lustre/stripe_test/testfile.''\''''$$''\''''''
> 
> Any help is appreciated.
> 
> -jeremy
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@lists.clusterfs.com
> https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
Jeremy,

        I''m wondering if you were intending on gathering 
    system wide throughput ?  If so, you may wish to
    let Iozone handle the launch of remote execution.

    Iozone''s -+m option is intended to be used in a cluster
    environment and provides stonewalling, (thread straggler
    elimination) unique file names, and aggregation of the results.

    Example:  ( simple two node system "hostA" and "hostB",
                      and each node has 1 Gig of RAM)

        iozone -r 64 -s 2g -t 2 -+m control_file 

    and the contents of the control_file:
            hostA  /mnt/lustre/stripe_test  /opt/iozone/bin/iozone
            hostB  /mnt/lustre/stripe_test  /opt/iozone/bin/iozone

    In this example Iozone will start two copies of itself,
    one instance on hostA and the other on hostB. Each instance
    will test with 64Kbyte transfers, for a file size of 2 Gig. 
    Since Iozone is coordinating the execution the file names,
    that Iozone uses for testing, will be automatically generated
    and non-conflicting. These test files would be created in 
    /mnt/lustre/stripe_test (from config_file above)
    Each test would be started at a barrier, and would finish
    when the first child is complete. Throughput would then
    be calculated, and returned as results.

    Note: Iozone also supports multiple execution with a 
    single shared test file. You may also choose to use file 
    locking, record locking, or no locking at all, in this 
    mode.  

    Note: Iozone presents a very heavy workload. On my
    Lustre systems, in my lab at my house, I''ve also seen
    some evictions when running Iozone. This happens when
    I push the system (with Iozone) to the point where
    the nodes are not able to respond (due to high network
    loads, such as running Iozone on top of NFS, on top of
    Lustre, large transfer sizes, on very large files, on a slow 
    100BT network.) Once the congestion rises to certain level the 
    nodes are unable to exchange keep-alive messages and 
    evictions begin. Once the load drops down, the nodes re-join, 
    and life goes on.  
    I believe there are some timeouts (in Lustre) that you can
    tweak, to increase the amount of time that the nodes
    will wait for a keep-alive to come back.  Adjusting this
    might improve the evictions. It might also increase the
    time one would wait in the event of a real node failure... One
    of those trade-offs.... 
    Other things that I''ve noticed that seem to help:  More memory,
    faster interconnect, more CPU''s per/node.

    Hope that my observations are helpful...

Enjoy,
Don Capps
capps_at_iozone_dot_org.

Jeremy Hansen

2006-May-19 07:36 UTC

head link

[Lustre-discuss] How do I troubleshoot this problem?

Perfect.  I was unaware of this.  I will try this asap.

Although, it seems that my invocation of iozone should be unrelated to the 
problems I''ve been seeing.

Thank you
-jeremy

On Mon, 21 Nov 2005, Iozone wrote:
> Jeremy,
> 
>     See bottom post....
> 
> ----Original Message-----
> From: lustre-discuss-admin@lists.clusterfs.com
> [mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Jeremy
> Hansen
> Sent: 20 November 2005 08:49
> To: lustre-discuss@lists.clusterfs.com
> Subject: Re: [Lustre-discuss] How do I troubleshoot this problem?
> 
> To add to this through a bit more testing, it seems I''m only get
errors
> when running the test on a directory in which I''ve changed the
stripe
> pattern to a stripe size of 4M using all available osts, which in this
> case is 30 osts.
> Running the same test using the default striping config seems to pass,
> but performance is very low.  The default stripe config I believe is 1M
> and 1 OST per file.
> 
> -jeremy
> 
> 
> On 11/20/05 12:32 AM, "Jeremy Hansen"
<jeremy@smokehabanos.com> wrote:
> 
> > I''m working with a new Lustre installation and I''m
running into
> problems.
> > I''m running iozone tests across multiple clients and
basically when I
> > exceed more then 2, sometimes 3 clients, I start getting a lot of 
> > evictions from the server.  An example of the error can be found at:
> > 
> >     http://www.thedudeminds.com/lustre.txt
> > 
> > Right now I suspect it''s possible network related, although I
can''t
> > really say this for sure based on the errors.
> > 
> > I''m launching iozone using c3 from a head node.  Here is my
options:
> > 
> > cexec rlx-2-6:3-5 ''/opt/iozone/bin/iozone -i 0 -i 1 -r 8k -s
4g -f
> >
/mnt/lustre/stripe_test/testfile.''\''''$$''\''''''
> > 
> > Any help is appreciated.
> > 
> > -jeremy
> > 
> > 
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss@lists.clusterfs.com
> > https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
> 
> Jeremy,
> 
>         I''m wondering if you were intending on gathering 
>     system wide throughput ?  If so, you may wish to
>     let Iozone handle the launch of remote execution.
> 
>     Iozone''s -+m option is intended to be used in a cluster
>     environment and provides stonewalling, (thread straggler
>     elimination) unique file names, and aggregation of the results.
> 
>     Example:  ( simple two node system "hostA" and
"hostB",
>                       and each node has 1 Gig of RAM)
> 
>         iozone -r 64 -s 2g -t 2 -+m control_file 
> 
>     and the contents of the control_file:
>             hostA  /mnt/lustre/stripe_test  /opt/iozone/bin/iozone
>             hostB  /mnt/lustre/stripe_test  /opt/iozone/bin/iozone
> 
>     In this example Iozone will start two copies of itself,
>     one instance on hostA and the other on hostB. Each instance
>     will test with 64Kbyte transfers, for a file size of 2 Gig. 
>     Since Iozone is coordinating the execution the file names,
>     that Iozone uses for testing, will be automatically generated
>     and non-conflicting. These test files would be created in 
>     /mnt/lustre/stripe_test (from config_file above)
>     Each test would be started at a barrier, and would finish
>     when the first child is complete. Throughput would then
>     be calculated, and returned as results.
> 
>     Note: Iozone also supports multiple execution with a 
>     single shared test file. You may also choose to use file 
>     locking, record locking, or no locking at all, in this 
>     mode.  
> 
>     Note: Iozone presents a very heavy workload. On my
>     Lustre systems, in my lab at my house, I''ve also seen
>     some evictions when running Iozone. This happens when
>     I push the system (with Iozone) to the point where
>     the nodes are not able to respond (due to high network
>     loads, such as running Iozone on top of NFS, on top of
>     Lustre, large transfer sizes, on very large files, on a slow 
>     100BT network.) Once the congestion rises to certain level the 
>     nodes are unable to exchange keep-alive messages and 
>     evictions begin. Once the load drops down, the nodes re-join, 
>     and life goes on.  
>     I believe there are some timeouts (in Lustre) that you can
>     tweak, to increase the amount of time that the nodes
>     will wait for a keep-alive to come back.  Adjusting this
>     might improve the evictions. It might also increase the
>     time one would wait in the event of a real node failure... One
>     of those trade-offs.... 
>     Other things that I''ve noticed that seem to help:  More
memory,
>     faster interconnect, more CPU''s per/node.
> 
>     Hope that my observations are helpful...
> 
> Enjoy,
> Don Capps
> capps_at_iozone_dot_org.
>  
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@lists.clusterfs.com
> https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
>

Jeremy Hansen

2006-May-19 07:36 UTC

head link

[Lustre-discuss] How do I troubleshoot this problem?

To add to this through a bit more testing, it seems I''m only get errors
when
running the test on a directory in which I''ve changed the stripe
pattern to
a stripe size of 4M using all available osts, which in this case is 30 osts.
Running the same test using the default striping config seems to pass, but
performance is very low.  The default stripe config I believe is 1M and 1
OST per file.

-jeremy


On 11/20/05 12:32 AM, "Jeremy Hansen" <jeremy@smokehabanos.com>
wrote:
> I''m working with a new Lustre installation and I''m
running into problems.
> I''m running iozone tests across multiple clients and basically
when I exceed
> more then 2, sometimes 3 clients, I start getting a lot of evictions from
> the server.  An example of the error can be found at:
> 
>     http://www.thedudeminds.com/lustre.txt
> 
> Right now I suspect it''s possible network related, although I
can''t really
> say this for sure based on the errors.
> 
> I''m launching iozone using c3 from a head node.  Here is my
options:
> 
> cexec rlx-2-6:3-5 ''/opt/iozone/bin/iozone -i 0 -i 1 -r 8k -s 4g -f
>
/mnt/lustre/stripe_test/testfile.''\''''$$''\''''''
> 
> Any help is appreciated.
> 
> -jeremy
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@lists.clusterfs.com
> https://lists.clusterfs.com/mailman/listinfo/lustre-discuss

Jeremy Hansen

2006-May-19 07:36 UTC

head link

[Lustre-discuss] How do I troubleshoot this problem?

Cexec is from the c3 project.  Basically you create a table of machines 
and cexec will launch commands on those machines in parallel, using ssh.  
My intention is to run iozone tests across several clients in parallel 
using a different testfile on each client, hence the .$$ to make sure the 
files are different.  So to answer your question, what I was doing was 
intentional.

As for network problems, this is definitely a possibility.  We''re in an
environment where we do not control the switch we''re connected to,
which
is changing very shortly.

Thanks
-jeremy

On Mon, 21 Nov 2005, Hancock, Tom wrote:
>  
> The errors are indicative of a client eviction by a server for
> lack of responsiveness, potentially due to a network error
> (though that''s not indicated) or potentially due to a timeout
> within the lustre dlm.
> 
> I suspect the intention was to incorporate the pid in the name.
> In this case I think that the command may not be what was intended.
> Replacing ''cexec'' with ''ssh'' and
''iozone'' with ''touch'' I can see on
> a system here that files are created with names like
''testfile.$$''.
> Thus on my system each client would write accessing the same file
> at the same location relatively synchronously. Was this what was
> really intended, or perhaps cexec has different semantics?
> 
> - Tom
> 
> ---
> Tom Hancock, Hewlett Packard, Galway. +353-91-754765 
>  
> 
> -----Original Message-----
> From: lustre-discuss-admin@lists.clusterfs.com
> [mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Jeremy
> Hansen
> Sent: 20 November 2005 08:49
> To: lustre-discuss@lists.clusterfs.com
> Subject: Re: [Lustre-discuss] How do I troubleshoot this problem?
> 
> To add to this through a bit more testing, it seems I''m only get
errors
> when running the test on a directory in which I''ve changed the
stripe
> pattern to a stripe size of 4M using all available osts, which in this
> case is 30 osts.
> Running the same test using the default striping config seems to pass,
> but performance is very low.  The default stripe config I believe is 1M
> and 1 OST per file.
> 
> -jeremy
> 
> 
> On 11/20/05 12:32 AM, "Jeremy Hansen"
<jeremy@smokehabanos.com> wrote:
> 
> > I''m working with a new Lustre installation and I''m
running into
> problems.
> > I''m running iozone tests across multiple clients and
basically when I
> > exceed more then 2, sometimes 3 clients, I start getting a lot of 
> > evictions from the server.  An example of the error can be found at:
> > 
> >     http://www.thedudeminds.com/lustre.txt
> > 
> > Right now I suspect it''s possible network related, although I
can''t
> > really say this for sure based on the errors.
> > 
> > I''m launching iozone using c3 from a head node.  Here is my
options:
> > 
> > cexec rlx-2-6:3-5 ''/opt/iozone/bin/iozone -i 0 -i 1 -r 8k -s
4g -f
> >
/mnt/lustre/stripe_test/testfile.''\''''$$''\''''''
> > 
> > Any help is appreciated.
> > 
> > -jeremy
> > 
> > 
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss@lists.clusterfs.com
> > https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@lists.clusterfs.com
> https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@lists.clusterfs.com
> https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
>

Jeremy Hansen

2006-May-19 07:36 UTC

head link

[Lustre-discuss] How do I troubleshoot this problem?

Don, thank you for the tips.  Very useful.  This is where I am at the
moment.

One, the entire network is gigE, currently connected to a Cisco 6513.  All
OSSs are connected to a single blade in the 6513 and there is 30 OSSs.
Using the default stripe of 1M and a single OSS per file, only 8 clients and
using your suggested in this fashion:

/opt/iozone/bin/iozone -i 0 -i 1 -r 8k -s 4g -t 8 -+m hostlist -w

What I see is the first set of read results (which do not look that great)

       Test running:
        Children see throughput for  8 initial writers  =  116711.56 KB/sec
        Min throughput per process                      =   14515.36 KB/sec
        Max throughput per process                      =   14627.10 KB/sec
        Avg throughput per process                      =   14588.95 KB/sec
        Min xfer                                        = 4162560.00 KB

I would hope to see a throughput of at least 1G and that''s where it
stops.
The test fails to continue, it just sits.

If I do this same test using a larger stripe size using all 30 OSSs, this is
where I can easily get the errors to flood in and clients start getting
evicted.

It feels like networking issues, but I see no errors, dropped packets, etc
on the switch (although that doesn''t convince me that the switch
isn''t a
problem).

We''re replacing the Cisco with a Force10 E1200 in a few weeks but
I''m uneasy
about not being able to prove definitively that it''s a switch problem.

Using iozone default -r and -s sizes, tests complete fine and give numbers
more like what I''m expecting:

/opt/iozone/bin/iozone -i 0 -i 1 -t 8 -+m hostlist -w

        Children see throughput for  8 initial writers  = 1081793.05 KB/sec
        Min throughput per process                      =  150853.59 KB/sec
        Max throughput per process                      =  161713.44 KB/sec
        Avg throughput per process                      =  135224.13 KB/sec
        Min xfer                                        =     512.00 KB

But this test isn''t sufficient, since we''re going to be
working with mostly
very large files and I think the other tests, even if it''s maxing out
the
switch, should at least complete.

Thank you
-jeremy


On 11/21/05 5:45 AM, "Iozone" <capps@iozone.org> wrote:
> Jeremy,
> 
>     See bottom post....
> 
> ----Original Message-----
> From: lustre-discuss-admin@lists.clusterfs.com
> [mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Jeremy
> Hansen
> Sent: 20 November 2005 08:49
> To: lustre-discuss@lists.clusterfs.com
> Subject: Re: [Lustre-discuss] How do I troubleshoot this problem?
> 
> To add to this through a bit more testing, it seems I''m only get
errors
> when running the test on a directory in which I''ve changed the
stripe
> pattern to a stripe size of 4M using all available osts, which in this
> case is 30 osts.
> Running the same test using the default striping config seems to pass,
> but performance is very low.  The default stripe config I believe is 1M
> and 1 OST per file.
> 
> -jeremy
> 
> 
> On 11/20/05 12:32 AM, "Jeremy Hansen"
<jeremy@smokehabanos.com> wrote:
> 
>> I''m working with a new Lustre installation and I''m
running into
> problems.
>> I''m running iozone tests across multiple clients and basically
when I
>> exceed more then 2, sometimes 3 clients, I start getting a lot of
>> evictions from the server.  An example of the error can be found at:
>> 
>>     http://www.thedudeminds.com/lustre.txt
>> 
>> Right now I suspect it''s possible network related, although I
can''t
>> really say this for sure based on the errors.
>> 
>> I''m launching iozone using c3 from a head node.  Here is my
options:
>> 
>> cexec rlx-2-6:3-5 ''/opt/iozone/bin/iozone -i 0 -i 1 -r 8k -s
4g -f
>>
/mnt/lustre/stripe_test/testfile.''\''''$$''\''''''
>> 
>> Any help is appreciated.
>> 
>> -jeremy
>> 
>> 
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.clusterfs.com
>> https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
> 
> Jeremy,
> 
>         I''m wondering if you were intending on gathering
>     system wide throughput ?  If so, you may wish to
>     let Iozone handle the launch of remote execution.
> 
>     Iozone''s -+m option is intended to be used in a cluster
>     environment and provides stonewalling, (thread straggler
>     elimination) unique file names, and aggregation of the results.
> 
>     Example:  ( simple two node system "hostA" and
"hostB",
>                       and each node has 1 Gig of RAM)
> 
>         iozone -r 64 -s 2g -t 2 -+m control_file
> 
>     and the contents of the control_file:
>             hostA  /mnt/lustre/stripe_test  /opt/iozone/bin/iozone
>             hostB  /mnt/lustre/stripe_test  /opt/iozone/bin/iozone
> 
>     In this example Iozone will start two copies of itself,
>     one instance on hostA and the other on hostB. Each instance
>     will test with 64Kbyte transfers, for a file size of 2 Gig.
>     Since Iozone is coordinating the execution the file names,
>     that Iozone uses for testing, will be automatically generated
>     and non-conflicting. These test files would be created in
>     /mnt/lustre/stripe_test (from config_file above)
>     Each test would be started at a barrier, and would finish
>     when the first child is complete. Throughput would then
>     be calculated, and returned as results.
> 
>     Note: Iozone also supports multiple execution with a
>     single shared test file. You may also choose to use file
>     locking, record locking, or no locking at all, in this
>     mode.  
> 
>     Note: Iozone presents a very heavy workload. On my
>     Lustre systems, in my lab at my house, I''ve also seen
>     some evictions when running Iozone. This happens when
>     I push the system (with Iozone) to the point where
>     the nodes are not able to respond (due to high network
>     loads, such as running Iozone on top of NFS, on top of
>     Lustre, large transfer sizes, on very large files, on a slow
>     100BT network.) Once the congestion rises to certain level the
>     nodes are unable to exchange keep-alive messages and
>     evictions begin. Once the load drops down, the nodes re-join,
>     and life goes on.
>     I believe there are some timeouts (in Lustre) that you can
>     tweak, to increase the amount of time that the nodes
>     will wait for a keep-alive to come back.  Adjusting this
>     might improve the evictions. It might also increase the
>     time one would wait in the event of a real node failure... One
>     of those trade-offs....
>     Other things that I''ve noticed that seem to help:  More
memory,
>     faster interconnect, more CPU''s per/node.
> 
>     Hope that my observations are helpful...
> 
> Enjoy,
> Don Capps
> capps_at_iozone_dot_org.
>  
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@lists.clusterfs.com
> https://lists.clusterfs.com/mailman/listinfo/lustre-discuss

Jeremy Hansen

2006-May-19 07:36 UTC

head link

[Lustre-discuss] How do I troubleshoot this problem?

On 11/21/05 3:54 AM, "Hancock, Tom" <Tom.Hancock@hp.com> wrote:
>  
> The errors are indicative of a client eviction by a server for
> lack of responsiveness, potentially due to a network error
> (though that''s not indicated) or potentially due to a timeout
> within the lustre dlm.
> 
> I suspect the intention was to incorporate the pid in the name.
> In this case I think that the command may not be what was intended.
> Replacing ''cexec'' with ''ssh'' and
''iozone'' with ''touch'' I can see on
> a system here that files are created with names like
''testfile.$$''.
> Thus on my system each client would write accessing the same file
> at the same location relatively synchronously. Was this what was
> really intended, or perhaps cexec has different semantics?
Tom, read over your email again.  I the files are infact getting the pid at
the end.  This I can verify.  The paste of my command line may have been
miss leading.  You do have to escape the $$''s to have the shell
interpret
things properly.

-jeremy
> 
> - Tom
> 
> ---
> Tom Hancock, Hewlett Packard, Galway. +353-91-754765
>  
> 
> -----Original Message-----
> From: lustre-discuss-admin@lists.clusterfs.com
> [mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Jeremy
> Hansen
> Sent: 20 November 2005 08:49
> To: lustre-discuss@lists.clusterfs.com
> Subject: Re: [Lustre-discuss] How do I troubleshoot this problem?
> 
> To add to this through a bit more testing, it seems I''m only get
errors
> when running the test on a directory in which I''ve changed the
stripe
> pattern to a stripe size of 4M using all available osts, which in this
> case is 30 osts.
> Running the same test using the default striping config seems to pass,
> but performance is very low.  The default stripe config I believe is 1M
> and 1 OST per file.
> 
> -jeremy
> 
> 
> On 11/20/05 12:32 AM, "Jeremy Hansen"
<jeremy@smokehabanos.com> wrote:
> 
>> I''m working with a new Lustre installation and I''m
running into
> problems.
>> I''m running iozone tests across multiple clients and basically
when I
>> exceed more then 2, sometimes 3 clients, I start getting a lot of
>> evictions from the server.  An example of the error can be found at:
>> 
>>     http://www.thedudeminds.com/lustre.txt
>> 
>> Right now I suspect it''s possible network related, although I
can''t
>> really say this for sure based on the errors.
>> 
>> I''m launching iozone using c3 from a head node.  Here is my
options:
>> 
>> cexec rlx-2-6:3-5 ''/opt/iozone/bin/iozone -i 0 -i 1 -r 8k -s
4g -f
>>
/mnt/lustre/stripe_test/testfile.''\''''$$''\''''''
>> 
>> Any help is appreciated.
>> 
>> -jeremy
>> 
>> 
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.clusterfs.com
>> https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@lists.clusterfs.com
> https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@lists.clusterfs.com
> https://lists.clusterfs.com/mailman/listinfo/lustre-discuss

Lustre discuss - May 2006 - How do I troubleshoot this problem?

[Lustre-discuss] How do I troubleshoot this problem?

[Lustre-discuss] How do I troubleshoot this problem?

[Lustre-discuss] How do I troubleshoot this problem?

[Lustre-discuss] How do I troubleshoot this problem?

[Lustre-discuss] How do I troubleshoot this problem?

[Lustre-discuss] How do I troubleshoot this problem?

[Lustre-discuss] How do I troubleshoot this problem?

[Lustre-discuss] How do I troubleshoot this problem?