thr3ads.net - Lustre devel - [lustre-devel] Tests with lustre 1.0.0/1.0.1 [Dec 2003]

If this information is useful, please help other people find it:
Share via:

Jacques-Charles Lafoucriere

2003-Dec-17 12:46 UTC

[lustre-devel] Tests with lustre 1.0.0/1.0.1

Hello,

I am using the following cluster :
- 100 UP P4 client 1GB mem
-  6 BP P4 OST node 2 GB mem (each OST node has 3 partitions on 3 scsi 
disks => 18 OST)
-  1 BP MDS 2 GB mem
- gigabit ethernet as interconnect

The base kernel is 2.4.20-19.9

I have installed lustre 1.0.0 and lustre 1.0.1 (with portals debug off) 
and with patch level 29 and 32 (1.0.0 tested with 29 and 32, 1.0.1 only 
with 32)

I am running the following test: each client write a 900 MB private file 
in a common directory

With both releases I have i/o errors on some files (2 or 3, not always 
the same)

I have the following messages on the OSTs:
Dec 17 13:13:42 cors123 kernel: LustreError: 
1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
f737d800
Dec 17 13:13:46 cors123 kernel: LustreError: 
1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
ed73a000
Dec 17 13:13:52 cors123 kernel: LustreError: 
1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
f737d800
Dec 17 13:13:56 cors123 kernel: LustreError: 
1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
ed73a000
Dec 17 13:14:02 cors123 kernel: LustreError: 
1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
f737d800
Dec 17 13:14:06 cors123 kernel: LustreError: 
1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
ed73a000
Dec 17 13:14:12 cors123 kernel: LustreError: 
1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
f737d800
Dec 17 13:14:16 cors123 kernel: LustreError: 
1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
ed73a000

A few days ago, I run the same test with a b_devel from November and 
patch level 29 (the latest with the release) and got no issue (it was 
very stable)

Any idea ?

JCL

Blake Matheny

2003-Dec-18 12:17 UTC

head link

[lustre-devel] Tests with lustre 1.0.0/1.0.1

I have gotten some errors recently that were also in niobuf.c,
(see https://bugzilla.lustre.org/show_bug.cgi?id=2363). Both of our errors are
triggered by PTL_MD_INUSE being the return value of PtlMDUnlink and
l_wait_event returning -ETIMEDOUT. I''m not sure how our bugs are
related (they
may not be), but I thought I would point it out. When I get these errors
occasionally the system becomes unusable, are you able to eventually continue
with reads and writes?

-Blake

Whatchu talkin'' ''bout, Willis?> Hello,
> 
> I am using the following cluster :
> - 100 UP P4 client 1GB mem
> -  6 BP P4 OST node 2 GB mem (each OST node has 3 partitions on 3 scsi 
> disks => 18 OST)
> -  1 BP MDS 2 GB mem
> - gigabit ethernet as interconnect
> 
> The base kernel is 2.4.20-19.9
> 
> I have installed lustre 1.0.0 and lustre 1.0.1 (with portals debug off) 
> and with patch level 29 and 32 (1.0.0 tested with 29 and 32, 1.0.1 only 
> with 32)
> 
> I am running the following test: each client write a 900 MB private file 
> in a common directory
> 
> With both releases I have i/o errors on some files (2 or 3, not always 
> the same)
> 
> I have the following messages on the OSTs:
> Dec 17 13:13:42 cors123 kernel: LustreError: 
> 1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
> f737d800
> Dec 17 13:13:46 cors123 kernel: LustreError: 
> 1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
> ed73a000
> Dec 17 13:13:52 cors123 kernel: LustreError: 
> 1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
> f737d800
> Dec 17 13:13:56 cors123 kernel: LustreError: 
> 1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
> ed73a000
> Dec 17 13:14:02 cors123 kernel: LustreError: 
> 1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
> f737d800
> Dec 17 13:14:06 cors123 kernel: LustreError: 
> 1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
> ed73a000
> Dec 17 13:14:12 cors123 kernel: LustreError: 
> 1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
> f737d800
> Dec 17 13:14:16 cors123 kernel: LustreError: 
> 1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
> ed73a000
> 
> A few days ago, I run the same test with a b_devel from November and 
> patch level 29 (the latest with the release) and got no issue (it was 
> very stable)
> 
> Any idea ?
> 
> JCL
> 
> 
> _______________________________________________
> lustre-devel mailing list
> lustre-devel@lists.clusterfs.com
> https://lists.clusterfs.com/mailman/listinfo/lustre-devel
-- 
Blake Matheny
bmatheny@purdue.edu           The things that should be available to everyone,
http://www.mkfifo.net         are the things that impact everyone.
http://ovmj.org/GNUnet/

Jacques-Charles Lafoucriere

2003-Dec-18 12:53 UTC

head link

[lustre-devel] Tests with lustre 1.0.0/1.0.1

The client on with the error occur has to be rebooted (sometime rlogin 
is no more possible on the node)
The other nodes are ok and can still continue to work with lustre (ls -l 
were ok, read/write also)

I have also made tests with 10 nodes (insted of 100 but with still 100 
client configured) and I have the same issue.
May be I have a kind saturation of my ethernet network. With the 
previous release the client unitary performance was smaller, with 1.0.X 
the stress on the OST nodes mays be higher.
I''ll make some tunning on the tcp memory buffers on the nodes to be
sure
there is much mode buffers on OST than on clients.

JC

Blake Matheny wrote:
>I have gotten some errors recently that were also in niobuf.c,
>(see https://bugzilla.lustre.org/show_bug.cgi?id=2363). Both of our errors
are
>triggered by PTL_MD_INUSE being the return value of PtlMDUnlink and
>l_wait_event returning -ETIMEDOUT. I''m not sure how our bugs are
related (they
>may not be), but I thought I would point it out. When I get these errors
>occasionally the system becomes unusable, are you able to eventually
continue
>with reads and writes?
>
>-Blake
>
>Whatchu talkin'' ''bout, Willis?
>  
>
>>Hello,
>>
>>I am using the following cluster :
>>- 100 UP P4 client 1GB mem
>>-  6 BP P4 OST node 2 GB mem (each OST node has 3 partitions on 3 scsi 
>>disks => 18 OST)
>>-  1 BP MDS 2 GB mem
>>- gigabit ethernet as interconnect
>>
>>The base kernel is 2.4.20-19.9
>>
>>I have installed lustre 1.0.0 and lustre 1.0.1 (with portals debug off) 
>>and with patch level 29 and 32 (1.0.0 tested with 29 and 32, 1.0.1 only 
>>with 32)
>>
>>I am running the following test: each client write a 900 MB private file
>>in a common directory
>>
>>With both releases I have i/o errors on some files (2 or 3, not always 
>>the same)
>>
>>I have the following messages on the OSTs:
>>Dec 17 13:13:42 cors123 kernel: LustreError: 
>>1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
>>f737d800
>>Dec 17 13:13:46 cors123 kernel: LustreError: 
>>1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
>>ed73a000
>>Dec 17 13:13:52 cors123 kernel: LustreError: 
>>1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
>>f737d800
>>Dec 17 13:13:56 cors123 kernel: LustreError: 
>>1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
>>ed73a000
>>Dec 17 13:14:02 cors123 kernel: LustreError: 
>>1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
>>f737d800
>>Dec 17 13:14:06 cors123 kernel: LustreError: 
>>1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
>>ed73a000
>>Dec 17 13:14:12 cors123 kernel: LustreError: 
>>1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
>>f737d800
>>Dec 17 13:14:16 cors123 kernel: LustreError: 
>>1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
>>ed73a000
>>
>>A few days ago, I run the same test with a b_devel from November and 
>>patch level 29 (the latest with the release) and got no issue (it was 
>>very stable)
>>
>>Any idea ?
>>
>>JCL
>>
>>
>>_______________________________________________
>>lustre-devel mailing list
>>lustre-devel@lists.clusterfs.com
>>https://lists.clusterfs.com/mailman/listinfo/lustre-devel
>>    
>>
>
>  
>

Blake Matheny

2003-Dec-18 13:05 UTC

head link

[lustre-devel] Tests with lustre 1.0.0/1.0.1

> The client on with the error occur has to be rebooted (sometime rlogin 
> is no more possible on the node)
> The other nodes are ok and can still continue to work with lustre (ls -l 
> were ok, read/write also)Same thing here.
> I have also made tests with 10 nodes (insted of 100 but with still 100 
> client configured) and I have the same issue.
> May be I have a kind saturation of my ethernet network. With the 
> previous release the client unitary performance was smaller, with 1.0.X 
> the stress on the OST nodes mays be higher.
> I''ll make some tunning on the tcp memory buffers on the nodes to
be sure
> there is much mode buffers on OST than on clients.I doubt it''s saturation. I have a test cluster with only 6 machines (1
mds, 4
ost, 1 client, lov across all ost''s) on a 100Mb switch, and I really
doubt
it''s being saturated. Although, I do seem to get this error mostly when
I''m
testing writes of large files.

-Blake
> JC
> 
> Blake Matheny wrote:
> 
> >I have gotten some errors recently that were also in niobuf.c,
> >(see https://bugzilla.lustre.org/show_bug.cgi?id=2363). Both of our
errors
> >are
> >triggered by PTL_MD_INUSE being the return value of PtlMDUnlink and
> >l_wait_event returning -ETIMEDOUT. I''m not sure how our bugs
are related
> >(they
> >may not be), but I thought I would point it out. When I get these
errors
> >occasionally the system becomes unusable, are you able to eventually 
> >continue
> >with reads and writes?
> >
> >-Blake
> >
> >Whatchu talkin'' ''bout, Willis?
> > 
> >
> >>Hello,
> >>
> >>I am using the following cluster :
> >>- 100 UP P4 client 1GB mem
> >>-  6 BP P4 OST node 2 GB mem (each OST node has 3 partitions on 3
scsi
> >>disks => 18 OST)
> >>-  1 BP MDS 2 GB mem
> >>- gigabit ethernet as interconnect
> >>
> >>The base kernel is 2.4.20-19.9
> >>
> >>I have installed lustre 1.0.0 and lustre 1.0.1 (with portals debug
off)
> >>and with patch level 29 and 32 (1.0.0 tested with 29 and 32, 1.0.1
only
> >>with 32)
> >>
> >>I am running the following test: each client write a 900 MB private
file
> >>in a common directory
> >>
> >>With both releases I have i/o errors on some files (2 or 3, not
always
> >>the same)
> >>
> >>I have the following messages on the OSTs:
> >>Dec 17 13:13:42 cors123 kernel: LustreError: 
> >>1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout:
desc
> >>f737d800
> >>Dec 17 13:13:46 cors123 kernel: LustreError: 
> >>1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout:
desc
> >>ed73a000
> >>Dec 17 13:13:52 cors123 kernel: LustreError: 
> >>1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout:
desc
> >>f737d800
> >>Dec 17 13:13:56 cors123 kernel: LustreError: 
> >>1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout:
desc
> >>ed73a000
> >>Dec 17 13:14:02 cors123 kernel: LustreError: 
> >>1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout:
desc
> >>f737d800
> >>Dec 17 13:14:06 cors123 kernel: LustreError: 
> >>1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout:
desc
> >>ed73a000
> >>Dec 17 13:14:12 cors123 kernel: LustreError: 
> >>1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout:
desc
> >>f737d800
> >>Dec 17 13:14:16 cors123 kernel: LustreError: 
> >>1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout:
desc
> >>ed73a000
> >>
> >>A few days ago, I run the same test with a b_devel from November
and
> >>patch level 29 (the latest with the release) and got no issue (it
was
> >>very stable)
> >>
> >>Any idea ?
> >>
> >>JCL
> >>
> >>
> >>_______________________________________________
> >>lustre-devel mailing list
> >>lustre-devel@lists.clusterfs.com
> >>https://lists.clusterfs.com/mailman/listinfo/lustre-devel
> >>   
> >>
> >
> > 
> >
> 
> 
> _______________________________________________
> lustre-devel mailing list
> lustre-devel@lists.clusterfs.com
> https://lists.clusterfs.com/mailman/listinfo/lustre-devel
-- 
Blake Matheny
bmatheny@purdue.edu           The things that should be available to everyone,
http://www.mkfifo.net         are the things that impact everyone.
http://ovmj.org/GNUnet/

Phil Schwan

2003-Dec-18 15:17 UTC

head link

[lustre-devel] Tests with lustre 1.0.0/1.0.1

Hello Jacques--

Jacques-Charles Lafoucriere wrote:> 
> I have also made tests with 10 nodes (insted of 100 but with still 100 
> client configured) and I have the same issue.
> May be I have a kind saturation of my ethernet network. With the 
> previous release the client unitary performance was smaller, with 1.0.X 
> the stress on the OST nodes mays be higher.
> I''ll make some tunning on the tcp memory buffers on the nodes to
be sure
> there is much mode buffers on OST than on clients.
I agree with Blake that the network is probably not saturated as such,
but something is causing a node not to respond in a timely fashion,
which leads to timeouts and errors.  Certainly the test that you''re
running is something which our production installations do every day.

TCP memory buffers will not normally cause something catastrophic, in my
experience.  A shortage of Lustre request buffers would, but with 10
clients, I think it is very unlikely that you are running out of Lustre
request buffers.
>>>I have the following messages on the OSTs:
>>>Dec 17 13:13:42 cors123 kernel: LustreError: 
>>>1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout:
desc
>>>f737d800
Can you send me the messages that you see on the OSTs and the client,
before this one occurs?  By the time this message arrives, things have
already been unhappy for a little while.

Also, you mentioned that you can reproduce this with just 10 clients.
Can you perhaps reproduce it with just one?  What is your default file
striping pattern?  If you run "vmstat" on one or two OSTs while this
test is running, what is the average disk I/O (BO) rate?

It may be that your configuration requires some tuning away from the
default cache parameters, although with only 10 clients something else
would probably also have to be wrong.
>>>A few days ago, I run the same test with a b_devel from November and
>>>patch level 29 (the latest with the release) and got no issue (it
was
>>>very stable)
That is very interesting, too.  Do you happen to remember which snapshot
number this was?

Thanks for your patience and feedback--

-Phil

Jacques-Charles Lafoucriere

2003-Dec-18 16:46 UTC

head link

[lustre-devel] Tests with lustre 1.0.0/1.0.1

Phil Schwan wrote:
>Hello Jacques--
>
>Jacques-Charles Lafoucriere wrote:
>  
>
>>I have also made tests with 10 nodes (insted of 100 but with still 100 
>>client configured) and I have the same issue.
>>May be I have a kind saturation of my ethernet network. With the 
>>previous release the client unitary performance was smaller, with 1.0.X 
>>the stress on the OST nodes mays be higher.
>>I''ll make some tunning on the tcp memory buffers on the nodes
to be sure
>>there is much mode buffers on OST than on clients.
>>    
>>
>
>I agree with Blake that the network is probably not saturated as such,
>but something is causing a node not to respond in a timely fashion,
>which leads to timeouts and errors.  Certainly the test that you''re
>running is something which our production installations do every day.
>
>TCP memory buffers will not normally cause something catastrophic, in my
>experience.  A shortage of Lustre request buffers would, but with 10
>clients, I think it is very unlikely that you are running out of Lustre
>request buffers.
>
>  
>
>>>>I have the following messages on the OSTs:
>>>>Dec 17 13:13:42 cors123 kernel: LustreError: 
>>>>1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long
timeout: desc
>>>>f737d800
>>>>        
>>>>
>
>Can you send me the messages that you see on the OSTs and the client,
>before this one occurs?  By the time this message arrives, things have
>already been unhappy for a little while.
>On the OST, there are plenty of lines of the same type and few lines like :
Dec 17 13:13:33 cors123 kernel: Lustre: 
1732:(socknal_cb.c:1534:ksocknal_process_receive()) [d8856800] EOF from 
0xc0a8011b ip c0a8011b:32805

and also
Dec 17 12:56:42 cors123 kernel: LustreError: 
1999:(ost_handler.c:730:ost_brw_write()) bulk IO comms error: evicting 
c798f_lov1_48c0260155@ nid 0xc0a80161 (0:192.168.1.97)
Dec 17 12:56:42 cors123 kernel: LustreError: 
1968:(ost_handler.c:730:ost_brw_write()) bulk IO comms error: evicting 
c798f_lov1_48c0260155@ nid 0xc0a80161 (0:192.168.1.97)
Dec 17 12:56:42 cors123 kernel: LustreError: 
1983:(ost_handler.c:730:ost_brw_write()) bulk IO comms error: evicting 
c798f_lov1_48c0260155@ nid 0xc0a80161 (0:192.168.1.97)
Dec 17 12:56:42 cors123 kernel: LustreError: 
1978:(ost_handler.c:730:ost_brw_write()) bulk IO comms error: evicting 
c798f_lov1_48c0260155@ nid 0xc0a80161 (0:192.168.1.97)
Dec 17 12:56:42 cors123 kernel: LustreError: 
1971:(ost_handler.c:730:ost_brw_write()) bulk IO comms error: evicting 
c798f_lov1_48c0260155@ nid 0xc0a80161 (0:192.168.1.97)
Dec 17 12:56:46 cors123 kernel: LustreError: 
1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
ed73a000
Dec 17 12:56:52 cors123 kernel: LustreError: 
1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
f737d800
Dec 17 12:56:56 cors123 kernel: LustreError: 
1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
ed73a000
Dec 17 12:57:02 cors123 kernel: LustreError: 
1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
f737d800
Dec 17 12:57:06 cors123 kernel: LustreError: 
1980:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
ed73a000
Dec 17 12:57:08 cors123 kernel: Lustre: 
1732:(socknal_cb.c:1534:ksocknal_process_receive()) [d88b3800] EOF from 
0xc0a80161 ip c0a80161:32788
Dec 17 12:57:08 cors123 kernel: LustreError: 
1740:(socknal.c:1004:ksocknal_destroy_conn()) Refusing to complete a 
partial receive from 0xc0a80161, ip c0a80161
Dec 17 12:57:08 cors123 kernel: LustreError: 
1740:(socknal.c:1006:ksocknal_destroy_conn()) This may hang 
communications and prevent modules from unloading
Dec 17 12:57:12 cors123 kernel: LustreError: 
1979:(niobuf.c:369:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc 
f737d800

I can send you the full listings but it is big

On the client (I am not sure it is the good one but it is the only one 
which have errors at 13:13) :
Dec 17 13:13:17 cors071 kernel: LustreError: 
1717:(socknal_cb.c:2491:ksocknal_find_timed_out_conn()) Timed out TX to 
0xc0a8017e 4425024 f75e3000
Dec 17 13:41:37 cors071 kernel: LustreError: 
1717:(socknal_cb.c:2522:ksocknal_check_peer_timeouts()) Timeout out 
conn->0xc0a8017e ip c0a8017e:988
Dec 17 13:41:37 cors071 kernel: LustreError: 
1742:(client.c:810:ptlrpc_expire_one_request()) @@@ timeout req@ed252000 
x6558/t0 o4->OST_cors126_2_UUID@NID_192.168.1.126_UUID:6 lens 288/248 
ref 1 fl RPC:N/0/0 rc 0
Dec 17 13:41:37 cors071 kernel: LustreError: 
1742:(recover.c:100:ptlrpc_run_failed_import_upcall()) Error invoking 
recovery upcall /usr/lib/lustre/lustre_upcall FAILED_IMPORT 
OST_cors126_2_UUID OSC_cors071.ocre.cea.fr_OST_cors126_2_MNT_cors071 
NID_192.168.1.126_UUID: -2; check /proc/sys/lustre/lustre_upcall
>
>Also, you mentioned that you can reproduce this with just 10 clients.
>Can you perhaps reproduce it with just one?  What is your default file
>striping pattern?  If you run "vmstat" on one or two OSTs while
this
>test is running, what is the average disk I/O (BO) rate?
>  
>on one client it is working well (and faster than with previous release)
default is on all OST => 18
>It may be that your configuration requires some tuning away from the
>default cache parameters, although with only 10 clients something else
>would probably also have to be wrong.
>
>  
>
>>>>A few days ago, I run the same test with a b_devel from November
and
>>>>patch level 29 (the latest with the release) and got no issue
(it was
>>>>very stable)
>>>>        
>>>>
>
>That is very interesting, too.  Do you happen to remember which snapshot
>number this was?
>  
>I download it on november 8th at 23:20 (french time)
I used it with up to 100 clients writing on 6 and 12 OST with many 
different stripe size
>Thanks for your patience and feedback--
>
>-Phil
>
>
>  
>

Phil Schwan

2003-Dec-18 17:05 UTC

head link

[lustre-devel] Tests with lustre 1.0.0/1.0.1

Hello Jacques--

Jacques-Charles Lafoucriere wrote:> 
> On the client (I am not sure it is the good one but it is the only one 
> which have errors at 13:13) :
> Dec 17 13:13:17 cors071 kernel: LustreError: 
> 1717:(socknal_cb.c:2491:ksocknal_find_timed_out_conn()) Timed out TX to 
> 0xc0a8017e 4425024 f75e3000
I have one idea, before we go to the next step: are you using the RPMs
from the web site, or did you compile Lustre yourself?

If you compiled your own Lustre, did you run configure with
--enable-zerocopy?  Recent testing is showing us that the memory
allocations in the non-zero-copy TCP path may be doing significant harm,
because they can block unpredictably.  Lustre 1.0.2 will default to
zero-copy, but 1.0.1 did not.

If you are already using zero-copy, then we will give you some
instructions for gathering more detailed log data.  This message points
us in the right direction, I think.
>> What is your default file striping pattern?
>
> default is on all OST => 18
Although this is not the source of your problem, with these kinds of
applications (N clients writing N files), we have found that one stripe
per file is usually best.  100 clients are many more than are needed to
saturate 18 OSTs, and more stripes just causes unnecessary contention.

For the other common case (N clients writing 1 file), it is indeed best
to stripe over all OSTs for maximum performance.

Unfortunately, it is not yet possible to change your default stripe
policy on a per-directory or per-environment basis, so you would have to
create the files in advance with a script, or change your application to
call an ioctl.  Neither is a great answer.

Best--

-Phil

Lustre devel - Dec 2003 - Tests with lustre 1.0.0/1.0.1

[lustre-devel] Tests with lustre 1.0.0/1.0.1

[lustre-devel] Tests with lustre 1.0.0/1.0.1

[lustre-devel] Tests with lustre 1.0.0/1.0.1

[lustre-devel] Tests with lustre 1.0.0/1.0.1

[lustre-devel] Tests with lustre 1.0.0/1.0.1

[lustre-devel] Tests with lustre 1.0.0/1.0.1

[lustre-devel] Tests with lustre 1.0.0/1.0.1