thr3ads.net - Lustre discuss - [Lustre-discuss] Odd errors in testing [May 2006]

If this information is useful, please help other people find it:
Share via:

Andreas Dilger

2006-May-19 07:36 UTC

[Lustre-discuss] Odd errors in testing

--GRPZ8SYKNexpdSJ7
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sep 09, 2004  07:18 -0600, Sonja Tideman wrote:>  I am running Lustre 1.2.5 on a 16-node cluster running RH Enterprise Linux
> WS 3.0 (with a 2.4.21 SMP kernel) using a tcp NAL with IP over gm.  I have
> it configured with 8 OSTs and 1 MDS, configured to stripe in 2MB sizes
> across all OSTs (the plan is to connect this to a much bigger cluster,
which
> is why there are so many OSTs).  Running a really simple benchmark run, I
> noticed some strange errors.  The benchmark will report various errors,
> starting with a short read and followed by "No such file or
directory".
> This occurs using 6 clients all writing to and reading from the same file
> (which is either 2GB or 3GB in length) using either a 128KB or 256KB block
> size (which I realize is too small for good performance, but I was using
the
> test to test for functionality, not performance).  It only occurs on the
> reads; the writes seem to have no problem.
>=20
> The client nodes have the following in the dmesg buffer:
>=20
> > LustreError: 11090:0:(client.c:452:ptlrpc_check_status()) @@@ type
=3D=3D
> > PTL_RPC_MSG_ERR, err =3D=3D -5 req@f7397400 x388800/t0
> > o3->ost518_UUID@NID_ga518-myri0_UUID:6 lens 288/240 ref 2 fl
Rpc:R/0/0 rc 0/-5
> > LustreError: 11090:0:(client.c:452:ptlrpc_check_status()) @@@ type
=3D=3D
> > PTL_RPC_MSG_ERR, err =3D=3D -5 req@ef062400 x388828/t0
> > o3->ost514_UUID@NID_ga514-myri0_UUID:6 lens 288/240 ref 2 fl
Rpc:R/0/0 rc 0/-5
> > LustreError: 11090:0:(client.c:452:ptlrpc_check_status()) @@@ type
=3D=3D
> > PTL_RPC_MSG_ERR, err =3D=3D -5 req@e3bf3e00 x542055/t0
> > o3->ost516_UUID@NID_ga516-myri0_UUID:6 lens 288/240 ref 2 fl
Rpc:R/0/0 rc 0/-5
> > LustreError: 11090:0:(client.c:452:ptlrpc_check_status()) @@@ type
=3D=3D
> > PTL_RPC_MSG_ERR, err =3D=3D -5 req@de9ca000 x710072/t0
> > o3->ost513_UUID@NID_ga513-myri0_UUID:6 lens 288/240 ref 2 fl
Rpc:R/0/0 rc 0/-5
>=20
> The OSTs and the MDS have a lot of the following messages:
>=20
> > LustreError: 11300:0:(filter_io.c:85:filter_finish_page_read()) page
index
> > 3835/offset 0xefb000 not uptodate
> > LustreError: 11300:0:(filter_io.c:396:filter_preprw_read()) error page
> > 4096@15708160 121 f09d1380: rc -5
>=20
> I didn''t see this behavior with Lustre 1.2.1.  It also only seems
to happen
> with 6 clients; less than that performs fine (I haven''t
experimented with
> using more than that) and it only happens with the smaller block size.
> Block sizes larger than 256KB perform fine as well.  Anyone have any ideas
> what could be going on?  I have all debugging turned off, and I
haven''t
> played around with any of the tunable parameters (in /proc...such as in
> /proc/fs/lustre/OSC/OST*).
This was a bug that appeared in 1.2.5 but has been resolved for the
upcoming 1.2.6 release.  It only happened when multiple clients were
concurrently reading overlapping regions from the same large file.
As a workaround you can #define FILTER_MAX_CACHE_SIZE OBD_OBJECT_EOF
in obdfilter/filter_internal.h, or at runtime you can also fix this with

echo 0xffffffffffffffff >
/proc/fs/lustre/obdfilter/<OST>/readcache_max_filesize

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/             http://members.shaw.ca/golinux/


--GRPZ8SYKNexpdSJ7
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQFBSp2xpIg59Q01vtYRAt6GAKDnHLP2/6ozvg2CUy7WQhADkorMOQCaAgnr
DFk1ETDZUSAlCwc8XM6B9o4=LOVU
-----END PGP SIGNATURE-----

--GRPZ8SYKNexpdSJ7--

Sonja Tideman

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Odd errors in testing

Hello,

 I am running Lustre 1.2.5 on a 16-node cluster running RH Enterprise Linux
WS 3.0 (with a 2.4.21 SMP kernel) using a tcp NAL with IP over gm.  I have
it configured with 8 OSTs and 1 MDS, configured to stripe in 2MB sizes
across all OSTs (the plan is to connect this to a much bigger cluster, which
is why there are so many OSTs).  Running a really simple benchmark run, I
noticed some strange errors.  The benchmark will report various errors,
starting with a short read and followed by "No such file or
directory".
This occurs using 6 clients all writing to and reading from the same file
(which is either 2GB or 3GB in length) using either a 128KB or 256KB block
size (which I realize is too small for good performance, but I was using the
test to test for functionality, not performance).  It only occurs on the
reads; the writes seem to have no problem.

The client nodes have the following in the dmesg buffer:
> LustreError: 11090:0:(client.c:452:ptlrpc_check_status()) @@@ type =>
PTL_RPC_MSG_ERR, err == -5 req@f7397400 x388800/t0
> o3->ost518_UUID@NID_ga518-myri0_UUID:6 lens 288/240 ref 2 fl Rpc:R/0/0
rc 0/-5
> LustreError: 11090:0:(client.c:452:ptlrpc_check_status()) @@@ type =>
PTL_RPC_MSG_ERR, err == -5 req@ef062400 x388828/t0
> o3->ost514_UUID@NID_ga514-myri0_UUID:6 lens 288/240 ref 2 fl Rpc:R/0/0
rc 0/-5
> LustreError: 11090:0:(client.c:452:ptlrpc_check_status()) @@@ type =>
PTL_RPC_MSG_ERR, err == -5 req@e3bf3e00 x542055/t0
> o3->ost516_UUID@NID_ga516-myri0_UUID:6 lens 288/240 ref 2 fl Rpc:R/0/0
rc 0/-5
> LustreError: 11090:0:(client.c:452:ptlrpc_check_status()) @@@ type =>
PTL_RPC_MSG_ERR, err == -5 req@de9ca000 x710072/t0
> o3->ost513_UUID@NID_ga513-myri0_UUID:6 lens 288/240 ref 2 fl Rpc:R/0/0
rc 0/-5
The OSTs and the MDS have a lot of the following messages:
> LustreError: 11300:0:(filter_io.c:85:filter_finish_page_read()) page index
> 3835/offset 0xefb000 not uptodate
> LustreError: 11300:0:(filter_io.c:396:filter_preprw_read()) error page
> 4096@15708160 121 f09d1380: rc -5
I didn''t see this behavior with Lustre 1.2.1.  It also only seems to
happen
with 6 clients; less than that performs fine (I haven''t experimented
with
using more than that) and it only happens with the smaller block size.
Block sizes larger than 256KB perform fine as well.  Anyone have any ideas
what could be going on?  I have all debugging turned off, and I haven''t
played around with any of the tunable parameters (in /proc...such as in
/proc/fs/lustre/OSC/OST*).

Thanks,
  Sonja Tideman

Lustre discuss - May 2006 - Odd errors in testing

[Lustre-discuss] Odd errors in testing

[Lustre-discuss] Odd errors in testing