--GRPZ8SYKNexpdSJ7
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
On Sep 09, 2004 07:18 -0600, Sonja Tideman wrote:> I am running Lustre 1.2.5 on a 16-node cluster running RH Enterprise Linux
> WS 3.0 (with a 2.4.21 SMP kernel) using a tcp NAL with IP over gm. I have
> it configured with 8 OSTs and 1 MDS, configured to stripe in 2MB sizes
> across all OSTs (the plan is to connect this to a much bigger cluster,
which
> is why there are so many OSTs). Running a really simple benchmark run, I
> noticed some strange errors. The benchmark will report various errors,
> starting with a short read and followed by "No such file or
directory".
> This occurs using 6 clients all writing to and reading from the same file
> (which is either 2GB or 3GB in length) using either a 128KB or 256KB block
> size (which I realize is too small for good performance, but I was using
the
> test to test for functionality, not performance). It only occurs on the
> reads; the writes seem to have no problem.
>=20
> The client nodes have the following in the dmesg buffer:
>=20
> > LustreError: 11090:0:(client.c:452:ptlrpc_check_status()) @@@ type
=3D=3D
> > PTL_RPC_MSG_ERR, err =3D=3D -5 req@f7397400 x388800/t0
> > o3->ost518_UUID@NID_ga518-myri0_UUID:6 lens 288/240 ref 2 fl
Rpc:R/0/0 rc 0/-5
> > LustreError: 11090:0:(client.c:452:ptlrpc_check_status()) @@@ type
=3D=3D
> > PTL_RPC_MSG_ERR, err =3D=3D -5 req@ef062400 x388828/t0
> > o3->ost514_UUID@NID_ga514-myri0_UUID:6 lens 288/240 ref 2 fl
Rpc:R/0/0 rc 0/-5
> > LustreError: 11090:0:(client.c:452:ptlrpc_check_status()) @@@ type
=3D=3D
> > PTL_RPC_MSG_ERR, err =3D=3D -5 req@e3bf3e00 x542055/t0
> > o3->ost516_UUID@NID_ga516-myri0_UUID:6 lens 288/240 ref 2 fl
Rpc:R/0/0 rc 0/-5
> > LustreError: 11090:0:(client.c:452:ptlrpc_check_status()) @@@ type
=3D=3D
> > PTL_RPC_MSG_ERR, err =3D=3D -5 req@de9ca000 x710072/t0
> > o3->ost513_UUID@NID_ga513-myri0_UUID:6 lens 288/240 ref 2 fl
Rpc:R/0/0 rc 0/-5
>=20
> The OSTs and the MDS have a lot of the following messages:
>=20
> > LustreError: 11300:0:(filter_io.c:85:filter_finish_page_read()) page
index
> > 3835/offset 0xefb000 not uptodate
> > LustreError: 11300:0:(filter_io.c:396:filter_preprw_read()) error page
> > 4096@15708160 121 f09d1380: rc -5
>=20
> I didn''t see this behavior with Lustre 1.2.1. It also only seems
to happen
> with 6 clients; less than that performs fine (I haven''t
experimented with
> using more than that) and it only happens with the smaller block size.
> Block sizes larger than 256KB perform fine as well. Anyone have any ideas
> what could be going on? I have all debugging turned off, and I
haven''t
> played around with any of the tunable parameters (in /proc...such as in
> /proc/fs/lustre/OSC/OST*).
This was a bug that appeared in 1.2.5 but has been resolved for the
upcoming 1.2.6 release. It only happened when multiple clients were
concurrently reading overlapping regions from the same large file.
As a workaround you can #define FILTER_MAX_CACHE_SIZE OBD_OBJECT_EOF
in obdfilter/filter_internal.h, or at runtime you can also fix this with
echo 0xffffffffffffffff >
/proc/fs/lustre/obdfilter/<OST>/readcache_max_filesize
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/
--GRPZ8SYKNexpdSJ7
Content-Type: application/pgp-signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
iD8DBQFBSp2xpIg59Q01vtYRAt6GAKDnHLP2/6ozvg2CUy7WQhADkorMOQCaAgnr
DFk1ETDZUSAlCwc8XM6B9o4=LOVU
-----END PGP SIGNATURE-----
--GRPZ8SYKNexpdSJ7--