Ender Güler
2009-Jun-29 15:35 UTC
[Lustre-discuss] errors on mds and osses regarding to cheksum and decreased stripe counts
Hi there, We have lustre 1.6.5.1 installation on RHEL 5.1. The interconnect is infiniband. I came across the errors like following, on mds: Lustre: 21241:0:(lov_qos.c:427:qos_shrink_lsm()) using fewer stripes for object 103514695: old 8 new 6 And here is the errors regarding to checksum, on one of the ost''s: LustreError: 12397:0:(ost_handler.c:1225:ost_brw_write()) client csum 41d0fa49, original server csum e388fa92, server csum now e388fa92 We set the stripe count to 8 at the installation time. I''ve googled but could not find any relevant information regarding to source of the problem (by the way, I assume that it''s a problem) Have you encounter log messages like these, before? What could be wrong? And how could the problem be solved? Thanks in advance. Regards, Ender GULER System Administrator National Center for HPC of Turkey -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090629/dcb79eb4/attachment.html
Andreas Dilger
2009-Jun-30 21:12 UTC
[Lustre-discuss] errors on mds and osses regarding to cheksum and decreased stripe counts
On Jun 29, 2009 18:35 +0300, Ender G?ler wrote:> We have lustre 1.6.5.1 installation on RHEL 5.1. The interconnect is > infiniband. I came across the errors like following, on mds: > > Lustre: 21241:0:(lov_qos.c:427:qos_shrink_lsm()) using fewer stripes for > object 103514695: old 8 new 6This can happen if some of your OSTs are not responsive to precreate requests. It appears you are using a wide striping by default, which is good if you have lots of clients reading/writing from the same file on a regular basis, but is not recommended if clients normally read/write from a single file OR the bandwidth of a single OST can handle the needs of a single client.> And here is the errors regarding to checksum, on one of the ost''s: > LustreError: 12397:0:(ost_handler.c:1225:ost_brw_write()) client csum > 41d0fa49, original server csum e388fa92, server csum now e388fa92This looks like you are having network problems, or possibly you are using mmap IO? The data is arriving at the server is different than the data that was originally checksummed by the client. This can happen in some cases if the client is doing repeated mmap writes to the same part of the file. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Ender Güler
2009-Jul-06 18:29 UTC
[Lustre-discuss] errors on mds and osses regarding to cheksum and decreased stripe counts
Thank you Andreas, After investigating a little deeper, one of our users'' processes creates lots of smaller files and made i/o. This is the cause of the "fewer stripes" log message. And for the csum problem my investigation is still in progress. Thanks again. On Wed, Jul 1, 2009 at 12:12 AM, Andreas Dilger <adilger at sun.com> wrote:> On Jun 29, 2009 18:35 +0300, Ender G?ler wrote: > > We have lustre 1.6.5.1 installation on RHEL 5.1. The interconnect is > > infiniband. I came across the errors like following, on mds: > > > > Lustre: 21241:0:(lov_qos.c:427:qos_shrink_lsm()) using fewer stripes for > > object 103514695: old 8 new 6 > > This can happen if some of your OSTs are not responsive to precreate > requests. It appears you are using a wide striping by default, which > is good if you have lots of clients reading/writing from the same file > on a regular basis, but is not recommended if clients normally read/write > from a single file OR the bandwidth of a single OST can handle the needs > of a single client. > > > And here is the errors regarding to checksum, on one of the ost''s: > > LustreError: 12397:0:(ost_handler.c:1225:ost_brw_write()) client csum > > 41d0fa49, original server csum e388fa92, server csum now e388fa92 > > This looks like you are having network problems, or possibly you are > using mmap IO? The data is arriving at the server is different than > the data that was originally checksummed by the client. This can happen > in some cases if the client is doing repeated mmap writes to the same > part of the file. > > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090706/432ce5ab/attachment.html