Hi; We''ve been testing some 1.8.0.1 patchless clients (RHEL5.3, x86_64, RPMs from the Sun download page) with out 1.6.4.2 servers. The OSS nodes started logging these LustreErrors from the 1.8.0.1 clients:> LustreError: 7302:0:(ost_handler.c:1157:ost_brw_write()) client csum 8448447f, original server csum 66fb7cff, server csum now 66fb7cff > LustreError: 7302:0:(ost_handler.c:1157:ost_brw_write()) Skipped 1 previous similar message > LustreError: 7391:0:(ost_handler.c:1095:ost_brw_write()) client csum 9d8c7d6a, server csum 2cfdcb47 > LustreError: 168-f: ufhpc-OST0004: BAD WRITE CHECKSUM: changed in transit before arrival at OST from 12345-10.13.28.55 at tcp inum 38470778/1485322248 object 67094039/0 extent [0-1023]Is this a known issue with running 1.8.0.1 clients against 1.6.4.2 servers? We aren''t seeing these messages in relation to our 1.6 clients. Looking through the Lustre bugzilla, I see bug 18296, which discusses these messages, but it was logged against Lustre version 1.6.6. Cheers, Craig
On Jul 24, 2009, at 10:33 AM, Craig Prescott wrote:> > Hi; > > We''ve been testing some 1.8.0.1 patchless clients (RHEL5.3, x86_64, > RPMs > from the Sun download page) with out 1.6.4.2 servers.Just to clarify the typo... That should have been "with our" 1.6.4.2 servers. We are running 1.8.0.1 patch-less clients with 1.6.4.2 on the MGS/MDS and OSSs and getting the messages Craig refers to below. ct> The OSS nodes started logging these LustreErrors from the 1.8.0.1 > clients: > >> LustreError: 7302:0:(ost_handler.c:1157:ost_brw_write()) client >> csum 8448447f, original server csum 66fb7cff, server csum now >> 66fb7cff >> LustreError: 7302:0:(ost_handler.c:1157:ost_brw_write()) Skipped 1 >> previous similar message >> LustreError: 7391:0:(ost_handler.c:1095:ost_brw_write()) client >> csum 9d8c7d6a, server csum 2cfdcb47 >> LustreError: 168-f: ufhpc-OST0004: BAD WRITE CHECKSUM: changed in >> transit before arrival at OST from 12345-10.13.28.55 at tcp inum >> 38470778/1485322248 object 67094039/0 extent [0-1023] > > Is this a known issue with running 1.8.0.1 clients against 1.6.4.2 > servers? We aren''t seeing these messages in relation to our 1.6 > clients. > > Looking through the Lustre bugzilla, I see bug 18296, which discusses > these messages, but it was logged against Lustre version 1.6.6. > > Cheers, > Craig > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
On Jul 24, 2009 10:33 -0400, Craig Prescott wrote:> We''ve been testing some 1.8.0.1 patchless clients (RHEL5.3, x86_64, RPMs > from the Sun download page) with our 1.6.4.2 servers. > > The OSS nodes started logging these LustreErrors from the 1.8.0.1 clients: > > > LustreError: 7302:0:(ost_handler.c:1157:ost_brw_write()) client csum 8448447f, original server csum 66fb7cff, server csum now 66fb7cff > > LustreError: 7302:0:(ost_handler.c:1157:ost_brw_write()) Skipped 1 previous similar message > > LustreError: 7391:0:(ost_handler.c:1095:ost_brw_write()) client csum 9d8c7d6a, server csum 2cfdcb47 > > LustreError: 168-f: ufhpc-OST0004: BAD WRITE CHECKSUM: changed in transit before arrival at OST from 12345-10.13.28.55 at tcp inum 38470778/1485322248 object 67094039/0 extent [0-1023] > > Is this a known issue with running 1.8.0.1 clients against 1.6.4.2 > servers? We aren''t seeing these messages in relation to our 1.6 clients.This is a known issue if the clients are using mmap IO (which can change the kernel pages w/o notifying the kernel. It would be possible to fix this warning by adding a "file is mmapped" flag to the RPC and suppress the console error on the server and subsequent error message if the IO never makes it to the server at least once in the next 5 retries. Unfortunately, since this is a non-fatal error, nobody has worked on fixing it yet.> Looking through the Lustre bugzilla, I see bug 18296, which discusses > these messages, but it was logged against Lustre version 1.6.6.The 1.6 and 1.8 code is very similar, with only a handful of isolated features added. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Jul 24, 2009 14:06 -0600, Andreas Dilger wrote:> On Jul 24, 2009 10:33 -0400, Craig Prescott wrote: > > We''ve been testing some 1.8.0.1 patchless clients (RHEL5.3, x86_64, RPMs > > from the Sun download page) with our 1.6.4.2 servers. > > > > The OSS nodes started logging these LustreErrors from the 1.8.0.1 clients: > > > > > LustreError: 7302:0:(ost_handler.c:1157:ost_brw_write()) client csum 8448447f, original server csum 66fb7cff, server csum now 66fb7cff > > > LustreError: 7302:0:(ost_handler.c:1157:ost_brw_write()) Skipped 1 previous similar message > > > LustreError: 7391:0:(ost_handler.c:1095:ost_brw_write()) client csum 9d8c7d6a, server csum 2cfdcb47 > > > LustreError: 168-f: ufhpc-OST0004: BAD WRITE CHECKSUM: changed in transit before arrival at OST from 12345-10.13.28.55 at tcp inum 38470778/1485322248 object 67094039/0 extent [0-1023] > > > > Is this a known issue with running 1.8.0.1 clients against 1.6.4.2 > > servers? We aren''t seeing these messages in relation to our 1.6 clients. > > This is a known issue if the clients are using mmap IO (which can change > the kernel pages w/o notifying the kernel. It would be possible to fix > this warning by adding a "file is mmapped" flag to the RPC and suppress > the console error on the server and subsequent error message if the IO > never makes it to the server at least once in the next 5 retries. > > Unfortunately, since this is a non-fatal error, nobody has worked on > fixing it yet.PS - of course, if mmap is not involved and the errors are isolated to particular client/server nodes it is entirely possible that the network is corrupting the data in transit, as the message suggests. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.