Hi; We are running Lustre 1.6.3 using o2ib (OFED 1.2) and tcp networks. Clients are CentOS 4.5 patchless clients, and the single server (MGS/MDS/OSS combo) is a CentOS 5.0 with patched kernel (includes proposed fix for bug 13438). All nodes are x86_64. We have run into a problem on the clients when one of our users tries to install an rpm package into an rpm database that lives on the Lustre filesystem. The rpm install command hangs in I/O wait state (client is using o2ib). Attempts to access the rpm database directory from other processes like ls also hang in D state. ldlm_poold and pdflush are stuck: [root at iogw1 ~]# ps auxww | grep '' D '' | grep -v grep root 79 0.0 0.0 0 0 ? D Oct26 0:01 [pdflush] uscms01 7155 0.0 0.0 16912 1684 ? D Oct26 0:00 ls -al /scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib/rpm uscms01 11712 0.0 0.0 16912 1684 ? D Oct27 0:00 ls -al /scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib/rpm uscms01 15363 0.0 0.0 2820 820 ? D Oct27 0:00 /usr/sbin/lsof /scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib uscms01 16589 0.0 0.1 12700 5564 ? D Oct26 0:02 rpm -Uvh --define _rpmlock_path /scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib/rpm/__db.0 -r /scratch/mri/osg/app/cmssoft/cms --dbpath /scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib/rpm --rcfile /scratch/mri/osg/app/cmssoft/cms/tmp/BOOTSTRAP/build/eulisse/rpm/PKGTOOLS/slc3_ia32_gcc323/external/rpm/4.4.2.1-CMS3/lib/rpm/rpmrc --nodeps --prefix /scratch/mri/osg/app/cmssoft/cms --ignoreos --ignorearch /scratch/mri/osg/app/cmssoft/cms/tmp/BOOTSTRAP/external+elfutils+0.128-CMS3-1-1007.slc3_ia32_gcc323.rpm uscms01 21531 0.0 0.0 16912 1684 ? D Oct27 0:00 ls -al /scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib/rpm root 23827 0.0 0.0 0 0 ? D Oct26 0:02 [ldlm_poold] Rebooting the client and remounting restores access to the rpm database directory (ls works), but if the user starts their commands again, the problem repeats. We tried adding the ''flock'' mount option, and the user was able to do his software installation once, but the problem returned. In the system logs on the client, I see some LustreErrors, but am unsure if they correspond to the users activity (see appended). They mention bug 11742, which does not appear to have a solution. Has anyone ever seen this before or know of a fix? Any help would be appreciated. Thanks, Craig From the client: Oct 26 12:56:33 iogw1 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit AND doesn''t match the original - likely false positive due to mmap IO (bug 11742): from 10.13.24.85 at o2ib inum 18200061/3052028908 object 3065594/0 extent [446464-450559] Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) original client csum 947c387a, server csum bdfdb0f6, client csum now 279c1071 Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) @@@ redo for checksum error req at 000001011f910e00 x390550/t31236588 o4->mri-OST0004_UUID at 10.13.24.85@o2ib:28 lens 384/352 ref 2 fl Complete:R/0/0 rc 0/0 Oct 26 12:56:33 iogw1 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit AND doesn''t match the original - likely false positive due to mmap IO (bug 11742): from 10.13.24.85 at o2ib inum 18200059/1461467594 object 3307767/0 extent [20480-24575] Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) original client csum 38323908, server csum 8ecf8abb, client csum now 8e60616c Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) @@@ redo for checksum error req at 000001011e530a00 x390536/t34503458 o4->mri-OST0000_UUID at 10.13.24.85@o2ib:28 lens 384/352 ref 2 fl Complete:R/0/0 rc 0/0 Oct 26 12:56:33 iogw1 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit AND doesn''t match the original - likely false positive due to mmap IO (bug 11742): from 10.13.24.85 at o2ib inum 18200060/1314702654 object 3279245/0 extent [1204224-1318911] Oct 26 12:56:33 iogw1 kernel: LustreError: Skipped 3 previous similar messages Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) original client csum 18528fa8, server csum 4cda3c4e, client csum now 7b2e3f37 Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) Skipped 3 previous similar messages Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) @@@ redo for checksum error req at 00000100bfda0e00 x390556/t29531809 o4->mri-OST0001_UUID at 10.13.24.85@o2ib:28 lens 432/360 ref 2 fl Complete:R/0/0 rc 0/0 Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) Skipped 3 previous similar messages Oct 26 12:56:33 iogw1 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit AND doesn''t match the original - likely false positive due to mmap IO (bug 11742): from 10.13.24.85 at o2ib inum 18200060/1314702654 object 3279245/0 extent [1204224-1318911] Oct 26 12:56:33 iogw1 kernel: LustreError: Skipped 4 previous similar messages Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) original client csum 6fc1808f, server csum f9c02df2, client csum now 271aff8c Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) Skipped 4 previous similar messages Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) @@@ redo for checksum error req at 000001011e189e00 x390559/t29531810 o4->mri-OST0001_UUID at 10.13.24.85@o2ib:28 lens 432/360 ref 2 fl Complete:R/0/0 rc 0/0 Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) Skipped 4 previous similar messages Oct 26 12:56:33 iogw1 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit AND doesn''t match the original - likely false positive due to mmap IO (bug 11742): from 10.13.24.85 at o2ib inum 18200061/3052028908 object 3065594/0 extent [446464-450559] Oct 26 12:56:33 iogw1 kernel: LustreError: Skipped 2 previous similar messages Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) original client csum be7ec5be, server csum 952e539d, client csum now 7c80e126 Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) Skipped 2 previous similar messages Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1277:osc_brw_redo_request()) too many checksum retries, returning error Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) @@@ redo for checksum error req at 000001012673d000 x390567/t29531812 o4->mri-OST0001_UUID at 10.13.24.85@o2ib:28 lens 432/360 ref 2 fl Complete:R/0/0 rc 0/0 Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) Skipped 3 previous similar messages Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1277:osc_brw_redo_request()) too many checksum retries, returning error Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1277:osc_brw_redo_request()) Skipped 1 previous similar message