Anatoly Oreshkin
2007-Dec-14 15:28 UTC
[Lustre-discuss] networking problem with kernel-lustre-smp-2.6.9-55.0.9.EL_lustre.1.6.3(1.6.4)smp
Hello, We have Scientific Linux SL release 4.4 (aka RHEL 4.4) with kernel 2.6.9-42.0.3.ELsmp installed on our cluster. I''ve got from clusterfs site http://www.clusterfs.com/downloads/public/Lustre/v1.6/Production/1.6.3/rhel-2.6-i686/ binary rpms for RHEL-2.6-i686: kernel-lustre-smp-2.6.9-55.0.9.EL_lustre.1.6.3.i686.rpm kernel-lustre-source-2.6.9-55.0.9.EL_lustre.1.6.3.i686.rpm lustre-ldiskfs-3.0.2-2.6.9_55.0.9.EL_lustre.1.6.3smp.i686.rpm lustre-modules-1.6.3-2.6.9_55.0.9.EL_lustre.1.6.3smp.i686.rpm lustre-1.6.3-2.6.9_55.0.9.EL_lustre.1.6.3smp.i686.rpm and installed them on head node and all client nodes. First I''ve tried to test networking with this kernel on NFS file system without lustre file system. NFS server is started on head node and exports non-lustre file system. I''ve started reading on client nodes NFS file system and encountered networking problem. On one client with ethernet card Marvell 88E8050 Gigabit (driver sky2) kernel has given "hw tcp v4 csum failed" error messages and reading has hung. On other client node with ethernet card Intel 82566DC Gigabit (driver e1000) command dmesg has showed nfs_statfs: statfs error = 512 nfs_statfs: statfs error = 512 nfs_statfs: statfs error = 512 .... and reading also has hung. With my old kernel from SL 4.4 there were no such problems. Then I''ve installed binary rpms for Lustre 1.6.4 from http://www.clusterfs.com/downloads/public/Lustre/v1.6/Production/1.6.4 and tried again the same test reading but result was the same. What might be wrong ? Thank you.
Brian J. Murrell
2007-Dec-14 15:46 UTC
[Lustre-discuss] networking problem with kernel-lustre-smp-2.6.9-55.0.9.EL_lustre.1.6.3(1.6.4)smp
On Fri, 2007-12-14 at 18:28 +0300, Anatoly Oreshkin wrote:> Hello, > > We have Scientific Linux SL release 4.4 (aka RHEL 4.4) with > kernel 2.6.9-42.0.3.ELsmp installed on our cluster.So this is the older RHEL4 kernel.> I''ve got from clusterfs site > http://www.clusterfs.com/downloads/public/Lustre/v1.6/Production/1.6.3/rhel-2.6-i686/ > > binary rpms for RHEL-2.6-i686: > > kernel-lustre-smp-2.6.9-55.0.9.EL_lustre.1.6.3.i686.rpmIndeed, this is the kernel from what I think is RHEL 4.5.> First I''ve tried to test networking with this kernel on NFS file system > without lustre file system. > NFS server is started on head node and exports non-lustre file system. > I''ve started reading on client nodes NFS file system and encountered > networking problem.Did you attempt this exact same test with the SL 4.4 (2.6.9-42.0.3.ELsmp) kernel before using our 2.6.9-55 kernel? That would be a most interesting result. If the problem reproduces with that kernel then you appear to have a kernel-version independent hardware/driver problem which you will need to rectify before you can do anything useful with that node/nic. If the problem does not exist in the SL 2.6.9-42.0.3.ELsmp kernel, then the next step would be to try the 2.6.9-55.0.9.ELsmp kernel from SL. If the problem does exist in that kernel then there is an issue with the generic 2.6.9-55.0.9.ELsmp kernel. If it does not happen with that kernel either then we have a possible regression in our kernel. We cannot determine any of this though without you being able to try those combinations out. All of this is expensive in terms of time though, and many times it''s just cheaper to replace troublesome hardware with something that is known to be working. It''s not as intellectually interesting, but can be cheaper. :-) b.