Hi, I am having problems with file locking or the lack of it. I''m setting up a lustre environment consisting of a co-lo MGS/MDS and 3 OSS server. I am using CTDB and SAMBA to get windows to co-operate. I have configured the OSS server to also be clients in order to get CTDB/SAMBA working. All seems okay and working apart from file locking which is not working from a linux or windows prospective. I have setup the following MGS/MDS = 192.168.3.171 OSS1 = 192.168.3.172 OSS2 = 192.168.2.2 OSS3 = 192.168.1.2 I have Lustre installed on all servers which have Centos 4 using the following packages: kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5.i686.rpm lustre-ldiskfs-3.0.4-2.6.9_67.0.7.EL_lustre.1.6.5smp.i686.rpm lustre-modules-1.6.5-2.6.9_67.0.7.EL_lustre.1.6.5smp.i686.rpm lustre-1.6.5-2.6.9_67.0.7.EL_lustre.1.6.5smp.i686.rpm I have rebuilt e2fsprogs rpmbuild --rebuild e2fsprogs-1.40.7.sun3-0redhat.src.rpm Installed the subsequent packages e2fsprogs-1.40.7.sun3-0redhat.i386.rpm e2fsprogs-debuginfo-1.40.7.sun3-0redhat.i386.rpm e2fsprogs-devel-1.40.7.sun3-0redhat.i386.rpm uuidd-1.40.7.sun3-0redhat.i386.rpm The following has been added to modprobe.conf on all lustre servers ../lnet/parameters dir options lnet networks=tcp I have configured Lustre using the following commands: ****************** MDT/MGS **************************************** mkfs.lustre --fsname lustre --mdt --mgs /dev/sdb mkdir -p /mnt/mdt mount -t lustre /dev/sdb /mnt/mdt ***************** OSS1 ********************************************* mkfs.lustre --fsname lustre --ost --mgsnode=192.168.3.171 at tcp0 /dev/sdb mkdir -p /mnt/ost1 mount -t lustre /dev/sdb /mnt/ost1 ****************** OSS2 ******************************************** mkfs.lustre --fsname lustre --ost --mgsnode=192.168.3.171 at tcp0 /dev/sdb mkdir -p /mnt/ost2 mount -t lustre /dev/sdb /mnt/ost2 ****************** OSS3 ******************************************** mkfs.lustre --fsname lustre --ost --mgsnode=192.168.3.171 at tcp0 /dev/sdb mkdir -p /mnt/ost3 mount -t lustre /dev/sdb /mnt/ost3 I have mounted the clients on all OSS server using the following commands ****************** Client ********************************************* mkdir -p /mnt/lustre mount -t lustre -o flock 192.168.3.171 at tcp0:/lustre /mnt/lustre As you can see I have used the -o flock on the client mount command. Can you please advise.. Have I missed something or configured it wrongly? What are the best tools I can use to check why file locking is not working ? If you need more info please let me know. Regards Darren George
On Thursday 21 August 2008 10:52:12 Darren George wrote:> Hi, > I am having problems with file locking or the lack of it. > I''m setting up a lustre environment consisting of a co-lo MGS/MDS and 3 > OSS server. I am using CTDB and SAMBA to get windows to co-operate. > I have configured the OSS server to also be clients in order to get > CTDB/SAMBA working. > All seems okay and working apart from file locking which is not working > from a linux or windows prospective.You provided quite a lot information, but not a single prove the locking doesn''t work. How did you figure that out? [...]> > I have mounted the clients on all OSS server using the following commands^^^^^^^^^^^^^^^^^^^^^^ What does "on all OSS server" mean? Are you using your servers also as Lustre clients? This might/will deadlock.> ****************** Client ********************************************* > mkdir -p /mnt/lustre > mount -t lustre -o flock 192.168.3.171 at tcp0:/lustre /mnt/lustre > > As you can see I have used the -o flock on the client mount command.Did you already try "-o localflock"? Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH
Hello Darren, On Thursday 21 August 2008 11:32:45 Darren George wrote:> Hi Bernd, > > Thanks for your quick responce. > I am indeed using the servers as lustre clients as well. > I see the following from messages log on the server the windows client > is connected to > > localhost kernel: LustreError: 32074:0:(file.c:2570:ll_file_flock()) LBUG > Aug 21 17:06:12 localhost kernel: LustreError: > 32074:0:(file.c:2569:ll_file_flock()) unknown fcntl lock type: 96doh, what it type 96? From include/asm-generic/fcntl.h: /* for posix fcntl() and lockf() */ #ifndef F_RDLCK #define F_RDLCK 0 #define F_WRLCK 1 #define F_UNLCK 2 #endif Now you run into a rather ugly programming technique of the Lustre developers, rather often they simply call LBUG(), although the problem is not grave: In lustre/llite/file.c: switch (file_lock->fl_type) { case F_RDLCK: einfo.ei_mode = LCK_PR; break; case F_UNLCK: /* An unlock request may or may not have any relation to * existing locks so we may not be able to pass a lock handle * via a normal ldlm_lock_cancel() request. The request may even * unlock a byte range in the middle of an existing lock. In * order to process an unlock request we need all of the same * information that is given with a normal read or write record * lock request. To avoid creating another ldlm unlock (cancel) * message we''ll treat a LCK_NL flock request as an unlock. */ einfo.ei_mode = LCK_NL; break; case F_WRLCK: einfo.ei_mode = LCK_PW; break; default: CERROR("unknown fcntl lock type: %d\n", file_lock->fl_type); LBUG(); } IHMO, instead of calling LBUG() here, simply "return EINVAL" should be done. So with the present code it seems whenever the userspace is setting a wrong struct flock l_type, it will trigger a LBUG(). I''m going to check this and then will fill in a bugzilla entry. Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH
On Aug 21, 2008 12:23 +0200, Bernd Schubert wrote:> IHMO, instead of calling LBUG() here, simply "return EINVAL" should be done. > So with the present code it seems whenever the userspace is setting a > wrong struct flock l_type, it will trigger a LBUG(). I''m going to check this > and then will fill in a bugzilla entry.Yes, this is very old code, and it should be fixed. Ideally it would be fixed to handle this lock type properly, but EINVAL is definitely better than the LBUG. I believe there is already a bug open for this. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Thursday 21 August 2008 12:32:32 Andreas Dilger wrote:> On Aug 21, 2008 12:23 +0200, Bernd Schubert wrote: > > IHMO, instead of calling LBUG() here, simply "return EINVAL" should be > > done. So with the present code it seems whenever the userspace is setting > > a wrong struct flock l_type, it will trigger a LBUG(). I''m going to check > > this and then will fill in a bugzilla entry. > > Yes, this is very old code, and it should be fixed. Ideally it would > be fixed to handle this lock type properly, but EINVAL is definitely > better than the LBUG. I believe there is already a bug open for this.Hmm, but what is type 96? In binary it is "1100000", so maybe an endian problem and it should "11"? But then what is "11", if at all corresponds to 1 | 2, thus F_WRLCK | F_UNLCK. But does it make sense to set both at once? Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH
On Thursday 21 August 2008 13:15:00 Bernd Schubert wrote:> On Thursday 21 August 2008 12:32:32 Andreas Dilger wrote: > > On Aug 21, 2008 12:23 +0200, Bernd Schubert wrote: > > > IHMO, instead of calling LBUG() here, simply "return EINVAL" should be > > > done. So with the present code it seems whenever the userspace is > > > setting a wrong struct flock l_type, it will trigger a LBUG(). I''m > > > going to check this and then will fill in a bugzilla entry. > > > > Yes, this is very old code, and it should be fixed. Ideally it would > > be fixed to handle this lock type properly, but EINVAL is definitely > > better than the LBUG. I believe there is already a bug open for this. > > Hmm, but what is type 96? In binary it is "1100000", so maybe an endian > problem and it should "11"? But then what is "11", if at all corresponds to > 1 | 2, thus F_WRLCK | F_UNLCK. But does it make sense to set both at once?Ah, it is probably the flock call and not lockf/fcntl, I always get confused by the two different locking methods. This is bug #5135 (https://bugzilla.lustre.org/show_bug.cgi?id=5135). Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH
On Aug 21, 2008 13:15 +0200, Bernd Schubert wrote:> On Thursday 21 August 2008 12:32:32 Andreas Dilger wrote: > > On Aug 21, 2008 12:23 +0200, Bernd Schubert wrote: > > > IHMO, instead of calling LBUG() here, simply "return EINVAL" should be > > > done. So with the present code it seems whenever the userspace is setting > > > a wrong struct flock l_type, it will trigger a LBUG(). I''m going to check > > > this and then will fill in a bugzilla entry. > > > > Yes, this is very old code, and it should be fixed. Ideally it would > > be fixed to handle this lock type properly, but EINVAL is definitely > > better than the LBUG. I believe there is already a bug open for this. > > Hmm, but what is type 96? In binary it is "1100000", so maybe an endian > problem and it should "11"? But then what is "11", if at all corresponds to > 1 | 2, thus F_WRLCK | F_UNLCK. But does it make sense to set both at once?My recollection is that it has to do something with BSD locking modes or similar. See https://bugzilla.lustre.org/show_bug.cgi?id=15920 Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.