I'm setting up an HA ftp server (amongst other services). When two connections happen simultaneously, and (more specifically) the same user from two IP's attempt to access the same file (one for reading, and one for writing), the processes both hang. And all subsequent attempts to either read or write the file fail. The two processes that seem to have caused the lock: user 24139 1657 Thu Apr 1 18:25:01 2010 proftpd: cbs - ::ffff:xxx.yyy.0.253: RETR prim_wo_img_dom.obs user 24142 1657 Thu Apr 1 18:25:01 2010 proftpd: cbs - ::ffff:xxx.yyy.103.208: STOR prim_wo_img_dom.obs (there are 49 other process trying to do the same things, but these are the first ones.) I'm more than happy to provide any information needed on this issue: OSL CentOS release 5.4 (Final) uname -a: Linux prtftp01<omitted> 2.6.18-164.11.1.el5 #1 SMP Wed Jan 20 07:32:21 EST 2010 x86_64 x86_64 x86_64 GNU/Linux ocfs2 version 1.4.4 At the moment, only one host is actively serving FTP at any time. I can fail the services back and forth as needed. --Jason -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20100402/b5e8ea2e/attachment.html
To add further information: 1) Note A: # cat /sys/kernel/debug/o2dlm/6D419D86AE8A4DB1940788EDDA27027B/dlm_state Domain: 6D419D86AE8A4DB1940788EDDA27027B Key: 0xc955c1d5 Thread Pid: 3869 Node: 1 State: JOINED Number of Joins: 1 Joining Node: 255 Domain Map: 1 2 Live Map: 1 2 Lock Resources: 70731 (442210) MLEs: 0 (1048380) Blocking: 0 (647669) Mastery: 0 (400711) Migration: 0 (0) Lists: Dirty=Empty Purge=Empty PendingASTs=Empty PendingBASTs=Empty Purge Count: 0 Refs: 70732 Dead Node: 255 Recovery Pid: 3870 Master: 255 State: INACTIVE Recovery Map: Recovery Node State: Node B: # cat /sys/kernel/debug/o2dlm/6D419D86AE8A4DB1940788EDDA27027B/dlm_state Domain: 6D419D86AE8A4DB1940788EDDA27027B Key: 0xc955c1d5 Thread Pid: 3757 Node: 2 State: JOINED Number of Joins: 1 Joining Node: 255 Domain Map: 1 2 Live Map: 1 2 Lock Resources: 48113 (50521) MLEs: 0 (85510) Blocking: 0 (35121) Mastery: 0 (50389) Migration: 0 (0) Lists: Dirty=Empty Purge=Empty PendingASTs=Empty PendingBASTs=Empty Purge Count: 0 Refs: 48114 Dead Node: 255 Recovery Pid: 3758 Master: 255 State: INACTIVE Recovery Map: Recovery Node State: There are no busy locks apparently, as shown by # debugfs.ocfs2 -R "fs_locks -B" /dev/sda1 # I am unable to kill any of these processes, even with kill -9. # cat /etc/ocfs2/cluster.conf cluster: node_count = 2 name = ocfs2ftpcluster node: ip_port = 7777 ip_address = 192.168.0.1 number = 1 name = prtftp01 cluster = ocfs2ftpcluster node: ip_port = 7777 ip_address = 192.168.0.2 number = 2 name = prtftp02 cluster = ocfs2ftpcluster If you'd like the output of : # debugfs.ocfs2 -R "fs_locks" /dev/sda1 | wc -l 768681 I can give it, but it's a lot output. --Jason On Fri, Apr 2, 2010 at 11:38 AM, Jason Price <japrice at gmail.com> wrote:> I'm setting up an HA ftp server (amongst other services). > > When two connections happen simultaneously, and (more specifically) the > same user from two IP's attempt to access the same file (one for reading, > and one for writing), the processes both hang. And all subsequent attempts > to either read or write the file fail. > > The two processes that seem to have caused the lock: > user 24139 1657 Thu Apr 1 18:25:01 2010 proftpd: cbs - > ::ffff:xxx.yyy.0.253: RETR prim_wo_img_dom.obs > user 24142 1657 Thu Apr 1 18:25:01 2010 proftpd: cbs - > ::ffff:xxx.yyy.103.208: STOR prim_wo_img_dom.obs > > (there are 49 other process trying to do the same things, but these are the > first ones.) > > I'm more than happy to provide any information needed on this issue: > > OSL > CentOS release 5.4 (Final) > > uname -a: > Linux prtftp01<omitted> 2.6.18-164.11.1.el5 #1 SMP Wed Jan 20 07:32:21 EST > 2010 x86_64 x86_64 x86_64 GNU/Linux > > ocfs2 version 1.4.4 > > At the moment, only one host is actively serving FTP at any time. I can > fail the services back and forth as needed. > > --Jason-------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20100402/9389d3c7/attachment.html
FWIW, I have seen a similar problem here on occasion, but with vsftpd instead. When I run `ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN` I usually see one node with a single vsftpd in D (uninterruptable I/O) state, and multiple vsftpd processes on the other node, presumably waiting for the resource. I also believe this when multiple processes are trying to read & write the same file via FTP. And if left alone for a bit, other programs that may read the same file will get hung waiting as well. Mine are typically not busy waits though, but I have seen a couple that were. Sometimes I will find that all is cleared and back to normal after a short while (a timeout somewhere perhaps?). Usually the only solution is to reboot one or both nodes, which I have to instigate via kernel panic/self fence because a normal shutdown also gets caught up by the non-killable processes. I need to get a netconsole set up to capture some stuff for the next time so that I can add it to the bugzilla as well. At 10:52 AM 4/2/2010, Jason Price wrote:>Message: 1 >Date: Fri, 2 Apr 2010 11:38:24 -0400 >From: Jason Price <japrice at gmail.com> >Subject: [Ocfs2-users] Ftp server... single file seems locked >To: ocfs2-users at oss.oracle.com >Message-ID: > <p2r83f15e31004020838o961f478cg19ae4f403631764 at mail.gmail.com> >Content-Type: text/plain; charset="iso-8859-1" > >I'm setting up an HA ftp server (amongst other services). > >When two connections happen simultaneously, and (more specifically) the same >user from two IP's attempt to access the same file (one for reading, and one >for writing), the processes both hang. And all subsequent attempts to >either read or write the file fail. > >The two processes that seem to have caused the lock: >user 24139 1657 Thu Apr 1 18:25:01 2010 proftpd: cbs - >::ffff:xxx.yyy.0.253: RETR prim_wo_img_dom.obs >user 24142 1657 Thu Apr 1 18:25:01 2010 proftpd: cbs - >::ffff:xxx.yyy.103.208: STOR prim_wo_img_dom.obs > >(there are 49 other process trying to do the same things, but these are the >first ones.)