I have a 2 node cluster of HP DL380G4s. These machines are attached via scsi to an external HP disk enclosure. They run 32bit RH AS 4.0 and OCFS 1.2.4, the latest release. They were upgraded from 1.2.3 only a few days after 1.2.4 was released. I had reported on the mailing list that my developers were happy, and things seemed faster. However, twice in that time, the cluster has gone down due to the kernel OOM killer killing processes, and then ASR kicks in, and eventually reboots the box. I am also starting to notice some directory corruption, and errors like this in /var/log/messages Feb 18 04:14:37 cyber1 kernel: (23693,1):ocfs2_check_dir_entry:1703 ERROR: bad entry in directory #101726961: rec_len % 4 != 0 - offset=0, inode=3484598105688391, rec_len=18, name_len=128 Sometimes I can't delete a directory, it will tell me its not empty, even though it is. What could this be? I was hoping that OCFS 1.2.4 would have fixed the out of memory problems, but it looks like I still run into it. What information can I provide that will help?
Start monitoring /proc/slabinfo and /proc/meminfo. Dump it to a file every 5-10 mins. Which version of the rhel4 kernel are you on: uname -a? Cline, Ernie wrote:> I have a 2 node cluster of HP DL380G4s. These machines are attached via > scsi to an external HP disk enclosure. They run 32bit RH AS 4.0 and > OCFS 1.2.4, the latest release. They were upgraded from 1.2.3 only a > few days after 1.2.4 was released. I had reported on the mailing list > that my developers were happy, and things seemed faster. However, twice > in that time, the cluster has gone down due to the kernel OOM killer > killing processes, and then ASR kicks in, and eventually reboots the > box. > > I am also starting to notice some directory corruption, and errors like > this in /var/log/messages > > Feb 18 04:14:37 cyber1 kernel: (23693,1):ocfs2_check_dir_entry:1703 > ERROR: bad entry in directory #101726961: rec_len % 4 != 0 - offset=0, > inode=3484598105688391, rec_len=18, name_len=128 > > Sometimes I can't delete a directory, it will tell me its not empty, > even though it is. > > What could this be? I was hoping that OCFS 1.2.4 would have fixed the > out of memory problems, but it looks like I still run into it. What > information can I provide that will help? > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >
It seems that ocfs has an unfixed memory leak even in the most recent version. I hope to make a more detailed bug report on Monday. John On Fri, 2007-02-23 at 15:31 -0800, Luis Freitas wrote:> Hi, > > This is a bit off topic, hope there is not a problem. > > Anyone out there experiencing high swapping with the kernel > retaining a large amount of buffers? This used to be a problem on 2.4, > and I usually changed /proc/sys/vm/freepages to fix it. But on 2.6 > this parameter no longer exists. > > One of the servers here is holding over 3.5Gb of cache even when > using over 700Mb of swap, and free memory is always low. > > [oracle@br001sv0432 ~]$ free > total used free shared buffers > cached > Mem: 5190736 4810880 379856 0 143032 > 3583868 > -/+ buffers/cache: 1083980 4106756 > Swap: 2048248 723064 1325184 > [oracle@br001sv0432 ~]$ > > I am tuning /proc/sys/vm/swappiness, but this seems to have no > effect at all. Changed from 60 to 10 and seems to have no effect. The > server runs Oracle RAC with OCFS2. > > Regards, > Luis > > > > ______________________________________________________________________ > Food fight? Enjoy some healthy debate > in the Yahoo! Answers Food & Drink Q&A. > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users