John Lange
2007-Mar-23 12:24 UTC
[Ocfs2-users] The ongoing mystery of the ocfs2 memory leak
If you have been watching this list you may have seen my postings about some kind of memory leak when using ocfs2. This is a problem that is still not solved and I'm hoping someone one the list can help us isolate the issue. The circumstances are very strange; After much analysis and testing what we have been able to figure out is that there is a 400Meg drop in memory that happens every day between 6:45am and 7:45am. This memory is never recovered and after about 3-4 days the node starts killing processes (oom-killer) until it self-destructs. Now you are probably thinking (as we were) that this is some kind of cron that kicks in at that time and causes the problem but that is not the case. For one thing, daily cron does not run at that time. And secondly, we logged all processes to a file every 15 minutes and then compared what was running before the memory loss to what was running during and after the memory loss and there is nothing new running! And when we analyze the slabinfo for the same period there is nothing that is taking a corresponding (400M) jump in size during the same time period. So where the heck is our memory going?!? Does anyone have a clue how we can diagnose this? Currently we are capturing vmstat, slabinfo, and full process list at 15 minute intervals. Is there anything else we could be logging? Thanks, John Lange
Brian Sieler
2007-Apr-08 22:48 UTC
[Ocfs2-users] The ongoing mystery of the ocfs2 memory leak
John, This may or may not help: Swap memory was being consumed on my "main" node every day between 3 and 4 PM. Memory and swap would be completely depleted after about 2 1/2 days and the node would crash. (I'm running 2-node OCFS2 cluster w/Oracle RAC on 2.6.9-34.0.2.ELsmp (RHEL 4.0). Thinking it was heavy OCFS2 file system activity I moved 95% of the file system activity off the node to ext3 filesystem on a different server. Problem persisted on the OCFS2 node in the same predicable manner. We have a web application that connects to the database and uses a particular config setting to remove abandoned db connections. Once I removed that setting I stopped getting the predictable afternoon drain and the node is more stable. Now I'm on a slower bleed. Swap is still being consumed and apparently never released, about 100MB/day even on non-business, low-activity days. I'm shuffling processes and settings on my two nodes to try to isolate the problem. I'm no longer convinced it's OCFS2 doing the leaking. One setting I'm looking at is /proc/sys/vm/swappiness though I've read from some other folks that it was ineffective at limiting the swapping on RHEL. If anyone has any hints or suggestions, by all means... -----Original Message----- From: ocfs2-users-bounces@oss.oracle.com [mailto:ocfs2-users-bounces@oss.oracle.com] On Behalf Of John Lange Sent: Friday, March 23, 2007 2:23 PM To: ocfs2-users Subject: [Ocfs2-users] The ongoing mystery of the ocfs2 memory leak If you have been watching this list you may have seen my postings about some kind of memory leak when using ocfs2. This is a problem that is still not solved and I'm hoping someone one the list can help us isolate the issue. The circumstances are very strange; After much analysis and testing what we have been able to figure out is that there is a 400Meg drop in memory that happens every day between 6:45am and 7:45am. This memory is never recovered and after about 3-4 days the node starts killing processes (oom-killer) until it self-destructs. Now you are probably thinking (as we were) that this is some kind of cron that kicks in at that time and causes the problem but that is not the case. For one thing, daily cron does not run at that time. And secondly, we logged all processes to a file every 15 minutes and then compared what was running before the memory loss to what was running during and after the memory loss and there is nothing new running! And when we analyze the slabinfo for the same period there is nothing that is taking a corresponding (400M) jump in size during the same time period. So where the heck is our memory going?!? Does anyone have a clue how we can diagnose this? Currently we are capturing vmstat, slabinfo, and full process list at 15 minute intervals. Is there anything else we could be logging? Thanks, John Lange _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users