Christian Schlittchen
2006-Oct-09 08:46 UTC
[Ocfs2-users] Stability issues with ocfs2 on a mailserver cluster
Trying to build a large scale mail server cluster (about 40000 users, currently 3 blades, blade01,02,03) we've run into two major issues with ocfs2. First,one of the cluster machines crashes in irregular intervals. The kernel log shows the following: Oct 9 15:51:31 blade01 kernel: (27817,1):o2net_send_tcp_msg:800 ERROR: sendmsg returned -104 instead of 96 Oct 9 15:51:31 blade01 kernel: o2net: no longer connected to node blade03 (num 2) at 134.102.20.127:7777 Oct 9 15:51:31 blade01 kernel: (2319,3):ocfs2_broadcast_vote:713 ERROR: status = -112 Oct 9 15:51:31 blade01 kernel: (2319,3):ocfs2_do_request_vote:786 ERROR: status = -112 Oct 9 15:51:31 blade01 kernel: (2319,3):ocfs2_query_inode_wipe:766 ERROR: status = -112>From this point on there are a lot of various error messages fromocfs2 related kernel functions. All processes trying to access the cluster on blade01 hang and a few seconds later the oom killer starts killing processes. The strange thing is that blade03 and blade02 carry on just fine, there is no obvious problem with blade03, and blade02 and blade03 stay connected. blade01's network connectivity is fine, too. We are running vanilla 2.6.18 kernels, the ocfs2 filesystem is on an fc connected storage server. Any help on this would be appreciated very much. Another problem we have detected is with the vacation program. Vacation, when called with the parameter -i, initializes the vacation database. On an ocfs2-Filesystem this gives inconsistent results. Sometimes the initialization works fine, but sometimes it creates a damaged db-file. This happens in roughly 50% off all attempts to initialize the database but seems totally random. We have tried several versions of the berkeley libs, from version 3 to 4.4, but the problem remained. vacation works fine on non-ocfs2 non-clustered filesystems.
Mark Fasheh
2006-Oct-10 16:40 UTC
[Ocfs2-users] Stability issues with ocfs2 on a mailserver cluster
Hi, On Mon, Oct 09, 2006 at 05:46:10PM +0200, Christian Schlittchen wrote:> Another problem we have detected is with the vacation program. Vacation, > when called with the parameter -i, initializes the vacation database. > On an ocfs2-Filesystem this gives inconsistent results. Sometimes the > initialization works fine, but sometimes it creates a damaged db-file. > This happens in roughly 50% off all attempts to initialize the database > but seems totally random.Can you apply the following patches (in order) to your kernel tree and rebuild? A bug with our extend zeroing code seems to have creeped in recently and this fixes it. http://kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/backports/2.6.18/ --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh@oracle.com