Hi,
I'm getting system (and eventually cluster) crashes on intensive disk
writes in ubuntu server 10.04 with my OCFS2 file system.
I have an iSER (infiniband) backed shared disk array with OCFS2 on it.
There are 6 nodes in the cluster, and the heartbeat interface is over a
regular 1GigE connection. Originally, the problem presented itself while
I was doing performance testing and it's been reproducible ever since.
Running something like
'dd if=/dev/zero of=/<ocfs2 array>/zeroes bs=64k count=100000'
kills the node almost immediately, and then subsequently hangs the rest
of the cluster when other nodes try to unmount the array (for a restart
or whatever other reason). This happens regardless how many nodes are
running on the server. I've tried with a single node and it still happens.
I was lucky enough to capture some messages from stderr that weren't
being caught by syslog. I've attached it here as a screenshot, as my
management interface doesn't allow directly copying or pasting text.
Please take a look: http://img163.imageshack.us/img163/4771/screenshots.png
Take note that there are no other nodes started up, and I have no idea
how there could be another node "heartbeating" in the same slot.
I should also note that I originally had the heartbeat configured on the
same infiniband interface, so I thought the iSER traffic was blocking
out the heartbeat. However, configuring the heartbeat to use another
interface didn't help solve the problem. I'm also fairly certain it is
not the iSER interface causing problems because I have formatted the
array as ext4 and successfully run read/write tests (from one node at a
time of course).
Thanks in advance for any replies,
Matt