I''ve been having terrible problems with ocfs2 getting corrupted. (Of course this is after I said on this list a couple months ago that I''ve been using it for a while without issues!) I have two sets of SLES11 servers, that each share their own ocfs2 volume. I started having problems with the original set of servers and opened a ticket with Novell. They wanted me to completely update the systems. Since they were running critical VMs I didn''t feel comfortable doing that, so I installed two more servers with their own oscf2 volume. These two I then patched completely. Unfortunately, these two servers starting exhibiting their own corruption problems. Just copying my virtual disk files and running some VMs would cause the ocfs2 to get corrupted. Right now it''s at a point where I can''t even fix it with fsck.ocfs2. I''m told the ticket has been escalated to the ocfs2 devs. Earlier this week I had a problem with one of the original servers and had to hard restart it. This is a problem I''ve always had with xen after it runs for a long time, sometimes it will have memory allocation issues, can''t start VMs, etc. Worse yet, there''s no way to restart it nicely, because VMs will not shut down and you can''t get on the console or ssh to shut down the server nicely. Only option (that I know of) is to hard reset the box. Of course this can have side affects. In this case everything came back up ok, but I could see there was corruption. I asked Novell and they said I should unmount the volumes, run fsck.ocfs2 and make sure it''s clean, then restart everything. This was on Monday, and since my critical machines were up and running, I couldn''t afford to have them down right then. So, 2AM this morning I decided was a good time to down these systems, run the fsck and then get them back up. I thought this would be fairly simple, take 30-60 mins, and get things stable for a while longer while we work on the ocfs2 issue with Novell. Unfortunately, after running fsck.ocfs2 and making sure it was clean, my VMs would not all come back up. I could get 4 or 5 of them up, but not the rest. After unhealthy and very stressful investigation I found that the ocfs2 volume is going read only. I''m waiting for a call back from Novell right now. It seems that once my ocfs2 volume gets corrupted there''s no way to fix it or make it stable again. Our storage is on a Xiotech Magnitude 4000 3D. Each xen server is assigned the same vdisk that is used for ocfs2. We use file based disks for our VMs. Performance wise this does the job for us. It makes them very easy to move around, copy for new VMs, etc. What other options should I look at besides ocfs2? I also have a call in to our xiotech admin to create me a new disk that I can assign directly to my server (one for each) so I can copy my VMs and get them up and running. Just in case Novell is not able to get a resolution for me. I''m confident that the VMs will be stable once they are running on "local" storage. Sorry this got so long, but I don''t think I can take much more stress around the stability of my xen servers. I''ve also looked at XenServer, which seems to be really stable and has nice features, but you also lose a lot of portability. Hard for me to explain, but on sles/xen it''s incredibly easy to create sles VMs. It''s also nice to be able to mount disk files if needed, copy them, etc. If anyone gets this far into the message I''d appreciate any suggestions. Thanks a lot, James _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
James Pifer
2010-Feb-28 15:56 UTC
Re: [Xen-users] nothing but problems with ocfs2 on sles11
Wanted to give an update on this. It turns out I had one VM that was really causing the problem. Whenever I tired to start it, ocfs2 would decide there''s a problem with it and and make the whole volume read only. This would essentially hose both servers. I copied that single VM to a local disk and I was able to start it up. It''s a sles guest and sles determined there was a problem and ran fsck. Thankfully it ran the check and finished booting. Novell wants me to make a full backup of everything, remove and recreate the ocfs2 volume, then restore everything. They feel it''s a problem with the disk. The disk is coming from a Xiotech SAN, which reports no disk problems. I''m not a disk or storage expert so I''m not sure where the real problem really is. Considering the problems I''ve had on both sets of servers using ocfs2, I''m going to use local disks on each server for a while and split up the load. I lose all my portability, but right now I need stability. If anyone has any suggestions I''m all eyes/ears. Thanks, James _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users