I''ve been on vacation (and still am) but had to work on a couple problems. I have a couple Citrix servers that are domU''s on a sles10sp2 server that has local storage and connects to two ocfs2 volumes. I tried to restart one of the Citrix servers and it would not restart, giving an error that the disk was already mounted in a loopback, etc. I looked at mount and didn''t see anything mounted and I had just shut the domU down. I assumed it had not shut down completely. This domU runs from the local disk. So I decided to a restart of the host was in order. I downed the rest of the domU''s, including an oracle server running off one of the ocfs2 clusters. This servers has been being used for the last three weeks from this location. After restarting dom0 I started bringing the domU''s back up. All of them came back up fine, except for the oracle server. It gave an error that the disk files did not exist, and they don''t, they aren''t there anymore. I checked and double checked history to see if any rm commands had been given and I didn''t find any. When I restarted, there was an error on one of the local file systems that said "JDB: barrier-based sync failed...". Luckily I have a copy of this domU from a few weeks ago BEFORE I copied it to the ocfs2 volume. What could explain the sudden deletion of a directory like this? If this happened on some of the other domU''s it could be ugly. Any advice is appreciated!!!! James _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Wed, Dec 30, 2009 at 11:32:01PM -0500, James Pifer wrote:> I''ve been on vacation (and still am) but had to work on a couple > problems. I have a couple Citrix servers that are domU''s on a sles10sp2 > server that has local storage and connects to two ocfs2 volumes. > > I tried to restart one of the Citrix servers and it would not restart, > giving an error that the disk was already mounted in a loopback, etc. I > looked at mount and didn''t see anything mounted and I had just shut the > domU down. I assumed it had not shut down completely. This domU runs > from the local disk. > > So I decided to a restart of the host was in order. I downed the rest of > the domU''s, including an oracle server running off one of the ocfs2 > clusters. This servers has been being used for the last three weeks from > this location. > > After restarting dom0 I started bringing the domU''s back up. All of them > came back up fine, except for the oracle server. It gave an error that > the disk files did not exist, and they don''t, they aren''t there > anymore. > > I checked and double checked history to see if any rm commands had been > given and I didn''t find any. > > ???When I restarted, there was an error on one of the local file systems > that said "JDB: barrier-based sync failed...". > > Luckily I have a copy of this domU from a few weeks ago BEFORE I copied > it to the ocfs2 volume. What could explain the sudden deletion of a > directory like this? > > If this happened on some of the other domU''s it could be ugly. > > Any advice is appreciated!!!! >Sounds like a problem with OCFS2. This is exactly the reason why I don''t like storing VM disk images on a filesystem - fsck or this kind of weird filesystem error can completely f*ck up the disk images. I suggest LVM for guest disks. Sorry, I can''t really help with the problem. Did you try fsck? -- Pasi _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
James Pifer wrote:> I''ve been on vacation (and still am) but had to work on a couple > problems. I have a couple Citrix servers that are domU''s on a sles10sp2 > server that has local storage and connects to two ocfs2 volumes. > > I tried to restart one of the Citrix servers and it would not restart, > giving an error that the disk was already mounted in a loopback, etc. I > looked at mount and didn''t see anything mounted and I had just shut the > domU down. I assumed it had not shut down completely. This domU runs > from the local disk. > > So I decided to a restart of the host was in order. I downed the rest of > the domU''s, including an oracle server running off one of the ocfs2 > clusters. This servers has been being used for the last three weeks from > this location. > > After restarting dom0 I started bringing the domU''s back up. All of them > came back up fine, except for the oracle server. It gave an error that > the disk files did not exist, and they don''t, they aren''t there > anymore. > > I checked and double checked history to see if any rm commands had been > given and I didn''t find any. > > When I restarted, there was an error on one of the local file systems > that said "JDB: barrier-based sync failed...". > > Luckily I have a copy of this domU from a few weeks ago BEFORE I copied > it to the ocfs2 volume. What could explain the sudden deletion of a > directory like this? > > If this happened on some of the other domU''s it could be ugly. > > Any advice is appreciated!!!! > > JamesI am afraid that I don''t have advice, either, but I''d like to second the recommendation of not using filesystems to store VMs. Even for simple performance reasons. I found that dedicated partitions made a world of performance difference. A note on LVM though; if you cluster it, it won''t do LVM snapshots. I found this out late in the game, and had to create secondary partitions in the actual VM to use for snapshotting. I got luck that I had the space to do this, as I hadn''t planned for it. That said, it''s been a so-far successful setup. Best of luck, and do share any solution you find please. Madi _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
> > A note on LVM though; if you cluster it, it won't do LVM snapshots. I > found this out late in the game, and had to create secondary partitions > in the actual VM to use for snapshotting. I got luck that I had the > space to do this, as I hadn't planned for it. That said, it's been a > so-far successful setup. >Is there any word on clustered LVM and snapshotting? Is it ever going to happen or are there fundamental reasons why it won't work? When I was first tinkering with it, probably 5 years ago, I somehow managed to make a snapshot on a clustered LVM setup and boy did it blow up in my face!!! A word of advice - don't do that :) James _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
James Pifer wrote:> I''ve been on vacation (and still am) but had to work on a couple > problems. I have a couple Citrix servers that are domU''s on a sles10sp2 > server that has local storage and connects to two ocfs2 volumes.Is there more than just the sles server using both volumes? If not, have you considered using another filesystem? Personally I''ve had nothing but trouble with ocfs2 in Debian and Centos -- clusters would just randomly fall apart. I''ve also found that unless filesystem throughput is very good, ocfs2 would end up loosing writes by getting ahead of itself somehow. All depends on the storage backend I suppose.> I tried to restart one of the Citrix servers and it would not restart, > giving an error that the disk was already mounted in a loopback, etc. I > looked at mount and didn''t see anything mounted and I had just shut the > domU down. I assumed it had not shut down completely. This domU runs > from the local disk. > > So I decided to a restart of the host was in order. I downed the rest of > the domU''s, including an oracle server running off one of the ocfs2 > clusters. This servers has been being used for the last three weeks from > this location. > > After restarting dom0 I started bringing the domU''s back up. All of them > came back up fine, except for the oracle server. It gave an error that > the disk files did not exist, and they don''t, they aren''t there > anymore. > > I checked and double checked history to see if any rm commands had been > given and I didn''t find any.Do you still have the system logs from the reboots? You might see the cluster falling apart there depending on how your dom0 shutdown the domUs.> When I restarted, there was an error on one of the local file systems > that said "JDB: barrier-based sync failed...". > > Luckily I have a copy of this domU from a few weeks ago BEFORE I copied > it to the ocfs2 volume. What could explain the sudden deletion of a > directory like this?Per above, what kind of storage architecture are you using underneath your ocfs2 volumes? I recall reading a bug that described the lost writes I mentioned, though I can''t for the life of me find it now.> If this happened on some of the other domU''s it could be ugly. > > Any advice is appreciated!!!!I''ve not used SLES, so maybe that''s the determining factor, but have you tried gfs2 or lustre (http://wiki.lustre.org/index.php/Main_Page)? I particularly want to try the latter. Jamon _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
James Pifer wrote:> On Fri, 2010-01-01 at 14:07 -0500, Jamon Camisso wrote: >> Is there more than just the sles server using both volumes? If not, have >> you considered using another filesystem? Personally I''ve had nothing but >> trouble with ocfs2 in Debian and Centos -- clusters would just randomly >> fall apart. I''ve also found that unless filesystem throughput is very >> good, ocfs2 would end up loosing writes by getting ahead of itself >> somehow. All depends on the storage backend I suppose. > > I think I know what happened in this case. After a lot of thought, I > believe the blunder was mine. I remember working with this specific domU > in early December. I was moving it from my dev machine with local > storage to the cluster. I did not realize how much space it was actually > using, so after copying I decided it would best to leave it on local > storage since it was not a super critical system. > > Here''s when I''m speculating. Somewhere along the way I think I screwed > up and did bring the domU up on the ocfs2 cluster or I had already > modified the config. I then started it back up before deleting the one I > just copied. I then tried to delete the copy on ocfs2 while it was > running. Not sure why I may have stopped here when it did not delete, > maybe side tracked, don''t know. In any case I''m thinking they were > marked for deletion. > > Then after Christmas I had to reboot the server for a different problem. > When I stopped the domU, or during reboot, the file deletion actually > took place. Thankfully I still had a copy of it. Wouldn''t have been the > end of the world except for work rebuilding it. > > I''m not sure if that is even possible but that''s what I''m thinking. > Other than that my ocfs2 cluster has been solid on sles. Been using it > for quite some time, well over a year I think.That sounds plausible. I could see doing the same thing pretty easily. I use xm migrate (live) to make sure that there''s only ever one copy of a domU running anywhere. That way I can definitively check from the dom0 which filesystem is being used too -- it must get messy with different storage pools, lvm volumes, raw tap:aio files etc. The one doubt I have is the timeline involved. I suppose it is possible that the domU continued merrily along with a filesystem that was loosing writes for the rest of the month (a couple weeks?), it''s too bad there isn''t a copy of the filesystem around where you could see the logs to confirm it! Good to hear you''ve got a backup and that you haven''t had problems since the reboot :) Jamon _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Fri, 2010-01-01 at 14:07 -0500, Jamon Camisso wrote:> James Pifer wrote: > > I''ve been on vacation (and still am) but had to work on a couple > > problems. I have a couple Citrix servers that are domU''s on a sles10sp2 > > server that has local storage and connects to two ocfs2 volumes. > > Is there more than just the sles server using both volumes? If not, have > you considered using another filesystem? Personally I''ve had nothing but > trouble with ocfs2 in Debian and Centos -- clusters would just randomly > fall apart. I''ve also found that unless filesystem throughput is very > good, ocfs2 would end up loosing writes by getting ahead of itself > somehow. All depends on the storage backend I suppose. >I think I know what happened in this case. After a lot of thought, I believe the blunder was mine. I remember working with this specific domU in early December. I was moving it from my dev machine with local storage to the cluster. I did not realize how much space it was actually using, so after copying I decided it would best to leave it on local storage since it was not a super critical system. Here''s when I''m speculating. Somewhere along the way I think I screwed up and did bring the domU up on the ocfs2 cluster or I had already modified the config. I then started it back up before deleting the one I just copied. I then tried to delete the copy on ocfs2 while it was running. Not sure why I may have stopped here when it did not delete, maybe side tracked, don''t know. In any case I''m thinking they were marked for deletion. Then after Christmas I had to reboot the server for a different problem. When I stopped the domU, or during reboot, the file deletion actually took place. Thankfully I still had a copy of it. Wouldn''t have been the end of the world except for work rebuilding it. I''m not sure if that is even possible but that''s what I''m thinking. Other than that my ocfs2 cluster has been solid on sles. Been using it for quite some time, well over a year I think. Thanks, James _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users