thr3ads.net - Xen users - [Xen-users] Scary!!! Lost domU!!! [Dec 2009]

If this information is useful, please help other people find it:
Share via:

James Pifer

2009-Dec-31 04:32 UTC

[Xen-users] Scary!!! Lost domU!!!

I''ve been on vacation (and still am) but had to work on a couple
problems. I have a couple Citrix servers that are domU''s on a sles10sp2
server that has local storage and connects to two ocfs2 volumes. 

I tried to restart one of the Citrix servers and it would not restart,
giving an error that the disk was already mounted in a loopback, etc. I
looked at mount and didn''t see anything mounted and I had just shut the
domU down. I assumed it had not shut down completely. This domU runs
from the local disk. 

So I decided to a restart of the host was in order. I downed the rest of
the domU''s, including an oracle server running off one of the ocfs2
clusters. This servers has been being used for the last three weeks from
this location. 

After restarting dom0 I started bringing the domU''s back up. All of
them
came back up fine, except for the oracle server. It gave an error that
the disk files did not exist, and they don''t, they aren''t
there
anymore. 

I checked and double checked history to see if any rm commands had been
given and I didn''t find any. 

When I restarted, there was an error on one of the local file systems
that said "JDB: barrier-based sync failed...".

Luckily I have a copy of this domU from a few weeks ago BEFORE I copied
it to the ocfs2 volume. What could explain the sudden deletion of a
directory like this?

If this happened on some of the other domU''s it could be ugly. 

Any advice is appreciated!!!!

James


_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Pasi Kärkkäinen

2009-Dec-31 13:59 UTC

head link

Re: [Xen-users] Scary!!! Lost domU!!!

On Wed, Dec 30, 2009 at 11:32:01PM -0500, James Pifer
wrote:> I''ve been on vacation (and still am) but had to work on a couple
> problems. I have a couple Citrix servers that are domU''s on a
sles10sp2
> server that has local storage and connects to two ocfs2 volumes. 
> 
> I tried to restart one of the Citrix servers and it would not restart,
> giving an error that the disk was already mounted in a loopback, etc. I
> looked at mount and didn''t see anything mounted and I had just
shut the
> domU down. I assumed it had not shut down completely. This domU runs
> from the local disk. 
> 
> So I decided to a restart of the host was in order. I downed the rest of
> the domU''s, including an oracle server running off one of the
ocfs2
> clusters. This servers has been being used for the last three weeks from
> this location. 
> 
> After restarting dom0 I started bringing the domU''s back up. All
of them
> came back up fine, except for the oracle server. It gave an error that
> the disk files did not exist, and they don''t, they aren''t
there
> anymore. 
> 
> I checked and double checked history to see if any rm commands had been
> given and I didn''t find any. 
> 
> ???When I restarted, there was an error on one of the local file systems
> that said "JDB: barrier-based sync failed...".
> 
> Luckily I have a copy of this domU from a few weeks ago BEFORE I copied
> it to the ocfs2 volume. What could explain the sudden deletion of a
> directory like this?
> 
> If this happened on some of the other domU''s it could be ugly. 
> 
> Any advice is appreciated!!!!
> 
Sounds like a problem with OCFS2. This is exactly the reason why I
don''t
like storing VM disk images on a filesystem - fsck or this kind of weird
filesystem error can completely f*ck up the disk images.

I suggest LVM for guest disks.

Sorry, I can''t really help with the problem. Did you try fsck? 

-- Pasi


_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Madison Kelly

2010-Jan-01 00:31 UTC

head link

Re: [Xen-users] Scary!!! Lost domU!!!

James Pifer wrote:> I''ve been on vacation (and still am) but had to work on a couple
> problems. I have a couple Citrix servers that are domU''s on a
sles10sp2
> server that has local storage and connects to two ocfs2 volumes. 
> 
> I tried to restart one of the Citrix servers and it would not restart,
> giving an error that the disk was already mounted in a loopback, etc. I
> looked at mount and didn''t see anything mounted and I had just
shut the
> domU down. I assumed it had not shut down completely. This domU runs
> from the local disk. 
> 
> So I decided to a restart of the host was in order. I downed the rest of
> the domU''s, including an oracle server running off one of the
ocfs2
> clusters. This servers has been being used for the last three weeks from
> this location. 
> 
> After restarting dom0 I started bringing the domU''s back up. All
of them
> came back up fine, except for the oracle server. It gave an error that
> the disk files did not exist, and they don''t, they aren''t
there
> anymore. 
> 
> I checked and double checked history to see if any rm commands had been
> given and I didn''t find any. 
> 
> When I restarted, there was an error on one of the local file systems
> that said "JDB: barrier-based sync failed...".
> 
> Luckily I have a copy of this domU from a few weeks ago BEFORE I copied
> it to the ocfs2 volume. What could explain the sudden deletion of a
> directory like this?
> 
> If this happened on some of the other domU''s it could be ugly. 
> 
> Any advice is appreciated!!!!
> 
> James
I am afraid that I don''t have advice, either, but I''d like to
second the
recommendation of not using filesystems to store VMs. Even for simple 
performance reasons. I found that dedicated partitions made a world of 
performance difference.

A note on LVM though; if you cluster it, it won''t do LVM snapshots. I 
found this out late in the game, and had to create secondary partitions 
in the actual VM to use for snapshotting. I got luck that I had the 
space to do this, as I hadn''t planned for it. That said, it''s
been a
so-far successful setup.

Best of luck, and do share any solution you find please.

Madi

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

James Harper

2010-Jan-01 00:51 UTC

head link

RE: [Xen-users] Scary!!! Lost domU!!!

> 
> A note on LVM though; if you cluster it, it won't do LVM snapshots. I
> found this out late in the game, and had to create secondary partitions
> in the actual VM to use for snapshotting. I got luck that I had the
> space to do this, as I hadn't planned for it. That said, it's been
a
> so-far successful setup.
> 
Is there any word on clustered LVM and snapshotting? Is it ever going to happen
or are there fundamental reasons why it won't work?

When I was first tinkering with it, probably 5 years ago, I somehow managed to
make a snapshot on a clustered LVM setup and boy did it blow up in my face!!! A
word of advice - don't do that :)

James

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Jamon Camisso

2010-Jan-01 19:07 UTC

head link

Re: [Xen-users] Scary!!! Lost domU!!!

James Pifer wrote:> I''ve been on vacation (and still am) but had to work on a couple
> problems. I have a couple Citrix servers that are domU''s on a
sles10sp2
> server that has local storage and connects to two ocfs2 volumes. 
Is there more than just the sles server using both volumes? If not, have 
you considered using another filesystem? Personally I''ve had nothing
but
trouble with ocfs2 in Debian and Centos -- clusters would just randomly 
fall apart. I''ve also found that unless filesystem throughput is very 
good, ocfs2 would end up loosing writes by getting ahead of itself 
somehow. All depends on the storage backend I suppose.
> I tried to restart one of the Citrix servers and it would not restart,
> giving an error that the disk was already mounted in a loopback, etc. I
> looked at mount and didn''t see anything mounted and I had just
shut the
> domU down. I assumed it had not shut down completely. This domU runs
> from the local disk. 
> 
> So I decided to a restart of the host was in order. I downed the rest of
> the domU''s, including an oracle server running off one of the
ocfs2
> clusters. This servers has been being used for the last three weeks from
> this location. 
> 
> After restarting dom0 I started bringing the domU''s back up. All
of them
> came back up fine, except for the oracle server. It gave an error that
> the disk files did not exist, and they don''t, they aren''t
there
> anymore. 
> 
> I checked and double checked history to see if any rm commands had been
> given and I didn''t find any. 
Do you still have the system logs from the reboots? You might see the 
cluster falling apart there depending on how your dom0 shutdown the domUs.
> When I restarted, there was an error on one of the local file systems
> that said "JDB: barrier-based sync failed...".
> 
> Luckily I have a copy of this domU from a few weeks ago BEFORE I copied
> it to the ocfs2 volume. What could explain the sudden deletion of a
> directory like this?
Per above, what kind of storage architecture are you using underneath 
your ocfs2 volumes? I recall reading a bug that described the lost 
writes I mentioned, though I can''t for the life of me find it now.
> If this happened on some of the other domU''s it could be ugly. 
> 
> Any advice is appreciated!!!!
I''ve not used SLES, so maybe that''s the determining factor,
but have you
tried gfs2 or lustre (http://wiki.lustre.org/index.php/Main_Page)? I 
particularly want to try the latter.

Jamon


_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Jamon Camisso

2010-Jan-04 02:15 UTC

head link

Re: [Xen-users] Scary!!! Lost domU!!!

James Pifer wrote:> On Fri, 2010-01-01 at 14:07 -0500, Jamon Camisso wrote:
>> Is there more than just the sles server using both volumes? If not,
have
>> you considered using another filesystem? Personally I''ve had
nothing but
>> trouble with ocfs2 in Debian and Centos -- clusters would just randomly
>> fall apart. I''ve also found that unless filesystem throughput
is very
>> good, ocfs2 would end up loosing writes by getting ahead of itself 
>> somehow. All depends on the storage backend I suppose.
> 
> I think I know what happened in this case. After a lot of thought, I
> believe the blunder was mine. I remember working with this specific domU
> in early December. I was moving it from my dev machine with local
> storage to the cluster. I did not realize how much space it was actually
> using, so after copying I decided it would best to leave it on local
> storage since it was not a super critical system. 
> 
> Here''s when I''m speculating. Somewhere along the way I
think I screwed
> up and did bring the domU up on the ocfs2 cluster or I had already
> modified the config. I then started it back up before deleting the one I
> just copied. I then tried to delete the copy on ocfs2 while it was
> running. Not sure why I may have stopped here when it did not delete,
> maybe side tracked, don''t know. In any case I''m thinking
they were
> marked for deletion. 
> 
> Then after Christmas I had to reboot the server for a different problem.
> When I stopped the domU, or during reboot, the file deletion actually
> took place. Thankfully I still had a copy of it. Wouldn''t have
been the
> end of the world except for work rebuilding it.  
> 
> I''m not sure if that is even possible but that''s what
I''m thinking.
> Other than that my ocfs2 cluster has been solid on sles. Been using it
> for quite some time, well over a year I think. 
That sounds plausible. I could see doing the same thing pretty easily. I 
use xm migrate (live) to make sure that there''s only ever one copy of a
domU running anywhere. That way I can definitively check from the dom0 
which filesystem is being used too -- it must get messy with different 
storage pools, lvm volumes, raw tap:aio files etc.

The one doubt I have is the timeline involved. I suppose it is possible 
that the domU continued merrily along with a filesystem that was loosing 
writes for the rest of the month (a couple weeks?), it''s too bad there 
isn''t a copy of the filesystem around where you could see the logs to 
confirm it!

Good to hear you''ve got a backup and that you haven''t had
problems since
the reboot :)

Jamon

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

James Pifer

2010-Jan-04 02:49 UTC

head link

Re: [Xen-users] Scary!!! Lost domU!!!

On Fri, 2010-01-01 at 14:07 -0500, Jamon Camisso wrote:> James Pifer wrote:
> > I''ve been on vacation (and still am) but had to work on a
couple
> > problems. I have a couple Citrix servers that are domU''s on a
sles10sp2
> > server that has local storage and connects to two ocfs2 volumes. 
> 
> Is there more than just the sles server using both volumes? If not, have 
> you considered using another filesystem? Personally I''ve had
nothing but
> trouble with ocfs2 in Debian and Centos -- clusters would just randomly 
> fall apart. I''ve also found that unless filesystem throughput is
very
> good, ocfs2 would end up loosing writes by getting ahead of itself 
> somehow. All depends on the storage backend I suppose.
> 
I think I know what happened in this case. After a lot of thought, I
believe the blunder was mine. I remember working with this specific domU
in early December. I was moving it from my dev machine with local
storage to the cluster. I did not realize how much space it was actually
using, so after copying I decided it would best to leave it on local
storage since it was not a super critical system. 

Here''s when I''m speculating. Somewhere along the way I think I
screwed
up and did bring the domU up on the ocfs2 cluster or I had already
modified the config. I then started it back up before deleting the one I
just copied. I then tried to delete the copy on ocfs2 while it was
running. Not sure why I may have stopped here when it did not delete,
maybe side tracked, don''t know. In any case I''m thinking they
were
marked for deletion. 

Then after Christmas I had to reboot the server for a different problem.
When I stopped the domU, or during reboot, the file deletion actually
took place. Thankfully I still had a copy of it. Wouldn''t have been the
end of the world except for work rebuilding it.  

I''m not sure if that is even possible but that''s what
I''m thinking.
Other than that my ocfs2 cluster has been solid on sles. Been using it
for quite some time, well over a year I think. 

Thanks,
James

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Xen users - Dec 2009 - Scary!!! Lost domU!!!

[Xen-users] Scary!!! Lost domU!!!

Re: [Xen-users] Scary!!! Lost domU!!!

Re: [Xen-users] Scary!!! Lost domU!!!

RE: [Xen-users] Scary!!! Lost domU!!!

Re: [Xen-users] Scary!!! Lost domU!!!

Re: [Xen-users] Scary!!! Lost domU!!!

Re: [Xen-users] Scary!!! Lost domU!!!