thr3ads.net - Xen devel - [Xen-devel] blkif migration problem [Dec 2006]

If this information is useful, please help other people find it:
Share via:

Cristian Zamfir

2006-Dec-04 19:09 UTC

[Xen-devel] blkif migration problem

Hi,

We are attempting to migrate blkif devices backed by drbd devices. We 
have used a similar approach to the vTPM migration. Complete migration 
seems to go without errors on both source and destination. The migrated 
machine responds to external network queries like ping, arping, nmap but 
I cannot ssh into it. Also, when using xm console, I get get these 
messages before the login prompt:

vbd vbd-769: 16 Device in use; refusing to close
netfront: device eth0 has flipping receive path.

... then the  machine hangs after inputing the login username.

My guess is that even though the hotplug scripts returned successfully 
for the vbd device (according to the xend.log bellow), the vbd did not 
migrate successfully and the dom0 machine cannot read anything from the 
disk.

Do you have any suggestion on what the problem might be and where and 
how to look for more debugging information?

Attached are the xend.logs for the source and the destination.

Thank you.

Cristian






_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Cristian Zamfir

2006-Dec-07 15:47 UTC

head link

[Xen-devel] blkif migration problem

Hi,

I am trying to live migrate blkif devices backed by drbd devices and I 
am struggling with a problem for a few days now. The problem is that 
after migration, the domU machine cannot load any new programs into 
memory. The ssh connection survives migration and I can run programs 
that are already in the memory but not something that needs to be loaded 
from the disk.

I am currently testing with an almost idle machine and I am triggering 
the drive migration after the domain is suspended, in step 2, from: 
XendCheckpoint.py: dominfo.migrateDevices(network, dst, 
DEV_MIGRATE_STEP2, domain_name).

However, I also tried before the domain is suspended from step 1 
(dominfo.migrateDevices(network, dst, DEV_MIGRATE_STEP1, domain_name))
and everything works fine, except that there is the obvious possibility 
of loosing some writes to the disk because the domain is not suspended yet.

After migration, when I reattach a console I get this message:
"vbd vbd-769: 16 Device in use; refusing to close"
This is from the blkfront.c backend_changed() function but I cannot 
figure out why this error occurs.

 From the xend.logs and dmesg on the source and destination attached 
bellow I cannot spot any errors. I am using xen 3.0.3.

Any kind of ideas will be greatly appreciated as I am a beginner with 
developing and debugging xen.
Thank you very much.


















_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ewan Mellor

2006-Dec-07 16:02 UTC

head link

Re: [Xen-devel] blkif migration problem

On Thu, Dec 07, 2006 at 03:47:39PM +0000, Cristian Zamfir wrote:
> 
> Hi,
> 
> I am trying to live migrate blkif devices backed by drbd devices and I 
> am struggling with a problem for a few days now. The problem is that 
> after migration, the domU machine cannot load any new programs into 
> memory. The ssh connection survives migration and I can run programs 
> that are already in the memory but not something that needs to be loaded 
> from the disk.
> 
> I am currently testing with an almost idle machine and I am triggering 
> the drive migration after the domain is suspended, in step 2, from: 
> XendCheckpoint.py: dominfo.migrateDevices(network, dst, 
> DEV_MIGRATE_STEP2, domain_name).
> 
> However, I also tried before the domain is suspended from step 1 
> (dominfo.migrateDevices(network, dst, DEV_MIGRATE_STEP1, domain_name))
> and everything works fine, except that there is the obvious possibility 
> of loosing some writes to the disk because the domain is not suspended yet.
> 
> After migration, when I reattach a console I get this message:
> "vbd vbd-769: 16 Device in use; refusing to close"
> This is from the blkfront.c backend_changed() function but I cannot 
> figure out why this error occurs.
I believe that this means that the frontend has seen that the backend is
tearing down, but since the device is still mounted inside the guest,
it''s
refusing.  I don''t think that the frontend ought to see the backend
tear down
at all -- the guest ought to be suspended before you tear down the backend
device.

When you say that you are "triggering the drive migration", what does
that
involve?  Why would the frontend see the store contents change at all at this
point?

Have you tried a localhost migration?  This would be easier, because you
don''t
actually need to move the disk of course, so you can get half your signalling
tested before moving on to the harder problem.

Ewan.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Cristian Zamfir

2006-Dec-07 18:14 UTC

head link

Re: [Xen-devel] blkif migration problem

Ewan Mellor wrote:> On Thu, Dec 07, 2006 at 03:47:39PM +0000, Cristian Zamfir wrote:
> 
>> Hi,
>>
>> I am trying to live migrate blkif devices backed by drbd devices and I 
>> am struggling with a problem for a few days now. The problem is that 
>> after migration, the domU machine cannot load any new programs into 
>> memory. The ssh connection survives migration and I can run programs 
>> that are already in the memory but not something that needs to be
loaded
>> from the disk.
>>
>> I am currently testing with an almost idle machine and I am triggering 
>> the drive migration after the domain is suspended, in step 2, from: 
>> XendCheckpoint.py: dominfo.migrateDevices(network, dst, 
>> DEV_MIGRATE_STEP2, domain_name).
>>
>> However, I also tried before the domain is suspended from step 1 
>> (dominfo.migrateDevices(network, dst, DEV_MIGRATE_STEP1, domain_name))
>> and everything works fine, except that there is the obvious possibility
>> of loosing some writes to the disk because the domain is not suspended
yet.
>>
>> After migration, when I reattach a console I get this message:
>> "vbd vbd-769: 16 Device in use; refusing to close"
>> This is from the blkfront.c backend_changed() function but I cannot 
>> figure out why this error occurs.
> 
> I believe that this means that the frontend has seen that the backend is
> tearing down, but since the device is still mounted inside the guest,
it''s
> refusing.  I don''t think that the frontend ought to see the
backend tear down
> at all -- the guest ought to be suspended before you tear down the backend
> device.
> 
I am triggering the migration in DEV_MIGRATE_STEP2, which is right after 
the domain was suspended, as far as I can tell from the python code in 
XendCheckpoint.py:

dominfo.migrateDevices(network, dst, DEV_MIGRATE_STEP1, domain_name)
....
....
def saveInputHandler(line, tochild):
            log.debug("In saveInputHandler %s", line)
            if line == "suspend":
                 log.debug("Suspending %d ...", dominfo.getDomid())
                 dominfo.shutdown(''suspend'')
                 dominfo.waitForShutdown()
                 dominfo.migrateDevices(network, dst, DEV_MIGRATE_STEP2,
                                       domain_name)
                 log.info("Domain %d suspended.", dominfo.getDomid())
                 dominfo.migrateDevices(network, dst, DEV_MIGRATE_STEP3,
                                        domain_name)

"Triggering the migration" involves dominfo.migrateDevices(..) calling
my script in /etc/xen/scripts. This script checks that the drive at the 
source and the replica at the destination are in sync and then switches 
their roles (the one on the source becomes secondary and the one on the 
destination becomes primary). But since the guest is suspended at this 
point, I don''t understand why should the frontend see any change.

I found that DRBD drives are not quite usable when they are in secondary 
state, only the primary one should be mounted. For instance, when trying 
to mount a drbd device in secondary state I get this error:
#mount -r -t reiserfs /dev/drbd1 /mnt/vm
mount: /dev/drbd1 already mounted or /mnt/vm busy

Therefore, could this error happen on the destination, during restore 
while waiting for backends to set up, if the drive is in secondary state?

I also don''t understand why everything works if I migrate the hard
drive
in DEV_MIGRATE_STEP1. The only error I get in this case is reiserfs 
complainig about some writes that failed, but everything besides this 
seems ok.

I cannot really try localhost migration because I think drbd only works 
with two machines, but I have tested most of my code outside xen and it 
worked.

Thank you very much for your help.

> When you say that you are "triggering the drive migration", what
does that
> involve?  Why would the frontend see the store contents change at all at
this
> point?
> 
> Have you tried a localhost migration?  This would be easier, because you
don''t
> actually need to move the disk of course, so you can get half your
signalling
> tested before moving on to the harder problem.
> 
> Ewan.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Dec 2006 - blkif migration problem

[Xen-devel] blkif migration problem

[Xen-devel] blkif migration problem

Re: [Xen-devel] blkif migration problem

Re: [Xen-devel] blkif migration problem