Hello, I have been playing around with Remus on Xen 4.0.1, attempting to fail-over for an HVM domU. I''ve run into some problems that I think could be related to tapdisk2 and its interaction with how one sets up Remus disk replication in the domU config file. A few things I''ve noticed: -The tap:remus:backupHostIP:port|aio:imagePath notation does not work for me, although this is what is written in the Remus documentation. However, I have found the following to work (i.e., not complain when starting domU), so this is what I''ve been using: tap2:remus:backupHostIP:port|aio:imagePath... When I invoke the remus script, I do indeed see checkpoint traffic flowing between primary host and backup host. However, disk replication does not actually seem to be working (the image on the backup host is never modified). xend.log gives me the following error on the backup host: ------------------ [2010-09-07 11:55:41 2584] ERROR (XendDomainInfo:2244) Failed to restart domain 6. Traceback (most recent call last): File "/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py", line 2227, in _restart new_dom_info) File "/usr/lib64/python2.6/site-packages/xen/xend/XendDomain.py", line 998, in domain_create_from_dict dominfo = XendDomainInfo.create_from_dict(config_dict) File "/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py", line 126, in create_from_dict vm.start() File "/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py", line 469, in start XendTask.log_progress(31, 60, self._initDomain) File "/usr/lib64/python2.6/site-packages/xen/xend/XendTask.py", line 209, in log_progress retval = func(*args, **kwds) File "/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py", line 2896, in _initDomain self._createDevices() File "/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py", line 2374, in _createDevices devid = self._createDevice(devclass, config) File "/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py", line 2336, in _createDevice return self.getDeviceController(deviceClass).createDevice(devConfig) File "/usr/lib64/python2.6/site-packages/xen/xend/server/BlktapController.py", line 212, in createDevice raise Exception, ''Failed to create device.\n stdout: %s\n stderr: %s\nCheck that target \"%s\" exists and that blktap2 driver installed in dom0.'' % (out.rstrip(), err.rstrip(), file); Exception: Failed to create device. stdout: vbd open failed: -22 stderr: Check that target "192.168.1.106:9500|aio:/home/jak/remus/XenGuest1.img" exists and that blktap2 driver installed in dom0. -------------------- Thus, it seems like the backup host does not properly set up the tap disk for one of two reasons: either dom0 does not have blktap2 support, or there is something wrong with the path being passed to tapdisk2. I do believe that dom0 on the backup host has blktap2 support, because I can launch the guest on the backup host (without using Remus) using the tap2 notation. Further, the file /home/jak/remus/XenGuest1.img does in fact exist on the backup host. I could be completely wrong about this, but it would seem to me that the backup host should be trying to set up the tap disk without any mention of the hostIP and port number, since (if I understand correctly) Remus does not replicate the backup''s disk. Does anyone know what might be going on here? Am I using the right syntax in my configuration file? Any help you could provide would be greatly appreciated. Thanks so much, Jon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, Sep 07, 2010 at 03:28:32PM -0700, Jonathan Kirsch wrote:> Hello, > > I have been playing around with Remus on Xen 4.0.1, attempting to > fail-over for an HVM domU. > > I''ve run into some problems that I think could be related to tapdisk2 and > its interaction with how one sets up Remus disk replication in the domU > config file. > > A few things I''ve noticed: > > -The tap:remus:backupHostIP:port|aio:imagePath notation does not work for > me, although this is what is written in the Remus documentation. However, > I have found the following to work (i.e., not complain when starting > domU), so this is what I''ve been using: > > tap2:remus:backupHostIP:port|aio:imagePath... >Yeah, this stuff was changed in Xen 4.0.1: http://wiki.xensource.com/xenwiki/blktap2 I guess someone should update the remus wiki page. -- Pasi _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Its not just the tap2:remus:.... there is a bug lurking in the in tools/python/xen/remus/device.py in ReplicatedDisk class. The regular expression scans the domU config for only tap:tapdisk:remus... or tap:remus.. disk types only. I was able to get it working by fixing that regexp. This applies for xen 4.0.1 only. Am not sure about xen unstable. Here is a patch that might be of help to you (its rather crude but heck I was too lazy :) ) diff -r b536ebfba183 tools/python/xen/remus/device.py --- a/tools/python/xen/remus/device.py Wed Aug 25 09:22:42 2010 +0100 +++ b/tools/python/xen/remus/device.py Fri Sep 03 08:47:13 2010 -0700 @@ -36,10 +36,13 @@ # to request commits. self.ctlfd = None - if not disk.uname.startswith(''tap:remus:'') and not disk.uname.startswith(''tap:tapdisk:remus:''): + if not disk.uname.startswith(''tap2:remus:'') and not disk.uname.startswith(''tap:remus:'') and not disk.uname.startswith(''tap:tapdisk:remus:''): raise ReplicatedDiskException(''Disk is not replicated: %s'' % str(disk)) - fifo = re.match("tap:.*(remus.*)\|", disk.uname).group(1).replace('':'', ''_'') + if disk.uname.startswith(''tap2:remus:''): + fifo = re.match("tap2:.*(remus.*)\|", disk.uname).group(1).replace('':'', ''_'') + else: + fifo = re.match("tap:.*(remus.*)\|", disk.uname).group(1).replace('':'', ''_'') absfifo = os.path.join(self.FIFODIR, fifo) absmsgfifo = absfifo + ''.msg'' On Tue, Sep 7, 2010 at 11:01 PM, Pasi Kärkkäinen <pasik@iki.fi> wrote:> On Tue, Sep 07, 2010 at 03:28:32PM -0700, Jonathan Kirsch wrote: > > Hello, > > > > I have been playing around with Remus on Xen 4.0.1, attempting to > > fail-over for an HVM domU. > > > > I''ve run into some problems that I think could be related to tapdisk2 > and > > its interaction with how one sets up Remus disk replication in the > domU > > config file. > > > > A few things I''ve noticed: > > > > -The tap:remus:backupHostIP:port|aio:imagePath notation does not work > for > > me, although this is what is written in the Remus documentation. > However, > > I have found the following to work (i.e., not complain when starting > > domU), so this is what I''ve been using: > > > > tap2:remus:backupHostIP:port|aio:imagePath... > > > > Yeah, this stuff was changed in Xen 4.0.1: > http://wiki.xensource.com/xenwiki/blktap2 > > I guess someone should update the remus wiki page. > > -- Pasi > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >-- perception is but an offspring of its own self _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi, Thanks a lot for the patch. Unfortunately, this did not solve the problem for me (after applying the patch on both primary and backup, rebuilding and installing xen/tools/stubdom, and then rebooting both hosts). The backup is still unable to create the disk device when the fail-over occurs. Thus, although I see checkpoint traffic flowing from primary to backup, the state of the backup''s disk image is never modified (as judged by the image''s last-modified time). The backup does switch from "paused" to "running," but it consumes 100% CPU and when I connect to its vnc console it is as if the VM is frozen. So *something* is being transferred, because I do see the screen from the primary, but obviously all is not right, because I can''t interact with it at all. Out of curiosity, in your working Remus deployment, which dom0 kernel are you running (and which version of Xen)? I''m running Xen 4.0.1 and the pvops 2.6.31.14 dom0 kernel. My understanding was that Remus supported pvops dom0 2.6.31.x. Any other ideas regarding what this might be a symptom of? My naive interpretation is that it is not a networking configuration problem (since state is being transferred), but that it has something to do with setting up the tapdisk via tapdisk2. Thanks, Jon On Wed, Sep 8, 2010 at 1:50 AM, Shriram Rajagopalan <rshriram@gmail.com>wrote:> Its not just the tap2:remus:.... > > there is a bug lurking in the in tools/python/xen/remus/device.py in > ReplicatedDisk class. The regular expression scans the domU config for only > tap:tapdisk:remus... or tap:remus.. disk types only. I was able to get it > working by fixing that regexp. > This applies for xen 4.0.1 only. Am not sure about xen unstable. > Here is a patch that might be of help to you (its rather crude but heck I > was too lazy :) ) > diff -r b536ebfba183 tools/python/xen/remus/device.py > --- a/tools/python/xen/remus/device.py Wed Aug 25 09:22:42 2010 +0100 > +++ b/tools/python/xen/remus/device.py Fri Sep 03 08:47:13 2010 -0700 > @@ -36,10 +36,13 @@ > # to request commits. > self.ctlfd = None > > - if not disk.uname.startswith(''tap:remus:'') and not > disk.uname.startswith(''tap:tapdisk:remus:''): > + if not disk.uname.startswith(''tap2:remus:'') and not > disk.uname.startswith(''tap:remus:'') and not > disk.uname.startswith(''tap:tapdisk:remus:''): > raise ReplicatedDiskException(''Disk is not replicated: %s'' % > str(disk)) > - fifo = re.match("tap:.*(remus.*)\|", > disk.uname).group(1).replace('':'', ''_'') > + if disk.uname.startswith(''tap2:remus:''): > + fifo = re.match("tap2:.*(remus.*)\|", > disk.uname).group(1).replace('':'', ''_'') > + else: > + fifo = re.match("tap:.*(remus.*)\|", > disk.uname).group(1).replace('':'', ''_'') > absfifo = os.path.join(self.FIFODIR, fifo) > absmsgfifo = absfifo + ''.msg'' > > > > On Tue, Sep 7, 2010 at 11:01 PM, Pasi Kärkkäinen <pasik@iki.fi> wrote: > >> On Tue, Sep 07, 2010 at 03:28:32PM -0700, Jonathan Kirsch wrote: >> > Hello, >> > >> > I have been playing around with Remus on Xen 4.0.1, attempting to >> > fail-over for an HVM domU. >> > >> > I''ve run into some problems that I think could be related to tapdisk2 >> and >> > its interaction with how one sets up Remus disk replication in the >> domU >> > config file. >> > >> > A few things I''ve noticed: >> > >> > -The tap:remus:backupHostIP:port|aio:imagePath notation does not work >> for >> > me, although this is what is written in the Remus documentation. >> However, >> > I have found the following to work (i.e., not complain when starting >> > domU), so this is what I''ve been using: >> > >> > tap2:remus:backupHostIP:port|aio:imagePath... >> > >> >> Yeah, this stuff was changed in Xen 4.0.1: >> http://wiki.xensource.com/xenwiki/blktap2 >> >> I guess someone should update the remus wiki page. >> >> -- Pasi >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel >> > > > > -- > perception is but an offspring of its own self >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Wed, Sep 8, 2010 at 11:33 AM, Jonathan Kirsch <kirsch.jonathan@gmail.com>wrote:> Hi, > > Thanks a lot for the patch. Unfortunately, this did not solve the problem > for me (after applying the patch on both primary and backup, rebuilding and > installing xen/tools/stubdom, and then rebooting both hosts). The backup is > still unable to create the disk device when the fail-over occurs. Thus, > although I see checkpoint traffic flowing from primary to backup, the state > of the backup''s disk image is never modified (as judged by the image''s > last-modified time). The backup does switch from "paused" to "running," but > it consumes 100% CPU and when I connect to its vnc console it is as if the > VM is frozen. So *something* is being transferred, because I do see the > screen from the primary, but obviously all is not right, because I can''t > interact with it at all. > > Are there any error messages in the Backup machine''s syslog (orequivalent), about the tapdisks being used for the VM? Are there error messages in the /var/log/xen/xend.log in Backup machine ?> Out of curiosity, in your working Remus deployment, which dom0 kernel are > you running (and which version of Xen)? I''m running Xen 4.0.1 and the pvops > 2.6.31.14 dom0 kernel. My understanding was that Remus supported pvops dom0 > 2.6.31.x. > > I am running Xen 4.0.1 with pvops 2.6.32.18. But I have not run any HVMs onremus on my setup yet. So, if your current setup is able to run HVM domUs (without remus) and you are also able to "live" migrate HVM domUs between the two machines, then the issue is somewhere else IMO.> Any other ideas regarding what this might be a symptom of? My naive > interpretation is that it is not a networking configuration problem (since > state is being transferred), but that it has something to do with setting up > the tapdisk via tapdisk2. > >Thanks,> Jon > > On Wed, Sep 8, 2010 at 1:50 AM, Shriram Rajagopalan <rshriram@gmail.com>wrote: > >> Its not just the tap2:remus:.... >> >> there is a bug lurking in the in tools/python/xen/remus/device.py in >> ReplicatedDisk class. The regular expression scans the domU config for only >> tap:tapdisk:remus... or tap:remus.. disk types only. I was able to get it >> working by fixing that regexp. >> This applies for xen 4.0.1 only. Am not sure about xen unstable. >> Here is a patch that might be of help to you (its rather crude but heck I >> was too lazy :) ) >> diff -r b536ebfba183 tools/python/xen/remus/device.py >> --- a/tools/python/xen/remus/device.py Wed Aug 25 09:22:42 2010 +0100 >> +++ b/tools/python/xen/remus/device.py Fri Sep 03 08:47:13 2010 -0700 >> @@ -36,10 +36,13 @@ >> # to request commits. >> self.ctlfd = None >> >> - if not disk.uname.startswith(''tap:remus:'') and not >> disk.uname.startswith(''tap:tapdisk:remus:''): >> + if not disk.uname.startswith(''tap2:remus:'') and not >> disk.uname.startswith(''tap:remus:'') and not >> disk.uname.startswith(''tap:tapdisk:remus:''): >> raise ReplicatedDiskException(''Disk is not replicated: %s'' % >> str(disk)) >> - fifo = re.match("tap:.*(remus.*)\|", >> disk.uname).group(1).replace('':'', ''_'') >> + if disk.uname.startswith(''tap2:remus:''): >> + fifo = re.match("tap2:.*(remus.*)\|", >> disk.uname).group(1).replace('':'', ''_'') >> + else: >> + fifo = re.match("tap:.*(remus.*)\|", >> disk.uname).group(1).replace('':'', ''_'') >> absfifo = os.path.join(self.FIFODIR, fifo) >> absmsgfifo = absfifo + ''.msg'' >> >> >> >> On Tue, Sep 7, 2010 at 11:01 PM, Pasi Kärkkäinen <pasik@iki.fi> wrote: >> >>> On Tue, Sep 07, 2010 at 03:28:32PM -0700, Jonathan Kirsch wrote: >>> > Hello, >>> > >>> > I have been playing around with Remus on Xen 4.0.1, attempting to >>> > fail-over for an HVM domU. >>> > >>> > I''ve run into some problems that I think could be related to >>> tapdisk2 and >>> > its interaction with how one sets up Remus disk replication in the >>> domU >>> > config file. >>> > >>> > A few things I''ve noticed: >>> > >>> > -The tap:remus:backupHostIP:port|aio:imagePath notation does not >>> work for >>> > me, although this is what is written in the Remus documentation. >>> However, >>> > I have found the following to work (i.e., not complain when starting >>> > domU), so this is what I''ve been using: >>> > >>> > tap2:remus:backupHostIP:port|aio:imagePath... >>> > >>> >>> Yeah, this stuff was changed in Xen 4.0.1: >>> http://wiki.xensource.com/xenwiki/blktap2 >>> >>> I guess someone should update the remus wiki page. >>> >>> -- Pasi >>> >>> >>> _______________________________________________ >>> Xen-devel mailing list >>> Xen-devel@lists.xensource.com >>> http://lists.xensource.com/xen-devel >>> >> >> >> >> -- >> perception is but an offspring of its own self >> > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel > >-- perception is but an offspring of its own self _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi, Thanks for the response. I looked at all of the error messages and didn''t see anything jumping out at me. Stumped, I decided to try upgrading to a Gigabit switch (I had been running 100Mbps). This partially solves the problem, as described below. The good thing is that the problem I had been describing before (frozen backup, no disk replication) no longer happens. I now see that the disk image on the backup is being updated, and the backup is at least not frozen. The bad thing is that even though the backup sits in the paused state before the fail-over occurs (as expected), once the fail-over happens it thinks the VM has been shutdown and reboots itself. Thus, when I log into the backup''s vnc console after pulling the network cable from the primary, I see the VM booting up -- any program that I had running on the primary is (obviously) no longer running. When I log in on the backup, I do see disk modifications reflected in the file system, which leads me to believe that disk replication now works. I''m not sure how to go about debugging why the backup thinks the domain has been shut down. I''ve checked the various log files but nothing jumps out at me. Any ideas about why this might be happening or which logs I should be looking at? I''d be happy to post the logs if you think that would help. Note that I''ve confirmed that live migration (still) works perfectly. I can migrate from one to the other and back again, with state maintained as it should be and without this "rebooting" issue coming up. Thanks for the help, Jon PS I also don''t understand why simply upgrading the switch would cause the other problems to go away. Maybe I''m wrong, but it seems like using the Gigabit switch should have made synchronization happen faster but shouldn''t have fundamentally changed the equation. Any thoughts about this? On Wed, Sep 8, 2010 at 12:43 PM, Shriram Rajagopalan <rshriram@gmail.com>wrote:> > > On Wed, Sep 8, 2010 at 11:33 AM, Jonathan Kirsch < > kirsch.jonathan@gmail.com> wrote: > >> Hi, >> >> Thanks a lot for the patch. Unfortunately, this did not solve the problem >> for me (after applying the patch on both primary and backup, rebuilding and >> installing xen/tools/stubdom, and then rebooting both hosts). The backup is >> still unable to create the disk device when the fail-over occurs. Thus, >> although I see checkpoint traffic flowing from primary to backup, the state >> of the backup''s disk image is never modified (as judged by the image''s >> last-modified time). The backup does switch from "paused" to "running," but >> it consumes 100% CPU and when I connect to its vnc console it is as if the >> VM is frozen. So *something* is being transferred, because I do see the >> screen from the primary, but obviously all is not right, because I can''t >> interact with it at all. >> >> Are there any error messages in the Backup machine''s syslog (or > equivalent), about the tapdisks being used for the VM? > > Are there error messages in the /var/log/xen/xend.log in Backup machine ? > >> Out of curiosity, in your working Remus deployment, which dom0 kernel are >> you running (and which version of Xen)? I''m running Xen 4.0.1 and the pvops >> 2.6.31.14 dom0 kernel. My understanding was that Remus supported pvops dom0 >> 2.6.31.x. >> >> I am running Xen 4.0.1 with pvops 2.6.32.18. But I have not run any HVMs > on remus on my setup yet. So, if your current setup is able to run HVM domUs > (without remus) and you are also able to "live" migrate HVM domUs between > the two machines, then the issue is somewhere else IMO. > >> Any other ideas regarding what this might be a symptom of? My naive >> interpretation is that it is not a networking configuration problem (since >> state is being transferred), but that it has something to do with setting up >> the tapdisk via tapdisk2. >> >> > Thanks, >> Jon >> >> On Wed, Sep 8, 2010 at 1:50 AM, Shriram Rajagopalan <rshriram@gmail.com>wrote: >> >>> Its not just the tap2:remus:.... >>> >>> there is a bug lurking in the in tools/python/xen/remus/device.py in >>> ReplicatedDisk class. The regular expression scans the domU config for only >>> tap:tapdisk:remus... or tap:remus.. disk types only. I was able to get it >>> working by fixing that regexp. >>> This applies for xen 4.0.1 only. Am not sure about xen unstable. >>> Here is a patch that might be of help to you (its rather crude but heck >>> I was too lazy :) ) >>> diff -r b536ebfba183 tools/python/xen/remus/device.py >>> --- a/tools/python/xen/remus/device.py Wed Aug 25 09:22:42 2010 +0100 >>> +++ b/tools/python/xen/remus/device.py Fri Sep 03 08:47:13 2010 -0700 >>> @@ -36,10 +36,13 @@ >>> # to request commits. >>> self.ctlfd = None >>> >>> - if not disk.uname.startswith(''tap:remus:'') and not >>> disk.uname.startswith(''tap:tapdisk:remus:''): >>> + if not disk.uname.startswith(''tap2:remus:'') and not >>> disk.uname.startswith(''tap:remus:'') and not >>> disk.uname.startswith(''tap:tapdisk:remus:''): >>> raise ReplicatedDiskException(''Disk is not replicated: %s'' % >>> str(disk)) >>> - fifo = re.match("tap:.*(remus.*)\|", >>> disk.uname).group(1).replace('':'', ''_'') >>> + if disk.uname.startswith(''tap2:remus:''): >>> + fifo = re.match("tap2:.*(remus.*)\|", >>> disk.uname).group(1).replace('':'', ''_'') >>> + else: >>> + fifo = re.match("tap:.*(remus.*)\|", >>> disk.uname).group(1).replace('':'', ''_'') >>> absfifo = os.path.join(self.FIFODIR, fifo) >>> absmsgfifo = absfifo + ''.msg'' >>> >>> >>> >>> On Tue, Sep 7, 2010 at 11:01 PM, Pasi Kärkkäinen <pasik@iki.fi> wrote: >>> >>>> On Tue, Sep 07, 2010 at 03:28:32PM -0700, Jonathan Kirsch wrote: >>>> > Hello, >>>> > >>>> > I have been playing around with Remus on Xen 4.0.1, attempting to >>>> > fail-over for an HVM domU. >>>> > >>>> > I''ve run into some problems that I think could be related to >>>> tapdisk2 and >>>> > its interaction with how one sets up Remus disk replication in the >>>> domU >>>> > config file. >>>> > >>>> > A few things I''ve noticed: >>>> > >>>> > -The tap:remus:backupHostIP:port|aio:imagePath notation does not >>>> work for >>>> > me, although this is what is written in the Remus documentation. >>>> However, >>>> > I have found the following to work (i.e., not complain when >>>> starting >>>> > domU), so this is what I''ve been using: >>>> > >>>> > tap2:remus:backupHostIP:port|aio:imagePath... >>>> > >>>> >>>> Yeah, this stuff was changed in Xen 4.0.1: >>>> http://wiki.xensource.com/xenwiki/blktap2 >>>> >>>> I guess someone should update the remus wiki page. >>>> >>>> -- Pasi >>>> >>>> >>>> _______________________________________________ >>>> Xen-devel mailing list >>>> Xen-devel@lists.xensource.com >>>> http://lists.xensource.com/xen-devel >>>> >>> >>> >>> >>> -- >>> perception is but an offspring of its own self >>> >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel >> >> > > > -- > perception is but an offspring of its own self >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel