thr3ads.net - Xen devel - [Xen-devel] txenmon: Cluster monitoring/management [Feb 2004]

If this information is useful, please help other people find it:
Share via:

stevegt@TerraLuna.Org

2004-Feb-10 05:25 UTC

[Xen-devel] txenmon: Cluster monitoring/management

On Sun, Feb 08, 2004 at 09:19:56AM +0000, Ian Pratt
wrote:> Of course, this will all be much neater in rev 3 of the domain
> control tools that will use a db backend to maintain state about
> currently running domains across a cluster...
Ack!  We might be doing duplicate work.  How far have you gotten with
this?

Right now I''m running python code (distantly descended from
createlinuxdom.py) that is able to:

- monitor each domain and restart as needed

- migrate domains from one host to another

- dynamically create and assign swap vd''s, and garbage collect them at
  shutdown or after crash

...and a few other things.  Right now migration is via reboot, not
suspend; haven''t had a chance to troubleshoot resume further.  

The only thing I''m using VD''s for at this point is swap.  This
code so
far depends on NFS root partitions, all served from a central NFS
server; control and state are communicated via NFS also.  I just today
started migrating the control/state comms to jabber instead, so that I
could start using VD root filesystems after the COW stuff settles down.
Haven''t decided what to do for migrating filesystems between nodes in
that case though.

Right now I''m calling this ''txenmon'' (TerraLuna Xen
Monitor) but was
already considering renaming it ''xenmon'' and posting it after
I got it
cleaned up.  

This is all to support a production Xen cluster rollout that I plan to
have running by the end of this month.  I really don''t want to go back
to UML at this point, and if I don''t have this cluster running by March
I''m in deep doo-doo -- so I''m committed to working full-time
on Xen
tools now.  ;-}

So here''s the current version, not cleaned up, way too verbose, crufty,
but running:

Steve


#!/usr/bin/python2.2

import Xc, XenoUtil, string, sys, os, time, socket, cPickle

# initialize a few variables that might come in handy
thishostname = socket.gethostname()
if not len(sys.argv) >= 2:
    print "usage: %s /path/to/base/of/users/hosts" % sys.argv[0]
    sys.exit(1)

nfsserv="10.27.2.50"

base = sys.argv[1]
if len(base) > 1 and base.endswith(''/''):
    base=base[:-1]

# Obtain an instance of the Xen control interface
xc = Xc.new()

# daemonize
daemonize=0
if daemonize:
    try: 
        pid = os.fork() 
        if pid > 0:
	    # exit first parent
	    sys.exit(0) 
    except OSError, e: 
        print >>sys.stderr, "fork #1 failed: %d (%s)" %
(e.errno, e.strerror)
        sys.exit(1)
    # decouple from parent environment
    # os.chdir("/") 
    os.setsid() 
    os.umask(0) 
    # XXX what about stdout etc?
    # do second fork
    try: 
        pid = os.fork() 
        if pid > 0:
	    # exit from second parent, print eventual PID before
	    # print "Daemon PID %d" % pid 
	    sys.exit(0) 
    except OSError, e: 
        print >>sys.stderr, "fork #2 failed: %d (%s)" %
(e.errno, e.strerror)
        sys.exit(1) 

def main():
    while 1:
        guests=getGuests(base)
        # state machine
        for guest in guests:
            print guest.path,guest.activeHost,guest.isRunning()
            if guest.isMine():
                if guest.isRunningHere():
                    guest.heartbeat()
                if guest.isRunnable():
                    if guest.isRunningHere():
                        pass
                    else:
                        if guest.isRunning():
                            print "warning: %s is running on %s" % (
                                guest.name, guest.activeHost
                            )
                        else:
                            guest.start()
                else: # not guest.isRunnable()
                    if guest.isRunningHere():
                        guest.shutdown()
                    if guest.isHung():
                        guest.destroy()
            else: # not guest.isMine()
                if guest.isRunningHere():
                    guest.shutdown()
                if guest.isRunning():
                    pass
                else:
                    print "warning: %s is not running on %s" % (
                        guest.name,guest.ctl(''host'')
                    )
        # end state machine
        # garbage collect vd''s
        usedVds=[]
        for guest in guests:
            if guest.isRunningHere():
                usedVds+=guest.vds()
                guest.pickle()
        for vd in listvdids():
            print "usedVds =",usedVds
            if vd in usedVds:
                pass
            else:
                print "deleting vd %s" % vd
                XenoUtil.vd_delete(vd)
        # garbage collect domains
        # XXX
        time.sleep(10)
    # end while

def getGuests(base):
    users=os.listdir(base)
    guests=[]
    for user in users:
        if not os.path.isdir("%s/%s" % (base,user)):
            continue
        guestnames=os.listdir("%s/%s" % (base,user))
        for name in guestnames:
            path="%s/%s/%s" % (base,user,name)
            try:
                try:
                    file=open("%s/log/pickle" % path,"r")
                    guest=cPickle.load(file)
                    file.close()
                except:
                    print "creating",path
                    guest=Guest(path)
            except:
                print "exception creating guest %s/%s: %s" % (
                    user,
                    name,
                    sys.exc_info()[1].__dict__
                )
                continue
            guests.append(guest)
    return guests

def listvdids():
    vdids=[]
    for vbd in XenoUtil.vd_list():
        vdids.append(vbd[''vdisk_id''])
    print "listvdids =", vdids
    return vdids


class Guest(object):

    def __init__(self,path):
        self.reload(path)

    def reload(self,path):
        pathparts=path.split(''/'')
        name=pathparts.pop()
        user=pathparts.pop()
        base=''/''.join(pathparts)
        self.path=path
        self.base=base
        self.user=user
        self.name=name
        self.domain_name="%s" % name
        self.ctlcache={}
        # requested domain id number
        self.domid=self.ctl("domid")
        # kernel
        self.image=self.ctl("kernel")
        # memory
        self.memory_megabytes=int(self.ctl("mem"))
        swap=self.ctl("swap")
        (swap_dev,swap_megabytes) = swap.split(",")
        self.swap_dev=swap_dev
        self.swap_megabytes=int(swap_megabytes)
        # ip
        self.ipaddr    = [self.ctl("ip")]
        self.netmask = XenoUtil.get_current_ipmask()
        self.gateway = self.ctl("gw")
        # vbd''s
        vbds = []
        vbdfile = open("%s/ctl/vbds" % self.path,"r")
        for line in vbdfile.readlines():
            print line
            ( uname, virt_name, rw ) = line.split('','')
            uname = uname.strip()
            virt_name = virt_name.strip()
            rw = rw.strip()
            vbds.append(( uname, virt_name, rw ))
        self.vbds=vbds
        self.vbd_expert = 0
        # build kernel command line
        ipbit    = "ip="+self.ipaddr[0]
        ipbit += ":"+nfsserv
        ipbit +=
":"+self.gateway+":"+self.netmask+"::eth0:off"
        rootbit = "root=/dev/nfs nfsroot=/export/%s/root" % path
        extrabit = "4 DOMID=%s " % self.domid 
        self.cmdline = ipbit +" "+ rootbit +" "+ extrabit
        self.curid=None
        self.swapvdid=None
        self.shutdownTime=None
        self.activeHost=None


    def ctl(self,var):
        filename="%s/ctl/%s" % (self.path,var)
        # if not hasattr(self,''ctlcache''):
        #     print dir(self)
        #     print self.path
        #     self.ctlcache={}
        if not self.ctlcache.has_key(''filename''):
            self.ctlcache[filename]={''mtime'': 0,
''val'': None}
        val=None
        mtime=os.path.getmtime(filename)
        if self.ctlcache[filename][''mtime''] < mtime:
            val = open(filename,"r").readline().strip()
            self.ctlcache[filename]={''mtime'': mtime,
''val'': val}
        else:
            val = self.ctlcache[filename][''val'']
        return val
    
    def destroy(self):
        print "destroying %s" % self.domain_name
        # print "now curid =",self.curid
        if self.curid == 0:
            raise "attempt to kill dom0" 
        xc.domain_destroy(dom=self.curid,force=True)

    def heartbeat(self):
        assert self.isRunningHere()
        # update swap expiry to one day
        try:
            XenoUtil.vd_refresh(self.swapvdid, 86400)
        except:
            print "%s missed swap expiry update: %s" % (
                self.domain_name,
                sys.exc_info()[1].__dict__
            )
        self.activeHost=thishostname
        self.pickle()

    def isHung(self):
        if not self.isRunningHere():
            return False
        if self.shutdownTime and time.time() - self.shutdownTime > 300:
            return True
        return False

    def isMine(self):
        if self.ctl("host") == thishostname:
            return True
        return False

    def isRunnable(self):
        run=int(self.ctl("run"))
        if run > 0:
            return True
        return False

    def isRunning(self):
        if self.isRunningHere():
            return True
        else:
            host=self.activeHost
            if host == None:
                return None
            if host == thishostname:
                return False
        filename="%s/log/%s" % (self.path,"pickle")
        mtime=None
        try:
            mtime=os.path.getmtime(filename)
        except:
            return False
        now=time.time()
        if now - mtime < 60:
            return True
        return False

    def isRunningHere(self):
        if not self.curid or self.curid == 0:
            return False
        domains=xc.domain_getinfo()
        domids = [ d[''dom''] for d in domains ]
        if self.curid in domids:
            # print self.curid
            return True
        self.curid=None
        return False

    def XXXlog(self,var,val=None,append=False):
        filename="%s/log/%s" % (self.path,var)
        if val==None:
            out=None
            try:
                out=open(filename,"r").readlines()
            except:
                return None
            out=[l.strip() for l in out]
            return out
        mode="w"
        if append:
            mode="a"
        file=open(filename,mode)
        file.write("%s\n" % str(val))
        file.close()

    def mkswap(self):
        # create swap, 1 minute expiry 
        vdid=XenoUtil.vd_create(self.swap_megabytes,60)
        # print "vdid =",vdid
        self.swapvdid=vdid
        uname="vd:%s" % vdid
        # format it
        segments = XenoUtil.lookup_disk_uname(uname)
        if XenoUtil.vd_extents_validate(segments,1) < 0:
            print "segment conflict on %s" % uname
            sys.exit(1)
        tmpdev="/dev/xenswap%s" % vdid
        cmd="mknod %s b 125 %s" % (tmpdev,vdid)
        os.system(cmd)
        virt_dev = XenoUtil.blkdev_name_to_number(tmpdev)
        xc.vbd_create(0,virt_dev,1)
        xc.vbd_setextents(0,virt_dev,segments)
        cmd="mkswap %s" % tmpdev
        os.system(cmd)
        xc.vbd_destroy(0,virt_dev)
        self.vbds.append(( uname, self.swap_dev, "w" ))
        print "mkswap:",uname, self.swap_dev, "w"
        print self.vbds

    def pickle(self):
        assert self.isRunningHere()
        # write then rename so others see an atomic operation...
        file=open("%s/log/pickle.new" % self.path,"w")
        cPickle.dump(self,file)
        file.close()
        os.rename(
            "%s/log/pickle.new" % self.path, 
            "%s/log/pickle"     % self.path
        )

    def shutdown(self):
        print "shutting down %s" % self.name
        # reduce swap expiry to 10 minutes (to give it time to shut down)
        if self.swapvdid:
            XenoUtil.vd_refresh(self.swapvdid, 600)
        xc.domain_destroy(dom=self.curid)
        if not self.shutdownTime:
            self.shutdownTime=time.time()

    def start(self):
        """Create, build and start the domain for this
guest."""
        self.reload(self.path)
        image=self.image
        memory_megabytes=self.memory_megabytes
        domain_name=self.domain_name
        ipaddr=self.ipaddr
        netmask=self.netmask
        vbds=self.vbds
        cmdline=self.cmdline
        vbd_expert=self.vbd_expert
        
        print "Domain image                    : ", self.image
        print "Domain memory                 : ",
self.memory_megabytes
        print "Domain IP address(es) : ", self.ipaddr 
        print "Domain block devices    : ", self.vbds
        print ''Domain cmdline                : "%s"''
% self.cmdline

        if self.isRunning():
            raise "%s already running on %s" %
(self.name,self.activeHost)

        if not os.path.isfile( image ):
            print "Image file ''" + image + "''
does not exist"
            return None

        id = xc.domain_create( mem_kb=memory_megabytes*1024, name=domain_name )
        print "Created new domain with id = " + str(id)
        if id <= 0:
            print "Error creating domain"
            return None

        ret = xc.linux_build( dom=id, image=image, cmdline=cmdline )
        if ret < 0:
            print "Error building Linux guest OS: "
            print "Return code from linux_build = " + str(ret)
            xc.domain_destroy ( dom=id )
            return None

        # setup the virtual block devices
        # set the expertise level appropriately
        XenoUtil.VBD_EXPERT_MODE = vbd_expert
        
        self.mkswap()

        self.datavds=[]
        for ( uname, virt_name, rw ) in vbds:
            virt_dev = XenoUtil.blkdev_name_to_number( virt_name )
            segments = XenoUtil.lookup_disk_uname( uname )
            if not segments or segments < 0:
                print "Error looking up %s\n" % uname
                xc.domain_destroy ( dom=id )
                return None

            # check that setting up this VBD won''t violate the sharing
            # allowed by the current VBD expertise level
            # print uname, virt_name, rw, segments
            if XenoUtil.vd_extents_validate(segments, rw==''w''
or rw==''rw'') < 0:
                xc.domain_destroy( dom = id )
                return None
                
            if xc.vbd_create( dom=id, vbd=virt_dev, writeable=
rw==''w'' or rw==''rw'' ):
                print "Error creating VBD vbd=%d writeable=%d\n" %
(virt_dev,rw)
                xc.domain_destroy ( dom=id )
                return None

            if xc.vbd_setextents( 
                    dom=id,
                    vbd=virt_dev,
                    extents=segments):
                print "Error populating VBD vbd=%d\n" % virt_dev
                xc.domain_destroy ( dom=id )
                return None
            self.datavds.append(virt_dev)


        # setup virtual firewall rules for all aliases
        for ip in ipaddr:
            XenoUtil.setup_vfr_rules_for_vif( id, 0, ip )

        if xc.domain_start( dom=id ) < 0:
            print "Error starting domain"
            xc.domain_destroy ( dom=id )
            sys.exit()

        self.curid=id
        print "domain (re)started: %s (%d)" % (domain_name,id)
        self.heartbeat()
        return id


    def vds(self):
        vds=[]
        # XXX add data vbds
        vds.append(self.swapvdid)
        return vds


        

main()


-- 
Stephen G. Traugott  (KG6HDQ)
UNIX/Linux Infrastructure Architect, TerraLuna LLC
stevegt@TerraLuna.Org 
http://www.stevegt.com -- http://Infrastructures.Org 


-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

stevegt@TerraLuna.Org

2004-Feb-10 05:41 UTC

head link

[Xen-devel] Re: txenmon: Cluster monitoring/management

One thing I didn''t mention: this code also is able to be killed on the
fly without killing the domains it monitors; on restart it will discover
and adopt them, as well as their swap VD''s, and resume monitoring.

This feature was needed because the script dies after a few hours of
running -- I''m getting a SIGABRT from somewhere in the xc libraries
every few hours, I think.  Note that I''m not only checking
xc.domain_getinfo(), but also updating the swap VD expirys on every trip
through the while loop; one of those is likely the culprit.

Steve


On Mon, Feb 09, 2004 at 09:25:36PM -0800,  wrote:> On Sun, Feb 08, 2004 at 09:19:56AM +0000, Ian Pratt wrote:
> > Of course, this will all be much neater in rev 3 of the domain
> > control tools that will use a db backend to maintain state about
> > currently running domains across a cluster...
> 
> Ack!  We might be doing duplicate work.  How far have you gotten with
> this?
> 
> Right now I''m running python code (distantly descended from
> createlinuxdom.py) that is able to:
> 
> - monitor each domain and restart as needed
> 
> - migrate domains from one host to another
> 
> - dynamically create and assign swap vd''s, and garbage collect
them at
>   shutdown or after crash
> 
> ...and a few other things.  Right now migration is via reboot, not
> suspend; haven''t had a chance to troubleshoot resume further.  
> 
> The only thing I''m using VD''s for at this point is swap. 
This code so
> far depends on NFS root partitions, all served from a central NFS
> server; control and state are communicated via NFS also.  I just today
> started migrating the control/state comms to jabber instead, so that I
> could start using VD root filesystems after the COW stuff settles down.
> Haven''t decided what to do for migrating filesystems between nodes
in
> that case though.
> 
> Right now I''m calling this ''txenmon'' (TerraLuna
Xen Monitor) but was
> already considering renaming it ''xenmon'' and posting it
after I got it
> cleaned up.  
> 
> This is all to support a production Xen cluster rollout that I plan to
> have running by the end of this month.  I really don''t want to go
back
> to UML at this point, and if I don''t have this cluster running by
March
> I''m in deep doo-doo -- so I''m committed to working
full-time on Xen
> tools now.  ;-}
> 
> So here''s the current version, not cleaned up, way too verbose,
crufty,
> but running:
> 
> Steve
> 
> 
> #!/usr/bin/python2.2
> 
> import Xc, XenoUtil, string, sys, os, time, socket, cPickle
> 
> # initialize a few variables that might come in handy
> thishostname = socket.gethostname()
> if not len(sys.argv) >= 2:
>     print "usage: %s /path/to/base/of/users/hosts" % sys.argv[0]
>     sys.exit(1)
> 
> nfsserv="10.27.2.50"
> 
> base = sys.argv[1]
> if len(base) > 1 and base.endswith(''/''):
>     base=base[:-1]
> 
> # Obtain an instance of the Xen control interface
> xc = Xc.new()
> 
> # daemonize
> daemonize=0
> if daemonize:
>     try: 
>         pid = os.fork() 
>         if pid > 0:
> 	    # exit first parent
> 	    sys.exit(0) 
>     except OSError, e: 
>         print >>sys.stderr, "fork #1 failed: %d (%s)" %
(e.errno, e.strerror)
>         sys.exit(1)
>     # decouple from parent environment
>     # os.chdir("/") 
>     os.setsid() 
>     os.umask(0) 
>     # XXX what about stdout etc?
>     # do second fork
>     try: 
>         pid = os.fork() 
>         if pid > 0:
> 	    # exit from second parent, print eventual PID before
> 	    # print "Daemon PID %d" % pid 
> 	    sys.exit(0) 
>     except OSError, e: 
>         print >>sys.stderr, "fork #2 failed: %d (%s)" %
(e.errno, e.strerror)
>         sys.exit(1) 
> 
> def main():
>     while 1:
>         guests=getGuests(base)
>         # state machine
>         for guest in guests:
>             print guest.path,guest.activeHost,guest.isRunning()
>             if guest.isMine():
>                 if guest.isRunningHere():
>                     guest.heartbeat()
>                 if guest.isRunnable():
>                     if guest.isRunningHere():
>                         pass
>                     else:
>                         if guest.isRunning():
>                             print "warning: %s is running on %s"
% (
>                                 guest.name, guest.activeHost
>                             )
>                         else:
>                             guest.start()
>                 else: # not guest.isRunnable()
>                     if guest.isRunningHere():
>                         guest.shutdown()
>                     if guest.isHung():
>                         guest.destroy()
>             else: # not guest.isMine()
>                 if guest.isRunningHere():
>                     guest.shutdown()
>                 if guest.isRunning():
>                     pass
>                 else:
>                     print "warning: %s is not running on %s" % (
>                         guest.name,guest.ctl(''host'')
>                     )
>         # end state machine
>         # garbage collect vd''s
>         usedVds=[]
>         for guest in guests:
>             if guest.isRunningHere():
>                 usedVds+=guest.vds()
>                 guest.pickle()
>         for vd in listvdids():
>             print "usedVds =",usedVds
>             if vd in usedVds:
>                 pass
>             else:
>                 print "deleting vd %s" % vd
>                 XenoUtil.vd_delete(vd)
>         # garbage collect domains
>         # XXX
>         time.sleep(10)
>     # end while
> 
> def getGuests(base):
>     users=os.listdir(base)
>     guests=[]
>     for user in users:
>         if not os.path.isdir("%s/%s" % (base,user)):
>             continue
>         guestnames=os.listdir("%s/%s" % (base,user))
>         for name in guestnames:
>             path="%s/%s/%s" % (base,user,name)
>             try:
>                 try:
>                     file=open("%s/log/pickle" %
path,"r")
>                     guest=cPickle.load(file)
>                     file.close()
>                 except:
>                     print "creating",path
>                     guest=Guest(path)
>             except:
>                 print "exception creating guest %s/%s: %s" % (
>                     user,
>                     name,
>                     sys.exc_info()[1].__dict__
>                 )
>                 continue
>             guests.append(guest)
>     return guests
> 
> def listvdids():
>     vdids=[]
>     for vbd in XenoUtil.vd_list():
>         vdids.append(vbd[''vdisk_id''])
>     print "listvdids =", vdids
>     return vdids
> 
> 
> class Guest(object):
> 
>     def __init__(self,path):
>         self.reload(path)
> 
>     def reload(self,path):
>         pathparts=path.split(''/'')
>         name=pathparts.pop()
>         user=pathparts.pop()
>         base=''/''.join(pathparts)
>         self.path=path
>         self.base=base
>         self.user=user
>         self.name=name
>         self.domain_name="%s" % name
>         self.ctlcache={}
>         # requested domain id number
>         self.domid=self.ctl("domid")
>         # kernel
>         self.image=self.ctl("kernel")
>         # memory
>         self.memory_megabytes=int(self.ctl("mem"))
>         swap=self.ctl("swap")
>         (swap_dev,swap_megabytes) = swap.split(",")
>         self.swap_dev=swap_dev
>         self.swap_megabytes=int(swap_megabytes)
>         # ip
>         self.ipaddr    = [self.ctl("ip")]
>         self.netmask = XenoUtil.get_current_ipmask()
>         self.gateway = self.ctl("gw")
>         # vbd''s
>         vbds = []
>         vbdfile = open("%s/ctl/vbds" % self.path,"r")
>         for line in vbdfile.readlines():
>             print line
>             ( uname, virt_name, rw ) = line.split('','')
>             uname = uname.strip()
>             virt_name = virt_name.strip()
>             rw = rw.strip()
>             vbds.append(( uname, virt_name, rw ))
>         self.vbds=vbds
>         self.vbd_expert = 0
>         # build kernel command line
>         ipbit    = "ip="+self.ipaddr[0]
>         ipbit += ":"+nfsserv
>         ipbit +=
":"+self.gateway+":"+self.netmask+"::eth0:off"
>         rootbit = "root=/dev/nfs nfsroot=/export/%s/root" % path
>         extrabit = "4 DOMID=%s " % self.domid 
>         self.cmdline = ipbit +" "+ rootbit +" "+
extrabit
>         self.curid=None
>         self.swapvdid=None
>         self.shutdownTime=None
>         self.activeHost=None
> 
> 
>     def ctl(self,var):
>         filename="%s/ctl/%s" % (self.path,var)
>         # if not hasattr(self,''ctlcache''):
>         #     print dir(self)
>         #     print self.path
>         #     self.ctlcache={}
>         if not self.ctlcache.has_key(''filename''):
>             self.ctlcache[filename]={''mtime'': 0,
''val'': None}
>         val=None
>         mtime=os.path.getmtime(filename)
>         if self.ctlcache[filename][''mtime''] < mtime:
>             val = open(filename,"r").readline().strip()
>             self.ctlcache[filename]={''mtime'': mtime,
''val'': val}
>         else:
>             val = self.ctlcache[filename][''val'']
>         return val
>     
>     def destroy(self):
>         print "destroying %s" % self.domain_name
>         # print "now curid =",self.curid
>         if self.curid == 0:
>             raise "attempt to kill dom0" 
>         xc.domain_destroy(dom=self.curid,force=True)
> 
>     def heartbeat(self):
>         assert self.isRunningHere()
>         # update swap expiry to one day
>         try:
>             XenoUtil.vd_refresh(self.swapvdid, 86400)
>         except:
>             print "%s missed swap expiry update: %s" % (
>                 self.domain_name,
>                 sys.exc_info()[1].__dict__
>             )
>         self.activeHost=thishostname
>         self.pickle()
> 
>     def isHung(self):
>         if not self.isRunningHere():
>             return False
>         if self.shutdownTime and time.time() - self.shutdownTime > 300:
>             return True
>         return False
> 
>     def isMine(self):
>         if self.ctl("host") == thishostname:
>             return True
>         return False
> 
>     def isRunnable(self):
>         run=int(self.ctl("run"))
>         if run > 0:
>             return True
>         return False
> 
>     def isRunning(self):
>         if self.isRunningHere():
>             return True
>         else:
>             host=self.activeHost
>             if host == None:
>                 return None
>             if host == thishostname:
>                 return False
>         filename="%s/log/%s" % (self.path,"pickle")
>         mtime=None
>         try:
>             mtime=os.path.getmtime(filename)
>         except:
>             return False
>         now=time.time()
>         if now - mtime < 60:
>             return True
>         return False
> 
>     def isRunningHere(self):
>         if not self.curid or self.curid == 0:
>             return False
>         domains=xc.domain_getinfo()
>         domids = [ d[''dom''] for d in domains ]
>         if self.curid in domids:
>             # print self.curid
>             return True
>         self.curid=None
>         return False
> 
>     def XXXlog(self,var,val=None,append=False):
>         filename="%s/log/%s" % (self.path,var)
>         if val==None:
>             out=None
>             try:
>                 out=open(filename,"r").readlines()
>             except:
>                 return None
>             out=[l.strip() for l in out]
>             return out
>         mode="w"
>         if append:
>             mode="a"
>         file=open(filename,mode)
>         file.write("%s\n" % str(val))
>         file.close()
> 
>     def mkswap(self):
>         # create swap, 1 minute expiry 
>         vdid=XenoUtil.vd_create(self.swap_megabytes,60)
>         # print "vdid =",vdid
>         self.swapvdid=vdid
>         uname="vd:%s" % vdid
>         # format it
>         segments = XenoUtil.lookup_disk_uname(uname)
>         if XenoUtil.vd_extents_validate(segments,1) < 0:
>             print "segment conflict on %s" % uname
>             sys.exit(1)
>         tmpdev="/dev/xenswap%s" % vdid
>         cmd="mknod %s b 125 %s" % (tmpdev,vdid)
>         os.system(cmd)
>         virt_dev = XenoUtil.blkdev_name_to_number(tmpdev)
>         xc.vbd_create(0,virt_dev,1)
>         xc.vbd_setextents(0,virt_dev,segments)
>         cmd="mkswap %s" % tmpdev
>         os.system(cmd)
>         xc.vbd_destroy(0,virt_dev)
>         self.vbds.append(( uname, self.swap_dev, "w" ))
>         print "mkswap:",uname, self.swap_dev, "w"
>         print self.vbds
> 
>     def pickle(self):
>         assert self.isRunningHere()
>         # write then rename so others see an atomic operation...
>         file=open("%s/log/pickle.new" % self.path,"w")
>         cPickle.dump(self,file)
>         file.close()
>         os.rename(
>             "%s/log/pickle.new" % self.path, 
>             "%s/log/pickle"     % self.path
>         )
> 
>     def shutdown(self):
>         print "shutting down %s" % self.name
>         # reduce swap expiry to 10 minutes (to give it time to shut down)
>         if self.swapvdid:
>             XenoUtil.vd_refresh(self.swapvdid, 600)
>         xc.domain_destroy(dom=self.curid)
>         if not self.shutdownTime:
>             self.shutdownTime=time.time()
> 
>     def start(self):
>         """Create, build and start the domain for this
guest."""
>         self.reload(self.path)
>         image=self.image
>         memory_megabytes=self.memory_megabytes
>         domain_name=self.domain_name
>         ipaddr=self.ipaddr
>         netmask=self.netmask
>         vbds=self.vbds
>         cmdline=self.cmdline
>         vbd_expert=self.vbd_expert
>         
>         print "Domain image                    : ", self.image
>         print "Domain memory                 : ",
self.memory_megabytes
>         print "Domain IP address(es) : ", self.ipaddr 
>         print "Domain block devices    : ", self.vbds
>         print ''Domain cmdline                :
"%s"'' % self.cmdline
> 
>         if self.isRunning():
>             raise "%s already running on %s" %
(self.name,self.activeHost)
> 
>         if not os.path.isfile( image ):
>             print "Image file ''" + image +
"'' does not exist"
>             return None
> 
>         id = xc.domain_create( mem_kb=memory_megabytes*1024,
name=domain_name )
>         print "Created new domain with id = " + str(id)
>         if id <= 0:
>             print "Error creating domain"
>             return None
> 
>         ret = xc.linux_build( dom=id, image=image, cmdline=cmdline )
>         if ret < 0:
>             print "Error building Linux guest OS: "
>             print "Return code from linux_build = " + str(ret)
>             xc.domain_destroy ( dom=id )
>             return None
> 
>         # setup the virtual block devices
>         # set the expertise level appropriately
>         XenoUtil.VBD_EXPERT_MODE = vbd_expert
>         
>         self.mkswap()
> 
>         self.datavds=[]
>         for ( uname, virt_name, rw ) in vbds:
>             virt_dev = XenoUtil.blkdev_name_to_number( virt_name )
>             segments = XenoUtil.lookup_disk_uname( uname )
>             if not segments or segments < 0:
>                 print "Error looking up %s\n" % uname
>                 xc.domain_destroy ( dom=id )
>                 return None
> 
>             # check that setting up this VBD won''t violate the
sharing
>             # allowed by the current VBD expertise level
>             # print uname, virt_name, rw, segments
>             if XenoUtil.vd_extents_validate(segments,
rw==''w'' or rw==''rw'') < 0:
>                 xc.domain_destroy( dom = id )
>                 return None
>                 
>             if xc.vbd_create( dom=id, vbd=virt_dev, writeable=
rw==''w'' or rw==''rw'' ):
>                 print "Error creating VBD vbd=%d writeable=%d\n"
% (virt_dev,rw)
>                 xc.domain_destroy ( dom=id )
>                 return None
> 
>             if xc.vbd_setextents( 
>                     dom=id,
>                     vbd=virt_dev,
>                     extents=segments):
>                 print "Error populating VBD vbd=%d\n" % virt_dev
>                 xc.domain_destroy ( dom=id )
>                 return None
>             self.datavds.append(virt_dev)
> 
> 
>         # setup virtual firewall rules for all aliases
>         for ip in ipaddr:
>             XenoUtil.setup_vfr_rules_for_vif( id, 0, ip )
> 
>         if xc.domain_start( dom=id ) < 0:
>             print "Error starting domain"
>             xc.domain_destroy ( dom=id )
>             sys.exit()
> 
>         self.curid=id
>         print "domain (re)started: %s (%d)" % (domain_name,id)
>         self.heartbeat()
>         return id
> 
> 
>     def vds(self):
>         vds=[]
>         # XXX add data vbds
>         vds.append(self.swapvdid)
>         return vds
> 
> 
>         
> 
> main()
> 
> 
> -- 
> Stephen G. Traugott  (KG6HDQ)
> UNIX/Linux Infrastructure Architect, TerraLuna LLC
> stevegt@TerraLuna.Org 
> http://www.stevegt.com -- http://Infrastructures.Org 
-- 
Stephen G. Traugott  (KG6HDQ)
UNIX/Linux Infrastructure Architect, TerraLuna LLC
stevegt@TerraLuna.Org 
http://www.stevegt.com -- http://Infrastructures.Org 


-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

Ian Pratt

2004-Feb-10 08:06 UTC

head link

[Xen-devel] Re: txenmon: Cluster monitoring/management

> On Sun, Feb 08, 2004 at 09:19:56AM +0000, Ian Pratt wrote:
> > Of course, this will all be much neater in rev 3 of the domain
> > control tools that will use a db backend to maintain state about
> > currently running domains across a cluster...
> 
> Ack!  We might be doing duplicate work.  How far have you gotten with
> this?
We haven''t even started, but have been thinking about the design,
and what the schema for the database show be etc.
 > Right now I''m running python code (distantly descended from
> createlinuxdom.py) that is able to:
> 
> - monitor each domain and restart as needed
> 
> - migrate domains from one host to another
> 
> - dynamically create and assign swap vd''s, and garbage collect
them at
>   shutdown or after crash
> 
> ...and a few other things.  Right now migration is via reboot, not
> suspend; haven''t had a chance to troubleshoot resume further.  
Cool! It''s always a nice surprise to find out what work is
going on by people on the list. 

You might want to try repulling 1.2 and trying the newer versions
of the tools which are a bit more user friendly.
> Right now I''m calling this ''txenmon'' (TerraLuna
Xen Monitor) but was
> already considering renaming it ''xenmon'' and posting it
after I got it
> cleaned up.  
Great, we''d love to see stuff like this in the tree.
 

Thanks,
Ian


-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

stevegt@TerraLuna.Org

2004-Feb-10 19:47 UTC

head link

[Xen-devel] Re: txenmon: Cluster monitoring/management

On Tue, Feb 10, 2004 at 08:06:25AM +0000, Ian Pratt
wrote:> > On Sun, Feb 08, 2004 at 09:19:56AM +0000, Ian Pratt wrote:
> > > Of course, this will all be much neater in rev 3 of the domain
> > > control tools that will use a db backend to maintain state about
> > > currently running domains across a cluster...
> > 
> > Ack!  We might be doing duplicate work.  How far have you gotten with
> > this?
> 
> We haven''t even started, but have been thinking about the design,
> and what the schema for the database show be etc.
When you say "database", do you mean "an independent sqlite
running in
each dom0", or do you mean "a central SQL server running somewhere on
a
dedicated machine"?  (See further down for why I ask.)

As far as schema goes, the things I''ve needed to track so far are these
"control" items, referenced in the guest.ctl() calls in txenmon:

    domid
    gw
    host
    ip
    kernel
    mem
    run
    swap
    vbds

...and I''m considering adding a ''reboot'' boolean.  I
also track several
runtime state items as attributes of the Guest class -- the whole object
is saved as a pickle, so see __init__ for a list of them.  

The NFS export directory tree looks something like this:

    /export/xen/fs/stevegt
    /export/xen/fs/stevegt/tcx
    /export/xen/fs/stevegt/tcx/root
    /export/xen/fs/stevegt/tcx/ctl
    /export/xen/fs/stevegt/tcx/log
    /export/xen/fs/stevegt/xentest1
    /export/xen/fs/stevegt/xentest1/root
    /export/xen/fs/stevegt/xentest1/log
    /export/xen/fs/stevegt/xentest1/ctl
    /export/xen/fs/stevegt/xentest2
    /export/xen/fs/stevegt/xentest2/root
    /export/xen/fs/stevegt/xentest2/log
    /export/xen/fs/stevegt/xentest2/ctl
    /export/xen/fs/stevegt/crashme1
    /export/xen/fs/stevegt/crashme1/root
    /export/xen/fs/stevegt/crashme1/ctl
    /export/xen/fs/stevegt/crashme1/log

...where ''stevegt'' is a user who owns one or more virtual
domains, and
''xentest1'' is the hostname of a virtual domain.  Those control
items I
mentioned above go in individual files (qmail style) under ./ctl, and
the python pickle for each virtual domain is saved as ./log/pickle.  The
root partition for each domain is under ./root.  Here''s what the
contents of ./ctl look like for a given guest:

    nfs1:/export/xen# ls -l /export/xen/fs/stevegt/tcx/ctl
    total 32
    -rw-r--r--    1 root     root            3 Feb  8 20:57 domid
    -rw-r--r--    1 root     root           12 Feb  5 22:51 gw
    -rw-r--r--    1 root     root            6 Feb  9 21:56 host
    -rw-r--r--    1 root     root           13 Feb  8 20:57 ip
    -rw-r--r--    1 root     root           30 Feb  5 22:52 kernel
    -rw-r--r--    1 root     root            4 Feb  9 17:47 mem
    -rw-r--r--    1 root     root            2 Feb  9 21:56 run
    -rw-r--r--    1 root     root           14 Feb  5 22:53 swap
    -rw-r--r--    1 root     root            0 Feb  5 22:52 vbds

Because these are individual files, this makes it easy to say, for
instance, ''echo 0 > run'' from a shell prompt to cause a
domain to shut
down, or ''echo node43 > host'' to cause it to move to a
different node.

I considered using the sqlite db for these things; I didn''t do that (1)
because this was faster to implement and easier to access from the
command line, and (2) I didn''t want to cause future schema conflicts
with whatever you were going to do.

				 * * * 

Having said all this, I''m less worried about schema and more worried
about single points of failure.  Right now txenmon runs in domain 0 on
each node, and the data store is distributed as above.  This gives me a
dependence on the central NFS server staying up, but an NFS server is a
relatively simple thing, it can be HA''d, backed up easily, and will
tend
to have uptimes in the hundreds of days anyway as long as you leave it
alone.

If these data items were to move into a "real" database server
instead,
say a central mysql or postgresql server, than I''d worry more; database
servers aren''t as easy to keep available for hundreds of days without
interruption.  (See http://Infrastructures.Org for more of my
perspective on this.)

I''m moving in the direction of keeping some sort of distributed data
store, like those flat files and python pickles, (or use the sqlite on
each dom0?) which can be cached on local disk in each dom0, and then use
something like UDP broadcast (simple) or XMPP/jabber (less simple) as a
peer-to-peer communications mechanism, to keep the caches synced.

My goal here is to be able to walk into a Xen data center and destroy
any random machine without impacting any user for more than a few
minutes.  (See http://www.infrastructures.org/bootstrap/recovery.shtml).

To this end, I''m curious what people''s thoughts are on backups
and
real-time replication of virtual disks -- I''m only using them for swap
right now, because of these issues.

				 * * * 
> Cool! It''s always a nice surprise to find out what work is
> going on by people on the list. 
As I said last night, you have me full time right now.  ;-)  My wife and
I are launching a commercial service based on Xen (we were evaluating
UML).  I have until the end of March.  If enough revenue is flowing by
then, then you get to keep me.  If not, then "the boss" will tell me
to
put myself back on the consulting market.

Nothing like a little pressure.  ;-)
> You might want to try repulling 1.2 and trying the newer versions
> of the tools which are a bit more user friendly.
My most recent pull was a week ago; this got me xc_dom_control and
xc_vd_tool.  I''ll likely do another pull this week.  We already have
one
production customer (woo hoo!), so I am trying to limit upgrades/reboots
for them.
> Great, we''d love to see stuff like this in the tree.
Would it help if I exposed a bk repository you could pull from, or how
do you want to do this?

Steve
-- 
Stephen G. Traugott  (KG6HDQ)
UNIX/Linux Infrastructure Architect, TerraLuna LLC
stevegt@TerraLuna.Org 
http://www.stevegt.com -- http://Infrastructures.Org 

-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

Williamson, Mark A

2004-Feb-10 19:55 UTC

head link

RE: [Xen-devel] txenmon: Cluster monitoring/management

> ...and a few other things.  Right now migration is via reboot, not
> suspend; haven''t had a chance to troubleshoot resume further.  
We''ve improved the front-end to the resume functionality since you
highlighted the problenm, so you may want to have a look at the modified
tools if you have time.

The previous version xc_dom_control.py just called linux_restore in the
Xc library in order to reload a domain''s memory state.  That
didn''t
recreate all of the VBDs, or set up the appropriate VFR (Virtual
Firewall Router) state, which was the problem you''d experienced.

I''m not sure what version of the tools you''re using.  We now
use
''xc_dom_create.py'' to start / restore domains - this can reads
it''s
configuration from a file, using the ''-f'' option.  We use
xc_dom_control.py to control running domains.

Using the latest tools stuff, you domains should be restored using
xc_dom_create.py, specifying the original configuration file as usual,
with the ''-f'' flag (which provides information for setting up
the VFR /
VBDs again) but also the domain memory state file, using the
''-L'' flag
for ''Load domain state from file''.  That way, the VBD / VFR
state gets
put back before the domain is restarted.

Also, the save option of xc_dom_control.py is now ''suspend''
and it stops
and destroys the copy of the domain in memory after it has been
suspended to disk (so it can''t change it''s persistent storage,
etc.,
which would otherwise confuse the image you if resume from file later).
> This is all to support a production Xen cluster rollout that I plan to
> have running by the end of this month.  I really don''t want to go
back
> to UML at this point, and if I don''t have this cluster 
> running by March
> I''m in deep doo-doo -- so I''m committed to working
full-time on Xen
> tools now.  ;-}
Thanks for the contribution!  And good luck, too!

Mark


-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

stevegt@TerraLuna.Org

2004-Feb-10 20:13 UTC

head link

Re: [Xen-devel] txenmon: Cluster monitoring/management

I see these save/restore updates in ''bk changes -R'' -- my last
pull was
Monday a week ago.  And yes, using xc_dom_create.py for restore sounds
like exactly the right idea; had just hit that realization myself late
last night.

Pulling 1.2 at this instant; I''ll exercise it and let you know how it
goes.

Steve

On Tue, Feb 10, 2004 at 07:55:36PM -0000, Williamson, Mark A
wrote:> > ...and a few other things.  Right now migration is via reboot, not
> > suspend; haven''t had a chance to troubleshoot resume further.
> 
> We''ve improved the front-end to the resume functionality since you
> highlighted the problenm, so you may want to have a look at the modified
> tools if you have time.
> 
> The previous version xc_dom_control.py just called linux_restore in the
> Xc library in order to reload a domain''s memory state.  That
didn''t
> recreate all of the VBDs, or set up the appropriate VFR (Virtual
> Firewall Router) state, which was the problem you''d experienced.
> 
> I''m not sure what version of the tools you''re using.  We
now use
> ''xc_dom_create.py'' to start / restore domains - this can
reads it''s
> configuration from a file, using the ''-f'' option.  We use
> xc_dom_control.py to control running domains.
> 
> Using the latest tools stuff, you domains should be restored using
> xc_dom_create.py, specifying the original configuration file as usual,
> with the ''-f'' flag (which provides information for
setting up the VFR /
> VBDs again) but also the domain memory state file, using the
''-L'' flag
> for ''Load domain state from file''.  That way, the VBD /
VFR state gets
> put back before the domain is restarted.
> 
> Also, the save option of xc_dom_control.py is now
''suspend'' and it stops
> and destroys the copy of the domain in memory after it has been
> suspended to disk (so it can''t change it''s persistent
storage, etc.,
> which would otherwise confuse the image you if resume from file later).
> 
> > This is all to support a production Xen cluster rollout that I plan to
> > have running by the end of this month.  I really don''t want
to go back
> > to UML at this point, and if I don''t have this cluster 
> > running by March
> > I''m in deep doo-doo -- so I''m committed to working
full-time on Xen
> > tools now.  ;-}
> 
> Thanks for the contribution!  And good luck, too!
> 
> Mark
> 
-- 
Stephen G. Traugott  (KG6HDQ)
UNIX/Linux Infrastructure Architect, TerraLuna LLC
stevegt@TerraLuna.Org 
http://www.stevegt.com -- http://Infrastructures.Org 


-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

Bin Ren

2004-Feb-11 00:43 UTC

head link

Re: [Xen-devel] Re: txenmon: Cluster monitoring/management

On 10 Feb 2004, at 19:47, stevegt@TerraLuna.Org wrote:
>     nfs1:/export/xen# ls -l /export/xen/fs/stevegt/tcx/ctl
>     total 32
>     -rw-r--r--    1 root     root            3 Feb  8 20:57 domid
>     -rw-r--r--    1 root     root           12 Feb  5 22:51 gw
>     -rw-r--r--    1 root     root            6 Feb  9 21:56 host
>     -rw-r--r--    1 root     root           13 Feb  8 20:57 ip
>     -rw-r--r--    1 root     root           30 Feb  5 22:52 kernel
>     -rw-r--r--    1 root     root            4 Feb  9 17:47 mem
>     -rw-r--r--    1 root     root            2 Feb  9 21:56 run
>     -rw-r--r--    1 root     root           14 Feb  5 22:53 swap
>     -rw-r--r--    1 root     root            0 Feb  5 22:52 vbds
>
> Because these are individual files, this makes it easy to say, for
> instance, ''echo 0 > run'' from a shell prompt to cause
a domain to shut
> down, or ''echo node43 > host'' to cause it to move to a
different node.
Hey, this is the very Plan9 style, isn''t it?! ;-p

-- Bin



-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

stevegt@TerraLuna.Org

2004-Feb-11 04:17 UTC

head link

Re: [Xen-devel] Re: txenmon: Cluster monitoring/management

On Wed, Feb 11, 2004 at 12:43:59AM +0000, Bin Ren wrote:> On 10 Feb 2004, at 19:47, stevegt@TerraLuna.Org wrote:
> 
> >    nfs1:/export/xen# ls -l /export/xen/fs/stevegt/tcx/ctl
> >    total 32
> >    -rw-r--r--    1 root     root            3 Feb  8 20:57 domid
> >    -rw-r--r--    1 root     root           12 Feb  5 22:51 gw
> >    -rw-r--r--    1 root     root            6 Feb  9 21:56 host
> >    -rw-r--r--    1 root     root           13 Feb  8 20:57 ip
> >    -rw-r--r--    1 root     root           30 Feb  5 22:52 kernel
> >    -rw-r--r--    1 root     root            4 Feb  9 17:47 mem
> >    -rw-r--r--    1 root     root            2 Feb  9 21:56 run
> >    -rw-r--r--    1 root     root           14 Feb  5 22:53 swap
> >    -rw-r--r--    1 root     root            0 Feb  5 22:52 vbds
> >
> >Because these are individual files, this makes it easy to say, for
> >instance, ''echo 0 > run'' from a shell prompt to
cause a domain to shut
> >down, or ''echo node43 > host'' to cause it to move
to a different node.
> 
> Hey, this is the very Plan9 style, isn''t it?! ;-p
Is it?  Never played with that.  I used to live behind Murray Hill Bell
Labs and work at USL; maybe I got polluted.  ;-}

Steve
-- 
Stephen G. Traugott  (KG6HDQ)
UNIX/Linux Infrastructure Architect, TerraLuna LLC
stevegt@TerraLuna.Org 
http://www.stevegt.com -- http://Infrastructures.Org 


-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

stevegt@TerraLuna.Org

2004-Feb-11 08:29 UTC

head link

Re: [Xen-devel] txenmon: Cluster monitoring/management

Though I didn''t get a working xenolinux today, I did decide to try
today''s tools with the 02 Feb 1.2 xen/xenolinux.  

- I like the new ''list'' output.  Pretty!  ;-)

- After dealing with the builder_fn=''xc.linux_build'' vs.
  builder_fn=''linux'' change, I was able to suspend and restore
a virtual
  domain just fine, including its swap VD.  I even ran my perl malloc
  torture test to make sure the swap device actually worked.

- Amusing note: I forgot to log off of the virtual domain before
  suspending it.  I thought about this after I resumed, thought "aww,
  gotta ssh in again", and then was in for a suprise.  The socket
  survived.  Very cool.  ;-)

- I''ll go ahead and integrate the suspend/restore (should we just call
  this "resume"?) machinery into txenmon, so it can migrate guests
  between xenoservers without rebooting them.  I haven''t tested restore
  to a different xenoserver yet, but am hoping that will just work.

G''night all,

Steve
 

On Tue, Feb 10, 2004 at 12:13:39PM -0800,  wrote:> I see these save/restore updates in ''bk changes -R'' -- my
last pull was
> Monday a week ago.  And yes, using xc_dom_create.py for restore sounds
> like exactly the right idea; had just hit that realization myself late
> last night.
> 
> Pulling 1.2 at this instant; I''ll exercise it and let you know how
it
> goes.
> 
> Steve
> 
> On Tue, Feb 10, 2004 at 07:55:36PM -0000, Williamson, Mark A wrote:
> > > ...and a few other things.  Right now migration is via reboot,
not
> > > suspend; haven''t had a chance to troubleshoot resume
further.
> > 
> > We''ve improved the front-end to the resume functionality
since you
> > highlighted the problenm, so you may want to have a look at the
modified
> > tools if you have time.
> > 
> > The previous version xc_dom_control.py just called linux_restore in
the
> > Xc library in order to reload a domain''s memory state.  That
didn''t
> > recreate all of the VBDs, or set up the appropriate VFR (Virtual
> > Firewall Router) state, which was the problem you''d
experienced.
> > 
> > I''m not sure what version of the tools you''re using.
We now use
> > ''xc_dom_create.py'' to start / restore domains - this
can reads it''s
> > configuration from a file, using the ''-f'' option. 
We use
> > xc_dom_control.py to control running domains.
> > 
> > Using the latest tools stuff, you domains should be restored using
> > xc_dom_create.py, specifying the original configuration file as usual,
> > with the ''-f'' flag (which provides information for
setting up the VFR /
> > VBDs again) but also the domain memory state file, using the
''-L'' flag
> > for ''Load domain state from file''.  That way, the
VBD / VFR state gets
> > put back before the domain is restarted.
> > 
> > Also, the save option of xc_dom_control.py is now
''suspend'' and it stops
> > and destroys the copy of the domain in memory after it has been
> > suspended to disk (so it can''t change it''s
persistent storage, etc.,
> > which would otherwise confuse the image you if resume from file
later).
> > 
> > > This is all to support a production Xen cluster rollout that I
plan to
> > > have running by the end of this month.  I really don''t
want to go back
> > > to UML at this point, and if I don''t have this cluster 
> > > running by March
> > > I''m in deep doo-doo -- so I''m committed to
working full-time on Xen
> > > tools now.  ;-}
> > 
> > Thanks for the contribution!  And good luck, too!
> > 
> > Mark
> > 
> 
> -- 
> Stephen G. Traugott  (KG6HDQ)
> UNIX/Linux Infrastructure Architect, TerraLuna LLC
> stevegt@TerraLuna.Org 
> http://www.stevegt.com -- http://Infrastructures.Org 
-- 
Stephen G. Traugott  (KG6HDQ)
UNIX/Linux Infrastructure Architect, TerraLuna LLC
stevegt@TerraLuna.Org 
http://www.stevegt.com -- http://Infrastructures.Org 


-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

Xen devel - Feb 2004 - txenmon: Cluster monitoring/management

[Xen-devel] txenmon: Cluster monitoring/management

[Xen-devel] Re: txenmon: Cluster monitoring/management

[Xen-devel] Re: txenmon: Cluster monitoring/management

[Xen-devel] Re: txenmon: Cluster monitoring/management

RE: [Xen-devel] txenmon: Cluster monitoring/management

Re: [Xen-devel] txenmon: Cluster monitoring/management

Re: [Xen-devel] Re: txenmon: Cluster monitoring/management

Re: [Xen-devel] Re: txenmon: Cluster monitoring/management

Re: [Xen-devel] txenmon: Cluster monitoring/management