Hi, Anybody (Christian?) could please tell me if we can get the support for registering watch with /proc/xen/xenbus? (..OK, I know that we will change it this /proc stuff to a device soon) So far we can only do read/write/rm. I really miss the xen watch feature. Many thanks, Hieu _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/8/05, NAHieu <nahieu@gmail.com> wrote:> Anybody (Christian?) could please tell me if we can get the support > for registering watch with /proc/xen/xenbus? (..OK, I know that we > will change it this /proc stuff to a device soon) > > So far we can only do read/write/rm. I really miss the xen watch feature.I don''t plan to add support for watches anytime soon myself. Patches are welcome of course! christian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, 2005-09-08 at 11:38 +0100, Christian Limpach wrote:> On 9/8/05, NAHieu <nahieu@gmail.com> wrote: > > Anybody (Christian?) could please tell me if we can get the support > > for registering watch with /proc/xen/xenbus? (..OK, I know that we > > will change it this /proc stuff to a device soon) > > > > So far we can only do read/write/rm. I really miss the xen watch feature.At the moment, the xenbus device is a simple hack which grabs the lock on all store communication on open, and drops it on close. It''s not really a general mechanism for tools in domUs to communicate with the store. BTW, Christian, you''ll be happy to know I''m testing a patch which changes the unnatural ioctl into a read/write interface 8) Cheers, Rusty. -- A bad analogy is like a leaky screwdriver -- Richard Braakman _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/9/05, Rusty Russell <rusty@rustcorp.com.au> wrote:> On Thu, 2005-09-08 at 11:38 +0100, Christian Limpach wrote: > > On 9/8/05, NAHieu <nahieu@gmail.com> wrote: > > > Anybody (Christian?) could please tell me if we can get the support > > > for registering watch with /proc/xen/xenbus? (..OK, I know that we > > > will change it this /proc stuff to a device soon) > > > > > > So far we can only do read/write/rm. I really miss the xen watch feature. > > At the moment, the xenbus device is a simple hack which grabs the lock > on all store communication on open, and drops it on close. It''s not > really a general mechanism for tools in domUs to communicate with the > store.Could we reduce the time we hold the lock to from before we call xb_write until both data->bytes_left and data->awaiting_reply are 0 again? I think this would work except for transactions, how do you feel about supporting multiple transactions per connection? Should we at least extend the interface now so that we can add concurrent transaction support later, i.e. transaction start returning a handle and all operations taking a handle argument? christian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Tue, 2005-09-13 at 10:42 +0100, Christian Limpach wrote:> On 9/9/05, Rusty Russell <rusty@rustcorp.com.au> wrote: > > On Thu, 2005-09-08 at 11:38 +0100, Christian Limpach wrote: > > > On 9/8/05, NAHieu <nahieu@gmail.com> wrote: > > > > Anybody (Christian?) could please tell me if we can get the support > > > > for registering watch with /proc/xen/xenbus? (..OK, I know that we > > > > will change it this /proc stuff to a device soon) > > > > > > > > So far we can only do read/write/rm. I really miss the xen watch feature. > > > > At the moment, the xenbus device is a simple hack which grabs the lock > > on all store communication on open, and drops it on close. It''s not > > really a general mechanism for tools in domUs to communicate with the > > store. > > Could we reduce the time we hold the lock to from before we call > xb_write until both data->bytes_left and data->awaiting_reply are 0 > again? I think this would work except for transactions, how do you > feel about supporting multiple transactions per connection? Should we > at least extend the interface now so that we can add concurrent > transaction support later, i.e. transaction start returning a handle > and all operations taking a handle argument?I think if we want to do this we should actually introduce a new mechanism for communications. There''s no reason why a domain can''t introduce more pages for xenstore communication beyond the one it is given to start with, is there? (OK, unintroduce needs to take a shared page instead/as well as a domid). So when someone opens the xenbus dev, we introduce a new page to the domain and the xenstored uses that for comms. When closed, the page is released. This actually simplifies the xenbus_dev driver a lot: now it''s just a dumb pass-through since we don''t have to worry about the userspace program blowing chunks all over the kernel''s comms mechanism. I''ll hack something up and see what you think... Rusty. -- A bad analogy is like a leaky screwdriver -- Richard Braakman _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Wed, Sep 14, 2005 at 10:21:04AM +1000, Rusty Russell wrote:> I think if we want to do this we should actually introduce a new > mechanism for communications. There''s no reason why a domain can''t > introduce more pages for xenstore communication beyond the one it is > given to start with, is there? (OK, unintroduce needs to take a shared > page instead/as well as a domid). > > So when someone opens the xenbus dev, we introduce a new page to the > domain and the xenstored uses that for comms. When closed, the page is > released. This actually simplifies the xenbus_dev driver a lot: now > it''s just a dumb pass-through since we don''t have to worry about the > userspace program blowing chunks all over the kernel''s comms mechanism.That would work but there''s a few reasons why this is impracticle: - xenstored needs to keep track of these pages and this consumes resources in the server. There''s no way for xenstored to close a connection to free up resources unless we want to start handling this case in all clients. With multiple transactions, we can just timeout the transaction after 5 minutes, we have to handle the timeout case in the client already anyway. - this complicates save/restore yet again because we''ll have to reconnect all these pages when the domain is restored christian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Wed, 2005-09-14 at 10:21 +1000, Rusty Russell wrote:> So when someone opens the xenbus dev, we introduce a new page to the > domain and the xenstored uses that for comms. When closed, the page is > released. This actually simplifies the xenbus_dev driver a lot: now > it''s just a dumb pass-through since we don''t have to worry about the > userspace program blowing chunks all over the kernel''s comms mechanism. > > I''ll hack something up and see what you think...OK, realized there''s a problem with suspend/resume. Within the kernel, we prevent suspend/resume by simply holding the xenstore_lock, so it can''t happen during normal operations, or transactions. So let''s leave this for now. It is fairly similar to the problem of xenstored restarts, which I''m trying to merge, so I suggest we revisit after that... Patch for reading only (I subbed in NULL for the default store page for the moment, since I know there''s another patch out there which touches this). diff -r 0d8c0db04258 linux-2.6-xen-sparse/drivers/xen/xenbus/xenbus_comms.c --- a/linux-2.6-xen-sparse/drivers/xen/xenbus/xenbus_comms.c Tue Sep 13 21:52:24 2005 +++ b/linux-2.6-xen-sparse/drivers/xen/xenbus/xenbus_comms.c Wed Sep 14 19:16:48 2005 @@ -46,14 +46,18 @@ DECLARE_WAIT_QUEUE_HEAD(xb_waitq); -static inline struct ringbuf_head *outbuf(void) -{ - return mfn_to_virt(xen_start_info->store_mfn); -} - -static inline struct ringbuf_head *inbuf(void) -{ - return mfn_to_virt(xen_start_info->store_mfn) + PAGE_SIZE/2; +static inline struct ringbuf_head *outbuf(void *page) +{ + if (!page) + return mfn_to_virt(xen_start_info->store_mfn); + return page; +} + +static inline struct ringbuf_head *inbuf(void *page) +{ + if (!page) + return mfn_to_virt(xen_start_info->store_mfn) + PAGE_SIZE/2; + return page + PAGE_SIZE/2; } static irqreturn_t wake_waiting(int irq, void *unused, struct pt_regs *regs) @@ -117,10 +121,10 @@ return avail != 0; } -int xb_write(const void *data, unsigned len) +int xb_write(const void *data, unsigned len, void *page) { struct ringbuf_head h; - struct ringbuf_head *out = outbuf(); + struct ringbuf_head *out = outbuf(page); do { void *dst; @@ -151,19 +155,19 @@ return 0; } -int xs_input_avail(void) +int xs_input_avail(void *page) { unsigned int avail; - struct ringbuf_head *in = inbuf(); + struct ringbuf_head *in = inbuf(page); get_input_chunk(in, in->buf, &avail); return avail != 0; } -int xb_read(void *data, unsigned len) +int xb_read(void *data, unsigned len, void *page) { struct ringbuf_head h; - struct ringbuf_head *in = inbuf(); + struct ringbuf_head *in = inbuf(page); int was_full; while (len != 0) { @@ -200,34 +204,25 @@ return 0; } -/* Set up interrupt handler off store event channel. */ -int xb_init_comms(void) +/* Set up interrupt handler off store event channel, and clear page. */ +int xb_init_comms(unsigned int evtchn, void *page) { int err; - if (!xen_start_info->store_evtchn) - return 0; - - err = bind_evtchn_to_irqhandler( - xen_start_info->store_evtchn, wake_waiting, - 0, "xenbus", &xb_waitq); + err = bind_evtchn_to_irqhandler(evtchn, wake_waiting, + 0, "xenbus", &xb_waitq); if (err) { printk(KERN_ERR "XENBUS request irq failed %i\n", err); - unbind_evtchn_from_irq(xen_start_info->store_evtchn); return err; } /* FIXME zero out page -- domain builder should probably do this*/ - memset(mfn_to_virt(xen_start_info->store_mfn), 0, PAGE_SIZE); - + memset(inbuf(page), 0, PAGE_SIZE/2); + memset(outbuf(page), 0, PAGE_SIZE/2); return 0; } -void xb_suspend_comms(void) -{ - - if (!xen_start_info->store_evtchn) - return; - - unbind_evtchn_from_irqhandler(xen_start_info->store_evtchn, &xb_waitq); -} +void xb_stop_comms(unsigned int evtchn) +{ + unbind_evtchn_from_irqhandler(evtchn, &xb_waitq); +} diff -r 0d8c0db04258 linux-2.6-xen-sparse/drivers/xen/xenbus/xenbus_comms.h --- a/linux-2.6-xen-sparse/drivers/xen/xenbus/xenbus_comms.h Tue Sep 13 21:52:24 2005 +++ b/linux-2.6-xen-sparse/drivers/xen/xenbus/xenbus_comms.h Wed Sep 14 19:16:48 2005 @@ -29,13 +29,13 @@ #define _XENBUS_COMMS_H int xs_init(void); -int xb_init_comms(void); -void xb_suspend_comms(void); +int xb_init_comms(unsigned int evtchn, void *page); +void xb_stop_comms(unsigned int evtchn); -/* Low level routines. */ -int xb_write(const void *data, unsigned len); -int xb_read(void *data, unsigned len); -int xs_input_avail(void); +/* Low level routines: page is NULL for default store page */ +int xb_write(const void *data, unsigned len, void *page); +int xb_read(void *data, unsigned len, void *page); +int xs_input_avail(void *page); extern wait_queue_head_t xb_waitq; #endif /* _XENBUS_COMMS_H */ diff -r 0d8c0db04258 linux-2.6-xen-sparse/drivers/xen/xenbus/xenbus_dev.c --- a/linux-2.6-xen-sparse/drivers/xen/xenbus/xenbus_dev.c Tue Sep 13 21:52:24 2005 +++ b/linux-2.6-xen-sparse/drivers/xen/xenbus/xenbus_dev.c Wed Sep 14 19:16:48 2005 @@ -36,6 +36,7 @@ #include <linux/notifier.h> #include <linux/wait.h> #include <linux/fs.h> +#include <linux/spinlock.h> #include "xenbus_comms.h" @@ -45,90 +46,47 @@ #include <asm-xen/linux-public/xenstored.h> struct xenbus_dev_data { - /* Are there bytes left to be read in this message? */ - int bytes_left; - /* Are we still waiting for the reply to a message we wrote? */ - int awaiting_reply; - /* Buffer for outgoing messages. */ - unsigned int len; - union { - struct xsd_sockmsg msg; - char buffer[PAGE_SIZE]; - } u; + unsigned int evtchn; + char *page; + char bouncebuf[PAGE_SIZE]; }; static struct proc_dir_entry *xenbus_dev_intf; -/* Reply can be long (dir, getperm): don''t buffer, just examine - * headers so we can discard rest if they die. */ static ssize_t xenbus_dev_read(struct file *filp, char __user *ubuf, size_t len, loff_t *ppos) { struct xenbus_dev_data *data = filp->private_data; - struct xsd_sockmsg msg; - int err; + int ret; - /* Refill empty buffer? */ - if (data->bytes_left == 0) { - if (len < sizeof(msg)) - return -EINVAL; - - err = xb_read(&msg, sizeof(msg)); - if (err) - return err; - data->bytes_left = msg.len; - if (ubuf && copy_to_user(ubuf, &msg, sizeof(msg)) != 0) - return -EFAULT; - /* We can receive spurious XS_WATCH_EVENT messages. */ - if (msg.type != XS_WATCH_EVENT) - data->awaiting_reply = 0; - return sizeof(msg); - } - - /* Don''t read over next header, or over temporary buffer. */ - if (len > sizeof(data->u.buffer)) - len = sizeof(data->u.buffer); - if (len > data->bytes_left) - len = data->bytes_left; - - err = xb_read(data->u.buffer, len); - if (err) - return err; - - data->bytes_left -= len; - if (ubuf && copy_to_user(ubuf, data->u.buffer, len) != 0) - return -EFAULT; - return len; + if (len > sizeof(data->bouncebuf)) + len = sizeof(data->bouncebuf); + ret = xb_read(data->bouncebuf, len, data->page); + if (ret && copy_to_user(ubuf, data->bouncebuf, len) != 0) + ret = -EFAULT; + return ret; } -/* We do v. basic sanity checking so they don''t screw up kernel later. */ static ssize_t xenbus_dev_write(struct file *filp, const char __user *ubuf, size_t len, loff_t *ppos) { struct xenbus_dev_data *data = filp->private_data; - int err; + int ret; - /* We gather data in buffer until we''re ready to send it. */ - if (len > data->len + sizeof(data->u)) - return -EINVAL; - if (copy_from_user(data->u.buffer + data->len, ubuf, len) != 0) + if (len > sizeof(data->bouncebuf)) + len = sizeof(data->bouncebuf); + if (copy_from_user(data->bouncebuf, ubuf, len) != 0) return -EFAULT; - data->len += len; - if (data->len >= sizeof(data->u.msg) + data->u.msg.len) { - err = xb_write(data->u.buffer, data->len); - if (err) - return err; - data->len = 0; - data->awaiting_reply = 1; - } - return len; + return xb_write(data->boundbuf, len); } static int xenbus_dev_open(struct inode *inode, struct file *filp) { - struct xenbus_dev_data *u; + struct xenbus_dev_data *data; + int err; + evtchn_op_t op = { .cmd = EVTCHNOP_alloc_unbound }; if (xen_start_info->store_evtchn == 0) return -ENOENT; @@ -136,34 +94,71 @@ /* Don''t try seeking. */ nonseekable_open(inode, filp); - u = kmalloc(sizeof(*u), GFP_KERNEL); - if (u == NULL) - return -ENOMEM; + data = kmalloc(sizeof(*data), GFP_KERNEL); + if (!data) { + err = -ENOMEM; + goto out; + } + data->page = (void *)__get_free_page(GFP_KERNEL); + if (!data->page) { + err = -ENOMEM; + goto free_data; + } + op.u.alloc_unbound.dom = 0; + err = HYPERVISOR_event_channel_op(&op); + if (err) + goto free_page; + data->evtchn = op.u.alloc_unbound.port; + err = xb_init_comms(data->evtchn, data->page); + if (err) + goto release_evtchn; - memset(u, 0, sizeof(*u)); + down(&xenbus_lock); + err = xenbus_introduce(virt_to_mfn(data->page), data->evtchn); + up(&xenbus_lock); + if (err) + goto release_comms; filp->private_data = u; + return 0; - down(&xenbus_lock); +release_comms: + xb_stop_comms(data->evtchn); +release_evtchn: + op.cmd = EVTCHNOP_close; + op.u.close.dom = DOMID_SELF; + op.u.close.port = data->evtchn; + if (HYPERVISOR_event_channel_op(&op) != 0) + printk(KERN_WARNING "xenbus_dev: channel close failed\n"); +free_page: + free_page((unsigned long)data->page); +free_data: + kfree(data); +out: + return err; +} - return 0; -} +static void xenbus_dev_stop(struct xenbus_dev_data *data) +{ + static int xenbus_dev_release(struct inode *inode, struct file *filp) { struct xenbus_dev_data *data = filp->private_data; + evtchn_op_t op = { .cmd = EVTCHNOP_close }; - /* Discard any unread replies. */ - while (data->bytes_left || data->awaiting_reply) - xenbus_dev_read(filp, NULL, sizeof(data->u.buffer), NULL); - - /* Harmless if no transaction in progress. */ - xenbus_transaction_end(1); - + down(&xenbus_lock); + if (xenbus_release(virt_to_mfn(data->page)) != 0) + printk(KERN_WARNING "xenbus_dev: release failed\n"); up(&xenbus_lock); + xb_stop_comms(data->evtchn); + op.u.close.dom = DOMID_SELF; + op.u.close.port = data->evtchn; + if (HYPERVISOR_event_channel_op(&op) != 0) + printk(KERN_WARNING "xenbus_dev: channel close failed\n"); + free_page((unsigned long)data->page); kfree(data); - return 0; } diff -r 0d8c0db04258 linux-2.6-xen-sparse/drivers/xen/xenbus/xenbus_probe.c --- a/linux-2.6-xen-sparse/drivers/xen/xenbus/xenbus_probe.c Tue Sep 13 21:52:24 2005 +++ b/linux-2.6-xen-sparse/drivers/xen/xenbus/xenbus_probe.c Wed Sep 14 19:16:48 2005 @@ -607,12 +607,12 @@ down(&xenbus_lock); bus_for_each_dev(&xenbus_frontend.bus, NULL, NULL, suspend_dev); bus_for_each_dev(&xenbus_backend.bus, NULL, NULL, suspend_dev); - xb_suspend_comms(); + xb_stop_comms(xen_start_info->store_evtchn); } void xenbus_resume(void) { - xb_init_comms(); + xb_init_comms(xen_start_info->store_evtchn, NULL); reregister_xenbus_watches(); bus_for_each_dev(&xenbus_frontend.bus, NULL, NULL, resume_dev); bus_for_each_dev(&xenbus_backend.bus, NULL, NULL, resume_dev); diff -r 0d8c0db04258 linux-2.6-xen-sparse/drivers/xen/xenbus/xenbus_xs.c --- a/linux-2.6-xen-sparse/drivers/xen/xenbus/xenbus_xs.c Tue Sep 13 21:52:24 2005 +++ b/linux-2.6-xen-sparse/drivers/xen/xenbus/xenbus_xs.c Wed Sep 14 19:16:48 2005 @@ -426,6 +426,44 @@ } EXPORT_SYMBOL(xenbus_gather); +int xenbus_introduce(unsigned long mfn, unsigned int evtchn) +{ + char domid_str[21]; + char mfn_str[21]; + char eventchn_str[21]; + struct kvec iov[4]; + + sprintf(domid_str, "%u", DOMID_SELF); + sprintf(mfn_str, "%lu", mfn); + sprintf(eventchn_str, "%u", eventchn); + + iov[0].iov_base = domid_str; + iov[0].iov_len = strlen(domid_str) + 1; + iov[1].iov_base = mfn_str; + iov[1].iov_len = strlen(mfn_str) + 1; + iov[2].iov_base = eventchn_str; + iov[2].iov_len = strlen(eventchn_str) + 1; + iov[3].iov_base = ""; + iov[3].iov_len = 1; + + return xs_error(xs_talkv(h, XS_INTRODUCE, iov, ARRAY_SIZE(iov), NULL)); +} + +int xenbus_release(unsigned long mfn); +{ + struct iovec iov[2]; + char domid_str[21], mfn_str[21]; + + sprintf(domid_str, "%u", DOMID_SELF); + sprintf(mfn_str, "%lu", mfn); + iov[0].iov_base = domid_str; + iov[0].iov_len = strlen(domid_str) + 1; + iov[1].iov_base = mfn_str; + iov[1].iov_len = strlen(mfn_str) + 1; + + return xs_error(xs_talkv(h, XS_RELEASE, iov, ARRAY_SIZE(iov), NULL)); +} + static int xs_watch(const char *path, const char *token) { struct kvec iov[2]; @@ -569,7 +607,7 @@ int err; struct task_struct *watcher; - err = xb_init_comms(); + err = xb_init_comms(xen_start_info->store_evtchn, NULL); if (err) return err; diff -r 0d8c0db04258 linux-2.6-xen-sparse/include/asm-xen/xenbus.h --- a/linux-2.6-xen-sparse/include/asm-xen/xenbus.h Tue Sep 13 21:52:24 2005 +++ b/linux-2.6-xen-sparse/include/asm-xen/xenbus.h Wed Sep 14 19:16:48 2005 @@ -90,6 +90,8 @@ int xenbus_rm(const char *dir, const char *node); int xenbus_transaction_start(const char *subtree); int xenbus_transaction_end(int abort); +int xenbus_introduce(unsigned long mfn, unsigned int evtchn); +int xenbus_release(unsigned long mfn); /* Single read and scanf: returns -errno or num scanned if > 0. */ int xenbus_scanf(const char *dir, const char *node, const char *fmt, ...) diff -r 0d8c0db04258 tools/python/xen/lowlevel/xs/xs.c --- a/tools/python/xen/lowlevel/xs/xs.c Tue Sep 13 21:52:24 2005 +++ b/tools/python/xen/lowlevel/xs/xs.c Wed Sep 14 19:16:48 2005 @@ -708,7 +708,7 @@ } #define xspy_release_domain_doc "\n" \ - "Tell xenstore to release its channel to a domain.\n" \ + "Tell xenstore to release its channel(s) to a domain.\n" \ "Unless this is done the domain will not be released.\n" \ " dom [int]: domain id\n" \ "\n" \ @@ -733,7 +733,7 @@ &dom)) goto exit; Py_BEGIN_ALLOW_THREADS - xsval = xs_release_domain(xh, dom); + xsval = xs_release_domain(xh, dom, 0); Py_END_ALLOW_THREADS if (!xsval) { PyErr_SetFromErrno(PyExc_RuntimeError); diff -r 0d8c0db04258 tools/xenstore/testsuite/09domain.test --- a/tools/xenstore/testsuite/09domain.test Tue Sep 13 21:52:24 2005 +++ b/tools/xenstore/testsuite/09domain.test Wed Sep 14 19:16:48 2005 @@ -10,10 +10,16 @@ close # Release that domain. -release 1 +release 1 100 close # Introduce and release by same connection. expect handle is 2 introduce 1 100 7 /my/home -release 1 +release 1 100 + +# Introduce another connection from a domain connection +expect handle is 3 +# FIXME: We can''t actually handle multiple domain connections in xs_test. +introduce 1 101 7 /my/home +release 1 0 diff -r 0d8c0db04258 tools/xenstore/testsuite/11domain-watch.test --- a/tools/xenstore/testsuite/11domain-watch.test Tue Sep 13 21:52:24 2005 +++ b/tools/xenstore/testsuite/11domain-watch.test Wed Sep 14 19:16:48 2005 @@ -12,7 +12,7 @@ 1 waitwatch 1 ackwatch token 1 unwatch /test token -release 1 +release 1 0 1 close # ignore watches while doing commands, should work. @@ -26,7 +26,7 @@ expect 1:/dir/test:token 1 waitwatch 1 ackwatch token -release 1 +release 1 0 1 close # unwatch @@ -39,7 +39,7 @@ expect 1:/dir/test2:token2 1 waitwatch 1 unwatch /dir token2 -release 1 +release 1 0 1 close # unwatch while watch pending. @@ -48,5 +48,5 @@ 1 watch /dir token1 write /dir/test2 create contents 1 unwatch /dir token1 -release 1 +release 1 0 1 close diff -r 0d8c0db04258 tools/xenstore/xenstored_core.c --- a/tools/xenstore/xenstored_core.c Tue Sep 13 21:52:24 2005 +++ b/tools/xenstore/xenstored_core.c Wed Sep 14 19:16:48 2005 @@ -1285,7 +1285,7 @@ break; case XS_RELEASE: - do_release(conn, onearg(in)); + do_release(conn, in); break; case XS_GET_DOMAIN_PATH: diff -r 0d8c0db04258 tools/xenstore/xenstored_domain.c --- a/tools/xenstore/xenstored_domain.c Tue Sep 13 21:52:24 2005 +++ b/tools/xenstore/xenstored_domain.c Wed Sep 14 19:16:48 2005 @@ -51,6 +51,9 @@ /* Event channel port */ u16 port; + /* Shared page frame */ + unsigned long mfn; + /* Domain path in store. */ char *path; @@ -273,6 +276,7 @@ domain = talloc(context, struct domain); domain->port = 0; domain->domid = domid; + domain->mfn = mfn; domain->path = talloc_strdup(domain, path); domain->page = xc_map_foreign_range(*xc_handle, domain->domid, getpagesize(), @@ -302,6 +306,10 @@ void do_introduce(struct connection *conn, struct buffered_data *in) { struct domain *domain; + domid_t domid; + unsigned long mfn; + unsigned int evtchn; + char *path; char *vec[4]; if (get_strings(in, vec, ARRAY_SIZE(vec)) < ARRAY_SIZE(vec)) { @@ -309,7 +317,20 @@ return; } - if (conn->id != 0) { + domid = atoi(vec[0]); + mfn = atol(vec[1]); + evtchn = atoi(vec[2]); + path = vec[3]; + + /* Domains can introduce more pages: share same path. */ + if (conn->domain) { + if (domid != DOMID_SELF) { + send_error(conn, EACCES); + return; + } + domid = conn->domain->domid; + path = conn->domain->path; + } else if (conn->id != 0) { send_error(conn, EACCES); return; } @@ -320,13 +341,12 @@ } /* Sanity check args. */ - if ((atoi(vec[2]) <= 0) || !is_valid_nodename(vec[3])) { + if (mfn == 0 || !is_valid_nodename(path)) { send_error(conn, EINVAL); return; } /* Hang domain off "in" until we''re finished. */ - domain = new_domain(in, atoi(vec[0]), atol(vec[1]), atol(vec[2]), - vec[3]); + domain = new_domain(in, domid, mfn, evtchn, path); if (!domain) { send_error(conn, errno); return; @@ -340,54 +360,69 @@ send_ack(conn, XS_INTRODUCE); } -static struct domain *find_domain_by_domid(domid_t domid) +static struct domain *find_domain(domid_t domid, unsigned long mfn) { struct domain *i; list_for_each_entry(i, &domains, list) { - if (i->domid == domid) - return i; + if (i->domid == domid) { + if (!mfn || i->mfn == mfn) + return i; + } } return NULL; } -/* domid */ -void do_release(struct connection *conn, const char *domid_str) +/* domid, mfn */ +void do_release(struct connection *conn, struct buffered_data *in) { struct domain *domain; domid_t domid; - - if (!domid_str) { + unsigned long mfn; + char *vec[2]; + + if (get_strings(in, vec, ARRAY_SIZE(vec)) < ARRAY_SIZE(vec)) { send_error(conn, EINVAL); return; } - - domid = atoi(domid_str); + domid = atol(vec[0]); + mfn = atol(vec[1]); + if (!domid) { send_error(conn, EINVAL); return; } - if (conn->id != 0) { + if (!conn->can_write) { + send_error(conn, EROFS); + return; + } + + /* Domains can release own pages, but not all of them. */ + if (conn->domain) { + if (domid != DOMID_SELF || !mfn) { + send_error(conn, EACCES); + return; + } + domid = conn->domain->domid; + } else if (conn->id != 0) { send_error(conn, EACCES); return; } - domain = find_domain_by_domid(domid); - if (!domain) { - send_error(conn, ENOENT); - return; - } - - if (!domain->conn) { - send_error(conn, EINVAL); - return; - } - - talloc_free(domain->conn); - - fire_watches(NULL, "@releaseDomain", false); - + if (mfn) { + domain = find_domain(domid, mfn); + if (!domain) { + send_error(conn, ENOENT); + return; + } + talloc_free(domain->conn); + } else { + while ((domain = find_domain(domid, 0)) != NULL) + talloc_free(domain->conn); + + fire_watches(NULL, "@releaseDomain", false); + } send_ack(conn, XS_RELEASE); } @@ -405,7 +440,7 @@ if (domid == DOMID_SELF) domain = conn->domain; else - domain = find_domain_by_domid(domid); + domain = find_domain(domid, 0); if (!domain) send_error(conn, ENOENT); diff -r 0d8c0db04258 tools/xenstore/xenstored_domain.h --- a/tools/xenstore/xenstored_domain.h Tue Sep 13 21:52:24 2005 +++ b/tools/xenstore/xenstored_domain.h Wed Sep 14 19:16:48 2005 @@ -26,7 +26,7 @@ void do_introduce(struct connection *conn, struct buffered_data *in); /* domid */ -void do_release(struct connection *conn, const char *domid_str); +void do_release(struct connection *conn, struct buffered_data *in); /* domid */ void do_get_domain_path(struct connection *conn, const char *domid_str); diff -r 0d8c0db04258 tools/xenstore/xs.c --- a/tools/xenstore/xs.c Tue Sep 13 21:52:24 2005 +++ b/tools/xenstore/xs.c Wed Sep 14 19:16:48 2005 @@ -565,13 +565,19 @@ return xs_bool(xs_talkv(h, XS_INTRODUCE, iov, ARRAY_SIZE(iov), NULL)); } -bool xs_release_domain(struct xs_handle *h, domid_t domid) -{ - char domid_str[MAX_STRLEN(domid)]; +bool xs_release_domain(struct xs_handle *h, domid_t domid, unsigned long mfn) +{ + struct iovec iov[2]; + char domid_str[MAX_STRLEN(domid)], mfn_str[MAX_STRLEN(mfn)]; sprintf(domid_str, "%u", domid); - - return xs_bool(xs_single(h, XS_RELEASE, domid_str, NULL)); + sprintf(mfn_str, "%lu", mfn); + iov[0].iov_base = domid_str; + iov[0].iov_len = strlen(domid_str) + 1; + iov[1].iov_base = mfn_str; + iov[1].iov_len = strlen(mfn_str) + 1; + + return xs_bool(xs_talkv(h, XS_RELEASE, iov, ARRAY_SIZE(iov), NULL)); } char *xs_get_domain_path(struct xs_handle *h, domid_t domid) diff -r 0d8c0db04258 tools/xenstore/xs.h --- a/tools/xenstore/xs.h Tue Sep 13 21:52:24 2005 +++ b/tools/xenstore/xs.h Wed Sep 14 19:16:48 2005 @@ -125,14 +125,16 @@ /* Introduce a new domain. * This tells the store daemon about a shared memory page, event channel * and store path associated with a domain: the domain uses these to communicate. + * The domain can use this to add its own extra pages. */ bool xs_introduce_domain(struct xs_handle *h, domid_t domid, unsigned long mfn, unsigned int eventchn, const char *path); /* Release a domain. * Tells the store domain to release the memory page to the domain. + * mfn == 0 means to release all pages. */ -bool xs_release_domain(struct xs_handle *h, domid_t domid); +bool xs_release_domain(struct xs_handle *h, domid_t domid, unsigned long mfn); /* Query the home path of a domain. */ diff -r 0d8c0db04258 tools/xenstore/xs_test.c --- a/tools/xenstore/xs_test.c Tue Sep 13 21:52:24 2005 +++ b/tools/xenstore/xs_test.c Wed Sep 14 19:16:48 2005 @@ -207,6 +207,7 @@ " start <node>\n" " abort\n" " introduce <domid> <mfn> <eventchn> <path>\n" + " release <domid> <mfn>\n" " commit\n" " sleep <milliseconds>\n" " expect <pattern>\n" @@ -630,9 +631,9 @@ daemon_pid = *(int *)((void *)out + 32); } -static void do_release(unsigned int handle, const char *domid) -{ - if (!xs_release_domain(handles[handle], atoi(domid))) +static void do_release(unsigned int handle, const char *domid, const char *mfn) +{ + if (!xs_release_domain(handles[handle], atoi(domid), atol(mfn))) failed(handle); } @@ -809,7 +810,7 @@ do_introduce(handle, arg(line, 1), arg(line, 2), arg(line, 3), arg(line, 4)); else if (streq(command, "release")) - do_release(handle, arg(line, 1)); + do_release(handle, arg(line, 1), arg(line, 2)); else if (streq(command, "dump")) dump(handle); else if (streq(command, "sleep")) { -- A bad analogy is like a leaky screwdriver -- Richard Braakman _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 9/14/05, Rusty Russell <rusty@rustcorp.com.au> wrote:> Patch for reading only (I subbed in NULL for the default store page for > the moment, since I know there''s another patch out there which touches > this).I really don''t think that the multi-page approach is good and it''s also orthogonal, i.e. we could have multiple connections but still want concurrent transactions on the same connection. What''s wrong with concurrent transactions on the same connection? christian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Wed, 2005-09-14 at 13:55 +0100, Christian Limpach wrote:> On 9/14/05, Rusty Russell <rusty@rustcorp.com.au> wrote: > > Patch for reading only (I subbed in NULL for the default store page for > > the moment, since I know there''s another patch out there which touches > > this). > > I really don''t think that the multi-page approach is good and it''s > also orthogonal, i.e. we could have multiple connections but still > want concurrent transactions on the same connection. What''s wrong > with concurrent transactions on the same connection?You''re right it''s orthogonal, but we really do want separate connections for each client: they''re logically separate, so overloading them on one transport is going to be a hack. Inside the kernel we do it to a limited degree, but already have a proxy for handling watches and we trust everyone to get it right and use One Big Lock. Now, we still might want concurrent transactions for a single user, but more likely we want to get rid of the "root of transaction" model altogether, since noone likes it, and there''s an unrelated issue with NFS-root mounted store (it breaks horribly) that is likely to change our underlying (on-disk directory-based) implementation. Thoughts? Rusty. -- A bad analogy is like a leaky screwdriver -- Richard Braakman _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 15 Sep 2005, at 02:39, Rusty Russell wrote:> we really do want separate connections > for each client: they''re logically separate, so overloading them on one > transport is going to be a hack.How does two connections being ''logically separate'' imply that it is improper for them not to also be ''physically separate''? Multiplexing multiple simultaneous connections/transactions onto a single underlying page-level transport would seem fine to me! -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, Sep 15, 2005 at 11:39:50AM +1000, Rusty Russell wrote:> On Wed, 2005-09-14 at 13:55 +0100, Christian Limpach wrote: > > On 9/14/05, Rusty Russell <rusty@rustcorp.com.au> wrote: > > > Patch for reading only (I subbed in NULL for the default store page for > > > the moment, since I know there''s another patch out there which touches > > > this). > > > > I really don''t think that the multi-page approach is good and it''s > > also orthogonal, i.e. we could have multiple connections but still > > want concurrent transactions on the same connection. What''s wrong > > with concurrent transactions on the same connection? > > You''re right it''s orthogonal, but we really do want separate connections > for each client: they''re logically separate, so overloading them on one > transport is going to be a hack. Inside the kernel we do it to a > limited degree, but already have a proxy for handling watches and we > trust everyone to get it right and use One Big Lock.I think it''s not even a hack by your definition because at least for the uses why I added the device (backend device setup from hotplug scripts), it is logically the same client. It seems to me that the solutions we want are extremely opposite: - multiple pages with single connection and single transaction - single page with single connection and multiple transactions My main objections against multiple pages are: - setup/teardown overhead: we''ll have to add messages to setup and teardown a new connection - we have to maintain state about the connection in the daemon - save/restore becomes more complicated> Now, we still might want concurrent transactions for a single user, but > more likely we want to get rid of the "root of transaction" model > altogether, since noone likes it, and there''s an unrelated issue with > NFS-root mounted store (it breaks horribly) that is likely to change our > underlying (on-disk directory-based) implementation. Thoughts?I''d like us to extend the message format so that we can support concurrent transactions in the future, i.e. have transaction_start return a transaction handle and have everything else support a transaction handle. We can return 0 as the handle for now and also use 0 for outside of transaction operation. christian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, 2005-09-15 at 11:53 +0100, Keir Fraser wrote:> On 15 Sep 2005, at 02:39, Rusty Russell wrote: > > > we really do want separate connections > > for each client: they''re logically separate, so overloading them on one > > transport is going to be a hack. > > How does two connections being ''logically separate'' imply that it is > improper for them not to also be ''physically separate''? Multiplexing > multiple simultaneous connections/transactions onto a single underlying > page-level transport would seem fine to me!Um, multiplexing, like any feature, adds complexity: if we don''t need it, don''t do it. <shrug> We have a way of establishing new ringbuffers to talk to the store, we just currently assume one per domain. Loosening that seems simpler and more robust than introducing a multiplexing layer, unless you two can see something I can''t? Christian says:> My main objections against multiple pages are: > - setup/teardown overhead: we''ll have to add messages to setup > and teardown a new connectionWhich we already have, as above.> - we have to maintain state about the connection in the daemonBut allowing multiple connections over one transport doesn''t change this.> - save/restore becomes more complicatedActually, I think it becomes simpler we simply force the device closed, which should be handled by libxenstore just the same as a unix domain socket closing on daemon restart, AFAICT. So it has an appeal. What was the reason for wanting multiple transactions per connection? Changing the interface is going to be a PITA, so we should figure out if we''re going to need that soon... Thanks! Rusty. -- A bad analogy is like a leaky screwdriver -- Richard Braakman _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 17 Sep 2005, at 09:26, Rusty Russell wrote:>> How does two connections being ''logically separate'' imply that it is >> improper for them not to also be ''physically separate''? Multiplexing >> multiple simultaneous connections/transactions onto a single >> underlying >> page-level transport would seem fine to me! > > Um, multiplexing, like any feature, adds complexity: if we don''t need > it, don''t do it. <shrug>That doesn''t make it a ''hack''.> We have a way of establishing new ringbuffers to talk to the store, we > just currently assume one per domain. Loosening that seems simpler and > more robust than introducing a multiplexing layer, unless you two can > see something I can''t?Multiplexing will require user-space reads/writes to be passed to the kernel rather than stuffing its own comms page directly. This has the advantage of being what we already do, and any performance disadvantages really don''t matter. On the xenstored side it ought simply to be a matter of picking a transaction or connection id out of the message to index into some kind of state table. If we have multiple pages the client driver is complicated by reserving user pages and creating grant references for them, and cleanly tearing down and dealloc''ing grant references at the appropriate point(s). I agree the daemon doesn''t really get any more complicated, but I think save/restore will need extra code, either in the domain0 tools or in the guest os, to reconnect pages through to xenstored. Maybe there is a hidden complexity to muxing that we don''t see? I guess save/restore needs some care because transaction-id state will be lost when we reconnect to a new xenstored. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sat, Sep 17, 2005 at 06:26:04PM +1000, Rusty Russell wrote:> What was the reason for wanting multiple transactions per connection? > Changing the interface is going to be a PITA, so we should figure out if > we''re going to need that soon...It solves the immediate problem of making the xenbus lock in the kernel driver more fine grained. I''m not too happy with how a userspace program can block all xenbus access for that domain at the moment, even if this is limited to the root user. Additionally, I believe it''s necessary to support reconnect to a different store after restore, allowing the store daemon or the xenbus driver to distinguish operations which are part of an incomplete transaction from operations which are not part of a transaction at all. Finally, adding transaction ids later will be a lot more difficult. I also find the current behaviour a bit odd, we support operations outside of transactions but once you start a transaction, every operation you make is part of that transaction. i think we should either require to always be in a transaction or always allow outside of transaction operations. christian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sat, 2005-09-17 at 09:33 +0100, Keir Fraser wrote:> On 17 Sep 2005, at 09:26, Rusty Russell wrote: > > >> How does two connections being ''logically separate'' imply that it is > >> improper for them not to also be ''physically separate''? Multiplexing > >> multiple simultaneous connections/transactions onto a single > >> underlying > >> page-level transport would seem fine to me! > > > > Um, multiplexing, like any feature, adds complexity: if we don''t need > > it, don''t do it. <shrug> > > That doesn''t make it a ''hack''.That''s quoting a little out of context. The current partial exposure of the kernel''s channel is a hack, since the current model is a 1:1 mapping between the transport and the connection. The term hack is not a bad thing in itself, but it does accurately reflect that it''s limited, as in this case where we''re asked to add watch support.> > We have a way of establishing new ringbuffers to talk to the store, we > > just currently assume one per domain. Loosening that seems simpler and > > more robust than introducing a multiplexing layer, unless you two can > > see something I can''t? > > Multiplexing will require user-space reads/writes to be passed to the > kernel rather than stuffing its own comms page directly. This has the > advantage of being what we already do, and any performance > disadvantages really don''t matter.No, I''m proposing we keep the device in the domU kernel exactly as now, but rather than sharing the same page, use separate pages. This means we don''t have to share the same lock or worry about userspace messing up, etc.> On the xenstored side it ought simply to be a matter of picking a > transaction or connection id out of the message to index into some kind > of state table.Sure, I can write multiplexing and demultiplexing code for the store daemon, separate the data structures, duplicate that code in libxenstore and xenbus. But AFAICT it''s code which simply doesn''t need to exist.> If we have multiple pages the client driver is complicated by reserving > user pages and creating grant references for them, and cleanly tearing > down and dealloc''ing grant references at the appropriate point(s). I > agree the daemon doesn''t really get any more complicated, but I think > save/restore will need extra code, either in the domain0 tools or in > the guest os, to reconnect pages through to xenstored.No, this is the beauty of it: save/restore should be exactly to libxenstored as the restarting of store daemon; Christian and I have been vigorously debating semantics to get them the same (and I think we''re close). Then on save, the device simply closes; on restore, the library reconnects. So, in summary: separating the concept of transport from connection is possible, but I don''t think it actually buys us anything except another layer of indirection. Rusty. -- A bad analogy is like a leaky screwdriver -- Richard Braakman _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sat, 2005-09-17 at 18:40 +0100, Christian Limpach wrote:> On Sat, Sep 17, 2005 at 06:26:04PM +1000, Rusty Russell wrote: > > What was the reason for wanting multiple transactions per connection? > > Changing the interface is going to be a PITA, so we should figure out if > > we''re going to need that soon... > > It solves the immediate problem of making the xenbus lock in the kernel > driver more fine grained. I''m not too happy with how a userspace > program can block all xenbus access for that domain at the moment, > even if this is limited to the root user.Sure, and we already have the concept of separate transports, so I think we should use them.> Additionally, I believe > it''s necessary to support reconnect to a different store after > restore, allowing the store daemon or the xenbus driver to distinguish > operations which are part of an incomplete transaction from operations > which are not part of a transaction at all.And I think you''re wrong; it''s unnecessary and it doesn''t actually help. The code will show.> Finally, adding transaction > ids later will be a lot more difficult.Unless we introduce a xs_transaction_context() call, and leave it implicit?> I also find the current behaviour a bit odd, we support operations > outside of transactions but once you start a transaction, every > operation you make is part of that transaction. i think we should > either require to always be in a transaction or always allow outside > of transaction operations.We could drop implicit transactions; it would simplify the code a bit. I didn''t do this because it seems a bit silly for simple monitoring tools. The start/end transaction model is simple, but if it''s demonstrably insufficient, I agree we should go for transaction ids. That''s orthogonal to the multiplexing of connections over a single transport tho. Rusty. -- A bad analogy is like a leaky screwdriver -- Richard Braakman _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 19 Sep 2005, at 01:11, Rusty Russell wrote:> That''s quoting a little out of context. The current partial exposure > of > the kernel''s channel is a hack, since the current model is a 1:1 > mapping > between the transport and the connection. The term hack is not a bad > thing in itself, but it does accurately reflect that it''s limited, as > in > this case where we''re asked to add watch support.Well, I can''t disagree with this. :-) Whichever way we go we have to keep some per-handle state on users of the xenbus device, whether that''s a transaction id or reference to a unique transport page. Either way we don''t need to continuously hold the xenbus_lock (which is super gross).>> If we have multiple pages the client driver is complicated by >> reserving >> user pages and creating grant references for them, and cleanly tearing >> down and dealloc''ing grant references at the appropriate point(s). I >> agree the daemon doesn''t really get any more complicated, but I think >> save/restore will need extra code, either in the domain0 tools or in >> the guest os, to reconnect pages through to xenstored. > > No, this is the beauty of it: save/restore should be exactly to > libxenstored as the restarting of store daemon; Christian and I have > been vigorously debating semantics to get them the same (and I think > we''re close). Then on save, the device simply closes; on restore, the > library reconnects.I wholeheartedly agree that restore should be equivalent to xenstored restart from the p.o.v. of the xenbus driver. That''s how the block qnde net drivers already work. But it''s orthogonal to whether we mux connections onto a single transport page. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, 2005-09-19 at 09:54 +0100, Keir Fraser wrote:> I wholeheartedly agree that restore should be equivalent to xenstored > restart from the p.o.v. of the xenbus driver. That''s how the block qnde > net drivers already work. But it''s orthogonal to whether we mux > connections onto a single transport page.OK, that''s a little confusing. There are three ways to talk to xenstored (dom0, domU kernel, domU userspace), and two comms problems (xenstored restart, and migration). 1) For domU kernels, we have a shared page, so xenstored restarts are completely transparent to the domains: the new xenstored just maps the page and away we go. We also hold the xenstore_lock across entire transactions, which is grabbed on save/restore, so we trivially avoid any sync issues on migration. 2) When dom0 tools connect to xenstored, they use a socket, which there''s no easy way for the new xenstored to pick up (eg. you write to the socket just as the daemon shuts down...). The idea, then, is to handle reconnecting in libxenstore itself. The recent protocol tweaks are designed to make this trivial. There''s no migration of dom0 tools to worry about. 3) The new problem is domU tools. By providing a kernel device node we make it look exactly like talking through a socket, so libxenstore is basically unchanged. By each open using a separate page to talk to xenstored, we don''t have to hold the kernel lock, nor worry about tool errors/crashes screwing the kernel''s store page. xenstored restarts are transparent. On migration, we can simply force close (unmap page, return EBADF), which libxestore can treat exactly like the xenstored restart case with sockets, reconnect and re-xmit. The only issue is that, in the case of migration, the new xenstored won''t know about any transaction currently in progress. We can either migration transactions (easy for clients), or return EAGAIN for the next operation (easy for xenstored, sucks for clients). Cheers, Rusty. -- A bad analogy is like a leaky screwdriver -- Richard Braakman _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 20 Sep 2005, at 12:01, Rusty Russell wrote:> By providing a kernel device node we > make it look exactly like talking through a socket, so libxenstore is > basically unchanged. By each open using a separate page to talk to > xenstored, we don''t have to hold the kernel lock, nor worry about tool > errors/crashes screwing the kernel''s store page. xenstored restarts > are > transparent. On migration, we can simply force close (unmap page, > return EBADF), which libxestore can treat exactly like the xenstored > restart case with sockets, reconnect and re-xmit.None of this really adds weight for or against muxing versus page-per-transaction. If accesses from domU userspace are always communicated via the kernel device node, userspace gets no direct access to the transport page and so the kernel can ensure the page does not get corrupted. Also there''s no need to hold the kernel lock for the duration of the user-space transaction if we mux multiple transactions onto a single page -- as independent transactions it is up to xenstored to deal with sync issues between them. We would need to hold the lock for short periods during the transaction (e.g., while accessing the shared transport page) but we would not be holding it continuously. As for client handling of migration/restart -- it''s an API issue that is independent of the underlying ''wire'' transport. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 20 Sep 2005, at 12:01, Rusty Russell wrote:> The only issue is that, in the case of migration, the new xenstored > won''t know about any transaction currently in progress. We can either > migration transactions (easy for clients), or return EAGAIN for the > next > operation (easy for xenstored, sucks for clients).Well, you know we already disagree very strongly on this. Either we allow clients to lock down sections of the xenstore hierarchy for unbounded periods of time (unacceptable since we do not trust all clients) or we have to handle transaction failure in the clients. The only exception to this I can think of is read-only transactions, where you can take a read-only consistent snapshot of the store when the transaction begins and guarantee eventual success (even here we may want a timeout to avoid resource hogging). Apart from that, on failure the client *has* to execute its transactional code again -- the values it writes to the store may be dependent on values it reads as part of the same transaction. If the transaction is failed because some of those values it read are now stale (or might be stale, because a leased lock was revoked), the transaction replay has to include re-execution of the client code -- it cannot be hidden in the transactional API (e.g., just replaying the transactional writes from the previous failed attempt may be incorrect if those writes are based on stale reads from the previous failed attempt). This is the price for providing ACID guarantees (well, A, C and I at least, and without atomicity, consistency and isolation you are not implementing transactions as understood by every computer scientist). If we have to make failure visible to clients anyway (and I''m sure we do), it does at least greatly simplify things like xenstored restart and migration. Transactions are simply failed. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
FWIW: I think most of the ugliness here is a result of confusion from using the store to do both the storage of persistent configuration information and as a mechanism to implement configuration ABIs. If you split out those two functions such that there is a persistent store service which is generally useful for storage of persistent configuration information but only stores information that is private to each client then you can allow clients to lock down their private data for unbounded periods (because it doesn''t matter to any other client) and there is no need for transaction retry in the clients. You have to provide a separate mechanism to publish configuration ABIs. The provider of a configuration ABI can sign up to a maximum service time for requests made of that ABI. If a request takes too long, it can be promoted to a domain failure of the provider domain. So a scenario for a misbehaving domain might go as follows: domain misbehaves and accidentally locks up part of its private region of the store. This state can persist indefinitely without affecting other domains. Super-user performs some configuration which results in a request to the domain''s configuration ABI to reconfigure something. Domain attempts reconfiguration but hangs because it has locked up a bit of its area of the store. Configuration request violates domain''s config ABI SLA. Domain is killed. Super-user''s configuration request fails. Store detects domain exit and rolls back uncommitted transaction. Domain is restarted. Super-user can retry configuration request. With this approach, clients of the persistent store don''t have to handle transaction failures. Clients of the configuration ABI have to handle failures of configuration requests as a result of domains exiting (which is unavoidable: failing requests could be automatically retried after a domain is restarted but a domain might fail continually in which case the configuration request would eventually have to be failed). This note might not be very helpful for 3.0 given the current architecture but you might perhaps find a mapping of the above solution onto the xenstore infrastructure. On Wed, 2005-09-21 at 10:39 +0100, Keir Fraser wrote:> On 20 Sep 2005, at 12:01, Rusty Russell wrote: > > > The only issue is that, in the case of migration, the new xenstored > > won''t know about any transaction currently in progress. We can either > > migration transactions (easy for clients), or return EAGAIN for the > > next > > operation (easy for xenstored, sucks for clients). > > Well, you know we already disagree very strongly on this. > > Either we allow clients to lock down sections of the xenstore hierarchy > for unbounded periods of time (unacceptable since we do not trust all > clients) or we have to handle transaction failure in the clients. The > only exception to this I can think of is read-only transactions, where > you can take a read-only consistent snapshot of the store when the > transaction begins and guarantee eventual success (even here we may > want a timeout to avoid resource hogging). Apart from that, on failure > the client *has* to execute its transactional code again -- the values > it writes to the store may be dependent on values it reads as part of > the same transaction. If the transaction is failed because some of > those values it read are now stale (or might be stale, because a leased > lock was revoked), the transaction replay has to include re-execution > of the client code -- it cannot be hidden in the transactional API > (e.g., just replaying the transactional writes from the previous failed > attempt may be incorrect if those writes are based on stale reads from > the previous failed attempt). > > This is the price for providing ACID guarantees (well, A, C and I at > least, and without atomicity, consistency and isolation you are not > implementing transactions as understood by every computer scientist). > > If we have to make failure visible to clients anyway (and I''m sure we > do), it does at least greatly simplify things like xenstored restart > and migration. Transactions are simply failed. > > -- Keir > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Wed, 2005-09-21 at 10:35 +0100, Keir Fraser wrote:> On 20 Sep 2005, at 12:01, Rusty Russell wrote: > > > By providing a kernel device node we > > make it look exactly like talking through a socket, so libxenstore is > > basically unchanged. By each open using a separate page to talk to > > xenstored, we don''t have to hold the kernel lock, nor worry about tool > > errors/crashes screwing the kernel''s store page. xenstored restarts > > are > > transparent. On migration, we can simply force close (unmap page, > > return EBADF), which libxestore can treat exactly like the xenstored > > restart case with sockets, reconnect and re-xmit. > > None of this really adds weight for or against muxing versus > page-per-transaction.Exactly! So you would have me implement multiplexing code in the kernel, demultiplexing code in the daemon, checking code in the kernel to make sure we don''t corrupt the shared comms channel, for no reason. Cheers, Rusty. -- A bad analogy is like a leaky screwdriver -- Richard Braakman _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Wed, 2005-09-21 at 10:39 +0100, Keir Fraser wrote:> On 20 Sep 2005, at 12:01, Rusty Russell wrote: > > > The only issue is that, in the case of migration, the new xenstored > > won''t know about any transaction currently in progress. We can either > > migrate transactions (easy for clients), or return EAGAIN for the > > next > > operation (easy for xenstored, sucks for clients). > > Well, you know we already disagree very strongly on this.Perhaps I was unclear? It''s not the *commit* that fails with EAGAIN, but *any operation* (read/write/dir, etc), in this scenario. Unlike daemon restarts, where we can simply re-establish transactions, even if the eventual commit is doomed to fail. (I sent such code to Christian, but abandoned it because we had to change everything else anyway). Now, reread the paragraph you quoted. Is my question clearer now? I really do want to know what you think, Should we try to migrate transactions, or label the xenstored API clearly that any operation inside a transaction can return EAGAIN if you are inside a domU using the xenbus device? Or can you see a third way? (BTW, your analysis of the use of locks to provide ACID is flawed, but since I''ve had to abandon that approach for another reason I''ll not waste time now in an academic argument: that''s for pubs and IRC). Have I unconfused the issue? Rusty. -- A bad analogy is like a leaky screwdriver -- Richard Braakman _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 22 Sep 2005, at 03:22, Rusty Russell wrote:> It''s not the *commit* that fails with EAGAIN, but *any operation* > (read/write/dir, etc), in this scenario.Well, that is more of a pain. Probably okay for Python as an exception gets raised and you automatically unwind the stack to the full scope of the transaction. More of a pain for C code. But still, it also has the advantage that you can fail a doomed transaction earlier -- for example, you can ensure that client code doesn''t execute based on inconsistent sets of read values (i.e., read a value A before a remote transaction commits; read a value B after a remote transaction commits; where that remote transaction updates both A and B). Either we believe some clients are fragile to reading bad/inconsistent data, in which case I think EAGAIN on any read/write is a good idea (better than massively complicating xenstored). If we believe clients should be robust to reading crap from the store (and many ought to be, because backends can''t trust values written by frontends, for example) then we can handle doomed transactions without early report of EAGAIN as follows: * discard writes * return empty strings for reads (or ENOENT, so something like that). Whatever, the client probably needs the code to realise that a bad thing has happened and to take appropriate action whichever strategy we go for. I suspect they are equivalent complexity for clients. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 22 Sep 2005, at 03:07, Rusty Russell wrote:> Exactly! So you would have me implement multiplexing code in the > kernel, demultiplexing code in the daemon, checking code in the kernel > to make sure we don''t corrupt the shared comms channel, for no reason.Better to implement simple mux/demux code than pain-in-the-arse save/restore code. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, 2005-09-22 at 10:36 +0100, Keir Fraser wrote:> On 22 Sep 2005, at 03:07, Rusty Russell wrote: > > > Exactly! So you would have me implement multiplexing code in the > > kernel, demultiplexing code in the daemon, checking code in the kernel > > to make sure we don''t corrupt the shared comms channel, for no reason. > > Better to implement simple mux/demux code than pain-in-the-arse > save/restore code.But but but... it doesn''t *help*. That''s the entire point! OK, please describe, in simple terms, why you think save/restore is different if we multiplex across a single transport? Rusty. -- A bad analogy is like a leaky screwdriver -- Richard Braakman _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, 2005-09-22 at 10:35 +0100, Keir Fraser wrote:> Whatever, the client probably needs the code to realise that a bad > thing has happened and to take appropriate action whichever strategy we > go for. I suspect they are equivalent complexity for clients.I think you''ve summed it up well. Of these two I''m leaning towards EAGAIN (which the client can turn into a fake success if they want). But both are subtle and kinda icky. Which is why I am pondering a bundle/unbundle interface for transactions, so we can migrate them with the domain. Summary: 1) Easy to do at the moment: we already snapshot the entire store for transactions, so we can just bundle/unbundle that. We need globally-unique transactions IDs, but that''s fairly simple. 2) Each domain adds roughly 5k to the store (this will increase, say 10k). This means migrating off a node with 100 domains means adding 1M to the data we have to send *per transaction*. 3) The store compresses extremely well (~800 bytes per domain), so we could trivially get it down to 160k/transaction in the 100 domain case. You know I treasure simple APIs, and this makes the store API simpler and so reduces subtle errors in future. But is it worth the complexity? Rusty. -- A bad analogy is like a leaky screwdriver -- Richard Braakman _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Rusty, Can you explain once again why you think that migrating in-progress transactions is the right thing to do? It seems to me that our transactions are generally pretty small, and I don''t imagine them getting problematically bigger in the future. If client-side transaction code is already being written to expect failures and retry when they occur, what is the argument against blowing away in-progress transactions when you migrate. Given that the majority of current transaction code is to do with device drivers and you disconnect/reconnect those on migration anyway, why go through the extra work of adding complexity to migration? a. On 9/22/05, Rusty Russell <rusty@rustcorp.com.au> wrote:> On Thu, 2005-09-22 at 10:35 +0100, Keir Fraser wrote: > > Whatever, the client probably needs the code to realise that a bad > > thing has happened and to take appropriate action whichever strategy we > > go for. I suspect they are equivalent complexity for clients. > > I think you''ve summed it up well. Of these two I''m leaning towards > EAGAIN (which the client can turn into a fake success if they want). > But both are subtle and kinda icky. > > Which is why I am pondering a bundle/unbundle interface for > transactions, so we can migrate them with the domain. Summary: > > 1) Easy to do at the moment: we already snapshot the entire store for > transactions, so we can just bundle/unbundle that. We need > globally-unique transactions IDs, but that''s fairly simple. > 2) Each domain adds roughly 5k to the store (this will increase, say > 10k). This means migrating off a node with 100 domains means adding 1M > to the data we have to send *per transaction*. > 3) The store compresses extremely well (~800 bytes per domain), so we > could trivially get it down to 160k/transaction in the 100 domain case. > > You know I treasure simple APIs, and this makes the store API simpler > and so reduces subtle errors in future. > > But is it worth the complexity? > Rusty. > -- > A bad analogy is like a leaky screwdriver -- Richard Braakman > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> But but but... it doesn''t *help*. That''s the entire point! > > OK, please describe, in simple terms, why you think save/restore is > different if we multiplex across a single transport?Well, maybe there''s not so much in it after all. I''ll assume here we go for the ''xenstored forgets all state, and clients get EAGAIN at the first available opportunity'' approach. If we mux on a single transport: 1. The shared transport page is set up automatically in xenstored when the domain is restored. Xenstored has forgotten about any in-progress transactions. 2. The xenbus driver marks all file handles (or transaction structures, or whatever it uses to track local state for each local transaction) as doomed. Any further activity on those transactions returns EAGAIN rather than passing thru to xenstored. 3. That''s it! Clients detect failure and retry. If we have page per transaction: 1. Same as (1) above. 2. Same as (2) above, but free the per-transaction transport page. 3. Same as (3) above. However, I''m not clear yet what each separate transport page represents. Is it a single transaction, or a connection that stores multiple watches and one transaction at a time? If the latter, save/restore gets a bit harder as the transport pages must be automatically re-registered and watches re-registered... -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 23 Sep 2005, at 00:51, Rusty Russell wrote:> Which is why I am pondering a bundle/unbundle interface for > transactions, so we can migrate them with the domain. Summary:How do you guarantee consistency of the migrated transaction? It holds no locks or anything on the values it has read while it is not running, so they could change under its feet? Correctness-wise, it sounds to me as though bundle/unbundle has the same correctness guarantees as drop all writes, ENOENT on reads, and EAGAIN on commit. Or is the bundling simply so that you can continue to feed consistent snapshotted values to the transaction from its ''shadow store'', even though it is ultimately a doomed transaction? Do xenstored''s performance problems stem from copying the store for every transaction? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Thu, 2005-09-22 at 18:01 -0700, Andrew Warfield wrote:> Rusty, > > Can you explain once again why you think that migrating in-progress > transactions is the right thing to do? It seems to me that our > transactions are generally pretty small, and I don''t imagine them > getting problematically bigger in the future. If client-side > transaction code is already being written to expect failures and retry > when they occur, what is the argument against blowing away in-progress > transactions when you migrate. > > Given that the majority of current transaction code is to do with > device drivers and you disconnect/reconnect those on migration anyway, > why go through the extra work of adding complexity to migration?Actually, with more thought on your excellent points I''ve changed my mind: we should wait for any transactions to complete before saving domain and sidestep the whole issue. To recap, the problem is that if we are are halfway through a transaction when we migrate the domain, it''s hard to know what to do on the next op (eg. "xs_write", "xs_read" etc). Without migrating the transaction, we can fail the next op with EAGAIN or return "dummy" values, neither of which is pleasant. (Failing with EAGAIN on commit is different, and something the caller needs to handle anyway). Now, we already have this "domain won''t save until transactions are done" simply because we use a single big lock, but this discussion started because we want to get rid of that lock for /proc/xen/xenbus (it''s fine for drivers). I think we should do so, but keep this wont-save-during-transactions semantic; it means a waitqueue etc, but I don''t think it''s too bad. As you say, our transactions are pretty small. Do people like this more? Rusty. -- A bad analogy is like a leaky screwdriver -- Richard Braakman _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Fri, 2005-09-23 at 10:24 +0100, Keir Fraser wrote:> Or is the bundling simply so that you can continue to feed consistent > snapshotted values to the transaction from its ''shadow store'', even > though it is ultimately a doomed transaction?Yes that was the plan, although see other mail, I now prefer not allowing save during transactions at all to avoid this issue.> Do xenstored''s performance problems stem from copying the store for > every transaction?Yes. Using TDB means it''s now copying a single file, rather than a whole directory tree, but if someone puts in enough domains in the store it will become a problem and we''ll need a more sophisticated implementation. It would be a fun problem to work on, but I don''t think I can justify spending cycles on it yet: I can create around 10,000 domain-style directories in the store on my laptop in 22 seconds (using 10,000 separate transactions). The advantage of the "whole copy" approach are simplicity and robustness: if you SIGKILL xenstored your chances of recovery are excellent, as commit is done as a rename(2). Cheers! Rusty. -- A bad analogy is like a leaky screwdriver -- Richard Braakman _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Fri, 2005-09-23 at 10:17 +0100, Keir Fraser wrote:> However, I''m not clear yet what each separate transport page > represents. Is it a single transaction, or a connection that stores > multiple watches and one transaction at a time? If the latter, > save/restore gets a bit harder as the transport pages must be > automatically re-registered and watches re-registered...I was assuming per-connection, ie. every time a tool opens /proc/xen/xenbus we map a new page/event channel, and free on close. ie. the drivers in the domU kernel share the store comms page created with the domain, tools in the domU get a separate page each, tools in dom0 connect to the unix domain socket each. To the store, one connection, one transport. (Each connection can set up multiple watches of course, and Christian has been hinting that he wants multiple transactions too, but that''s another argument). So my proposal is (for those who came in late): [xenstored] | |--> /var/run/xenstored/socket <-- libxenstore <-- tool in dom0 | <-- libxenstore <-- tool in dom0 | ... |--> /var/run/xenstored/socket_ro <-- libxenstore <-- tool in dom0 | <-- libxenstore <-- tool in dom0 | ... |--> shared page 100/evtchn 100 <-- xenbus/xenbus_xs.c <- domU 1 kernel |--> shared page 200/evtchn 200 <-- xenbus/xenbus_xs.c <- domU 2 kernel |--> shared page 300/evtchn 300 <-- xenbus/xenbus_xs.c <- domU 3 kernel | ... |--> shared page 1000/evtchn 1000 <-- xenbus/xenbus_dev.c <-- libxenstore <-- tool in domU 1 |--> shared page 1001/evtchn 1001 <-- xenbus/xenbus_dev.c <-- libxenstore <-- tool in domU 1 |--> shared page 1002/evtchn 1002 <-- xenbus/xenbus_dev.c <-- libxenstore <-- tool in domU 1 | ... This differs from what we have: xenbus/xenbus_dev.c uses the same shared page as xenbus/xenbus_xs.c, and there''s one big lock (xenbus_lock) so drivers can''t use xenbus while the device is open. We all agree that is suboptimal. So on domain save, we just force close everyone who has /proc/xen/xenbus open. This closing, like normal closing, frees all those pages and channels, sends XS_UNINTRODUCE to store to tell it to unmap. On restore, libxenstore gets EBADF from the /proc/xen/xenbus fd. It reopens the fd, reregisters watches, and continues. This turns out to be *exactly* the same as the case where a tool in dom0, using libxenstore to talk to xenstored via socket, sees the xenstored restart. ie. we need that code anyway. In summary, I think this is the cleanest way to make /proc/xen/xenbus a first-class citizen and more robust, involves minimal changes, and doesn''t complicate save/restore much at all. Cheers! Rusty. -- A bad analogy is like a leaky screwdriver -- Richard Braakman _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 25 Sep 2005, at 04:29, Rusty Russell wrote:> > I was assuming per-connection, ie. every time a tool > opens /proc/xen/xenbus we map a new page/event channel, and free on > close. ie. the drivers in the domU kernel share the store comms page > created with the domain, tools in the domU get a separate page each, > tools in dom0 connect to the unix domain socket each. To the store, > one > connection, one transport. (Each connection can set up multiple > watches > of course, and Christian has been hinting that he wants multiple > transactions too, but that''s another argument).Yeah, I can live with this, although: What about multiple transactions within the kernel? Do you plan to continue serialising them (e.g., on a waitqueue)? An advantage of mux/demux would be that concurrent kernel transactions could easily use the same mechanism. Your scheme places restart mechanisms in user space, so they''re out of reach for kernel transactions. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 25 Sep 2005, at 01:57, Rusty Russell wrote:> Now, we already have this "domain won''t save until transactions are > done" simply because we use a single big lock, but this discussion > started because we want to get rid of that lock for /proc/xen/xenbus > (it''s fine for drivers). I think we should do so, but keep this > wont-save-during-transactions semantic; it means a waitqueue etc, but I > don''t think it''s too bad. As you say, our transactions are pretty > small. > > Do people like this more?Blocking system progress on arbitrary user apps doesn''t sound particularly attractive to me, but I guess it is at least simple. I''m more inclined to EAGAIN on read/write, maybe sugared by exceptions in languages that support them, or setjmp/longjmp to get you back to the outermost scope of the transaction. I agree that''s not pretty either, though. How will you handle ''xenstored restart''? You can''t really guarantee that to always happen at opportune moments with no transactions in flight. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 25 Sep 2005, at 12:02, Keir Fraser wrote:> Yeah, I can live with this, although: What about multiple transactions > within the kernel? Do you plan to continue serialising them (e.g., on > a waitqueue)? An advantage of mux/demux would be that concurrent > kernel transactions could easily use the same mechanism. Your scheme > places restart mechanisms in user space, so they''re out of reach for > kernel transactions.Also, page-per-connection won''t entirely avoid sharing of state/resource in xenstored. At some point we''ll want to add per-domain access policy, and space/bandwidth quotas (to prevent DoS). All of those must be shared between the multiple connections of a domain -- so the separate connections aren''t as independent as you might like. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sun, Sep 25, 2005 at 12:33:33PM +0100, Keir Fraser wrote:> > On 25 Sep 2005, at 12:02, Keir Fraser wrote: > > >Yeah, I can live with this, although: What about multiple transactions > >within the kernel? Do you plan to continue serialising them (e.g., on > >a waitqueue)? An advantage of mux/demux would be that concurrent > >kernel transactions could easily use the same mechanism. Your scheme > >places restart mechanisms in user space, so they''re out of reach for > >kernel transactions. > > Also, page-per-connection won''t entirely avoid sharing of > state/resource in xenstored. At some point we''ll want to add per-domain > access policy, and space/bandwidth quotas (to prevent DoS). All of > those must be shared between the multiple connections of a domain -- so > the separate connections aren''t as independent as you might like.I believe that the only thing we really _need_ at the moment is support for multiple concurrent transactions. This avoids having to hold the lock except for very short periods and it allows the server to return EAGAIN on operations where it doesn''t have the state corresponding to the transaction anymore. Since we need to add some kind of transaction identifier to the interface to support this, we should make this change now. christian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sun, 2005-09-25 at 12:09 +0100, Keir Fraser wrote:> How will you handle ''xenstored restart''? You can''t really guarantee > that to always happen at opportune moments with no transactions in > flight.Transaction IDs: when you create a transaction instead of returning "OK" it returns an ID (this also helps if later we want multiple transactions at a time on a single connection). If you were in a transaction you EBADF, you simply ask for it back before continuing: this can be done by libxenstore. It''s simple enough to implement, because xenstored keeps each transaction in a separate file: it just has to pick up those files again on restart. I''ll start sending patches... Cheers, Rusty. -- A bad analogy is like a leaky screwdriver -- Richard Braakman _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sun, 2005-09-25 at 12:33 +0100, Keir Fraser wrote:> On 25 Sep 2005, at 12:02, Keir Fraser wrote: > > > Yeah, I can live with this, although: What about multiple transactions > > within the kernel? Do you plan to continue serialising them (e.g., on > > a waitqueue)? An advantage of mux/demux would be that concurrent > > kernel transactions could easily use the same mechanism. Your scheme > > places restart mechanisms in user space, so they''re out of reach for > > kernel transactions.We already have the mechanism: xenbus_lock. I don''t think we want to go for parallelism within the kernel for xenstore comms: it''d be a fair amount of pain for something which isn''t exactly speed critical. Like Andrew said, I can''t transactions getting significantly longer.> Also, page-per-connection won''t entirely avoid sharing of > state/resource in xenstored. At some point we''ll want to add per-domain > access policy, and space/bandwidth quotas (to prevent DoS). All of > those must be shared between the multiple connections of a domain -- so > the separate connections aren''t as independent as you might like.We already have a permissions model based on domid (although not actually enforced due to a bug: we can fix this with one line but will require xend fixups I imagine). Space quotas will have to be by ID, too, not by the connection(s) which created them: in the case of migration, the store will be recreated by the tools, but should still be counted against the ID which owns them. So even if we multiplexed all the connections together for one domain they would still have to be separate. Bandwidth quotas are and interesting idea: I was thinking of a dumb fairness scheme. We almost do this: we rotate the list of connections, but there''s a FIXME about the unfair way we service domain pages. Or we could just measure the time we spend servicing each connection, and put the slowest ones at the tail... (socket connections would be immune, since we trust dom0 tools). I haven''t thought too hard about it. Thanks, I''ll update the TODO file... Rusty. -- A bad analogy is like a leaky screwdriver -- Richard Braakman _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Sun, 2005-09-25 at 19:55 +0100, Christian Limpach wrote:> On Sun, Sep 25, 2005 at 12:33:33PM +0100, Keir Fraser wrote: > > Also, page-per-connection won''t entirely avoid sharing of > > state/resource in xenstored. At some point we''ll want to add per-domain > > access policy, and space/bandwidth quotas (to prevent DoS). All of > > those must be shared between the multiple connections of a domain -- so > > the separate connections aren''t as independent as you might like. > > I believe that the only thing we really _need_ at the moment > is support for multiple concurrent transactions.You mean, on the same connection, I assume?> This avoids > having to hold the lock except for very short periodsTrue, but is holding the lock a problem? If we remove /proc/dev/xenbus from the equation, I don''t think it is. In fact, I think the lock is a very good thing, which is why I put it there.> and it > allows the server to return EAGAIN on operations where it doesn''t > have the state corresponding to the transaction anymore.Non sequiter, AFAICT. We can have the server return EAGAIN on any operation, but that''s independent of the kernel''s locking strategy. But you will have to make all the kernel code deal with it, as well as everyone using libxenstore. Frankly, that''s just stupid if we can easily avoid it.> Since we > need to add some kind of transaction identifier to the interface > to support this, we should make this change now.Or, alternately, since we don''t need it, we shouldn''t. Rusty. -- A bad analogy is like a leaky screwdriver -- Richard Braakman _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 26 Sep 2005, at 07:36, Rusty Russell wrote:>> Since we >> need to add some kind of transaction identifier to the interface >> to support this, we should make this change now. > > Or, alternately, since we don''t need it, we shouldn''t.Holding xenbus_lock even across kernel transactions is undesirable if we can avoid it, I think. Remind me again why mux/demux is harder or more code than spawning multiple pages per domain? Both approaches require extra code to be written after all, and it''s not clear to me that the extra code required in xenstored for mux/demux is very much or particularly tricky to write. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, Sep 26, 2005 at 04:36:11PM +1000, Rusty Russell wrote:> > I believe that the only thing we really _need_ at the moment > > is support for multiple concurrent transactions. > > You mean, on the same connection, I assume?Yes. Before I reply to the rest of your comments, I''d like to point out that I don''t agree with your assumption that we can delay suspend/resume until we''re outside of a transaction. This seems to be the fundamental difference driving different solutions.> > This avoids > > having to hold the lock except for very short periods > > True, but is holding the lock a problem? If we remove /proc/dev/xenbus > from the equation, I don''t think it is. In fact, I think the lock is a > very good thing, which is why I put it there.I think holding the lock is not desirable. For live migration, it might even turn out to be a bottleneck if we have to serialize device reconnect.> > and it > > allows the server to return EAGAIN on operations where it doesn''t > > have the state corresponding to the transaction anymore. > > Non sequiter, AFAICT. We can have the server return EAGAIN on any > operation, but that''s independent of the kernel''s locking strategy. But > you will have to make all the kernel code deal with it, as well as > everyone using libxenstore. Frankly, that''s just stupid if we can > easily avoid it.You read what I wrote in the wrong context: transaction ids are necessary to distinguish operations which are part of a transaction from those which are outside of transactions. On suspend/resume, you need to be able to decide either at xenstored level or better on the client side (xenbus driver or libxenstore) whether you need to fail an operation or not. You can achieve the same by only supporting operations within a transaction.> > Since we > > need to add some kind of transaction identifier to the interface > > to support this, we should make this change now. > > Or, alternately, since we don''t need it, we shouldn''t.I think we need them since it''s the simplest solution to the whole multi-page/multi-connection issue for a saner xenbus_dev implementation: - lock only held around xs_talkv - transaction ids - single point for demultiplexing watch events christian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 26 Sep 2005, at 19:51, Christian Limpach wrote:>>> Since we >>> need to add some kind of transaction identifier to the interface >>> to support this, we should make this change now. >> >> Or, alternately, since we don''t need it, we shouldn''t. > > I think we need them since it''s the simplest solution to the whole > multi-page/multi-connection issue for a saner xenbus_dev > implementation: > - lock only held around xs_talkv > - transaction ids > - single point for demultiplexing watch eventsThis is precisely how I expected that xenbus was going to be structured in the first place. It seems the simplest, most natural implementation and happens to avoid a lot of potential unnecessary blocking and serialisation. And not even at much cost in xenstored (how hard can the demultiplex be?). -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, 2005-09-26 at 20:30 +0100, Keir Fraser wrote:> On 26 Sep 2005, at 19:51, Christian Limpach wrote: > > >>> Since we > >>> need to add some kind of transaction identifier to the interface > >>> to support this, we should make this change now. > >> > >> Or, alternately, since we don''t need it, we shouldn''t. > > > > I think we need them since it''s the simplest solution to the whole > > multi-page/multi-connection issue for a saner xenbus_dev > > implementation: > > - lock only held around xs_talkv > > - transaction ids > > - single point for demultiplexing watch events > > This is precisely how I expected that xenbus was going to be structured > in the first place. It seems the simplest, most natural implementation > and happens to avoid a lot of potential unnecessary blocking and > serialisation. And not even at much cost in xenstored (how hard can the > demultiplex be?).Well, here''s the simple patch that modifies introductions to allow the domain to introduce new pages, and another one that tests it. I''m away for the next week and a half on other stuff, then I''m in Cambridge. Modifying the xenbus dev to use this should be fairly easy, if you choose to go this route. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> diff -r 0e368f851f6a tools/python/xen/lowlevel/xs/xs.c --- a/tools/python/xen/lowlevel/xs/xs.c Tue Sep 27 00:55:56 2005 +++ b/tools/python/xen/lowlevel/xs/xs.c Tue Sep 27 16:36:19 2005 @@ -728,7 +728,7 @@ &dom)) goto exit; Py_BEGIN_ALLOW_THREADS - xsval = xs_release_domain(xh, dom); + xsval = xs_release_domain(xh, dom, 0); Py_END_ALLOW_THREADS if (!xsval) { PyErr_SetFromErrno(PyExc_RuntimeError); diff -r 0e368f851f6a tools/xenstore/xenstored_core.c --- a/tools/xenstore/xenstored_core.c Tue Sep 27 00:55:56 2005 +++ b/tools/xenstore/xenstored_core.c Tue Sep 27 16:36:19 2005 @@ -1136,7 +1136,7 @@ break; case XS_RELEASE: - do_release(conn, onearg(in)); + do_release(conn, in); break; case XS_GET_DOMAIN_PATH: diff -r 0e368f851f6a tools/xenstore/xenstored_domain.c --- a/tools/xenstore/xenstored_domain.c Tue Sep 27 00:55:56 2005 +++ b/tools/xenstore/xenstored_domain.c Tue Sep 27 16:36:19 2005 @@ -50,6 +50,9 @@ /* Event channel port */ u16 port; + + /* Page number. */ + unsigned long mfn; /* Domain path in store. */ char *path; @@ -282,6 +285,7 @@ domain->port = 0; domain->shutdown = 0; domain->domid = domid; + domain->mfn = mfn; domain->path = talloc_strdup(domain, path); domain->page = xc_map_foreign_range(*xc_handle, domain->domid, getpagesize(), @@ -312,15 +316,28 @@ { struct domain *domain; char *vec[4]; + domid_t domid; if (get_strings(in, vec, ARRAY_SIZE(vec)) < ARRAY_SIZE(vec)) { send_error(conn, EINVAL); return; } - if (conn->id != 0 || !conn->can_write) { + domid = atoi(vec[0]); + if (!conn->can_write || (conn->id != 0 && domid != DOMID_SELF)) { send_error(conn, EACCES); return; + } + + /* Domains can introduce more comms pages to store. FIXME: limit. */ + if (domid == DOMID_SELF) { + if (!conn->domain) { + send_error(conn, EINVAL); + return; + } + /* Same domid and path. */ + domid = conn->domain->domid; + vec[3] = conn->domain->path; } /* Sanity check args. */ @@ -329,8 +346,7 @@ return; } /* Hang domain off "in" until we''re finished. */ - domain = new_domain(in, atoi(vec[0]), atol(vec[1]), atol(vec[2]), - vec[3]); + domain = new_domain(in, domid, atol(vec[1]), atol(vec[2]), vec[3]); if (!domain) { send_error(conn, errno); return; @@ -339,58 +355,69 @@ /* Now domain belongs to its connection. */ talloc_steal(domain->conn, domain); - fire_watches(conn, "@introduceDomain", false); + if (domid != DOMID_SELF) + fire_watches(conn, "@introduceDomain", false); send_ack(conn, XS_INTRODUCE); } -static struct domain *find_domain_by_domid(domid_t domid) +static struct domain *find_domain(domid_t domid, unsigned long mfn) { struct domain *i; list_for_each_entry(i, &domains, list) { - if (i->domid == domid) + if (i->domid == domid && (!mfn || i->mfn == mfn)) return i; } return NULL; } -/* domid */ -void do_release(struct connection *conn, const char *domid_str) +/* domid, mfn */ +void do_release(struct connection *conn, struct buffered_data *in) { struct domain *domain; domid_t domid; - - if (!domid_str) { + unsigned long mfn; + bool released = false; + char *vec[2]; + + if (get_strings(in, vec, ARRAY_SIZE(vec)) < ARRAY_SIZE(vec)) { send_error(conn, EINVAL); return; } - - domid = atoi(domid_str); + domid = atoi(vec[0]); + mfn = atol(vec[1]); if (!domid) { send_error(conn, EINVAL); return; } - if (conn->id != 0) { - send_error(conn, EACCES); - return; - } - - domain = find_domain_by_domid(domid); - if (!domain) { + if (domid == DOMID_SELF) { + if (!conn->domain || mfn == 0) { + send_error(conn, EINVAL); + return; + } + domid = conn->domain->domid; + } else { + if (conn->id != 0) { + send_error(conn, EACCES); + return; + } + } + + /* Can release multiple if mfn == 0 */ + while ((domain = find_domain(domid, mfn)) != NULL) { + talloc_free(domain->conn); + released = true; + } + + if (!released) { send_error(conn, ENOENT); return; } - if (!domain->conn) { - send_error(conn, EINVAL); - return; - } - - talloc_free(domain->conn); - - fire_watches(conn, "@releaseDomain", false); + if (domid != DOMID_SELF) + fire_watches(conn, "@releaseDomain", false); send_ack(conn, XS_RELEASE); } @@ -409,7 +436,7 @@ if (domid == DOMID_SELF) domain = conn->domain; else - domain = find_domain_by_domid(domid); + domain = find_domain(domid, 0); if (!domain) send_error(conn, ENOENT); diff -r 0e368f851f6a tools/xenstore/xenstored_domain.h --- a/tools/xenstore/xenstored_domain.h Tue Sep 27 00:55:56 2005 +++ b/tools/xenstore/xenstored_domain.h Tue Sep 27 16:36:19 2005 @@ -25,8 +25,8 @@ /* domid, mfn, eventchn, path */ void do_introduce(struct connection *conn, struct buffered_data *in); -/* domid */ -void do_release(struct connection *conn, const char *domid_str); +/* domid, mfn */ +void do_release(struct connection *conn, struct buffered_data *in); /* domid */ void do_get_domain_path(struct connection *conn, const char *domid_str); diff -r 0e368f851f6a tools/xenstore/xs.c --- a/tools/xenstore/xs.c Tue Sep 27 00:55:56 2005 +++ b/tools/xenstore/xs.c Tue Sep 27 16:36:19 2005 @@ -676,13 +676,21 @@ return xs_bool(xs_talkv(h, XS_INTRODUCE, iov, ARRAY_SIZE(iov), NULL)); } -bool xs_release_domain(struct xs_handle *h, domid_t domid) +bool xs_release_domain(struct xs_handle *h, domid_t domid, unsigned long mfn) { char domid_str[MAX_STRLEN(domid)]; + char mfn_str[MAX_STRLEN(mfn)]; + struct iovec iov[2]; sprintf(domid_str, "%u", domid); - - return xs_bool(xs_single(h, XS_RELEASE, domid_str, NULL)); + sprintf(mfn_str, "%lu", mfn); + + iov[0].iov_base = domid_str; + iov[0].iov_len = strlen(domid_str) + 1; + iov[1].iov_base = mfn_str; + iov[1].iov_len = strlen(mfn_str) + 1; + + return xs_bool(xs_talkv(h, XS_RELEASE, iov, ARRAY_SIZE(iov), NULL)); } char *xs_get_domain_path(struct xs_handle *h, domid_t domid) diff -r 0e368f851f6a tools/xenstore/xs.h --- a/tools/xenstore/xs.h Tue Sep 27 00:55:56 2005 +++ b/tools/xenstore/xs.h Tue Sep 27 16:36:19 2005 @@ -130,8 +130,9 @@ /* Release a domain. * Tells the store domain to release the memory page to the domain. + * mfn is 0 to release all of them. */ -bool xs_release_domain(struct xs_handle *h, domid_t domid); +bool xs_release_domain(struct xs_handle *h, domid_t domid, unsigned long mfn); /* Query the home path of a domain. */ Test code for previous patch. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> diff -r 0e368f851f6a tools/xenstore/fake_libxc.c --- a/tools/xenstore/fake_libxc.c Tue Sep 27 00:55:56 2005 +++ b/tools/xenstore/fake_libxc.c Tue Sep 27 16:36:19 2005 @@ -44,42 +44,38 @@ return 0; } -void *xc_map_foreign_range(int xc_handle, u32 dom __attribute__((unused)), +void *xc_map_foreign_range(int xc_handle __attribute__((unused)), + u32 dom __attribute__((unused)), int size, int prot, unsigned long mfn __attribute__((unused))) { void *ret; + int *extra; + int fd; - ret = mmap(NULL, size, prot, MAP_SHARED, xc_handle, 0); + fd = open("/tmp/xcmap", O_RDWR); + if (fd < 0) + return NULL; + + /* We actually get extra page, for comms with xs_test. */ + ret = mmap(NULL, size + getpagesize(), prot, MAP_SHARED, fd, 0); if (ret == MAP_FAILED) return NULL; - /* xs_test tells us pid and port by putting it in buffer, we reply. */ - xs_test_pid = *(int *)(ret + 32); - port = *(int *)(ret + 36); - *(int *)(ret + 32) = getpid(); + extra = ret + size; + xs_test_pid = extra[0]; + port = extra[1]; + extra[2] = getpid(); return ret; } int xc_interface_open(void) { - int fd; - char page[getpagesize()]; - - fd = open("/tmp/xcmap", O_RDWR|O_CREAT|O_TRUNC, 0600); - if (fd < 0) - return fd; - - memset(page, 0, sizeof(page)); - if (!xs_write_all(fd, page, sizeof(page))) - barf_perror("Failed to write /tmp/xcmap page"); - - return fd; + return 1; } -int xc_interface_close(int xc_handle) +int xc_interface_close(int xc_handle __attribute__((unused))) { - close(xc_handle); return 0; } diff -r 0e368f851f6a tools/xenstore/testsuite/09domain.test --- a/tools/xenstore/testsuite/09domain.test Tue Sep 27 00:55:56 2005 +++ b/tools/xenstore/testsuite/09domain.test Tue Sep 27 16:36:19 2005 @@ -17,3 +17,24 @@ expect handle is 2 introduce 1 100 7 /my/home release 1 + +# Introduce sub-connection +write /my/home/entry contents +close +expect handle is 3 +introduce 1 100 7 /my/home +expect handle is 4 +3 introduce-self 120 7 /my/home + +# Check home is correct +expect 4:contents +4 read entry + +# Release 4 from 3 +3 release-self 120 + +# Introduce a new one and release both at once. +expect handle is 5 +3 introduce-self 120 7 /my/home + +release 1 diff -r 0e368f851f6a tools/xenstore/xs_test.c --- a/tools/xenstore/xs_test.c Tue Sep 27 00:55:56 2005 +++ b/tools/xenstore/xs_test.c Tue Sep 27 16:36:19 2005 @@ -59,6 +59,7 @@ static struct ringbuf_head *out, *in; static unsigned int ringbuf_datasize; +static int event_channel; static int daemon_pid; /* FIXME: Mark connection as broken (close it?) when this happens. */ @@ -208,6 +209,9 @@ " start <node>\n" " abort\n" " introduce <domid> <mfn> <eventchn> <path>\n" + " introduce-self <mfn> <eventchn> <path>\n" + " release <domid>\n" + " release-self <mfn>\n" " commit\n" " sleep <milliseconds>\n" " expect <pattern>\n" @@ -575,16 +579,22 @@ } static void do_introduce(unsigned int handle, - const char *domid, + int domid, const char *mfn, const char *eventchn, const char *path) { unsigned int i; - int fd; + int fd, *extra; + char pages[getpagesize()*2]; + + fd = open("/tmp/xcmap", O_RDWR|O_CREAT, 0600); + memset(pages, 0, sizeof(pages)); + write(fd, pages, sizeof(pages)); /* This mechanism is v. slow w. valgrind running. */ - timeout_ms = 5000; + if (timeout_ms) + timeout_ms = 5000; /* We poll, so ignore signal */ signal(SIGUSR2, SIG_IGN); @@ -592,22 +602,24 @@ if (!handles[i]) break; - fd = open("/tmp/xcmap", O_RDWR); /* Set in and out pointers. */ - out = mmap(NULL, getpagesize(), PROT_WRITE|PROT_READ, MAP_SHARED,fd,0); + out = mmap(NULL, sizeof(pages), PROT_WRITE|PROT_READ, MAP_SHARED,fd,0); if (out == MAP_FAILED) barf_perror("Failed to map /tmp/xcmap page"); in = (void *)out + getpagesize() / 2; close(fd); + event_channel = atoi(eventchn); + /* Tell them the event channel and our PID. */ - *(int *)((void *)out + 32) = getpid(); - *(u16 *)((void *)out + 36) = atoi(eventchn); - - if (!xs_introduce_domain(handles[handle], atoi(domid), + extra = (void *)out + getpagesize(); + extra[0] = getpid(); + extra[1] = event_channel; + + if (!xs_introduce_domain(handles[handle], domid, atol(mfn), atoi(eventchn), path)) { failed(handle); - munmap(out, getpagesize()); + munmap(out, getpagesize()*2); return; } output("handle is %i\n", i); @@ -622,12 +634,18 @@ handles[i]->committing = false; /* Read in daemon pid. */ - daemon_pid = *(int *)((void *)out + 32); + daemon_pid = extra[2]; } static void do_release(unsigned int handle, const char *domid) { - if (!xs_release_domain(handles[handle], atoi(domid))) + if (!xs_release_domain(handles[handle], atoi(domid), 0)) + failed(handle); +} + +static void do_release_self(unsigned int handle, const char *mfn) +{ + if (!xs_release_domain(handles[handle], DOMID_SELF, atol(mfn))) failed(handle); } @@ -802,10 +820,15 @@ else if (streq(command, "abort")) do_end(handle, true); else if (streq(command, "introduce")) - do_introduce(handle, arg(line, 1), arg(line, 2), + do_introduce(handle, atoi(arg(line, 1)), arg(line, 2), arg(line, 3), arg(line, 4)); + else if (streq(command, "introduce-self")) + do_introduce(handle, DOMID_SELF, + arg(line, 1), arg(line, 2), arg(line, 3)); else if (streq(command, "release")) do_release(handle, arg(line, 1)); + else if (streq(command, "release-self")) + do_release_self(handle, arg(line, 1)); else if (streq(command, "dump")) dump(handle); else if (streq(command, "sleep")) { -- A bad analogy is like a leaky screwdriver -- Richard Braakman _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, 2005-09-26 at 19:51 +0100, Christian Limpach wrote:> On Mon, Sep 26, 2005 at 04:36:11PM +1000, Rusty Russell wrote: > > > I believe that the only thing we really _need_ at the moment > > > is support for multiple concurrent transactions. > > > > You mean, on the same connection, I assume? > > Yes. > > Before I reply to the rest of your comments, I''d like to point out that > I don''t agree with your assumption that we can delay suspend/resume > until we''re outside of a transaction. This seems to be the fundamental > difference driving different solutions.Absolutely. The delays will be immeasurably short. The code will be simple. The API will be simple. Locking will be simple. Drivers will be simple. Debugging will be simple. Parallelism is not going to make restore faster, in fact, the current "winner takes all" transaction model means they''ll be much, much slower. Now, since this is really my last email before I head off on leave, you''re going to decide this without my bitching and moaning (yeah, I know how much you''ll miss it!). Nonetheless, I''m happy to hand xenstored to your capable hands. When I come back, I might start writing tests or something equally uncontroversial 8) Cheers! Rusty. -- A bad analogy is like a leaky screwdriver -- Richard Braakman _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Rusty Russell wrote:> On Mon, 2005-09-26 at 19:51 +0100, Christian Limpach wrote: >>On Mon, Sep 26, 2005 at 04:36:11PM +1000, Rusty Russell wrote: >> >>Before I reply to the rest of your comments, I''d like to point out that >>I don''t agree with your assumption that we can delay suspend/resume >>until we''re outside of a transaction. This seems to be the fundamental >>difference driving different solutions. > > Absolutely. The delays will be immeasurably short.What about denial of service (intentional or due to bugs in guest OSes/tools)? -- David Hopwood <david.nospam.hopwood@blueyonder.co.uk> _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Reasonably Related Threads
- [PATCH v4 00/23] Xenstore stub domain
- [PATCH 6/11] Xenstore watch rework
- [PATCH] direct_remap_pfn_range vm_flags fix
- [PATCH] libxl: make libxl communicate with xenstored by socket or xenbus driver
- [RFC 00/14] arm: implement ballooning and privcmd foreign mappings based on x86 PVH