David Vrabel
2012-Aug-23 17:13 UTC
[RFC PATCH 0/3] xen/privcmd: support for paged-out frames
This series is a straight forward-port of some functionality from classic kernels to support Xen hosts that do paging of guests. This isn''t functionality the XenServer makes use of so I''ve not tested these with paging in use (GridCentric requested that our older kernels supported this and I''m just doing the forward port). I''m not entirely happy about the approach used here because: 1. It relies on the meaning of the return code of the update_mmu hypercall and it assumes the value Xen used for -ENOENT is the same the kernel uses. This does not appear to be a formal part of the hypercall ABI. Keir, can you comment on this? 2. It seems more sensible to have the kernel do the retries instead of libxc doing them. The kernel has to have a mechanism for this any way (for mapping back/front rings). 3. The current way of handling paged-out frames by repeatedly retrying is a bit lame. Shouldn''t there be some event that the guest waiting for the frame can wait on instead? By moving the retry mechanism into the kernel we can change this without impacting the ABI to userspace. David
David Vrabel
2012-Aug-23 17:13 UTC
[PATCH 1/3] xen/mm: return more precise error from xen_remap_domain_range()
From: David Vrabel <david.vrabel@citrix.com> Callers of xen_remap_domain_range() need to know if the remap failed because frame is currently paged out. So they can retry the remap later on. Return -ENOENT in this case. This assumes that the error codes returned by Xen are a subset of those used by the kernel. It is unclear if this is defined as part of the hypercall ABI. Signed-off-by: David Vrabel <david.vrabel@citrix.com> --- arch/x86/xen/mmu.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index b65a761..fb187ea 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -2355,8 +2355,8 @@ int xen_remap_domain_mfn_range(struct vm_area_struct *vma, if (err) goto out; - err = -EFAULT; - if (HYPERVISOR_mmu_update(mmu_update, batch, NULL, domid) < 0) + err = HYPERVISOR_mmu_update(mmu_update, batch, NULL, domid); + if (err < 0) goto out; nr -= batch; -- 1.7.2.5
David Vrabel
2012-Aug-23 17:13 UTC
[PATCH 2/3] xen/privcmd: report paged-out frames in PRIVCMD_MMAPBATCH ioctl
From: David Vrabel <david.vrabel@citrix.com> libxc handles paged-out frames in xc_map_foreign_bulk() and friends by retrying the map operation. libxc expects the PRIVCMD_MMAPBATCH ioctl to report paged out frames by setting bit 31 in the mfn. Do this for the PRIVCMD_MMAPBATCH ioctl if xen_remap_domain_mfn_range() returned -ENOENT. Signed-off-by: David Vrabel <david.vrabel@citrix.com> --- drivers/xen/privcmd.c | 11 ++++++++--- 1 files changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c index ccee0f1..f8c1b6d 100644 --- a/drivers/xen/privcmd.c +++ b/drivers/xen/privcmd.c @@ -255,10 +255,15 @@ static int mmap_batch_fn(void *data, void *state) { xen_pfn_t *mfnp = data; struct mmap_batch_state *st = state; + int ret; - if (xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK, *mfnp, 1, - st->vma->vm_page_prot, st->domain) < 0) { - *mfnp |= 0xf0000000U; + ret = xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK, *mfnp, 1, + st->vma->vm_page_prot, st->domain); + if (ret < 0) { + if (ret == -ENOENT) + *mfnp |= 0x80000000U; + else + *mfnp |= 0xf0000000U; st->err++; } st->va += PAGE_SIZE; -- 1.7.2.5
David Vrabel
2012-Aug-23 17:13 UTC
[PATCH 3/3] xen/privcmd: add PRIVCMD_MMAPBATCH_V2 ioctl
From: David Vrabel <david.vrabel@citrix.com> PRIVCMD_MMAPBATCH_V2 extends PRIVCMD_MMAPBATCH with an additional field for reporting the error code for every frame that could not be mapped. libxc prefers PRIVCMD_MMAPBATCH_V2 over PRIVCMD_MMAPBATCH. Signed-off-by: David Vrabel <david.vrabel@citrix.com> --- drivers/xen/privcmd.c | 54 +++++++++++++++++++++++++++++++++++++++---------- include/xen/privcmd.h | 10 +++++++++ 2 files changed, 53 insertions(+), 11 deletions(-) diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c index f8c1b6d..4f97160 100644 --- a/drivers/xen/privcmd.c +++ b/drivers/xen/privcmd.c @@ -248,7 +248,8 @@ struct mmap_batch_state { struct vm_area_struct *vma; int err; - xen_pfn_t __user *user; + xen_pfn_t __user *user_mfn; + int __user *user_err; }; static int mmap_batch_fn(void *data, void *state) @@ -275,34 +276,58 @@ static int mmap_return_errors(void *data, void *state) { xen_pfn_t *mfnp = data; struct mmap_batch_state *st = state; + int ret = 0; - return put_user(*mfnp, st->user++); + if (st->user_err) { + if ((*mfnp & 0xf0000000U) == 0xf0000000U) + ret = -ENOENT; + else if ((*mfnp & 0xf0000000U) == 0x80000000U) + ret = -EINVAL; + else + ret = 0; + return __put_user(ret, st->user_err); + } else + return __put_user(*mfnp, st->user_mfn++); } static struct vm_operations_struct privcmd_vm_ops; -static long privcmd_ioctl_mmap_batch(void __user *udata) +static long privcmd_ioctl_mmap_batch(void __user *udata, int version) { int ret; - struct privcmd_mmapbatch m; + struct privcmd_mmapbatch_v2 m; struct mm_struct *mm = current->mm; struct vm_area_struct *vma; unsigned long nr_pages; LIST_HEAD(pagelist); struct mmap_batch_state state; + printk("%s(%d)\n", __func__, version); + if (!xen_initial_domain()) return -EPERM; - if (copy_from_user(&m, udata, sizeof(m))) - return -EFAULT; + if (version == 1) { + if (copy_from_user(&m, udata, sizeof(struct privcmd_mmapbatch))) + return -EFAULT; + if (!access_ok(VERIFY_WRITE, m.arr, m.num * sizeof(*m.arr))) + return -EFAULT; + m.err = NULL; + } else { + if (copy_from_user(&m, udata, sizeof(struct privcmd_mmapbatch_v2))) + return -EFAULT; + if (!access_ok(VERIFY_READ, m.arr, m.num * sizeof(*m.arr))) + return -EFAULT; + if (!access_ok(VERIFY_WRITE, m.err, m.num * (sizeof(*m.err)))) + return -EFAULT; + } nr_pages = m.num; if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT))) return -EINVAL; ret = gather_array(&pagelist, m.num, sizeof(xen_pfn_t), - m.arr); + (xen_pfn_t *)m.arr); if (ret || list_empty(&pagelist)) goto out; @@ -331,10 +356,11 @@ static long privcmd_ioctl_mmap_batch(void __user *udata) up_write(&mm->mmap_sem); if (state.err > 0) { - state.user = m.arr; + state.user_mfn = (xen_pfn_t *)m.arr; + state.user_err = m.err; ret = traverse_pages(m.num, sizeof(xen_pfn_t), - &pagelist, - mmap_return_errors, &state); + &pagelist, + mmap_return_errors, &state); } out: @@ -359,7 +385,13 @@ static long privcmd_ioctl(struct file *file, break; case IOCTL_PRIVCMD_MMAPBATCH: - ret = privcmd_ioctl_mmap_batch(udata); + ret = privcmd_ioctl_mmap_batch(udata, 1); + printk("%s() batch ret = %d\n", __func__, ret); + break; + + case IOCTL_PRIVCMD_MMAPBATCH_V2: + ret = privcmd_ioctl_mmap_batch(udata, 2); + printk("%s() batch ret = %d\n", __func__, ret); break; default: diff --git a/include/xen/privcmd.h b/include/xen/privcmd.h index 17857fb..9fa27c4 100644 --- a/include/xen/privcmd.h +++ b/include/xen/privcmd.h @@ -62,6 +62,14 @@ struct privcmd_mmapbatch { xen_pfn_t __user *arr; /* array of mfns - top nibble set on err */ }; +struct privcmd_mmapbatch_v2 { + unsigned int num; /* number of pages to populate */ + domid_t dom; /* target domain */ + __u64 addr; /* virtual address */ + const xen_pfn_t __user *arr; /* array of mfns */ + int __user *err; /* array of error codes */ +}; + /* * @cmd: IOCTL_PRIVCMD_HYPERCALL * @arg: &privcmd_hypercall_t @@ -73,5 +81,7 @@ struct privcmd_mmapbatch { _IOC(_IOC_NONE, ''P'', 2, sizeof(struct privcmd_mmap)) #define IOCTL_PRIVCMD_MMAPBATCH \ _IOC(_IOC_NONE, ''P'', 3, sizeof(struct privcmd_mmapbatch)) +#define IOCTL_PRIVCMD_MMAPBATCH_V2 \ + _IOC(_IOC_NONE, ''P'', 4, sizeof(struct privcmd_mmapbatch_v2)) #endif /* __LINUX_PUBLIC_PRIVCMD_H__ */ -- 1.7.2.5
Konrad Rzeszutek Wilk
2012-Aug-23 19:40 UTC
Re: [PATCH 3/3] xen/privcmd: add PRIVCMD_MMAPBATCH_V2 ioctl
On Thu, Aug 23, 2012 at 06:13:46PM +0100, David Vrabel wrote:> From: David Vrabel <david.vrabel@citrix.com> > > PRIVCMD_MMAPBATCH_V2 extends PRIVCMD_MMAPBATCH with an additional > field for reporting the error code for every frame that could not be > mapped. libxc prefers PRIVCMD_MMAPBATCH_V2 over PRIVCMD_MMAPBATCH. > > Signed-off-by: David Vrabel <david.vrabel@citrix.com> > --- > drivers/xen/privcmd.c | 54 +++++++++++++++++++++++++++++++++++++++---------- > include/xen/privcmd.h | 10 +++++++++ > 2 files changed, 53 insertions(+), 11 deletions(-) > > diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c > index f8c1b6d..4f97160 100644 > --- a/drivers/xen/privcmd.c > +++ b/drivers/xen/privcmd.c > @@ -248,7 +248,8 @@ struct mmap_batch_state { > struct vm_area_struct *vma; > int err; > > - xen_pfn_t __user *user; > + xen_pfn_t __user *user_mfn; > + int __user *user_err; > }; > > static int mmap_batch_fn(void *data, void *state) > @@ -275,34 +276,58 @@ static int mmap_return_errors(void *data, void *state) > { > xen_pfn_t *mfnp = data; > struct mmap_batch_state *st = state; > + int ret = 0; > > - return put_user(*mfnp, st->user++); > + if (st->user_err) { > + if ((*mfnp & 0xf0000000U) == 0xf0000000U) > + ret = -ENOENT; > + else if ((*mfnp & 0xf0000000U) == 0x80000000U) > + ret = -EINVAL;Yikes. Any way those 0xf000.. and 0x8000 can at least be #defined> + else > + ret = 0; > + return __put_user(ret, st->user_err); > + } else > + return __put_user(*mfnp, st->user_mfn++); > } > > static struct vm_operations_struct privcmd_vm_ops; > > -static long privcmd_ioctl_mmap_batch(void __user *udata) > +static long privcmd_ioctl_mmap_batch(void __user *udata, int version) > { > int ret; > - struct privcmd_mmapbatch m; > + struct privcmd_mmapbatch_v2 m; > struct mm_struct *mm = current->mm; > struct vm_area_struct *vma; > unsigned long nr_pages; > LIST_HEAD(pagelist); > struct mmap_batch_state state; > > + printk("%s(%d)\n", __func__, version); > +Hehe.> if (!xen_initial_domain()) > return -EPERM; > > - if (copy_from_user(&m, udata, sizeof(m))) > - return -EFAULT; > + if (version == 1) { > + if (copy_from_user(&m, udata, sizeof(struct privcmd_mmapbatch))) > + return -EFAULT; > + if (!access_ok(VERIFY_WRITE, m.arr, m.num * sizeof(*m.arr))) > + return -EFAULT;That is new..> + m.err = NULL; > + } else {Not elseif (version == 2) ? .. what if version 3 comes around?> + if (copy_from_user(&m, udata, sizeof(struct privcmd_mmapbatch_v2))) > + return -EFAULT; > + if (!access_ok(VERIFY_READ, m.arr, m.num * sizeof(*m.arr))) > + return -EFAULT; > + if (!access_ok(VERIFY_WRITE, m.err, m.num * (sizeof(*m.err)))) > + return -EFAULT;I think the VERIFY_WRITE can cover both versions?> + } > > nr_pages = m.num; > if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT))) > return -EINVAL; > > ret = gather_array(&pagelist, m.num, sizeof(xen_pfn_t), > - m.arr); > + (xen_pfn_t *)m.arr); > > if (ret || list_empty(&pagelist)) > goto out; > @@ -331,10 +356,11 @@ static long privcmd_ioctl_mmap_batch(void __user *udata) > up_write(&mm->mmap_sem); > > if (state.err > 0) { > - state.user = m.arr; > + state.user_mfn = (xen_pfn_t *)m.arr; > + state.user_err = m.err; > ret = traverse_pages(m.num, sizeof(xen_pfn_t), > - &pagelist, > - mmap_return_errors, &state); > + &pagelist, > + mmap_return_errors, &state); > } > > out: > @@ -359,7 +385,13 @@ static long privcmd_ioctl(struct file *file, > break; > > case IOCTL_PRIVCMD_MMAPBATCH: > - ret = privcmd_ioctl_mmap_batch(udata); > + ret = privcmd_ioctl_mmap_batch(udata, 1); > + printk("%s() batch ret = %d\n", __func__, ret);Pfff...> + break; > + > + case IOCTL_PRIVCMD_MMAPBATCH_V2: > + ret = privcmd_ioctl_mmap_batch(udata, 2); > + printk("%s() batch ret = %d\n", __func__, ret); > break; > > default: > diff --git a/include/xen/privcmd.h b/include/xen/privcmd.h > index 17857fb..9fa27c4 100644 > --- a/include/xen/privcmd.h > +++ b/include/xen/privcmd.h > @@ -62,6 +62,14 @@ struct privcmd_mmapbatch { > xen_pfn_t __user *arr; /* array of mfns - top nibble set on err */ > }; > > +struct privcmd_mmapbatch_v2 { > + unsigned int num; /* number of pages to populate */unsigend int? Not ''u32''?> + domid_t dom; /* target domain */ > + __u64 addr; /* virtual address */ > + const xen_pfn_t __user *arr; /* array of mfns */ > + int __user *err; /* array of error codes */int? Not a specific type?> +}; > + > /* > * @cmd: IOCTL_PRIVCMD_HYPERCALL > * @arg: &privcmd_hypercall_t > @@ -73,5 +81,7 @@ struct privcmd_mmapbatch { > _IOC(_IOC_NONE, ''P'', 2, sizeof(struct privcmd_mmap)) > #define IOCTL_PRIVCMD_MMAPBATCH \ > _IOC(_IOC_NONE, ''P'', 3, sizeof(struct privcmd_mmapbatch)) > +#define IOCTL_PRIVCMD_MMAPBATCH_V2 \ > + _IOC(_IOC_NONE, ''P'', 4, sizeof(struct privcmd_mmapbatch_v2)) > > #endif /* __LINUX_PUBLIC_PRIVCMD_H__ */ > -- > 1.7.2.5
Andres Lagar-Cavilla
2012-Aug-24 01:32 UTC
Re: [RFC PATCH 0/3] xen/privcmd: support for paged-out frames
On Aug 23, 2012, at 1:13 PM, David Vrabel wrote:> This series is a straight forward-port of some functionality from > classic kernels to support Xen hosts that do paging of guests. > > This isn''t functionality the XenServer makes use of so I''ve not tested > these with paging in use (GridCentric requested that our older kernels > supported this and I''m just doing the forward port).Thanks for this series. Very timely. I may add that we are not the only consumers of paging. This functionality was first added into classic kernels by Olaf Hering from Suse (added to cc).> > I''m not entirely happy about the approach used here because: > > 1. It relies on the meaning of the return code of the update_mmu > hypercall and it assumes the value Xen used for -ENOENT is the same > the kernel uses. This does not appear to be a formal part of the > hypercall ABI. > > Keir, can you comment on this?I see your point. I may add that it''s likely to be more pervasive than just relying on ENOENT being 12. Which is a fairly safe bet.> > 2. It seems more sensible to have the kernel do the retries instead of > libxc doing them. The kernel has to have a mechanism for this any way > (for mapping back/front rings). > > 3. The current way of handling paged-out frames by repeatedly retrying > is a bit lame. Shouldn''t there be some event that the guest waiting > for the frame can wait on instead? By moving the retry mechanism into > the kernel we can change this without impacting the ABI to userspace.Lame is an interesting choice of language :) I am not a huge fan of libxc retry, but we''ve been pounding it quite hard for a while and it works -- and, importantly, it yields the scheduler :) While kernel retry may benefit from hypothetical code reuse, "Shouldn''t there be some event that the guest waiting for the frame can wait on instead?" will need to become concrete to start a real discussion. For better or worse, since xen-4.1 (!) libxc will do the right thing if fed the appropriate errno. Let me re-iterate that this is great work. Thanks Andres> > David >
Andres Lagar-Cavilla
2012-Aug-24 01:34 UTC
Re: [PATCH 2/3] xen/privcmd: report paged-out frames in PRIVCMD_MMAPBATCH ioctl
On Aug 23, 2012, at 1:13 PM, David Vrabel wrote:> From: David Vrabel <david.vrabel@citrix.com> > > libxc handles paged-out frames in xc_map_foreign_bulk() and friends by > retrying the map operation. libxc expects the PRIVCMD_MMAPBATCH ioctl > to report paged out frames by setting bit 31 in the mfn. > > Do this for the PRIVCMD_MMAPBATCH ioctl if > xen_remap_domain_mfn_range() returned -ENOENT. > > Signed-off-by: David Vrabel <david.vrabel@citrix.com> > --- > drivers/xen/privcmd.c | 11 ++++++++--- > 1 files changed, 8 insertions(+), 3 deletions(-) > > diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c > index ccee0f1..f8c1b6d 100644 > --- a/drivers/xen/privcmd.c > +++ b/drivers/xen/privcmd.c > @@ -255,10 +255,15 @@ static int mmap_batch_fn(void *data, void *state) > { > xen_pfn_t *mfnp = data; > struct mmap_batch_state *st = state; > + int ret; > > - if (xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK, *mfnp, 1, > - st->vma->vm_page_prot, st->domain) < 0) { > - *mfnp |= 0xf0000000U; > + ret = xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK, *mfnp, 1, > + st->vma->vm_page_prot, st->domain); > + if (ret < 0) { > + if (ret == -ENOENT) > + *mfnp |= 0x80000000U;As Konrad pointed out separately, constants here would be great. These two constants are defined in <xen>/include/public/domctl.h. In particular, the 0x80..0 constant is defined there but *not* used by the hypervisor. I don''t see a problem in (re)defining it in the linux interface headers shared between libxc and the kernel -- this is where it really belongs. It could be even phased out of domctl.h in 4.2 (unlikely) or 4.3. As for the 0xf0..0 constant I have no strong opinion -- pre-existing problem.> + else > + *mfnp |= 0xf0000000U; > st->err++; > } > st->va += PAGE_SIZE;Libxc expects errno to be ENOENT if at least one map hypercall returned ENOENT. So you need to keep an extra latch in the state, e.g. st->enoents. Or recycle st->err to be a tristate (ok, error, enoent), since I don''t see any actual use of err count. Whichever way, privcmd_ioctl_mmap_batch needs to find out if it needs to return ENOENT. Note this requirement also extends to PRIVCMD_MMAPBATCH_V2. Andres> -- > 1.7.2.5 >
Andres Lagar-Cavilla
2012-Aug-24 01:35 UTC
Re: [PATCH 3/3] xen/privcmd: add PRIVCMD_MMAPBATCH_V2 ioctl
On Aug 23, 2012, at 1:13 PM, David Vrabel wrote:> From: David Vrabel <david.vrabel@citrix.com> > > PRIVCMD_MMAPBATCH_V2 extends PRIVCMD_MMAPBATCH with an additional > field for reporting the error code for every frame that could not be > mapped. libxc prefers PRIVCMD_MMAPBATCH_V2 over PRIVCMD_MMAPBATCH. > > Signed-off-by: David Vrabel <david.vrabel@citrix.com> > --- > drivers/xen/privcmd.c | 54 +++++++++++++++++++++++++++++++++++++++---------- > include/xen/privcmd.h | 10 +++++++++ > 2 files changed, 53 insertions(+), 11 deletions(-) > > diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c > index f8c1b6d..4f97160 100644 > --- a/drivers/xen/privcmd.c > +++ b/drivers/xen/privcmd.c > @@ -248,7 +248,8 @@ struct mmap_batch_state { > struct vm_area_struct *vma; > int err; > > - xen_pfn_t __user *user; > + xen_pfn_t __user *user_mfn; > + int __user *user_err; > }; > > static int mmap_batch_fn(void *data, void *state) > @@ -275,34 +276,58 @@ static int mmap_return_errors(void *data, void *state) > { > xen_pfn_t *mfnp = data; > struct mmap_batch_state *st = state; > + int ret = 0; > > - return put_user(*mfnp, st->user++); > + if (st->user_err) { > + if ((*mfnp & 0xf0000000U) == 0xf0000000U) > + ret = -ENOENT; > + else if ((*mfnp & 0xf0000000U) == 0x80000000U) > + ret = -EINVAL;Wires crossed. 0x80..0 is ENOENT, 0xf0..0 is EINVAL. Really in need of constants.> + else > + ret = 0; > + return __put_user(ret, st->user_err); > + } else > + return __put_user(*mfnp, st->user_mfn++); > } > > static struct vm_operations_struct privcmd_vm_ops; > > -static long privcmd_ioctl_mmap_batch(void __user *udata) > +static long privcmd_ioctl_mmap_batch(void __user *udata, int version) > { > int ret; > - struct privcmd_mmapbatch m; > + struct privcmd_mmapbatch_v2 m; > struct mm_struct *mm = current->mm; > struct vm_area_struct *vma; > unsigned long nr_pages; > LIST_HEAD(pagelist); > struct mmap_batch_state state; > > + printk("%s(%d)\n", __func__, version);Surely this and other unconditional printk''s to go away in next round… Thanks Andres> + > if (!xen_initial_domain()) > return -EPERM; > > - if (copy_from_user(&m, udata, sizeof(m))) > - return -EFAULT; > + if (version == 1) { > + if (copy_from_user(&m, udata, sizeof(struct privcmd_mmapbatch))) > + return -EFAULT; > + if (!access_ok(VERIFY_WRITE, m.arr, m.num * sizeof(*m.arr))) > + return -EFAULT; > + m.err = NULL; > + } else { > + if (copy_from_user(&m, udata, sizeof(struct privcmd_mmapbatch_v2))) > + return -EFAULT; > + if (!access_ok(VERIFY_READ, m.arr, m.num * sizeof(*m.arr))) > + return -EFAULT; > + if (!access_ok(VERIFY_WRITE, m.err, m.num * (sizeof(*m.err)))) > + return -EFAULT; > + } > > nr_pages = m.num; > if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT))) > return -EINVAL; > > ret = gather_array(&pagelist, m.num, sizeof(xen_pfn_t), > - m.arr); > + (xen_pfn_t *)m.arr); > > if (ret || list_empty(&pagelist)) > goto out; > @@ -331,10 +356,11 @@ static long privcmd_ioctl_mmap_batch(void __user *udata) > up_write(&mm->mmap_sem); > > if (state.err > 0) { > - state.user = m.arr; > + state.user_mfn = (xen_pfn_t *)m.arr; > + state.user_err = m.err; > ret = traverse_pages(m.num, sizeof(xen_pfn_t), > - &pagelist, > - mmap_return_errors, &state); > + &pagelist, > + mmap_return_errors, &state); > } > > out: > @@ -359,7 +385,13 @@ static long privcmd_ioctl(struct file *file, > break; > > case IOCTL_PRIVCMD_MMAPBATCH: > - ret = privcmd_ioctl_mmap_batch(udata); > + ret = privcmd_ioctl_mmap_batch(udata, 1); > + printk("%s() batch ret = %d\n", __func__, ret); > + break; > + > + case IOCTL_PRIVCMD_MMAPBATCH_V2: > + ret = privcmd_ioctl_mmap_batch(udata, 2); > + printk("%s() batch ret = %d\n", __func__, ret); > break; > > default: > diff --git a/include/xen/privcmd.h b/include/xen/privcmd.h > index 17857fb..9fa27c4 100644 > --- a/include/xen/privcmd.h > +++ b/include/xen/privcmd.h > @@ -62,6 +62,14 @@ struct privcmd_mmapbatch { > xen_pfn_t __user *arr; /* array of mfns - top nibble set on err */ > }; > > +struct privcmd_mmapbatch_v2 { > + unsigned int num; /* number of pages to populate */ > + domid_t dom; /* target domain */ > + __u64 addr; /* virtual address */ > + const xen_pfn_t __user *arr; /* array of mfns */ > + int __user *err; /* array of error codes */ > +}; > + > /* > * @cmd: IOCTL_PRIVCMD_HYPERCALL > * @arg: &privcmd_hypercall_t > @@ -73,5 +81,7 @@ struct privcmd_mmapbatch { > _IOC(_IOC_NONE, ''P'', 2, sizeof(struct privcmd_mmap)) > #define IOCTL_PRIVCMD_MMAPBATCH \ > _IOC(_IOC_NONE, ''P'', 3, sizeof(struct privcmd_mmapbatch)) > +#define IOCTL_PRIVCMD_MMAPBATCH_V2 \ > + _IOC(_IOC_NONE, ''P'', 4, sizeof(struct privcmd_mmapbatch_v2)) > > #endif /* __LINUX_PUBLIC_PRIVCMD_H__ */ > -- > 1.7.2.5 >
David Vrabel
2012-Aug-24 11:14 UTC
Re: [PATCH 3/3] xen/privcmd: add PRIVCMD_MMAPBATCH_V2 ioctl
On 23/08/12 20:40, Konrad Rzeszutek Wilk wrote:> On Thu, Aug 23, 2012 at 06:13:46PM +0100, David Vrabel wrote: >> From: David Vrabel <david.vrabel@citrix.com> >> >> PRIVCMD_MMAPBATCH_V2 extends PRIVCMD_MMAPBATCH with an additional >> field for reporting the error code for every frame that could not be >> mapped. libxc prefers PRIVCMD_MMAPBATCH_V2 over PRIVCMD_MMAPBATCH.[...]>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c >> index f8c1b6d..4f97160 100644 >> --- a/drivers/xen/privcmd.c >> +++ b/drivers/xen/privcmd.c >> @@ -275,34 +276,58 @@ static int mmap_return_errors(void *data, void *state) >> { >> xen_pfn_t *mfnp = data; >> struct mmap_batch_state *st = state; >> + int ret = 0; >> >> - return put_user(*mfnp, st->user++); >> + if (st->user_err) { >> + if ((*mfnp & 0xf0000000U) == 0xf0000000U) >> + ret = -ENOENT; >> + else if ((*mfnp & 0xf0000000U) == 0x80000000U) >> + ret = -EINVAL; > > Yikes. Any way those 0xf000.. and 0x8000 can at least be #definedAgreed.>> + else >> + ret = 0; >> + return __put_user(ret, st->user_err); >> + } else >> + return __put_user(*mfnp, st->user_mfn++); >> } >> >> static struct vm_operations_struct privcmd_vm_ops; >> >> -static long privcmd_ioctl_mmap_batch(void __user *udata) >> +static long privcmd_ioctl_mmap_batch(void __user *udata, int version) >> { >> int ret; >> - struct privcmd_mmapbatch m; >> + struct privcmd_mmapbatch_v2 m; >> struct mm_struct *mm = current->mm; >> struct vm_area_struct *vma; >> unsigned long nr_pages; >> LIST_HEAD(pagelist); >> struct mmap_batch_state state; >> >> + printk("%s(%d)\n", __func__, version); >> + > > Hehe.Heh. I didn''t polish up these patches as I wasn''t sure this new interface was useful.>> if (!xen_initial_domain()) >> return -EPERM; >> >> - if (copy_from_user(&m, udata, sizeof(m))) >> - return -EFAULT; >> + if (version == 1) { >> + if (copy_from_user(&m, udata, sizeof(struct privcmd_mmapbatch))) >> + return -EFAULT; >> + if (!access_ok(VERIFY_WRITE, m.arr, m.num * sizeof(*m.arr))) >> + return -EFAULT; > > That is new..We use access_ok() here so we can use the less expensive __get_user() and __put_user() later on.>> + if (copy_from_user(&m, udata, sizeof(struct privcmd_mmapbatch_v2))) >> + return -EFAULT; >> + if (!access_ok(VERIFY_READ, m.arr, m.num * sizeof(*m.arr))) >> + return -EFAULT; >> + if (!access_ok(VERIFY_WRITE, m.err, m.num * (sizeof(*m.err)))) >> + return -EFAULT; > > I think the VERIFY_WRITE can cover both versions?Yes, VERIFY_WRITE is a superset of VERIFY_READ. In v1, m.arr is read/write but in v2, m.arr is const so we use VERIFY_READ so the userspace buffer can reside in a read-only section.>> --- a/include/xen/privcmd.h >> +++ b/include/xen/privcmd.h >> @@ -62,6 +62,14 @@ struct privcmd_mmapbatch { >> xen_pfn_t __user *arr; /* array of mfns - top nibble set on err */ >> }; >> >> +struct privcmd_mmapbatch_v2 { >> + unsigned int num; /* number of pages to populate */ > > unsigend int? Not ''u32''? >> + domid_t dom; /* target domain */ >> + __u64 addr; /* virtual address */ >> + const xen_pfn_t __user *arr; /* array of mfns */ >> + int __user *err; /* array of error codes */ > > int? Not a specific type?It''s an existing interface supported by classic Xen kernels and currently being used by libxc. So while I agree that it''s not the best interface, I don''t think it can be changed. David
Konrad Rzeszutek Wilk
2012-Aug-24 11:41 UTC
Re: [PATCH 3/3] xen/privcmd: add PRIVCMD_MMAPBATCH_V2 ioctl
> >> +struct privcmd_mmapbatch_v2 { > >> + unsigned int num; /* number of pages to populate */ > > > > unsigend int? Not ''u32''? > >> + domid_t dom; /* target domain */ > >> + __u64 addr; /* virtual address */ > >> + const xen_pfn_t __user *arr; /* array of mfns */ > >> + int __user *err; /* array of error codes */ > > > > int? Not a specific type? > > It''s an existing interface supported by classic Xen kernels and > currently being used by libxc. So while I agree that it''s not the best > interface, I don''t think it can be changed.How does it work with a 64-bit dom0 and 32-bit userspace? Is the libxc smart enough to figure out the size of the structure?> > David
Ian Campbell
2012-Aug-24 11:50 UTC
Re: [PATCH 3/3] xen/privcmd: add PRIVCMD_MMAPBATCH_V2 ioctl
On Fri, 2012-08-24 at 12:41 +0100, Konrad Rzeszutek Wilk wrote:> > >> +struct privcmd_mmapbatch_v2 { > > >> + unsigned int num; /* number of pages to populate */ > > > > > > unsigend int? Not ''u32''? > > >> + domid_t dom; /* target domain */ > > >> + __u64 addr; /* virtual address */ > > >> + const xen_pfn_t __user *arr; /* array of mfns */ > > >> + int __user *err; /* array of error codes */ > > > > > > int? Not a specific type? > > > > It''s an existing interface supported by classic Xen kernels and > > currently being used by libxc. So while I agree that it''s not the best > > interface, I don''t think it can be changed. > > How does it work with a 64-bit dom0 and 32-bit userspace? Is the libxc > smart enough to figure out the size of the structure?This already doesn''t work for numerous reasons. The man one being that libxc will make 32 bit hypercalls, but when the kernel forwards them on they look like 64 bit ones. Ian.
David Vrabel
2012-Aug-24 11:58 UTC
Re: [RFC PATCH 0/3] xen/privcmd: support for paged-out frames
On 24/08/12 02:32, Andres Lagar-Cavilla wrote:> On Aug 23, 2012, at 1:13 PM, David Vrabel wrote: > >> This series is a straight forward-port of some functionality from >> classic kernels to support Xen hosts that do paging of guests. >> >> This isn''t functionality the XenServer makes use of so I''ve not tested >> these with paging in use (GridCentric requested that our older kernels >> supported this and I''m just doing the forward port). > > Thanks for this series. Very timely. I may add that we are not the > only consumers of paging. This functionality was first added into > classic kernels by Olaf Hering from Suse (added to cc).Sure.>> I''m not entirely happy about the approach used here because: >> >> 1. It relies on the meaning of the return code of the update_mmu >> hypercall and it assumes the value Xen used for -ENOENT is the same >> the kernel uses. This does not appear to be a formal part of the >> hypercall ABI. >> >> Keir, can you comment on this? > > I see your point. I may add that it''s likely to be more pervasive > than just relying on ENOENT being 12. Which is a fairly safe bet. > >> >> 2. It seems more sensible to have the kernel do the retries instead of >> libxc doing them. The kernel has to have a mechanism for this any way >> (for mapping back/front rings). >> >> 3. The current way of handling paged-out frames by repeatedly retrying >> is a bit lame. Shouldn''t there be some event that the guest waiting >> for the frame can wait on instead? By moving the retry mechanism into >> the kernel we can change this without impacting the ABI to userspace. > > Lame is an interesting choice of language :)My embedded background makes me frown at anything that polls -- it''s generally bad for power consumption.> I am not a huge fan of libxc retry, but we''ve been pounding it quite > hard for a while and it works -- and, importantly, it yields the > scheduler :) > > While kernel retry may benefit from hypothetical code reuse, > "Shouldn''t there be some event that the guest waiting for the frame can > wait on instead?" will need to become concrete to start a real discussion.In kernel retries don''t require the event. Using the event is something to consider in the longer term.> For better or worse, since xen-4.1 (!) libxc will do the right thing > if fed the appropriate errno.At a minimum I think we need to fix the existing interface to have the behavior libxc expects. Is PRIVCMD_MMAPBATCH_V2 actually required? It doesn''t seem to do much more than the V1 interface. Perhaps the fix in patches #1 and #2 sufficient to make libxc work? David
David Vrabel
2012-Aug-24 12:00 UTC
Re: [PATCH 3/3] xen/privcmd: add PRIVCMD_MMAPBATCH_V2 ioctl
On 24/08/12 12:41, Konrad Rzeszutek Wilk wrote:>>>> +struct privcmd_mmapbatch_v2 { >>>> + unsigned int num; /* number of pages to populate */ >>> >>> unsigend int? Not ''u32''? >>>> + domid_t dom; /* target domain */ >>>> + __u64 addr; /* virtual address */ >>>> + const xen_pfn_t __user *arr; /* array of mfns */ >>>> + int __user *err; /* array of error codes */ >>> >>> int? Not a specific type? >> >> It''s an existing interface supported by classic Xen kernels and >> currently being used by libxc. So while I agree that it''s not the best >> interface, I don''t think it can be changed.It''s also the same as struct privcmd_mmapbatch except for the extra ''err'' field and ''arr'' being const.> How does it work with a 64-bit dom0 and 32-bit userspace? Is the libxc > smart enough to figure out the size of the structure?privcmd doesn''t support compat ioctls because there there is nothing doing the translation of the hypercalls from the 32-bit to 64-bit ABI -- the hypervisor won''t do it as the hypercalls are called from the 64-bit kernel. 64 bit Xen, 64 bit dom0, 32 bit tools has never worked for this reason. I think Ian Campbell had some hacky patches for this but he may not want to admit to them. ;) David
Ian Campbell
2012-Aug-24 12:14 UTC
Re: [PATCH 3/3] xen/privcmd: add PRIVCMD_MMAPBATCH_V2 ioctl
On Fri, 2012-08-24 at 13:00 +0100, David Vrabel wrote:> 64 bit Xen, 64 bit dom0, 32 bit tools has never worked for this reason. > I think Ian Campbell had some hacky patches for this but he may not > want to admit to them. ;)They were pretty nasty. I think I made a 32-bit hypercall entry path available to 64 bit kernels (via int 0x8?) so privcmd used that for 32 bit tasks. Then there was some hacking to make the hypercall compat arg translation area work for 64 bit guests (I don''t remember what I did here, I seem to recall it wasn''t pretty) I eventually fell at the hurdle of getting 64 bit blktap kernel module to work with 32 bit tapdisk process (since the ring layouts are different). Not an impossible problem but at this poiint I couldn''t be bothered any more.> > David
Ian Campbell
2012-Aug-24 12:14 UTC
Re: [RFC PATCH 0/3] xen/privcmd: support for paged-out frames
On Fri, 2012-08-24 at 12:58 +0100, David Vrabel wrote:> Is PRIVCMD_MMAPBATCH_V2 actually required? It doesn''t seem to do much > more than the V1 interface. Perhaps the fix in patches #1 and #2 > sufficient to make libxc work?The V1 interface has some hideous misfeature wrt error reporting iirc. It doesn''t report the actual per-frame error code, just the "an error" signal through setting the top nibble (already a nasty interface!), IIRC Aha: http://xenbits.xen.org/hg/linux-2.6.18-xen.hg/rev/6d6c3dd995c0 It was knackered on 64 bit (actually set bits 28-31...)... Ian.
Andres Lagar-Cavilla
2012-Aug-24 15:06 UTC
Re: [RFC PATCH 0/3] xen/privcmd: support for paged-out frames
On Aug 24, 2012, at 8:14 AM, Ian Campbell wrote:> On Fri, 2012-08-24 at 12:58 +0100, David Vrabel wrote: >> Is PRIVCMD_MMAPBATCH_V2 actually required? It doesn''t seem to do much >> more than the V1 interface. Perhaps the fix in patches #1 and #2 >> sufficient to make libxc work? > > The V1 interface has some hideous misfeature wrt error reporting iirc. > > It doesn''t report the actual per-frame error code, just the "an error" > signal through setting the top nibble (already a nasty interface!), IIRC > > Aha: http://xenbits.xen.org/hg/linux-2.6.18-xen.hg/rev/6d6c3dd995c0IIuc, David''s approach in these patches was hypervisor rc -> encode into mfn field -> decode into error rval if available. The changeset Ian references shows a different approach is needed. Perhaps: 1. Provide local err array in all cases 2. Always store hypervisor rc into err array 3. if version 2, copy into user-space. If version 1, encode into user-space mfn Andres> > It was knackered on 64 bit (actually set bits 28-31...)... > > Ian. >
Bastian Blank
2012-Aug-24 18:01 UTC
Re: [PATCH 2/3] xen/privcmd: report paged-out frames in PRIVCMD_MMAPBATCH ioctl
On Thu, Aug 23, 2012 at 06:13:45PM +0100, David Vrabel wrote:> + if (ret < 0) { > + if (ret == -ENOENT) > + *mfnp |= 0x80000000U; > + else > + *mfnp |= 0xf0000000U;Please don''t use magic constants. Bastian -- Respect is a rational process -- McCoy, "The Galileo Seven", stardate 2822.3