thr3ads.net - Xen devel - VMX status report. Xen:26323 & Dom0:3.7.1 [Jan 2013]

If this information is useful, please help other people find it:
Share via:

Ren, Yongjie

2013-Jan-10 07:51 UTC

VMX status report. Xen:26323 & Dom0:3.7.1

Hi, All,
This is the test report of xen-unstable tree on some Intel platforms.
We found 1 new bug about live migration.

Version Info:
================================================================xen changeset:  
26323:64b36dde26bc
Dom0:           Linux 3.7.1
================================================================
New issue(1)
=============1. sometimes live migration failed and reported call trace in dom0
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1841

Fixed issue(0)
=============
Old issues(11)
=============1. [ACPI] Dom0 can''t resume from S3 sleep
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1707
2. [XL]"xl vcpu-set" causes dom0 crash or panic
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1730
  -- Dom0 will not panic with latest linux.git tree 3.8-rc..
  -- We''ll have more detailed update on that bugzilla.
3. [VT-D] VNC console broken after detach NIC from guest
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1736
4. Sometimes Xen panic on ia32pae Sandybridge when restore guest
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1747
5. ''xl vcpu-set'' can''t decrease the vCPU number of a
HVM guest
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1822
6. Dom0 cannot be shutdown before PCI device detachment from guest
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1826
7. xl pci-list shows one PCI device (PF or VF) could be assigned to two
different guests
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1834
8. [upstream qemu] Guest free memory with upstream qemu is 14MB lower than that
with qemu-xen-unstable.git
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1836
9. [upstream qemu]''maxvcpus=NUM'' item is not supported in
upstream QEMU
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1837
10. [upstream qemu] Guest console hangs after save/restore or live-migration
when setting ''hpet=0'' in guest config file
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1838
11. [upstream qemu] ''xen_platform_pci=0'' setting cannot make
the guest use emulated PCI devices by default
  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1839


Best Regards,
     Yongjie (Jay)

Jan Beulich

2013-Jan-10 08:57 UTC

head link

Re: VMX status report. Xen:26323 & Dom0:3.7.1

>>> On 10.01.13 at 08:51, "Ren, Yongjie"
<yongjie.ren@intel.com> wrote:
> New issue(1)
> =============> 1. sometimes live migration failed and reported call
trace in dom0
>   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1841 
For the failed allocation, the only obvious candidate appears to be

	err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);

which quite obviously can be of (almost) arbitrary size because

	nr_pages = m.num;
	if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT)))
		return -EINVAL;

really only checks for completely insane values.

This got introduced by Andres'' "xen/privcmd: add
PRIVCMD_MMAPBATCH_V2
ioctl" and is becoming worse with Mukesh''s recent "xen:
privcmd:
support autotranslated physmap guests", which added another
similar (twice as large) allocation in alloc_empty_pages().

I''d like to note that the forward ported kernels don''t appear
to
have a similar issue, as they never allocates more than a page at
a time. Was that code consulted at all when that addition was
done?

Jan

Andres Lagar-Cavilla

2013-Jan-10 17:10 UTC

head link

Re: VMX status report. Xen:26323 & Dom0:3.7.1

On Jan 10, 2013, at 3:57 AM, "Jan Beulich" <JBeulich@suse.com>
wrote:
>>>> On 10.01.13 at 08:51, "Ren, Yongjie"
<yongjie.ren@intel.com> wrote:
>> New issue(1)
>> =============>> 1. sometimes live migration failed and reported
call trace in dom0
>>  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1841 
> 
> For the failed allocation, the only obvious candidate appears to be
> 
> 	err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
> 
> which quite obviously can be of (almost) arbitrary size because
> 
> 	nr_pages = m.num;
> 	if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT)))
> 		return -EINVAL;
> 
> really only checks for completely insane values.
> 
> This got introduced by Andres'' "xen/privcmd: add
PRIVCMD_MMAPBATCH_V2
> ioctl" and is becoming worse with Mukesh''s recent "xen:
privcmd:
> support autotranslated physmap guests", which added another
> similar (twice as large) allocation in alloc_empty_pages().
Perhaps the err_array in this case, since alloc_empty_pages only happens for
auto translated dom0s.

Not familiar wether libxl changes (or is even capable of changing) parameters of
the migration code. But right now in libxc, mapping is done in MAX_BATCH_SIZE
batches, which are of size 1024. So we are talking about 1024 ints, which is
*one* page.

So is really the kernel incapable of allocating one measly page?

This leads me to think that it might be gather_array, but that one would
allocate a grand total of two pages.

In any case, both functions allocate arbitrary number of pages, and that is the
fundamental problem.

What is the approach in the forward ported kernel wrt to gather_array?

The cleanest alternative I can think of is to refactor the the body of
mmap_batch to allocate one page for each array, and iteratively call
traverse_pages recycling the local arrays and increasing the pointers in the
source user space arrays.

Having said that, that would allocate two pages (always), and the code right now
allocates max three (for libxc driven migrations). So maybe the problem is
elsewhere….

Thanks,
Andres
> 
> I''d like to note that the forward ported kernels don''t
appear to
> have a similar issue, as they never allocates more than a page at
> a time. Was that code consulted at all when that addition was
> done?
> 
> Jan
>

Mats Petersson

2013-Jan-10 17:27 UTC

head link

Re: VMX status report. Xen:26323 & Dom0:3.7.1

On 10/01/13 17:10, Andres Lagar-Cavilla wrote:> On Jan 10, 2013, at 3:57 AM, "Jan Beulich"
<JBeulich@suse.com> wrote:
>
>>>>> On 10.01.13 at 08:51, "Ren, Yongjie"
<yongjie.ren@intel.com> wrote:
>>> New issue(1)
>>> =============>>> 1. sometimes live migration failed and
reported call trace in dom0
>>>   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1841
>> For the failed allocation, the only obvious candidate appears to be
>>
>> 	err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
>>
>> which quite obviously can be of (almost) arbitrary size because
>>
>> 	nr_pages = m.num;
>> 	if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT)))
>> 		return -EINVAL;
>>
>> really only checks for completely insane values.
>>
>> This got introduced by Andres'' "xen/privcmd: add
PRIVCMD_MMAPBATCH_V2
>> ioctl" and is becoming worse with Mukesh''s recent
"xen: privcmd:
>> support autotranslated physmap guests", which added another
>> similar (twice as large) allocation in alloc_empty_pages().
> Perhaps the err_array in this case, since alloc_empty_pages only happens
for auto translated dom0s.
>
> Not familiar wether libxl changes (or is even capable of changing)
parameters of the migration code. But right now in libxc, mapping is done in
MAX_BATCH_SIZE batches, which are of size 1024. So we are talking about 1024
ints, which is *one* page.
>
> So is really the kernel incapable of allocating one measly page?
>
> This leads me to think that it might be gather_array, but that one would
allocate a grand total of two pages.
>
> In any case, both functions allocate arbitrary number of pages, and that is
the fundamental problem.
>
> What is the approach in the forward ported kernel wrt to gather_array?
>
> The cleanest alternative I can think of is to refactor the the body of
mmap_batch to allocate one page for each array, and iteratively call
traverse_pages recycling the local arrays and increasing the pointers in the
source user space arrays.
>
> Having said that, that would allocate two pages (always), and the code
right now allocates max three (for libxc driven migrations). So maybe the
problem is elsewhere….
>
> Thanks,
> Andres
Whilst this may not add much to the discussion, where I have been 
working on the improved privcmd.c, I have been using 3.7.0rc5 and 3.8.0. 
Both of these seem to work fine for migration using the libxc interface 
(since I''ve been using the Xenserver build, the migration is not done 
through libxl).

I have not had a single failure to allocate pages in the migration, - I 
have a script that loops around migrating the guest to the same host as 
quickly as it can and I have used guests up to 64GB (and left that 
running overnight - it takes about 3 minutes, so a night gives several 
hundred iterations.

So I''m wondering what is different between my setup and this one...

--
Mats>
>> I''d like to note that the forward ported kernels
don''t appear to
>> have a similar issue, as they never allocates more than a page at
>> a time. Was that code consulted at all when that addition was
>> done?
>>
>> Jan
>>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>
>

David Vrabel

2013-Jan-10 19:23 UTC

head link

Re: VMX status report. Xen:26323 & Dom0:3.7.1

On 10/01/13 08:57, Jan Beulich wrote:>>>> On 10.01.13 at 08:51, "Ren, Yongjie"
<yongjie.ren@intel.com> wrote:
>> New issue(1)
>> =============>> 1. sometimes live migration failed and reported
call trace in dom0
>>   http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1841 
> 
> For the failed allocation, the only obvious candidate appears to be
> 
> 	err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
> 
> which quite obviously can be of (almost) arbitrary size because
> 
> 	nr_pages = m.num;
> 	if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT)))
> 		return -EINVAL;
> 
> really only checks for completely insane values.
> 
> This got introduced by Andres'' "xen/privcmd: add
PRIVCMD_MMAPBATCH_V2
> ioctl" and is becoming worse with Mukesh''s recent "xen:
privcmd:
> support autotranslated physmap guests", which added another
> similar (twice as large) allocation in alloc_empty_pages().
> 
> I''d like to note that the forward ported kernels don''t
appear to
> have a similar issue, as they never allocates more than a page at
> a time. Was that code consulted at all when that addition was
> done?
I did highlight this at the time[1].

See [2] for how I avoided this allocation.

David

[1] http://lists.xen.org/archives/html/xen-devel/2012-08/msg02208.html
[2] http://lists.xen.org/archives/html/xen-devel/2012-08/msg02092.html

Jan Beulich

2013-Jan-11 08:34 UTC

head link

Re: VMX status report. Xen:26323 & Dom0:3.7.1

>>> On 10.01.13 at 18:10, Andres Lagar-Cavilla
<andreslc@gridcentric.ca> wrote:
> On Jan 10, 2013, at 3:57 AM, "Jan Beulich"
<JBeulich@suse.com> wrote:
> 
>>>>> On 10.01.13 at 08:51, "Ren, Yongjie"
<yongjie.ren@intel.com> wrote:
>>> New issue(1)
>>> =============>>> 1. sometimes live migration failed and
reported call trace in dom0
>>>  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1841 
>> 
>> For the failed allocation, the only obvious candidate appears to be
>> 
>> 	err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
>> 
>> which quite obviously can be of (almost) arbitrary size because
>> 
>> 	nr_pages = m.num;
>> 	if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT)))
>> 		return -EINVAL;
>> 
>> really only checks for completely insane values.
>> 
>> This got introduced by Andres'' "xen/privcmd: add
PRIVCMD_MMAPBATCH_V2
>> ioctl" and is becoming worse with Mukesh''s recent
"xen: privcmd:
>> support autotranslated physmap guests", which added another
>> similar (twice as large) allocation in alloc_empty_pages().
> 
> Perhaps the err_array in this case, since alloc_empty_pages only happens
for
> auto translated dom0s.
> 
> Not familiar wether libxl changes (or is even capable of changing) 
> parameters of the migration code. But right now in libxc, mapping is done
in
> MAX_BATCH_SIZE batches, which are of size 1024. So we are talking about
1024
> ints, which is *one* page.
> 
> So is really the kernel incapable of allocating one measly page?
> 
> This leads me to think that it might be gather_array, but that one would 
> allocate a grand total of two pages.
The log indicated a failed order-4 allocation though. So maybe not
that allocation after all (be the basically unbounded size here is a
problem in any case).
> In any case, both functions allocate arbitrary number of pages, and that is
> the fundamental problem.
> 
> What is the approach in the forward ported kernel wrt to gather_array?
There''s no gather_array there. It simply sets up things one page
at a time. (And you can always go and check linux-2.6.18-xen.hg).

Jan

Andres Lagar-Cavilla

2013-Jan-11 15:55 UTC

head link

Re: VMX status report. Xen:26323 & Dom0:3.7.1

On Jan 10, 2013, at 2:23 PM, David Vrabel <david.vrabel@citrix.com> wrote:
> On 10/01/13 08:57, Jan Beulich wrote:
>>>>> On 10.01.13 at 08:51, "Ren, Yongjie"
<yongjie.ren@intel.com> wrote:
>>> New issue(1)
>>> =============>>> 1. sometimes live migration failed and
reported call trace in dom0
>>>  http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1841 
>> 
>> For the failed allocation, the only obvious candidate appears to be
>> 
>> 	err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
>> 
>> which quite obviously can be of (almost) arbitrary size because
>> 
>> 	nr_pages = m.num;
>> 	if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT)))
>> 		return -EINVAL;
>> 
>> really only checks for completely insane values.
>> 
>> This got introduced by Andres'' "xen/privcmd: add
PRIVCMD_MMAPBATCH_V2
>> ioctl" and is becoming worse with Mukesh''s recent
"xen: privcmd:
>> support autotranslated physmap guests", which added another
>> similar (twice as large) allocation in alloc_empty_pages().
>> 
>> I''d like to note that the forward ported kernels
don''t appear to
>> have a similar issue, as they never allocates more than a page at
>> a time. Was that code consulted at all when that addition was
>> done?
> 
> I did highlight this at the time[1].
> 
> See [2] for how I avoided this allocation.
Well you should pushed more forcefully back then! I remember convincing you
otherwise ;)

In any case see follow-up to Jan.

Andres> 
> David
> 
> [1] http://lists.xen.org/archives/html/xen-devel/2012-08/msg02208.html
> [2] http://lists.xen.org/archives/html/xen-devel/2012-08/msg02092.html

Andres Lagar-Cavilla

2013-Jan-11 16:05 UTC

head link

Re: VMX status report. Xen:26323 & Dom0:3.7.1

On Jan 11, 2013, at 3:34 AM, "Jan Beulich" <JBeulich@suse.com>
wrote:
>>>> On 10.01.13 at 18:10, Andres Lagar-Cavilla
<andreslc@gridcentric.ca> wrote:
>> On Jan 10, 2013, at 3:57 AM, "Jan Beulich"
<JBeulich@suse.com> wrote:
>> 
>>>>>> On 10.01.13 at 08:51, "Ren, Yongjie"
<yongjie.ren@intel.com> wrote:
>>>> New issue(1)
>>>> =============>>>> 1. sometimes live migration
failed and reported call trace in dom0
>>>> http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1841 
>>> 
>>> For the failed allocation, the only obvious candidate appears to be
>>> 
>>> 	err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
>>> 
>>> which quite obviously can be of (almost) arbitrary size because
>>> 
>>> 	nr_pages = m.num;
>>> 	if ((m.num <= 0) || (nr_pages > (LONG_MAX >>
PAGE_SHIFT)))
>>> 		return -EINVAL;
>>> 
>>> really only checks for completely insane values.
>>> 
>>> This got introduced by Andres'' "xen/privcmd: add
PRIVCMD_MMAPBATCH_V2
>>> ioctl" and is becoming worse with Mukesh''s recent
"xen: privcmd:
>>> support autotranslated physmap guests", which added another
>>> similar (twice as large) allocation in alloc_empty_pages().
>> 
>> Perhaps the err_array in this case, since alloc_empty_pages only
happens for
>> auto translated dom0s.
>> 
>> Not familiar wether libxl changes (or is even capable of changing) 
>> parameters of the migration code. But right now in libxc, mapping is
done in
>> MAX_BATCH_SIZE batches, which are of size 1024. So we are talking about
1024
>> ints, which is *one* page.
>> 
>> So is really the kernel incapable of allocating one measly page?
>> 
>> This leads me to think that it might be gather_array, but that one
would
>> allocate a grand total of two pages.
> 
> The log indicated a failed order-4 allocation though. So maybe not
> that allocation after all (be the basically unbounded size here is a
> problem in any case).
Unless somehow the batch size got changed to 16384, this most definitely is not
err_array. Would be good to ascertain that.

I do see how gather array avoids allocations larger than O(0). Let me look into
cooking a similar solution for err_array.

Andres> 
>> In any case, both functions allocate arbitrary number of pages, and
that is
>> the fundamental problem.
>> 
>> What is the approach in the forward ported kernel wrt to gather_array?
> 
> There''s no gather_array there. It simply sets up things one page
> at a time. (And you can always go and check linux-2.6.18-xen.hg).
> 
> Jan
>

Andres Lagar-Cavilla

2013-Jan-14 04:29 UTC

head link

Re: VMX status report. Xen:26323 & Dom0:3.7.1

On Jan 11, 2013, at 3:34 AM, Jan Beulich <jbeulich@suse.com> wrote:
>>>> On 10.01.13 at 18:10, Andres Lagar-Cavilla
<andreslc@gridcentric.ca> wrote:
>> On Jan 10, 2013, at 3:57 AM, "Jan Beulich"
<JBeulich@suse.com> wrote:
>> 
>>>>>> On 10.01.13 at 08:51, "Ren, Yongjie"
<yongjie.ren@intel.com> wrote:
>>>> New issue(1)
>>>> =============>>>> 1. sometimes live migration
failed and reported call trace in dom0
>>>> http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1841 
>>> 
>>> For the failed allocation, the only obvious candidate appears to be
>>> 
>>> 	err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
>>> 
>>> which quite obviously can be of (almost) arbitrary size because
>>> 
>>> 	nr_pages = m.num;
>>> 	if ((m.num <= 0) || (nr_pages > (LONG_MAX >>
PAGE_SHIFT)))
>>> 		return -EINVAL;
>>> 
>>> really only checks for completely insane values.
>>> 
>>> This got introduced by Andres'' "xen/privcmd: add
PRIVCMD_MMAPBATCH_V2
>>> ioctl" and is becoming worse with Mukesh''s recent
"xen: privcmd:
>>> support autotranslated physmap guests", which added another
>>> similar (twice as large) allocation in alloc_empty_pages().
>> 
>> Perhaps the err_array in this case, since alloc_empty_pages only
happens for
>> auto translated dom0s.
>> 
>> Not familiar wether libxl changes (or is even capable of changing) 
>> parameters of the migration code. But right now in libxc, mapping is
done in
>> MAX_BATCH_SIZE batches, which are of size 1024. So we are talking about
1024
>> ints, which is *one* page.
>> 
>> So is really the kernel incapable of allocating one measly page?
>> 
>> This leads me to think that it might be gather_array, but that one
would
>> allocate a grand total of two pages.
> 
> The log indicated a failed order-4 allocation though. So maybe not
> that allocation after all (be the basically unbounded size here is a
> problem in any case).
In my mind this casts serious doubt on whether err_array is the culprit of this
test result. I would look into other variables.

Having said that, a kcalloc of a potentially multi-page-sized array is a
problem.

Below you''ll find pasted an RFC patch to fix this. I''ve
expanded the cc line to add Mats Peterson, who is also looking into some
improvements to privcmd (and IanC for general feedback).

The RFC patch cuts down code overall and cleans up logic too. I did change the
behavior wrt classic implementations when it comes to handling errors &
EFAULT. Instead of doing all the mapping work and then copying back to user, I
copy back each individual mapping error as soon as it arises. And short-circuit
and quit the whole operation as soon as the first EFAULT arises.

Short-circuiting on the first EFAULT is the right thing to do IMHO. There is no
point in continue working if we can''t tell the caller about the result.
Further, libxc will just munmap the vma and undo all work upon receiving EFAULT.
So any further work is wasted work.

Please advise, thanks
Andres

diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index 3421f0d..9433396 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c 
@@ -254,18 +254,16 @@ struct mmap_batch_state {
    unsigned long va;
    struct vm_area_struct *vma;
    int index;
-   /* A tristate: 
-    *      0 for no errors                 
-    *      1 if at least one error has happened (and no
-    *          -ENOENT errors have happened)
-    *      -ENOENT if at least 1 -ENOENT has happened.
-    */             
-   int global_error;   
+   /* Whether at least one enoent has happened. */
+   int enoent;     
    /* An array for individual errors */
    int *err;           
                    
-   /* User-space mfn array to store errors in the second pass for V1. */
+   int version;
+   /* User-space mfn array to store errors for V1. */
    xen_pfn_t __user *user_mfn;
+   /* User space int array to store errors for V2. */
+   int __user *user_err;
 };
 
 /* auto translated dom0 note: if domU being created is PV, then mfn is
@@ -287,40 +285,35 @@ static int mmap_batch_fn(void *data, void *state)
                     st->vma->vm_page_prot, st->domain,
                     &cur_page);
 
-   /* Store error code for second pass. */
-   *(st->err++) = ret;
-
-   /* And see if it affects the global_error. */
-   if (ret < 0) {
-       if (ret == -ENOENT)
-           st->global_error = -ENOENT;
-       else {
-           /* Record that at least one error has happened. */
-           if (st->global_error == 0)
-               st->global_error = 1;
+   /* See if error affects the global enoent flag. */
+   if (ret == -ENOENT) {
+       st->enoent = 1;
+   }
+
+   /* And, if error, store in user space (version dependent). */
+   if (ret) {
+       int efault = 0;
+       if (st->version == 1) {
+           xen_pfn_t mfn_err = *mfnp;
+           mfn_err |= (ret == -ENOENT) ?
+                       PRIVCMD_MMAPBATCH_PAGED_ERROR :
+                       PRIVCMD_MMAPBATCH_MFN_ERROR;
+           efault = __put_user(mfn_err, st->user_mfn++);
+       } else { /* st->version == 2 */
+           efault = __put_user(ret, st->user_err++);
        }
+       /* If copy to user failed, short circuit now. */
+       if (efault)
+           return efault;
+   } else {
+       st->user_mfn++;
+       st->user_err++;
    }
-   st->va += PAGE_SIZE;

+   st->va += PAGE_SIZE;
    return 0;
 }

-static int mmap_return_errors_v1(void *data, void *state)
-{
-   xen_pfn_t *mfnp = data;
-   struct mmap_batch_state *st = state;
-   int err = *(st->err++);
-
-   /*
-    * V1 encodes the error codes in the 32bit top nibble of the
-    * mfn (with its known limitations vis-a-vis 64 bit callers).
-    */
-   *mfnp |= (err == -ENOENT) ?
-               PRIVCMD_MMAPBATCH_PAGED_ERROR :
-               PRIVCMD_MMAPBATCH_MFN_ERROR;
-   return __put_user(*mfnp, st->user_mfn++);
-}
-
 /* Allocate pfns that are then mapped with gmfns from foreign domid. Update
  * the vma with the page info to use later.
  * Returns: 0 if success, otherwise -errno
@@ -357,7 +350,6 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int
version)
    struct vm_area_struct *vma;
    unsigned long nr_pages;
    LIST_HEAD(pagelist);
-   int *err_array = NULL;
    struct mmap_batch_state state;

    if (!xen_initial_domain())
@@ -396,10 +388,12 @@ static long privcmd_ioctl_mmap_batch(void __user *udata,
int version)
        goto out;
    }

-   err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
-   if (err_array == NULL) {
-       ret = -ENOMEM;
-       goto out;
+   /* Zero the V2 array of errors, so we only write return codes on error. */
+   if (version == 2) {
+       if (clear_user(m.err, sizeof(int) * m.num)) {
+           ret = -EFAULT;
+           goto out;
+       }
    }

    down_write(&mm->mmap_sem);
@@ -426,38 +420,22 @@ static long privcmd_ioctl_mmap_batch(void __user *udata,
int version)
    state.vma           = vma;
    state.va            = m.addr;
    state.index         = 0;
-   state.global_error  = 0;
-   state.err           = err_array;
+   state.enoent        = 0;
+   state.user_mfn      = (xen_pfn_t *)m.arr;
+   state.user_err      = m.err;
+   state.version       = version;

-   /* mmap_batch_fn guarantees ret == 0 */
-   BUG_ON(traverse_pages(m.num, sizeof(xen_pfn_t),
-                &pagelist, mmap_batch_fn, &state));
+   ret = traverse_pages(m.num, sizeof(xen_pfn_t),
+                        &pagelist, mmap_batch_fn, &state);

    up_write(&mm->mmap_sem);

-   if (version == 1) {
-       if (state.global_error) {
-           /* Write back errors in second pass. */
-           state.user_mfn = (xen_pfn_t *)m.arr;
-           state.err      = err_array;
-           ret = traverse_pages(m.num, sizeof(xen_pfn_t),
-                        &pagelist, mmap_return_errors_v1, &state);
-       } else
-           ret = 0;
-
-   } else if (version == 2) {
-       ret = __copy_to_user(m.err, err_array, m.num * sizeof(int));
-       if (ret)
-           ret = -EFAULT;
-   }
-
    /* If we have not had any EFAULT-like global errors then set the global
     * error to -ENOENT if necessary. */
-   if ((ret == 0) && (state.global_error == -ENOENT))
+   if ((ret == 0) && (state.enoent))
        ret = -ENOENT;

 out:
-   kfree(err_array);
    free_page_list(&pagelist);

    return ret;

> 
>> In any case, both functions allocate arbitrary number of pages, and
that is
>> the fundamental problem.
>> 
>> What is the approach in the forward ported kernel wrt to gather_array?
> 
> There''s no gather_array there. It simply sets up things one page
> at a time. (And you can always go and check linux-2.6.18-xen.hg).
> 
> Jan
>

Andres Lagar-Cavilla

2013-Jan-14 05:03 UTC

head link

Re: VMX status report. Xen:26323 & Dom0:3.7.1

On Jan 13, 2013, at 11:29 PM, Andres Lagar-Cavilla
<andreslc@gridcentric.ca> wrote:
> On Jan 11, 2013, at 3:34 AM, Jan Beulich <jbeulich@suse.com> wrote:
> 
>>>>> On 10.01.13 at 18:10, Andres Lagar-Cavilla
<andreslc@gridcentric.ca> wrote:
>>> On Jan 10, 2013, at 3:57 AM, "Jan Beulich"
<JBeulich@suse.com> wrote:
>>> 
>>>>>>> On 10.01.13 at 08:51, "Ren, Yongjie"
<yongjie.ren@intel.com> wrote:
>>>>> New issue(1)
>>>>> =============>>>>> 1. sometimes live
migration failed and reported call trace in dom0
>>>>> http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1841 
>>>> 
>>>> For the failed allocation, the only obvious candidate appears
to be
>>>> 
>>>> 	err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
>>>> 
>>>> which quite obviously can be of (almost) arbitrary size because
>>>> 
>>>> 	nr_pages = m.num;
>>>> 	if ((m.num <= 0) || (nr_pages > (LONG_MAX >>
PAGE_SHIFT)))
>>>> 		return -EINVAL;
>>>> 
>>>> really only checks for completely insane values.
>>>> 
>>>> This got introduced by Andres'' "xen/privcmd: add
PRIVCMD_MMAPBATCH_V2
>>>> ioctl" and is becoming worse with Mukesh''s recent
"xen: privcmd:
>>>> support autotranslated physmap guests", which added
another
>>>> similar (twice as large) allocation in alloc_empty_pages().
>>> 
>>> Perhaps the err_array in this case, since alloc_empty_pages only
happens for
>>> auto translated dom0s.
>>> 
>>> Not familiar wether libxl changes (or is even capable of changing) 
>>> parameters of the migration code. But right now in libxc, mapping
is done in
>>> MAX_BATCH_SIZE batches, which are of size 1024. So we are talking
about 1024
>>> ints, which is *one* page.
>>> 
>>> So is really the kernel incapable of allocating one measly page?
>>> 
>>> This leads me to think that it might be gather_array, but that one
would
>>> allocate a grand total of two pages.
>> 
>> The log indicated a failed order-4 allocation though. So maybe not
>> that allocation after all (be the basically unbounded size here is a
>> problem in any case).
> 
> In my mind this casts serious doubt on whether err_array is the culprit of
this test result. I would look into other variables.
> 
> Having said that, a kcalloc of a potentially multi-page-sized array is a
problem.
> 
> Below you''ll find pasted an RFC patch to fix this. I''ve
expanded the cc line to add Mats Peterson, who is also looking into some
improvements to privcmd (and IanC for general feedback).
> 
> The RFC patch cuts down code overall and cleans up logic too. I did change
the behavior wrt classic implementations when it comes to handling errors &
EFAULT. Instead of doing all the mapping work and then copying back to user, I
copy back each individual mapping error as soon as it arises. And short-circuit
and quit the whole operation as soon as the first EFAULT arises.
> 
> Short-circuiting on the first EFAULT is the right thing to do IMHO. There
is no point in continue working if we can''t tell the caller about the
result. Further, libxc will just munmap the vma and undo all work upon receiving
EFAULT. So any further work is wasted work.
Copying back one error at a time is what the current code does for the ioctl V1
in any case.

I''ve noticed a glaring logic error in the V1 case. If there is any
mapping error, a second pass to copy back errors will be done. That second pass
always sets the top nibble for each mfn to some error, whether one has happened
for that mapping or not. Uggh.

Patch I just posted fixes this. Some version of a fix has to go in soon.

Andres> 
> Please advise, thanks
> Andres
> 
> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> index 3421f0d..9433396 100644
> --- a/drivers/xen/privcmd.c
> +++ b/drivers/xen/privcmd.c 
> @@ -254,18 +254,16 @@ struct mmap_batch_state {
>    unsigned long va;
>    struct vm_area_struct *vma;
>    int index;
> -   /* A tristate: 
> -    *      0 for no errors                 
> -    *      1 if at least one error has happened (and no
> -    *          -ENOENT errors have happened)
> -    *      -ENOENT if at least 1 -ENOENT has happened.
> -    */             
> -   int global_error;   
> +   /* Whether at least one enoent has happened. */
> +   int enoent;     
>    /* An array for individual errors */
>    int *err;           
> 
> -   /* User-space mfn array to store errors in the second pass for V1. */
> +   int version;
> +   /* User-space mfn array to store errors for V1. */
>    xen_pfn_t __user *user_mfn;
> +   /* User space int array to store errors for V2. */
> +   int __user *user_err;
> };
> 
> /* auto translated dom0 note: if domU being created is PV, then mfn is
> @@ -287,40 +285,35 @@ static int mmap_batch_fn(void *data, void *state)
>                     st->vma->vm_page_prot, st->domain,
>                     &cur_page);
> 
> -   /* Store error code for second pass. */
> -   *(st->err++) = ret;
> -
> -   /* And see if it affects the global_error. */
> -   if (ret < 0) {
> -       if (ret == -ENOENT)
> -           st->global_error = -ENOENT;
> -       else {
> -           /* Record that at least one error has happened. */
> -           if (st->global_error == 0)
> -               st->global_error = 1;
> +   /* See if error affects the global enoent flag. */
> +   if (ret == -ENOENT) {
> +       st->enoent = 1;
> +   }
> +
> +   /* And, if error, store in user space (version dependent). */
> +   if (ret) {
> +       int efault = 0;
> +       if (st->version == 1) {
> +           xen_pfn_t mfn_err = *mfnp;
> +           mfn_err |= (ret == -ENOENT) ?
> +                       PRIVCMD_MMAPBATCH_PAGED_ERROR :
> +                       PRIVCMD_MMAPBATCH_MFN_ERROR;
> +           efault = __put_user(mfn_err, st->user_mfn++);
> +       } else { /* st->version == 2 */
> +           efault = __put_user(ret, st->user_err++);
>        }
> +       /* If copy to user failed, short circuit now. */
> +       if (efault)
> +           return efault;
> +   } else {
> +       st->user_mfn++;
> +       st->user_err++;
>    }
> -   st->va += PAGE_SIZE;
> 
> +   st->va += PAGE_SIZE;
>    return 0;
> }
> 
> -static int mmap_return_errors_v1(void *data, void *state)
> -{
> -   xen_pfn_t *mfnp = data;
> -   struct mmap_batch_state *st = state;
> -   int err = *(st->err++);
> -
> -   /*
> -    * V1 encodes the error codes in the 32bit top nibble of the
> -    * mfn (with its known limitations vis-a-vis 64 bit callers).
> -    */
> -   *mfnp |= (err == -ENOENT) ?
> -               PRIVCMD_MMAPBATCH_PAGED_ERROR :
> -               PRIVCMD_MMAPBATCH_MFN_ERROR;
> -   return __put_user(*mfnp, st->user_mfn++);
> -}
> -
> /* Allocate pfns that are then mapped with gmfns from foreign domid. Update
>  * the vma with the page info to use later.
>  * Returns: 0 if success, otherwise -errno
> @@ -357,7 +350,6 @@ static long privcmd_ioctl_mmap_batch(void __user
*udata, int version)
>    struct vm_area_struct *vma;
>    unsigned long nr_pages;
>    LIST_HEAD(pagelist);
> -   int *err_array = NULL;
>    struct mmap_batch_state state;
> 
>    if (!xen_initial_domain())
> @@ -396,10 +388,12 @@ static long privcmd_ioctl_mmap_batch(void __user
*udata, int version)
>        goto out;
>    }
> 
> -   err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
> -   if (err_array == NULL) {
> -       ret = -ENOMEM;
> -       goto out;
> +   /* Zero the V2 array of errors, so we only write return codes on error.
*/
> +   if (version == 2) {
> +       if (clear_user(m.err, sizeof(int) * m.num)) {
> +           ret = -EFAULT;
> +           goto out;
> +       }
>    }
> 
>    down_write(&mm->mmap_sem);
> @@ -426,38 +420,22 @@ static long privcmd_ioctl_mmap_batch(void __user
*udata, int version)
>    state.vma           = vma;
>    state.va            = m.addr;
>    state.index         = 0;
> -   state.global_error  = 0;
> -   state.err           = err_array;
> +   state.enoent        = 0;
> +   state.user_mfn      = (xen_pfn_t *)m.arr;
> +   state.user_err      = m.err;
> +   state.version       = version;
> 
> -   /* mmap_batch_fn guarantees ret == 0 */
> -   BUG_ON(traverse_pages(m.num, sizeof(xen_pfn_t),
> -                &pagelist, mmap_batch_fn, &state));
> +   ret = traverse_pages(m.num, sizeof(xen_pfn_t),
> +                        &pagelist, mmap_batch_fn, &state);
> 
>    up_write(&mm->mmap_sem);
> 
> -   if (version == 1) {
> -       if (state.global_error) {
> -           /* Write back errors in second pass. */
> -           state.user_mfn = (xen_pfn_t *)m.arr;
> -           state.err      = err_array;
> -           ret = traverse_pages(m.num, sizeof(xen_pfn_t),
> -                        &pagelist, mmap_return_errors_v1, &state);
> -       } else
> -           ret = 0;
> -
> -   } else if (version == 2) {
> -       ret = __copy_to_user(m.err, err_array, m.num * sizeof(int));
> -       if (ret)
> -           ret = -EFAULT;
> -   }
> -
>    /* If we have not had any EFAULT-like global errors then set the global
>     * error to -ENOENT if necessary. */
> -   if ((ret == 0) && (state.global_error == -ENOENT))
> +   if ((ret == 0) && (state.enoent))
>        ret = -ENOENT;
> 
> out:
> -   kfree(err_array);
>    free_page_list(&pagelist);
> 
>    return ret;
> 
> 
>> 
>>> In any case, both functions allocate arbitrary number of pages, and
that is
>>> the fundamental problem.
>>> 
>>> What is the approach in the forward ported kernel wrt to
gather_array?
>> 
>> There''s no gather_array there. It simply sets up things one
page
>> at a time. (And you can always go and check linux-2.6.18-xen.hg).
>> 
>> Jan
>> 
>

David Vrabel

2013-Jan-14 13:59 UTC

head link

Re: VMX status report. Xen:26323 & Dom0:3.7.1

On 14/01/13 04:29, Andres Lagar-Cavilla wrote:> 
> Below you''ll find pasted an RFC patch to fix this. I''ve
expanded the
> cc line to add Mats Peterson, who is also looking into some improvements
> to privcmd (and IanC for general feedback).
> 
> The RFC patch cuts down code overall and cleans up logic too. I did
> change the behavior wrt classic implementations when it comes to
> handling errors & EFAULT. Instead of doing all the mapping work and
then
> copying back to user, I copy back each individual mapping error as soon
> as it arises. And short-circuit and quit the whole operation as soon as
> the first EFAULT arises.
Which is broken.  Please just look at my v3 patch and implement that method.
> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> index 3421f0d..9433396 100644
> --- a/drivers/xen/privcmd.c
> +++ b/drivers/xen/privcmd.c 
[...]> @@ -287,40 +285,35 @@ static int mmap_batch_fn(void *data, void *state)
[...]> +           efault = __put_user(mfn_err, st->user_mfn++);
> +       } else { /* st->version == 2 */
> +           efault = __put_user(ret, st->user_err++);
You can''t use __put_user() or any other function accessing user memory
while holding mmap_sem or you will occasionally deadlock in the page
fault handler (depending on whether the user page is currently present
or not).

David

Andres Lagar-Cavilla

2013-Jan-14 15:06 UTC

head link

Re: VMX status report. Xen:26323 & Dom0:3.7.1

On Jan 14, 2013, at 8:59 AM, David Vrabel <david.vrabel@citrix.com> wrote:
> On 14/01/13 04:29, Andres Lagar-Cavilla wrote:
>> 
>> Below you''ll find pasted an RFC patch to fix this.
I''ve expanded the
>> cc line to add Mats Peterson, who is also looking into some
improvements
>> to privcmd (and IanC for general feedback).
>> 
>> The RFC patch cuts down code overall and cleans up logic too. I did
>> change the behavior wrt classic implementations when it comes to
>> handling errors & EFAULT. Instead of doing all the mapping work and
then
>> copying back to user, I copy back each individual mapping error as soon
>> as it arises. And short-circuit and quit the whole operation as soon as
>> the first EFAULT arises.
> 
> Which is broken.Certainly due to copy_on_write within mmap semaphore. Unfortunately I
didn''t have time last night to post the fix, pardon for the
noise.>  Please just look at my v3 patch and implement that method.The one nit I have about that is that it does an unnecessary get_user of the mfn
on the second pass for V1. HOw about this?
Andres

diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index 3421f0d..fc4952d 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -261,11 +261,12 @@ struct mmap_batch_state {
 	 *      -ENOENT if at least 1 -ENOENT has happened.
 	 */
 	int global_error;
-	/* An array for individual errors */
-	int *err;
+	int version;
 
 	/* User-space mfn array to store errors in the second pass for V1. */
 	xen_pfn_t __user *user_mfn;
+	/* User-space int array to store errors in the second pass for V2. */
+	int __user *user_err;
 };
 
 /* auto translated dom0 note: if domU being created is PV, then mfn is
@@ -288,7 +289,19 @@ static int mmap_batch_fn(void *data, void *state)
 					 &cur_page);
 
 	/* Store error code for second pass. */
-	*(st->err++) = ret;
+	if (st->version == 1) {
+		if (ret < 0) {
+			/*
+			 * V1 encodes the error codes in the 32bit top nibble of the
+			 * mfn (with its known limitations vis-a-vis 64 bit callers).
+			 */
+			*mfnp |= (ret == -ENOENT) ?
+						PRIVCMD_MMAPBATCH_PAGED_ERROR :
+						PRIVCMD_MMAPBATCH_MFN_ERROR;
+		}
+	} else { /* st->version == 2 */
+		*((int *) mfnp) = ret;
+	}
 
 	/* And see if it affects the global_error. */
 	if (ret < 0) {
@@ -305,20 +318,25 @@ static int mmap_batch_fn(void *data, void *state)
 	return 0;
 }
 
-static int mmap_return_errors_v1(void *data, void *state)
+static int mmap_return_errors(void *data, void *state)
 {
-	xen_pfn_t *mfnp = data;
 	struct mmap_batch_state *st = state;
-	int err = *(st->err++);
 
-	/*
-	 * V1 encodes the error codes in the 32bit top nibble of the
-	 * mfn (with its known limitations vis-a-vis 64 bit callers).
-	 */
-	*mfnp |= (err == -ENOENT) ?
-				PRIVCMD_MMAPBATCH_PAGED_ERROR :
-				PRIVCMD_MMAPBATCH_MFN_ERROR;
-	return __put_user(*mfnp, st->user_mfn++);
+	if (st->version == 1) {
+		xen_pfn_t mfnp = *((xen_pfn_t *) data);
+		if (mfnp & PRIVCMD_MMAPBATCH_MFN_ERROR)
+			return __put_user(mfnp, st->user_mfn++);
+		else
+			st->user_mfn++;
+	} else { /* st->version == 2 */
+		int err = *((int *) data);
+		if (err)
+			return __put_user(err, st->user_err++);
+		else
+			st->user_err++;
+	}
+
+	return 0;
 }
 
 /* Allocate pfns that are then mapped with gmfns from foreign domid. Update
@@ -357,7 +375,6 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int
version)
 	struct vm_area_struct *vma;
 	unsigned long nr_pages;
 	LIST_HEAD(pagelist);
-	int *err_array = NULL;
 	struct mmap_batch_state state;
 
 	if (!xen_initial_domain())
@@ -396,10 +413,12 @@ static long privcmd_ioctl_mmap_batch(void __user *udata,
int version)
 		goto out;
 	}
 
-	err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
-	if (err_array == NULL) {
-		ret = -ENOMEM;
-		goto out;
+	if (version == 2) {
+		/* Zero error array now to only copy back actual errors. */
+		if (clear_user(m.err, sizeof(int) * m.num)) {
+			ret = -EFAULT;
+			goto out;
+		}
 	}
 
 	down_write(&mm->mmap_sem);
@@ -427,7 +446,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int
version)
 	state.va            = m.addr;
 	state.index         = 0;
 	state.global_error  = 0;
-	state.err           = err_array;
+	state.version       = version;
 
 	/* mmap_batch_fn guarantees ret == 0 */
 	BUG_ON(traverse_pages(m.num, sizeof(xen_pfn_t),
@@ -435,21 +454,14 @@ static long privcmd_ioctl_mmap_batch(void __user *udata,
int version)
 
 	up_write(&mm->mmap_sem);
 
-	if (version == 1) {
-		if (state.global_error) {
-			/* Write back errors in second pass. */
-			state.user_mfn = (xen_pfn_t *)m.arr;
-			state.err      = err_array;
-			ret = traverse_pages(m.num, sizeof(xen_pfn_t),
-					     &pagelist, mmap_return_errors_v1, &state);
-		} else
-			ret = 0;
-
-	} else if (version == 2) {
-		ret = __copy_to_user(m.err, err_array, m.num * sizeof(int));
-		if (ret)
-			ret = -EFAULT;
-	}
+	if (state.global_error) {
+		/* Write back errors in second pass. */
+		state.user_mfn = (xen_pfn_t *)m.arr;
+		state.user_err = m.err;
+		ret = traverse_pages(m.num, sizeof(xen_pfn_t),
+							 &pagelist, mmap_return_errors, &state);
+	} else
+		ret = 0;
 
 	/* If we have not had any EFAULT-like global errors then set the global
 	 * error to -ENOENT if necessary. */
@@ -457,7 +469,6 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int
version)
 		ret = -ENOENT;
 
 out:
-	kfree(err_array);
 	free_page_list(&pagelist);
 
 	return ret;

> 
>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
>> index 3421f0d..9433396 100644
>> --- a/drivers/xen/privcmd.c
>> +++ b/drivers/xen/privcmd.c 
> [...]
>> @@ -287,40 +285,35 @@ static int mmap_batch_fn(void *data, void *state)
> [...]
>> +           efault = __put_user(mfn_err, st->user_mfn++);
>> +       } else { /* st->version == 2 */
>> +           efault = __put_user(ret, st->user_err++);
> 
> You can''t use __put_user() or any other function accessing user
memory
> while holding mmap_sem or you will occasionally deadlock in the page
> fault handler (depending on whether the user page is currently present
> or not).
> 
> David

David Vrabel

2013-Jan-14 16:03 UTC

head link

Re: VMX status report. Xen:26323 & Dom0:3.7.1

On 14/01/13 15:06, Andres Lagar-Cavilla wrote:> On Jan 14, 2013, at 8:59 AM, David Vrabel <david.vrabel@citrix.com>
wrote:
> 
>> On 14/01/13 04:29, Andres Lagar-Cavilla wrote:
>>>
>>> Below you''ll find pasted an RFC patch to fix this.
I''ve expanded the
>>> cc line to add Mats Peterson, who is also looking into some
improvements
>>> to privcmd (and IanC for general feedback).
>>>
>>> The RFC patch cuts down code overall and cleans up logic too. I did
>>> change the behavior wrt classic implementations when it comes to
>>> handling errors & EFAULT. Instead of doing all the mapping work
and then
>>> copying back to user, I copy back each individual mapping error as
soon
>>> as it arises. And short-circuit and quit the whole operation as
soon as
>>> the first EFAULT arises.
>>
>> Which is broken.
> Certainly due to copy_on_write within mmap semaphore. Unfortunately I
didn''t have time last night to post the fix, pardon for the noise.
>>  Please just look at my v3 patch and implement that method.
... but be aware that I messed up mmap_return_errors() for V1 and set
all MFNs as having errors.  Oops.
> The one nit I have about that is that it does an unnecessary get_user of
the mfn on the second pass for V1. HOw about this?
__get_user() and __put_user() are actually cheap (provided they don''t
fault).

This looks ok except for one thing.
> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> index 3421f0d..fc4952d 100644
> --- a/drivers/xen/privcmd.c
> +++ b/drivers/xen/privcmd.c
[...]> @@ -288,7 +289,19 @@ static int mmap_batch_fn(void *data, void *state)
>  					 &cur_page);
>  
>  	/* Store error code for second pass. */
> -	*(st->err++) = ret;
> +	if (st->version == 1) {
> +		if (ret < 0) {
> +			/*
> +			 * V1 encodes the error codes in the 32bit top nibble of the
> +			 * mfn (with its known limitations vis-a-vis 64 bit callers).
> +			 */
> +			*mfnp |= (ret == -ENOENT) ?
> +						PRIVCMD_MMAPBATCH_PAGED_ERROR :
> +						PRIVCMD_MMAPBATCH_MFN_ERROR;
You also need to clear the top nibble on success (ret >= 0) so large
PFNs with the top nibble already set don''t give false positives of
errors.

David

Jan Beulich

2013-Jan-14 16:14 UTC

head link

Re: VMX status report. Xen:26323 & Dom0:3.7.1

>>> On 14.01.13 at 17:03, David Vrabel <david.vrabel@citrix.com>
wrote:
> On 14/01/13 15:06, Andres Lagar-Cavilla wrote:
>> @@ -288,7 +289,19 @@ static int mmap_batch_fn(void *data, void *state)
>>  					 &cur_page);
>>  
>>  	/* Store error code for second pass. */
>> -	*(st->err++) = ret;
>> +	if (st->version == 1) {
>> +		if (ret < 0) {
>> +			/*
>> +			 * V1 encodes the error codes in the 32bit top nibble of the
>> +			 * mfn (with its known limitations vis-a-vis 64 bit callers).
>> +			 */
>> +			*mfnp |= (ret == -ENOENT) ?
>> +						PRIVCMD_MMAPBATCH_PAGED_ERROR :
>> +						PRIVCMD_MMAPBATCH_MFN_ERROR;
> 
> You also need to clear the top nibble on success (ret >= 0) so large
> PFNs with the top nibble already set don''t give false positives of
errors.
Not really - that''s what v2 was added for (the caller, unless
keeping a second array with the original MFNs, wouldn''t be able
to match things up in that case).

Jan

David Vrabel

2013-Jan-14 16:15 UTC

head link

Re: VMX status report. Xen:26323 & Dom0:3.7.1

On 14/01/13 16:14, Jan Beulich wrote:>>>> On 14.01.13 at 17:03, David Vrabel
<david.vrabel@citrix.com> wrote:
>> On 14/01/13 15:06, Andres Lagar-Cavilla wrote:
>>> @@ -288,7 +289,19 @@ static int mmap_batch_fn(void *data, void
*state)
>>>  					 &cur_page);
>>>  
>>>  	/* Store error code for second pass. */
>>> -	*(st->err++) = ret;
>>> +	if (st->version == 1) {
>>> +		if (ret < 0) {
>>> +			/*
>>> +			 * V1 encodes the error codes in the 32bit top nibble of the
>>> +			 * mfn (with its known limitations vis-a-vis 64 bit callers).
>>> +			 */
>>> +			*mfnp |= (ret == -ENOENT) ?
>>> +						PRIVCMD_MMAPBATCH_PAGED_ERROR :
>>> +						PRIVCMD_MMAPBATCH_MFN_ERROR;
>>
>> You also need to clear the top nibble on success (ret >= 0) so large
>> PFNs with the top nibble already set don''t give false
positives of errors.
> 
> Not really - that''s what v2 was added for (the caller, unless
> keeping a second array with the original MFNs, wouldn''t be able
> to match things up in that case).
Ok, I can agree with that.

David

Andres Lagar-Cavilla

2013-Jan-14 16:18 UTC

head link

Re: VMX status report. Xen:26323 & Dom0:3.7.1

On Jan 14, 2013, at 11:03 AM, David Vrabel <david.vrabel@citrix.com>
wrote:
> On 14/01/13 15:06, Andres Lagar-Cavilla wrote:
>> On Jan 14, 2013, at 8:59 AM, David Vrabel
<david.vrabel@citrix.com> wrote:
>> 
>>> On 14/01/13 04:29, Andres Lagar-Cavilla wrote:
>>>> 
>>>> Below you''ll find pasted an RFC patch to fix this.
I''ve expanded the
>>>> cc line to add Mats Peterson, who is also looking into some
improvements
>>>> to privcmd (and IanC for general feedback).
>>>> 
>>>> The RFC patch cuts down code overall and cleans up logic too. I
did
>>>> change the behavior wrt classic implementations when it comes
to
>>>> handling errors & EFAULT. Instead of doing all the mapping
work and then
>>>> copying back to user, I copy back each individual mapping error
as soon
>>>> as it arises. And short-circuit and quit the whole operation as
soon as
>>>> the first EFAULT arises.
>>> 
>>> Which is broken.
>> Certainly due to copy_on_write within mmap semaphore. Unfortunately I
didn''t have time last night to post the fix, pardon for the noise.
>>> Please just look at my v3 patch and implement that method.
> 
> ... but be aware that I messed up mmap_return_errors() for V1 and set
> all MFNs as having errors.  Oops.
> 
>> The one nit I have about that is that it does an unnecessary get_user
of the mfn on the second pass for V1. HOw about this?
> 
> __get_user() and __put_user() are actually cheap (provided they
don''t
> fault).
> 
> This looks ok except for one thing.
> 
>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
>> index 3421f0d..fc4952d 100644
>> --- a/drivers/xen/privcmd.c
>> +++ b/drivers/xen/privcmd.c
> [...]
>> @@ -288,7 +289,19 @@ static int mmap_batch_fn(void *data, void *state)
>> 					 &cur_page);
>> 
>> 	/* Store error code for second pass. */
>> -	*(st->err++) = ret;
>> +	if (st->version == 1) {
>> +		if (ret < 0) {
>> +			/*
>> +			 * V1 encodes the error codes in the 32bit top nibble of the
>> +			 * mfn (with its known limitations vis-a-vis 64 bit callers).
>> +			 */
>> +			*mfnp |= (ret == -ENOENT) ?
>> +						PRIVCMD_MMAPBATCH_PAGED_ERROR :
>> +						PRIVCMD_MMAPBATCH_MFN_ERROR;
> 
> You also need to clear the top nibble on success (ret >= 0) so large
> PFNs with the top nibble already set don''t give false positives of
errors.
But classic kernels don''t do this either, afaict (e.g.
http://xenbits.xen.org/hg/linux-2.6.18-xen.hg/file/c340a22a3a63/drivers/xen/privcmd/privcmd.c#l282
or XenServer''s 6.1 kernel).

The key reason for V2 is to get rid of all these limitations and stop trying to
fix them in V1.

I''m open to whichever fix, though. It''d be just one line for
the else case. I''d just like some feedback before
"officially" submitting a patch.

Thanks!
Andres> 
> David
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Andres Lagar-Cavilla

2013-Jan-14 16:19 UTC

head link

Re: VMX status report. Xen:26323 & Dom0:3.7.1

On Jan 14, 2013, at 11:15 AM, David Vrabel <david.vrabel@citrix.com>
wrote:
> On 14/01/13 16:14, Jan Beulich wrote:
>>>>> On 14.01.13 at 17:03, David Vrabel
<david.vrabel@citrix.com> wrote:
>>> On 14/01/13 15:06, Andres Lagar-Cavilla wrote:
>>>> @@ -288,7 +289,19 @@ static int mmap_batch_fn(void *data, void
*state)
>>>> 					 &cur_page);
>>>> 
>>>> 	/* Store error code for second pass. */
>>>> -	*(st->err++) = ret;
>>>> +	if (st->version == 1) {
>>>> +		if (ret < 0) {
>>>> +			/*
>>>> +			 * V1 encodes the error codes in the 32bit top nibble of
the
>>>> +			 * mfn (with its known limitations vis-a-vis 64 bit
callers).
>>>> +			 */
>>>> +			*mfnp |= (ret == -ENOENT) ?
>>>> +						PRIVCMD_MMAPBATCH_PAGED_ERROR :
>>>> +						PRIVCMD_MMAPBATCH_MFN_ERROR;
>>> 
>>> You also need to clear the top nibble on success (ret >= 0) so
large
>>> PFNs with the top nibble already set don''t give false
positives of errors.
>> 
>> Not really - that''s what v2 was added for (the caller, unless
>> keeping a second array with the original MFNs, wouldn''t be
able
>> to match things up in that case).
> 
> Ok, I can agree with that.
Ok, cool, thanks David. Jan, Konrad, is the last patch sent this (EST) morning
decent enough?

Andres> 
> David

Konrad Rzeszutek Wilk

2013-Jan-14 21:21 UTC

head link

Re: VMX status report. Xen:26323 & Dom0:3.7.1

On Mon, Jan 14, 2013 at 11:19:18AM -0500, Andres Lagar-Cavilla
wrote:> On Jan 14, 2013, at 11:15 AM, David Vrabel <david.vrabel@citrix.com>
wrote:
> 
> > On 14/01/13 16:14, Jan Beulich wrote:
> >>>>> On 14.01.13 at 17:03, David Vrabel
<david.vrabel@citrix.com> wrote:
> >>> On 14/01/13 15:06, Andres Lagar-Cavilla wrote:
> >>>> @@ -288,7 +289,19 @@ static int mmap_batch_fn(void *data,
void *state)
> >>>> 					 &cur_page);
> >>>> 
> >>>> 	/* Store error code for second pass. */
> >>>> -	*(st->err++) = ret;
> >>>> +	if (st->version == 1) {
> >>>> +		if (ret < 0) {
> >>>> +			/*
> >>>> +			 * V1 encodes the error codes in the 32bit top nibble
of the
> >>>> +			 * mfn (with its known limitations vis-a-vis 64 bit
callers).
> >>>> +			 */
> >>>> +			*mfnp |= (ret == -ENOENT) ?
> >>>> +						PRIVCMD_MMAPBATCH_PAGED_ERROR :
> >>>> +						PRIVCMD_MMAPBATCH_MFN_ERROR;
> >>> 
> >>> You also need to clear the top nibble on success (ret >= 0)
so large
> >>> PFNs with the top nibble already set don''t give false
positives of errors.
> >> 
> >> Not really - that''s what v2 was added for (the caller,
unless
> >> keeping a second array with the original MFNs, wouldn''t
be able
> >> to match things up in that case).
> > 
> > Ok, I can agree with that.
> 
> Ok, cool, thanks David. Jan, Konrad, is the last patch sent this (EST)
morning decent enough?

Hm, I am not seeing it in my mailbox. Was I on the ''To'' or
''CC'' list?

Xen devel - Jan 2013 - VMX status report. Xen:26323 & Dom0:3.7.1

VMX status report. Xen:26323 & Dom0:3.7.1

Re: VMX status report. Xen:26323 & Dom0:3.7.1

Re: VMX status report. Xen:26323 & Dom0:3.7.1

Re: VMX status report. Xen:26323 & Dom0:3.7.1

Re: VMX status report. Xen:26323 & Dom0:3.7.1

Re: VMX status report. Xen:26323 & Dom0:3.7.1

Re: VMX status report. Xen:26323 & Dom0:3.7.1

Re: VMX status report. Xen:26323 & Dom0:3.7.1

Re: VMX status report. Xen:26323 & Dom0:3.7.1

Re: VMX status report. Xen:26323 & Dom0:3.7.1

Re: VMX status report. Xen:26323 & Dom0:3.7.1

Re: VMX status report. Xen:26323 & Dom0:3.7.1

Re: VMX status report. Xen:26323 & Dom0:3.7.1

Re: VMX status report. Xen:26323 & Dom0:3.7.1

Re: VMX status report. Xen:26323 & Dom0:3.7.1

Re: VMX status report. Xen:26323 & Dom0:3.7.1

Re: VMX status report. Xen:26323 & Dom0:3.7.1

Re: VMX status report. Xen:26323 & Dom0:3.7.1