thr3ads.net - Xen devel - Re: Some trouble to use NVIDIA CUDA with Xen [Aug 2013]

If this information is useful, please help other people find it:
Share via:

Martin Cerveny

2013-Aug-10 16:21 UTC

Re: Some trouble to use NVIDIA CUDA with Xen

Hello.

Any progress on this topics ?
What is the blocker to run correctly NVIDIA (proprietary) drivers in kernel in
XEN/Dom0 ?

The problem still persit:

- CUDA "deviceQuery" run and exit without error (BUT with xen after
9sec, without xen after 1sec)
- CUDA other programs for example "bandwidthTest" exit with error
"code=46(cudaErrorDevicesUnavailable)"

Test environment:

- fedora18
- kernel 3.9.11-200.fc18.x86_64
- nvidia drivers 319.37 (comes with CUDA)
- nvidia CUDA 5.5
- XEN 4.2.2 and directly from git repo 4.4.unstable (commit
73f18583dd824f0e49f65149ef603600ce31b8ee)
- AMD Athlon(tm) 64 X2 Dual Core Processor 5600+

Attached files:

- output from dmesg, deviceQuery (with /bin.time), bandwidthTest (with
/bin/time), lspci -vvv

Hypotehesis:

- PCI output are the same - probably no problem
- dmesg differences ( diff dmesg_boot_without_xen_wt.txt
dmesg_boot_xen_44u_wt.txt )

=====================without XEN:

< MTRR default type: uncachable
< MTRR fixed ranges enabled:
<   00000-9FFFF write-back
<   A0000-EFFFF uncachable
<   F0000-FFFFF write-protect
< MTRR variable ranges enabled:
<   0 base 0000000000 mask FF80000000 write-back
<   1 base 0080000000 mask FFC0000000 write-back
<   2 disabled
<   3 disabled
<   4 disabled
<   5 disabled
<   6 disabled
<   7 disabled
< x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106

=====================with XEN:
> NVRM: PAT configuration unsupported.
=====================
The MTRR+PAT is not supported (still) in kernel in XEN/dom0 ?
Maybe MTRR+PAT is needed for CUDA too.

Is there any workaround for linux kernels ~ 3.9.x and xen ~ v4.x.x ?

Thanks, Martin Cerveny

Refs:

[Xen-devel] XEN MTRR - 
http://lists.xen.org/archives/html/xen-devel/2012-06/msg00194.html
[Xen-users] Xen and nvidia -
http://lists.xen.org/archives/html/xen-users/2013-01/msg00169.html

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Gordan Bobic

2013-Aug-12 12:33 UTC

head link

Re: Some trouble to use NVIDIA CUDA with Xen

Your Nvidia device ID seems to imply a GTX770.
 I could be wrong, but wasn''t this only supported on Quadro/Grid GPUs?
 Or is that limitation only applicable to Windows?

 Gordan

 On Sat, 10 Aug 2013 18:21:32 +0200 (CEST), Martin Cerveny 
 <martin@c-home.cz> wrote:> Hello.
>
> Any progress on this topics ?
> What is the blocker to run correctly NVIDIA (proprietary) drivers in
> kernel in XEN/Dom0 ?
>
> The problem still persit:
>
> - CUDA "deviceQuery" run and exit without error (BUT with xen
after
> 9sec, without xen after 1sec)
> - CUDA other programs for example "bandwidthTest" exit with error
> "code=46(cudaErrorDevicesUnavailable)"
>
> Test environment:
>
> - fedora18
> - kernel 3.9.11-200.fc18.x86_64
> - nvidia drivers 319.37 (comes with CUDA)
> - nvidia CUDA 5.5
> - XEN 4.2.2 and directly from git repo 4.4.unstable (commit
> 73f18583dd824f0e49f65149ef603600ce31b8ee)
> - AMD Athlon(tm) 64 X2 Dual Core Processor 5600+
>
> Attached files:
>
> - output from dmesg, deviceQuery (with /bin.time), bandwidthTest
> (with /bin/time), lspci -vvv
>
> Hypotehesis:
>
> - PCI output are the same - probably no problem
> - dmesg differences ( diff dmesg_boot_without_xen_wt.txt
> dmesg_boot_xen_44u_wt.txt )
>
> =====================> without XEN:
>
> < MTRR default type: uncachable
> < MTRR fixed ranges enabled:
> <   00000-9FFFF write-back
> <   A0000-EFFFF uncachable
> <   F0000-FFFFF write-protect
> < MTRR variable ranges enabled:
> <   0 base 0000000000 mask FF80000000 write-back
> <   1 base 0080000000 mask FFC0000000 write-back
> <   2 disabled
> <   3 disabled
> <   4 disabled
> <   5 disabled
> <   6 disabled
> <   7 disabled
> < x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
>
> =====================> with XEN:
>
>> NVRM: PAT configuration unsupported.
>
> =====================>
> The MTRR+PAT is not supported (still) in kernel in XEN/dom0 ?
> Maybe MTRR+PAT is needed for CUDA too.
>
> Is there any workaround for linux kernels ~ 3.9.x and xen ~ v4.x.x ?
>
> Thanks, Martin Cerveny
>
> Refs:
>
> [Xen-devel] XEN MTRR -
> http://lists.xen.org/archives/html/xen-devel/2012-06/msg00194.html
> [Xen-users] Xen and nvidia -
> http://lists.xen.org/archives/html/xen-users/2013-01/msg00169.html

Konrad Rzeszutek Wilk

2013-Aug-12 13:00 UTC

head link

Re: Some trouble to use NVIDIA CUDA with Xen

On Mon, Aug 12, 2013 at 01:33:13PM +0100, Gordan Bobic
wrote:> Your Nvidia device ID seems to imply a GTX770.
> I could be wrong, but wasn''t this only supported on Quadro/Grid
GPUs?
> Or is that limitation only applicable to Windows?
> 
> Gordan
> 
> On Sat, 10 Aug 2013 18:21:32 +0200 (CEST), Martin Cerveny
> <martin@c-home.cz> wrote:
> >Hello.
> >
> >Any progress on this topics ?
> >What is the blocker to run correctly NVIDIA (proprietary) drivers in
> >kernel in XEN/Dom0 ?
> >
> >The problem still persit:
> >
> >- CUDA "deviceQuery" run and exit without error (BUT with xen
after
> >9sec, without xen after 1sec)
> >- CUDA other programs for example "bandwidthTest" exit with
error
> >"code=46(cudaErrorDevicesUnavailable)"
> >
> >Test environment:
> >
> >- fedora18
> >- kernel 3.9.11-200.fc18.x86_64
> >- nvidia drivers 319.37 (comes with CUDA)
> >- nvidia CUDA 5.5
> >- XEN 4.2.2 and directly from git repo 4.4.unstable (commit
> >73f18583dd824f0e49f65149ef603600ce31b8ee)
> >- AMD Athlon(tm) 64 X2 Dual Core Processor 5600+
> >
> >Attached files:
> >
> >- output from dmesg, deviceQuery (with /bin.time), bandwidthTest
> >(with /bin/time), lspci -vvv
> >
> >Hypotehesis:
> >
> >- PCI output are the same - probably no problem
> >- dmesg differences ( diff dmesg_boot_without_xen_wt.txt
> >dmesg_boot_xen_44u_wt.txt )
> >
> >=====================> >without XEN:
> >
> >< MTRR default type: uncachable
> >< MTRR fixed ranges enabled:
> ><   00000-9FFFF write-back
> ><   A0000-EFFFF uncachable
> ><   F0000-FFFFF write-protect
> >< MTRR variable ranges enabled:
> ><   0 base 0000000000 mask FF80000000 write-back
> ><   1 base 0080000000 mask FFC0000000 write-back
> ><   2 disabled
> ><   3 disabled
> ><   4 disabled
> ><   5 disabled
> ><   6 disabled
> ><   7 disabled
> >< x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
> >
> >=====================> >with XEN:
> >
> >>NVRM: PAT configuration unsupported.
Right, so there are couple of patches that can enable that back.


You need to revert these two:
8eaffa67b43e99ae581622c5133e20b0f48bcef1
c79c49826270b8b0061b2fca840fc3f013c8a78a

And apply this patch:

https://lkml.org/lkml/2012/2/10/229

That should re-enable PAT.  Try that and please report
back.> >
> >=====================> >
> >The MTRR+PAT is not supported (still) in kernel in XEN/dom0 ?
> >Maybe MTRR+PAT is needed for CUDA too.
> >
> >Is there any workaround for linux kernels ~ 3.9.x and xen ~ v4.x.x ?
> >
> >Thanks, Martin Cerveny
> >
> >Refs:
> >
> >[Xen-devel] XEN MTRR -
> >http://lists.xen.org/archives/html/xen-devel/2012-06/msg00194.html
> >[Xen-users] Xen and nvidia -
> >http://lists.xen.org/archives/html/xen-users/2013-01/msg00169.html
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Martin Cerveny

2013-Aug-13 19:59 UTC

head link

Re: Some trouble to use NVIDIA CUDA with Xen

Hello.

On Mon, 12 Aug 2013, Konrad Rzeszutek Wilk wrote:> On Mon, Aug 12, 2013 at 01:33:13PM +0100, Gordan Bobic wrote:
>> Your Nvidia device ID seems to imply a GTX770.
>> I could be wrong, but wasn''t this only supported on
Quadro/Grid GPUs?
>> Or is that limitation only applicable to Windows?
I did not try to use it in "multios" (domU) environment (only simple
Dom0).
>>>> NVRM: PAT configuration unsupported.
> Right, so there are couple of patches that can enable that back.
>
> You need to revert these two:
> 8eaffa67b43e99ae581622c5133e20b0f48bcef1
> c79c49826270b8b0061b2fca840fc3f013c8a78a
>
> And apply this patch:
>
> https://lkml.org/lkml/2012/2/10/229
>
> That should re-enable PAT.  Try that and please report back.
I applied the patch to 3.9.11-200.PAT.fc18.x86_64 (3.10 is not working due 
to incompatibilities with nvidia driver source code).

Error persists:
"NVRM: PAT configuration unsupported."

There is one progress at least.
The tested programs are working in "normal speed" (as without Xen)
- speedup 20x to compare with non-patched kernel on Dom0.

The CUDA error still persists:

---
# /usr/bin/time ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

  Device 0: GeForce GTX 770
  Quick Mode

CUDA error at bandwidthTest.cu:719 code=46(cudaErrorDevicesUnavailable) 
"cudaEventCreate(&start)"

0.00user 0.20system 0:00.26elapsed 79%CPU (0avgtext+0avgdata 5236maxresident)k
0inputs+0outputs (0major+1182minor)pagefaults 0swaps
---

I reported the problem to NVIDIA too and waiting for solution.

Any other hint ?

Thanks, Martin Cerveny

Konrad Rzeszutek Wilk

2013-Aug-13 20:20 UTC

head link

Re: Some trouble to use NVIDIA CUDA with Xen

On Tue, Aug 13, 2013 at 09:59:48PM +0200, Martin Cerveny
wrote:> Hello.
> 
> On Mon, 12 Aug 2013, Konrad Rzeszutek Wilk wrote:
> >On Mon, Aug 12, 2013 at 01:33:13PM +0100, Gordan Bobic wrote:
> >>Your Nvidia device ID seems to imply a GTX770.
> >>I could be wrong, but wasn''t this only supported on
Quadro/Grid GPUs?
> >>Or is that limitation only applicable to Windows?
> 
> I did not try to use it in "multios" (domU) environment (only
simple Dom0).
> 
> >>>>NVRM: PAT configuration unsupported.
> >Right, so there are couple of patches that can enable that back.
> >
> >You need to revert these two:
> >8eaffa67b43e99ae581622c5133e20b0f48bcef1
> >c79c49826270b8b0061b2fca840fc3f013c8a78a
> >
> >And apply this patch:
> >
> >https://lkml.org/lkml/2012/2/10/229
> >
> >That should re-enable PAT.  Try that and please report back.
> 
> I applied the patch to 3.9.11-200.PAT.fc18.x86_64 (3.10 is not
> working due to incompatibilities with nvidia driver source code).
Did you revert the other two git commits?> 
> Error persists:
> "NVRM: PAT configuration unsupported."
> 
> There is one progress at least.
> The tested programs are working in "normal speed" (as without
Xen)
> - speedup 20x to compare with non-patched kernel on Dom0.
> 
> The CUDA error still persists:
> 
> ---
> # /usr/bin/time ./bandwidthTest
> [CUDA Bandwidth Test] - Starting...
> Running on...
> 
>  Device 0: GeForce GTX 770
>  Quick Mode
> 
> CUDA error at bandwidthTest.cu:719
> code=46(cudaErrorDevicesUnavailable)
"cudaEventCreate(&start)"
> 
> 0.00user 0.20system 0:00.26elapsed 79%CPU (0avgtext+0avgdata
5236maxresident)k
> 0inputs+0outputs (0major+1182minor)pagefaults 0swaps
> ---
> 
> I reported the problem to NVIDIA too and waiting for solution.
> 
> Any other hint ?
> 
> Thanks, Martin Cerveny

Martin Cerveny

2013-Aug-13 20:32 UTC

head link

Re: Some trouble to use NVIDIA CUDA with Xen

Hello.

On Tue, 13 Aug 2013, Konrad Rzeszutek Wilk wrote:>>>>>> NVRM: PAT configuration unsupported.
>>> Right, so there are couple of patches that can enable that back.
>>>
>>> You need to revert these two:
>>> 8eaffa67b43e99ae581622c5133e20b0f48bcef1
>>> c79c49826270b8b0061b2fca840fc3f013c8a78a
>>>
>>> And apply this patch:
>>>
>>> https://lkml.org/lkml/2012/2/10/229
>>>
>>> That should re-enable PAT.  Try that and please report back.
>>
>> I applied the patch to 3.9.11-200.PAT.fc18.x86_64 (3.10 is not
>> working due to incompatibilities with nvidia driver source code).
>>
>> Error persists:
>> "NVRM: PAT configuration unsupported."
> Did you revert the other two git commits?
Yes. But somethings happends bad in revert. I will try again.

M.C>

Martin Cerveny

2013-Aug-14 22:21 UTC

head link

Re: Some trouble to use NVIDIA CUDA with Xen

Hello.

Partial SUCCSESS !

On Tue, 13 Aug 2013, Konrad Rzeszutek Wilk wrote:>>>>>> NVRM: PAT configuration unsupported.
>>> Right, so there are couple of patches that can enable that back.
>>>
>>> You need to revert these two:
>>> 8eaffa67b43e99ae581622c5133e20b0f48bcef1
>>> c79c49826270b8b0061b2fca840fc3f013c8a78a
>>>
>>> And apply this patch:
>>>
>>> https://lkml.org/lkml/2012/2/10/229
>>>
>>> That should re-enable PAT.  Try that and please report back.
>>
>> I applied the patch to 3.9.11-200.PAT.fc18.x86_64 (3.10 is not
>> working due to incompatibilities with nvidia driver source code).
>
> Did you revert the other two git commits?
Yes, double check (combined patch is in the attachment to 
rpmbuild/SOURCE/, rpmbuild patch too).

# rdmsr 0x277
50100070406

I look to nvidia source code.

The error is on nvidia side:

snip from /usr/src/nvidia-319.37/nv-pat.c
================....
#if defined(HAVE_NV_XEN) && defined(CONFIG_XEN) &&
defined(CONFIG_PARAVIRT)
     if (PAT_WC_index == 4)
         return NV_PAT_MODE_KERNEL;
#endif

     if (PAT_WC_index == 1)
         return NV_PAT_MODE_KERNEL;
     else if (PAT_WC_index != 0xf)
     {
         nv_printf(NV_DBG_ERRORS,
             "NVRM: PAT configuration unsupported.\n");
         return NV_PAT_MODE_DISABLED;
     }
....
==================
HAVE_NV_XEN is NOT defined.

HAVE_NV_XEN is defined only if "nv-xen.h" is present (tested in 
/usr/src/nvidia-319.37/conftest.h) and it seems to be removed 
from distributed source (~ in nvidia driver 19x.x.x versions).

Ok, i downloaded some older version "nv-xen.h" from net to 
/usr/src/nvidia-319.37/nv-xen.h recompile driver
("cd /usr/src/nvidia-319.37; make clean module; rmmod nvidia;
cp nvidia.ko /lib/modules/3.9.11-200.PAT.fc18.x86_64/extra;
modprobe nvidia").

Error "NVRM: PAT configuration unsupported." does not shown (as
expected).

Most CUDA demoprograms WORKS without error!!!

But some programs hung PCIe and kernel:

[55799.433278] BUG: Bad rss-counter state mm:ffff8800723e0000 idx:1 val:21
[55800.139090] abrt-handle-eve[10175]: segfault at 18 ip 0000003f20ebb6d3 sp
00007fffa7e6ef00 error 4 in libc-2.16.so[3f20e00000+1ad000]
[55800.375196] BUG: Bad rss-counter state mm:ffff8800723e2680 idx:1 val:5
[55845.124636] BUG: Bad rss-counter state mm:ffff8800723e0000 idx:1 val:8
[55962.186275] BUG: Bad rss-counter state mm:ffff880074a27800 idx:0 val:5
[55962.192811] BUG: Bad rss-counter state mm:ffff880074a27800 idx:1 val:795
[55962.262019] traps: abrt-handle-eve[10287] general protection ip:3f20ebb7a6
sp:7fffbd613410 error:0 in libc-2.16.so[3f20e00000+1ad000]
[55962.394789] BUG: Bad rss-counter state mm:ffff8800723e0380 idx:1 val:13
[55981.779246] NVRM: GPU at 0000:02:00: GPU-fe328712-3546-53fe-149d-3d78e7aa64d5
[55981.786391] NVRM: Xid (0000:02:00): 38, 0001 00000000 00000000 00000000
00000000 00000000
[55982.407300] NVRM: GPU at 0000:02:00.0 has fallen off the bus.
[55982.425810] NVRM: os_pci_init_handle: invalid context!
....
[57200.089052] BUG: soft lockup - CPU#0 stuck for 22s! [dct8x8:10290]
....
[56008.089053] RIP: e030:[<ffffffffa15061a8>]  [<ffffffffa15061a8>]
_nv012574rm+0x4/0x51 [nvidia]
....
[56008.089053] Call Trace:
[56008.089053]  [<ffffffffa15056f3>] ? _nv012271rm+0xbe/0x1c6 [nvidia]
[56008.089053]  [<ffffffffa17cddf3>] ? _nv008298rm+0x26/0xb2 [nvidia]
[56008.089053]  [<ffffffffa17ec512>] ? _nv003411rm+0x47dd/0xb184 [nvidia]
[56008.089053]  [<ffffffffa17ec5b2>] ? _nv003411rm+0x487d/0xb184 [nvidia]
[56008.089053]  [<ffffffffa17f49cd>] ? _nv014043rm+0xfcd/0x1b30 [nvidia]
[56008.089053]  [<ffffffffa17ec820>] ? _nv003411rm+0x4aeb/0xb184 [nvidia]
[56008.089053]  [<ffffffffa17ec94d>] ? _nv003411rm+0x4c18/0xb184 [nvidia]
[56008.089053]  [<ffffffffa18a203f>] ? _nv010926rm+0x28/0xeb [nvidia]
[56008.089053]  [<ffffffffa18a1df4>] ? _nv011116rm+0x162/0x385 [nvidia]
[56008.089053]  [<ffffffffa14da4bd>] ? _nv008434rm+0xed/0x176 [nvidia]
[56008.089053]  [<ffffffffa1929132>] ? _nv013320rm+0x5e/0xb4 [nvidia]
[56008.089053]  [<ffffffffa192f5ea>] ? _nv013321rm+0xc76/0x2dcc [nvidia]
[56008.089053]  [<ffffffffa192f975>] ? _nv013321rm+0x1001/0x2dcc [nvidia]
[56008.089053]  [<ffffffffa192f400>] ? _nv013321rm+0xa8c/0x2dcc [nvidia]
[56008.089053]  [<ffffffffa17f1657>] ? _nv003411rm+0x9922/0xb184 [nvidia]
[56008.089053]  [<ffffffffa17efbe6>] ? _nv003411rm+0x7eb1/0xb184 [nvidia]
[56008.089053]  [<ffffffffa19c35ef>] ? _nv000747rm+0x2a3/0x2f2 [nvidia]
[56008.089053]  [<ffffffffa19bcee2>] ? rm_disable_adapter+0x74/0x107
[nvidia]
[56008.089053]  [<ffffffffa19da600>] ?
nv_check_pci_config_space+0x1d0/0x2e0 [nvidia]
[56008.089053]  [<ffffffff8108808e>] ? down+0x2e/0x50
[56008.089053]  [<ffffffffa19dc987>] ? nv_kern_close+0x147/0x440 [nvidia]
[56008.089053]  [<ffffffff8119ed3c>] ? __fput+0xec/0x240
[56008.089053]  [<ffffffff8119ee9e>] ? ____fput+0xe/0x10
[56008.089053]  [<ffffffff8107f6d5>] ? task_work_run+0xc5/0xe0
[56008.089053]  [<ffffffff81064a5e>] ? do_exit+0x2ae/0xa30
[56008.089053]  [<ffffffff8108ffcb>] ? finish_task_switch+0x4b/0xe0
[56008.089053]  [<ffffffff8106526f>] ? do_group_exit+0x3f/0xa0
[56008.089053]  [<ffffffff810652e7>] ? sys_exit_group+0x17/0x20
[56008.089053]  [<ffffffff81667f99>] ? system_call_fastpath+0x16/0x1b

M.C>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2013-Aug-15 13:19 UTC

head link

Re: Some trouble to use NVIDIA CUDA with Xen

On Thu, Aug 15, 2013 at 12:21:41AM +0200, Martin Cerveny
wrote:> Hello.
> 
> Partial SUCCSESS !
> 
> On Tue, 13 Aug 2013, Konrad Rzeszutek Wilk wrote:
> >>>>>>NVRM: PAT configuration unsupported.
> >>>Right, so there are couple of patches that can enable that
back.
> >>>
> >>>You need to revert these two:
> >>>8eaffa67b43e99ae581622c5133e20b0f48bcef1
> >>>c79c49826270b8b0061b2fca840fc3f013c8a78a
> >>>
> >>>And apply this patch:
> >>>
> >>>https://lkml.org/lkml/2012/2/10/229
> >>>
> >>>That should re-enable PAT.  Try that and please report back.
> >>
> >>I applied the patch to 3.9.11-200.PAT.fc18.x86_64 (3.10 is not
> >>working due to incompatibilities with nvidia driver source code).
> >
> >Did you revert the other two git commits?
> 
> Yes, double check (combined patch is in the attachment to
> rpmbuild/SOURCE/, rpmbuild patch too).
Looks correct.> 
> # rdmsr 0x277
> 50100070406
> 
> I look to nvidia source code.
> 
> The error is on nvidia side:
> 
> snip from /usr/src/nvidia-319.37/nv-pat.c
> ================> ....
> #if defined(HAVE_NV_XEN) && defined(CONFIG_XEN) &&
defined(CONFIG_PARAVIRT)
>     if (PAT_WC_index == 4)
>         return NV_PAT_MODE_KERNEL;
> #endif
> 
>     if (PAT_WC_index == 1)
>         return NV_PAT_MODE_KERNEL;
>     else if (PAT_WC_index != 0xf)
>     {
>         nv_printf(NV_DBG_ERRORS,
>             "NVRM: PAT configuration unsupported.\n");
>         return NV_PAT_MODE_DISABLED;
>     }
> ....
> ==================> 
> HAVE_NV_XEN is NOT defined.
> 
> HAVE_NV_XEN is defined only if "nv-xen.h" is present (tested in
> /usr/src/nvidia-319.37/conftest.h) and it seems to be removed from
> distributed source (~ in nvidia driver 19x.x.x versions).
> 
> Ok, i downloaded some older version "nv-xen.h" from net to
Do you know what it contains? Perhaps there are some oddities in
there?> /usr/src/nvidia-319.37/nv-xen.h recompile driver
> ("cd /usr/src/nvidia-319.37; make clean module; rmmod nvidia;
> cp nvidia.ko /lib/modules/3.9.11-200.PAT.fc18.x86_64/extra;
> modprobe nvidia").
> 
> Error "NVRM: PAT configuration unsupported." does not shown (as
expected).
> 
> Most CUDA demoprograms WORKS without error!!!
Nice.> 
> But some programs hung PCIe and kernel:
> 
> [55799.433278] BUG: Bad rss-counter state mm:ffff8800723e0000 idx:1 val:21
> [55800.139090] abrt-handle-eve[10175]: segfault at 18 ip 0000003f20ebb6d3
sp 00007fffa7e6ef00 error 4 in libc-2.16.so[3f20e00000+1ad000]
> [55800.375196] BUG: Bad rss-counter state mm:ffff8800723e2680 idx:1 val:5
> [55845.124636] BUG: Bad rss-counter state mm:ffff8800723e0000 idx:1 val:8
> [55962.186275] BUG: Bad rss-counter state mm:ffff880074a27800 idx:0 val:5
> [55962.192811] BUG: Bad rss-counter state mm:ffff880074a27800 idx:1 val:795
> [55962.262019] traps: abrt-handle-eve[10287] general protection
ip:3f20ebb7a6 sp:7fffbd613410 error:0 in libc-2.16.so[3f20e00000+1ad000]
> [55962.394789] BUG: Bad rss-counter state mm:ffff8800723e0380 idx:1 val:13
That and those errors above imply that the nvidia driver is not doing
a good job of converting the WC pages back to WB. And when they
go back to the general pool of memory they still have the WC bit
set. Which is really really bad.

I presume there was some code that did the ''mark_WC'' and then
''unmark_WC'' (or mark_WB) or perhaps set_pages_wb and
set_pages_wc.

(The set_pages_wb and set_pages_wb fix is the one pageattr.c file.
You could also add in the code there an printk to make sure that
it is indeed working correctly - or use this little module:

http://xenbits.xen.org/gitweb/?p=xentesttools/bootstrap.git;a=blob;f=root_image/drivers/wb_to_wc/wb_to_wc.c;h=cd2439ac103150229f14f732a9a7a271ca6f397e;hb=HEAD

to double check that it is working correctly).
> [55981.779246] NVRM: GPU at 0000:02:00:
GPU-fe328712-3546-53fe-149d-3d78e7aa64d5
> [55981.786391] NVRM: Xid (0000:02:00): 38, 0001 00000000 00000000 00000000
00000000 00000000
> [55982.407300] NVRM: GPU at 0000:02:00.0 has fallen off the bus.
Ha!> [55982.425810] NVRM: os_pci_init_handle: invalid context!
> ....

Martin Cerveny

2013-Aug-15 13:28 UTC

head link

Re: Some trouble to use NVIDIA CUDA with Xen

Hello.

On Thu, 15 Aug 2013, Konrad Rzeszutek Wilk wrote:>> HAVE_NV_XEN is NOT defined.
>>
>> HAVE_NV_XEN is defined only if "nv-xen.h" is present (tested
in
>> /usr/src/nvidia-319.37/conftest.h) and it seems to be removed from
>> distributed source (~ in nvidia driver 19x.x.x versions).
>>
>> Ok, i downloaded some older version "nv-xen.h" from net to
>
> Do you know what it contains? Perhaps there are some oddities in there?
I take first match from google :-) 
https://github.com/lll-project/nvidia/blob/master/include/nvidia/nv-xen.h

Maybe it is not the last one, but I ask nvidia to deliver "up-to-date"
version.
>> But some programs hung PCIe and kernel:
>>
>> [55799.433278] BUG: Bad rss-counter state mm:ffff8800723e0000 idx:1
val:21
>> [55800.139090] abrt-handle-eve[10175]: segfault at 18 ip
0000003f20ebb6d3 sp 00007fffa7e6ef00 error 4 in libc-2.16.so[3f20e00000+1ad000]
>> [55800.375196] BUG: Bad rss-counter state mm:ffff8800723e2680 idx:1
val:5
>> [55845.124636] BUG: Bad rss-counter state mm:ffff8800723e0000 idx:1
val:8
>> [55962.186275] BUG: Bad rss-counter state mm:ffff880074a27800 idx:0
val:5
>> [55962.192811] BUG: Bad rss-counter state mm:ffff880074a27800 idx:1
val:795
>> [55962.262019] traps: abrt-handle-eve[10287] general protection
ip:3f20ebb7a6 sp:7fffbd613410 error:0 in libc-2.16.so[3f20e00000+1ad000]
>> [55962.394789] BUG: Bad rss-counter state mm:ffff8800723e0380 idx:1
val:13
>
> That and those errors above imply that the nvidia driver is not doing
> a good job of converting the WC pages back to WB. And when they
> go back to the general pool of memory they still have the WC bit
> set. Which is really really bad.
>
> I presume there was some code that did the ''mark_WC'' and
then
> ''unmark_WC'' (or mark_WB) or perhaps set_pages_wb and
set_pages_wc.
>
> (The set_pages_wb and set_pages_wb fix is the one pageattr.c file.
> You could also add in the code there an printk to make sure that
> it is indeed working correctly - or use this little module:
>
>
http://xenbits.xen.org/gitweb/?p=xentesttools/bootstrap.git;a=blob;f=root_image/drivers/wb_to_wc/wb_to_wc.c;h=cd2439ac103150229f14f732a9a7a271ca6f397e;hb=HEAD
>
> to double check that it is working correctly).
I will try @weekend.

M.C>

Konrad Rzeszutek Wilk

2013-Aug-15 14:15 UTC

head link

Re: Some trouble to use NVIDIA CUDA with Xen

On Thu, Aug 15, 2013 at 03:28:44PM +0200, Martin Cerveny
wrote:> Hello.
> 
> On Thu, 15 Aug 2013, Konrad Rzeszutek Wilk wrote:
> >>HAVE_NV_XEN is NOT defined.
> >>
> >>HAVE_NV_XEN is defined only if "nv-xen.h" is present
(tested in
> >>/usr/src/nvidia-319.37/conftest.h) and it seems to be removed from
> >>distributed source (~ in nvidia driver 19x.x.x versions).
> >>
> >>Ok, i downloaded some older version "nv-xen.h" from net
to
> >
> >Do you know what it contains? Perhaps there are some oddities in there?
> 
> I take first match from google :-)
https://github.com/lll-project/nvidia/blob/master/include/nvidia/nv-xen.h
Duh! <hiddes away in shame>
.. snip..>
>http://xenbits.xen.org/gitweb/?p=xentesttools/bootstrap.git;a=blob;f=root_image/drivers/wb_to_wc/wb_to_wc.c;h=cd2439ac103150229f14f732a9a7a271ca6f397e;hb=HEAD
> >
> >to double check that it is working correctly).
> 
> I will try @weekend.
Thank you.> 
> M.C>

Martin Cerveny

2013-Aug-27 13:17 UTC

head link

Xen devel - Aug 2013 - Re: Some trouble to use NVIDIA CUDA with Xen

Re: Some trouble to use NVIDIA CUDA with Xen

Re: Some trouble to use NVIDIA CUDA with Xen

Re: Some trouble to use NVIDIA CUDA with Xen

Re: Some trouble to use NVIDIA CUDA with Xen

Re: Some trouble to use NVIDIA CUDA with Xen

Re: Some trouble to use NVIDIA CUDA with Xen

Re: Some trouble to use NVIDIA CUDA with Xen

Re: Some trouble to use NVIDIA CUDA with Xen

Re: Some trouble to use NVIDIA CUDA with Xen

Re: Some trouble to use NVIDIA CUDA with Xen

Re: Some trouble to use NVIDIA CUDA with Xen