thr3ads.net - Nouveau - [Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM [May 2014]

If this information is useful, please help other people find it:
Share via:

Alexandre Courbot

2014-May-23 09:43 UTC

[Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

On 05/23/2014 06:24 PM, Lucas Stach wrote:> Am Freitag, den 23.05.2014, 16:10 +0900 schrieb Alexandre Courbot:
>> On Mon, May 19, 2014 at 7:16 PM, Lucas Stach <l.stach at
pengutronix.de> wrote:
>>> Am Montag, den 19.05.2014, 19:06 +0900 schrieb Alexandre Courbot:
>>>> On 05/19/2014 06:57 PM, Lucas Stach wrote:
>>>>> Am Montag, den 19.05.2014, 18:46 +0900 schrieb Alexandre
Courbot:
>>>>>> This patch is not meant to be merged, but rather to try
and understand
>>>>>> why this is needed and what a more suitable solution
could be.
>>>>>>
>>>>>> Allowing BOs to be write-cached results in the
following happening when
>>>>>> trying to run any program on Tegra/GK20A:
>>>>>>
>>>>>> Unhandled fault: external abort on non-linefetch
(0x1008) at 0xf0036010
>>>>>> ...
>>>>>> (nouveau_bo_rd32) from [<c0357d00>]
(nouveau_fence_update+0x5c/0x80)
>>>>>> (nouveau_fence_update) from [<c0357d40>]
(nouveau_fence_done+0x1c/0x38)
>>>>>> (nouveau_fence_done) from [<c02c3d00>]
(ttm_bo_wait+0xec/0x168)
>>>>>> (ttm_bo_wait) from [<c035e334>]
(nouveau_gem_ioctl_cpu_prep+0x44/0x100)
>>>>>> (nouveau_gem_ioctl_cpu_prep) from [<c02aaa84>]
(drm_ioctl+0x1d8/0x4f4)
>>>>>> (drm_ioctl) from [<c0355394>]
(nouveau_drm_ioctl+0x54/0x80)
>>>>>> (nouveau_drm_ioctl) from [<c00ee7b0>]
(do_vfs_ioctl+0x3dc/0x5a0)
>>>>>> (do_vfs_ioctl) from [<c00ee9a8>]
(SyS_ioctl+0x34/0x5c)
>>>>>> (SyS_ioctl) from [<c000e6e0>]
(ret_fast_syscall+0x0/0x30
>>>>>>
>>>>>> The offending nouveau_bo_rd32 is done over an IO-mapped
BO, e.g. a BO
>>>>>> mapped through the BAR.
>>>>>>
>>>>> Um wait, this memory is behind an already mapped bar? I
think ioremap on
>>>>> ARM defaults to uncached mappings, so if you want to access
the memory
>>>>> behind this bar as WC you need to map the BAR as a whole as
WC by using
>>>>> ioremap_wc.
>>>>
>>>> Tried mapping the BAR using ioremap_wc(), but to no avail. On
the other
>>>> hand, could it be that VRAM BOs end up creating a mapping over
an
>>>> already-mapped region? I seem to remember that ARM might not
like it...
>>>
>>> Multiple mapping are generally allowed, as long as they have the
same
>>> caching state. It's conflicting mappings (uncached vs cached,
or cached
>>> vs wc), that are documented to yield undefined results.
>>
>> Sorry about the confusion. The BAR is *not* mapped to the kernel yet
>> (it is BAR1, there is no BAR3 on GK20A) and an ioremap_*() is
>> performed in ttm_bo_ioremap() to make the part of the BAR where the
>> buffer is mapped visible. It seems that doing an ioremap_wc() on the
>> BAR area on Tegra is what leads to these errors. ioremap() or
>> ioremap_nocache() (which are in effect the same on ARM) do not cause
>> this issue.
>>
> It would be cool if you could ask HW, or the blob developers, if this is
> a general issue. The external abort is clearly the GPUs AXI client
> responding with an error to the read request, though I'm not clear
where
> a WC read differs from an uncached one.
Will check that.
>
>> The best way to solve this issue would be to not use the BAR at all
>> since the memory behind these objects can be directly accessed by the
>> CPU. As such it would better be mapped using ttm_bo_kmap_ttm()
>> instead. But right now this is clearly not how nouveau_bo.c is written
>> and it does not look like this can easily be done. :/
>
> Yeah, it sounds like we want this shortcut for stolen VRAM
> implementations.
Actually, isn't it the case that we do not want to use TTM at all for 
stolen VRAM (UMA) devices?

I am trying to wrap my head around this since a while already, and could 
not think of a way to use the current TTM-based nouveau_bo optimally for 
GK20A. Because we cannot do without the idea of VRAM and GART, we will 
always have to "move" objects from one location to another, or deal
with
constraints that do not make sense for UMA devices (like in the current 
case, accessing VRAM objects through the BAR).

I am currently contemplating the idea of writing an alternative non-TTM 
implementation of nouveau_bo for UMA devices, that would (hopefully) be 
much simpler and would spare us a lot of stunts.

On the other hand, this sounds like a considerable work and I would like 
to make sure that my lack of understanding of TTM is not driving me to 
the wrong solution. Thoughts?

Thanks,
Alex.

Lucas Stach

2014-May-23 09:59 UTC

head link

[Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

Am Freitag, den 23.05.2014, 18:43 +0900 schrieb Alexandre
Courbot:> On 05/23/2014 06:24 PM, Lucas Stach wrote:
> > Am Freitag, den 23.05.2014, 16:10 +0900 schrieb Alexandre Courbot:
> >> On Mon, May 19, 2014 at 7:16 PM, Lucas Stach <l.stach at
pengutronix.de> wrote:
> >>> Am Montag, den 19.05.2014, 19:06 +0900 schrieb Alexandre
Courbot:
> >>>> On 05/19/2014 06:57 PM, Lucas Stach wrote:
> >>>>> Am Montag, den 19.05.2014, 18:46 +0900 schrieb
Alexandre Courbot:
> >>>>>> This patch is not meant to be merged, but rather
to try and understand
> >>>>>> why this is needed and what a more suitable
solution could be.
> >>>>>>
> >>>>>> Allowing BOs to be write-cached results in the
following happening when
> >>>>>> trying to run any program on Tegra/GK20A:
> >>>>>>
> >>>>>> Unhandled fault: external abort on non-linefetch
(0x1008) at 0xf0036010
> >>>>>> ...
> >>>>>> (nouveau_bo_rd32) from [<c0357d00>]
(nouveau_fence_update+0x5c/0x80)
> >>>>>> (nouveau_fence_update) from [<c0357d40>]
(nouveau_fence_done+0x1c/0x38)
> >>>>>> (nouveau_fence_done) from [<c02c3d00>]
(ttm_bo_wait+0xec/0x168)
> >>>>>> (ttm_bo_wait) from [<c035e334>]
(nouveau_gem_ioctl_cpu_prep+0x44/0x100)
> >>>>>> (nouveau_gem_ioctl_cpu_prep) from
[<c02aaa84>] (drm_ioctl+0x1d8/0x4f4)
> >>>>>> (drm_ioctl) from [<c0355394>]
(nouveau_drm_ioctl+0x54/0x80)
> >>>>>> (nouveau_drm_ioctl) from [<c00ee7b0>]
(do_vfs_ioctl+0x3dc/0x5a0)
> >>>>>> (do_vfs_ioctl) from [<c00ee9a8>]
(SyS_ioctl+0x34/0x5c)
> >>>>>> (SyS_ioctl) from [<c000e6e0>]
(ret_fast_syscall+0x0/0x30
> >>>>>>
> >>>>>> The offending nouveau_bo_rd32 is done over an
IO-mapped BO, e.g. a BO
> >>>>>> mapped through the BAR.
> >>>>>>
> >>>>> Um wait, this memory is behind an already mapped bar?
I think ioremap on
> >>>>> ARM defaults to uncached mappings, so if you want to
access the memory
> >>>>> behind this bar as WC you need to map the BAR as a
whole as WC by using
> >>>>> ioremap_wc.
> >>>>
> >>>> Tried mapping the BAR using ioremap_wc(), but to no avail.
On the other
> >>>> hand, could it be that VRAM BOs end up creating a mapping
over an
> >>>> already-mapped region? I seem to remember that ARM might
not like it...
> >>>
> >>> Multiple mapping are generally allowed, as long as they have
the same
> >>> caching state. It's conflicting mappings (uncached vs
cached, or cached
> >>> vs wc), that are documented to yield undefined results.
> >>
> >> Sorry about the confusion. The BAR is *not* mapped to the kernel
yet
> >> (it is BAR1, there is no BAR3 on GK20A) and an ioremap_*() is
> >> performed in ttm_bo_ioremap() to make the part of the BAR where
the
> >> buffer is mapped visible. It seems that doing an ioremap_wc() on
the
> >> BAR area on Tegra is what leads to these errors. ioremap() or
> >> ioremap_nocache() (which are in effect the same on ARM) do not
cause
> >> this issue.
> >>
> > It would be cool if you could ask HW, or the blob developers, if this
is
> > a general issue. The external abort is clearly the GPUs AXI client
> > responding with an error to the read request, though I'm not clear
where
> > a WC read differs from an uncached one.
> 
> Will check that.
> 
> >
> >> The best way to solve this issue would be to not use the BAR at
all
> >> since the memory behind these objects can be directly accessed by
the
> >> CPU. As such it would better be mapped using ttm_bo_kmap_ttm()
> >> instead. But right now this is clearly not how nouveau_bo.c is
written
> >> and it does not look like this can easily be done. :/
> >
> > Yeah, it sounds like we want this shortcut for stolen VRAM
> > implementations.
> 
> Actually, isn't it the case that we do not want to use TTM at all for 
> stolen VRAM (UMA) devices?
> 
> I am trying to wrap my head around this since a while already, and could 
> not think of a way to use the current TTM-based nouveau_bo optimally for 
> GK20A. Because we cannot do without the idea of VRAM and GART, we will 
> always have to "move" objects from one location to another, or
deal with
> constraints that do not make sense for UMA devices (like in the current 
> case, accessing VRAM objects through the BAR).
> 
> I am currently contemplating the idea of writing an alternative non-TTM 
> implementation of nouveau_bo for UMA devices, that would (hopefully) be 
> much simpler and would spare us a lot of stunts.
> 
> On the other hand, this sounds like a considerable work and I would like 
> to make sure that my lack of understanding of TTM is not driving me to 
> the wrong solution. Thoughts?
> You may want to make yourself aware of all the quirks required for
sharing memory between the GPU and CPU on an ARM host. I think there are
far more involved than what you see now and writing an replacement for
TTM will not be an easy task.

Doing away with the concept of two memory areas will not get you to a
single unified address space. You would have to deal with things like
not being able to change the caching state of pages in the systems
lowmem yourself. You will still have to deal with remapping pages that
aren't currently visible to the CPU (ok this is not an issue on Jetson
right now as it only has 2GB of RAM), because it's in systems highmem,
or even in a different LPAE area.

You really want to be sure you are aware of all the consequences of
this, before considering this task.

Regards,
Lucas

-- 
Pengutronix e.K.             | Lucas Stach                 |
Industrial Linux Solutions   | http://www.pengutronix.de/  |

Alexandre Courbot

2014-May-23 14:40 UTC

head link

[Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

On 05/23/2014 06:59 PM, Lucas Stach wrote:> Am Freitag, den 23.05.2014, 18:43 +0900 schrieb Alexandre Courbot:
>> On 05/23/2014 06:24 PM, Lucas Stach wrote:
>>> Am Freitag, den 23.05.2014, 16:10 +0900 schrieb Alexandre Courbot:
>>>> On Mon, May 19, 2014 at 7:16 PM, Lucas Stach <l.stach at
pengutronix.de> wrote:
>>>>> Am Montag, den 19.05.2014, 19:06 +0900 schrieb Alexandre
Courbot:
>>>>>> On 05/19/2014 06:57 PM, Lucas Stach wrote:
>>>>>>> Am Montag, den 19.05.2014, 18:46 +0900 schrieb
Alexandre Courbot:
>>>>>>>> This patch is not meant to be merged, but
rather to try and understand
>>>>>>>> why this is needed and what a more suitable
solution could be.
>>>>>>>>
>>>>>>>> Allowing BOs to be write-cached results in the
following happening when
>>>>>>>> trying to run any program on Tegra/GK20A:
>>>>>>>>
>>>>>>>> Unhandled fault: external abort on
non-linefetch (0x1008) at 0xf0036010
>>>>>>>> ...
>>>>>>>> (nouveau_bo_rd32) from [<c0357d00>]
(nouveau_fence_update+0x5c/0x80)
>>>>>>>> (nouveau_fence_update) from [<c0357d40>]
(nouveau_fence_done+0x1c/0x38)
>>>>>>>> (nouveau_fence_done) from [<c02c3d00>]
(ttm_bo_wait+0xec/0x168)
>>>>>>>> (ttm_bo_wait) from [<c035e334>]
(nouveau_gem_ioctl_cpu_prep+0x44/0x100)
>>>>>>>> (nouveau_gem_ioctl_cpu_prep) from
[<c02aaa84>] (drm_ioctl+0x1d8/0x4f4)
>>>>>>>> (drm_ioctl) from [<c0355394>]
(nouveau_drm_ioctl+0x54/0x80)
>>>>>>>> (nouveau_drm_ioctl) from [<c00ee7b0>]
(do_vfs_ioctl+0x3dc/0x5a0)
>>>>>>>> (do_vfs_ioctl) from [<c00ee9a8>]
(SyS_ioctl+0x34/0x5c)
>>>>>>>> (SyS_ioctl) from [<c000e6e0>]
(ret_fast_syscall+0x0/0x30
>>>>>>>>
>>>>>>>> The offending nouveau_bo_rd32 is done over an
IO-mapped BO, e.g. a BO
>>>>>>>> mapped through the BAR.
>>>>>>>>
>>>>>>> Um wait, this memory is behind an already mapped
bar? I think ioremap on
>>>>>>> ARM defaults to uncached mappings, so if you want
to access the memory
>>>>>>> behind this bar as WC you need to map the BAR as a
whole as WC by using
>>>>>>> ioremap_wc.
>>>>>>
>>>>>> Tried mapping the BAR using ioremap_wc(), but to no
avail. On the other
>>>>>> hand, could it be that VRAM BOs end up creating a
mapping over an
>>>>>> already-mapped region? I seem to remember that ARM
might not like it...
>>>>>
>>>>> Multiple mapping are generally allowed, as long as they
have the same
>>>>> caching state. It's conflicting mappings (uncached vs
cached, or cached
>>>>> vs wc), that are documented to yield undefined results.
>>>>
>>>> Sorry about the confusion. The BAR is *not* mapped to the
kernel yet
>>>> (it is BAR1, there is no BAR3 on GK20A) and an ioremap_*() is
>>>> performed in ttm_bo_ioremap() to make the part of the BAR where
the
>>>> buffer is mapped visible. It seems that doing an ioremap_wc()
on the
>>>> BAR area on Tegra is what leads to these errors. ioremap() or
>>>> ioremap_nocache() (which are in effect the same on ARM) do not
cause
>>>> this issue.
>>>>
>>> It would be cool if you could ask HW, or the blob developers, if
this is
>>> a general issue. The external abort is clearly the GPUs AXI client
>>> responding with an error to the read request, though I'm not
clear where
>>> a WC read differs from an uncached one.
>>
>> Will check that.
So after checking with more knowledgeable people, it turns out this is 
the expected behavior on ARM and BAR regions should be mapped uncached 
on GK20A. All the more reasons to avoid using the BAR at all.
>>
>>>
>>>> The best way to solve this issue would be to not use the BAR at
all
>>>> since the memory behind these objects can be directly accessed
by the
>>>> CPU. As such it would better be mapped using ttm_bo_kmap_ttm()
>>>> instead. But right now this is clearly not how nouveau_bo.c is
written
>>>> and it does not look like this can easily be done. :/
>>>
>>> Yeah, it sounds like we want this shortcut for stolen VRAM
>>> implementations.
>>
>> Actually, isn't it the case that we do not want to use TTM at all
for
>> stolen VRAM (UMA) devices?
>>
>> I am trying to wrap my head around this since a while already, and
could
>> not think of a way to use the current TTM-based nouveau_bo optimally
for
>> GK20A. Because we cannot do without the idea of VRAM and GART, we will
>> always have to "move" objects from one location to another,
or deal with
>> constraints that do not make sense for UMA devices (like in the current
>> case, accessing VRAM objects through the BAR).
>>
>> I am currently contemplating the idea of writing an alternative non-TTM
>> implementation of nouveau_bo for UMA devices, that would (hopefully) be
>> much simpler and would spare us a lot of stunts.
>>
>> On the other hand, this sounds like a considerable work and I would
like
>> to make sure that my lack of understanding of TTM is not driving me to
>> the wrong solution. Thoughts?
>>
> You may want to make yourself aware of all the quirks required for
> sharing memory between the GPU and CPU on an ARM host. I think there are
> far more involved than what you see now and writing an replacement for
> TTM will not be an easy task.
>
> Doing away with the concept of two memory areas will not get you to a
> single unified address space. You would have to deal with things like
> not being able to change the caching state of pages in the systems
> lowmem yourself. You will still have to deal with remapping pages that
> aren't currently visible to the CPU (ok this is not an issue on Jetson
> right now as it only has 2GB of RAM), because it's in systems highmem,
> or even in a different LPAE area.
>
> You really want to be sure you are aware of all the consequences of
> this, before considering this task.
Yep, that's why I am seeking advice here. My first hope is that with a 
few tweaks we will be able to keep using TTM and the current nouveau_bo 
implementation. But unless I missed something this is not going to be easy.

We can also use something like the patch I originally sent to make it 
work, although not with good performance, on GK20A. Not very graceful, 
but it will allow applications to run.

In the long run though, we will want to achieve better performance, and 
it seems like a BO implementation targeted at UMA devices would also be 
beneficial to quite a few desktop GPUs. So as tricky as it may be I'm 
interested in gathering thoughts and why not giving it a first try with 
GK20A, even if it imposes some limitations like having buffers in lowmem 
in a first time (we can probably live with this one for a short while, 
and 64 bits will also be coming to the rescue :))

Thanks,
Alex.

Apparently Analagous Threads

Search for more possibly parallel threads

Nouveau - May 2014 - [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

[Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

[Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

[Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

Apparently Analagous Threads