thr3ads.net - Nouveau - [Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM [May 2014]

If this information is useful, please help other people find it:
Share via:

Stéphane Marchesin

2014-May-27 01:07 UTC

[Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

On Mon, May 26, 2014 at 5:02 PM, Alexandre Courbot <gnurou at gmail.com>
wrote:> On Mon, May 26, 2014 at 6:21 PM, Lucas Stach <l.stach at
pengutronix.de> wrote:
>> Am Montag, den 26.05.2014, 09:45 +0300 schrieb Terje Bergstr?m:
>>> On 23.05.2014 17:40, Alex Courbot wrote:
>>> > On 05/23/2014 06:59 PM, Lucas Stach wrote:
>>> > So after checking with more knowledgeable people, it turns out
this is
>>> > the expected behavior on ARM and BAR regions should be mapped
uncached
>>> > on GK20A. All the more reasons to avoid using the BAR at all.
>>>
>>> This is actually specific to Tegra.
>>>
>>> >> You may want to make yourself aware of all the quirks
required for
>>> >> sharing memory between the GPU and CPU on an ARM host. I
think there are
>>> >> far more involved than what you see now and writing an
replacement for
>>> >> TTM will not be an easy task.
>>> >>
>>> >> Doing away with the concept of two memory areas will not
get you to a
>>> >> single unified address space. You would have to deal with
things like
>>> >> not being able to change the caching state of pages in the
systems
>>> >> lowmem yourself. You will still have to deal with
remapping pages that
>>> >> aren't currently visible to the CPU (ok this is not an
issue on Jetson
>>> >> right now as it only has 2GB of RAM), because it's in
systems highmem,
>>> >> or even in a different LPAE area.
>>> >>
>>> >> You really want to be sure you are aware of all the
consequences of
>>> >> this, before considering this task.
>>> >
>>> > Yep, that's why I am seeking advice here. My first hope is
that with a
>>> > few tweaks we will be able to keep using TTM and the current
nouveau_bo
>>> > implementation. But unless I missed something this is not
going to be easy.
>>> >
>>> > We can also use something like the patch I originally sent to
make it
>>> > work, although not with good performance, on GK20A. Not very
graceful,
>>> > but it will allow applications to run.
>>> >
>>> > In the long run though, we will want to achieve better
performance, and
>>> > it seems like a BO implementation targeted at UMA devices
would also be
>>> > beneficial to quite a few desktop GPUs. So as tricky as it may
be I'm
>>> > interested in gathering thoughts and why not giving it a first
try with
>>> > GK20A, even if it imposes some limitations like having buffers
in lowmem
>>> > in a first time (we can probably live with this one for a
short while,
>>> > and 64 bits will also be coming to the rescue :))
>>>
>>> I don't think lowmem or LPAE is any problem, if the memory
manager is
>>> designed with that in mind. Vast majority of the buffers kernel
>>> allocates do not need to be touched in kernel space.
>>>
>>> Actually I can't think of any buffers that we allocate on
behalf of user
>>> space that would need to be permanently mapped also to kernel. In
case
>>> or relocs only push buffer needs to be temporarily mapped to
kernel.
>>>
>>> Ultimately even relocs are not necessary if we expose GPU virtual
>>> addresses directly to user space. But that's another topic.
>>>
>> Nouveau already exposes constant virtual addresses to userspace and
>> skips the pushbuf patching when the presumed offset from userspace is
>> the same as what the kernel thinks it should be.
>>
>> The problem with lowmem on ARM is that you can't unmap those pages
from
>> the kernel cached mapping. So if you alloc a page, give it to userspace
>> and userspace decides to map the page WC you just produced a
conflicting
>> mapping, which may yield undefined results on ARMv7. You may think this
>> is not a problem as you are not touching the kernel cached mapping, but
>> in fact it is. The CPUs prefetcher can still access this mapping.
>
> Why would this memory be mapped into the kernel?
On ARM the kernel keeps a linear mapping of lowmem using sections
(ARM's version of huge pages). This is always cached, and because the
sections are not 4k, it's a pain to remove parts of it. See
arch/arm/mm/mmu.c

That said, I don't think this issue exists on A15 (which is what those
GPUs are paired with), so it's a purely theoretical problem.

St?phane

Alexandre Courbot

2014-May-27 02:42 UTC

head link

[Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

On Tue, May 27, 2014 at 10:07 AM, St?phane Marchesin
<stephane.marchesin at gmail.com> wrote:> On Mon, May 26, 2014 at 5:02 PM, Alexandre Courbot <gnurou at
gmail.com> wrote:
>> On Mon, May 26, 2014 at 6:21 PM, Lucas Stach <l.stach at
pengutronix.de> wrote:
>>> Am Montag, den 26.05.2014, 09:45 +0300 schrieb Terje Bergstr?m:
>>>> On 23.05.2014 17:40, Alex Courbot wrote:
>>>> > On 05/23/2014 06:59 PM, Lucas Stach wrote:
>>>> > So after checking with more knowledgeable people, it turns
out this is
>>>> > the expected behavior on ARM and BAR regions should be
mapped uncached
>>>> > on GK20A. All the more reasons to avoid using the BAR at
all.
>>>>
>>>> This is actually specific to Tegra.
>>>>
>>>> >> You may want to make yourself aware of all the quirks
required for
>>>> >> sharing memory between the GPU and CPU on an ARM host.
I think there are
>>>> >> far more involved than what you see now and writing an
replacement for
>>>> >> TTM will not be an easy task.
>>>> >>
>>>> >> Doing away with the concept of two memory areas will
not get you to a
>>>> >> single unified address space. You would have to deal
with things like
>>>> >> not being able to change the caching state of pages in
the systems
>>>> >> lowmem yourself. You will still have to deal with
remapping pages that
>>>> >> aren't currently visible to the CPU (ok this is
not an issue on Jetson
>>>> >> right now as it only has 2GB of RAM), because it's
in systems highmem,
>>>> >> or even in a different LPAE area.
>>>> >>
>>>> >> You really want to be sure you are aware of all the
consequences of
>>>> >> this, before considering this task.
>>>> >
>>>> > Yep, that's why I am seeking advice here. My first
hope is that with a
>>>> > few tweaks we will be able to keep using TTM and the
current nouveau_bo
>>>> > implementation. But unless I missed something this is not
going to be easy.
>>>> >
>>>> > We can also use something like the patch I originally sent
to make it
>>>> > work, although not with good performance, on GK20A. Not
very graceful,
>>>> > but it will allow applications to run.
>>>> >
>>>> > In the long run though, we will want to achieve better
performance, and
>>>> > it seems like a BO implementation targeted at UMA devices
would also be
>>>> > beneficial to quite a few desktop GPUs. So as tricky as it
may be I'm
>>>> > interested in gathering thoughts and why not giving it a
first try with
>>>> > GK20A, even if it imposes some limitations like having
buffers in lowmem
>>>> > in a first time (we can probably live with this one for a
short while,
>>>> > and 64 bits will also be coming to the rescue :))
>>>>
>>>> I don't think lowmem or LPAE is any problem, if the memory
manager is
>>>> designed with that in mind. Vast majority of the buffers kernel
>>>> allocates do not need to be touched in kernel space.
>>>>
>>>> Actually I can't think of any buffers that we allocate on
behalf of user
>>>> space that would need to be permanently mapped also to kernel.
In case
>>>> or relocs only push buffer needs to be temporarily mapped to
kernel.
>>>>
>>>> Ultimately even relocs are not necessary if we expose GPU
virtual
>>>> addresses directly to user space. But that's another topic.
>>>>
>>> Nouveau already exposes constant virtual addresses to userspace and
>>> skips the pushbuf patching when the presumed offset from userspace
is
>>> the same as what the kernel thinks it should be.
>>>
>>> The problem with lowmem on ARM is that you can't unmap those
pages from
>>> the kernel cached mapping. So if you alloc a page, give it to
userspace
>>> and userspace decides to map the page WC you just produced a
conflicting
>>> mapping, which may yield undefined results on ARMv7. You may think
this
>>> is not a problem as you are not touching the kernel cached mapping,
but
>>> in fact it is. The CPUs prefetcher can still access this mapping.
>>
>> Why would this memory be mapped into the kernel?
>
> On ARM the kernel keeps a linear mapping of lowmem using sections
> (ARM's version of huge pages). This is always cached, and because the
> sections are not 4k, it's a pain to remove parts of it. See
> arch/arm/mm/mmu.c
Ah, are we talking about the directly-mapped low memory region
starting at PAGE_OFFSET? Ok, it makes sense now, thanks.

But it seems to me that such different mappings can also happen in
many other scenarios as well, don't they? How is the issue handled in
these cases?

Stéphane Marchesin

2014-May-27 05:18 UTC

head link

[Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

On Mon, May 26, 2014 at 7:42 PM, Alexandre Courbot <gnurou at gmail.com>
wrote:> On Tue, May 27, 2014 at 10:07 AM, St?phane Marchesin
> <stephane.marchesin at gmail.com> wrote:
>> On Mon, May 26, 2014 at 5:02 PM, Alexandre Courbot <gnurou at
gmail.com> wrote:
>>> On Mon, May 26, 2014 at 6:21 PM, Lucas Stach <l.stach at
pengutronix.de> wrote:
>>>> Am Montag, den 26.05.2014, 09:45 +0300 schrieb Terje Bergstr?m:
>>>>> On 23.05.2014 17:40, Alex Courbot wrote:
>>>>> > On 05/23/2014 06:59 PM, Lucas Stach wrote:
>>>>> > So after checking with more knowledgeable people, it
turns out this is
>>>>> > the expected behavior on ARM and BAR regions should be
mapped uncached
>>>>> > on GK20A. All the more reasons to avoid using the BAR
at all.
>>>>>
>>>>> This is actually specific to Tegra.
>>>>>
>>>>> >> You may want to make yourself aware of all the
quirks required for
>>>>> >> sharing memory between the GPU and CPU on an ARM
host. I think there are
>>>>> >> far more involved than what you see now and
writing an replacement for
>>>>> >> TTM will not be an easy task.
>>>>> >>
>>>>> >> Doing away with the concept of two memory areas
will not get you to a
>>>>> >> single unified address space. You would have to
deal with things like
>>>>> >> not being able to change the caching state of
pages in the systems
>>>>> >> lowmem yourself. You will still have to deal with
remapping pages that
>>>>> >> aren't currently visible to the CPU (ok this
is not an issue on Jetson
>>>>> >> right now as it only has 2GB of RAM), because
it's in systems highmem,
>>>>> >> or even in a different LPAE area.
>>>>> >>
>>>>> >> You really want to be sure you are aware of all
the consequences of
>>>>> >> this, before considering this task.
>>>>> >
>>>>> > Yep, that's why I am seeking advice here. My first
hope is that with a
>>>>> > few tweaks we will be able to keep using TTM and the
current nouveau_bo
>>>>> > implementation. But unless I missed something this is
not going to be easy.
>>>>> >
>>>>> > We can also use something like the patch I originally
sent to make it
>>>>> > work, although not with good performance, on GK20A.
Not very graceful,
>>>>> > but it will allow applications to run.
>>>>> >
>>>>> > In the long run though, we will want to achieve better
performance, and
>>>>> > it seems like a BO implementation targeted at UMA
devices would also be
>>>>> > beneficial to quite a few desktop GPUs. So as tricky
as it may be I'm
>>>>> > interested in gathering thoughts and why not giving it
a first try with
>>>>> > GK20A, even if it imposes some limitations like having
buffers in lowmem
>>>>> > in a first time (we can probably live with this one
for a short while,
>>>>> > and 64 bits will also be coming to the rescue :))
>>>>>
>>>>> I don't think lowmem or LPAE is any problem, if the
memory manager is
>>>>> designed with that in mind. Vast majority of the buffers
kernel
>>>>> allocates do not need to be touched in kernel space.
>>>>>
>>>>> Actually I can't think of any buffers that we allocate
on behalf of user
>>>>> space that would need to be permanently mapped also to
kernel. In case
>>>>> or relocs only push buffer needs to be temporarily mapped
to kernel.
>>>>>
>>>>> Ultimately even relocs are not necessary if we expose GPU
virtual
>>>>> addresses directly to user space. But that's another
topic.
>>>>>
>>>> Nouveau already exposes constant virtual addresses to userspace
and
>>>> skips the pushbuf patching when the presumed offset from
userspace is
>>>> the same as what the kernel thinks it should be.
>>>>
>>>> The problem with lowmem on ARM is that you can't unmap
those pages from
>>>> the kernel cached mapping. So if you alloc a page, give it to
userspace
>>>> and userspace decides to map the page WC you just produced a
conflicting
>>>> mapping, which may yield undefined results on ARMv7. You may
think this
>>>> is not a problem as you are not touching the kernel cached
mapping, but
>>>> in fact it is. The CPUs prefetcher can still access this
mapping.
>>>
>>> Why would this memory be mapped into the kernel?
>>
>> On ARM the kernel keeps a linear mapping of lowmem using sections
>> (ARM's version of huge pages). This is always cached, and because
the
>> sections are not 4k, it's a pain to remove parts of it. See
>> arch/arm/mm/mmu.c
>
> Ah, are we talking about the directly-mapped low memory region
> starting at PAGE_OFFSET? Ok, it makes sense now, thanks.
>
> But it seems to me that such different mappings can also happen in
> many other scenarios as well, don't they? How is the issue handled in
> these cases?
It depends. A lot of cache controllers actually implement a solution
for that in hardware, in the cache controller. For example I think
Tegra2 is one of those platforms. And then a lot of platforms just
ignore the issue completely because it has very low probability.

St?phane

Reasonably Related Threads

Search for more maybe matching threads

Nouveau - May 2014 - [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

[Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

[Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

[Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

Reasonably Related Threads