Alexandre Courbot
2014-May-23 07:10 UTC
[Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM
On Mon, May 19, 2014 at 7:16 PM, Lucas Stach <l.stach at pengutronix.de> wrote:> Am Montag, den 19.05.2014, 19:06 +0900 schrieb Alexandre Courbot: >> On 05/19/2014 06:57 PM, Lucas Stach wrote: >> > Am Montag, den 19.05.2014, 18:46 +0900 schrieb Alexandre Courbot: >> >> This patch is not meant to be merged, but rather to try and understand >> >> why this is needed and what a more suitable solution could be. >> >> >> >> Allowing BOs to be write-cached results in the following happening when >> >> trying to run any program on Tegra/GK20A: >> >> >> >> Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0036010 >> >> ... >> >> (nouveau_bo_rd32) from [<c0357d00>] (nouveau_fence_update+0x5c/0x80) >> >> (nouveau_fence_update) from [<c0357d40>] (nouveau_fence_done+0x1c/0x38) >> >> (nouveau_fence_done) from [<c02c3d00>] (ttm_bo_wait+0xec/0x168) >> >> (ttm_bo_wait) from [<c035e334>] (nouveau_gem_ioctl_cpu_prep+0x44/0x100) >> >> (nouveau_gem_ioctl_cpu_prep) from [<c02aaa84>] (drm_ioctl+0x1d8/0x4f4) >> >> (drm_ioctl) from [<c0355394>] (nouveau_drm_ioctl+0x54/0x80) >> >> (nouveau_drm_ioctl) from [<c00ee7b0>] (do_vfs_ioctl+0x3dc/0x5a0) >> >> (do_vfs_ioctl) from [<c00ee9a8>] (SyS_ioctl+0x34/0x5c) >> >> (SyS_ioctl) from [<c000e6e0>] (ret_fast_syscall+0x0/0x30 >> >> >> >> The offending nouveau_bo_rd32 is done over an IO-mapped BO, e.g. a BO >> >> mapped through the BAR. >> >> >> > Um wait, this memory is behind an already mapped bar? I think ioremap on >> > ARM defaults to uncached mappings, so if you want to access the memory >> > behind this bar as WC you need to map the BAR as a whole as WC by using >> > ioremap_wc. >> >> Tried mapping the BAR using ioremap_wc(), but to no avail. On the other >> hand, could it be that VRAM BOs end up creating a mapping over an >> already-mapped region? I seem to remember that ARM might not like it... > > Multiple mapping are generally allowed, as long as they have the same > caching state. It's conflicting mappings (uncached vs cached, or cached > vs wc), that are documented to yield undefined results.Sorry about the confusion. The BAR is *not* mapped to the kernel yet (it is BAR1, there is no BAR3 on GK20A) and an ioremap_*() is performed in ttm_bo_ioremap() to make the part of the BAR where the buffer is mapped visible. It seems that doing an ioremap_wc() on the BAR area on Tegra is what leads to these errors. ioremap() or ioremap_nocache() (which are in effect the same on ARM) do not cause this issue. The best way to solve this issue would be to not use the BAR at all since the memory behind these objects can be directly accessed by the CPU. As such it would better be mapped using ttm_bo_kmap_ttm() instead. But right now this is clearly not how nouveau_bo.c is written and it does not look like this can easily be done. :/
Lucas Stach
2014-May-23 09:24 UTC
[Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM
Am Freitag, den 23.05.2014, 16:10 +0900 schrieb Alexandre Courbot:> On Mon, May 19, 2014 at 7:16 PM, Lucas Stach <l.stach at pengutronix.de> wrote: > > Am Montag, den 19.05.2014, 19:06 +0900 schrieb Alexandre Courbot: > >> On 05/19/2014 06:57 PM, Lucas Stach wrote: > >> > Am Montag, den 19.05.2014, 18:46 +0900 schrieb Alexandre Courbot: > >> >> This patch is not meant to be merged, but rather to try and understand > >> >> why this is needed and what a more suitable solution could be. > >> >> > >> >> Allowing BOs to be write-cached results in the following happening when > >> >> trying to run any program on Tegra/GK20A: > >> >> > >> >> Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0036010 > >> >> ... > >> >> (nouveau_bo_rd32) from [<c0357d00>] (nouveau_fence_update+0x5c/0x80) > >> >> (nouveau_fence_update) from [<c0357d40>] (nouveau_fence_done+0x1c/0x38) > >> >> (nouveau_fence_done) from [<c02c3d00>] (ttm_bo_wait+0xec/0x168) > >> >> (ttm_bo_wait) from [<c035e334>] (nouveau_gem_ioctl_cpu_prep+0x44/0x100) > >> >> (nouveau_gem_ioctl_cpu_prep) from [<c02aaa84>] (drm_ioctl+0x1d8/0x4f4) > >> >> (drm_ioctl) from [<c0355394>] (nouveau_drm_ioctl+0x54/0x80) > >> >> (nouveau_drm_ioctl) from [<c00ee7b0>] (do_vfs_ioctl+0x3dc/0x5a0) > >> >> (do_vfs_ioctl) from [<c00ee9a8>] (SyS_ioctl+0x34/0x5c) > >> >> (SyS_ioctl) from [<c000e6e0>] (ret_fast_syscall+0x0/0x30 > >> >> > >> >> The offending nouveau_bo_rd32 is done over an IO-mapped BO, e.g. a BO > >> >> mapped through the BAR. > >> >> > >> > Um wait, this memory is behind an already mapped bar? I think ioremap on > >> > ARM defaults to uncached mappings, so if you want to access the memory > >> > behind this bar as WC you need to map the BAR as a whole as WC by using > >> > ioremap_wc. > >> > >> Tried mapping the BAR using ioremap_wc(), but to no avail. On the other > >> hand, could it be that VRAM BOs end up creating a mapping over an > >> already-mapped region? I seem to remember that ARM might not like it... > > > > Multiple mapping are generally allowed, as long as they have the same > > caching state. It's conflicting mappings (uncached vs cached, or cached > > vs wc), that are documented to yield undefined results. > > Sorry about the confusion. The BAR is *not* mapped to the kernel yet > (it is BAR1, there is no BAR3 on GK20A) and an ioremap_*() is > performed in ttm_bo_ioremap() to make the part of the BAR where the > buffer is mapped visible. It seems that doing an ioremap_wc() on the > BAR area on Tegra is what leads to these errors. ioremap() or > ioremap_nocache() (which are in effect the same on ARM) do not cause > this issue. >It would be cool if you could ask HW, or the blob developers, if this is a general issue. The external abort is clearly the GPUs AXI client responding with an error to the read request, though I'm not clear where a WC read differs from an uncached one.> The best way to solve this issue would be to not use the BAR at all > since the memory behind these objects can be directly accessed by the > CPU. As such it would better be mapped using ttm_bo_kmap_ttm() > instead. But right now this is clearly not how nouveau_bo.c is written > and it does not look like this can easily be done. :/Yeah, it sounds like we want this shortcut for stolen VRAM implementations. Regards, Lucas -- Pengutronix e.K. | Lucas Stach | Industrial Linux Solutions | http://www.pengutronix.de/ |
Alexandre Courbot
2014-May-23 09:43 UTC
[Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM
On 05/23/2014 06:24 PM, Lucas Stach wrote:> Am Freitag, den 23.05.2014, 16:10 +0900 schrieb Alexandre Courbot: >> On Mon, May 19, 2014 at 7:16 PM, Lucas Stach <l.stach at pengutronix.de> wrote: >>> Am Montag, den 19.05.2014, 19:06 +0900 schrieb Alexandre Courbot: >>>> On 05/19/2014 06:57 PM, Lucas Stach wrote: >>>>> Am Montag, den 19.05.2014, 18:46 +0900 schrieb Alexandre Courbot: >>>>>> This patch is not meant to be merged, but rather to try and understand >>>>>> why this is needed and what a more suitable solution could be. >>>>>> >>>>>> Allowing BOs to be write-cached results in the following happening when >>>>>> trying to run any program on Tegra/GK20A: >>>>>> >>>>>> Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0036010 >>>>>> ... >>>>>> (nouveau_bo_rd32) from [<c0357d00>] (nouveau_fence_update+0x5c/0x80) >>>>>> (nouveau_fence_update) from [<c0357d40>] (nouveau_fence_done+0x1c/0x38) >>>>>> (nouveau_fence_done) from [<c02c3d00>] (ttm_bo_wait+0xec/0x168) >>>>>> (ttm_bo_wait) from [<c035e334>] (nouveau_gem_ioctl_cpu_prep+0x44/0x100) >>>>>> (nouveau_gem_ioctl_cpu_prep) from [<c02aaa84>] (drm_ioctl+0x1d8/0x4f4) >>>>>> (drm_ioctl) from [<c0355394>] (nouveau_drm_ioctl+0x54/0x80) >>>>>> (nouveau_drm_ioctl) from [<c00ee7b0>] (do_vfs_ioctl+0x3dc/0x5a0) >>>>>> (do_vfs_ioctl) from [<c00ee9a8>] (SyS_ioctl+0x34/0x5c) >>>>>> (SyS_ioctl) from [<c000e6e0>] (ret_fast_syscall+0x0/0x30 >>>>>> >>>>>> The offending nouveau_bo_rd32 is done over an IO-mapped BO, e.g. a BO >>>>>> mapped through the BAR. >>>>>> >>>>> Um wait, this memory is behind an already mapped bar? I think ioremap on >>>>> ARM defaults to uncached mappings, so if you want to access the memory >>>>> behind this bar as WC you need to map the BAR as a whole as WC by using >>>>> ioremap_wc. >>>> >>>> Tried mapping the BAR using ioremap_wc(), but to no avail. On the other >>>> hand, could it be that VRAM BOs end up creating a mapping over an >>>> already-mapped region? I seem to remember that ARM might not like it... >>> >>> Multiple mapping are generally allowed, as long as they have the same >>> caching state. It's conflicting mappings (uncached vs cached, or cached >>> vs wc), that are documented to yield undefined results. >> >> Sorry about the confusion. The BAR is *not* mapped to the kernel yet >> (it is BAR1, there is no BAR3 on GK20A) and an ioremap_*() is >> performed in ttm_bo_ioremap() to make the part of the BAR where the >> buffer is mapped visible. It seems that doing an ioremap_wc() on the >> BAR area on Tegra is what leads to these errors. ioremap() or >> ioremap_nocache() (which are in effect the same on ARM) do not cause >> this issue. >> > It would be cool if you could ask HW, or the blob developers, if this is > a general issue. The external abort is clearly the GPUs AXI client > responding with an error to the read request, though I'm not clear where > a WC read differs from an uncached one.Will check that.> >> The best way to solve this issue would be to not use the BAR at all >> since the memory behind these objects can be directly accessed by the >> CPU. As such it would better be mapped using ttm_bo_kmap_ttm() >> instead. But right now this is clearly not how nouveau_bo.c is written >> and it does not look like this can easily be done. :/ > > Yeah, it sounds like we want this shortcut for stolen VRAM > implementations.Actually, isn't it the case that we do not want to use TTM at all for stolen VRAM (UMA) devices? I am trying to wrap my head around this since a while already, and could not think of a way to use the current TTM-based nouveau_bo optimally for GK20A. Because we cannot do without the idea of VRAM and GART, we will always have to "move" objects from one location to another, or deal with constraints that do not make sense for UMA devices (like in the current case, accessing VRAM objects through the BAR). I am currently contemplating the idea of writing an alternative non-TTM implementation of nouveau_bo for UMA devices, that would (hopefully) be much simpler and would spare us a lot of stunts. On the other hand, this sounds like a considerable work and I would like to make sure that my lack of understanding of TTM is not driving me to the wrong solution. Thoughts? Thanks, Alex.
Alexandre Courbot
2014-May-26 09:58 UTC
[Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM
On Fri, May 23, 2014 at 6:24 PM, Lucas Stach <l.stach at pengutronix.de> wrote:>> The best way to solve this issue would be to not use the BAR at all >> since the memory behind these objects can be directly accessed by the >> CPU. As such it would better be mapped using ttm_bo_kmap_ttm() >> instead. But right now this is clearly not how nouveau_bo.c is written >> and it does not look like this can easily be done. :/ > > Yeah, it sounds like we want this shortcut for stolen VRAM > implementations.Tried playing a bit with nouveau_bo and the following hack allows a simple Mesa program to run to completion... once (second time leads to a kernel panic): diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c index f00ae18003f1..6317d30a8e1d 100644 --- a/drivers/gpu/drm/nouveau/nouveau_bo.c +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c @@ -538,7 +538,6 @@ nouveau_bo_init_mem_type(struct ttm_bo_device *bdev, uint32_t type, man->available_caching = TTM_PL_MASK_CACHING; man->default_caching = TTM_PL_FLAG_CACHED; break; - case TTM_PL_VRAM: if (nv_device(drm->device)->card_type >= NV_50) { man->func = &nouveau_vram_manager; man->io_reserve_fastpath = false; @@ -556,6 +555,7 @@ nouveau_bo_init_mem_type(struct ttm_bo_device *bdev, uint32_t type, man->default_caching = TTM_PL_FLAG_WC; #endif break; + case TTM_PL_VRAM: case TTM_PL_TT: if (nv_device(drm->device)->card_type >= NV_50) man->func = &nouveau_gart_manager; @@ -1297,6 +1297,7 @@ nouveau_ttm_io_mem_reserve(struct ttm_bo_device *bdev, struct ttm_mem_reg *mem) break; /* fallthrough, tiled memory */ case TTM_PL_VRAM: + break; mem->bus.offset = mem->start << PAGE_SHIFT; mem->bus.base = nv_device_resource_start(nouveau_dev(dev), 1); mem->bus.is_iomem = true; Of course it won't go very far this way, but I wonder if the principle is not what we would want to do for UMA devices? Not using the vram manager at all, and strictly rely on TT placements for BOs. We will need to add extra handling for things like tiled memory. Does that look like the right direction?
Possibly Parallel Threads
- [RFC] drm/nouveau: disable caching for VRAM BOs on ARM
- [RFC] drm/nouveau: disable caching for VRAM BOs on ARM
- [RFC] drm/nouveau: disable caching for VRAM BOs on ARM
- [RFC] drm/nouveau: disable caching for VRAM BOs on ARM
- [PATCH 00/12] drm/nouveau: support for GK20A, cont'd