Hello, I am currently looking into ways to support fixed virtual address allocations and sparse mappings in nouveau, as a step towards supporting CUDA. CUDA requires that the GPU virtual address for a given buffer match the CPU virtual address. Therefore, when mapping a CUDA buffer, we have to have a way of specifying a particular virtual address to map to (we would ask that the CPU virtual address be used). Currently, as I understand it, the allocator implemented in nvkm/core/mm.c, used to provision virtual addresses, doesn't allow for this (but it's very easy to modify the allocator slightly to allow for this, which I have done locally in my experiments). In addition, the CUDA use case typically involves allocating a big chunk of address space ahead of time as a way to reserve that chunk for future CUDA use. It then maps individual buffers into that address space as needed. Currently, the virtual address allocation is done during buffer mapping, so in order to support these sparse mappings, it seems to me that the virtual address allocation and buffer mapping need to be decoupled into separate operations. My current strawman proposal for supporting this is to introduce two new ioctls DRM_IOCTL_NOUVEAU_AS_ALLOC and DRM_IOCTL_NOUVEAU_AS_FREE, that look roughly like this: #define NOUVEAU_AS_ALLOC_FLAGS_FIXED_OFFSET 0x1 struct drm_nouveau_as_alloc { uint64_t pages; /* in, pages */ uint32_t page_size; /* in, bytes */ uint32_t flags; /* in */ uint64_t offset; /* in/out, byte address */ }; struct drm_nouveau_as_free { uint64_t offset; /* in, byte address */ }; These ioctls just call into the allocator to allocate a range of addresses, resulting in a struct nvkm_vma that tracks that allocation (or releases the struct nvkm_vma back into the virtual address pool in the case of the free ioctl). If NOUVEAU_AS_ALLOC_FLAGS_FIXED_OFFSET is set, offset specifies the requested virtual address. Otherwise, an arbitrary address will be allocated. In addition to this, a way to map/unmap buffers is needed. Ordinarily, one would just use DRM_IOCTL_PRIME_FD_TO_HANDLE to import and map a dmabuf into gem. However, this ioctl will try to grab the virtual address range for this buffer, which will fail in the CUDA case since the virtual address range has been reserved ahead of time. So we perhaps introduce a set of ioctls to map/unmap buffers on top of an already existing virtual address allocation. Please, feedback and questions are very much appreciated.
On Mon, Jul 6, 2015 at 8:42 PM, Andrew Chew <achew at nvidia.com> wrote:> Hello, > > I am currently looking into ways to support fixed virtual address allocations > and sparse mappings in nouveau, as a step towards supporting CUDA. > > CUDA requires that the GPU virtual address for a given buffer match the > CPU virtual address. Therefore, when mapping a CUDA buffer, we have to have > a way of specifying a particular virtual address to map to (we would ask that > the CPU virtual address be used). Currently, as I understand it, the allocator > implemented in nvkm/core/mm.c, used to provision virtual addresses, doesn't > allow for this (but it's very easy to modify the allocator slightly to allow > for this, which I have done locally in my experiments). > > In addition, the CUDA use case typically involves allocating a big chunk of > address space ahead of time as a way to reserve that chunk for future CUDA > use. It then maps individual buffers into that address space as needed. > Currently, the virtual address allocation is done during buffer mapping, so > in order to support these sparse mappings, it seems to me that the virtual > address allocation and buffer mapping need to be decoupled into separate > operations. > > My current strawman proposal for supporting this is to introduce two new ioctls > DRM_IOCTL_NOUVEAU_AS_ALLOC and DRM_IOCTL_NOUVEAU_AS_FREE, that look roughly > like this: > > #define NOUVEAU_AS_ALLOC_FLAGS_FIXED_OFFSET 0x1 > struct drm_nouveau_as_alloc { > uint64_t pages; /* in, pages */ > uint32_t page_size; /* in, bytes */ > uint32_t flags; /* in */ > uint64_t offset; /* in/out, byte address */ > }; > > struct drm_nouveau_as_free { > uint64_t offset; /* in, byte address */ > }; > > These ioctls just call into the allocator to allocate a range of addresses, > resulting in a struct nvkm_vma that tracks that allocation (or releases the > struct nvkm_vma back into the virtual address pool in the case of the free > ioctl). If NOUVEAU_AS_ALLOC_FLAGS_FIXED_OFFSET is set, offset specifies the > requested virtual address. Otherwise, an arbitrary address will be > allocated.Well, this can't just be an address space. You still need bo's, if this is to work with nouveau -- it has to know when to swap things in and out, when they're used, etc. (and/or move between VRAM and GART and system/swap). I suspect that your target here are the GK20A and GM20B chips which don't have dedicated VRAM, but the ioctl's need to work for everything. Would it be sufficient to extend NOUVEAU_GEM_NEW or create a NOUVEAU_GEM_NEW_FIXED or something? IOW, why do have to separate the concept of a GEM object and a VM allocation?> > In addition to this, a way to map/unmap buffers is needed. Ordinarily, one > would just use DRM_IOCTL_PRIME_FD_TO_HANDLE to import and map a dmabuf into > gem. However, this ioctl will try to grab the virtual address range for this > buffer, which will fail in the CUDA case since the virtual address range > has been reserved ahead of time. So we perhaps introduce a set of ioctls > to map/unmap buffers on top of an already existing virtual address allocation.My suggestion above is an alternative to this, right? I think dmabufs tend to be used for sharing between devices. I suspect there's more going on here that I don't understand though -- I assume the CUDA use-case is similar to the HSA use-case -- being able to build up data structures that point to one another on the CPU and then process them on the GPU? Can you detail a specific use-case perhaps, including the interactions with the GPU and its address space? Jérôme, I believe you were doing the HSA kernel implementation. Perhaps you'd have some feedback on this proposal? Cheers, -ilia
Jerome Glisse
2015-Jul-07 17:27 UTC
[Nouveau] CUDA fixed VA allocations and sparse mappings
On Tue, Jul 07, 2015 at 11:29:38AM -0400, Ilia Mirkin wrote:> On Mon, Jul 6, 2015 at 8:42 PM, Andrew Chew <achew at nvidia.com> wrote: > > Hello, > > > > I am currently looking into ways to support fixed virtual address allocations > > and sparse mappings in nouveau, as a step towards supporting CUDA. > > > > CUDA requires that the GPU virtual address for a given buffer match the > > CPU virtual address. Therefore, when mapping a CUDA buffer, we have to have > > a way of specifying a particular virtual address to map to (we would ask that > > the CPU virtual address be used). Currently, as I understand it, the allocator > > implemented in nvkm/core/mm.c, used to provision virtual addresses, doesn't > > allow for this (but it's very easy to modify the allocator slightly to allow > > for this, which I have done locally in my experiments). > > > > In addition, the CUDA use case typically involves allocating a big chunk of > > address space ahead of time as a way to reserve that chunk for future CUDA > > use. It then maps individual buffers into that address space as needed. > > Currently, the virtual address allocation is done during buffer mapping, so > > in order to support these sparse mappings, it seems to me that the virtual > > address allocation and buffer mapping need to be decoupled into separate > > operations. > > > > My current strawman proposal for supporting this is to introduce two new ioctls > > DRM_IOCTL_NOUVEAU_AS_ALLOC and DRM_IOCTL_NOUVEAU_AS_FREE, that look roughly > > like this: > > > > #define NOUVEAU_AS_ALLOC_FLAGS_FIXED_OFFSET 0x1 > > struct drm_nouveau_as_alloc { > > uint64_t pages; /* in, pages */ > > uint32_t page_size; /* in, bytes */ > > uint32_t flags; /* in */ > > uint64_t offset; /* in/out, byte address */ > > }; > > > > struct drm_nouveau_as_free { > > uint64_t offset; /* in, byte address */ > > }; > > > > These ioctls just call into the allocator to allocate a range of addresses, > > resulting in a struct nvkm_vma that tracks that allocation (or releases the > > struct nvkm_vma back into the virtual address pool in the case of the free > > ioctl). If NOUVEAU_AS_ALLOC_FLAGS_FIXED_OFFSET is set, offset specifies the > > requested virtual address. Otherwise, an arbitrary address will be > > allocated. > > Well, this can't just be an address space. You still need bo's, if > this is to work with nouveau -- it has to know when to swap things in > and out, when they're used, etc. (and/or move between VRAM and GART > and system/swap). I suspect that your target here are the GK20A and > GM20B chips which don't have dedicated VRAM, but the ioctl's need to > work for everything. > > Would it be sufficient to extend NOUVEAU_GEM_NEW or create a > NOUVEAU_GEM_NEW_FIXED or something? IOW, why do have to separate the > concept of a GEM object and a VM allocation?Well maybe something like i did for radeon. With radeon you have 2 set of ioctl. One to create/delete bo (GEM stuff) and one to associate a virtual address with a bo. I wanted to let the userspace decide on virtual address of buffer precisely for the same reason CUDA do it ie to allow to map some buffer at same address in GPU address space as in CPU address space. So far we never really took advantage of that on radeon side. Also on radeon you can map same bo at different virtual address in same process (you will need different file descriptor for each mapping and you can only submit command stream using mapping valid for the file descriptor). Thought this is mostly usefull when sharing same bo accross different process. I think my radeon virtual address ioclt are nice design but other might disagree. If you want to look at the code : drivers/gpu/drm/radeon/radeon_vm.c drivers/gpu/drm/radeon/radeon_gem.c Grep for _va (virtual address per bo) or _vm (virtual address manager per file descriptor) function name and structure name. On the command stream and bo eviction side everything is as usual on radeon. So a bo can be evicted btw 2 command stream to make room for another one. Either its mapping is invalidated or updated to point to system memory. So most of the logic for everything else remain the same (just need to update the multiple virtual address space).> > > > > In addition to this, a way to map/unmap buffers is needed. Ordinarily, one > > would just use DRM_IOCTL_PRIME_FD_TO_HANDLE to import and map a dmabuf into > > gem. However, this ioctl will try to grab the virtual address range for this > > buffer, which will fail in the CUDA case since the virtual address range > > has been reserved ahead of time. So we perhaps introduce a set of ioctls > > to map/unmap buffers on top of an already existing virtual address allocation. > > My suggestion above is an alternative to this, right? I think dmabufs > tend to be used for sharing between devices. I suspect there's more > going on here that I don't understand though -- I assume the CUDA > use-case is similar to the HSA use-case -- being able to build up data > structures that point to one another on the CPU and then process them > on the GPU? Can you detail a specific use-case perhaps, including the > interactions with the GPU and its address space?I think you nailed it, it is really about having the same address pointing to the same thing on both the GPU and CPU. But this is also valid and usefull for VRAM. OpenCL 2.0 have various level of transparent address space (probably not the term use in the spec) and the lowest level would need something like what radeon have to work. The most advance level needs more plumbing inside core kernel mm or inside the CPU and GPU hardware.> Jérôme, I believe you were doing the HSA kernel implementation. > Perhaps you'd have some feedback on this proposal?No i did not do the HSA stuff, AMD team leaded by Oded did :) Cheers, Jérôme
On Tue, Jul 07, 2015 at 11:29:38AM -0400, Ilia Mirkin wrote:> On Mon, Jul 6, 2015 at 8:42 PM, Andrew Chew <achew at nvidia.com> wrote: > > These ioctls just call into the allocator to allocate a range of addresses, > > resulting in a struct nvkm_vma that tracks that allocation (or releases the > > struct nvkm_vma back into the virtual address pool in the case of the free > > ioctl). If NOUVEAU_AS_ALLOC_FLAGS_FIXED_OFFSET is set, offset specifies the > > requested virtual address. Otherwise, an arbitrary address will be > > allocated. > > Well, this can't just be an address space. You still need bo's, if > this is to work with nouveau -- it has to know when to swap things in > and out, when they're used, etc. (and/or move between VRAM and GART > and system/swap). I suspect that your target here are the GK20A and > GM20B chips which don't have dedicated VRAM, but the ioctl's need to > work for everything. > > Would it be sufficient to extend NOUVEAU_GEM_NEW or create a > NOUVEAU_GEM_NEW_FIXED or something? IOW, why do have to separate the > concept of a GEM object and a VM allocation?You're correct. This is for gk20a and gm20b. The thing these proposed ioctls are supposed to accomplish is to reserve, ahead of time, a portion of the address space. So at this time, there really aren't any buffer objects yet, and there's nothing to be mapped to the GMMU. That part would come later.> > In addition to this, a way to map/unmap buffers is needed. Ordinarily, one > > would just use DRM_IOCTL_PRIME_FD_TO_HANDLE to import and map a dmabuf into > > gem. However, this ioctl will try to grab the virtual address range for this > > buffer, which will fail in the CUDA case since the virtual address range > > has been reserved ahead of time. So we perhaps introduce a set of ioctls > > to map/unmap buffers on top of an already existing virtual address allocation. > > My suggestion above is an alternative to this, right? I think dmabufs > tend to be used for sharing between devices. I suspect there's more > going on here that I don't understand though -- I assume the CUDA > use-case is similar to the HSA use-case -- being able to build up data > structures that point to one another on the CPU and then process them > on the GPU? Can you detail a specific use-case perhaps, including the > interactions with the GPU and its address space?The whole dmabufs thing is kind of a side issue. I'll take a look at NOUVEAU_GEM_NEW, but that could be an alternative to this, maybe, if extended (or we make a new NOUVEAU_GEM_NEW_FIXED, as you suggested). Crucially, the NOUVEAU_GEM_NEW_FIXED operation shouldn't result in trying to get a virtual address region and then failing because a previous operation (see above) has reserved it already. The use case is exactly as you describe. There are data structures built up that contain CPU pointers, and those pointers need to make sense to the GPU as well.
On 7 July 2015 at 10:42, Andrew Chew <achew at nvidia.com> wrote:> Hello, > > I am currently looking into ways to support fixed virtual address allocations > and sparse mappings in nouveau, as a step towards supporting CUDA.Hey Andrew, The sparse mappings was something I'd actually planned on doing too in the near future, though I haven't yet settled on exactly how it'd be exposed. Fixed address allocations weren't going to be part of that, but I see that it makes sense for a variety of use cases. One question I have here is how this is intended to work where the RM needs to make some of these allocations itself (for graphics context mapping, etc), how should potential conflicts with user mappings be handled? Thanks, Ben.> > CUDA requires that the GPU virtual address for a given buffer match the > CPU virtual address. Therefore, when mapping a CUDA buffer, we have to have > a way of specifying a particular virtual address to map to (we would ask that > the CPU virtual address be used). Currently, as I understand it, the allocator > implemented in nvkm/core/mm.c, used to provision virtual addresses, doesn't > allow for this (but it's very easy to modify the allocator slightly to allow > for this, which I have done locally in my experiments). > > In addition, the CUDA use case typically involves allocating a big chunk of > address space ahead of time as a way to reserve that chunk for future CUDA > use. It then maps individual buffers into that address space as needed. > Currently, the virtual address allocation is done during buffer mapping, so > in order to support these sparse mappings, it seems to me that the virtual > address allocation and buffer mapping need to be decoupled into separate > operations. > > My current strawman proposal for supporting this is to introduce two new ioctls > DRM_IOCTL_NOUVEAU_AS_ALLOC and DRM_IOCTL_NOUVEAU_AS_FREE, that look roughly > like this: > > #define NOUVEAU_AS_ALLOC_FLAGS_FIXED_OFFSET 0x1 > struct drm_nouveau_as_alloc { > uint64_t pages; /* in, pages */ > uint32_t page_size; /* in, bytes */ > uint32_t flags; /* in */ > uint64_t offset; /* in/out, byte address */ > }; > > struct drm_nouveau_as_free { > uint64_t offset; /* in, byte address */ > }; > > These ioctls just call into the allocator to allocate a range of addresses, > resulting in a struct nvkm_vma that tracks that allocation (or releases the > struct nvkm_vma back into the virtual address pool in the case of the free > ioctl). If NOUVEAU_AS_ALLOC_FLAGS_FIXED_OFFSET is set, offset specifies the > requested virtual address. Otherwise, an arbitrary address will be > allocated. > > In addition to this, a way to map/unmap buffers is needed. Ordinarily, one > would just use DRM_IOCTL_PRIME_FD_TO_HANDLE to import and map a dmabuf into > gem. However, this ioctl will try to grab the virtual address range for this > buffer, which will fail in the CUDA case since the virtual address range > has been reserved ahead of time. So we perhaps introduce a set of ioctls > to map/unmap buffers on top of an already existing virtual address allocation. > > Please, feedback and questions are very much appreciated. > _______________________________________________ > Nouveau mailing list > Nouveau at lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/nouveau
regarding -------- Fixed address allocations weren't going to be part of that, but I see that it makes sense for a variety of use cases. One question I have here is how this is intended to work where the RM needs to make some of these allocations itself (for graphics context mapping, etc), how should potential conflicts with user mappings be handled? -------- As an initial implemetation you can probably assume that the GPU offloading is in "exclusive" mode. Basically that the CUDA or OpenACC code has full ownership of the card. The Tesla cards don't even have a video out on them. To complicate this even more - some offloading code has very long running kernels and even worse - may critically depend on using the full available GPU ram. (Large matrix sizes and soon big Fortran arrays or complex data types) Long term - direct PCIe copies between cards will be important.. aka zero-copy. It may seem crazy, but when you have 16+ GPU in a single workstation (Cirrascale) stuff like this is key.