Jason Gunthorpe
2024-Sep-26 14:40 UTC
[RFC 00/29] Introduce NVIDIA GPU Virtualization (vGPU) Support
On Thu, Sep 26, 2024 at 02:54:38PM +0200, Greg KH wrote:> That's fine, but again, do NOT make design decisions based on what you > can, and can not, feel you can slide by one of these companies to get it > into their old kernels. That's what I take objection to here.It is not slide by. It is a recognition that participating in the community gives everyone value. If you excessively deny value from one side they will have no reason to participate. In this case the value is that, with enough light work, the kernel-fork community can deploy this code to their users. This has been the accepted bargin for a long time now. There is a great big question mark over Rust regarding what impact it actually has on this dynamic. It is definitely not just backport a few hundred upstream patches. There is clearly new upstream development work needed still - arch support being a very obvious one.> Also always remember please, that the % of overall Linux kernel > installs, even counting out Android and embedded, is VERY tiny for these > companies. The huge % overall is doing the "right thing" by using > upstream kernels. And with the laws in place now that % is only going > to grow and those older kernels will rightfully fall away into even > smaller %.Who is "doing the right thing"? That is not what I see, we sell server HW to *everyone*. There are a couple sites that are "near" upstream, but that is not too common. Everyone is running some kind of kernel fork. I dislike this generalization you do with % of users. Almost 100% of NVIDIA server HW are running forks. I would estimate around 10% is above a 6.0 baseline. It is not tiny either, NVIDIA sold like $60B of server HW running Linux last year with this kind of demographic. So did Intel, AMD, etc. I would not describe this as "VERY tiny". Maybe you mean RHEL-alike specifically, and yes, they are a diminishing install share. However, the hyperscale companies more than make up for that with their internal secret proprietary forks :(> > Otherwise, let's slow down here. Nova is still years away from being > > finished. Nouveau is the in-tree driver for this HW. This series > > improves on Nouveau. We are definitely not at the point of refusing > > new code because it is not writte in Rust, RIGHT? > > No, I do object to "we are ignoring the driver being proposed by the > developers involved for this hardware by adding to the old one instead" > which it seems like is happening here.That is too harsh. We've consistently taken a community position that OOT stuff doesn't matter, and yes that includes OOT stuff that people we trust and respect are working on. Until it is ready for submission, and ideally merged, it is an unknown quantity. Good well meaning people routinely drop their projects, good projects run into unexpected roadblocks, and life happens. Nova is not being ignored, there is dialog, and yes some disagreement. Again, nobody here is talking about disrupting Nova. We just want to keep going as-is until we can all agree together it is ready to make a change. Jason
Andy Ritger
2024-Sep-26 18:07 UTC
[RFC 00/29] Introduce NVIDIA GPU Virtualization (vGPU) Support
I hope and expect the nova and vgpu_mgr efforts to ultimately converge. First, for the fw ABI debacle: yes, it is unfortunate that we still don't have a stable ABI from GSP. We /are/ working on it, though there isn't anything to show, yet. FWIW, I expect the end result will be a much simpler interface than what is there today, and a stable interface that NVIDIA can guarantee. But, for now, we have a timing problem like Jason described: - We have customers eager for upstream vfio support in the near term, and that seems like something NVIDIA can develop/contribute/maintain in the near term, as an incremental step forward. - Nova is still early in its development, relative to nouveau/nvkm. - From NVIDIA's perspective, we're nervous about the backportability of rust-based components to enterprise kernels in the near term. - The stable GSP ABI is not going to be ready in the near term. I agree with what Dave said in one of the forks of this thread, in the context of NV2080_CTRL_VGPU_MGR_INTERNAL_BOOTLOAD_GSP_VGPU_PLUGIN_TASK_PARAMS:> The GSP firmware interfaces are not guaranteed stable. Exposing these > interfaces outside the nvkm core is unacceptable, as otherwise we > would have to adapt the whole kernel depending on the loaded firmware. > > You cannot use any nvidia sdk headers, these all have to be abstracted > behind things that have no bearing on the API.Agreed. Though not infinitely scalable, and not as clean as in rust, it seems possible to abstract NV2080_CTRL_VGPU_MGR_INTERNAL_BOOTLOAD_GSP_VGPU_PLUGIN_TASK_PARAMS behind a C-implemented abstraction layer in nvkm, at least for the short term. Is there a potential compromise where vgpu_mgr starts its life with a dependency on nvkm, and as things mature we migrate it to instead depend on nova? On Thu, Sep 26, 2024 at 11:40:57AM -0300, Jason Gunthorpe wrote:> On Thu, Sep 26, 2024 at 02:54:38PM +0200, Greg KH wrote: > > > That's fine, but again, do NOT make design decisions based on what you > > can, and can not, feel you can slide by one of these companies to get it > > into their old kernels. That's what I take objection to here. > > It is not slide by. It is a recognition that participating in the > community gives everyone value. If you excessively deny value from one > side they will have no reason to participate. > > In this case the value is that, with enough light work, the > kernel-fork community can deploy this code to their users. This has > been the accepted bargin for a long time now. > > There is a great big question mark over Rust regarding what impact it > actually has on this dynamic. It is definitely not just backport a few > hundred upstream patches. There is clearly new upstream development > work needed still - arch support being a very obvious one. > > > Also always remember please, that the % of overall Linux kernel > > installs, even counting out Android and embedded, is VERY tiny for these > > companies. The huge % overall is doing the "right thing" by using > > upstream kernels. And with the laws in place now that % is only going > > to grow and those older kernels will rightfully fall away into even > > smaller %. > > Who is "doing the right thing"? That is not what I see, we sell > server HW to *everyone*. There are a couple sites that are "near" > upstream, but that is not too common. Everyone is running some kind of > kernel fork. > > I dislike this generalization you do with % of users. Almost 100% of > NVIDIA server HW are running forks. I would estimate around 10% is > above a 6.0 baseline. It is not tiny either, NVIDIA sold like $60B of > server HW running Linux last year with this kind of demographic. So > did Intel, AMD, etc. > > I would not describe this as "VERY tiny". Maybe you mean RHEL-alike > specifically, and yes, they are a diminishing install share. However, > the hyperscale companies more than make up for that with their > internal secret proprietary forks :( > > > > Otherwise, let's slow down here. Nova is still years away from being > > > finished. Nouveau is the in-tree driver for this HW. This series > > > improves on Nouveau. We are definitely not at the point of refusing > > > new code because it is not writte in Rust, RIGHT? > > > > No, I do object to "we are ignoring the driver being proposed by the > > developers involved for this hardware by adding to the old one instead" > > which it seems like is happening here. > > That is too harsh. We've consistently taken a community position that > OOT stuff doesn't matter, and yes that includes OOT stuff that people > we trust and respect are working on. Until it is ready for submission, > and ideally merged, it is an unknown quantity. Good well meaning > people routinely drop their projects, good projects run into > unexpected roadblocks, and life happens. > > Nova is not being ignored, there is dialog, and yes some disagreement. > > Again, nobody here is talking about disrupting Nova. We just want to > keep going as-is until we can all agree together it is ready to make a > change. > > Jason
Danilo Krummrich
2024-Sep-26 22:42 UTC
[RFC 00/29] Introduce NVIDIA GPU Virtualization (vGPU) Support
On Thu, Sep 26, 2024 at 11:40:57AM -0300, Jason Gunthorpe wrote:> On Thu, Sep 26, 2024 at 02:54:38PM +0200, Greg KH wrote: > > > > No, I do object to "we are ignoring the driver being proposed by the > > developers involved for this hardware by adding to the old one instead" > > which it seems like is happening here. > > That is too harsh. We've consistently taken a community position that > OOT stuff doesn't matter, and yes that includes OOT stuff that people > we trust and respect are working on. Until it is ready for submission, > and ideally merged, it is an unknown quantity. Good well meaning > people routinely drop their projects, good projects run into > unexpected roadblocks, and life happens.That's not the point -- at least it never was my point. Upstream has set a strategy, and it's totally fine to raise concerns, discuss them, look for solutions, draw conclusions and do adjustments where needed. But, we have to agree on a long term strategy and work towards the corresponding goals *together*. I don't want to end up in a situation where everyone just does their own thing. So, when you say things like "go do Nova, have fun", it really just sounds like as if you just want to do your own thing and ignore the existing upstream strategy instead of collaborate and shape it.