thr3ads.net - Nouveau - [RFC 00/29] Introduce NVIDIA GPU Virtualization (vGPU) Support [Sep 2024]

If this information is useful, please help other people find it:
Share via:

Jason Gunthorpe

2024-Sep-26 14:40 UTC

[RFC 00/29] Introduce NVIDIA GPU Virtualization (vGPU) Support

On Thu, Sep 26, 2024 at 02:54:38PM +0200, Greg KH wrote:
> That's fine, but again, do NOT make design decisions based on what you
> can, and can not, feel you can slide by one of these companies to get it
> into their old kernels.  That's what I take objection to here.
It is not slide by. It is a recognition that participating in the
community gives everyone value. If you excessively deny value from one
side they will have no reason to participate.

In this case the value is that, with enough light work, the
kernel-fork community can deploy this code to their users. This has
been the accepted bargin for a long time now.

There is a great big question mark over Rust regarding what impact it
actually has on this dynamic. It is definitely not just backport a few
hundred upstream patches. There is clearly new upstream development
work needed still - arch support being a very obvious one.
> Also always remember please, that the % of overall Linux kernel
> installs, even counting out Android and embedded, is VERY tiny for these
> companies.  The huge % overall is doing the "right thing" by
using
> upstream kernels.  And with the laws in place now that % is only going
> to grow and those older kernels will rightfully fall away into even
> smaller %.
Who is "doing the right thing"? That is not what I see, we sell
server HW to *everyone*. There are a couple sites that are "near"
upstream, but that is not too common. Everyone is running some kind of
kernel fork.

I dislike this generalization you do with % of users. Almost 100% of
NVIDIA server HW are running forks. I would estimate around 10% is
above a 6.0 baseline. It is not tiny either, NVIDIA sold like $60B of
server HW running Linux last year with this kind of demographic. So
did Intel, AMD, etc.

I would not describe this as "VERY tiny". Maybe you mean RHEL-alike
specifically, and yes, they are a diminishing install share. However,
the hyperscale companies more than make up for that with their
internal secret proprietary forks :(
> > Otherwise, let's slow down here. Nova is still years away from
being
> > finished. Nouveau is the in-tree driver for this HW. This series
> > improves on Nouveau. We are definitely not at the point of refusing
> > new code because it is not writte in Rust, RIGHT?
> 
> No, I do object to "we are ignoring the driver being proposed by the
> developers involved for this hardware by adding to the old one
instead"
> which it seems like is happening here.
That is too harsh. We've consistently taken a community position that
OOT stuff doesn't matter, and yes that includes OOT stuff that people
we trust and respect are working on. Until it is ready for submission,
and ideally merged, it is an unknown quantity. Good well meaning
people routinely drop their projects, good projects run into
unexpected roadblocks, and life happens.

Nova is not being ignored, there is dialog, and yes some disagreement.

Again, nobody here is talking about disrupting Nova. We just want to
keep going as-is until we can all agree together it is ready to make a
change.

Jason

Andy Ritger

2024-Sep-26 18:07 UTC

head link

[RFC 00/29] Introduce NVIDIA GPU Virtualization (vGPU) Support

I hope and expect the nova and vgpu_mgr efforts to ultimately converge.

First, for the fw ABI debacle: yes, it is unfortunate that we still don't
have a stable ABI from GSP.  We /are/ working on it, though there isn't
anything to show, yet.  FWIW, I expect the end result will be a much
simpler interface than what is there today, and a stable interface that
NVIDIA can guarantee.

But, for now, we have a timing problem like Jason described:

- We have customers eager for upstream vfio support in the near term,
  and that seems like something NVIDIA can develop/contribute/maintain in
  the near term, as an incremental step forward.

- Nova is still early in its development, relative to nouveau/nvkm.

- From NVIDIA's perspective, we're nervous about the backportability of
  rust-based components to enterprise kernels in the near term.

- The stable GSP ABI is not going to be ready in the near term.


I agree with what Dave said in one of the forks of this thread, in the context
of
NV2080_CTRL_VGPU_MGR_INTERNAL_BOOTLOAD_GSP_VGPU_PLUGIN_TASK_PARAMS:
> The GSP firmware interfaces are not guaranteed stable. Exposing these
> interfaces outside the nvkm core is unacceptable, as otherwise we
> would have to adapt the whole kernel depending on the loaded firmware.
>
> You cannot use any nvidia sdk headers, these all have to be abstracted
> behind things that have no bearing on the API.
Agreed.  Though not infinitely scalable, and not
as clean as in rust, it seems possible to abstract
NV2080_CTRL_VGPU_MGR_INTERNAL_BOOTLOAD_GSP_VGPU_PLUGIN_TASK_PARAMS behind
a C-implemented abstraction layer in nvkm, at least for the short term.

Is there a potential compromise where vgpu_mgr starts its life with a
dependency on nvkm, and as things mature we migrate it to instead depend
on nova?


On Thu, Sep 26, 2024 at 11:40:57AM -0300, Jason Gunthorpe
wrote:> On Thu, Sep 26, 2024 at 02:54:38PM +0200, Greg KH wrote:
> 
> > That's fine, but again, do NOT make design decisions based on what
you
> > can, and can not, feel you can slide by one of these companies to get
it
> > into their old kernels.  That's what I take objection to here.
> 
> It is not slide by. It is a recognition that participating in the
> community gives everyone value. If you excessively deny value from one
> side they will have no reason to participate.
> 
> In this case the value is that, with enough light work, the
> kernel-fork community can deploy this code to their users. This has
> been the accepted bargin for a long time now.
> 
> There is a great big question mark over Rust regarding what impact it
> actually has on this dynamic. It is definitely not just backport a few
> hundred upstream patches. There is clearly new upstream development
> work needed still - arch support being a very obvious one.
> 
> > Also always remember please, that the % of overall Linux kernel
> > installs, even counting out Android and embedded, is VERY tiny for
these
> > companies.  The huge % overall is doing the "right thing" by
using
> > upstream kernels.  And with the laws in place now that % is only going
> > to grow and those older kernels will rightfully fall away into even
> > smaller %.
> 
> Who is "doing the right thing"? That is not what I see, we sell
> server HW to *everyone*. There are a couple sites that are "near"
> upstream, but that is not too common. Everyone is running some kind of
> kernel fork.
> 
> I dislike this generalization you do with % of users. Almost 100% of
> NVIDIA server HW are running forks. I would estimate around 10% is
> above a 6.0 baseline. It is not tiny either, NVIDIA sold like $60B of
> server HW running Linux last year with this kind of demographic. So
> did Intel, AMD, etc.
> 
> I would not describe this as "VERY tiny". Maybe you mean
RHEL-alike
> specifically, and yes, they are a diminishing install share. However,
> the hyperscale companies more than make up for that with their
> internal secret proprietary forks :(
> 
> > > Otherwise, let's slow down here. Nova is still years away
from being
> > > finished. Nouveau is the in-tree driver for this HW. This series
> > > improves on Nouveau. We are definitely not at the point of
refusing
> > > new code because it is not writte in Rust, RIGHT?
> > 
> > No, I do object to "we are ignoring the driver being proposed by
the
> > developers involved for this hardware by adding to the old one
instead"
> > which it seems like is happening here.
> 
> That is too harsh. We've consistently taken a community position that
> OOT stuff doesn't matter, and yes that includes OOT stuff that people
> we trust and respect are working on. Until it is ready for submission,
> and ideally merged, it is an unknown quantity. Good well meaning
> people routinely drop their projects, good projects run into
> unexpected roadblocks, and life happens.
> 
> Nova is not being ignored, there is dialog, and yes some disagreement.
> 
> Again, nobody here is talking about disrupting Nova. We just want to
> keep going as-is until we can all agree together it is ready to make a
> change.
> 
> Jason

Danilo Krummrich

2024-Sep-26 22:42 UTC

head link

[RFC 00/29] Introduce NVIDIA GPU Virtualization (vGPU) Support

On Thu, Sep 26, 2024 at 11:40:57AM -0300, Jason Gunthorpe
wrote:> On Thu, Sep 26, 2024 at 02:54:38PM +0200, Greg KH wrote:
> > 
> > No, I do object to "we are ignoring the driver being proposed by
the
> > developers involved for this hardware by adding to the old one
instead"
> > which it seems like is happening here.
> 
> That is too harsh. We've consistently taken a community position that
> OOT stuff doesn't matter, and yes that includes OOT stuff that people
> we trust and respect are working on. Until it is ready for submission,
> and ideally merged, it is an unknown quantity. Good well meaning
> people routinely drop their projects, good projects run into
> unexpected roadblocks, and life happens.
That's not the point -- at least it never was my point.

Upstream has set a strategy, and it's totally fine to raise concerns,
discuss
them, look for solutions, draw conclusions and do adjustments where needed.

But, we have to agree on a long term strategy and work towards the corresponding
goals *together*.

I don't want to end up in a situation where everyone just does their own
thing.

So, when you say things like "go do Nova, have fun", it really just
sounds like
as if you just want to do your own thing and ignore the existing upstream
strategy instead of collaborate and shape it.

Nouveau - Sep 2024 - [RFC 00/29] Introduce NVIDIA GPU Virtualization (vGPU) Support

[RFC 00/29] Introduce NVIDIA GPU Virtualization (vGPU) Support

[RFC 00/29] Introduce NVIDIA GPU Virtualization (vGPU) Support

[RFC 00/29] Introduce NVIDIA GPU Virtualization (vGPU) Support