thr3ads.net - Nouveau - [PATCH 11/16] gpu: nova-core: add falcon register definitions and base code [May 2025]

If this information is useful, please help other people find it:
Share via:

Danilo Krummrich

2025-Apr-30 18:16 UTC

[PATCH 11/16] gpu: nova-core: add falcon register definitions and base code

On Wed, Apr 30, 2025 at 10:38:11AM -0400, Joel Fernandes
wrote:> On 4/30/2025 9:25 AM, Alexandre Courbot wrote:
> > On Tue Apr 22, 2025 at 11:44 PM JST, Danilo Krummrich wrote:
> 
> >>> +/// Returns a boxed falcon HAL adequate for the passed
`chipset`.
> >>> +///
> >>> +/// We use this function and a heap-allocated trait object
instead of statically defined trait
> >>> +/// objects because of the two-dimensional (Chipset, Engine)
lookup required to return the
> >>> +/// requested HAL.
> >>
> >> Do we really need the dynamic dispatch? AFAICS, there's only
E::BASE that is
> >> relevant to FalconHal impls?
> >>
> >> Can't we do something like I do in the following example [1]?
> >>
> >> [1]
https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=bf7035a07e79a4047fb6834eac03a9f2
> > 
> > So are you have noticed there are two dimensions from which the
falcons
> > can be instantiated:
> > 
> > - The engine, which determines its register BASE,
> > - The HAL, which is determined by the chipset.
> > 
> > For the engine, I want to keep things static for the main reason that
if
> > BASE was dynamic, we would have to do all our IO using
> > try_read()/try_write() and check for an out-of-bounds error at each
> > register access. The cost of monomorphization is limited as there are
> > only a handful of engines.
> > 
> > But the HAL introduces a second dimension to this, and if we support N
> > engines then the amount of monomorphized code would then increase by N
> > for each new HAL we add. Chipsets are released at a good cadence, so
> > this is the dimension that risks growing the most.
I agree, avoiding the dynamic dispatch is probably not worth in this case
considering the long term. However, I wanted to point out an alternative with
[2].
> > It is also the one that makes use of methods to abstract things (vs.
> > fixed parameters), so it is a natural candidate for using virtual
> > methods. I am not a fan of having ever-growing boilerplate match
> > statements for each method that needs to be abstracted, especially
since
> > this is that virtual methods do without requiring extra code, and for
a
> > runtime penalty that is completely negligible in our context and IMHO
> > completely balanced by the smaller binary size that results from their
> > use.
>
> Adding to what Alex said, note that the runtime cost is still there even
without
> using dyn. Because at runtime, the match conditionals need to route
function
> calls to the right place.
Honestly, I don't know how dynamic dispatch scales compared to static
dispatch
with conditionals.

OOC, I briefly looked for a benchmark and found [3], which doesn't look
unreasonable at a first glance.

I modified it real quick to have more than 2 actions. [4]

2 Actions
---------
Dynamic Dispatch: time:   [2.0679 ns 2.0825 ns 2.0945 ns]
 Static Dispatch: time:   [850.29 ps 851.05 ps 852.36 ps]

20 Actions
----------
Dynamic Dispatch: time:   [21.368 ns 21.827 ns 22.284 ns]
 Static Dispatch: time:   [1.3623 ns 1.3703 ns 1.3793 ns]

100 Actions
-----------
Dynamic Dispatch: time:   [103.72 ns 104.33 ns 105.13 ns]
 Static Dispatch: time:   [4.5905 ns 4.6311 ns 4.6775 ns]

Absolutely take it with a grain of salt, I neither spend a lot of brain power
nor time on this, which usually is not a great combination with benchmarking
things. :)

However, I think it's probably not too important here. Hence, feel free to
go
with dynamic dispatch for this.
> I am just not seeing the benefits of not using dyn for
> this use case and only drawbacks. IMHO, we should try to not be doing the
> compiler's job.
> 
> Maybe the only benefit is you don't need an Arc or Kbox wrapper?
That's not a huge concern for me, it's only one single allocation per
Engine,
correct?

[2]
https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=99ce0f12542488f78e35356c99a1e23f
[3] https://github.com/tailcallhq/rust-benchmarks
[4] https://pastebin.com/k0PqtQnq

Joel Fernandes

2025-Apr-30 23:09 UTC

head link

[PATCH 11/16] gpu: nova-core: add falcon register definitions and base code

On 4/30/2025 2:16 PM, Danilo Krummrich wrote:
[...]>>> It is also the one that makes use of methods to abstract things
(vs.
>>> fixed parameters), so it is a natural candidate for using virtual
>>> methods. I am not a fan of having ever-growing boilerplate match
>>> statements for each method that needs to be abstracted, especially
since
>>> this is that virtual methods do without requiring extra code, and
for a
>>> runtime penalty that is completely negligible in our context and
IMHO
>>> completely balanced by the smaller binary size that results from
their
>>> use.
>>
>> Adding to what Alex said, note that the runtime cost is still there
even without
>> using dyn. Because at runtime, the match conditionals need to route
function
>> calls to the right place.
> 
> Honestly, I don't know how dynamic dispatch scales compared to static
dispatch
> with conditionals.
> 
> OOC, I briefly looked for a benchmark and found [3], which doesn't look
> unreasonable at a first glance.
> 
> I modified it real quick to have more than 2 actions. [4]
> 
> 2 Actions
> ---------
> Dynamic Dispatch: time:   [2.0679 ns 2.0825 ns 2.0945 ns]
>  Static Dispatch: time:   [850.29 ps 851.05 ps 852.36 ps]
> 
> 20 Actions
> ----------
> Dynamic Dispatch: time:   [21.368 ns 21.827 ns 22.284 ns]
>  Static Dispatch: time:   [1.3623 ns 1.3703 ns 1.3793 ns]
> 
> 100 Actions
> -----------
> Dynamic Dispatch: time:   [103.72 ns 104.33 ns 105.13 ns]
>  Static Dispatch: time:   [4.5905 ns 4.6311 ns 4.6775 ns]
> 
> Absolutely take it with a grain of salt, I neither spend a lot of brain
power
> nor time on this, which usually is not a great combination with
benchmarking
> things. :)
Interesting, thanks for running the benchmark. I think this could be because of
function inlining during the static dispatch, so maybe at runtime there is no
overhead after all, even with long match statements. Just speculating, I have
not looked at codegen for this or anything.

But as you noted, the overhead still is not that much an issue (unless say the
method in concern is in an extremely hot path).
> 
> However, I think it's probably not too important here. Hence, feel free
to go
> with dynamic dispatch for this.
Ok thanks, sounds good to me. It does seem the code is a lot more readable IMHO
as well, with dyn.
>> I am just not seeing the benefits of not using dyn for
>> this use case and only drawbacks. IMHO, we should try to not be doing
the
>> compiler's job.
>>
>> Maybe the only benefit is you don't need an Arc or Kbox wrapper?
> 
> That's not a huge concern for me, it's only one single allocation
per Engine,
> correct?
Yes, that's right. I was more referring to the fact that static dispatch as
in
your example does not need Arc/Box, however even with Arc/Box IMHO the
readability of the code using dyn is more due to the lack of long match
statements on the access methods.

thanks,

- Joel

Alexandre Courbot

2025-May-01 00:09 UTC

head link

[PATCH 11/16] gpu: nova-core: add falcon register definitions and base code

On Thu May 1, 2025 at 3:16 AM JST, Danilo Krummrich
wrote:> On Wed, Apr 30, 2025 at 10:38:11AM -0400, Joel Fernandes wrote:
>> On 4/30/2025 9:25 AM, Alexandre Courbot wrote:
>> > On Tue Apr 22, 2025 at 11:44 PM JST, Danilo Krummrich wrote:
>> 
>> >>> +/// Returns a boxed falcon HAL adequate for the passed
`chipset`.
>> >>> +///
>> >>> +/// We use this function and a heap-allocated trait
object instead of statically defined trait
>> >>> +/// objects because of the two-dimensional (Chipset,
Engine) lookup required to return the
>> >>> +/// requested HAL.
>> >>
>> >> Do we really need the dynamic dispatch? AFAICS, there's
only E::BASE that is
>> >> relevant to FalconHal impls?
>> >>
>> >> Can't we do something like I do in the following example
[1]?
>> >>
>> >> [1]
https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=bf7035a07e79a4047fb6834eac03a9f2
>> > 
>> > So are you have noticed there are two dimensions from which the
falcons
>> > can be instantiated:
>> > 
>> > - The engine, which determines its register BASE,
>> > - The HAL, which is determined by the chipset.
>> > 
>> > For the engine, I want to keep things static for the main reason
that if
>> > BASE was dynamic, we would have to do all our IO using
>> > try_read()/try_write() and check for an out-of-bounds error at
each
>> > register access. The cost of monomorphization is limited as there
are
>> > only a handful of engines.
>> > 
>> > But the HAL introduces a second dimension to this, and if we
support N
>> > engines then the amount of monomorphized code would then increase
by N
>> > for each new HAL we add. Chipsets are released at a good cadence,
so
>> > this is the dimension that risks growing the most.
>
> I agree, avoiding the dynamic dispatch is probably not worth in this case
> considering the long term. However, I wanted to point out an alternative
with
> [2].
>
>> > It is also the one that makes use of methods to abstract things
(vs.
>> > fixed parameters), so it is a natural candidate for using virtual
>> > methods. I am not a fan of having ever-growing boilerplate match
>> > statements for each method that needs to be abstracted, especially
since
>> > this is that virtual methods do without requiring extra code, and
for a
>> > runtime penalty that is completely negligible in our context and
IMHO
>> > completely balanced by the smaller binary size that results from
their
>> > use.
>>
>> Adding to what Alex said, note that the runtime cost is still there
even without
>> using dyn. Because at runtime, the match conditionals need to route
function
>> calls to the right place.
>
> Honestly, I don't know how dynamic dispatch scales compared to static
dispatch
> with conditionals.
>
> OOC, I briefly looked for a benchmark and found [3], which doesn't look
> unreasonable at a first glance.
>
> I modified it real quick to have more than 2 actions. [4]
>
> 2 Actions
> ---------
> Dynamic Dispatch: time:   [2.0679 ns 2.0825 ns 2.0945 ns]
>  Static Dispatch: time:   [850.29 ps 851.05 ps 852.36 ps]
>
> 20 Actions
> ----------
> Dynamic Dispatch: time:   [21.368 ns 21.827 ns 22.284 ns]
>  Static Dispatch: time:   [1.3623 ns 1.3703 ns 1.3793 ns]
>
> 100 Actions
> -----------
> Dynamic Dispatch: time:   [103.72 ns 104.33 ns 105.13 ns]
>  Static Dispatch: time:   [4.5905 ns 4.6311 ns 4.6775 ns]
>
> Absolutely take it with a grain of salt, I neither spend a lot of brain
power
> nor time on this, which usually is not a great combination with
benchmarking
> things. :)
>
> However, I think it's probably not too important here. Hence, feel free
to go
> with dynamic dispatch for this.
Indeed, it looks like the cost of dispatch will be completely shadowed
by the IO behind it anyway. And these HAL calls are like a few here and
there anyway, it's not like they are on a critical path.
>
>> I am just not seeing the benefits of not using dyn for
>> this use case and only drawbacks. IMHO, we should try to not be doing
the
>> compiler's job.
>> 
>> Maybe the only benefit is you don't need an Arc or Kbox wrapper?
>
> That's not a huge concern for me, it's only one single allocation
per Engine,
> correct?
Correct. Note that for other engines we will be able to store the HALs as
static singletons instead of building them on the heap like I am
currently doing. The reason for doing this on falcon is that the
dual-dimension of the instances makes it more complex to build and look
them up.

... or maybe I could just use a macro? Let me try that and see whether
it works.

Nouveau - May 2025 - [PATCH 11/16] gpu: nova-core: add falcon register definitions and base code

[PATCH 11/16] gpu: nova-core: add falcon register definitions and base code

[PATCH 11/16] gpu: nova-core: add falcon register definitions and base code

[PATCH 11/16] gpu: nova-core: add falcon register definitions and base code