Danilo Krummrich
2025-Apr-30 18:16 UTC
[PATCH 11/16] gpu: nova-core: add falcon register definitions and base code
On Wed, Apr 30, 2025 at 10:38:11AM -0400, Joel Fernandes wrote:> On 4/30/2025 9:25 AM, Alexandre Courbot wrote: > > On Tue Apr 22, 2025 at 11:44 PM JST, Danilo Krummrich wrote: > > >>> +/// Returns a boxed falcon HAL adequate for the passed `chipset`. > >>> +/// > >>> +/// We use this function and a heap-allocated trait object instead of statically defined trait > >>> +/// objects because of the two-dimensional (Chipset, Engine) lookup required to return the > >>> +/// requested HAL. > >> > >> Do we really need the dynamic dispatch? AFAICS, there's only E::BASE that is > >> relevant to FalconHal impls? > >> > >> Can't we do something like I do in the following example [1]? > >> > >> [1] https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=bf7035a07e79a4047fb6834eac03a9f2 > > > > So are you have noticed there are two dimensions from which the falcons > > can be instantiated: > > > > - The engine, which determines its register BASE, > > - The HAL, which is determined by the chipset. > > > > For the engine, I want to keep things static for the main reason that if > > BASE was dynamic, we would have to do all our IO using > > try_read()/try_write() and check for an out-of-bounds error at each > > register access. The cost of monomorphization is limited as there are > > only a handful of engines. > > > > But the HAL introduces a second dimension to this, and if we support N > > engines then the amount of monomorphized code would then increase by N > > for each new HAL we add. Chipsets are released at a good cadence, so > > this is the dimension that risks growing the most.I agree, avoiding the dynamic dispatch is probably not worth in this case considering the long term. However, I wanted to point out an alternative with [2].> > It is also the one that makes use of methods to abstract things (vs. > > fixed parameters), so it is a natural candidate for using virtual > > methods. I am not a fan of having ever-growing boilerplate match > > statements for each method that needs to be abstracted, especially since > > this is that virtual methods do without requiring extra code, and for a > > runtime penalty that is completely negligible in our context and IMHO > > completely balanced by the smaller binary size that results from their > > use. > > Adding to what Alex said, note that the runtime cost is still there even without > using dyn. Because at runtime, the match conditionals need to route function > calls to the right place.Honestly, I don't know how dynamic dispatch scales compared to static dispatch with conditionals. OOC, I briefly looked for a benchmark and found [3], which doesn't look unreasonable at a first glance. I modified it real quick to have more than 2 actions. [4] 2 Actions --------- Dynamic Dispatch: time: [2.0679 ns 2.0825 ns 2.0945 ns] Static Dispatch: time: [850.29 ps 851.05 ps 852.36 ps] 20 Actions ---------- Dynamic Dispatch: time: [21.368 ns 21.827 ns 22.284 ns] Static Dispatch: time: [1.3623 ns 1.3703 ns 1.3793 ns] 100 Actions ----------- Dynamic Dispatch: time: [103.72 ns 104.33 ns 105.13 ns] Static Dispatch: time: [4.5905 ns 4.6311 ns 4.6775 ns] Absolutely take it with a grain of salt, I neither spend a lot of brain power nor time on this, which usually is not a great combination with benchmarking things. :) However, I think it's probably not too important here. Hence, feel free to go with dynamic dispatch for this.> I am just not seeing the benefits of not using dyn for > this use case and only drawbacks. IMHO, we should try to not be doing the > compiler's job. > > Maybe the only benefit is you don't need an Arc or Kbox wrapper?That's not a huge concern for me, it's only one single allocation per Engine, correct? [2] https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=99ce0f12542488f78e35356c99a1e23f [3] https://github.com/tailcallhq/rust-benchmarks [4] https://pastebin.com/k0PqtQnq
Joel Fernandes
2025-Apr-30 23:09 UTC
[PATCH 11/16] gpu: nova-core: add falcon register definitions and base code
On 4/30/2025 2:16 PM, Danilo Krummrich wrote: [...]>>> It is also the one that makes use of methods to abstract things (vs. >>> fixed parameters), so it is a natural candidate for using virtual >>> methods. I am not a fan of having ever-growing boilerplate match >>> statements for each method that needs to be abstracted, especially since >>> this is that virtual methods do without requiring extra code, and for a >>> runtime penalty that is completely negligible in our context and IMHO >>> completely balanced by the smaller binary size that results from their >>> use. >> >> Adding to what Alex said, note that the runtime cost is still there even without >> using dyn. Because at runtime, the match conditionals need to route function >> calls to the right place. > > Honestly, I don't know how dynamic dispatch scales compared to static dispatch > with conditionals. > > OOC, I briefly looked for a benchmark and found [3], which doesn't look > unreasonable at a first glance. > > I modified it real quick to have more than 2 actions. [4] > > 2 Actions > --------- > Dynamic Dispatch: time: [2.0679 ns 2.0825 ns 2.0945 ns] > Static Dispatch: time: [850.29 ps 851.05 ps 852.36 ps] > > 20 Actions > ---------- > Dynamic Dispatch: time: [21.368 ns 21.827 ns 22.284 ns] > Static Dispatch: time: [1.3623 ns 1.3703 ns 1.3793 ns] > > 100 Actions > ----------- > Dynamic Dispatch: time: [103.72 ns 104.33 ns 105.13 ns] > Static Dispatch: time: [4.5905 ns 4.6311 ns 4.6775 ns] > > Absolutely take it with a grain of salt, I neither spend a lot of brain power > nor time on this, which usually is not a great combination with benchmarking > things. :)Interesting, thanks for running the benchmark. I think this could be because of function inlining during the static dispatch, so maybe at runtime there is no overhead after all, even with long match statements. Just speculating, I have not looked at codegen for this or anything. But as you noted, the overhead still is not that much an issue (unless say the method in concern is in an extremely hot path).> > However, I think it's probably not too important here. Hence, feel free to go > with dynamic dispatch for this.Ok thanks, sounds good to me. It does seem the code is a lot more readable IMHO as well, with dyn.>> I am just not seeing the benefits of not using dyn for >> this use case and only drawbacks. IMHO, we should try to not be doing the >> compiler's job. >> >> Maybe the only benefit is you don't need an Arc or Kbox wrapper? > > That's not a huge concern for me, it's only one single allocation per Engine, > correct?Yes, that's right. I was more referring to the fact that static dispatch as in your example does not need Arc/Box, however even with Arc/Box IMHO the readability of the code using dyn is more due to the lack of long match statements on the access methods. thanks, - Joel
Alexandre Courbot
2025-May-01 00:09 UTC
[PATCH 11/16] gpu: nova-core: add falcon register definitions and base code
On Thu May 1, 2025 at 3:16 AM JST, Danilo Krummrich wrote:> On Wed, Apr 30, 2025 at 10:38:11AM -0400, Joel Fernandes wrote: >> On 4/30/2025 9:25 AM, Alexandre Courbot wrote: >> > On Tue Apr 22, 2025 at 11:44 PM JST, Danilo Krummrich wrote: >> >> >>> +/// Returns a boxed falcon HAL adequate for the passed `chipset`. >> >>> +/// >> >>> +/// We use this function and a heap-allocated trait object instead of statically defined trait >> >>> +/// objects because of the two-dimensional (Chipset, Engine) lookup required to return the >> >>> +/// requested HAL. >> >> >> >> Do we really need the dynamic dispatch? AFAICS, there's only E::BASE that is >> >> relevant to FalconHal impls? >> >> >> >> Can't we do something like I do in the following example [1]? >> >> >> >> [1] https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=bf7035a07e79a4047fb6834eac03a9f2 >> > >> > So are you have noticed there are two dimensions from which the falcons >> > can be instantiated: >> > >> > - The engine, which determines its register BASE, >> > - The HAL, which is determined by the chipset. >> > >> > For the engine, I want to keep things static for the main reason that if >> > BASE was dynamic, we would have to do all our IO using >> > try_read()/try_write() and check for an out-of-bounds error at each >> > register access. The cost of monomorphization is limited as there are >> > only a handful of engines. >> > >> > But the HAL introduces a second dimension to this, and if we support N >> > engines then the amount of monomorphized code would then increase by N >> > for each new HAL we add. Chipsets are released at a good cadence, so >> > this is the dimension that risks growing the most. > > I agree, avoiding the dynamic dispatch is probably not worth in this case > considering the long term. However, I wanted to point out an alternative with > [2]. > >> > It is also the one that makes use of methods to abstract things (vs. >> > fixed parameters), so it is a natural candidate for using virtual >> > methods. I am not a fan of having ever-growing boilerplate match >> > statements for each method that needs to be abstracted, especially since >> > this is that virtual methods do without requiring extra code, and for a >> > runtime penalty that is completely negligible in our context and IMHO >> > completely balanced by the smaller binary size that results from their >> > use. >> >> Adding to what Alex said, note that the runtime cost is still there even without >> using dyn. Because at runtime, the match conditionals need to route function >> calls to the right place. > > Honestly, I don't know how dynamic dispatch scales compared to static dispatch > with conditionals. > > OOC, I briefly looked for a benchmark and found [3], which doesn't look > unreasonable at a first glance. > > I modified it real quick to have more than 2 actions. [4] > > 2 Actions > --------- > Dynamic Dispatch: time: [2.0679 ns 2.0825 ns 2.0945 ns] > Static Dispatch: time: [850.29 ps 851.05 ps 852.36 ps] > > 20 Actions > ---------- > Dynamic Dispatch: time: [21.368 ns 21.827 ns 22.284 ns] > Static Dispatch: time: [1.3623 ns 1.3703 ns 1.3793 ns] > > 100 Actions > ----------- > Dynamic Dispatch: time: [103.72 ns 104.33 ns 105.13 ns] > Static Dispatch: time: [4.5905 ns 4.6311 ns 4.6775 ns] > > Absolutely take it with a grain of salt, I neither spend a lot of brain power > nor time on this, which usually is not a great combination with benchmarking > things. :) > > However, I think it's probably not too important here. Hence, feel free to go > with dynamic dispatch for this.Indeed, it looks like the cost of dispatch will be completely shadowed by the IO behind it anyway. And these HAL calls are like a few here and there anyway, it's not like they are on a critical path.> >> I am just not seeing the benefits of not using dyn for >> this use case and only drawbacks. IMHO, we should try to not be doing the >> compiler's job. >> >> Maybe the only benefit is you don't need an Arc or Kbox wrapper? > > That's not a huge concern for me, it's only one single allocation per Engine, > correct?Correct. Note that for other engines we will be able to store the HALs as static singletons instead of building them on the heap like I am currently doing. The reason for doing this on falcon is that the dual-dimension of the instances makes it more complex to build and look them up. ... or maybe I could just use a macro? Let me try that and see whether it works.