Andy Ritger
2017-Nov-22 01:29 UTC
[Nouveau] Addressing the problem of noisy GPUs under Nouveau
Hi Martin, I was asked to clarify a few things: (1) Are all the user reports of loud fans on Fermi-era GPUs? (2) When the VBIOS POSTs the card, it loads initial ucode onto the Falcon processor (PMU), which will do basic fan management on its own. We call this init ucode "IFR" (Init From ROM). nvidia.ko will restore the IFR ucode when unloaded. I assume the loud fan symptom occurs after Nouveau is loaded and running, correct? I.e., this is a problem in Nouveau's fan control programming, rather than a problem in IFR. (3) IFR will run until something else is loaded on the Falcon processor (PMU). On Fermi, I assume the Nouveau kernel driver is uploading the Nouveau-written ucode from here: drivers/gpu/drm/nouveau/nvkm/subdev/pmu/fuc correct? I only ask to rule out the possibility that IFR and Nouveau are both attempting to program fans simultaneously. The symptoms you describe don't sound like that, but just double checking... (4) Given the PMU ucode debacle, I'm embarrassed to ask, but at least on Fermi, how much does Nouveau strictly depend on Nouveau's PMU ucode? Would it be an option to just let IFR continue to manage fans? (5) Lastly, I was asked how Nouveau determines what fan speed to (attempt to) program. Thanks, - Andy On Sun, Nov 12, 2017 at 11:15:45PM -0800, John Hubbard wrote:> On 11/12/2017 06:29 PM, Martin Peres wrote: > > Hello, > > > > Some users have been complaining for years about their GPU sounding like > > a jet engine at take off. Last year, I finally laid my hand on one of > > these GPUs and have been trying to fix this issue on and off since then. > > Some early feedback: can you tell us the exact SKUs you have? And are these > production boards with production VBIOSes? > > Normally, it's just our bringup boards that we'd expect to be noisy like > this, so we're looking for a few more details. > > thanks, > John Hubbard > NVIDIA > > > > > After failing to find anything in the HW, I figured out that the duty > > cycle set by nvidia's proprietary driver would be way under the expected > > value. By randomly changing values in the unknown tables of the vbios, I > > found out that there is a fan calibration table at the offset 0x18 in > > the BIT P table (version 2). > > > > In this table, I identified 2 major 16 bits parameters at offset 0xa and > > 0xc[2]. The first one, I named pwm_max, while naming the latter > > pwm_offset. As expected, these parameters look like a mapping function > > of the form aX + b. However, after gathering more samples, I found out > > that the output was not continuous when linearly increasing pwm_offset > > [1]. Even more funnily, the period of this square function is linear > > with the frequency used for the fan's PWN. > > > > I tried reverse engineering the formula to describe this function, but > > failed to find a version that would work perfectly for all PWM > > frequency. This is the closest I have got to[3], and I basically stopped > > there about a year ago because I could not figure it out and got > > frustrated :s. > > > > I started again on this project 2 weeks ago, with the intent of finding > > a good-enough solution for nouveau, and modelling the rest of the > > equation that that would allow me to compute what duty I should set for > > every wanted fan speed (%). I again mostly succeeded... but it would > > seem that the interpretation of the table depends on the generation of > > chipset (Tesla behaves one way, Fermi+ behaves another way). Also, the > > proprietary is not consistent for rules such as what to do when the > > computed duty value is going to be lower than 0 or not (sometimes we > > clamp it to 0, some times we set it to the same value as the divider, > > some times we set it to a slightly lower value than the divider). > > > > I have been trying to cover all edge cases by generating a randomized > > set of values for the PWM frequency, pwm_max, and pwm_offset values, > > flashed the vbios, and iterate from 0% to 100% fan speed while dumping > > the values set by your driver. Using half a million sample points (which > > took a week to acquire), my model computes 97% of the values correctly > > (ignoring off by ones), while the remaining 3% are worryingly off (by up > > to 100%)... It is clear that the code is not trivial and is full of > > branching, which makes clean-room reverse engineering a chore. > > > > As a final attempt to make a somewhat complete solution, I tried this > > weekend to make a "safe" model that would still make the GPUs quiet. I > > managed to improve the pass rate from 97 to 99.6%, but the remaining > > failures conflict with my previous findings, which are also way more > > prevalent. In the end, the only completely-safe way of driving the fan > > is the current behaviour of nouveau... > > > > At this point, I am ready to throw in the towel and hardcode parameters > > in nouveau to address the problem of the loudest GPUs, but this is of > > course suboptimal. This is why I am asking for your help. Would you have > > some documentation about this fan calibration table that could help me > > here? Code would be even more appreciated. > > > > Thanks a lot in advance, > > Martin > > > > PS: here is most of the code you may want to see: > > http://fs.mupuf.org/nvidia/fan_calib/ > > > > [1] http://fs.mupuf.org/nvidia/fan_calib/pwm_offset.png > > [2] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L333 > > [3] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L298 > >
Ilia Mirkin
2017-Nov-22 02:06 UTC
[Nouveau] Addressing the problem of noisy GPUs under Nouveau
On Tue, Nov 21, 2017 at 8:29 PM, Andy Ritger <aritger at nvidia.com> wrote:> Hi Martin,Martin should have complete answers,> > I was asked to clarify a few things: > > (1) Are all the user reports of loud fans on Fermi-era GPUs?Yes. Although I believe some GK208 users are also having trouble, including yours truly. (It's been quite a while since I've checked though... my memory is weak in that regard.)> > (2) When the VBIOS POSTs the card, it loads initial ucode onto the Falcon > processor (PMU), which will do basic fan management on its own. We call this > init ucode "IFR" (Init From ROM). nvidia.ko will restore the IFR ucode when > unloaded. I assume the loud fan symptom occurs after Nouveau is loaded and > running, correct? I.e., this is a problem in Nouveau's fan control > programming, rather than a problem in IFR.Correct.> > (3) IFR will run until something else is loaded on the Falcon processor (PMU). > On Fermi, I assume the Nouveau kernel driver is uploading the Nouveau-written > ucode from here: > > drivers/gpu/drm/nouveau/nvkm/subdev/pmu/fuc > > correct? I only ask to rule out the possibility that IFR and Nouveau are both > attempting to program fans simultaneously. The symptoms you describe don't > sound like that, but just double checking...Correct.> > (4) Given the PMU ucode debacle, I'm embarrassed to ask, but at least on Fermi, > how much does Nouveau strictly depend on Nouveau's PMU ucode? Would it be an > option to just let IFR continue to manage fans?Reclocking is still on our horizon, which clearly won't happen without nouveau PMU code loaded. Not sure what it's used for until reclocking becomes a thing on Fermi.> > (5) Lastly, I was asked how Nouveau determines what fan speed to (attempt > to) program.I'll let Martin answer this, but as you're probably aware, there's 2 different ways this can be done - there might be a PWM, we might have to toggle it manually. Maybe something else still. Have a look at drm/nouveau/nvkm/subdev/therm/fan.c and the various bits it ends up calling (pre-GF119 fermi's end up with the nv50 fan_set, I believe). The bios stuff is parsed in nvkm/subdev/bios/fan.c and therm.c, although I believe Martin's latest analysis is more advanced than what's in that code. Martin's question was very long, but it boils down to this: How do we compute the correct values to write into the e114/e118 pwm registers based on the VBIOS contents and current state of the board (like temperature). We generally do this right, but appear to get it extra-wrong for certain GPUs. Cheers, -ilia
Karol Herbst
2017-Nov-22 03:55 UTC
[Nouveau] Addressing the problem of noisy GPUs under Nouveau
On Wed, Nov 22, 2017 at 3:06 AM, Ilia Mirkin <imirkin at alum.mit.edu> wrote:> On Tue, Nov 21, 2017 at 8:29 PM, Andy Ritger <aritger at nvidia.com> wrote: >> Hi Martin, > > Martin should have complete answers, > >> >> I was asked to clarify a few things: >> >> (1) Are all the user reports of loud fans on Fermi-era GPUs? > > Yes. Although I believe some GK208 users are also having trouble, > including yours truly. (It's been quite a while since I've checked > though... my memory is weak in that regard.) >I think there are some Keplers where we drive the fans too loud? Maybe it got fixed, but I am sure some users complaint about this on Kepler GPUs.>> >> (2) When the VBIOS POSTs the card, it loads initial ucode onto the Falcon >> processor (PMU), which will do basic fan management on its own. We call this >> init ucode "IFR" (Init From ROM). nvidia.ko will restore the IFR ucode when >> unloaded. I assume the loud fan symptom occurs after Nouveau is loaded and >> running, correct? I.e., this is a problem in Nouveau's fan control >> programming, rather than a problem in IFR. > > Correct. > >> >> (3) IFR will run until something else is loaded on the Falcon processor (PMU). >> On Fermi, I assume the Nouveau kernel driver is uploading the Nouveau-written >> ucode from here: >> >> drivers/gpu/drm/nouveau/nvkm/subdev/pmu/fuc >> >> correct? I only ask to rule out the possibility that IFR and Nouveau are both >> attempting to program fans simultaneously. The symptoms you describe don't >> sound like that, but just double checking... > > Correct. > >> >> (4) Given the PMU ucode debacle, I'm embarrassed to ask, but at least on Fermi, >> how much does Nouveau strictly depend on Nouveau's PMU ucode? Would it be an >> option to just let IFR continue to manage fans? > > Reclocking is still on our horizon, which clearly won't happen without > nouveau PMU code loaded. Not sure what it's used for until reclocking > becomes a thing on Fermi. >well I plan to use the PMU for the PMU counters readout code. Not that it matters much on Fermi...>> >> (5) Lastly, I was asked how Nouveau determines what fan speed to (attempt >> to) program. > > I'll let Martin answer this, but as you're probably aware, there's 2 > different ways this can be done - there might be a PWM, we might have > to toggle it manually. Maybe something else still. > > Have a look at drm/nouveau/nvkm/subdev/therm/fan.c and the various > bits it ends up calling (pre-GF119 fermi's end up with the nv50 > fan_set, I believe). > > The bios stuff is parsed in nvkm/subdev/bios/fan.c and therm.c, > although I believe Martin's latest analysis is more advanced than > what's in that code. > > Martin's question was very long, but it boils down to this: > > How do we compute the correct values to write into the e114/e118 pwm > registers based on the VBIOS contents and current state of the board > (like temperature). > > We generally do this right, but appear to get it extra-wrong for certain GPUs. >well short answer is: Nouveau parses the vbios and see what it has to do. Apparently it is wrong in some cases. I don't think there is anything else Nouveau tries to do like having its own curves for calculating fan speeds or so.> Cheers, > > -ilia > _______________________________________________ > Nouveau mailing list > Nouveau at lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/nouveau
Martin Peres
2017-Nov-23 01:07 UTC
[Nouveau] Addressing the problem of noisy GPUs under Nouveau
Hey, Thanks for your answer, Andy! On 22/11/17 04:06, Ilia Mirkin wrote:> On Tue, Nov 21, 2017 at 8:29 PM, Andy Ritger <aritger at nvidia.com> wrote: >> Hi Martin, > > Martin should have complete answers, > >> >> I was asked to clarify a few things: >> >> (1) Are all the user reports of loud fans on Fermi-era GPUs? > > Yes. Although I believe some GK208 users are also having trouble, > including yours truly. (It's been quite a while since I've checked > though... my memory is weak in that regard.)We did not hear back from a lot of users about these issues, but I can see that most GF108 vbios in our vbios repo are problematic, and some GK106/GT215/GT216/GT218 might be too.> >> >> (2) When the VBIOS POSTs the card, it loads initial ucode onto the Falcon >> processor (PMU), which will do basic fan management on its own. We call this >> init ucode "IFR" (Init From ROM). nvidia.ko will restore the IFR ucode when >> unloaded. I assume the loud fan symptom occurs after Nouveau is loaded and >> running, correct? I.e., this is a problem in Nouveau's fan control >> programming, rather than a problem in IFR. > > Correct.Indeed.> >> >> (3) IFR will run until something else is loaded on the Falcon processor (PMU). >> On Fermi, I assume the Nouveau kernel driver is uploading the Nouveau-written >> ucode from here: >> >> drivers/gpu/drm/nouveau/nvkm/subdev/pmu/fuc >> >> correct? I only ask to rule out the possibility that IFR and Nouveau are both >> attempting to program fans simultaneously. The symptoms you describe don't >> sound like that, but just double checking... > > Correct.Indeed.> >> >> (4) Given the PMU ucode debacle, I'm embarrassed to ask, but at least on Fermi, >> how much does Nouveau strictly depend on Nouveau's PMU ucode? Would it be an >> option to just let IFR continue to manage fans? > > Reclocking is still on our horizon, which clearly won't happen without > nouveau PMU code loaded. Not sure what it's used for until reclocking > becomes a thing on Fermi.Yeah, this would hinder our reclocking efforts :s The best idea I can come up with is to fake the temperature (register 0x20408) to 1°C (minimum the hardware allows us) and read the PWM duty, then we can get the maximum duty by setting the temperature to the fan_boost threshold. Not sure we have a sure-way of computing the fan_boost threshold though, maybe we can just use of the thermal throttling threshold for this (more on this later in the email). In any case, all of these solutions are workarounds. Given that the code to compute these values is already found in vbioses, why is it a problem to share the meaning of all the values in the fan calibration table, and/or the algorithm?> >> >> (5) Lastly, I was asked how Nouveau determines what fan speed to (attempt >> to) program.Oh, thanks for giving me an idea about what the other values in this table may be about :D Anyways, the current code uses the entry id 0x46 of the thermal table (bit P, offset 0x10) to find out what are the thermal points for $fan_min and $fan_max. The $fan_min and $fan_max values are found in the entry id 0x22 of the same table. If the 0x46 entry is not present in the thermal table (which seems to be the norm for Fermi), we revert to default values: 40 -> 85°C. With these 4 values, we get 2 trip points (temp_min, fan_min) and(temp_max, fan_max), and we merely do linear interpolation between them.> > I'll let Martin answer this, but as you're probably aware, there's 2 > different ways this can be done - there might be a PWM, we might have > to toggle it manually. Maybe something else still.The manual toggle fans are only present on pre-tesla GPUs, let's ignore them here, because we know what to do there. All recent (2006+) GPUs use PWM, and anything after the GT215 use this fan calibration table which took me a while to find, and that is still mostly a mystery to me :s> > Have a look at drm/nouveau/nvkm/subdev/therm/fan.c and the various > bits it ends up calling (pre-GF119 fermi's end up with the nv50 > fan_set, I believe). > > The bios stuff is parsed in nvkm/subdev/bios/fan.c and therm.c, > although I believe Martin's latest analysis is more advanced than > what's in that code.Absolutely :) I have not updated Nouveau yet, in fear of setting a value lower than what the proprietary driver does...> > Martin's question was very long, but it boils down to this: > > How do we compute the correct values to write into the e114/e118 pwm > registers based on the VBIOS contents and current state of the board > (like temperature).Unfortunately, it can also be the e11c/e120 couple, or 0x200d8/dc on GF119+, or 0x200cd/d0 on Kepler+. At least, it looks like we know which PWM controler we need to drive, so I did not want to muddy the water even more by giving register addresses, rather concentrating on the problem at hand: How to compute the duty value for the PWM controler.> > We generally do this right, but appear to get it extra-wrong for certain GPUs.Yes... So far, we are always safe, but users tend to mind when their computer sound like a jumbo jet at take off... Who would have thought? :D Anyway, looking forward to your answer! Cheers, Martin