Martin Peres
2017-Nov-13 02:29 UTC
[Nouveau] Addressing the problem of noisy GPUs under Nouveau
Hello, Some users have been complaining for years about their GPU sounding like a jet engine at take off. Last year, I finally laid my hand on one of these GPUs and have been trying to fix this issue on and off since then. After failing to find anything in the HW, I figured out that the duty cycle set by nvidia's proprietary driver would be way under the expected value. By randomly changing values in the unknown tables of the vbios, I found out that there is a fan calibration table at the offset 0x18 in the BIT P table (version 2). In this table, I identified 2 major 16 bits parameters at offset 0xa and 0xc[2]. The first one, I named pwm_max, while naming the latter pwm_offset. As expected, these parameters look like a mapping function of the form aX + b. However, after gathering more samples, I found out that the output was not continuous when linearly increasing pwm_offset [1]. Even more funnily, the period of this square function is linear with the frequency used for the fan's PWN. I tried reverse engineering the formula to describe this function, but failed to find a version that would work perfectly for all PWM frequency. This is the closest I have got to[3], and I basically stopped there about a year ago because I could not figure it out and got frustrated :s. I started again on this project 2 weeks ago, with the intent of finding a good-enough solution for nouveau, and modelling the rest of the equation that that would allow me to compute what duty I should set for every wanted fan speed (%). I again mostly succeeded... but it would seem that the interpretation of the table depends on the generation of chipset (Tesla behaves one way, Fermi+ behaves another way). Also, the proprietary is not consistent for rules such as what to do when the computed duty value is going to be lower than 0 or not (sometimes we clamp it to 0, some times we set it to the same value as the divider, some times we set it to a slightly lower value than the divider). I have been trying to cover all edge cases by generating a randomized set of values for the PWM frequency, pwm_max, and pwm_offset values, flashed the vbios, and iterate from 0% to 100% fan speed while dumping the values set by your driver. Using half a million sample points (which took a week to acquire), my model computes 97% of the values correctly (ignoring off by ones), while the remaining 3% are worryingly off (by up to 100%)... It is clear that the code is not trivial and is full of branching, which makes clean-room reverse engineering a chore. As a final attempt to make a somewhat complete solution, I tried this weekend to make a "safe" model that would still make the GPUs quiet. I managed to improve the pass rate from 97 to 99.6%, but the remaining failures conflict with my previous findings, which are also way more prevalent. In the end, the only completely-safe way of driving the fan is the current behaviour of nouveau... At this point, I am ready to throw in the towel and hardcode parameters in nouveau to address the problem of the loudest GPUs, but this is of course suboptimal. This is why I am asking for your help. Would you have some documentation about this fan calibration table that could help me here? Code would be even more appreciated. Thanks a lot in advance, Martin PS: here is most of the code you may want to see: http://fs.mupuf.org/nvidia/fan_calib/ [1] http://fs.mupuf.org/nvidia/fan_calib/pwm_offset.png [2] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L333 [3] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L298
John Hubbard
2017-Nov-13 03:12 UTC
[Nouveau] Addressing the problem of noisy GPUs under Nouveau
Hi Martin, This is just a quick ACK. I've started an internal email thread and we'll see if we can get back to you soon. Yes, our thermal and fan control definitely changes a lot which the various chip architectures. I'm continually impressed by how much the SW+HW has been able to improve performance per watt, year after year, but of course the side effect is a very complex system, as you are seeing. But even so, let's see if there is any sort of simpler approximation that would work for you here...no promises, because I'm about to be humbled when the thermal experts respond. :) thanks, John Hubbard NVIDIA On 11/12/2017 06:29 PM, Martin Peres wrote:> Hello, > > Some users have been complaining for years about their GPU sounding like > a jet engine at take off. Last year, I finally laid my hand on one of > these GPUs and have been trying to fix this issue on and off since then. > > After failing to find anything in the HW, I figured out that the duty > cycle set by nvidia's proprietary driver would be way under the expected > value. By randomly changing values in the unknown tables of the vbios, I > found out that there is a fan calibration table at the offset 0x18 in > the BIT P table (version 2). > > In this table, I identified 2 major 16 bits parameters at offset 0xa and > 0xc[2]. The first one, I named pwm_max, while naming the latter > pwm_offset. As expected, these parameters look like a mapping function > of the form aX + b. However, after gathering more samples, I found out > that the output was not continuous when linearly increasing pwm_offset > [1]. Even more funnily, the period of this square function is linear > with the frequency used for the fan's PWN. > > I tried reverse engineering the formula to describe this function, but > failed to find a version that would work perfectly for all PWM > frequency. This is the closest I have got to[3], and I basically stopped > there about a year ago because I could not figure it out and got > frustrated :s. > > I started again on this project 2 weeks ago, with the intent of finding > a good-enough solution for nouveau, and modelling the rest of the > equation that that would allow me to compute what duty I should set for > every wanted fan speed (%). I again mostly succeeded... but it would > seem that the interpretation of the table depends on the generation of > chipset (Tesla behaves one way, Fermi+ behaves another way). Also, the > proprietary is not consistent for rules such as what to do when the > computed duty value is going to be lower than 0 or not (sometimes we > clamp it to 0, some times we set it to the same value as the divider, > some times we set it to a slightly lower value than the divider). > > I have been trying to cover all edge cases by generating a randomized > set of values for the PWM frequency, pwm_max, and pwm_offset values, > flashed the vbios, and iterate from 0% to 100% fan speed while dumping > the values set by your driver. Using half a million sample points (which > took a week to acquire), my model computes 97% of the values correctly > (ignoring off by ones), while the remaining 3% are worryingly off (by up > to 100%)... It is clear that the code is not trivial and is full of > branching, which makes clean-room reverse engineering a chore. > > As a final attempt to make a somewhat complete solution, I tried this > weekend to make a "safe" model that would still make the GPUs quiet. I > managed to improve the pass rate from 97 to 99.6%, but the remaining > failures conflict with my previous findings, which are also way more > prevalent. In the end, the only completely-safe way of driving the fan > is the current behaviour of nouveau... > > At this point, I am ready to throw in the towel and hardcode parameters > in nouveau to address the problem of the loudest GPUs, but this is of > course suboptimal. This is why I am asking for your help. Would you have > some documentation about this fan calibration table that could help me > here? Code would be even more appreciated. > > Thanks a lot in advance, > Martin > > PS: here is most of the code you may want to see: > http://fs.mupuf.org/nvidia/fan_calib/ > > [1] http://fs.mupuf.org/nvidia/fan_calib/pwm_offset.png > [2] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L333 > [3] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L298 >
John Hubbard
2017-Nov-13 07:15 UTC
[Nouveau] Addressing the problem of noisy GPUs under Nouveau
On 11/12/2017 06:29 PM, Martin Peres wrote:> Hello, > > Some users have been complaining for years about their GPU sounding like > a jet engine at take off. Last year, I finally laid my hand on one of > these GPUs and have been trying to fix this issue on and off since then.Some early feedback: can you tell us the exact SKUs you have? And are these production boards with production VBIOSes? Normally, it's just our bringup boards that we'd expect to be noisy like this, so we're looking for a few more details. thanks, John Hubbard NVIDIA> > After failing to find anything in the HW, I figured out that the duty > cycle set by nvidia's proprietary driver would be way under the expected > value. By randomly changing values in the unknown tables of the vbios, I > found out that there is a fan calibration table at the offset 0x18 in > the BIT P table (version 2). > > In this table, I identified 2 major 16 bits parameters at offset 0xa and > 0xc[2]. The first one, I named pwm_max, while naming the latter > pwm_offset. As expected, these parameters look like a mapping function > of the form aX + b. However, after gathering more samples, I found out > that the output was not continuous when linearly increasing pwm_offset > [1]. Even more funnily, the period of this square function is linear > with the frequency used for the fan's PWN. > > I tried reverse engineering the formula to describe this function, but > failed to find a version that would work perfectly for all PWM > frequency. This is the closest I have got to[3], and I basically stopped > there about a year ago because I could not figure it out and got > frustrated :s. > > I started again on this project 2 weeks ago, with the intent of finding > a good-enough solution for nouveau, and modelling the rest of the > equation that that would allow me to compute what duty I should set for > every wanted fan speed (%). I again mostly succeeded... but it would > seem that the interpretation of the table depends on the generation of > chipset (Tesla behaves one way, Fermi+ behaves another way). Also, the > proprietary is not consistent for rules such as what to do when the > computed duty value is going to be lower than 0 or not (sometimes we > clamp it to 0, some times we set it to the same value as the divider, > some times we set it to a slightly lower value than the divider). > > I have been trying to cover all edge cases by generating a randomized > set of values for the PWM frequency, pwm_max, and pwm_offset values, > flashed the vbios, and iterate from 0% to 100% fan speed while dumping > the values set by your driver. Using half a million sample points (which > took a week to acquire), my model computes 97% of the values correctly > (ignoring off by ones), while the remaining 3% are worryingly off (by up > to 100%)... It is clear that the code is not trivial and is full of > branching, which makes clean-room reverse engineering a chore. > > As a final attempt to make a somewhat complete solution, I tried this > weekend to make a "safe" model that would still make the GPUs quiet. I > managed to improve the pass rate from 97 to 99.6%, but the remaining > failures conflict with my previous findings, which are also way more > prevalent. In the end, the only completely-safe way of driving the fan > is the current behaviour of nouveau... > > At this point, I am ready to throw in the towel and hardcode parameters > in nouveau to address the problem of the loudest GPUs, but this is of > course suboptimal. This is why I am asking for your help. Would you have > some documentation about this fan calibration table that could help me > here? Code would be even more appreciated. > > Thanks a lot in advance, > Martin > > PS: here is most of the code you may want to see: > http://fs.mupuf.org/nvidia/fan_calib/ > > [1] http://fs.mupuf.org/nvidia/fan_calib/pwm_offset.png > [2] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L333 > [3] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L298 >
Martin Peres
2017-Nov-13 09:25 UTC
[Nouveau] Addressing the problem of noisy GPUs under Nouveau
Hello, On 13/11/17 09:15, John Hubbard wrote:> On 11/12/2017 06:29 PM, Martin Peres wrote: >> Hello, >> >> Some users have been complaining for years about their GPU sounding like >> a jet engine at take off. Last year, I finally laid my hand on one of >> these GPUs and have been trying to fix this issue on and off since then. > > Some early feedback: can you tell us the exact SKUs you have? And are these > production boards with production VBIOSes? > > Normally, it's just our bringup boards that we'd expect to be noisy like > this, so we're looking for a few more details.Thanks for the quick feedback. We only have access to production hardware with production vbioses, as far as I know. In any case, I made all my experiments on the following GPU (with a stock vbios, albeit modified to perform the experiment): NVIDIA Corporation GF108 [GeForce GT 620] (rev a1) (prog-if 00 [VGA controller]) Subsystem: eVga.com. Corp. Device 2625 I pushed my vbios to http://fs.mupuf.org/nvidia/fan_calib/ if this is interesting to you (I doubt it, but if that can save us a round trip, then let's do this :)). Thanks, Martin> > thanks, > John Hubbard > NVIDIA > >> >> After failing to find anything in the HW, I figured out that the duty >> cycle set by nvidia's proprietary driver would be way under the expected >> value. By randomly changing values in the unknown tables of the vbios, I >> found out that there is a fan calibration table at the offset 0x18 in >> the BIT P table (version 2). >> >> In this table, I identified 2 major 16 bits parameters at offset 0xa and >> 0xc[2]. The first one, I named pwm_max, while naming the latter >> pwm_offset. As expected, these parameters look like a mapping function >> of the form aX + b. However, after gathering more samples, I found out >> that the output was not continuous when linearly increasing pwm_offset >> [1]. Even more funnily, the period of this square function is linear >> with the frequency used for the fan's PWN. >> >> I tried reverse engineering the formula to describe this function, but >> failed to find a version that would work perfectly for all PWM >> frequency. This is the closest I have got to[3], and I basically stopped >> there about a year ago because I could not figure it out and got >> frustrated :s. >> >> I started again on this project 2 weeks ago, with the intent of finding >> a good-enough solution for nouveau, and modelling the rest of the >> equation that that would allow me to compute what duty I should set for >> every wanted fan speed (%). I again mostly succeeded... but it would >> seem that the interpretation of the table depends on the generation of >> chipset (Tesla behaves one way, Fermi+ behaves another way). Also, the >> proprietary is not consistent for rules such as what to do when the >> computed duty value is going to be lower than 0 or not (sometimes we >> clamp it to 0, some times we set it to the same value as the divider, >> some times we set it to a slightly lower value than the divider). >> >> I have been trying to cover all edge cases by generating a randomized >> set of values for the PWM frequency, pwm_max, and pwm_offset values, >> flashed the vbios, and iterate from 0% to 100% fan speed while dumping >> the values set by your driver. Using half a million sample points (which >> took a week to acquire), my model computes 97% of the values correctly >> (ignoring off by ones), while the remaining 3% are worryingly off (by up >> to 100%)... It is clear that the code is not trivial and is full of >> branching, which makes clean-room reverse engineering a chore. >> >> As a final attempt to make a somewhat complete solution, I tried this >> weekend to make a "safe" model that would still make the GPUs quiet. I >> managed to improve the pass rate from 97 to 99.6%, but the remaining >> failures conflict with my previous findings, which are also way more >> prevalent. In the end, the only completely-safe way of driving the fan >> is the current behaviour of nouveau... >> >> At this point, I am ready to throw in the towel and hardcode parameters >> in nouveau to address the problem of the loudest GPUs, but this is of >> course suboptimal. This is why I am asking for your help. Would you have >> some documentation about this fan calibration table that could help me >> here? Code would be even more appreciated. >> >> Thanks a lot in advance, >> Martin >> >> PS: here is most of the code you may want to see: >> http://fs.mupuf.org/nvidia/fan_calib/ >> >> [1] http://fs.mupuf.org/nvidia/fan_calib/pwm_offset.png >> [2] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L333 >> [3] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L298 >>
Andy Ritger
2017-Nov-22 01:29 UTC
[Nouveau] Addressing the problem of noisy GPUs under Nouveau
Hi Martin, I was asked to clarify a few things: (1) Are all the user reports of loud fans on Fermi-era GPUs? (2) When the VBIOS POSTs the card, it loads initial ucode onto the Falcon processor (PMU), which will do basic fan management on its own. We call this init ucode "IFR" (Init From ROM). nvidia.ko will restore the IFR ucode when unloaded. I assume the loud fan symptom occurs after Nouveau is loaded and running, correct? I.e., this is a problem in Nouveau's fan control programming, rather than a problem in IFR. (3) IFR will run until something else is loaded on the Falcon processor (PMU). On Fermi, I assume the Nouveau kernel driver is uploading the Nouveau-written ucode from here: drivers/gpu/drm/nouveau/nvkm/subdev/pmu/fuc correct? I only ask to rule out the possibility that IFR and Nouveau are both attempting to program fans simultaneously. The symptoms you describe don't sound like that, but just double checking... (4) Given the PMU ucode debacle, I'm embarrassed to ask, but at least on Fermi, how much does Nouveau strictly depend on Nouveau's PMU ucode? Would it be an option to just let IFR continue to manage fans? (5) Lastly, I was asked how Nouveau determines what fan speed to (attempt to) program. Thanks, - Andy On Sun, Nov 12, 2017 at 11:15:45PM -0800, John Hubbard wrote:> On 11/12/2017 06:29 PM, Martin Peres wrote: > > Hello, > > > > Some users have been complaining for years about their GPU sounding like > > a jet engine at take off. Last year, I finally laid my hand on one of > > these GPUs and have been trying to fix this issue on and off since then. > > Some early feedback: can you tell us the exact SKUs you have? And are these > production boards with production VBIOSes? > > Normally, it's just our bringup boards that we'd expect to be noisy like > this, so we're looking for a few more details. > > thanks, > John Hubbard > NVIDIA > > > > > After failing to find anything in the HW, I figured out that the duty > > cycle set by nvidia's proprietary driver would be way under the expected > > value. By randomly changing values in the unknown tables of the vbios, I > > found out that there is a fan calibration table at the offset 0x18 in > > the BIT P table (version 2). > > > > In this table, I identified 2 major 16 bits parameters at offset 0xa and > > 0xc[2]. The first one, I named pwm_max, while naming the latter > > pwm_offset. As expected, these parameters look like a mapping function > > of the form aX + b. However, after gathering more samples, I found out > > that the output was not continuous when linearly increasing pwm_offset > > [1]. Even more funnily, the period of this square function is linear > > with the frequency used for the fan's PWN. > > > > I tried reverse engineering the formula to describe this function, but > > failed to find a version that would work perfectly for all PWM > > frequency. This is the closest I have got to[3], and I basically stopped > > there about a year ago because I could not figure it out and got > > frustrated :s. > > > > I started again on this project 2 weeks ago, with the intent of finding > > a good-enough solution for nouveau, and modelling the rest of the > > equation that that would allow me to compute what duty I should set for > > every wanted fan speed (%). I again mostly succeeded... but it would > > seem that the interpretation of the table depends on the generation of > > chipset (Tesla behaves one way, Fermi+ behaves another way). Also, the > > proprietary is not consistent for rules such as what to do when the > > computed duty value is going to be lower than 0 or not (sometimes we > > clamp it to 0, some times we set it to the same value as the divider, > > some times we set it to a slightly lower value than the divider). > > > > I have been trying to cover all edge cases by generating a randomized > > set of values for the PWM frequency, pwm_max, and pwm_offset values, > > flashed the vbios, and iterate from 0% to 100% fan speed while dumping > > the values set by your driver. Using half a million sample points (which > > took a week to acquire), my model computes 97% of the values correctly > > (ignoring off by ones), while the remaining 3% are worryingly off (by up > > to 100%)... It is clear that the code is not trivial and is full of > > branching, which makes clean-room reverse engineering a chore. > > > > As a final attempt to make a somewhat complete solution, I tried this > > weekend to make a "safe" model that would still make the GPUs quiet. I > > managed to improve the pass rate from 97 to 99.6%, but the remaining > > failures conflict with my previous findings, which are also way more > > prevalent. In the end, the only completely-safe way of driving the fan > > is the current behaviour of nouveau... > > > > At this point, I am ready to throw in the towel and hardcode parameters > > in nouveau to address the problem of the loudest GPUs, but this is of > > course suboptimal. This is why I am asking for your help. Would you have > > some documentation about this fan calibration table that could help me > > here? Code would be even more appreciated. > > > > Thanks a lot in advance, > > Martin > > > > PS: here is most of the code you may want to see: > > http://fs.mupuf.org/nvidia/fan_calib/ > > > > [1] http://fs.mupuf.org/nvidia/fan_calib/pwm_offset.png > > [2] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L333 > > [3] https://github.com/envytools/envytools/blob/master/nvbios/power.c#L298 > >