Bug 100121

Summary: Dell Studio XPS 8100 - CPU fan speed going up and down
Product: Drivers Reporter: Jan C Peters (jcpeters89)
Component: Hardware MonitoringAssignee: Pali Rohár (pali)
Status: RESOLVED CODE_FIX    
Severity: low CC: aaron.lu, adeptg, dakotajaywhipple, fedora, forums, jcpeters89, jdelvare, lenb, leon, linuxtom68, rickfharris, rjw, the.ant
Priority: P1    
Hardware: Intel   
OS: Linux   
See Also: https://bugzilla.redhat.com/show_bug.cgi?id=1225869
Kernel Version: at least 4.0.4 and 4.0.5, probably all 4.*.* Subsystem:
Regression: Yes Bisected commit-id:
Attachments: DMI information of an affected machine
0001-dell-smm-hwmon-Blacklist-Dell-Studio-XPS-8100.patch
Blacklist Studio 8000

Description Jan C Peters 2015-06-18 16:46:13 UTC
Overview
========

After installation of the first version 4.*.* kernel the fan speed of my CPU is going up and down. It stays at ~400 RPM for about 10-15 seconds then peaks to ~3000 RPM for about one second then goes down again, and repeat that indefinitely. That happens for no apparent reason, CPU is more or less idle. CPU temperature seems to be stable at around 40° C.


Steps to Reproduce
==================

Install kernel 4.0.4-202.fc21.x86_64 on a Fedora 21 machine with an Intel Core i7 870 processor. Or run a live DVD with Fedora 22. This bug occurs 100% of the time.


Expected Results
================

Stable low fan speed.


Build / System
==============

Dist: Fedora 21
CPU: Intel Core i7 870

I tested with the following kernel builds, all showed the behavior described above:

4.0.4-201.fc21.x86_64
4.0.4-202.fc21.x86_64
4.0.5-200.vanilla.stable.knurd.1.fc21.x86_64

The first two are from the official fedora repos, the third is from the kernel-vanilla-stable repo (see https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories). So I suspect that the fedora patches are not causing this bug.


Additional Information
======================

This bug was originally reported on the Fedora bugtracker (https://bugzilla.redhat.com/show_bug.cgi?id=1225869), there are two other people who have the same problem in the following configuration:

Fedora 22, kernel 4.0.4-301.fc22.x86_64, Intel Core i7 860

With the prior Fedora 21 kernels (3.19.7-200.fc21 and older) the fan behaved as expected, stable and low speed.


Output of $ sensors
-------------------

i8k-virtual-0
Adapter: Virtual device
Processor Fan:   3269 RPM
Motherboard Fan:  807 RPM

coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +44.0°C  (high = +83.0°C, crit = +99.0°C)
Core 1:       +44.0°C  (high = +83.0°C, crit = +99.0°C)
Core 2:       +46.0°C  (high = +83.0°C, crit = +99.0°C)
Core 3:       +47.0°C  (high = +83.0°C, crit = +99.0°C)

it8720-isa-0a10
Adapter: ISA adapter
in0:          +0.86 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
in1:          +3.04 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
in2:          +3.36 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
+5V:          +3.02 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
in4:          +3.02 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
in5:          +2.16 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
in6:          +2.16 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
5VSB:         +2.98 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
Vbat:         +2.99 V  
fan1:        3245 RPM  (min =    0 RPM)
fan2:         801 RPM  (min =    0 RPM)
temp1:       +127.0°C  (low  =  -1.0°C, high = +127.0°C)  ALARM  sensor = thermal diode
temp2:        +22.0°C  (low  =  -1.0°C, high = +127.0°C)  ALARM  sensor = thermistor
temp3:        -53.0°C  (low  =  -1.0°C, high = +127.0°C)  sensor = Intel PECI
cpu0_vid:    +0.000 V
intrusion0:  ALARM

-----------------------
an a second later

i8k-virtual-0
Adapter: Virtual device
Processor Fan:    445 RPM
Motherboard Fan:  818 RPM

coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +41.0°C  (high = +83.0°C, crit = +99.0°C)
Core 1:       +44.0°C  (high = +83.0°C, crit = +99.0°C)
Core 2:       +46.0°C  (high = +83.0°C, crit = +99.0°C)
Core 3:       +44.0°C  (high = +83.0°C, crit = +99.0°C)

it8720-isa-0a10
Adapter: ISA adapter
in0:          +1.23 V  (min =  +0.00 V, max =  +4.08 V)
in1:          +3.04 V  (min =  +0.00 V, max =  +4.08 V)
in2:          +3.36 V  (min =  +0.00 V, max =  +4.08 V)
+5V:          +3.02 V  (min =  +0.00 V, max =  +4.08 V)
in4:          +3.02 V  (min =  +0.00 V, max =  +4.08 V)
in5:          +2.16 V  (min =  +0.00 V, max =  +4.08 V)
in6:          +2.16 V  (min =  +0.00 V, max =  +4.08 V)
5VSB:         +2.98 V  (min =  +0.00 V, max =  +4.08 V)
Vbat:         +2.99 V  
fan1:         441 RPM  (min =    0 RPM)
fan2:         812 RPM  (min =    0 RPM)
temp1:       +127.0°C  (low  =  -1.0°C, high = +127.0°C)  sensor = thermal diode
temp2:        +22.0°C  (low  =  -1.0°C, high = +127.0°C)  sensor = thermistor
temp3:        -53.0°C  (low  =  -1.0°C, high = +127.0°C)  sensor = Intel PECI
cpu0_vid:    +0.000 V
intrusion0:  ALARM


Output of $ cat /proc/cpuinfo | grep "model name"
-------------------------------------------------

model name	: Intel(R) Core(TM) i7 CPU         870  @ 2.93GHz
model name	: Intel(R) Core(TM) i7 CPU         870  @ 2.93GHz
model name	: Intel(R) Core(TM) i7 CPU         870  @ 2.93GHz
model name	: Intel(R) Core(TM) i7 CPU         870  @ 2.93GHz
model name	: Intel(R) Core(TM) i7 CPU         870  @ 2.93GHz
model name	: Intel(R) Core(TM) i7 CPU         870  @ 2.93GHz
model name	: Intel(R) Core(TM) i7 CPU         870  @ 2.93GHz
model name	: Intel(R) Core(TM) i7 CPU         870  @ 2.93GHz
Comment 1 Roman Spirgi 2015-06-18 19:50:49 UTC
Yes, indeed it seems to be an issue on other distributions as well:
http://forum.antergos.com/topic/2133/problems-with-fan-after-4-0-kernel-update

I do have the same issue. My CPU model name:
model name	: Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz
...
Comment 2 Jan C Peters 2015-06-19 09:24:44 UTC
Found another reference on the arch bugtracker which seems to reference a similar bug (maybe the same, there is not much information given): https://bugs.archlinux.org/task/44819
Comment 3 Roman Spirgi 2015-06-30 20:18:37 UTC
Issue seen on DELL Studio XPS 8100 models so far (3 confirms, see https://bugzilla.redhat.com/show_bug.cgi?id=1225869).

Question: What did change since kernel version 3.19.x? Obviously there must be commits related to CPU fan speed ?!
Comment 4 Roman Spirgi 2015-07-03 22:03:11 UTC
Problem persists with kernel 4.0.6-200.fc21.x86_64.
Comment 5 Aaron Lu 2015-07-08 05:06:55 UTC
Jan,

Does v3.19 work for you? If so, then this is a regression.
Comment 6 Jan C Peters 2015-07-08 11:19:16 UTC
Indeed v3.19 worked perfectly. I changed it to regression.
Comment 7 Aaron Lu 2015-07-09 01:37:43 UTC
Then please see if v4.1 and/or latest v4.2-rc fixes this issue, if not, please bisect between v3.19 and v4.0 to find the offending commit, thanks!
Comment 8 Roman Spirgi 2015-07-11 21:36:45 UTC
Neither kernel version 4.1. nor 4.2.-rc1 fixes this issue. I did test with the following versions:
kernel-4.1.0-1.fc23.x86_64
kernel-4.2.0-0.rc1.git3.1.fc23.x86_64

Latest 3.19 kernel works fine instead (kernel-3.19.8-100.fc20.x86_64). But issue is there in the earliest Fedora build of the kernel 4 series (kernel-4.0.3-200.fc21.x86_64).

I'm afraid I don't have in-depth knowledge of the kernel so how to bisect to find the offending commit?

Thanks,
Roman
Comment 9 Jan C Peters 2015-07-12 11:09:50 UTC
Aaron probably refers to the bisect subcommand of git which lets you perform a binary search through the sequence of commits between a known working state of the source tree ("good") and a state that definitely contains the bug ("bad") to efficiently find the commit that introduced it. Every search step corresponds to building the binary (here: the kernel and the modules) and investigating if the bug is present (here: rebooting with the freshly build version of the kernel).

I am going to try to find the offending commit this way, but it may take some time. I will report any findings here.
Comment 10 Aaron Lu 2015-07-14 01:49:40 UTC
(In reply to Jan C Peters from comment #9)
> I am going to try to find the offending commit this way, but it may take
> some time. I will report any findings here.

Thanks a lot!
Comment 11 Rafael J. Wysocki 2015-07-14 14:26:39 UTC
We should find out if the fan is ACPI-managed.

Aaron, can you please give Jan instructions on that?
Comment 12 Jan C Peters 2015-07-14 16:35:19 UTC
I am down to 21 commits now, and there is a series of commits mentioning i8k, temperature sensors and fan stuff, I am pretty confident that the bug has been introduced there somewhere. References:

I am working with the linux-stable repo, located at
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/

Currently:

bisect/bad:  786f949631c7d7dad6eae15659d29e78a5598662
bisect/good: ff1e33b0c16ba422e3bf3fa8cc7e89b2c958e193

I will follow it down to the single offending commit.

As for ACPI: I know my way around Linux systems, git and C/C++ code, but I am not very familiar with the kernel's internals. So help on that matter is very much appreciated.
Comment 13 Jan C Peters 2015-07-14 21:35:53 UTC
Finally, I found it. After 15 tedious times of kernel recompilation and rebooting.

The offending commit is f989e55452c74b4f7b22c889b8ec9f1192aaeec4. Here is a link: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=f989e55452c74b4f7b22c889b8ec9f1192aaeec4

And according to the description it has definitely something to do with the fan. The 8 commits immediately preceding this one all have the prefix i8k and seem to belong together. This is the last commit of that chain. 

These i8k commits also mention Dell models, so I might add that I am indeed running a Dell machine (Dell Studio XPS 8100), as is Roman and another guy (as they mentioned on the Fedora Bugzilla).

As I have no experience in fiddling with the kernel code nor do I have any idea what imlications my changes might have. I would very much appreciate if someone else could take a closer look at this. The author of the offending commit is Pali Rohár <pali.rohar@gmail.com>.

If you need more information, let me know.
Comment 14 Aaron Lu 2015-07-15 02:04:21 UTC
Thanks for the bisect, I'll assign it to the commit author and move it to the correct component.
Comment 15 Pali Rohár 2015-07-15 07:41:56 UTC
i8k.ko driver is specific for Dell machines with Dell SMM BIOS and on other machines cannot be loaded.

Is this problem just with "Dell Studio XPS 8100"? Can you provide DMI information about affected machines? If it is just one, we can blacklist it for now.

Can you try to stop all possible application which can access /sys/class/hwmon nodes? I need to know if this problem is caused by accessing hwmon attributes from userspace or this problem happen also when nobody invoke calls to SMM BIOS.

Just to note that i8k.ko was in 4.2 renamed to dell-smm-hwmon.ko

Temporary solution is to just disable i8k.ko (resp. dell-smm-hwmon.ko for >= 4.2) to not load.
Comment 16 Pali Rohár 2015-07-15 11:20:21 UTC
And are you sure that no fancontrol or similar software is running?

Can you test if this problem happen also with booted linux into minimal userspace (e.g. single user mode or init=/bin/bash) with loaded i8k/dell-smm-hwmon modules?
Comment 17 Jan C Peters 2015-07-15 15:50:00 UTC
Created attachment 182731 [details]
DMI information of an affected machine

Output of dmidecode
Comment 18 Jan C Peters 2015-07-15 16:06:42 UTC
I have no idea if this affects only the Dell Studio XPS 8100. It almost seems so. Anyway, I attached my DMI information, but the most interesting part is probably this:

System Information
	Manufacturer: Dell Inc.
	Product Name: Studio XPS 8100

Just to be clear: this is a standard desktop (ATX) machine, not a laptop.

I am very sure that no fan control or similar software is running. I am just using conky to display some values but I stopped that and it did not change anything.

I am not sure how I can stop all userspace programs from accessing hwmon attributes, but i am confident that this is not the problem.

I tested blacklisting the i8k module (by temporarily adding modprobe.blacklist=i8k to the kernel command line) on a 4.0.5 kernel, and indeed the fan showed normal behaviour (no rapid speedup/downs). The strange thing is, that when I ran "lsmod | grep i8k" the i8k module showed up, which shouldn't happen in this case, right? Also the /proc/i8k file was present. But, as I said, fan acted normal. And yes, the i8k driver was actually built as a load-on-demand  module, not as a kernel built-in.

I will try the minimal userspace thing now and report how that goes...
Comment 19 Pali Rohár 2015-07-15 16:16:56 UTC
(In reply to Jan C Peters from comment #18)
> I have no idea if this affects only the Dell Studio XPS 8100.

I would like to know this. Can somebody else with same problem confirm that it is not on other machine?

> It almost
> seems so. Anyway, I attached my DMI information, but the most interesting
> part is probably this:
> 
> System Information
>       Manufacturer: Dell Inc.
>       Product Name: Studio XPS 8100
> 

Yes, this is enough.

> Just to be clear: this is a standard desktop (ATX) machine, not a laptop.
> 

Does not matter. It is x86 machine with Dell BIOS which provide SMM interface for fan control.

> I am very sure that no fan control or similar software is running. I am just
> using conky to display some values but I stopped that and it did not change
> anything.
> 
> I am not sure how I can stop all userspace programs from accessing hwmon
> attributes, but i am confident that this is not the problem.
> 

If you are 100% sure that commit f989e55452c74b4f7b22c889b8ec9f1192aaeec4 introduced this problem, then there must be problem just when accessing hwmon attributes. See that patch and you will see that it only add new attributes.

> I tested blacklisting the i8k module (by temporarily adding
> modprobe.blacklist=i8k to the kernel command line) on a 4.0.5 kernel, and
> indeed the fan showed normal behaviour (no rapid speedup/downs). The strange
> thing is, that when I ran "lsmod | grep i8k" the i8k module showed up, which
> shouldn't happen in this case, right? Also the /proc/i8k file was present.
> But, as I said, fan acted normal. And yes, the i8k driver was actually built
> as a load-on-demand  module, not as a kernel built-in.
> 

This is distribution/userspace specific. If you disable and blacklist i8k, then it must not be in lsmod and /proc/i8k must not exist. Otherwise it is not disabled... I cannot help you more with distribution specific settings...

> I will try the minimal userspace thing now and report how that goes...

Yes, try this. Load i8k manually. And then play with accessing hwmon attributes (cat is enough) and see what happen.
Comment 20 Jan C Peters 2015-07-15 16:47:29 UTC
First of all I figured out that blacklisting in Fedora works with putting a respective config file into /lib/modprobe.d/. Then the module really isn't loaded (does not show up in lsmod).

So this is what happens: When I boot into single user mode (by appending single at the kernel command line) I have the same experience (with respect to the bug) that I have when I boot normally (full-fledged gnome session), which is the following:

If I don't blacklist i8k the fan shows buggy behavior (also in single user mode).

If I do blacklist i8k the fan's behavior appears normal, stable. In this case there is only hwmon0, which contains only the attributes of the cpu cores. The fan's attributes are nowhere to be found. Then I tried loading i8k manually via "modprobe i8k". Curiously, the fan is still operating stable, no buggy behavior. And now there is also hwmon1 with the fan's attributes. Reading out any of these attributes did not yield any noticeable reaction, neither before loading i8k nor afterwards. All this holds for both single-user mode as well as complete graphical multi-user session.

----

Yes, this is the commit which is the first to show this buggy behavior for me. Maybe the commits before that are connected to this?

I don't know how to proceed from here. Can you please give me some hints?
Comment 21 Pali Rohár 2015-07-15 16:53:48 UTC
hwmon directory for coretemp and it8720 is irrelevant here. Just i8k hwmon directory matter. So try to read hwmon attributes for i8k (after you load i8k manually).

And really, when you load i8k.ko manually there is no problem? Only if i8k.ko is loaded at boot time automatically? It does not make sense...
Comment 22 Jan C Peters 2015-07-15 17:13:29 UTC
Ok, it gets even stranger: actually it does not seem to happen EVERY time, I just booted a 4.0.5 kernel and everything was fine, stable fan speed. But as I started putting stress on the cpu (compiling the kernel), the fan started out doing its buggy speeding up and down. So after all it may be that this commit wasn't the culprit after all. But where do we go from here?
Comment 23 Jan C Peters 2015-07-15 17:33:48 UTC
As long as no one has a better suggestion I will bisect again, this time with cpu stress to be sure.
Comment 24 Roman Spirgi 2015-07-15 18:36:24 UTC
I do have the same issue on the same model with slightly different CPU, i7 860 @ 2.80 GHz
...
System Information
	Manufacturer: Dell Inc.
	Product Name: Studio XPS 8100
...

I see - hear - the issue almost every time when booting up. It is starting just before the Gnome login screen appears. And I did notice that fan speed after resuming from standby seems to be stable instead.
Comment 25 Jan C Peters 2015-07-16 16:30:50 UTC
This time I put massive stress on the CPU when booting the compiled kernel (by again compiling the kernel in parallel on all cores), I reached almost 100% utilization for all cores. The result is still the same commit I found the first time. There has to be something (maybe indirectly) changing the behaviour of the fan.

The commits before that also look suspicious to me (fan speed multiplier and the like).

Meanwhile, is it correct that blacklisting the i8k driver will not result in any negative effects (other than not being able to read out fan speeds from /sys/class/hwmon)? In this case I will just blacklist that module for now.

What was that about the fan being ACPI managed? How can I investigate this?
Comment 26 Pali Rohár 2015-07-16 16:43:00 UTC
(In reply to Jan C Peters from comment #25)
> This time I put massive stress on the CPU when booting the compiled kernel
> (by again compiling the kernel in parallel on all cores), I reached almost
> 100% utilization for all cores. The result is still the same commit I found
> the first time. There has to be something (maybe indirectly) changing the
> behaviour of the fan.
> 
> The commits before that also look suspicious to me (fan speed multiplier and
> the like).
> 

I do not have Dell Studio XPS 8100 machine, so I cannot help more. Somebody must provide which was first commit which introduced this bug. After that I can start thinking where can be problem and how to fix it.

From all your comments I'm really unsure if it is f989e55452c74b4f7b22c889b8ec9f1192aaeec4 or not.

If you can at least identify that i8k.ko module is 100% problematic, we can blacklist it for Dell Studio XPS 8100 machine directly in i8k (resp. dell-smm-hwmon) code to prevent loadint it.

> Meanwhile, is it correct that blacklisting the i8k driver will not result in
> any negative effects (other than not being able to read out fan speeds from
> /sys/class/hwmon)? In this case I will just blacklist that module for now.

Yes, i8k module is just for reading temperature and fan values via SMM requests. On older Dell laptops it provide also information about some pressed hotkeys and power status. If you do not need all this information (looks like only fan speed is supported for your machine), just disable and do not load i8k.ko module.
Comment 27 Jan C Peters 2015-07-16 16:58:53 UTC
Well I don't know what to tell you, I did the complete (!) bisection (between versions v3.19 and v4.0) two times now. And the first "bad" commit is f989e55452c74b4f7b22c889b8ec9f1192aaeec4. When I compiled the kernel at the commit immediately before that (1a131ca1de7a84cf3827c418ee5971b493c6f23f) everything was fine. Also with massive cpu load -- got >75° C and fan accelerated a bit to compensate for that, but no buggy behavior.

The behavior, that the bug does not emerge after manual loading of i8k or after waking from standby, is really strange to me too and I can't explain it. But that is what happens.

Well, it seems we are not able to resolve this, and not many users are affected by this bug, so I suggest just blacklisting the module for Dell Studio XPS 8100 machines.

If somebody has suggestions what I could try, I am at your service.
Comment 28 Pali Rohár 2015-07-16 18:52:56 UTC
Ok, so from above information that commit 1a131ca1de7a84cf3827c418ee5971b493c6f23f is OK and f989e55452c74b4f7b22c889b8ec9f1192aaeec4 is bad we can conclude that f989e55452c74b4f7b22c889b8ec9f1192aaeec4 is really first bad commit. Maybe somebody else should confirm it so will be 100% sure.

The next step is to identify what is causing that hight CPU fan speed. That patch is doing two things:

1) At module init it check if fan1 and fan2 are present by calling i8k_get_fan_type() function
2) It export fan1_label and fan2_label hwmon attributes via /sys. Accesing those attributes just call i8k_get_fan_type() function.

So you can check:

1) If unloading module at runtime make CPU fan speed to work at normal level
2) Adding debug lines to i8k_get_fan_type() function to print something every time when is called
3) Try to load module before or after others or after desktop application load

It could identify if problem is there:

1) always after calling i8k_get_fan_type() (even just one time call) and poweroff/reboot machine is needed
2) only after i8k_get_fan_type() finish and if i8k_get_fan_type() is not called more times there is no problem
3) totally randomly after module was loaded (or when was loaded before something other)
Comment 29 Pali Rohár 2015-07-16 18:58:18 UTC
Created attachment 182861 [details]
0001-dell-smm-hwmon-Blacklist-Dell-Studio-XPS-8100.patch

You can try to apply this attached patch 0001-dell-smm-hwmon-Blacklist-Dell-Studio-XPS-8100.patch

It should blacklist Dell-Studio-XPS-8100 machine and loading module should fail. With parameter ignore_dmi=1 blacklist shoud be ignored.
Comment 30 Jan C Peters 2015-07-16 21:07:15 UTC
ok,that should be worth a shot. I will put in some debug outputs adn see what they tell me. How can I influence the order the modules are loaded in?

I don't have so much time right now, I will get back to you in a few days.
Comment 31 Jan C Peters 2015-07-23 15:09:06 UTC
The only thing I can say for sure by now is that if I remove the i8k module post-boot, the weird fan behavior is still there, does not go back to normal. I pretty much expected that, since I also think that i8k only indirectly causes the bug. Next I am trying to find out which program/other module is accessing the i8k hwmon attributes, because one of these must be the culprit then, right?

A question in this regard: Which part of the kernel/which module actually controls the cpu fan? Can I somehow track calls/changes/function invocations to this part? Or can I track queries of this module (e. g. there must be calls such as "what is the current cpu temperature?" to some other component)?
Comment 32 Pali Rohár 2015-07-24 07:54:08 UTC
So if unloading i8k module did not help, there is something really bad. Try to my patch if it fix your problem and try to also revert that bad commit on top of some 4.2 rc release (to make sure that bad commit is really one bad).
Comment 33 Tom G 2015-07-28 18:27:58 UTC
I have a DELL Studio XPS 8100 Intel Core i7 860 @ 2.80 that began exhibiting this behavior after upgrading to Fedora 22 x86_64 and updating kernel > 4.

Blacklisting the i8k is the only thing that stops the issue.  I'm not a kernel programmer so I'll not speculate as to the code issues generating the problem, but it's most definitely related to i8k.ko module in my case.
Comment 34 Pali Rohár 2015-07-28 18:40:08 UTC
Please check if patch from comment #29 fix this problem.
Comment 35 Jan C Peters 2015-07-30 17:36:17 UTC
I just installed the new Fedora distribution kernel 4.1.2-200.fc22.x86_64 and the bug is still there, but I noticed -- as I did before -- that the weird fan behavior does not always start during during boot, but seems to be dependent on CPU temperature. On my first cold boot of kernel 4.1.2 today the fan was keeping quiet (CPU temp was around 20-40°C), but when I started recompiling the kernel with make -j8 the temperature naturally increased and the fan started spiking.

Aside from that I compiled kernel v4.2-rc3 once without and once with the patch from comment #29. The (unsurprising) bad news is that without the patch the kernel still exhibits buggy behavior. The good news is that the patch actually fixed the buggy fan behavior. The downside: it does not really fix it, it just deactivates the module for this particular kind of Dell machine.

Unfortunately I won't be able to contribute to this investigation in the future, since the affected machine that I have been using until now will be disassembled and replaced by a much more recent machine in the next few days (and it's not because of this bug, rather about other unrelated motherboard and hardware deficiencies).
Comment 36 Jan C Peters 2015-07-30 21:11:38 UTC
Ok, I wanted to try one last thing: reverting the commit from comment #13. Of course there was a conflict because of the renaming, I just copied the reverted version of the drivers/char/i8k.c file over to drivers/hwmon/dell_smm_hwmon.c and removed the i8k.c file. I explicitly did _not_ use the patch from comment #29. The result: indeed the fan showed no buggy behavior. One more indicator that this commit introduced the bug, however strange that is.

Here is an idea (more of a compromise) for a resolution: This commit adds labels to the sensors, right? So maybe you could add these labels only if the machine is not a Dell Studio XPS 8100. This way owners of these machines could still use the driver to read of fan speeds (well, without labels, but that is not that bad), while users of other machines can use the driver in its entirety. That seems better than deactivating the module completely for us Dell Studio XPS 8100 users. And it should work this way, at least in theory.
Comment 37 Pali Rohár 2015-07-30 21:19:12 UTC
Maybe you are right that label patch causing this problem, but my question is why? It just does not make sense at all. Or there is bug in BIOS/firmware...

From your output I see that it87 driver also reports fan speed on your Dell machine. So I'm for blacklisting this driver completly as until we know more it looks like for a BIOS bug.

Anyway, I sent this patch from comment #29 to mailing list for review.
Comment 38 Thorsten Leemhuis 2015-08-28 06:53:25 UTC
For the record, seeing this problem on a Studio XPS 8000, too (I'm also on Fedora). Shall I submit a patch like the one from comment #29 to blacklist this machine as well? Or how can I help to find and fix the root cause of the problem?
Comment 39 Pali Rohár 2015-08-31 09:38:57 UTC
(In reply to Thorsten Leemhuis from comment #38)
> For the record, seeing this problem on a Studio XPS 8000, too (I'm also on
> Fedora). Shall I submit a patch like the one from comment #29 to blacklist
> this machine as well?

If 8000 has same problem as 8100 you can send patch.

> Or how can I help to find and fix the root cause of the problem?

Find *what* cause that problem. If it is some smm call or something else.
Comment 40 RickD 2015-11-10 17:16:21 UTC
I have this problem on an Inspiron 580. Blacklisting dell-smm-hwmon seems to solve it.
Comment 41 Rick Harris 2015-11-20 01:02:40 UTC
I'm not on Fedora, but am having similar experience relevant to this bug on a Dell Studio 1749 laptop after upgrading to kernel-4.*

My CPU fan functions fine at boot but after a time of inactivity, the fan drops to eventually nothing and the machine starts to overheat.
eg. System is idle, CPU temp at 85 degrees celsius and fan at 0 RPM

CONFIG_SENSORS_DELL_SMM=y
# CONFIG_I8K is not set

I am currently working around the bug by disabling CONFIG_SENSORS_DELL_SMM in the kernel.

I did not apply any similar patch as mentioned in comment #29 as I fear blacklisting machines is simply hiding the bug instead of addressing it as it seems to be occurring on so many different Dell laptops now.

Thanks :)
Comment 42 Pali Rohár 2015-11-20 08:48:12 UTC
@RickD and @Rick Harris if you want to try to debug this bug, look at source code of dell_smm_hwmon.c and try to disable calling some SMM function. Maybe one of that cause this bad behaviour...
Comment 43 Rick Harris 2015-11-20 21:49:51 UTC
(In reply to Pali Rohár from comment #42)
> @RickD and @Rick Harris if you want to try to debug this bug, look at source
> code of dell_smm_hwmon.c and try to disable calling some SMM function. Maybe
> one of that cause this bad behaviour...

I'm no coder, just a pesky user :D
So I really have no clue where to start editing the source code to try and debug for you.

I'd even go so far as to say I'm hesitant to try patches as for each time it doesn't work, the system overheats to the point the BIOS performs a sudden poweroff.
This can't be good for the hardware and I really need my laptop to keep working into the near future.

Something I failed to mention which may help is that I can wake the fan back up by typing 'sensors' in a terminal.
This seems to re-initialise the driver and the fan temporarily returns to normal functionality.

The laptop usually idles at around 50 degrees celsius and waking it up with the 'sensors' command quickly returns it to it's normal temperature.

Hope this helps some :)
Comment 44 Dakota Whipple 2016-01-12 23:24:26 UTC
I'm currently using a Dell Studio XPS 8000 and having this issue. Could you please add that version to the update?
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/hwmon/dell-smm-hwmon.c?id=a4b45b25f18d1e798965efec429ba5fc01b9f0b6
Comment 45 Thorsten Leemhuis 2016-01-13 09:57:28 UTC
Created attachment 199491 [details]
Blacklist Studio 8000

(In reply to Dakota Whipple from comment #44)
> I'm currently using a Dell Studio XPS 8000 and having this issue. Could you
> please add that version to the update?

@Pali: Here is a patch to blacklist the Studio 8000. It's against Linux 4.4; just verified it works as expected. I still have the system close to me, but that will change soon.

Signed-off-by: Thorsten Leemhuis <linux@leemhuis.info>
Comment 46 Pali Rohár 2016-01-15 17:55:27 UTC
@Thorsten Leemhuis: Ok, please send it to maintainers.
Comment 47 Pali Rohár 2016-02-09 22:00:08 UTC
Look also at bug #112021
Comment 48 Yury Vidineev 2016-02-10 12:26:58 UTC
I have this problem on an XPS 13 9333. rmmod dell-smm-hwmon seems to solve it.
Comment 49 Pali Rohár 2016-02-13 12:22:58 UTC
Can you repeat test from bug #112021 on other affected machines? To check if this is same problem or not.
Comment 50 Yury Vidineev 2016-05-18 21:49:56 UTC
Sorry for silence - I haven't such bug with ubuntu 16.04 (4.4.0-22) on Dell XPS 13 (9333)
Comment 51 Pali Rohár 2016-05-19 11:08:29 UTC
On Wednesday 18 May 2016 21:49:56 bugzilla-daemon@bugzilla.kernel.org wrote:
> --- Comment #50 from Yury Vidineev <adeptg@gmail.com> ---
> Sorry for silence - I haven't such bug with ubuntu 16.04 (4.4.0-22) on Dell
> XPS
> 13 (9333)

So dell-smm-hwmon module is working fine on your XPS 13?
Comment 52 Yury Vidineev 2016-05-19 18:35:03 UTC
Looks like yes, it works fine for me.
I have module loaded

lsmod|grep smm
dell_smm_hwmon         16384  0

and running 'sensors' doesn't lead to some strange fan behavior


sudo dmidecode|grep XPS
        Product Name: XPS13 9333

uname -r
4.4.0-22-generic
Comment 53 Pali Rohár 2016-05-21 14:56:28 UTC
Hi all! I sent proposed patch to lkml: https://lkml.org/lkml/2016/5/21/47 patch: https://lkml.org/lkml/diff/2016/5/21/47/1

Please test it and let me know if it finally fix problem at boot time or not.
Comment 54 Jan C Peters 2016-05-24 15:46:55 UTC
I do not longer own the machine I had this bug on, so I cannot test the patch. But kudos for figuring this out, that was a really hard bug to catch!
Comment 55 Rick Harris 2016-05-25 09:30:57 UTC
(In reply to Pali Rohár from comment #53)
> Hi all! I sent proposed patch to lkml: https://lkml.org/lkml/2016/5/21/47
> patch: https://lkml.org/lkml/diff/2016/5/21/47/1
> 
> Please test it and let me know if it finally fix problem at boot time or not.

Many thanks for providing a patch, but I've only just yesterday replaced the 12 month old hard drive that was cooked by this bug as reported by SMART (sits next to the CPU heatsink).

However, I do still have the old hard drive barely functioning in a severely degraded state so can ultimately sacrifice it to test this patch.
But will need to swap hard drives in/out to test given time.

Thanks :)
Comment 56 Leon Yu 2016-06-14 20:14:52 UTC
How do test this if I am running Ubuntu? I have an Inspiron desktop. The fan would spin up every 20 seconds or so. My current solution has been blacklisting dell-smm-hwmon.
Comment 57 Pali Rohár 2016-06-14 20:36:57 UTC
Ok, now I think that root of this problem is know and also reason. This bug (fan speed going up and down) is different from bug 112021 (machines hangs for few seconds with max fan speed). See more details in my email: https://lkml.org/lkml/2016/6/13/761

So for now there is no fix/patch for this bug 100121 yet.

Looks like there is no other way to fix this bug as doing blacklisting of affected machines by broken Dell SMM firmware. (Or whitelisting of working machines).

So I can ask everybody who is affected by this bug (cpu fan speed going randomly up and down) to provide DMI vendor and DMI product name (based on which can be build blacklist)? Thanks.

These two commands write vendor and product name of running machine:
$ cat /sys/class/dmi/id/sys_vendor
$ cat /sys/class/dmi/id/product_name
Comment 58 Leon Yu 2016-06-17 08:22:30 UTC
I have the following: 

leonyu@neptune:~$ cat /sys/class/dmi/id/sys_vendor
Dell Inc.
leonyu@neptune:~$ cat /sys/class/dmi/id/product_name
Inspiron 580
Comment 59 Dakota Whipple 2016-06-17 08:27:25 UTC
I am no longer affected by this bug anymore. The fan drivers appear to be working now with Dell SMM blacklisted.

$ cat /sys/class/dmi/id/sys_vendor
Dell Inc.
$ cat /sys/class/dmi/id/product_name
Studio XPS 8000
Comment 60 Pali Rohár 2016-08-07 17:51:57 UTC
Workaround for this bug is in linus tree:  https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2744d2fde00dc8bcc3679eb72c81a63058e90faa

This is bug in Dell SMM or BIOS code. Please report it to Dell support as only Dell can fix their BIOSes... There is no other way.
Comment 61 Rick Harris 2016-08-21 00:05:46 UTC
Good news there is another way, fix or remove the regressive piece of code committed to the kernel.

This is not Dell's bug, it's yours, so take ownership of it.

Still happening on a "Dell Studio 1749".

In the interests of some other poor user who may unknowingly enable this bug in their kernel config, you may add that to the list of Dell machines that this kernel option doesn't work with.

 CONFIG_SENSORS_DELL_SMM
"adds support for reporting temperature of different sensors and controls the fans on Dell laptops via System Management Mode provided by Dell BIOS"

Why does the CPU temp and fan work just fine with this *DISABLED* ?
It seems totally redundant?

Distro kernel packagers should just disable the kernel option as the support that this option reports to add, exists without it enabled.

Thanks :)
Comment 62 Pali Rohár 2017-05-12 15:49:16 UTC
(In reply to Rick Harris from comment #61)
> This is not Dell's bug

Yes, it is. In some Dell's firmware function used by dell-smm-hwmon. We cannot do anything, just avoid call it.

> Still happening on a "Dell Studio 1749".
> 
> In the interests of some other poor user who may unknowingly enable this bug
> in their kernel config, you may add that to the list of Dell machines that
> this kernel option doesn't work with.

If another machine is affected, please provide DMI vendor and product name (with all whitespaces) and we can add it into blacklist if it is really affected by this bug.

>  CONFIG_SENSORS_DELL_SMM
> "adds support for reporting temperature of different sensors and controls
> the fans on Dell laptops via System Management Mode provided by Dell BIOS"
> 
> Why does the CPU temp ...
> It seems totally redundant?

"adds support for reporting temperature ... via System Management Mode provided by Dell BIOS"

So if you have another way how to read temperature or set fans it is redundant. Otherwise not.

> and fan work just fine with this *DISABLED* ?

"controls the fans ... via System Management Mode provided by Dell BIOS"

Probably your fans are in auto mode and you are not controlling them manually?