Bug 112021
Summary: | Dell Precision M3800 hangs for 2 - 3 seconds with max fan speed while reading fan labels | ||
---|---|---|---|
Product: | Drivers | Reporter: | Tolga Cakir (cevelnet) |
Component: | Hardware Monitoring | Assignee: | Pali Rohár (pali) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | cevelnet, fedora, guy.b, howl.nsp, jdelvare, stephen.h.nease, szg00000, w.shackleton, zvova7890 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | >=3.19 affected | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
DMI Data of affected machine
[PATCH] dell-smm-hwmon: Cache fan_type() calls and use fan_status() for fan detection Output of sensors (lm_sensors) Second fan support - diff |
Description
Tolga Cakir
2016-02-05 23:16:27 UTC
Hi! Can you check if commit f989e55452c74b4f7b22c889b8ec9f1192aaeec4 introduced this hang? Function which read fan label is i8k_hwmon_show_temp_label() see: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/dell-smm-hwmon.c#n563 The only place where it can hang is in i8k_get_temp_type() function which do SMM call. Can you try to hardcode type variable in i8k_hwmon_show_temp_label() to 0 and comment code which calls i8k_get_temp_type()? In this case sysfs just report for any fan that type is CPU. Also I have another test: Can you take kernel version which is working fine and backport dell-smm-hwmon.ko driver from linus mainline tree? I would like to know that no other change in kernel could cause this problem... I have put a standalone dell-smm-hwmon driver for testing purpose at: http://jdelvare.nerim.net/devel/lm-sensors/drivers/dell-smm-hwmon/ Generic instructions can be found at: http://jdelvare.nerim.net/devel/lm-sensors/drivers/INSTALL It is the version from kernel v4.5-rc3, and should build and run on kernel versions >= v3.16. Thanks helping! Sorry for the delay; I finally found some time to follow your instructions. I'm starting with Jean's instructions, as they were easier to do: First of all, I've installed and booted using 3.16.7.24-1-MANJARO kernel, rebooted several times and made sure, that the problem didn't occur. Then, I've downloaded and compiled the 4.5-rc3 kernel module and as soon as I insmodded the kernel module (first try!), my fans turned up to maximum and my laptop freezed for about 3 seconds. Again, using 'cat /sys/class/hwmon/hwmon<dell_smm>/fan1_label' shows the same symptoms. I will now go in depth and follow Pali's instructions in a new post. Cheers, Tolga (In reply to Pali Rohár from comment #1) > Hi! > > Can you check if commit f989e55452c74b4f7b22c889b8ec9f1192aaeec4 introduced > this hang? Yes, according to my tests, I can confirm that this problem exists in f989e55452c74b4f7b22c889b8ec9f1192aaeec4. I've also tested its parent 1a131ca1de7a84cf3827c418ee5971b493c6f23f, where I couldn't reproduce this issue. > Function which read fan label is i8k_hwmon_show_temp_label() see: > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/ > hwmon/dell-smm-hwmon.c#n563 > > The only place where it can hang is in i8k_get_temp_type() function which do > SMM call. Can you try to hardcode type variable in > i8k_hwmon_show_temp_label() to 0 and comment code which calls > i8k_get_temp_type()? In this case sysfs just report for any fan that type is > CPU. First, I've modified i8k_hwmon_show_temp_label(), which didn't result in any differences. Due to the fact, that this issue occurs on using cat on fan1_label, I've also modified i8k_hwmon_show_fan_label() according to your instructions: I was able to eliminate the freeze during cat. However, the freeze during insmod was still there. So I've looked into i8k_init_hwmon(). Commenting out i8k_get_fan_type(0) fixed this issue (it was returning err >= 0) on insmod. Commenting out i8k_get_fan_type(1) did not have any impact. As a result, i8k_get_fan_type() seems to cause this issue. > Also I have another test: Can you take kernel version which is working fine > and backport dell-smm-hwmon.ko driver from linus mainline tree? I would like > to know that no other change in kernel could cause this problem... Just realized, this was identical to Jean's request. I've posted information about this here https://bugzilla.kernel.org/show_bug.cgi?id=112021#c3. Thank you! Cheers, Tolga Ok, thank you for testing! Now we know that I8K_SMM_GET_FAN_TYPE smm call takes too long. What we can do right now is to cache all fan_type and temp_type values as thouse should not change. But I'm not sure what to do with i8k_get_fan_type() in init code sequence as use it for detecting if fan is present. That smm call also uses old Dell DOS binary... The DMI data contains some Fan information... I'm wondering if we can use that. Or maybe the result of I8K_SMM_GET_FAN_TYPE is the same for all recent models so it should be skipped based on the result of dmi_get_date()? I don't have the hardware and I don't use the driver, so these are only random idea... Maybe good, maybe not. I already think about using DMI tables... but those fan data in DMI provides bogus and incorrect values :-( What's really puzzling me is, that it only randomly appears. There are boots, where dell-smm-hwmon is working absolutely fine. I can read temps etc. without any hangs or the fans going nuts. These kind of random, unexplainable problems usually have to do with outdated CPU microcodes / BIOS versions. But I'm on the latest M3800 BIOS A10 and am using the latest 20151106 intel-ucode. I'm really wondering, what **randomly** causes I8K_SMM_GET_FAN_TYPE to hang the machine. I've come across an ACPI bug for the Dell XPS 15 9530, which is also true for the M3800: https://bugzilla.kernel.org/show_bug.cgi?id=83921. Could it make the SMM misbehave in specific situations? Also, does blacklisting dell-smm-hwmon have any drawbacks, other than not beeing able to monitor fan speed? Also, dell-smm-hwmon only reports 1 fan, whereas the Dell M3800 actually has 2 fans (http://www.appstate.edu/~taylorsa1/m3800/7.jpg). That fan function just call smm function and CPU enter into SMM mode. In SMM mode linux kernel is stopped and some firmware handler code is doing something... This is reason why you see that laptop hangs. It is BIOS problem, not microcode. Firmware then fill return value for our smm function and return back to kernel. Maybe try to contact Dell support or Dell kernel devs, if they can report problem to BIOS team? There was more bugs in Dell BIOS which was reproduced *only* by linux kernel and some of then Dell fixed with new BIOS release (e.g. repeating keys)... Driver dell-smm-hwmon provides only fan and cpu temp info. For very very old laptops (about year 2000) without ACPI support it provides power supply state and some multimedia key. So there are no drawbacks. Currently dell-smm-hwmon check existance of fan with number 0 and 1 (via that I8K_SMM_GET_FAN_TYPE function). IIRC old DOS Dell binary try to check fans with numbers 0...255. So it is possible that your second fan has assigned another index number. You can try to play with it, see function i8k_init_hwmon and function i8k_hwmon_show_fan how it reads index number defined in SENSOR_DEVICE_ATTR. Anyway, before commit f989e55452c74b4f7b22c889b8ec9f1192aaeec4, fan presence was checked by function i8k_get_fan_status(). I changed it to i8k_get_fan_type() because DOS binary NBSVC.MDM also doing 0x03a3 call (=i8k_get_fan_type). So maybe we can revert fan detection to i8k_get_fan_status() function and cache value i8k_get_fan_type() for each fan. If i8k_get_fan_status() is not hanging and detect fans correctly we can use it driver init function to prevent kernel hang when booting. And when somebody do cat /sys/.../*_label then just first call will hang (value is not cached). I have no other idea how to fix this problem... The best would be if Dell fix their BIOSes... Created attachment 204001 [details]
[PATCH] dell-smm-hwmon: Cache fan_type() calls and use fan_status() for fan detection
Try this patch if it works and helps you.
Thank you Pali! I've contacted Dell in the meantime, they will get back to me during business time. Also, I've tried out your patch and it fixes the hang on module init. However, as you already said, the first read of /sys/.../*_label (like sensors from lm_sensors does) hangs the device. All reads after the first one are working correctly and detecting fan speed and type correctly. Tomorrow, I'm gonna play around a bit with the indices and see, if I can find out the GPU fan index. Eventhough the problem is not entirely fixed, it's still a step forward. Thank you for that! Cheers, Tolga @Tolga: Do you have some response from Dell? Have you found index for GPU fan? @Jean: What do you think about my patch in comment #11? @Pali: Yes, I've contacted them via Twitter @DellCaresPRO. They will keep me updated about this issue, but I don't have any further information yet. I didn't had time to play around with the GPU fan yet, sorry about the delay. (In reply to Pali Rohár from comment #13) > @Jean: What do you think about my patch in comment #11? Looks good. You can add my Reviewed-by:. Please just add a comment explaining why you are caching the values, so that we do not forget. Thanks. @Tolga: Is that patch OK for you, or still does not fix your use case? Also you can try to contact directly Dell kernel developers (look into git history, etc). @Jean: Ok, I will add explaining comment and your Reviewed-by. But before sending it for merging I would like to see tests from other affected machines if it does not break anything else... But looks like that nobody who is watching bug #100121 does not want to test it... I have added your patch to my standalone dell-smm-hwmon driver at: http://jdelvare.nerim.net/devel/lm-sensors/drivers/dell-smm-hwmon/ if it can help with testing. Hi, I'm affected by this i8k bug on the same machine (2015). I'm with the kernel : 3.16.0-62-generic xubuntu 14.04.4 LTS i8kutils 1.42 (from this ppa) https://launchpad.net/i8kutils The machine doesn't really freeze at boot but the fans are at full speed for 1 second. The main problem is that every switch on of the fans produce a micro freeze as well if I watch the speed trough conky. Let me know if you need more infos and how to proceed. @guy.b: Hi! Can you test patch from comment #11? Pali: Your diff is for the dell-smm-hwmon driver but on my OS it's still called and installed as ik8. So, can you confirm that your patch can be installed in this circumstances ? lsmod: i8k 14703 0 sensors: i8k-virtual-0 Adapter: Virtual device Right Fan: 0 RPM CPU: +41.0°C boot.log * Starting Dell fan/cpu-temperature monitor i8kmon * Starting ACPI daemon[474G[ OK ] @guy.b: You have old kernel, in new is dell-smm-hwmon.ko module instead i8k.ko. So either update kernel or look at Jean standalone module. Yes, I will try in few days once back at home, I'm abroad with very unstable wifi connexions. Thank you. I could upgrade to the kernel 4.2.0-30-generic and dell_smm_hwmon which now makes the freezes at boot (3 times) and "sensors" command (~ 3s) very clear with the fans at full speed : 03-10 09:46 kernel: [ 2.021950] iwlwifi 0000:06:00.0: L1 Enabled - LTR Enabled 03-10 09:46 kernel: [ 6.234030] clocksource: Switched to clocksource tsc 03-10 09:46 kernel: [ 6.239541] dcdbas dcdbas: Dell Systems Management Base Driver (version 5.6.0-3.2) dell_smm-virtual-0 Adapter: Virtual device Processor Fan: 4800 RPM CPU: +46.0°C I will patch (as soon I know how) and report. (In reply to guy.b from comment #18) > The machine doesn't really freeze at boot but the fans are at full speed for > 1 second. The main problem is that every switch on of the fans produce a > micro freeze as well if I watch the speed trough conky. So you have 3.16 kernel. Can you be more specific about yours problem? When that micro freeze happens? And when fans are at full speed? Can you check if this is regression? (E.g. find some older kernel version which is working fine with loaded i8k.ko module) With the 3.16 kernel the fans are at full speed for a second when the module i8k get loaded at the end of the boot (boot.log), but it's not "really" a freeze, far away from the clear 3 times ~ 3s freezes during the boot with the 4.2. Then, the 3.16 makes only micro freezes with the command "sensors" and while getting the fan speed from conky and when the fan switches on. But with the 4.2 kernel the command "sensors" makes the machine completely freezing for about 3s. I haven't tried to get the fan speed from conky with that kernel which would probably end with a total freeze. I can't say if regression or not, I have installed i8kutils for the first time 3-4 days ago on that last 3.16, I never tried before and it's the only 3.16 kernel that is installed on the OS. @Pali: Sorry for the delay. I've just played around with the values and got a hit on my first try: index 2. The reported values seem to be correct. It's listed as "Processor Fan", which is correct, as both fans are installed to a shared heatpipe and cool down GPU and CPU at the same time. They're running on similar speeds all the time - on hang, both fans run on full speed (plausible to me, judged by the noise it makes). I'm gonna attach a sample output of "sensors" utility and the diff about the changes I've made. I've noticed, even with the latest patch, upon calling "sensors" my system hangs for some milliseconds - noticed by very short hangs during audio playback. Never noticed this on my desktop, which also is a Haswell-based Dell machine (Dell T20, Xeon E3-1225v3). This might get annoying in setups, where people have conky / monitoring tools running in the background. About your patched dell-smm-hwmon: while it's definitely a good workaround for the annoying hangs on boot, it doesn't fix the problem at its root (like you already stated, probably non-fixable in kernel). Would it be a better approach to blacklist M3800 in the module, until Dell fixes their BIOSes? Either way, you should either include your patch, or blacklist the M3800. Both solutions are better than leaving it in its current state. As a sidenote: I was just asking myself, why M3800 support was added in the first place and e.g. XPS 15 9530 wasn't. Both devices are practically the same, with the only difference beeing the built-in GPU (Quadro K1100M vs GeForce GTX750M). On one hand, dell-smm-hwmon looks for very unspecific strings like "Precision" or "Studio" and adds devices, which might not be supported. On the other hand it looks up specific devices like the "XPS13". For my day to day use, I've blacklisted this module. Beeing able to read my fan speeds doesn't justify the downsides. Fan speeds seem to be regulated by BIOS anyway (in a positive way - running cool & quiet - most of the time, fans are 0 rpm). I'm still in contact with the Dell support. Guess it will take some more time; they've asked me to do some tests. I'll keep you updated on this matter. I'm also gonna try to find a Dell developer from Git history. Cheers, Tolga Created attachment 209931 [details] Output of sensors (lm_sensors) Comment #26 Created attachment 209941 [details] Second fan support - diff Comment #26 Hi there, I've just come across the Intel Haswell Errata document (http://www.intel.de/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-desktop-specification-update.pdf). What are your thoughts about HSD8 and HSD89? Might they be the cause? Cheers, Tolga I have the same issue but always happens. Executing sensors and there is a little hang, also by doing cat /sys/class/hwmon/hwmon1/fan1_input hwmon1 is dell_smm in my system. Temperature values are read normally. The machine is a Dell Vostro 3360 and I have two mobos for it. One is a "Intel(R) Core(TM) i3-2367M CPU @ 1.40GHz" (2nd gen i3) and the other one is "Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz" (3rd gen i7). Both of them have the same behaviour. Both of them only have one fan measure not two, as in the Dell Diagnostics program. Also this laptop has only one physical fan. Tolga I don't think that's the cause, don't know if over Haswell could be added that issue, but the processors I have tested are Sandy and Ivy Bridge. I have received a final decision from Dell (TL;DR version): Windows runs fine, so it isn't a hardware problem. This driver has not been developed by Dell, so no support. I don't think we can expect a BIOS update. But maybe it's really due to the ancient DOS program beeing incompatible with newer notebooks. Can't remember facing anything similar under Windows. @David: atleast we can opt that out. Good find! (In reply to Tolga Cakir from comment #31) > I have received a final decision from Dell (TL;DR version): Windows runs > fine, so it isn't a hardware problem. Idiocy, like always... Do not need to comment it... > This driver has not been developed by > Dell, so no support. I don't think we can expect a BIOS update. > > But maybe it's really due to the ancient DOS program beeing incompatible > with newer notebooks. In that case what is new API or program which *is* compatible? > Can't remember facing anything similar under Windows. > > @David: atleast we can opt that out. Good find! My question now is: How to solve this problem? With my patch? Or something else? Hi, During a buckup (Grsync) few days ago, I realised that the fans didn't spin on at all with the cpu beeing constantly at ~ 60° for few minutes. I dont know if the reason is the blacklisted dell-smm-hwmon. By the next backup I'm gona try with the loaded module to test. Archlinux Kernel 4.5.4-1-ARCH (In reply to Tolga Cakir from comment #31) > I have received a final decision from Dell (TL;DR version): Windows runs > fine, so it isn't a hardware problem. Same problem can be reproduced also on Windows. And on *any* operating system. Just calling that SMM function for retrieving fan type cause this problem. (In reply to Pali Rohár from comment #34) > Same problem can be reproduced also on Windows. And on *any* operating > system. Just calling that SMM function for retrieving fan type cause this > problem. I know it's a lame excuse by Dell, but I don't believe they will change their minds. Maybe they would care, if we would be able to demonstrate the freeze under Windows using that SMM call. I will have a read on this and see, what I can do. There is nothing more annoying than dealing with somebody else's buggy code - especially if they don't care. Hi guys, Is there a fix for this which doesn't involve blacklisting the driver? When I blacklist the driver, my fan operates at 100%, so I guess the BIOS for my XPS13 9333 isn't handling the fan correctly. Any advice? Workaround for this bug is in linus tree: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5ce91714b0d8c0a3ff9b858966721f508351cf4c This is bug in Dell SMM or BIOS code. Please report it to Dell support as only Dell can fix their BIOSes... There is no other way. Hi there, just wanted to add, that commit 27046a3ffbb01ba715e6236c170701c84759b61d "hwmon: Use smp_call_on_cpu() for dell-smm i8k" by Juergen Gross completely fixes this issue from ground up, even on initial call. No more freezes, yay! Thanks everyone involved in fixing this! Cheers, Tolga Are you sure it was not 5ce91714b0d8c0a3ff9b858966721f508351cf4c ? Because it was before 27046a3ffbb01ba715e6236c170701c84759b61d linus tree. I've tested the whole patch series in June one by one. The result was: no hang on boot, but "sometimes" hang on initial call (e.g. sensors user-space program). My E-Mail from June for more details on 5ce91714b0d8c0a3ff9b858966721f508351cf4c: https://lkml.org/lkml/2016/6/18/252 I didn't test 27046a3ffbb01ba715e6236c170701c84759b61d on it's own, but full 4.9-rc3. Wasn't able to reproduce the issue a single time. I will test 27046a3ffbb01ba715e6236c170701c84759b61d on an older kernel and report. Just to be sure. Unfortunately, smp_call_on_cpu() got introduced in 4.9-rc1, can't try this out on an older kernel, just the whole package. Either way, this issue seems to be fully fixed now. I cannot believe that 27046a3ffbb01ba715e6236c170701c84759b61d fix this problem. I think that some other commit could be relevent, but not this... Anyway, if you are 100% sure that it fixed this freezing, are you able to track which lines of original (unpatched) code caused that freeze? Just I'm really curious... Hi, I've tried the dell_smm module again after the latest Tolga's news and I see changes (4.8.6-1-ARCH). With the sensors command it freezes just the first time, then not anymore and the return of the fans info are different (both fans are recognized). Now I get : sensors dell_smm-virtual-0 Adapter: Virtual device Processor Fan: 0 RPM Video Fan: 0 RPM CPU: +39.0°C Instead of only one before : sensors dell_smm-virtual-0 Adapter: Virtual device Processor Fan: 4800 RPM CPU: +46.0°C I haven't yet tried the boot with the loaded module, so no idee if the boot freeze stil occur. I will try and report. (In reply to guy.b from comment #43) > I've tried the dell_smm module again after the latest Tolga's news and I see > changes (4.8.6-1-ARCH). With the sensors command it freezes just the first > time, then not anymore This is due that commit 5ce91714b0d8c0a3ff9b858966721f508351cf4c which introduce caching. > and the return of the fans info are different (both > fans are recognized). Now I get : > > sensors > dell_smm-virtual-0 > Adapter: Virtual device > Processor Fan: 0 RPM > Video Fan: 0 RPM > CPU: +39.0°C Are those 0 RPM values correct? Do you have really two fans? > I haven't yet tried the boot with the loaded module, so no idee if the boot > freeze stil occur. I will try and report. Ok, let us know after you try it. > This is due that commit 5ce91714b0d8c0a3ff9b858966721f508351cf4c which > introduce > caching. Ok. > Are those 0 RPM values correct? Do you have really two fans? Yes it has two fans (CPU & GPU). But if the value are correct is not sure because by the command sensors it freezes the first time with the fans at the max speed but still shows 0 RPM in confront to before during the same situation : Processor Fan: 4800 RPM Then with the command sensors (without freezes) it shows 0 RPM which is correct. So it's difficult to know. I have tried the test of the backup which makes the CPU hot ~ 65° a few seconds but the fans don't switch on ! I will try a stress test to see if the fans switches on or not and report. The boot freeze is gone. My Dell 7720 is also affected by this problem. Operation that we read or change RPM's is provide an a freezes by the around 500ms. >> That logged read RPM: > [15567.022496] dell_smm_hwmon: i8k_hwmon_show_fan > [15567.022499] dell_smm_hwmon: i8k_get_fan_speed 0 > [15567.523122] dell_smm_hwmon: smm(0x02a3 0x0000) = 0x0a52 (took 488908 > usecs) >> > Write the RPM is similar delay: > [14494.740165] dell_smm_hwmon: smm(0x01a3 0x0200) = 0x0000 (took 489212 > usecs) >> BIOS is up-to-date by the A17 version. > [vova7890@dell-i7 hwmon]$ uname -a > Linux dell-i7 4.11.0-1-ARCH #1 SMP PREEMPT Tue May 9 18:26:58 UTC 2017 x86_64 > GNU/Linux >> I think that the BUG is provided by the bios, and we can't do anything with it. One way - blacklist the fan operations, with support for override blacklist for some reason. > DMI: Dell Inc. Inspiron 7720/ , BIOS A17 05/19/2015 @vova7890: Hi! Please rather create new bug report, because this one is about freeze when reading *_label attrs and was workarounded for all models by introduction of caching by commit https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5ce91714b0d8c0a3ff9b858966721f508351cf4c From your comment it can be seen that problem is about reading or chaning current speed. Hi, I just wanted to confirm that with dell_smm_hwmon enabled (not blacklisted) the cpu fan doesn't spin on under load (~ 60° - 70° during few minutes). |