Bug 211895

Summary: dell_wmi_sysman causes unbootable system
Product: Drivers Reporter: Borislav Gerassimov (borislav_ba)
Component: Platform_x86Assignee: drivers_platform_x86 (drivers_platform_x86)
Status: RESOLVED CODE_FIX    
Severity: high CC: freddy, hi-angel, jwrdegoede, perry_yuan, pspacek, superm1
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.11 Subsystem:
Regression: No Bisected commit-id:
Attachments: journalctl from the failed boot
journalctl from the failed boot - DELL E5470
journalctl from 5.11.2 + the patch by Mario
dmesg during modprobe dell_wmi_sysman

Description Borislav Gerassimov 2021-02-22 13:26:42 UTC
Created attachment 295399 [details]
journalctl from the failed boot

The affected system is a Dell Latitude E5570 running ArchLinux with latest firmware/UEFI/etc. The system boots to a general protection fault screen (see https://bugs.archlinux.org/task/69702 - there is an attached screenshot in the first comment).
I've also just tested with linux-next from 20210222 - the problem persists. Blacklisting 'dell_wmi_sysman' is the only solution for reaching a working system.

Tell me if I can be of any help.
Comment 1 Petr Špaček 2021-02-26 09:16:46 UTC
Created attachment 295461 [details]
journalctl from the failed boot - DELL E5470
Comment 2 Petr Špaček 2021-02-26 09:18:10 UTC
Panic reproduced also on Dell Latitude E5470/0MT53G, BIOS 1.23.3 08/04/2020 with kernel 5.11.1.

First log lines about WMI are:
wmi_bus wmi_bus-PNP0C14:01: WQBC data block query control method not found
acpi PNP0C14:02: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)
acpi PNP0C14:03: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:01)

and then boot goes on for a second. Several processes cause stack traces to be printed, pointing all over the place. Blacklisting the module workarounds the problem.
Comment 3 Mario Limonciello 2021-02-26 18:57:47 UTC
So if it's 5.11.1 as well as linux-next then you have https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/platform/x86/dell-wmi-sysman?h=linux-5.11.y&id=215164bfb7144c5890dd8021ff06e486939862d4 already applied then.

> 5570/5470?  

These are quite old right - like they came out in 2015/2016 time frame.

I would expect they don't support this interface, but they should have bailed more gracefully too.

Can you please test https://lkml.org/lkml/2021/2/18/748
It needs to be re-spun from maintainer feedback, but I think it will fix the issue for you.  If it works for you, i'll respin it soon and you can test the new one as well hopefully.
Comment 4 Borislav Gerassimov 2021-03-01 12:12:55 UTC
Created attachment 295547 [details]
journalctl from 5.11.2 + the patch by Mario
Comment 5 Borislav Gerassimov 2021-03-01 15:15:51 UTC
There is no change in the behavior with the above patch.
Comment 6 Freddy Willemsen 2021-03-19 13:40:50 UTC
The panic occurs on my Dell Latitude E5570 as well (figures, same hardware as the E5470) using Fedora 33. The proposed patch does not change anything.
Comment 7 Hans de Goede 2021-03-20 14:40:36 UTC
I've prepared and posted a set of patches which deal with various problems with error-exit path cleanups and general robustness of the dell-wmi-sysman driver:

https://lore.kernel.org/platform-driver-x86/20210320143429.76047-1-hdegoede@redhat.com/T/#t

Note it is not entirely clear to me what is going on here, so I'm not sure if these patches fix things but hopefully they will help.

What would be helpful, independent of testing the patches, is if someone could boot a 5.11 kernel with dell-wmi-sysman blacklisted to avoid the problem.

And then:

1. Switch to a text-console
2. ssh into the machine and run dmesg -w
3. ssh into the machine a second time and run: "sudo modprobe dell_wmi_sysman dyndbg"

And then collect log info from the "dmesg -w" and in case there are log messages on the text-console which did not make it into the ssh dmesg -w output, make a picture of those.

And if you are capable of building your own kernels then testing the patches would be great too of course (save the emails in "raw" format and then "git am" them).
Comment 8 Freddy Willemsen 2021-03-21 00:13:49 UTC
I answered above question for my situation in https://bugzilla.redhat.com/show_bug.cgi?id=1936171. For completeness sake I will upload the requested dmesg info here as well.
Comment 9 Freddy Willemsen 2021-03-21 00:14:09 UTC
Created attachment 295965 [details]
dmesg during modprobe dell_wmi_sysman
Comment 10 Hans de Goede 2021-03-21 12:23:32 UTC
Thanks to Freddy's testing, we now have confirmation that the patches fix things and I now also now the exact circumstances / root-cause which triggers this crash.

I've posted a v2 of the patches, adding one more robustness fix and dropping one patch which needs more testing:

https://lore.kernel.org/platform-driver-x86/20210321115901.35072-1-hdegoede@redhat.com/T/#t

I'll work on getting these merged by Linus and then also on getting them added to the stable kernel series (I'm the drivers/platform/x86 maintainer). In the mean time it would be best if distros carry the v2 patch-series as downstream patches.
Comment 11 Borislav Gerassimov 2021-03-23 13:35:31 UTC
I've just tested with Linux Next 20210323 on my Dell Latitude E5570 - everything works as expected. I just couldn't reboot for some reason - the system seemed stuck at some point before the reboot target and I had to forcibly turn it off. But I think this is out of scope of the current report :) 

Thank you all for fixing and testing this.
Good luck ahead!