Bug 207491
Summary: | [regression in 5.6][bisected] "BUG: kernel NULL pointer dereference" after resuming from suspend | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Dominik Mierzejewski (dominik) |
Component: | x86-64 | Assignee: | Rafael J. Wysocki (rjw) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | bp, jwrdegoede, luzmaximilian, malattia, platform_x86_64, verdre, williambader |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | v5.6-rc1 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
dmesg from booting Fedora 32 kernel-5.6.7-300.fc32.x86_64
acpi_evaluate_object returns buffers SNC calls should handle BUFFER types Don't use thermal handle if NULL |
Description
Dominik Mierzejewski
2020-04-28 20:55:22 UTC
Created attachment 288807 [details]
dmesg from booting Fedora 32 kernel-5.6.7-300.fc32.x86_64
Machine is a Sony Vaio Pro 13 (Intel Haswell i7-4500U) with SecureBoot enabled. Downstream bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1829096 . I managed to get an Oops on the console after adding no_console_suspend to kernel command line: [ 86.898573] Freezing user space processes ... (elapsed 0.001 seconds) done. [ 86.900150] OOM killer disabled. [ 86.900165] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. [ 86.901579] wlp1s0: deauthenticating from aa:bb:cc:dd:ee:ff by local choice (Reason: 3=DEAUTH_LEAVING) [ 86.907405] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 86.910092] sd 0:0:0:0: [sda] Stopping disk [ 86.939714] Removing pn544 [ 87.216904] PM: suspend devices took 0.315 seconds [ 87.230238] ACPI: EC: interrupt blocked [ 87.244083] ACPI: Preparing to enter system sleep state S3 [ 87.245040] ACPI: EC: event blocked [ 87.245046] ACPI: EC: EC stopped [ 87.245049] PM: Saving platform NVS memory [ 87.245306] Disabling non-boot CPUs ... [ 87.246142] IRQ 16: no longer affine to CPU1 [ 87.247209] smpboot: CPU 1 is now offline [ 87.252696] IRQ 45: no longer affine to CPU2 [ 87.253848] smpboot: CPU 2 is now offline [ 87.256900] IRQ 23: no longer affine to CPU3 [ 87.256902] IRQ 43: no longer affine to CPU3 [ 87.256905] IRQ 49: no longer affine to CPU3 [ 87.257919] smpboot: CPU 3 is now offline [ 87.261044] ACPI: Low-level resume complete [ 87.261115] ACPI: EC: EC started [ 87.261118] PM: Restoring platform NVS memory [ 87.265116] Enabling non-boot CPUs ... [ 87.265175] x86: Booting SMP configuration: [ 87.265179] smpboot: Booting Node 0 Processor 1 APIC 0x2 [ 87.268461] CPU1 is up [ 87.268508] smpboot: Booting Node 0 Processor 2 APIC 0x1 [ 87.269612] CPU2 is up [ 87.269649] smpboot: Booting Node 0 Processor 3 APIC 0x3 [ 87.270604] CPU3 is up [ 87.272498] ACPI: Waking up from system sleep state S3 [ 87.274019] ACPI: button: The lid device is not compliant to SW_LID. [ 87.274453] ACPI: EC: interrupt unblocked [ 87.298871] ACPI: EC: event unblocked [ 87.309328] sd 0:0:0:0: [sda] Starting disk [ 87.313944] sony_laptop: invalid acpi_object: expected 0x1 got 0x3 [ 87.314115] sony_laptop: invalid acpi_object: expected 0x1 got 0x3 [ 87.315003] sony_laptop: invalid acpi_object: expected 0x1 got 0x3 [ 87.315012] BUG: kernel NULL pointer dereference, address: 000000000000000000 [ 87.315014] #PF: supervisor read access in kernel mode [ 87.315015] #PF: error_code(0x0000) - not-present page [ 87.315016] PGD 0 P4D 0 [ 87.315019] Oops: 0000 [#1] SMP PTI [ 87.315021] CPU: 0 PID: 1796 Comm: systemd-sleep Not tainted 5.6.7-300-fc32.x86_64 #1 [ 87.315023] Hardware name: Sony Corporation SVP1322C5E/VAIO, BIOS R2091V7 03/24/2014 [ 87.315028] RIP: 0100:sony_nc_resume+0x1de/0x200 [sony_laptop] [ 87.315030] Code: ff ff ff e9 40 ff ff ff 4c 89 e2 be 00 01 00 00 bf 22 01 00 00 e8 f2 df ff ff 85 c0 75 23 0f b6 44 24 0c 48 8b 15 12 97 00 00 <8b> 3a 39 c7 0f 84 14 ff ff ff 0f b7 ff e8 30 e0 ff ff e9 07 ff ff [ 87.315032] RSP: 0018:ffffa98140953d10 EFLAGS: 00010282 [ 87.315034] RAX: 00000000fffffffb RBX: 000000000000000f RCX: 000000000000937d [ 87.315036] RDX: 0000000000000000 RSI: c5f11fee0037bf32 RDI: 0000000000030080 [ 87.315037] RBP: ffff8c69963ab260 R08: 0000000000000461 R09: 0000000000000029 [ 87.315039] R10: ffff8c696c4a59a0 R11: 0000000000000000 R12: ffffa98140953d1c [ 87.315040] R13: 0000000000000000 R14: ffffffffba3df2e1 R15: 0000000000000010 [ 87.315042] FS: 00007f45e0ab2b80(0000) GS:ffff8c6997a00000(0000) knlGS:0000000000000000 [ 87.315044] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000000050033 [ 87.315045] CR2: 0000000000000000 CR3: 00000001fcc7a001 CR4: 00000000001606f0 [ 87.315047] Call Trace: [ 87.315052] ? _cond_resched+0x16/0x40 [ 87.315054] ? sony_nc_thermal_mode_show+0x60/0x60 [sony_laptop] [ 87.315057] dpm_run_callback+0x4f/0x140 [ 87.315059] device_resume+0x136/0x200 [ 87.315062] dpm_resume+0xce/0x2e0 [ 87.315064] dpm_resume_end+0xd/0x20 [ 87.628326] ata1.00: configured for UDMA/133 SControl 300)ci_hcdType 0x1000000000000000f ff ff 85 c0 75 23 0f b6 44 24 0c 48 8b 15 12 97 00 00 <8b> 3a 39 c7 0f 84 14 ff ff ff 0f b7 ff e8 30 e0 ff ffimer mei_me snd mei soundc Looks like sony_laptop module is triggering this, but I don't see any source changes between 5.5.17 and 5.6.7 in that module. There are some new errors in dmesg compared to 5.6.7, though: [ 18.419670] sony_laptop: Invalid acpi_object: expected 0x1 got 0x3 [ 18.422698] sony_laptop: Invalid acpi_object: expected 0x1 got 0x3 [ 18.424856] sony_laptop: couldn't set up keyboard backlight function (-22) [ 18.428007] sony_laptop: Invalid acpi_object: expected 0x1 got 0x3 [ 18.430306] sony_laptop: Invalid acpi_object: expected 0x1 got 0x3 [ 18.433781] sony_laptop: Invalid acpi_object: expected 0x1 got 0x3 [ 18.435843] sony_laptop: No USB Charge capability found [ 18.438865] sony_laptop: Invalid acpi_object: expected 0x1 got 0x3 [ 18.441430] sony_laptop: couldn't set up lid resume function (-5) [ 18.443902] sony_laptop: Invalid acpi_object: expected 0x1 got 0x3 [ 18.446393] sony_laptop: couldn't to read the thermal profiles [ 18.448270] sony_laptop: couldn't set up thermal profile function (-22) [ 18.452028] sony_laptop: SNC setup done. Cc'ing sony_laptop maintainer. Booting with sony-laptop blacklisted makes resume from suspend work again. Thanks to Hans de Goede for the suggestion. Added x86_64 platform maintainers to Cc: as well. I'm seeing the same issues on a Vaio Pro 13 device, did you already try to bisect Dominik? I didn't have time yet. For now, I'm using a work-around to unload the module before suspend and reload after resume: $ cat /usr/lib/systemd/system-sleep/sony-laptop #!/bin/sh if [ "${1}" == "pre" ]; then logger --journald <<__end MESSAGE=removing sony-laptop module before suspending, see https://bugzilla.redhat.com/show_bug.cgi?id=1829096 __end modprobe -r sony-laptop elif [ "${1}" == "post" ]; then logger --journald <<__end MESSAGE=reloading sony-laptop module after resuming, see https://bugzilla.redhat.com/show_bug.cgi?id=1829096 __end modprobe sony-laptop fi # chmod +x /usr/lib/systemd/system-sleep/sony-laptop If you have time to do the bisect, please do it. I'll appreciate it. Looks like it was bisected by William Bader in https://bugzilla.redhat.com/show_bug.cgi?id=1830150 . The offending commit seems to be: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.6-rc1&id=6d232b29cfce65961db4a668c2c6c6987cd24d45 Quoting Maximillian Luz from the above ticket: The referenced patch changes the behavior of the Linux AML interpreter to match the behavior of the Windows one. Prior to this patch, AML buffer fields created via CreateField(...) were automatically converted to type Integer on return if they were small enough (<= 64 bits on a 64 bit machine). It seems that this is the behavior expected in the sony-laptop driver. As an example: The DSDT contains the following code: Mutex (SNM1, 0x00) Mutex (SNM2, 0x00) Name (SNBF, Buffer (0x0410){}) CreateField (SNBF, Zero, 0x20, SNBD) CreateWordField (SNBF, 0x02, CPW0) CreateWordField (SNBF, 0x04, CPW1) CreateWordField (SNBF, Zero, RCW0) CreateWordField (SNBF, 0x02, RCW1) Method (SN07, 1, NotSerialized) { Acquire (SNM2, 0xFFFF) Local0 = Arg0 SNBD = Local0 SNCM () Release (SNM2) Return (SNBD) /* \_SB_.PCI0.LPCB.SNC_.SNBD */ } Prior to the patch, SN07 returned an Integer type, calls to SN07 from sony-laptop expect this: https://elixir.bootlin.com/linux/v5.6.10/source/drivers/platform/x86/sony-laptop.c#L913. This call now fails and we see sony_laptop: Invalid acpi_object: expected 0x1 got 0x3 in the logs (see https://elixir.bootlin.com/linux/v5.6.10/source/drivers/platform/x86/sony-laptop.c#L774). Type 0x3 is Buffer, type 0x1 is Integer. I assume that something like this could break suspend/resume if called during any of this. As far as I can see, the easiest solution would be to convert the returned buffer to an integer in sony_nc_int_call. (In reply to Dominik Mierzejewski from comment #10) ... > As far as I can see, the easiest solution would be to convert the returned > buffer to an integer in sony_nc_int_call. Sounds about right. A few lines below, in sony_nc_buffer_call (https://elixir.bootlin.com/linux/v5.6.10/source/drivers/platform/x86/sony-laptop.c#L788) you see the driver doing some similar conversion. I just booted up my Vaio after about ~5 months not using it, I'll try to reproduce and cook a patch. Created attachment 288901 [details] acpi_evaluate_object returns buffers I tested this patch on my laptop and I get the files and functionality I expect in /sys/devices/platform/sony-laptop/ for my Vaio Pro 11. Suspend and resume work with the module loaded. I assume this patch should go to stable@vger.kernel.org. It's been some time, I need to double check what's the new rules to submit patches now. What happens in sony_nc_int_call() if the buffer is smaller than sizeof(int)? Won't that leave high bytes of *result uninitialized, or can't that ever happen? That's right, sorry about that. It's worth at least zeroing the value in sony_nc_int_call before passing it along to the buffer_call, much like you've done in your patch on https://bugzilla.redhat.com/show_bug.cgi?id=1830150. Note that I started replying to this bugzilla instead of the redhat one just because I received this one first. If we should discuss on https://bugzilla.redhat.com/show_bug.cgi?id=1830150 instead, I'll be happy to move over there. For the record, I'm going to add a second patch to fix the NULL deref in the thermal profile setup. Created attachment 288903 [details]
SNC calls should handle BUFFER types
Created attachment 288905 [details]
Don't use thermal handle if NULL
>second patch to fix the NULL deref in the thermal profile setup. I do not get BUG: kernel NULL pointer dereference, address: 000000000000000000 booted from 5.6.8-200.fc31.x86_64 My laptop is a Sony Vaio VPCCB4Q1E, if that makes a difference. >I started replying to this bugzilla I am CC'd on both this bugzilla and the redhat bugzilla, so it doesn't matter to me. I'll test the two patches. The NULL pointer would happen on some models (Vaio Pros for sure) after having resumed from suspend. From the trace that Dominik provided it seems to make sense that the NULL is in sony_nc_thermal_resume, but his call trace doesn't completely agree and points to sony_nc_thermal_mode_show. On the other hand I don't see how sony_nc_thermal_mode_show is called from sony_nc_resume. [ 87.315028] RIP: 0100:sony_nc_resume+0x1de/0x200 [sony_laptop] ... [ 87.315054] ? sony_nc_thermal_mode_show+0x60/0x60 [sony_laptop] (In reply to William Bader from comment #17) ... > I'll test the two patches. And also, thanks for your help. Copy and pasting my comment from: https://bugzilla.redhat.com/show_bug.cgi?id=1830150 here to keep everyone in the loop: Hi All, Thank you all for your great work on finding the cause of this. This potentially impacts a lot of other code then just the sony-laptop driver. So I believe that this is best discussed with various involved upstream maintainers. I plan to start a mailinglist thread about this soon. William, Steve, Mattia. I would like to add you to the Cc of the upstream discussion, but this will leak the email address you use for bugzilla to pretty much the entire world. Can you please let me know if that is ok ? Or mail me a different email address to use at hdegoede@redhat.com . Regards, Hans Dominik, same question for you, do you want to be on the Cc for an upstream thread about this. Mattia, I just realized that you are the sony-laptop driver maintainer, thank you for the quick fix. I'm happy to see that the sony-laptop driver is actively maintained. I've mixed experiences with how responsive maintainers of individual platform/x86 drivers are. I guess that we might go with your fix, the reason I want to discuss this upstream is because this likely will impact some other drivers too, so then the question becomes if we revert the ACPICA change, or fix up all the drivers as bug reports come in ? Hi Hans. (In reply to Hans de Goede from comment #21) > [...] the sony-laptop driver is actively maintained. Heh, not as much as I would like... It's actually a mix of a long weekend where I live and the pandemic lock-down that got me into looking at this regression. On the bright side I'm enjoying it, maybe I'll find more time to toy around in the future. > I guess that we might go with your fix, the reason I want to discuss this > upstream is because this likely will impact some other drivers too, so then > the question becomes if we revert the ACPICA change, or fix up all the > drivers as bug reports come in ? I'd want the fixes for sony-laptop regardless of what happens with the ACPICA change. I just sent the series to the platform-driver-x86 list. The discussion with upstream on how to deal with the larger issue of the interpreter change in behaviour is likely necessary. Thanks for taking the time to kick start it (yes, I'd like to be CC'd). (In reply to Mattia Dongili from comment #22) > I'd want the fixes for sony-laptop regardless of what happens with the > ACPICA change. I just sent the series to the platform-driver-x86 list. Hmm, but if the ACPICA change gets reverted then things will not work with your new changes because you now check that object->type == ACPI_TYPE_BUFFER even in the sony_nc_int_call() paths, which will break if the change gets reverted. > The discussion with upstream on how to deal with the larger issue of the > interpreter change in behaviour is likely necessary. Thanks for taking the > time to kick start it (yes, I'd like to be CC'd). Great, I'm going to wait a bit to see if I get also get a request to be on the Cc from some of the others and then I'll send out an email about this. Update: I've gone ahead and send out the email with William and Dominik added to the Cc. (In reply to Hans de Goede from comment #23) > (In reply to Mattia Dongili from comment #22) > > I'd want the fixes for sony-laptop regardless of what happens with the > > ACPICA change. I just sent the series to the platform-driver-x86 list. > > Hmm, but if the ACPICA change gets reverted then things will not work with > your new changes because you now check that object->type == ACPI_TYPE_BUFFER > even in the > sony_nc_int_call() paths, which will break if the change gets reverted. There's existing code that checks for ACPI_TYPE_BUFFER and if that fails, it checks ACPI_TYPE_INTEGER: https://github.com/torvalds/linux/blob/master/drivers/platform/x86/sony-laptop.c#L798-L805 My patch only adds the missing bits in sony_nc_buffer_call from sony_nc_int_call so that it should be possible to use it for both types. I'm trying the first patch with the evtypes change reverted just to make sure, but I'm relatively confident it should work. (In reply to Mattia Dongili from comment #25) > I'm trying the first patch with the evtypes change reverted just to make > sure, but I'm relatively confident it should work. Ok, I did not notice the integer type check in the other place, since I was only looking at the patch, I think you are right and that the new code should work fine in both cases. >I've gone ahead and send out the email with William and Dominik added to the >Cc. Thanks. I can test patches if necessary. >pandemic lock-down That is also how I had a weekend to do the bisection. >This potentially impacts a lot of other code then just the sony-laptop driver. Sony hasn't made laptops for several years, so there probably aren't a large number left, but Dominik and I both found the resume problem within days of Fedora moving to 5.6 kernels. Why doesn't the redhat bugzilla have a long list of bug reports for other laptop vendors by now? >I'm trying the first patch with the evtypes change reverted Could that patch have been implemented in device-specific code like in a surface-laptop.c driver similar to the sony-laptop.c driver, or did it need to go into shared code? (In reply to William Bader from comment #27) > Sony hasn't made laptops for several years, so there probably aren't a large > number left, but Dominik and I both found the resume problem within days of > Fedora moving to 5.6 kernels. Why doesn't the redhat bugzilla have a long > list of bug reports for other laptop vendors by now? I honestly don't expect that much code to rely on the buffer-to-integer conversion. If the writers of the ACPI code/DSDT in question had intended to return an Integer, I think they would have chosen CreateDWordField (creating a 32 bit Integer) instead of CreateWordField with a length of 32 bits. It's a fairly specific situation where this issue occurs. You have to return a buffer field from ACPI, created via CreateField. If you return any other variable, this doesn't apply. If the field is too large (>64 bits), you'd have to handle it as buffer field in the first place (pre commit) and thus the issue also wouldn't apply. I suspect that most stuff returned from ACPI are simple integer variables, less buffer fields, and much less fields created via CreateField that fall into that category. > Could that patch have been implemented in device-specific code like in a > surface-laptop.c driver similar to the sony-laptop.c driver, or did it need > to go into shared code? The difference between the sony-laptop issue and the issue experienced on the Surface devices is where the problem is: For Sony laptops, the problem is in the driver, for Surface laptops, the problem is in the DSDT/ACPI code execution (see commit message of the commit in question for details). So there's nothing you can do with a device-specific kernel driver. You could a) change the Linux interpreter as done in the commit, b) do a) but only for Surface devices, or c) patch the DSDT for Surface devices specifically for Linux. This commit changes the AML _interpreter_ behavior to match the behavior of Windows. And as much as anyone may hate it, but MS kind of dictates the specification here, as pretty much all devices will be validated against their interpreter. So option b) ignores the fact that this may happen on other devices too (and introduces way more complexity for basically nothing), and option c) is straight up not viable if you want to have your average user be able to install Linux on such a device (at least as far as I know, my knowledge of ACPI patching is limited). This leaves us option a) and handling the fallout as we experience it now. Yes, it's unfortunate. sony-laptop fixes are in mainline and have been picked up for the upcoming 5.6 and 5.7 stable releases. I'm trying to figure out how to close this bug report, but I won't be offended if someone closes before I do. Sure, that's easy. :-) |