Bug 207491 - [regression in 5.6][bisected] "BUG: kernel NULL pointer dereference" after resuming from suspend
Summary: [regression in 5.6][bisected] "BUG: kernel NULL pointer dereference" after re...
Status: RESOLVED CODE_FIX
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: Rafael J. Wysocki
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-04-28 20:55 UTC by Dominik Mierzejewski
Modified: 2020-06-17 09:03 UTC (History)
7 users (show)

See Also:
Kernel Version: v5.6-rc1
Tree: Mainline
Regression: Yes


Attachments
dmesg from booting Fedora 32 kernel-5.6.7-300.fc32.x86_64 (77.80 KB, text/plain)
2020-04-28 21:03 UTC, Dominik Mierzejewski
Details
acpi_evaluate_object returns buffers (2.74 KB, patch)
2020-05-05 02:47 UTC, Mattia Dongili
Details | Diff
SNC calls should handle BUFFER types (3.11 KB, patch)
2020-05-05 05:04 UTC, Mattia Dongili
Details | Diff
Don't use thermal handle if NULL (1.14 KB, patch)
2020-05-05 05:05 UTC, Mattia Dongili
Details | Diff

Description Dominik Mierzejewski 2020-04-28 20:55:22 UTC
After upgrading from 5.5.17 to 5.6.7 (Fedora 31 and 32), the machine no longer resumes after suspend successfully. The screen comes up, but there's no reaction to mouse or keyboard input and the machine remains inaccessible via network (iwlwifi). Fedora rawhide kernel-5.7.0-0.rc3.1.fc33 suffers from this as well.
Comment 1 Dominik Mierzejewski 2020-04-28 21:03:09 UTC
Created attachment 288807 [details]
dmesg from booting Fedora 32 kernel-5.6.7-300.fc32.x86_64
Comment 2 Dominik Mierzejewski 2020-04-28 21:04:26 UTC
Machine is a Sony Vaio Pro 13 (Intel Haswell i7-4500U) with SecureBoot enabled.
Comment 3 Dominik Mierzejewski 2020-04-28 21:08:44 UTC
Downstream bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1829096 .
Comment 4 Dominik Mierzejewski 2020-04-29 08:00:07 UTC
I managed to get an Oops on the console after adding no_console_suspend to kernel command line:

[   86.898573] Freezing user space processes ... (elapsed 0.001 seconds) done.
[   86.900150] OOM killer disabled.
[   86.900165] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[   86.901579] wlp1s0: deauthenticating from aa:bb:cc:dd:ee:ff by local choice (Reason: 3=DEAUTH_LEAVING)
[   86.907405] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[   86.910092] sd 0:0:0:0: [sda] Stopping disk
[   86.939714] Removing pn544
[   87.216904] PM: suspend devices took 0.315 seconds
[   87.230238] ACPI: EC: interrupt blocked
[   87.244083] ACPI: Preparing to enter system sleep state S3
[   87.245040] ACPI: EC: event blocked
[   87.245046] ACPI: EC: EC stopped
[   87.245049] PM: Saving platform NVS memory
[   87.245306] Disabling non-boot CPUs ...
[   87.246142] IRQ 16: no longer affine to CPU1
[   87.247209] smpboot: CPU 1 is now offline
[   87.252696] IRQ 45: no longer affine to CPU2
[   87.253848] smpboot: CPU 2 is now offline
[   87.256900] IRQ 23: no longer affine to CPU3
[   87.256902] IRQ 43: no longer affine to CPU3
[   87.256905] IRQ 49: no longer affine to CPU3
[   87.257919] smpboot: CPU 3 is now offline
[   87.261044] ACPI: Low-level resume complete
[   87.261115] ACPI: EC: EC started
[   87.261118] PM: Restoring platform NVS memory
[   87.265116] Enabling non-boot CPUs ...
[   87.265175] x86: Booting SMP configuration:
[   87.265179] smpboot: Booting Node 0 Processor 1 APIC 0x2
[   87.268461] CPU1 is up
[   87.268508] smpboot: Booting Node 0 Processor 2 APIC 0x1
[   87.269612] CPU2 is up
[   87.269649] smpboot: Booting Node 0 Processor 3 APIC 0x3
[   87.270604] CPU3 is up
[   87.272498] ACPI: Waking up from system sleep state S3
[   87.274019] ACPI: button: The lid device is not compliant to SW_LID.
[   87.274453] ACPI: EC: interrupt unblocked
[   87.298871] ACPI: EC: event unblocked
[   87.309328] sd 0:0:0:0: [sda] Starting disk
[   87.313944] sony_laptop: invalid acpi_object: expected 0x1 got 0x3
[   87.314115] sony_laptop: invalid acpi_object: expected 0x1 got 0x3
[   87.315003] sony_laptop: invalid acpi_object: expected 0x1 got 0x3
[   87.315012] BUG: kernel NULL pointer dereference, address: 000000000000000000
[   87.315014] #PF: supervisor read access in kernel mode
[   87.315015] #PF: error_code(0x0000) - not-present page
[   87.315016] PGD 0 P4D 0
[   87.315019] Oops: 0000 [#1] SMP PTI
[   87.315021] CPU: 0 PID: 1796 Comm: systemd-sleep Not tainted 5.6.7-300-fc32.x86_64 #1
[   87.315023] Hardware name: Sony Corporation SVP1322C5E/VAIO, BIOS R2091V7 03/24/2014
[   87.315028] RIP: 0100:sony_nc_resume+0x1de/0x200 [sony_laptop]
[   87.315030] Code: ff ff ff e9 40 ff ff ff 4c 89 e2 be 00 01 00 00 bf 22 01 00 00 e8 f2 df ff ff 85 c0 75 23 0f b6 44 24 0c 48 8b 15 12 97 00 00 <8b> 3a 39 c7 0f 84 14 ff ff ff 0f b7 ff e8 30 e0 ff ff e9 07 ff ff
[   87.315032] RSP: 0018:ffffa98140953d10 EFLAGS: 00010282
[   87.315034] RAX: 00000000fffffffb RBX: 000000000000000f RCX: 000000000000937d
[   87.315036] RDX: 0000000000000000 RSI: c5f11fee0037bf32 RDI: 0000000000030080
[   87.315037] RBP: ffff8c69963ab260 R08: 0000000000000461 R09: 0000000000000029
[   87.315039] R10: ffff8c696c4a59a0 R11: 0000000000000000 R12: ffffa98140953d1c
[   87.315040] R13: 0000000000000000 R14: ffffffffba3df2e1 R15: 0000000000000010
[   87.315042] FS:  00007f45e0ab2b80(0000) GS:ffff8c6997a00000(0000) knlGS:0000000000000000
[   87.315044] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000000050033
[   87.315045] CR2: 0000000000000000 CR3: 00000001fcc7a001 CR4: 00000000001606f0
[   87.315047] Call Trace:
[   87.315052]  ? _cond_resched+0x16/0x40
[   87.315054]  ? sony_nc_thermal_mode_show+0x60/0x60 [sony_laptop]
[   87.315057] dpm_run_callback+0x4f/0x140
[   87.315059] device_resume+0x136/0x200
[   87.315062] dpm_resume+0xce/0x2e0
[   87.315064] dpm_resume_end+0xd/0x20
[   87.628326] ata1.00: configured for UDMA/133 SControl 300)ci_hcdType 0x1000000000000000f ff ff 85 c0 75 23 0f b6 44 24 0c 48 8b 15 12 97 00 00 <8b> 3a 39 c7 0f 84 14 ff ff ff 0f b7 ff e8 30 e0 ff ffimer mei_me snd mei soundc
Comment 5 Dominik Mierzejewski 2020-04-29 10:12:58 UTC
Looks like sony_laptop module is triggering this, but I don't see any source changes between 5.5.17 and 5.6.7 in that module. There are some new errors in dmesg compared to 5.6.7, though:
[   18.419670] sony_laptop: Invalid acpi_object: expected 0x1 got 0x3
[   18.422698] sony_laptop: Invalid acpi_object: expected 0x1 got 0x3
[   18.424856] sony_laptop: couldn't set up keyboard backlight function (-22)
[   18.428007] sony_laptop: Invalid acpi_object: expected 0x1 got 0x3
[   18.430306] sony_laptop: Invalid acpi_object: expected 0x1 got 0x3
[   18.433781] sony_laptop: Invalid acpi_object: expected 0x1 got 0x3
[   18.435843] sony_laptop: No USB Charge capability found
[   18.438865] sony_laptop: Invalid acpi_object: expected 0x1 got 0x3
[   18.441430] sony_laptop: couldn't set up lid resume function (-5)
[   18.443902] sony_laptop: Invalid acpi_object: expected 0x1 got 0x3
[   18.446393] sony_laptop: couldn't to read the thermal profiles
[   18.448270] sony_laptop: couldn't set up thermal profile function (-22)
[   18.452028] sony_laptop: SNC setup done.

Cc'ing sony_laptop maintainer.
Comment 6 Dominik Mierzejewski 2020-04-29 11:05:08 UTC
Booting with sony-laptop blacklisted makes resume from suspend work again. Thanks to Hans de Goede for the suggestion.
Comment 7 Dominik Mierzejewski 2020-04-29 21:41:56 UTC
Added x86_64 platform maintainers to Cc: as well.
Comment 8 verdre 2020-05-04 09:56:42 UTC
I'm seeing the same issues on a Vaio Pro 13 device, did you already try to bisect Dominik?
Comment 9 Dominik Mierzejewski 2020-05-04 10:35:22 UTC
I didn't have time yet. For now, I'm using a work-around to unload the module before suspend and reload after resume:

$ cat /usr/lib/systemd/system-sleep/sony-laptop 
#!/bin/sh
if [ "${1}" == "pre" ]; then
  logger --journald <<__end
MESSAGE=removing sony-laptop module before suspending, see https://bugzilla.redhat.com/show_bug.cgi?id=1829096
__end
  modprobe -r sony-laptop
elif [ "${1}" == "post" ]; then
  logger --journald <<__end
MESSAGE=reloading sony-laptop module after resuming, see https://bugzilla.redhat.com/show_bug.cgi?id=1829096
__end
  modprobe sony-laptop
fi
# chmod +x /usr/lib/systemd/system-sleep/sony-laptop

If you have time to do the bisect, please do it. I'll appreciate it.
Comment 10 Dominik Mierzejewski 2020-05-04 22:05:12 UTC
Looks like it was bisected by William Bader in https://bugzilla.redhat.com/show_bug.cgi?id=1830150 . The offending commit seems to be: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.6-rc1&id=6d232b29cfce65961db4a668c2c6c6987cd24d45

Quoting Maximillian Luz from the above ticket:

The referenced patch changes the behavior of the Linux AML interpreter to match the behavior of the Windows one. Prior to this patch, AML buffer fields created via CreateField(...) were automatically converted to type Integer on return if they were small enough (<= 64 bits on a 64 bit machine). It seems that this is the behavior expected in the sony-laptop driver.

As an example: The DSDT contains the following code:

  Mutex (SNM1, 0x00)
  Mutex (SNM2, 0x00)
  Name (SNBF, Buffer (0x0410){})
  CreateField (SNBF, Zero, 0x20, SNBD)
  CreateWordField (SNBF, 0x02, CPW0)
  CreateWordField (SNBF, 0x04, CPW1)
  CreateWordField (SNBF, Zero, RCW0)
  CreateWordField (SNBF, 0x02, RCW1)

  Method (SN07, 1, NotSerialized)
  {
      Acquire (SNM2, 0xFFFF)
      Local0 = Arg0
      SNBD = Local0
      SNCM ()
      Release (SNM2)
      Return (SNBD) /* \_SB_.PCI0.LPCB.SNC_.SNBD */
  }

Prior to the patch, SN07 returned an Integer type, calls to SN07 from sony-laptop expect this: https://elixir.bootlin.com/linux/v5.6.10/source/drivers/platform/x86/sony-laptop.c#L913. This call now fails and we see

  sony_laptop: Invalid acpi_object: expected 0x1 got 0x3

in the logs (see https://elixir.bootlin.com/linux/v5.6.10/source/drivers/platform/x86/sony-laptop.c#L774). Type 0x3 is Buffer, type 0x1 is Integer. I assume that something like this could break suspend/resume if called during any of this.

As far as I can see, the easiest solution would be to convert the returned buffer to an integer in sony_nc_int_call.
Comment 11 Mattia Dongili 2020-05-05 00:01:05 UTC
(In reply to Dominik Mierzejewski from comment #10)
...
> As far as I can see, the easiest solution would be to convert the returned
> buffer to an integer in sony_nc_int_call.

Sounds about right. A few lines below, in sony_nc_buffer_call (https://elixir.bootlin.com/linux/v5.6.10/source/drivers/platform/x86/sony-laptop.c#L788) you see the driver doing some similar conversion.

I just booted up my Vaio after about ~5 months not using it, I'll try to reproduce and cook a patch.
Comment 12 Mattia Dongili 2020-05-05 02:47:42 UTC
Created attachment 288901 [details]
acpi_evaluate_object returns buffers

I tested this patch on my laptop and I get the files and functionality I expect in /sys/devices/platform/sony-laptop/ for my Vaio Pro 11.
Suspend and resume work with the module loaded.

I assume this patch should go to stable@vger.kernel.org.

It's been some time, I need to double check what's the new rules to submit patches now.
Comment 13 William Bader 2020-05-05 03:44:27 UTC
What happens in sony_nc_int_call() if the buffer is smaller than sizeof(int)? Won't that leave high bytes of *result uninitialized, or can't that ever happen?
Comment 14 Mattia Dongili 2020-05-05 04:50:47 UTC
That's right, sorry about that. It's worth at least zeroing the value in sony_nc_int_call before passing it along to the buffer_call, much like you've done in your patch on https://bugzilla.redhat.com/show_bug.cgi?id=1830150.


Note that I started replying to this bugzilla instead of the redhat one just because I received this one first. If we should discuss on https://bugzilla.redhat.com/show_bug.cgi?id=1830150 instead, I'll be happy to move over there.

For the record, I'm going to add a second patch to fix the NULL deref in the thermal profile setup.
Comment 15 Mattia Dongili 2020-05-05 05:04:52 UTC
Created attachment 288903 [details]
SNC calls should handle BUFFER types
Comment 16 Mattia Dongili 2020-05-05 05:05:13 UTC
Created attachment 288905 [details]
Don't use thermal handle if NULL
Comment 17 William Bader 2020-05-05 05:31:18 UTC
>second patch to fix the NULL deref in the thermal profile setup.

I do not get
BUG: kernel NULL pointer dereference, address: 000000000000000000
booted from 5.6.8-200.fc31.x86_64
My laptop is a Sony Vaio VPCCB4Q1E, if that makes a difference.

>I started replying to this bugzilla

I am CC'd on both this bugzilla and the redhat bugzilla, so it doesn't matter to me.

I'll test the two patches.
Comment 18 Mattia Dongili 2020-05-05 05:44:15 UTC
The NULL pointer would happen on some models (Vaio Pros for sure) after having resumed from suspend.

From the trace that Dominik provided it seems to make sense that the NULL is in sony_nc_thermal_resume, but his call trace doesn't completely agree and points to sony_nc_thermal_mode_show. On the other hand I don't see how sony_nc_thermal_mode_show is called from sony_nc_resume.

[   87.315028] RIP: 0100:sony_nc_resume+0x1de/0x200 [sony_laptop]
...
[   87.315054]  ? sony_nc_thermal_mode_show+0x60/0x60 [sony_laptop]
Comment 19 Mattia Dongili 2020-05-05 05:48:49 UTC
(In reply to William Bader from comment #17)
... 
> I'll test the two patches.

And also, thanks for your help.
Comment 20 Hans de Goede 2020-05-05 09:18:28 UTC
Copy and pasting my comment from:
https://bugzilla.redhat.com/show_bug.cgi?id=1830150
here to keep everyone in the loop:

Hi All,

Thank you all for your great work on finding the cause of this. This potentially impacts a lot of other code then just the sony-laptop driver. So I believe that this is best discussed with various involved upstream maintainers. I plan to start a mailinglist thread about this soon.

William, Steve, Mattia. I would like to add you to the Cc of the upstream discussion, but this will leak the email address you use for bugzilla to pretty much the entire world. Can you please let me know if that is ok ?  Or mail me a different email address to use at hdegoede@redhat.com .

Regards,

Hans


Dominik, same question for you, do you want to be on the Cc for an upstream thread about this.
Comment 21 Hans de Goede 2020-05-05 09:20:57 UTC
Mattia, I just realized that you are the sony-laptop driver maintainer, thank you for the quick fix. I'm happy to see that the sony-laptop driver is actively maintained. I've mixed experiences with how responsive maintainers of individual platform/x86 drivers are.

I guess that we might go with your fix, the reason I want to discuss this upstream is because this likely will impact some other drivers too, so then the question becomes if we revert the ACPICA change, or fix up all the drivers as bug reports come in ?
Comment 22 Mattia Dongili 2020-05-05 10:18:44 UTC
Hi Hans.

(In reply to Hans de Goede from comment #21)
> [...] the sony-laptop driver is actively maintained.

Heh, not as much as I would like... It's actually a mix of a long weekend where I live and the pandemic lock-down that got me into looking at this regression.
On the bright side I'm enjoying it, maybe I'll find more time to toy around in the future.

> I guess that we might go with your fix, the reason I want to discuss this
> upstream is because this likely will impact some other drivers too, so then
> the question becomes if we revert the ACPICA change, or fix up all the
> drivers as bug reports come in ?

I'd want the fixes for sony-laptop regardless of what happens with the ACPICA change. I just sent the series to the platform-driver-x86 list.

The discussion with upstream on how to deal with the larger issue of the interpreter change in behaviour is likely necessary. Thanks for taking the time to kick start it (yes, I'd like to be CC'd).
Comment 23 Hans de Goede 2020-05-05 12:43:22 UTC
(In reply to Mattia Dongili from comment #22)
> I'd want the fixes for sony-laptop regardless of what happens with the
> ACPICA change. I just sent the series to the platform-driver-x86 list.

Hmm, but if the ACPICA change gets reverted then things will not work with your new changes because you now check that object->type == ACPI_TYPE_BUFFER even in the
sony_nc_int_call() paths, which will break if the change gets reverted.

> The discussion with upstream on how to deal with the larger issue of the
> interpreter change in behaviour is likely necessary. Thanks for taking the
> time to kick start it (yes, I'd like to be CC'd).

Great, I'm going to wait a bit to see if I get also get a request to be on the Cc from some of the others and then I'll send out an email about this.
Comment 24 Hans de Goede 2020-05-05 13:03:32 UTC
Update: I've gone ahead and send out the email with William and Dominik added to the Cc.
Comment 25 Mattia Dongili 2020-05-05 13:28:31 UTC
(In reply to Hans de Goede from comment #23)
> (In reply to Mattia Dongili from comment #22)
> > I'd want the fixes for sony-laptop regardless of what happens with the
> > ACPICA change. I just sent the series to the platform-driver-x86 list.
> 
> Hmm, but if the ACPICA change gets reverted then things will not work with
> your new changes because you now check that object->type == ACPI_TYPE_BUFFER
> even in the
> sony_nc_int_call() paths, which will break if the change gets reverted.

There's existing code that checks for ACPI_TYPE_BUFFER and if that fails, it checks ACPI_TYPE_INTEGER:
https://github.com/torvalds/linux/blob/master/drivers/platform/x86/sony-laptop.c#L798-L805

My patch only adds the missing bits in sony_nc_buffer_call from sony_nc_int_call so that it should be possible to use it for both types.

I'm trying the first patch with the evtypes change reverted just to make sure, but I'm relatively confident it should work.
Comment 26 Hans de Goede 2020-05-05 13:31:09 UTC
(In reply to Mattia Dongili from comment #25)
> I'm trying the first patch with the evtypes change reverted just to make
> sure, but I'm relatively confident it should work.

Ok, I did not notice the integer type check in the other place, since I was only looking at the patch, I think you are right and that the new code should work fine in both cases.
Comment 27 William Bader 2020-05-05 17:21:56 UTC
>I've gone ahead and send out the email with William and Dominik added to the
>Cc.

Thanks. I can test patches if necessary.

>pandemic lock-down

That is also how I had a weekend to do the bisection.

>This potentially impacts a lot of other code then just the sony-laptop driver.

Sony hasn't made laptops for several years, so there probably aren't a large number left, but Dominik and I both found the resume problem within days of Fedora moving to 5.6 kernels. Why doesn't the redhat bugzilla have a long list of bug reports for other laptop vendors by now?

>I'm trying the first patch with the evtypes change reverted

Could that patch have been implemented in device-specific code like in a surface-laptop.c driver similar to the sony-laptop.c driver, or did it need to go into shared code?
Comment 28 Maximilian Luz 2020-05-05 18:04:35 UTC
(In reply to William Bader from comment #27)
> Sony hasn't made laptops for several years, so there probably aren't a large
> number left, but Dominik and I both found the resume problem within days of
> Fedora moving to 5.6 kernels. Why doesn't the redhat bugzilla have a long
> list of bug reports for other laptop vendors by now?

I honestly don't expect that much code to rely on the buffer-to-integer conversion. If the writers of the ACPI code/DSDT in question had intended to return an Integer, I think they would have chosen CreateDWordField (creating a 32 bit Integer) instead of CreateWordField with a length of 32 bits. It's a fairly specific situation where this issue occurs. You have to return a buffer field from ACPI, created via CreateField. If you return any other variable, this doesn't apply. If the field is too large (>64 bits), you'd have to handle it as buffer field in the first place (pre commit) and thus the issue also wouldn't apply. I suspect that most stuff returned from ACPI are simple integer variables, less buffer fields, and much less fields created via CreateField that fall into that category.

> Could that patch have been implemented in device-specific code like in a
> surface-laptop.c driver similar to the sony-laptop.c driver, or did it need
> to go into shared code?

The difference between the sony-laptop issue and the issue experienced on the Surface devices is where the problem is: For Sony laptops, the problem is in the driver, for Surface laptops, the problem is in the DSDT/ACPI code execution (see commit message of the commit in question for details). So there's nothing you can do with a device-specific kernel driver. You could a) change the Linux interpreter as done in the commit, b) do a) but only for Surface devices, or c) patch the DSDT for Surface devices specifically for Linux.

This commit changes the AML _interpreter_ behavior to match the behavior of Windows. And as much as anyone may hate it, but MS kind of dictates the specification here, as pretty much all devices will be validated against their interpreter. So option b) ignores the fact that this may happen on other devices too (and introduces way more complexity for basically nothing), and option c) is straight up not viable if you want to have your average user be able to install Linux on such a device (at least as far as I know, my knowledge of ACPI patching is limited). This leaves us option a) and handling the fallout as we experience it now. Yes, it's unfortunate.
Comment 29 Mattia Dongili 2020-06-17 00:14:24 UTC
sony-laptop fixes are in mainline and have been picked up for the upcoming 5.6 and 5.7 stable releases.

I'm trying to figure out how to close this bug report, but I won't be offended if someone closes before I do.
Comment 30 Borislav Petkov 2020-06-17 09:03:44 UTC
Sure, that's easy. :-)

Note You need to log in before you can comment on or make changes to this bug.