Bug 203939 - Dell 7920 Workstation doesn't wake-up after going into Suspend (Sleep) state
Summary: Dell 7920 Workstation doesn't wake-up after going into Suspend (Sleep) state
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 blocking
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-06-20 21:57 UTC by Sasikumar
Modified: 2019-09-29 07:59 UTC (History)
6 users (show)

See Also:
Kernel Version: 4.18
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Kernel Logs (36.92 KB, text/plain)
2019-06-20 21:57 UTC, Sasikumar
Details
S3 failed to resume log (813.88 KB, application/octet-stream)
2019-06-21 02:26 UTC, Perry_Yuan
Details

Description Sasikumar 2019-06-20 21:57:58 UTC
Created attachment 283327 [details]
Kernel Logs

Dell 7920 Workstation doesn't wake-up in RHEL 8.0 after going into Suspend (Sleep) state with 9460-8i MR and 1 RAID0 VD
Notes:

1.System perfectly works fine with kernel version <= 4.17
2.System does not wake up after went in to sleep from kernel version >= 4.18 


-------------------------------------------------------------------------
Jun 11 13:10:02 dhcp-135-15-125-231 kernel: ACPI Error: Field [DRQL] at bit offset/length 136/8 exceeds size of target Buffer (128 bits) (20180531/dsopcode-201)
Jun 11 13:10:02 dhcp-135-15-125-231 kernel: ACPI Error: Method parse/execution failed \_SB.PC00.LPC0.SIO1.DSRS, AE_AML_BUFFER_LIMIT (20180531/psparse-516)
Jun 11 13:10:02 dhcp-135-15-125-231 kernel: ACPI Error: Method parse/execution failed \_SB.PC00.LPC0.UAR1._SRS, AE_AML_BUFFER_LIMIT (20180531/psparse-516)
Jun 11 13:10:02 dhcp-135-15-125-231 kernel: serial 00:03: activation failed
Jun 11 13:10:02 dhcp-135-15-125-231 kernel: dpm_run_callback(): pnp_bus_resume+0x0/0x90 returns -5
Jun 11 13:10:02 dhcp-135-15-125-231 kernel: PM: Device 00:03 failed to resume: error -5
Jun 11 13:10:02 dhcp-135-15-125-231 kernel: megaraid_sas 0000:17:00.0: Waiting for FW to come to ready state
Jun 11 13:10:02 dhcp-135-15-125-231 kernel: snd_hda_intel 0000:b3:00.1: azx_get_response timeout, switching to polling mode: last cmd=0x005f0900
Jun 11 13:10:02 dhcp-135-15-125-231 kernel: dpm_run_callback(): pci_pm_resume+0x0/0xa0 returns -19
Jun 11 13:10:02 dhcp-135-15-125-231 kernel: PM: Device 0000:17:00.0 failed to resume async: error -19
Jun 11 13:10:02 dhcp-135-15-125-231 kernel: ACPI Error: AE_AML_PACKAGE_LIMIT, Index (0x00000000C) is beyond end of object (length 0xA) (20180531/exoparg2-396)
Jun 11 13:10:02 dhcp-135-15-125-231 kernel: ACPI Error: Method parse/execution failed \_SB.GADR, AE_AML_PACKAGE_LIMIT (20180531/psparse-516)
Jun 11 13:10:02 dhcp-135-15-125-231 kernel: sd 3:0:0:0: [sdb] Starting disk
Jun 11 13:10:02 dhcp-135-15-125-231 kernel: ACPI Error: Method parse/execution failed \_SB.UGP1, AE_AML_PACKAGE_LIMIT (20180531/psparse-516)
Jun 11 13:10:02 dhcp-135-15-125-231 kernel: ACPI Error: Method parse/execution failed \_SB.UGPS, AE_AML_PACKAGE_LIMIT (20180531/psparse-516)
Jun 11 13:10:02 dhcp-135-15-125-231 kernel: ACPI Error: Method parse/execution failed \_GPE._L6F, AE_AML_PACKAGE_LIMIT (20180531/psparse-516)
Jun 11 13:10:02 dhcp-135-15-125-231 kernel: ACPI Error: AE_AML_PACKAGE_LIMIT, while evaluating GPE method [_L6F] 
-----------------------------------------------------------------------------
Comment 1 Perry_Yuan 2019-06-21 02:26:47 UTC
Created attachment 283341 [details]
S3 failed to resume log
Comment 2 RobertK 2019-07-01 18:10:35 UTC
Hello Rafael J. Wysocki:

Could you provide us an update on this bug?  It seems newer kernel 4.18.5 includes the commit d3252ace0bc652a1a244455556b6a549f969bf99 , which is broken or at least not fully working for all PCIe devices.
We see this commit is added into 4.18.5 in the change-log:  https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.18.5 

It appears then it was included into all upstream kernels:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d3252ace0bc652a1a244455556b6a549f969bf99

We also see the Resizable BAR Support is optional according to PCIe Base Specification 3.0 and is supported by the PCIe hardware from the PCIe device, so this would explain by 4.17.11 kernel was working (OS is resumed from S3 state).

If there is any further logs which need to be provided, please request and provide the instructions to enable the logs and method to extract them.  Thanks for your support.

Regards,
-RobertK
Comment 3 Perry_Yuan 2019-07-03 05:47:35 UTC
(In reply to RobertK from comment #2)
> Hello Rafael J. Wysocki:
> 
> Could you provide us an update on this bug?  It seems newer kernel 4.18.5
> includes the commit d3252ace0bc652a1a244455556b6a549f969bf99 , which is
> broken or at least not fully working for all PCIe devices.
> We see this commit is added into 4.18.5 in the change-log: 
> https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.18.5 
> 
> It appears then it was included into all upstream kernels:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=d3252ace0bc652a1a244455556b6a549f969bf99
> 
> We also see the Resizable BAR Support is optional according to PCIe Base
> Specification 3.0 and is supported by the PCIe hardware from the PCIe
> device, so this would explain by 4.17.11 kernel was working (OS is resumed
> from S3 state).
> 
> If there is any further logs which need to be provided, please request and
> provide the instructions to enable the logs and method to extract them. 
> Thanks for your support.
> 
> Regards,
> -RobertK

Robert
Will the issue be resolved if you remove the patch from your kernel codes?

Perry
Comment 4 RobertK 2019-07-15 18:31:58 UTC
We are able to reproduce this issue in our Lab.  After removing below patch from 4.18/5.2 kernel, issue is no more observed:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d3252ace0bc652a1a244455556b6a549f969bf99

We dig it further and found there is *bug* in Linux PCI code. Linux PCI code(function: pci_restore_rebar_state)  fills  incorrect value in “BAR size” field of resizable BAR control register for MR adapter.

---------------------------
Layout of Resize BAR control register:
31-13    Resvd
12-8     BAR size
7-5      Number of Resizable BARs
4-3      Resvd
2-0      BAR index
--
Bits 8:12 is the register is for BAR size which R/W encoded value, below is description about this BAR size field:
0 = 1MB
1 = 2MB
2 = 4MB
3 = 8MB
--
 The default value of this field is equal to the default size of the address space that the BAR resource is requesting via the BAR’s read-only bits. When this register field is programmed, the value is immediately reflected in the size of the resource, as encoded in the number of read-only bits in the BAR.
-----------------------------

megaraid_sas adapter correctly exposes BAR size - 1MB as BAR resource length through PCI config space but while resuming Linux PCI code wrongly calculates and fills BAR size(8:12) as “f's” in Resize BAR control register.

Expected value to be filled is: 0s for 1 MB BAR size.
-----
Below is relevant code snippet which writes to Resize BAR control register (culprit code is the size variable):
static void pci_restore_rebar_state(struct pci_dev *pdev)
...
...
...
                            size = order_base_2((resource_size(res) >> 20) | 1) - 1;   >>>>>   this calculation returns size as : -1 for 1 MB BAR, which causes “ff” to be written to BAR size bits(8:12)
...
...
...
-----

If there is any further logs which need to be provided, please request and
provide the instructions to enable the logs and method to extract them. 
Thanks for your support.
 
Regards,
-RobertK
Comment 5 RobertK 2019-07-16 16:48:40 UTC
Broadcom submitted a suggestion to fix problem based on our limited understanding of relevant PCI layer code.

Patch should be reviewed by Linux PCI maintainers/folks and they can take call how they want to address this bug.

https://patchwork.kernel.org/patch/11045839/

Note You need to log in before you can comment on or make changes to this bug.