Bug 218531

Summary: Continuous ACPI errors resulting in high CPU usage by journald
Product: Platform Specific/Hardware Reporter: danilrybakov249
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: RESOLVED CODE_FIX    
Severity: normal CC: andriy.shevchenko, andy.shevchenko, bagasdotme, jwrdegoede, shinichiro.kawasaki
Priority: P3    
Hardware: All   
OS: Linux   
Kernel Version: 6.6.15 Subsystem:
Regression: Yes Bisected commit-id: 847e1eb30e269a094da046c08273abe3f3361cf2
Attachments: output of journalctl -kb
output of acpidump
Adds printk to understand the failure cause
output of journalctl -kb with patch
output of journalctl -kb healthy
A trial patch to skip P2SB function 0 resource cache
output of journalctl -kb with two patches applied
Another trial patch to use pci_scan_slot()
2nd version of the trial patch to use pci_scan_slot()
output of journalctl -kb with trial patch v2
Trial to defer P2SB scan for all functions for Goldmont CPUs
output of journalctl -kb with trial patch v3
Trial to defer P2SB scan and do the deferred cache only for requested function
output of journalctl -kb with trial patch v4
[PATCH] platform/x86: p2sb: On Goldmont only cache P2SB and SPI devfn BAR
[PATCH v3] platform/x86: p2sb: Defer P2SB device scan when P2SB device has func 0

Description danilrybakov249 2024-02-27 05:11:51 UTC
Overview:

After updating from lts v6.6.14-2 to lts v6.6.17-1 noticed high CPU temperature and lag. After running htop noticed that journald was using 30-60% of CPU. Afterwards, tried switching to stable, or lts v6.6.18-1, but encountered the same issue.

Running journalctl -f gives these lines over and over again:

Feb 19 21:09:12 danirybe kernel: ACPI Error: Could not disable RealTimeClock events (20230628/evxfevnt-243)
Feb 19 21:09:12 danirybe kernel: ACPI Error: No handler or method for GPE 08, disabling event (20230628/evgpe-839)
Feb 19 21:09:12 danirybe kernel: ACPI Error: No handler or method for GPE 0A, disabling event (20230628/evgpe-839)
Feb 19 21:09:12 danirybe kernel: ACPI Error: No handler or method for GPE 0B, disabling event (20230628/evgpe-839)
Feb 19 21:09:12 danirybe kernel: ACPI Error: No installed handler for fixed event - PM_Timer (0), disabling (20230628/evevent-255)
Feb 19 21:09:12 danirybe kernel: ACPI Error: No installed handler for fixed event - PowerButton (2), disabling (20230628/evevent-255)
Feb 19 21:09:12 danirybe kernel: ACPI Error: No installed handler for fixed event - SleepButton (3), disabling (20230628/evevent-255)

My system info:

Laptop model: ASUS VivoBook D540NV-GQ065T
OS: Arch Linux x86_64
Kernel: 6.6.14-2-lts
WM: sway
CPU: Intel Pentium N420 (4) @ 2.500GHz
GPU1: Intel Apollo Lake [HD Graphics 505]
GPU2: NVIDIA GeForce 920MX

I've pinned down the commit after which the problem occurs:

847e1eb30e269a094da046c08273abe3f3361cf2 is the first bad commit
commit 847e1eb30e269a094da046c08273abe3f3361cf2
Author: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Date:   Mon Jan 8 15:20:58 2024 +0900

    platform/x86: p2sb: Allow p2sb_bar() calls during PCI device probe
    
    commit 5913320eb0b3ec88158cfcb0fa5e996bf4ef681b upstream.
    
    p2sb_bar() unhides P2SB device to get resources from the device. It
    guards the operation by locking pci_rescan_remove_lock so that parallel
    rescans do not find the P2SB device. However, this lock causes deadlock
    when PCI bus rescan is triggered by /sys/bus/pci/rescan. The rescan
    locks pci_rescan_remove_lock and probes PCI devices. When PCI devices
    call p2sb_bar() during probe, it locks pci_rescan_remove_lock again.
    Hence the deadlock.
    
    To avoid the deadlock, do not lock pci_rescan_remove_lock in p2sb_bar().
    Instead, do the lock at fs_initcall. Introduce p2sb_cache_resources()
    for fs_initcall which gets and caches the P2SB resources. At p2sb_bar(),
    refer the cache and return to the caller.
    
    Before operating the device at P2SB DEVFN for resource cache, check
    that its device class is PCI_CLASS_MEMORY_OTHER 0x0580 that PCH
    specifications define. This avoids unexpected operation to other devices
    at the same DEVFN.
    
    Link: https://lore.kernel.org/linux-pci/6xb24fjmptxxn5js2fjrrddjae6twex5bjaftwqsuawuqqqydx@7cl3uik5ef6j/
    Fixes: 9745fb07474f ("platform/x86/intel: Add Primary to Sideband (P2SB) bridge support")
    Cc: stable@vger.kernel.org
    Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
    Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
    Link: https://lore.kernel.org/r/20240108062059.3583028-2-shinichiro.kawasaki@wdc.com
    Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
    Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
    Tested-by Klara Modin <klarasmodin@gmail.com>
    Reviewed-by: Hans de Goede <hdegoede@redhat.com>
    Signed-off-by: Hans de Goede <hdegoede@redhat.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 drivers/platform/x86/p2sb.c | 180 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 139 insertions(+), 41 deletions(-)
Comment 1 Artem S. Tashkinov 2024-02-27 06:34:45 UTC
This is worth posting to LKML as well, please do.

CC'ing Andy Shevchenko anyways.
Comment 2 danilrybakov249 2024-02-27 06:57:38 UTC
(In reply to Artem S. Tashkinov from comment #1)
> This is worth posting to LKML as well, please do.
> 
> CC'ing Andy Shevchenko anyways.

To be honest, I have no idea how to do that properly.
Comment 3 Bagas Sanjaya 2024-02-27 08:29:57 UTC
(In reply to danilrybakov249 from comment #2)
> (In reply to Artem S. Tashkinov from comment #1)
> > This is worth posting to LKML as well, please do.
> > 
> > CC'ing Andy Shevchenko anyways.
> 
> To be honest, I have no idea how to do that properly.

Report forwarded [1].

[1]: https://lore.kernel.org/regressions/Zd2bsV8VsFJMlbFW@archie.me/
Comment 4 Shin'ichiro Kawasaki 2024-02-27 09:54:17 UTC
Thanks for the report and sorry for the trouble.

I guess that this issue might be unique to the systems with Goldmont CPU microarchitecture. I don't have access to the hardware. If it is possible, I would like to ask help for debug. As the first step, I want to get two things below.

1) Can we get kernel messages during boot process? The messages may help to understand the prcies cause. If this works, I can provide patches to add printk to confirm behaviors on the system.

2) Can you share "lspci -v" output using the kernel without the issue?
Comment 5 danilrybakov249 2024-02-27 11:28:09 UTC
(In reply to Shin'ichiro Kawasaki from comment #4)
> Thanks for the report and sorry for the trouble.
> 
> I guess that this issue might be unique to the systems with Goldmont CPU
> microarchitecture. I don't have access to the hardware. If it is possible, I
> would like to ask help for debug. As the first step, I want to get two
> things below.
> 
> 1) Can we get kernel messages during boot process? The messages may help to
> understand the prcies cause. If this works, I can provide patches to add
> printk to confirm behaviors on the system.
> 
> 2) Can you share "lspci -v" output using the kernel without the issue?

I'll try:
journalctl -b (unhealth): http://0x0.st/HRmH.txt
dmesg (unhealthy): http://0x0.st/HRA7.txt
lspci -v (healthy): http://0x0.st/HRAh.txt
Comment 6 Hans de Goede 2024-02-27 14:11:08 UTC
Thank you for reporting this and sorry about the regression.

> I've pinned down the commit after which the problem occurs:
> 847e1eb30e269a094da046c08273abe3f3361cf2 is the first bad commit

So you have build your own 6.6.17 kernel from source with just that commit reverted and with that locally build 6.617 with that single commit reverted you are not seeing this issue ?

> I think this CPU is in Goldmont microarchitecture group. The group is handled
> in a bit unique way in drivers/platform/x86/p2sb.c. I guess the commit
> affected
> handling of P2SB resource on machines with that architecture.

I just tried to reproduce this on one of my own laptops with a Apollo Lake N3450 processor, but I cannot reproduce this. So this seems to be specific to this one laptop model.

> dmesg (unhealthy): http://0x0.st/HRA7.txt

Unfortunately that is incomplete due to the errors being spammed to the log the entire time, so it only contains a long list of the errors repeating.

Can you append " log_buf_len=50M" to your kernel commandline and then *immediately*after boot run "dmesg > dmesg.txt"?  And then please attach the generated dmesg.txt here in bugzilla using bugzilla's attachment function.

Using pastebins is problematic because the links to these become invalid over time.
Comment 7 Andy Shevchenko 2024-02-27 14:43:33 UTC
(In reply to danilrybakov249 from comment #5)
> (In reply to Shin'ichiro Kawasaki from comment #4)

> I'll try:
> journalctl -b (unhealth): http://0x0.st/HRmH.txt
> dmesg (unhealthy): http://0x0.st/HRA7.txt
> lspci -v (healthy): http://0x0.st/HRAh.txt

The first two are useless.
Please, run `journalctl -kb` which will give you the kernel boot messages from the very recent boot.
Comment 8 Andy Shevchenko 2024-02-27 14:46:32 UTC
What we also might need is the `acpidump -o VivoBook-D540NV-GQ065T.dat` (the mentioned file). You might need to install ACPI tools package.
Comment 9 danilrybakov249 2024-02-27 15:04:24 UTC
Created attachment 305914 [details]
output of journalctl -kb
Comment 10 danilrybakov249 2024-02-27 15:05:45 UTC
Actually, v6.6.15 already causes the issue. I've bisected the commits between v6.6.14 and v6.6.15 and 847e1eb30e269a094da046c08273abe3f3361cf2 is the first commit that produces the issue.
Comment 11 danilrybakov249 2024-02-27 15:26:32 UTC
Created attachment 305915 [details]
output of acpidump
Comment 12 Shin'ichiro Kawasaki 2024-02-28 02:10:47 UTC
Created attachment 305917 [details]
Adds printk to understand the failure cause

Please apply this patch to the kernel, then reboot with the kernel and share "journalctl -kb" output. Thanks in advance.
Comment 13 Shin'ichiro Kawasaki 2024-02-28 02:21:51 UTC
Thank you danilrybakov, for sharing the logs. The lspci output did not list "0d.x" PCI devices, so it was confirmed the P2SB device is hidden.

The "journalctl -kb" shows 0d.x devices as below. So the p2sb device scan and cache feature is working. It lists multiple devices then it was confirmed that the scan and cache is repeated for Goldmont CPU as expected.

------------------------------------------------------------------------------------
...
Feb 27 21:48:25 danirybe kernel: clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.0: [8086:5a92] type 00 class 0x058000
Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.0: reg 0x10: [mem 0xd0000000-0xd0ffffff 64bit]
Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.1: [8086:5a94] type 00 class 0xff0000
Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.1: reg 0x10: [mem 0xfe042000-0xfe043fff 64bit]
Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.1: reg 0x18: [mem 0xfe044000-0xfe044fff 64bit]
Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.1: PME# supported from D0 D3hot
Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.2: [8086:5a96] type 00 class 0x0c8000
Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.2: reg 0x10: [mem 0xfed01000-0xfed01fff]
Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.3: [8086:5aec] type 00 class 0x050000
Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.3: reg 0x10: [mem 0xfe900000-0xfe901fff 64bit]
Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.3: reg 0x18: [mem 0xfe902000-0xfe902fff 64bit]
Feb 27 21:48:25 danirybe kernel: NET: Registered PF_INET protocol family
...
------------------------------------------------------------------------------------

I can think of two potential issues:

1) p2sb_bar() fails due to bus dev id mismatch or any other reason.
2) p2sb_bar() is called before p2sb scan and cache (this is not likely...)

To confirm these guesses, may I ask danilrybakov to try out the patch I attached?

https://bugzilla.kernel.org/attachment.cgi?id=305917

Please apply it to the unhealthy kernel, build, boot with the kernel, and share "journalctl -kb". Thanks in advance.
Comment 14 danilrybakov249 2024-02-28 05:42:25 UTC
Created attachment 305918 [details]
output of journalctl -kb with patch

patched with "git patch" using the patch provided by Shin'ichiro Kawasaki
Comment 15 danilrybakov249 2024-02-28 05:58:44 UTC
Comment on attachment 305918 [details]
output of journalctl -kb with patch

pathed with "git apply" using the patch provided by Shin'ichiro Kawasaki
Comment 16 Shin'ichiro Kawasaki 2024-02-28 07:56:10 UTC
Thanks for the journalctl log with the patch. Unfortunately, my guesses were wrong...

1) p2sb_bar() call succeeded
2) p2sb_bar() call was made after p2sb scan and cache

lpc_ich driver called ps2b_bar successfully to setup SPI device. But before that the ACPI error were already reported. And after lpc_ich driver probe, the ACPI errors continue.

-----------------------------------------------------------------------------------
...
Feb 28 12:33:47 danirybe kernel: ACPI Error: No handler or method for GPE 08, disabling event (20230628/evgpe-839)
Feb 28 12:33:47 danirybe kernel: ACPI Error: No handler or method for GPE 0A, disabling event (20230628/evgpe-839)
Feb 28 12:33:47 danirybe kernel: ACPI Error: No handler or method for GPE 0B, disabling event (20230628/evgpe-839)
Feb 28 12:33:47 danirybe kernel: lpc_ich_init_spi: calling p2sb_bar
Feb 28 12:33:47 danirybe kernel: p2sb_bar: devfn=d.2
Feb 28 12:33:47 danirybe kernel: ACPI Error: No installed handler for fixed event - PM_Timer (0), disabling (20230628/evevent-255)
Feb 28 12:33:47 danirybe kernel: ACPI Error: No installed handler for fixed event - PowerButton (2), disabling (20230628/evevent-255)
Feb 28 12:33:47 danirybe kernel: ACPI Error: No installed handler for fixed event - SleepButton (3), disabling (20230628/evevent-255)
Feb 28 12:33:47 danirybe kernel: ACPI Error: Could not disable RealTimeClock events (20230628/evxfevnt-243)
...
-----------------------------------------------------------------------------------

I had guessed that p2sb_bar() call failure would have caused lpc_ich probe failure and the ACPI errors, but the guess was wrong. Now I'm not sure why the ACPI errors are triggered (and why they are not suppressed after lpc_ich probe).

danilrybakov, could you share "journalctl -kb" log with the healthy kernel? I would like to compare it with the unhealthy kernel, and hope that it will be a clue.
Comment 17 danilrybakov249 2024-02-28 09:20:42 UTC
Created attachment 305919 [details]
output of journalctl -kb healthy
Comment 18 Andy Shevchenko 2024-02-28 09:56:47 UTC
@Shin'ichiro Kawasaki I have now two observations: 1) we, for some reason, do not use `pci_scan_slot()` which may have subtle differences to our for-loop; 2) with the above we do enumerate from function 0, which we remove from PCI immediately, and function 0 is special. Perhaps, even if it's not related to this bug, we need to refactor caching approach in this case.
Comment 19 Shin'ichiro Kawasaki 2024-02-28 11:19:46 UTC
danylrybakov, thanks for the log of the healthy kernel. It does not record the ACPI error. So the question is why the ACPI error is triggered after the p2sb scan and cache the commit introduced.

Andy, thanks for the comment. I have not yet fully understood, but it is likely that the function 0 device scan and remove could be the cause.

I guess that the concerned system does not use P2SB resource at function 0, so I would like to try out what happens if we skip it. Will create a patch for it and ask experiment with it.
Comment 20 Shin'ichiro Kawasaki 2024-02-28 11:26:35 UTC
Created attachment 305920 [details]
A trial patch to skip P2SB function 0 resource cache

Please apply this patch together with the previous patch to add debug log. Then please build kernel, boot with that kernel and share "journalctl -kb" log. Thanks in advance.
Comment 21 danilrybakov249 2024-02-28 13:44:13 UTC
Created attachment 305921 [details]
output of journalctl -kb with two patches applied

This fixed the problem!
Thanks, everybody!
Comment 22 Andy Shevchenko 2024-02-28 14:54:59 UTC
(In reply to danilrybakov249 from comment #21)
> Created attachment 305921 [details]
> output of journalctl -kb with two patches applied
> 
> This fixed the problem!

Okay, so I would give a try for `pci_scan_slot()` first. What is special about P2SB itself is that it is not 100% pure PCI device, it simply emulates programming interface of PCI devices to some extend. This what I mentioned as well in the commit log that brought P2SB library into the kernel.
Comment 23 Shin'ichiro Kawasaki 2024-02-29 03:46:01 UTC
(In reply to Andy Shevchenko from comment #22)
> (In reply to danilrybakov249 from comment #21)
> > Created attachment 305921 [details]
> > output of journalctl -kb with two patches applied
> > 
> > This fixed the problem!

Great, but still we need to think about good solution. The trial patch was just a dirty edit for the experiment.

> 
> Okay, so I would give a try for `pci_scan_slot()` first. What is special
> about P2SB itself is that it is not 100% pure PCI device, it simply emulates
> programming interface of PCI devices to some extend. This what I mentioned
> as well in the commit log that brought P2SB library into the kernel.

I have created a patch to use pci_scan_slot() when P2SB has function 0 and will post it soon. Please take a look if it is what you expect.
Comment 24 Shin'ichiro Kawasaki 2024-02-29 03:54:44 UTC
Created attachment 305926 [details]
Another trial patch to use pci_scan_slot()

Use pci_scan_slot() in place of pci_scan_single_device() when P2SB has function 0.
It still it has some debug prints to confirm its behavior.

I suggest danylrybakov to wait for Andy's comment on it. If the patch is what he expects, it will be worth trying on the concerned machine.
Comment 25 Andy Shevchenko 2024-02-29 09:16:47 UTC
(In reply to Shin'ichiro Kawasaki from comment #24)
> Created attachment 305926 [details]
> Another trial patch to use pci_scan_slot()
> 
> Use pci_scan_slot() in place of pci_scan_single_device() when P2SB has
> function 0.
> It still it has some debug prints to confirm its behavior.
> 
> I suggest danylrybakov to wait for Andy's comment on it. If the patch is
> what he expects, it will be worth trying on the concerned machine.

Yeah, but you need to cache all found devices, with this patch the SPI NOR won't be cached. So, something like this: 1) scan slot; 2) iterate over all found devices and cache their resources; 3) reverse remove devices including function 0 (at the end).
Comment 26 Shin'ichiro Kawasaki 2024-02-29 10:18:29 UTC
Created attachment 305928 [details]
2nd version of the trial patch to use pci_scan_slot()

Andy, thanks for the comment. I updated the patch to follow Andy's comment.

danylrybakov, could you try this patch? Please apply this patch only to the unhealthy kernel. Fingers crossed...
Comment 27 danilrybakov249 2024-02-29 13:40:44 UTC
Created attachment 305930 [details]
output of journalctl -kb with trial patch v2

Unfortunately, errors again.
Comment 28 Shin'ichiro Kawasaki 2024-03-01 07:17:50 UTC
Hmm, so the difference between simple pci_scan_single_device() loop and pci_scan_slot() is not the cause of the errors. I guess P2SB function 0 scan at fs_initcall step does something harmful.

It is ideal to understand why P2SB function 0 scan at that step cause the ACPI errors. I walked through related code, but it looks too much for me without the device at my hand. It will need a number of trials (). Even if the cause is known, the fix will be modifications in PCI or ACPI sub-systems, which will not be accepted.

As the different approach, I would like to seek for the solution by delaying the P2SB device scan, from fs_initcall step to p2sb_bar() call. Such delayed scan will leave the
chance to cause deadlock with sysfs pci bus rescan, but by caching the resource, such
device scan happens only once. Also, such delayed scan can be limited only for the concerned device. Then the chance of deadlock will be small enough.

I will post a trial patch based on this approach. It will delay the P2SB scan for all functions from 0 to 7. One thing I'm not sure is that the delayed function 0 scan is okay on the concerned system or not. If not, additional change will be required.
Comment 29 Shin'ichiro Kawasaki 2024-03-01 07:20:31 UTC
Created attachment 305932 [details]
Trial to defer P2SB scan for all functions for Goldmont CPUs

danylrybakov, could you try this patch? Please apply this patch only to the unhealthy kernel.
Comment 30 danilrybakov249 2024-03-01 08:04:18 UTC
Created attachment 305933 [details]
output of journalctl -kb with trial patch v3

no errors
Comment 31 Shin'ichiro Kawasaki 2024-03-01 12:04:05 UTC
(In reply to danilrybakov249 from comment #30)
> Created attachment 305933 [details]
> output of journalctl -kb with trial patch v3
> 
> no errors

Thanks! It's good the patch works as expected.
I will create a formal patch to ask comments and reviews.
Comment 32 Shin'ichiro Kawasaki 2024-03-01 13:47:24 UTC
(In reply to Shin'ichiro Kawasaki from comment #31)
...
> I will create a formal patch to ask comments and reviews.

Posted the patch to the platfrom-driver-x86 list.

https://lore.kernel.org/platform-driver-x86/20240301134504.1887132-1-shinichiro.kawasaki@wdc.com/T/#u
Comment 33 Shin'ichiro Kawasaki 2024-03-02 01:30:32 UTC
Andy, thank you for the comments on the patch. I reflected them and posted v2.

https://lore.kernel.org/platform-driver-x86/20240302012813.2011111-1-shinichiro.kawasaki@wdc.com/T/#u

danylrybakov, I CCed the v2 patch to you. If you have time to afford, please try it and confirm it avoids the ACPI errors.
Comment 34 danilrybakov249 2024-03-02 07:01:02 UTC
Thank you for your work. What I've discovered with this patch is that the errors don't appear immediately after booting, but they still appear after a bit of time. The same with v1 of the patch.
Comment 35 Shin'ichiro Kawasaki 2024-03-02 07:25:41 UTC
(In reply to danilrybakov249 from comment #34)
> Thank you for your work. What I've discovered with this patch is that the
> errors don't appear immediately after booting, but they still appear after a
> bit of time. The same with v1 of the patch.

Thanks for the report. It is good to know that the fix patch still has the issue.

Now I think the scan and cache of function 0 triggers the ACPI errors regardless of the step, at fs_initcall() as wells p2bs_bar() call. I will try to create another trial patch which will do the deferred scan and cache only for the requested function number.
Comment 36 Shin'ichiro Kawasaki 2024-03-02 09:33:19 UTC
Created attachment 305948 [details]
Trial to defer P2SB scan and do the deferred cache only for requested function

danylrybakov, please try this patch and see if it avoids the ACPI errors. I hope it will work good long time after the boot.

The patch has some printks. Please share journactl -kb output so that I can confirm the modified behavior. Thanks in advance.
Comment 37 danilrybakov249 2024-03-02 16:40:12 UTC
Created attachment 305951 [details]
output of journalctl -kb with trial patch v4

So far no errors. I'll leave the laptop running for the night and post an update tomorrow.
Comment 38 danilrybakov249 2024-03-02 22:06:39 UTC
Yep, no errors, even after a while.
Comment 39 Shin'ichiro Kawasaki 2024-03-02 23:42:30 UTC
(In reply to danilrybakov249 from comment #38)
> Yep, no errors, even after a while.

Great, thank you for testing. It's good to have this solid fix :)

As discussed in platfomr-driver-x86 list, Hans suggested another potential fix idea which may be better than the confirmed fix. Before I post the next formal patch I would like to wait a bit to see how the discussion go.
Comment 40 Hans de Goede 2024-03-04 13:33:32 UTC
Shin'ichiro, thank you for your work on this.

Danilrybakov, thank you for all your testing!

We are close to fixing this now. Shin'ichiro has posted a slightly different patch upstream then the last patch which you tested; and I think I may have found an even simpler solution:

On Goldmont platform devices like your laptop the troublesome p2sb_bar() function only ever gets called for 2 devices, the actual P2SB devfn 13,0 and the SPI controller which is part of the P2SB, devfn 13,2.

But the current p2sb code tries to cache BAR0 info for all of devfn 13,0 to 13,7 . This involves calling pci_scan_single_device() for device 13 functions 0-7 and I think that maybe your laptop model does not seem to like pci_scan_single_device() getting called for some of the other hidden devices.

I have prepared a patch (replacing Shin'ichiro's patches) which only caches BAR0 for the P2SB (13.0) and SPI (13,2) devfn combinations. I'll attach that patch here, can you please give this a try?

I'll also attach Shin'ichiro's latest patch as posted on the list:

https://lore.kernel.org/platform-driver-x86/20240304071912.2340622-1-shinichiro.kawasaki@wdc.com/

with some follow-up fixes from me squashed in. If the patch to only cache 13,0 + 13,2 BAR0 info does not help, please give this patch a try. This one should work, but the other one would be a better fix if that approach works.
Comment 41 Hans de Goede 2024-03-04 13:39:15 UTC
Created attachment 305957 [details]
[PATCH] platform/x86: p2sb: On Goldmont only cache P2SB and SPI devfn BAR

Patch to only cache BAR0 for the P2SB (13.0) and SPI (13,2) devfn combinations, please try this first.

Note their is no need to try the next patch if this one works.
Comment 42 Hans de Goede 2024-03-04 13:54:55 UTC
Created attachment 305958 [details]
[PATCH v3] platform/x86: p2sb: Defer P2SB device scan when P2SB device has func 0

Latest patch from Shin'ichiro's:

https://lore.kernel.org/platform-driver-x86/20240304071912.2340622-1-shinichiro.kawasaki@wdc.com/

with some fixes from Hans squashed in from:
https://lore.kernel.org/platform-driver-x86/20240304134921.305604-2-hdegoede@redhat.com/

In case the "PATCH] platform/x86: p2sb: On Goldmont only cache P2SB and SPI devfn BAR" patch does NOT fix things for you please give this one a try, so that we can be sure that this issue will actually be fixed when I merge the 2 patches linked above.

Note this is basically almost the same as "A trial patch to skip P2SB function 0 resource cache" so this should work, but I would like to be sure.

Thank you for your patience and for all the tests you have run.
Comment 43 danilrybakov249 2024-03-04 16:36:43 UTC
No issues with [PATCH] platform/x86: p2sb: On Goldmont only cache P2SB and SPI devfn BAR. :)
Comment 44 Hans de Goede 2024-03-04 16:40:53 UTC
Hi,

(In reply to danilrybakov249 from comment #43)
> No issues with [PATCH] platform/x86: p2sb: On Goldmont only cache P2SB and
> SPI devfn BAR. :)

That is great, thank you for the quick test.

Then I'll go and send that patch to Linus sometime this week.
Comment 45 Shin'ichiro Kawasaki 2024-03-05 01:03:44 UTC
(In reply to Hans de Goede from comment #44)
> Hi,
> 
> (In reply to danilrybakov249 from comment #43)
> > No issues with [PATCH] platform/x86: p2sb: On Goldmont only cache P2SB and
> > SPI devfn BAR. :)
> 
> That is great, thank you for the quick test.
> 
> Then I'll go and send that patch to Linus sometime this week.

Hans, thank you very much finding out the best solution :) As I commented on the list, I agree with the patch.

Danilrybakov, my thank also goes to you for your cooperations and patience :)
I think your efforts deserve the credit tags "Reported-by" and "Tested-by" that Hans mentioned. I suggest you to reply to the mail from Hans so that the tags get handled with the patch.
Comment 46 danilrybakov249 2024-03-05 06:03:49 UTC
(In reply to Shin'ichiro Kawasaki from comment #45)
 
> I suggest you to reply to the mail from Hans so that
> the tags get handled with the patch.

I did, thank you as well.