Overview: After updating from lts v6.6.14-2 to lts v6.6.17-1 noticed high CPU temperature and lag. After running htop noticed that journald was using 30-60% of CPU. Afterwards, tried switching to stable, or lts v6.6.18-1, but encountered the same issue. Running journalctl -f gives these lines over and over again: Feb 19 21:09:12 danirybe kernel: ACPI Error: Could not disable RealTimeClock events (20230628/evxfevnt-243) Feb 19 21:09:12 danirybe kernel: ACPI Error: No handler or method for GPE 08, disabling event (20230628/evgpe-839) Feb 19 21:09:12 danirybe kernel: ACPI Error: No handler or method for GPE 0A, disabling event (20230628/evgpe-839) Feb 19 21:09:12 danirybe kernel: ACPI Error: No handler or method for GPE 0B, disabling event (20230628/evgpe-839) Feb 19 21:09:12 danirybe kernel: ACPI Error: No installed handler for fixed event - PM_Timer (0), disabling (20230628/evevent-255) Feb 19 21:09:12 danirybe kernel: ACPI Error: No installed handler for fixed event - PowerButton (2), disabling (20230628/evevent-255) Feb 19 21:09:12 danirybe kernel: ACPI Error: No installed handler for fixed event - SleepButton (3), disabling (20230628/evevent-255) My system info: Laptop model: ASUS VivoBook D540NV-GQ065T OS: Arch Linux x86_64 Kernel: 6.6.14-2-lts WM: sway CPU: Intel Pentium N420 (4) @ 2.500GHz GPU1: Intel Apollo Lake [HD Graphics 505] GPU2: NVIDIA GeForce 920MX I've pinned down the commit after which the problem occurs: 847e1eb30e269a094da046c08273abe3f3361cf2 is the first bad commit commit 847e1eb30e269a094da046c08273abe3f3361cf2 Author: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Date: Mon Jan 8 15:20:58 2024 +0900 platform/x86: p2sb: Allow p2sb_bar() calls during PCI device probe commit 5913320eb0b3ec88158cfcb0fa5e996bf4ef681b upstream. p2sb_bar() unhides P2SB device to get resources from the device. It guards the operation by locking pci_rescan_remove_lock so that parallel rescans do not find the P2SB device. However, this lock causes deadlock when PCI bus rescan is triggered by /sys/bus/pci/rescan. The rescan locks pci_rescan_remove_lock and probes PCI devices. When PCI devices call p2sb_bar() during probe, it locks pci_rescan_remove_lock again. Hence the deadlock. To avoid the deadlock, do not lock pci_rescan_remove_lock in p2sb_bar(). Instead, do the lock at fs_initcall. Introduce p2sb_cache_resources() for fs_initcall which gets and caches the P2SB resources. At p2sb_bar(), refer the cache and return to the caller. Before operating the device at P2SB DEVFN for resource cache, check that its device class is PCI_CLASS_MEMORY_OTHER 0x0580 that PCH specifications define. This avoids unexpected operation to other devices at the same DEVFN. Link: https://lore.kernel.org/linux-pci/6xb24fjmptxxn5js2fjrrddjae6twex5bjaftwqsuawuqqqydx@7cl3uik5ef6j/ Fixes: 9745fb07474f ("platform/x86/intel: Add Primary to Sideband (P2SB) bridge support") Cc: stable@vger.kernel.org Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20240108062059.3583028-2-shinichiro.kawasaki@wdc.com Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Tested-by Klara Modin <klarasmodin@gmail.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> drivers/platform/x86/p2sb.c | 180 ++++++++++++++++++++++++++++++++++---------- 1 file changed, 139 insertions(+), 41 deletions(-)
This is worth posting to LKML as well, please do. CC'ing Andy Shevchenko anyways.
(In reply to Artem S. Tashkinov from comment #1) > This is worth posting to LKML as well, please do. > > CC'ing Andy Shevchenko anyways. To be honest, I have no idea how to do that properly.
(In reply to danilrybakov249 from comment #2) > (In reply to Artem S. Tashkinov from comment #1) > > This is worth posting to LKML as well, please do. > > > > CC'ing Andy Shevchenko anyways. > > To be honest, I have no idea how to do that properly. Report forwarded [1]. [1]: https://lore.kernel.org/regressions/Zd2bsV8VsFJMlbFW@archie.me/
Thanks for the report and sorry for the trouble. I guess that this issue might be unique to the systems with Goldmont CPU microarchitecture. I don't have access to the hardware. If it is possible, I would like to ask help for debug. As the first step, I want to get two things below. 1) Can we get kernel messages during boot process? The messages may help to understand the prcies cause. If this works, I can provide patches to add printk to confirm behaviors on the system. 2) Can you share "lspci -v" output using the kernel without the issue?
(In reply to Shin'ichiro Kawasaki from comment #4) > Thanks for the report and sorry for the trouble. > > I guess that this issue might be unique to the systems with Goldmont CPU > microarchitecture. I don't have access to the hardware. If it is possible, I > would like to ask help for debug. As the first step, I want to get two > things below. > > 1) Can we get kernel messages during boot process? The messages may help to > understand the prcies cause. If this works, I can provide patches to add > printk to confirm behaviors on the system. > > 2) Can you share "lspci -v" output using the kernel without the issue? I'll try: journalctl -b (unhealth): http://0x0.st/HRmH.txt dmesg (unhealthy): http://0x0.st/HRA7.txt lspci -v (healthy): http://0x0.st/HRAh.txt
Thank you for reporting this and sorry about the regression. > I've pinned down the commit after which the problem occurs: > 847e1eb30e269a094da046c08273abe3f3361cf2 is the first bad commit So you have build your own 6.6.17 kernel from source with just that commit reverted and with that locally build 6.617 with that single commit reverted you are not seeing this issue ? > I think this CPU is in Goldmont microarchitecture group. The group is handled > in a bit unique way in drivers/platform/x86/p2sb.c. I guess the commit > affected > handling of P2SB resource on machines with that architecture. I just tried to reproduce this on one of my own laptops with a Apollo Lake N3450 processor, but I cannot reproduce this. So this seems to be specific to this one laptop model. > dmesg (unhealthy): http://0x0.st/HRA7.txt Unfortunately that is incomplete due to the errors being spammed to the log the entire time, so it only contains a long list of the errors repeating. Can you append " log_buf_len=50M" to your kernel commandline and then *immediately*after boot run "dmesg > dmesg.txt"? And then please attach the generated dmesg.txt here in bugzilla using bugzilla's attachment function. Using pastebins is problematic because the links to these become invalid over time.
(In reply to danilrybakov249 from comment #5) > (In reply to Shin'ichiro Kawasaki from comment #4) > I'll try: > journalctl -b (unhealth): http://0x0.st/HRmH.txt > dmesg (unhealthy): http://0x0.st/HRA7.txt > lspci -v (healthy): http://0x0.st/HRAh.txt The first two are useless. Please, run `journalctl -kb` which will give you the kernel boot messages from the very recent boot.
What we also might need is the `acpidump -o VivoBook-D540NV-GQ065T.dat` (the mentioned file). You might need to install ACPI tools package.
Created attachment 305914 [details] output of journalctl -kb
Actually, v6.6.15 already causes the issue. I've bisected the commits between v6.6.14 and v6.6.15 and 847e1eb30e269a094da046c08273abe3f3361cf2 is the first commit that produces the issue.
Created attachment 305915 [details] output of acpidump
Created attachment 305917 [details] Adds printk to understand the failure cause Please apply this patch to the kernel, then reboot with the kernel and share "journalctl -kb" output. Thanks in advance.
Thank you danilrybakov, for sharing the logs. The lspci output did not list "0d.x" PCI devices, so it was confirmed the P2SB device is hidden. The "journalctl -kb" shows 0d.x devices as below. So the p2sb device scan and cache feature is working. It lists multiple devices then it was confirmed that the scan and cache is repeated for Goldmont CPU as expected. ------------------------------------------------------------------------------------ ... Feb 27 21:48:25 danirybe kernel: clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.0: [8086:5a92] type 00 class 0x058000 Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.0: reg 0x10: [mem 0xd0000000-0xd0ffffff 64bit] Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.1: [8086:5a94] type 00 class 0xff0000 Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.1: reg 0x10: [mem 0xfe042000-0xfe043fff 64bit] Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.1: reg 0x18: [mem 0xfe044000-0xfe044fff 64bit] Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.1: PME# supported from D0 D3hot Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.2: [8086:5a96] type 00 class 0x0c8000 Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.2: reg 0x10: [mem 0xfed01000-0xfed01fff] Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.3: [8086:5aec] type 00 class 0x050000 Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.3: reg 0x10: [mem 0xfe900000-0xfe901fff 64bit] Feb 27 21:48:25 danirybe kernel: pci 0000:00:0d.3: reg 0x18: [mem 0xfe902000-0xfe902fff 64bit] Feb 27 21:48:25 danirybe kernel: NET: Registered PF_INET protocol family ... ------------------------------------------------------------------------------------ I can think of two potential issues: 1) p2sb_bar() fails due to bus dev id mismatch or any other reason. 2) p2sb_bar() is called before p2sb scan and cache (this is not likely...) To confirm these guesses, may I ask danilrybakov to try out the patch I attached? https://bugzilla.kernel.org/attachment.cgi?id=305917 Please apply it to the unhealthy kernel, build, boot with the kernel, and share "journalctl -kb". Thanks in advance.
Created attachment 305918 [details] output of journalctl -kb with patch patched with "git patch" using the patch provided by Shin'ichiro Kawasaki
Comment on attachment 305918 [details] output of journalctl -kb with patch pathed with "git apply" using the patch provided by Shin'ichiro Kawasaki
Thanks for the journalctl log with the patch. Unfortunately, my guesses were wrong... 1) p2sb_bar() call succeeded 2) p2sb_bar() call was made after p2sb scan and cache lpc_ich driver called ps2b_bar successfully to setup SPI device. But before that the ACPI error were already reported. And after lpc_ich driver probe, the ACPI errors continue. ----------------------------------------------------------------------------------- ... Feb 28 12:33:47 danirybe kernel: ACPI Error: No handler or method for GPE 08, disabling event (20230628/evgpe-839) Feb 28 12:33:47 danirybe kernel: ACPI Error: No handler or method for GPE 0A, disabling event (20230628/evgpe-839) Feb 28 12:33:47 danirybe kernel: ACPI Error: No handler or method for GPE 0B, disabling event (20230628/evgpe-839) Feb 28 12:33:47 danirybe kernel: lpc_ich_init_spi: calling p2sb_bar Feb 28 12:33:47 danirybe kernel: p2sb_bar: devfn=d.2 Feb 28 12:33:47 danirybe kernel: ACPI Error: No installed handler for fixed event - PM_Timer (0), disabling (20230628/evevent-255) Feb 28 12:33:47 danirybe kernel: ACPI Error: No installed handler for fixed event - PowerButton (2), disabling (20230628/evevent-255) Feb 28 12:33:47 danirybe kernel: ACPI Error: No installed handler for fixed event - SleepButton (3), disabling (20230628/evevent-255) Feb 28 12:33:47 danirybe kernel: ACPI Error: Could not disable RealTimeClock events (20230628/evxfevnt-243) ... ----------------------------------------------------------------------------------- I had guessed that p2sb_bar() call failure would have caused lpc_ich probe failure and the ACPI errors, but the guess was wrong. Now I'm not sure why the ACPI errors are triggered (and why they are not suppressed after lpc_ich probe). danilrybakov, could you share "journalctl -kb" log with the healthy kernel? I would like to compare it with the unhealthy kernel, and hope that it will be a clue.
Created attachment 305919 [details] output of journalctl -kb healthy
@Shin'ichiro Kawasaki I have now two observations: 1) we, for some reason, do not use `pci_scan_slot()` which may have subtle differences to our for-loop; 2) with the above we do enumerate from function 0, which we remove from PCI immediately, and function 0 is special. Perhaps, even if it's not related to this bug, we need to refactor caching approach in this case.
danylrybakov, thanks for the log of the healthy kernel. It does not record the ACPI error. So the question is why the ACPI error is triggered after the p2sb scan and cache the commit introduced. Andy, thanks for the comment. I have not yet fully understood, but it is likely that the function 0 device scan and remove could be the cause. I guess that the concerned system does not use P2SB resource at function 0, so I would like to try out what happens if we skip it. Will create a patch for it and ask experiment with it.
Created attachment 305920 [details] A trial patch to skip P2SB function 0 resource cache Please apply this patch together with the previous patch to add debug log. Then please build kernel, boot with that kernel and share "journalctl -kb" log. Thanks in advance.
Created attachment 305921 [details] output of journalctl -kb with two patches applied This fixed the problem! Thanks, everybody!
(In reply to danilrybakov249 from comment #21) > Created attachment 305921 [details] > output of journalctl -kb with two patches applied > > This fixed the problem! Okay, so I would give a try for `pci_scan_slot()` first. What is special about P2SB itself is that it is not 100% pure PCI device, it simply emulates programming interface of PCI devices to some extend. This what I mentioned as well in the commit log that brought P2SB library into the kernel.
(In reply to Andy Shevchenko from comment #22) > (In reply to danilrybakov249 from comment #21) > > Created attachment 305921 [details] > > output of journalctl -kb with two patches applied > > > > This fixed the problem! Great, but still we need to think about good solution. The trial patch was just a dirty edit for the experiment. > > Okay, so I would give a try for `pci_scan_slot()` first. What is special > about P2SB itself is that it is not 100% pure PCI device, it simply emulates > programming interface of PCI devices to some extend. This what I mentioned > as well in the commit log that brought P2SB library into the kernel. I have created a patch to use pci_scan_slot() when P2SB has function 0 and will post it soon. Please take a look if it is what you expect.
Created attachment 305926 [details] Another trial patch to use pci_scan_slot() Use pci_scan_slot() in place of pci_scan_single_device() when P2SB has function 0. It still it has some debug prints to confirm its behavior. I suggest danylrybakov to wait for Andy's comment on it. If the patch is what he expects, it will be worth trying on the concerned machine.
(In reply to Shin'ichiro Kawasaki from comment #24) > Created attachment 305926 [details] > Another trial patch to use pci_scan_slot() > > Use pci_scan_slot() in place of pci_scan_single_device() when P2SB has > function 0. > It still it has some debug prints to confirm its behavior. > > I suggest danylrybakov to wait for Andy's comment on it. If the patch is > what he expects, it will be worth trying on the concerned machine. Yeah, but you need to cache all found devices, with this patch the SPI NOR won't be cached. So, something like this: 1) scan slot; 2) iterate over all found devices and cache their resources; 3) reverse remove devices including function 0 (at the end).
Created attachment 305928 [details] 2nd version of the trial patch to use pci_scan_slot() Andy, thanks for the comment. I updated the patch to follow Andy's comment. danylrybakov, could you try this patch? Please apply this patch only to the unhealthy kernel. Fingers crossed...
Created attachment 305930 [details] output of journalctl -kb with trial patch v2 Unfortunately, errors again.
Hmm, so the difference between simple pci_scan_single_device() loop and pci_scan_slot() is not the cause of the errors. I guess P2SB function 0 scan at fs_initcall step does something harmful. It is ideal to understand why P2SB function 0 scan at that step cause the ACPI errors. I walked through related code, but it looks too much for me without the device at my hand. It will need a number of trials (). Even if the cause is known, the fix will be modifications in PCI or ACPI sub-systems, which will not be accepted. As the different approach, I would like to seek for the solution by delaying the P2SB device scan, from fs_initcall step to p2sb_bar() call. Such delayed scan will leave the chance to cause deadlock with sysfs pci bus rescan, but by caching the resource, such device scan happens only once. Also, such delayed scan can be limited only for the concerned device. Then the chance of deadlock will be small enough. I will post a trial patch based on this approach. It will delay the P2SB scan for all functions from 0 to 7. One thing I'm not sure is that the delayed function 0 scan is okay on the concerned system or not. If not, additional change will be required.
Created attachment 305932 [details] Trial to defer P2SB scan for all functions for Goldmont CPUs danylrybakov, could you try this patch? Please apply this patch only to the unhealthy kernel.
Created attachment 305933 [details] output of journalctl -kb with trial patch v3 no errors
(In reply to danilrybakov249 from comment #30) > Created attachment 305933 [details] > output of journalctl -kb with trial patch v3 > > no errors Thanks! It's good the patch works as expected. I will create a formal patch to ask comments and reviews.
(In reply to Shin'ichiro Kawasaki from comment #31) ... > I will create a formal patch to ask comments and reviews. Posted the patch to the platfrom-driver-x86 list. https://lore.kernel.org/platform-driver-x86/20240301134504.1887132-1-shinichiro.kawasaki@wdc.com/T/#u
Andy, thank you for the comments on the patch. I reflected them and posted v2. https://lore.kernel.org/platform-driver-x86/20240302012813.2011111-1-shinichiro.kawasaki@wdc.com/T/#u danylrybakov, I CCed the v2 patch to you. If you have time to afford, please try it and confirm it avoids the ACPI errors.
Thank you for your work. What I've discovered with this patch is that the errors don't appear immediately after booting, but they still appear after a bit of time. The same with v1 of the patch.
(In reply to danilrybakov249 from comment #34) > Thank you for your work. What I've discovered with this patch is that the > errors don't appear immediately after booting, but they still appear after a > bit of time. The same with v1 of the patch. Thanks for the report. It is good to know that the fix patch still has the issue. Now I think the scan and cache of function 0 triggers the ACPI errors regardless of the step, at fs_initcall() as wells p2bs_bar() call. I will try to create another trial patch which will do the deferred scan and cache only for the requested function number.
Created attachment 305948 [details] Trial to defer P2SB scan and do the deferred cache only for requested function danylrybakov, please try this patch and see if it avoids the ACPI errors. I hope it will work good long time after the boot. The patch has some printks. Please share journactl -kb output so that I can confirm the modified behavior. Thanks in advance.
Created attachment 305951 [details] output of journalctl -kb with trial patch v4 So far no errors. I'll leave the laptop running for the night and post an update tomorrow.
Yep, no errors, even after a while.
(In reply to danilrybakov249 from comment #38) > Yep, no errors, even after a while. Great, thank you for testing. It's good to have this solid fix :) As discussed in platfomr-driver-x86 list, Hans suggested another potential fix idea which may be better than the confirmed fix. Before I post the next formal patch I would like to wait a bit to see how the discussion go.
Shin'ichiro, thank you for your work on this. Danilrybakov, thank you for all your testing! We are close to fixing this now. Shin'ichiro has posted a slightly different patch upstream then the last patch which you tested; and I think I may have found an even simpler solution: On Goldmont platform devices like your laptop the troublesome p2sb_bar() function only ever gets called for 2 devices, the actual P2SB devfn 13,0 and the SPI controller which is part of the P2SB, devfn 13,2. But the current p2sb code tries to cache BAR0 info for all of devfn 13,0 to 13,7 . This involves calling pci_scan_single_device() for device 13 functions 0-7 and I think that maybe your laptop model does not seem to like pci_scan_single_device() getting called for some of the other hidden devices. I have prepared a patch (replacing Shin'ichiro's patches) which only caches BAR0 for the P2SB (13.0) and SPI (13,2) devfn combinations. I'll attach that patch here, can you please give this a try? I'll also attach Shin'ichiro's latest patch as posted on the list: https://lore.kernel.org/platform-driver-x86/20240304071912.2340622-1-shinichiro.kawasaki@wdc.com/ with some follow-up fixes from me squashed in. If the patch to only cache 13,0 + 13,2 BAR0 info does not help, please give this patch a try. This one should work, but the other one would be a better fix if that approach works.
Created attachment 305957 [details] [PATCH] platform/x86: p2sb: On Goldmont only cache P2SB and SPI devfn BAR Patch to only cache BAR0 for the P2SB (13.0) and SPI (13,2) devfn combinations, please try this first. Note their is no need to try the next patch if this one works.
Created attachment 305958 [details] [PATCH v3] platform/x86: p2sb: Defer P2SB device scan when P2SB device has func 0 Latest patch from Shin'ichiro's: https://lore.kernel.org/platform-driver-x86/20240304071912.2340622-1-shinichiro.kawasaki@wdc.com/ with some fixes from Hans squashed in from: https://lore.kernel.org/platform-driver-x86/20240304134921.305604-2-hdegoede@redhat.com/ In case the "PATCH] platform/x86: p2sb: On Goldmont only cache P2SB and SPI devfn BAR" patch does NOT fix things for you please give this one a try, so that we can be sure that this issue will actually be fixed when I merge the 2 patches linked above. Note this is basically almost the same as "A trial patch to skip P2SB function 0 resource cache" so this should work, but I would like to be sure. Thank you for your patience and for all the tests you have run.
No issues with [PATCH] platform/x86: p2sb: On Goldmont only cache P2SB and SPI devfn BAR. :)
Hi, (In reply to danilrybakov249 from comment #43) > No issues with [PATCH] platform/x86: p2sb: On Goldmont only cache P2SB and > SPI devfn BAR. :) That is great, thank you for the quick test. Then I'll go and send that patch to Linus sometime this week.
(In reply to Hans de Goede from comment #44) > Hi, > > (In reply to danilrybakov249 from comment #43) > > No issues with [PATCH] platform/x86: p2sb: On Goldmont only cache P2SB and > > SPI devfn BAR. :) > > That is great, thank you for the quick test. > > Then I'll go and send that patch to Linus sometime this week. Hans, thank you very much finding out the best solution :) As I commented on the list, I agree with the patch. Danilrybakov, my thank also goes to you for your cooperations and patience :) I think your efforts deserve the credit tags "Reported-by" and "Tested-by" that Hans mentioned. I suggest you to reply to the mail from Hans so that the tags get handled with the patch.
(In reply to Shin'ichiro Kawasaki from comment #45) > I suggest you to reply to the mail from Hans so that > the tags get handled with the patch. I did, thank you as well.