Bug 215533

Summary: [BISECTED][REGRESSION] UI becomes unresponsive every couple of seconds
Product: Drivers Reporter: joey.corleone
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: RESOLVED CODE_FIX    
Severity: normal CC: bjorn, dspencer577, jan.kiszka, regressions, sergiu.deitsch
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.16 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Bisection Log
v517good
v517bad
instrumentation
Interrupts
dmesg with debug messages (offending commit reverted)
dmesg with debug messages
dmesg with pci=earlydump
bios 1.14 dmesg with pci=earlydump
bios 1.12 dmesg with pci=earlydump

Description joey.corleone 2022-01-26 07:47:24 UTC

    
Comment 1 joey.corleone 2022-01-26 08:18:12 UTC
I accidentally sent the report prematurely. So here come my findings:

Since 5.16
(1) my system becomes unresponsive every couple of seconds (micro lags), which makes it more or less unusable.
(2) wrong(?) CPU frequencies are reported. 

- 5.15 works fine.
- Starting from some commit in 5.17, it seems (1) is fixed (unsure), but definitely not (2).

I have bisected the kernel between 5.15 and 5.16, and found that the offending commit is 0e8ae5a6ff5952253cd7cc0260df838ab4c21009 ("PCI/portdrv: Do not setup up IRQs if there are no users"). Bisection log attached.

Reverting this commit on linux-git[1] fixes both (1) and (2).

Important notes:
- This regression was reported on a DELL XPS 9550 laptop by two users [2], so it might be related strictly to that model. 
- According to user mallocman, the issue can also be fixed by reverting the BIOS version of the laptop to v1.12.
- The issue ONLY occurs when AC is plugged in (and stays there even when I unplug it).
- When booting on battery power, there is no issue at all.

You can easily observe the regression via: 

watch cat /sys/devices/system/cpu/cpu[0-9]*/cpufreq/scaling_cur_fre

As soon as I plug in AC, all frequencies go up to values around 3248338 and stay there even if I unplug AC. This does not happen at all when booted on battery power. 

Also note: 
- the laptop's fans are not really affected by the high frequencies.
- setting the governor to "powersave" has no effect on the frequencies (as compared to when on battery power).
- lowering the maximum frequency manually works, but does not fix (1).

[1] https://aur.archlinux.org/pkgbase/linux-git/ (pulled commits up to 0280e3c58f92b2fe0e8fbbdf8d386449168de4a8).
[2] https://bbs.archlinux.org/viewtopic.php?id=273330
Comment 2 joey.corleone 2022-01-26 08:18:40 UTC
Created attachment 300320 [details]
Bisection Log
Comment 3 Jan Kiszka 2022-01-26 17:31:38 UTC
Are all of these config switches enabled in your kernel?

CONFIG_PCIE_PME
CONFIG_PCIEAER
CONFIG_HOTPLUG_PCI_PCIE
CONFIG_PCIE_DPC

If not, does it help to enable them?

Then: Could you also do some code changes that may help to understand which PCIe service has a quirk on this device? If yes, I could try to guide through them.
Comment 4 Bjorn Helgaas 2022-01-26 18:07:53 UTC
Correct me if I missed something:

  v5.15: fine (tested on upstream v5.15 per bisect log)
  v5.16: laggy, cpufreq issues (tested on Arch Linux v5.16.2 per log)
  v5.17-rc1: cpufreq issues (unspecified; I assume tested on Arch Linux?)
  v5.17-rc1 + revert 0e8ae5a6ff59: fine (I assume tested on Arch Linux?)

  Above all with BIOS 1.14.
  Above all on AC power.

I think we need to reduce the number of variables.  Test on mainline kernels if possible since we don't know what Arch adds on top.  Focus on booting and staying on AC power for now.

v5.17 hasn't been released yet, so I assume you're testing v5.17-rc1 or similar.  If you see lag in v5.17-rc1, we need to debug that.  If you only see it in v5.16 and it's already fixed in v5.17-rc1, it becomes a secondary problem that might be of interest for stable kernels.

Please attach complete dmesg logs for v5.17-rc1 (with the cpufreq issue) and v5.17-rc1 + revert 0e8ae5a6ff59 (which works fine if I understand correctly).
Comment 5 joey.corleone 2022-01-27 10:05:08 UTC
On 26.01.22 18:31, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=215533
> 
> --- Comment #3 from Jan Kiszka (jan.kiszka@web.de) ---
> Are all of these config switches enabled in your kernel?
> 
> CONFIG_PCIE_PME
> CONFIG_PCIEAER
> CONFIG_HOTPLUG_PCI_PCIE
> CONFIG_PCIE_DPC
> 
> If not, does it help to enable them?

All of these switches were already enabled.

> Then: Could you also do some code changes that may help to understand which
> PCIe service has a quirk on this device? If yes, I could try to guide through
> them.

Sure, no problem.
Comment 6 joey.corleone 2022-01-27 10:54:02 UTC
Created attachment 300334 [details]
v517good

On 26.01.22 19:07, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=215533
> 
> --- Comment #4 from Bjorn Helgaas (bjorn@helgaas.com) ---
> Correct me if I missed something:
> 
>    v5.15: fine (tested on upstream v5.15 per bisect log)
>    v5.16: laggy, cpufreq issues (tested on Arch Linux v5.16.2 per log)

Yes, when Arch Linux updated their linux package [1] to v5.16, I had 
noticed the issue for the first time.

I then proceeded to test the linux-lts package [2] which is currently 
tracking v5.15. The issue is not present in there.

>    v5.17-rc1: cpufreq issues (unspecified; I assume tested on Arch Linux?)
	
Next, I used the Arch Build System [3] to build the linux-git kernel 
[4], which automatically pulls the latest mainline commits - in my case 
v5.17-rc1.

The lags seem to either be gone completely or occur less frequently, I 
am not sure about that yet. The cpufreq issue remains.

>    v5.17-rc1 + revert 0e8ae5a6ff59: fine (I assume tested on Arch Linux?)

Yes, I then bisected the kernel between 5.15 and 5.16 using the same 
Arch Build System as in the previous step. When I found the offending 
commit, I reverted it on top of v5.17-rc1 and found that none of the 
issues was present any more on this updated kernel.

>    Above all with BIOS 1.14.

Yes.

>    Above all on AC power.

I have always
- booted on battery power
- then started watching the frequencies using this command

watch cat /sys/devices/system/cpu/cpu[0-9]*/cpufreq/scaling_cur_fre

- and only then plugged in the power cable.

As soon as I plugged it in, the CPU frequencies went up, and the system 
began to stutter.

Actually, once or twice I forgot to remove the cable, and booted with it 
already being inserted - it led to the same issues though.

> I think we need to reduce the number of variables.  Test on mainline kernels
> if
> possible since we don't know what Arch adds on top.

I was under the impression that the Arch Build System applies only 
minimal additional patches. I guess I could look into traditional 
compilation [5] but haven't tried that out yet.

>  Focus on booting and
> staying on AC power for now.
> 
> v5.17 hasn't been released yet, so I assume you're testing v5.17-rc1 or
> similar.  If you see lag in v5.17-rc1, we need to debug that.  If you only
> see
> it in v5.16 and it's already fixed in v5.17-rc1, it becomes a secondary
> problem
> that might be of interest for stable kernels.
> 
> Please attach complete dmesg logs for v5.17-rc1 (with the cpufreq issue) and
> v5.17-rc1 + revert 0e8ae5a6ff59 (which works fine if I understand correctly).

I have attached both logs. I proceeded as in all my previous tests: 
booted on battery power, started to watch the cpu frequencies on cmd, 
and then connected the power cable.

[1] https://archlinux.org/packages/core/x86_64/linux/
[2] https://archlinux.org/packages/core/x86_64/linux-lts/
[3] https://wiki.archlinux.org/title/Kernel/Arch_Build_System
[4] https://aur.archlinux.org/pkgbase/linux-git/
[5] https://wiki.archlinux.org/title/Kernel/Traditional_compilation
Comment 7 joey.corleone 2022-01-27 10:54:02 UTC
Created attachment 300335 [details]
v517bad
Comment 8 Sergiu Deitsch 2022-01-30 20:22:22 UTC
I observe exactly the same problem on a Dell Mobile Precision 5510 CTO (also Arch Linux). Booting without A/C plugged in does not trigger the issue. htop shows a kernel thread permanently consuming over 90% of a single core.
Comment 9 joey.corleone 2022-01-31 14:00:48 UTC
Some new information and perspectives on this bug can be found in this report: https://bugzilla.kernel.org/show_bug.cgi?id=215546 .
Comment 10 Jan Kiszka 2022-01-31 16:34:08 UTC
That other ticket points toward an IRQ storm on a shared IRQ line, possibly also shared by one of the PCI services.

Can you confirm that as well? How does /proc/interrupts look like in the good and and the bad case?
Comment 11 Jan Kiszka 2022-01-31 17:03:27 UTC
Created attachment 300362 [details]
instrumentation

Could you also run with this ad-hoc instrumentation? The last hunk would be interesting to add to the reverted setup as well.
Comment 12 joey.corleone 2022-01-31 17:23:41 UTC
Created attachment 300363 [details]
Interrupts
Comment 13 joey.corleone 2022-01-31 17:27:44 UTC
(In reply to Jan Kiszka from comment #10)
> That other ticket points toward an IRQ storm on a shared IRQ line, possibly
> also shared by one of the PCI services.
> 
> Can you confirm that as well? How does /proc/interrupts look like in the
> good and and the bad case?

Yes, seems to be the same for me. I have saved the output of cat /proc/interrupts for
- v5.15 on battery power
- v5.15 on AC power
- v5.16 on battery power
- v5.16 on AC power

and uploaded them as attachment.
Comment 14 Jan Kiszka 2022-01-31 17:28:31 UTC
Thanks. The interrupts logs confirm the IRQ storm theory, namely on IRQ16. That kills all devices sharing that line (idma64.0, i801_smbus, i2c_designware.0).

Now the question is what the kernel can / must do to prevent this. The instrumentation may help me get an idea.
Comment 15 David Spencer 2022-01-31 19:06:05 UTC
*** Bug 215546 has been marked as a duplicate of this bug. ***
Comment 16 joey.corleone 2022-01-31 19:52:04 UTC
Created attachment 300366 [details]
dmesg with debug messages (offending commit reverted)
Comment 17 joey.corleone 2022-01-31 19:53:01 UTC
Created attachment 300367 [details]
dmesg with debug messages
Comment 18 joey.corleone 2022-01-31 19:54:56 UTC
(In reply to Jan Kiszka from comment #11)
> Created attachment 300362 [details]
> instrumentation
> 
> Could you also run with this ad-hoc instrumentation? The last hunk would be
> interesting to add to the reverted setup as well.

I have attached dmesg logs for both the good and the bad case.
Comment 19 Jan Kiszka 2022-01-31 20:41:45 UTC
This may fix the issue if I got it right:

diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index bda630889f95..68b0013c3662 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -331,7 +331,7 @@ int pcie_port_device_register(struct pci_dev *dev)
 
 	pci_set_master(dev);
 
-	irq_services = 0;
+	irq_services = PCIE_PORT_SERVICE_BWNOTIF;
 	if (IS_ENABLED(CONFIG_PCIE_PME))
 		irq_services |= PCIE_PORT_SERVICE_PME;
 	if (IS_ENABLED(CONFIG_PCIEAER))
Comment 20 Jan Kiszka 2022-01-31 21:05:56 UTC
"Fix" was the wrong term: It should restore the behavior prior to my commit. But it also defeats its purpose.

Please give it a try, and if it "works", I need to discuss this afterwards with the PCI core people.
Comment 21 joey.corleone 2022-01-31 21:11:29 UTC
Yes, it works. The issues are gone with the patch applied.
Comment 22 joey.corleone 2022-01-31 21:35:58 UTC
Initially, another user on the Arch forums had reported that he also managed to fix the issue by downgrading the machine's BIOS (1.14 -> 1.12). 

So, if only the PCIE_PORT_SERVICE_BWNOTIF bit is set in my 1.14 logs, does this mean that the 1.14 BIOS version either removed some of the device's capabilities or fails to report them correctly?
Comment 23 Bjorn Helgaas 2022-01-31 21:47:03 UTC
> So, if only the PCIE_PORT_SERVICE_BWNOTIF bit is set in my 1.14 logs, does
> this mean that the 1.14 BIOS version either removed some of the device's
> capabilities or fails to report them correctly?

"pci=earlydump" logs with 1.12 and 1.14 should be able to answer that question.
Comment 24 joey.corleone 2022-02-01 09:07:05 UTC
Created attachment 300370 [details]
dmesg with pci=earlydump
Comment 25 joey.corleone 2022-02-01 09:10:37 UTC
(In reply to Bjorn Helgaas from comment #23)
> > So, if only the PCIE_PORT_SERVICE_BWNOTIF bit is set in my 1.14 logs, does
> > this mean that the 1.14 BIOS version either removed some of the device's
> > capabilities or fails to report them correctly?
> 
> "pci=earlydump" logs with 1.12 and 1.14 should be able to answer that
> question.

I have uploaded a dmesg log with pci=earlydump for BIOS version 1.14 and asked the guy with 1.12 [1] to obtain a log as well.

[1] https://bbs.archlinux.org/viewtopic.php?pid=2019530#p2019530
Comment 26 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-02-02 05:47:11 UTC
FWIW, there is a ticket with a regression that sounds similar to this one (at least to me): https://bugzilla.kernel.org/show_bug.cgi?id=215548

Would be cool if someone with more knowledge about the area and this issue could take a look and comment there.
Comment 27 David Spencer 2022-02-05 01:01:41 UTC
Created attachment 300396 [details]
bios 1.14 dmesg with pci=earlydump

Another dmesg log with bios 1.14.
Comment 28 David Spencer 2022-02-05 01:02:51 UTC
Created attachment 300397 [details]
bios 1.12 dmesg with pci=earlydump

And here's a dmesg log with pci=earlydump from bios 1.12. I can confirm that the issue is fixed by downgrading to bios 1.12.
Comment 29 Bjorn Helgaas 2022-02-08 20:42:08 UTC
Some analysis here: https://lore.kernel.org/r/20220208185658.GA489586@bhelgaas

Obviously we need a kernel fix for this.  In the meantime, try this to see if it works around the problem:

  # setpci -s00:1d.0 CAP_EXP+0x10.w=0x0080
  # setpci -s00:1d.0 CAP_EXP+0x12.w=0x4000
Comment 30 joey.corleone 2022-02-09 08:54:23 UTC
(In reply to Bjorn Helgaas from comment #29)
> Some analysis here:
> https://lore.kernel.org/r/20220208185658.GA489586@bhelgaas
> 
> Obviously we need a kernel fix for this.  In the meantime, try this to see
> if it works around the problem:
> 
>   # setpci -s00:1d.0 CAP_EXP+0x10.w=0x0080
>   # setpci -s00:1d.0 CAP_EXP+0x12.w=0x4000

Yes, setting either one of these temporarily fixes the issue (i.e. it stops the storm on IRQ 16). Although this only lasts until the power cable is either plugged in or out - then the storm returns.

Thanks for taking the time to analyze this!
Comment 31 joey.corleone 2022-02-18 11:41:33 UTC
The reverted commit has landed in v5.16 (and v5.17) and thus fixed the issues described in this ticket.

Do you want me to close this ticket now or keep it open until the underlying issue described in [1] is solved?

[1] https://lore.kernel.org/all/20220208185658.GA489586@bhelgaas/
Comment 32 Bjorn Helgaas 2022-02-18 16:20:57 UTC
https://git.kernel.org/linus/075b7d363c67 appeared in v5.17-rc4.
https://git.kernel.org/linus/2d88a0f49d4d appeared in v5.16.10.

Closing since *this* issue is resolved.  If the workaround mentioned in comment #31 is implemented and there are issues with it, we should open a new report.