Created attachment 306225 [details] doc for bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2015670 As requested by bjorn-helgaas for bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2015670 attaching zip file with: dmesg.txt - complete dmesg log (includes some Correctable Errors) lspci-command - terminal message from lspci lspci-output - output of "sudo lspci -vv" grub-txt - grub default modified with "pcie_aspm=off" ... makes a difference dmesg-2.txt - complete dmesg log with "pcie_aspm=off" lspci-2-output - with "pcie_aspm=off" inxi.txt
Add pci=noaer to kernel boot parameters. This is extremely unlikely to be fixed because normally such errors are a result of some issue with your motherboard.
IIRC Ubuntu uses a modified r8169 driver. So please re-test with a recent mainline kernel. Instead of using command line parameter pcie_aspm you can also disable selected ASPM states via sysfs (attributes in link directory). In the example given only L0s is active. RTL810xE is quite old, therefore I would say it's likely that some silicon bug is involved. Especially as with newer Realtek NIC versions the problem doesn't seem to occur.
Tested with last mainline kernel problem REMAINS corrado@corrado-n7-noble:~$ uname -a Linux corrado-n7-noble 6.9.0-060900rc5-generic #202404212032 SMP PREEMPT_DYNAMIC Sun Apr 21 20:46:14 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux corrado@corrado-n7-noble:~$
Reassigning to PCI and reopening because we should be able to do better. Users should not have to discover and use kernel parameters to deal with issues like this (except as a temporary workaround or debugging tool). I wanted to look at this based on the fact that "pcie_aspm=off" seemed to be a workaround for the Correctable Error flood. I assumed that this disabled ASPM, so I hoped there was a connection between ASPM and the CEs. But I learned today that this parameter doesn't disable ASPM. It prevents Linux from touching the ASPM config at all, and also prevents Linux from advertising ASPM support to the platform (see ACPI_PCIE_REQ_SUPPORT), which means Linux doesn't request AER control, so that is also left up to the platform, and in corrado's case it looks like firmware left AER disabled, so the effect is similar to "pci=noaebbr": acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig Segments MSI EDR HPX-Type3] acpi PNP0A08:00: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI] So probably there is no connection between ASPM and the CEs; it's just that "pcie_aspm=off" effectively turns off AER just like "pci=noaer" does. So the CEs still *happen* as shown by "01:00.0 CESta: RxErr+" in lspci-2.txt, but the Root Port doesn't generate interrupts ("00:1d.0 RootCmd: CERptEn-"). I feel like I've worked through all this in the past, probably several times. I think we may want to mask the CEs in the AER Correctable Error Mask. Then we should still be able to use AER for other errors.
Created attachment 306243 [details] test patch to mask r8169 Receiver Errors @corrado, if you have a chance to test this, I'd like to hear the results. This is based on v6.9-rc1. You shouldn't need any special kernel parameters. If you test it, please attach the complete dmesg log and "sudo lspci -vv" output. Thank you!
I installed v6.9-rc1 and downloaded your patch but i don't know how to use it. may you explain please?
> I installed v6.9-rc1 and downloaded your patch but i don't know how to use > it. may you explain please? Oh, sorry! There are some instructions at https://wiki.ubuntu.com/KernelTeam/GitKernelBuild . In that guide, between steps 3 and 4, you would run "patch -p1 < aer-mask-r8169" to apply my patch. That guide hasn't been updated for three years, so I don't know whether it's completely up to date. If it doesn't work for you, let me know and email me your kernel config file from /boot/config-*, and I'll try to build it for you.
I have problem wit prereq. corrado@corrado-n7-noble:~$ sudo apt-get install git build-essential kernel-package fakeroot libncurses5-dev libssl-dev ccache bison flex libelf-dev dwarves [sudo] password for corrado: Reading package lists... Done Building dependency tree... Done Reading state information... Done Note, selecting 'libncurses-dev' instead of 'libncurses5-dev' Package kernel-package is not available, but is referred to by another package. This may mean that the package is missing, has been obsoleted, or is only available from another source E: Package 'kernel-package' has no installation candidate corrado@corrado-n7-noble:~$
Sorry about that, it is definitely kind of a pain to build debs from source. I think I managed to do it, based on the config file you sent. I used the instructions at https://wiki.ubuntu.com/KernelTeam/GitKernelBuild and disabled CONFIG_SECURITY_LOCKDOWN_LSM and CONFIG_MODULE_SIG. The resulting debs are here: https://drive.google.com/drive/folders/1Uq2zmh8oubiBh7czbM48vP_6zyhvO8_2?usp=sharing These are v6.8 + the comment #5 patch. Let me know if you try them out.
Created attachment 306258 [details] attachment-9361-0.html With your custom kernel problem disappeared, let me know if You need additional info. thanks Corrado Venturini Il giorno gio 2 mag 2024 alle ore 00:34 <bugzilla-daemon@kernel.org> ha scritto: > https://bugzilla.kernel.org/show_bug.cgi?id=218784 > > --- Comment #9 from Bjorn Helgaas (bjorn@helgaas.com) --- > Sorry about that, it is definitely kind of a pain to build debs from > source. I > think I managed to do it, based on the config file you sent. I used the > instructions at https://wiki.ubuntu.com/KernelTeam/GitKernelBuild and > disabled > CONFIG_SECURITY_LOCKDOWN_LSM and CONFIG_MODULE_SIG. > > The resulting debs are here: > > > https://drive.google.com/drive/folders/1Uq2zmh8oubiBh7czbM48vP_6zyhvO8_2?usp=sharing > > These are v6.8 + the comment #5 patch. Let me know if you try them out. > > -- > You may reply to this email to add a comment. > > You are receiving this mail because: > You reported the bug.
With your custom kernel problem disappeared, let me know if You need additional info. thanks
Thank you very much for testing this! Are you OK with me crediting you with reporting and testing in the public git log? E.g., Reported-by: Corrado Venturini <email-addr> Tested-by: Corrado Venturini <email-addr> If so, let me know what email address you'd prefer there.
Still this seems to be a workaround. I'd really like to know whether disabling ASPM L0s via sysfs stops the CE's on the affected chip version. So far it seems only RTL8106e is affected. Therefore any change in r8169 should be specific to this chip version.
An additional thought: Logging each and every CE may spam the dmesg log, w/o any actual benefit. So it may be worth to: - use _once to log only the first CE or - use __ratelimit() for CE's
Re comment #13, r8169 does funky ASPM stuff that it shouldn't need to do, but I haven't actually seen evidence of an ASPM connection. I *suspected* one because "pcie_aspm=off" shut the messages off, but I was wrong. That parameter doesn't actually change the ASPM config, it just leaves the BIOS configuration alone. It *does* prevent Linux from enabling AER, which is why the messages go away. Re comment #14, I agree, it's not a great workaround and possibly not the best approach. Similar thoughts here: https://lore.kernel.org/r/20240502225608.GA1553882@bhelgaas Improvements like you suggest in the PCI core might reduce the need for driver changes. But in cases where a device generates CEs continually, I think even rate-limited reporting will lead to legitimate bug reports. In that case, a driver change to say "we know this device is broken and these CEs are harmless" might be helpful.
(In reply to Bjorn Helgaas from comment #12) > Thank you very much for testing this! Are you OK with me crediting you with > reporting and testing in the public git log? E.g., > > Reported-by: Corrado Venturini <email-addr> > Tested-by: Corrado Venturini <email-addr> > > If so, let me know what email address you'd prefer there. ok, no problem, my email is corradoventu@gmail.com thanks again