Bug 218784 - [r8169] AER: Multiple Corrected error received for RTL810xE
Summary: [r8169] AER: Multiple Corrected error received for RTL810xE
Status: REOPENED
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL: https://bugs.launchpad.net/ubuntu/+so...
Keywords:
Depends on:
Blocks:
 
Reported: 2024-04-27 09:04 UTC by corrado venturini
Modified: 2024-05-03 16:06 UTC (History)
2 users (show)

See Also:
Kernel Version: 6.9-rc5
Subsystem:
Regression: No
Bisected commit-id:


Attachments
doc for bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2015670 (54.49 KB, application/zip)
2024-04-27 09:04 UTC, corrado venturini
Details
test patch to mask r8169 Receiver Errors (2.15 KB, patch)
2024-04-29 22:21 UTC, Bjorn Helgaas
Details | Diff
attachment-9361-0.html (1.86 KB, text/html)
2024-05-02 06:51 UTC, corrado venturini
Details

Description corrado venturini 2024-04-27 09:04:55 UTC
Created attachment 306225 [details]
doc for bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2015670

As requested by bjorn-helgaas for bug
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2015670
attaching zip file with:
dmesg.txt - complete dmesg log (includes some Correctable Errors)
lspci-command - terminal message from lspci
lspci-output - output of "sudo lspci -vv"
grub-txt - grub default modified with "pcie_aspm=off" ... makes a difference
dmesg-2.txt - complete dmesg log with "pcie_aspm=off"
lspci-2-output - with "pcie_aspm=off"
inxi.txt
Comment 1 Artem S. Tashkinov 2024-04-27 13:30:38 UTC
Add pci=noaer to kernel boot parameters.

This is extremely unlikely to be fixed because normally such errors are a result of some issue with your motherboard.
Comment 2 Heiner Kallweit 2024-04-27 17:16:02 UTC
IIRC Ubuntu uses a modified r8169 driver. So please re-test with a recent mainline kernel.
Instead of using command line parameter pcie_aspm you can also disable selected ASPM states via sysfs (attributes in link directory). In the example given only L0s is active.
RTL810xE is quite old, therefore I would say it's likely that some silicon bug is involved. Especially as with newer Realtek NIC versions the problem doesn't seem to occur.
Comment 3 corrado venturini 2024-04-28 06:19:37 UTC
Tested with last mainline kernel problem REMAINS
corrado@corrado-n7-noble:~$ uname -a
Linux corrado-n7-noble 6.9.0-060900rc5-generic #202404212032 SMP PREEMPT_DYNAMIC Sun Apr 21 20:46:14 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
corrado@corrado-n7-noble:~$
Comment 4 Bjorn Helgaas 2024-04-29 22:18:04 UTC
Reassigning to PCI and reopening because we should be able to do better.  Users should not have to discover and use kernel parameters to deal with issues like this (except as a temporary workaround or debugging tool).

I wanted to look at this based on the fact that "pcie_aspm=off" seemed to be a workaround for the Correctable Error flood.  I assumed that this disabled ASPM, so I hoped there was a connection between ASPM and the CEs.

But I learned today that this parameter doesn't disable ASPM.  It prevents Linux from touching the ASPM config at all, and also prevents Linux from advertising ASPM support to the platform (see ACPI_PCIE_REQ_SUPPORT), which means Linux doesn't request AER control, so that is also left up to the platform, and in corrado's case it looks like firmware left AER disabled, so the effect is similar to "pci=noaebbr":

  acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig Segments MSI EDR HPX-Type3]
  acpi PNP0A08:00: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]

So probably there is no connection between ASPM and the CEs; it's just that "pcie_aspm=off" effectively turns off AER just like "pci=noaer" does.  So the CEs still *happen* as shown by "01:00.0 CESta: RxErr+" in lspci-2.txt, but the Root Port doesn't generate interrupts ("00:1d.0 RootCmd: CERptEn-").

I feel like I've worked through all this in the past, probably several times.

I think we may want to mask the CEs in the AER Correctable Error Mask.  Then we should still be able to use AER for other errors.
Comment 5 Bjorn Helgaas 2024-04-29 22:21:42 UTC
Created attachment 306243 [details]
test patch to mask r8169 Receiver Errors

@corrado, if you have a chance to test this, I'd like to hear the results.  This is based on v6.9-rc1.  You shouldn't need any special kernel parameters.  If you test it, please attach the complete dmesg log and "sudo lspci -vv" output.  Thank you!
Comment 6 corrado venturini 2024-04-30 05:59:17 UTC
I installed v6.9-rc1 and downloaded your patch but i don't know how to use it. may you explain please?
Comment 7 Bjorn Helgaas 2024-04-30 16:36:11 UTC
> I installed v6.9-rc1 and downloaded your patch but i don't know how to use
> it. may you explain please?

Oh, sorry!  There are some instructions at https://wiki.ubuntu.com/KernelTeam/GitKernelBuild .  In that guide, between steps 3 and 4, you would run "patch -p1 < aer-mask-r8169" to apply my patch.

That guide hasn't been updated for three years, so I don't know whether it's completely up to date.  If it doesn't work for you, let me know and email me your kernel config file from /boot/config-*, and I'll try to build it for you.
Comment 8 corrado venturini 2024-05-01 05:06:01 UTC
I have problem wit prereq.
corrado@corrado-n7-noble:~$ sudo apt-get install git build-essential kernel-package fakeroot libncurses5-dev libssl-dev ccache bison flex libelf-dev dwarves
[sudo] password for corrado: 
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Note, selecting 'libncurses-dev' instead of 'libncurses5-dev'
Package kernel-package is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'kernel-package' has no installation candidate
corrado@corrado-n7-noble:~$
Comment 9 Bjorn Helgaas 2024-05-01 22:34:58 UTC
Sorry about that, it is definitely kind of a pain to build debs from source.  I think I managed to do it, based on the config file you sent.  I used the instructions at https://wiki.ubuntu.com/KernelTeam/GitKernelBuild and disabled CONFIG_SECURITY_LOCKDOWN_LSM and CONFIG_MODULE_SIG.

The resulting debs are here: https://drive.google.com/drive/folders/1Uq2zmh8oubiBh7czbM48vP_6zyhvO8_2?usp=sharing

These are v6.8 + the comment #5 patch.  Let me know if you try them out.
Comment 10 corrado venturini 2024-05-02 06:51:20 UTC
Created attachment 306258 [details]
attachment-9361-0.html

With your custom kernel problem disappeared, let me know if You need
additional info.
thanks

Corrado Venturini


Il giorno gio 2 mag 2024 alle ore 00:34 <bugzilla-daemon@kernel.org> ha
scritto:

> https://bugzilla.kernel.org/show_bug.cgi?id=218784
>
> --- Comment #9 from Bjorn Helgaas (bjorn@helgaas.com) ---
> Sorry about that, it is definitely kind of a pain to build debs from
> source.  I
> think I managed to do it, based on the config file you sent.  I used the
> instructions at https://wiki.ubuntu.com/KernelTeam/GitKernelBuild and
> disabled
> CONFIG_SECURITY_LOCKDOWN_LSM and CONFIG_MODULE_SIG.
>
> The resulting debs are here:
>
>
> https://drive.google.com/drive/folders/1Uq2zmh8oubiBh7czbM48vP_6zyhvO8_2?usp=sharing
>
> These are v6.8 + the comment #5 patch.  Let me know if you try them out.
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You reported the bug.
Comment 11 corrado venturini 2024-05-02 12:44:20 UTC
With your custom kernel problem disappeared, let me know if You need additional info.
thanks
Comment 12 Bjorn Helgaas 2024-05-02 15:58:48 UTC
Thank you very much for testing this!  Are you OK with me crediting you with reporting and testing in the public git log?  E.g.,

  Reported-by: Corrado Venturini <email-addr>
  Tested-by: Corrado Venturini <email-addr>

If so, let me know what email address you'd prefer there.
Comment 13 Heiner Kallweit 2024-05-03 07:59:09 UTC
Still this seems to be a workaround. I'd really like to know whether disabling ASPM L0s via sysfs stops the CE's on the affected chip version.
So far it seems only RTL8106e is affected. Therefore any change in r8169 should be specific to this chip version.
Comment 14 Heiner Kallweit 2024-05-03 09:09:11 UTC
An additional thought:
Logging each and every CE may spam the dmesg log, w/o any actual benefit. So it may be worth to:
- use _once to log only the first CE   or
- use __ratelimit() for CE's
Comment 15 Bjorn Helgaas 2024-05-03 15:31:53 UTC
Re comment #13, r8169 does funky ASPM stuff that it shouldn't need to do, but I haven't actually seen evidence of an ASPM connection.  I *suspected* one because "pcie_aspm=off" shut the messages off, but I was wrong. That parameter doesn't actually change the ASPM config, it just leaves the BIOS configuration alone.  It *does* prevent Linux from enabling AER, which is why the messages go away.

Re comment #14, I agree, it's not a great workaround and possibly not the best approach.  Similar thoughts here: https://lore.kernel.org/r/20240502225608.GA1553882@bhelgaas

Improvements like you suggest in the PCI core might reduce the need for driver changes.  But in cases where a device generates CEs continually, I think even rate-limited reporting will lead to legitimate bug reports.  In that case, a driver change to say "we know this device is broken and these CEs are harmless" might be helpful.
Comment 16 corrado venturini 2024-05-03 16:06:21 UTC
(In reply to Bjorn Helgaas from comment #12)
> Thank you very much for testing this!  Are you OK with me crediting you with
> reporting and testing in the public git log?  E.g.,
> 
>   Reported-by: Corrado Venturini <email-addr>
>   Tested-by: Corrado Venturini <email-addr>
> 
> If so, let me know what email address you'd prefer there.

ok, no problem, my email is corradoventu@gmail.com
thanks again

Note You need to log in before you can comment on or make changes to this bug.