Bug 206217 - Crash with r8169 IRQ on Jetson-TK1
Summary: Crash with r8169 IRQ on Jetson-TK1
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-01-15 21:55 UTC by NicolasChauvet
Modified: 2020-05-18 10:03 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.3+
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Fix phy-names on jetson-tk1 (1.09 KB, patch)
2020-03-15 20:02 UTC, NicolasChauvet
Details | Diff
setpci manipulation (1.65 KB, text/plain)
2020-04-02 12:23 UTC, NicolasChauvet
Details
lspci-vvv-jetson-tk1.txt (7.59 KB, text/plain)
2020-04-08 14:40 UTC, NicolasChauvet
Details
lspci-jetson-tk1-after_aer.txt (3.65 KB, text/plain)
2020-04-08 21:58 UTC, NicolasChauvet
Details
Revert raw_violation_fixup (1.86 KB, application/mbox)
2020-04-18 16:21 UTC, NicolasChauvet
Details

Description NicolasChauvet 2020-01-15 21:55:31 UTC
I'm experiencing the following error on Jetson TK1 (ARM/Tegra) with a Fedora kernel 5.3+ (tested up to 5.4.10). The same behavior is not reproducible on the Trimslice with the same kernel.
Last known good kernel was 4.19

This is easily reproducible once the network is under usage (ssh access works but yum update exhibit the problem).

I can workaround the problem on Jetson TK1 by blacklisting the r8169 kernel module and using a usb ethernet NIC.


-----

[10375.391081] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: 0000:01:00.0
[10375.402037] r8169 0000:01:00.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[10375.413754] r8169 0000:01:00.0: AER:   device [10ec:8168] error status/mask=00004000/00400000
[10375.422693] r8169 0000:01:00.0: AER:    [14] CmpltTO                (First)
[10375.430723] pcieport 0000:00:02.0: AER: Device recovery failed
[10386.513018] ------------[ cut here ]------------ kB/s | 589 kB     00:17 ETA
[10386.518082] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:447 dev_watchdog+0x114/0x1f0
[10386.526410] NETDEV WATCHDOG: enp1s0 (r8169): transmit queue 0 timed out
[10386.533070] Modules linked in: r8169 xfs tun rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables rfkill iptable_nat nf_nat ipt_REJECT b
[10386.533840]  cqhci phy_tegra_usb libahci_platform rtc_tegra i2c_tegra
[10386.629003] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.10-100.fc30.armv7hl+lpae #1
[10386.636863] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
[10386.643420] [<c043191c>] (unwind_backtrace) from [<c042bfcc>] (show_stack+0x18/0x1c)
[10386.651268] [<c042bfcc>] (show_stack) from [<c0befb40>] (dump_stack+0xb4/0xd0)
[10386.658568] [<c0befb40>] (dump_stack) from [<c0457abc>] (__warn+0xdc/0xf8)
[10386.665479] [<c0457abc>] (__warn) from [<c0457e6c>] (warn_slowpath_fmt+0x70/0x84)
[10386.672998] [<c0457e6c>] (warn_slowpath_fmt) from [<c0af3d90>] (dev_watchdog+0x114/0x1f0)
[10386.681249] [<c0af3d90>] (dev_watchdog) from [<c04c7b98>] (call_timer_fn+0x4c/0x15c)
[10386.689037] [<c04c7b98>] (call_timer_fn) from [<c04c83e4>] (__run_timers.part.0+0x1b4/0x1cc)
[10386.697486] [<c04c83e4>] (__run_timers.part.0) from [<c04c8434>] (run_timer_softirq+0x38/0x6c)
[10386.706127] [<c04c8434>] (run_timer_softirq) from [<c04023b4>] (__do_softirq+0x1fc/0x30c)
[10386.714364] [<c04023b4>] (__do_softirq) from [<c045d3b4>] (irq_exit+0x7c/0xdc)
[10386.721652] [<c045d3b4>] (irq_exit) from [<c04ae938>] (__handle_domain_irq+0x7c/0xa8)
[10386.729572] [<c04ae938>] (__handle_domain_irq) from [<c07bc054>] (gic_handle_irq+0x54/0x80)
[10386.737971] [<c07bc054>] (gic_handle_irq) from [<c0401bb8>] (__irq_svc+0x58/0x74)
[10386.745454] Exception stack(0xc1401f30 to 0xc1401f78)
[10386.750510] 1f20:                                     00000000 ee78a174 00000000 c043caa0
[10386.758690] 1f40: 00000000 00000000 c1400000 00000001 c1405ffc c1401f88 c1406044 00000000
[10386.766868] 1f60: 00000000 c1401f80 c0428f00 c0428edc 680f0013 ffffffff
[10386.773512] [<c0401bb8>] (__irq_svc) from [<c0428edc>] (arch_cpu_idle+0x24/0x54)
[10386.780980] [<c0428edc>] (arch_cpu_idle) from [<c04834cc>] (do_idle+0x108/0x268)
[10386.788413] [<c04834cc>] (do_idle) from [<c0483898>] (cpu_startup_entry+0x20/0x28)
[10386.796035] [<c0483898>] (cpu_startup_entry) from [<c1200f18>] (start_kernel+0x42c/0x4d0)
[10386.804485] ---[ end trace 1d15883f7e94fdfe ]---
[10386.865042] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: 0000:01:00.0
[10386.873627] r8169 0000:01:00.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[10386.885085] r8169 0000:01:00.0: AER:   device [10ec:8168] error status/mask=00004000/00400000
[10386.893702] r8169 0000:01:00.0: AER:    [14] CmpltTO                (First)
[10386.901358] pcieport 0000:00:02.0: AER: Device recovery failed
Comment 1 NicolasChauvet 2020-01-15 22:00:44 UTC
I'm suspecting the following patch to have an effect on the crash...
"PCI: tegra: Advertise PCIe Advanced Error Reporting (AER) capability"
git hash c635a815c8c73a985b0e723efd4ffd70e99729fc

Test pending ...
Comment 2 Manikanta Maddireddy 2020-01-19 10:10:14 UTC
Hi Nicolas,

This patch only enables error reporting capability, I don't think it caused this error. I believe you would observe "NETDEV WATCHDOG" even if you revert this patch. I tested these patches in Jetson TK1 platform with "Arch Linux ARM" before publishing them. Could you please provide me the steps to install "Fedora kernel 5.3+" on Jetson TK1, I will give it a try on my setup.

Thanks,
Manikanta
Comment 3 NicolasChauvet 2020-01-19 20:39:07 UTC
Hello Manikanta,

Thanks for your answer.

About installing the Fedora kernel 5.3+ is simple as to install Fedora 31 and update to the current kernel (which to this date is at version 5.4.10).

Also, the bootloader I'm using a vanilla u-boot at least from January 2018 (I can confirm the exact version if needed).

I will try to build a current kernel with and without the AER enablement in the next days. But I don't expect this is the root cause, but there is a side effect that affects AER.
Comment 4 NicolasChauvet 2020-03-15 15:41:29 UTC
@manikanta

I've reproduced using a 5.6.0-rc5 kernel
(using the fedora rawhide kernel nodebug kernel from a f30 userspace).

Something that worth to mention is that I'm using the ahci driver with rootfs on sata.

I'm not reproducing when using the rootfs on the emmc with the same kernel (and a f31 userspace).

With such 5.6.0-rc5 kernel, I can see the following:
pcieport 0000:00:02.0: AER: enabled with IRQ 25*4096
So it's unlikely related to AER, but probaly some interaction between ahci and r8169
Comment 5 NicolasChauvet 2020-03-15 15:42:12 UTC
This is a lpae fedora kernel: 5.6.0-0.rc5.git0.1.fc33.armv7hl+lpae
Comment 6 Manikanta Maddireddy 2020-03-15 19:13:39 UTC
Hi Nicolas,

I was able to reproduce the issue once with R21.8.0 Nvidia release package + 5.5.0-rc7. I tried to do kernel git bisect, but I didn't saw the issue. Even after reverting to top of the changes. In short, I don't have reliable way to reproduce the issue and debug.

Jetson TK1 doesn't have any other PCIe ports, so its not a PCIe based AHCI controller. I don't see how ahci and r8169 are related. Can you do kernel git bisect on recent Tegra PCIe changes? 
As I said earlier AER will just report errors, so please look for any network instability issues when you go below AER enable change.

Thanks,
Manikanta
Comment 7 NicolasChauvet 2020-03-15 19:35:34 UTC
Maybe there is something at the padctl level ? (which seems common between pcie/sata ?)

Given the earlier kernel I've experienced issue was 5.3 and the older I've tested without issue was 4.19 that's a whole range of kernel versions to test for regression... I need to narrow which component is involved before to bissect any code.

Do you confirm that you can use both rootfs on sata and ethernet using such kernel from:
https://dl.fedoraproject.org/pub/alt/rawhide-kernel-nodebug/armhfp/

Thx for your feedbacks on this.
Comment 8 NicolasChauvet 2020-03-15 20:02:57 UTC
Created attachment 287935 [details]
Fix phy-names on jetson-tk1

I might have a test on this patch. (later tomorrow).

I don't know which could be the implication of using the same phy-names or not using the appropriate one. Right know it's only a wild guess...
Comment 9 Manikanta Maddireddy 2020-03-16 05:02:30 UTC
Hi Nicolas,

Tegra PCIe driver reads the phy using name from pcie-0 to pcie-<num_lane-1>. You don't need to change the phy-name, please refer to tegra_pcie_port_get_phys() -> devm_of_phy_optional_get_index() functions.

Thanks,
Manikanta
Comment 10 NicolasChauvet 2020-03-16 16:49:50 UTC
Okay, so as I understand the phy-names are based on the index of the given consumer hierarchy level. So phy from pci@1,0 and pci@2,0 can share the same name and are not referring to names from pcie-0 to pcie-4 from the padctl->pad->pcie->lanes nodes in tegra124.dtsi.

Now I still have my problem.

You said, you have reproduced the issue once.
- Which bootloader are you using ? (I'm using upstream u-boot from 2019-10).
- Which kernel have you attempted to bisect from 5.5.0-rc7 ? (seems old given 5.5 is already stable). Are you trying to pull from git.kernel.org tegra/for-5.6 trees or anything else ?
- Is this seen only on a cold boot ?

For me, what is surprising is that the crash is wrapped between AER messages, but it might be unrelated to AER. It's more likely (given the Back trace) related to IRQ of the NIC. (Updating the topic according to that). 
But then I don't get why there is such crash.

Anything to help narrow the issue ?
Comment 11 Manikanta Maddireddy 2020-03-17 04:39:01 UTC
Hi,

>>Which bootloader are you using ? (I'm using upstream u-boot from 2019-10).
It is from R21.8.0_armhf build downloaded from https://developer.nvidia.com/embedded/downloads.
U-Boot 2018.05-gc50329da15 (Oct 31 2019 - 13:48:20 -0700)

>>Which kernel have you attempted to bisect from 5.5.0-rc7 ? 
I cloned it from linux-next.

>>Is this seen only on a cold boot ?
It reproduced only once, I can't remember.


>>Anything to help narrow the issue ?
It is not a crash, rather a warning trace.
enp1s0 interface TX queue is stopped for more than timeout period, so network watchdog printed a warning. This happened because PCIe memory read failed with completion timeout and r8169 driver got a wrong data(most likely all Fs or 0s). Upon detecting wrong data, it stopped Tx queue with netif_stop_queue() call. I believe it happened in either rtl8169_start_xmit() or rtl8169_interrupt().

[10386.526410] NETDEV WATCHDOG: enp1s0 (r8169): transmit queue 0 timed out

I suggest below experiments,
- Update only u-boot image with nvidia release binary, you can download from https://developer.nvidia.com/embedded/downloads.
- After issue is reproduced, try to access PCIe config space and BAR registers. Access only 4 bytes of data using setpci and devmem tools, else kernel log will flood with AER errors.

Thanks,
Manikanta
Comment 12 NicolasChauvet 2020-03-29 13:22:21 UTC
I've reproduced with 5.6.0-rc7-next-20200325+
But then I've another issue related to tegra_drm vic... (there I can't enable the display).

To me the problem is easily reproducible anytime I'm trying to use the NIC with a little load. I can ssh into it without issue, but using scp to (for example) copy a kernel with some modules and dtbs exhibit the problem.

I don't expect the l4t u-boot will load my kernel with syslinux support.
I've made a quick look and nothing much changed over the last year in upstream u-boot. So I'll keep it for now.

My best bet is to try an older fedora release (from external MMC) and upgrade kernel until I can reproduce the issue...
Comment 13 NicolasChauvet 2020-04-02 12:23:10 UTC
> - After issue is reproduced, try to access PCIe config space and BAR
> registers. Access only 4 bytes of data using setpci and devmem tools, else
> kernel log will flood with AER errors.

Can you elaborate on this ? which registers should I access ? Do I need to compare before/after the AER issue is logged ? (seems there is only a difference between BASE_ADDRESS on the Tegra PCI Bridge, no diffs on the NIC)


I can reliably reproduce with 5.3+ kernel using scp on a large file (~1Go) to the jetson. Last known good kernel is 5.2.18-200.fc30.armv7hl for me.

Also note that once the AER error is logged I cannot use the NIC anymore, there is no mean to send any packet on the LAN.

Also I'm now using u-boot 2020.04-rc4 with few patches sent recently from Tom Warren and applied to the fedora uboot
https://src.fedoraproject.org/rpms/uboot-tools/tree/master
https://koji.fedoraproject.org/koji/buildinfo?buildID=1487207
(image located in uboot-images-armv7 noarch RPM). 

Specially the net-tegra-Misc-network-fixes.patch seems to mention interaction between the NIC kernel driver and bootloader.
(but I'm still reproducing the issue with theses patches).
Comment 14 NicolasChauvet 2020-04-02 12:23:47 UTC
Created attachment 288153 [details]
setpci manipulation
Comment 15 Manikanta Maddireddy 2020-04-02 14:21:34 UTC
Hi,

I believe you can access serial console after AER error. Please use R232 to USB cable to get serial console. Please follow https://youtu.be/ZY34SxdHufI for details.

>>Can you elaborate on this ? which registers should I access ?
How to access config registers:
  1) "sudo lspci" command gives list of all PCIe devices. Take BDF of Ethernet device, looks something like "01:00.0".
  2) "sudo setpci -s <BDF of NIC> 0x0.l" should give proper device ID & vendor ID. If it gives 0xffffffff, then PCIe link is not stable. Ex: sudo setpci -s 01.00.0 0x0.l

How to access BAR registers:
  1) "sudo lspci -s <BDF of NIC> | grep Region" give BAR address. There can be multiple BARs, pick one address. This address is display in hex.
  2) "sudo devmem <addr> w" should give non Fs/0s. This means BAR is accessible.

Please share "sudo lspci -vvv" output as well.

Is it possible for you flash complete nvidia package(recent one) from https://developer.nvidia.com/embedded/downloads and then only update kernel & DTB? I did the same on my board and don't see this issue.

Thanks,
Manikanta
Comment 16 NicolasChauvet 2020-04-08 14:39:56 UTC
Thanks for your help tracking this issue.
I will try to use the latest l4t. But I will have to backup my existing system.

Here are some informations (running u-boot 2020-04-rc4 with fedora patches).


There is an issue with trying to access the lspci of the NIC:
  r8169 0000:01:00.0: invalid short VPD tag 00 at offset 1
cat: '/sys/bus/pci/devices/0000:01:00.0/vpd': Input/output error

See the attached lspci -vvv

Looking at the regions for the NIC:
# devmem2 1000 w                                                                                                                                         
EAFFFFFE
# devmem2 13000000 w                                                                                                                                                   
EAFFFFFE
# devmem2 20000000 w                                                                                                                                                   
[  398.420973] 8<--- cut here ---
[  398.424023] Unhandled fault: imprecise external abort (0x406) at 0x00000000
[  398.430968] pgd = b4b2dd01
[  398.433664] [00000000] *pgd=aa4c7835, *pte=00000000, *ppte=00000000
Bus error (core dumped)
Comment 17 NicolasChauvet 2020-04-08 14:40:40 UTC
Created attachment 288285 [details]
lspci-vvv-jetson-tk1.txt
Comment 18 Manikanta Maddireddy 2020-04-08 16:01:04 UTC
Hi Nicolas,

These addresses are in hex, could you please read the BAR with below commands.
# devmem2 0x13000000 w
# devmem2 0x20000000 w

Also dump lspci output.
# lspci -xxxx > lspci_dump.txt

Please note that this data should be captured after AER error is observed.

Thanks,
Manikanta
Comment 19 NicolasChauvet 2020-04-08 21:57:52 UTC
(In reply to Manikanta Maddireddy from comment #18)
> Hi Nicolas,
> 
> These addresses are in hex, could you please read the BAR with below
> commands.
> # devmem2 0x13000000 w
Error mapping (1) : Operation not permitted
> # devmem2 0x20000000 w
FBC4D000

But the Error is reproducible both before and after the AER error is issued.


> Also dump lspci output.
> # lspci -xxxx > lspci_dump.txt
Uploaded lspci-jetson-tk1-after_aer.txt

I've installed l4t 21.8, and I was able to boot using the l4t kernel. But I cannot boot to userspace with an upstream kernel. Extlinux seems to pass to the kernel, but I don't have any serial console output and I'm locked at some point. (my config relies on having an initramfs, if relevant).

When the same kernel is booted using a 2020-04-rc4 upstream u-boot, I can boot any upstream kernel normally. (but then I have this AER issue only with 5.3+ kernels).
Comment 20 NicolasChauvet 2020-04-08 21:58:18 UTC
Created attachment 288291 [details]
lspci-jetson-tk1-after_aer.txt
Comment 21 Manikanta Maddireddy 2020-04-09 10:51:42 UTC
Hi Nicolas,

BAR & config registers are accessible after AER error, so PCIe link is stable. Can you toggle the eth interface down & up and see if Ethernet recovers. We may face same issue again, but this tells us PCIe link status.

>When the same kernel is booted using a 2020-04-rc4 upstream u-boot, I can boot
>>any upstream kernel normally. (but then I have this AER issue only with 5.3+
>>kernels).
My patch series is merged in 5.2. Can you bisect my changes and identify which patch caused the issue? AER will be printed only if its capability is enabled(which is present in same series), so you have to look for "NETDEV WATCHDOG" warning log.

>I've installed l4t 21.8, and I was able to boot using the l4t kernel. But I
>>cannot boot to userspace with an upstream kernel.
Please follow below instructions to boot upstream kernel,
 make ARCH=arm tegra_defconfig
 make ARCH=arm zImage -j8
 make ARCH=arm dtbs
 make ARCH=arm modules -j8
 rm -rf modules_install
 DESTDIR=./modules_install make ARCH=arm modules_install INSTALL_MOD_PATH=./modules_install

Check extlinux.conf to find out Linux and FDT image path on target. Copy compiled images to these locations.

Thanks,
Manikanta
Comment 22 NicolasChauvet 2020-04-16 14:01:22 UTC
> Can you toggle the eth interface down & up and see if Ethernet recovers.
No the interface does not recover. I'm using nmcli so I can disconnect, but not to connect anymore, either or not I unload/reload the r8169 kernel module:

# nmcli  dev disconnect enp1s0                                                                                                                                         
Device 'enp1s0' successfully disconnected.
# rmmod r8169
# modprobe r8169
# nmcli  dev connect enp1s0
Error: nmcli terminated by signal Interrupt (2)
[  303.201616] r8169 0000:01:00.0 enp1s0: Link is Down
[  306.531104] libphy: r8169: probed
[  306.532306] r8169 0000:01:00.0 eth0: RTL8168g/8111g, 00:04:4b:25:b7:bc, XID 4c0, IRQ 396
[  306.532320] r8169 0000:01:00.0 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
[  306.556824] r8169 0000:01:00.0 enp1s0: renamed from eth0
[  306.608853] Generic FE-GE Realtek PHY r8169-100:00: attached PHY driver [Generic FE-GE Realtek PHY] (mii_bus:phy_addr=r8169-100:00, irq=IGNORE)
[  306.709227] r8169 0000:01:00.0 enp1s0: Link is Down
[  332.346444] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: 0000:01:00.0
[  332.347117] r8169 0000:01:00.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[  332.347175] r8169 0000:01:00.0: AER:   device [10ec:8168] error status/mask=00004000/00400000
[  332.347287] r8169 0000:01:00.0: AER:    [14] CmpltTO                (First)
[  332.353136] pcieport 0000:00:02.0: AER: Device recovery failed
[  338.474594] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: 0000:01:00.0
[  338.474621] r8169 0000:01:00.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[  338.474629] r8169 0000:01:00.0: AER:   device [10ec:8168] error status/mask=00004000/00400000
[  338.474638] r8169 0000:01:00.0: AER:    [14] CmpltTO                (First)
[  338.474756] pcieport 0000:00:02.0: AER: Device recovery failed

> My patch series is merged in 5.2.
Can you remind your patchset or git tree ?

> Please follow below instructions to boot upstream kernel,
I'm using a config close to tegra_defconfig already. But the default tegra_defconfig isn't able to boot a fedora userspace well.
I would like to avoid this path as this is trying to solve another issue.
(and upstream u-boot is able to boot fedora kernels and upstream correctly).
Comment 23 NicolasChauvet 2020-04-16 15:13:48 UTC
> Can you remind your patchset or git tree ?
Here is the git log for tegra-pci.c since 5.2
I'm going to revert them on top of a current kernel and verify that I cannot reproduce the issue. If not reproducible I should be able to bisect.

94e99b194e5f PCI: tegra: Use pci_parse_request_of_pci_ranges()
45586c7078d4 treewide: remove redundant IS_ERR() before error code check
21a92676e1fe PCI: tegra: Fix afi_pex2_ctrl reg offset for Tegra30
885199148442 PCI: tegra: Fix return value check of pm_runtime_get_sync()
9e38e690ace3 PCI: tegra: Fix OF node reference leak
7be142caabc4 PCI: tegra: Enable Relaxed Ordering only for Tegra20 & Tegra30
4b16a8227907 PCI: tegra: Change link retry log level to debug
dbdcc22c845b PCI: tegra: Add support for GPIO based PERST#
2d8c7361585f PCI: tegra: Put PEX CLK & BIAS pads in DPD mode
adb2653b3d2e PCI: tegra: Add AFI_PEX2_CTRL reg offset as part of SoC struct
c894121d0142 PCI: tegra: Change PRSNT_SENSE IRQ log to debug
b5b4717ea0dd PCI: tegra: Program AFI_CACHE_BAR_{0,1}_{ST,SZ} registers only for Tegra20
eef4a3502661 PCI: tegra: Fix PLLE power down issue due to CLKREQ# signal
c23ae2aec5bc PCI: tegra: Set target speed as Gen1 before starting LTSSM
9f570b6c240e PCI: tegra: Update flow control timer frequency in Tegra210
191cd6fb5d2c PCI: tegra: Add SW fixup for RAW violations
b2634cd0d26d PCI: tegra: Increase the deskew retry time
f1178099a6e4 PCI: tegra: Enable PCIe xclk clock clamping
52db2fd89e1a PCI: tegra: Process pending DLL transactions before entering L1 or L2
92bd94f1fdde PCI: tegra: Disable AFI dynamic clock gating
7763cc24e210 PCI: tegra: Enable opportunistic UpdateFC and ACK
2513a4ee4735 PCI: tegra: Program UPHY electrical settings for Tegra210
c635a815c8c7 PCI: tegra: Advertise PCIe Advanced Error Reporting (AER) capability
538123a29aeb PCI: tegra: Add PCIe Gen2 link speed support
d1f9113faf8a PCI: tegra: Fix PCIe host power up sequence
316b9ef1ee14 PCI: tegra: Mask AFI_INTR in runtime suspend
973d7499c51c PCI: tegra: Rearrange Tegra PCIe driver functions
1056dda8a8d6 PCI: tegra: Handle failure cases in tegra_pcie_power_on()
Comment 24 NicolasChauvet 2020-04-18 14:47:56 UTC
A quick update on this.

I've made a first pass of all of theses commit without to reproduce any IP error.
One thing I've changed in my test was to copy a large file to /tmp as tmpfs instead of regular MMC (Now I'm using the external MMC).

Seems like copying over ethernet and using MMC at the same time is needed for a reproducer. I've also blacklisted nouveau, but that seems without effect.
Comment 25 NicolasChauvet 2020-04-18 15:10:41 UTC
OKay here is an interesting result:


git bisect start '--term-old=unfixed' '--term-new=fixed'
# fixed: [d7eb396957281bdbdee5f1552132f12a3f2b218d] Revert "PCI: tegra: Handle failure cases in tegra_pcie_power_on()"
git bisect fixed d7eb396957281bdbdee5f1552132f12a3f2b218d
# unfixed: [05512d55221c5735b20fccf9d588d525c4c190f3] Revert "PCI: tegra: Fix return value check of pm_runtime_get_sync()"
git bisect unfixed 05512d55221c5735b20fccf9d588d525c4c190f3
# fixed: [e78232ad2780c0d5cfec2df357ce78796205f067] Revert "PCI: tegra: Add SW fixup for RAW violations"
git bisect fixed e78232ad2780c0d5cfec2df357ce78796205f067
# unfixed: [4dc36b05e9eef5ae30d2ae72b298837fbff50819] Revert "PCI: tegra: Add AFI_PEX2_CTRL reg offset as part of SoC struct"
git bisect unfixed 4dc36b05e9eef5ae30d2ae72b298837fbff50819
# unfixed: [4272f4dce96eb71e207bf380a3b0516e305eea87] Revert "PCI: tegra: Fix PLLE power down issue due to CLKREQ# signal"
git bisect unfixed 4272f4dce96eb71e207bf380a3b0516e305eea87
# unfixed: [eb5846b38a56b333808a47da747d67d9c635dc7d] Revert "PCI: tegra: Update flow control timer frequency in Tegra210"
git bisect unfixed eb5846b38a56b333808a47da747d67d9c635dc7d
# first fixed commit: [e78232ad2780c0d5cfec2df357ce78796205f067] Revert "PCI: tegra: Add SW fixup for RAW violations"


In others word 191cd6fb5d2cf184a3010c55ca290b2fe5a3d727 is the first commit that breaks for me.

Trying to revert it on top or current to see if everything is back to normal...
Comment 26 NicolasChauvet 2020-04-18 16:21:38 UTC
Created attachment 288601 [details]
Revert raw_violation_fixup
Comment 27 NicolasChauvet 2020-04-18 16:23:10 UTC
I confirm that the revert fix the issue for me.

I've uploaded the exact patch I've tested on top of the current linux-next
Now I'm able to use ethernet to copy file onto the MMC without issue.
Comment 28 Manikanta Maddireddy 2020-04-20 04:06:13 UTC
Hi Nicolas,

I checked the downstream Tegra PCIe driver, it has same programming sequence. I am not sure what is causing this issue with upstream kernel. It might take few weeks for me to start working on this issue. Please go ahead and publish the patch to upstream mailing list.

I need few more details for debugging, some of details are already shared in few comments, but please share complete details again, so that we can have full details in single comment.
- "sudo lspci -xxxx" output from you build. Please mentioned kernel version and Tegra PCIe driver top commit.
- Your setup details, both HW & SW.
- Steps to reproduce the issues.

Thanks,
Manikanta
Comment 29 NicolasChauvet 2020-04-20 16:07:44 UTC
Here are some info for a reproducer:


- I'm using jetson-tk1 with upstream u-boot 2020.04 (2019 versions also exhibit the issue, l4t u-boot version cannot boot upstream kernel/Fedora userspace with me.)
- Kernel/Userspace is Fedora 30 armhfp Minimal Spin 
https://download.fedoraproject.org/pub/fedora/linux/releases/30/Spins/armhfp/images/Fedora-Minimal-armhfp-30-1.2-sda.raw.xz
rootfs and /boot are ext4 /boot/efi is vfat
- Using the external MMC: sdcard is Lexar SDHCI I, x200 speed, level 8 grade.
- Last known good kernel is 5.2.18-200.fc30.armv7hl
(download with koji download-build kernel-5.2.18-200.fc30 --arch armv7hl)
- Tested bad kernel from 5.3+ is kernel-5.3.18-200.fc30.armv7hl
- Fedora kernel are built with all_modules (notable exception is pci-tegra that is builtin).
- Issue reproducible on the kernel-lpae Fedora variant also.
- Upstream kernel from linux next are either vmlinuz-5.6.0-rc7-next-20200325-tegra+ or vmlinuz-5.7.0-rc1-next-20200414-tegra+ (there everything is builtin).
- lspci before AER error: https://bugzilla.kernel.org/attachment.cgi?id=288285
- lspci after AER error: https://bugzilla.kernel.org/attachment.cgi?id=288291
- Bisection is from revert in reverse order of the pci-tegra.c patches on current linux-next See https://bugzilla.kernel.org/show_bug.cgi?id=206217#c25
Bisection show the first commit that shows the error as 191cd6fb5d2cf184a3010c55ca290b2fe5a3d727 
"tegra: Add SW fixup for RAW violations"

To reproduce:
using a 5.3+ kernel on fedora30 on external MMC. I can reproduce the issue by copying (using scp) a large file (800Mb) into /var/tmp of the jetson-tk1 (if the directory is a tmpfs the issue is not reproducible). It has to be on the MMC.
It's also possible to reproduce by using "dnf update" (as initially reported) but this depends on the updates available.

Note You need to log in before you can comment on or make changes to this bug.