Bug 199249 - random PCIe Bus Errors(AER can't find device of id0008) on new ryzen2200g build preceding hard disk/network/mouse freeze
Summary: random PCIe Bus Errors(AER can't find device of id0008) on new ryzen2200g bui...
Status: RESOLVED INVALID
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: io_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-03-30 16:59 UTC by Luca
Modified: 2018-04-06 11:25 UTC (History)
0 users

See Also:
Kernel Version: 4.13.9
Subsystem:
Regression: No
Bisected commit-id:


Attachments
pciebus error crash liveusb fedora27 hard disk removed, usb flashdrive (2.33 MB, text/plain)
2018-03-30 16:59 UTC, Luca
Details
lshw (25.69 KB, text/plain)
2018-03-31 16:06 UTC, Luca
Details
lspci of 15:00.2 / 00:01.2 (1.60 KB, text/plain)
2018-03-31 16:07 UTC, Luca
Details
Same crash log with ubuntu 4.15 and firmware downgraded (565.05 KB, application/zip)
2018-03-31 16:49 UTC, Luca
Details

Description Luca 2018-03-30 16:59:22 UTC
Created attachment 275019 [details]
pciebus error crash liveusb fedora27 hard disk removed, usb flashdrive

Several times a day my new build freezes, usually after 1 hour but after it happened the first time it is more frequent. I always keep a terminal open with journalctl -f and before the crash I start seeing errors on PCIe bus

Mar 30 11:12:57 localhost-live kernel: pcieport 0000:00:01.2: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=000a(Transmitter ID)
Mar 30 11:12:57 localhost-live kernel: pcieport 0000:00:01.2:   device [1022:15d3] error status/mask=00001000/00006000

Not just Data Link Layer errors.

[liveuser@localhost-live ~]$ lspci -d 1022:15d3
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 15d3
00:01.6 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 15d3

After a maximum of 3 minutes from those messages the mouse stop responding, the keyboard keep working, the network became unavailable, then the hard disk get stuck and is expelled and after that I only see pciebus error on tty2 or it freeze completely. At that point I try to reboot with a sysqr or I shut down the machine.

The log I have attached are from a Fedora 27 live (kernel 4.13.9) booted via usb through my phone (usb mountr - usb gadget) because otherwise the hard disk became unavailable and it was impossible to save the log.

I have removed the hard drive and cdrom drive, I don't have any pciexpress card in the slot. During the log I had a usb flash drive that also became unavailable. I was able to save the log opening my phone in nautilus using mtp://.

The same happens with Ubuntu 18.04 with 4.15 and also 4.16rc7 kernel when I attach the hard drive. In that case I see hidden between the pcieport errors also 1 or 2 sata hard reset and ata failed check power mode.

I have cleared the CMOS and removed the bios battery but it still happens, I have also tried to set "Typical current Idle" in the bios since I bought a PSU not zero load certified because I didn't know it.

if I boot with pcie_aspm=off the error disappears but it still freezes, pci=noaer same thing, pci=nommconf same thing, with pci=nomsi I can't boot. 

I've tried memtest and mprime and it doesn't seem to have problems and with high load it seems stable, one time it crashed after I stopped mprime. I have read many similar bug report and tried many suggestion but I seems to have multiple problems. I can't reproduce the problem, sometimes it is fine when I leave the computer idle.

I don't have another PSU or another motherboard, it could be an hardware problem, I'll try to install Windows in the next days to see if it works stable there. 

The log are full of pcieport error, there are other kind of errors though, I have tried filtering them with grep -v pcieport and I can see this other errors:

Mar 30 11:15:08 localhost-live kernel: ------------[ cut here ]------------
Mar 30 11:15:08 localhost-live kernel: WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x214/0x220
Mar 30 11:15:08 localhost-live kernel: Modules linked in: vfat fat fuse nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_codec_realtek snd_hda_codec_generic amdgpu snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hda_core edac_mce_amd snd_hwdep snd_seq kvm_amd i2c_algo_bit joydev kvm ttm snd_seq_device drm_kms_helper snd_pcm drm snd_timer irqbypass snd wmi_bmof soundcore tpm_tis tpm_tis_core tpm sp5100_tco wmi video i2c_piix4
Mar 30 11:15:08 localhost-live kernel:  shpchp acpi_cpufreq nls_utf8 isofs squashfs crct10dif_pclmul crc32_pclmul crc32c_intel 8021q ghash_clmulni_intel garp mrp stp llc uas r8169 mii usb_storage sunrpc scsi_transport_iscsi loop
Mar 30 11:15:08 localhost-live kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W       4.13.9-300.fc27.x86_64 #1
Mar 30 11:15:08 localhost-live kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AB350M Pro4, BIOS P4.70 02/09/2018
Mar 30 11:15:08 localhost-live kernel: task: ffffffff95e10480 task.stack: ffffffff95e00000
Mar 30 11:15:08 localhost-live kernel: RIP: 0010:dev_watchdog+0x214/0x220
Mar 30 11:15:08 localhost-live kernel: RSP: 0018:ffff91b04f603e58 EFLAGS: 00010282
Mar 30 11:15:08 localhost-live kernel: RAX: 000000000000003c RBX: 0000000000000000 RCX: 0000000000000000
Mar 30 11:15:08 localhost-live kernel: RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 0000000000000300
Mar 30 11:15:08 localhost-live kernel: RBP: ffff91b04f603e78 R08: 0000000000002977 R09: 0000000000000004
Mar 30 11:15:08 localhost-live kernel: R10: ffff91b04f603ed8 R11: 0000000000000001 R12: ffff91b0432e8000
Mar 30 11:15:08 localhost-live kernel: R13: 0000000000000000 R14: 0000000000000001 R15: ffff91b0432e8000
Mar 30 11:15:08 localhost-live kernel: FS:  0000000000000000(0000) GS:ffff91b04f600000(0000) knlGS:0000000000000000
Mar 30 11:15:08 localhost-live kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 30 11:15:08 localhost-live kernel: CR2: 00007f5feab6b5c0 CR3: 00000003a63ff000 CR4: 00000000003406f0
Mar 30 11:15:08 localhost-live kernel: Call Trace:
Mar 30 11:15:08 localhost-live kernel:  <IRQ>
Mar 30 11:15:08 localhost-live kernel:  ? qdisc_rcu_free+0x50/0x50
Mar 30 11:15:08 localhost-live kernel:  call_timer_fn+0x33/0x130
Mar 30 11:15:08 localhost-live kernel:  run_timer_softirq+0x3f2/0x440
Mar 30 11:15:08 localhost-live kernel:  ? sched_clock+0x9/0x10
Mar 30 11:15:08 localhost-live kernel:  ? sched_clock+0x9/0x10
Mar 30 11:15:08 localhost-live kernel:  ? sched_clock_cpu+0x11/0xb0
Mar 30 11:15:08 localhost-live kernel:  __do_softirq+0xea/0x2bf
Mar 30 11:15:08 localhost-live kernel:  irq_exit+0xfb/0x100
Mar 30 11:15:08 localhost-live kernel:  smp_apic_timer_interrupt+0x3d/0x50
Mar 30 11:15:08 localhost-live kernel:  apic_timer_interrupt+0x93/0xa0
Mar 30 11:15:08 localhost-live kernel: RIP: 0010:cpuidle_enter_state+0x130/0x2d0
Mar 30 11:15:08 localhost-live kernel: RSP: 0018:ffffffff95e03dc8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
Mar 30 11:15:08 localhost-live kernel: RAX: ffff91b04f61a440 RBX: 00000084abfa5654 RCX: 000000000000001f
Mar 30 11:15:08 localhost-live kernel: RDX: 00000084abfa5654 RSI: fffffffb80b944a4 RDI: 0000000000000000
Mar 30 11:15:08 localhost-live kernel: RBP: ffffffff95e03e08 R08: 0000000000001a03 R09: 0000000000000007
Mar 30 11:15:08 localhost-live kernel: R10: ffffffff95e03d98 R11: 00000000000003e5 R12: ffff91b043894600
Mar 30 11:15:08 localhost-live kernel: R13: 0000000000000000 R14: 0000000000000002 R15: ffffffff95f7f4d8
Mar 30 11:15:08 localhost-live kernel:  </IRQ>
Mar 30 11:15:08 localhost-live kernel:  ? cpuidle_enter_state+0x120/0x2d0
Mar 30 11:15:08 localhost-live kernel:  cpuidle_enter+0x17/0x20
Mar 30 11:15:08 localhost-live kernel:  call_cpuidle+0x23/0x40
Mar 30 11:15:08 localhost-live kernel:  do_idle+0x18f/0x1e0
Mar 30 11:15:08 localhost-live kernel:  cpu_startup_entry+0x73/0x80
Mar 30 11:15:08 localhost-live kernel:  rest_init+0xbc/0xc0
Mar 30 11:15:08 localhost-live kernel:  start_kernel+0x4c5/0x4e6
Mar 30 11:15:08 localhost-live kernel:  ? early_idt_handler_array+0x120/0x120
Mar 30 11:15:08 localhost-live kernel:  x86_64_start_reservations+0x24/0x26
Mar 30 11:15:08 localhost-live kernel:  x86_64_start_kernel+0x13e/0x161
Mar 30 11:15:08 localhost-live kernel:  secondary_startup_64+0x9f/0x9f
Mar 30 11:15:08 localhost-live kernel: Code: 8c 24 60 04 00 00 eb 8f 4c 89 e7 c6 05 f3 b0 88 00 01 e8 70 68 fd ff 89 d9 48 89 c2 4c 89 e6 48 c7 c7 c8 0b d3 95 e8 6d 8d 98 ff <0f> ff eb c1 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 c7 47 08 
Mar 30 11:15:08 localhost-live kernel: ---[ end trace 467781cd51614a71 ]---

hardware:
Asrock ab350m pro4 - firmware 4.7
PSU seasonic s12ii
ram 2x8 gb kingston hyperx fury 2666
CPU ryzen2200g
Comment 1 Luca 2018-03-31 16:06:38 UTC
Created attachment 275027 [details]
lshw
Comment 2 Luca 2018-03-31 16:07:48 UTC
Created attachment 275029 [details]
lspci of 15:00.2 / 00:01.2
Comment 3 Luca 2018-03-31 16:49:46 UTC
Created attachment 275031 [details]
Same crash log with ubuntu 4.15 and firmware downgraded

I have tried downgrading the firmware to 4.50(the only other firmware compatible with my CPU) and now the problem seems less aggressive. It still freezes the computer but after I see the AER pcie bus errors I have more time to save the log or continue using the computer. 
I have attached the log of 3 different instances of the problem, in the file named "lot of errors before it freezes" at the last line of the log I was switching to TTY2 and I noticed in GDM that there was a popup with "no activity system is going to suspend soon", I had already disabled the suspension so it should not have said that. In the control center I still had the screen suspension enabled, but when the problem occurs I'm using the computer but now I disabled that too. 

Would it be possible that the trigger is a user space daemon? 

mar 31 17:34:20 ghv systemd[1]: Starting Daemon for power management..
mar 31 17:34:20 ghv dbus-daemon[1690]: [system] Successfully activated service 'org.freedesktop.UPower'
mar 31 17:34:05 ghv thermald[1788]: NO RAPL sysfs present 
mar 31 17:34:05 ghv thermald[1788]: Unsupported cpu model, use thermal-conf.xml file or run with --ignore-cpuid-check
mar 31 17:34:05 ghv thermald[1788]: THD engine start failed
Comment 4 Luca 2018-04-06 11:25:22 UTC
RMA'd the motherboard, sorry it was probably a hardware defect.

Note You need to log in before you can comment on or make changes to this bug.