Created attachment 285977 [details] backtrace When using amdgpu with kernel 5.4-rc5 through 5.4-rc7, we are seeing invalid DMA under load with the Vega 64. This issue did not occur on 5.3 or earlier. The invalid DMA causes an EEH and knocks the GPU offline until a reboot. Trace attached.
Can you bisect?
I am travelling now but can bisect when back at the lab next week.
Just had a chance to test on 5.4.0, still fails (haven't had a chance to bisect yet; I suspect it's more related to the 64-bit enablement on POWER in 5.4 than anything else). The EEH is quite strange, the PEST register decodes as: MMIO CFG Read Other Transaction Type An MMIO Load, MMIO I/O Write, or other transaction returned from the PCIe link with a status of Unsupported Request (UR) Failure address: 0x000000000000 Full trace [20341.276752702,3] PHB#0033[8:3]: PHB Freeze/Fence detected ! [20341.276848173,3] PHB#0033[8:3]: PCI FIR=2000000000000000 [20341.276900504,3] PHB#0033[8:3]: PCI FIR WOF=2000000000000000 [20341.276939625,3] PHB#0033[8:3]: NEST FIR=0000800000000000 [20341.276979866,3] PHB#0033[8:3]: NEST FIR WOF=0000800000000000 [20341.277023394,3] PHB#0033[8:3]: ERR RPT0=0000000000000001 [20341.277068184,3] PHB#0033[8:3]: ERR RPT1=0000000000000000 [20341.277110812,3] PHB#0033[8:3]: AIB ERR=0000200000000000 [20341.277830701,3] PHB#0033[8:3]: brdgCtl = 00000002 [20341.277906614,3] PHB#0033[8:3]: deviceStatus = 00000020 [20341.277946469,3] PHB#0033[8:3]: slotStatus = 00402000 [20341.277981186,3] PHB#0033[8:3]: linkStatus = e9010008 [20341.278025974,3] PHB#0033[8:3]: devCmdStatus = 00100107 [20341.278068859,3] PHB#0033[8:3]: devSecStatus = 00000000 [20341.278109829,3] PHB#0033[8:3]: rootErrorStatus = 00000000 [20341.278149196,3] PHB#0033[8:3]: corrErrorStatus = 00000000 [20341.278190145,3] PHB#0033[8:3]: uncorrErrorStatus = 00000000 [20341.278223684,3] PHB#0033[8:3]: devctl = 00000020 [20341.278276525,3] PHB#0033[8:3]: devStat = 00000000 [20341.278314241,3] PHB#0033[8:3]: tlpHdr1 = 00000000 [20341.278356746,3] PHB#0033[8:3]: tlpHdr2 = 00000000 [20341.278397163,3] PHB#0033[8:3]: tlpHdr3 = 00000000 [20341.278440709,3] PHB#0033[8:3]: tlpHdr4 = 00000000 [20341.278478424,3] PHB#0033[8:3]: sourceId = 00000000 [20341.278516547,3] PHB#0033[8:3]: nFir = 0000800000000000 [20341.278555975,3] PHB#0033[8:3]: nFirMask = 0030001c00000000 [20341.278598653,3] PHB#0033[8:3]: nFirWOF = 0000800000000000 [20341.278642004,3] PHB#0033[8:3]: phbPlssr = 0000001800000000 [20341.278686870,3] PHB#0033[8:3]: phbCsr = 0000001800000000 [20341.278731874,3] PHB#0033[8:3]: lemFir = 0004000100000100 [20341.278776158,3] PHB#0033[8:3]: lemErrorMask = 0000000000000000 [20341.278815229,3] PHB#0033[8:3]: lemWOF = 0000000100000000 [20341.278857015,3] PHB#0033[8:3]: phbErrorStatus = 000005a000000000 [20341.278909821,3] PHB#0033[8:3]: phbFirstErrorStatus = 0000002000000000 [20341.278951950,3] PHB#0033[8:3]: phbErrorLog0 = 2148000098000240 [20341.278999524,3] PHB#0033[8:3]: phbErrorLog1 = a008400000000000 [20341.279042839,3] PHB#0033[8:3]: phbTxeErrorStatus = 0000200000000000 [20341.279081676,3] PHB#0033[8:3]: phbTxeFirstErrorStatus = 0000200000000000 [20341.279120945,3] PHB#0033[8:3]: phbTxeErrorLog0 = 4000000000000000 [20341.279160833,3] PHB#0033[8:3]: phbTxeErrorLog1 = 0000000000000000 [20341.279207802,3] PHB#0033[8:3]: phbRxeArbErrorStatus = 0000000000000000 [20341.279254658,3] PHB#0033[8:3]: phbRxeArbFrstErrorStatus = 0000000000000000 [20341.279297181,3] PHB#0033[8:3]: phbRxeArbErrorLog0 = 0000000000000000 [20341.279334227,3] PHB#0033[8:3]: phbRxeArbErrorLog1 = 0000000000000000 [20341.279376968,3] PHB#0033[8:3]: phbRxeMrgErrorStatus = 0000000000000001 [20341.279420726,3] PHB#0033[8:3]: phbRxeMrgFrstErrorStatus = 0000000000000001 [20341.279469009,3] PHB#0033[8:3]: phbRxeMrgErrorLog0 = 0000000000000000 [20341.279512839,3] PHB#0033[8:3]: phbRxeMrgErrorLog1 = 0000000000000000 [20341.279561496,3] PHB#0033[8:3]: phbRxeTceErrorStatus = 0000000000000000 [20341.279604696,3] PHB#0033[8:3]: phbRxeTceFrstErrorStatus = 0000000000000000 [20341.279645952,3] PHB#0033[8:3]: phbRxeTceErrorLog0 = 0000000000000000 [20341.279685644,3] PHB#0033[8:3]: phbRxeTceErrorLog1 = 0000000000000000 [20341.279731458,3] PHB#0033[8:3]: phbPblErrorStatus = 0000000000000800 [20341.279778323,3] PHB#0033[8:3]: phbPblFirstErrorStatus = 0000000000000800 [20341.279825433,3] PHB#0033[8:3]: phbPblErrorLog0 = 0000000000000000 [20341.279866852,3] PHB#0033[8:3]: phbPblErrorLog1 = 00000000028de410 [20341.279903104,3] PHB#0033[8:3]: phbPcieDlpErrorLog1 = 0000000000000000 [20341.279942888,3] PHB#0033[8:3]: phbPcieDlpErrorLog2 = 0000000000000000 [20341.279984925,3] PHB#0033[8:3]: phbPcieDlpErrorStatus = 0000000000000000 [20341.280033282,3] PHB#0033[8:3]: phbRegbErrorStatus = 0010001000000000 [20341.280080310,3] PHB#0033[8:3]: phbRegbFirstErrorStatus = 0000001000000000 [20341.280126330,3] PHB#0033[8:3]: phbRegbErrorLog0 = 4800003c00000000 [20341.280173657,3] PHB#0033[8:3]: phbRegbErrorLog1 = 0000000000000200 [20341.280218925,3] PHB#0033[8:3]: PEST[1ff] = 3740002a01000000 0000000000000000 [ 1580.231935] EEH: PHB#33 failure detected, location: N/A [ 1580.231958] EEH: Frozen PHB#33-PE#0 detected [ 1580.231969] EEH: Call Trace: [ 1580.231983] EEH: [00000000741e7c92] __eeh_send_failure_event+0x78/0x150 [ 1580.232006] EEH: [0000000019c0a3ea] eeh_dev_check_failure+0x1d8/0x6b0 [ 1580.232019] EEH: [00000000d1114f7e] eeh_check_failure+0x98/0x100 [ 1580.232080] EEH: [0000000026fdad67] amdgpu_mm_rreg+0x20c/0x250 [amdgpu] [ 1580.232134] EEH: [0000000087736ee4] vi_flush_hdp+0xa0/0xc0 [amdgpu] [ 1580.232191] EEH: [000000000b00465e] amdgpu_gart_bind+0x78/0x140 [amdgpu] [ 1580.232247] EEH: [00000000e410157a] amdgpu_ttm_gart_bind+0x124/0x140 [amdgpu] [ 1580.232295] EEH: [0000000027696b17] amdgpu_ttm_alloc_gart+0x19c/0x230 [amdgpu] [ 1580.232350] EEH: [00000000abff626d] amdgpu_vm_sdma_map_table+0x4c/0x70 [amdgpu] [ 1580.232411] EEH: [000000003babc62e] amdgpu_vm_clear_bo+0x188/0x460 [amdgpu] [ 1580.232460] EEH: [000000003135d9d5] amdgpu_vm_update_ptes+0x300/0x5f0 [amdgpu] [ 1580.232513] EEH: [00000000a9b62a4c] amdgpu_vm_bo_update_mapping+0x100/0x140 [amdgpu] [ 1580.232565] EEH: [00000000c53ee852] amdgpu_vm_bo_update+0x348/0x8a0 [amdgpu] [ 1580.232614] EEH: [00000000e468e987] amdgpu_gem_va_ioctl+0x5c4/0x620 [amdgpu] [ 1580.232644] EEH: [000000002c0a19e7] drm_ioctl_kernel+0xfc/0x180 [drm] [ 1580.232671] EEH: [000000005cb0f244] drm_ioctl+0x238/0x480 [drm] [ 1580.232725] EEH: [00000000b812c3a6] amdgpu_drm_ioctl+0x70/0xd0 [amdgpu] [ 1580.232749] EEH: [000000004de566d7] do_vfs_ioctl+0xe0/0xac0 [ 1580.232770] EEH: [0000000045206404] ksys_ioctl+0xc4/0x110 [ 1580.232782] EEH: [000000001e273b3a] sys_ioctl+0x28/0x80 [ 1580.232804] EEH: [00000000aa248bf4] system_call+0x5c/0x68 [ 1580.232834] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures. [ 1580.232880] EEH: Notify device drivers to shutdown [ 1580.232911] EEH: Beginning: 'error_detected(IO frozen)' [ 1580.232933] PCI 0033:00:00.0#01fe: EEH: no driver [ 1580.232935] PCI 0033:01:00.0#0000: EEH: driver not EEH aware [ 1580.232957] PCI 0033:01:00.1#0000: EEH: driver not EEH aware [ 1580.232970] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'none' [ 1580.232998] EEH: Collect temporary log [ 1580.233008] PHB4 PHB#51 Diag-data (Version: 1) [ 1580.233018] brdgCtl: 00000002 [ 1580.233028] RootSts: 00000020 00402000 e9010008 00100107 00000000 [ 1580.233040] nFir: 0000800000000000 0030001c00000000 0000800000000000 [ 1580.233062] PhbSts: 0000001800000000 0000001800000000 [ 1580.233082] Lem: 0004000100000100 0000000000000000 0000000100000000 [ 1580.233104] PhbErr: 000005a000000000 0000002000000000 2148000098000240 a008400000000000 [ 1580.233136] PhbTxeErr: 0000200000000000 0000200000000000 4000000000000000 0000000000000000 [ 1580.233169] RxeMrgErr: 0000000000000001 0000000000000001 0000000000000000 0000000000000000 [ 1580.233192] PblErr: 0000000000000800 0000000000000800 0000000000000000 00000000028de410 [ 1580.233225] RegbErr: 0010001000000000 0000001000000000 4800003c00000000 0000000000000200 [ 1580.233259] EEH: Reset with hotplug activity [ 1580.891352] snd_hda_codec_hdmi hdaudioC0D0: Unable to sync register 0x2f0d00. -5 [ 1590.340025] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=7463, emitted seq=7465 [ 1590.340117] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0 [ 1590.340172] amdgpu 0033:01:00.0: GPU reset begin! [ 1590.350000] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=325761, emitted seq=325763 [ 1590.350057] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process hyperspace pid 4160 thread hyperspace:cs0 pid 4161 [ 1590.350089] amdgpu 0033:01:00.0: GPU reset begin! [ 1590.350108] [drm] Bailing on TDR for s_job:4f608, as another already in progress [ 1590.350923] amdgpu: [powerplay] [ 1590.350923] last message was failed ret is 65535 [ 1590.350949] amdgpu: [powerplay] [ 1590.350949] failed to send message 261 ret is 65535 [ 1590.350971] amdgpu: [powerplay] [ 1590.350971] last message was failed ret is 65535 [ 1590.350983] amdgpu: [powerplay] [ 1590.350983] failed to send message 261 ret is 65535 [ 1590.350996] amdgpu: [powerplay] [ 1590.350996] last message was failed ret is 65535 [ 1590.351017] amdgpu: [powerplay] [ 1590.351017] failed to send message 261 ret is 65535 [ 1590.351030] amdgpu: [powerplay] [ 1590.351030] last message was failed ret is 65535 [ 1590.351064] amdgpu: [powerplay] [ 1590.351064] failed to send message 261 ret is 65535 [ 1590.351096] amdgpu: [powerplay] [ 1590.351096] last message was failed ret is 65535 [ 1590.351127] amdgpu: [powerplay] [ 1590.351127] failed to send message 261 ret is 65535 [ 1590.351158] amdgpu: [powerplay] [ 1590.351158] last message was failed ret is 65535 [ 1590.351202] amdgpu: [powerplay] [ 1590.351202] failed to send message 261 ret is 65535 [ 1590.351224] amdgpu: [powerplay] [ 1590.351224] last message was failed ret is 65535 [ 1590.351236] amdgpu: [powerplay] [ 1590.351236] failed to send message 261 ret is 65535 [ 1590.351251] amdgpu: [powerplay] [ 1590.351251] last message was failed ret is 65535 [ 1590.351272] amdgpu: [powerplay] [ 1590.351272] failed to send message 261 ret is 65535 [ 1590.351303] amdgpu: [powerplay] [ 1590.351303] last message was failed ret is 65535 [ 1590.351324] amdgpu: [powerplay] [ 1590.351324] failed to send message 261 ret is 65535 [ 1590.351356] amdgpu: [powerplay] [ 1590.351356] last message was failed ret is 65535 [ 1590.351378] amdgpu: [powerplay] [ 1590.351378] failed to send message 261 ret is 65535 [ 1590.351410] amdgpu: [powerplay] [ 1590.351410] last message was failed ret is 65535 [ 1590.351441] amdgpu: [powerplay] [ 1590.351441] failed to send message 261 ret is 65535 [ 1590.351463] amdgpu: [powerplay] [ 1590.351463] last message was failed ret is 65535 [ 1590.351485] amdgpu: [powerplay] [ 1590.351485] failed to send message 261 ret is 65535 [ 1590.351520] amdgpu: [powerplay] [ 1590.351520] last message was failed ret is 65535 [ 1590.351541] amdgpu: [powerplay] [ 1590.351541] failed to send message 261 ret is 65535 [ 1590.351572] amdgpu: [powerplay] [ 1590.351572] last message was failed ret is 65535 [ 1590.351603] amdgpu: [powerplay] [ 1590.351603] failed to send message 261 ret is 65535 [ 1590.351634] amdgpu: [powerplay] [ 1590.351634] last message was failed ret is 65535 [ 1590.351666] amdgpu: [powerplay] [ 1590.351666] failed to send message 261 ret is 65535 [ 1590.351698] amdgpu: [powerplay] [ 1590.351698] last message was failed ret is 65535 [ 1590.351730] amdgpu: [powerplay] [ 1590.351730] failed to send message 261 ret is 65535 [ 1590.351761] amdgpu: [powerplay] [ 1590.351761] last message was failed ret is 65535 [ 1590.351795] amdgpu: [powerplay] [ 1590.351795] failed to send message 261 ret is 65535 [ 1590.351980] amdgpu: [powerplay] [ 1590.351980] last message was failed ret is 65535 [ 1590.352014] amdgpu: [powerplay] [ 1590.352014] failed to send message 306 ret is 65535 [ 1590.352039] amdgpu: [powerplay] [ 1590.352039] last message was failed ret is 65535 [ 1590.352080] amdgpu: [powerplay] [ 1590.352080] failed to send message 5e ret is 65535 [ 1590.352103] amdgpu: [powerplay] [ 1590.352103] last message was failed ret is 65535 [ 1590.352134] amdgpu: [powerplay] [ 1590.352134] failed to send message 145 ret is 65535 [ 1590.352156] amdgpu: [powerplay] [ 1590.352156] last message was failed ret is 65535 [ 1590.352190] amdgpu: [powerplay] [ 1590.352190] failed to send message 146 ret is 65535 [ 1590.352225] amdgpu: [powerplay] [ 1590.352225] last message was failed ret is 65535 [ 1590.352271] amdgpu: [powerplay] [ 1590.352271] failed to send message 148 ret is 65535 [ 1590.352292] amdgpu: [powerplay] [ 1590.352292] last message was failed ret is 65535 [ 1590.352304] amdgpu: [powerplay] [ 1590.352304] failed to send message 145 ret is 65535 [ 1590.352339] amdgpu: [powerplay] [ 1590.352339] last message was failed ret is 65535 [ 1590.352370] amdgpu: [powerplay] [ 1590.352370] failed to send message 146 ret is 65535 [ 1590.383835] [drm] REG_WAIT timeout 10us * 3000 tries - dce110_stream_encoder_dp_blank line:956 [ 1590.383875] ------------[ cut here ]------------ [ 1590.383912] WARNING: CPU: 48 PID: 1214 at drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:332 generic_reg_wait+0x214/0x230 [amdgpu] [ 1590.383945] Modules linked in: i2c_dev uinput amdgpu snd_usb_audio drm_vram_helper snd_usbmidi_lib gpu_sched ttm snd_rawmidi snd_seq_device ses mc drm_kms_helper snd_hda_codec_hdmi enclosure joydev sd_mod evdev scsi_transport_sas drm snd_hda_intel sg snd_hda_codec drm_panel_orientation_quirks snd_hda_core syscopyarea sysfillrect ecb snd_hwdep aacraid sysimgblt fb_sys_fops snd_pcm nvme nvme_core xts i2c_algo_bit snd_timer snd soundcore ctr cbc ofpart vmx_crypto ipmi_powernv ipmi_devintf powernv_flash gf128mul mtd ipmi_msghandler opal_prd at24 binfmt_misc parport_pc lp parport ip_tables x_tables autofs4 nfsv3 nfs_acl nfs lockd grace sunrpc fscache raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor hid_generic usbhid hid raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod xhci_pci xhci_hcd usbcore tg3 libphy [ 1590.384181] CPU: 48 PID: 1214 Comm: kworker/48:2 Not tainted 5.4.0 #5 [ 1590.384194] Workqueue: events drm_sched_job_timedout [gpu_sched] [ 1590.384205] NIP: c00800000888505c LR: c00800000888504c CTR: c000000000715d70 [ 1590.384238] REGS: c0000007dd55ec40 TRAP: 0700 Not tainted (5.4.0) [ 1590.384257] MSR: 9000000002029033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 28224228 XER: 00000000 [ 1590.384284] CFAR: c0000000001b66f4 IRQMASK: 0 [ 1590.384284] GPR00: c00800000888504c c0000007dd55eed0 c0080000089f5000 0000000000000052 [ 1590.384284] GPR04: c0000007fdd1ce18 c0000007fdda5858 0000000000000490 c0000007fffc9000 [ 1590.384284] GPR08: 0000000000000007 0000000000000000 00000007fced0000 9000000002001033 [ 1590.384284] GPR12: 0000000000004000 c0000007fffc9000 c000200715000000 c0000007eff449c0 [ 1590.384284] GPR16: c0000007dc7a6000 c0000007def45300 0000000000000000 00000000000003bc [ 1590.384284] GPR20: c0080000088f6470 0000000000000000 0000000000004ea4 0000000000010000 [ 1590.384284] GPR24: 0000000000000000 c00800000890ca90 c0000007a9e40680 0000000000000bb8 [ 1590.384284] GPR28: 0000000000000010 0000000000000bb8 000000000000000a 0000000000000bb9 [ 1590.384414] NIP [c00800000888505c] generic_reg_wait+0x214/0x230 [amdgpu] [ 1590.384450] LR [c00800000888504c] generic_reg_wait+0x204/0x230 [amdgpu] [ 1590.384467] Call Trace: [ 1590.384499] [c0000007dd55eed0] [c00800000888504c] generic_reg_wait+0x204/0x230 [amdgpu] (unreliable) [ 1590.384548] [c0000007dd55efa0] [c00800000882caec] dce110_stream_encoder_dp_blank+0x104/0x170 [amdgpu] [ 1590.384601] [c0000007dd55f030] [c00800000885a07c] dce110_blank_stream+0xf4/0x120 [amdgpu] [ 1590.384632] [c0000007dd55f060] [c0080000088743bc] core_link_disable_stream+0x64/0x420 [amdgpu] [ 1590.384692] [c0000007dd55f140] [c008000008857dbc] dce110_reset_hw_ctx_wrap+0xf4/0x2e0 [amdgpu] [ 1590.384745] [c0000007dd55f200] [c00800000885a2e0] dce110_apply_ctx_to_hw+0x58/0x600 [amdgpu] [ 1590.384797] [c0000007dd55f2d0] [c00800000886dcec] dc_commit_state+0x3d4/0x820 [amdgpu] [ 1590.384853] [c0000007dd55f400] [c0080000087fe94c] amdgpu_dm_atomic_commit_tail+0x3c4/0x19a8 [amdgpu] [ 1590.384888] [c0000007dd55f700] [c008000007d93fb0] commit_tail+0xf8/0x1f0 [drm_kms_helper] [ 1590.384912] [c0000007dd55f740] [c008000007d942a8] drm_atomic_helper_commit+0x1e0/0x1f0 [drm_kms_helper] [ 1590.384951] [c0000007dd55f780] [c0080000087fbac8] amdgpu_dm_atomic_commit+0x110/0x140 [amdgpu] [ 1590.384992] [c0000007dd55f7e0] [c0080000079ce2cc] drm_atomic_commit+0x74/0xa0 [drm] [ 1590.385016] [c0000007dd55f850] [c008000007d94768] drm_atomic_helper_disable_all+0x290/0x2b0 [drm_kms_helper] [ 1590.385044] [c0000007dd55f8a0] [c008000007d949dc] drm_atomic_helper_suspend+0x154/0x1a0 [drm_kms_helper] [ 1590.385094] [c0000007dd55f920] [c0080000087f717c] dm_suspend+0x44/0xa0 [amdgpu] [ 1590.385124] [c0000007dd55f950] [c008000008621e2c] amdgpu_device_ip_suspend_phase1+0xe4/0x190 [amdgpu] [ 1590.385163] [c0000007dd55f9d0] [c008000008623ddc] amdgpu_device_ip_suspend+0x44/0xe0 [amdgpu] [ 1590.385192] [c0000007dd55fa10] [c00800000888de54] amdgpu_device_pre_asic_reset+0x248/0x28c [amdgpu] [ 1590.385230] [c0000007dd55fab0] [c00800000888e7b8] amdgpu_device_gpu_recover+0x2f0/0xb4c [amdgpu] [ 1590.385268] [c0000007dd55fb90] [c008000008779f3c] amdgpu_job_timedout+0x124/0x170 [amdgpu] [ 1590.385290] [c0000007dd55fc30] [c008000007651244] drm_sched_job_timedout+0x6c/0x110 [gpu_sched] [ 1590.385336] [c0000007dd55fc70] [c000000000154ee0] process_one_work+0x260/0x520 [ 1590.385379] [c0000007dd55fd10] [c000000000155228] worker_thread+0x88/0x5f0 [ 1590.385400] [c0000007dd55fdb0] [c00000000015f21c] kthread+0x19c/0x1b0 [ 1590.385430] [c0000007dd55fe20] [c00000000000bd54] ret_from_kernel_thread+0x5c/0x68 [ 1590.385463] Instruction dump: [ 1590.385480] 4bfffed4 3c620000 e8633ab8 7e679b78 7e86a378 7f65db78 7fc4f378 4800f091 [ 1590.385513] e8410018 813a0020 2f890001 419eff7c <0fe00000> 4bffff74 60000000 60000000 [ 1590.385546] ---[ end trace 59567a2f8b8649ed ]--- [ 1591.478349] PCI 0033:01:00.0#0000: EEH: 2100000 reads ignored for recovering device at location=CPU2 Slot1 (16x) driver=amdgpu [ 1591.478370] PCI 0033:01:00.0#0000: EEH: Might be infinite loop in amdgpu driver [ 1591.478382] CPU: 48 PID: 1214 Comm: kworker/48:2 Tainted: G W 5.4.0 #5 [ 1591.478405] Workqueue: events drm_sched_job_timedout [gpu_sched] [ 1591.478414] Call Trace: [ 1591.478422] [c0000007dd55e940] [c000000000a9ccc8] dump_stack+0xbc/0x104 (unreliable) [ 1591.478434] [c0000007dd55e980] [c00000000003e788] eeh_dev_check_failure+0x598/0x6b0 [ 1591.478455] [c0000007dd55ea30] [c00000000003eb08] eeh_check_failure+0x98/0x100 [ 1591.478491] [c0000007dd55ea70] [c008000008622744] amdgpu_mm_rreg+0x20c/0x250 [amdgpu] [ 1591.478539] [c0000007dd55eac0] [c0080000086298f4] cail_reg_read+0x2c/0x50 [amdgpu] [ 1591.478577] [c0000007dd55eae0] [c00800000863255c] atom_get_src_int+0x104/0xa00 [amdgpu] [ 1591.478615] [c0000007dd55eb90] [c008000008633e30] atom_op_test+0xd8/0x1d0 [amdgpu] [ 1591.478660] [c0000007dd55ec20] [c008000008636a2c] amdgpu_atom_execute_table_locked+0x204/0x3e0 [amdgpu] [ 1591.478701] [c0000007dd55ed20] [c008000008636d30] atom_op_calltable+0x128/0x1e0 [amdgpu] [ 1591.478740] [c0000007dd55eda0] [c008000008636a2c] amdgpu_atom_execute_table_locked+0x204/0x3e0 [amdgpu] [ 1591.478770] [c0000007dd55eea0] [c008000008636e58] amdgpu_atom_execute_table+0x70/0xb0 [amdgpu] [ 1591.478829] [c0000007dd55eee0] [c008000008810f30] transmitter_control_v1_6+0x128/0x220 [amdgpu] [ 1591.478887] [c0000007dd55ef40] [c00800000880c410] bios_parser_transmitter_control+0x38/0x70 [amdgpu] [ 1591.478944] [c0000007dd55ef60] [c00800000882f678] dce110_link_encoder_disable_output+0xd0/0x1c0 [amdgpu] [ 1591.478997] [c0000007dd55f020] [c00800000887cbfc] dp_disable_link_phy+0xa4/0x1d0 [amdgpu] [ 1591.479029] [c0000007dd55f060] [c008000008874488] core_link_disable_stream+0x130/0x420 [amdgpu] [ 1591.479082] [c0000007dd55f140] [c008000008857dbc] dce110_reset_hw_ctx_wrap+0xf4/0x2e0 [amdgpu] [ 1591.479134] [c0000007dd55f200] [c00800000885a2e0] dce110_apply_ctx_to_hw+0x58/0x600 [amdgpu] [ 1591.479186] [c0000007dd55f2d0] [c00800000886dcec] dc_commit_state+0x3d4/0x820 [amdgpu] [ 1591.479241] [c0000007dd55f400] [c0080000087fe94c] amdgpu_dm_atomic_commit_tail+0x3c4/0x19a8 [amdgpu] [ 1591.479280] [c0000007dd55f700] [c008000007d93fb0] commit_tail+0xf8/0x1f0 [drm_kms_helper] [ 1591.479325] [c0000007dd55f740] [c008000007d942a8] drm_atomic_helper_commit+0x1e0/0x1f0 [drm_kms_helper] [ 1591.479381] [c0000007dd55f780] [c0080000087fbac8] amdgpu_dm_atomic_commit+0x110/0x140 [amdgpu] [ 1591.479419] [c0000007dd55f7e0] [c0080000079ce2cc] drm_atomic_commit+0x74/0xa0 [drm] [ 1591.479445] [c0000007dd55f850] [c008000007d94768] drm_atomic_helper_disable_all+0x290/0x2b0 [drm_kms_helper] [ 1591.479484] [c0000007dd55f8a0] [c008000007d949dc] drm_atomic_helper_suspend+0x154/0x1a0 [drm_kms_helper] [ 1591.479542] [c0000007dd55f920] [c0080000087f717c] dm_suspend+0x44/0xa0 [amdgpu] [ 1591.479589] [c0000007dd55f950] [c008000008621e2c] amdgpu_device_ip_suspend_phase1+0xe4/0x190 [amdgpu] [ 1591.479640] [c0000007dd55f9d0] [c008000008623ddc] amdgpu_device_ip_suspend+0x44/0xe0 [amdgpu] [ 1591.479674] [c0000007dd55fa10] [c00800000888de54] amdgpu_device_pre_asic_reset+0x248/0x28c [amdgpu] [ 1591.479712] [c0000007dd55fab0] [c00800000888e7b8] amdgpu_device_gpu_recover+0x2f0/0xb4c [amdgpu] [ 1591.479769] [c0000007dd55fb90] [c008000008779f3c] amdgpu_job_timedout+0x124/0x170 [amdgpu] [ 1591.479815] [c0000007dd55fc30] [c008000007651244] drm_sched_job_timedout+0x6c/0x110 [gpu_sched] [ 1591.479860] [c0000007dd55fc70] [c000000000154ee0] process_one_work+0x260/0x520 [ 1591.479903] [c0000007dd55fd10] [c000000000155228] worker_thread+0x88/0x5f0 [ 1591.479923] [c0000007dd55fdb0] [c00000000015f21c] kthread+0x19c/0x1b0 [ 1591.479953] [c0000007dd55fe20] [c00000000000bd54] ret_from_kernel_thread+0x5c/0x68 [ 1592.584699] PCI 0033:01:00.0#0000: EEH: 4200000 reads ignored for recovering device at location=CPU2 Slot1 (16x) driver=amdgpu [ 1592.584723] PCI 0033:01:00.0#0000: EEH: Might be infinite loop in amdgpu driver
Stack decodes to: arch/powerpc/include/asm/eeh.h:403 [if (EEH_POSSIBLE_ERROR(val, u32))] drivers/gpu/drm/amd/amdgpu/vi.c:913 [RREG32(mmHDP_MEM_COHERENCY_FLUSH_CNTL)] drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c:340 [amdgpu_asic_flush_hdp(adev, NULL)]
This doesn't look related to the first one. The first one is a vega10 asic according to the description, the second one is from a older VI asic. mmHDP_MEM_COHERENCY_FLUSH_CNTL is a register that the driver uses to flush and invalidate the cache on the framebuffer BAR (for CPU access to the framebuffer). This particular code path has been in the driver for years.
Yes, my fault, sorry about that -- different box, unbeknownst to me had a different GPU (note to self, check lspci next time before decoding trace). To top it off, this particular fault seems to be related to a faulty GPU -- letting it cool overnight fixes the problems temporarily. I still need to verify the Vega is failing on 5.4.0, as one of the patches leading up to 5.4.0 resolved the similar software lockup I had been seeing on this Polaris card.
Thus far I have not been able to reproduce on 5.4.0 stable. At this point I'm going to assume it was fixed somewhere in the rc merge process.