Bug 200139 - amdgpu lockup after resume from sleep
Summary: amdgpu lockup after resume from sleep
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-06-19 13:05 UTC by Joern Hoffmann
Modified: 2020-07-10 16:20 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.17.2
Subsystem:
Regression: No
Bisected commit-id:


Attachments
HWInfo (519.29 KB, text/plain)
2018-06-19 13:05 UTC, Joern Hoffmann
Details

Description Joern Hoffmann 2018-06-19 13:05:24 UTC
Created attachment 276689 [details]
HWInfo

I have observed a GPU lockup when the systems resumes after a sleep. The duration of the sleep dosn't care. The problem occurs every time putting the system to sleep.

I was able to narrow the problem a little bit. When I switch to the console and then putting the system to sleep, the system will come up properly (with a trace on a amgpu fuction). If I then switch back to the login manager or to the desktop, the gpu fault and eventually hangs. See logs below.

I can reproduce the problem with kernel 4.16.13. Further it dosn't matter if amdgpu.dc is enabled or disable.

System
----------
Linux 4.17.2
Debian Unstable
X.Org 1.20
Mesa 18.1.1
Radeon RX 580 Series (POLARIS10, DRM 3.25.0, 4.17.2, LLVM 6.0.0)
CPU Intel Core i7-8700k
MB Asus Prime z380-A


Kernel log after the resume from console:
-----------------------------------------
Jun 19 14:24:39 moc kernel: sd 0:0:0:0: [sda] Starting disk
Jun 19 14:24:39 moc kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400040000).
Jun 19 14:24:39 moc kernel: WARNING: CPU: 7 PID: 28047 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:725 amdgpu_dm_display_resume+0x213/0x220 [amdgpu]
Jun 19 14:24:39 moc kernel: Modules linked in: vmnet(OE) vmw_vsock_vmci_transport(E) vsock(E) vmw_vmci(E) vmmon(OE) fuse(E) joydev(E) hid_cherry(E) hid_generic(E) usbhid(E) hid(E) intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) ir
Jun 19 14:24:39 moc kernel:  asus_wmi(E) evdev(E) efi_pstore(E) intel_uncore(E) sparse_keymap(E) wmi_bmof(E) mxm_wmi(E) i2c_algo_bit(E) rfkill(E) sg(E) intel_rapl_perf(E) iTCO_wdt(E) efivars(E) snd(E) mei_me(E) iTCO_vendor_support(E) soundcore(E) mei(E) shpchp(E) wmi(E) v
Jun 19 14:24:39 moc kernel:  btrfs(E) zstd_decompress(E) zstd_compress(E) xxhash(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) raid1(E) raid0(E) multipath(E) linear(E) md
Jun 19 14:24:39 moc kernel: CPU: 7 PID: 28047 Comm: kworker/u24:7 Tainted: G           OE     4.17.2 #1
Jun 19 14:24:39 moc kernel: Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 0805 05/18/2018
Jun 19 14:24:39 moc kernel: Workqueue: events_unbound async_run_entry_fn
Jun 19 14:24:39 moc kernel: RIP: 0010:amdgpu_dm_display_resume+0x213/0x220 [amdgpu]
Jun 19 14:24:39 moc kernel: RSP: 0000:ffffaadd4447fd60 EFLAGS: 00010202
Jun 19 14:24:39 moc kernel: RAX: 0000000000000002 RBX: ffff96d7a48b0000 RCX: 0000000000000006
Jun 19 14:24:39 moc kernel: RDX: 0000000000000006 RSI: ffff96d6915a2c80 RDI: ffff96d7898f7800
Jun 19 14:24:39 moc kernel: RBP: ffff96d79fb9d800 R08: 0000000000000000 R09: ffffffffc14a7174
Jun 19 14:24:39 moc kernel: R10: ffffe4dea0a9a840 R11: 0000000000000001 R12: 0000000000000000
Jun 19 14:24:39 moc kernel: R13: ffff96d7a5e43800 R14: ffff96d7a9ca8d40 R15: ffffffffb4695dbb
Jun 19 14:24:39 moc kernel: FS:  0000000000000000(0000) GS:ffff96d7ae3c0000(0000) knlGS:0000000000000000
Jun 19 14:24:39 moc kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 19 14:24:39 moc kernel: CR2: 0000000000000000 CR3: 00000003aa80a001 CR4: 00000000003606e0
Jun 19 14:24:39 moc kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 19 14:24:39 moc kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jun 19 14:24:39 moc kernel: Call Trace:
Jun 19 14:24:39 moc kernel:  amdgpu_device_ip_resume_phase2+0x45/0xb0 [amdgpu]
Jun 19 14:24:39 moc kernel:  amdgpu_device_resume+0xbf/0x380 [amdgpu]
Jun 19 14:24:39 moc kernel:  ? pci_pm_freeze+0xd0/0xd0
Jun 19 14:24:39 moc kernel:  ? pci_pm_freeze+0xd0/0xd0
Jun 19 14:24:39 moc kernel:  dpm_run_callback+0x4d/0x130
Jun 19 14:24:39 moc kernel:  device_resume+0x97/0x190
Jun 19 14:24:39 moc kernel:  async_resume+0x19/0x40
Jun 19 14:24:39 moc kernel:  async_run_entry_fn+0x39/0x160
Jun 19 14:24:39 moc kernel:  process_one_work+0x17b/0x360
Jun 19 14:24:39 moc kernel:  worker_thread+0x2e/0x390
Jun 19 14:24:39 moc kernel:  ? process_one_work+0x360/0x360
Jun 19 14:24:39 moc kernel:  kthread+0x113/0x130
Jun 19 14:24:39 moc kernel:  ? kthread_create_worker_on_cpu+0x70/0x70
Jun 19 14:24:39 moc kernel:  ret_from_fork+0x35/0x40
Jun 19 14:24:39 moc kernel: Code: 00 7f ac 48 89 ef e8 dd df a5 ff 48 c7 83 90 aa 00 00 00 00 00 00 89 c5 48 89 df e8 c8 17 00 00 89 e8 5b 5d 41 5c 41 5d 41 5e c3 <0f> 0b e9 48 ff ff ff 0f 0b eb a5 66 90 0f 1f 44 00 00 53 48 89 
Jun 19 14:24:39 moc kernel: ---[ end trace c39336409cdb2ae3 ]---
Jun 19 14:24:39 moc kernel: [drm] UVD and UVD ENC initialized successfully.
Jun 19 14:24:39 moc kernel: ixgbe 0000:03:00.0: Multiqueue Enabled: Rx Queue count = 12, Tx Queue count = 12 XDP Queue count = 0


Log after switching to X11
---------------------------
Jun 19 14:29:13 moc kernel: amdgpu 0000:01:00.0: GPU fault detected: 147 0x0a304401
Jun 19 14:29:13 moc kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x08404D46
Jun 19 14:29:13 moc kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08044001
Jun 19 14:29:13 moc kernel: amdgpu 0000:01:00.0: VM fault (0x01, vmid 4, pasid 0) at page 138431814, read from 'TC5' (0x54433500) (68)
Jun 19 14:29:13 moc kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0000480c
Jun 19 14:29:13 moc kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x08404D46
Jun 19 14:29:13 moc kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08044001
Jun 19 14:29:13 moc kernel: amdgpu 0000:01:00.0: VM fault (0x01, vmid 4, pasid 0) at page 138431814, read from 'TC5' (0x54433500) (68)
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0: GPU fault detected: 147 0x0a304401
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x08404D46
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08044001
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0: VM fault (0x01, vmid 4, pasid 0) at page 138431814, read from 'TC5' (0x54433500) (68)
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0: GPU fault detected: 147 0x0a304401
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x08404D46
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08044001
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0: VM fault (0x01, vmid 4, pasid 0) at page 138431814, read from 'TC5' (0x54433500) (68)
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0000480c
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0E40C60C
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08048001
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0: VM fault (0x01, vmid 4, pasid 0) at page 239126028, read from 'TC4' (0x54433400) (72)
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0000480c
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0804800C
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0: VM fault (0x0c, vmid 4, pasid 0) at page 0, read from 'TC4' (0x54433400) (72)
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0: GPU fault detected: 147 0x0a304401
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x08404D46
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08044001
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0: VM fault (0x01, vmid 4, pasid 0) at page 138431814, read from 'TC5' (0x54433500) (68)
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0000480c
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0804800C
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0: VM fault (0x0c, vmid 4, pasid 0) at page 0, read from 'TC4' (0x54433400) (72)
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0: GPU fault detected: 147 0x0a304401
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x08404D46
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08044001
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0: VM fault (0x01, vmid 4, pasid 0) at page 138431814, read from 'TC5' (0x54433500) (68)
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0: GPU fault detected: 147 0x0a304401
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x08404D46
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x08044001
Jun 19 14:29:14 moc kernel: amdgpu 0000:01:00.0: VM fault (0x01, vmid 4, pasid 0) at page 138431814, read from 'TC5' (0x54433500) (68)
Jun 19 14:29:24 moc kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=384604, last emitted seq=384605
Jun 19 14:29:24 moc kernel: [drm] IP block:gfx_v8_0 is hung!
Jun 19 14:29:24 moc kernel: [drm] GPU recovery disabled.
-- Reboot --
Comment 1 Christian König 2018-06-19 13:24:48 UTC
Duplicate of bug https://bugzilla.kernel.org/show_bug.cgi?id=199959

Note You need to log in before you can comment on or make changes to this bug.