Bug 216173 - amdgpu [gfxhub] page fault (src_id:0 ring:173 vmid:1 pasid:32769, for process Xorg pid 2994 thread Xorg:cs0 pid 3237)
Summary: amdgpu [gfxhub] page fault (src_id:0 ring:173 vmid:1 pasid:32769, for process...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: i386 Linux
: P1 high
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-06-25 23:52 UTC by Witold Baryluk
Modified: 2022-06-29 13:09 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.19-rc3
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
lspci -vvv output (173.01 KB, text/plain)
2022-06-25 23:53 UTC, Witold Baryluk
Details
dmesg-amdgpu-fail-5.19.0-rc3.txt (148.20 KB, text/plain)
2022-06-25 23:53 UTC, Witold Baryluk
Details
dmesg 5.19-rc2 (163.45 KB, text/plain)
2022-06-25 23:54 UTC, Witold Baryluk
Details
dpkg-l.txt (1.48 MB, text/plain)
2022-06-25 23:54 UTC, Witold Baryluk
Details
config-5.19.0-rc3.txt (203.43 KB, text/plain)
2022-06-25 23:56 UTC, Witold Baryluk
Details
inxi-full-systeminfo.txt (7.04 KB, text/plain)
2022-06-25 23:57 UTC, Witold Baryluk
Details

Description Witold Baryluk 2022-06-25 23:52:53 UTC
This appears to be a regression in 5.19-rc3 (and rc2, didn't test before that). It works fine on 5.18.7. Both custom build. And also no issues on 5.18.0.

Debian, amd64.

44:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] (rev c0)

CPU: AMD Threadripper 2950X, stock
Memory: 8x32GB ECC
Motherboard: MSI MEG Creation X399


Booting looks fine, but when Xorg server starts, the screen looks corrupted, and it takes seconds until screen freezes and is not responding.

Dmesg output:


[  140.683672] amdgpu 0000:44:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:173 vmid:1 pasid:32769, for process Xorg pid 2994 thread Xorg:cs0 pid 3237)
[  140.683678] amdgpu 0000:44:00.0: amdgpu:   in page starting at address 0x0000800106ef5000 from client 0x1b (UTCL2)
[  140.683681] amdgpu 0000:44:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x0014115B
[  140.683682] amdgpu 0000:44:00.0: amdgpu:      Faulty UTCL2 client ID: TCP (0x8)
[  140.683684] amdgpu 0000:44:00.0: amdgpu:      MORE_FAULTS: 0x1
[  140.683685] amdgpu 0000:44:00.0: amdgpu:      WALKER_ERROR: 0x5
[  140.683686] amdgpu 0000:44:00.0: amdgpu:      PERMISSION_FAULTS: 0x5
[  140.683686] amdgpu 0000:44:00.0: amdgpu:      MAPPING_ERROR: 0x1
[  140.683687] amdgpu 0000:44:00.0: amdgpu:      RW: 0x1
...
[  151.015508] gmc_v10_0_process_interrupt: 699 callbacks suppressed
...


Eventually resets, but still not usable:

[  161.261520] amdgpu 0000:44:00.0: amdgpu: IH ring buffer overflow (0x0008D620, 0x00002680, 0x0000D640)
[  161.270648] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=100, emitted seq=103
[  161.270854] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 2994 thread Xorg:cs0 pid 3237
[  161.271004] amdgpu 0000:44:00.0: amdgpu: GPU reset begin!
[  161.830407] amdgpu 0000:44:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[  161.830517] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[  162.084366] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[  162.101328] [drm] free PSP TMR buffer
[  162.149879] CPU: 15 PID: 188 Comm: kworker/u128:14 Tainted: G        W   E     5.19.0-rc3 #1
[  162.149883] Hardware name: Micro-Star International Co., Ltd. MS-7B92/MEG X399 CREATION (MS-7B92), BIOS 1.30 03/25/2019
[  162.149884] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
[  162.149890] Call Trace:
[  162.149892]  <TASK>
[  162.149893]  dump_stack_lvl+0x34/0x45
[  162.149898]  amdgpu_do_asic_reset+0x1b/0x3db [amdgpu]
[  162.150047]  amdgpu_device_gpu_recover_imp.cold+0x57e/0x910 [amdgpu]
[  162.150194]  amdgpu_job_timedout+0x14b/0x180 [amdgpu]
[  162.150323]  ? finish_task_switch.isra.0+0x7d/0x270
[  162.150326]  drm_sched_job_timedout+0x5b/0xf0 [gpu_sched]
[  162.150330]  process_one_work+0x1ab/0x300
[  162.150332]  worker_thread+0x48/0x3c0
[  162.150334]  ? rescuer_thread+0x3c0/0x3c0
[  162.150336]  kthread+0xd1/0x100
[  162.150338]  ? kthread_complete_and_exit+0x20/0x20
[  162.150339]  ret_from_fork+0x1f/0x30
[  162.150342]  </TASK>
[  162.150351] amdgpu 0000:44:00.0: amdgpu: MODE1 reset
[  162.150354] amdgpu 0000:44:00.0: amdgpu: GPU mode1 reset
[  162.150417] amdgpu 0000:44:00.0: amdgpu: GPU smu mode1 reset
[  162.653371] amdgpu 0000:44:00.0: amdgpu: GPU reset succeeded, trying to resume
[  162.653516] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[  162.653537] [drm] VRAM is lost due to GPU reset!
[  162.653541] [drm] PSP is resuming...
[  162.834166] [drm] reserve 0xa00000 from 0x8001000000 for PSP TMR
[  162.948850] amdgpu 0000:44:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[  162.948853] amdgpu 0000:44:00.0: amdgpu: SMU is resuming...
[  162.948884] amdgpu 0000:44:00.0: amdgpu: use vbios provided pptable
[  163.025704] amdgpu 0000:44:00.0: amdgpu: SMU is resumed successfully!
[  163.027473] [drm] DMUB hardware initialized: version=0x02020003
[  163.280274] [drm] kiq ring mec 2 pipe 1 q 0
[  163.284624] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[  163.284906] [drm] JPEG decode initialized successfully.
[  163.284926] amdgpu 0000:44:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[  163.284928] amdgpu 0000:44:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[  163.284930] amdgpu 0000:44:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[  163.284931] amdgpu 0000:44:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[  163.284932] amdgpu 0000:44:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[  163.284934] amdgpu 0000:44:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[  163.284935] amdgpu 0000:44:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[  163.284936] amdgpu 0000:44:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[  163.284937] amdgpu 0000:44:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[  163.284938] amdgpu 0000:44:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[  163.284940] amdgpu 0000:44:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[  163.284941] amdgpu 0000:44:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[  163.284942] amdgpu 0000:44:00.0: amdgpu: ring sdma2 uses VM inv eng 14 on hub 0
[  163.284943] amdgpu 0000:44:00.0: amdgpu: ring sdma3 uses VM inv eng 15 on hub 0
[  163.284944] amdgpu 0000:44:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1
[  163.284945] amdgpu 0000:44:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1
[  163.284947] amdgpu 0000:44:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1
[  163.284948] amdgpu 0000:44:00.0: amdgpu: ring vcn_dec_1 uses VM inv eng 5 on hub 1
[  163.284949] amdgpu 0000:44:00.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 6 on hub 1
[  163.284950] amdgpu 0000:44:00.0: amdgpu: ring vcn_enc_1.1 uses VM inv eng 7 on hub 1
[  163.284951] amdgpu 0000:44:00.0: amdgpu: ring jpeg_dec uses VM inv eng 8 on hub 1
[  163.292565] amdgpu 0000:44:00.0: amdgpu: recover vram bo from shadow start
[  163.292579] amdgpu 0000:44:00.0: amdgpu: recover vram bo from shadow done
[  163.292582] [drm] Skip scheduling IBs!
[  163.292583] [drm] Skip scheduling IBs!
[  163.292598] amdgpu 0000:44:00.0: amdgpu: GPU reset(3) succeeded!
[  163.292618] [drm] Skip scheduling IBs!
[  163.292626] [drm] Skip scheduling IBs!
[  163.292629] [drm] Skip scheduling IBs!
[  163.989966] usb usb8-port1: Cannot enable. Maybe the USB cable is bad?
[  166.265393] amdgpu_cs_ioctl: 3200 callbacks suppressed
[  166.265397] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  166.265812] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  166.282284] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  166.283327] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  171.486759] amdgpu_cs_ioctl: 65 callbacks suppressed
Comment 1 Witold Baryluk 2022-06-25 23:53:15 UTC
Created attachment 301271 [details]
lspci -vvv output
Comment 2 Witold Baryluk 2022-06-25 23:53:48 UTC
Created attachment 301272 [details]
dmesg-amdgpu-fail-5.19.0-rc3.txt
Comment 3 Witold Baryluk 2022-06-25 23:54:11 UTC
Created attachment 301273 [details]
dmesg 5.19-rc2
Comment 4 Witold Baryluk 2022-06-25 23:54:38 UTC
Created attachment 301274 [details]
dpkg-l.txt
Comment 5 Witold Baryluk 2022-06-25 23:56:18 UTC
Created attachment 301275 [details]
config-5.19.0-rc3.txt
Comment 6 Witold Baryluk 2022-06-25 23:57:10 UTC
Created attachment 301276 [details]
inxi-full-systeminfo.txt
Comment 7 Witold Baryluk 2022-06-26 00:06:06 UTC
The issue can happen even before logging from display manager to the desktop environment.

I use lightdm.
Comment 8 Witold Baryluk 2022-06-26 00:09:44 UTC
# cat /sys/kernel/debug/dri/0/amdgpu_firmware_info 
VCE feature version: 0, firmware version: 0x00000000
UVD feature version: 0, firmware version: 0x00000000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 38, firmware version: 0x0000003e
PFP feature version: 38, firmware version: 0x00000056
CE feature version: 38, firmware version: 0x00000024
RLC feature version: 1, firmware version: 0x0000005b
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
MEC feature version: 38, firmware version: 0x00000058
MEC2 feature version: 38, firmware version: 0x00000058
SOS feature version: 0, firmware version: 0x00210862
ASD feature version: 553648218, firmware version: 0x2100005a
TA XGMI feature version: 0x00000000, firmware version: 0x2000000b
TA RAS feature version: 0x00000000, firmware version: 0x1b00012a
TA HDCP feature version: 0x00000000, firmware version: 0x1700001f
TA DTM feature version: 0x00000000, firmware version: 0x12000009
TA RAP feature version: 0x00000000, firmware version: 0x0700000e
TA SECUREDISPLAY feature version: 0x00000000, firmware version: 0x00000000
SMC feature version: 0, program: 0, firmware version: 0x003a4700 (58.71.0)
SDMA0 feature version: 52, firmware version: 0x0000004c
SDMA1 feature version: 52, firmware version: 0x0000004c
SDMA2 feature version: 52, firmware version: 0x0000004c
SDMA3 feature version: 52, firmware version: 0x0000004c
VCN feature version: 0, firmware version: 0x0210d02a
DMCU feature version: 0, firmware version: 0x00000000
DMCUB feature version: 0, firmware version: 0x02020003
TOC feature version: 0, firmware version: 0x00000000
VBIOS version: 113-69XB6SSB1-D01
Comment 9 Witold Baryluk 2022-06-29 00:03:30 UTC
Bisected:

9cad937c0c58618fe5b0310fd539a854dc1ae95 is the first bad commit
commit c9cad937c0c58618fe5b0310fd539a854dc1ae95
Author: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Date:   Fri Apr 8 04:18:43 2022 +0530

    drm/amdgpu: add drm buddy support to amdgpu

Note You need to log in before you can comment on or make changes to this bug.