Bug 205089 - amdgpu : drm:amdgpu_cs_ioctl : Failed to initialize parser -125
Summary: amdgpu : drm:amdgpu_cs_ioctl : Failed to initialize parser -125
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-10-05 11:48 UTC by Bruno Jacquet
Modified: 2020-07-25 07:35 UTC (History)
6 users (show)

See Also:
Kernel Version: 5.3.2
Tree: Mainline
Regression: No


Attachments
dmesg of fence timeout error (94.94 KB, text/plain)
2019-10-12 19:06 UTC, Bruno Jacquet
Details

Description Bruno Jacquet 2019-10-05 11:48:18 UTC
Hello,

I am experiencing freezes with kernel 5.3.2 and amdgpu on a Vega 64 card.

This happens during games (I experience it on CS:GO) but it is a bit random and takes time to eventually trigger.
Once it triggers my dmesg is filled with errors:


[ 9156.537524] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 9156.747176] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 9156.747224] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 9156.883220] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 9156.883285] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

When it happens, the image hangs and PC is unresponsive. Sometimes I manage to switch to a TTY, but then the screen is corrupted.

HW:
- AMD Ryzen 2700X CPU
- AMD RX vega 64

SW:
- Kernel 5.3.2
- Mesa 19.2.0
Comment 1 Alex Deucher 2019-10-07 03:16:20 UTC
The GPU has reset and so you need to restart your desktop environment to continue.  The error messages are because the kernel is rejecting commands from userspace because the application needs to recreate their contexts after a GPU reset.  Things like desktop compositors would need to use the OpenGL robustness extensions and recreate their contexts after a GPU reset for this to work smoothly.  Unfortunately, no desktop compositors do this at the moment.
Comment 2 Bruno Jacquet 2019-10-08 17:50:13 UTC
If I understand you right this means there is still another issue that caused the GPU reset. And this issue in particular is just a consequence of the reset not being properly handled?
Comment 3 Alex Deucher 2019-10-08 18:23:24 UTC
(In reply to Bruno Jacquet from comment #2)
> If I understand you right this means there is still another issue that
> caused the GPU reset. And this issue in particular is just a consequence of
> the reset not being properly handled?

The GPU reset succeeded.  However, since the GPU has been reset, the contents of the memory (e.g, vram) that the application was using is undefined.  So the application needs to use an API level (e.g., OpenGL robustness extensions or vulkan context lost) interface to query whether the GPU was reset and re-initialize it's buffers if so.
Comment 4 Bruno Jacquet 2019-10-08 20:15:26 UTC
Okay, I got this, but should I investigate the initial GPU reset cause?
Comment 5 Alex Deucher 2019-10-08 20:19:21 UTC
If you could come up with a reproducible test case, that would help for tracking down why it's hanging in the first place.
Comment 6 Bruno Jacquet 2019-10-12 19:05:12 UTC
Hello Alex,

Well my test case is still very random, but I finally managed to get the full dmesg, the initial error seems to be this:
[34856.817554] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[34858.320812] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=6337674, emitted seq=6337676
[34858.320854] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process csgo_linux64 pid 12587 thread csgo_linux:cs0 pid 12595
[34858.320857] amdgpu 0000:1f:00.0: GPU reset begin!
Comment 7 Bruno Jacquet 2019-10-12 19:06:23 UTC
Created attachment 285483 [details]
dmesg of fence timeout error
Comment 8 Andreas Schneider 2019-10-14 17:20:11 UTC
I've hit the same error when trying to run vkdt [1], the darktable RAW image developer prototype written in Vulkan.

I can reliably reproduce the issue with it.

Kernel 5.3.4
Mesa 19.1.7
Vulkan 1.1.123

After compiling use:

./vkdt -g default-darkroom.cfg -d all path/to/RAW_images

[1] https://github.com/hanatos/vkdt
Comment 9 Andreas Schneider 2019-10-14 19:07:50 UTC
I totally forgot the GPU is a RX 470.
Comment 10 Bruno Jacquet 2020-04-27 16:58:17 UTC
With a more recent stack it seems I am no longer experiencing this.
Kernel 5.4.35 and mesa 20.0.5 seems stable for me.

Andreas, did you try upgrading your SW components and see if you still have the issue?
Comment 11 Andreas Schneider 2020-04-28 07:25:35 UTC
Yes, seems to work. I think this can be closed.
Comment 12 Bruno Jacquet 2020-04-28 08:01:36 UTC
OK, closing.
Comment 13 Lech 2020-07-25 07:35:43 UTC
Jul 25 09:19:54 lech-ryzen-vega kernel: [37627.065966] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
Jul 25 09:19:54 lech-ryzen-vega kernel: [37631.935858] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1228554, emitted seq=1228556
Jul 25 09:19:54 lech-ryzen-vega kernel: [37631.935939] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process HeroesOfTheStor pid 28617 thread HeroesOfTheStor pid 28691
Jul 25 09:19:54 lech-ryzen-vega kernel: [37631.935948] amdgpu 0000:0b:00.0: GPU reset begin!
Jul 25 09:19:54 lech-ryzen-vega kernel: [37632.181860] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
Jul 25 09:19:54 lech-ryzen-vega kernel: [37632.312215] amdgpu 0000:0b:00.0: GPU BACO reset
Jul 25 09:19:55 lech-ryzen-vega kernel: [37632.888325] amdgpu 0000:0b:00.0: GPU reset succeeded, trying to resume
Jul 25 09:19:55 lech-ryzen-vega kernel: [37632.888485] [drm] PCIE GART of 512M enabled (table at 0x000000F400900000).
Jul 25 09:19:55 lech-ryzen-vega kernel: [37632.888509] [drm] VRAM is lost due to GPU reset!
Jul 25 09:19:55 lech-ryzen-vega kernel: [37632.888833] [drm] PSP is resuming...
Jul 25 09:19:55 lech-ryzen-vega kernel: [37633.076488] [drm] reserve 0x400000 from 0xf5fe800000 for PSP TMR
Jul 25 09:19:55 lech-ryzen-vega kernel: [37633.255659] [drm] kiq ring mec 2 pipe 1 q 0
Jul 25 09:19:56 lech-ryzen-vega kernel: [37634.373718] snd_hda_intel 0000:0b:00.1: azx_get_response timeout, switching to polling mode: last cmd=0x00af2d00
Jul 25 09:19:56 lech-ryzen-vega kernel: [37634.373723] snd_hda_intel 0000:0b:00.1: spurious response 0x0:0x0, last cmd=0xaf2d00
Jul 25 09:19:56 lech-ryzen-vega kernel: [37634.373726] snd_hda_intel 0000:0b:00.1: spurious response 0x0:0x0, last cmd=0xaf2d00
Jul 25 09:19:56 lech-ryzen-vega kernel: [37634.373728] snd_hda_intel 0000:0b:00.1: spurious response 0x233:0x0, last cmd=0xaf2d00
Jul 25 09:19:56 lech-ryzen-vega kernel: [37634.373730] snd_hda_intel 0000:0b:00.1: spurious response 0x0:0x0, last cmd=0xaf2d00
Jul 25 09:19:56 lech-ryzen-vega kernel: [37634.373731] snd_hda_intel 0000:0b:00.1: spurious response 0x1:0x0, last cmd=0xaf2d00
Jul 25 09:19:56 lech-ryzen-vega kernel: [37634.373733] snd_hda_intel 0000:0b:00.1: spurious response 0x0:0x0, last cmd=0xaf2d00
Jul 25 09:19:56 lech-ryzen-vega kernel: [37634.373735] snd_hda_intel 0000:0b:00.1: spurious response 0x0:0x0, last cmd=0xaf2d00
Jul 25 09:19:56 lech-ryzen-vega kernel: [37634.373736] snd_hda_intel 0000:0b:00.1: spurious response 0x0:0x0, last cmd=0xaf2d00
Jul 25 09:19:56 lech-ryzen-vega kernel: [37634.373738] snd_hda_intel 0000:0b:00.1: spurious response 0x0:0x0, last cmd=0xaf2d00
Jul 25 09:19:56 lech-ryzen-vega kernel: [37634.373739] snd_hda_intel 0000:0b:00.1: spurious response 0x0:0x0, last cmd=0xaf2d00
Jul 25 09:19:57 lech-ryzen-vega kernel: [37635.377702] snd_hda_intel 0000:0b:00.1: No response from codec, disabling MSI: last cmd=0x00a72d01
Jul 25 09:19:58 lech-ryzen-vega kernel: [37636.393677] snd_hda_intel 0000:0b:00.1: No response from codec, resetting bus: last cmd=0x00a72d01
Jul 25 09:19:59 lech-ryzen-vega kernel: [37637.397658] snd_hda_intel 0000:0b:00.1: azx_get_response timeout, switching to single_cmd mode: last cmd=0x00b77701
Jul 25 09:19:59 lech-ryzen-vega kernel: [37637.419432] [drm] UVD and UVD ENC initialized successfully.
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519135] [drm] VCE initialized successfully.
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519149] amdgpu 0000:0b:00.0: ring gfx uses VM inv eng 0 on hub 0
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519151] amdgpu 0000:0b:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519153] amdgpu 0000:0b:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519155] amdgpu 0000:0b:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519156] amdgpu 0000:0b:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519158] amdgpu 0000:0b:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519159] amdgpu 0000:0b:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519161] amdgpu 0000:0b:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519162] amdgpu 0000:0b:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519164] amdgpu 0000:0b:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519166] amdgpu 0000:0b:00.0: ring sdma0 uses VM inv eng 0 on hub 1
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519167] amdgpu 0000:0b:00.0: ring page0 uses VM inv eng 1 on hub 1
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519169] amdgpu 0000:0b:00.0: ring sdma1 uses VM inv eng 4 on hub 1
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519170] amdgpu 0000:0b:00.0: ring page1 uses VM inv eng 5 on hub 1
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519171] amdgpu 0000:0b:00.0: ring uvd_0 uses VM inv eng 6 on hub 1
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519173] amdgpu 0000:0b:00.0: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519174] amdgpu 0000:0b:00.0: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519176] amdgpu 0000:0b:00.0: ring vce0 uses VM inv eng 9 on hub 1
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519177] amdgpu 0000:0b:00.0: ring vce1 uses VM inv eng 10 on hub 1
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519179] amdgpu 0000:0b:00.0: ring vce2 uses VM inv eng 11 on hub 1
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519180] [drm] ECC is not present.
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.519182] [drm] SRAM ECC is not present.
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.520993] [drm] recover vram bo from shadow start
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.522435] [drm] recover vram bo from shadow done
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.522437] [drm] Skip scheduling IBs!
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.522438] [drm] Skip scheduling IBs!
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.522466] amdgpu 0000:0b:00.0: GPU reset(2) succeeded!
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.522477] [drm] Skip scheduling IBs!
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.522479] [drm] Skip scheduling IBs!
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.522481] [drm] Skip scheduling IBs!
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.522482] [drm] Skip scheduling IBs!
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.522484] [drm] Skip scheduling IBs!
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.522485] [drm] Skip scheduling IBs!
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.522487] [drm] Skip scheduling IBs!
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.522488] [drm] Skip scheduling IBs!
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.522489] [drm] Skip scheduling IBs!
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.522491] [drm] Skip scheduling IBs!
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.522492] [drm] Skip scheduling IBs!
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.522493] [drm] Skip scheduling IBs!
Jul 25 09:20:00 lech-ryzen-vega kernel: [37637.522770] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jul 25 09:20:10 lech-ryzen-vega kernel: [37648.127879] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jul 25 09:20:10 lech-ryzen-vega kernel: [37648.129190] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jul 25 09:20:10 lech-ryzen-vega kernel: [37648.162337] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jul 25 09:20:10 lech-ryzen-vega kernel: [37648.164145] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jul 25 09:20:10 lech-ryzen-vega kernel: [37648.164261] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jul 25 09:20:10 lech-ryzen-vega kernel: [37648.167924] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Jul 25 09:20:10 lech-ryzen-vega kernel: [37648.168801] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

HW: 
Vega 56 
Ryzen 3600X

SW:
5.7.1-050701-generic x86_64
Mesa 20.2.0-devel (git-14a12b7 2020-07-24 focal-oibaf-ppa)

You can safely reopen it.

Note You need to log in before you can comment on or make changes to this bug.