Bug 206141

Summary: VCE UVD ring test failed -110
Product: Drivers Reporter: Janpieter Sollie (janpieter.sollie)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED INVALID    
Severity: low CC: alexdeucher, thong.thai
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.4.10 Subsystem:
Regression: No Bisected commit-id:
Attachments: part of DMESG that looks relevant
Kernel. Config
dmesg with 2 GPUs

Description Janpieter Sollie 2020-01-09 09:18:07 UTC
Created attachment 286705 [details]
part of DMESG that looks relevant

while booting my PC, amdgpu complains it cannot execute UVD and VCE ring tests on Fiji GPU (R9 Nano).
The error code -110 points to a timeout.
Maybe R9 Nano needs more time to initialize the UVD decoder in UEFI mode?
-------------------
[    7.270335] amdgpu 0000:0a:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on uvd (-110).
[    8.063987] Generic FE-GE Realtek PHY r8169-500:00: attached PHY driver [Generic FE-GE Realtek PHY] (mii_bus:phy_addr=r8169-500:00, irq=IGNORE)
[    8.211306] r8169 0000:05:00.0 eth0: Link is Down
[    8.400079] amdgpu 0000:0a:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on vce0 (-110).
[    8.400084] [drm:process_one_work] *ERROR* ib ring test failed (-110).
--------------------
Please note that this is not critical to me at all: I only use this card to perform OpenCL calculations via RoCm, so couldn't care less, but I still think it's worth to have a look at.
Comment 1 Janpieter Sollie 2020-01-11 22:02:36 UTC
tried to increase timeout value (timeout << 1) but it did not help. moving on ...
Comment 2 Janpieter Sollie 2020-01-13 13:08:43 UTC
Tried to make ALL functions where ETIMEDOUT is specified (in drivers/gpu/amd/amdgpu/) with timeout << 2, but nothing.  Am I looking at the wrong function here?
Comment 3 Alex Deucher 2020-01-13 16:39:07 UTC
The relevant function is amdgpu_ib_ring_tests(), however, if the relevant engines are in some bad state, increasing the timeout won't help.
Comment 4 Janpieter Sollie 2020-01-14 06:23:39 UTC
Hi Alex, 
Thank you for the feedback. Tried that one as well. 
When I already multiplied the timeout by 4, guess it will be the bad state then.
Is there any way I could reset the state before the ring tests begin?
FYI: I'm tried different firmware revisions already, some of them seem to impact the situation: sometimes UVD fails, sometimes VCE fails, but none of them seem  to work for both.
Comment 5 Thong Thai 2020-01-14 19:27:07 UTC
Hi Janpieter,

I was unsuccessful in trying to recreate your issue. 
- Running Linux stable, 5.4.10, verified the IB tests are running
- Same video card, same VBIOS, same firmware
- Tried with/without a display connected
- Tried with/with rocm-dkms

Trying to see if I'm missing anything else, what motherboard / CPU are using? and do you have any special kernal parameters you're using?

---

[    4.702561] [drm] initializing kernel modesetting (FIJI 0x1002:0x7300 0x1002:0x0B36 0xCA).
[    4.702569] [drm] register mmio base: 0xFCF00000
[    4.702569] [drm] register mmio size: 262144
[    4.702577] [drm] add ip block number 0 <vi_common>
[    4.702578] [drm] add ip block number 1 <gmc_v8_0>
[    4.702578] [drm] add ip block number 2 <tonga_ih>
[    4.702579] [drm] add ip block number 3 <gfx_v8_0>
[    4.702580] [drm] add ip block number 4 <sdma_v3_0>
[    4.702580] [drm] add ip block number 5 <powerplay>
[    4.702581] [drm] add ip block number 6 <dm>
[    4.702582] [drm] add ip block number 7 <uvd_v6_0>
[    4.702582] [drm] add ip block number 8 <vce_v3_0>
[    4.702758] amdgpu 0000:07:00.0: No more image in the PCI ROM
[    4.702777] ATOM BIOS: 113-C8820200-107
[    4.702789] [drm] UVD is enabled in physical mode
[    4.702789] [drm] VCE enabled in physical mode
[    4.702811] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[    4.702818] amdgpu 0000:07:00.0: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used)
[    4.702819] amdgpu 0000:07:00.0: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
[    4.702823] [drm] Detected VRAM RAM=4096M, BAR=256M
[    4.702824] [drm] RAM width 512bits HBM
[    4.702876] [TTM] Zone  kernel: Available graphics memory: 8201336 KiB
[    4.702877] [TTM] Zone   dma32: Available graphics memory: 2097152 KiB
[    4.702877] [TTM] Initializing pool allocator
[    4.702880] [TTM] Initializing DMA pool allocator
[    4.702907] [drm] amdgpu: 4096M of VRAM memory ready
[    4.702909] [drm] amdgpu: 4096M of GTT memory ready.
[    4.702925] [drm] GART: num cpu pages 262144, num gpu pages 262144
[    4.702988] [drm] PCIE GART of 1024M enabled (table at 0x000000F4001D5000).
[    4.743132] [drm] Chained IB support enabled!
[    4.764245] amdgpu: [powerplay] hwmgr_sw_init smu backed is fiji_smu
[    4.768880] [drm] Found UVD firmware Version: 1.91 Family ID: 12
[    4.768885] [drm] UVD ENC is disabled
[    4.772092] [drm] Found VCE firmware Version: 55.2 Binary ID: 3
[    4.837386] [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!
[    4.848088] [drm] Display Core initialized with v3.2.48!
[    4.848823] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    4.848824] [drm] Driver supports precise vblank timestamp query.
[    4.874937] [drm] UVD initialized successfully.
[    4.974949] [drm] VCE initialized successfully.
[    4.976370] [drm] Cannot find any crtc or sizes
[    4.978116] [drm] Initialized amdgpu 3.35.0 20150101 for 0000:07:00.0 on minor 0
Comment 6 Janpieter Sollie 2020-01-14 19:50:29 UTC
Created attachment 286813 [details]
Kernel. Config

Hi Thong,
I use efibootmgr, so no kernel arguments on bootloader, but there are a few in config (attached here) 
Hope it helps!
Also, I ordered another R9 nano, to rule out the possibility of hardware failure. It will be available in a few days. I'll keep you updated
Comment 7 Janpieter Sollie 2020-01-17 16:34:35 UTC
Created attachment 286863 [details]
dmesg with 2 GPUs

OK, so this definitely looks like a HW failure,
also tried to copy FW from working GPU to broken GPU, but it did not help.
Is it possible to disable the UVD/VCE engine on the original GPU?
I mean, it's not used anyway, so I might as well disable it completely to avoid these errors.
Comment 8 Alex Deucher 2020-01-17 21:16:45 UTC
(In reply to Janpieter Sollie from comment #7)
> Is it possible to disable the UVD/VCE engine on the original GPU?
> I mean, it's not used anyway, so I might as well disable it completely to
> avoid these errors.

Yes.  Set the amdgpu.ip_block_mask module parameter on the kernel command line in grub.  Each bit refers to an IP.  From your log:

[    3.987749] [drm] add ip block number 0 <vi_common>
[    3.987750] [drm] add ip block number 1 <gmc_v8_0>
[    3.987751] [drm] add ip block number 2 <tonga_ih>
[    3.987752] [drm] add ip block number 3 <gfx_v8_0>
[    3.987753] [drm] add ip block number 4 <sdma_v3_0>
[    3.987753] [drm] add ip block number 5 <powerplay>
[    3.987754] [drm] add ip block number 6 <dm>
[    3.987755] [drm] add ip block number 7 <uvd_v6_0>
[    3.987755] [drm] add ip block number 8 <vce_v3_0>

bits 7 and 8 are for uvd and vce, so you can append amdgpu.ip_block_mask=0x7f to only enable blocks 0-6.
Comment 9 Janpieter Sollie 2020-01-18 06:40:36 UTC
Thank you Alex! You helped the environment by making it unnecessary to trash a GPU :p.
Anyway, this bug is definitely solved, thank you all! If there's anything I could do to pay back for the support, let me know.
Janpieter