Created attachment 286705 [details] part of DMESG that looks relevant while booting my PC, amdgpu complains it cannot execute UVD and VCE ring tests on Fiji GPU (R9 Nano). The error code -110 points to a timeout. Maybe R9 Nano needs more time to initialize the UVD decoder in UEFI mode? ------------------- [ 7.270335] amdgpu 0000:0a:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on uvd (-110). [ 8.063987] Generic FE-GE Realtek PHY r8169-500:00: attached PHY driver [Generic FE-GE Realtek PHY] (mii_bus:phy_addr=r8169-500:00, irq=IGNORE) [ 8.211306] r8169 0000:05:00.0 eth0: Link is Down [ 8.400079] amdgpu 0000:0a:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on vce0 (-110). [ 8.400084] [drm:process_one_work] *ERROR* ib ring test failed (-110). -------------------- Please note that this is not critical to me at all: I only use this card to perform OpenCL calculations via RoCm, so couldn't care less, but I still think it's worth to have a look at.
tried to increase timeout value (timeout << 1) but it did not help. moving on ...
Tried to make ALL functions where ETIMEDOUT is specified (in drivers/gpu/amd/amdgpu/) with timeout << 2, but nothing. Am I looking at the wrong function here?
The relevant function is amdgpu_ib_ring_tests(), however, if the relevant engines are in some bad state, increasing the timeout won't help.
Hi Alex, Thank you for the feedback. Tried that one as well. When I already multiplied the timeout by 4, guess it will be the bad state then. Is there any way I could reset the state before the ring tests begin? FYI: I'm tried different firmware revisions already, some of them seem to impact the situation: sometimes UVD fails, sometimes VCE fails, but none of them seem to work for both.
Hi Janpieter, I was unsuccessful in trying to recreate your issue. - Running Linux stable, 5.4.10, verified the IB tests are running - Same video card, same VBIOS, same firmware - Tried with/without a display connected - Tried with/with rocm-dkms Trying to see if I'm missing anything else, what motherboard / CPU are using? and do you have any special kernal parameters you're using? --- [ 4.702561] [drm] initializing kernel modesetting (FIJI 0x1002:0x7300 0x1002:0x0B36 0xCA). [ 4.702569] [drm] register mmio base: 0xFCF00000 [ 4.702569] [drm] register mmio size: 262144 [ 4.702577] [drm] add ip block number 0 <vi_common> [ 4.702578] [drm] add ip block number 1 <gmc_v8_0> [ 4.702578] [drm] add ip block number 2 <tonga_ih> [ 4.702579] [drm] add ip block number 3 <gfx_v8_0> [ 4.702580] [drm] add ip block number 4 <sdma_v3_0> [ 4.702580] [drm] add ip block number 5 <powerplay> [ 4.702581] [drm] add ip block number 6 <dm> [ 4.702582] [drm] add ip block number 7 <uvd_v6_0> [ 4.702582] [drm] add ip block number 8 <vce_v3_0> [ 4.702758] amdgpu 0000:07:00.0: No more image in the PCI ROM [ 4.702777] ATOM BIOS: 113-C8820200-107 [ 4.702789] [drm] UVD is enabled in physical mode [ 4.702789] [drm] VCE enabled in physical mode [ 4.702811] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit [ 4.702818] amdgpu 0000:07:00.0: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used) [ 4.702819] amdgpu 0000:07:00.0: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF [ 4.702823] [drm] Detected VRAM RAM=4096M, BAR=256M [ 4.702824] [drm] RAM width 512bits HBM [ 4.702876] [TTM] Zone kernel: Available graphics memory: 8201336 KiB [ 4.702877] [TTM] Zone dma32: Available graphics memory: 2097152 KiB [ 4.702877] [TTM] Initializing pool allocator [ 4.702880] [TTM] Initializing DMA pool allocator [ 4.702907] [drm] amdgpu: 4096M of VRAM memory ready [ 4.702909] [drm] amdgpu: 4096M of GTT memory ready. [ 4.702925] [drm] GART: num cpu pages 262144, num gpu pages 262144 [ 4.702988] [drm] PCIE GART of 1024M enabled (table at 0x000000F4001D5000). [ 4.743132] [drm] Chained IB support enabled! [ 4.764245] amdgpu: [powerplay] hwmgr_sw_init smu backed is fiji_smu [ 4.768880] [drm] Found UVD firmware Version: 1.91 Family ID: 12 [ 4.768885] [drm] UVD ENC is disabled [ 4.772092] [drm] Found VCE firmware Version: 55.2 Binary ID: 3 [ 4.837386] [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4! [ 4.848088] [drm] Display Core initialized with v3.2.48! [ 4.848823] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [ 4.848824] [drm] Driver supports precise vblank timestamp query. [ 4.874937] [drm] UVD initialized successfully. [ 4.974949] [drm] VCE initialized successfully. [ 4.976370] [drm] Cannot find any crtc or sizes [ 4.978116] [drm] Initialized amdgpu 3.35.0 20150101 for 0000:07:00.0 on minor 0
Created attachment 286813 [details] Kernel. Config Hi Thong, I use efibootmgr, so no kernel arguments on bootloader, but there are a few in config (attached here) Hope it helps! Also, I ordered another R9 nano, to rule out the possibility of hardware failure. It will be available in a few days. I'll keep you updated
Created attachment 286863 [details] dmesg with 2 GPUs OK, so this definitely looks like a HW failure, also tried to copy FW from working GPU to broken GPU, but it did not help. Is it possible to disable the UVD/VCE engine on the original GPU? I mean, it's not used anyway, so I might as well disable it completely to avoid these errors.
(In reply to Janpieter Sollie from comment #7) > Is it possible to disable the UVD/VCE engine on the original GPU? > I mean, it's not used anyway, so I might as well disable it completely to > avoid these errors. Yes. Set the amdgpu.ip_block_mask module parameter on the kernel command line in grub. Each bit refers to an IP. From your log: [ 3.987749] [drm] add ip block number 0 <vi_common> [ 3.987750] [drm] add ip block number 1 <gmc_v8_0> [ 3.987751] [drm] add ip block number 2 <tonga_ih> [ 3.987752] [drm] add ip block number 3 <gfx_v8_0> [ 3.987753] [drm] add ip block number 4 <sdma_v3_0> [ 3.987753] [drm] add ip block number 5 <powerplay> [ 3.987754] [drm] add ip block number 6 <dm> [ 3.987755] [drm] add ip block number 7 <uvd_v6_0> [ 3.987755] [drm] add ip block number 8 <vce_v3_0> bits 7 and 8 are for uvd and vce, so you can append amdgpu.ip_block_mask=0x7f to only enable blocks 0-6.
Thank you Alex! You helped the environment by making it unnecessary to trash a GPU :p. Anyway, this bug is definitely solved, thank you all! If there's anything I could do to pay back for the support, let me know. Janpieter