Bug 211349 - IB test failed on sdma0 ! AMDGPU driver for Raven APU (ryzen 2400G) hangs!
Summary: IB test failed on sdma0 ! AMDGPU driver for Raven APU (ryzen 2400G) hangs!
Status: RESOLVED ANSWERED
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: i386 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-01-26 12:24 UTC by bolando
Modified: 2021-01-28 05:41 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.10.9
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Main Error (5.93 KB, text/plain)
2021-01-26 12:46 UTC, bolando
Details
Kernel log without kfd fd kfd: added device 1002:15dd and amdgpu: Topology: Add APU node [0x0:0x0] (88.28 KB, text/plain)
2021-01-26 12:48 UTC, bolando
Details
Kernel config file (113.62 KB, text/plain)
2021-01-26 12:49 UTC, bolando
Details
GCC version (365 bytes, text/plain)
2021-01-26 12:49 UTC, bolando
Details

Description bolando 2021-01-26 12:24:46 UTC
When load AMDGPU driver,it hangs everytime with IB test faild on sdma0.The AMDGPU drivers is not support on 32bit system? My linux is LFS-10.0,hardware is mainboard(Biostar B450GT with newest bios,update to 12/08/2020),cpu(Ryzen5 2400G),memery(32G DDR4 with duaul channel).I have test many version of kernels and amdgpu firmware,but hangs every time when modprobe amdgpu. I also have installed Ubuntu 20.4 TLS(with 5.8.0-25 kernel),AMDGPU worked very good on it.
Comment 1 bolando 2021-01-26 12:46:32 UTC
Created attachment 294855 [details]
Main Error
Comment 2 bolando 2021-01-26 12:48:12 UTC
Created attachment 294857 [details]
Kernel log without kfd fd kfd: added device 1002:15dd   and amdgpu: Topology: Add APU node [0x0:0x0]
Comment 3 bolando 2021-01-26 12:49:08 UTC
Created attachment 294859 [details]
Kernel config file
Comment 4 bolando 2021-01-26 12:49:37 UTC
Created attachment 294861 [details]
GCC version
Comment 5 bolando 2021-01-26 12:54:55 UTC
    I have tested with many kernels  and firmwares but failed ! To compare with Ubuntu 20.4 LTS kern log,my kern log lack of "kfd: added device" and  "amdgpu 0000:06:00.0: amdgpu: Topology: Add APU node [0x0:0x0]". It seems sdma0 and vcn bug,with some memoery faults. This is mail error log.


Jan 26 09:58:07 Pink kernel: [   69.141903] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on sdma0 (-110).
Jan 26 09:58:07 Pink kernel: [   69.145985] amdgpu 0000:06:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:157 vmid:10 pasid:0, for process  pid 0 thread  pid 0)
Jan 26 09:58:07 Pink kernel: [   69.146002] amdgpu 0000:06:00.0: amdgpu:   in page starting at address 0x000000f40021b000 from client 27
Jan 26 09:58:07 Pink kernel: [   69.146012] amdgpu 0000:06:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00A00B3A
Jan 26 09:58:07 Pink kernel: [   69.146020] amdgpu 0000:06:00.0: amdgpu: ^I Faulty UTCL2 client ID: CPC (0x5)
Jan 26 09:58:07 Pink kernel: [   69.146027] amdgpu 0000:06:00.0: amdgpu: ^I MORE_FAULTS: 0x0
Jan 26 09:58:07 Pink kernel: [   69.146033] amdgpu 0000:06:00.0: amdgpu: ^I WALKER_ERROR: 0x5
Jan 26 09:58:07 Pink kernel: [   69.146040] amdgpu 0000:06:00.0: amdgpu: ^I PERMISSION_FAULTS: 0x3
Jan 26 09:58:07 Pink kernel: [   69.146046] amdgpu 0000:06:00.0: amdgpu: ^I MAPPING_ERROR: 0x1
Jan 26 09:58:07 Pink kernel: [   69.146052] amdgpu 0000:06:00.0: amdgpu: ^I RW: 0x0
Jan 26 09:58:07 Pink kernel: [   69.146067] amdgpu 0000:06:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:157 vmid:10 pasid:0, for process  pid 0 thread  pid 0)
Jan 26 09:58:07 Pink kernel: [   69.146077] amdgpu 0000:06:00.0: amdgpu:   in page starting at address 0x000000f40021b000 from client 27
Jan 26 09:58:07 Pink kernel: [   69.146086] amdgpu 0000:06:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00A00B3A
Jan 26 09:58:07 Pink kernel: [   69.146094] amdgpu 0000:06:00.0: amdgpu: ^I Faulty UTCL2 client ID: CPC (0x5)
Jan 26 09:58:07 Pink kernel: [   69.146100] amdgpu 0000:06:00.0: amdgpu: ^I MORE_FAULTS: 0x0
Jan 26 09:58:07 Pink kernel: [   69.146106] amdgpu 0000:06:00.0: amdgpu: ^I WALKER_ERROR: 0x5
Jan 26 09:58:07 Pink kernel: [   69.146112] amdgpu 0000:06:00.0: amdgpu: ^I PERMISSION_FAULTS: 0x3
Jan 26 09:58:07 Pink kernel: [   69.146118] amdgpu 0000:06:00.0: amdgpu: ^I MAPPING_ERROR: 0x1
Jan 26 09:58:07 Pink kernel: [   69.146124] amdgpu 0000:06:00.0: amdgpu: ^I RW: 0x0
Jan 26 09:58:07 Pink kernel: [   69.146514] mce: [Hardware Error]: Machine check events logged
Jan 26 09:58:07 Pink kernel: [   69.146526] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 20: dc2030000001085b
Jan 26 09:58:07 Pink kernel: [   69.146533] mce: [Hardware Error]: TSC 52c0cc15a4 ADDR 7ffcffffff40 SYND 5b240204 IPID 2e00000000 
Jan 26 09:58:07 Pink kernel: [   69.146545] mce: [Hardware Error]: PROCESSOR 2:810f10 TIME 1611655087 SOCKET 0 APIC 0 microcode 8101016
Jan 26 09:58:08 Pink kernel: [   70.150550] [drm:vcn_v1_0_set_powergating_state [amdgpu]] *ERROR* VCN decode not responding, trying to reset the VCPU!!!
Jan 26 09:58:10 Pink kernel: [   71.172270] [drm:vcn_v1_0_set_powergating_state [amdgpu]] *ERROR* VCN decode not responding, trying to reset the VCPU!!!
Jan 26 09:58:11 Pink kernel: [   72.193987] [drm:vcn_v1_0_set_powergating_state [amdgpu]] *ERROR* VCN decode not responding, trying to reset the VCPU!!!
Jan 26 09:58:12 Pink kernel: [   73.215700] [drm:vcn_v1_0_set_powergating_state [amdgpu]] *ERROR* VCN decode not responding, trying to reset the VCPU!!!
Jan 26 09:58:13 Pink kernel: [   74.237417] [drm:vcn_v1_0_set_powergating_state [amdgpu]] *ERROR* VCN decode not responding, trying to reset the VCPU!!!
Jan 26 09:58:14 Pink kernel: [   75.259129] [drm:vcn_v1_0_set_powergating_state [amdgpu]] *ERROR* VCN decode not responding, trying to reset the VCPU!!!
Jan 26 09:58:15 Pink kernel: [   76.280848] [drm:vcn_v1_0_set_powergating_state [amdgpu]] *ERROR* VCN decode not responding, trying to reset the VCPU!!!
Jan 26 09:58:16 Pink kernel: [   77.302559] [drm:vcn_v1_0_set_powergating_state [amdgpu]] *ERROR* VCN decode not responding, trying to reset the VCPU!!!
Jan 26 09:58:17 Pink kernel: [   78.324274] [drm:vcn_v1_0_set_powergating_state [amdgpu]] *ERROR* VCN decode not responding, trying to reset the VCPU!!!
Jan 26 09:58:18 Pink kernel: [   79.345988] [drm:vcn_v1_0_set_powergating_state [amdgpu]] *ERROR* VCN decode not responding, trying to reset the VCPU!!!
Jan 26 09:58:18 Pink kernel: [   79.366046] [drm:vcn_v1_0_set_powergating_state [amdgpu]] *ERROR* VCN decode not responding, giving up!!!
Jan 26 09:58:18 Pink kernel: [   79.366067] [drm:amdgpu_device_ip_set_powergating_state [amdgpu]] *ERROR* set_powergating_state of IP block <vcn_v1_0> failed -1
Jan 26 09:58:18 Pink kernel: [   79.366137] amdgpu 0000:06:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:16 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
Jan 26 09:58:18 Pink kernel: [   79.366150] amdgpu 0000:06:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 18
Jan 26 09:58:18 Pink kernel: [   79.366159] amdgpu 0000:06:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000420
Jan 26 09:58:18 Pink kernel: [   79.366166] amdgpu 0000:06:00.0: amdgpu: ^I Faulty UTCL2 client ID: VCN (0x2)
Jan 26 09:58:18 Pink kernel: [   79.366172] amdgpu 0000:06:00.0: amdgpu: ^I MORE_FAULTS: 0x0
Jan 26 09:58:18 Pink kernel: [   79.366177] amdgpu 0000:06:00.0: amdgpu: ^I WALKER_ERROR: 0x0
Jan 26 09:58:18 Pink kernel: [   79.366182] amdgpu 0000:06:00.0: amdgpu: ^I PERMISSION_FAULTS: 0x2
Jan 26 09:58:18 Pink kernel: [   79.366187] amdgpu 0000:06:00.0: amdgpu: ^I MAPPING_ERROR: 0x0
Jan 26 09:58:18 Pink kernel: [   79.366192] amdgpu 0000:06:00.0: amdgpu: ^I RW: 0x0
Jan 26 09:58:19 Pink kernel: [   80.405920] amdgpu 0000:06:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on vcn_dec (-110).
Jan 26 09:58:20 Pink kernel: [   81.429922] amdgpu 0000:06:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on vcn_enc0 (-110).
Jan 26 09:58:21 Pink kernel: [   82.453922] amdgpu 0000:06:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on vcn_enc1 (-110).
Jan 26 09:59:18 Pink kernel: Kernel logging (proc) stopped.
Jan 26 09:59:18 Pink kernel: Kernel log daemon terminating.
Jan 26 10:00:07 Pink kernel: klogd 1.5.1, log source = /proc/kmsg started.
Comment 6 Alex Deucher 2021-01-26 15:06:24 UTC
Does it work properly on 64bit?
Comment 7 bolando 2021-01-26 16:29:11 UTC
Yes,maybe Ubuntu kernel applyed some patch? Otherwise AMDGPU driver only worked on X86_64 ? The radeon drivers worked well on 32bit kernel. I have Caicos and Oland chipset radeon graphic cards,all be drived perfect on LFS-10.0 i686 arch!
Comment 8 Alex Deucher 2021-01-26 17:04:37 UTC
(In reply to bolando from comment #7)
> Yes,maybe Ubuntu kernel applyed some patch? Otherwise AMDGPU driver only
> worked on X86_64 ? The radeon drivers worked well on 32bit kernel. I have
> Caicos and Oland chipset radeon graphic cards,all be drived perfect on
> LFS-10.0 i686 arch!

It should work in theory.  That said, we don't do regular validation of 32 bit any more.
Comment 9 bolando 2021-01-26 17:37:04 UTC
(In reply to Alex Deucher from comment #8)
> (In reply to bolando from comment #7)
> > Yes,maybe Ubuntu kernel applyed some patch? Otherwise AMDGPU driver only
> > worked on X86_64 ? The radeon drivers worked well on 32bit kernel. I have
> > Caicos and Oland chipset radeon graphic cards,all be drived perfect on
> > LFS-10.0 i686 arch!
> 
> It should work in theory.  That said, we don't do regular validation of 32
> bit any more.

Thanks for you relay,depend on general-purpose of drivers development,AMDGPU should work on 32bit arch.But I don't know what wrong with it.The AMDGPU driver for me lack of kfd ,APU node topology and   amdgpudrmfb(fb0 interface),I want to know how to fix it. The firmware and kernel is nearly newest ,but 5.10.9 do more things on resetting the GPU, show more debug information than 5.8.0.
Comment 10 Alex Deucher 2021-01-26 17:39:49 UTC
does setting CONFIG_HSA_AMD=n fix it?
Comment 11 bolando 2021-01-26 17:49:24 UTC
No HSA_AMD option, it's only for 64bit kernel
Comment 12 Michel Dänzer 2021-01-27 08:50:36 UTC
I'd recommend running a 64-bit kernel, even if all user-space is 32-bit.
Comment 13 bolando 2021-01-27 10:54:25 UTC
(In reply to Michel Dänzer from comment #12)
> I'd recommend running a 64-bit kernel, even if all user-space is 32-bit.

I just have finished LFS-10.0 in several weeks,it's a very hard work. If enable 64bit kernel support,I need to recompile everything on LFS10.0 in next weeks.Have any other solutions for 32bit arch ?
Comment 14 Michel Dänzer 2021-01-27 10:57:13 UTC
> If enable 64bit kernel support,I need to recompile everything on LFS10.0 in
> next weeks.

You shouldn't. 32-bit user-space works fine with a 64-bit kernel.
Comment 15 bolando 2021-01-27 14:07:31 UTC
(In reply to Michel Dänzer from comment #14)
> > If enable 64bit kernel support,I need to recompile everything on LFS10.0 in
> > next weeks.
> 
> You shouldn't. 32-bit user-space works fine with a 64-bit kernel.

Thanks for reply.My LFS-10.0 is built for 32bit,I couldn't select 64bit kernel config when recompile the Linux kernel.Only 32bit kernel could build.I really want to know that the 32bit arch won't be supported by AMDGPU drivers from now on?
Comment 16 Alex Deucher 2021-01-27 15:21:38 UTC
(In reply to bolando from comment #15)
> (In reply to Michel Dänzer from comment #14)
> > > If enable 64bit kernel support,I need to recompile everything on LFS10.0
> in
> > > next weeks.
> > 
> > You shouldn't. 32-bit user-space works fine with a 64-bit kernel.
> 
> Thanks for reply.My LFS-10.0 is built for 32bit,I couldn't select 64bit
> kernel config when recompile the Linux kernel.Only 32bit kernel could
> build.I really want to know that the 32bit arch won't be supported by AMDGPU
> drivers from now on?

Anecdotally it works for some people.  It may depend on the platform and device.
Comment 17 bolando 2021-01-27 15:41:47 UTC
(In reply to Alex Deucher from comment #16)
> (In reply to bolando from comment #15)
> > (In reply to Michel Dänzer from comment #14)
> > > > If enable 64bit kernel support,I need to recompile everything on
> LFS10.0
> > in
> > > > next weeks.
> > > 
> > > You shouldn't. 32-bit user-space works fine with a 64-bit kernel.
> > 
> > Thanks for reply.My LFS-10.0 is built for 32bit,I couldn't select 64bit
> > kernel config when recompile the Linux kernel.Only 32bit kernel could
> > build.I really want to know that the 32bit arch won't be supported by
> AMDGPU
> > drivers from now on?
> 
> Anecdotally it works for some people.  It may depend on the platform and
> device.
God from AMDGPU drivers development team?I have reviewed the 5.10.11 kernel changelog and found your name!
Anecdotally worked on 32bit system ?It seems a few of  people use the 32bit systems .The LFS book don't recommend build x86_64 system,so I built 32bit system. The newer kernel does work better on AMDGPU driver,maybe on one day,I can use Raven APU with new Linux kernel expectantly!Thanks a lot!
Comment 18 bolando 2021-01-28 04:36:25 UTC
I have compiled x64 kernel for my LFS10.0.Booting with the X64 kernel,when load the amdgpu driver,screen frozen again. check the kern log ,everything seems OK but no amdgpudrmfb .I try to start X11,but failed with no fittable modes.
Comment 19 Alex Deucher 2021-01-28 04:48:11 UTC
Please attach your xorg log and dmesg output.  Note that if you want an fbdev interface for the console, you need to enable CONFIG_DRM_FBDEV_EMULATION=y in your config.
Comment 20 bolando 2021-01-28 04:51:12 UTC
  Everything is OK!I recompiled the 5.10.8 X64 kernel,AMDGPU is successful load !It seems the 32bit kernel is not supported now .I think the AMDGPU drivers need IOMMU and HSPA support,the 32bit kernel haven't supported them.
  Thanks for all people who replied and helped me ! Thanks a lot !
Comment 21 bolando 2021-01-28 05:41:18 UTC
(In reply to Michel Dänzer from comment #14)
> > If enable 64bit kernel support,I need to recompile everything on LFS10.0 in
> > next weeks.
> 
> You shouldn't. 32-bit user-space works fine with a 64-bit kernel.

You advise is very effective! I use Ubuntu to compile the X64 kernel . Thanks !

Note You need to log in before you can comment on or make changes to this bug.