Bug 205497 - Attempt to read amd gpu id causes a freeze
Summary: Attempt to read amd gpu id causes a freeze
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-11-12 05:20 UTC by Luya Tshimbalanga
Modified: 2019-11-30 01:12 UTC (History)
5 users (show)

See Also:
Kernel Version: 5.3.9
Tree: Mainline
Regression: No


Attachments
Script from radeontop to read AMD gpu ids (646 bytes, application/x-shellscript)
2019-11-12 05:20 UTC, Luya Tshimbalanga
Details
possible fix (1.06 KB, patch)
2019-11-12 14:54 UTC, Alex Deucher
Details | Diff
possible fix (4.29 KB, patch)
2019-11-12 15:12 UTC, Alex Deucher
Details | Diff
possible fix (1.50 KB, patch)
2019-11-14 16:41 UTC, Alex Deucher
Details | Diff
dmesg from amd raven ridege Ryzen 2500u (90.94 KB, text/plain)
2019-11-16 18:52 UTC, Luya Tshimbalanga
Details
amdgpu firmware info (1.09 KB, text/plain)
2019-11-16 18:53 UTC, Luya Tshimbalanga
Details
Screenshot of radeontop running with patched kernel (137.43 KB, image/png)
2019-11-16 19:01 UTC, Luya Tshimbalanga
Details

Description Luya Tshimbalanga 2019-11-12 05:20:26 UTC
Created attachment 285871 [details]
Script from radeontop to read AMD gpu ids

Running an utility named radeontop on an AMD APU causes a freeze while attempting to read amdgpu ids. Attached is the script. 
It will be nice to provide a better method to read AMD GPU cards.
Comment 1 albertogomezmarin 2019-11-12 08:29:19 UTC
It is happening for me too with a Vega integrated graphics. Totally freeze with no graphic load and the utility running
Comment 2 clst 2019-11-12 11:58:22 UTC
I think this might be a regression since radeontop worked fine with 4.19 on my Acer Nitro with Ryzen 5 2500U Raven + Polaris RX 560  


The freezes are also not instant I get about a few seconds up to a few minutes before it hangs (might be dependent on load).

Some more information might be here: https://github.com/clbr/radeontop/issues/87
Comment 3 Alex Deucher 2019-11-12 14:54:16 UTC
Created attachment 285881 [details]
possible fix

Assuming radeontop uses the info ioctl to query the registers, this patch should fix it.  If it mmaps the register BAR directly, there's nothing you can do.  Accessing registers while the gfx block is off will lead to garbage data and possibly hang the chip.
Comment 4 Alex Deucher 2019-11-12 15:12:06 UTC
Created attachment 285883 [details]
possible fix

updated patch to handle cached registers properly.
Comment 5 V.I.S. 2019-11-12 16:52:23 UTC
Hi. Please add patch for 4.19.x LTS kernels too.

Thanks.
Comment 6 Alex Deucher 2019-11-14 16:41:07 UTC
Created attachment 285923 [details]
possible fix

Better fix.
Comment 7 Trek 2019-11-15 08:45:31 UTC
as users reported, this bug should only affects kernels 5.2+

by default, radeontop calls amdgpu_read_mm_registers, amdgpu_query_info and amdgpu_query_sensor_info, but it can be forced by the command line to read BAR from /dev/mem

there is a kernel dump at https://github.com/clbr/radeontop/issues/87#issuecomment-529267244

thank you for the patch, but I cannot test it as my hardware is not affected (KAVERI)
Comment 8 V.I.S. 2019-11-15 08:58:08 UTC
Please read here... https://github.com/lestofante/ksysguard-gpu/issues/4

Same issue on 4.19.x LTS kernel.
Comment 9 Trek 2019-11-15 09:07:57 UTC
thanks, I was not aware of it, may be different hardware from the ones on which kernel 4.19/5.1 works?
Comment 10 V.I.S. 2019-11-15 09:15:11 UTC
AMD Ryzen 5 2600G + AMD RX560 (multiseat system), system freezed after few days on kernel 4.19.83 in my case.
Comment 11 Alex Deucher 2019-11-15 15:43:49 UTC
(In reply to Trek from comment #7)
> by default, radeontop calls amdgpu_read_mm_registers, amdgpu_query_info and
> amdgpu_query_sensor_info, but it can be forced by the command line to read
> BAR from /dev/mem

If you access the BAR directly you will likely have problems in certain power saving modes.

Can someone test the patch?
Comment 12 V.I.S. 2019-11-15 16:35:16 UTC
I need approx 3-5 days for testing, because this bug is not persistent.
Comment 13 Luya Tshimbalanga 2019-11-16 03:32:05 UTC
(In reply to Alex Deucher from comment #11)
> (In reply to Trek from comment #7)
> > by default, radeontop calls amdgpu_read_mm_registers, amdgpu_query_info and
> > amdgpu_query_sensor_info, but it can be forced by the command line to read
> > BAR from /dev/mem
> 
> If you access the BAR directly you will likely have problems in certain
> power saving modes.
> 
> Can someone test the patch?

Currently building on https://copr.fedorainfracloud.org/coprs/luya/kernel-amgpu-gfxoff/build/1095660/
Comment 14 Trek 2019-11-16 06:30:13 UTC
(In reply to Alex Deucher from comment #11)
> If you access the BAR directly you will likely have problems in certain
> power saving modes.
thank you, I'll add a warning message when accessing BAR directly
Comment 15 Luya Tshimbalanga 2019-11-16 18:52:37 UTC
Created attachment 285947 [details]
dmesg from amd raven ridege Ryzen 2500u

dmesg showing latest kernel git snapshot
Comment 16 Luya Tshimbalanga 2019-11-16 18:53:30 UTC
Created attachment 285949 [details]
amdgpu firmware info

Firmware information of amdgpu installed in the testing system
Comment 17 Luya Tshimbalanga 2019-11-16 19:01:32 UTC
Created attachment 285951 [details]
Screenshot of radeontop running with patched kernel

Running radeontop with the patched test kernel, I can confirm the patch fixed the  freezing issue which no longer occurs as the card is correctly picked up.
Comment 18 Luya Tshimbalanga 2019-11-16 20:06:08 UTC
Reading another bug report on https://bugzilla.kernel.org/show_bug.cgi?id=204689 taken from amdgfx mailing list, could that issue related?

Anyway, radeontop still runs with the patched kernel. No noticeable freeze and I tested with Blender rendering the old Ryzen CPU 3D model with GPU compute running on rocm-opencl (which needs optimization compared to amdgpu-pro-opencl).

To Alex, will it possible to prepare the patch in the patchwork.kernel.org? Thanks.
Comment 19 Alex Deucher 2019-11-18 15:13:51 UTC
(In reply to Luya Tshimbalanga from comment #18)
> Reading another bug report on
> https://bugzilla.kernel.org/show_bug.cgi?id=204689 taken from amdgfx mailing
> list, could that issue related?

Not likely.
Comment 20 Luya Tshimbalanga 2019-11-30 01:12:24 UTC
I confirm the fix landed on kernel 5.4. Thanks Alex for a quick investigation. Closing this report.

Note You need to log in before you can comment on or make changes to this bug.