Bug 208647 - persistent amdgpu: [mmhub] page faults
Summary: persistent amdgpu: [mmhub] page faults
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-07-21 14:35 UTC by Jay Foad
Modified: 2020-09-28 15:18 UTC (History)
3 users (show)

See Also:
Kernel Version: 5.4.0-42-generic
Subsystem:
Regression: No
Bisected commit-id:


Attachments
output of journalctl -b-5 -k (151.16 KB, text/plain)
2020-07-21 15:34 UTC, Jay Foad
Details
dmesg on 5.9.0-RC7 (242.33 KB, text/plain)
2020-09-28 15:18 UTC, Stefan Winter
Details

Description Jay Foad 2020-07-21 14:35:45 UTC
Whenever X is running I get persistent page faults like this:

Jul 21 15:19:16 jay-X470-AORUS-ULTRA-GAMING kernel: amdgpu 0000:0c:00.0: amdgpu: [mmhub] page fault (src_id:0 ring:169 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
Jul 21 15:19:16 jay-X470-AORUS-ULTRA-GAMING kernel: amdgpu 0000:0c:00.0: amdgpu:   in page starting at address 0x00000000fffb0000 from client 18
Jul 21 15:19:16 jay-X470-AORUS-ULTRA-GAMING kernel: amdgpu 0000:0c:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00041F52
Jul 21 15:19:16 jay-X470-AORUS-ULTRA-GAMING kernel: amdgpu 0000:0c:00.0: amdgpu:          Faulty UTCL2 client ID: 0xf
Jul 21 15:19:16 jay-X470-AORUS-ULTRA-GAMING kernel: amdgpu 0000:0c:00.0: amdgpu:          MORE_FAULTS: 0x0
Jul 21 15:19:16 jay-X470-AORUS-ULTRA-GAMING kernel: amdgpu 0000:0c:00.0: amdgpu:          WALKER_ERROR: 0x1
Jul 21 15:19:16 jay-X470-AORUS-ULTRA-GAMING kernel: amdgpu 0000:0c:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Jul 21 15:19:16 jay-X470-AORUS-ULTRA-GAMING kernel: amdgpu 0000:0c:00.0: amdgpu:          MAPPING_ERROR: 0x1
Jul 21 15:19:16 jay-X470-AORUS-ULTRA-GAMING kernel: amdgpu 0000:0c:00.0: amdgpu:          RW: 0x1

Sometimes I get several of these per second. Sometimes there are none for a few minutes.

If I boot into runlevel 3 (i.e. without starting X) I get one of these during boot, but then there are no more after that.

I'm running Ubuntu 20.04 but I also saw this on 18.04.

Kernel version is 5.4.0-42-generic but I also saw this with 5.3.0-51-generic.

I'm using the amdgpu-pro drivers.

Graphics card is a Navi 10.

Motherboard is a Gigabyte X470 AORUS ULTRA GAMING.

CPU is an AMD Ryzen 9 3900X.

A very similar sounding bug was reported here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1888116
Comment 1 Alex Deucher 2020-07-21 15:13:14 UTC
This is most likely a userspace issue (e.g., mesa).  The kernel driver is just the messenger.
Comment 2 Jay Foad 2020-07-21 15:15:07 UTC
Wouldn't there normally be a useful pid in the first line if it came from userspace?
Comment 3 Nicolai Hähnle 2020-07-21 15:22:12 UTC
Hi Alex, I asked Jay to report this because (1) the fact that there's a fault during boot is suspicious and points in the direction of this being the kernel's fault and (2) the fact that it's an *mmhub* fault is even more suspicious.

Certainly this seems to happen without Mesa video encode/decode activity, so it can't really be Mesa's (or any graphics driver's) fault.

Someone suggested that audio support also goes through mmhub and that it may be related. I have no idea if that's true.
Comment 4 Alex Deucher 2020-07-21 15:26:44 UTC
Please attach your full dmesg output and xorg log (if using X).
Comment 5 Jay Foad 2020-07-21 15:34:06 UTC
Created attachment 290439 [details]
output of journalctl -b-5 -k
Comment 6 Stefan Winter 2020-09-28 15:17:29 UTC
FWIW, this still (or again) happens with a 5.9.0-RC7 with a Navi 10. It did not happen on 5.8.6 with slightly different .config though.

Attaching a full dmesg. Note that the page faults start happening very shortly after the snd_hda_intel initialization which activates amdgpu.
Comment 7 Stefan Winter 2020-09-28 15:18:23 UTC
Created attachment 292697 [details]
dmesg on 5.9.0-RC7

dmesg

Note You need to log in before you can comment on or make changes to this bug.