Bug 199653 - [AMDGPU][DC] BUG: unable to handle kernel NULL pointer dereference (trace decoded)
Summary: [AMDGPU][DC] BUG: unable to handle kernel NULL pointer dereference (trace dec...
Status: RESOLVED INVALID
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: All Linux
: P1 high
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-08 09:09 UTC by Marcus Husar
Modified: 2018-08-17 11:37 UTC (History)
4 users (show)

See Also:
Kernel Version: drm-next-4.18-wip (agd5f, AMDGPU)
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Call trace of crash decoded with decode_stacktrace.sh (7.15 KB, text/plain)
2018-05-08 09:09 UTC, Marcus Husar
Details
Original call trace of crash (5.11 KB, text/plain)
2018-05-08 09:10 UTC, Marcus Husar
Details
Journal log with amdgpu.dc_log=1 drm.debug=6 (4.42 MB, text/plain)
2018-05-08 09:13 UTC, Marcus Husar
Details
Call trace of i2c designware warning (taints kernel) (2.71 KB, text/plain)
2018-05-08 09:13 UTC, Marcus Husar
Details

Description Marcus Husar 2018-05-08 09:09:59 UTC
Created attachment 275827 [details]
Call trace of crash decoded with decode_stacktrace.sh

kernel: BUG: unable to handle kernel NULL pointer dereference at 00000000000002e4

This happens multiple times a day on my machine. It leads to a complete system freeze. Yesterday I was lucky and got a stack trace.

It mostly happens browsing the web with Firefox (WebRender enabled, XWayland, Gnome-Shell) when the cursor moves or rotates. But it can happen anywhere and anytime.

The used kernel is from branch drm-next-4.18-wip@92fb374 of Alex Deucher, AMD (agd5f). See: git://people.freedesktop.org/~agd5f/linux.

My machine:
 * Hardware name: Acer Swift SF315-41/Becks_RR, BIOS V1.04 01/09/2018
 * Ryzen Mobile 2500U
 * Firmware: VCN: 1.73 (latest available version)
 * My kernel is tainted because the i2c designware driver emits a warning during
   boot. This should be unrelated to AMDGPU
   (see attachment i2c_designware_trace.txt).

Please ask if anything else is needed.
Comment 1 Marcus Husar 2018-05-08 09:10:46 UTC
Created attachment 275829 [details]
Original call trace of crash
Comment 2 Marcus Husar 2018-05-08 09:13:12 UTC
Created attachment 275831 [details]
Journal log with amdgpu.dc_log=1 drm.debug=6
Comment 3 Marcus Husar 2018-05-08 09:13:52 UTC
Created attachment 275833 [details]
Call trace of i2c designware warning (taints kernel)
Comment 4 James Le Cuirot 2018-05-29 12:25:04 UTC
I have almost exactly the same hardware but the Ryzen 7 (2700U) version. I also get multiple daily freezes. Many thanks for the additional info, I never managed to get any. I did try kdump but it never triggers, even when forcing a kernel panic. I'm now running OpenSUSE 15.0 with kernels from the "stable" repository (currently 4.16.12-1.g39c7522) and recent Mesa 18.2 prerelease builds.
Comment 5 James Le Cuirot 2018-06-16 16:30:05 UTC
After seeing amdgpu.vm_update_mode=3 mentioned in bug #199749, I tried it but it didn't help.
Comment 6 Andrey Grodzovsky 2018-06-16 18:23:03 UTC
Those two are unrelated bugs.
Comment 7 James Le Cuirot 2018-06-20 19:57:28 UTC
I had high hopes for 4.18-rc1 but alas it froze after a few hours. :( I know having the latest firmware is important so I have that too. These OpenSUSE packages (some unofficial) are installed:

kernel-default-4.18.rc1-1.1.gfa9e020
kernel-firmware-20180606-35.1
libdrm2-2.4.99~git20180511-lp150.1.1
Mesa-18.2.0~git20180619-lp150.16.1
Comment 8 James Le Cuirot 2018-07-13 20:03:49 UTC
OP also filed a freedesktop.org bug report with more information.

https://bugs.freedesktop.org/104817

Still the same with 4.18-rc4. :(
Comment 9 Marcus Husar 2018-08-06 10:33:48 UTC
From the freedesktop.org bug:

It seems to me that this is in fact a CPU related problem. Since July 25 I don’t have any problems. My system is pretty stable. What helped was to add idle=nomwait to my GRUB command line. This has fixed those problems for me.

Please try to add idle=nomwait to your GRUB command line. I think this bug can be closed.
Comment 10 James Le Cuirot 2018-08-06 11:05:51 UTC
I added idle=nomwait recently and that has fixed it for me too. I thought I had already tried this, not sure, but perhaps there were two issues and the other has since been fixed.
Comment 11 Marcus Husar 2018-08-17 11:37:07 UTC
See comment #9 and #10. Kernel parameter idle=nomwait fixed this bug for me. It seems to be a CPU related problem.

Note You need to log in before you can comment on or make changes to this bug.