Bug 208489

Summary: amdgpu: kernel oops when overclocking Vega M GPU (i7-8809G)
Product: Drivers Reporter: crab2313
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED CODE_FIX    
Severity: normal CC: crab2313
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.7.7 Subsystem:
Regression: No Bisected commit-id:
Attachments: full kernel dmesg

Description crab2313 2020-07-07 21:24:01 UTC
Created attachment 290165 [details]
full kernel dmesg

CPU: Intel(R) Core(TM) i7-8809G CPU @ 3.10GHz
Intel Hades Canyon NUC Kit


❯ cat /sys/bus/pci/drivers/amdgpu/0000:01:00.0/pp_od_clk_voltage
OD_SCLK:
0:        225MHz        750mV
1:        400MHz        750mV
2:        535MHz        750mV
3:        715MHz        750mV
4:        960MHz        750mV
5:       1080MHz        750mV
6:       1140MHz        750mV
7:       1250MHz        750mV
OD_MCLK:
0:        300MHz        750mV
1:        500MHz        750mV
2:        800MHz        800mV
OD_RANGE:
SCLK:     225MHz       1600MHz
MCLK:     300MHz       1000MHz
VDDC:     750mV         750mV

After doing this:

#!/bin/sh
sudo sh -c "echo 's 7 1250 750' > /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/pp_od_clk_voltage"
sudo sh -c "echo 'c' > /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/pp_od_clk_voltage"

kernel oops with the dmesg.


[    4.932714] Bluetooth: RFCOMM TTY layer initialized
[    4.932722] Bluetooth: RFCOMM socket layer initialized
[    4.932725] Bluetooth: RFCOMM ver 1.11
[    9.120298] rfkill: input handler enabled
[    9.922018] fuse: init (API version 7.31)
[   10.492078] rfkill: input handler disabled
[   12.680512] wlp6s0: authenticate with 50:d2:f5:f1:12:ed
[   12.690803] wlp6s0: send auth to 50:d2:f5:f1:12:ed (try 1/3)
[   12.728470] wlp6s0: authenticated
[   12.728864] wlp6s0: associate with 50:d2:f5:f1:12:ed (try 1/3)
[   12.759696] wlp6s0: RX AssocResp from 50:d2:f5:f1:12:ed (capab=0x31 status=0 aid=2)
[   12.762966] wlp6s0: associated
[   13.100624] IPv6: ADDRCONF(NETDEV_CHANGE): wlp6s0: link becomes ready
[  606.958453] BUG: unable to handle page fault for address: ffff9032a4c849a4
[  606.958455] #PF: supervisor read access in kernel mode
[  606.958456] #PF: error_code(0x0000) - not-present page
[  606.958457] PGD 173c01067 P4D 173c01067 PUD 0 
[  606.958459] Oops: 0000 [#1] PREEMPT SMP PTI
[  606.958460] CPU: 7 PID: 2337 Comm: bash Not tainted 5.7.7-zen1-1-zen #1
[  606.958461] Hardware name: Intel Corporation NUC8i7HVK/NUC8i7HVB, BIOS HNKBLi70.86A.0054.2019.0214.1350 02/14/2019
[  606.958528] RIP: 0010:phm_find_closest_vddci+0x3b/0x60 [amdgpu]
[  606.958529] Code: c0 eb 09 48 83 c0 01 48 39 d0 74 19 44 0f b7 44 c3 0c 89 c5 66 41 39 f0 72 e9 44 89 c0 5b 5d c3 bd ff ff ff ff 0f 1f 44 00 00 <44> 0f b7 44 eb 0c 5b 5d 44 89 c0 c3 48 c7 c6 f0 c1 96 c0 48 c7 c7
[  606.958530] RSP: 0018:ffffa3ecc18ff948 EFLAGS: 00010246
[  606.958531] RAX: 00000000000002ee RBX: ffff902aa4c849a0 RCX: 0000000000000008
[  606.958532] RDX: 0000000000000000 RSI: 0000000000000226 RDI: ffff902aa4c849a0
[  606.958532] RBP: 00000000ffffffff R08: ffffa3ecc18ff9d4 R09: 0000000000000029
[  606.958533] R10: 000000000000e401 R11: 0000000000000000 R12: ffff902aa6861600
[  606.958534] R13: ffff902aa4c84000 R14: ffff902aa4c85301 R15: ffffa3ecc18ff9d4
[  606.958534] FS:  00007fe8233c0b80(0000) GS:ffff902aaedc0000(0000) knlGS:0000000000000000
[  606.958535] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  606.958536] CR2: ffff9032a4c849a4 CR3: 0000000467468005 CR4: 00000000003606e0
[  606.958536] Call Trace:
[  606.958587]  vegam_get_dependency_volt_by_clk.isra.0+0x8e/0x220 [amdgpu]
[  606.958637]  vegam_populate_all_graphic_levels+0x26a/0x960 [amdgpu]
[  606.958686]  smu7_set_power_state_tasks+0x77c/0x12b0 [amdgpu]
[  606.958734]  phm_set_power_state+0x5a/0x80 [amdgpu]
[  606.958784]  psm_adjust_power_state_dynamic+0xca/0x1d0 [amdgpu]
[  606.958831]  hwmgr_handle_task+0x49/0xf0 [amdgpu]
[  606.958882]  pp_dpm_dispatch_tasks+0x3a/0x60 [amdgpu]
[  606.958915]  amdgpu_set_pp_od_clk_voltage+0x3cb/0x490 [amdgpu]
[  606.958921]  kernfs_fop_write+0xce/0x1b0
[  606.958923]  vfs_write+0x10a/0x420
[  606.958925]  __x64_sys_write+0x6d/0xf0
[  606.958926]  do_syscall_64+0x4e/0x160
[  606.958928]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  606.958930] RIP: 0033:0x7fe823523b57
[  606.958931] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  606.958931] RSP: 002b:00007ffea3ba5e88 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  606.958932] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe823523b57
[  606.958933] RDX: 0000000000000002 RSI: 00005648e5dda620 RDI: 0000000000000001
[  606.958934] RBP: 00005648e5dda620 R08: 000000000000000a R09: 0000000000000001
[  606.958934] R10: 00005648e5d20870 R11: 0000000000000246 R12: 0000000000000002
[  606.958935] R13: 00007fe8235f4500 R14: 0000000000000002 R15: 00007fe8235f4700
[  606.958936] Modules linked in: ccm fuse rfcomm xt_CHECKSUM xt_MASQUERADE xt_conntrack cmac algif_hash ipt_REJECT nf_reject_ipv4 algif_skcipher af_alg xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter tun mousedev input_leds hid_generic joydev usbhid hid xpad ff_memless bridge stp llc bnep msr intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp iwlmvm kvm_intel snd_hda_codec_realtek kvm snd_hda_codec_generic iTCO_wdt mac80211 iTCO_vendor_support irqbypass 8250_dw ledtrig_audio snd_hda_codec_hdmi mei_hdcp nls_iso8859_1 tps6598x crct10dif_pclmul typec libarc4 wmi_bmof crc32_pclmul nls_cp437 ghash_clmulni_intel intel_wmi_thunderbolt vfat aesni_intel snd_hda_intel btusb fat btrtl snd_intel_dspcfg iwlwifi btbcm crypto_simd cryptd glue_helper snd_hda_codec intel_cstate btintel intel_uncore snd_hda_core intel_rapl_perf pcspkr
[  606.958955]  e1000e i2c_i801 cfg80211 snd_hwdep bluetooth igb snd_pcm mei_me ecdh_generic intel_lpss_pci rfkill dca snd_timer ecc intel_lpss mei idma64 intel_pch_thermal snd tpm_crb soundcore wmi tpm_tis i2c_multi_instantiate tpm_tis_core evdev tpm rng_core mac_hid tcp_bbr sch_cake sg crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 sdhci_pci cqhci xhci_pci sdhci crc32c_intel xhci_hcd mmc_core amdgpu gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm agpgart
[  606.958968] CR2: ffff9032a4c849a4
[  606.958970] ---[ end trace d28ac9f0a176b773 ]---
[  606.959020] RIP: 0010:phm_find_closest_vddci+0x3b/0x60 [amdgpu]
[  606.959021] Code: c0 eb 09 48 83 c0 01 48 39 d0 74 19 44 0f b7 44 c3 0c 89 c5 66 41 39 f0 72 e9 44 89 c0 5b 5d c3 bd ff ff ff ff 0f 1f 44 00 00 <44> 0f b7 44 eb 0c 5b 5d 44 89 c0 c3 48 c7 c6 f0 c1 96 c0 48 c7 c7
[  606.959022] RSP: 0018:ffffa3ecc18ff948 EFLAGS: 00010246
[  606.959023] RAX: 00000000000002ee RBX: ffff902aa4c849a0 RCX: 0000000000000008
[  606.959023] RDX: 0000000000000000 RSI: 0000000000000226 RDI: ffff902aa4c849a0
[  606.959024] RBP: 00000000ffffffff R08: ffffa3ecc18ff9d4 R09: 0000000000000029
[  606.959024] R10: 000000000000e401 R11: 0000000000000000 R12: ffff902aa6861600
[  606.959025] R13: ffff902aa4c84000 R14: ffff902aa4c85301 R15: ffffa3ecc18ff9d4
[  606.959026] FS:  00007fe8233c0b80(0000) GS:ffff902aaedc0000(0000) knlGS:0000000000000000
[  606.959027] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  606.959027] CR2: ffff9032a4c849a4 CR3: 0000000467468005 CR4: 00000000003606e0
Comment 1 crab2313 2020-08-01 17:02:17 UTC
Ok. The fix has been pushed to upstream.