Created attachment 257353 [details] lspci output I have a Gentoo system with kernel 4.12.0 and after a while the display goes black. After of that the system is accessible via ssh, but any intent of restart the xserver results in a freeze. I have to reset the system in order to have a working systems. This situation repeats very often in different time lapses. The video card is an AMD Radeon R9 380 (sapphire nitro). 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tonga PRO [Radeon R9 285/380] (rev f1) I have downloaded and tested with the new firmware from http://people.freedesktop.org/~agd5f/radeon_ucode/tonga/ Seems to be a problem with amdgpu driver. I can see a variety of errors in dmesg: [ 8.088336] AMD-Vi: Event logged [ [ 8.088337] IO_PAGE_FAULT device=01:00.0 domain=0x000f address=0x000000f4001a8300 flags=0x0010] [ 8.642870] amdgpu: [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table! This message seems to be the last before the video goes black. drm:0xffffffffa00d4f5d] *ERROR* Couldn't read SADs: 0 I can provide any additional information required. Thanks for your help in advance.
Created attachment 257355 [details] dmesg output
Created attachment 257357 [details] configuration used in the kernel
Created attachment 257359 [details] lspci verbose output for the vga card
Created attachment 257361 [details] kernel.log (fragment)
Created attachment 257363 [details] some debug msgs for amdgpu powerplay code I don't know if this useful but I added some extra prints in the code drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c in order to see what id is looking for. Seems to be looking for an ID 65281(int) and it can't find it.
Is this a regression from older kernel versions? If yes, can you bisect?
(In reply to Olaf H B from comment #0) > > [ 8.642870] amdgpu: [powerplay] Can't find requested voltage id in > vdd_dep_on_sclk table! This is harmless.
(In reply to Michel Dänzer from comment #6) > Is this a regression from older kernel versions? If yes, can you bisect? I don't think this is a regression because I had similar issues with kernel 4.11 I got other freeze some minutes ago. This appears in the kernel.log: (complete file in attachment) Jul 6 13:18:35 hiperion kernel: [47523.933893] general protection fault: 0000 [#1] SMP Jul 6 13:18:35 hiperion kernel: [47523.934043] Modules linked in: kvm_amd kvm irqbypass aesni_intel aes_x86_64 amdgpu crypto_simd cryptd glue_helper input_leds fam15h_power k10temp mfd_core backlight ttm acpi_cpufreq Jul 6 13:18:35 hiperion kernel: [47523.934565] CPU: 2 PID: 12681 Comm: Timer Not tainted 4.12.0 #3 Jul 6 13:18:35 hiperion kernel: [47523.934746] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 R2.0, BIOS 2603 06/26/2015 Jul 6 13:18:35 hiperion kernel: [47523.935031] task: ffff88031ba00d00 task.stack: ffffc900015b8000 Jul 6 13:18:35 hiperion kernel: [47523.935146] RIP: 0010:0xffffffff8106674e Jul 6 13:18:35 hiperion kernel: [47523.935178] RSP: 0018:ffffc900015bbcd8 EFLAGS: 00010002 Jul 6 13:18:35 hiperion kernel: [47523.935221] RAX: ffffffff81808320 RBX: ffff8804294aa700 RCX: 0000000000000001 Jul 6 13:18:35 hiperion kernel: [47523.935276] RDX: 0000000000000010 RSI: 0000000000000000 RDI: ffff8804294aa700 Jul 6 13:18:35 hiperion kernel: [47523.935332] RBP: ffffc900015bbd20 R08: 0000000000000041 R09: 0000000000000001 Jul 6 13:18:35 hiperion kernel: [47523.935387] R10: ffffea000c6fed00 R11: 0000000000000000 R12: 0000000000000001 Jul 6 13:18:35 hiperion kernel: [47523.935443] R13: ffff8804294aadbc R14: 0000000000000046 R15: 0000000000018740 Jul 6 13:18:35 hiperion kernel: [47523.935499] FS: 00007f2b6bb57700(0000) GS:ffff88043ec80000(0000) knlGS:0000000000000000 Jul 6 13:18:35 hiperion kernel: [47523.935561] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 6 13:18:35 hiperion kernel: [47523.935607] CR2: 00007f53897ea000 CR3: 00000003c397f000 CR4: 00000000000406e0 Jul 6 13:18:35 hiperion kernel: [47523.935662] Call Trace: Jul 6 13:18:35 hiperion kernel: [47523.935685] 0xffffffff81066aad Jul 6 13:18:35 hiperion kernel: [47523.935712] 0xffffffff8114a7c1 Jul 6 13:18:35 hiperion kernel: [47523.935738] ? 0xffffffff81066aa0 Jul 6 13:18:35 hiperion kernel: [47523.935766] 0xffffffff8107a1cd Jul 6 13:18:35 hiperion kernel: [47523.935792] 0xffffffff8107a70f Jul 6 13:18:35 hiperion kernel: [47523.935818] 0xffffffff8113e761 Jul 6 13:18:35 hiperion kernel: [47523.935845] 0xffffffff81134dff Jul 6 13:18:35 hiperion kernel: [47523.935872] 0xffffffff8113569e Jul 6 13:18:35 hiperion kernel: [47523.935898] ? 0xffffffff810adcf8 Jul 6 13:18:35 hiperion kernel: [47523.935925] 0xffffffff81136b8a Jul 6 13:18:35 hiperion kernel: [47523.935952] 0xffffffff816b96a0 Jul 6 13:18:35 hiperion kernel: [47523.935978] RIP: 0033:0x00007f2b8be0fe5d Jul 6 13:18:35 hiperion kernel: [47523.936010] RSP: 002b:00007f2b6bb56790 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 Jul 6 13:18:35 hiperion kernel: [47523.936070] RAX: ffffffffffffffda RBX: 00007f2b8bd3edb0 RCX: 00007f2b8be0fe5d Jul 6 13:18:35 hiperion kernel: [47523.936125] RDX: 0000000000000001 RSI: 00007f2b6bb567a7 RDI: 000000000000001b Jul 6 13:18:35 hiperion kernel: [47523.936180] RBP: 00000000000000fa R08: 00007f2b76687908 R09: 0000000000000000 Jul 6 13:18:35 hiperion kernel: [47523.936234] R10: 000000000000002b R11: 0000000000000293 R12: 0000000000000005 Jul 6 13:18:35 hiperion kernel: [47523.936289] R13: 0000000000000000 R14: 00007f2b6bb56a28 R15: 00002b38d797ff91 Jul 6 13:18:35 hiperion kernel: [47523.936368] Code: 88 83 04 04 00 00 0f 85 66 01 00 00 83 bb 1c 03 00 00 01 0f 8e b8 01 00 00 48 8b 43 68 ba 10 00 00 00 8b 73 50 44 89 e1 48 89 df <ff> 50 40 48 8d 93 20 03 00 00 41 89 c0 44 89 c0 48 0f a3 02 0f Jul 6 13:18:35 hiperion kernel: [47523.936551] RIP: 0xffffffff8106674e RSP: ffffc900015bbcd8 Jul 6 13:18:35 hiperion kernel: [47523.956831] ---[ end trace de9deca7f7bcb1e8 ]---
Created attachment 257391 [details] kern.log
Can you get a log with symbols?
(In reply to Alex Deucher from comment #10) > Can you get a log with symbols? I have recompiled the kernel with simbols, but at this time I have been unable to reproduce the error. Last error I had with this machine seems to be unrelated with amdgpu driver: 2986.484313] csgo_linux64[25324]: segfault at 21400000000 ip 0000021400000000 sp 00007f5ebd6a49e0 error 14 in renderD128[7f5eb9809000+20000] [ 3378.568512] BUG: unable to handle kernel paging request at 000000000001abbe [ 3378.568674] IP: ipv4_mtu+0x4d/0x70 [ 3378.568730] PGD 347257067 [ 3378.568738] P4D 347257067 [ 3378.568782] PUD 0 [ 3378.568827] [ 3378.568897] Oops: 0002 [#1] SMP [ 3378.568944] Modules linked in: kvm_amd kvm amdgpu irqbypass aesni_intel aes_x86_64 crypto_simd cryptd glue_helper mfd_core backlight ttm acpi_cpufreq [ 3378.569155] CPU: 7 PID: 25155 Comm: CIPCServer::Thr Not tainted 4.12.0 #6 [ 3378.569231] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 R2.0, BIOS 2603 06/26/2015 [ 3378.569345] task: ffff88042ae50000 task.stack: ffffc90000ea4000 [ 3378.569424] RIP: 0010:ipv4_mtu+0x4d/0x70 [ 3378.569499] RSP: 0018:ffffc90000ea7ab0 EFLAGS: 00010212 [ 3378.569582] RAX: 000000000000ffff RBX: ffff880347f3b700 RCX: 0000000000010000 [ 3378.569665] RDX: ffffffff81ad1cc0 RSI: 0000000000000002 RDI: ffff880347e45700 [ 3378.569743] RBP: ffffc90000ea7ae8 R08: ffff88042147689c R09: 0000000000000000 [ 3378.569839] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88041fb62b00 [ 3378.569917] R13: ffffffff81f0e7c0 R14: 0000000000000000 R15: ffff88042acee000 [ 3378.569993] FS: 0000000000000000(0000) GS:ffff88042e400000(0063) knlGS:00000000ea4b1b40 [ 3378.570076] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 [ 3378.570139] CR2: 000000000001abbe CR3: 000000034711b000 CR4: 00000000000406e0 [ 3378.570213] Call Trace: [ 3378.570254] ? ip_finish_output+0x191/0x2f0 [ 3378.570305] ? nf_hook_slow+0xa0/0xf0 [ 3378.570353] ip_output+0x7d/0x250 [ 3378.570398] ? ip_output+0x99/0x250 [ 3378.570448] ? ip_fragment.constprop.41+0x80/0x80 [ 3378.570503] ip_local_out+0x48/0x80 [ 3378.570559] ip_queue_xmit+0x1e5/0x5a0 [ 3378.570635] ? ip_queue_xmit+0x5/0x5a0 [ 3378.570708] ? tcp_v4_md5_lookup+0x13/0x20 [ 3378.570757] tcp_transmit_skb+0x4ee/0x9a0 [ 3378.570813] tcp_send_ack+0x100/0x180 [ 3378.570860] tcp_cleanup_rbuf+0x67/0x100 [ 3378.570908] tcp_recvmsg+0x45d/0xa70 [ 3378.570958] inet_recvmsg+0x5b/0x1e0 [ 3378.571009] ? __fget_light+0x24/0x70 [ 3378.571057] SYSC_recvfrom+0xff/0x180 [ 3378.571104] ? eventfd_read+0x40/0x80 [ 3378.571156] ? __vfs_read+0x28/0x110 [ 3378.571204] SyS_recv+0x14/0x20 [ 3378.571247] compat_SyS_socketcall+0x36d/0x420 [ 3378.571299] do_fast_syscall_32+0x93/0x1f0 [ 3378.571359] entry_SYSCALL_compat+0x40/0x45 [ 3378.571408] RIP: 0023:0xf7735af9 [ 3378.571448] RSP: 002b:00000000ea4b0a70 EFLAGS: 00000286 ORIG_RAX: 0000000000000066 [ 3378.571531] RAX: ffffffffffffffda RBX: 000000000000000a RCX: 00000000ea4b0a80 [ 3378.571605] RDX: 00000000ebdd14f8 RSI: 00000000e1f8b600 RDI: 0000000000000000 [ 3378.571687] RBP: 0000000000000009 R08: 0000000000000000 R09: 0000000000000000 [ 3378.571761] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 3378.571838] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 3378.571949] Code: 57 28 48 83 e2 fc 8b 42 04 85 c0 75 1d f6 02 04 48 8b 47 18 8b 88 18 02 00 00 75 10 81 f9 ff ff 00 00 b8 ff ff 00 00 0f 46 c1 5d <d3> 80 bf ab 00 00 00 00 74 e7 81 f9 40 02 00 00 b8 40 02 00 00 [ 3378.572256] RIP: ipv4_mtu+0x4d/0x70 RSP: ffffc90000ea7ab0 [ 3378.572315] CR2: 000000000001abbe [ 3378.592534] ---[ end trace 78177a6dc6a07526 ]---
Got it. I have the evidence with simbols. Video output is frozen but cursor is moveable. Although I can't go to a tty via CTR+ALT+fn I can use the computer via ssh. This is the error in dmesg: [138113.241913] traps: Web Content[30166] general protection ip:7f9da4a25db8 sp:7ffcad46f582 error:0 in libxul.so[7f9da25e3000+4653000] [147737.047512] traps: cinnamon[4977] trap invalid opcode ip:7fd7d11ed9ef sp:7ffe27732750 error:0 in libmozjs-24.so[7fd7d1173000+2ad000] [164593.296815] cinnamon[6950]: segfault at d60 ip 00007f29d717264e sp 00007fff55010f68 error 4 in libcairo.so.2.11400.8[7f29d70fb000+122000] [164632.261892] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [164632.262154] BUG: unable to handle kernel paging request at ffffffffa01e0216 [164632.262442] IP: pp_ip_funcs+0xfcb6/0xfffffffffffb1aa0 [amdgpu] [164632.262678] PGD 1e24067 [164632.262681] P4D 1e24067 [164632.262842] PUD 1e25063 [164632.263005] PMD 4233d2067 [164632.263166] PTE 8000000417334161 [164632.263430] Oops: 0011 [#1] SMP [164632.263465] Modules linked in: kvm_amd kvm amdgpu irqbypass aesni_intel aes_x86_64 crypto_simd cryptd glue_helper mfd_core acpi_cpufreq backlight ttm [164632.263627] CPU: 3 PID: 2797 Comm: amdgpu_cs:0 Not tainted 4.12.0 #7 [164632.263688] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 R2.0, BIOS 2603 06/26/2015 [164632.263776] task: ffff8804289c6600 task.stack: ffffc90002f50000 [164632.263892] RIP: 0010:pp_ip_funcs+0xfcb6/0xfffffffffffb1aa0 [amdgpu] [164632.263955] RSP: 0018:ffffc90002f53d68 EFLAGS: 00010286 [164632.264013] RAX: 0000000000000000 RBX: ffff880425046800 RCX: 0000000000000001 [164632.264080] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8804289c6600 [164632.264148] RBP: ffffffff814bed80 R08: 0000000000000000 R09: 0000000000000000 [164632.264215] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000c0186444 [164632.264282] R13: ffff880427323000 R14: ffffc90002f53db0 R15: ffffc90002f53e60 [164632.264350] FS: 00007f69bb16a700(0000) GS:ffff88042dc00000(0000) knlGS:0000000000000000 [164632.264425] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [164632.264482] CR2: ffffffffa01e0216 CR3: 000000041717d000 CR4: 00000000000406e0 [164632.264552] Call Trace: [164632.264624] ? amdgpu_cs_find_mapping+0xa0/0xa0 [amdgpu] [164632.264681] ? trace_hardirqs_on+0xd/0x10 [164632.264757] ? amdgpu_drm_ioctl+0x59/0xa0 [amdgpu] [164632.264809] ? do_vfs_ioctl+0x9e/0x6c0 [164632.264849] ? __fget+0x10c/0x210 [164632.264886] ? __fget+0x5/0x210 [164632.264923] ? SyS_ioctl+0x4c/0x90 [164632.264962] ? entry_SYSCALL_64_fastpath+0x18/0xad [164632.265037] Code: 00 41 4d 44 47 50 55 5f 47 45 4d 5f 4d 4d 41 50 00 41 4d 44 47 50 55 5f 43 54 58 00 41 4d 44 47 50 55 5f 42 4f 5f 4c 49 53 54 00 <41> 4d 44 47 50 55 5f 43 53 00 41 4d 44 47 50 55 5f 49 4e 46 4f [164632.265389] RIP: pp_ip_funcs+0xfcb6/0xfffffffffffb1aa0 [amdgpu] RSP: ffffc90002f53d68 [164632.265462] CR2: ffffffffa01e0216 [164632.285886] ---[ end trace 185809306e39b4c5 ]---
After further testing I started to think that the problem might be a hardware issue. Then I replaced the memory and the issue seems to be fixed. Sorry for the noise guys.