Bug 196273

Summary: Loss of video output and system freezes *ERROR* Couldn't read SADs: 0
Product: Drivers Reporter: Olaf H B (olaf)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED INVALID    
Severity: normal CC: alexdeucher
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.12.0 Subsystem:
Regression: No Bisected commit-id:
Attachments: lspci output
dmesg output
configuration used in the kernel
lspci verbose output for the vga card
kernel.log (fragment)
some debug msgs for amdgpu powerplay code
kern.log

Description Olaf H B 2017-07-04 17:49:22 UTC
Created attachment 257353 [details]
lspci output

I have a Gentoo system with kernel 4.12.0 and after a while the display goes black. After of that the system is accessible via ssh, but any intent of restart the xserver results in a freeze.

I have to reset the system in order to have a working systems.

This situation repeats very often in different time lapses.


The video card is an AMD Radeon R9 380 (sapphire nitro).

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tonga PRO [Radeon R9 285/380] (rev f1)


I have downloaded and tested with the new firmware from 

http://people.freedesktop.org/~agd5f/radeon_ucode/tonga/

Seems to be a problem with amdgpu driver.

I can see a variety of errors in dmesg:

[    8.088336] AMD-Vi: Event logged [
[    8.088337] IO_PAGE_FAULT device=01:00.0 domain=0x000f address=0x000000f4001a8300 flags=0x0010]


[    8.642870] amdgpu: [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table!



This message seems to be the last before the video goes black.

drm:0xffffffffa00d4f5d] *ERROR* Couldn't read SADs: 0

I can provide any additional information required. Thanks for your help in advance.
Comment 1 Olaf H B 2017-07-04 17:50:06 UTC
Created attachment 257355 [details]
dmesg output
Comment 2 Olaf H B 2017-07-04 17:50:46 UTC
Created attachment 257357 [details]
configuration used in the kernel
Comment 3 Olaf H B 2017-07-04 17:51:16 UTC
Created attachment 257359 [details]
lspci verbose output for the vga card
Comment 4 Olaf H B 2017-07-04 17:51:44 UTC
Created attachment 257361 [details]
kernel.log (fragment)
Comment 5 Olaf H B 2017-07-04 17:52:34 UTC
Created attachment 257363 [details]
some debug msgs for amdgpu powerplay code

I don't know if this useful but I added some extra prints in the code

drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c

in order to see what id is looking for.

Seems to be looking for an ID 65281(int) and it can't find it.
Comment 6 Michel Dänzer 2017-07-05 01:33:12 UTC
Is this a regression from older kernel versions? If yes, can you bisect?
Comment 7 Alex Deucher 2017-07-05 14:53:57 UTC
(In reply to Olaf H B from comment #0)
> 
> [    8.642870] amdgpu: [powerplay] Can't find requested voltage id in
> vdd_dep_on_sclk table!

This is harmless.
Comment 8 Olaf H B 2017-07-06 18:39:19 UTC
(In reply to Michel Dänzer from comment #6)
> Is this a regression from older kernel versions? If yes, can you bisect?

I don't think this is a regression because I had similar issues with kernel 4.11

I got other freeze some minutes ago.

This appears in the kernel.log: (complete file in attachment)

Jul  6 13:18:35 hiperion kernel: [47523.933893] general protection fault: 0000 [#1] SMP
Jul  6 13:18:35 hiperion kernel: [47523.934043] Modules linked in: kvm_amd kvm irqbypass aesni_intel aes_x86_64 amdgpu crypto_simd cryptd glue_helper input_leds fam15h_power k10temp mfd_core backlight ttm acpi_cpufreq
Jul  6 13:18:35 hiperion kernel: [47523.934565] CPU: 2 PID: 12681 Comm: Timer Not tainted 4.12.0 #3
Jul  6 13:18:35 hiperion kernel: [47523.934746] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 R2.0, BIOS 2603 06/26/2015
Jul  6 13:18:35 hiperion kernel: [47523.935031] task: ffff88031ba00d00 task.stack: ffffc900015b8000
Jul  6 13:18:35 hiperion kernel: [47523.935146] RIP: 0010:0xffffffff8106674e
Jul  6 13:18:35 hiperion kernel: [47523.935178] RSP: 0018:ffffc900015bbcd8 EFLAGS: 00010002
Jul  6 13:18:35 hiperion kernel: [47523.935221] RAX: ffffffff81808320 RBX: ffff8804294aa700 RCX: 0000000000000001
Jul  6 13:18:35 hiperion kernel: [47523.935276] RDX: 0000000000000010 RSI: 0000000000000000 RDI: ffff8804294aa700
Jul  6 13:18:35 hiperion kernel: [47523.935332] RBP: ffffc900015bbd20 R08: 0000000000000041 R09: 0000000000000001
Jul  6 13:18:35 hiperion kernel: [47523.935387] R10: ffffea000c6fed00 R11: 0000000000000000 R12: 0000000000000001
Jul  6 13:18:35 hiperion kernel: [47523.935443] R13: ffff8804294aadbc R14: 0000000000000046 R15: 0000000000018740
Jul  6 13:18:35 hiperion kernel: [47523.935499] FS:  00007f2b6bb57700(0000) GS:ffff88043ec80000(0000) knlGS:0000000000000000
Jul  6 13:18:35 hiperion kernel: [47523.935561] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul  6 13:18:35 hiperion kernel: [47523.935607] CR2: 00007f53897ea000 CR3: 00000003c397f000 CR4: 00000000000406e0
Jul  6 13:18:35 hiperion kernel: [47523.935662] Call Trace:
Jul  6 13:18:35 hiperion kernel: [47523.935685]  0xffffffff81066aad
Jul  6 13:18:35 hiperion kernel: [47523.935712]  0xffffffff8114a7c1
Jul  6 13:18:35 hiperion kernel: [47523.935738]  ? 0xffffffff81066aa0
Jul  6 13:18:35 hiperion kernel: [47523.935766]  0xffffffff8107a1cd
Jul  6 13:18:35 hiperion kernel: [47523.935792]  0xffffffff8107a70f
Jul  6 13:18:35 hiperion kernel: [47523.935818]  0xffffffff8113e761
Jul  6 13:18:35 hiperion kernel: [47523.935845]  0xffffffff81134dff
Jul  6 13:18:35 hiperion kernel: [47523.935872]  0xffffffff8113569e
Jul  6 13:18:35 hiperion kernel: [47523.935898]  ? 0xffffffff810adcf8
Jul  6 13:18:35 hiperion kernel: [47523.935925]  0xffffffff81136b8a
Jul  6 13:18:35 hiperion kernel: [47523.935952]  0xffffffff816b96a0
Jul  6 13:18:35 hiperion kernel: [47523.935978] RIP: 0033:0x00007f2b8be0fe5d
Jul  6 13:18:35 hiperion kernel: [47523.936010] RSP: 002b:00007f2b6bb56790 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
Jul  6 13:18:35 hiperion kernel: [47523.936070] RAX: ffffffffffffffda RBX: 00007f2b8bd3edb0 RCX: 00007f2b8be0fe5d
Jul  6 13:18:35 hiperion kernel: [47523.936125] RDX: 0000000000000001 RSI: 00007f2b6bb567a7 RDI: 000000000000001b
Jul  6 13:18:35 hiperion kernel: [47523.936180] RBP: 00000000000000fa R08: 00007f2b76687908 R09: 0000000000000000
Jul  6 13:18:35 hiperion kernel: [47523.936234] R10: 000000000000002b R11: 0000000000000293 R12: 0000000000000005
Jul  6 13:18:35 hiperion kernel: [47523.936289] R13: 0000000000000000 R14: 00007f2b6bb56a28 R15: 00002b38d797ff91
Jul  6 13:18:35 hiperion kernel: [47523.936368] Code: 88 83 04 04 00 00 0f 85 66 01 00 00 83 bb 1c 03 00 00 01 0f 8e b8 01 00 00 48 8b 43 68 ba 10 00 00 00 8b 73 50 44 89 e1 48 89 df <ff> 50 40 48 8d 93 20 03 00 00 41 89 c0 44 89 c0 48 0f
 a3 02 0f 
Jul  6 13:18:35 hiperion kernel: [47523.936551] RIP: 0xffffffff8106674e RSP: ffffc900015bbcd8
Jul  6 13:18:35 hiperion kernel: [47523.956831] ---[ end trace de9deca7f7bcb1e8 ]---
Comment 9 Olaf H B 2017-07-06 18:40:45 UTC
Created attachment 257391 [details]
kern.log
Comment 10 Alex Deucher 2017-07-06 18:59:57 UTC
Can you get a log with symbols?
Comment 11 Olaf H B 2017-07-10 22:48:15 UTC
(In reply to Alex Deucher from comment #10)
> Can you get a log with symbols?

I have recompiled the kernel with simbols, but at this time I have been unable to reproduce the error.


Last error I had with this machine seems to be unrelated with amdgpu driver:


 2986.484313] csgo_linux64[25324]: segfault at 21400000000 ip 0000021400000000 sp 00007f5ebd6a49e0 error 14 in renderD128[7f5eb9809000+20000]
[ 3378.568512] BUG: unable to handle kernel paging request at 000000000001abbe
[ 3378.568674] IP: ipv4_mtu+0x4d/0x70
[ 3378.568730] PGD 347257067 
[ 3378.568738] P4D 347257067 
[ 3378.568782] PUD 0 
[ 3378.568827] 
[ 3378.568897] Oops: 0002 [#1] SMP
[ 3378.568944] Modules linked in: kvm_amd kvm amdgpu irqbypass aesni_intel aes_x86_64 crypto_simd cryptd glue_helper mfd_core backlight ttm acpi_cpufreq
[ 3378.569155] CPU: 7 PID: 25155 Comm: CIPCServer::Thr Not tainted 4.12.0 #6
[ 3378.569231] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 R2.0, BIOS 2603 06/26/2015
[ 3378.569345] task: ffff88042ae50000 task.stack: ffffc90000ea4000
[ 3378.569424] RIP: 0010:ipv4_mtu+0x4d/0x70
[ 3378.569499] RSP: 0018:ffffc90000ea7ab0 EFLAGS: 00010212
[ 3378.569582] RAX: 000000000000ffff RBX: ffff880347f3b700 RCX: 0000000000010000
[ 3378.569665] RDX: ffffffff81ad1cc0 RSI: 0000000000000002 RDI: ffff880347e45700
[ 3378.569743] RBP: ffffc90000ea7ae8 R08: ffff88042147689c R09: 0000000000000000
[ 3378.569839] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88041fb62b00
[ 3378.569917] R13: ffffffff81f0e7c0 R14: 0000000000000000 R15: ffff88042acee000
[ 3378.569993] FS:  0000000000000000(0000) GS:ffff88042e400000(0063) knlGS:00000000ea4b1b40
[ 3378.570076] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[ 3378.570139] CR2: 000000000001abbe CR3: 000000034711b000 CR4: 00000000000406e0
[ 3378.570213] Call Trace:
[ 3378.570254]  ? ip_finish_output+0x191/0x2f0
[ 3378.570305]  ? nf_hook_slow+0xa0/0xf0
[ 3378.570353]  ip_output+0x7d/0x250
[ 3378.570398]  ? ip_output+0x99/0x250
[ 3378.570448]  ? ip_fragment.constprop.41+0x80/0x80
[ 3378.570503]  ip_local_out+0x48/0x80
[ 3378.570559]  ip_queue_xmit+0x1e5/0x5a0
[ 3378.570635]  ? ip_queue_xmit+0x5/0x5a0
[ 3378.570708]  ? tcp_v4_md5_lookup+0x13/0x20
[ 3378.570757]  tcp_transmit_skb+0x4ee/0x9a0
[ 3378.570813]  tcp_send_ack+0x100/0x180
[ 3378.570860]  tcp_cleanup_rbuf+0x67/0x100
[ 3378.570908]  tcp_recvmsg+0x45d/0xa70
[ 3378.570958]  inet_recvmsg+0x5b/0x1e0
[ 3378.571009]  ? __fget_light+0x24/0x70
[ 3378.571057]  SYSC_recvfrom+0xff/0x180
[ 3378.571104]  ? eventfd_read+0x40/0x80
[ 3378.571156]  ? __vfs_read+0x28/0x110
[ 3378.571204]  SyS_recv+0x14/0x20
[ 3378.571247]  compat_SyS_socketcall+0x36d/0x420
[ 3378.571299]  do_fast_syscall_32+0x93/0x1f0
[ 3378.571359]  entry_SYSCALL_compat+0x40/0x45
[ 3378.571408] RIP: 0023:0xf7735af9
[ 3378.571448] RSP: 002b:00000000ea4b0a70 EFLAGS: 00000286 ORIG_RAX: 0000000000000066
[ 3378.571531] RAX: ffffffffffffffda RBX: 000000000000000a RCX: 00000000ea4b0a80
[ 3378.571605] RDX: 00000000ebdd14f8 RSI: 00000000e1f8b600 RDI: 0000000000000000
[ 3378.571687] RBP: 0000000000000009 R08: 0000000000000000 R09: 0000000000000000
[ 3378.571761] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 3378.571838] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 3378.571949] Code: 57 28 48 83 e2 fc 8b 42 04 85 c0 75 1d f6 02 04 48 8b 47 18 8b 88 18 02 00 00 75 10 81 f9 ff ff 00 00 b8 ff ff 00 00 0f 46 c1 5d <d3> 80 bf ab 00 00 00 00 74 e7 81 f9 40 02 00 00 b8 40 02 00 00 
[ 3378.572256] RIP: ipv4_mtu+0x4d/0x70 RSP: ffffc90000ea7ab0
[ 3378.572315] CR2: 000000000001abbe
[ 3378.592534] ---[ end trace 78177a6dc6a07526 ]---
Comment 12 Olaf H B 2017-07-11 00:27:21 UTC
Got it.

I have the evidence with simbols.

Video output is frozen but cursor is moveable. Although I can't go to a tty via CTR+ALT+fn

I can use the computer via ssh.

This is the error in dmesg:


[138113.241913] traps: Web Content[30166] general protection ip:7f9da4a25db8 sp:7ffcad46f582 error:0 in libxul.so[7f9da25e3000+4653000]
[147737.047512] traps: cinnamon[4977] trap invalid opcode ip:7fd7d11ed9ef sp:7ffe27732750 error:0 in libmozjs-24.so[7fd7d1173000+2ad000]
[164593.296815] cinnamon[6950]: segfault at d60 ip 00007f29d717264e sp 00007fff55010f68 error 4 in libcairo.so.2.11400.8[7f29d70fb000+122000]
[164632.261892] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[164632.262154] BUG: unable to handle kernel paging request at ffffffffa01e0216
[164632.262442] IP: pp_ip_funcs+0xfcb6/0xfffffffffffb1aa0 [amdgpu]
[164632.262678] PGD 1e24067 
[164632.262681] P4D 1e24067 
[164632.262842] PUD 1e25063 
[164632.263005] PMD 4233d2067 
[164632.263166] PTE 8000000417334161

[164632.263430] Oops: 0011 [#1] SMP
[164632.263465] Modules linked in: kvm_amd kvm amdgpu irqbypass aesni_intel aes_x86_64 crypto_simd cryptd glue_helper mfd_core acpi_cpufreq backlight ttm
[164632.263627] CPU: 3 PID: 2797 Comm: amdgpu_cs:0 Not tainted 4.12.0 #7
[164632.263688] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 R2.0, BIOS 2603 06/26/2015
[164632.263776] task: ffff8804289c6600 task.stack: ffffc90002f50000
[164632.263892] RIP: 0010:pp_ip_funcs+0xfcb6/0xfffffffffffb1aa0 [amdgpu]
[164632.263955] RSP: 0018:ffffc90002f53d68 EFLAGS: 00010286
[164632.264013] RAX: 0000000000000000 RBX: ffff880425046800 RCX: 0000000000000001
[164632.264080] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8804289c6600
[164632.264148] RBP: ffffffff814bed80 R08: 0000000000000000 R09: 0000000000000000
[164632.264215] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000c0186444
[164632.264282] R13: ffff880427323000 R14: ffffc90002f53db0 R15: ffffc90002f53e60
[164632.264350] FS:  00007f69bb16a700(0000) GS:ffff88042dc00000(0000) knlGS:0000000000000000
[164632.264425] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[164632.264482] CR2: ffffffffa01e0216 CR3: 000000041717d000 CR4: 00000000000406e0
[164632.264552] Call Trace:
[164632.264624]  ? amdgpu_cs_find_mapping+0xa0/0xa0 [amdgpu]
[164632.264681]  ? trace_hardirqs_on+0xd/0x10
[164632.264757]  ? amdgpu_drm_ioctl+0x59/0xa0 [amdgpu]
[164632.264809]  ? do_vfs_ioctl+0x9e/0x6c0
[164632.264849]  ? __fget+0x10c/0x210
[164632.264886]  ? __fget+0x5/0x210
[164632.264923]  ? SyS_ioctl+0x4c/0x90
[164632.264962]  ? entry_SYSCALL_64_fastpath+0x18/0xad
[164632.265037] Code: 00 41 4d 44 47 50 55 5f 47 45 4d 5f 4d 4d 41 50 00 41 4d 44 47 50 55 5f 43 54 58 00 41 4d 44 47 50 55 5f 42 4f 5f 4c 49 53 54 00 <41> 4d 44 47 50 55 5f 43 53 00 41 4d 44 47 50 55 5f 49 4e 46 4f 
[164632.265389] RIP: pp_ip_funcs+0xfcb6/0xfffffffffffb1aa0 [amdgpu] RSP: ffffc90002f53d68
[164632.265462] CR2: ffffffffa01e0216
[164632.285886] ---[ end trace 185809306e39b4c5 ]---
Comment 13 Olaf H B 2017-07-13 18:48:59 UTC
After further testing I started to think that the problem might be a hardware issue.

Then I replaced the memory and the issue seems to be fixed.

Sorry for the noise guys.