Bug 196273 - Loss of video output and system freezes *ERROR* Couldn't read SADs: 0
Summary: Loss of video output and system freezes *ERROR* Couldn't read SADs: 0
Status: RESOLVED INVALID
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-07-04 17:49 UTC by Olaf H B
Modified: 2017-07-13 18:48 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.12.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments
lspci output (2.88 KB, text/plain)
2017-07-04 17:49 UTC, Olaf H B
Details
dmesg output (223.52 KB, text/plain)
2017-07-04 17:50 UTC, Olaf H B
Details
configuration used in the kernel (98.48 KB, text/plain)
2017-07-04 17:50 UTC, Olaf H B
Details
lspci verbose output for the vga card (1.18 KB, text/plain)
2017-07-04 17:51 UTC, Olaf H B
Details
kernel.log (fragment) (953 bytes, text/plain)
2017-07-04 17:51 UTC, Olaf H B
Details
some debug msgs for amdgpu powerplay code (5.82 KB, text/plain)
2017-07-04 17:52 UTC, Olaf H B
Details
kern.log (422.06 KB, text/plain)
2017-07-06 18:40 UTC, Olaf H B
Details

Description Olaf H B 2017-07-04 17:49:22 UTC
Created attachment 257353 [details]
lspci output

I have a Gentoo system with kernel 4.12.0 and after a while the display goes black. After of that the system is accessible via ssh, but any intent of restart the xserver results in a freeze.

I have to reset the system in order to have a working systems.

This situation repeats very often in different time lapses.


The video card is an AMD Radeon R9 380 (sapphire nitro).

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tonga PRO [Radeon R9 285/380] (rev f1)


I have downloaded and tested with the new firmware from 

http://people.freedesktop.org/~agd5f/radeon_ucode/tonga/

Seems to be a problem with amdgpu driver.

I can see a variety of errors in dmesg:

[    8.088336] AMD-Vi: Event logged [
[    8.088337] IO_PAGE_FAULT device=01:00.0 domain=0x000f address=0x000000f4001a8300 flags=0x0010]


[    8.642870] amdgpu: [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table!



This message seems to be the last before the video goes black.

drm:0xffffffffa00d4f5d] *ERROR* Couldn't read SADs: 0

I can provide any additional information required. Thanks for your help in advance.
Comment 1 Olaf H B 2017-07-04 17:50:06 UTC
Created attachment 257355 [details]
dmesg output
Comment 2 Olaf H B 2017-07-04 17:50:46 UTC
Created attachment 257357 [details]
configuration used in the kernel
Comment 3 Olaf H B 2017-07-04 17:51:16 UTC
Created attachment 257359 [details]
lspci verbose output for the vga card
Comment 4 Olaf H B 2017-07-04 17:51:44 UTC
Created attachment 257361 [details]
kernel.log (fragment)
Comment 5 Olaf H B 2017-07-04 17:52:34 UTC
Created attachment 257363 [details]
some debug msgs for amdgpu powerplay code

I don't know if this useful but I added some extra prints in the code

drivers/gpu/drm/amd/powerplay/hwmgr/hwmgr.c

in order to see what id is looking for.

Seems to be looking for an ID 65281(int) and it can't find it.
Comment 6 Michel Dänzer 2017-07-05 01:33:12 UTC
Is this a regression from older kernel versions? If yes, can you bisect?
Comment 7 Alex Deucher 2017-07-05 14:53:57 UTC
(In reply to Olaf H B from comment #0)
> 
> [    8.642870] amdgpu: [powerplay] Can't find requested voltage id in
> vdd_dep_on_sclk table!

This is harmless.
Comment 8 Olaf H B 2017-07-06 18:39:19 UTC
(In reply to Michel Dänzer from comment #6)
> Is this a regression from older kernel versions? If yes, can you bisect?

I don't think this is a regression because I had similar issues with kernel 4.11

I got other freeze some minutes ago.

This appears in the kernel.log: (complete file in attachment)

Jul  6 13:18:35 hiperion kernel: [47523.933893] general protection fault: 0000 [#1] SMP
Jul  6 13:18:35 hiperion kernel: [47523.934043] Modules linked in: kvm_amd kvm irqbypass aesni_intel aes_x86_64 amdgpu crypto_simd cryptd glue_helper input_leds fam15h_power k10temp mfd_core backlight ttm acpi_cpufreq
Jul  6 13:18:35 hiperion kernel: [47523.934565] CPU: 2 PID: 12681 Comm: Timer Not tainted 4.12.0 #3
Jul  6 13:18:35 hiperion kernel: [47523.934746] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 R2.0, BIOS 2603 06/26/2015
Jul  6 13:18:35 hiperion kernel: [47523.935031] task: ffff88031ba00d00 task.stack: ffffc900015b8000
Jul  6 13:18:35 hiperion kernel: [47523.935146] RIP: 0010:0xffffffff8106674e
Jul  6 13:18:35 hiperion kernel: [47523.935178] RSP: 0018:ffffc900015bbcd8 EFLAGS: 00010002
Jul  6 13:18:35 hiperion kernel: [47523.935221] RAX: ffffffff81808320 RBX: ffff8804294aa700 RCX: 0000000000000001
Jul  6 13:18:35 hiperion kernel: [47523.935276] RDX: 0000000000000010 RSI: 0000000000000000 RDI: ffff8804294aa700
Jul  6 13:18:35 hiperion kernel: [47523.935332] RBP: ffffc900015bbd20 R08: 0000000000000041 R09: 0000000000000001
Jul  6 13:18:35 hiperion kernel: [47523.935387] R10: ffffea000c6fed00 R11: 0000000000000000 R12: 0000000000000001
Jul  6 13:18:35 hiperion kernel: [47523.935443] R13: ffff8804294aadbc R14: 0000000000000046 R15: 0000000000018740
Jul  6 13:18:35 hiperion kernel: [47523.935499] FS:  00007f2b6bb57700(0000) GS:ffff88043ec80000(0000) knlGS:0000000000000000
Jul  6 13:18:35 hiperion kernel: [47523.935561] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul  6 13:18:35 hiperion kernel: [47523.935607] CR2: 00007f53897ea000 CR3: 00000003c397f000 CR4: 00000000000406e0
Jul  6 13:18:35 hiperion kernel: [47523.935662] Call Trace:
Jul  6 13:18:35 hiperion kernel: [47523.935685]  0xffffffff81066aad
Jul  6 13:18:35 hiperion kernel: [47523.935712]  0xffffffff8114a7c1
Jul  6 13:18:35 hiperion kernel: [47523.935738]  ? 0xffffffff81066aa0
Jul  6 13:18:35 hiperion kernel: [47523.935766]  0xffffffff8107a1cd
Jul  6 13:18:35 hiperion kernel: [47523.935792]  0xffffffff8107a70f
Jul  6 13:18:35 hiperion kernel: [47523.935818]  0xffffffff8113e761
Jul  6 13:18:35 hiperion kernel: [47523.935845]  0xffffffff81134dff
Jul  6 13:18:35 hiperion kernel: [47523.935872]  0xffffffff8113569e
Jul  6 13:18:35 hiperion kernel: [47523.935898]  ? 0xffffffff810adcf8
Jul  6 13:18:35 hiperion kernel: [47523.935925]  0xffffffff81136b8a
Jul  6 13:18:35 hiperion kernel: [47523.935952]  0xffffffff816b96a0
Jul  6 13:18:35 hiperion kernel: [47523.935978] RIP: 0033:0x00007f2b8be0fe5d
Jul  6 13:18:35 hiperion kernel: [47523.936010] RSP: 002b:00007f2b6bb56790 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
Jul  6 13:18:35 hiperion kernel: [47523.936070] RAX: ffffffffffffffda RBX: 00007f2b8bd3edb0 RCX: 00007f2b8be0fe5d
Jul  6 13:18:35 hiperion kernel: [47523.936125] RDX: 0000000000000001 RSI: 00007f2b6bb567a7 RDI: 000000000000001b
Jul  6 13:18:35 hiperion kernel: [47523.936180] RBP: 00000000000000fa R08: 00007f2b76687908 R09: 0000000000000000
Jul  6 13:18:35 hiperion kernel: [47523.936234] R10: 000000000000002b R11: 0000000000000293 R12: 0000000000000005
Jul  6 13:18:35 hiperion kernel: [47523.936289] R13: 0000000000000000 R14: 00007f2b6bb56a28 R15: 00002b38d797ff91
Jul  6 13:18:35 hiperion kernel: [47523.936368] Code: 88 83 04 04 00 00 0f 85 66 01 00 00 83 bb 1c 03 00 00 01 0f 8e b8 01 00 00 48 8b 43 68 ba 10 00 00 00 8b 73 50 44 89 e1 48 89 df <ff> 50 40 48 8d 93 20 03 00 00 41 89 c0 44 89 c0 48 0f
 a3 02 0f 
Jul  6 13:18:35 hiperion kernel: [47523.936551] RIP: 0xffffffff8106674e RSP: ffffc900015bbcd8
Jul  6 13:18:35 hiperion kernel: [47523.956831] ---[ end trace de9deca7f7bcb1e8 ]---
Comment 9 Olaf H B 2017-07-06 18:40:45 UTC
Created attachment 257391 [details]
kern.log
Comment 10 Alex Deucher 2017-07-06 18:59:57 UTC
Can you get a log with symbols?
Comment 11 Olaf H B 2017-07-10 22:48:15 UTC
(In reply to Alex Deucher from comment #10)
> Can you get a log with symbols?

I have recompiled the kernel with simbols, but at this time I have been unable to reproduce the error.


Last error I had with this machine seems to be unrelated with amdgpu driver:


 2986.484313] csgo_linux64[25324]: segfault at 21400000000 ip 0000021400000000 sp 00007f5ebd6a49e0 error 14 in renderD128[7f5eb9809000+20000]
[ 3378.568512] BUG: unable to handle kernel paging request at 000000000001abbe
[ 3378.568674] IP: ipv4_mtu+0x4d/0x70
[ 3378.568730] PGD 347257067 
[ 3378.568738] P4D 347257067 
[ 3378.568782] PUD 0 
[ 3378.568827] 
[ 3378.568897] Oops: 0002 [#1] SMP
[ 3378.568944] Modules linked in: kvm_amd kvm amdgpu irqbypass aesni_intel aes_x86_64 crypto_simd cryptd glue_helper mfd_core backlight ttm acpi_cpufreq
[ 3378.569155] CPU: 7 PID: 25155 Comm: CIPCServer::Thr Not tainted 4.12.0 #6
[ 3378.569231] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 R2.0, BIOS 2603 06/26/2015
[ 3378.569345] task: ffff88042ae50000 task.stack: ffffc90000ea4000
[ 3378.569424] RIP: 0010:ipv4_mtu+0x4d/0x70
[ 3378.569499] RSP: 0018:ffffc90000ea7ab0 EFLAGS: 00010212
[ 3378.569582] RAX: 000000000000ffff RBX: ffff880347f3b700 RCX: 0000000000010000
[ 3378.569665] RDX: ffffffff81ad1cc0 RSI: 0000000000000002 RDI: ffff880347e45700
[ 3378.569743] RBP: ffffc90000ea7ae8 R08: ffff88042147689c R09: 0000000000000000
[ 3378.569839] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88041fb62b00
[ 3378.569917] R13: ffffffff81f0e7c0 R14: 0000000000000000 R15: ffff88042acee000
[ 3378.569993] FS:  0000000000000000(0000) GS:ffff88042e400000(0063) knlGS:00000000ea4b1b40
[ 3378.570076] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[ 3378.570139] CR2: 000000000001abbe CR3: 000000034711b000 CR4: 00000000000406e0
[ 3378.570213] Call Trace:
[ 3378.570254]  ? ip_finish_output+0x191/0x2f0
[ 3378.570305]  ? nf_hook_slow+0xa0/0xf0
[ 3378.570353]  ip_output+0x7d/0x250
[ 3378.570398]  ? ip_output+0x99/0x250
[ 3378.570448]  ? ip_fragment.constprop.41+0x80/0x80
[ 3378.570503]  ip_local_out+0x48/0x80
[ 3378.570559]  ip_queue_xmit+0x1e5/0x5a0
[ 3378.570635]  ? ip_queue_xmit+0x5/0x5a0
[ 3378.570708]  ? tcp_v4_md5_lookup+0x13/0x20
[ 3378.570757]  tcp_transmit_skb+0x4ee/0x9a0
[ 3378.570813]  tcp_send_ack+0x100/0x180
[ 3378.570860]  tcp_cleanup_rbuf+0x67/0x100
[ 3378.570908]  tcp_recvmsg+0x45d/0xa70
[ 3378.570958]  inet_recvmsg+0x5b/0x1e0
[ 3378.571009]  ? __fget_light+0x24/0x70
[ 3378.571057]  SYSC_recvfrom+0xff/0x180
[ 3378.571104]  ? eventfd_read+0x40/0x80
[ 3378.571156]  ? __vfs_read+0x28/0x110
[ 3378.571204]  SyS_recv+0x14/0x20
[ 3378.571247]  compat_SyS_socketcall+0x36d/0x420
[ 3378.571299]  do_fast_syscall_32+0x93/0x1f0
[ 3378.571359]  entry_SYSCALL_compat+0x40/0x45
[ 3378.571408] RIP: 0023:0xf7735af9
[ 3378.571448] RSP: 002b:00000000ea4b0a70 EFLAGS: 00000286 ORIG_RAX: 0000000000000066
[ 3378.571531] RAX: ffffffffffffffda RBX: 000000000000000a RCX: 00000000ea4b0a80
[ 3378.571605] RDX: 00000000ebdd14f8 RSI: 00000000e1f8b600 RDI: 0000000000000000
[ 3378.571687] RBP: 0000000000000009 R08: 0000000000000000 R09: 0000000000000000
[ 3378.571761] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 3378.571838] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 3378.571949] Code: 57 28 48 83 e2 fc 8b 42 04 85 c0 75 1d f6 02 04 48 8b 47 18 8b 88 18 02 00 00 75 10 81 f9 ff ff 00 00 b8 ff ff 00 00 0f 46 c1 5d <d3> 80 bf ab 00 00 00 00 74 e7 81 f9 40 02 00 00 b8 40 02 00 00 
[ 3378.572256] RIP: ipv4_mtu+0x4d/0x70 RSP: ffffc90000ea7ab0
[ 3378.572315] CR2: 000000000001abbe
[ 3378.592534] ---[ end trace 78177a6dc6a07526 ]---
Comment 12 Olaf H B 2017-07-11 00:27:21 UTC
Got it.

I have the evidence with simbols.

Video output is frozen but cursor is moveable. Although I can't go to a tty via CTR+ALT+fn

I can use the computer via ssh.

This is the error in dmesg:


[138113.241913] traps: Web Content[30166] general protection ip:7f9da4a25db8 sp:7ffcad46f582 error:0 in libxul.so[7f9da25e3000+4653000]
[147737.047512] traps: cinnamon[4977] trap invalid opcode ip:7fd7d11ed9ef sp:7ffe27732750 error:0 in libmozjs-24.so[7fd7d1173000+2ad000]
[164593.296815] cinnamon[6950]: segfault at d60 ip 00007f29d717264e sp 00007fff55010f68 error 4 in libcairo.so.2.11400.8[7f29d70fb000+122000]
[164632.261892] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[164632.262154] BUG: unable to handle kernel paging request at ffffffffa01e0216
[164632.262442] IP: pp_ip_funcs+0xfcb6/0xfffffffffffb1aa0 [amdgpu]
[164632.262678] PGD 1e24067 
[164632.262681] P4D 1e24067 
[164632.262842] PUD 1e25063 
[164632.263005] PMD 4233d2067 
[164632.263166] PTE 8000000417334161

[164632.263430] Oops: 0011 [#1] SMP
[164632.263465] Modules linked in: kvm_amd kvm amdgpu irqbypass aesni_intel aes_x86_64 crypto_simd cryptd glue_helper mfd_core acpi_cpufreq backlight ttm
[164632.263627] CPU: 3 PID: 2797 Comm: amdgpu_cs:0 Not tainted 4.12.0 #7
[164632.263688] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 R2.0, BIOS 2603 06/26/2015
[164632.263776] task: ffff8804289c6600 task.stack: ffffc90002f50000
[164632.263892] RIP: 0010:pp_ip_funcs+0xfcb6/0xfffffffffffb1aa0 [amdgpu]
[164632.263955] RSP: 0018:ffffc90002f53d68 EFLAGS: 00010286
[164632.264013] RAX: 0000000000000000 RBX: ffff880425046800 RCX: 0000000000000001
[164632.264080] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8804289c6600
[164632.264148] RBP: ffffffff814bed80 R08: 0000000000000000 R09: 0000000000000000
[164632.264215] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000c0186444
[164632.264282] R13: ffff880427323000 R14: ffffc90002f53db0 R15: ffffc90002f53e60
[164632.264350] FS:  00007f69bb16a700(0000) GS:ffff88042dc00000(0000) knlGS:0000000000000000
[164632.264425] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[164632.264482] CR2: ffffffffa01e0216 CR3: 000000041717d000 CR4: 00000000000406e0
[164632.264552] Call Trace:
[164632.264624]  ? amdgpu_cs_find_mapping+0xa0/0xa0 [amdgpu]
[164632.264681]  ? trace_hardirqs_on+0xd/0x10
[164632.264757]  ? amdgpu_drm_ioctl+0x59/0xa0 [amdgpu]
[164632.264809]  ? do_vfs_ioctl+0x9e/0x6c0
[164632.264849]  ? __fget+0x10c/0x210
[164632.264886]  ? __fget+0x5/0x210
[164632.264923]  ? SyS_ioctl+0x4c/0x90
[164632.264962]  ? entry_SYSCALL_64_fastpath+0x18/0xad
[164632.265037] Code: 00 41 4d 44 47 50 55 5f 47 45 4d 5f 4d 4d 41 50 00 41 4d 44 47 50 55 5f 43 54 58 00 41 4d 44 47 50 55 5f 42 4f 5f 4c 49 53 54 00 <41> 4d 44 47 50 55 5f 43 53 00 41 4d 44 47 50 55 5f 49 4e 46 4f 
[164632.265389] RIP: pp_ip_funcs+0xfcb6/0xfffffffffffb1aa0 [amdgpu] RSP: ffffc90002f53d68
[164632.265462] CR2: ffffffffa01e0216
[164632.285886] ---[ end trace 185809306e39b4c5 ]---
Comment 13 Olaf H B 2017-07-13 18:48:59 UTC
After further testing I started to think that the problem might be a hardware issue.

Then I replaced the memory and the issue seems to be fixed.

Sorry for the noise guys.

Note You need to log in before you can comment on or make changes to this bug.