Bug 196337

Summary: [amdgpu][carrizo] Re-enable GFX PG breaks Carrizo
Product: Drivers Reporter: Johannes Hirte (johannes.hirte)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: normal    
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.12.0-09301-g3b06b1a7448e Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg
Xorg.0.log

Description Johannes Hirte 2017-07-12 08:48:43 UTC
With the current drm updates, my system immediately reboots when X starts. Bisecting pointet me first at 

2dc80b00652f2a08f3f1a01e668e3c7ad716f55f is the first bad commit
commit 2dc80b00652f2a08f3f1a01e668e3c7ad716f55f
Author: Shirish S <shirish.s@amd.com>
Date:   Thu May 25 10:05:25 2017 +0530

    drm/amdgpu: optimize amdgpu driver load & resume time
    
    amdgpu_device_resume() & amdgpu_device_init() have a high
    time consuming call of amdgpu_late_init() which sets the
    clock_gating state of all IP blocks and is blocking.
    This patch defers only this setting of clock gating state
    operation to post resume of amdgpu driver but ideally before
    the UI comes up or in some cases post ui as well.
    
    With this change the resume time of amdgpu_device comes down
    from 1.299s to 0.199s which further helps in reducing the overall
    system resume time.
    
    V1: made the optimization applicable during driver load as well.
    
    TEST:(For ChromiumOS on STONEY only)
    * UI comes up
    * amdgpu_late_init() call gets called consistently and no errors reported.
    
    Signed-off-by: Shirish S <shirish.s@amd.com>
    Reviewed-by: Huang Rui <ray.huang@amd.com>
    Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

:040000 040000 a7c011a1238a16f134e62f7f891660218387f834 66059629924642e897de024bea19d28856e6ae44 M      drivers

The commit before doesn't work either, but hangs on console with a white cursor when X starts. With a second bisect I came to

commit 4fdca894bb3be01829bb620b6157cafee9f956a6
Author: Alex Deucher <alexander.deucher@amd.com>
Date:   Fri Mar 17 15:27:06 2017 -0400

    Revert "drm/amd/amdgpu: Disable GFX_PG on Carrizo until compute issues solved"
    
    Re-enable GFX PG.  It's working properly with MEC now that KIQ is
    enabled.
    
    Reviewed-by: Samuel  Li <samuel.li@amd.com>
    
    This reverts commit e9ef19aa1bdeac380662a112f1d03a7c3477527f.

Reverting this on (disabling GFX PG again) made my system work again.

System is a HP ProBook 645 G2 with an AMD PRO A10-8700B.
Comment 1 Johannes Hirte 2017-08-04 13:58:17 UTC
Still an issue with linux-4.13-rc3. I can login via ssh, so I can provide the logs. Showing blocked tasks gives me this:

[  620.046142] sysrq: SysRq : Show Blocked State
[  620.046155]   task                        PC stack   pid father
[  620.046194] disk_cache:0    D    0  2446      1 0x00000006
[  620.046201] Call Trace:
[  620.046214]  __schedule+0x217/0x710
[  620.046222]  ? preempt_count_add+0x6f/0xb0
[  620.046268]  ? _raw_spin_lock_irqsave+0x18/0x50
[  620.046274]  schedule+0x3b/0x90
[  620.046280]  amd_sched_entity_fini+0x95/0xe0
[  620.046286]  ? wait_woken+0x80/0x80
[  620.046292]  amdgpu_vm_fini+0x38/0x280
[  620.046297]  amdgpu_driver_postclose_kms+0x81/0x1f0
[  620.046302]  ? drm_master_release+0x61/0x110
[  620.046306]  drm_release+0x260/0x380
[  620.046312]  __fput+0xd4/0x1e0
[  620.046317]  ____fput+0x9/0x10
[  620.046322]  task_work_run+0x71/0x90
[  620.046327]  do_exit+0x2c1/0xb00
[  620.046332]  ? futex_wait+0x149/0x230
[  620.046336]  do_group_exit+0x3e/0xb0
[  620.046341]  get_signal+0x25b/0x610
[  620.046347]  do_signal+0x23/0x5e0
[  620.046352]  ? __schedule+0x217/0x710
[  620.046356]  ? preempt_count_add+0x6f/0xb0
[  620.046362]  ? strlcpy+0x36/0x50
[  620.046367]  exit_to_usermode_loop+0x53/0x90
[  620.046371]  syscall_return_slowpath+0x53/0x60
[  620.046374]  entry_SYSCALL_64_fastpath+0x92/0x94
[  620.046379] RIP: 0033:0x7fd3ebc0323d
[  620.046382] RSP: 002b:00007fd3dcc6bbb0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[  620.046386] RAX: fffffffffffffe00 RBX: 0000000001bf0cd8 RCX: 00007fd3ebc0323d
[  620.046389] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 0000000001bf0d00
[  620.046392] RBP: 0000000000000000 R08: 0000000000000000 R09: cccccccccccccccd
[  620.046395] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  620.046398] R13: 0000000001bf0cb0 R14: 0000000000000000 R15: 0000000001bf0d00
Comment 2 Johannes Hirte 2017-08-04 13:59:00 UTC
Created attachment 257809 [details]
dmesg
Comment 3 Johannes Hirte 2017-08-04 13:59:18 UTC
Created attachment 257811 [details]
Xorg.0.log
Comment 4 Johannes Hirte 2017-08-25 19:09:34 UTC
I've added pg_mask=0xFFFFFFFE as suggested by Tom St Denis at https://lists.freedesktop.org/archives/amd-gfx/2017-August/012329.html

This works for me, so I don't need to modify the code anymore. But this really should be fixed upstream. So please tell me, you need more infos or if you have something for testing.