Bug 205545

Summary: Linux >=5.3.11 < 5.4.0 --> i915: Resetting rcs0 for hang on rcs0
Product: Drivers Reporter: Ale (alpiturchi)
Component: Video(Other)Assignee: drivers_video-other
Status: RESOLVED MOVED    
Severity: high CC: alpiturchi, hi-angel, kernel, marek.bartosiewicz, waltercool
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: >=5.3.11 <5.4.0 Subsystem:
Regression: No Bisected commit-id:

Description Ale 2019-11-16 09:13:34 UTC
I recently upgraded from kernel 5.3.10 to 5.3.11, so it's just a minor upgrade.

Curiously, when I launch libreoffice, the system freezes for few seconds and I get this in dmesg output:

i915 0000:00:02.0: Resetting rcs0 for hang on rcs0	

This is 100% reproducible

Switching back to 5.3.10 the problem disappears

Here my system specs:
heavensdoor ~ # inxi -Fzm
System:    Host: heavensdoor Kernel: 5.3.10-gentoo x86_64 bits: 64 Desktop: Xfce 4.14.1 Distro: Gentoo Base System release 2.6 
Machine:   Type: Laptop System: Notebook product: P9XXRC v: N/A serial: <filter> 
           Mobo: Notebook model: P9XXRC serial: <filter> UEFI: INSYDE v: 1.07.04 date: 05/03/2019 
Battery:   ID-1: BAT0 charge: 59.2 Wh condition: 59.2/56.2 Wh (105%) 
Memory:    RAM: total: 31.08 GiB used: 904.0 MiB (2.8%) 
           RAM Report: permissions: Unable to run dmidecode. Root privileges required. 
CPU:       Topology: 6-Core model: Intel Core i7-9750H bits: 64 type: MT MCP L2 cache: 12.0 MiB 
           Speed: 800 MHz min/max: 800/4500 MHz Core speeds (MHz): 1: 800 2: 801 3: 800 4: 800 5: 800 6: 800 7: 800 8: 801 
           9: 800 10: 800 11: 800 12: 800 
Graphics:  Device-1: Intel UHD Graphics 630 driver: i915 v: kernel 
           Device-2: NVIDIA TU116M [GeForce GTX 1660 Mobile] driver: N/A 
           Display: server: X.Org 1.20.5 driver: intel unloaded: modesetting,vesa resolution: 1920x1080~144Hz 
           OpenGL: renderer: Mesa DRI Intel UHD Graphics 630 (Coffeelake 3x8 GT2) v: 4.5 Mesa 19.1.8 
Audio:     Device-1: Intel Cannon Lake PCH cAVS driver: snd_hda_intel 
           Device-2: NVIDIA driver: snd_hda_intel 
           Sound Server: ALSA v: k5.3.10-gentoo 
Network:   Device-1: Qualcomm Atheros AR9462 Wireless Network Adapter driver: ath9k 
           IF: wlp8s0 state: up mac: <filter> 
           Device-2: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet driver: r8169 
           IF: enp9s0 state: down mac: <filter> 
           IF-ID-1: sit0 state: down mac: <filter> 
           IF-ID-2: tunl0 state: down mac: <filter> 
Drives:    Local Storage: total: 1.38 TiB used: 172.60 GiB (12.3%) 
           ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 970 PRO 512GB size: 476.94 GiB 
           ID-2: /dev/sda vendor: Samsung model: SSD 860 EVO 1TB size: 931.51 GiB 
Partition: ID-1: / size: 100.00 GiB used: 6.66 GiB (6.7%) fs: btrfs dev: /dev/dm-0 
           ID-2: /home size: 276.45 GiB used: 4.57 GiB (1.7%) fs: btrfs dev: /dev/dm-1 
Sensors:   System Temperatures: cpu: 56.0 C mobo: N/A 
           Fan Speeds (RPM): N/A 
Info:      Processes: 295 Uptime: 46m Shell: bash inxi: 3.0.36
Comment 1 Ale 2019-11-20 15:27:24 UTC
I used git bisect from vanilla sources following this:
https://wiki.gentoo.org/wiki/Kernel_git-bisect

Finally I got the incriminated commit:

77fc9100fc5768ca01ca2dd2cc5a515a4723a58a is the first bad commit
commit 77fc9100fc5768ca01ca2dd2cc5a515a4723a58a
Author: Jon Bloomfield <jon.bloomfield@intel.com>
Date:   Thu Sep 27 10:23:17 2018 -0700

    drm/i915/cmdparser: Use explicit goto for error paths
   
    commit 0546a29cd884fb8184731c79ab008927ca8859d0 upstream.
   
    In the next patch we will be adding a second valid
    termination condition which will require a small
    amount of refactoring to share logic with the BB_END
    case.
   
    Refactor all error conditions to jump to a dedicated
    exit path, with 'break' reserved only for a successful
    parse.
   
    Cc: Tony Luck <tony.luck@intel.com>
    Cc: Dave Airlie <airlied@redhat.com>
    Cc: Takashi Iwai <tiwai@suse.de>
    Cc: Tyler Hicks <tyhicks@canonical.com>
    Signed-off-by: Jon Bloomfield <jon.bloomfield@intel.com>
    Reviewed-by: Chris Wilson <chris.p.wilson@intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 drivers/gpu/drm/i915/i915_cmd_parser.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

Prior to this the hangs does not happen. After this it hangs just opening a libreoffice document a 2-3 times

This is the bisect log

heavensdoor linux # git bisect log
git bisect start
# bad: [dada86c5aaa8f2305bf8a8bf9014b60603f9f013] Linux 5.3.11
git bisect bad dada86c5aaa8f2305bf8a8bf9014b60603f9f013
# good: [b260a0862e3a9fccdac23ec3b783911b098c1c74] Linux 5.3.10
git bisect good b260a0862e3a9fccdac23ec3b783911b098c1c74
# good: [9fd8ecf10b9cf44efc6d70791df3e1857598fb76] net: stmmac: Fix the problem of tso_xmit
git bisect good 9fd8ecf10b9cf44efc6d70791df3e1857598fb76
# good: [cf0ccb042e9ea3e6a156b753292b4d3e80bfeef9] hv_netvsc: Fix error handling in netvsc_attach()
git bisect good cf0ccb042e9ea3e6a156b753292b4d3e80bfeef9
# good: [7819546459c63b98fe14d43137ef8e4eebeb78f4] drm/i915: Remove Master tables from cmdparser
git bisect good 7819546459c63b98fe14d43137ef8e4eebeb78f4
# bad: [981d3a01c29b03b512604f21196af3ec7a14987f] x86/speculation/taa: Add mitigation for TSX Async Abort
git bisect bad 981d3a01c29b03b512604f21196af3ec7a14987f
# bad: [bdb4e778f43a07e0d51354c4b9a8a17306ec4b85] drm/i915/cmdparser: Ignore Length operands during command matching
git bisect bad bdb4e778f43a07e0d51354c4b9a8a17306ec4b85
# good: [41e79b82c420f88c709f53dd2f3e61e0c01d511b] drm/i915: Allow parsing of unsized batches
git bisect good 41e79b82c420f88c709f53dd2f3e61e0c01d511b
# bad: [77fc9100fc5768ca01ca2dd2cc5a515a4723a58a] drm/i915/cmdparser: Use explicit goto for error paths
git bisect bad 77fc9100fc5768ca01ca2dd2cc5a515a4723a58a
# good: [4b75b05cb098b15658a91fae3be29a58a1cfa2a1] drm/i915: Add gen9 BCS cmdparsing
git bisect good 4b75b05cb098b15658a91fae3be29a58a1cfa2a1
# first bad commit: [77fc9100fc5768ca01ca2dd2cc5a515a4723a58a] drm/i915/cmdparser: Use explicit goto for error paths
Comment 2 Ale 2019-11-21 19:21:15 UTC
Problem still present in 5.3.12
Comment 3 Ale 2019-11-26 17:31:07 UTC
Problem still present in 5.3.13 but solved somehow in 5.4.0
Comment 4 Marek 2019-11-27 09:16:33 UTC
5.4.0 is still bad for my Skylake laptop. Keeps its Intel GPU 100% powered on and no cstates:

                    |             GPU     |
                    |                     |
                    | Powered On100,0%    |
                    | RC6         0,0%    |
                    | RC6p        0,0%    |
                    | RC6pp       0,0%    |
Comment 5 Ale 2019-11-27 21:01:54 UTC
I discovered that 5.4.0 now hangs the notebook completely. No logs, just freeze and a hard reset.
It's a bit too much for me to debug this, so I'll stick with 5.3.10 for a while...

Please, is any kernel developer watching this?
Comment 6 Jani Nikula 2019-11-28 15:08:55 UTC
(In reply to Ale from comment #5)
> Please, is any kernel developer watching this?

No. Please file all i915 bugs at [1].

[1] https://bugs.freedesktop.org/enter_bug.cgi?product=DRI&component=DRM/Intel
Comment 7 Jani Nikula 2019-12-02 11:32:48 UTC
(In reply to Jani Nikula from comment #6)
> (In reply to Ale from comment #5)
> > Please, is any kernel developer watching this?
> 
> No. Please file all i915 bugs at [1].
> 
> [1]
> https://bugs.freedesktop.org/enter_bug.cgi?product=DRI&component=DRM/Intel

Ahem, just moved to https://gitlab.freedesktop.org/drm/intel/issues/new
Comment 8 Konstantin Kharlamov 2020-01-10 13:49:28 UTC
I think it is this one https://gitlab.freedesktop.org/drm/intel/issues/673