Bug 202063 - [Regression] Spinlock not released on kernel 4.9.147 by i915, CPU stuck
Summary: [Regression] Spinlock not released on kernel 4.9.147 by i915, CPU stuck
Status: RESOLVED CODE_FIX
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: Andrew Morton
URL:
Keywords:
: 202295 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-12-25 17:47 UTC by ValdikSS
Modified: 2019-02-07 13:58 UTC (History)
9 users (show)

See Also:
Kernel Version: 4.9.150
Tree: Mainline
Regression: Yes


Attachments
Proposed fix from David Airlie (967 bytes, patch)
2019-01-23 02:47 UTC, Will Deacon
Details | Diff

Description ValdikSS 2018-12-25 17:47:55 UTC
With kernel 4.9.147, starting graphics using Xorg on my laptop hangs the graphics subsystem entirely: you cannot switch between VTs and the screen is not refreshed.
Everything works fine with previous version, 4.9.146.

Sending dmesg and journalctl -f over ssh to another server gives me the following:

[  +3.890763] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [Xorg:2590]
[  +0.000003] Modules linked in: ccm xt_comment xt_owner ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle devlink ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables jc42 sunrpc vfat fat btusb btrtl btbcm btintel bluetooth arc4 iwlmvm intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_hdmi mei_wdt kvm_intel mac80211 iTCO_wdt iTCO_vendor_support kvm snd_hda_codec_conexant snd_hda_codec_generic irqbypass snd_hda_intel intel_cstate intel_uncore snd_hda_codec iwlwifi intel_rapl_perf snd_hda_core snd_hwdep snd_seq cfg80211 snd_seq_device
[  +0.000031]  snd_pcm mei_me i2c_i801 mei lpc_ich i2c_smbus snd_timer shpchp thinkpad_acpi snd wmi soundcore rfkill tpm_tis tpm_tis_core tpm sch_fq tcp_bbr binfmt_misc btrfs xor raid6_pq dm_crypt i915 i2c_algo_bit crct10dif_pclmul drm_kms_helper crc32_pclmul crc32c_intel drm sdhci_pci ghash_clmulni_intel sdhci e1000e mmc_core serio_raw ptp pps_core fjes video
[  +0.000021] CPU: 2 PID: 2590 Comm: Xorg Tainted: G        W       4.9.147 #1
[  +0.000000] Hardware name: LENOVO 4286CTO/4286CTO, BIOS 8DET76WW (1.46 ) 06/21/2018
[  +0.000002] task: ffff94534da40000 task.stack: ffff9f7ac2250000
[  +0.000000] RIP: 0010:[<ffffffff990ef5d4>]  [<ffffffff990ef5d4>] queued_spin_lock_slowpath+0x54/0x1a0
[  +0.000006] RSP: 0018:ffff9f7ac2253b58  EFLAGS: 00000202
[  +0.000000] RAX: 0000000000000101 RBX: ffff9453511e3600 RCX: ffff9453416f3060
[  +0.000001] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9453511e3610
[  +0.000001] RBP: ffff9f7ac2253b58 R08: ffff945341ab98b0 R09: ffffffffc0430cd0
[  +0.000001] R10: ffffd41b4805bc00 R11: 0000000000000000 R12: ffff94533fb5d880
[  +0.000000] R13: ffff94533fb5d9c0 R14: ffff94534a4fee00 R15: ffff94534d2cede8
[  +0.000002] FS:  00007fefb09c4ac0(0000) GS:ffff94535e280000(0000) knlGS:0000000000000000
[  +0.000001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000000] CR2: 000055b71de41400 CR3: 000000020c73e000 CR4: 0000000000060670
[  +0.000001] Stack:
[  +0.000002]  ffff9f7ac2253b68 ffffffff99833191 ffff9f7ac2253b90 ffffffffc043248b
[  +0.000002]  ffff94534d2cec38 ffff945341ab0000 ffff945341ab2860 ffff9f7ac2253bc8
[  +0.000002]  ffffffffc043383b ffff94534d2cec38 ffff94534d2cede0 ffff9f7ac2253db0
[  +0.000002] Call Trace:
[  +0.000004]  [<ffffffff99833191>] _raw_spin_lock+0x21/0x30
[  +0.000025]  [<ffffffffc043248b>] i915_gem_request_retire+0xab/0x1c0 [i915]
[  +0.000016]  [<ffffffffc043383b>] i915_gem_request_alloc+0x1ab/0x280 [i915]
[  +0.000014]  [<ffffffffc0420ac4>] i915_gem_do_execbuffer.isra.49+0x5e4/0x1a00 [i915]
[  +0.000003]  [<ffffffff99833500>] ? syscall_return_via_sysret+0x44/0x4d
[  +0.000002]  [<ffffffff99833564>] ? __switch_to_asm+0x34/0x70
[  +0.000014]  [<ffffffffc04222cb>] i915_gem_execbuffer2+0xeb/0x240 [i915]
[  +0.000010]  [<ffffffffc0351eeb>] drm_ioctl+0x21b/0x480 [drm]
[  +0.000014]  [<ffffffffc04221e0>] ? i915_gem_execbuffer+0x300/0x300 [i915]
[  +0.000003]  [<ffffffff99271c28>] do_vfs_ioctl+0xa8/0x610
[  +0.000001]  [<ffffffff9927220a>] SyS_ioctl+0x7a/0x90
[  +0.000002]  [<ffffffff99003c29>] do_syscall_64+0x79/0x180
[  +0.000002]  [<ffffffff9983344e>] entry_SYSCALL_64_after_swapgs+0x58/0xc6
[  +0.000001] Code: 74 3f 40 30 f6 85 f6 75 61 f0 0f ba 2f 08 8b 07 72 58 89 c2 30 e6 a9 00 00 ff ff 0f 85 45 01 00 00 85 d2 74 0e 8b 07 84 c0 74 08 <f3> 90 8b 07 84 c0 75 f8 b8 01 00 00 00 5d 66 89 07 c3 f3 90 eb

This is Lenovo Thinkpad X220 laptop with Intel HD 3000 Sandy Bridge graphics. Since there's only a single commit for i915 which does not change Sandy Bridge, I assume the problem is in spinlock patches.
Comment 1 ValdikSS 2018-12-29 21:46:38 UTC
Also happens on 4.9.148.
Comment 2 ValdikSS 2019-01-15 15:21:18 UTC
Still happens on 4.9.150.
Comment 3 paulmck 2019-01-15 16:05:57 UTC
On Sat, Dec 29, 2018 at 09:46:38PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=202063
> 
> --- Comment #1 from ValdikSS (iam@valdikss.org.ru) ---
> Also happens on 4.9.148.

Could you please try bisecting between 4.9.146 and 4.9.147?  That should
help pinpoint the offending commit.

							Thanx, Paul
Comment 4 ValdikSS 2019-01-15 16:07:46 UTC
I'm pretty sure that the problem in spinlock patch series. Do you want me to determine exact patch in the patchset?
Comment 5 paulmck 2019-01-15 17:29:52 UTC
On Tue, Jan 15, 2019 at 04:07:46PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=202063
> 
> --- Comment #4 from ValdikSS (iam@valdikss.org.ru) ---
> I'm pretty sure that the problem in spinlock patch series. Do you want me to
> determine exact patch in the patchset?

Use whatever variant of bisection you like.  As long as it finds the
offending commit, it is no skin off my teeth.  ;-)

							Thanx, Paul
Comment 6 ValdikSS 2019-01-18 07:27:12 UTC
git bisect start
# bad: [bbfc30f29cb328111fec12975ded8223ecc8e1a5] Linux 4.9.147
git bisect bad bbfc30f29cb328111fec12975ded8223ecc8e1a5
# good: [0cff89461d557239296735d18b5a144c8f4b151b] Linux 4.9.146
git bisect good 0cff89461d557239296735d18b5a144c8f4b151b
# bad: [3e5d4c14a7427dc2a24737c8dcc61688870d737a] mac80211_hwsim: fix module init error paths for netlink
git bisect bad 3e5d4c14a7427dc2a24737c8dcc61688870d737a
# good: [af20483dbd7c2a01f7874191524fc0397b9d3bec] Revert "drm/rockchip: Allow driver to be shutdown on reboot/kexec"
git bisect good af20483dbd7c2a01f7874191524fc0397b9d3bec
# bad: [60668f3cddf1b25a954b198cade0ce726a6853ab] locking/qspinlock: Merge 'struct __qspinlock' into 'struct qspinlock'
git bisect bad 60668f3cddf1b25a954b198cade0ce726a6853ab
# good: [d395117fac7943da6966ccbac3b95651f5581f15] IB/hfi1: Remove race conditions in user_sdma send path
git bisect good d395117fac7943da6966ccbac3b95651f5581f15
# good: [48c42d4dfec408760d15acc334d91208a6b2262e] locking/qspinlock: Ensure node is initialised before updating prev->next
git bisect good 48c42d4dfec408760d15acc334d91208a6b2262e
# good: [8e5b3bcc5291092aaac4cadc0b5fb46182172ed3] locking/qspinlock: Bound spinning on pending->locked transition in slowpath
git bisect good 8e5b3bcc5291092aaac4cadc0b5fb46182172ed3
# first bad commit: [60668f3cddf1b25a954b198cade0ce726a6853ab] locking/qspinlock: Merge 'struct __qspinlock' into 'struct qspinlock'

60668f3cddf1b25a954b198cade0ce726a6853ab is the first bad commit
commit 60668f3cddf1b25a954b198cade0ce726a6853ab
Author: Will Deacon <will.deacon@arm.com>
Date:   Tue Dec 18 23:10:43 2018 +0100

    locking/qspinlock: Merge 'struct __qspinlock' into 'struct qspinlock'
    
    commit 625e88be1f41b53cec55827c984e4a89ea8ee9f9 upstream.
    
    'struct __qspinlock' provides a handy union of fields so that
    subcomponents of the lockword can be accessed by name, without having to
    manage shifts and masks explicitly and take endianness into account.
    
    This is useful in qspinlock.h and also potentially in arch headers, so
    move the 'struct __qspinlock' into 'struct qspinlock' and kill the extra
    definition.
    
    Signed-off-by: Will Deacon <will.deacon@arm.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Waiman Long <longman@redhat.com>
    Acked-by: Boqun Feng <boqun.feng@gmail.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: paulmck@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1524738868-31318-3-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Sasha Levin <sashal@kernel.org>
Comment 7 paulmck 2019-01-18 18:45:56 UTC
On Fri, Jan 18, 2019 at 07:27:12AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=202063
> 
> --- Comment #6 from ValdikSS (iam@valdikss.org.ru) ---
> git bisect start
> # bad: [bbfc30f29cb328111fec12975ded8223ecc8e1a5] Linux 4.9.147
> git bisect bad bbfc30f29cb328111fec12975ded8223ecc8e1a5
> # good: [0cff89461d557239296735d18b5a144c8f4b151b] Linux 4.9.146
> git bisect good 0cff89461d557239296735d18b5a144c8f4b151b
> # bad: [3e5d4c14a7427dc2a24737c8dcc61688870d737a] mac80211_hwsim: fix module
> init error paths for netlink
> git bisect bad 3e5d4c14a7427dc2a24737c8dcc61688870d737a
> # good: [af20483dbd7c2a01f7874191524fc0397b9d3bec] Revert "drm/rockchip:
> Allow
> driver to be shutdown on reboot/kexec"
> git bisect good af20483dbd7c2a01f7874191524fc0397b9d3bec
> # bad: [60668f3cddf1b25a954b198cade0ce726a6853ab] locking/qspinlock: Merge
> 'struct __qspinlock' into 'struct qspinlock'
> git bisect bad 60668f3cddf1b25a954b198cade0ce726a6853ab
> # good: [d395117fac7943da6966ccbac3b95651f5581f15] IB/hfi1: Remove race
> conditions in user_sdma send path
> git bisect good d395117fac7943da6966ccbac3b95651f5581f15
> # good: [48c42d4dfec408760d15acc334d91208a6b2262e] locking/qspinlock: Ensure
> node is initialised before updating prev->next
> git bisect good 48c42d4dfec408760d15acc334d91208a6b2262e
> # good: [8e5b3bcc5291092aaac4cadc0b5fb46182172ed3] locking/qspinlock: Bound
> spinning on pending->locked transition in slowpath
> git bisect good 8e5b3bcc5291092aaac4cadc0b5fb46182172ed3
> # first bad commit: [60668f3cddf1b25a954b198cade0ce726a6853ab]
> locking/qspinlock: Merge 'struct __qspinlock' into 'struct qspinlock'

Thank you!

Does this happen on mainline?  As in, is this a bug in mainline or a
bug in backporting a fix?  Does reverting this patch in -stable make
the problem go away?

Adding Boqun on CC, as the rest are CCed on the bugzilla.

							Thanx, Paul

> 60668f3cddf1b25a954b198cade0ce726a6853ab is the first bad commit
> commit 60668f3cddf1b25a954b198cade0ce726a6853ab
> Author: Will Deacon <will.deacon@arm.com>
> Date:   Tue Dec 18 23:10:43 2018 +0100
> 
>     locking/qspinlock: Merge 'struct __qspinlock' into 'struct qspinlock'
> 
>     commit 625e88be1f41b53cec55827c984e4a89ea8ee9f9 upstream.
> 
>     'struct __qspinlock' provides a handy union of fields so that
>     subcomponents of the lockword can be accessed by name, without having to
>     manage shifts and masks explicitly and take endianness into account.
> 
>     This is useful in qspinlock.h and also potentially in arch headers, so
>     move the 'struct __qspinlock' into 'struct qspinlock' and kill the extra
>     definition.
> 
>     Signed-off-by: Will Deacon <will.deacon@arm.com>
>     Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>     Acked-by: Waiman Long <longman@redhat.com>
>     Acked-by: Boqun Feng <boqun.feng@gmail.com>
>     Cc: Linus Torvalds <torvalds@linux-foundation.org>
>     Cc: Thomas Gleixner <tglx@linutronix.de>
>     Cc: linux-arm-kernel@lists.infradead.org
>     Cc: paulmck@linux.vnet.ibm.com
>     Link:
>
> http://lkml.kernel.org/r/1524738868-31318-3-git-send-email-will.deacon@arm.com
>     Signed-off-by: Ingo Molnar <mingo@kernel.org>
>     Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>     Signed-off-by: Sasha Levin <sashal@kernel.org>
> 
> -- 
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 8 ValdikSS 2019-01-18 18:51:15 UTC
I was running kernels 4.18 and 4.19 (up to 4.19.15) and everything is fine. I think that's a bug only in 4.9.147+. Have not tested other LTS kernels (4.4, 4.14)
Comment 9 paulmck 2019-01-18 19:57:04 UTC
On Fri, Jan 18, 2019 at 06:51:15PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=202063
> 
> --- Comment #8 from ValdikSS (iam@valdikss.org.ru) ---
> I was running kernels 4.18 and 4.19 (up to 4.19.15) and everything is fine. I
> think that's a bug only in 4.9.147+. Have not tested other LTS kernels (4.4,
> 4.14)

OK, I will bite...  Perhaps this commit should not have been backported
to 4.9-stable in the first place.  So does reverting it in 4.9.147+ help?

								Thanx, Paul
Comment 10 ValdikSS 2019-01-18 19:58:51 UTC
Sorry, I already deleted kernel sources and can't try right now. Better wait for commit author reply.
Comment 11 ValdikSS 2019-01-19 18:53:30 UTC
This commit does not revert cleanly.
Comment 12 Yill Din 2019-01-21 00:04:04 UTC
Reverting c6bcf40f769294a80c64213f9175ccd408d64532 through c3b6e79fbf295c9cda4dd1828a8f0593cad53d48 allows this kernel and 151 to work here.
Comment 13 Dave Airlie 2019-01-22 18:29:07 UTC
Okay I looked at this with Will yesterday, got distracted by the fact that CONFIG_PARAVIRT_SPINLOCKS needs to be not set for it to happen.

It appears the first chunk of the indicated patch is what causes it.

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index e07cc206919d..eaba08076030 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -14,7 +14,7 @@
  */
 static inline void native_queued_spin_unlock(struct qspinlock *lock)
 {
-       smp_store_release(&lock->locked, 0);
+       smp_store_release((u8 *)lock, 0);
 }

seems to fix it for me,

[airlied@carbonite linux]$ diff ../works-obj ../fails-obj 
317c317
<      224:	c6 43 10 00          	movb   $0x0,0x10(%rbx)
---
>      224:     c6 43 13 00             movb   $0x0,0x13(%rbx)
492c492
<      3a0:	41 c6 44 24 10 00    	movb   $0x0,0x10(%r12)
---
>      3a0:     41 c6 44 24 13 00       movb   $0x0,0x13(%r12)
1558c1558
<      dde:	c6 80 ac a4 00 00 00 	movb   $0x0,0xa4ac(%rax)
---
>      dde:     c6 80 af a4 00 00 00    movb   $0x0,0xa4af(%rax)


Which doesn't look good.
Comment 14 Dave Airlie 2019-01-22 18:43:02 UTC
there is a missing byteorder.h include somewhere.
Comment 15 gordan 2019-01-22 18:58:28 UTC
*** Bug 202295 has been marked as a duplicate of this bug. ***
Comment 16 Will Deacon 2019-01-23 02:47:19 UTC
Created attachment 280683 [details]
Proposed fix from David Airlie

Please can you try the attached fix from David Airlie?
Comment 17 Will Deacon 2019-01-26 22:13:10 UTC
Fix committed in 4.9.153.
Comment 18 ValdikSS 2019-02-07 13:58:47 UTC
Works now, thanks.

Note You need to log in before you can comment on or make changes to this bug.