Bug 75561

Summary: kernel BUG at drivers/misc/mei/hbm.c:787
Product: Drivers Reporter: Julian Sikorski (belegdol)
Component: OtherAssignee: drivers_other
Status: NEW ---    
Severity: high CC: alan, dparsons, szg00000, tomas.winkler, tomasw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.14.2 Subsystem:
Regression: No Bisected commit-id:
Attachments: journalctl -b -1 output

Description Julian Sikorski 2014-05-06 05:18:51 UTC
Created attachment 135191 [details]
journalctl -b -1 output

Deal all,

since upgrading to 3.14.1, I started seeing the following bug:

maj 05 17:53:37 snowball2 kernel: mei_me 0000:00:16.0: less data available than length=00000002.
maj 05 17:53:37 snowball2 kernel: mei_me 0000:00:16.0: unexpected reset: dev_state = INIT_CLIENTS
maj 05 17:53:37 snowball2 kernel: mei_me 0000:00:16.0: hbm: properties response: state mismatch, [1, 1]
maj 05 17:53:37 snowball2 kernel: mei_me 0000:00:16.0: unexpected reset: dev_state = INIT_CLIENTS
maj 05 17:53:37 snowball2 kernel: mei_me 0000:00:16.0: hbm: start: state mismatch, [1, 3]
maj 05 17:53:37 snowball2 kernel: ------------[ cut here ]------------
maj 05 17:53:37 snowball2 kernel: kernel BUG at drivers/misc/mei/hbm.c:787!
maj 05 17:53:38 snowball2 kernel: invalid opcode: 0000 [#1] SMP 
maj 05 17:53:37 snowball2 NetworkManager[912]: <info> NetworkManager state is now DISCONNECTED
maj 05 17:53:37 snowball2 NetworkManager[912]: <info> (wlan0): device state change: unmanaged -> unavailable (reason 'managed') [10 20 2]
maj 05 17:53:37 snowball2 NetworkManager[912]: <info> (wlan0): bringing up device.
maj 05 17:53:37 snowball2 NetworkManager[912]: <info> (wlan0) supports 5 scan SSIDs
maj 05 17:53:37 snowball2 NetworkManager[912]: <info> (wlan0): supplicant interface state: starting -> ready
maj 05 17:53:37 snowball2 NetworkManager[912]: <info> (wlan0): device state change: unavailable -> disconnected (reason 'supplicant-available') [20 30 42]
maj 05 17:53:37 snowball2 NetworkManager[912]: <info> (wlan0): supplicant interface state: ready -> disconnected
maj 05 17:53:37 snowball2 NetworkManager[912]: <info> (wlan0) supports 5 scan SSIDs
maj 05 17:53:38 snowball2 kernel: Modules linked in: dm_crypt nf_conntrack_netbios_ns nf_conntrack_broadcast ccm bnep bluetooth 6lowpan_iphc ip6t_rpfilter ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw nvidia(POF) fuse arc4 iwldvm mac80211 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel usb_storage snd_hda_intel microcode snd_hda_codec snd_hwdep snd_seq joydev snd_seq_device iwlwifi
maj 05 17:53:38 snowball2 kernel:  snd_pcm serio_raw cfg80211 jmb38x_ms drm jme sdhci_pci i2c_i801 mii sdhci rfkill mmc_core snd_timer memstick i2c_core mei_me snd lpc_ich mei mfd_core soundcore shpchp nfsd auth_rpcgss nfs_acl lockd sunrpc binfmt_misc firewire_ohci firewire_core crc_itu_t wmi video
maj 05 17:53:38 snowball2 kernel: CPU: 3 PID: 8875 Comm: irq/51-mei_me Tainted: PF          O 3.14.2-200.fc20.x86_64 #1
maj 05 17:53:38 snowball2 kernel: Hardware name: CLEVO                             P150HMx/P150HMx, BIOS 4.6.4 08/09/2011
maj 05 17:53:38 snowball2 kernel: task: ffff88031bc51300 ti: ffff88031c0e2000 task.ti: ffff88031c0e2000
maj 05 17:53:38 snowball2 kernel: RIP: 0010:[<ffffffffa012b5d0>]  [<ffffffffa012b5d0>] mei_hbm_dispatch+0x650/0xea0 [mei]
maj 05 17:53:38 snowball2 kernel: RSP: 0000:ffff88031c0e3cf8  EFLAGS: 00010293
maj 05 17:53:38 snowball2 kernel: RAX: 0000000000000000 RBX: ffff88031d872000 RCX: 0000000000000003
maj 05 17:53:38 snowball2 kernel: RDX: ffffc90005b9c004 RSI: ffffc90005b9c004 RDI: 00000000801c1c1d
maj 05 17:53:38 snowball2 kernel: RBP: ffff88031c0e3d30 R08: 0000000000000000 R09: 0000000000000001
maj 05 17:53:38 snowball2 kernel: R10: 00000000000299a4 R11: 0000000000040000 R12: ffff88031d87230a
maj 05 17:53:38 snowball2 kernel: R13: ffff88031d87250c R14: ffff88031d872200 R15: ffffffff810e6340
maj 05 17:53:38 snowball2 kernel: FS:  0000000000000000(0000) GS:ffff88032f4c0000(0000) knlGS:0000000000000000
maj 05 17:53:38 snowball2 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
maj 05 17:53:38 snowball2 kernel: CR2: 0000000000000000 CR3: 0000000001c0c000 CR4: 00000000000407e0
maj 05 17:53:38 snowball2 kernel: Stack:
maj 05 17:53:38 snowball2 kernel:  ffffffff810a4522 ffff880200000000 ffff88031d872000 ffff88031c0e3d9c
maj 05 17:53:38 snowball2 kernel:  ffff88031d87250c ffff88031d872200 ffffffff810e6340 ffff88031c0e3d88
maj 05 17:53:38 snowball2 kernel:  ffffffffa012c180 000000031c0e3d70 0000000000000293 00000000ffffffb9
maj 05 17:53:38 snowball2 kernel: Call Trace:
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810a4522>] ? insert_work+0x62/0xa0
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810e6340>] ? irq_finalize_oneshot.part.30+0xe0/0xe0
maj 05 17:53:38 snowball2 kernel:  [<ffffffffa012c180>] mei_irq_read_handler+0x2a0/0x6e0 [mei]
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810e6340>] ? irq_finalize_oneshot.part.30+0xe0/0xe0
maj 05 17:53:38 snowball2 kernel:  [<ffffffffa0123f88>] mei_me_irq_thread_handler+0x148/0x300 [mei_me]
maj 05 17:53:38 snowball2 kernel:  [<ffffffff816f2482>] ? __schedule+0x2e2/0x740
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810e6360>] irq_thread_fn+0x20/0x50
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810e67ef>] irq_thread+0x13f/0x170
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810e63f0>] ? irq_forced_thread_fn+0x60/0x60
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810e66b0>] ? irq_thread_check_affinity+0xf0/0xf0
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810ae211>] kthread+0xe1/0x100
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
maj 05 17:53:38 snowball2 kernel:  [<ffffffff816fef7c>] ret_from_fork+0x7c/0xb0
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
maj 05 17:53:38 snowball2 kernel: Code: 37 f5 ff ff 84 c0 0f 84 af 01 00 00 f6 05 42 cf 00 00 04 0f 85 12 06 00 00 48 89 df e8 aa 68 00 00 e9 25 fc ff ff 0f 1f 44 00 00 <0f> 0b 66 0f 1f 44 00 00 f6 05 33 d0 00 00 04 0f 85 88 06 00 00 
maj 05 17:53:38 snowball2 kernel: RIP  [<ffffffffa012b5d0>] mei_hbm_dispatch+0x650/0xea0 [mei]
maj 05 17:53:38 snowball2 kernel:  RSP <ffff88031c0e3cf8>
maj 05 17:53:38 snowball2 kernel: ---[ end trace 422e659790487932 ]---
maj 05 17:53:38 snowball2 kernel: BUG: unable to handle kernel paging request at ffffffffffffffd8
maj 05 17:53:38 snowball2 kernel: IP: [<ffffffff810ae800>] kthread_data+0x10/0x20
maj 05 17:53:38 snowball2 kernel: PGD 1c0f067 PUD 1c11067 PMD 0
maj 05 17:53:38 snowball2 acpid[831]: client connected from 908[0:0]
maj 05 17:53:38 snowball2 acpid[831]: 1 client rule loaded
maj 05 17:53:38 snowball2 kernel: Oops: 0000 [#2] SMP 
maj 05 17:53:38 snowball2 gnome-session[1764]: (realcrypt:2873): Gtk-CRITICAL **: IA__gtk_widget_get_direction: assertion 'GTK_IS_WIDGET (widget)' failed
maj 05 17:53:38 snowball2 kernel: Modules linked in: dm_crypt nf_conntrack_netbios_ns nf_conntrack_broadcast ccm bnep bluetooth 6lowpan_iphc ip6t_rpfilter ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw nvidia(POF) fuse arc4 iwldvm mac80211 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel usb_storage snd_hda_intel microcode snd_hda_codec snd_hwdep snd_seq joydev snd_seq_device iwlwifi
maj 05 17:53:38 snowball2 kernel:  snd_pcm serio_raw cfg80211 jmb38x_ms drm jme sdhci_pci i2c_i801 mii sdhci rfkill mmc_core snd_timer memstick i2c_core mei_me snd lpc_ich mei mfd_core soundcore shpchp nfsd auth_rpcgss nfs_acl lockd sunrpc binfmt_misc firewire_ohci firewire_core crc_itu_t wmi video
maj 05 17:53:38 snowball2 kernel: CPU: 3 PID: 8875 Comm: irq/51-mei_me Tainted: PF     D    O 3.14.2-200.fc20.x86_64 #1
maj 05 17:53:38 snowball2 kernel: Hardware name: CLEVO                             P150HMx/P150HMx, BIOS 4.6.4 08/09/2011
maj 05 17:53:38 snowball2 kernel: task: ffff88031bc51300 ti: ffff88031c0e2000 task.ti: ffff88031c0e2000
maj 05 17:53:38 snowball2 kernel: RIP: 0010:[<ffffffff810ae800>]  [<ffffffff810ae800>] kthread_data+0x10/0x20
maj 05 17:53:38 snowball2 kernel: RSP: 0000:ffff88031c0e3a08  EFLAGS: 00010202
maj 05 17:53:38 snowball2 kernel: RAX: 0000000000000000 RBX: ffff88031bc51300 RCX: 0000000000000015
maj 05 17:53:38 snowball2 kernel: RDX: ffff88031c0e3e80 RSI: 0000000000000000 RDI: ffff88031bc51300
maj 05 17:53:38 snowball2 kernel: RBP: ffff88031c0e3a08 R08: 0000000000000000 R09: 0000000000000001
maj 05 17:53:38 snowball2 kernel: R10: 0000000000000001 R11: ffffffff8112192b R12: ffffffff81ed8520
maj 05 17:53:38 snowball2 kernel: R13: ffff88031bc51968 R14: ffff88031bc51300 R15: ffff88031bc51300
maj 05 17:53:38 snowball2 kernel: FS:  0000000000000000(0000) GS:ffff88032f4c0000(0000) knlGS:0000000000000000
maj 05 17:53:38 snowball2 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
maj 05 17:53:38 snowball2 kernel: CR2: ffffffffffffffd8 CR3: 0000000001c0c000 CR4: 00000000000407e0
maj 05 17:53:38 snowball2 kernel: Stack:
maj 05 17:53:38 snowball2 kernel:  ffff88031c0e3a28 ffffffff810e6413 0000000000000000 ffffffff81ed8520
maj 05 17:53:38 snowball2 kernel:  ffff88031c0e3a58 ffffffff810aafd7 000000000000000b 0000000000000000
maj 05 17:53:38 snowball2 kernel:  0000000000000246 0000000000000000 ffff88031c0e3ae8 ffffffff8108c754
maj 05 17:53:38 snowball2 kernel: Call Trace:
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810e6413>] irq_thread_dtor+0x23/0xb0
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810aafd7>] task_work_run+0xa7/0xe0
maj 05 17:53:38 snowball2 kernel:  [<ffffffff8108c754>] do_exit+0x2c4/0xa30
maj 05 17:53:38 snowball2 kernel:  [<ffffffff816eb1ac>] ? printk+0x77/0x8e
maj 05 17:53:38 snowball2 kernel:  [<ffffffff816f704c>] oops_end+0x9c/0xe0
maj 05 17:53:38 snowball2 kernel:  [<ffffffff81017fcb>] die+0x4b/0x70
maj 05 17:53:38 snowball2 kernel:  [<ffffffff816f68e0>] do_trap+0x60/0x170
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810e6340>] ? irq_finalize_oneshot.part.30+0xe0/0xe0
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810e6340>] ? irq_finalize_oneshot.part.30+0xe0/0xe0
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810150aa>] do_invalid_op+0xaa/0xe0
maj 05 17:53:38 snowball2 kernel:  [<ffffffffa012b5d0>] ? mei_hbm_dispatch+0x650/0xea0 [mei]
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810caa32>] ? enqueue_task_fair+0x412/0x660
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810c2885>] ? sched_clock_cpu+0x85/0xc0
maj 05 17:53:38 snowball2 kernel:  [<ffffffff8170075e>] invalid_op+0x1e/0x30
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810e6340>] ? irq_finalize_oneshot.part.30+0xe0/0xe0
maj 05 17:53:38 snowball2 kernel:  [<ffffffffa012b5d0>] ? mei_hbm_dispatch+0x650/0xea0 [mei]
maj 05 17:53:38 snowball2 kernel:  [<ffffffffa012afbf>] ? mei_hbm_dispatch+0x3f/0xea0 [mei]
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810a4522>] ? insert_work+0x62/0xa0
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810e6340>] ? irq_finalize_oneshot.part.30+0xe0/0xe0
maj 05 17:53:38 snowball2 kernel:  [<ffffffffa012c180>] mei_irq_read_handler+0x2a0/0x6e0 [mei]
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810e6340>] ? irq_finalize_oneshot.part.30+0xe0/0xe0
maj 05 17:53:38 snowball2 kernel:  [<ffffffffa0123f88>] mei_me_irq_thread_handler+0x148/0x300 [mei_me]
maj 05 17:53:38 snowball2 kernel:  [<ffffffff816f2482>] ? __schedule+0x2e2/0x740
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810e6360>] irq_thread_fn+0x20/0x50
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810e67ef>] irq_thread+0x13f/0x170
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810e63f0>] ? irq_forced_thread_fn+0x60/0x60
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810e66b0>] ? irq_thread_check_affinity+0xf0/0xf0
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810ae211>] kthread+0xe1/0x100
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
maj 05 17:53:38 snowball2 kernel:  [<ffffffff816fef7c>] ret_from_fork+0x7c/0xb0
maj 05 17:53:38 snowball2 kernel:  [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
maj 05 17:53:38 snowball2 kernel: Code: 00 48 89 e5 5d 48 8b 40 c8 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 48 8b 87 28 04 00 00 55 48 89 e5 <48> 8b 40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 
maj 05 17:53:38 snowball2 kernel: RIP  [<ffffffff810ae800>] kthread_data+0x10/0x20
maj 05 17:53:38 snowball2 kernel:  RSP <ffff88031c0e3a08>
maj 05 17:53:38 snowball2 kernel: CR2: ffffffffffffffd8
maj 05 17:53:38 snowball2 kernel: ---[ end trace 422e659790487933 ]---
maj 05 17:53:38 snowball2 kernel: Fixing recursive fault but reboot is needed!

The bug is initially invisible, but will cause the system to freeze on next suspend attempt. Full journalctl log is attached. Please let me know if more info is needed.
Comment 1 Alan 2014-05-19 11:13:56 UTC
+ Thomas Winkler
Comment 2 Tomas Winkler 2014-05-19 11:57:09 UTC
Can you check you have this patch applied ?
commit b1b94b5d387e3a1f034c308e22f9295828d7174a
Author: Tomas Winkler <tomas.winkler@intel.com>
Date:   Mon Mar 3 00:21:28 2014 +0200
 
    mei: me: do not reset when less than expected data is received
    
    There is a race in ME hardware between data copy for host and interrupt
    delivery. An interrupt can be delivered prior to whole data copied for the
    host to read but rather then going trough the reset we just merely need to
    wait for the next interrupt.
    
    The bug is visible in read/write stress with multiple connections per client
    
    This is a regression caused as a side effect of the commit:
    commit 544f94601409653f07ae6e22d4a39e3a90dceead
    mei: do not run reset flow from the interrupt thread
    
    Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
    Cc: stable <stable@vger.kernel.org> # 3.14
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Comment 3 Julian Sikorski 2014-05-20 05:32:17 UTC
The bug just occurred again with 3.14.4-200.fc20.x86_64. It does not look like the patch you are referring to is included in that kernel, neither as a part of stable queue nor as a Fedora patch.
Comment 4 Tomas Winkler 2014-05-20 07:47:28 UTC
Can I ask you to cherry pick this patch and test it?
Thanks
Comment 5 Julian Sikorski 2014-05-20 08:07:21 UTC
I will, but keep in mind it took me 2 weeks to reproduce the bug and I'm not even sure what caused it. The laptop suspended when streaming a movie from an external usb drive to a raspberry pi over nfs, can this be it?
Comment 6 Tomas Winkler 2014-05-20 09:54:31 UTC
I believe the only traffic between driver and mei device would be during suspend/resume in your case the streaming is not relevant.
Comment 7 Drew Parsons 2014-06-13 13:16:42 UTC
I experience this bug in the 3.14 kernels.  It's reliably reproducible after a couple of suspend/resume cycles.  As Tomas reported, the first resume seems to trigger the kernel oops, and the system freeze happens at the second resume.

Successful suspend/resume is restored with the b1b94b5d patch (applied manually to 3.14.3 source).  A "less data available than length=" message is written to the kernel log in place of the kernel oops. 

There is still some odd behaviour:  the system immediately jumps back to suspend after the first resume. But after triggering resume again, my system does then resume and operate properly (no freeze).  

I haven't yet tested a 3.15 kernel.
Comment 8 Drew Parsons 2014-07-31 05:27:48 UTC
The bug seems to be fixed later in 3.14, without the b1b94b5d patch.  

In 3.14.13 (Debian 3.14.13-2) I'm no longer experiencing the freeze even after repeated suspend/resume cycles.