Bug 199541

Summary: irq/124-mei_me - Call Trace: <4>[ 56.184872] mei_irq_read_handler+0x26d/0x650 [mei]
Product: Drivers Reporter: Martin Peres (martin.peres)
Component: OtherAssignee: drivers_other
Status: RESOLVED CODE_FIX    
Severity: normal CC: claudio.glickman, tomasw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.16.0-rc7 Subsystem:
Regression: No Bisected commit-id:
Attachments: Patch

Description Martin Peres 2018-04-27 14:09:33 UTC
Hi,

We caught this backtrace multiple times in pstore in the Intel-GFX-CI system (seen on SKL, KBL, CFL):

<4>[  266.280526] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul snd_hda_intel ghash_clmulni_intel snd_hda_codec e1000e snd_hwdep snd_hda_core snd_pcm mei_me mei prime_numbers
<4>[  266.280542] CPU: 6 PID: 4144 Comm: irq/130-mei_me Tainted: G     U           4.16.0-rc7-ga0e39233b887-drmtip_21+ #1
<4>[  266.280544] Hardware name: System manufacturer System Product Name/Z170 PRO GAMING, BIOS 0802 09/02/2015
<4>[  266.280548] RIP: 0010:mei_hbm_dispatch+0x17e/0xc10 [mei]
<4>[  266.280549] RSP: 0018:ffffb75740dc7d98 EFLAGS: 00010297
<4>[  266.280551] RAX: 0000000000000000 RBX: ffff9971440e5d40 RCX: 0000000000000003
<4>[  266.280552] RDX: ffffb75740189004 RSI: ffffb75740189004 RDI: 000000008014140c
<4>[  266.280553] RBP: ffff9971440e60f8 R08: ffff99713f03b0f8 R09: 000000005e09e028
<4>[  266.280554] R10: ffffb75740dc7e20 R11: 0000000000000001 R12: ffffb75740dc7e30
<4>[  266.280555] R13: ffff9971440e62f8 R14: ffff9971440e5d40 R15: ffffffffac0f9e60
<4>[  266.280556] FS:  0000000000000000(0000) GS:ffff997155d80000(0000) knlGS:0000000000000000
<4>[  266.280557] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  266.280558] CR2: 000055d7d568cd40 CR3: 000000004c210006 CR4: 00000000003606e0
<4>[  266.280559] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[  266.280560] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>[  266.280561] Call Trace:
<4>[  266.280566]  mei_irq_read_handler+0x26d/0x650 [mei]
<4>[  266.280570]  ? rcu_read_lock_sched_held+0x6f/0x80
<4>[  266.280573]  ? irq_thread+0x90/0x1e0
<4>[  266.280576]  mei_me_irq_thread_handler+0x3e8/0xa70 [mei_me]
<4>[  266.280579]  ? irq_thread+0xc5/0x1e0
<4>[  266.280581]  ? irq_thread+0x90/0x1e0
<4>[  266.280583]  irq_thread_fn+0x16/0x40
<4>[  266.280585]  irq_thread+0x172/0x1e0
<4>[  266.280587]  ? irq_forced_thread_fn+0x60/0x60
<4>[  266.280590]  ? wake_threads_waitq+0x30/0x30
<4>[  266.280593]  kthread+0xfb/0x130
<4>[  266.280595]  ? irq_thread_dtor+0x90/0x90
<4>[  266.280597]  ? _kthread_create_on_node+0x60/0x60
<4>[  266.280601]  ret_from_fork+0x3a/0x50
<4>[  266.280605] Code: 8b 3b be 01 00 00 00 e8 11 df 24 ec 31 c0 e9 ef fe ff ff 3c 8a 0f 84 39 03 00 00 3c 90 0f 84 7c 01 00 00 3c 87 0f 84 fb 03 00 00 <0f> 0b 3c 03 0f 84 ba 00 00 00 3c 07 75 f2 0f 1f 44 00 00 48 8b 
<1>[  266.280649] RIP: mei_hbm_dispatch+0x17e/0xc10 [mei] RSP: ffffb75740dc7d98
<4>[  266.280664] ---[ end trace 9faad50c8ef83572 ]---
<5>[  266.286016] sd 0:0:0:0: [sda] Synchronizing SCSI cache
<5>[  266.293372] sd 0:0:0:0: [sda] Stopping disk
<4>[  267.576854] sched: RT throttling activated
<1>[  268.467906] BUG: unable to handle kernel NULL pointer dereference at 0000000000000006
<1>[  268.467917] IP: 0x6
<6>[  268.467919] PGD 0 P4D 0 
<4>[  268.467928] Oops: 0010 [#2] PREEMPT SMP PTI
<0>[  268.467933] Dumping ftrace buffer:
<0>[  268.467940]    (ftrace buffer empty)
<4>[  268.467943] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul snd_hda_intel ghash_clmulni_intel snd_hda_codec e1000e snd_hwdep snd_hda_core snd_pcm mei_me mei prime_numbers
<4>[  268.467990] CPU: 6 PID: 4144 Comm: irq/130-mei_me Tainted: G     UD          4.16.0-rc7-ga0e39233b887-drmtip_21+ #1
<4>[  268.467993] Hardware name: System manufacturer System Product Name/Z170 PRO GAMING, BIOS 0802 09/02/2015
<4>[  268.467996] RIP: 0010:0x6
<4>[  268.468000] RSP: 0018:ffffb75740dc7e98 EFLAGS: 00010282
<4>[  268.468005] RAX: ffffb75740dc7ec8 RBX: ffff99713f03afd8 RCX: 0000000000000001
<4>[  268.468008] RDX: 0000000080000001 RSI: 0000000000000001 RDI: ffffb75740dc7ec8
<4>[  268.468011] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
<4>[  268.468013] R10: 0000000000000000 R11: ffff99713f03a840 R12: ffff99713f03a840
<4>[  268.468016] R13: ffffffffad05d735 R14: ffff99713f03b038 R15: 0000000000000000
<4>[  268.468020] FS:  0000000000000000(0000) GS:ffff997155d80000(0000) knlGS:0000000000000000
<4>[  268.468023] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  268.468026] CR2: 0000000000000006 CR3: 000000004c210006 CR4: 00000000003606e0
<4>[  268.468028] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[  268.468031] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>[  268.468033] Call Trace:
<4>[  268.468042]  ? task_work_run+0x88/0xb0
<4>[  268.468050]  ? do_exit+0x314/0xd30
<4>[  268.468058]  ? kthread+0xfb/0x130
<4>[  268.468067]  ? rewind_stack_do_exit+0x17/0x20
<4>[  268.468078] Code:  Bad RIP value.
<1>[  268.468091] RIP: 0x6 RSP: ffffb75740dc7e98
<4>[  268.468093] CR2: 0000000000000006

Sorry for not having the top line, the backtrace was apparently longer than the maximum size of the pstore...

Here are the relevant logs:
 - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_1/fi-cfl-s2/igt@gem_softpin@noreloc-s3.html
 - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_11/fi-skl-6700k2/igt@kms_vblank@pipe-b-ts-continuation-dpms-suspend.html
 - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_12/fi-kbl-7500u/igt@kms_plane@plane-panning-bottom-right-suspend-pipe-a-planes.html
 - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_21/fi-skl-6700k2/igt@kms_vblank@pipe-c-ts-continuation-dpms-suspend.html

Original bug report (not super relevant): https://bugs.freedesktop.org/show_bug.cgi?id=105524
Comment 1 Tomas Winkler 2018-05-03 07:28:45 UTC
Can you please provide also config fule
Comment 2 Tomas Winkler 2018-05-03 07:29:03 UTC
(In reply to Tomas Winkler from comment #1)
> Can you please provide also the config file.
Comment 3 Martin Peres 2018-05-03 18:06:04 UTC
Sure, sorry!

Here it is: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_21/kconfig.txt
Comment 4 claudio.glickman 2018-05-15 06:47:38 UTC
Not able to replicate over 100 suspend/resume iterations with the latest drm-tip 4.17.0-rc4+. Can you please update? Is the issue occurring frequently?
Comment 5 Martin Peres 2018-05-17 20:19:37 UTC
Sorry for the delay!

We see the problem in about 5% of our runs, with each run executing ~80 suspend tests.

On our fi-kbl-7500u, the reproduction rate is about once every 10-15 runs. The same can be found on our fi-skl-6700k2 machine. However, the other SKL/KBL/CFL platforms do not see the problem just yet...

Here is the description of all our platforms (dmidecode, ...): https://intel-gfx-ci.01.org/hardware/

I hope this helps!
Comment 6 Tomas Winkler 2018-05-29 07:24:07 UTC
Created attachment 276249 [details]
Patch
Comment 7 Tomas Winkler 2018-05-29 07:24:38 UTC
(In reply to Tomas Winkler from comment #6)
> Created attachment 276249 [details]
This patch should hopefully resolve the issue.
Comment 8 Tomas Winkler 2018-05-29 11:57:15 UTC
(In reply to Tomas Winkler from comment #6)
> Created attachment 276249 [details]
This patch should hopefully resolve the issue.

Please don't publish by yourself, just report if this resolves your issue.
Comment 9 Martin Peres 2018-05-29 12:05:55 UTC
(In reply to Tomas Winkler from comment #8)
> (In reply to Tomas Winkler from comment #6)
> > Created attachment 276249 [details]
> This patch should hopefully resolve the issue.
> 
> Please don't publish by yourself, just report if this resolves your issue.

Yes, of course. Our CI system picks up emails from mailing lists, and we have a dedicated list to test things without annoying other developers. What happened was that I screwed up my git send-email command (apparently, adding all the --no-*-cc was not enough to remove all Cc:s), which resulted in stable@ and the original author to receive this email which was not intended to them...

In any case, this passed the initial CI run, so I added the patch to our integration tree (a separate branch called core-for-CI), and I will report back by the end of the week to tell you if we can see a reduction in reproduction rate.

Sorry again for the noise.
Comment 10 Tomas Winkler 2018-06-06 07:49:47 UTC
Understood. Martin is the issue resolved for you?
Can we add your Tested-by?
Comment 11 Martin Peres 2018-06-06 08:33:36 UTC
We have not been able to reproduce the issue since we applied the patch. So:

Tested-by: Martin Peres <martin.peres@linux.intel.com>

Thanks a lot!
Comment 12 Martin Peres 2019-01-16 11:53:20 UTC
The patch has landed, but we did not close this bug. Thanks!