Hi, We caught this backtrace multiple times in pstore in the Intel-GFX-CI system (seen on SKL, KBL, CFL): <4>[ 266.280526] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul snd_hda_intel ghash_clmulni_intel snd_hda_codec e1000e snd_hwdep snd_hda_core snd_pcm mei_me mei prime_numbers <4>[ 266.280542] CPU: 6 PID: 4144 Comm: irq/130-mei_me Tainted: G U 4.16.0-rc7-ga0e39233b887-drmtip_21+ #1 <4>[ 266.280544] Hardware name: System manufacturer System Product Name/Z170 PRO GAMING, BIOS 0802 09/02/2015 <4>[ 266.280548] RIP: 0010:mei_hbm_dispatch+0x17e/0xc10 [mei] <4>[ 266.280549] RSP: 0018:ffffb75740dc7d98 EFLAGS: 00010297 <4>[ 266.280551] RAX: 0000000000000000 RBX: ffff9971440e5d40 RCX: 0000000000000003 <4>[ 266.280552] RDX: ffffb75740189004 RSI: ffffb75740189004 RDI: 000000008014140c <4>[ 266.280553] RBP: ffff9971440e60f8 R08: ffff99713f03b0f8 R09: 000000005e09e028 <4>[ 266.280554] R10: ffffb75740dc7e20 R11: 0000000000000001 R12: ffffb75740dc7e30 <4>[ 266.280555] R13: ffff9971440e62f8 R14: ffff9971440e5d40 R15: ffffffffac0f9e60 <4>[ 266.280556] FS: 0000000000000000(0000) GS:ffff997155d80000(0000) knlGS:0000000000000000 <4>[ 266.280557] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[ 266.280558] CR2: 000055d7d568cd40 CR3: 000000004c210006 CR4: 00000000003606e0 <4>[ 266.280559] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>[ 266.280560] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 <4>[ 266.280561] Call Trace: <4>[ 266.280566] mei_irq_read_handler+0x26d/0x650 [mei] <4>[ 266.280570] ? rcu_read_lock_sched_held+0x6f/0x80 <4>[ 266.280573] ? irq_thread+0x90/0x1e0 <4>[ 266.280576] mei_me_irq_thread_handler+0x3e8/0xa70 [mei_me] <4>[ 266.280579] ? irq_thread+0xc5/0x1e0 <4>[ 266.280581] ? irq_thread+0x90/0x1e0 <4>[ 266.280583] irq_thread_fn+0x16/0x40 <4>[ 266.280585] irq_thread+0x172/0x1e0 <4>[ 266.280587] ? irq_forced_thread_fn+0x60/0x60 <4>[ 266.280590] ? wake_threads_waitq+0x30/0x30 <4>[ 266.280593] kthread+0xfb/0x130 <4>[ 266.280595] ? irq_thread_dtor+0x90/0x90 <4>[ 266.280597] ? _kthread_create_on_node+0x60/0x60 <4>[ 266.280601] ret_from_fork+0x3a/0x50 <4>[ 266.280605] Code: 8b 3b be 01 00 00 00 e8 11 df 24 ec 31 c0 e9 ef fe ff ff 3c 8a 0f 84 39 03 00 00 3c 90 0f 84 7c 01 00 00 3c 87 0f 84 fb 03 00 00 <0f> 0b 3c 03 0f 84 ba 00 00 00 3c 07 75 f2 0f 1f 44 00 00 48 8b <1>[ 266.280649] RIP: mei_hbm_dispatch+0x17e/0xc10 [mei] RSP: ffffb75740dc7d98 <4>[ 266.280664] ---[ end trace 9faad50c8ef83572 ]--- <5>[ 266.286016] sd 0:0:0:0: [sda] Synchronizing SCSI cache <5>[ 266.293372] sd 0:0:0:0: [sda] Stopping disk <4>[ 267.576854] sched: RT throttling activated <1>[ 268.467906] BUG: unable to handle kernel NULL pointer dereference at 0000000000000006 <1>[ 268.467917] IP: 0x6 <6>[ 268.467919] PGD 0 P4D 0 <4>[ 268.467928] Oops: 0010 [#2] PREEMPT SMP PTI <0>[ 268.467933] Dumping ftrace buffer: <0>[ 268.467940] (ftrace buffer empty) <4>[ 268.467943] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul snd_hda_intel ghash_clmulni_intel snd_hda_codec e1000e snd_hwdep snd_hda_core snd_pcm mei_me mei prime_numbers <4>[ 268.467990] CPU: 6 PID: 4144 Comm: irq/130-mei_me Tainted: G UD 4.16.0-rc7-ga0e39233b887-drmtip_21+ #1 <4>[ 268.467993] Hardware name: System manufacturer System Product Name/Z170 PRO GAMING, BIOS 0802 09/02/2015 <4>[ 268.467996] RIP: 0010:0x6 <4>[ 268.468000] RSP: 0018:ffffb75740dc7e98 EFLAGS: 00010282 <4>[ 268.468005] RAX: ffffb75740dc7ec8 RBX: ffff99713f03afd8 RCX: 0000000000000001 <4>[ 268.468008] RDX: 0000000080000001 RSI: 0000000000000001 RDI: ffffb75740dc7ec8 <4>[ 268.468011] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000 <4>[ 268.468013] R10: 0000000000000000 R11: ffff99713f03a840 R12: ffff99713f03a840 <4>[ 268.468016] R13: ffffffffad05d735 R14: ffff99713f03b038 R15: 0000000000000000 <4>[ 268.468020] FS: 0000000000000000(0000) GS:ffff997155d80000(0000) knlGS:0000000000000000 <4>[ 268.468023] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[ 268.468026] CR2: 0000000000000006 CR3: 000000004c210006 CR4: 00000000003606e0 <4>[ 268.468028] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>[ 268.468031] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 <4>[ 268.468033] Call Trace: <4>[ 268.468042] ? task_work_run+0x88/0xb0 <4>[ 268.468050] ? do_exit+0x314/0xd30 <4>[ 268.468058] ? kthread+0xfb/0x130 <4>[ 268.468067] ? rewind_stack_do_exit+0x17/0x20 <4>[ 268.468078] Code: Bad RIP value. <1>[ 268.468091] RIP: 0x6 RSP: ffffb75740dc7e98 <4>[ 268.468093] CR2: 0000000000000006 Sorry for not having the top line, the backtrace was apparently longer than the maximum size of the pstore... Here are the relevant logs: - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_1/fi-cfl-s2/igt@gem_softpin@noreloc-s3.html - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_11/fi-skl-6700k2/igt@kms_vblank@pipe-b-ts-continuation-dpms-suspend.html - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_12/fi-kbl-7500u/igt@kms_plane@plane-panning-bottom-right-suspend-pipe-a-planes.html - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_21/fi-skl-6700k2/igt@kms_vblank@pipe-c-ts-continuation-dpms-suspend.html Original bug report (not super relevant): https://bugs.freedesktop.org/show_bug.cgi?id=105524
Can you please provide also config fule
(In reply to Tomas Winkler from comment #1) > Can you please provide also the config file.
Sure, sorry! Here it is: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_21/kconfig.txt
Not able to replicate over 100 suspend/resume iterations with the latest drm-tip 4.17.0-rc4+. Can you please update? Is the issue occurring frequently?
Sorry for the delay! We see the problem in about 5% of our runs, with each run executing ~80 suspend tests. On our fi-kbl-7500u, the reproduction rate is about once every 10-15 runs. The same can be found on our fi-skl-6700k2 machine. However, the other SKL/KBL/CFL platforms do not see the problem just yet... Here is the description of all our platforms (dmidecode, ...): https://intel-gfx-ci.01.org/hardware/ I hope this helps!
Created attachment 276249 [details] Patch
(In reply to Tomas Winkler from comment #6) > Created attachment 276249 [details] This patch should hopefully resolve the issue.
(In reply to Tomas Winkler from comment #6) > Created attachment 276249 [details] This patch should hopefully resolve the issue. Please don't publish by yourself, just report if this resolves your issue.
(In reply to Tomas Winkler from comment #8) > (In reply to Tomas Winkler from comment #6) > > Created attachment 276249 [details] > This patch should hopefully resolve the issue. > > Please don't publish by yourself, just report if this resolves your issue. Yes, of course. Our CI system picks up emails from mailing lists, and we have a dedicated list to test things without annoying other developers. What happened was that I screwed up my git send-email command (apparently, adding all the --no-*-cc was not enough to remove all Cc:s), which resulted in stable@ and the original author to receive this email which was not intended to them... In any case, this passed the initial CI run, so I added the patch to our integration tree (a separate branch called core-for-CI), and I will report back by the end of the week to tell you if we can see a reduction in reproduction rate. Sorry again for the noise.
Understood. Martin is the issue resolved for you? Can we add your Tested-by?
We have not been able to reproduce the issue since we applied the patch. So: Tested-by: Martin Peres <martin.peres@linux.intel.com> Thanks a lot!
The patch has landed, but we did not close this bug. Thanks!