Per this comment¹ the project has 2 bugtrackers, so I'm reporting it here too. This is reproducible on two different systems: 1. Dell Inspiron 5767, Archlinux, Kernel version 5.11.15 2. Macbook 2013, Fedora, Kernel version 5.11.11 Not sure if there's anything useful I can attach, perhaps if a photo of the standard `sleepgraph` output. The systems do not actually go to the sleep, their screens stay lit, and the last content is visible on the screen. Last entry from `sleepgraph` is ``` SUSPEND START ``` ## Steps to reproduce 1. Execute `sudo sleepgraph -f -m mem -multi 1 1` *(the command is taken [from here](https://01.org/blogs/2018/improving-linux-suspend-resume-performance)*) ### Expected The system will go to sleep for some time and then wake up, as if no `-f` option was provided. ### Actual The system freezes with the last content visible on the screen. 1: https://github.com/intel/pm-graph/issues/12#issuecomment-502809924
Interesting, so the ftrace callgraph behavior is having problems. Does it work ok with -dev? -dev uses ftrace kprobes, where -f uses ftrace callgraph, I just want to be sure it's not a problem with all of ftrace.
Thank you for reply! Just tested: no, executing the sudo sleepgraph -dev -m mem -multi 1 1 causes no problems, the system wakes up fine.
ok, I've verified -f hangs here on every machine with a kernel 5.10.0-rc3 or newer. So the issue is something added to the kernel between 5.1.0-rc2 and 5.10.0-rc3. I'm bisecting it now.
Here's the offending commit, ftrace callgraph before it works, ftrace callgraph after it fails. I'll post a bug report tommorrow to see if we can get this fixed. commit b02414c8f045ab3b9afc816c3735bc98c5c3d262 Author: Steven Rostedt (VMware) <rostedt@goodmis.org> Date: Mon Nov 2 15:31:27 2020 -0500 ring-buffer: Fix recursion protection transitions between interrupt context The recursion protection of the ring buffer depends on preempt_count() to be correct. But it is possible that the ring buffer gets called after an interrupt comes in but before it updates the preempt_count(). This will trigger a false positive in the recursion code. Use the same trick from the ftrace function callback recursion code which uses a "transition" bit that gets set, to allow for a single recursion for to handle transitions between contexts. Cc: stable@vger.kernel.org Fixes: 567cd4da54ff4 ("ring-buffer: User context bit recursion checking") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> kernel/trace/ring_buffer.c | 58 ++++++++++++++++++++++++++++++++++++---------- 1 file changed, 46 insertions(+), 12 deletions(-)
This is cool, thank you very much for the bisection!
Oh, btw, I added author of the offending commit to CC. I didn't verbalize it in the prev. comment because I thought it didn't work (the "Add" text field lacks completion for their email, and there was no button to add, so I assumed it won't work due to lack of bugzilla account by them. But apparently it silently worked anyway upon adding prev. comment).
Did you already submit a bug to the mailing list? If so could you post the link? I'm not on the CC (My email is todd.e.brandt@intel.com).
Well, you said you are gonna send the report, so I did not do anything in that regard ☺ Should I? If so, what ML would be appropriate? And, isn't this bugtracker connected to MLs? They usually do, so creating another report on the site might be sufficient, but I don't know how it works on bugzilla.kernel specifically…
No, please don't report it, I just misunderstood your last comment. I'm still working on a script that reproduces it. It looks like there's a very specific set of circumstances that have to occur for the hang to happen, and sleepgraph initiates them, and I'm still narrowing down why. On the plus side there might be a temporary fix to sleepgraph that gets around the issue. I'll update you when I've got it, thanks.
FYI, thanks for Cc'ing me. I'm looking into it too. I can reproduce it, and removing the above commit from the bisect makes it work again. I'm about to take the rest of the day off, I may not be able to look more into it until tomorrow. But I expect to find what about that commit is causing the bug this week.
Created attachment 296493 [details] tempfix.patch This it the temp fix to the issue, it's cause by a conflict between ftrace callgraph and the device_pm_callback trace events.
Hi Steven, I've narrowed it down to a specific conflict, when you use ftrace callgraph at the same time as the device_pm_callback_start/end trace events, the system hangs in 5.10.0-rc3 and newer. The attached fix shows exactly where the conflict is.
Created attachment 296495 [details] reproduce.sh This script reproduces the issue with the smallest set of commands possible.
In the meantime, Konstantin, you can apply that tempfix.patch to pm-graph and it will work again. Unfortunately the timeline itself will be lacking device detail, but the callgraph data will still be readable. Try using -ftop instead as it doesn't rely on device trace events. It just produces one massive callgraph over the whole of suspend_devices_and_enter.
Thank you very much Todd for your research and the workaround!
Thank you for catching this and posting it Konstantin! If you hadn't, it might have been quite a while longer until we finally caught it.
I looked a little more into this, and the above patch is fixing the symptom and not the disease. I found that if the function graph tracer traces dev_driver_string() which the above trace events use, it locks up. Is it that function graph tracer can't be called in some part of the suspend cycle, due to it "hijacking" the return pointer? And since the trace clock uses the "global" clock, which takes a spin lock, that too can also be traced in this "critical" section. That is, without applying any patches, I can make the script work with the following command: # echo native_queued_spin_lock_slowpath dev_driver_string > /sys/kernel/tracing/set_ftrace_notrace Perhaps instead of removing those trace events, can you add these functions to the set_ftrace_notrace file? Of course, you may need to continue to maintain what not to trace. Also, FYI, keep a look out on trace-cruncher. We are working on a python plugin to interact with libtracefs, where you have native python commands to communicate with the ftrace tracefs directory. https://github.com/vmware/trace-cruncher https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/
Hi Steven, that does appear to work, thanks. However this is only a temporary fix. There shouldn't be any functions traversed by ftrace that result in a system hang. So I hope you can get this fixed in the kernel itself. I'll add this temporary fix to pm-graph master so as to cover kernels 5.10.0-rc3 through whichever version your future fix is posted in. Thanks.
On Wed, 28 Apr 2021 21:24:47 +0000 bugzilla-daemon@bugzilla.kernel.org wrote: > --- Comment #18 from Todd Brandt (todd.e.brandt@intel.com) --- > Hi Steven, that does appear to work, thanks. > > However this is only a temporary fix. There shouldn't be any functions > traversed by ftrace that result in a system hang. So I hope you can get this > fixed in the kernel itself. It's not that simple. Those functions are not the issue. The problem is the state of the kernel when the functions get traced. It could be *any* function that gets traced in that state that causes the hang/crash. To fix this in the kernel, it would need some coordination between myself and the power management folks. My testing showed that the problem is that those trace events that you commented out are called within a critical section of suspend to ram. [ 561.185688] trace_device_pm_callback_end+0x7f/0xd0 [ 561.185693] dpm_run_callback+0x5e/0xc0 [ 561.185703] ? ftrace_graph_caller+0xa0/0xa0 [ 561.185707] __device_suspend+0x11e/0x450 [ 561.185718] dpm_suspend+0x13d/0x1e0 [ 561.185728] dpm_suspend_start+0x77/0x80 [ 561.185736] suspend_devices_and_enter+0x10e/0x650 [ 561.185750] pm_suspend.cold+0x329/0x374 [ 561.185758] state_store+0x71/0xd0 [ 561.185768] kernfs_fop_write_iter+0x127/0x1c0 [ 561.185778] new_sync_write+0x11f/0x1b0 [ 561.185798] vfs_write+0x2be/0x3b0 [ 561.185809] ksys_write+0x68/0xe0 [ 561.185819] do_syscall_64+0x33/0x40 [ 561.185826] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 561.185831] RIP: 0033:0x7f5d9b3a94e7 I haven't dug deeper, but my assumption here (and I can be wrong) is that the magic that the function graph tracer does with replacing the return address here with the address of the trampoline may be causing issues. I'll dig a little deeper to see if I can find what is with this context that breaks function graph tracing.
I've CCed Raphael on the bug, if linux-pm support is required we'll need to work through him.
Can commit b02414c8 easily be reverted for now, as it introduces a regression in Linux 5.10?
(In reply to Paul Menzel from comment #21) > Can commit b02414c8 easily be reverted for now, as it introduces a > regression in Linux 5.10? No, because reverting that commit will re-introduce the bug that was fixed, causing a regression in 5.12.
The bug is that the function graph tracer is called in a location that causes it to crash. That commit just opened the door to the bug. It was dumb luck that this wasn't hit by another path.
> No, because reverting that commit will re-introduce the bug that was fixed, > causing a regression in 5.12. I believe, Linux’ “No regression rule” explicitly states, that’s not a reason. You do not break userspace, even if it means another bug gets fixed later. Linus commented a lot of on that. Otherwise you make it hard for people to follow the mainline series. PS: Sorry, about hijacking this bug. I won’t comment any further.
(In reply to Paul Menzel from comment #24) > > No, because reverting that commit will re-introduce the bug that was fixed, > > causing a regression in 5.12. > > I believe, Linux’ “No regression rule” explicitly states, that’s not a > reason. You do not break userspace, even if it means another bug gets fixed > later. Linus commented a lot of on that. Otherwise you make it hard for > people to follow the mainline series. But the bug fixed has been fixed in 5.10, 5.11 and 5.12 (which is now released), and for many stable kernels now. To revert this, we will need to re-introduce the fixed bug back to 5.10, 5.11, 5.12 and send it to the stable kernels. > > PS: Sorry, about hijacking this bug. I won’t comment any further. The fixed bug may not have been as catastrophic (does not crash the kernel) but in my opinion, it is a worse bug. It causes missed events. There were complaints that the trace did not make sense due to events not showing up where it was obvious that they should have, and data was unnecessarily lost. And that fix is really unrelated to the current bug (which happens only if you are doing function graph tracing during suspend and resume). There's a bug where if function graph tracing occurs in a specific place in the suspend/resume code, it can crash. This is a long known issue. Function graph tracing requires modifying the return address on the stack in order to trace the return of the code, and for some reason, this is dangerous during the suspend and resume cycle. The only reason the other commit is related to this, is that we stopped hiding an event that would crash the system. Let's fix the real bug. There's currently a work around for this small use case that crashes. I'm looking into it, but reverting a real fix for 3 major releases and several stable releases and ignore the real bug is not a solution!
(In reply to Steven Rostedt from comment #25) > (In reply to Paul Menzel from comment #24) > > > No, because reverting that commit will re-introduce the bug that was > fixed, > > > causing a regression in 5.12. > > > > I believe, Linux’ “No regression rule” explicitly states, that’s not a > > reason. You do not break userspace, even if it means another bug gets fixed > > later. Linus commented a lot of on that. Otherwise you make it hard for > > people to follow the mainline series. > > But the bug fixed has been fixed in 5.10, 5.11 and 5.12 (which is now > released), and for many stable kernels now. To revert this, we will need to > re-introduce the fixed bug back to 5.10, 5.11, 5.12 and send it to the > stable kernels. > > > > > PS: Sorry, about hijacking this bug. I won’t comment any further. > > > The fixed bug may not have been as catastrophic (does not crash the kernel) > but in my opinion, it is a worse bug. It causes missed events. There were > complaints that the trace did not make sense due to events not showing up > where it was obvious that they should have, and data was unnecessarily lost. > > And that fix is really unrelated to the current bug (which happens only if > you are doing function graph tracing during suspend and resume). There's a > bug where if function graph tracing occurs in a specific place in the > suspend/resume code, it can crash. This is a long known issue. Function > graph tracing requires modifying the return address on the stack in order to > trace the return of the code, and for some reason, this is dangerous during > the suspend and resume cycle. The only reason the other commit is related to > this, is that we stopped hiding an event that would crash the system. If I can chime in: as I understand from explanations, both bugs discussed are similar in the sense they are related to userspace debugging interfaces for the kernel. If we apply the "no regression" rule, then we can't revert the commit at the moment, because that would regress userspace for 5.10, 5.11, 5.12. It's sad that regression happened, but, well, these things happen, and in my opinion "now" that we found it, is too late for reverting the commit. > I'm looking into it, but reverting a real fix for 3 major releases and > several > stable releases and ignore the real bug is not a solution! Oh, this amazing, thank you for looking into this!
(In reply to Konstantin Kharlamov from comment #26) > If we apply the "no regression" rule, then we can't revert > the commit at the moment, because that would regress userspace for 5.10, > 5.11, 5.12. It's actually worse than that. That commit also fixed a regression. To revert it, we would need to revert the updates that caused the regression it fixed, and it will spiral to many more commits. > > It's sad that regression happened, but, well, these things happen, and in my > opinion "now" that we found it, is too late for reverting the commit. Yeah, but tracing sensitive locations, especially something that is as sensitive as suspend / resume always brings up strange corner cases. > > > I'm looking into it, but reverting a real fix for 3 major releases and > > several > > stable releases and ignore the real bug is not a solution! > > Oh, this amazing, thank you for looking into this! It's my job as a maintainer of ftrace ;-) There could be another bug within function graph tracing, and I want to know what that might be.
Created attachment 296549 [details] Use trylock in trace_global_clock I think I found the bug, and it is related to the commit, but the commit is still fine. As printk fails here, I added Peter Zilstra's "force_early_printk" series: https://lore.kernel.org/lkml/20170928121823.430053219@infradead.org/ which will skip all the cruft that printk does, and just writes straight to the serial port. And in doing so, I found this: NMI watchdog: Watchdog detected hard LOCKUP on cpu 7 Modules linked in:c xt_CHECKSUMc xt_MASQUERADEc ip6table_manglec ip6table_natc iptable_manglec iptable_natc nf_natc ebtable_filterc ebtablesc tunc bridgec stpc llcc vsockc vmw_vmcic ipt_REJECTc nf_reject_ipv4c iptable_filterc ip6t_REJECTc nf_reject_ipv6c xt_statec xt_conntrackc nf_conntrackc nf_defrag_ipv6c nf_defrag_ipv4c ip6table_filterc ip6_tablesc snd_hda_codec_hdmic snd_hda_codec_realtekc snd_hda_codec_genericc ledtrig_audioc snd_hda_intelc snd_intel_dspcfgc snd_hda_codecc joydevc snd_hwdepc snd_hda_corec intel_rapl_msrc intel_rapl_commonc iTCO_wdtc x86_pkg_temp_thermalc intel_powerclampc hp_wmic iTCO_vendor_supportc sparse_keymapc rfkillc wmi_bmofc coretempc snd_seqc snd_seq_devicec kvm_intelc i915c snd_pcmc mei_mec snd_timerc kvmc sndc meic soundcorec i2c_i801c e1000ec irqbypassc lpc_ichc i2c_algo_bitc i2c_smbusc drm_kms_helperc crct10dif_pclmulc crc32_pclmulc crc32c_intelc ghash_clmulni_intelc serio_rawc tpm_infineonc videoc wmic drmc ip_tablesc irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff950d967f>] copy_process+0x92f/0x1d70 softirqs last enabled at (0): [<ffffffff950d967f>] copy_process+0x92f/0x1d70 softirqs last disabled at (0): [<0000000000000000>] 0x0 CPU: 7 PID: 1800 Comm: python3 Not tainted 5.12.0-rc3-test+ #204 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 RIP: 0010:native_queued_spin_lock_slowpath+0x69/0x200 Code: 2b 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 03 30 e4 09 d0 a9 00 01 ff ff 0f 85 14 01 00 00 85 c0 74 0e 8b 03 84 c0 74 08 f3 90 <8b> 03 84 c0 75 f8 b8 01 00 00 00 66 89 03 5b 5d 41 5c c3 8b 37 ba RSP: 0018:ffffb3c9812e3950 EFLAGS: 00000002c RAX: 0000000000140101 RBX: ffffffff98137dc8 RCX: 000000000000001f RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff98137dc8 RBP: ffffffff98137dc8 R08: 00000029413d9eee R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: 00000029413d9eee R13: 00000000000003e8 R14: 0000000000000d60 R15: 0000000000000043 FS: 00007fe0c8ebd740(0000) GS:ffff9d5c5aa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000564b1a9cd4d8 CR3: 000000000c080003 CR4: 00000000001706e0 Call Trace: trace_clock_global+0x91/0xa0 __rb_reserve_next+0x237/0x460 ring_buffer_lock_reserve+0x12a/0x3f0 trace_buffer_lock_reserve+0x10/0x50 __trace_graph_return+0x1f/0x80 trace_graph_return+0xb7/0xf0 ? trace_clock_global+0x91/0xa0 ftrace_return_to_handler+0x8b/0xf0 ? pv_hash+0xa0/0xa0 return_to_handler+0x15/0x30 ? ftrace_graph_caller+0xa0/0xa0 ? trace_clock_global+0x91/0xa0 ? __rb_reserve_next+0x237/0x460 ? ring_buffer_lock_reserve+0x12a/0x3f0 ? trace_event_buffer_lock_reserve+0x3c/0x120 ? trace_event_buffer_reserve+0x6b/0xc0 ? trace_event_raw_event_device_pm_callback_start+0x125/0x2d0 ? dpm_run_callback+0x3b/0xc0 ? pm_ops_is_empty+0x50/0x50 ? platform_get_irq_byname_optional+0x90/0x90 ? trace_device_pm_callback_start+0x82/0xd0 ? dpm_run_callback+0x49/0xc0 ? ftrace_graph_caller+0xa0/0xa0 ? __device_suspend+0x11e/0x450 ? dpm_suspend+0x13d/0x1e0 ? dpm_suspend_start+0x77/0x80 ? suspend_devices_and_enter+0x10e/0x650 ? pm_suspend.cold+0x329/0x374 ? state_store+0x71/0xd0 ? kernfs_fop_write_iter+0x127/0x1c0 ? new_sync_write+0x11f/0x1b0 ? vfs_write+0x2be/0x3b0 ? ksys_write+0x68/0xe0 ? do_syscall_64+0x33/0x40 The above shows the following: ring_buffer_lock_reserve() { trace_clock_global() { arch_spin_lock() { queued_spin_lock_slowpath() { /* lock taken */ (something else gets traced by function graph tracer) ring_buffer_lock_reserve() { trace_clock_global() { arch_spin_lock() { queued_spin_lock_slowpath() { /* DEAD LOCK! */ I updated the logic a bit in the trace_global_clock() and this fixes the issue for me. Can you try this one?
Thank you, I confirm the attached patch fixes the issue. After booting into 5.12 kernel with the patch applied and executing the previously troublesome command sudo sleepgraph -f -m mem -multi 1 1 my system goes to sleep and then wakes up correctly, no hangs observed. The patch is Tested-by: Konstantin Kharlamov <hi-angel@yandex.ru>
Steven, well done! I removed my tempfix patch from the sleepgraph code, then I tested your patch on 5.12 and ran callgraph on both freeze and mem and both appear to work well. sudo sleepgraph -m freeze -rtcwake 15 -f sudo sleepgraph -m freeze -rtcwake 15 -ftop sudo sleepgraph -m mem -rtcwake 15 -f sudo sleepgraph -m mem -rtcwake 15 -ftop I also ran the full pm-graph test-all script which tests all possible sleepgraph features (tools/testall.sh in the pm-graph repo) and all appears well. So your fix is comfirmed: Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com>