Bug 212761 - -f option hangs the system
Summary: -f option hangs the system
Status: RESOLVED CODE_FIX
Alias: None
Product: Tools
Classification: Unclassified
Component: pm-graph (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Virtual Assignee for pm-graph
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-04-23 17:53 UTC by Konstantin Kharlamov
Modified: 2021-10-13 23:27 UTC (History)
4 users (show)

See Also:
Kernel Version: 5.11.11
Subsystem:
Regression: No
Bisected commit-id:


Attachments
tempfix.patch (600 bytes, text/plain)
2021-04-27 17:13 UTC, Todd Brandt
Details
reproduce.sh (624 bytes, application/x-shellscript)
2021-04-27 17:21 UTC, Todd Brandt
Details
Use trylock in trace_global_clock (1.76 KB, patch)
2021-04-30 14:43 UTC, Steven Rostedt
Details | Diff

Description Konstantin Kharlamov 2021-04-23 17:53:38 UTC
Per this comment¹ the project has 2 bugtrackers, so I'm reporting it here too.

This is reproducible on two different systems:

1. Dell Inspiron 5767, Archlinux, Kernel version 5.11.15
2. Macbook 2013, Fedora, Kernel version 5.11.11

Not sure if there's anything useful I can attach, perhaps if a photo of the standard `sleepgraph` output.

The systems do not actually go to the sleep, their screens stay lit, and the last content is visible on the screen. Last entry from `sleepgraph` is 

```
SUSPEND START
```

## Steps to reproduce 

1. Execute `sudo sleepgraph -f  -m mem -multi 1 1` *(the command is taken [from here](https://01.org/blogs/2018/improving-linux-suspend-resume-performance)*)

### Expected

The system will go to sleep for some time and then wake up, as if no `-f` option was provided.

### Actual

The system freezes with the last content visible on the screen.

1: https://github.com/intel/pm-graph/issues/12#issuecomment-502809924
Comment 1 Todd Brandt 2021-04-26 17:31:02 UTC
Interesting, so the ftrace callgraph behavior is having problems. Does it work ok with -dev? -dev uses ftrace kprobes, where -f uses ftrace callgraph, I just want to be sure it's not a problem with all of ftrace.
Comment 2 Konstantin Kharlamov 2021-04-26 17:53:38 UTC
Thank you for reply!

Just tested: no, executing the

    sudo sleepgraph -dev -m mem -multi 1 1

causes no problems, the system wakes up fine.
Comment 3 Todd Brandt 2021-04-26 20:25:33 UTC
ok, I've verified -f hangs here on every machine with a kernel 5.10.0-rc3 or newer. So the issue is something added to the kernel between 5.1.0-rc2 and 5.10.0-rc3. I'm bisecting it now.
Comment 4 Todd Brandt 2021-04-27 04:01:04 UTC
Here's the offending commit, ftrace callgraph before it works, ftrace callgraph after it fails. I'll post a bug report tommorrow to see if we can get this fixed.

commit b02414c8f045ab3b9afc816c3735bc98c5c3d262
Author: Steven Rostedt (VMware) <rostedt@goodmis.org>
Date:   Mon Nov 2 15:31:27 2020 -0500

    ring-buffer: Fix recursion protection transitions between interrupt context

    The recursion protection of the ring buffer depends on preempt_count() to be
    correct. But it is possible that the ring buffer gets called after an
    interrupt comes in but before it updates the preempt_count(). This will
    trigger a false positive in the recursion code.

    Use the same trick from the ftrace function callback recursion code which
    uses a "transition" bit that gets set, to allow for a single recursion for
    to handle transitions between contexts.

    Cc: stable@vger.kernel.org
    Fixes: 567cd4da54ff4 ("ring-buffer: User context bit recursion checking")
    Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

 kernel/trace/ring_buffer.c | 58 ++++++++++++++++++++++++++++++++++++----------
 1 file changed, 46 insertions(+), 12 deletions(-)
Comment 5 Konstantin Kharlamov 2021-04-27 06:12:44 UTC
This is cool, thank you very much for the bisection!
Comment 6 Konstantin Kharlamov 2021-04-27 07:01:10 UTC
Oh, btw, I added author of the offending commit to CC. I didn't verbalize it in the prev. comment because I thought it didn't work (the "Add" text field lacks completion for their email, and there was no button to add, so I assumed it won't work due to lack of bugzilla account by them. But apparently it silently worked anyway upon adding prev. comment).
Comment 7 Todd Brandt 2021-04-27 15:49:51 UTC
Did you already submit a bug to the mailing list? If so could you post the link? I'm not on the CC (My email is todd.e.brandt@intel.com).
Comment 8 Konstantin Kharlamov 2021-04-27 16:02:21 UTC
Well, you said you are gonna send the report, so I did not do anything in that regard ☺ Should I? If so, what ML would be appropriate? And, isn't this bugtracker connected to MLs? They usually do, so creating another report on the site might be sufficient, but I don't know how it works on bugzilla.kernel specifically…
Comment 9 Todd Brandt 2021-04-27 16:21:46 UTC
No, please don't report it, I just misunderstood your last comment.

I'm still working on a script that reproduces it. It looks like there's a very specific set of circumstances that have to occur for the hang to happen, and sleepgraph initiates them, and I'm still narrowing down why. On the plus side there might be a temporary fix to sleepgraph that gets around the issue. I'll update you when I've got it, thanks.
Comment 10 Steven Rostedt 2021-04-27 16:31:31 UTC
FYI, thanks for Cc'ing me. I'm looking into it too. I can reproduce it, and removing the above commit from the bisect makes it work again.

I'm about to take the rest of the day off, I may not be able to look more into it until tomorrow. But I expect to find what about that commit is causing the bug this week.
Comment 11 Todd Brandt 2021-04-27 17:13:38 UTC
Created attachment 296493 [details]
tempfix.patch

This it the temp fix to the issue, it's cause by a conflict between ftrace callgraph and the device_pm_callback trace events.
Comment 12 Todd Brandt 2021-04-27 17:15:36 UTC
Hi Steven, I've narrowed it down to a specific conflict, when you use ftrace callgraph at the same time as the device_pm_callback_start/end trace events, the system hangs in 5.10.0-rc3 and newer. The attached fix shows exactly where the conflict is.
Comment 13 Todd Brandt 2021-04-27 17:21:23 UTC
Created attachment 296495 [details]
reproduce.sh

This script reproduces the issue with the smallest set of commands possible.
Comment 14 Todd Brandt 2021-04-27 17:25:10 UTC
In the meantime, Konstantin, you can apply that tempfix.patch to pm-graph and it will work again. Unfortunately the timeline itself will be lacking device detail, but the callgraph data will still be readable. Try using -ftop instead as it doesn't rely on device trace events. It just produces one massive callgraph over the whole of suspend_devices_and_enter.
Comment 15 Konstantin Kharlamov 2021-04-27 19:13:50 UTC
Thank you very much Todd for your research and the workaround!
Comment 16 Todd Brandt 2021-04-27 22:49:03 UTC
Thank you for catching this and posting it Konstantin! If you hadn't, it might have been quite a while longer until we finally caught it.
Comment 17 Steven Rostedt 2021-04-28 14:12:56 UTC
I looked a little more into this, and the above patch is fixing the symptom and not the disease.

I found that if the function graph tracer traces dev_driver_string() which the above trace events use, it locks up.

Is it that function graph tracer can't be called in some part of the suspend cycle, due to it "hijacking" the return pointer?

And since the trace clock uses the "global" clock, which takes a spin lock, that too can also be traced in this "critical" section.

That is, without applying any patches, I can make the script work with the following command:

 # echo native_queued_spin_lock_slowpath dev_driver_string > /sys/kernel/tracing/set_ftrace_notrace

Perhaps instead of removing those trace events, can you add these functions to the set_ftrace_notrace file? Of course, you may need to continue to maintain what not to trace.

Also, FYI, keep a look out on trace-cruncher. We are working on a python plugin to interact with libtracefs, where you have native python commands to communicate with the ftrace tracefs directory.

https://github.com/vmware/trace-cruncher
https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/
Comment 18 Todd Brandt 2021-04-28 21:24:47 UTC
Hi Steven, that does appear to work, thanks.

However this is only a temporary fix. There shouldn't be any functions traversed by ftrace that result in a system hang. So I hope you can get this fixed in the kernel itself.

I'll add this temporary fix to pm-graph master so as to cover kernels 5.10.0-rc3 through whichever version your future fix is posted in. Thanks.
Comment 19 Steven Rostedt 2021-04-28 21:54:06 UTC
On Wed, 28 Apr 2021 21:24:47 +0000
bugzilla-daemon@bugzilla.kernel.org wrote:

> --- Comment #18 from Todd Brandt (todd.e.brandt@intel.com) ---
> Hi Steven, that does appear to work, thanks.
> 
> However this is only a temporary fix. There shouldn't be any functions
> traversed by ftrace that result in a system hang. So I hope you can get this
> fixed in the kernel itself.

It's not that simple. Those functions are not the issue. The problem is the
state of the kernel when the functions get traced. It could be *any*
function that gets traced in that state that causes the hang/crash.

To fix this in the kernel, it would need some coordination between myself
and the power management folks. My testing showed that the problem is that
those trace events that you commented out are called within a critical
section of suspend to ram.

[  561.185688]  trace_device_pm_callback_end+0x7f/0xd0
[  561.185693]  dpm_run_callback+0x5e/0xc0
[  561.185703]  ? ftrace_graph_caller+0xa0/0xa0
[  561.185707]  __device_suspend+0x11e/0x450
[  561.185718]  dpm_suspend+0x13d/0x1e0
[  561.185728]  dpm_suspend_start+0x77/0x80
[  561.185736]  suspend_devices_and_enter+0x10e/0x650
[  561.185750]  pm_suspend.cold+0x329/0x374
[  561.185758]  state_store+0x71/0xd0
[  561.185768]  kernfs_fop_write_iter+0x127/0x1c0
[  561.185778]  new_sync_write+0x11f/0x1b0
[  561.185798]  vfs_write+0x2be/0x3b0
[  561.185809]  ksys_write+0x68/0xe0
[  561.185819]  do_syscall_64+0x33/0x40
[  561.185826]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  561.185831] RIP: 0033:0x7f5d9b3a94e7

I haven't dug deeper, but my assumption here (and I can be wrong) is that
the magic that the function graph tracer does with replacing the return
address here with the address of the trampoline may be causing issues.

I'll dig a little deeper to see if I can find what is with this context
that breaks function graph tracing.
Comment 20 Todd Brandt 2021-04-28 22:05:39 UTC
I've CCed Raphael on the bug, if linux-pm support is required we'll need to work through him.
Comment 21 Paul Menzel 2021-04-29 19:10:45 UTC
Can commit b02414c8 easily be reverted for now, as it introduces a regression in Linux 5.10?
Comment 22 Steven Rostedt 2021-04-29 19:29:13 UTC
(In reply to Paul Menzel from comment #21)
> Can commit b02414c8 easily be reverted for now, as it introduces a
> regression in Linux 5.10?

No, because reverting that commit will re-introduce the bug that was fixed, causing a regression in 5.12.
Comment 23 Steven Rostedt 2021-04-29 19:31:37 UTC
The bug is that the function graph tracer is called in a location that causes it to crash. That commit just opened the door to the bug. It was dumb luck that this wasn't hit by another path.
Comment 24 Paul Menzel 2021-04-30 07:52:31 UTC
> No, because reverting that commit will re-introduce the bug that was fixed,
> causing a regression in 5.12.

I believe, Linux’ “No regression rule” explicitly states, that’s not a reason. You do not break userspace, even if it means another bug gets fixed later. Linus commented a lot of on that. Otherwise you make it hard for people to follow the mainline series.

PS: Sorry, about hijacking this bug. I won’t comment any further.
Comment 25 Steven Rostedt 2021-04-30 11:22:04 UTC
(In reply to Paul Menzel from comment #24)
> > No, because reverting that commit will re-introduce the bug that was fixed,
> > causing a regression in 5.12.
> 
> I believe, Linux’ “No regression rule” explicitly states, that’s not a
> reason. You do not break userspace, even if it means another bug gets fixed
> later. Linus commented a lot of on that. Otherwise you make it hard for
> people to follow the mainline series.

But the bug fixed has been fixed in 5.10, 5.11 and 5.12 (which is now released), and for many stable kernels now. To revert this, we will need to re-introduce the fixed bug back to 5.10, 5.11, 5.12 and send it to the stable kernels.

> 
> PS: Sorry, about hijacking this bug. I won’t comment any further.


The fixed bug may not have been as catastrophic (does not crash the kernel) but in my opinion, it is a worse bug. It causes missed events. There were complaints that the trace did not make sense due to events not showing up where it was obvious that they should have, and data was unnecessarily lost.

And that fix is really unrelated to the current bug (which happens only if you are doing function graph tracing during suspend and resume). There's a bug where if function graph tracing occurs in a specific place in the suspend/resume code, it can crash. This is a long known issue. Function graph tracing requires modifying the return address on the stack in order to trace the return of the code, and for some reason, this is dangerous during the suspend and resume cycle. The only reason the other commit is related to this, is that we stopped hiding an event that would crash the system.

Let's fix the real bug. There's currently a work around for this small use case that crashes. I'm looking into it, but reverting a real fix for 3 major releases and several stable releases and ignore the real bug is not a solution!
Comment 26 Konstantin Kharlamov 2021-04-30 11:38:41 UTC
(In reply to Steven Rostedt from comment #25)
> (In reply to Paul Menzel from comment #24)
> > > No, because reverting that commit will re-introduce the bug that was
> fixed,
> > > causing a regression in 5.12.
> >
> > I believe, Linux’ “No regression rule” explicitly states, that’s not a
> > reason. You do not break userspace, even if it means another bug gets fixed
> > later. Linus commented a lot of on that. Otherwise you make it hard for
> > people to follow the mainline series.
>
> But the bug fixed has been fixed in 5.10, 5.11 and 5.12 (which is now
> released), and for many stable kernels now. To revert this, we will need to
> re-introduce the fixed bug back to 5.10, 5.11, 5.12 and send it to the
> stable kernels.
>
> >
> > PS: Sorry, about hijacking this bug. I won’t comment any further.
>
>
> The fixed bug may not have been as catastrophic (does not crash the kernel)
> but in my opinion, it is a worse bug. It causes missed events. There were
> complaints that the trace did not make sense due to events not showing up
> where it was obvious that they should have, and data was unnecessarily lost.
>
> And that fix is really unrelated to the current bug (which happens only if
> you are doing function graph tracing during suspend and resume). There's a
> bug where if function graph tracing occurs in a specific place in the
> suspend/resume code, it can crash. This is a long known issue. Function
> graph tracing requires modifying the return address on the stack in order to
> trace the return of the code, and for some reason, this is dangerous during
> the suspend and resume cycle. The only reason the other commit is related to
> this, is that we stopped hiding an event that would crash the system.

If I can chime in: as I understand from explanations, both bugs discussed are similar in the sense they are related to userspace debugging interfaces for the kernel. If we apply the "no regression" rule, then we can't revert the commit at the moment, because that would regress userspace for 5.10, 5.11, 5.12.

It's sad that regression happened, but, well, these things happen, and in my opinion "now" that we found it, is too late for reverting the commit.

> I'm looking into it, but reverting a real fix for 3 major releases and
> several
> stable releases and ignore the real bug is not a solution!

Oh, this amazing, thank you for looking into this!
Comment 27 Steven Rostedt 2021-04-30 11:46:29 UTC
(In reply to Konstantin Kharlamov from comment #26)
> If we apply the "no regression" rule, then we can't revert
> the commit at the moment, because that would regress userspace for 5.10,
> 5.11, 5.12.

It's actually worse than that. That commit also fixed a regression. To revert it, we would need to revert the updates that caused the regression it fixed, and it will spiral to many more commits.

> 
> It's sad that regression happened, but, well, these things happen, and in my
> opinion "now" that we found it, is too late for reverting the commit.

Yeah, but tracing sensitive locations, especially something that is as sensitive as suspend / resume always brings up strange corner cases.

> 
> > I'm looking into it, but reverting a real fix for 3 major releases and
> > several
> > stable releases and ignore the real bug is not a solution!
> 
> Oh, this amazing, thank you for looking into this!

It's my job as a maintainer of ftrace ;-)

There could be another bug within function graph tracing, and I want to know what that might be.
Comment 28 Steven Rostedt 2021-04-30 14:43:55 UTC
Created attachment 296549 [details]
Use trylock in trace_global_clock

I think I found the bug, and it is related to the commit, but the commit is still fine.

As printk fails here, I added Peter Zilstra's "force_early_printk" series:

  https://lore.kernel.org/lkml/20170928121823.430053219@infradead.org/

which will skip all the cruft that printk does, and just writes straight to the serial port. And in doing so, I found this:

NMI watchdog: Watchdog detected hard LOCKUP on cpu 7
Modules linked in:c xt_CHECKSUMc xt_MASQUERADEc ip6table_manglec ip6table_natc iptable_manglec iptable_natc nf_natc ebtable_filterc ebtablesc tunc bridgec stpc llcc vsockc vmw_vmcic ipt_REJECTc nf_reject_ipv4c iptable_filterc ip6t_REJECTc nf_reject_ipv6c xt_statec xt_conntrackc nf_conntrackc nf_defrag_ipv6c nf_defrag_ipv4c ip6table_filterc ip6_tablesc snd_hda_codec_hdmic snd_hda_codec_realtekc snd_hda_codec_genericc ledtrig_audioc snd_hda_intelc snd_intel_dspcfgc snd_hda_codecc joydevc snd_hwdepc snd_hda_corec intel_rapl_msrc intel_rapl_commonc iTCO_wdtc x86_pkg_temp_thermalc intel_powerclampc hp_wmic iTCO_vendor_supportc sparse_keymapc rfkillc wmi_bmofc coretempc snd_seqc snd_seq_devicec kvm_intelc i915c snd_pcmc mei_mec snd_timerc kvmc sndc meic soundcorec i2c_i801c e1000ec irqbypassc lpc_ichc i2c_algo_bitc i2c_smbusc drm_kms_helperc crct10dif_pclmulc crc32_pclmulc crc32c_intelc ghash_clmulni_intelc serio_rawc tpm_infineonc videoc wmic drmc ip_tablesc
irq event stamp: 0
hardirqs last  enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff950d967f>] copy_process+0x92f/0x1d70
softirqs last  enabled at (0): [<ffffffff950d967f>] copy_process+0x92f/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
CPU: 7 PID: 1800 Comm: python3 Not tainted 5.12.0-rc3-test+ #204
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
RIP: 0010:native_queued_spin_lock_slowpath+0x69/0x200
Code: 2b 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 03 30 e4 09 d0 a9 00 01 ff ff 0f 85 14 01 00 00 85 c0 74 0e 8b 03 84 c0 74 08 f3 90 <8b> 03 84 c0 75 f8 b8 01 00 00 00 66 89 03 5b 5d 41 5c c3 8b 37 ba
RSP: 0018:ffffb3c9812e3950 EFLAGS: 00000002c
RAX: 0000000000140101 RBX: ffffffff98137dc8 RCX: 000000000000001f
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff98137dc8
RBP: ffffffff98137dc8 R08: 00000029413d9eee R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: 00000029413d9eee
R13: 00000000000003e8 R14: 0000000000000d60 R15: 0000000000000043
FS:  00007fe0c8ebd740(0000) GS:ffff9d5c5aa00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000564b1a9cd4d8 CR3: 000000000c080003 CR4: 00000000001706e0
Call Trace:
 trace_clock_global+0x91/0xa0
 __rb_reserve_next+0x237/0x460
 ring_buffer_lock_reserve+0x12a/0x3f0
 trace_buffer_lock_reserve+0x10/0x50
 __trace_graph_return+0x1f/0x80
 trace_graph_return+0xb7/0xf0
 ? trace_clock_global+0x91/0xa0
 ftrace_return_to_handler+0x8b/0xf0
 ? pv_hash+0xa0/0xa0
 return_to_handler+0x15/0x30
 ? ftrace_graph_caller+0xa0/0xa0
 ? trace_clock_global+0x91/0xa0
 ? __rb_reserve_next+0x237/0x460
 ? ring_buffer_lock_reserve+0x12a/0x3f0
 ? trace_event_buffer_lock_reserve+0x3c/0x120
 ? trace_event_buffer_reserve+0x6b/0xc0
 ? trace_event_raw_event_device_pm_callback_start+0x125/0x2d0
 ? dpm_run_callback+0x3b/0xc0
 ? pm_ops_is_empty+0x50/0x50
 ? platform_get_irq_byname_optional+0x90/0x90
 ? trace_device_pm_callback_start+0x82/0xd0
 ? dpm_run_callback+0x49/0xc0
 ? ftrace_graph_caller+0xa0/0xa0
 ? __device_suspend+0x11e/0x450
 ? dpm_suspend+0x13d/0x1e0
 ? dpm_suspend_start+0x77/0x80
 ? suspend_devices_and_enter+0x10e/0x650
 ? pm_suspend.cold+0x329/0x374
 ? state_store+0x71/0xd0
 ? kernfs_fop_write_iter+0x127/0x1c0
 ? new_sync_write+0x11f/0x1b0
 ? vfs_write+0x2be/0x3b0
 ? ksys_write+0x68/0xe0
 ? do_syscall_64+0x33/0x40

The above shows the following:

ring_buffer_lock_reserve() {
  trace_clock_global() {
    arch_spin_lock() {
      queued_spin_lock_slowpath() {
        /* lock taken */
        (something else gets traced by function graph tracer)
          ring_buffer_lock_reserve() {        
            trace_clock_global() {
              arch_spin_lock() {
                queued_spin_lock_slowpath() {
                /* DEAD LOCK! */

I updated the logic a bit in the trace_global_clock() and this fixes the issue for me. Can you try this one?
Comment 29 Konstantin Kharlamov 2021-04-30 15:20:59 UTC
Thank you, I confirm the attached patch fixes the issue. After booting into 5.12 kernel with the patch applied and executing the previously troublesome command

    sudo sleepgraph -f -m mem -multi 1 1

my system goes to sleep and then wakes up correctly, no hangs observed.

The patch is 

Tested-by: Konstantin Kharlamov <hi-angel@yandex.ru>
Comment 30 Todd Brandt 2021-04-30 17:17:42 UTC
Steven, well done! I removed my tempfix patch from the sleepgraph code, then I tested your patch on 5.12 and ran callgraph on both freeze and mem and both appear to work well.

sudo sleepgraph -m freeze -rtcwake 15 -f
sudo sleepgraph -m freeze -rtcwake 15 -ftop
sudo sleepgraph -m mem -rtcwake 15 -f
sudo sleepgraph -m mem -rtcwake 15 -ftop

I also ran the full pm-graph test-all script which tests all possible sleepgraph features (tools/testall.sh in the pm-graph repo) and all appears well.

So your fix is comfirmed:

Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com>

Note You need to log in before you can comment on or make changes to this bug.