Bug 209317

Summary: ftrace kernel self test failure on RISC-V on 5.8, regression from 5.4.0
Product: Tracing/Profiling Reporter: Colin Ian King (colin.king)
Component: FtraceAssignee: Steven Rostedt (rostedt)
Status: NEW ---    
Severity: high    
Priority: P1    
Hardware: Other   
OS: Linux   
Kernel Version: 5.8.0 Tree: Mainline
Regression: Yes

Description Colin Ian King 2020-09-18 14:33:50 UTC
Running the kernel with the kernel ftrace self tests triggers an OOPS on RISC-V on 5.8.0, this did not occur on 5.4.0

How to reproduce:

Test: linux/tools/testing/selftests/ftrace 

run with sudo ftrace -vvv

function_graph ftrace test basic2.tc causes the oops:

[  455.134397] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  455.134451] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  455.134508] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  455.134564] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  455.134607] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  455.134649] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  455.134754] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  455.135305] Oops [#1]
[  455.145642] Modules linked in: binfmt_misc ofpart redboot cmdlinepart cfi_cmdset_0001 cfi_probe cfi_util gen_probe physmap map_funcs chipreg virtio_rng mtd uio_pdrv_genirq uio sch_fq_codel drm drm_panel_orientation_quirks backlight ip_tables x_tables autofs4 virtio_net net_failover failover virtio_blk
[  455.196320] CPU: 2 PID: 20 Comm: migration/2 Tainted: G        W         5.8.0-2-generic #2
[  455.197001] epc: 0000000000000000 ra : ffffffe0002b7acc sp : ffffffe1f5c8bd70
[  455.197581]  gp : ffffffe0017223a0 tp : ffffffe1f5c76780 t0 : ffffffe1f5c8bd78
[  455.222311]  t1 : ffffffe0002769a0 t2 : ffffffe1f5c8be00 s0 : ffffffe1f5c8bd80
[  455.223098]  s1 : ffffffe1f1063ba0 a0 : ffffffe0002b7ba6 a1 : ffffffe0002769a0
[  455.223846]  a2 : ffffffe1f5c8be00 a3 : 0000000000000010 a4 : 0000000000000002
[  455.224507]  a5 : ffffffe1fecbfad8 a6 : 00000000000000ff a7 : 0000000000000001
[  455.225351]  s2 : ffffffe1f1063bc4 s3 : ffffffffffffffff s4 : ffffffe001724210
[  455.226199]  s5 : 0000000000000001 s6 : 0000000000000003 s7 : 0000000000000003
[  455.226924]  s8 : 0000000000000004 s9 : 0000000000000002 s10: 0000000000000000
[  455.227522]  s11: 0000000000000003 t3 : 000000000000b67e t4 : 00000000000192f5
[  455.228111]  t5 : ffffffe0017221a0 t6 : ffffffe000c02d24
[  455.228542] status: 0000000000000100 badaddr: 0000000000000000 cause: 000000000000000c
Comment 1 Colin Ian King 2020-09-18 14:47:56 UTC
Occurs in 5.8.8 too.
Comment 2 Colin Ian King 2020-09-24 16:59:23 UTC
regression between 5.6 (ok) and 5.7 (crashes)
Comment 3 Colin Ian King 2020-09-26 19:11:30 UTC
This is a RISC-V specific issue, bisected down to:

cfafe260137418d0265d0df3bb18dc494af2b43e is the first bad commit
commit cfafe260137418d0265d0df3bb18dc494af2b43e
Author: Atish Patra <atish.patra@wdc.com>
Date: Tue Mar 17 18:11:43 2020 -0700

    RISC-V: Add supported for ordered booting method using HSM
Comment 4 Colin Ian King 2020-09-26 22:02:35 UTC
Issue still in 5.9-rc6
Comment 5 Steven Rostedt 2020-09-28 15:13:44 UTC
On Sat, 26 Sep 2020 22:02:35 +0000
bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=209317
> 
> --- Comment #4 from Colin Ian King (colin.king@canonical.com) ---
> Issue still in 5.9-rc6
> 


Atish,

As the issues bisects down to your commit, care to take a look at this.
(And take ownership of this bug)

-- Steve
Comment 6 Atish.Patra 2020-09-28 17:25:40 UTC
On Mon, 2020-09-28 at 11:13 -0400, Steven Rostedt wrote:
> On Sat, 26 Sep 2020 22:02:35 +0000
> bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=209317
> > 
> > --- Comment #4 from Colin Ian King (colin.king@canonical.com) ---
> > Issue still in 5.9-rc6
> > 
> 
> Atish,
> 
> As the issues bisects down to your commit, care to take a look at
> this.
> (And take ownership of this bug)
> 

Yes. I am already looking into this. Colin informed me about the bug
over the weekend.

I couldn't change the ownership as I am not part of the editbugs group.
I have sent an email to helpdesk@kernel.org for access.

> -- Steve
Comment 7 Atish Patra 2020-10-03 17:33:20 UTC
Hi Alan and Zong,
I initially suspected ftrace is broken between v5.6 & v5.7 as Kolin pointed out.
I couldn't find any reason how the HSM patch is related. Zong's ftrace
patching code was also merged in that release.
However, I was able to reproduce the issue in the older kernel(v5.4)
as well on both Qemu & Unleashed hardware.
Here are the steps:

mount -t debugfs none /sys/kernel/debug/
cd /sys/kernel/debug/tracing
echo function_graph > current_tracer
echo function > current_tracer

It works for the first time with function_graph but writing any other
tracer crashes immediately.
Can you take a look to check if the bug is in ftrace infrastructure code ?

On Mon, Sep 28, 2020 at 10:25 AM Atish Patra <Atish.Patra@wdc.com> wrote:
>
> On Mon, 2020-09-28 at 11:13 -0400, Steven Rostedt wrote:
> > On Sat, 26 Sep 2020 22:02:35 +0000
> > bugzilla-daemon@bugzilla.kernel.org wrote:
> >
> > > https://bugzilla.kernel.org/show_bug.cgi?id=209317
> > >
> > > --- Comment #4 from Colin Ian King (colin.king@canonical.com) ---
> > > Issue still in 5.9-rc6
> > >
> >
> > Atish,
> >
> > As the issues bisects down to your commit, care to take a look at
> > this.
> > (And take ownership of this bug)
> >
>
> Yes. I am already looking into this. Colin informed me about the bug
> over the weekend.
>
> I couldn't change the ownership as I am not part of the editbugs group.
> I have sent an email to helpdesk@kernel.org for access.
>
> > -- Steve
>
> --
> Regards,
> Atish
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
Comment 8 zong.li 2020-10-05 06:08:59 UTC
Hi Atish,

I can take out some time to take a look at it together, if anyone here
fixes it or has ideas, please share the information, thanks.

On Sun, Oct 4, 2020 at 1:33 AM Atish Patra <atishp@atishpatra.org> wrote:
>
> Hi Alan and Zong,
> I initially suspected ftrace is broken between v5.6 & v5.7 as Kolin pointed
> out.
> I couldn't find any reason how the HSM patch is related. Zong's ftrace
> patching code was also merged in that release.
> However, I was able to reproduce the issue in the older kernel(v5.4)
> as well on both Qemu & Unleashed hardware.
> Here are the steps:
>
> mount -t debugfs none /sys/kernel/debug/
> cd /sys/kernel/debug/tracing
> echo function_graph > current_tracer
> echo function > current_tracer
>
> It works for the first time with function_graph but writing any other
> tracer crashes immediately.
> Can you take a look to check if the bug is in ftrace infrastructure code ?
>
> On Mon, Sep 28, 2020 at 10:25 AM Atish Patra <Atish.Patra@wdc.com> wrote:
> >
> > On Mon, 2020-09-28 at 11:13 -0400, Steven Rostedt wrote:
> > > On Sat, 26 Sep 2020 22:02:35 +0000
> > > bugzilla-daemon@bugzilla.kernel.org wrote:
> > >
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=209317
> > > >
> > > > --- Comment #4 from Colin Ian King (colin.king@canonical.com) ---
> > > > Issue still in 5.9-rc6
> > > >
> > >
> > > Atish,
> > >
> > > As the issues bisects down to your commit, care to take a look at
> > > this.
> > > (And take ownership of this bug)
> > >
> >
> > Yes. I am already looking into this. Colin informed me about the bug
> > over the weekend.
> >
> > I couldn't change the ownership as I am not part of the editbugs group.
> > I have sent an email to helpdesk@kernel.org for access.
> >
> > > -- Steve
> >
> > --
> > Regards,
> > Atish
> > _______________________________________________
> > linux-riscv mailing list
> > linux-riscv@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-riscv
>
>
>
> --
> Regards,
> Atish
Comment 9 Atish Patra 2020-10-05 18:47:20 UTC
On Sun, Oct 4, 2020 at 11:08 PM Zong Li <zong.li@sifive.com> wrote:
>
> Hi Atish,
>
> I can take out some time to take a look at it together, if anyone here
> fixes it or has ideas, please share the information, thanks.
>

Thanks. I observed this in case it helps.

Across kernels, the panic trace seems to point out the one of the
first two functions after patching is corrupted.
rcu_momentary_dyntick_idle or stop_machine_yield[1]

[1]https://elixir.bootlin.com/linux/v5.9-rc7/source/kernel/stop_machine.c#L213

I am suspecting nop was not replaced with the correct auipc+jalr pair?

> On Sun, Oct 4, 2020 at 1:33 AM Atish Patra <atishp@atishpatra.org> wrote:
> >
> > Hi Alan and Zong,
> > I initially suspected ftrace is broken between v5.6 & v5.7 as Kolin pointed
> out.
> > I couldn't find any reason how the HSM patch is related. Zong's ftrace
> > patching code was also merged in that release.
> > However, I was able to reproduce the issue in the older kernel(v5.4)
> > as well on both Qemu & Unleashed hardware.
> > Here are the steps:
> >
> > mount -t debugfs none /sys/kernel/debug/
> > cd /sys/kernel/debug/tracing
> > echo function_graph > current_tracer
> > echo function > current_tracer
> >
> > It works for the first time with function_graph but writing any other
> > tracer crashes immediately.
> > Can you take a look to check if the bug is in ftrace infrastructure code ?
> >
> > On Mon, Sep 28, 2020 at 10:25 AM Atish Patra <Atish.Patra@wdc.com> wrote:
> > >
> > > On Mon, 2020-09-28 at 11:13 -0400, Steven Rostedt wrote:
> > > > On Sat, 26 Sep 2020 22:02:35 +0000
> > > > bugzilla-daemon@bugzilla.kernel.org wrote:
> > > >
> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=209317
> > > > >
> > > > > --- Comment #4 from Colin Ian King (colin.king@canonical.com) ---
> > > > > Issue still in 5.9-rc6
> > > > >
> > > >
> > > > Atish,
> > > >
> > > > As the issues bisects down to your commit, care to take a look at
> > > > this.
> > > > (And take ownership of this bug)
> > > >
> > >
> > > Yes. I am already looking into this. Colin informed me about the bug
> > > over the weekend.
> > >
> > > I couldn't change the ownership as I am not part of the editbugs group.
> > > I have sent an email to helpdesk@kernel.org for access.
> > >
> > > > -- Steve
> > >
> > > --
> > > Regards,
> > > Atish
> > > _______________________________________________
> > > linux-riscv mailing list
> > > linux-riscv@lists.infradead.org
> > > http://lists.infradead.org/mailman/listinfo/linux-riscv
> >
> >
> >
> > --
> > Regards,
> > Atish