207519 – perf in 5.4 tracing kernel and user page faults and tlb flushes will panic the kernel

Bug 207519 - perf in 5.4 tracing kernel and user page faults and tlb flushes will panic the kernel

Summary: perf in 5.4 tracing kernel and user page faults and tlb flushes will panic th...

Status:	NEW

Alias:	None

Product:	Tracing/Profiling
Classification:	Unclassified
Component:	Kernel Perf (show other bugs)
Hardware:	Intel Linux

Importance:	P1 blocking
Assignee:	Frederic Weisbecker

URL:	https://bugs.launchpad.net/ubuntu/+so...
Keywords:

Depends on:
Blocks:

Reported:	2020-04-30 11:00 UTC by Colin Ian King
Modified:	2020-04-30 16:01 UTC (History)
CC List:	2 users (show)

See Also:
Kernel Version:	5.4+ upwards
Subsystem:
Regression:	Yes
Bisected commit-id:

Attachments
full stack dump (82.17 KB, text/plain) 2020-04-30 11:20 UTC, Colin Ian King	Details
Add an attachment (proposed patch, testcase, etc.)

Description Colin Ian King 2020-04-30 11:00:39 UTC

originally I triggered this with stress-ng on V0.11.08 running sudo stress-ng --perf --cpu 1 -t 10

I've pushed a commit since to not use the TLB flush event to avoid this issue for the moment. 

I've worked through all the perf event combinations and found that the kernel panic occurs with the following events:

sudo perf record -eexceptions:page_fault_user,exceptions:page_fault_kernel,tlb:tlb_flush sleep 1

Bisecting the kernel I found that this issue occurred when the following commit landed in the kernel:

commit 763802b53a427ed3cbd419dbba255c414fdd9e7c
Author: Joerg Roedel <jroedel@suse.de>
Date:   Sat Mar 21 18:22:41 2020 -0700

    x86/mm: split vmalloc_sync_all()
    
This is a 100% reproducer, always happes on x86-64 in VM and on hardware.

Comment 1 Colin Ian King 2020-04-30 11:20:20 UTC

Created attachment 288837 [details]
full stack dump

Top of stack dump (attached) shows it's a stack overflow


[   22.163398] BUG: stack guard page was hit at (____ptrval____) (stack is (____ptrval____)..(____ptrval____))
[   22.165204] kernel stack overflow (double-fault): 0000 [#1] SMP PTI
[   22.166729] CPU: 3 PID: 935 Comm: perf Not tainted 5.4.0-28-generic #32-Ubuntu
[   22.168813] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
[   22.171263] RIP: 0010:perf_trace_x86_exceptions+0x44/0xf0
[   22.172769] Code: 83 ec 18 48 8b 5f 78 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 65 48 03 1d 00 0c f9 68 48 8b 87 80 00 00 00 48 85 c0 75 08 <48> 8b 03 48 85 c0 74 74 bf 24 00 00 00 48 8d 55 c4 48 8d 75 c8 e8
[   22.176573] RSP: 0018:ffff978f00838020 EFLAGS: 00010046
[   22.177569] RAX: 0000000000000000 RBX: ffffb78effdcab70 RCX: 0000000000000000
[   22.178800] RDX: ffff978f008380b8 RSI: ffffb78effdcab70 RDI: ffffffff9863e620
[   22.179993] RBP: ffff978f00838060 R08: 0000000000000000 R09: 0000000000000000
[   22.181188] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9863e620
[   22.182698] R13: 0000000000000000 R14: ffffb78effdcab70 R15: ffff978f008380b8
[   22.184019] FS:  00007ff4818af780(0000) GS:ffff892b7db80000(0000) knlGS:0000000000000000
[   22.185592] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   22.186732] CR2: ffff978f00837ff8 CR3: 000000007d5d8000 CR4: 00000000000006e0
[   22.188100] Call Trace:
[   22.188689]  do_page_fault+0xca/0xe0
[   22.189493]  do_async_page_fault+0x39/0x70
[   22.190388]  async_page_fault+0x34/0x40
[   22.191233] RIP: 0010:perf_trace_x86_exceptions+0x44/0xf0

Comment 2 Colin Ian King 2020-04-30 11:41:52 UTC

Finally got a full stack dump:

Comment 3 Colin Ian King 2020-04-30 11:42:23 UTC

still occurs on 5.7-rc2 and today's linux-next tip

Comment 4 Joerg Roedel 2020-04-30 15:05:40 UTC

Please have a look here:

https://lore.kernel.org/lkml/20200430141120.GA8135@suse.de/

and

https://lore.kernel.org/lkml/20200430145057.GB8135@suse.de/


This is likely the same issue.

Comment 5 Colin Ian King 2020-04-30 16:01:50 UTC

Yep, it definitely is the same issue.

Note You need to log in before you can comment on or make changes to this bug.