Bug 211151 - WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229 __do_page_fault+0x6b4/0xcd0
Summary: WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229 __do_page_fault+0x...
Status: CLOSED CODE_FIX
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: PPC-64 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Jens Axboe
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-01-12 09:57 UTC by Zorro Lang
Modified: 2021-02-16 09:50 UTC (History)
3 users (show)

See Also:
Kernel Version: v5.11-rc3+
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Zorro Lang 2021-01-12 09:57:25 UTC
Description of problem:
xfstests always hit below kernel bug on ppc64le (kernel supports io_uring):

[  556.472666] ------------[ cut here ]------------
[  556.472686] Bug: Read fault blocked by KUAP!
[  556.472697] WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229 __do_page_fault+0x6b4/0xcd0
[  556.472728] Modules linked in: bonding rfkill sunrpc pseries_rng xts uio_pdrv_genirq vmx_crypto uio ip_tables xfs libcrc32c sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp
[  556.472816] CPU: 1 PID: 101841 Comm: io_wqe_worker-0 Tainted: G        W         5.11.0-rc3+ #2
[  556.472830] NIP:  c00000000009e7e4 LR: c00000000009e7e0 CTR: 0000000000000000
[  556.472842] REGS: c000000016367090 TRAP: 0700   Tainted: G        W          (5.11.0-rc3+)
[  556.472853] MSR:  8000000000021033 <SF,ME,IR,DR,RI,LE>  CR: 48022424  XER: 00000001
[  556.472901] CFAR: c0000000001822ac IRQMASK: 1 
               GPR00: c00000000009e7e0 c000000016367330 c0000000023fc300 0000000000000020 
               GPR04: c000000001e3c2b8 0000000000000001 0000000000000027 c0000007fbcccc90 
               GPR08: 0000000000000023 0000000000000000 c000000024ed0900 fffffffffc464a58 
               GPR12: 0000000000002000 c00000001ecaf280 c0000000001caee8 c000000014d547c0 
               GPR16: 0000000000000000 0000000000000000 0000000000000000 c000000002454018 
               GPR20: c000000001336480 bfffffffffffffff 0000000000000000 c00000000b0e5800 
               GPR24: a8aaaaaaaaaaaaaa 0000000000000000 0000000000200000 c00000002cc38880 
               GPR28: 000001000e3c9310 c0000000013424c0 c0000000163674a0 c000000001e0d2c0 
[  556.473125] NIP [c00000000009e7e4] __do_page_fault+0x6b4/0xcd0
[  556.473139] LR [c00000000009e7e0] __do_page_fault+0x6b0/0xcd0
[  556.473152] Call Trace:
[  556.473168] [c000000016367330] [c00000000009e7e0] __do_page_fault+0x6b0/0xcd0 (unreliable)
[  556.473198] [c0000000163673e0] [c00000000009ee3c] do_page_fault+0x3c/0x120
[  556.473216] [c000000016367430] [c00000000000c848] handle_page_fault+0x10/0x2c
[  556.473232] --- interrupt: 300 at iov_iter_fault_in_readable+0x148/0x6f0
[  556.473245] NIP:  c0000000008e8228 LR: c0000000008e834c CTR: 0000000000000000
[  556.473257] REGS: c0000000163674a0 TRAP: 0300   Tainted: G        W          (5.11.0-rc3+)
[  556.473268] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44008482  XER: 00000001
[  556.473339] CFAR: c0000000008e81f0 DAR: 000001000e3c9310 DSISR: 00200000 IRQMASK: 0 
               GPR00: c0000000008e834c c000000016367740 c0000000023fc300 0000000000000000 
               GPR04: c00000002cc389e0 0000000000000001 00000007fa4b0000 c0000000025bc520 
               GPR08: 00000007fa4b0000 0000000000000200 fcffffffffffffff ffffffffffea2ad8 
               GPR12: 0000000000008000 c00000001ecaf280 c0000000001caee8 c000000014d547c0 
               GPR16: 0000000000000000 0000000000000000 0000000000000000 c000000002454018 
               GPR20: c000000001336480 bfffffffffffffff 0000000000000000 c00000000b0e5800 
               GPR24: a8aaaaaaaaaaaaaa fcffffffffffffff 00000000000004b1 00000000000004b1 
               GPR28: c00000000b0e5888 000001000e3c97c0 0000000000000000 000001000e3c9310 
[  556.473667] NIP [c0000000008e8228] iov_iter_fault_in_readable+0x148/0x6f0
[  556.473688] LR [c0000000008e834c] iov_iter_fault_in_readable+0x26c/0x6f0
[  556.473708] --- interrupt: 300
[  556.473722] [c0000000163677e0] [c0000000007154a0] iomap_write_actor+0xc0/0x280
[  556.473770] [c000000016367880] [c00000000070fc94] iomap_apply+0x1c4/0x780
[  556.473804] [c000000016367990] [c000000000710330] iomap_file_buffered_write+0xa0/0x120
[  556.473839] [c0000000163679e0] [c00800000040791c] xfs_file_buffered_aio_write+0x314/0x5e0 [xfs]
[  556.474053] [c000000016367a90] [c0000000006d74bc] io_write+0x10c/0x460
[  556.474101] [c000000016367bb0] [c0000000006d80e4] io_issue_sqe+0x8d4/0x1200
[  556.474132] [c000000016367c70] [c0000000006d8ad0] io_wq_submit_work+0xc0/0x250
[  556.474161] [c000000016367cb0] [c0000000006e2578] io_worker_handle_work+0x498/0x800
[  556.474192] [c000000016367d40] [c0000000006e2cdc] io_wqe_worker+0x3fc/0x4f0
[  556.474223] [c000000016367da0] [c0000000001cb0a4] kthread+0x1c4/0x1d0
[  556.474254] [c000000016367e10] [c00000000000dbf0] ret_from_kernel_thread+0x5c/0x6c
[  556.474286] Instruction dump:
[  556.474310] e87e0100 481287f1 60000000 2fa30000 419e01ec 408e0400 3c82fef4 388461d0 
[  556.474395] 3c62fef4 386362d0 480e3a69 60000000 <0fe00000> 3860000b 4bfffa08 3d220006 
[  556.474479] irq event stamp: 1280
[  556.474505] hardirqs last  enabled at (1279): [<c0000000005a0104>] __slab_free+0x3e4/0x570
[  556.474540] hardirqs last disabled at (1280): [<c000000000008a04>] data_access_common_virt+0x1a4/0x1c0
[  556.474565] softirqs last  enabled at (536): [<c00000000107cdfc>] __do_softirq+0x6ac/0x7f4
[  556.474590] softirqs last disabled at (437): [<c00000000019179c>] irq_exit+0x2ec/0x320
[  556.474615] ---[ end trace 4c1967c400992302 ]---
[  556.482797] Kernel attempted to read user page (1000e3f9210) - exploit attempt? (uid: 0)

Steps to Reproduce:
Many xfstests sub-cases can reproduce this bug, likes generic/013, generic/051 etc ... These cases all use xfstests fsstress or fsx which contains IO_URING test.

Additional info:
A comment from Dave Chinner talked about this bug:
"Oh, this is likely an io_uring bug. The buffered write has been handed to a io_uring worker thread, and then it's trying to fault in a user page from a different mm context triggering a "user read page" error from a context that should never access userspace memory...."

So I'd like to CC io_uring people in this bug.
Comment 1 Jens Axboe 2021-01-26 20:31:25 UTC
Just noticed this one. Seems to be because the powerpc fault code does
this:

if (unlikely(!is_user && bad_kernel_fault(regs, error_code, address, is_write)))
	return SIGSEGV;

where is_user is based on user_mod(regs) only. This doesn't work for
io_uring, where the helper thread can assume the user app identity and
could perform this fault just fine.

Not sure what the best fix is, would have to ask the powerpc folks.
Something basic like the below might make it work, but this is really an
arch question.

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 8961b44f350c..2176f9ffebcc 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -395,7 +395,7 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
 	struct mm_struct *mm = current->mm;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
  	int is_exec = TRAP(regs) == 0x400;
-	int is_user = user_mode(regs);
+	int is_user = user_mode(regs) || current->flags & PF_IO_WORKER;
 	int is_write = page_fault_is_write(error_code);
 	vm_fault_t fault, major = 0;
 	bool kprobe_fault = kprobe_page_fault(regs, 11);
Comment 2 Zorro Lang 2021-01-27 04:36:50 UTC
(In reply to Jens Axboe from comment #1)
> Just noticed this one. Seems to be because the powerpc fault code does
> this:
> 
> if (unlikely(!is_user && bad_kernel_fault(regs, error_code, address,
> is_write)))
>       return SIGSEGV;
> 
> where is_user is based on user_mod(regs) only. This doesn't work for
> io_uring, where the helper thread can assume the user app identity and
> could perform this fault just fine.
> 
> Not sure what the best fix is, would have to ask the powerpc folks.
> Something basic like the below might make it work, but this is really an
> arch question.
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 8961b44f350c..2176f9ffebcc 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -395,7 +395,7 @@ static int __do_page_fault(struct pt_regs *regs,
> unsigned long address,
>       struct mm_struct *mm = current->mm;
>       unsigned int flags = FAULT_FLAG_DEFAULT;
>       int is_exec = TRAP(regs) == 0x400;
> -     int is_user = user_mode(regs);
> +     int is_user = user_mode(regs) || current->flags & PF_IO_WORKER;
>       int is_write = page_fault_is_write(error_code);
>       vm_fault_t fault, major = 0;
>       bool kprobe_fault = kprobe_page_fault(regs, 11);

Thanks Jens, I just changed this bug Component to "PPC-64". Feel free to cc powerpc folks if you know anyone about that, or you might like to send your patch to powerpc maillist to ask them directly. (I can help to ask too, if you would not like to take that time.)
Comment 3 Jens Axboe 2021-01-27 04:53:23 UTC
I actually think the patch should be even simpler, something ala:

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 8961b44f350c..5a4d6af04c99 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -417,9 +417,12 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
 	/*
 	 * The kernel should never take an execute fault nor should it
 	 * take a page fault to a kernel address or a page fault to a user
-	 * address outside of dedicated places
+	 * address outside of dedicated places. Use mm to decide if this is
+	 * valid or not, it's perfectly legitimate to have a kernel thread
+	 * take a fault as long as it's performed kthread_use_mm() prior. An
+	 * example of that would be io_uring helper threads.
 	 */
-	if (unlikely(!is_user && bad_kernel_fault(regs, error_code, address, is_write)))
+	if (unlikely(!mm && bad_kernel_fault(regs, error_code, address, is_write)))
 		return SIGSEGV;
 
 	/*
Comment 4 Jens Axboe 2021-01-27 04:54:05 UTC
Feel free to send out a feeler with the patch, would not mind someone else driving this and I have no access to test any of this. Seems like you do, perhaps you can test the patch as well?
Comment 5 Zorro Lang 2021-01-27 05:44:15 UTC
(In reply to Jens Axboe from comment #4)
> Feel free to send out a feeler with the patch, would not mind someone else
> driving this and I have no access to test any of this. Seems like you do,
> perhaps you can test the patch as well?

Yes, I have test env. which can 100% reproduce this issue. Let me check if the latest mainline linux fixed this bug, last time I tested this bug on 5.11-rc4, it's rc5 now. If the bug is still there, I'll send this patch with your name to powerpc list to ask.
Comment 6 Christophe Leroy 2021-01-27 16:34:16 UTC
(In reply to Jens Axboe from comment #1)
> Just noticed this one. Seems to be because the powerpc fault code does
> this:
> 
> if (unlikely(!is_user && bad_kernel_fault(regs, error_code, address,
> is_write)))
>       return SIGSEGV;
> 
> where is_user is based on user_mod(regs) only. This doesn't work for
> io_uring, where the helper thread can assume the user app identity and
> could perform this fault just fine.
> 
> Not sure what the best fix is, would have to ask the powerpc folks.
> Something basic like the below might make it work, but this is really an
> arch question.
> 

I don't understand why testing is_user would be an issue.
KUAP purpose it to block any unallowed access from kernel to user memory (Equivalent to SMAP on x86). So it really must be based on MSR_PR bit, that is what is_user provides.

If the kernel access is legitimate, kernel should have opened userspace access then you shouldn't get this "Bug: Read fault blocked by KUAP!".

As far as I understand, the fault occurs in iov_iter_fault_in_readable() which calls fault_in_pages_readable()
And fault_in_pages_readable() uses __get_user() so it is a legitimate access and you really should get a KUAP fault.

So the problem is somewhere else, I think you proposed patch just hides the problem, it doesn't fix it.

Can you provide your vmlinux binary together with your .config ?
Comment 7 Michael Ellerman 2021-01-29 13:09:35 UTC
I can't reproduce here on a Power9 bare metal machine.

What hardware are you on?
And you're running tests/generic/013 from xfstests from kernel.org?
What defconfig?
Comment 8 Zorro Lang 2021-01-29 18:16:14 UTC
(In reply to Michael Ellerman from comment #7)
> I can't reproduce here on a Power9 bare metal machine.
> 
> What hardware are you on?
> And you're running tests/generic/013 from xfstests from kernel.org?
> What defconfig?

Yes, clone it from:
# git clone git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git

And I use local.config file as below:
FSTYP=xfs
TEST_DEV=/dev/sda5
TEST_DIR=/mnt/xfstests/test
SCRATCH_DEV=/dev/sda3
SCRATCH_MNT=/mnt/xfstests/scratch
LOGWRITES_DEV=/dev/sda6
MKFS_OPTIONS="-m crc=1,finobt=1,reflink=1,rmapbt=1,bigtime=1,inobtcount=1"
TEST_FS_MOUNT_OPTS=""

I tested on a P9 Virtual machine which base on a P9 bare metal machine. 
System:
Host Hypervisor	PowerVM
Vendor	IBM
Model	9009-42A
Memory	8192 MB
NUMA Nodes	1

CPU:
Vendor	IBM
Model Name	POWER9 (architected), altivec supported
Hyper	True
Arch(s)	ppc64le

Disks:
Model	Size	Logical sector size	Physical sector size
VDASD	107.37 GB / 100.00 GiB	512 bytes	512 bytes

Devices:
Description	                Type	Bus     Driver
Virtual I/O device (vty)	generic	vio	hvc_console
Virtual I/O device (v-scsi)	generic	vio	ibmvscsi

I don't know if you enabled IO_URING?
CONFIG_IO_URING=y
CONFIG_IO_WQ=y

To reproduce this bug, io_uring is necessary. So kernel must support it, and xfstests must be built with liburing-devel to make xfstests/ltp/fsstress to do io_uring test, then generic/013 or generic/051 or other cases which use fsstress will trigger this bug.

Thanks,
Zorro
Comment 9 Michael Ellerman 2021-02-16 09:50:41 UTC
Fixed in 8c511eff1827 ("powerpc/kuap: Allow kernel thread to access userspace after kthread_use_mm")

Note You need to log in before you can comment on or make changes to this bug.