Bug 204819

Summary: KASAN still got problems loading some modules at boot
Product: Platform Specific/Hardware Reporter: Erhard F. (erhard_f)
Component: PPC-32Assignee: platform_ppc-32
Status: RESOLVED INVALID    
Severity: normal CC: christophe.leroy, michael
Priority: P1    
Hardware: PPC-32   
OS: Linux   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=204479
Kernel Version: 5.3-rc8 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg (5.3-rc8, PowerMac G4 DP)
kernel .config (5.3-rc8, PowerMac G4 DP)
dmesg (5.3-rc8, PowerMac G4 DP)
kernel_page_tables (5.3-rc8, PowerMac G4 DP)
objdump sungem (5.3-rc8, PowerMac G4 DP)
Manually fixed kernel_page_tables
dmesg v2 (5.3-rc8 + ptdump patch)
kernel_page_tables v2 (5.3-rc8 + ptdump patch)
objdump firewire-ohci v2 (5.3-rc8 + ptdump patch)
objdump usbcore v2 (5.3-rc8 + ptdump patch)
dmesg v3 (5.3-rc8 + ptdump patch, NO SMP)
kernel_page_tables v3 (5.3-rc8 + ptdump patch, NO SMP)
objdump usbcore v3 (5.3-rc8 + ptdump patch, NO SMP)
objdump ohci-hcd v3 (5.3-rc8 + ptdump patch, NO SMP)
objdump firewire-ohci v3 (5.3-rc8 + ptdump patch, NO SMP)
objdump ehci-pci v3 (5.3-rc8 + ptdump patch, NO SMP)

Description Erhard F. 2019-09-11 22:47:36 UTC
Created attachment 284931 [details]
dmesg (5.3-rc8, PowerMac G4 DP)

I tried finding a minimal kernel .config for bug #204479 to rule out other influences.

What I found was, that KASAN shows different BUG: Unable to handle kernel data access at 0x.......", depending on which module gets loaded/used first, like:

[...]
Sep 12 00:12:09 T600 kernel: BUG: Unable to handle kernel data access at 0xfe205150
Sep 12 00:12:09 T600 kernel: Faulting instruction address: 0xc00261d8
Sep 12 00:12:09 T600 kernel: Oops: Kernel access of bad area, sig: 11 [#1]
Sep 12 00:12:09 T600 kernel: BE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2 PowerMac
Sep 12 00:12:09 T600 kernel: Modules linked in: sungem_phy(+) usb_common
Sep 12 00:12:09 T600 kernel: CPU: 0 PID: 117 Comm: systemd-udevd Tainted: G        W         5.3.0-rc8 #3
Sep 12 00:12:09 T600 kernel: NIP:  c00261d8 LR: c02355c4 CTR: c0236c6c
Sep 12 00:12:09 T600 kernel: REGS: eb7d5ae0 TRAP: 0300   Tainted: G        W          (5.3.0-rc8)
Sep 12 00:12:09 T600 kernel: MSR:  00009032 <EE,ME,IR,DR,RI>  CR: 48228828  XER: 20000000
Sep 12 00:12:09 T600 kernel: DAR: fe205150 DSISR: 0a000000 
                             GPR00: c0236cb0 eb7d5b98 ebbd2320 fe205150 00000000 00000007 c01083dc 00000003 
                             GPR08: e991025c 1e205150 00000000 f1013000 c0236c6c 007d6980 eb7d5eb8 eb7d5eb0 
                             GPR16: f10504c0 00000124 f102a880 f103d000 c010426c c144c480 00000011 eb9ef9a0 
                             GPR24: f102a8dc f102a8e0 f1013000 f102a810 0000002d 00000001 0000003c f1028a80 
Sep 12 00:12:09 T600 kernel: NIP [c00261d8] __memset+0x10/0xc0
Sep 12 00:12:09 T600 kernel: LR [c02355c4] kasan_unpoison_shadow+0x20/0x40
Sep 12 00:12:09 T600 kernel: Call Trace:
Sep 12 00:12:09 T600 kernel: [eb7d5b98] [ee802420] 0xee802420 (unreliable)
Sep 12 00:12:09 T600 kernel: [eb7d5ba8] [c0236cb0] __asan_register_globals+0x44/0x68
Sep 12 00:12:09 T600 kernel: [eb7d5bc8] [c01083e8] do_init_module+0x134/0x2c4
Sep 12 00:12:09 T600 kernel: [eb7d5bf8] [c010b1b8] load_module+0x2b60/0x2d1c
Sep 12 00:12:09 T600 kernel: [eb7d5df8] [c010b5d8] sys_finit_module+0x100/0x140
Sep 12 00:12:09 T600 kernel: [eb7d5f38] [c001a274] ret_from_syscall+0x0/0x34
Sep 12 00:12:09 T600 kernel: --- interrupt: c00 at 0x24f2c4
                                 LR = 0x43e35c
Sep 12 00:12:09 T600 kernel: Instruction dump:
Sep 12 00:12:09 T600 kernel: 7c062396 7d4021d6 7d080214 7cca3050 90e30000 91030004 7cc33378 4e800020 
Sep 12 00:12:09 T600 kernel: 28050004 418000a0 5084442e 5084801e <90830000> 4d820020 70600003 7ca02a14 
Sep 12 00:12:09 T600 kernel: ---[ end trace 00769ab21374844d ]---
[...]
This changes depending on which modules I build for the kernel. So I suspect KASAN to still not get the module loading quite right for all modules. At least on my system (PowerMac G4 DP).
Comment 1 Erhard F. 2019-09-11 22:48:21 UTC
Created attachment 284933 [details]
kernel .config (5.3-rc8, PowerMac G4 DP)
Comment 2 Christophe Leroy 2019-09-12 05:09:13 UTC
Could you please provide the content of /sys/kernel/debug/kernel_page_tables together with the dmesg, and also an 'objdump -h' of the failing modules ?
Comment 3 Erhard F. 2019-09-13 15:51:35 UTC
Created attachment 284949 [details]
dmesg (5.3-rc8, PowerMac G4 DP)
Comment 4 Erhard F. 2019-09-13 15:52:05 UTC
Created attachment 284951 [details]
kernel_page_tables (5.3-rc8, PowerMac G4 DP)
Comment 5 Erhard F. 2019-09-13 15:52:42 UTC
Created attachment 284953 [details]
objdump sungem (5.3-rc8, PowerMac G4 DP)
Comment 6 Christophe Leroy 2019-09-14 08:15:32 UTC
Aaarrgghhh ! kernel_page_tables addresses are wrong. Ok, I'll manage.

For next time, please apply https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/arch/powerpc/mm/ptdump?h=next-20190904&id=7c7a532ba3fc51bf9527d191fb410786c1fdc73c
Comment 7 Christophe Leroy 2019-09-14 08:21:40 UTC
Created attachment 284961 [details]
Manually fixed kernel_page_tables
Comment 8 Christophe Leroy 2019-09-14 08:26:21 UTC
You get:

Sep 13 17:43:49 T600 kernel: BUG: Unable to handle kernel data access at 0xfe205150

However this area is properly mapped RW:

0xfe205000-0xfe205fff  0x297d6000         4K        rw       present           dirty  accessed                         

There must be something wrong with hash table update.
Comment 9 Erhard F. 2019-09-14 14:19:13 UTC
I can apply the patch but how do I apply the manually fixed kernel_page_tables? Not quite sure how to proceed.
Comment 10 Christophe Leroy 2019-09-14 16:47:05 UTC
No no, what I mean by 'manually fixed kernel_page_tables' is that I fixed the file you attached so that it has the same values as what it would have had with the patch applied. Don't worry.
Comment 11 Christophe Leroy 2019-09-16 05:50:45 UTC
I have carefully reviewed the flushing calls and have not been able to identify any issue.

Maybe a SMP issue ? Does this problem also happen without CONFIG_SMP ?
Comment 12 Erhard F. 2019-09-16 11:10:53 UTC
Created attachment 284991 [details]
dmesg v2 (5.3-rc8 + ptdump patch)
Comment 13 Erhard F. 2019-09-16 11:11:41 UTC
Created attachment 284993 [details]
kernel_page_tables v2 (5.3-rc8 + ptdump patch)
Comment 14 Erhard F. 2019-09-16 11:12:35 UTC
Created attachment 284995 [details]
objdump firewire-ohci v2 (5.3-rc8 + ptdump patch)
Comment 15 Erhard F. 2019-09-16 11:13:17 UTC
Created attachment 284997 [details]
objdump usbcore v2 (5.3-rc8 + ptdump patch)
Comment 16 Erhard F. 2019-09-16 11:13:50 UTC
Created attachment 284999 [details]
dmesg v3 (5.3-rc8 + ptdump patch, NO SMP)
Comment 17 Erhard F. 2019-09-16 11:14:36 UTC
Created attachment 285001 [details]
kernel_page_tables v3 (5.3-rc8 + ptdump patch, NO SMP)
Comment 18 Erhard F. 2019-09-16 11:15:17 UTC
Created attachment 285003 [details]
objdump usbcore v3 (5.3-rc8 + ptdump patch, NO SMP)
Comment 19 Erhard F. 2019-09-16 11:16:01 UTC
Created attachment 285005 [details]
objdump ohci-hcd v3 (5.3-rc8 + ptdump patch, NO SMP)
Comment 20 Erhard F. 2019-09-16 11:16:27 UTC
Created attachment 285007 [details]
objdump firewire-ohci v3 (5.3-rc8 + ptdump patch, NO SMP)
Comment 21 Erhard F. 2019-09-16 11:16:49 UTC
Created attachment 285009 [details]
objdump ehci-pci v3 (5.3-rc8 + ptdump patch, NO SMP)
Comment 22 Erhard F. 2019-09-16 11:20:58 UTC
(In reply to Christophe Leroy from comment #11)
> I have carefully reviewed the flushing calls and have not been able to
> identify any issue.
> 
> Maybe a SMP issue ? Does this problem also happen without CONFIG_SMP ?
For my 1st (v2) retest I applied the "powerpc/ptdump: Fix addresses display on PPC32" patch you linked on top of -rc8.

For my 2nd (v3) retest I applied the patch and disabled SMP in the .config.

As you can see the issue still persists without SMP. Also different 'problematic' modules popup even without SMP.
Comment 23 Christophe Leroy 2019-09-16 11:32:23 UTC
Thanks. The third one is interesting, as it shows :

Sep 16 12:57:39 T600 kernel: BUG: Unable to handle kernel data access at 0xfe295404

0xfe295000-0xfe295fff  0x00d05000         4K  user  r        present                  accessed                         


So unlike the other cases, this one shows that the area is still read-only and pointing to the zero shadow area.
Wondering whether for the other ones the area is not allocated to a further module loader after the failing one.

Are you able to dump kernel_page_tables just after each module load failure before loading any additional module ?
Comment 25 Erhard F. 2019-09-16 13:39:53 UTC
(In reply to Christophe Leroy from comment #24)
> Do you confirm that the two patches done in the scope of bug #204479 are
> included in your config, especially the first one:
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/
> arch/powerpc/mm/kasan/kasan_init_32.
> c?id=663c0c9496a69f80011205ba3194049bcafd681d
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/
> arch/powerpc/mm/kasan/kasan_init_32.
> c?id=45ff3c55958542c3b76075d59741297b8cb31cbb
Ahh, now we are getting closer...

No, I did the testing with a vanilla 5.3-rc8 kernel assuming that

powerpc/kasan: Fix shadow area set up for modules
powerpc/kasan: Fix parallel loading of modules

were already in 5.3-rc8. Now I realize I was wrong.

In -rc8 the latest one (2019-07-31) is:

powerpc/kasan: fix early boot failure on PPC32

The other two patches are dated 2019-08-20 and not in 5.3-rc8 or the newly released 5.3.

Sorry for the noise! I will close the bug for now and retest as soon as the 2 patches are in 5.4-rcx or 5.3.x. If it still seems valid that time I will re-open.
Comment 26 Michael Ellerman 2019-09-18 05:54:54 UTC
If you want to test things that are not quite in mainline you can always try the "merge" branch of the powerpc tree. That is a merge of master, next and fixes, so should have any fixes that are in the pipeline.