Bug 216715

Summary: kernel 6.1-rc5 + KASAN_OUTLINE fails to boot at very early stage when DEBUG_PAGEALLOC_ENABLE_DEFAULT is enabled (PowerMac G4 3,6)
Product: Platform Specific/Hardware Reporter: Erhard F. (erhard_f)
Component: PPC-32Assignee: platform_ppc-32
Status: RESOLVED OBSOLETE    
Severity: normal CC: christophe.leroy
Priority: P1    
Hardware: PPC-32   
OS: Linux   
Kernel Version: 6.1-rc5 Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel .config (6.1-rc5, PowerMac G4 DP)
kernel .config (6.1-rc6, PowerMac G4 DP)
kernel dmesg (6.1-rc6, PowerMac G4 DP) - succesful boot
kernel dmesg (6.1-rc6, PowerMac G4 DP) - BUG: KASAN: slab-out-of-bounds
303256: kernel dmesg (6.1-rc6, PowerMac G4 DP) - BUG: KASAN: slab-out-of-bounds, 2nd time
dmesg (6.4-rc4, PowerMac G4 DP) - succesful boot
kernel .config (6.4-rc3, PowerMac G4 DP)

Description Erhard F. 2022-11-21 00:19:04 UTC
Created attachment 303245 [details]
kernel .config (6.1-rc5, PowerMac G4 DP)

Kernel 6.1-rc5 fails to boot on my PowerMac G4 3,6 at a very early stage (OpenFirmware console, white background) when built with CONFIG_KASAN_OUTLINE=y and CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y.

Same kernel .config without CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT just boots fine, even when Outline KASAN is enabled and CONFIG_DEBUG_PAGEALLOC=y is set. Only CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT seems to provoke the issue.
Comment 1 Christophe Leroy 2022-11-21 08:19:26 UTC
I have no problem booting 6.1-rc5 on QEMU with your config.

What happens if you remove CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT and instead pass debug_pagealloc=on as parameter to the kernel ?
Comment 2 Erhard F. 2022-11-22 00:12:17 UTC
Created attachment 303254 [details]
kernel .config (6.1-rc6, PowerMac G4 DP)

Somehow I had a hard time reproducing this...

Finally it turned out that this is not 100% deterministic. I'll attach my new -rc6 .config which provokes the early boot failure best.

The same kernel can run into this early boot issue, or it can fail boot later with "BUG: KASAN: slab-out-of-bounds in filemap_map_pages+0x2c8/0x5d0" or it can boot up just fine.

I had the impression the early boot issue is provoked easier when CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y is set or debug_pagealloc=on is set as boot parameter as you suggested. Also it seems to be provoked easier when CONFIG_NETCONSOLE=y is used and properly configured (perhaps also linked to my yet-to-be-posted netconsole bug?).
Comment 3 Erhard F. 2022-11-22 00:14:31 UTC
Created attachment 303255 [details]
kernel dmesg (6.1-rc6, PowerMac G4 DP) - succesful boot
Comment 4 Erhard F. 2022-11-22 00:19:36 UTC
Created attachment 303256 [details]
kernel dmesg (6.1-rc6, PowerMac G4 DP) - BUG: KASAN: slab-out-of-bounds

Rarely but sometimes boot fails due to a KASAN error like this:

[...]
==================================================================
BUG: KASAN: slab-out-of-bounds in filemap_map_pages+0x2c8/0x5d0
Read of size 4 at addr c336c170 by task systemd/112

CPU: 1 PID: 112 Comm: systemd Tainted: G                TN 6.1.0-rc6-PMacG4 #30
Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
Call Trace:
[f1973c60] [c0d3b790] dump_stack_lvl+0x60/0xa4 (unreliable)
[f1973c80] [c033d124] print_report+0x154/0x4f4
[f1973cd0] [c033ce40] kasan_report+0xe8/0x218
[f1973d40] [c0284234] filemap_map_pages+0x2c8/0x5d0
[f1973e00] [c02dfb54] handle_mm_fault+0xb58/0x10bc
[f1973ed0] [c002b994] ___do_page_fault+0x228/0x814
[f1973f10] [c002c1d4] do_page_fault+0x28/0x6c
[f1973f30] [c00044a8] InstructionAccess_virt+0x114/0x118
--- interrupt: 400 at 0xa7452df4
NIP:  a7452df4 LR: a7452974 CTR: 00000000
REGS: f1973f40 TRAP: 0400   Tainted: G                TN  (6.1.0-rc6-PMacG4)
MSR:  4000d032 <EE,PR,ME,IR,DR,RI>  CR: 480042e0  XER: 00000000

GPR00: 00000000 afdd8cf0 a7b4a5a0 00000000 a7452974 480042e0 00000000 a7452df4 
GPR08: 0000d032 400c13f8 400c1509 400c161a 400c12e7 00b2fc1c 00000000 00000000 
GPR16: 00000000 00000000 afdd8f20 afdd8e98 00000000 00000001 00000000 00000000 
GPR24: 00000000 02084ffc 00000000 afdd8f1c 02084b80 02084810 a757fff4 00000001 
NIP [a7452df4] 0xa7452df4
LR [a7452974] 0xa7452974
--- interrupt: 400

Allocated by task 1:
 kasan_set_track+0x44/0x94
 __kasan_slab_alloc+0xa0/0xe8
 kmem_cache_alloc+0x224/0x684
 __kernfs_new_node+0xe8/0x354
 kernfs_new_node+0x84/0xfc
 __kernfs_create_file+0x50/0x204
 sysfs_add_file_mode_ns+0xf4/0x1f0
 internal_create_group+0x1f0/0x620
 sysfs_slab_add+0x23c/0x2dc
 __kmem_cache_create+0x1bc/0x584
 kmem_cache_create_usercopy+0x250/0x39c
 btrfs_init_cachep+0x48/0x1e0
 init_btrfs_fs+0x50/0x2b0
 do_one_initcall+0xc0/0x34c
 kernel_init_freeable+0x2c0/0x400
 kernel_init+0x28/0x178
 ret_from_kernel_thread+0x5c/0x64

The buggy address belongs to the object at c336c118
 which belongs to the cache kernfs_node_cache of size 88
The buggy address is located 0 bytes to the right of
 88-byte region [c336c118, c336c170)

The buggy address belongs to the physical page:
page:eee4bb30 refcount:1 mapcount:0 mapping:00000000 index:0x0 pfn:0x336c
flags: 0x200(slab|zone=0)
raw: 00000200 00000100 00000122 c1852720 00000000 001e003c ffffffff 00000001
raw: 00000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 c336c000: fc 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
 c336c080: fc fc 00 00 00 00 00 00 00 00 00 00 00 fc fc fc
>c336c100: fc fc fc 00 00 00 00 00 00 00 00 00 00 00 fc fc
                                                     ^
 c336c180: fc fc fc fc 00 00 00 00 00 00 00 00 00 00 00 fc
 c336c200: fc fc fc fc fc 00 00 00 00 00 00 00 00 00 00 00
==================================================================
Disabling lock debugging due to kernel taint
get_swap_device: Bad swap file entry 00000010
get_swap_device: Bad swap file entry 00000010
get_swap_device: Bad swap file entry 00000010
get_swap_device: Bad swap file entry 00000010
get_swap_device: Bad swap file entry 00000010
get_swap_device: Bad swap file entry 00000010
systemd[1]: Mounting FUSE Control File System...
get_swap_device: Bad swap file entry 00000010
get_swap_device: Bad swap file entry 00000010
get_swap_device: Bad swap file entry 00000010
get_swap_device: Bad swap file entry 00000010
** 257 printk messages dropped **
get_swap_device: Bad swap file entry 00000010
get_swap_device: Bad swap file entry 00000010
get_swap_device: Bad swap file entry 00000010
systemd-journald[106]: /dev/kmsg buffer overrun, some messages lost.
get_swap_device: Bad swap file entry 00000010
[...]


After the 1st time this happened I checked swap to be sure it's not a disk errors fault. But the swap partition showed no sign of error:

 # swapoff /dev/sdb6 && mkswap -f -L swap --verbose -c /dev/sdb6
0 beschädigte Seiten
mkswap: /dev/sdb6: Warnung: Alte swap-Signatur wird gelöscht.
Auslagerungsbereich Version 1 wird angelegt, Größe = 8 GiB (8589930496 Bytes)
LABEL=swap, UUID=aa150610-5987-492c-b9db-d3ea3f34a6be
Comment 5 Erhard F. 2022-11-22 00:34:38 UTC
Created attachment 303257 [details]
303256: kernel dmesg (6.1-rc6, PowerMac G4 DP) - BUG: KASAN: slab-out-of-bounds, 2nd time

Here a slightly different kernel is used with SMP off. KASAN hit happened instantly after trying to build glibc:

[...]
BUG: KASAN: slab-out-of-bounds in set_pte_at+0x9c/0x16c
Read of size 4 at addr c322519c by task emerge/296

CPU: 0 PID: 296 Comm: emerge Tainted: G                TN 6.1.0-rc6-PMacG4s #31
Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
Call Trace:
[f2aeb760] [c0ca8ecc] dump_stack_lvl+0x34/0x74 (unreliable)
[f2aeb780] [c02d2800] print_report+0x154/0x4f4
[f2aeb7d0] [c02d2508] kasan_report+0xec/0x230
[f2aeb830] [c002657c] set_pte_at+0x9c/0x16c
[f2aeb860] [c02744a8] copy_page_range+0x4ec/0xee0
[f2aeb990] [c005608c] dup_mm+0x70c/0x8ac
[f2aebab0] [c0057eec] copy_process+0x19b0/0x2c10
[f2aebba0] [c0059424] kernel_clone+0xd0/0x4c4
[f2aebc40] [c0059efc] sys_clone+0xfc/0x154
[f2aebd10] [c0017c60] system_call_exception+0x104/0x1ac
[f2aebf30] [c001f1ac] ret_from_syscall+0x0/0x2c
--- interrupt: c00 at 0xa7572df4
NIP:  a7572df4 LR: a7572974 CTR: a7747ee8
REGS: f2aebf40 TRAP: 0c00   Tainted: G                TN  (6.1.0-rc6-PMacG4s)
MSR:  0200f932 <VEC,EE,PR,FP,ME,IR,DR,RI>  CR: 48882244  XER: 00000000

GPR00: 00000078 af879140 a7c2d5a0 01200011 00000000 00000000 00000000 a7c26088 
GPR08: 00000000 00000002 a6e31870 a781c724 48882242 008bfff4 00000000 00a0243c 
GPR16: 00000000 016b591e 016b5910 af879304 31afa123 00000001 00000000 a7c284fc 
GPR24: 00000000 00000000 a58ddc88 a7ae3128 a7bc1f08 a6ee0118 a769fff4 00000001 
NIP [a7572df4] 0xa7572df4
LR [a7572974] 0xa7572974
--- interrupt: c00

Allocated by task 1:
 kasan_set_track+0x44/0x94
 __kasan_slab_alloc+0xa0/0xe8
 kmem_cache_alloc+0x1e8/0x664
 __kernfs_new_node+0xe8/0x354
 kernfs_new_node+0x84/0xfc
 __kernfs_create_file+0x50/0x204
 sysfs_add_file_mode_ns+0xf4/0x1f0
 internal_create_group+0x1f0/0x620
 sysfs_slab_add+0x23c/0x2dc
 __kmem_cache_create+0x14c/0x510
 kmem_cache_create_usercopy+0x250/0x39c
 btrfs_init_cachep+0x48/0x1e8
 init_btrfs_fs+0x50/0x2b0
 do_one_initcall+0xc0/0x34c
 kernel_init_freeable+0x2a0/0x3e0
 kernel_init+0x28/0x174
 ret_from_kernel_thread+0x5c/0x64

The buggy address belongs to the object at c32251a0
 which belongs to the cache kernfs_node_cache of size 88
The buggy address is located 4 bytes to the left of
 88-byte region [c32251a0, c32251f8)

The buggy address belongs to the physical page:
page:eee50d34 refcount:1 mapcount:0 mapping:00000000 index:0x0 pfn:0x3225
flags: 0x200(slab|zone=0)
raw: 00000200 00000100 00000122 c1843d20 00000000 001e003c ffffffff 00000001
raw: 00000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 c3225080: fc fc 00 00 00 00 00 00 00 00 00 00 00 fc fc fc
 c3225100: fc fc fc 00 00 00 00 00 00 00 00 00 00 00 fc fc
>c3225180: fc fc fc fc 00 00 00 00 00 00 00 00 00 00 00 fc
                    ^
 c3225200: fc fc fc fc fc 00 00 00 00 00 00 00 00 00 00 00
 c3225280: fc fc fc fc fc fc 00 00 00 00 00 00 00 00 00 00
==================================================================
Disabling lock debugging due to kernel taint
_swap_info_get: Bad swap file entry 10005281
BUG: Bad page map in process emerge  pte:00528124 pmd:03225000
addr:af85a000 vm_flags:00100173 anon_vma:c8fb3bd8 mapping:00000000 index:affdf
file:(null) fault:0x0 mmap:0x0 read_folio:0x0
CPU: 0 PID: 300 Comm: emerge Tainted: G    B           TN 6.1.0-rc6-PMacG4s #31
Hardware name: PowerMac3,6 7455 0x80010303 PowerMac
Call Trace:
[f2f838c0] [c0ca8ecc] dump_stack_lvl+0x34/0x74 (unreliable)
[f2f838e0] [c02717e8] print_bad_pte+0x2e8/0x364
[f2f83970] [c027579c] unmap_page_range+0x900/0xa30
[f2f83a30] [c027607c] unmap_vmas+0x1d8/0x2cc
[f2f83b30] [c0283a68] exit_mmap+0x154/0x2f0
[f2f83c50] [c00542e8] mmput+0x98/0x244
[f2f83c80] [c005f784] do_exit+0x434/0xdc0
[f2f83d00] [c0060318] do_group_exit+0x64/0x100
[f2f83d30] [c00603e4] __wake_up_parent+0x0/0x4c
[f2f83d50] [c0017c60] system_call_exception+0x104/0x1ac
[f2f83f30] [c001f1ac] ret_from_syscall+0x0/0x2c
--- interrupt: c00 at 0xa7572ec0
[...]
Comment 6 Christophe Leroy 2023-05-19 18:49:00 UTC
Would be nice to give it a new try with KCSAN enabled.

To get KCSAN on powerpc/32, apply following series: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=354731
Comment 7 Erhard F. 2023-05-23 22:34:53 UTC
Created attachment 304310 [details]
dmesg (6.4-rc4, PowerMac G4 DP) - succesful boot

KASAN and KCSAN can't be enabled at the same time as it seems. If I select KASAN, KCSAN is automatically disabled.

But in this case that does not seem to be a problem. Kernel 6.4-rc4 boots fine with KASAN_OUTLINE and DEBUG_PAGEALLOC_ENABLE_DEFAULT enabled. Tested it about a dozen times to be sure.

I'll close here when kernel 6.4.0 is released and this issue is still gone for good.
Comment 8 Erhard F. 2023-05-23 22:35:28 UTC
Created attachment 304311 [details]
kernel .config (6.4-rc3, PowerMac G4 DP)
Comment 9 Erhard F. 2024-04-19 08:45:34 UTC
Cannot reproduce on kernels v6.8.x and current v6.9-rc4, outline + inline KASAN work as intended.

So closing here.