Bug 211537
Description
Erhard F.
2021-02-03 01:40:13 UTC
Created attachment 295045 [details]
kernel dmesg (5.10.12, eMachines E620 - successful boot)
Created attachment 295047 [details]
kernel .config (5.10.12, eMachines E620)
Created attachment 295235 [details]
kernel dmesg (5.11-rc7, eMachines E620)
Found out that this does not happen at all on kernel 5.4.x series. Took some time, but I was able to bisect the issue. 'bad' bisects showed up at the 1st boot most of the time, sometimes at the 3rd or 4th boot. I rated a bisect 'good' if the machine booted to desktop 20 times in a row with no issues (or issues not connected to this bug). The bisect revealed the following commit: 3202fa62fb43087387c65bfa9c100feffac74aa6 is the first bad commit commit 3202fa62fb43087387c65bfa9c100feffac74aa6 Author: Kees Cook <keescook@chromium.org> Date: Wed Apr 1 21:04:27 2020 -0700 slub: relocate freelist pointer to middle of object In a recent discussion[1] with Vitaly Nikolenko and Silvio Cesare, it became clear that moving the freelist pointer away from the edge of allocations would likely improve the overall defensive posture of the inline freelist pointer. My benchmarks show no meaningful change to performance (they seem to show it being faster), so this looks like a reasonable change to make. Instead of having the freelist pointer at the very beginning of an allocation (offset 0) or at the very end of an allocation (effectively offset -sizeof(void *) from the next allocation), move it away from the edges of the allocation and into the middle. This provides some protection against small-sized neighboring overflows (or underflows), for which the freelist pointer is commonly the target. (Large or well controlled overwrites are much more likely to attack live object contents, instead of attempting freelist corruption.) The vaunted kernel build benchmark, across 5 runs. Before: Mean: 250.05 Std Dev: 1.85 and after, which appears mysteriously faster: Mean: 247.13 Std Dev: 0.76 Attempts at running "sysbench --test=memory" show the change to be well in the noise (sysbench seems to be pretty unstable here -- it's not really measuring allocation). Hackbench is more allocation-heavy, and while the std dev is above the difference, it looks like may manifest as an improvement as well: 20 runs of "hackbench -g 20 -l 1000", before: Mean: 36.322 Std Dev: 0.577 and after: Mean: 36.056 Std Dev: 0.598 [1] https://twitter.com/vnik5287/status/1235113523098685440 Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Vitaly Nikolenko <vnik@duasynt.com> Cc: Silvio Cesare <silvio.cesare@gmail.com> Cc: Christoph Lameter <cl@linux.com>Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/202003051624.AAAC9AECC@keescook Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> mm/slub.c | 7 +++++++ 1 file changed, 7 insertions(+) Created attachment 296499 [details]
bisect.log
Created attachment 296501 [details]
kernel .config (final git bisect, eMachines E620)
Interesting! This looks like some kind of use-after-free problem that is corrupting the freelist in seemingly "unallocated" memory. There have been similar issues detected by this commit: https://bugzilla.kernel.org/show_bug.cgi?id=207383 https://lore.kernel.org/lkml/YIGsJ9LlFquvBI2iWPKhJwjKBwDUr_C-38oVpLJJHJ5rDCY_Zrrv392o6UPNxHoeQrcpLYC9U4fZdpD9ilz6Amg2IxkSexGLQMCQIBek8rc=@protonmail.com/ Given the radeon bit, I wonder if there is another issue with GPU drivers, but if it happens at boot, maybe it's related to ath5k. Created attachment 296539 [details] kernel dmesg (5.12, eMachines E620) (In reply to Kees Cook from comment #7) > Given the radeon bit, I wonder if there is another issue with GPU drivers, > but if it happens at boot, maybe it's related to ath5k. Thanks for the hint! I gave kernel v5.12 + debugging options + KASAN a try and now I am getting this: [...] [drm] radeon kernel modesetting enabled. [drm] initializing kernel modesetting (RS690 0x1002:0x791F 0x107B:0x0184 0x00). resource sanity check: requesting [mem 0x000c0000-0x000dffff], which spans more than PCI Bus 0000:00 [mem 0x000ce000-0x000cffff window] caller pci_map_rom+0x13f/0x284 mapping multiple BARs ATOM BIOS: ATI radeon 0000:01:05.0: VRAM: 256M 0x00000000BE000000 - 0x00000000CDFFFFFF (256M used) radeon 0000:01:05.0: GTT: 512M 0x0000000080000000 - 0x000000009FFFFFFF [drm] Detected VRAM RAM=256M, BAR=256M [drm] RAM width 128bits DDR [TTM] Zone kernel: Available graphics memory: 1622862 KiB [drm] radeon: 256M of VRAM memory ready [drm] radeon: 512M of GTT memory ready. [drm] GART: num cpu pages 131072, num gpu pages 131072 ================================================================== BUG: KASAN: slab-out-of-bounds in radeon_atombios_parse_power_table_1_3+0x1837/0x1998 [radeon] Write of size 4 at addr ffff88810178d858 by task systemd-udevd/157 CPU: 0 PID: 157 Comm: systemd-udevd Not tainted 5.12.0-E620 #50 Hardware name: eMachines eMachines E620 /Nile , BIOS V1.03 09/30/2008 Call Trace: dump_stack+0xa5/0xe6 print_address_description.constprop.0+0x18/0x239 ? radeon_atombios_parse_power_table_1_3+0x1837/0x1998 [radeon] kasan_report+0x170/0x1a8 ? radeon_atombios_parse_power_table_1_3+0x1837/0x1998 [radeon] radeon_atombios_parse_power_table_1_3+0x1837/0x1998 [radeon] ? mark_held_locks+0xa4/0xbb ? test_bit+0x23/0x2e ? hlock_class+0x42/0xf5 ? mark_lock+0x172/0x662 ? generic_exec_single+0x166/0x1a9 ? radeon_atombios_lookup_gpio+0x26d/0x26d [radeon] ? print_usage_bug+0x320/0x320 ? cpumask_next_and+0x18/0x19 ? generic_exec_single+0x1a9/0x1a9 ? __phys_addr_nodebug+0x7a/0x7a ? __cpa_flush_tlb+0x60/0x60 ? atom_parse_data_header+0xf7/0x26d [radeon] ? lookup_address_in_pgd+0x191/0x1ba radeon_atombios_get_power_modes+0x144/0x1888 [radeon] ? __change_page_attr_set_clr+0xf6b/0xf6b ? test_bit+0x23/0x2e ? radeon_atombios_parse_pplib_non_clock_info+0x9d3/0x9d3 [radeon] ? set_page_memtype+0x81/0xa4 radeon_pm_init+0x1019/0x1904 [radeon] ? set_memory_uc+0x3d/0x62 rs690_init+0x76e/0x84a [radeon] radeon_device_init+0x1c1a/0x21e5 [radeon] ? __kasan_kmalloc_large+0x7c/0x83 radeon_driver_load_kms+0xf5/0x30b [radeon] drm_dev_register+0x255/0x4a0 [drm] radeon_pci_probe+0x246/0x2f6 [radeon] pci_device_probe+0x1aa/0x294 really_probe+0x30e/0x850 ? pm_runtime_put_noidle+0x19/0x5b driver_probe_device+0xe6/0x135 ? device_driver_attach+0xf8/0xf8 device_driver_attach+0xc1/0xf8 __driver_attach+0x13f/0x146 bus_for_each_dev+0xfa/0x146 ? next_device+0x39/0x39 bus_add_driver+0x2b3/0x447 driver_register+0x242/0x2c1 ? 0xffffffffc0b80000 do_one_initcall+0x149/0x2fd ? test_bit.constprop.0+0x2a/0x2a ? slab_post_alloc_hook+0x4a/0x186 ? trace_kmalloc+0x9a/0xc7 ? kasan_unpoison+0x25/0x31 do_init_module+0x1ae/0x573 load_module+0x4dee/0x5cca ? rcu_read_lock_sched_held+0x73/0xc8 ? module_frob_arch_sections+0x6/0x6 ? kernel_read_file+0x25c/0x3fd ? __vmalloc_node+0xa4/0xb1 ? kernel_read_file+0x38b/0x3fd ? __ia32_sys_fsconfig+0xd3/0xd3 ? kernel_read_file_from_fd+0x57/0x74 __do_sys_finit_module+0xf1/0x140 ? __do_sys_init_module+0x184/0x184 ? account_locked_vm+0x75/0x75 ? rcu_read_lock_held_common+0x7e/0x7e do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f21b2959e79 Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d bf 0f 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffd14d5a388 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 RAX: ffffffffffffffda RBX: 0000563b414190f0 RCX: 00007f21b2959e79 RDX: 0000000000000000 RSI: 0000563b41427360 RDI: 000000000000001c RBP: 0000000000020000 R08: 0000000000000000 R09: 0000563b4141a2c0 R10: 000000000000001c R11: 0000000000000246 R12: 0000563b41427360 R13: 0000000000000000 R14: 0000563b414191e0 R15: 0000563b414190f0 Allocated by task 135: kasan_save_stack+0x1b/0x3c kasan_set_track+0x1c/0x21 ____kasan_kmalloc+0x68/0x72 kobject_get_path+0x9a/0x151 kobject_uevent_env+0x202/0x9e8 kobject_synth_uevent+0x208/0x50e uevent_store+0x15/0x35 kernfs_fop_write_iter+0x2b3/0x35d new_sync_write+0x143/0x1d0 vfs_write+0x191/0x240 ksys_write+0x107/0x192 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff88810178d6c8 which belongs to the cache kmalloc-96 of size 96 The buggy address is located 304 bytes to the right of 96-byte region [ffff88810178d6c8, ffff88810178d728) The buggy address belongs to the page: page:000000009ba3a72f refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88810178d348 pfn:0x10178c head:000000009ba3a72f order:1 compound_mapcount:0 flags: 0x8000000000010200(slab|head) raw: 8000000000010200 ffffea0004056688 ffff888100040e08 ffff888100042f40 raw: ffff88810178d348 0000000000120005 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88810178d700: fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc ffff88810178d780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88810178d800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ^ ffff88810178d880: fc 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc ffff88810178d900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Disabling lock debugging due to kernel taint [...] Don't know about ATH5K but DRM_RADEON sure seems to play an important role. I get this KASAN hit 100% of the time at boot. Also it does not matter whether ATH5K is built or not. If built as a module it does get loaded later than DRM_RADEON. Created attachment 296541 [details]
kernel .config (5.12, eMachines E620)
To be sure I disabled DRM_RADEON while enabling ATH5K. The machine boots fine like that without any stacktrace or KASAN hit (tried about 10 times). So it's DRM_RADEON after all I guess, somehow connected to the commit I bisected. Alex needs to take a look at this, but of hand I would say that looks like a combination of on old VBIOS and a bug in the VBIOS parser. Maybe try to update the VBIOS as well. (In reply to Christian König from comment #11) > Maybe try to update the VBIOS as well. The machine is an Acer eMachines E620 laptop. System BIOS is already latest (V1.03) which is available for that machine. The IGP is an Radeon XPress 1200. Where would I get a proper VBIOS to update? # lspci -vv -s 01:05.0 01:05.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RS690M [Radeon Xpress 1200/1250/1270] (prog-if 00 [VGA controller]) Subsystem: Gateway, Inc. RS690M [Radeon Xpress 1200/1250/1270] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 64, Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 255 Region 0: Memory at d0000000 (64-bit, prefetchable) [size=256M] Region 2: Memory at f0100000 (64-bit, non-prefetchable) [size=64K] Region 4: I/O ports at 9400 [size=256] Region 5: Memory at f0000000 (32-bit, non-prefetchable) [size=1M] Expansion ROM at 000c0000 [virtual] [disabled] [size=128K] Capabilities: [50] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Kernel modules: radeon BIOS shows current VBIOS as: V010.055.000.049.029773 This appears to be near drivers/gpu/drm/radeon/radeon_atombios.c line 2248, and I see halfway through it stops using "state_index - 1", so I bet this is it: rdev->pm.power_state[state_index].flags &= ~RADEON_PM_STATE_SINGLE_DISPLAY_ONLY; See if this patch fixes your problem... Created attachment 296589 [details]
attempt to fix off-by-one index in default power state
Comment on attachment 296589 [details]
attempt to fix off-by-one index in default power state
Oops, typo.
Created attachment 296591 [details]
attempt to fix off-by-one index in default power state
(In reply to Kees Cook from comment #14) > This appears to be near drivers/gpu/drm/radeon/radeon_atombios.c line 2248, > and I see halfway through it stops using "state_index - 1", so I bet this is > it: > > > rdev->pm.power_state[state_index].flags &= > ~RADEON_PM_STATE_SINGLE_DISPLAY_ONLY; > > See if this patch fixes your problem... Yes, it did! Thanks! I applied your 2 patches you posted later on dri-devel: [PATCH 1/2] drm/radeon: Fix off-by-one power_state index heap overwrite [PATCH 2/2] drm/radeon: Avoid power table parsing memory leaks The machine boots now on the patched kernel 5.12.0 without provoking this KASAN hit. Also no other issues at boot. Did a few reboots to test it, looks good so far... However having carefully read your patches I also enabled kemleak. ;) Which shows me this: unreferenced object 0xffff888102622e08 (size 96): comm "systemd-udevd", pid 152, jiffies 4294895220 (age 792.274s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 40 26 92 04 80 88 ff ff ........@&...... 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<0000000075dbd345>] __kmalloc+0x176/0x1e7 [<00000000806812e2>] radeon_atombios_parse_power_table_1_3+0x43f/0x1b1f [radeon] [<000000006e896262>] radeon_atombios_get_power_modes+0x144/0x1888 [radeon] [<00000000edc0acbb>] radeon_pm_init+0x1019/0x1904 [radeon] [<00000000f21bf64c>] rs690_init+0x76e/0x84a [radeon] [<000000001e47a877>] radeon_device_init+0x1c1a/0x21e5 [radeon] [<00000000f60f93e0>] radeon_driver_load_kms+0xf5/0x30b [radeon] [<00000000bccd090f>] drm_dev_register+0x255/0x4a0 [drm] [<0000000050f51b30>] radeon_pci_probe+0x246/0x2f6 [radeon] [<00000000b624f961>] pci_device_probe+0x1aa/0x294 [<000000001686585e>] really_probe+0x30e/0x850 [<000000009ecb9009>] driver_probe_device+0xe6/0x135 [<0000000099ebf8a7>] device_driver_attach+0xc1/0xf8 [<00000000d735bfe4>] __driver_attach+0x13f/0x146 [<0000000050bac4e1>] bus_for_each_dev+0xfa/0x146 [<0000000001aea795>] bus_add_driver+0x2b3/0x447 unreferenced object 0xffff888004922640 (size 64): comm "systemd-udevd", pid 152, jiffies 4294895220 (age 792.274s) hex dump (first 32 bytes): 00 00 00 00 40 9c 00 00 00 00 00 00 00 00 00 00 ....@........... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<000000000736e960>] kmem_cache_alloc_trace+0x122/0x171 [<00000000827c1b4e>] radeon_atombios_parse_power_table_1_3+0x546/0x1b1f [radeon] [<000000006e896262>] radeon_atombios_get_power_modes+0x144/0x1888 [radeon] [<00000000edc0acbb>] radeon_pm_init+0x1019/0x1904 [radeon] [<00000000f21bf64c>] rs690_init+0x76e/0x84a [radeon] [<000000001e47a877>] radeon_device_init+0x1c1a/0x21e5 [radeon] [<00000000f60f93e0>] radeon_driver_load_kms+0xf5/0x30b [radeon] [<00000000bccd090f>] drm_dev_register+0x255/0x4a0 [drm] [<0000000050f51b30>] radeon_pci_probe+0x246/0x2f6 [radeon] [<00000000b624f961>] pci_device_probe+0x1aa/0x294 [<000000001686585e>] really_probe+0x30e/0x850 [<000000009ecb9009>] driver_probe_device+0xe6/0x135 [<0000000099ebf8a7>] device_driver_attach+0xc1/0xf8 [<00000000d735bfe4>] __driver_attach+0x13f/0x146 [<0000000050bac4e1>] bus_for_each_dev+0xfa/0x146 [<0000000001aea795>] bus_add_driver+0x2b3/0x447 Created attachment 296597 [details]
kmemleak output (5.12 + patches, eMachines E620)
Ah, just to make sure I understand: you saw this kmem leak report when running with the two patches I sent to LKML? (In reply to Kees Cook from comment #20) > Ah, just to make sure I understand: you saw this kmem leak report when > running with the two patches I sent to LKML? Yes, exactly. Created attachment 296709 [details]
kernel dmesg (5.13-rc1, eMachines E620)
Created attachment 296711 [details]
kmemleak output (5.13-rc1, eMachines E620)
Looks like the patch already went into v5.13-rc1, thanks!
With v5.13-rc1 only the memory leak remains:
unreferenced object 0xffff888102a316c8 (size 96):
comm "systemd-udevd", pid 144, jiffies 4294892558 (age 1127.210s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 40 82 55 08 80 88 ff ff ........@.U.....
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<00000000be239a59>] __kmalloc+0x163/0x1d4
[<00000000bc8a83b6>] radeon_atombios_parse_power_table_1_3+0x43f/0x1b1f [radeon]
[<000000004cdc8d73>] radeon_atombios_get_power_modes+0x144/0x1888 [radeon]
[<000000000a0a16d2>] radeon_pm_init+0xffd/0x18cc [radeon]
[<00000000f37d806c>] rs690_init+0x76e/0x84a [radeon]
[<00000000933cf3a9>] radeon_device_init+0x1c1a/0x21c1 [radeon]
[<00000000f00e9a19>] radeon_driver_load_kms+0xf5/0x30b [radeon]
[<00000000a9fe0c11>] drm_dev_register+0x255/0x4a0 [drm]
[<000000005015ac7d>] radeon_pci_probe+0x246/0x2f6 [radeon]
[<0000000018efccda>] pci_device_probe+0x1aa/0x294
[<00000000350850d3>] really_probe+0x30e/0x850
[<0000000083391671>] driver_probe_device+0x112/0x161
[<000000004d1e121b>] device_driver_attach+0xc1/0xf8
[<00000000ad7d8e9a>] __driver_attach+0x16b/0x172
[<0000000078f0f9cf>] bus_for_each_dev+0xfa/0x146
[<000000002f8d374c>] bus_add_driver+0x2b3/0x447
unreferenced object 0xffff888008558240 (size 64):
comm "systemd-udevd", pid 144, jiffies 4294892558 (age 1127.210s)
hex dump (first 32 bytes):
00 00 00 00 40 9c 00 00 00 00 00 00 00 00 00 00 ....@...........
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<0000000094907ef6>] kmem_cache_alloc_trace+0x10f/0x15e
[<000000000667fd5a>] radeon_atombios_parse_power_table_1_3+0x546/0x1b1f [radeon]
[<000000004cdc8d73>] radeon_atombios_get_power_modes+0x144/0x1888 [radeon]
[<000000000a0a16d2>] radeon_pm_init+0xffd/0x18cc [radeon]
[<00000000f37d806c>] rs690_init+0x76e/0x84a [radeon]
[<00000000933cf3a9>] radeon_device_init+0x1c1a/0x21c1 [radeon]
[<00000000f00e9a19>] radeon_driver_load_kms+0xf5/0x30b [radeon]
[<00000000a9fe0c11>] drm_dev_register+0x255/0x4a0 [drm]
[<000000005015ac7d>] radeon_pci_probe+0x246/0x2f6 [radeon]
[<0000000018efccda>] pci_device_probe+0x1aa/0x294
[<00000000350850d3>] really_probe+0x30e/0x850
[<0000000083391671>] driver_probe_device+0x112/0x161
[<000000004d1e121b>] device_driver_attach+0xc1/0xf8
[<00000000ad7d8e9a>] __driver_attach+0x16b/0x172
[<0000000078f0f9cf>] bus_for_each_dev+0xfa/0x146
[<000000002f8d374c>] bus_add_driver+0x2b3/0x447
As the original problem got fixed by Kees and the fix already went to affected stable and LTS kernels I'll close here. Thanks again! I'll open a new bug for the remaining memory leak and link here. |