Bug 211537

Summary: [bisected] invalid opcode: 0000 [#1] SMP NOPTI, RIP: 0010:kfree+0x115/0x230
Product: Memory Management Reporter: Erhard F. (erhard_f)
Component: OtherAssignee: Andrew Morton (akpm)
Status: RESOLVED CODE_FIX    
Severity: normal CC: christian.koenig, kees, kees
Priority: P1    
Hardware: x86-64   
OS: Linux   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=211707
https://bugzilla.kernel.org/show_bug.cgi?id=213373
Kernel Version: 5.10.12 Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel dmesg (5.10.12, eMachines E620)
kernel dmesg (5.10.12, eMachines E620 - successful boot)
kernel .config (5.10.12, eMachines E620)
kernel dmesg (5.11-rc7, eMachines E620)
bisect.log
kernel .config (final git bisect, eMachines E620)
kernel dmesg (5.12, eMachines E620)
kernel .config (5.12, eMachines E620)
attempt to fix off-by-one index in default power state
attempt to fix off-by-one index in default power state
kmemleak output (5.12 + patches, eMachines E620)
kernel dmesg (5.13-rc1, eMachines E620)
kmemleak output (5.13-rc1, eMachines E620)

Description Erhard F. 2021-02-03 01:40:13 UTC
Created attachment 295043 [details]
kernel dmesg (5.10.12, eMachines E620)

I am getting this on my 'new' laptop, an Acer eMachines E620 during boot. It does not happen every time, about 1/3 times the machine boots just ok. When it succeeds booting It runs for days without flaws, so I don't think this is due to an hardware flaw.

I used netconsole to get the dmesg as the machine won't even reach the console when the crash happens.

[...]
    7.552116] invalid opcode: 0000 [#1] SMP NOPTI
[    7.552197] CPU: 0 PID: 781 Comm: openrc-run.sh Tainted: G        W         5.10.12-gentoo-E620 #2
[    7.552270] Hardware name: eMachines        eMachines E620  /Nile       , BIOS V1.03       09/30/2008
[    7.552343] RIP: 0010:kfree+0x115/0x230
[    7.552406] Code: 89 c5 e8 75 ea ff ff 48 8b 00 0f ba e0 09 72 63 e8 1f f4 ff ff 41 89 c4 48 8b 45 00 0f ba e0 10 72 0a 48 8b 45 08 a8 01 75 02 <0f> 0b 44 89 e1 48 c7 c2 00 f0 ff ff be 06 00 00 00 48 d3 e2 48 c7
[    7.552492] RSP: 0018:ffffb42f40267e10 EFLAGS: 00010246
[    7.552557] RAX: ffffd61280ee8d88 RBX: 0000000000000004 RCX: 000000008010000d
[    7.552625] RDX: 4000000000000000 RSI: ffffffffba1360b0 RDI: ffffd61280ee8d80
[    7.552689] RBP: ffffd61280ee8d80 R08: ffffffffb91bebdf R09: 0000000000000000
[    7.552756] R10: ffff8fe2c1047ac8 R11: 0000000000000000 R12: 0000000000000000
[    7.552819] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000100
[    7.552880] FS:  00007fe80eff6b68(0000) GS:ffff8fe339c00000(0000) knlGS:0000000000000000
[    7.552950] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    7.553011] CR2: 00007fe80eec7bc0 CR3: 0000000038012000 CR4: 00000000000006f0
[    7.553075] Call Trace:
[    7.553140]  __free_fdtable+0x16/0x1f
[    7.553201]  put_files_struct+0x81/0x9b
[    7.553264]  do_exit+0x433/0x94d
[    7.553327]  do_group_exit+0xa6/0xa6
[    7.553388]  __x64_sys_exit_group+0xf/0xf
[    7.553973]  do_syscall_64+0x33/0x40
[    7.553973]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    7.553973] RIP: 0033:0x7fe80ef64bea
[    7.553973] Code: Unable to access opcode bytes at RIP 0x7fe80ef64bc0.
[    7.553973] RSP: 002b:00007ffdb1c47528 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[    7.553973] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fe80ef64bea
[    7.553973] RDX: 00007fe80ef64f60 RSI: 0000000000000000 RDI: 0000000000000000
[    7.553973] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[    7.553973] R10: 00007fe80ee2c620 R11: 0000000000000246 R12: 00007fe80eff41e0
[    7.553973] R13: 00000000ffffffff R14: 0000000000000024 R15: 00007fe80edf9cd0
[    7.553973] Modules linked in: radeon(+) ath5k(+) snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio evdev input_leds psmouse led_class snd_hda_intel snd_intel_dspcfg k8temp hwmon snd_hda_codec i2c_algo_bit mac80211 snd_hwdep snd_hda_core drm_kms_helper snd_pcm ath cfbfillrect ohci_pci syscopyarea cfbimgblt snd_timer sysfillrect ehci_pci ohci_hcd ehci_hcd sysimgblt fb_sys_fops snd cfbcopyarea soundcore fb usbcore i2c_piix4 sr_mod cdrom usb_common font fbdev ttm video cfg80211 drm battery drm_panel_orientation_quirks ac backlight rfkill libarc4 thermal button processor
[    7.557627] ---[ end trace eaa8cedcac091bed ]---
[    7.557688] RIP: 0010:kfree+0x115/0x230
[    7.557749] Code: 89 c5 e8 75 ea ff ff 48 8b 00 0f ba e0 09 72 63 e8 1f f4 ff ff 41 89 c4 48 8b 45 00 0f ba e0 10 72 0a 48 8b 45 08 a8 01 75 02 <0f> 0b 44 89 e1 48 c7 c2 00 f0 ff ff be 06 00 00 00 48 d3 e2 48 c7
[    7.557840] RSP: 0018:ffffb42f40267e10 EFLAGS: 00010246
[    7.557902] RAX: ffffd61280ee8d88 RBX: 0000000000000004 RCX: 000000008010000d
[    7.557962] RDX: 4000000000000000 RSI: ffffffffba1360b0 RDI: ffffd61280ee8d80
[    7.558029] RBP: ffffd61280ee8d80 R08: ffffffffb91bebdf R09: 0000000000000000
[    7.558091] R10: ffff8fe2c1047ac8 R11: 0000000000000000 R12: 0000000000000000
[    7.558152] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000100
[    7.558219] FS:  00007fe80eff6b68(0000) GS:ffff8fe339c00000(0000) knlGS:0000000000000000
[    7.558294] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    7.558355] CR2: 00007fe80eec7bc0 CR3: 0000000038012000 CR4: 00000000000006f0
[    7.558420] Fixing recursive fault but reboot is needed!
[    7.706793] general protection fault, probably for non-canonical address 0x908cb23a5a626f90: 0000 [#2] SMP NOPTI
[    7.706885] CPU: 1 PID: 613 Comm: udevd Tainted: G      D W         5.10.12-gentoo-E620 #2
[    7.706945] Hardware name: eMachines        eMachines E620  /Nile       , BIOS V1.03       09/30/2008
[    7.707013] RIP: 0010:get_freepointer+0x6/0x1a
[    7.707066] Code: 89 f0 48 89 d6 48 89 ca 49 8b 48 30 48 83 ef 70 48 c7 c0 fb ff ff ff 48 85 c9 74 07 e8 21 9e 87 00 48 98 c3 8b 47 28 48 01 c6 <48> 8b 06 48 33 87 b0 00 00 00 48 89 f2 48 0f ca 48 31 d0 c3 48 8b
[    7.707144] RSP: 0018:ffffb42f4073ba80 EFLAGS: 00010286
[    7.707200] RAX: 0000000000000030 RBX: 000000000000016e RCX: 0000000000000352
[    7.707255] RDX: 0000000000000000 RSI: 908cb23a5a626f90 RDI: ffff8fe2c1042a00
[    7.707300] RBP: ffff8fe2c1042a00 R08: 908cb23a5a626f60 R09: 0000000000000351
[    7.707300] R10: ffff8fe339d35880 R11: ffff8fe339d35940 R12: 0000000000000dc0
[    7.707300] R13: ffff8fe2c1042a00 R14: 0000000000000060 R15: ffffffffc0765861
[    7.707300] FS:  00007fa577d41b68(0000) GS:ffff8fe339d00000(0000) knlGS:0000000000000000
[    7.707300] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    7.707300] CR2: 00007f4786a3fc30 CR3: 0000000003acc000 CR4: 00000000000006e0
[    7.707300] Call Trace:
[    7.707300]  __kmalloc+0xe6/0x1a5
[    7.707300]  ath5k_eeprom_read_pcal_info_2413+0x63e/0x7ba [ath5k]
[    7.707300]  ? ath5k_eeprom_read_pcal_info_5112+0x423/0x423 [ath5k]
[    7.707300]  ath5k_eeprom_init+0xda1/0x1683 [ath5k]
[    7.707300]  ath5k_hw_init+0x48a/0x6d6 [ath5k]
[    7.707300]  ath5k_init_ah+0x246/0xb93 [ath5k]
[    7.707300]  ? _dev_info+0x67/0x7e
[    7.707300]  ath5k_pci_probe+0x1c3/0x221 [ath5k]
[    7.707300]  pci_device_probe+0x95/0x104
[    7.707300]  really_probe+0x144/0x326
[    7.707300]  driver_probe_device+0x63/0x92
[    7.707300]  device_driver_attach+0x37/0x50
[    7.707300]  __driver_attach+0x8d/0x93
[    7.707300]  ? device_driver_attach+0x50/0x50
[    7.707300]  bus_for_each_dev+0x71/0xa7
[    7.707300]  bus_add_driver+0x106/0x1b7
[    7.707300]  driver_register+0x99/0xd2
[    7.707300]  ? 0xffffffffc05ec000
[    7.707300]  do_one_initcall+0xe2/0x24a
[    7.707300]  ? trace_kmalloc+0x88/0xc7
[    7.707300]  ? kmem_cache_alloc_trace+0x130/0x157
[    7.707300]  do_init_module+0x56/0x1f5
[    7.707300]  __do_sys_finit_module+0x94/0xbb
[    7.707300]  do_syscall_64+0x33/0x40
[    7.707300]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    7.707300] RIP: 0033:0x7fa577cd1358
[    7.707300] Code: 4c 89 4c 24 48 4c 8b 4c 24 60 48 89 54 24 10 48 8b 7e 08 48 89 74 24 18 48 8b 51 18 c7 44 24 08 08 00 00 00 48 8b 76 10 0f 05 <48> 89 c7 e8 10 0e fe ff 48 83 c4 58 c3 66 2e 0f 1f 84 00 00 00 00
[    7.707300] RSP: 002b:00007ffcefb6f9a0 EFLAGS: 00000206 ORIG_RAX: 0000000000000139
[    7.707300] RAX: ffffffffffffffda RBX: 0000563da7667f10 RCX: 00007fa577cd1358
[    7.707300] RDX: 0000000000000000 RSI: 00007fa577c448e7 RDI: 0000000000000010
[    7.707300] RBP: 0000000000020000 R08: 0000000000000000 R09: 00007fa577ce8dbd
[    7.707300] R10: 0000000000000010 R11: 0000000000000206 R12: 0000000000000000
[    7.707300] R13: 00007fa5772ded80 R14: 00007fa577c448e7 R15: 0000000000000000
[    7.707300] Modules linked in: radeon(+) ath5k(+) snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio evdev input_leds psmouse led_class snd_hda_intel snd_intel_dspcfg k8temp hwmon snd_hda_codec i2c_algo_bit mac80211 snd_hwdep snd_hda_core drm_kms_helper snd_pcm ath cfbfillrect ohci_pci syscopyarea cfbimgblt snd_timer sysfillrect ehci_pci ohci_hcd ehci_hcd sysimgblt fb_sys_fops snd cfbcopyarea soundcore fb usbcore i2c_piix4 sr_mod cdrom usb_common font fbdev ttm video cfg80211 drm battery drm_panel_orientation_quirks ac backlight rfkill libarc4 thermal button processor
[    7.709927] ---[ end trace eaa8cedcac091bee ]---
[...]

Happens on 5.10.x and 5.11-rc kernels. Have not checked earlier kernels yet. The lockdep warning seen earlier in the dmesg happens every time but does not prevent the machine from booting (it only is problematic if HPET is used as clocksource. see bug #211327).
Comment 1 Erhard F. 2021-02-03 01:44:02 UTC
Created attachment 295045 [details]
kernel dmesg (5.10.12, eMachines E620 - successful boot)
Comment 2 Erhard F. 2021-02-03 01:44:52 UTC
Created attachment 295047 [details]
kernel .config (5.10.12, eMachines E620)
Comment 3 Erhard F. 2021-02-11 14:29:29 UTC
Created attachment 295235 [details]
kernel dmesg (5.11-rc7, eMachines E620)
Comment 4 Erhard F. 2021-04-27 19:51:36 UTC
Found out that this does not happen at all on kernel 5.4.x series. Took some time, but I was able to bisect the issue.

'bad' bisects showed up at the 1st boot most of the time, sometimes at the 3rd or 4th boot. I rated a bisect 'good' if the machine booted to desktop 20 times in a row with no issues (or issues not connected to this bug). The bisect revealed the following commit:


3202fa62fb43087387c65bfa9c100feffac74aa6 is the first bad commit
commit 3202fa62fb43087387c65bfa9c100feffac74aa6
Author: Kees Cook <keescook@chromium.org>
Date:   Wed Apr 1 21:04:27 2020 -0700

    slub: relocate freelist pointer to middle of object
    
    In a recent discussion[1] with Vitaly Nikolenko and Silvio Cesare, it
    became clear that moving the freelist pointer away from the edge of
    allocations would likely improve the overall defensive posture of the
    inline freelist pointer.  My benchmarks show no meaningful change to
    performance (they seem to show it being faster), so this looks like a
    reasonable change to make.
    
    Instead of having the freelist pointer at the very beginning of an
    allocation (offset 0) or at the very end of an allocation (effectively
    offset -sizeof(void *) from the next allocation), move it away from the
    edges of the allocation and into the middle.  This provides some
    protection against small-sized neighboring overflows (or underflows), for
    which the freelist pointer is commonly the target.  (Large or well
    controlled overwrites are much more likely to attack live object contents,
    instead of attempting freelist corruption.)
    
    The vaunted kernel build benchmark, across 5 runs. Before:
    
            Mean: 250.05
            Std Dev: 1.85
    
    and after, which appears mysteriously faster:
    
            Mean: 247.13
            Std Dev: 0.76
    
    Attempts at running "sysbench --test=memory" show the change to be well in
    the noise (sysbench seems to be pretty unstable here -- it's not really
    measuring allocation).
    
    Hackbench is more allocation-heavy, and while the std dev is above the
    difference, it looks like may manifest as an improvement as well:
    
    20 runs of "hackbench -g 20 -l 1000", before:
    
            Mean: 36.322
            Std Dev: 0.577
    
    and after:
    
            Mean: 36.056
            Std Dev: 0.598
    
    [1] https://twitter.com/vnik5287/status/1235113523098685440
    
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Acked-by: Christoph Lameter <cl@linux.com>
    Cc: Vitaly Nikolenko <vnik@duasynt.com>
    Cc: Silvio Cesare <silvio.cesare@gmail.com>
    Cc: Christoph Lameter <cl@linux.com>Cc: Pekka Enberg <penberg@kernel.org>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
    Link: http://lkml.kernel.org/r/202003051624.AAAC9AECC@keescook
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

 mm/slub.c | 7 +++++++
 1 file changed, 7 insertions(+)
Comment 5 Erhard F. 2021-04-27 19:52:03 UTC
Created attachment 296499 [details]
bisect.log
Comment 6 Erhard F. 2021-04-27 19:53:02 UTC
Created attachment 296501 [details]
kernel .config (final git bisect, eMachines E620)
Comment 7 Kees Cook 2021-04-28 18:01:15 UTC
Interesting! This looks like some kind of use-after-free problem that is corrupting the freelist in seemingly "unallocated" memory. There have been similar issues detected by this commit:

https://bugzilla.kernel.org/show_bug.cgi?id=207383
https://lore.kernel.org/lkml/YIGsJ9LlFquvBI2iWPKhJwjKBwDUr_C-38oVpLJJHJ5rDCY_Zrrv392o6UPNxHoeQrcpLYC9U4fZdpD9ilz6Amg2IxkSexGLQMCQIBek8rc=@protonmail.com/

Given the radeon bit, I wonder if there is another issue with GPU drivers, but if it happens at boot, maybe it's related to ath5k.
Comment 8 Erhard F. 2021-04-29 11:40:38 UTC
Created attachment 296539 [details]
kernel dmesg (5.12, eMachines E620)

(In reply to Kees Cook from comment #7)
> Given the radeon bit, I wonder if there is another issue with GPU drivers,
> but if it happens at boot, maybe it's related to ath5k.
Thanks for the hint! I gave kernel v5.12 + debugging options + KASAN a try and now I am getting this:

[...]
[drm] radeon kernel modesetting enabled.
[drm] initializing kernel modesetting (RS690 0x1002:0x791F 0x107B:0x0184 0x00).
resource sanity check: requesting [mem 0x000c0000-0x000dffff], which spans more than PCI Bus 0000:00 [mem 0x000ce000-0x000cffff window]
caller pci_map_rom+0x13f/0x284 mapping multiple BARs
ATOM BIOS: ATI
radeon 0000:01:05.0: VRAM: 256M 0x00000000BE000000 - 0x00000000CDFFFFFF (256M used)
radeon 0000:01:05.0: GTT: 512M 0x0000000080000000 - 0x000000009FFFFFFF
[drm] Detected VRAM RAM=256M, BAR=256M
[drm] RAM width 128bits DDR
[TTM] Zone  kernel: Available graphics memory: 1622862 KiB
[drm] radeon: 256M of VRAM memory ready
[drm] radeon: 512M of GTT memory ready.
[drm] GART: num cpu pages 131072, num gpu pages 131072
==================================================================
BUG: KASAN: slab-out-of-bounds in radeon_atombios_parse_power_table_1_3+0x1837/0x1998 [radeon]
Write of size 4 at addr ffff88810178d858 by task systemd-udevd/157

CPU: 0 PID: 157 Comm: systemd-udevd Not tainted 5.12.0-E620 #50
Hardware name: eMachines        eMachines E620  /Nile       , BIOS V1.03       09/30/2008
Call Trace:
 dump_stack+0xa5/0xe6
 print_address_description.constprop.0+0x18/0x239
 ? radeon_atombios_parse_power_table_1_3+0x1837/0x1998 [radeon]
 kasan_report+0x170/0x1a8
 ? radeon_atombios_parse_power_table_1_3+0x1837/0x1998 [radeon]
 radeon_atombios_parse_power_table_1_3+0x1837/0x1998 [radeon]
 ? mark_held_locks+0xa4/0xbb
 ? test_bit+0x23/0x2e
 ? hlock_class+0x42/0xf5
 ? mark_lock+0x172/0x662
 ? generic_exec_single+0x166/0x1a9
 ? radeon_atombios_lookup_gpio+0x26d/0x26d [radeon]
 ? print_usage_bug+0x320/0x320
 ? cpumask_next_and+0x18/0x19
 ? generic_exec_single+0x1a9/0x1a9
 ? __phys_addr_nodebug+0x7a/0x7a
 ? __cpa_flush_tlb+0x60/0x60
 ? atom_parse_data_header+0xf7/0x26d [radeon]
 ? lookup_address_in_pgd+0x191/0x1ba
 radeon_atombios_get_power_modes+0x144/0x1888 [radeon]
 ? __change_page_attr_set_clr+0xf6b/0xf6b
 ? test_bit+0x23/0x2e
 ? radeon_atombios_parse_pplib_non_clock_info+0x9d3/0x9d3 [radeon]
 ? set_page_memtype+0x81/0xa4
 radeon_pm_init+0x1019/0x1904 [radeon]
 ? set_memory_uc+0x3d/0x62
 rs690_init+0x76e/0x84a [radeon]
 radeon_device_init+0x1c1a/0x21e5 [radeon]
 ? __kasan_kmalloc_large+0x7c/0x83
 radeon_driver_load_kms+0xf5/0x30b [radeon]
 drm_dev_register+0x255/0x4a0 [drm]
 radeon_pci_probe+0x246/0x2f6 [radeon]
 pci_device_probe+0x1aa/0x294
 really_probe+0x30e/0x850
 ? pm_runtime_put_noidle+0x19/0x5b
 driver_probe_device+0xe6/0x135
 ? device_driver_attach+0xf8/0xf8
 device_driver_attach+0xc1/0xf8
 __driver_attach+0x13f/0x146
 bus_for_each_dev+0xfa/0x146
 ? next_device+0x39/0x39
 bus_add_driver+0x2b3/0x447
 driver_register+0x242/0x2c1
 ? 0xffffffffc0b80000
 do_one_initcall+0x149/0x2fd
 ? test_bit.constprop.0+0x2a/0x2a
 ? slab_post_alloc_hook+0x4a/0x186
 ? trace_kmalloc+0x9a/0xc7
 ? kasan_unpoison+0x25/0x31
 do_init_module+0x1ae/0x573
 load_module+0x4dee/0x5cca
 ? rcu_read_lock_sched_held+0x73/0xc8
 ? module_frob_arch_sections+0x6/0x6
 ? kernel_read_file+0x25c/0x3fd
 ? __vmalloc_node+0xa4/0xb1
 ? kernel_read_file+0x38b/0x3fd
 ? __ia32_sys_fsconfig+0xd3/0xd3
 ? kernel_read_file_from_fd+0x57/0x74
 __do_sys_finit_module+0xf1/0x140
 ? __do_sys_init_module+0x184/0x184
 ? account_locked_vm+0x75/0x75
 ? rcu_read_lock_held_common+0x7e/0x7e
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f21b2959e79
Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d bf 0f 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffd14d5a388 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 0000563b414190f0 RCX: 00007f21b2959e79
RDX: 0000000000000000 RSI: 0000563b41427360 RDI: 000000000000001c
RBP: 0000000000020000 R08: 0000000000000000 R09: 0000563b4141a2c0
R10: 000000000000001c R11: 0000000000000246 R12: 0000563b41427360
R13: 0000000000000000 R14: 0000563b414191e0 R15: 0000563b414190f0

Allocated by task 135:
 kasan_save_stack+0x1b/0x3c
 kasan_set_track+0x1c/0x21
 ____kasan_kmalloc+0x68/0x72
 kobject_get_path+0x9a/0x151
 kobject_uevent_env+0x202/0x9e8
 kobject_synth_uevent+0x208/0x50e
 uevent_store+0x15/0x35
 kernfs_fop_write_iter+0x2b3/0x35d
 new_sync_write+0x143/0x1d0
 vfs_write+0x191/0x240
 ksys_write+0x107/0x192
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xae

The buggy address belongs to the object at ffff88810178d6c8
 which belongs to the cache kmalloc-96 of size 96
The buggy address is located 304 bytes to the right of
 96-byte region [ffff88810178d6c8, ffff88810178d728)
The buggy address belongs to the page:
page:000000009ba3a72f refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88810178d348 pfn:0x10178c
head:000000009ba3a72f order:1 compound_mapcount:0
flags: 0x8000000000010200(slab|head)
raw: 8000000000010200 ffffea0004056688 ffff888100040e08 ffff888100042f40
raw: ffff88810178d348 0000000000120005 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88810178d700: fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc
 ffff88810178d780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88810178d800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                                                    ^
 ffff88810178d880: fc 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc
 ffff88810178d900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Disabling lock debugging due to kernel taint
[...]

Don't know about ATH5K but DRM_RADEON sure seems to play an important role. I get this KASAN hit 100% of the time at boot. Also it does not matter whether ATH5K is built or not. If built as a module it does get loaded later than DRM_RADEON.
Comment 9 Erhard F. 2021-04-29 11:45:10 UTC
Created attachment 296541 [details]
kernel .config (5.12, eMachines E620)
Comment 10 Erhard F. 2021-04-29 15:41:28 UTC
To be sure I disabled DRM_RADEON while enabling ATH5K. The machine boots fine like that without any stacktrace or KASAN hit (tried about 10 times).

So it's DRM_RADEON after all I guess, somehow connected to the commit I bisected.
Comment 11 Christian König 2021-04-29 16:25:12 UTC
Alex needs to take a look at this, but of hand I would say that looks like a combination of on old VBIOS and a bug in the VBIOS parser.

Maybe try to update the VBIOS as well.
Comment 12 Erhard F. 2021-04-29 19:22:59 UTC
(In reply to Christian König from comment #11)
> Maybe try to update the VBIOS as well.
The machine is an Acer eMachines E620 laptop. System BIOS is already latest (V1.03) which is available for that machine. The IGP is an Radeon XPress 1200. Where would I get a proper VBIOS to update?

 # lspci -vv -s 01:05.0
01:05.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RS690M [Radeon Xpress 1200/1250/1270] (prog-if 00 [VGA controller])
	Subsystem: Gateway, Inc. RS690M [Radeon Xpress 1200/1250/1270]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 64, Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 255
	Region 0: Memory at d0000000 (64-bit, prefetchable) [size=256M]
	Region 2: Memory at f0100000 (64-bit, non-prefetchable) [size=64K]
	Region 4: I/O ports at 9400 [size=256]
	Region 5: Memory at f0000000 (32-bit, non-prefetchable) [size=1M]
	Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
	Capabilities: [50] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Kernel modules: radeon
Comment 13 Erhard F. 2021-04-29 19:30:10 UTC
BIOS shows current VBIOS as: V010.055.000.049.029773
Comment 14 Kees Cook 2021-05-02 23:08:43 UTC
This appears to be near drivers/gpu/drm/radeon/radeon_atombios.c line 2248, and I see halfway through it stops using "state_index - 1", so I bet this is it:


                rdev->pm.power_state[state_index].flags &=
                        ~RADEON_PM_STATE_SINGLE_DISPLAY_ONLY;

See if this patch fixes your problem...
Comment 15 Kees Cook 2021-05-02 23:09:20 UTC
Created attachment 296589 [details]
attempt to fix off-by-one index in default power state
Comment 16 Kees Cook 2021-05-02 23:11:07 UTC
Comment on attachment 296589 [details]
attempt to fix off-by-one index in default power state

Oops, typo.
Comment 17 Kees Cook 2021-05-02 23:11:27 UTC
Created attachment 296591 [details]
attempt to fix off-by-one index in default power state
Comment 18 Erhard F. 2021-05-03 11:28:37 UTC
(In reply to Kees Cook from comment #14)
> This appears to be near drivers/gpu/drm/radeon/radeon_atombios.c line 2248,
> and I see halfway through it stops using "state_index - 1", so I bet this is
> it:
> 
> 
>                 rdev->pm.power_state[state_index].flags &=
>                         ~RADEON_PM_STATE_SINGLE_DISPLAY_ONLY;
> 
> See if this patch fixes your problem...
Yes, it did! Thanks!

I applied your 2 patches you posted later on dri-devel:
[PATCH 1/2] drm/radeon: Fix off-by-one power_state index heap overwrite
[PATCH 2/2] drm/radeon: Avoid power table parsing memory leaks

The machine boots now on the patched kernel 5.12.0 without provoking this KASAN hit. Also no other issues at boot. Did a few reboots to test it, looks good so far...

However having carefully read your patches I also enabled kemleak. ;) Which shows me this:

unreferenced object 0xffff888102622e08 (size 96):
  comm "systemd-udevd", pid 152, jiffies 4294895220 (age 792.274s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 40 26 92 04 80 88 ff ff  ........@&......
    01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<0000000075dbd345>] __kmalloc+0x176/0x1e7
    [<00000000806812e2>] radeon_atombios_parse_power_table_1_3+0x43f/0x1b1f [radeon]
    [<000000006e896262>] radeon_atombios_get_power_modes+0x144/0x1888 [radeon]
    [<00000000edc0acbb>] radeon_pm_init+0x1019/0x1904 [radeon]
    [<00000000f21bf64c>] rs690_init+0x76e/0x84a [radeon]
    [<000000001e47a877>] radeon_device_init+0x1c1a/0x21e5 [radeon]
    [<00000000f60f93e0>] radeon_driver_load_kms+0xf5/0x30b [radeon]
    [<00000000bccd090f>] drm_dev_register+0x255/0x4a0 [drm]
    [<0000000050f51b30>] radeon_pci_probe+0x246/0x2f6 [radeon]
    [<00000000b624f961>] pci_device_probe+0x1aa/0x294
    [<000000001686585e>] really_probe+0x30e/0x850
    [<000000009ecb9009>] driver_probe_device+0xe6/0x135
    [<0000000099ebf8a7>] device_driver_attach+0xc1/0xf8
    [<00000000d735bfe4>] __driver_attach+0x13f/0x146
    [<0000000050bac4e1>] bus_for_each_dev+0xfa/0x146
    [<0000000001aea795>] bus_add_driver+0x2b3/0x447
unreferenced object 0xffff888004922640 (size 64):
  comm "systemd-udevd", pid 152, jiffies 4294895220 (age 792.274s)
  hex dump (first 32 bytes):
    00 00 00 00 40 9c 00 00 00 00 00 00 00 00 00 00  ....@...........
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<000000000736e960>] kmem_cache_alloc_trace+0x122/0x171
    [<00000000827c1b4e>] radeon_atombios_parse_power_table_1_3+0x546/0x1b1f [radeon]
    [<000000006e896262>] radeon_atombios_get_power_modes+0x144/0x1888 [radeon]
    [<00000000edc0acbb>] radeon_pm_init+0x1019/0x1904 [radeon]
    [<00000000f21bf64c>] rs690_init+0x76e/0x84a [radeon]
    [<000000001e47a877>] radeon_device_init+0x1c1a/0x21e5 [radeon]
    [<00000000f60f93e0>] radeon_driver_load_kms+0xf5/0x30b [radeon]
    [<00000000bccd090f>] drm_dev_register+0x255/0x4a0 [drm]
    [<0000000050f51b30>] radeon_pci_probe+0x246/0x2f6 [radeon]
    [<00000000b624f961>] pci_device_probe+0x1aa/0x294
    [<000000001686585e>] really_probe+0x30e/0x850
    [<000000009ecb9009>] driver_probe_device+0xe6/0x135
    [<0000000099ebf8a7>] device_driver_attach+0xc1/0xf8
    [<00000000d735bfe4>] __driver_attach+0x13f/0x146
    [<0000000050bac4e1>] bus_for_each_dev+0xfa/0x146
    [<0000000001aea795>] bus_add_driver+0x2b3/0x447
Comment 19 Erhard F. 2021-05-03 11:29:52 UTC
Created attachment 296597 [details]
kmemleak output (5.12 + patches, eMachines E620)
Comment 20 Kees Cook 2021-05-05 22:29:44 UTC
Ah, just to make sure I understand: you saw this kmem leak report when running with the two patches I sent to LKML?
Comment 21 Erhard F. 2021-05-05 23:11:54 UTC
(In reply to Kees Cook from comment #20)
> Ah, just to make sure I understand: you saw this kmem leak report when
> running with the two patches I sent to LKML?
Yes, exactly.
Comment 22 Erhard F. 2021-05-10 16:12:42 UTC
Created attachment 296709 [details]
kernel dmesg (5.13-rc1, eMachines E620)
Comment 23 Erhard F. 2021-05-10 16:13:18 UTC
Created attachment 296711 [details]
kmemleak output (5.13-rc1, eMachines E620)

Looks like the patch already went into v5.13-rc1, thanks!

With v5.13-rc1 only the memory leak remains:
unreferenced object 0xffff888102a316c8 (size 96):
  comm "systemd-udevd", pid 144, jiffies 4294892558 (age 1127.210s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 40 82 55 08 80 88 ff ff  ........@.U.....
    01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000be239a59>] __kmalloc+0x163/0x1d4
    [<00000000bc8a83b6>] radeon_atombios_parse_power_table_1_3+0x43f/0x1b1f [radeon]
    [<000000004cdc8d73>] radeon_atombios_get_power_modes+0x144/0x1888 [radeon]
    [<000000000a0a16d2>] radeon_pm_init+0xffd/0x18cc [radeon]
    [<00000000f37d806c>] rs690_init+0x76e/0x84a [radeon]
    [<00000000933cf3a9>] radeon_device_init+0x1c1a/0x21c1 [radeon]
    [<00000000f00e9a19>] radeon_driver_load_kms+0xf5/0x30b [radeon]
    [<00000000a9fe0c11>] drm_dev_register+0x255/0x4a0 [drm]
    [<000000005015ac7d>] radeon_pci_probe+0x246/0x2f6 [radeon]
    [<0000000018efccda>] pci_device_probe+0x1aa/0x294
    [<00000000350850d3>] really_probe+0x30e/0x850
    [<0000000083391671>] driver_probe_device+0x112/0x161
    [<000000004d1e121b>] device_driver_attach+0xc1/0xf8
    [<00000000ad7d8e9a>] __driver_attach+0x16b/0x172
    [<0000000078f0f9cf>] bus_for_each_dev+0xfa/0x146
    [<000000002f8d374c>] bus_add_driver+0x2b3/0x447
unreferenced object 0xffff888008558240 (size 64):
  comm "systemd-udevd", pid 144, jiffies 4294892558 (age 1127.210s)
  hex dump (first 32 bytes):
    00 00 00 00 40 9c 00 00 00 00 00 00 00 00 00 00  ....@...........
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<0000000094907ef6>] kmem_cache_alloc_trace+0x10f/0x15e
    [<000000000667fd5a>] radeon_atombios_parse_power_table_1_3+0x546/0x1b1f [radeon]
    [<000000004cdc8d73>] radeon_atombios_get_power_modes+0x144/0x1888 [radeon]
    [<000000000a0a16d2>] radeon_pm_init+0xffd/0x18cc [radeon]
    [<00000000f37d806c>] rs690_init+0x76e/0x84a [radeon]
    [<00000000933cf3a9>] radeon_device_init+0x1c1a/0x21c1 [radeon]
    [<00000000f00e9a19>] radeon_driver_load_kms+0xf5/0x30b [radeon]
    [<00000000a9fe0c11>] drm_dev_register+0x255/0x4a0 [drm]
    [<000000005015ac7d>] radeon_pci_probe+0x246/0x2f6 [radeon]
    [<0000000018efccda>] pci_device_probe+0x1aa/0x294
    [<00000000350850d3>] really_probe+0x30e/0x850
    [<0000000083391671>] driver_probe_device+0x112/0x161
    [<000000004d1e121b>] device_driver_attach+0xc1/0xf8
    [<00000000ad7d8e9a>] __driver_attach+0x16b/0x172
    [<0000000078f0f9cf>] bus_for_each_dev+0xfa/0x146
    [<000000002f8d374c>] bus_add_driver+0x2b3/0x447
Comment 24 Erhard F. 2021-06-08 23:56:47 UTC
As the original problem got fixed by Kees and the fix already went to affected stable and LTS kernels I'll close here. Thanks again!

I'll open a new bug for the remaining memory leak and link here.