Bug 217669

Summary: radeon.ko/i586: BUG: kernel NULL pointer dereference, address: 00000004
Product: Linux Reporter: Taketo Kabe (kkabe)
Component: KernelAssignee: Virtual assignee for kernel bugs (linux-kernel)
Status: NEW ---    
Severity: normal CC: bagasdotme
Priority: P3    
Hardware: i386   
OS: Linux   
Kernel Version: 6.4.3 Subsystem:
Regression: Yes Bisected commit-id: b39181f7c690
Attachments: Activate #ifdef FTRACE_MCOUNT_MAX_OFFSET only if defined as nonzero
check and return if vblank->worker was null in drm_vblank_flush_worker

Description Taketo Kabe 2023-07-13 13:00:24 UTC
Origin report: https://gitlab.freedesktop.org/drm/amd/-/issues/2615

Compile with CONFIG_DRM_RADEON=m on 32bit,
kernel panics on radeon.ko load.

On slow machine, there is 70 second window between
login: prompt on raw VGA tty1 and
radeon.ko KMS console load.

When logging in on tty1 during this window, kernel panics.
(Fast machine loads KMS console before login: prompt, so the problem is masked)

[  466.298267] [drm] radeon kernel modesetting enabled.
[  466.371716] radeon 0000:01:00.0: vgaarb: deactivate vga console
[  466.476920] Console: switching to colour dummy device 80x25
[  466.661163] [drm] initializing kernel modesetting (RV280 0x1002:0x5960 0x148C:0x2094 0x01).
[  466.672175] [drm] Forcing AGP to PCI mode
[  466.848372] [drm] Generation 2 PCI interface, using max accessible memory
[  466.859389] radeon 0000:01:00.0: VRAM: 128M 0x00000000C0000000 - 0x00000000C7FFFFFF (128M used)
[  466.869191] radeon 0000:01:00.0: GTT: 512M 0x00000000A0000000 - 0x00000000BFFFFFFF
[  467.022337] [drm] Detected VRAM RAM=128M, BAR=128M
[  467.041498] [drm] RAM width 64bits DDR
[  467.153587] workqueue: Failed to create a rescuer kthread for wq "ttm": -EINTR
[  467.154126] [drm:radeon_ttm_init [radeon]] *ERROR* failed initializing buffer object driver(-12).
[  467.176714] radeon 0000:01:00.0: Fatal error during GPU init
[  467.361714] [drm] radeon: finishing device.
[  467.431560] [drm] radeon: cp finalized
[  467.495402] radeon: probe of 0000:01:00.0 failed with error -12

In this instance, tty1 console freezes.
Comment 1 Taketo Kabe 2023-07-13 13:00:52 UTC
While bisecting, I got a clearer panic:

[  469.825305] BUG: kernel NULL pointer dereference, address: 00000004
[  469.830502] #PF: supervisor read access in kernel mode
[  469.830502] #PF: error_code(0x0000) - not-present page
[  469.830502] *pde = 00000000
[  469.830502] Oops: 0000 [#1] PREEMPT SMP
[  469.830502] CPU: 0 PID: 365 Comm: systemd-udevd Not tainted 5.14.0-221.el9.v1.i586 #1
[  469.830502] Hardware name: System Manufacturer System Name/ALADDIN5, BIOS 0626 07/15/95
[  469.830502] EIP: _raw_spin_lock_irqsave+0x1f/0x40
[  469.830502] Code: cc cc cc cc cc cc cc 3e cc cc cc 3e 55 89 c1 89 55 89 c1 89 5b fa 64 ff 5b fa 64 ff c2 31 d2 be c2 31 d2 be 89 d0 3e 0f 89 d0 <3e> 0f 89 d8 5b 5e 89 d8 5b 5e 26 00 90 89 26 00 90 89 b7 15 75 ff
[  469.830502] EAX: 00000000 EBX: 00000246 ECX: 00000004 EDX: 00000000
[  469.830502] ESI: 00000001 EDI: c3e71c40 EBP: c3e71c34 ESP: c3e71c2c
[  469.830502] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010046
[  469.830502] CR0: 80050033 CR2: 00000004 CR3: 057fa000 CR4: 00000010
[  469.830502] Call Trace:
[  469.830502]  kthread_flush_worker+0x52/0xa0
[  469.830502]  ? kthread_should_park+0x40/0x40
[  469.830502]  drm_crtc_vblank_off+0x1d7/0x230 [drm]
[  469.830502]  radeon_crtc_dpms+0x197/0x1b0 [radeon]
[  469.830502]  radeon_crtc_disable+0x16/0xa0 [radeon]
[  469.830502]  __drm_helper_disable_unused_functions+0x74/0xc0 [drm_kms_helper]
[  469.830502]  drm_helper_disable_unused_functions+0x3c/0x50 [drm_kms_helper]
[  469.830502]  radeon_fbdev_init+0xb0/0x130 [radeon]
[  469.830502]  radeon_modeset_init+0x25d/0x320 [radeon]
[  469.830502]  radeon_driver_load_kms+0xc4/0x240 [radeon]
[  469.830502]  drm_dev_register+0xb4/0x1a0 [drm]
[  469.830502]  radeon_pci_probe+0xc0/0x100 [radeon]
[  469.830502]  pci_device_probe+0xbc/0x150
[  469.830502]  really_probe+0xb7/0x350
[  469.830502]  __driver_probe_device+0x109/0x1e0
[  469.830502]  driver_probe_device+0x1f/0x90
[  469.830502]  __driver_attach+0x8a/0x1b0
[  469.830502]  ? __device_attach_driver+0x100/0x100
[  469.830502]  bus_for_each_dev+0x58/0x90
[  469.830502]  driver_attach+0x19/0x20
[  469.830502]  ? __device_attach_driver+0x100/0x100
[  469.830502]  bus_add_driver+0x12f/0x1d0
[  469.830502]  driver_register+0x79/0xd0
[  469.830502]  ? 0xf7dde000
[  469.830502]  __pci_register_driver+0x52/0x60
[  469.830502]  radeon_module_init+0x5c/0x1000 [radeon]
[  469.830502]  do_one_initcall+0x3e/0x1c0
[  469.830502]  ? __vunmap+0x20b/0x2a0
[  469.830502]  ? __vunmap+0x20b/0x2a0
[  469.830502]  ? kmem_cache_alloc_trace+0x38/0x440
[  469.830502]  do_init_module+0x52/0x260
[  469.830502]  load_module+0x930/0x9b0
[  469.830502]  __ia32_sys_init_module+0x15d/0x180
[  469.830502]  do_int80_syscall_32+0x2e/0x80
[  469.830502]  entry_INT80_32+0xf0/0xf0
[  469.830502] EIP: 0xb79e7e4e
[  469.830502] Code: 0f 83 d6 06 00 00 c3 66 90 66 90 90 57 56 53 8b 7c 24 20 8b 74 24 1c 8b 54 24 18 8b 4c 24 14 8b 5c 24 10 b8 80 00 00 00 cd 80 <5b> 5e 5f 3d 01 f0 ff ff 0f 83 a4 06 00 00 c3 66 90 90 53 8b 54 24
[  469.830502] EAX: ffffffda EBX: b5526010 ECX: 0020d79c EDX: b7c26274
[  469.830502] ESI: b7c20295 EDI: b7c2ddd8 EBP: 018af7c0 ESP: bfd2f810
[  469.830502] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292
[  469.830502] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib radeon(+) nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 gpu_sched drm_buddy i2c_algo_bit drm_display_helper cec drm_ttm_helper ppdev ttm rfkill ip_set nf_tables libcrc32c nfnetlink drm_kms_helper pcspkr syscopyarea e100 sysfillrect parport_pc sysimgblt mii fb_sys_fops parport qrtr drm fuse ext4 mbcache jbd2 sd_mod t10_pi sr_mod crc64_rocksoft_generic cdrom crc64_rocksoft crc64 sg ata_generic pata_ali libata serio_raw
[  469.830502] CR2: 0000000000000004
[  469.830502] ---[ end trace 30555bd5ee4bee23 ]---
[  469.830502] EIP: _raw_spin_lock_irqsave+0x1f/0x40
[  469.830502] Code: cc cc cc cc cc cc cc 3e cc cc cc 3e 55 89 c1 89 55 89 c1 89 5b fa 64 ff 5b fa 64 ff c2 31 d2 be c2 31 d2 be 89 d0 3e 0f 89 d0 <3e> 0f 89 d8 5b 5e 89 d8 5b 5e 26 00 90 89 26 00 90 89 b7 15 75 ff
[  469.830502] EAX: 00000000 EBX: 00000246 ECX: 00000004 EDX: 00000000
[  469.830502] ESI: 00000001 EDI: c3e71c40 EBP: c3e71c34 ESP: c3e71c2c
[  469.830502] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010046
[  469.830502] CR0: 80050033 CR2: 00000004 CR3: 057fa000 CR4: 00000010
[  469.830502] Kernel panic - not syncing: Fatal exception
[  469.830502] Kernel Offset: disabled
[  469.830502] ---[ end Kernel panic - not syncing: Fatal exception ]---
Comment 2 Taketo Kabe 2023-07-13 13:01:26 UTC
analysis(short):

vblank->worker was NULL in
drivers/gpu/drm/drm_vblank.c:drm_crtc_vblank_off(), but
no one afterwards checked it for NULL.

kthread_queue_work() do dereferecne "worker" as

raw_spin_lock_irqsave(&worker->lock, flags);

which results as dereferencing &NULL->lock == 0x00000004.
I'm not sure whether vblank->worker == NULL is a correct behavior or not.
Comment 3 Taketo Kabe 2023-07-13 13:01:52 UTC
analysis(long):

drivers/gpu/drm/drm_vblank.c
void drm_crtc_vblank_off(struct drm_crtc *crtc)
{
        struct drm_device *dev = crtc->dev;
        unsigned int pipe = drm_crtc_index(crtc);
        struct drm_vblank_crtc *vblank = &dev->vblank[pipe];
        struct drm_pending_vblank_event *e, *t;
        ktime_t now;
        u64 seq;
...
        /* Wait for any vblank work that's still executing to finish */
        drm_vblank_flush_worker(vblank);	<<<
}

drivers/gpu/drm/drm_internal.h
static inline void drm_vblank_flush_worker(struct drm_vblank_crtc *vblank)
{
        kthread_flush_worker(vblank->worker);	<<< NULL for unknown reason
}

kernel/kthread.c
void kthread_flush_worker(struct kthread_worker *worker)
{
        struct kthread_flush_work fwork = {
                KTHREAD_WORK_INIT(fwork.work, kthread_flush_work_fn),
                COMPLETION_INITIALIZER_ONSTACK(fwork.done),
        };

        kthread_queue_work(worker, &fwork.work);	<<<< passes NULL
        wait_for_completion(&fwork.done);
}

kernel/kthread.c
bool kthread_queue_work(struct kthread_worker *worker,
                        struct kthread_work *work)
{
        bool ret = false;
        unsigned long flags;

        raw_spin_lock_irqsave(&worker->lock, flags);	<<<< dereference NULL->lock==0x00000004
        if (!queuing_blocked(worker, work)) {
                kthread_insert_work(worker, work, &worker->work_list);
                ret = true;
        }
        raw_spin_unlock_irqrestore(&worker->lock, flags);
        return ret;
}

include/linux/kthread.h
struct kthread_worker {
        unsigned int            flags;
        raw_spinlock_t          lock;		<<< offset 0x00000004
        struct list_head        work_list;
        struct list_head        delayed_work_list;
        struct task_struct      *task;
        struct kthread_work     *current_work;
};
Comment 4 Taketo Kabe 2023-07-13 13:03:13 UTC
Created attachment 304625 [details]
Activate #ifdef FTRACE_MCOUNT_MAX_OFFSET only if defined as nonzero
Comment 5 Taketo Kabe 2023-07-13 13:03:52 UTC
(attach patch-FTRACE_MCOUNT_MAX_OFFSET-defined.patch)

I succeeded to bisect down the regressing commit found in kernel-5.18.0-rc2:

b39181f7c690 (refs/bisect/bad) ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid adding weak function

This at a glance does not relate to drm/kms code.

The attached patch effectively reverts the commit for 32bit.
This fixed the problem on kernel-5.18.0, but not enough for kernel-6.4.3 .
Comment 6 Taketo Kabe 2023-07-13 13:05:22 UTC
Created attachment 304626 [details]
check and return if vblank->worker was null in drm_vblank_flush_worker

Attached patch sort of fixes the problem; it does not panic and
KMS console works, but printk is triggered 4 times on radeon.ko load and
when VGA connector is plugged in.

I am sort of at loss now; I need advice from people which knows better.
Comment 7 Bagas Sanjaya 2023-07-14 02:40:13 UTC
Can you also file report on freedesktop gitlab [1]?

[1]: https://gitlab.freedesktop.org/drm/amd/-/issues
Comment 8 Bagas Sanjaya 2023-07-14 02:43:36 UTC
(In reply to Bagas Sanjaya from comment #7)
> Can you also file report on freedesktop gitlab [1]?
> 
> [1]: https://gitlab.freedesktop.org/drm/amd/-/issues

Oops, I don't see your gitlab issue link on the TS. Sorry for inconvenience.
Comment 9 Taketo Kabe 2023-07-14 06:45:17 UTC
regressions@leemhuis.info sed in <55a3bbb1-5b3c-f454-b529-8ee9944cc67c@leemhuis.
info>

>> That being said: That commit is not in 5.18, as Steve noticed:
>>
>> >> #regzbot introduced: b39181f7c6907d
>> https://bugzilla.kernel.org/show_bug.c
gi?id=217669
>> >> #regzbot title: FTRACE_MCOUNT_MAX_OFFSET causes kernel NULL pointer
>> derefe
rence and virtual console (tty1) freeze
>> > That commit was added in 5.19.
>> >
>> > So I'm confused about why it's mentioned. Was it backported?
>>
>> Taketo Kabe, could you please help to clean this confusion up? Did you
>> mean 5.19 in https://bugzilla.kernel.org/show_bug.cgi?id=217669#c5 ? And
>> BTW: did you really use a vanilla kernel for your bisection?

Reporter Me:
I bisected using freedesktop.org kernel tree, which git commit ID is
in sync with kernel.org
but version number in ./Makefile could be slighty behind.

Patch in
https://bugzilla.kernel.org/show_bug.cgi?id=217669#c4
fixed the problem in freedesktop.org kernel 5.18.0-rc2 .
This may explain that in kernel.org tree, the said commit is in kernel-5.19.