Bug 207561 - DRM? broke for AMDGPU in 5.6.10 (worked in 5.6.6)
Summary: DRM? broke for AMDGPU in 5.6.10 (worked in 5.6.6)
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(Other) (show other bugs)
Hardware: x86-64 Linux
: P1 blocking
Assignee: other_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-05-03 17:33 UTC by Artem S. Tashkinov
Modified: 2020-05-08 22:14 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.6.10
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Artem S. Tashkinov 2020-05-03 17:33:05 UTC
I've just upgraded from 5.6.6 compiled with GCC 9.1 (Fedora 31) to 5.6.10 compiled with GCC 10 (Fedora 32).

Audio is completely broken.

There are kernel OOPs on boot. No idea what to make of all of it.

[    2.874973] [drm] amdgpu: 6128M of VRAM memory ready
[    2.874975] [drm] amdgpu: 6128M of GTT memory ready.
[    2.874993] [drm] GART: num cpu pages 131072, num gpu pages 131072
[    2.875263] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[    2.884270] BUG: kernel NULL pointer dereference, address: 0000000000000026
[    2.884276] #PF: supervisor read access in kernel mode
[    2.884278] #PF: error_code(0x0000) - not-present page
[    2.884280] PGD 0 P4D 0 
[    2.884284] Oops: 0000 [#1] PREEMPT SMP NOPTI
[    2.884286] CPU: 0 PID: 2594 Comm: comp_1.2.1 Not tainted 5.6.10-az2 #3
[    2.884287] Hardware name: System manufacturer System Product Name/TUF GAMING X570-PLUS (WI-FI), BIOS 1405 11/19/2019
[    2.884292] RIP: 0010:__kthread_should_park+0x0/0x20
[    2.884295] Code: ff ff e8 e3 6a 04 00 e9 b4 fe ff ff 48 89 df e8 b6 69 04 00 84 c0 0f 84 79 ff ff ff e9 04 ff ff ff 66 0f 1f 84 00 00 00 00 00 <f6> 47 26 20 74 12 48 8b 87 f8 04 00 00 48 8b 00 48 c1 e8 02 83 e0
[    2.884300] RSP: 0018:ffffb00b00a27e30 EFLAGS: 00010246
[    2.884302] RAX: 7fffffffffffffff RBX: ffff9d55885c93b0 RCX: 00000000abe66f7e
[    2.884304] RDX: 0000000000000001 RSI: 0000000000000202 RDI: 0000000000000000
[    2.884306] RBP: ffffb00b00a27e70 R08: 0000000000000000 R09: ffffffffbec09100
[    2.884308] R10: ffff9d5597a0c380 R11: 0000000000000294 R12: ffff9d55885c9538
[    2.884310] R13: ffff9d559716f880 R14: ffff9d55885c93b0 R15: ffffb00b007c77f0
[    2.884312] FS:  0000000000000000(0000) GS:ffff9d559ee00000(0000) knlGS:0000000000000000
[    2.884315] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.884316] CR2: 0000000000000026 CR3: 0000000fce6f0000 CR4: 0000000000340ef0
[    2.884318] Call Trace:
[    2.884321]  ? drm_sched_get_cleanup_job+0x44/0x140 [gpu_sched]
[    2.884323]  drm_sched_main+0x52/0x440 [gpu_sched]
[    2.884325]  kthread+0x117/0x130
[    2.884327]  ? drm_sched_get_cleanup_job+0x140/0x140 [gpu_sched]
[    2.884328]  ? kthread_create_worker_on_cpu+0x40/0x40
[    2.884331]  ret_from_fork+0x22/0x40
[    2.884332] Modules linked in: irqbypass snd_hda_codec_realtek(+) snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg crct10dif_pclmul amdgpu(+) snd_hda_codec crc32_pclmul snd_hwdep crc32c_intel snd_hda_core efi_pstore ghash_clmulni_intel snd_seq gpu_sched aesni_intel i2c_algo_bit snd_seq_device glue_helper input_leds drm_kms_helper crypto_simd ccp snd_pcm cryptd led_class efivars syscopyarea wmi_bmof pcspkr rng_core sysfillrect r8169 sysimgblt sha256_generic fb_sys_fops k10temp libsha256 ttm snd_timer sr_mod realtek cdrom i2c_piix4 xhci_pci snd sha1_generic libphy xhci_hcd wmi 8250 8250_base serial_core evdev drm backlight ip_tables x_tables ipv6 nf_defrag_ipv6
[    2.884353] CR2: 0000000000000026
[    2.884355] ---[ end trace 81a5d931ae80b751 ]---
[    2.908803] [drm] use_doorbell being set to: [true]
[    2.920851] [drm] use_doorbell being set to: [true]
[    2.957062] [drm] Found VCN firmware Version ENC: 1.7 DEC: 4 VEP: 0 Revision: 17
[    2.957825] [drm] PSP loading VCN firmware
[    3.017921] RIP: 0010:__kthread_should_park+0x0/0x20
[    3.017924] Code: ff ff e8 e3 6a 04 00 e9 b4 fe ff ff 48 89 df e8 b6 69 04 00 84 c0 0f 84 79 ff ff ff e9 04 ff ff ff 66 0f 1f 84 00 00 00 00 00 <f6> 47 26 20 74 12 48 8b 87 f8 04 00 00 48 8b 00 48 c1 e8 02 83 e0
[    3.017929] RSP: 0018:ffffb00b00a27e30 EFLAGS: 00010246
[    3.017931] RAX: 7fffffffffffffff RBX: ffff9d55885c93b0 RCX: 00000000abe66f7e
[    3.017933] RDX: 0000000000000001 RSI: 0000000000000202 RDI: 0000000000000000
[    3.017935] RBP: ffffb00b00a27e70 R08: 0000000000000000 R09: ffffffffbec09100
[    3.017937] R10: ffff9d5597a0c380 R11: 0000000000000294 R12: ffff9d55885c9538
[    3.017938] R13: ffff9d559716f880 R14: ffff9d55885c93b0 R15: ffffb00b007c77f0
[    3.017941] FS:  0000000000000000(0000) GS:ffff9d559ee00000(0000) knlGS:0000000000000000
[    3.017943] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    3.017945] CR2: 0000000000000026 CR3: 0000000fce6f0000 CR4: 0000000000340ef0
[    3.047138] input: HDA Digital PCBeep as /devices/pci0000:00/0000:00:08.1/0000:0b:00.4/sound/card1/input10
[    3.047183] input: HD-Audio Generic Front Mic as /devices/pci0000:00/0000:00:08.1/0000:0b:00.4/sound/card1/input11
[    3.047242] input: HD-Audio Generic Rear Mic as /devices/pci0000:00/0000:00:08.1/0000:0b:00.4/sound/card1/input12
[    3.047280] input: HD-Audio Generic Line as /devices/pci0000:00/0000:00:08.1/0000:0b:00.4/sound/card1/input13
[    3.047506] input: HD-Audio Generic Line Out Front as /devices/pci0000:00/0000:00:08.1/0000:0b:00.4/sound/card1/input14
[    3.047708] input: HD-Audio Generic Line Out Surround as /devices/pci0000:00/0000:00:08.1/0000:0b:00.4/sound/card1/input15
[    3.048082] input: HD-Audio Generic Line Out CLFE as /devices/pci0000:00/0000:00:08.1/0000:0b:00.4/sound/card1/input16
[    3.048278] input: HD-Audio Generic Front Headphone as /devices/pci0000:00/0000:00:08.1/0000:0b:00.4/sound/card1/input17

Will now try compiling 5.6.6 and then use GCC 9.1 instead.
Comment 1 Artem S. Tashkinov 2020-05-03 17:45:21 UTC
OK, both issues are gone with 5.6.6 compiled with GCC10.

Looks like we have two regressions:

Audio broke.
DRM broke for my RX 5600 XT.
Comment 2 Artem S. Tashkinov 2020-05-03 17:58:01 UTC
Looks like the same issue is discussed here:

https://lkml.org/lkml/2020/4/10/545

Alex, am I correct that patch

https://cgit.freedesktop.org/drm/drm-misc/commit/?h=drm-misc-fixes&id=8623b5255ae7ccaf276aac3920787bf575fa6b37

should fix my issue?
Comment 3 Artem S. Tashkinov 2020-05-03 18:00:04 UTC
Is this patch scheduled for 5.6.11?
Comment 4 Linus Torvalds 2020-05-03 18:01:40 UTC
Bisection? 

The oops looks like __kthread_should_park() called with a NULL argument.

That should have been fixed by commit 8623b5255ae7 ("drm/scheduler: fix drm_sched_get_cleanup_job") in mainline.

I'm not sure why this started showing up in -stable.
Comment 5 Linus Torvalds 2020-05-03 18:03:29 UTC
Can you try to bisect the audio breakage, that seems to be something else.

You'd need to add that oneliner from commit 8623b5255ae7 to avoid the drm breakage.
Comment 6 Artem S. Tashkinov 2020-05-03 19:07:42 UTC
(In reply to Linus Torvalds from comment #5)
> Can you try to bisect the audio breakage, that seems to be something else.
> 
> You'd need to add that oneliner from commit 8623b5255ae7 to avoid the drm
> breakage.

So, after three compilations:

Kernel 5.6.6  compiled with GCC 10 (Fedora 32): all is fine.
Kernel 5.6.10 compiled with GCC 10 (Fedora 32): DRM bug/ALSA broken.
Kernel 5.6.10 compiled with GCC 9  (Fedora 31): all is fine.

Looks like GCC 10 generates invalid code once again, at least its Fedora 32 version. GCC 10 hasn't yet been formally released.

I'm now running 5.6.10 compiled with GCC 9 and everything is OK.

No idea what to do next. Perhaps it's worth closing this bug report as RESOLVED/INVALID. And bug 207563 as well.
Comment 7 Linus Torvalds 2020-05-03 19:16:41 UTC
(In reply to Artem S. Tashkinov from comment #6)
> 
> So, after three compilations:
> 
> Kernel 5.6.6  compiled with GCC 10 (Fedora 32): all is fine.
> Kernel 5.6.10 compiled with GCC 10 (Fedora 32): DRM bug/ALSA broken.
> Kernel 5.6.10 compiled with GCC 9  (Fedora 31): all is fine.
> 
> Looks like GCC 10 generates invalid code once again, at least its Fedora 32
> version. GCC 10 hasn't yet been formally released.

Uhhuh. Potential compilers bugs are not fun to chase.

It might be a real kernel bug that is just exposed by the compiler change, of course, but ...

Is there any obvious change in the ALSA output to give a hint of where the breakage might be?
Comment 8 Artem S. Tashkinov 2020-05-03 19:25:09 UTC
(In reply to Linus Torvalds from comment #7)
> Uhhuh. Potential compilers bugs are not fun to chase.
> 
> It might be a real kernel bug that is just exposed by the compiler change,
> of course, but ...
> 
> Is there any obvious change in the ALSA output to give a hint of where the
> breakage might be?

But ... I'm now running 5.6.10 compiled with GCC 9.3 and everything works. :-) 

There are ALSA related changes between 5.6.6 and 5.6.10 but I'm not sure they are relevant if the problem is down to the compiler. I'm thinking if GCC 10 miscompiles the kernel, it can break something so much as to cause breakage all over the place. Really don't know what to do.

I'm compiling the vanilla kernel using the default GCC 10 compiler in Fedora 32, so I guess you could take a look at that. I can attach my .config if that's of any help.
Comment 9 Linus Torvalds 2020-05-03 19:26:42 UTC
(In reply to Artem S. Tashkinov from comment #8)
> > 
> > Is there any obvious change in the ALSA output to give a hint of where the
> > breakage might be?
> 
> But ... I'm now running 5.6.10 compiled with GCC 9.3 and everything works.
> :-) 

I meant between the (working) gcc-9 and (broken) gcc-10 case.

Apply the drm one-liner fix to make it all past that one (assuming it does fix the gcc10 case for drm).

           Linus
Comment 10 Artem S. Tashkinov 2020-05-03 19:41:50 UTC
(In reply to Linus Torvalds from comment #9)
> (In reply to Artem S. Tashkinov from comment #8)
> > > 
> > > Is there any obvious change in the ALSA output to give a hint of where
> the
> > > breakage might be?
> > 
> > But ... I'm now running 5.6.10 compiled with GCC 9.3 and everything works.
> > :-) 
> 
> I meant between the (working) gcc-9 and (broken) gcc-10 case.
> 
> Apply the drm one-liner fix to make it all past that one (assuming it does
> fix the gcc10 case for drm).
> 

With the applied one liner everything works as intended with GCC 10. I'm baffled.
Comment 11 Linus Torvalds 2020-05-03 19:48:43 UTC
(In reply to Artem S. Tashkinov from comment #10)
> 
> With the applied one liner everything works as intended with GCC 10. I'm
> baffled.

Ok, then the sound problem was likely just due to the oops.

The oops might just have broken some kthread functionality that sound depended on or whatever.

I think you can close this, although you should probably make sure the stable people know about that oneliner fix.
Comment 12 Artem S. Tashkinov 2020-05-03 19:54:58 UTC
(In reply to Linus Torvalds from comment #11)
> (In reply to Artem S. Tashkinov from comment #10)
> > 
> > With the applied one liner everything works as intended with GCC 10. I'm
> > baffled.
> 
> Ok, then the sound problem was likely just due to the oops.
> 
> The oops might just have broken some kthread functionality that sound
> depended on or whatever.
> 
> I think you can close this, although you should probably make sure the
> stable people know about that oneliner fix.

I'd like to hear from Alex Deucher as I'm not even sure why kernel 5.6.10/GCC93 without this patch works, but the same kernel compiled with GCC10 breaks.

I guess he can also propose this oneliner for stable.
Comment 13 Linus Torvalds 2020-05-03 20:00:44 UTC
(In reply to Artem S. Tashkinov from comment #12)
> 
> I'd like to hear from Alex Deucher as I'm not even sure why kernel
> 5.6.10/GCC93 without this patch works, but the same kernel compiled with
> GCC10 breaks.

Well, it is touted as a fix for a race.

So just timing differences may make the race happen or not.

And a compiler change will obviously cause timing differences.

So that part I'm not surprised about, although getting Alex to say "yes, backport it", is likely a good idea anyway.

That said, bugzilla often gets a lot less attention than just emailing people. Hint hint.

            Linus
Comment 14 Alex Deucher 2020-05-03 21:46:00 UTC
Yes, backport it :)
Comment 15 Artem S. Tashkinov 2020-05-03 21:59:58 UTC
(In reply to Alex Deucher from comment #14)
> Yes, backport it :)

Tested-by: Artem S. Tashkinov

Please submit for stable (5.6.11).

Thank you!

Note You need to log in before you can comment on or make changes to this bug.