199405 – Linux 4.17-rc1 stack dumps that were not present in 4.16.2 kernel related to Thunderbolt and related subsystems.

Bug 199405 - Linux 4.17-rc1 stack dumps that were not present in 4.16.2 kernel related to Thunderbolt and related subsystems.

Summary: Linux 4.17-rc1 stack dumps that were not present in 4.16.2 kernel related to ...

Status:	NEW

Alias:	None

Product:	Other
Classification:	Unclassified
Component:	Modules (show other bugs)
Hardware:	All Linux

Importance:	P1 normal
Assignee:	other_modules

URL:
Keywords:

Depends on:
Blocks:

Reported:	2018-04-16 05:08 UTC by Nicholas Johnson
Modified:	2018-04-16 05:09 UTC (History)
CC List:	0 users

See Also:
Kernel Version:	4.17-rc1
Subsystem:
Regression:	No
Bisected commit-id:

Attachments
dmesg output (104.64 KB, text/plain) 2018-04-16 05:08 UTC, Nicholas Johnson	Details
Add an attachment (proposed patch, testcase, etc.)

Description Nicholas Johnson 2018-04-16 05:08:39 UTC

Created attachment 275393 [details]
dmesg output

I have attached the output of dmesg in the file "dmesg-report"

Hardware:
Dell XPS 9370 i7-8650U (not a typo, it is not the common i7-8550U) 16GB DDR3L.
    -Thunderbolt controller = 8086:15D2 AKA JHL6540 | NVM28 firmware
Gigabyte Aorus Thunderbolt 3 Gaming Box External Graphics
    -Thunderbolt controller = 8086:1577 AKA DSL6540 | NVM25 firmware
    -With AMD Radeon R9 Nano (Fiji XT) installed inside

The kernel came from here: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.17-rc1/

The faults come when trying to approve the external graphics. Sometimes it hangs on Atom BIOS - looks something like this (same deal as 4.16.2 kernel).

[Apr 6 13:31] [drm] amdgpu kernel modesetting enabled.
[ +0.000272] [drm] initializing kernel modesetting (FIJI 0x1002:0x7300 0x1002:0x0B36 0xCA).
[ +0.000050] [drm] register mmio base: 0xC4000000
[ +0.000001] [drm] register mmio size: 262144
[ +0.000020] [drm] probing gen 2 caps for device 8086:1578 = 1715c43/e
[ +0.000004] [drm] probing mlw for device 8086:1578 = 1715c43
[ +0.000010] [drm] UVD is enabled in physical mode
[ +0.000001] [drm] VCE enabled in physical mode
[ +1.617919] ATOM BIOS: 113-C8820200-107
[ +0.000016] [drm] GPU posting now...
[ +5.370805] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 5secs aborting
[ +0.000022] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing B456 (len 304, WS 4, PS 0) @ 0xB51B
[ +0.000019] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing B104 (len 183, WS 0, PS 8) @ 0xB17E
[ +0.000002] amdgpu 0000:3c:00.0: gpu post error!
[ +0.000002] amdgpu 0000:3c:00.0: Fatal error during GPU init
[ +0.000017] [drm] amdgpu: finishing device.
[ +0.000198] amdgpu: probe of 0000:3c:00.0 failed with error -22

You can get it to pass this by approving one port, failing, removing the cord, and trying on the other port.

But in 4.16.2, pulling the Thunderbolt 3 cable to the GPU after it failed to init never caused any issues. With 4.17-rc1, it often causes stack dumps.

There was a stack dump before in thunderbolt.ko itself, but somehow it said "tainted" despite me not having done anything that should have tainted it. I am not sure what tainted it. Either way, I know nobody is interested in tainted dumps. The one in the attached file is not tainted.

I don't know if the amdgpu.ko driver can kill the Thunderbolt driver. But this means it is likely that the patches from Mika Westerberg (Intel Corp) adding Titan Ridge support and other fixes, that have been sitting around for months, and have finally been included in 4.17-rc1, might have introduced bugs.

If any more information is needed, I can provide it.

Thank you!

Note You need to log in before you can comment on or make changes to this bug.