Bug 217514 - [amdgpu] system doesn't boot after linux-firmware 2023-05-23 ffe1a41e
Summary: [amdgpu] system doesn't boot after linux-firmware 2023-05-23 ffe1a41e
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-05-31 16:35 UTC by rLy
Modified: 2023-05-31 19:44 UTC (History)
1 user (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
softlockup (400.57 KB, text/plain)
2023-05-31 16:35 UTC, rLy
Details
amdgpu_error (250.87 KB, text/plain)
2023-05-31 16:36 UTC, rLy
Details

Description rLy 2023-05-31 16:35:25 UTC
Created attachment 304361 [details]
softlockup

Updating linux-firmware to the latest git version causes my pc to lock up during boot. I have a 3900x paired with a 7900xtx running arch linux with 6.3.4 xanmod kernel (but this happens with kernel from the core repo as well) and mesa 23.1.1 if that matters.
During boot time I see the following error printed and the system is completely locked up, only hard reset helps:
`May 31 07:20:40 valhalla kernel: watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [swapper/5:0]`

accompanied with a lots of amdgpu errors in the journal (followed by stack trace after both):
```
May 31 07:20:44 valhalla kernel: amdgpu 0000:0c:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:9 pasid:32768, for process  pid 0 thread  pid 0)
May 31 07:20:44 valhalla kernel: amdgpu 0000:0c:00.0: amdgpu:   in page starting at address 0x0000ffff0021a000 from client 10
May 31 07:20:44 valhalla kernel: amdgpu 0000:0c:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00900831
May 31 07:20:44 valhalla kernel: amdgpu 0000:0c:00.0: amdgpu:          Faulty UTCL2 client ID: CPF (0x4)
May 31 07:20:44 valhalla kernel: amdgpu 0000:0c:00.0: amdgpu:          MORE_FAULTS: 0x1
May 31 07:20:44 valhalla kernel: amdgpu 0000:0c:00.0: amdgpu:          WALKER_ERROR: 0x0
May 31 07:20:44 valhalla kernel: amdgpu 0000:0c:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
May 31 07:20:44 valhalla kernel: amdgpu 0000:0c:00.0: amdgpu:          MAPPING_ERROR: 0x0
May 31 07:20:44 valhalla kernel: amdgpu 0000:0c:00.0: amdgpu:          RW: 0x0

```

full journal log in "softlockup".

The issues start to happen after [this commit, ffe1a41e](https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=ffe1a41e2ddbc39109b12d95dcac282d90eba8fc)

but not the above mentioned soft lock, instead after initramfs loads I get the bios splash screen back and it's stuck there.
There are different amdgpu errors(followed by stack trace) during this:
```
May 31 09:18:37 valhalla kernel: amdgpu 0000:0c:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000006 SMN_C2PMSG_82:0x00000000
May 31 09:18:37 valhalla kernel: amdgpu 0000:0c:00.0: amdgpu: Failed to enable requested dpm features!
May 31 09:18:37 valhalla kernel: amdgpu 0000:0c:00.0: amdgpu: Failed to setup smc hw!
May 31 09:18:37 valhalla kernel: [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <smu> failed -62
May 31 09:18:37 valhalla kernel: amdgpu 0000:0c:00.0: amdgpu: amdgpu_device_ip_init failed
May 31 09:18:37 valhalla kernel: amdgpu 0000:0c:00.0: amdgpu: Fatal error during GPU init
May 31 09:18:37 valhalla kernel: amdgpu 0000:0c:00.0: amdgpu: amdgpu: finishing device.
```
Logs during this in "amdgpu_error"

Note that at the end it seems like the system is running but as I only saw the bios splash screen rebooted via sysrq/reisub.

The commit after ffe1a41 ([56832557](https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=568325574a3b6148f3296984aa24fcd1fb4b912c) or might be the one after that [39dafcc](https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=39d6fcc73100ae4aeeec0194bbf102c672673edd), not sure at the moment) gets past the splash screen but that's where the soft lockup starts to happen.
Comment 1 rLy 2023-05-31 16:36:06 UTC
Created attachment 304362 [details]
amdgpu_error
Comment 3 rLy 2023-05-31 19:44:33 UTC
(In reply to Alex Deucher from comment #2)
> Does this kernel change fix the issues?
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=5ee33d905f89c18d4b33da6e5eefdae6060502df

Well, it turns out arch updated the kernel to 6.3.5 yesterday evening, xanmod this morning which I didn't noticed earlier (I first encountered the issue like 2 days ago) and that already includes this patch and it's indeed working now. I guess this can be closed.
Thank you!

Note You need to log in before you can comment on or make changes to this bug.