Bug 214509 - Kernel fails to boot on Apple 2018+ Macs due to bug in NVMe driver which arose in 5.14.6 mainline and 5.10.67 LTS.
Summary: Kernel fails to boot on Apple 2018+ Macs due to bug in NVMe driver which aros...
Status: RESOLVED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: NVMe (show other bugs)
Hardware: All Linux
: P1 high
Assignee: IO/NVME Virtual Default Assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-09-24 04:32 UTC by gargaditya08
Modified: 2021-10-04 14:06 UTC (History)
3 users (show)

See Also:
Kernel Version: 5.14.6 and 5.10.67
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
bridgeOS crash log (the security chip) (30.34 KB, text/plain)
2021-09-24 05:06 UTC, orlandoch.dev
Details
log when booting, before the crash (1.46 KB, text/plain)
2021-09-24 05:34 UTC, orlandoch.dev
Details
Proposed patch seemes to fix the bug. (8.35 KB, patch)
2021-09-25 04:35 UTC, gargaditya08
Details | Diff

Description gargaditya08 2021-09-24 04:32:48 UTC
Due to some changes in kernel 5.10.67 and 5.14.6, the kernels now crash the BridgeOS found on Apple Macs. The issue was not found on 5.10.66 and 5.14.5 and these kernels booted correctly.
Comment 1 orlandoch.dev 2021-09-24 05:06:59 UTC
Created attachment 298939 [details]
bridgeOS crash log (the security chip)

Can reproduce this on same model, archlinux with normal lts 5.10.67 kernel, no dkms modules. After the kernel starts it freezes after a few debug messages (I'll take a photo of these and try to type them out into another comment here soon), and the computer shuts off.

The T2 security processor which runs "bridgeOS" panics when Linux boots on this version which crashes the computer. There is a crash log for bridgeOS (got it by booting to macOS after linux crashed). It's probably not helpful for us, but I've attached it anyway.
Comment 2 orlandoch.dev 2021-09-24 05:34:03 UTC
Created attachment 298941 [details]
log when booting, before the crash

This is what is printed when booting 5.10.67, the kernel commandline is empty so there shouldn't be any unsafe options there. After the IOAPIC message it hangs for a few seconds and then the fans go high (this is a symptom of the T2 security chip panicking), and then the computer shuts off.

The messages in this attachment are also generated by older, working kernels, however those kernels don't crash, and instead they boot properly.
Comment 3 orlandoch.dev 2021-09-24 09:47:06 UTC
I've isolated it to this commit on the lts tree: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=240a7025a6f89f9596c36134bd07f3855c56c712


On the main tree, the equivalent commit is https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e7006de6c23803799be000a5dcce4d916a36541a .


These MacBooks have the SSD as a pcie device which is part of the T2 chip.

04:00.0 Mass storage controller: Apple Inc. ANS2 NVMe Controller (rev 01)


There was a quirk for this ssd added in 5.4: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=66341331ba0d2de4ff421cdc401a1e34de50502a I don't know if that would be related to this issue.
Comment 4 orlandoch.dev 2021-09-24 10:22:37 UTC
The nvme commit that broke it has this:

> This means that we are giving up some possible queue depth as 12 bits
> allow for a maximum queue depth of 4095 instead of 65536, however we
> never create such long queues anyways so no real harm done.

The one that added support for the apple ssd has:

> This adds support for Apple weird implementation of NVME in their
> 2018 or later machines. It accounts for the twice-as-big SQ entries
> for the IO queues

If these are talking about the same queues then losing the larger queue size might be the issue.
Comment 5 orlandoch.dev 2021-09-24 12:22:34 UTC
Looking at the T2's panic log, it looks like it's probably not the queue sizes:

assert failed: [7447]:command id out of range error (cid = 4120), status_reg: 0x2000

This might be the checksum bits introduced at the end of the command_id making it too high?
Comment 6 gargaditya08 2021-09-24 14:23:03 UTC
Indeed the commit (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e7006de6c23803799be000a5dcce4d916a36541a) has broken the kernels for Apple SSDs. This one needs to be reverted to the original version.
Comment 7 gargaditya08 2021-09-25 04:35:32 UTC
Created attachment 298967 [details]
Proposed patch seemes to fix the bug.

Note You need to log in before you can comment on or make changes to this bug.