Due to some changes in kernel 5.10.67 and 5.14.6, the kernels now crash the BridgeOS found on Apple Macs. The issue was not found on 5.10.66 and 5.14.5 and these kernels booted correctly.
Created attachment 298939 [details] bridgeOS crash log (the security chip) Can reproduce this on same model, archlinux with normal lts 5.10.67 kernel, no dkms modules. After the kernel starts it freezes after a few debug messages (I'll take a photo of these and try to type them out into another comment here soon), and the computer shuts off. The T2 security processor which runs "bridgeOS" panics when Linux boots on this version which crashes the computer. There is a crash log for bridgeOS (got it by booting to macOS after linux crashed). It's probably not helpful for us, but I've attached it anyway.
Created attachment 298941 [details] log when booting, before the crash This is what is printed when booting 5.10.67, the kernel commandline is empty so there shouldn't be any unsafe options there. After the IOAPIC message it hangs for a few seconds and then the fans go high (this is a symptom of the T2 security chip panicking), and then the computer shuts off. The messages in this attachment are also generated by older, working kernels, however those kernels don't crash, and instead they boot properly.
I've isolated it to this commit on the lts tree: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=240a7025a6f89f9596c36134bd07f3855c56c712 On the main tree, the equivalent commit is https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e7006de6c23803799be000a5dcce4d916a36541a . These MacBooks have the SSD as a pcie device which is part of the T2 chip. 04:00.0 Mass storage controller: Apple Inc. ANS2 NVMe Controller (rev 01) There was a quirk for this ssd added in 5.4: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=66341331ba0d2de4ff421cdc401a1e34de50502a I don't know if that would be related to this issue.
The nvme commit that broke it has this: > This means that we are giving up some possible queue depth as 12 bits > allow for a maximum queue depth of 4095 instead of 65536, however we > never create such long queues anyways so no real harm done. The one that added support for the apple ssd has: > This adds support for Apple weird implementation of NVME in their > 2018 or later machines. It accounts for the twice-as-big SQ entries > for the IO queues If these are talking about the same queues then losing the larger queue size might be the issue.
Looking at the T2's panic log, it looks like it's probably not the queue sizes: assert failed: [7447]:command id out of range error (cid = 4120), status_reg: 0x2000 This might be the checksum bits introduced at the end of the command_id making it too high?
Indeed the commit (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e7006de6c23803799be000a5dcce4d916a36541a) has broken the kernels for Apple SSDs. This one needs to be reverted to the original version.
Created attachment 298967 [details] Proposed patch seemes to fix the bug.