Bug 63591 - mishandled corruption of primary GPT table, failure to boot
Summary: mishandled corruption of primary GPT table, failure to boot
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: fs_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-10-24 04:31 UTC by Chris Murphy
Modified: 2019-05-30 23:15 UTC (History)
5 users (show)

See Also:
Kernel Version: 3.11.5-302
Subsystem:
Regression: No
Bisected commit-id:


Attachments
altered LBA 2 (512 bytes, application/octet-stream)
2013-10-24 04:37 UTC, Chris Murphy
Details
rdsosreport.txt (49.40 KB, text/plain)
2013-12-04 08:02 UTC, Chris Murphy
Details

Description Chris Murphy 2013-10-24 04:31:37 UTC
Upon altering a single byte within LBA 2 (the primary GPT table), the computer fails to boot as systemd complains of failed dependency for /sysroot. Relevant kernel message:

kernel: vda: unknown partition table

Because a one byte change causes this, I'm guessing the kernel does compute and compare checksums, and determines the primary GPT is invalid. But then the kernel also doesn't decide to use the valid backup GPT. The primary GPT header is intact and passes its own checksum, so the location of the backup GPT should be reliable information.
Comment 1 Chris Murphy 2013-10-24 04:37:31 UTC
Created attachment 112161 [details]
altered LBA 2

Full sector, primary GPT table. Offset 0x14 changed from 0x45 to 0x46, i.e. EFI is changed to FFI in the BIOSboot partitiontypeguid; so this is non-critical information about a non-critical partition.
Comment 2 Chris Murphy 2013-12-04 08:00:56 UTC
For reasons I don't understand, none of the corruptions I've tried: various corruptions, separately, of only the primary gpt header, primary gpt header crc, primary gpt table, primary gpt table array crc, and so on for the secondary gpt as well trigger any of the messages in block/partitions/efi.c. The message I get in every case is in block/partitions/check.c which is "unknown partition table".

The result is boot failure as systemd dev-vda2.device times out and I end up in emergency.target. Since there's code to identify various problems with a GPT, note them, and gracefully continue as long as either the primary or backup GPTs are valid (which in all test cases I only corrupted one element at a time), it seems this failure is unintended, and the "unknown partition table" is bogus.
Comment 3 Chris Murphy 2013-12-04 08:02:47 UTC
Created attachment 117371 [details]
rdsosreport.txt

3.11.10-300.fc20.x86_64

Attaching rdsosreport for the latest example failure which was corruption of the primary gpt header CRC, changing one of the 4 byte values by one. Yet instead of being identified as having a bad CRC, instead the result is "unknown partition table".
Comment 4 Davidlohr Bueso 2013-12-04 22:53:41 UTC
If your primary gpt is corrupted, then the only way to tell Linux to use the alternate/backup is to use the 'gpt' kernel parameter - this is indeed *undocumented*, or at least incomplete. Note that this option will also skip the mbr checks. Please try using this option and also enable debugging - you should see something like "Primary GPT is invalid, using alternate GPT" in dmesg.
Comment 5 Alan 2013-12-05 00:20:17 UTC
That strikes me as a bug given the intention of the GPT backup appears to be resilience and most GPT system end users won' t be clued up enough to fiddle with kernel parameters.
Comment 6 Davidlohr Bueso 2013-12-05 00:24:27 UTC
I agree, but the rationale behind this is to protect against devices which misreport their size, and forces the user to decide to use the Alternate GPT when they still have a valid primary header.
Comment 7 Chris Murphy 2013-12-18 08:23:44 UTC
The thing is, we're talking about 1 bit of corruption in either the header or table. The primary GPT is actually OK except it doesn't pass checksum, so it has a statistically good chance of still booting. But instead, the kernel clearly knows the primary GPT is corrupt, and then face plants. I don't think it's OK to, by default, make the whole point of GPT utterly useless as a work around for broken hardware that lies about the size.

However, I wonder how misreported device size causes the alternate to be used, or how we're avoiding possibly significant data loss if the drive misbehaves this much?

Note You need to log in before you can comment on or make changes to this bug.