Bug 63591

Summary: mishandled corruption of primary GPT table, failure to boot
Product: File System Reporter: Chris Murphy (bugzilla)
Component: OtherAssignee: fs_other
Status: NEW ---    
Severity: normal CC: alan, davidlohr, Matt_Domsch, rtguille, samuel-kbugs
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.11.5-302 Subsystem:
Regression: No Bisected commit-id:
Attachments: altered LBA 2
rdsosreport.txt

Description Chris Murphy 2013-10-24 04:31:37 UTC
Upon altering a single byte within LBA 2 (the primary GPT table), the computer fails to boot as systemd complains of failed dependency for /sysroot. Relevant kernel message:

kernel: vda: unknown partition table

Because a one byte change causes this, I'm guessing the kernel does compute and compare checksums, and determines the primary GPT is invalid. But then the kernel also doesn't decide to use the valid backup GPT. The primary GPT header is intact and passes its own checksum, so the location of the backup GPT should be reliable information.
Comment 1 Chris Murphy 2013-10-24 04:37:31 UTC
Created attachment 112161 [details]
altered LBA 2

Full sector, primary GPT table. Offset 0x14 changed from 0x45 to 0x46, i.e. EFI is changed to FFI in the BIOSboot partitiontypeguid; so this is non-critical information about a non-critical partition.
Comment 2 Chris Murphy 2013-12-04 08:00:56 UTC
For reasons I don't understand, none of the corruptions I've tried: various corruptions, separately, of only the primary gpt header, primary gpt header crc, primary gpt table, primary gpt table array crc, and so on for the secondary gpt as well trigger any of the messages in block/partitions/efi.c. The message I get in every case is in block/partitions/check.c which is "unknown partition table".

The result is boot failure as systemd dev-vda2.device times out and I end up in emergency.target. Since there's code to identify various problems with a GPT, note them, and gracefully continue as long as either the primary or backup GPTs are valid (which in all test cases I only corrupted one element at a time), it seems this failure is unintended, and the "unknown partition table" is bogus.
Comment 3 Chris Murphy 2013-12-04 08:02:47 UTC
Created attachment 117371 [details]
rdsosreport.txt

3.11.10-300.fc20.x86_64

Attaching rdsosreport for the latest example failure which was corruption of the primary gpt header CRC, changing one of the 4 byte values by one. Yet instead of being identified as having a bad CRC, instead the result is "unknown partition table".
Comment 4 Davidlohr Bueso 2013-12-04 22:53:41 UTC
If your primary gpt is corrupted, then the only way to tell Linux to use the alternate/backup is to use the 'gpt' kernel parameter - this is indeed *undocumented*, or at least incomplete. Note that this option will also skip the mbr checks. Please try using this option and also enable debugging - you should see something like "Primary GPT is invalid, using alternate GPT" in dmesg.
Comment 5 Alan 2013-12-05 00:20:17 UTC
That strikes me as a bug given the intention of the GPT backup appears to be resilience and most GPT system end users won' t be clued up enough to fiddle with kernel parameters.
Comment 6 Davidlohr Bueso 2013-12-05 00:24:27 UTC
I agree, but the rationale behind this is to protect against devices which misreport their size, and forces the user to decide to use the Alternate GPT when they still have a valid primary header.
Comment 7 Chris Murphy 2013-12-18 08:23:44 UTC
The thing is, we're talking about 1 bit of corruption in either the header or table. The primary GPT is actually OK except it doesn't pass checksum, so it has a statistically good chance of still booting. But instead, the kernel clearly knows the primary GPT is corrupt, and then face plants. I don't think it's OK to, by default, make the whole point of GPT utterly useless as a work around for broken hardware that lies about the size.

However, I wonder how misreported device size causes the alternate to be used, or how we're avoiding possibly significant data loss if the drive misbehaves this much?