Bug 63591
Summary: | mishandled corruption of primary GPT table, failure to boot | ||
---|---|---|---|
Product: | File System | Reporter: | Chris Murphy (bugzilla) |
Component: | Other | Assignee: | fs_other |
Status: | NEW --- | ||
Severity: | normal | CC: | alan, davidlohr, Matt_Domsch, rtguille, samuel-kbugs |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.11.5-302 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
altered LBA 2
rdsosreport.txt |
Description
Chris Murphy
2013-10-24 04:31:37 UTC
Created attachment 112161 [details]
altered LBA 2
Full sector, primary GPT table. Offset 0x14 changed from 0x45 to 0x46, i.e. EFI is changed to FFI in the BIOSboot partitiontypeguid; so this is non-critical information about a non-critical partition.
For reasons I don't understand, none of the corruptions I've tried: various corruptions, separately, of only the primary gpt header, primary gpt header crc, primary gpt table, primary gpt table array crc, and so on for the secondary gpt as well trigger any of the messages in block/partitions/efi.c. The message I get in every case is in block/partitions/check.c which is "unknown partition table". The result is boot failure as systemd dev-vda2.device times out and I end up in emergency.target. Since there's code to identify various problems with a GPT, note them, and gracefully continue as long as either the primary or backup GPTs are valid (which in all test cases I only corrupted one element at a time), it seems this failure is unintended, and the "unknown partition table" is bogus. Created attachment 117371 [details]
rdsosreport.txt
3.11.10-300.fc20.x86_64
Attaching rdsosreport for the latest example failure which was corruption of the primary gpt header CRC, changing one of the 4 byte values by one. Yet instead of being identified as having a bad CRC, instead the result is "unknown partition table".
If your primary gpt is corrupted, then the only way to tell Linux to use the alternate/backup is to use the 'gpt' kernel parameter - this is indeed *undocumented*, or at least incomplete. Note that this option will also skip the mbr checks. Please try using this option and also enable debugging - you should see something like "Primary GPT is invalid, using alternate GPT" in dmesg. That strikes me as a bug given the intention of the GPT backup appears to be resilience and most GPT system end users won' t be clued up enough to fiddle with kernel parameters. I agree, but the rationale behind this is to protect against devices which misreport their size, and forces the user to decide to use the Alternate GPT when they still have a valid primary header. The thing is, we're talking about 1 bit of corruption in either the header or table. The primary GPT is actually OK except it doesn't pass checksum, so it has a statistically good chance of still booting. But instead, the kernel clearly knows the primary GPT is corrupt, and then face plants. I don't think it's OK to, by default, make the whole point of GPT utterly useless as a work around for broken hardware that lies about the size. However, I wonder how misreported device size causes the alternate to be used, or how we're avoiding possibly significant data loss if the drive misbehaves this much? |