Bug 73761 - efibootmgr OOPS
Summary: efibootmgr OOPS
Status: CLOSED INVALID
Alias: None
Product: EFI
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: EFI Virtual User
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-04-09 17:33 UTC by Andy Lutomirski
Modified: 2016-01-13 22:34 UTC (History)
1 user (show)

See Also:
Kernel Version: 3.14
Tree: Mainline
Regression: No


Attachments
Complete kernel logs, instrumented (253.21 KB, text/plain)
2014-04-10 17:43 UTC, Andy Lutomirski
Details
Complete kernel logs, instrumented (107.57 KB, text/plain)
2014-04-15 15:19 UTC, Andy Lutomirski
Details
Logs w/ EFI page table (138.88 KB, text/plain)
2014-05-07 22:40 UTC, Andy Lutomirski
Details

Description Andy Lutomirski 2014-04-09 17:33:52 UTC
This is 100% reproducible for me on Fedora's 3.13.7 kernel and on mainline 3.14.

$ sudo efibootmgr -v
BootCurrent: 0008
Timeout: 1 seconds
BootOrder: 0003,0008,0000,0001
Boot0000* SATA2:PLDS DVD+/-RW DH-16A6S  	BIOS(10,0,00)
Boot0001* SATA1:INTEL SSDSC2BB600G4     	BIOS(11,0,00)
Boot0003* grub	HD(1,800,40000,75f192ff-2a82-4e4e-b83c-c41f3bb39847)File(\EFI\grub\grubx64.efi)
Boot0008* UEFI: Built-in EFI Shell 	Vendor(5023b95c-db26-429b-a648-bd47664c8012,)AMBO
Boot000A* Hard Drive 	BIOS(2,0,00)AMGOAMNO........o.I.N.T.E.L. .S.S.D.S.C.2.B.B.6.0.0.G.4....................A...........................>..Gd-.;.A..MQ..L.T.B.L.W.3.3.4.3.A.0.F.W.0.6.T.0.N.G. . ......AMBO

$ sudo efibootmgr -B -b 0

[   68.204983] general protection fault: 0000 [#1] SMP 
[   68.205151] Modules linked in: fuse ipt_MASQUERADE xt_CHECKSUM tun bnep ip6t_rpfilter bluetooth 6lowpan_iphc ip6t_REJECT cfg80211 xt_conntrack rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw f71882fg xfs vfat fat libcrc32c snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal snd_hda_codec_hdmi coretemp iTCO_wdt kvm_intel snd_hda_intel gpio_ich mxm_wmi iTCO_vendor_support kvm snd_hda_codec snd_hwdep snd_seq sb_edac snd_seq_device edac_core snd_pcm microcode joydev serio_raw pcspkr i2c_i801 mei_me lpc_ich mei snd_timer
[   68.207630]  mfd_core snd wmi shpchp soundcore binfmt_misc dm_crypt radeon hid_logitech_dj i2c_algo_bit drm_kms_helper ttm drm e1000e firewire_ohci crct10dif_pclmul crc32_pclmul crc32c_intel firewire_core ptp ghash_clmulni_intel i2c_core pps_core crc_itu_t
[   68.208394] CPU: 8 PID: 3290 Comm: efibootmgr Not tainted 3.14.0+ #862
[   68.208594] Hardware name: MSI MS-7760/X79A-GD65 (8D) (MS-7760), BIOS V1.8 10/18/2012
[   68.208837] task: ffff880443502440 ti: ffff880443708000 task.ti: ffff880443708000
[   68.209067] RIP: 0010:[<fffffffeecedcc9e>]  [<fffffffeecedcc9e>] 0xfffffffeecedcc9e
[   68.209308] RSP: 0018:ffff880443709ad0  EFLAGS: 00010046
[   68.209472] RAX: 0000000000000000 RBX: fffffffeecee6a78 RCX: 000000000000001d
[   68.209691] RDX: 0000000000000000 RSI: 0000000000000007 RDI: 8000000000000009
[   68.209911] RBP: ffffffff81c37400 R08: 0000000000000000 R09: 0000000000000001
[   68.210131] R10: fffffffedbc1ff00 R11: 0000000000000008 R12: fffffffeecee5a40
[   68.210350] R13: 0000000000000000 R14: 8000000000000000 R15: 000000000000180e
[   68.210571] FS:  00007f2ddddd5740(0000) GS:ffff88045fd00000(0000) knlGS:0000000000000000
[   68.210821] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   68.210997] CR2: 0000003a5cf48c30 CR3: 000000000009b000 CR4: 00000000000407e0
[   68.211216] Stack:
[   68.211277]  0000000000000000 fffffffeecee5a40 0000000000000000 0000000000000007
[   68.211522]  8000000000000009 fffffffeecedfbcf fffffffeecee6a78 ffffffff81c37400
[   68.211765]  0000000000000007 0000000000000011 ffff880442b6e000 fffffffeecee6a78
[   68.212011] Call Trace:
[   68.212091]  [<ffffffff81302609>] ? security_context_to_sid_core+0x189/0x260
[   68.212310]  [<ffffffff811c8f8c>] ? kmem_cache_alloc_trace+0x3c/0x200
[   68.212508]  [<ffffffff8107f811>] ? efi_call5+0x71/0xf0
[   68.212666]  [<ffffffff8107f71c>] ? efi_call4+0x6c/0xf0
[   68.212829]  [<ffffffff811e2063>] ? do_dentry_open+0x203/0x2e0
[   68.213008]  [<ffffffff8107eb49>] ? virt_efi_set_variable+0x49/0x60
[   68.213200]  [<ffffffff8107ee5d>] ? efi_query_variable_store+0x12d/0x1f0
[   68.213409]  [<ffffffff8159a622>] ? efivar_entry_set+0xb2/0x140
[   68.213590]  [<ffffffff8159ac17>] ? efivar_store_raw+0x147/0x280
[   68.213776]  [<ffffffff812eb0d8>] ? security_capable+0x18/0x20
[   68.225363]  [<ffffffff81091da9>] ? ns_capable+0x29/0x50
[   68.237037]  [<ffffffff8159b2db>] ? efivar_attr_store+0x4b/0x70
[   68.248831]  [<ffffffff8125cccd>] ? sysfs_kf_write+0x3d/0x50
[   68.260716]  [<ffffffff81260be2>] ? kernfs_fop_write+0xd2/0x140
[   68.260724]  [<ffffffff811e49fa>] ? vfs_write+0xba/0x1e0
[   68.260727]  [<ffffffff811f0379>] ? putname+0x29/0x40
[   68.260730]  [<ffffffff811e5455>] ? SyS_write+0x55/0xd0
[   68.260735]  [<ffffffff8170f7e9>] ? system_call_fastpath+0x16/0x1b
[   68.260757] Code: 00 eb 2b b8 ab aa aa aa f7 25 ff 99 00 00 d1 ea 8d 04 12 49 03 c0 48 3b c8 75 09 c6 05 dd 99 00 00 00 eb 0a 48 8b 05 3a 95 00 00 <ff> 50 48 48 83 c4 28 c3 cc cc 48 89 5c 24 08 48 89 6c 24 10 48 
[   68.260763] RIP  [<fffffffeecedcc9e>] 0xfffffffeecedcc9e
[   68.260764]  RSP <ffff880443709ad0>
[   68.372690] ---[ end trace 71f0aedf03f8e192 ]---
Comment 1 Andy Lutomirski 2014-04-09 17:59:31 UTC
I think the stack trace is screwy past efi_call5.  gdb shows:

   0xffffffff811c8f7f <+47>:	mov    %r10,-0x30(%rbp)
   0xffffffff811c8f83 <+51>:	mov    %rdi,-0x38(%rbp)
   0xffffffff811c8f87 <+55>:	callq  0xffffffff817034c0 <_cond_resched>
   0xffffffff811c8f8c <+60>:	mov    -0x38(%rbp),%r9
   0xffffffff811c8f90 <+64>:	mov    -0x30(%rbp),%r10
   0xffffffff811c8f94 <+68>:	nopl   0x0(%rax,%rax,1)
   0xffffffff811c8f99 <+73>:	mov    %r9,%r15

What's a call to cond_resched doing in here?

For security_context_to_sid_core, I see:

   0xffffffff81302609 <+393>:	mov    %r15d,%eax
   0xffffffff8130260c <+396>:	mov    -0x30(%rbp),%rdx
   0xffffffff81302610 <+400>:	xor    %gs:0x28,%rdx
   0xffffffff81302619 <+409>:	jne    0xffffffff813026d3 <security_context_to_sid_core+595>
   0xffffffff8130261f <+415>:	add    $0x68,%rsp
   0xffffffff81302623 <+419>:	pop    %rbx

so the address is between instructions.  Presumably some random crap is on the stack.

My motherboard is an MSI X79A-GD65 (8D) (MS-7760) with BIOS revision 4.6, and that BIOS is fairly crappy.
Comment 2 Andy Lutomirski 2014-04-09 18:08:26 UTC
efi=old_map makes no difference.
Comment 3 Matt Fleming 2014-04-10 11:37:43 UTC
Andy, could you upload the dmesg from your machine? I'd like to figure out what specific region the RIP is in at the time of the fault.

Is there a known working kernel version where this oops doesn't occur?

It looks like our "For the love of God don't write too much data to the NVRAM" workaround from efi_query_variable_store() is causing this. Apparently your firmware tried to jump through a NULL pointer,

$ ./scripts/decodecode < /tmp/oops
[ 68.260757] Code: 00 eb 2b b8 ab aa aa aa f7 25 ff 99 00 00 d1 ea 8d 04 12 49 03 c0 48 3b c8 75 09 c6 05 dd 99 00 00 00 eb 0a 48 8b 05 3a 95 00 00 <ff> 50 48 48 83 c4 28 c3 cc cc 48 89 5c 24 08 48 89 6c 24 10 48
All code
========
   0:	00 eb                	add    %ch,%bl
   2:	2b b8 ab aa aa aa    	sub    -0x55555555(%rax),%edi
   8:	f7 25 ff 99 00 00    	mull   0x99ff(%rip)        # 0x9a0d
   e:	d1 ea                	shr    %edx
  10:	8d 04 12             	lea    (%rdx,%rdx,1),%eax
  13:	49 03 c0             	add    %r8,%rax
  16:	48 3b c8             	cmp    %rax,%rcx
  19:	75 09                	jne    0x24
  1b:	c6 05 dd 99 00 00 00 	movb   $0x0,0x99dd(%rip)        # 0x99ff
  22:	eb 0a                	jmp    0x2e
  24:	48 8b 05 3a 95 00 00 	mov    0x953a(%rip),%rax        # 0x9565
  2b:*	ff 50 48             	callq  *0x48(%rax)		<-- trapping instruction
  2e:	48 83 c4 28          	add    $0x28,%rsp
  32:	c3                   	retq   
  33:	cc                   	int3   
  34:	cc                   	int3   
  35:	48 89 5c 24 08       	mov    %rbx,0x8(%rsp)
  3a:	48 89 6c 24 10       	mov    %rbp,0x10(%rsp)
  3f:	48                   	rex.W


Can you create new variables/new boot entries with efibootmgr? It would also be useful if you could figure out which efi.set_variable() call in efi_query_variable_store() causes this oops.
Comment 4 Andy Lutomirski 2014-04-10 17:43:27 UTC
Created attachment 131881 [details]
Complete kernel logs, instrumented
Comment 5 Andy Lutomirski 2014-04-10 17:57:35 UTC
It's the first set_variable call.  The dmesg I attached is from a kernel with this patch:

diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index b97acec..7a7face 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -1159,6 +1159,7 @@ efi_status_t efi_query_variable_store(u32 attributes, unsi
                if (!dummy)
                        return EFI_OUT_OF_RESOURCES;
 
+               printk(KERN_INFO "set_variable %ld %p\n", (long)dummy_size, (voi
                status = efi.set_variable(efi_dummy_name, &EFI_DUMMY_GUID,
                                          EFI_VARIABLE_NON_VOLATILE |
                                          EFI_VARIABLE_BOOTSERVICE_ACCESS |
@@ -1170,6 +1171,7 @@ efi_status_t efi_query_variable_store(u32 attributes, unsi
                         * This should have failed, so if it didn't make sure
                         * that we delete it...
                         */
+                       printk(KERN_INFO "set_variable zero %p\n", (void*)dummy)
                        efi.set_variable(efi_dummy_name, &EFI_DUMMY_GUID,
                                         EFI_VARIABLE_NON_VOLATILE |
                                         EFI_VARIABLE_BOOTSERVICE_ACCESS |
Comment 6 Matt Fleming 2014-04-15 12:44:45 UTC
Andy, did you definitely upload the correct log? All I can see is a file with loads of btrfs messages.
Comment 7 Andy Lutomirski 2014-04-15 15:19:21 UTC
Created attachment 132371 [details]
Complete kernel logs, instrumented

I have too many files called dmesg.txt.  This should be the right one.
Comment 8 Matt Fleming 2014-05-07 13:17:51 UTC
Sorry Andy, could you boot a v3.15-rc kernel and enable CONFIG_EFI_PGT_DUMP? I need the output to figure out which region the firmware was in when it crashed (I forgot that with the new EFI memmap stuff regions aren't mapped at a linear offset).
Comment 9 Andy Lutomirski 2014-05-07 15:40:11 UTC
Sure.

Note that, on 3.14, the bug is masked by the change to skip reading the BGRT payload if the valid bit isn't set.
Comment 10 Andy Lutomirski 2014-05-07 16:26:22 UTC
Ignore the comment about BGRTs.  I somehow mixed up the bugs in my head.
Comment 11 Andy Lutomirski 2014-05-07 22:40:46 UTC
Created attachment 135341 [details]
Logs w/ EFI page table

For better or for worse, efibootmgr just successfully deleted an entry on this kernel.
Comment 12 Matt Fleming 2014-11-01 09:23:43 UTC
Andy, are you still seeing issues here?
Comment 13 Andy Lutomirski 2016-01-05 01:37:02 UTC
I just triggered this issue again on a stock Fedora 23 live USB stick on the same machine.  I can't give a proper log because the machine hung hard with only a useless tiny bit of log message at the bottom.

RIP = 0xfffffffee14dcc93
Comment 14 Andy Lutomirski 2016-01-13 22:34:55 UTC
I just manged to crash the builtin firmware setup screen a few times while manipulating boot options.  I'm closing this as INVALID since it's most likely a severe UEFI firmware bug and there's probably nothing that Linux can do about it.

Note You need to log in before you can comment on or make changes to this bug.