Bug 47121
Summary: | UEFI boot panics on a new Samsung Series 9 laptop throwing a machine check exception | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Alessandro Crismani (alessandro.crismani) |
Component: | x86-64 | Assignee: | platform_x86_64 (platform_x86_64) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | alan, clancy.kieran+kernel, florian, greg, hpa, johnyjacksonjames, matt, mike.bakhterev, mike, mjg59-kernel, mus.svz, orzel, phemmer+kernel, r.maassen60, sergio91pt, tu.hoang.a, ucelsanicin, vortahupsa |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.5.3 and 3.6'rc4 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
Picture of the machine check exception error thrown at boot
Higher quality picture capturing the panic dmesg log when booting using the kernel efi stub (3.6-rc4) Dmesg output when booting using the UEFI enabled grub of Ubuntu 12.04.1 |
Created attachment 79381 [details]
Higher quality picture capturing the panic
Alessandro, Are you able to boot the machine successfully using a UEFI boot loader instead of the EFI boot stub directly? Also, if you could attach a copy of dmesg to this bug report next time you successfully boot using the EFI boot stub, that would be very helpful. If I remember correctly, the latest Ubuntu live CD booted ok using its UEFI enabled grub. At the same time, I remember having had problems with previous iterations of Ubuntu (those were tested with an older BIOS, which may have caused more trouble). I'll attach a dmesg log later today, both for my install that uses the kernel stub and for an Ubuntu live CD, that uses grub. Before I collect such information, which kernel version should I use for obtaining the stub log? If possible, I would recommend using a v3.6-rc4 kernel. Created attachment 79621 [details]
dmesg log when booting using the kernel efi stub (3.6-rc4)
Log from dmesg when booting using the kernel efi stub. The kernel is at version 3.6-rc4. It took me 8 times before I got to the login screen, hence the success to failure ratio of the boot sequence is quite low.
I'll provide the dmesg log using Ubuntu + grub (or any other distribution that you may recommend) tomorrow, given that I lack a USB DVD reader or a USB drive at the moment. Created attachment 79891 [details]
Dmesg output when booting using the UEFI enabled grub of Ubuntu 12.04.1
Are you able to successfully boot with grub every time? To me this seems like a hardware error, but if that is the case then I would expect you to see errors most of the time. You were right. I installed grub on my system, and those mce errors are thrown also when booting with it. I am wondering why Ubuntu's grub seems to be more resilient. May it be that booting from the laptop's SSD is fast enough to create problems to the boot sequence, compared to booting from the live CD? It's difficult to say what is causing the MCEs, and so it's difficult to say why one boot scenario throws MCEs more reliably than a different one. You could try swapping out the RAM to see if that makes a difference, you've possibly got a bad RAM module, or trying different components to see if the problem disappears. Other than that, I would suggest trying to get your laptop replaced/repaired if you don't have the option of running with a legacy BIOS, or turning on the legacy BIOS if you do. Fortunately, I can switch to a legacy BIOS based boot system. The legacy BIOS boots without any problems using the Syslinux bootloader (and I believe any other solution works too). Hence, I do not think that I have faulty RAM modules, and I suspect that the UEFI code in the BIOS is the one to blame! Does it make sense? Yep, that is entirely possibly. I'll add some details here. MCE code is "MCE: 5 Bank 6: be2000000003110a". And here is a guy, who hit the same MCE code when playing with PCI memory regions caching: http://stackoverflow.com/questions/11254023/how-to-do-mmap-for-cacheable-pcie-bar Another thing to check Matt would be whether flushing the TLB on all processors after every EFI call fixes the problem. There are strict rules on aliased mappings and if the firmware is adding its own but not flushing the tlb after they are gone then that would cause MCEs in some situations. Alan Alan, that's interesting. Mikhail, Alessandro, could you please build a kernel from the 'efi/tlbflush' branch at git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/linux.git and see whether the MCE errors disappear? Unfortunately it doesnt't help. Maybe my laptop really has a bad quality UEFI code, and it just fails in such an horrible way. Gak not much else I can think to try short of turning off all of the 2MB/4MB page usage Excuse me for delay. I've tested efi/tlbflush, and it does not help. As to my knowledge this is not the TLB problem. One Intel guy noted: The SDM can help here. See volume 3A, section 15.9.2 "Compound Error Codes". The low 16 bits of the status in this case are 0001 0001 0000 1010 This tells us that you have a cache error in L2 cache severe enough that the processor has begun filtering further error reports. And the stackoverflow question is about playing with caching policy. May be MTRR are configured incorrectly? Hi. May be this will help to clarify things. My distro now uses 3.6.2 kernel, and machine does not boot at all, even from usb drive. Earlier kernels sometimes (frequently enough) were able to boot from usb 2.0 flash drives. I have the same laptop and was experiencing the exact same issue. However I was able to resolve it by blacklisting the "samsung_laptop" module. Apparently it was getting loaded by udev (or something else) automatically. With the module loading, I could boot maybe one out of every 15 times. With it blacklisted I've probably rebooted about 20 times now without any issues. Patrick, thank you for that bit of info. That's very helpful. In light of this, the MCEs actually make a lot of sense because they can be caused under UEFI when the kernel pokes around in memory regions it's not supposed to. And reading the samsung-laptop code, that's exactly what it does. Mikhail, Alessandro, can you please try disabling the samsung-laptop module and see if you have better results? Yes, Linux 3.6.3 boots with samsung-laptop blacklisted. Additional info: i tried to blacklist this module on Linux 3.4.x and there was no effect. On 3.5 i did not try this. As far as i understand this module provides among other things (?) keyboard backlight support. And when the kernel was booted under UEFI the MCE could always be triggered with an attempt to change kbd backlight level with command: # echo 0 > /sys/devices/platform/samsung/leds/samsung::kbd_backlight/brightness *** Bug 49161 has been marked as a duplicate of this bug. *** Maybe this can be helpfull, i booted form ubuntu 12.10 live usb-stick and teh error occured, but when i disabled the ssd boot entry the live usb-stick runs normal. Just updating as I'm not sure this should be marked as WILL_NOT_FIX any more. There was an internal discussion a few weeks ago about having the samsung-laptop module do a run-time check to see if UEFI is enabled and to not load if so. But I'm not sure if there was any final verdict on that discussion, or whether I just got removed from the email CC list. Could we get an update on whether the run-time check will be implemented, or if this will indeed remain as WILL_NOT_FIX? Also it would be nice if the samsung-laptop code could be fixed so that it could be used without generating the MCE errors. Could this be proposed as the final solution to the issue, with the run-time check just an intermediary fix? Adding some more people to the Cc list. A runtime check is indeed the best way to fix this issue. I'm not sure it's possible to fix the samsung-laptop module to run on UEFI machines. I'm no longer the maintainer of this driver, but this was already discussed on the linux-efi mailing list and it's a bug in the core kernel efi code, it should be failing the memory mapping of this area, it's not a driver bug in itself. Of course the driver does expose the bug, which makes this a pain, but the core is where this should be fixed properly, not much we can do in the driver itself, unless someone wants to add a blacklist of machines this driver should not bind to. The core can't simply refuse access to these address ranges, because some hardware requires it for VESA. Yes, this is ridiculous. Yes, the legacy area is, and probably will always be, special. The driver needs to handle it in this case. There will no doubt be lots of other similar things of this nature hence the need for an "on_drugs()" or similar test for running in EFI mode But samsung-laptop provides important functionality: rfkill, keyboard backlight, performance level management... If the module will not load this stuff will not work. Is it possible to solve this problem in another way? I had the same problem on my new Series 9 900x4d. Tried to install Arch Linux via UEFI boot and got the same kernel panic. Blacklisting samsung-laptop allowed me to boot and install the system. The strange thing is that I now cannot reproduce the problem anymore, not even with the USB drive that I used for the installation (where the panic definitively happened). The panic doesn't happen anymore because the driver simply isn't loaded anymore. "modprobe samsung-laptop" gives me: modprobe: ERROR: could not insert 'samsung_laptop': No such device I also noticed that the aesni_intel module isn't loaded and /proc/cpuinfo doesn't report the aes flag although the processor definitively supports AES-NI [1]. modprobing aesni_intel gives me the same "No such device" error message. Could this be related? I'm going to try to boot via non-UEFI and see if that changes anything. [1] http://ark.intel.com/products/65707 ah, I got it. The samsung-laptop module is only loaded if "OS Mode Selection" is set to "UEFI and CSM OS". It is not loaded when it is set to "UEFI OS". Took me a while to figure this out because changing the setting only seems to have effect when the machine is actually powered off and not just rebooted. aesni_intel isn't loaded either way, not even with non-UEFI boot, so I guess this is another bug. Can anyone confirm this? I upgraded to the latest BIOS yesterday btw, before even trying to boot the Arch Linux installation media. On Sat, Nov 24, 2012 at 11:07:45AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > But samsung-laptop provides important functionality: rfkill, keyboard > backlight, performance level management... If the module will not load this > stuff will not work. Is it possible to solve this problem in another way? None of that functionality will probably be supported by this driver anyway on UEFI laptops, those options will come from ACPI, so you will need those ACPI drivers loaded to control them. I was having the same problem on Samsung Ultrabook Series 9 as well. I employed the same workaround by blacklisting samsung-laptop kernel module. As a result of that, keyboard backlight under low light condition did not work. However, with the recent kernel 3.6.8-1 and linux-firmware 20121118-1 on Archlinux, keyboard backlight is working even with samsung-laptop blacklisted. Maybe we don't need samsung-laptop kernel module at all? > However, with the recent kernel 3.6.8-1 and linux-firmware 20121118-1 on > Archlinux, keyboard backlight is working even with samsung-laptop > blacklisted. > Maybe we don't need samsung-laptop kernel module at all? It's not working on series 9 New with the same kernel/firmware.(In reply to comment #33) > aesni_intel isn't loaded either way, not even with non-UEFI boot, so I guess > this is another bug. Can anyone confirm this? > > I upgraded to the latest BIOS yesterday btw, before even trying to boot the > Arch Linux installation media. I have P04AAC (something like that) BIOS. And samsung-laptop is loading (crashing the system when in UEFI mode), aesni_intel is loading too. No longer working on 3.6.9-1 and 3.6.10-1. The only kernel version in which I had ambient light sensor to work without samsung-laptop is 3.6.8-1. Folks, A fix for the samsung-laptop module has been merged into Linus' tree and is available in v3.8-rc6. It has also been marked for the stable trees. The fix is, http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=e0094244e41c4d0c7ad69920681972fc45d8ce34 and its prerequisite, http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=83e68189745ad931c2afd45d8ee3303929233e7f Fixed in v3.8. A patch referencing this bug report has been merged in Linux v3.8: commit 1de63d60cd5b0d33a812efa455d5933bf1564a51 Author: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Date: Thu Feb 14 09:12:52 2013 +0900 efi: Clear EFI_RUNTIME_SERVICES rather than EFI_BOOT by "noefi" boot parameter A patch referencing a commit somehow associated to this bug report has been merged in Linux v3.9-rc1: commit fb834c7acc5e140cf4f9e86da93a66de8c0514da Author: Matt Fleming <matt.fleming@intel.com> Date: Wed Feb 20 20:36:12 2013 +0000 x86, efi: Make "noefi" really disable EFI runtime serivces |
Created attachment 79361 [details] Picture of the machine check exception error thrown at boot Hi everybody, I am trying to boot my new Samsung Series 9 NP900X4C laptop using the kernel EFI stub. The laptop supports UEFI. Unfortunately, most of the times, the boot process panics and throws a machine check exception (please find a picture of it enclosed to the report, the picture has horrible quality, I hope to get one with a decent camera later today). Only few times I've seen the boot process succeed. I do not know whether the bug is in the kernel stub, or in the UEFI firmware of the laptop itself (which appears to be buggy, even though Samsung already pushed two BIOS updates), though I'd love to see if I can be of any help fixing this, given that I rather boot through UEFI than through BIOS + syslinux. Please let me know if you need more information, I'll be glad to come back with whatever log once instructed so. Cheers, Alessandro