Bug 47121 - UEFI boot panics on a new Samsung Series 9 laptop throwing a machine check exception
Summary: UEFI boot panics on a new Samsung Series 9 laptop throwing a machine check ex...
Status: RESOLVED CODE_FIX
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
: 49161 (view as bug list)
Depends on:
Blocks:
 
Reported: 2012-09-06 13:14 UTC by Alessandro Crismani
Modified: 2021-10-15 17:59 UTC (History)
18 users (show)

See Also:
Kernel Version: 3.5.3 and 3.6'rc4
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Picture of the machine check exception error thrown at boot (303.02 KB, image/jpeg)
2012-09-06 13:14 UTC, Alessandro Crismani
Details
Higher quality picture capturing the panic (564.90 KB, image/jpeg)
2012-09-06 17:52 UTC, Alessandro Crismani
Details
dmesg log when booting using the kernel efi stub (3.6-rc4) (55.95 KB, text/x-log)
2012-09-10 19:11 UTC, Alessandro Crismani
Details
Dmesg output when booting using the UEFI enabled grub of Ubuntu 12.04.1 (57.90 KB, text/x-log)
2012-09-12 06:37 UTC, Alessandro Crismani
Details

Description Alessandro Crismani 2012-09-06 13:14:59 UTC
Created attachment 79361 [details]
Picture of the machine check exception error thrown at boot

Hi everybody,

I am trying to boot my new Samsung Series 9 NP900X4C laptop using the kernel EFI stub. The laptop supports UEFI.

Unfortunately, most of the times, the boot process panics and throws a machine check exception (please find a picture of it enclosed to the report, the picture has horrible quality, I hope to get one with a decent camera later today). Only few times I've seen the boot process succeed.

I do not know whether the bug is in the kernel stub, or in the UEFI firmware of the laptop itself (which appears to be buggy, even though Samsung already pushed two BIOS updates), though I'd love to see if I can be of any help fixing this, given that I rather boot through UEFI than through BIOS + syslinux.

Please let me know if you need more information, I'll be glad to come back with whatever log once instructed so.

Cheers,
Alessandro
Comment 1 Alessandro Crismani 2012-09-06 17:52:12 UTC
Created attachment 79381 [details]
Higher quality picture capturing the panic
Comment 2 Matt Fleming 2012-09-10 11:44:10 UTC
Alessandro,

Are you able to boot the machine successfully using a UEFI boot loader instead of the EFI boot stub directly? Also, if you could attach a copy of dmesg to this bug report next time you successfully boot using the EFI boot stub, that would be very helpful.
Comment 3 Alessandro Crismani 2012-09-10 12:42:03 UTC
If I remember correctly, the latest Ubuntu live CD booted ok using its UEFI enabled grub. At the same time, I remember having had problems with previous iterations of Ubuntu (those were tested with an older BIOS, which may have caused more trouble).

I'll attach a dmesg log later today, both for my install that uses the kernel stub and for an Ubuntu live CD, that uses grub.

Before I collect such information, which kernel version should I use for obtaining the stub log?
Comment 4 Matt Fleming 2012-09-10 13:13:11 UTC
If possible, I would recommend using a v3.6-rc4 kernel.
Comment 5 Alessandro Crismani 2012-09-10 19:11:29 UTC
Created attachment 79621 [details]
dmesg log when booting using the kernel efi stub (3.6-rc4)

Log from dmesg when booting using the kernel efi stub. The kernel is at version 3.6-rc4. It took me 8 times before I got to the login screen, hence the success to failure ratio of the boot sequence is quite low.
Comment 6 Alessandro Crismani 2012-09-10 19:12:48 UTC
I'll provide the dmesg log using Ubuntu + grub (or any other distribution that you may recommend) tomorrow, given that I lack a USB DVD reader or a USB drive at the moment.
Comment 7 Alessandro Crismani 2012-09-12 06:37:54 UTC
Created attachment 79891 [details]
Dmesg output when booting using the UEFI enabled grub of Ubuntu 12.04.1
Comment 8 Matt Fleming 2012-09-12 16:24:20 UTC
Are you able to successfully boot with grub every time? To me this seems like a hardware error, but if that is the case then I would expect you to see errors most of the time.
Comment 9 Alessandro Crismani 2012-09-13 14:06:05 UTC
You were right. I installed grub on my system, and those mce errors are thrown also when booting with it.

I am wondering why Ubuntu's grub seems to be more resilient. May it be that booting from the laptop's SSD is fast enough to create problems to the boot sequence, compared to booting from the live CD?
Comment 10 Matt Fleming 2012-09-17 09:30:10 UTC
It's difficult to say what is causing the MCEs, and so it's difficult to say why one boot scenario throws MCEs more reliably than a different one.

You could try swapping out the RAM to see if that makes a difference, you've possibly got a bad RAM module, or trying different components to see if the problem disappears. Other than that, I would suggest trying to get your laptop replaced/repaired if you don't have the option of running with a legacy BIOS, or turning on the legacy BIOS if you do.
Comment 11 Alessandro Crismani 2012-09-17 09:50:29 UTC
Fortunately, I can switch to a legacy BIOS based boot system.

The legacy BIOS boots without any problems using the Syslinux bootloader (and I believe any other solution works too). Hence, I do not think that I have faulty RAM modules, and I suspect that the UEFI code in the BIOS is the one to blame! Does it make sense?
Comment 12 Matt Fleming 2012-09-19 09:11:34 UTC
Yep, that is entirely possibly.
Comment 13 Mikhail 2012-10-08 15:27:01 UTC
I'll add some details here. MCE code is "MCE: 5 Bank 6: be2000000003110a". And here is a guy, who hit the same MCE code when playing with PCI memory regions caching:

http://stackoverflow.com/questions/11254023/how-to-do-mmap-for-cacheable-pcie-bar
Comment 14 Alan 2012-10-08 15:34:40 UTC
Another thing to check Matt would be whether flushing the TLB on all processors after every EFI call fixes the problem. There are strict rules on aliased mappings and if the firmware is adding its own but not flushing the tlb after they are gone then that would cause MCEs in some situations.

Alan
Comment 15 Matt Fleming 2012-10-08 18:58:56 UTC
Alan, that's interesting.

Mikhail, Alessandro, could you please build a kernel from the 'efi/tlbflush' branch at  git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/linux.git and see whether the MCE errors disappear?
Comment 16 Alessandro Crismani 2012-10-12 07:23:54 UTC
Unfortunately it doesnt't help.

Maybe my laptop really has a bad quality UEFI code, and it just fails in such an horrible way.
Comment 17 Alan 2012-10-12 15:27:29 UTC
Gak not much else I can think to try short of turning off all of the 2MB/4MB page usage
Comment 18 Mikhail 2012-10-13 04:38:01 UTC
Excuse me for delay. I've tested efi/tlbflush, and it does not help. As to my knowledge this is not the TLB problem. One Intel guy noted:

The SDM can help here. See volume 3A, section 15.9.2 "Compound Error Codes".

The low 16 bits of the status in this case are 0001 0001 0000 1010

This tells us that you have a cache error in L2 cache severe enough that the
processor has begun filtering further error reports.

And the stackoverflow question is about playing with caching policy. May be MTRR are configured incorrectly?
Comment 19 Mikhail 2012-10-19 14:11:15 UTC
Hi. May be this will help to clarify things. My distro now uses 3.6.2 kernel, and machine does not boot at all, even from usb drive. Earlier kernels sometimes (frequently enough) were able to boot from usb 2.0 flash drives.
Comment 20 Patrick Hemmer 2012-10-26 04:32:42 UTC
I have the same laptop and was experiencing the exact same issue. However I was able to resolve it by blacklisting the "samsung_laptop" module. Apparently it was getting loaded by udev (or something else) automatically.

With the module loading, I could boot maybe one out of every 15 times. With it blacklisted I've probably rebooted about 20 times now without any issues.
Comment 21 Matt Fleming 2012-10-26 10:54:44 UTC
Patrick, thank you for that bit of info. That's very helpful.

In light of this, the MCEs actually make a lot of sense because they can be caused under UEFI when the kernel pokes around in memory regions it's not supposed to. And reading the samsung-laptop code, that's exactly what it does.

Mikhail, Alessandro, can you please try disabling the samsung-laptop module and see if you have better results?
Comment 22 Mikhail 2012-10-28 09:18:16 UTC
Yes, Linux 3.6.3 boots with samsung-laptop blacklisted. Additional info: i tried to blacklist this module on Linux 3.4.x and there was no effect. On 3.5 i did not try this.

As far as i understand this module provides among other things (?) keyboard backlight support. And when the kernel was booted under UEFI the MCE could always be triggered with an attempt to change kbd backlight level with command:

# echo 0 > /sys/devices/platform/samsung/leds/samsung::kbd_backlight/brightness
Comment 23 Tu Hoang 2012-11-21 17:09:44 UTC
*** Bug 49161 has been marked as a duplicate of this bug. ***
Comment 24 Randolph Maaßen 2012-11-21 17:18:08 UTC
Maybe this can be helpfull, i booted form ubuntu 12.10 live usb-stick and teh error occured, but when i disabled the ssd boot entry the live usb-stick runs normal.
Comment 25 Patrick Hemmer 2012-11-21 17:19:15 UTC
Just updating as I'm not sure this should be marked as WILL_NOT_FIX any more.
There was an internal discussion a few weeks ago about having the samsung-laptop module do a run-time check to see if UEFI is enabled and to not load if so. But I'm not sure if there was any final verdict on that discussion, or whether I just got removed from the email CC list.

Could we get an update on whether the run-time check will be implemented, or if this will indeed remain as WILL_NOT_FIX?

Also it would be nice if the samsung-laptop code could be fixed so that it could be used without generating the MCE errors. Could this be proposed as the final solution to the issue, with the run-time check just an intermediary fix?
Comment 26 Matt Fleming 2012-11-21 17:49:08 UTC
Adding some more people to the Cc list.

A runtime check is indeed the best way to fix this issue. I'm not sure it's possible to fix the samsung-laptop module to run on UEFI machines.
Comment 27 Greg Kroah-Hartman 2012-11-21 17:58:32 UTC
I'm no longer the maintainer of this driver, but this was already discussed on
the linux-efi mailing list and it's a bug in the core kernel efi code, it should be failing the memory mapping of this area, it's not a driver bug in itself.

Of course the driver does expose the bug, which makes this a pain, but the core is where this should be fixed properly, not much we can do in the driver itself, unless someone wants to add a blacklist of machines this driver should not bind to.
Comment 28 Matthew Garrett 2012-11-21 18:24:48 UTC
The core can't simply refuse access to these address ranges, because some hardware requires it for VESA. Yes, this is ridiculous.
Comment 29 H. Peter Anvin 2012-11-21 18:26:48 UTC
Yes, the legacy area is, and probably will always be, special.
Comment 30 Alan 2012-11-21 21:00:33 UTC
The driver needs to handle it in this case. There will no doubt be lots of other similar things of this nature hence the need for an "on_drugs()" or similar test for running in EFI mode
Comment 31 Mikhail 2012-11-24 11:07:45 UTC
But samsung-laptop provides important functionality: rfkill, keyboard backlight, performance level management... If the module will not load this stuff will not work. Is it possible to solve this problem in another way?
Comment 32 mus.svz 2012-11-24 11:18:51 UTC
I had the same problem on my new Series 9 900x4d. Tried to install Arch Linux via UEFI boot and got the same kernel panic. Blacklisting samsung-laptop allowed me to boot and install the system.

The strange thing is that I now cannot reproduce the problem anymore, not even with the USB drive that I used for the installation (where the panic definitively happened). The panic doesn't happen anymore because the driver simply isn't loaded anymore.

"modprobe samsung-laptop" gives me:
modprobe: ERROR: could not insert 'samsung_laptop': No such device

I also noticed that the aesni_intel module isn't loaded and /proc/cpuinfo doesn't report the aes flag although the processor definitively supports AES-NI [1]. modprobing aesni_intel gives me the same "No such device" error message. Could this be related? 

I'm going to try to boot via non-UEFI and see if that changes anything.

[1] http://ark.intel.com/products/65707
Comment 33 mus.svz 2012-11-24 11:49:18 UTC
ah, I got it. The samsung-laptop module is only loaded if "OS Mode Selection" is set to "UEFI and CSM OS". It is not loaded when it is set to "UEFI OS". Took me a while to figure this out because changing the setting only seems to have effect when the machine is actually powered off and not just rebooted.

aesni_intel isn't loaded either way, not even with non-UEFI boot, so I guess this is another bug. Can anyone confirm this?

I upgraded to the latest BIOS yesterday btw, before even trying to boot the Arch Linux installation media.
Comment 34 Greg Kroah-Hartman 2012-11-24 17:51:38 UTC
On Sat, Nov 24, 2012 at 11:07:45AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> But samsung-laptop provides important functionality: rfkill, keyboard
> backlight, performance level management... If the module will not load this
> stuff will not work. Is it possible to solve this problem in another way?

None of that functionality will probably be supported by this driver
anyway on UEFI laptops, those options will come from ACPI, so you will
need those ACPI drivers loaded to control them.
Comment 35 Tu Hoang 2012-11-29 23:12:58 UTC
I was having the same problem on Samsung Ultrabook Series 9 as well. I employed the same workaround by blacklisting samsung-laptop kernel module. As a result of that, keyboard backlight under low light condition did not work.

However, with the recent kernel 3.6.8-1 and linux-firmware 20121118-1 on Archlinux, keyboard backlight is working even with samsung-laptop blacklisted. Maybe we don't need samsung-laptop kernel module at all?
Comment 36 Mikhail 2012-12-01 11:50:43 UTC
> However, with the recent kernel 3.6.8-1 and linux-firmware 20121118-1 on
> Archlinux, keyboard backlight is working even with samsung-laptop
> blacklisted.
> Maybe we don't need samsung-laptop kernel module at all?

It's not working on series 9 New with the same kernel/firmware.(In reply to comment #33)

> aesni_intel isn't loaded either way, not even with non-UEFI boot, so I guess
> this is another bug. Can anyone confirm this?
> 
> I upgraded to the latest BIOS yesterday btw, before even trying to boot the
> Arch Linux installation media.

I have P04AAC (something like that) BIOS. And samsung-laptop is loading (crashing the system when in UEFI mode), aesni_intel is loading too.
Comment 37 Tu Hoang 2012-12-14 18:42:48 UTC
No longer working on 3.6.9-1 and 3.6.10-1. The only kernel version in which I had ambient light sensor to work without samsung-laptop is 3.6.8-1.
Comment 38 Matt Fleming 2013-02-01 08:14:31 UTC
Folks,

A fix for the samsung-laptop module has been merged into Linus' tree and is available in v3.8-rc6. It has also been marked for the stable trees.

The fix is,
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=e0094244e41c4d0c7ad69920681972fc45d8ce34

and its prerequisite,

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=83e68189745ad931c2afd45d8ee3303929233e7f
Comment 39 Matt Fleming 2013-02-20 19:33:51 UTC
Fixed in v3.8.
Comment 40 Florian Mickler 2013-02-23 10:29:00 UTC
A patch referencing this bug report has been merged in Linux v3.8:

commit 1de63d60cd5b0d33a812efa455d5933bf1564a51
Author: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Date:   Thu Feb 14 09:12:52 2013 +0900

    efi: Clear EFI_RUNTIME_SERVICES rather than EFI_BOOT by "noefi" boot parameter
Comment 41 Florian Mickler 2013-03-04 21:24:17 UTC
A patch referencing a commit somehow associated to this bug report has been merged in Linux v3.9-rc1:

commit fb834c7acc5e140cf4f9e86da93a66de8c0514da
Author: Matt Fleming <matt.fleming@intel.com>
Date:   Wed Feb 20 20:36:12 2013 +0000

    x86, efi: Make "noefi" really disable EFI runtime serivces

Note You need to log in before you can comment on or make changes to this bug.