Bug 11259

Summary: oops when closing laptop lid unless CONFIG_ACPI_VIDEO=n or CONFIG_SMP=n -- HP Compaq 6720s
Product: ACPI Reporter: Dmitri Gribenko (gribozavr)
Component: Power-VideoAssignee: Matthew Garrett (mjg59-kernel)
Status: CLOSED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: acpi-bugzilla, adamw, bjorn.helgaas, gi1242, lenb, mjg59-kernel, patrick.kuijvenhoven, tbm, trenn
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.25, 2.6.26, 2.6.27-rc2 Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel messages
lspci -vxxx
acpidump
kernel config
oops with acpid debugging turned off
Modified DSDT for temporary overriding for testing
grep . /sys/firmware/acpi/interrupts/*
KMS-dependent patch for issue
Reserve low 64K on HPs

Description Dmitri Gribenko 2008-08-06 10:38:47 UTC
Latest working kernel version: unknown
Earliest failing kernel version: 2.6.26
Distribution: Debian lenny amd64
Hardware Environment: HP Compaq 6720s

==============================
Problem Description:
When I close laptop lid, I get a panic. First time I got it on Debian 2.6.25 stock kernel.  Then I compiled vanilla 2.6.26, 2.6.27-rc1 and 2.6.27-rc2 kernels and got the same panic.  With kernel config file for 2.6.27-rc2 (will post it below) the problem is 100% reproducible (and it is 100% reproducible on other kernels i mentioned, too).

I have discovered that if I turn off CONFIG_ACPI_VIDEO, the problem goes away and I can't trigger it anymore.  It also goes away if I disable SMP.

I get this panic when the system was booted normally (with gnome, hal, acpid etc running), as well as in single-user mode (only init + root shell started).  All kernel logs I will post below are captured with netconsole in single-user mode.  netconsole is the only option for me, because it always hangs irrecoverably after the oops and kernel log isn't saved on disk.

==============================
Steps to reproduce:
Enable CONFIG_ACPI_VIDEO and SMP, boot, close the lid.
Comment 1 Dmitri Gribenko 2008-08-06 11:00:15 UTC
Created attachment 17104 [details]
kernel messages

I did:
rmmod fan
rmmod battery
rmmod thermal
rmmod ac
echo 0x1f >/sys/module/acpi/parameters/debug_level
(I should note that when I turn ACPI debugging this way, panic does not always happen on the first lid close, but on the second or third)

Then closed the lid @81 sec.  No panic, I opened lid @91 sec.  Closed it again @96 sec.  I never managed to get a backtrace before.
Comment 2 Dmitri Gribenko 2008-08-06 11:01:26 UTC
Created attachment 17105 [details]
lspci -vxxx
Comment 3 Dmitri Gribenko 2008-08-06 11:02:33 UTC
Created attachment 17106 [details]
acpidump
Comment 4 Dmitri Gribenko 2008-08-06 11:04:58 UTC
Created attachment 17107 [details]
kernel config
Comment 5 Dmitri Gribenko 2008-08-06 11:19:50 UTC
Created attachment 17108 [details]
oops with acpid debugging turned off

When acpi debugging is off, I always get the following message:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000046

And it is different from what I get when acpi debugging is on.  So, just for the record, I post this one, too.
Comment 6 Thomas Renninger 2008-08-06 15:57:37 UTC
Created attachment 17112 [details]
Modified DSDT for temporary overriding for testing

Can you compile this modified DSDT into your kernel.
You need:
CONFIG_ACPI_CUSTOM_DSDT_FILE
pointing to the downloaded attachment
CONFIG_ACPI_CUSTOM_DSDT
set to yes.
Double check in dmesg whether overriding the DSDT worked.

Does everything work fine now?
If not best you try again and pass again log_buf_len=... and acpi.debug_level=0x21F this time
To get even more output. There are still a lot log gaps.
Comment 7 Dmitri Gribenko 2008-08-06 18:03:03 UTC
2.6.27-rc2 kernel messages (w/ modified DSDT), compressed with bzip2
http://rapidshare.com/files/135420903/log39.bz2.html

Unfortunately with modified DSDT I get the same oops (BUG: unable to handle kernel NULL pointer dereference at 0000000000000046) if ACPI debugging is off.

Every time I triggered it with ACPI debugging on, it didn't print any oops message (I tried X times), the laptop just hung and didn't respond to anything.

Some messages are out of order (approximately @40-60 sec).  I didn't sort the file, I post it as it was captured with netconsole.

Kernel config is all the same, only custom DSDT was turned on.

In this run:
@0.013878: ACPI: Table DSDT replaced by host OS
@120.713494: rmmod fan battery thermal ac
@127.480872: lid closed
@136.779098: lid opened
@143.424001: lid closed
@152.671701: lid opened
@161.638761: lid closed ==> it hang
Comment 8 Dmitri Gribenko 2008-08-06 18:10:29 UTC
2.6.27-rc2 kernel messages (w/ modified DSDT), compressed with bzip2
http://rapidshare.com/files/135421985/log38.bz2.html

I decided to post one more log, where the problem was triggered when the lid was closed first time (just in case).

In this run:
@0.013790: ACPI: Table DSDT replaced by host OS
@123.865109: rmmod fan battery thermal ac
@132.556135: lid closed ==> it hang
Comment 9 Zhang Rui 2008-08-06 18:16:59 UTC
would you please run "echo 1 > /proc/acpi/video/XXX/DOS" before closing the lid and see if it helps?
Comment 10 Dmitri Gribenko 2008-08-06 18:57:37 UTC
> I tried X times
Excuse me, I composed the message before I finished testing.  I tried 5 times then.

> would you please run "echo 1 > /proc/acpi/video/XXX/DOS" before closing the
> lid
> and see if it helps?
Yes, it seems to help (with modified DSDT, didn't try with my original, but can do that if needed).  But when I run that with ACPI debugging on, no kernel messages are printed when lid is closed/opened.  (Does it ignore lid events after that command?)  If I start acpid and then acpi_listen, I don't see any messages coming about lid button.
Comment 11 Zhang Rui 2008-08-11 20:04:34 UTC
(In reply to comment #10)
> > would you please run "echo 1 > /proc/acpi/video/XXX/DOS" before closing the
> lid
> > and see if it helps?
> Yes, it seems to help (with modified DSDT, didn't try with my original, but
> can
> do that if needed).  But when I run that with ACPI debugging on, no kernel
> messages are printed when lid is closed/opened.  (Does it ignore lid events
> after that command?)
I don't think so.
Please run "grep . /sys/firmware/acpi/interrupts/*" both before and after closing the lid and attach the output.
Comment 12 Zhang Rui 2008-08-11 20:51:53 UTC
as it works well in 2.6.26, could you please use git-bisect to figure out which commit introduces this regression?
Comment 13 Dmitri Gribenko 2008-08-12 10:48:03 UTC
Created attachment 17189 [details]
grep . /sys/firmware/acpi/interrupts/*

If I run "echo 1 > /proc/acpi/video/XXX/DOS", then "grep . /sys/firmware/acpi/interrupts/*" produces the same output before closing the lid, while the it is closed, and after opening it (output attached).

I'd like to use git-bisect, but I can't find a version that works.

2.6.8, 2.6.18, 2.6.22 panic'ed at early boot initialization. Though sometimes, but not always, I see 'acpi' in function names in backtraces. Module 'video' which is built by CONFIG_ACPI_VIDEO isn't loaded at that point yet, so I think that that panics are caused by some different bug.

2.6.23, 2.6.23.17 fail to build.

2.6.24, 2.6.26, 2.6.27-{rc1,rc2} all panic when lid is closed.

I understand that this probably doesn't help, but that is everything that I could find out.

Thank you for your time and efforts.
Comment 14 Zhang Rui 2008-08-12 20:16:38 UTC
so this problem exists on all kernel versions, clear the regression flag.
Comment 15 Len Brown 2008-08-22 11:01:10 UTC
Rui,
CONFIG_SMP=n sounds like a clue.

Perhaps Linux sometimes handles an event on CPU1 but
some AML (and/or SMM) on this box assumes that it must
be running on CPU0?
Comment 16 Zhang Rui 2008-08-26 02:31:34 UTC
okay.

dmitri, will you please boot with "noirqbalance" and check if it helps?
Comment 17 Dmitri Gribenko 2008-08-26 05:32:04 UTC
No, noirqbalance doesn't help (I can post kernel log if you need it).

But booting with maxcpus=1 on a SMP kernel helps.
Comment 18 Zhang Rui 2008-10-31 00:06:30 UTC
please
1. boot with "noirqbalance"
2. and disable irqbanlance service after boot
and try again.
Comment 19 Zhang Rui 2008-11-24 22:44:59 UTC
ping Dmitri, any update?
Comment 20 Zhang Rui 2008-12-03 18:44:41 UTC
I can reproduce this problem.
and the system hangs on a SMI call in _L18 method.
several workarounds available.
1. _DOS=1, in this case, GPIO 0x18 is routed to SMI mode, thus no GPE0x18 when closing the lid.
2. disable GPE0x18 manually when _DOS=0, this works the same way as workaround 1.
3. CONFIG_SMP=n, maxcpus=1. This is weird, I suspect that the oops has something to do with the cpu No. that triggers the SMI.

Dmitri,
please do the test in comment #18.
Comment 21 Patrick Kuijvenhoven 2008-12-29 06:15:53 UTC
My laptop (a HP 6710b) is quite simular as Dimitri's. I also have the same symptoms. Hope I can help.

I'm running the proposed kernel for Ubuntu 8.10 (a.k.a. Intrepid Ibex), which currently is based on version "2.6.27-11-generic".

I've booted with the "noirqbalance" boot option twice. After both boots, I looked at /proc/interrupts and noticed the interrupts were still being "balanced". I do not have a "irqbalance" service running. In the kernelconfig, the CONFIG_IRQBALANCE line is commented.

First boot: closing the lid resulted in a "freeze". Screen was there, but I was unable to use the keyboard.
Second boot: closing the lid resulted in an "oops". Dump was quite like http://launchpadlibrarian.net/18617502/16102008130.jpg

These bugreports might interest you as well:
https://bugs.launchpad.net/ubuntu/+source/acpid/+bug/157691
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/228399

Two workarounds I use (only one of these is enough to prevent the "oops"):
# echo 0 > /sys/devices/system/cpu/cpu1/online
# echo 7 > /proc/acpi/video/C098/DOS

Please let me know how I can help.
Comment 22 Gautam Iyer 2008-12-29 10:39:30 UTC
I've had the same problem on an HP 2710p ever since 2.6.24/25. Just today I ran it on a 2.6.28 kernel, with X86_CHECK_BIOS_CORRUPTION=y and I found the following in /var/log/messages:

	Corrupted low memory at c000fe2c (fe2c phys) = 00000010
	Corrupted low memory at c000fe38 (fe38 phys) = 00000ff0
	Corrupted low memory at c000fe40 (fe40 phys) = 000006d0
	Corrupted low memory at c000fe4c (fe4c phys) = 00000800
	Corrupted low memory at c000fe50 (fe50 phys) = cff3007b
	Corrupted low memory at c000fe54 (fe54 phys) = 000fffff
	Corrupted low memory at c000fe5c (fe5c phys) = 8f9300d8
	Corrupted low memory at c000fe60 (fe60 phys) = 000fffff
	Corrupted low memory at c000fe64 (fe64 phys) = 01366000
	Corrupted low memory at c000fe68 (fe68 phys) = cf000000
	Corrupted low memory at c000fe6c (fe6c phys) = 000fffff
	Corrupted low memory at c000fe74 (fe74 phys) = 00820000
	Corrupted low memory at c000fe78 (fe78 phys) = 000007ff
	Corrupted low memory at c000fe7c (fe7c phys) = c0424000
	Corrupted low memory at c000fe80 (fe80 phys) = 008b0080
	Corrupted low memory at c000fe84 (fe84 phys) = 0000206b
	Corrupted low memory at c000fe88 (fe88 phys) = c17f7800
	Corrupted low memory at c000fe8c (fe8c phys) = c17f4000
	Corrupted low memory at c000fe94 (fe94 phys) = c0424000
	Corrupted low memory at c000fea4 (fea4 phys) = 00008000
	Corrupted low memory at c000fea8 (fea8 phys) = 00820000
	Corrupted low memory at c000feac (feac phys) = 000000ff
	Corrupted low memory at c000feb0 (feb0 phys) = c17f4000
	Corrupted low memory at c000feb4 (feb4 phys) = cf000000
	Corrupted low memory at c000feb8 (feb8 phys) = 000fffff
	Corrupted low memory at c000fec0 (fec0 phys) = cff3007b
	Corrupted low memory at c000fec4 (fec4 phys) = 000fffff
	Corrupted low memory at c000fecc (fecc phys) = cf9b0060
	Corrupted low memory at c000fed0 (fed0 phys) = 000fffff
	Corrupted low memory at c000fed8 (fed8 phys) = cf930068
	Corrupted low memory at c000fedc (fedc phys) = 000fffff
	Corrupted low memory at c000fee4 (fee4 phys) = fedb0000
	Corrupted low memory at c000fee8 (fee8 phys) = 00000001
	Corrupted low memory at c000feec (feec phys) = 00050000
	Corrupted low memory at c000fef8 (fef8 phys) = fedb0000
	Corrupted low memory at c000fefc (fefc phys) = 00030100
	Corrupted low memory at c000ff14 (ff14 phys) = 000000ff
	Corrupted low memory at c000ff18 (ff18 phys) = 0000778e
	Corrupted low memory at c000ff5c (ff5c phys) = 000000f3
	Corrupted low memory at c000ff64 (ff64 phys) = 00000008
	Corrupted low memory at c000ff6c (ff6c phys) = 000000b2
	Corrupted low memory at c000ff74 (ff74 phys) = 000000b2
	Corrupted low memory at c000ff7c (ff7c phys) = f70edd2c
	Corrupted low memory at c000ff84 (ff84 phys) = f700f960
	Corrupted low memory at c000ff8c (ff8c phys) = 000000b2
	Corrupted low memory at c000ff94 (ff94 phys) = f70edddc
	Corrupted low memory at c000ffa4 (ffa4 phys) = 00b20003
	Corrupted low memory at c000ffa8 (ffa8 phys) = 0000007b
	Corrupted low memory at c000ffac (ffac phys) = 00000060
	Corrupted low memory at c000ffb0 (ffb0 phys) = 00000068
	Corrupted low memory at c000ffb4 (ffb4 phys) = 0000007b
	Corrupted low memory at c000ffb8 (ffb8 phys) = 000000d8
	Corrupted low memory at c000ffc4 (ffc4 phys) = 00000080
	Corrupted low memory at c000ffc8 (ffc8 phys) = 00000400
	Corrupted low memory at c000ffd0 (ffd0 phys) = ffff0ff0
	Corrupted low memory at c000ffd8 (ffd8 phys) = c022afc0
	Corrupted low memory at c000ffe8 (ffe8 phys) = 00000246
	Corrupted low memory at c000fff0 (fff0 phys) = 00494000
	Corrupted low memory at c000fff8 (fff8 phys) = 8005003b
	------------[ cut here ]------------
	WARNING: at arch/x86/kernel/setup.c:718 check_for_bios_corruption+0xb6/0xc0()
	Memory corruption detected in low memory
	Modules linked in: b43 ssb rfkill mac80211 crc32 input_polldev i915 drm libafs(P) fuse ipv6 autofs4 xt_recent ipt_addrtype xt_multiport xt_mac ipt_MASQUERADE xt_state xt_tcpudp ipt_REJECT ipt_LOG xt_limit iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables xt_iprange x_tables nf_conntrack_ftp nf_conntrack snd_pcm_oss snd_mixer_oss snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device ext2 mbcache arc4 ecb rng_core bitrev loop acpi_cpufreq ehci_hcd uhci_hcd sdhci_pci usbcore sdhci mmc_core intelfb fb i2c_algo_bit cfbcopyarea snd_hda_intel snd_pcm video backlight output snd_timer snd_page_alloc wmi battery 8250_pnp 8250 serial_core button e1000e leds_hp_disk led_class i2c_core cfbimgblt cfbfillrect snd_hwdep intel_agp ac serio_raw snd soundcore agpgart evdev [last unloaded: input_polldev]
	Pid: 0, comm: swapper Tainted: P       A   2.6.28 #4
	Call Trace:
	[<c012a656>] warn_slowpath+0x76/0x90
	[<c011eb85>] place_entity+0xb5/0xd0
	[<c0141611>] up+0x11/0x40
	[<c012ae1d>] release_console_sem+0x19d/0x1d0
	[<c011dcf6>] activate_task+0x16/0x20
	[<c0123700>] try_to_wake_up+0x20/0x180
	[<c013d2ab>] autoremove_wake_function+0x1b/0x50
	[<c011e11b>] __wake_up_common+0x4b/0x80
	[<c012b3ab>] printk+0x1b/0x20
	[<c0107986>] check_for_bios_corruption+0xb6/0xc0
	[<c0107995>] periodic_check_for_corruption+0x5/0x30
	[<c0133274>] run_timer_softirq+0x144/0x1b0
	[<c0107990>] periodic_check_for_corruption+0x0/0x30
	[<c0107990>] periodic_check_for_corruption+0x0/0x30
	[<c012efe4>] __do_softirq+0x94/0x160
	[<c012f0e7>] do_softirq+0x37/0x40
	[<c012f479>] irq_exit+0x59/0x80
	[<c0106462>] do_IRQ+0x52/0x90
	[<c0118aba>] read_hpet+0xa/0x10
	[<c014332d>] getnstimeofday+0x3d/0xe0
	[<c01047a3>] common_interrupt+0x23/0x28
	[<c014007b>] __run_hrtimer+0x9b/0xa0
	[<c024920c>] acpi_idle_enter_bm+0x28d/0x2f8
	[<c02b69a9>] cpuidle_idle_call+0x69/0xc0
	[<c010273b>] cpu_idle+0x4b/0xa0
	---[ end trace 2ff63db346bda773 ]---

When the system doesn't crash, the above is repeatable, and produces a corruption in the same regions of the memory. When the system crashes, I get a complete freeze (though once or twice I've got an Oops).

The documentation says I can use memmap to stop the kernel from using this memory. Will try that next...

GI
Comment 23 Gautam Iyer 2008-12-29 11:34:32 UTC
(In reply to comment #22)

> The documentation says I can use memmap to stop the kernel from using this
> memory. Will try that next...

OK. From the logs above, I gathered that the memory between 0xc000fe2c--0xc000ffff is corrupt. Accordingly, rebooted with memmap=0x01d4$0xc000fe2c. I get the exact same (as above) message in my syslog though, so perhaps I did something wrong.

Should I file a new bug report with the above trace info?

GI
Comment 24 Zhang Rui 2009-02-23 01:13:43 UTC
*** Bug 12328 has been marked as a duplicate of this bug. ***
Comment 25 Matthew Garrett 2009-02-23 19:43:48 UTC
Created attachment 20343 [details]
KMS-dependent patch for issue

This appears to fix the problem on my 2510p, but obviously requires KMS.
Comment 26 Matthew Garrett 2009-02-23 20:01:03 UTC
Created attachment 20344 [details]
Reserve low 64K on HPs

And this is the less-invasive patch that ought to paper over the bug, with luck. Could someone test this?
Comment 27 Matthew Garrett 2009-03-06 17:25:52 UTC
Hm. No, simply reserving the low 64K doesn't work.
Comment 28 Zhang Rui 2009-03-23 00:17:18 UTC
patch already available.
http://patchwork.kernel.org/patch/13147/
Comment 29 Len Brown 2009-04-02 04:29:43 UTC
patch mentioned in comment #28 is in the acpi tree
"ACPI: Populate DIDL before registering ACPI video device on Intel"
Comment 30 Dmitri Gribenko 2009-04-02 09:40:31 UTC
I confirm that this patch fixes the bug. (Tested on Linux kernel 2.6.29)
Comment 31 Len Brown 2009-04-07 03:07:24 UTC
shipped in 2.6.30 merge window (2.6.29-git14)

closed

commit 74a365b3f354fafc537efa5867deb7a9fadbfe27
Author: Matthew Garrett <mjg59@srcf.ucam.org>
Date:   Thu Mar 19 21:35:39 2009 +0000

    ACPI: Populate DIDL before registering ACPI video device on Intel
Comment 32 Zhang Rui 2009-07-20 01:50:54 UTC
*** Bug 13751 has been marked as a duplicate of this bug. ***
Comment 33 Bjorn Helgaas 2009-07-29 22:41:40 UTC
This is apparently slightly different from bug 13751, because the comment #28 patch fixed the problem Dmitri was seeing, but others still saw crashes even with that patch.

I just added another patch to bug 13751 to address those crashes, and it might the problem Patrick and Gautam are seeing.
Comment 34 Adam Williamson 2009-09-24 17:48:40 UTC
The fix which was applied for this issue has its own problem. It just assumes the workaround should be applied to any Intel video adapter:

	for_each_pci_dev(dev) {
		if ((dev->class >> 8) != PCI_CLASS_DISPLAY_VGA)
			continue;
		if (dev->vendor != PCI_VENDOR_ID_INTEL)
			continue;
		pci_read_config_dword(dev, 0xfc, &address);
		if (!address)
			continue;
		return 1;
	}

However, this also catches GMA 500 adapters, which don't use the Intel driver. (If you're lucky), they use the psb driver, which doesn't behave like intel and shouldn't have this 'workaround' applied. When it is applied, it causes psb not to work right unless the workaround 'IgnoreACPI' X config option is set:

http://swiss.ubuntuforums.org/showpost.php?p=7781689&postcount=17

so this check should be refined not to catch GMA 500 (and, possibly, Moorestown in future, depending on how Moorestown is handled) adapters, I think. I'll see if I can provide a refinement to do this.
Comment 35 Adam Williamson 2009-09-24 18:06:33 UTC
update from mjg:

<mjg59> adamw: It looks like it implements opregion, to be honest
<mjg59> adamw: In which case the workaround is correct, but the poulsbo DRM driver is deficient

(...confirmatory diagnosis)

<mjg59> adamw: Would actually be pretty easy - you just need to pull in the intel opregion code from the i915 drm, and change the backlight register writes
<mjg59> Oh, and add support to the interrupt handler
<mjg59> I've no idea how poulsbo would doorbell
<mjg59> adamw: 0x7f6a3010 (the contents of PCI conf register 0xfc on your device) is ACPI non-volatile storage
<mjg59> adamw: So it certainly looks like opregion. You get to keep both pieces.
<mjg59> The kernel behaviour is correct, the poulsbo DRM is just shit. Sorry.

So, bottom line, opregion support should be patched into the poulsbo DRM module. Which is theoretically possible, as that bit is actually open source. I don't have the skills to do it, though. Just updating this bug for anyone who stumbles across this so the correct info is here.