Latest working kernel version: unknown Earliest failing kernel version: 2.6.26 Distribution: Debian lenny amd64 Hardware Environment: HP Compaq 6720s ============================== Problem Description: When I close laptop lid, I get a panic. First time I got it on Debian 2.6.25 stock kernel. Then I compiled vanilla 2.6.26, 2.6.27-rc1 and 2.6.27-rc2 kernels and got the same panic. With kernel config file for 2.6.27-rc2 (will post it below) the problem is 100% reproducible (and it is 100% reproducible on other kernels i mentioned, too). I have discovered that if I turn off CONFIG_ACPI_VIDEO, the problem goes away and I can't trigger it anymore. It also goes away if I disable SMP. I get this panic when the system was booted normally (with gnome, hal, acpid etc running), as well as in single-user mode (only init + root shell started). All kernel logs I will post below are captured with netconsole in single-user mode. netconsole is the only option for me, because it always hangs irrecoverably after the oops and kernel log isn't saved on disk. ============================== Steps to reproduce: Enable CONFIG_ACPI_VIDEO and SMP, boot, close the lid.
Created attachment 17104 [details] kernel messages I did: rmmod fan rmmod battery rmmod thermal rmmod ac echo 0x1f >/sys/module/acpi/parameters/debug_level (I should note that when I turn ACPI debugging this way, panic does not always happen on the first lid close, but on the second or third) Then closed the lid @81 sec. No panic, I opened lid @91 sec. Closed it again @96 sec. I never managed to get a backtrace before.
Created attachment 17105 [details] lspci -vxxx
Created attachment 17106 [details] acpidump
Created attachment 17107 [details] kernel config
Created attachment 17108 [details] oops with acpid debugging turned off When acpi debugging is off, I always get the following message: BUG: unable to handle kernel NULL pointer dereference at 0000000000000046 And it is different from what I get when acpi debugging is on. So, just for the record, I post this one, too.
Created attachment 17112 [details] Modified DSDT for temporary overriding for testing Can you compile this modified DSDT into your kernel. You need: CONFIG_ACPI_CUSTOM_DSDT_FILE pointing to the downloaded attachment CONFIG_ACPI_CUSTOM_DSDT set to yes. Double check in dmesg whether overriding the DSDT worked. Does everything work fine now? If not best you try again and pass again log_buf_len=... and acpi.debug_level=0x21F this time To get even more output. There are still a lot log gaps.
2.6.27-rc2 kernel messages (w/ modified DSDT), compressed with bzip2 http://rapidshare.com/files/135420903/log39.bz2.html Unfortunately with modified DSDT I get the same oops (BUG: unable to handle kernel NULL pointer dereference at 0000000000000046) if ACPI debugging is off. Every time I triggered it with ACPI debugging on, it didn't print any oops message (I tried X times), the laptop just hung and didn't respond to anything. Some messages are out of order (approximately @40-60 sec). I didn't sort the file, I post it as it was captured with netconsole. Kernel config is all the same, only custom DSDT was turned on. In this run: @0.013878: ACPI: Table DSDT replaced by host OS @120.713494: rmmod fan battery thermal ac @127.480872: lid closed @136.779098: lid opened @143.424001: lid closed @152.671701: lid opened @161.638761: lid closed ==> it hang
2.6.27-rc2 kernel messages (w/ modified DSDT), compressed with bzip2 http://rapidshare.com/files/135421985/log38.bz2.html I decided to post one more log, where the problem was triggered when the lid was closed first time (just in case). In this run: @0.013790: ACPI: Table DSDT replaced by host OS @123.865109: rmmod fan battery thermal ac @132.556135: lid closed ==> it hang
would you please run "echo 1 > /proc/acpi/video/XXX/DOS" before closing the lid and see if it helps?
> I tried X times Excuse me, I composed the message before I finished testing. I tried 5 times then. > would you please run "echo 1 > /proc/acpi/video/XXX/DOS" before closing the > lid > and see if it helps? Yes, it seems to help (with modified DSDT, didn't try with my original, but can do that if needed). But when I run that with ACPI debugging on, no kernel messages are printed when lid is closed/opened. (Does it ignore lid events after that command?) If I start acpid and then acpi_listen, I don't see any messages coming about lid button.
(In reply to comment #10) > > would you please run "echo 1 > /proc/acpi/video/XXX/DOS" before closing the > lid > > and see if it helps? > Yes, it seems to help (with modified DSDT, didn't try with my original, but > can > do that if needed). But when I run that with ACPI debugging on, no kernel > messages are printed when lid is closed/opened. (Does it ignore lid events > after that command?) I don't think so. Please run "grep . /sys/firmware/acpi/interrupts/*" both before and after closing the lid and attach the output.
as it works well in 2.6.26, could you please use git-bisect to figure out which commit introduces this regression?
Created attachment 17189 [details] grep . /sys/firmware/acpi/interrupts/* If I run "echo 1 > /proc/acpi/video/XXX/DOS", then "grep . /sys/firmware/acpi/interrupts/*" produces the same output before closing the lid, while the it is closed, and after opening it (output attached). I'd like to use git-bisect, but I can't find a version that works. 2.6.8, 2.6.18, 2.6.22 panic'ed at early boot initialization. Though sometimes, but not always, I see 'acpi' in function names in backtraces. Module 'video' which is built by CONFIG_ACPI_VIDEO isn't loaded at that point yet, so I think that that panics are caused by some different bug. 2.6.23, 2.6.23.17 fail to build. 2.6.24, 2.6.26, 2.6.27-{rc1,rc2} all panic when lid is closed. I understand that this probably doesn't help, but that is everything that I could find out. Thank you for your time and efforts.
so this problem exists on all kernel versions, clear the regression flag.
Rui, CONFIG_SMP=n sounds like a clue. Perhaps Linux sometimes handles an event on CPU1 but some AML (and/or SMM) on this box assumes that it must be running on CPU0?
okay. dmitri, will you please boot with "noirqbalance" and check if it helps?
No, noirqbalance doesn't help (I can post kernel log if you need it). But booting with maxcpus=1 on a SMP kernel helps.
please 1. boot with "noirqbalance" 2. and disable irqbanlance service after boot and try again.
ping Dmitri, any update?
I can reproduce this problem. and the system hangs on a SMI call in _L18 method. several workarounds available. 1. _DOS=1, in this case, GPIO 0x18 is routed to SMI mode, thus no GPE0x18 when closing the lid. 2. disable GPE0x18 manually when _DOS=0, this works the same way as workaround 1. 3. CONFIG_SMP=n, maxcpus=1. This is weird, I suspect that the oops has something to do with the cpu No. that triggers the SMI. Dmitri, please do the test in comment #18.
My laptop (a HP 6710b) is quite simular as Dimitri's. I also have the same symptoms. Hope I can help. I'm running the proposed kernel for Ubuntu 8.10 (a.k.a. Intrepid Ibex), which currently is based on version "2.6.27-11-generic". I've booted with the "noirqbalance" boot option twice. After both boots, I looked at /proc/interrupts and noticed the interrupts were still being "balanced". I do not have a "irqbalance" service running. In the kernelconfig, the CONFIG_IRQBALANCE line is commented. First boot: closing the lid resulted in a "freeze". Screen was there, but I was unable to use the keyboard. Second boot: closing the lid resulted in an "oops". Dump was quite like http://launchpadlibrarian.net/18617502/16102008130.jpg These bugreports might interest you as well: https://bugs.launchpad.net/ubuntu/+source/acpid/+bug/157691 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/228399 Two workarounds I use (only one of these is enough to prevent the "oops"): # echo 0 > /sys/devices/system/cpu/cpu1/online # echo 7 > /proc/acpi/video/C098/DOS Please let me know how I can help.
I've had the same problem on an HP 2710p ever since 2.6.24/25. Just today I ran it on a 2.6.28 kernel, with X86_CHECK_BIOS_CORRUPTION=y and I found the following in /var/log/messages: Corrupted low memory at c000fe2c (fe2c phys) = 00000010 Corrupted low memory at c000fe38 (fe38 phys) = 00000ff0 Corrupted low memory at c000fe40 (fe40 phys) = 000006d0 Corrupted low memory at c000fe4c (fe4c phys) = 00000800 Corrupted low memory at c000fe50 (fe50 phys) = cff3007b Corrupted low memory at c000fe54 (fe54 phys) = 000fffff Corrupted low memory at c000fe5c (fe5c phys) = 8f9300d8 Corrupted low memory at c000fe60 (fe60 phys) = 000fffff Corrupted low memory at c000fe64 (fe64 phys) = 01366000 Corrupted low memory at c000fe68 (fe68 phys) = cf000000 Corrupted low memory at c000fe6c (fe6c phys) = 000fffff Corrupted low memory at c000fe74 (fe74 phys) = 00820000 Corrupted low memory at c000fe78 (fe78 phys) = 000007ff Corrupted low memory at c000fe7c (fe7c phys) = c0424000 Corrupted low memory at c000fe80 (fe80 phys) = 008b0080 Corrupted low memory at c000fe84 (fe84 phys) = 0000206b Corrupted low memory at c000fe88 (fe88 phys) = c17f7800 Corrupted low memory at c000fe8c (fe8c phys) = c17f4000 Corrupted low memory at c000fe94 (fe94 phys) = c0424000 Corrupted low memory at c000fea4 (fea4 phys) = 00008000 Corrupted low memory at c000fea8 (fea8 phys) = 00820000 Corrupted low memory at c000feac (feac phys) = 000000ff Corrupted low memory at c000feb0 (feb0 phys) = c17f4000 Corrupted low memory at c000feb4 (feb4 phys) = cf000000 Corrupted low memory at c000feb8 (feb8 phys) = 000fffff Corrupted low memory at c000fec0 (fec0 phys) = cff3007b Corrupted low memory at c000fec4 (fec4 phys) = 000fffff Corrupted low memory at c000fecc (fecc phys) = cf9b0060 Corrupted low memory at c000fed0 (fed0 phys) = 000fffff Corrupted low memory at c000fed8 (fed8 phys) = cf930068 Corrupted low memory at c000fedc (fedc phys) = 000fffff Corrupted low memory at c000fee4 (fee4 phys) = fedb0000 Corrupted low memory at c000fee8 (fee8 phys) = 00000001 Corrupted low memory at c000feec (feec phys) = 00050000 Corrupted low memory at c000fef8 (fef8 phys) = fedb0000 Corrupted low memory at c000fefc (fefc phys) = 00030100 Corrupted low memory at c000ff14 (ff14 phys) = 000000ff Corrupted low memory at c000ff18 (ff18 phys) = 0000778e Corrupted low memory at c000ff5c (ff5c phys) = 000000f3 Corrupted low memory at c000ff64 (ff64 phys) = 00000008 Corrupted low memory at c000ff6c (ff6c phys) = 000000b2 Corrupted low memory at c000ff74 (ff74 phys) = 000000b2 Corrupted low memory at c000ff7c (ff7c phys) = f70edd2c Corrupted low memory at c000ff84 (ff84 phys) = f700f960 Corrupted low memory at c000ff8c (ff8c phys) = 000000b2 Corrupted low memory at c000ff94 (ff94 phys) = f70edddc Corrupted low memory at c000ffa4 (ffa4 phys) = 00b20003 Corrupted low memory at c000ffa8 (ffa8 phys) = 0000007b Corrupted low memory at c000ffac (ffac phys) = 00000060 Corrupted low memory at c000ffb0 (ffb0 phys) = 00000068 Corrupted low memory at c000ffb4 (ffb4 phys) = 0000007b Corrupted low memory at c000ffb8 (ffb8 phys) = 000000d8 Corrupted low memory at c000ffc4 (ffc4 phys) = 00000080 Corrupted low memory at c000ffc8 (ffc8 phys) = 00000400 Corrupted low memory at c000ffd0 (ffd0 phys) = ffff0ff0 Corrupted low memory at c000ffd8 (ffd8 phys) = c022afc0 Corrupted low memory at c000ffe8 (ffe8 phys) = 00000246 Corrupted low memory at c000fff0 (fff0 phys) = 00494000 Corrupted low memory at c000fff8 (fff8 phys) = 8005003b ------------[ cut here ]------------ WARNING: at arch/x86/kernel/setup.c:718 check_for_bios_corruption+0xb6/0xc0() Memory corruption detected in low memory Modules linked in: b43 ssb rfkill mac80211 crc32 input_polldev i915 drm libafs(P) fuse ipv6 autofs4 xt_recent ipt_addrtype xt_multiport xt_mac ipt_MASQUERADE xt_state xt_tcpudp ipt_REJECT ipt_LOG xt_limit iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables xt_iprange x_tables nf_conntrack_ftp nf_conntrack snd_pcm_oss snd_mixer_oss snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device ext2 mbcache arc4 ecb rng_core bitrev loop acpi_cpufreq ehci_hcd uhci_hcd sdhci_pci usbcore sdhci mmc_core intelfb fb i2c_algo_bit cfbcopyarea snd_hda_intel snd_pcm video backlight output snd_timer snd_page_alloc wmi battery 8250_pnp 8250 serial_core button e1000e leds_hp_disk led_class i2c_core cfbimgblt cfbfillrect snd_hwdep intel_agp ac serio_raw snd soundcore agpgart evdev [last unloaded: input_polldev] Pid: 0, comm: swapper Tainted: P A 2.6.28 #4 Call Trace: [<c012a656>] warn_slowpath+0x76/0x90 [<c011eb85>] place_entity+0xb5/0xd0 [<c0141611>] up+0x11/0x40 [<c012ae1d>] release_console_sem+0x19d/0x1d0 [<c011dcf6>] activate_task+0x16/0x20 [<c0123700>] try_to_wake_up+0x20/0x180 [<c013d2ab>] autoremove_wake_function+0x1b/0x50 [<c011e11b>] __wake_up_common+0x4b/0x80 [<c012b3ab>] printk+0x1b/0x20 [<c0107986>] check_for_bios_corruption+0xb6/0xc0 [<c0107995>] periodic_check_for_corruption+0x5/0x30 [<c0133274>] run_timer_softirq+0x144/0x1b0 [<c0107990>] periodic_check_for_corruption+0x0/0x30 [<c0107990>] periodic_check_for_corruption+0x0/0x30 [<c012efe4>] __do_softirq+0x94/0x160 [<c012f0e7>] do_softirq+0x37/0x40 [<c012f479>] irq_exit+0x59/0x80 [<c0106462>] do_IRQ+0x52/0x90 [<c0118aba>] read_hpet+0xa/0x10 [<c014332d>] getnstimeofday+0x3d/0xe0 [<c01047a3>] common_interrupt+0x23/0x28 [<c014007b>] __run_hrtimer+0x9b/0xa0 [<c024920c>] acpi_idle_enter_bm+0x28d/0x2f8 [<c02b69a9>] cpuidle_idle_call+0x69/0xc0 [<c010273b>] cpu_idle+0x4b/0xa0 ---[ end trace 2ff63db346bda773 ]--- When the system doesn't crash, the above is repeatable, and produces a corruption in the same regions of the memory. When the system crashes, I get a complete freeze (though once or twice I've got an Oops). The documentation says I can use memmap to stop the kernel from using this memory. Will try that next... GI
(In reply to comment #22) > The documentation says I can use memmap to stop the kernel from using this > memory. Will try that next... OK. From the logs above, I gathered that the memory between 0xc000fe2c--0xc000ffff is corrupt. Accordingly, rebooted with memmap=0x01d4$0xc000fe2c. I get the exact same (as above) message in my syslog though, so perhaps I did something wrong. Should I file a new bug report with the above trace info? GI
*** Bug 12328 has been marked as a duplicate of this bug. ***
Created attachment 20343 [details] KMS-dependent patch for issue This appears to fix the problem on my 2510p, but obviously requires KMS.
Created attachment 20344 [details] Reserve low 64K on HPs And this is the less-invasive patch that ought to paper over the bug, with luck. Could someone test this?
Hm. No, simply reserving the low 64K doesn't work.
patch already available. http://patchwork.kernel.org/patch/13147/
patch mentioned in comment #28 is in the acpi tree "ACPI: Populate DIDL before registering ACPI video device on Intel"
I confirm that this patch fixes the bug. (Tested on Linux kernel 2.6.29)
shipped in 2.6.30 merge window (2.6.29-git14) closed commit 74a365b3f354fafc537efa5867deb7a9fadbfe27 Author: Matthew Garrett <mjg59@srcf.ucam.org> Date: Thu Mar 19 21:35:39 2009 +0000 ACPI: Populate DIDL before registering ACPI video device on Intel
*** Bug 13751 has been marked as a duplicate of this bug. ***
This is apparently slightly different from bug 13751, because the comment #28 patch fixed the problem Dmitri was seeing, but others still saw crashes even with that patch. I just added another patch to bug 13751 to address those crashes, and it might the problem Patrick and Gautam are seeing.
The fix which was applied for this issue has its own problem. It just assumes the workaround should be applied to any Intel video adapter: for_each_pci_dev(dev) { if ((dev->class >> 8) != PCI_CLASS_DISPLAY_VGA) continue; if (dev->vendor != PCI_VENDOR_ID_INTEL) continue; pci_read_config_dword(dev, 0xfc, &address); if (!address) continue; return 1; } However, this also catches GMA 500 adapters, which don't use the Intel driver. (If you're lucky), they use the psb driver, which doesn't behave like intel and shouldn't have this 'workaround' applied. When it is applied, it causes psb not to work right unless the workaround 'IgnoreACPI' X config option is set: http://swiss.ubuntuforums.org/showpost.php?p=7781689&postcount=17 so this check should be refined not to catch GMA 500 (and, possibly, Moorestown in future, depending on how Moorestown is handled) adapters, I think. I'll see if I can provide a refinement to do this.
update from mjg: <mjg59> adamw: It looks like it implements opregion, to be honest <mjg59> adamw: In which case the workaround is correct, but the poulsbo DRM driver is deficient (...confirmatory diagnosis) <mjg59> adamw: Would actually be pretty easy - you just need to pull in the intel opregion code from the i915 drm, and change the backlight register writes <mjg59> Oh, and add support to the interrupt handler <mjg59> I've no idea how poulsbo would doorbell <mjg59> adamw: 0x7f6a3010 (the contents of PCI conf register 0xfc on your device) is ACPI non-volatile storage <mjg59> adamw: So it certainly looks like opregion. You get to keep both pieces. <mjg59> The kernel behaviour is correct, the poulsbo DRM is just shit. Sorry. So, bottom line, opregion support should be patched into the poulsbo DRM module. Which is theoretically possible, as that bit is actually open source. I don't have the skills to do it, though. Just updating this bug for anyone who stumbles across this so the correct info is here.