Bug 22462
Summary: | 2.6.33.1 Regression: Unexpected power-off during boot or when resuming from suspend on Vostro V13 | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Guillaume Pothier (gpothier) |
Component: | x86-64 | Assignee: | platform_x86_64 (platform_x86_64) |
Status: | RESOLVED DUPLICATE | ||
Severity: | normal | CC: | acpi-bugzilla, andrea.cimitan, bjorn.helgaas, jbarnes, peponnet.cyril, rjw, xcoulon |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.33.1 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 7216 | ||
Attachments: |
Output of dmesg
Output of lsmod Config file ubuntu config that reproduces the bug even with latest git diff between the ubuntu broken config and a working one this *always* crashes during boot with bluetooth disabled in the bios dump during a broken suspend/resume this config crashes diff between the a broken config and a working one |
Description
Guillaume Pothier
2010-11-08 13:07:03 UTC
I did a bit more testing, and it seems the regression occurred between 2.6.33 and 2.6.33.1. I'm now trying to bisect to find the precise commit. I'm bisecting now, but it takes a long time so I wanted to make that known: in the previous comment I mentioned that the regression occurred between 2.6.33 and 2.6.33.1, but it seems it is only true for Ubuntu mainline kernels downloaded there: http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.33/ http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.33.1-lucid/ Compiling from kernel.org's git, the regression is between 2.6.34-rc2 and 2.6.34-rc6. I don't know what could be the difference between Ubuntu's mainline kernels and the ones from git, except maybe the config. Here's the result of the bisect. Note that the last step did not build: gpothier@tadzim:linux-git$ git bisect skip There are only 'skip'ped commits left to test. The first bad commit could be any of: bb910a7040e90a0ca3d3e8245d6d5c128a5d1287 522dba7134d6b2e5821d3457f7941ec34f668e6d We cannot bisect more! The commit message for the first candidate is "PCI/PM Runtime: Make runtime PM of PCI devices inactive by default", and message for the second candidate is "Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6" My testing process consisted in considering a build "good" if I could do at least 10 suspend/resume cycles in a row. Is there anything else I can do to help? Here is the full bisect log: git bisect start # bad: [f6f94e2ab1b33f0082ac22d71f66385a60d8157f] Linux 2.6.36 git bisect bad f6f94e2ab1b33f0082ac22d71f66385a60d8157f # good: [57d54889cd00db2752994b389ba714138652e60c] Linux 2.6.34-rc1 git bisect good 57d54889cd00db2752994b389ba714138652e60c # skip: [e9a5f426b85e429bffaee4e0b086b1e742a39fa6] CPU: Avoid using unititialized error variable in disable_nonboot_cpus() git bisect skip e9a5f426b85e429bffaee4e0b086b1e742a39fa6 # bad: [fd4dc88e46c4d9dd845ffef50a975ceea110fd85] staging: hv: Fix error checking in channel.c git bisect bad fd4dc88e46c4d9dd845ffef50a975ceea110fd85 # bad: [c6c352371c1ce486a62f4eb92e545b05cfcef76b] ARM: 5965/1: Fix soft lockup in at91 udc driver git bisect bad c6c352371c1ce486a62f4eb92e545b05cfcef76b # bad: [11130736c99c37e253f45b2d3fd30b07313f83c6] ACPI: processor: refactor internal map_lapic_id() git bisect bad 11130736c99c37e253f45b2d3fd30b07313f83c6 # skip: [91e013827c0bcbb187ecf02213c5446b6f62d445] Merge branch 'master' into for-linus git bisect skip 91e013827c0bcbb187ecf02213c5446b6f62d445 # skip: [c251c7f738cd94eb3a1febda318078c661eccb4d] drivers/net/tulip/eeprom.c: fix bogus "(null)" in tulip init messages git bisect skip c251c7f738cd94eb3a1febda318078c661eccb4d # good: [0636b33c5f2fac4e274464ae6867805f080fc433] Merge branches 'cxgb3', 'ipoib', 'misc' and 'nes' into for-next git bisect good 0636b33c5f2fac4e274464ae6867805f080fc433 # skip: [41dcc17c735d4e99a91002b09850d0f09ee4ab4b] ARM: mach-shmobile: pfc-sh7377: modify KEYIN settings git bisect skip 41dcc17c735d4e99a91002b09850d0f09ee4ab4b # bad: [21768639be419d00275ac4e58b863361d0c24ee4] edac: mpc85xx mask ecc syndrome correctly git bisect bad 21768639be419d00275ac4e58b863361d0c24ee4 # bad: [2afb18981739a1426af2a6c952e03c5966b3dfc6] broadsheetfb: add MMIO hooks git bisect bad 2afb18981739a1426af2a6c952e03c5966b3dfc6 # bad: [416d8d2888db392c562fb8afaf9136730ef0da9e] drivers/block/floppy.c: remove REPEAT macro git bisect bad 416d8d2888db392c562fb8afaf9136730ef0da9e # bad: [045f98363080ddbbcef6b8b8306ec58a818406a0] drivers/block/floppy.c: remove used once CHECK_READY macro git bisect bad 045f98363080ddbbcef6b8b8306ec58a818406a0 # bad: [516a82422209e078345d0ca54b16793d7bfd4782] sdio: recognize io card without powercycle git bisect bad 516a82422209e078345d0ca54b16793d7bfd4782 # bad: [522dba7134d6b2e5821d3457f7941ec34f668e6d] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 git bisect bad 522dba7134d6b2e5821d3457f7941ec34f668e6d # good: [51d0f6d1f50349579f007adf5c0b51aaedd93b94] Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable git bisect good 51d0f6d1f50349579f007adf5c0b51aaedd93b94 # skip: [bb910a7040e90a0ca3d3e8245d6d5c128a5d1287] PCI/PM Runtime: Make runtime PM of PCI devices inactive by default git bisect skip bb910a7040e90a0ca3d3e8245d6d5c128a5d1287 The change made by "PCI/PM Runtime: Make runtime PM of PCI devices inactive by default" has no effect on the behavior of PCI devices that don't support runtime PM. Can you check if 2.6.37-rc2 fixes the problem for you, please? Thanks for taking a look at this issue. The problem persists in 2.6.37-rc2. Poweroff occurs either during boot or when resuming from suspend. What happens if you do: # echo core > /sys/power/pm_test # echo mem > /sys/power/state (it should carry out a fake suspend-resume cycle and return to the command prompt in about 5-10 sec.)? Does it reboot too? Yes, I got a poweroff at the first attempt (with 2.6.37-rc2). The poweroff occurred after about 10s. Does it also happen with the following test (after a fresh boot): # echo devices > /sys/power/pm_test # echo mem > /sys/power/state Yes, it still happens with "echo devices" instead of "echo core". Does it happen 100% of the time? Please attach a dmesg log and the output of lsmod from your system. Created attachment 38692 [details]
Output of dmesg
Created attachment 38702 [details]
Output of lsmod
Requested information attached. It doesn't happen 100% of the time, but I've never been able to perform more than 5-6 suspend/resume cycles with the faulty kernels. I have been playing with bios settings and found a way to trigger it 100% of the time (assuming it is the same bug) by disabling bluetooth. Here is the original comment about this: I did some more testing and although it seems several devices cause suspend/resume problems, the biggest issue is with bluetooth and usb. I went to the bios settings and disabled all the integrated devices. In that configuration I did 10 suspend/resume cycles without problem. Then I started enabling a few devices. With WLAN enabled I had one lock-up out of 10 suspend/resume cycles, but it was a lockup during suspend, not during resume, so it is probably another issue. However enabling bluetooth makes it impossible to suspend/resume more than once or twice (after the second resume, the machine powers off during resume). But the strangest thing is this: disabling bluetooth but enabling outside USB ports systematically triggers the unexpected power off during boot, the system becomes unbootable. Correction: I just tested that again and disabling bluetooth doesn't seem to have any effect anymore on 2.6.37-rc2. With bluetooth disabled I had 4 poweroffs out of 10 boots, and with bluetooth enabled I had 3 out of 10. I don't think there is a statically significant difference. same issue on my brand new dell vostro v13, if you need any info, I'm glad to help Hmm. Does booting with pci=nocrs makes things better? Also please try to get the contents of /proc/iomem before suspend and after a successful resume, if possible. pci_nocrs seems to mitigate the problem. Actually, I was going to say it makes the problem disappear because I was able to perform 15 suspend/resume cycles, but when attempting the 16th to get the contents of /proc/iomem, it powered off... Anyway, here are the contents of /proc/iomem before and after a suspend/resume cycle, on 2.6.37-rc without pci=nocrs (or do you need it with pci=nocrs?): gpothier@tadzim:~$ cat /proc/iomem 00000000-0000ffff : reserved 00010000-0009d3ff : System RAM 0009d400-0009ffff : reserved 000a0000-000bffff : PCI Bus 0000:00 000d2000-000d3fff : reserved 000d4000-000d7fff : PCI Bus 0000:00 000d8000-000dbfff : PCI Bus 0000:00 000dc000-000fffff : reserved 00100000-bbaa0fff : System RAM 01000000-015880b5 : Kernel code 015880b6-01a9d9cf : Kernel data 01b89000-01c6f57f : Kernel bss bbaa1000-bbaa6fff : reserved bbaa7000-bbbb9fff : System RAM bbbba000-bbc0efff : reserved bbc0f000-bbd07fff : System RAM bbd08000-bbf0efff : reserved bbf0f000-bbf18fff : System RAM bbf19000-bbf1efff : reserved bbf1f000-bbf64fff : System RAM bbf65000-bbf9efff : ACPI Non-volatile Storage bbf9f000-bbfe4fff : System RAM bbfe5000-bbffefff : ACPI Tables bbfff000-bbffffff : System RAM c0000000-dfffffff : PCI Bus 0000:00 c8000000-c9ffffff : PCI Bus 0000:09 c8000000-c8001fff : 0000:09:00.0 c8000000-c8001fff : iwlagn ca000000-cbffffff : PCI Bus 0000:07 cc000000-cdffffff : PCI Bus 0000:05 ce000000-cfffffff : PCI Bus 0000:03 d0000000-dfffffff : 0000:00:02.0 e0000000-efffffff : PCI MMCONFIG 0000 [bus 00-ff] e0000000-efffffff : pnp 00:0a f0000000-febfffff : PCI Bus 0000:00 f0000000-f1ffffff : PCI Bus 0000:01 f2000000-f3ffffff : PCI Bus 0000:03 f4000000-f7ffffff : PCI Bus 0000:05 f4000000-f4003fff : 0000:05:00.0 f4000000-f4003fff : r8169 f6000000-f6000fff : 0000:05:00.0 f6000000-f6000fff : r8169 f7fe0000-f7ffffff : 0000:05:00.0 f8000000-f9ffffff : PCI Bus 0000:07 fa000000-fbffffff : PCI Bus 0000:09 fc000000-fdffffff : PCI Bus 0000:01 fe000000-fe3fffff : 0000:00:02.0 fe400000-fe4fffff : 0000:00:02.1 fe700000-fe703fff : 0000:00:1b.0 fe700000-fe703fff : ICH HD audio fe704000-fe7047ff : 0000:00:1f.2 fe704000-fe7047ff : ahci fe704800-fe704bff : 0000:00:1a.7 fe704800-fe704bff : ehci_hcd fe704c00-fe704fff : 0000:00:1d.7 fe704c00-fe704fff : ehci_hcd febfe000-febfefff : Intel Flush Page febfff00-febfffff : 0000:00:1f.3 fec00000-fec003ff : IOAPIC 0 fed00000-fed003ff : HPET 0 fed00000-fed003ff : pnp 00:03 fed10000-fed13fff : pnp 00:0a fed18000-fed18fff : pnp 00:0a fed19000-fed19fff : pnp 00:0a fed1c000-fed1ffff : pnp 00:0a fed20000-fed3ffff : pnp 00:0a fed45000-fed8ffff : pnp 00:0a fee00000-fee00fff : Local APIC 100000000-13fffffff : System RAM gpothier@tadzim:~$ sudo pm-suspend [sudo] password for gpothier: gpothier@tadzim:~$ cat /proc/iomem 00000000-0000ffff : reserved 00010000-0009d3ff : System RAM 0009d400-0009ffff : reserved 000a0000-000bffff : PCI Bus 0000:00 000d2000-000d3fff : reserved 000d4000-000d7fff : PCI Bus 0000:00 000d8000-000dbfff : PCI Bus 0000:00 000dc000-000fffff : reserved 00100000-bbaa0fff : System RAM 01000000-015880b5 : Kernel code 015880b6-01a9d9cf : Kernel data 01b89000-01c6f57f : Kernel bss bbaa1000-bbaa6fff : reserved bbaa7000-bbbb9fff : System RAM bbbba000-bbc0efff : reserved bbc0f000-bbd07fff : System RAM bbd08000-bbf0efff : reserved bbf0f000-bbf18fff : System RAM bbf19000-bbf1efff : reserved bbf1f000-bbf64fff : System RAM bbf65000-bbf9efff : ACPI Non-volatile Storage bbf9f000-bbfe4fff : System RAM bbfe5000-bbffefff : ACPI Tables bbfff000-bbffffff : System RAM c0000000-dfffffff : PCI Bus 0000:00 c8000000-c9ffffff : PCI Bus 0000:09 c8000000-c8001fff : 0000:09:00.0 c8000000-c8001fff : iwlagn ca000000-cbffffff : PCI Bus 0000:07 cc000000-cdffffff : PCI Bus 0000:05 ce000000-cfffffff : PCI Bus 0000:03 d0000000-dfffffff : 0000:00:02.0 e0000000-efffffff : PCI MMCONFIG 0000 [bus 00-ff] e0000000-efffffff : pnp 00:0a f0000000-febfffff : PCI Bus 0000:00 f0000000-f1ffffff : PCI Bus 0000:01 f2000000-f3ffffff : PCI Bus 0000:03 f4000000-f7ffffff : PCI Bus 0000:05 f4000000-f4003fff : 0000:05:00.0 f4000000-f4003fff : r8169 f6000000-f6000fff : 0000:05:00.0 f6000000-f6000fff : r8169 f7fe0000-f7ffffff : 0000:05:00.0 f8000000-f9ffffff : PCI Bus 0000:07 fa000000-fbffffff : PCI Bus 0000:09 fc000000-fdffffff : PCI Bus 0000:01 fe000000-fe3fffff : 0000:00:02.0 fe400000-fe4fffff : 0000:00:02.1 fe700000-fe703fff : 0000:00:1b.0 fe700000-fe703fff : ICH HD audio fe704000-fe7047ff : 0000:00:1f.2 fe704000-fe7047ff : ahci fe704800-fe704bff : 0000:00:1a.7 fe704800-fe704bff : ehci_hcd fe704c00-fe704fff : 0000:00:1d.7 fe704c00-fe704fff : ehci_hcd febfe000-febfefff : Intel Flush Page febfff00-febfffff : 0000:00:1f.3 fec00000-fec003ff : IOAPIC 0 fed00000-fed003ff : HPET 0 fed00000-fed003ff : pnp 00:03 fed10000-fed13fff : pnp 00:0a fed18000-fed18fff : pnp 00:0a fed19000-fed19fff : pnp 00:0a fed1c000-fed1ffff : pnp 00:0a fed20000-fed3ffff : pnp 00:0a fed45000-fed8ffff : pnp 00:0a fee00000-fee00fff : Local APIC 100000000-13fffffff : System RAM This very likely is a problem with the allocation of PCI resources which the BIOS apparently doesn't give us useful information about. Guillaume, why did you skip so many commits during the bisection? Also, would it be possible to bisect again, starting from good=v2.6.34-rc2 and bad=v2.6.34-rc6 (if that's what you're still seeing)? I had to skip commits because either the kernel didn't compile, or produced a kernel panic on boot. I'll try to bisect again, but as it takes a long time, would it be useful if I try first with 2.6.37-rc2 minus the "PCI/PM Runtime: Make runtime PM of PCI devices inactive by default" commit? Yes, please do. All the reports I've seen are on the Dell Vostro V13: Anoop P B (launchpad 588194 reporter) Kenneth (launchpad 588194 comment 22) gpothier (launchpad 588194 comment 24, kernel.org 22462 reporter) znotdead (launchpad 588194 comment 25) Emile (launchpad 588194 comment 40) Alejandro Arcos (launchpad 588194 comment 43) Andrea Cimitan (launchpad 588194 comment 45, kernel.org comment 15) No /proc/iomem difference from before/after (comment 18). pci=nocrs should only affect PCI resource assignment. From the dmesg in attachment 38692 [details], I think we do three assignments (the 02.0 "Flush Page" actually doesn't show in dmesg, but it's in the iomem above): pci 0000:00:1f.3: BAR 0: assigned [mem 0xfebfff00-0xfebfffff 64bit] pci 0000:05:00.0: BAR 6: assigned [mem 0xf7fe0000-0xf7ffffff pref] pci 0000:00:02.0: Flush Page assigned [mem 0xfebfe000-0xfebfefff] 00:1f.3: i2c-i801 SMBus controller; we assign BAR 0, but I don't think you (Guillaume) have the driver loaded, so I doubt this matters. 05:00.0: r8169 NIC; we assign the option ROM, but I don't think the driver looks at it, so I doubt this matters either. 00:02.0: i915 video; hmm... intel-gtt.c allocates this flush page, and this *does* matter. It'd be interesting to see if removing intel_gtt from the picture makes a difference. The btusb_intr_complete complaints in the dmesg are also interesting. Clearly *something* is wrong there, and it looks like it's connected via the 00:1a.0 USB controller, which we didn't touch. If it can help, from SLED11 GA up to date, the laptop is freezing during rsync transfers in gigabit full duplex (about 4GB). And when rebooting we can hear some errors beep from the main board indicating an NVRAM error according to Dell Support. I don't made tests in SP1 for rsync transfer in SLED11 SP1. For the powerloss at boot / resume and sometimes losing network interface, it never occurs on SLED11GA, it only occures since we are using the SLED11 SP1 kernel. SLED11 GA : 2.6.27.45 SLED11 SP1 : 2.6.32.24 I've tried to bissect to but I give up. On my own, I though all this stuff is a network side issue (sometimes after a poweroff on boot, I don't see any network card in BIOS). I know too, that some people are experiencing network interface loss when using dual boot (Linux first, resume from sleep, and reboot windows os). Sometimes the network interface is not here, and not in the BIOS too and may cause linux to poweroff the system when loading the kernel. (In reply to comment #25) > For the powerloss at boot / resume and sometimes losing network interface, it > never occurs on SLED11GA, it only occures since we are using the SLED11 SP1 > kernel. This bug is for tracking this particular issue, which seems to have been introduced way after 2.6.27. Please report the earlier issues separately. > SLED11 GA : 2.6.27.45 > SLED11 SP1 : 2.6.32.24 > > I've tried to bissect to but I give up. > > On my own, I though all this stuff is a network side issue (sometimes after a > poweroff on boot, I don't see any network card in BIOS). > > I know too, that some people are experiencing network interface loss when > using > dual boot (Linux first, resume from sleep, and reboot windows os). Sometimes > the network interface is not here, and not in the BIOS too and may cause > linux > to poweroff the system when loading the kernel. That may be a result of the network adapter's PCI resources overlapping with something else. *** Bug 24192 has been marked as a duplicate of this bug. *** Sorry for the silence, I've been very busy this month. I just tested with the latest kernel (cloned today from kernel.org), and it seems the problem got fixed: I went through 20 suspend/resume cycles without a hiccup. (In reply to comment #28) > Sorry for the silence, I've been very busy this month. I just tested with the > latest kernel (cloned today from kernel.org), and it seems the problem got > fixed: I went through 20 suspend/resume cycles without a hiccup. with mainline? I did a git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git linux-2.6 compiled in ubuntu maverick using config from previous kernel, and it fails to resume. which is your config? which kernel did you grab? @Andrea: Yes, with mainline. I followed the instructions from this page: https://wiki.kubuntu.org/KernelTeam/GitKernelBuild So that's a fresh clone from 2 days ago. When did you clone? I also took the config from a previous kernel (2.6.31-22), I'll attach it for reference. Created attachment 42232 [details]
Config file
(In reply to comment #31) > Created an attachment (id=42232) [details] > Config file this work (though I had to adjust this to latest git). now compiling with ubuntu default config, to see if it's the config file that breaks ok, using ubuntu config it crashes with git. so, it's not fixed, but something in the ubuntu config is triggering the bug. I'll attach the ubuntu config, and a diff between the ubuntu config and the config that does not reproduce the bug. Created attachment 42282 [details]
ubuntu config that reproduces the bug even with latest git
Created attachment 42292 [details]
diff between the ubuntu broken config and a working one
Created attachment 42422 [details]
this *always* crashes during boot with bluetooth disabled in the bios
Created attachment 42432 [details]
dump during a broken suspend/resume
I don't know if those are useful or not
One thing that might make a difference is that your failing kernels have CONFIG_HIGHMEM64G set. Can please set CONFIG_HIGHMEM4G instead in your failing .config and see if that helps? Also, which is related, CONFIG_X86_PAE=y looks like it might be the culprit. Other suspicious .config options apparently set in the failing .configs are: CONFIG_ACPI_BLACKLIST_YEAR=2000 CONFIG_ACPI_APEI=y CONFIG_X86_APM_BOOT=y CONFIG_X86_SPEEDSTEP_ICH=y CONFIG_X86_SPEEDSTEP_SMI=y CONFIG_INTEL_IDLE=y There are probably more, but I'd start from checking the above (and the ones in the two previous comments). Created attachment 42522 [details]
this config crashes
(In reply to comment #40) > Other suspicious .config options apparently set in the failing .configs are: > > CONFIG_ACPI_BLACKLIST_YEAR=2000 > CONFIG_ACPI_APEI=y > CONFIG_X86_APM_BOOT=y > CONFIG_X86_SPEEDSTEP_ICH=y > CONFIG_X86_SPEEDSTEP_SMI=y > CONFIG_INTEL_IDLE=y > > There are probably more, but I'd start from checking the above (and the > ones in the two previous comments). tried, it crashes. see above Created attachment 42562 [details]
diff between the a broken config and a working one
continue doing a config-bisect :) here is a diff between a broken kernel and a kernel that works (strange, it doesn't seem to touch acpi)
(In reply to comment #43) > Created an attachment (id=42562) [details] > diff between the a broken config and a working one > > continue doing a config-bisect :) here is a diff between a broken kernel and > a > kernel that works (strange, it doesn't seem to touch acpi) the strange thing is that if I apply that diff to the ubuntu config it doesn't work :D First, in what sense doesn't it work? Do you mean it doesn't build? Second, I _think_ we can rule out things that are build as modules in your failing .config, unless one of these modules is actually loaded. Would it be possible to check if they aren't loaded? Patch attached to this bug fixes the issue: https://bugzilla.kernel.org/show_bug.cgi?id=15416 Patch is: https://bugzilla.kernel.org/attachment.cgi?id=41862 Cool, thanks for the information. *** This bug has been marked as a duplicate of bug 15416 *** |