Bug 41722
Summary: | Boot hang unless acpi=off - Clevo M5X0JE | ||
---|---|---|---|
Product: | ACPI | Reporter: | Bjorn Helgaas (bjorn) |
Component: | Config-Other | Assignee: | acpi_config-other |
Status: | CLOSED WILL_FIX_LATER | ||
Severity: | normal | CC: | bjorn, lenb, rbrito, rjw, rui.zhang, tianyu.lan |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
URL: | http://thread.gmane.org/gmane.linux.kernel/1181761/foc | ||
Kernel Version: | 3.1.0-rc3 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
last messages with "ignore loglevel acpi.debug_layer=0xffffffff acpi.debug_level=0xffffffff"
ignore_loglevel acpi.debug_layer=0x59f acpi.debug_level=0xffffffff dmesg (acpi=off, with PCI transparent bridge change) acpidump dmidecode boot with kernel 2.4.27 boot with kernel 2.6.8 Output of hwdirect with the MSRs for the Clevo |
Description
Bjorn Helgaas
2011-08-25 16:49:22 UTC
Dear Bjorn, I am attaching here two digital photos of a *vanilla* kernel (without that PCI bridge sizing patch reverted). The kernel is compiled with: CONFIG_ACPI_EC_DEBUGFS=m CONFIG_ACPI_DEBUG=y CONFIG_ACPI_DEBUG_FUNC_TRACE=y Hope this can supply sufficient information. If not, I can still try other stuff. Just let me know and I'll do my best. Thanks. Created attachment 70192 [details]
last messages with "ignore loglevel acpi.debug_layer=0xffffffff acpi.debug_level=0xffffffff"
Created attachment 70202 [details]
ignore_loglevel acpi.debug_layer=0x59f acpi.debug_level=0xffffffff
Just for the record, this notebook still has a Windows Vista partition in it and I can try to extract whatever ACPI information is exposed there so that we can have more information regarding what the hardware configuration is. Would this help? Do you know whether any version of Linux boots on this system without "acpi=off"? I did find dmesg logs at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/343675 (2.6.28) and https://bugs.launchpad.net/ubuntu/+source/linux/+bug/446911 (2.6.31), which suggests that linux booted with ACPI enabled on *some* Clevo M5X0JE systems. I assume Windows still boots on this system (to make sure this isn't hardware that broke and prevents both Windows and Linux from booting). Based on your debug output, I think the hang happens when we do an outb() to I/O port 0x142e to enable ACPI mode. Do you have any CardBus or other devices you could remove, to make sure it's not device-related? Created attachment 70282 [details]
dmesg (acpi=off, with PCI transparent bridge change)
Created attachment 70292 [details]
acpidump
Created attachment 70302 [details]
dmidecode
Hi, Bjorn. On Aug 25 2011, bugzilla-daemon@bugzilla.kernel.org wrote: > --- Comment #5 from Bjorn Helgaas <bhelgaas@google.com> 2011-08-25 19:48:13 > --- > Do you know whether any version of Linux boots on this system without > "acpi=off"? I don't know, but I can try. The only problem is that udev has some restrictions on the kernel version. :-( > I did find dmesg logs at > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/343675 (2.6.28) and > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/446911 (2.6.31), which > suggests that linux booted with ACPI enabled on *some* Clevo M5X0JE systems. As per the previous bug, the only version of Linux that I got and that worked before was 2.6.24 and I don't quite remember if I needed acpi=off or not, but I can try it again, of course, and let you know. > I assume Windows still boots on this system (to make sure this isn't > hardware that broke and prevents both Windows and Linux from booting). No the hardware doesn't appear to be broken. And yes, Windows works fine (from the little about Windows that I know). I chose to keep it here, as that's what my mother used to use (but she tells me that she would prefer to switch). Anything works well in Windows and I can try to boot with some logging enabled, if I happen to know where debug messages are sent/put once Windows boots. > Based on your debug output, I think the hang happens when we do an outb() to > I/O port 0x142e to enable ACPI mode. Do you have any CardBus or other > devices > you could remove, to make sure it's not device-related? While the notebook has a cardbus bay, it is not in use right now. The only things that are plugged in right now is the power cord and the ethernet cable. I don't have a null cable here to use a serial console or something... I can try to see if a network console works, but I guess that a PCI card would be sort of a chicken-and-egg problem, wouldn't it? Just as a side comment, do these PCI and ACPI problems always go hand in hand? I will try to do a binary search with Linux versions to see if there is any one that can work without acpi=off being passed. Thanks, Bjorn, (In reply to comment #8) > Created an attachment (id=70302) [details] > dmidecode I just tried to read the disassembly of the DSDT of the output of dmidecode and I couldn't compile it because the iasl compiler throws (among others) this error: DSDT.dsl 898: 0xFFF00000, // Length Error 4117 - ^ Length is larger than Min/Max window if this language has any kind of scope (it seems that the braces are there for this very reason), then it is inside the "Method (_CRS, 0, NotSerialized)" section, which, from the name, I understand is something like a function, right? And I suppose that that _CRS stands for Current Resource Settings, right? You told me before that I should have passed pci=use_crs to the kernel, but it seems that we have a problem with the DSDT table, if I am not mistaken... Does this (admittedly superficial) analysis help? For this purpose, you don't need to worry about udev. That's a user-space thing, and if you get far enough to run into udev failures, you already know that ACPI is working. The https://bugs.launchpad.net/ubuntu/+source/linux/+bug/343675 report is particularly interesting because it has the same BIOS version as you. If you could lay hands on an Ubuntu 9.04 CD and try that, that would be a good start. You shouldn't need to do a full install -- just a live CD should be fine. Another thing to look at is your BIOS SETUP settings. If there's a "reset everything to defaults", please try that. Per Len Brown, the outb() triggers SMM (a special part of the BIOS), and the hang is most likely in that SMM code, which is very difficult to debug. I think the console info you have now is fine; no need to worry about netconsole or serial console. I've been working a lot of problems at the intersection of PCI and ACPI, but in this case, I think the transparent bridge problem is unrelated to the ACPI enable problem. The DSDT _CRS error you see is one of those things at the intersection of PCI and ACPI, but the hang happens long before we would look at that information. Created attachment 70442 [details]
boot with kernel 2.4.27
Here is a boot with kernel 2.4.27 (from Debian 3.1). It boots with no options passed, but I guess that this doesn't help, as it says there "ACPI: Interpreter disabled".
If it does help, then I may try to find a way to get the logs out of that machine, but as that kernel version it has almost no supported drivers for this machine, I will take me some time.
Created attachment 70452 [details]
boot with kernel 2.6.8
Hi, Bjorn.
Also from Debian 3.1, this is a boot with kernel 2.6.8, without passing any parameters.
I think that I can say that no moderately recent kernel 2.6 ever boots on this machine without acpi being turned off.
Would remote access to this machine help? I can, of course, give you ssh access so that you can play with it.
Thanks.
So it looks like 2.4.27 doesn't even try to enable ACPI mode, and 2.6.8 hangs the same way as 3.1.0-rc3. Can you try 2.6.28? If that doesn't boot, try the Ubuntu 9.04 kernel, which was reported to boot on your machine/BIOS. I don't think remote access would help much unless we had remote power control and remote console access. Dear Bjorn, Sorry for not responding promptly but I had some family members with health problems. Things are settling right now and I think that I can be faster now. (In reply to comment #14) > So it looks like 2.4.27 doesn't even try to enable ACPI mode, and 2.6.8 hangs > the same way as 3.1.0-rc3. Yes, exactly. > Can you try 2.6.28? If that doesn't boot, try the Ubuntu 9.04 kernel, which > was reported to boot on your machine/BIOS. Unfortunately, I tried to boot with that version of the Ubuntu live CD and it crashed the same way. I may try it once again, paying more attention if it behaves differently in any kind of way, though. Oh, talking about PCI and ACPI failing, could you take a look at issue #41132 to see if you can help with another machine that stopped working after a commit that I bisected? Thanks, Rogério Brito. hmm, what's the status of this bug? does the problem still exist in the latest upstream kernel? bug closed as there is no response from the bug reporter. please feel free to reopen it if the problem still exists in the latest upstream kernel. Just for the record, I can pick up where we left off. Just let me know what I should do next. Thanks, Rogério Brito. > https://bugzilla.kernel.org/show_bug.cgi?id=41722 This system boots Windows fine, but hangs when Linux does an outb(0x142e) to enable ACPI mode. Len suggests that this enters SMM mode, where the BIOS does stuff which we basically cannot debug. Len did hazard a guess that MTRR config is a possible difference between Linux and Windows. I think we can figure out the Linux MTRR config from the dmesg of a boot with "acpi=off". Rogério, would you mind attaching that log here (it's OK if you need to apply the transparent bridge reversion patch). Then, since I think you do have Windows on the box, can you use a free trial version of AIDA64 (http://www.aida64.com/) to collect a complete system report and attach that as well? With luck, we can extract the MTRR config under Windows from that. Ping, Rogério, do you want to work on this anymore? I don't blame you if you've given up ;) Hi, Bjorn. On Oct 05 2012, bugzilla-daemon@bugzilla.kernel.org wrote: > --- Comment #20 from Bjorn Helgaas <bhelgaas@google.com> 2012-10-05 16:39:54 > --- > Ping, Rogério, do you want to work on this anymore? I don't blame you if > you've given up ;) Yes, I still want to work on this. To make things speedier, what do you think if we scheduled some time so that we could both be online and I could give you some feedback instantaneously? Do you think it would be a good idea? I can even show you (say, via a Google Hangout) how the machine behaves. I won't have time today, but will try to reserve time whenever you want, if you think that this would be productive. Thanks, Great! Let's start by collecting the dmesg log and AIDA64 report I mentioned in comment #19. I really want to see the MTRR config under Windows, but I don't think AIDA64 shows that. There's a HWDIRECT tool (http://www.eprotek.com/pg_hwdirect.html) that claims to do it. I haven't tried it, but there is a free trial version of it as well. If you could try it out and attach any MTRR info you can find under Windows, that would be great. Dear Bjorn, On Fri, Oct 5, 2012 at 2:46 PM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > --- Comment #22 from Bjorn Helgaas <bhelgaas@google.com> 2012-10-05 17:46:02 > --- > Great! Let's start by collecting the dmesg log and AIDA64 report I mentioned > in comment #19. I really want to see the MTRR config under Windows, but I > don't think AIDA64 shows that. I had that collected once, but I forgot to upload it. :) I suspect that some of the files are too big for bugzilla to hold (especially the full report). Compressing it may work, but I would prefer to send them to you in private, if you don't mind. > There's a HWDIRECT tool > (http://www.eprotek.com/pg_hwdirect.html) that claims to do it. I haven't > tried it, but there is a free trial version of it as well. If you could try > it > out and attach any MTRR info you can find under Windows, that would be great. I am downloading it now, but I read in their homepage that it doesn't allow you to save the results to file in the trial version. I guess that I will have to circumvent that somehow. :) Regarding us meeting, I am fine with the times that you gave me, but I will get in touch in private mail for details. Thanks for all the directions, Rogério. -- Rogério Brito : rbrito@{ime.usp.br,gmail.com} : GPG key 4096R/BCFCAAAA http://rb.doesntexist.org/blog : Projects : https://github.com/rbrito/ DebianQA: http://qa.debian.org/developer.php?login=rbrito%40ime.usp.br Created attachment 85141 [details]
Output of hwdirect with the MSRs for the Clevo
Dear Bjorn and others,
At last, here is a dump of what HWDIRECT sees running under windows.
Not exactly the most user-friendly program (took me a while to see how to grab the contents of the MSRs), but, now, at least, we can compare what Windows sees with what Linux sees.
And, more importantly, why Linux hangs during ACPI init.
If more information is needed, please let me know.
Thanks in advance,
Rogério Brito.
Here's my meager comparison of MTRR settings between Linux and Windows. I didn't see any differences. Under Windows, I see this: 00000-7FFFF 06 WB 80000-9FFFF 06 WB The above correspond to Linux's 00000-9FFFF write-back A0000-BFFFF 00 UC The above correspond to Linux's A0000-BFFFF uncachable C0000-C7FFF 05 WP C8000-CFFFF 05 WP D0000-D3FFF 05 WP The above correspond to Linux's C0000-D3FFF write-protect D4000-D7FFF 00 UC D8000-DFFFF 00 UC E0000-E3FFF 00 UC The above correspond to Linux's D4000-E3FFF uncachable E4000-E7FFF 05 WP E8000-EFFFF 05 WP F0000-F7FFF 05 WP F8000-FFFFF 05 WP The above correspond to Linux's E4000-FFFFF write-protect My raw notes, including the specific registers where I extracted the ranges, are below. I'm afraid I don't have any ideas about other things to try. The only ways I can think of to make progress are to (1) see if the manufacturer can help (unlikely since this is so old), (2) see if any version of Linux can enable ACPI successfully and figure out what changed, or (3) see if any other instance of this machine can boot Linux and figure out what's special about this particular one. Linux: MTRR default type: uncachable MTRR fixed ranges enabled: 00000-9FFFF write-back A0000-BFFFF uncachable C0000-D3FFF write-protect D4000-E3FFF uncachable E4000-FFFFF write-protect MTRR variable ranges enabled: 0 base 0000000000 mask FF80000000 write-back 1 disabled 2 disabled 3 disabled 4 disabled 5 disabled 6 disabled 7 disabled Windows: 000000FE 00000508 00000000 MTRRCAP (read-only) 00000200 00000006 00000000 MTRR_PHYSBASE0 type 06 WB base 0 00000201 80000800 000000FF MTRR_PHYSMASK0 valid = 1 mask = ff80000.000 00000202 00000000 00000000 MTRR_PHYSBASE1 00000203 00000000 00000000 MTRR_PHYSMASK1 00000204 00000000 00000000 MTRR_PHYSBASE2 00000205 00000000 00000000 MTRR_PHYSMASK2 00000206 00000000 00000000 MTRR_PHYSBASE3 00000207 00000000 00000000 MTRR_PHYSMASK3 00000208 00000000 00000000 MTRR_PHYSBASE4 00000209 00000000 00000000 MTRR_PHYSMASK4 0000020A 00000000 00000000 MTRR_PHYSBASE5 0000020B 00000000 00000000 MTRR_PHYSMASK5 0000020C 00000000 00000000 MTRR_PHYSBASE6 0000020D 00000000 00000000 MTRR_PHYSMASK6 0000020E 00000000 00000000 MTRR_PHYSBASE7 0000020F 00000000 00000000 MTRR_PHYSMASK7 00000250 06060606 06060606 MTRR_FIX64K_00000 00000-7FFFF 06 WB 00000258 06060606 06060606 MTRR_FIX16K_80000 80000-9FFFF 06 WB 00000259 00000000 00000000 MTRR_FIX16K_A0000 A0000-BFFFF 00 UC 00000268 05050505 05050505 MTRR_FIX4K_C0000 C0000-C7FFF 05 WP 00000269 05050505 05050505 MTRR_FIX4K_C8000 C8000-CFFFF 05 WP 0000026A 05050505 00000000 MTRR_FIX4K_D0000 D0000-D3FFF 05 WP D4000-D7FFF 00 UC 0000026B 00000000 00000000 MTRR_FIX4K_D8000 D8000-DFFFF 00 UC 0000026C 00000000 05050505 MTRR_FIX4K_E0000 E0000-E3FFF 00 UC E4000-E7FFF 05 WP 0000026D 05050505 05050505 MTRR_FIX4K_E8000 E8000-EFFFF 05 WP 0000026E 05050505 05050505 MTRR_FIX4K_F0000 F0000-F7FFF 05 WP 0000026F 05050505 05050505 MTRR_FIX4K_F8000 F8000-FFFFF 05 WP 000002FF 00000C00 00000000 MTRRdefType 1100 0000 0000 E = 1 MTRR enable FE = 1 Fixed Range enable Type = 0 UC Default memory type 00000277 00070106 00070106 CR_PAT, Page Attribute Table PA0 = 06 WB PA1 = 01 WC PA2 = 07 UC- (MTRR WC overrides) PA3 = 00 UC PA4 = 06 WB PA5 = 01 WC PA6 = 07 UC- (MTRR WC overrides) PA7 = 00 UC Maybe the ACPI folks have some ideas here. Rogério Brito, is there any chance that you can try the latest kernel to see if the problem still exists? BTW, would you please give a detailed description about the problem you got, as I can not get this in comment #0. Since no response for more than one month, close this bug as WILL_FIX_LATER. Please feel free to reopen it if available. |