Bug 41722 - Boot hang unless acpi=off - Clevo M5X0JE
Summary: Boot hang unless acpi=off - Clevo M5X0JE
Status: CLOSED WILL_FIX_LATER
Alias: None
Product: ACPI
Classification: Unclassified
Component: Config-Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: acpi_config-other
URL: http://thread.gmane.org/gmane.linux.k...
Keywords:
Depends on:
Blocks:
 
Reported: 2011-08-25 16:49 UTC by Bjorn Helgaas
Modified: 2013-11-25 07:40 UTC (History)
6 users (show)

See Also:
Kernel Version: 3.1.0-rc3
Subsystem:
Regression: No
Bisected commit-id:


Attachments
last messages with "ignore loglevel acpi.debug_layer=0xffffffff acpi.debug_level=0xffffffff" (807.94 KB, image/jpeg)
2011-08-25 17:14 UTC, Rogério Brito
Details
ignore_loglevel acpi.debug_layer=0x59f acpi.debug_level=0xffffffff (811.74 KB, image/jpeg)
2011-08-25 17:16 UTC, Rogério Brito
Details
dmesg (acpi=off, with PCI transparent bridge change) (44.10 KB, text/plain)
2011-08-25 21:20 UTC, Bjorn Helgaas
Details
acpidump (76.56 KB, text/plain)
2011-08-25 21:20 UTC, Bjorn Helgaas
Details
dmidecode (7.23 KB, text/plain)
2011-08-25 21:20 UTC, Bjorn Helgaas
Details
boot with kernel 2.4.27 (804.72 KB, image/jpeg)
2011-08-27 04:03 UTC, Rogério Brito
Details
boot with kernel 2.6.8 (745.95 KB, image/jpeg)
2011-08-27 04:11 UTC, Rogério Brito
Details
Output of hwdirect with the MSRs for the Clevo (151.86 KB, text/x-log)
2012-10-28 18:49 UTC, Rogério Brito
Details

Description Bjorn Helgaas 2011-08-25 16:49:22 UTC
Booting 3.1.0-rc3 on this notebook hangs at this point:

  (...)
  CPU: Mobile AMD Sempron (tm) Processor 3400+ stepping 02
  ACPI: Core revision 20110623

Please turn on CONFIG_ACPI_DEBUG=y, boot with "ignore_loglevel acpi.debug_layer=0xffffffff acpi.debug_level=0xffffffff", and attach whatever console information you can collect (digital photo, hand transcription, etc).

If that generates too much output, try acpi.debug_layer=0x59f instead.
Comment 1 Rogério Brito 2011-08-25 17:07:46 UTC
Dear Bjorn,

I am attaching here two digital photos of a *vanilla* kernel (without that PCI bridge sizing patch reverted). The kernel is compiled with:

CONFIG_ACPI_EC_DEBUGFS=m
CONFIG_ACPI_DEBUG=y
CONFIG_ACPI_DEBUG_FUNC_TRACE=y


Hope this can supply sufficient information. If not, I can still try other stuff. Just let me know and I'll do my best.


Thanks.
Comment 2 Rogério Brito 2011-08-25 17:14:57 UTC
Created attachment 70192 [details]
last messages with "ignore loglevel acpi.debug_layer=0xffffffff acpi.debug_level=0xffffffff"
Comment 3 Rogério Brito 2011-08-25 17:16:47 UTC
Created attachment 70202 [details]
ignore_loglevel acpi.debug_layer=0x59f acpi.debug_level=0xffffffff
Comment 4 Rogério Brito 2011-08-25 18:29:39 UTC
Just for the record, this notebook still has a Windows Vista partition in it and I can try to extract whatever ACPI information is exposed there so that we can have more information regarding what the hardware configuration is.

Would this help?
Comment 5 Bjorn Helgaas 2011-08-25 19:48:13 UTC
Do you know whether any version of Linux boots on this system without "acpi=off"?  

I did find dmesg logs at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/343675 (2.6.28) and https://bugs.launchpad.net/ubuntu/+source/linux/+bug/446911 (2.6.31), which suggests that linux booted with ACPI enabled on *some* Clevo M5X0JE systems.

I assume Windows still boots on this system (to make sure this isn't hardware that broke and prevents both Windows and Linux from booting).

Based on your debug output, I think the hang happens when we do an outb() to I/O port 0x142e to enable ACPI mode.  Do you have any CardBus or other devices you could remove, to make sure it's not device-related?
Comment 6 Bjorn Helgaas 2011-08-25 21:20:09 UTC
Created attachment 70282 [details]
dmesg (acpi=off, with PCI transparent bridge change)
Comment 7 Bjorn Helgaas 2011-08-25 21:20:37 UTC
Created attachment 70292 [details]
acpidump
Comment 8 Bjorn Helgaas 2011-08-25 21:20:53 UTC
Created attachment 70302 [details]
dmidecode
Comment 9 Rogério Brito 2011-08-25 21:33:03 UTC
Hi, Bjorn.

On Aug 25 2011, bugzilla-daemon@bugzilla.kernel.org wrote:
> --- Comment #5 from Bjorn Helgaas <bhelgaas@google.com>  2011-08-25 19:48:13
> ---
> Do you know whether any version of Linux boots on this system without
> "acpi=off"?

I don't know, but I can try. The only problem is that udev has some
restrictions on the kernel version. :-(

> I did find dmesg logs at
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/343675 (2.6.28) and
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/446911 (2.6.31), which
> suggests that linux booted with ACPI enabled on *some* Clevo M5X0JE systems.

As per the previous bug, the only version of Linux that I got and that
worked before was 2.6.24 and I don't quite remember if I needed acpi=off or
not, but I can try it again, of course, and let you know.

> I assume Windows still boots on this system (to make sure this isn't
> hardware that broke and prevents both Windows and Linux from booting).

No the hardware doesn't appear to be broken. And yes, Windows works fine
(from the little about Windows that I know). I chose to keep it here, as
that's what my mother used to use (but she tells me that she would prefer to
switch).

Anything works well in Windows and I can try to boot with some logging
enabled, if I happen to know where debug messages are sent/put once Windows
boots.

> Based on your debug output, I think the hang happens when we do an outb() to
> I/O port 0x142e to enable ACPI mode.  Do you have any CardBus or other
> devices
> you could remove, to make sure it's not device-related?

While the notebook has a cardbus bay, it is not in use right now. The only
things that are plugged in right now is the power cord and the ethernet
cable.

I don't have a null cable here to use a serial console or something... I can
try to see if a network console works, but I guess that a PCI card would be
sort of a chicken-and-egg problem, wouldn't it?

Just as a side comment, do these PCI and ACPI problems always go hand in
hand?

I will try to do a binary search with Linux versions to see if there is any
one that can work without acpi=off being passed.


Thanks,
Comment 10 Rogério Brito 2011-08-25 21:54:46 UTC
Bjorn,

(In reply to comment #8)
> Created an attachment (id=70302) [details]
> dmidecode

I just tried to read the disassembly of the DSDT of the output of dmidecode and I couldn't compile it because the iasl compiler throws (among others) this error:

DSDT.dsl   898:                         0xFFF00000,         // Length
Error    4117 -                                  ^ Length is larger than Min/Max window

if this language has any kind of scope (it seems that the braces are there for this very reason), then it is inside the "Method (_CRS, 0, NotSerialized)" section, which, from the name, I understand is something like a function, right?

And I suppose that that _CRS stands for Current Resource Settings, right? You told me before that I should have passed pci=use_crs to the kernel, but it seems that we have a problem with the DSDT table, if I am not mistaken...

Does this (admittedly superficial) analysis help?
Comment 11 Bjorn Helgaas 2011-08-25 22:05:03 UTC
For this purpose, you don't need to worry about udev.  That's a user-space
thing, and if you get far enough to run into udev failures, you already know
that ACPI is working.

The https://bugs.launchpad.net/ubuntu/+source/linux/+bug/343675 report is
particularly interesting because it has the same BIOS version as you.  If you
could lay hands on an Ubuntu 9.04 CD and try that, that would be a good start. 
You shouldn't need to do a full install -- just a live CD should be fine.

Another thing to look at is your BIOS SETUP settings.  If there's a "reset
everything to defaults", please try that.

Per Len Brown, the outb() triggers SMM (a special part of the BIOS), and the
hang is most likely in that SMM code, which is very difficult to debug.

I think the console info you have now is fine; no need to worry about
netconsole or serial console.

I've been working a lot of problems at the intersection of PCI and ACPI, but in
this case, I think the transparent bridge problem is unrelated to the ACPI
enable problem.

The DSDT _CRS error you see is one of those things at the intersection of PCI and ACPI, but the hang happens long before we would look at that information.
Comment 12 Rogério Brito 2011-08-27 04:03:45 UTC
Created attachment 70442 [details]
boot with kernel 2.4.27

Here is a boot with kernel 2.4.27 (from Debian 3.1). It boots with no options passed, but I guess that this doesn't help, as it says there "ACPI: Interpreter disabled".

If it does help, then I may try to find a way to get the logs out of that machine, but as that kernel version it has almost no supported drivers for this machine, I will take me some time.
Comment 13 Rogério Brito 2011-08-27 04:11:43 UTC
Created attachment 70452 [details]
boot with kernel 2.6.8

Hi, Bjorn.

Also from Debian 3.1, this is a boot with kernel 2.6.8, without passing any parameters.

I think that I can say that no moderately recent kernel 2.6 ever boots on this machine without acpi being turned off.

Would remote access to this machine help? I can, of course, give you ssh access so that you can play with it.

Thanks.
Comment 14 Bjorn Helgaas 2011-09-02 23:55:35 UTC
So it looks like 2.4.27 doesn't even try to enable ACPI mode, and 2.6.8 hangs the same way as 3.1.0-rc3.

Can you try 2.6.28?  If that doesn't boot, try the Ubuntu 9.04 kernel, which was reported to boot on your machine/BIOS.

I don't think remote access would help much unless we had remote power control and remote console access.
Comment 15 Rogério Brito 2011-09-06 22:22:44 UTC
Dear Bjorn,

Sorry for not responding promptly but I had some family members with health problems. Things are settling right now and I think that I can be faster now.

(In reply to comment #14)
> So it looks like 2.4.27 doesn't even try to enable ACPI mode, and 2.6.8 hangs
> the same way as 3.1.0-rc3.

Yes, exactly.

> Can you try 2.6.28?  If that doesn't boot, try the Ubuntu 9.04 kernel, which
> was reported to boot on your machine/BIOS.

Unfortunately, I tried to boot with that version of the Ubuntu live CD and it crashed the same way. I may try it once again, paying more attention if it behaves differently in any kind of way, though.

Oh, talking about PCI and ACPI failing, could you take a look at issue #41132 to see if you can help with another machine that stopped working after a commit that I bisected?


Thanks,

Rogério Brito.
Comment 16 Zhang Rui 2012-01-18 05:39:52 UTC
hmm, what's the status of this bug?
does the problem still exist in the latest upstream kernel?
Comment 17 Zhang Rui 2012-05-24 08:07:51 UTC
bug closed as there is no response from the bug reporter.
please feel free to reopen it if the problem still exists in the latest upstream kernel.
Comment 18 Rogério Brito 2012-06-22 10:29:03 UTC
Just for the record, I can pick up where we left off. Just let me know what I should do next.

Thanks, Rogério Brito.
Comment 19 Bjorn Helgaas 2012-06-22 19:32:40 UTC
> https://bugzilla.kernel.org/show_bug.cgi?id=41722

This system boots Windows fine, but hangs when Linux does an
outb(0x142e) to enable ACPI mode.  Len suggests that this enters SMM
mode, where the BIOS does stuff which we basically cannot debug.  Len
did hazard a guess that MTRR config is a possible difference between
Linux and Windows.

I think we can figure out the Linux MTRR config from the dmesg of a
boot with "acpi=off".  Rogério, would you mind attaching that log here
(it's OK if you need to apply the transparent bridge reversion patch).

Then, since I think you do have Windows on the box, can you use a free
trial version of AIDA64 (http://www.aida64.com/) to collect a complete
system report and attach that as well?  With luck, we can extract the
MTRR config under Windows from that.
Comment 20 Bjorn Helgaas 2012-10-05 16:39:54 UTC
Ping, Rogério, do you want to work on this anymore?  I don't blame you if you've given up ;)
Comment 21 Rogério Brito 2012-10-05 17:13:17 UTC
Hi, Bjorn.

On Oct 05 2012, bugzilla-daemon@bugzilla.kernel.org wrote:
> --- Comment #20 from Bjorn Helgaas <bhelgaas@google.com>  2012-10-05 16:39:54
> ---
> Ping, Rogério, do you want to work on this anymore?  I don't blame you if
> you've given up ;)

Yes, I still want to work on this. To make things speedier, what do you
think if we scheduled some time so that we could both be online and I could
give you some feedback instantaneously?

Do you think it would be a good idea? I can even show you (say, via a Google
Hangout) how the machine behaves.

I won't have time today, but will try to reserve time whenever you want, if
you think that this would be productive.


Thanks,
Comment 22 Bjorn Helgaas 2012-10-05 17:46:02 UTC
Great!  Let's start by collecting the dmesg log and AIDA64 report I mentioned in comment #19.  I really want to see the MTRR config under Windows, but I don't think AIDA64 shows that.  There's a HWDIRECT tool (http://www.eprotek.com/pg_hwdirect.html) that claims to do it.  I haven't tried it, but there is a free trial version of it as well.  If you could try it out and attach any MTRR info you can find under Windows, that would be great.
Comment 23 Rogério Brito 2012-10-11 19:54:15 UTC
Dear Bjorn,

On Fri, Oct 5, 2012 at 2:46 PM,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
> --- Comment #22 from Bjorn Helgaas <bhelgaas@google.com>  2012-10-05 17:46:02
> ---
> Great!  Let's start by collecting the dmesg log and AIDA64 report I mentioned
> in comment #19.  I really want to see the MTRR config under Windows, but I
> don't think AIDA64 shows that.

I had that collected once, but I forgot to upload it. :)

I suspect that some of the files are too big for bugzilla to hold
(especially the full report). Compressing it may work, but I would
prefer to send them to you in private, if you don't mind.

> There's a HWDIRECT tool
> (http://www.eprotek.com/pg_hwdirect.html) that claims to do it.  I haven't
> tried it, but there is a free trial version of it as well.  If you could try
> it
> out and attach any MTRR info you can find under Windows, that would be great.

I am downloading it now, but I read in their homepage that it doesn't
allow you to save the results to file in the trial version. I guess
that I will have to circumvent that somehow. :)

Regarding us meeting, I am fine with the times that you gave me, but I
will get in touch in private mail for details.


Thanks for all the directions,

Rogério.
-- 
Rogério Brito : rbrito@{ime.usp.br,gmail.com} : GPG key 4096R/BCFCAAAA
http://rb.doesntexist.org/blog : Projects : https://github.com/rbrito/
DebianQA: http://qa.debian.org/developer.php?login=rbrito%40ime.usp.br
Comment 24 Rogério Brito 2012-10-28 18:49:30 UTC
Created attachment 85141 [details]
Output of hwdirect with the MSRs for the Clevo

Dear Bjorn and others,

At last, here is a dump of what HWDIRECT sees running under windows.

Not exactly the most user-friendly program (took me a while to see how to grab the contents of the MSRs), but, now, at least, we can compare what Windows sees with what Linux sees.

And, more importantly, why Linux hangs during ACPI init.

If more information is needed, please let me know.


Thanks in advance,

Rogério Brito.
Comment 25 Bjorn Helgaas 2013-01-30 21:01:20 UTC
Here's my meager comparison of MTRR settings between Linux and Windows.  I didn't see any differences.  Under Windows, I see this:

      00000-7FFFF  06  WB
      80000-9FFFF  06  WB

The above correspond to Linux's 00000-9FFFF write-back

      A0000-BFFFF  00  UC

The above correspond to Linux's A0000-BFFFF uncachable

      C0000-C7FFF  05  WP
      C8000-CFFFF  05  WP
      D0000-D3FFF  05  WP

The above correspond to Linux's C0000-D3FFF write-protect

      D4000-D7FFF  00  UC
      D8000-DFFFF  00  UC
      E0000-E3FFF  00  UC

The above correspond to Linux's D4000-E3FFF uncachable

      E4000-E7FFF  05  WP
      E8000-EFFFF  05  WP
      F0000-F7FFF  05  WP
      F8000-FFFFF  05  WP

The above correspond to Linux's E4000-FFFFF write-protect

My raw notes, including the specific registers where I extracted the ranges, are below.

I'm afraid I don't have any ideas about other things to try.  The only ways I can think of to make progress are to (1) see if the manufacturer can help (unlikely since this is so old), (2) see if any version of Linux can enable ACPI successfully and figure out what changed, or (3) see if any other instance of this machine can boot Linux and figure out what's special about this particular one.

Linux:
      MTRR default type: uncachable
      MTRR fixed ranges enabled:
        00000-9FFFF write-back
        A0000-BFFFF uncachable
        C0000-D3FFF write-protect
        D4000-E3FFF uncachable
        E4000-FFFFF write-protect
      MTRR variable ranges enabled:
        0 base 0000000000 mask FF80000000 write-back
        1 disabled
        2 disabled
        3 disabled
        4 disabled
        5 disabled
        6 disabled
        7 disabled

Windows:
    000000FE    00000508    00000000    MTRRCAP (read-only)
    00000200    00000006    00000000    MTRR_PHYSBASE0
      type 06 WB
      base 0
    00000201    80000800    000000FF    MTRR_PHYSMASK0
      valid = 1
      mask = ff80000.000
    00000202    00000000    00000000    MTRR_PHYSBASE1
    00000203    00000000    00000000    MTRR_PHYSMASK1
    00000204    00000000    00000000    MTRR_PHYSBASE2
    00000205    00000000    00000000    MTRR_PHYSMASK2
    00000206    00000000    00000000    MTRR_PHYSBASE3
    00000207    00000000    00000000    MTRR_PHYSMASK3
    00000208    00000000    00000000    MTRR_PHYSBASE4
    00000209    00000000    00000000    MTRR_PHYSMASK4
    0000020A    00000000    00000000    MTRR_PHYSBASE5
    0000020B    00000000    00000000    MTRR_PHYSMASK5
    0000020C    00000000    00000000    MTRR_PHYSBASE6
    0000020D    00000000    00000000    MTRR_PHYSMASK6
    0000020E    00000000    00000000    MTRR_PHYSBASE7
    0000020F    00000000    00000000    MTRR_PHYSMASK7
    00000250    06060606    06060606    MTRR_FIX64K_00000
      00000-7FFFF  06  WB
    00000258    06060606    06060606    MTRR_FIX16K_80000
      80000-9FFFF  06  WB
    00000259    00000000    00000000    MTRR_FIX16K_A0000
      A0000-BFFFF  00  UC
    00000268    05050505    05050505    MTRR_FIX4K_C0000
      C0000-C7FFF  05  WP
    00000269    05050505    05050505    MTRR_FIX4K_C8000
      C8000-CFFFF  05  WP
    0000026A    05050505    00000000    MTRR_FIX4K_D0000
      D0000-D3FFF  05  WP
      D4000-D7FFF  00  UC
    0000026B    00000000    00000000    MTRR_FIX4K_D8000
      D8000-DFFFF  00  UC
    0000026C    00000000    05050505    MTRR_FIX4K_E0000
      E0000-E3FFF  00  UC
      E4000-E7FFF  05  WP
    0000026D    05050505    05050505    MTRR_FIX4K_E8000
      E8000-EFFFF  05  WP
    0000026E    05050505    05050505    MTRR_FIX4K_F0000
      F0000-F7FFF  05  WP
    0000026F    05050505    05050505    MTRR_FIX4K_F8000
      F8000-FFFFF  05  WP
    000002FF    00000C00    00000000    MTRRdefType
      1100 0000 0000
      E = 1     MTRR enable
      FE = 1    Fixed Range enable
      Type = 0  UC Default memory type
    00000277    00070106    00070106    CR_PAT, Page Attribute Table
      PA0 = 06  WB
      PA1 = 01  WC
      PA2 = 07  UC- (MTRR WC overrides)
      PA3 = 00  UC
      PA4 = 06  WB
      PA5 = 01  WC
      PA6 = 07  UC- (MTRR WC overrides)
      PA7 = 00  UC
Comment 26 Bjorn Helgaas 2013-02-13 17:58:21 UTC
Maybe the ACPI folks have some ideas here.
Comment 27 Zhang Rui 2013-10-14 10:39:29 UTC
Rogério Brito,
is there any chance that you can try the latest kernel to see if the problem still exists?
BTW, would you please give a detailed description about the problem you got, as I can not get this in comment #0.
Comment 28 Lan Tianyu 2013-11-25 07:40:47 UTC
Since no response for more than one month, close this bug as WILL_FIX_LATER. Please feel free to reopen it if available.

Note You need to log in before you can comment on or make changes to this bug.