Bug 10117

Summary: MacBookPro1,1 hangs on boot (pci=nommconf helps)
Product: ACPI Reporter: Rafael J. Wysocki (rjw)
Component: Config-OtherAssignee: Venkatesh Pallipadi (venki)
Status: CLOSED CODE_FIX    
Severity: normal CC: bunk, crmafra, greg, jbarnes, kedgedev, kernel, tglx, vegard.nossum, venki
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.25-rc Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 9832    
Attachments: dmesg output after boot succeded
kernel config
cpuidle test patch
dmesg with Venkatesh's test patch
Put some printks in hpet code
dmesg -rc9 with hpet=disable
dmesg which uses hpet, hangs frequently
dmesg

Description Rafael J. Wysocki 2008-02-26 16:10:41 UTC
Subject         : 2.6.25-current-git hangs on boot
Submitter       : Soeren Sonnenburg <kernel@nn7.de>
Date            : 2008-02-23 18:55
References      : http://lkml.org/lkml/2008/2/23/263

This entry is being used for tracking a regression from 2.6.24.  Please don't
close it until the problem is fixed in the mainline.

Note: This issue seems to be independent of bug #10093.
Comment 1 Jesse Barnes 2008-03-13 10:13:08 UTC
If pci=nommconf prevents the hang, this could be the infamous PCI decode bug.  The fix for that should be upstream now though...
Comment 2 Rafael J. Wysocki 2008-03-13 12:50:06 UTC
Soeren, could you test the current mainline, please?
Comment 3 Soeren Sonnenburg 2008-03-21 08:02:22 UTC
the 2 times I just booted git 457fb605834504af294916411be128a9b21fc3f6 it came pass this point. but I would need to run it more often than that to be sure the bug is gone, but the kernel is still not stable enough to allow this
Comment 4 Jesse Barnes 2008-03-21 08:52:35 UTC
If the problem isn't fixed, it should hang at boot everytime, so hopefully that means it's fixed.  I guess you can close this out after a bit more testing.
Comment 5 Soeren Sonnenburg 2008-03-22 12:18:54 UTC
well no, it is not 100% reproducible... and indeed booting another time with current git I got it to hang again :(
Comment 6 Soeren Sonnenburg 2008-03-25 04:33:17 UTC
also in a4083c9271e0a697278e089f2c0b9a95363ada0a this bug is still present. happens about every third boot... the last outputted lines are:

[...]
ACPI: ACPI0007:01 is registered as cooling_device1
ACPI: Processor [CPU1] (supports 8 throttling states)
Comment 7 Jesse Barnes 2008-03-25 10:14:25 UTC
Ok, this must not be the bug I was thinking of then, since that one consistently hangs the box at bootup.  Can you get more details about precisely where the hang occurs?  Maybe by enabling ACPI debugging or adding some printks in those paths?
Comment 8 Soeren Sonnenburg 2008-03-25 22:43:25 UTC
I have no idea where to add these printk's ... I would need some pointer/guidance... 
Comment 9 Jesse Barnes 2008-03-27 10:26:53 UTC
Sorry about the delay, the response got buried in my inbox.  You could pass 'initcall_debug' on your boot line to get some more info.  Also enabling ACPI_DEBUG and ACPI_DEBUG_FUNC_TRACE in your kernel config might give us a clue.
Comment 10 Rafael J. Wysocki 2008-04-04 02:39:46 UTC
References : http://lkml.org/lkml/2008/4/4/41

Soeren said:

I rebooted >10 times and couldn't trigger this hang with current
mainstream anymore.... so I guess it is gone.

Closing (please reopen if the issue reappears).
Comment 11 Carlos R. Mafra 2008-04-06 15:21:49 UTC
I am sorry, but this hang still happens on my vaio with current 
git v2.6.25-rc8-139-ge315c12.

I have a picture showing the last messages here:
http://www.ift.unesp.br/users/crmafra/dsc04673.jpg

Hm, how do I reopen this bug? (I can't unselect the 
"Leave as CLOSED CODE_FIX" below)
Comment 12 Carlos R. Mafra 2008-04-07 04:27:28 UTC
Created attachment 15649 [details]
dmesg output after boot succeded
Comment 13 Carlos R. Mafra 2008-04-07 04:29:49 UTC
Created attachment 15650 [details]
kernel config 

This is the kernel config which triggers this bug.

I don't have the acpidump right now, but it is posted at the end of this
email:
http://lkml.org/lkml/2008/2/11/364
Comment 14 Rafael J. Wysocki 2008-04-09 14:06:01 UTC
The problem is present in 2.6.25-rc8-git7.

References : http://lkml.org/lkml/2008/4/9/69
Comment 15 Len Brown 2008-04-09 18:20:10 UTC
looks like the pci=nommconf thing was a false positive
looks like this still happens, but only sometimes
however, when it hangs, it is always at ACPI processor driver load time
re-assigning to ACPI/processor-power.

looks like max_cstate=2 did not make this go away
looks like max_cstate=1 may make this go away

please build with
CONFIG_THERMAL=n
CONFIG_ACPI_THERMAL=n

just because these print the last lines seen.  It may be hanging
due to C-state and this will simply delete the last two lines seen
but have no actual effect on the hang.
Comment 16 Carlos R. Mafra 2008-04-10 09:30:10 UTC
I tested 2.6.25-rc8-0139-g(something) with 
CONFIG_THERMAL=n
CONFIG_ACPI_THERMAL=n

and it hung at the 8th boot. 

However, the last lines printed when it hung  were the same:

ACPI: Processor [CPU1] (supports 8 throttling states)

Hm, -rc8-0139 is not the latest anymore...should I test the v2.6.25-rc8-229-gf4be31e? (I ask because I feel a bit unconfortable
booting my laptop too many times in a row..I don't know, maybe it
will break or something)
Comment 17 Venkatesh Pallipadi 2008-04-10 10:52:20 UTC
No need to check with latest git. There are no changes that I know of that can affect this bug
Comment 18 Venkatesh Pallipadi 2008-04-10 13:53:28 UTC
Created attachment 15727 [details]
cpuidle test patch

Carlos,

Can you please try the attached patch and report back. If it hangs, picture of last few messages will help.
Also, try using vga=6 boot option, which gives us more number of lines of text on console (may need config changes with VIDEO_SELECT enabled).
Comment 19 Carlos R. Mafra 2008-04-11 05:34:19 UTC
I tested your patch and it continues to hang.

Unfortunately I lost my digital camera so I can't take a picture, but the
last message was the same, although using vga=6 there were more messages before.
I had started to copy them by hand, but I noticed that the messages before that
last "supports 8 throttling states" are the same as they appear in the dmesg
when it boots ok. They are (copied from the dmesg)

  IO window: 4000-4fff
  MEM window: 0xfa000000-0xfbffffff
  PREFETCH window: 0x00000000f4000000-0x00000000f5ffffff
PCI: Bridge: 0000:00:1c.4
  IO window: 5000-5fff
  MEM window: 0xfc200000-0xfc2fffff
  PREFETCH window: disabled.
PCI: Bus 10, cardbus bridge: 0000:09:03.0
  IO window: 0x00006000-0x000060ff
  IO window: 0x00006400-0x000064ff
  PREFETCH window: 0x88000000-0x8bffffff
  MEM window: 0x90000000-0x93ffffff
PCI: Bridge: 0000:00:1e.0
  IO window: 6000-6fff
  MEM window: 0xfc300000-0xfc3fffff
  PREFETCH window: 0x0000000088000000-0x000000008bffffff
ACPI: PCI Interrupt 0000:00:1c.0[A] -> GSI 17 (level, low) -> IRQ 17
PCI: Setting latency timer of device 0000:00:1c.0 to 64
ACPI: PCI Interrupt 0000:00:1c.1[B] -> GSI 16 (level, low) -> IRQ 16
PCI: Setting latency timer of device 0000:00:1c.1 to 64
ACPI: PCI Interrupt 0000:00:1c.2[C] -> GSI 18 (level, low) -> IRQ 18
PCI: Setting latency timer of device 0000:00:1c.2 to 64
ACPI: PCI Interrupt 0000:00:1c.4[A] -> GSI 17 (level, low) -> IRQ 17
PCI: Setting latency timer of device 0000:00:1c.4 to 64
PCI: Setting latency timer of device 0000:00:1e.0 to 64
ACPI: PCI Interrupt 0000:09:03.0[A] -> GSI 16 (level, low) -> IRQ 16
NET: Registered protocol family 2
IP route cache hash table entries: 65536 (order: 7, 524288 bytes)
TCP established hash table entries: 262144 (order: 10, 4194304 bytes)
TCP bind hash table entries: 65536 (order: 9, 2097152 bytes)
TCP: Hash tables configured (established 262144 bind 65536)
TCP reno registered
Unpacking initramfs... done
Freeing initrd memory: 444k freed
Simple Boot Flag at 0x36 set to 0x1
io scheduler noop registered
io scheduler cfq registered (default)
pci 0000:00:02.0: Boot video device
ACPI: AC Adapter [ADP1] (on-line)
Switched to high resolution mode on CPU 1
ACPI: Battery Slot [BAT0] (battery present)
input: Lid Switch as /devices/LNXSYSTM:00/device:00/PNP0C0D:00/input/input0
Switched to high resolution mode on CPU 0
ACPI: Lid Switch [LID0]
input: Power Button (CM) as /devices/LNXSYSTM:00/device:00/PNP0C0C:00/input/input1
ACPI: Power Button (CM) [PWRB]
ACPI: SSDT 7F6D8C52, 02A1 (r1   Sony     VAIO 20070704 PTL  20050624)
ACPI: SSDT 7F6D8709, 04B7 (r1   Sony     VAIO 20070704 PTL  20050624)
Monitor-Mwait will be used to enter C-1 state
Monitor-Mwait will be used to enter C-2 state
Monitor-Mwait will be used to enter C-3 state
ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3])
ACPI: ACPI0007:00 is registered as cooling_device0
ACPI: Processor [CPU0] (supports 8 throttling states)
ACPI: SSDT 7F6D8EF3, 00E4 (r1   Sony     VAIO 20070704 PTL  20050624)
ACPI: SSDT 7F6D8BC0, 0092 (r1   Sony     VAIO 20070704 PTL  20050624)
ACPI: CPU1 (power states: C1[C1] C2[C2] C3[C3])
ACPI: ACPI0007:01 is registered as cooling_device1
ACPI: Processor [CPU1] (supports 8 throttling states)

Regarding your patch, the printks added don't appear before the hang occurs but
only later (when the boot is ok). Looking at the dmesg (with your patch) the
lines following the "hang point" are these ones (I added CONFIG_THERMAL=y
back again to not damage my laptop)

ACPI: LNXTHERM:01 is registered as thermal_zone0
ACPI: Thermal Zone [TZ00] (52 C)
ACPI: LNXTHERM:02 is registered as thermal_zone1
ACPI: Thermal Zone [TZ01] (52 C)
hpet_resources: 0xfed00000 is busy
Generic RTC Driver v1.07
Linux agpgart interface v0.103
ACPI: PCI Interrupt 0000:09:03.2[C] -> GSI 18 (level, low) -> IRQ 18
sony-laptop: Sony Notebook Control Driver v0.6.
input: Sony Vaio Keys as /devices/LNXSYSTM:00/device:00/PNP0A08:00/device:29/SNY5001:00/input/input2
input: Sony Vaio Jogdial as /devices/virtual/input/input3
sony-laptop: detected Sony Vaio FZ Series
PNP: PS/2 Controller [PNP0303:PS2K,PNP0f13:PS2M] at 0x60,0x64 irq 1,12
i8042.c: Detected active multiplexing controller, rev 1.1.
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX0 port at 0x60,0x64 irq 12
serio: i8042 AUX1 port at 0x60,0x64 irq 12
serio: i8042 AUX2 port at 0x60,0x64 irq 12
serio: i8042 AUX3 port at 0x60,0x64 irq 12
mice: PS/2 mouse device common for all mice
input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input4
input: PS/2 Mouse as /devices/platform/i8042/serio4/input/input5
input: AlpsPS/2 ALPS GlidePoint as /devices/platform/i8042/serio4/input/input6
cpuidle: installing idle handler
cpuidle: using governor ladder
cpuidle: installing idle handler
cpuidle: using governor menu
cpuidle: using menu governor
sdhci: Secure Digital Host Controller Interface driver

So lots of things happen between the hang point and the printks in your patch.

Then I wanted to check what the kernel is doing after the "supports 8 throttling states" line of deatch and put a lot of printks in various places in
processor_{idle, core, throttling}.c, because I didn't know where to put them.
So I almost every function in these files I put things like these:

function_X(foo){

       printk(KERN_INFO "beginning of function_X");
       ...
       printk(KERN_INFO "end of function_X");
}

It turned out that all these printks appeared before the hang, and the
last line kept being "supports 8 throttling..."

However, I could make the kernel print something after the usual last line
with

printk(KERN_INFO "Before acpi_processor_throttling_init\n"); <--added
acpi_processor_throttling_init();
printk(KERN_INFO ""After acpi_processor_throttling_init\n"\n"); <--added

in processor_core.c:1102

Then the kernel hung with these last messages:

ACPI: Processor [CPU1] (supports 8 throttling states)
Before acpi_processor_throttling_init
After acpi_processor_throttling_init

So at least we know that acpi_processor_throttling_init is executed
after the usual hang and it doesn't by itself cause the hang.

I also put a printk in the beginning of 
static int __init cpuidle_init(void) at the end of cpuidle/cpuidle.c
and found that it does not get printed when the laptop hangs.

I also noticed that with vga=6 the hang is much more frequent than with
my usual vga=0x0364. It now hangs with 80-90% chance.

I don't know if this long comment helps anything, but I wish it does!
Comment 20 Carlos R. Mafra 2008-04-11 05:41:23 UTC
Created attachment 15733 [details]
dmesg with Venkatesh's test patch

This is the dmesg with Venkatesh's patch when the boot succeeds. Note that
the printks in the test patch appear after the usual last line message which
appear when the boot hangs.
Comment 21 Venkatesh Pallipadi 2008-04-11 06:16:11 UTC
This is indeed a strange bug.

Those messages coming later means, cpuidle is not even utilized at the point of hang. Can you try one more thing, just to isolate the problem.
Try boot parameter "idle=mwait" (with any latest kernel you have - I mean with or without the last patch). Does it still hang?
Comment 22 Carlos R. Mafra 2008-04-11 10:23:09 UTC
Right, this is strange! 

After writing the previous comment I went back home and started bisecting...

So far, the earliest bad kernel after 2.6.24 is 2.6.24-4891-gaca7f96 (it hung in 
the first try). 

I also had good indications that 2.6.24-4877-gb406ac6 is good (did not hang after
19 boots).

There are a few x86 commits between these two points, but I can't say if
they are really related to the bug. 

Do you know if the problem can be there?
I am very afraid of this result, because I know Ingo boots thousands of times
his machines and it is just me having trouble to boot my laptop, and this
preliminary bisect points to x86...

But I won't have new test results until tomorrow, I am back to the institute now.
Comment 23 Venkatesh Pallipadi 2008-04-11 10:42:05 UTC
This is not particular to your laptop. I also have another bug report which is similar to this one and probably has the same root cause; bug #10377
Only reason I havent marked them duplicate it that in #10377, actual hang happens at a different point and only with the battery. But, hang symptoms are pretty similar. Also, Soeren who originally reported this bug saw exactly same probelm as you are seeing. Only that he stopped seeing it later. Probably there is some timing-ness involved in this. I tried on many laptops here to reproduce with little success.

Regarding the git bisect. None of the patches in the bisect range looks related to this problem. May be commit e3ed910 can be potentially related (though I dont think it is). I understand the problem here is not very reproducable and we cannot be sure of the actual good bisect. Thanks for all the debug effort in root causing this one....
Comment 24 Carlos R. Mafra 2008-04-11 14:18:29 UTC
Ok, thanks for the encouragement! :-)

Another observation from my notes is that the last messages were a bit different
during the git bisection marathon.

For example, with 2.6.24-6483-g3f655ef it prints more info after the 
"8 throttling states", there are also these ones: 

LNXTHERM: 01 is registered as thermal_zone
Thermal Zone [TZ00] (50 C)
Exception (thermal-0473): AE_NOT_FOUND, <7> Switched to high resolution mode on
CPU1
Invalid active threshold [0] [20070126]
Exception (thermal-0473): AE_NOT_FOUND, Invalid active threshold [1] [20070126]
LNXTHERM: 02 is registered as thermal_zone1
Thermal Zone [TZ01] (50 C)

and then it hung.

So I noticed the common "Switched to high resolution mode on CPU1" from 
the description of bug #10377
Comment 25 Carlos R. Mafra 2008-04-12 10:40:10 UTC
I was crazy in trying to bisect this elusive bug.

Today I kept booting one version to test the bisection range from comment #22.

The kernel version 2.6.24-4887-gafadcd7 was doing just fine up to the 30th boot,
but it hung at the 31st.

So that is a crazy bug, and I think everything I've said before about "it is
probably good" is unreliable. Maybe even with CPUIDLE=n it will hang after
enough boots.

I will keep using -rc9 from now on, and update this bugzilla with more info
if I have more data points.

Thanks everyone for the help!
Comment 26 Venkatesh Pallipadi 2008-04-12 10:49:50 UTC
That was what I was afraid of.

Can you try rc9 with "idle=mwait" boot option and see whether you can reproduce the bug (with ot without CPUIDLE, shouldn't matter).
Comment 27 Carlos R. Mafra 2008-04-12 10:56:44 UTC
I will try "idle=mwait" next.

I just wanted to settle things with the bisect I did, that's why I didn't even
boot -rc9 yet.

But I will report if idle=mwait helps, after testing at least 30 boots :-(
It may take some time tough.

Thanks.
Comment 28 Carlos R. Mafra 2008-04-13 07:11:11 UTC
I have more observations, and I think I may have nailed it down.

Vanilla -rc9 kernel fails very frequently (~90%) if I use vga=6, and ~10% with
vga=0x0364 (strange and I don't know why, but it is happening).

Vanilla -rc9 with "idle=mwait" hung at 4th boot.

I noticed that booting -rc9 with hpet=disable makes a lot of difference, 42 
boots so far and no hang. Contrasted with the very high hang count of -rc9 
that could mean something.

So I tried to check what are the differences in the dmesg in the 
hpet=disable case and without disabling hpet.

The message "Switched to high resolution mode on CPU1" appears before
the hang point if I don't use hpet=disable, and it appears much later
in the boot log if I use hpet=disable.

I also wrote a test patch to put printks in hpet-relevant code, which I will
attach in this bugzilla. 

This is from the dmesg of a bad kernel which could finish the boot process 
(using vga=0x0364 to increase the chances) and _without_ using the 
hpet=disable option, and with my printks-patch:

************* Using hpet **************************
ACPI: Power Button (CM) [PWRB]
ACPI: SSDT 7F6D8C52, 02A1 (r1   Sony     VAIO 20070704 PTL  20050624)
ACPI: SSDT 7F6D8709, 04B7 (r1   Sony     VAIO 20070704 PTL  20050624)
hpet_legacy_set_mode
hpet_legacy_set_mode: oneshot
Switched to high resolution mode on CPU 1
Switched to high resolution mode on CPU 0
Monitor-Mwait will be used to enter C-1 state
Monitor-Mwait will be used to enter C-2 state
Monitor-Mwait will be used to enter C-3 state
ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3])
ACPI: ACPI0007:00 is registered as cooling_device0
ACPI: Processor [CPU0] (supports 8 throttling states)
ACPI: SSDT 7F6D8EF3, 00E4 (r1   Sony     VAIO 20070704 PTL  20050624)
ACPI: SSDT 7F6D8BC0, 0092 (r1   Sony     VAIO 20070704 PTL  20050624)
ACPI: CPU1 (power states: C1[C1] C2[C2] C3[C3])
ACPI: ACPI0007:01 is registered as cooling_device1
ACPI: Processor [CPU1] (supports 8 throttling states)
ACPI: LNXTHERM:01 is registered as thermal_zone0
ACPI: Thermal Zone [TZ00] (55 C)
ACPI: LNXTHERM:02 is registered as thermal_zone1
ACPI: Thermal Zone [TZ01] (55 C)
hpet_init
hpet_acpi_add
hpet_resources
hpet_resources: 0xfed00000 is busy
*****************************************

And this is using hpet=disable

********* hpet=disable **************
ACPI: Power Button (CM) [PWRB]
ACPI: SSDT 7F6D8C52, 02A1 (r1   Sony     VAIO 20070704 PTL  20050624)
ACPI: SSDT 7F6D8709, 04B7 (r1   Sony     VAIO 20070704 PTL  20050624)
Monitor-Mwait will be used to enter C-1 state
Monitor-Mwait will be used to enter C-2 state
Monitor-Mwait will be used to enter C-3 state
ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3])
ACPI: ACPI0007:00 is registered as cooling_device0
ACPI: Processor [CPU0] (supports 8 throttling states)
ACPI: SSDT 7F6D8EF3, 00E4 (r1   Sony     VAIO 20070704 PTL  20050624)
ACPI: SSDT 7F6D8BC0, 0092 (r1   Sony     VAIO 20070704 PTL  20050624)
ACPI: CPU1 (power states: C1[C1] C2[C2] C3[C3])
ACPI: ACPI0007:01 is registered as cooling_device1
ACPI: Processor [CPU1] (supports 8 throttling states)
ACPI: LNXTHERM:01 is registered as thermal_zone0
ACPI: Thermal Zone [TZ00] (54 C)
ACPI: LNXTHERM:02 is registered as thermal_zone1
ACPI: Thermal Zone [TZ01] (54 C)
hpet_init
hpet_acpi_add
hpet_resources
hpet_resources
hpet_acpi_add: no address or irqs in _CRS
Generic RTC Driver v1.07
Linux agpgart interface v0.103
ACPI: PCI Interrupt 0000:09:03.2[C] -> GSI 18 (level, low) -> IRQ 18
sony-laptop: Sony Notebook Control Driver v0.6.
input: Sony Vaio Keys as /devices/LNXSYSTM:00/device:00/PNP0A08:00/device:29/SNY5001:00/input/input2
input: Sony Vaio Jogdial as /devices/virtual/input/input3
sony-laptop: detected Sony Vaio FZ Series
PNP: PS/2 Controller [PNP0303:PS2K,PNP0f13:PS2M] at 0x60,0x64 irq 1,12
i8042.c: Detected active multiplexing controller, rev 1.1.
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX0 port at 0x60,0x64 irq 12
serio: i8042 AUX1 port at 0x60,0x64 irq 12
serio: i8042 AUX2 port at 0x60,0x64 irq 12
serio: i8042 AUX3 port at 0x60,0x64 irq 12
mice: PS/2 mouse device common for all mice
input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input4
cpuidle: using governor ladder
cpuidle: using governor menu
sdhci: Secure Digital Host Controller Interface driver
sdhci: Copyright(c) Pierre Ossman
ricoh-mmc: Ricoh MMC Controller disabling driver
ricoh-mmc: Copyright(c) Philip Langdale
Advanced Linux Sound Architecture Driver Version 1.0.16rc2 (Thu Jan 31 16:40:16 2008 UTC).
ALSA device list:
  No soundcards found.
TCP cubic registered
NET: Registered protocol family 1
NET: Registered protocol family 17
NET: Registered protocol family 15
 evxface-0154 [00] install_fixed_event_ha: Enabled fixed event 4, Handler=ffffffff80368345
BIOS EDD facility v0.16 2004-Jun-25, 6 devices found
Freeing unused kernel memory: 236k freed
SCSI subsystem initialized
No dock devices found.
libata version 3.00 loaded.
ahci 0000:00:1f.2: version 3.0
ACPI: PCI Interrupt 0000:00:1f.2[B] -> GSI 19 (level, low) -> IRQ 19
Switched to high resolution mode on CPU 1
Switched to high resolution mode on CPU 0  <----|
                                                |
                                                |
                                           Notice how far from the
                                           usual hang point!
                                               
Do these observations help? I know that in the similar
bug #10377, this "Switched to high resolution mode" line was the last
one...

I will also attach both dmesg logs, using hpet=disable and otherwise, just
in case anyone wants to check them.

Heh, I've already learnt a lot of things while looking at the code to put
the printks, but I am afraid I can't go much further now.

Thanks!
Comment 29 Carlos R. Mafra 2008-04-13 07:16:00 UTC
Created attachment 15741 [details]
Put some printks in hpet code

I wanted to check what functions related to hpet were executed before the
hang point, so I put a lot of printks in various places.
Comment 30 Carlos R. Mafra 2008-04-13 07:21:45 UTC
Created attachment 15742 [details]
dmesg -rc9 with hpet=disable

Using hpet=disable this -rc9 kernel survived 42 boots so far.

Notice that "Switched to high resolution mode on CPU{0,1}" appears much later.

I can't guarantee it won't fail at the 43rd boot, tough. But without using
hpet=disable the vanilla -rc9 hangs with ~90% chance using vga=6
Comment 31 Carlos R. Mafra 2008-04-13 07:24:44 UTC
Created attachment 15743 [details]
dmesg which uses hpet, hangs frequently

This is not using hpet=disable.

It could finish booting because I wrote vga=0x0364, otherwise it fails
very often.
Comment 32 Venkatesh Pallipadi 2008-04-14 18:34:05 UTC
*** Bug 10377 has been marked as a duplicate of this bug. ***
Comment 33 Mark Lord 2008-04-15 06:21:19 UTC
This bug is probably the very same issue I have noticed here since 2.6.21 or so, when the high-resolution timer rework and NO_HZ entered the kernel.

It comes and goes on various machines here, never easily reproducable, and different kernel .configs make it go away, and return, at random.

When the lockups occur, it is *always* around the point where the "Switched to high resolution mode on CPU X" messages.  Always.

For me, using CONFIG_HIGH_RES_TIMERS=n always makes it go away.

Since that time, the same bug has been variously reported by others.

On 14/12/07, Arun Thomas discovered that if he waited 8+ minutes (for some timer to wrap), the kernel would then continue initializing.  In response, Thomas Gleixner had this to say:

"The problem is caused by an SMI during the calibration routine. We
really need to come up with a solid solution which does not rely on
the periodic timer coming in, when there is something else (HPET,
pm_timer) available."

But I don't know if it ever went any further than that.

It would be good to see someone knowledgeable in that area finally dig around and fix this issue, as it makes reliable unattended/remote boot somewhat dicey.

My own bug 10391 is probably a duplicate of this one.
Comment 34 Mark Lord 2008-04-15 06:27:30 UTC
Arun Thomas's thread: http://lkml.org/lkml/2007/12/14/60
Comment 35 Mark Lord 2008-04-15 06:33:14 UTC
Original bug report thread (2.6.21) from April-2007:
http://lkml.org/lkml/2007/4/30/464
Comment 36 Rafael J. Wysocki 2008-04-15 13:30:41 UTC
Dropped from the list of recent regressions.
Comment 37 Rafael J. Wysocki 2008-04-15 14:00:35 UTC
Sorry, I didn't notice that bug #10377 was closed as a duplicate of this one.

It looks like we started to track a couple of different things in this entry.
Comment 38 Adrian Bunk 2008-04-15 14:32:44 UTC
This bug should only be used for Soerens originally reported problem.

Please open separate bugs for the other issues discussed here (and copy the information from here to these bugs).
Comment 39 Adrian Bunk 2008-04-15 14:39:23 UTC
Soeren said:
I couldn't reproduce this one any longer...
Comment 40 Carlos R. Mafra 2008-04-15 14:46:17 UTC
Adrian, as far as I can tell Soeren's original bug is the same one which hits
me.

Please take a look at the References: http://lkml.org/lkml/2008/2/23/263
in the bug description, and at comment #6 from Soeren himself.

Then take a look at the picture in comment #11, from myself. The problem
is the same.

The fact that he does not see the problem anymore doesn't mean that this
bug was fixed.
Comment 41 Adrian Bunk 2008-04-15 15:00:01 UTC
(In reply to comment #40)
> Adrian, as far as I can tell Soeren's original bug is the same one which hits
> me.

I've seen more than once that different bugs had exactly the same symptoms.

I do not know whether you run into the same bug as Soeren, but it is better if you report your issue in an own bug.

We have seen in this bug Mark saying it was already present in older kernels although what Soeren reported was not present in 2.6.24 - that's exactly what can happen when different people discuss issues that might or might not be the same bug in one bug entry.
Comment 42 Mark Lord 2008-04-15 15:49:40 UTC
The nature of this bug, strongly suggests that it is impossible to know if it was "not present in 2.6.24".  Simply building the kernel with a one-line change in .config is enough to make it come or go.

Having all of the information collected together just might make it more likely that somebody familiar with the hirez timer code might actually recognize the problem.

Whatever.
Comment 43 Venkatesh Pallipadi 2008-04-15 15:58:57 UTC
My take would be to keep all the info at one place, as long as bug "looks" similar. We can split it out later if we think they are unrelated.

Specifically in this case, "hpet=disable" workaround in one bug helped the reporter on the other bug. There is no way for that to happen if all reports are in different bug number. It also helps anyone trying to analyze this thing to have all info that looks relevant at one place.
Comment 44 Rafael J. Wysocki 2008-04-15 16:15:03 UTC
I kind of agree, but the fact that the original problem is not reproducible any more by the original reporter strongly suggests that the other reports in this bug entry need not be directly related to it.

Thus, it seems reasonable to use a different bug entry for tracking the other cases while keeping this one closed.  Otherwise we sort of lose the information that the original reporter is unable to reproduce the problem any more.
Comment 45 Adrian Bunk 2008-04-15 17:04:07 UTC
(In reply to comment #43)
> My take would be to keep all the info at one place, as long as bug "looks"
> similar. We can split it out later if we think they are unrelated.

We have seen in this bug that people were talking about different issues when deciding e.g. whether it is on 32bit or on 64bit, a 2.6.25-rc regression, whether it is still present in 2.6.25-rc9, or whether bug #10377 is a duplicate.

> Specifically in this case, "hpet=disable" workaround in one bug helped the
> reporter on the other bug. There is no way for that to happen if all reports
> are in different bug number. It also helps anyone trying to analyze this
> thing
> to have all info that looks relevant at one place.

This can happen if you put notes in the bugs that one bug might be the same issue like this or that other one.
Comment 46 Vegard Nossum 2008-05-11 02:46:37 UTC
Hi, Did anybody try running the watchdog to see where this hangs? It should be done with the parameter nmi_watchdog=1. See Documentation/nmi_watchdog.txt for more information. -Vegard
Comment 47 Carlos R. Mafra 2008-05-11 13:07:36 UTC
Hi Vegard,

I tried with nmi_watchdog=1 in a kernel which used to hang frequently 
(2.6.25-rc9, one out of three), but then it did not hang after 12 boots. I will
try more times and let you know if it hangs with this boot option.

However I could get some traces by other means:

http://lkml.org/lkml/2008/5/11/138
 
Comment 48 De Ganseman Amaury 2008-05-21 05:48:52 UTC
I have the same problem here using macbook and a 2.6.25.4 
Comment 49 De Ganseman Amaury 2008-05-21 05:49:40 UTC
Created attachment 16226 [details]
dmesg

dmesg