Subject : 2.6.25-current-git hangs on boot Submitter : Soeren Sonnenburg <kernel@nn7.de> Date : 2008-02-23 18:55 References : http://lkml.org/lkml/2008/2/23/263 This entry is being used for tracking a regression from 2.6.24. Please don't close it until the problem is fixed in the mainline. Note: This issue seems to be independent of bug #10093.
If pci=nommconf prevents the hang, this could be the infamous PCI decode bug. The fix for that should be upstream now though...
Soeren, could you test the current mainline, please?
the 2 times I just booted git 457fb605834504af294916411be128a9b21fc3f6 it came pass this point. but I would need to run it more often than that to be sure the bug is gone, but the kernel is still not stable enough to allow this
If the problem isn't fixed, it should hang at boot everytime, so hopefully that means it's fixed. I guess you can close this out after a bit more testing.
well no, it is not 100% reproducible... and indeed booting another time with current git I got it to hang again :(
also in a4083c9271e0a697278e089f2c0b9a95363ada0a this bug is still present. happens about every third boot... the last outputted lines are: [...] ACPI: ACPI0007:01 is registered as cooling_device1 ACPI: Processor [CPU1] (supports 8 throttling states)
Ok, this must not be the bug I was thinking of then, since that one consistently hangs the box at bootup. Can you get more details about precisely where the hang occurs? Maybe by enabling ACPI debugging or adding some printks in those paths?
I have no idea where to add these printk's ... I would need some pointer/guidance...
Sorry about the delay, the response got buried in my inbox. You could pass 'initcall_debug' on your boot line to get some more info. Also enabling ACPI_DEBUG and ACPI_DEBUG_FUNC_TRACE in your kernel config might give us a clue.
References : http://lkml.org/lkml/2008/4/4/41 Soeren said: I rebooted >10 times and couldn't trigger this hang with current mainstream anymore.... so I guess it is gone. Closing (please reopen if the issue reappears).
I am sorry, but this hang still happens on my vaio with current git v2.6.25-rc8-139-ge315c12. I have a picture showing the last messages here: http://www.ift.unesp.br/users/crmafra/dsc04673.jpg Hm, how do I reopen this bug? (I can't unselect the "Leave as CLOSED CODE_FIX" below)
Created attachment 15649 [details] dmesg output after boot succeded
Created attachment 15650 [details] kernel config This is the kernel config which triggers this bug. I don't have the acpidump right now, but it is posted at the end of this email: http://lkml.org/lkml/2008/2/11/364
The problem is present in 2.6.25-rc8-git7. References : http://lkml.org/lkml/2008/4/9/69
looks like the pci=nommconf thing was a false positive looks like this still happens, but only sometimes however, when it hangs, it is always at ACPI processor driver load time re-assigning to ACPI/processor-power. looks like max_cstate=2 did not make this go away looks like max_cstate=1 may make this go away please build with CONFIG_THERMAL=n CONFIG_ACPI_THERMAL=n just because these print the last lines seen. It may be hanging due to C-state and this will simply delete the last two lines seen but have no actual effect on the hang.
I tested 2.6.25-rc8-0139-g(something) with CONFIG_THERMAL=n CONFIG_ACPI_THERMAL=n and it hung at the 8th boot. However, the last lines printed when it hung were the same: ACPI: Processor [CPU1] (supports 8 throttling states) Hm, -rc8-0139 is not the latest anymore...should I test the v2.6.25-rc8-229-gf4be31e? (I ask because I feel a bit unconfortable booting my laptop too many times in a row..I don't know, maybe it will break or something)
No need to check with latest git. There are no changes that I know of that can affect this bug
Created attachment 15727 [details] cpuidle test patch Carlos, Can you please try the attached patch and report back. If it hangs, picture of last few messages will help. Also, try using vga=6 boot option, which gives us more number of lines of text on console (may need config changes with VIDEO_SELECT enabled).
I tested your patch and it continues to hang. Unfortunately I lost my digital camera so I can't take a picture, but the last message was the same, although using vga=6 there were more messages before. I had started to copy them by hand, but I noticed that the messages before that last "supports 8 throttling states" are the same as they appear in the dmesg when it boots ok. They are (copied from the dmesg) IO window: 4000-4fff MEM window: 0xfa000000-0xfbffffff PREFETCH window: 0x00000000f4000000-0x00000000f5ffffff PCI: Bridge: 0000:00:1c.4 IO window: 5000-5fff MEM window: 0xfc200000-0xfc2fffff PREFETCH window: disabled. PCI: Bus 10, cardbus bridge: 0000:09:03.0 IO window: 0x00006000-0x000060ff IO window: 0x00006400-0x000064ff PREFETCH window: 0x88000000-0x8bffffff MEM window: 0x90000000-0x93ffffff PCI: Bridge: 0000:00:1e.0 IO window: 6000-6fff MEM window: 0xfc300000-0xfc3fffff PREFETCH window: 0x0000000088000000-0x000000008bffffff ACPI: PCI Interrupt 0000:00:1c.0[A] -> GSI 17 (level, low) -> IRQ 17 PCI: Setting latency timer of device 0000:00:1c.0 to 64 ACPI: PCI Interrupt 0000:00:1c.1[B] -> GSI 16 (level, low) -> IRQ 16 PCI: Setting latency timer of device 0000:00:1c.1 to 64 ACPI: PCI Interrupt 0000:00:1c.2[C] -> GSI 18 (level, low) -> IRQ 18 PCI: Setting latency timer of device 0000:00:1c.2 to 64 ACPI: PCI Interrupt 0000:00:1c.4[A] -> GSI 17 (level, low) -> IRQ 17 PCI: Setting latency timer of device 0000:00:1c.4 to 64 PCI: Setting latency timer of device 0000:00:1e.0 to 64 ACPI: PCI Interrupt 0000:09:03.0[A] -> GSI 16 (level, low) -> IRQ 16 NET: Registered protocol family 2 IP route cache hash table entries: 65536 (order: 7, 524288 bytes) TCP established hash table entries: 262144 (order: 10, 4194304 bytes) TCP bind hash table entries: 65536 (order: 9, 2097152 bytes) TCP: Hash tables configured (established 262144 bind 65536) TCP reno registered Unpacking initramfs... done Freeing initrd memory: 444k freed Simple Boot Flag at 0x36 set to 0x1 io scheduler noop registered io scheduler cfq registered (default) pci 0000:00:02.0: Boot video device ACPI: AC Adapter [ADP1] (on-line) Switched to high resolution mode on CPU 1 ACPI: Battery Slot [BAT0] (battery present) input: Lid Switch as /devices/LNXSYSTM:00/device:00/PNP0C0D:00/input/input0 Switched to high resolution mode on CPU 0 ACPI: Lid Switch [LID0] input: Power Button (CM) as /devices/LNXSYSTM:00/device:00/PNP0C0C:00/input/input1 ACPI: Power Button (CM) [PWRB] ACPI: SSDT 7F6D8C52, 02A1 (r1 Sony VAIO 20070704 PTL 20050624) ACPI: SSDT 7F6D8709, 04B7 (r1 Sony VAIO 20070704 PTL 20050624) Monitor-Mwait will be used to enter C-1 state Monitor-Mwait will be used to enter C-2 state Monitor-Mwait will be used to enter C-3 state ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3]) ACPI: ACPI0007:00 is registered as cooling_device0 ACPI: Processor [CPU0] (supports 8 throttling states) ACPI: SSDT 7F6D8EF3, 00E4 (r1 Sony VAIO 20070704 PTL 20050624) ACPI: SSDT 7F6D8BC0, 0092 (r1 Sony VAIO 20070704 PTL 20050624) ACPI: CPU1 (power states: C1[C1] C2[C2] C3[C3]) ACPI: ACPI0007:01 is registered as cooling_device1 ACPI: Processor [CPU1] (supports 8 throttling states) Regarding your patch, the printks added don't appear before the hang occurs but only later (when the boot is ok). Looking at the dmesg (with your patch) the lines following the "hang point" are these ones (I added CONFIG_THERMAL=y back again to not damage my laptop) ACPI: LNXTHERM:01 is registered as thermal_zone0 ACPI: Thermal Zone [TZ00] (52 C) ACPI: LNXTHERM:02 is registered as thermal_zone1 ACPI: Thermal Zone [TZ01] (52 C) hpet_resources: 0xfed00000 is busy Generic RTC Driver v1.07 Linux agpgart interface v0.103 ACPI: PCI Interrupt 0000:09:03.2[C] -> GSI 18 (level, low) -> IRQ 18 sony-laptop: Sony Notebook Control Driver v0.6. input: Sony Vaio Keys as /devices/LNXSYSTM:00/device:00/PNP0A08:00/device:29/SNY5001:00/input/input2 input: Sony Vaio Jogdial as /devices/virtual/input/input3 sony-laptop: detected Sony Vaio FZ Series PNP: PS/2 Controller [PNP0303:PS2K,PNP0f13:PS2M] at 0x60,0x64 irq 1,12 i8042.c: Detected active multiplexing controller, rev 1.1. serio: i8042 KBD port at 0x60,0x64 irq 1 serio: i8042 AUX0 port at 0x60,0x64 irq 12 serio: i8042 AUX1 port at 0x60,0x64 irq 12 serio: i8042 AUX2 port at 0x60,0x64 irq 12 serio: i8042 AUX3 port at 0x60,0x64 irq 12 mice: PS/2 mouse device common for all mice input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input4 input: PS/2 Mouse as /devices/platform/i8042/serio4/input/input5 input: AlpsPS/2 ALPS GlidePoint as /devices/platform/i8042/serio4/input/input6 cpuidle: installing idle handler cpuidle: using governor ladder cpuidle: installing idle handler cpuidle: using governor menu cpuidle: using menu governor sdhci: Secure Digital Host Controller Interface driver So lots of things happen between the hang point and the printks in your patch. Then I wanted to check what the kernel is doing after the "supports 8 throttling states" line of deatch and put a lot of printks in various places in processor_{idle, core, throttling}.c, because I didn't know where to put them. So I almost every function in these files I put things like these: function_X(foo){ printk(KERN_INFO "beginning of function_X"); ... printk(KERN_INFO "end of function_X"); } It turned out that all these printks appeared before the hang, and the last line kept being "supports 8 throttling..." However, I could make the kernel print something after the usual last line with printk(KERN_INFO "Before acpi_processor_throttling_init\n"); <--added acpi_processor_throttling_init(); printk(KERN_INFO ""After acpi_processor_throttling_init\n"\n"); <--added in processor_core.c:1102 Then the kernel hung with these last messages: ACPI: Processor [CPU1] (supports 8 throttling states) Before acpi_processor_throttling_init After acpi_processor_throttling_init So at least we know that acpi_processor_throttling_init is executed after the usual hang and it doesn't by itself cause the hang. I also put a printk in the beginning of static int __init cpuidle_init(void) at the end of cpuidle/cpuidle.c and found that it does not get printed when the laptop hangs. I also noticed that with vga=6 the hang is much more frequent than with my usual vga=0x0364. It now hangs with 80-90% chance. I don't know if this long comment helps anything, but I wish it does!
Created attachment 15733 [details] dmesg with Venkatesh's test patch This is the dmesg with Venkatesh's patch when the boot succeeds. Note that the printks in the test patch appear after the usual last line message which appear when the boot hangs.
This is indeed a strange bug. Those messages coming later means, cpuidle is not even utilized at the point of hang. Can you try one more thing, just to isolate the problem. Try boot parameter "idle=mwait" (with any latest kernel you have - I mean with or without the last patch). Does it still hang?
Right, this is strange! After writing the previous comment I went back home and started bisecting... So far, the earliest bad kernel after 2.6.24 is 2.6.24-4891-gaca7f96 (it hung in the first try). I also had good indications that 2.6.24-4877-gb406ac6 is good (did not hang after 19 boots). There are a few x86 commits between these two points, but I can't say if they are really related to the bug. Do you know if the problem can be there? I am very afraid of this result, because I know Ingo boots thousands of times his machines and it is just me having trouble to boot my laptop, and this preliminary bisect points to x86... But I won't have new test results until tomorrow, I am back to the institute now.
This is not particular to your laptop. I also have another bug report which is similar to this one and probably has the same root cause; bug #10377 Only reason I havent marked them duplicate it that in #10377, actual hang happens at a different point and only with the battery. But, hang symptoms are pretty similar. Also, Soeren who originally reported this bug saw exactly same probelm as you are seeing. Only that he stopped seeing it later. Probably there is some timing-ness involved in this. I tried on many laptops here to reproduce with little success. Regarding the git bisect. None of the patches in the bisect range looks related to this problem. May be commit e3ed910 can be potentially related (though I dont think it is). I understand the problem here is not very reproducable and we cannot be sure of the actual good bisect. Thanks for all the debug effort in root causing this one....
Ok, thanks for the encouragement! :-) Another observation from my notes is that the last messages were a bit different during the git bisection marathon. For example, with 2.6.24-6483-g3f655ef it prints more info after the "8 throttling states", there are also these ones: LNXTHERM: 01 is registered as thermal_zone Thermal Zone [TZ00] (50 C) Exception (thermal-0473): AE_NOT_FOUND, <7> Switched to high resolution mode on CPU1 Invalid active threshold [0] [20070126] Exception (thermal-0473): AE_NOT_FOUND, Invalid active threshold [1] [20070126] LNXTHERM: 02 is registered as thermal_zone1 Thermal Zone [TZ01] (50 C) and then it hung. So I noticed the common "Switched to high resolution mode on CPU1" from the description of bug #10377
I was crazy in trying to bisect this elusive bug. Today I kept booting one version to test the bisection range from comment #22. The kernel version 2.6.24-4887-gafadcd7 was doing just fine up to the 30th boot, but it hung at the 31st. So that is a crazy bug, and I think everything I've said before about "it is probably good" is unreliable. Maybe even with CPUIDLE=n it will hang after enough boots. I will keep using -rc9 from now on, and update this bugzilla with more info if I have more data points. Thanks everyone for the help!
That was what I was afraid of. Can you try rc9 with "idle=mwait" boot option and see whether you can reproduce the bug (with ot without CPUIDLE, shouldn't matter).
I will try "idle=mwait" next. I just wanted to settle things with the bisect I did, that's why I didn't even boot -rc9 yet. But I will report if idle=mwait helps, after testing at least 30 boots :-( It may take some time tough. Thanks.
I have more observations, and I think I may have nailed it down. Vanilla -rc9 kernel fails very frequently (~90%) if I use vga=6, and ~10% with vga=0x0364 (strange and I don't know why, but it is happening). Vanilla -rc9 with "idle=mwait" hung at 4th boot. I noticed that booting -rc9 with hpet=disable makes a lot of difference, 42 boots so far and no hang. Contrasted with the very high hang count of -rc9 that could mean something. So I tried to check what are the differences in the dmesg in the hpet=disable case and without disabling hpet. The message "Switched to high resolution mode on CPU1" appears before the hang point if I don't use hpet=disable, and it appears much later in the boot log if I use hpet=disable. I also wrote a test patch to put printks in hpet-relevant code, which I will attach in this bugzilla. This is from the dmesg of a bad kernel which could finish the boot process (using vga=0x0364 to increase the chances) and _without_ using the hpet=disable option, and with my printks-patch: ************* Using hpet ************************** ACPI: Power Button (CM) [PWRB] ACPI: SSDT 7F6D8C52, 02A1 (r1 Sony VAIO 20070704 PTL 20050624) ACPI: SSDT 7F6D8709, 04B7 (r1 Sony VAIO 20070704 PTL 20050624) hpet_legacy_set_mode hpet_legacy_set_mode: oneshot Switched to high resolution mode on CPU 1 Switched to high resolution mode on CPU 0 Monitor-Mwait will be used to enter C-1 state Monitor-Mwait will be used to enter C-2 state Monitor-Mwait will be used to enter C-3 state ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3]) ACPI: ACPI0007:00 is registered as cooling_device0 ACPI: Processor [CPU0] (supports 8 throttling states) ACPI: SSDT 7F6D8EF3, 00E4 (r1 Sony VAIO 20070704 PTL 20050624) ACPI: SSDT 7F6D8BC0, 0092 (r1 Sony VAIO 20070704 PTL 20050624) ACPI: CPU1 (power states: C1[C1] C2[C2] C3[C3]) ACPI: ACPI0007:01 is registered as cooling_device1 ACPI: Processor [CPU1] (supports 8 throttling states) ACPI: LNXTHERM:01 is registered as thermal_zone0 ACPI: Thermal Zone [TZ00] (55 C) ACPI: LNXTHERM:02 is registered as thermal_zone1 ACPI: Thermal Zone [TZ01] (55 C) hpet_init hpet_acpi_add hpet_resources hpet_resources: 0xfed00000 is busy ***************************************** And this is using hpet=disable ********* hpet=disable ************** ACPI: Power Button (CM) [PWRB] ACPI: SSDT 7F6D8C52, 02A1 (r1 Sony VAIO 20070704 PTL 20050624) ACPI: SSDT 7F6D8709, 04B7 (r1 Sony VAIO 20070704 PTL 20050624) Monitor-Mwait will be used to enter C-1 state Monitor-Mwait will be used to enter C-2 state Monitor-Mwait will be used to enter C-3 state ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3]) ACPI: ACPI0007:00 is registered as cooling_device0 ACPI: Processor [CPU0] (supports 8 throttling states) ACPI: SSDT 7F6D8EF3, 00E4 (r1 Sony VAIO 20070704 PTL 20050624) ACPI: SSDT 7F6D8BC0, 0092 (r1 Sony VAIO 20070704 PTL 20050624) ACPI: CPU1 (power states: C1[C1] C2[C2] C3[C3]) ACPI: ACPI0007:01 is registered as cooling_device1 ACPI: Processor [CPU1] (supports 8 throttling states) ACPI: LNXTHERM:01 is registered as thermal_zone0 ACPI: Thermal Zone [TZ00] (54 C) ACPI: LNXTHERM:02 is registered as thermal_zone1 ACPI: Thermal Zone [TZ01] (54 C) hpet_init hpet_acpi_add hpet_resources hpet_resources hpet_acpi_add: no address or irqs in _CRS Generic RTC Driver v1.07 Linux agpgart interface v0.103 ACPI: PCI Interrupt 0000:09:03.2[C] -> GSI 18 (level, low) -> IRQ 18 sony-laptop: Sony Notebook Control Driver v0.6. input: Sony Vaio Keys as /devices/LNXSYSTM:00/device:00/PNP0A08:00/device:29/SNY5001:00/input/input2 input: Sony Vaio Jogdial as /devices/virtual/input/input3 sony-laptop: detected Sony Vaio FZ Series PNP: PS/2 Controller [PNP0303:PS2K,PNP0f13:PS2M] at 0x60,0x64 irq 1,12 i8042.c: Detected active multiplexing controller, rev 1.1. serio: i8042 KBD port at 0x60,0x64 irq 1 serio: i8042 AUX0 port at 0x60,0x64 irq 12 serio: i8042 AUX1 port at 0x60,0x64 irq 12 serio: i8042 AUX2 port at 0x60,0x64 irq 12 serio: i8042 AUX3 port at 0x60,0x64 irq 12 mice: PS/2 mouse device common for all mice input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input4 cpuidle: using governor ladder cpuidle: using governor menu sdhci: Secure Digital Host Controller Interface driver sdhci: Copyright(c) Pierre Ossman ricoh-mmc: Ricoh MMC Controller disabling driver ricoh-mmc: Copyright(c) Philip Langdale Advanced Linux Sound Architecture Driver Version 1.0.16rc2 (Thu Jan 31 16:40:16 2008 UTC). ALSA device list: No soundcards found. TCP cubic registered NET: Registered protocol family 1 NET: Registered protocol family 17 NET: Registered protocol family 15 evxface-0154 [00] install_fixed_event_ha: Enabled fixed event 4, Handler=ffffffff80368345 BIOS EDD facility v0.16 2004-Jun-25, 6 devices found Freeing unused kernel memory: 236k freed SCSI subsystem initialized No dock devices found. libata version 3.00 loaded. ahci 0000:00:1f.2: version 3.0 ACPI: PCI Interrupt 0000:00:1f.2[B] -> GSI 19 (level, low) -> IRQ 19 Switched to high resolution mode on CPU 1 Switched to high resolution mode on CPU 0 <----| | | Notice how far from the usual hang point! Do these observations help? I know that in the similar bug #10377, this "Switched to high resolution mode" line was the last one... I will also attach both dmesg logs, using hpet=disable and otherwise, just in case anyone wants to check them. Heh, I've already learnt a lot of things while looking at the code to put the printks, but I am afraid I can't go much further now. Thanks!
Created attachment 15741 [details] Put some printks in hpet code I wanted to check what functions related to hpet were executed before the hang point, so I put a lot of printks in various places.
Created attachment 15742 [details] dmesg -rc9 with hpet=disable Using hpet=disable this -rc9 kernel survived 42 boots so far. Notice that "Switched to high resolution mode on CPU{0,1}" appears much later. I can't guarantee it won't fail at the 43rd boot, tough. But without using hpet=disable the vanilla -rc9 hangs with ~90% chance using vga=6
Created attachment 15743 [details] dmesg which uses hpet, hangs frequently This is not using hpet=disable. It could finish booting because I wrote vga=0x0364, otherwise it fails very often.
*** Bug 10377 has been marked as a duplicate of this bug. ***
This bug is probably the very same issue I have noticed here since 2.6.21 or so, when the high-resolution timer rework and NO_HZ entered the kernel. It comes and goes on various machines here, never easily reproducable, and different kernel .configs make it go away, and return, at random. When the lockups occur, it is *always* around the point where the "Switched to high resolution mode on CPU X" messages. Always. For me, using CONFIG_HIGH_RES_TIMERS=n always makes it go away. Since that time, the same bug has been variously reported by others. On 14/12/07, Arun Thomas discovered that if he waited 8+ minutes (for some timer to wrap), the kernel would then continue initializing. In response, Thomas Gleixner had this to say: "The problem is caused by an SMI during the calibration routine. We really need to come up with a solid solution which does not rely on the periodic timer coming in, when there is something else (HPET, pm_timer) available." But I don't know if it ever went any further than that. It would be good to see someone knowledgeable in that area finally dig around and fix this issue, as it makes reliable unattended/remote boot somewhat dicey. My own bug 10391 is probably a duplicate of this one.
Arun Thomas's thread: http://lkml.org/lkml/2007/12/14/60
Original bug report thread (2.6.21) from April-2007: http://lkml.org/lkml/2007/4/30/464
Dropped from the list of recent regressions.
Sorry, I didn't notice that bug #10377 was closed as a duplicate of this one. It looks like we started to track a couple of different things in this entry.
This bug should only be used for Soerens originally reported problem. Please open separate bugs for the other issues discussed here (and copy the information from here to these bugs).
Soeren said: I couldn't reproduce this one any longer...
Adrian, as far as I can tell Soeren's original bug is the same one which hits me. Please take a look at the References: http://lkml.org/lkml/2008/2/23/263 in the bug description, and at comment #6 from Soeren himself. Then take a look at the picture in comment #11, from myself. The problem is the same. The fact that he does not see the problem anymore doesn't mean that this bug was fixed.
(In reply to comment #40) > Adrian, as far as I can tell Soeren's original bug is the same one which hits > me. I've seen more than once that different bugs had exactly the same symptoms. I do not know whether you run into the same bug as Soeren, but it is better if you report your issue in an own bug. We have seen in this bug Mark saying it was already present in older kernels although what Soeren reported was not present in 2.6.24 - that's exactly what can happen when different people discuss issues that might or might not be the same bug in one bug entry.
The nature of this bug, strongly suggests that it is impossible to know if it was "not present in 2.6.24". Simply building the kernel with a one-line change in .config is enough to make it come or go. Having all of the information collected together just might make it more likely that somebody familiar with the hirez timer code might actually recognize the problem. Whatever.
My take would be to keep all the info at one place, as long as bug "looks" similar. We can split it out later if we think they are unrelated. Specifically in this case, "hpet=disable" workaround in one bug helped the reporter on the other bug. There is no way for that to happen if all reports are in different bug number. It also helps anyone trying to analyze this thing to have all info that looks relevant at one place.
I kind of agree, but the fact that the original problem is not reproducible any more by the original reporter strongly suggests that the other reports in this bug entry need not be directly related to it. Thus, it seems reasonable to use a different bug entry for tracking the other cases while keeping this one closed. Otherwise we sort of lose the information that the original reporter is unable to reproduce the problem any more.
(In reply to comment #43) > My take would be to keep all the info at one place, as long as bug "looks" > similar. We can split it out later if we think they are unrelated. We have seen in this bug that people were talking about different issues when deciding e.g. whether it is on 32bit or on 64bit, a 2.6.25-rc regression, whether it is still present in 2.6.25-rc9, or whether bug #10377 is a duplicate. > Specifically in this case, "hpet=disable" workaround in one bug helped the > reporter on the other bug. There is no way for that to happen if all reports > are in different bug number. It also helps anyone trying to analyze this > thing > to have all info that looks relevant at one place. This can happen if you put notes in the bugs that one bug might be the same issue like this or that other one.
Hi, Did anybody try running the watchdog to see where this hangs? It should be done with the parameter nmi_watchdog=1. See Documentation/nmi_watchdog.txt for more information. -Vegard
Hi Vegard, I tried with nmi_watchdog=1 in a kernel which used to hang frequently (2.6.25-rc9, one out of three), but then it did not hang after 12 boots. I will try more times and let you know if it hangs with this boot option. However I could get some traces by other means: http://lkml.org/lkml/2008/5/11/138
I have the same problem here using macbook and a 2.6.25.4
Created attachment 16226 [details] dmesg dmesg