Created attachment 20976 [details] Hanging boot log (at commit b21597d0268983f8f9e8b563494f75490403e948) I bisected a boot-hang on Acer TravelMate 66x laptop and ended on commit ff69f2bba67bd45514923aaedbf40fe351787c59. My system is: Acer TravelMate 66x 00:00.0 Host bridge [0600]: Intel Corporation 82852/82855 GM/GME/PM/GMV Processor to I/O Controller [8086:3580] (rev 02) 00:00.1 System peripheral [0880]: Intel Corporation 82852/82855 GM/GME/PM/GMV Processor to I/O Controller [8086:3584] (rev 02) 00:00.3 System peripheral [0880]: Intel Corporation 82852/82855 GM/GME/PM/GMV Processor to I/O Controller [8086:3585] (rev 02) 00:02.0 VGA compatible controller [0300]: Intel Corporation 82852/855GM Integrated Graphics Device [8086:3582] (rev 02) 00:02.1 Display controller [0380]: Intel Corporation 82852/855GM Integrated Graphics Device [8086:3582] (rev 02) 00:1d.0 USB Controller [0c03]: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #1 [8086:24c2] (rev 03) 00:1d.1 USB Controller [0c03]: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #2 [8086:24c4] (rev 03) 00:1d.2 USB Controller [0c03]: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #3 [8086:24c7] (rev 03) 00:1d.7 USB Controller [0c03]: Intel Corporation 82801DB/DBM (ICH4/ICH4-M) USB2 EHCI Controller [8086:24cd] (rev 03) 00:1e.0 PCI bridge [0604]: Intel Corporation 82801 Mobile PCI Bridge [8086:2448] (rev 83) 00:1f.0 ISA bridge [0601]: Intel Corporation 82801DBM (ICH4-M) LPC Interface Bridge [8086:24cc] (rev 03) 00:1f.1 IDE interface [0101]: Intel Corporation 82801DBM (ICH4-M) IDE Controller [8086:24ca] (rev 03) 00:1f.3 SMBus [0c05]: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) SMBus Controller [8086:24c3] (rev 03) 00:1f.5 Multimedia audio controller [0401]: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) AC'97 Audio Controller [8086:24c5] (rev 03) 00:1f.6 Modem [0703]: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) AC'97 Modem Controller [8086:24c6] (rev 03) 02:02.0 Ethernet controller [0200]: Broadcom Corporation BCM4401 100Base-T [14e4:4401] (rev 01) 02:04.0 Network controller [0280]: Intel Corporation PRO/Wireless LAN 2100 3B Mini PCI Adapter [8086:1043] (rev 04) 02:06.0 CardBus bridge [0607]: O2 Micro, Inc. OZ711EC1 SmartCardBus Controller [1217:7113] (rev 20) 02:06.1 CardBus bridge [0607]: O2 Micro, Inc. OZ711EC1 SmartCardBus Controller [1217:7113] (rev 20) 02:07.0 FireWire (IEEE 1394) [0c00]: Texas Instruments TSB43AB21 IEEE-1394a-2000 Controller (PHY/Link) [104c:8026] Attached is boot log captures with netconsole. When hang the bisk light remains on.
Created attachment 20977 [details] acpidump of the laptop
First-Bad-Commit : ff69f2bba67bd45514923aaedbf40fe351787c59
Does booting w/ "processor.max_cstate=0" change the behavior?
Hi, John How about adding the boot option of "notsc"? Thanks.
(In reply to comment #4) > How about adding the boot option of "notsc"? Might try it, but from the log it looks like the TSC is already being thrown out, so I suspect its the acpi_pm timer stalling in higher C states.
(In reply to comment #5) > (In reply to comment #4) > > How about adding the boot option of "notsc"? > > Might try it, but from the log it looks like the TSC is already being thrown > out, so I suspect its the acpi_pm timer stalling in higher C states. Does you meaning the acpi_pm timer can save the box from TSC issue which make box hang? John: the main contents of bad commit is quite simple as below, would you like to add some debug printk to check the kt2/kt1's value? - t1 = inl(acpi_gbl_FADT.xpm_timer_block.address); + kt1 = ktime_get_real(); acpi_idle_do_entry(cx); - t2 = inl(acpi_gbl_FADT.xpm_timer_block.address); + kt2 = ktime_get_real(); + idle_time = ktime_to_us(ktime_sub(kt2, kt1)); local_irq_enable(); cx->usage++; - return ticks_elapsed_in_us(t1, t2); + return idle_time;
HI(In reply to comment #5) > (In reply to comment #4) > > How about adding the boot option of "notsc"? > Might try it, but from the log it looks like the TSC is already being thrown > out, so I suspect its the acpi_pm timer stalling in higher C states. From the dmesg it seems that TSC is thrown out only after the CPU enters the deeper C-state(for example: C2/C3). Maybe it should be throw out before OS expectes to enter the high deeper C-state. On the previous kernel the ACPI PM timer is used to record the time interval that CPU stays in C-state. I don't think it will stall deeper C-state. Thanks.
Created attachment 20994 [details] processor.max_cstate=0 panic In reply to comment #3 Booting with processor.max_cstate=0 causes the panic when it wants to run the 'idle' task ... See attached log up to panic as caputred with netconsole. This looks like processor.max_cstate=0 triggers some (unrelated) bug.
Created attachment 20995 [details] notsc hangs too In reply to comment #4 Booting with notsc also causes the kernel to hang and there is no big difference between no extra commandline option and notsc ... Attached boot log as captured with netconsole.
In reply to comment #6 Adding printk() right after each assignment of kt1 and kt2, showing their respecitive value the kernel does not stall... I will try moving around the printk's hopefully getting some more insight (looks like inserting some delay - which printk definitely does - helps get the kernel to boot). Here a small log extract around the TSC marking: [ 6.679978] calling initialize_hashrnd+0x0/0x20 @ 1 [ 6.682232] initcall initialize_hashrnd+0x0/0x20 returned 0 after 4 usecs [ 6.685826] acpi_idle_enter_simple(): kt1 = 1239823286526093114 [ 6.688120] acpi_idle_enter_simple(): kt2 = 1239823286528384540 [ 6.690371] acpi_idle_enter_simple(): kt1 = 1239823286530635735 [ 6.695144] acpi_idle_enter_simple(): kt2 = 1239823286535411036 [ 6.697328] Marking TSC unstable due to TSC halts in idle [ 6.699588] acpi_idle_enter_simple(): kt1 = 1239823286539851942 [ 6.702067] acpi_idle_enter_simple(): kt2 = 1239823286542331873 [ 6.704466] acpi_idle_enter_simple(): kt1 = 1239823286544730781 [ 6.707222] acpi_idle_enter_simple(): kt2 = 1239823286547487137 [ 6.709393] ata1.00: configured for UDMA/100 [ 6.711502] async_waiting @ 743 [ 6.713593] async_continuing @ 743 after 0 usec [ 6.715792] scsi 0:0:0:0: Direct-Access ATA FUJITSU MHT2060A 006C PQ: 0 ANSI: 5 [ 6.718059] sd 0:0:0:0: Attached scsi generic sg0 type 0 [ 6.720216] initcall 2_async_port_probe+0x0/0xc0 returned 0 after 5545494 usecs [ 6.722325] calling 4_sd_probe_async+0x0/0x250 @ 743 [ 6.724565] async_continuing @ 750 after 5545210 usec [ 6.726893] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input4 From this I could guess that without the printk's async_waiting could be bitten by the timing issue and thus causing the hang (just a guess, to be verified!)
Created attachment 20997 [details] log with printk - in _c1 [in _c1 only = hang, erverywhere=boot] Follow-up to comment #10 (log extract there contains a cyop&pase error, the first 2 acpi_idle_enter_simple should probably be acpi_idle_enter_c1 - I copied the printk but forgot to adjust the function name in them) I changed the printk() to happen only after assigning kt2. Keeping them in the 3 functions (acpi_idle_enter_simple, acpi_idle_enter_c1, acpi_idle_enter_bm) the kernel boots just fine (producing lots of log ;)) This is first part of attached log Keeping just the printk in acpi_idle_enter_c1 the kernel hangs, this is second part of attached log So the printk in acpi_idle_enter_simple inserts some delay needed to get the box to boot (or to skip some other non-recovering wait)...
Disabling async boot produces no change (forcing enable_async in kernel/async.c to 0). So I don't know what makes the printk in acpi_idle_enter_simple() prevent hang...
Hi, Bruno sorry for the incorrect parameters in comment #3. The boot option should be "processor.max_cstate=1" or "idle=poll". When the boot option of "processor.max_cstate=0" is added, the system will access the uninitialized memory region. So please try the boot option of "processor.max_cstate=1" or "idle=poll". thanks.
(In reply to comment #13) > Hi, Bruno > sorry for the incorrect parameters in comment #3. The boot option should > be > "processor.max_cstate=1" or "idle=poll". When the boot option of > "processor.max_cstate=0" is added, the system will access the uninitialized > memory region. > So please try the boot option of "processor.max_cstate=1" or "idle=poll". > thanks. Ah, good catch ykzhao! Sorry for the bad cmd.
Hi, Bruno Will you please try the boot option of "clocksource=acpi_pm" and see whether the box still hangs? Thanks.
Created attachment 21005 [details] [debug patch]: delete the conversion between idle time and sleep ticks in idle thread will you please try the debug patch and see whether the box can be booted? In the debug patch it is unnecessary to convert the idle_time to pm timer sleep tick again in idle thread. Only when the /proc/acpi/processor/*/power I/F is provoked, it will be converted. thanks.
*** Bug 13094 has been marked as a duplicate of this bug. ***
Hi ykzhao, I've just tested your debug patch on top of 2.6.30-rc2 on my Thinkpad X40 (see Bug 13094). The kernel still crashes in the same manner. But booting with clocksource=acpi_pm helps here, too. -Daniel
> > So the printk in acpi_idle_enter_simple inserts some delay needed to get the > box to boot (or to skip some other non-recovering wait)... Bruno: I am wondering the ktime_get_real really can return in above function. In fact, the ktime_get_real should call getnstimeofday() in kernel/time/timekeeping.c file if the CONFIG_GENERIC_TIME was set in kernel config. Maybe it hang in the following while cycle in getnstimeofday function: // do { // seq = read_seqbegin(&xtime_lock); *ts = xtime; /* read clocksource: */ cycle_now = clocksource_read(clock); /* calculate the delta since the last update_wall_time: */ cycle_delta = (cycle_now - clock->cycle_last) & clock->mask; /* convert to nanoseconds: */ nsecs = cyc2ns(clock, cycle_delta); // } while (read_seqretry(&xtime_lock, seq)); So would you like to comments the "do" "while" and "seq = .." line as above. and try again?
I've just gathered a quite strange backtrace in my dmesg about timers while suspending/resuming. This is on 2.6.30-rc2 with ff69... reverted. Maybe ff69 just brought this issue into the open. Anyway, I'll attach the complete dmesg. -Daniel
Created attachment 21011 [details] dmesg as described of 2.6.30-rc2 with ff69... reverted
In repley to comment #13 and comment #15: all 3 variants boot fine (keeping the single printk in acpi_idle_enter_c1) processor.max_cstate=1 still calls acpi_idle_enter_c1() but the message stating that TSC is unstable seems to happen a few acpi_idle_enter_c1() later. If log is useful I can attach. idle=poll does not call acpi_idle_enter_c1() at all clocksource=acpi_pm calls it exactly once before marking TSC as unstable.
Created attachment 21015 [details] Hang log for comments 16, 19 and 16+19 In reply to comment #16 and comment #19 In all combinations of suggestions from #16 and #19 the boot keeps hanging. The attached log is the concatenation of all 3 cases, captured via netconsole. - as suggested in comment #19 I commented out the loop but it kernel keeps hanging during boot. - as suggested in comment #16 I applied the patch but kernel keeps hanging during boot - a combination of both comments, loop commented and patch applied kernel keeps hanging during boot. Often, but not always the HDD led remains lighting when the kernel hangs.
Hi, Daniel Will you please try the boot option of "notsc" and see whether it still hangs? thanks.
Created attachment 21022 [details] [debug patch]: print which clocksource is used. Will you please try the debug patch and boot the system with/without the boot option of "notsc"? After the test, please attach the output log. Thanks.
On Fri, Apr 17, 2009 at 04:05:53AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > --- Comment #25 from ykzhao <yakui.zhao@intel.com> 2009-04-17 04:05:52 --- > Created an attachment (id=21022) > --> (http://bugzilla.kernel.org/attachment.cgi?id=21022) > [debug patch]: print which clocksource is used. > > Will you please try the debug patch and boot the system with/without the boot > option of "notsc"? > After the test, please attach the output log. > Thanks. Hi ykzhao, In the course of preparing the logs you requested, I encountered a very strange effect. When I boot the machine with a usb-serial converter plugged into one usb port, everything is fine and the machine doesn't freeze. If I then unplug the usb-serial converter, the machine freezes within a few seconds on a bad kernel. I've tested quite a few of the bad kernel versions, all exhibiting the same behaviour. So I think it's unlikely that this is an unrelated regression. Furthermore, on a kernel where ff69.. is reverted, this does not happen, too. I hope this can help. -Daniel
On Fri, Apr 17, 2009 at 01:16:25AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > --- Comment #24 from ykzhao <yakui.zhao@intel.com> 2009-04-17 01:16:24 --- > Hi, Daniel > Will you please try the boot option of "notsc" and see whether it still > hangs? > thanks. Booting 2.6.30-rc2-195-g9f76208 with "notsc" freezes (with the usb-serial converter unplugged). I've also retested whether "clocksource=acpi_pm" really works around the problem (or whether I had accidentally plugged in the serial converter). I can confirm that "clocksource=acpi_pm" really helps, even when the serial converter is unplugged. -Daniel
Daniel, Since your machine has a LAPIC and IOAPIC, please reproduce the hang with "nmi_watchdog=1" (or =2) and see if you can get a stack trace of the actual hang. Daniel, Bruno, Plese boot with yakui's patch in comment #25 to print out the clocksource. Looking at the hpet=force notsc hang in comment #9, we suspect that the actual hang in that case may be related to the hpet. Can you boot regularly with clocksource=hpet to see if that works at all? re: comment #8 - processor.max_cstate=0 panics system yakui to file a new (unrelated) bug and fix this boot option. re: marking the TSC unstable from idle we should be doing this at idle init, rather than idle run-time, i'll file a patch to fix that bug -- though it isn't clear if it is related to the actual failure at hand.
Created attachment 21065 [details] pach vs 2.6.30-rc2 This patch moves the TSC disable check to idle initialization time from (after the first) idle run-time. Please give it a go and attach the new dmesg.
On Tue, Apr 21, 2009 at 02:13:37AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > --- Comment #28 from Len Brown <len.brown@intel.com> 2009-04-21 02:13:36 --- > Daniel, > Since your machine has a LAPIC and IOAPIC, please reproduce the hang > with "nmi_watchdog=1" (or =2) and see if you can get a stack trace > of the actual hang. I've just tried this, but unfortunately it seems to affect the freeze. Now the machine freezes after X switched the console to raw mode. So I can't see an eventual backtrace. And there's nothing in the netconsole, too. I've tried a few things to freeze the box in a more useful way but didn't succeed. But at least I managed to get the netconsole working, so expect the other logs shortly. -Daniel
I can reproduce this bug on a Thinkpad T42, where the boot freezes with 2.6.30-rc1 and -rc2. The kernel will boot OK if I remove "hpet=force" from the command line. Reverting ff69f2bba67bd45514923aaedbf40fe351787c59 does allow the system to boot, but certain timers appear to be faulty: one symptom being that the laptop fails to reliably associate to the wireless access point. Booting that kernel (with revert) without "hpet=force" on the command line gives normal behaviour. "hpet=force" is needed to get the HPET in the T42's ICH4 chipset to be recognised. This is a definite regression from 2.6.29.
Created attachment 21067 [details] dmesg without and with "notsc", clocksource debug patch applied The last few module loads are missing, because I couldn't manage to log in via ssh into the initramfs to get the first part of the dmesg _and_ capture the last few lines before the freeze via netconsole. But there's no clocksource lines in there, so I figured that's not the interesting part anyway.
Created attachment 21068 [details] dmesg without "hpet=force", clocksource debug patch applied I've tried "clocksource=hpet". This does not work. But removing "hpet=force" from my kernel cmdline results in a fine boot-up. I've added the dmesg output of that run. I need hpet=force to enable the hpet on my Thinkpad X40 (like on many other Thinkpads from that era).
On Tue, Apr 21, 2009 at 05:19:44AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > --- Comment #29 from Len Brown <len.brown@intel.com> 2009-04-21 05:19:43 --- > Created an attachment (id=21065) > --> (http://bugzilla.kernel.org/attachment.cgi?id=21065) > pach vs 2.6.30-rc2 > > This patch moves the TSC disable check to idle initialization time > from (after the first) idle run-time. Please give it a go > and attach the new dmesg. I've just tested this patch, and the systems freezes in the same manner. Looking at the clocksource debug output, I think your patch has roughly the same effect as booting with clocksource=hpet, which crashes, too. -Daniel
"hpet=force" is being used on all 3 machines in this bug report. When "hpet=force" is not used, is the boot hang gone? If no, does the patch in comment #29 help that case? Re: comment #31 Are there other "hpet=force" regressions even when ff69f2bba is reverted from 2.6.30?
Created attachment 21072 [details] Logs of multiple boots I can confirm that disabling hpet on my system also makes the hang go away. I have not noticed any other regression (with/without hpet=force). Only thing I know since a long time is that local apic (which is disabled by BIOS and when enabled with lapic cmdline option) causes S3 suspend/resume to fail. But this one is most probably unrelated/offtopic. Adding lacpi cmdline option does not change hang behavior either. Attached netconsole produced log contains the following boot attemps (see also headers in the log): - According to comment #28 : boot with clocksource=hpet => hang - Using patch for #25 -> hang - Using patch for #25 and clocksource=hpet -> hang - Using patch for #25 and hpet=force removed -> boots - Using patch for #25 and patch for #29 but without hpet=force -> boots - Using patch for #25 and patch for #29, with hpet=force -> hang - Using patch for #25 and patch for #29, with hpet=force and lapic -> boots - Using patch for #25 and patch for #29 but without hpet=force and with lapic -> boots What remains to check is if using "hpet=force clocksource=hpet" boots properly or hangs with troublesome commit reverted.
On Tue, Apr 21, 2009 at 05:16:44PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > Are there other "hpet=force" regressions even when ff69f2bba > is reverted from 2.6.30? I don't have any other regression on my Thinkpad X40 if I revert ff69. Furthermore I checked with $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource hpet that 2.6.30-rc2 with ff69 reverted is indeed using hpet, which it is. -Daniel
Created attachment 21073 [details] Log with $subjet commit reverted, hpet=force and clocksource=hpet Missing check of comment #36 - using "hpet=force clocksource=hpet" boots properly with commit ff69f2bba67bd45514923aaedbf40fe351787c59 reverted (patch from #25 applied but not the one from #29 (this one requires commit ff69...). Log attached
reproduced on a Thinkpad T41 (ICH4) using 2.6.30-rc2-ccc5ff94c boot using no cmdline options tempts me with: HPET not enabled in BIOS. You might try hpet=force boot option But when I use hpet=force, boot hangs. But if I revert ff69f2bba, hpet=force succeeds.
Created attachment 21088 [details] patch vs 2.6.30-rc3 to root cause the failiure 2.6.20-rc3 + this patch booted on a t41 with "hpet=force" hangs. This identifies the hpet counter read immediately at the exit from C3 as causing a hang.
Created attachment 21089 [details] patch vs 2.6.30-rc3 to prevent the boot hang Please test this patch. It will allow hpet=force to be used on the ICH4 boxes, and it will still be used for clockevents, but it will not be available as a clocksource.
2.6.30-rc3 boots fine with hpet=force and patch from comment #41. With this setup my system is using acpi_pm as clocksource (availabled ones: acpi_pm jiffies tsc).
Created attachment 21097 [details] patch vs 2.6.30-rc3 to prevent the boot hang Please test this newer patch. It should allow "hpet=force" to again be used on the ICH4M machines, and the HPET will be used for both clocksource and clockevents.
I've just tested the patch in comment #43 and everything works flawlessly on my machine (Thinkpad X40) now. -Daniel
Patch from comment #43 work for me too (Acer TM660)
Len's commit f461ddea0af8b98e2b7940eba9c693b0ee44d64a fixes things for me, too. (There's a problem with ath5k associating with an access point, but that seems unrelated.)
Fixed by commit f461ddea0af8b98e2b7940eba9c693b0ee44d64a .
Hi all, Some further testing shows that Len's patch doesn't solve all the problems ff69.. caused on my Thinkpad X40. With ff69 reverted on top of 2.6.30-rc2, I can hibernate (suspend to ram) as often as I want. With Len's patch (-linus version v2.6.30-rc3-329-g0c8454f) I can resume only once. In the second cycle the machine hangs hard shortly after switching the power on (fan and hd spin up). It doesn't manage to set the blinking mode for the suspend led, indicating a resume in progress. I checked the clocksource=acpi_pm workaround, and this does not work for the resume issue, i.e. broken in the same way as with Len's patch. So at least for my setup, the breakage introduced by ff69 is not yet completely fixed. -Daniel
For me, multiple S3 cycles are possible in a row (tested from linux console [vesafb]; I'm not using KMS but I have i915 loaded so GPU can be resumed) Tested with today's linus kernel (commit e25c2c873f59c57cf1c2b1979cc8fb01958305ef)
On Mon, Apr 27, 2009 at 08:25:58PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > --- Comment #49 from Bruno <bonbons67@internet.lu> 2009-04-27 20:25:57 --- > For me, multiple S3 cycles are possible in a row (tested from linux console > [vesafb]; I'm not using KMS but I have i915 loaded so GPU can be resumed) Thanks for testing this out, Bruno. I did some further testing but as of v2.6.30-rc4-288-g413 my problem seems to be fixed.