Bug 13087

Summary: hpet=force boot hang due to commit ff69f2bba67bd45514923aaedbf40fe351787c59
Product: ACPI Reporter: Bruno Prémont (bonbons)
Component: OtherAssignee: Venkatesh Pallipadi (venki)
Status: CLOSED CODE_FIX    
Severity: normal CC: acpi-bugzilla, alex.shi, cedric, daniel, john.stultz, lenb, pm, rjw, yakui.zhao
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.30-rc1 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 13070    
Attachments: Hanging boot log (at commit b21597d0268983f8f9e8b563494f75490403e948)
acpidump of the laptop
processor.max_cstate=0 panic
notsc hangs too
log with printk - in _c1 [in _c1 only = hang, erverywhere=boot]
[debug patch]: delete the conversion between idle time and sleep ticks in idle thread
dmesg as described of 2.6.30-rc2 with ff69... reverted
Hang log for comments 16, 19 and 16+19
[debug patch]: print which clocksource is used.
pach vs 2.6.30-rc2
dmesg without and with "notsc", clocksource debug patch applied
dmesg without "hpet=force", clocksource debug patch applied
Logs of multiple boots
Log with $subjet commit reverted, hpet=force and clocksource=hpet
patch vs 2.6.30-rc3 to root cause the failiure
patch vs 2.6.30-rc3 to prevent the boot hang
patch vs 2.6.30-rc3 to prevent the boot hang

Description Bruno Prémont 2009-04-14 17:51:18 UTC
Created attachment 20976 [details]
Hanging boot log (at commit b21597d0268983f8f9e8b563494f75490403e948)

I bisected a boot-hang on Acer TravelMate 66x laptop and ended on
commit ff69f2bba67bd45514923aaedbf40fe351787c59.

My system is:
  Acer TravelMate 66x

00:00.0 Host bridge [0600]: Intel Corporation 82852/82855 GM/GME/PM/GMV Processor to I/O Controller [8086:3580] (rev 02)
00:00.1 System peripheral [0880]: Intel Corporation 82852/82855 GM/GME/PM/GMV Processor to I/O Controller [8086:3584] (rev 02)
00:00.3 System peripheral [0880]: Intel Corporation 82852/82855 GM/GME/PM/GMV Processor to I/O Controller [8086:3585] (rev 02)
00:02.0 VGA compatible controller [0300]: Intel Corporation 82852/855GM Integrated Graphics Device [8086:3582] (rev 02)
00:02.1 Display controller [0380]: Intel Corporation 82852/855GM Integrated Graphics Device [8086:3582] (rev 02)
00:1d.0 USB Controller [0c03]: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #1 [8086:24c2] (rev 03)
00:1d.1 USB Controller [0c03]: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #2 [8086:24c4] (rev 03)
00:1d.2 USB Controller [0c03]: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #3 [8086:24c7] (rev 03)
00:1d.7 USB Controller [0c03]: Intel Corporation 82801DB/DBM (ICH4/ICH4-M) USB2 EHCI Controller [8086:24cd] (rev 03)
00:1e.0 PCI bridge [0604]: Intel Corporation 82801 Mobile PCI Bridge [8086:2448] (rev 83)
00:1f.0 ISA bridge [0601]: Intel Corporation 82801DBM (ICH4-M) LPC Interface Bridge [8086:24cc] (rev 03)
00:1f.1 IDE interface [0101]: Intel Corporation 82801DBM (ICH4-M) IDE Controller [8086:24ca] (rev 03)
00:1f.3 SMBus [0c05]: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) SMBus Controller [8086:24c3] (rev 03)
00:1f.5 Multimedia audio controller [0401]: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) AC'97 Audio Controller [8086:24c5] (rev 03)
00:1f.6 Modem [0703]: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) AC'97 Modem Controller [8086:24c6] (rev 03)
02:02.0 Ethernet controller [0200]: Broadcom Corporation BCM4401 100Base-T [14e4:4401] (rev 01)
02:04.0 Network controller [0280]: Intel Corporation PRO/Wireless LAN 2100 3B Mini PCI Adapter [8086:1043] (rev 04)
02:06.0 CardBus bridge [0607]: O2 Micro, Inc. OZ711EC1 SmartCardBus Controller [1217:7113] (rev 20)
02:06.1 CardBus bridge [0607]: O2 Micro, Inc. OZ711EC1 SmartCardBus Controller [1217:7113] (rev 20)
02:07.0 FireWire (IEEE 1394) [0c00]: Texas Instruments TSB43AB21 IEEE-1394a-2000 Controller (PHY/Link) [104c:8026]


Attached is boot log captures with netconsole. When hang the bisk light remains on.
Comment 1 Bruno Prémont 2009-04-14 17:55:01 UTC
Created attachment 20977 [details]
acpidump of the laptop
Comment 2 Rafael J. Wysocki 2009-04-14 21:55:49 UTC
First-Bad-Commit : ff69f2bba67bd45514923aaedbf40fe351787c59
Comment 3 john stultz 2009-04-14 22:18:40 UTC
Does booting w/ "processor.max_cstate=0" change the behavior?
Comment 4 ykzhao 2009-04-15 01:09:46 UTC
Hi, John
    How about adding the boot option of "notsc"?
    Thanks.
Comment 5 john stultz 2009-04-15 01:42:51 UTC
(In reply to comment #4)
>     How about adding the boot option of "notsc"?

Might try it, but from the log it looks like the TSC is already being thrown out, so I suspect its the acpi_pm timer stalling in higher C states.
Comment 6 Alex Shi 2009-04-15 02:17:59 UTC
(In reply to comment #5)
> (In reply to comment #4)
> >     How about adding the boot option of "notsc"?
> 
> Might try it, but from the log it looks like the TSC is already being thrown
> out, so I suspect its the acpi_pm timer stalling in higher C states.

Does you meaning the acpi_pm timer can save the box from TSC issue which make box hang? 

John:
the main contents of bad commit is quite simple as below, would you like to add some debug printk to check the kt2/kt1's value? 

-       t1 = inl(acpi_gbl_FADT.xpm_timer_block.address);
+       kt1 = ktime_get_real();
        acpi_idle_do_entry(cx);
-       t2 = inl(acpi_gbl_FADT.xpm_timer_block.address);
+       kt2 = ktime_get_real();
+       idle_time =  ktime_to_us(ktime_sub(kt2, kt1));
 
        local_irq_enable();
        cx->usage++;
 
-       return ticks_elapsed_in_us(t1, t2);
+       return idle_time;
Comment 7 ykzhao 2009-04-15 13:29:52 UTC
HI(In reply to comment #5)
> (In reply to comment #4)
> >     How about adding the boot option of "notsc"?
> Might try it, but from the log it looks like the TSC is already being thrown
> out, so I suspect its the acpi_pm timer stalling in higher C states.

From the dmesg it seems that TSC is thrown out only after the CPU enters the deeper C-state(for example: C2/C3). Maybe it should be throw out before OS expectes to enter the high deeper C-state.

On the previous kernel the ACPI PM timer is used to record the time interval that CPU stays in C-state.  I don't think it will stall deeper C-state.

Thanks.
Comment 8 Bruno Prémont 2009-04-15 16:46:31 UTC
Created attachment 20994 [details]
processor.max_cstate=0 panic

In reply to comment #3

Booting with processor.max_cstate=0 causes the panic when it wants to run the 'idle' task ...
See attached log up to panic as caputred with netconsole.

This looks like processor.max_cstate=0 triggers some (unrelated) bug.
Comment 9 Bruno Prémont 2009-04-15 16:53:36 UTC
Created attachment 20995 [details]
notsc hangs too

In reply to comment #4

Booting with notsc also causes the kernel to hang and there is no big difference between no extra commandline option and notsc ...

Attached boot log as captured with netconsole.
Comment 10 Bruno Prémont 2009-04-15 17:35:51 UTC
In reply to comment #6

Adding printk() right after each assignment of kt1 and kt2, showing their respecitive value the kernel does not stall...

I will try moving around the printk's hopefully getting some more insight (looks like inserting some delay - which printk definitely does - helps get the kernel to boot).


Here a small log extract around the TSC marking:
[    6.679978] calling  initialize_hashrnd+0x0/0x20 @ 1
[    6.682232] initcall initialize_hashrnd+0x0/0x20 returned 0 after 4 usecs
[    6.685826] acpi_idle_enter_simple(): kt1 = 1239823286526093114
[    6.688120] acpi_idle_enter_simple(): kt2 = 1239823286528384540
[    6.690371] acpi_idle_enter_simple(): kt1 = 1239823286530635735
[    6.695144] acpi_idle_enter_simple(): kt2 = 1239823286535411036
[    6.697328] Marking TSC unstable due to TSC halts in idle
[    6.699588] acpi_idle_enter_simple(): kt1 = 1239823286539851942
[    6.702067] acpi_idle_enter_simple(): kt2 = 1239823286542331873
[    6.704466] acpi_idle_enter_simple(): kt1 = 1239823286544730781
[    6.707222] acpi_idle_enter_simple(): kt2 = 1239823286547487137
[    6.709393] ata1.00: configured for UDMA/100
[    6.711502] async_waiting @ 743
[    6.713593] async_continuing @ 743 after 0 usec
[    6.715792] scsi 0:0:0:0: Direct-Access     ATA      FUJITSU MHT2060A 006C PQ: 0 ANSI: 5
[    6.718059] sd 0:0:0:0: Attached scsi generic sg0 type 0
[    6.720216] initcall 2_async_port_probe+0x0/0xc0 returned 0 after 5545494 usecs
[    6.722325] calling  4_sd_probe_async+0x0/0x250 @ 743
[    6.724565] async_continuing @ 750 after 5545210 usec
[    6.726893] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input4


From this I could guess that without the printk's async_waiting could be bitten by the timing issue and thus causing the hang (just a guess, to be verified!)
Comment 11 Bruno Prémont 2009-04-15 18:05:29 UTC
Created attachment 20997 [details]
log with printk - in _c1 [in _c1 only = hang, erverywhere=boot]

Follow-up to comment #10 (log extract there contains a cyop&pase error, the first 2 acpi_idle_enter_simple should probably be acpi_idle_enter_c1 - I copied the printk but forgot to adjust the function name in them)

I changed the printk() to happen only after assigning kt2.

Keeping them in the 3 functions (acpi_idle_enter_simple, acpi_idle_enter_c1, acpi_idle_enter_bm) the kernel boots just fine (producing lots of log ;))
This is first part of attached log

Keeping just the printk in acpi_idle_enter_c1 the kernel hangs, this is second part of attached log


So the printk in acpi_idle_enter_simple inserts some delay needed to get the box to boot (or to skip some other non-recovering wait)...
Comment 12 Bruno Prémont 2009-04-15 19:32:12 UTC
Disabling async boot produces no change (forcing enable_async in kernel/async.c to 0).

So I don't know what makes the printk in acpi_idle_enter_simple() prevent hang...
Comment 13 ykzhao 2009-04-16 02:09:03 UTC
Hi, Bruno
    sorry for the incorrect parameters in comment #3. The boot option should be "processor.max_cstate=1" or "idle=poll". When the boot option of "processor.max_cstate=0" is added, the system will access the uninitialized memory region.
    So please try the boot option of "processor.max_cstate=1" or "idle=poll".
    thanks.
Comment 14 john stultz 2009-04-16 02:16:09 UTC
(In reply to comment #13)
> Hi, Bruno
>     sorry for the incorrect parameters in comment #3. The boot option should
>     be
> "processor.max_cstate=1" or "idle=poll". When the boot option of
> "processor.max_cstate=0" is added, the system will access the uninitialized
> memory region.
>     So please try the boot option of "processor.max_cstate=1" or "idle=poll".
>     thanks.

Ah, good catch ykzhao! Sorry for the bad cmd.
Comment 15 ykzhao 2009-04-16 02:22:54 UTC
Hi, Bruno
    Will you please try the boot option of "clocksource=acpi_pm" and see whether the box still hangs?
    Thanks.
Comment 16 ykzhao 2009-04-16 03:21:03 UTC
Created attachment 21005 [details]
[debug patch]: delete the conversion between idle time and sleep ticks in idle thread

will you please try the debug patch and see whether the box can be booted?
    In the debug patch it is unnecessary to convert the idle_time to pm timer sleep tick again in idle thread. Only when the /proc/acpi/processor/*/power I/F is provoked, it will be converted.
    thanks.
Comment 17 Zhang Rui 2009-04-16 03:23:17 UTC
*** Bug 13094 has been marked as a duplicate of this bug. ***
Comment 18 Daniel Vetter 2009-04-16 09:10:05 UTC
Hi ykzhao,

I've just tested your debug patch on top of 2.6.30-rc2 on my Thinkpad X40 (see Bug 13094). The kernel still crashes in the same manner. But booting with clocksource=acpi_pm helps here, too.

-Daniel
Comment 19 Alex Shi 2009-04-16 10:06:33 UTC
> 
> So the printk in acpi_idle_enter_simple inserts some delay needed to get the
> box to boot (or to skip some other non-recovering wait)...

Bruno: 
I am wondering the ktime_get_real really can return in above function. In fact, the ktime_get_real should call getnstimeofday() in kernel/time/timekeeping.c file if the CONFIG_GENERIC_TIME was set in kernel config. 

Maybe it hang in the following while cycle in getnstimeofday function: 
//       do {
//                seq = read_seqbegin(&xtime_lock);

                *ts = xtime;

                /* read clocksource: */
                cycle_now = clocksource_read(clock);

                /* calculate the delta since the last update_wall_time: */
                cycle_delta = (cycle_now - clock->cycle_last) & clock->mask;

                /* convert to nanoseconds: */
                nsecs = cyc2ns(clock, cycle_delta);

//        } while (read_seqretry(&xtime_lock, seq));

So would you like to comments the "do" "while" and "seq = .." line as above. and try again?
Comment 20 Daniel Vetter 2009-04-16 12:30:33 UTC
I've just gathered a quite strange backtrace in my dmesg about timers while suspending/resuming. This is on 2.6.30-rc2 with ff69... reverted. Maybe ff69 just brought this issue into the open. Anyway, I'll attach the complete dmesg.

-Daniel
Comment 21 Daniel Vetter 2009-04-16 12:32:00 UTC
Created attachment 21011 [details]
dmesg as described of 2.6.30-rc2 with ff69... reverted
Comment 22 Bruno Prémont 2009-04-16 16:27:33 UTC
In repley to comment #13 and comment #15: all 3 variants boot fine (keeping the single printk in acpi_idle_enter_c1)

processor.max_cstate=1 still calls acpi_idle_enter_c1() but the message stating that TSC is unstable seems to happen a few acpi_idle_enter_c1() later.
If log is useful I can attach.

idle=poll does not call acpi_idle_enter_c1() at all

clocksource=acpi_pm calls it exactly once before marking TSC as unstable.
Comment 23 Bruno Prémont 2009-04-16 17:17:59 UTC
Created attachment 21015 [details]
Hang log for comments 16, 19 and 16+19

In reply to comment #16 and comment #19

In all combinations of suggestions from #16 and #19 the boot keeps hanging.
The attached log is the concatenation of all 3 cases, captured via netconsole.

- as suggested in comment #19 I commented out the loop but it kernel keeps
  hanging during boot.

- as suggested in comment #16 I applied the patch but kernel keeps hanging
  during boot

- a combination of both comments, loop commented and patch applied kernel keeps
  hanging during boot.

Often, but not always the HDD led remains lighting when the kernel hangs.
Comment 24 ykzhao 2009-04-17 01:16:24 UTC
Hi, Daniel
    Will you please try the boot option of "notsc" and see whether it still hangs?
    thanks.
Comment 25 ykzhao 2009-04-17 04:05:52 UTC
Created attachment 21022 [details]
[debug patch]: print which clocksource is used.

Will you please try the debug patch and boot the system with/without the boot option of "notsc"?
    After the test, please attach the output log.
    Thanks.
Comment 26 Daniel Vetter 2009-04-18 11:26:46 UTC
On Fri, Apr 17, 2009 at 04:05:53AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> --- Comment #25 from ykzhao <yakui.zhao@intel.com>  2009-04-17 04:05:52 ---
> Created an attachment (id=21022)
>  --> (http://bugzilla.kernel.org/attachment.cgi?id=21022)
> [debug patch]: print which clocksource is used.
> 
> Will you please try the debug patch and boot the system with/without the boot
> option of "notsc"?
>     After the test, please attach the output log.
>     Thanks.

Hi ykzhao,

In the course of preparing the logs you requested, I encountered a very
strange effect. When I boot the machine with a usb-serial converter
plugged into one usb port, everything is fine and the machine doesn't
freeze. If I then unplug the usb-serial converter, the machine freezes
within a few seconds on a bad kernel. I've tested quite a few of the bad
kernel versions, all exhibiting the same behaviour. So I think it's
unlikely that this is an unrelated regression. Furthermore, on a kernel
where  ff69.. is reverted, this does not happen, too.

I hope this can help.

-Daniel
Comment 27 Daniel Vetter 2009-04-18 11:45:10 UTC
On Fri, Apr 17, 2009 at 01:16:25AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> --- Comment #24 from ykzhao <yakui.zhao@intel.com>  2009-04-17 01:16:24 ---
> Hi, Daniel
>     Will you please try the boot option of "notsc" and see whether it still
> hangs?
>     thanks.

Booting 2.6.30-rc2-195-g9f76208 with "notsc" freezes (with the usb-serial
converter unplugged). I've also retested whether "clocksource=acpi_pm"
really works around the problem (or whether I had accidentally plugged in
the serial converter). I can confirm that "clocksource=acpi_pm" really
helps, even when the serial converter is unplugged.

-Daniel
Comment 28 Len Brown 2009-04-21 02:13:36 UTC
Daniel,
Since your machine has a LAPIC and IOAPIC, please reproduce the hang
with "nmi_watchdog=1" (or =2) and see if you can get a stack trace
of the actual hang.

Daniel, Bruno,
Plese boot with yakui's patch in comment #25 to print out the clocksource.

Looking at the hpet=force notsc hang in comment #9,
we suspect that the actual hang in that case may be related to the hpet.

Can you boot regularly with clocksource=hpet to see if that works at all?

re: comment #8 - processor.max_cstate=0 panics system
	yakui to file a new (unrelated) bug and fix this boot option.

re: marking the TSC unstable from idle
     we should be doing this at idle init, rather than idle run-time,
     i'll file a patch to fix that bug -- though it isn't clear if it
     is related to the actual failure at hand.
Comment 29 Len Brown 2009-04-21 05:19:43 UTC
Created attachment 21065 [details]
pach vs 2.6.30-rc2

This patch moves the TSC disable check to idle initialization time
from (after the first) idle run-time.  Please give it a go
and attach the new dmesg.
Comment 30 Daniel Vetter 2009-04-21 09:11:05 UTC
On Tue, Apr 21, 2009 at 02:13:37AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> --- Comment #28 from Len Brown <len.brown@intel.com>  2009-04-21 02:13:36 ---
> Daniel,
> Since your machine has a LAPIC and IOAPIC, please reproduce the hang
> with "nmi_watchdog=1" (or =2) and see if you can get a stack trace
> of the actual hang.

I've just tried this, but unfortunately it seems to affect the freeze. Now
the machine freezes after X switched the console to raw mode. So I can't
see an eventual backtrace. And there's nothing in the netconsole, too.
I've tried a few things to freeze the box in a more useful way but didn't
succeed. But at least I managed to get the netconsole working, so expect
the other logs shortly.

-Daniel
Comment 31 Paul Martin 2009-04-21 10:41:20 UTC
I can reproduce this bug on a Thinkpad T42, where the boot freezes with 2.6.30-rc1 and -rc2. The kernel will boot OK if I remove "hpet=force" from the command line.

Reverting ff69f2bba67bd45514923aaedbf40fe351787c59 does allow the system to boot, but certain timers appear to be faulty: one symptom being that the laptop fails to reliably associate to the wireless access point. Booting that kernel (with revert) without "hpet=force" on the command line gives normal behaviour.

"hpet=force" is needed to get the HPET in the T42's ICH4 chipset to be recognised. This is a definite regression from 2.6.29.
Comment 32 Daniel Vetter 2009-04-21 13:37:15 UTC
Created attachment 21067 [details]
dmesg without and with "notsc", clocksource debug patch applied

The last few module loads are missing, because I couldn't manage to log in via ssh into the initramfs to get the first part of the dmesg _and_ capture the last few lines before the freeze via netconsole. But there's no clocksource lines in there, so I figured that's not the interesting part anyway.
Comment 33 Daniel Vetter 2009-04-21 13:46:25 UTC
Created attachment 21068 [details]
dmesg without "hpet=force", clocksource debug patch applied

I've tried "clocksource=hpet". This does not work. But removing "hpet=force" from my kernel cmdline results in a fine boot-up. I've added the dmesg output of that run. I need hpet=force to enable the hpet on my Thinkpad X40 (like on many other Thinkpads from that era).
Comment 34 Daniel Vetter 2009-04-21 14:51:44 UTC
On Tue, Apr 21, 2009 at 05:19:44AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> --- Comment #29 from Len Brown <len.brown@intel.com>  2009-04-21 05:19:43 ---
> Created an attachment (id=21065)
>  --> (http://bugzilla.kernel.org/attachment.cgi?id=21065)
> pach vs 2.6.30-rc2
> 
> This patch moves the TSC disable check to idle initialization time
> from (after the first) idle run-time.  Please give it a go
> and attach the new dmesg.

I've just tested this patch, and the systems freezes in the same manner.
Looking at the clocksource debug output, I think your patch has roughly
the same effect as booting with clocksource=hpet, which crashes, too.

-Daniel
Comment 35 Len Brown 2009-04-21 17:16:43 UTC
"hpet=force" is being used on all 3 machines in this bug report.

When "hpet=force" is not used, is the boot hang gone?

If no, does the patch in comment #29 help that case?

Re: comment #31
Are there other "hpet=force" regressions even when ff69f2bba
is reverted from 2.6.30?
Comment 36 Bruno Prémont 2009-04-21 20:39:12 UTC
Created attachment 21072 [details]
Logs of multiple boots

I can confirm that disabling hpet on my system also makes the hang go away.

I have not noticed any other regression (with/without hpet=force). Only thing I know since a long time is that local apic (which is disabled by BIOS and when enabled with lapic cmdline option) causes S3 suspend/resume to fail. But this one is most probably unrelated/offtopic. Adding lacpi cmdline option does not change hang behavior either.

Attached netconsole produced log contains the following boot attemps (see also headers in the log):
- According to comment #28 : boot with clocksource=hpet => hang
- Using patch for #25 -> hang
- Using patch for #25 and clocksource=hpet -> hang
- Using patch for #25 and hpet=force removed -> boots
- Using patch for #25 and patch for #29 but without hpet=force -> boots
- Using patch for #25 and patch for #29, with hpet=force -> hang
- Using patch for #25 and patch for #29, with hpet=force and lapic -> boots
- Using patch for #25 and patch for #29 but without hpet=force and with
  lapic -> boots

What remains to check is if using "hpet=force clocksource=hpet" boots properly or hangs with troublesome commit reverted.
Comment 37 Daniel Vetter 2009-04-21 20:46:49 UTC
On Tue, Apr 21, 2009 at 05:16:44PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> Are there other "hpet=force" regressions even when ff69f2bba
> is reverted from 2.6.30?

I don't have any other regression on my Thinkpad X40 if I revert ff69.
Furthermore I checked with

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet

that 2.6.30-rc2 with ff69 reverted is indeed using hpet, which it is.

-Daniel
Comment 38 Bruno Prémont 2009-04-21 21:03:21 UTC
Created attachment 21073 [details]
Log with $subjet commit reverted, hpet=force and clocksource=hpet

Missing check of comment #36 - using "hpet=force clocksource=hpet" boots properly with commit ff69f2bba67bd45514923aaedbf40fe351787c59 reverted (patch from #25 applied but not the one from #29 (this one requires commit ff69...).

Log attached
Comment 39 Len Brown 2009-04-22 02:26:36 UTC
reproduced on a Thinkpad T41 (ICH4) using 2.6.30-rc2-ccc5ff94c

boot using no cmdline options tempts me with:

HPET not enabled in BIOS. You might try hpet=force boot option

But when I use hpet=force, boot hangs.

But if I revert ff69f2bba, hpet=force succeeds.
Comment 40 Len Brown 2009-04-23 01:33:05 UTC
Created attachment 21088 [details]
patch vs 2.6.30-rc3 to root cause the failiure

2.6.20-rc3 + this patch booted on a t41 with "hpet=force" hangs.
This identifies the hpet counter read immediately at the exit from C3
as causing a hang.
Comment 41 Len Brown 2009-04-23 02:17:32 UTC
Created attachment 21089 [details]
patch vs 2.6.30-rc3 to prevent the boot hang

Please test this patch.

It will allow hpet=force to be used on the ICH4 boxes,
and it will still be used for clockevents, but it will not
be available as a clocksource.
Comment 42 Bruno Prémont 2009-04-23 19:08:29 UTC
2.6.30-rc3 boots fine with hpet=force and patch from comment #41.

With this setup my system is using acpi_pm as clocksource (availabled ones: acpi_pm jiffies tsc).
Comment 43 Len Brown 2009-04-23 23:10:22 UTC
Created attachment 21097 [details]
patch vs 2.6.30-rc3 to prevent the boot hang

Please test this newer patch.
It should allow "hpet=force" to again be used on the ICH4M machines,
and the HPET will be used for both clocksource and clockevents.
Comment 44 Daniel Vetter 2009-04-24 11:33:44 UTC
I've just tested the patch in comment #43 and everything works flawlessly on my machine (Thinkpad X40) now.

-Daniel
Comment 45 Bruno Prémont 2009-04-24 21:36:27 UTC
Patch from comment #43 work for me too (Acer TM660)
Comment 46 Paul Martin 2009-04-24 23:30:20 UTC
Len's commit f461ddea0af8b98e2b7940eba9c693b0ee44d64a fixes things for me, too.

(There's a problem with ath5k associating with an access point, but that seems unrelated.)
Comment 47 Rafael J. Wysocki 2009-04-25 21:31:05 UTC
Fixed by commit f461ddea0af8b98e2b7940eba9c693b0ee44d64a .
Comment 48 Daniel Vetter 2009-04-27 14:45:07 UTC
Hi all,

Some further testing shows that Len's  patch doesn't solve all the problems ff69.. caused on my Thinkpad X40.

With ff69 reverted on top of 2.6.30-rc2, I can hibernate (suspend to ram) as often as I want. With Len's patch (-linus version v2.6.30-rc3-329-g0c8454f) I can resume only once. In the second cycle the machine hangs hard shortly after switching the power on (fan and hd spin up). It doesn't manage to set the blinking mode for the suspend led, indicating a resume in progress. I checked the clocksource=acpi_pm workaround, and this does not work for the resume issue, i.e. broken in the same way as with Len's patch.

So at least for my setup, the breakage introduced by ff69 is not yet completely fixed.

-Daniel
Comment 49 Bruno Prémont 2009-04-27 20:25:57 UTC
For me, multiple S3 cycles are possible in a row (tested from linux console [vesafb]; I'm not using KMS but I have i915 loaded so GPU can be resumed)

Tested with today's linus kernel (commit e25c2c873f59c57cf1c2b1979cc8fb01958305ef)
Comment 50 Daniel Vetter 2009-05-07 09:05:10 UTC
On Mon, Apr 27, 2009 at 08:25:58PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> --- Comment #49 from Bruno <bonbons67@internet.lu>  2009-04-27 20:25:57 ---
> For me, multiple S3 cycles are possible in a row (tested from linux console
> [vesafb]; I'm not using KMS but I have i915 loaded so GPU can be resumed)
Thanks for testing this out, Bruno. I did some further testing but as of
v2.6.30-rc4-288-g413 my problem seems to be fixed.