Bug 28612

Summary: regular soft lockup (hpet and C1E interaction)
Product: Platform Specific/Hardware Reporter: Alexandre Demers (alexandre.f.demers)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: RESOLVED UNREPRODUCIBLE    
Severity: normal CC: dbrownell, florian, john.stultz, maciej.rutecki, rjw, tglx
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.37 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 21782    
Attachments: kern.log
dmesg
dmesg before debug options
kern.log before debug options
dmesg with debug options
kern.log with debug options
2.6.38-rc4 kern.log
2.6.38-rc4 dmesg
kernel messages at log and while copying to USB
lspci config
kern.log on 2.6.38-rc5
dmesg-2.6.38-rc5

Description Alexandre Demers 2011-02-08 16:39:22 UTC
Created attachment 46862 [details]
kern.log

Using a 64 bit machine (3 cores). Every 2 boots or so, I get a soft lockup on that machine. It was not doing this before 2.6.35 and maybe 2.6.36 was OK, I can't remember when it all began.

I use the same kernel (but compiled for an ATOM CPU) on an Atom netbook and it has no problem.

I'll try to narrow it down a bit to see if it's related to a kernel version or to a compilation option (using specific AMD 64bit optimizations in the config now, while I was not using it in previous versions).

Meanwhile, I'm attaching kern.log and dmesg.
Comment 1 Alexandre Demers 2011-02-08 16:39:53 UTC
Created attachment 46872 [details]
dmesg
Comment 2 john stultz 2011-02-08 19:44:25 UTC
Do you see this issue booting with "notsc" ?
Comment 3 Alexandre Demers 2011-02-08 21:03:16 UTC
Yes, I do. I tried many times and it froze at about the same rate (1 on 2).

I downloaded latest 2.6.35 and 2.6.36. I'll compile both and install them with the same options I'm using on 2.6.37. If they all do the same, I'll try to change the optimizations to only use x86_64 without the Athlon specific ones.
Comment 4 Alexandre Demers 2011-02-09 01:12:46 UTC
Many hours later, here are the results. First, I recompiled 2.6.37 with some "kernel hacks" selected. Then I compiled 2.6.35.11 and 2.6.36.3. The system had no soft lock problem with the previous kernels.

I'm attaching two other kern.log and dmesg. The first ones were taken before I recompiled 2.6.37 with the added debug options. The second ones are the result of a kernel lock up with selected debug options.
Comment 5 Alexandre Demers 2011-02-09 01:13:58 UTC
Created attachment 46932 [details]
dmesg before debug options
Comment 6 Alexandre Demers 2011-02-09 01:15:42 UTC
Created attachment 46942 [details]
kern.log before debug options
Comment 7 Alexandre Demers 2011-02-09 01:16:25 UTC
Created attachment 46952 [details]
dmesg with debug options
Comment 8 Alexandre Demers 2011-02-09 01:17:26 UTC
Created attachment 46962 [details]
kern.log with debug options
Comment 9 Alexandre Demers 2011-02-09 04:26:26 UTC
It seems that sometimes, not always, when it hangs, if I unplug my mouse/keyboard USB receiver, it continues to load after that. But as I said, it doesn't always work.

I just finished compiling latest 2.6.38-rc4, restarted and it softlocked again. Since I enabled more debug options in that new release, I'm adding the kern.log and dmesg.

Should I add udev log or anything else?
Comment 10 Alexandre Demers 2011-02-09 04:27:41 UTC
Created attachment 46972 [details]
2.6.38-rc4 kern.log
Comment 11 Alexandre Demers 2011-02-09 04:28:00 UTC
Created attachment 46982 [details]
2.6.38-rc4 dmesg
Comment 12 Alexandre Demers 2011-02-11 02:08:07 UTC
Created attachment 47252 [details]
kernel messages at log and while copying to USB

Here more messages about boot lockup and softlock while copying on USB devices.

I have a feeling it is related to the USB detection. When only some devices are plugged (2 or 3) on my computer, I'm often able to "unlock" the device detection by disconnecting one of the device (usually my wireless keyboard/mouse transmitter).

However, when connecting more devices, this trick will most of the time fail.

Also, when copying on one of my USB drives (flash + 2 HDD), it will often end up as a total rediscovery of all my USB devices. I thought it could be related to the fact that I some point it was a mainboard problem (which could be), but on the other end, it doesn't happen under Windows.
Comment 13 Alexandre Demers 2011-02-13 03:30:12 UTC
Created attachment 47572 [details]
lspci config

Still there, attaching lspci output if this can be helpful
Comment 14 Alexandre Demers 2011-02-17 04:09:39 UTC
Ouch! rc5 is hurting even more. It softlocks at every boot now (4 on 4) and it's taking a lot more time before I can access my desktop. Attaching new kern.log and dmesg.
Comment 15 Alexandre Demers 2011-02-17 04:10:50 UTC
Created attachment 48062 [details]
kern.log on 2.6.38-rc5
Comment 16 Alexandre Demers 2011-02-17 04:11:17 UTC
Created attachment 48072 [details]
dmesg-2.6.38-rc5
Comment 17 john stultz 2011-02-17 18:42:54 UTC
Adding David Brownell since it seems connected to the OHCI.

Alexandre: Have you tried bisecting to try to narrow down the commit that is causing problems?
Comment 18 Alexandre Demers 2011-02-17 19:02:11 UTC
Haven't done yet, but I intend to do so. I'll try to do it soon (in the following week), otherwise I wont be able to do it before April. 

Here is my approach of the thing:
1- move back to a kernel without the problem (2.6.35 from what I see).
2- apply RC to narrow my search to a limited amount of commits.
3- PROFITS! (I couldn't resist) I mean find the culprid.

On that 3rd step, should I focus on something specific or not?
Comment 19 john stultz 2011-02-17 19:17:29 UTC
Narrowing it down to and rc number would be helpful, but if you're using git, git bisect allows you to just bisect all the changes between a known good commit and known bad commit. 

See: http://www.kernel.org/pub/software/scm/git/docs/user-manual.html#using-bisect
Comment 20 Alexandre Demers 2011-02-22 06:58:29 UTC
beginning of bisect: 2.6.34 seems clean -> 0 hang on 8 boots

Note to myself: long hang on .38-rc5 could be unrelated, let's focus on the first one hang, then investigate this new one.
Comment 21 Alexandre Demers 2011-02-22 23:42:03 UTC
2.6.35 -> 0 hang on 8 boots. Moving to next version.
Comment 22 Alexandre Demers 2011-02-23 02:19:19 UTC
2.6.36 -> 0 hang on 8 boots.
Comment 23 Alexandre Demers 2011-02-23 05:02:03 UTC
2.6.37 -> 2 on 2. I'll now try to find the problem where it comes from.
Comment 24 Alexandre Demers 2011-02-24 06:42:11 UTC
Well, bisecting gave me the following culprit:

f1c18071ad70e2a78ab31fc26a18fcfa954a05c6 is the first bad commit
commit f1c18071ad70e2a78ab31fc26a18fcfa954a05c6
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Mon Dec 13 12:43:23 2010 +0100

    x86: HPET: Chose a paranoid safe value for the ETIME check
    
    commit 995bd3bb5 (x86: Hpet: Avoid the comparator readback penalty)
    chose 8 HPET cycles as a safe value for the ETIME check, as we had the
    confirmation that the posted write to the comparator register is
    delayed by two HPET clock cycles on Intel chipsets which showed
    readback problems.
    
    After that patch hit mainline we got reports from machines with newer
    AMD chipsets which seem to have an even longer delay. See
    http://thread.gmane.org/gmane.linux.kernel/1054283 and
    http://thread.gmane.org/gmane.linux.kernel/1069458 for further
    information.
    
    Boris tried to come up with an ACPI based selection of the minimum
    HPET cycles, but this failed on a couple of test machines. And of
    course we did not get any useful information from the hardware folks.
    
    For now our only option is to chose a paranoid high and safe value for
    the minimum HPET cycles used by the ETIME check. Adjust the minimum ns
    value for the HPET clockevent accordingly.
    
    Reported-Bistected-and-Tested-by: Markus Trippelsdorf <markus@trippelsdorf.de>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    LKML-Reference: <alpine.LFD.2.00.1012131222420.2653@localhost6.localdomain6>
    Cc: Simon Kirby <sim@hostway.ca>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Andreas Herrmann <Andreas.Herrmann3@amd.com>
    Cc: John Stultz <johnstul@us.ibm.com>

:040000 040000 4bccf4aa759a3ece8485328d5cf8eae0feac0d1d 031816aca3d34c392e54d1dee7d3ceb13629c2a4 M	arch


Prior to that commit, no hang (got only one with the parent commit, but it was not behaving like the hang I was looking for, it was a lot shorter AND happened only 1 time on 10 boots).
Comment 25 john stultz 2011-02-24 07:27:26 UTC
CC'ing tglx
Comment 26 Thomas Gleixner 2011-02-24 12:34:23 UTC
On Thu, 24 Feb 2011, bugzilla-daemon@bugzilla.kernel.org wrote:
> Well, bisecting gave me the following culprit:
> 
> f1c18071ad70e2a78ab31fc26a18fcfa954a05c6 is the first bad commit
> commit f1c18071ad70e2a78ab31fc26a18fcfa954a05c6
> Author: Thomas Gleixner <tglx@linutronix.de>
> Date:   Mon Dec 13 12:43:23 2010 +0100
> 
>     x86: HPET: Chose a paranoid safe value for the ETIME check

Hmm. Can you please revert that patch on top of 2.6.37 and verify,
that it really cures the problem ?

Thanks,

	tglx
Comment 27 Alexandre Demers 2011-02-24 17:56:11 UTC
Hard reset to 2.6.37, then reverted the commit. Booted 8 times with no hang nor softlock detected.
Comment 28 Alexandre Demers 2011-02-26 17:07:00 UTC
Just to let you know I won't have access to internet for the next month if there is any development.
Comment 29 Florian Mickler 2011-03-29 21:11:07 UTC
Ping... is this still a problem in current 2.6.38.y/2.6.39-rc* ?
Comment 30 Alexandre Demers 2011-04-04 15:30:05 UTC
Hi Florian,

Sorry for the delay, as stated previously, I was away for the last month.
I'll be having a look with latest kernel soon (going back home today) and
I'll let you know as soon as possible.

Alex

On Tue, Mar 29, 2011 at 5:11 PM, <bugzilla-daemon@bugzilla.kernel.org>wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=28612
>
>
>
>
>
> --- Comment #29 from Florian Mickler <florian@mickler.org>  2011-03-29
> 21:11:07 ---
> Ping... is this still a problem in current 2.6.38.y/2.6.39-rc* ?
>
> --
> Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You reported the bug.
>
Comment 31 Alexandre Demers 2011-04-05 14:02:02 UTC
Still there on stable 2.6.38 (tested stable, then tested reverting bad commit). I'm now going to test with 2.6.39-rc1.
Comment 32 Alexandre Demers 2011-04-05 16:48:40 UTC
2.6.39-rc1 also has the problem.

Whenever I revert the culprit commit on whatever kernel version, the problem goes away.
Comment 33 Florian Mickler 2011-04-09 17:39:19 UTC
Thanks for the update.

First-Bad-Commit: f1c18071ad70e2a78ab31fc26a18fcfa954a05c6
Comment 34 Alexandre Demers 2011-04-12 01:32:29 UTC
Maybe a little hint: I remember that somewhere near 2.6.32 and 2.6.35 (maybe before that), there was a problem with C1E enabled in BIOS and the kernels at the time. I can't remember what was wrong (reboot, freeze, whatever), but I had to disable the C1E option. Then it got fixed and I re-enabled the option.

So I told myself "why not disable the option again since it's playins with freq, changing the CPUs on halt and so on?" Guess what happened after I had changed the option and booted with a normal kernel (with the problematic commit)? It booted fine.

I hope this can help a bit more.
Comment 35 Alexandre Demers 2011-04-12 04:38:18 UTC
I found the related C1E info I was talking about:
http://kerneltrap.com/mailarchive/linux-kernel/2008/6/12/2105054/thread

Some patches were supposed to deal with a similar problem. Must be noted: there is a comment about a key having to be pressed to help the box continue to work "The patches work fine on systems which are not affected by the dreaded
ATI chipset timer wreckage. On those which have the problem, the box
needs help from the keyboard to continue working."

This is similar to what I was experiencing at the beginning (from time to time, disconnecting my keyboard "USB/PS2" was allowing to continue booting).

So it seems we are somewhere similar to what is described, I think.
Comment 36 Alexandre Demers 2011-05-31 20:23:41 UTC
Still seen when enabling C1E in BIOS on 2.6.39.
Comment 37 Alexandre Demers 2011-07-16 03:28:29 UTC
I was wondering if it would make sense to create an hybrid solution. What I have in mind is to use the previous values for the default behavior, but if a given condition happens, switch to the current values and code. That way, for people like me, the old way would still work without creating softlockup when C1E is enabled, and for others it would be the current way of doing it.

I think it wouldn't be to difficult to make, isn't it? And would it make sense?
Comment 38 Alexandre Demers 2011-07-19 17:03:45 UTC
By the way, still present under 3.0-rc7. Adding hpet=force clocksource=hpet to kernel params solves the problem when C1E is enabled.


I'm pretty sure I'm not the only one hitting this bug, see the following:
http://ubuntuforums.org/showthread.php?t=1742834
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/776588

and maybe also (which I have to test)
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/776588
Comment 39 Alexandre Demers 2011-07-30 04:09:39 UTC
Very interestingly, it seems to hit only some CPU. I just changed my CPU, moving from a Phenom X3 8550 to a Phenom II X6 1065T. It seems the bug just went away. I'll investigate this a bit more, but I know C1E management is handled differently on Phenom II.
Comment 40 Alexandre Demers 2013-08-02 21:12:34 UTC
Since I changed the motherboard, I can't continue looking at it. Closing this bug.