Created attachment 46862 [details] kern.log Using a 64 bit machine (3 cores). Every 2 boots or so, I get a soft lockup on that machine. It was not doing this before 2.6.35 and maybe 2.6.36 was OK, I can't remember when it all began. I use the same kernel (but compiled for an ATOM CPU) on an Atom netbook and it has no problem. I'll try to narrow it down a bit to see if it's related to a kernel version or to a compilation option (using specific AMD 64bit optimizations in the config now, while I was not using it in previous versions). Meanwhile, I'm attaching kern.log and dmesg.
Created attachment 46872 [details] dmesg
Do you see this issue booting with "notsc" ?
Yes, I do. I tried many times and it froze at about the same rate (1 on 2). I downloaded latest 2.6.35 and 2.6.36. I'll compile both and install them with the same options I'm using on 2.6.37. If they all do the same, I'll try to change the optimizations to only use x86_64 without the Athlon specific ones.
Many hours later, here are the results. First, I recompiled 2.6.37 with some "kernel hacks" selected. Then I compiled 2.6.35.11 and 2.6.36.3. The system had no soft lock problem with the previous kernels. I'm attaching two other kern.log and dmesg. The first ones were taken before I recompiled 2.6.37 with the added debug options. The second ones are the result of a kernel lock up with selected debug options.
Created attachment 46932 [details] dmesg before debug options
Created attachment 46942 [details] kern.log before debug options
Created attachment 46952 [details] dmesg with debug options
Created attachment 46962 [details] kern.log with debug options
It seems that sometimes, not always, when it hangs, if I unplug my mouse/keyboard USB receiver, it continues to load after that. But as I said, it doesn't always work. I just finished compiling latest 2.6.38-rc4, restarted and it softlocked again. Since I enabled more debug options in that new release, I'm adding the kern.log and dmesg. Should I add udev log or anything else?
Created attachment 46972 [details] 2.6.38-rc4 kern.log
Created attachment 46982 [details] 2.6.38-rc4 dmesg
Created attachment 47252 [details] kernel messages at log and while copying to USB Here more messages about boot lockup and softlock while copying on USB devices. I have a feeling it is related to the USB detection. When only some devices are plugged (2 or 3) on my computer, I'm often able to "unlock" the device detection by disconnecting one of the device (usually my wireless keyboard/mouse transmitter). However, when connecting more devices, this trick will most of the time fail. Also, when copying on one of my USB drives (flash + 2 HDD), it will often end up as a total rediscovery of all my USB devices. I thought it could be related to the fact that I some point it was a mainboard problem (which could be), but on the other end, it doesn't happen under Windows.
Created attachment 47572 [details] lspci config Still there, attaching lspci output if this can be helpful
Ouch! rc5 is hurting even more. It softlocks at every boot now (4 on 4) and it's taking a lot more time before I can access my desktop. Attaching new kern.log and dmesg.
Created attachment 48062 [details] kern.log on 2.6.38-rc5
Created attachment 48072 [details] dmesg-2.6.38-rc5
Adding David Brownell since it seems connected to the OHCI. Alexandre: Have you tried bisecting to try to narrow down the commit that is causing problems?
Haven't done yet, but I intend to do so. I'll try to do it soon (in the following week), otherwise I wont be able to do it before April. Here is my approach of the thing: 1- move back to a kernel without the problem (2.6.35 from what I see). 2- apply RC to narrow my search to a limited amount of commits. 3- PROFITS! (I couldn't resist) I mean find the culprid. On that 3rd step, should I focus on something specific or not?
Narrowing it down to and rc number would be helpful, but if you're using git, git bisect allows you to just bisect all the changes between a known good commit and known bad commit. See: http://www.kernel.org/pub/software/scm/git/docs/user-manual.html#using-bisect
beginning of bisect: 2.6.34 seems clean -> 0 hang on 8 boots Note to myself: long hang on .38-rc5 could be unrelated, let's focus on the first one hang, then investigate this new one.
2.6.35 -> 0 hang on 8 boots. Moving to next version.
2.6.36 -> 0 hang on 8 boots.
2.6.37 -> 2 on 2. I'll now try to find the problem where it comes from.
Well, bisecting gave me the following culprit: f1c18071ad70e2a78ab31fc26a18fcfa954a05c6 is the first bad commit commit f1c18071ad70e2a78ab31fc26a18fcfa954a05c6 Author: Thomas Gleixner <tglx@linutronix.de> Date: Mon Dec 13 12:43:23 2010 +0100 x86: HPET: Chose a paranoid safe value for the ETIME check commit 995bd3bb5 (x86: Hpet: Avoid the comparator readback penalty) chose 8 HPET cycles as a safe value for the ETIME check, as we had the confirmation that the posted write to the comparator register is delayed by two HPET clock cycles on Intel chipsets which showed readback problems. After that patch hit mainline we got reports from machines with newer AMD chipsets which seem to have an even longer delay. See http://thread.gmane.org/gmane.linux.kernel/1054283 and http://thread.gmane.org/gmane.linux.kernel/1069458 for further information. Boris tried to come up with an ACPI based selection of the minimum HPET cycles, but this failed on a couple of test machines. And of course we did not get any useful information from the hardware folks. For now our only option is to chose a paranoid high and safe value for the minimum HPET cycles used by the ETIME check. Adjust the minimum ns value for the HPET clockevent accordingly. Reported-Bistected-and-Tested-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <alpine.LFD.2.00.1012131222420.2653@localhost6.localdomain6> Cc: Simon Kirby <sim@hostway.ca> Cc: Borislav Petkov <bp@alien8.de> Cc: Andreas Herrmann <Andreas.Herrmann3@amd.com> Cc: John Stultz <johnstul@us.ibm.com> :040000 040000 4bccf4aa759a3ece8485328d5cf8eae0feac0d1d 031816aca3d34c392e54d1dee7d3ceb13629c2a4 M arch Prior to that commit, no hang (got only one with the parent commit, but it was not behaving like the hang I was looking for, it was a lot shorter AND happened only 1 time on 10 boots).
CC'ing tglx
On Thu, 24 Feb 2011, bugzilla-daemon@bugzilla.kernel.org wrote: > Well, bisecting gave me the following culprit: > > f1c18071ad70e2a78ab31fc26a18fcfa954a05c6 is the first bad commit > commit f1c18071ad70e2a78ab31fc26a18fcfa954a05c6 > Author: Thomas Gleixner <tglx@linutronix.de> > Date: Mon Dec 13 12:43:23 2010 +0100 > > x86: HPET: Chose a paranoid safe value for the ETIME check Hmm. Can you please revert that patch on top of 2.6.37 and verify, that it really cures the problem ? Thanks, tglx
Hard reset to 2.6.37, then reverted the commit. Booted 8 times with no hang nor softlock detected.
Just to let you know I won't have access to internet for the next month if there is any development.
Ping... is this still a problem in current 2.6.38.y/2.6.39-rc* ?
Hi Florian, Sorry for the delay, as stated previously, I was away for the last month. I'll be having a look with latest kernel soon (going back home today) and I'll let you know as soon as possible. Alex On Tue, Mar 29, 2011 at 5:11 PM, <bugzilla-daemon@bugzilla.kernel.org>wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=28612 > > > > > > --- Comment #29 from Florian Mickler <florian@mickler.org> 2011-03-29 > 21:11:07 --- > Ping... is this still a problem in current 2.6.38.y/2.6.39-rc* ? > > -- > Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You reported the bug. >
Still there on stable 2.6.38 (tested stable, then tested reverting bad commit). I'm now going to test with 2.6.39-rc1.
2.6.39-rc1 also has the problem. Whenever I revert the culprit commit on whatever kernel version, the problem goes away.
Thanks for the update. First-Bad-Commit: f1c18071ad70e2a78ab31fc26a18fcfa954a05c6
Maybe a little hint: I remember that somewhere near 2.6.32 and 2.6.35 (maybe before that), there was a problem with C1E enabled in BIOS and the kernels at the time. I can't remember what was wrong (reboot, freeze, whatever), but I had to disable the C1E option. Then it got fixed and I re-enabled the option. So I told myself "why not disable the option again since it's playins with freq, changing the CPUs on halt and so on?" Guess what happened after I had changed the option and booted with a normal kernel (with the problematic commit)? It booted fine. I hope this can help a bit more.
I found the related C1E info I was talking about: http://kerneltrap.com/mailarchive/linux-kernel/2008/6/12/2105054/thread Some patches were supposed to deal with a similar problem. Must be noted: there is a comment about a key having to be pressed to help the box continue to work "The patches work fine on systems which are not affected by the dreaded ATI chipset timer wreckage. On those which have the problem, the box needs help from the keyboard to continue working." This is similar to what I was experiencing at the beginning (from time to time, disconnecting my keyboard "USB/PS2" was allowing to continue booting). So it seems we are somewhere similar to what is described, I think.
Still seen when enabling C1E in BIOS on 2.6.39.
I was wondering if it would make sense to create an hybrid solution. What I have in mind is to use the previous values for the default behavior, but if a given condition happens, switch to the current values and code. That way, for people like me, the old way would still work without creating softlockup when C1E is enabled, and for others it would be the current way of doing it. I think it wouldn't be to difficult to make, isn't it? And would it make sense?
By the way, still present under 3.0-rc7. Adding hpet=force clocksource=hpet to kernel params solves the problem when C1E is enabled. I'm pretty sure I'm not the only one hitting this bug, see the following: http://ubuntuforums.org/showthread.php?t=1742834 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/776588 and maybe also (which I have to test) https://bugs.launchpad.net/ubuntu/+source/linux/+bug/776588
Very interestingly, it seems to hit only some CPU. I just changed my CPU, moving from a Phenom X3 8550 to a Phenom II X6 1065T. It seems the bug just went away. I'll investigate this a bit more, but I know C1E management is handled differently on Phenom II.
Since I changed the motherboard, I can't continue looking at it. Closing this bug.