Bug 31122 - System freezes randomly
Summary: System freezes randomly
Status: RESOLVED INSUFFICIENT_DATA
Alias: None
Product: Drivers
Classification: Unclassified
Component: Platform (show other bugs)
Hardware: All Linux
: P1 high
Assignee: drivers_platform@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-03-15 15:52 UTC by Nicos Gollan
Modified: 2012-08-20 14:58 UTC (History)
3 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
lspci output (172.87 KB, text/plain)
2011-03-16 18:29 UTC, Nicos Gollan
Details
dmesg output 2.6.38, apic=debug, hpet=verbose (74.47 KB, text/plain)
2011-03-17 18:41 UTC, Nicos Gollan
Details
dmesg output 2.6.35, apic=debug, hpet=verbose (67.53 KB, text/plain)
2011-03-17 18:41 UTC, Nicos Gollan
Details
Timer instability (2.6.38) (62.20 KB, text/plain)
2011-03-21 08:47 UTC, Nicos Gollan
Details
dmesg output 2.6.38, apic=debug, hpet=verbose (74.83 KB, text/plain)
2011-03-21 08:49 UTC, Nicos Gollan
Details
dmesg output, 2.6.38.2 (60.75 KB, text/plain)
2011-04-09 10:35 UTC, Nicos Gollan
Details
ls -lR of Debian patches in kernel package (7.62 KB, text/plain)
2011-04-19 07:05 UTC, Nicos Gollan
Details

Description Nicos Gollan 2011-03-15 15:52:38 UTC
For reference, see Bug #28722 especially

https://bugzilla.kernel.org/attachment.cgi?id=47042 (dmesg output)

My system freezes hard after some time when using HPET as clocksource. Switching to the ACPI PM timer seems to resolve the problem. I've had freezes with all kernels from 2.8.36 to 2.6.38-rc6; the -rc6 runs stable with HPET disabled, and the release version does not seem to have any related changes (the only HPET changes are related to wakeup from S3, which I am not using).

This is (still) a regression when comparing to 2.6.35 which runs rock solid.

Testing this is hard and risky, since it sometimes takes 4-5 hours before the system crashes, I'm only observing it on a machine I need, and I've had data loss twice due to that one now.
Comment 1 herrmann.der.user 2011-03-16 14:04:48 UTC
Can you please boot 2.6.35 with apic=debug and hpet=verbose and then boot
2.6.38 with apic=debug and HPET disabled?

Also please provide lspci -nnxxxx (run as root).
And what's your mainboard?
Comment 2 Nicos Gollan 2011-03-16 18:29:46 UTC
Created attachment 50982 [details]
lspci output

Output of lspci -nnxxxx attached, the board is an Asus M4A89TD Pro (not the USB3 version; BIOS version 1006, which appears to be the most recent).
Comment 3 herrmann.der.user 2011-03-17 08:05:22 UTC
Sorry for not being specific enough.

Can you also please attach dmesg output when booting
- 2.6.35 with "apic=debug hpet=verbose"
- 2.6.38(-rc6) with "apic=debug hpet=verbose" (optional disable hpet if it freezes during boot)

You are right the fix for an SB800 issue that went into 2.6.38
after -rc6 was related to suspend/resume, but it's also worth to
try 2.6.38. And the requested dmesg output is required to see whether
your system would be affected by the SB800 APIC pin2 polarity
quirk at all.

Thanks, Andreas
Comment 4 Nicos Gollan 2011-03-17 18:41:07 UTC
Created attachment 51062 [details]
dmesg output 2.6.38, apic=debug, hpet=verbose

The lspci output was just the first serving, here's dmesg output. The system boots alright with HPET under all kernels, but freezes eventually, sometimes after as little as 10 minutes, sometimes after 8-10 hours.
Comment 5 Nicos Gollan 2011-03-17 18:41:41 UTC
Created attachment 51072 [details]
dmesg output 2.6.35, apic=debug, hpet=verbose
Comment 6 Nicos Gollan 2011-03-19 19:23:19 UTC
In "happy" news, something just happened that makes me think it's not (only) a HPET issue.

The system froze completely (even the screen turned off!) when plugging in a USB card reader. That was on 2.6.38 release. The card reader itself works just fine.

So, apparently all kernels >= 2.6.36 are still broken on my system. I'm now actually afraid to even boot into Linux out of fear for my data.
Comment 7 Nicos Gollan 2011-03-19 20:29:24 UTC
I'm actually ashamed to write this.

I can reproduce the crash with the screen turning off by rubbing the cardreader on a wool pullover before plugging it in. Seems like (1) my case's front USB ports are not a pinnacle of electrical engineering, and (2) completely going AWOL is my mainboard's failure mode when seeing a static spike on USB.

I've tried that multiple times, and the screen always turned off, so it still leaves the issue of the system just freezing after a longer time.

Sorry for the noise.
Comment 8 Nicos Gollan 2011-03-21 08:47:49 UTC
Created attachment 51452 [details]
Timer instability (2.6.38)

And back to "normal" freezes, this time with the ACPI timer. The system just locked up without any interaction (nobody was even near the machine to zap it); the system also is not set up to go to sleep modes after some time of inactivity.

Note the last few lines in the kernel log:

Mar 21 09:09:21 elysium kernel: [ 1607.319988] hrtimer: interrupt took 635401502 ns
Mar 21 09:13:43 elysium kernel: [ 1868.762707] Clocksource tsc unstable (delta = 3926721646 ns)
Mar 21 09:16:31 elysium rsyslogd: [origin software="rsyslogd" swVersion="5.7.8" x-pid="1227" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Comment 9 Nicos Gollan 2011-03-21 08:49:03 UTC
Created attachment 51462 [details]
dmesg output 2.6.38, apic=debug, hpet=verbose

dmesg with apic/HPET debug options set, this time with the release version.
Comment 10 Nicos Gollan 2011-04-09 10:32:02 UTC
Randomly froze again with 2.6.38.2

Anyone?
Comment 11 Nicos Gollan 2011-04-09 10:35:17 UTC
Created attachment 53892 [details]
dmesg output, 2.6.38.2
Comment 12 john stultz 2011-04-18 19:38:26 UTC
(In reply to comment #7)
> I'm actually ashamed to write this.
> 
> I can reproduce the crash with the screen turning off by rubbing the
> cardreader
> on a wool pullover before plugging it in. Seems like (1) my case's front USB
> ports are not a pinnacle of electrical engineering, and (2) completely going
> AWOL is my mainboard's failure mode when seeing a static spike on USB.
> 
> I've tried that multiple times, and the screen always turned off, so it still
> leaves the issue of the system just freezing after a longer time.

Just to be clear, does this sort of hang (ie: non-random, seemingly usb/electrical related) occur when using the ACPI PM timer as a clocksource?
Comment 13 Nicos Gollan 2011-04-18 19:56:45 UTC
(In reply to comment #12)

> Just to be clear, does this sort of hang (ie: non-random, seemingly
> usb/electrical related) occur when using the ACPI PM timer as a clocksource?

The non-random static zap always works as long as the system is powered up (BIOS setup, Linux with any clocksource, Windows), and it always had the same symptoms. I've stopped trying though after making sure a few times. It never happened when grounding the device before plugging it in. I'm very confident that it's an unrelated SNAFU.

The system is too stable with older kernels an other OSs to chalk the freezes up to static. With 2.6.38.2 and the PM timer, I'm getting audible timing issues when playing music, and sometimes even short visible freezes.
Comment 14 john stultz 2011-04-18 20:10:17 UTC
(In reply to comment #13)
> (In reply to comment #12)
> 
> > Just to be clear, does this sort of hang (ie: non-random, seemingly
> > usb/electrical related) occur when using the ACPI PM timer as a
> clocksource?
> 
> The non-random static zap always works as long as the system is powered up
> (BIOS setup, Linux with any clocksource, Windows), and it always had the same
> symptoms. I've stopped trying though after making sure a few times. It never
> happened when grounding the device before plugging it in. I'm very confident
> that it's an unrelated SNAFU.

Ok. So the non-random usb hangs are just bad hardware, and is a separate issue.

> The system is too stable with older kernels an other OSs to chalk the freezes
> up to static. With 2.6.38.2 and the PM timer, I'm getting audible timing
> issues
> when playing music, and sometimes even short visible freezes.

So again to try to untangle the issues:
1) There's a seemingly random hang that only happens with the HPET as the clocksource.

2) There are other strange timing issues and short term freezes seen with the ACPI PM clocksource.

Does booting with "nohz=off" change the second issue at all?
Comment 15 Nicos Gollan 2011-04-18 20:38:05 UTC
(In reply to comment #14)

> So again to try to untangle the issues:
> 1) There's a seemingly random hang that only happens with the HPET as the
> clocksource.

That's what I thought at first. The system would hang very often after 20 minuts or a few hours with HPET enabled. With the PM timer, it happens far less frequently (once every few days with 4-10 hours uptime per day), but it still happens.

I've also had the system run unattended without local user interaction for ~8-10 hours daily for a week without a freeze. (That opens a tangent into driver-land?)

> 2) There are other strange timing issues and short term freezes seen with the
> ACPI PM clocksource.
> 
> Does booting with "nohz=off" change the second issue at all?

No. 2.6.38.2 gives very choppy audio playback with and without nohz, 2.6.35 works fine. Both are running with the same kernel parameters, without HPET:

    hpet=disable clocksource=acpi_pm nmi_watchdog=1 splash
Comment 16 john stultz 2011-04-18 22:05:12 UTC
(In reply to comment #15)
> (In reply to comment #14)
> 
> > So again to try to untangle the issues:
> > 1) There's a seemingly random hang that only happens with the HPET as the
> > clocksource.
> 
> That's what I thought at first. The system would hang very often after 20
> minuts or a few hours with HPET enabled. With the PM timer, it happens far
> less
> frequently (once every few days with 4-10 hours uptime per day), but it still
> happens.

Ah. Yuck. So this is not clearly a timekeeping issue, but may just be more likely to trigger with the HPET clocksource.

More interestingly, looking at the 2.6.35 vs 2.6.38 dmesg logs, I see in the 2.6.38 log:
Override clocksource acpi_pm is not HRT compatible. Cannot switch while in HRT/NOHZ mode

So that means you're probably still using the TSC here. Further, I'm really not sure why the acpi_pm is not HRT compatible, as that isn't normally the case (sure enough, in 2.6.35, it does switch to using acpi_pm)

Are you building these kernels yourself? Or are these distribution kernels that may have additional patches applied?
Comment 17 Nicos Gollan 2011-04-19 07:05:55 UTC
Created attachment 54672 [details]
ls -lR of Debian patches in kernel package

(In reply to comment #16)

> More interestingly, looking at the 2.6.35 vs 2.6.38 dmesg logs, I see in the
> 2.6.38 log:
> Override clocksource acpi_pm is not HRT compatible. Cannot switch while in
> HRT/NOHZ mode

Ah, that's only visible with debugging output. Would be nice to see that one unconditionally. The other fun thing is that HPET wasn't even disabled for the debug dmesg output, and it still switched back to TSC.

> So that means you're probably still using the TSC here. Further, I'm really
> not
> sure why the acpi_pm is not HRT compatible, as that isn't normally the case
> (sure enough, in 2.6.35, it does switch to using acpi_pm)
>
> Are you building these kernels yourself? Or are these distribution kernels
> that
> may have additional patches applied?

Those are all Debian distribution kernels. I'm attaching a listing of the package's debian/patches directory.

(Tangent ahead)

Could any of that be related to the observed weirdness w.r.t. the NMI? On the affected kernels, I'm getting maybe 20-50 NMIs per CPU per *day* (the numbers are the same or at least very close to the PMI counts), while they're ticking up nicely on 2.6.35.

Note You need to log in before you can comment on or make changes to this bug.