Bug 10729

Summary: Freezes with kernel 2.6.25, worked perfectly with 2.6.24
Product: Platform Specific/Hardware Reporter: Andreas Juch (kernel-bt)
Component: x86-64Assignee: David Brownell (dbrownell)
Status: CLOSED CODE_FIX    
Severity: normal CC: bunk, celejar, dbrownell, tglx
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.25-2-amd64 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg on kernel 2.6.24
dmesg on kernel 2.6.25
configs and dmesg outputs of the tests i ran
kconfig tweaks

Description Andreas Juch 2008-05-16 20:13:20 UTC
Latest working kernel version: 2.6.24-1-amd64
Earliest failing kernel version: 2.6.25
Distribution: Debian testing/unstable
Hardware Environment:
Intel DG965WH Mainboard,
Intel C2D E6600 Processor,
3 GiB RAM,
using the internal Intel graphics,
only other H/W: Technotrend/Siemens DVB-S Card.
Software Environment: Debian testing/unstable with most packages from testing.
Problem Description:
The system freezes after some hours (about four or so) uptime. Nothing can be done after that. The logs don't show data about the crash. Kernel 2.6.24 works very well, so hardware failure is not possible. The dmesg outputs between .24 and .25 show lots of differences, but the messages closest to the crash are "rtc: lost n interrupts" (n is from 1 to 8). The attached dmesg outputs are from  Debian kernels, but I tried 2.6.25 from kernel.org and got exactly the same freezes, so I think it is irrelevant here. The freezes always happened during X.org sessions, sometimes I was working on the systems, sometimes it froze while I was not at the system. I'll attach both system logs as this is the only information I have about the freezes. Alt-sysrq-t has no effect.
Steps to reproduce:
Work at my PC or just let it run long enough :-(
Comment 1 Andreas Juch 2008-05-16 20:13:55 UTC
Created attachment 16169 [details]
dmesg on kernel 2.6.24
Comment 2 Andreas Juch 2008-05-16 20:14:47 UTC
Created attachment 16170 [details]
dmesg on kernel 2.6.25
Comment 3 Andrew Morton 2008-05-16 20:38:09 UTC
I marked this as a regression and tentatively recategorised it to x86_64.

Thomas, we've seen a few of those rtc messages - is it likely to be assicoated
with this problem?
Comment 4 Thomas Gleixner 2008-05-17 02:22:29 UTC
Yes, this might be the RTC changes in drivers/rtc vs. the legacy rtc code.

David ?
Comment 5 Andreas Juch 2008-05-18 12:06:57 UTC
Found it. I stopped chrony, the ntp synchronization daemon and the system now runs without the lost interrupts messages for four and a half hour. No freezes up to now, so this might be a duplicate of bug #10369. If it still crashes, I'll write a comment. Thanks for now!
Comment 6 David Brownell 2008-05-18 12:48:45 UTC
I speculate that the legacy RTC has HPET-related bugs still/again/...

One experiment would be to try using a proper "RTC framework" config with rtc-cmos statically linked.  Then try diddling HPET config options.

I don't know that anything has changed recently, but as usual it would be good to verify the current (2.6.26-rc2) kernel has these issues...
Comment 7 Andreas Juch 2008-05-18 21:56:54 UTC
I just tested 2.6.26-rc2 with the same config as 2.6.25 (make oldconfig) and it ran for 7 hours with chrony without freeze and without any lost interrupts messages. So the current development version doesn't seem to have that problem.

rtc-cmos doesn't seem to be used here. It's compiled as a module here and lsmod | grep rtc doesn't give any results. Here's a grep of the kernel rtc config with 2.6.26-rc2:
CONFIG_HPET_EMULATE_RTC=y
# CONFIG_HPET_RTC_IRQ is not set
CONFIG_RTC_LIB=m
CONFIG_RTC_CLASS=m
# RTC interfaces
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set
# I2C RTC drivers
CONFIG_RTC_DRV_DS1307=m
CONFIG_RTC_DRV_DS1374=m
CONFIG_RTC_DRV_DS1672=m
CONFIG_RTC_DRV_MAX6900=m
CONFIG_RTC_DRV_RS5C372=m
CONFIG_RTC_DRV_ISL1208=m
CONFIG_RTC_DRV_X1205=m
CONFIG_RTC_DRV_PCF8563=m
CONFIG_RTC_DRV_PCF8583=m
CONFIG_RTC_DRV_M41T80=m
# CONFIG_RTC_DRV_M41T80_WDT is not set
CONFIG_RTC_DRV_S35390A=m
# SPI RTC drivers
CONFIG_RTC_DRV_MAX6902=m
CONFIG_RTC_DRV_R9701=m
CONFIG_RTC_DRV_RS5C348=m
# Platform RTC drivers
CONFIG_RTC_DRV_CMOS=m
CONFIG_RTC_DRV_DS1511=m
CONFIG_RTC_DRV_DS1553=m
CONFIG_RTC_DRV_DS1742=m
CONFIG_RTC_DRV_STK17TA8=m
CONFIG_RTC_DRV_M48T86=m
CONFIG_RTC_DRV_M48T59=m
CONFIG_RTC_DRV_V3020=m
# on-CPU RTC drivers

diff -u between .25 and .26 rtc config:
--- /tmp/25.rtc	2008-05-19 06:49:54.000000000 +0200
+++ /tmp/26.rtc	2008-05-19 06:50:04.000000000 +0200
@@ -1,11 +1,7 @@
 CONFIG_HPET_EMULATE_RTC=y
-CONFIG_RTC=y
 # CONFIG_HPET_RTC_IRQ is not set
-CONFIG_SND_RTCTIMER=m
-CONFIG_SND_SEQ_RTCTIMER_DEFAULT=y
 CONFIG_RTC_LIB=m
 CONFIG_RTC_CLASS=m
-# Conflicting RTC option has been selected, check GEN_RTC and RTC
 # RTC interfaces
 CONFIG_RTC_INTF_SYSFS=y
 CONFIG_RTC_INTF_PROC=y

I don't know if I have time for playing with all HPET options since every option needs my computer to run for ~4 hours with it, but I'll try some of them.
Comment 8 Anonymous Emailer 2008-05-19 18:07:05 UTC
Reply-To: david-b@pacbell.net

On Sunday 18 May 2008, bugme-daemon@bugzilla.kernel.org wrote:
> I just tested 2.6.26-rc2 with the same config as 2.6.25 (make oldconfig) and
> it
> ran for 7 hours with chrony without freeze and without any lost interrupts
> messages. So the current development version doesn't seem to have that
> problem.
> 
> rtc-cmos doesn't seem to be used here. It's compiled as a module here and
> lsmod | grep rtc doesn't give any results.

You should try this with two different configs:

 - Your currrent RC2 config with the RTC framework, and rtc-cmos,
   statically linked;

 - Alternate config, without the RTC framework but with the legacy
   RTC driver statically linked

The freeze seemed to be related to the kernel handling the RTC
update IRQ.  Your "it works" configuration doesn't have the kernel
handling that IRQ, so it's not a good verification that the problem
went away ...
Comment 9 Andreas Juch 2008-05-24 10:18:35 UTC
Ok. I tried the two configs with 2.6.25.4 and 2.6.26rc3. Here are the results:

* the second config crashed with both kernels
* the first config had lost interrupts in kernel 2.6.25.4, but they seem to disappear in .26

I hope that you meant "device drivers/character devices/enhanced real time clock support" with "legacy RTC driver"... I'll attach the complete kernel config of every tested kernel and config and the dmesg output of each of them. I hope I've done it right this time.
Comment 10 Andreas Juch 2008-05-24 10:19:55 UTC
Created attachment 16267 [details]
configs and dmesg outputs of the tests i ran
Comment 11 David Brownell 2008-05-24 12:21:18 UTC
Created attachment 16269 [details]
kconfig tweaks

> Ok. I tried the two configs with 2.6.25.4 and 2.6.26rc3.
> Here are the results:
> 
> * the second config crashed with both kernels

Where "second config" is "legacy RTC driver in use",
also HPET ... but not CONFIG_HPET_RTC_IRQ.  I suggest
you enable that (char drivers, under HPET); near as
I can tell, disabling HPET_RTC_IRQ while HPET is in
use should not be permitted.

(If that checks out, this patch should probably get
merged ... given an ACK from someone more HPET-savvy
than me.)


> * the first config had lost interrupts in kernel 2.6.25.4,
>   but they seem to disappear in .26

Where "first config/2.6.25.4" is invalid but ends up using
the legacy RTC driver without HPET_RTC_IRQ ... while the
"first config/2.6.26-rc3" is valid and uses only the RTC
framework.


> I hope that you meant "device drivers/character devices/enhanced real time
> clock support" with "legacy RTC driver"

Yes.  That's also clarified in this patch.
Comment 12 Andreas Juch 2008-05-25 08:24:05 UTC
The patch seems to work. I have applied the patch to 2.6.25.4 and used the "goofy" config from the debian 2.6.25 kernels with make oldconfig. HPET_RTC_IRQ became y automagically and the system is running with chrony since 18,5 hours.

If you want, I can do the two test cases from above again with the patch, but I'll need a few days for that.

Thanks a lot for your help!

So the current working config is:
CONFIG_HPET_EMULATE_RTC=y
CONFIG_RTC=y
CONFIG_HPET_RTC_IRQ=y
CONFIG_SND_RTCTIMER=m
CONFIG_SND_SEQ_RTCTIMER_DEFAULT=y
CONFIG_RTC_LIB=m
CONFIG_RTC_CLASS=m
# Conflicting RTC option has been selected, check GEN_RTC and RTC
# RTC interfaces
Comment 13 Anonymous Emailer 2008-05-25 09:22:25 UTC
Reply-To: david-b@pacbell.net

On Sunday 25 May 2008, bugme-daemon@bugzilla.kernel.org wrote:
> So the current working config is:

Actually there are now two working configs.  The one you list
is the legacy driver config.  The other uses the newer code;
your previous note reported that it works ok.

The question in my mind is now:  does this "working" legacy
config behave right on hardware that *doesn't* have HPET?
Comment 14 Andreas Juch 2008-05-25 10:31:06 UTC
The newer code only worked in 2.6.26-rc3, I had lost interrupts with it in 2.6.25.4, but I didn't wait hours to see if it crashes.

I'm now testing that "working" legacy config with HPET disabled in BIOS. I don't know if that's equal to a system without HPET.
Comment 15 Anonymous Emailer 2008-05-25 12:13:14 UTC
Reply-To: david-b@pacbell.net

On Sunday 25 May 2008, bugme-daemon@bugzilla.kernel.org wrote:
> I'm now testing that "working" legacy config with HPET disabled in BIOS. I
> don't know if that's equal to a system without HPET.

If the kernel doesn't find and enable the HPET, it should
be equilvalent to an HPET-less system.
Comment 16 Andreas Juch 2008-05-26 07:38:46 UTC
The system ran fine with HPET disabled for about 10 hours without lost interrupts or crash, so I guess it's ok on systems without HPET too.
Comment 17 celejar 2008-05-27 16:59:48 UTC
Hi,

I believe that I'm also seeing this bug on my laptop (Acer Aspire 690-2672,
Intel 82801G (ICH7 Family).  Running Debian Sid, I get complete freezes (no
screen activity, no response to input, can't ssh in) with nothing particularly
helpful (I don't recall if I got the 'lost n interrupts' message) in the logs,
after various amounts of uptime, but always within about fifteen minutes.  I'm
pretty sure that it is this bug, since after disabling chrony, as per Andreas'
suggestion, the problem seems to have gone away.  Also, the earliest I see the
problem is immediately after init starts chrony.

Possibly useful information:

I) I'm seeing this with i386 (686)
II) The freezes come much earlier for me than the four hours reported by Andreas.
Comment 18 Laurent Luyckx 2008-06-04 14:31:08 UTC
Had the same issue on IBM x336 running Debian Etch.

The system stopped working (like explained by the others on this report) after doing a manual ntpdate. I've a cron job doing it every hour and 10 minutes. The server crashed 2 times after 25 hours of uptime. I'm now back on 2.6.24.2 and everything is fine again.

This is my RTC config:

$ grep -i RTC linux-2.6.25.4/.config
CONFIG_HPET_EMULATE_RTC=y
CONFIG_RTC=y
# CONFIG_RTC_CLASS is not set
Comment 19 David Brownell 2008-06-05 22:38:04 UTC
So if you apply the Kconfig patch then rebuild your kernel, does it work?  If not, I'd assume it's a different bug.
Comment 20 Adrian Bunk 2008-06-13 00:34:22 UTC
fixed by commit e6d2bb2bacb43ff03b0f458108d71981d58e775a