Kernels starting with 2.6.36 crash periodically with: BUG: scheduling while atomic: kworker/0:1/0/0x10000100 It's always in kworker. I cannot determine a use pattern that provokes it, but all builds of 2.6.36 and 2.6.37 so far produce the problem, 2.6.35 seems to run perfectly stable. I'm using Debian builds of the kernels, so I have no idea what versions they are based on (changelog and distribution versioning are worthless). I'm attaching a screenshot of the last console output. This is on a 6 core system, and I guess there are 3-4 screens more of that, but there is no scrolling and no logging.
Created attachment 47022 [details] Console screenshot (2.6.36)
Created attachment 47032 [details] dmesg (2.6.35, works)
Created attachment 47042 [details] dmesg (2.6.37, crashed)
Thanks. I'll recategorise this under x86_64.
Small update, I've had a surprisingly crash-free time so far with 2.6.38-rc6 from Debian packages, which may be due to the EHCI fix. The other difference is that (due to missing kbuild for .38 on Debian), there are no VirtualBox modules.
Spoke too early, 2.6.38-rc6 still freezes.
Created attachment 50092 [details] dmesg from the latest freeze Note the last bunch of messages. Is it possible that the system timer is deteriorating until it just stops?
I've been looking through the changelog for 2.6.36, and the HPET code apparently had some major overhaul in there. Since the problem started with that kernel version, I'm now really suspecting that it might actually be an issue in there, maybe in conjunction with a platform quirk. I'd really like some feedback on this one, since testing is both nondeterministic (the system sometimes runs for days, sometimes it freezes after an hour), and also comes at the risk of data loss (I've had my KDE configuration, address book, and other things on an XFS partition wiped twice now).
I'm definitely not getting the "scheduling while atomic" freezes anymore, and since I disabled use of the HPET (using the ACPI timer as clocksource), the system hasn't frozen once in a few days worth of (non-contiguous) uptime. So I'm marking this one solved.