Bug 13616 - IOAPIC -> kernel: BUG: soft lockup - CPU#1 stuck for 61s!
Summary: IOAPIC -> kernel: BUG: soft lockup - CPU#1 stuck for 61s!
Status: RESOLVED INVALID
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: io_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-06-25 04:35 UTC by Lee Howard
Modified: 2009-07-22 14:36 UTC (History)
0 users

See Also:
Kernel Version: 2.6.27.24-170.2.68.fc10.x86_64
Subsystem:
Regression: No
Bisected commit-id:


Attachments
error messages (3.28 KB, text/plain)
2009-06-25 04:36 UTC, Lee Howard
Details

Description Lee Howard 2009-06-25 04:35:59 UTC
This Fedora 10 system (mostly operating as a mail server) had been running in production without any problem for possibly two weeks when it locked up with repeated messages of "BUG: soft lockup - CPU#1 stuck for 61s!"  The full message of the first instance is attached in "messages.txt".

As I couldn't reboot remotely I had to drive to the datacenter and reset the system.  It then ran fine for another two days until today when it locked up again.  I reset it again and this time updated the kernel to 2.6.27.25-170.2.72.fc10.x86_64.

No more than 20 minutes later I had to return to the datacenter and then another 20 minutes after that.  Each time I had to reset the system as the console was unresponsive.

Ultimately I added "noapic" to the kernel boot parameters, and I haven't had an issue now for three hours.  (I'm still running 2.6.27.25-170.2.72.fc10.x86_64.)

It would be purely speculative for me to guess as to why it ran fine for two weeks and then for two days but ultimately could not last an hour.  I suppose it's possible that our mail traffic has increased (it probably has, as it gets closer to the end of the month).  I suppose it's possible that changes in 2.6.27.25-170.2.72.fc10.x86_64 aggravated the problem.  And, I suppose that only three hours of uptime isn't completely conclusive about the "noapic" resolution.

That said, based on other reports I've seen (which led me to test "noapic"), it genuinely "feels" like this is an IOAPIC problem.

What information can I get you from the system?  If there is a least-intrusive manner of obtaining the information that would be favored since the system is operating in production use.

Thanks.
Comment 1 Lee Howard 2009-06-25 04:36:47 UTC
Created attachment 22087 [details]
error messages
Comment 2 Lee Howard 2009-07-22 14:36:44 UTC
After a few days, maybe a week, other lockups/crashes would yet occur with the "noapic" setting.  So apparently "noapic" only made the problem occur less-frequently, but it was not a true workaround.

The fix was to review all CMOS/BIOS settings, especially those under ACPI.  I enabled "ACPI 3.0" instead of "ACPI 2.0".  I disabled NMI.  There were a few other changes that I can't recall from memory.  After making these CMOS/BIOS setting changes then I haven't had another problem.

Note You need to log in before you can comment on or make changes to this bug.