Bug 13616

Summary: IOAPIC -> kernel: BUG: soft lockup - CPU#1 stuck for 61s!
Product: IO/Storage Reporter: Lee Howard (faxguy)
Component: OtherAssignee: io_other
Status: RESOLVED INVALID    
Severity: normal    
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.27.24-170.2.68.fc10.x86_64 Subsystem:
Regression: No Bisected commit-id:
Attachments: error messages

Description Lee Howard 2009-06-25 04:35:59 UTC
This Fedora 10 system (mostly operating as a mail server) had been running in production without any problem for possibly two weeks when it locked up with repeated messages of "BUG: soft lockup - CPU#1 stuck for 61s!"  The full message of the first instance is attached in "messages.txt".

As I couldn't reboot remotely I had to drive to the datacenter and reset the system.  It then ran fine for another two days until today when it locked up again.  I reset it again and this time updated the kernel to 2.6.27.25-170.2.72.fc10.x86_64.

No more than 20 minutes later I had to return to the datacenter and then another 20 minutes after that.  Each time I had to reset the system as the console was unresponsive.

Ultimately I added "noapic" to the kernel boot parameters, and I haven't had an issue now for three hours.  (I'm still running 2.6.27.25-170.2.72.fc10.x86_64.)

It would be purely speculative for me to guess as to why it ran fine for two weeks and then for two days but ultimately could not last an hour.  I suppose it's possible that our mail traffic has increased (it probably has, as it gets closer to the end of the month).  I suppose it's possible that changes in 2.6.27.25-170.2.72.fc10.x86_64 aggravated the problem.  And, I suppose that only three hours of uptime isn't completely conclusive about the "noapic" resolution.

That said, based on other reports I've seen (which led me to test "noapic"), it genuinely "feels" like this is an IOAPIC problem.

What information can I get you from the system?  If there is a least-intrusive manner of obtaining the information that would be favored since the system is operating in production use.

Thanks.
Comment 1 Lee Howard 2009-06-25 04:36:47 UTC
Created attachment 22087 [details]
error messages
Comment 2 Lee Howard 2009-07-22 14:36:44 UTC
After a few days, maybe a week, other lockups/crashes would yet occur with the "noapic" setting.  So apparently "noapic" only made the problem occur less-frequently, but it was not a true workaround.

The fix was to review all CMOS/BIOS settings, especially those under ACPI.  I enabled "ACPI 3.0" instead of "ACPI 2.0".  I disabled NMI.  There were a few other changes that I can't recall from memory.  After making these CMOS/BIOS setting changes then I haven't had another problem.