Bug 11418
Summary: | Many soft lockups | ||
---|---|---|---|
Product: | Other | Reporter: | Gu Rui (chaos.proton) |
Component: | Other | Assignee: | Thomas Gleixner (tglx) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | herrmann.der.user, rjw, tglx |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.27-rc4-git3 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 11167 | ||
Attachments: |
the whole dmesg
potential fix for the problem combo patch of clockevent fixes which are queued for mainline failed to adjust min_delta_ns with Thomas's patch applyed dmesg containing backtrace for soft lockups Final version of the combo-patch against 2.6.27-5 1000ns boot up with Thomas' patch in Comment #24 applied |
Description
Gu Rui
2008-08-24 13:06:05 UTC
Created attachment 17418 [details]
the whole dmesg
Is this a regression? If that's the case, what's the latest known good kernel? Can you please try the following kernel command line options: highres=off nohz=off (In reply to comment #2) > Is this a regression? If that's the case, what's the latest known good > kernel? > I don't know whether it's a regression or not because I didn't turn up the kernel hack configuration for my early kernel builds. So even if there were lockups, I couldn't catch them. (In reply to comment #3) > Can you please try the following kernel command line options: > > highres=off nohz=off > Yes, the problem has gone with the two options. In menuconfig I see the description on High Resolution Timer Support: This option enables high resolution timer support. If your hardware is not capable then this option only increases the size of the kernel image. So I think my box should work when this configuration is enabled according to the description. And I'm using Laptop so maybe I want to enable Tickless System to save my bettery. But if I enable _any_ of them, the kernel could lockup. Can you try to boot with "hpet=disable" on the kernel command line ? Thanks, tglx (In reply to comment #6) > Can you try to boot with "hpet=disable" on the kernel command line ? > > Thanks, > tglx > Yes, "hpet=disable" can fix the problem. I also found that when I enable the hpet, I will lose my wireless network with " wlan0: No ProbeResp from current AP 00:1d:0f:66:fb:02 - assume out of range" in dmesg. It's maybe an other problem but I just note it here. Thanks Another incarnation of the HPET vs. AMD story. Andreas, any idea how to get to the root cause of this ? Maybe it has a similar root cause like bug #11191 where I've seen that min_delta_ns for hpet is too small. Currently I try to reproduce soft lockups on a system where I've seen this before. If this is (reliably) reproducible I'll increase min_delta_ns for hpet and hope that this fixes the issue. If the kernel programs hpet in one-shot mode but hpet_legacy_next_event fails with -ETIME when a rather small delta was passed, it might happen that the kernel loops for a long time (or forever?) until HPET_COUNTER is definitely greater than HPET_T0_CMP in hpet_legacy_next_event ... Writing this I've observed another "BUG: soft lockup - CPU#1 stuck for 89s! [uname:28197]" So I am going to verify above idea. Created attachment 17599 [details] potential fix for the problem x86: hpet: increase min_delta_ns to increase chance of successful programming in hpet_legacy_next_event This fixes http://bugzilla.kernel.org/show_bug.cgi?id=11191 and most probably http://bugzilla.kernel.org/show_bug.cgi?id=11418 as well. With c1e_idle hpet is frequently reprogrammed (in one-shot mode). If the delta for next timer event is very small the T0 comparator value is too close to the current HPET counter value and Linux repeatedly tries to reprogram the comparator. On an HP tx1000 (with AMD Turion and nvidia MCP51) this caused BUG: spinlock lockup on CPU#0 during boot. On other systems with other chipsets I've observed soft lockups, e.g. BUG: soft lockup - CPU#1 stuck for 89s! [uname:28197] Both symptoms vanished when I've increased min_delta_ns for hpet. > Maybe it has a similar root cause like bug #11191 where I've seen that
> min_delta_ns for hpet is too small.
Yup. Same problem.
Thanks for Andreas's patch. But on my box, set min_delta_ns to 0x40 could not fix the problems, i.e, there is still many soft lockups and wlan0 problem. I set it to 0x60 then the problems gone.(uptime 6 hours without soft lockup and wireless connection lost) Thanks Created attachment 17622 [details]
combo patch of clockevent fixes which are queued for mainline
Can you please try the attached patch. It has detection of the problem and adjusts the min value if it is too small.
Thanks,
tglx
Handled-By : Thomas Gleixner <tglx@linutronix.de> Patch : http://bugzilla.kernel.org/attachment.cgi?id=17622&action=view *** Bug 11191 has been marked as a duplicate of this bug. *** *** Bug 11279 has been marked as a duplicate of this bug. *** Created attachment 17633 [details]
failed to adjust min_delta_ns with Thomas's patch applyed
5000 nsec maybe enough for reprogramming since there seems no problem purely apply Thomas's patch. So I set the default value to 4000 nsec. Then I observed a system freeze when working in Konsole. Then I move the mouse a bit, the system recovered(amazing..). I did a dmesg immediately as shown in attachment. Not a long time after, the whole system froze. Only the mouse could move but there was no recovery. About 1 min or so the system recovered itself. No more info in dmesg. But I see the kernel doesn't doublemin_delta_ns since it's still 4000 in 'cat /proc/timer_list'. I know little about kernel hacking and only a little about C. And sorry for my poor English.
Thanks
> 5000 nsec maybe enough for reprogramming since there seems no problem purely
> apply Thomas's patch. So I set the default value to 4000 nsec. Then I
> observed
> a system freeze when working in Konsole. Then I move the mouse a bit, the
> system recovered(amazing..). I did a dmesg immediately as shown in
> attachment.
> Not a long time after, the whole system froze. Only the mouse could move but
> there was no recovery. About 1 min or so the system recovered itself. No more
> info in dmesg. But I see the kernel doesn't doublemin_delta_ns since it's
> still
> 4000 in 'cat /proc/timer_list'. I know little about kernel hacking and only a
> little about C. And sorry for my poor English.
The lockup is a different problem. Before that the system did not
recover from such a situation, as it was simply stuck in the
reprogramming loop. To check whether the fixup of a
too small min_delta works, you should set it to 1000ns. That should
trigger the logic in the clock events code.
Can you test that please ?
Thanks,
tglx
(In reply to comment #18) > The lockup is a different problem. Before that the system did not > recover from such a situation, as it was simply stuck in the > reprogramming loop. To check whether the fixup of a > too small min_delta works, you should set it to 1000ns. That should > trigger the logic in the clock events code. > > Can you test that please ? > > Thanks, > > tglx > 1000ns doesn't boot...The screen end up at "ACPI:RTC can wake from S4" or sort of this.I wait for about 1 min then I lost my patience to shutdown my box with force. Thanks That's strange. Yesterday I've tested whether fixup of min_delta of Thomas' patches works. And it worked. (I've used 0x500 as the initial value for min_delta_ns.) FYI, one of my test machines still shows soft lockups with Thomas' patches applied. I guess it's another problem: BUG: soft lockup - CPU#2 stuck for 166s! [swapper:0] ... RIP: 0010:[<ffffffff8024ee5b>] [<ffffffff8024ee5b>] tick_nohz_restart_sched_tic\ k+0x151/0x155 ... Call Trace: [<ffffffff8020ac3e>] ? cpu_idle+0x94/0x9e and BUG: soft lockup - CPU#2 stuck for 116s! [top:10593] ... RIP: 0010:[<ffffffff8029a752>] [<ffffffff8029a752>] __d_lookup+0xef/0x107 ... Call Trace: ... [<ffffffff8020be6b>] ? system_call_fastpath+0x16/0x1b The real weird thing is output of top/ps on this machine: 6 root RT -5 0 0 0 S 0 0.0 0:00.28 migration/1 7 root 15 -5 0 0 0 S 0 0.0 5124415h ksoftirqd/1 8 root RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/1 ... 18 root 15 -5 0 0 0 S 0 0.0 0:00.68 events/3 19 root 15 -5 0 0 0 S 0 0.0 19215:21 khelper 100 root 15 -5 0 0 0 S 0 0.0 0:00.04 kblockd/0 which is just bogus for ksoftirqd and khelper. Or is this just "normal" consequence of soft lockups? I'll try to debug what's going on here. > RIP: 0010:[<ffffffff8029a752>] [<ffffffff8029a752>] __d_lookup+0xef/0x107 > ... > Call Trace: > ... > [<ffffffff8020be6b>] ? system_call_fastpath+0x16/0x1b Full backtrace please. > The real weird thing is output of top/ps on this machine: > > 6 root RT -5 0 0 0 S 0 0.0 0:00.28 migration/1 > 7 root 15 -5 0 0 0 S 0 0.0 5124415h ksoftirqd/1 > 8 root RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/1 > ... > 18 root 15 -5 0 0 0 S 0 0.0 0:00.68 events/3 > 19 root 15 -5 0 0 0 S 0 0.0 19215:21 khelper > 100 root 15 -5 0 0 0 S 0 0.0 0:00.04 kblockd/0 > > which is just bogus for ksoftirqd and khelper. Or is this just "normal" > consequence of soft lockups? That's default in mainline. Don't remember why the softirqd is not running as -rt thread. Probably the many yield()s which are in the network code :( Thanks, tglx Created attachment 17634 [details] dmesg containing backtrace for soft lockups > Full backtrace please. See attachment. Created attachment 17644 [details]
Final version of the combo-patch against 2.6.27-5
Found the real root cause. The other fixes are still valid, but I'm feeling stupid
Can you please test again? Thanks, tglx Created attachment 17645 [details] 1000ns boot up with Thomas' patch in Comment #24 applied Yup, 1000ns boot up on my box~;) We can see that each adjustment followed by a traceback. But I have a question to discuss: Why not increase the min_delta_ns by a static step but double it? In my case, 5000ns is enough, but the kernel jump from 4000ns to 8000ns. Maybe we increase min_delta_ns by 1000ns per step is reasonable as the default is already 5000ns. Thanks complete patch series merged in mainline: commit f5325225658737e6c9cb8e24373e2c281a90be2a FYI, just want to tell you my final test results. Just fixing the min_delta issue I've still seen lots of soft lockups during a weekend test. Overnight I've tested the final combo patch (from comment #24) which also fixed the u32/long calculation issue. And no soft lockups where detected anymore. Thanks Thomas. |