Bug 11418 - Many soft lockups
Summary: Many soft lockups
Status: CLOSED CODE_FIX
Alias: None
Product: Other
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Thomas Gleixner
URL:
Keywords:
: 11191 11279 (view as bug list)
Depends on:
Blocks: Regressions-2.6.26
  Show dependency tree
 
Reported: 2008-08-24 13:06 UTC by Gu Rui
Modified: 2008-09-09 02:18 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.27-rc4-git3
Tree: Mainline
Regression: Yes


Attachments
the whole dmesg (45.48 KB, application/octet-stream)
2008-08-24 13:06 UTC, Gu Rui
Details
potential fix for the problem (1.60 KB, patch)
2008-09-03 13:19 UTC, herrmann.der.user
Details | Diff
combo patch of clockevent fixes which are queued for mainline (7.97 KB, patch)
2008-09-04 09:11 UTC, Thomas Gleixner
Details | Diff
failed to adjust min_delta_ns with Thomas's patch applyed (44.68 KB, application/octet-stream)
2008-09-05 04:19 UTC, Gu Rui
Details
dmesg containing backtrace for soft lockups (28.70 KB, application/txt)
2008-09-05 06:38 UTC, herrmann.der.user
Details
Final version of the combo-patch against 2.6.27-5 (9.71 KB, patch)
2008-09-05 22:31 UTC, Thomas Gleixner
Details | Diff
1000ns boot up with Thomas' patch in Comment #24 applied (44.22 KB, text/plain)
2008-09-06 03:41 UTC, Gu Rui
Details

Description Gu Rui 2008-08-24 13:06:05 UTC
Sometimes my Konsole will have a slow respond for the keyboard. It may also stop at half of dmesg output(or other process) at times, when I move the mouse, it would be back to continue. I got some debug messages:
[  437.711014] BUG: soft lockup - CPU#1 stuck for 74s! [swapper:0]                                                            
[  437.711014] Modules linked in: fuse ricoh_mmc                                                                              
[  437.711014]                                                                                                                
[  437.711014] Pid: 0, comm: swapper Not tainted (2.6.27-rc4-g9-debug-00130-g6450f65 #14)                                     
[  437.711014] EIP: 0060:[<c0142829>] EFLAGS: 00000292 CPU: 1                                                                 
[  437.711014] EIP is at tick_nohz_restart_sched_tick+0x39/0x140                                                              
[  437.711014] EAX: e990dc7b EBX: 01020800 ECX: 00000001 EDX: 00000000                                                        
[  437.711014] ESI: 00000001 EDI: c1714fa0 EBP: 00000001 ESP: f784bf94                                                        
[  437.711014]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068                                                                  
[  437.711014] CR0: 8005003b CR2: b800c000 CR3: 26b7c000 CR4: 000006d0                                                        
[  437.711014] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000                                                        
[  437.711014] DR6: ffff0ff0 DR7: 00000400                                                                                    
[  437.711014]  [<c0101d27>] ? cpu_idle+0x97/0x120                                                                            
[  437.711014]  =======================                                                                                       
[  544.224041] BUG: soft lockup - CPU#1 stuck for 66s! [syslogd:1219]                                                         
[  544.224041] Modules linked in: fuse ricoh_mmc                                                                              
[  544.224041]                                                                                                                
[  544.224041] Pid: 1219, comm: syslogd Not tainted (2.6.27-rc4-g9-debug-00130-g6450f65 #14)                                  
[  544.224041] EIP: 0060:[<c04b55bb>] EFLAGS: 00000202 CPU: 1                                                                 
[  544.224041] EIP is at _spin_unlock_irq+0xb/0x30                                                                            
[  544.224041] EAX: 00000001 EBX: c1718240 ECX: 00000000 EDX: 00004645                                                        
[  544.224041] ESI: f6266e00 EDI: f7831c80 EBP: 00000000 ESP: f6107ac8                                                        
[  544.224041]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068                                                                  
[  544.224041] CR0: 8005003b CR2: b09b5000 CR3: 36d56000 CR4: 000006d0                                                        
[  544.224041] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000                                                        
[  544.224041] DR6: ffff0ff0 DR7: 00000400                                                                                    
[  544.224041]  [<c012004a>] finish_task_switch+0x2a/0xb0                                                                     
[  544.224041]  [<c04b384b>] schedule+0x26b/0x7b0                                                                             
[  544.224041]  [<c01f6b09>] journal_dirty_metadata+0x89/0x150                                                                
[  544.224041]  [<c01f2132>] __ext3_journal_dirty_metadata+0x22/0x60                                                          
[  544.224041]  [<c01f6b09>] journal_dirty_metadata+0x89/0x150                                                                
[  544.224041]  [<c04b3fa5>] schedule_timeout+0x75/0xc0                                                                       
[  544.224041]  [<c01956a7>] do_select+0x397/0x490                                                                            
[  544.224041]  [<c0195d10>] __pollwait+0x0/0xe0                                                                              
[  544.224041]  [<c011e9d0>] default_wake_function+0x0/0x10                                                                   
[  544.224041]  [<c01f2132>] __ext3_journal_dirty_metadata+0x22/0x60                                                          
[  544.224041]  [<c01f5a89>] journal_get_write_access+0x29/0x40                                                               
[  544.224041]  [<c01e572f>] ext3_mark_iloc_dirty+0x1ff/0x340                                                                 
[  544.224041]  [<c01e5c39>] ext3_mark_inode_dirty+0x29/0x40                                                                  
[  544.224041]  [<c01f548a>] journal_stop+0x11a/0x1b0                                                                         
[  544.224041]  [<c01679dd>] balance_dirty_pages_ratelimited_nr+0x5d/0x380                                                    
[  544.224041]  [<c01f548a>] journal_stop+0x11a/0x1b0                                                                         
[  544.224041]  [<c01e7ea0>] ext3_ordered_write_end+0xf0/0x150                                                                
[  544.224041]  [<c0128d83>] current_fs_time+0x13/0x20                                                                        
[  544.224041]  [<c01615a3>] generic_file_buffered_write+0x1c3/0x660                                                          
[  544.224041]  [<c0128d83>] current_fs_time+0x13/0x20                                                                        
[  544.224041]  [<c04b5594>] _spin_unlock_irqrestore+0x14/0x30                                                                
[  544.224041]  [<c013f9fa>] clocksource_get_next+0x3a/0x40                                                                   
[  544.224041]  [<c013e329>] update_wall_time+0x259/0x7e0                                                                     
[  544.224041]  [<c0116e2a>] read_hpet+0xa/0x10                                                                               
[  544.224041]  [<c013df87>] getnstimeofday+0x37/0xe0                                                                         
[  544.224041]  [<c012155d>] task_tick_fair+0x3d/0xa0                                                                         
[  544.224041]  [<c013a8c4>] hrtimer_forward+0x104/0x150                                                                      
[  544.224041]  [<c0116e2a>] read_hpet+0xa/0x10                                                                               
[  544.224041]  [<c013df87>] getnstimeofday+0x37/0xe0                                                                         
[  544.224041]  [<c013b8a9>] hrtimer_run_pending+0x29/0x110                                                                   
[  544.224041]  [<c0141062>] clockevents_program_event+0xa2/0x140                                                             
[  544.224041]  [<c04b55c0>] _spin_unlock_irq+0x10/0x30                                                                       
[  544.224041]  [<c012dabe>] run_timer_softirq+0x16e/0x1a0                                                                    
[  544.224041]  [<c0195978>] core_sys_select+0x1d8/0x300                                                                      
[  544.224041]  [<c01023eb>] setup_sigcontext+0x10b/0x190                                                                     
[  544.224041]  [<c04b55c0>] _spin_unlock_irq+0x10/0x30                                                                       
[  544.224041]  [<c0102814>] do_notify_resume+0x3a4/0x770                                                                     
[  544.224041]  [<c0116e2a>] read_hpet+0xa/0x10                                                                               
[  544.224041]  [<c013df87>] getnstimeofday+0x37/0xe0                                                                         
[  544.224041]  [<c013aba2>] enqueue_hrtimer+0x72/0xf0                                                                        
[  544.224041]  [<c04b5594>] _spin_unlock_irqrestore+0x14/0x30                                                                
[  544.224041]  [<c013b316>] hrtimer_start+0xf6/0x190                                                                         
[  544.224041]  [<c0195e3b>] sys_select+0x4b/0x1b0                                                                            
[  544.224041]  [<c0128706>] alarm_setitimer+0x36/0x70                                                                        
[  544.224041]  [<c010339a>] syscall_call+0x7/0xb                                                                             
[  544.224041]  =======================                                                                                       
[ 1933.378183] BUG: soft lockup - CPU#1 stuck for 68s! [swapper:0]                                                            
[ 1933.378183] Modules linked in: fuse ricoh_mmc                                                                              
[ 1933.378183]                                                                                                                
[ 1933.378183] Pid: 0, comm: swapper Not tainted (2.6.27-rc4-g9-debug-00130-g6450f65 #14)                                     
[ 1933.378183] EIP: 0060:[<c0142829>] EFLAGS: 00000292 CPU: 1                                                                 
[ 1933.378183] EIP is at tick_nohz_restart_sched_tick+0x39/0x140                                                              
[ 1933.378183] EAX: 26529b04 EBX: 01020800 ECX: 00000001 EDX: 00000000                                                        
[ 1933.378183] ESI: 00000001 EDI: c1714fa0 EBP: 00000001 ESP: f784bf94                                                        
[ 1933.378183]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068                                                                  
[ 1933.378183] CR0: 8005003b CR2: a3554000 CR3: 00676000 CR4: 000006d0                                                        
[ 1933.378183] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000                                                        
[ 1933.378183] DR6: ffff0ff0 DR7: 00000400                                                                                    
[ 1933.378183]  [<c0101d27>] ? cpu_idle+0x97/0x120                                                                            
[ 1933.378183]  =======================                                                                                       
[ 2714.154007] BUG: soft lockup - CPU#1 stuck for 70s! [swapper:0]                                                            
[ 2714.154007] Modules linked in: fuse ricoh_mmc                                                                              
[ 2714.154007]                                                                                                                
[ 2714.154007] Pid: 0, comm: swapper Not tainted (2.6.27-rc4-g9-debug-00130-g6450f65 #14)                                     
[ 2714.154007] EIP: 0060:[<c0142829>] EFLAGS: 00000292 CPU: 1                                                                 
[ 2714.154007] EIP is at tick_nohz_restart_sched_tick+0x39/0x140                                                              
[ 2714.154007] EAX: f027bf20 EBX: 01020800 ECX: 00000001 EDX: 00000000                                                        
[ 2714.154007] ESI: 00000001 EDI: c1714fa0 EBP: 00000001 ESP: f784bf94                                                        
[ 2714.154007]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068                                                                  
[ 2714.154007] CR0: 8005003b CR2: b800c000 CR3: 1d842000 CR4: 000006d0                                                        
[ 2714.154007] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000                                                        
[ 2714.154008] DR6: ffff0ff0 DR7: 00000400                                                                                    
[ 2714.154008]  [<c0101d27>] ? cpu_idle+0x97/0x120                                                                            
[ 2714.154008]  =======================                                                                                       
[ 3304.220009] BUG: soft lockup - CPU#1 stuck for 112s! [swapper:0]                                                           
[ 3304.220009] Modules linked in: fuse ricoh_mmc                                                                              
[ 3304.220009]                                                                                                                
[ 3304.220009] Pid: 0, comm: swapper Not tainted (2.6.27-rc4-g9-debug-00130-g6450f65 #14)                                     
[ 3304.220009] EIP: 0060:[<c0142829>] EFLAGS: 00000292 CPU: 1                                                                 
[ 3304.220009] EIP is at tick_nohz_restart_sched_tick+0x39/0x140                                                              
[ 3304.220009] EAX: 52d470f6 EBX: 01020800 ECX: 00000001 EDX: 00000000                                                        
[ 3304.220009] ESI: 00000001 EDI: c1714fa0 EBP: 00000001 ESP: f784bf94                                                        
[ 3304.220009]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068                                                                  
[ 3304.220009] CR0: 8005003b CR2: b800c000 CR3: 1d9d5000 CR4: 000006d0                                                        
[ 3304.220009] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000                                                        
[ 3304.220009] DR6: ffff0ff0 DR7: 00000400                                                                                    
[ 3304.220009]  [<c0101d27>] ? cpu_idle+0x97/0x120                                                                            
[ 3304.220009]  =======================                                                                                       
[ 3642.407022] APIC error on CPU0: 00(40)                                                                                     
[ 3763.952016] APIC error on CPU0: 40(40)                                                                                     
[ 4438.217465] BUG: soft lockup - CPU#1 stuck for 83s! [syslogd:1219]                                                         
[ 4438.217465] Modules linked in: fuse ricoh_mmc                                                                              
[ 4438.217465]                                                                                                                
[ 4438.217465] Pid: 1219, comm: syslogd Not tainted (2.6.27-rc4-g9-debug-00130-g6450f65 #14)                                  
[ 4438.217465] EIP: 0060:[<c04b55bb>] EFLAGS: 00000286 CPU: 1                                                                 
[ 4438.217465] EIP is at _spin_unlock_irq+0xb/0x30                                                                            
[ 4438.217465] EAX: 00000001 EBX: c1718240 ECX: 00000000 EDX: 0000e4e3                                                        
[ 4438.217465] ESI: f6c77a00 EDI: f7831c80 EBP: 00000000 ESP: f6107ac8                                                        
[ 4438.217465]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068                                                                  
[ 4438.217465] CR0: 8005003b CR2: a5e37004 CR3: 36d56000 CR4: 000006d0                                                        
[ 4438.217465] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000                                                        
[ 4438.217465] DR6: ffff0ff0 DR7: 00000400                                                                                    
[ 4438.217465]  [<c012004a>] finish_task_switch+0x2a/0xb0                                                                     
[ 4438.217465]  [<c04b384b>] schedule+0x26b/0x7b0                                                                             
[ 4438.217465]  [<c04b3fa5>] schedule_timeout+0x75/0xc0
[ 4438.217465]  [<c01956a7>] do_select+0x397/0x490
[ 4438.217465]  [<c0195d10>] __pollwait+0x0/0xe0
[ 4438.217465]  [<c011e9d0>] default_wake_function+0x0/0x10
[ 4438.217465]  [<c01f2132>] __ext3_journal_dirty_metadata+0x22/0x60
[ 4438.217465]  [<c01f5a89>] journal_get_write_access+0x29/0x40
[ 4438.217465]  [<c01e572f>] ext3_mark_iloc_dirty+0x1ff/0x340
[ 4438.217465]  [<c01e5c39>] ext3_mark_inode_dirty+0x29/0x40
[ 4438.217465]  [<c01f548a>] journal_stop+0x11a/0x1b0
[ 4438.217465]  [<c01679dd>] balance_dirty_pages_ratelimited_nr+0x5d/0x380
[ 4438.217465]  [<c01f548a>] journal_stop+0x11a/0x1b0
[ 4438.217465]  [<c01e7ea0>] ext3_ordered_write_end+0xf0/0x150
[ 4438.217465]  [<c01615a3>] generic_file_buffered_write+0x1c3/0x660
[ 4438.217465]  [<c01a68cc>] __mark_inode_dirty+0x5c/0x1a0
[ 4438.217465]  [<c0161f7a>] __generic_file_aio_write_nolock+0x28a/0x4f0
[ 4438.217465]  [<c0190067>] do_lookup+0x67/0x1d0
[ 4438.217465]  [<c0195978>] core_sys_select+0x1d8/0x300
[ 4438.217465]  [<c01023eb>] setup_sigcontext+0x10b/0x190
[ 4438.217465]  [<c04b55c0>] _spin_unlock_irq+0x10/0x30
[ 4438.217465]  [<c0268627>] rb_insert_color+0x77/0xe0
[ 4438.217465]  [<c013aba2>] enqueue_hrtimer+0x72/0xf0
[ 4438.217465]  [<c04b5594>] _spin_unlock_irqrestore+0x14/0x30
[ 4438.217465]  [<c013b316>] hrtimer_start+0xf6/0x190
[ 4438.217465]  [<c0195e3b>] sys_select+0x4b/0x1b0
[ 4438.217465]  [<c0128706>] alarm_setitimer+0x36/0x70
[ 4438.217465]  [<c010339a>] syscall_call+0x7/0xb
[ 4438.217465]  =======================
[ 4599.071020] APIC error on CPU0: 40(40)
Comment 1 Gu Rui 2008-08-24 13:06:45 UTC
Created attachment 17418 [details]
the whole dmesg
Comment 2 Rafael J. Wysocki 2008-08-27 05:53:30 UTC
Is this a regression?  If that's the case, what's the latest known good kernel?
Comment 3 Thomas Gleixner 2008-08-27 10:55:29 UTC
Can you please try the following kernel command line options:

highres=off nohz=off
Comment 4 Gu Rui 2008-08-27 23:35:39 UTC
(In reply to comment #2)
> Is this a regression?  If that's the case, what's the latest known good
> kernel?
> 

I don't know whether it's a regression or not because I didn't turn up the kernel hack configuration for my early kernel builds. So even if there were lockups, I couldn't catch them.
Comment 5 Gu Rui 2008-08-27 23:46:03 UTC
(In reply to comment #3)
> Can you please try the following kernel command line options:
> 
> highres=off nohz=off
> 

Yes, the problem has gone with the two options. In menuconfig I see the description on High Resolution Timer Support:

This option enables high resolution timer support. If your hardware is not capable then this option only increases the size of the kernel image.

So I think my box should work when this configuration is enabled according to the description. And I'm using Laptop so maybe I want to enable Tickless System to save my bettery. But if I enable _any_ of them, the kernel could lockup.
Comment 6 Thomas Gleixner 2008-08-28 03:45:04 UTC
Can you try to boot with "hpet=disable" on the kernel command line ?

Thanks,
	tglx
Comment 7 Gu Rui 2008-08-29 10:04:56 UTC
(In reply to comment #6)
> Can you try to boot with "hpet=disable" on the kernel command line ?
> 
> Thanks,
>         tglx
> 

Yes, "hpet=disable" can fix the problem. I also found that when I enable the hpet, I will lose my wireless network with " wlan0: No ProbeResp from current AP 00:1d:0f:66:fb:02 - assume out of range" in dmesg. It's maybe an other problem but I just note it here.

Thanks
Comment 8 Thomas Gleixner 2008-08-29 15:12:56 UTC
Another incarnation of the HPET vs. AMD story.

Andreas, any idea how to get to the root cause of this ?
Comment 9 herrmann.der.user 2008-09-03 12:24:47 UTC
Maybe it has a similar root cause like bug #11191 where I've seen that min_delta_ns for hpet is too small.

Currently I try to reproduce soft lockups on a system where I've seen this before.
If this is (reliably) reproducible I'll increase min_delta_ns for hpet and hope that this fixes the issue.

If the kernel programs hpet in one-shot mode but hpet_legacy_next_event fails
with -ETIME when a rather small delta was passed, it might happen that the
kernel loops for a long time (or forever?) until HPET_COUNTER is definitely greater than HPET_T0_CMP in hpet_legacy_next_event ...

Writing this I've observed another

"BUG: soft lockup - CPU#1 stuck for 89s! [uname:28197]"

So I am going to verify above idea.
Comment 10 herrmann.der.user 2008-09-03 13:19:06 UTC
Created attachment 17599 [details]
potential fix for the problem

   x86: hpet: increase min_delta_ns to increase chance of successful programming in hpet_legacy_next_event
    
    This fixes http://bugzilla.kernel.org/show_bug.cgi?id=11191
    and most probably http://bugzilla.kernel.org/show_bug.cgi?id=11418
    as well.
    
    With c1e_idle hpet is frequently reprogrammed (in one-shot mode).
    
    If the delta for next timer event is very small the T0 comparator
    value is too close to the current HPET counter value and Linux
    repeatedly tries to reprogram the comparator.
    
    On an HP tx1000 (with AMD Turion and nvidia MCP51) this caused
    
      BUG: spinlock lockup on CPU#0
    
    during boot. On other systems with other chipsets I've observed soft
    lockups, e.g.
    
      BUG: soft lockup - CPU#1 stuck for 89s! [uname:28197]
    
    Both symptoms vanished when I've increased min_delta_ns for hpet.
Comment 11 Thomas Gleixner 2008-09-03 14:41:50 UTC
> Maybe it has a similar root cause like bug #11191 where I've seen that
> min_delta_ns for hpet is too small.

Yup. Same problem.
Comment 12 Gu Rui 2008-09-04 08:35:28 UTC
Thanks for Andreas's patch. But on my box, set min_delta_ns to 0x40 could not fix the problems, i.e, there is still many soft lockups and wlan0 problem. I set it to 0x60 then the problems gone.(uptime 6 hours without soft lockup and wireless connection lost)

Thanks
Comment 13 Thomas Gleixner 2008-09-04 09:11:31 UTC
Created attachment 17622 [details]
combo patch of clockevent fixes which are queued for mainline

Can you please try the attached patch. It has detection of the problem and adjusts the min value if it is too small.

Thanks,

       tglx
Comment 14 Rafael J. Wysocki 2008-09-04 14:59:40 UTC
Handled-By : Thomas Gleixner <tglx@linutronix.de>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=17622&action=view
Comment 15 Rafael J. Wysocki 2008-09-04 15:35:01 UTC
*** Bug 11191 has been marked as a duplicate of this bug. ***
Comment 16 Thomas Gleixner 2008-09-05 03:45:02 UTC
*** Bug 11279 has been marked as a duplicate of this bug. ***
Comment 17 Gu Rui 2008-09-05 04:19:50 UTC
Created attachment 17633 [details]
failed to adjust min_delta_ns with Thomas's patch applyed

5000 nsec maybe enough for reprogramming since there seems no problem purely apply Thomas's patch. So I set the default value to 4000 nsec. Then I observed a system freeze when working in Konsole. Then I move the mouse a bit, the system recovered(amazing..). I did a dmesg immediately as shown in attachment. Not a long time after, the whole system froze. Only the mouse could move but there was no recovery. About 1 min or so the system recovered itself. No more info in dmesg. But I see the kernel doesn't doublemin_delta_ns since it's still 4000 in 'cat /proc/timer_list'. I know little about kernel hacking and only a little about C. And sorry for my poor English.

Thanks
Comment 18 Thomas Gleixner 2008-09-05 05:28:47 UTC
> 5000 nsec maybe enough for reprogramming since there seems no problem purely
> apply Thomas's patch. So I set the default value to 4000 nsec. Then I
> observed
> a system freeze when working in Konsole. Then I move the mouse a bit, the
> system recovered(amazing..). I did a dmesg immediately as shown in
> attachment.
> Not a long time after, the whole system froze. Only the mouse could move but
> there was no recovery. About 1 min or so the system recovered itself. No more
> info in dmesg. But I see the kernel doesn't doublemin_delta_ns since it's
> still
> 4000 in 'cat /proc/timer_list'. I know little about kernel hacking and only a
> little about C. And sorry for my poor English.

The lockup is a different problem. Before that the system did not
recover from such a situation, as it was simply stuck in the
reprogramming loop. To check whether the fixup of a
too small min_delta works, you should set it to 1000ns. That should
trigger the logic in the clock events code.

Can you test that please ?

Thanks,

	tglx
Comment 19 Gu Rui 2008-09-05 05:49:44 UTC
(In reply to comment #18)
> The lockup is a different problem. Before that the system did not
> recover from such a situation, as it was simply stuck in the
> reprogramming loop. To check whether the fixup of a
> too small min_delta works, you should set it to 1000ns. That should
> trigger the logic in the clock events code.
> 
> Can you test that please ?
> 
> Thanks,
> 
>         tglx
> 

1000ns doesn't boot...The screen end up at "ACPI:RTC can wake from S4" or sort of this.I wait for about 1 min then I lost my patience to shutdown my box with force.

Thanks
Comment 20 herrmann.der.user 2008-09-05 06:09:41 UTC
That's strange.
Yesterday I've tested whether fixup of min_delta of Thomas' patches works.
And it worked. (I've used 0x500 as the initial value for min_delta_ns.)
Comment 21 herrmann.der.user 2008-09-05 06:12:00 UTC
FYI, one of my test machines still shows soft lockups with Thomas' patches applied. I guess it's another problem:

BUG: soft lockup - CPU#2 stuck for 166s! [swapper:0]
...
RIP: 0010:[<ffffffff8024ee5b>]  [<ffffffff8024ee5b>] tick_nohz_restart_sched_tic\
k+0x151/0x155
...
Call Trace:
 [<ffffffff8020ac3e>] ? cpu_idle+0x94/0x9e

and

BUG: soft lockup - CPU#2 stuck for 116s! [top:10593]
...
RIP: 0010:[<ffffffff8029a752>]  [<ffffffff8029a752>] __d_lookup+0xef/0x107
...
Call Trace:
...
 [<ffffffff8020be6b>] ? system_call_fastpath+0x16/0x1b

The real weird thing is output of top/ps on this machine:

    6 root      RT  -5     0    0    0 S    0  0.0   0:00.28 migration/1
    7 root      15  -5     0    0    0 S    0  0.0  5124415h ksoftirqd/1
    8 root      RT  -5     0    0    0 S    0  0.0   0:00.00 watchdog/1
...
   18 root      15  -5     0    0    0 S    0  0.0   0:00.68 events/3
   19 root      15  -5     0    0    0 S    0  0.0  19215:21 khelper
  100 root      15  -5     0    0    0 S    0  0.0   0:00.04 kblockd/0

which is just bogus for ksoftirqd and khelper. Or is this just "normal"
consequence of soft lockups?
I'll try to debug what's going on here.
Comment 22 Thomas Gleixner 2008-09-05 06:26:51 UTC
> RIP: 0010:[<ffffffff8029a752>]  [<ffffffff8029a752>] __d_lookup+0xef/0x107
> ...
> Call Trace:
> ...
>  [<ffffffff8020be6b>] ? system_call_fastpath+0x16/0x1b

Full backtrace please.
 
> The real weird thing is output of top/ps on this machine:
> 
>     6 root      RT  -5     0    0    0 S    0  0.0   0:00.28 migration/1
>     7 root      15  -5     0    0    0 S    0  0.0  5124415h ksoftirqd/1
>     8 root      RT  -5     0    0    0 S    0  0.0   0:00.00 watchdog/1
> ...
>    18 root      15  -5     0    0    0 S    0  0.0   0:00.68 events/3
>    19 root      15  -5     0    0    0 S    0  0.0  19215:21 khelper
>   100 root      15  -5     0    0    0 S    0  0.0   0:00.04 kblockd/0
> 
> which is just bogus for ksoftirqd and khelper. Or is this just "normal"
> consequence of soft lockups?

That's default in mainline. Don't remember why the softirqd is not
running as -rt thread. Probably the many yield()s which are in the
network code :(

Thanks,
	tglx
Comment 23 herrmann.der.user 2008-09-05 06:38:28 UTC
Created attachment 17634 [details]
dmesg containing backtrace for soft lockups

> Full backtrace please.

See attachment.
Comment 24 Thomas Gleixner 2008-09-05 22:31:06 UTC
Created attachment 17644 [details]
Final version of the combo-patch against 2.6.27-5

Found the real root cause. The other fixes are still valid, but I'm feeling stupid
Comment 25 Thomas Gleixner 2008-09-05 22:32:34 UTC
Can you please test again?

Thanks,
 
        tglx
Comment 26 Gu Rui 2008-09-06 03:41:57 UTC
Created attachment 17645 [details]
1000ns boot up with Thomas' patch in Comment  #24 applied

Yup, 1000ns boot up on my box~;) We can see that each adjustment followed by a traceback. 

But I have a question to discuss: Why not increase the min_delta_ns by a static step but double it? In my case, 5000ns is enough, but the kernel jump from 4000ns to 8000ns. Maybe we increase min_delta_ns by 1000ns per step is reasonable as the default is already 5000ns.

Thanks
Comment 27 Thomas Gleixner 2008-09-08 10:14:42 UTC
complete patch series merged in mainline:

commit f5325225658737e6c9cb8e24373e2c281a90be2a
Comment 28 herrmann.der.user 2008-09-09 02:18:23 UTC
FYI, just want to tell you my final test results.

Just fixing the min_delta issue I've still seen lots of soft lockups during a
weekend test.

Overnight I've tested the final combo patch (from comment #24) which also
fixed the u32/long calculation issue.
And no soft lockups where detected anymore.

Thanks Thomas.

Note You need to log in before you can comment on or make changes to this bug.