Bug 37382

Summary: machine hang due to task rebalance after about 209 days uptime
Product: Process Management Reporter: kinwin (kinwin2008)
Component: SchedulerAssignee: Ingo Molnar (mingo)
Severity: normal CC: akpm, alan
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: crash stack trace
kernel configuration

Description kinwin 2011-06-13 01:50:32 UTC
Created attachment 61762 [details]
crash stack trace

About 5 percent of our machines running on linux hangs after 209 days running, kernel crash stack trace in attachment1 [details].
  when we disasemble the machine code, we found it meet divide-by-zero error.where
the sentence is at: 

   kernel/sched.c :: update_sg_lb_stats:
  sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;

it seemes like when update_group_power, the power is calculated as zero.

kernel configure attached follow.
Comment 1 kinwin 2011-06-13 01:56:21 UTC
Created attachment 61772 [details]
kernel configuration
Comment 2 Andrew Morton 2011-06-20 23:00:35 UTC
(apparently the scheduler code is unmaintained)

I believe this was later fixed by

        if (!power)
                power = 1;

in kernel/sched_fair.c:update_cpu_power().  Either we forgot to backport that fix into or we weren't maintaining the 2.6.32.x stream by the time the fix was merged.