Bug 37382 - machine hang due to task rebalance after about 209 days uptime
machine hang due to task rebalance after about 209 days uptime
Product: Process Management
Classification: Unclassified
Component: Scheduler
All Linux
: P1 normal
Assigned To: Ingo Molnar
Depends on:
  Show dependency treegraph
Reported: 2011-06-13 01:50 UTC by kinwin
Modified: 2012-08-24 14:38 UTC (History)
2 users (show)

See Also:
Kernel Version:
Tree: Mainline
Regression: No

crash stack trace (31.09 KB, image/png)
2011-06-13 01:50 UTC, kinwin
kernel configuration (41.40 KB, application/octet-stream)
2011-06-13 01:56 UTC, kinwin

Description kinwin 2011-06-13 01:50:32 UTC
Created attachment 61762 [details]
crash stack trace

About 5 percent of our machines running on linux hangs after 209 days running, kernel crash stack trace in attachment1 [details].
  when we disasemble the machine code, we found it meet divide-by-zero error.where
the sentence is at: 

   kernel/sched.c :: update_sg_lb_stats:
  sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;

it seemes like when update_group_power, the power is calculated as zero.

kernel configure attached follow.
Comment 1 kinwin 2011-06-13 01:56:21 UTC
Created attachment 61772 [details]
kernel configuration
Comment 2 Andrew Morton 2011-06-20 23:00:35 UTC
(apparently the scheduler code is unmaintained)

I believe this was later fixed by

        if (!power)
                power = 1;

in kernel/sched_fair.c:update_cpu_power().  Either we forgot to backport that fix into or we weren't maintaining the 2.6.32.x stream by the time the fix was merged.

Note You need to log in before you can comment on or make changes to this bug.