Bug 37382 - machine hang due to task rebalance after about 209 days uptime
Summary: machine hang due to task rebalance after about 209 days uptime
Alias: None
Product: Process Management
Classification: Unclassified
Component: Scheduler (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Ingo Molnar
Depends on:
Reported: 2011-06-13 01:50 UTC by kinwin
Modified: 2012-08-24 14:38 UTC (History)
2 users (show)

See Also:
Kernel Version:
Regression: No
Bisected commit-id:

crash stack trace (31.09 KB, image/png)
2011-06-13 01:50 UTC, kinwin
kernel configuration (41.40 KB, application/octet-stream)
2011-06-13 01:56 UTC, kinwin

Description kinwin 2011-06-13 01:50:32 UTC
Created attachment 61762 [details]
crash stack trace

About 5 percent of our machines running on linux hangs after 209 days running, kernel crash stack trace in attachment1 [details].
  when we disasemble the machine code, we found it meet divide-by-zero error.where
the sentence is at: 

   kernel/sched.c :: update_sg_lb_stats:
  sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;

it seemes like when update_group_power, the power is calculated as zero.

kernel configure attached follow.
Comment 1 kinwin 2011-06-13 01:56:21 UTC
Created attachment 61772 [details]
kernel configuration
Comment 2 Andrew Morton 2011-06-20 23:00:35 UTC
(apparently the scheduler code is unmaintained)

I believe this was later fixed by

        if (!power)
                power = 1;

in kernel/sched_fair.c:update_cpu_power().  Either we forgot to backport that fix into or we weren't maintaining the 2.6.32.x stream by the time the fix was merged.

Note You need to log in before you can comment on or make changes to this bug.