Bug 9632
Summary: | kernel BUG at kernel/timer.c:606 | ||
---|---|---|---|
Product: | Timers | Reporter: | Badalian Slava (slavon.net) |
Component: | Other | Assignee: | john stultz (john.stultz) |
Status: | CLOSED CODE_FIX | ||
Severity: | high | CC: | bunk, jarkao2, marek, oleg |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.23.12 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: | debug patch to detect the corrupted pending timers |
Description
Badalian Slava
2007-12-25 01:20:29 UTC
Looks like the pending timer was corrupted. Say it was freed/reused without del_timer(). I don't know how to add myself to CC list and send the patch at the same time. Will send the debug patch in the next comment. Created attachment 14183 [details]
debug patch to detect the corrupted pending timers
WARNING: the patch is not tested.
Still, ant chance you can reproduce the bug with this patch?
Hmm.. PC must be HA. i can't test paches on it =( i replace BUG_ON(xxx) to if (xxx) panic(); i think its help to me and i wait for fixing bug (reboot is better when freeze). Hi Leigh Sharpe had the ?same? problem as you maybe having right now. Read on here: http://lists.openwall.net/netdev/2007/12/05/43 He solved the problem by modifying order of commands in his script. Below is his original reply to my inquiry: ---------------------------- My issue wasn't with an SMP kernel. I only had one CPU in the machine I was playing with. I actually found the reason for the problem I was having. It seems that when adding qdiscs or classes, you need to do it in the right order. If you add filters to a qdisc, redirecting traffic to a class or qdisc which has not yet been set up, the kernel will crash. If classes are set up prior to adding the filters, it doesn't cause a problem. Leigh ---------------------------- Good luck Badalian. Order Class add Qdisc add Filter add first rule qdisc add dev eth1 root handle 1 htb default 7 I think this order could especially matter with ifb which is used by Leigh and Marek, but even if changing the order helps it's not the proper bug fix, after all. BTW, Slava, could you add here some more details like .config and maybe a bit more about these rules (addresses masked of course)? I especially wonder if you are using something like ifb, vlan or bonding? Of course the outcome from Oleg's patch should be still the most interesting. One more BTW: Oleg, your patch looks very interesting (clever) and I hope you'll manage to add this to the kernel under some new or existing debugging option - it seems to be very useful considering the 'optimistic' way of deleting timers in many places. But, maybe I miss something, it seems you could try to add 'recovering' of these timers after the buggy one, e.g. with some backwards loop? On 12/28, bugme-daemon@bugzilla.kernel.org wrote: > > ------- Comment #8 from jarkao2@gmail.com 2007-12-28 00:34 ------- > I hope you'll manage to add this to the kernel under some new or > existing debugging option Will try to do. The patch is simple, but unfortunately it needs a lot of #ifdef's to minimize the impact without CONFIG_DEBUG_XXX. > But, maybe > I miss something, it seems you could try to add 'recovering' of > these timers after the buggy one, e.g. with some backwards loop? Yes Jarek. I didn't dare to do this right now, just did a minimal hack to catch the bug. It also need other changes. Say, __run_timers() should also check the timer. Btw, please look at the first attachment at http://bugzilla.kernel.org/show_bug.cgi?id=9180 This patch is similar, and it does recover the list in run_workqueue(). But it is not "complete" as well. Oleg. > Will try to do. The patch is simple, but unfortunately it needs a lot > of #ifdef's to minimize the impact without CONFIG_DEBUG_XXX. After you catch the idea it could look like simple... But, it took some time for me... I thought about using a static table for this, like in lockdep, that's why I called your way clever. And I 'personally' like ifdefs: they make it harder to read the code at the beginning, but IMHO they make it easier e.g. to debug it later, when you can easilly skip what doesn't matter for sure. But it would be very bad to abstain from adding debugging (and tolerate such misterious crashes) for such esthetical reasons. > This patch is similar, and it does recover the list in run_workqueue(). > But it is not "complete" as well. Yes, I've thought about mentionning here workqueue too... Of course, this is needed as well - maybe a bit less after fixing the cancelling in workqueues... Probably some more solutions could be reimplemented too (lockdep checks for del_timer_sync?!). I also wonder why there is no such simple thing as del_timer_last() or _exit(), which could be required in all xyz_exit() or xyz_destroy() functions to mark the timer_list structure can't be used by mod_timer() anymore, without init_timer()?! There could be also considered if checking the function field only in such a debugging is reliable enough: maybe some checksum would be even better. But, of course, there is no need to wait with adding a basic debugging like this now (especially to -mm). Any additional features or improvements could be done later. Thanks! > [...] maybe some checksum would be even better [...]
As a matter of fact the simplest and most realiable should be storing
of some pointer or key, which could be verified by a call to mm if
it's still valid (this patrt of memory not kfreed in the meantime),
but I don't know how much I'm dreaming with this...
Hmmm... in last kernels system work normal... i think need close bug... i reopen if bug still in kernel and i can reproduce it. |