Bug 10068 - timer.c crash using WI-FI (current process: firefox)
Summary: timer.c crash using WI-FI (current process: firefox)
Status: CLOSED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Networking
Classification: Unclassified
Component: Wireless (show other bugs)
Hardware: All Linux
: P1 high
Assignee: networking_wireless@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-02-22 11:16 UTC by Marco Zaccheria
Modified: 2008-04-16 13:46 UTC (History)
7 users (show)

See Also:
Kernel Version: 2.6.24.2
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
debug patch (1.74 KB, patch)
2008-02-24 22:47 UTC, Thomas Gleixner
Details | Diff
(timer) objects debug facility (15.56 KB, patch)
2008-02-26 14:56 UTC, Thomas Gleixner
Details | Diff
(timer) objects debug facility v2 (15.56 KB, patch)
2008-02-26 15:00 UTC, Thomas Gleixner
Details | Diff
(timer) objects debug facility v3(aka picked-the-right-file-this-time) (15.67 KB, text/x-patch)
2008-02-26 16:33 UTC, Thomas Gleixner
Details

Description Marco Zaccheria 2008-02-22 11:16:38 UTC
Latest working kernel version: 2.6.19.2
Earliest failing kernel version: 2.6.24.2
Distribution: Debian Lenny/Sid
Hardware Environment: athlon XP 2400+ using a zd1211 device (driver zd1211rw)
Software Environment: X11 with Gnome; crashed while using firefox (iceweasel)

Problem Description:
System crashes completely. It seems related to wireless network usage, I've used my system several times without connecting the wifi device (and without any other network interface enabled).
I haven't found the problem on 2.6.19.2 kernel I think because zd1211rw driver didn't work for my card
Here's the log (not flushed to disk!!!)

------------------------------

Kernel BUG at kernel/timer.c: 607!
Invalid opcode: 0000 [#1]
Modules linked in: cpufreq_stats nls_cp437 sbp2 scsi_mod loop zd1211rw ieee80211softmac parport_pc parport ohci1394 snd_intel8x0 ieee1394 sis900 ehci_hcd ide_cd cdrom fan asus_acpi backlight battery ac

Pid 3239, comm: firefox-bin Not tainted (2.6.24.2 #1)
EIP:0060 :[<c011e54b>] EFLAGS:00210007 CPU:0
EIP is at cascade+0x3b/0x57
EAX:0 EBX:0 ECX:5 EDX:d9eb3ca4
ESI:5 EDI:c0485640 EBP:d9ecdf30 ESP:d9ecdf30
DS:007b ES:007b FS:0000 GS:0033 SS:0068

...

Call trace

[<c011e6ad>] run_timer_softirq+0x55/0x141
[<c012b8e3>] tick_handle_periodic+0xf/0x54
[<c011bdcc>] __do_softirq+0x35/0x75
[<c011be2e>] do_softirq+022/0x26
[<c01055b0>] do_IRQ+0x58/0x6b
[<c033b1a7>] schedule+0x1f0/0x20a
[<c01045e7>] common_interrupt+0x23/0x28

Kernel Panic - not syncing: Fatal exception in interrupt




Steps to reproduce:
Stress network
Comment 1 Thomas Gleixner 2008-02-24 22:41:51 UTC
Doh, some stupid code is calling init_timer() on an enqueued timer. I whip up a patch which allows us to debug this.
Comment 2 Thomas Gleixner 2008-02-24 22:47:06 UTC
Created attachment 14974 [details]
debug patch

Marco, can you please apply the attached patch and provide the debug output ?

Thanks,
       tglx
Comment 3 Anonymous Emailer 2008-02-25 16:29:25 UTC
Reply-To: akpm@linux-foundation.org

On Fri, 22 Feb 2008 11:16:40 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=10068
> 
>            Summary: timer.c crash using WI-FI (current process: firefox)
>            Product: Timers
>            Version: 2.5
>      KernelVersion: 2.6.24.2
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: blocking
>           Priority: P1
>          Component: Other
>         AssignedTo: johnstul@us.ibm.com
>         ReportedBy: zacmarco@yahoo.it
> 
> 
> Latest working kernel version: 2.6.19.2
> Earliest failing kernel version: 2.6.24.2
> Distribution: Debian Lenny/Sid
> Hardware Environment: athlon XP 2400+ using a zd1211 device (driver zd1211rw)
> Software Environment: X11 with Gnome; crashed while using firefox (iceweasel)
> 
> Problem Description:
> System crashes completely. It seems related to wireless network usage, I've
> used my system several times without connecting the wifi device (and without
> any other network interface enabled).
> I haven't found the problem on 2.6.19.2 kernel I think because zd1211rw
> driver
> didn't work for my card
> Here's the log (not flushed to disk!!!)
> 
> ------------------------------
> 
> Kernel BUG at kernel/timer.c: 607!
> Invalid opcode: 0000 [#1]
> Modules linked in: cpufreq_stats nls_cp437 sbp2 scsi_mod loop zd1211rw
> ieee80211softmac parport_pc parport ohci1394 snd_intel8x0 ieee1394 sis900
> ehci_hcd ide_cd cdrom fan asus_acpi backlight battery ac
> 
> Pid 3239, comm: firefox-bin Not tainted (2.6.24.2 #1)
> EIP:0060 :[<c011e54b>] EFLAGS:00210007 CPU:0
> EIP is at cascade+0x3b/0x57
> EAX:0 EBX:0 ECX:5 EDX:d9eb3ca4
> ESI:5 EDI:c0485640 EBP:d9ecdf30 ESP:d9ecdf30
> DS:007b ES:007b FS:0000 GS:0033 SS:0068
> 
> ...
> 
> Call trace
> 
> [<c011e6ad>] run_timer_softirq+0x55/0x141
> [<c012b8e3>] tick_handle_periodic+0xf/0x54
> [<c011bdcc>] __do_softirq+0x35/0x75
> [<c011be2e>] do_softirq+022/0x26
> [<c01055b0>] do_IRQ+0x58/0x6b
> [<c033b1a7>] schedule+0x1f0/0x20a
> [<c01045e7>] common_interrupt+0x23/0x28
> 
> Kernel Panic - not syncing: Fatal exception in interrupt
> 

urgh.

Yes, it's probably a wireless driver bug.  But look at the BUG_ON():

static int cascade(tvec_base_t *base, tvec_t *tv, int index)
{
	/* cascade all the timers from tv up one level */
	struct timer_list *timer, *tmp;
	struct list_head tv_list;

	list_replace_init(tv->vec + index, &tv_list);

	/*
	 * We are removing _all_ timers from the list, so we
	 * don't have to detach them individually.
	 */
	list_for_each_entry_safe(timer, tmp, &tv_list, entry) {
		BUG_ON(tbase_get_base(timer->base) != base);
		internal_add_timer(base, timer);
	}

	return index;
}

if we're going to detect some bug, we shold provide _some_ information
telling the poor programmer what he did wrong!  This one is very obscure.

Seems we found a timer on CPU A's list, but the timer thinks it's on timer
B's list.  Or not on a list at all.

Question is: what sequence of timer interace calls could have caused this
to occur?  And can we add a check for that bug at the time where it occurs,
rather later on in the timer interrupt handler?
Comment 4 Oleg Nesterov 2008-02-25 17:04:05 UTC
On 02/25, Andrew Morton wrote:
>
> On Fri, 22 Feb 2008 11:16:40 -0800 (PST) bugme-daemon@bugzilla.kernel.org
> wrote:
> 
> > http://bugzilla.kernel.org/show_bug.cgi?id=10068
> > 
> >            Summary: timer.c crash using WI-FI (current process: firefox)
> >            Product: Timers
> >            Version: 2.5
> >      KernelVersion: 2.6.24.2
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: blocking
> >           Priority: P1
> >          Component: Other
> >         AssignedTo: johnstul@us.ibm.com
> >         ReportedBy: zacmarco@yahoo.it
> > 
> > 
> > Latest working kernel version: 2.6.19.2
> > Earliest failing kernel version: 2.6.24.2
> > Distribution: Debian Lenny/Sid
> > Hardware Environment: athlon XP 2400+ using a zd1211 device (driver
> zd1211rw)
> > Software Environment: X11 with Gnome; crashed while using firefox
> (iceweasel)
> > 
> > Problem Description:
> > System crashes completely. It seems related to wireless network usage, I've
> > used my system several times without connecting the wifi device (and
> without
> > any other network interface enabled).
> > I haven't found the problem on 2.6.19.2 kernel I think because zd1211rw
> driver
> > didn't work for my card
> > Here's the log (not flushed to disk!!!)
> > 
> > ------------------------------
> > 
> > Kernel BUG at kernel/timer.c: 607!
> > Invalid opcode: 0000 [#1]
> > Modules linked in: cpufreq_stats nls_cp437 sbp2 scsi_mod loop zd1211rw
> > ieee80211softmac parport_pc parport ohci1394 snd_intel8x0 ieee1394 sis900
> > ehci_hcd ide_cd cdrom fan asus_acpi backlight battery ac
> > 
> > Pid 3239, comm: firefox-bin Not tainted (2.6.24.2 #1)
> > EIP:0060 :[<c011e54b>] EFLAGS:00210007 CPU:0
> > EIP is at cascade+0x3b/0x57
> > EAX:0 EBX:0 ECX:5 EDX:d9eb3ca4
> > ESI:5 EDI:c0485640 EBP:d9ecdf30 ESP:d9ecdf30
> > DS:007b ES:007b FS:0000 GS:0033 SS:0068
> > 
> > ...
> > 
> > Call trace
> > 
> > [<c011e6ad>] run_timer_softirq+0x55/0x141
> > [<c012b8e3>] tick_handle_periodic+0xf/0x54
> > [<c011bdcc>] __do_softirq+0x35/0x75
> > [<c011be2e>] do_softirq+022/0x26
> > [<c01055b0>] do_IRQ+0x58/0x6b
> > [<c033b1a7>] schedule+0x1f0/0x20a
> > [<c01045e7>] common_interrupt+0x23/0x28
> > 
> > Kernel Panic - not syncing: Fatal exception in interrupt
> > 
> 
> urgh.
> 
> Yes, it's probably a wireless driver bug.  But look at the BUG_ON():
> 
> static int cascade(tvec_base_t *base, tvec_t *tv, int index)
> {
>       /* cascade all the timers from tv up one level */
>       struct timer_list *timer, *tmp;
>       struct list_head tv_list;
> 
>       list_replace_init(tv->vec + index, &tv_list);
> 
>       /*
>        * We are removing _all_ timers from the list, so we
>        * don't have to detach them individually.
>        */
>       list_for_each_entry_safe(timer, tmp, &tv_list, entry) {
>               BUG_ON(tbase_get_base(timer->base) != base);
>               internal_add_timer(base, timer);
>       }
> 
>       return index;
> }
> 
> if we're going to detect some bug, we shold provide _some_ information
> telling the poor programmer what he did wrong!  This one is very obscure.
> 
> Seems we found a timer on CPU A's list, but the timer thinks it's on timer
> B's list.  Or not on a list at all.
>
> Question is: what sequence of timer interace calls could have caused this
> to occur?  And can we add a check for that bug at the time where it occurs,
> rather later on in the timer interrupt handler?

Most probably the pending timer was corrupted. Say it was freed/reused
without del_timer(), or re-initialized.

Marco, could you try this patch
	http://bugzilla.kernel.org/attachment.cgi?id=14183
?

see also http://bugzilla.kernel.org/attachment.cgi?id=14183

The Thomas's patch can also help, but if the pending timer was overwriten
->init_site could be dirtied too.

Oleg.
Comment 5 Oleg Nesterov 2008-02-25 17:26:59 UTC
On 02/26, Oleg Nesterov wrote:
>
> On 02/25, Andrew Morton wrote:
> >
> > On Fri, 22 Feb 2008 11:16:40 -0800 (PST) bugme-daemon@bugzilla.kernel.org
> wrote:
> > 
> > > Kernel BUG at kernel/timer.c: 607!
> > > Invalid opcode: 0000 [#1]
> > > Modules linked in: cpufreq_stats nls_cp437 sbp2 scsi_mod loop zd1211rw
> > > ieee80211softmac parport_pc parport ohci1394 snd_intel8x0 ieee1394 sis900
> > > ehci_hcd ide_cd cdrom fan asus_acpi backlight battery ac
> > > 
> > > Pid 3239, comm: firefox-bin Not tainted (2.6.24.2 #1)
> > > EIP:0060 :[<c011e54b>] EFLAGS:00210007 CPU:0
> > > EIP is at cascade+0x3b/0x57
> > > EAX:0 EBX:0 ECX:5 EDX:d9eb3ca4
> > > ESI:5 EDI:c0485640 EBP:d9ecdf30 ESP:d9ecdf30
> > > DS:007b ES:007b FS:0000 GS:0033 SS:0068
> > > 
> > > ...
> > > 
> > > Call trace
> > > 
> > > [<c011e6ad>] run_timer_softirq+0x55/0x141
> > > [<c012b8e3>] tick_handle_periodic+0xf/0x54
> > > [<c011bdcc>] __do_softirq+0x35/0x75
> > > [<c011be2e>] do_softirq+022/0x26
> > > [<c01055b0>] do_IRQ+0x58/0x6b
> > > [<c033b1a7>] schedule+0x1f0/0x20a
> > > [<c01045e7>] common_interrupt+0x23/0x28
> > > 
> > > Kernel Panic - not syncing: Fatal exception in interrupt
> > > 
> > 
> > urgh.
> > 
> > Yes, it's probably a wireless driver bug.  But look at the BUG_ON():
> > 
> > static int cascade(tvec_base_t *base, tvec_t *tv, int index)
> > {
> >     /* cascade all the timers from tv up one level */
> >     struct timer_list *timer, *tmp;
> >     struct list_head tv_list;
> > 
> >     list_replace_init(tv->vec + index, &tv_list);
> > 
> >     /*
> >      * We are removing _all_ timers from the list, so we
> >      * don't have to detach them individually.
> >      */
> >     list_for_each_entry_safe(timer, tmp, &tv_list, entry) {
> >             BUG_ON(tbase_get_base(timer->base) != base);
> >             internal_add_timer(base, timer);
> >     }
> > 
> >     return index;
> > }
> > 
> > if we're going to detect some bug, we shold provide _some_ information
> > telling the poor programmer what he did wrong!  This one is very obscure.
> > 
> > Seems we found a timer on CPU A's list, but the timer thinks it's on timer
> > B's list.  Or not on a list at all.
> >
> > Question is: what sequence of timer interace calls could have caused this
> > to occur?  And can we add a check for that bug at the time where it occurs,
> > rather later on in the timer interrupt handler?
> 
> Most probably the pending timer was corrupted. Say it was freed/reused
> without del_timer(), or re-initialized.
> 
> Marco, could you try this patch
>       http://bugzilla.kernel.org/attachment.cgi?id=14183
> ?
> 
> see also http://bugzilla.kernel.org/attachment.cgi?id=14183

Argh. It can't be applied because of
	"time: clean hungarian notation from timers"
	commit a6fa8e5a6172a5a5bc06ed04f34e50b36c978127

Please find the re-diff below. hopefully it still works. but it doesn't
like CONFIG_HOTPLUG_CPU.

Oleg.

--- MM/include/linux/timer.h~TMR_DBG	2008-02-17 23:40:09.000000000 +0300
+++ MM/include/linux/timer.h	2008-02-26 04:07:15.000000000 +0300
@@ -8,6 +8,7 @@
 struct tvec_base;
 
 struct timer_list {
+	void (*next_func)(unsigned long);
 	struct list_head entry;
 	unsigned long expires;
 
--- MM/kernel/timer.c~TMR_DBG	2008-02-17 23:41:28.000000000 +0300
+++ MM/kernel/timer.c	2008-02-26 04:14:15.000000000 +0300
@@ -58,12 +58,19 @@ EXPORT_SYMBOL(jiffies_64);
 #define TVN_MASK (TVN_SIZE - 1)
 #define TVR_MASK (TVR_SIZE - 1)
 
+struct xxx {
+	void (*next_func)(unsigned long);
+	struct list_head list;
+};
+
+#define tox(p)	list_entry((p), struct timer_list, entry)
+
 struct tvec {
-	struct list_head vec[TVN_SIZE];
+	struct xxx vec[TVN_SIZE];
 };
 
 struct tvec_root {
-	struct list_head vec[TVR_SIZE];
+	struct xxx vec[TVR_SIZE];
 };
 
 struct tvec_base {
@@ -256,7 +263,7 @@ static void internal_add_timer(struct tv
 {
 	unsigned long expires = timer->expires;
 	unsigned long idx = expires - base->timer_jiffies;
-	struct list_head *vec;
+	struct xxx *vec;
 
 	if (idx < TVR_SIZE) {
 		int i = expires & TVR_MASK;
@@ -291,7 +298,9 @@ static void internal_add_timer(struct tv
 	/*
 	 * Timers are FIFO:
 	 */
-	list_add_tail(&timer->entry, vec);
+	list_add_tail(&timer->entry, &vec->list);
+	timer->next_func = tox(timer->entry.next)->function;
+	tox(timer->entry.prev)->next_func = timer->function;
 }
 
 #ifdef CONFIG_TIMER_STATS
@@ -351,6 +360,7 @@ static inline void detach_timer(struct t
 {
 	struct list_head *entry = &timer->entry;
 
+	tox(entry->prev)->next_func = timer->next_func;
 	__list_del(entry->prev, entry->next);
 	if (clear_pending)
 		entry->next = NULL;
@@ -594,15 +604,22 @@ static int cascade(struct tvec_base *bas
 	/* cascade all the timers from tv up one level */
 	struct timer_list *timer, *tmp;
 	struct list_head tv_list;
+	void (*func)(unsigned long) = tv->vec[index].next_func;
 
-	list_replace_init(tv->vec + index, &tv_list);
+	list_replace_init(&tv->vec[index].list, &tv_list);
 
 	/*
 	 * We are removing _all_ timers from the list, so we
 	 * don't have to detach them individually.
 	 */
 	list_for_each_entry_safe(timer, tmp, &tv_list, entry) {
-		BUG_ON(tbase_get_base(timer->base) != base);
+		if (tbase_get_base(timer->base) != base || timer->function != func) {
+			print_symbol(KERN_CRIT "ERR!! 1 %s\n", (unsigned long)func);
+			print_symbol(KERN_CRIT "ERR!! 2 %s\n", (unsigned long)timer->function);
+			printk(KERN_CRIT "ERR!! 3 %p %p\n", base, timer->base);
+			break;
+		}
+		func = timer->next_func;
 		internal_add_timer(base, timer);
 	}
 
@@ -624,8 +641,8 @@ static inline void __run_timers(struct t
 
 	spin_lock_irq(&base->lock);
 	while (time_after_eq(jiffies, base->timer_jiffies)) {
-		struct list_head work_list;
-		struct list_head *head = &work_list;
+		struct xxx work_list;
+		struct list_head *head = &work_list.list;
 		int index = base->timer_jiffies & TVR_MASK;
 
 		/*
@@ -637,7 +654,7 @@ static inline void __run_timers(struct t
 					!cascade(base, &base->tv4, INDEX(2)))
 			cascade(base, &base->tv5, INDEX(3));
 		++base->timer_jiffies;
-		list_replace_init(base->tv1.vec + index, &work_list);
+		list_replace_init(&base->tv1.vec[index].list, &work_list.list);
 		while (!list_empty(head)) {
 			void (*fn)(unsigned long);
 			unsigned long data;
@@ -1264,13 +1281,13 @@ static int __cpuinit init_timers_cpu(int
 	spin_lock_init(&base->lock);
 
 	for (j = 0; j < TVN_SIZE; j++) {
-		INIT_LIST_HEAD(base->tv5.vec + j);
-		INIT_LIST_HEAD(base->tv4.vec + j);
-		INIT_LIST_HEAD(base->tv3.vec + j);
-		INIT_LIST_HEAD(base->tv2.vec + j);
+		INIT_LIST_HEAD(&base->tv5.vec[j].list);
+		INIT_LIST_HEAD(&base->tv4.vec[j].list);
+		INIT_LIST_HEAD(&base->tv3.vec[j].list);
+		INIT_LIST_HEAD(&base->tv2.vec[j].list);
 	}
 	for (j = 0; j < TVR_SIZE; j++)
-		INIT_LIST_HEAD(base->tv1.vec + j);
+		INIT_LIST_HEAD(&base->tv1.vec[j].list);
 
 	base->timer_jiffies = jiffies;
 	return 0;
Comment 6 Thomas Gleixner 2008-02-26 00:24:07 UTC
On Mon, 25 Feb 2008, bugme-daemon@bugzilla.kernel.org wrote:
>
> if we're going to detect some bug, we shold provide _some_ information
> telling the poor programmer what he did wrong!  This one is very obscure.
> 
> Seems we found a timer on CPU A's list, but the timer thinks it's on timer
> B's list.  Or not on a list at all.

The timer was enqueued while some stupid code called init_timer()
 
> Question is: what sequence of timer interace calls could have caused this
> to occur?  And can we add a check for that bug at the time where it occurs,
> rather later on in the timer interrupt handler?
 
I'm looking into that, but it's pretty hard to detect that in
init_timer() reliably. 

We had this other problem with bluetooth as well, where a timer was
not deleted before the data structure which contained the timer was
freed. The problem in both cases is that the timer list is corrupted
and we have no chance to detect it _before_ the shit hits the fan.

Thanks,
	tglx
Comment 7 Marco Zaccheria 2008-02-26 00:33:35 UTC
Hi all!
Yesterday evening I've tried using the kernel with Thomas' patch for about 5 hours, but, sorry, it didn't crash!
This evening (here at work I can't try it because I don't have my PC) I'll try your latest patch and I will provide you a feedback.
Thanks
Comment 8 Marco Zaccheria 2008-02-26 13:36:01 UTC
I've reproduced the bug (this time in a workqueue method called during a sw interrupt), but this time I have no debug output, because log wasn't flushed to disk and, this time, after some minutes, it started printing continously on screen this message:

zd1211rw 4-5:1.0: Could not allocate skb.

Now I'm going to try your latest patch.

Another question: is it normal the wireless driver prints periodically (about every 30 seconds) the message

SoftMAC: Open Authentication completed with 00:01:38:8e:5f:43

where it's printed the AP MAC ?

THX
Comment 9 Thomas Gleixner 2008-02-26 14:05:15 UTC
> I've reproduced the bug (this time in a workqueue method called during a sw
> interrupt), but this time I have no debug output, because log wasn't flushed
> to
> disk and, this time, after some minutes, it started printing continously on
> screen this message:

Hang on for a couple of minutes. I created a debug patch, which should
allow the box to survive and tell us exactly where the wreckage
happens. I'm right now testing it myself with a couple of known timer
wreckage variants to make sure that it works.

Thanks,
	tglx
Comment 10 Thomas Gleixner 2008-02-26 14:56:44 UTC
Created attachment 15017 [details]
(timer) objects debug facility

Please remove the previous debug patches and apply this one.

Enable CONFIG_DEBUG_OBJECT_OPS and CONFIG_DEBUG_OBJECT_TIMERS

CONFIG_DEBUG_OBJECT_FREE is optional (I guess your problem is covered by the two above options already)

Thanks,
       tglx
Comment 11 Thomas Gleixner 2008-02-26 15:00:29 UTC
Created attachment 15018 [details]
 (timer) objects debug facility v2

Doh, forgot to refresh the patch before uploading.
Comment 12 Thomas Gleixner 2008-02-26 16:33:48 UTC
Created attachment 15019 [details]
(timer) objects debug facility v3(aka picked-the-right-file-this-time)

/me feels really stupid

I really should stay away from GUI tools, which require to select a file per mouse click, when I'm tired. 

I have not found a sane way to create a bugzilla attachment via mail :( Pointers are welcome !

Sorry for the noise.

    tglx
Comment 13 Marco Zaccheria 2008-02-28 01:36:51 UTC
Hi!
I've tried your patch but it crashed again!!!
Sorry, I don't have any debug output because yesterday evening I had not too much time to reproduce the bug again.
I hope this evening I could give you more informations.
Can you tell me what can I do to help you more? ( apart from writing in a better english ;) )
Comment 14 Oleg Nesterov 2008-02-28 02:08:25 UTC
On 02/28, bugme-daemon@bugzilla.kernel.org wrote:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=10068
> 
> 
> 
> 
> 
> ------- Comment #13 from zacmarco@yahoo.it  2008-02-28 01:36 -------
> Hi!
> I've tried your patch but it crashed again!!!

Do you mean you still see the same BUG_ON() with the Thomas'patch applied?

In that case, perhaps you can try the patch I sent. It is not as generic
as Thomas's, it is just a quick dirty hack to catch this particular BUG().

BTW, thanks a lot for your efforts ;)

Oleg.
Comment 15 Marco Zaccheria 2008-02-28 12:55:22 UTC
Ok.
Here's the BUG_ON() output (with Thomas' patch):

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000014
printing eip: c0123781 *pde = 00000000

The call stack seems the same as other tests.

Yesterday I tried to patch sources with Oleg's patch but the "patch" command gave me an error (like if starting file was not the same). May I patch the official 2.6.24.2 file with your patch?

By the way, in a few minutes I'll try to re-patch it (now, I'm VERY VERY VERY sorry to work on Windows, It's the only way I could stay connected).
Comment 16 Thomas Gleixner 2008-02-28 14:15:48 UTC
On Thu, 28 Feb 2008, bugme-daemon@bugzilla.kernel.org wrote:
> ------- Comment #15 from zacmarco@yahoo.it  2008-02-28 12:55 -------
> Ok.
> Here's the BUG_ON() output (with Thomas' patch):
> 
> BUG: unable to handle kernel NULL pointer dereference at virtual address
> 00000014
> printing eip: c0123781 *pde = 00000000

Which CONFIG options did you enable ?

Thanks,
	tglx
Comment 17 Oleg Nesterov 2008-02-29 00:20:37 UTC
On 02/28, bugme-daemon@bugzilla.kernel.org wrote:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=10068
> 
> 
> 
> 
> 
> ------- Comment #15 from zacmarco@yahoo.it  2008-02-28 12:55 -------
> 
> Yesterday I tried to patch sources with Oleg's patch but the "patch" command
> gave me an error (like if starting file was not the same). May I patch the
> official 2.6.24.2 file with your patch?

Ah, so you are using 2.6.24. In that case please use the first patch

	http://bugzilla.kernel.org/attachment.cgi?id=14183

sorry for the confusion!

Oleg.
Comment 18 Marco Zaccheria 2008-03-05 11:42:22 UTC
Patch applied but... it crashed instantaneally!

For Thomas: I've tried to add the options you told me (with your patch applied), but it seems that launching the "make" command they disappeared. I've added these options on the .config . Is it right?
Comment 19 Thomas Gleixner 2008-03-05 12:44:53 UTC
hmm, the disappearing probably happens because those options depend on CONFIG_DEBUG_KERNEL which is probably not set in your .config

Please use either "make menuconfig" or try to add:

CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_OBJECT_OPS=y
CONFIG_DEBUG_OBJECT_TIMERS=y
CONFIG_DEBUG_OBJECT_FREE=y

If that does not help, please attach your .config. I'll fix it for you.

Thanks,
       tglx
Comment 20 Oleg Nesterov 2008-03-05 12:45:47 UTC
On 03/05, bugme-daemon@bugzilla.kernel.org wrote:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=10068
> 
> ------- Comment #18 from zacmarco@yahoo.it  2008-03-05 11:42 -------
> Patch applied but... it crashed instantaneally!

Not that I am really surprised, but it was tested (not by me).
Could you be more verbose, what exactly happens? Please send
me privately include/linux/timer.h + kernel/timer.c with this
patch applied. And .config, please.

(to avoid a possible confusion, Thomas's patch is better, but
 in case it can't catch this bug...)

Oleg.
Comment 21 Marco Zaccheria 2008-03-09 15:53:51 UTC
Sorry for the delay, I was busy last week.
Today I've tried Thomas' patch (I had a problem on the debugobjects.c file applying the patch, but I've solved the problem cutting and pasting all code lines from the patch file). The system remains up, but I've found the following trace on system logs:

ODEBUG: init active object: db112e94 timer_list
WARNING: at lib/debugobjects.c:63 debug_print_object()
Pid: 2023, comm: softmac Not tainted 2.6.24.2 #10
 [<c01c5181>] debug_object_op+0x89/0xe0
 [<c0120168>] init_timer+0x18/0x40
 [<e098f813>] ieee80211softmac_auth_req+0x6b/0x9c [ieee80211softmac]
 [<e0991543>] ieee80211softmac_assoc_work+0x292/0x392 [ieee80211softmac]
 [<e0991643>] ieee80211softmac_assoc_notify_scan+0x0/0x10 [ieee80211softmac]
 [<e0991ab6>] ieee80211softmac_notify_callback+0x40/0x48 [ieee80211softmac]
 [<e0991a76>] ieee80211softmac_notify_callback+0x0/0x48 [ieee80211softmac]
 [<e0991978>] ieee80211softmac_call_events_locked+0xdc/0xee [ieee80211softmac]
 [<e0991643>] ieee80211softmac_assoc_notify_scan+0x0/0x10 [ieee80211softmac]
 [<e0991a76>] ieee80211softmac_notify_callback+0x0/0x48 [ieee80211softmac]
 [<c01250bf>] run_workqueue+0x6b/0xdf
 [<c0335f0f>] schedule+0x1f0/0x20a
 [<c01256b2>] worker_thread+0x0/0xc2
 [<c0125766>] worker_thread+0xb4/0xc2
 [<c0127baa>] autoremove_wake_function+0x0/0x33
 [<c01256b2>] worker_thread+0x0/0xc2
 [<c0127a4a>] kthread+0x36/0x5c
 [<c0127a14>] kthread+0x0/0x5c
 [<c0104757>] kernel_thread_helper+0x7/0x10
 =======================

I hope I coul help you with this trace
Comment 22 Oleg Nesterov 2008-03-09 16:12:49 UTC
On 03/09, bugme-daemon@bugzilla.kernel.org wrote:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=10068
> 
> ------- Comment #21 from zacmarco@yahoo.it  2008-03-09 15:53 -------
> Sorry for the delay, I was busy last week.
> Today I've tried Thomas' patch (I had a problem on the debugobjects.c file
> applying the patch, but I've solved the problem cutting and pasting all code
> lines from the patch file). The system remains up, but I've found the
> following
> trace on system logs:
> 
> ODEBUG: init active object: db112e94 timer_list
> WARNING: at lib/debugobjects.c:63 debug_print_object()
> Pid: 2023, comm: softmac Not tainted 2.6.24.2 #10
>  [<c01c5181>] debug_object_op+0x89/0xe0
>  [<c0120168>] init_timer+0x18/0x40
>  [<e098f813>] ieee80211softmac_auth_req+0x6b/0x9c [ieee80211softmac]
>  [<e0991543>] ieee80211softmac_assoc_work+0x292/0x392 [ieee80211softmac]
>  [<e0991643>] ieee80211softmac_assoc_notify_scan+0x0/0x10 [ieee80211softmac]
>  [<e0991ab6>] ieee80211softmac_notify_callback+0x40/0x48 [ieee80211softmac]
>  [<e0991a76>] ieee80211softmac_notify_callback+0x0/0x48 [ieee80211softmac]
>  [<e0991978>] ieee80211softmac_call_events_locked+0xdc/0xee
>  [ieee80211softmac]
>  [<e0991643>] ieee80211softmac_assoc_notify_scan+0x0/0x10 [ieee80211softmac]
>  [<e0991a76>] ieee80211softmac_notify_callback+0x0/0x48 [ieee80211softmac]
>  [<c01250bf>] run_workqueue+0x6b/0xdf
>  [<c0335f0f>] schedule+0x1f0/0x20a
>  [<c01256b2>] worker_thread+0x0/0xc2
>  [<c0125766>] worker_thread+0xb4/0xc2
>  [<c0127baa>] autoremove_wake_function+0x0/0x33
>  [<c01256b2>] worker_thread+0x0/0xc2
>  [<c0127a4a>] kthread+0x36/0x5c
>  [<c0127a14>] kthread+0x0/0x5c
>  [<c0104757>] kernel_thread_helper+0x7/0x10
>  =======================
> 
> I hope I coul help you with this trace

Thanks a lot! this does help.

might be related to

	[Bug 8937] BUG prempt in workqueue.c
	http://bugzilla.kernel.org/show_bug.cgi?id=8937

Oleg.
Comment 23 Thomas Gleixner 2008-03-10 02:28:48 UTC
(In reply to comment #21)
> Sorry for the delay, I was busy last week.
> Today I've tried Thomas' patch (I had a problem on the debugobjects.c file
> applying the patch, but I've solved the problem cutting and pasting all code
> lines from the patch file). The system remains up, but I've found the
> following
> trace on system logs:

Yep, that's the intention of the patch to keep the system alive and point out the place where the problem happens at the same time.
 
> ODEBUG: init active object: db112e94 timer_list

That's what I suspected in http://bugzilla.kernel.org/show_bug.cgi?id=10068#c1

> WARNING: at lib/debugobjects.c:63 debug_print_object()
 
> I hope I coul help you with this trace

Yes, it should give the ieee80211 developer enough information to fix it.

Thanks,
       tglx
Comment 24 Marco Zaccheria 2008-03-31 10:12:17 UTC
Hi all. I'd like to know if you have news on this bug. Because of it seems related to ieee80211 driver, is there a related bug on that area?

Thanks a lot
Comment 25 Johannes Berg 2008-04-01 05:13:34 UTC
I'm sorry, but personally I'm being drowned in work (and in kernel stuff, mac80211 is really keeping me busy enough) and don't have time to fix bugs in ieee80211 right now, especially considering that ieee80211 has been removed in 2.6.25. I apologise.
Comment 26 Adrian Bunk 2008-04-01 05:24:25 UTC
Marco, does 2.6.25-rc7 work for you?
Comment 27 Marco Zaccheria 2008-04-10 10:40:26 UTC
I've tried the 2.6.25-rc8 kernel.
In .config, I've enabled the MAC80211 and disabled the softmac.
Following the kernel log, it seems ok

...
usb 4-5: new high speed USB device using ehci_hcd and address 3
usb 4-5: configuration #1 chosen from 1 choice
usb 4-5: reset high speed USB device using ehci_hcd and address 3
zd1211rw 4-5:1.0: phy1
Apr 10 19:23:22 ZacMobile kernel: usb 4-5: new high speed USB device using ehci_hcd and address 3
Apr 10 19:23:22 ZacMobile kernel: usb 4-5: configuration #1 chosen from 1 choice
Apr 10 19:23:22 ZacMobile kernel: usb 4-5: reset high speed USB device using ehci_hcd and address 3
Apr 10 19:23:22 ZacMobile kernel: zd1211rw 4-5:1.0: phy1
udev: renamed network interface wmaster0 to eth2
Apr 10 19:23:22 ZacMobile kernel: udev: renamed network interface wmaster0 to eth2

...

but if i give an iwconfig command, it results that eth2 has not wireless extension, while it exists another interface, named wmaster0_renamed

I can't bring up any of the two interfaces.
Comment 28 John W. Linville 2008-04-10 13:19:11 UTC
Udev is misnaming your interfaces.  Often this can be fixed by simply deleting /etc/udev/rules.d/70-persistent-net.rules and letting udev recreate it after a reboot.
Comment 29 Marco Zaccheria 2008-04-10 15:39:00 UTC
Ok, I've done it and all seems to work (now I have to verify that it will not crash!).
Only a question: is there a method for renaming the if name? It seems not to work after a rename (throug udev).

Thanks a lot!
Comment 30 Marco Zaccheria 2008-04-11 13:07:03 UTC
System stays up! EUREKA!

:)

Thank you so much!!!
Comment 31 Marco Zaccheria 2008-04-14 01:39:53 UTC
So... what about the status of this BUG?
Comment 32 John W. Linville 2008-04-16 13:46:04 UTC
This is my interpretation of the status...correct me if I'm wrong... :-)

Note You need to log in before you can comment on or make changes to this bug.