Bug 10473

Summary: Infinite loop "b44: eth0: powering down PHY"
Product: Drivers Reporter: Nelson A. de Oliveira (naoliv)
Component: NetworkAssignee: Jeff Garzik (jgarzik)
Status: CLOSED CODE_FIX    
Severity: high CC: admin, albcamus, androsyn, gcosta, kirill, mingo, tglx
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.25 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 9832    
Attachments: Current .config file
lspci output
config from 2.6.26-rc4-git3

Description Nelson A. de Oliveira 2008-04-17 17:08:26 UTC
Latest working kernel version: 2.6.24.4 (from Debian)
Earliest failing kernel version: 2.6.25 (vanilla)
Distribution: Debian
Problem Description:

While booting the new 2.6.25 Kernel, it enters an infinite looping displaying "b44: eth0: powering down PHY".
The system isn't freezed as magick SysRq keys works, but it just stay displaying those messages. I am unable to dump any information using SysRq, however (as the b44(...) messages are too fast).

I will attach lspci output and my .config
Comment 1 Nelson A. de Oliveira 2008-04-17 17:09:11 UTC
Created attachment 15794 [details]
Current .config file
Comment 2 Nelson A. de Oliveira 2008-04-17 17:11:35 UTC
Created attachment 15795 [details]
lspci output
Comment 3 Anonymous Emailer 2008-04-17 17:12:58 UTC
Reply-To: akpm@linux-foundation.org


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Thu, 17 Apr 2008 17:08:27 -0700 (PDT)
bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=10473
> 
>            Summary: Infinite loop "b44: eth0: powering down PHY"
>            Product: Drivers
>            Version: 2.5
>      KernelVersion: 2.6.25
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Network
>         AssignedTo: jgarzik@pobox.com
>         ReportedBy: naoliv@gmail.com
> 
> 
> Latest working kernel version: 2.6.24.4 (from Debian)
> Earliest failing kernel version: 2.6.25 (vanilla)
> Distribution: Debian
> Problem Description:
> 
> While booting the new 2.6.25 Kernel, it enters an infinite looping displaying
> "b44: eth0: powering down PHY".
> The system isn't freezed as magick SysRq keys works, but it just stay
> displaying those messages. I am unable to dump any information using SysRq,
> however (as the b44(...) messages are too fast).
> 
> I will attach lspci output and my .config
> 

Apparently a regression.
Comment 4 Michael Buesch 2008-04-18 07:07:36 UTC
CCed Gary (the b44 maintainer).
Not sure why I am actually CCed :)


On Friday 18 April 2008 02:12:15 Andrew Morton wrote:
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Thu, 17 Apr 2008 17:08:27 -0700 (PDT)
> bugme-daemon@bugzilla.kernel.org wrote:
> 
> > http://bugzilla.kernel.org/show_bug.cgi?id=10473
> > 
> >            Summary: Infinite loop "b44: eth0: powering down PHY"
> >            Product: Drivers
> >            Version: 2.5
> >      KernelVersion: 2.6.25
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: high
> >           Priority: P1
> >          Component: Network
> >         AssignedTo: jgarzik@pobox.com
> >         ReportedBy: naoliv@gmail.com
> > 
> > 
> > Latest working kernel version: 2.6.24.4 (from Debian)
> > Earliest failing kernel version: 2.6.25 (vanilla)
> > Distribution: Debian
> > Problem Description:
> > 
> > While booting the new 2.6.25 Kernel, it enters an infinite looping
> displaying
> > "b44: eth0: powering down PHY".
> > The system isn't freezed as magick SysRq keys works, but it just stay
> > displaying those messages. I am unable to dump any information using SysRq,
> > however (as the b44(...) messages are too fast).
> > 
> > I will attach lspci output and my .config
> > 
> 
> Apparently a regression.

Can you add a dump_stack() call to the b44_halt() function and post the resulting logs?
Comment 5 Nelson A. de Oliveira 2008-04-18 08:23:43 UTC
Hi!

On Fri, Apr 18, 2008 at 11:06 AM, Michael Buesch <mb@bu3sch.de> wrote:
>  > > While booting the new 2.6.25 Kernel, it enters an infinite looping
>  displaying
>  > > "b44: eth0: powering down PHY".
>  > > The system isn't freezed as magick SysRq keys works, but it just stay
>  > > displaying those messages. I am unable to dump any information using
>  SysRq,
>  > > however (as the b44(...) messages are too fast).
>  > >
>  > > I will attach lspci output and my .config
>  > >
>  >
>  > Apparently a regression.
>
>  Can you add a dump_stack() call to the b44_halt() function and post the
>  resulting logs?

What I get is:

Pid: 4, comm: ksoftirqd/0 Tainted: GF 2.6.25-naoliv1 #2
[<f8992420>] [<b0231ffd>] [<b02380b9>] [<b011ed60>] [<b01060f2>]
[<b011f059>] [<b011efe1>] [<b01297c8>] [<b0129790>] [<b010473f>]

Something is saying to me that this won't help too much and that
probably I need to enable something related with debug (what do I need
to enable, please?)

BTW, is it a way to "pause" the messages after dump_stack()?

Thank you!

Best regards,
Nelson
Comment 6 Michael Buesch 2008-04-18 08:33:20 UTC
On Friday 18 April 2008 17:23:06 Nelson A. de Oliveira wrote:
> Hi!
> 
> On Fri, Apr 18, 2008 at 11:06 AM, Michael Buesch <mb@bu3sch.de> wrote:
> >  > > While booting the new 2.6.25 Kernel, it enters an infinite looping
> displaying
> >  > > "b44: eth0: powering down PHY".
> >  > > The system isn't freezed as magick SysRq keys works, but it just stay
> >  > > displaying those messages. I am unable to dump any information using
> SysRq,
> >  > > however (as the b44(...) messages are too fast).
> >  > >
> >  > > I will attach lspci output and my .config
> >  > >
> >  >
> >  > Apparently a regression.
> >
> >  Can you add a dump_stack() call to the b44_halt() function and post the
> resulting logs?
> 
> What I get is:
> 
> Pid: 4, comm: ksoftirqd/0 Tainted: GF 2.6.25-naoliv1 #2
> [<f8992420>] [<b0231ffd>] [<b02380b9>] [<b011ed60>] [<b01060f2>]
> [<b011f059>] [<b011efe1>] [<b01297c8>] [<b0129790>] [<b010473f>]

Ehm, please enable CONFIG_KALLSYMS.

> BTW, is it a way to "pause" the messages after dump_stack()?

mdelay(1000) will delay one second. But it will kill the system, basically.
Comment 7 Nelson A. de Oliveira 2008-04-18 10:12:39 UTC
Hi!

On Fri, Apr 18, 2008 at 12:32 PM, Michael Buesch <mb@bu3sch.de> wrote:
>  Ehm, please enable CONFIG_KALLSYMS.

Right. Sorry.

Here it is:

Pid: 4, comm: ksoftirqd/0 Tainted: GF 2.6.25-naoliv #4

[<f899df84>] b44_halt+0x68/0x7f [b44]

[<f899f432>] b44_poll+0x36a/0x405 [b44]

[<b02393ad>] net_rx_action+0x63/0x131

[<b011ee60>] __do_softirq+0x5a/0xa5

[<b01061e2>] do_softirq+0x52/0x84

[<b011f159>] ksoftirqd+0x78/0x110

[<b011f0e1>] ksoftirqd+0x0/0x110

[<b01298d4>] kthread+0x38/0x60

[<b012989c>] kthread+0x0/0x60

[<b010474b>] kernel_thread_helper+0x7/0x10

Anything else that I can do to help, please?

Thank you!

Best regards,
Nelson
Comment 8 Michael Buesch 2008-04-18 10:20:28 UTC
On Friday 18 April 2008 19:12:34 Nelson A. de Oliveira wrote:
> Anything else that I can do to help, please?

Please apply this patch and send me the messages.

Index: wireless-testing/drivers/net/b44.c
===================================================================
--- wireless-testing.orig/drivers/net/b44.c	2008-04-15 12:40:17.000000000 +0200
+++ wireless-testing/drivers/net/b44.c	2008-04-18 19:18:02.000000000 +0200
@@ -866,6 +866,7 @@ static int b44_poll(struct napi_struct *
 	if (bp->istat & ISTAT_ERRORS) {
 		unsigned long flags;
 
+printk(KERN_ERR "b44_poll: istat = 0x%08X\n", bp->istat);
 		spin_lock_irqsave(&bp->lock, flags);
 		b44_halt(bp);
 		b44_init_rings(bp);
Comment 9 Nelson A. de Oliveira 2008-04-18 10:44:21 UTC
Hi!

On Fri, Apr 18, 2008 at 2:19 PM, Michael Buesch <mb@bu3sch.de> wrote:
> On Friday 18 April 2008 19:12:34 Nelson A. de Oliveira wrote:
>  > Anything else that I can do to help, please?
>
>  Please apply this patch and send me the messages.

b44_poll: istat = 0x00000400

b44: eth0: powering down PHY


Pid: 0, comm: swapper Not tainted 2.6.25-naoliv1 #4

[<f899df84>] b44_halt+0x68/0x7f [b44]

[<f88f4440>] b44_poll+0x378/0x415 [b44]

[<b010453b>] common_interrupt+0x23/0x28

[<b02393ad>] net_rx_action+0x63/0x131

[<b011ee60>] __do_softirq+0x5a/0xa5

[<b01061e2>] do_softirq+0x52/0x84

[<b013d666>] handle_fasteoi_irq+0x0/0xad

[<b011ed3e>] irq_exit+0x35/0x76

[<b01062ad>] do_IRQ+0x99/0xb0

[<b010453b>] common_interrupt+0x23/0x28

[<f886700b>] acpi_idle_enter_bm+0x28c/0x2fd6 [processor]

[<b022602b>] cpuidle_idle_call+0x55/0x86

[<b0225fd6>] cpuidle_idle_call+0x0/0x86

[<b01028d3>] cpu_idle+0x8c/0xbc

I have increased the delay now.  This is the first message that
appears. It seems that after some time it starts to display the other
lines from my last email (Pid 4, comm: ksoftirqd/0 ...).

Best regards,
Nelson
Comment 10 Michael Buesch 2008-04-18 11:00:30 UTC
On Friday 18 April 2008 19:43:57 Nelson A. de Oliveira wrote:
> Hi!
> 
> On Fri, Apr 18, 2008 at 2:19 PM, Michael Buesch <mb@bu3sch.de> wrote:
> > On Friday 18 April 2008 19:12:34 Nelson A. de Oliveira wrote:
> >  > Anything else that I can do to help, please?
> >
> >  Please apply this patch and send me the messages.
> 
> b44_poll: istat = 0x00000400

Hm, a descriptor error. Smells like my DMA fix actually broke this, damit.
On which architecture are you running?

> I have increased the delay now.  This is the first message that
> appears. It seems that after some time it starts to display the other
> lines from my last email (Pid 4, comm: ksoftirqd/0 ...).

I'm always only interested in the first message of one type :)
Comment 11 Nelson A. de Oliveira 2008-04-18 11:10:09 UTC
On Fri, Apr 18, 2008 at 2:59 PM, Michael Buesch <mb@bu3sch.de> wrote:
>  > b44_poll: istat = 0x00000400
>
>  Hm, a descriptor error. Smells like my DMA fix actually broke this, damit.
>  On which architecture are you running?

i386 here.

>  > I have increased the delay now.  This is the first message that
>  > appears. It seems that after some time it starts to display the other
>  > lines from my last email (Pid 4, comm: ksoftirqd/0 ...).
>
>  I'm always only interested in the first message of one type :)

Right :-)

Best regards,
Nelson
Comment 12 Michael Buesch 2008-04-18 11:19:25 UTC
On Friday 18 April 2008 20:09:36 Nelson A. de Oliveira wrote:
> On Fri, Apr 18, 2008 at 2:59 PM, Michael Buesch <mb@bu3sch.de> wrote:
> >  > b44_poll: istat = 0x00000400
> >
> >  Hm, a descriptor error. Smells like my DMA fix actually broke this, damit.
> >  On which architecture are you running?
> 
> i386 here.

Hm, I tested my patch on i386.
So I'm not sure what's going on, actually. And the patch was pretty
trivial and I really can't find a bug in it.
So you say 2.6.24 was still working?
Comment 13 Nelson A. de Oliveira 2008-04-18 12:03:16 UTC
Hi!

On Fri, Apr 18, 2008 at 3:18 PM, Michael Buesch <mb@bu3sch.de> wrote:
> On Friday 18 April 2008 20:09:36 Nelson A. de Oliveira wrote:
>  > On Fri, Apr 18, 2008 at 2:59 PM, Michael Buesch <mb@bu3sch.de> wrote:
>  > >  > b44_poll: istat = 0x00000400
>  > >
>  > >  Hm, a descriptor error. Smells like my DMA fix actually broke this,
>  damit.
>  > >  On which architecture are you running?
>  >
>  > i386 here.
>
>  Hm, I tested my patch on i386.
>  So I'm not sure what's going on, actually. And the patch was pretty
>  trivial and I really can't find a bug in it.
>  So you say 2.6.24 was still working?

Strange... compiled 2.6.24.4, 2.6.24 and 2.6.23 here and they are all
stopping with this:

b44: eth0: Link is up at 100 Mbps, full duplex.
b44: eth0: Flow control is off for TX and off for RX.

And it seems to keep waiting for something. The system isn't freezed
(as CTRL+ALT+DEL kills the running processes and correctly reboots the
machine).

With Debian's 2.6.24.4 it is working.
With vanilla 2.6.25 and my config it just enters an infinite loop of
"b44: eth0: powering down PHY".
Can different GCC versions cause this? Can a bad .config file cause
things like that? (I am using this .config for a long time and it has
always been working correctly, at least until now)

Thank you!

Best regards,
Nelson
Comment 14 Michael Buesch 2008-04-18 12:20:18 UTC
On Friday 18 April 2008 21:02:37 Nelson A. de Oliveira wrote:
> Hi!
> 
> On Fri, Apr 18, 2008 at 3:18 PM, Michael Buesch <mb@bu3sch.de> wrote:
> > On Friday 18 April 2008 20:09:36 Nelson A. de Oliveira wrote:
> >  > On Fri, Apr 18, 2008 at 2:59 PM, Michael Buesch <mb@bu3sch.de> wrote:
> >  > >  > b44_poll: istat = 0x00000400
> >  > >
> >  > >  Hm, a descriptor error. Smells like my DMA fix actually broke this,
> damit.
> >  > >  On which architecture are you running?
> >  >
> >  > i386 here.
> >
> >  Hm, I tested my patch on i386.
> >  So I'm not sure what's going on, actually. And the patch was pretty
> >  trivial and I really can't find a bug in it.
> >  So you say 2.6.24 was still working?
> 
> Strange... compiled 2.6.24.4, 2.6.24 and 2.6.23 here and they are all
> stopping with this:
> 
> b44: eth0: Link is up at 100 Mbps, full duplex.
> b44: eth0: Flow control is off for TX and off for RX.
>
> And it seems to keep waiting for something. The system isn't freezed
> (as CTRL+ALT+DEL kills the running processes and correctly reboots the
> machine).

Well. 2.6.24 didn't have this message. But it could still have the actual
bug, of course. So can you try applying my printk patch to a broken 2.6.24
kernel and see whether it triggers the message or not? Under normal
circumstances this codepath should never trigger.

> With Debian's 2.6.24.4 it is working.
> With vanilla 2.6.25 and my config it just enters an infinite loop of
> "b44: eth0: powering down PHY".

This message was added in 2.6.25. That doesn't mean the
bug was also added in 2.6.25, of course.

> Can different GCC versions cause this? Can a bad .config file cause
> things like that? (I am using this .config for a long time and it has
> always been working correctly, at least until now)

Well, possible, although unlikely.

Can you try bisecting the bug? Yeah, I know about the lwn article [1] that
says bisecting is baaaaaad (tm), but my opinion is different. :)
It's an excellent tool for efficiently finding patches that caused bugs.
But take care to really check whether device _works_ or not. Just looking
at the actual "powering down PHY" will _not_ be enough, as that was only
recently added, as I said.

[1] http://lwn.net/Articles/278137/
Comment 15 Nelson A. de Oliveira 2008-04-18 13:38:42 UTC
Hi!

On Fri, Apr 18, 2008 at 4:19 PM, Michael Buesch <mb@bu3sch.de> wrote:
>  Well. 2.6.24 didn't have this message. But it could still have the actual
>  bug, of course. So can you try applying my printk patch to a broken 2.6.24
>  kernel and see whether it triggers the message or not? Under normal
>  circumstances this codepath should never trigger.

No b44_poll message printed when using your patch on 2.6.24, 2.6.23 and 2.6.21.

>  Can you try bisecting the bug? Yeah, I know about the lwn article [1] that
>  says bisecting is baaaaaad (tm), but my opinion is different. :)
>  It's an excellent tool for efficiently finding patches that caused bugs.
>  But take care to really check whether device _works_ or not. Just looking
>  at the actual "powering down PHY" will _not_ be enough, as that was only
>  recently added, as I said.

Sure. I will do this when I arrive at home (Can you point me to some
URL to read and do the bisections, please?).
What I saw with 2.6.24, 2.6.23 and 2.6.21 is that the interface seems
to be up, getting an IP via DHCP (I can ping from another machine),
but it stays waiting for something after printing

b44: eth0: Link is up at 100 Mbps, full duplex.
b44: eth0: Flow control is off for TX and off for RX.

Thank you!

Best regards,
Nelson
Comment 16 Jike Song 2008-04-21 01:30:27 UTC
(In reply to comment #15)
Hi Nelson, here is a git bisect guide from kernel.org:

http://www.kernel.org/doc/local/git-quick.html#bisect

Thanks,


> Hi!
> 
> On Fri, Apr 18, 2008 at 4:19 PM, Michael Buesch <mb@bu3sch.de> wrote:
> >  Well. 2.6.24 didn't have this message. But it could still have the actual
> >  bug, of course. So can you try applying my printk patch to a broken 2.6.24
> >  kernel and see whether it triggers the message or not? Under normal
> >  circumstances this codepath should never trigger.
> 
> No b44_poll message printed when using your patch on 2.6.24, 2.6.23 and
> 2.6.21.
> 
> >  Can you try bisecting the bug? Yeah, I know about the lwn article [1] that
> >  says bisecting is baaaaaad (tm), but my opinion is different. :)
> >  It's an excellent tool for efficiently finding patches that caused bugs.
> >  But take care to really check whether device _works_ or not. Just looking
> >  at the actual "powering down PHY" will _not_ be enough, as that was only
> >  recently added, as I said.
> 
> Sure. I will do this when I arrive at home (Can you point me to some
> URL to read and do the bisections, please?).
> What I saw with 2.6.24, 2.6.23 and 2.6.21 is that the interface seems
> to be up, getting an IP via DHCP (I can ping from another machine),
> but it stays waiting for something after printing
> 
> b44: eth0: Link is up at 100 Mbps, full duplex.
> b44: eth0: Flow control is off for TX and off for RX.
> 
> Thank you!
> 
> Best regards,
> Nelson
> 
Comment 17 Nelson A. de Oliveira 2008-04-21 11:14:51 UTC
Hi!

I have tried to do a bisect here (thank you Jike Song for the link).
Marked 2.6.20 as good and master as bad. On the first test, I've got this:

(...)
BUG: unable to handle kernel NULL pointer dereference at virtual
address 00000000
 printing eip:
b01b6265
*pde = 00000000
Oops: 0000 [#1]
PREEMPT SMP
Modules linked in: b44(F) mousedev(F) iwl3945(F) ehci_hcd(F)
mac80211(F) snd_hda_intel(F) thermal(F) i2c_i801(F) ac(F) ssb(F)
snd_pcm(F) snd_timer(F) uhci_hcd(F) psmouse(F) evdev(F) battery(F)
button(F) processor(F) mii(F) usbcore(F) snd(F) snd_page_alloc(F)
sg(F) sr_mod(F) cdrom(F)
CPU:    0
EIP:    0060:[<b01b6265>]    Tainted: GF       VLI
EFLAGS: 00010246   (2.6.23-naoliv1 #1)
EIP is at strlen+0x8/0x11
eax: 00000000   ebx: f7429000   ecx: ffffffff   edx: f76b6cb0
esi: 00000000   edi: 00000000   ebp: 00000000   esp: f76b6ca0
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Process modprobe (pid: 692, ti=f76b6000 task=f76c6000 task.ti=f76b6000)
Stack: f75a2000 b01b3254 f785f200 b02d3e5b b02cb0da f7856200 b01b324a f7426688
       b02cb0da f7856200 f88f61f0 b0310c8c f785f200 f88f8e28 b0207d4e f74266a8
       f88f8d9c f7426600 f7426600 00000000 f7453400 b0206d8b f7426688 b02c4e14
Call Trace:
 [<b01b3254>] kobject_uevent_env+0x276/0x383
 [<b01b324a>] kobject_uevent_env+0x26c/0x383
 [<b0207d4e>] bus_add_device+0xad/0xdc
 [<b0206d8b>] device_add+0x2a0/0x45e
 [<f88f36f1>] ssb_attach_queued_buses+0x1a2/0x297 [ssb]
 [<f88f3b2f>] ssb_bus_register+0x120/0x185 [ssb]
 [<f88f4ac2>] ssb_pci_get_invariants+0x0/0x281 [ssb]
 [<f88f3bf3>] ssb_bus_pcibus_register+0x24/0x47 [ssb]
 [<b01bb856>] pci_set_master+0x54/0x58
 [<f88f52b1>] ssb_pcihost_probe+0x5e/0x89 [ssb]
 [<b01bd0ff>] pci_device_probe+0x36/0x55
 [<b020857e>] driver_probe_device+0xc5/0x148
 [<b02890a5>] klist_next+0x58/0x6d
 [<b02086dc>] __driver_attach+0x49/0x7f
 [<b0207ba8>] bus_for_each_dev+0x35/0x57
 [<b02083f2>] driver_attach+0x16/0x18
 [<b0208693>] __driver_attach+0x0/0x7f
 [<b0207e56>] bus_add_driver+0x6d/0x17d
 [<b01bd249>] __pci_register_driver+0x55/0x81
 [<f881d01f>] b44_init+0x1f/0x48 [b44]
 [<b013cdcc>] sys_init_module+0x1545/0x1619
 [<b0103e9a>] sysenter_past_esp+0x5f/0x85
 =======================
Code: f0 48 5e c3 56 89 d1 89 c6 83 ec 04 31 d2 89 c8 88 c4 ac 38 e0
75 03 8d 56 ff 84 c0 75 f4 5e 89 d0 5e c3 57 83 c9 ff 89 c7 31 c0 <f2>
ae f7 d1 49 5f 89 c8 c3 57 89 c7 89 d0 31 d2 85 c9 74 0c f2
EIP: [<b01b6265>] strlen+0x8/0x11 SS:ESP 0068:f76b6ca0
hub 1-2:1.0: hub_port_status failed (err = -71)
hub 1-2:1.0: hub_port_status failed (err = -71)
hub 1-2:1.0: hub_port_status failed (err = -71)
hub 1-2:1.0: hub_port_status failed (err = -71)
Clocksource tsc unstable (delta = -162081422 ns)
usb 5-2: new high speed USB device using ehci_hcd and address 2
usb 5-2: configuration #1 chosen from 1 choice
hub 5-2:1.0: USB hub found
hub 5-2:1.0: 4 ports detected
sysfs: duplicate filename 'bInterfaceNumber' can not be created
WARNING: at fs/sysfs/dir.c:425 sysfs_add_one()
 [<b018bebc>] sysfs_add_one+0x54/0xb8
 [<b018ba00>] sysfs_add_file+0x42/0x6a
 [<b018d115>] sysfs_create_group+0x84/0xe7
 [<b0206f3f>] device_add+0x454/0x45e
 [<f88ca72a>] usb_create_sysfs_intf_files+0x24/0x98 [usbcore]
 [<f88c7295>] usb_set_configuration+0x48f/0x4a9 [usbcore]
 [<f88cdcdb>] generic_probe+0x50/0x91 [usbcore]
 [<f88c8784>] usb_probe_device+0x32/0x37 [usbcore]
 [<b020857e>] driver_probe_device+0xc5/0x148
 [<b02890a5>] klist_next+0x58/0x6d
 [<b0207aa8>] bus_for_each_drv+0x35/0x5c
 [<b020867f>] device_attach+0x5e/0x72
 [<b0208601>] __device_attach+0x0/0x5
 [<b0207a24>] bus_attach_device+0x26/0x75
 [<b0206d92>] device_add+0x2a7/0x45e
 [<f88c2c1a>] usb_new_device+0x4d/0x8a [usbcore]
 [<f88c3746>] hub_thread+0x702/0xa8f [usbcore]
 [<b012fd84>] autoremove_wake_function+0x0/0x33
 [<f88c3044>] hub_thread+0x0/0xa8f [usbcore]
 [<b012fcb7>] kthread+0x38/0x5d
 [<b012fc7f>] kthread+0x0/0x5d
 [<b0104abb>] kernel_thread_helper+0x7/0x10
 =======================
(...)

This one is probably 2.6.23.
After some time the system continued to boot, but without network interface.
So marked it as bad.
The newer bisect failed to compile. Marked it bad. Bisect again,
failed, again, failed :-(
My git-bisect log is:

git-bisect start
# good: [62d0cfcb27cf755cebdc93ca95dabc83608007cd] Linux 2.6.20
git-bisect good 62d0cfcb27cf755cebdc93ca95dabc83608007cd
# bad: [3925e6fc1f774048404fdd910b0345b06c699eb4] Merge branch
'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
git-bisect bad 3925e6fc1f774048404fdd910b0345b06c699eb4
# bad: [3749c66c67fb5c257771815c186bc32290cacf44] Merge branch
'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm
git-bisect bad 3749c66c67fb5c257771815c186bc32290cacf44
# bad: [b11115c15351faba978ce1b9e75068e77f6ef48d] serial_core.h:
include <linux/sysrq.h>
git-bisect bad b11115c15351faba978ce1b9e75068e77f6ef48d
# bad: [1936502d00ae6c2aa3931c42f6cf54afaba094f2] [NET_SCHED] qdisc:
avoid transmit softirq on watchdog wakeup
git-bisect bad 1936502d00ae6c2aa3931c42f6cf54afaba094f2

What else can I do, please?

Thank you very much!

Best regards,
Nelson
Comment 18 Michael Buesch 2008-04-21 11:21:55 UTC
On Monday 21 April 2008 20:14:44 Nelson A. de Oliveira wrote:
> This one is probably 2.6.23.
> After some time the system continued to boot, but without network interface.
> So marked it as bad.

That probably was a mistake

> What else can I do, please?

You can try latest git. I was told it has a feature
to tell bisect "I don't know" instead of "good" or "bad".
This can be used if a test kernel doesn't compile, or does
fail because of some other bug.

You can also manually bisect the stuff between your known-good
version of b44 and the bad one. There were only a couple of patches.
You can extract them with git and revert them one by one and see
when it does start working again.
I think it was something like 5 patches or so. Nothing too time consuming.
Comment 19 Nelson A. de Oliveira 2008-04-21 20:02:22 UTC
Hi!

Maybe this can help:
Using a new .config, I started to enable/disable options and test.
What I found is that if I enable "3G/1G user/kernel split", the kernel
works (it boots normally, the network interface works, etc). If I
select "3G/1G user/kernel split (for full 1G low memory)" I get the
infinite loop of "b44: eth0: powering down PHY".

Working config file (on 2.6.25) is attached.
Diff to the non-working is below:

--- working_config	2008-04-21 23:42:40.000000000 -0300
+++ not_working_config	2008-04-21 23:55:28.000000000 -0300
@@ -1,7 +1,7 @@
 #
 # Automatically generated make config: don't edit
 # Linux kernel version: 2.6.25
-# Mon Apr 21 23:28:49 2008
+# Mon Apr 21 23:43:04 2008
 #
 # CONFIG_64BIT is not set
 CONFIG_X86_32=y
@@ -228,12 +228,12 @@
 # CONFIG_NOHIGHMEM is not set
 CONFIG_HIGHMEM4G=y
 # CONFIG_HIGHMEM64G is not set
-CONFIG_VMSPLIT_3G=y
-# CONFIG_VMSPLIT_3G_OPT is not set
+# CONFIG_VMSPLIT_3G is not set
+CONFIG_VMSPLIT_3G_OPT=y
 # CONFIG_VMSPLIT_2G is not set
 # CONFIG_VMSPLIT_2G_OPT is not set
 # CONFIG_VMSPLIT_1G is not set
-CONFIG_PAGE_OFFSET=0xC0000000
+CONFIG_PAGE_OFFSET=0xB0000000
 CONFIG_HIGHMEM=y
 CONFIG_ARCH_FLATMEM_ENABLE=y
 CONFIG_ARCH_SPARSEMEM_ENABLE=y

Can this be the cause?

Thank you!

Best regards,
Nelson
Comment 20 Michael Buesch 2008-04-22 06:36:07 UTC
On Tuesday 22 April 2008 05:01:54 Nelson A. de Oliveira wrote:
> Hi!
> 
> Maybe this can help:
> Using a new .config, I started to enable/disable options and test.
> What I found is that if I enable "3G/1G user/kernel split", the kernel
> works (it boots normally, the network interface works, etc). If I
> select "3G/1G user/kernel split (for full 1G low memory)" I get the
> infinite loop of "b44: eth0: powering down PHY".

Ah, so this bug isn't actually caused by a patch but rather by a
different config option.
I think we can't do much about it, currently. The device has strange
memory requirements and changing the split does actually break it.
This cannot be fixed until andi kleen's mask-allocator is merged.
This "bug" has always been there.

> Working config file (on 2.6.25) is attached.
> Diff to the non-working is below:
> 
> --- working_config    2008-04-21 23:42:40.000000000 -0300
> +++ not_working_config        2008-04-21 23:55:28.000000000 -0300
> @@ -1,7 +1,7 @@
>  #
>  # Automatically generated make config: don't edit
>  # Linux kernel version: 2.6.25
> -# Mon Apr 21 23:28:49 2008
> +# Mon Apr 21 23:43:04 2008
>  #
>  # CONFIG_64BIT is not set
>  CONFIG_X86_32=y
> @@ -228,12 +228,12 @@
>  # CONFIG_NOHIGHMEM is not set
>  CONFIG_HIGHMEM4G=y
>  # CONFIG_HIGHMEM64G is not set
> -CONFIG_VMSPLIT_3G=y
> -# CONFIG_VMSPLIT_3G_OPT is not set
> +# CONFIG_VMSPLIT_3G is not set
> +CONFIG_VMSPLIT_3G_OPT=y
>  # CONFIG_VMSPLIT_2G is not set
>  # CONFIG_VMSPLIT_2G_OPT is not set
>  # CONFIG_VMSPLIT_1G is not set
> -CONFIG_PAGE_OFFSET=0xC0000000
> +CONFIG_PAGE_OFFSET=0xB0000000
>  CONFIG_HIGHMEM=y
>  CONFIG_ARCH_FLATMEM_ENABLE=y
>  CONFIG_ARCH_SPARSEMEM_ENABLE=y
> 
> Can this be the cause?
> 
> Thank you!
> 
> Best regards,
> Nelson
> 
Comment 21 Aaron Sethman 2008-05-30 09:48:58 UTC
I'm also getting this on 2.6.26-rc4 and -rc4-git3.  I'm running on x86-64. 2.6.25 worked okay for me.  
Comment 22 Aaron Sethman 2008-05-30 09:51:09 UTC
Created attachment 16339 [details]
config from 2.6.26-rc4-git3
Comment 23 Kirill A. Shutemov 2008-06-12 00:59:47 UTC
> I'm also getting this on 2.6.26-rc4 and -rc4-git3.  I'm running on x86-64.
> 2.6.25 worked okay for me. 

The same symptoms.

I have tried to bisect the bug:

> git bisect log                                                                
git-bisect start
# good: [4b119e21d0c66c22e8ca03df05d9de623d0eb50f] Linux 2.6.25
git-bisect good 4b119e21d0c66c22e8ca03df05d9de623d0eb50f
# bad: [38e80121bd7d0c493072442ac7eddcba165a07a8] Merge git://git.infradead.org/battery-2.6
git-bisect bad 38e80121bd7d0c493072442ac7eddcba165a07a8
# bad: [7ae44cfa7ab29b277691327e8de790d7b880722f] [ALSA] snd-powermac: style awacs.s and awacs.h
git-bisect bad 7ae44cfa7ab29b277691327e8de790d7b880722f
# good: [7cea51be4e91edad05bd834f3235b45c57783f0d] security: fix up documentation for security_module_enable
git-bisect good 7cea51be4e91edad05bd834f3235b45c57783f0d
# good: [8f19ca1341a6d89bd96e2e69e6e10f46d3258089] x86: unify gfp masks
git-bisect good 8f19ca1341a6d89bd96e2e69e6e10f46d3258089
# bad: [9a64388d83f6ef08dfff405a9d122e3dbcb6bf38] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
git-bisect bad 9a64388d83f6ef08dfff405a9d122e3dbcb6bf38
# bad: [85b375a613085b78531ec86369a51c2f3b922f95] Merge branch 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm
git-bisect bad 85b375a613085b78531ec86369a51c2f3b922f95
# good: [d1964dab60ce7c104dd21590e987a8787db18051] Merge branches 'arm', 'at91', 'ep93xx', 'iop', 'ks8695', 'misc', 'mxc', 'ns9x', 'orion', 'pxa', 'sa1100', 's3c' and 'sparsemem' into devel
git-bisect good d1964dab60ce7c104dd21590e987a8787db18051
# good: [d1964dab60ce7c104dd21590e987a8787db18051] Merge branches 'arm', 'at91', 'ep93xx', 'iop', 'ks8695', 'misc', 'mxc', 'ns9x', 'orion', 'pxa', 'sa1100', 's3c' and 'sparsemem' into devel
git-bisect good d1964dab60ce7c104dd21590e987a8787db18051
# good: [486fdae21458bd9f4e125099bb3c38a4064e450e] sched: build fix
git-bisect good 486fdae21458bd9f4e125099bb3c38a4064e450e
# good: [fd9be4ce2e1eb407a8152f823698cc0d652bbec8] Merge branch 'ro-bind.b6' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
git-bisect good fd9be4ce2e1eb407a8152f823698cc0d652bbec8
# good: [32ab2cb9415f341913e3f33ef7566ca6e92ef283] ARM: OMAP2: Move clock.h to clock24xx.h
git-bisect good 32ab2cb9415f341913e3f33ef7566ca6e92ef283
# good: [3760d31f11bfbd0ead9eaeb8573e0602437a9d7c] ARM: OMAP2: New DPLL clock framework
git-bisect good 3760d31f11bfbd0ead9eaeb8573e0602437a9d7c
# good: [3760d31f11bfbd0ead9eaeb8573e0602437a9d7c] ARM: OMAP2: New DPLL clock framework
git-bisect good 3760d31f11bfbd0ead9eaeb8573e0602437a9d7c
# bad: [34d0559178393547505ec9492321255405f4e441] x86: UV startup of slave cpus
git-bisect bad 34d0559178393547505ec9492321255405f4e441
# bad: [da60cab4dd922cd933e82bace490f6155a32a90e] x86: return conditional to mmu
git-bisect bad da60cab4dd922cd933e82bace490f6155a32a90e

It seems related to DMA.

> git bisect visualize --pretty=oneline |cat
da60cab4dd922cd933e82bace490f6155a32a90e x86: return conditional to mmu
aa99b16faadcc9a5b6bd9550fda117a8e9e46d26 x86: remove kludge from x86_64
Comment 24 Glauber Costa 2008-06-12 06:58:51 UTC
We have a bunch of fixes in the x86 tree that does not appear to be in linus.
Since you are using git anyway, would you mind testing it? If it works, we might want to cherry-pick them for linus since they'll be associated with a regression.

thanx
Comment 25 Kirill A. Shutemov 2008-06-12 07:02:57 UTC
Fixed. Patch http://lkml.org/lkml/2008/6/12/227
Comment 26 Ingo Molnar 2008-06-12 07:05:56 UTC
> We have a bunch of fixes in the x86 tree that does not appear to be in 
> linus. Since you are using git anyway, would you mind testing it? If 
> it works, we might want to cherry-pick them for linus since they'll be 
> associated with a regression.

to check that, pick up tip/master, as per:

  http://people.redhat.com/mingo/tip.git/README
Comment 27 Alexander S. Titov 2008-09-13 09:31:48 UTC
Fedora 9, kernel 2.6.27-rc6 x86.
Same problem with Broadcom NIC.