Bug 14691 - Complete NAPI IRQ lockup in b44 driver: code fundamentally incompatible with netconsole requirements
Summary: Complete NAPI IRQ lockup in b44 driver: code fundamentally incompatible with ...
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 blocking
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-11-25 20:15 UTC by Andreas Mohr
Modified: 2012-06-14 15:59 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.32-rc8
Subsystem:
Regression: No
Bisected commit-id:


Attachments
successfully ratelimited problematic b44 interrupt handler error message (844 bytes, patch)
2009-11-25 21:13 UTC, Andreas Mohr
Details | Diff

Description Andreas Mohr 2009-11-25 20:15:32 UTC
Hi,

severity "blocking" (== "Blocks development and/or testing work"): it's my primary window to the world on this router - netconsole - and crashes multiple times when loading various drivers, thus for me there's no productive proceeding possible, plus there have been multiple reports on this issue since at least 2004.

The kernel version I'm testing it on is 2.6.30.9 (OpenWrt MIPSEL, on ASUS WL-500gP v2), but since b44 interrupt handler is unchanged I assume it's still unfixed.

Frequently when doing some medium activity such as loading modules such as ftdi_sio, usb_audio etc. (especially with kobject debugging enabled), the network LED starts blinking like crazy and when re-listening to netconsole output, there's simply a flood of
%s: Error, poll already scheduled
and the box is DEAD (again! *curse*).

Productive debugging of multiple USB and ftdi_sio issues is simply impossible with this kind of problem remaining.

See also:
"RE:hostname freeze" (2004!!): http://lists.linuxcoding.com/rhl/2004/msg51048.html

"b44 driver suspend/resume (was Re: [ACPI] Re: Re: various problems with Acer TM654, suspend, ACAD, radeon)": http://osdir.com/ml/network.general/2004-08/msg00208.html

"acpi and b44: irq disabled, b44: Error, poll already scheduled" (2004!!):
http://lkml.indiana.edu/hypermail/linux/kernel/0408.3/0078.html

And a very related fix and long discussion is at:
"[PATCH 2.6.30-rc4] r8169: avoid losing MSI interrupts":
http://kerneltrap.org/mailarchive/linux-netdev/2009/5/23/5791863

I'm very uncertain whether the interrupt handler method of simply screaming this error message on every interrupt without doing any actual remedies is a good idea.
HAH! Now I've got it: the fact that we're on netconsole means that this error message gets sent _immediately_ (out to the netconsole device!!), thus this very network transmission encounters an IRQ handler at a time where the previous scheduled NAPI processing had no chance yet of being done, thus napi_schedule_prep() FAIL, thus error message again, thus... ad nauseam!
Or, IOW, now I'm _certain_ that it is _NOT_ a good idea to scream this message on every IRQ without doing anything about it.


I want to report this issue now to make sure it can get the attention it deserves, but for now I'll try to come up with a bandaid locally (simply silencing the error message) in order to be able to debug the other driver issues mentioned above.

Thanks!
Comment 1 Andreas Mohr 2009-11-25 20:35:35 UTC
Now testing

                } else {
                        /* netconsole fix!!:
                         * without ratelimiting, this message would:
                         * immediately find its way out to b44 netconsole ->
                         * new IRQ -> re-encounter failed napi_schedule_prep()
                         * -> error message -> ad nauseam -> box LOCKUP.
                         * See thread "r8169: avoid losing MSI interrupts"
                         * for further inspiration. */
                        if (printk_ratelimit())
                                printk(KERN_ERR PFX
                                        "%s: Error, poll already scheduled; "
                                        "istat 0x%x, imask 0x%x\n",
                                                dev->name, istat, imask);
                }

If this works, then I'll send a patch.
Comment 2 Andreas Mohr 2009-11-25 21:13:22 UTC
Created attachment 23937 [details]
successfully ratelimited problematic b44 interrupt handler error message

A new flash image using this patch managed to work properly, i.e. it did spit out mostly normal debug messages with some remaining ratelimited NAPI poll error messages interspersed. However, access to the box was still crippled, most definitely due to the very high rate of ftdi_sio serial byte debugging logs sent over b44 netconsole.
All in all certainly a success.
Will submit patch to LKML now.
Comment 3 Andrew Morton 2009-11-25 21:17:03 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Wed, 25 Nov 2009 20:15:33 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=14691
> 
>            Summary: Complete NAPI IRQ lockup in b44 driver: code
>                     fundamentally incompatible with netconsole
>                     requirements
>            Product: Drivers
>            Version: 2.5
>     Kernel Version: 2.6.32-rc8
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: blocking
>           Priority: P1
>          Component: Network
>         AssignedTo: drivers_network@kernel-bugs.osdl.org
>         ReportedBy: andi@lisas.de
>         Regression: No
> 
> 
> Hi,
> 
> severity "blocking" (== "Blocks development and/or testing work"): it's my
> primary window to the world on this router - netconsole - and crashes
> multiple
> times when loading various drivers, thus for me there's no productive
> proceeding possible, plus there have been multiple reports on this issue
> since
> at least 2004.
> 
> The kernel version I'm testing it on is 2.6.30.9 (OpenWrt MIPSEL, on ASUS
> WL-500gP v2), but since b44 interrupt handler is unchanged I assume it's
> still
> unfixed.
> 
> Frequently when doing some medium activity such as loading modules such as
> ftdi_sio, usb_audio etc. (especially with kobject debugging enabled), the
> network LED starts blinking like crazy and when re-listening to netconsole
> output, there's simply a flood of
> %s: Error, poll already scheduled
> and the box is DEAD (again! *curse*).
> 
> Productive debugging of multiple USB and ftdi_sio issues is simply impossible
> with this kind of problem remaining.
> 
> See also:
> "RE:hostname freeze" (2004!!):
> http://lists.linuxcoding.com/rhl/2004/msg51048.html
> 
> "b44 driver suspend/resume (was Re: [ACPI] Re: Re: various problems with Acer
> TM654, suspend, ACAD, radeon)":
> http://osdir.com/ml/network.general/2004-08/msg00208.html
> 
> "acpi and b44: irq disabled, b44: Error, poll already scheduled" (2004!!):
> http://lkml.indiana.edu/hypermail/linux/kernel/0408.3/0078.html
> 
> And a very related fix and long discussion is at:
> "[PATCH 2.6.30-rc4] r8169: avoid losing MSI interrupts":
> http://kerneltrap.org/mailarchive/linux-netdev/2009/5/23/5791863
> 
> I'm very uncertain whether the interrupt handler method of simply screaming
> this error message on every interrupt without doing any actual remedies is a
> good idea.
> HAH! Now I've got it: the fact that we're on netconsole means that this error
> message gets sent _immediately_ (out to the netconsole device!!), thus this
> very network transmission encounters an IRQ handler at a time where the
> previous scheduled NAPI processing had no chance yet of being done, thus
> napi_schedule_prep() FAIL, thus error message again, thus... ad nauseam!
> Or, IOW, now I'm _certain_ that it is _NOT_ a good idea to scream this
> message
> on every IRQ without doing anything about it.
> 
> 
> I want to report this issue now to make sure it can get the attention it
> deserves, but for now I'll try to come up with a bandaid locally (simply
> silencing the error message) in order to be able to debug the other driver
> issues mentioned above.
> 
> Thanks!
Comment 4 Andreas Mohr 2009-11-30 19:39:20 UTC
Improved patch created by David Miller (thanks!).
From my POV, this can be considered handled.
http://lkml.org/lkml/2009/11/30/51
http://patchwork.kernel.org/patch/63627/

Andreas Mohr

Note You need to log in before you can comment on or make changes to this bug.