Bug 10922

Summary: firewire: kernel locks up after bus generation roll-over
Product: Drivers Reporter: Stefan Richter (stefanr)
Component: IEEE1394Assignee: drivers_ieee1394
Status: CLOSED CODE_FIX    
Severity: normal CC: jarod, kernel
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: all Subsystem:
Regression: --- Bisected commit-id:
Bug Depends on:    
Bug Blocks: 8828    
Attachments: screenshot of last kernel messages
screenshot of "page allocation failure"
messages captured with netconsole

Description Stefan Richter 2008-06-15 13:50:11 UTC
Steps to reproduce:
Trigger bus resets with firecontrol or any other way.
After the bus generation went from 255 to 0 or from 0 to 1, the machine freezes.
This has been observed with different controllers.
Comment 1 Stefan Richter 2008-06-15 13:51:28 UTC
Created attachment 16494 [details]
screenshot of last kernel messages
Comment 2 Stefan Richter 2008-06-15 14:05:31 UTC
As seen in the screenshot, the bug does not neccesarily happen directly at generation 0.  Here it happened at generation 1.

In generation 255:
    1394commander forces a bus reset.
Generation counter increments from 255 to 0.
Synthesized PHY packet is recorded, selfIDs are logged.
In generation 0:
    fw-core sends a phy config packet to get the gap count down.
    fw-core forces a bus reset.
Generation counter increments from 0 to 1.
Synthesized PHY packet is recorded, selfIDs are logged.
Kernel locks up.
Comment 3 Stefan Richter 2008-06-16 09:30:40 UTC
Created attachment 16507 [details]
screenshot of "page allocation failure"

Yesterday's screenshot was with Mac mini + 6-port hub.

Today I tested first the mini alone.  It survived two generation roll-overs.  When I then plugged in a FireWire disk at generation 2 or 5 or so, several trace dumps quickly scrolled by until it all froze.  Attached is the last visible screen.

swapper: page allocation failure. order:0, mode:0x20
...Call trace:
__alloc_pages_internal
cache_alloc_refill
lock_timer_base
_spin_lock_irqrestore
:firewire_core:fw_node_event
:firewire_core:report_found_node
kmem_cache_alloc
:firewire_core:fw_node_event
:firewire_core:for_each_fw_node
:firewire_core:fw_core_handle_bus_reset
_spin_unlock_irqrestore
tasklet_action

It would be good to know the first one of all the trace dumps.  Hopefully a netconsole will gather something.
Comment 4 Stefan Richter 2008-06-16 09:55:13 UTC
Created attachment 16508 [details]
messages captured with netconsole

Same scenario as in the previous comment, now all kernel messages captured via netconsole.  All the dumps are of type "swapper: page allocation failure. order:0, mode:0x20".
Comment 5 Stefan Richter 2008-06-16 11:01:38 UTC
The bug evidently happens if
  - the bus reset tasklet was executed at least 255 times or so, /and/
  - the topology code is actually doing something, e.g. scheduling "found node" events or scheduling of bus manager work.

If only the local node is present, the bus reset tasklet can be executed much more often without apparent problem.  This includes the rather light-weight "node updated" events for the local mode.

In the scenario as per comment 1 and 2, i.e. PC plus a hub which has selfID.link_active cleared, only "node updated" events for the local node and the scheduling of bus manager work happen.

Whether the hang occurs inside or outside of the topology code is not yet clear to me.  It does not happen in the bus manager work, because if I shorten fw_card_bm_work() to a mere return;, the kernel still hangs after more than circa 255 bus resets with the hub connected.
Comment 6 Stefan Richter 2008-06-16 11:14:31 UTC
A probably less interesting footnote:  Now I got this last sigh a while after the last selfID logging (generation 2 after roll-over):

BUG: spinlock lockup on CPU#1, 1394commander/6551, ffff8100680c04d0
Pid: 6551, comm: 1394commander Not tainted 2.6.26-rc6 #4

Call Trace:
 [<ffffffff802f552c>] _raw_spin_lock+0xf9/0x120
 [<...>] spin_lock_irqsave+0x3a/0x42
 [<...>] :firewire_core:fw_device_op_release+0x9c/0xdd
 [<...>] __fput+0xbd/0x182
 [<...>] filp_close+0x5d/0x65
 [<...>] do_exit+0x234/0x60e

This is obviously because 1394commander spins on card->lock while something else hangs inside a card->lock protected region.
Comment 7 Stefan Richter 2008-07-05 01:12:27 UTC
PS to comment 5:
> The bug evidently happens if
>   - the bus reset tasklet was executed at least 255 times or so, /and/
>   - the topology code is actually doing something, e.g. scheduling
>     "found node" events or scheduling of bus manager work.
> 
> If only the local node is present, the bus reset tasklet can be executed
> much more often without apparent problem.  This includes the rather
> light-weight "node updated" events for the local mode.

If there were more than about 255 resets happening without any other node connected, and then another node is plugged in, the lock-up happens immediately.

It does not matter whether one external node is present or several external nodes are there, it always takes >255 (>256?) bus resets.
Comment 8 Stefan Richter 2008-07-05 01:34:14 UTC
The crash can be prevented by
  - rmmod'ing and insmod'ing firewire-ohci
or
  - unbinding and rebinding the controller's PCI device from/to firewire-ohci
    through the modules 'unbind' and 'bind' sysfs attributes
before ~250 bus resets happened.

This means that either firewire-core's or firewire-ohci's or both drivers' shutdown routines or initialization routines clean up sufficiently to keep clear of the bug for another ~250 bus resets.
Comment 9 Stefan Richter 2008-09-07 00:22:54 UTC
Downstream bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=244576
  - nForce2 controller goes into bus reset loop; loop causes hang
  - Sony DCR-TRV27 miniDV camcorder causes bus reset loop; loop causes hang

https://bugs.gentoo.org/show_bug.cgi?id=235817
  - Sony DCR-HC37 miniDV camcorder causes bus reset loop; loop causes hang
Comment 10 Stefan Richter 2008-10-16 13:04:43 UTC
Candidate fix:  http://marc.info/?l=linux1394-devel&m=122418676518400
Comment 11 Stefan Richter 2008-10-16 13:49:15 UTC
Fix verified and committed to linux1394-2.6.git.
Will be offered to mainline in a few days or so, and after that to -stable.
Comment 12 Stefan Richter 2008-10-27 15:18:35 UTC
Fix merged in 2.6.28-rc2-git1.
Comment 13 Stefan Richter 2008-10-27 15:23:10 UTC
Hmm, seems more like 2.6.28-rc2-git2.
Will post the fix to -stable too within the hour.