Bug 10922
Summary: | firewire: kernel locks up after bus generation roll-over | ||
---|---|---|---|
Product: | Drivers | Reporter: | Stefan Richter (stefanr) |
Component: | IEEE1394 | Assignee: | drivers_ieee1394 |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | jarod, kernel |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | all | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 8828 | ||
Attachments: |
screenshot of last kernel messages
screenshot of "page allocation failure" messages captured with netconsole |
Description
Stefan Richter
2008-06-15 13:50:11 UTC
Created attachment 16494 [details]
screenshot of last kernel messages
As seen in the screenshot, the bug does not neccesarily happen directly at generation 0. Here it happened at generation 1. In generation 255: 1394commander forces a bus reset. Generation counter increments from 255 to 0. Synthesized PHY packet is recorded, selfIDs are logged. In generation 0: fw-core sends a phy config packet to get the gap count down. fw-core forces a bus reset. Generation counter increments from 0 to 1. Synthesized PHY packet is recorded, selfIDs are logged. Kernel locks up. Created attachment 16507 [details]
screenshot of "page allocation failure"
Yesterday's screenshot was with Mac mini + 6-port hub.
Today I tested first the mini alone. It survived two generation roll-overs. When I then plugged in a FireWire disk at generation 2 or 5 or so, several trace dumps quickly scrolled by until it all froze. Attached is the last visible screen.
swapper: page allocation failure. order:0, mode:0x20
...Call trace:
__alloc_pages_internal
cache_alloc_refill
lock_timer_base
_spin_lock_irqrestore
:firewire_core:fw_node_event
:firewire_core:report_found_node
kmem_cache_alloc
:firewire_core:fw_node_event
:firewire_core:for_each_fw_node
:firewire_core:fw_core_handle_bus_reset
_spin_unlock_irqrestore
tasklet_action
It would be good to know the first one of all the trace dumps. Hopefully a netconsole will gather something.
Created attachment 16508 [details]
messages captured with netconsole
Same scenario as in the previous comment, now all kernel messages captured via netconsole. All the dumps are of type "swapper: page allocation failure. order:0, mode:0x20".
The bug evidently happens if - the bus reset tasklet was executed at least 255 times or so, /and/ - the topology code is actually doing something, e.g. scheduling "found node" events or scheduling of bus manager work. If only the local node is present, the bus reset tasklet can be executed much more often without apparent problem. This includes the rather light-weight "node updated" events for the local mode. In the scenario as per comment 1 and 2, i.e. PC plus a hub which has selfID.link_active cleared, only "node updated" events for the local node and the scheduling of bus manager work happen. Whether the hang occurs inside or outside of the topology code is not yet clear to me. It does not happen in the bus manager work, because if I shorten fw_card_bm_work() to a mere return;, the kernel still hangs after more than circa 255 bus resets with the hub connected. A probably less interesting footnote: Now I got this last sigh a while after the last selfID logging (generation 2 after roll-over): BUG: spinlock lockup on CPU#1, 1394commander/6551, ffff8100680c04d0 Pid: 6551, comm: 1394commander Not tainted 2.6.26-rc6 #4 Call Trace: [<ffffffff802f552c>] _raw_spin_lock+0xf9/0x120 [<...>] spin_lock_irqsave+0x3a/0x42 [<...>] :firewire_core:fw_device_op_release+0x9c/0xdd [<...>] __fput+0xbd/0x182 [<...>] filp_close+0x5d/0x65 [<...>] do_exit+0x234/0x60e This is obviously because 1394commander spins on card->lock while something else hangs inside a card->lock protected region. PS to comment 5: > The bug evidently happens if > - the bus reset tasklet was executed at least 255 times or so, /and/ > - the topology code is actually doing something, e.g. scheduling > "found node" events or scheduling of bus manager work. > > If only the local node is present, the bus reset tasklet can be executed > much more often without apparent problem. This includes the rather > light-weight "node updated" events for the local mode. If there were more than about 255 resets happening without any other node connected, and then another node is plugged in, the lock-up happens immediately. It does not matter whether one external node is present or several external nodes are there, it always takes >255 (>256?) bus resets. The crash can be prevented by - rmmod'ing and insmod'ing firewire-ohci or - unbinding and rebinding the controller's PCI device from/to firewire-ohci through the modules 'unbind' and 'bind' sysfs attributes before ~250 bus resets happened. This means that either firewire-core's or firewire-ohci's or both drivers' shutdown routines or initialization routines clean up sufficiently to keep clear of the bug for another ~250 bus resets. Downstream bugs: https://bugzilla.redhat.com/show_bug.cgi?id=244576 - nForce2 controller goes into bus reset loop; loop causes hang - Sony DCR-TRV27 miniDV camcorder causes bus reset loop; loop causes hang https://bugs.gentoo.org/show_bug.cgi?id=235817 - Sony DCR-HC37 miniDV camcorder causes bus reset loop; loop causes hang Candidate fix: http://marc.info/?l=linux1394-devel&m=122418676518400 Fix verified and committed to linux1394-2.6.git. Will be offered to mainline in a few days or so, and after that to -stable. Fix merged in 2.6.28-rc2-git1. Hmm, seems more like 2.6.28-rc2-git2. Will post the fix to -stable too within the hour. |