Bug 15077

Summary: firewire-net: panic in prio_tree_left (was in fwnet_write_complete)
Product: Drivers Reporter: Stefan Richter (stefanr)
Component: IEEE1394Assignee: drivers_ieee1394
Status: CLOSED CODE_FIX    
Severity: normal CC: basinilya
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32 Subsystem:
Regression: No Bisected commit-id:
Attachments: screenshot of panic in fwnet_write_complete
new screenshot
not a patch
dmesg with comments

Description Stefan Richter 2010-01-17 15:11:59 UTC
Created attachment 24607 [details]
screenshot of panic in fwnet_write_complete

Reported at http://marc.info/?l=linux1394-devel&m=126341793319333 on 2010-01-13:

>>>
I run Archlinux x86, kernel 2.6.32.2-2 is from it's repo.
The system falls to kernel panic when I copy lots of files (10Gb)
from laptop with rsync or other progs via firewire. This happens in
90% cases within 1 minute of copying.
Moving mouse or hitting keys on the keyboard make kernel panic sooner.
lesser MTU value has no effect.
I had no kernel panic yet with "-maxcpus=1" boot option.

cpu is C2D E6550
--------------------
# lspci
...
04:03.0 FireWire (IEEE 1394): Agere Systems FW322/323 (rev 70)
--------------------

traces look like this (see the screenshot: http://www.postimage.org/image.php?v=Pq1AO6Dr ):
--------------------
...
Call Trace:
 [<xxxxxxxx>] ? ip_rcv_finish+0xeb/0x390
 [<xxxxxxxx>] ? close_transaction+0xda/0xf0 [firewire_core]
 [<xxxxxxxx>] ? handle_at_packet+0xa1/0x100 [firewire_ohci]
...
EIP: [<xxxxxxxx>] fwnet_write_complete+0x2c/0x160 [firewire_net] SS:ESP 0068:f7079e40
...
<<<
Comment 1 leniviy 2010-01-17 18:41:27 UTC
Created attachment 24608 [details]
new screenshot
Comment 2 leniviy 2010-01-17 18:46:00 UTC
Created attachment 24609 [details]
not a patch

Hi. Thanks for creating the bug ticket.
I followed your recommendation partially. Instead of
INIT_LIST_HEAD(&ptask->pt_link) I initialize prev and next to 0, so I
know the ptask is broken. 
Note, the attached patch fixes nothing, just shows what is wrong. 
see the new screenshot http://bugzilla.kernel.org/attachment.cgi?id=24608
Comment 3 Stefan Richter 2010-01-18 22:21:33 UTC
Proposed patch:  http://lkml.org/lkml/2010/1/18/438
Comment 4 leniviy 2010-01-19 19:20:24 UTC
Tried the patch. Problem stays, symptoms changed:
1)
At first I started rsync on tty1. All worked fine until I switched to X. It showed me the desktop and the system hung. Keyboard leds not blinking.

2) started rsync in gnome-terminal. All worked fine until I switched to tty1. It flooded several screens with call traces and hung.

3) After the 3rd restart /sys/class/firewire0 not created and dmesg says nothing, although firewire_net loaded with no error messages.

...still playing.

From the beginning there was another problem; I planned to report it separately, but maybe they're connected: very often firewire_net not auto-loaded and when I try to modprobe it, or to ifconfig firewire0 or to ping, dmesg says:
 firewire_ohci: isochronous cycle inconsistent
or
 firewire_core: giving up on config rom for node id ffc1
and the only solution is to reboot and try again.
Comment 5 Anonymous Emailer 2010-01-19 23:50:00 UTC
Reply-To: stefanr@s5r6.in-berlin.de

bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=15077
> 
> 
> --- Comment #4 from leniviy <basinilya@gmail.com>  2010-01-19 19:20:24 ---
> Tried the patch. Problem stays, symptoms changed:
> 1)
> At first I started rsync on tty1. All worked fine until I switched to X. It
> showed me the desktop and the system hung. Keyboard leds not blinking.
> 
> 2) started rsync in gnome-terminal. All worked fine until I switched to tty1.
> It flooded several screens with call traces and hung.

Could both be the same kmemcache corruption bug which I saw, or could be
anything else.

> 3) After the 3rd restart /sys/class/firewire0 not created and dmesg says
> nothing, although firewire_net loaded with no error messages.

Most likely unrelated.

> ...still playing.
> 
>>From the beginning there was another problem; I planned to report it
> separately, but maybe they're connected: very often firewire_net not
> auto-loaded and when I try to modprobe it, or to ifconfig firewire0 or to
> ping,
> dmesg says:
>  firewire_ohci: isochronous cycle inconsistent

Shouldn't be an issue if it only happens on bus topology changes.  It is
known to happen when the cycle master changes.  It should not affect
firewire-net.

> or
>  firewire_core: giving up on config rom for node id ffc1
> and the only solution is to reboot and try again.

In this situation,
# modprobe -r firewire-ohci
# modprobe firewire-ohci debug=7
could give more information.  This could be unreliable hardware.

The PC which gave up had the Agere FW322/323, right?  What controller is
on the peer?
Comment 6 leniviy 2010-01-23 23:20:10 UTC
Created attachment 24691 [details]
dmesg with comments

I think http://lkml.org/lkml/2010/1/18/438 fixes this very bug. 

But you were probably right about memory corruption. Fixing this bug just revealed that bug.
Comment 7 Stefan Richter 2010-03-06 15:22:44 UTC
fwnet_write_complete fix from comment 3 was merged into 2.6.33.  Renaming this bug to the one in prio_tree_left per comment 4/ comment 6 (attachment 24691 [details]).  Also note the new crash in cache_free_debugcheck per comment 5 (http://lkml.org/lkml/2010/1/18/488).
Comment 8 Stefan Richter 2010-10-25 13:48:05 UTC
Candidate fixes by Clemens Ladisch:
http://thread.gmane.org/gmane.linux.kernel.firewire.devel/14502
Comment 9 Stefan Richter 2010-10-31 12:40:57 UTC
The patches from comment 8 work for me.  firewire-net is still not performing very well and sometimes connections break down entirely (workaround: reload firewire-ohci), but there are no crashes anymore.
Comment 10 Stefan Richter 2010-11-06 08:31:40 UTC
Fixes merged into mainline, to appear in 2.6.37-rc2, also submitted for inclusion into currently active stable series.