Bug 9617

Summary: panic in :firewire_ohci:handle_at_packet (was oops with macbook in target disk mode (firewire-ohci))
Product: Drivers Reporter: Johannes Berg (johannes)
Component: IEEE1394Assignee: Stefan Richter (stefanr)
Status: CLOSED CODE_FIX    
Severity: high CC: jarod
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.24-rc5-g2d0811f5-dirty Subsystem:
Regression: --- Bisected commit-id:
Attachments: image showing the oops
panic in :firewire_ohci:handle_at_packet
last log messages before lock-up (now with AR/At event logging)
another shot of the same panic
messages without DEBUG_PAGEALLOC
crash log with some debug printks
invalid patch

Description Johannes Berg 2007-12-21 16:03:39 UTC
Most recent kernel where this bug did not occur: unknown
Distribution: debian/unstable(+experimental)
Hardware Environment: powerbook G4
Software Environment: gnome?

kernel version is wireless-2.6#everything tree plus some local wireless patches

Problem Description:

System oopses as soon as I plug in a macbook in target-disk mode (with
firewire-ohci).

Steps to reproduce:
unload ohci1394, load firewire-ohci
plug in cable connected to macbook in target-disk mode

Unlike with ohci1394, it oopses without mounting anything.
Comment 1 Johannes Berg 2007-12-21 16:04:22 UTC
Created attachment 14144 [details]
image showing the oops

This one shows the oops with firewire-ohci, has the other stack's modules loaded but not ohci1394 so those aren't used.
Comment 2 Stefan Richter 2007-12-22 12:38:09 UTC
If you have got a few older kernels laying around and can afford to reboot, it'd be nice if you could check whether the oops also happens with 2.6.23 or a late 2.6.23-rc (and if yes, if it also happens with 2.6.22.y or 2.6.22 or late 2.6.22-rc), depending on what you can conveniently boot.
Comment 3 Johannes Berg 2007-12-22 13:25:39 UTC
I seem to still have these. Want to choose a few?

vmlinux-2.6.23-g0895e91d-dirty
vmlinux-2.6.23-g398a9bfb-dirty
vmlinux-2.6.23-g4bcbd552-dirty
vmlinux-2.6.23-rc2-g5f743bbc-dirty
vmlinux-2.6.23-rc2-g91e4d6f6-dirty
vmlinux-2.6.23-rc2-gf2b262a6-dirty
vmlinux-2.6.23-rc3-g0347a7c8-dirty
vmlinux-2.6.23-rc3-g16f54da2-dirty
vmlinux-2.6.23-rc3-gdf5fc53d-dirty
vmlinux-2.6.23-rc4-g60faa0c2-dirty
vmlinux-2.6.23-rc4-nohz-g60faa0c2-dirty
vmlinux-2.6.23-rc6-gccad6302-dirty
vmlinux-2.6.23-rc6-gcd799e16-dirty
vmlinux-2.6.23-rc8-g428ff550-dirty
vmlinux-2.6.23-rc9-g2d46d1e0-dirty
vmlinux-2.6.24-g301df407-dirty
vmlinux-2.6.24-rc1-dirty
vmlinux-2.6.24-rc1-g2655e2ce-dirty
vmlinux-2.6.24-rc1-gb55d1b18-dirty
vmlinux-2.6.24-rc1-gc8593884-dirty
vmlinux-2.6.24-rc1-gec3b67c1-dirty
vmlinux-2.6.24-rc2-dirty
vmlinux-2.6.24-rc2-g41aa1d66-dirty
vmlinux-2.6.24-rc2-gc4ca6d40-dirty
vmlinux-2.6.24-rc3-g301df407-dirty
vmlinux-2.6.24-rc3-gad8f7a9b-dirty
vmlinux-2.6.24-rc3-gcc05246d-dirty
vmlinux-2.6.24-rc5-g2d0811f5-dirty
vmlinux-2.6.24-rc5-g50443c08-dirty
vmlinux-2.6.24-rc5-gb8b5f6a2-dirty
Comment 4 Stefan Richter 2007-12-22 13:46:55 UTC
vmlinux-2.6.23-rc9-g2d46d1e0-dirty should be fine, if you didn't dirty it too much. :-)

If this oopses too, then 2.6.22.15 might be a better test subject than vmlinux-2.6.23-g0895e91d-dirty, unless you can confirm that commit b8f106570f20a7a98a18a14c2d11d65f581c2808 "firewire: fw-sbp2: remove unused struct member" is not an ancestor of g0895e91d-dirty.  (This was one of the patches that went into the 2.6.22...2.6.23-rc1 merge window.)
Comment 5 Stefan Richter 2007-12-22 15:40:03 UTC
FWIW, if I connect an Intel Mac mini in target disk mode with an i586 SMP box running 2.6.24-rc6 plus what's in linux1394-2.6.git, I merely get:

Dec 23 00:29:00 stein firewire_core: created new fw device fw3 (0 config rom retries, S400)
Dec 23 00:29:00 stein scsi2 : SBP-2 IEEE-1394
Dec 23 00:29:00 stein scsi3 : SBP-2 IEEE-1394
Dec 23 00:29:00 stein firewire_sbp2: logged in to fw3.0 LUN 0000 (0 retries)
Dec 23 00:29:00 stein scsi 2:0:0:0: Direct-Access-RBC AAPL     FireWire Target  0000 PQ: 0 ANSI: 3
Dec 23 00:29:00 stein sd 2:0:0:0: [sdd] Attached SCSI removable disk
Dec 23 00:29:00 stein sd 2:0:0:0: Attached scsi generic sg3 type 14
Dec 23 00:29:00 stein firewire_sbp2: logged in to fw3.1 LUN 0000 (0 retries)
Dec 23 00:29:00 stein scsi 3:0:0:0: Direct-Access-RBC AAPL     FireWire Target  0000 PQ: 0 ANSI: 3
Dec 23 00:29:00 stein sd 3:0:0:0: [sde] 117210240 512-byte hardware sectors (60012 MB)
Dec 23 00:29:00 stein sd 3:0:0:0: [sde] Test WP failed, assume Write Enabled
Dec 23 00:29:00 stein sd 3:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Dec 23 00:29:00 stein sd 3:0:0:0: [sde] 117210240 512-byte hardware sectors (60012 MB)
Dec 23 00:29:00 stein sd 3:0:0:0: [sde] Test WP failed, assume Write Enabled
Dec 23 00:29:00 stein sd 3:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Dec 23 00:29:00 stein sde: sde1 sde2 sde3
Dec 23 00:29:00 stein sd 3:0:0:0: [sde] Attached SCSI disk
Dec 23 00:29:00 stein sd 3:0:0:0: Attached scsi generic sg4 type 14

I can access the sde disk just fine.  I don't know what the sdd disk is supposed to be; I remember that older Macs (Titanium PowerBook in particular) only presented a single logical unit.
Comment 6 Johannes Berg 2007-12-22 23:51:40 UTC
Do you have slab debugging enabled on that machine? I'll give the suggested kernel a try later today, b8f106570f20a7a98a18a14c2d11d65f581c2808 is an ancestor of 0895e91d. The dirtying is only related to wireless and powerpc suspend work :)
Comment 7 Johannes Berg 2007-12-23 00:09:25 UTC
Based on that I don't understand the firewire code I'd guess that somehow I get two completions for a single transaction packet that has been freed by its own callback. Can that happen at all? I can stick in a driver_data->packet = NULL into handle_at_packet to test for that I guess...

johannes
Comment 8 Johannes Berg 2007-12-23 00:34:13 UTC
Exactly the same thing happens with vmlinux-2.6.23-rc9-g2d46d1e0-dirty.
Comment 9 Stefan Richter 2007-12-23 04:01:24 UTC
> Do you have slab debugging enabled on that machine?

Not yet.
Comment 10 Stefan Richter 2008-02-28 08:25:55 UTC
Created attachment 15061 [details]
panic in :firewire_ohci:handle_at_packet

similar but not exactly same panic on x86_64 SMP PC with FW323 controller and an old portable FireWire CD-RW
Comment 11 Jarod Wilson 2008-02-28 08:36:27 UTC
Identical panic as Stefan, dual quad-core opteron, Ti TSB43AB22/A controller onboard a Tyan motherboard, with a bus-powered 2.5" hard disk.

firewire_sbp2: fw4.0: sbp2_scsi_abort
sd 9:0:0:0: [sdc] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdc, sector 12423
Buffer I/O error on device sdc1, logical block 1545
lost page write due to I/O error on sdc1
sd 9:0:0:0: [sdc] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdc, sector 12431
sd 9:0:0:0: [sdc] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdc, sector 12447
Buffer I/O error on device sdc1, logical block 1548
lost page write due to I/O error on sdc1
Aborting journal on device sdc1.
general protection fault: 0000 [1] SMP DEBUG_PAGEALLOC
CPU 2 
Modules linked in: ipt_MASQUERADE iptable_nat nf_nat bridge rfcomm l2cap bluetooth autofs4 nfs lockd nfs_acl sunrpc nf_conntrack_ipv4 ipt_REJECT iptable_filter ip_tables nf_conntrack_ipv6 xt_state nf_conntrack xt_tcpudp ip6t_ipv6header ip6t_REJECT ip6table_filter ip6_tables x_tables ipv6 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi cpufreq_ondemand powernow_k8 freq_table ext4dev jbd2 crc16 loop dm_multipath kvm_amd kvm snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc floppy snd_hwdep snd forcedeth pcspkr soundcore i2c_nforce2 io_edgeport shpchp i2c_core usbserial sr_mod cdrom button sg firewire_sbp2 firewire_ohci firewire_core crc_itu_t pata_amd dm_snapshot dm_zero dm_mirror dm_mod sata_nv pata_acpi ata_generic libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted 2.6.25-0.67.rc3.fc9.fw #1
RIP: 0010:[<ffffffff880fe5d1>]  [<ffffffff880fe5d1>] :firewire_ohci:handle_at_packet+0xa1/0xb1
RSP: 0018:ffff81023e69fe80  EFLAGS: 00010283
RAX: 0000000000000002 RBX: ffff81023e6ba650 RCX: 0000000000cccbc2
RDX: 0000000000000002 RSI: ffff81023d2ee1b0 RDI: ffff81023e6ba650
RBP: ffff81023e69fea0 R08: ffff81023d2ee658 R09: 0000000000000000
R10: ffffffff88106166 R11: 0000000000002002 R12: ffff8100bf8bedf0
R13: ffff81023d2ee1b0 R14: ffff8100bf8bedf0 R15: ffff8100bf8be000
FS:  00007fef7a5236f0(0000) GS:ffff81023fc81960(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 000000000011fb80 CR3: 000000023c12f000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff81023e698000, task ffff81023e694000)
Stack:  ffff81023e69feb0 ffff8100bf8bedf0 ffff8100bf8be000 ffff81023d2ef0b8
 ffff81023e69fee0 ffffffff880feeca 0000000000000009 ffff81023d2ef0f8
 0000000000000000 0000000000000009 0000000000000002 ffff810001142200
Call Trace:
 <IRQ>  [<ffffffff880feeca>] :firewire_ohci:context_tasklet+0x6a/0xe1
 [<ffffffff810396ef>] tasklet_action+0x5e/0xb2
 [<ffffffff8103959c>] __do_softirq+0x5f/0xe3
 [<ffffffff8100d1ac>] call_softirq+0x1c/0x28
 [<ffffffff8100eaa7>] do_softirq+0x39/0x8a
 [<ffffffff810394fc>] irq_exit+0x4e/0x8f
 [<ffffffff8100ec66>] do_IRQ+0x144/0x166
 [<ffffffff811fa36e>] ? cpuidle_idle_call+0x0/0xa2
 [<ffffffff8100c536>] ret_from_intr+0x0/0xf
 <EOI>  [<ffffffff811fa40b>] ? cpuidle_idle_call+0x9d/0xa2
 [<ffffffff811fa3aa>] ? cpuidle_idle_call+0x3c/0xa2
 [<ffffffff811fa36e>] ? cpuidle_idle_call+0x0/0xa2
 [<ffffffff8100b094>] ? cpu_idle+0xa4/0xce
 [<ffffffff81295c1c>] ? start_secondary+0x3c0/0x3cc


Code: 21 c7 43 40 13 00 00 00 eb 18 c7 43 40 14 00 00 00 eb 0f 8d 42 f0 89 43 40 eb 07 c7 43 40 10 00 00 00 8b 53 40 4c 89 ee 48 89 df <ff> 53 38 b8 01 00 00 00 5a 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 
RIP  [<ffffffff880fe5d1>] :firewire_ohci:handle_at_packet+0xa1/0xb1
 RSP <ffff81023e69fe80>
---[ end trace 7bf37a0e55355b79 ]---
Kernel panic - not syncing: Aiee, killing interrupt handler!
APIC error on CPU2: 08(00)
Comment 12 Jarod Wilson 2008-03-12 14:51:07 UTC
My panic is gone with a one-liner, just posted a bit ago:

http://lkml.org/lkml/2008/3/12/355
Comment 13 Stefan Richter 2008-03-12 17:06:52 UTC
Created attachment 15230 [details]
last log messages before lock-up (now with AR/At event logging)

The actual panic message didn't make it onto the screen anymore in this example.

Nothing sticks out.  We see the successful completion of the login and of the SCSI inquiry (each ending in a status write from the target of size 8), and there seem to be two other SCSI requests being completed (each with a status write from the target of size 10).  The AT event which ends in the panic was not shown anymore.
Comment 14 Stefan Richter 2008-03-12 17:11:01 UTC
Created attachment 15231 [details]
another shot of the same panic

Again with AR/AT DMA event logging.  It his hard to say which of the events before the panic didn't make it to the screen anymore, presumably quite a few.  Maybe I should set up netconsole or firescope after all.
Comment 15 Stefan Richter 2008-03-12 17:13:01 UTC
Re comment #12:
> My panic is gone with a one-liner,

Alas mine isn't.
Comment 16 Stefan Richter 2008-03-12 17:42:13 UTC
Created attachment 15233 [details]
messages without DEBUG_PAGEALLOC

These are the AR and AT DMA events encountered with the crashing hardware combination if I switch DEBUG_PAGEALLOC off, i.e. the panic doesn't happen.  We see the same events as in attachment 15230 [details], followed by seemingly more of the same.  Sometimes the CD-RW (node ffc0) is a little bit quicker to send write responses, sometimes a little bit slower.  In the latter case we see overlapping split transactions, which shouldn't matter.

What may be significant though is that other SBP-2 devices which I regularly use don't use split transactions for write transactions from host to target.  They use unified transactions instead, where the ACK for the write request stands in for a write response and thus already completes the transaction.

IOW, crash-provoking LSI based device:

    AT host -> target, ack_pending , BW req
    AR target -> host, ack_complete, W resp

Other device which doesn't provoke crashes even with DEBUG_PAGEALLOC enabled:

    AT host -> target, ack_complete, BW req

Furthermore, there are much fewer quadlet writes to the fetch agent reset register with the non-crashing devices.  Almost only the 8 bystes sized write requests to the ORB pointer register happen, and 8 bytes sized status writes from the target (compared to the interspersed 10 bytes sized status writes from the crash-provoking LSI based CD-RW).
Comment 17 Stefan Richter 2008-03-14 11:15:41 UTC
A Mac mini in target mode also causes split transactions, like the crash-provoking LSI based devices but unlike all my other SBP-2 devices which I inspected so far.
Comment 18 Stefan Richter 2008-03-15 05:34:57 UTC
Another note on how to reproduce the panic:

If firewire-sbp2 is blacklisted and loaded manually after plugging in a dangerous SBP-2 target, the panic does not happen (or at least not immediately after the SBP-2 login).
Comment 19 Stefan Richter 2008-03-15 09:32:10 UTC
Created attachment 15277 [details]
crash log with some debug printks

So I figured that setting up netconsole might not be such a bad idea after all.

Let's just look at the AR and AT events.  ffc0 is the CD-RW, ffc4 is the local node.

firewire_sbp2: fw1.0: logged in to LUN 0000 (0 retries)
AT spd 2 tl 17, ffc4 -> ffc0, ack_pending , QW req, fffff0000210 = 000c800f
AR spd 2 tl 17, ffc0 -> ffc4, ack_complete, W resp

This is a write transaction, performed as a split transaction.  I.e. it consists of write request and write response.  We know to which request a response belongs by looking at the transaction label, "tl", here 0x17.

AT spd 2 tl 18, ffc4 -> ffc0, ack_pending , QW req, fffff0010004 = 00000000
AR spd 2 tl 18, ffc0 -> ffc4, ack_complete, W resp

Ditto.

AR spd 2 tl 19, ffc0 -> ffc4, ack_complete, W resp
AT spd 2 tl 19, ffc4 -> ffc0, ack_pending , BW req, fffff0010008 8,0

Again a split transaction.  It's now a block write, not a quadlet write.  But the more interesting thing is:  The reception of the response is logged before transmission of the request.  In reality, the request went over the bus before the response, if we assume for a moment that my CD-RW is not clairvoyant.

AR spd 2 tl 00, ffc0 -> ffc4, ack_complete, BW req, 000100000000 8,0

This is a unified transaction.  I.e. the ACK for the request already completes the transaction, and no response is sent anymore.

scsi 3:0:0:0: CD-ROM            TEAC     CD-W28E          1.1A PQ: 0 ANSI: 0 CCS
AR spd 2 tl 1a, ffc0 -> ffc4, ack_complete, W resp
AT spd 2 tl 1a, ffc4 -> ffc0, ack_pending , BW req, fffff0010008 8,0

Again a split transaction with request and response logged in the wrong order.

AR spd 2 tl 01, ffc0 -> ffc4, ack_complete, BW req, 000100000000 10,0

Unified transaction.

AT spd 2 tl 1b, ffc4 -> ffc0, ack_pending , QW req, fffff0010004 = 00000000
AR spd 2 tl 1b, ffc0 -> ffc4, ack_complete, W resp

Split transaction, logged in the real order.

AT spd 2 tl 1c, ffc4 -> ffc0, ack_pending , BW req, fffff0010008 8,0
AR spd 2 tl 1c, ffc0 -> ffc4, ack_complete, W resp

Ditto.

AR spd 2 tl 00, ffc0 -> ffc4, ack_complete, BW req, 000100000000 10,0

Unified transaction.

AR spd 2 tl 1d, ffc0 -> ffc4, ack_complete, W resp
AT spd 6b6b6b6b tl 1a, 6b6b -> 6b6b, ack_pending , QR resp = 6b6b6b6b
= handle_at_packet -- calling packet->callback =
general protection fault: 0000 [1] 

Boom!

This is most certainly a split transaction with request and response logged in the wrong order.  We can only guess what kind of transaction this last one was.  

Anyway.  What apparently happened was that 
fw_core_handle_response() which was called by the AR handler did some nice cleanup while the AT handler for the matching request was still pending.
Comment 20 Stefan Richter 2008-03-15 10:03:03 UTC
Created attachment 15278 [details]
invalid patch

> What apparently happened was that  fw_core_handle_response() which was
> called by the AR handler did some nice cleanup while the AT handler for
> the matching request was still pending.

No.  That's not it.  At least the attached patch which reorders the scheduling of AT and AR tasklets does not help.

Whatever happened, the one obvious thing is that the AT handler advanced into an uninitialized descriptor buffer.
Comment 21 Stefan Richter 2008-03-15 11:05:31 UTC
> Whatever happened, the one obvious thing is that the AT handler advanced
> into an uninitialized descriptor buffer.

No, the AT debug printk only accesses the packet which higher levels allocated, not fw-ohci's AT DMA descriptor buffer.  Someone already freed the transaction into which this packet was embedded.
Comment 22 Stefan Richter 2008-03-15 15:46:15 UTC
The interrupt events arrive in the expected order.  But the AR response tasklet may nevertheless be executed before the AT request tasklet.  Crash with interrupt event logging enabled:

firewire_ohci: IRQ 00000001 AT_req
firewire_ohci: IRQ 00000020 AR_resp
AR spd 2 tl 08, ffc0 -> ffc4, ack_complete, W resp
AT spd 6b6b6b6b tl 1a, 6b6b -> 6b6b, ack_pending , QR resp = 6b6b6b6b
general protection fault: 0000 [1]
Comment 23 Stefan Richter 2008-03-15 17:16:33 UTC
One wonders what happened between comment #9 (2007-12-23) and #10 (2008-02-28) and whether the FireWire maintainer never heard of DEBUG_PAGEALLOC before.  Anyway, there is a proposed patch now:  http://lkml.org/lkml/2008/3/15/123

Johannes, Jarod, does this patch work for you?  It fixes access to my two LSI based CD-RWs.  I have yet to stress-test it with some other hardware.
Comment 24 Jarod Wilson 2008-03-15 18:07:20 UTC
Forgot to add here that my one-liner in comment 12 was a red herring, that particular system's panic was apparently resolved by another patch that made sure we properly ensured dma coherency on systems with memory mapped over the 4GB mark. 

However, I'm still able to reproduce a panic in handle_at_packet by connecting my laptop to a mac mini in target disk mode, and will test out the patch in comment #23 here soonish...
Comment 25 Johannes Berg 2008-03-16 08:57:55 UTC
The patch in comment #23 fixes the problem for me, thanks!
Comment 26 Jarod Wilson 2008-03-16 20:14:16 UTC
Indeed, no more panic hooking to the mac mini in target disk mode either.