Bug 9617
Summary: | panic in :firewire_ohci:handle_at_packet (was oops with macbook in target disk mode (firewire-ohci)) | ||
---|---|---|---|
Product: | Drivers | Reporter: | Johannes Berg (johannes) |
Component: | IEEE1394 | Assignee: | Stefan Richter (stefanr) |
Status: | CLOSED CODE_FIX | ||
Severity: | high | CC: | jarod |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.24-rc5-g2d0811f5-dirty | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
image showing the oops
panic in :firewire_ohci:handle_at_packet last log messages before lock-up (now with AR/At event logging) another shot of the same panic messages without DEBUG_PAGEALLOC crash log with some debug printks invalid patch |
Description
Johannes Berg
2007-12-21 16:03:39 UTC
Created attachment 14144 [details]
image showing the oops
This one shows the oops with firewire-ohci, has the other stack's modules loaded but not ohci1394 so those aren't used.
If you have got a few older kernels laying around and can afford to reboot, it'd be nice if you could check whether the oops also happens with 2.6.23 or a late 2.6.23-rc (and if yes, if it also happens with 2.6.22.y or 2.6.22 or late 2.6.22-rc), depending on what you can conveniently boot. I seem to still have these. Want to choose a few? vmlinux-2.6.23-g0895e91d-dirty vmlinux-2.6.23-g398a9bfb-dirty vmlinux-2.6.23-g4bcbd552-dirty vmlinux-2.6.23-rc2-g5f743bbc-dirty vmlinux-2.6.23-rc2-g91e4d6f6-dirty vmlinux-2.6.23-rc2-gf2b262a6-dirty vmlinux-2.6.23-rc3-g0347a7c8-dirty vmlinux-2.6.23-rc3-g16f54da2-dirty vmlinux-2.6.23-rc3-gdf5fc53d-dirty vmlinux-2.6.23-rc4-g60faa0c2-dirty vmlinux-2.6.23-rc4-nohz-g60faa0c2-dirty vmlinux-2.6.23-rc6-gccad6302-dirty vmlinux-2.6.23-rc6-gcd799e16-dirty vmlinux-2.6.23-rc8-g428ff550-dirty vmlinux-2.6.23-rc9-g2d46d1e0-dirty vmlinux-2.6.24-g301df407-dirty vmlinux-2.6.24-rc1-dirty vmlinux-2.6.24-rc1-g2655e2ce-dirty vmlinux-2.6.24-rc1-gb55d1b18-dirty vmlinux-2.6.24-rc1-gc8593884-dirty vmlinux-2.6.24-rc1-gec3b67c1-dirty vmlinux-2.6.24-rc2-dirty vmlinux-2.6.24-rc2-g41aa1d66-dirty vmlinux-2.6.24-rc2-gc4ca6d40-dirty vmlinux-2.6.24-rc3-g301df407-dirty vmlinux-2.6.24-rc3-gad8f7a9b-dirty vmlinux-2.6.24-rc3-gcc05246d-dirty vmlinux-2.6.24-rc5-g2d0811f5-dirty vmlinux-2.6.24-rc5-g50443c08-dirty vmlinux-2.6.24-rc5-gb8b5f6a2-dirty vmlinux-2.6.23-rc9-g2d46d1e0-dirty should be fine, if you didn't dirty it too much. :-) If this oopses too, then 2.6.22.15 might be a better test subject than vmlinux-2.6.23-g0895e91d-dirty, unless you can confirm that commit b8f106570f20a7a98a18a14c2d11d65f581c2808 "firewire: fw-sbp2: remove unused struct member" is not an ancestor of g0895e91d-dirty. (This was one of the patches that went into the 2.6.22...2.6.23-rc1 merge window.) FWIW, if I connect an Intel Mac mini in target disk mode with an i586 SMP box running 2.6.24-rc6 plus what's in linux1394-2.6.git, I merely get: Dec 23 00:29:00 stein firewire_core: created new fw device fw3 (0 config rom retries, S400) Dec 23 00:29:00 stein scsi2 : SBP-2 IEEE-1394 Dec 23 00:29:00 stein scsi3 : SBP-2 IEEE-1394 Dec 23 00:29:00 stein firewire_sbp2: logged in to fw3.0 LUN 0000 (0 retries) Dec 23 00:29:00 stein scsi 2:0:0:0: Direct-Access-RBC AAPL FireWire Target 0000 PQ: 0 ANSI: 3 Dec 23 00:29:00 stein sd 2:0:0:0: [sdd] Attached SCSI removable disk Dec 23 00:29:00 stein sd 2:0:0:0: Attached scsi generic sg3 type 14 Dec 23 00:29:00 stein firewire_sbp2: logged in to fw3.1 LUN 0000 (0 retries) Dec 23 00:29:00 stein scsi 3:0:0:0: Direct-Access-RBC AAPL FireWire Target 0000 PQ: 0 ANSI: 3 Dec 23 00:29:00 stein sd 3:0:0:0: [sde] 117210240 512-byte hardware sectors (60012 MB) Dec 23 00:29:00 stein sd 3:0:0:0: [sde] Test WP failed, assume Write Enabled Dec 23 00:29:00 stein sd 3:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Dec 23 00:29:00 stein sd 3:0:0:0: [sde] 117210240 512-byte hardware sectors (60012 MB) Dec 23 00:29:00 stein sd 3:0:0:0: [sde] Test WP failed, assume Write Enabled Dec 23 00:29:00 stein sd 3:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Dec 23 00:29:00 stein sde: sde1 sde2 sde3 Dec 23 00:29:00 stein sd 3:0:0:0: [sde] Attached SCSI disk Dec 23 00:29:00 stein sd 3:0:0:0: Attached scsi generic sg4 type 14 I can access the sde disk just fine. I don't know what the sdd disk is supposed to be; I remember that older Macs (Titanium PowerBook in particular) only presented a single logical unit. Do you have slab debugging enabled on that machine? I'll give the suggested kernel a try later today, b8f106570f20a7a98a18a14c2d11d65f581c2808 is an ancestor of 0895e91d. The dirtying is only related to wireless and powerpc suspend work :) Based on that I don't understand the firewire code I'd guess that somehow I get two completions for a single transaction packet that has been freed by its own callback. Can that happen at all? I can stick in a driver_data->packet = NULL into handle_at_packet to test for that I guess... johannes Exactly the same thing happens with vmlinux-2.6.23-rc9-g2d46d1e0-dirty. > Do you have slab debugging enabled on that machine?
Not yet.
Created attachment 15061 [details]
panic in :firewire_ohci:handle_at_packet
similar but not exactly same panic on x86_64 SMP PC with FW323 controller and an old portable FireWire CD-RW
Identical panic as Stefan, dual quad-core opteron, Ti TSB43AB22/A controller onboard a Tyan motherboard, with a bus-powered 2.5" hard disk. firewire_sbp2: fw4.0: sbp2_scsi_abort sd 9:0:0:0: [sdc] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdc, sector 12423 Buffer I/O error on device sdc1, logical block 1545 lost page write due to I/O error on sdc1 sd 9:0:0:0: [sdc] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdc, sector 12431 sd 9:0:0:0: [sdc] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdc, sector 12447 Buffer I/O error on device sdc1, logical block 1548 lost page write due to I/O error on sdc1 Aborting journal on device sdc1. general protection fault: 0000 [1] SMP DEBUG_PAGEALLOC CPU 2 Modules linked in: ipt_MASQUERADE iptable_nat nf_nat bridge rfcomm l2cap bluetooth autofs4 nfs lockd nfs_acl sunrpc nf_conntrack_ipv4 ipt_REJECT iptable_filter ip_tables nf_conntrack_ipv6 xt_state nf_conntrack xt_tcpudp ip6t_ipv6header ip6t_REJECT ip6table_filter ip6_tables x_tables ipv6 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi cpufreq_ondemand powernow_k8 freq_table ext4dev jbd2 crc16 loop dm_multipath kvm_amd kvm snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc floppy snd_hwdep snd forcedeth pcspkr soundcore i2c_nforce2 io_edgeport shpchp i2c_core usbserial sr_mod cdrom button sg firewire_sbp2 firewire_ohci firewire_core crc_itu_t pata_amd dm_snapshot dm_zero dm_mirror dm_mod sata_nv pata_acpi ata_generic libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.25-0.67.rc3.fc9.fw #1 RIP: 0010:[<ffffffff880fe5d1>] [<ffffffff880fe5d1>] :firewire_ohci:handle_at_packet+0xa1/0xb1 RSP: 0018:ffff81023e69fe80 EFLAGS: 00010283 RAX: 0000000000000002 RBX: ffff81023e6ba650 RCX: 0000000000cccbc2 RDX: 0000000000000002 RSI: ffff81023d2ee1b0 RDI: ffff81023e6ba650 RBP: ffff81023e69fea0 R08: ffff81023d2ee658 R09: 0000000000000000 R10: ffffffff88106166 R11: 0000000000002002 R12: ffff8100bf8bedf0 R13: ffff81023d2ee1b0 R14: ffff8100bf8bedf0 R15: ffff8100bf8be000 FS: 00007fef7a5236f0(0000) GS:ffff81023fc81960(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 000000000011fb80 CR3: 000000023c12f000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 0, threadinfo ffff81023e698000, task ffff81023e694000) Stack: ffff81023e69feb0 ffff8100bf8bedf0 ffff8100bf8be000 ffff81023d2ef0b8 ffff81023e69fee0 ffffffff880feeca 0000000000000009 ffff81023d2ef0f8 0000000000000000 0000000000000009 0000000000000002 ffff810001142200 Call Trace: <IRQ> [<ffffffff880feeca>] :firewire_ohci:context_tasklet+0x6a/0xe1 [<ffffffff810396ef>] tasklet_action+0x5e/0xb2 [<ffffffff8103959c>] __do_softirq+0x5f/0xe3 [<ffffffff8100d1ac>] call_softirq+0x1c/0x28 [<ffffffff8100eaa7>] do_softirq+0x39/0x8a [<ffffffff810394fc>] irq_exit+0x4e/0x8f [<ffffffff8100ec66>] do_IRQ+0x144/0x166 [<ffffffff811fa36e>] ? cpuidle_idle_call+0x0/0xa2 [<ffffffff8100c536>] ret_from_intr+0x0/0xf <EOI> [<ffffffff811fa40b>] ? cpuidle_idle_call+0x9d/0xa2 [<ffffffff811fa3aa>] ? cpuidle_idle_call+0x3c/0xa2 [<ffffffff811fa36e>] ? cpuidle_idle_call+0x0/0xa2 [<ffffffff8100b094>] ? cpu_idle+0xa4/0xce [<ffffffff81295c1c>] ? start_secondary+0x3c0/0x3cc Code: 21 c7 43 40 13 00 00 00 eb 18 c7 43 40 14 00 00 00 eb 0f 8d 42 f0 89 43 40 eb 07 c7 43 40 10 00 00 00 8b 53 40 4c 89 ee 48 89 df <ff> 53 38 b8 01 00 00 00 5a 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 RIP [<ffffffff880fe5d1>] :firewire_ohci:handle_at_packet+0xa1/0xb1 RSP <ffff81023e69fe80> ---[ end trace 7bf37a0e55355b79 ]--- Kernel panic - not syncing: Aiee, killing interrupt handler! APIC error on CPU2: 08(00) My panic is gone with a one-liner, just posted a bit ago: http://lkml.org/lkml/2008/3/12/355 Created attachment 15230 [details]
last log messages before lock-up (now with AR/At event logging)
The actual panic message didn't make it onto the screen anymore in this example.
Nothing sticks out. We see the successful completion of the login and of the SCSI inquiry (each ending in a status write from the target of size 8), and there seem to be two other SCSI requests being completed (each with a status write from the target of size 10). The AT event which ends in the panic was not shown anymore.
Created attachment 15231 [details]
another shot of the same panic
Again with AR/AT DMA event logging. It his hard to say which of the events before the panic didn't make it to the screen anymore, presumably quite a few. Maybe I should set up netconsole or firescope after all.
Re comment #12: > My panic is gone with a one-liner, Alas mine isn't. Created attachment 15233 [details] messages without DEBUG_PAGEALLOC These are the AR and AT DMA events encountered with the crashing hardware combination if I switch DEBUG_PAGEALLOC off, i.e. the panic doesn't happen. We see the same events as in attachment 15230 [details], followed by seemingly more of the same. Sometimes the CD-RW (node ffc0) is a little bit quicker to send write responses, sometimes a little bit slower. In the latter case we see overlapping split transactions, which shouldn't matter. What may be significant though is that other SBP-2 devices which I regularly use don't use split transactions for write transactions from host to target. They use unified transactions instead, where the ACK for the write request stands in for a write response and thus already completes the transaction. IOW, crash-provoking LSI based device: AT host -> target, ack_pending , BW req AR target -> host, ack_complete, W resp Other device which doesn't provoke crashes even with DEBUG_PAGEALLOC enabled: AT host -> target, ack_complete, BW req Furthermore, there are much fewer quadlet writes to the fetch agent reset register with the non-crashing devices. Almost only the 8 bystes sized write requests to the ORB pointer register happen, and 8 bytes sized status writes from the target (compared to the interspersed 10 bytes sized status writes from the crash-provoking LSI based CD-RW). A Mac mini in target mode also causes split transactions, like the crash-provoking LSI based devices but unlike all my other SBP-2 devices which I inspected so far. Another note on how to reproduce the panic: If firewire-sbp2 is blacklisted and loaded manually after plugging in a dangerous SBP-2 target, the panic does not happen (or at least not immediately after the SBP-2 login). Created attachment 15277 [details]
crash log with some debug printks
So I figured that setting up netconsole might not be such a bad idea after all.
Let's just look at the AR and AT events. ffc0 is the CD-RW, ffc4 is the local node.
firewire_sbp2: fw1.0: logged in to LUN 0000 (0 retries)
AT spd 2 tl 17, ffc4 -> ffc0, ack_pending , QW req, fffff0000210 = 000c800f
AR spd 2 tl 17, ffc0 -> ffc4, ack_complete, W resp
This is a write transaction, performed as a split transaction. I.e. it consists of write request and write response. We know to which request a response belongs by looking at the transaction label, "tl", here 0x17.
AT spd 2 tl 18, ffc4 -> ffc0, ack_pending , QW req, fffff0010004 = 00000000
AR spd 2 tl 18, ffc0 -> ffc4, ack_complete, W resp
Ditto.
AR spd 2 tl 19, ffc0 -> ffc4, ack_complete, W resp
AT spd 2 tl 19, ffc4 -> ffc0, ack_pending , BW req, fffff0010008 8,0
Again a split transaction. It's now a block write, not a quadlet write. But the more interesting thing is: The reception of the response is logged before transmission of the request. In reality, the request went over the bus before the response, if we assume for a moment that my CD-RW is not clairvoyant.
AR spd 2 tl 00, ffc0 -> ffc4, ack_complete, BW req, 000100000000 8,0
This is a unified transaction. I.e. the ACK for the request already completes the transaction, and no response is sent anymore.
scsi 3:0:0:0: CD-ROM TEAC CD-W28E 1.1A PQ: 0 ANSI: 0 CCS
AR spd 2 tl 1a, ffc0 -> ffc4, ack_complete, W resp
AT spd 2 tl 1a, ffc4 -> ffc0, ack_pending , BW req, fffff0010008 8,0
Again a split transaction with request and response logged in the wrong order.
AR spd 2 tl 01, ffc0 -> ffc4, ack_complete, BW req, 000100000000 10,0
Unified transaction.
AT spd 2 tl 1b, ffc4 -> ffc0, ack_pending , QW req, fffff0010004 = 00000000
AR spd 2 tl 1b, ffc0 -> ffc4, ack_complete, W resp
Split transaction, logged in the real order.
AT spd 2 tl 1c, ffc4 -> ffc0, ack_pending , BW req, fffff0010008 8,0
AR spd 2 tl 1c, ffc0 -> ffc4, ack_complete, W resp
Ditto.
AR spd 2 tl 00, ffc0 -> ffc4, ack_complete, BW req, 000100000000 10,0
Unified transaction.
AR spd 2 tl 1d, ffc0 -> ffc4, ack_complete, W resp
AT spd 6b6b6b6b tl 1a, 6b6b -> 6b6b, ack_pending , QR resp = 6b6b6b6b
= handle_at_packet -- calling packet->callback =
general protection fault: 0000 [1]
Boom!
This is most certainly a split transaction with request and response logged in the wrong order. We can only guess what kind of transaction this last one was.
Anyway. What apparently happened was that
fw_core_handle_response() which was called by the AR handler did some nice cleanup while the AT handler for the matching request was still pending.
Created attachment 15278 [details] invalid patch > What apparently happened was that fw_core_handle_response() which was > called by the AR handler did some nice cleanup while the AT handler for > the matching request was still pending. No. That's not it. At least the attached patch which reorders the scheduling of AT and AR tasklets does not help. Whatever happened, the one obvious thing is that the AT handler advanced into an uninitialized descriptor buffer. > Whatever happened, the one obvious thing is that the AT handler advanced
> into an uninitialized descriptor buffer.
No, the AT debug printk only accesses the packet which higher levels allocated, not fw-ohci's AT DMA descriptor buffer. Someone already freed the transaction into which this packet was embedded.
The interrupt events arrive in the expected order. But the AR response tasklet may nevertheless be executed before the AT request tasklet. Crash with interrupt event logging enabled: firewire_ohci: IRQ 00000001 AT_req firewire_ohci: IRQ 00000020 AR_resp AR spd 2 tl 08, ffc0 -> ffc4, ack_complete, W resp AT spd 6b6b6b6b tl 1a, 6b6b -> 6b6b, ack_pending , QR resp = 6b6b6b6b general protection fault: 0000 [1] One wonders what happened between comment #9 (2007-12-23) and #10 (2008-02-28) and whether the FireWire maintainer never heard of DEBUG_PAGEALLOC before. Anyway, there is a proposed patch now: http://lkml.org/lkml/2008/3/15/123 Johannes, Jarod, does this patch work for you? It fixes access to my two LSI based CD-RWs. I have yet to stress-test it with some other hardware. Forgot to add here that my one-liner in comment 12 was a red herring, that particular system's panic was apparently resolved by another patch that made sure we properly ensured dma coherency on systems with memory mapped over the 4GB mark. However, I'm still able to reproduce a panic in handle_at_packet by connecting my laptop to a mac mini in target disk mode, and will test out the patch in comment #23 here soonish... The patch in comment #23 fixes the problem for me, thanks! Indeed, no more panic hooking to the mac mini in target disk mode either. |