Bug 8735

Summary: BUG: scheduling while atomic
Product: Drivers Reporter: Gregor Jasny (gjasny)
Component: IEEE1394Assignee: Stefan Richter (stefanr)
Status: CLOSED CODE_FIX    
Severity: normal    
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: Todays linux-2.6 git pull (2.6.22+) Subsystem:
Regression: --- Bisected commit-id:
Attachments: [PATCH] firewire: fw-ohci: fix "scheduling while atomic"

Description Gregor Jasny 2007-07-11 05:20:17 UTC
Hi,

today I tried the new Juju Firewire stack on my MacMini.
The whole Linux system is located on the external fw-disk.
During boot I see the following bug. Afterward the driver works fine.

firewire_core: created new fw device fw0 (0 config rom retries, S400)
scsi6 : SBP-2 IEEE-1394
firewire_core: created new fw device fw1 (0 config rom retries, S400)
firewire_core: phy config: card 0, new root=ffc1, gap_count=5
firewire_ohci: context_stop: still active (0x00000411)
BUG: scheduling while atomic: events/0/0x00000100/7
 [<c036096d>] __sched_text_start+0x4dd/0x690
 [<c0124aa2>] __mod_timer+0x92/0xb0
 [<c03612bb>] schedule_timeout+0x4b/0xd0
 [<c0124820>] process_timeout+0x0/0x10
 [<c0124ad5>] msleep+0x15/0x20
 [<c02a4e7f>] context_stop+0x5f/0x70
 [<c02a678b>] bus_reset_tasklet+0xfb/0x1c0
 [<c01215b2>] tasklet_action+0x42/0xa0
 [<c01211f2>] __do_softirq+0x82/0x100
 [<c01212a7>] do_softirq+0x37/0x40
 [<c0105330>] do_IRQ+0x40/0x80
 [<c0105330>] do_IRQ+0x40/0x80
 [<c010342b>] common_interrupt+0x23/0x28
 [<c0362445>] _spin_unlock_irqrestore+0x5/0x10
 [<c02a5d9e>] at_context_transmit+0x2be/0x450
 [<c01249e8>] try_to_del_timer_sync+0x48/0x50
 [<c03612c2>] schedule_timeout+0x52/0xd0
 [<c0124967>] lock_timer_base+0x27/0x60
 [<c0124aa2>] __mod_timer+0x92/0xb0
 [<c02a2394>] fw_send_request+0x194/0x1f0
 [<c02a08a5>] fw_card_bm_work+0x2a5/0x320
 [<c02a7030>] complete_transaction+0x0/0x70
 [<c02a720a>] sbp2_send_orb+0xca/0xe0
 [<c02a7030>] complete_transaction+0x0/0x70
 [<c02a741e>] sbp2_send_management_orb+0xfe/0x230
 [<c02a7980>] sbp2_login+0x60/0x1d0
 [<c02a7920>] sbp2_login+0x0/0x1d0
 [<c012af33>] run_workqueue+0x73/0x100
 [<c012e640>] autoremove_wake_function+0x0/0x50
 [<c012b85c>] worker_thread+0x9c/0x100
 [<c012e640>] autoremove_wake_function+0x0/0x50
 [<c012b7c0>] worker_thread+0x0/0x100
 [<c012e362>] kthread+0x42/0x70
 [<c012e320>] kthread+0x0/0x70
 [<c010362b>] kernel_thread_helper+0x7/0x1c
 =======================
firewire_sbp2: management write failed, rcode 0x12
Initializing USB Mass Storage driver...
firewire_sbp2: logged in to sbp2 unit fw1.0 (0 retries)
firewire_sbp2:  - management_agent_address:    0xfffff0030000
firewire_sbp2:  - command_block_agent_address: 0xfffff0100000
firewire_sbp2:  - status write address:        0x000100000000
scsi 6:0:0:0: Direct-Access-RBC SAMSUNG  SP2514N          VF10 PQ: 0 ANSI: 4

The kernel version is todays linux-2.6 pull.

If you need more information, please contact me.

Thanks,
Gregor
Comment 1 Stefan Richter 2007-07-12 13:33:19 UTC
Created attachment 12017 [details]
[PATCH] firewire: fw-ohci: fix "scheduling while atomic"

Does this work for you?

Also, is it a PPC Mini or Intel Mini?  (I have an Intel Mini but didn't encounter the bug yet.)
Comment 2 Gregor Jasny 2007-07-13 07:56:44 UTC
Hi Stefan,

It's an early Core Duo Mini (T2300@1.66GHz).

I've booted several times without any BUG/OOPS. Can you please push the patch into the stable series?

During operation my sytem froze for about 5 seconds and logged the following
lines to syslog:
Jul 11 15:31:11 Mini kernel: firewire_sbp2: status write for unknown orb
Jul 11 15:31:41 Mini kernel: firewire_sbp2: sbp2_scsi_abort

I've never observed this behaviour with the old stack. Is there anything I can do to debug this?

Thanks,
Gregor
Comment 3 Stefan Richter 2007-07-13 11:54:02 UTC
I'm going to send the patch to Linus soon, then to -stable.

----

I've got a 1.66 GHz Core Duo Mini too, but from after the product revision when they discontinued the Core Solo type and added the 1.83 GHz type.  From lspci:

00:00.0 Host bridge: Intel Corporation Mobile 945GM/PM/GMS, 943/940GML and 945GT Express Memory Controller Hub (rev 03)
...
03:03.0 FireWire (IEEE 1394): Agere Systems FW323 (rev 61)

I guess it looks the same on your Mini.  But I replaced the Core Duo by a Core 2 Duo and am running an x86-64 build on it.  It's not my primary machine, hence not extensively tested.

----

About the "status write for unknown orb" ...pause... "sbp2_scsi_abort":  This is known to happen sometimes but the reason is unclear.  Perhaps there was a bus reset, which may happen sporadically if the bus is electrically unstable.  There might be hidden bugs in bus reset handling.  Was there a line like "phy config: card 2, new root=ffc1, gap_count=5" near the "...unknown orb" message?  This message is only logged for some bus resets though, not for all.

Alas I don't get those "...unkown orb" messages with my own setups, or at least didn't notice it so far.  If I get similar errors at all, then early during device discovery and usually ending in total inability to start using the disk.

The "sbp2_scsi_abort" later is caused by the Linux SCSI stack calling into sbp2 to abort the command after timeout.  It's because sbp2 was previously unable to finish the command properly because it didn't recognize the status write from the device for what it was.

The fact that your system froze is because you have the root filesystem on that disk, and there is no IO going on until the timeout and sbp2_scsi_abort.

This may take a while to debug, especially as I'm currently busy with projects entirely unrelated to Linux.  I don't know how if Kristian Høgsberg is available at the moment.

Please open a new bug for this and point to your comment #2 here as description. But also make a note what brand/model of SBP-2 enclosure this is, and if possible which SBP-2 bridge chip is in there.
Comment 4 Stefan Richter 2007-07-19 00:41:05 UTC
Fix was committed to Linus' tree and sent to the -stable team.

On "status write for unknown orb":  I saw this once during heavy traffic on an otherwise perfectly working setup, but without subsequent I/O error.  I.e. I'm able to reproduce it at least partially on rare occasions.