Bug 9219 - b43 crash at PCI initialization
Summary: b43 crash at PCI initialization
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Michael Buesch
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-10-24 12:41 UTC by Christian Casteyde
Modified: 2007-10-27 08:19 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.24-rc1
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
Screenshot of the crash at boot. (259.78 KB, image/jpeg)
2007-10-24 12:48 UTC, Christian Casteyde
Details

Description Christian Casteyde 2007-10-24 12:41:48 UTC
Most recent kernel where this bug did not occur: 2.6.24-rc1
This is not a regression, since it's a new driver (b43) crash.
I'm trying to boot with the old driver (bcm43xx) to check.
Distribution: Bluewhite64
Hardware Environment: AMD Athlon 64, Broadcom 4306rev2 Wifi chipset
Software Environment: Linux
Problem Description:
The laptop crashes at boot with b43 driver (more precisely, its dependance on ssb) as soon as the hardware is seen, apparently in PCI initialization.

Steps to reproduce: Compile 2.6.24+b43/b43legacy, boot on specific hardware.

crash call stack appended as a screen photo (I'm too lazy to copy it).
Comment 1 Christian Casteyde 2007-10-24 12:48:12 UTC
Created attachment 13264 [details]
Screenshot of the crash at boot.
Comment 2 Christian Casteyde 2007-10-24 13:06:47 UTC
Looking at the call stack, this seems to be a "Sonics Silicon Backplane" problem (CONFIG_SSB), which is needed by both b43 drivers.
Old bcm43xx driver works as in previous kernels (with locking problems from softmac, but usable).
Comment 3 Christian Casteyde 2007-10-24 13:23:52 UTC
Oops, sorry for the first line:
Most recent kernel where this bug did not occur: N/A
Comment 4 Michael Buesch 2007-10-24 13:49:14 UTC
I don't see why this crashes.
I'm sorry. Please try to gather more information.
Comment 5 Christian Casteyde 2007-10-24 15:08:47 UTC
Maybe x86/x86_64 merge pb in pci code?
Seeing the code, dev must be NULL to crash.
But this cannot set RIP to NULL, so there must be a f_ops called with a NULL pointer somewhere in an inlined function called from pcibios_enable_device.

I can add printks (I would see it just before the log) to check some things. Apart from dev NULLiness, do you see anything that could be interesting to trace?
Comment 6 Christian Casteyde 2007-10-24 15:11:10 UTC
I'll also try to load it as module tomorrow. If it succeeds, it maybe that the device initialization come too early in the PCI init code. Otherwize, I may be in better conditions to look inside the system.
Comment 7 Michael Buesch 2007-10-24 15:15:38 UTC
Yes, please try as module.
I never saw this problem, but I'm always compiling it as module.
Comment 8 Christian Casteyde 2007-10-25 11:35:12 UTC
The results:
As module, the driver loads. So this is a static only crash, maybe some module initialization problem therefore.
If you can't reproduce it with the module built-in (which is possible now it has been merged into mainline), I can do more tests for you if you want. I'm waiting for you answer to try to adds some printks to all the driver functions.

However, I've got another crash when removing the b43 module (ssb was still there). Say 10-20s after, a kernel thread crashed with the following:


Oct 25 20:16:47 athor kernel: b43-phy0 ERROR: Firmware file "b43/ucode5.fw" not found or load failed.
Oct 25 20:16:47 athor kernel: b43-phy0 ERROR: You must go to http://linuxwireless.org/en/users/Drivers
/bcm43xx#devicefirmware and download the correct firmware (version 4).
<!-- OK, I unload the module, to reload it after moving some firmware file -->
<!-- then : -->
Oct 25 20:17:11 athor kernel: Unable to handle kernel paging request at ffffffff880243df RIP:
Oct 25 20:17:11 athor kernel:  [<ffffffff8036da39>] strcmp+0x9/0x20
Oct 25 20:17:11 athor kernel: PGD 203067 PUD 207063 PMD 56d9067 PTE 0
Oct 25 20:17:11 athor kernel: Oops: 0000 [1] PREEMPT
Oct 25 20:17:11 athor kernel: CPU 0
Oct 25 20:17:11 athor kernel: Modules linked in:
Oct 25 20:17:11 athor kernel: Pid: 5, comm: events/0 Not tainted 2.6.24-rc1 #6
Oct 25 20:17:11 athor kernel: RIP: 0010:[<ffffffff8036da39>]  [<ffffffff8036da39>] strcmp+0x9/0x20
Oct 25 20:17:11 athor kernel: RSP: 0018:ffff810002877d70  EFLAGS: 00010082
Oct 25 20:17:11 athor kernel: RAX: ffffffff807875e0 RBX: ffffffff808a6640 RCX: 7800000000000000
Oct 25 20:17:11 athor kernel: RDX: 0000000000000000 RSI: ffffffff806fab3a RDI: ffffffff880243df
Oct 25 20:17:11 athor kernel: RBP: ffff810002877d70 R08: ffffffff807875e0 R09: 0000000000000000
Oct 25 20:17:11 athor kernel: R10: ffffffff80246723 R11: 0000000000000001 R12: ffffffff808a78a0
Oct 25 20:17:11 athor kernel: R13: ffffffff808a67a0 R14: 0000000000000000 R15: ffffffff806fab3a
Oct 25 20:17:11 athor kernel: FS:  00002adfd2135d30(0000) GS:ffffffff80790000(0000) knlGS:000000000000
0000
Oct 25 20:17:11 athor kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Oct 25 20:17:11 athor kernel: CR2: ffffffff880243df CR3: 000000000539b000 CR4: 00000000000006e0
Oct 25 20:17:11 athor kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 25 20:17:11 athor kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct 25 20:17:11 athor kernel: Process events/0 (pid: 5, threadinfo ffff810002876000, task ffff81000287
4000)
Oct 25 20:17:11 athor kernel: Stack:  ffff810002877db0 ffffffff802551f9 0000000000000000 0000000000029
ac0
Oct 25 20:17:11 athor kernel:  ffffffff808a78a0 ffff810002877e60 ffffffff80926020 ffff810002874000
Oct 25 20:17:11 athor kernel:  ffff810002877e20 ffffffff80258226 0000000200000000 0000000000000000
Oct 25 20:17:11 athor kernel: Call Trace:
Oct 25 20:17:11 athor kernel:  [<ffffffff802551f9>] count_matching_names+0x59/0xc0
Oct 25 20:17:11 athor kernel:  [<ffffffff80258226>] __lock_acquire+0x5b6/0x1080
Oct 25 20:17:11 athor kernel:  [<ffffffff805bb5cb>] _spin_unlock_irq+0x2b/0x60
Oct 25 20:17:11 athor kernel:  [<ffffffff80527c80>] rt_check_expire+0x0/0x160
Oct 25 20:17:11 athor kernel:  [<ffffffff80258d47>] lock_acquire+0x57/0x80
Oct 25 20:17:11 athor kernel:  [<ffffffff80246723>] run_workqueue+0x103/0x230
Oct 25 20:17:11 athor kernel:  [<ffffffff80246767>] run_workqueue+0x147/0x230
Oct 25 20:17:11 athor kernel:  [<ffffffff8024733a>] worker_thread+0xca/0x130
Oct 25 20:17:11 athor kernel:  [<ffffffff8024b240>] autoremove_wake_function+0x0/0x40
Oct 25 20:17:11 athor kernel:  [<ffffffff80247270>] worker_thread+0x0/0x130
Oct 25 20:17:11 athor kernel:  [<ffffffff8024ae7d>] kthread+0x4d/0x80
Oct 25 20:17:11 athor kernel:  [<ffffffff8020c608>] child_rip+0xa/0x12
Oct 25 20:17:11 athor kernel:  [<ffffffff8020c1c3>] restore_args+0x0/0x30
Oct 25 20:17:11 athor kernel:  [<ffffffff8024af82>] kthreadd+0xd2/0x150
Oct 25 20:17:11 athor kernel:  [<ffffffff8024ae30>] kthread+0x0/0x80
Oct 25 20:17:11 athor kernel:  [<ffffffff8020c5fe>] child_rip+0x0/0x12
Oct 25 20:17:11 athor kernel:
Oct 25 20:17:11 athor kernel:
Oct 25 20:17:11 athor kernel: Code: 0f b6 17 89 d0 2a 06 48 ff c6 84 c0 75 04 84 d2 75 eb c9 0f
Oct 25 20:17:11 athor kernel: RIP  [<ffffffff8036da39>] strcmp+0x9/0x20
Oct 25 20:17:11 athor kernel:  RSP <ffff810002877d70>
Oct 25 20:17:11 athor kernel: CR2: ffffffff880243df
<!--here I rebooted-->

Sorry, the call stack is nearly useless. However, it is in lockdep code, where a string is not right, so this is clearly for me a spinlock / lock / anything not well initialized by the driver, of not freed at module unload. Seems another init/term problem, that may be correlated and valuable to inspect I think.
Comment 9 Michael Buesch 2007-10-25 13:45:14 UTC
Crap initcalls dependency. I have no idea how to properly fix that.
Initcalls are crap.
Comment 10 Michael Buesch 2007-10-25 14:10:21 UTC
I'll try to explain the problem with the initcall a bit:

ssb init must be subsys_initcall, as it has to init before the drivers (b43, b44...) are initialized. But when ssb is on a PCIbus host device, it also needs to init _after_ PCI. So the problem is that some PCI code also seems to use subsys_initcall and so it depends on link order if pci or ssb is initialized first. If ssb is initialized before pci it goes boom.

So which initcall should we select for ssb?

119 #define core_initcall(fn)               __define_initcall("1",fn,1)
120 #define core_initcall_sync(fn)          __define_initcall("1s",fn,1s)
121 #define postcore_initcall(fn)           __define_initcall("2",fn,2)
122 #define postcore_initcall_sync(fn)      __define_initcall("2s",fn,2s)
123 #define arch_initcall(fn)               __define_initcall("3",fn,3)
124 #define arch_initcall_sync(fn)          __define_initcall("3s",fn,3s)
125 #define subsys_initcall(fn)             __define_initcall("4",fn,4)
126 #define subsys_initcall_sync(fn)        __define_initcall("4s",fn,4s)
127 #define fs_initcall(fn)                 __define_initcall("5",fn,5)
128 #define fs_initcall_sync(fn)            __define_initcall("5s",fn,5s)
129 #define rootfs_initcall(fn)             __define_initcall("rootfs",fn,rootfs)
130 #define device_initcall(fn)             __define_initcall("6",fn,6)
131 #define device_initcall_sync(fn)        __define_initcall("6s",fn,6s)
132 #define late_initcall(fn)               __define_initcall("7",fn,7)
133 #define late_initcall_sync(fn)          __define_initcall("7s",fn,7s)

the driver is device_initcall, of course.
Well, we could probably workaround that by selecting either fs_initcall or rootfs_initcall, but that seems to be a bad hack to me.

Anyway, could you check if the crash on modinit goes away, if you use rootfs_initcall in drivers/ssb/main.c?

The other crash on rmmod seems to be some leaked resource (some workqueue that has not been properly terminated) on this firmware error condition. Some buggy error path.
Comment 11 Christian Casteyde 2007-10-26 10:59:31 UTC
It does not crash with rootfs_initcall attribute in ssb/main.c.
The diagnostic is right.
Comment 12 Michael Buesch 2007-10-26 13:36:06 UTC
Ok, well. Thanks for testing. So I think I will submit a patch and add a comment why we use rootfs_initcall here. It's a bad solution, but I don't see a better one.

I'll look into the rmmod crash later.
Comment 13 Christian Casteyde 2007-10-27 08:11:49 UTC
For the record, I got another crash for the workqueue just after issuing a :
ifconfig wlan0 down
...
iwconfig ...
(the crash was in between the two commands, and after that I panic'ed when I insisted).
Maybe the same as modprobe -r, I guess my old test with a loop of ifconfig up/down could trigger the crash.
Must I open another bug report for this second error? I suppose you are aware of it anyway...
Oh, btw, I was trying to connect to a Zydas USB key in ad-hoc mode, and I didn't managed. Is ad-hoc supported by the b43 driver (I never managed to connect either with bcm43xx, but I could do it with a Ralink, so this is b* dependent).
Comment 14 Michael Buesch 2007-10-27 08:19:53 UTC
Please open an entry for each bug. This is getting confusing.

Note You need to log in before you can comment on or make changes to this bug.