Bug 16263
Summary: | ath5k - failed to initialize device | ||
---|---|---|---|
Product: | Drivers | Reporter: | NiTr0 (nitr0) |
Component: | PCI | Assignee: | Bjorn Helgaas (bjorn) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | akpm, bjorn.helgaas, bjorn, jbarnes, linville, me, rui.zhang |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.34 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 56331 | ||
Attachments: |
dmesg output of 2.6.34 kernel
acpi dump + dmesg output Everest report under WinPE dmesg log test patch to use BIOS assignment as a last resort dmesg output of patched kernel updated patch to fall back to BIOS assignments new dmesg from patched kernel /proc/iomem |
That is a big gap, and not a lot of information about what failed. Are you in a position to do some builds driven by git bisect? Yes, I can build & run anything that you need for obtaining information about trouble. OK, please do a bisection of the linux-2.6 tree between 2.6.32 and 2.6.34: git bisect start -- drivers/net/wireless/ath git bisect bad v2.6.34 git bisect good v2.6.32 And, of course, marking each build as good or bad after rebooting and testing. You don't have to limit the path, but with such a large gap that might be helpful in narrowing down the error more quickly. What does that list as the first bad commit? Hm... At second bisect step of linux/kernel/git/torvalds/linux-2.6.git (ath5k: fix setup for CAB queue) I've got kernel panic after running hostapd - but device looks to be initialized and started in ad-hoc successfully. I mark it as good (because searched bug isn't present in this revision) and continue to building git bisects, right? Sorry for that lame questions - I'm novice in large projects debugging/development. I would say mark anything good as long as modprobe doesn't result in the aforementioned PCI iomap error. The particular start up sequence hasn't changed since 2007 so my first suspicion would lie in the platform. Another thing that might be interesting is contents of /proc/iomem in working and non-working kernels. Just to clarify above -- if you have a build error, don't mark it good - you can use 'git bisect skip' in that case. But if you have a building kernel that doesn't have the iomap error, then you can mark it good regardless of other things. Thanks for info. I'm continue trying to find broken commit. Here is 'bad' /proc/iomem (2.6.34-stable): 00000000-0000ffff : reserved 00010000-0009f3ff : System RAM 0009f400-0009ffff : reserved 000f0000-000fffff : reserved 00100000-3bedffff : System RAM 01000000-013ed83e : Kernel code 013ed83f-018b98bf : Kernel data 019a0000-01a9d863 : Kernel bss 3bee0000-3bee2fff : ACPI Non-volatile Storage 3bee3000-3beeffff : ACPI Tables 3bef0000-3befffff : reserved 3bf00000-3bffffff : RAM buffer 3c000000-3fffffff : reserved 3c000000-3fffffff : pnp 00:0e f0000000-f3ffffff : PCI MMCONFIG 0000 [bus 00-3f] f0000000-f3ffffff : reserved f0000000-f3ffffff : pnp 00:0d fec00000-ffffffff : reserved fec00000-fec003ff : IOAPIC 0 fec80000-fecbffff : PCI Bus 0000:00 fec80000-fec9ffff : 0000:00:0d.0 feca0000-feca3fff : 0000:00:05.0 feca4000-feca4fff : 0000:00:02.0 feca4000-feca4fff : ohci_hcd feca5000-feca5fff : 0000:00:07.0 feca5000-feca5fff : forcedeth feca6000-feca6fff : 0000:00:08.0 feca6000-feca6fff : sata_nv feca7000-feca7fff : 0000:00:08.1 feca7000-feca7fff : sata_nv feca8000-feca80ff : 0000:00:02.1 feca8000-feca80ff : ehci_hcd feca9000-feca9fff : 0000:01:06.0 feca9000-feca9fff : e100 fecb0000-fecbffff : 0000:01:06.0 fee00000-feefffff : pnp 00:0e fee00000-fee00fff : Local APIC feff0000-feff03ff : HPET 0 feff0000-feff00ff : pnp 00:0e fefff000-feffffff : pnp 00:0e fff80000-fff80fff : pnp 00:0e fff90000-fffbffff : pnp 00:0e fffed000-fffeffff : pnp 00:0e ffff0000-ffffffff : pnp 00:0e 'Good' /proc/iomem (2.6.33-rc5-00982-gf6f2230): 00000000-0000ffff : reserved 00010000-0009f3ff : System RAM 0009f400-0009ffff : reserved 000f0000-000fffff : reserved 00100000-3bedffff : System RAM 01000000-0142fdba : Kernel code 0142fdbb-01aba7ff : Kernel data 01ba9000-01cae273 : Kernel bss 3bee0000-3bee2fff : ACPI Non-volatile Storage 3bee3000-3beeffff : ACPI Tables 3bef0000-3befffff : reserved 3bf00000-3bffffff : RAM buffer 3c000000-3fffffff : reserved 3c000000-3fffffff : pnp 00:0e 40000000-4001ffff : 0000:00:0d.0 e0000000-efffffff : 0000:00:0d.0 e0000000-efffffff : nvidiafb f0000000-f3ffffff : PCI MMCONFIG 0000 [bus 00-3f] f0000000-f3ffffff : reserved f0000000-f3ffffff : pnp 00:0d fb000000-fbffffff : 0000:00:0d.0 fb000000-fbffffff : nvidiafb fc000000-fcffffff : 0000:00:0d.0 fc000000-fcffffff : nvidiafb fd800000-fd8fffff : PCI Bus 0000:01 fd8a0000-fd8bffff : 0000:01:06.0 fd8a0000-fd8bffff : e100 fd8e0000-fd8effff : 0000:01:08.0 fd8e0000-fd8effff : ath5k fd8ff000-fd8fffff : 0000:01:06.0 fd8ff000-fd8fffff : e100 fd900000-fd9fffff : PCI Bus 0000:04 fda00000-fdafffff : PCI Bus 0000:04 fdb00000-fdbfffff : PCI Bus 0000:03 fdc00000-fdcfffff : PCI Bus 0000:03 fdd00000-fddfffff : PCI Bus 0000:02 fde00000-fdefffff : PCI Bus 0000:02 fdf00000-fdffffff : PCI Bus 0000:01 fdf00000-fdf0ffff : 0000:01:06.0 fe024000-fe027fff : 0000:00:05.0 fe02b000-fe02bfff : 0000:00:08.1 fe02b000-fe02bfff : sata_nv fe02c000-fe02cfff : 0000:00:08.0 fe02c000-fe02cfff : sata_nv fe02d000-fe02dfff : 0000:00:07.0 fe02d000-fe02dfff : forcedeth fe02e000-fe02e0ff : 0000:00:02.1 fe02e000-fe02e0ff : ehci_hcd fe02f000-fe02ffff : 0000:00:02.0 fe02f000-fe02ffff : ohci_hcd fec00000-ffffffff : reserved fec00000-fec003ff : IOAPIC 0 fee00000-feefffff : pnp 00:0e fee00000-fee00fff : Local APIC feff0000-feff03ff : HPET 0 feff0000-feff00ff : pnp 00:0e fefff000-feffffff : pnp 00:0e fff80000-fff80fff : pnp 00:0e fff90000-fffbffff : pnp 00:0e fffed000-fffeffff : pnp 00:0e ffff0000-ffffffff : pnp 00:0e I've finished bisecting kernel - it looks that trouble is really somewhere in PCI/PNP code, and ath5k trouble with memory resources was involved by it. Here is log: git bisect start '--' 'drivers/net/wireless/ath' # bad: [e40152ee1e1c7a63f4777791863215e3faa37a86] Linus 2.6.34 git bisect bad e40152ee1e1c7a63f4777791863215e3faa37a86 # good: [22763c5cf3690a681551162c15d34d935308c8d7] Linux 2.6.32 git bisect good 22763c5cf3690a681551162c15d34d935308c8d7 # good: [c37919bfe0a5c1bee9a31701a31e05a2f8840936] ath9k_hw: Fix AR_GPIO_INPUT_EN_VAL_BT_PRIORITY_BB and its shift value in 0x4054 git bisect good c37919bfe0a5c1bee9a31701a31e05a2f8840936 # good: [a951ae2176b982574ffa197455db6c89359fd5eb] ath5k: fix setup for CAB queue git bisect good a951ae2176b982574ffa197455db6c89359fd5eb # good: [f6f223039c0d0683bdea1eabd35b309e10311a60] Merge branch 'master' of ssh://master.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6 git bisect good f6f223039c0d0683bdea1eabd35b309e10311a60 # bad: [a3b980fd1391e75068ae25f3817728b27bfdb04c] ath5k: fix TSF reset git bisect bad a3b980fd1391e75068ae25f3817728b27bfdb04c # good: [19bc291c99f018bd4f2c38bbf69144086dca903f] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6 git bisect good 19bc291c99f018bd4f2c38bbf69144086dca903f # bad: [a9f042cbe5284f34ccff15f3084477e11b39b17b] ath9k: fix lockdep warning when unloading module git bisect bad a9f042cbe5284f34ccff15f3084477e11b39b17b # good: [7bfbae10dc10a5c94a780d117a57e875d77e8e5a] ath9k: disable RIFS search for AR91xx based chips git bisect good 7bfbae10dc10a5c94a780d117a57e875d77e8e5a How can I find what commit in kernel tree breaks ath5k on my system? I must do 'git bisect reset', then 'git bisect start' (without path), and mark bad & good commits - 'git bisect bad a9f042cbe5284f34ccff15f3084477e11b39b17b', 'git bisect good 7bfbae10dc10a5c94a780d117a57e875d77e8e5a' ? (In reply to comment #8) > > How can I find what commit in kernel tree breaks ath5k on my system? I must > do > 'git bisect reset', then 'git bisect start' (without path), and mark bad & > good > commits - 'git bisect bad a9f042cbe5284f34ccff15f3084477e11b39b17b', 'git > bisect good 7bfbae10dc10a5c94a780d117a57e875d77e8e5a' ? I think that's right. But it looks like that bisection simply hasn't finished yet - keep going? What makes you believe that PnP/PCI changes caused this? > I think that's right. But it looks like that bisection simply hasn't > finished > yet - keep going? Hm, yes, you're right, I ewas confusing by string '0 bisects remaining' after marking last commit. But I'm really missed last step. I'll run build now. > What makes you believe that PnP/PCI changes caused this? Different /proc/iomem; also in dmesg nvidiafb.ko also prints warnings about memory resources. And in that ath5k commits range it looks like there is no commits that can break something. OK, thanks. I cc'ed Jesse and Bjorn - maybe it's something they already knew about. Wow. This looks like a spectacular train wreck from my change that automatically turns on "pci=use_crs" for recent boxes (7bc5e3f2be32a). I reassigned it to myself, but I don't see how to change the category. If you boot with "pci=nocrs", it should work around it for now, while we figure out the real fix. The problem is that we found this: ACPI: PCI Root Bridge [PCI0] (0000:00) pci_root PNP0A08:00: host bridge window [mem 0xfec80000-0xfecbffff] pci 0000:00:04.0: PCI bridge to [bus 01-01] (subtractive decode) pci 0000:00:04.0: bridge window [mem 0xfd800000-0xfd8fffff] pci 0000:01:08.0: reg 10: [mem 0xfd8e0000-0xfd8effff] The 01:08.0 BAR is inside the 00:04.0 bridge window, so that's fine. However, the bridge window itself is outside the host bridge window, so we tried to move things around and of course failed, because we only know about the one tiny window through the host bridge. The next question is how Windows works. Is there any chance you have Windows on this box as well and can collect the corresponding bridge window and BAR information from the Device Manager, System Information, the "SIV" tool or similar? Failing that, please give as many details about your box as possible, and maybe we can find somebody else who does have Windows on a similar box. The patch here: https://bugzilla.kernel.org/show_bug.cgi?id=15533#c5 will dump more details about the _CRS information we get from the BIOS. It's possible there's something there that Linux just doesn't know how to handle yet. Would you mind trying that patch (with the parameters mentioned in that comment) and attaching the dmesg log? It's OK to boot with "pci=nocrs" so your system works.) Finally, please collect the DSDT (directions here: http://kernel.org/pub/linux/kernel/people/helgaas/debug) and attach it here. Maybe the BIOS marks that host bridge as subtractive decode. I haven't seen that yet, but I think it might be possible, and I don't think Linux would deal well with it. I tried to run Windows PE on box - it looks like it can't run PCI bridges because it can't see resources. Log of Everest report is attached (SIV doesn't run - one dll was missed). Also ACPI dump and dmesg of patched kernel is attached too. With "pci=nocrs" system runs OK. Created attachment 26918 [details]
acpi dump + dmesg output
Created attachment 26919 [details]
Everest report under WinPE
Thanks, this is great info, and I"ll look in more detail next week. But I don't see the dmesg log in comment #14; did that get missed? Created attachment 26920 [details]
dmesg log
Oops, I missed to add it into archive.
Now I attached it.
re-assign to Bjorn Here's what I see from the dmesg and the Everest report. The BAR values printed by Linux are before any modification, so WinPE should start with the same assignments. The "WinPE:" comments at the right are from the Everest report. It appears that WinPE moves devices into the PNP0A08 window until that window is full. The remaining devices are left at the addresses set by BIOS. Linux does the same thing, EXCEPT that if there are devices that don't fit in the PNP0A08 window, we disable them instead of trying the addresses from BIOS. ACPI: PCI Root Bridge [PCI0] (0000:00) pci_root PNP0A08:00: host bridge window [mem 0xfec80000-0xfecbffff] pci 00:02.0: reg 10: [mem 0xfe02f000-0xfe02ffff] WinPE: 0xfecbf000-0xfecbffff pci 00:02.1: reg 10: [mem 0xfe02e000-0xfe02e0ff] WinPE: 0xfecbef00-0xfecbefff pci 00:05.0: reg 10: [mem 0xfe024000-0xfe027fff] pci 00:07.0: reg 10: [mem 0xfe02d000-0xfe02dfff] WinPE: 0xfecbb000-0xfecbbfff pci 00:08.0: reg 24: [mem 0xfe02c000-0xfe02cfff] WinPE: 0xfecbd000-0xfecbdfff pci 00:08.1: reg 24: [mem 0xfe02b000-0xfe02bfff] WinPE: 0xfecbc000-0xfecbcfff pci 00:0d.0: reg 10: [mem 0xfb000000-0xfbffffff] pci 00:0d.0: reg 14: [mem 0xe0000000-0xefffffff] pci 00:0d.0: reg 1c: [mem 0xfc000000-0xfcffffff] pci 00:0d.0: reg 30: [mem 0x00000000-0x0001ffff] pci 01:06.0: reg 10: [mem 0xfd8ff000-0xfd8fffff] WinPE: 0xfecba000-0xfecbafff pci 01:06.0: reg 18: [mem 0xfd8a0000-0xfd8bffff] WinPE: 0xfec80000-0xfec9ffff pci 01:06.0: reg 30: [mem 0x00000000-0x0000ffff] pci 01:08.0: reg 10: [mem 0xfd8e0000-0xfd8effff] ** Linux disables, WinPE leaves alone pci 00:04.0: PCI bridge to [bus 01-01] (subtractive decode) pci 00:04.0: bridge window [mem 0xfd800000-0xfd8fffff] WinPE: closed? pci 00:04.0: bridge window [mem 0xfdf00000-0xfdffffff] WinPE: closed? pci 00:09.0: PCI bridge to [bus 02-02] pci 00:09.0: bridge window [mem 0xfde00000-0xfdefffff] pci 00:09.0: bridge window [mem 0xfdd00000-0xfddfffff] pci 00:0b.0: PCI bridge to [bus 03-03] pci 00:0b.0: bridge window [mem 0xfdc00000-0xfdcfffff] pci 00:0b.0: bridge window [mem 0xfdb00000-0xfdbfffff] pci 00:0c.0: PCI bridge to [bus 04-04] pci 00:0c.0: bridge window [mem 0xfda00000-0xfdafffff] pci 00:0c.0: bridge window [mem 0xfd900000-0xfd9fffff] I'll work up a patch to make Linux use the existing BAR values instead of disabling the device, but it's going to take me a while because I'm catching up from being on vacation. Created attachment 27093 [details]
test patch to use BIOS assignment as a last resort
Here's a test patch to make Linux behave more like WinPE in this situation. If you could try it and attach the dmesg log, that would be great! If you make the attachment "text/plain", it's easier to read it in a browser.
Created attachment 27101 [details]
dmesg output of patched kernel
Here is dmesg output. There are error with nvidiafb, but WiFi card seems to be working - at least it's initialized as device.
Here's the call trace outline showing that we discard the BIOS BAR addresses long before we try to reassign unassigned resources: pci_subsys_init (subsys_initcall) pcibios_init pcibios_resource_survey pcibios_allocate_resources pci_claim_resource "no compatible bridge window ..." <** here's where we throw away the BIOS address **> <reserve motherboard resources> pcibios_assign_resources (fs_initcall, after motherboard resources) pci_assign_unassigned_resources pci_bus_assign_resources __pci_bus_assign_resources pbus_assign_resources_sorted __assign_resources_sorted pci_assign_resource "can't assign ..." <** here's where we want the BIOS address again **> The separation between the pci_claim_resource() failure and pci_assign_resource() is the reason we have to store the BIOS BAR address somewhere. Created attachment 27105 [details]
updated patch to fall back to BIOS assignments
Updated patch. I made an off-by-one error that caused the nvidia issues. Sorry, I should have caught this in my testing. Can you try this one?
Created attachment 27107 [details]
new dmesg from patched kernel
Now all looks OK.
Created attachment 27108 [details]
/proc/iomem
Should be resolved by this upstream commit, which should appear in 2.6.35-rc6: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=58c84eda07560a6b75b03e8d3b26d6eddfc14011 This commit appeared in 2.6.35, so this should be resolved. |
Created attachment 26876 [details] dmesg output of 2.6.34 kernel After upgrade from 2.6.32.7 to 2.6.34 (configured by 'make oldconfig' with 2.6.32.7 .config as base - so almost all settings are same) I have broken wifi in my Linux box. Here is part of dmesg: ACPI: PCI Interrupt Link [APC1] enabled at IRQ 16 alloc irq_desc for 16 on node 0 alloc kstat_irqs on node 0 ath5k 0000:01:08.0: PCI INT A -> Link[APC1] -> GSI 16 (level, low) -> IRQ 16 ath5k 0000:01:08.0: cannot remap PCI memory region ath5k 0000:01:08.0: PCI INT A disabled ath5k: probe of 0000:01:08.0 failed with error -5 Wireless card - D-Link DWL-520, system - Athlon II @ MCP61 board; output of lspci for wireless card: 01:08.0 Ethernet controller: Atheros Communications Inc. Atheros AR5001X+ Wireless Network Adapter (rev 01) Subsystem: D-Link System Inc Device 3a73 Flags: medium devsel, IRQ 16 Memory at <ignored> (32-bit, non-prefetchable) Capabilities: [44] Power Management version 2 Kernel modules: ath5k dmesg output is attached. P.S. If this is caused by broken PCI management driver - move this bugreport to proper category.