Bug 16263 - ath5k - failed to initialize device
Summary: ath5k - failed to initialize device
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Bjorn Helgaas
URL:
Keywords:
Depends on:
Blocks: 56331
  Show dependency tree
 
Reported: 2010-06-21 09:51 UTC by NiTr0
Modified: 2013-04-09 06:23 UTC (History)
7 users (show)

See Also:
Kernel Version: 2.6.34
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg output of 2.6.34 kernel (11.92 KB, application/octet-stream)
2010-06-21 09:51 UTC, NiTr0
Details
acpi dump + dmesg output (47.25 KB, application/x-compressed)
2010-06-23 21:14 UTC, NiTr0
Details
Everest report under WinPE (175.88 KB, text/plain)
2010-06-23 21:15 UTC, NiTr0
Details
dmesg log (38.98 KB, application/octet-stream)
2010-06-23 21:40 UTC, NiTr0
Details
test patch to use BIOS assignment as a last resort (4.15 KB, patch)
2010-07-13 23:07 UTC, Bjorn Helgaas
Details | Diff
dmesg output of patched kernel (47.82 KB, text/plain)
2010-07-14 13:03 UTC, NiTr0
Details
updated patch to fall back to BIOS assignments (3.55 KB, patch)
2010-07-14 20:20 UTC, Bjorn Helgaas
Details | Diff
new dmesg from patched kernel (46.78 KB, text/plain)
2010-07-14 22:07 UTC, NiTr0
Details
/proc/iomem (1.76 KB, text/plain)
2010-07-14 22:08 UTC, NiTr0
Details

Description NiTr0 2010-06-21 09:51:59 UTC
Created attachment 26876 [details]
dmesg output of 2.6.34 kernel

After upgrade from 2.6.32.7 to 2.6.34 (configured by 'make oldconfig' with 2.6.32.7 .config as base - so almost all settings are same) I have broken wifi in my Linux box. Here is part of dmesg:

ACPI: PCI Interrupt Link [APC1] enabled at IRQ 16
  alloc irq_desc for 16 on node 0
  alloc kstat_irqs on node 0
ath5k 0000:01:08.0: PCI INT A -> Link[APC1] -> GSI 16 (level, low) -> IRQ 16
ath5k 0000:01:08.0: cannot remap PCI memory region
ath5k 0000:01:08.0: PCI INT A disabled
ath5k: probe of 0000:01:08.0 failed with error -5

Wireless card - D-Link DWL-520, system - Athlon II @ MCP61 board; output of lspci for wireless card:

01:08.0 Ethernet controller: Atheros Communications Inc. Atheros AR5001X+ Wireless Network Adapter (rev 01)
        Subsystem: D-Link System Inc Device 3a73
        Flags: medium devsel, IRQ 16
        Memory at <ignored> (32-bit, non-prefetchable)
        Capabilities: [44] Power Management version 2
        Kernel modules: ath5k

dmesg output is attached.

P.S. If this is caused by broken PCI management driver - move this bugreport to proper category.
Comment 1 John W. Linville 2010-06-21 13:16:34 UTC
That is a big gap, and not a lot of information about what failed.  Are you in a position to do some builds driven by git bisect?
Comment 2 NiTr0 2010-06-21 13:21:03 UTC
Yes, I can build & run anything that you need for obtaining information about trouble.
Comment 3 John W. Linville 2010-06-21 13:50:53 UTC
OK, please do a bisection of the linux-2.6 tree between 2.6.32 and 2.6.34:

git bisect start -- drivers/net/wireless/ath
git bisect bad v2.6.34
git bisect good v2.6.32

And, of course, marking each build as good or bad after rebooting and testing.  You don't have to limit the path, but with such a large gap that might be helpful in narrowing down the error more quickly.  What does that list as the first bad commit?
Comment 4 NiTr0 2010-06-21 20:33:30 UTC
Hm... At second bisect step of linux/kernel/git/torvalds/linux-2.6.git (ath5k: fix setup for CAB queue) I've got kernel panic after running hostapd - but device looks to be initialized and started in ad-hoc successfully. I mark it as good (because searched bug isn't present in this revision) and continue to building git bisects, right?
Sorry for that lame questions - I'm novice in large projects debugging/development.
Comment 5 Bob Copeland 2010-06-21 20:50:58 UTC
I would say mark anything good as long as modprobe doesn't result in the aforementioned PCI iomap error.

The particular start up sequence hasn't changed since 2007 so my first suspicion would lie in the platform.

Another thing that might be interesting is contents of /proc/iomem in working and non-working kernels.
Comment 6 Bob Copeland 2010-06-21 20:53:42 UTC
Just to clarify above -- if you have a build error, don't mark it good - you can use 'git bisect skip' in that case.  But if you have a building kernel that doesn't have the iomap error, then you can mark it good regardless of other things.
Comment 7 NiTr0 2010-06-21 21:16:42 UTC
Thanks for info. I'm continue trying to find broken commit.

Here is 'bad' /proc/iomem (2.6.34-stable):
00000000-0000ffff : reserved
00010000-0009f3ff : System RAM
0009f400-0009ffff : reserved
000f0000-000fffff : reserved
00100000-3bedffff : System RAM
  01000000-013ed83e : Kernel code
  013ed83f-018b98bf : Kernel data
  019a0000-01a9d863 : Kernel bss
3bee0000-3bee2fff : ACPI Non-volatile Storage
3bee3000-3beeffff : ACPI Tables
3bef0000-3befffff : reserved
3bf00000-3bffffff : RAM buffer
3c000000-3fffffff : reserved
  3c000000-3fffffff : pnp 00:0e
f0000000-f3ffffff : PCI MMCONFIG 0000 [bus 00-3f]
  f0000000-f3ffffff : reserved
    f0000000-f3ffffff : pnp 00:0d
fec00000-ffffffff : reserved
  fec00000-fec003ff : IOAPIC 0
  fec80000-fecbffff : PCI Bus 0000:00
    fec80000-fec9ffff : 0000:00:0d.0
    feca0000-feca3fff : 0000:00:05.0
    feca4000-feca4fff : 0000:00:02.0
      feca4000-feca4fff : ohci_hcd
    feca5000-feca5fff : 0000:00:07.0
      feca5000-feca5fff : forcedeth
    feca6000-feca6fff : 0000:00:08.0
      feca6000-feca6fff : sata_nv
    feca7000-feca7fff : 0000:00:08.1
      feca7000-feca7fff : sata_nv
    feca8000-feca80ff : 0000:00:02.1
      feca8000-feca80ff : ehci_hcd
    feca9000-feca9fff : 0000:01:06.0
      feca9000-feca9fff : e100
    fecb0000-fecbffff : 0000:01:06.0
  fee00000-feefffff : pnp 00:0e
    fee00000-fee00fff : Local APIC
  feff0000-feff03ff : HPET 0
    feff0000-feff00ff : pnp 00:0e
  fefff000-feffffff : pnp 00:0e
  fff80000-fff80fff : pnp 00:0e
  fff90000-fffbffff : pnp 00:0e
  fffed000-fffeffff : pnp 00:0e
  ffff0000-ffffffff : pnp 00:0e

'Good' /proc/iomem (2.6.33-rc5-00982-gf6f2230):
00000000-0000ffff : reserved
00010000-0009f3ff : System RAM
0009f400-0009ffff : reserved
000f0000-000fffff : reserved
00100000-3bedffff : System RAM
  01000000-0142fdba : Kernel code
  0142fdbb-01aba7ff : Kernel data
  01ba9000-01cae273 : Kernel bss
3bee0000-3bee2fff : ACPI Non-volatile Storage
3bee3000-3beeffff : ACPI Tables
3bef0000-3befffff : reserved
3bf00000-3bffffff : RAM buffer
3c000000-3fffffff : reserved
  3c000000-3fffffff : pnp 00:0e
40000000-4001ffff : 0000:00:0d.0
e0000000-efffffff : 0000:00:0d.0
  e0000000-efffffff : nvidiafb
f0000000-f3ffffff : PCI MMCONFIG 0000 [bus 00-3f]
  f0000000-f3ffffff : reserved
    f0000000-f3ffffff : pnp 00:0d
fb000000-fbffffff : 0000:00:0d.0
  fb000000-fbffffff : nvidiafb
fc000000-fcffffff : 0000:00:0d.0
  fc000000-fcffffff : nvidiafb
fd800000-fd8fffff : PCI Bus 0000:01
  fd8a0000-fd8bffff : 0000:01:06.0
    fd8a0000-fd8bffff : e100
  fd8e0000-fd8effff : 0000:01:08.0
    fd8e0000-fd8effff : ath5k
  fd8ff000-fd8fffff : 0000:01:06.0
    fd8ff000-fd8fffff : e100
fd900000-fd9fffff : PCI Bus 0000:04
fda00000-fdafffff : PCI Bus 0000:04
fdb00000-fdbfffff : PCI Bus 0000:03
fdc00000-fdcfffff : PCI Bus 0000:03
fdd00000-fddfffff : PCI Bus 0000:02
fde00000-fdefffff : PCI Bus 0000:02
fdf00000-fdffffff : PCI Bus 0000:01
  fdf00000-fdf0ffff : 0000:01:06.0
fe024000-fe027fff : 0000:00:05.0
fe02b000-fe02bfff : 0000:00:08.1
  fe02b000-fe02bfff : sata_nv
fe02c000-fe02cfff : 0000:00:08.0
  fe02c000-fe02cfff : sata_nv
fe02d000-fe02dfff : 0000:00:07.0
  fe02d000-fe02dfff : forcedeth
fe02e000-fe02e0ff : 0000:00:02.1
  fe02e000-fe02e0ff : ehci_hcd
fe02f000-fe02ffff : 0000:00:02.0
  fe02f000-fe02ffff : ohci_hcd
fec00000-ffffffff : reserved
  fec00000-fec003ff : IOAPIC 0
  fee00000-feefffff : pnp 00:0e
    fee00000-fee00fff : Local APIC
  feff0000-feff03ff : HPET 0
    feff0000-feff00ff : pnp 00:0e
  fefff000-feffffff : pnp 00:0e
  fff80000-fff80fff : pnp 00:0e
  fff90000-fffbffff : pnp 00:0e
  fffed000-fffeffff : pnp 00:0e
  ffff0000-ffffffff : pnp 00:0e
Comment 8 NiTr0 2010-06-22 13:00:02 UTC
I've finished bisecting kernel - it looks that trouble is really somewhere in PCI/PNP code, and ath5k trouble with memory resources was involved by it.

Here is log:
git bisect start '--' 'drivers/net/wireless/ath'
# bad: [e40152ee1e1c7a63f4777791863215e3faa37a86] Linus 2.6.34
git bisect bad e40152ee1e1c7a63f4777791863215e3faa37a86
# good: [22763c5cf3690a681551162c15d34d935308c8d7] Linux 2.6.32
git bisect good 22763c5cf3690a681551162c15d34d935308c8d7
# good: [c37919bfe0a5c1bee9a31701a31e05a2f8840936] ath9k_hw: Fix AR_GPIO_INPUT_EN_VAL_BT_PRIORITY_BB and its shift value in 0x4054
git bisect good c37919bfe0a5c1bee9a31701a31e05a2f8840936
# good: [a951ae2176b982574ffa197455db6c89359fd5eb] ath5k: fix setup for CAB queue
git bisect good a951ae2176b982574ffa197455db6c89359fd5eb
# good: [f6f223039c0d0683bdea1eabd35b309e10311a60] Merge branch 'master' of ssh://master.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6
git bisect good f6f223039c0d0683bdea1eabd35b309e10311a60
# bad: [a3b980fd1391e75068ae25f3817728b27bfdb04c] ath5k: fix TSF reset
git bisect bad a3b980fd1391e75068ae25f3817728b27bfdb04c
# good: [19bc291c99f018bd4f2c38bbf69144086dca903f] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6
git bisect good 19bc291c99f018bd4f2c38bbf69144086dca903f
# bad: [a9f042cbe5284f34ccff15f3084477e11b39b17b] ath9k: fix lockdep warning when unloading module
git bisect bad a9f042cbe5284f34ccff15f3084477e11b39b17b
# good: [7bfbae10dc10a5c94a780d117a57e875d77e8e5a] ath9k: disable RIFS search for AR91xx based chips
git bisect good 7bfbae10dc10a5c94a780d117a57e875d77e8e5a

How can I find what commit in kernel tree breaks ath5k on my system? I must do 'git bisect reset', then 'git bisect start' (without path), and mark bad & good commits - 'git bisect bad a9f042cbe5284f34ccff15f3084477e11b39b17b', 'git bisect good 7bfbae10dc10a5c94a780d117a57e875d77e8e5a' ?
Comment 9 Andrew Morton 2010-06-22 20:57:52 UTC
(In reply to comment #8)
>
> How can I find what commit in kernel tree breaks ath5k on my system? I must
> do
> 'git bisect reset', then 'git bisect start' (without path), and mark bad &
> good
> commits - 'git bisect bad a9f042cbe5284f34ccff15f3084477e11b39b17b', 'git
> bisect good 7bfbae10dc10a5c94a780d117a57e875d77e8e5a' ?

I think that's right.   But it looks like that bisection simply hasn't finished yet - keep going?

What makes you believe that PnP/PCI changes caused this?
Comment 10 NiTr0 2010-06-22 21:14:07 UTC
> I think that's right.   But it looks like that bisection simply hasn't
> finished
> yet - keep going?
Hm, yes, you're right, I ewas confusing by string '0 bisects remaining' after marking last commit. But I'm really missed last step. 
I'll run build now.
 
> What makes you believe that PnP/PCI changes caused this?
Different /proc/iomem; also in dmesg nvidiafb.ko also prints warnings about memory resources. And in that ath5k commits range it looks like there is no commits that can break something.
Comment 11 Andrew Morton 2010-06-22 22:14:24 UTC
OK, thanks.  I cc'ed Jesse and Bjorn - maybe it's something they already knew about.
Comment 12 Bjorn Helgaas 2010-06-23 17:38:55 UTC
Wow.  This looks like a spectacular train wreck from my change that automatically turns on "pci=use_crs" for recent boxes (7bc5e3f2be32a).

I reassigned it to myself, but I don't see how to change the category.

If you boot with "pci=nocrs", it should work around it for now, while we figure out the real fix.

The problem is that we found this:

  ACPI: PCI Root Bridge [PCI0] (0000:00)
  pci_root PNP0A08:00: host bridge window [mem 0xfec80000-0xfecbffff]
  pci 0000:00:04.0: PCI bridge to [bus 01-01] (subtractive decode)
  pci 0000:00:04.0:   bridge window [mem 0xfd800000-0xfd8fffff]
  pci 0000:01:08.0: reg 10: [mem 0xfd8e0000-0xfd8effff]

The 01:08.0 BAR is inside the 00:04.0 bridge window, so that's fine.  However, the bridge window itself is outside the host bridge window, so we tried to move things around and of course failed, because we only know about the one tiny window through the host bridge.

The next question is how Windows works.  Is there any chance you have Windows on this box as well and can collect the corresponding bridge window and BAR information from the Device Manager, System Information, the "SIV" tool or similar?  Failing that, please give as many details about your box as possible, and maybe we can find somebody else who does have Windows on a similar box.

The patch here: https://bugzilla.kernel.org/show_bug.cgi?id=15533#c5 will dump more details about the _CRS information we get from the BIOS.  It's possible there's something there that Linux just doesn't know how to handle yet.  Would you mind trying that patch (with the parameters mentioned in that comment) and attaching the dmesg log?  It's OK to boot with "pci=nocrs" so your system works.)

Finally, please collect the DSDT (directions here: http://kernel.org/pub/linux/kernel/people/helgaas/debug) and attach it here.  Maybe the BIOS marks that host bridge as subtractive decode.  I haven't seen that yet, but I think it might be possible, and I don't think Linux would deal well with it.
Comment 13 NiTr0 2010-06-23 21:13:30 UTC
I tried to run Windows PE on box - it looks like it can't run PCI bridges because it can't see resources.
Log of Everest report is attached (SIV doesn't run - one dll was missed). Also ACPI dump and dmesg of patched kernel is attached too.
With "pci=nocrs" system runs OK.
Comment 14 NiTr0 2010-06-23 21:14:37 UTC
Created attachment 26918 [details]
acpi dump + dmesg output
Comment 15 NiTr0 2010-06-23 21:15:21 UTC
Created attachment 26919 [details]
Everest report under WinPE
Comment 16 Bjorn Helgaas 2010-06-23 21:34:11 UTC
Thanks, this is great info, and I"ll look in more detail next week.  But I don't see the dmesg log in comment #14; did that get missed?
Comment 17 NiTr0 2010-06-23 21:40:31 UTC
Created attachment 26920 [details]
dmesg log

Oops, I missed to add it into archive.
Now I attached it.
Comment 18 Zhang Rui 2010-06-29 03:07:10 UTC
re-assign to Bjorn
Comment 19 Bjorn Helgaas 2010-07-09 15:47:50 UTC
Here's what I see from the dmesg and the Everest report.  The BAR values printed by Linux are before any modification, so WinPE should start with the same assignments.  The "WinPE:" comments at the right are from the Everest report.

It appears that WinPE moves devices into the PNP0A08 window until that window is full.  The remaining devices are left at the addresses set by BIOS.

Linux does the same thing, EXCEPT that if there are devices that don't fit in the PNP0A08 window, we disable them instead of trying the addresses from BIOS.

ACPI: PCI Root Bridge [PCI0] (0000:00)
pci_root PNP0A08:00: host bridge window [mem 0xfec80000-0xfecbffff]
pci 00:02.0: reg 10: [mem 0xfe02f000-0xfe02ffff]  WinPE: 0xfecbf000-0xfecbffff
pci 00:02.1: reg 10: [mem 0xfe02e000-0xfe02e0ff]  WinPE: 0xfecbef00-0xfecbefff
pci 00:05.0: reg 10: [mem 0xfe024000-0xfe027fff]
pci 00:07.0: reg 10: [mem 0xfe02d000-0xfe02dfff]  WinPE: 0xfecbb000-0xfecbbfff
pci 00:08.0: reg 24: [mem 0xfe02c000-0xfe02cfff]  WinPE: 0xfecbd000-0xfecbdfff
pci 00:08.1: reg 24: [mem 0xfe02b000-0xfe02bfff]  WinPE: 0xfecbc000-0xfecbcfff
pci 00:0d.0: reg 10: [mem 0xfb000000-0xfbffffff]
pci 00:0d.0: reg 14: [mem 0xe0000000-0xefffffff]
pci 00:0d.0: reg 1c: [mem 0xfc000000-0xfcffffff]
pci 00:0d.0: reg 30: [mem 0x00000000-0x0001ffff]
pci 01:06.0: reg 10: [mem 0xfd8ff000-0xfd8fffff]  WinPE: 0xfecba000-0xfecbafff
pci 01:06.0: reg 18: [mem 0xfd8a0000-0xfd8bffff]  WinPE: 0xfec80000-0xfec9ffff
pci 01:06.0: reg 30: [mem 0x00000000-0x0000ffff]
pci 01:08.0: reg 10: [mem 0xfd8e0000-0xfd8effff]  ** Linux disables, WinPE leaves alone
pci 00:04.0: PCI bridge to [bus 01-01] (subtractive decode)
pci 00:04.0:   bridge window [mem 0xfd800000-0xfd8fffff]  WinPE: closed?
pci 00:04.0:   bridge window [mem 0xfdf00000-0xfdffffff]  WinPE: closed?
pci 00:09.0: PCI bridge to [bus 02-02]
pci 00:09.0:   bridge window [mem 0xfde00000-0xfdefffff]
pci 00:09.0:   bridge window [mem 0xfdd00000-0xfddfffff]
pci 00:0b.0: PCI bridge to [bus 03-03]
pci 00:0b.0:   bridge window [mem 0xfdc00000-0xfdcfffff]
pci 00:0b.0:   bridge window [mem 0xfdb00000-0xfdbfffff]
pci 00:0c.0: PCI bridge to [bus 04-04]
pci 00:0c.0:   bridge window [mem 0xfda00000-0xfdafffff]
pci 00:0c.0:   bridge window [mem 0xfd900000-0xfd9fffff]

I'll work up a patch to make Linux use the existing BAR values instead of disabling the device, but it's going to take me a while because I'm catching up from being on vacation.
Comment 20 Bjorn Helgaas 2010-07-13 23:07:58 UTC
Created attachment 27093 [details]
test patch to use BIOS assignment as a last resort

Here's a test patch to make Linux behave more like WinPE in this situation.  If you could try it and attach the dmesg log, that would be great!  If you make the attachment "text/plain", it's easier to read it in a browser.
Comment 21 NiTr0 2010-07-14 13:03:20 UTC
Created attachment 27101 [details]
dmesg output of patched kernel

Here is dmesg output. There are error with nvidiafb, but WiFi card seems to be working - at least it's initialized as device.
Comment 22 Bjorn Helgaas 2010-07-14 20:12:07 UTC
Here's the call trace outline showing that we discard the BIOS BAR addresses long before we try to reassign unassigned resources:

    pci_subsys_init (subsys_initcall)
      pcibios_init
        pcibios_resource_survey
          pcibios_allocate_resources
            pci_claim_resource
              "no compatible bridge window ..."
            <** here's where we throw away the BIOS address **>

    <reserve motherboard resources>

    pcibios_assign_resources (fs_initcall, after motherboard resources)
      pci_assign_unassigned_resources
        pci_bus_assign_resources
          __pci_bus_assign_resources
            pbus_assign_resources_sorted
              __assign_resources_sorted
                pci_assign_resource
                  "can't assign ..."
                <** here's where we want the BIOS address again **>

The separation between the pci_claim_resource() failure and pci_assign_resource() is the reason we have to store the BIOS BAR address somewhere.
Comment 23 Bjorn Helgaas 2010-07-14 20:20:09 UTC
Created attachment 27105 [details]
updated patch to fall back to BIOS assignments

Updated patch.  I made an off-by-one error that caused the nvidia issues.  Sorry, I should have caught this in my testing.  Can you try this one?
Comment 24 NiTr0 2010-07-14 22:07:58 UTC
Created attachment 27107 [details]
new dmesg from patched kernel

Now all looks OK.
Comment 25 NiTr0 2010-07-14 22:08:48 UTC
Created attachment 27108 [details]
/proc/iomem
Comment 26 Bjorn Helgaas 2010-07-19 15:38:56 UTC
Should be resolved by this upstream commit, which should appear in 2.6.35-rc6:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=58c84eda07560a6b75b03e8d3b26d6eddfc14011
Comment 27 Bjorn Helgaas 2011-06-13 23:21:17 UTC
This commit appeared in 2.6.35, so this should be resolved.

Note You need to log in before you can comment on or make changes to this bug.