Bug 15960 - I/O port range not assigned, BIOS allocation gets lost
Summary: I/O port range not assigned, BIOS allocation gets lost
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 high
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks: 15310
  Show dependency tree
 
Reported: 2010-05-11 15:09 UTC by Clemens Ladisch
Modified: 2012-04-04 15:05 UTC (History)
6 users (show)

See Also:
Kernel Version: 2.6.34-rc1
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
kernel log with mainline (54.93 KB, text/plain)
2010-10-19 18:50 UTC, Peter Sääf
Details
patch with a proposed fix, and some debug messages in case the fix fails (3.43 KB, patch)
2010-10-19 23:33 UTC, Ram Pai
Details | Diff
log with the pci_xonar_debug patch (57.56 KB, text/plain)
2010-10-20 14:48 UTC, Peter Sääf
Details
patch to identify if the issue is hotplug related. (1.77 KB, patch)
2010-10-20 17:49 UTC, Ram Pai
Details | Diff
patch to identify if the issue is hotplug related. (1.83 KB, patch)
2010-10-20 21:13 UTC, Ram Pai
Details | Diff
Updated log (60.03 KB, text/plain)
2010-10-21 15:36 UTC, Peter Sääf
Details
patch with a proposed fix, and some debug messages in case the fix fails (9.52 KB, patch)
2011-01-14 07:19 UTC, Ram Pai
Details | Diff
dmesg output with last patch (79.86 KB, text/plain)
2011-01-14 23:24 UTC, Peter Sääf
Details
patch with a proposed fix, and some debug messages in case the fix fails (10.74 KB, patch)
2011-01-15 01:21 UTC, Ram Pai
Details | Diff
New log (97.82 KB, text/plain)
2011-01-15 14:54 UTC, Peter Sääf
Details
patch with a proposed fix, and some debug messages in case the fix fails (11.05 KB, patch)
2011-01-17 19:22 UTC, Ram Pai
Details | Diff
yinghai's patch. Allocate unallocated resources. (4.26 KB, patch)
2011-01-17 19:28 UTC, Ram Pai
Details | Diff
Log with both patches applied (95.19 KB, text/plain)
2011-01-17 22:25 UTC, Peter Sääf
Details

Description Clemens Ladisch 2010-05-11 15:09:38 UTC
(bug report originally from <https://bugtrack.alsa-project.org/alsa-bug/view.php?id=4982>:)

After upgrading kernel from 2.6.32 to 2.6.34-rc4 my Xonar DX stopped working. Kernel log has:

[kernel] AV200 0000:09:04.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
[kernel] invalid PCI I/O range
[kernel] AV200 0000:09:04.0: PCI INT A disabled
[kernel] ALSA device list:
[kernel] No soundcards found.

According to git bisect the culprit is (from 2.6.34-rc1):

commit 977d17bb1749517b353874ccdc9b85abc7a58c2a
Author: Yinghai Lu <yinghai@kernel.org>
Date: Fri Jan 22 01:02:24 2010 -0800
PCI: update bridge resources to get more big ranges in PCI assign unssigned


lkml thread ("[regression, bisected] Xonar DX invalid PCI I/O range since  977d17bb174"):
http://lkml.org/lkml/2010/4/19/20
Comment 1 Andrew Morton 2010-05-12 21:50:18 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Tue, 11 May 2010 15:09:41 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=15960
> 
>            Summary: I/O port range not assigned, BIOS allocation gets lost
>            Product: Drivers
>            Version: 2.5
>     Kernel Version: 2.6.34-rc1
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: PCI
>         AssignedTo: drivers_pci@kernel-bugs.osdl.org
>         ReportedBy: clemens@ladisch.de
>                 CC: yinghai@kernel.org
>         Regression: Yes
> 
> 
> (bug report originally from
> <https://bugtrack.alsa-project.org/alsa-bug/view.php?id=4982>:)
> 
> After upgrading kernel from 2.6.32 to 2.6.34-rc4 my Xonar DX stopped working.
> Kernel log has:
> 
> [kernel] AV200 0000:09:04.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
> [kernel] invalid PCI I/O range
> [kernel] AV200 0000:09:04.0: PCI INT A disabled
> [kernel] ALSA device list:
> [kernel] No soundcards found.
> 
> According to git bisect the culprit is (from 2.6.34-rc1):
> 
> commit 977d17bb1749517b353874ccdc9b85abc7a58c2a
> Author: Yinghai Lu <yinghai@kernel.org>
> Date: Fri Jan 22 01:02:24 2010 -0800
> PCI: update bridge resources to get more big ranges in PCI assign unssigned
> 
> 
> lkml thread ("[regression, bisected] Xonar DX invalid PCI I/O range since 
> 977d17bb174"):
> http://lkml.org/lkml/2010/4/19/20
> 

Guys?  It's getting awfully close to 2.6.34 - should we just revert it?

Clemens, have you confirmed that reverting
977d17bb1749517b353874ccdc9b85abc7a58c2a from current mainline fixes
things?
Comment 2 Rafael J. Wysocki 2010-05-12 22:43:14 UTC
First-Bad-Commit : 977d17bb1749517b353874ccdc9b85abc7a58c2a
Comment 3 Jesse Barnes 2010-05-12 23:06:42 UTC
On Wed, 12 May 2010 14:49:07 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Tue, 11 May 2010 15:09:41 GMT
> bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=15960
> > 
> >            Summary: I/O port range not assigned, BIOS allocation gets lost
> >            Product: Drivers
> >            Version: 2.5
> >     Kernel Version: 2.6.34-rc1
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: high
> >           Priority: P1
> >          Component: PCI
> >         AssignedTo: drivers_pci@kernel-bugs.osdl.org
> >         ReportedBy: clemens@ladisch.de
> >                 CC: yinghai@kernel.org
> >         Regression: Yes
> > 
> > 
> > (bug report originally from
> > <https://bugtrack.alsa-project.org/alsa-bug/view.php?id=4982>:)
> > 
> > After upgrading kernel from 2.6.32 to 2.6.34-rc4 my Xonar DX stopped
> working.
> > Kernel log has:
> > 
> > [kernel] AV200 0000:09:04.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
> > [kernel] invalid PCI I/O range
> > [kernel] AV200 0000:09:04.0: PCI INT A disabled
> > [kernel] ALSA device list:
> > [kernel] No soundcards found.
> > 
> > According to git bisect the culprit is (from 2.6.34-rc1):
> > 
> > commit 977d17bb1749517b353874ccdc9b85abc7a58c2a
> > Author: Yinghai Lu <yinghai@kernel.org>
> > Date: Fri Jan 22 01:02:24 2010 -0800
> > PCI: update bridge resources to get more big ranges in PCI assign unssigned
> > 
> > 
> > lkml thread ("[regression, bisected] Xonar DX invalid PCI I/O range since 
> > 977d17bb174"):
> > http://lkml.org/lkml/2010/4/19/20
> > 
> 
> Guys?  It's getting awfully close to 2.6.34 - should we just revert it?
> 
> Clemens, have you confirmed that reverting
> 977d17bb1749517b353874ccdc9b85abc7a58c2a from current mainline fixes
> things?

I haven't heard back from Yinghai on a related bug, so reverting is
definitely a possibility.  But we do have a patch to work around the
issue as well; it needs to be fixed though since "pci=try=" doesn't
make much sense ("pci=realloc=" or similar would be better).  Either
way though it's risky; I'm afraid of breaking a now working setup by
reverting the patch or changing the default.

Yinghai, maybe there's an easy way to avoid reassigning everything if
the result isn't much better (i.e. devices are still missing BARs)?
That might avoid situations like this...
Comment 4 Jesse Barnes 2010-05-12 23:06:49 UTC
On Wed, 12 May 2010 14:49:07 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Tue, 11 May 2010 15:09:41 GMT
> bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=15960
> > 
> >            Summary: I/O port range not assigned, BIOS allocation gets lost
> >            Product: Drivers
> >            Version: 2.5
> >     Kernel Version: 2.6.34-rc1
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: high
> >           Priority: P1
> >          Component: PCI
> >         AssignedTo: drivers_pci@kernel-bugs.osdl.org
> >         ReportedBy: clemens@ladisch.de
> >                 CC: yinghai@kernel.org
> >         Regression: Yes
> > 
> > 
> > (bug report originally from
> > <https://bugtrack.alsa-project.org/alsa-bug/view.php?id=4982>:)
> > 
> > After upgrading kernel from 2.6.32 to 2.6.34-rc4 my Xonar DX stopped
> working.
> > Kernel log has:
> > 
> > [kernel] AV200 0000:09:04.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
> > [kernel] invalid PCI I/O range
> > [kernel] AV200 0000:09:04.0: PCI INT A disabled
> > [kernel] ALSA device list:
> > [kernel] No soundcards found.
> > 
> > According to git bisect the culprit is (from 2.6.34-rc1):
> > 
> > commit 977d17bb1749517b353874ccdc9b85abc7a58c2a
> > Author: Yinghai Lu <yinghai@kernel.org>
> > Date: Fri Jan 22 01:02:24 2010 -0800
> > PCI: update bridge resources to get more big ranges in PCI assign unssigned
> > 
> > 
> > lkml thread ("[regression, bisected] Xonar DX invalid PCI I/O range since 
> > 977d17bb174"):
> > http://lkml.org/lkml/2010/4/19/20
> > 
> 
> Guys?  It's getting awfully close to 2.6.34 - should we just revert it?
> 
> Clemens, have you confirmed that reverting
> 977d17bb1749517b353874ccdc9b85abc7a58c2a from current mainline fixes
> things?

I haven't heard back from Yinghai on a related bug, so reverting is
definitely a possibility.  But we do have a patch to work around the
issue as well; it needs to be fixed though since "pci=try=" doesn't
make much sense ("pci=realloc=" or similar would be better).  Either
way though it's risky; I'm afraid of breaking a now working setup by
reverting the patch or changing the default.

Yinghai, maybe there's an easy way to avoid reassigning everything if
the result isn't much better (i.e. devices are still missing BARs)?
That might avoid situations like this...
Comment 5 Anonymous Emailer 2010-05-12 23:11:45 UTC
Reply-To: yinghai.lu@oracle.com

On 05/12/2010 03:05 PM, Jesse Barnes wrote:
> On Wed, 12 May 2010 14:49:07 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
>>
>> (switched to email.  Please respond via emailed reply-to-all, not via the
>> bugzilla web interface).
>>
>> On Tue, 11 May 2010 15:09:41 GMT
>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=15960
>>>
>>>            Summary: I/O port range not assigned, BIOS allocation gets lost
>>>            Product: Drivers
>>>            Version: 2.5
>>>     Kernel Version: 2.6.34-rc1
>>>           Platform: All
>>>         OS/Version: Linux
>>>               Tree: Mainline
>>>             Status: NEW
>>>           Severity: high
>>>           Priority: P1
>>>          Component: PCI
>>>         AssignedTo: drivers_pci@kernel-bugs.osdl.org
>>>         ReportedBy: clemens@ladisch.de
>>>                 CC: yinghai@kernel.org
>>>         Regression: Yes
>>>
>>>
>>> (bug report originally from
>>> <https://bugtrack.alsa-project.org/alsa-bug/view.php?id=4982>:)
>>>
>>> After upgrading kernel from 2.6.32 to 2.6.34-rc4 my Xonar DX stopped
>>> working.
>>> Kernel log has:
>>>
>>> [kernel] AV200 0000:09:04.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
>>> [kernel] invalid PCI I/O range
>>> [kernel] AV200 0000:09:04.0: PCI INT A disabled
>>> [kernel] ALSA device list:
>>> [kernel] No soundcards found.
>>>
>>> According to git bisect the culprit is (from 2.6.34-rc1):
>>>
>>> commit 977d17bb1749517b353874ccdc9b85abc7a58c2a
>>> Author: Yinghai Lu <yinghai@kernel.org>
>>> Date: Fri Jan 22 01:02:24 2010 -0800
>>> PCI: update bridge resources to get more big ranges in PCI assign unssigned
>>>
>>>
>>> lkml thread ("[regression, bisected] Xonar DX invalid PCI I/O range since 
>>> 977d17bb174"):
>>> http://lkml.org/lkml/2010/4/19/20
>>>
>>
>> Guys?  It's getting awfully close to 2.6.34 - should we just revert it?
>>
>> Clemens, have you confirmed that reverting
>> 977d17bb1749517b353874ccdc9b85abc7a58c2a from current mainline fixes
>> things?
> 
> I haven't heard back from Yinghai on a related bug, so reverting is
> definitely a possibility.  But we do have a patch to work around the
> issue as well; it needs to be fixed though since "pci=try=" doesn't
> make much sense ("pci=realloc=" or similar would be better).  Either
> way though it's risky; I'm afraid of breaking a now working setup by
> reverting the patch or changing the default.
> 
> Yinghai, maybe there's an easy way to avoid reassigning everything if
> the result isn't much better (i.e. devices are still missing BARs)?
> That might avoid situations like this...
> 
pci=realloc looks like clear all bridge bar, and assign new one.

YH
Comment 6 Linus Torvalds 2010-05-12 23:26:07 UTC
On Wed, 12 May 2010, Jesse Barnes wrote:
> 
> I haven't heard back from Yinghai on a related bug, so reverting is
> definitely a possibility.

Let's just revert the thing, and never bring it back. It was a mistake, 
just admit it. No sane setup has this problem, and the whole "let's 
reassign everything" is so fragile that it should _never_ have been done 
by default.

We already had an issue with the "reassign everything" code deciding to do 
it just because a ROM resource didn't fit. We fixed that one, iirc, but it 
shows how fragile the whole thing was.

So my suggestion:

 - revert it (but I agree with Andrew that we should get confirmation that 
   reverting it makes things work)

 - REMEMBER that this doesn't work, and NEVER EVER do it again

 - if somebody wants to reassign all the PCI resources, make that be a 
   special thing that requires actual user interaction.

> I'm afraid of breaking a now working setup by reverting the patch or 
> changing the default.

What "now workign setup?" It didn't work in 2.6.33 either. The rule about 
regressions is that IT DOES NOT MATTER HOW MANY THINGS YOU FIX! If it 
breaks something that used to work, it's buggy, and needs to be reverted.

It's that simple. Something that got fixed by that patch simply DOES NOT 
MATTER. It's not an argument for not reverting. The only thing that 
matters is that it broke something, and thus needs to be reverted or 
fixed, and looking at the code, we can be pretty sure that 'fixed' is 
likely not even an option.

			Linus
Comment 7 Peter Sääf 2010-05-13 00:03:03 UTC
On Wed, 2010-05-12 at 16:20 -0700, Jesse Barnes wrote:
> Ok, that's fine.  Let's pull it once we get confirmation that the
> revert works.
> 
> Clemens or Peter, can you make sure reverting
> 977d17bb1749517b353874ccdc9b85abc7a58c2a gets your sound device back?
> 
> Thanks,

Yup, reverting from current mainline fixes things for me.

Cheers,
Peter
Comment 8 Linus Torvalds 2010-05-13 00:09:30 UTC
On Thu, 13 May 2010, Peter Henriksson wrote:
> 
> Yup, reverting from current mainline fixes things for me.

Ok. Jesse, do you want to do it and push it to me (or put another way: "Do 
you have other things pending?") or should I just do the revert?

		Linus
Comment 9 Jesse Barnes 2010-05-13 00:22:12 UTC
On Wed, 12 May 2010 15:49:45 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, 12 May 2010, Jesse Barnes wrote:
> > 
> > I haven't heard back from Yinghai on a related bug, so reverting is
> > definitely a possibility.
> 
> Let's just revert the thing, and never bring it back. It was a mistake, 
> just admit it. No sane setup has this problem, and the whole "let's 
> reassign everything" is so fragile that it should _never_ have been done 
> by default.
> 
> We already had an issue with the "reassign everything" code deciding to do 
> it just because a ROM resource didn't fit. We fixed that one, iirc, but it 
> shows how fragile the whole thing was.

Right.

> 
> So my suggestion:
> 
>  - revert it (but I agree with Andrew that we should get confirmation that 
>    reverting it makes things work)
> 
>  - REMEMBER that this doesn't work, and NEVER EVER do it again
> 
>  - if somebody wants to reassign all the PCI resources, make that be a 
>    special thing that requires actual user interaction.

Reassigning everything was a bad idea, but I wouldn't go so far as to
say "never reassign" either.  In some cases it's unavoidable whether
due to firmware bugs or just more devices in the system than the
firmware could handle.  But we obviously need a more focused approach
to handle those situations; doing it all at once like this wreaks too
much havoc.

> > I'm afraid of breaking a now working setup by reverting the patch or 
> > changing the default.
> 
> What "now workign setup?" It didn't work in 2.6.33 either. The rule about 
> regressions is that IT DOES NOT MATTER HOW MANY THINGS YOU FIX! If it 
> breaks something that used to work, it's buggy, and needs to be reverted.
> 
> It's that simple. Something that got fixed by that patch simply DOES NOT 
> MATTER. It's not an argument for not reverting. The only thing that 
> matters is that it broke something, and thus needs to be reverted or 
> fixed, and looking at the code, we can be pretty sure that 'fixed' is 
> likely not even an option.

Ok, that's fine.  Let's pull it once we get confirmation that the
revert works.

Clemens or Peter, can you make sure reverting
977d17bb1749517b353874ccdc9b85abc7a58c2a gets your sound device back?

Thanks,
Comment 10 Jesse Barnes 2010-05-13 00:22:20 UTC
On Wed, 12 May 2010 15:49:45 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, 12 May 2010, Jesse Barnes wrote:
> > 
> > I haven't heard back from Yinghai on a related bug, so reverting is
> > definitely a possibility.
> 
> Let's just revert the thing, and never bring it back. It was a mistake, 
> just admit it. No sane setup has this problem, and the whole "let's 
> reassign everything" is so fragile that it should _never_ have been done 
> by default.
> 
> We already had an issue with the "reassign everything" code deciding to do 
> it just because a ROM resource didn't fit. We fixed that one, iirc, but it 
> shows how fragile the whole thing was.

Right.

> 
> So my suggestion:
> 
>  - revert it (but I agree with Andrew that we should get confirmation that 
>    reverting it makes things work)
> 
>  - REMEMBER that this doesn't work, and NEVER EVER do it again
> 
>  - if somebody wants to reassign all the PCI resources, make that be a 
>    special thing that requires actual user interaction.

Reassigning everything was a bad idea, but I wouldn't go so far as to
say "never reassign" either.  In some cases it's unavoidable whether
due to firmware bugs or just more devices in the system than the
firmware could handle.  But we obviously need a more focused approach
to handle those situations; doing it all at once like this wreaks too
much havoc.

> > I'm afraid of breaking a now working setup by reverting the patch or 
> > changing the default.
> 
> What "now workign setup?" It didn't work in 2.6.33 either. The rule about 
> regressions is that IT DOES NOT MATTER HOW MANY THINGS YOU FIX! If it 
> breaks something that used to work, it's buggy, and needs to be reverted.
> 
> It's that simple. Something that got fixed by that patch simply DOES NOT 
> MATTER. It's not an argument for not reverting. The only thing that 
> matters is that it broke something, and thus needs to be reverted or 
> fixed, and looking at the code, we can be pretty sure that 'fixed' is 
> likely not even an option.

Ok, that's fine.  Let's pull it once we get confirmation that the
revert works.

Clemens or Peter, can you make sure reverting
977d17bb1749517b353874ccdc9b85abc7a58c2a gets your sound device back?

Thanks,
Comment 11 Jesse Barnes 2010-05-13 01:19:09 UTC
On Wed, 12 May 2010 17:06:47 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Thu, 13 May 2010, Peter Henriksson wrote:
> > 
> > Yup, reverting from current mainline fixes things for me.
> 
> Ok. Jesse, do you want to do it and push it to me (or put another way: "Do 
> you have other things pending?") or should I just do the revert?

Nope, nothing else pending atm, please just revert.

Thanks,
Comment 12 Jesse Barnes 2010-05-13 01:19:17 UTC
On Wed, 12 May 2010 17:06:47 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Thu, 13 May 2010, Peter Henriksson wrote:
> > 
> > Yup, reverting from current mainline fixes things for me.
> 
> Ok. Jesse, do you want to do it and push it to me (or put another way: "Do 
> you have other things pending?") or should I just do the revert?

Nope, nothing else pending atm, please just revert.

Thanks,
Comment 13 Rafael J. Wysocki 2010-05-13 20:34:11 UTC
Fixed by commit 769d9968e42c995eaaf61ac5583d998f32e0769a .
Comment 14 Ram Pai 2010-06-23 16:39:06 UTC
I have two SRIOV adapters, mellanox Connectx2 and ServerEngine be3, both fail MMIO allocation for its SRIOV BARs. A git bisect leads to this reverted patch :( .  Offcourse this is on a platform where the BIOS is unaware of SRIOV BARs. Hence the OS is forced to assign the resources.

Any chance this patch can be brought back again, and the specific issue with 
Xonar DX be debugged and fixed. I am willing to volunteer.

Thanks,
RP
Comment 15 Yinghai Lu 2010-06-23 18:02:43 UTC
On 06/23/2010 09:39 AM, bugzilla-daemon@bugzilla.kernel.org wrote:
> --- Comment #14 from Ram Pai <linuxram@us.ibm.com>  2010-06-23 16:39:06 ---
> I have two SRIOV adapters, mellanox Connectx2 and ServerEngine be3, both fail
> MMIO allocation for its SRIOV BARs. A git bisect leads to this reverted patch
> :( .  Offcourse this is on a platform where the BIOS is unaware of SRIOV
> BARs.
> Hence the OS is forced to assign the resources.
> 
> Any chance this patch can be brought back again, and the specific issue with 
> Xonar DX be debugged and fixed. I am willing to volunteer.
> 

three options
1. don't do io port second try
2. use pci=try=2
3. restore bridge resource if mutil trying get failed.

Yinghai
Comment 16 Ram Pai 2010-10-18 23:44:09 UTC
Peter,

       Do you still have your box with the Xonas sound card? If so can I get the output of your dmesg on the latest mainline kernel. I have a inkling that your kernel, even now, prints messages related to memory resource assignment failures. Fortunately it does not effect your sound card.

RP
Comment 17 Peter Sääf 2010-10-19 18:50:56 UTC
Created attachment 34162 [details]
kernel log with mainline

Ram,
Kernel log with current mainline.
Comment 18 Ram Pai 2010-10-19 21:16:16 UTC
Thanks Peter,

As suspected, I am seeing the following messages in your log. 

Oct 19 20:43:13 [kernel] pci 0000:04:00.0: BAR 8: can't assign mem (size 0xc00000)
Oct 19 20:43:13 [kernel] pci 0000:05:01.0: BAR 8: can't assign mem (size 0x200000)
Oct 19 20:43:13 [kernel] pci 0000:05:01.0: BAR 9: can't assign mem pref (size 0x200000)
Oct 19 20:43:13 [kernel] pci 0000:05:02.0: BAR 8: can't assign mem (size 0x400000)
Oct 19 20:43:13 [kernel] pci 0000:05:03.0: BAR 8: can't assign mem (size 0x400000)
Oct 19 20:43:13 [kernel] pci 0000:05:02.0: BAR 7: can't assign io (size 0x1000)
Oct 19 20:43:13 [kernel] pci 0000:05:03.0: BAR 7: can't assign io (size 0x1000)

Ok. let me post a proposed test patch later today and see if that cures these messages.
Comment 19 Ram Pai 2010-10-19 23:33:53 UTC
Created attachment 34192 [details]
patch with a proposed fix, and some debug messages in case the fix fails

Peter,

Attached is a patch that might fix the reason behind the failure 
 messages on your setup. Can you run this patch against the latest mainline kernel along with the debug command line option.  If the problem is fixed you should not see the error messages as well as your xonar device should function properly. In any case please send me the output of your dmesg. I have added some debug messages to the patch that might provide me some insight in case the problem persists. 

thanks a bunch,
RP
Comment 20 Peter Sääf 2010-10-20 14:48:23 UTC
Created attachment 34242 [details]
log with the pci_xonar_debug patch

Attaching log with the patch applied. The Xonar is still working.

Regards,
Peter
Comment 21 Ram Pai 2010-10-20 17:49:25 UTC
Created attachment 34272 [details]
patch to identify if the issue is hotplug related.

Ok. I still see same error messages, that are captured in comment #18. 

On further analysis, it certainly looks like a issue where the BIOS has not allocated enough space for hotplug brigdes. The OS attempts to allocate some minimal space, 4k i/o and 2M mem window, to each of those hotplug bridges and fails.

However to prove the above theory, can you try the attached patch. It goes on top of the earlier patch. If possible run the kernel with debug option and provide me the dmesg output. 

Thanks, I hope this will be last iteration towards identifying the root cause of  the problem.
Comment 22 Peter Sääf 2010-10-20 19:03:42 UTC
I get a kernel oops with that patch on top. Perhaps the below warnings are related? I'm not a programmer so I don't know what to make of it.

  CC      drivers/pci/setup-bus.o
drivers/pci/setup-bus.c: In function ‘pbus_size_io’:
drivers/pci/setup-bus.c:417: warning: format ‘%x’ expects type ‘unsigned int’, but argument 4 has type ‘resource_size_t’
drivers/pci/setup-bus.c:437: warning: passing argument 2 of ‘dev_printk’ from incompatible pointer type
include/linux/device.h:646: note: expected ‘const struct device *’ but argument is of type ‘struct pci_dev *’
drivers/pci/setup-bus.c:437: warning: too few arguments for format
drivers/pci/setup-bus.c: In function ‘pbus_size_mem’:
drivers/pci/setup-bus.c:491: warning: format ‘%x’ expects type ‘unsigned int’, but argument 4 has type ‘resource_size_t’
drivers/pci/setup-bus.c:518: warning: passing argument 2 of ‘dev_printk’ from incompatible pointer type
include/linux/device.h:646: note: expected ‘const struct device *’ but argument is of type ‘struct pci_dev *’
drivers/pci/setup-bus.c:518: warning: too few arguments for format
Comment 23 Ram Pai 2010-10-20 21:13:46 UTC
Created attachment 34292 [details]
patch to identify if the issue is hotplug related.

Sorry Peter. Fixed all those compiler warning messages. Try this new patch, instead of the last one. thankx.
RP
Comment 24 Peter Sääf 2010-10-21 15:36:43 UTC
Created attachment 34312 [details]
Updated log

Thanks. Works better now. Log attached.
Comment 25 Ram Pai 2010-10-21 16:57:03 UTC
Ok. As suspected, those are hotplug bridges to which the BIOS has not allocated
any memory resource or io ports since there are no devices behind them.

However the OS tries to allocate the minimum amount. Unfortunately since there are not enough resources available to satisfy all the hot-plug bridges, the allocation fails.  

Jesse: How is this supposed to work?  Should we ignore pre-allocation to hotplug bridges if there are not enough resources available?
Comment 26 Jesse Barnes 2010-10-21 17:15:20 UTC
Yes, I think we should.  Bridges with nothing behind them at boot should have the lowest priority when it comes to allocating resources.
Comment 27 Ram Pai 2010-10-21 17:44:26 UTC
Ok. Jesse, the plan of action is to fix this bug, and bring back Yinghai's patch?

If yes. I will provide the patches. But this will be viewed as yet another band-aid by Linus, for sure.  Linus are you listening?

Anyway I will continue the discussion on lkml.
RP
Comment 28 Yinghai Lu 2010-10-21 19:17:48 UTC
On 10/21/2010 10:44 AM, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=15960
> 
> 
> 
> 
> 
> --- Comment #27 from Ram Pai <linuxram@us.ibm.com>  2010-10-21 17:44:26 ---
> Ok. Jesse, the plan of action is to fix this bug, and bring back Yinghai's
> patch?
> 
i have updated version local.

[PATCH] pci: update bridge resources to get more big ranges in PCI assign unssigned

BIOS separates IO ranges between several IOHs, and on some slots, BIOS assigns
resources to a bridge, but stops assigning resources to the device under that
bridge, because the device needs a big resource.

So:
1. allocate resources and record the failed device resources
2. clear the BIOS assigned resources of the parent bridge of failing device
3. go back and call pci assign unassigned
4. if it still fails, go up the tree, clear more bridges. and try again

Use pci=realloc=mask to control it with io or mmio.
	1: io
	2: mmio
	3: io and mmio

Default is 0: disable all of reallocation

Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 Documentation/kernel-parameters.txt |    4 +
 drivers/pci/pci.c                   |    7 ++
 drivers/pci/pci.h                   |    2 
 drivers/pci/setup-bus.c             |  115 +++++++++++++++++++++++++++++++++++-
 4 files changed, 125 insertions(+), 3 deletions(-)

Index: linux-2.6/drivers/pci/setup-bus.c
===================================================================
--- linux-2.6.orig/drivers/pci/setup-bus.c
+++ linux-2.6/drivers/pci/setup-bus.c
@@ -36,6 +36,8 @@ struct resource_list_x {
 	unsigned long flags;
 };
 
+unsigned long pci_realloc_mask = IORESOURCE_MEM;
+
 static void add_to_failed_list(struct resource_list_x *head,
 				 struct pci_dev *dev, struct resource *res)
 {
@@ -102,7 +104,8 @@ static void __assign_resources_sorted(st
 		res = list->res;
 		idx = res - &list->dev->resource[0];
 		if (pci_assign_resource(list->dev, idx)) {
-			if (fail_head && !pci_is_root_bus(list->dev->bus))
+			if (fail_head && !pci_is_root_bus(list->dev->bus) &&
+			    (res->flags & pci_realloc_mask))
 				add_to_failed_list(fail_head, list->dev, res);
 			res->start = 0;
 			res->end = 0;
@@ -830,11 +833,63 @@ static void pci_bus_dump_resources(struc
 	}
 }
 
+static int __init pci_bus_get_depth(struct pci_bus *bus)
+{
+	int depth = 0;
+	struct pci_dev *dev;
+
+	list_for_each_entry(dev, &bus->devices, bus_list) {
+		int ret;
+		struct pci_bus *b = dev->subordinate;
+		if (!b)
+			continue;
+
+		ret = pci_bus_get_depth(b);
+		if (ret + 1 > depth)
+			depth = ret + 1;
+	}
+
+	return depth;
+}
+static int __init pci_get_max_depth(void)
+{
+	int depth = 0;
+	struct pci_bus *bus;
+
+	list_for_each_entry(bus, &pci_root_buses, node) {
+		int ret;
+
+		ret = pci_bus_get_depth(bus);
+		if (ret > depth)
+			depth = ret;
+	}
+
+	return depth;
+}
+
+/*
+ * first try will not touch pci bridge res
+ * second and later try will clear small leaf bridge res
+ */
 void __init
 pci_assign_unassigned_resources(void)
 {
 	struct pci_bus *bus;
+	int tried_times = 0;
+	enum release_type rel_type = leaf_only;
+	struct resource_list_x head, *list;
+	unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
+				  IORESOURCE_PREFETCH;
+	unsigned long failed_type;
+	int pci_try_num;
+	int max_depth = pci_get_max_depth();
+
+	head.next = NULL;
 
+	pci_try_num = max_depth + 1;
+	printk(KERN_DEBUG "PCI: max bus depth: %d\n", max_depth);
+
+again:
 	/* Depth first, calculate sizes and alignments of all
 	   subordinate buses. */
 	list_for_each_entry(bus, &pci_root_buses, node) {
@@ -842,10 +897,64 @@ pci_assign_unassigned_resources(void)
 	}
 	/* Depth last, allocate resources and update the hardware. */
 	list_for_each_entry(bus, &pci_root_buses, node) {
-		pci_bus_assign_resources(bus);
-		pci_enable_bridges(bus);
+		__pci_bus_assign_resources(bus, &head);
+	}
+	tried_times++;
+
+	/* any device complain? */
+	if (!head.next)
+		goto enable_and_dump;
+	failed_type = 0;
+	for (list = head.next; list;) {
+		failed_type |= list->flags;
+		list = list->next;
+	}
+
+	/* if reach the limit, don't want to try more */
+	failed_type &= type_mask;
+	if (tried_times >= pci_try_num) {
+		free_failed_list(&head);
+		goto enable_and_dump;
 	}
 
+	printk(KERN_DEBUG "PCI: No. %d try to assign unassigned res\n",
+			 tried_times + 1);
+
+	/* third times and later will not check if it is leaf */
+	if ((tried_times + 1) > 2)
+		rel_type = whole_subtree;
+
+	/*
+	 * Try to release leaf bridge's resources that doesn't fit resource of
+	 * child device under that bridge
+	 */
+	for (list = head.next; list;) {
+		bus = list->dev->bus;
+		pci_bus_release_bridge_resources(bus, list->flags & type_mask,
+						  rel_type);
+		list = list->next;
+	}
+	/* restore size and flags */
+	for (list = head.next; list;) {
+		struct resource *res = list->res;
+
+		res->start = list->start;
+		res->end = list->end;
+		res->flags = list->flags;
+		if (list->dev->subordinate)
+			res->flags = 0;
+
+		list = list->next;
+	}
+	free_failed_list(&head);
+
+	goto again;
+
+enable_and_dump:
+	/* Depth last, update the hardware. */
+	list_for_each_entry(bus, &pci_root_buses, node)
+		pci_enable_bridges(bus);
+
 	/* dump the resource on buses */
 	list_for_each_entry(bus, &pci_root_buses, node) {
 		pci_bus_dump_resources(bus);
Index: linux-2.6/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.orig/Documentation/kernel-parameters.txt
+++ linux-2.6/Documentation/kernel-parameters.txt
@@ -1932,6 +1932,10 @@ and is between 256 and 4096 characters.
 				for broken drivers that don't call it.
 		skip_isa_align	[X86] do not align io start addr, so can
 				handle more pci cards
+		realloc=n	set mask for reallocating the pci bridge resource
+				1: io port
+				2: mem io : default
+				3: io port and mem io
 		firmware	[ARM] Do not re-enumerate the bus but instead
 				just use the configuration from the
 				bootloader. This is currently used on
Index: linux-2.6/drivers/pci/pci.c
===================================================================
--- linux-2.6.orig/drivers/pci/pci.c
+++ linux-2.6/drivers/pci/pci.c
@@ -2997,6 +2997,13 @@ static int __init pci_setup(char *str)
 				pci_no_aer();
 			} else if (!strcmp(str, "nodomains")) {
 				pci_no_domains();
+			} else if (!strncmp(str, "realloc=", 8)) {
+				int realloc = memparse(str + 8, &str);
+				if (realloc > 0) {
+					/* IORESOURCE_IO is bit 8, IORESOURCE_MEM is bit 9*/
+					pci_realloc_mask = realloc << 8;
+					pci_realloc_mask &= IORESOURCE_IO | IORESOURCE_MEM;
+				}
 			} else if (!strncmp(str, "cbiosize=", 9)) {
 				pci_cardbus_io_size = memparse(str + 9, &str);
 			} else if (!strncmp(str, "cbmemsize=", 10)) {
Index: linux-2.6/drivers/pci/pci.h
===================================================================
--- linux-2.6.orig/drivers/pci/pci.h
+++ linux-2.6/drivers/pci/pci.h
@@ -224,6 +224,8 @@ static inline int pci_ari_enabled(struct
 	return bus->self && bus->self->ari_enabled;
 }
 
+extern unsigned long pci_realloc_mask;
+
 #ifdef CONFIG_PCI_QUIRKS
 extern int pci_is_reassigndev(struct pci_dev *dev);
 resource_size_t pci_specified_resource_alignment(struct pci_dev *dev);
Comment 29 Ram Pai 2011-01-14 07:19:20 UTC
Created attachment 43502 [details]
patch with a proposed fix, and some debug messages in case the fix fails

Peter, 

Can you please try the attached patch and provide the output off dmesg.

This patch has a proposed fix to your issue.

Hope you still have your setup to test this out,
thanks,
RP
Comment 30 Peter Sääf 2011-01-14 23:24:57 UTC
Created attachment 43652 [details]
dmesg output with last patch

Latest patch appears to work fine.
Comment 31 Ram Pai 2011-01-15 01:21:58 UTC
Created attachment 43672 [details]
patch with a proposed fix, and some debug messages in case the fix fails

Peter,

     thanx for the dmesg. Your testing exposed a small problem with my patch. 

Jan 15 00:15:55 darwin kernel: pci 0000:05:02.0: BAR 7: can't assign io (size 0x0)
Jan 15 00:15:55 darwin kernel: pci 0000:05:03.0: BAR 7: can't assign io (size 0x0)

     I have fixed the issue. Hopefully this time we wont see any of those failure messages anymore. Once we reach that state, we will have to apply yanghai's patch on top and confirm that your machine does not regress.

    For now, I request you to apply my new patch and provide me the dmesg output.

thanks,
RP
Comment 32 Peter Sääf 2011-01-15 14:54:17 UTC
Created attachment 43682 [details]
New log
Comment 33 Ram Pai 2011-01-17 19:22:51 UTC
Created attachment 43862 [details]
patch with a proposed fix, and some debug messages in case the fix fails

Peter,
      Ok. it all looks good. Now we have to apply Yanghai's patch and check for regression. If your device continues to operate successfully, we are good.
     BTW: I have tweaked my patch to take care of one other small issue.

     Apply my patch first. 
     I will attach Yanghai's patch. Apply that one next.

thanks in advance for your help,
RP
Comment 34 Ram Pai 2011-01-17 19:28:00 UTC
Created attachment 43872 [details]
yinghai's patch.  Allocate unallocated resources.

Yinghai,

   I have modified your patch to just do the allocation retries. The realloc_mask code can be a different patch, and I think we dont need that code right now.

RP
Comment 35 Peter Sääf 2011-01-17 22:25:15 UTC
Created attachment 43892 [details]
Log with both patches applied
Comment 36 Peter Sääf 2011-01-17 22:28:57 UTC
I should have mentioned. The Xonar DX appears to be working as it should.
Comment 37 Ram Pai 2011-01-17 22:47:58 UTC
Thanks a lot!
 
The dmesg log also confirms that all devices got the necessary resources they needed.

I think we are good. Unless I get pushed again in one other direction I hope to not trouble you with more testing :)

thanks,
RP
Comment 38 Florian Mickler 2011-03-29 00:13:17 UTC
A patch referencing this bug report has been merged in v2.6.38-8876-g036a982:

commit c8adf9a3e873eddaaec11ac410a99ef6b9656938
Author: Ram Pai <linuxram@us.ibm.com>
Date:   Mon Feb 14 17:43:20 2011 -0800

    PCI: pre-allocate additional resources to devices only after successful allocation of essential resources.
Comment 39 Florian Mickler 2011-05-19 07:23:59 UTC
A patch referencing a commit referencing this bug report has been merged in v2.6.39:

commit 93d2175d3d31f11ba04fcfa0e9a496a1b4bc8b34
Author: Yinghai Lu <yinghai@kernel.org>
Date:   Fri May 13 18:06:17 2011 -0700

    PCI: Clear bridge resource flags if requested size is 0
Comment 40 Florian Mickler 2011-08-23 20:20:47 UTC
A patch referencing a commit referencing this bug report has been merged in Linux v3.1-rc3:

commit be768912a49b10b68e96fbd8fa3cab0adfbd3091
Author: Yinghai Lu <yinghai@kernel.org>
Date:   Mon Jul 25 13:08:38 2011 -0700

    PCI: honor child buses add_size in hot plug configuration
Comment 41 Yinghai Lu 2012-01-13 23:33:42 UTC
Clemens,

Can you check if our effects breaks your setup?

please check:

	git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-pci2

Thanks
Comment 42 Florian Mickler 2012-04-04 15:05:22 UTC
A patch referencing this bug report has been merged in Linux v3.4-rc1:

commit 0c5be0cb0edfe3b5c4b62eac68aa2aa15ec681af
Author: Yinghai Lu <yinghai@kernel.org>
Date:   Thu Feb 23 19:23:29 2012 -0800

    PCI: Retry on IORESOURCE_IO type allocations

Note You need to log in before you can comment on or make changes to this bug.