Bug 111901 - Pre-2008 boards may fail to enable PATA device(s) via pata_amd | pata_acpi unless CRS is turned off by pci=nocrs
Summary: Pre-2008 boards may fail to enable PATA device(s) via pata_amd | pata_acpi un...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL: https://bugs.launchpad.net/ubuntu/+so...
Keywords:
Depends on:
Blocks:
 
Reported: 2016-02-04 22:44 UTC by Andreas E
Modified: 2016-03-09 22:03 UTC (History)
4 users (show)

See Also:
Kernel Version: 4.3
Subsystem:
Regression: No
Bisected commit-id:


Attachments
.tgz of lspci -vv & hwinfo output from Biostar M7NCD Ultra motherboard (86.39 KB, application/octet-stream)
2016-02-05 20:01 UTC, Felix Miata
Details
tar of 3x dmesg : 4.2.0 pci=crs + 4.2.0 pci=nocrs + 4.4.0 (150.00 KB, application/x-tar)
2016-02-10 00:12 UTC, Andreas E
Details

Description Andreas E 2016-02-04 22:44:22 UTC
With a LOT of help from Oleg B. (Launchpad), it could be figured out that bug 111151 was actually a regression, but not a true bug.

Let me copy the error message again from the now-closed report:

pata_amd 0000:00:09.0: can't enable device: BAR 0 [io 0x01f0-0x01f7] not claimed
pata_amd: probe of 0000:00:09.0 failed with error -22

Actual reason for this is CRS, which has been enabled by default since commit 3d9fecf6bfb8b12bc2f9a4c7109895a2a2bb9436 on 9 June 2015.
Our old machines do not like this at all. :)

The man to come to our rescue was Jiang Liu on commit 4d6b4e69a245e9df4b84dba387596086cb66887d on 14 October 2015.
Of course, it WORKED now, but it still broke all the past kernels.

So I'd call it a regression.

Now the question is: 

Is this supposed to remain like this?
That you guys are simply of the opinion, if you use and old machine, it's only fair you've got to work around this by pci=nocrs.
Or would you decide otherwise?
Comment 1 Andreas E 2016-02-04 22:47:36 UTC
P.S. Above all, it is not sure whether 4.4.x will actually make it into Ubuntu 16.04 (Xenial Xerus) final in April 2016. It may, or it may not...
Comment 2 Alan 2016-02-05 10:18:45 UTC
Not Intel - AMD 8)

Please email the author of the patch and ask them to submit it to stable citing the bugs. Don't expect a reply for a couple of weeks if Jiang Liu is in China as it is Chinese new year.
Comment 3 Oleg Blashchuk 2016-02-05 18:53:27 UTC
(In reply to Alan from comment #2)
> Not Intel - AMD 8)

Actually, it's not because of AMD, but seems to be the problem of nVidia nForce2 Ultra 400 chipset. I've got plenty of legacy AMD-based computers, mainly on different VIA chipsets - they all are not touched by this issue. Even computers with nVidia nForce2 IGP chipset works fine. Seems that Ultra 400 claims to support address ranges above 4GB, but IDE controller on the corresponding south-bridge actually does not support working in that area.
Comment 4 Felix Miata 2016-02-05 20:01:58 UTC
Created attachment 202991 [details]
.tgz of lspci -vv & hwinfo output from Biostar M7NCD Ultra motherboard

adding CC here as 
https://bugzilla.redhat.com/show_bug.cgi?id=1302071#c22
may have been this issue with this NForce2 400 Ultra motherboard:
http://www.biostar-usa.com/mbdetails.asp?model=M7NCD%20ULTRA
Comment 5 Andreas E 2016-02-05 21:06:47 UTC
to comment 2.

Alan, I set 'Intel' as a collective term supposed to include both i386 and x86-64. 
(at least I _thought_ the i originally stood for Intel)

Currently it's set to 64bit only. I'm on 32bit, though, FYI.

$10,000 question:
How can we set _both_ i386 and x86-64 in Hardware drop-down menu?
Looks impossible to me.
Comment 6 Andreas E 2016-02-05 21:12:51 UTC
and to comment 3:

FYI, I've got a DFI nfII Ultra-AL here, which, tadaa, indeed got a NForce2 Ultra _400_ chipset (+nForce2 MCP (Master Control Program). So your assumption seems spot-on.

End Of Line.
Comment 7 Bjorn Helgaas 2016-02-08 20:50:47 UTC
Moving to drivers/pci category.
Comment 8 Bjorn Helgaas 2016-02-08 21:23:49 UTC
Hi, Andreas.  There isn't much information here for me to investigate.  I can't tell whether something is broken in v4.4, or whether something got fixed in v4.4 and you want to identify the fix and backport it to an older kernel.

1) Every regression is a bug.

2) A regression is a bug where something that used to work in an older kernel is broken in a newer kernel.

3) It is a bug if a system requires "pci=nocrs" to boot.  Our objective is to boot every system without any PCI kernel options.

I assume the problem is that v4.4 gives the following message and your pata_amd device doesn't work:

  pata_amd 0000:00:09.0: can't enable device: BAR 0 [io 0x01f0-0x01f7] not claimed

You marked this as a regression in v4.4, so I assume this used to work, but is broken in v4.4.  And apparently booting v4.4 with "pci=nocrs" is a workaround (this would still be a bug that we want to fix).

Please attach:

  - complete dmesg log of v4.4 with "pci=nocrs"
  - complete dmesg log of older, working kernel

Felix, I looked at your attachment.  Your logs do show a pata_amd device, but they don't show any problem with it, so I can't tell what problem you're seeing, and I can't tell whether it's the same as what Andreas is seeing.
Comment 9 Andreas E 2016-02-08 22:12:21 UTC
>
>I assume the problem is that v4.4 gives the following message and your
>pata_amd >device doesn't work:

>  pata_amd 0000:00:09.0: can't enable device: BAR 0 [io 0x01f0-0x01f7] not
>  claimed

__Nooooo__. It's exactly the other way round, Bjorn. :) 
This is for kernels LESS than 4.4.0.
The 4.4.0's do NOT show that and they work.
BUT as you all know, 4.4.0 may still be called "unstable" for now.

BTW it's always these confusing fields...
Well Bugzilla asked me for 'Kernel Version'. That reads, "tell me the kernel version you're __currently using__". That IS 4.4.0, and indeed...

> something got fixed in v4.4 and you want to identify the fix and backport it
> to an older kernel.

THAT would be the best idea. I'm a rather unselfish person. Whilst I might continue using 4.4.0 on my private box, I could think of less well-off-countries where so-called "old" boards are even _still_ used in servers these days.
And this might help 1000s, if they (for safety reasons) would like to stick to a kernel _branch_ but simply agree on upgrading to another _patch level_ that has the fix included.

So yes, I'd vote for a backport.


>2) A regression is a bug where something that used to work in an older kernel
>>is broken in a newer kernel.

Thanks for the extra lesson in bug'ology. :P 
So "regression" would have been valid if I had lied about my kernel version and put in 4.3.5, whilst in reality using 4.4.0.


>3) It is a bug if a system requires "pci=nocrs" to boot. 

Yes it is. This bug applies to kernels LESS than 4.4.0, mind you once more, e. g. 4.3.5.
Comment 10 Bjorn Helgaas 2016-02-08 22:40:19 UTC
This is a bug report.  We want to know the version of the kernel with the bug.  I changed this to v4.3 and marked it as "not a regression."

If you want to locate the fix so it can be backported, you could bisect between v4.3 and v4.4 and figure out where it started working.
Comment 11 Andreas E 2016-02-10 00:12:46 UTC
Created attachment 203251 [details]
tar of 3x dmesg : 4.2.0 pci=crs + 4.2.0 pci=nocrs + 4.4.0
Comment 12 Andreas E 2016-02-10 00:18:50 UTC
Alright, here we go!

FAIL : 4.2.0-16 + pci=crs (default)

SUCC : 4.2.0-16 + pci=nocrs (custom)

SUCC : 4.4.0-2 (default)

p.s. Sorry for the delay, I must have messed up something while re-installing the 4.2 kernel. Turned out that my system _absolutely_ needs linux-image-extra (i. e. the additional modules) and otherwise would simply bail out with "Can't find root device". THIS IS A LIE. It can find root device, but it needs a special driver to continue which is only in linux-image-extra. Wish there was a way that the kernel messages could finally distinguish between a root device simply __not found__ and modularized drivers that it may need at early startup. Especially newbies might have 50% more chance of figuring out why it can't boot.
Comment 13 Andreas E 2016-02-10 00:21:35 UTC
P.P.S. Why the hell is EISA support still compiled in there?!! This is yet 10 years older than my whole machine ("old" on its part)
Comment 14 Bjorn Helgaas 2016-03-09 22:03:59 UTC
Questions of "can't find root device" and EISA support are distractions.  If something needs to be done about them, they should be separate bug reports.

The PATA problem ("pata_amd 0000:00:09.0: can't enable device: BAR 0 [io 0x01f0-0x01f7] not claimed") is resolved in v4.4.  As far as I can tell, the only remaining question is whether we can identify a fix to be backported to older kernels, so that's what I'll try here.

In v4.2 we default to using _CRS:

  acpi PNP0A03:00: host bridge window expanded to [io  0x0000-0x0cfe]; [io  0x0000-0x0cfe window] ignored
  acpi PNP0A03:00: ignoring host bridge window [io  0x0000-0x0cfe] (conflicts with PCI conf1 [io  0x0cf8-0x0cff])
  pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]

In v4.2 with "pci=nocrs":

  acpi PNP0A03:00: host bridge window [io  0x0cf0-0x0cf3] (ignored)
  acpi PNP0A03:00: host bridge window [io  0x0000-0x0cfe window] (ignored)
  acpi PNP0A03:00: host bridge window [io  0x0d00-0xffff window] (ignored)
  pci_bus 0000:00: root bus resource [io  0x0000-0xffff]

In v4.4:

  acpi PNP0A03:00: host bridge window expanded to [io  0x0000-0x0cff]; [io  0x0000-0x0cfe window] ignored
  pci_bus 0000:00: root bus resource [io  0x0000-0x0cff]
  pci_bus 0000:00: root bus resource [io  0x0cf0-0x0cf3]
  pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]

I think the problem is that _CRS gives us these I/O windows:

  [io  0x0cf0-0x0cf3]
  [io  0x0000-0x0cfe window]
  [io  0x0d00-0xffff window]

and Linux isn't handling them correctly.  It looks like v4.2 expands [io  0x0cf0-0x0cf3] to [io  0x0000-0x0cfe], and then discards the whole thing because it conflicts with [io  0x0cf8-0x0cff].  That 0xcf8-0xcff region is hard-coded into x86 in pci_direct_probe().  v4.4 works because it expands that region a bit more, to [io  0x0000-0x0cff], which doesn't conflict because it completely contains [io  0x0cf8-0x0cff].

I don't know yet why v4.4 expands it differently.

Note You need to log in before you can comment on or make changes to this bug.