Bug 16228

Summary: BUG/boot failure on Dell Precision T3500 (pci/ahci_stop_engine)
Product: Drivers Reporter: Brian Bloniarz (brian.bloniarz)
Component: PCIAssignee: Bjorn Helgaas (bjorn.helgaas)
Status: CLOSED CODE_FIX    
Severity: normal CC: avi, bjorn.helgaas, chemobejk, dchris, florian, jpl, kernelbmw2017, linux-bugs, maciej.rutecki, Matt_Domsch, rezwanul_kabir, rjw, tanmoylaskar, yinghai
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.35-rc3 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 16055    
Attachments: Boot log 2.6.35-rc3 (failure leading to ahci_stop_engine BUG)
Boot log 2.6.34 (timeout waiting for root device)
Boot log 2.6.33.5 (successful)
Boot log 2.6.35-rc3 w/ pci=nocrs (successful)
allow BAR override
/proc/iomem as it appears on 2.6.33.5
Boot log 2.6.35-rc3 w/ pci=bar=0000:00:1f.2[24]=0xfa000000 (successful)
Boot log 2.6.35-rc3 w/ pci=bar=0000:00:1f.2[24]=0xff970800 (fail)
Bootlog 2.6.35-rc3 with Yinghai's reserve_region_with_split_check_child patch
Bootlog 2.6.35-rc3 with Yinghai's -v4 patch
dmidecode output from the machine
print PNP resources
dmesg, acpidump & DSDT after applying PNP patch
Everest report of Windows resources
allocate from top down within a region
allocate from top down within a region (v2)
dmesg log from 2.6.34 with patch V2 applied
dmesg log from 2.6.34 with patch V3 applied and pci=use_crs
alloc top-down (v3)
align to end of free space
use highest available gap first
use highest bridge windows first
handling of e820 reserved entries
dmesg log from 2.6.34 with patch V4 applied and pci=use_crs
/proc/iomem from 2.6.34 with patch V4 applied and pci=use_crs
avoid e820 regions

Description Brian Bloniarz 2010-06-16 17:57:11 UTC
Created attachment 26811 [details]
Boot log 2.6.35-rc3 (failure leading to ahci_stop_engine BUG)

2.6.35-rc3 does not boot on my Dell Precision T3500 due to
PCI issues, culminating in:

> BUG: unable to handle kernel paging request at ffffc90012621018

in ahci_stop_engine. Other things I've tried:

v2.6.35-rc3, pci=nocrs: boots successfully.

v2.6.35-rc1: has same BUG as v2.6.35-rc3

v2.6.34: fails to boot (timeout on the root device), but no BUG.

v2.6.33: boots successfully

The logs from these boots are attached.
Comment 1 Brian Bloniarz 2010-06-16 17:58:24 UTC
Created attachment 26812 [details]
Boot log 2.6.34 (timeout waiting for root device)
Comment 2 Brian Bloniarz 2010-06-16 17:59:07 UTC
Created attachment 26813 [details]
Boot log 2.6.33.5 (successful)
Comment 3 Brian Bloniarz 2010-06-16 17:59:40 UTC
Created attachment 26814 [details]
Boot log 2.6.35-rc3 w/ pci=nocrs (successful)
Comment 4 Bjorn Helgaas 2010-06-16 22:57:12 UTC
Created attachment 26819 [details]
allow BAR override

Your root device is connected via AHCI, so the 2.6.34 timeout looks
like it's caused by the same issue of the 00:1f.2 AHCI controller
not working.

We read IRQ info from PCI config space, so the fact that we did read
that tells us nothing about the BAR being moved (my mistake).

My only theory is that there may be two BIOS bugs:
  1) BIOS left 0000:00:1f.2 at [mem 0xff970000-0xff9707ff], which is
     outside all the windows on the PNP0A03:00 host bridge, and
  2) The reported host bridge window [mem 0xbff00000-0xdfffffff] is actually
     wrong, and putting the device at [mem 0xbff00000-0xbff007ff] doesn't
     work.

Linux looks for unused areas starting at low addresses, and Windows
starts from the top and goes down.  If the beginning of that host
bridge window is incorrect, Linux would be more likely to trip over it
than Windows.

Maybe you could use this patch and experiment with placing the device
at different places.  I think the easiest thing is to boot 2.6.33.5 first
and collect the /proc/iomem output (please also attach that here).  That
will help find unused areas.

After you find a possible spot (I can help, given the /proc/iomem output),
you can try things like "pci=bar=0000:00:1f.2[24]=0xdfff0000" to reassign
that BAR 5.
Comment 5 Brian Bloniarz 2010-06-17 15:19:34 UTC
Created attachment 26825 [details]
/proc/iomem as it appears on 2.6.33.5
Comment 6 Bjorn Helgaas 2010-06-17 15:59:20 UTC
Thanks, Brian.  Would you mind making your attachments "text/plain",
so firefox can easily open them?

Here are some things to try.  These are working down from high addresses,
so I think they're more like what Windows would do.

  pci=bar=0000:00:1f.2[24]=0xff980800
  pci=bar=0000:00:1f.2[24]=0xff970800
  pci=bar=0000:00:1f.2[24]=0xfed9f800
  pci=bar=0000:00:1f.2[24]=0xfbfff800
Comment 7 Brian Bloniarz 2010-06-17 16:03:01 UTC
Created attachment 26826 [details]
Boot log 2.6.35-rc3 w/ pci=bar=0000:00:1f.2[24]=0xfa000000 (successful)

Could you give me some ideas for addresses to try? I don't know much about PCI.

My first shot in the dark was "pci=bar=0000:00:1f.2[24]=0xfa000000", and that boots successfully. I'm attaching the bootlog from that, if that helps.
Comment 8 Bjorn Helgaas 2010-06-17 16:37:42 UTC
That's perfect, thanks!

The E820 memory map from BIOS reports a range that overlaps the first
megabyte (0xbff00000-0xbfffffff) of that host bridge window:

  BIOS-e820: 0000000000100000 - 00000000bfdf9c00 (usable)
  BIOS-e820: 00000000bfdf9c00 - 00000000bfe4bc00 (ACPI NVS)
  BIOS-e820: 00000000bfe4bc00 - 00000000bfe4dc00 (ACPI data)
  BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved)
  pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xdfffffff]

The 0x100000-0xbfdf9c00 range is system RAM, and I would expect that
the ACPI NVS and data area is also RAM, and the last reserved piece is
probably RAM, too, because RAM tends to end at nicely aligned boundaries.

Table 14-1 in the ACPI 4.0 spec says "reserved" ranges from the E820
table are "unsuitable for a standard device to use as a device memory
space."

I think maybe Linux should regard those reserved ranges as unavailable
for allocation to PCI devices, even if they happen to be included in a
host bridge window.  What do you think, Yinghai?
Comment 9 Brian Bloniarz 2010-06-17 16:46:19 UTC
Awesome, sounds like you have a theory. I also went through the 4 locations you suggested, I have boot logs for those.

pci=bar=0000:00:1f.2[24]=0xff980800: succeeds

pci=bar=0000:00:1f.2[24]=0xff970800: fails
It gives identical BUG behavior. However, the log also contains:
[    3.283225] pci 0000:00:1f.2: no compatible bridge window for [mem 0xff970800-0xff970fff]
So is there something funny about that address?

pci=bar=0000:00:1f.2[24]=0xfed9f800: succeeds

pci=bar=0000:00:1f.2[24]=0xfbfff800: succeeds.
Comment 10 Brian Bloniarz 2010-06-17 16:48:29 UTC
Created attachment 26827 [details]
Boot log 2.6.35-rc3 w/ pci=bar=0000:00:1f.2[24]=0xff970800 (fail)
Comment 11 Bjorn Helgaas 2010-06-17 16:52:11 UTC
Oops, the 0xff970800 failure is my fault, I should have known that
would fail, because it's not inside the [mem 0xff97c000-0xff97ffff]
window.  I was looking at the /proc/iomem from 2.6.33, and it had the
device at 0xff970000, so I figure we could put it right after that.
The only reason the device was there is that 2.6.33 just ignores the
host bridge windows.

Anyway, I think we just need to figure out the best way to ignore the
0xbff00000-0xbfffffff range.  Hopefully we can find a generic way that
will fix similar issues on other boxes, too.
Comment 12 Yinghai Lu 2010-06-17 18:43:15 UTC
On 06/17/2010 09:37 AM, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=16228
> 
> 
> Bjorn Helgaas <bjorn.helgaas@hp.com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |yinghai@kernel.org
> 
> 
> 
> 
> --- Comment #8 from Bjorn Helgaas <bjorn.helgaas@hp.com>  2010-06-17 16:37:42
> ---
> That's perfect, thanks!
> 
> The E820 memory map from BIOS reports a range that overlaps the first
> megabyte (0xbff00000-0xbfffffff) of that host bridge window:
> 
>   BIOS-e820: 0000000000100000 - 00000000bfdf9c00 (usable)
>   BIOS-e820: 00000000bfdf9c00 - 00000000bfe4bc00 (ACPI NVS)
>   BIOS-e820: 00000000bfe4bc00 - 00000000bfe4dc00 (ACPI data)
>   BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved)
>   pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xdfffffff]
> 
> The 0x100000-0xbfdf9c00 range is system RAM, and I would expect that
> the ACPI NVS and data area is also RAM, and the last reserved piece is
> probably RAM, too, because RAM tends to end at nicely aligned boundaries.
> 
> Table 14-1 in the ACPI 4.0 spec says "reserved" ranges from the E820
> table are "unsuitable for a standard device to use as a device memory
> space."
> 
> I think maybe Linux should regard those reserved ranges as unavailable
> for allocation to PCI devices, even if they happen to be included in a
> host bridge window.  What do you think, Yinghai?

for above 1M area, 
current kernel will honor setting from HW register and later will use insert_resource_expand_to_fit()
with e820 reserved entries.
so in this case will have one big 0xbfe4dc00 - 0xe000000 reserved entry in /proc/iomem as parent of
0xbff00000 - 0xe0000000

Sane BIOS should not put allocated range to PCI bridge/device into e820 table as reserved entries.

But there IS some BIOS doing that.
So Linus decided to trust setting from PCI bar at first.

later for pci_assign_unsigned, We should avoid those range.

one solution is expanding my one of old version for below 1M handling to:
use reserve_region_with_split instead.
will get /proc/iomem
bfe4dc00 - bfefffff reserved
bff00000 - dfffffff PCI BUS #00
  bff00000 - bfffffff reserved
  ...

so will use bff0000 - bfffffff reserved as holder to prevent those range to be allocated to unassigned devices.
reserve_region_with_split need to need be updated a little bit to make sure it will not put holder on range with children already.

something like this

Subject: [PATCH] x86, resource: Add reserve_region_with_split_check_child()

It will cover the whole region to BUSY, except that some regions that have
children under them.

those children normally is PCI bar but it is falling into E820_RESERVED.
We can not put BUSY on them, otherwise driver can not use pci_request_region()
later

/proc/iomem will have
00010000-00094fff : System RAM
00095000-0009ffff : reserved
000a0000-000bffff : PCI Bus 0000:00
  000a0000-000bffff : reserved
000c0000-000cffff : reserved
000d0000-000dffff : PCI Bus 0000:00
  000d0000-000dffff : reserved
000e0000-000fffff : reserved

-v2: Add function pointer to put string comparing with caller
-v3: expand to use it above 1M resources.

Tested-by: Guenter Roeck <guenter.roeck@ericsson.com>
Tested-by: Andy Isaacson <adi@hexapodia.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 arch/x86/kernel/e820.c |   15 ++++++++++++---
 include/linux/ioport.h |    3 +++
 kernel/resource.c      |   29 ++++++++++++++++++++++++-----
 3 files changed, 39 insertions(+), 8 deletions(-)

Index: linux-2.6/arch/x86/kernel/e820.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/e820.c
+++ linux-2.6/arch/x86/kernel/e820.c
@@ -1094,7 +1094,7 @@ void __init e820_reserve_resources(void)
 		 * pci device BAR resource and insert them later in
 		 * pcibios_resource_survey()
 		 */
-		if (e820.map[i].type != E820_RESERVED || res->start < (1ULL<<20)) {
+		if (e820.map[i].type != E820_RESERVED) {
 			res->flags |= IORESOURCE_BUSY;
 			insert_resource(&iomem_resource, res);
 		}
@@ -1128,6 +1128,14 @@ static unsigned long ram_alignment(resou
 
 #define MAX_RESOURCE_SIZE ((resource_size_t)-1)
 
+static int __init check_func(struct resource *cf)
+{
+	if (strstr(cf->name, "PCI Bus"))
+		return 1;
+
+        return 0;
+}
+
 void __init e820_reserve_resources_late(void)
 {
 	int i;
@@ -1135,8 +1143,9 @@ void __init e820_reserve_resources_late(
 
 	res = e820_res;
 	for (i = 0; i < e820.nr_map; i++) {
-		if (!res->parent && res->end)
-			insert_resource_expand_to_fit(&iomem_resource, res);
+		if (!res->parent && res->end) {
+			reserve_region_with_split_check_child(&iomem_resource, res->start, res->end, res->name, check_func);
+		}
 		res++;
 	}
 
Index: linux-2.6/include/linux/ioport.h
===================================================================
--- linux-2.6.orig/include/linux/ioport.h
+++ linux-2.6/include/linux/ioport.h
@@ -120,6 +120,9 @@ void release_child_resources(struct reso
 extern void reserve_region_with_split(struct resource *root,
 			     resource_size_t start, resource_size_t end,
 			     const char *name);
+void reserve_region_with_split_check_child(struct resource *root,
+			     resource_size_t start, resource_size_t end,
+			     const char *name, int (*check_func)(struct resource *cf));
 extern struct resource *insert_resource_conflict(struct resource *parent, struct resource *new);
 extern int insert_resource(struct resource *parent, struct resource *new);
 extern void insert_resource_expand_to_fit(struct resource *root, struct resource *new);
Index: linux-2.6/kernel/resource.c
===================================================================
--- linux-2.6.orig/kernel/resource.c
+++ linux-2.6/kernel/resource.c
@@ -607,9 +607,14 @@ int adjust_resource(struct resource *res
 	return result;
 }
 
+static int __init check_func_nop(struct resource *cf)
+{
+	return 1;
+}
+
 static void __init __reserve_region_with_split(struct resource *root,
 		resource_size_t start, resource_size_t end,
-		const char *name)
+		const char *name, bool check_child, int (*check_func)(struct resource *cf))
 {
 	struct resource *parent = root;
 	struct resource *conflict;
@@ -631,13 +636,18 @@ static void __init __reserve_region_with
 	kfree(res);
 
 	/* conflict covered whole area */
-	if (conflict->start <= start && conflict->end >= end)
+	if (conflict->start <= start && conflict->end >= end) {
+		if (check_child && !conflict->child && check_func(conflict))
+			__reserve_region_with_split(conflict, start, end, name, false, check_func_nop);
 		return;
+	}
 
 	if (conflict->start > start)
-		__reserve_region_with_split(root, start, conflict->start-1, name);
+		__reserve_region_with_split(root, start, conflict->start-1, name, check_child, check_func);
 	if (conflict->end < end)
-		__reserve_region_with_split(root, conflict->end+1, end, name);
+		__reserve_region_with_split(root, conflict->end+1, end, name, check_child, check_func);
+	if (check_child && !conflict->child && check_func(conflict))
+		__reserve_region_with_split(conflict, conflict->start, conflict->end, name, false, check_func_nop);
 }
 
 void __init reserve_region_with_split(struct resource *root,
@@ -645,7 +655,16 @@ void __init reserve_region_with_split(st
 		const char *name)
 {
 	write_lock(&resource_lock);
-	__reserve_region_with_split(root, start, end, name);
+	__reserve_region_with_split(root, start, end, name, false, check_func_nop);
+	write_unlock(&resource_lock);
+}
+
+void __init reserve_region_with_split_check_child(struct resource *root,
+		resource_size_t start, resource_size_t end,
+		const char *name, int (*check_func)(struct resource *cf))
+{
+	write_lock(&resource_lock);
+	__reserve_region_with_split(root, start, end, name, true, check_func);
 	write_unlock(&resource_lock);
 }
Comment 13 Brian Bloniarz 2010-06-17 22:26:27 UTC
Created attachment 26832 [details]
Bootlog 2.6.35-rc3 with Yinghai's reserve_region_with_split_check_child patch

I tried booting with this reserve_region_with_split_check_child() patch, my machine still doesn't boot. The bootlog & /proc/iomem is attached if that helps at all. It seems like it tries to use the same address as the unpatched 2.6.35-rc3?
Comment 14 Yinghai Lu 2010-06-17 23:37:52 UTC
On 06/17/2010 03:26 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=16228
> 
> 
> 
> 
> 
> --- Comment #13 from Brian Bloniarz <phunge0@hotmail.com>  2010-06-17
> 22:26:27 ---
> Created an attachment (id=26832)
>  --> (https://bugzilla.kernel.org/attachment.cgi?id=26832)
> Bootlog 2.6.35-rc3 with Yinghai's reserve_region_with_split_check_child patch
> 
> I tried booting with this reserve_region_with_split_check_child() patch, my
> machine still doesn't boot. The bootlog & /proc/iomem is attached if that
> helps
> at all. It seems like it tries to use the same address as the unpatched
> 2.6.35-rc3?
> 

you got
bfe4dc00-bfefffff : reserved
bff00000-dfffffff : PCI Bus 0000:00
  bff00000-bff007ff : 0000:00:1f.2
    bff00000-bff007ff : ahci
  c0000000-cfffffff : PCI Bus 0000:02


should have
bfe4dc00-bfefffff : reserved
bff00000-dfffffff : PCI Bus 0000:00
  bff00000-bfffffff : reserved
  c0000000-cfffffff : PCI Bus 0000:02

please check this version

Subject: [PATCH] x86, resource: Add reserve_region_with_split_check_child()

It will cover the whole region to BUSY, except that some regions that have
children under them.

those children normally is PCI bar but it is falling into E820_RESERVED.
We can not put BUSY on them, otherwise driver can not use pci_request_region()
later

/proc/iomem will have
00010000-00094fff : System RAM
00095000-0009ffff : reserved
000a0000-000bffff : PCI Bus 0000:00
  000a0000-000bffff : reserved
000c0000-000cffff : reserved
000d0000-000dffff : PCI Bus 0000:00
  000d0000-000dffff : reserved
000e0000-000fffff : reserved

-v2: Add function pointer to put string comparing with caller
-v3: expand to use it above 1M resources.
-v4: handle partial overlap case

Tested-by: Guenter Roeck <guenter.roeck@ericsson.com>
Tested-by: Andy Isaacson <adi@hexapodia.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 arch/x86/kernel/e820.c |   15 ++++++++++++---
 include/linux/ioport.h |    3 +++
 kernel/resource.c      |   35 ++++++++++++++++++++++++++++++-----
 3 files changed, 45 insertions(+), 8 deletions(-)

Index: linux-2.6/arch/x86/kernel/e820.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/e820.c
+++ linux-2.6/arch/x86/kernel/e820.c
@@ -1094,7 +1094,7 @@ void __init e820_reserve_resources(void)
 		 * pci device BAR resource and insert them later in
 		 * pcibios_resource_survey()
 		 */
-		if (e820.map[i].type != E820_RESERVED || res->start < (1ULL<<20)) {
+		if (e820.map[i].type != E820_RESERVED) {
 			res->flags |= IORESOURCE_BUSY;
 			insert_resource(&iomem_resource, res);
 		}
@@ -1128,6 +1128,14 @@ static unsigned long ram_alignment(resou
 
 #define MAX_RESOURCE_SIZE ((resource_size_t)-1)
 
+static int __init check_func(struct resource *cf)
+{
+	if (strstr(cf->name, "PCI Bus"))
+		return 1;
+
+        return 0;
+}
+
 void __init e820_reserve_resources_late(void)
 {
 	int i;
@@ -1135,8 +1143,9 @@ void __init e820_reserve_resources_late(
 
 	res = e820_res;
 	for (i = 0; i < e820.nr_map; i++) {
-		if (!res->parent && res->end)
-			insert_resource_expand_to_fit(&iomem_resource, res);
+		if (!res->parent && res->end) {
+			reserve_region_with_split_check_child(&iomem_resource, res->start, res->end, res->name, check_func);
+		}
 		res++;
 	}
 
Index: linux-2.6/include/linux/ioport.h
===================================================================
--- linux-2.6.orig/include/linux/ioport.h
+++ linux-2.6/include/linux/ioport.h
@@ -120,6 +120,9 @@ void release_child_resources(struct reso
 extern void reserve_region_with_split(struct resource *root,
 			     resource_size_t start, resource_size_t end,
 			     const char *name);
+void reserve_region_with_split_check_child(struct resource *root,
+			     resource_size_t start, resource_size_t end,
+			     const char *name, int (*check_func)(struct resource *cf));
 extern struct resource *insert_resource_conflict(struct resource *parent, struct resource *new);
 extern int insert_resource(struct resource *parent, struct resource *new);
 extern void insert_resource_expand_to_fit(struct resource *root, struct resource *new);
Index: linux-2.6/kernel/resource.c
===================================================================
--- linux-2.6.orig/kernel/resource.c
+++ linux-2.6/kernel/resource.c
@@ -607,9 +607,14 @@ int adjust_resource(struct resource *res
 	return result;
 }
 
+static int __init check_func_nop(struct resource *cf)
+{
+	return 1;
+}
+
 static void __init __reserve_region_with_split(struct resource *root,
 		resource_size_t start, resource_size_t end,
-		const char *name)
+		const char *name, bool check_child, int (*check_func)(struct resource *cf))
 {
 	struct resource *parent = root;
 	struct resource *conflict;
@@ -631,13 +636,24 @@ static void __init __reserve_region_with
 	kfree(res);
 
 	/* conflict covered whole area */
-	if (conflict->start <= start && conflict->end >= end)
+	if (conflict->start <= start && conflict->end >= end) {
+		if (check_child && !conflict->child && check_func(conflict))
+			__reserve_region_with_split(conflict, start, end, name, false, check_func_nop);
 		return;
+	}
 
 	if (conflict->start > start)
-		__reserve_region_with_split(root, start, conflict->start-1, name);
+		__reserve_region_with_split(root, start, conflict->start-1, name, check_child, check_func);
 	if (conflict->end < end)
-		__reserve_region_with_split(root, conflict->end+1, end, name);
+		__reserve_region_with_split(root, conflict->end+1, end, name, check_child, check_func);
+
+	if (check_child && check_func(conflict)) {
+		resource_size_t common_start, common_end;
+
+		common_start = max(start, conflict->start);
+		common_end = min(end, conflict->end);
+		__reserve_region_with_split(conflict, common_start, common_end, name, check_child, check_func);
+	}
 }
 
 void __init reserve_region_with_split(struct resource *root,
@@ -645,7 +661,16 @@ void __init reserve_region_with_split(st
 		const char *name)
 {
 	write_lock(&resource_lock);
-	__reserve_region_with_split(root, start, end, name);
+	__reserve_region_with_split(root, start, end, name, false, check_func_nop);
+	write_unlock(&resource_lock);
+}
+
+void __init reserve_region_with_split_check_child(struct resource *root,
+		resource_size_t start, resource_size_t end,
+		const char *name, int (*check_func)(struct resource *cf))
+{
+	write_lock(&resource_lock);
+	__reserve_region_with_split(root, start, end, name, true, check_func);
 	write_unlock(&resource_lock);
 }
Comment 15 Brian Bloniarz 2010-06-18 14:06:03 UTC
Created attachment 26844 [details]
Bootlog 2.6.35-rc3 with Yinghai's -v4 patch

This -v4 patch did the trick for me! The bootlog & /proc/iomem is attached. /proc/iomem shows up as you said it should.
Comment 16 Jesse Barnes 2010-06-18 17:38:21 UTC
On Thu, 17 Jun 2010 16:36:39 -0700
Yinghai Lu <yinghai@kernel.org> wrote:

> On 06/17/2010 03:26 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=16228
> > 
> > 
> > 
> > 
> > 
> > --- Comment #13 from Brian Bloniarz <phunge0@hotmail.com>  2010-06-17
> 22:26:27 ---
> > Created an attachment (id=26832)
> >  --> (https://bugzilla.kernel.org/attachment.cgi?id=26832)
> > Bootlog 2.6.35-rc3 with Yinghai's reserve_region_with_split_check_child
> patch
> > 
> > I tried booting with this reserve_region_with_split_check_child() patch, my
> > machine still doesn't boot. The bootlog & /proc/iomem is attached if that
> helps
> > at all. It seems like it tries to use the same address as the unpatched
> > 2.6.35-rc3?
> > 
> 
> you got
> bfe4dc00-bfefffff : reserved
> bff00000-dfffffff : PCI Bus 0000:00
>   bff00000-bff007ff : 0000:00:1f.2
>     bff00000-bff007ff : ahci
>   c0000000-cfffffff : PCI Bus 0000:02
> 
> 
> should have
> bfe4dc00-bfefffff : reserved
> bff00000-dfffffff : PCI Bus 0000:00
>   bff00000-bfffffff : reserved
>   c0000000-cfffffff : PCI Bus 0000:02
> 
> please check this version
> 
> Subject: [PATCH] x86, resource: Add reserve_region_with_split_check_child()
> 
> It will cover the whole region to BUSY, except that some regions that have
> children under them.
> 
> those children normally is PCI bar but it is falling into E820_RESERVED.
> We can not put BUSY on them, otherwise driver can not use
> pci_request_region()
> later
> 
> /proc/iomem will have
> 00010000-00094fff : System RAM
> 00095000-0009ffff : reserved
> 000a0000-000bffff : PCI Bus 0000:00
>   000a0000-000bffff : reserved
> 000c0000-000cffff : reserved
> 000d0000-000dffff : PCI Bus 0000:00
>   000d0000-000dffff : reserved
> 000e0000-000fffff : reserved
> 
> -v2: Add function pointer to put string comparing with caller
> -v3: expand to use it above 1M resources.
> -v4: handle partial overlap case
> 
> Tested-by: Guenter Roeck <guenter.roeck@ericsson.com>
> Tested-by: Andy Isaacson <adi@hexapodia.org>
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> 
> ---
>  arch/x86/kernel/e820.c |   15 ++++++++++++---
>  include/linux/ioport.h |    3 +++
>  kernel/resource.c      |   35 ++++++++++++++++++++++++++++++-----
>  3 files changed, 45 insertions(+), 8 deletions(-)
> 
> Index: linux-2.6/arch/x86/kernel/e820.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/e820.c
> +++ linux-2.6/arch/x86/kernel/e820.c
> @@ -1094,7 +1094,7 @@ void __init e820_reserve_resources(void)
>                * pci device BAR resource and insert them later in
>                * pcibios_resource_survey()
>                */
> -             if (e820.map[i].type != E820_RESERVED || res->start <
> (1ULL<<20)) {
> +             if (e820.map[i].type != E820_RESERVED) {
>                       res->flags |= IORESOURCE_BUSY;
>                       insert_resource(&iomem_resource, res);

I thought this hunk was already upstream?  If not, it probably should
be, didn't we decide to mark the lowest 1M region as busy to avoid
trouble?

The rest has been commented on iirc, and looks pretty ugly.  checking
for the string "PCI Bus" seems like a band-aid over the real problem at
best...
Comment 17 Matt Domsch 2010-06-18 19:38:36 UTC
Rez, can you assist?
Comment 18 Yinghai Lu 2010-06-18 21:27:54 UTC
On 06/18/2010 10:36 AM, Jesse Barnes wrote:
> On Thu, 17 Jun 2010 16:36:39 -0700
> Yinghai Lu <yinghai@kernel.org> wrote:
> 
>> On 06/17/2010 03:26 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
>>> https://bugzilla.kernel.org/show_bug.cgi?id=16228
>>>
>>>
>>>
>>>
>>>
>>> --- Comment #13 from Brian Bloniarz <phunge0@hotmail.com>  2010-06-17
>>> 22:26:27 ---
>>> Created an attachment (id=26832)
>>>  --> (https://bugzilla.kernel.org/attachment.cgi?id=26832)
>>> Bootlog 2.6.35-rc3 with Yinghai's reserve_region_with_split_check_child
>>> patch
>>>
>>> I tried booting with this reserve_region_with_split_check_child() patch, my
>>> machine still doesn't boot. The bootlog & /proc/iomem is attached if that
>>> helps
>>> at all. It seems like it tries to use the same address as the unpatched
>>> 2.6.35-rc3?
>>>
>>
>> you got
>> bfe4dc00-bfefffff : reserved
>> bff00000-dfffffff : PCI Bus 0000:00
>>   bff00000-bff007ff : 0000:00:1f.2
>>     bff00000-bff007ff : ahci
>>   c0000000-cfffffff : PCI Bus 0000:02
>>
>>
>> should have
>> bfe4dc00-bfefffff : reserved
>> bff00000-dfffffff : PCI Bus 0000:00
>>   bff00000-bfffffff : reserved
>>   c0000000-cfffffff : PCI Bus 0000:02
>>
>> please check this version
>>
>> Subject: [PATCH] x86, resource: Add reserve_region_with_split_check_child()
>>
>> It will cover the whole region to BUSY, except that some regions that have
>> children under them.
>>
>> those children normally is PCI bar but it is falling into E820_RESERVED.
>> We can not put BUSY on them, otherwise driver can not use
>> pci_request_region()
>> later
>>
>> /proc/iomem will have
>> 00010000-00094fff : System RAM
>> 00095000-0009ffff : reserved
>> 000a0000-000bffff : PCI Bus 0000:00
>>   000a0000-000bffff : reserved
>> 000c0000-000cffff : reserved
>> 000d0000-000dffff : PCI Bus 0000:00
>>   000d0000-000dffff : reserved
>> 000e0000-000fffff : reserved
>>
>> -v2: Add function pointer to put string comparing with caller
>> -v3: expand to use it above 1M resources.
>> -v4: handle partial overlap case
>>
>> Tested-by: Guenter Roeck <guenter.roeck@ericsson.com>
>> Tested-by: Andy Isaacson <adi@hexapodia.org>
>> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
>>
>> ---
>>  arch/x86/kernel/e820.c |   15 ++++++++++++---
>>  include/linux/ioport.h |    3 +++
>>  kernel/resource.c      |   35 ++++++++++++++++++++++++++++++-----
>>  3 files changed, 45 insertions(+), 8 deletions(-)
>>
>> Index: linux-2.6/arch/x86/kernel/e820.c
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/kernel/e820.c
>> +++ linux-2.6/arch/x86/kernel/e820.c
>> @@ -1094,7 +1094,7 @@ void __init e820_reserve_resources(void)
>>               * pci device BAR resource and insert them later in
>>               * pcibios_resource_survey()
>>               */
>> -            if (e820.map[i].type != E820_RESERVED || res->start <
>> (1ULL<<20)) {
>> +            if (e820.map[i].type != E820_RESERVED) {
>>                      res->flags |= IORESOURCE_BUSY;
>>                      insert_resource(&iomem_resource, res);
> 
> I thought this hunk was already upstream?  If not, it probably should
> be, didn't we decide to mark the lowest 1M region as busy to avoid
> trouble?

yes, that one only handle below 1M problem.

and this one handle above 1M problem and below 1M problem.

> 
> The rest has been commented on iirc, and looks pretty ugly.  checking
> for the string "PCI Bus" seems like a band-aid over the real problem at
> best...

that seems to be the only way that tell that resource is belong to bridge.

Thanks

Yinghai
Comment 19 Bjorn Helgaas 2010-06-21 17:54:27 UTC
> > ... checking
> > for the string "PCI Bus" seems like a band-aid over the real problem at
> > best...
> 
> that seems to be the only way that tell that resource is belong to bridge.

Whether it's the only way or not, it's still unacceptably ugly, IMHO.
And the whole reserve_region_with_split_check_child() thing is just
as bad.  There's no way we can maintain that.

The resource management framework is based on hierarchical, nested
resources.  That makes good sense in most cases, where you have
devices that respond at certain address ranges, and any upstream
bridges have to have windows that completely enclose the device
ranges.

But I don't think it makes quite as much sense for things like the
E820 map.  In the E820 map, there's no concept of "one range == one
device."  There's not even a concept of devices.  So I'm not sure
we can enforce the idea that a device region must have a certain
relationship with an E820 region.  As far as I can tell, there are
no requirements about how the BIOS lays out the E820 map.  For example,
it may use two E820 entries to describe a single contiguous region.

The assumption that an E820 entry must have a hierarchical relationship
with device ranges has led to things like insert_resource_expand_to_fit()
and the split between e820_reserve_resources() and e820_reserve_resources_late().
Both of those feel like ad hoc workarounds for system configurations
that are probably completely acceptable.

I don't know whether we can fix resources to handle E820 reservations
naturally, or whether we need some kind of separate map (I hope not,
because that sounds like a tangle in itself).  I'd like to see what
Windows does with this case ... I vaguely remember that it might keep
overlapping ranges that Linux would complain about.
Comment 20 Brian Bloniarz 2010-07-06 19:11:49 UTC
Hi all, thanks for your help so far. Is there any update on this bug? I will only have access to the machine that exhibits the problem for another 2 weeks if there are patches to be tested.
Comment 21 Rafael J. Wysocki 2010-07-08 23:10:52 UTC
Handled-By : Bjorn Helgaas <bjorn.helgaas@hp.com>
Comment 22 Bjorn Helgaas 2010-07-16 20:35:44 UTC
Sorry for the delay in looking at this.  I know it's a high-priority regression, and I certainly haven't forgotten about it.  I won't have a proposal before you lose access to the machine, but I hope the Dell folks on the CC: list will be able to reproduce the problem and test fixes. It's possible that a BIOS change would cover up the Linux problem, so would you mind collecting the BIOS version before you lose your machine, Brian?  Attaching the output of "dmidecode" should be enough and doesn't require a reboot.
Comment 23 Brian Bloniarz 2010-07-21 20:08:17 UTC
Created attachment 27191 [details]
dmidecode output from the machine
Comment 24 Stefan Becker 2010-08-03 15:42:47 UTC
Your dmidecode shows that your are using BIOS A01. I see this problem even with the latest Dell T3500 BIOS A07. So there is (as of today) no BIOS fix.

FYI: Fedora Bug Report <https://bugzilla.redhat.com/show_bug.cgi?id=620313>. Fedora is currently working on moving from 2.6.33 to 2.6.34.x for Fedora 13.
Comment 25 Bjorn Helgaas 2010-08-05 18:46:38 UTC
If anybody has Windows on a box that shows this failure, could you
please attach an Everest report of the hardware (free trial version
at http://www.lavalys.com/)?

Both Linux and Windows try to move PCI devices into host bridge windows.
Linux allocates window space from the bottom up, so it does this:

  pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xdfffffff]
  pci 0000:00:1f.2: no compatible bridge window for [mem 0xff970000-0xff9707ff]
  pci 0000:00:1f.2: BAR 5: assigned [mem 0xbff00000-0xbff007ff]

That doesn't work, I think because 0xbff00000 is actually RAM.

Windows allocates window space from the top down, so it won't use the
0xbff00000 area unless the window is almost completely consumed by
other devices.  I expected that even in that case, Windows would avoid
the 0xbfe4dc00-0xc0000000 area because it's marked "reserved" in the
E820 map.  However, I've been experimenting with Windows 2008 R2 on
qemu, and it does *not* seem to avoid E820 "reserved" areas when
moving PCI devices.

If Windows doesn't avoid E820 reserved areas, there must be another
reason, such as a PNP device, that would keep it from putting a PCI
device at 0xbff00000.  Otherwise, if we put a couple 256M graphics
cards in the box to consume the host bridge window, Windows would
have the same problem we're seeing with Linux.
Comment 26 Bjorn Helgaas 2010-08-10 18:05:41 UTC
Created attachment 27398 [details]
print PNP resources

Since nobody seems to have Windows on one of these boxes, could someone
please add this patch and post the resulting dmesg log?  It's fine to
use "pci=nocrs" so the machine boots.  And please collect the log with
the "dmesg" command or use the "ignore_loglevel" option so the PCI resource
information, e.g., "pci 0000:00:1f.2: reg 24: [mem 0xff970000-0xff9707ff]",
is included.

Also, please collect the DSDT so I can make sure we account for all the
resources used by ACPI devices.  Instructions are here:
  http://kernel.org/pub/linux/kernel/people/helgaas/debug
Comment 27 Stefan Becker 2010-08-11 07:49:51 UTC
Created attachment 27403 [details]
dmesg, acpidump & DSDT after applying PNP patch

Archive with the 3 requested files.

NOTE: the acpixtract from pmtools-20071106 behaves differently from the instructions you provided on the web page. I had to use

     acpixtract -sDSDT acpidump.asc

and the resulting file was named DSDT.dat.

I can't provide any debug info from Windows, because none of my machines has ever been installed with Windows :-)
Comment 28 Bjorn Helgaas 2010-08-11 23:04:16 UTC
Created attachment 27414 [details]
Everest report of Windows resources

Everest report from a T3500 with A07 BIOS.

I don't see any devices in the BFF00000-BFFFFFFF range (except the
PNP0A03 host bridge window itself), so I don't think anything prevents
Windows from placing a PCI device there.  If that happened, I think
that device probably would not work.

But it would be hard to force that to happen, because Windows allocates
space from the top down, and there's a lot of available space above
BFF00000 that would be used first.  Here's my accounting of used and
available space under Windows:

    BFF00000-F7FFFFFF PNP0A03:00 (host bridge to bus 00)
        BFF00000-BFFFFFFF available (0x100000 = 1M)
        C0000000-DFFFFFFF 00:03.0 Root Port 3 (bridge to bus 02, 512M window)
            C0000000-DFFFFFFF 02:00.0 NVidia FX 580 (512M)
        E0000000-F3CFFFFF available (0x13cfffff = 316M)
        F3D00000-F3DFFFFF 00:1c.5 Root Port 6 (bridge to bus 05, 1M window)
            F3D00000-F3DDFFFF available (0xe0000 = 896K on bus 05)
            F3DE0000-F3DEFFFF 05:00.0 Broadcom 57xx NIC (64K)
            F3DF0000-F3DFFFFF 05:00.0 Broadcom 57xx NIC (64K)
        F3E00000-F3EFFFFF 00:01.0 Root Port 1 (bridge to bus 01, 1M window))
            F3E00000-F3EFFFFF available (1M on bus 01)
        F3F00000-F3FFFFFF 00:07.0 Root Port 7 (bridge to bus 03, 1M window))
            F3F00000-F3FFFFFF available (1M on bus 03)
        F4000000-F7EFFFFF 00:03.0 Root Port 3 (bridge to bus 02, 63M window))
            F4000000-F5FFFFFF 02:00.0 NVidia FX 580 (32M)
            F6000000-F6FFFFFF 02:00.0 NVidia FX 580 (16M)
            F7000000-F7EFFFFF available (0xf00000 = 15M on bus 02)
        F7F00000-F7FF9FFF available (0xfa000 = 1000K on bus 00)
        F7FFA000-F7FFA3FF 00:1a.0 USB Enhanced Host Controller (0x400 = 1K)
        F7FFA400-F7FFAFFF available (0xc00 = 3072 bytes on bus 00)
        F7FFB000-F7FFB0FF 0:1f.3 SMBus Controller (0x100 = 256 bytes)
        F7FFB100-F7FFBFFF available (0xf00 = 3840 bytes on bus 00)
        F7FFC000-F7FFFFFF 00:1b.0 Audio Controller (0x4000 = 16K)

    FED20000-FED9FFFF PNP0A03:00 (host bridge window to bus 00, 512K)
        FED20000-FED9FFFF available (0x80000 = 512K)

    FF97C000-FF97FFFF PNP0A03:00 (host bridge window to bus 00, 16K)
        FF97C000-FF97FFFF available (0x4000 = 16K)

    FF980000-FF980FFF PNP0A03:00 (host bridge window to bus 00, 4K)
        FF980000-FF9803FF 00:1d.7 USB Enhanced Host Controller (0x400 = 1K)
        FF980400-FF9807FF available (0x400 == 1K)
        FF980800-FF980FFF 00:1f.2 SATA RAID Controller (0x800 = 2K)
Comment 29 Stefan Becker 2010-09-02 07:15:03 UTC
Any progress on this matter? Do you have a patch that I could test with 2.6.34 or 2.6.35?

FYI: Fedora has pushed a 2.6.34 package to F13 stable updates, i.e. normal users will now run into this problem.
Comment 30 Bjorn Helgaas 2010-09-03 23:17:33 UTC
Created attachment 28952 [details]
allocate from top down within a region

Thanks for the Fedora heads up.  I already got a new report, and I'm
afraid we'll get a lot more.

I think the fix for this is to do what Windows does, namely:

  1) Allocate new space starting with the top window, going
     to the bottom one, and
  2) Within each window, allocate from the top to the bottom.

This patch only implements 2), but I think it's probably enough to
work around the immediate problem.
Comment 31 Bjorn Helgaas 2010-09-05 04:17:09 UTC
Created attachment 28992 [details]
 allocate from top down within a region (v2)

Oops, a kind soul tested the previous patch and reported that it failed.
Here's another try.
Comment 32 Bjorn Helgaas 2010-09-05 04:21:26 UTC
Here's the Fedora 13 report of the same problem:
https://bugzilla.redhat.com/show_bug.cgi?id=620313
Comment 33 Stefan Becker 2010-09-06 07:22:11 UTC
Created attachment 29072 [details]
dmesg log from 2.6.34 with patch V2 applied

With patch V2 applied 2.6.34 boots up fine on my T3500 (dmesg log attached).
Comment 34 Bjorn Helgaas 2010-09-06 13:58:31 UTC
Stefan, your machine isn't using CRS, possibly because it's older than
2008.  We do open the 1c.0 bridge windows to bus 04, and it looks like
we're doing the right thing there, but there's no device on bus 04, so
it really doesn't make a difference either way.  I would think this
machine would boot either with or without "pci=nocrs", even without
my patch.

I would guess that if you boot with "pci=use_crs", your machine will
fail to boot vanilla 2.6.34.  And hopefully, if you add my patch, it
*will* boot.  If so, I'd like to see the dmesg from the boot with my
patch.  I have another report that the patch isn't sufficient, so if
it fails, it would help to see a serial console log or video to see
where it goes wrong.  Thanks!
Comment 35 Stefan Becker 2010-09-06 14:27:19 UTC
(In reply to comment #34)
> Stefan, your machine isn't using CRS, possibly because it's older than
> 2008.  We do open the 1c.0 bridge windows to bus 04, and it looks like
> we're doing the right thing there, but there's no device on bus 04, so
> it really doesn't make a difference either way.  I would think this
> machine would boot either with or without "pci=nocrs", even without
> my patch.

I can assure you that the F13 2.6.34 kernel does not boot without "pci=nocrs" on the T3500. That's why I opened Fedora bug #620313.

With patch V2 applied to the F13 kernel RPM the machine boots without pci=nocrs, i.e. it works out of the box.

The machine was bought last year (2009) and I have installed the latest Dell BIOS (A07). So I'm not sure how your 2008 rule would apply.

> I would guess that if you boot with "pci=use_crs", your machine will
> fail to boot vanilla 2.6.34.  And hopefully, if you add my patch, it
> *will* boot.  If so, I'd like to see the dmesg from the boot with my
> patch.  I have another report that the patch isn't sufficient, so if

I'll try to boot with "pci=use_crs" with the plain Fedora package and the one with the patch.
Comment 36 Stefan Becker 2010-09-06 15:15:34 UTC
(In reply to comment #35)
> With patch V2 applied to the F13 kernel RPM the machine boots without
> pci=nocrs, i.e. it works out of the box.

*blush* I picked up the next-in-line F13 2.6.34 RPM that has "pci=nocrs" by default :-/ No wonder that it boots and doesn't show _CRS...

I'll try with "pci=use_crs" ASAP.
Comment 37 Stefan Becker 2010-09-08 04:48:22 UTC
Confirmed: patch V2 does not fix the issue for me. I'll try the latest debug patch attached to the Fedora bz ASAP.
Comment 38 Stefan Becker 2010-09-10 07:17:47 UTC
Created attachment 29492 [details]
dmesg log from 2.6.34 with patch V3 applied and pci=use_crs

As the Fedora bug has been closed with pci=nocrs as workaround, let's continue the discussion here.

Patch V3 fixes the issue on my T3500:

Kernel command line: ro root=/dev/mapper/VolGroup00-LogVol00 rhgb quiet SYSFONT=latarcyrheb-sun16 LANG=en_US.UTF-8 KEYTABLE=fi intel_iommu=off pci=use_crs
PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
pci 0000:00:1f.2: reg 10: [io  0xfe00-0xfe07]
pci 0000:00:1f.2: reg 14: [io  0xfe10-0xfe13]
pci 0000:00:1f.2: reg 18: [io  0xfe20-0xfe27]
pci 0000:00:1f.2: reg 1c: [io  0xfe30-0xfe33]
pci 0000:00:1f.2: reg 20: [io  0xfec0-0xfedf]
pci 0000:00:1f.2: reg 24: [mem 0xff970000-0xff9707ff]
pci 0000:00:1f.2: PME# supported from D3hot
pci 0000:00:1f.2: PME# disabled
pci 0000:00:1f.2: no compatible bridge window for [mem 0xff970000-0xff9707ff]
pci 0000:00:1f.2: BAR 5: assigned [mem 0xff980800-0xff980fff]
pci 0000:00:1f.2: BAR 5: set to [mem 0xff980800-0xff980fff] (PCI address [0xff980800-0xff980fff]
ahci 0000:00:1f.2: version 3.0
ahci 0000:00:1f.2: PCI INT C -> GSI 20 (level, low) -> IRQ 20
ahci 0000:00:1f.2: irq 53 for MSI/MSI-X
ahci 0000:00:1f.2: AHCI 0001.0200 32 slots 6 ports 3 Gbps 0x3f impl SATA mode
ahci 0000:00:1f.2: flags: 64bit ncq sntf led clo pio ccc ems sxs 
ahci 0000:00:1f.2: setting latency timer to 64
Comment 39 Bjorn Helgaas 2010-09-10 21:52:28 UTC
Thanks a lot for testing this, Stefan!  Based on the Everest report in
comment 28, it looks like we're now doing the same thing as Windows,
and in fact, we chose the identical address for the 00:1f.2 SATA device.

I'll polish up the V3 patches and post them soon.
Comment 40 Bjorn Helgaas 2010-09-15 19:48:25 UTC
Created attachment 30172 [details]
alloc top-down (v3)

This is the patch Stefan tested in comment 38.  (This was originally
attached to https://bugzilla.redhat.com/show_bug.cgi?id=620313, but I'm
also attaching it here for completeness.)
Comment 41 Bjorn Helgaas 2010-09-15 19:51:31 UTC
Created attachment 30182 [details]
align to end of free space

This is a SeaBIOS patch I used with Windows 7 on qemu to verify that
given an area of available space, Windows uses the *end*, not the
beginning.
Comment 42 Bjorn Helgaas 2010-09-15 19:54:49 UTC
Created attachment 30192 [details]
use highest available gap first

This is a SeaBIOS patch I used with Windows 7 on qemu to verify that
given a window with several areas available, e.g., several gaps between
children of the resource, Windows uses the highest area first.
Comment 43 Bjorn Helgaas 2010-09-15 19:56:48 UTC
Created attachment 30202 [details]
use highest bridge windows first

This is a SeaBIOS patch I used with Windows 7 on qemu to verify that
given a PCI host bridge with several windows, Windows fills up the
highest windows first.
Comment 44 Bjorn Helgaas 2010-09-16 15:35:44 UTC
Stefan, could you please attach the /proc/iomem contents corresponding to
your test in comment 38?  I'd like to understand what we're doing with
this E820 reserved entry:

  BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved)
Comment 45 Bjorn Helgaas 2010-09-16 16:46:34 UTC
Created attachment 30242 [details]
handling of e820 reserved entries

This SeaBIOS patch shows two interesting things:

  1) Windows 7 ignores E820 reserved entries when it moves PCI devices.

  2) When an E820 reserved entry conflicts with a host bridge window,
     Linux *expands* the E820 entry to contain the entire window.  Then
     we insert it, so it becomes the parent of the window, which means
     we effectively ignore the reservation.

On these Dell machines, the E820 map contains a reserved entry that
conflicts with a host bridge window:

  BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved)
  pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xf7ffffff]

As a result, I think we'll see this in Stefan's /proc/iomem:

  bfe4dc00-f7ffffff reserved (expanded)
    bff00000-f7ffffff PCI Bus 0000:00

When we move the 1f.2 BAR into the window, the area at 0xbff00000 looks
like it's available because we allocate from the window.  The e820 reserved
area is now the parent of the window, so it has no effect on the allocation.
Comment 46 Stefan Becker 2010-09-20 07:52:26 UTC
Created attachment 30662 [details]
dmesg log from 2.6.34 with patch V4 applied and pci=use_crs

Results with latest patch sent to LKML.
Comment 47 Stefan Becker 2010-09-20 07:54:14 UTC
Created attachment 30672 [details]
/proc/iomem from 2.6.34 with patch V4 applied and pci=use_crs   

As with patch V3 the system boots fine with pci=use_crs.
Comment 48 Rafael J. Wysocki 2010-09-20 21:02:25 UTC
On Monday, September 20, 2010, Bjorn Helgaas wrote:
> On Monday, September 20, 2010 01:57:17 pm Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.34 and 2.6.35.
> > 
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.34 and 2.6.35.  Please verify if it still should
> > be listed and let the tracking team know (either way).
> > 
> > 
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=16228
> > Subject             : BUG/boot failure on Dell Precision T3500
> (pci/ahci_stop_engine)
> > Submitter   : Brian Bloniarz <brian.bloniarz@gmail.com>
> > Date                : 2010-06-16 17:57 (97 days old)
> > Handled-By  : Bjorn Helgaas <bjorn.helgaas@hp.com>
> 
> The four patches here: https://patchwork.kernel.org/patch/189182/
> should resolve this problem.  They've been tested by Stefan Becker.
Comment 50 Stefan Becker 2010-10-01 11:15:16 UTC
FYI: I just updated my T3500 to the latest BIOS A08 from Dell. It still doesn't boot 2.6.34 without the proposed patches. So still no BIOS fix...
Comment 51 Bjorn Helgaas 2010-10-01 17:35:32 UTC
Thanks for the update.  I think it's critical that we make Linux tolerate
this issue.  Even if there were a BIOS fix, we shouldn't require users to
update their machines.
Comment 52 Avi Carmi 2010-10-12 16:06:28 UTC
same issue here, Dell Precision T3500 BIOS A07 (will upgrade to A08 as soon as I post) after upgrading to 10.10 Maverick.

I'll have to recall how to apply patches and compile the kernel, last time I've done it was years ago RHEL 5 or 6, so spoiled now by Ubuntu but found this yet in another thread: 

https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/647043/comments/14

which has a link to a patched test kernel, with the same list of patches:

> I had Lisa do a series of tests. We subsequently discovered a series of
> patches that have been submitted 
> upstream but are a bit invasive (ie. they won't land in upstream 2.6.36, and
> will likely have to wait for 2.6.37).
>
> http://marc.info/?l=linux-kernel&m=128476278029918&w=2
> Patch : https://patchwork.kernel.org/patch/189182/
> Patch : https://patchwork.kernel.org/patch/189232/
> Patch : https://patchwork.kernel.org/patch/189242/
> Patch : https://patchwork.kernel.org/patch/189252/
>
> I've subsequently built a test kernel with the above patches applied. Lisa
> has confirmed this fixes the issue 
> and the USB ports work with this test kernel. The test kernel can be found at
> the following:
>
> http://people.canonical.com/~ogasawara/lp647043/i386/
>
> I'll have to discuss with the Ubuntu Kernel SRU team if they will qualify for
> a Stable Release Update to Maverick.

-avi
Comment 53 Avi Carmi 2010-10-12 18:02:57 UTC
just FYI, did the A08 BIOS upgrade, and as mentioned earlier, the A08 BIOS update did not resolve the problem.

pci=nocrs does work with the 10.10 Ubuntu kernel and A08 BIOS 

found a report that A08 does not work with the older 10.04 Kernel, but I did not try the older kernel with A08.

did not yet try the test patched kernel, since I can boot with pci=nocrs and I need to get some real work done...

-avi
Comment 54 Florian Mickler 2010-10-31 18:26:05 UTC
Do these need to go into stable?

commit 1af3c2e45e7a641e774bbb84fa428f2f0bf2d9c9
Author: Bjorn Helgaas <bjorn.helgaas@hp.com>
Date:   Tue Oct 26 15:41:54 2010 -0600

    x86: allocate space within a region top-down



commit dc9887dc02e37bcf83f4e792aa14b07782ef54cf
Author: Bjorn Helgaas <bjorn.helgaas@hp.com>
Date:   Tue Oct 26 15:41:44 2010 -0600

    x86/PCI: allocate space from the end of a region, not the beginning



commit b126b4703afa4010b161784a43650337676dd03b
Author: Bjorn Helgaas <bjorn.helgaas@hp.com>
Date:   Tue Oct 26 15:41:39 2010 -0600

    PCI: allocate bus resources from the top down
Comment 55 Bjorn Helgaas 2010-11-02 16:01:20 UTC
These fixes appeared in 2.6.37-rc1.  Confirmation that they still work
would be good, then after we get more runtime, we might consider putting
the whole lot into stable (the patches mentioned in comment 54, plus
several related ones).
Comment 56 Stefan Becker 2010-11-03 06:45:11 UTC
I tested patch V5 on top of kernel-2.6.34.7-59.fc13.x86_64. The T3500 booted with pci=use_crs.
Comment 57 Bernhard M. Wiedemann 2010-11-05 15:19:51 UTC
@Bjorn: tested 2.6.37-rc1 from yesterday's Linus git repo and it works fine on my T3500 (2.6.34 to 36 had the above problems). So good fix.
Comment 58 jpl 2010-12-01 12:09:19 UTC
I'd just like to say that I hope this fix can be backported into stable.  There are a fairly large number of people unable to boot a new kernel after upgrading to Ubuntu 10.10, and the problems seem related:

http://ubuntuforums.org/showthread.php?t=1595809
Comment 59 Tanmoy Laskar 2010-12-08 13:38:13 UTC
Hi,

I'm one of those people - I've been running 2.6.32.26 on a 64-bit Maverick Installation on my Dell T3500, since I cannot boot with 2.6.35 (and have been unable to do so since upgrading to 10.10 in October). Any chance that this will be backported before Natty is released?

Thanks.
Comment 60 Bjorn Helgaas 2010-12-08 16:51:17 UTC
The most likely change for -stable would be to make "pci=nocrs" the
default, as discussed here:

  https://bugzilla.kernel.org/show_bug.cgi?id=22132#c8

That could easily be done in the Ubuntu distro kernel as well, and
probably doesn't depend on the change being in -stable itself.  I'm
not involved in Ubuntu kernel maintenance, so I don't know what it
would take to make that happen.

In the meantime, using "pci=nocrs" manually should be a workaround
that would allow you to boot newer kernels.
Comment 61 Bjorn Helgaas 2010-12-16 22:24:55 UTC
Created attachment 40472 [details]
avoid e820 regions

We've gone around and around on this, and the current proposal is to
apply the attached patch to 2.6.37-rc6.

If anybody could validate that this fixes the bug, that would be great.
Comment 62 Bernhard M. Wiedemann 2010-12-17 12:44:14 UTC
@Bjorn:  tested current git both with and without your patch and both appear to work on Dell T3500
Comment 63 Bjorn Helgaas 2010-12-17 15:52:40 UTC
Thanks, Bernhard!  Current git (e.g., a3383e8372c) without my patch
works because it allocates top-down and avoids the 0xbfe4dc00-0xc0000000
reserved region that way.  With my patch, it allocates bottom-up and
avoids that region because it's mentioned in the E820 table.