Bug 23542 - HP 2530p accelerated graphics causes spontaneous power off/on
Summary: HP 2530p accelerated graphics causes spontaneous power off/on
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Bjorn Helgaas
URL:
Keywords:
Depends on:
Blocks: 21782
  Show dependency tree
 
Reported: 2010-11-22 17:12 UTC by Bjorn Helgaas
Modified: 2010-12-19 11:41 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.37-rc1
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg diff (-good, +bad) (69.22 KB, text/plain)
2010-11-22 17:12 UTC, Bjorn Helgaas
Details
/proc/iomem diff (-good, +bad) (2.66 KB, text/plain)
2010-11-22 22:48 UTC, Bjorn Helgaas
Details
patch to avoid PCI allocations in 1M below 4GB (1.88 KB, patch)
2010-11-22 22:52 UTC, Bjorn Helgaas
Details | Diff
v2 patch to avoid PCI allocations in 1M below 4GB (2.30 KB, patch)
2010-11-23 20:18 UTC, Bjorn Helgaas
Details | Diff
dmesg diff (-good, +bad) (26.78 KB, text/plain)
2010-12-03 17:59 UTC, Dan Williams
Details
full dmesg from bad kernel 2.6.36.1-10.fc15 (65.91 KB, text/plain)
2010-12-03 18:00 UTC, Dan Williams
Details
/proc/iomem from a good kernel (2.6.36-4.fc15) (2.53 KB, text/plain)
2010-12-06 15:26 UTC, Dan Williams
Details
patch to avoid allocating PNP resources (8.24 KB, patch)
2010-12-08 17:54 UTC, Bjorn Helgaas
Details | Diff
Fedora 2.6.36 kernel patch fixing original 2530p issue (2.30 KB, patch)
2010-12-09 16:27 UTC, Dan Williams
Details | Diff
Fedora 2.6.36 kernel CRS fixes patch that caused the problem originally (21.33 KB, patch)
2010-12-09 16:28 UTC, Dan Williams
Details | Diff
My attempt to backport your PNP avoidance patch to the Fedora 2.6.36.2 kernel on top of the two above patches (8.38 KB, patch)
2010-12-09 16:29 UTC, Dan Williams
Details | Diff
v3 patch to avoid top 1M below 4G (1.56 KB, patch)
2010-12-14 17:39 UTC, Bjorn Helgaas
Details | Diff
SeaBIOS/qemu test patch (2.21 KB, patch)
2010-12-15 21:41 UTC, Bjorn Helgaas
Details | Diff
avoid E820 regions (8.84 KB, patch)
2010-12-16 20:38 UTC, Bjorn Helgaas
Details | Diff

Description Bjorn Helgaas 2010-11-22 17:12:27 UTC
Created attachment 37872 [details]
dmesg diff (-good, +bad)

Matthew Garrett reports that my _CRS patches break the 2530p:

  Once it starts doing accelerated operations there's a high probability
  that it'll suddenly display garbage all over the screen, suddenly power
  down and power back on after a few seconds

Matthew found this in Fedora, introduced between these two kernel versions:

  -Linux version 2.6.36-4.fc13.x86_64 (mjg59@ihatethathostname.lab.bos.redhat.com) (gcc version 4.4.4 20100630 (Red Hat 4.4.4-10) (GCC) ) #1 SMP Tue Nov 16 17:41:39 EST 2010
  +Linux version 2.6.36-4.fc15.x86_64 (mockbuild@x86-19.phx2.fedoraproject.org) (gcc version 4.5.1 20101112 (Red Hat 4.5.1-5) (GCC) ) #1 SMP Tue Nov 16 07:31:02 UTC 2010

I reported this against 2.6.37-rc1 because I think it's probably caused
by commit 1af3c2e45, which appeared in .37-rc1 but was probably backported
to fc15.
Comment 1 Bjorn Helgaas 2010-11-22 22:48:22 UTC
Created attachment 37892 [details]
/proc/iomem diff (-good, +bad)

Here's the interesting bit:

   fee01000-ffffffff : PCI Bus 0000:00
  +  ff900000-ffafffff : PCI Bus 0000:03
  +  ffb00000-ffcfffff : PCI Bus 0000:02
  +  ffd00000-ffefffff : PCI Bus 0000:01
     fffa0000-fffa6fff : reserved
  +  fffff000-ffffffff : Intel Flush Page

Putting the "Intel Flush Page" at the very last page before 4G is not
going to work.  That page contains the ROM processor restart vector,
so I don't think accesses to it will go to the PCI bus.
Comment 2 Bjorn Helgaas 2010-11-22 22:52:46 UTC
Created attachment 37902 [details]
patch to avoid PCI allocations in 1M below 4GB
Comment 3 Rafael J. Wysocki 2010-11-22 23:18:26 UTC
Handled-By : Bjorn Helgaas <bjorn.helgaas@hp.com>
Comment 4 Bjorn Helgaas 2010-11-23 20:18:04 UTC
Created attachment 37972 [details]
v2 patch to avoid PCI allocations in 1M below 4GB

The previous version would make host bridge windows above 4GB completely
unusable.
Comment 5 Florian Mickler 2010-11-30 10:51:35 UTC
References: https://patchwork.kernel.org/patch/365092/
Comment 6 Bjorn Helgaas 2010-11-30 17:28:19 UTC
New information: there is an ACPI INT0800 device that is using the
0xff000000-0xffffffff range:

  [mjg59@2530p 00:03]$ cat id 
  INT0800
  [mjg59@2530p 00:03]$ cat resources 
  state = active
  mem 0xff000000-0xffffffff

That's enough to tell us we shouldn't put any PCI devices in that
range.

(It's still sort of dubious that the PNP0A08:00 window:
  PNP0A08:00: host bridge window [mem 0xfee01000-0xffffffff]
overlaps the INT0800 range, but Windows must handle that, so we'll
have to manage it somehow, too.)

Unfortunately, Linux ignores the resources used by ACPI devices
(except PNP0C01 and PNP0C02 motherboard devices), so the INT0800
device didn't save us in this case.  This is a Linux defect, but
it's too complicated to fix right now.
Comment 7 Dan Williams 2010-12-03 15:53:20 UTC
I'm still getting this error with 2.6.36.1-10.fc15.x86_64 on my 2530p which contains the above patch; I've got 4GB of RAM in this machine.  Before the fixes for this the graphics would corrupt and a reboot would occur within 30 seconds of logging into the desktop.  With the 2010-11-23 patch it happens about 5 minutes after doing so.  mjg59 says he only has 2GB of RAM which could be why he's not seeing the problem anymore but I still am.
Comment 8 Bjorn Helgaas 2010-12-03 16:14:06 UTC
Thanks a lot for testing this, Dan.  Can you attach your
2.6.36.1-10.fc15.x86_64 dmesg please?  I just want to double-check
that the patch is doing what I expected.  If it's not too much
trouble, you might also try it with the 0xfff00000 changed to
0xffe00000, since it sounds like that might be safer for really
old machines.

But I doubt that 0xffe00000 will make a difference; we might just
have to bite the bullet and figure out how to avoid all ACPI devices.
Comment 9 Dan Williams 2010-12-03 17:03:47 UTC
mjg59 asked me for /proc/iomem diffs.

'bad' == 2.6.36.1-10.fc15.x86_64
'good' == 2.6.36-4.fc15 with csr patch reverted

[dcbw@dcbw ~]$ diff -u badiomem.txt goodiomem.txt 
--- badiomem.txt        2010-12-03 09:55:39.668107911 -0600
+++ goodiomem.txt       2010-12-03 09:53:06.616013000 -0600
@@ -4,9 +4,9 @@
 000a0000-000bffff : PCI Bus 0000:00
 000ef000-000fffff : reserved
 00100000-b4f2efff : System RAM
-  01000000-0146d3bc : Kernel code
-  0146d3bd-01b7c61f : Kernel data
-  01c6f000-01e3cf07 : Kernel bss
+  01000000-0146c33c : Kernel code
+  0146c33d-01b7c59f : Kernel data
+  01c6f000-01e3cf47 : Kernel bss
 b4f2f000-b4f30fff : reserved
 b4f31000-b5d6ffff : System RAM
 b5d70000-b5d7ffff : ACPI Non-volatile Storage
@@ -48,6 +48,10 @@
   d4828000-d4828fff : 0000:00:19.0
     d4828000-d4828fff : e1000e
   d4829000-d4829fff : 0000:00:03.3
+  d482a000-d482afff : Intel Flush Page
+  d4900000-d4afffff : PCI Bus 0000:01
+  d4b00000-d4cfffff : PCI Bus 0000:02
+  d4d00000-d4efffff : PCI Bus 0000:03
 e0000000-efffffff : PCI MMCONFIG 0000 [bus 00-ff]
   e0000000-efffffff : reserved
     e0000000-efffffff : pnp 00:01
@@ -68,9 +72,5 @@
 fee00000-fee00fff : Local APIC
   fee00000-fee00fff : reserved
 fee01000-ffffffff : PCI Bus 0000:00
-  ff800000-ff9fffff : PCI Bus 0000:03
-  ffa00000-ffbfffff : PCI Bus 0000:02
-  ffc00000-ffdfffff : PCI Bus 0000:01
-  ffe7f000-ffe7ffff : Intel Flush Page
   ffe80000-ffffffff : reserved
 100000000-13bffffff : System RAM
Comment 10 Dan Williams 2010-12-03 17:59:57 UTC
Created attachment 38952 [details]
dmesg diff (-good, +bad)
Comment 11 Dan Williams 2010-12-03 18:00:46 UTC
Created attachment 38962 [details]
full dmesg from bad kernel 2.6.36.1-10.fc15
Comment 12 Dan Williams 2010-12-03 18:02:35 UTC
let me know if there's anything more you need, thanks!
Comment 13 Dan Williams 2010-12-06 15:26:02 UTC
Created attachment 39142 [details]
/proc/iomem from a good kernel (2.6.36-4.fc15)
Comment 14 Bjorn Helgaas 2010-12-08 17:54:24 UTC
Created attachment 39262 [details]
patch to avoid allocating PNP resources

Linus thought it was ugly to hack up pcibios_align_resource() to avoid
regions, and as usual, he's right.  Here's another try that takes a little
different approach -- instead of avoiding the hardcoded 2MB range just
below 4GB, this avoids all PNP device resources.  The INT0800 device
should be enough for this particular issue.

This is still ugly in that the PNP devices should be directly in the
iomem_resource tree like PCI devices are, but we can't do that quite yet.
Comment 15 Dan Williams 2010-12-09 03:18:53 UTC
Ok, I'll try to backport this to current Fedora kernels (2.6.36.1 + the CRS patch + the "never allocate pci from the last 1M below 4G" patch) and see where I get.
Comment 16 Bjorn Helgaas 2010-12-09 16:00:59 UTC
Just to be clear, my current proposed patches do not include a
"never allocate pci from the last 1M below 4G" patch.  The patch
in comment 14 avoids PNP resources, which should be enough to
solve the 2530p problem because there's an INT0800 device that
claims 0xff000000-0xffffffff.

Sorry for the backporting mess; there are many patches between
2.6.36.1 and now.  If you want to point me at the patches you're
applying to 2.6.36.1, I'd be happy to give you my opinions.
Comment 17 Dan Williams 2010-12-09 16:26:21 UTC
Yeah, I found the "never allocate from last 1M" patch never got upstream, but it was applied to Fedora 2.6.36 kernels to fix the 2530p issue originally (and it does for mjg59's machine with only 2G).  I haven't yet booted the kernel I built, but I'll attach the patches that are applied.

In the end I think we do want to backport this to 2.6.36 and send to stable@ if we can.
Comment 18 Dan Williams 2010-12-09 16:27:05 UTC
Created attachment 39622 [details]
Fedora 2.6.36 kernel patch fixing original 2530p issue
Comment 19 Dan Williams 2010-12-09 16:28:20 UTC
Created attachment 39632 [details]
Fedora 2.6.36 kernel CRS fixes patch that caused the problem originally
Comment 20 Dan Williams 2010-12-09 16:29:32 UTC
Created attachment 39642 [details]
My attempt to backport your PNP avoidance patch to the Fedora 2.6.36.2 kernel on top of the two above patches
Comment 21 Dan Williams 2010-12-09 16:32:22 UTC
Hmm, looking at my backport this morning I missed a bit in pcibios_align_resource() that should have gotten moved to arch_remove_reservations().  Other than that, comments on my backport attempt?
Comment 22 Bjorn Helgaas 2010-12-09 16:57:14 UTC
I think you should drop the patch from comment 18.  I don't plan
to push that upstream.  If we decide we do need to ignore that last
1M or 2M, I'll put that code in arch_remove_reservations() instead
of in pcibios_align_resource().

You mentioned the code that moved from
pcibios_align_resource() to arch_remove_reservations(); that change
*is* in your comment 20 patch.  But of course, that patch will change
a little bit if you drop the comment 18 patch.
Comment 23 Dan Williams 2010-12-10 03:18:06 UTC
So I've run at least an hour with the patches posted above and had no issues so far.  Without the recent PNP resource patch things would fail within 10 minutes.  I'll keep running with this kernel for a day or two before we declare unconditional success :)  Thanks!  I'll let you know.
Comment 24 Dan Williams 2010-12-10 19:50:25 UTC
Still working fine after 6 hours, no issues so far.  I think it's fixed.
Comment 25 Florian Mickler 2010-12-10 22:43:45 UTC
What patches do you have tested? Just so we don't loose track.
Comment 26 Bjorn Helgaas 2010-12-14 17:39:44 UTC
Created attachment 40142 [details]
v3 patch to avoid top 1M below 4G

It's too late in the .37 cycle to consider the previous patch.
Here's the patch Linus proposed, an alternate way to avoid that
last 1MB.  Can anybody test it?
Comment 27 Bjorn Helgaas 2010-12-15 21:41:07 UTC
Created attachment 40292 [details]
SeaBIOS/qemu test patch

This is just for documentation.  Booting Windows 7 on qemu
with this SeaBIOS patch shows that Windows ignored the last
1MB of memory below 4GB, even if it's not mentioned in the
E820 table or any ACPI device _CRS method.
Comment 28 Bjorn Helgaas 2010-12-16 20:38:05 UTC
Created attachment 40462 [details]
avoid E820 regions

Boy, I hope we converge on a solution here soon...

The current proposal for .37 is to revert the top-down allocation
patches and add this additional patch.

If you're based on 2.6.36, you would want to add these patches, which
went in after .36 and will stay in .37:

  a1862e3 resources: handle overflow when aligning start of available area
  6909ba1 resources: ensure callback doesn't allocate outside available space
  5d6b1fa resources: factor out resource_clip() to simplify find_resource()
  a9cea01 resources: add a default alignf to simplify find_resource()

and add the patch I'm attaching here.

Note You need to log in before you can comment on or make changes to this bug.