Bug 71391 - GM45: Regression due to incorrect graphics base of stolen memory.
Summary: GM45: Regression due to incorrect graphics base of stolen memory.
Status: RESOLVED DOCUMENTED
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - Intel) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Chris Wilson
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-03-02 10:38 UTC by Pär Lindfors
Modified: 2014-03-04 22:10 UTC (History)
4 users (show)

See Also:
Kernel Version: 3.14
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
/proc/iomem broken kernel (2.88 KB, application/octet-stream)
2014-03-03 12:19 UTC, Pär Lindfors
Details
/proc/iomem working kernel (2.83 KB, application/octet-stream)
2014-03-03 12:20 UTC, Pär Lindfors
Details
dmesg 3.14.0-rc4 (bad) (58.13 KB, text/plain)
2014-03-03 20:42 UTC, Pär Lindfors
Details
dmesg 3.14.0-rc4 with 17fec8 reverted (good) (58.08 KB, text/plain)
2014-03-03 20:43 UTC, Pär Lindfors
Details
intel_reg_dumper while using crtc 0 (13.65 KB, application/octet-stream)
2014-03-04 19:40 UTC, Pär Lindfors
Details

Description Pär Lindfors 2014-03-02 10:38:46 UTC
In kernel v3.12 and later render cache flushing is broken on CRTC 0 on
my Lenovo x200s, with GM45 graphics. A bisect show that the first bad
commit is:

commit 17fec8a08698bcab98788e1e89f5b8e7502ababd
Author: Chris Wilson <chris@chris-wilson.co.uk>

    drm/i915: Use Graphics Base of Stolen Memory on all gen3+


Bad kernel versions (all have the above commit):
v3.12.0, v3.12.13, v3.13.0, 3.13.5, 3.14.0-rc4+(d8efcf)

Good kernel versions:
v3.11.0, v3.11.10
v3.13.5 with 17fec8 reverted.


I get the exact same symtops that I had one year ago in bug
53541. Your diagnose based on the symptoms that time was "Lack of
render cache flushing". That bug was in 3.6, got fixed in 3.9 and the
fix had to do with the base of stolen memory that time as well:

commit e12a2d53ae45a69aea499b64f75e7222cca0f12f
Author: Chris Wilson <chris@chris-wilson.co.uk>

    drm/i915: Fix detection of base of stolen memory
Comment 1 Daniel Vetter 2014-03-03 07:30:10 UTC
Can you please attach the output of /proc/iomem for both a broken and working kernel?
Comment 2 Pär Lindfors 2014-03-03 12:19:47 UTC
Created attachment 127821 [details]
/proc/iomem broken kernel
Comment 3 Pär Lindfors 2014-03-03 12:20:21 UTC
Created attachment 127831 [details]
/proc/iomem working kernel
Comment 4 Pär Lindfors 2014-03-03 12:22:55 UTC
Files uploaded. Apparently, on a working kernel I don't have any Stolen Memory in /proc/iomem or vga+ in /proc/ioports.


diff -u broken-3.13.5/iomem working-3.13.5-with-17fec8-reverted/iomem
--- broken-3.13.5/iomem 2014-03-03 13:00:33.375990509 +0100
+++ working-3.13.5-with-17fec8-reverted/iomem   2014-03-03 13:02:38.552319778 +0100
@@ -32,7 +32,6 @@
 bd9ff000-bd9fffff : System RAM
 bda00000-bdbfffff : RAM buffer
 bdc00000-bfffffff : reserved
-  be000000-bfffffff : Graphics Stolen Memory
 c0000000-febfffff : PCI Bus 0000:00
   c0000000-c01fffff : PCI Bus 0000:02
   c0200000-c03fffff : PCI Bus 0000:02
diff -u broken-3.13.5/ioports working-3.13.5-with-17fec8-reverted/ioports
--- broken-3.13.5/ioports       2014-03-03 13:00:33.375990509 +0100
+++ working-3.13.5-with-17fec8-reverted/ioports 2014-03-03 13:02:38.552319778 +0100
@@ -12,7 +12,6 @@
   00a0-00a1 : pic2
   00c0-00df : dma2
   00f0-00ff : fpu
-  03c0-03df : vga+
   0800-080f : pnp 00:01
 0cf8-0cff : PCI conf1
 0d00-ffff : PCI Bus 0000:00
Comment 5 Ville Syrjala 2014-03-03 12:52:42 UTC
What do these show?

setpci -s 0:0.0 0xa0.w 0xa2.w 0xa4.l 0xa8.l 0xac.l 0xb0.l
setpci -s 0:2.0 0x52.w 0x5c.l
Comment 6 Pär Lindfors 2014-03-03 13:14:27 UTC
This is while running the working kernel, 3.13.5 with 17fec8 reverted.

# setpci -s 0:0.0 0xa0.w 0xa2.w 0xa4.l 0xa8.l 0xac.l 0xb0.l
0020
13c0
00000000
00000000
00000000
0000c000
# setpci -s 0:2.0 0x52.w 0x5c.l
0b50
be000000
Comment 7 Daniel Vetter 2014-03-03 13:28:25 UTC
Sounds like the pci subsystem has a hard time allocating some slot for the cache flushing page. Adding Björn since iirc he has a bunch of patches in-flight to make intel-gtt.c use the right functions and improve pci resource allocation in general.
Comment 8 Chris Wilson 2014-03-03 14:05:11 UTC
So to confirm it is the use of stolen that causes the upset (rather than some random interaction with us reserving stolen), please try:

diff --git a/drivers/gpu/drm/i915/i915_gem_stolen.c b/drivers/gpu/drm/i915/i915_gem_stolen.c
index d58b4e287e32..db0a75930ca4 100644
--- a/drivers/gpu/drm/i915/i915_gem_stolen.c
+++ b/drivers/gpu/drm/i915/i915_gem_stolen.c
@@ -214,6 +214,8 @@ int i915_gem_init_stolen(struct drm_device *dev)
        struct drm_i915_private *dev_priv = dev->dev_private;
        int bios_reserved = 0;
 
+       return 0;
+
        if (dev_priv->gtt.stolen_size == 0)
                return 0;
 

Afaik, the only use-case for stolen on your machine is the fbcon framebuffer and ringbuffers.

Can you please grab a video of the failure mode?
Comment 9 Ville Syrjala 2014-03-03 14:11:01 UTC
(In reply to Pär Lindfors from comment #6)
> # setpci -s 0:2.0 0x52.w 0x5c.l
> 0b50
> be000000

That seems to indicate you have vt-d enabled. If there's a BIOS knob for turning it off, you should try it.
Comment 10 Bjorn Helgaas 2014-03-03 16:50:39 UTC
Strictly speaking, I think i915_stolen_to_physical() should use pcibios_bus_to_resource() to convert the bus address read from config space dword 0x5c:

  /* Read Graphics Base of Stolen Memory directly */
  pci_read_config_dword(dev->pdev, 0x5c, &base);

to the CPU physical address expected by devm_request_mem_region().  But (1) that won't make a difference on your machine because bus and CPU physical address should be identical on your machine, and (2) 17fec8a08698 didn't change that anyway.

The patches Daniel mentioned in comment #7 are on the branch at 96702be56037 ("Merge branch 'pci/resource' into next") and appeared in v3.14-rc1.  If PCI was unable to allocate a cache flushing page, the dmesg log should have a clue.  Can you attach the complete dmesg logs for the working and non-working cases?  The ideal would be something like v3.14-rc4 and v3.14-rc4 with 17fec8 reverted.
Comment 11 Pär Lindfors 2014-03-03 17:07:26 UTC
Chris: Don't know what tree your patch is against, it did not
apply. So I added that same return 0 code manually to that
function on d8efcf38b13df3e9e889cf7cc214cb85dc53600c (Linus
master from 2014-02-28). This simple patch makes the problem go
away.

By "grab a video" do you mean with a camera pointed at the
laptop? If so, then yes I can do that.

Ville: Disabling vt-d did not help.

# setpci -s 0:2.0 0x52.w 0x5c.l
0350
be000000
Comment 12 Daniel Vetter 2014-03-03 17:21:23 UTC
Björn: I've misread things and the resource allocation actually works in both cases. Also Pär confirmed that it's actually an issue with the stolen memory range, not with the flushing reg pci bar. Sorry for the noise.
Comment 13 Pär Lindfors 2014-03-03 20:42:52 UTC
Created attachment 127941 [details]
dmesg 3.14.0-rc4 (bad)
Comment 14 Pär Lindfors 2014-03-03 20:43:31 UTC
Created attachment 127951 [details]
dmesg 3.14.0-rc4 with 17fec8 reverted (good)
Comment 15 Pär Lindfors 2014-03-03 20:52:37 UTC
From Daniels latest comment I guess you maybe didn't need the
dmesg output anymore, but here you go anyway.

One thing I noticed while preparing these is that the bug is a
whole lot worse in 3.14-rc4 than in 3.13.5 (and earlier). In
earlier versions the screen locks up for a few seconds, then
continues, but in 3.14-rc4 it never refreshes.
Comment 16 Chris Wilson 2014-03-03 21:04:45 UTC
(In reply to Pär Lindfors from comment #11)
> Chris: Don't know what tree your patch is against, it did not
> apply. So I added that same return 0 code manually to that
> function on d8efcf38b13df3e9e889cf7cc214cb85dc53600c (Linus
> master from 2014-02-28). This simple patch makes the problem go
> away.

Thanks. That just confirms using the stolen memory is conflicting with something else.
 
> By "grab a video" do you mean with a camera pointed at the
> laptop? If so, then yes I can do that.

Yes. What we will be looking for here is whether there is any pattern to the corruption. For example, if it is regular, it may be a scanout access violation. If it is located to one area, we may only conflict over a few pages (and we need to reserve a fraction of the stolen area for whomever). Or it may be dirty cachelines, in which case we have angered the GPU gods.
Comment 17 Bjorn Helgaas 2014-03-03 21:18:41 UTC
(In reply to Pär Lindfors from comment #15)
> From Daniels latest comment I guess you maybe didn't need the
> dmesg output anymore, but here you go anyway.

Right, but thanks anyway!  I compared them just out of curiosity, and there are no PCI core resource-related differences.  I did notice that you're using "pcie_aspm=force"; if that makes any difference for you, I consider it a bug, since users shouldn't have to discover and use that flag.  Of course, that would be totally separate from this GM45 issue, so if you want to follow up on that, please do it in a new bugzilla.
Comment 18 Pär Lindfors 2014-03-04 19:19:03 UTC
(In reply to Bjorn Helgaas from comment #17)
> I did notice that you're
> using "pcie_aspm=force"; if that makes any difference for you, I consider it
> a bug, since users shouldn't have to discover and use that flag.  Of course,
> that would be totally separate from this GM45 issue, so if you want to
> follow up on that, please do it in a new bugzilla.

Thanks. I added that option years ago, when trying to minimize
power usage. I know for sure that it was needed to enable ASPM
back then, but I don't remember if it actually saved any
power. When I have more time I will try to remember to
investigate if I still need it, and report a new bug if
necessary.
Comment 19 Pär Lindfors 2014-03-04 19:23:36 UTC
(In reply to Chris Wilson from comment #16)
> Yes. What we will be looking for here is whether there is any pattern to the
> corruption. For example, if it is regular, it may be a scanout access
> violation. If it is located to one area, we may only conflict over a few
> pages (and we need to reserve a fraction of the stolen area for whomever).
> Or it may be dirty cachelines, in which case we have angered the GPU gods.

I shot a video. Not the best video quality, but it shows the bug
clearly. The file is 8.7M which is too large for bugzilla, so it is
available here:

http://www.lysator.liu.se/~paran/kernel/bugzilla/71391/71391-video1-720p.mkv
Comment 20 Daniel Vetter 2014-03-04 19:29:28 UTC
Looks like fbc, but that shouldn't be enabled ...

Can you please reproduce the issue (i.e. put lvds onto crtc 0) and grab the output of intel_reg_dumper (as root) from the intel-gpu-tools package. Most distros have this pre-built, otherwise please build from sources.
Comment 21 Pär Lindfors 2014-03-04 19:40:18 UTC
Created attachment 128011 [details]
intel_reg_dumper while using crtc 0
Comment 22 Daniel Vetter 2014-03-04 20:04:50 UTC
Indeed, fbc seems to be enabled. Can you please check whether the bug goes away when booting with i915.enable_fbc=0. Please double-check in /sys/modules/i915/parameters/ that the enable_fbc knob is set correct. Note that it's recently renamed, so check with modinfo i915 which name you're kernel is using.
Comment 23 Pär Lindfors 2014-03-04 20:58:25 UTC
Booting with i915.i915_enable_fbc=0 makes the bug go away.


One comment that may or may not be relevant. With kernel 3.10 I
get a kernel log message like:

[drm] not enough stolen space for compressed buffer (need 6291456 bytes), disabling
[drm] hint: you may be able to increase stolen memory size in the BIOS to avoid this

However I don't get this with any later kernel versions, this
includes the kernels where I have reverted the stolen memory
code.
Comment 24 Daniel Vetter 2014-03-04 22:10:37 UTC
fbc is known-broken in a bunch of cases. I guess something is enabling it somewhere (check for optinos in /etc/modprobe.d/) and now that we've fixed up stolen memory detection for gm45 it finally blows up.

Remove that bogus option and go back to the default (which is no fbc for gm45) and all should be fine.

We really should start to throw stern warnings into dmesg when people run with non-default options ...

Note You need to log in before you can comment on or make changes to this bug.