Bug 71391
Summary: | GM45: Regression due to incorrect graphics base of stolen memory. | ||
---|---|---|---|
Product: | Drivers | Reporter: | Pär Lindfors (paran) |
Component: | Video(DRI - Intel) | Assignee: | Chris Wilson (chris) |
Status: | RESOLVED DOCUMENTED | ||
Severity: | normal | CC: | bjorn, daniel, intel-gfx-bugs, ville.syrjala |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.14 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
/proc/iomem broken kernel
/proc/iomem working kernel dmesg 3.14.0-rc4 (bad) dmesg 3.14.0-rc4 with 17fec8 reverted (good) intel_reg_dumper while using crtc 0 |
Description
Pär Lindfors
2014-03-02 10:38:46 UTC
Can you please attach the output of /proc/iomem for both a broken and working kernel? Created attachment 127821 [details]
/proc/iomem broken kernel
Created attachment 127831 [details]
/proc/iomem working kernel
Files uploaded. Apparently, on a working kernel I don't have any Stolen Memory in /proc/iomem or vga+ in /proc/ioports. diff -u broken-3.13.5/iomem working-3.13.5-with-17fec8-reverted/iomem --- broken-3.13.5/iomem 2014-03-03 13:00:33.375990509 +0100 +++ working-3.13.5-with-17fec8-reverted/iomem 2014-03-03 13:02:38.552319778 +0100 @@ -32,7 +32,6 @@ bd9ff000-bd9fffff : System RAM bda00000-bdbfffff : RAM buffer bdc00000-bfffffff : reserved - be000000-bfffffff : Graphics Stolen Memory c0000000-febfffff : PCI Bus 0000:00 c0000000-c01fffff : PCI Bus 0000:02 c0200000-c03fffff : PCI Bus 0000:02 diff -u broken-3.13.5/ioports working-3.13.5-with-17fec8-reverted/ioports --- broken-3.13.5/ioports 2014-03-03 13:00:33.375990509 +0100 +++ working-3.13.5-with-17fec8-reverted/ioports 2014-03-03 13:02:38.552319778 +0100 @@ -12,7 +12,6 @@ 00a0-00a1 : pic2 00c0-00df : dma2 00f0-00ff : fpu - 03c0-03df : vga+ 0800-080f : pnp 00:01 0cf8-0cff : PCI conf1 0d00-ffff : PCI Bus 0000:00 What do these show? setpci -s 0:0.0 0xa0.w 0xa2.w 0xa4.l 0xa8.l 0xac.l 0xb0.l setpci -s 0:2.0 0x52.w 0x5c.l This is while running the working kernel, 3.13.5 with 17fec8 reverted. # setpci -s 0:0.0 0xa0.w 0xa2.w 0xa4.l 0xa8.l 0xac.l 0xb0.l 0020 13c0 00000000 00000000 00000000 0000c000 # setpci -s 0:2.0 0x52.w 0x5c.l 0b50 be000000 Sounds like the pci subsystem has a hard time allocating some slot for the cache flushing page. Adding Björn since iirc he has a bunch of patches in-flight to make intel-gtt.c use the right functions and improve pci resource allocation in general. So to confirm it is the use of stolen that causes the upset (rather than some random interaction with us reserving stolen), please try: diff --git a/drivers/gpu/drm/i915/i915_gem_stolen.c b/drivers/gpu/drm/i915/i915_gem_stolen.c index d58b4e287e32..db0a75930ca4 100644 --- a/drivers/gpu/drm/i915/i915_gem_stolen.c +++ b/drivers/gpu/drm/i915/i915_gem_stolen.c @@ -214,6 +214,8 @@ int i915_gem_init_stolen(struct drm_device *dev) struct drm_i915_private *dev_priv = dev->dev_private; int bios_reserved = 0; + return 0; + if (dev_priv->gtt.stolen_size == 0) return 0; Afaik, the only use-case for stolen on your machine is the fbcon framebuffer and ringbuffers. Can you please grab a video of the failure mode? (In reply to Pär Lindfors from comment #6) > # setpci -s 0:2.0 0x52.w 0x5c.l > 0b50 > be000000 That seems to indicate you have vt-d enabled. If there's a BIOS knob for turning it off, you should try it. Strictly speaking, I think i915_stolen_to_physical() should use pcibios_bus_to_resource() to convert the bus address read from config space dword 0x5c: /* Read Graphics Base of Stolen Memory directly */ pci_read_config_dword(dev->pdev, 0x5c, &base); to the CPU physical address expected by devm_request_mem_region(). But (1) that won't make a difference on your machine because bus and CPU physical address should be identical on your machine, and (2) 17fec8a08698 didn't change that anyway. The patches Daniel mentioned in comment #7 are on the branch at 96702be56037 ("Merge branch 'pci/resource' into next") and appeared in v3.14-rc1. If PCI was unable to allocate a cache flushing page, the dmesg log should have a clue. Can you attach the complete dmesg logs for the working and non-working cases? The ideal would be something like v3.14-rc4 and v3.14-rc4 with 17fec8 reverted. Chris: Don't know what tree your patch is against, it did not apply. So I added that same return 0 code manually to that function on d8efcf38b13df3e9e889cf7cc214cb85dc53600c (Linus master from 2014-02-28). This simple patch makes the problem go away. By "grab a video" do you mean with a camera pointed at the laptop? If so, then yes I can do that. Ville: Disabling vt-d did not help. # setpci -s 0:2.0 0x52.w 0x5c.l 0350 be000000 Björn: I've misread things and the resource allocation actually works in both cases. Also Pär confirmed that it's actually an issue with the stolen memory range, not with the flushing reg pci bar. Sorry for the noise. Created attachment 127941 [details]
dmesg 3.14.0-rc4 (bad)
Created attachment 127951 [details]
dmesg 3.14.0-rc4 with 17fec8 reverted (good)
From Daniels latest comment I guess you maybe didn't need the dmesg output anymore, but here you go anyway. One thing I noticed while preparing these is that the bug is a whole lot worse in 3.14-rc4 than in 3.13.5 (and earlier). In earlier versions the screen locks up for a few seconds, then continues, but in 3.14-rc4 it never refreshes. (In reply to Pär Lindfors from comment #11) > Chris: Don't know what tree your patch is against, it did not > apply. So I added that same return 0 code manually to that > function on d8efcf38b13df3e9e889cf7cc214cb85dc53600c (Linus > master from 2014-02-28). This simple patch makes the problem go > away. Thanks. That just confirms using the stolen memory is conflicting with something else. > By "grab a video" do you mean with a camera pointed at the > laptop? If so, then yes I can do that. Yes. What we will be looking for here is whether there is any pattern to the corruption. For example, if it is regular, it may be a scanout access violation. If it is located to one area, we may only conflict over a few pages (and we need to reserve a fraction of the stolen area for whomever). Or it may be dirty cachelines, in which case we have angered the GPU gods. (In reply to Pär Lindfors from comment #15) > From Daniels latest comment I guess you maybe didn't need the > dmesg output anymore, but here you go anyway. Right, but thanks anyway! I compared them just out of curiosity, and there are no PCI core resource-related differences. I did notice that you're using "pcie_aspm=force"; if that makes any difference for you, I consider it a bug, since users shouldn't have to discover and use that flag. Of course, that would be totally separate from this GM45 issue, so if you want to follow up on that, please do it in a new bugzilla. (In reply to Bjorn Helgaas from comment #17) > I did notice that you're > using "pcie_aspm=force"; if that makes any difference for you, I consider it > a bug, since users shouldn't have to discover and use that flag. Of course, > that would be totally separate from this GM45 issue, so if you want to > follow up on that, please do it in a new bugzilla. Thanks. I added that option years ago, when trying to minimize power usage. I know for sure that it was needed to enable ASPM back then, but I don't remember if it actually saved any power. When I have more time I will try to remember to investigate if I still need it, and report a new bug if necessary. (In reply to Chris Wilson from comment #16) > Yes. What we will be looking for here is whether there is any pattern to the > corruption. For example, if it is regular, it may be a scanout access > violation. If it is located to one area, we may only conflict over a few > pages (and we need to reserve a fraction of the stolen area for whomever). > Or it may be dirty cachelines, in which case we have angered the GPU gods. I shot a video. Not the best video quality, but it shows the bug clearly. The file is 8.7M which is too large for bugzilla, so it is available here: http://www.lysator.liu.se/~paran/kernel/bugzilla/71391/71391-video1-720p.mkv Looks like fbc, but that shouldn't be enabled ... Can you please reproduce the issue (i.e. put lvds onto crtc 0) and grab the output of intel_reg_dumper (as root) from the intel-gpu-tools package. Most distros have this pre-built, otherwise please build from sources. Created attachment 128011 [details]
intel_reg_dumper while using crtc 0
Indeed, fbc seems to be enabled. Can you please check whether the bug goes away when booting with i915.enable_fbc=0. Please double-check in /sys/modules/i915/parameters/ that the enable_fbc knob is set correct. Note that it's recently renamed, so check with modinfo i915 which name you're kernel is using. Booting with i915.i915_enable_fbc=0 makes the bug go away. One comment that may or may not be relevant. With kernel 3.10 I get a kernel log message like: [drm] not enough stolen space for compressed buffer (need 6291456 bytes), disabling [drm] hint: you may be able to increase stolen memory size in the BIOS to avoid this However I don't get this with any later kernel versions, this includes the kernels where I have reverted the stolen memory code. fbc is known-broken in a bunch of cases. I guess something is enabling it somewhere (check for optinos in /etc/modprobe.d/) and now that we've fixed up stolen memory detection for gm45 it finally blows up. Remove that bogus option and go back to the default (which is no fbc for gm45) and all should be fine. We really should start to throw stern warnings into dmesg when people run with non-default options ... |