Bug 71611

Summary: agp/intel: can't ioremap flush page - no chipset flushing
Product: Drivers Reporter: Paul Bolle (pebolle)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: RESOLVED CODE_FIX    
Severity: normal CC: bjorn, svenjoac
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.14-rc1 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg of v3.14-rc5 boot
iomem (v3.13.2)
iomem (v3.14-rc1)
dmesg of v3.14-rc5 boot with Bjorn's debug patch

Description Paul Bolle 2014-03-06 20:19:34 UTC
Booting v3.14-rc1 on an (outdated) ThinkPad X41 triggers a kernel
error:
    pci 0000:00:02.0: can't ioremap flush page - no chipset flushing

That is this pci device:
     lspci | grep 00:02.0
     00:02.0 VGA compatible controller: Intel Corporation Mobile 915GM/GMS/910GML Express Graphics Controller (rev 03)

I can't remember seeing that error before. It is apparently printed by
drivers/char/agp/intel-gtt.c.

Reported at https://lkml.org/lkml/2014/2/8/201

Bjorn Helgaas asked me to "attach complete dmesg logs of the working and broken kernels to it". But the logs of the working kernel (v3.13.y based) and broken kernel (v3.14-rcy based) are identical up to that error.
Comment 1 Paul Bolle 2014-03-07 09:35:20 UTC
Created attachment 128401 [details]
dmesg of v3.14-rc5 boot

At Bjorn's request I added (the first 30 seconds) of dmesg of a v3.14-rc5 boot. 

Since the dmesg of v3.13.2 (the last v3.13 I have installed) and v3.14-rc5 are identical until the error we're investigating here I've not bothered to attach the v3.13.2 dmesg.
Comment 2 Bjorn Helgaas 2014-03-07 17:04:01 UTC
Created attachment 128461 [details]
iomem (v3.13.2)
Comment 3 Bjorn Helgaas 2014-03-07 17:04:46 UTC
Created attachment 128471 [details]
iomem (v3.14-rc1)

The diff between /proc/iomem on v3.13.2 and v3.14-rc1 is:
--- iomem-3.13.2	2014-02-08 21:14:30.214030591 +0100
+++ iomem-3.14-rc1	2014-02-08 21:07:22.041189158 +0100
@@ -11,16 +11,13 @@
   000e0000-000effff : Extension ROM
   000f0000-000fffff : System ROM
 00100000-7f6dffff : System RAM
-  00400000-009af63a : Kernel code
-  009af63b-00c932ff : Kernel data
-  00d4f000-00e4dfff : Kernel bss
+  00400000-009c57bf : Kernel code
+  009c57c0-00cb6aff : Kernel data
+  00d78000-00e74fff : Kernel bss
 7f6e0000-7f6f4fff : ACPI Tables
 7f6f5000-7f6fffff : ACPI Non-volatile Storage
 7f700000-7fffffff : reserved
   7f800000-7fffffff : Graphics Stolen Memory
-80000000-801fffff : PCI Bus 0000:02
-80200000-8027ffff : 0000:00:02.1
-80280000-80280fff : Intel Flush Page
 a0000000-a003ffff : 0000:00:02.0
 a0040000-a00403ff : 0000:00:1d.7
   a0040000-a00403ff : ehci_hcd
Comment 4 Paul Bolle 2014-03-08 10:57:28 UTC
Created attachment 128581 [details]
dmesg of v3.14-rc5 boot with Bjorn's debug patch

I've recompiled v3.14-rc5 with a debug patch (see https://lkml.org/lkml/2014/3/7/484 ). Log with the messages that this patch adds is attached.

It appears that commit 04f982beb900f37bc216d63c9dbc5bdddb4a3d3a ("Merge branch 'pci/msi' into next") is good while commit 96702be560374ee7e7139a34cab03554129abbb4 ("Merge branch 'pci/resource' into next") is bad. Just as Bjorn expected. I'll continue bisecting, but I suppose that by now it's clear to Bjorn where things went awry in the handful of commits between good and bad above.
Comment 5 Bjorn Helgaas 2014-03-08 14:00:25 UTC
Paul, you can stop bisecting.  I see what the problem is, but I probably won't be able to get you a patch to fix it until Monday.
Comment 6 Sven Joachim 2014-03-19 20:50:37 UTC
It seems this bug has been fixed in commit ac93ac7403493f8707b7734de9f40d5cb5db9045 ("PCI: Don't check resource_size() in pci_bus_alloc_resource()") and should be closed.

I saw the same message as Paul in 3.14-rc6 and can confirm that it's gone in -rc7.
Comment 7 Bjorn Helgaas 2014-03-19 20:54:44 UTC
Should be fixed by the following commit, which appeared in v3.14-rc7.  Thanks Paul and Sven!

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ac93ac740349
Comment 8 Bjorn Helgaas 2014-03-19 20:59:09 UTC
Here's my analysis of what I did wrong in f75b99d5a77d ("PCI: Enforce bus address limits in resource allocation"), where I introduced the regression:

The problem is basically that I used resource_size() to figure out
whether there's any available space.  resource_size() is res->end -
res->start + 1, so applying it to [mem 0x00000000-0xffffffff] returns
zero in a kernel 32-bit resource addresses, i.e., with
CONFIG_PHYS_ADDR_T_64BIT=n.