Bug 85491 - radeon 0000:01:00.0: Fatal error during GPU init
Summary: radeon 0000:01:00.0: Fatal error during GPU init
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-10-02 21:36 UTC by Zermond
Modified: 2015-10-03 14:45 UTC (History)
9 users (show)

See Also:
Kernel Version: 3.16.3-200.fc20.x86_64
Tree: Mainline
Regression: Yes


Attachments
journalctl (369.07 KB, application/octet-stream)
2014-10-02 21:36 UTC, Zermond
Details
dmesg (72.82 KB, text/plain)
2014-10-02 21:36 UTC, Zermond
Details
lspci (2.36 KB, text/plain)
2014-10-02 21:37 UTC, Zermond
Details
uname-a with working kernel (118 bytes, application/octet-stream)
2014-10-02 21:37 UTC, Zermond
Details
dmesg with 3.17 kernel (68.36 KB, text/plain)
2014-10-04 15:07 UTC, Zermond
Details
journalctl with 3.17 kernel (172.80 KB, application/octet-stream)
2014-10-04 15:08 UTC, Zermond
Details
logs for last working kernel (54.28 KB, application/octet-stream)
2014-11-11 19:44 UTC, Marek
Details
logs for first not working kernel (60.34 KB, application/octet-stream)
2014-11-11 19:45 UTC, Marek
Details
save_big_align for later (562 bytes, patch)
2014-12-04 05:38 UTC, Yinghai Lu
Details | Diff
clear mmio64 flags when children device does not support it (2.02 KB, patch)
2014-12-06 00:27 UTC, Yinghai Lu
Details | Diff
"cat /proc/iomem" and "lspci -vvv" (3.00 KB, application/x-zip-compressed)
2014-12-08 19:34 UTC, Marek
Details
"cat /proc/iomem" and "lspci -vvv" for not working kernel 3.16.2 (3.03 KB, application/x-zip-compressed)
2014-12-08 19:35 UTC, Marek
Details
output of "lspci -t" for kernel 3.15.10 (443 bytes, application/octet-stream)
2014-12-09 16:51 UTC, Marek
Details
output of "dmesg" from branch from Yinghai Lu (132.61 KB, text/plain)
2014-12-09 18:33 UTC, Marek
Details
AIDA64 report (467.48 KB, text/plain)
2014-12-22 15:38 UTC, Marek
Details
clip resource under bridge (5.41 KB, patch)
2014-12-24 22:47 UTC, Yinghai Lu
Details | Diff
photo of screen with kernel panic (1.78 MB, image/jpeg)
2014-12-25 15:56 UTC, Marek
Details
clip resource under bridge (5.47 KB, patch)
2014-12-25 21:09 UTC, Yinghai Lu
Details | Diff
dmesg log for kernel 3.18.1 and pci_bridge_clip_v3.patch (66.12 KB, text/plain)
2014-12-25 22:35 UTC, Marek
Details
Arch linux 4.2-4 and rv610 dmesg (56.62 KB, text/plain)
2015-09-20 19:00 UTC, niam
Details
Vanilla 4.2.2 patch test = WORKS! commit a4ad03352739c96842af5d06387595665cdd875e (141.54 KB, application/octet-stream)
2015-10-03 14:25 UTC, niam
Details

Description Zermond 2014-10-02 21:36:23 UTC
Created attachment 152241 [details]
journalctl

hello,


if I want upgrade kernel 3.11.* to 3.16.* I have subj problems (radeon 0000:01:00.0: Fatal error during GPU init). With kernel 3.11.* all working.


I attached systemctl --system (last booting with kernel 3.16*), uname -a, dmesg, lspci
Comment 1 Zermond 2014-10-02 21:36:48 UTC
Created attachment 152251 [details]
dmesg
Comment 2 Zermond 2014-10-02 21:37:05 UTC
Created attachment 152261 [details]
lspci
Comment 3 Zermond 2014-10-02 21:37:36 UTC
Created attachment 152271 [details]
uname-a with working kernel
Comment 4 Alex Deucher 2014-10-02 21:43:05 UTC
Can you narrow down when the problem started?  Even better, can you bisect?
Comment 5 Zermond 2014-10-03 05:06:48 UTC
(In reply to Alex Deucher from comment #4)
> Can you narrow down when the problem started?  Even better, can you bisect?

As soon as I updated the kernel. At version 3.11, everything was fine with version 3.16 of the problem.
Comment 6 Alex Deucher 2014-10-03 13:30:28 UTC
(In reply to Zermond from comment #5)
> (In reply to Alex Deucher from comment #4)
> > Can you narrow down when the problem started?  Even better, can you bisect?
> 
> As soon as I updated the kernel. At version 3.11, everything was fine with
> version 3.16 of the problem.

Can you narrow it down any more than that?  Does 3.12 work ok?  3.13? etc.
Comment 7 Zermond 2014-10-04 15:06:03 UTC
(In reply to Alex Deucher from comment #6)
> (In reply to Zermond from comment #5)
> > (In reply to Alex Deucher from comment #4)
> > > Can you narrow down when the problem started?  Even better, can you
> bisect?
> > 
> > As soon as I updated the kernel. At version 3.11, everything was fine with
> > version 3.16 of the problem.
> 
> Can you narrow it down any more than that?  Does 3.12 work ok?  3.13? etc.

I'm sorry, I do not know how to use the old kernel. 
I installed 3.17, but it also did not work. 
I installed the boot loader to the kernel boot 3.17 1 level and made dmesg, journalctl -xn I am attached.
Comment 8 Zermond 2014-10-04 15:07:48 UTC
Created attachment 152401 [details]
dmesg with 3.17 kernel
Comment 9 Zermond 2014-10-04 15:08:12 UTC
Created attachment 152411 [details]
journalctl with 3.17 kernel
Comment 10 Marek 2014-11-11 19:43:57 UTC
Hi all, I have simillar problem and I tried to find some solution but with no success. Until kernel 3.15.10-201 worked everythink fine, but after upgrade to 3.16.2-200 (and every next kernel up to 3.16.7-200) instead radeon driver VESA is used (small resolution, kde gui is bit laggy probably because gpu acceleration is not used). My description maybe isn`t accurate  but I will be happy to answer any of your questions. I have attached output of journalctl, lsmod, dmesg and Xorg.log for last working and first not working kernel. I am using Fedora 20 x64 on asus notebook M51Se with ati radeon HD3470 graphics.
Comment 11 Marek 2014-11-11 19:44:45 UTC
Created attachment 157331 [details]
logs for last working kernel
Comment 12 Marek 2014-11-11 19:45:19 UTC
Created attachment 157341 [details]
logs for first not working kernel
Comment 13 Michel Dänzer 2014-11-13 03:35:33 UTC
Marek, can you bisect or otherwise narrow down what kernel change caused the problem for you?
Comment 14 Marek 2014-11-17 15:35:58 UTC
I can try, but this will be the first time I am going to do this. I have read this article: https://wiki.ubuntu.com/Kernel/KernelBisection and I am going to proceed accordingly.
Comment 15 Marek 2014-11-25 10:21:03 UTC
Hi, I tried also Kubuntu with new kernel (newer than 3.15) and it was not working (previous versions of kernel were working also with Kubuntu) so it is not Fedora specific problem. The result of bisection is that the first bad commit is: 

e5558d1a516fa6924fa8d53152b665d4c26f142e Merge branches 'dma-api', 'pci/virtualization', 'pci/msi', 'pci/misc' and 'pci/resource' into next

I took a look at code that was changed, but it is (yet) far beyond my abylities to come to some conclusion/quess. I am java developer and in the past I have written also few small C programs - so if needed I could help with some testing/debugging.
Comment 16 Michel Dänzer 2014-11-26 03:32:38 UTC
(In reply to Marek from comment #15)
> The result of bisection is that the first bad commit is: 
> 
> e5558d1a516fa6924fa8d53152b665d4c26f142e Merge branches 'dma-api',
> 'pci/virtualization', 'pci/msi', 'pci/misc' and 'pci/resource' into next

In general, if the result of a bisection is a merge commit, it indicates something might have gone wrong during the bisection. In this case, I suspect the problem might not happen every time even with affected kernels, so you need to test several times before declaring a commit good.

You can double-check this by testing commit e5558d1a516fa6924fa8d53152b665d4c26f142e again several times. Does it happen every time? If yes, test its parent commit(s) again several times. Does it never happen? If the answer to either question is no, I'm afraid you need to start the bisection again.
Comment 17 Dave Airlie 2014-11-26 03:52:08 UTC
this looks like a PCI regresion

окт 03 03:49:35 localhost.localdomain kernel: pci 0000:01:00.0: can't claim BAR 0 [mem 0xc0000000-0xcfffffff pref]: no compatible bridge window

окт 03 03:49:35 localhost.localdomain kernel: pci 0000:01:00.0: BAR 0: can't assign mem pref (size 0x10000000)
окт 03 03:49:35 localhost.localdomain kernel: pci 0000:01:00.0: BAR 0: trying firmware assignment [mem size 0x10000000 pref]
окт 03 03:49:35 localhost.localdomain kernel: pci 0000:01:00.0: BAR 0: [mem size 0x10000000 pref] conflicts with PCI Bus 0000:02 [mem 0xc0000000-0xc01fffff]

Bjorn?

Dave.
Comment 18 Marek 2014-11-27 07:58:59 UTC
It seemd weird also to me, that result of the bisect was a merge so I tried to build one of parents of this wrong merge commit (before Michael's comment) and then I was merging to the parent the other parents one by one:

Parent: 518a6a34f645897ec3440e5cbcf53ced3493ee1c - good - I startded with this one
Parent: 14574674e461077a9f4dd5eae050f622e8b8c084 - good - I merged this to the commit above
Parent: 3cb30b73ad71b384c6289243d4ccd31ab90bce6f - good - I merged this to the commit above
Parent: 034cd97ebda4062eb4402a6cf963ccd262caa86a - good - I merged this to the commit above
Parent: 9edbcd2252b5ef148177c9f2c11a56469cf5db52 - good - I merged this to the commit above
Parent: 67d29b5c6c40e91b124695e9250c2fd24915e24a - bad

After Dave's comment I decided to merge commit 67d29b5c6c40e91b124695e9250c2fd24915e24a as the last. Based on this I think that it is possible that the commit we are looking for is one of commits between 67d29b5c6c40e91b124695e9250c2fd24915e24a and 0b2d70764bb39242dcc49c0ebd10fcb8258ce5fa
Comment 19 Yinghai Lu 2014-12-03 21:55:06 UTC
[    0.113672] pci 0000:01:00.0: reg 0x10: [mem 0xc0000000-0xcfffffff pref]
[    0.113683] pci 0000:01:00.0: reg 0x14: [io  0xa000-0xa0ff]
[    0.113695] pci 0000:01:00.0: reg 0x18: [mem 0xfdff0000-0xfdffffff]
[    0.113729] pci 0000:01:00.0: reg 0x30: [mem 0xfdfc0000-0xfdfdffff pref]
[    0.113776] pci 0000:01:00.0: supports D1 D2
[    0.115016] pci 0000:00:01.0: PCI bridge to [bus 01]
[    0.115022] pci 0000:00:01.0:   bridge window [io  0x8000-0xafff]
[    0.115027] pci 0000:00:01.0:   bridge window [mem 0xfdf00000-0xfdffffff]
[    0.115034] pci 0000:00:01.0:   bridge window [mem 0xbdf00000-0xddefffff 64bit pref]

so kernel reject the one for BAR0 in 01:00.0.

and later can not allocate one...

[    0.169448] pci 0000:01:00.0: BAR 0: no space for [mem size 0x10000000 pref]
[    0.169453] pci 0000:01:00.0: BAR 0: trying firmware assignment [mem size 0x10000000 pref]
[    0.169458] pci 0000:01:00.0: BAR 0: [mem size 0x10000000 pref] conflicts with PCI Bus 0000:02 [mem 0xc0000000-0xc01fffff]
[    0.169461] pci 0000:01:00.0: BAR 0: failed to assign [mem size 0x10000000 pref]

and it says

[    0.169657] pci_bus 0000:00: Some PCI device resources are unassigned, try booting with pci=realloc

so boot with pci=realloc will fix the problem?

also in old kernel:

[    0.187601] pci 0000:00:01.0: BAR 15: assigned [mem 0xc0000000-0xcfffffff pref]
[    0.187634] pci 0000:00:01.0: BAR 15: can't assign mem pref (size 0x10000000)
[    0.187638] pci 0000:00:01.0: failed to add 10000000 res[15]=[mem 0xc0000000-0xcfffffff pref]
[    0.187643] pci 0000:01:00.0: BAR 0: assigned [mem 0xc0000000-0xcfffffff pref]

that means it has pci=realloc enabled by default.
Comment 20 Yinghai Lu 2014-12-04 05:38:40 UTC
Created attachment 159601 [details]
save_big_align for later

avoid 0xc0000000 to be taken early by device other than 00:01.0

and it is needed together with booting with "pci=realloc"
Comment 21 Bjorn Helgaas 2014-12-05 18:49:38 UTC
Zermond, Marek, is there any chance you could test these two commits:

5b28541552ef PCI: Restrict 64-bit prefetchable bridge windows to 64-bit resources
14c8530dbc1b PCI: Support BAR sizes up to 8GB

I think 5b28541552ef is probably what broke this.  14c8530dbc1b is the preceeding commit and I suspect it will work.
Comment 22 Marek 2014-12-05 21:46:36 UTC
I will test that, but probably not before sunday evening
Comment 23 Yinghai Lu 2014-12-06 00:27:39 UTC
Created attachment 159861 [details]
clear mmio64 flags when children device does not support it

Please apply this patch only on 3.17 or 3.18.
Comment 24 Marek 2014-12-07 21:36:21 UTC
Bjorn: you were right - commit 14c8530dbc1b was working and 5b28541552ef did't work

Yinghai: I applied your patch "clear mmio64 flags when..." to versions 3.17.0 and 3.18.0_rc7 and both patched version are working correctly
Comment 25 Wei Yang 2014-12-08 09:18:55 UTC
Hi, Zermond & Marek

I am willing to take a look at this one.

If possible, would you mind providing some information on both the bad one and the good one?

What I need is:

cat /proc/iomem
lspci -vvv

Thanks for your efforts in advance.
Comment 26 Marek 2014-12-08 19:34:35 UTC
Created attachment 160081 [details]
"cat /proc/iomem" and "lspci -vvv"
Comment 27 Marek 2014-12-08 19:35:51 UTC
Created attachment 160091 [details]
"cat /proc/iomem" and "lspci -vvv" for not working kernel 3.16.2
Comment 28 Marek 2014-12-08 19:41:15 UTC
Hi Wei, of course - attachment https://bugzilla.kernel.org/attachment.cgi?id=160081 is for working kernel 3.15.10 (sorry, I did't add sufficient comment on that attachment)
Comment 29 Wei Yang 2014-12-09 02:00:16 UTC
(In reply to Marek from comment #26)
> Created attachment 160081 [details]
> "cat /proc/iomem" and "lspci -vvv"

Thanks, give me some time to take a look.
Hope I could help :-)
Comment 30 Wei Yang 2014-12-09 02:54:08 UTC
(In reply to Marek from comment #28)
> Hi Wei, of course - attachment
> https://bugzilla.kernel.org/attachment.cgi?id=160081 is for working kernel
> 3.15.10 (sorry, I did't add sufficient comment on that attachment)

hmm... could I ask for one more information?

The output of lspci -t would be helpful for me to understand the topology of the pci tree :-)
Comment 31 Marek 2014-12-09 16:50:53 UTC
(In reply to Wei Yang from comment #30)
> (In reply to Marek from comment #28)
> > Hi Wei, of course - attachment
> > https://bugzilla.kernel.org/attachment.cgi?id=160081 is for working kernel
> > 3.15.10 (sorry, I did't add sufficient comment on that attachment)
> 
> hmm... could I ask for one more information?
> 
> The output of lspci -t would be helpful for me to understand the topology of
> the pci tree :-)

sure - it is from version 3.15.10
Comment 32 Marek 2014-12-09 16:51:31 UTC
Created attachment 160271 [details]
output of "lspci -t" for kernel 3.15.10
Comment 33 Marek 2014-12-09 18:33:34 UTC
Created attachment 160281 [details]
output of "dmesg" from branch from Yinghai Lu

git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
branch: for-pci-allocate-fit-3.18
Comment 34 Yinghai Lu 2014-12-09 19:38:53 UTC
(In reply to Marek from comment #33)
> Created attachment 160281 [details]
> output of "dmesg" from branch from Yinghai Lu
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
> branch: for-pci-allocate-fit-3.18

Thanks Marek,

That is working as expected.
Comment 35 Bjorn Helgaas 2014-12-19 22:17:16 UTC
The problem is that BIOS programmed an invalid Root Port window leading to the Radeon device.  The window contains the Radeon device, so the configuration actually *works* fine, but the window is invalid because it either overlaps system RAM or starts below the upstream host bridge window, so Linux discards it:

Zermond's system:
  acpi PNP0A08:00: host bridge window [mem 0xc0000000-0xffffffff] (ignored)
  pci_bus 0000:00: root bus resource [mem 0x00000000-0xfffffffff]
  pci 0000:00:01.0:   bridge window [mem 0xbdf00000-0xddefffff 64bit pref]  # invalid Root Port window
  pci 0000:00:01.0: can't claim BAR 15 [mem 0xbdf00000-0xddefffff 64bit pref]: address conflict with System RAM [mem 0x00100000-0xbff9ffff]
  pci 0000:01:00.0: can't claim BAR 0 [mem 0xc0000000-0xcfffffff pref]: no compatible bridge window  # Radeon

Marek's system:
  pci_bus 0000:00: root bus resource [mem 0xc0000000-0xffffffff]   (from _CRS)
  pci 0000:00:01.0:   bridge window [mem 0xbdf00000-0xddefffff 64bit pref]  # invalid Root Port window
  pci 0000:00:01.0: can't claim BAR 15 [mem 0xbdf00000-0xddefffff 64bit pref]: no compatible bridge window
  pci 0000:01:00.0: can't claim BAR 0 [mem 0xc0000000-0xcfffffff pref]: no compatible bridge window  # Radeon

The Root Port window had the same problem before 5b28541552ef, of course, since BIOS set it up.  But before 5b28541552ef, Linux assigned a valid window big enough for the Radeon:

  pci 0000:00:01.0:   bridge window [mem 0xc0000000-0xcfffffff pref]

After 5b28541552ef, we won't put a 64-bit window below 4GB, so we assign space above 4GB:

  pci 0000:00:01.0:   bridge window [mem 0x140000000-0x1401fffff 64bit pref]

which is not usable by Radeon, since it only has a 32-bit BAR.
Comment 36 Bjorn Helgaas 2014-12-20 01:02:21 UTC
Zermond, Marek, do either of you have Windows on this box?  If so, I'm interested in an AIDA64 dump.  You can get a free trial version at http://www.aida64.com/downloads

My suspicion is that Windows trims the Root Port window to fit inside the host bridge window.  That would leave everything working almost identically to how the BIOS configured it, so it would be sort of a minimal change.
Comment 37 Marek 2014-12-22 15:38:23 UTC
Created attachment 161621 [details]
AIDA64 report
Comment 38 Marek 2014-12-22 15:44:21 UTC
Here you are Bjorn, I am not sure whether this is the dump you have in mind - if it's not then let me know and I will give you what you need.
Comment 39 Yinghai Lu 2014-12-24 22:47:59 UTC
Created attachment 161761 [details]
clip resource under bridge

Please test it on top v3.18.
Comment 40 Yinghai Lu 2014-12-24 22:55:25 UTC
(In reply to Marek from comment #37)
> Created attachment 161621 [details]
> AIDA64 report

so windows change from [mem 0xbdf00000-0xddefffff 64bit
pref] to C0000000-DFFFFFFF  ?

      Driver Description                                Mobile Intel(R) PM965/GM965/GL960/GS965 Express PCI Express Root Port - 2A01
      Driver Date                                       6/21/2006
      Driver Version                                    6.1.7601.17514
      Driver Provider                                   Microsoft
      INF File                                          machine.inf
      Hardware ID                                       PCI\VEN_8086&DEV_2A01&SUBSYS_2A018086&REV_03
      Location Information                              PCI bus 0, device 1, function 0
      PCI Device                                        Intel GL960/GM965/PM965 Chipset - PCI Express Root Port [C-0]

    Device Resources:
      IRQ                                               131071
      Memory                                            000A0000-000BFFFF
      Memory                                            C0000000-DFFFFFFF
      Memory                                            FDF00000-FDFFFFFF
      Port                                              03B0-03BB
      Port                                              03C0-03DF
      Port                                              8000-AFFF
Comment 41 Marek 2014-12-25 15:56:08 UTC
Created attachment 161801 [details]
photo of screen with kernel panic

After applying  second version of Yinghai's patch (pci_bridge_clip_v2.patch) to versions 3.18.0 and 3.18.1, boot process stops with kernel panic - I attached photo of screen with the log. (I need to find a way how to get the full log)
Comment 42 Yinghai Lu 2014-12-25 21:09:58 UTC
Created attachment 161841 [details]
clip resource under bridge
Comment 43 Yinghai Lu 2014-12-25 21:10:20 UTC
please check updated patch.
Comment 44 Marek 2014-12-25 22:35:11 UTC
Created attachment 161851 [details]
dmesg log for kernel 3.18.1 and pci_bridge_clip_v3.patch

Yinghai's patch v3 was tested on kernel version 3.18.1 and it works fine. I attached output of dmesg.
Comment 45 gob_iron 2015-01-05 21:12:19 UTC
I've just submitted a bug report https://bugzilla.kernel.org/show_bug.cgi?id=90831
regarding my kernel's inability to claim bar 0, although under slightly different circumstances and with an NVidia card.

Would you be able to tell me if the patch will help?  In order to get to the stage at which the kernel reports that it can't claim Bar 0, I had to set pci=use_crs on the grub cmdline

Thanks.
Comment 46 Bjorn Helgaas 2015-03-09 21:51:48 UTC
This should be resolved by the following commits, which appeared in v3.19:

  3f2f4dc456e9 ("PCI: Pass bridge device, not bus, when updating bridge windows")
  0f7e7aee2f37 ("PCI: Add pci_bus_clip_resource() to clip to fit upstream window")
  8505e729a2f6 ("PCI: Add pci_claim_bridge_resource() to clip window if necessary")
  851b09369255 ("x86/PCI: Clip bridge windows to fit in upstream windows")

There are commits similar to 851b09369255 for arches other than x86.
Comment 47 niam 2015-09-20 12:50:26 UTC
Hi, ALL.
Please advise whether this bug was fixed permanently or again present since 4.1 kernel.

I have the following story with RV610 on ASUS F7SR laptop during kernel loading: 
cat journalctlb  | grep -Ei 'radeon|drm'
вер 06 16:00:59 h4os kernel: [drm] Initialized drm 1.1.0 20060810
вер 06 16:00:59 h4os kernel: [drm] radeon kernel modesetting enabled.
вер 06 16:00:59 h4os kernel: [drm] initializing kernel modesetting (RV610 0x1002:0x94C9 0x1043:0x15B2).
вер 06 16:00:59 h4os kernel: [drm] register mmio base: 0xFDEF0000
вер 06 16:00:59 h4os kernel: [drm] register mmio size: 65536
вер 06 16:00:59 h4os kernel: radeon 0000:01:00.0: VRAM: 256M 0x0000000000000000 - 0x000000000FFFFFFF (256M used)
вер 06 16:00:59 h4os kernel: radeon 0000:01:00.0: GTT: 512M 0x0000000010000000 - 0x000000002FFFFFFF
вер 06 16:00:59 h4os kernel: [drm] Detected VRAM RAM=256M, BAR=0M
вер 06 16:00:59 h4os kernel: [drm] RAM width 64bits DDR
вер 06 16:00:59 h4os kernel: radeon 0000:01:00.0: Fatal error during GPU init
вер 06 16:00:59 h4os kernel: [drm] radeon: finishing device.
вер 06 16:00:59 h4os kernel: [drm] radeon: ttm finalized
вер 06 16:00:59 h4os kernel: radeon: probe of 0000:01:00.0 failed with error -12

All Details are gathered on arch forum: https://bbs.archlinux.org/viewtopic.php?id=202078
Comment 48 Yinghai Lu 2015-09-20 17:11:02 UTC
Hi Niam,

Please post boot log with "debug ignore_logleve", so we can find out if
the pci resource allocation cause the problem.
Comment 49 niam 2015-09-20 19:00:38 UTC
Created attachment 188021 [details]
Arch linux 4.2-4 and  rv610 dmesg

I have a problem with RV610 on all kernels from 4.1 while 3.19 works perfect.
Dmesg attached, was tracked
while kernel was boot with: debug ignore_loglevel log_buf_len=10M print_fatal_signals=1 LOGLEVEL=8 earlyprintk=vga,keep sched_debug

The problem is following
 radeon 0000:01:00.0: Fatal error during GPU init
[    0.694918] [drm] radeon: finishing device.
[    0.702587] [drm] radeon: ttm finalized
[    0.702776] radeon: probe of 0000:01:00.0 failed with error -12
Comment 50 niam 2015-09-20 19:02:09 UTC
(In reply to Yinghai Lu from comment #48)
> Hi Niam,
> 
> Please post boot log with "debug ignore_logleve", so we can find out if
> the pci resource allocation cause the problem.

Please see below.https://bugzilla.kernel.org/attachment.cgi?id=188021
Comment 51 niam 2015-09-20 19:08:45 UTC
Actually, probably these lines: 

ACPI : EC: GPE = 0x1c, I/O: command/status = 0x66, data = 0x62
[    0.221255] vgaarb: setting as boot device: PCI:0000:01:00.0
[    0.221255] vgaarb: device added: PCI:0000:01:00.0,decodes=io+mem,owns=io+mem,locks=none
[    0.221255] vgaarb: loaded
[    0.221255] vgaarb: bridge control possible 0000:01:00.0
[    0.221255] PCI: Using ACPI for IRQ routing
[    0.228084] PCI: pci_cache_line_size set to 64 bytes
[    0.228100] pci 0000:00:01.0: can't claim BAR 15 [mem 0xbdf00000-0xddefffff 64bit pref]: no compatible bridge window
[    0.228119] pci 0000:00:01.0: [mem size 0x20000000 64bit pref] clipped to [mem size 0x1df00000 64bit pref]
[    0.228137] pci 0000:00:01.0:   bridge window [mem size 0x1df00000 64bit pref]
[    0.228155] pci 0000:00:01.0: can't claim BAR 15 [mem size 0x1df00000 64bit pref]: no address assigned
[    0.228180] pci 0000:01:00.0: can't claim BAR 0 [mem 0xc0000000-0xcfffffff pref]: no compatible bridge window

shows that something goes wrong with initialization of GPU.
The question is - what to do and how to fix his in new kernels.
Comment 52 Yinghai Lu 2015-09-21 01:28:34 UTC
please check patch: 

commit a4ad03352739c96842af5d06387595665cdd875e
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Fri Sep 18 17:15:01 2015 -0500

    PCI: Clear IORESOURCE_UNSET when clipping a bridge window
    
    c770cb4cb505 ("PCI: Mark invalid BARs as unassigned") sets IORESOURCE_UNSET
    if we fail to claim a resource.  If we tried to claim a bridge window,
    failed, clipped the window, and tried to claim the clipped window, we
    failed again because of IORESOURCE_UNSET.
    
    When pci_bus_clip_resource() clips a bridge window to fit inside an
    upstream window, we're reassigning the window, so clear the
    IORESOURCE_UNSET flag.  Also clear IORESOURCE_UNSET in our copy of the
    unclipped window so we can see exactly what the original window was and how
    it now fits inside the upstream window.
    
    Fixes: c770cb4cb505 ("PCI: Mark invalid BARs as unassigned")
    Based-on-patch-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    CC: stable@vger.kernel.org	# 4.1+

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 6fbd3f2..d3346d2 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -256,6 +256,8 @@ bool pci_bus_clip_resource(struct pci_dev *dev, int idx)
 
 		res->start = start;
 		res->end = end;
+		res->flags &= ~IORESOURCE_UNSET;
+		orig_res.flags &= ~IORESOURCE_UNSET;
 		dev_printk(KERN_DEBUG, &dev->dev, "%pR clipped to %pR\n",
 				 &orig_res, res);
Comment 53 niam 2015-09-21 09:59:06 UTC
(In reply to Yinghai Lu from comment #52)
> please check patch: 
> 
> commit a4ad03352739c96842af5d06387595665cdd875e
> Author: Bjorn Helgaas <bhelgaas@google.com>
> Date:   Fri Sep 18 17:15:01 2015 -0500
> 
>     PCI: Clear IORESOURCE_UNSET when clipping a bridge window
>     
>     c770cb4cb505 ("PCI: Mark invalid BARs as unassigned") sets
> IORESOURCE_UNSET
>     if we fail to claim a resource.  If we tried to claim a bridge window,
>     failed, clipped the window, and tried to claim the clipped window, we
>     failed again because of IORESOURCE_UNSET.
>     
>     When pci_bus_clip_resource() clips a bridge window to fit inside an
>     upstream window, we're reassigning the window, so clear the
>     IORESOURCE_UNSET flag.  Also clear IORESOURCE_UNSET in our copy of the
>     unclipped window so we can see exactly what the original window was and
> how
>     it now fits inside the upstream window.
>     
>     Fixes: c770cb4cb505 ("PCI: Mark invalid BARs as unassigned")
>     Based-on-patch-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>     CC: stable@vger.kernel.org        # 4.1+
> 
> diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> index 6fbd3f2..d3346d2 100644
> --- a/drivers/pci/bus.c
> +++ b/drivers/pci/bus.c
> @@ -256,6 +256,8 @@ bool pci_bus_clip_resource(struct pci_dev *dev, int idx)
>  
>               res->start = start;
>               res->end = end;
> +             res->flags &= ~IORESOURCE_UNSET;
> +             orig_res.flags &= ~IORESOURCE_UNSET;
>               dev_printk(KERN_DEBUG, &dev->dev, "%pR clipped to %pR\n",
>                                &orig_res, res);

Thank you, but the main question - whether it will be  included in the main line vanila kernel and from whitch version?
If you need my check before - then I need to build custom kernel based on it, because the kernels I have mentioned before - there were stock prebuild kernels for Arch linux.
Comment 54 Yinghai Lu 2015-09-21 21:10:07 UTC
Niam,

Please try to build one customized kernel with that patch.
should be on top of kernel after v4.1
Comment 55 niam 2015-10-03 14:25:10 UTC
Created attachment 189381 [details]
Vanilla 4.2.2 patch test = WORKS! commit a4ad03352739c96842af5d06387595665cdd875e

commit a4ad03352739c96842af5d06387595665cdd875e testing on Vanilla 4.2.2 under ARCH = WORKS!
Comment 56 niam 2015-10-03 14:27:42 UTC
(In reply to Yinghai Lu from comment #54)
> Niam,
> 
> Please try to build one customized kernel with that patch.
> should be on top of kernel after v4.1

Have tested with 4.2.2 Vanilla kernel + .config from Arch 4.2.2-1 = WORKS! GPU now detected, attached is dmesg with ignore_loglevel approcing this.
Please note that this patch is still NOT included in the main line kernel! :(
Comment 57 Bjorn Helgaas 2015-10-03 14:45:11 UTC
> Please note that this patch is still NOT included in the main line kernel! :(

This patch appeared in v4.3-rc3:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b838b39e930a

It will be released in v4.3, and it's marked for stable, so it will likely be backported to the stable kernels for v4.1 and later.

Note You need to log in before you can comment on or make changes to this bug.