Bug 194761 - amdgpu driver breaks on Oland (SI)
Summary: amdgpu driver breaks on Oland (SI)
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-03-02 08:57 UTC by Jean Delvare
Modified: 2017-10-23 11:07 UTC (History)
10 users (show)

See Also:
Kernel Version: 4.10
Tree: Mainline
Regression: Yes


Attachments
Checkerboard effect on glxgears (53.03 KB, image/jpeg)
2017-03-02 08:57 UTC, Jean Delvare
Details
possible fix (1.73 KB, patch)
2017-05-26 16:38 UTC, Alex Deucher
Details | Diff
possible fix (1.75 KB, patch)
2017-06-02 16:18 UTC, Alex Deucher
Details | Diff
another possible fix (1.11 KB, patch)
2017-06-02 20:35 UTC, Alex Deucher
Details | Diff
dmesg output (69.37 KB, text/plain)
2017-06-09 18:51 UTC, siyia
Details
drm/amdgpu/gfx6: [TEST] fix tiling setup for oland (for v4.11) (1.27 KB, patch)
2017-06-12 07:39 UTC, Jean Delvare
Details | Diff
Bumped Patch version number for linux 4.12 (1.27 KB, patch)
2017-07-03 14:37 UTC, siyia
Details | Diff
possible final fix? (1.64 KB, patch)
2017-07-28 14:16 UTC, Marek Olšák
Details | Diff
possible final fix? (4.91 KB, patch)
2017-07-28 14:31 UTC, Marek Olšák
Details | Diff
[PATCH] drm/amdgpu: revert tile table update for oland (for kernels v4.10 to v4.12) (11.74 KB, patch)
2017-09-04 08:11 UTC, Jean Delvare
Details | Diff
[PATCH] drm/amdgpu: revert tile table update for oland (for kernels v4.13 and up) (10.46 KB, patch)
2017-09-11 15:39 UTC, Jean Delvare
Details | Diff

Description Jean Delvare 2017-03-02 08:57:30 UTC
Created attachment 255045 [details]
Checkerboard effect on glxgears

After upgrading my kernel from v4.9.11 to v4.10.0, all 3D applications are broken. The most visible problem is a checkerboard effect, with blinking of some areas. See the attached screenshot of glxgears for an example. The bug only happens when I use the amdgpu driver (with CONFIG_DRM_AMDGPU_SI=y.) If I use the radeon driver instead then all is fine.

There are no error messages in the kernel log, nor in /var/log/Xorg.0.log. However I could spot differences between the kernel logs:

-Linux version 4.9.11-1-default (geeko@buildhost) (gcc version 6.3.1 20170202 [gcc-6-branch revision 245119] (SUSE Linux) ) #1 SMP PREEMPT Sat Feb 18 17:59:27 UTC 2017 (cf9c670)
+Linux version 4.10.1-1-default (geeko@buildhost) (gcc version 6.3.1 20170202 [gcc-6-branch revision 245119] (SUSE Linux) ) #1 SMP PREEMPT Sun Feb 26 12:43:10 UTC 2017 (1ecd5af)

 [drm] GPU post is not needed
 amdgpu 0000:01:00.0: VRAM: 1024M 0x0000000000000000 - 0x000000003FFFFFFF (1024M used)
-amdgpu 0000:01:00.0: GTT: 3991M 0x0000000040000000 - 0x00000001397847FF
+amdgpu 0000:01:00.0: GTT: 1024M 0x0000000040000000 - 0x000000007FFFFFFF
 [drm] Detected VRAM RAM=1024M, BAR=256M
 [drm] RAM width 64bits DDR3
 [drm] amdgpu: 1024M of VRAM memory ready
-[drm] amdgpu: 3991M of GTT memory ready.
-[drm] GART: num cpu pages 1021828, num gpu pages 1021828
-amdgpu 0000:01:00.0: PCIE GART of 3991M enabled (table at 0x0000000000040000).
+[drm] amdgpu: 1024M of GTT memory ready.
+[drm] GART: num cpu pages 262144, num gpu pages 262144
+amdgpu 0000:01:00.0: PCIE GART of 1024M enabled (table at 0x0000000000040000).

-[drm] Initialized amdgpu 3.8.0 20150101 for 0000:01:00.0 on minor 0
+[drm] Initialized amdgpu 3.9.0 20150101 for 0000:01:00.0 on minor 0

So something changed in the detection of the available memory. I don't know if this is related to the bug. As a comparison point, the radeon driver disagrees with both the old amdgpu driver and the new amdgpu driver:

radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000 - 0x000000003FFFFFFF (1024M used)
radeon 0000:01:00.0: GTT: 2048M 0x0000000040000000 - 0x00000000BFFFFFFF
[drm] Detected VRAM RAM=1024M, BAR=256M
[drm] RAM width 64bits DDR
[drm] radeon: 1024M of VRAM memory ready
[drm] radeon: 2048M of GTT memory ready.
Comment 1 flora.cui 2017-03-02 09:37:33 UTC
revert this commit should fix this issue.
commit c7bc82efa60dd0ba220fbfc1bed5edc79e831a3e
Refs: v4.9-272-gc7bc82e
Author:     Flora Cui <Flora.Cui@amd.com>
AuthorDate: Thu Dec 15 16:29:31 2016 +0800
Commit:     Alex Deucher <alexander.deucher@amd.com>
CommitDate: Wed Dec 21 15:37:47 2016 -0500

    drm/amdgpu: update tile table for oland/hainan

    Change-Id: Ib8b66559d662cbe8a9f40f7405a4a052c189f604
    Signed-off-by: Flora Cui <Flora.Cui@amd.com>
    Reviewed-by: Junwei Zhang <Jerry.Zhang@amd.com>
    Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

or apply the commit series should fix this issue too.
* 8873e69 - drm/amd/gfx6: update gb_addr_config <Flora Cui> 2017-02-09 14:18:05 +0800
* 0882518 - drm/amdgpu: update HAINAN_GB_ADDR_CONFIG_GOLDEN <Flora Cui> 2017-02-09 14:18:04 +0800
* 04cab25 - drm/amdgpu: update VERDE_GB_ADDR_CONFIG_GOLDEN <Flora Cui> 2017-02-09 14:18:04 +0800
* a99ecc6 - drm/amdgpu: refine si_read_register <Flora Cui> 2017-02-09 14:18:03 +0800
* ffb3c1c - drm/amdgpu/gfx6: clean up spi configuration <Flora Cui> 2017-02-09 14:18:03 +0800
* d8a4e76 - drm/amdgpu/gfx6: clean up cu configuration <Flora Cui> 2017-02-09 14:18:02 +0800
* 1abd532 - drm/amdgpu/gfx6: clean up rb configuration <Flora Cui> 2017-02-09 14:18:02 +0800
Comment 2 Jean Delvare 2017-03-02 15:12:05 UTC
Thanks for the quick reply. I confirm that kernel v4.10.1 with commit f8d9422ef80c ("drm/amdgpu: update tile table for oland/hainan") reverted works again. Not sure why your commit ID is different but that's clearly the same patch.

How do we fix this in the 4.10 stable kernel branch? Simply push the reverted patch to stable@?
Comment 3 Alex Deucher 2017-03-02 16:16:39 UTC
Either push the revert to stable, or pull the additional fixes back from 4.11.
Comment 4 Jean Delvare 2017-03-02 16:23:07 UTC
I know both are possible ;-) But I don't know what are the benefits of both options. I'd go for just reverting one patch because it seems more simple, but to be honest I have no idea what that patch was trying to achieve, so I don't know if we lose anything important by reverting it.
Comment 5 Alex Deucher 2017-03-02 16:24:53 UTC
It's just an optimization.
Comment 6 Jean Delvare 2017-03-02 17:23:40 UTC
OK, let's just revert then. I have just sent a revert patch to stable@.
Comment 7 Jean Delvare 2017-03-15 11:30:33 UTC
Fixed in kernel v4.10.3 by reverting commit 8d9422ef80c ("drm/amdgpu: update tile table for oland/hainan").

Fixed in kernel v4.11 by the commits mentioned in comment #1.
Comment 8 Jean Delvare 2017-05-19 06:16:09 UTC
This is NOT fixed in v4.11. The claims in comment #1 are incorrect, the patch series mentioned in that comment doesn't fix the problem.

Should we revert commit 8d9422ef80c ("drm/amdgpu: update tile table for oland/hainan") permanently?
Comment 9 Jean Delvare 2017-05-19 06:35:57 UTC
Sorry there was a missing digit in the commit ID, I obviously meant f8d9422ef80c ("drm/amdgpu: update tile table for oland/hainan".)
Comment 10 flora.cui 2017-05-19 09:43:20 UTC
glxgears works on oland with alex's 4.11 branch (HEAD: commit 3a7370c - drm/amdgpu: correct emit frame size for vcn dec/enc ring).
Comment 11 Jean Delvare 2017-05-19 12:45:08 UTC
Can you please point me to the exact tree and branch I am supposed to test?

Also, do you happen to know which commits in that branch are fixing the problem?
Comment 12 flora.cui 2017-05-22 02:32:20 UTC
https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-4.11

I didn't check the commits. The commits in #commit1 should work.
Comment 13 Jean Delvare 2017-05-22 09:26:54 UTC
Assuming you mean "in comment #1", well no, they do not work for me. Have you tested? I naively trusted you, and as a result agreed not to get the problematic commit reverted upstream, and now users are hitting a regression when upgrading to kernel v4.11. This isn't good.

Now we need to figure out whether it is actually fixed upstream or not. No more assumptions. And if it fixed, knowing which commit did fix it would let us push the fix to the 4.11 stable tree. However I am not going to waste time bisecting it. If nobody knows how it was fixed, I will push the same revert to 4.11 that was pushed to 4.10, and be done with it.
Comment 14 flora.cui 2017-05-22 09:34:18 UTC
I don't know why it fails on your side, at least it works with my test PC. anyway, you could revert it if this troubles you.
Comment 15 Jean Delvare 2017-05-22 20:57:17 UTC
And it's not just me, see:

https://bugzilla.opensuse.org/show_bug.cgi?id=1039806
Comment 16 Jean Delvare 2017-05-23 14:58:33 UTC
I tested Alex's branch amd-staging-4.11 as of yesterday (top commit c285c73f2213) and it does NOT fix the bug.

Did it occur to you that there may be a simple bug in commit c7bc82efa60d ("drm/amdgpu: update tile table for oland/hainan")? Can anyone familiar with the code review this commit to see if we missed something obvious?
Comment 17 flora.cui 2017-05-24 09:56:25 UTC
I'll continue to check this issue. meanwhile you could revert the guilty commit for quick fix. the tiling table is an optimization from hw team. btw, have u try with amdgpu-pro driver? maybe that's the difference we have.
Comment 18 Jean Delvare 2017-05-24 11:49:59 UTC
I did not try the amdgpu-pro driver. My distribution (SUSE) doesn't appear to be supported, and the download page doesn't list my hardware (Radeon R5 240) as supported by this driver anyway.
Comment 19 Berg 2017-05-24 12:55:29 UTC
I confirm:
Checkerboard effect when using amdgpu on Oland is back.

On amdgpu-pro the same effect has been for a long time (despite the fact that my R7 240 is in the list of supported devices), but archlinux is not supported as a distro, there is only a user package in AUR, so here it is difficult to make claims
Comment 20 Michel Dänzer 2017-05-26 02:05:31 UTC
Flora, why did you add me to Cc? I get notifications via the dri-devel mailing list.
Comment 21 flora.cui 2017-05-26 08:06:39 UTC
Hi Michel,
This issue might has something with mesa. I add you to the cc list for you're expert on open source umd.
It can be reproduced with oland + ubuntu + xinit + glxgears + mesa. reverting (commit drm/amd/amdgpu: Clean up GFX6 tilemode programming && commit drm/amdgpu: update tile table for oland/hainan) could fix it.
It can't be reproduced with amdgpu-pro driver (with the same kernel). Could you help to further investigate?
Comment 22 Michel Dänzer 2017-05-26 08:25:16 UTC
Not really I'm afraid. Maybe Marek or Nicolai has an idea.
Comment 23 Marek Olšák 2017-05-26 15:00:29 UTC
Hi,

(BTW I'm from AMD)

First of all, if a kernel change breaks Mesa, the kernel change should be reverted. Kernel changes should never ever break existing userspace. There are released versions of Mesa out there that we can't change with commits to the master branch. That's why. Kernel changes can't break Mesa ever!

Now about the commit. Yes, the commit is wrong! I've checked internal hw docs and there are 2 variants of Oland: Oland64 and Oland128. The number may refer to the bus width. Oland128 uses the same config as Cape Verde, which is P4_8x16. Oland64 uses the same config as Hainan, which is P2. The commit only switches the config from Oland128 to Oland64, so it seems to fix Oland64 and at the same time break Oland128.

I think the real fix is to use the old version of the tile mode table for Oland128, and the new version for Oland64.
Comment 24 Alex Deucher 2017-05-26 16:38:46 UTC
Created attachment 256727 [details]
possible fix

Does this patch help?
Comment 25 Jean Delvare 2017-05-30 09:02:03 UTC
As my wife upgraded her hardware last week, the problematic card is now sitting on my desk. Give me a few hours to install it on my own system and reproduce the bug, then I can test your candidate fix and report.
Comment 26 Jean Delvare 2017-05-30 13:19:00 UTC
Hmmm. I backported the patch to kernel 4.11.3, but it did not help. Then I swapped the conditions, and it worked. That is, adev->mc.vram_width < 128 mapped to Cape Verde and adev->mc.vram_width >= 128 mapped to Hainan. Does that make any sense? I do not claim I know what I'm doing.

My Oland card has adev->mc.vram_width == 64 and is apparently happy with the Cape Verde configuration (I'd need to test further to confirm there is no negative side effect.)

amdgpu 0000:01:00.0: VRAM: 1024M 0x0000000000000000 - 0x000000003FFFFFFF (1024M used)
amdgpu 0000:01:00.0: GTT: 1024M 0x0000000040000000 - 0x000000007FFFFFFF
[drm] Detected VRAM RAM=1024M, BAR=256M
[drm] RAM width 64bits DDR3

I can provide additional information about my card, just tell me what you need to know.
Comment 27 Marek Olšák 2017-05-30 13:59:06 UTC
Well then I wonder what the number in Oland128 means if it's not the bus width.
Comment 28 Jean Delvare 2017-05-30 14:41:33 UTC
I don't know how reliable it is, but Wikipedia claims that the Radeon R5 240 has a 128-bit bus to video RAM. Is it possible that the amdgpu driver is getting it wrong?
Comment 29 Berg 2017-05-31 11:35:32 UTC
(In reply to Jean Delvare from comment #28)
> I don't know how reliable it is, but Wikipedia claims that the Radeon R5 240
> has a 128-bit bus to video RAM.
I confirm as the owner of the R7 240, - yes, the 128-bit bus, R5 240 is same
Comment 30 Jean Delvare 2017-06-01 11:53:26 UTC
(In reply to Marek Olšák from comment #27)
> Well then I wonder what the number in Oland128 means if it's not the bus
> width.

Who can answer this question?
Comment 31 Alex Deucher 2017-06-01 20:04:04 UTC
Flora, what is the bus width on your oland card?
Comment 32 flora.cui 2017-06-02 02:28:36 UTC
(In reply to Alex Deucher from comment #31)
> Flora, what is the bus width on your oland card?

128bits
Comment 33 Alex Deucher 2017-06-02 16:18:09 UTC
Created attachment 256831 [details]
possible fix

Based on the feedback on this bug, it appears the logic should be reversed.  I verified the vram width detection and that is correct.
Comment 34 Marek Olšák 2017-06-02 20:26:34 UTC
BTW, adev->gfx.config.mc_arb_ramcfg is not set for SI. That might cause addrlib to do something incorrect.
Comment 35 Alex Deucher 2017-06-02 20:35:54 UTC
Created attachment 256845 [details]
another possible fix

This sets mc_arb_ramcfg properly for userspace.
Comment 36 Jean Delvare 2017-06-03 10:27:21 UTC
The patch in comment #33 is what I came up with (see comment #26) and it works OK for me. Note that the second part of the patch isn't needed though, the added condition will always be tree if you reach it.
Comment 37 Jean Delvare 2017-06-03 10:34:03 UTC
Alex, is the fix in comment #35 possibly related to this bug, or you just happened to find it while reading the code?
Comment 38 Marek Olšák 2017-06-03 10:37:13 UTC
The second possible fix is a bug fix by itself (not just for this issue), seems like an important fix affecting Mesa, and may change the outcome of the first fix. (which we are still uncertain of)
Comment 39 siyia 2017-06-09 13:33:18 UTC
After kernel 4.11.4 2d apps and videos work perfectly with amdgpu and r7 240,however any game either with opengl or vulkan,native or with wine gets the chessboard effect artifacts.
Comment 40 Marek Olšák 2017-06-09 13:45:22 UTC
(In reply to siyia from comment #39)
> After kernel 4.11.4 2d apps and videos work perfectly with amdgpu and r7
> 240,however any game either with opengl or vulkan,native or with wine gets
> the chessboard effect artifacts.

Can you test the second possible fix and if it doesn't help, can you test both of them?
Comment 41 siyia 2017-06-09 14:25:00 UTC
another possible fix patch?
Comment 42 Marek Olšák 2017-06-09 14:28:45 UTC
Yes, second == another.
Comment 43 siyia 2017-06-09 14:34:38 UTC
Damn, i could a few weeks back with arch,but my current distro does not have an easy way to compile from source and apply patches to kernels.
Comment 44 Marek Olšák 2017-06-09 15:06:06 UTC
I don't have Oland and I presume Alex doesn't have Oland either, so there is nothing we can do. Whoever tests the patches will make the decision whether or not they will be included in the kernel and which ones.
Comment 45 siyia 2017-06-09 16:59:32 UTC
I ll see what i can do and will post back.
Comment 46 siyia 2017-06-09 17:37:08 UTC
Ok i am compiling linux 4.11.4 with the second patch only, will post results after installing it and booting it with amdgpu
Comment 47 siyia 2017-06-09 17:37:50 UTC
This will take a while.....
Comment 48 siyia 2017-06-09 18:40:55 UTC
Still happens with second patch only
Comment 49 Alex Deucher 2017-06-09 18:46:17 UTC
(In reply to siyia from comment #48)
> Still happens with second patch only

Can you attach your demsg output?
Comment 50 Alex Deucher 2017-06-09 18:46:34 UTC
dmesg even.
Comment 51 siyia 2017-06-09 18:51:58 UTC
Created attachment 256935 [details]
dmesg output
Comment 52 siyia 2017-06-09 18:57:29 UTC
The first patch fails at compilation cannot use it.
Comment 53 siyia 2017-06-09 19:06:12 UTC
When i use the first patch i get:

patching file drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c
Hunk #1 FAILED at 412.
Hunk #2 FAILED at 636.
2 out of 2 hunks FAILED -- saving rejects to file drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c.rej
==> ERROR: A failure occurred in prepare().
    Aborting...
Comment 54 Alex Deucher 2017-06-09 19:08:44 UTC
(In reply to siyia from comment #52)
> The first patch fails at compilation cannot use it.

[    0.882491] [drm] RAM width 128bits DDR3

Yours is the 128bit version.

What patch did you try?  Please try:

https://bugzilla.kernel.org/attachment.cgi?id=256831&action=diff
or
https://bugzilla.kernel.org/attachment.cgi?id=256727&action=diff

They are the same patch, just with the logic reversed.

Alex
Comment 55 siyia 2017-06-09 19:10:32 UTC
which one is suppossed to work on my card?
Comment 56 siyia 2017-06-09 19:11:07 UTC
i tried another possible fix from attachments
Comment 57 Alex Deucher 2017-06-09 19:13:19 UTC
https://bugzilla.kernel.org/attachment.cgi?id=256831&action=diff
is what fixes it for others, but I suspect
https://bugzilla.kernel.org/attachment.cgi?id=256727&action=diff
is what will fix it for you.
Comment 58 siyia 2017-06-09 19:17:09 UTC
patching file drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c
Hunk #1 FAILED at 412.
Hunk #2 FAILED at 636.
2 out of 2 hunks FAILED -- saving rejects to file drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c.rej

https://bugzilla.kernel.org/attachment.cgi?id=256727&action=diff

fails
Comment 59 Marek Olšák 2017-06-09 19:46:27 UTC
So it looks like 64-bit Oland needs the P4 config and 128-bit Oland needs the P2 config. Maybe the MC_ARB_RAMCFG patch will help with other SI chips if it doesn't help Oland.
Comment 60 Alex Deucher 2017-06-09 19:52:45 UTC
(In reply to Marek Olšák from comment #59)
> So it looks like 64-bit Oland needs the P4 config and 128-bit Oland needs
> the P2 config. Maybe the MC_ARB_RAMCFG patch will help with other SI chips
> if it doesn't help Oland.

The windows driver uses the P2 config for all Oland and hainan parts regardless of memory memory bandwidth.
Comment 61 Jean Delvare 2017-06-12 07:26:25 UTC
For completeness, I tried with only the mc_arb_ramcfg fix on my R5 240 (reportedly 64-bit Oland) and it does not work. Actually that patch doesn't seem to have any effect for me (but it is obviously correct so should be applied anyway.)

What fixes it for me is using the P4_8x16 config. If the Windows driver really uses the P2 config for that part, then something else must be wrong in the Linux driver. But then again this contradicts Marek's findings in comment #23 that the hardware documentation claims some Oland variants use the P4_8x16 config. We still don't know for sure what Oland64 and Oland128 stand for, if not the memory bus width.

I am available and willing to help with any test or question you may have.
Comment 62 Jean Delvare 2017-06-12 07:39:53 UTC
Created attachment 256949 [details]
drm/amdgpu/gfx6: [TEST] fix tiling setup for oland (for v4.11)

Siyia, this is a test patch for kernel v4.11.4. It uses the P4_8x16 config for all Oland variants, please test it and report if it fixes the problem for you.

This is most likely NOT suitable for upstream, it is only a test patch. At least assuming some Oland cards are working with the current code.
Comment 63 siyia 2017-06-12 10:47:16 UTC
Cant you contact amd about what Oland64 and Oland128 mean?
Comment 64 siyia 2017-06-12 11:09:59 UTC
Latest patch fixed it for me with kernel 4.11.4.
Comment 65 siyia 2017-06-12 13:09:46 UTC
https://bugzilla.kernel.org/attachment.cgi?id=256949 works as it should with kernel 4.11.4,tested both opengl and vulkan 3d apps and didnt notice any regressions.
Comment 66 siyia 2017-06-15 14:12:07 UTC
Regression persists with kernel 4.11.5, but https://bugzilla.kernel.org/attachment.cgi?id=256949 thankfully works and fixes the issue.
Comment 67 Jean Delvare 2017-06-26 07:30:34 UTC
We are still waiting for someone at AMD to contact their hardware documentation team to clarify what Oland64 and Oland128 refer to, and to explain how the Windows driver can possibly use the P2 tile table for all Oland parts (comment #60) when the hardware documentation claims some of them must use the P4_8x16 tile table (comment #23) and experiments with the Linux driver confirms this is the case (comment #26, comment #64.)
Comment 68 siyia 2017-06-26 12:30:59 UTC
Probably some mistake in the documentation, otherwise i doesnt make sense, unless the use some magic lol, by the way with kernel 4.11.6 the problem got worse, now also the desktop enviroment is rendered incorrectly.
Comment 69 siyia 2017-07-03 14:37:32 UTC
Created attachment 257315 [details]
Bumped Patch version number for linux 4.12

Patch still works for linux 4.12.0
Comment 70 Alexander Tsoy 2017-07-06 15:35:12 UTC
For me P4_8x16 config fixed this issue.

$ sudo dmesg | egrep 'RAM width|OLAND'
[    1.670667] [drm] initializing kernel modesetting (OLAND 0x1002:0x6613 0x174B:0xE266 0x00).
[    2.754782] [drm] RAM width 128bits GDDR5
Comment 71 Alexander Tsoy 2017-07-06 16:00:29 UTC
(In reply to Alexander Tsoy from comment #70)
> For me P4_8x16 config fixed this issue.
> 
> $ sudo dmesg | egrep 'RAM width|OLAND'
> [    1.670667] [drm] initializing kernel modesetting (OLAND 0x1002:0x6613
> 0x174B:0xE266 0x00).
> [    2.754782] [drm] RAM width 128bits GDDR5

To clarify a bit. I've applied the following two patches on top of 4.11.9:
https://bugzilla.kernel.org/attachment.cgi?id=256727
https://bugzilla.kernel.org/attachment.cgi?id=256845

So I guess the following patch should also work for me:
https://bugzilla.kernel.org/attachment.cgi?id=256949
Comment 72 Marek Olšák 2017-07-07 22:12:04 UTC
Additional piece of the puzzle to the Oland mystery:

1) Oland uses Verde's GB_ADDR_CONFIG in one place:
	case CHIP_OLAND:
		...
		gb_addr_config = VERDE_GB_ADDR_CONFIG_GOLDEN;
	...
	WREG32(mmGB_ADDR_CONFIG, gb_addr_config);

// #define VERDE_GB_ADDR_CONFIG_GOLDEN         0x02010002

See: (1 << VERDE_GB_ADDR_CONFIG_GOLDEN.NUM_PIPES) == 4 (P4)


2) Oland uses GB_ADDR_CONFIG with (1 << NUM_PIPES) == 4 (P4) in golden settings, also same as Verde:

static const u32 verde_golden_rlc_registers[] =
{
	mmGB_ADDR_CONFIG, 0xffffffff, 0x02010002, (P4)
...
static const u32 oland_golden_rlc_registers[] =
{
	mmGB_ADDR_CONFIG, 0xffffffff, 0x02010002, (P4)


I see two issues:
- GB_ADDR_CONFIG matches Verde in both places
- GB_ADDR_CONFIG is set in two places (duplicated code?)
Comment 73 siyia 2017-07-25 13:10:48 UTC
Any further news about this issue?Why dont you just use the p4 config for the afflicted cards as a fix?
Comment 74 Rich 2017-07-28 09:06:32 UTC
Agree with the above. Not sure why such a huge regression has not just been reverted until a suitable fix has been implemented.
Comment 75 Zoltan Boszormenyi 2017-07-28 09:26:04 UTC
(In reply to Rich from comment #74)
> Agree with the above. Not sure why such a huge regression has not just been
> reverted until a suitable fix has been implemented.

It seems the QA process has holes @ AMD.
See bug #170741 , this is a regression since Linux 4.4-rc4 (not Radeon related).
Comment 76 Marek Olšák 2017-07-28 14:16:02 UTC
Created attachment 257745 [details]
possible final fix?

Can you please test this patch?
Comment 77 Alexander Tsoy 2017-07-28 14:28:47 UTC
Can't test right now. BTW, why HAINAN_GB_ADDR_CONFIG_GOLDEN value is different in different headers?

./amdgpu/sid.h:#define HAINAN_GB_ADDR_CONFIG_GOLDEN        0x02010001
./amdgpu/si_enums.h:#define HAINAN_GB_ADDR_CONFIG_GOLDEN        0x02011003

VERDE_GB_ADDR_CONFIG_GOLDEN and TAHITI_GB_ADDR_CONFIG_GOLDEN values are the same, for example:

./amdgpu/sid.h:#define VERDE_GB_ADDR_CONFIG_GOLDEN         0x12010002
./amdgpu/si_enums.h:#define VERDE_GB_ADDR_CONFIG_GOLDEN         0x02010002

./amdgpu/sid.h:#define TAHITI_GB_ADDR_CONFIG_GOLDEN        0x12011003
./amdgpu/si_enums.h:#define TAHITI_GB_ADDR_CONFIG_GOLDEN        0x12011003
Comment 78 Marek Olšák 2017-07-28 14:31:21 UTC
Created attachment 257747 [details]
possible final fix?

Please try this patch instead.
Comment 79 siyia 2017-07-28 17:00:35 UTC
Will test it when i get my hands on 4.12.4 kernel
Comment 80 Jean Delvare 2017-07-30 06:21:52 UTC
I tested the patch from comment #78 and unfortunately I have to report that it doesn't fix the problem for me, checkerboard effect is still present. Nevertheless these cleanups look good so I think they should go upstream, assuming they don't break anything else.

I have noticed that the values of VERDE_GB_ADDR_CONFIG_GOLDEN and HAINAN_GB_ADDR_CONFIG_GOLDEN which were kept do not match the ones in the radeon driver. I tried using the ones from the radeon driver instead but it did not help.
Comment 81 Alexander Tsoy 2017-07-31 14:17:22 UTC
(In reply to Jean Delvare from comment #80)
> I tested the patch from comment #78 and unfortunately I have to report that
> it doesn't fix the problem for me, checkerboard effect is still present.
It doesn't fix the problem for me as well.
Comment 82 Zoltan Boszormenyi 2017-08-01 05:42:57 UTC
This might be a related fix from Jean Delvare:
https://lists.freedesktop.org/archives/dri-devel/2017-July/148751.html
Comment 83 Rich 2017-08-23 14:47:55 UTC
For anyone reading this, this is still broken in Kernel 4.12.8 (currently latest).

Really concerned that this bug doesn't get fixed before 4.9 is no longer the latest LONGTERM. Still wondering about my previous comment re: just reverting the offending commit due to the severity of the regression before a proper fix can be implemented.
Comment 84 Jean Delvare 2017-09-04 08:11:41 UTC
Created attachment 258183 [details]
[PATCH] drm/amdgpu: revert tile table update for oland (for kernels v4.10 to v4.12)
Comment 85 Jean Delvare 2017-09-11 15:39:04 UTC
Created attachment 258301 [details]
[PATCH] drm/amdgpu: revert tile table update for oland (for kernels v4.13 and up)
Comment 86 Jean Delvare 2017-10-03 07:26:35 UTC
Fixed in kernel v4.14:

commit 4cf97582b46f123a4b7cd88d999f1806c2eb4093
Author: Jean Delvare <jdelvare@suse.de>
Date:   Mon Sep 11 17:43:56 2017 +0200

    drm/amdgpu: revert tile table update for oland
    
The root cause is probably somewhere else but at least Oland cards are usable again.
Comment 87 yousifjkadom 2017-10-20 14:26:37 UTC
Hi. I have Lenovo ThinkPad e550 with 2 VGA:

1) shared: Intel Corporate HD 5500

2) dedicated VGA, Radeon R7 M265 2GB

I'm using Fedora 26 X64 bit on this laptop.

Currently, Linux kernel support Radeon R7 M265 (which is South Island), BUT DISABLED BY DEFAULT BECAUSE THIS SUPPORT IS EXPERIMENTAL. So, currently, user should use customized kernel to enable such support which beyond many users like me .... Kindly, see this link:
https://forums.fedoraforum.org/showthread.php?t=315139

I asked in Fedora forum about this & they inform me that this bug (that you kindly fixed it) is a blocker against enable kernel support for my Radeon VGA by default.

I read that it is planned to enable kernel support by default in kernel 4.15.x

But, now I see that you kindly fixed this bug at kernel version 4.14.x !

Please is fixing this bug mean that kernel support for my Radeon R7 M265 dedicated VGA will be enabled by default or it will deferred till kernel 4.15.x

Best.
Comment 88 Michel Dänzer 2017-10-20 14:30:20 UTC
(In reply to yousifjkadom from comment #87)

Everything you say applies to the amdgpu kernel driver, but in the meantime, your card should work with the radeon kernel driver. Doesn't it?
Comment 89 yousifjkadom 2017-10-20 19:20:52 UTC
@Michel Dänzer

Many thanks for your kind & rapid reply !

No ! No ! Since I was on Fedora 24 that I used for 11 months & then I upgraded from Fedora 24 to Fedora 26, till now my Radeon VGA do not work at all !
Comment 90 Jean Delvare 2017-10-23 11:07:57 UTC
yousifjkadom, your problem has nothing to do with this bug, please create a new bug for it.

Note You need to log in before you can comment on or make changes to this bug.