Bug 194761
Summary: | amdgpu driver breaks on Oland (SI) | ||
---|---|---|---|
Product: | Drivers | Reporter: | Jean Delvare (jdelvare) |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | alexander, alexdeucher, berg, eutychios23, flora.cui, forums0, maraeo, nhaehnle, yousifjkadom, zboszor |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 4.10 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
Checkerboard effect on glxgears
possible fix possible fix another possible fix dmesg output drm/amdgpu/gfx6: [TEST] fix tiling setup for oland (for v4.11) Bumped Patch version number for linux 4.12 possible final fix? possible final fix? [PATCH] drm/amdgpu: revert tile table update for oland (for kernels v4.10 to v4.12) [PATCH] drm/amdgpu: revert tile table update for oland (for kernels v4.13 and up) |
revert this commit should fix this issue. commit c7bc82efa60dd0ba220fbfc1bed5edc79e831a3e Refs: v4.9-272-gc7bc82e Author: Flora Cui <Flora.Cui@amd.com> AuthorDate: Thu Dec 15 16:29:31 2016 +0800 Commit: Alex Deucher <alexander.deucher@amd.com> CommitDate: Wed Dec 21 15:37:47 2016 -0500 drm/amdgpu: update tile table for oland/hainan Change-Id: Ib8b66559d662cbe8a9f40f7405a4a052c189f604 Signed-off-by: Flora Cui <Flora.Cui@amd.com> Reviewed-by: Junwei Zhang <Jerry.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> or apply the commit series should fix this issue too. * 8873e69 - drm/amd/gfx6: update gb_addr_config <Flora Cui> 2017-02-09 14:18:05 +0800 * 0882518 - drm/amdgpu: update HAINAN_GB_ADDR_CONFIG_GOLDEN <Flora Cui> 2017-02-09 14:18:04 +0800 * 04cab25 - drm/amdgpu: update VERDE_GB_ADDR_CONFIG_GOLDEN <Flora Cui> 2017-02-09 14:18:04 +0800 * a99ecc6 - drm/amdgpu: refine si_read_register <Flora Cui> 2017-02-09 14:18:03 +0800 * ffb3c1c - drm/amdgpu/gfx6: clean up spi configuration <Flora Cui> 2017-02-09 14:18:03 +0800 * d8a4e76 - drm/amdgpu/gfx6: clean up cu configuration <Flora Cui> 2017-02-09 14:18:02 +0800 * 1abd532 - drm/amdgpu/gfx6: clean up rb configuration <Flora Cui> 2017-02-09 14:18:02 +0800 Thanks for the quick reply. I confirm that kernel v4.10.1 with commit f8d9422ef80c ("drm/amdgpu: update tile table for oland/hainan") reverted works again. Not sure why your commit ID is different but that's clearly the same patch. How do we fix this in the 4.10 stable kernel branch? Simply push the reverted patch to stable@? Either push the revert to stable, or pull the additional fixes back from 4.11. I know both are possible ;-) But I don't know what are the benefits of both options. I'd go for just reverting one patch because it seems more simple, but to be honest I have no idea what that patch was trying to achieve, so I don't know if we lose anything important by reverting it. It's just an optimization. OK, let's just revert then. I have just sent a revert patch to stable@. Fixed in kernel v4.10.3 by reverting commit 8d9422ef80c ("drm/amdgpu: update tile table for oland/hainan"). Fixed in kernel v4.11 by the commits mentioned in comment #1. This is NOT fixed in v4.11. The claims in comment #1 are incorrect, the patch series mentioned in that comment doesn't fix the problem. Should we revert commit 8d9422ef80c ("drm/amdgpu: update tile table for oland/hainan") permanently? Sorry there was a missing digit in the commit ID, I obviously meant f8d9422ef80c ("drm/amdgpu: update tile table for oland/hainan".) glxgears works on oland with alex's 4.11 branch (HEAD: commit 3a7370c - drm/amdgpu: correct emit frame size for vcn dec/enc ring). Can you please point me to the exact tree and branch I am supposed to test? Also, do you happen to know which commits in that branch are fixing the problem? https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-4.11 I didn't check the commits. The commits in #commit1 should work. Assuming you mean "in comment #1", well no, they do not work for me. Have you tested? I naively trusted you, and as a result agreed not to get the problematic commit reverted upstream, and now users are hitting a regression when upgrading to kernel v4.11. This isn't good. Now we need to figure out whether it is actually fixed upstream or not. No more assumptions. And if it fixed, knowing which commit did fix it would let us push the fix to the 4.11 stable tree. However I am not going to waste time bisecting it. If nobody knows how it was fixed, I will push the same revert to 4.11 that was pushed to 4.10, and be done with it. I don't know why it fails on your side, at least it works with my test PC. anyway, you could revert it if this troubles you. And it's not just me, see: https://bugzilla.opensuse.org/show_bug.cgi?id=1039806 I tested Alex's branch amd-staging-4.11 as of yesterday (top commit c285c73f2213) and it does NOT fix the bug. Did it occur to you that there may be a simple bug in commit c7bc82efa60d ("drm/amdgpu: update tile table for oland/hainan")? Can anyone familiar with the code review this commit to see if we missed something obvious? I'll continue to check this issue. meanwhile you could revert the guilty commit for quick fix. the tiling table is an optimization from hw team. btw, have u try with amdgpu-pro driver? maybe that's the difference we have. I did not try the amdgpu-pro driver. My distribution (SUSE) doesn't appear to be supported, and the download page doesn't list my hardware (Radeon R5 240) as supported by this driver anyway. I confirm: Checkerboard effect when using amdgpu on Oland is back. On amdgpu-pro the same effect has been for a long time (despite the fact that my R7 240 is in the list of supported devices), but archlinux is not supported as a distro, there is only a user package in AUR, so here it is difficult to make claims Flora, why did you add me to Cc? I get notifications via the dri-devel mailing list. Hi Michel, This issue might has something with mesa. I add you to the cc list for you're expert on open source umd. It can be reproduced with oland + ubuntu + xinit + glxgears + mesa. reverting (commit drm/amd/amdgpu: Clean up GFX6 tilemode programming && commit drm/amdgpu: update tile table for oland/hainan) could fix it. It can't be reproduced with amdgpu-pro driver (with the same kernel). Could you help to further investigate? Not really I'm afraid. Maybe Marek or Nicolai has an idea. Hi, (BTW I'm from AMD) First of all, if a kernel change breaks Mesa, the kernel change should be reverted. Kernel changes should never ever break existing userspace. There are released versions of Mesa out there that we can't change with commits to the master branch. That's why. Kernel changes can't break Mesa ever! Now about the commit. Yes, the commit is wrong! I've checked internal hw docs and there are 2 variants of Oland: Oland64 and Oland128. The number may refer to the bus width. Oland128 uses the same config as Cape Verde, which is P4_8x16. Oland64 uses the same config as Hainan, which is P2. The commit only switches the config from Oland128 to Oland64, so it seems to fix Oland64 and at the same time break Oland128. I think the real fix is to use the old version of the tile mode table for Oland128, and the new version for Oland64. Created attachment 256727 [details]
possible fix
Does this patch help?
As my wife upgraded her hardware last week, the problematic card is now sitting on my desk. Give me a few hours to install it on my own system and reproduce the bug, then I can test your candidate fix and report. Hmmm. I backported the patch to kernel 4.11.3, but it did not help. Then I swapped the conditions, and it worked. That is, adev->mc.vram_width < 128 mapped to Cape Verde and adev->mc.vram_width >= 128 mapped to Hainan. Does that make any sense? I do not claim I know what I'm doing. My Oland card has adev->mc.vram_width == 64 and is apparently happy with the Cape Verde configuration (I'd need to test further to confirm there is no negative side effect.) amdgpu 0000:01:00.0: VRAM: 1024M 0x0000000000000000 - 0x000000003FFFFFFF (1024M used) amdgpu 0000:01:00.0: GTT: 1024M 0x0000000040000000 - 0x000000007FFFFFFF [drm] Detected VRAM RAM=1024M, BAR=256M [drm] RAM width 64bits DDR3 I can provide additional information about my card, just tell me what you need to know. Well then I wonder what the number in Oland128 means if it's not the bus width. I don't know how reliable it is, but Wikipedia claims that the Radeon R5 240 has a 128-bit bus to video RAM. Is it possible that the amdgpu driver is getting it wrong? (In reply to Jean Delvare from comment #28) > I don't know how reliable it is, but Wikipedia claims that the Radeon R5 240 > has a 128-bit bus to video RAM. I confirm as the owner of the R7 240, - yes, the 128-bit bus, R5 240 is same (In reply to Marek Olšák from comment #27) > Well then I wonder what the number in Oland128 means if it's not the bus > width. Who can answer this question? Flora, what is the bus width on your oland card? (In reply to Alex Deucher from comment #31) > Flora, what is the bus width on your oland card? 128bits Created attachment 256831 [details]
possible fix
Based on the feedback on this bug, it appears the logic should be reversed. I verified the vram width detection and that is correct.
BTW, adev->gfx.config.mc_arb_ramcfg is not set for SI. That might cause addrlib to do something incorrect. Created attachment 256845 [details]
another possible fix
This sets mc_arb_ramcfg properly for userspace.
The patch in comment #33 is what I came up with (see comment #26) and it works OK for me. Note that the second part of the patch isn't needed though, the added condition will always be tree if you reach it. Alex, is the fix in comment #35 possibly related to this bug, or you just happened to find it while reading the code? The second possible fix is a bug fix by itself (not just for this issue), seems like an important fix affecting Mesa, and may change the outcome of the first fix. (which we are still uncertain of) After kernel 4.11.4 2d apps and videos work perfectly with amdgpu and r7 240,however any game either with opengl or vulkan,native or with wine gets the chessboard effect artifacts. (In reply to siyia from comment #39) > After kernel 4.11.4 2d apps and videos work perfectly with amdgpu and r7 > 240,however any game either with opengl or vulkan,native or with wine gets > the chessboard effect artifacts. Can you test the second possible fix and if it doesn't help, can you test both of them? another possible fix patch? Yes, second == another. Damn, i could a few weeks back with arch,but my current distro does not have an easy way to compile from source and apply patches to kernels. I don't have Oland and I presume Alex doesn't have Oland either, so there is nothing we can do. Whoever tests the patches will make the decision whether or not they will be included in the kernel and which ones. I ll see what i can do and will post back. Ok i am compiling linux 4.11.4 with the second patch only, will post results after installing it and booting it with amdgpu This will take a while..... Still happens with second patch only (In reply to siyia from comment #48) > Still happens with second patch only Can you attach your demsg output? dmesg even. Created attachment 256935 [details]
dmesg output
The first patch fails at compilation cannot use it. When i use the first patch i get: patching file drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c Hunk #1 FAILED at 412. Hunk #2 FAILED at 636. 2 out of 2 hunks FAILED -- saving rejects to file drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c.rej ==> ERROR: A failure occurred in prepare(). Aborting... (In reply to siyia from comment #52) > The first patch fails at compilation cannot use it. [ 0.882491] [drm] RAM width 128bits DDR3 Yours is the 128bit version. What patch did you try? Please try: https://bugzilla.kernel.org/attachment.cgi?id=256831&action=diff or https://bugzilla.kernel.org/attachment.cgi?id=256727&action=diff They are the same patch, just with the logic reversed. Alex which one is suppossed to work on my card? i tried another possible fix from attachments https://bugzilla.kernel.org/attachment.cgi?id=256831&action=diff is what fixes it for others, but I suspect https://bugzilla.kernel.org/attachment.cgi?id=256727&action=diff is what will fix it for you. patching file drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c Hunk #1 FAILED at 412. Hunk #2 FAILED at 636. 2 out of 2 hunks FAILED -- saving rejects to file drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c.rej https://bugzilla.kernel.org/attachment.cgi?id=256727&action=diff fails So it looks like 64-bit Oland needs the P4 config and 128-bit Oland needs the P2 config. Maybe the MC_ARB_RAMCFG patch will help with other SI chips if it doesn't help Oland. (In reply to Marek Olšák from comment #59) > So it looks like 64-bit Oland needs the P4 config and 128-bit Oland needs > the P2 config. Maybe the MC_ARB_RAMCFG patch will help with other SI chips > if it doesn't help Oland. The windows driver uses the P2 config for all Oland and hainan parts regardless of memory memory bandwidth. For completeness, I tried with only the mc_arb_ramcfg fix on my R5 240 (reportedly 64-bit Oland) and it does not work. Actually that patch doesn't seem to have any effect for me (but it is obviously correct so should be applied anyway.) What fixes it for me is using the P4_8x16 config. If the Windows driver really uses the P2 config for that part, then something else must be wrong in the Linux driver. But then again this contradicts Marek's findings in comment #23 that the hardware documentation claims some Oland variants use the P4_8x16 config. We still don't know for sure what Oland64 and Oland128 stand for, if not the memory bus width. I am available and willing to help with any test or question you may have. Created attachment 256949 [details]
drm/amdgpu/gfx6: [TEST] fix tiling setup for oland (for v4.11)
Siyia, this is a test patch for kernel v4.11.4. It uses the P4_8x16 config for all Oland variants, please test it and report if it fixes the problem for you.
This is most likely NOT suitable for upstream, it is only a test patch. At least assuming some Oland cards are working with the current code.
Cant you contact amd about what Oland64 and Oland128 mean? Latest patch fixed it for me with kernel 4.11.4. https://bugzilla.kernel.org/attachment.cgi?id=256949 works as it should with kernel 4.11.4,tested both opengl and vulkan 3d apps and didnt notice any regressions. Regression persists with kernel 4.11.5, but https://bugzilla.kernel.org/attachment.cgi?id=256949 thankfully works and fixes the issue. We are still waiting for someone at AMD to contact their hardware documentation team to clarify what Oland64 and Oland128 refer to, and to explain how the Windows driver can possibly use the P2 tile table for all Oland parts (comment #60) when the hardware documentation claims some of them must use the P4_8x16 tile table (comment #23) and experiments with the Linux driver confirms this is the case (comment #26, comment #64.) Probably some mistake in the documentation, otherwise i doesnt make sense, unless the use some magic lol, by the way with kernel 4.11.6 the problem got worse, now also the desktop enviroment is rendered incorrectly. Created attachment 257315 [details]
Bumped Patch version number for linux 4.12
Patch still works for linux 4.12.0
For me P4_8x16 config fixed this issue. $ sudo dmesg | egrep 'RAM width|OLAND' [ 1.670667] [drm] initializing kernel modesetting (OLAND 0x1002:0x6613 0x174B:0xE266 0x00). [ 2.754782] [drm] RAM width 128bits GDDR5 (In reply to Alexander Tsoy from comment #70) > For me P4_8x16 config fixed this issue. > > $ sudo dmesg | egrep 'RAM width|OLAND' > [ 1.670667] [drm] initializing kernel modesetting (OLAND 0x1002:0x6613 > 0x174B:0xE266 0x00). > [ 2.754782] [drm] RAM width 128bits GDDR5 To clarify a bit. I've applied the following two patches on top of 4.11.9: https://bugzilla.kernel.org/attachment.cgi?id=256727 https://bugzilla.kernel.org/attachment.cgi?id=256845 So I guess the following patch should also work for me: https://bugzilla.kernel.org/attachment.cgi?id=256949 Additional piece of the puzzle to the Oland mystery: 1) Oland uses Verde's GB_ADDR_CONFIG in one place: case CHIP_OLAND: ... gb_addr_config = VERDE_GB_ADDR_CONFIG_GOLDEN; ... WREG32(mmGB_ADDR_CONFIG, gb_addr_config); // #define VERDE_GB_ADDR_CONFIG_GOLDEN 0x02010002 See: (1 << VERDE_GB_ADDR_CONFIG_GOLDEN.NUM_PIPES) == 4 (P4) 2) Oland uses GB_ADDR_CONFIG with (1 << NUM_PIPES) == 4 (P4) in golden settings, also same as Verde: static const u32 verde_golden_rlc_registers[] = { mmGB_ADDR_CONFIG, 0xffffffff, 0x02010002, (P4) ... static const u32 oland_golden_rlc_registers[] = { mmGB_ADDR_CONFIG, 0xffffffff, 0x02010002, (P4) I see two issues: - GB_ADDR_CONFIG matches Verde in both places - GB_ADDR_CONFIG is set in two places (duplicated code?) Any further news about this issue?Why dont you just use the p4 config for the afflicted cards as a fix? Agree with the above. Not sure why such a huge regression has not just been reverted until a suitable fix has been implemented. (In reply to Rich from comment #74) > Agree with the above. Not sure why such a huge regression has not just been > reverted until a suitable fix has been implemented. It seems the QA process has holes @ AMD. See bug #170741 , this is a regression since Linux 4.4-rc4 (not Radeon related). Created attachment 257745 [details]
possible final fix?
Can you please test this patch?
Can't test right now. BTW, why HAINAN_GB_ADDR_CONFIG_GOLDEN value is different in different headers? ./amdgpu/sid.h:#define HAINAN_GB_ADDR_CONFIG_GOLDEN 0x02010001 ./amdgpu/si_enums.h:#define HAINAN_GB_ADDR_CONFIG_GOLDEN 0x02011003 VERDE_GB_ADDR_CONFIG_GOLDEN and TAHITI_GB_ADDR_CONFIG_GOLDEN values are the same, for example: ./amdgpu/sid.h:#define VERDE_GB_ADDR_CONFIG_GOLDEN 0x12010002 ./amdgpu/si_enums.h:#define VERDE_GB_ADDR_CONFIG_GOLDEN 0x02010002 ./amdgpu/sid.h:#define TAHITI_GB_ADDR_CONFIG_GOLDEN 0x12011003 ./amdgpu/si_enums.h:#define TAHITI_GB_ADDR_CONFIG_GOLDEN 0x12011003 Created attachment 257747 [details]
possible final fix?
Please try this patch instead.
Will test it when i get my hands on 4.12.4 kernel I tested the patch from comment #78 and unfortunately I have to report that it doesn't fix the problem for me, checkerboard effect is still present. Nevertheless these cleanups look good so I think they should go upstream, assuming they don't break anything else. I have noticed that the values of VERDE_GB_ADDR_CONFIG_GOLDEN and HAINAN_GB_ADDR_CONFIG_GOLDEN which were kept do not match the ones in the radeon driver. I tried using the ones from the radeon driver instead but it did not help. (In reply to Jean Delvare from comment #80) > I tested the patch from comment #78 and unfortunately I have to report that > it doesn't fix the problem for me, checkerboard effect is still present. It doesn't fix the problem for me as well. This might be a related fix from Jean Delvare: https://lists.freedesktop.org/archives/dri-devel/2017-July/148751.html For anyone reading this, this is still broken in Kernel 4.12.8 (currently latest). Really concerned that this bug doesn't get fixed before 4.9 is no longer the latest LONGTERM. Still wondering about my previous comment re: just reverting the offending commit due to the severity of the regression before a proper fix can be implemented. Created attachment 258183 [details]
[PATCH] drm/amdgpu: revert tile table update for oland (for kernels v4.10 to v4.12)
Created attachment 258301 [details]
[PATCH] drm/amdgpu: revert tile table update for oland (for kernels v4.13 and up)
Fixed in kernel v4.14: commit 4cf97582b46f123a4b7cd88d999f1806c2eb4093 Author: Jean Delvare <jdelvare@suse.de> Date: Mon Sep 11 17:43:56 2017 +0200 drm/amdgpu: revert tile table update for oland The root cause is probably somewhere else but at least Oland cards are usable again. Hi. I have Lenovo ThinkPad e550 with 2 VGA: 1) shared: Intel Corporate HD 5500 2) dedicated VGA, Radeon R7 M265 2GB I'm using Fedora 26 X64 bit on this laptop. Currently, Linux kernel support Radeon R7 M265 (which is South Island), BUT DISABLED BY DEFAULT BECAUSE THIS SUPPORT IS EXPERIMENTAL. So, currently, user should use customized kernel to enable such support which beyond many users like me .... Kindly, see this link: https://forums.fedoraforum.org/showthread.php?t=315139 I asked in Fedora forum about this & they inform me that this bug (that you kindly fixed it) is a blocker against enable kernel support for my Radeon VGA by default. I read that it is planned to enable kernel support by default in kernel 4.15.x But, now I see that you kindly fixed this bug at kernel version 4.14.x ! Please is fixing this bug mean that kernel support for my Radeon R7 M265 dedicated VGA will be enabled by default or it will deferred till kernel 4.15.x Best. (In reply to yousifjkadom from comment #87) Everything you say applies to the amdgpu kernel driver, but in the meantime, your card should work with the radeon kernel driver. Doesn't it? @Michel Dänzer Many thanks for your kind & rapid reply ! No ! No ! Since I was on Fedora 24 that I used for 11 months & then I upgraded from Fedora 24 to Fedora 26, till now my Radeon VGA do not work at all ! yousifjkadom, your problem has nothing to do with this bug, please create a new bug for it. |
Created attachment 255045 [details] Checkerboard effect on glxgears After upgrading my kernel from v4.9.11 to v4.10.0, all 3D applications are broken. The most visible problem is a checkerboard effect, with blinking of some areas. See the attached screenshot of glxgears for an example. The bug only happens when I use the amdgpu driver (with CONFIG_DRM_AMDGPU_SI=y.) If I use the radeon driver instead then all is fine. There are no error messages in the kernel log, nor in /var/log/Xorg.0.log. However I could spot differences between the kernel logs: -Linux version 4.9.11-1-default (geeko@buildhost) (gcc version 6.3.1 20170202 [gcc-6-branch revision 245119] (SUSE Linux) ) #1 SMP PREEMPT Sat Feb 18 17:59:27 UTC 2017 (cf9c670) +Linux version 4.10.1-1-default (geeko@buildhost) (gcc version 6.3.1 20170202 [gcc-6-branch revision 245119] (SUSE Linux) ) #1 SMP PREEMPT Sun Feb 26 12:43:10 UTC 2017 (1ecd5af) [drm] GPU post is not needed amdgpu 0000:01:00.0: VRAM: 1024M 0x0000000000000000 - 0x000000003FFFFFFF (1024M used) -amdgpu 0000:01:00.0: GTT: 3991M 0x0000000040000000 - 0x00000001397847FF +amdgpu 0000:01:00.0: GTT: 1024M 0x0000000040000000 - 0x000000007FFFFFFF [drm] Detected VRAM RAM=1024M, BAR=256M [drm] RAM width 64bits DDR3 [drm] amdgpu: 1024M of VRAM memory ready -[drm] amdgpu: 3991M of GTT memory ready. -[drm] GART: num cpu pages 1021828, num gpu pages 1021828 -amdgpu 0000:01:00.0: PCIE GART of 3991M enabled (table at 0x0000000000040000). +[drm] amdgpu: 1024M of GTT memory ready. +[drm] GART: num cpu pages 262144, num gpu pages 262144 +amdgpu 0000:01:00.0: PCIE GART of 1024M enabled (table at 0x0000000000040000). -[drm] Initialized amdgpu 3.8.0 20150101 for 0000:01:00.0 on minor 0 +[drm] Initialized amdgpu 3.9.0 20150101 for 0000:01:00.0 on minor 0 So something changed in the detection of the available memory. I don't know if this is related to the bug. As a comparison point, the radeon driver disagrees with both the old amdgpu driver and the new amdgpu driver: radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000 - 0x000000003FFFFFFF (1024M used) radeon 0000:01:00.0: GTT: 2048M 0x0000000040000000 - 0x00000000BFFFFFFF [drm] Detected VRAM RAM=1024M, BAR=256M [drm] RAM width 64bits DDR [drm] radeon: 1024M of VRAM memory ready [drm] radeon: 2048M of GTT memory ready.