Bug 194867 - DRM BUG while initializing cape verde (2nd card)
Summary: DRM BUG while initializing cape verde (2nd card)
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-03-13 09:50 UTC by Janpieter Sollie
Modified: 2017-03-16 09:33 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.11-rc2
Tree: Mainline
Regression: No


Attachments
zip file with all listed attachments (43.90 KB, application/x-zip-compressed)
2017-03-13 09:50 UTC, Janpieter Sollie
Details
patch 1/2 (1.04 KB, patch)
2017-03-15 13:57 UTC, Alex Deucher
Details | Diff
patch 2/2 (3.45 KB, patch)
2017-03-15 13:57 UTC, Alex Deucher
Details | Diff
/proc/kmsg output (10.82 KB, text/plain)
2017-03-16 08:10 UTC, Janpieter Sollie
Details

Description Janpieter Sollie 2017-03-13 09:50:12 UTC
Created attachment 255215 [details]
zip file with all listed attachments

There seems to be a logical error while specifying the memory sizes for ttm in the amdgpu module on the SI architecture:
while the Fiji card boots fine, the Cape Verde card gives a kernel BUG.
dmesg and .config and proposed patch in attachment.
the problem lies in linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: the determination of the p_size is reduced 0 when the page_shift is too big
I managed to work around the problem when changing the sentence "adev->gds.mem.total_size >> PAGE_SHIFT)" in  amdgpu_ttm_init to "(adev->gds.mem.total_size >> PAGE_SHIFT) + 1)", and the same for "(adev->gds.gws.total_size" and "adev->gds.oa.total_size", though I am not sure this is the correct solution.  The problem is that my SI card is limited in memory (I guess) and the page_size is 12
Comment 1 Alex Deucher 2017-03-15 13:57:16 UTC
Created attachment 255263 [details]
patch 1/2

Does this patch set fix the issue?
Comment 2 Alex Deucher 2017-03-15 13:57:32 UTC
Created attachment 255265 [details]
patch 2/2
Comment 3 Janpieter Sollie 2017-03-16 08:09:02 UTC
we're one step further: see triplefault.txt output.
I set the kernel verbosity to 7, and did a modprobe amdgpu (the module is blacklisted).  The error is gone, but the machine hits a triple fault (I suspect it does, don't blame me when it doesn't) and because of that, it immediately reboots without panic.  should I file a new bug for that, or can you have a look at it?
notice that this does not happen with dpm disabled.
Comment 4 Janpieter Sollie 2017-03-16 08:10:16 UTC
Created attachment 255283 [details]
/proc/kmsg output
Comment 5 Michel Dänzer 2017-03-16 08:34:12 UTC
Did this work with kernel 4.10 or older?
Comment 6 Janpieter Sollie 2017-03-16 09:20:31 UTC
no, the output is exactly the same: after the 4 ring tests, it reboots
Comment 7 Michel Dänzer 2017-03-16 09:23:02 UTC
That should be tracked in a separate report then.
Comment 8 Janpieter Sollie 2017-03-16 09:33:48 UTC
is there any documentation except my kernel config, lsmod, /proc/kmsg and lspci you need to handle this as a new report? 'cause I can annoy myself with users coming to me saying 'it doesn't work and i know nothing about it', so I'd like to provide every possible info you people need

Note You need to log in before you can comment on or make changes to this bug.