Bug 194867

Summary: DRM BUG while initializing cape verde (2nd card)
Product: Drivers Reporter: Janpieter Sollie (janpieter.sollie)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED CODE_FIX    
Severity: normal CC: alexdeucher
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.11-rc2 Subsystem:
Regression: No Bisected commit-id:
Attachments: zip file with all listed attachments
patch 1/2
patch 2/2
/proc/kmsg output

Description Janpieter Sollie 2017-03-13 09:50:12 UTC
Created attachment 255215 [details]
zip file with all listed attachments

There seems to be a logical error while specifying the memory sizes for ttm in the amdgpu module on the SI architecture:
while the Fiji card boots fine, the Cape Verde card gives a kernel BUG.
dmesg and .config and proposed patch in attachment.
the problem lies in linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: the determination of the p_size is reduced 0 when the page_shift is too big
I managed to work around the problem when changing the sentence "adev->gds.mem.total_size >> PAGE_SHIFT)" in  amdgpu_ttm_init to "(adev->gds.mem.total_size >> PAGE_SHIFT) + 1)", and the same for "(adev->gds.gws.total_size" and "adev->gds.oa.total_size", though I am not sure this is the correct solution.  The problem is that my SI card is limited in memory (I guess) and the page_size is 12
Comment 1 Alex Deucher 2017-03-15 13:57:16 UTC
Created attachment 255263 [details]
patch 1/2

Does this patch set fix the issue?
Comment 2 Alex Deucher 2017-03-15 13:57:32 UTC
Created attachment 255265 [details]
patch 2/2
Comment 3 Janpieter Sollie 2017-03-16 08:09:02 UTC
we're one step further: see triplefault.txt output.
I set the kernel verbosity to 7, and did a modprobe amdgpu (the module is blacklisted).  The error is gone, but the machine hits a triple fault (I suspect it does, don't blame me when it doesn't) and because of that, it immediately reboots without panic.  should I file a new bug for that, or can you have a look at it?
notice that this does not happen with dpm disabled.
Comment 4 Janpieter Sollie 2017-03-16 08:10:16 UTC
Created attachment 255283 [details]
/proc/kmsg output
Comment 5 Michel Dänzer 2017-03-16 08:34:12 UTC
Did this work with kernel 4.10 or older?
Comment 6 Janpieter Sollie 2017-03-16 09:20:31 UTC
no, the output is exactly the same: after the 4 ring tests, it reboots
Comment 7 Michel Dänzer 2017-03-16 09:23:02 UTC
That should be tracked in a separate report then.
Comment 8 Janpieter Sollie 2017-03-16 09:33:48 UTC
is there any documentation except my kernel config, lsmod, /proc/kmsg and lspci you need to handle this as a new report? 'cause I can annoy myself with users coming to me saying 'it doesn't work and i know nothing about it', so I'd like to provide every possible info you people need