Created attachment 255215 [details]
zip file with all listed attachments
There seems to be a logical error while specifying the memory sizes for ttm in the amdgpu module on the SI architecture:
while the Fiji card boots fine, the Cape Verde card gives a kernel BUG.
dmesg and .config and proposed patch in attachment.
the problem lies in linux/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: the determination of the p_size is reduced 0 when the page_shift is too big
I managed to work around the problem when changing the sentence "adev->gds.mem.total_size >> PAGE_SHIFT)" in amdgpu_ttm_init to "(adev->gds.mem.total_size >> PAGE_SHIFT) + 1)", and the same for "(adev->gds.gws.total_size" and "adev->gds.oa.total_size", though I am not sure this is the correct solution. The problem is that my SI card is limited in memory (I guess) and the page_size is 12
Created attachment 255263 [details]
Does this patch set fix the issue?
Created attachment 255265 [details]
we're one step further: see triplefault.txt output.
I set the kernel verbosity to 7, and did a modprobe amdgpu (the module is blacklisted). The error is gone, but the machine hits a triple fault (I suspect it does, don't blame me when it doesn't) and because of that, it immediately reboots without panic. should I file a new bug for that, or can you have a look at it?
notice that this does not happen with dpm disabled.
Created attachment 255283 [details]
Did this work with kernel 4.10 or older?
no, the output is exactly the same: after the 4 ring tests, it reboots
That should be tracked in a separate report then.
is there any documentation except my kernel config, lsmod, /proc/kmsg and lspci you need to handle this as a new report? 'cause I can annoy myself with users coming to me saying 'it doesn't work and i know nothing about it', so I'd like to provide every possible info you people need