Bug 196635

Summary: amdgpu clinfo hangs with SI
Product: Drivers Reporter: Janpieter Sollie (janpieter.sollie)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal    
Priority: P1    
Hardware: x86-64   
OS: Linux   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=194899
Kernel Version: 4.13-rc4 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg output
lspci output
kernel config
working dmesg
working clinfo

Description Janpieter Sollie 2017-08-10 15:13:40 UTC
Created attachment 257863 [details]
dmesg output

the clinfo command does not work anymore since I tested 4.13-rc4 on my pc.
kernel error message in attachment.
I'm using the amdgpu-pro libraries (not the kernel driver, really only the libraries) on a dual opteron with a R9 nano and a HD 7700 installed.
I must say I was very happy DPM is working now, but the clinfo calls not working is a bit of a bummer :(
Comment 1 Janpieter Sollie 2017-08-10 15:14:56 UTC
Created attachment 257865 [details]
lspci output
Comment 2 Janpieter Sollie 2017-08-10 16:33:19 UTC
Created attachment 257867 [details]
kernel config
Comment 3 Janpieter Sollie 2017-08-11 03:31:45 UTC
Created attachment 257881 [details]
working dmesg

dmesg with amdgpu.dpm=0 seems to intitialize the device correctly
Comment 4 Janpieter Sollie 2017-08-11 03:33:47 UTC
Created attachment 257883 [details]
working clinfo

clinfo output with amdgpu.dpm=0
Comment 5 Janpieter Sollie 2017-08-11 03:37:09 UTC
the system works with dpm=0.  I attached some info about the working system.  Please note that I DO NOT use the amdgpu-pro kernel module, only its libraries
Comment 6 Michel Dänzer 2017-08-14 10:07:52 UTC
Can you bisect the kernel?
Comment 7 Janpieter Sollie 2017-08-15 06:14:54 UTC
I'm not a kernel developer, but I am willing to help you where I can. what do you need from the bisection?
Comment 8 Michel Dänzer 2017-08-15 06:25:33 UTC
No need to be a developer, just to compile and test a number of kernel Git commits. Search for "git bisect howto".
Comment 9 Janpieter Sollie 2017-08-16 05:34:10 UTC
I just browsed through a few howtos:
It won't be easy to point to the problem: in 4.10, it hit a triple fault and then crashed with dpm enabled. do you want a bisection from that one(see 194899) to the current status or do I need to do something else?
Comment 10 Michel Dänzer 2017-08-16 06:17:21 UTC
(In reply to Janpieter Sollie from comment #9)
> It won't be easy to point to the problem: in 4.10, it hit a triple fault and
> then crashed with dpm enabled. do you want a bisection from that one(see
> 194899) to the current status or do I need to do something else?

Hmm, I guess I misread the bug description as meaning it worked properly before. If that's not the case, there's probably no point in bisecting.
Comment 11 Janpieter Sollie 2017-08-22 11:33:29 UTC
the problem SEEMS to be with CIK support and upgrade to rc6 ...
disabling CIK support in my kernel and upgrading it to rc6 solved the problem.
Probably CIK and SI are not really cooperating properly yet.
Comment 12 Michel Dänzer 2017-08-23 01:20:17 UTC
(In reply to Janpieter Sollie from comment #11)
> disabling CIK support in my kernel and upgrading it to rc6 solved the
> problem.
> Probably CIK and SI are not really cooperating properly yet.

Weird, does rc6 still work with CIK support enabled?
Comment 13 Janpieter Sollie 2017-08-25 03:57:53 UTC
yes.
But I really think the problem is application-layer:
I do not see any errors in dmesg when running clinfo,
but when I run the application I'm developing, I see the following errors in dmesg:
[31637.263268] amdgpu 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0014 address=0x00000000ff9f4000 flags=0x0000]
[31637.263379] amdgpu 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0014 address=0x00000000ff9e4000 flags=0x0000]
... and the application hangs
the interesting part here is: to make sure the driver does not "accidentally work", I added a polaris device to the system.  The amdgpu recognised the polaris, fiji and SI, but only the SI gives these faults.

do you know how I can figure out whether this is a kernel / midline / application layer problem?