Bug 196635 - amdgpu clinfo hangs with SI
Summary: amdgpu clinfo hangs with SI
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-08-10 15:13 UTC by Janpieter Sollie
Modified: 2017-08-25 03:57 UTC (History)
0 users

See Also:
Kernel Version: 4.13-rc4
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg output (109.60 KB, text/plain)
2017-08-10 15:13 UTC, Janpieter Sollie
Details
lspci output (4.56 KB, text/plain)
2017-08-10 15:14 UTC, Janpieter Sollie
Details
kernel config (97.12 KB, text/x-mpsub)
2017-08-10 16:33 UTC, Janpieter Sollie
Details
working dmesg (83.56 KB, text/plain)
2017-08-11 03:31 UTC, Janpieter Sollie
Details
working clinfo (11.07 KB, text/plain)
2017-08-11 03:33 UTC, Janpieter Sollie
Details

Description Janpieter Sollie 2017-08-10 15:13:40 UTC
Created attachment 257863 [details]
dmesg output

the clinfo command does not work anymore since I tested 4.13-rc4 on my pc.
kernel error message in attachment.
I'm using the amdgpu-pro libraries (not the kernel driver, really only the libraries) on a dual opteron with a R9 nano and a HD 7700 installed.
I must say I was very happy DPM is working now, but the clinfo calls not working is a bit of a bummer :(
Comment 1 Janpieter Sollie 2017-08-10 15:14:56 UTC
Created attachment 257865 [details]
lspci output
Comment 2 Janpieter Sollie 2017-08-10 16:33:19 UTC
Created attachment 257867 [details]
kernel config
Comment 3 Janpieter Sollie 2017-08-11 03:31:45 UTC
Created attachment 257881 [details]
working dmesg

dmesg with amdgpu.dpm=0 seems to intitialize the device correctly
Comment 4 Janpieter Sollie 2017-08-11 03:33:47 UTC
Created attachment 257883 [details]
working clinfo

clinfo output with amdgpu.dpm=0
Comment 5 Janpieter Sollie 2017-08-11 03:37:09 UTC
the system works with dpm=0.  I attached some info about the working system.  Please note that I DO NOT use the amdgpu-pro kernel module, only its libraries
Comment 6 Michel Dänzer 2017-08-14 10:07:52 UTC
Can you bisect the kernel?
Comment 7 Janpieter Sollie 2017-08-15 06:14:54 UTC
I'm not a kernel developer, but I am willing to help you where I can. what do you need from the bisection?
Comment 8 Michel Dänzer 2017-08-15 06:25:33 UTC
No need to be a developer, just to compile and test a number of kernel Git commits. Search for "git bisect howto".
Comment 9 Janpieter Sollie 2017-08-16 05:34:10 UTC
I just browsed through a few howtos:
It won't be easy to point to the problem: in 4.10, it hit a triple fault and then crashed with dpm enabled. do you want a bisection from that one(see 194899) to the current status or do I need to do something else?
Comment 10 Michel Dänzer 2017-08-16 06:17:21 UTC
(In reply to Janpieter Sollie from comment #9)
> It won't be easy to point to the problem: in 4.10, it hit a triple fault and
> then crashed with dpm enabled. do you want a bisection from that one(see
> 194899) to the current status or do I need to do something else?

Hmm, I guess I misread the bug description as meaning it worked properly before. If that's not the case, there's probably no point in bisecting.
Comment 11 Janpieter Sollie 2017-08-22 11:33:29 UTC
the problem SEEMS to be with CIK support and upgrade to rc6 ...
disabling CIK support in my kernel and upgrading it to rc6 solved the problem.
Probably CIK and SI are not really cooperating properly yet.
Comment 12 Michel Dänzer 2017-08-23 01:20:17 UTC
(In reply to Janpieter Sollie from comment #11)
> disabling CIK support in my kernel and upgrading it to rc6 solved the
> problem.
> Probably CIK and SI are not really cooperating properly yet.

Weird, does rc6 still work with CIK support enabled?
Comment 13 Janpieter Sollie 2017-08-25 03:57:53 UTC
yes.
But I really think the problem is application-layer:
I do not see any errors in dmesg when running clinfo,
but when I run the application I'm developing, I see the following errors in dmesg:
[31637.263268] amdgpu 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0014 address=0x00000000ff9f4000 flags=0x0000]
[31637.263379] amdgpu 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0014 address=0x00000000ff9e4000 flags=0x0000]
... and the application hangs
the interesting part here is: to make sure the driver does not "accidentally work", I added a polaris device to the system.  The amdgpu recognised the polaris, fiji and SI, but only the SI gives these faults.

do you know how I can figure out whether this is a kernel / midline / application layer problem?

Note You need to log in before you can comment on or make changes to this bug.