Bug 209163

Summary: amdgpu: The CS has been cancelled because the context is lost
Product: Memory Management Reporter: Satish patel (satish.in)
Component: OtherAssignee: drivers_video-dri
Status: NEW ---    
Severity: high CC: alexdeucher, christian.koenig, satish.in
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.9.118 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg log
AMDGPU version information
Mesa_opencl version information
lspci information
VRAM Utilization screen shot

Description Satish patel 2020-09-05 13:56:17 UTC
Created attachment 292355 [details]
dmesg log

I am getting error after playing application continuously .
Comment 1 Satish patel 2020-09-05 14:05:09 UTC
Created attachment 292357 [details]
AMDGPU version information
Comment 2 Satish patel 2020-09-05 14:05:42 UTC
Created attachment 292359 [details]
Mesa_opencl version information
Comment 3 Satish patel 2020-09-05 14:15:41 UTC
Created attachment 292361 [details]
lspci information
Comment 4 Christian König 2020-09-09 11:03:54 UTC
This is expected behavior, your application tries to use more memory than physical available:

[71804.930003] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!

That is most likely a bug in the application, e.g. a memory leak.
Comment 5 Satish patel 2020-09-09 18:17:09 UTC
(In reply to Christian König from comment #4)
> This is expected behavior, your application tries to use more memory than
> physical available:
> 
> [71804.930003] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for
> command submission!
> 
> That is most likely a bug in the application, e.g. a memory leak.

Dear Mr. Konig, 

Thanks for your reply , But I would like to inform and describe same application running up to 10 days until  Physical memory and swap memory not utilized in CentOS 7 (gnome display ) with kernel 3.10.0-1127.el7.x86_64. 

But same application has error "amdgpu: The CS has been cancelled because the context is lost" even system utilize only  75% physical memory from Total 5.83 GB Physical memory  and 1% swap memory from 15 GB swap partition. This Error , I am getting in Kernel 4.9.118. Why system crash ( Display flickering and touch screen not responding) and not utilize swap memory area ? . But CPU and memory utilization showing when monitoring from other system .
Comment 6 Christian König 2020-09-09 18:26:48 UTC
You are running out of VRAM, not system memory.

Can you test this on an up to date kernel as well?
Comment 7 Satish patel 2020-09-10 04:08:29 UTC
Created attachment 292449 [details]
VRAM Utilization screen shot

It's attached VRAM Utilization error screen shot as output of -  cat /sys/kernel/debug/dri/0/amdgpu_vram_mm
Comment 8 Satish patel 2020-09-10 04:19:29 UTC
(In reply to Christian König from comment #6)
> You are running out of VRAM, not system memory.
> 
> Can you test this on an up to date kernel as well?

Is there any way to restrict not utilize full VRAM by AMDGPU module parameter settings ? same application running with on same hardware in Gnome desktop  (Centos 7) with kernel 3.10.xx.1127 . 

I am getting error when Utilize same application in X Windows and getting error after 19 hours.  where same application running more than 7 days with above Operating system and kernel version.
Comment 9 Christian König 2020-09-10 08:39:09 UTC
Try amdgpu.vramlimit=512 on the kernel command line to limit the available VRAM to 512MB.

The problem is certainly some kind of memory leak.

You need to test an up to date kernel, like 5.8 or even better the latest bleeding edge amd-staging-drm-next branch.