Bug 210123

Summary: drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] - flip_done time out with vmwgfx
Product: Drivers Reporter: Stefan Mayr (stefan+kernel)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: normal CC: tiwai, zackr
Priority: P1    
Hardware: x86-64   
OS: Linux   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=208373
Kernel Version: 5.3.18-24.9.1 Subsystem:
Regression: No Bisected commit-id:

Description Stefan Mayr 2020-11-09 21:05:09 UTC
Since we upgraded SUSE Linux Enterprise Server 15 from kernel 4.12.14-197.45.1 (SLES15 SP1) to 5.3.18-24.9.1 (SLES15 SP2) or later we see the following error messages on some virtual machines:

[102215.857602] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:38:crtc-0] flip_done timed out
[102226.097847] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:34:plane-0] flip_done timed out

We were also provided some more current kernels from SUSE support:
- 5.3.18 with updated modules 
- 5.8.15
- 5.9.1

The issue stays the same. All affected machines are running in runlevel 3. The only graphical "thing" is the boot screen and when this error appears in the logs this screen is sitting at an empty login prompt.
All virtual machines are running in a VMware environment on ESXi-Hosts with versions between 6.0.x and 6.7.x. We could not track it down to specific ESXi versions, load on the ESXi host or even the virtual machine. This happens on different versions, loaded and also on idle hosts and virtual machines.

The issue goes away when we add vmwgfx to the grub module_blacklist.

I know our kernel versions are somehow SUSE specific. But what changed between 4.12 and 5.3 and later that may cause this message between drm and vmwgfx?
Comment 1 Stefan Mayr 2020-11-09 21:16:24 UTC
#208373 seems simliar but for us it started with an older kernel version
Comment 2 Michel Dänzer 2020-11-11 09:24:08 UTC
(In reply to Stefan Mayr from comment #1)
> #208373 seems simliar but for us it started with an older kernel version

It's about amdgpu, unlikely to be directly related.
Comment 3 Stefan Mayr 2020-11-20 10:22:00 UTC
Did some more test with Kernel versions provided by SUSE:

Kernel 5.0.13 - 6 days without issues
Kernel 5.2.14 - 2 days until we got the error message

Today I installed 5.1.16 and we wait if this versions shows the error or not
Comment 4 Stefan Mayr 2020-11-20 10:38:32 UTC
I had another look at bug 208373: the inital reporter also uses vmwgfx and amdgpu is only mentioned in bug 208373, comment 2
Comment 5 Zack Rusin 2020-11-25 16:54:31 UTC
Do you know if there are any errors in the vmware.log? It sounds like it's guest isolated bug in vmwgfx but it'd be great to be able to take a look at vmware.log from one of the sessions with errors.
Comment 6 Stefan Mayr 2020-11-25 21:45:59 UTC
Kernel 5.1.16 also showed the error message after 2 days.

So this seems to be triggered by changes between 5.0.13 and 5.1.16. I'm out of SUSE Kernels to narrow it down even more.

I also checked for a vmware.log on this host but I couldn't find one. Installed are the open-vm-tools that are part of SLES15. Other logfiles of vmtoolsd don't show any error messages.
Comment 7 Stefan Mayr 2021-01-28 19:25:14 UTC
Same issue with Kernel 5.10.9