Bug 210123 - drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] - flip_done time out with vmwgfx
Summary: drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] - flip_done tim...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-11-09 21:05 UTC by Stefan Mayr
Modified: 2021-01-28 19:25 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.3.18-24.9.1
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Stefan Mayr 2020-11-09 21:05:09 UTC
Since we upgraded SUSE Linux Enterprise Server 15 from kernel 4.12.14-197.45.1 (SLES15 SP1) to 5.3.18-24.9.1 (SLES15 SP2) or later we see the following error messages on some virtual machines:

[102215.857602] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:38:crtc-0] flip_done timed out
[102226.097847] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:34:plane-0] flip_done timed out

We were also provided some more current kernels from SUSE support:
- 5.3.18 with updated modules 
- 5.8.15
- 5.9.1

The issue stays the same. All affected machines are running in runlevel 3. The only graphical "thing" is the boot screen and when this error appears in the logs this screen is sitting at an empty login prompt.
All virtual machines are running in a VMware environment on ESXi-Hosts with versions between 6.0.x and 6.7.x. We could not track it down to specific ESXi versions, load on the ESXi host or even the virtual machine. This happens on different versions, loaded and also on idle hosts and virtual machines.

The issue goes away when we add vmwgfx to the grub module_blacklist.

I know our kernel versions are somehow SUSE specific. But what changed between 4.12 and 5.3 and later that may cause this message between drm and vmwgfx?
Comment 1 Stefan Mayr 2020-11-09 21:16:24 UTC
#208373 seems simliar but for us it started with an older kernel version
Comment 2 Michel Dänzer 2020-11-11 09:24:08 UTC
(In reply to Stefan Mayr from comment #1)
> #208373 seems simliar but for us it started with an older kernel version

It's about amdgpu, unlikely to be directly related.
Comment 3 Stefan Mayr 2020-11-20 10:22:00 UTC
Did some more test with Kernel versions provided by SUSE:

Kernel 5.0.13 - 6 days without issues
Kernel 5.2.14 - 2 days until we got the error message

Today I installed 5.1.16 and we wait if this versions shows the error or not
Comment 4 Stefan Mayr 2020-11-20 10:38:32 UTC
I had another look at bug 208373: the inital reporter also uses vmwgfx and amdgpu is only mentioned in bug 208373, comment 2
Comment 5 Zack Rusin 2020-11-25 16:54:31 UTC
Do you know if there are any errors in the vmware.log? It sounds like it's guest isolated bug in vmwgfx but it'd be great to be able to take a look at vmware.log from one of the sessions with errors.
Comment 6 Stefan Mayr 2020-11-25 21:45:59 UTC
Kernel 5.1.16 also showed the error message after 2 days.

So this seems to be triggered by changes between 5.0.13 and 5.1.16. I'm out of SUSE Kernels to narrow it down even more.

I also checked for a vmware.log on this host but I couldn't find one. Installed are the open-vm-tools that are part of SLES15. Other logfiles of vmtoolsd don't show any error messages.
Comment 7 Stefan Mayr 2021-01-28 19:25:14 UTC
Same issue with Kernel 5.10.9

Note You need to log in before you can comment on or make changes to this bug.