Bug 216303

Summary: Commit ee7a69aa38d87a3bbced7b8245c732c05ed0c6ec broke legacy frame buffer with NVIDIA
Product: Drivers Reporter: reserv0
Component: Video(Other)Assignee: drivers_video-other
Status: RESOLVED CODE_FIX    
Severity: normal CC: 1700011628, aplattner, arclnx, dburger, hujq, javier, jeremy, jhs, kernel, mail, rhysperry111, svenstaro, ulf.norberg
Priority: P1    
Hardware: i386   
OS: Linux   
URL: https://github.com/NVIDIA/open-gpu-kernel-modules/issues/341
Kernel Version: 5.18.13 Subsystem:
Regression: No Bisected commit-id:
Attachments: Commit revert that fixes the bug
Reverts to un-break v6.0.6
Fix for fbdev and Linux v6.1

Description reserv0 2022-07-29 10:29:46 UTC
Created attachment 301506 [details]
Commit revert that fixes the bug

After updating to (vanilla) v5.18.13, a PC (Core-i7 9700K) with a NVIDIA GTX 1070Ti card (using the proprietary NVIDIA driver) stops displaying the text consoles properly after X is started (the video mode is obviously wrong when switching to a virtual console).

I traced the bug down to the ee7a69aa38d87a3bbced7b8245c732c05ed0c6ec commit discussed here: https://patchwork.freedesktop.org/patch/488499/

After reverting that commit (see attached patch), everything works beautifully again.

The system in question is NOT using the simplefb frame buffer (because it boots under BIOS/CSM, not UEFI), but the VESA one.

The relevant defines of the config file are as follow:

CONFIG_DRM_I915=m
CONFIG_DRM_I915_FORCE_PROBE=""
CONFIG_DRM_I915_CAPTURE_ERROR=y
CONFIG_DRM_I915_COMPRESS_ERROR=y
CONFIG_DRM_I915_USERPTR=y
CONFIG_DRM_I915_REQUEST_TIMEOUT=20000
CONFIG_DRM_I915_FENCE_TIMEOUT=10000
CONFIG_DRM_I915_USERFAULT_AUTOSUSPEND=250
CONFIG_DRM_I915_HEARTBEAT_INTERVAL=2500
CONFIG_DRM_I915_PREEMPT_TIMEOUT=640
CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT=8000
CONFIG_DRM_I915_STOP_TIMEOUT=100
CONFIG_DRM_I915_TIMESLICE_DURATION=1
CONFIG_DRM_PANEL=y
CONFIG_DRM_BRIDGE=y
CONFIG_DRM_PANEL_BRIDGE=y
CONFIG_DRM_PANEL_ORIENTATION_QUIRKS=y
CONFIG_DRM_NOMODESET=y
CONFIG_FB_CMDLINE=y
CONFIG_FB_NOTIFY=y
CONFIG_FB=y
CONFIG_FB_SYS_FOPS=m
CONFIG_FB_DEFERRED_IO=y
CONFIG_FB_VESA=y
CONFIG_FB_EFI=y
CONFIG_HDMI=y
CONFIG_VGA_CONSOLE=y
CONFIG_DUMMY_CONSOLE=y
CONFIG_DUMMY_CONSOLE_COLUMNS=80
CONFIG_DUMMY_CONSOLE_ROWS=25
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE_LEGACY_ACCELERATION=y
CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y
Comment 1 Artem S. Tashkinov 2022-07-31 12:26:14 UTC
CC'ing Javier Martinez Canillas
Comment 2 Javier Martinez Canillas 2022-08-01 00:49:16 UTC
There isn't much we can do here really, the NVIDIA proprietary driver is an out-of-tree module and you should report to them if there are any issues.

For this particular problem, they were relying on registered framebuffer devices to bind to the fbcon to have VT support.

But that solution is fragile. Instead, is the DRM driver responsibility to register an emulated framebuffer device if fbcon and VT has to be supported.

Commit ee7a69aa38d8 ("fbdev: Disable sysfb device registration when removing conflicting FBs") fixed a race since, since otherwise it would allow to register platform devices for firmware provided framebuffers even after a native video driver was registered.
Comment 4 reserv0 2022-08-01 18:22:18 UTC
@Javier
Well, this new code introduces a new bug ruining existing drivers (that always worked beautifully and for many years as far as I am concerned, so they are not as "fragile" as you imply, obviously) to fix another bug in others... I'd say it is inadequate, to say the least. Rejecting the fault on NVIDIA is not really fair, IMHO.

At least, you could add a check for the VESA frame buffer usage (disabling that line when it is in use) or add a boot parameter option to disable this "fix"...

@Joël
Thank you. I back-linked your bug report on NVIDIA's git to this bug report.
Comment 5 Javier Martinez Canillas 2022-08-01 19:12:23 UTC
(In reply to reserv0 from comment #4)
> @Javier
> Well, this new code introduces a new bug ruining existing drivers (that
> always worked beautifully and for many years as far as I am concerned, so

That commit is not introducing a bug, what is doing is disabling the sysfb,
that registers platform devices that bind to drivers using a firmware provided
framebuffer. This includes both efifb (for EFI GOP) and vesafb (for BIOS/VESA).

> they are not as "fragile" as you imply, obviously) to fix another bug in

The actual fact that are not working after the mentioned commit does show that
the approach *was* fragile. Since relied on other drivers to have a fbcon / VT.

> others... I'd say it is inadequate, to say the least. Rejecting the fault on
> NVIDIA is not really fair, IMHO.
>

As mentioned. It is the Nvidia driver fault because it should register all the
needed interfaces. This includes not only a DRI device for KMS/DRM but also a
fbdev, if they want to support fbcon and virtual terminals.

That's what all other DRM drivers do, no other driver AFAICT was relying on a
different fbdev driver for this. So the real fix would be for the Nvidia driver
to register an emulated fbdev as mentioned.
 
> At least, you could add a check for the VESA frame buffer usage (disabling
> that line when it is in use) or add a boot parameter option to disable this
> "fix"...
>

I don't know why you think this is only related to EFI and not legacy BIOS/VESA.
All drivers that make use of firmware-provided framebuffers should be unregistered
when a real video driver takes over the graphics device.
Comment 6 reserv0 2022-08-01 22:25:35 UTC
@Javier

1.- Until you committed your "fix" (more like a dirty workaround) for the race condition, the frame buffer has been working just fine with all NVIDIA drivers I used in the past 15+ years.

2.- Since your "fix" introduces a regression that breaks *all* existing NVIDIA drivers (including for legacy, no more maintained drivers that still allow to use old cards), this commit should, logically, be reverted until a proper, non breaking fix is found. Or, at the very least, a kill switch should allow to prevent this "fix" to kick in when a NVIDIA driver is in use... A boot parameter, for example...

3.- You assumption following which "the Nvidia driver should register all the
needed interfaces" is apparently not clearly a requirement (or was not one in the past), as shows the comment posted here:
https://github.com/NVIDIA/open-gpu-kernel-modules/issues/341#issuecomment-1201748789
Comment 7 Javier Martinez Canillas 2022-08-02 01:41:25 UTC
(In reply to reserv0 from comment #6)
> @Javier
> 
> 1.- Until you committed your "fix" (more like a dirty workaround) for the
> race condition, the frame buffer has been working just fine with all NVIDIA
> drivers I used in the past 15+ years.
> 

Honestly, you don't seem to understand what the problem that commit fixed is, why that change was introduced and why the Nvidia driver stopped working after it. So I'll just stop answering to this bugzilla.

But regardless of that, the fact that you consider that a change affecting an out-of-tree module could be considered a mainline regression shows that you also don't understand how development in Linux kernel works.

Feel free to read https://www.kernel.org/doc/Documentation/process/stable-api-nonsense.rst if this isn't clear to you.
Comment 8 reserv0 2022-08-02 08:08:48 UTC
@Javier:

I do not care (neither any NVIDIA car owner would) about what problem your so-called "fix" is supposed to solve, since that problem did not impact us. However, I do care for the BUG, *your* change introduced for ALL past (including, legacy ones, that will never be updated any more) and present NVIDIA drivers, suddenly breaking all systems using them.

The MINIMUM would be to implement a kill switch for that "fix", so that EXISTING software won't stay BROKEN FOREVER.

Linux is suffering from ideological "principles", instead of adopting a pragmatic approach to such issues... And some stupid hostility towards third party software. So SAD ! :-(
Comment 9 Javier Martinez Canillas 2022-08-02 16:25:52 UTC
NVIDIA owners should be reporting problems related to NVIDIA drivers to NVIDIA.
Comment 10 reserv0 2022-08-02 17:18:49 UTC
@Javier:

> NVIDIA owners should be reporting problems related to NVIDIA drivers to
> NVIDIA.
Perfectly irresponsible reply, thank you! 

Congratulation, anyway... I do not think any other kernel developer will be able to brag about breaking that many systems with just a one liner patch to the kernel code!!!
Comment 11 Alberto Ruiz 2022-08-03 16:03:42 UTC
(In reply to reserv0 from comment #10)
> @Javier:
> 
> > NVIDIA owners should be reporting problems related to NVIDIA drivers to
> > NVIDIA.
> Perfectly irresponsible reply, thank you! 
> 
> Congratulation, anyway... I do not think any other kernel developer will be
> able to brag about breaking that many systems with just a one liner patch to
> the kernel code!!!

Let me get this right.

A for profit company that you bought a product from doesn't have an in-tree driver and contribute to the overall maintenance of the DRM stack and you get angry and call names at the volunteer developer trying to do the right thing from the DRM stack because the off tree driver?

The kernel doesn't provide any API guarantees to off tree drivers. This is upstream policy. Upstream is not a product, downstream that provide a patch will not be broken. you are wasting energy on a transient problem caused by your GPU vendor.

The solution you are after goes through either waiting for NVIDIA to update their drivers to update your kernel or file a bug against your distro or patch your kernel.

Stop alienating volunteer engineers with your petty rants.
Comment 12 reserv0 2022-08-03 16:55:37 UTC
> you are wasting energy on a transient problem caused by your GPU vendor.

This is total non-sense! Are you a fake news adept? Turning tables and pretending the cause is the effect and vice versa?

YOU are wasting Linux users' time by breaking drivers that worked beautifully for decades. This is NOT NVIDIA which suddenly introduced such a huge a compatibility issue, but a kernel devel one-liner "patch".

As a developer myself, would I introduce such a regression in one of my pieces of software, I would either fix it, provide a workaround for it, or get fired for breaking things that used to work just fine adn for incompetency!

As for Linux "policy" of systematically breaking every driver third party vendors are providing, guess what? It's one of the very reasons why people do not want to waste their time with Linux!

But I'll stop wasting my time with you guys. Keep breaking what does not need fixing, and you will end up killing Linux and all the good wills (yes, including the good wills by hardware vendors).
Comment 13 Alberto Ruiz 2022-08-03 17:23:08 UTC
(In reply to reserv0 from comment #12)
> YOU are wasting Linux users' time by breaking drivers that worked
> beautifully for decades. This is NOT NVIDIA which suddenly introduced such a
> huge a compatibility issue, but a kernel devel one-liner "patch".

NVIDIA drivers (and pretty much every off tree driver) break all the time against mainline, NVIDIA (and any off tree driver) has to keep track of the kernel regularly so that drivers keep building. This is common practice over those decades you talk about, you just didn't notice because NVIDIA and downstreams play their part.

> As a developer myself, would I introduce such a regression in one of my
> pieces of software, I would either fix it, provide a workaround for it, or
> get fired for breaking things that used to work just fine adn for
> incompetency!

Ok.

> As for Linux "policy" of systematically breaking every driver third party
> vendors are providing, guess what? It's one of the very reasons why people
> do not want to waste their time with Linux!

Third party vendors know very well that if they want an ever working driver without having to track mainline changes they have to contribute their driver upstream. NVIDIA knows this. You are their customer please talk to them if you have concerns about their upstream engagement.

> But I'll stop wasting my time with you guys. Keep breaking what does not
> need fixing, and you will end up killing Linux and all the good wills (yes,
> including the good wills by hardware vendors).

Sure, have a nice day.
Comment 14 Artem S. Tashkinov 2022-08-04 11:54:02 UTC
This is ugly beyond hope.
Comment 15 Jingyuan Deng 2022-08-08 08:26:41 UTC
(In reply to Javier Martinez Canillas from comment #2)
> There isn't much we can do here really, the NVIDIA proprietary driver is an
> out-of-tree module and you should report to them if there are any issues.
> 
> For this particular problem, they were relying on registered framebuffer
> devices to bind to the fbcon to have VT support.
> 
> But that solution is fragile. Instead, is the DRM driver responsibility to
> register an emulated framebuffer device if fbcon and VT has to be supported.
> 
> Commit ee7a69aa38d8 ("fbdev: Disable sysfb device registration when removing
> conflicting FBs") fixed a race since, since otherwise it would allow to
> register platform devices for firmware provided framebuffers even after a
> native video driver was registered.

But I have to say similar problems happen on amd device and Loongson device after 5.18.13 as I know. I cannot tell whether it is caused by this commit, but I have to say many users find tty display problem these day and it is not a nvidia-only problem!
Comment 17 Jingyuan Deng 2022-08-09 06:51:06 UTC
(In reply to Jingyuan Deng from comment #15)
> (In reply to Javier Martinez Canillas from comment #2)
> > There isn't much we can do here really, the NVIDIA proprietary driver is an
> > out-of-tree module and you should report to them if there are any issues.
> > 
> > For this particular problem, they were relying on registered framebuffer
> > devices to bind to the fbcon to have VT support.
> > 
> > But that solution is fragile. Instead, is the DRM driver responsibility to
> > register an emulated framebuffer device if fbcon and VT has to be
> supported.
> > 
> > Commit ee7a69aa38d8 ("fbdev: Disable sysfb device registration when
> removing
> > conflicting FBs") fixed a race since, since otherwise it would allow to
> > register platform devices for firmware provided framebuffers even after a
> > native video driver was registered.
> 
> But I have to say similar problems happen on amd device and Loongson device
> after 5.18.13 as I know. I cannot tell whether it is caused by this commit,
> but I have to say many users find tty display problem these day and it is
> not a nvidia-only problem!

See https://bugzilla.kernel.org/show_bug.cgi?id=216331 
Similar problem on amd device
Comment 18 Javier Martinez Canillas 2022-08-09 10:40:03 UTC
(In reply to Jingyuan Deng from comment #17)
> (In reply to Jingyuan Deng from comment #15)
> > (In reply to Javier Martinez Canillas from comment #2)
> > > There isn't much we can do here really, the NVIDIA proprietary driver is
> an
> > > out-of-tree module and you should report to them if there are any issues.
> > > 
> > > For this particular problem, they were relying on registered framebuffer
> > > devices to bind to the fbcon to have VT support.
> > > 
> > > But that solution is fragile. Instead, is the DRM driver responsibility
> to
> > > register an emulated framebuffer device if fbcon and VT has to be
> > supported.
> > > 
> > > Commit ee7a69aa38d8 ("fbdev: Disable sysfb device registration when
> > removing
> > > conflicting FBs") fixed a race since, since otherwise it would allow to
> > > register platform devices for firmware provided framebuffers even after a
> > > native video driver was registered.
> > 
> > But I have to say similar problems happen on amd device and Loongson device
> > after 5.18.13 as I know. I cannot tell whether it is caused by this commit,
> > but I have to say many users find tty display problem these day and it is
> > not a nvidia-only problem!
> 
> See https://bugzilla.kernel.org/show_bug.cgi?id=216331 
> Similar problem on amd device

but does reverting ee7a69aa38d8 ("fbdev: Disable sysfb device registration when removing conflicting FBs") make the issue to go away? It's not clear that's the same problem.
Comment 19 Joël Müller 2022-08-14 18:43:00 UTC
Is there a solution on the way? How long I have to patch my kernel by hand?
Comment 20 Joël Müller 2022-09-29 11:52:23 UTC
We have located the issue and it's the framebuffer from the BMC. When we blacklist the ast module the kernel boots properly.

Full disscusion https://bbs.archlinux.org/viewtopic.php?id=279971
Comment 21 reserv0 2022-10-24 08:40:12 UTC
Problem solved for me in Linux v6.0.3.
Comment 22 reserv0 2022-10-27 22:47:06 UTC
And it's broken again in v6.0.4!!!

I have to revert commit 9d69ef1838150c7d87afc1a87aa658c637217585 to make it work...
Comment 23 reserv0 2022-10-29 13:03:02 UTC
And now, v6.0.6 makes it even worst...

In order for the frame buffer console to work with NVIDIA drivers, I now need to revert both 9d69ef1838150c7d87afc1a87aa658c637217585 and 25a6688f27ff54f97adf7cce1d7e18c38bf51eb4...

Could we please *stop this madness* and revert the code (that worked just fine fore over 15 years) to what it was before it got totally and irremediably broken, i.e. before the *regression* introduced in v5.18.13?

Obviously, that code change which was supposed to fix *another* bug is simply inadequate, so let's revert it all until a *proper fix* is found for that other bug, that won't break something that has always worked fine so far!
Comment 24 reserv0 2022-10-29 13:05:14 UTC
Created attachment 303105 [details]
Reverts to un-break v6.0.6

Added the patch I am using to un-break v6.0.6 (and return to the v6.0.3 code that did work).
Comment 25 Ulf 2022-11-18 20:11:19 UTC
I also suffer from this issue and have been logging in on a black console for a while now.  By preventing the driver for my integrated graphics (i915) from loading (as suggested in #20) I do get a console (using bootparam i915.modeset=0 gives similar results).  But I need the integrated "graphics", i915, for sound through HDMI.  Any suggestions on how to solve this?
Comment 26 reserv0 2022-12-12 13:17:56 UTC
And v6.1 is now broken beyond repair...

The *only* solutions left to me now, if I want to run a recent Linux kernel, is to disable entirely the i915 driver, either at kernel configure time, or in the kernel blacklist option, or in the BIOS (by disabling the iGPU there)...

It means I cannot use any more the iGPU with a second monitor or for OpenCL computing! >:-(
Comment 27 reserv0 2022-12-22 22:11:04 UTC
Created attachment 303457 [details]
Fix for fbdev and Linux v6.1

I had another look at v6.1 and found out that all what was needed was to remove the call to sysfb_disable() in drivers/video/aperture.c to restore full frame buffer functionality with both the Intel iGPU and the NVIDIA GPU active.

Patch attached.
Comment 28 Aaron Plattner 2023-01-10 21:35:47 UTC
I've been looking at this from the NVIDIA side.

If I understand the regression commit ee7a69aa3 correctly, the original problem it solves is a race between sysfb_init registering a platform device for the system boot console, and a drm driver calling into do_remove_conflicting_framebuffers: if sysfb_init is called before drm initializes, then remove_conflicting_framebuffers will remove the platform device sysfb adds. But if drm initializes before sysfb_init is called, then there will be no platform device to remove and then sysfb will still add one later. Is that a correct understanding of the motivation behind the new call to sysfb_disable in this change?

To recap this bug, the problem here is that there are two graphics devices in the system, one of which is displaying a framebuffer console that was set up by the boot firmware. In this bug report, the boot console is on an NVIDIA device and the other device is an Intel integrated GPU, but we've also seen this problem on server or workstation systems that have an ASPEED device alongside an NVIDIA one. In this situation, the following sequence happens:

 1. EFI firmware sets a mode on the NVIDIA GPU and uses it for the boot
    console.
 2. The kernel starts, reads the EFI boot payload, and loads efifb.
 3. The kernel calls sysfb_init, which creates a platform device for the efifb
    console on the NVIDIA GPU.
 4. The kernel detects the Intel GPU and loads i915, which registers its own
    framebuffer. Since this is not the boot device,
    remove_conflicting_framebuffers does not find the efifb framebuffer as
    conflicting and does not remove it.

Commit ee7a69aa3 adds a new step 3.5:

 3.5. Call sysfb_disable, which removes the efifb framebuffer console on the
   NVIDIA GPU even though it does not conflict with the i915 framebuffer.

This is what breaks the console on these configurations, regardless of whether or not the NVIDIA driver is installed or loaded.

I'm looking at making the NVIDIA driver install its own framebuffer console in order to work around this problem, but that will take a little while to develop and get it into production. In the meantime, would it make sense to make sysfb_disable a little smarter about which device the boot framebuffer is on and only disable itself if a framebuffer that actually conflicts with it is being enabled? This would also help affected users who choose not to install the NVIDIA driver at all.
Comment 29 Javier Martinez Canillas 2023-01-11 17:39:54 UTC
Daniel posted a patch series that should handle this case:

https://patchwork.kernel.org/project/dri-devel/list/?series=711019
Comment 30 Jani Nikula 2023-05-26 14:36:40 UTC
commit 5ae3716cfdcd286268133867f67d0803847acefc
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Apr 6 15:21:07 2023 +0200

    video/aperture: Only remove sysfb on the default vga pci device