Bug 216303
Summary: | Commit ee7a69aa38d87a3bbced7b8245c732c05ed0c6ec broke legacy frame buffer with NVIDIA | ||
---|---|---|---|
Product: | Drivers | Reporter: | reserv0 |
Component: | Video(Other) | Assignee: | drivers_video-other |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | 1700011628, aplattner, arclnx, dburger, hujq, javier, jeremy, jhs, kernel, mail, rhysperry111, svenstaro, ulf.norberg |
Priority: | P1 | ||
Hardware: | i386 | ||
OS: | Linux | ||
URL: | https://github.com/NVIDIA/open-gpu-kernel-modules/issues/341 | ||
Kernel Version: | 5.18.13 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
Commit revert that fixes the bug
Reverts to un-break v6.0.6 Fix for fbdev and Linux v6.1 |
Description
reserv0
2022-07-29 10:29:46 UTC
CC'ing Javier Martinez Canillas There isn't much we can do here really, the NVIDIA proprietary driver is an out-of-tree module and you should report to them if there are any issues. For this particular problem, they were relying on registered framebuffer devices to bind to the fbcon to have VT support. But that solution is fragile. Instead, is the DRM driver responsibility to register an emulated framebuffer device if fbcon and VT has to be supported. Commit ee7a69aa38d8 ("fbdev: Disable sysfb device registration when removing conflicting FBs") fixed a race since, since otherwise it would allow to register platform devices for firmware provided framebuffers even after a native video driver was registered. @Javier Well, this new code introduces a new bug ruining existing drivers (that always worked beautifully and for many years as far as I am concerned, so they are not as "fragile" as you imply, obviously) to fix another bug in others... I'd say it is inadequate, to say the least. Rejecting the fault on NVIDIA is not really fair, IMHO. At least, you could add a check for the VESA frame buffer usage (disabling that line when it is in use) or add a boot parameter option to disable this "fix"... @Joël Thank you. I back-linked your bug report on NVIDIA's git to this bug report. (In reply to reserv0 from comment #4) > @Javier > Well, this new code introduces a new bug ruining existing drivers (that > always worked beautifully and for many years as far as I am concerned, so That commit is not introducing a bug, what is doing is disabling the sysfb, that registers platform devices that bind to drivers using a firmware provided framebuffer. This includes both efifb (for EFI GOP) and vesafb (for BIOS/VESA). > they are not as "fragile" as you imply, obviously) to fix another bug in The actual fact that are not working after the mentioned commit does show that the approach *was* fragile. Since relied on other drivers to have a fbcon / VT. > others... I'd say it is inadequate, to say the least. Rejecting the fault on > NVIDIA is not really fair, IMHO. > As mentioned. It is the Nvidia driver fault because it should register all the needed interfaces. This includes not only a DRI device for KMS/DRM but also a fbdev, if they want to support fbcon and virtual terminals. That's what all other DRM drivers do, no other driver AFAICT was relying on a different fbdev driver for this. So the real fix would be for the Nvidia driver to register an emulated fbdev as mentioned. > At least, you could add a check for the VESA frame buffer usage (disabling > that line when it is in use) or add a boot parameter option to disable this > "fix"... > I don't know why you think this is only related to EFI and not legacy BIOS/VESA. All drivers that make use of firmware-provided framebuffers should be unregistered when a real video driver takes over the graphics device. @Javier 1.- Until you committed your "fix" (more like a dirty workaround) for the race condition, the frame buffer has been working just fine with all NVIDIA drivers I used in the past 15+ years. 2.- Since your "fix" introduces a regression that breaks *all* existing NVIDIA drivers (including for legacy, no more maintained drivers that still allow to use old cards), this commit should, logically, be reverted until a proper, non breaking fix is found. Or, at the very least, a kill switch should allow to prevent this "fix" to kick in when a NVIDIA driver is in use... A boot parameter, for example... 3.- You assumption following which "the Nvidia driver should register all the needed interfaces" is apparently not clearly a requirement (or was not one in the past), as shows the comment posted here: https://github.com/NVIDIA/open-gpu-kernel-modules/issues/341#issuecomment-1201748789 (In reply to reserv0 from comment #6) > @Javier > > 1.- Until you committed your "fix" (more like a dirty workaround) for the > race condition, the frame buffer has been working just fine with all NVIDIA > drivers I used in the past 15+ years. > Honestly, you don't seem to understand what the problem that commit fixed is, why that change was introduced and why the Nvidia driver stopped working after it. So I'll just stop answering to this bugzilla. But regardless of that, the fact that you consider that a change affecting an out-of-tree module could be considered a mainline regression shows that you also don't understand how development in Linux kernel works. Feel free to read https://www.kernel.org/doc/Documentation/process/stable-api-nonsense.rst if this isn't clear to you. @Javier: I do not care (neither any NVIDIA car owner would) about what problem your so-called "fix" is supposed to solve, since that problem did not impact us. However, I do care for the BUG, *your* change introduced for ALL past (including, legacy ones, that will never be updated any more) and present NVIDIA drivers, suddenly breaking all systems using them. The MINIMUM would be to implement a kill switch for that "fix", so that EXISTING software won't stay BROKEN FOREVER. Linux is suffering from ideological "principles", instead of adopting a pragmatic approach to such issues... And some stupid hostility towards third party software. So SAD ! :-( NVIDIA owners should be reporting problems related to NVIDIA drivers to NVIDIA. @Javier:
> NVIDIA owners should be reporting problems related to NVIDIA drivers to
> NVIDIA.
Perfectly irresponsible reply, thank you!
Congratulation, anyway... I do not think any other kernel developer will be able to brag about breaking that many systems with just a one liner patch to the kernel code!!!
(In reply to reserv0 from comment #10) > @Javier: > > > NVIDIA owners should be reporting problems related to NVIDIA drivers to > > NVIDIA. > Perfectly irresponsible reply, thank you! > > Congratulation, anyway... I do not think any other kernel developer will be > able to brag about breaking that many systems with just a one liner patch to > the kernel code!!! Let me get this right. A for profit company that you bought a product from doesn't have an in-tree driver and contribute to the overall maintenance of the DRM stack and you get angry and call names at the volunteer developer trying to do the right thing from the DRM stack because the off tree driver? The kernel doesn't provide any API guarantees to off tree drivers. This is upstream policy. Upstream is not a product, downstream that provide a patch will not be broken. you are wasting energy on a transient problem caused by your GPU vendor. The solution you are after goes through either waiting for NVIDIA to update their drivers to update your kernel or file a bug against your distro or patch your kernel. Stop alienating volunteer engineers with your petty rants. > you are wasting energy on a transient problem caused by your GPU vendor.
This is total non-sense! Are you a fake news adept? Turning tables and pretending the cause is the effect and vice versa?
YOU are wasting Linux users' time by breaking drivers that worked beautifully for decades. This is NOT NVIDIA which suddenly introduced such a huge a compatibility issue, but a kernel devel one-liner "patch".
As a developer myself, would I introduce such a regression in one of my pieces of software, I would either fix it, provide a workaround for it, or get fired for breaking things that used to work just fine adn for incompetency!
As for Linux "policy" of systematically breaking every driver third party vendors are providing, guess what? It's one of the very reasons why people do not want to waste their time with Linux!
But I'll stop wasting my time with you guys. Keep breaking what does not need fixing, and you will end up killing Linux and all the good wills (yes, including the good wills by hardware vendors).
(In reply to reserv0 from comment #12) > YOU are wasting Linux users' time by breaking drivers that worked > beautifully for decades. This is NOT NVIDIA which suddenly introduced such a > huge a compatibility issue, but a kernel devel one-liner "patch". NVIDIA drivers (and pretty much every off tree driver) break all the time against mainline, NVIDIA (and any off tree driver) has to keep track of the kernel regularly so that drivers keep building. This is common practice over those decades you talk about, you just didn't notice because NVIDIA and downstreams play their part. > As a developer myself, would I introduce such a regression in one of my > pieces of software, I would either fix it, provide a workaround for it, or > get fired for breaking things that used to work just fine adn for > incompetency! Ok. > As for Linux "policy" of systematically breaking every driver third party > vendors are providing, guess what? It's one of the very reasons why people > do not want to waste their time with Linux! Third party vendors know very well that if they want an ever working driver without having to track mainline changes they have to contribute their driver upstream. NVIDIA knows this. You are their customer please talk to them if you have concerns about their upstream engagement. > But I'll stop wasting my time with you guys. Keep breaking what does not > need fixing, and you will end up killing Linux and all the good wills (yes, > including the good wills by hardware vendors). Sure, have a nice day. This is ugly beyond hope. (In reply to Javier Martinez Canillas from comment #2) > There isn't much we can do here really, the NVIDIA proprietary driver is an > out-of-tree module and you should report to them if there are any issues. > > For this particular problem, they were relying on registered framebuffer > devices to bind to the fbcon to have VT support. > > But that solution is fragile. Instead, is the DRM driver responsibility to > register an emulated framebuffer device if fbcon and VT has to be supported. > > Commit ee7a69aa38d8 ("fbdev: Disable sysfb device registration when removing > conflicting FBs") fixed a race since, since otherwise it would allow to > register platform devices for firmware provided framebuffers even after a > native video driver was registered. But I have to say similar problems happen on amd device and Loongson device after 5.18.13 as I know. I cannot tell whether it is caused by this commit, but I have to say many users find tty display problem these day and it is not a nvidia-only problem! (In reply to Jingyuan Deng from comment #15) > (In reply to Javier Martinez Canillas from comment #2) > > There isn't much we can do here really, the NVIDIA proprietary driver is an > > out-of-tree module and you should report to them if there are any issues. > > > > For this particular problem, they were relying on registered framebuffer > > devices to bind to the fbcon to have VT support. > > > > But that solution is fragile. Instead, is the DRM driver responsibility to > > register an emulated framebuffer device if fbcon and VT has to be > supported. > > > > Commit ee7a69aa38d8 ("fbdev: Disable sysfb device registration when > removing > > conflicting FBs") fixed a race since, since otherwise it would allow to > > register platform devices for firmware provided framebuffers even after a > > native video driver was registered. > > But I have to say similar problems happen on amd device and Loongson device > after 5.18.13 as I know. I cannot tell whether it is caused by this commit, > but I have to say many users find tty display problem these day and it is > not a nvidia-only problem! See https://bugzilla.kernel.org/show_bug.cgi?id=216331 Similar problem on amd device (In reply to Jingyuan Deng from comment #17) > (In reply to Jingyuan Deng from comment #15) > > (In reply to Javier Martinez Canillas from comment #2) > > > There isn't much we can do here really, the NVIDIA proprietary driver is > an > > > out-of-tree module and you should report to them if there are any issues. > > > > > > For this particular problem, they were relying on registered framebuffer > > > devices to bind to the fbcon to have VT support. > > > > > > But that solution is fragile. Instead, is the DRM driver responsibility > to > > > register an emulated framebuffer device if fbcon and VT has to be > > supported. > > > > > > Commit ee7a69aa38d8 ("fbdev: Disable sysfb device registration when > > removing > > > conflicting FBs") fixed a race since, since otherwise it would allow to > > > register platform devices for firmware provided framebuffers even after a > > > native video driver was registered. > > > > But I have to say similar problems happen on amd device and Loongson device > > after 5.18.13 as I know. I cannot tell whether it is caused by this commit, > > but I have to say many users find tty display problem these day and it is > > not a nvidia-only problem! > > See https://bugzilla.kernel.org/show_bug.cgi?id=216331 > Similar problem on amd device but does reverting ee7a69aa38d8 ("fbdev: Disable sysfb device registration when removing conflicting FBs") make the issue to go away? It's not clear that's the same problem. Is there a solution on the way? How long I have to patch my kernel by hand? We have located the issue and it's the framebuffer from the BMC. When we blacklist the ast module the kernel boots properly. Full disscusion https://bbs.archlinux.org/viewtopic.php?id=279971 Problem solved for me in Linux v6.0.3. And it's broken again in v6.0.4!!! I have to revert commit 9d69ef1838150c7d87afc1a87aa658c637217585 to make it work... And now, v6.0.6 makes it even worst... In order for the frame buffer console to work with NVIDIA drivers, I now need to revert both 9d69ef1838150c7d87afc1a87aa658c637217585 and 25a6688f27ff54f97adf7cce1d7e18c38bf51eb4... Could we please *stop this madness* and revert the code (that worked just fine fore over 15 years) to what it was before it got totally and irremediably broken, i.e. before the *regression* introduced in v5.18.13? Obviously, that code change which was supposed to fix *another* bug is simply inadequate, so let's revert it all until a *proper fix* is found for that other bug, that won't break something that has always worked fine so far! Created attachment 303105 [details]
Reverts to un-break v6.0.6
Added the patch I am using to un-break v6.0.6 (and return to the v6.0.3 code that did work).
I also suffer from this issue and have been logging in on a black console for a while now. By preventing the driver for my integrated graphics (i915) from loading (as suggested in #20) I do get a console (using bootparam i915.modeset=0 gives similar results). But I need the integrated "graphics", i915, for sound through HDMI. Any suggestions on how to solve this? And v6.1 is now broken beyond repair... The *only* solutions left to me now, if I want to run a recent Linux kernel, is to disable entirely the i915 driver, either at kernel configure time, or in the kernel blacklist option, or in the BIOS (by disabling the iGPU there)... It means I cannot use any more the iGPU with a second monitor or for OpenCL computing! >:-( Created attachment 303457 [details]
Fix for fbdev and Linux v6.1
I had another look at v6.1 and found out that all what was needed was to remove the call to sysfb_disable() in drivers/video/aperture.c to restore full frame buffer functionality with both the Intel iGPU and the NVIDIA GPU active.
Patch attached.
I've been looking at this from the NVIDIA side. If I understand the regression commit ee7a69aa3 correctly, the original problem it solves is a race between sysfb_init registering a platform device for the system boot console, and a drm driver calling into do_remove_conflicting_framebuffers: if sysfb_init is called before drm initializes, then remove_conflicting_framebuffers will remove the platform device sysfb adds. But if drm initializes before sysfb_init is called, then there will be no platform device to remove and then sysfb will still add one later. Is that a correct understanding of the motivation behind the new call to sysfb_disable in this change? To recap this bug, the problem here is that there are two graphics devices in the system, one of which is displaying a framebuffer console that was set up by the boot firmware. In this bug report, the boot console is on an NVIDIA device and the other device is an Intel integrated GPU, but we've also seen this problem on server or workstation systems that have an ASPEED device alongside an NVIDIA one. In this situation, the following sequence happens: 1. EFI firmware sets a mode on the NVIDIA GPU and uses it for the boot console. 2. The kernel starts, reads the EFI boot payload, and loads efifb. 3. The kernel calls sysfb_init, which creates a platform device for the efifb console on the NVIDIA GPU. 4. The kernel detects the Intel GPU and loads i915, which registers its own framebuffer. Since this is not the boot device, remove_conflicting_framebuffers does not find the efifb framebuffer as conflicting and does not remove it. Commit ee7a69aa3 adds a new step 3.5: 3.5. Call sysfb_disable, which removes the efifb framebuffer console on the NVIDIA GPU even though it does not conflict with the i915 framebuffer. This is what breaks the console on these configurations, regardless of whether or not the NVIDIA driver is installed or loaded. I'm looking at making the NVIDIA driver install its own framebuffer console in order to work around this problem, but that will take a little while to develop and get it into production. In the meantime, would it make sense to make sysfb_disable a little smarter about which device the boot framebuffer is on and only disable itself if a framebuffer that actually conflicts with it is being enabled? This would also help affected users who choose not to install the NVIDIA driver at all. Daniel posted a patch series that should handle this case: https://patchwork.kernel.org/project/dri-devel/list/?series=711019 commit 5ae3716cfdcd286268133867f67d0803847acefc Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Thu Apr 6 15:21:07 2023 +0200 video/aperture: Only remove sysfb on the default vga pci device |