Bug 60852 - Random memory corruption reported after writing to /sys/module/drm_kms_helper/parameters/edid_firmware
Summary: Random memory corruption reported after writing to /sys/module/drm_kms_helper...
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - Intel) (show other bugs)
Hardware: x86-64 Linux
: P3 normal
Assignee: Imre Deak
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-09-05 00:05 UTC by Alex Villacis Lasso
Modified: 2014-09-24 12:03 UTC (History)
2 users (show)

See Also:
Kernel Version: 3.11
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg output showing the memory corruption report (59.22 KB, text/plain)
2013-09-05 00:05 UTC, Alex Villacis Lasso
Details
Configuration file used to build the kernel (128.24 KB, text/plain)
2013-09-05 00:05 UTC, Alex Villacis Lasso
Details

Description Alex Villacis Lasso 2013-09-05 00:05:10 UTC
Created attachment 107420 [details]
dmesg output showing the memory corruption report

Distribution: Fedora 18 x86_64
Kernel Version: 3.11

My machine has a flat panel display that does not report EDID correctly to my video card. I found the edid_firmware parameter to drm_kms_helper, and I tried the following line in rc.local:

echo -n edid/1280x1024.bin > /sys/module/drm_kms_helper/parameters/edid_firmware

Sometimes, nothing bad happens. Other times, I get a memory corruption backtrace like the one in the attached dmesg report.
Comment 1 Alex Villacis Lasso 2013-09-05 00:05:58 UTC
Created attachment 107421 [details]
Configuration file used to build the kernel
Comment 2 Chris Wilson 2013-09-05 10:38:10 UTC
You need to set that parameter on the command line for it to have effect for LVDS. I guess no one checked whether changing it on the fly was safe...
Comment 3 Jani Nikula 2013-09-05 13:54:19 UTC
(In reply to Chris Wilson from comment #2)
> You need to set that parameter on the command line for it to have effect for
> LVDS. I guess no one checked whether changing it on the fly was safe...

Hmm, it looked like a list corruption in the firmware loader to me. Which shouldn't happen regardless of whether it's safe to set edid_firmware on the fly.

Alex, please see if you can reproduce the problem with this patch:

diff --git a/drivers/gpu/drm/drm_edid_load.c b/drivers/gpu/drm/drm_edid_load.c
index 271b42b..0a54bdd7 100644
--- a/drivers/gpu/drm/drm_edid_load.c
+++ b/drivers/gpu/drm/drm_edid_load.c
@@ -152,7 +152,7 @@ static u8 *edid_load(struct drm_connector *connector, const char *name,
 		goto out;
 	}
 
-	err = request_firmware(&fw, name, &pdev->dev);
+	err = 1;
 	platform_device_unregister(pdev);
 
 	if (err) {
@@ -231,7 +231,6 @@ static u8 *edid_load(struct drm_connector *connector, const char *name,
 	    name, connector_name);
 
 relfw_out:
-	release_firmware(fw);
 
 out:
 	if (err)
Comment 4 Jani Nikula 2013-09-05 14:01:11 UTC
Side note, it seems wasteful to do the full firmware loading dance of checking the firmware cache, checking the kernel builtin firmware blobs, trying to load the firmware directly from kernel, asking the user space helper to please give us the firmware, and finally checking the DRM built-in EDID blobs... every time 
drm_helper_probe_single_connector_modes() gets run with the edid_firmware parameter set.
Comment 5 Alex Villacis Lasso 2013-09-05 14:14:17 UTC
(In reply to Chris Wilson from comment #2)
> You need to set that parameter on the command line for it to have effect for
> LVDS. I guess no one checked whether changing it on the fly was safe...

In mainline kernel with Fedora 18, if I set the kernel parameter, the kernel hangs for more than a minute before mounting the root filesystem and booting normally. The distro kernel does not hang (3.10.10), but I already downloaded the source RPM and cannot find which patch, if any, solves this.
Comment 6 Jani Nikula 2013-09-05 14:28:15 UTC
(In reply to Alex Villacis Lasso from comment #5)
> (In reply to Chris Wilson from comment #2)
> > You need to set that parameter on the command line for it to have effect
> for
> > LVDS. I guess no one checked whether changing it on the fly was safe...
> 
> In mainline kernel with Fedora 18, if I set the kernel parameter, the kernel
> hangs for more than a minute before mounting the root filesystem and booting
> normally. The distro kernel does not hang (3.10.10), but I already
> downloaded the source RPM and cannot find which patch, if any, solves this.

Wild guess, the distro kernel has CONFIG_FW_LOADER_USER_HELPER=n.
Comment 7 Alex Villacis Lasso 2013-09-05 15:52:28 UTC
It does. I thought I made a point of basing the compiled kernel config off the distro kernel config except in the places I wanted to change, and I did not intend to set CONFIG_FW_LOADER_USER_HELPER in my config. If I set CONFIG_FW_LOADER_USER_HELPER=n in my config and recompile, could that influence whether the bug might trigger, even before applying the debug patch?
Comment 8 Jani Nikula 2013-09-06 04:33:44 UTC
(In reply to Alex Villacis Lasso from comment #7)
> It does. I thought I made a point of basing the compiled kernel config off
> the distro kernel config except in the places I wanted to change, and I did
> not intend to set CONFIG_FW_LOADER_USER_HELPER in my config. If I set
> CONFIG_FW_LOADER_USER_HELPER=n in my config and recompile, could that
> influence whether the bug might trigger, even before applying the debug
> patch?

With CONFIG_FW_LOADER_USER_HELPER=n the kernel doesn't try to ask the user space to supply the firmware (=edid in this case). Which also means there's no 60 second timeout for the userspace reply which apparently doesn't happen. And, *if* the problem is in the firmware loader code, that might vastly influence whether the bug might trigger.

Now, the 60 second timeout isn't this bug. (It might be considered a different bug.) This bug is the list corruption you see. All else being equal, if you can reproduce it with the patch, it's in DRM. If the patch makes the problem go away, it's in the firmware loader. So we'd appreciate you trying the debug patch, thanks. :)
Comment 9 Alex Villacis Lasso 2013-09-09 02:24:15 UTC
I have compiled a 3.11 kernel with CONFIG_FW_LOADER_USER_HELPER=y, with the patch applied, and without the edid_firmware boot parameter. After three boots and manually writing the string to the sysfs file, I am yet unable to trigger the bug. However, I did notice that the bug was random, so I will keep testing.
Comment 10 Jani Nikula 2013-09-09 06:32:42 UTC
(In reply to Alex Villacis Lasso from comment #9)
> I have compiled a 3.11 kernel with CONFIG_FW_LOADER_USER_HELPER=y, with the
> patch applied, and without the edid_firmware boot parameter. After three
> boots and manually writing the string to the sysfs file, I am yet unable to
> trigger the bug. However, I did notice that the bug was random, so I will
> keep testing.

Please also try with the rc.local approach you had earlier (just to try to reproduce the bug; this is not the recommended approach for regular use).
Comment 11 Jani Nikula 2013-09-09 06:41:30 UTC
Alternatively, please try this patch on top of 3.11, with CONFIG_FW_LOADER_USER_HELPER=y and the rc.local thing:

diff --git a/drivers/gpu/drm/drm_edid_load.c b/drivers/gpu/drm/drm_edid_load.c
index 271b42b..7a0a1d3 100644
--- a/drivers/gpu/drm/drm_edid_load.c
+++ b/drivers/gpu/drm/drm_edid_load.c
@@ -155,6 +155,13 @@ static u8 *edid_load(struct drm_connector *connector, const
        err = request_firmware(&fw, name, &pdev->dev);
        platform_device_unregister(pdev);
 
+       if (err)
+               goto out;
+       else {
+               err = -EINVAL;
+               goto relfw_out;
+       }
+
        if (err) {
                i = 0;
                while (i < GENERIC_EDIDS && strcmp(name, generic_edid_name[i]))

If you can reproduce the issue with this, the problem is in the firmware loader.
Comment 12 Alex Villacis Lasso 2013-09-10 14:10:04 UTC
(In reply to Jani Nikula from comment #11)
> Alternatively, please try this patch on top of 3.11, with
> CONFIG_FW_LOADER_USER_HELPER=y and the rc.local thing:
> 
> diff --git a/drivers/gpu/drm/drm_edid_load.c
> b/drivers/gpu/drm/drm_edid_load.c
> index 271b42b..7a0a1d3 100644
> --- a/drivers/gpu/drm/drm_edid_load.c
> +++ b/drivers/gpu/drm/drm_edid_load.c
> @@ -155,6 +155,13 @@ static u8 *edid_load(struct drm_connector *connector,
> const
>         err = request_firmware(&fw, name, &pdev->dev);
>         platform_device_unregister(pdev);
>  
> +       if (err)
> +               goto out;
> +       else {
> +               err = -EINVAL;
> +               goto relfw_out;
> +       }
> +
>         if (err) {
>                 i = 0;
>                 while (i < GENERIC_EDIDS && strcmp(name,
> generic_edid_name[i]))
> 
> If you can reproduce the issue with this, the problem is in the firmware
> loader.

Not exactly what I expected. I recompiled the test kernel with this patch. The boot process ran normally, except that there was no backtrace at that moment, and the screen resolution failed to change. However, after a few hours, I attempted to shut down, and then the system hung on the shutdown splash screen. The keyboard was unresponsive. I could not even dismiss the splash screen to check whether there was a backtrace behind.
Comment 13 Jani Nikula 2014-08-14 11:34:38 UTC
Please retry recent kernels and report back. There's been some changes in the area. Thanks.
Comment 14 Jani Nikula 2014-09-12 09:35:44 UTC
Ping for the re-test.
Comment 15 Jani Nikula 2014-09-24 12:03:18 UTC
Timeout, closing. Assumed fixed by some of the changes in the edid firmware loader. Please reopen if the problem reappears with recent kernels.

Note You need to log in before you can comment on or make changes to this bug.