Bug 107651 - Computer Hung by Attempt to Turn-on Non-Existence KBD Light
Summary: Computer Hung by Attempt to Turn-on Non-Existence KBD Light
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 high
Assignee: drivers_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-11-10 09:57 UTC by Daniel Kian Mc Kiernan
Modified: 2018-12-26 16:00 UTC (History)
7 users (show)

See Also:
Kernel Version: 4.1.3 and up (not present in 4.0.8)
Tree: Fedora
Regression: No


Attachments
attachment-4523-0.html (3.57 KB, text/html)
2018-06-15 01:54 UTC, Harish
Details
attachment-16526-0.html (5.60 KB, text/html)
2018-06-15 08:10 UTC, Harish
Details
log.txt (106.74 KB, text/plain)
2018-07-29 14:24 UTC, Jirka Novak
Details

Description Daniel Kian Mc Kiernan 2015-11-10 09:57:19 UTC
This is Bug 1253523 at bugzilla.redhat.com.

On a Dell Inspiron Mini 1012, and presumably on other machines, systemd attempts to turn-on a non-existence keyboard backlight. This completely hangs the computer during boot-up.

Iván Jiménez pinned-down what was happening, and provided a work-around using

     systemctl mask systemd-backlight@leds\:dell\:\:kbd_backlight.service

He notes that the problem is associated with the presence of a file

     /var/lib/systemd/backlight/platform-dell-laptop:leds:dell::kbd_backlight

which he conjectures is created by

     systemd-backlight@.service

At systemd issues report #1792, Lennart Poettering commented “Well, userspace should not be able to make the system freeze like this. If it can do that, that's a kernel bug, and in this case a bug in the dell laptop driver, apparently. Could you please file a bug against that driver in the kernel bugzilla, and explain the situation there? Thanks!” and closed the issue.
Comment 1 Daniel Kian Mc Kiernan 2016-06-01 06:01:56 UTC
It appears that the only good served in reporting this bug was in learning that bug reports are as much ignored here as they are everywhere else that bugzilla is used. 

It's simply a very bad sign when any group of developers use bugzilla; it is typically used to create an illusion of responsiveness.
Comment 2 Daniel Kian Mc Kiernan 2016-06-23 04:21:07 UTC
Mr Jiménez reports that this bug persists with kernel 4.5.5-300.fc24.x86_64.
Comment 3 Daniel Kian Mc Kiernan 2016-10-22 20:12:04 UTC
Ankit Rastogi notes this same problem on the Dell Studio 14Z (which the same work-around. 

Jóhann B. Guðmundssonsuggests that the problem was introduced with the following action: 

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6cff8d60aa0aba5583ecda09984dbcb2f24cc28d
Comment 4 Daniel Kian Mc Kiernan 2017-04-12 21:02:18 UTC
FWIW, this bug is still found in kernel-4.10.8-200.fc25.x86_64, and presumably in other recent versions of the kernel, as it is not being worked.
Comment 5 Eric 2017-08-10 19:50:43 UTC
Same issue with newer kernels. 4.4 fails, 4.9.0.3 fails. Looks like kernel is trying to include more drivers, but failed miserably. 

CentOS 7 with old 3.13 works without any issue. Old kernel in Debian Jessie works.
Comment 6 Adam Hunt 2017-10-30 18:53:22 UTC
I just hit this bug while installing Debian on an old Dell Mini 1021.

I was finally able to boot the machine successfully by adding "systemd.restore_state=0" to the kernel command line. Once the machine was up and running masking the systemd-backlight@.service seems to have 'masked' the issue.
 
     sudo systemctl mask systemd-backlight@leds\:dell\:\:kbd_backlight.service
Comment 7 Daniel Kian Mc Kiernan 2018-03-03 06:06:57 UTC
Because this bug is not being worked, it of course persists in kernel 4.15.x.
Comment 8 Harish 2018-05-27 13:19:15 UTC
I also got this bug on 4.15.x & 4.16.x kernel from Dell Latitude E6430.

Surprisingly, When I tried to added "debug systemd.journald.forward_to_console=1" boot parameter, this bug didn't happened and system booted normally.

Also, I was not able to reproduce this bug in a consistent manner.

if I put "debug" boot parameter only, then chances of hitting this bug is greatly reduced ( say 50 % chance ).

If someone can tell me a way to collect logs or any other information about this bug, I can provide that

Distro: Opensuse Tumbleweed.
current Kernel version: 4.16.11
Comment 9 Daniel Kian Mc Kiernan 2018-05-28 00:01:16 UTC
(In reply to Harish from comment #8)

> If someone can tell me a way to collect logs or any other information about
> this bug, I can provide that

No logging can occur once the bug bites. 

The bug was introduced in this commitment: 

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6cff8d60aa0aba5583ecda09984dbcb2f24cc28d 

The developers simply choose not to patch the code.  Bugzilla reports primarily serve as sinks into which bug reports are drained so that developers can do things other than fixing bugs.
Comment 10 Eric 2018-06-14 22:05:23 UTC
Built kernel 4.9.0 from source and excluded drivers/platform/x86/dell-laptop.c
My dell laptop finally able to boot flawlessly. Functional keys to adjust screen brightness and such all working properly, what is the point of having this dell-laptop.c anyway? 

Newer Linux kernel keeps adding "nice to have" but buggy drivers. dell-laptop.c is totally broken for dell laptop and only cure is to remove it from source.
Comment 11 Daniel Kian Mc Kiernan 2018-06-14 23:44:15 UTC
(In reply to Eric from comment #10)
<snip>
> Functional keys to adjust
> screen brightness and such all working properly, what is the point of having
> this dell-laptop.c anyway? 
> 
> Newer Linux kernel keeps adding "nice to have" but buggy drivers.
> dell-laptop.c is totally broken for dell laptop and only cure is to remove
> it from source.

One point of this code is identified by the bug; some Dell laptops have keyboard lights, and the code was meant to activate them, which would be helpful when they were present.  Unfortunately, it seems never to have occurred to the developer to test or otherwise to determine what would happen if the laptop did _not_ have such lights. 

And now we're in the all-too-familiar phase of passive-aggressive refusal to fix a bug.  Developers dig-in their heels when users commit one of two offenses: 

[1] Failing to provide as much information as they can. 

[2] Providing as much information as they can. 

In our case, we committed the second offense.
Comment 12 Eric 2018-06-15 00:27:23 UTC
(In reply to Daniel Kian Mc Kiernan from comment #11)
> (In reply to Eric from comment #10)
> <snip>
> > Functional keys to adjust
> > screen brightness and such all working properly, what is the point of
> having
> > this dell-laptop.c anyway? 
> > 
> > Newer Linux kernel keeps adding "nice to have" but buggy drivers.
> > dell-laptop.c is totally broken for dell laptop and only cure is to remove
> > it from source.
> 
> One point of this code is identified by the bug; some Dell laptops have
> keyboard lights, and the code was meant to activate them, which would be
> helpful when they were present.  Unfortunately, it seems never to have
> occurred to the developer to test or otherwise to determine what would
> happen if the laptop did _not_ have such lights. 
> 
> And now we're in the all-too-familiar phase of passive-aggressive refusal to
> fix a bug.  Developers dig-in their heels when users commit one of two
> offenses: 
> 
> [1] Failing to provide as much information as they can. 
> 
> [2] Providing as much information as they can. 
> 
> In our case, we committed the second offense.

Agree. The word "kernel" is defined as "the core of a computer's operating system", but the kbd light should be no where near the core Linux system, especially when the code is written in a way that no error message is displayed and whole booting process is stopped. 

Different distro used to tweak the kernel here and there, maybe best way to solve this is to log it in debian/redhat so these distro can exclude that buggy file from the build, until the bug is fixed.
Comment 13 Eric 2018-06-15 01:43:33 UTC
Just figured I don't need to rebuild kernel every time. Remove the kernel module ko file alone is working fine for me: 

sudo rm /lib/modules/$(uname -r)/kernel/drivers/platform/x86/dell-laptop.ko
Comment 14 Harish 2018-06-15 01:54:22 UTC
Created attachment 276565 [details]
attachment-4523-0.html

Just adding few points which I understood about this bug

* Buggy kernel modules is not the only one part which causes this bug.
  - There is a systemd service which tries to store & restore backlight
state.
  - This systemd service is the actual triggering point of this bug
  - I couldnt disable this service in my system since some other service
was automatically reverting my changes at each time system boot ( Probably
gnome session )
  - Finally i disabled this service by creating a readonly file in
/var/lib/systemd/backlight which prevents this systemd service from saving
& restoring backlight state. after that everything began to work correctly


-Sent from mobile

On Fri, Jun 15, 2018, 5:57 AM <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=107651
>
> --- Comment #12 from Eric (yangguchen@gmail.com) ---
> (In reply to Daniel Kian Mc Kiernan from comment #11)
> > (In reply to Eric from comment #10)
> > <snip>
> > > Functional keys to adjust
> > > screen brightness and such all working properly, what is the point of
> > having
> > > this dell-laptop.c anyway?
> > >
> > > Newer Linux kernel keeps adding "nice to have" but buggy drivers.
> > > dell-laptop.c is totally broken for dell laptop and only cure is to
> remove
> > > it from source.
> >
> > One point of this code is identified by the bug; some Dell laptops have
> > keyboard lights, and the code was meant to activate them, which would be
> > helpful when they were present.  Unfortunately, it seems never to have
> > occurred to the developer to test or otherwise to determine what would
> > happen if the laptop did _not_ have such lights.
> >
> > And now we're in the all-too-familiar phase of passive-aggressive
> refusal to
> > fix a bug.  Developers dig-in their heels when users commit one of two
> > offenses:
> >
> > [1] Failing to provide as much information as they can.
> >
> > [2] Providing as much information as they can.
> >
> > In our case, we committed the second offense.
>
> Agree. The word "kernel" is defined as "the core of a computer's operating
> system", but the kbd light should be no where near the core Linux system,
> especially when the code is written in a way that no error message is
> displayed
> and whole booting process is stopped.
>
> Different distro used to tweak the kernel here and there, maybe best way to
> solve this is to log it in debian/redhat so these distro can exclude that
> buggy
> file from the build, until the bug is fixed.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 15 Daniel Kian Mc Kiernan 2018-06-15 02:39:05 UTC
(In reply to Harish from comment #14)
<snip>
>   - This systemd service is the actual triggering point of this bug
>   - I couldnt disable this service in my system since some other service
> was automatically reverting my changes at each time system boot ( Probably
> gnome session )
>   - Finally i disabled this service by creating a readonly file in
> /var/lib/systemd/backlight which prevents this systemd service from saving
> & restoring backlight state. after that everything began to work correctly

As I said in my initial bug report above, this matter was raised at systemd issues report #1792, and Lennart Poettering commented “Well, userspace should not be able to make the system freeze like this. If it can do that, that's a kernel bug, and in this case a bug in the dell laptop driver, apparently. Could you please file a bug against that driver in the kernel bugzilla, and explain the situation there? Thanks!” 

Also, please note that (as reported above) Iván Jiménez pinned-down what was happening, and provided a work-around using

     systemctl mask systemd-backlight@leds\:dell\:\:kbd_backlight.service

which is a cleaner work-around.
Comment 16 Daniel Kian Mc Kiernan 2018-06-15 02:41:03 UTC
(In reply to Eric from comment #12)
<snip>
> Different distro used to tweak the kernel here and there, maybe best way to
> solve this is to log it in debian/redhat so these distro can exclude that
> buggy file from the build, until the bug is fixed.

The Fedora-based distributions are going to follow the lead of Fedora, whose developers will follow the lead here.
Comment 17 Harish 2018-06-15 08:10:59 UTC
Created attachment 276567 [details]
attachment-16526-0.html

>> “Well, userspace should
not be able to make the system freeze like this. If it can do that, that's a
kernel bug, and in this case a bug in the dell laptop driver, apparently.
Could
you please file a bug against that driver in the kernel bugzilla, and
explain
the situation there? Thanks!”

As i described in previous mail, it was had to debug this issue because,
this bug was not happening when logging enabled.
Still I will try my best to identify the module associated with this issue
and will report it

-Thanks



On Fri, Jun 15, 2018 at 8:09 AM, <bugzilla-daemon@bugzilla.kernel.org>
wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=107651
>
> --- Comment #15 from Daniel Kian Mc Kiernan (Mc_Kiernan@oeconomist.com)
> ---
> (In reply to Harish from comment #14)
> <snip>
> >   - This systemd service is the actual triggering point of this bug
> >   - I couldnt disable this service in my system since some other service
> > was automatically reverting my changes at each time system boot (
> Probably
> > gnome session )
> >   - Finally i disabled this service by creating a readonly file in
> > /var/lib/systemd/backlight which prevents this systemd service from
> saving
> > & restoring backlight state. after that everything began to work
> correctly
>
> As I said in my initial bug report above, this matter was raised at systemd
> issues report #1792, and Lennart Poettering commented “Well, userspace
> should
> not be able to make the system freeze like this. If it can do that, that's
> a
> kernel bug, and in this case a bug in the dell laptop driver, apparently.
> Could
> you please file a bug against that driver in the kernel bugzilla, and
> explain
> the situation there? Thanks!”
>
> Also, please note that (as reported above) Iván Jiménez pinned-down what
> was
> happening, and provided a work-around using
>
>      systemctl mask systemd-backlight@leds\:dell\:\:kbd_backlight.service
>
> which is a cleaner work-around.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 18 Daniel Kian Mc Kiernan 2018-06-15 10:07:42 UTC
(In reply to Harish from comment #17)
<snip> 
> As i described in previous mail, it was had to debug this issue because,
> this bug was not happening when logging enabled.
> Still I will try my best to identify the module associated with this issue
> and will report it

Please review the previous comments.  The module has been explicitly identified as dell-laptop.ko.  The commitment in which the bug was introduced has been explicitly identified as well.
Comment 19 Eric 2018-06-15 12:16:48 UTC
(In reply to Harish from comment #17)
> Created attachment 276567 [details]
> attachment-16526-0.html
> 
> >> “Well, userspace should
> not be able to make the system freeze like this. If it can do that, that's a
> kernel bug, and in this case a bug in the dell laptop driver, apparently.
> Could
> you please file a bug against that driver in the kernel bugzilla, and
> explain
> the situation there? Thanks!”
> 
> As i described in previous mail, it was had to debug this issue because,
> this bug was not happening when logging enabled.
> Still I will try my best to identify the module associated with this issue
> and will report it
> 
> -Thanks
> 
> 
> 
> On Fri, Jun 15, 2018 at 8:09 AM, <bugzilla-daemon@bugzilla.kernel.org>
> wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=107651
> >
> > --- Comment #15 from Daniel Kian Mc Kiernan (Mc_Kiernan@oeconomist.com)
> > ---
> > (In reply to Harish from comment #14)
> > <snip>
> > >   - This systemd service is the actual triggering point of this bug
> > >   - I couldnt disable this service in my system since some other service
> > > was automatically reverting my changes at each time system boot (
> > Probably
> > > gnome session )
> > >   - Finally i disabled this service by creating a readonly file in
> > > /var/lib/systemd/backlight which prevents this systemd service from
> > saving
> > > & restoring backlight state. after that everything began to work
> > correctly
> >
> > As I said in my initial bug report above, this matter was raised at systemd
> > issues report #1792, and Lennart Poettering commented “Well, userspace
> > should
> > not be able to make the system freeze like this. If it can do that, that's
> > a
> > kernel bug, and in this case a bug in the dell laptop driver, apparently.
> > Could
> > you please file a bug against that driver in the kernel bugzilla, and
> > explain
> > the situation there? Thanks!”
> >
> > Also, please note that (as reported above) Iván Jiménez pinned-down what
> > was
> > happening, and provided a work-around using
> >
> >      systemctl mask systemd-backlight@leds\:dell\:\:kbd_backlight.service
> >
> > which is a cleaner work-around.
> >
> > --
> > You are receiving this mail because:
> > You are on the CC list for the bug.
> >

I think the bug was introduced by that commit for sure. But I am curious about this observer effect, which may be helpful to narrow down which line of code, there are only about 1000 lines of changes, and it refers to another Dell official driver library in comments.
Comment 20 Jirka Novak 2018-07-21 15:09:31 UTC
Hello,

I have same issue on Dell Precision 3530 with Fedora Core 28 with kernels 4.17.3 and 4.17.6. There is small difference between 4.17.3 and 4.17.6. With .3 there is about 10% probability that system will boot up, with .6 there is no way how to boot system up.
I'm able boot with acpi=off, but it has consequences - no ACPI :-)
When I masked systemd-backlight@leds\:dell\:\:kbd_backlight.service, issue is resolved.

I'm observing the issue on new Precision 3530, but I have older Latitude E6520 where it was OK with same kernels. Therefore I can offer testing/troubleshooting/regression testing environment for one who wants to test new patch to fix the issue.
Comment 21 Jirka Novak 2018-07-21 15:47:50 UTC
Hello,

does anyone knows what is really wrong in dell-laptop.c source? I mean does it create infinity loop because it do not check whether backlight is supported on a hardware or there is some kind of kernel lockup or so?
I might try to correct it and it helps me to save time...
Comment 22 Daniel Kian Mc Kiernan 2018-07-22 00:14:24 UTC
(In reply to Jirka Novak from comment #21)
> 
> does anyone knows what is really wrong in dell-laptop.c source?

Knowing just that would bring one within minutes of a solution. 

In the absence of an emulator of the affected systems (which would be impractical to create), there is no way to log the process. 

What remains is poring over the code that creates the problem. 

Doing so would be a short task for the developer, as he possesses a preëxisting familiarity with the code and would know where to look.  But he just grunts and ignores the issue. 

Anyone else would have to engage in a long and tedious task of learning his or her way around the code in order to deduce where the problem might be.  One or more of the commentors here seems none-the-less to have planned to do as much, but apparently has not been able to give-over as much time as this task would require of persons other than the developer.
Comment 23 Jirka Novak 2018-07-29 14:24:19 UTC
Created attachment 277593 [details]
log.txt

Hello,

  I checked the issue. I added a few messages to dell-laptop.c and
dell-smbios-base.c.
  I found that kernel hang is caused by
dell-smbios-base.c/dell_smbios_call() function. This function guards
SMBIOS calls by mutex. Function is called from multiple kernel threads
during kernel startup. Mutex works fine, but I found that at some point
one SMBIOS call do not finish therefore mutex stays locked and kernel hangs.

  Hanging SMBIOS call is request for state of rfkill hwswitch in
dell-laptop.c/dell_rfkill_set() - it is second call of
dell_send_request() in this function.

  I attached log from one boot, find dell_laptop and dell_smbios
messages. E.g. AA 610 ... message says that message was generated by
kernel thread 610.
  You can find:
dell_smbios: AA 677 dell_smbios_call 17 11 locked
...
dell_smbios: AA 610 dell_smbios_call 04 11 trylock

  PID 677 gets lock and calls SMBIOS and never returns. PID 610 tries to
lock mutex and hangs there forever.

  I don't know whether hang is caused by kernel or DELL BIOS. Is it
possible to ask more skilled developer for help to find it out?

  BTW I'm not sure if every problem reported in this thread is caused by
same sequence of calls. It is weird to me that same issue will be
observed on so wide range of DELL HW for quite long time.

Best regards,

Jirka Novak

Note You need to log in before you can comment on or make changes to this bug.