Bug 206365 - kernel NULL pointer dereference when charger is unplugged
Summary: kernel NULL pointer dereference when charger is unplugged
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: USB (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: Heikki Krogerus
URL:
Keywords:
: 206633 206635 (view as bug list)
Depends on:
Blocks:
 
Reported: 2020-01-31 08:22 UTC by Andrea Gagliardi La Gala
Modified: 2020-04-02 13:56 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.3.4-300.fc31.x86_64
Tree: Mainline
Regression: No


Attachments
lshw output (94.35 KB, text/plain)
2020-01-31 08:22 UTC, Andrea Gagliardi La Gala
Details
test (1.15 KB, patch)
2020-02-13 13:17 UTC, Heikki Krogerus
Details | Diff
dp_removal_fix.diff (446 bytes, patch)
2020-02-17 15:17 UTC, Heikki Krogerus
Details | Diff
dmesg showing the oops still happening with 5.5.10 + 2 patches (from Red Hat bugzilla 1785972) (113.69 KB, text/plain)
2020-03-20 10:16 UTC, Hans de Goede
Details
Trace after null deref oops in Fedora 5.5.5 kernel with 2 ucsi fixes included (2.32 KB, text/plain)
2020-03-20 10:17 UTC, Hans de Goede
Details
5.5.13-arch1 dmesg (81.28 KB, text/plain)
2020-03-26 15:46 UTC, vincent.datrier
Details

Description Andrea Gagliardi La Gala 2020-01-31 08:22:53 UTC
Created attachment 287039 [details]
lshw output

Every time I unplug my charger, I get a oops about a NULL pointer dereference. The computer then slows down to a crawl, up until it fully freezes. ABRT tries to get a report, but never gets to write any data. I installed kdump and forced the panic on oops setting to get as much data as possible. I initially thought that tlp was the culprit; deactivating it yielded no improvements. Then, reading dmesg while troubleshooting I noticed lockdown was blocking direct writes to registers. I disabled Secure Boot but the problem still is there.
Comment 1 Andrea Gagliardi La Gala 2020-01-31 08:27:16 UTC
1. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

I only tried Fedora 30 and 31; both have the issue. Interestingly, I used Manjaro before and never had this specific problem.


2. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Boot on AC power, unplug the charger. ``dmesg -w`` shows the oops the second I unplug the charger.


3. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

I cannot download it; dnf complains about keys for rawhide not matching the repo (signing with Fedora 32 keys).


4. Are you running any modules that not shipped with directly Fedora's kernel?:

I enabled Intel's GuC firmware loading but disabled it, thinking this issue appeared after this modification. With the kernel completely untainted the issue remains.


5. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.


Here's the oops:

[ 1110.200260] BUG: kernel NULL pointer dereference, address: 0000000000000080
[ 1110.200268] #PF: supervisor read access in kernel mode
[ 1110.200271] #PF: error_code(0x0000) - not-present page
[ 1110.200274] PGD 0 P4D 0 
[ 1110.200281] Oops: 0000 [#1] SMP PTI
[ 1110.200288] CPU: 0 PID: 12998 Comm: kworker/0:0 Kdump: loaded Not tainted 5.3.4-300.fc31.x86_64 #1
[ 1110.200291] Hardware name: ASUSTeK COMPUTER INC. UX370UAR/UX370UAR, BIOS UX370UAR.310 04/17/2019
[ 1110.200303] Workqueue: events ucsi_connector_change [typec_ucsi]
[ 1110.200312] RIP: 0010:ucsi_displayport_remove_partner+0xa/0x20 [typec_ucsi]
[ 1110.200318] Code: 38 00 c7 43 28 00 00 00 00 48 83 c7 10 5b e9 2d 9b 01 d3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 48 85 ff 74 0f <48> 8b 47 78 48 c7 00 00 00 00 00 c6 40 3d 00 c3 66 0f 1f 44 00 00
[ 1110.200322] RSP: 0018:ffff9e2d84ec7df8 EFLAGS: 00010202
[ 1110.200326] RAX: 0000000000000008 RBX: ffff90fe43b70170 RCX: 00000000000056d3
[ 1110.200329] RDX: 00000000000056d2 RSI: 0000000000000001 RDI: 0000000000000008
[ 1110.200332] RBP: 0000000000000000 R08: ffffffff94528880 R09: ffff9e2d84ec7cf0
[ 1110.200334] R10: ffff90fecc3be1ff R11: 0000000000000000 R12: ffff90fe43b70170
[ 1110.200337] R13: 0000000000000001 R14: ffff90fe43b702c0 R15: ffff90fe43b70038
[ 1110.200341] FS:  0000000000000000(0000) GS:ffff90fe4ea00000(0000) knlGS:0000000000000000
[ 1110.200344] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1110.200348] CR2: 0000000000000080 CR3: 000000038e40a001 CR4: 00000000003606f0
[ 1110.200350] Call Trace:
[ 1110.200361]  ucsi_unregister_altmodes+0x7b/0x90 [typec_ucsi]
[ 1110.200370]  ucsi_unregister_partner.part.0+0x13/0x30 [typec_ucsi]
[ 1110.200377]  ucsi_connector_change+0x247/0x340 [typec_ucsi]
[ 1110.200389]  process_one_work+0x19d/0x340
[ 1110.200397]  worker_thread+0x50/0x3b0
[ 1110.200404]  kthread+0xfb/0x130
[ 1110.200411]  ? process_one_work+0x340/0x340
[ 1110.200416]  ? kthread_park+0x80/0x80
[ 1110.200425]  ret_from_fork+0x35/0x40
[ 1110.200431] Modules linked in: squashfs zstd_decompress loop rfcomm fuse ccm xt_CHECKSUM xt_MASQUERADE nf_nat_tftp nf_conntrack_tftp tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables cmac ip6table_filter ip6_tables iptable_filter bnep sunrpc vfat fat uvcvideo btusb btrtl btbcm btintel videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 bluetooth videobuf2_common videodev mc ecdh_generic ecc snd_soc_skl snd_soc_hdac_hda snd_hda_ext_core snd_soc_skl_ipc joydev snd_soc_sst_ipc iwlmvm snd_hda_codec_hdmi snd_soc_sst_dsp snd_soc_acpi_intel_match snd_soc_acpi x86_pkg_temp_thermal intel_powerclamp coretemp mac80211 snd_soc_core snd_hda_codec_realtek
[ 1110.200494]  snd_hda_codec_generic ledtrig_audio snd_compress spi_pxa2xx_platform kvm_intel ac97_bus mei_hdcp dw_dmac iTCO_wdt hid_multitouch snd_pcm_dmaengine snd_hda_intel typec_displayport iTCO_vendor_support libarc4 intel_rapl_msr snd_hda_codec gpio_keys kvm iwlwifi snd_hda_core snd_hwdep irqbypass snd_seq intel_cstate intel_uncore snd_seq_device snd_pcm intel_rapl_perf cfg80211 asus_nb_wmi hid_sensor_accel_3d snd_timer asus_wmi sparse_keymap wmi_bmof hid_sensor_trigger snd hid_sensor_iio_common industrialio_triggered_buffer mei_me kfifo_buf soundcore intel_xhci_usb_role_switch rfkill i2c_i801 mei industrialio roles idma64 processor_thermal_device ucsi_acpi typec_ucsi intel_rapl_common cros_ec_ishtp intel_lpss_pci cros_ec_core intel_lpss intel_soc_dts_iosf intel_pch_thermal typec int3403_thermal int340x_thermal_zone soc_button_array int3400_thermal asus_wireless acpi_pad acpi_thermal_rel binfmt_misc ip_tables xfs libcrc32c dm_crypt hid_sensor_hub intel_ishtp_loader intel_ishtp_hid i915
[ 1110.200553]  i2c_algo_bit drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel drm nvme ghash_clmulni_intel serio_raw nvme_core intel_ish_ipc intel_ishtp wmi i2c_hid video pinctrl_sunrisepoint pinctrl_intel
[ 1110.200575] CR2: 0000000000000080

I attached my lshw output and vmcore-dmesg.
Comment 2 Andrea Gagliardi La Gala 2020-01-31 08:28:25 UTC
Seems to happen in particular on ASUS laptops, eg. UX370UAR and UX391UA.
Comment 3 Greg Kroah-Hartman 2020-01-31 08:30:37 UTC
On Fri, Jan 31, 2020 at 08:22:53AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=206365
> 
>             Bug ID: 206365
>            Summary: kernel NULL pointer dereference when charger is
>                     unplugged
>            Product: Drivers
>            Version: 2.5
>     Kernel Version: 5.3.4-300.fc31.x86_64

5.3 is quite old now, please try a 5.4 kernel or newer.
Comment 4 Hans de Goede 2020-02-02 08:41:30 UTC
Note this was orignally reported in Fedora's bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1762031

I asked the reporter to file a bug here so that Heikki can take a look at it.
Comment 5 Heikki Krogerus 2020-02-03 13:34:33 UTC
Can you please attach trace output from the driver:

        # Unload all UCSI modules
        modprobe -r ucsi_acpi

        # At this point you should plug-in the problematic device

        # Reload the UCSI core module
        modprobe typec_ucsi

        # Enable UCSI tracing
        echo 1 > /sys/kernel/debug/tracing/events/ucsi/enable

        # Now reload the ACPI glue driver
        modprobe ucsi_acpi

        # Unplug the problematic device so that you see the error

        # Finally dump the trace output
        cat /sys/kernel/debug/tracing/trace
Comment 6 Hans de Goede 2020-02-09 15:30:35 UTC
Heikki,

This seems to be impacting quite a few users (in a few different variations), Fedora has received multiple bug reports about this:
https://bugzilla.redhat.com/show_bug.cgi?id=1762031
https://bugzilla.redhat.com/show_bug.cgi?id=1785972
https://bugzilla.redhat.com/show_bug.cgi?id=1798810
https://bugzilla.redhat.com/show_bug.cgi?id=1800913

These show 3 different ucsi related backtraces:

https://retrace.fedoraproject.org/faf/reports/2828061/  :
 ucsi_notify+0x105/0x110 [typec_ucsi]
 process_one_work+0x1b5/0x360
 worker_thread+0x50/0x3c0

https://retrace.fedoraproject.org/faf/reports/bthash/6c143da44e62f214a2018303d35cb7fc42c873d1
 ucsi_unregister_altmodes+0x7b/0x90 [typec_ucsi]
 ucsi_unregister_partner.part.0+0x13/0x30 [typec_ucsi]
 ucsi_connector_change+0x247/0x340 [typec_ucsi]
 process_one_work+0x19d/0x380
 worker_thread+0x50/0x3b0

https://retrace.fedoraproject.org/faf/reports/bthash/7dfd85f0d1be45835314ad403618a3a1a7c07a1f
 ucsi_notify+0x105/0x110 [typec_ucsi]
 acpi_ev_notify_dispatch+0x45/0x5a
 acpi_os_execute_deferred+0x16/0x20
 process_one_work+0x1b5/0x360
 worker_thread+0x50/0x3c0

I've asked all Fedora bug-reporters to add themselves to the Cc here and to provide the info you have requested in comment 5.
Comment 7 Heikki Krogerus 2020-02-13 13:17:28 UTC
Created attachment 287349 [details]
test

I received the logs I requested on the usb-linux ml, and I think I can see the race. This is the fix I proposed.
Comment 8 Hans de Goede 2020-02-14 09:43:32 UTC
Thanks for the patch Heikki, I've created a (test) build of a Fedora kernel with the patch added and I've asked all reports (see the bug links above) to test this kernel.
Comment 9 Hans de Goede 2020-02-15 20:30:37 UTC
Heikki, I've gotten some feedback on the patch, it seems that it does not help (or at least does not cover all use cases). The reporter testing the patch has been following your instructions from comment 5. One interesting datapoint is that "modprobe -r ucsi_acpi" as per the debug instruction oopses itself.

There are logs of the oops on rmmod as wel as the requested tracesin the Fedora bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1785972

This is a public bug so you should be able to view it and the attachments.
Comment 10 Heikki Krogerus 2020-02-17 15:17:45 UTC
Created attachment 287439 [details]
dp_removal_fix.diff

Because there are several issues here, each fix should tackle individual problem. This next one I'm hoping will take care of one of the NULL pointer dereference issues when removing the alternate modes.
Comment 11 Hans de Goede 2020-02-21 19:02:57 UTC
Thanks for the new fix, I've done a new test kernel-build for Fedora and asked the reporters of the various Fedora bugs to test the new test-build.
Comment 12 Hans de Goede 2020-02-23 13:50:37 UTC
Good news, it seems that the last patch (combined with the original "test" patch) fixes this. Quoting from: https://bugzilla.redhat.com/show_bug.cgi?id=1785972

"I had been able to reproduce this by

* plug external monitor
* unplug external monitor
* wait a few seconds
* watch kernel go up in flames

No crash after many cycles now."
Comment 13 Jet Ong 2020-02-24 00:31:47 UTC
*** Bug 206635 has been marked as a duplicate of this bug. ***
Comment 14 Jet Ong 2020-02-24 00:32:43 UTC
*** Bug 206633 has been marked as a duplicate of this bug. ***
Comment 15 Hans de Goede 2020-03-14 14:17:31 UTC
I'm afraid that the 2 patches attached here and recently submitted upstream for this are not a complete fix. At least one Fedora user is still seeing issues with a kernel with those 2 patches added.

He has collected the extra info requested in comment 5 here and attached those to the bugzilla.redhat.com bug where I've provided the test-kernel to the user:

https://bugzilla.redhat.com/show_bug.cgi?id=1785972

This is a public bug, so you should be able to get the info directly from the bug if not let me know.

The user who is still seeing problems reports that when enabling tracing 2 plug + unplug cycles are necessary to get the oops rather then one, so this likely is a timing issue?

Here is a dmesg from the user from after the oops after the second unplug:
https://bugzilla.redhat.com/attachment.cgi?id=1665271

And here is the requested trace output:
https://bugzilla.redhat.com/attachment.cgi?id=1665272

Let me know if you have trouble getting these from bugzilla.redhat.com, then I will attach them here.

Hmm, looking at the oops it seems that the second patch here should fix it ?

I will add the 2 patches to the Fedora kernels and ask people to test again.
Comment 16 vincent.datrier 2020-03-20 10:02:46 UTC
(In reply to Heikki Krogerus from comment #5)
> Can you please attach trace output from the driver:
> 
>         # Unload all UCSI modules
>         modprobe -r ucsi_acpi
> 
>         # At this point you should plug-in the problematic device
> 
>         # Reload the UCSI core module
>         modprobe typec_ucsi
> 
>         # Enable UCSI tracing
>         echo 1 > /sys/kernel/debug/tracing/events/ucsi/enable
> 
>         # Now reload the ACPI glue driver
>         modprobe ucsi_acpi
> 
>         # Unplug the problematic device so that you see the error
> 
>         # Finally dump the trace output
>         cat /sys/kernel/debug/tracing/trace

Hi Heikki,

I was one of the reporters of this bug. I since switched to Solus and can't figure out how to patch the included kernel; I'm migrating (again!) to Arch this time so I can play with the patches.

# tracer: nop
#
# entries-in-buffer/entries-written: 3/3   #P:8
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
     kworker/5:2-279   [005] ....    91.833930: ucsi_register_altmode: partner alt mode: svid ff01, mode 1 vdo 1
     kworker/5:2-279   [005] ....    91.873504: ucsi_register_port: port0 status: change=0000, opmode=3, connected=1, sourcing=0, partner_flags=1, partner_type=1, request_data_obj=430384e1, BC status=0
     kworker/5:2-279   [005] ....    91.925506: ucsi_register_port: port1 status: change=0000, opmode=1, connected=0, sourcing=1, partner_flags=1, partner_type=2, request_data_obj=00000000, BC status=0

Hope this helps! I'll try your patches as soon as I have my new setup running on this laptop.
Comment 17 Hans de Goede 2020-03-20 10:16:02 UTC
Created attachment 287981 [details]
dmesg showing the oops still happening with 5.5.10 + 2 patches (from Red Hat bugzilla 1785972)

Hi Heikki,

I've added the 2 patches from this bug (in the form you submitted them upstream) to Fedora's 5.5.10 kernel. Unfortunately the NULL deref oops is still happening, see the attached dmesg. So it looks like we are still missing a check somewhere. This is being tracked in a public bug in Red Hat's bugzilla here:
https://bugzilla.redhat.com/show_bug.cgi?id=1785972

I'm attaching the dmesg from the user reporting this still happening with 5.5.10 + the 2 patches here.

Note the same user earlier tested a test-build of a Fedora 5.5.x kernel with the 2 patches added and also collected the traces you requested. You should be able to access those directly from RH's bugzilla, but I will also attach the trace here.

Regards,

Hans
Comment 18 Hans de Goede 2020-03-20 10:17:23 UTC
Created attachment 287983 [details]
Trace after null deref oops in Fedora 5.5.5 kernel with 2 ucsi fixes included
Comment 19 Heikki Krogerus 2020-03-24 12:34:40 UTC
Thanks Hans.

From the trace I can see there is still something odd going on. The firmware is reporting four modes for the DisplayPort? The DisplayPort Alt Mode standard defines only one mode.

That could be something that is causing the problem... or not. I need to stare at the code again (I'm still not able to reproduce this).
Comment 20 vincent.datrier 2020-03-26 15:46:19 UTC
Created attachment 288077 [details]
5.5.13-arch1 dmesg

Hey Heikki,

I tried your two patches, as they are now in the newly released Arch Linux 5.5.13
kernel. I still get an oops when unplugging, except the behavior seems more sporadic now. I cannot pinpoint a specific scenario where I did not get an oops, however; it mostly oopses, but sometimes it does not anymore.

My trace is the same as the one I posted about a week ago; there is no record of the charger getting unplugged. Attached is the complete latest dmesg I recovered.

If you need me to do any test, let me know! With the current world events I have plenty of time to tinker.

Thanks for all your hard work!
Comment 21 Heikki Krogerus 2020-04-02 09:42:19 UTC
I'm going to buy the charger or dock so I can reproduce this.

Can you tell me the exact model of your charger or dock?
Comment 22 vincent.datrier 2020-04-02 13:56:04 UTC
Heikki,

I'm not certain it comes from the charger; two wholly different chargers yield the same result, and ditto with a USB-C stick. Here's the trace with a USB stick:

# tracer: nop
#
# entries-in-buffer/entries-written: 3/3   #P:8
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
     kworker/3:1-13119 [003] .... 19393.338874: ucsi_register_port: port0 status: change=0000, opmode=1, connected=0, sourcing=1, partner_flags=1, partner_type=2, request_data_obj=00000000, BC status=0
     kworker/3:1-13119 [003] .... 19393.462172: ucsi_register_altmode: partner alt mode: svid ff01, mode 1 vdo 1
     kworker/3:1-13119 [003] .... 19393.521871: ucsi_register_port: port1 status: change=0000, opmode=4, connected=1, sourcing=1, partner_flags=1, partner_type=2, request_data_obj=00000000, BC status=0

I don't have access to the second charger, it's at work which I cannot go to anymore, but the one I have here is the stock ASUS ADP-45EW C rev. A01. The USB stick is a SanDisk dual-port USB-A/USB-C (I have several and they all make the laptop crash).

In the meantime I blacklisted the two ucsi related modules, to no ill effect so far. The laptop I'm using is the refreshed UX370UA (UX370UAR) with the latest UEFI (Version 311). Let me know if there is anything I can try out next to help you out!

Note You need to log in before you can comment on or make changes to this bug.