Bug 105251
Summary: | [Dell XPS 13 9343] Random kernel hangs at boot | ||
---|---|---|---|
Product: | Drivers | Reporter: | darkbasic (darkbasic) |
Component: | Input Devices | Assignee: | drivers_input-devices |
Status: | RESOLVED UNREPRODUCIBLE | ||
Severity: | normal | CC: | a-kernel, aaron.lu, andrej, angiolucci, danjared, greg.martyn, keith.waller, krims0n32, leho, mail, mail, mail, mattia.b89, mika.westerberg, rbreaves, saunders.52, victorien.desmars, will.leonardi |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.3-rc3 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
boot log with "debug acpi.debug_level=0x2003" (warn, error and tables debug enabled)
boot log with "debug acpi.debug_level=0x2003" (warn, error and tables debug enabled) log with modprobe.blacklist=snd_soc_rl6347a log with CONFIG_SND_SOC=n log with CONFIG_SND_SOC=n lspci -vv output for Audio devices on Dell Inc. XPS 13 9343 attachment-12574-0.html attachment-15810-0.html kernel panic bios A07 |
Created attachment 189041 [details]
boot log with "debug acpi.debug_level=0x2003" (warn, error and tables debug enabled)
Re-uploaded with text/plain.
Created attachment 189361 [details]
log with modprobe.blacklist=snd_soc_rl6347a
I tried to blacklist the crashing module (snd_soc_rl6347a) without any success: it still loads. I booted kernel with modprobe.blacklist=snd_soc_rl6347a and I put "blacklist snd_soc_rl6347a" in /etc/modprobe.d/blacklist.conf. I even tried to recompile the kernel without that module, but there is no option in "make nconfig" to disable it, while adding "# CONFIG_SND_SOC_RL6347A is not set" to the config doesn't help because it keeps showing CONFIG_SND_SOC_RL6347A=m in "make nconfig" if I search for it with F8.
I attached another log with modprobe.blacklist=snd_soc_rl6347a plus modprobe.d entries.
Created attachment 189471 [details]
log with CONFIG_SND_SOC=n
I compiled kernel with CONFIG_SND_SOC=n to avoid snd_soc_rl6347a being loaded. Now I don't get any Oops but it doesn't even reach the login: instead it goes into emergency mode!
Created attachment 189481 [details]
log with CONFIG_SND_SOC=n
Please do not consider my previous comment, becuase the failure was due to file loss: in fact sometimes the systems hangs while shutting down (I think because of this bug) and so some modules of my just compiled kernel were loss. I recompiled the kernel once again and synced the disks just to be sure: I still get random crashes even with CONFIG_SND_SOC=n (see the new log).
I also have a Dell XPS13 9343(as shown in the Product Name entry of the DMI info), but I don't have this problem with Fedora's v4.1.x kernels. What can I say? Perhaps try another distro, like Fedora 22? Hi Aaron, thanks for your answer. Today I also got the answer from Dell, which told me they don't have such an issue with Ubuntu. I'm starting to think this has something to do with my configuration, maybe the combo dm-crypt + btrfs which is pretty unusual. I will try both Ubuntu and Fedora soon, and then I will try to reinstall Arch into an usb stick. Right, dm-crupt+btrfs is indeed unusual. I heard that btrfs is not quite stable, but I have never used it. Maybe you need try another FS. I heard you can even get filesystem corruption with btrfs and SATA Aggressive Link Power Management (ALPM), in fact I immediately disabled it as soon as I installed tlp: https://github.com/linrunner/TLP/issues/128 I will try another filesystem, but I hope btrfs is not the culprit because I need its advanced features like snapshotting and compression. - I tried Ubuntu 15.10 live CD and it didn't crash. - Then I tried to install Ubuntu 15.10 on a flash drive, using dm-crypt and btrfs and it didn't crash. - Finally I tried arch live CD 2015.10.01 and sometimes it crashed!!! - Last but not the least I tried to compile kernel 4.3-rc5 (the arch way) using Ubuntu CONFIG file: it still crashes. So this isn't a matter of dm-crypt/btrfs, and it isn't a matter of kernel configuration too. Probably the culprit is either: A) Arch's userspace (but it seems unlikely) B) mkinitcpio (Arch's initramfs creation tool). Maybe it loads different modules than ubuntu/fedora in the initramfs and this is enough to trigger the Oops. I cannot be sure of (B) because I need an initramfs because of disk encryption and so I cannot get rid of it and simply use a static kernel. @Dell Please download the Arch Live CD from [1], flash it to an usb drive using dd, then boot it using EFI at least 20 times (the crash is much more difficult to reproduce with the live CD). If you cannot manage to reproduce it within 20 boots you probably aren't being afflicted, anyway let me know. I have the hidpi one, with i7-5600U, Intel Wifi and 256GB SSD. [1]https://www.archlinux.org/download/ My config: xp13 dev edition fullhd A4 Bios Replaced Broadcom wifi with intel wifi kernel 4.1.6 arch linux -- no crashes My machine loads snd_soc_rt286 instead of snd_soc_rl6347a (rl6347a module was added in 4.2) The reason why snd_soc_rl6347a module gets loaded even when blacklisted is probably that it gets loaded as a dependency of other modules. for preventing the module to get into initramfs it is necessary to edit some configs to prevent blacklisting modules in or having them pulled into initramfs. it is explained in more detailed at https://wiki.archlinux.org/index.php/Kernel_modules#Blacklisting Try blacklisting a few more sound modules: snd_soc_rl6347a / snd_soc_rt286 snd_pcm snd_soc_core i2c_core snd_timer snd soundcore snd_compress snd_pcm_dmaengine on my system these modules are loaded and would trigger snd_soc_rt286 to be loaded according to lsmod. Also, could you load a 4.1 kernel since 4.1 does not have that sound module and provide a boot log? probably you did this already: - checked ram (memtest) and flash (badblocks) for faults - compiled kernel using a different gcc compiler version, without any custom compiler variables/flags - booted without intel-ucode.img (kernel command line initrd) Created attachment 190141 [details]
lspci -vv output for Audio devices on Dell Inc. XPS 13 9343
If you read comment #4 I compiled the kernel with CONFIG_SND_SOC=n to avoid snd_soc_rl6347a being loaded and I still get random crashes (see attachment 189481 [details]). Regarding badblocks: I get the Oops even from the Arch Live CD. Regarding memtest: I did run it on Windows and it passed the test, but I will try it once again with memtest86 just to be sure. Regarding intel-ucode.img: I will check if the Arch Live CD boots with ucode, but I guess it doesn't. Okay cool, just to confirm: you get random crashes even without loading intel-ucode.img at boot? There does not seem to be a ucode update yet for Broadwell according to the latest download: https://downloadcenter.intel.com/search?keyword=microcode Yes, I confirm. I am aware that there is still no Broadwell ucode, but loading (missing) ucode doesn't hurt anyway because it will simply do nothing if there isn't any new ucode available. I just put it to get ready when new ucodes will get released. By the way, did you ever get this i915 trace in dmesg? Sometimes I get it after booting, but this is harmless anyway (it doesn't crash my system): [ 42.350664] ------------[ cut here ]------------ [ 42.350683] WARNING: CPU: 0 PID: 1299 at drivers/gpu/drm/i915/intel_uncore.c:620 hsw_unclaimed_reg_debug+0x6d/0x90 [i915]() [ 42.350685] Unclaimed register detected before reading register 0x223a0 [ 42.350687] Modules linked in: rfcomm cmac ecb ctr ccm bnep nls_iso8859_1 nls_cp437 vfat fat snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device cdc_ether usbnet r8152 mii btusb btrtl btbcm btintel bluetooth crc16 uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core v4l2_common videodev media snd_hda_codec_hdmi arc4 nvram msr joydev mousedev iwlmvm iTCO_wdt dell_laptop iTCO_vendor_support dell_wmi hid_multitouch sparse_keymap dcdbas mac80211 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp evdev input_leds mac_hid kvm_intel iwlwifi kvm psmouse rtsx_pci_ms serio_raw memstick pcspkr cfg80211 dell_led snd_hda_codec_realtek rfkill i2c_i801 lpc_ich snd_hda_codec_generic mei_me i915 mei shpchp wmi battery snd_hda_intel drm_kms_helper snd_hda_codec drm snd_hda_core snd_hwdep intel_gtt [ 42.350740] snd_pcm snd_timer i2c_hid int3403_thermal dw_dmac snd dw_dmac_core snd_soc_sst_acpi gpio_lynxpoint tpm_crb soundcore tpm i2c_designware_platform i2c_algo_bit video processor_thermal_device intel_soc_dts_iosf spi_pxa2xx_platform i2c_designware_core iosf_mbi int3402_thermal 8250_dw int3400_thermal int340x_thermal_zone acpi_thermal_rel acpi_pad ac fan acpi_als thermal kfifo_buf industrialio button processor sch_fq_codel ip_tables x_tables btrfs xor raid6_pq sha256_ssse3 sha256_generic hmac drbg ansi_cprng algif_skcipher af_alg hid_generic usbhid hid dm_crypt dm_mod sd_mod rtsx_pci_sdmmc atkbd libps2 crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper ehci_pci cryptd xhci_pci ehci_hcd ahci xhci_hcd libahci libata scsi_mod [ 42.350784] rtsx_pci usbcore usb_common i8042 serio sdhci_acpi sdhci led_class mmc_core [ 42.350792] CPU: 0 PID: 1299 Comm: akonadi_archive Not tainted 4.2.3-1-ARCH #1 [ 42.350793] Hardware name: Dell Inc. XPS 13 9343/0F5KF3, BIOS A05 07/14/2015 [ 42.350795] 0000000000000000 00000000e54b1f4a ffff88021e403c88 ffffffff8156c0ca [ 42.350798] 0000000000000000 ffff88021e403ce0 ffff88021e403cc8 ffffffff81074886 [ 42.350801] ffff88021e403cd8 ffff880213280000 00000000000223a0 ffff880213284398 [ 42.350803] Call Trace: [ 42.350805] <IRQ> [<ffffffff8156c0ca>] dump_stack+0x4c/0x6e [ 42.350814] [<ffffffff81074886>] warn_slowpath_common+0x86/0xc0 [ 42.350817] [<ffffffff81074915>] warn_slowpath_fmt+0x55/0x70 [ 42.350828] [<ffffffffa0637c8d>] hsw_unclaimed_reg_debug+0x6d/0x90 [i915] [ 42.350838] [<ffffffffa063d429>] gen6_read32+0x59/0x1a0 [i915] [ 42.350846] [<ffffffffa050a0d8>] ? send_vblank_event+0x78/0x110 [drm] [ 42.350856] [<ffffffffa062fe08>] intel_lrc_irq_handler+0x38/0x220 [i915] [ 42.350866] [<ffffffffa0626aa7>] gen8_gt_irq_handler+0x217/0x240 [i915] [ 42.350874] [<ffffffffa0626b4c>] gen8_irq_handler+0x7c/0x560 [i915] [ 42.350877] [<ffffffff810cc109>] handle_irq_event_percpu+0x39/0x1b0 [ 42.350880] [<ffffffff810cc2c7>] handle_irq_event+0x47/0x70 [ 42.350882] [<ffffffff810cf46c>] handle_edge_irq+0xbc/0x150 [ 42.350886] [<ffffffff81017c42>] handle_irq+0x22/0x40 [ 42.350889] [<ffffffff815740bf>] do_IRQ+0x4f/0xe0 [ 42.350891] [<ffffffff8157216b>] common_interrupt+0x6b/0x6b [ 42.350892] <EOI> [ 42.350894] ---[ end trace edb7049d993b12ff ]--- some more thoughts: -reset bios to default -downgrade to a04 / try a06 when released If other distribution than arch linux do not yield the kernel oops like you mentioned on lkml I would use the kernel+modules of that linux distribution and test it with arch linux. same warning regarding i915 I tried to reset the bios: it didn't help. I also passed several runs of memtest86: memory is fine. - I tried to use Ubuntu's kernel with Ubuntu's initramfs on Arch, but somehow I didn't manage to boot the system. - So I kept the Ubuntu kernel and I generated a new initramfs for it: I still get crashes at boot, maybe a bit less frequently. - Then I decided to build a kernel statically without using initramfs at all: I don't get crashes anymore!!! So Ubuntu's kernel *IS* bugged, but somehow only Arch's initramfs exposes the bug. Unfortunately my laptop uses dm-crypt, so I *need* an initramfs to boot it (I was booting a flash drive in my tests). I forgot to say that if I boot my static kernel *with* an initramfs it still crashes. I made a custom initramfs removing the "base" hook and it does't crash anymore. So, whatever is the problem, it lies in the base hook. Sounds like a archlinux specific problem? Maybe discussing it in arch's forum or support list is more appropriate. ERRATA CORRIGE: omitting the "base" hook simply prevented the subsequent hooks from being loaded, so it wasn't the culprit! I studied mkinitcpio's behaviour a bit and, after removing all the hooks, I started re-adding them from the first one: each time I added a subsequent hook and I rebooted 5 times to check if it was still crashing. *The real culprit is the "keyboard" hook.* Unfortunately I need the keyboard hook to be able to type my dm-crypt password, but on the wiki, regarding the keyboard hook I read: "As a side effect, modules for some non-keyboard input devices might be added to, but this should not be relied on." In fact lots of useless modules were loaded into the initramfs by the keyboard hook, while I just need "atkbd" to be able to type the password (plus some other modules if I want to be able to type it from an usb keyboard). Now that I'm able to properly boot my laptop I can "bisect" which module induces the crash. Stay tuned. @Aaron It *isn't* an arch related issue (as I said even Ubuntu's kernel is bugged): it's just that Arch *exposes* the bug while Ubuntu doesn't, so I want it fixed upstream. (In reply to darkbasic from comment #19) > Now that I'm able to properly boot my laptop I can "bisect" which module > induces the crash. Stay tuned. Good, thanks. > > @Aaron > It *isn't* an arch related issue (as I said even Ubuntu's kernel is bugged): > it's just that Arch *exposes* the bug while Ubuntu doesn't, so I want it > fixed upstream. OK. I finished bisecting: the culprit is "hid-multitouch". If you load hid-multitouch into the initramfs it leads to random crashes. If you load the module later it doesn't harm. Only Arch dm-crypt users are affected by this issue because they need the mkinitcpio "keyboard" hook to be able to type the password, and it loads hid-multitouch into the initramfs. This is a bug in the kernel and it *isn't* distro specific. @Dell It's your turn to fix the bug now, I already spent several dozens of hours debugging it and there is really nothing more I can do with my limited kernel knowledge. P.S. I left the "NEEDINFO" status because I don't know which "VERIFIED" substatus to choose. Are you able to reproduce the crash without dm-crypt stuff? hid-multitouch belongs to input subsystem, I'm moving the bug there. @Mika I just double checked to be 100% sure: yes, I can reproduce it without dm-crypt stuff: you just need to load hid-multitouch into the initramfs. Does it crash if you just run 'modprobe hid-multitouch' from the initramfs? If so, can you provide dmesg of the crash? How should I get a command prompt to be able to 'modprobe hid-multitouch' while still being into the initramfs? add break=premount to kernel cmd line No, it doesn't crash at this stage. Probably loading other modules *AFTER* hid-multitouch gets your system crashed. I can confirm it's hid-multitouch based on the error messages I've seen. We have two dell laptops here with touchscreens that both error randomly during startup. One is running Fedora 22 (w. dm-crypt+btrfs), and one is Fedora 23 (w. dm-crypt+ext4). I would just like to add this happens for me too on a Fedora 4.2.6-300.fc23.x86_64 kernel with or without dm-crypt usage. Forgot to add some more details: XPS13 w/ Infinity touch display i7-5600U 256GB SSD Bios A05 I'm pretty sure it only happens with the touch model. Same problem here. It happens sometimes (randonly I guess) to me on a Dell Inspiron 7348 with touchscreen. Fedora 23, no dm-crypt. BIOS A07 features some shiny new random kernel panics with kernel 4.4 (but not 4.3), during and after boot. I suggest you to stick to A05. Like usual nobody from Dell will care about it. Next laptop will definitely be a Lenovo. Sadly an older bios version raises another problem on Inspiron 7348,since bios A07 fix a serious problem with cpu fan in this model. Dell also removed bios versions older than A07 for this model from their website. It seems that I'll need to choose between a broken boot process, or overheating and noise problems with cpu fan. "Next laptop will definitely be a Lenovo" (2) Created attachment 196081 [details] attachment-12574-0.html No problem - thanks for letting me know! On Mon, Nov 30, 2015 at 11:39 <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=105251 > > --- Comment #35 from Vinícius Reis <angiolucci@gmail.com> --- > Sadly an older bios version raises another problem on Inspiron 7348,since > bios > A07 fix a serious problem with cpu fan in this model. Dell also removed > bios > versions older than A07 for this model from their website. > > It seems that I'll need to choose between a broken boot process, or > overheating > and noise problems with cpu fan. > > "Next laptop will definitely be a Lenovo" (2) > > -- > You are receiving this mail because: > You are on the CC list for the bug. Comment on attachment 196081 [details]
attachment-12574-0.html
[deleted]
Sorry, I did not mean to be rude. Dell is doing a good job offerig Linux support on their machines and of course as a user, I really enjoy it. What I wanted to say on my previous comment is basically 'proceed with caution on downgrading your system bios as a workaround for this issue, since in some hardware (like mine), it can cause other problems, like overheating and fan noise issues'. Cheers! Created attachment 196151 [details] attachment-15810-0.html Apologizes, I managed to auto respond to the email by accident. That was not an intentional response. On Mon, Nov 30, 2015 at 14:06 <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=105251 > > --- Comment #38 from Vinícius Reis <angiolucci@gmail.com> --- > Sorry, I did not mean to be rude. Dell is doing a good job offerig Linux > support on their machines and of course as a user, I really enjoy it. > > What I wanted to say on my previous comment is basically 'proceed with > caution > on downgrading your system bios as a workaround for this issue, since in > some > hardware (like mine), it can cause other problems, like overheating and fan > noise issues'. > > Cheers! > > -- > You are receiving this mail because: > You are on the CC list for the bug. Created attachment 196251 [details]
kernel panic bios A07
This is 100% reproducible in my computer. Every boot with the serial console attached leads to a kernel panic. If I boot without the serial console it randomly panics later. Somehow the slower boot due to the serial console leads to a more easily reproducible issue.
I am experiencing the same problem on a late 2015 Dell XPS 13 (9350), but in my case it is only reproducible after 4.2.5, so it may not be related at all. Anyway, I came across bug 109081, which describes the same symptoms for another computer with Intel's Skylake chipset (the same as Dell 9350). Bug 109081, comment 13 suggests disabling the intel_idle driver with intel_idle.max_cstate=0, which works in my case, so maybe some of you could try that to see whether or not these bugs are related. The new bug with bios A07 is fixed with 4.4-rc8. Finally we have working i2s audio without having to stick with random kernel panics :) I'm currently testing the tip on comment #41 by Johannes Larsen. Four days without boot crashes. So far, so good =) It's just a test, I'm not sure that boot crashes were gone, since it was happening in a random fashion. I'll test it for more 3 or 4 weeks. My computer crashed on the 5th testing day with "intel_idle.max_cstate=0", while booting fedora. So, it did not solve the problem =/ Thanks a lot for the detailed report, the pointer to hid-multitouch saved the day for me! Here's my bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1297188 Blacklisting hid-multitouch and loading it in a systemd service later in the boot process (wanted by multi-user and loaded before display-manager) resolved the issue. And it also keeps both the touchpad and the touchscreen fully operational. You're welcome, hopefully if you manage to backport this bug in Ubuntu even Dell will acknowledge the bug. Maybe. I also suggest you to have a look at the Arch wiki for future issues: Arch's community is quite active around this laptop. https://wiki.archlinux.org/index.php/Dell_XPS_13_(2015) I seem to be getting a similar issue on my Lenovo Yoga 700 11-inch. Doing a similar delayed load of hid-multitouch solved the issue mostly (it went from 75% failed boots to 10%, and yes, I spent about ten minutes rebooting to check). And even that failed boot with hid-multitouch delayed seemed to be a similar "hid drivers loading causing i2c to mess up" error. I'm a dell 9343 owner (non-touchscreen); tell me if I could help you! Non-touchscreen models don't suffer from this problem, you're lucky ;) I had the same issue on my Dell Inspiron 7548 with touchscreen (with Manjaro). After months searching the solution to my problem (I thought it was caused by the radeon driver), I finally find that thread. Removing "keyboard" from the HOOKS in /etc/mkinitcpio.conf and instead using MODULES="atkbd.ko usbhid hid-generic" apparently solved my linux kernel oops at boot. Thank you so much ! Don't forget to contact Dell and tell them how angry are you: they are CCed to this bug report, they know that MANY of their laptops suffer from this problem and they simply don't care. I believe I am experiencing this issue as well on a XPS 13 9343 Full HD (non-touch) model. At times I will have issues booting into Ubuntu 16 and it usually starts after the hid multitouch trackpad errors out and becomes non-responsive. I suspect though that this may not be a software error. I have a similar issues running other OS's as well. I wish I had a solution, but seriously leaving it alone for a few hours or a day seems to fix it... granted that is not a fix I like. It feels like my laptop will eventually give out on me, but the issue will go away for 4-6 weeks at a time. I suffered rarely of this issue in the past but know I am absolutely sure I haven't had any boot kernel hang during these months. Probably the BIOS update (A09) or the kernel one (4.4) have solved the issue. I think we can mark this bug as fixed! Ok, I will get rid of the workaround and I will confirm it. Sorry for the late reply, but I confirm that the issue is no longer reproducible. I'm closing. |
Created attachment 189031 [details] boot log with "debug acpi.debug_level=0x2003" (warn, error and tables debug enabled) This laptop suffers of random kernel hangs at boot: I tested kernel 4.1.8, 4.2.1 and 4.3-rc3 and they are all affected. Every time I turn on my laptop I have to boot several times to be able to reach the sddm login because often I get hangs which prevent the system to boot. Once booted I don't have any problem anymore, until the next boot. With "acpi=off" the system boots flawlessly, while with "acpi=ht" I still get random hangs, so I guess the problem is in the ACPI table parsing code itself, or perhaps the SMP code. I recompiled kernel 4.3-rc3 with CONFIG_ACPI_DEBUG=y and I attached full log from a serial console, booting with "debug acpi.debug_level=0x2003" (warn, error and tables debug enabled). Distro is Arch Linux. Please let me know if you need further logs.