I have my ext4 /home dir mounted on lvm partition /dev/vg1/lv_home which is on mdadm raid10 device with 3 samsung evo 850 SSD disks. After I upgraded to 4.14.20 the computer hang with the HDD led light shining full (not blinking) as soon as I logged in (eg accesed the /home partition). I was able to get a dmesg out which shows an OOPS. The machine hanged as sonn any IO tried to access /home and a clean poweroff was not possible. [ 14.226693] ata1.00: exception Emask 0x10 SAct 0x10000 SErr 0x400101 action 0x6 frozen [ 14.226694] ata1.00: irq_stat 0x08000000, interface fatal error [ 14.226695] ata1: SError: { RecovData UnrecovData Handshk } [ 14.226696] ata1.00: failed command: WRITE FPDMA QUEUED [ 14.226698] ata1.00: cmd 61/e0:80:58:10:5c/00:00:0a:00:00/40 tag 16 ncq dma 114688 out [ 14.226699] ata1.00: status: { DRDY } [ 14.226701] ata1: hard resetting link [ 14.537049] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 14.538626] ata1.00: supports DRM functions and may not be fully accessible [ 14.539258] ata1.00: disabling queued TRIM support [ 14.540875] ata1.00: supports DRM functions and may not be fully accessible [ 14.541376] ata1.00: disabling queued TRIM support [ 14.542646] ata1.00: configured for UDMA/133 [ 14.542662] ata1: EH complete [ 15.036750] ------------[ cut here ]------------ [ 15.036764] WARNING: CPU: 0 PID: 0 at /home/buildozer/aports/main/linux-vanilla/src/linux-4.14/kernel/rcu/tree.c:2725 rcu_process_callbacks+0x370/0x421 [ 15.036766] Modules linked in: ctr ccm nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_conntrack_ipv4 nf_nat nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c crc32c_generic ipt_REJECT nf_reject_ipv4 xt_tcpudp br_netfilter bridge stp llc ebtable_filter ebtables overlay exportfs ip6table_filter ip6_tables iptable_filter ip_tables x_tables ipv6 bnep joydev mousedev hid_logitech_hidpp hid_logitech_dj nls_utf8 nls_cp437 vfat fat snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic arc4 input_leds btusb btrtl btbcm btintel bluetooth hid_apple hid_generic ecdh_generic iwlmvm mac80211 crct10dif_pclmul ghash_clmulni_intel iwlwifi pcbc coretemp tun deadline_iosched cfg80211 snd_hda_intel aesni_intel [ 15.036845] igb kvm_intel aes_x86_64 snd_hda_codec hwmon crypto_simd kvm dca glue_helper rfkill snd_hda_core cryptd irqbypass intel_cstate snd_hwdep intel_rapl_perf snd_pcm af_packet snd_timer e1000e psmouse snd serio_raw ptp soundcore pcspkr iTCO_wdt pps_core shpchp iTCO_vendor_support mei_me mei intel_pch_thermal thermal fan evdev efivarfs raid10 usbhid hid dm_mod dax crc32_pclmul crc32c_intel ahci libahci libata i2c_i801 xhci_pci xhci_hcd usbcore i915 video drm_kms_helper drm intel_gtt agpgart i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt i2c_core nvme nvme_core wmi button loop raid1 ext4 crc16 mbcache jbd2 sd_mod scsi_mod [ 15.036928] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.20-0-vanilla #1-Alpine [ 15.036930] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./Z170N-WIFI-CF, BIOS F6 10/26/2015 [ 15.036933] task: fffffffface124c0 task.stack: fffffffface00000 [ 15.036938] RIP: 0010:rcu_process_callbacks+0x370/0x421 [ 15.036941] RSP: 0018:ffff8ff771c03f00 EFLAGS: 00010002 [ 15.036944] RAX: 0000000000000000 RBX: ffff8ff771c21940 RCX: 000000018040003b [ 15.036947] RDX: ffffffffffffd801 RSI: ffff8ff771c03f10 RDI: ffff8ff771c21978 [ 15.036949] RBP: fffffffface4f340 R08: 0000000000000001 R09: 0000000000000100 [ 15.036951] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8ff771c21978 [ 15.036954] R13: fffffffface124c0 R14: 7fffffffffffffff R15: fffffffffffffff9 [ 15.036957] FS: 0000000000000000(0000) GS:ffff8ff771c00000(0000) knlGS:0000000000000000 [ 15.036959] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 15.036962] CR2: 00007f0ba7bdf078 CR3: 00000002afe0a006 CR4: 00000000003606b0 [ 15.036964] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 15.036966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 15.036968] Call Trace: [ 15.036973] <IRQ> [ 15.036982] ? rebalance_domains+0xf2/0x237 [ 15.036988] __do_softirq+0xfb/0x268 [ 15.036995] ? sched_clock+0x5/0x8 [ 15.037000] irq_exit+0x62/0xa1 [ 15.037005] smp_apic_timer_interrupt+0xbf/0xf7 [ 15.037009] apic_timer_interrupt+0x98/0xa0 [ 15.037012] </IRQ> [ 15.037017] RIP: 0010:cpuidle_enter_state+0x149/0x20d [ 15.037019] RSP: 0018:fffffffface03e80 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 [ 15.037023] RAX: ffff8ff771c20c40 RBX: ffff8ff771c28fc8 RCX: 000000000000001f [ 15.037025] RDX: 000000038041cb3c RSI: 0000000000000000 RDI: 0000000000000000 [ 15.037027] RBP: 0000000000000005 R08: 00000666643562fe R09: 0000000000000014 [ 15.037029] R10: fffffffface03e60 R11: 000000000000165c R12: 0000000000000000 [ 15.037032] R13: fffffffface124c0 R14: fffffffface7ff60 R15: fffffffface80158 [ 15.037041] do_idle+0x11a/0x190 [ 15.037046] cpu_startup_entry+0x6f/0x71 [ 15.037052] start_kernel+0x44d/0x46d [ 15.037064] secondary_startup_64+0xa5/0xb0 [ 15.037070] Code: 8b 93 90 00 00 00 48 2b 15 6f 25 da 00 48 39 d0 7d 07 48 89 83 90 00 00 00 48 83 7b 38 00 0f 94 c2 48 85 c0 0f 94 c0 38 c2 74 02 <0f> ff 48 8b 3c 24 57 9d 0f 1f 44 00 00 4c 89 e7 e8 e5 26 00 00 [ 15.037142] ---[ end trace 486866c4a775ee92 ]--- [ 22.250602] logitech-hidpp-device 0003:046D:4060.0008: HID++ 4.5 device connected. This does not happen with 4.14.19, so I looked at the commit log and tried revert commit c561093ed6843684690436dea034af53b462cfe5 (scsi: core: Ensure that the SCSI error handler gets woken up). With that commit reverted machine again worked as normal. Linux ncopa-desktop 4.14.20-1-vanilla #2-Alpine SMP Wed Feb 21 15:11:34 CET 2018 x86_64 GNU/Linux dmesg output from my kernel with the regression reverted: [ 20.910014] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x400101 action 0x6 frozen [ 20.910018] ata1.00: irq_stat 0x08000000, interface fatal error [ 20.910022] ata1: SError: { RecovData UnrecovData Handshk } [ 20.910026] ata1.00: failed command: WRITE DMA [ 20.910036] ata1.00: cmd ca/00:08:00:d0:5e/00:00:00:00:00/ed tag 2 dma 4096 out [ 20.910039] ata1.00: status: { DRDY } [ 20.910048] ata1: hard resetting link [ 21.221989] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 21.223512] ata1.00: supports DRM functions and may not be fully accessible [ 21.225118] ata1.00: supports DRM functions and may not be fully accessible [ 21.226387] ata1.00: configured for UDMA/133 [ 21.226402] ata1: EH complete [ 21.259887] ata1: limiting SATA link speed to 3.0 Gbps [ 21.259898] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x400100 action 0x6 frozen [ 21.259902] ata1.00: irq_stat 0x08000000, interface fatal error [ 21.259908] ata1: SError: { UnrecovData Handshk } [ 21.259914] ata1.00: failed command: WRITE DMA [ 21.259929] ata1.00: cmd ca/00:e0:20:10:5c/00:00:00:00:00/ea tag 3 dma 114688 out [ 21.259933] ata1.00: status: { DRDY } [ 21.259946] ata1: hard resetting link [ 21.571618] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320) [ 21.573260] ata1.00: supports DRM functions and may not be fully accessible [ 21.575300] ata1.00: supports DRM functions and may not be fully accessible [ 21.576892] ata1.00: configured for UDMA/133 [ 21.576913] ata1: EH complete [ 22.124450] logitech-hidpp-device 0003:046D:4060.0008: HID++ 4.5 device connected. So problem seems to happen when there are one or more ata errors. I have seen those errors on 4.9 kernel too, but they don't seem to do any harm, until 4.14.20 kernel.
Those two patches fixes the issue: https://patchwork.kernel.org/patch/10181165/ https://patchwork.kernel.org/patch/10233617/
Getting a very similar Warning on 4.15.7 when dealing with an old unhappy disk. TL;DR the above patches applied cleanly and fixed this on 4.15.7 for me as well. Mar 3 20:42:00 redsun kernel: [ 750.861060] ata4.00: exception Emask 0x11 SAct 0x80000 SErr 0x680100 action 0x6 frozen Mar 3 20:42:00 redsun kernel: [ 750.861063] ata4.00: irq_stat 0x48000008, interface fatal error Mar 3 20:42:00 redsun kernel: [ 750.861066] ata4: SError: { UnrecovData 10B8B BadCRC Handshk } Mar 3 20:42:00 redsun kernel: [ 750.861069] ata4.00: failed command: READ FPDMA QUEUED Mar 3 20:42:00 redsun kernel: [ 750.861078] ata4.00: cmd 60/00:98:c0:c5:c3/01:00:02:00:00/40 tag 19 ncq dma 131072 in Mar 3 20:42:00 redsun kernel: [ 750.861078] res 40/00:98:c0:c5:c3/00:00:02:00:00/40 Emask 0x10 (ATA bus error) Mar 3 20:42:00 redsun kernel: [ 750.861080] ata4.00: status: { DRDY } Mar 3 20:42:39 redsun kernel: [ 789.396122] WARNING: CPU: 3 PID: 1926 at /usr/src/linux-stable/kernel/rcu/tree.c:2792 rcu_process_callbacks+0x482/0x4a0 Mar 3 20:42:39 redsun kernel: [ 789.396124] Modules linked in: cmac cifs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp tun bridge ebtable_filter ebtables ip6table_f ilter ip6_tables iptable_filter ip_tables x_tables ipv6 cfg80211 rfkill 8021q garp stp llc fuse amdkfd amd_iommu_v2 amdgpu snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hda_core chash ttm drm_kms_helper snd_hwdep i2c_dev uas drm igb snd_pcm snd_timer agpgart wmi_bmof sn d mxm_wmi kvm_amd usb_storage fb_sys_fops syscopyarea sysfillrect kvm soundcore ptp sysimgblt pps_core dca i2c_algo_bit alx mdio ccp crct10dif_pclmul crc32_pclmul evdev crc32c_intel shpchp i2c_piix4 i2c_core Mar 3 20:42:39 redsun kernel: [ 789.396173] efi_pstore efivars ghash_clmulni_intel k10temp hwmon button wmi acpi_cpufreq loop hid_generic usbhid hid uhci_hcd ohci_pci ohci_hcd ehci_hcd xhci_pci xhci_hcd Mar 3 20:42:39 redsun kernel: [ 789.396188] CPU: 3 PID: 1926 Comm: Web Content Tainted: G W 4.15.7 #21 Mar 3 20:42:39 redsun kernel: [ 789.396189] Hardware name: Gigabyte Technology Co., Ltd. AX370-Gaming 5/AX370-Gaming 5, BIOS F21 02/08/2018 Mar 3 20:42:39 redsun kernel: [ 789.396193] RIP: 0010:rcu_process_callbacks+0x482/0x4a0 Mar 3 20:42:39 redsun kernel: [ 789.396194] RSP: 0000:ffff94857ecc3f10 EFLAGS: 00010002 Mar 3 20:42:39 redsun kernel: [ 789.396197] RAX: ffffffffffffd800 RBX: ffff94857ece1800 RCX: 0000000000004601 Mar 3 20:42:39 redsun kernel: [ 789.396198] RDX: 0000000000000001 RSI: ffff94857ecc3f18 RDI: ffff94857ece1838 Mar 3 20:42:39 redsun kernel: [ 789.396199] RBP: ffffffffa983cb40 R08: 0000000000024f30 R09: ffffffffa80fbdcb Mar 3 20:42:39 redsun kernel: [ 789.396200] R10: fffff3a0c85f5e00 R11: 0000000000000326 R12: ffff94857ece1838 Mar 3 20:42:39 redsun kernel: [ 789.396201] R13: 0000000000000246 R14: 7fffffffffffffff R15: fffffffffffffffc Mar 3 20:42:39 redsun kernel: [ 789.396204] FS: 00007f0492a279c0(0000) GS:ffff94857ecc0000(0000) knlGS:0000000000000000 Mar 3 20:42:39 redsun kernel: [ 789.396205] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 3 20:42:39 redsun kernel: [ 789.396207] CR2: 00007f04656dc000 CR3: 000000026ceaa000 CR4: 00000000003406e0 Mar 3 20:42:39 redsun kernel: [ 789.396208] Call Trace: Mar 3 20:42:39 redsun kernel: [ 789.396211] <IRQ> Mar 3 20:42:39 redsun kernel: [ 789.396216] __do_softirq+0xe0/0x2dd Mar 3 20:42:39 redsun kernel: [ 789.396220] irq_exit+0xae/0xb0 Mar 3 20:42:39 redsun kernel: [ 789.396223] smp_apic_timer_interrupt+0x76/0x130 Mar 3 20:42:39 redsun kernel: [ 789.396225] apic_timer_interrupt+0x7d/0x90 Mar 3 20:42:39 redsun kernel: [ 789.396227] </IRQ> Mar 3 20:42:39 redsun kernel: [ 789.396229] RIP: 0033:0x7f0490e38707 Mar 3 20:42:39 redsun kernel: [ 789.396230] RSP: 002b:00007ffe7bb45690 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff11 Mar 3 20:42:39 redsun kernel: [ 789.396232] RAX: 00007f045c4af910 RBX: 00007f047fc08000 RCX: ffffffffff78f144 Mar 3 20:42:39 redsun kernel: [ 789.396233] RDX: 00007f0490e386f0 RSI: 00007f045c4af940 RDI: 00007f047fc08000 Mar 3 20:42:39 redsun kernel: [ 789.396234] RBP: 00007ffe7bb46a40 R08: 00007f045f77f800 R09: 0000000000000000 Mar 3 20:42:39 redsun kernel: [ 789.396235] R10: 000000000000000a R11: 00007f045fc113ca R12: 00007f045c4af970 Mar 3 20:42:39 redsun kernel: [ 789.396237] R13: 00007f045c4b0078 R14: 00007ffe7bb456f0 R15: 0000000000000000 Mar 3 20:42:39 redsun kernel: [ 789.396238] Code: 8b 1d e3 66 86 01 48 85 db 74 1b 48 8b 03 48 8b 7b 08 48 83 c3 18 48 89 ee e8 6b 6f f0 00 48 8b 03 48 85 c0 75 e8 e9 b6 fb ff ff <0f> 0b e9 df fd ff ff 0f 0b e9 d6 fc ff ff 4c 89 ee 4c 89 e7 e8 Mar 3 20:42:39 redsun kernel: [ 789.396279] ---[ end trace 9f6cbfdd9555c3ff ]--- The two patches linked above applied cleanly, and the link can now be reset and the disk can continue to be used: Mar 4 01:19:21 redsun kernel: [ 500.894429] ata4.00: exception Emask 0x11 SAct 0x40000001 SErr 0x680100 action 0x6 frozen Mar 4 01:19:21 redsun kernel: [ 500.894432] ata4.00: irq_stat 0x48000008, interface fatal error Mar 4 01:19:21 redsun kernel: [ 500.894435] ata4: SError: { UnrecovData 10B8B BadCRC Handshk } Mar 4 01:19:21 redsun kernel: [ 500.894438] ata4.00: failed command: READ FPDMA QUEUED Mar 4 01:19:21 redsun kernel: [ 500.894446] ata4.00: cmd 60/00:00:28:f8:7a/01:00:05:00:00/40 tag 0 ncq dma 131072 in Mar 4 01:19:21 redsun kernel: [ 500.894446] res 40/00:00:28:f8:7a/00:00:05:00:00/40 Emask 0x10 (ATA bus error) Mar 4 01:19:21 redsun kernel: [ 500.894448] ata4.00: status: { DRDY } Mar 4 01:19:21 redsun kernel: [ 500.894451] ata4.00: failed command: READ FPDMA QUEUED Mar 4 01:19:21 redsun kernel: [ 500.894459] ata4.00: cmd 60/00:f0:28:f7:7a/01:00:05:00:00/40 tag 30 ncq dma 131072 in Mar 4 01:19:21 redsun kernel: [ 500.894459] res 40/00:00:28:f8:7a/00:00:05:00:00/40 Emask 0x10 (ATA bus error) Mar 4 01:19:21 redsun kernel: [ 500.894461] ata4.00: status: { DRDY } Mar 4 01:22:04 redsun kernel: [ 664.145825] ata4.00: exception Emask 0x11 SAct 0x400000 SErr 0x680100 action 0x6 frozen Mar 4 01:22:04 redsun kernel: [ 664.145828] ata4.00: irq_stat 0x48000008, interface fatal error Mar 4 01:22:04 redsun kernel: [ 664.145831] ata4: SError: { UnrecovData 10B8B BadCRC Handshk } Mar 4 01:22:04 redsun kernel: [ 664.145834] ata4.00: failed command: READ FPDMA QUEUED Mar 4 01:22:04 redsun kernel: [ 664.145842] ata4.00: cmd 60/30:b0:e8:26:4e/00:00:03:00:00/40 tag 22 ncq dma 24576 in Mar 4 01:22:04 redsun kernel: [ 664.145842] res 40/00:b0:e8:26:4e/00:00:03:00:00/40 Emask 0x10 (ATA bus error) Mar 4 01:22:04 redsun kernel: [ 664.145844] ata4.00: status: { DRDY } Mar 4 01:22:25 redsun kernel: [ 685.225653] ata4: limiting SATA link speed to 3.0 Gbps Mar 4 01:22:25 redsun kernel: [ 685.225658] ata4.00: exception Emask 0x11 SAct 0x1800 SErr 0x600100 action 0x6 frozen Mar 4 01:22:25 redsun kernel: [ 685.225660] ata4.00: irq_stat 0x48000008, interface fatal error Mar 4 01:22:25 redsun kernel: [ 685.225663] ata4: SError: { UnrecovData BadCRC Handshk } Mar 4 01:22:25 redsun kernel: [ 685.225666] ata4.00: failed command: READ FPDMA QUEUED Mar 4 01:22:25 redsun kernel: [ 685.225674] ata4.00: cmd 60/00:58:90:96:d6/01:00:08:00:00/40 tag 11 ncq dma 131072 in Mar 4 01:22:25 redsun kernel: [ 685.225674] res 40/00:58:90:96:d6/00:00:08:00:00/40 Emask 0x10 (ATA bus error) Mar 4 01:22:25 redsun kernel: [ 685.225676] ata4.00: status: { DRDY } Mar 4 01:22:25 redsun kernel: [ 685.225678] ata4.00: failed command: READ FPDMA QUEUED Mar 4 01:22:25 redsun kernel: [ 685.225686] ata4.00: cmd 60/00:60:90:97:d6/01:00:08:00:00/40 tag 12 ncq dma 131072 in Mar 4 01:22:25 redsun kernel: [ 685.225686] res 40/00:58:90:96:d6/00:00:08:00:00/40 Emask 0x10 (ATA bus error) Mar 4 01:22:25 redsun kernel: [ 685.225688] ata4.00: status: { DRDY }
This has been fixed by the following commit, which is in James' tree but not yet in Linus' tree nor any of the stable trees: 3be8828fc507 ("scsi: core: Avoid that ATA error handling can trigger a kernel hang or oops"). See also https://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git/commit/?h=fixes&id=3be8828fc507cdafe7040a3dcf361a2bcd8e305b. A pull request that includes that commit has been sent to Linus. See also https://lkml.org/lkml/2018/3/6/767.
*** Bug 198923 has been marked as a duplicate of this bug. ***
Commit 3be8828fc507 is now upstream but is not yet in any of the stable trees.
Kernels 4.15.10 and 4.14.27 include patch "scsi: core: Avoid that ATA error handling can trigger a kernel hang or oops".
Many thanks for the updates. I upgraded to 4.15.10 now and the issue I had before (as reported in the closed duplicate) appears to be fixed with that patch.
4.14.27 kernel fixes the issue I had on my system. Thank you!