Bug 198861

Summary: Regression causes kernel OOPS and hang in SCSI error report
Product: IO/Storage Reporter: ncopa
Component: SCSIAssignee: linux-scsi (linux-scsi)
Status: RESOLVED CODE_FIX    
Severity: normal CC: buschmann, bvanassche, dennis-busch, fan4326, flyser42, jskier, kernel, kionmaru, mabo, nielson.peter, sgh
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.14.20 Subsystem:
Regression: No Bisected commit-id:

Description ncopa 2018-02-21 15:43:42 UTC
I have my ext4 /home dir mounted on lvm partition /dev/vg1/lv_home which is on mdadm raid10 device with 3 samsung evo 850 SSD disks.

After I upgraded to 4.14.20 the computer hang with the HDD led light shining full (not blinking) as soon as I logged in (eg accesed the /home partition).

I was able to get a dmesg out which shows an OOPS. The machine hanged as sonn any IO tried to access /home and a clean poweroff was not possible.

[   14.226693] ata1.00: exception Emask 0x10 SAct 0x10000 SErr 0x400101 action 0x6 frozen
[   14.226694] ata1.00: irq_stat 0x08000000, interface fatal error
[   14.226695] ata1: SError: { RecovData UnrecovData Handshk }
[   14.226696] ata1.00: failed command: WRITE FPDMA QUEUED
[   14.226698] ata1.00: cmd 61/e0:80:58:10:5c/00:00:0a:00:00/40 tag 16 ncq dma 114688 out
[   14.226699] ata1.00: status: { DRDY }
[   14.226701] ata1: hard resetting link
[   14.537049] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   14.538626] ata1.00: supports DRM functions and may not be fully accessible
[   14.539258] ata1.00: disabling queued TRIM support
[   14.540875] ata1.00: supports DRM functions and may not be fully accessible
[   14.541376] ata1.00: disabling queued TRIM support
[   14.542646] ata1.00: configured for UDMA/133
[   14.542662] ata1: EH complete
[   15.036750] ------------[ cut here ]------------
[   15.036764] WARNING: CPU: 0 PID: 0 at /home/buildozer/aports/main/linux-vanilla/src/linux-4.14/kernel/rcu/tree.c:2725 rcu_process_callbacks+0x370/0x421
[   15.036766] Modules linked in: ctr ccm nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_conntrack_ipv4 nf_nat nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c crc32c_generic ipt_REJECT nf_reject_ipv4 xt_tcpudp br_netfilter bridge stp llc ebtable_filter ebtables overlay exportfs ip6table_filter ip6_tables iptable_filter ip_tables x_tables ipv6 bnep joydev mousedev hid_logitech_hidpp hid_logitech_dj nls_utf8 nls_cp437 vfat fat snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic arc4 input_leds btusb btrtl btbcm btintel bluetooth hid_apple hid_generic ecdh_generic iwlmvm mac80211 crct10dif_pclmul ghash_clmulni_intel iwlwifi pcbc coretemp tun deadline_iosched cfg80211 snd_hda_intel aesni_intel
[   15.036845]  igb kvm_intel aes_x86_64 snd_hda_codec hwmon crypto_simd kvm dca glue_helper rfkill snd_hda_core cryptd irqbypass intel_cstate snd_hwdep intel_rapl_perf snd_pcm af_packet snd_timer e1000e psmouse snd serio_raw ptp soundcore pcspkr iTCO_wdt pps_core shpchp iTCO_vendor_support mei_me mei intel_pch_thermal thermal fan evdev efivarfs raid10 usbhid hid dm_mod dax crc32_pclmul crc32c_intel ahci libahci libata i2c_i801 xhci_pci xhci_hcd usbcore i915 video drm_kms_helper drm intel_gtt agpgart i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt i2c_core nvme nvme_core wmi button loop raid1 ext4 crc16 mbcache jbd2 sd_mod scsi_mod
[   15.036928] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.20-0-vanilla #1-Alpine
[   15.036930] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./Z170N-WIFI-CF, BIOS F6 10/26/2015
[   15.036933] task: fffffffface124c0 task.stack: fffffffface00000
[   15.036938] RIP: 0010:rcu_process_callbacks+0x370/0x421
[   15.036941] RSP: 0018:ffff8ff771c03f00 EFLAGS: 00010002
[   15.036944] RAX: 0000000000000000 RBX: ffff8ff771c21940 RCX: 000000018040003b
[   15.036947] RDX: ffffffffffffd801 RSI: ffff8ff771c03f10 RDI: ffff8ff771c21978
[   15.036949] RBP: fffffffface4f340 R08: 0000000000000001 R09: 0000000000000100
[   15.036951] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8ff771c21978
[   15.036954] R13: fffffffface124c0 R14: 7fffffffffffffff R15: fffffffffffffff9
[   15.036957] FS:  0000000000000000(0000) GS:ffff8ff771c00000(0000) knlGS:0000000000000000
[   15.036959] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.036962] CR2: 00007f0ba7bdf078 CR3: 00000002afe0a006 CR4: 00000000003606b0
[   15.036964] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   15.036966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   15.036968] Call Trace:
[   15.036973]  <IRQ>
[   15.036982]  ? rebalance_domains+0xf2/0x237
[   15.036988]  __do_softirq+0xfb/0x268
[   15.036995]  ? sched_clock+0x5/0x8
[   15.037000]  irq_exit+0x62/0xa1
[   15.037005]  smp_apic_timer_interrupt+0xbf/0xf7
[   15.037009]  apic_timer_interrupt+0x98/0xa0
[   15.037012]  </IRQ>
[   15.037017] RIP: 0010:cpuidle_enter_state+0x149/0x20d
[   15.037019] RSP: 0018:fffffffface03e80 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[   15.037023] RAX: ffff8ff771c20c40 RBX: ffff8ff771c28fc8 RCX: 000000000000001f
[   15.037025] RDX: 000000038041cb3c RSI: 0000000000000000 RDI: 0000000000000000
[   15.037027] RBP: 0000000000000005 R08: 00000666643562fe R09: 0000000000000014
[   15.037029] R10: fffffffface03e60 R11: 000000000000165c R12: 0000000000000000
[   15.037032] R13: fffffffface124c0 R14: fffffffface7ff60 R15: fffffffface80158
[   15.037041]  do_idle+0x11a/0x190
[   15.037046]  cpu_startup_entry+0x6f/0x71
[   15.037052]  start_kernel+0x44d/0x46d
[   15.037064]  secondary_startup_64+0xa5/0xb0
[   15.037070] Code: 8b 93 90 00 00 00 48 2b 15 6f 25 da 00 48 39 d0 7d 07 48 89 83 90 00 00 00 48 83 7b 38 00 0f 94 c2 48 85 c0 0f 94 c0 38 c2 74 02 <0f> ff 48 8b 3c 24 57 9d 0f 1f 44 00 00 4c 89 e7 e8 e5 26 00 00 
[   15.037142] ---[ end trace 486866c4a775ee92 ]---
[   22.250602] logitech-hidpp-device 0003:046D:4060.0008: HID++ 4.5 device connected.



This does not happen with 4.14.19, so I looked at the commit log and tried revert commit c561093ed6843684690436dea034af53b462cfe5 (scsi: core: Ensure that the SCSI error handler gets woken up).

With that commit reverted machine again worked as normal.

Linux ncopa-desktop 4.14.20-1-vanilla #2-Alpine SMP Wed Feb 21 15:11:34 CET 2018 x86_64 GNU/Linux

dmesg output from my kernel with the regression reverted:

[   20.910014] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x400101 action 0x6 frozen
[   20.910018] ata1.00: irq_stat 0x08000000, interface fatal error
[   20.910022] ata1: SError: { RecovData UnrecovData Handshk }
[   20.910026] ata1.00: failed command: WRITE DMA
[   20.910036] ata1.00: cmd ca/00:08:00:d0:5e/00:00:00:00:00/ed tag 2 dma 4096 out
[   20.910039] ata1.00: status: { DRDY }
[   20.910048] ata1: hard resetting link
[   21.221989] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   21.223512] ata1.00: supports DRM functions and may not be fully accessible
[   21.225118] ata1.00: supports DRM functions and may not be fully accessible
[   21.226387] ata1.00: configured for UDMA/133
[   21.226402] ata1: EH complete
[   21.259887] ata1: limiting SATA link speed to 3.0 Gbps
[   21.259898] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x400100 action 0x6 frozen
[   21.259902] ata1.00: irq_stat 0x08000000, interface fatal error
[   21.259908] ata1: SError: { UnrecovData Handshk }
[   21.259914] ata1.00: failed command: WRITE DMA
[   21.259929] ata1.00: cmd ca/00:e0:20:10:5c/00:00:00:00:00/ea tag 3 dma 114688 out
[   21.259933] ata1.00: status: { DRDY }
[   21.259946] ata1: hard resetting link
[   21.571618] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[   21.573260] ata1.00: supports DRM functions and may not be fully accessible
[   21.575300] ata1.00: supports DRM functions and may not be fully accessible
[   21.576892] ata1.00: configured for UDMA/133
[   21.576913] ata1: EH complete
[   22.124450] logitech-hidpp-device 0003:046D:4060.0008: HID++ 4.5 device connected.

So problem seems to happen when there are one or more ata errors.

I have seen those errors on 4.9 kernel too, but they don't seem to do any harm, until 4.14.20 kernel.
Comment 1 ncopa 2018-02-22 17:20:12 UTC
Those two patches fixes the issue:

https://patchwork.kernel.org/patch/10181165/
https://patchwork.kernel.org/patch/10233617/
Comment 2 JMD 2018-03-04 09:24:27 UTC
Getting a very similar Warning on 4.15.7 when dealing with an old unhappy disk. TL;DR the above patches applied cleanly and fixed this on 4.15.7 for me as well.

Mar  3 20:42:00 redsun kernel: [  750.861060] ata4.00: exception Emask 0x11 SAct 0x80000 SErr 0x680100 action 0x6 frozen
Mar  3 20:42:00 redsun kernel: [  750.861063] ata4.00: irq_stat 0x48000008, interface fatal error
Mar  3 20:42:00 redsun kernel: [  750.861066] ata4: SError: { UnrecovData 10B8B BadCRC Handshk }
Mar  3 20:42:00 redsun kernel: [  750.861069] ata4.00: failed command: READ FPDMA QUEUED
Mar  3 20:42:00 redsun kernel: [  750.861078] ata4.00: cmd 60/00:98:c0:c5:c3/01:00:02:00:00/40 tag 19 ncq dma 131072 in
Mar  3 20:42:00 redsun kernel: [  750.861078]          res 40/00:98:c0:c5:c3/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
Mar  3 20:42:00 redsun kernel: [  750.861080] ata4.00: status: { DRDY }
Mar  3 20:42:39 redsun kernel: [  789.396122] WARNING: CPU: 3 PID: 1926 at /usr/src/linux-stable/kernel/rcu/tree.c:2792 rcu_process_callbacks+0x482/0x4a0
Mar  3 20:42:39 redsun kernel: [  789.396124] Modules linked in: cmac cifs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp tun bridge ebtable_filter ebtables ip6table_f
ilter ip6_tables iptable_filter ip_tables x_tables ipv6 cfg80211 rfkill 8021q garp stp llc fuse amdkfd amd_iommu_v2 amdgpu snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hda_core chash ttm drm_kms_helper snd_hwdep i2c_dev uas drm igb snd_pcm snd_timer agpgart wmi_bmof sn
d mxm_wmi kvm_amd usb_storage fb_sys_fops syscopyarea sysfillrect kvm soundcore ptp sysimgblt pps_core dca i2c_algo_bit alx mdio ccp crct10dif_pclmul crc32_pclmul evdev crc32c_intel shpchp i2c_piix4 i2c_core
Mar  3 20:42:39 redsun kernel: [  789.396173]  efi_pstore efivars ghash_clmulni_intel k10temp hwmon button wmi acpi_cpufreq loop hid_generic usbhid hid uhci_hcd ohci_pci ohci_hcd ehci_hcd xhci_pci xhci_hcd
Mar  3 20:42:39 redsun kernel: [  789.396188] CPU: 3 PID: 1926 Comm: Web Content Tainted: G        W        4.15.7 #21
Mar  3 20:42:39 redsun kernel: [  789.396189] Hardware name: Gigabyte Technology Co., Ltd. AX370-Gaming 5/AX370-Gaming 5, BIOS F21 02/08/2018
Mar  3 20:42:39 redsun kernel: [  789.396193] RIP: 0010:rcu_process_callbacks+0x482/0x4a0
Mar  3 20:42:39 redsun kernel: [  789.396194] RSP: 0000:ffff94857ecc3f10 EFLAGS: 00010002
Mar  3 20:42:39 redsun kernel: [  789.396197] RAX: ffffffffffffd800 RBX: ffff94857ece1800 RCX: 0000000000004601
Mar  3 20:42:39 redsun kernel: [  789.396198] RDX: 0000000000000001 RSI: ffff94857ecc3f18 RDI: ffff94857ece1838
Mar  3 20:42:39 redsun kernel: [  789.396199] RBP: ffffffffa983cb40 R08: 0000000000024f30 R09: ffffffffa80fbdcb
Mar  3 20:42:39 redsun kernel: [  789.396200] R10: fffff3a0c85f5e00 R11: 0000000000000326 R12: ffff94857ece1838
Mar  3 20:42:39 redsun kernel: [  789.396201] R13: 0000000000000246 R14: 7fffffffffffffff R15: fffffffffffffffc
Mar  3 20:42:39 redsun kernel: [  789.396204] FS:  00007f0492a279c0(0000) GS:ffff94857ecc0000(0000) knlGS:0000000000000000
Mar  3 20:42:39 redsun kernel: [  789.396205] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar  3 20:42:39 redsun kernel: [  789.396207] CR2: 00007f04656dc000 CR3: 000000026ceaa000 CR4: 00000000003406e0
Mar  3 20:42:39 redsun kernel: [  789.396208] Call Trace:
Mar  3 20:42:39 redsun kernel: [  789.396211]  <IRQ>
Mar  3 20:42:39 redsun kernel: [  789.396216]  __do_softirq+0xe0/0x2dd
Mar  3 20:42:39 redsun kernel: [  789.396220]  irq_exit+0xae/0xb0
Mar  3 20:42:39 redsun kernel: [  789.396223]  smp_apic_timer_interrupt+0x76/0x130
Mar  3 20:42:39 redsun kernel: [  789.396225]  apic_timer_interrupt+0x7d/0x90
Mar  3 20:42:39 redsun kernel: [  789.396227]  </IRQ>
Mar  3 20:42:39 redsun kernel: [  789.396229] RIP: 0033:0x7f0490e38707
Mar  3 20:42:39 redsun kernel: [  789.396230] RSP: 002b:00007ffe7bb45690 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff11
Mar  3 20:42:39 redsun kernel: [  789.396232] RAX: 00007f045c4af910 RBX: 00007f047fc08000 RCX: ffffffffff78f144
Mar  3 20:42:39 redsun kernel: [  789.396233] RDX: 00007f0490e386f0 RSI: 00007f045c4af940 RDI: 00007f047fc08000
Mar  3 20:42:39 redsun kernel: [  789.396234] RBP: 00007ffe7bb46a40 R08: 00007f045f77f800 R09: 0000000000000000
Mar  3 20:42:39 redsun kernel: [  789.396235] R10: 000000000000000a R11: 00007f045fc113ca R12: 00007f045c4af970
Mar  3 20:42:39 redsun kernel: [  789.396237] R13: 00007f045c4b0078 R14: 00007ffe7bb456f0 R15: 0000000000000000
Mar  3 20:42:39 redsun kernel: [  789.396238] Code: 8b 1d e3 66 86 01 48 85 db 74 1b 48 8b 03 48 8b 7b 08 48 83 c3 18 48 89 ee e8 6b 6f f0 00 48 8b 03 48 85 c0 75 e8 e9 b6 fb ff ff <0f> 0b e9 df fd ff ff 0f 0b e9 d6 fc ff ff 4c 89 ee 4c 89 e7 e8 
Mar  3 20:42:39 redsun kernel: [  789.396279] ---[ end trace 9f6cbfdd9555c3ff ]---

The two patches linked above applied cleanly, and the link can now be reset and the disk can continue to be used:

Mar  4 01:19:21 redsun kernel: [  500.894429] ata4.00: exception Emask 0x11 SAct 0x40000001 SErr 0x680100 action 0x6 frozen
Mar  4 01:19:21 redsun kernel: [  500.894432] ata4.00: irq_stat 0x48000008, interface fatal error
Mar  4 01:19:21 redsun kernel: [  500.894435] ata4: SError: { UnrecovData 10B8B BadCRC Handshk }
Mar  4 01:19:21 redsun kernel: [  500.894438] ata4.00: failed command: READ FPDMA QUEUED
Mar  4 01:19:21 redsun kernel: [  500.894446] ata4.00: cmd 60/00:00:28:f8:7a/01:00:05:00:00/40 tag 0 ncq dma 131072 in
Mar  4 01:19:21 redsun kernel: [  500.894446]          res 40/00:00:28:f8:7a/00:00:05:00:00/40 Emask 0x10 (ATA bus error)
Mar  4 01:19:21 redsun kernel: [  500.894448] ata4.00: status: { DRDY }
Mar  4 01:19:21 redsun kernel: [  500.894451] ata4.00: failed command: READ FPDMA QUEUED
Mar  4 01:19:21 redsun kernel: [  500.894459] ata4.00: cmd 60/00:f0:28:f7:7a/01:00:05:00:00/40 tag 30 ncq dma 131072 in
Mar  4 01:19:21 redsun kernel: [  500.894459]          res 40/00:00:28:f8:7a/00:00:05:00:00/40 Emask 0x10 (ATA bus error)
Mar  4 01:19:21 redsun kernel: [  500.894461] ata4.00: status: { DRDY }
Mar  4 01:22:04 redsun kernel: [  664.145825] ata4.00: exception Emask 0x11 SAct 0x400000 SErr 0x680100 action 0x6 frozen
Mar  4 01:22:04 redsun kernel: [  664.145828] ata4.00: irq_stat 0x48000008, interface fatal error
Mar  4 01:22:04 redsun kernel: [  664.145831] ata4: SError: { UnrecovData 10B8B BadCRC Handshk }
Mar  4 01:22:04 redsun kernel: [  664.145834] ata4.00: failed command: READ FPDMA QUEUED
Mar  4 01:22:04 redsun kernel: [  664.145842] ata4.00: cmd 60/30:b0:e8:26:4e/00:00:03:00:00/40 tag 22 ncq dma 24576 in
Mar  4 01:22:04 redsun kernel: [  664.145842]          res 40/00:b0:e8:26:4e/00:00:03:00:00/40 Emask 0x10 (ATA bus error)
Mar  4 01:22:04 redsun kernel: [  664.145844] ata4.00: status: { DRDY }
Mar  4 01:22:25 redsun kernel: [  685.225653] ata4: limiting SATA link speed to 3.0 Gbps
Mar  4 01:22:25 redsun kernel: [  685.225658] ata4.00: exception Emask 0x11 SAct 0x1800 SErr 0x600100 action 0x6 frozen
Mar  4 01:22:25 redsun kernel: [  685.225660] ata4.00: irq_stat 0x48000008, interface fatal error
Mar  4 01:22:25 redsun kernel: [  685.225663] ata4: SError: { UnrecovData BadCRC Handshk }
Mar  4 01:22:25 redsun kernel: [  685.225666] ata4.00: failed command: READ FPDMA QUEUED
Mar  4 01:22:25 redsun kernel: [  685.225674] ata4.00: cmd 60/00:58:90:96:d6/01:00:08:00:00/40 tag 11 ncq dma 131072 in
Mar  4 01:22:25 redsun kernel: [  685.225674]          res 40/00:58:90:96:d6/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
Mar  4 01:22:25 redsun kernel: [  685.225676] ata4.00: status: { DRDY }
Mar  4 01:22:25 redsun kernel: [  685.225678] ata4.00: failed command: READ FPDMA QUEUED
Mar  4 01:22:25 redsun kernel: [  685.225686] ata4.00: cmd 60/00:60:90:97:d6/01:00:08:00:00/40 tag 12 ncq dma 131072 in
Mar  4 01:22:25 redsun kernel: [  685.225686]          res 40/00:58:90:96:d6/00:00:08:00:00/40 Emask 0x10 (ATA bus error)
Mar  4 01:22:25 redsun kernel: [  685.225688] ata4.00: status: { DRDY }
Comment 3 Bart Van Assche 2018-03-07 17:12:00 UTC
This has been fixed by the following commit, which is in James' tree but not yet in Linus' tree nor any of the stable trees: 3be8828fc507 ("scsi: core: Avoid that ATA error handling can trigger a kernel hang or oops"). See also https://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git/commit/?h=fixes&id=3be8828fc507cdafe7040a3dcf361a2bcd8e305b. A pull request that includes that commit has been sent to Linus. See also https://lkml.org/lkml/2018/3/6/767.
Comment 4 Theodore Tso 2018-03-08 03:50:31 UTC
*** Bug 198923 has been marked as a duplicate of this bug. ***
Comment 5 Bart Van Assche 2018-03-09 15:56:38 UTC
Commit 3be8828fc507 is now upstream but is not yet in any of the stable trees.
Comment 6 Bart Van Assche 2018-03-15 13:50:29 UTC
Kernels 4.15.10 and 4.14.27 include patch "scsi: core: Avoid that ATA error handling can trigger a kernel hang or oops".
Comment 7 Dennis Busch 2018-03-16 19:23:05 UTC
Many thanks for the updates. I upgraded to 4.15.10 now and the issue I had before (as reported in the closed duplicate) appears to be fixed with that patch.
Comment 8 ncopa 2018-03-25 21:51:12 UTC
4.14.27 kernel fixes the issue I had on my system. Thank you!