Bug 89261

Summary: ata1.00: failed command: READ FPDMA QUEUED
Product: IO/Storage Reporter: Sumit Rai (sumitrai96)
Component: Serial ATAAssignee: Tejun Heo (tj)
Status: NEW ---    
Severity: normal CC: alan, am1n, spamtrap1, sumitrai96, szg00000
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 3.17.4-300.fc21.x86_64 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg
lspci -nn
hdparm -I from Crucial MX100 512GB

Description Sumit Rai 2014-12-05 04:25:58 UTC
I am seeing the below error in dmesg:

[ 3657.844533] ata1.00: exception Emask 0x0 SAct 0x8000000 SErr 0x50000 action 0x6 frozen
[ 3657.844543] ata1: SError: { PHYRdyChg CommWake }
[ 3657.844549] ata1.00: failed command: READ FPDMA QUEUED
[ 3657.844559] ata1.00: cmd 60/20:d8:98:84:52/00:00:0c:00:00/40 tag 27 ncq 16384 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 3657.844564] ata1.00: status: { DRDY }
[ 3657.844570] ata1: hard resetting link
[ 3658.149326] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 3658.149803] ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out
[ 3658.149840] ata1.00: supports DRM functions and may not be fully accessible
[ 3658.158491] ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out
[ 3658.158528] ata1.00: supports DRM functions and may not be fully accessible
[ 3658.166701] ata1.00: configured for UDMA/133
[ 3658.177338] ata1.00: device reported invalid CHS sector 0
[ 3658.177381] ata1: EH complete

Machine Details:
[    0.000000] DMI: Apple Inc. MacBookPro8,1/Mac-94245B3640C91C81, BIOS    MBP81.88Z.0047.B27.1201241646 01/24/12
OS: Fedora 21
SSD: Crucial_CT512MX100

[root@localhost ~]# smartctl -H /dev/sda
smartctl 6.2 2014-07-16 r3952 [x86_64-linux-3.17.4-300.fc21.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Comment 1 Sumit Rai 2014-12-05 04:28:10 UTC
Created attachment 159761 [details]
dmesg
Comment 2 Sumit Rai 2014-12-05 04:28:43 UTC
Created attachment 159771 [details]
lspci -nn
Comment 3 Alan 2014-12-08 15:26:36 UTC
[ 3657.844533] ata1.00: exception Emask 0x0 SAct 0x8000000 SErr 0x50000 action 0x6 frozen
[ 3657.844543] ata1: SError: { PHYRdyChg CommWake }

That is the device disconnecting and reconnecting by the look of things.  Check if there is newer firmware available for your drive,but failing that you may need to see whether running with NCQ or with trim disabled helps.
Comment 4 Sumit Rai 2014-12-09 21:19:43 UTC
(In reply to Alan from comment #3)
> [ 3657.844533] ata1.00: exception Emask 0x0 SAct 0x8000000 SErr 0x50000
> action 0x6 frozen
> [ 3657.844543] ata1: SError: { PHYRdyChg CommWake }
> 
> That is the device disconnecting and reconnecting by the look of things. 
> Check if there is newer firmware available for your drive,but failing that
> you may need to see whether running with NCQ or with trim disabled helps.

I havn't been able to get newer firmware from Manufacturer.
Some folks facing similar issue on same ssd model at https://github.com/linrunner/TLP/issues/84 are suggesting it's ALPM issue, and chaging link_power_management_policy to medium or max helps.
I am testing with echo max_performance > /sys/class/scsi_host/host0/link_power_management_policy to see if it fixes the issue.

NCQ and TRIM are both enabled.
I am reluctant to disable NCQ since It might significantly lower performance.
I will try to reproduce with TRIM or NCQ disabled if issue occurs again in max_power.
Comment 5 Alan 2014-12-10 08:13:42 UTC
If you do figure out the trigger then please update the bug accordingly and include the hdparm -I output for the drive. That way if we've got multiple reporters we can blacklist the problem features of the drive
Comment 6 Sumit Rai 2014-12-28 01:58:13 UTC
Changing ALPM settings to max performance does seem reduce the frequency of errors to significant degree, but it doesn't go away completely.
I am trying to reproduce it with min_power and ncq disabled.
Comment 7 Sumit Rai 2015-01-07 09:23:23 UTC
Disabling ncq with libata.force=noncq doesn't resolve this issue either with min_power or max_performance.

[srai@localhost ~]$ dmesg | grep ncq
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.17.4-300.fc21.x86_64 root=/dev/mapper/fedora_20-root ro selinux=0 kvm-intel.nested=1 libata.force=noncq
[    0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-3.17.4-300.fc21.x86_64 root=/dev/mapper/fedora_20-root ro selinux=0 kvm-intel.nested=1 libata.force=noncq
[    2.719890] ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum part ems apst 
[    3.035244] ata1.00: FORCE: horkage modified (noncq)
[    3.042855] ata2.00: FORCE: horkage modified (noncq)

I am moving on to see if disabling TRIM helps.
Comment 8 Steve Holland 2015-06-03 22:28:55 UTC
Created attachment 178781 [details]
hdparm -I from Crucial MX100 512GB

I am observing the same problem on a Thinkpad (S1) Yoga with 
the same Crucial MX100 512G SSD as Sumit. 

In my case the initial error is
ata1: SError: { CommWake }
without the PHYRdyChange, but everything else is similar. 

For me changing the ALPM setting to max performance seems to have  solved the issue. (Upgrading thd drive firmware might have helped too). 

Based on the comments of multiple users in the linrunner bug report
https://github.com/linrunner/TLP/issues/84
and also multiple reports of a similar problem under Windows on Crucial's own support forums, I would think a blacklisting would be in order.