Bug 71371
Summary: | [PATCH]Crucial M500, broken "queued TRIM" support | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Marios Andreopoulos (opensource) |
Component: | Serial ATA | Assignee: | Tejun Heo (tj) |
Status: | RESOLVED OBSOLETE | ||
Severity: | normal | CC: | ag+services, alan, anton.wd, bdowning, faulty.lee, frederik.hofe, hmh, hugh, i, jacobecc9, kernel, kernel, maarten.fonville, maugarta, mkp, pali, ravikiran, rhill, rickyxt, smf-linux, stevenhoneyman, sumitrai96, szg00000, tcl_de, v10lator, woggling, xarafaxz |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | >=3.12 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
libata-core.c patch
dmesg_ioerr |
Thanks. See Documentation/SubmittingPatches and send it to linx-ide@vger.kernel.org with a signed off by and it should get into the kernel. Alan Thank you too. It was the first patch I ever send and was accepted! Feel free to change status as resolved if you see fit. Once it's hit the kernel proper. That way it is less likely to get lost Crucial recently released a new firmware update for the M500, MU05, see http://forum.crucial.com/t5/Solid-State-Drives-SSD/Feedback-Thread-Firmware-MU05-for-the-M500/td-p/146872 Do you know if this update fixes Queued TRIM? Looks like it is fixed in MU05 and patch is ready in libata/for-3.15-fixes See: http://patchwork.ozlabs.org/patch/336214/ (In reply to Pali Rohár from comment #5) > Looks like it is fixed in MU05 and patch is ready in libata/for-3.15-fixes > See: http://patchwork.ozlabs.org/patch/336214/ Great. Thank you! I really don't know *why* this happens, but I have two Intel NUCs with Crucial M500 disks of 128Gb with their new MU05 firmware. When I run kernel 3.14.4 from the kernel-ppa-mainline of Ubuntu this 'fix' to the blacklist is included. The expectation would be that Queued TRIM would work and no errors would appear. But instead the exact errors as in the first post *do* appear and the disk acts very slow. When I switch back to an older kernel (with the old blacklists) the problems disappear. So, there are several hypotheses: 1) Crucial did not fix the bug for my specific model (product: Crucial_CT120M50 version: MU05) 2) Somehow magically on both of my NUCs the firmware on my SSD is broken 3) Enabling Queued TRIM has exposed another before unknown bug. Best would be if more people with other M500 SSDs could check if they also have this problem. If so, the blacklist would have to be more strict again and/or Crucial will have to be notified about this problem. I have Crucial_CT480M500SSD1 with firmware MU05 and that problem still persist. Have to revert that patch before firmware exclusion. If not, it can get you REAL TROUBLE after deleting large files, BROKEN PARTITION TABLE included. We're going to blacklist all firmware versions until we figure this out. However, it would be great if people that experienced problems could forward me any error messages from syslog so I can forward them to Micron. Created attachment 137921 [details]
dmesg_ioerr
Just regular dmesg 2 minutes after boot.
I got this in my dmesg when the errors occurred: [ 51.802288] ata4.00: exception Emask 0x1 SAct 0x34000000 SErr 0x0 action 0x6 frozen [ 51.802294] ata4.00: irq_stat 0x40000008 [ 51.802299] ata4.00: failed command: READ FPDMA QUEUED [ 51.802307] ata4.00: cmd 60/08:d0:88:da:9f/00:00:09:00:00/40 tag 26 ncq 4096 in [ 51.802307] res 40/00:e4:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error) [ 51.802311] ata4.00: status: { DRDY } [ 51.802314] ata4.00: failed command: SEND FPDMA QUEUED [ 51.802319] ata4.00: cmd 64/01:e0:00:00:00/00:00:00:00:00/a0 tag 28 ncq 512 out [ 51.802319] res 40/00:e4:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error) [ 51.802322] ata4.00: status: { DRDY } [ 51.802324] ata4.00: failed command: READ FPDMA QUEUED [ 51.802329] ata4.00: cmd 60/08:e8:f0:49:47/00:00:07:00:00/40 tag 29 ncq 4096 in [ 51.802329] res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x5 (timeout) [ 51.802332] ata4.00: status: { DRDY } [ 51.802337] ata4: hard resetting link [ 52.122361] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 52.122808] ata4.00: supports DRM functions and may not be fully accessible [ 52.129420] ata4.00: supports DRM functions and may not be fully accessible [ 52.135470] ata4.00: configured for UDMA/133 [ 52.135479] ata4.00: device reported invalid CHS sector 0 [ 52.135482] ata4.00: device reported invalid CHS sector 0 [ 52.135498] sd 3:0:0:0: [sda] [ 52.135501] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 52.135504] sd 3:0:0:0: [sda] [ 52.135505] Sense Key : Aborted Command [current] [descriptor] [ 52.135510] Descriptor sense data with sense descriptors (in hex): [ 52.135512] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 [ 52.135525] 00 00 00 00 [ 52.135530] sd 3:0:0:0: [sda] [ 52.135532] Add. Sense: No additional sense information [ 52.135535] sd 3:0:0:0: [sda] CDB: [ 52.135537] Write same(16): 93 08 00 00 00 00 02 25 48 48 00 00 00 08 00 00 [ 52.135552] end_request: I/O error, dev sda, sector 35997768 [ 52.135571] EXT4-fs (sda1): discard request in group:135 block:24585 count:1 failed with -5 [ 52.135572] ata4: EH complete I just booted into 3.14.5-1-ARCH (previously running 3.14.1-1-ARCH) and almost immediately got this: [ 170.458943] ata1.00: exception Emask 0x1 SAct 0xc00 SErr 0x0 action 0x6 frozen [ 170.458948] ata1.00: irq_stat 0x40000008 [ 170.458952] ata1.00: failed command: SEND FPDMA QUEUED [ 170.458958] ata1.00: cmd 64/01:50:00:00:00/00:00:00:00:00/a0 tag 10 ncq 512 out res 40/00:54:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error) [ 170.458961] ata1.00: status: { DRDY } [ 170.458963] ata1.00: failed command: READ FPDMA QUEUED [ 170.458968] ata1.00: cmd 60/08:58:28:fd:93/00:00:07:00:00/40 tag 11 ncq 4096 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x5 (timeout) [ 170.458971] ata1.00: status: { DRDY } [ 170.458975] ata1: hard resetting link [ 170.779052] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 170.779426] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20131218/psargs-359) [ 170.779434] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT0._GTF] (Node ffff8807fe053230), AE_NOT_FOUND (20131218/psparse-536) [ 170.779478] ata1.00: supports DRM functions and may not be fully accessible [ 170.786162] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20131218/psargs-359) [ 170.786170] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT0._GTF] (Node ffff8807fe053230), AE_NOT_FOUND (20131218/psparse-536) [ 170.786212] ata1.00: supports DRM functions and may not be fully accessible [ 170.792588] ata1.00: configured for UDMA/133 [ 170.792595] ata1.00: device reported invalid CHS sector 0 [ 170.792597] ata1.00: device reported invalid CHS sector 0 [ 170.792608] sd 0:0:0:0: [sda] [ 170.792610] Result: hostbyte=0x00 driverbyte=0x08 [ 170.792612] sd 0:0:0:0: [sda] [ 170.792614] Sense Key : 0xb [current] [descriptor] [ 170.792617] Descriptor sense data with sense descriptors (in hex): [ 170.792619] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 [ 170.792627] 00 00 00 00 [ 170.792631] sd 0:0:0:0: [sda] [ 170.792632] ASC=0x0 ASCQ=0x0 [ 170.792634] sd 0:0:0:0: [sda] CDB: [ 170.792636] cdb[0]=0x93: 93 08 00 00 00 00 09 67 2b c8 00 00 00 08 00 00 [ 170.792646] end_request: I/O error, dev sda, sector 157756360 [ 170.792667] EXT4-fs (dm-1): discard request in group:469 block:25465 count:1 failed with -5 [ 170.792675] ata1: EH complete I updated my SSD to MU05 a while ago: [ 1.095937] ata1.00: ATA-9: Crucial_CT480M500SSD1, MU05, max UDMA/133 [ 1.095941] ata1.00: 937703088 sectors, multi 16: LBA48 NCQ (depth 31/32), AA I remounted everything nodiscard, hopefully nothing has already gotten corrupted. @Martin K. Petersen: Any news about fixing this FW bug? I forwarded the relevant logs and information to my contacts at Micron. They were looking into it but I haven't gotten any updates recently. I'll poke them again... @Martin K. Petersen: But have they at least confirmed that they are actively working on it? I payed about 500 bucks for my 1TB M500 so I should be able to expect some support from them. I've checked that 3.15.4 is still blacklisting all M500 firmware, which is a good thing. I'm getting M500 over the weekend. Btw if someone can share the steps to replicate this, I'll give it a try. @Martin K. Petersen: anything new? @Martin K. Petersen. I was wondering if you heard anything back from Micron or if there is at least an official acknowledgement that they are looking into it? I have three M500's (1x960Gb, 2x 480GB) and would be nice to know if they are definitely looking into it. Such a shame that this affects only Crucial drives.. My worry is they may not be actively working on M500 as they released two drives after M500.. IMHO if Crucial isn't actively working on the M5x0 firmware, they should just release a small firmware update that only disables queued trim and be done with it. I have bought a Crucial M550 for almost a month, but TBH no firmwares available on its website. I've just started getting this issue since upgrading from 3.16.3 -> 3.17.0 Previously, in 3.16.3 I had no issues (because of the blacklist): [ 3.879291] ata5.00: disabling queued TRIM support [ 3.879294] ata5.00: ATA-9: Crucial_CT240M500SSD3, MU05, max UDMA/133 Now, 3.17.0, I don't get the same message about disabling TRIM (and instead get the discard request failed errors as above) Comparing the two libata-core.c I noticed this, which looks strange (i.e. backwards)... but I'll take a closer look @@ -4261,10 +4327,10 @@ ata_id_c_string(dev->id, model_rev, ATA_ID_FW_REV, sizeof(model_rev)); while (ad->model_num) { - if (glob_match(model_num, ad->model_num)) { + if (!glob_match(model_num, ad->model_num)) { if (ad->model_rev == NULL) return ad->horkage; - if (glob_match(model_rev, ad->model_rev)) + if (!glob_match(model_rev, ad->model_rev)) return ad->horkage; } ad++; This looks like a regression somehow relating to the introduction of glob.c with commit 428ac5fc056e06dc0b4ed82d5979add9a8c62b35 I tried a few different things, and could fix the issue by reverting just that change (this way was easier to test!) $ cp ../linux-3.16.3/drivers/ata/libata-core.c drivers/ata/libata-core.c $ sed -i 's/glob_match/&_old/g' drivers/ata/libata-core.c so glob.c remained, but libata-core used the old function. This worked, NCQ TRIM got disabled as it should. I'm not sure where to go from here - glob.c works fine with the self tests, and that's the only function that has changed between the two releases. I tagged an "else" on the end of the part that should disable NCQ TRIM (to make sure it wasn't trying to disable it and just failing) - the if statement returns false. There's now a patch to fix the 3.17.0 libata-core blacklist bug, available here for people needing it before the merge: http://www.spinics.net/lists/linux-ide/msg49669.html Hello. Does this bug affected on adata sp920 too? It seems it is mostly identical ssd drive. I was also getting similar errors on Crucial_CT512MX100. (FYKI: https://bugzilla.kernel.org/show_bug.cgi?id=89261). Someone suggested it's ALPM issue, so after setting SATA link power management policy to max_performance at all times (even on battery power), I am no longer getting the error. echo max_performance > /sys/class/scsi_host/host0/link_power_management_policy Can anyone please check if this fixes the problem for you. Sumit: That's really good data. I'll pass that along to the Micron folks. There's a new firmware for M550 SSDs, the changelog reads: Corrected error handling NCQ Trim Commands So is anyone able to confirm that M550 SSDs with MU02 firmware can safely be removed from the blacklist? If only please Crucial would also fix this fo the M500, the hardware is not *that* old that it wouldn't be expected to get new firmware, especially for a crucial error like this. On Wed, Mar 11, 2015, at 09:12, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=71371 > > --- Comment #28 from Maarten Fonville <maarten.fonville@gmail.com> --- > If only please Crucial would also fix this fo the M500, the hardware is > not > *that* old that it wouldn't be expected to get new firmware, especially > for a > crucial error like this. Well, they just issued a fix for the MX100, with mostly same changelog as the one for the M550. It is supposed to fix the NCQ trim horkage on the MX100 as well as ALPM-related issues. There is some hope they will get to the M500. But M500 owners should make a pest of themselves demanding the fix through the Crucial official support channels, I think, if it is to become reality. (In reply to Thomas from comment #27) > So is anyone able to confirm that M550 SSDs with MU02 firmware can safely be > removed from the blacklist? I just decided to try this for myself, so I applied this small patch (warning: quick&dirty, not meaned to fix this bug, might cause data corruption if Crucial SSDs other than the MD550 are installed) : diff -Nru linux-3.19.old/drivers/ata/libata-core.c linux-3.19/drivers/ata/libata-core.c --- linux-3.19.old/drivers/ata/libata-core.c 2015-02-09 03:54:22.000000000 +0100 +++ linux-3.19/drivers/ata/libata-core.c 2015-03-11 12:29:15.196686954 +0100 @@ -4235,7 +4235,7 @@ /* devices that don't properly handle queued TRIM commands */ { "Micron_M[56]*", NULL, ATA_HORKAGE_NO_NCQ_TRIM | ATA_HORKAGE_ZERO_AFTER_TRIM, }, - { "Crucial_CT*SSD*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, }, + { "Crucial_CT???M550SSD*", "MU01*", ATA_HORKAGE_NO_NCQ_TRIM, }, /* * As defined, the DRAT (Deterministic Read After Trim) and RZAT Compiled the kernel, rebootet, started file options on the SSD (copying files to it and removing files) with discard as a mount option as well as issued manual TRIM commands (via fstrim -v). So far everything seems fine: No errors in dmesg or any other signs of data corruption. Ofc. all of this was with the MU02 firmware: $ hdparm -i /dev/sda /dev/sda: Model=Crucial_CT128M550SSD1, FwRev=MU02, -SNIP- (In reply to Henrique de Moraes Holschuh from comment #29) > On Wed, Mar 11, 2015, at 09:12, bugzilla-daemon@bugzilla.kernel.org > wrote: > There is some hope they will get to the M500. But M500 owners should > make a pest of themselves demanding the fix through the Crucial official > support channels, I think, if it is to become reality. OK, I started to write again in the Crucial Forums about this error. I urge anyone who has the hardware then to also make themselves heard. On Wed, Mar 11, 2015, at 13:34, bugzilla-daemon@bugzilla.kernel.org wrote: > OK, I started to write again in the Crucial Forums about this error. I I did say official support channels. The Crucial forums are a "users helping users" thing, even the Crucial people there will tell you to use other channels if you need to get Crucial to actually act on something. No, I don't have the proper contact info on hand, but I suppose the people who do RMAs should be able to send you to the proper support channel... I think we should refine the queued-trim blacklist to blacklist the M550 and MX100 before MU02, tell people about it so that they can test it, and merge it later after some testing. Also, it would be really nice to be able to override TRIM support mode on the kernel command line (off/auto/non-queued/queued). I'm working with the Micron folks to get this resolved. Waiting for one more piece of info and then I'll update our blacklist rules for Crucial/Micron drives. Hopefully within a day or two. I know this has been a long time coming but it was hard to root cause due to Linux being the only OS actually issuing queued TRIM commands. (In reply to Martin K. Petersen from comment #33) > I'm working with the Micron folks to get this resolved. Waiting for one more > piece of info and then I'll update our blacklist rules for Crucial/Micron > drives. Hopefully within a day or two. I'm running my Crucial M550 with the MU02 firmware and with a patched kernel to disable the blacklist (see comment #30) since 3 days without any data corruption, so I would say it has been fixed. Anyway, I understand why you want to confirm it with Micron before updating the blacklist rules, any news on this? (In reply to Martin K. Petersen from comment #33) > I'm working with the Micron folks to get this resolved. Waiting for one more > piece of info and then I'll update our blacklist rules for Crucial/Micron > drives. Hopefully within a day or two. I was having this issue on MX100 drive install on 13 inch MacBookPro8,1, I checked the crucial site they have firmware upgrade for MX100 as well (MU01 -> MU02). But when I tried to update the firmware via bootable cd I am getting below error. Firmware Update on /dev/sda failed with status 13. Device Name: /dev/sda Firmware Update on /dev/sda Faild! STATUS_CODE: 13 Your system will now return to normal operation following a reboot. I would really appreciate it if you could please pass this information to micron folks. Latest update: M500: The problems *should* be fixed in MU05. My MU05 is working fine under stress even with ALPM enabled but I know that several of you are seeing problems. Unless I hear otherwise I'm going to leave M500 on the blacklist. M510/M550/MX100: MU05 does contain the relevant fixes. I will blacklist MU01. M600/MX200: Shipped with the fixes in place and all released firmware versions are good to go. BX100: These drives do not support queued TRIM. (In reply to Martin K. Petersen from comment #36) > Latest update: > > M500: The problems *should* be fixed in MU05. My MU05 is working fine under > stress even with ALPM enabled but I know that several of you are seeing > problems. Unless I hear otherwise I'm going to leave M500 on the blacklist. I can confirm M500 MU05 is broken/"iffy" - it's what caused me to notice this within minutes of updating: https://bugzilla.kernel.org/show_bug.cgi?id=71371#c21 Please leave M500 blacklisted until MU06 is out & tested (if they ever do...) I'm also experiencing this issue on Ubuntu 14.04 LTS using a Crucial_CT240M500SSD1 with MU05 firmware. I constantly get the following messages in dmesg: 1198.426617] ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310) [ 1198.427031] ata4.00: supports DRM functions and may not be fully accessible [ 1198.429981] ata4.00: disabling queued TRIM support [ 1198.433634] ata4.00: supports DRM functions and may not be fully accessible [ 1198.436581] ata4.00: disabling queued TRIM support [ 1198.439868] ata4.00: configured for UDMA/33 [ 1198.439917] ata4: EH complete [ 1200.510758] ata4: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen [ 1200.510766] ata4: irq_stat 0x00400040, connection status changed [ 1200.510772] ata4: SError: { HostInt PHYRdyChg 10B8B DevExch } [ 1200.510781] ata4: hard resetting link And then the follow message periodically: [ 1172.514336] ata4.00: failed command: WRITE FPDMA QUEUED [ 1172.514338] ata4.00: cmd 61/88:b0:90:7b:79/03:00:17:00:00/40 tag 22 ncq 462848 out [ 1172.514338] res 40/00:4c:10:c2:76/00:00:17:00:00/40 Emask 0x50 (ATA bus error) [ 1172.514339] ata4.00: status: { DRDY } The computer runs fine if I'm just browsing or playing music but if I perform multiple disk operations, like downloading a torrent and then waching a movie the performance of the system becomes very slow and frequently freezes the computer. I have to hard reset to recover. Will applying the patch above resolve this? @Martin K. Petersen: Anything new from Micron for M500? Will they fix this problem? @Pali: Since M500 is a couple of generations old I personally doubt they'll update it. Queued TRIM is a niche feature to begin with and unqueued TRIM works fine on these drives. Just to add another data point I have a Micron (not Crucial) M550 which seems to work fine after removing the blacklist entry. Model=Micron_M550_2.5"_7mm_512GB, FwRev=DL04, SerialNo=14260C70054B I don't know if there are any firmware updates for this drive - I can't find them on Micron's site and the Crucial update refuses to apply. Installed a brand new Crucial_CT250MX200SSD1 (MU01) this morning and had the following problem: [ 1.065905] ata3: SATA max UDMA/133 abar m1024@0xfe8ffc00 port 0xfe8ffe00 irq 22 [ 1.390156] ata3: SATA link down (SStatus 0 SControl 300) [ 1.397783] ata3: exception Emask 0x10 SAct 0x0 SErr 0x40d0002 action 0xe frozen [ 1.397784] ata3: irq_stat 0x00400040, connection status changed [ 1.397789] ata3: hard resetting link [ 3.000238] ata3: SATA link down (SStatus 0 SControl 300) [ 3.004857] ata3: EH complete [ 3.014985] ata3: exception Emask 0x10 SAct 0x0 SErr 0x40d0002 action 0xe frozen [ 3.019752] ata3: irq_stat 0x00400040, connection status changed [ 3.024583] ata3: limiting SATA link speed to 1.5 Gbps [ 3.029362] ata3: hard resetting link [ 3.753998] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310) [ 3.764610] ata3.00: supports DRM functions and may not be fully accessible [ 3.775535] ata3.00: ATA-10: Crucial_CT250MX200SSD1, MU01, max UDMA/133 [ 3.785821] ata3.00: 488397168 sectors, multi 16: LBA48 NCQ (depth 31/32), AA [ 3.797454] ata3.00: supports DRM functions and may not be fully accessible [ 3.809415] ata3.00: configured for UDMA/133 [ 3.819663] ata3: EH complete [ 3.839699] ata3.00: Enabling discard_zeroes_data [ 3.889700] ata3.00: Enabling discard_zeroes_data [ 3.919664] ata3.00: Enabling discard_zeroes_data I appear to have fixed the problem by adding the following to libata-core.c: --- linux-4.1.1/drivers/ata/libata-core.c 2015-06-22 06:05:43.000000000 +0100 +++ linux-4.1.2/drivers/ata/libata-core.c 2015-07-11 10:04:48.644600782 +0100 @@ -4235,6 +4235,8 @@ ATA_HORKAGE_ZERO_AFTER_TRIM, }, { "Crucial_CT*MX100*", "MU01", ATA_HORKAGE_NO_NCQ_TRIM | ATA_HORKAGE_ZERO_AFTER_TRIM, }, + { "Crucial_CT*MX200*", "MU01", ATA_HORKAGE_NO_NCQ_TRIM | + ATA_HORKAGE_ZERO_AFTER_TRIM, }, { "Samsung SSD 8*", NULL, ATA_HORKAGE_NO_NCQ_TRIM | ATA_HORKAGE_ZERO_AFTER_TRIM, }, Was that the correct thing to do ? @rickyxt, @Stuart Foster: The errors you reported look like a different issue than the originally reported in this bug report. It could be related to NCQ TRIM enabled together with ALPM, as it does look like ALPM-caused issues. Or it could be just ALPM-related. Note that marginal SATA cables or connectors can cause this kind of spurious link down problems as well, so please verify that. We know for a fact that the Crucial and Micron SATA SSDs react differently depending on link speed and ALPM level. Slowdown issues (no data corruption) were reported in the Crucial forums, and the slowdowns "went away" when the device was connected to a very fast Intel 6Gbps SATA port (and still showed up on some other 6Gbps ports from other vendors). The slowdowns also went away when ALPM was entirely disabled (regardless of SATA port type/speed/vendor). So, the SSDs did not like ALPM enabled depending on the motherboard (and SATA port configuration) it was connected to. These issues are supposedly fixed by the latest firmware updates for the M550, MX and BX SSDs, but they might not have gone away entirely... In rickyxt's case, the M500 is likely to still have some of the ALPM bugs, as it did not get the latest round of firmware fixes (which would have fixed the NCQ TRIM bug as well). It is likely a very good idea to run M500 SSDs with ALPM disabled. That's how I run mine. In Stuart Foster's case, it is surprising that disabling NCQ trim helped avoid this possibly new issue on the MX200. Still, disabling NCQ TRIM is _always safe_ as far as we know, so it would be a good idea for Stuart to keep it blacklisted for production. Stuart, if you'd like to help track down the issue you noticed in the MX200 MU01 firmware, maybe you could check whether you can easily reproduce the issue, and whether it happens with the four combinations of ALPM enabled/disabled and NCQ TRIM enabled/disabled ? The cable may be the issue as I have currently experienced the problem once with the original cable with my patched kernel. I would liked to investigate further before I condemn the cable though. During the last few hours the unpached kernel is now booting most of the time without showing any issues. This is different to the behaviour this morning when the unpatched kernel borked on every boot. Interestingly I have not yet seen the issue with two other sata cables that I have tried. I would like to try the ALPM enable/disabled test you suggested. What is the easiest way to turn it off, can this be done from the kernel boot parameters ? On Sat, 11 Jul 2015, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=71371 > > --- Comment #44 from Stuart Foster <smf.linux@ntlworld.com> --- > The cable may be the issue as I have currently experienced the problem once > with the original cable with my patched kernel. I would liked to investigate > further before I condemn the cable though. During the last few hours the > unpached kernel is now booting most of the time without showing any issues. > This is different to the behaviour this morning when the unpatched kernel > borked on every boot. Interestingly I have not yet seen the issue with two > other sata cables that I have tried. > > I would like to try the ALPM enable/disabled test you suggested. What is the > easiest way to turn it off, can this be done from the kernel boot parameters > ? You can change ALPM state at runtime in sysfs: /sys/class/scsi_host/host*/link_power_management_policy set to "max_performance" to disable ALPM. set to "medium_power" to enable. Do note that it applies to the entire scsi host, not to a device. Use, e.g. "lsscsi" to get the host/device mapping. SCSI Device [6:0:0:0] would be on host 6. I have run the ALPM/NCQ tests as suggested (4.1.1 kernel ALPM disabled by default) with a known good cable and I have found no issues (the only difference on my tests was that with NCQ disabled the results were insignificantly faster (0.3%) on a 10 minute random read/write test). The cable that was giving me my original problem interestingly only fails with the new MX200 drive and only on one orientation, if I swap it end for end it works perfectly like the other 5 cables I have tried. So in the bin with it and apologies for wasting your time. Stuart: Thanks for the update. FYI: The MX200 supports queued TRIM and shares implementation with M550 and M600. I am not aware of any issues with the latest firmware releases for these drives. (In reply to Martin K. Petersen from comment #47) > Stuart: Thanks for the update. > > FYI: The MX200 supports queued TRIM and shares implementation with M550 and > M600. I am not aware of any issues with the latest firmware releases for > these drives. I don't have any M550 or M600's but I do have some M4's and M500's the only thing I see with the M500's on Linux 4.1.2 (ASUS M5A97 PRO and a Lenovo laptop) is: [ 2.657287] ata3: SATA max UDMA/133 abar m1024@0xfeb0b000 port 0xfeb0b200 irq 19 [ 2.978232] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 2.978555] ata3.00: supports DRM functions and may not be fully accessible [ 2.978606] ata3.00: failed to get NCQ Send/Recv Log Emask 0x1 [ 2.978606] ata3.00: ATA-9: Crucial_CT240M500SSD1, MU05, max UDMA/133 [ 2.978608] ata3.00: 468862128 sectors, multi 16: LBA48 NCQ (depth 31/32), AA [ 2.978651] ata3.00: failed to get Identify Device Data, Emask 0x1 [ 2.978907] ata3.00: supports DRM functions and may not be fully accessible [ 2.978956] ata3.00: failed to get NCQ Send/Recv Log Emask 0x1 [ 2.979001] ata3.00: failed to get Identify Device Data, Emask 0x1 [ 2.979003] ata3.00: configured for UDMA/133 [ 2.992665] ata3.00: Enabling discard_zeroes_data [ 2.992858] ata3.00: Enabling discard_zeroes_data [ 2.993128] ata3.00: Enabling discard_zeroes_data Thanks I had a request for a customer site review from Crucial after checking the firmware status of all my Crucial SSD's and I happen to comment on the M500 MU05 firmware status and I unexpectedly got the following reply: Hello Stuart, Thank you for your recent survey. I am very sorry to hear that you are disappointed that we haven't updated our M500 MU05 Firmware. We have no plans to release an updated version I am afraid as this is not a widespread issue and only affects a small number of Linux customers. I appreciate some users are having an issue but it is an extremely small number and not affecting a high number of M500's. Our drives have built in garbage collection which is in effect the same as TRIM. This cleans the SSD as it sits idle so this contributes to a clean and healthy drive. If you have any further questions or need further help, just let me know. Kind Regards James Anderson eCommerce Sales and Support Agent Tel: 0800 013 0330 Fax: 0800 013 0336 mailto:crucialeusupport@micron.com Join the Crucial Community Forums: http://forum.crucial.com/ Facebook: http://www.facebook.com/CrucialMemory Twitter: http://twitter.com/crucialmemory YouTube: http://www.youtube.com/crucialmemory Micron Consumer Products Group, a division of Micron Europe Ltd Registered in England, Company No. 02341071 Registered Office: L’Avenir, Opladen Way, Bagshot Road, Bracknell, Berkshire, UK RG12 0PH (In reply to Stuart Foster from comment #49) > I had a request for a customer site review from Crucial after checking the > firmware status of all my Crucial SSD's and I happen to comment on the M500 > MU05 firmware status and I unexpectedly got the following reply: > > Hello Stuart, > > Thank you for your recent survey. I am very sorry to hear that you are > disappointed that we haven't updated our M500 MU05 Firmware. > > We have no plans to release an updated version I am afraid as this is not a > widespread issue and only affects a small number of Linux customers. I > appreciate some users are having an issue but it is an extremely small > number and not affecting a high number of M500's. > > Our drives have built in garbage collection which is in effect the same as > TRIM. This cleans the SSD as it sits idle so this contributes to a clean and > healthy drive. > > > If you have any further questions or need further help, just let me know. > > Kind Regards > > > James Anderson > eCommerce Sales and Support Agent > Tel: 0800 013 0330 > Fax: 0800 013 0336 > mailto:crucialeusupport@micron.com > > Join the Crucial Community > Forums: http://forum.crucial.com/ > Facebook: http://www.facebook.com/CrucialMemory > Twitter: http://twitter.com/crucialmemory > YouTube: http://www.youtube.com/crucialmemory > > Micron Consumer Products Group, a division of Micron Europe Ltd Registered > in England, Company No. 02341071 Registered Office: L’Avenir, Opladen Way, > Bagshot Road, Bracknell, Berkshire, UK RG12 0PH Thanks for sharing that. It's disappointing, especially considering whatever is broken (and now patched) for the M550 probably shares most of the code with the M500. I guess they think they can make more money by getting us to upgrade. They lost me as a customer a year ago - I won't be buying their products again! I have M550 with Fedora Rawhide installed inside. After updated to MU02 it runs faster than before, and I've only met one time that failed to start the rootfs after new kernel installed. But There is an enclosure for SSD to connect to USB 3.0 port for personal work, I seldom connect my SSD via SATA so I haven't enabled discard in fstab, any suggestions for case like this? |
Created attachment 127781 [details] libata-core.c patch Crucial M500 and Micron M500 have broken "queued TRIM" support which leads to file system corruption. Currently the kernel disables "queued TRIM" but not for all models of the drive. It disables it only for models "Micron_M500*" and "Crucial_CT???M500SSD1" (wildcards as in drivers/ata/libata-core.c). I have a Crucial M500 mSATA drive, which shows as "Crucial_CT240M500SSD3" and I am suffering during the last months from frequent filesystem corruption. Disabling queued TRIM command seems to fix the problem. I've been testing it only for the last 20 hours, but I don't see the usual dmesg messages about failed TRIM commands I had until now. The errors I got until now, usually once during every boot and then periodically were like this: [ 62.283352] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 [ 62.283356] ata1.00: irq_stat 0x40000008 [ 62.283358] ata1.00: failed command: SEND FPDMA QUEUED [ 62.283363] ata1.00: cmd 64/01:00:00:00:00/00:00:00:00:00/a0 tag 0 ncq 512 out [ 62.283363] res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x403 (HSM violation) <F> [ 62.283368] ata1: hard resetting link [ 62.590063] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 62.590374] ata1.00: supports DRM functions and may not be fully accessible [ 62.596841] ata1.00: supports DRM functions and may not be fully accessible [ 62.602819] ata1.00: configured for UDMA/133 [ 62.613085] sd 0:0:0:0: [sda] [ 62.613089] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 62.613091] sd 0:0:0:0: [sda] [ 62.613092] Sense Key : Aborted Command [current] [descriptor] [ 62.613094] Descriptor sense data with sense descriptors (in hex): [ 62.613095] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 [ 62.613100] 00 00 00 00 [ 62.613102] sd 0:0:0:0: [sda] [ 62.613104] Add. Sense: No additional sense information [ 62.613105] sd 0:0:0:0: [sda] CDB: [ 62.613106] Write same(16): 93 08 00 00 00 00 0f 57 56 b8 00 00 00 28 00 00 [ 62.613112] end_request: I/O error, dev sda, sector 257382072 [ 62.613128] EXT4-fs (dm-4): discard request in group:577 block:17623 count:5 failed with -5 [ 62.613130] ata1: EH complete