Bug 60731
Summary: | MacBook Air 6,2 ata command timeout prevents boot | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | sam.attwell |
Component: | Serial ATA | Assignee: | Jeff Garzik (jgarzik) |
Status: | NEW --- | ||
Severity: | high | CC: | adrien-xx-kernel-bz, adrien.lassere, afiestas, aim, alex, alexfeinman, andrew.cobb, ben, bpshacklett, dorin, funtoos, gwhite, june, kaloz, kernel-bugzilla, lasse.makholm, levex, luke.marsden, miek, satadru, stephen, szg00000, tj, uriahheep, vingarzan |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 3.11.0-rc4 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg output for succesful boot
dmesg output for unsuccessful boot 3.11-rc7 dmesg with ncq MacBookAir6,2 DSDT Dissasembly MacBookAir6,2 Mac OS X 10.9 dmesg MacBookAir6,2 System Information SATA Report ahci: add NO_NCQ flag to this controller ahci: add NO_NCQ flag to this controller dmesg log from kernel-3.12 with proposed patch applied ahci-noncq.patch 0001-ahci-disable-NCQ-on-Samsung-pci-e-SSDs-on-macbooks.patch samsung-1600-nomsi.patch |
Description
sam.attwell
2013-08-11 22:49:34 UTC
Created attachment 107180 [details]
dmesg output for succesful boot
Booting on the MacBook Air 6,2 fails most of the time after ata command for ncq.
MacBook Air 6,2
Intel Core i7-4650U 1.7Ghz
8GB RAM
128GB SSD
3.11.0-rc4
Appending libata.force=noncq seems to resolve the problem.
Succesful boot dmesg attached (kern.log doesn't report on failed boot here).
(In reply to sam.attwell from comment #1) > Created attachment 107180 [details] > dmesg output for succesful boot > > Booting on the MacBook Air 6,2 fails most of the time after ata command for > ncq. > > MacBook Air 6,2 > Intel Core i7-4650U 1.7Ghz > 8GB RAM > 128GB SSD > 3.11.0-rc4 > > Appending libata.force=noncq seems to resolve the problem. > > Succesful boot dmesg attached (kern.log doesn't report on failed boot here). Here's lscpi: 00:00.0 Host bridge [0600]: Intel Corporation Haswell-ULT DRAM Controller [8086:0a04] (rev 09) 00:02.0 VGA compatible controller [0300]: Intel Corporation Haswell-ULT Integrated Graphics Controller [8086:0a26] (rev 09) 00:03.0 Audio device [0403]: Intel Corporation Device [8086:0a0c] (rev 09) 00:14.0 USB controller [0c03]: Intel Corporation Lynx Point-LP USB xHCI HC [8086:9c31] (rev 04) 00:16.0 Communication controller [0780]: Intel Corporation Lynx Point-LP HECI #0 [8086:9c3a] (rev 04) 00:1b.0 Audio device [0403]: Intel Corporation Lynx Point-LP HD Audio Controller [8086:9c20] (rev 04) 00:1c.0 PCI bridge [0604]: Intel Corporation Lynx Point-LP PCI Express Root Port 1 [8086:9c10] (rev e4) 00:1c.1 PCI bridge [0604]: Intel Corporation Lynx Point-LP PCI Express Root Port 2 [8086:9c12] (rev e4) 00:1c.2 PCI bridge [0604]: Intel Corporation Lynx Point-LP PCI Express Root Port 3 [8086:9c14] (rev e4) 00:1c.4 PCI bridge [0604]: Intel Corporation Lynx Point-LP PCI Express Root Port 5 [8086:9c18] (rev e4) 00:1c.5 PCI bridge [0604]: Intel Corporation Lynx Point-LP PCI Express Root Port 6 [8086:9c1a] (rev e4) 00:1f.0 ISA bridge [0601]: Intel Corporation Lynx Point-LP LPC Controller [8086:9c43] (rev 04) 00:1f.3 SMBus [0c05]: Intel Corporation Lynx Point-LP SMBus Controller [8086:9c22] (rev 04) 02:00.0 Multimedia controller [0480]: Broadcom Corporation Device [14e4:1570] 03:00.0 Network controller [0280]: Broadcom Corporation Device [14e4:43a0] (rev 03) 04:00.0 SATA controller [0106]: Samsung Electronics Co Ltd Device [144d:1600] (rev 01) 05:00.0 PCI bridge [0604]: Intel Corporation DSL3510 Thunderbolt Port [Cactus Ridge] [8086:1547] (rev 03) 06:00.0 PCI bridge [0604]: Intel Corporation DSL3510 Thunderbolt Port [Cactus Ridge] [8086:1547] (rev 03) 06:03.0 PCI bridge [0604]: Intel Corporation DSL3510 Thunderbolt Port [Cactus Ridge] [8086:1547] (rev 03) 06:04.0 PCI bridge [0604]: Intel Corporation DSL3510 Thunderbolt Port [Cactus Ridge] [8086:1547] (rev 03) 06:05.0 PCI bridge [0604]: Intel Corporation DSL3510 Thunderbolt Port [Cactus Ridge] [8086:1547] (rev 03) 06:06.0 PCI bridge [0604]: Intel Corporation DSL3510 Thunderbolt Port [Cactus Ridge] [8086:1547] (rev 03) 07:00.0 System peripheral [0880]: Intel Corporation DSL3510 Thunderbolt Port [Cactus Ridge] [8086:1547] (rev 03) 08:00.0 PCI bridge [0604]: Intel Corporation DSL3510 Thunderbolt Controller [Cactus Ridge] [8086:1549] 09:00.0 PCI bridge [0604]: Intel Corporation DSL3510 Thunderbolt Controller [Cactus Ridge] [8086:1549] 0a:00.0 Ethernet controller [0200]: Broadcom Corporation NetXtreme BCM57762 Gigabit Ethernet PCIe [14e4:1682] Created attachment 107181 [details]
dmesg output for unsuccessful boot
Here's an edited dmesg for unsuccessful boot.
Created attachment 107355 [details]
3.11-rc7 dmesg with ncq
The problem is still there with -rc7. Sometimes the system fails to boot up, sometimes it causes "pauses". I've attached dmesg from -rc7.
Hi, Can you try the following kernel parameter? "libata.force=3.0" All, I am also running a MacBookAir6,2 and I think I am experiencing some variant of this problem. (In reply to Levente Kurusa from comment #5) > Can you try the following kernel parameter? "libata.force=3.0" I've tested with kernel-3.12.0-0.rc7.git2.1.fc21.x86_64 and kernel-3.11.6-200.fc19.x86_64. I tried libata.force=3.0 with both kernels and it prevents the system from booting 100% of the time. (I do not know where to get further debugging information in that case.) With no libata kernel flags, I definitely find the same sequence of error messages in dmesg as reported by Sam and Imre. Using libata.force=noncq suppresses the error messages and smooths out the booting process. Thanks! All, I am still experiencing this issue with 3.12 release. The kernel often fails to boot, and if it does boot it will frequently hang up the whole system with IO waits, followed by a flood of ata messages in dmesg. Please advise on further troubleshooting steps? I am still using libata.force=1:noncq to make things work, but this seems like an ugly workaround! Thanks. Hi, Can you try it with an another operating system? (win xp, mac, etc.) There are a lot of chips with broken NCQ support. I will look into these, in the meantime please try to test it with another OS. Also, try to change some BIOS options as well. Levente, Thanks for your quick reply! (In reply to Levente Kurusa from comment #8) > Hi, > > Can you try it with an another operating system? (win xp, mac, etc.) > There are a lot of chips with broken NCQ support. I will look into these, in > the meantime please try to test it with another OS. Also, try to change some > BIOS options as well. Per my earlier post (in comment #6) I am testing on a MacBookAir6,2. It has a samsung SSD right on the PCIe bus: http://www.ifixit.com/Teardown/MacBook+Air+13-Inch+Mid+2013+Teardown/15042#s49088 I think it goes without saying that this hardware works great on Mac OS X. ;) I can confirm it works flawlessly on OS X. (Also, no BIOS.) If NCQ needs to be disabled for this class of hardware, shouldn't the kernel automatically detect that? The default behavior is very bad for novice users. Thanks, We have several controllers where the NCQ is disabled. In case it works on Mac OS X, can you post the disassembled DSDT table? Reason is, it could be somekind of a stupid _OSI related bug, which we had in the past. (In reply to Levente Kurusa from comment #10) > can you post the disassembled DSDT table? I will gladly post that information if you can link me to instructions on how to do that? (In reply to Alex Markley from comment #11) > (In reply to Levente Kurusa from comment #10) > > can you post the disassembled DSDT table? > > I will gladly post that information if you can link me to instructions on > how to do that? Pardon me. :-) cat /sys/firmware/acpi/tables/DSDT > dsdt.aml iasl -d dsdt.aml This will dump out a dsdt.dsl, please post it. Created attachment 113951 [details]
MacBookAir6,2 DSDT Dissasembly
Here was the command output. (Note the warning!)
Intel ACPI Component Architecture
ASL Optimizing Compiler version 20130823-64 [Oct 8 2013]
Copyright (c) 2000 - 2013 Intel Corporation
Loading Acpi table from file dsdt.aml
Acpi table [DSDT] successfully installed and loaded
Pass 1 parse of [DSDT]
Pass 2 parse of [DSDT]
Parsing Deferred Opcodes (Methods/Buffers/Packages/Regions)
Parsing completed
Found 6 external control methods, reparsing with new information
Pass 1 parse of [DSDT]
Pass 2 parse of [DSDT]
Parsing Deferred Opcodes (Methods/Buffers/Packages/Regions)
Parsing completed
Disassembly completed
ASL Output: dsdt.dsl - 278845 bytes
iASL Warning: There were 6 external control methods found during
disassembly, but additional ACPI tables to resolve these externals
were not specified. The resulting disassembler output file may not
compile because the disassembler did not know how many arguments
to assign to these methods. To specify the tables needed to resolve
external control method references, use the one of the following
example iASL invocations:
iasl -e <ssdt1.aml,ssdt2.aml...> -d <dsdt.aml>
iasl -e <dsdt.aml,ssdt2.aml...> -d <ssdt1.aml>
Couple of debugging steps I would do: Try the following kernel params (without the enclosing quotes): "acpi_osi="!Linux"" "acpi_osi=! acpi_osi="Darwin"" "acpi_osi= acpi_osi="Darwin"" Also, it might be worth checking that Mac OS X uses NCQ. Unfortunately, I do not know how to find that information in Mac. Levente, Thanks for providing some debugging steps. Can you clarify for me, am I testing with all of those kernel parameters on the same command line? Or should I boot once for each acpi_osi parameter and record results with each one? I will look into the NCQ issue on OS X. It's entirely possible that some system logs will make mention of it. I can boot into OS X and inspect the logs. Thanks. (In reply to Alex Markley from comment #15) > Levente, > > Thanks for providing some debugging steps. Can you clarify for me, am I > testing with all of those kernel parameters on the same command line? Or > should I boot once for each acpi_osi parameter and record results with each > one? > For each boot attempt, take one line. :-) Post the results as well. > I will look into the NCQ issue on OS X. It's entirely possible that some > system logs will make mention of it. I can boot into OS X and inspect the > logs. Yes please. If you can post something like a 'dmesg' for Mac that'd be great. Created attachment 113961 [details]
MacBookAir6,2 Mac OS X 10.9 dmesg
Here are the contents of the Mac OS X dmesg buffer after booting on my MacBookAir6,2.
Created attachment 113971 [details]
MacBookAir6,2 System Information SATA Report
This is a copy/paste from the OS X System Information report on SATA devices. Notice that NCQ is explicitly mentioned as being supported.
(In reply to Alex Markley from comment #18) > Created attachment 113971 [details] > MacBookAir6,2 System Information SATA Report > > This is a copy/paste from the OS X System Information report on SATA > devices. Notice that NCQ is explicitly mentioned as being supported. Supported doesn't mean it is in use. I have not found anything that might be problematic in the Mac 'dmesg'. I will check the chip's datasheet if it differs from the ahci docs. I tried booting with all three sets of acpi_osi= parameters. (I also of course removed libata.force=1:noncq from my boot parameters.) In all three cases, I ran into severe NCQ-related command timeouts preventing system boot. This was all tested with kernel 3.12 release. Please advise on further troubleshooting steps. :) Created attachment 113981 [details]
ahci: add NO_NCQ flag to this controller
Please try to apply this patch with:
patch drivers/ata/ahci.c patch.patch
then recompile the kernel and install it with:
make defconfig && make -j4 && make install
(commands must be executed from the kernel source tree's root)
After this, please report the dmesg and the results! :-)
(In reply to Levente Kurusa from comment #21) > Created attachment 113981 [details] > ahci: add NO_NCQ flag to this controller > > Please try to apply this patch with: > > patch drivers/ata/ahci.c patch.patch > > then recompile the kernel and install it with: > make defconfig && make -j4 && make install > > (commands must be executed from the kernel source tree's root) > After this, please report the dmesg and the results! :-) Please disregard. Wrong patch attached. Created attachment 113991 [details]
ahci: add NO_NCQ flag to this controller
Please do what I posted with this patch.
I will go ahead and test this patch ASAP. However, my question is this: Would there be any hope of NCQ support being fixed rather than blacklisted? I don't know much about ATA but wouldn't it be desirable to enable the full command set? Also, with regards to investigating the hardware more closely, would it help if I disassemble the laptop and get part numbers off of the SSD module? I believe the SSD module is considered a "user serviceable" part, so I can gain access to it easily. This a great question. Either Linux's NCQ support is broken, because we see lots of controllers not working on Linux. But, we also have the other hand, lot of chips work. So it is kinda difficult to tell right now what is the problem. No, this is not the drive's fault. This is the controller's (or Linux's) fault most likely. (In reply to Levente Kurusa from comment #26) > No, this is not the drive's fault. This is the controller's (or Linux's) > fault most likely. On this device the ATA controller is directly on the board with the flash chips. This SSD module is a card that plugs directly into the PCIe bus. That's why I thought a scan of the board might help. :) lspci told me what is t(In reply to Alex Markley from comment #27) > (In reply to Levente Kurusa from comment #26) > > No, this is not the drive's fault. This is the controller's (or Linux's) > > fault most likely. > > On this device the ATA controller is directly on the board with the flash > chips. This SSD module is a card that plugs directly into the PCIe bus. > lspci told us what controller controlls the SSD so there is no need. How is the patch doing? (In reply to Levente Kurusa from comment #28) > How is the patch doing? Levente, Thanks for your patience! I was able to test the patch today. I tested it against kernel-3.12 and unfortunately it does NOT appear to successfully disable NCQ. I will attach a dmesg illustrating the problem. Thanks. Created attachment 114581 [details]
dmesg log from kernel-3.12 with proposed patch applied
Given the controller is intel ahci, it's highly unlikely that the controller or driver is at fault. Prolly best to blacklist the ssd device from ata_device_blacklist[]. Thanks. (In reply to Tejun Heo from comment #31) > Given the controller is intel ahci, it's highly unlikely that the controller > or driver is at fault. Prolly best to blacklist the ssd device from > ata_device_blacklist[]. > > Thanks. The controller picked up by the ahci module is: [144d:1600], which is not an Intel one. Also, it might be worth dumping the _GTF, maybe it has some chip-specific initialization sequence. If I recall correctly we had something like that years ago. Ooh, right, this is the new sata express thing, so, yeah, then the controller could actually be at fault here. I wonder whether macos uses ncq at all. Is there any way to find out which commands are being used there? Thanks. I'd be glad to perform any troubleshooting / debugging steps on Mac OS X to determine answers to some of these questions. I just need some help figuring out _what_ steps to take. :) I'm competent with the terminal, but I'm not an expert in OS X by any means... Heh and I have no clue whatsoever about mac. The best thing would be snooping what commands are going to the device but seeing whether there's significant bandwidth difference between queue depth 1 and queue depth 32 reads using io benchmark tools could be a useful indicator. Thanks. Just FYI, at least two other recent Apple laptop models use the same SATA controller for their SSD's: the Macbook Pro Retina 11,1 and the Macbook Pro Retina 11,3 (probably the 11,2 as well)... Here are indications that this is the case, please see the lspci's on the following pages: Macbook Pro 11,3: https://bbs.archlinux.org/viewtopic.php?pid=1368083#p1368083 Macbook Pro 11,1: https://wiki.debian.org/InstallingDebianOn/Apple/MacBookPro/11-1#lspci I am ready to provide my assistance regarding this issue as well. Thanks, uriah I have a Macbook Pro 11,3 with the same issue. If I can be of any help, don't hesitate to let me know. I'm generally running the latest version of the kernel from git, so building a kernel is no issue. Created attachment 126161 [details]
ahci-noncq.patch
Hmmm... no idea why Levente's previous patch didn't work. Can someone please try the attached patch and report dmesg?
Thanks.
Guys, please post us your queue depths if Tejun's patch is not working. cat /sys/block/sda/device/queue_depth (where sda is your main SDD/HDD) Also, try increasing your queue depth via echoing 1,2,4,etc to it. But, I am not sure if it will work with a force=noncq. Tejun, do you think it could work? It might give some extra hints if disabling NCQ will not work. I'd say if it turns out that a QD of 32 does not work, we might be up and running with a QD of 8 or so. That will still have some performance gains. Thanks, Levente Kurusa I applied Tejun's patch and unfortunately I can't say that I see a difference, other than actually being able to boot well without libata.force=noncq. I mean the speed is still dropping from 8-900 MB/sec to <1MB/sec. I have a MacBook Pro 11,3. ➜ ~ cat /proc/cmdline BOOT_IMAGE=/vmlinuz-3.13.0-8-generic.efi.signed root=UUID=aa56465a-a88b-4902-8c54-6dd18749bcce ro allow-discards root_trim=yes quiet splash rootflags=data=writeback crashkernel=384M-:128M vt.handoff=7 ➜ ~ cat /sys/block/sda/device/queue_depth 1 ➜ ~ echo 2 > /sys/block/sda/device/queue_depth echo: write error: invalid argument No clue why the echo fails. As you could see from above, I did not had libata.force=noncq. So, any other way to change the QD? Cheers&Thanks, -Dragos based on the patch, it seems that NCQ is disabled by it... Well, not what I was looking for to be honest. I mean can we do something other than blacklisting this controller? I have Mac OS X and Win7 running on this machine, so let me know if I can help with any traces. Needless to say, only in Linux I have terrible performance. Excellent, do you know any way so that we can make sure that win7 or mac OS X actually USES (not SUPPORTS, but USES it) NCQ? Oh, and I have an idea which might be worth dumping, but unfortunately I don't know how to dump the _GTF field. Tejun, maybe you know how to do that? Reason I want to do so, is because IIRC _GTF is used to setup the devices, right? If it so, it is entirely possible that the ACPI BIOS might want to write somekind of init-sequence to the device. Currently, (looking at dmesg) we don't write the contents of _GTF to the device since its length (8) is bigger than the count of IDE registers(7)... Just a thought. Thanks, Levente Kurusa That is a good question. So I booted in Windows and HD-Tune seems to think that it does support NCQ. Performance varies of course with the block size: for 8MB blocks it averages close to 1GB/s, still getting 2-300MB/s with 64KB blocks. Throughput is not very stable, but nothing to complain about. An interesting thing was that HD-Tune reported it to be running quite hot at 55C. In Linux it ranges between 41C and 53C. I did a test and there seems to be no correlation between temperature and performance. So I'd rule that out. The behavior for me is that on frequent operations it grinds down with fsync() sys calls taking 1-2 seconds to complete for every small file. I have the feeling then that doing something stupid and counterintuitive (e.g. cat /dev/sda) actually makes the performance of fsync better. Then if there is no disk activity for a while it comes back to sane performance. Levente, you can access _GTF using acpidump. I'm a bit skeptical whether _GTF would be doing anything drastic tho. Note that the controller is embedded with the drive and the problem is likely to be between the controller and host rather than what it represents as the attached drive, I suspect. For now, blacklisting ncq is probably the right thing to do so that those machines can at least boot. Thanks. Created attachment 126611 [details]
0001-ahci-disable-NCQ-on-Samsung-pci-e-SSDs-on-macbooks.patch
Applied to libata/for-3.14-fixes. Thanks.
I'm also affected by this on a new MacBook Pro 11,3. With a vanilla 3.14 kernel (that includes the above patch and thus disables NCQ for this controller by default, the problems are far from gone. With NCQ disabled, the system runs and boots mostly without (I/O-related) problems but for certain workloads, performance drops through the floor. It seems to bottom out at about 3 writes/second. Typically this happens during write and fsync() intensive (and presumably single-threaded) workloads such as during apt-get install|upgrade or running resize2fs. I see no errors in dmesg or any other signs of anything wrong - except lousy performance... I'm wondering if NCQ is a bit of a red herring here and merely aggravates a different root problem. I don't, however, have the knowledge to know where to look instead... Please let us know if there is anything we can do to help to figure this out. Be it testing patches/running tests/diagnostics/etc... Here is an iostat log snippet illustrating the poor I/O performance while apt is installing a few bug packages: Device: wrqm/s w/s wkB/s w_await svctm %util sda 0.00 0.00 0.00 0.00 0.00 100.00 sda 23.00 11.00 200.00 753.82 90.91 100.00 sda 0.00 7.00 136.00 1005.14 142.86 100.00 sda 10.00 9.00 328.00 840.00 111.11 100.00 sda 29.00 11.00 124.00 556.36 90.91 100.00 sda 0.00 18.00 132.00 1474.67 55.56 100.00 sda 0.00 4.00 16.00 1591.00 250.00 100.00 sda 0.00 7.00 36.00 1840.00 142.86 100.00 sda 0.00 0.00 0.00 0.00 0.00 100.00 sda 15.00 7.00 52.00 481.71 142.86 100.00 sda 0.00 4.00 28.00 568.00 250.00 100.00 sda 21.00 10.00 144.00 461.20 100.00 100.00 sda 11.00 10.00 356.00 492.80 100.00 100.00 sda 1.00 3.00 24.00 1112.00 333.33 100.00 sda 18.00 8.00 132.00 549.00 125.00 100.00 sda 0.00 3.00 16.00 872.00 333.33 100.00 sda 27.00 5.00 64.00 892.00 200.00 100.00 sda 0.00 5.00 72.00 20.00 200.80 100.40 sda 6.00 18.00 644.00 545.78 55.56 100.00 sda 17.00 17.00 132.00 191.76 58.82 100.00 sda 141.00 296.00 3976.00 17.26 3.20 94.80 sda 0.00 3.00 8.00 945.33 333.33 100.00 sda 22.00 1.00 0.00 28.00 1000.00 100.00 sda 7.00 8.00 444.00 704.00 125.00 100.00 sda 0.00 5.00 8.00 691.20 200.00 100.00 sda 0.00 5.00 12.00 548.80 200.00 100.00 sda 22.00 8.00 104.00 711.50 125.00 100.00 sda 37.00 55.00 700.00 375.85 18.18 100.00 sda 43.00 72.00 416.00 55.50 13.67 98.40 sda 143.00 170.00 1352.00 1.39 5.57 95.20 sda 26.00 5.00 112.00 308.80 200.00 100.00 sda 0.00 3.00 4.00 757.33 333.33 100.00 sda 6.00 8.00 1468.00 561.00 125.00 100.00 sda 20.00 4.00 76.00 734.00 250.00 100.00 sda 14.00 4.00 88.00 842.00 250.00 100.00 sda 0.00 0.00 0.00 0.00 0.00 100.00 And here with all the columns: http://pastebin.com/yYrcxMu4 Hi, can you please post a dmesg with the patch applied? Thanks Levente Kurusa fsync performance is bad because SYNCHRONIZE CACHE / FLUSH CACHE takes anywhere from 1.2 to .02 seconds (I'm basing this off of the dmesg timestamps between the issued command and ahci_hw_interrupt). I tend to get much more predictable performance by just turning off write caching for the SSD (or by manually disabling the flush command), but that obviously isn't a real solution. I haven't been able to figure out the underlying cause of these issues, but I have an affected machine (MBP 11,3), and some kernel development experience so if anyone could point me in the right direction (or any direction really) that would be very helpful. Thanks, Brennan Also of interest: disabling MSI on the SSD fixes NCQ but the poor FLUSH_EXT performance remains. I tried Brennan's idea of disabling MSI on the SSD, using device_nomsi=0x1600144d on the kernel command line. It does seem to work. A thought I had was to try nobarrier as an ext4 mount option on my / filesystem. It restores performance, though at the cost of potential reliability. It is a trade off I am willing to make for now. I do break off /home as it's own filesystem. My really valuable data is there. Worst case I reinstall the OS. I would guess OSX doesn't do write barriers. Created attachment 150741 [details]
samsung-1600-nomsi.patch
The attached patch plugs MSI and allows NCQ. Can someone please verify that the patch makes NCQ work w/o any extra kernel params?
Thanks!
(In reply to Tejun Heo from comment #53) appears to be working fine for me. (In reply to Tejun Heo from comment #53) confirmed working here, too I have a MBP 11,2 and the patch made no difference. Still get high IO wait. Kernel compile time before: 9.15.98 Kernel compile time after: 9.20.20. I also noticed when launching chromium I get extremely high IO wait. I *DO* not run with libata.force=noncq. 11,1 not 11,2. (MBP 13") Are you sure NCQ is active? Did you also disable barriers for ext4? okay.. added barrier=0 into fstab and now iowait issue seems to have gone away. Just booted into my "non-patched" kernel which has NCQ disabled, and the iowait issues have also gone away. Why does disabling barriers make everything so much better? It is also a dangerous place to be in, I was hoping for a performance boost without disabling barriers. queue_depth in patched kernel = 31 queue_depth in non-patched kernel = 1 Thanks for your time. W/o barrier, journal is kinda useless and you're risking corrupting the filesystem across power loss, which may or may not be okay depending on your use case, I guess. I'll apply the patch to libata/for-3.18-fixes w/ stable cc'd. Thanks. Patch applied to libata/for-3.18-fixes w/ stable cc'd. http://lkml.kernel.org/g/20141027143052.GM4436@htj.dyndns.org Thanks. Another affected user here, anything I can do to give debug information please ask... :/ To clarify affected on the poor performance of FLUSH_EXT, the MSI/NCQ issue is fixed (or workarounded) on 3.19. On MBP 11,3 (Samsung PCIe SSD 144d:1600) with kernel 3.19 (Ubuntu 15.04) the problem remains. High IOwait when performing any disk activity. The worst offenders are launching Chrome to restore a large number of windows and dpkg (e.g. installing linux-headers, which apparently involves writing many small files) Example iostat when launching chrome: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 31.00 4.00 7.00 200.00 380.00 105.45 6.53 697.45 1104.00 465.14 90.91 100.00 avg-cpu: %user %nice %system %iowait %steal %idle 7.65 0.00 1.13 46.55 0.00 44.67 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 1.00 27.00 13.00 17.00 684.00 936.50 108.03 6.37 252.27 196.92 294.59 33.33 100.00 avg-cpu: %user %nice %system %iowait %steal %idle 3.03 0.00 0.38 52.84 0.00 43.74 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 1.00 19.00 5.00 7.00 324.00 236.00 93.33 9.06 842.67 916.80 789.71 83.33 100.00 avg-cpu: %user %nice %system %iowait %steal %idle 2.02 0.00 0.51 36.36 0.00 61.11 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 1.00 0.00 2.00 5.00 260.00 84.00 98.29 13.07 980.00 1148.00 912.80 142.86 100.00 This goes on for several minutes. Setting barrier=0 in fstab solves the problem, but as others pointed out, it makes fs vulnerable to corruption and is not really a solution Another user affected by this. The 'sync' takes several seconds to finish. The 4k random IO performance is horrendous (in KB/s). The whole point of an SSD is to improve the 4k random IO performance...:) The question is: where is this slow FLUSH_EXT performance resulting from for this SSD? How can we debug this? It certainly is not an issue when booted into OS X. apt-get install brings my MacBook Pro 15" mid-2014 to an unusable stutter for many minutes. Chromium also often displays the status line "Waiting for AppCache..." for several to ten seconds when visiting any website. To summarize, if I understand correctly, the present workaround in layman's terms is: 1. Add barrier=0 option to /etc/fstab. This risks filesystem corruption in the event of a power loss. 2. Add device_nomsi=0x1600144d to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub and run grub-mkconfig -o /boot/grub/grub.cfg. (In reply to Stephen Niedzielski from comment #66) > apt-get install brings my MacBook Pro 15" mid-2014 to an unusable stutter > for many minutes. Chromium also often displays the status line "Waiting for > AppCache..." for several to ten seconds when visiting any website. To > summarize, if I understand correctly, the present workaround in layman's > terms is: > > 1. Add barrier=0 option to /etc/fstab. This risks filesystem corruption in > the event of a power loss. > 2. Add device_nomsi=0x1600144d to GRUB_CMDLINE_LINUX_DEFAULT in > /etc/default/grub and run grub-mkconfig -o /boot/grub/grub.cfg. Yes, this pretty much summarizes my experiences (In reply to Stephen Niedzielski from comment #66) > apt-get install brings my MacBook Pro 15" mid-2014 to an unusable stutter > for many minutes. Chromium also often displays the status line "Waiting for > AppCache..." for several to ten seconds when visiting any website. To > summarize, if I understand correctly, the present workaround in layman's > terms is: > > 1. Add barrier=0 option to /etc/fstab. This risks filesystem corruption in > the event of a power loss. > 2. Add device_nomsi=0x1600144d to GRUB_CMDLINE_LINUX_DEFAULT in > /etc/default/grub and run grub-mkconfig -o /boot/grub/grub.cfg. Modern kernels should only need 1. I'll look into this more in the near future -- I stopped working on it when school started for me. Thank you for the confirmation, Alex and Brennan. I appreciate it. I've only tried with both changes made on Linux 3.19.0. Chromium has been working much better and apt-get works night and day better. I've disabled suspend, which seems to fail sporadically on me previous to the change, and am hoping regular back ups will be adequate in the event of a corrupted filesystem. I tried with just barrier=0 on 3.13.0-rc7 and that works too. Running Debian testing debian-installer with kernel 4.1.3-1 and I note that the high I/O wait condition is still a problem, even on MacBookPro11,1 systems. I should also note that this issue prompted me to use btrfs in conjunction due to its faster speed than ext4 (prior to finding this bug), which eventually lead to filesystem-killing corruption over time. Still trying to isolate whether or not this was simply btrfs failing, or something relating to this bug. At some point, I think when I upgraded my distro, barrier=0 seemed to stop working for me so I also set device_nomsi=0x1600144d again. At any rate, the two changes _seem_ to be working fine. Are there any updates to this issue for newer kernels? I have a MacbookPro11,3 as listed above on kernel 4.4.0 and I'm getting the exact same issues with slow small writes as listed previously. Setting device_nomsi=0x1600144d doesn't appear to do anything, and setting libata.force=noncq seems to be a partial workaround... but performance is much much higher in OS X or Windows 10. Amusingly, booting the machine in virtualbox from OS X and just giving virtualbox access to the raw disk partitions through OS X allows disk access to work just fine. Any information I can give to help find a way to resolve this issue? I use a MacBookPro11,3 with kernel v4.2.0-23. I haven't done a comparison with / without in a while, but just device_nomsi=0x1600144d seems to work fine. I do notice slowdowns when I'm installing packages via apt-get but it's nothing dramatic like before. Make sure you run update-grub :) I don't think I've tried libata.force=noncq. The device_nomsi=0x1600144d option doesn't seem to be making any difference on kernel 4.4. Do you really notice any drive benchmark changes with it enabled? Does that change at all with a newer kernel? Is there any way to fix msi on this device? I imagine all newer MBPs will have this sort of issue... I haven't run any benchmarks or tried with v4.4. This is a MacBookPro11,1 running Ubuntu 16.10 with kernel 4.8.0-34-generic. The onboard SSD is: 04:00.0 SATA controller: Samsung Electronics Co Ltd Apple PCIe SSD (rev 01) I'm seeing the same high iowait times (observed in await column in `iostat -xz 1` output), which can be triggered easily by apt or other disk-intensive workloads such as bonnie++ (`bonnie++ -d /tmp -s 128 -b -r 0` is enough to trigger it). I tried `device_nomsi=0x1600144d` to kernel boot args first, this _might_ have made a slight improvement to iowait when running bonnie++, I _think_ it reduced pegged-at-100% `%util` column. But I was eyeballing it, so this might have been confirmation bias. Then I added `libata.force=noncq` to kernel boot args and this reduced the output of `cat /sys/block/sda/device/queue_depth` from 31 to 1, and I _think_ it made another _slight_ improvement, but things were still pretty rough. Finally, I added `barrier=0` to my /etc/fstab for the / partition, and (predictably) the overall performance of the machine skyrocketed. It's a night-and-day difference. There's clearly something still amiss with the system when the first two workarounds are in place. I'm OK with running with the dangerous `barrier=0` setting, but it would be great for other users to not have to debug and find this thread to get it fixed :) I'm happy to provide debugging data on request and/or to engage in an interactive debugging session if anyone's interested in trying to get to the bottom of this. Some additional context: I was able to compare `await` times on this machine versus a Thinkpad X230 running the same kernel and the same bonnie++ test. `await` times were 100ms+ on the MacBookPro11,1 and sub-5ms on the Thinkpad X230, which has a SAMSUNG SSD 830 Series (CXM03B1Q) in it. |