Bug 189471 - Acer Aspire Switch Alpha 12 (SA5-271) SSD not detected, all ATA ports are DUMMY + Dirty Fix
Summary: Acer Aspire Switch Alpha 12 (SA5-271) SSD not detected, all ATA ports are DUM...
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: Tejun Heo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-12-02 05:59 UTC by Sui Chen
Modified: 2017-05-08 17:48 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.8
Subsystem:
Regression: No
Bisected commit-id:


Attachments
lspci_vvv (10.52 KB, text/plain)
2016-12-02 05:59 UTC, Sui Chen
Details
acer-kernel-ahci.patch (1.07 KB, patch)
2017-01-19 04:48 UTC, Damian Ivanov
Details | Diff
patch for ahci.c using DMI Match (1.36 KB, patch)
2017-01-19 20:44 UTC, Sui Chen
Details | Diff
dmidecode output (11.30 KB, text/plain)
2017-01-19 20:45 UTC, Sui Chen
Details
proposed patch on January 24 (3.34 KB, patch)
2017-01-24 22:51 UTC, Sui Chen
Details | Diff
revised patch on May 6 (2.01 KB, patch)
2017-05-06 13:27 UTC, Sui Chen
Details | Diff

Description Sui Chen 2016-12-02 05:59:36 UTC
Created attachment 246591 [details]
lspci_vvv

Hello,

I recently bought an Acer Aspire Switch Alpha 12 (Model Number: SA5-271) 2-in-1 convertible computer. This computer has an Intel Skylake i5-6200U processor and a Lite-On CV1-8B256 SSD.

I noticed that the kernel will intermittently fail to detect the SSD as /dev/sda and may be fixed by changing seemingly unrelated settings in the BIOS (such as clearing secure boot databases) or with a "dirty hack" in libahci.c (tested on Kernel 4.8.11, https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.8.11.tar.xz ). When the SSD is not detected, the kernel will print an alert saying "Gave Up waiting for root device. Alert! /dev/disk/by-uuid/ does not exist. Dropping to a shell." Typing "blkid" in the initramfs shell shows no devices either. Very frustrating since this can happen more than 50% of the times.

When the SSD is detected the following relevant lines can be seen in the dmesg output:
   [    1.347021] ahci 0000:00:17.0: version 3.0
   [    1.365569] ahci 0000:00:17.0: AHCI 0001.0301 32 slots 3 ports 6 Gbps 0x7 impl SATA mode
   [    1.367519] ahci 0000:00:17.0: flags: 64bit ncq pm led clo only pio slum part deso sadm sds apst
   [    1.377574] scsi host0: ahci
   [    1.379465] scsi host1: ahci
   [    1.381373] scsi host2: ahci
   [    1.383060] ata1: SATA max UDMA/133 abar m2048@0xb1648000 port 0xb1648100 irq 124
   [    1.384665] ata2: SATA max UDMA/133 abar m2048@0xb1648000 port 0xb1648180 irq 124
   [    1.386305] ata3: SATA max UDMA/133 abar m2048@0xb1648000 port 0xb1648200 irq 124

However, when the SSD is NOT detecting:
   [    1.337065] ahci 0000:00:17.0: version 3.0
-> [    1.343206] ahci 0000:00:17.0: implemented port map (0x7) contains more ports than nr_ports (2), using nr_ports
-> [    1.351165] ahci 0000:00:17.0: AHCI 0001.0301 32 slots 2 ports 6 Gbps 0x0 impl SATA mode
   [    1.352323] ahci 0000:00:17.0: flags: 64bit ncq pm led clo only pio slum part deso sadm sds apst
   [    1.355960] scsi host0: ahci
   [    1.357292] scsi host1: ahci
-> [    1.358405] ata1: DUMMY
-> [    1.359466] ata2: DUMMY

One can note the differences in the marked lines in the dmesg output when the SSD is not detecting: 1) nr_ports becomes 2 instead of 3; 2) ATA1 and ATA2 are both DUMMY; and 3) The Capability registers give different numbers on the # of ports.



Adding the following lines in the function "ahci_save_initial_config" in file libahci.c, line 453 seems to fix this for now:
    if ((cap & 0xC734FF00) == 0xC734FF00) {
        dev_info(dev, "Forcing CAP to 0xC734FF02 and port_map to 0x7!\n");
        hpriv->saved_cap = cap = 0xC734FF02;
        hpriv->saved_port_map = port_map = 0x7;
    }

The version I used is 4.8.11, ( https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.8.11.tar.xz )

What the code does is to force port_map to become 0x7 and saved_cap to become 0xC734FF02. Actually, when the SSD is detecting the cap register holds a value of 0xC734FF02 but when it fails, cap can be either 0xC734FF01 or 0xC734FF00.

I would be greatly appreciative if you can enlighten me on those questions: 
- This fix of forcing values into the CAP register with magic values is very ad-hoc and superficial; there should be a better way of doing this. How should I continue digging into the cause of this? (for example should I find a way to dump the state changes in the controller, etc)
- Is the current way a bad way of solving this; can it possibly damage the computer?

Thanks!
(PS: Sorry for sending this multiple times (I sent to the Mailing List and the Maintainer on the top of the source code and then realized I should come here to Bugzilla)
Comment 1 Damian Ivanov 2017-01-17 07:35:02 UTC
Same issue here, but my laptop has kingston ssd - the issue seems to be with the controller.
Comment 2 Damian Ivanov 2017-01-18 07:31:44 UTC
copr for fedora 25:
https://copr.fedorainfracloud.org/coprs/damianatorrpm/acer_kernel/
Comment 3 Damian Ivanov 2017-01-18 11:50:51 UTC
To be able to boot up and the ssd to be detected you need to run a custom kernel. I have created such: https://copr.fedorainfracloud.org/coprs/damianatorrpm/acer_kernel/
and also I have created a custom Fedora .iso incorporating this kernel
https://drive.google.com/drive/folders/0B_wtRVB2Z4pvWFpwVDUyTWlBcVU

With that working the installation is possible but bootloader doesn't work.
I have tried in UEFI/legacy mode grub (and using chroot after install trying to reinstall grub2). Tried rEFInd, systemd-boot and elilo as well). It is BIOS bug.
So here the solution:

How to install Fedora 25 (deletes Windows)
1) Download .iso with my custom kernel
2) Set boot mode in BIOS to legacy not UEFI
3) Make bootable usb stick from the iso you downloaded and boot from it.
4) Open Gnome disks and completely format the drive (not the partitions, with MBR!!! not GPT or it won't work)
5) Install Fedora.

*Note: The custom kernel.rpm has set Epoch to 1 which means the kernel will never update back to the one from standard repo.
Comment 4 Tejun Heo 2017-01-18 18:36:33 UTC
Looks like the BIOS is messing up. I can't think of workarounds other than forcing the port map (or CAP) for the affected machines. ahci already contains a bunch of machine-specific workarounds. Can you please create a patch to match your system and apply the necessary workaround?

Thanks.
Comment 5 Damian Ivanov 2017-01-19 04:48:11 UTC
Created attachment 252401 [details]
acer-kernel-ahci.patch
Comment 6 Tejun Heo 2017-01-19 20:28:48 UTC
Thanks for the patch but you can't match the CAP value and override it. If you run "dmidecode" as root, it will print out a bunch of identification information. "Product Name" in "System Information" or "Base Board Information" is usually a good field to match. This is how other system-specific workarounds are applied too. If you have trouble creating a patch, please attach the dmidecode output. I can do the patch.

Thanks!
Comment 7 Sui Chen 2017-01-19 20:44:18 UTC
Created attachment 252461 [details]
patch for ahci.c using DMI Match
Comment 8 Sui Chen 2017-01-19 20:45:17 UTC
Created attachment 252471 [details]
dmidecode output
Comment 9 Sui Chen 2017-01-19 20:46:13 UTC
Hi all,
I created a patch using the DMI_MATCH routine but I'm not sure if I did it correctly so I attached the dmidecode output as well.
Thanks!
Comment 10 Tejun Heo 2017-01-19 20:54:58 UTC
Sui, generally looks good to me but can you please make the following changes?

* Separate out it to its own function as other workarounds do.
* Other info messages aren't capitalized. Maybe drop the capitalization here too?
* Please add comments explaining what's going on and link back to this bz.

Once the patch is updated, can you please format the patch according to Documentation/process/submitting-patches.rst. There's a lot in there but it'd basically look like

  Subject: ahci: ONE LINE DESCRIPTION

  PATCH DESCRIPTION.

  Link: http://LINK_TO_THIS_BUG
  Signed-off-by: YOUR_NAME <YOUR_EMAIL>
  ---
  ACTUAL PATCH

Thanks!
Comment 11 Tejun Heo 2017-01-19 20:55:35 UTC
Looks correct to me. If it triggers correctly on your machine, it should be fine.
Comment 12 Sui Chen 2017-01-24 22:51:24 UTC
Created attachment 253091 [details]
proposed patch on January 24
Comment 13 Sui Chen 2017-01-24 22:54:12 UTC
I did the changes and formatted the patch and also added Damian as he has tested the patch. I found there some patches that are not in their individual functions (such as the ``MCP65 revision A1 and A2 can't do MSI'' one) but I still made the SA5-271 patch into its own function.

Thanks!
Comment 14 Sui Chen 2017-01-26 02:57:59 UTC
Update:

There is a BIOS version update here: (https://www.acer.com/ac/en/US/content/support-product/6806?b=1 )

After flashing 1.04 the solution doesn't seem to work anymore. The system seems to hang at boot with a different reason; the ATA1 channel is up and then down again. 

As a result I downgraded to 1.03 .
Comment 15 Sui Chen 2017-01-26 04:46:02 UTC
Correction:

What I described in Comment #14 is a rare event, but it can happen with both BIOS versions 1.03 and 1.04. I did some more tests with version 1.04 and it seems the workaround can correctly trigger and the computer can successfully boot into the system most of the time. It seems the BIOS version is irrelevant to the occurrences of the rare event.

I'll keep watching and see if there are any clues to the rare event.

Sorry about the confusion !
Comment 16 Damian Ivanov 2017-01-26 10:46:13 UTC
I have yet only tested the original patch (not in own function).
I am running BIOS 1.04 without any issues a week now or so.
Comment 17 Sui Chen 2017-05-06 13:19:07 UTC
(In reply to Widen-Damian Ivanov from comment #16)
> I have yet only tested the original patch (not in own function).
> I am running BIOS 1.04 without any issues a week now or so.

Hi, I'm back and I think I found what happened in Comment #14. It looks it's caused by n_ports being set to 1 in this line in function ahci_init_one in file ahci.c :

  n_ports = max(ahci_nr_ports(hpriv->cap), fls(hpriv->port_map));

In this case, only ATA1 is probed; ATA2 is not probed. Then, system failed to find the SSD on ATA1, and decides there is no SSD.
It seems changing hpriv->port_map and hpriv->cap before n_ports is set, so n_ports becomes 2 or 3, can fix this issue. (The current fix sets these two values after n_ports is set.)

The problem when n_ports is set to 1 may be triggered by booting into Windows, and rebooting into Linux. I'm using BIOS 1.04 .

I'll test this fix for some more time to confirm it works stably.
Comment 18 Sui Chen 2017-05-06 13:27:34 UTC
Created attachment 256241 [details]
revised patch on May 6

The SA5-271 workaround is moved before setting n_ports. This seems to make the workaround more stable in a boot that immediately follows a rebooting from Windows.
Comment 19 Tejun Heo 2017-05-08 17:48:26 UTC
Sui Chen, can you post the patch to linux-ide@vger.kernel.org w/ proper patch description and Signed-off-by and cc me?

Thanks!

Note You need to log in before you can comment on or make changes to this bug.