Bug 217114 - Tiger Lake SATA Controller not operating correctly, failing to populate partitions in /dev
Summary: Tiger Lake SATA Controller not operating correctly, failing to populate parti...
Status: RESOLVED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: Intel Linux
: P1 high
Assignee: Tejun Heo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-03-02 11:25 UTC by emmi
Modified: 2024-05-08 15:17 UTC (History)
6 users (show)

See Also:
Kernel Version: 6.2.1
Subsystem:
Regression: No
Bisected commit-id:


Attachments
lspci -vvv on kernel 6.1.12 ASUS Vivobook 15 X513EAN (28.72 KB, text/plain)
2023-03-05 07:15 UTC, Vitalii Solomonov
Details
lspci -vvv on kernel 6.2.1 ASUS Vivobook 15 X513EAN (28.72 KB, text/plain)
2023-03-05 07:15 UTC, Vitalii Solomonov
Details
dmesg on kernel 6.1.12 ASUS Vivobook 15 X513EAN (80.95 KB, text/plain)
2023-03-05 07:16 UTC, Vitalii Solomonov
Details
dmesg on kernel 6.2.1 ASUS Vivobook 15 X513EAN (76.26 KB, text/plain)
2023-03-05 07:16 UTC, Vitalii Solomonov
Details
dmesg VMD enabled kernel 6.8.0-rc2 (66.45 KB, text/plain)
2024-01-31 20:32 UTC, Vitalii Solomonov
Details
dmesg VMD disabled kernel 6.8.0-rc2 (62.89 KB, text/plain)
2024-01-31 20:33 UTC, Vitalii Solomonov
Details
rdsoreport VMD enabled kernel 6.8.0-rc2 (175.75 KB, text/plain)
2024-01-31 20:33 UTC, Vitalii Solomonov
Details
rdsoreport VMD disabled kernel 6.8.0-rc2 (176.74 KB, text/plain)
2024-01-31 20:34 UTC, Vitalii Solomonov
Details
dmesg VMD enabled kernel 6.8.0-rc2 CONFIG_SATA_MOBILE_LPM_POLICY=3 (80.15 KB, text/plain)
2024-02-01 03:46 UTC, Vitalii Solomonov
Details
journalctl VMD enabled kernel 6.8.0-rc2 CONFIG_SATA_MOBILE_LPM_POLICY=3 (356.02 KB, application/octet-stream)
2024-02-01 03:46 UTC, Vitalii Solomonov
Details
Bind LPM policy to [8086:a0d3] with enabled VMD (90.28 KB, text/plain)
2024-02-02 08:41 UTC, Jian-Hong Pan
Details
dmesg with binding LPM policy with patch "ata: ahci: Add force LPM policy quirk for ASUS B1400CEAE" on enabled VMD machine (92.00 KB, text/plain)
2024-02-06 07:50 UTC, Jian-Hong Pan
Details
dmesg with binding LPM policy with PCI IDs matching on enabled VMD machine (90.37 KB, text/plain)
2024-02-06 07:54 UTC, Jian-Hong Pan
Details

Description emmi 2023-03-02 11:25:00 UTC
As per kernel problem found in https://bbs.archlinux.org/viewtopic.php?id=283906 ,

Commit 104ff59af73aba524e57ae0fef70121643ff270e seems to have broken Intel Tiger Lake SATA controllers in a way that prevents boot, as the sysroot partition will not be found. 

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=104ff59af73aba524e57ae0fef70121643ff270e
Comment 1 schwagsucks 2023-03-02 17:31:53 UTC
As some people in the reference arch forum post reported this seems to have started in 6.1.13.  6.1.12 loads as expected.  

The problem is the sata disks can not be recognized any longer which is why the reported sysroot partition can't be found.  

My primary disk is nvme and as long as I remove all sata references from my fstab I can boot but then can't mount the device partitions because the devices are not present in /dev.  

Any attempts to boot with a sata disk in fstab results in a boot failure with emergency shell.
Comment 2 schwagsucks 2023-03-02 19:31:28 UTC
I can provide any details required

My sata controller:
10000:e0:17.0 SATA controller: Intel Corporation Tiger Lake-LP SATA Controller (rev 20) (prog-if 01 [AHCI 1.0])
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 146
	Region 0: Memory at 50100000 (32-bit, non-prefetchable) [size=8K]
	Region 1: Memory at 50102800 (32-bit, non-prefetchable) [size=256]
	Region 5: Memory at 50102000 (32-bit, non-prefetchable) [size=2K]
	Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
		Address: fee01000  Data: 0000
	Capabilities: [70] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [a8] SATA HBA v1.0 BAR4 Offset=00000004
	Kernel driver in use: ahci
Comment 3 Damien Le Moal 2023-03-03 08:04:11 UTC
Is it possible for you to get and post the ata/ahci related messages during a bad boot ?
Comment 4 Damien Le Moal 2023-03-03 08:10:10 UTC
And can you try booting with libata.force=nolpm to check ?
Comment 5 emmi 2023-03-03 08:34:28 UTC
(In reply to Damien Le Moal from comment #4)
> And can you try booting with libata.force=nolpm to check ?

As per the forum thread attached, this does not correct the issue (most of the extraneous information is on pages 2 and 3 of that thread)


(In reply to Damien Le Moal from comment #3)
> Is it possible for you to get and post the ata/ahci related messages during
> a bad boot ?

Nope, no sysroot means no console and attempts to load a console prior to root mount fail for me (probably because of sulogin etc being restricted)
Comment 6 Damien Le Moal 2023-03-03 08:42:32 UTC
(In reply to emmi from comment #5)
> (In reply to Damien Le Moal from comment #4)
> > And can you try booting with libata.force=nolpm to check ?
> 
> As per the forum thread attached, this does not correct the issue (most of
> the extraneous information is on pages 2 and 3 of that thread)

Missed that. Will have a look.

> (In reply to Damien Le Moal from comment #3)
> > Is it possible for you to get and post the ata/ahci related messages during
> > a bad boot ?
> 
> Nope, no sysroot means no console and attempts to load a console prior to
> root mount fail for me (probably because of sulogin etc being restricted)

Can you use a serial console to capture the messages ?
Comment 7 emmi 2023-03-03 08:48:14 UTC
(In reply to Damien Le Moal from comment #6)
> (In reply to emmi from comment #5)
> > (In reply to Damien Le Moal from comment #4)
> > > And can you try booting with libata.force=nolpm to check ?
> > 
> > As per the forum thread attached, this does not correct the issue (most of
> > the extraneous information is on pages 2 and 3 of that thread)
> 
> Missed that. Will have a look.
> 
> > (In reply to Damien Le Moal from comment #3)
> > > Is it possible for you to get and post the ata/ahci related messages
> during
> > > a bad boot ?
> > 
> > Nope, no sysroot means no console and attempts to load a console prior to
> > root mount fail for me (probably because of sulogin etc being restricted)
> 
> Can you use a serial console to capture the messages ?

Personally I cannot without disassembling my laptop and likely soldering test pads, since its somewhere between ultrabook and a laptop, and thus doesnt have integrated serial connectivity.
Comment 8 Damien Le Moal 2023-03-03 08:51:09 UTC
ah. OK. This is a laptop... Too bad. These error messages would be really useful to come up with a better solution than reverting the patch causing the issue. May be a screen video ? (disable rhgb or any other graphic boot stuff and add earlycon kernel parameter. You should be able to see the boot messages & errors).
Comment 9 emmi 2023-03-03 09:07:53 UTC
(In reply to Damien Le Moal from comment #8)
> ah. OK. This is a laptop... Too bad. These error messages would be really
> useful to come up with a better solution than reverting the patch causing
> the issue. May be a screen video ? (disable rhgb or any other graphic boot
> stuff and add earlycon kernel parameter. You should be able to see the boot
> messages & errors).

I'm not currently able to as i'm not at home, but im sure some others would be able to provide that data...
Comment 10 Damien Le Moal 2023-03-03 09:21:55 UTC
I am going to send a revert to Linus & stable now. We can figure out how to correctly enable LPM for this adapter during the 6.3 cycle.
Comment 11 Damien Le Moal 2023-03-03 10:33:52 UTC
Revert sent. Probably will be picked up in 6.1.15.
Comment 12 emmi 2023-03-05 00:22:11 UTC
Revert accepted
Comment 13 Vitalii Solomonov 2023-03-05 07:15:21 UTC
Created attachment 303855 [details]
lspci -vvv on kernel 6.1.12 ASUS Vivobook 15 X513EAN
Comment 14 Vitalii Solomonov 2023-03-05 07:15:56 UTC
Created attachment 303856 [details]
lspci -vvv on kernel 6.2.1 ASUS Vivobook 15 X513EAN
Comment 15 Vitalii Solomonov 2023-03-05 07:16:34 UTC
Created attachment 303857 [details]
dmesg on kernel 6.1.12 ASUS Vivobook 15 X513EAN
Comment 16 Vitalii Solomonov 2023-03-05 07:16:51 UTC
Created attachment 303858 [details]
dmesg on kernel 6.2.1 ASUS Vivobook 15 X513EAN
Comment 17 Vitalii Solomonov 2023-03-05 07:21:28 UTC
I have an ASUS Vivobook 15 X513EAN laptop and can confirm that. I have / on nvme and /home on sda. 6.1.12 works fine, 6.2.0 - /dev/ is not populated with sda devices.
I'm attaching my full dmesg and lspci output for both kernels. What additional logs are needed?
Comment 18 Daniel Drake 2024-01-30 18:46:15 UTC
Is anyone previously affected by this issue available to participate in some testing/diagnosis of the root cause here? This would involve building kernels with specific patches applied and getting logs (so you need to be using the SATA disk as secondary, booting from another device).

We have identified that the lack of LPM mode is likely causing all platforms using this chipset to excessively drain the battery in suspend mode (see bug #218394). We want to fix that for all, but obviously need to avoid the breakage that was discovered here.

The platform we have (Asus B1400) works just fine with a SATA disk added in LPM mode, so we aren't able to diagnose there.
Comment 19 Vitalii Solomonov 2024-01-31 06:20:31 UTC
I'm not experienced in such testing, but i'm willing to participate.
My only concern is that my hardware would be damaged. If the chance of this happening is considered low, I'm ready.

On my current configuration i'm booting from SSD(nvme) and HDD(sda) is my secondary drive. Will it require any changes?
Comment 20 Daniel Drake 2024-01-31 11:15:53 UTC
Thanks Vitalii!

First thing to do would be to reproduce the bug on a newer kernel version, preferably 6.8-rc2.
You'll need to apply the original patch : https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/patch/?id=104ff59af73aba524e57ae0fef70121643ff270e

Then confirm that such new kernel boots without SATA detected, and capture dmesg logs for good measure.

Assuming that is the case, please then go into your BIOS setup menu and try to disable Intel VMD. Refer to this video: https://www.youtube.com/watch?v=_Ft9KBTC2kk
It may warn about "malfunction" or similar - that can be ignored in the case of Linux, which does not support the RAID mode that exists beneath this function.

With VMD disabled, boot up again and see if the SATA disk is detected, capturing new dmesg logs regardless of success/failure.

If you additionally have Windows installed on this device, please then re-enable VMD in the BIOS before attempting your next Windows boot.
Comment 21 Vitalii Solomonov 2024-01-31 20:32:41 UTC
Created attachment 305797 [details]
dmesg VMD enabled kernel 6.8.0-rc2
Comment 22 Vitalii Solomonov 2024-01-31 20:33:10 UTC
Created attachment 305798 [details]
dmesg VMD disabled kernel 6.8.0-rc2
Comment 23 Vitalii Solomonov 2024-01-31 20:33:48 UTC
Created attachment 305799 [details]
rdsoreport VMD enabled kernel 6.8.0-rc2
Comment 24 Vitalii Solomonov 2024-01-31 20:34:17 UTC
Created attachment 305800 [details]
rdsoreport VMD disabled kernel 6.8.0-rc2
Comment 25 Vitalii Solomonov 2024-02-01 03:46:31 UTC
Created attachment 305804 [details]
dmesg VMD enabled kernel 6.8.0-rc2 CONFIG_SATA_MOBILE_LPM_POLICY=3
Comment 26 Vitalii Solomonov 2024-02-01 03:46:59 UTC
Created attachment 305805 [details]
journalctl VMD enabled kernel 6.8.0-rc2 CONFIG_SATA_MOBILE_LPM_POLICY=3
Comment 27 Jian-Hong Pan 2024-02-02 08:41:24 UTC
Created attachment 305814 [details]
Bind LPM policy to [8086:a0d3] with enabled VMD

If I revert the commit 6210038aeaf4 ("ata: ahci: Revert "ata: ahci: Add Tiger Lake UP{3,4} AHCI controller"") (fix the conflict, of course) based on kernel v6.8-rc2, then the SATA HDD disappears!!?  Both CONFIG_SATA_MOBILE_LPM_POLICY=3 and 0 can reproduce the issue on ASUS B1400CEAE.

$ dmesg | grep SATA
[    0.783211] ahci 10000:e0:17.0: AHCI 0001.0301 32 slots 1 ports 6 Gbps 0x1 impl SATA mode
[    0.783399] ata1: SATA max UDMA/133 abar m2048@0x76102000 port 0x76102100 irq 144 lpm-pol 3
[    1.096685] ata1: SATA link down (SStatus 4 SControl 300)
Comment 28 Jian-Hong Pan 2024-02-06 07:50:18 UTC
Created attachment 305838 [details]
dmesg with binding LPM policy with patch "ata: ahci: Add force LPM policy quirk for ASUS B1400CEAE" on enabled VMD machine

Bind LPM policy with the patch "ata: ahci: Add force LPM policy quirk for ASUS B1400CEAE" [1] based on kernel v6.8-rc2. Also, add debug messages as:

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 7ecd56c8262a..b910c7856d08 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1677,8 +1676,10 @@ static void ahci_update_initial_lpm_policy(struct ata_port *ap,
 
 
        /* Ignore processing for chipsets that don't use policy */
-       if (!(hpriv->flags & AHCI_HFLAG_USE_LPM_POLICY))
+       if (!(hpriv->flags & AHCI_HFLAG_USE_LPM_POLICY)) {
+               dev_info(ap->dev, "%s: do not use LPM policy\n", __func__);
                return;
+       }
 
        /* user modified policy via module param */
        if (mobile_lpm_policy != -1) {
@@ -1696,6 +1697,7 @@ static void ahci_update_initial_lpm_policy(struct ata_port *ap,
 update_policy:
        if (policy >= ATA_LPM_UNKNOWN && policy <= ATA_LPM_MIN_POWER)
                ap->target_lpm_policy = policy;
+       dev_info(ap->dev, "%s: policy %d\n", __func__, policy);
 }
 
 static void ahci_intel_pcs_quirk(struct pci_dev *pdev, struct ahci_host_priv *hpriv)
@@ -1706,12 +1708,16 @@ static void ahci_intel_pcs_quirk(struct pci_dev *pdev, struct ahci_host_priv *hp
        /*
         * Only apply the 6-port PCS quirk for known legacy platforms.
         */
-       if (!id || id->vendor != PCI_VENDOR_ID_INTEL)
+       if (!id || id->vendor != PCI_VENDOR_ID_INTEL) {
+               dev_info(&pdev->dev, "%s: not Intel, the vendor is 0x%08x\n", __func__, id->vendor);
                return;
+       }
 
        /* Skip applying the quirk on Denverton and beyond */
-       if (((enum board_ids) id->driver_data) >= board_ahci_pcs7)
+       if (((enum board_ids) id->driver_data) >= board_ahci_pcs7) {
+               dev_info(&pdev->dev, "%s: skip\n", __func__);
                return;
+       }
 
        /*
         * port_map is determined from PORTS_IMPL PCI register which is
@@ -1722,8 +1728,10 @@ static void ahci_intel_pcs_quirk(struct pci_dev *pdev, struct ahci_host_priv *hp
         * before the OS boots.
         */
        pci_read_config_word(pdev, PCS_6, &tmp16);
+       dev_info(&pdev->dev, "%s: PCS_6 is 0x%04x", __func__, tmp16);
        if ((tmp16 & hpriv->port_map) != hpriv->port_map) {
                tmp16 |= hpriv->port_map;
+               dev_info(&pdev->dev, "%s: write PCS_6 with 0x%04x", __func__, tmp16);
                pci_write_config_word(pdev, PCS_6, tmp16);
        }
 }
@@ -1998,6 +2006,7 @@ static int ahci_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
        if (rc)
                return rc;
 
+       dev_info(&pdev->dev, "%s: probed\n", __func__);
        pm_runtime_put_noidle(&pdev->dev);
        return 0;
 }
diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
index 1a63200ea437..7e4f349554eb 100644
--- a/drivers/ata/libahci.c
+++ b/drivers/ata/libahci.c
@@ -812,6 +812,7 @@ static int ahci_set_lpm(struct ata_link *link, enum ata_lpm_policy policy,
        struct ahci_port_priv *pp = ap->private_data;
        void __iomem *port_mmio = ahci_port_base(ap);
 
+       ata_link_info(link, "%s: policy=%d\n", __func__, policy);
        if (policy != ATA_LPM_MAX_POWER) {
                /* wakeup flag only applies to the max power policy */
                hints &= ~ATA_LPM_WAKE_ONLY;
@@ -1533,6 +1534,12 @@ int ahci_check_ready(struct ata_link *link)
 {
        void __iomem *port_mmio = ahci_port_base(link->ap);
        u8 status = readl(port_mmio + PORT_TFDATA) & 0xFF;
+       u32 cur = 0;
+
+       sata_scr_read(link, SCR_STATUS, &cur);
+
+       ata_link_info(link, "BUSY ? %d (status: %#x) SStatus.DET: %#x\n",
+                     status & ATA_BUSY, status, cur & 0xf);
 
        return ata_check_ready(status);
 }
diff --git a/drivers/ata/libata-sata.c b/drivers/ata/libata-sata.c
index 0fb1934875f2..4bcedd46bcfa 100644
--- a/drivers/ata/libata-sata.c
+++ b/drivers/ata/libata-sata.c
@@ -344,6 +344,7 @@ int sata_link_resume(struct ata_link *link, const unsigned int *params,
        if (!(rc = sata_scr_read(link, SCR_ERROR, &serror)))
                rc = sata_scr_write(link, SCR_ERROR, serror);
 
+       ata_link_info(link, "%s: rc=%d", __func__, rc);
        return rc != -EINVAL ? rc : 0;
 }
 EXPORT_SYMBOL_GPL(sata_link_resume);
@@ -378,6 +379,7 @@ int sata_link_scr_lpm(struct ata_link *link, enum ata_lpm_policy policy,
        if (rc)
                return rc;
 
+       ata_link_info(link, "%s: policy is %d and original scontrol 0x%08x\n", __func__, policy, scontrol);
        switch (policy) {
        case ATA_LPM_MAX_POWER:
                /* disable all LPM transitions */
@@ -422,6 +424,7 @@ int sata_link_scr_lpm(struct ata_link *link, enum ata_lpm_policy policy,
                WARN_ON(1);
        }
 
+       ata_link_info(link, "%s: write scontrol 0x%08x\n", __func__, scontrol);
        rc = sata_scr_write(link, SCR_CONTROL, scontrol);
        if (rc)
                return rc;
@@ -586,9 +589,12 @@ int sata_link_hardreset(struct ata_link *link, const unsigned int *timing,
        rc = sata_link_resume(link, timing, deadline);
        if (rc)
                goto out;
+
        /* if link is offline nothing more to do */
-       if (ata_phys_link_offline(link))
+       if (ata_phys_link_offline(link)) {
+               ata_link_info(link, "%s: ata_phys_link_offline is True\n", __func__);
                goto out;
+       }
 
        /* Link is online.  From this point, -ENODEV too is an error. */
        if (online)
@@ -616,12 +622,15 @@ int sata_link_hardreset(struct ata_link *link, const unsigned int *timing,
        rc = 0;
        if (check_ready)
                rc = ata_wait_ready(link, deadline, check_ready);
+
+       ata_link_info(link, "%s: is %d\n", __func__, rc);
  out:
        if (rc && rc != -EAGAIN) {
                /* online is set iff link is online && reset succeeded */
                if (online)
                        *online = false;
        }
+       ata_link_info(link, "%s: is %s line, returns %d\n", __func__, *online? "on":"off", rc);
        return rc;
 }
 EXPORT_SYMBOL_GPL(sata_link_hardreset);


[1]: https://patchwork.kernel.org/project/linux-pci/patch/20240130095933.14158-1-jhp@endlessos.org/
Comment 29 Jian-Hong Pan 2024-02-06 07:54:35 UTC
Created attachment 305839 [details]
dmesg with binding LPM policy with PCI IDs matching on enabled VMD machine

Bind LPM policy with PCI IDs matching [8086:a0d3] based on kernel v6.8-rc2 and the same debug messages in comment #28.

Note You need to log in before you can comment on or make changes to this bug.