Bug 217114
Description
emmi
2023-03-02 11:25:00 UTC
As some people in the reference arch forum post reported this seems to have started in 6.1.13. 6.1.12 loads as expected. The problem is the sata disks can not be recognized any longer which is why the reported sysroot partition can't be found. My primary disk is nvme and as long as I remove all sata references from my fstab I can boot but then can't mount the device partitions because the devices are not present in /dev. Any attempts to boot with a sata disk in fstab results in a boot failure with emergency shell. I can provide any details required My sata controller: 10000:e0:17.0 SATA controller: Intel Corporation Tiger Lake-LP SATA Controller (rev 20) (prog-if 01 [AHCI 1.0]) Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 146 Region 0: Memory at 50100000 (32-bit, non-prefetchable) [size=8K] Region 1: Memory at 50102800 (32-bit, non-prefetchable) [size=256] Region 5: Memory at 50102000 (32-bit, non-prefetchable) [size=2K] Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit- Address: fee01000 Data: 0000 Capabilities: [70] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [a8] SATA HBA v1.0 BAR4 Offset=00000004 Kernel driver in use: ahci Is it possible for you to get and post the ata/ahci related messages during a bad boot ? And can you try booting with libata.force=nolpm to check ? (In reply to Damien Le Moal from comment #4) > And can you try booting with libata.force=nolpm to check ? As per the forum thread attached, this does not correct the issue (most of the extraneous information is on pages 2 and 3 of that thread) (In reply to Damien Le Moal from comment #3) > Is it possible for you to get and post the ata/ahci related messages during > a bad boot ? Nope, no sysroot means no console and attempts to load a console prior to root mount fail for me (probably because of sulogin etc being restricted) (In reply to emmi from comment #5) > (In reply to Damien Le Moal from comment #4) > > And can you try booting with libata.force=nolpm to check ? > > As per the forum thread attached, this does not correct the issue (most of > the extraneous information is on pages 2 and 3 of that thread) Missed that. Will have a look. > (In reply to Damien Le Moal from comment #3) > > Is it possible for you to get and post the ata/ahci related messages during > > a bad boot ? > > Nope, no sysroot means no console and attempts to load a console prior to > root mount fail for me (probably because of sulogin etc being restricted) Can you use a serial console to capture the messages ? (In reply to Damien Le Moal from comment #6) > (In reply to emmi from comment #5) > > (In reply to Damien Le Moal from comment #4) > > > And can you try booting with libata.force=nolpm to check ? > > > > As per the forum thread attached, this does not correct the issue (most of > > the extraneous information is on pages 2 and 3 of that thread) > > Missed that. Will have a look. > > > (In reply to Damien Le Moal from comment #3) > > > Is it possible for you to get and post the ata/ahci related messages > during > > > a bad boot ? > > > > Nope, no sysroot means no console and attempts to load a console prior to > > root mount fail for me (probably because of sulogin etc being restricted) > > Can you use a serial console to capture the messages ? Personally I cannot without disassembling my laptop and likely soldering test pads, since its somewhere between ultrabook and a laptop, and thus doesnt have integrated serial connectivity. ah. OK. This is a laptop... Too bad. These error messages would be really useful to come up with a better solution than reverting the patch causing the issue. May be a screen video ? (disable rhgb or any other graphic boot stuff and add earlycon kernel parameter. You should be able to see the boot messages & errors). (In reply to Damien Le Moal from comment #8) > ah. OK. This is a laptop... Too bad. These error messages would be really > useful to come up with a better solution than reverting the patch causing > the issue. May be a screen video ? (disable rhgb or any other graphic boot > stuff and add earlycon kernel parameter. You should be able to see the boot > messages & errors). I'm not currently able to as i'm not at home, but im sure some others would be able to provide that data... I am going to send a revert to Linus & stable now. We can figure out how to correctly enable LPM for this adapter during the 6.3 cycle. Revert sent. Probably will be picked up in 6.1.15. Revert accepted Created attachment 303855 [details]
lspci -vvv on kernel 6.1.12 ASUS Vivobook 15 X513EAN
Created attachment 303856 [details]
lspci -vvv on kernel 6.2.1 ASUS Vivobook 15 X513EAN
Created attachment 303857 [details]
dmesg on kernel 6.1.12 ASUS Vivobook 15 X513EAN
Created attachment 303858 [details]
dmesg on kernel 6.2.1 ASUS Vivobook 15 X513EAN
I have an ASUS Vivobook 15 X513EAN laptop and can confirm that. I have / on nvme and /home on sda. 6.1.12 works fine, 6.2.0 - /dev/ is not populated with sda devices. I'm attaching my full dmesg and lspci output for both kernels. What additional logs are needed? Is anyone previously affected by this issue available to participate in some testing/diagnosis of the root cause here? This would involve building kernels with specific patches applied and getting logs (so you need to be using the SATA disk as secondary, booting from another device). We have identified that the lack of LPM mode is likely causing all platforms using this chipset to excessively drain the battery in suspend mode (see bug #218394). We want to fix that for all, but obviously need to avoid the breakage that was discovered here. The platform we have (Asus B1400) works just fine with a SATA disk added in LPM mode, so we aren't able to diagnose there. I'm not experienced in such testing, but i'm willing to participate. My only concern is that my hardware would be damaged. If the chance of this happening is considered low, I'm ready. On my current configuration i'm booting from SSD(nvme) and HDD(sda) is my secondary drive. Will it require any changes? Thanks Vitalii! First thing to do would be to reproduce the bug on a newer kernel version, preferably 6.8-rc2. You'll need to apply the original patch : https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/patch/?id=104ff59af73aba524e57ae0fef70121643ff270e Then confirm that such new kernel boots without SATA detected, and capture dmesg logs for good measure. Assuming that is the case, please then go into your BIOS setup menu and try to disable Intel VMD. Refer to this video: https://www.youtube.com/watch?v=_Ft9KBTC2kk It may warn about "malfunction" or similar - that can be ignored in the case of Linux, which does not support the RAID mode that exists beneath this function. With VMD disabled, boot up again and see if the SATA disk is detected, capturing new dmesg logs regardless of success/failure. If you additionally have Windows installed on this device, please then re-enable VMD in the BIOS before attempting your next Windows boot. Created attachment 305797 [details]
dmesg VMD enabled kernel 6.8.0-rc2
Created attachment 305798 [details]
dmesg VMD disabled kernel 6.8.0-rc2
Created attachment 305799 [details]
rdsoreport VMD enabled kernel 6.8.0-rc2
Created attachment 305800 [details]
rdsoreport VMD disabled kernel 6.8.0-rc2
Created attachment 305804 [details]
dmesg VMD enabled kernel 6.8.0-rc2 CONFIG_SATA_MOBILE_LPM_POLICY=3
Created attachment 305805 [details]
journalctl VMD enabled kernel 6.8.0-rc2 CONFIG_SATA_MOBILE_LPM_POLICY=3
Created attachment 305814 [details]
Bind LPM policy to [8086:a0d3] with enabled VMD
If I revert the commit 6210038aeaf4 ("ata: ahci: Revert "ata: ahci: Add Tiger Lake UP{3,4} AHCI controller"") (fix the conflict, of course) based on kernel v6.8-rc2, then the SATA HDD disappears!!? Both CONFIG_SATA_MOBILE_LPM_POLICY=3 and 0 can reproduce the issue on ASUS B1400CEAE.
$ dmesg | grep SATA
[ 0.783211] ahci 10000:e0:17.0: AHCI 0001.0301 32 slots 1 ports 6 Gbps 0x1 impl SATA mode
[ 0.783399] ata1: SATA max UDMA/133 abar m2048@0x76102000 port 0x76102100 irq 144 lpm-pol 3
[ 1.096685] ata1: SATA link down (SStatus 4 SControl 300)
Created attachment 305838 [details] dmesg with binding LPM policy with patch "ata: ahci: Add force LPM policy quirk for ASUS B1400CEAE" on enabled VMD machine Bind LPM policy with the patch "ata: ahci: Add force LPM policy quirk for ASUS B1400CEAE" [1] based on kernel v6.8-rc2. Also, add debug messages as: diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c index 7ecd56c8262a..b910c7856d08 100644 --- a/drivers/ata/ahci.c +++ b/drivers/ata/ahci.c @@ -1677,8 +1676,10 @@ static void ahci_update_initial_lpm_policy(struct ata_port *ap, /* Ignore processing for chipsets that don't use policy */ - if (!(hpriv->flags & AHCI_HFLAG_USE_LPM_POLICY)) + if (!(hpriv->flags & AHCI_HFLAG_USE_LPM_POLICY)) { + dev_info(ap->dev, "%s: do not use LPM policy\n", __func__); return; + } /* user modified policy via module param */ if (mobile_lpm_policy != -1) { @@ -1696,6 +1697,7 @@ static void ahci_update_initial_lpm_policy(struct ata_port *ap, update_policy: if (policy >= ATA_LPM_UNKNOWN && policy <= ATA_LPM_MIN_POWER) ap->target_lpm_policy = policy; + dev_info(ap->dev, "%s: policy %d\n", __func__, policy); } static void ahci_intel_pcs_quirk(struct pci_dev *pdev, struct ahci_host_priv *hpriv) @@ -1706,12 +1708,16 @@ static void ahci_intel_pcs_quirk(struct pci_dev *pdev, struct ahci_host_priv *hp /* * Only apply the 6-port PCS quirk for known legacy platforms. */ - if (!id || id->vendor != PCI_VENDOR_ID_INTEL) + if (!id || id->vendor != PCI_VENDOR_ID_INTEL) { + dev_info(&pdev->dev, "%s: not Intel, the vendor is 0x%08x\n", __func__, id->vendor); return; + } /* Skip applying the quirk on Denverton and beyond */ - if (((enum board_ids) id->driver_data) >= board_ahci_pcs7) + if (((enum board_ids) id->driver_data) >= board_ahci_pcs7) { + dev_info(&pdev->dev, "%s: skip\n", __func__); return; + } /* * port_map is determined from PORTS_IMPL PCI register which is @@ -1722,8 +1728,10 @@ static void ahci_intel_pcs_quirk(struct pci_dev *pdev, struct ahci_host_priv *hp * before the OS boots. */ pci_read_config_word(pdev, PCS_6, &tmp16); + dev_info(&pdev->dev, "%s: PCS_6 is 0x%04x", __func__, tmp16); if ((tmp16 & hpriv->port_map) != hpriv->port_map) { tmp16 |= hpriv->port_map; + dev_info(&pdev->dev, "%s: write PCS_6 with 0x%04x", __func__, tmp16); pci_write_config_word(pdev, PCS_6, tmp16); } } @@ -1998,6 +2006,7 @@ static int ahci_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) if (rc) return rc; + dev_info(&pdev->dev, "%s: probed\n", __func__); pm_runtime_put_noidle(&pdev->dev); return 0; } diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c index 1a63200ea437..7e4f349554eb 100644 --- a/drivers/ata/libahci.c +++ b/drivers/ata/libahci.c @@ -812,6 +812,7 @@ static int ahci_set_lpm(struct ata_link *link, enum ata_lpm_policy policy, struct ahci_port_priv *pp = ap->private_data; void __iomem *port_mmio = ahci_port_base(ap); + ata_link_info(link, "%s: policy=%d\n", __func__, policy); if (policy != ATA_LPM_MAX_POWER) { /* wakeup flag only applies to the max power policy */ hints &= ~ATA_LPM_WAKE_ONLY; @@ -1533,6 +1534,12 @@ int ahci_check_ready(struct ata_link *link) { void __iomem *port_mmio = ahci_port_base(link->ap); u8 status = readl(port_mmio + PORT_TFDATA) & 0xFF; + u32 cur = 0; + + sata_scr_read(link, SCR_STATUS, &cur); + + ata_link_info(link, "BUSY ? %d (status: %#x) SStatus.DET: %#x\n", + status & ATA_BUSY, status, cur & 0xf); return ata_check_ready(status); } diff --git a/drivers/ata/libata-sata.c b/drivers/ata/libata-sata.c index 0fb1934875f2..4bcedd46bcfa 100644 --- a/drivers/ata/libata-sata.c +++ b/drivers/ata/libata-sata.c @@ -344,6 +344,7 @@ int sata_link_resume(struct ata_link *link, const unsigned int *params, if (!(rc = sata_scr_read(link, SCR_ERROR, &serror))) rc = sata_scr_write(link, SCR_ERROR, serror); + ata_link_info(link, "%s: rc=%d", __func__, rc); return rc != -EINVAL ? rc : 0; } EXPORT_SYMBOL_GPL(sata_link_resume); @@ -378,6 +379,7 @@ int sata_link_scr_lpm(struct ata_link *link, enum ata_lpm_policy policy, if (rc) return rc; + ata_link_info(link, "%s: policy is %d and original scontrol 0x%08x\n", __func__, policy, scontrol); switch (policy) { case ATA_LPM_MAX_POWER: /* disable all LPM transitions */ @@ -422,6 +424,7 @@ int sata_link_scr_lpm(struct ata_link *link, enum ata_lpm_policy policy, WARN_ON(1); } + ata_link_info(link, "%s: write scontrol 0x%08x\n", __func__, scontrol); rc = sata_scr_write(link, SCR_CONTROL, scontrol); if (rc) return rc; @@ -586,9 +589,12 @@ int sata_link_hardreset(struct ata_link *link, const unsigned int *timing, rc = sata_link_resume(link, timing, deadline); if (rc) goto out; + /* if link is offline nothing more to do */ - if (ata_phys_link_offline(link)) + if (ata_phys_link_offline(link)) { + ata_link_info(link, "%s: ata_phys_link_offline is True\n", __func__); goto out; + } /* Link is online. From this point, -ENODEV too is an error. */ if (online) @@ -616,12 +622,15 @@ int sata_link_hardreset(struct ata_link *link, const unsigned int *timing, rc = 0; if (check_ready) rc = ata_wait_ready(link, deadline, check_ready); + + ata_link_info(link, "%s: is %d\n", __func__, rc); out: if (rc && rc != -EAGAIN) { /* online is set iff link is online && reset succeeded */ if (online) *online = false; } + ata_link_info(link, "%s: is %s line, returns %d\n", __func__, *online? "on":"off", rc); return rc; } EXPORT_SYMBOL_GPL(sata_link_hardreset); [1]: https://patchwork.kernel.org/project/linux-pci/patch/20240130095933.14158-1-jhp@endlessos.org/ Created attachment 305839 [details] dmesg with binding LPM policy with PCI IDs matching on enabled VMD machine Bind LPM policy with PCI IDs matching [8086:a0d3] based on kernel v6.8-rc2 and the same debug messages in comment #28. |