I have noticed that after each shutdown SMART would log +1 on each raw_value of drives that count power related issues. Here are some of the smart values where i noticed increase among different drives POR_Recovery_Count Unexpect_Power_Loss_Ct Power-Off_Retract_Count That doesn't happen on previous Kernels.. and it stops as soon as i switch back to LTS. I seeked through the Bug Reports but couldn't really find the issue. So i don't know if this has been reported or patched already.
Please perform git bisect: https://docs.kernel.org/admin-guide/bug-bisect.html BTW, 6.6-rc7 is out, it's worth giving it a try first.
This has been reported as being introduced in stable 6.5.8 [1]. Reverting commit bbbf096ea227607cbb348155eeda7af71af1a35b is reported to solve the issue [2]. bbbf096ea227607cbb348155eeda7af71af1a35b [3] is mainline commit aa3998dbeb3abce63653b7f6d4542e7dcd022590 [4] which was introduced in 6.6-rc4 which matches the original report. [1]: https://bugs.archlinux.org/task/80064 [2]: https://bugs.archlinux.org/task/80064#comment223100 [3]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=bbbf096ea227607cbb348155eeda7af71af1a35b [4]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=aa3998dbeb3abce63653b7f6d4542e7dcd022590
CC'ing Damien Le Moal, the author of the bad commit.
(In reply to Artem S. Tashkinov from comment #3) > CC'ing Damien Le Moal, the author of the bad commit. Bad commit ? Please measure your words... This commit fixes a very real and serious issue with system/suspend resume. But sure, it looks like it did introduce a regression that went unnoticed, which seems to be that ATA drives are not being suspended before a system shutdown. I will have a look tomorrow (too late for me today). In the mean time, please test with the latest 6.6-rc7 to confirm if this issue is still present.
I can reproduce the issue with 6.6-rc7. Do you need any outputs?
(In reply to loqs from comment #5) > I can reproduce the issue with 6.6-rc7. Do you need any outputs? No, nothing needed for now. Looking into reproducing this locally now. Thanks.
(In reply to Totallyreal Name from comment #0) > I have noticed that after each shutdown SMART would log +1 on each raw_value > of drives that count power related issues. > > Here are some of the smart values where i noticed increase among different > drives > > POR_Recovery_Count > Unexpect_Power_Loss_Ct None of the drives I have show these 2 attributes... > Power-Off_Retract_Count This one I see it, but it always increments for me, with old and new kernels, regardless of patch aa3998dbeb3a applied or not. hdsentinel says about this attribute: The 192 Power Off Retract Count attribute usually increases on non-standard power OFF cycles, for example caused by sudden power failure, power loss or hard reset. This is usually lower than the start/stop count, ideally 0. So the increase could be due to the hard reset on device scan on boot. Not entirely sure since checking smart before we actually boot the kernel is not easy to do :) > That doesn't happen on previous Kernels.. and it stops as soon as i switch > back to LTS. > I seeked through the Bug Reports but couldn't really find the issue. So i > don't know if this has been reported or patched already. As mentioned above, I am having trouble recreating the same behavior as you are seeing, and the drives I have do not show all the same attributes as yours, so hard to check everything. However, I did find an issue: libata does not call ata_dev_power_set_standby() on system shutdown with a real PC with aa3998dbeb3a applied. But it does with a qemu VM, which is weird. This may be related to ACPI and how system shutdown is handled (qemu uses ACPI S5 apparently). Digging into that to restore the old behavior and have the drive put to standby mode before shutting down the system.
The `POR_Recovery_Count` increments for SATA SSDs too with the latest stable kernel (6.5.8). > [Power-Off_Retract_Count] always increments for me If it's much smaller than `12 Power_Cycle_Count`, then it wasn't always like that for you either. Aside of SMART properties, when powering down the system, the disk makes a sound associated with a sudden power loss. It's easy to tell the difference between that and a normal spinning down.
(In reply to tgn-ff from comment #8) > The `POR_Recovery_Count` increments for SATA SSDs too with the latest stable > kernel (6.5.8). > > > [Power-Off_Retract_Count] always increments for me > > If it's much smaller than `12 Power_Cycle_Count`, then it wasn't always like > that for you either. 12 Power_Cycle_Count -O--CK 100 100 000 - 46 192 Power-Off_Retract_Count -O--CK 100 100 000 - 628 193 Load_Cycle_Count -O--C- 100 100 000 - 628 Nope. Not the case. This is a development machine that I rarely power off but often reset due to kernel bugs during testing. That should explain the very large difference I have. The fact remains that I do see both attributes incremented with a power cycle with and without the patch. > Aside of SMART properties, when powering down the system, the disk makes a > sound associated with a sudden power loss. It's easy to tell the difference > between that and a normal spinning down. OK, so this is a strong indicator that you have the issue I mentioned: with the patch applied, libata does not spin down the drive on system shutdown. Can you see a message like this on shutdown ? [ 485.255105] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 485.258028] ata1.00: Entering standby power mode If you have the issue, the second message will be missing. (note: it may be hard to capture this if you cannot setup a serial console to another machine. Taking a video of the screen when shutting down may do the trick though. Also note that you may need to boot with "loglevel=7" added to your kernel boot parameters to see anything.)
(In reply to tgn-ff from comment #8) > Aside of SMART properties, when powering down the system, the disk makes a > sound associated with a sudden power loss. It's easy to tell the difference > between that and a normal spinning down. Could you try the for-6.7 branch of the libata tree: git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata.git https://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata.git This branch has commit 5b6fba546da2 ("ata: libata-core: Detach a port devices on shutdown") which should fix the issue. However, I did not mark that commit for stable backport, which will need to be done if this solves the issue. The other thing is that this commit is effective for pci devices only and not for platform devices. So I need to improve this.
Created attachment 305289 [details] Candidate fix patch
I just attached a candidate fix patch to this case. The patch is on top of Linus tree (mainline 6.6.-rc7). Can you try it please ?
(In reply to Damien Le Moal from comment #4) > (In reply to Artem S. Tashkinov from comment #3) > > CC'ing Damien Le Moal, the author of the bad commit. > > Bad commit ? Please measure your words... This commit fixes a very real and > serious issue with system/suspend resume. > > But sure, it looks like it did introduce a regression that went unnoticed, > which seems to be that ATA drives are not being suspended before a system > shutdown. I will have a look tomorrow (too late for me today). In the mean > time, please test with the latest 6.6-rc7 to confirm if this issue is still > present. That's how git calls it, that's how people normally call the commit that led to a regression. I meant nothing personal, I didn't mean it was actually "bad".
(In reply to Artem S. Tashkinov from comment #13) > That's how git calls it, that's how people normally call the commit that led > to a regression. I meant nothing personal, I didn't mean it was actually > "bad". No worries, I did not take it personally :) I know that git bisect uses the "bad" term for marking commits, but when discussing one, I find it clearer to explain that a commit introduced a regression rather than using "bad" as in "that commit is garbage", as the latter case also happens... Anyway, I found the issue, which I did not see initially because I tested the entire patch series for the suspend/resume fixes & changes and that series has a patch to suspend drives on shutdown, but for AHCI only. Other adapters would have the issue and that patch was not part of the fixes, which was another mistake. So it is good that I got this bugzilla case. This should be fixed before 6.6.0 is out and I will adapt the patch for AHCI queued for 6.7 accordingly.
I hadn't noticed the patch you attached here, so I tried the 5b6fba546da246b3d0dd8465c07783e22629cc53 commit you mentioned before and it seems to fix it. The SMART properties don't increase and the disk sounds like it's spinning down normally. I will apply the attached patch, but compilation will take a while.
(In reply to Damien Le Moal from comment #12) > I just attached a candidate fix patch to this case. The patch is on top of > Linus tree (mainline 6.6.-rc7). Can you try it please ? That fixes the issues for me. Tested with 6.6-rc7 and 6.5.8 Tested-by: Stephen M Smith <stephenmsmith@blueyonder.co.uk>
A second vote of confidence for the attached patch from me.