Bug 218038 - bbbf096ea227607cbb348155eeda7af71af1a35b results in "dirty" shutdown
Summary: bbbf096ea227607cbb348155eeda7af71af1a35b results in "dirty" shutdown
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: Tejun Heo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-10-22 21:43 UTC by Totallyreal Name
Modified: 2023-10-25 10:58 UTC (History)
3 users (show)

See Also:
Kernel Version: 6.6-rc4
Subsystem:
Regression: Yes
Bisected commit-id: bbbf096ea227607cbb348155eeda7af71af1a35b


Attachments
Candidate fix patch (3.77 KB, patch)
2023-10-25 06:57 UTC, Damien Le Moal
Details | Diff

Description Totallyreal Name 2023-10-22 21:43:31 UTC
I have noticed that after each shutdown SMART would log +1 on each raw_value of drives that count power related issues.

Here are some of the smart values where i noticed increase among different drives

POR_Recovery_Count 
Unexpect_Power_Loss_Ct  
Power-Off_Retract_Count 

That doesn't happen on previous Kernels.. and it stops as soon as i switch back to LTS.
I seeked through the Bug Reports but couldn't really find the issue. So i don't know if this has been reported or patched already.
Comment 1 Artem S. Tashkinov 2023-10-23 07:08:13 UTC
Please perform git bisect:

https://docs.kernel.org/admin-guide/bug-bisect.html

BTW, 6.6-rc7 is out, it's worth giving it a try first.
Comment 2 loqs 2023-10-24 10:11:42 UTC
This has been reported as being introduced in stable 6.5.8 [1].  Reverting commit 	bbbf096ea227607cbb348155eeda7af71af1a35b is reported to solve the issue [2]. bbbf096ea227607cbb348155eeda7af71af1a35b [3] is mainline commit aa3998dbeb3abce63653b7f6d4542e7dcd022590 [4] which was introduced in 6.6-rc4 which matches the original report.

[1]: https://bugs.archlinux.org/task/80064
[2]: https://bugs.archlinux.org/task/80064#comment223100
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=bbbf096ea227607cbb348155eeda7af71af1a35b
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=aa3998dbeb3abce63653b7f6d4542e7dcd022590
Comment 3 Artem S. Tashkinov 2023-10-24 10:53:14 UTC
CC'ing Damien Le Moal, the author of the bad commit.
Comment 4 Damien Le Moal 2023-10-24 11:34:41 UTC
(In reply to Artem S. Tashkinov from comment #3)
> CC'ing Damien Le Moal, the author of the bad commit.

Bad commit ? Please measure your words... This commit fixes a very real and serious issue with system/suspend resume.

But sure, it looks like it did introduce a regression that went unnoticed, which seems to be that ATA drives are not being suspended before a system shutdown. I will have a look tomorrow (too late for me today). In the mean time, please test with the latest 6.6-rc7 to confirm if this issue is still present.
Comment 5 loqs 2023-10-24 12:22:22 UTC
I can reproduce the issue with 6.6-rc7.  Do you need any outputs?
Comment 6 Damien Le Moal 2023-10-24 23:47:55 UTC
(In reply to loqs from comment #5)
> I can reproduce the issue with 6.6-rc7.  Do you need any outputs?

No, nothing needed for now. Looking into reproducing this locally now. Thanks.
Comment 7 Damien Le Moal 2023-10-25 02:19:59 UTC
(In reply to Totallyreal Name from comment #0)
> I have noticed that after each shutdown SMART would log +1 on each raw_value
> of drives that count power related issues.
> 
> Here are some of the smart values where i noticed increase among different
> drives
> 
> POR_Recovery_Count 
> Unexpect_Power_Loss_Ct  

None of the drives I have show these 2 attributes...

> Power-Off_Retract_Count 

This one I see it, but it always increments for me, with old and new kernels, regardless of patch aa3998dbeb3a applied or not. hdsentinel says about this attribute:

The 192 Power Off Retract Count attribute usually increases on non-standard power OFF cycles, for example caused by sudden power failure, power loss or hard reset. This is usually lower than the start/stop count, ideally 0.

So the increase could be due to the hard reset on device scan on boot. Not entirely sure since checking smart before we actually boot the kernel is not easy to do :) 

> That doesn't happen on previous Kernels.. and it stops as soon as i switch
> back to LTS.
> I seeked through the Bug Reports but couldn't really find the issue. So i
> don't know if this has been reported or patched already.

As mentioned above, I am having trouble recreating the same behavior as you are seeing, and the drives I have do not show all the same attributes as yours, so hard to check everything. However, I did find an issue: libata does not call ata_dev_power_set_standby() on system shutdown with a real PC with aa3998dbeb3a applied. But it does with a qemu VM, which is weird. This may be related to ACPI and how system shutdown is handled (qemu uses ACPI S5 apparently). Digging into that to restore the old behavior and have the drive put to standby mode before shutting down the system.
Comment 8 tgn-ff 2023-10-25 03:03:24 UTC
The `POR_Recovery_Count` increments for SATA SSDs too with the latest stable kernel (6.5.8).

> [Power-Off_Retract_Count] always increments for me

If it's much smaller than `12 Power_Cycle_Count`, then it wasn't always like that for you either.


Aside of SMART properties, when powering down the system, the disk makes a sound associated with a sudden power loss. It's easy to tell the difference between that and a normal spinning down.
Comment 9 Damien Le Moal 2023-10-25 04:39:23 UTC
(In reply to tgn-ff from comment #8)
> The `POR_Recovery_Count` increments for SATA SSDs too with the latest stable
> kernel (6.5.8).
> 
> > [Power-Off_Retract_Count] always increments for me
> 
> If it's much smaller than `12 Power_Cycle_Count`, then it wasn't always like
> that for you either.

 12 Power_Cycle_Count       -O--CK   100   100   000    -    46
192 Power-Off_Retract_Count -O--CK   100   100   000    -    628
193 Load_Cycle_Count        -O--C-   100   100   000    -    628

Nope. Not the case. This is a development machine that I rarely power off but often reset due to kernel bugs during testing. That should explain the very large difference I have. The fact remains that I do see both attributes incremented with a power cycle with and without the patch.
 
> Aside of SMART properties, when powering down the system, the disk makes a
> sound associated with a sudden power loss. It's easy to tell the difference
> between that and a normal spinning down.

OK, so this is a strong indicator that you have the issue I mentioned: with the patch applied, libata does not spin down the drive on system shutdown. Can you see a message like this on shutdown ?

[  485.255105] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[  485.258028] ata1.00: Entering standby power mode

If you have the issue, the second message will be missing.

(note: it may be hard to capture this if you cannot setup a serial console to another machine. Taking a video of the screen when shutting down may do the trick though. Also note that you may need to boot with "loglevel=7" added to your kernel boot parameters to see anything.)
Comment 10 Damien Le Moal 2023-10-25 05:54:03 UTC
(In reply to tgn-ff from comment #8)
> Aside of SMART properties, when powering down the system, the disk makes a
> sound associated with a sudden power loss. It's easy to tell the difference
> between that and a normal spinning down.

Could you try the for-6.7 branch of the libata tree:

git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata.git
https://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata.git

This branch has commit 5b6fba546da2 ("ata: libata-core: Detach a port devices on shutdown") which should fix the issue. However, I did not mark that commit for stable backport, which will need to be done if this solves the issue. The other thing is that this commit is effective for pci devices only and not for platform devices. So I need to improve this.
Comment 11 Damien Le Moal 2023-10-25 06:57:49 UTC
Created attachment 305289 [details]
Candidate fix patch
Comment 12 Damien Le Moal 2023-10-25 06:58:40 UTC
I just attached a candidate fix patch to this case. The patch is on top of Linus tree (mainline 6.6.-rc7). Can you try it please ?
Comment 13 Artem S. Tashkinov 2023-10-25 07:39:01 UTC
(In reply to Damien Le Moal from comment #4)
> (In reply to Artem S. Tashkinov from comment #3)
> > CC'ing Damien Le Moal, the author of the bad commit.
> 
> Bad commit ? Please measure your words... This commit fixes a very real and
> serious issue with system/suspend resume.
> 
> But sure, it looks like it did introduce a regression that went unnoticed,
> which seems to be that ATA drives are not being suspended before a system
> shutdown. I will have a look tomorrow (too late for me today). In the mean
> time, please test with the latest 6.6-rc7 to confirm if this issue is still
> present.

That's how git calls it, that's how people normally call the commit that led to a regression. I meant nothing personal, I didn't mean it was actually "bad".
Comment 14 Damien Le Moal 2023-10-25 07:54:12 UTC
(In reply to Artem S. Tashkinov from comment #13)
> That's how git calls it, that's how people normally call the commit that led
> to a regression. I meant nothing personal, I didn't mean it was actually
> "bad".

No worries, I did not take it personally :) I know that git bisect uses the "bad" term for marking commits, but when discussing one, I find it clearer to explain that a commit introduced a regression rather than using "bad" as in "that commit is garbage", as the latter case also happens...

Anyway, I found the issue, which I did not see initially because I tested the entire patch series for the suspend/resume fixes & changes and that series has a patch to suspend drives on shutdown, but for AHCI only. Other adapters would have the issue and that patch was not part of the fixes, which was another mistake. So it is good that I got this bugzilla case. This should be fixed before 6.6.0 is out and I will adapt the patch for AHCI queued for 6.7 accordingly.
Comment 15 tgn-ff 2023-10-25 08:09:47 UTC
I hadn't noticed the patch you attached here, so I tried the 5b6fba546da246b3d0dd8465c07783e22629cc53 commit you mentioned before and it seems to fix it. The SMART properties don't increase and the disk sounds like it's spinning down normally.

I will apply the attached patch, but compilation will take a while.
Comment 16 loqs 2023-10-25 08:11:54 UTC
(In reply to Damien Le Moal from comment #12)
> I just attached a candidate fix patch to this case. The patch is on top of
> Linus tree (mainline 6.6.-rc7). Can you try it please ?

That fixes the issues for me.  Tested with 6.6-rc7 and 6.5.8
Tested-by: Stephen M Smith <stephenmsmith@blueyonder.co.uk>
Comment 17 tgn-ff 2023-10-25 10:58:06 UTC
A second vote of confidence for the attached patch from me.

Note You need to log in before you can comment on or make changes to this bug.