Bug 216087

Summary: Only Toshiba MG08/MG09 SATA hard disks disappear after resume from S3 suspend (reproducible) - AMD Ryzen 5 PRO 4650G
Product: Power Management Reporter: Nils (nilsdev)
Component: Hibernation/SuspendAssignee: Rafael J. Wysocki (rjw)
Status: REOPENED ---    
Severity: high CC: bvanassche, damien.lemoal, damien.lemoal, mario.limonciello, pmenzel+bugzilla.kernel.org, regressions, rui.zhang
Priority: P1    
Hardware: AMD   
OS: Linux   
Kernel Version: 5.19.8 Subsystem:
Regression: No Bisected commit-id:
Attachments: lspci -nn
journalctl -k -b 13
journalctl -k -b 10

Description Nils 2022-06-06 21:49:17 UTC
Every time I resume from S3 / suspend to disk one particular SATA hard disk disappears. It is always the same disk. Other disks are not affected. I swapped SATA port, power and SATA data cables to no avail.

Upon resume from `systemctl suspend` (wake by key press) one of a number of hard disks, here sd 2 / sdb, gives the following error messages (I am giving a few lines of context):

```
[174232.762381] ata5: found unknown device (class 0)
[174232.762411] ata6: found unknown device (class 0)
[174233.228546] ata3: SATA link down (SStatus 0 SControl 300)
[174237.422376] ata6: softreset failed (device not ready)
[174237.422391] ata5: softreset failed (device not ready)
[174238.290401] ata4: found unknown device (class 0)
[174238.450381] ata4: softreset failed (device not ready)
[174238.450414] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[174238.454123] ata1.00: configured for UDMA/133
[174238.608510] ata3: SATA link down (SStatus 0 SControl 300)
[174238.608521] ata3.00: disable device
[174238.608540] sd 2:0:0:0: rejecting I/O to offline device
[174238.608547] ata3.00: detaching (SCSI 2:0:0:0)
[174238.608554] sd 2:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[174238.608558] sd 2:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x80 returns -5
[174238.608567] sd 2:0:0:0: PM: failed to resume async: error -5
[174238.630402] sd 2:0:0:0: [sdb] Synchronizing SCSI cache
[174238.630423] sd 2:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[174238.630424] sd 2:0:0:0: [sdb] Stopping disk
[174238.630428] sd 2:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[174242.774382] ata6: found unknown device (class 0)
[174242.774417] ata5: found unknown device (class 0)
[174243.174399] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[174243.266633] ata6.00: configured for UDMA/133
[174243.294401] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[174243.387415] ata5.00: configured for UDMA/133
[174243.802389] ata4: found unknown device (class 0)
[174245.942401] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[174246.128870] ata4.00: configured for UDMA/133
```

The disk does NOT reappear after `echo - - - > /sys/class/scsi_host/host3/scan` (tried all `host[0-9]`), `/dev/sdb` stays gone.

I am on openSUSE Tumbleweed with frequent updates. The disk worked fine for months until very recently.  The disk tests fine with a SMART long test and I backed up 8 TB from it without problems after a complete shutdown and reboot. 

Hardware: Chipset AMD X570, CPU AMD Ryzen 5 PRO 4650G, disk TOSHIBA MG08ACA16TE.

Very likely OK on: 5.16.15-1-default
Definitely broken: 5.18.1-1-default

I tested the same setup but connecting the disk by USB3.  The disk in this case "survives" suspend and resume.
Comment 1 Nils 2022-06-06 22:30:31 UTC
Perhaps related: https://www.linuxquestions.org/questions/linux-hardware-18/sata-disk-problems-up-after-resume-from-memory-4175532754/  (only first half of thread, from 2015, not documented as fixed)
Comment 2 Nils 2022-06-06 22:31:09 UTC
Perhaps related: https://bugzilla.kernel.org/show_bug.cgi?id=76681
Comment 3 Nils 2022-06-11 16:35:26 UTC
I can confirm that this issue still exists in Linux 5.18.2-1-default.
Comment 4 Nils 2022-06-11 16:40:12 UTC
Second hard disk tested: TOSHIBA MG09ACA1, same problem.

What distinguishes these affected disks from the 5 other, non affected disks?

 * both larger than the others (nominal 16 and 18 vs 10 TB)
 * both from Toshiba (but so is one other though that is old)

Hope this helps track the issue.
Comment 5 Nils 2022-06-29 08:08:09 UTC
I can confirm that this issue still exists in Linux 5.18.6-1-default.

Since this apparently is a regression I hope it is easy to find and fix.
Comment 6 Mario Limonciello (AMD) 2022-06-29 17:12:30 UTC
As you have a working and failing config, can you please bisect this?
Comment 7 Nils 2022-06-30 08:19:59 UTC
OK, I will try installing an old kernel, though that may be difficult (rolling distro). This is a production machine so it may take a few days.
Comment 8 Nils 2022-06-30 13:15:10 UTC
Tested 5.17.9-1-default (oldest one available in openSUSE Tumbleweed History): The problem already exists in this kernel version.

The disks that failed are Toshiba MG09ACA18TE and MG08ACA16TE.  For example, Toshiba DT01ACA300 does not exhibit the problem.  The affected ones are the largest disks in the system.

Meanwhile, is there an X570 chipset configuration option that I can try out as a workaround in the BIOS?
Comment 9 Mario Limonciello (AMD) 2022-06-30 13:41:17 UTC
> Tested 5.17.9-1-default (oldest one available in openSUSE Tumbleweed
> History): The problem already exists in this kernel version.

So this isn't a regression you're saying?  Ideally what we need to see is a bisect down to an individual commit (which would mean compiling from source https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html).

> Meanwhile, is there an X570 chipset configuration option that I can try out
> as a workaround in the BIOS?

You'd have to check with your hardware manufacturer on what is exposed in their BIOS related to SATA.
Comment 10 Nils 2022-06-30 14:31:58 UTC
I believe this is a regression.  I have narrowed it down now.

Definitely affected:  5.17.1-1-default through 5.18.6-1-default
Very likely OK:       5.16.15-1-default

So while this is not a perfect disection I hope the changes in the related drivers (perhaps for X570's SATA controller?) are few between 5.16.15 and 5.17.1.

Not blaming you (!) but had I been given the request for bisection testing 3.5 weeks ago when I first posted this perhaps I could have accessed enough older kernels to narrow it down much more.

Now I extracted the information from past logs, I did:

foreach n ( `seq 1 14` )
foreach? journalctl -k -b -$n | egrep '(scsi_bus_resume|@buildhost)'
foreach? echo was $n
foreach? echo =============
foreach? end

5.16.15-1-default boots did not show the resume problem.  I did find traces of suspend (S3) activity but no error.

5.17.1-1-default was the first one that did show errors.
Comment 11 Mario Limonciello (AMD) 2022-06-30 20:32:45 UTC
> 5.16.15-1-default boots did not show the resume problem.  I did find traces
> of suspend (S3) activity but no error.
> 5.17.1-1-default was the first one that did show errors.

How about checking out kernel sources and do 2x build with tags at 5.16.0 and 5.17.0.  If you can fail with 5.17.0 but pass with 5.16.0, you should be able to do proper bisect.
Comment 12 Nils 2022-07-08 17:26:11 UTC
Thanks for the suggestion.  However, I am afraid I cannot do that.  I cannot deploy a custom kernel on this production machine.

However, it is not so bad because the regression occurred between 5.16.15 and 5.17.1 !

The diff should therefore be small, especially focused only on those code files relevant here.

Would that be some sort of X570 chipset SATA driver component?  I wonder whether it's the disk size or some other feature of the affected disks that hinders wakeup/accessibility of these drives?  The drives do spin up during wakeup from S3 but stay lost to the driver.

If you point me to the relevant source files I can try to do a static analysis - though the original coder would be much more competent to do that.
Comment 13 Mario Limonciello (AMD) 2022-07-08 17:35:53 UTC
> However, it is not so bad because the regression occurred between 5.16.15 and
> 5.17.1 !

Unfortunately that is two different stable trees, so it can be a collection of stable commits that caused it, or it could be stuff from the 5.17 merge window.  The 5.16.0 to 5.17.0 jump will have thousands of commits.  The 5.16.14 to 5.16.15 will be a lot smaller (probably a hundred or so).

> If you point me to the relevant source files I can try to do a static
> analysis - though the original coder would be much more competent to do that.

Assuming it was in drivers/ata it's not exactly a small diff..

$ git diff v5.16 v5.17 drivers/ata | wc -l
6572

Assuming it's a problem in drivers/ata here is how many commits to wade through:
$ git log --oneline v5.16..v5.17 drivers/ata    | wc -l
113

Sometimes these types of problems are outside the subsystem though.  This could easily be something like an ACPICA change that caused it.
As an example recently A very innocuous looking change to return -ENODEV in what seemed to be an error path for a GPIO driver completely broke resume from s0i3 for all AMD systems.

A bisect really is your best bet.
Comment 14 Mario Limonciello (AMD) 2022-07-08 17:38:01 UTC
BTW - is this your ATA controller?  I didn't see a log added so I'm not 100% sure.

> PCI_VDEVICE(AMD, 0x7801)

If so, this might have been caused by a17ab7aba5df4135ef77d7f6d7105e1ea414936f
Comment 15 Paul Menzel 2022-07-09 09:54:41 UTC
Please attach the output of `lspci -nn`, and the Linux logs for the last working and first non-working situation.
Comment 16 Nils 2022-07-10 12:03:48 UTC
Created attachment 301383 [details]
lspci -nn
Comment 17 Nils 2022-07-10 16:05:57 UTC
(In reply to Paul Menzel from comment #15)
> Please attach the output of `lspci -nn`, and the Linux logs for the last
> working and first non-working situation.

Do you need specific lines, may I grep something please?  The files are very large and likely contain personal data:

# wc jou*
   9096  106183  742028 journalctl-k-b-10-5.17.1-1-default-hdb_lost
   1792   21641  162138 journalctl-k-b-11-5.16.15-1-default-hdb_not_lost
   1571   18787  139323 journalctl-k-b-12-5.16.15-1-default-hdb_not_lost
   1676   19921  146721 journalctl-k-b-13-5.16.15-1-default-hdb_not_lost
  14135  166532 1190210 total

It may take a long time to sift through these files and remove personal data.
Comment 18 Mario Limonciello (AMD) 2022-07-10 16:08:31 UTC
Your SATA controller from lspci output is 0x7901, so it's not caused by a17ab7aba5df4135ef77d7f6d7105e1ea414936f.

1527f69204fe35f341cb599f1cb01bd02daf4374 would be suspect, but that landed in 5.16 which you said 5.16.y is OK.
Comment 19 Damien Le Moal 2022-07-10 23:25:33 UTC
Likely related to this problem:

https://bugzilla.kernel.org/show_bug.cgi?id=215880
Comment 20 Paul Menzel 2022-07-11 07:43:57 UTC
(In reply to Nils from comment #17)
> (In reply to Paul Menzel from comment #15)
> > Please attach the output of `lspci -nn`, and the Linux logs for the last
> > working and first non-working situation.
> 
> Do you need specific lines, may I grep something please?  The files are very
> large and likely contain personal data:
> 
> # wc jou*
>    9096  106183  742028 journalctl-k-b-10-5.17.1-1-default-hdb_lost
>    1792   21641  162138 journalctl-k-b-11-5.16.15-1-default-hdb_not_lost
>    1571   18787  139323 journalctl-k-b-12-5.16.15-1-default-hdb_not_lost
>    1676   19921  146721 journalctl-k-b-13-5.16.15-1-default-hdb_not_lost
>   14135  166532 1190210 total
> 
> It may take a long time to sift through these files and remove personal data.

Hmm, `journalctl -k` should only contain Linux kernel messages with no personal information. It should contain the same messages as the output of `dmesg`.
Comment 21 Mario Limonciello (AMD) 2022-07-11 16:27:28 UTC
Damien, thanks for the insight!
Nils, can you try the patches at the end of that bug report?
Comment 22 Nils 2022-07-12 17:49:57 UTC
> Nils, can you try the patches at the end of that bug report?

I am afraid this is a production machine and I cannot install a custom kernel.  Can I detect if I am affected by that particular issue in a running system?
Comment 23 Nils 2022-07-12 18:14:42 UTC
(In reply to Paul Menzel from comment #20)
> Hmm, `journalctl -k` should only contain Linux kernel messages with no
> personal information. It should contain the same messages as the output of
> `dmesg`.

I am editing the files now but am finding many serial numbers, names etc.  After 20 minutes I am at 17%.  This may take a long time.  I will have to continue on another day.
Comment 24 Nils 2022-07-12 22:25:43 UTC
Created attachment 301405 [details]
journalctl -k -b 13

Linux 5.16.15-1-default - hdb not lost during resume from S3.
Comment 25 Nils 2022-07-12 22:26:30 UTC
Created attachment 301406 [details]
journalctl -k -b 10

Linux 5.17.1-1-default - hdb lost during resume from S3.
Comment 26 Bart Van Assche 2022-07-15 18:53:01 UTC
Would it be possible to test Martin Petersen's for-next branch, which includes a fix for "Resume process hangs for 5-6 seconds starting sometime in 5.16" ("scsi: sd: Rework asynchronous resume support")? See also https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git/log/?h=for-next
Comment 27 Nils 2022-07-17 00:50:50 UTC
(In reply to Bart Van Assche from comment #26)
> Would it be possible to test Martin Petersen's for-next branch, which
> includes a fix for "Resume process hangs for 5-6 seconds starting sometime
> in 5.16" ("scsi: sd: Rework asynchronous resume support")? See also
> https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git/log/?h=for-next

I am afraid this is a production machine and I cannot install a custom kernel.  Can I detect if I am affected by that particular issue in a running system?
Comment 28 Bart Van Assche 2022-07-17 23:03:49 UTC
Now that I have had the time to read all the messages above, the following information looks very interesting to me:

Linux 5.16.15-1-default - hdb not lost during resume from S3.
Linux 5.17.1-1-default - hdb lost during resume from S3.

The following changes from me have been included in the sd driver in each of these kernel versions:

$ PAGER= git log --author="Bart Van Assche" --reverse --format=oneline v5.15..v5.16.15 drivers/scsi/sd.c
a19a93e4c6a98c9c0f2f5a6db76846f10d7d1f85 scsi: core: pm: Rely on the device driver core for async power management
1c9575326a4a09bdb82c104ed1ab4e932c7a1f49 scsi: sd: Rename sd_resume() into sd_resume_system()
9131bff6a9f1d4c6ab2d4673974a9659d8d24dc7 scsi: core: pm: Only runtime resume if necessary

$ PAGER= git log --author="Bart Van Assche" --reverse --format=oneline v5.15..v5.17.1 drivers/scsi/sd.c
a19a93e4c6a98c9c0f2f5a6db76846f10d7d1f85 scsi: core: pm: Rely on the device driver core for async power management
1c9575326a4a09bdb82c104ed1ab4e932c7a1f49 scsi: sd: Rename sd_resume() into sd_resume_system()
9131bff6a9f1d4c6ab2d4673974a9659d8d24dc7 scsi: core: pm: Only runtime resume if necessary
db33028647a3eca9e3e6fccf170d75f3b56a466c scsi: Remove superfluous #include <linux/async.h> directives

In other words, the patch "scsi: core: pm: Rely on the device driver core for async power management" has been included in both kernel versions and the fix for that patch ("scsi: sd: Rework asynchronous resume support") is included in neither v5.16.15 nor in v5.17.1. My conclusion is that it is *unlikely* that https://bugzilla.kernel.org/show_bug.cgi?id=215880 is related to this issue.

Damien, I found the following in comment #1:

[174233.228546] ata3: SATA link down (SStatus 0 SControl 300)
[174238.608510] ata3: SATA link down (SStatus 0 SControl 300)
[174238.608521] ata3.00: disable device
[174238.608540] sd 2:0:0:0: rejecting I/O to offline device
[174238.608547] ata3.00: detaching (SCSI 2:0:0:0)
[174238.608554] sd 2:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[174238.608558] sd 2:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x80 returns -5
[174238.608567] sd 2:0:0:0: PM: failed to resume async: error -5
[174238.630402] sd 2:0:0:0: [sdb] Synchronizing SCSI cache
[174238.630423] sd 2:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[174238.630424] sd 2:0:0:0: [sdb] Stopping disk
[174238.630428] sd 2:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

Do you perhaps know what can cause a START STOP UNIT command to be sent to a SATA disk before the SATA link is up?
Comment 29 Damien Le Moal 2022-07-18 11:17:38 UTC
(In reply to Bart Van Assche from comment #28)
> Now that I have had the time to read all the messages above, the following
> information looks very interesting to me:
> 
> Linux 5.16.15-1-default - hdb not lost during resume from S3.
> Linux 5.17.1-1-default - hdb lost during resume from S3.
> 
> The following changes from me have been included in the sd driver in each of
> these kernel versions:
> 
> $ PAGER= git log --author="Bart Van Assche" --reverse --format=oneline
> v5.15..v5.16.15 drivers/scsi/sd.c
> a19a93e4c6a98c9c0f2f5a6db76846f10d7d1f85 scsi: core: pm: Rely on the device
> driver core for async power management
> 1c9575326a4a09bdb82c104ed1ab4e932c7a1f49 scsi: sd: Rename sd_resume() into
> sd_resume_system()
> 9131bff6a9f1d4c6ab2d4673974a9659d8d24dc7 scsi: core: pm: Only runtime resume
> if necessary
> 
> $ PAGER= git log --author="Bart Van Assche" --reverse --format=oneline
> v5.15..v5.17.1 drivers/scsi/sd.c
> a19a93e4c6a98c9c0f2f5a6db76846f10d7d1f85 scsi: core: pm: Rely on the device
> driver core for async power management
> 1c9575326a4a09bdb82c104ed1ab4e932c7a1f49 scsi: sd: Rename sd_resume() into
> sd_resume_system()
> 9131bff6a9f1d4c6ab2d4673974a9659d8d24dc7 scsi: core: pm: Only runtime resume
> if necessary
> db33028647a3eca9e3e6fccf170d75f3b56a466c scsi: Remove superfluous #include
> <linux/async.h> directives
> 
> In other words, the patch "scsi: core: pm: Rely on the device driver core
> for async power management" has been included in both kernel versions and
> the fix for that patch ("scsi: sd: Rework asynchronous resume support") is
> included in neither v5.16.15 nor in v5.17.1. My conclusion is that it is
> *unlikely* that https://bugzilla.kernel.org/show_bug.cgi?id=215880 is
> related to this issue.
> 
> Damien, I found the following in comment #1:
> 
> [174233.228546] ata3: SATA link down (SStatus 0 SControl 300)
> [174238.608510] ata3: SATA link down (SStatus 0 SControl 300)
> [174238.608521] ata3.00: disable device
> [174238.608540] sd 2:0:0:0: rejecting I/O to offline device
> [174238.608547] ata3.00: detaching (SCSI 2:0:0:0)
> [174238.608554] sd 2:0:0:0: [sdb] Start/Stop Unit failed: Result:
> hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> [174238.608558] sd 2:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x80
> returns -5
> [174238.608567] sd 2:0:0:0: PM: failed to resume async: error -5
> [174238.630402] sd 2:0:0:0: [sdb] Synchronizing SCSI cache
> [174238.630423] sd 2:0:0:0: [sdb] Synchronize Cache(10) failed: Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> [174238.630424] sd 2:0:0:0: [sdb] Stopping disk
> [174238.630428] sd 2:0:0:0: [sdb] Start/Stop Unit failed: Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
> 
> Do you perhaps know what can cause a START STOP UNIT command to be sent to a
> SATA disk before the SATA link is up?

Resume issue the start stop unit command, which triggered the link-up/resume code in ATA, which failed, causing the start stop unit command to fail. That command is not sent "after" the fact that the link is down. The problem here is that the device PHY/link are not ready. Bart patches addressed a similar problem found on another system. It would be really great if you could try them to see if they fix your issue.
Comment 30 Nils 2022-07-25 18:51:14 UTC
I would love to be able to test the kernel! However, I cannot install code or software from another source than the given one (signed bu openSUSE and everything).

I hope the patch for 215880 will be merged soon. Us being on a rolling release I will likely be able to test it soon after.  Will let you know.
Comment 31 Nils 2022-08-15 11:01:41 UTC
I can confirm that this issue still exists in Linux 5.19.0.
Comment 32 Bart Van Assche 2022-08-15 16:55:24 UTC
Please share the output of the following command:

find /sys -name proc_name | xargs grep -aH .
Comment 33 Nils 2022-08-16 16:07:29 UTC
(In reply to Bart Van Assche from comment #32)
> Please share the output of the following command:
> 
> find /sys -name proc_name | xargs grep -aH .

# find /sys -name proc_name | xargs grep -aH .
/sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:09.0/0000:0a:00.0/ata1/host0/scsi_host/host0/proc_name:ahci
/sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:09.0/0000:0a:00.0/ata4/host3/scsi_host/host3/proc_name:ahci
/sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:09.0/0000:0a:00.0/ata2/host1/scsi_host/host1/proc_name:ahci
/sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:09.0/0000:0a:00.0/ata3/host2/scsi_host/host2/proc_name:ahci
/sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:0a.0/0000:0b:00.0/ata6/host5/scsi_host/host5/proc_name:ahci
/sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:0a.0/0000:0b:00.0/ata5/host4/scsi_host/host4/proc_name:ahci
/sys/devices/pci0000:00/0000:00:08.2/0000:0f:00.1/ata8/host7/scsi_host/host7/proc_name:ahci
/sys/devices/pci0000:00/0000:00:08.2/0000:0f:00.0/ata7/host6/scsi_host/host6/proc_name:ahci

This is after an S3 suspend which means sdb2 is invisible.
Comment 34 Nils 2022-08-16 16:08:11 UTC
PS: sdb I mean. The second hard disk.
Comment 35 Nils 2022-09-24 12:04:18 UTC
I can confirm that this issue still exists in Linux 5.19.8.
Comment 36 Nils 2022-09-24 12:05:19 UTC
Sorry, wrong status, was looking for CONFIRMED but VERIFIED is not CONFIRMED :-(
Comment 37 Damien Le Moal 2022-09-26 02:01:23 UTC
Could you try this patch ?

https://lore.kernel.org/linux-ide/20220921155753.231770-1-niklas.cassel@wdc.com/

You did mention that the problem is on a production machine. I hope you can test this on a similar machine not in production.