Bug 91921

Summary: WD 2TB Caviar black causes hard lock on resume S2RAM
Product: Power Management Reporter: Dragon32 (markcscott)
Component: Hibernation/SuspendAssignee: Aaron Lu (aaron.lu)
Status: CLOSED CODE_FIX    
Severity: high CC: aaron.lu, lenb, markcscott, rui.zhang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.13.7 to 4.0.5 Tree: Mainline
Regression: Yes
Attachments: dmesg output for comment 3
echo platform > pm_test result
DMESG output

Description Dragon32 2015-01-23 16:06:12 UTC
Hi,

I currently have an issue where a pair of WD 2TB caviar black HDDs are causing a hard crash on resume from S2RAM. I initially thought the issue was down to an Nvidia driver problems but subsequent investigations narrowed the troublesome hardware down to a pair of WD drives in my system. The models are:

WD2003FZEX-00Z4SA0

and

WD2002FAEX-0

The setup worked with OS 13.1 + K3.11 and successfully resumed from ram

Please note both these drives are NTFS, do NOT support APM (which I think is the issue) and I boot legacy BIOS (to lazy to do UEFI).

To give some history, please see my initial posting on the Opensuse bugzilla:

"I recently experimented with kernel 3.18 on opensuse 13.1 but that resulted in a resume failure on S2RAM, see here for posting:

https://forums.opensuse.org/showthread.php/503846-Failure-to-resume-from-Suspend-to-RAM-Kernel-3-18-1-1-1-g5f2f35e-on-OS-13-1-x64

Oh well I thought time to move onto 13.2 however this now exhibits the same behaviour as 13.1 + 3.18 kernel (13.1 + 3.11 kernel OK) in that it will not resume from S2RAM and is hard locked so only a reset will do. This was a clean install on a brand new SSD drive with home copied so no cruft left over." 


For further info and full discussion with opensuse devs please follow the link to the filed opensuse bugreport as it lists the steps at which the culprit was found:

https://bugzilla.opensuse.org/show_bug.cgi?id=913105

So far this bug occurs with the following kernels that I have tested : 3.16.6, 3.16.7, 3.18.1, 3.18.3. Kernel 3.11 is currently the last know good kernel, I have not yet tested kernels between 3.11 and 3.16.6

Please advise of any further information required.

Thanks
Comment 1 Dragon32 2015-01-24 14:36:25 UTC
A little bit more info

Using the debug method:

sh -c "sync && echo 1 > /sys/power/pm_trace && pm-suspend"

The magic number returned is:

[    2.941809]   Magic number: 0:278:890
[    2.961872]   hash matches ../drivers/base/power/main.c:736

however this does match any pci device so using:

cat /sys/power/pm_trace_dev_match

The resulting output is:

bsg
scsi_device
scsi_disk
sd
pci

Thanks
Comment 2 Dragon32 2015-01-26 22:38:36 UTC
With reference to comment 1 PM_trace was performed on Opensuse 13.2 using kernel 3.16.7

Thanks
Comment 3 Aaron Lu 2015-01-29 06:17:26 UTC
If it is a disk related issue, you should be able to trigger it with devices level testing. Check this document:
https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt

Simply put, you can do:
# cd /sys/power
# echo devices > pm_test
# echo mem > state

and then the test should fail. With luck, you should be able to see some messages about the error.
Comment 4 Dragon32 2015-01-29 07:04:16 UTC
Created attachment 165121 [details]
dmesg output for comment 3

Hi Aaron and thank you for the reply.

I have tried the test you suggested in comment 3 but it didn't seem to generate any errors and successfully re-awoke, please find attached full dmesg output. The relevant time lines are 1714 to 1728.

Thanks
Comment 5 Aaron Lu 2015-01-29 07:51:17 UTC
What about the next level, platform? And go on till problem occurred.
Comment 6 Dragon32 2015-01-29 13:24:24 UTC
Created attachment 165131 [details]
echo platform  > pm_test result

Hi Lu, 

apologies - quickly did the test without reading too far. Thank you for your patient hand holding :)

The platform test successfully resumed but there were some error messages which I don't fully comprehend that have been discussed here regarding "BAR 14: can't assign mem:". No idea if they are relevant.

http://www.gossamer-threads.com/lists/linux/kernel/1894740

lspci:

00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v3 Processor DRAM Controller (rev 06)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller (rev 06)
00:14.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI (rev 05)
00:16.0 Communication controller: Intel Corporation 8 Series/C220 Series Chipset Family MEI Controller #1 (rev 04)
00:16.3 Serial controller: Intel Corporation 8 Series/C220 Series Chipset Family KT Controller (rev 04)
00:1a.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2 (rev 05)
00:1b.0 Audio device: Intel Corporation 8 Series/C220 Series Chipset High Definition Audio Controller (rev 05)
00:1c.0 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #1 (rev d5)
00:1c.2 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #3 (rev d5)
00:1c.3 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #4 (rev d5)
00:1d.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #1 (rev 05)
00:1f.0 ISA bridge: Intel Corporation H87 Express LPC Controller (rev 05)
00:1f.2 SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 05)
00:1f.3 SMBus: Intel Corporation 8 Series/C220 Series Chipset Family SMBus Controller (rev 05)
01:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 750] (rev a2)
01:00.1 Audio device: NVIDIA Corporation Device 0fbc (rev a1)
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)
04:00.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 41)
Comment 7 Aaron Lu 2015-01-30 05:12:06 UTC
The BAR problem doesn't seem relevant.
Please test the next level.
Comment 8 Dragon32 2015-01-30 08:51:58 UTC
Hi Aaron

I should have added yesterday that I tried all steps and resume from RAM was successful in all steps, the only error messages are the ones supplied. What I have noticed is that it may take a little longer time interval between suspend and awake to reproduce the bug. If I manually reawaken the system with a key press in 5 seconds the system reawakens successfully as I suspect full suspend to RAM power down is not yet achieved within 5 seconds, it seems to need about 7-8 seconds to get it to fail from resume. I don't know where I can change the auto re-awken time from 5 to say 10 seconds to check this.

Thanks
Comment 9 Dragon32 2015-01-30 09:54:44 UTC
Tested Kernel 3.19.RC6 on Opensuse 13.2 and bug still present, changed bug info to reflect.
Comment 10 Dragon32 2015-01-31 14:46:36 UTC
Tested a couple more kernel versions and the last known good kernel I can find is 3.12.37 and the first know bad kernel is 3.13.7

It would appear I have exactly the same bug symptoms as this post:

https://bugs.freedesktop.org/show_bug.cgi?id=86115

Thanks
Comment 11 Dragon32 2015-02-01 08:31:49 UTC
Created attachment 165401 [details]
DMESG output

Hi Aaron

If I use:

echo 0 > /sys/power/pm_async

then try suspend to RAM, wake up is successful. I note there are then a couple of COMRESET errors which correspond to the WD 2TB drives and there are some errors relating to USB. Please see attached DMESG output.

Thanks for your patience :)
Comment 12 Dragon32 2015-02-01 08:43:57 UTC
NB

echo 0 > /sys/power/pm_async

only seems to work once or twice at best then fails again to resume from suspend.
Comment 13 Aaron Lu 2015-02-02 02:04:27 UTC
You can change the delay time value from the default 5000ms to 10000ms here and then see if this made the problem occur:

diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index c347e3ce3a55..fa3ecd0e1108 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -208,8 +208,8 @@ static int suspend_test(int level)
 {
 #ifdef CONFIG_PM_DEBUG
        if (pm_test_level == level) {
-               printk(KERN_INFO "suspend debug: Waiting for 5 seconds.\n");
-               mdelay(5000);
+               printk(KERN_INFO "suspend debug: Waiting for 10 seconds.\n");
+               mdelay(10000);
                return 1;
        }
 #endif /* !CONFIG_PM_DEBUG */
Comment 14 Aaron Lu 2015-02-02 08:31:36 UTC
I suppose v3.12 works and v3.13 doesn't? If so, can you please use Linus' git tree to do a git bisect to find the offending commit?
Comment 15 Dragon32 2015-02-02 10:02:13 UTC
Hi Aaron, I'm chemist by profession (not a programmer) and a git bisect is currently beyond my skill set. I'll ask over at opensuse to see if someone can walk me through this, I've never built a kernel in my life. I will also need to build it with development files to use the propriety Nvidia driver as the GPL nvidia driver fails to reawken my video card.

Lars Muller (one of the Samba devs) has run into the exact same bug as myself and I'll see if I can ask for some help there.

Please read this  bug report at freedesktop and have a look at the attachments, he seems to have the same comrest issue.

https://bugs.freedesktop.org/show_bug.cgi?id=86115
Comment 16 Aaron Lu 2015-03-13 03:03:36 UTC
The COMRESET failure doesn't matter, since the ata port finally survived and your disk is found and properly detected.

It doesn't seem there is a problem with the hard disk, and Lars Muller is seeking help on the graphics side.

So do you still have problem for resume?
Comment 17 Dragon32 2015-03-13 10:59:38 UTC
Hi Aaron, 

thanks, yes I'm still having problems with resume from suspend to ram.

I've just not had time yet to get around to learning how to git bisect and then try those bisects. Wife and small children :)
Comment 18 Aaron Lu 2015-03-23 06:35:00 UTC
Did you try the suggestions there:
https://bugs.freedesktop.org/show_bug.cgi?id=86115#c15
i.e. try to boot with modprobe.blacklist=nvidia to see if that makes any difference.

BTW, what about nouveau? Does that work for you?
Comment 19 Dragon32 2015-03-23 16:58:58 UTC
Hi Aaron

I've tried several different HDD and SSD combinations of other drives ranging from 180gb to 1gb in size and no other drive causes this issue apart from the WD 2TB caviar blacks. I'm a bit mystified as to how the Nvidia Video driver is going to affect just one particular species of hard drive ?

Yes nouveau is broken for me and fails on resume from suspend to RAM but the Nvidia binary blob is fine so that isn't really an issue for me. It's the problem of my WD 2TB caviar blacks that is currently an issue as I cannot work around that bug the moment other than purchasing an alternative type of 2TB drive.

Thanks
Comment 20 Aaron Lu 2015-03-24 03:16:43 UTC
(In reply to Dragon32 from comment #19)
> Hi Aaron
> 
> I've tried several different HDD and SSD combinations of other drives
> ranging from 180gb to 1gb in size and no other drive causes this issue apart
> from the WD 2TB caviar blacks. I'm a bit mystified as to how the Nvidia
> Video driver is going to affect just one particular species of hard drive ?

Do you mean when you use other disks, S2Ram works well?
Hmm...perhaps a device issue?
Comment 21 Dragon32 2015-03-25 09:09:16 UTC
Yes S2RAM woks for for other SSDs and HDDs, it's a quirk for the WD blacks.

They are two different generations of WD blacks:

WD2003FZEX-00Z4SA0
WD2002FAEX-0
Comment 22 Aaron Lu 2015-03-31 03:13:07 UTC
We need the bisect to go further, please do that when you have time.
I'll close if for the moment, if you have additional information, feel free to re-open it.
Comment 23 Dragon32 2015-06-13 12:01:01 UTC
Hi Aaron

I've some good news on this one, this has been successfully fixed. Using this test build kindly provided by Takashi Iwai I can now successfully resume from s2ram:

http://download.opensuse.org/repositories/home:/tiwai:/bnc934397/standard/

Log notes:
- Update config files: extend CONFIG_DPM_WATCHDOG_TIMEOUT to 60 (bnc#934397)
- commit f382e20

Thanks
Comment 24 Dragon32 2015-06-13 15:15:24 UTC
For further info please also see this bug report:

https://bugzilla.opensuse.org/show_bug.cgi?id=934397
Comment 25 Aaron Lu 2015-06-16 02:02:29 UTC
Patch sent out by Takashi:

https://patchwork.kernel.org/patch/6608411/
Comment 26 Len Brown 2015-07-22 00:25:32 UTC
4.2-rc1 has:

commit fff3b16d2754a061a3549c4307a186423a0128fd
Author: Takashi Iwai <tiwai@suse.de>
Date:   Thu Jun 25 00:35:16 2015 +0200

    PM / sleep: Increase default DPM watchdog timeout to 60