Hi, I currently have an issue where a pair of WD 2TB caviar black HDDs are causing a hard crash on resume from S2RAM. I initially thought the issue was down to an Nvidia driver problems but subsequent investigations narrowed the troublesome hardware down to a pair of WD drives in my system. The models are: WD2003FZEX-00Z4SA0 and WD2002FAEX-0 The setup worked with OS 13.1 + K3.11 and successfully resumed from ram Please note both these drives are NTFS, do NOT support APM (which I think is the issue) and I boot legacy BIOS (to lazy to do UEFI). To give some history, please see my initial posting on the Opensuse bugzilla: "I recently experimented with kernel 3.18 on opensuse 13.1 but that resulted in a resume failure on S2RAM, see here for posting: https://forums.opensuse.org/showthread.php/503846-Failure-to-resume-from-Suspend-to-RAM-Kernel-3-18-1-1-1-g5f2f35e-on-OS-13-1-x64 Oh well I thought time to move onto 13.2 however this now exhibits the same behaviour as 13.1 + 3.18 kernel (13.1 + 3.11 kernel OK) in that it will not resume from S2RAM and is hard locked so only a reset will do. This was a clean install on a brand new SSD drive with home copied so no cruft left over." For further info and full discussion with opensuse devs please follow the link to the filed opensuse bugreport as it lists the steps at which the culprit was found: https://bugzilla.opensuse.org/show_bug.cgi?id=913105 So far this bug occurs with the following kernels that I have tested : 3.16.6, 3.16.7, 3.18.1, 3.18.3. Kernel 3.11 is currently the last know good kernel, I have not yet tested kernels between 3.11 and 3.16.6 Please advise of any further information required. Thanks
A little bit more info Using the debug method: sh -c "sync && echo 1 > /sys/power/pm_trace && pm-suspend" The magic number returned is: [ 2.941809] Magic number: 0:278:890 [ 2.961872] hash matches ../drivers/base/power/main.c:736 however this does match any pci device so using: cat /sys/power/pm_trace_dev_match The resulting output is: bsg scsi_device scsi_disk sd pci Thanks
With reference to comment 1 PM_trace was performed on Opensuse 13.2 using kernel 3.16.7 Thanks
If it is a disk related issue, you should be able to trigger it with devices level testing. Check this document: https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt Simply put, you can do: # cd /sys/power # echo devices > pm_test # echo mem > state and then the test should fail. With luck, you should be able to see some messages about the error.
Created attachment 165121 [details] dmesg output for comment 3 Hi Aaron and thank you for the reply. I have tried the test you suggested in comment 3 but it didn't seem to generate any errors and successfully re-awoke, please find attached full dmesg output. The relevant time lines are 1714 to 1728. Thanks
What about the next level, platform? And go on till problem occurred.
Created attachment 165131 [details] echo platform > pm_test result Hi Lu, apologies - quickly did the test without reading too far. Thank you for your patient hand holding :) The platform test successfully resumed but there were some error messages which I don't fully comprehend that have been discussed here regarding "BAR 14: can't assign mem:". No idea if they are relevant. http://www.gossamer-threads.com/lists/linux/kernel/1894740 lspci: 00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v3 Processor DRAM Controller (rev 06) 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller (rev 06) 00:14.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI (rev 05) 00:16.0 Communication controller: Intel Corporation 8 Series/C220 Series Chipset Family MEI Controller #1 (rev 04) 00:16.3 Serial controller: Intel Corporation 8 Series/C220 Series Chipset Family KT Controller (rev 04) 00:1a.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2 (rev 05) 00:1b.0 Audio device: Intel Corporation 8 Series/C220 Series Chipset High Definition Audio Controller (rev 05) 00:1c.0 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #1 (rev d5) 00:1c.2 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #3 (rev d5) 00:1c.3 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #4 (rev d5) 00:1d.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #1 (rev 05) 00:1f.0 ISA bridge: Intel Corporation H87 Express LPC Controller (rev 05) 00:1f.2 SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 05) 00:1f.3 SMBus: Intel Corporation 8 Series/C220 Series Chipset Family SMBus Controller (rev 05) 01:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 750] (rev a2) 01:00.1 Audio device: NVIDIA Corporation Device 0fbc (rev a1) 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06) 04:00.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 41)
The BAR problem doesn't seem relevant. Please test the next level.
Hi Aaron I should have added yesterday that I tried all steps and resume from RAM was successful in all steps, the only error messages are the ones supplied. What I have noticed is that it may take a little longer time interval between suspend and awake to reproduce the bug. If I manually reawaken the system with a key press in 5 seconds the system reawakens successfully as I suspect full suspend to RAM power down is not yet achieved within 5 seconds, it seems to need about 7-8 seconds to get it to fail from resume. I don't know where I can change the auto re-awken time from 5 to say 10 seconds to check this. Thanks
Tested Kernel 3.19.RC6 on Opensuse 13.2 and bug still present, changed bug info to reflect.
Tested a couple more kernel versions and the last known good kernel I can find is 3.12.37 and the first know bad kernel is 3.13.7 It would appear I have exactly the same bug symptoms as this post: https://bugs.freedesktop.org/show_bug.cgi?id=86115 Thanks
Created attachment 165401 [details] DMESG output Hi Aaron If I use: echo 0 > /sys/power/pm_async then try suspend to RAM, wake up is successful. I note there are then a couple of COMRESET errors which correspond to the WD 2TB drives and there are some errors relating to USB. Please see attached DMESG output. Thanks for your patience :)
NB echo 0 > /sys/power/pm_async only seems to work once or twice at best then fails again to resume from suspend.
You can change the delay time value from the default 5000ms to 10000ms here and then see if this made the problem occur: diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c index c347e3ce3a55..fa3ecd0e1108 100644 --- a/kernel/power/suspend.c +++ b/kernel/power/suspend.c @@ -208,8 +208,8 @@ static int suspend_test(int level) { #ifdef CONFIG_PM_DEBUG if (pm_test_level == level) { - printk(KERN_INFO "suspend debug: Waiting for 5 seconds.\n"); - mdelay(5000); + printk(KERN_INFO "suspend debug: Waiting for 10 seconds.\n"); + mdelay(10000); return 1; } #endif /* !CONFIG_PM_DEBUG */
I suppose v3.12 works and v3.13 doesn't? If so, can you please use Linus' git tree to do a git bisect to find the offending commit?
Hi Aaron, I'm chemist by profession (not a programmer) and a git bisect is currently beyond my skill set. I'll ask over at opensuse to see if someone can walk me through this, I've never built a kernel in my life. I will also need to build it with development files to use the propriety Nvidia driver as the GPL nvidia driver fails to reawken my video card. Lars Muller (one of the Samba devs) has run into the exact same bug as myself and I'll see if I can ask for some help there. Please read this bug report at freedesktop and have a look at the attachments, he seems to have the same comrest issue. https://bugs.freedesktop.org/show_bug.cgi?id=86115
The COMRESET failure doesn't matter, since the ata port finally survived and your disk is found and properly detected. It doesn't seem there is a problem with the hard disk, and Lars Muller is seeking help on the graphics side. So do you still have problem for resume?
Hi Aaron, thanks, yes I'm still having problems with resume from suspend to ram. I've just not had time yet to get around to learning how to git bisect and then try those bisects. Wife and small children :)
Did you try the suggestions there: https://bugs.freedesktop.org/show_bug.cgi?id=86115#c15 i.e. try to boot with modprobe.blacklist=nvidia to see if that makes any difference. BTW, what about nouveau? Does that work for you?
Hi Aaron I've tried several different HDD and SSD combinations of other drives ranging from 180gb to 1gb in size and no other drive causes this issue apart from the WD 2TB caviar blacks. I'm a bit mystified as to how the Nvidia Video driver is going to affect just one particular species of hard drive ? Yes nouveau is broken for me and fails on resume from suspend to RAM but the Nvidia binary blob is fine so that isn't really an issue for me. It's the problem of my WD 2TB caviar blacks that is currently an issue as I cannot work around that bug the moment other than purchasing an alternative type of 2TB drive. Thanks
(In reply to Dragon32 from comment #19) > Hi Aaron > > I've tried several different HDD and SSD combinations of other drives > ranging from 180gb to 1gb in size and no other drive causes this issue apart > from the WD 2TB caviar blacks. I'm a bit mystified as to how the Nvidia > Video driver is going to affect just one particular species of hard drive ? Do you mean when you use other disks, S2Ram works well? Hmm...perhaps a device issue?
Yes S2RAM woks for for other SSDs and HDDs, it's a quirk for the WD blacks. They are two different generations of WD blacks: WD2003FZEX-00Z4SA0 WD2002FAEX-0
We need the bisect to go further, please do that when you have time. I'll close if for the moment, if you have additional information, feel free to re-open it.
Hi Aaron I've some good news on this one, this has been successfully fixed. Using this test build kindly provided by Takashi Iwai I can now successfully resume from s2ram: http://download.opensuse.org/repositories/home:/tiwai:/bnc934397/standard/ Log notes: - Update config files: extend CONFIG_DPM_WATCHDOG_TIMEOUT to 60 (bnc#934397) - commit f382e20 Thanks
For further info please also see this bug report: https://bugzilla.opensuse.org/show_bug.cgi?id=934397
Patch sent out by Takashi: https://patchwork.kernel.org/patch/6608411/
4.2-rc1 has: commit fff3b16d2754a061a3549c4307a186423a0128fd Author: Takashi Iwai <tiwai@suse.de> Date: Thu Jun 25 00:35:16 2015 +0200 PM / sleep: Increase default DPM watchdog timeout to 60