Bug 82291

Summary: (Xeon(R) CPU E5-2697 v2 )reboot(fixed)after resume, and deadlock on dma_lock when suspend(root caused)
Product: Power Management Reporter: Janek Kozicki (janek.listy.mailowe)
Component: Hibernation/SuspendAssignee: Chen Yu (yu.c.chen)
Status: CLOSED CODE_FIX    
Severity: high CC: cosurgi, janek.listy.mailowe, rui.zhang, yu.c.chen
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.2 Subsystem:
Regression: No Bisected commit-id:
Attachments: 1. magic numbers 2. dmesg 3.lspci
config file for kernel 4.2
dmesg after failed resume, with hibernation done using Method 1)
dmesg after failed resume, with hibernation done using Method 2)
dmesg after failed resume, with hibernation done using Method 3)
dmesg after failed resume, with hibernation done using Method 4)
Video of resume failure using Method 1 and 2
Video of resume failure using Method 3 and 4
Video of resume success and failure using Method suggested by Chen Yu in Comment 30
dmesg after successful resume, using Method suggested by Chen Yu in Comment 30
dmesg after failed resume (2nd try), using Method suggested by Chen Yu in Comment 30
Dmesg before doing hibernation, in response to Comment 38
Dmesg after doing hibernation, in response to Comment 38
Screen photo no_console_suspend with all messages before to failure (1/4)
Screen photo no_console_suspend with all messages before failure (2/4)
Screen photo no_console_suspend with all messages before failure (3/4)
Screen photo no_console_suspend with all messages before failure (4/4)
config file for kernel 4.2 with USB keyboard (so that init=/bin/bash works) and CONFIG_SATA_AHCI=y
Screen photo init=/bin/bash with all messages before failure (1/6)
Screen photo init=/bin/bash with all messages before failure (2/6)
Screen photo init=/bin/bash with all messages before failure (3/6)
Screen photo init=/bin/bash with all messages before failure (4/6)
Screen photo init=/bin/bash with all messages before failure (5/6)
Screen photo init=/bin/bash with all messages before failure (6/6)
Dmesg after doing resume with init=/bin/bash
debug patch to track resume after cpu offline
kernel panic with parameter "text" in kernel command line 1/2
kernel panic with parameter "text" in kernel command line 2/2
screenshot with new debug info 1st try 1/2
screenshot with new debug info 1st try 2/2
lucky success screenshot with s4 patch 1/7
lucky success screenshot with s4 patch 2/7
lucky success screenshot with s4 patch 3/7
lucky success screenshot with s4 patch 4/7
lucky success screenshot with s4 patch 5/7
lucky success screenshot with s4 patch 6/7
lucky success screenshot with s4 patch 7/7
lucky success screenshot with s4 patch - dmesg output
screenshots of failure on second try with s4 patch 1/7
screenshots of failure on second try with s4 patch 2/7
screenshots of failure on second try with s4 patch 3/7
screenshots of failure on second try with s4 patch 4/7
screenshots of failure on second try with s4 patch 5/7
screenshots of failure on second try with s4 patch 6/7
screenshots of failure on second try with s4 patch 7/7
dmesg after second failure with s4 patch, with pm_trace
trial patch to enable the temporary direct-mapping for non-boot cpus before they die
freeze with nonboot.diff 1/8
freeze with nonboot.diff 2/8
freeze with nonboot.diff 3/8
freeze with nonboot.diff 4/8
freeze with nonboot.diff 5/8
freeze with nonboot.diff 6/8
freeze with nonboot.diff 7/8
freeze with nonboot.diff 8/8
freeze with nonboot.diff 2/8 (as jpg, mistakenly uploaded as text earlier)
force nonboot cpus to switch to safe pagetable when restoring page frames
Dmesg after successful resume with hibernate_mapping.diff
Dmesg after failed resume with hibernate_mapping.diff 1
Dmesg after failed resume with hibernate_mapping.diff 2
Dmesg after failed resume with hibernate_mapping.diff 3
Screenshot of successful resume with hibernate_mapping.diff 1/17
Screenshot of successful resume with hibernate_mapping.diff 2/17
Screenshot of successful resume with hibernate_mapping.diff 3/17
Screenshot of successful resume with hibernate_mapping.diff 4/17
Screenshot of successful resume with hibernate_mapping.diff 5/17
Screenshot of successful resume with hibernate_mapping.diff 6/17
Screenshot of successful resume with hibernate_mapping.diff 7/17
Screenshot of successful resume with hibernate_mapping.diff 8/17
Screenshot of successful resume with hibernate_mapping.diff 9/17
Screenshot of successful resume with hibernate_mapping.diff 10/17
Screenshot of successful resume with hibernate_mapping.diff 11/17
Screenshot of successful resume with hibernate_mapping.diff 12/17
Screenshot of successful resume with hibernate_mapping.diff 13/17
Screenshot of successful resume with hibernate_mapping.diff 14/17
Screenshot of successful resume with hibernate_mapping.diff 15/17
Screenshot of successful resume with hibernate_mapping.diff 16/17
Screenshot of successful resume with hibernate_mapping.diff 17/17
Screenshot of failed 1 resume with hibernate_mapping.diff 1/14
Screenshot of failed 1 resume with hibernate_mapping.diff 2/14
Screenshot of failed 1 resume with hibernate_mapping.diff 3/14
Screenshot of failed 1 resume with hibernate_mapping.diff 4/14
Screenshot of failed 1 resume with hibernate_mapping.diff 5/14
Screenshot of failed 1 resume with hibernate_mapping.diff 6/14
Screenshot of failed 1 resume with hibernate_mapping.diff 7/14
Screenshot of failed 1 resume with hibernate_mapping.diff 8/14
Screenshot of failed 1 resume with hibernate_mapping.diff 9/14
Screenshot of failed 1 resume with hibernate_mapping.diff 10/14
Screenshot of failed 1 resume with hibernate_mapping.diff 11/14
Screenshot of failed 1 resume with hibernate_mapping.diff 12/14
Screenshot of failed 1 resume with hibernate_mapping.diff 13/14
Screenshot of failed 1 resume with hibernate_mapping.diff 14/14
Screenshot of failed 2 resume with hibernate_mapping.diff 1/2
Screenshot of failed 2 resume with hibernate_mapping.diff 2/2
Screenshot of failed 3 resume with hibernate_mapping.diff 1/5
Screenshot of failed 3 resume with hibernate_mapping.diff 2/5
Screenshot of failed 3 resume with hibernate_mapping.diff 3/5
Screenshot of failed 3 resume with hibernate_mapping.diff 4/5
Screenshot of failed 3 resume with hibernate_mapping.diff 5/5
Screenshot of failed 4 resume with hibernate_mapping.diff 1/9
Screenshot of failed 4 resume with hibernate_mapping.diff 2/9
Screenshot of failed 4 resume with hibernate_mapping.diff 3/9
Screenshot of failed 4 resume with hibernate_mapping.diff 4/9
Screenshot of failed 4 resume with hibernate_mapping.diff 5/9
Screenshot of failed 4 resume with hibernate_mapping.diff 6/9
Screenshot of failed 4 resume with hibernate_mapping.diff 7/9
Screenshot of failed 4 resume with hibernate_mapping.diff 8/9
Screenshot of failed 4 resume with hibernate_mapping.diff 9/9
Disable the warn_on invoking when nonboot cpus are not ready for exception handling
dmesg after failed resume with forbid_rcu patch
dmesg after succesful resume with forbid_rcu patch
screenshot of critical moment during successful resume with forbid_rcu patch 1/2
screenshot of critical moment during successful resume with forbid_rcu patch 2/2
screenshot of critical moment during failed resume with forbid_rcu patch 1/2
screenshot of critical moment during failed resume with forbid_rcu patch 2/2
kernel config for 4.6.0-rc7
messages before reboot on failed resume
messages after freeze during 2nd hibernation
resume failure messages 4.7-rc2 with patch/9158227/
resume failure dmesg PM_TRACE 4.7-rc2 with patch/9158227/
hibernation failure messages 4.7-rc2 with patch/9158227/
hibernation failure PM_TRACE dmesg 4.7-rc2 with patch/9158227/
hibernation failure messages2 4.7-rc2 with patch/9158227/
hibernation failure PM_TRACE dmesg2 4.7-rc2 with patch/9158227/
resume failure4 messages 4.7-rc2 with patch/9158227/
resume failed photo 0: 4.7-rc3 + patch/9172981
resume failed photo 1: 4.7-rc3 + patch/9172981
resume failed dmesg 1 (PM_TRACE): 4.7-rc3 + patch/9172981
resume failed photo 2: 4.7-rc3 + patch/9172981
resume failed dmesg 2 (PM_TRACE): 4.7-rc3 + patch/9172981
hibernation failed photo 3: 4.7-rc3 + patch/9172981
hibernation failed dmesg 3 (PM_TRACE): 4.7-rc3 + patch/9172981
resume failed photo 4: 4.7-rc3 + patch/9172981
resume failed dmesg 4 (PM_TRACE): 4.7-rc3 + patch/9172981
2nd hibernate attempt failed messages 4.7-rc5+patch9202321
2nd hibernate attempt failed dmesg after reboot 4.7-rc5+patch9202321
hibernation snapshot option
we check what is after i8237A_resume during hibernation: it is ioapic_resume
track i8237
v2 debug i8237a
dmesg 4.7-rc5+patch_9202321+attachment_221771_v2_debug_i8237a
freeze photo 4.7-rc5+patch_9202321+attachment_221771_v2_debug_i8237a
debug - show lock trace when deadlock
dmesg after 1000 invocations of echo q > /proc/sysrq-trigger; 4.7-rc5+patch9202321+attachment221801
kernel config for 4.7-rc5+patch9202321+attachment221801
dma lock debug
2nd hibernation freeze photo
System.map for 4.7.0-rc5 + patch9202321 + attachment221841
test dma_lock, dump more lockup info
4.7-rc5 + patch/9202321 + attachment 221881
system freeze photo, 4.7-rc5 + patch/9202321 + attachment 221881
echo q > /proc/sysrq-trigger after 1rs hibernate-resume cycle.
echo q > /proc/sysrq-trigger after 1rs hibernate-resume cycle. next photo
echo q > /proc/sysrq-trigger after 1rs hibernate-resume cycle. I presses ctrl-alt-delete dozen times and I got this:
config used for latest recompile without ARCH_USE_QUEUED_SPINLOCKS
kernel config for 4.7-rc5+patch9202321+debug-qspinlock

Description Janek Kozicki 2014-08-13 09:40:57 UTC
Created attachment 146411 [details]
1. magic numbers 2. dmesg 3.lspci

I did full testing using instructions in both files Rafael suggested:
  Documentation/power/s2ram.txt
  Documentation/power/basic-pm-debugging.txt
So I used CONFIG_PM_TRACE_RTC=y.

I did about a dozen of hibernate/resume attempts, always with:
  /bin/sync
  echo 1 > /sys/power/pm_trace
  echo reboot > /sys/power/disk
  echo disk > /sys/power/state

then after a failed resume I recorded magic number(s) from realtime clock - see attachment.

The info from `cat /sys/power/pm_trace_dev_match` varied between:
  pci
  acpi
  acpi

and just this:
  pci_bus

In case if you need some other info, a full dmesg from one of
reboots after a failed resume is in the attachment.


Now I tried to identify the problem. There are two magic numbers
recorded in realtime clock:

pci     0000:ff:0d.1: hash matches
pci_bus 0000:ff: hash matches

The 0000:ff: is strange, like no particular piece of hardware driver is wrong, but whole bus?

The 0000:ff:0d.1 from lspci is a following thing:

ff:0d.1 System peripheral [0880]: Intel Corporation Ivytown Unicast Registers [8086:0ee3] (rev 04)
        Subsystem: Super Micro Computer Inc Device [15d9:0628]
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

Also in the attachment is my full lspci output.

So that's the best I could do so far. Currently I am waiting for more
instructions from you on how to proceed.

please help. If there's anything I could do for debugging, please let me know.

PS: I am pretty sure that my hardware is not broken.
PSS: this bug is also reported on http://lkml.iu.edu//hypermail/linux/kernel/1408.1/00372.html

best regards
Janek Kozicki
Comment 1 Janek Kozicki 2014-08-14 21:15:51 UTC
Can you tell me where in the sources should I look into?
Comment 2 Zhang Rui 2014-08-20 01:33:12 UTC
is there any earlier kernel release that suspend-to-disk works for you?
can you please try 3.17-rc1 kernel?
Comment 3 Janek Kozicki 2014-08-20 11:26:03 UTC
From earlier kernels I have tried following:

3.2    -  20% failures
3.12   - 100% failures
3.13   -  40% failures
3.16   -  40% failures

Since 3.2 was working best for me, I used it for months, occassionally trying the latest kernel, then downgrading back because things were getting worse.

But I don't know about 3.3 to 3.11 - there were serious problems with dkms and debian to get it to work. So I couldn't compile them.

I am compiling 3.17-rc1 now, will report shortly.
Comment 4 Janek Kozicki 2014-08-20 14:55:00 UTC
Alright, I did 12 tries with 3.17-rc1, I noted 66% failures, details follow.

The tries 1-6 used hibernation method uswsusp while tries 7-12 used echo disk > /sys/power/state

Try 1: (first attempt)
1.1: hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       /usr/sbin/hibernate  # from uswsusp
1.2: result: SUCCESS
1.3: notes: an oops message was observed after resume:
       Nov 24 13:22:12 absurd kernel: [   96.454394] PM: restore of devices complete after 968.078 msecs
       Nov 24 13:22:12 absurd kernel: [   96.627999] Restarting kernel threads ... done.
       Nov 24 13:22:12 absurd kernel: [   96.628726] PM: Basic memory bitmaps freed
       Nov 24 13:22:12 absurd kernel: [   96.628785] Restarting tasks ... done.
       Nov 24 13:22:12 absurd kernel: [   96.632911] BUG: unable to handle kernel NULL pointer dereference at           (null)
       Nov 24 13:22:12 absurd kernel: [   96.633045] IP: [<ffffffff810ab51c>] swsusp_free+0x1c/0x180
       Nov 24 13:22:12 absurd kernel: [   96.633143] PGD 846de2067 PUD 84c5cb067 PMD 0 
       Nov 24 13:22:12 absurd kernel: [   96.633308] Oops: 0000 [#1] SMP 
       Nov 24 13:22:12 absurd kernel: [   96.633434] Modules linked in: parport_pc ppdev lp parport bnep rfcomm bluetooth rfkill binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs fscache lockd sunrpc loop raid1 hid_logitech_dj snd_hda_codec_hdmi nvidia(PO) x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm joydev usb_storage hid_generic crct10dif_pclmul snd_hda_codec_ca0132 iTCO_wdt crc32_pclmul iTCO_vendor_support usbhid ghash_clmulni_intel mxm_wmi evdev hid aesni_intel snd_hda_intel aes_x86_64 ablk_helper cryptd glue_helper lrw snd_hda_controller gf128mul snd_hda_codec snd_hwdep pcspkr snd_pcm sb_edac edac_core snd_seq snd_seq_device snd_timer i2c_i801 lpc_ich mei_me mfd_core snd mei ioatdma soundcore ipmi_si ipmi_msghandler wmi tpm_tis tpm processor thermal_sys button ext4 mbcache crc16 jbd2 sr_mod cdrom sg sd_mod crc_t10dif crct10dif_common dm_mod md_mod crc32c_intel isci ahci libsas libahci scsi_transport_sas libata scsi_mod usbcore igb usb_common i2c_algo_bit dca ptp pps_core [last unloaded: ehci_hc
       Nov 24 13:22:12 absurd kernel: d]
       Nov 24 13:22:12 absurd kernel: [   96.637863] CPU: 13 PID: 6417 Comm: s2disk Tainted: P           O   3.17.0-rc1-vanilla.1 #1
       Nov 24 13:22:12 absurd kernel: [   96.637916] Hardware name: Supermicro X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013
       Nov 24 13:22:12 absurd kernel: [   96.637967] task: ffff88084fdf2d20 ti: ffff880828cd4000 task.ti: ffff880828cd4000
       Nov 24 13:22:12 absurd kernel: [   96.638020] RIP: 0010:[<ffffffff810ab51c>]  [<ffffffff810ab51c>] swsusp_free+0x1c/0x180
       Nov 24 13:22:12 absurd kernel: [   96.638124] RSP: 0018:ffff880828cd7e88  EFLAGS: 00010246
       Nov 24 13:22:12 absurd kernel: [   96.638183] RAX: 0000000000000000 RBX: ffff88105522de00 RCX: 0000000000000001
       Nov 24 13:22:12 absurd kernel: [   96.638245] RDX: ffff88105522de10 RSI: ffff88105522de00 RDI: 0000000000000000
       Nov 24 13:22:12 absurd kernel: [   96.638307] RBP: 0000000000000010 R08: 0000000000000000 R09: 0000000000000000
       Nov 24 13:22:12 absurd kernel: [   96.638368] R10: ffff88105522de10 R11: 0000000000000246 R12: ffffea0000000000
       Nov 24 13:22:12 absurd kernel: [   96.638431] R13: ffff880857315be0 R14: ffff881057d1ac18 R15: ffff88105767f060
       Nov 24 13:22:12 absurd kernel: [   96.638493] FS:  00007f67d3898700(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
       Nov 24 13:22:12 absurd kernel: [   96.638567] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       Nov 24 13:22:12 absurd kernel: [   96.638626] CR2: 0000000000000000 CR3: 0000000850b16000 CR4: 00000000001407e0
       Nov 24 13:22:12 absurd kernel: [   96.638689] Stack:
       Nov 24 13:22:12 absurd kernel: [   96.638745]  ffff88105522de00 0000000000000010 ffff88105767f060 ffffffff810afc35
       Nov 24 13:22:12 absurd kernel: [   96.638976]  ffff88105522de00 ffffffff811a639b 00007fff1d865990 ffff88105522de10
       Nov 24 13:22:12 absurd kernel: [   96.639206]  ffff880856910800 0000000000000000 ffffffff81a088f0 ffff88084fdf3370
       Nov 24 13:22:12 absurd kernel: [   96.639436] Call Trace:
       Nov 24 13:22:12 absurd kernel: [   96.639496]  [<ffffffff810afc35>] ? snapshot_release+0x25/0xb0
       Nov 24 13:22:12 absurd kernel: [   96.639561]  [<ffffffff811a639b>] ? __fput+0xdb/0x200
       Nov 24 13:22:12 absurd kernel: [   96.639627]  [<ffffffff81082e67>] ? task_work_run+0x97/0xd0
       Nov 24 13:22:12 absurd kernel: [   96.639692]  [<ffffffff815120fd>] ? __schedule+0x2cd/0x770
       Nov 24 13:22:12 absurd kernel: [   96.639758]  [<ffffffff81013ad9>] ? do_notify_resume+0x89/0xc0
       Nov 24 13:22:12 absurd kernel: [   96.639824]  [<ffffffff815165a2>] ? int_signal+0x12/0x17
       Nov 24 13:22:12 absurd kernel: [   96.639883] Code: 01 75 ec 01 c0 c3 66 0f 1f 84 00 00 00 00 00 41 54 48 8b 05 e7 09 96 00 49 bc 00 00 00 00 00 ea ff ff 48 8b 3d ce 09 96 00 55 53 <48> 8b 10 48 89 50 18 48 8b 52 20 48 c7 40 28 00 00 00 00 c7 40 
       Nov 24 13:22:12 absurd kernel: [   96.642553] RIP  [<ffffffff810ab51c>] swsusp_free+0x1c/0x180
       Nov 24 13:22:12 absurd kernel: [   96.642653]  RSP <ffff880828cd7e88>
       Nov 24 13:22:12 absurd kernel: [   96.642710] CR2: 0000000000000000
       Nov 24 13:22:12 absurd kernel: [   96.642776] ---[ end trace 5b06cd867a37ec6b ]---


Try 2: (second attempt - continuing from previous resume)
1.1: hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       /usr/sbin/hibernate  # from uswsusp
1.2: result: FAILURE
1.3: notes:
       - freeze during hibernation process, photo: http://www.pg.gda.pl/~jkozicki/debug/fail2.jpg
       - sysrq-sub combo doesn't work.
       - Magic number after cold reboot:
       [   79.241815] input: Winbond Electronics Corp Hermon USB hidmouse Device as /devices/pci0000:00/0000:00:1a.0/usb1/1-1/1-1.6/1-1.6:1.1/0003:0557:2221.000A/inp[    2.252234] NET: Registered protocol family 17
       [    2.252247] mpls_gso: MPLS GSO support
       [    2.255029] registered taskstats version 1
       [    2.256015]   Magic number: 0:36:482
       [    2.256276] rtc_cmos 00:05: setting system clock to 2040-11-24 12:27:56 UTC (2237372876)
       [    2.256399] PM: Checking hibernation image partition /dev/disk/by-uuid/04e6765a-fe0c-44e7-9165-219563f6171e
       [    2.256403] PM: Hibernation image not present or could not be loaded.
       [    2.257767] Freeing unused kernel memory: 1012K (ffffffff818c0000 - ffffffff819bd000)
       [    2.257770] Write protecting the kernel read-only data: 8192k
       [    2.260382] Freeing unused kernel memory: 908K (ffff88000151d000 - ffff880001600000)
       [    2.261118] Freeing unused kernel memory: 272K (ffff8800017bc000 - ffff880001800000)
       [    2.307397] udevd[227]: starting version 175


Try 3: (first attempt)
1.1: hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       /usr/sbin/hibernate  # from uswsusp
1.2: result: SUCCESS
1.3: notes: an oops message was observed after resume:
       Aug 20 14:06:38 absurd kernel: [   74.346888] Restarting tasks ... done.
       Aug 20 14:06:38 absurd kernel: [   74.348080] BUG: unable to handle kernel NULL pointer dereference at           (null)
       Aug 20 14:06:38 absurd kernel: [   74.348086] IP: [<ffffffff810ab51c>] swsusp_free+0x1c/0x180
       Aug 20 14:06:38 absurd kernel: [   74.348087] PGD 1050233067 PUD 1055594067 PMD 0 
       Aug 20 14:06:38 absurd kernel: [   74.348088] Oops: 0000 [#1] SMP 
       Aug 20 14:06:38 absurd kernel: [   74.348118] Modules linked in: parport_pc ppdev lp parport bnep rfcomm bluetooth rfkill binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs fscache lockd sunrpc loop snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_ca0132 kvm_intel hid_logitech_dj kvm nvidia(PO) iTCO_wdt snd_hda_intel iTCO_vendor_support crct10dif_pclmul snd_hda_controller crc32_pclmul ghash_clmulni_intel mxm_wmi aesni_intel aes_x86_64 evdev ablk_helper joydev cryptd glue_helper lrw snd_hda_codec gf128mul pcspkr snd_hwdep snd_pcm sb_edac edac_core snd_seq i2c_i801 snd_seq_device mei_me snd_timer lpc_ich mei mfd_core snd ioatdma soundcore ipmi_si ipmi_msghandler tpm_tis wmi processor tpm thermal_sys button ext4 mbcache crc16 jbd2 dm_mod usb_storage raid1 hid_generic usbhid hid md_mod sg sd_mod sr_mod crc_t10dif cdrom crct10dif_common crc32c_intel isci ahci libsas libahci libata scsi_transport_sas usbcore scsi_mod usb_common igb i2c_algo_bit dca ptp pps_core [last unloaded: ehci_hc
       Aug 20 14:06:38 absurd kernel: d]
       Aug 20 14:06:38 absurd kernel: [   74.348127] CPU: 0 PID: 6496 Comm: s2disk Tainted: P           O   3.17.0-rc1-vanilla.1 #1
       Aug 20 14:06:38 absurd kernel: [   74.348127] Hardware name: Supermicro X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013
       Aug 20 14:06:38 absurd kernel: [   74.348128] task: ffff8808553680d0 ti: ffff880855464000 task.ti: ffff880855464000
       Aug 20 14:06:38 absurd kernel: [   74.348130] RIP: 0010:[<ffffffff810ab51c>]  [<ffffffff810ab51c>] swsusp_free+0x1c/0x180
       Aug 20 14:06:38 absurd kernel: [   74.348131] RSP: 0018:ffff880855467e88  EFLAGS: 00010246
       Aug 20 14:06:38 absurd kernel: [   74.348132] RAX: 0000000000000000 RBX: ffff880857ca9500 RCX: 0000000000000001
       Aug 20 14:06:38 absurd kernel: [   74.348132] RDX: ffff880857ca9510 RSI: ffff880857ca9500 RDI: 0000000000000000
       Aug 20 14:06:38 absurd kernel: [   74.348133] RBP: 0000000000000010 R08: 0000000000000000 R09: 0000000000000000
       Aug 20 14:06:38 absurd kernel: [   74.348133] R10: ffff880857ca9510 R11: 0000000000000246 R12: ffffea0000000000
       Aug 20 14:06:38 absurd kernel: [   74.348134] R13: ffff8808556a5be0 R14: ffff881057cfa258 R15: ffff88105769e060
       Aug 20 14:06:38 absurd kernel: [   74.348135] FS:  00007f23e6082700(0000) GS:ffff88087fc00000(0000) knlGS:0000000000000000
       Aug 20 14:06:38 absurd kernel: [   74.348135] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       Aug 20 14:06:38 absurd kernel: [   74.348136] CR2: 0000000000000000 CR3: 0000001051d50000 CR4: 00000000001407f0
       Aug 20 14:06:38 absurd kernel: [   74.348136] Stack:
       Aug 20 14:06:38 absurd kernel: [   74.348137]  ffff880857ca9500 0000000000000010 ffff88105769e060 ffffffff810afc35
       Aug 20 14:06:38 absurd kernel: [   74.348138]  ffff880857ca9500 ffffffff811a639b 00007fff75200760 ffff880857ca9510
       Aug 20 14:06:38 absurd kernel: [   74.348139]  ffff8810504c7800 0000000000000000 ffffffff81a088f0 ffff880855368720
       Aug 20 14:06:38 absurd kernel: [   74.348140] Call Trace:
       Aug 20 14:06:38 absurd kernel: [   74.348143]  [<ffffffff810afc35>] ? snapshot_release+0x25/0xb0
       Aug 20 14:06:38 absurd kernel: [   74.348145]  [<ffffffff811a639b>] ? __fput+0xdb/0x200
       Aug 20 14:06:38 absurd kernel: [   74.348149]  [<ffffffff81082e67>] ? task_work_run+0x97/0xd0
       Aug 20 14:06:38 absurd kernel: [   74.348151]  [<ffffffff815120fd>] ? __schedule+0x2cd/0x770
       Aug 20 14:06:38 absurd kernel: [   74.348155]  [<ffffffff81013ad9>] ? do_notify_resume+0x89/0xc0
       Aug 20 14:06:38 absurd kernel: [   74.348157]  [<ffffffff815165a2>] ? int_signal+0x12/0x17
       Aug 20 14:06:38 absurd kernel: [   74.348167] Code: 01 75 ec 01 c0 c3 66 0f 1f 84 00 00 00 00 00 41 54 48 8b 05 e7 09 96 00 49 bc 00 00 00 00 00 ea ff ff 48 8b 3d ce 09 96 00 55 53 <48> 8b 10 48 89 50 18 48 8b 52 20 48 c7 40 28 00 00 00 00 c7 40 
       Aug 20 14:06:38 absurd kernel: [   74.348169] RIP  [<ffffffff810ab51c>] swsusp_free+0x1c/0x180
       Aug 20 14:06:38 absurd kernel: [   74.348169]  RSP <ffff880855467e88>
       Aug 20 14:06:38 absurd kernel: [   74.348170] CR2: 0000000000000000
       Aug 20 14:06:38 absurd kernel: [   74.348172] ---[ end trace c578a0aa8bdfaef5 ]---
       Aug 20 14:06:38 absurd kernel: [   78.133088] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
       Aug 20 14:06:38 absurd kernel: [   78.237831] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
       Aug 20 14:06:38 absurd kernel: [   78.237942] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Try 4: (second attempt - continuing from previous resume)
1.1: hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       /usr/sbin/hibernate  # from uswsusp
1.2: result: FAILURE
1.3: notes:
       - freeze during hibernation process, photo: http://www.pg.gda.pl/~jkozicki/debug/fail4.jpg
       - sysrq-sub combo doesn't work.
       - Magic number after cold reboot:
       [    2.258946] TCP: cubic registered
       [    2.259494] NET: Registered protocol family 10
       [    2.260395] mip6: Mobile IPv6
       [    2.260402] NET: Registered protocol family 17
       [    2.260414] mpls_gso: MPLS GSO support
       [    2.263057] registered taskstats version 1
       [    2.264765]   Magic number: 0:483:431
       [    2.265089] rtc_cmos 00:05: setting system clock to 2040-11-24 12:25:17 UTC (2237372717)
       [    2.265138] PM: Checking hibernation image partition /dev/disk/by-uuid/04e6765a-fe0c-44e7-9165-219563f6171e
       [    2.265141] PM: Hibernation image not present or could not be loaded.
       [    2.266502] Freeing unused kernel memory: 1012K (ffffffff818c0000 - ffffffff819bd000)
       [    2.266505] Write protecting the kernel read-only data: 8192k
       [    2.268416] Freeing unused kernel memory: 908K (ffff88000151d000 - ffff880001600000)
       [    2.269346] Freeing unused kernel memory: 272K (ffff8800017bc000 - ffff880001800000)
       [    2.367566] udevd[227]: starting version 175
       [    2.634517] pps_core: LinuxPPS API ver. 1 registered
       [    2.634520] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
       [    2.634992] PTP clock support registered
       [    2.635302] dca service started, version 1.12.1


Try 5: (first attempt)
1.1: hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       /usr/sbin/hibernate  # from uswsusp
1.2: result: SUCCESS
1.3: notes: an oops message was observed after resume:
       Aug 20 14:14:00 absurd kernel: [  121.246668] sd 5:0:0:0: [sdc] Starting disk
       Aug 20 14:14:00 absurd kernel: [  121.247069] sd 6:0:0:0: [sdd] Starting disk
       Aug 20 14:14:00 absurd kernel: [  121.261848] PM: restore of devices complete after 963.529 msecs
       Aug 20 14:14:00 absurd kernel: [  121.435668] Restarting kernel threads ... done.
       Aug 20 14:14:00 absurd kernel: [  121.436507] PM: Basic memory bitmaps freed
       Aug 20 14:14:00 absurd kernel: [  121.436570] Restarting tasks ... done.
       Aug 20 14:14:00 absurd kernel: [  121.440762] BUG: unable to handle kernel NULL pointer dereference at           (null)
       Aug 20 14:14:00 absurd kernel: [  121.440896] IP: [<ffffffff810ab51c>] swsusp_free+0x1c/0x180
       Aug 20 14:14:00 absurd kernel: [  121.440991] PGD 104e3c3067 PUD 1051053067 PMD 0 
       Aug 20 14:14:00 absurd kernel: [  121.441156] Oops: 0000 [#1] SMP 
       Aug 20 14:14:00 absurd kernel: [  121.441283] Modules linked in: parport_pc ppdev lp parport bnep rfcomm bluetooth rfkill binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs fscache lockd sunrpc loop snd_hda_codec_hdmi nvidia(PO) x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_ca0132 hid_logitech_dj kvm snd_hda_intel snd_hda_controller iTCO_wdt crct10dif_pclmul snd_hda_codec crc32_pclmul iTCO_vendor_support mxm_wmi ghash_clmulni_intel evdev aesni_intel joydev aes_x86_64 ablk_helper cryptd glue_helper lrw gf128mul pcspkr snd_hwdep snd_pcm sb_edac edac_core i2c_i801 snd_seq snd_seq_device snd_timer mei_me lpc_ich snd mfd_core mei soundcore ioatdma tpm_tis tpm ipmi_si ipmi_msghandler wmi processor thermal_sys button ext4 mbcache crc16 jbd2 dm_mod usb_storage raid1 hid_generic usbhid hid md_mod sg sd_mod sr_mod cdrom crc_t10dif crct10dif_common crc32c_intel isci ahci libsas libahci libata scsi_transport_sas usbcore igb scsi_mod usb_common i2c_algo_bit dca ptp pps_core [last unloaded: ehci_hc
       Aug 20 14:14:00 absurd kernel: d]
       Aug 20 14:14:00 absurd kernel: [  121.445711] CPU: 9 PID: 6406 Comm: s2disk Tainted: P           O   3.17.0-rc1-vanilla.1 #1
       Aug 20 14:14:00 absurd kernel: [  121.445765] Hardware name: Supermicro X9DR3-F/X9DR3-F, BIOS 3.0a 07/31/2013
       Aug 20 14:14:00 absurd kernel: [  121.445817] task: ffff8808578e6920 ti: ffff88084ffd0000 task.ti: ffff88084ffd0000
       Aug 20 14:14:00 absurd kernel: [  121.445869] RIP: 0010:[<ffffffff810ab51c>]  [<ffffffff810ab51c>] swsusp_free+0x1c/0x180
       Aug 20 14:14:00 absurd kernel: [  121.445974] RSP: 0018:ffff88084ffd3e88  EFLAGS: 00010246
       Aug 20 14:14:00 absurd kernel: [  121.446033] RAX: 0000000000000000 RBX: ffff881054f7bb00 RCX: 0000000000000001
       Aug 20 14:14:00 absurd kernel: [  121.446095] RDX: ffff881054f7bb10 RSI: ffff881054f7bb00 RDI: 0000000000000000
       Aug 20 14:14:00 absurd kernel: [  121.446157] RBP: 0000000000000010 R08: 0000000000000000 R09: 0000000000000000
       Aug 20 14:14:00 absurd kernel: [  121.446218] R10: ffff881054f7bb10 R11: 0000000000000246 R12: ffffea0000000000
       Aug 20 14:14:00 absurd kernel: [  121.446279] R13: ffff88105825b920 R14: ffff881057d1ac18 R15: ffff88105767f060
       Aug 20 14:14:00 absurd kernel: [  121.446341] FS:  00007f11f5d15700(0000) GS:ffff88107fc20000(0000) knlGS:0000000000000000
       Aug 20 14:14:00 absurd kernel: [  121.446415] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       Aug 20 14:14:00 absurd kernel: [  121.446474] CR2: 0000000000000000 CR3: 0000001050f52000 CR4: 00000000001407e0
       Aug 20 14:14:00 absurd kernel: [  121.446537] Stack:
       Aug 20 14:14:00 absurd kernel: [  121.446593]  ffff881054f7bb00 0000000000000010 ffff88105767f060 ffffffff810afc35
       Aug 20 14:14:00 absurd kernel: [  121.446824]  ffff881054f7bb00 ffffffff811a639b 00007fffecd777f0 ffff881054f7bb10
       Aug 20 14:14:00 absurd kernel: [  121.447055]  ffff88105431bb80 0000000000000000 ffffffff81a088f0 ffff8808578e6f70
       Aug 20 14:14:00 absurd kernel: [  121.447288] Call Trace:
       Aug 20 14:14:00 absurd kernel: [  121.447347]  [<ffffffff810afc35>] ? snapshot_release+0x25/0xb0
       Aug 20 14:14:00 absurd kernel: [  121.447413]  [<ffffffff811a639b>] ? __fput+0xdb/0x200
       Aug 20 14:14:00 absurd kernel: [  121.447480]  [<ffffffff81082e67>] ? task_work_run+0x97/0xd0
       Aug 20 14:14:00 absurd kernel: [  121.447545]  [<ffffffff815120fd>] ? __schedule+0x2cd/0x770
       Aug 20 14:14:00 absurd kernel: [  121.447611]  [<ffffffff81013ad9>] ? do_notify_resume+0x89/0xc0
       Aug 20 14:14:00 absurd kernel: [  121.447677]  [<ffffffff815165a2>] ? int_signal+0x12/0x17
       Aug 20 14:14:00 absurd kernel: [  121.447736] Code: 01 75 ec 01 c0 c3 66 0f 1f 84 00 00 00 00 00 41 54 48 8b 05 e7 09 96 00 49 bc 00 00 00 00 00 ea ff ff 48 8b 3d ce 09 96 00 55 53 <48> 8b 10 48 89 50 18 48 8b 52 20 48 c7 40 28 00 00 00 00 c7 40 
       Aug 20 14:14:00 absurd kernel: [  121.450406] RIP  [<ffffffff810ab51c>] swsusp_free+0x1c/0x180
       Aug 20 14:14:00 absurd kernel: [  121.450506]  RSP <ffff88084ffd3e88>
       Aug 20 14:14:00 absurd kernel: [  121.450562] CR2: 0000000000000000
       Aug 20 14:14:00 absurd kernel: [  121.450627] ---[ end trace 8ddb5d09ec800dd3 ]---
       Aug 20 14:14:00 absurd kernel: [  121.605553] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
       Aug 20 14:14:00 absurd kernel: [  121.605657] ata3: SATA link down (SStatus 0 SControl 300)
       Aug 20 14:14:00 absurd kernel: [  121.605738] ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
       Aug 20 14:14:00 absurd kernel: [  121.605821] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
       Aug 20 14:14:00 absurd kernel: [  121.605906] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
       Aug 20 14:14:00 absurd kernel: [  121.606327] ata1.00: configured for UDMA/133
       Aug 20 14:14:00 absurd kernel: [  121.608668] ata4.00: configured for UDMA/100


Try 6: (second attempt - continuing from previous resume)
1.1: hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       /usr/sbin/hibernate  # from uswsusp
1.2: result: FAILURE
1.3: notes:
       - freeze during hibernation process, photo: http://www.pg.gda.pl/~jkozicki/debug/fail6.jpg
       - sysrq-sub combo doesn't work.
       - Magic number after cold reboot:
       [    2.255604] TCP: cubic registered
       [    2.256179] NET: Registered protocol family 10
       [    2.257103] mip6: Mobile IPv6
       [    2.257111] NET: Registered protocol family 17
       [    2.257122] mpls_gso: MPLS GSO support
       [    2.259864] registered taskstats version 1
       [    2.261393]   Magic number: 0:930:380
       [    2.261397]   hash matches drivers/base/power/main.c:814
       [    2.261442] pci_bus 0000:ff: hash matches
       [    2.261720] rtc_cmos 00:05: setting system clock to 2040-11-24 12:23:46 UTC (2237372626)
       [    2.261818] PM: Checking hibernation image partition /dev/disk/by-uuid/04e6765a-fe0c-44e7-9165-219563f6171e
       [    2.261822] PM: Hibernation image not present or could not be loaded.
       [    2.263129] Freeing unused kernel memory: 1012K (ffffffff818c0000 - ffffffff819bd000)
       [    2.263131] Write protecting the kernel read-only data: 8192k
       [    2.266239] Freeing unused kernel memory: 908K (ffff88000151d000 - ffff880001600000)
       [    2.267175] Freeing unused kernel memory: 272K (ffff8800017bc000 - ffff880001800000)


Try 7: (first attempt)
1.1: hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       echo reboot > /sys/power/disk
       echo disk > /sys/power/state
1.2: result: FAILURE
1.3: notes:
       - reboot at last stage of resume (similar to my problems with 3.16)
       - Magic number after cold reboot:
       [    2.253587] mpls_gso: MPLS GSO support
       [    2.256238] registered taskstats version 1
       [    2.257930]   Magic number: 0:930:740
       [    2.257933]   hash matches drivers/base/power/main.c:814
       [    2.257981] pci 0000:ff:0d.2: hash matches
       [    2.258016] acpi LNXCPU:69: hash matches
       [    2.258044] acpi LNXCPU:3c: hash matches
       [    2.258271] rtc_cmos 00:05: setting system clock to 2060-06-22 15:43:40 UTC (2855144620)
       [    2.258388] PM: Checking hibernation image partition /dev/disk/by-uuid/04e6765a-fe0c-44e7-9165-219563f6171e
       [    2.258393] PM: Hibernation image not present or could not be loaded.
       [    2.259907] Freeing unused kernel memory: 1012K (ffffffff818c0000 - ffffffff819bd000)
       [    2.259910] Write protecting the kernel read-only data: 8192k

Try 8: (first attempt)
1.1: hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       echo reboot > /sys/power/disk
       echo disk > /sys/power/state
1.2: result: SUCCESS
1.3: notes: I didn't find anything curious in dmesg or /var/log/syslog

Try 9: (second attempt - continuing from previous resume)
1.1: hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       echo reboot > /sys/power/disk
       echo disk > /sys/power/state
1.2: result: FAILURE
1.3: notes:
       - reboot at last stage of resume (similar to my problems with 3.16)
       - Magic number after cold reboot:
       [    2.254560] mip6: Mobile IPv6
       [    2.254567] NET: Registered protocol family 17
       [    2.254579] mpls_gso: MPLS GSO support
       [    2.257358] registered taskstats version 1
       [    2.259071]   Magic number: 0:930:740
       [    2.259074]   hash matches drivers/base/power/main.c:814
       [    2.259114] pci 0000:ff:0d.2: hash matches
       [    2.259141] acpi LNXCPU:69: hash matches
       [    2.259158] acpi LNXCPU:3c: hash matches
       [    2.259391] rtc_cmos 00:05: setting system clock to 2060-06-22 15:43:39 UTC (2855144619)
       [    2.259492] PM: Checking hibernation image partition /dev/disk/by-uuid/04e6765a-fe0c-44e7-9165-219563f6171e
       [    2.259496] PM: Hibernation image not present or could not be loaded.

Try 10: (first attempt)
1.1: hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       echo reboot > /sys/power/disk
       echo disk > /sys/power/state
1.2: result: FAILURE
1.3: notes:
       - reboot at last stage of resume (similar to my problems with 3.16)
       - Magic number after cold reboot:
       [    2.253086] mpls_gso: MPLS GSO support
       [    2.255827] registered taskstats version 1
       [    2.257519]   Magic number: 0:930:740
       [    2.257522]   hash matches drivers/base/power/main.c:814
       [    2.257562] pci 0000:ff:0d.2: hash matches
       [    2.257589] acpi LNXCPU:69: hash matches
       [    2.257606] acpi LNXCPU:3c: hash matches
       [    2.257847] rtc_cmos 00:05: setting system clock to 2060-06-22 15:43:41 UTC (2855144621)
       [    2.257947] PM: Checking hibernation image partition /dev/disk/by-uuid/04e6765a-fe0c-44e7-9165-219563f6171e

Try 11: (first attempt)
1.1: hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       echo reboot > /sys/power/disk
       echo disk > /sys/power/state
1.2: result: FAILURE
1.3: notes:
       - reboot at last stage of resume (similar to my problems with 3.16)
       - Magic number after cold reboot:
       [    2.254092] NET: Registered protocol family 17
       [    2.254104] mpls_gso: MPLS GSO support
       [    2.256838] registered taskstats version 1
       [    2.258506]   Magic number: 0:930:740
       [    2.258510]   hash matches drivers/base/power/main.c:814
       [    2.258546] pci 0000:ff:0d.2: hash matches
       [    2.258582] acpi LNXCPU:69: hash matches
       [    2.258599] acpi LNXCPU:3c: hash matches
       [    2.258854] rtc_cmos 00:05: setting system clock to 2060-06-22 15:43:46 UTC (2855144626)
       [    2.258959] PM: Checking hibernation image partition /dev/disk/by-uuid/04e6765a-fe0c-44e7-9165-219563f6171e
       [    2.258963] PM: Hibernation image not present or could not be loaded.
       [    2.260266] Freeing unused kernel memory: 1012K (ffffffff818c0000 - ffffffff819bd000)
       [    2.260268] Write protecting the kernel read-only data: 8192k

Try 12: (first attempt)
1.1: hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       echo reboot > /sys/power/disk
       echo disk > /sys/power/state
1.2: result: FAILURE
1.3: notes:
       - reboot at last stage of resume (similar to my problems with 3.16)
       - Magic number after cold reboot:
       [    2.254617] mpls_gso: MPLS GSO support
       [    2.257259] registered taskstats version 1
       [    2.258919]   Magic number: 0:930:740
       [    2.258923]   hash matches drivers/base/power/main.c:814
       [    2.258971] pci 0000:ff:0d.2: hash matches
       [    2.259006] acpi LNXCPU:69: hash matches
       [    2.259024] acpi LNXCPU:3c: hash matches
       [    2.259276] rtc_cmos 00:05: setting system clock to 2060-06-22 15:43:49 UTC (2855144629)
       [    2.259388] PM: Checking hibernation image partition /dev/disk/by-uuid/04e6765a-fe0c-44e7-9165-219563f6171e
       [    2.259392] PM: Hibernation image not present or could not be loaded.

please let me know what else should I do.

Here is full lspci output, so that you can identify Magic numbers with their respective drivers: http://www.pg.gda.pl/~jkozicki/debug/lspci.txt

best regards
Janek Kozicki
Comment 5 Janek Kozicki 2014-08-21 08:43:46 UTC
you can see that with uswsusp the second attempt always fails during hibernation.

And with `echo disk > /sys/power/state` it's random and fails on resume.
Comment 6 Janek Kozicki 2014-08-23 20:15:14 UTC
I am really willing to help in debugging this, please just tell me for example how could I find the offending file (or line) in kernel code.

Or maybe anything else, that I could do here?
Comment 7 Zhang Rui 2014-08-24 06:09:03 UTC
please attach the output of /etc/fstab
Comment 8 Janek Kozicki 2014-08-25 10:00:48 UTC
$ cat /etc/fstab | grep -v "#"

UUID=1b1cb5fb-b820-4281-badb-fe60995879cf /  ext4    noatime,commit=600,defaults,errors=remount-ro  0  1
/dev/mapper/lvr-home /home                   ext4    noatime,commit=600,defaults  0  2
tmpfs                /tmp                    tmpfs   defaults,noatime,mode=1777   0  0
/dev/mapper/lvr-usr  /usr                    ext4    noatime,commit=600,defaults  0  2
/dev/mapper/lvr-var  /var                    ext4    noatime,commit=600,defaults  0  2
UUID=04e6765a-fe0c-44e7-9165-219563f6171e none swap  sw                           0  0
/dev/sr0             /media/cdrom0           udf,iso9660 user,noauto              0  0

UUID=C83E5DEA3E5DD254 /mnt/win7_c  auto user,defaults,noauto 0 0
/dev/sdc1       /mnt/usb       auto     rw,user,noauto       0 0
/dev/sdc1       /mnt/sdc1      auto     rw,user,noauto       0 0
/dev/sdc2       /mnt/sdc2      auto     rw,user,noauto       0 0
/dev/sdd1       /mnt/sdd1      auto     rw,user,noauto       0 0
/dev/sde1       /mnt/sde1      auto     rw,user,noauto       0 0
/dev/sdf1       /mnt/sdf1      auto     rw,user,noauto       0 0
/dev/sdg1       /mnt/sdg1      auto     rw,user,noauto       0 0
/dev/sdc        /mnt/usb-flop  auto     rw,user,noauto       0 0
server:/server /server                  nfs  ro,hard,intr,user,noauto,noexec,vers=3 0  2
star:/stuff /stuff                      nfs  rw,hard,intr,user,noauto,noexec,vers=3 0  2
server:/server/cache /cache             nfs  rw,soft,intr,user,noauto,noexec,vers=3 0  2


###
---
nfs mounts were unmounted during my tests.
Comment 9 Janek Kozicki 2014-08-25 13:30:45 UTC
all in /mnt/ were unmounted too.

Those in /dev/mapper are LVM on mdadm raid1.

swap 04e6765a-fe0c-44e7-9165-219563f6171e is /dev/sda2 SSD (no lvm, no raid)

root / 1b1cb5fb-b820-4281-badb-fe60995879cf is /dev/sda1 SSD (no lvm, no raid)

/tmp is on tmpfs.
Comment 10 Janek Kozicki 2014-09-01 19:52:53 UTC
Hi,

I wish to help with this bug. Is there any other kind of info that I can provide? Or any kind of patch that you would like me to run/try to provide more debug output?

best regards
Janek Kozicki
Comment 11 Rafael J. Wysocki 2014-09-01 22:57:00 UTC
Does suspend-to-RAM (that is, "echo mem > /sys/power/state") work on this machine?
Comment 12 Janek Kozicki 2014-09-01 23:25:34 UTC
I never tried suspend to RAM and now I admit it's strange, I got this:

root@absurd:~# touch /forcefsck
root@absurd:~# /bin/sync
root@absurd:~# echo 1 > /sys/power/pm_trace
root@absurd:~# echo mem > /sys/power/state
bash: echo: write error: Invalid argument
root@absurd:~# whoami 
root
root@absurd:~# cat /sys/power/state
freeze standby disk


So I think I'd say that it doesn't work. Looks like it's not supported. Here's my full .config file:

 http://www.pg.gda.pl/~jkozicki/debug/config-3.16.0-vanilla.3+trtc

Also I have a question - how would I bring back my PC to life in case if suspend to RAM worked? Press the power button? Do I need to enable anything in BIOS for that to work?

best regards
Janek Kozicki
Comment 13 Janek Kozicki 2014-09-09 11:42:24 UTC
Is there any patch that you would like me to try, or anything else that I could do?
Comment 14 Janek Kozicki 2014-10-11 11:59:26 UTC
I tested kernel 3.17 extensively, here's my short report:

1. suspend to ram `echo mem > /sys/power/state` isn't working (but I only need suspend to disk, so it's ok)
2. `/usr/sbin/hibernate` yields 50% success rate (10 tries)
   5 success, 5 failures
   First cycle almost always works (failed only once).
   Second cycle almost always fails (worked only once).
   $ cat /sys/power/pm_trace_dev_match
         pci_bus
   The error is always the same as with `echo disk > /sys/power/state`
   in kernel 3.16.
   I never get the kind of error reported in 3.16 when using
   `/usr/sbin/hibernate`. So that's good news.
3. `echo disk > /sys/power/state` yields 20% success rate (10 tries)
   2 success, 8 failures
   Only 2 successes, in both cases it was the first cycle.
   $ cat /sys/power/pm_trace_dev_match
      pci
      acpi
      acpi
   The error reported seems similar to when using `/usr/sbin/hibernate` but
   pm_trace_dev_match gives different device matches.


----------- Detailed report:

Ad.1. suspend to ram `echo mem > /sys/power/state` isn't working
$ touch /forcefsck
$ /bin/sync
$ echo 1 > /sys/power/pm_trace
$ echo mem > /sys/power/state
bash: echo: write error: Invalid argument
$ cat /sys/power/state
freeze standby disk

Ad.2. try 1/10 (1st cycle): `/usr/sbin/hibernate`
* hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       /usr/sbin/hibernate  # from uswsusp
* result: FAILURE, same failure as with `echo disk > /sys/power/state` in kernel 3.16
* notes:
       - reboot at last stage of resume (similar to my problems with 3.16)
       $ cat /sys/power/pm_trace_dev_match
         pci_bus
       - Magic number after cold reboot:
[    2.261241] mip6: Mobile IPv6
[    2.261250] NET: Registered protocol family 17
[    2.261264] mpls_gso: MPLS GSO support
[    2.263938] registered taskstats version 1
[    2.265577]   Magic number: 0:930:380
[    2.265583]   hash matches drivers/base/power/main.c:814
[    2.265672] pci_bus 0000:ff: hash matches
[    2.265974] rtc_cmos 00:05: setting system clock to 2040-11-24 12:23:16 UTC (2237372596)
[    2.266118] PM: Checking hibernation image partition /dev/disk/by-uuid/04e6765a-fe0c-44e7-9165-219563f6171e
[    2.266121] PM: Hibernation image not present or could not be loaded.
[    2.267517] Freeing unused kernel memory: 1008K (ffffffff818bf000 - ffffffff819bb000)

Ad.2. try 2/10 (1st cycle): `/usr/sbin/hibernate`
* hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       /usr/sbin/hibernate  # from uswsusp
* result: SUCCESS
* notes: nothing suspicious in dmesg. Looks OK.

Ad.2. try 3/10 (2nd cycle): `/usr/sbin/hibernate`
* hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       /usr/sbin/hibernate  # from uswsusp
* result: FAILURE, same failure as with `echo disk > /sys/power/state` in kernel 3.16
* notes:
       - reboot at last stage of resume (similar to my problems with 3.16)
       $ cat /sys/power/pm_trace_dev_match
         pci_bus
       - Magic number after cold reboot:
[    2.245337] NET: Registered protocol family 17
[    2.245350] mpls_gso: MPLS GSO support
[    2.248109] registered taskstats version 1
[    2.249982]   Magic number: 0:930:380
[    2.249988]   hash matches drivers/base/power/main.c:814
[    2.250075] pci_bus 0000:ff: hash matches
[    2.250383] rtc_cmos 00:05: setting system clock to 2040-11-24 12:23:15 UTC (2237372595)
[    2.250529] PM: Checking hibernation image partition /dev/disk/by-uuid/04e6765a-fe0c-44e7-9165-219563f6171e
[    2.250533] PM: Hibernation image not present or could not be loaded.
[    2.251905] Freeing unused kernel memory: 1008K (ffffffff818bf000 - ffffffff819bb000)

Ad.2. try 4/10 (1st cycle): `/usr/sbin/hibernate`
* hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       /usr/sbin/hibernate  # from uswsusp
* result: FAILURE, same failure as with `echo disk > /sys/power/state` in kernel 3.16
* notes:
       - reboot at last stage of resume (similar to my problems with 3.16)
       $ cat /sys/power/pm_trace_dev_match
         pci_bus
       - Magic number after cold reboot:
[    2.252689] mip6: Mobile IPv6
[    2.252698] NET: Registered protocol family 17
[    2.252711] mpls_gso: MPLS GSO support
[    2.255346] registered taskstats version 1
[    2.257239]   Magic number: 0:930:380
[    2.257244]   hash matches drivers/base/power/main.c:814
[    2.257328] pci_bus 0000:ff: hash matches
[    2.257578] rtc_cmos 00:05: setting system clock to 2040-11-24 12:23:32 UTC (2237372612)
[    2.257720] PM: Checking hibernation image partition /dev/disk/by-uuid/04e6765a-fe0c-44e7-9165-219563f6171e
[    2.257724] PM: Hibernation image not present or could not be loaded.

Ad.2. try 5/10 (1st cycle): `/usr/sbin/hibernate`
* hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       /usr/sbin/hibernate  # from uswsusp
* result: SUCCESS
* notes: nothing suspicious in dmesg. Looks OK.

Ad.2. try 6/10 (2nd cycle): `/usr/sbin/hibernate`
* hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       /usr/sbin/hibernate  # from uswsusp
* result: FAILURE, same failure as with `echo disk > /sys/power/state` in kernel 3.16
* notes:
       - reboot at last stage of resume (similar to my problems with 3.16)
       $ cat /sys/power/pm_trace_dev_match
         pci_bus
       - Magic number after cold reboot:
[    2.257226] mip6: Mobile IPv6
[    2.257235] NET: Registered protocol family 17
[    2.257249] mpls_gso: MPLS GSO support
[    2.259897] registered taskstats version 1
[    2.261664]   Magic number: 0:930:380
[    2.261670]   hash matches drivers/base/power/main.c:814
[    2.261758] pci_bus 0000:ff: hash matches
[    2.262064] rtc_cmos 00:05: setting system clock to 2040-11-24 12:23:47 UTC (2237372627)
[    2.262207] PM: Checking hibernation image partition /dev/disk/by-uuid/04e6765a-fe0c-44e7-9165-219563f6171e
[    2.262211] PM: Hibernation image not present or could not be loaded.

Ad.2. try 7/10 (1st cycle): `/usr/sbin/hibernate`
* hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       /usr/sbin/hibernate  # from uswsusp
* result: SUCCESS
* notes: nothing suspicious in dmesg. Looks OK.

Ad.2. try 8/10 (2nd cycle): `/usr/sbin/hibernate`
* hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       /usr/sbin/hibernate  # from uswsusp
* result: SUCCESS.
* notes:  Attention: it's a 2nd cycle, and it worked. Since it's exceptional
       I am attaching full log http://www.pg.gda.pl/~jkozicki/debug/2nd_cycle.txt
       Maybe it will be useful.
       - Not sure if it means anything, but I did this:
       $ cat /sys/power/pm_trace_dev_match
       vc
       pci_bus

Ad.2. try 9/10 (3rd cycle): `/usr/sbin/hibernate`
* hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       /usr/sbin/hibernate  # from uswsusp
* result: FAILURE, same failure as with `echo disk > /sys/power/state` in kernel 3.16
* notes:
       - reboot at last stage of resume (similar to my problems with 3.16)
       $ cat /sys/power/pm_trace_dev_match
       # empty output
       - Magic number after cold reboot:
[    2.256557] mip6: Mobile IPv6
[    2.256566] NET: Registered protocol family 17
[    2.256579] mpls_gso: MPLS GSO support
[    2.259195] registered taskstats version 1
[    2.261003]   Magic number: 0:36:482
[    2.261350] rtc_cmos 00:05: setting system clock to 2040-11-24 12:29:58 UTC (2237372998)
[    2.261496] PM: Checking hibernation image partition /dev/disk/by-uuid/04e6765a-fe0c-44e7-9165-219563f6171e
[    2.261500] PM: Hibernation image not present or could not be loaded.

Ad.2. try 10/10 (1st cycle): `/usr/sbin/hibernate`
* hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       /usr/sbin/hibernate  # from uswsusp
* result: SUCCESS
* notes: nothing suspicious in dmesg. Looks OK.

----
Ad.3. try 1/10 (2nd cycle): `echo disk > /sys/power/state`
* hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       echo reboot > /sys/power/disk
       echo disk > /sys/power/state
* result: FAILURE, same failure as with `echo disk > /sys/power/state` in kernel 3.16
* notes:
       - reboot at last stage of resume (similar to my problems with 3.16)
       $ cat /sys/power/pm_trace_dev_match
       pci
       acpi
       acpi
       - Magic number after cold reboot:
[    2.256859] mip6: Mobile IPv6
[    2.256868] NET: Registered protocol family 17
[    2.256882] mpls_gso: MPLS GSO support
[    2.259492] registered taskstats version 1
[    2.261287]   Magic number: 0:930:740
[    2.261294]   hash matches drivers/base/power/main.c:814
[    2.261386] pci 0000:ff:0d.2: hash matches
[    2.261438] acpi LNXCPU:69: hash matches
[    2.261444] acpi LNXCPU:3c: hash matches
[    2.261702] rtc_cmos 00:05: setting system clock to 2060-06-22 15:43:43 UTC (2855144623)
[    2.261843] PM: Checking hibernation image partition /dev/disk/by-uuid/04e6765a-fe0c-44e7-9165-219563f6171e

Ad.3. try 2/10 (1st cycle): `echo disk > /sys/power/state`
* hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       echo reboot > /sys/power/disk
       echo disk > /sys/power/state
* result: FAILURE, same failure as with `echo disk > /sys/power/state` in kernel 3.16
* notes:
       - reboot at last stage of resume (similar to my problems with 3.16)
       $ cat /sys/power/pm_trace_dev_match
       pci
       acpi
       acpi
       - Magic number after cold reboot:
[    2.256498] mip6: Mobile IPv6
[    2.256507] NET: Registered protocol family 17
[    2.256521] mpls_gso: MPLS GSO support
[    2.259159] registered taskstats version 1
[    2.261031]   Magic number: 0:930:740
[    2.261037]   hash matches drivers/base/power/main.c:814
[    2.261122] pci 0000:ff:0d.2: hash matches
[    2.261173] acpi LNXCPU:69: hash matches
[    2.261179] acpi LNXCPU:3c: hash matches
[    2.261391] rtc_cmos 00:05: setting system clock to 2060-06-22 15:43:48 UTC (2855144628)
[    2.261534] PM: Checking hibernation image partition /dev/disk/by-uuid/04e6765a-fe0c-44e7-9165-219563f6171e
[    2.261538] PM: Hibernation image not present or could not be loaded.

Ad.3. try 3/10 (1st cycle): `echo disk > /sys/power/state`
* hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       echo reboot > /sys/power/disk
       echo disk > /sys/power/state
* result: FAILURE, same failure as with `echo disk > /sys/power/state` in kernel 3.16
* notes:
       - reboot at last stage of resume (similar to my problems with 3.16)
       $ cat /sys/power/pm_trace_dev_match
       pci
       acpi
       acpi
       - Magic number after cold reboot:
[    2.252296] mip6: Mobile IPv6
[    2.252305] NET: Registered protocol family 17
[    2.252319] mpls_gso: MPLS GSO support
[    2.254949] registered taskstats version 1
[    2.256793]   Magic number: 0:930:740
[    2.256799]   hash matches drivers/base/power/main.c:814
[    2.256883] pci 0000:ff:0d.2: hash matches
[    2.256936] acpi LNXCPU:69: hash matches
[    2.256942] acpi LNXCPU:3c: hash matches
[    2.257151] rtc_cmos 00:05: setting system clock to 2060-06-22 15:43:46 UTC (2855144626)
[    2.257301] PM: Checking hibernation image partition /dev/disk/by-uuid/04e6765a-fe0c-44e7-9165-219563f6171e

Ad.3. try 4/10 (1st cycle): `echo disk > /sys/power/state`
* hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       echo reboot > /sys/power/disk
       echo disk > /sys/power/state
* result: FAILURE, same failure as with `echo disk > /sys/power/state` in kernel 3.16
* notes:
       - reboot at last stage of resume (similar to my problems with 3.16)
       $ cat /sys/power/pm_trace_dev_match
       pci
       acpi
       acpi
       - Magic number after cold reboot:
[    2.254739] mip6: Mobile IPv6
[    2.254748] NET: Registered protocol family 17
[    2.254761] mpls_gso: MPLS GSO support
[    2.257410] registered taskstats version 1
[    2.259051]   Magic number: 0:930:740
[    2.259056]   hash matches drivers/base/power/main.c:814
[    2.259136] pci 0000:ff:0d.2: hash matches
[    2.259188] acpi LNXCPU:69: hash matches
[    2.259208] acpi LNXCPU:3c: hash matches
[    2.259439] rtc_cmos 00:05: setting system clock to 2060-06-22 15:43:42 UTC (2855144622)
[    2.259596] PM: Checking hibernation image partition /dev/disk/by-uuid/04e6765a-fe0c-44e7-9165-219563f6171e
[    2.259600] PM: Hibernation image not present or could not be loaded.

Ad.3. try 5/10 (1st cycle): `echo disk > /sys/power/state`
* hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       echo reboot > /sys/power/disk
       echo disk > /sys/power/state
* result: SUCCESS. 
* notes: Since it is exceptional that we have a succes, here's a full log, albeit
         I'm afraid there's nothing useful. http://www.pg.gda.pl/~jkozicki/debug/success_log2.txt

Ad.3. try 6/10 (2nd cycle): `echo disk > /sys/power/state`
* hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       echo reboot > /sys/power/disk
       echo disk > /sys/power/state
* result: FAILURE, same failure as with `echo disk > /sys/power/state` in kernel 3.16
* notes:
       - reboot at last stage of resume (similar to my problems with 3.16)
       $ cat /sys/power/pm_trace_dev_match
       pci
       acpi
       acpi
       - Magic number after cold reboot:
[    2.258429] mip6: Mobile IPv6
[    2.258438] NET: Registered protocol family 17
[    2.258450] mpls_gso: MPLS GSO support
[    2.261139] registered taskstats version 1
[    2.262826]   Magic number: 0:930:740
[    2.262832]   hash matches drivers/base/power/main.c:814
[    2.262911] pci 0000:ff:0d.2: hash matches
[    2.262965] acpi LNXCPU:69: hash matches
[    2.262984] acpi LNXCPU:3c: hash matches
[    2.263235] rtc_cmos 00:05: setting system clock to 2060-06-22 15:43:49 UTC (2855144629)
[    2.263380] PM: Checking hibernation image partition /dev/disk/by-uuid/04e6765a-fe0c-44e7-9165-219563f6171e
[    2.263384] PM: Hibernation image not present or could not be loaded.

Ad.3. try 7/10 (1st cycle): `echo disk > /sys/power/state`
* hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       echo reboot > /sys/power/disk
       echo disk > /sys/power/state
* result: SUCCESS. 
* notes: Couldn't find anything useful in syslog. Same as in try 5/10.

Ad.3. try 8/10 (2nd cycle): `echo disk > /sys/power/state`
* hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       echo reboot > /sys/power/disk
       echo disk > /sys/power/state
* result: FAILURE, same failure as with `echo disk > /sys/power/state` in kernel 3.16
* notes:
       - reboot at last stage of resume (similar to my problems with 3.16)
       $ cat /sys/power/pm_trace_dev_match
       pci
       acpi
       acpi
       - Magic number after cold reboot:
[    2.254200] mip6: Mobile IPv6
[    2.254208] NET: Registered protocol family 17
[    2.254221] mpls_gso: MPLS GSO support
[    2.256943] registered taskstats version 1
[    2.258860]   Magic number: 0:930:740
[    2.258866]   hash matches drivers/base/power/main.c:814
[    2.258946] pci 0000:ff:0d.2: hash matches
[    2.258998] acpi LNXCPU:69: hash matches
[    2.259018] acpi LNXCPU:3c: hash matches
[    2.259261] rtc_cmos 00:05: setting system clock to 2060-06-22 15:43:48 UTC (2855144628)
[    2.259408] PM: Checking hibernation image partition /dev/disk/by-uuid/04e6765a-fe0c-44e7-9165-219563f6171e
[    2.259412] PM: Hibernation image not present or could not be loaded.

Ad.3. try 9/10 (1st cycle): `echo disk > /sys/power/state`
* hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       echo reboot > /sys/power/disk
       echo disk > /sys/power/state
* result: FAILURE, same failure as with `echo disk > /sys/power/state` in kernel 3.16
* notes:
       - reboot at last stage of resume (similar to my problems with 3.16)
       $ cat /sys/power/pm_trace_dev_match
       pci
       acpi
       acpi
       - Magic number after cold reboot:
[    2.255673] mip6: Mobile IPv6
[    2.255682] NET: Registered protocol family 17
[    2.255696] mpls_gso: MPLS GSO support
[    2.258427] registered taskstats version 1
[    2.260349]   Magic number: 0:930:740
[    2.260355]   hash matches drivers/base/power/main.c:814
[    2.260436] pci 0000:ff:0d.2: hash matches
[    2.260487] acpi LNXCPU:69: hash matches
[    2.260507] acpi LNXCPU:3c: hash matches
[    2.260743] rtc_cmos 00:05: setting system clock to 2060-06-22 15:43:49 UTC (2855144629)
[    2.260889] PM: Checking hibernation image partition /dev/disk/by-uuid/04e6765a-fe0c-44e7-9165-219563f6171e
[    2.260893] PM: Hibernation image not present or could not be loaded.

Ad.3. try 10/10 (1st cycle): `echo disk > /sys/power/state`
* hibernate command:
       touch /forcefsck
       /bin/sync
       echo 1 > /sys/power/pm_trace
       echo reboot > /sys/power/disk
       echo disk > /sys/power/state
* result: FAILURE, same failure as with `echo disk > /sys/power/state` in kernel 3.16
* notes:
       - reboot at last stage of resume (similar to my problems with 3.16)
       $ cat /sys/power/pm_trace_dev_match
       pci
       acpi
       acpi
       - Magic number after cold reboot:
[    2.256766] mip6: Mobile IPv6
[    2.256775] NET: Registered protocol family 17
[    2.256788] mpls_gso: MPLS GSO support
[    2.259420] registered taskstats version 1
[    2.261311]   Magic number: 0:930:740
[    2.261316]   hash matches drivers/base/power/main.c:814
[    2.261401] pci 0000:ff:0d.2: hash matches
[    2.261452] acpi LNXCPU:69: hash matches
[    2.261458] acpi LNXCPU:3c: hash matches
[    2.261660] rtc_cmos 00:05: setting system clock to 2060-06-22 15:43:46 UTC (2855144626)
[    2.261808] PM: Checking hibernation image partition /dev/disk/by-uuid/04e6765a-fe0c-44e7-9165-219563f6171e
[    2.261812] PM: Hibernation image not present or could not be loaded.

Files linked:
http://www.pg.gda.pl/~jkozicki/debug/2nd_cycle.txt
http://www.pg.gda.pl/~jkozicki/debug/success_log2.txt

I hope that my debugging attepmts vere helpful. If there's anything more I can do, please just let me know.

best regards
Janek Kozicki
Comment 15 Janek Kozicki 2014-10-16 18:19:24 UTC
In my previous report the main good news was that with `/usr/sbin/hibernate` the first cycle almost always worked (only one failure).
Comment 16 Janek Kozicki 2015-01-05 09:54:20 UTC
I did 20 tests on kernel 3.18.1:

With hibernation method "/usr/sbin/hibernate" I had  4 successess and 6 failures (One time it was a 2nd in row success). The failures were following:

5 times it was a reboot on resume. The debug info:
  Magic number: 0:930:380  
  hash matches drivers/base/power/main.c:814 
  cat /sys/power/pm_trace_dev_match
  → pci_bus

1 time it was a freeze right before bringing up X server. The debug info:
  Magic number: 0:483:431   (no hash matches)
  cat /sys/power/pm_trace_dev_match
  → thermal

Then with hibernation method "echo reboot > /sys/power/disk; echo disk > /sys/power/state" I had 3 successes, 7 failures (zero 2nd successes). The failures were following:

7 times it was a reboot on resume. The debug info:
  Magic number: 0:930:740
  hash matches drivers/base/power/main.c:814
  cat /sys/power/pm_trace_dev_match
  → pci
  → acpi
  → acpi

Overall I think that nothing has changed in kernel 3.18.1 since my last test on 3.17.

Is there anything I could do to help with fixing this problem?

best regards
Janek Kozicki
Comment 17 Janek Kozicki 2015-09-04 12:56:07 UTC
Hi,

I just checked the recently released kernel 4.2, and the problem still persists. Is there anything I can do to help resolving this problem? I really need working hibernation to disk.

best regards
Janek Kozicki
Comment 18 J. Kozicki 2015-09-04 13:09:29 UTC
Hi,

I just checked the recently released kernel 4.2, and the problem still persists. Is there anything I can do to help resolving this problem? I really need working hibernation to disk.

NOTE: I just subscribed with another email address - I was worried that I wasn't receiving your replies.

best regards
Janek Kozicki
Comment 19 Chen Yu 2015-09-11 07:19:53 UTC
Hi, guys, can you please attached the dmesg after you failed to resume from hibernation? and can you please confirm if your kernel config is built with 
CONFIG_SATA_AHCI=y rather than CONFIG_SATA_AHCI=m ? (plz also attach your kernel config file)
Thanks!

Yu
Comment 20 Janek Kozicki 2015-09-11 13:54:51 UTC
Created attachment 187341 [details]
config file for kernel 4.2
Comment 21 Janek Kozicki 2015-09-11 13:55:38 UTC
Created attachment 187351 [details]
dmesg after failed resume, with hibernation done using Method 1)
Comment 22 Janek Kozicki 2015-09-11 13:56:11 UTC
Created attachment 187361 [details]
dmesg after failed resume, with hibernation done using Method 2)
Comment 23 Janek Kozicki 2015-09-11 13:56:35 UTC
Created attachment 187371 [details]
dmesg after failed resume, with hibernation done using Method 3)
Comment 24 Janek Kozicki 2015-09-11 13:57:00 UTC
Created attachment 187381 [details]
dmesg after failed resume, with hibernation done using Method 4)
Comment 25 Janek Kozicki 2015-09-11 13:59:50 UTC
Created attachment 187391 [details]
Video of resume failure using Method 1 and 2
Comment 26 Janek Kozicki 2015-09-11 14:01:16 UTC
Created attachment 187401 [details]
Video of resume failure using Method 3 and 4
Comment 27 Janek Kozicki 2015-09-11 14:04:18 UTC
Hi,

thank you very much for your reply. I am using kernel 4.2, here's the md5sum:

  3d5ea06d767e2f35c999eeadafc76523  linux-4.2.tar.xz

plus following diff:
  https://git.kernel.org/cgit/linux/kernel/git/tj/wq.git/commit/?h=for-4.3&id=1dadafa86a779884f14a6e7a3ddde1a57b0a0a65

This diff is necessary for nvidia-kernel-dkms debian package, and is irrelevant,
because resume from disk isn't working even without nvidia.

Next, I am compiling the kernel with these commands:

  unp linux-4.2.tar.xz
  cd linux-4.2
  make menuconfig
  fakeroot make-kpkg --initrd --append-to-version=-vanilla.3 kernel_image kernel_headers -j38
  cd ..
  dpkg -i linux-headers-4.2.0-vanilla.3_4.2.0-vanilla.3-10.00.Custom_amd64.deb linux-image-4.2.0-vanilla.3_4.2.0-vanilla.3-10.00.Custom_amd64.deb

I have attached the kernel config to this bug report (see the top of this bugreport webpage) under filename config-4.2.0-vanilla.3

Then I have tried to hibernate and resume my computer using these four methods:

Method 1)
  touch /forcefsck
  /bin/sync
  echo 1 > /sys/power/pm_trace
  /usr/sbin/hibernate
  
Method 2)
  touch /forcefsck
  /bin/sync
  echo 1 > /sys/power/pm_trace
  /usr/sbin/s2disk
  
Method 3)
  touch /forcefsck
  /bin/sync
  echo 1 > /sys/power/pm_trace
  echo reboot > /sys/power/disk
  echo disk > /sys/power/state
 
Method 4) 
  touch /forcefsck
  /bin/sync
  echo 1 > /sys/power/pm_trace
  echo shutdown > /sys/power/disk
  echo disk > /sys/power/state
  
All of those methods have failed upon resume and PC has restarted right before
the resume was about to finish.

The command /usr/sbin/hibernate comes from debian package hibernate, version: 2.0+15+g88d54a
The command /usr/sbin/s2disk    comes from debian package uswsusp,   version: 1.0+20120915-6
(I have debian jessie installed here)

The methods 3) and 4) do not depend on debian packages, they just issue
commands to kernel, and the error is a little different.


The errors:

The dmesg output for each of these hibernate methods is attached under file names failure_using_method_*.dmesg

You will see in these *dmesg files that errors in method 1 and 2 are similar, and errors in methods 3 and 4 are also similar but different from 1,2.


Finally, I have made videos of my PC trying to perform the resume:

http://janek.kozicki.pl/hibernate_problems/resume_failue_method_1_2.mov
http://janek.kozicki.pl/hibernate_problems/resume_failue_method_3_4.mov

I made only two videos, because the same things appear on screen with method 1
and 2, and the same messages appear with method 3 and 4.


Attachments (added to the bug report):
1. config-4.2.0-vanilla.3
2. failure_using_method_1.dmesg
3. failure_using_method_2.dmesg
4. failure_using_method_3.dmesg
5. failure_using_method_4.dmesg
6. resume_failue_method_1_2.mov
7. resume_failue_method_3_4.mov 


I really hope that you can help me to solve this problem.

If there is anything that I could do please tell me!
Janek Kozicki
Comment 28 Janek Kozicki 2015-09-11 14:07:29 UTC
Oh, and yes - all my tests above were done with CONFIG_SATA_AHCI=y as you can see in the attached kernel config.
Comment 29 Janek Kozicki 2015-09-19 13:49:03 UTC
Hi,

is there anything else that I could do here to provide more information?
Maybe try some other kernel options, or some patch?

I hope that the videos, dmesgs and kernel config that I attached to this bug info were useful for you.

Please, could you help me with this resume problem?

best regards
Janek Kozicki
Comment 30 Chen Yu 2015-10-22 02:46:38 UTC
Hi,

according to your log

[    2.586072] PM: Starting manual resume from disk
[    2.586075] PM: Hibernation image partition 8:2 present
[    2.586075] PM: Looking for hibernation image.
[    2.586234] PM: Image not found (code -22)
[    2.586235] PM: Hibernation image not present or could not be loaded.
I suspect there is something wrong with UUID naming, so I think since you have compiled SATA as built-in, can you please also help to test:

1. append 'init=/bin/bash' in your grub command line, and
2. after boot up, swapon /dev/sda3 (this should be your swap partition, and
    there must be ONLY one swap partition in your system
3. echo disk > /sys/power/state
4. boot up again, with 'init=/bin/bash resume=/dev/sda3'

Thanks!
Comment 31 Janek Kozicki 2015-10-23 23:18:09 UTC
Created attachment 190981 [details]
Video of resume success and failure using Method suggested by Chen Yu in Comment 30

Video of resume success and failure using Method suggested by Chen Yu in Comment 30.
Comment 32 Janek Kozicki 2015-10-23 23:19:17 UTC
Created attachment 190991 [details]
dmesg after successful resume, using Method suggested by Chen Yu in Comment 30

dmesg after successful resume, using Method suggested by Chen Yu in Comment 30
Comment 33 Janek Kozicki 2015-10-23 23:19:56 UTC
Created attachment 191001 [details]
dmesg after failed resume (2nd try), using Method suggested by Chen Yu in Comment 30

dmesg after failed resume (2nd try), using Method suggested by Chen Yu in Comment 30
Comment 34 Janek Kozicki 2015-10-23 23:24:12 UTC
Comment on attachment 190981 [details]
Video of resume success and failure using Method suggested by Chen Yu in Comment 30

High resolution version of this super-lucky video: http://janek.kozicki.pl/hibernate_problems/resume_success_and_failure_HIGH_RES.mp4
Comment 35 Janek Kozicki 2015-10-23 23:36:13 UTC
Hi,

I am so happy that you are interested in my problem.

Let me be clear - the four dmesgs that I have attached to this bug report are dmesgs _after_ failed resume.

It means that my PC tried to resume and due to some problem linux kernel has set the clock to the error code (because I am using flag `echo 1 > /sys/power/pm_trace`) and performed reboot right before _finalizing_(*) the resume process. On attached video for Method 1 and 2 you can see the actual progress printed on console in percents as it is reading the hibernation image from swapfile, in Method 3 and 4 (from my Comment 27 above) it is not printing anything on the console, but the timing is the same. These videos confirm that indeed the hibernation image is found. But of course I did what you told me to do, read about it below (**).

The important information in these dmesgs is the debug information left from `echo 1 > /sys/power/pm_trace`, which is the "Magic number", like this one: 

[    2.052032]   Magic number: 0:788:178

or this one: 

[    2.043685]   Magic number: 0:162:740

The hibernation image is not to be found, because when hibernation fails, the image is cleared and it is not to be found again (not clearing the image would induce a second attempt, resulting in an endless loop of rebooting and failing to resume).

(*) I am fairly certain the reboot happens right before _finalizing_ the resume process, because in older kernels it was sporadically working, like this:
   3.2    -  20% failures
   3.12   - 100% failures
   3.13   -  40% failures
   3.16   -  40% failures
   3.17   -  50% failures
   3.18.1 -  60% failures
   4.2.0  -  95% failures (on another attempt I was super-lucky and I have this single success on a video, attached to this bugreport and in http://janek.kozicki.pl/hibernate_problems/resume_success_and_failure_HIGH_RES.mp4 )

I don't think I could call this bug a regression. I think that when it worked it was rather random, than something intentionally modified in the linux kernel code. Also - I did only about 10 to 20 tries for each of those kernel versions (except kernel 3.2 which I used for months).

The important thing that I want to say about this _finalizing_(*) thing is that when I was using kernel 3.2 I got accustomed to the mental tension at the end of resume process and thinking: "Will it work this time, or not?". And 80% of time right after this _finalizing_ point I have seen my desktop, and 20% of time I have seen my PC rebooting. I used kernel 3.2 for many months, then I had to upgrade to a newer kernel :)

(**) OK, so I did what you asked me to do. 
(1) First - I couldn't use `init=/bin/bash` because of bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/553400 - keyboard isn't working with init=/bin/bash (I am using debian 8, but apparently this bug affects ubuntu too). But I think that this bug is irrelevant to this hibernate/resume problem here, so instead of `init=/bin/bash` I used `single`.

(2) second - my swap partition is /dev/sda2, there is an `fdisk -l` output below, as well as my /dev/disk/by-uuid/ and /etc/fstab. I am referring to swap always by using UUID so that it is found regardless of whether it happens to be sda2 or sdb2. On this PC it is always sda2, but few years ago on different PC (when using tuxonice) I had hibernation/resume problems (in kernel 2.29) because on each PC boot the hard drives were assigned randomly to /dev/sda, /dev/sdb and /dev/sdc. The hibernation worked (BTW: it worked 100% of times for many years) only when I used UUID. So now I use UUID always. But as you requested I booted using these arguments `resume=/dev/sda2 single'. And then `echo disk > /sys/power/state`. On the video you can see that it is reading the image (it prints progress in percents) then reboots right before _finalizing_ the resume process.

(3) I am attaching a video which shows exactly what happens when I did what you asked me to do. I was super-lucky, because on another attempt (the previous video of failed attempt was too poor quality to show it, and I decided to make another video) I had one success! Then a failure on 2nd try. Please have a look at this video, in mplayer you can play it at a 5 times the normal speed by pressing ']' key in mplayer during the playback (so you don't waste too much time). Better if you see the 22MB version http://janek.kozicki.pl/hibernate_problems/resume_success_and_failure_HIGH_RES.mp4 which I could not attach to the bugreport, because of 22MB filesize.

(4) I am attaching the dmesg after succesfull resume. Maybe it will be useful.

(5) I am also attaching dmesg after failed resume with the "Magic number" but of course with message "PM: Hibernation image not present" because it is after a reboot. I cannot obtain a dmesg from the failed resume after it is succesfully finding and reading the hibernation image, because the PC reboots itself :) The "Magic number" here that we got here is similar to one of previously reported magic numbers.

I hope that you can help me with this problem. If you have any other idea about what should I do, please tell me.

best regards
Janek Kozicki


PS: below is some extra info (I am using mdadm raid partition + LVM)


# fdisk -l

Device     Boot     Start       End   Sectors   Size Id Type
/dev/sda1            2048   2099199   2097152     1G 83 Linux
/dev/sda2         2099200 174065663 171966464    82G 82 Linux swap / Solaris
/dev/sda3  *    174065664 257951743  83886080    40G  7 HPFS/NTFS/exFAT
/dev/sda4       257951744 937703087 679751344 324.1G fd Linux raid autodetect

Device     Boot     Start        End   Sectors   Size Id Type
/dev/sdb1            2048  679753391 679751344 324.1G fd Linux raid autodetect
/dev/sdb2       679753392 1000215215 320461824 152.8G  7 HPFS/NTFS/exFAT

Device     Boot Start        End    Sectors  Size Id Type
/dev/sdc1        2048 3907029167 3907027120  1.8T 83 Linux


# ls -l /dev/disk/by-uuid/
lrwxrwxrwx 1 root root 10 Oct 23 21:49 04e6765a-fe0c-44e7-9165-219563f6171e -> ../../sda2
lrwxrwxrwx 1 root root 10 Oct 23 21:49 1b1cb5fb-b820-4281-badb-fe60995879cf -> ../../sda1
lrwxrwxrwx 1 root root 10 Oct 23 21:49 1fdefccc-b0a0-4371-a762-2dc7b8766cf1 -> ../../dm-2
lrwxrwxrwx 1 root root 10 Oct 23 21:49 C83E5DEA3E5DD254 -> ../../sda3
lrwxrwxrwx 1 root root 10 Oct 23 21:49 DC781BFD781BD55C -> ../../sdb2
lrwxrwxrwx 1 root root 10 Oct 23 21:49 c70e70d3-02e5-4f14-8e2b-0eb31e8e9e9c -> ../../dm-1
lrwxrwxrwx 1 root root 10 Oct 23 21:49 dff0c57d-511c-4a0a-8b07-b1a7d0ab6b99 -> ../../sdc1
lrwxrwxrwx 1 root root 10 Oct 23 21:49 e4f9d3b5-3ace-415e-87c0-a697d6afa3de -> ../../dm-0

# /etc/fstab is in my "Comment 8" above, but I paste it again, so You don't have to search for it:
# cat /etc/fstab | grep -v "#"

UUID=1b1cb5fb-b820-4281-badb-fe60995879cf /  ext4    noatime,commit=600,defaults,errors=remount-ro  0  1
/dev/mapper/lvr-home /home                   ext4    noatime,commit=600,defaults  0  2
tmpfs                /tmp                    tmpfs   defaults,noatime,mode=1777   0  0
/dev/mapper/lvr-usr  /usr                    ext4    noatime,commit=600,defaults  0  2
/dev/mapper/lvr-var  /var                    ext4    noatime,commit=600,defaults  0  2
UUID=04e6765a-fe0c-44e7-9165-219563f6171e none swap  sw                           0  0
/dev/sr0             /media/cdrom0           udf,iso9660 user,noauto              0  0

UUID=C83E5DEA3E5DD254 /mnt/win7_c  auto user,defaults,noauto 0 0
/dev/sdc1       /mnt/usb       auto     rw,user,noauto       0 0
/dev/sdc1       /mnt/sdc1      auto     rw,user,noauto       0 0
/dev/sdc2       /mnt/sdc2      auto     rw,user,noauto       0 0
/dev/sdd1       /mnt/sdd1      auto     rw,user,noauto       0 0
/dev/sde1       /mnt/sde1      auto     rw,user,noauto       0 0
/dev/sdf1       /mnt/sdf1      auto     rw,user,noauto       0 0
/dev/sdg1       /mnt/sdg1      auto     rw,user,noauto       0 0
/dev/sdc        /mnt/usb-flop  auto     rw,user,noauto       0 0
server:/server /server                  nfs  ro,hard,intr,user,noauto,noexec,vers=3 0  2
star:/stuff /stuff                      nfs  rw,hard,intr,user,noauto,noexec,vers=3 0  2
server:/server/cache /cache             nfs  rw,soft,intr,user,noauto,noexec,vers=3 0  2
Comment 36 Chen Yu 2015-11-02 15:48:04 UTC
Hi,Janek 
thanks for your  information in detail, but after I downloaded the http://janek.kozicki.pl/hibernate_problems/resume_success_and_failure_HIGH_RES.mp4  and attachement in #Comment 34,
unfortunately I can not play it(I used different kinds of media players on windows 7, none of they work), can you plz help to confirm? or provide the same format as 'Video of resume failure using Method 1 and 2', since I can play it.

Yu
Comment 37 Janek Kozicki 2015-11-02 17:19:06 UTC
Hi, this video was made using different equipment. I cannot encode it as MOV, but now I used MsMPEG instead. Can you please confirm that this new encoding works for you? Please download this file:
http://janek.kozicki.pl/hibernate_problems/resume_success_and_failure_HIGH_RES.mpg

please check if in your video player you can increase the playing speed so that you don't have to wait too long.

best regards
Janek Kozicki
Comment 38 Chen Yu 2015-11-03 06:28:17 UTC
OK, I see.
so it appears to me that, at 14:09:15, the system reboots immediately after restoring all the images, but we don't know what happened after restoring, to let linux print more info, please also append 'no_console_suspend ignore_loglevel nomodeset modprobe.blacklist=i915' on top of 'resume=/dev/sda2 single'

BTW, I also notice that, there is a oops at #Comment 4, which panics at:
[<ffffffff810ab51c>] swsusp_free
does it appear recently?

Yu
Comment 39 Janek Kozicki 2015-11-04 00:06:28 UTC
Created attachment 191981 [details]
Dmesg before doing hibernation, in response to Comment 38
Comment 40 Janek Kozicki 2015-11-04 00:07:06 UTC
Created attachment 191991 [details]
Dmesg after doing hibernation, in response to Comment 38
Comment 41 Janek Kozicki 2015-11-04 00:08:46 UTC
Created attachment 192001 [details]
Screen photo no_console_suspend with all messages before to failure (1/4)
Comment 42 Janek Kozicki 2015-11-04 00:09:17 UTC
Created attachment 192011 [details]
Screen photo no_console_suspend with all messages before failure (2/4)
Comment 43 Janek Kozicki 2015-11-04 00:09:44 UTC
Created attachment 192021 [details]
Screen photo no_console_suspend with all messages before failure (3/4)
Comment 44 Janek Kozicki 2015-11-04 00:10:14 UTC
Created attachment 192031 [details]
Screen photo no_console_suspend with all messages before failure (4/4)
Comment 45 Janek Kozicki 2015-11-04 00:21:09 UTC
Hi,

I did as you requested, the full video is on:
http://janek.kozicki.pl/hibernate_problems/resume_failure_no_console_suspend_HIGH_RES.mpg

also I did screenshots of this video right before the reboot and I have added them as attachments to this bugreport. There are some "Broke affinity" messages right before the reboot happens.

Regarding the 'oops' messages - I have not seen them recently. They only happened after successful resume (in older kernels), and I had only one successful resume recently - it's the one recorded on previous video, in attached "dmesg after successful resume, using Method suggested by Chen Yu in Comment 30" there are no oops messages.

I hope that my debug info will be useful, and that you can tell me more about what should I do to help to solve this problem.

best regards
Janek Kozicki
Comment 46 Janek Kozicki 2015-11-24 20:15:29 UTC
Hi,

Do you need any more kind of information of debug info?
Maybe I need try some other kernel options, or some patch?

Or maybe I should try latest kernel?

Could you please help me with this resume problem?

best regards
Janek Kozicki
Comment 47 Chen Yu 2015-11-26 03:16:57 UTC
Hi,
According to #Comment 35, it does not react to keyboard when boot with 'init=/bin/bash', right? I checked the thread you refered to, it suggested a USB problem, and do you use usb-keyboard? if so,plz compile your kernel with the following config as build-in:
CONFIG_USB_XHCI_HCD = y
CONFIG_USB_EHCI_HCD = y
CONFIG_USB_UHCI_HCD = y
CONFIG_SATA_AHCI = y
then try 'init=/bin/bash', since the result using this option can narrow down our scope for next step.

BTW I just noticed that you have pm_trace enabled, sorry I missed it before, I'll look at these numbers.

Yu
Comment 48 Janek Kozicki 2015-12-25 13:45:38 UTC
Created attachment 198241 [details]
config file for kernel 4.2 with USB keyboard (so that init=/bin/bash works) and CONFIG_SATA_AHCI=y
Comment 49 Janek Kozicki 2015-12-25 13:47:14 UTC
Created attachment 198251 [details]
Screen photo init=/bin/bash with all messages before failure (1/6)
Comment 50 Janek Kozicki 2015-12-25 13:47:37 UTC
Created attachment 198261 [details]
Screen photo init=/bin/bash with all messages before failure (2/6)
Comment 51 Janek Kozicki 2015-12-25 13:48:04 UTC
Created attachment 198271 [details]
Screen photo init=/bin/bash with all messages before failure (3/6)
Comment 52 Janek Kozicki 2015-12-25 13:48:34 UTC
Created attachment 198281 [details]
Screen photo init=/bin/bash with all messages before failure (4/6)
Comment 53 Janek Kozicki 2015-12-25 13:48:58 UTC
Created attachment 198291 [details]
Screen photo init=/bin/bash with all messages before failure (5/6)
Comment 54 Janek Kozicki 2015-12-25 13:49:20 UTC
Created attachment 198301 [details]
Screen photo init=/bin/bash with all messages before failure (6/6)
Comment 55 Janek Kozicki 2015-12-25 13:53:57 UTC
Created attachment 198321 [details]
Dmesg after doing resume with init=/bin/bash
Comment 56 Janek Kozicki 2015-12-25 14:07:49 UTC
Hi,

I am sorry for late reply, I had a master thesis defense and a very big exam. But now I will reply within one day if there is anything you need.

I did as you requested: I enabled USB keyboard and CONFIG_SATA_AHCI = y was already enabled. Now init=/bin/bash worked.

I attached the full kernel config in attachment 198241 [details] and I made a video as previously of whole resume process, please download a high resolution version from here: http://janek.kozicki.pl/hibernate_problems/resume_failure_init_bin_bash_HIGH_RES.mpg

I also made screenshots of this video and added them as attachements, there you can see that hibernation image was read succesfully to 100% (at speed 520 MB/sec, because it's an SDD disk) and again some "Broke affinity for irq" messages appear right before the resume failure which ends in a reboot.

I also attached dmesg after resume failure with pm_trace, this time with this error:

[    2.151426]   Magic number: 0:712:790

I hope that you can still help me, or maybe Rafael Wysocki could look at this problem too?

best regards
Janek Kozicki
Comment 57 Janek Kozicki 2016-01-11 15:46:06 UTC
Hi,

I just got another idea - maybe we could modify the kernel source so that it prints more debug information right before the reboot? I would record it on the video. 

It is because I suspect that the "Broke affinity for irq" message isn't maybe  helpful enough.

I could even put a sprintf between each line in the kernel source, so we would locate the exact line where the reboot happens. Please just tell me which file to modify, and which function in it, and what sprintf command to use every second line there so that it will print messages for us on the screen.

Do you think we could do that?

best regards
Janek Kozicki
Comment 58 Chen Yu 2016-01-11 15:51:41 UTC
I'll return to this thread this week. too many bugs at present :-)
Comment 59 Janek Kozicki 2016-01-11 15:58:15 UTC
Great! Thanks. I will reply within one day, maybe even within few hours (unless I'm asleep here in Europe) with whatever you need.
Comment 60 Chen Yu 2016-01-16 11:23:56 UTC
Hi,  
I've downloaded your video and it is a good start for debugging if init=/bin/bash works. The cpu restarts after the nonboot cpus been put offline, but there should be more debug info if we enable them in grub, so can you please use:
init=/bin/bash nomodeset text no_console_suspend ignore_loglevel resume=/dev/sda2 initcall_debug dyndbg='file drivers/base/syscore.c +p'
and appy the following debug patch. Please also provide the video and screenshot as you did before, thanks.
Comment 61 Janek Kozicki 2016-01-16 14:45:16 UTC
Hi, thank you very much for your reply.

I cannot find the patch which you have mentioned in Comment 60. Should I try to use

init=/bin/bash nomodeset text no_console_suspend ignore_loglevel resume=/dev/sda2 initcall_debug dyndbg='file drivers/base/syscore.c +p'

without a patch?

best regards
Janek Kozicki
Comment 62 Chen Yu 2016-01-16 14:59:24 UTC
wow, I forgot to upload, here it is
Comment 63 Chen Yu 2016-01-16 15:00:22 UTC
Created attachment 199821 [details]
debug patch to track resume after cpu offline
Comment 64 Chen Yu 2016-01-16 15:01:36 UTC
(In reply to Janek Kozicki from comment #61)
> Hi, thank you very much for your reply.
> 
> I cannot find the patch which you have mentioned in Comment 60. Should I try
> to use
> 
> init=/bin/bash nomodeset text no_console_suspend ignore_loglevel
> resume=/dev/sda2 initcall_debug dyndbg='file drivers/base/syscore.c +p'
> 
> without a patch?
> 
> best regards
> Janek Kozicki

sure, you can first try without this patch.
Comment 65 Janek Kozicki 2016-01-16 22:28:42 UTC
Created attachment 199881 [details]
kernel panic with parameter "text" in kernel command line 1/2
Comment 66 Janek Kozicki 2016-01-16 22:29:18 UTC
Created attachment 199891 [details]
kernel panic with parameter "text" in kernel command line 2/2
Comment 67 Janek Kozicki 2016-01-16 22:35:32 UTC
Created attachment 199901 [details]
screenshot with new debug info 1st try 1/2
Comment 68 Janek Kozicki 2016-01-16 22:36:02 UTC
Created attachment 199911 [details]
screenshot with new debug info 1st try 2/2
Comment 69 Janek Kozicki 2016-01-16 22:47:55 UTC
Created attachment 199921 [details]
lucky success screenshot with s4 patch 1/7
Comment 70 Janek Kozicki 2016-01-16 22:48:17 UTC
Created attachment 199931 [details]
lucky success screenshot with s4 patch 2/7
Comment 71 Janek Kozicki 2016-01-16 22:48:40 UTC
Created attachment 199941 [details]
lucky success screenshot with s4 patch 3/7
Comment 72 Janek Kozicki 2016-01-16 22:49:01 UTC
Created attachment 199951 [details]
lucky success screenshot with s4 patch 4/7
Comment 73 Janek Kozicki 2016-01-16 22:49:26 UTC
Created attachment 199961 [details]
lucky success screenshot with s4 patch 5/7
Comment 74 Janek Kozicki 2016-01-16 22:49:51 UTC
Created attachment 199971 [details]
lucky success screenshot with s4 patch 6/7
Comment 75 Janek Kozicki 2016-01-16 22:50:17 UTC
Created attachment 199981 [details]
lucky success screenshot with s4 patch 7/7
Comment 76 Janek Kozicki 2016-01-16 22:54:48 UTC
Created attachment 199991 [details]
lucky success screenshot with s4 patch - dmesg output
Comment 77 Janek Kozicki 2016-01-16 23:03:44 UTC
Created attachment 200001 [details]
screenshots of failure on second try with s4 patch 1/7
Comment 78 Janek Kozicki 2016-01-16 23:04:16 UTC
Created attachment 200011 [details]
screenshots of failure on second try with s4 patch 2/7
Comment 79 Janek Kozicki 2016-01-16 23:04:49 UTC
Created attachment 200021 [details]
screenshots of failure on second try with s4 patch 3/7
Comment 80 Janek Kozicki 2016-01-16 23:05:13 UTC
Created attachment 200031 [details]
screenshots of failure on second try with s4 patch 4/7
Comment 81 Janek Kozicki 2016-01-16 23:05:40 UTC
Created attachment 200041 [details]
screenshots of failure on second try with s4 patch 5/7
Comment 82 Janek Kozicki 2016-01-16 23:06:23 UTC
Created attachment 200051 [details]
screenshots of failure on second try with s4 patch 6/7
Comment 83 Janek Kozicki 2016-01-16 23:06:56 UTC
Created attachment 200061 [details]
screenshots of failure on second try with s4 patch 7/7
Comment 84 Janek Kozicki 2016-01-16 23:09:57 UTC
Created attachment 200071 [details]
dmesg after second failure with s4 patch, with pm_trace
Comment 85 Janek Kozicki 2016-01-16 23:24:23 UTC
Hi,

I used the patch and first I used the parameters as you said in Comment 60 (plus 'ro' parameter which was always there):

ro init=/bin/bash nomodeset text no_console_suspend ignore_loglevel resume=/dev/sda2 initcall_debug dyndbg='file drivers/base/syscore.c +p'

, but I had kernel panic, see first two screenshots in attachment 199881 [details] and attachment 199891 [details]

Then I removed parameter "text" from linux kernel command line and kernel panic did not appear again so I was able to do some real debugging. I noticed that previously you told me to use parameter modprobe.blacklist=i915 and now you didn't, so I did two runs. One with this parameter and one without (while both of those runs were without "text" parameter which caused kernel panic).

Kernel config is exactly the same as in attachment 198241 [details] for kernel 4.2

1. OK, so the first run was with following kernel command line 

ro resume=/dev/sda2 init=/bin/bash no_console_suspend ignore_loglevel nomodeset modprobe.blacklist=i915 initcall_debug dyndbg='file drivers/base/syscore.c +p'

the full video you will find at http://janek.kozicki.pl/hibernate_problems/resume_failure_with_s4_patch_HIGH_RES_1.mpg

the screenshots of critical moment are in attachment 199901 [details] and attachment 199911 [details] . You can see that it manages to print the last line "PM: before jump back to origianl kernel..." that you added in the patch, then it reboots.

2. Then I wanted only to make sure that removing the line modprobe.blacklist=i915 changes nothing so I booted kernel with these parameters:

ro resume=/dev/sda2 init=/bin/bash no_console_suspend ignore_loglevel nomodeset initcall_debug dyndbg='file drivers/base/syscore.c +p'

and two things happened:
 - I was again(??) super lucky because the resume was succesfull, and we can see more messages on the screen after your line "PM: before jump back to origianl kernel..."
 - after resume was complete the battery in my camera ran out. Fortunately it ran out after resume was complete. So I charged the battery and continued to record a second hibernate/resume process (which failed this time) in next video.

So there are two more videos for you to see, the succesfull resume: http://janek.kozicki.pl/hibernate_problems/resume_success_with_s4_patch_HIGH_RES_2.mpg
and failed one: http://janek.kozicki.pl/hibernate_problems/resume_failure_with_s4_patch_HIGH_RES_3.mpg

the screenshots around the critical moment during success are from attachment 199921 [details] up to 199981 . In case if dmesg after successful resume would be useful for you, it is in attachment 199991 [details]

And screenshots during second failure start at attachment 200001 [details] and go up to attachment 200061 [details]
Also, the full dmesg after this failed resume is in attachment 200071 [details] , with following pm_trace output:

[    2.220545]   Magic number: 0:265:841


OK so that's everything I did.

Now a some comments on my own - I consider it very unlikely to be so very lucky that the only two cases when hibernate resume worked were both recorded on a video. I really did more than 100 tries during past months with kernel 4.2 and I really only had those two successes, both of them recorded on a videocamera. One was in Comment 34, and another happened today. Therefore I think that those very rare successes are connected somehow with the sterile environment of init=/bin/bash and small amount of RAM used during these successfull tests. Perhaps that could guide us towards finding a solution.

My second comment: it seems that the reboot happened after passing control to the resumed kernel. If you give me a patch that prints messages there, I will record it too. But please note: the messages that will be printed MUST be printed before any other messages that we have seen on the succesfull video, because "PM: before jump back to origianl kernel..." is the last message that we see before reboot. While on succesfull resume, after about half a second a flood of messages appears, which you can see on the video. The critical moment happens during those 500ms as seen in the video.


OK. I really hope that we can move forward with this new debug info.

best regards
Janek Kozicki
Comment 86 Chen Yu 2016-01-17 10:02:38 UTC
Created attachment 200121 [details]
trial patch to enable the temporary direct-mapping for non-boot cpus before they die

Hi,
This is a possible fix for a "spurious" wakeup when non-boot cpus are put into sleep, please help try if this patch works for you. Related link:
https://bugzilla.kernel.org/show_bug.cgi?id=106371

if this patch does not work for you, I'll add more debug info...

thanks
yu
Comment 87 Janek Kozicki 2016-01-17 13:14:02 UTC
Created attachment 200131 [details]
freeze with nonboot.diff 1/8
Comment 88 Janek Kozicki 2016-01-17 13:14:33 UTC
Created attachment 200141 [details]
freeze with nonboot.diff 2/8
Comment 89 Janek Kozicki 2016-01-17 13:15:08 UTC
Created attachment 200151 [details]
freeze with nonboot.diff 3/8
Comment 90 Janek Kozicki 2016-01-17 13:15:52 UTC
Created attachment 200161 [details]
freeze with nonboot.diff 4/8
Comment 91 Janek Kozicki 2016-01-17 13:16:33 UTC
Created attachment 200171 [details]
freeze with nonboot.diff 5/8
Comment 92 Janek Kozicki 2016-01-17 13:17:09 UTC
Created attachment 200181 [details]
freeze with nonboot.diff 6/8
Comment 93 Janek Kozicki 2016-01-17 13:17:35 UTC
Created attachment 200191 [details]
freeze with nonboot.diff 7/8
Comment 94 Janek Kozicki 2016-01-17 13:18:01 UTC
Created attachment 200201 [details]
freeze with nonboot.diff 8/8
Comment 95 Janek Kozicki 2016-01-17 13:19:27 UTC
Created attachment 200211 [details]
freeze with nonboot.diff 2/8 (as jpg, mistakenly uploaded as text earlier)
Comment 96 Janek Kozicki 2016-01-17 13:21:00 UTC
Hi,

it is actually worse with nonboot.diff patch. Computer freezes during resume and keeps printing some stacktrace every 120 seconds. Full video is at

http://www.pg.gda.pl/~jkozicki/hibernate_problems/resume_freeze_with_nonboot_diff.mpg

screenshots of critical moment start in attachment 200131 [details]

Please note that the control does not even reach your message "PM: before jump back to origianl kernel..."

best regards
Janek Kozicki
Comment 97 Janek Kozicki 2016-01-17 13:41:27 UTC
Oh, I forgot to mention - on that video in Comment 96 I am pressing multiple times altsysrq-s altsysrq-u altsysrq-b but it has no effect at all.
Comment 98 Chen Yu 2016-01-18 01:34:38 UTC
OK, there is some problem with my previous patch that, it does not set up vmalloc area mappings, so when cpu tries to access vmalloc address there would be a expected page fault. I'll try to fix the patch. thanks
Comment 99 Chen Yu 2016-01-24 16:16:21 UTC
Created attachment 201331 [details]
force nonboot cpus to switch to safe pagetable when restoring page frames

Hi, 
please help try this patch to see if it works for you? thanks
Comment 100 Janek Kozicki 2016-01-25 19:35:31 UTC
Created attachment 201451 [details]
Dmesg after successful resume with hibernate_mapping.diff
Comment 101 Janek Kozicki 2016-01-25 19:36:22 UTC
Created attachment 201461 [details]
Dmesg after failed resume with hibernate_mapping.diff 1
Comment 102 Janek Kozicki 2016-01-25 19:36:59 UTC
Created attachment 201471 [details]
Dmesg after failed resume with hibernate_mapping.diff 2
Comment 103 Janek Kozicki 2016-01-25 19:37:35 UTC
Created attachment 201481 [details]
Dmesg after failed resume with hibernate_mapping.diff 3
Comment 104 Janek Kozicki 2016-01-25 19:40:15 UTC
Created attachment 201491 [details]
Screenshot of successful resume with hibernate_mapping.diff 1/17
Comment 105 Janek Kozicki 2016-01-25 19:40:43 UTC
Created attachment 201501 [details]
Screenshot of successful resume with hibernate_mapping.diff 2/17
Comment 106 Janek Kozicki 2016-01-25 19:41:07 UTC
Created attachment 201511 [details]
Screenshot of successful resume with hibernate_mapping.diff 3/17
Comment 107 Janek Kozicki 2016-01-25 19:41:30 UTC
Created attachment 201521 [details]
Screenshot of successful resume with hibernate_mapping.diff 4/17
Comment 108 Janek Kozicki 2016-01-25 19:41:58 UTC
Created attachment 201531 [details]
Screenshot of successful resume with hibernate_mapping.diff 5/17
Comment 109 Janek Kozicki 2016-01-25 19:42:26 UTC
Created attachment 201541 [details]
Screenshot of successful resume with hibernate_mapping.diff 6/17
Comment 110 Janek Kozicki 2016-01-25 19:42:52 UTC
Created attachment 201551 [details]
Screenshot of successful resume with hibernate_mapping.diff 7/17
Comment 111 Janek Kozicki 2016-01-25 19:43:13 UTC
Created attachment 201561 [details]
Screenshot of successful resume with hibernate_mapping.diff 8/17
Comment 112 Janek Kozicki 2016-01-25 19:43:38 UTC
Created attachment 201571 [details]
Screenshot of successful resume with hibernate_mapping.diff 9/17
Comment 113 Janek Kozicki 2016-01-25 19:44:02 UTC
Created attachment 201581 [details]
Screenshot of successful resume with hibernate_mapping.diff 10/17
Comment 114 Janek Kozicki 2016-01-25 19:44:30 UTC
Created attachment 201591 [details]
Screenshot of successful resume with hibernate_mapping.diff 11/17
Comment 115 Janek Kozicki 2016-01-25 19:44:52 UTC
Created attachment 201601 [details]
Screenshot of successful resume with hibernate_mapping.diff 12/17
Comment 116 Janek Kozicki 2016-01-25 19:45:12 UTC
Created attachment 201611 [details]
Screenshot of successful resume with hibernate_mapping.diff 13/17
Comment 117 Janek Kozicki 2016-01-25 19:45:34 UTC
Created attachment 201621 [details]
Screenshot of successful resume with hibernate_mapping.diff 14/17
Comment 118 Janek Kozicki 2016-01-25 19:45:56 UTC
Created attachment 201631 [details]
Screenshot of successful resume with hibernate_mapping.diff 15/17
Comment 119 Janek Kozicki 2016-01-25 19:46:17 UTC
Created attachment 201641 [details]
Screenshot of successful resume with hibernate_mapping.diff 16/17
Comment 120 Janek Kozicki 2016-01-25 19:46:37 UTC
Created attachment 201661 [details]
Screenshot of successful resume with hibernate_mapping.diff 17/17
Comment 121 Janek Kozicki 2016-01-25 19:48:44 UTC
Created attachment 201681 [details]
Screenshot of failed 1 resume with hibernate_mapping.diff 1/14
Comment 122 Janek Kozicki 2016-01-25 19:49:19 UTC
Created attachment 201691 [details]
Screenshot of failed 1 resume with hibernate_mapping.diff 2/14
Comment 123 Janek Kozicki 2016-01-25 19:49:46 UTC
Created attachment 201701 [details]
Screenshot of failed 1 resume with hibernate_mapping.diff 3/14
Comment 124 Janek Kozicki 2016-01-25 19:50:09 UTC
Created attachment 201711 [details]
Screenshot of failed 1 resume with hibernate_mapping.diff 4/14
Comment 125 Janek Kozicki 2016-01-25 19:50:36 UTC
Created attachment 201721 [details]
Screenshot of failed 1 resume with hibernate_mapping.diff 5/14
Comment 126 Janek Kozicki 2016-01-25 19:51:04 UTC
Created attachment 201731 [details]
Screenshot of failed 1 resume with hibernate_mapping.diff 6/14
Comment 127 Janek Kozicki 2016-01-25 19:51:31 UTC
Created attachment 201741 [details]
Screenshot of failed 1 resume with hibernate_mapping.diff 7/14
Comment 128 Janek Kozicki 2016-01-25 19:51:58 UTC
Created attachment 201751 [details]
Screenshot of failed 1 resume with hibernate_mapping.diff 8/14
Comment 129 Janek Kozicki 2016-01-25 19:52:19 UTC
Created attachment 201761 [details]
Screenshot of failed 1 resume with hibernate_mapping.diff 9/14
Comment 130 Janek Kozicki 2016-01-25 19:52:42 UTC
Created attachment 201771 [details]
Screenshot of failed 1 resume with hibernate_mapping.diff 10/14
Comment 131 Janek Kozicki 2016-01-25 19:53:04 UTC
Created attachment 201781 [details]
Screenshot of failed 1 resume with hibernate_mapping.diff 11/14
Comment 132 Janek Kozicki 2016-01-25 19:53:30 UTC
Created attachment 201791 [details]
Screenshot of failed 1 resume with hibernate_mapping.diff 12/14
Comment 133 Janek Kozicki 2016-01-25 19:53:51 UTC
Created attachment 201801 [details]
Screenshot of failed 1 resume with hibernate_mapping.diff 13/14
Comment 134 Janek Kozicki 2016-01-25 19:54:13 UTC
Created attachment 201811 [details]
Screenshot of failed 1 resume with hibernate_mapping.diff 14/14
Comment 135 Janek Kozicki 2016-01-25 19:55:59 UTC
Created attachment 201821 [details]
Screenshot of failed 2 resume with hibernate_mapping.diff 1/2
Comment 136 Janek Kozicki 2016-01-25 19:56:22 UTC
Created attachment 201831 [details]
Screenshot of failed 2 resume with hibernate_mapping.diff 2/2
Comment 137 Janek Kozicki 2016-01-25 19:58:04 UTC
Created attachment 201841 [details]
Screenshot of failed 3 resume with hibernate_mapping.diff 1/5
Comment 138 Janek Kozicki 2016-01-25 19:58:31 UTC
Created attachment 201851 [details]
Screenshot of failed 3 resume with hibernate_mapping.diff 2/5
Comment 139 Janek Kozicki 2016-01-25 19:58:54 UTC
Created attachment 201861 [details]
Screenshot of failed 3 resume with hibernate_mapping.diff 3/5
Comment 140 Janek Kozicki 2016-01-25 19:59:20 UTC
Created attachment 201871 [details]
Screenshot of failed 3 resume with hibernate_mapping.diff 4/5
Comment 141 Janek Kozicki 2016-01-25 19:59:43 UTC
Created attachment 201881 [details]
Screenshot of failed 3 resume with hibernate_mapping.diff 5/5
Comment 142 Janek Kozicki 2016-01-25 20:01:13 UTC
Created attachment 201891 [details]
Screenshot of failed 4 resume with hibernate_mapping.diff 1/9
Comment 143 Janek Kozicki 2016-01-25 20:01:41 UTC
Created attachment 201901 [details]
Screenshot of failed 4 resume with hibernate_mapping.diff 2/9
Comment 144 Janek Kozicki 2016-01-25 20:02:12 UTC
Created attachment 201911 [details]
Screenshot of failed 4 resume with hibernate_mapping.diff 3/9
Comment 145 Janek Kozicki 2016-01-25 20:02:40 UTC
Created attachment 201921 [details]
Screenshot of failed 4 resume with hibernate_mapping.diff 4/9
Comment 146 Janek Kozicki 2016-01-25 20:03:05 UTC
Created attachment 201931 [details]
Screenshot of failed 4 resume with hibernate_mapping.diff 5/9
Comment 147 Janek Kozicki 2016-01-25 20:03:35 UTC
Created attachment 201941 [details]
Screenshot of failed 4 resume with hibernate_mapping.diff 6/9
Comment 148 Janek Kozicki 2016-01-25 20:04:07 UTC
Created attachment 201951 [details]
Screenshot of failed 4 resume with hibernate_mapping.diff 7/9
Comment 149 Janek Kozicki 2016-01-25 20:04:39 UTC
Created attachment 201961 [details]
Screenshot of failed 4 resume with hibernate_mapping.diff 8/9
Comment 150 Janek Kozicki 2016-01-25 20:05:06 UTC
Created attachment 201971 [details]
Screenshot of failed 4 resume with hibernate_mapping.diff 9/9
Comment 151 Janek Kozicki 2016-01-25 20:07:57 UTC
Hi

I did 11 tries, out of which 6 were successful (54% success rate). I was quite impressed. Some of those tests I did in non-sterile conditions without init=/bin/bash trying to pretend that I use the computer in normal fashion. Following are the results, of course, with init=/bin/bash

OK. In attachment 201451 [details], attachment 201461 [details], attachment 201471 [details], attachment 201481 [details] you will find dmesgs after success and failure. The pm_trace result (Magic numbers) from failure dmesgs were:

./failure1_2016_01_24.txt:[    2.222016]   Magic number: 0:712:790
./failure2_2016_01_24.txt:[    2.217617]   Magic number: 0:712:790
./failure3_2016_01_25.txt:[    2.217158]   Magic number: 0:712:790

1. For first 3 tries I used following kernel command line: 

ro resume=/dev/sda2 init=/bin/bash no_console_suspend ignore_loglevel nomodeset initcall_debug dyndbg='file drivers/base/syscore.c +p'

You will find the video here: http://www.pg.gda.pl/~jkozicki/hibernate_problems/resume1_with_hibernate_mapping_diff.mpg
The critical moments on video are at 5:27 (success), 9:18 (failure), 15:34 (failure)

Screenshots from video of success during critical moment start at attachment 201491 [details]
* you can see it passes through message "CPU 31 is up", in attachment 201601 [details] then a flood of other messages appears in attachment 201611 [details]
* In attachment 201451 [details] you will find dmesg after this success.

Next screenshots from first failure start at attachment 201681 [details]
* you can see that last message there was "CPU 29 is up" in attachment 201801 [details]

Next screenshots from second failure start at attachment 201821 [details] (I only upload two screenshots, just before failure)
* you can see that last message there was "CPU 16 is up" in attachment 201831 [details]

2. For next two tries I used kernel command line with extra modprobe.blacklist=i915:

ro resume=/dev/sda2 init=/bin/bash no_console_suspend ignore_loglevel nomodeset modprobe.blacklist=i915 initcall_debug dyndbg='file drivers/base/syscore.c +p'

You will find video here: http://www.pg.gda.pl/~jkozicki/hibernate_problems/resume2_with_hibernate_mapping_diff.mpg
The critical moments are at 3:13 (failure) and 7:51 (failure)

Screenshots from failure start at attachment 201841 [details]
* you can see that last message there was "CPU 14 is up" in attachment 201871 [details]
  
Screenshots from next failure start at attachment 201891 [details]
* you can see that last message there was "CPU 20 is up" in attachment 201971 [details]


OK, so that's all I have done here. My thoughts are that we made some great progress here, because the problem now appears to be after swapping control to resumed image. If you need any kind of more debug messages please let me know.

best regards
Janek Kozicki
Comment 152 Janek Kozicki 2016-01-27 18:03:32 UTC
Hi,

sorry, maybe my previous email was unnecessarily too long. The short version is that I uploaded screenshots with four different failures, and the last message in each case was:

"CPU 29 is up"
"CPU 16 is up"
"CPU 14 is up"
"CPU 20 is up"

So we are dealing here with some random factor that randomly causes resume failure upon waking CPUs.

Also I uploaded dmesgs, more screenshots and videos:

http://www.pg.gda.pl/~jkozicki/hibernate_problems/resume1_with_hibernate_mapping_diff.mpg
http://www.pg.gda.pl/~jkozicki/hibernate_problems/resume2_with_hibernate_mapping_diff.mpg


best regards
Janek Kozicki
Comment 153 Chen Yu 2016-01-30 12:38:34 UTC
Created attachment 202401 [details]
Disable the warn_on invoking when nonboot cpus are not ready for exception handling

Hi Janek,
can you help try this patch?(do not apply hibernate_mapping.diff, that patch might be incorrectly designed).

There is a obvious bug in irq handler when rcu is involved during S4 resuming, and it might break(halt/reboot) the system by a implicit BUG_ON,my goodness..
Comment 154 Janek Kozicki 2016-01-31 15:55:50 UTC
Hi,

Unfortunately forbid_rcu_during_s4.diff does not help. I used following kernel command line:

ro resume=/dev/sda2 init=/bin/bash no_console_suspend ignore_loglevel nomodeset initcall_debug dyndbg='file drivers/base/syscore.c +p'

I used kernel 4.2 with following patches:
+ s4_debug_after_cpuoffline.diff from comment 63
+ forbid_rcu_during_s4.diff      from comment 153
+ https://git.kernel.org/cgit/linux/kernel/git/tj/wq.git/commit/?h=for-4.3&id=1dadafa86a779884f14a6e7a3ddde1a57b0a0a65

kernel config was the same as before in attachment 198241 [details].

I tried 3 times:

1,2: Two times upon resume the computer rebooted right after printing your message from s4_debug_after_cpuoffline.diff: "PM: before jump back to origianl kernel..." The pm_trace result (Magic number from dmesg) from those two times was:
  ./failure1_2016_01_30.txt:[    2.213576]   Magic number: 0:712:790
  ./failure2_2016_01_30.txt:[    2.224373]   Magic number: 0:712:790

3: one time I had a success.


Then just out of curiosity I tried two times with added
+ hibernate_mapping.diff         from comment 99 (which you told me to not use)

And turns out that you were right, because using both hibernate_mapping.diff + forbid_rcu_during_s4.diff resulted two failures:

1. on first failure it rebooted after printing "CPU 19 is up", just like in comment 152, the pm_trace Magic number from dmesg was:
./failure3_2016_01_30.txt:[    2.215363]   Magic number: 0:162:740

2. on second failure it was a complete freeze after writing "Kick cpu:31" (alt-sysrq did not work)


I can upload videos of those tests, as well as screenshots and dmesgs, because I have it all ready for you if you need them (like all the times before). Please let me know if you need me to upload them. I do not upload these right now, because maybe what I wrote above is just enough for you.

If course if you need any kind of more debug info I will be very happy to help.

best regards
Janek Kozicki
Comment 155 Janek Kozicki 2016-02-10 14:25:24 UTC
Created attachment 203311 [details]
dmesg after failed resume with forbid_rcu patch
Comment 156 Janek Kozicki 2016-02-10 14:26:01 UTC
Created attachment 203321 [details]
dmesg after succesful resume with forbid_rcu patch
Comment 157 Janek Kozicki 2016-02-10 14:27:36 UTC
Created attachment 203331 [details]
screenshot of critical moment during successful resume with forbid_rcu patch 1/2
Comment 158 Janek Kozicki 2016-02-10 14:28:08 UTC
Created attachment 203341 [details]
screenshot of critical moment during successful resume with forbid_rcu patch 2/2
Comment 159 Janek Kozicki 2016-02-10 14:30:42 UTC
Created attachment 203351 [details]
screenshot of critical moment during failed resume with forbid_rcu patch 1/2
Comment 160 Janek Kozicki 2016-02-10 14:31:17 UTC
Created attachment 203361 [details]
screenshot of critical moment during failed resume with forbid_rcu patch 2/2
Comment 161 Janek Kozicki 2016-02-10 14:33:30 UTC
Hi,

I hope that you have seen my answer from 10 days ago. I did four more tests hoping to get some more useful debug info. I got exactly the same results. This time I upload them, although I don't know if those are necessary (you didn't say).

Again I used kernel 4.2 with following patches:
+ s4_debug_after_cpuoffline.diff from comment 63
+ forbid_rcu_during_s4.diff      from comment 153
+ https://git.kernel.org/cgit/linux/kernel/git/tj/wq.git/commit/?h=for-4.3&id=1dadafa86a779884f14a6e7a3ddde1a57b0a0a65

kernel config was the same as before in attachment 198241 [details]

I have uploaded the screenshot just before failure in attachment 203351 [details] and two screenshots just before and just after the same moment, but during a success in attachment 203331 [details] and attachment 203341 [details]
Also there are two dmesgs from success (attachment 203321 [details]) and failure (attachment 203311 [details])
You can download a full video from: http://www.pg.gda.pl/~jkozicki/hibernate_problems/resume_with_forbid_rcu.mpg

Please tell me how to proceed. I am so desperate to have working hibernation on my computer. I will try any patches that you throw at me, I want to give all kind of debug information that we can get from the computer.

best regards
Janek Kozicki
Comment 162 Chen Yu 2016-02-22 01:47:30 UTC
Yes, I saw your info, I did not reply to you because currently I've setup a server enviroment similar to yours(with 48 cpus), it seems sometimes I can reproduce your problem in my system, apparently the hibernation might not be robust enough for dealing with server platform, I'll do some investigation on it.
Comment 163 Janek Kozicki 2016-02-22 10:47:36 UTC
Wow, this is great news! Thank you very much.

It may be possible that processor brand, or speed is also important. I have 32 cores of CPU E5-2687W clocked 3.4GHz. Maybe for example it will be useful to throttle down all CPUs to 800MHz (or slower) while resume is going on?

Also, maybe that will be useful for you:


processor       : 31
vendor_id       : GenuineIntel
cpu family      : 6
model           : 62
model name      : Intel(R) Xeon(R) CPU E5-2687W v2 @ 3.40GHz
stepping        : 4
microcode       : 0x416
cpu MHz         : 3207.023
cache size      : 25600 KB
physical id     : 1
siblings        : 16
core id         : 11
cpu cores       : 8
apicid          : 55
initial apicid  : 55
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt
bugs            :
bogomips        : 6801.77
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:
Comment 164 Janek Kozicki 2016-02-22 11:37:10 UTC
I just got some other idea:

there are physically two CPU dies on my motherboard. Maybe before activating the multitude of cores in a CPU the whole CPU die must be activated first?

And the same goes about deactivating them? First disable all cores, then disable the whole CPU die?

I mean, maybe it crashes because some subsystem in a CPU die (not CPU-cores) that is responsible for waking/managing cores is still active, while it shouldn't?

best regards
Janek Kozicki
Comment 165 Janek Kozicki 2016-03-15 11:41:51 UTC
Hi,

did you have any progress with this problem? I am using currently your hibernate_mapping.diff patch. But in fact in barely helps - during past month I had only 1 success over about 20 (real-world usage conditions) tries.

Do you need to me try something? Or maybe should I upgrade my kernel (I still use 4.2.0)? Or anything else?

best regards
Janek Kozicki
Comment 166 Janek Kozicki 2016-03-23 09:35:47 UTC
Hi,

it just occurred to me, that maybe you should compare the resume code with that in kernel 3.2. Because (as you can see in Comment 2 ) with kernel 3.2 had 80% success rate for many months.

I cannot downgrade now, due to many dependencies in my system, but I would if I could, just to have higher success rate.

best regards
Janek Kozicki
Comment 167 Janek Kozicki 2016-03-23 09:37:13 UTC
oops, I meant Comment 3
Comment 168 Janek Kozicki 2016-04-29 11:00:48 UTC
Hi,

sorry for bothering you this much, but could you tell me how is your progress with your 48 CPU workstation with respect to this bug? Is there really anything I could do to help here?

thank you very much for your efforts
Janek Kozicki
Comment 169 Chen Yu 2016-05-11 09:19:46 UTC
According to https://bugzilla.kernel.org/show_bug.cgi?id=116941,
there is a fix in 4.5, so would you please try if latest kernel 4.6.0-rc7 work for you? thanks.
Comment 170 Janek Kozicki 2016-05-12 16:51:38 UTC
Created attachment 216131 [details]
kernel config for 4.6.0-rc7
Comment 171 Janek Kozicki 2016-05-12 16:52:40 UTC
Created attachment 216141 [details]
messages before reboot on failed resume
Comment 172 Janek Kozicki 2016-05-12 16:53:34 UTC
Created attachment 216151 [details]
messages after freeze during 2nd hibernation
Comment 173 Janek Kozicki 2016-05-12 16:55:02 UTC
Hi,

thank you very much for your reply. I tried kernel 4.6.0-rc7 with following kernel command line:

ro resume=/dev/sda2 init=/bin/bash no_console_suspend ignore_loglevel nomodeset modprobe.blacklist=i915 initcall_debug dyndbg='file drivers/base/syscore.c +p'

and the kernel config in the attachment 216131 [details] which was copied from previous kernel 4.2 and contains the options that you requested:

CONFIG_USB_XHCI_HCD=y
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_UHCI_HCD=y
CONFIG_SATA_AHCI=y
CONFIG_PM_TRACE_RTC=y

I tried 6 times, and in total I had 4 failures and 2 successes. The failures are of two different kinds:

1. the failure during resume (like earlier in this bugreport)
   - the video from failed resume is at http://janek.kozicki.pl/hibernate_problems/failed_resume_kernel_4_6_rc7.mpg
   - messages right before reboot on failed resume is in attachment 216141 [details]

2. there is always a complete system freeze during 2nd hibernation (when it is attempted for the 2nd time "alt-sysrq s u b" is not working)
   - the video of freeze during hibernation is at http://janek.kozicki.pl/hibernate_problems/failed_2nd_hibernate_kernel_4_6_rc7.mpg
   - messages after freeze during 2nd hibernation is in attachment 216151 [details]

please let me know if there is anything more that I could do here.
Janek Kozicki
Comment 174 Chen Yu 2016-06-08 21:44:34 UTC
Hanek, could you please test with the following patch on top of latest kernel?
https://patchwork.kernel.org/patch/9158227/
Comment 175 Janek Kozicki 2016-06-09 20:36:27 UTC
Created attachment 219431 [details]
resume failure messages 4.7-rc2 with patch/9158227/
Comment 176 Janek Kozicki 2016-06-09 20:37:53 UTC
Created attachment 219441 [details]
resume failure dmesg PM_TRACE 4.7-rc2 with patch/9158227/
Comment 177 Janek Kozicki 2016-06-09 20:40:47 UTC
Created attachment 219451 [details]
hibernation failure messages 4.7-rc2 with patch/9158227/
Comment 178 Janek Kozicki 2016-06-09 20:42:44 UTC
Created attachment 219461 [details]
hibernation failure PM_TRACE dmesg 4.7-rc2 with patch/9158227/
Comment 179 Janek Kozicki 2016-06-09 20:47:07 UTC
Created attachment 219471 [details]
hibernation failure messages2 4.7-rc2 with patch/9158227/
Comment 180 Janek Kozicki 2016-06-09 20:49:54 UTC
Created attachment 219481 [details]
hibernation failure PM_TRACE dmesg2 4.7-rc2 with patch/9158227/
Comment 181 Janek Kozicki 2016-06-09 20:51:26 UTC
Created attachment 219491 [details]
resume failure4 messages 4.7-rc2 with patch/9158227/
Comment 182 Janek Kozicki 2016-06-09 21:04:57 UTC
Hi,

Thank you very much for your reply. Unfortunately that patch didn't help.

I tried kernel 4.7-rc2 with https://patchwork.kernel.org/patch/9158227/ and same .config and command line as in Comment 173.

I tried 6 times and I had 2 successes and 4 failures. Including always a failure on 2nd hibernation attempt (after a successful resume). This time I also recorded the pm_trace magic numbers.

1. resume failure: photo of the last messages on the screen is on the attachment 219431 [details]
   the dmesg from this failure is in attachment 219441 [details]
   [    5.509189]   Magic number: 13:867:789

2. next I had a successful resume, so I tried to hibernate for the 2nd time, and it froze during the hibernation (I had to press the reset button, alt-sysrq didn't work). The last messages on the screen are in attachment 219451 [details] and the dmesg with pm_trace after I pressed the reset button is in attachment 219461 [details]
   [    5.569650]   Magic number: 13:481:227

3. next I had a successful resume and a failed hibernate on second time with messages in attachment 219471 [details] which are very similar to previous messages on failed 2nd hibernate and dmesg with PM_TRACE in attachment 219481 [details]
   [    5.548926]   Magic number: 13:481:227

4. finally I had a failed resume with last messages captured in the video just before the reboot are in the attachment 219491 [details] , I didn't check pm_trace this time though, unless that's this number (which I have in my dmesg right now as I am writing this report):
   [    2.222662]   Magic number: 13:867:789

I have also an almost complete video documentation of my today's attempts in case if you need to know some other details here.

This concludes my test of 4.7-rc2 with patch/9158227/.

=====
Also, I don't know if that's worth mentioning, but in last month I had four successful resumes in a row which is so much unlikely and improbable according to my previous statistics that I might have found something that can help us. After doing testing in Comment 173 with kernel 4.6-rc7 I had problems with getting dkms to compile nvidia drivers for 4.6-rc7. And because I want to use my 3 LCD screens I went back to 4.2 with working nvidia drivers. The exact same version that I used in Comment 151. I use this version because I had best statistics with it so far. This is your hibernate_mapping.diff from Comment 99. To summarize it is:

kernel 4.2 with
- hibernate_mapping.diff (Comment 99)
- small nvidia related patch https://git.kernel.org/cgit/linux/kernel/git/tj/wq.git/commit/?h=for-4.3&id=1dadafa86a779884f14a6e7a3ddde1a57b0a0a65

The only unusual thing that I did before those four successful resumes was that I started windows 7 and then clicked "reboot computer" in the windows start menu. Afterwards I did let linux to perform resume from hibernation. On fifth attempt this failed. Probability of four successful resumes according to my statistics from Comment 151 is only 8%.

Concluding my observations regarding kernel 4.2: maybe windows 7 is setting some CPU registers or some other bits when it is rebooting the computer in such a way that linux can perform a successful resume.

best regards
Janek Kozicki
Comment 183 Chen Yu 2016-06-16 00:57:29 UTC
Janek,
Recently Rafael has fixed a resume issue due to incorrect page table copy-back, without which might copy the wrong pages to original address, would you please check if the following patch work for you(it should be in the coming upstream kernel)
https://patchwork.kernel.org/patch/9172981/
Comment 184 Janek Kozicki 2016-06-17 12:44:23 UTC
Created attachment 220441 [details]
resume failed photo 0: 4.7-rc3 + patch/9172981
Comment 185 Janek Kozicki 2016-06-17 12:45:10 UTC
Created attachment 220451 [details]
resume failed photo 1: 4.7-rc3 + patch/9172981
Comment 186 Janek Kozicki 2016-06-17 12:46:00 UTC
Created attachment 220461 [details]
resume failed dmesg 1 (PM_TRACE): 4.7-rc3 + patch/9172981
Comment 187 Janek Kozicki 2016-06-17 12:46:33 UTC
Created attachment 220471 [details]
resume failed photo 2: 4.7-rc3 + patch/9172981
Comment 188 Janek Kozicki 2016-06-17 12:47:27 UTC
Created attachment 220481 [details]
resume failed dmesg 2 (PM_TRACE): 4.7-rc3 + patch/9172981
Comment 189 Janek Kozicki 2016-06-17 12:48:09 UTC
Created attachment 220491 [details]
hibernation failed photo 3: 4.7-rc3 + patch/9172981
Comment 190 Janek Kozicki 2016-06-17 12:48:43 UTC
Created attachment 220501 [details]
hibernation failed dmesg 3 (PM_TRACE): 4.7-rc3 + patch/9172981
Comment 191 Janek Kozicki 2016-06-17 12:49:19 UTC
Created attachment 220511 [details]
resume failed photo 4: 4.7-rc3 + patch/9172981
Comment 192 Janek Kozicki 2016-06-17 12:49:56 UTC
Created attachment 220521 [details]
resume failed dmesg 4 (PM_TRACE): 4.7-rc3 + patch/9172981
Comment 193 Janek Kozicki 2016-06-17 12:56:19 UTC
Hi,

thank you very much for another thing to try. Unfortunately this didn't help either.

I tried https://patchwork.kernel.org/patch/9172981/ with kernel 4.7-rc3 using exactly the same command line and methodology as in comment 182 and comment 173.

I tried 5 times and I had 1 success and 4 failures. Including one failure on 2nd attempt (after that single successful resume).

0. failed resume photo in attachment 220441 [details], no dmesg this time.
1. again failed resume photo in attachment 220451 [details], dmesg in attachment 220461 [details]
   [    5.470778]   Magic number: 13:867:789
2. failed resume photo in attachment 220471 [details], dmesg in attachment 220481 [details]
   [    5.517879]   Magic number: 13:317:739
3. failed hibernation phton in attachment 220491 [details], dmesg in attachment 220501 [details]
   [    5.558095]   Magic number: 13:928:176
4. failed resume photo in attachment 220511 [details], dmesg in attachment 220521 [details]
   [    2.217589]   Magic number: 13:317:739

please let me know if there is anything I could do here. I have full video documentation of these attempts in case if you need it.

How is your 48-CPU workstation - can you still reproduce the bug there? How much RAM does it have? I have 64GB and maybe having huge RAM is somehow related to this problem?

best regards
Janek Kozicki
Comment 194 Chen Yu 2016-07-01 02:38:41 UTC
Janek, do you have a chance to test a new version of patch:
https://patchwork.kernel.org/patch/9202321/
I test this patch on my xeon machine, it passed a 10 rounds of hibernation, although I did not test too much.
thanks for your effort.
Comment 195 Chen Yu 2016-07-01 02:55:40 UTC
BTW, I'm using 'nomodeset text' to bypass the video, you can check by previous method that boots up a minimal system. (4.7-rc5). cross the finger...
Comment 196 Janek Kozicki 2016-07-01 08:57:00 UTC
Hi,

thanks a lot for another chance to try. I am compiling kernel  	4.7-rc5 with patch/9202321   right now. I will let you know how it works in few hours. In comment 193, 182, 173 I was using kernel command line:

ro resume=/dev/sda2 init=/bin/bash no_console_suspend ignore_loglevel nomodeset modprobe.blacklist=i915 initcall_debug dyndbg='file drivers/base/syscore.c +p'

I tried to use 'nomodeset text' earlier but it produced kernel panic in Comment 85, I will try it again.
Comment 197 Chen Yu 2016-07-01 09:06:25 UTC
ok, just using the params you once used to record the video is ok.
Actually when I recheck the code today, I found another potential problem that,
the nonboot cpus may jump into address with inconsistent page table, which might cause the system reset due to broken address space. Anyway, please first check if this patch makes things better, and I'll dig into the code to confirm if above issue exist and how to fix it.
Comment 198 Chen Yu 2016-07-01 09:07:47 UTC
BTW, when the problem was reproduced, it reboot after some cpus are brought back online , right?
Comment 199 Janek Kozicki 2016-07-01 09:59:02 UTC
It definitely was like this in kernel 4.2, eg. see attachment 201801 [details]

But in recent kernels, I don't know, the last one that I tried 4.7-rc3 eg. in attachment 220451 [details] it just reboots, but it is not printing that some "CPU is up" so I don't know if it reboots when some CPU is up. Or if it reboots because of something different.
Comment 200 Janek Kozicki 2016-07-01 10:01:00 UTC
ok, kernel compiled, I'm trying right now. Will be back in less than an hour or so.
Comment 201 Chen Yu 2016-07-01 10:06:16 UTC
(In reply to Janek Kozicki from comment #199)
> It definitely was like this in kernel 4.2, eg. see attachment 201801 [details]
> [details]
> 
> But in recent kernels, I don't know, the last one that I tried 4.7-rc3 eg.
> in attachment 220451 [details] it just reboots, but it is not printing that
> some "CPU is up" so I don't know if it reboots when some CPU is up. Or if it
> reboots because of something different.

Well, if it did not go to cpu is up, maybe it triggered exception at other place, one of the reason is because recently an commit has caused a regression, and this problem can be fixed by 
https://patchwork.kernel.org/patch/9208541/ (not upstream yet).
Comment 202 Janek Kozicki 2016-07-01 11:29:37 UTC
Wow, there is a big improvement with 4.7-rc5 + patch/9202321

I had 10 successful hibernate+resume cycles in a row!

Also I had 10 failed "2nd hibernate" attempts in a row, it always failed with this magic number: 
[    5.521951]   Magic number: 13:928:176

I am now encoding the video of the 10th run and I will upload the "failed 2nd hibernate" messages screenshot + dmesg in few minutes.

Should I try 4.7-rc5 + patch/9202321 + patch/9208541 ?
Comment 203 Janek Kozicki 2016-07-01 11:39:39 UTC
Created attachment 221631 [details]
2nd hibernate attempt failed messages 4.7-rc5+patch9202321
Comment 204 Janek Kozicki 2016-07-01 11:40:17 UTC
Created attachment 221641 [details]
2nd hibernate attempt failed dmesg after reboot 4.7-rc5+patch9202321
Comment 205 Janek Kozicki 2016-07-01 11:44:09 UTC
OK. I uploaded the video to http://janek.kozicki.pl/hibernate_problems/2nd_hibernate_failed_4.7-rc5_patch9202321.mpg  on this video you will see: a successful resume and a failed 2nd hibernate.

Also I uploaded a photo + dmesg in last two attachments. Those are about the failed 2nd hibernate.

Now I am compiling 4.7-rc5 + patch/9202321 + patch/9208541 and I will let you know in an hour or so.
Comment 206 Janek Kozicki 2016-07-01 13:02:38 UTC
https://patchwork.kernel.org/patch/9208541/ is an improvement too!

I had 4 successful "2nd hibernates" with it, and 2 failed "2nd hibernates".

When 2nd hibernate fails it produces exactly the same error message as in attachment 221631 [details] , I forgot to check PM_TRACE this time though. My guess is that it would be the same as in Comment 202. But of course I can check it if you need me to.

Also I uploaded the full video of my attempts with 4.7-rc5 + patch/9202321 + patch/9208541 it is:

http://janek.kozicki.pl/hibernate_problems/2nd_hibernate_sometimes_failed_4.7-rc5_patch9202321_patch9208541.mpg

To summarize shortly:

4.7-rc5 + patch/9202321   produces:
* 10 consecutive successes on 1st hibernate+resume in a row
* 10 consecutive failures  on 2nd hibernate

4.7-rc5 + patch/9202321 + patch/9208541  produces:
* 2 consecutive successes on 1st         hibernate+resume in a row
* 4 consecutive successes on 2nd,3rd,4th hibernate+resume in a row
* 2 failures on 2nd hibernate attempts (actually it was a 2nd, and a 5th attempt)

Currently that's the best kernel in the history of this bug for me.
Let's push it a little further and make it 100% successful! :)
Comment 207 Chen Yu 2016-07-01 14:13:16 UTC
OK, the patch at #Comment 194 looks address the resume-reboot issue, and the hibernate suspend failure may be another new issue, could you check the following step:

echo core > /sys/power/pm_test
echo disk > /sys/power/state
and wait for 5 minutes, if everything is ok, the system will come back.
Comment 208 Chen Yu 2016-07-01 14:30:19 UTC
plz wait a moment, I'll provide a debug patch.
Comment 209 Chen Yu 2016-07-01 14:42:20 UTC
Created attachment 221651 [details]
hibernation snapshot option
Comment 210 Chen Yu 2016-07-01 14:44:14 UTC
OK, please firstly provide the dmesg of this log:
echo core > /sys/power/pm_test
echo disk > /sys/power/state
wait for 5 seconds, the system will return.

let's see what should be the next phase after i8237_resume
(without #Comment 208 applied)
Comment 211 Janek Kozicki 2016-07-01 16:44:14 UTC
I am not sure if I understood you correctly:

I will do this during "2nd hibernation attempt":

#   echo core > /sys/power/pm_test
#   echo disk > /sys/power/state

then I will send you dmesg.

And I will do this with: 4.7-rc5 + patch/9202321 +  attachment 221651 [details] ; correct?

I am compiling kernel now. When I do the thing described above I can try more variations, like 4.7-rc5 + patch/9202321 + patch/9208541  + attachment 221651 [details], or trying to do this on "1st hibernation attempt".
Comment 212 Janek Kozicki 2016-07-01 17:54:13 UTC
Created attachment 221671 [details]
we check what is after i8237A_resume during hibernation: it is ioapic_resume
Comment 213 Janek Kozicki 2016-07-01 17:56:47 UTC
OK, I used 4.7-rc5 + patch/9202321 +  attachment 221651 [details] and I did as you asked me to:

   echo core > /sys/power/pm_test
   echo disk > /sys/power/state

However I had to do this on "1st hibernation attempt" because at "2nd hibernation" it froze and alt-sysrq didn't work.

The dmesg is in attachment 221671 [details].

[   94.323870] PM: Calling i8237A_resume+0x0/0x90
[   94.324008] PM: Calling ioapic_resume+0x0/0xa0

The next phase after i8237A_resume is ioapic_resume

Is this exactly what you needed?
Comment 214 Chen Yu 2016-07-02 01:58:38 UTC
(In reply to Janek Kozicki from comment #213)
> OK, I used 4.7-rc5 + patch/9202321 +  attachment 221651 [details] and I did
> as you asked me to:
> 
>    echo core > /sys/power/pm_test
>    echo disk > /sys/power/state
> 
> However I had to do this on "1st hibernation attempt" because at "2nd
> hibernation" it froze and alt-sysrq didn't work.
> 
> The dmesg is in attachment 221671 [details].
> 
> [   94.323870] PM: Calling i8237A_resume+0x0/0x90
> [   94.324008] PM: Calling ioapic_resume+0x0/0xa0
> 
> The next phase after i8237A_resume is ioapic_resume
> 
> Is this exactly what you needed?
Yes.
BTW, if without any patch applied(especially without patch/9202321),
will this problem reproduce too(hibernatin failed before ioapic_resume)?
Comment 215 Janek Kozicki 2016-07-02 09:30:57 UTC
Yes, this problem reproduces without patch/9202321 in fact we have stumbled upon it as far back as in kernel 4.6.0-rc7. We just didn't pay much attention to it, because it was difficult to get that far (past the first resume). Here is a short summary:

* 4.6.0-rc7 without any extra patches see attachment 216151 [details] in Comment 173
* 4.7-rc2   with patch/9158227/ see attachment 219471 [details] in Comment 182
* 4.7-rc3   with patch/9172981/ see attachment 220491 [details] in Comment 193

However it is interesting to see, that this problem did not occur in kernel 4.2 in Comment 35. Hard to tell if it's a regression though, because things were much more random back then.

PS: If you are viewing this message through web interface https://bugzilla.kernel.org/show_bug.cgi?id=82291 the words like "attachment 220491 [details]" are converted to urls to the photo of those messages "PM: Calling i8237A_resume+0x0/0x90".
Comment 216 Chen Yu 2016-07-02 11:38:37 UTC
Created attachment 221761 [details]
track i8237
Comment 217 Chen Yu 2016-07-02 11:41:39 UTC
OK, I've attached a debug patch to track the code path when suspend failed, please provide dmesg after boot up, and the picture when the problem is reproduced. (without other patches, just easy to reproduce is ok)
Comment 218 Chen Yu 2016-07-02 11:45:53 UTC
And another method is to disabled i8239a by setting CONFIG_ISA_DMA_API=n in kernel and recompile to see if there is any difference.
Comment 219 Janek Kozicki 2016-07-02 11:56:15 UTC
OK, in order to reproduce the hibernation suspend freeze problem I need to compile 4.7-rc5 + patch/9202321 (otherwise I wouldn't get past the first resume), then with your attachment 221761 [details] to track i8237.

So I will compile kernel twice:

* 1st I will compile kernel:
    4.7-rc5 + patch/9202321 + attachment 221761 [details]
  and I will give dmesg after bootup and picture after freeze

* 2nd I will compile kernel:
    4.7-rc5 + patch/9202321
  with CONFIG_ISA_DMA_API=n to see if there's any difference.

I am compiling now, give me about an hour to finish tests.
Comment 220 Chen Yu 2016-07-02 13:17:44 UTC
It looks like my debug patch would cause exception, please wait a minute.
Comment 221 Janek Kozicki 2016-07-02 13:18:39 UTC
Ah, ok. I was just about to reboot, because compilation of both kernels just finished. So I will wait for your reply.
Comment 222 Chen Yu 2016-07-02 13:46:54 UTC
Created attachment 221771 [details]
v2 debug i8237a
Comment 223 Janek Kozicki 2016-07-02 13:47:47 UTC
OK, I will use this one instead.
Comment 224 Janek Kozicki 2016-07-02 15:13:34 UTC
OK, so I have full results:

* 4.7-rc5 + patch/9202321 + attachment 221771 [details]
  - I did a successful suspend resume cycle
  - I saved dmesg afterwards
  - then I initiated "2nd hibernation" cycle and got system freeze

  I am uploading the dmesg and the photo in next post.

* 4.7-rc5 + patch/9202321 with option CONFIG_ISA_DMA_API=n 
  - I did 10 consecutive successful suspend-resume cycles in a row. It looks like it totally works! Maybe I could use my computer with CONFIG_ISA_DMA_API=n ? Currently it seems to work.
Comment 225 Janek Kozicki 2016-07-02 15:15:25 UTC
Created attachment 221781 [details]
dmesg 4.7-rc5+patch_9202321+attachment_221771_v2_debug_i8237a

please note that this dmesg is after first successful resume. The system freezes on 2nd attempt. So the messages from second attempt will be in the next attachment on a photo.
Comment 226 Janek Kozicki 2016-07-02 15:16:27 UTC
Created attachment 221791 [details]
freeze photo 4.7-rc5+patch_9202321+attachment_221771_v2_debug_i8237a
Comment 227 Janek Kozicki 2016-07-02 15:46:19 UTC
It seems to me that it freezes on DMA1_RESET

[   79.626621] PM: Calling i8237A_resume+0x0/0x150
[   79.626692] PM: i8237A_resume before dma lock
[   79.626764] PM: i8237A_resume before DMA1_RESET
Comment 228 Janek Kozicki 2016-07-02 15:47:03 UTC
uh, sorry. It freezes on "dma lock" :)
Comment 229 Chen Yu 2016-07-02 15:54:56 UTC
(In reply to Janek Kozicki from comment #228)
> uh, sorry. It freezes on "dma lock" :)

correct, there is a deadlock. let me think about it..
Comment 230 Chen Yu 2016-07-02 17:10:24 UTC
Created attachment 221801 [details]
debug - show lock trace when deadlock

please set CONFIG_DEBUG_SPINLOCK = y, and with this patch + 9202321 applied.

1. after boot up, try hibernate/resume, then if the system succeed to resume back for the first time, try:
echo q > /proc/sysrq    // for at least 10 times,
to see if it is reproduced, and gather possible dump message if 
a deadlock occurs.

2. if echo q > /proc/sysrq can not reproduce the problem, then
you can just try hibernate/resume and capture the screen for possible
output.

(too late.. go to sleep now, will be back tomorrow)
Comment 231 Janek Kozicki 2016-07-02 21:27:22 UTC
Hi,

I did exactly as you said, and I couldn't reproduce the problem. Even the 2nd hibernation started to work (I did it 3 times in a row).

It looks like your attachment 221801 [details] somehow fixed this bug.

To make everything clear it was:
  4.7-rc5 + patch/9202321 + attachment 221801 [details]
And CONFIG_DEBUG_SPINLOCK = y

I did invoke `echo q > /proc/sysrq-trigger` 1000 times using a bash loop:

  for i in {1..1000}; do echo q > /proc/sysrq-trigger ; done

In next two posts I am attaching the dmesg (but no freeze unfortunately) and also the kernel config, so that you can make sure that CONFIG_DEBUG_SPINLOCK = y , CONFIG_ISA_DMA_API=y  and other stuff you might needed is enabled.
Comment 232 Janek Kozicki 2016-07-02 21:30:32 UTC
Created attachment 221811 [details]
dmesg after 1000 invocations of echo q > /proc/sysrq-trigger; 4.7-rc5+patch9202321+attachment221801 [details]
Comment 233 Janek Kozicki 2016-07-02 21:32:16 UTC
Created attachment 221821 [details]
kernel config for 4.7-rc5+patch9202321+attachment221801 [details]
Comment 234 Chen Yu 2016-07-03 00:30:15 UTC
(In reply to Janek Kozicki from comment #231)
> Hi,
> 
> I did exactly as you said, and I couldn't reproduce the problem. Even the
> 2nd hibernation started to work (I did it 3 times in a row).
> 
> It looks like your attachment 221801 [details] somehow fixed this bug.
That is a debug patch. With DEBUG_SPINLOCK set, the implementation of spinlock
will change a bit, and maybe the previous deadlock was triggered in a special sequence
> 
> To make everything clear it was:
>   4.7-rc5 + patch/9202321 + attachment 221801 [details]
> And CONFIG_DEBUG_SPINLOCK = y
> 
> I did invoke `echo q > /proc/sysrq-trigger` 1000 times using a bash loop:
> 
>   for i in {1..1000}; do echo q > /proc/sysrq-trigger ; done
> 
> In next two posts I am attaching the dmesg (but no freeze unfortunately) and
> also the kernel config, so that you can make sure that CONFIG_DEBUG_SPINLOCK
> = y , CONFIG_ISA_DMA_API=y  and other stuff you might needed is enabled.
ok, please do not use sysrq, and use a script to do the hibernation/resume test. You can use something like:

while(loop < 1000)
rtcwake -m disk -s 30
sleep 30
Comment 235 Chen Yu 2016-07-03 03:33:24 UTC
Created attachment 221841 [details]
dma lock debug

OK, if it can not be reproduced when CONFIG_DEBUG_SPINLOCK is set, then let's revert it to your original config.

And please help do the following:
1. patch9202321 + this patch attached. without CONFIG_DEBUG_SPINLOCK.
    CONFIG_ISA_DMA_API=y

2. try to reproduce it, and provide the picture.
Comment 236 Chen Yu 2016-07-03 03:35:50 UTC
(In reply to Janek Kozicki from comment #224)
> * 4.7-rc5 + patch/9202321 with option CONFIG_ISA_DMA_API=n 
>   - I did 10 consecutive successful suspend-resume cycles in a row. It looks
> like it totally works! Maybe I could use my computer with
> CONFIG_ISA_DMA_API=n ? Currently it seems to work.

If you disable this option, maybe some DMA features are not enabled
Comment 237 Janek Kozicki 2016-07-03 07:43:43 UTC
I just woke up. I am preparing to compile kernel with original config from Comment 206 with:

4.7-rc5 + patch/9202321 + attachment 221841 [details]

are you sure that I need to wait 30 seconds between each call to rtcwake, which itself waits 30 seconds? Executing this 1000 times will take 16 hours.
Comment 238 Chen Yu 2016-07-03 08:40:53 UTC
(In reply to Janek Kozicki from comment #237)
> I just woke up. I am preparing to compile kernel with original config from
> Comment 206 with:
> 
> 4.7-rc5 + patch/9202321 + attachment 221841 [details]
> 
> are you sure that I need to wait 30 seconds between each call to rtcwake,
> which itself waits 30 seconds? Executing this 1000 times will take 16 hours.

no need to wait for 30 seconds.
Comment 239 Chen Yu 2016-07-03 08:43:09 UTC
(In reply to Janek Kozicki from comment #237)
> I just woke up. I am preparing to compile kernel with original config from
> Comment 206 with:
> 
> 4.7-rc5 + patch/9202321 + attachment 221841 [details]
> 
> are you sure that I need to wait 30 seconds between each call to rtcwake,
> which itself waits 30 seconds? Executing this 1000 times will take 16 hours.
You can try your previous method, if it can not be reproduced easily, we can use rtcwake.
Comment 240 Janek Kozicki 2016-07-03 08:56:32 UTC
Created attachment 221861 [details]
2nd hibernation freeze photo

yeah. I did use the previous method, mostly because I didn't know the command off the top of my head to mount /usr, where /usr/sbin/rtcwake resides. That's because I use mdadm raid + lvm2 config, and /etc/init/lvm2 start just didn't do the trick :) I will need to google this next time, so that I could mount /usr.

So here's what I did: 
  4.7-rc5 + patch/9202321 + attachment 221841 [details]
  - do one hibernate-resume cycle
  - invoke 2nd hibernate, make photo of system freeze. It's in this attachment.
I did this twice actually, and the messages were exactly the same both times.

So now with this debug patch we can reproduce the system freeze on 2nd hibernate.
Comment 241 Janek Kozicki 2016-07-03 08:59:55 UTC
if this photo is good for you, maybe we can skip using rtcwake, for now.
Comment 242 Janek Kozicki 2016-07-03 09:40:48 UTC
Created attachment 221871 [details]
System.map for 4.7.0-rc5 + patch9202321 + attachment221841 [details]

I don't know, maybe the System.map will be useful for you to check this:

ffffffff813cf3c0 T unregister_syscore_ops
ffffffff813cf400 T syscore_resume
ffffffff813cf580 T syscore_suspend

just in case I am attaching it.

Also, I tried to invoke `echo q > /proc/sysrq-trigger` but it did nothing in this kernel. Even after the first hibernate-resume cycle. The system did freeze though, with the same messages, when I tried again to do 2nd hibernate.
Comment 243 Chen Yu 2016-07-03 12:48:16 UTC
(In reply to Janek Kozicki from comment #242)
> Created attachment 221871 [details]
> System.map for 4.7.0-rc5 + patch9202321 + attachment221841 [details]
> 
> I don't know, maybe the System.map will be useful for you to check this:
> 
> ffffffff813cf3c0 T unregister_syscore_ops
> ffffffff813cf400 T syscore_resume
> ffffffff813cf580 T syscore_suspend
> 
> just in case I am attaching it.
OK, actually above result is reasonable, it shows the last_dma_owner who have grasped this dma lock is syscore_resume->i8237a_resume itself, after the 1st system resume.

But what the strange thing is, there is no current_dma_owner who currently hold the lock, which means, the last_dma_owner has already released this lock in 1st resume, but when the i8237_resume tries to grasp the lock for the 2nd time, it just hangs there.
1. After we set DEBUG_SPINLOCK, the problem disappeared, which shows that , there might be a problem in spinlock implementation, or
2. the content of dma lock is wrong after 1st resume?
3. cache/tlb problem?
> 
> Also, I tried to invoke `echo q > /proc/sysrq-trigger` but it did nothing in
> this kernel. Even after the first hibernate-resume cycle. The system did
> freeze though, with the same messages, when I tried again to do 2nd
> hibernate.

The patch above does not have the sysrq test, I'll add it to a new debug patch, please help test after the 1st resume.
Comment 244 Janek Kozicki 2016-07-03 13:13:46 UTC
of course, I am ready for more testing :) Just waiting for your debug patch.
Comment 245 Chen Yu 2016-07-03 16:10:53 UTC
Created attachment 221881 [details]
test dma_lock, dump more lockup info

please apply this patch, and
1. suspend to disk and resume for 1st,provide the dmesg.
2. suspend to disk for the 2nd(may hang), provide picture.

3. after above test, reboot the system,
4. suspend to disk and resume for 1st
5. echo q > /proc/sysrq-trigger to see if it can be reproduced, if hang please provide the picture.
Comment 246 Janek Kozicki 2016-07-03 16:11:57 UTC
OK, compiling kernel now.
Comment 247 Janek Kozicki 2016-07-03 17:10:12 UTC
Created attachment 221891 [details]
4.7-rc5 + patch/9202321 + attachment 221881 [details]

OK, so I did as you said with kernel 4.7-rc5 + patch/9202321 + attachment 221881 [details]

using the point numbering from your post above:

1. dmesg after 1st resume is in this attachment
2. photo of system freeze upon 2nd hibernation attempt is in next attachment

3. I rebooted system
4. I did 1st resume
5. two photos of system are in next two attachments. Interesting thing is that the system did not freeze after `echo q > /proc/sysrq-trigger`. The command just did not exit, and I was able to do `sysrq-s u b` to reboot. I suppose I would be even able to switch consoles and perform `shutdown -r` from another console, but consoles do not exist in init=/bin/bash environment.
Comment 248 Janek Kozicki 2016-07-03 17:12:21 UTC
Created attachment 221901 [details]
system freeze photo,  4.7-rc5 + patch/9202321 + attachment 221881 [details]
Comment 249 Janek Kozicki 2016-07-03 17:15:22 UTC
Created attachment 221911 [details]
echo q > /proc/sysrq-trigger after 1rs hibernate-resume cycle.

oops, I noticed that web interface for bugzilla interpreted the point numbers as attachment numbering.
Comment 250 Janek Kozicki 2016-07-03 17:16:49 UTC
Created attachment 221921 [details]
echo q > /proc/sysrq-trigger after 1rs hibernate-resume cycle. next photo
Comment 251 Janek Kozicki 2016-07-03 17:20:13 UTC
Created attachment 221931 [details]
echo q > /proc/sysrq-trigger after 1rs hibernate-resume cycle. I presses ctrl-alt-delete dozen times and I got this:

before pressing alt-sysrq s u b   I pressed ctrl-alt-del to see what happens, and I got this kind of panic message.
Comment 252 Janek Kozicki 2016-07-06 09:39:13 UTC
Hi,

just in case if my comment 247 was a bit too messy I am retyping it, but with proper attachment numbering:

kernel 4.7-rc5 + patch/9202321 + attachment 221881 [details]

using the point numbering from your post above:

1. dmesg after 1st resume is in attachment 221891 [details]
2. photo of system freeze upon 2nd hibernation attempt is in attachment 221901 [details]

3. I rebooted system
4. I did 1st resume
5. two photos of system after `echo q > /proc/sysrq-trigger` are in attachment 221911 [details] and attachment 221921 [details]. Interesting thing is that the system did not freeze after `echo q > /proc/sysrq-trigger`. The command just did not exit, and I was able to do `sysrq-s u b` to reboot.
6. I pressed ctrl-alt-del multiple times after that and I got panic message shown in attachment 221931 [details]

I hope this helps you with this weird problem.

I was wondering that maybe the dma channel gets locked during first hibernation, but the lock owner is somehow destroyed without unlocking first. Hence it is locked, but there is no owner of the lock. (in C++ I would say that class destructor is missing ;)
Comment 253 Chen Yu 2016-07-11 07:11:12 UTC
(In reply to Janek Kozicki from comment #252)
> Hi,
> 
> just in case if my comment 247 was a bit too messy I am retyping it, but
> with proper attachment numbering:
> 
> kernel 4.7-rc5 + patch/9202321 + attachment 221881 [details]
> 
> using the point numbering from your post above:
> 
> 1. dmesg after 1st resume is in attachment 221891 [details]
> 2. photo of system freeze upon 2nd hibernation attempt is in attachment
> 221901 [details]
> 
> 3. I rebooted system
> 4. I did 1st resume
> 5. two photos of system after `echo q > /proc/sysrq-trigger` are in
> attachment 221911 [details] and attachment 221921 [details]. Interesting
> thing is that the system did not freeze after `echo q >
> /proc/sysrq-trigger`. The command just did not exit, and I was able to do
> `sysrq-s u b` to reboot.
> 6. I pressed ctrl-alt-del multiple times after that and I got panic message
> shown in attachment 221931 [details]
> 
> I hope this helps you with this weird problem.
> 
> I was wondering that maybe the dma channel gets locked during first
> hibernation, but the lock owner is somehow destroyed without unlocking
> first. Hence it is locked, but there is no owner of the lock. (in C++ I
> would say that class destructor is missing ;)
Well, I think the log has showed that, the owner has already released the lock during resume, see:

[   58.565802] PM: Calling i8237A_resume+0x0/0x180
[   58.565875] last call addr is [          (null)]
[   58.565948] (1,bash) try to lock ffffffff81c9d140!
[   58.566021] i8237 grasp the lock
[   58.566153] (1,bash) just unlock ffffffff81c9d140!
[   58.566227] PM: Calling ioapic_resume+0x0/0xa0

It looks like using queue spinlock might cause this problem, I'm trying to write a patch to dump the queue spinlock elements, would you please try disable queue spinlock by setting CONFIG_QUEUED_SPINLOCKS=y and recompile the kernel, to see if there is any difference?
Comment 254 Chen Yu 2016-07-11 07:12:40 UTC
oops, should be CONFIG_QUEUED_SPINLOCKS=n
Comment 255 Janek Kozicki 2016-07-11 11:54:49 UTC
Hi,

I have a problem with setting CONFIG_QUEUED_SPINLOCKS=n, I edit to config file with vim, then when I start compilation I notice it is reverted back to CONFIG_QUEUED_SPINLOCKS=y.
Comment 256 Janek Kozicki 2016-07-11 12:01:28 UTC
even setting

CONFIG_ARCH_USE_QUEUED_SPINLOCKS=n
CONFIG_QUEUED_SPINLOCKS=n

does not help, they are both reverted to =y after I start compiling.
Comment 257 Chen Yu 2016-07-11 14:08:32 UTC
vim ./arch/x86/Kconfig

please delete this line and try again:
select ARCH_USE_QUEUED_SPINLOCKS
Comment 258 Janek Kozicki 2016-07-11 15:25:51 UTC
OK, I tried with 4.7.0-rc5 + patch/9202321 and without line select ARCH_USE_QUEUED_SPINLOCKS

It works - I did 9 consecutive suspend-resume cycles.
Comment 259 Chen Yu 2016-07-12 01:49:35 UTC
OK, I'll use your config and try if it can be reproduced in my enviroment.
BTW, are you using janek.listy.mailowe@gmail.com? I suggest we communicated offline, g+ ? then it might be more efficient?
Comment 260 Chen Yu 2016-07-12 07:42:56 UTC
OK, I've added your account, let's communicate on im.
Comment 261 Janek Kozicki 2016-07-12 08:27:37 UTC
Created attachment 222941 [details]
config used for latest recompile without ARCH_USE_QUEUED_SPINLOCKS
Comment 262 Janek Kozicki 2016-07-16 11:13:46 UTC
Created attachment 224051 [details]
kernel config for 4.7-rc5+patch9202321+debug-qspinlock
Comment 263 Chen Yu 2016-07-16 15:19:56 UTC
according to the System.map:
ffffffff81c9d080 b sleep_time_bin
ffffffff81c9d100 B dma_spin_lock

this might be caused by memory overwritten by pm_trace, which sets rtc to an unmeaning persist date, thus timekeeping_resume incorrectly overwrite the sleep_time_bin array, which just set the dma_spin_lock lock byte to 1, caused this problem. We might need to fix  tk_debug_account_sleep_time.
Comment 264 Chen Yu 2016-07-18 01:38:45 UTC
OK, Janek, I've sent the patch to a wider scope and let's wait for people's suggestion/feedback.
https://patchwork.kernel.org/patch/9233895/
I've marked this bug as resolved.
Comment 265 Chen Yu 2016-08-24 07:46:16 UTC
Commit-ID:  a4f8f6667f099036c88f231dcad4cf233652c824