Bug 188281 - [x86_64] [Nvidia driver triggered] IRQ affinity broken during hibernation. Toshiba SATELLITE PRO L770-116
Summary: [x86_64] [Nvidia driver triggered] IRQ affinity broken during hibernation. To...
Status: NEEDINFO
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: Chen Yu
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-11-21 18:24 UTC by Thomas Mitterfellner
Modified: 2017-12-27 23:34 UTC (History)
6 users (show)

See Also:
Kernel Version: 4.8.9-1.gbe1f097-default
Subsystem:
Regression: No
Bisected commit-id:


Attachments
journal entries of failed hibernation event (156.73 KB, text/plain)
2016-11-21 18:24 UTC, Thomas Mitterfellner
Details
journal entries of successful hibernation event (after resume from sleep) (222.04 KB, text/plain)
2016-11-21 18:25 UTC, Thomas Mitterfellner
Details
Hardware info (hwinfo --all) (48.24 KB, text/plain)
2016-11-21 18:28 UTC, Thomas Mitterfellner
Details
OpenSUSE default kernel config (4.4.57) (174.33 KB, text/x-mpsub)
2017-04-12 17:24 UTC, Thomas Mitterfellner
Details
fix irq affinity during cpu offline (1.53 KB, text/plain)
2017-04-21 12:09 UTC, Chen Yu
Details
Restore pci status for devices, otherwise the msi/msix affinity will get lost (436 bytes, text/plain)
2017-05-25 08:17 UTC, Chen Yu
Details
Restore pci status for devices, otherwise the msi/msix affinity will get lost (443 bytes, text/plain)
2017-05-25 08:20 UTC, Chen Yu
Details
Force power on the pci devices and restore their status across S4 (495 bytes, text/plain)
2017-05-27 09:28 UTC, Chen Yu
Details
no irq fixing debug (1.98 KB, text/plain)
2017-05-31 06:30 UTC, Chen Yu
Details
dmesg output (65.34 KB, text/plain)
2017-05-31 14:53 UTC, Thomas Mitterfellner
Details
proc interrupts (3.36 KB, text/plain)
2017-05-31 14:54 UTC, Thomas Mitterfellner
Details
lspci output (42.38 KB, text/plain)
2017-05-31 14:55 UTC, Thomas Mitterfellner
Details
dmesg output after resume from hibernate (73.75 KB, text/plain)
2017-06-01 12:38 UTC, Thomas Mitterfellner
Details
Fix irq affinity during CPU offline[v2] (2.08 KB, text/plain)
2017-06-20 16:16 UTC, Chen Yu
Details
dmesg output after resume from hibernate 4.12.0-rc6 (72.27 KB, text/plain)
2017-06-23 17:10 UTC, Thomas Mitterfellner
Details
dmesg out after 4th or 5th hibernation (111.84 KB, text/plain)
2017-06-24 19:00 UTC, Thomas Mitterfellner
Details
all-in-one-debug-nvidia (4.89 KB, text/plain)
2017-06-26 04:28 UTC, Chen Yu
Details
Logs from tests requested in comment #118 (108.91 KB, application/x-bzip)
2017-06-26 17:02 UTC, Thomas Mitterfellner
Details

Description Thomas Mitterfellner 2016-11-21 18:24:44 UTC
Created attachment 245421 [details]
journal entries of failed hibernation event

(Original bug reported for openSUSE Leap 42.2, kernel 4.4.27: https://bugzilla.opensuse.org/show_bug.cgi?id=1010794)

When I try to hibernate (kernel 4.8.9), the computer freezes and has to be turned off eventually.

This is what I first see on the text console (these lines appear during ~20-40 sec):
do_IRQ: 3.67 No irq handler for vector
do_IRQ: 1.51 No irq handler for vector
ata1.00: revalidation failed (errno=-5)
ata3.00: revalidation failed (errno=-5)
do_IRQ: 1.51 No irq handler for vector
do_IRQ: 1.51 No irq handler for vector
do_IRQ: 1.51 No irq handler for vector
ata3.00: revalidation failed (errno=-5)
ata1.00: revalidation failed (errno=-5)

Then, all of a sudden, all these file system related messages appear, something like this (but I suspect that these are just follow-up errors, not the cause):

XFS (sda3): metadata I/O error, block 0x40 ("xfs_buf_isdone_callbacks") errors numblks 32
EXT4-fs (sda2): Delayed block allocation failed for inode 210 at logical offset 1210 with max blocks 2 with error 5
EXT4-fs (sda2): This should not happen!! Data will be lost 
Aborting journal on device sda2-8.
Buffer I/O error on dev sda2, logical block 7897088, lost synge write
JBD2: Error -5 detected when updating journal superblock for -8.

Then the fan starts to blow and I have to turn the computer off.

BUT, the interesting thing is: when I first suspend the system to RAM (sleep) and resume, which works flawlessly, and do hibernation *after* that, it works. So suspending and resuming seems to (re)set some state which allows for hibernation to work afterwards.
Comment 1 Thomas Mitterfellner 2016-11-21 18:25:40 UTC
Created attachment 245431 [details]
journal entries of successful hibernation event (after resume from sleep)
Comment 2 Thomas Mitterfellner 2016-11-21 18:28:37 UTC
Created attachment 245441 [details]
Hardware info (hwinfo --all)
Comment 3 Thomas Mitterfellner 2016-11-21 18:48:39 UTC
One more note: this was definitely working with kernel 3.16.7-48.1 (openSUSE 13.2) and I *think* it was also working with kernel 4.1.34-33.1 (openSUSE Leap 42.1), but I'm not sure about the latter.
Comment 4 Thomas Mitterfellner 2016-11-21 21:33:28 UTC
I also tried this with nouveau only (rather than with the proprietary nvidia driver), with similar results. The difference is that I do not see the text console with the "do_IRQ: 3.67 No irq handler for vector" etc. messages but only a graphical screen with the mouse cursor. Then, after some 20-40 seconds, there is a short graphical message in bold letters (I couldn't read it fully) which says something about the screen locker not working, then I see browser I had opened before hibernation again including the text cursor, but there's no interaction possible.
When I switch to a text console and try to log in, after pressing enter after having typed the login name, I immediately get a message like "Buffer I/O error on dev sda2, logical block 7897088" and can only power off.
Comment 5 Thomas Mitterfellner 2016-11-21 21:40:33 UTC
I forgot to add: with nouveau, hibernation does also not work when having suspended and resumed first (contrary to with nvidia).
Comment 6 Thomas Mitterfellner 2016-11-28 22:02:30 UTC
I can confirm now that hibernation works with 4.1.35, so looks like a regression.
Comment 7 Zhang Rui 2016-11-29 08:50:04 UTC
can you please run git bisect to find out which commit introduces the problem?
Comment 8 Thomas Mitterfellner 2016-11-29 12:06:21 UTC
Actually, I had already thought about this, but I would need a pointer to some instructions on how to do that best (probably, if it matters, how to do it on openSUSE). In principle, I know how to use git (although I haven't used the bisect feature, but I'm sure that's not too hard to use); what I would need to know is how to configure the kernel in order to get one that's similar to a default openSUSE one. It's been years since I've last done that. Could you help me here?
Comment 9 Thomas Mitterfellner 2016-11-29 20:20:48 UTC
OK, now I am completely confused: I downgraded the distribution to openSUSE Leap 42.1 (kernel-4.1.34), compiled the kernel from the git repo git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git using tag v4.4.27 and the config file from git://kernel.opensuse.org/kernel-source.git, tag rpm-4.4.27-1, installed and booted that kernel and guess what – hibernate and resume works!

What is going on here?
Comment 10 Takashi Iwai 2016-11-29 21:06:32 UTC
(In reply to Thomas Mitterfellner from comment #9)
> OK, now I am completely confused: I downgraded the distribution to openSUSE
> Leap 42.1 (kernel-4.1.34), compiled the kernel from the git repo
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git using
> tag v4.4.27 and the config file from
> git://kernel.opensuse.org/kernel-source.git, tag rpm-4.4.27-1, installed and
> booted that kernel and guess what – hibernate and resume works!
> 
> What is going on here?

openSUSE Leap 42.2 default kernel contains quite a few backports in addition to stable 4.4.x kernel, and it's possible that some of them do harm.  Please try to install kernel-vanilla.rpm from Leap repo, and check whether it works with this kernel (it should).

Also, if not done yet, please test with the latest upstream kernel, too.  You can try the kernels from OBS Kernel:stable and Kernel:HEAD repositories.  They should be installable on Leap system (but use nouveau for testing).
  http://download.opensuse.org/repositories/Kernel:/HEAD/standard/
  http://download.opensuse.org/repositories/Kernel:/stable/standard/

If the recent upstream shows the problem, then it's some of the recent changes that triggers the issue.  If the problem is only with openSUSE Leap 42.2 kernel, it must be the wrong patch there.
Comment 11 Zhang Rui 2016-11-30 01:52:25 UTC
(In reply to Thomas Mitterfellner from comment #9)
> OK, now I am completely confused: I downgraded the distribution to openSUSE
> Leap 42.1 (kernel-4.1.34), compiled the kernel from the git repo
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git using
> tag v4.4.27 and the config file from
> git://kernel.opensuse.org/kernel-source.git, tag rpm-4.4.27-1, installed and
> booted that kernel and guess what – hibernate and resume works!
> 
> What is going on here?

do you mean the problem is gone, with latest 4.4 stable and kernel config file from opensuse?
hmmm, does the problem still exist in the Latest upstream kernel?
Comment 12 Thomas Mitterfellner 2016-11-30 08:26:03 UTC
(In reply to Zhang Rui from comment #11)
> do you mean the problem is gone, with latest 4.4 stable and kernel config
> file from opensuse?

Yes, the problem is gone when I compile the kernel from the source from git.kernel.org, and using the openSUSE config file.
Comment 13 Thomas Mitterfellner 2016-11-30 09:00:01 UTC
(In reply to Takashi Iwai from comment #10)
> openSUSE Leap 42.2 default kernel contains quite a few backports in addition
> to stable 4.4.x kernel, and it's possible that some of them do harm.  Please
> try to install kernel-vanilla.rpm from Leap repo, and check whether it works
> with this kernel (it should).

What I don't understand is: I tried 4.8.9-1.gbe1f097-default from the kernel:stable repo on 42.2. If I understood it right, this is the upstream kernel without openSUSE specific patches, right? With this one, hibernation did not work. I know it's unlikely, but could it be a compiler bug?

What I could try while I'm still on 42.1, is to compile the 4.4.27 kernel not from git.kernel.org, but from kernel.opensuse.org repo.
Comment 14 Takashi Iwai 2016-11-30 09:12:04 UTC
(In reply to Thomas Mitterfellner from comment #13)
> (In reply to Takashi Iwai from comment #10)
> > openSUSE Leap 42.2 default kernel contains quite a few backports in
> addition
> > to stable 4.4.x kernel, and it's possible that some of them do harm. 
> Please
> > try to install kernel-vanilla.rpm from Leap repo, and check whether it
> works
> > with this kernel (it should).
> 
> What I don't understand is: I tried 4.8.9-1.gbe1f097-default from the
> kernel:stable repo on 42.2. If I understood it right, this is the upstream
> kernel without openSUSE specific patches, right? With this one, hibernation
> did not work. I know it's unlikely, but could it be a compiler bug?

No, no, it's VERY likely.  It implies that there is a regression in the recent kernel, and we took the regression to Leap default kernel together with fixes.

> What I could try while I'm still on 42.1, is to compile the 4.4.27 kernel
> not from git.kernel.org, but from kernel.opensuse.org repo.

Just try kernel-vanilla 4.4.x from Leap 42.2 as requested.

If it works, then try kernel-vanilla 4.8.x from OBS Kernel:stable.  If 4.4.x kernel-vanilla works while 4.8.x kernel-vanilla doesn't, it really indicates the upstream regression.
Comment 15 Thomas Mitterfellner 2016-11-30 20:43:21 UTC
> If it works, then try kernel-vanilla 4.8.x from OBS Kernel:stable.  If 4.4.x
> kernel-vanilla works while 4.8.x kernel-vanilla doesn't, it really indicates
> the upstream regression.

I did it. With interesting results:
kernel-vanilla-4.4.35-2.1.gb38a36f: hibernation works
kernel-vanilla-4.8.11-2.1.ge617052: hibernation does *not* work

(tested 2 times with each kernel, i.e. 4 separate boot procedures)

Just so I understand: the leap 42.2 default kernel contains some backports from the main kernel tree which obviously brings in the regression in the hibernation feature?

So should I bisect now between 4.4 and 4.8? Or can this be narrowed down to some versions/patches?
Comment 16 Chen Yu 2016-12-15 03:48:18 UTC
(In reply to Thomas Mitterfellner from comment #15)
> > If it works, then try kernel-vanilla 4.8.x from OBS Kernel:stable.  If
> 4.4.x
> > kernel-vanilla works while 4.8.x kernel-vanilla doesn't, it really
> indicates
> > the upstream regression.
> 
> I did it. With interesting results:
> kernel-vanilla-4.4.35-2.1.gb38a36f: hibernation works
> kernel-vanilla-4.8.11-2.1.ge617052: hibernation does *not* work
> 
> (tested 2 times with each kernel, i.e. 4 separate boot procedures)
> 
> Just so I understand: the leap 42.2 default kernel contains some backports
> from the main kernel tree which obviously brings in the regression in the
> hibernation feature?
> 
> So should I bisect now between 4.4 and 4.8? Or can this be narrowed down to
> some versions/patches?
BTW, I remember there are several fix in hibernation code during the past few months. I'd suggested also check if this issue can be reproduced in upstream kernel 4.9, since I'm afraid some of the fix do not go into stable yet.
Comment 17 Thomas Mitterfellner 2016-12-15 08:16:49 UTC
OK, thanks for the information, I will try that. I had a hard time bisecting the kernel; the kernel would not boot to an X session when I compiled it with make localyesconfig, so I had to compile all modules (~40 min/build). Many of the kernels would not boot at all (I did >25 kernel builds) and at some point it seems I marked a kernel as bad because it would not hibernate but it seems this was a different problem so I did not find the real bad commit in the end.
Comment 18 Chen Yu 2016-12-21 07:54:18 UTC
(In reply to Thomas Mitterfellner from comment #17)
> OK, thanks for the information, I will try that. I had a hard time bisecting
> the kernel; the kernel would not boot to an X session when I compiled it
> with make localyesconfig, so I had to compile all modules (~40 min/build).
> Many of the kernels would not boot at all (I did >25 kernel builds) and at
> some point it seems I marked a kernel as bad because it would not hibernate
> but it seems this was a different problem so I did not find the real bad
> commit in the end.
Yes, since there were different known issues in upstream kernel and  fixed recently, I'm afraid bisect would lead to discrete results, as bisect is also exhausting. 
So, I was thinking, if you can check the following commit one-by-one? 

commit 406f992e4a372dafbe3c2cff7efbb2002a5c8ebd
Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Date:   Thu Jul 14 03:55:23 2016 +0200

    x86 / hibernate: Use hlt_play_dead() when resuming from hibernation

Introduced in v4.8-rc1.


commit 65c0554b73c920023cc8998802e508b798113b46
Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Date:   Thu Jun 30 18:11:41 2016 +0200

    x86/power/64: Fix kernel text mapping corruption during image restoration

Introduced in v4.7-rc7


commit e4630fdd47637168927905983205d7b7c5c08c09
Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Date:   Mon Aug 8 15:31:31 2016 +0200

    x86/power/64: Always create temporary identity mapping correctly

Introduced in v4.8-rc2


Notice, all of them do not go into the stable tree, but they are critical fix.
Comment 19 Thomas Mitterfellner 2016-12-21 12:30:13 UTC
Should I try to cherry-pick these commits e.g. into the kernel-vanilla-4.8.11 source (in which I found the hibernation feature to be broken) or should I check out the mentioned commit IDs and compile those?

Another thought: could this be some bug in xfs related code (I'm using xfs for my home directory)? Because some of the non-booting kernels (including the upstream kernel 4.9 I wanted to test) seemed to have a problem mounting my home directory and reported metadata corruption for which I should do xfs_repair (I did. Still, it wouldn't mount). At one point my file system got corrupted and I had to boot to an emergency system to do xfs_repair, which fixed things (boy, was I worried!).
Comment 20 Chen Yu 2016-12-23 03:09:05 UTC
(In reply to Thomas Mitterfellner from comment #19)


> Another thought: could this be some bug in xfs related code (I'm using xfs
> for my home directory)? Because some of the non-booting kernels (including
> the upstream kernel 4.9 I wanted to test) seemed to have a problem mounting
> my home directory and reported metadata corruption for which I should do
> xfs_repair (I did. Still, it wouldn't mount). At one point my file system
> got corrupted and I had to boot to an emergency system to do xfs_repair,
> which fixed things (boy, was I worried!).
OK. Well, then let's check by appending 'init=/bin/bash' in grub menu, after boot up to a simple shell, you can enable swap by swapon /dev/sdaX (sdaX is your swap partition), then check if hibernation would get better.

> Should I try to cherry-pick these commits e.g. into the
> kernel-vanilla-4.8.11 source (in which I found the hibernation feature to be
> broken) or should I check out the mentioned commit IDs and compile those?
> 
Second, please 
git pull https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
(not the stable tree) and  check if hibernation works.
If OK, then above patches might already fix it, if not,
then please git revert above 3 patches one-by-one, also compile different 
images and check if reverting any of above patches would make things better.
Comment 21 Chen Yu 2016-12-23 03:15:52 UTC
I'm also thinking if pm_test would give us more clue.

please check freezer, devices, platform, processors, core for /sys/power/pm_test
respectively. 
And if all of above works, please check
# echo test_resume > /sys/power/disk
# echo disk > /sys/power/state
Comment 22 Chen Yu 2017-01-02 09:20:18 UTC
Hi, Thomas, do you have time to make a check of #Comment 20?
Comment 23 Thomas Mitterfellner 2017-01-02 13:08:16 UTC
I built kernel 4.10.0-rc2-41 from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git now. I wouldn't boot to a graphical login but I tried to hibernate with systemctl hibernate. That did not work, same symptoms as reported before (hibernating like this works with 4.1.36).

I will try to figure out how to use pm_test. I suppose https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt contains the info necessary.
Comment 24 Thomas Mitterfellner 2017-01-02 13:41:53 UTC
I tested all of the pm_test modes (freezer, devices, platform, processors, core) now with 4.10.0-rc2-41. All of them seem to work, i.e. for all of them, I get back to my command line after ~5 seconds; although, starting from 'platform' mode, I get a message like this:

[  4.4.4377391] usb 3-1.2: device descriptor read/64, error -32

But I always get that upon hibernation, even for kernels where hibernation does work.

I also did test_resume, again, no problem.
Comment 25 Thomas Mitterfellner 2017-01-02 15:32:18 UTC
I tried now adding init=/bin/bash to the boot parameters of kernel 4.10.0-rc2-41 and only booting to a minimal prompt. I called swapon and mounted my home drive, the xfs file system. Hibernating via echo disk > /sys/power/state worked more or less, i.e. it hibernated, but after resuming, I got to the normal grub boot menu where I could select the kernel to boot. After selecting it, I came back to the prompt as I had left it before hibernation.
Then I manually modprobed the modules which are loaded when booting that kernel normally (129 modules) in batches and tried hibernating after each batch – no problem; hibnation worked even after I had loaded all the 129 modules.

Now I don't know what the difference between booting 'normally' and adding init=/bin/bash and then loading all the modules is. The only difference I could see was that the root fs was mounted as ro when booting with init=/bin/bash mode. Is it possible to mount it as rw?
Comment 26 Chen Yu 2017-01-03 15:53:30 UTC
Yes, after boot into in minimal environment, try 
mount -o remount,rw /dev/sdaX /
and check if hibernation works?

Another clue is init=/bin/bash would bypass the graphic driver. So it might be helpful to blacklist your graphic driver during normal bootup and check if it makes things better?
Comment 27 Thomas Mitterfellner 2017-01-04 13:55:59 UTC
I tried remounting / as rw (and mounting my xfs home dir), hibernation works. Also when loading nouveau it does hibernate and resume correctly. I will try to unload the graphics driver with 'normal' bootup (or blacklist it). Since it doesn't boot to X anyway, this shouldn't be a problem, I guess.
Comment 28 Thomas Mitterfellner 2017-01-04 14:21:24 UTC
Nope, neither with nouveau loaded nor unloaded hibernation works.

The first message I get is 

do_IRQ: 1.35 No irq handler for vector

Where does that come from? Is it related?
Comment 29 Thomas Mitterfellner 2017-01-08 11:18:42 UTC
What puzzles me most is that doing one suspend-to-RAM/resume cycle makes hibernation work (this is my current workaround). How could I figure out what changes after STR, i.e. which log files/sys files/log messages would be relevant?
Comment 30 Thomas Mitterfellner 2017-04-08 22:01:15 UTC
Note that this suspend-to-RAM/resume seems to work only a couple of times, e.g. I did suspend/resume once, then hibernate/resume twice successfully, but upon the third hibernation I got that erroneous behavior again (revalidation failed (errno=-5) etc.).

I read here (https://ubuntuforums.org/showthread.php?t=767668) that they fixed the revalidation error in a different context by using the apic=off kernel parameter. Indeed, hibernation does work when I do that but I have to push the power button once manually to actually turn off the power (which is not 100% satisfactory but better than the system crashing; I don't know whether turning off acpi altogether has other side effects though, e.g. on fan control).
Comment 31 Thomas Mitterfellner 2017-04-08 22:10:57 UTC
(In reply to Thomas Mitterfellner from comment #30)
> Note that this suspend-to-RAM/resume seems to work only a couple of times

Clarification: suspend-to-RAM/resume always works, but fixing hibernation by doing STR first only shows limited effect.
Comment 32 Chen Yu 2017-04-09 03:35:22 UTC
(In reply to Thomas Mitterfellner from comment #30)
> Note that this suspend-to-RAM/resume seems to work only a couple of times,
> e.g. I did suspend/resume once, then hibernate/resume twice successfully,
> but upon the third hibernation I got that erroneous behavior again
> (revalidation failed (errno=-5) etc.).
> 
> I read here (https://ubuntuforums.org/showthread.php?t=767668) that they
> fixed the revalidation error in a different context by using the apic=off
> kernel parameter. 
When acpi=off, then the final state of hibernation would not be S4 nor S5, so the behavior might be different, and once acpi is off, many drivers might not work.

(In reply to Thomas Mitterfellner from comment #28)
> Nope, neither with nouveau loaded nor unloaded hibernation works.
> 
> The first message I get is 
> 
> do_IRQ: 1.35 No irq handler for vector
> 
> Where does that come from? Is it related?
This was triggered when system has received an interrupt after entering do_IRQ, but found that this interrupt has not been probed yet(assigned with a IRQ number). In one word, there is no corresponding vector handler for 35, which was issued by the apic.

BTW, do you mean the problem is that system can not hibernate, but not failed to restore?
Comment 33 Thomas Mitterfellner 2017-04-09 10:01:22 UTC
Yes, the problem is that it does not hibernate.
IF hibernation works, resuming from hibernation has never been a problem (at least not that I remember).
Comment 34 Chen Yu 2017-04-12 15:18:07 UTC
(In reply to Thomas Mitterfellner from comment #33)
> Yes, the problem is that it does not hibernate.
> IF hibernation works, resuming from hibernation has never been a problem (at
> least not that I remember).

Could you provide your kernel config?
Comment 35 Thomas Mitterfellner 2017-04-12 17:24:49 UTC
Created attachment 255869 [details]
OpenSUSE default kernel config (4.4.57)

I attached the openSUSE default kernel config (4.4.57). With this kernel (which has some backports), hibernation fails.

Using the 4.4.57 vanilla kernel btw., hibernation kind of works (I /sometimes/ get a kernel panic when hibernating, indicated by the system not shutting down and the caps lock light blinking, but I suppose that's a different issue; resuming sometimes works after that has happened, although I haven't found out under which condition it does not)
Comment 36 Chen Yu 2017-04-17 05:15:18 UTC
I also saw that similar warning on one of my machines.
do_IRQ: xxx No irq handler for vector
And it is because my ahci driver has incorrectly set his irq affinity to a cpu w/o any irq handler. And it stopped my machine from writting the snapshot to the disk and hang the system due to non-handle irqs flood. To confirm if you have the same issue as I saw , could you please test this debug patch below with this step:
1. apply the patch on top of upstream kernel with

https://github.com/yu-chen-surf/linux-pm/commit/8da5c620da6e6a87eb190246857265a179a6d951

2. Boot up the patched kernel, and provide your full logs using this script:
./dump_vector.sh bootup_vectors.log
(you should replace the following cpu_num by the actual CPU number on your system)

#!/bin/bash
path=`pwd`
log_file=${path}/$1
echo > ${log_file}
n=0

while [ $n -le cpu_number ]
do
	cd  /sys/devices/system/cpu
	grep . cpu${n}/vector/slot*/* --exclude-dir=cpu${n}/vector/slot*/power >> ${log_file}
	echo -e "\n" >> ${log_file}
	n=$(( n+1 ))
done



3. echo processors > /sys/power/pm_test
   echo disk > /sys/power/state

4. 
./dump_vector.sh pm_test_processors_vectors.log


5. plz provide bootup_vectors.log, pm_test_processors_vectors.log, and dmesg
Comment 37 Chen Yu 2017-04-20 05:46:14 UTC
Hi Thomas, I think I know the reason for the failure of disk writing during hibernation. The offender is the irqbalance daemon, it has incorrectly set the ahci host's irq affinity during hibernation and cause the disk writting failed. Please disable the irqbalance daemon and try again. thx
Comment 38 Chen Yu 2017-04-21 12:09:13 UTC
Created attachment 255945 [details]
fix irq affinity during cpu offline

Thomas, do you have a chance to help test this patch attached?
The irqbalance is just one condition to trigger the problem. There might be a bug
in x86's irq handling.
Comment 39 Thomas Mitterfellner 2017-04-21 15:58:15 UTC
I tried to boot up with the kernel parameter acpi_irq_nobalance, but still an irqbalance process was running. I killed that with signal 9 and then tried to hibernate. That did not work – same problem as before. I will try to apply the patch I hope I'll succeed.
Comment 40 Thomas Mitterfellner 2017-04-21 20:45:15 UTC
I used my current kernel's source (4.4.57-18.3-default) and applied your patch to it, then built it with the kernel's config from /boot. With this modified kernel, hibernation works!
Comment 41 Thomas Mitterfellner 2017-04-21 20:48:57 UTC
I forgot to say thank you for looking into the problem and for fixing it! I hope that the patch will somehow make it to the kernel source and eventually to openSUSE's 42.2 default kernel.
Comment 42 Takashi Iwai 2017-04-22 07:00:48 UTC
I can merge to openSUSE / SUSE branch once when the patch is submitted to upstream.  (And the patch should be marked with Cc to stable, too.)
Chen, could you update the bugzilla when you submit it?  Thanks!
Comment 43 Chen Yu 2017-04-24 07:58:55 UTC
OK, sent here:
https://patchwork.kernel.org/patch/9695771/
But I'd like to get some suggestions from the maintainers first and then we can Cc stable.
Comment 44 Thomas Mitterfellner 2017-05-09 12:57:07 UTC
Did you hear anything about the patch from the maintainers yet? I'd like to report back that so far, hibernation is working flawlessly using my patched kernel and I haven't experienced any side-effects.
Comment 45 Chen Yu 2017-05-25 08:16:26 UTC
That patch is just to "lighten" the issue, the root cause fix should be here, please help check if it works for you w/o any other patch applied:
Comment 46 Chen Yu 2017-05-25 08:17:19 UTC
Created attachment 256707 [details]
Restore pci status for devices, otherwise the msi/msix affinity will get lost
Comment 47 Chen Yu 2017-05-25 08:20:17 UTC
Created attachment 256709 [details]
Restore pci status for devices, otherwise the msi/msix affinity will get lost
Comment 48 Thomas Mitterfellner 2017-05-25 19:26:34 UTC
I applied that patch (256709) to a fresh kernel source 4.4.62-18.6 (openSUSE Leap 42.2 default kernel), built and installed using the openSUSE default config for that kernel, but it does not make hibernation work correctly. When trying to hibernate, I got a "no IRQ handler for vector message" and the system did not power off. Only the other errors concerning the file system did not show (but I tried it only once).
Comment 49 Thomas Mitterfellner 2017-05-25 19:39:57 UTC
When I hibernate and power off manually, then resume, then hibernate again, automatic power-off works (similar to what I described earlier: when doing one _suspend_ and resume cycle, hibernation sometimes worked). 
N.B.: The patch applied with an offset of some 20 lines in the function pci_pm_thaw_noirq. Without knowing what it should do, my wild guess is that seems to suggest something is done upon _resuming_ from hibernation. But the original issue shows the problems when going _to_ hibernation.

So the original patch you posted fixes my issues better (I'll keep using this).
Comment 50 Manuel Krause 2017-05-25 20:07:06 UTC
Sorry for me bailing in. I'm searching for fixes for failing S2RAM and S2DISK causes and fixes for my system ATM, earlier with 4.10 and now 4.11 kernels.
By coincidence I've stumbled upon and read this topic and applied the fix from Comment 47 to my 4.11.2 kernel (without the one that Thomas prefers).
After a third resume from disk it locked up my system all of a sudden after already relogged into and using the KDE. No logs available unfortunately.
So, there are likely to be more root causes to be fixed that still remain in the shadow.
I'm now also going to apply the earlier "lighten"ing patch and that to 4.11.3.

Best regards, Manuel Krause
Comment 51 Chen Yu 2017-05-26 01:02:28 UTC
(In reply to Thomas Mitterfellner from comment #49)
> When I hibernate and power off manually, then resume, then hibernate again,
> automatic power-off works (similar to what I described earlier: when doing
> one _suspend_ and resume cycle, hibernation sometimes worked). 
> N.B.: The patch applied with an offset of some 20 lines in the function
> pci_pm_thaw_noirq. Without knowing what it should do, my wild guess is that
> seems to suggest something is done upon _resuming_ from hibernation. But the
> original issue shows the problems when going _to_ hibernation.
> 
> So the original patch you posted fixes my issues better (I'll keep using
> this).
The fix is in pci_pm_thaw_noirq, because before the system write the snapshot to the disk, this function will be invoked, and we saw "no irq handler" during this stage because the irq affinity setting was lost. This patch is to restore the irq affinity before suspend. Could you please check on latest upstream kernel? Previously I'm using 4.11-rc8, so you can clone a latest one by
git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Comment 52 Chen Yu 2017-05-26 01:15:32 UTC
(In reply to Manuel Krause from comment #50)
> Sorry for me bailing in. I'm searching for fixes for failing S2RAM and
> S2DISK causes and fixes for my system ATM, earlier with 4.10 and now 4.11
> kernels.
> By coincidence I've stumbled upon and read this topic and applied the fix
> from Comment 47 to my 4.11.2 kernel (without the one that Thomas prefers).
> After a third resume from disk it locked up my system all of a sudden after
> already relogged into and using the KDE. No logs available unfortunately.
> So, there are likely to be more root causes to be fixed that still remain in
> the shadow.
> I'm now also going to apply the earlier "lighten"ing patch and that to
> 4.11.3.
> 
> Best regards, Manuel Krause
w/o the patch applied, is it of the same phenomenon? No need to test the earlier one, it might be of another issue than Thomas. 
Please refer to the kernel's Documentation/power/basic-pm-debugging.txt 
to test different pm_test mode, and finally the test_resume mode.
Comment 53 Manuel Krause 2017-05-26 19:57:05 UTC
(In reply to Chen Yu from comment #52)
> (In reply to Manuel Krause from comment #50)
> w/o the patch applied, is it of the same phenomenon? No need to test the
> earlier one, it might be of another issue than Thomas. 
> Please refer to the kernel's Documentation/power/basic-pm-debugging.txt 
> to test different pm_test mode, and finally the test_resume mode.

I'm sorry, I don't suffer from Thomas' symptoms, but stumbled upon the "fix" patch while reading through the Hibernation related bugs and also gave it a test run on my machine along with new 4.11.3.
And if it breaks things after resume and system fully up, there's likely to be more attention needed for a more complete fix. I hope this doesn't sound too harsh. At least I can add, that the patch from Comment 43 doesn't harm my system.
Let's read Thomas' new findings with a more actual kernel.
Comment 54 Thomas Mitterfellner 2017-05-26 22:54:35 UTC
I built kernel 4.11 from the source you mentioned (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git) with the patch applied. Unfortunately, it does not boot up (kernel panic, caps lock flashing). I suppose taking the config from 4.4.62 and taking the default options for the remaining options is not a valid way of compiling a running kernel. I don't know how to proceed now.

If I understood it correctly, you say the former patch is not the addressing the "root cause" but doing plumbing downstream? But if the latter patch does not fix the problem on 4.4.62 – can it be the 'proper' fix anyway?
Comment 55 Chen Yu 2017-05-27 03:13:25 UTC
(In reply to Manuel Krause from comment #53)
> (In reply to Chen Yu from comment #52)
> > (In reply to Manuel Krause from comment #50)
> > w/o the patch applied, is it of the same phenomenon? No need to test the
> > earlier one, it might be of another issue than Thomas. 
> > Please refer to the kernel's Documentation/power/basic-pm-debugging.txt 
> > to test different pm_test mode, and finally the test_resume mode.
> 
> I'm sorry, I don't suffer from Thomas' symptoms, but stumbled upon the "fix"
> patch while reading through the Hibernation related bugs and also gave it a
> test run on my machine along with new 4.11.3.
> And if it breaks things after resume and system fully up, there's likely to
> be more attention needed for a more complete fix. I hope this doesn't sound
> too harsh. At least I can add, that the patch from Comment 43 doesn't harm
> my system.
> Let's read Thomas' new findings with a more actual kernel.
OK, got it. So comment 47 broken S4 on your platform.
Could you please also check if pm_test mode setting to 'processors' works w/o any patch applied on 4.11.3?

several rounds:( at least 3 rounds)

# echo processors > /sys/power/pm_test
# echo disk > /sys/power/state
Comment 56 Chen Yu 2017-05-27 03:21:18 UTC
(In reply to Thomas Mitterfellner from comment #54)
> I built kernel 4.11 from the source you mentioned
> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git) with
> the patch applied. Unfortunately, it does not boot up (kernel panic, caps
> lock flashing). I suppose taking the config from 4.4.62 and taking the
> default options for the remaining options is not a valid way of compiling a
> running kernel. I don't know how to proceed now.
May I know where it panic? previously I found that you might have to disable
module signature checking otherwise you will get panic during bootup:
CONFIG_MODULE_SIG should not be set, you can close it via make menuconfig:
Enable loadable module support->
Module signature verification
> 
> If I understood it correctly, you say the former patch is not the addressing
> the "root cause" but doing plumbing downstream? But if the latter patch does
> not fix the problem on 4.4.62 – can it be the 'proper' fix anyway?
Then there might be other situation in 4.4.62, for example, if the code flow is different. Normally we do not debug on distribution's kernel source code, but if you still encountered boot up issue on latest kernel, I can add some debug patch and debug on your 4.4.62.
Comment 57 Chen Yu 2017-05-27 09:28:41 UTC
Created attachment 256743 [details]
Force power on the pci devices and restore their status across S4

Please help check if this patch makes things better...
Comment 58 Chen Yu 2017-05-27 09:29:10 UTC
Comment 57 is a revised version of Comment 47
Comment 59 Chen Yu 2017-05-27 09:31:23 UTC
Manuel, please help first check Comment 55, and then check Comment 57.
Thomas, please help first check Comment 56, and then if it does not work, please check the patch from Comment 57.
Thanks!
Comment 60 Thomas Mitterfellner 2017-05-27 12:47:32 UTC
CONFIG_MODULE_SIG is not enabled in the .config file for the 4.11 kernel. I'm going to try patch 256743 on kernel 4.4.62.

As for the kernel panic I get with the 4.11 kernel: I would have to make a picture with my camera and attach that, as the kernel panic happens extremely early in the boot process and I can't find any log for it.
Comment 61 Manuel Krause 2017-05-27 19:52:52 UTC
(In reply to Chen Yu from comment #55)
> OK, got it. So comment 47 broken S4 on your platform.
> Could you please also check if pm_test mode setting to 'processors' works
> w/o any patch applied on 4.11.3?
> 
> several rounds:( at least 3 rounds)
> 
> # echo processors > /sys/power/pm_test
> # echo disk > /sys/power/state

I really dont' want to screw up this thread!
BTW, I've needed to add PM_DEBUG to my custom .config first, and I hope it's sufficient for our testing on this here.
've given it 10 rounds with different RAM and system loads. Log available, but there is _NO_ relevant error/ message in it. I've taken 10 rounds, as the real-world S2DISK is likely to fail at attempt No.6.

Now, I'm going to test Comment 57/ patch 256743.
Comment 62 Thomas Mitterfellner 2017-05-27 20:52:34 UTC
With patch 256743 (Comment 57) I get the same behavior as with the unrevised patch: it displays no IRQ handler for vector but does not power off, I have to do it manually (N.B.: I think the CPU load is also high as the fan starts blowing). When I turn the power on again, I get to the previous state, i.e. hibernation _kind of_ works (but it's not really satisfying).
Comment 63 Chen Yu 2017-05-28 14:35:03 UTC
(In reply to Manuel Krause from comment #61)
> (In reply to Chen Yu from comment #55)
> > OK, got it. So comment 47 broken S4 on your platform.
> > Could you please also check if pm_test mode setting to 'processors' works
> > w/o any patch applied on 4.11.3?
> > 
> > several rounds:( at least 3 rounds)
> > 
> > # echo processors > /sys/power/pm_test
> > # echo disk > /sys/power/state
> 
> I really dont' want to screw up this thread!
> BTW, I've needed to add PM_DEBUG to my custom .config first, and I hope it's
> sufficient for our testing on this here.
> 've given it 10 rounds with different RAM and system loads. Log available,
> but there is _NO_ relevant error/ message in it. I've taken 10 rounds, as
> the real-world S2DISK is likely to fail at attempt No.6.
Do you mean, w/o the patch 256709, it failed to hibernate on 6th try, but with the patch applied, it failed at 3nd try? Well, regression is critical, if my patch broken something, so I need to figure it out, if the patch is regarded as "No help" for the hibernation failure, or "Breaks something".
> 
> Now, I'm going to test Comment 57/ patch 256743.
patch 256743 might not help as Thomas replied previously. Please wait a moment, I'll try to debug on Thomas's situation and see why 256743 does not help.
Comment 64 Manuel Krause 2017-05-28 15:16:46 UTC
(In reply to Chen Yu from comment #63)
> Do you mean, w/o the patch 256709, it failed to hibernate on 6th try, but
> with the patch applied, it failed at 3nd try? Well, regression is critical,
> if my patch broken something, so I need to figure it out, if the patch is
> regarded as "No help" for the hibernation failure, or "Breaks something".

No, no. Your patches may be good.
Plain vanilla kernels, for a while now, didn't survive a 6th hibernate/resume. That's why I was and am still curious about this topic and useful patches.
Your proposed testing from Comment 55 hasn't shown any anomalies without any of your patches applied.

> patch 256743 might not help as Thomas replied previously. Please wait a
> moment, I'll try to debug on Thomas's situation and see why 256743 does not
> help.

Please focus on Thomas' results. I'm only here to check whether something else breaks. 
Currently I'm at 5th hibernation/resume cycle with your last patch. So far everything is fine. Give me some more hours for at least 10 real-world hibernations for a fair result. (And me considering to file a separate bug eventually, so you don't need to cover possibly two issues in one.)
Comment 65 Manuel Krause 2017-05-28 15:20:09 UTC
(In reply to Thomas Mitterfellner from comment #62)
> With patch 256743 (Comment 57) I get the same behavior as with the unrevised
> patch: it displays no IRQ handler for vector but does not power off, I have
> to do it manually (N.B.: I think the CPU load is also high as the fan starts
> blowing). When I turn the power on again, I get to the previous state, i.e.
> hibernation _kind of_ works (but it's not really satisfying).

Can you, please, add the info with which kernel you've tested it? Thank you!
Comment 66 Manuel Krause 2017-05-28 15:56:03 UTC
@ Chen Yu: In some way you were still right with one of your last questions: The fix from Comment 47 / 256709 (at least) didn't improve things. Nothing more, nothing less. That was with 4.11.2. Now I'm with 4.11.3.

Another thing I've noticed: With the current patch, resume from hibernation is much faster, until desktop gets interactive again. I have no numbers for comparison ATM, but keep it in mind, please. Mmmh... was it your patch or the 4.11.3 ?
Comment 67 Thomas Mitterfellner 2017-05-28 17:49:02 UTC
(In reply to Manuel Krause from comment #65)
> Can you, please, add the info with which kernel you've tested it? Thank you!

That's again openSUSE kernel 4.4.62-18.6-default.
Comment 68 Manuel Krause 2017-05-28 17:54:19 UTC
@ Chen Yu:
You should definitely consider publishing for kernel inclusion:
Comment 57/ patch 256743.
(ATM for me it doesn't matter from which actual bug it evolved.)

I'm now at 10 successful hibernations+resumes, what never occurred before with vanilla kernels (4.9/4.10/4.11).

Additionally, this fix extremely speeds up resumes from disk. I'm coming from TuxOnIce users, know about possible resume speeds, and this one makes in-kernel hiberation resuming almost as fast as former TOI.
Comment 69 Manuel Krause 2017-05-28 18:06:06 UTC
(In reply to Thomas Mitterfellner from comment #67)
> (In reply to Manuel Krause from comment #65)
> > Can you, please, add the info with which kernel you've tested it? Thank
> you!
> 
> That's again openSUSE kernel 4.4.62-18.6-default.

Why don't you fetch from http://download.opensuse.org/repositories/Kernel:/stable/standard/x86_64/ or such? 
Browse, find and get the sources and the appropriate .config.
I could only provide you with the .config files that worked with my recent kernels with my system. They may fail with yours.
Comment 70 Thomas Mitterfellner 2017-05-29 20:46:23 UTC
OK, I applied the patch 256743 to kernel 4.11.3-1.g7262353 from openSUSE's kernel-stable repo now, with the same result as for 4.4.62: automatic power-off does not work and only one IRQ message is shown:

do_IRQ: 3.51 No irq handler for vector

After powering off manually and powering on again, the previous state is restored, so hibernation _kind of_ works.
Comment 71 Manuel Krause 2017-05-29 22:18:50 UTC
(In reply to Thomas Mitterfellner from comment #70)
> OK, I applied the patch 256743 to kernel 4.11.3-1.g7262353 from openSUSE's
> kernel-stable repo now, with the same result as for 4.4.62: automatic
> power-off does not work and only one IRQ message is shown:
> 
> do_IRQ: 3.51 No irq handler for vector
> 
> After powering off manually and powering on again, the previous state is
> restored, so hibernation _kind of_ works.

Many thanks for your time and work on this!

Is the "3.51" an IRQ maybe readable in /proc/interrupts or somewhere else in your dmesg/ journal?

I need to add for my kernels, that I add the VRQ patch from Alfred Chen and BFQ disk i/o scheduler. The message I get at resume is:
[ 1445.677911] PM: Saving platform NVS memory
[ 1445.681655] Disabling non-boot CPUs ...
[ 1445.700995] Broke affinity for irq 28 <---------------------
[ 1445.702023] smpboot: CPU 1 is now offline
[ 1445.709383] PM: Creating hibernation image:

The "Broke affinity" is (from /proc/interrupts)
 28:      17790      22692   PCI-MSI 512000-edge      ahci[0000:00:1f.2]

and corresponding (from (lspci -v):
00:1f.2 SATA controller: Intel Corporation 82801IBM/IEM (ICH9M/ICH9M-E) 4 port SATA Controller [AHCI mode] (rev 03) (prog-if 01 [AHCI 1.0])
	Subsystem: Hewlett-Packard Company Device 30dd
	Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 28
	I/O ports at 60e8 [size=8]
	I/O ports at 60fc [size=4]
	I/O ports at 60e0 [size=8]
	I/O ports at 60f8 [size=4]
	I/O ports at 6000 [size=32]
	Memory at d8804000 (32-bit, non-prefetchable) [size=2K]
	Capabilities: [80] MSI: Enable+ Count=1/16 Maskable- 64bit-
	Capabilities: [70] Power Management version 3
	Capabilities: [a8] SATA HBA v1.0
	Capabilities: [b0] PCI Advanced Features
	Kernel driver in use: ahci

But at resuming it does work well.

@Thomas: To summarise, your remaining problems with 4.11.3 + 256743 patch are one "do_IRQ" message and NOT working auto power-off upon hibernation?
Comment 72 Manuel Krause 2017-05-29 22:51:22 UTC
(In reply to Manuel Krause from comment #68)
> @ Chen Yu:
> You should definitely consider publishing for kernel inclusion:
> Comment 57/ patch 256743.
> (ATM for me it doesn't matter from which actual bug it evolved.)
> 
> I'm now at 10 successful hibernations+resumes, what never occurred before
> with vanilla kernels (4.9/4.10/4.11).
> 
> Additionally, this fix extremely speeds up resumes from disk. I'm coming
> from TuxOnIce users, know about possible resume speeds, and this one makes
> in-kernel hiberation resuming almost as fast as former TOI.

PLEASE, FORGET ALL WRITTEN IN MY COMMENT 68 (cited above).

Tonight I felt the need to cross-check with the .config and kernel changes I've made recently. That was (as written above) CONFIG_PM_DEBUG and your patches.

I'm now safe to say, adding CONFIG_PM_DEBUG=y is the reason for highly speeded up resume from hibernation. And nothing else. (But for a proof of reliability without your patches I'd still need more rounds.)

For my understanding, it's a really heavy mess to put code that speeds up resuming from hibernation under the choice of debugging. Maybe it's only a one-liner #ifdef or so. Can you please investigate or delegate to the people in charge?
Comment 73 Thomas Mitterfellner 2017-05-30 08:03:52 UTC
(In reply to Manuel Krause from comment #71)
> (In reply to Thomas Mitterfellner from comment #70)
> > OK, I applied the patch 256743 to kernel 4.11.3-1.g7262353 from openSUSE's
> > kernel-stable repo now, with the same result as for 4.4.62: automatic
> > power-off does not work and only one IRQ message is shown:
> > 
> > do_IRQ: 3.51 No irq handler for vector
> > 
> > After powering off manually and powering on again, the previous state is
> > restored, so hibernation _kind of_ works.
> 
> Many thanks for your time and work on this!
> 
> Is the "3.51" an IRQ maybe readable in /proc/interrupts or somewhere else in
> your dmesg/ journal?

I will try to check, but let me add this: when I tried a second hibernation after that one, it was no longer IRQ 3.51, but IRQ 2.227.

> @Thomas: To summarise, your remaining problems with 4.11.3 + 256743 patch
> are one "do_IRQ" message and NOT working auto power-off upon hibernation?

Yes, that's correct (and it's the same with 4.4.62; not sure it's the same IRQs, though).
Comment 74 Chen Yu 2017-05-30 15:09:46 UTC
(In reply to Manuel Krause from comment #72)
> (In reply to Manuel Krause from comment #68)
> > @ Chen Yu:
> > You should definitely consider publishing for kernel inclusion:
> > Comment 57/ patch 256743.
> > (ATM for me it doesn't matter from which actual bug it evolved.)
> > 
> > I'm now at 10 successful hibernations+resumes, what never occurred before
> > with vanilla kernels (4.9/4.10/4.11).
> > 
> > Additionally, this fix extremely speeds up resumes from disk. I'm coming
> > from TuxOnIce users, know about possible resume speeds, and this one makes
> > in-kernel hiberation resuming almost as fast as former TOI.
> 
> PLEASE, FORGET ALL WRITTEN IN MY COMMENT 68 (cited above).
> 
> Tonight I felt the need to cross-check with the .config and kernel changes
> I've made recently. That was (as written above) CONFIG_PM_DEBUG and your
> patches.
> 
> I'm now safe to say, adding CONFIG_PM_DEBUG=y is the reason for highly
> speeded up resume from hibernation. And nothing else. (But for a proof of
> reliability without your patches I'd still need more rounds.)
> 
> For my understanding, it's a really heavy mess to put code that speeds up
> resuming from hibernation under the choice of debugging. Maybe it's only a
> one-liner #ifdef or so. Can you please investigate or delegate to the people
> in charge?
Sure, but let's focus on the IRQ affinity issue for now, please refer to this link for why this issue is quite critical, and I've already saw this on my servers:
https://patchwork.kernel.org/patch/9748013/
Comment 75 Chen Yu 2017-05-30 15:13:46 UTC
(In reply to Thomas Mitterfellner from comment #73)
> (In reply to Manuel Krause from comment #71)
> > (In reply to Thomas Mitterfellner from comment #70)
> > > OK, I applied the patch 256743 to kernel 4.11.3-1.g7262353 from
> openSUSE's
> > > kernel-stable repo now, with the same result as for 4.4.62: automatic
> > > power-off does not work and only one IRQ message is shown:
> > > 
> > > do_IRQ: 3.51 No irq handler for vector
> > > 
> > > After powering off manually and powering on again, the previous state is
> > > restored, so hibernation _kind of_ works.
> > 
> > Many thanks for your time and work on this!
> > 
> > Is the "3.51" an IRQ maybe readable in /proc/interrupts or somewhere else
> in
> > your dmesg/ journal?
> 
> I will try to check, but let me add this: when I tried a second hibernation
> after that one, it was no longer IRQ 3.51, but IRQ 2.227.
> 
> > @Thomas: To summarise, your remaining problems with 4.11.3 + 256743 patch
> > are one "do_IRQ" message and NOT working auto power-off upon hibernation?
> 
> Yes, that's correct (and it's the same with 4.4.62; not sure it's the same
> IRQs, though).
OK, please use a fix kernel version, as you mentioned, 4.11.3-1.g7262353, I'll provide debug patch soon. thx
Comment 76 Chen Yu 2017-05-31 06:29:03 UTC
Thomas, I've built the stable 4.11.3 using your kernel config 255869 at Comment 35, and before the patch 256709 at Comment 47  was applied, I saw the "No irq" issues which kill the system, but with above patch applied, the "No irq" issue was gone and everything works perfectly. Could you confirm if you have applied that patch and recompiled the kernel with the patch?
Anyway, please apply this debug patch on your 4.11.3, w/o any other patch applied, and do the following to verify if the patch has taken effect or not:
1. cd 4.11.3 directory
2. patch -p1 < debug_no_irq.diff
3. re-compile/install the kernel
4. after booting up to the patched 4.11.3, provide your dmesg.log 
5. provide your cat /proc/interrupts
6. provide your lspci -vvxx
7. echo 1 > /proc/sys/kernel/restore_pci
8. echo disk > /sys/power/state to see if there is still "No irq" issue during S4. (notice: all the patch I mentioned above is to fix the "No irq" issue, nothing to do with other problems)
Comment 77 Chen Yu 2017-05-31 06:30:55 UTC
Created attachment 256805 [details]
no irq fixing debug
Comment 78 Manuel Krause 2017-05-31 12:21:01 UTC
(In reply to Chen Yu from comment #74)
> (In reply to Manuel Krause from comment #72)

> Sure, but let's focus on the IRQ affinity issue for now, please refer to
> this link for why this issue is quite critical, and I've already saw this on
> my servers:
> https://patchwork.kernel.org/patch/9748013/

O.k., of course. BTW, can you make a short comment on the functional difference of patch 256709 versus 256743, please? Thank you.
Comment 79 Chen Yu 2017-05-31 13:59:32 UTC
(In reply to Manuel Krause from comment #78)
> (In reply to Chen Yu from comment #74)
> > (In reply to Manuel Krause from comment #72)
> 
> > Sure, but let's focus on the IRQ affinity issue for now, please refer to
> > this link for why this issue is quite critical, and I've already saw this
> on
> > my servers:
> > https://patchwork.kernel.org/patch/9748013/
> 
> O.k., of course. BTW, can you make a short comment on the functional
> difference of patch 256709 versus 256743, please? Thank you.

256743 will forcely power on the pci devices and restore the pci device status
during hibernation , while 256709 only restore the status.
Comment 80 Thomas Mitterfellner 2017-05-31 14:51:57 UTC
(In reply to Chen Yu from comment #76)
> Could you confirm if
> you have applied that patch and recompiled the kernel with the patch?

Yes, I just did it again and tested hibernation again. I still get the do_IRQ 3.51 message and no power-off

> Anyway, please apply this debug patch on your 4.11.3, w/o any other patch
> applied, and do the following to verify if the patch has taken effect or not:
> 4. after booting up to the patched 4.11.3, provide your dmesg.log 
> 5. provide your cat /proc/interrupts
> 6. provide your lspci -vvxx

> 8. echo disk > /sys/power/state to see if there is still "No irq" issue
> during S4. (notice: all the patch I mentioned above is to fix the "No irq"
> issue, nothing to do with other problems)

Still saw the issue.
Comment 81 Thomas Mitterfellner 2017-05-31 14:53:26 UTC
Created attachment 256811 [details]
dmesg output
Comment 82 Thomas Mitterfellner 2017-05-31 14:54:34 UTC
Created attachment 256813 [details]
proc interrupts
Comment 83 Thomas Mitterfellner 2017-05-31 14:55:06 UTC
Created attachment 256815 [details]
lspci output
Comment 84 Manuel Krause 2017-06-01 00:46:58 UTC
(In reply to Thomas Mitterfellner from comment #81)
> Created attachment 256811 [details]
> dmesg output

Can you please also attach a dmesg _after_ hibernation & resuming. This would provide the additional data reported by patch 256805, hopefully. (Don't understand why to ask for dmesg before resuming,BTW)
Besides this given added data and the needed tunable for enabling, the patch is just 256709.
I don't want to exhaust your patience, but did/does the 256743 patch change your experience?
Comment 85 Manuel Krause 2017-06-01 00:53:18 UTC
Please forget my last question. I've just re-read you answer to it in Comment 73.
Comment 86 Manuel Krause 2017-06-01 01:42:48 UTC
(In reply to Thomas Mitterfellner from comment #80)
> 
> Still saw the issue.

Mmmh. Another proposal for your testing, although Chen Yu might hurt me for it, and I can't promise any better result:
Have you ever tried to issue "echo 0 > /sys/power/pm_async" (default is 1) before suspending? Some people say it changes the "timeline", of when drivers get loaded, to the better -- and some not. (I personally had benefits from this some months ago, when my gfx driver "wanted" to be loaded in a dedicated timeframe.)

Best wishes and best regards, 
Manuel Krause
Comment 87 Chen Yu 2017-06-01 03:35:13 UTC
3.51 means, the vector[51] on cpu3 has no irq handler.
And the corresponding driver in use on this interrupt is the NVidia driver,
as you will see the logs in dmesg:
__assign_irq_vector: vector[51] assigned to cpus[3] with irq[39]
the irq 39 is used by the NVidia driver
How about blacklist this driver? I can not see it in my stable tree source code
Comment 88 Chen Yu 2017-06-01 05:10:04 UTC
(In reply to Chen Yu from comment #87)
> 3.51 means, the vector[51] on cpu3 has no irq handler.
> And the corresponding driver in use on this interrupt is the NVidia driver,
> as you will see the logs in dmesg:
> __assign_irq_vector: vector[51] assigned to cpus[3] with irq[39]
> the irq 39 is used by the NVidia driver
> How about blacklist this driver? I can not see it in my stable tree source
> code
[Please blacklist Nvidia with 256805 applied]
BTW, I'm testing on top of this stable tree from  https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
Where did you download the 4.11.3-1.g7262353? I can not find it at 
http://download.opensuse.org/repositories/Kernel:/stable/standard/x86_64/
Comment 89 Thomas Mitterfellner 2017-06-01 10:59:50 UTC
I tried debug patch 256805 on kernel 4.11.3-1.g7262353 (which I got from openSUSE kernel-stable repo) with nouveau now. I only get to an X screen with mouse pointer, but the login manager does not show (that's bad!). When I hiberated via echo disk > /sys/power/state, that worked without problems. Yet, I'm not sure whether that's a good test since the system did not run like I'm used to.

Also, it doesn't really help since I'd rather use the proprietary nvidia driver. With the first patch you posted, everything was fine using the nvidia driver.

@Manuel: I tried echo 0 > /sys/power/pm_async before hibernating; that did not change anything.
Comment 90 Chen Yu 2017-06-01 11:46:38 UTC
Thanks for your testing, this is to verify if nvidia driver has scribbled the irq MSI affinity across suspend()/resume(), unfortunately the answer seems to be Yes. Even worse, the nvidia driver is close-sourced, I can not figure it out how it touches the MSI unless someone in Nvidia could provide some information. So from my point of view, the patch 256805 has fixed all the open-sourced driver's PCI MSI issues, and it should be a good material for upstream(for example, you did not see the ahci/sata warnings anymore)

As for the first patch you mentioned, to be honest, I really don't know what the connection is between that patch and the "no irq handler" issue, only because I found there might be a unreasonable manner in the current code then I proposed that patch. Let me ping the maintainer about that patch if you really want that patch to get merged.
Comment 91 Thomas Mitterfellner 2017-06-01 12:38:21 UTC
Created attachment 256829 [details]
dmesg output after resume from hibernate
Comment 92 Thomas Mitterfellner 2017-06-01 12:50:13 UTC
I applied now your debug patch attachment 256805 [details] and manually applied 256743 (without that I could not hibernate) to kernel 4.11.3-4.g7bbd095-default. I did hibernation with manual power-off and resume, then dmesg (see attachment).
Comment 93 Thomas Mitterfellner 2017-06-01 13:03:21 UTC
To follow up on my previous comment, I also applied patch 255945 (the first you posted) to 4.11.3-4.g7bbd095-default (in addition to the two other patches), and it makes hibernation plus automatic power-off work. Should I also attach /proc/interrupts or dmesg or anything?
Comment 94 Manuel Krause 2017-06-02 11:19:33 UTC
(In reply to Thomas Mitterfellner from comment #93)
> To follow up on my previous comment, I also applied patch 255945 (the first
> you posted) to 4.11.3-4.g7bbd095-default (in addition to the two other
> patches), and it makes hibernation plus automatic power-off work. Should I
> also attach /proc/interrupts or dmesg or anything?

So, now I'm completely confused. Please let me ask some questions:
* You've ended up adding the code from all three patches?
* The very first patch attachment 255945 [details] is responsible for automatic power-off to work?
* The debug patch attachment 256805 [details] (with issuing echo 1 > /proc/sys/kernel/restore_pci) together with the very first patch are not sufficient to make hibernation work for you (so you've manually added the changes from patch attachment 256743 [details])?
* But either of the latter two patches fix the "No IRQ handler" issue?

Part of my confusion is that there does not appear a message including "restore pci status for" in your post-hibernation dmesg. Have you had issued "echo 1 > /proc/sys/kernel/restore_pci" when using debug patch attachment 256805 [details]?

Thank you in advance for a little more clarification.
Comment 95 Manuel Krause 2017-06-02 17:00:51 UTC
BTW, I'm ATM doing some reliability testing on my system with patch attachment 255945 [details] AND patch attachment 256743 [details]. 
The CONFIG_PM_DEBUG=y led to high resume speed from hibernation and the latter patch led to higher reliability than plain patch attachment 256709 [details] (what's the base of the debug patch attachment 256805 [details]).

Let me remark:
* 255945 makes use of "irq_set_affinity_locked(data, affinity, true);" instead of previous "chip->irq_set_affinity(data, affinity, true);"
* 256743 puts the action to "pci_pm_default_resume_early(pci_dev);" where the "pci_restore_state(pci_dev);" is included.
Comment 96 Chen Yu 2017-06-12 07:28:11 UTC
(In reply to Thomas Mitterfellner from comment #93)
> To follow up on my previous comment, I also applied patch 255945 (the first
> you posted) to 4.11.3-4.g7bbd095-default (in addition to the two other
> patches), and it makes hibernation plus automatic power-off work. Should I
> also attach /proc/interrupts or dmesg or anything?

tlgx has replied me on https://patchwork.kernel.org/patch/9695771/ and gave me some advice, I'm looking at the IRQ code carefully to find a safe solution in case any regression is introduced.
Comment 97 Manuel Krause 2017-06-13 15:18:36 UTC
(In reply to Chen Yu from comment #96)
> tlgx has replied me on https://patchwork.kernel.org/patch/9695771/ and gave
> me some advice, I'm looking at the IRQ code carefully to find a safe
> solution in case any regression is introduced.

Thank you very much in advance for your investigation, work and time. Hopefully your good work will fix the possible "issue behind". :-) Good luck!
Comment 98 Chen Yu 2017-06-20 16:16:56 UTC
Created attachment 257097 [details]
Fix irq affinity during CPU offline[v2]

Okay, finally I've looked through the irq affinity logic and please test this patch(it is an improvement of 255945) together with the pci restore patch:
https://patchwork.kernel.org/patch/9748017/
on top of lastest kernel. thx
Comment 99 Chen Yu 2017-06-20 16:20:53 UTC
(In reply to Manuel Krause from comment #95)
> BTW, I'm ATM doing some reliability testing on my system with patch
> attachment 255945 [details] AND patch attachment 256743 [details]. 
> The CONFIG_PM_DEBUG=y led to high resume speed from hibernation and the
> latter patch led to higher reliability than plain patch attachment 256709 [details]
> [details] (what's the base of the debug patch attachment 256805 [details]).
> 
> Let me remark:
> * 255945 makes use of "irq_set_affinity_locked(data, affinity, true);"
> instead of previous "chip->irq_set_affinity(data, affinity, true);"
Yes.
> * 256743 puts the action to "pci_pm_default_resume_early(pci_dev);" where
> the "pci_restore_state(pci_dev);" is included.
It puts the pci_restore_state(pci_dev) into pci_thawr_noirq()
Comment 100 Thomas Mitterfellner 2017-06-21 16:10:46 UTC
I applied that patch (257097, 0001-Fix-irq-affinity-during-cpu-offline.patch) to kernel 4.11.3, but I must report that it does not fix my hibernation issue. I get all these no IRQ handler messages and finally the revalidation failed stuff and the file system related messages.
Comment 101 Chen Yu 2017-06-22 02:54:53 UTC
(In reply to Thomas Mitterfellner from comment #100)
> I applied that patch (257097,
> 0001-Fix-irq-affinity-during-cpu-offline.patch) to kernel 4.11.3, but I must
> report that it does not fix my hibernation issue. I get all these no IRQ
> handler messages and finally the revalidation failed stuff and the file
> system related messages.
Not only 257097, but also the patch https://patchwork.kernel.org/patch/9748017/ , but let use the latest vanilla upstream instead(not the distribution repo ) , which can be cloned via
git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
and please also compile the kernel with pm_debug enabled.
By using the vanilla kernel, you can test s4 w/o rebooting the system and narrow down the scope by:

1. First, please test:
    echo core > /sys/power/pm_test
    echo disk > /sys/power/state
     the system will suspend and wait for 5 second to wake up,
     if there is no problem during this test(no irq warning), please reboot the system into
     a fresh new environment.
     
2. After boot up,
    echo test_resume > /sys/power/disk
    echo disk > /sys/power/state

    this is more deeper than the previous testing, if there is no problem during
     this test(no irq warning), then we are done. If there is no irq warning, 
    then please paste the No irq warning info here. I'll provide debug patch      according to the log.
Comment 102 Thomas Mitterfellner 2017-06-22 08:43:22 UTC
I had tried it with the *vanilla* kernel source provided by my distribution repos (this should be without any distribution specific patches) – is this OK? Because I vaguely remember that I couldn't boot the upstream kernel I compiled from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Anyway, I'll try to get to it tonight.
Comment 103 Thomas Mitterfellner 2017-06-22 20:01:50 UTC
I used 4.11.6-1.ge566a4a, applied PCI-PM-Restore-the-status-of-PCI-devices-across-hibernation.patch and 0001-Fix-irq-affinity-during-cpu-offline.patch.

The first test went fine, I didn't see a no IRQ handler message.

The second test failed, mentioning IRQ 3.51 again.

Do you need an additional log file?
Comment 104 Manuel Krause 2017-06-22 21:06:51 UTC
Of course, please post logs!

Your kernel is quite actual. No need to go to the most recent Linus' repository. Even if fancy developers ask for. ^^

Have you done the "challenge", that Chen Yu proposed in Comment 101 ?
Comment 105 Manuel Krause 2017-06-22 21:10:45 UTC
And, does your system properly shutdown at hibernation?
Comment 106 Manuel Krause 2017-06-22 21:20:16 UTC
@Thomas:
Just ignore my last question in Comment 104, I've simply misunderstood your reply, that you've tested them both.
Comment 107 Chen Yu 2017-06-23 05:34:23 UTC
(In reply to Thomas Mitterfellner from comment #102)
> I had tried it with the *vanilla* kernel source provided by my distribution
> repos (this should be without any distribution specific patches) – is this
> OK?
Should be Okay, if there is no modification from the distribution. Where did you get the package 4.11.6-1.ge566a4a? Is it 
http://download.opensuse.org/repositories/Kernel:/HEAD/standard/x86_64/
with the source code at:
http://download.opensuse.org/repositories/Kernel:/HEAD/standard/src/
?
The reason why I insist on using the upstream kernel, is that:
1. need to look up the code for debugging
2. might be fixed already.
but as you have problems on running vanilla kernel(although in comment 12 you mentioned you can boot up using the source code in git.kernel.org)
Comment 108 Chen Yu 2017-06-23 06:02:49 UTC
(In reply to Chen Yu from comment #107)
> (In reply to Thomas Mitterfellner from comment #102)
> > I had tried it with the *vanilla* kernel source provided by my distribution
> > repos (this should be without any distribution specific patches) – is this
> > OK?
> Should be Okay, if there is no modification from the distribution. Where did
> you get the package 4.11.6-1.ge566a4a? Is it 
> http://download.opensuse.org/repositories/Kernel:/HEAD/standard/x86_64/
> with the source code at:
> http://download.opensuse.org/repositories/Kernel:/HEAD/standard/src/
> ?
How about installing this version to check if it boots and reproduce the problem?
Comment 109 Thomas Mitterfellner 2017-06-23 06:48:51 UTC
I used the kernel source from the kernel-stable repo: http://download.opensuse.org/repositories/Kernel:/stable/standard/src/

I can use the 4.12.rc5 version too, though I highly doubt it will be fixed in this one. The patches to the 4.11.6 version had 5 and 12 lines offset, btw.
Comment 110 Chen Yu 2017-06-23 06:55:53 UTC
(In reply to Thomas Mitterfellner from comment #109)
> I used the kernel source from the kernel-stable repo:
> http://download.opensuse.org/repositories/Kernel:/stable/standard/src/
> 
> I can use the 4.12.rc5 version too, though I highly doubt it will be fixed
> in this one. The patches to the 4.11.6 version had 5 and 12 lines offset,
> btw.
Yes, please use 4.12.rc5,I understand this version will unlikely fix your problem, but it is a good news to me, as I can prepare debug patch more easily.
Comment 111 Chen Yu 2017-06-23 06:57:55 UTC
(In reply to Chen Yu from comment #110)
> (In reply to Thomas Mitterfellner from comment #109)
> > I used the kernel source from the kernel-stable repo:
> > http://download.opensuse.org/repositories/Kernel:/stable/standard/src/
> > 
> > I can use the 4.12.rc5 version too, though I highly doubt it will be fixed
> > in this one. The patches to the 4.11.6 version had 5 and 12 lines offset,
> > btw.
> Yes, please use 4.12.rc5,I understand this version will unlikely fix your
> problem, but it is a good news to me, as I can prepare debug patch more
> easily.
kernel-vanilla-4.12.rc5-1.1.g270295f.x86_64.rpm  this one should be ok
Comment 112 Thomas Mitterfellner 2017-06-23 17:10:23 UTC
Created attachment 257143 [details]
dmesg output after resume from hibernate 4.12.0-rc6

Tested both patches applied to 4.12.0-rc6-1.gffd9401-default – I still get the no IRQ handler for vector 3.51 message. I had to shutdown manually. When I powered on again, the hibernation image could be loaded, though. I'll post dmesg output – tell me if you need something else.
Comment 113 Thomas Mitterfellner 2017-06-24 09:48:06 UTC
Note that after the first hibernation, the IRQ for which the handle is lost changes, I think it was 3.133 or 1.131. Meanwhile I hibernated three times, I always powered off manually after the harddisk light went out.
Comment 114 Chen Yu 2017-06-24 15:40:32 UTC
(In reply to Thomas Mitterfellner from comment #113)
> Note that after the first hibernation, the IRQ for which the handle is lost
> changes, I think it was 3.133 or 1.131. Meanwhile I hibernated three times,
> I always powered off manually after the harddisk light went out.
What do you mean by "first hibernation"?  you mentioned in Comment 112 the no irq warning is 3.51, which is the nvidia driver,  but why it becomes 3.133 or 1.131 warning?  Currently I'm writing a quirk to deal with nvidia device, but if it is 3.51, it is another direction. Please confirm both the 257097 patch  and 
https://patchwork.kernel.org/patch/9748017/ have been applied on your 4.12.0-rc6-1.gffd9401-default
Comment 115 Chen Yu 2017-06-24 15:41:39 UTC
(In reply to Chen Yu from comment #114)
> (In reply to Thomas Mitterfellner from comment #113)
> > Note that after the first hibernation, the IRQ for which the handle is lost
> > changes, I think it was 3.133 or 1.131. Meanwhile I hibernated three times,
> > I always powered off manually after the harddisk light went out.
> What do you mean by "first hibernation"?  you mentioned in Comment 112 the
> no irq warning is 3.51, which is the nvidia driver,  but why it becomes
> 3.133 or 1.131 warning?  Currently I'm writing a quirk to deal with nvidia
> device, but if it is 3.51, it is another direction. Please confirm both the
> 257097 patch  and 
> https://patchwork.kernel.org/patch/9748017/ have been applied on your
> 4.12.0-rc6-1.gffd9401-default

s/but if it is 3.51/but if it is 3.11 or 1.31/
Comment 116 Thomas Mitterfellner 2017-06-24 18:56:06 UTC
Sorry, that was a typo: it's always 3.something (not 1.something), but the .something changes between the first and the second hibernation.

I can confirm I applied both of the patches you named.

By first hibernation I mean: I do hibernate, then I get the 3.51 IRQ message. Then I manually power-off, then resume. Then I hibernate again (second hibernation). This time the IRQ message has changed from 3.51 to 3.133 (or 3.131).
Comment 117 Thomas Mitterfellner 2017-06-24 19:00:12 UTC
Created attachment 257167 [details]
dmesg out after 4th or 5th hibernation

OK, the .something seems to change with each hibernation. Now, after 4th or 5th it is 3.198. Attached is the dmesg output after a couple of hibernations (with manual power-off each time).
Comment 118 Chen Yu 2017-06-26 04:28:31 UTC
Created attachment 257175 [details]
all-in-one-debug-nvidia

To avoid confusing, this patch is composed of the following fix and debug:
1. irq affinity setting fix during fixup_irqs
2. pci status restore during S4 
3. debug facility for nvidia GF108M [GeForce GT 525M]

Please apply this patch on top of 4.12.0-rc6-1.gffd9401-default and read the
test step carefully, thanks:

1. provide your boot up dmesg logs: bootup_dmesg_pm_test_core.log

   echo 0 > /sys/power/pm_async
   echo 1 > /sys/kernel/debug/tracing/events/power/device_pm_callback_start/enable
   echo 1 > /sys/kernel/debug/tracing/events/power/device_pm_callback_end/enable
   echo 1 > /sys/kernel/debug/tracing/tracing_on

   echo core > /sys/power/pm_test
   echo disk > /sys/power/state

   after woken up(I think it should be okay to wake up, as you have already tested this test mode before), gather:
   dmesg_pm_test_core.log
   cat /sys/kernel/debug/tracing/trace > trace_pm_test_core.log
   
2. *REBOOT* the system.

    provide your boot up dmesg logs: bootup_dmesg_all_replace.log

   echo 0 > /sys/power/pm_async
   echo 1 > /sys/kernel/debug/tracing/events/power/device_pm_callback_start/enable
   echo 1 > /sys/kernel/debug/tracing/events/power/device_pm_callback_end/enable
   echo 1 > /sys/kernel/debug/tracing/tracing_on

   echo test_resume > /sys/power/disk
   echo 1 > /proc/sys/kernel/pci_thaw_noirq_replace
   echo 1 > /proc/sys/kernel/pci_thaw_replace
   echo disk > /sys/power/state
   and wait a while to see if it can be woken up, and if there is any warning of
   "No irq handler".

tips: this test step is to replace pci_thaw_noirq with
    pci_restore_noirq and replace pci_thaw with pci_restore, which is what test 1 does.
    and trying to find out the difference between pci_restore() and pci_thaw(), 
    pci_restore_noirq() and pci_thaw_noirq() is the key to work around your nvidia problem.  Although this test might fail, please provide the No irq number, 3.51 etc.


   upload your:
   dmesg_all_replace.log
   cat /sys/kernel/debug/tracing/trace > trace_all_replace.log



3. *REBOOT* the system.
   provide your boot up dmesg logs: bootup_dmesg_only_replace_thaw_noirq.log

   echo 0 > /sys/power/pm_async
   echo 1 > /sys/kernel/debug/tracing/events/power/device_pm_callback_start/enable
   echo 1 > /sys/kernel/debug/tracing/events/power/device_pm_callback_end/enable
   echo 1 > /sys/kernel/debug/tracing/tracing_on

   echo test_resume > /sys/power/disk
   echo 1 > /proc/sys/kernel/pci_thaw_noirq_replace
   
   echo disk > /sys/power/state
   and wait a while to see if it can be woken up, and if there is any warning of
   "No irq handler".

   upload your:
   dmesg_only_replace_thaw_noirq.log
   cat /sys/kernel/debug/tracing/trace > trace_only_replace_thaw_noirq.log

Although this test might fail, please provide the No irq number, 3.51 etc.



4. *REBOOT* the system.
   provide your boot up dmesg logs: bootup_dmesg_only_replace_thaw.log

   echo 0 > /sys/power/pm_async
   echo 1 > /sys/kernel/debug/tracing/events/power/device_pm_callback_start/enable
   echo 1 > /sys/kernel/debug/tracing/events/power/device_pm_callback_end/enable
   echo 1 > /sys/kernel/debug/tracing/tracing_on

   echo test_resume > /sys/power/disk
   echo 1 > /proc/sys/kernel/pci_thaw_replace
   
   echo disk > /sys/power/state
   and wait a while to see if it can be woken up, and if there is any warning of
   "No irq handler".

   upload your:
   dmesg_only_replace_thaw.log
   cat /sys/kernel/debug/tracing/trace > trace_only_replace_thaw.log


Although this test might fail, please provide the No irq number, 3.51 etc.
Comment 119 Thomas Mitterfellner 2017-06-26 17:02:44 UTC
Created attachment 257183 [details]
Logs from tests requested in comment #118

I did all the tests requested in comment #118 using kernel source linux-4.12.0-rc6-1.gffd9401 and patch nvidia_irq_fix_debug_all.txt (attachment 257175 [details]).

All of the test resumed without problems and in none of the tests did I see a "No IRQ handler" message. The only message I saw was something like "psmouse serio1 synaptics: hardware seems to be different" (it was displayed too shortly to read exactly).
Comment 120 Chen Yu 2017-06-30 02:42:24 UTC
(In reply to Thomas Mitterfellner from comment #119)
> Created attachment 257183 [details]
> Logs from tests requested in comment #118
> 
> I did all the tests requested in comment #118 using kernel source
> linux-4.12.0-rc6-1.gffd9401 and patch nvidia_irq_fix_debug_all.txt
> (attachment 257175 [details]).
> 
> All of the test resumed without problems and in none of the tests did I see
> a "No IRQ handler" message. The only message I saw was something like
> "psmouse serio1 synaptics: hardware seems to be different" (it was displayed
> too shortly to read exactly).

Thanks! Then how about this test mode(using the linux-4.12.0-rc6-1.gffd9401 and patch nvidia_irq_fix_debug_all):

4. *REBOOT* the system.
   provide your boot up dmesg logs: bootup_dmesg_default_test_resume.log
   and cat /proc/interrupts > proc_interrupts_test_resume_default.log

   echo 0 > /sys/power/pm_async
   echo 1 > /sys/kernel/debug/tracing/events/power/device_pm_callback_start/enable
   echo 1 > /sys/kernel/debug/tracing/events/power/device_pm_callback_end/enable
   echo 1 > /sys/kernel/debug/tracing/tracing_on

   echo test_resume > /sys/power/disk
   
   echo disk > /sys/power/state
   and wait a while to see if it can be woken up, and if there is any warning of
   "No irq handler".

   if there is no problem, please upload the following log after resumed(no need to upload if it failed):
   dmesg_test_resume_default.log
   cat /sys/kernel/debug/tracing/trace > trace_test_resume_default.log
Comment 121 Thomas Mitterfellner 2017-07-01 08:57:54 UTC
This test failed. The message shown was the usual no IRQ handler message (3.51).
Comment 122 Manuel Krause 2017-08-04 19:42:20 UTC
So quiet on here? All on vacation? ;-)

@Chen Yu: Any conclusions from the results that Thomas presented?
Comment 123 Chen Yu 2017-09-04 07:57:40 UTC
Hi,
I do not have much clue at present because I do not have the source code from nvidia. The only workaround I can think out may be to a quirk for GF108M, but it might not be accept by the community because this should be a issue in nvidia driver and it behaves like a blackbox for me. Do you guys know where we can contact the nvidia driver developers?
Comment 124 Thomas Mitterfellner 2017-09-04 10:56:44 UTC
The only thing I've found is the nvidia developer forum. But so far, I couldn't even get a response to my postings from anyone there (e.g. https://devtalk.nvidia.com/default/topic/1020305/hibernation-broken-with-384-95/). Maybe you've got more luck.
Comment 125 Thomas Mitterfellner 2017-12-27 23:34:37 UTC
So the latest status with nvidia driver 384.95 is that after invoking hibernation the screen goes to text mode shortly (no messages about irqs, just a blinking cursor), then the screen goes black but the computer is not shut down. After ~1 minute, Caps Lock starts flashing (which indicates a kernel panic IIRC). I've tested this up to kernel 4.14.9-1.g9423ca2.

Could someone help me to debug this behavior, please (i.e. what should I do in order to get more information)?

Note You need to log in before you can comment on or make changes to this bug.