Since Kernel 5.14 (probably 5.14-rc5 indeed) the touchpad does not work anymore after resume from suspend to RAM. Was working with 5.13 kernel.
I've bisected the bug, first bad commit is:
b3e29642548258c7cd2cb3326a776fff84cd6b69 is the first bad commit
Merge: 8f4ef88 498d0dd
Author: Jiri Kosina <email@example.com>
Date: Wed Jun 30 09:15:15 2021 +0200
Merge branch 'for-5.14/multitouch' into for-linus
- patch series that ensures that hid-multitouch driver disables touch and
button-press reporting on hid-mt devices during suspend when the device is
not configured as a wakeup-source, from Hans de Goede
Seem to be related with the observed behaviour.
git bisect log:
git bisect start
# good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
git bisect good 62fb9874f5da54fdb243003b386128037319b219
# bad: [7d2a07b769330c34b4deabeed939325c77a7ec2f] Linux 5.14
git bisect bad 7d2a07b769330c34b4deabeed939325c77a7ec2f
# bad: [406254918b232db198ed60f5bf1f8b84d96bca00] Merge tag 'perf-tools-for-v5.14-2021-07-01' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
git bisect bad 406254918b232db198ed60f5bf1f8b84d96bca00
# bad: [a6eaf3850cb171c328a8b0db6d3c79286a1eba9d] Merge tag 'sched-urgent-2021-06-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad a6eaf3850cb171c328a8b0db6d3c79286a1eba9d
# good: [31e798fd6f0ff0acdc49c1a358b581730936a09a] Merge tag 'media/v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
git bisect good 31e798fd6f0ff0acdc49c1a358b581730936a09a
# good: [5e6928249b81b4d8727ab6a4037a171d15455cb0] Merge tag 'acpi-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect good 5e6928249b81b4d8727ab6a4037a171d15455cb0
# good: [ebb81c14543fb43cb2e1f2bfb5d32f5e390cf895] Merge tag 'mailbox-v5.14' of git://git.linaro.org/landing-teams/working/fujitsu/integration
git bisect good ebb81c14543fb43cb2e1f2bfb5d32f5e390cf895
# bad: [df04fbe8680bfe07f3d7487eccff9f768bb02533] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid
git bisect bad df04fbe8680bfe07f3d7487eccff9f768bb02533
# good: [72fbcac2f40e690e1a5584358750e546a2678c2c] platform/x86: intel_cht_int33fe: Move to its own subfolder
git bisect good 72fbcac2f40e690e1a5584358750e546a2678c2c
# good: [33197bd3e82f5c60487e53d4a291dc2e6031833f] Merge branch 'for-5.14/intel-ish' into for-linus
git bisect good 33197bd3e82f5c60487e53d4a291dc2e6031833f
# good: [e60d726f5d8ccc85f18b9f1f6839112dc8c58fb8] Merge tag 'tpmdd-next-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd
git bisect good e60d726f5d8ccc85f18b9f1f6839112dc8c58fb8
# bad: [b3e29642548258c7cd2cb3326a776fff84cd6b69] Merge branch 'for-5.14/multitouch' into for-linus
git bisect bad b3e29642548258c7cd2cb3326a776fff84cd6b69
# good: [cbe5b6b6a77ad262d9f9a56962c9b1ac2f91c0f5] HID: lg-g15: Add support for the Logitech Z-10 speakers
git bisect good cbe5b6b6a77ad262d9f9a56962c9b1ac2f91c0f5
# good: [622d97cf7f2b4efb36bec3c85b5c1db5e3dfd586] HID: logitech-dj: Implement may_wakeup ll-driver callback
git bisect good 622d97cf7f2b4efb36bec3c85b5c1db5e3dfd586
# good: [8f4ef88ebadefcf16b7f616f8af940465c44bea2] Merge branch 'for-5.14/logitech' into for-linus
git bisect good 8f4ef88ebadefcf16b7f616f8af940465c44bea2
# good: [498d0ddc6ae931e4e79a57c56b6dd4576aa435b6] HID: multitouch: Disable event reporting on suspend when the device is not a wakeup-source
git bisect good 498d0ddc6ae931e4e79a57c56b6dd4576aa435b6
# first bad commit: [b3e29642548258c7cd2cb3326a776fff84cd6b69] Merge branch 'for-5.14/multitouch' into for-linus
Note: The laptop has no key/button to enable/disable the touchpad.
Created attachment 299151 [details]
dmesg for 5.13.9 (good)
Created attachment 299153 [details]
dmesg for 5.14 (bad)
Touchpad is an ELAN0718:00 04F3:30FD
Update with kernel 5.15-rc7
Still present. However, I noticed dmesg ouput interesting messages telling there is a transfer from the ELAN touchpad while suspended, it failed to suspend, then failed to resume.
The interesting thing is if I try to suspend a second time, at resume the touchpad manages to resume.
Appending dmesg output for reference.
Created attachment 299369 [details]
dmesg output for 2 successive suspend/resume with 5.15-rc7 (first fail)
Hi I'm the Linux kernel's regression tracker. I stumbled on this report today, sorry that none of the developers took a closer look at this. This can happen with report in bugzilla.kernel.org, as most of the Linux kernel subsystems expect issues to reported elsewhere. But whatever:
Please let me know if the issue is still present with 5.16-rc5, if it is, it needs to be reported to the developers involved with the patches that are causing this. Ideally you do this yourself, as outlined in https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html If you don't want to do that, I can forward your report.
Hello, thanks for the answer.
The problem is still present in 5.16-rc7, but indeed it got worse (and indeed is now critical).
a) After the first "suspend" tentative (closing the lid), the laptop doesn't suspend (only the screen is blanked)
b) When I reopen the lid, the touchpad stays disabled
c) When I try to suspend a second time (closing the lid), the system totally freeze (without any info in dmesg) and I need to reset it totally.
I'll open a separate bug report for this one, when bisected, and will try to post it to the suspend maintainer together with the mailing list (I'm not accustomed to mailing lists).
(In reply to Christian Casteyde from comment #6)
> Hello, thanks for the answer.
> The problem is still present in 5.16-rc7, but indeed it got worse (and
> indeed is now critical).
> a) After the first "suspend" tentative (closing the lid), the laptop doesn't
> suspend (only the screen is blanked)
> b) When I reopen the lid, the touchpad stays disabled
> c) When I try to suspend a second time (closing the lid), the system totally
> freeze (without any info in dmesg) and I need to reset it totally.
> I'll open a separate bug report for this one, when bisected, and will try to
> post it to the suspend maintainer together with the mailing list (I'm not
> accustomed to mailing lists).
This *might* be a follow up error. In your case I'd try to report this problem to the mailing list (I can also forward it if you would prefer this). But I feat then you will be asked to redo part of the bisection (over the area before and after those all those HID changes), as the merge commit is unlikely to be the culprit.
FYI, if I remember well, reverting this single commit fixed the issue and the diff was making sense (try to disable the touchpad before going to sleep to prevent unexpected event from it while suspending... if I understood well the code).
Ohh, okay, I see, that merge includes code changes. So do you want me to forward it, or do you want to write the mail yourself?
I'll let you write the mail for this one, just to see how this should be done.
Thanks a lot.
For the record: I forward the issue with this mail:
Thank you for bringing this to my attention Thorsten.
So the mentioned merge commit consists of there 4 separate commits:
"HID: core: Add hid_hw_may_wakeup() function"
"HID: usbhid: Implement may_wakeup ll-driver callback"
"HID: logitech-dj: Implement may_wakeup ll-driver callback"
"HID: multitouch: Disable event reporting on suspend when the device is not a wakeup-source"
Where the first 3 implement a new "hid_hw_may_wakeup()" function which by itself does not do anything and the last one:
Uses that function to disable touch + button-press reporting on suspend, note the code to re-enable them was already there since before this change latency would be set to high in suspend and low on resume and the call setting it low on resume also re-enables the touch + button-press reporting.
So this last commit is the only functional change in the series.
Christian, for starters, can you try 5.14 or 5.15 with commit 498d0ddc6ae931e4e79a57c56b6dd4576aa435b6 reverted and confirm that just reverting that single commit fixes 5.14 / 5.15.
Note the 5.16 issue sounds unrelated and likely is a new issue I'm afraid, I see that you said you are going to bisect this, so I guess that you have already come to the same conclusion. And thank you very much for both the bisect for this bug as well as for the new one. Bisecting takes quite some time, but the results of it are really really useful for us kernel folks to get bugs pinpointed and fixed.
I confirm reverting this commit on kernel 5.15.14 fixes the issue:
However I really think this commit is certainly OK, the real problem is that the laptop always fail to suspend the first time (and this behavior is not a regression).
I think that the sequence is:
1) Entering suspend
2) Disable touchpad (new commit)
3) Fails to suspend (unknown reason)
4) Do not restore touchpad on error path (this is a bug).
Whereas when I suspend the laptop the second time:
1) Entering suspend
2) Disable touchpad (which does nothing as it is already disabled)
3) Succeeds to enter sleep mode
4) Wake up
5) Restore the touchpad
Hence when I redo close/open the lid of the laptop, I get back the touchpad.
Apart the first suspend failure, the problem is that on error while suspending, the touchpad state is not restored.
In reply to comment 13. Ah I see, ok.
So looking at the "dmesg output for 2 successive suspend/resume with 5.15-rc7 (first fail)" attachment this gets printed wrt the suspend failure:
[ 325.180446] [drm] free PSP TMR buffer
[ 325.383095] PM: late suspend of devices failed
But no errors are being shown before, so why the late suspend failed / which device failed to late suspend is not clear.
After this the PM-core resumes devices again, but it seems this resume after a failed suspend does not honor the normal resume ordering constraints, leading to:
[ 325.439972] ------------[ cut here ]------------
[ 325.439975] i2c_designware AMDI0010:03: Transfer while suspended
[ 325.440068] i2c_transfer+0x6c/0xc0
[ 325.440072] __i2c_hid_command+0x1a5/0x3a0
[ 325.440078] ? schedule+0x54/0xc0
[ 325.440083] ? i2c_hid_set_power+0x4a/0x100
[ 325.440087] ? __irq_put_desc_unlock+0x13/0x40
[ 325.440092] i2c_hid_set_power+0x4a/0x100
[ 325.440097] i2c_hid_core_resume+0x98/0xe0
[ 325.440101] ? acpi_subsys_resume_early+0x50/0x50
[ 325.440107] dpm_run_callback+0x1d/0xf0
[ 325.440113] device_resume+0xa4/0x220
[ 325.440117] async_resume+0x14/0x50
Since the I2C-HID device is a child device of the i2c-controller it should always be resumes *after* the controller, so this should never happen.
I wonder if the issue is that it is the i2c-controller's suspend method which fails (so it will not get resumed since it never suspended) and that upon that failure it fails to clear the bit which the "Transfer while suspended" check checks.
drivers/i2c/busses/i2c-designware-platdrv.c does use SET_LATE_SYSTEM_SLEEP_PM_OPS() but its suspend/resume handlers never return non 0.
I think we need some more verbose logging to make sense of this.
Can you try a 5.14 or 5.15 kernel with CONFIG_PM_SLEEP_DEBUG=y and CONFIG_DYNAMIC_DEBUG=y set in the .config .
As well as with:
dyndbg="file drivers/base/power/*.c +p" log_buf_len=50M
added to the kernel commandline and then gather dmesg output after 2 suspend/resume cycles ?
Hopefully that will help to shed some light on both the "PM: late suspend of devices failed" error as well as the resume ordering issue when that error is hit.
Here are the dmesg output (first just after boot, second after first suspend try, third after resume from successful suspend.
Created attachment 300262 [details]
dmesg after boot - 5.15 with commit reverted
Created attachment 300263 [details]
dmesg after first suspend request
Created attachment 300264 [details]
dmesg after 2nd suspend and resume
Created attachment 300265 [details]
lspci -v for info
Created attachment 300266 [details]
Sorry for budging in here, but it’s strange, that the merge commit b3e2964254 is the first bad commit. From the replies to Hans’ question, that reverting commit 498d0ddc6a fixes the issues, it seems like 498d0ddc6a should be the culprit. Could you retest that commit please?
(In reply to Paul Menzel from comment #22)
> it seems like 498d0ddc6a should be the culprit.
It afaics, see comment 13. I forgot to tell regzbot about it, sorry, will do that now.
(In reply to Hans de Goede from comment #12)
> Thank you for bringing this to my attention Thorsten.
yw, but FWIW: I'm very sorry for not CCing you when I forwarded the issue to list, must have slipped through for some reason :-/
Christian, thank you for the logs.
Unfortunately I still see no indication why the first suspend gets aborted with a:
"PM: late suspend of devices failed"
And it seems that the debug printing has changed the timing so that the:
"i2c_designware AMDI0010:03: Transfer while suspended"
error no longer happens.
I'm afraid I have no idea how to debug this any further.
(In reply to Hans de Goede from comment #24)
> I'm afraid I have no idea how to debug this any further.
So what do we do now? Just to be sure: I assume reverting is out of the question? Is there anyone else (someone more familiar with the suspend code? rafael maybe?) we could ask for help?
I'm quite convinced the original commit is right.
IMHO, what is missing is correct handling of suspend failure (whatever the cause): if something is disabled before going to suspend, it must be restored if suspend fails.
I don't know if the suspend framework makes that error handling easy, but it makes sense to rollback any action started in the suspend process if a step prevents this process from completing.
As far as the suspend failure on my laptop is concerned, this is another bug and not a regression.
If that’s a second issue, to avoid confusion, can you please create new bug, and reference it here?
OK, I've filed a separate bug here:
(In reply to Thorsten Leemhuis from comment #25)
> So what do we do now? Just to be sure: I assume reverting is out of the
Right, so there are 2 issues here really and neither are caused by the patch in question:
1. On the first suspend/resume the suspend fails
2. Sometimes while unrolling the failed suspend, there is an ordering issue causing the hid-driver to try and turn the touchpad back on before the i2c-controller has resumed.
Unfortunately there is nothing in the logs explaining 1. and 2. seems to go away when enabling suspend debug-logging, making it very tricky to fix 2.
Also note that after a second suspend resume where 1. (and thus also 2.) do not happen the touchpad starts working again.
> Is there anyone else (someone more familiar with the suspend code?
> rafael maybe?) we could ask for help?
Asking Rafael of he has any idea how to further debug either issue would be a good idea.
(In reply to Christian Casteyde from comment #26)
> I don't know if the suspend framework makes that error handling easy, but it
> makes sense to rollback any action started in the suspend process if a step
> prevents this process from completing.
The suspend framework already automatically resumes all devices during a suspend failure, but for some reason an ordering issue is hit when doing this which makes
the hid-mt code re-enable the touchpad before the i2c-controller is resumed:
[ 325.439975] i2c_designware AMDI0010:03: Transfer while suspended
That is the main reason I asked you to test with some extra debugging but during the run with the extra debugging this path was not hit. So it seems that this is timing dependent and enabling the debugging changes the timing enough to not hit the ordering issue.
Hmm, I wonder if this perhaps is just an issue with the controller resume and the touchpad resume running from different threads and the touchpad code path hitting a stale value of the suspended boolean in the CPU cache. It is a bit of a long shot, but can you try a kernel with the patch which I'm attaching next added and see if that fixes the issue with the touchpad not working after the first failed suspend?
Created attachment 300360 [details]
[PATCH] i2c: designware: Lock the driver while setting the suspended flag
Created attachment 300374 [details]
[PATCH] i2c: designware: Lock the driver while setting the suspended flag
I had also added the test patch to my own local kernels and I just noticed it causes a deadlock, oops.
Here is a new improved patch which does not have this problem, note as discussed before this may help with the touchpad not working after the first failed suspend/resume. I do not expect it to actually fix the first suspend failing (and I'm also not sure at all it will help with the touchpad either).
The patch doesn't apply neither on 5.15 or 5.16, I get the following reject:
patching file drivers/i2c/busses/i2c-designware-pcidrv.c
Hunk #1 FAILED at 213.
Hunk #2 FAILED at 228.
2 out of 2 hunks FAILED -- saving rejects to file drivers/i2c/busses/i2c-designware-pcidrv.c.rej
Created attachment 300375 [details]
Created attachment 300404 [details]
[PATCH for 5.16.6] i2c: designware: Lock the driver while setting the suspended flag
Sorry about that, here is a new version against 5.16.5, please give this a try.
Fortunately (or unfortunately?), I've just upgraded my distribution to Slackware 15, with which I cannot reproduce the problem anymore (even with my own kernel builds). Actually many things have been updated, from the X11 stack (got now the right AMD GPU driver fully functionnal) to the system libraries (libinput instead synaptic, etc.), so it may be quite impossible now to see what was causing the problem.
Anyway, I tested 5.16.7 with and without the patch, the result is still failure to suspend at first time, but at resume the touchpad is alive now.
This can also be due to the race condition which I could not reproduce...
I attached dmesg output for both tests, but I think we can close this bug report as not reproduceable anymore.
Thanks for all
Created attachment 300409 [details]
dmesg for 5.16.7 without patch
Created attachment 300410 [details]
dmesg for 5.16.7 with patch
Closing as I cannot reproduce anymore.
Thanks for all.