Bug 198233 - [bisected 1f7f51a63114bab3a05920f4b1343154e95e2cb6] System freezes after hibernate or suspend[pm_test failed at processors mode] - Macmini 6,2
Summary: [bisected 1f7f51a63114bab3a05920f4b1343154e95e2cb6] System freezes after hibe...
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Sound(ALSA) (show other bugs)
Hardware: Intel Linux
: P1 high
Assignee: Chen Yu
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-12-22 11:37 UTC by Johannes Geiss
Modified: 2020-04-27 02:44 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.16.3
Subsystem:
Regression: No
Bisected commit-id:


Attachments
System boot messages. (69.45 KB, text/plain)
2018-01-15 18:26 UTC, Johannes Geiss
Details
Freezer test was OK. (546 bytes, text/plain)
2018-01-15 18:27 UTC, Johannes Geiss
Details
Devices test was OK. (1.88 KB, text/plain)
2018-01-15 18:27 UTC, Johannes Geiss
Details
Platform test was OK. (10.16 KB, text/plain)
2018-01-15 18:27 UTC, Johannes Geiss
Details
Reverted patch of 1f7f51a63114bab3a05920f4b1343154e95e2cb6 (1.55 KB, patch)
2018-02-26 19:16 UTC, Johannes Geiss
Details | Diff

Description Johannes Geiss 2017-12-22 11:37:12 UTC
After a hibernation or a suspend the system resumes, the screen shows correctly the X11/console session, but it is frozen. No clock is running, no cursor (in case of console) is blinking. Keyboard or mouse input have no effect. No process seems to be running. Sometimes a remote ssh login is possible but not always.

This freeze happens in both cases: if X11 is running or not (console mode).

The only thing which is working is the MagicSysReq key: Alt-ScrollLock-R, S, U , B.

Hardware: Macmini 6,2
Comment 1 Chen Yu 2018-01-11 01:43:43 UTC
Is it always reproducible using any kernel version since you bought the Macmini?Could you please do different pm_test mode to check at which stage the hibernation/suspend failed? Since you mentioned the sysrq works, which means at least it has passed the 'core' stage, the possible failing stage would be:
"freezer" "devices" "platform" "processors"

Please help check 
1. echo 1 > /sys/power/pm_debug_messages

2. echo freezer > /sys/power/pm_test
3. echo mem > /sys/power/state
wait for 5 seconds, does it work?

4. echo devices > /sys/power/pm_test
5. echo mem > /sys/power/state
wait for 5 seconds, does it work?

6. echo platform > /sys/power/pm_test
7. echo mem > /sys/power/state

wait for 5 seconds, does it work?

8. echo processors > /sys/power/pm_test
9. echo mem > /sys/power/state

and provide the dmesg if possible.
Comment 2 Johannes Geiss 2018-01-15 18:26:39 UTC
Created attachment 273633 [details]
System boot messages.
Comment 3 Johannes Geiss 2018-01-15 18:27:09 UTC
Created attachment 273635 [details]
Freezer test was OK.
Comment 4 Johannes Geiss 2018-01-15 18:27:32 UTC
Created attachment 273637 [details]
Devices test was OK.
Comment 5 Johannes Geiss 2018-01-15 18:27:51 UTC
Created attachment 273639 [details]
Platform test was OK.
Comment 6 Johannes Geiss 2018-01-15 18:28:15 UTC
Processors test fails.

I saw only one message from "dmesg -w":

PM: suspend entry (deep)

After wake up the system shows the screen (X11) again, but no clock, Xterm (cursor), mouse pointer, Caps Lock LED or ssh access is working. The System seems to be completely frozen. Only Alt-SysReq-R/S/U/B works.
Comment 7 Johannes Geiss 2018-01-15 18:30:28 UTC
Some more info:

Kernel version 4.9.72 works fine.
Version 4.12.12 only suspend works (hibernate tells something about a CRC error at reboot).
Version 4.15.0-rc7 fails as described.
Comment 8 Chen Yu 2018-01-22 08:11:00 UTC
(In reply to Johannes Geiss from comment #5)
> Created attachment 273639 [details]
> Platform test was OK.

One clue from this log is that, there are some thunderbolt errors during resume. If this driver does not work well, say, does not react to any command after resumed, the next time the system tries to suspend the system might got problems when trying to freeze all the kernel threads - no matter what the pm_test mode is - and this can explain why the sysrq works when testing processors mode.
So let's first check if processors mode is working with thunderbolt removed/blacklist in the grub command line.
Comment 9 Johannes Geiss 2018-02-26 19:14:18 UTC
It took me a while, but now I figured it out which causes the issue, though I do not really understand it.

I bisect it until I've got to commit 1f7f51a63114bab3a05920f4b1343154e95e2cb6.

After bisecting, I finally used the current linux-git (origin/master) and patched the two files

sound/pci/hda/hda_codec.c
sound/pci/hda/patch_hdmi.c

back:

diff --git a/sound/pci/hda/hda_codec.c b/sound/pci/hda/hda_codec.c
index e018ecbf78a8..dad76f1663b2 100644
--- a/sound/pci/hda/hda_codec.c
+++ b/sound/pci/hda/hda_codec.c
@@ -3231,7 +3231,7 @@ int snd_hda_codec_build_pcms(struct hda_codec *codec)
 
                dev = get_empty_pcm_device(bus, cpcm->pcm_type);
                if (dev < 0) {
-                       cpcm->device = SNDRV_PCM_INVALID_DEVICE;
+                       /*cpcm->device = SNDRV_PCM_INVALID_DEVICE;*/
                        continue; /* no fatal error */
                }
                cpcm->device = dev;

and:

diff --git a/sound/pci/hda/patch_hdmi.c b/sound/pci/hda/patch_hdmi.c
index b4f1b6e88305..e064a0cda3a7 100644
--- a/sound/pci/hda/patch_hdmi.c
+++ b/sound/pci/hda/patch_hdmi.c
@@ -2148,7 +2148,7 @@ static int generic_hdmi_build_jack(struct hda_codec *codec, int pcm_idx)
 static int generic_hdmi_build_controls(struct hda_codec *codec)
 {
        struct hdmi_spec *spec = codec->spec;
-       int dev, err;
+       int err;
        int pin_idx, pcm_idx;
 
 
@@ -2176,13 +2176,11 @@ static int generic_hdmi_build_controls(struct hda_codec *codec)
                        return err;
                snd_hda_spdif_ctls_unassign(codec, pcm_idx);
 
-               dev = get_pcm_rec(spec, pcm_idx)->device;
-               if (dev != SNDRV_PCM_INVALID_DEVICE) {
-                       /* add control for ELD Bytes */
-                       err = hdmi_create_eld_ctl(codec, pcm_idx, dev);
-                       if (err < 0)
-                               return err;
-               }
+               /* add control for ELD Bytes */
+               err = hdmi_create_eld_ctl(codec, pcm_idx,
+                                       get_pcm_rec(spec, pcm_idx)->device);
+               if (err < 0)
+                       return err;
        }
 
        for (pin_idx = 0; pin_idx < spec->num_pins; pin_idx++) {

Now the suspend and hibernate work fine two times. I will keep an eye on this.
Comment 10 Johannes Geiss 2018-02-26 19:16:13 UTC
Created attachment 274469 [details]
Reverted patch of 1f7f51a63114bab3a05920f4b1343154e95e2cb6

Edits I described in my previous comment.
Comment 11 Chen Yu 2018-02-26 19:59:11 UTC
Thanks for your work Johannes, if the offender is confirmed to be 
1f7f51a63114bab3a05920f4b1343154e95e2cb6, then I think it's ok to send the update to the alsa-devel@alsa-project.org
Comment 12 Johannes Geiss 2018-04-23 06:56:52 UTC
Sent notification to alsa-devel@alsa-project.org about this bug.
Comment 13 Johannes Geiss 2018-04-29 10:52:15 UTC
I made an extended test with kernel 4.16.3. I narrowed it to the following:

Kernel 4.16.3:
- Test freezer: OK
- Test devices: OK
- Test platform: failed
- Test processors: failed

Kernel 4.16.3 with CONFIG_SND_DYNAMIC_MINORS=y:
- Test freezer: OK
- Test devices: OK
- Test platform: OK
- Test processors: OK
- Test systemctl suspend: OK

Kernel 4.16.3 without CONFIG_SND_DYNAMIC_MINORS, but patch:
-- 8< --
diff --git a/sound/pci/hda/patch_hdmi.c b/sound/pci/hda/patch_hdmi.c
--- a/sound/pci/hda/patch_hdmi.c
+++ b/sound/pci/hda/patch_hdmi.c
@@ -1383,6 +1383,8 @@ static void hdmi_pcm_setup_pin(struct hdmi_spec *spec,
         pcm = get_pcm_rec(spec, per_pin->pcm_idx);
     else
         return;
+ if (!pcm->pcm)
+ return;
     if (!test_bit(per_pin->pcm_idx, &spec->pcm_in_use))
         return;
 
@@ -2151,8 +2153,13 @@ static int generic_hdmi_build_controls(struct hda_codec *codec)
     int dev, err;
     int pin_idx, pcm_idx;
 
-
     for (pcm_idx = 0; pcm_idx < spec->pcm_used; pcm_idx++) {
+ if (!get_pcm_rec(spec, pcm_idx)->pcm) {
+ /* no PCM; mark this not to be selected */
+ set_bit(pcm_idx, &spec->pcm_bitmap);
+ continue;
+ }
+
         err = generic_hdmi_build_jack(codec, pcm_idx);
         if (err < 0)
             return err;
- Test freezer: OK
- Test devices: OK
- Test platform: failed
- Test processors: failed
- Test systemctl suspend: failed

The tests above were done using X11 display. The following were done at the console (Alt-Shift-F1):
- Test freezer: OK
- Test devices: failed sometimes, OK sometimes
- Test platform: failed
- Test processors: failed

No SSH connect is possible to the faulted machine, but ping works.
The console seems to be frozen, except the cursor is blinking.
Alt-F2, F3, etc. work, but getty is not starting. Only a blinking cursor is visible. Nothing more.
NumLock, CapsLock work. The LEDs switch on/off.
Alt-F7 (switch to the X11 screen) shows a black screen. Going back with Alt-Shift-F1 is not possible.
Alt-SysReq-T shows no activity.
Alt-SysReq-B reboots the machine.

Booting the machine in systemd's rescue.target as default.target:
- Test freezer: OK
- Test devices: OK
- Test platform: OK
- Test processors: OK

Booting the machine in systemd's multi-user.target (no X11) as default.target:
- Test freezer: OK
- Test devices: OK
- Test platform: failed
- Test processors: failed

Booting the machine in an empty systemd's multi-user.target as default.target:
- Test freezer: OK
- Test devices: OK
- Test platform: OK
- Test processors: OK

Restoring the services in multi-user.target.wants, resulted that pulseaudio.service seems to be the problem.
Booting the machine with only pulseaudio in multi-user.target.wants:
- Test freezer: OK
- Test devices: failed
- Test platform: failed
- Test processors: failed
- If pulseaudio.service is stopped after a reboot, the processors test is OK.
Comment 14 Chen Yu 2018-05-30 01:33:24 UTC
Johannes, is there any update from the alsa-devel@alsa-project.org about this bug? I think you can also Cc Wang YanQing <udknight@gmail.com> in that loop for help.
Comment 15 Johannes Geiss 2018-06-01 08:40:00 UTC
(In reply to Chen Yu from comment #14)
> Johannes, is there any update from the alsa-devel@alsa-project.org about
> this bug?

No, I have no more information from alsa-dev.

But I can tell, the bug disappeared when I enable CONFIG_SND_DYNAMIC_MINORS=y.

I also use pulseaudio in system mode (which is not recommended, I know). I disabled all services in multiuser.target.wants except pulseaudio.service. Then the bug happened at suspend/hibernate. If I also disable pulseaudio.service the bug disappeared.

So far, as I am concerned, this bug is resolved because I enabled CONFIG_SND_DYNAMIC_MINORS.

Note You need to log in before you can comment on or make changes to this bug.