Bug 203837 - Booting kernel under KVM immediately freezes host
Summary: Booting kernel under KVM immediately freezes host
Status: NEW
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: PPC-64 (show other bugs)
Hardware: PPC-64 Linux
: P1 blocking
Assignee: platform_ppc-64
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-06-06 22:59 UTC by Shawn Anastasio
Modified: 2019-06-11 19:42 UTC (History)
2 users (show)

See Also:
Kernel Version: v5.2-rc2
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Guest kernel config (170.54 KB, text/plain)
2019-06-06 22:59 UTC, Shawn Anastasio
Details

Description Shawn Anastasio 2019-06-06 22:59:12 UTC
Created attachment 283133 [details]
Guest kernel config

When booting kernel v5.2-rc2 (and confirmed up to 156c05917) in a VM on a
POWER9 host running kernel 5.1.7, the host immediately locks up and
becomes unresponsive to the point of requiring a hard reset.

The last guest kernel message printed to the screen before the
host locks up is:

[    0.013940] smp: Bringing up secondary CPUs ...

Due to the nature of the bug, it is very difficult to bisect, since a manual
host reset is required each time the bug is encountered. Also, my only
POWER machine is my primary workstation.

The bug has also been confirmed on other host kernel versions (down to 5.0.x).
When downgrading the guest kernel to 5.1.0, the issue is not present.

The guest kernel .config is attached.
Comment 1 Paul Mackerras 2019-06-07 05:42:39 UTC
I have tried but not succeeded in replicating this problem.

I have tried 5.2-rc3 in the host with the config I usually use, plus 5.2-rc3 in the guest with that same config. That boots just fine.

With 5.2-rc3 in the host and my usual config, and 5.2-rc3 in the guest compiled with the config attached to this bug, the guest gets a kernel panic due to being unable to mount root. It looks like it never manages to load virtio-blk for some reason.

With the config attached to this bug, I did once see the guest stop outputting messages after the message about bringing up CPUs. The host was still running just fine, and top in the host showed the qemu-system-ppc64 process using 100% of a CPU, consistent with the guest being in an infinite loop.

I think we need more details about the machine where the crash is occurring - host kernel config, details of VM config (qemu command line or libvirt xml), etc.
Comment 2 Paul Mackerras 2019-06-07 06:29:26 UTC
Just tried 5.1.7 in the host and got the guest locking up during boot. In xmon I see one cpu in pmdp_invalidate and another in handle_mm_fault. It seems very possible this is the bug that Nick Piggin's recent patch series fixes ("powerpc/64s: Fix THP PMD collapse serialisation"):

http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=112348
Comment 3 npiggin 2019-06-10 06:30:17 UTC
bugzilla-daemon@bugzilla.kernel.org's on June 7, 2019 4:29 pm:
> https://bugzilla.kernel.org/show_bug.cgi?id=203837
> 
> --- Comment #2 from Paul Mackerras (paulus@ozlabs.org) ---
> Just tried 5.1.7 in the host and got the guest locking up during boot. In
> xmon
> I see one cpu in pmdp_invalidate and another in handle_mm_fault. It seems
> very
> possible this is the bug that Nick Piggin's recent patch series fixes
> ("powerpc/64s: Fix THP PMD collapse serialisation"):
> 
> http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=112348

It's worth a try, although the bug was introduced around 4.20 and
I wasn't able to trigger it on radix, but other timing changes
could cause it to trigger I suppose.

pdbg (https://github.com/open-power/pdbg) is a useful tool for your
BMC that can often get the CPU registers out even for bad crashes,
this might help to narrow down the problem without bisecting.

Thanks,
Nick
Comment 4 Shawn Anastasio 2019-06-11 19:42:25 UTC
I have applied Nick's patchset to 5.1.7 but the issue still occurs.

As for using pdbg, I'm aware of the tool's existence but I'm not sure how
I would effectively use it to diagnose this issue. If anybody has some
pointers, it'd be appreciated.

Note You need to log in before you can comment on or make changes to this bug.