Bug 216040 - reboot stalls on x86 with __common_interrupt: 5.33 No irq handler for vector
Summary: reboot stalls on x86 with __common_interrupt: 5.33 No irq handler for vector
Status: NEW
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: Intel Linux
: P1 low
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-05-27 18:05 UTC by Luis Chamberlain
Modified: 2022-05-27 18:05 UTC (History)
0 users

See Also:
Kernel Version: v5.17-rc7
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Luis Chamberlain 2022-05-27 18:05:38 UTC
There is an odd bug which is creeping up when testing v5.17-rc7 with stalls on reboot at least once out of 10,000 reboots. This is done by using the reboot-limit test workflow on kdevops:

https://github.com/linux-kdevops/kdevops

The goal behind the reboot-limit test workflow is simple, to boot a guest and reboot it until a "steady state test goal", in this case 100. The test alone reboots the guest in a loop 100 times. So all together if the "test steady state" is reached the guest would reboot successfully 10,000 times.

Since a guest is used the a debian-testing host is used on 5.16.0-6-amd64. The guest is also debian-testing, but the kernel tested varies. The first issue spotted was a failure due to the package crony on debian and this has been reported along with a workaround:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1011533

The workaround is:

--- /etc/network/if-up.d/chrony.old	2022-05-24 16:40:53.112439882 +0000
+++ /etc/network/if-up.d/chrony	2022-05-24 16:41:23.452471796 +0000
@@ -5,6 +5,7 @@
 [ -x /usr/sbin/chronyd ] || exit 0
 
 if [ -e /run/chrony/chronyd.pid ]; then
+    systemctl is-system-running --wait
     chronyc onoffline > /dev/null 2>&1
 fi


If this is applied to a debian-testing guest and you continue to test further, you end up next with a kernel stall on reboot on the shutdown side of the guest. The only thing visible on the kernel log is:

[  OK  ] Stopped System Logging Service.
[  OK  ] Stopped Unattended Upgrades Shutdown.
[  OK  ] Stopped Regular background program processing daemon.
[  OK  ] Stopped Getty on tty1.
[  OK  ] Stopped Serial Getty on ttyS0.
[   16.148956] __common_interrupt: 5.33 No irq handler for vector

I see a similar bug reported earlier in 2015 [0] and another in 2019 [1] and another in 2020 [2], each time with a slight new variation. This was also discussed on linux-kernel [3] and it would seem the conclusion is that this was a BIOS bug last. Since the latest debian-testing qemu (1:6.2+dfsg-3) is used and this issue is observed I thought this *might* be worth reporting and following up on, as it would be unclear how to fix this if it is a BIOS bug. Also, by scraping the console I see some hints since the last successful reboot which might be telling of what might have caused this odd issue.

This is also worth reporting now given that we have a framework to reproduce this bug easily, with kdevops:

make menuconfig # select upstream kernel v5.17-rc7, enable reboot-limit
make
make linux
make reboot-limit
make reboot-limit-baseline-loop

You may want to have sudo virsh console on the guest while this is going. I detected this issue on reboot number 8,919. I can tell this as the .kernel-ci.ok is 89 (89 loops of the target 100 steady state) and the right before the failure on we see on the file .kernel-ci.fail.log:

TASK [reboot-limit : Write current updated boot count variable file to be 19] ***
changed: [rebootlimit]

OK so scraping a bit before all this on the console we see a successful shutdown and reboot effort:

[  OK  ] Reached target System Shutdown.
[  OK  ] Reached target Late Shutdown Services.
[  OK  ] Finished System Reboot.
[  OK  ] Reached target System Reboot.
[   17.841097] watchdog: watchdog0: watchdog did not stop!
[   17.849080] systemd-shutdown[1]: Using hardware watchdog 'i6300ESB timer', version 0, device /dev/watchdog0
[   17.854078] systemd-shutdown[1]: Watchdog running with a timeout of 10min.
[   17.860686] systemd-shutdown[1]: Syncing filesystems and block devices.
[   17.863357] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[   17.870148] systemd-journald[305]: Received SIGTERM from PID 1 (systemd-shutdow).
[   17.878312] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[   17.885497] systemd-shutdown[1]: Unmounting file systems.
[   17.888346] [1294]: Remounting '/' read-only with options 'discard,errors=remount-ro'.
[   17.902643] EXT4-fs (vda1): re-mounted. Quota mode: none.
[   17.909250] systemd-shutdown[1]: All filesystems unmounted.
[   17.910541] systemd-shutdown[1]: Deactivating swaps.
[   17.912321] systemd-shutdown[1]: All swaps deactivated.
[   17.913455] systemd-shutdown[1]: Detaching loop devices.
[   17.914739] systemd-shutdown[1]: All loop devices detached.
[   17.917730] systemd-shutdown[1]: Stopping MD devices.
[   17.919164] systemd-shutdown[1]: All MD devices stopped.
[   17.920657] systemd-shutdown[1]: Detaching DM devices.
[   17.922267] systemd-shutdown[1]: All DM devices detached.
[   17.923251] systemd-shutdown[1]: All filesystems, swaps, loop devices, MD devices and DM devices detached.
[   17.938460] systemd-shutdown[1]: Syncing filesystems and block devices.
[   17.940495] systemd-shutdown[1]: Rebooting.
[   17.941687] kvm: exiting hardware virtualization
[   19.431304] reboot: Restarting system
[   19.432218] reboot: machine restart

Then we see the bootup happen:

  Booting `Debian GNU/Linux, with Linux 5.17.0-rc7'

Loading Linux 5.17.0-rc7 ...
Loading initial ramdisk ...
[    0.000000] Linux version 5.17.0-rc7 (vagrant@rebootlimit) (gcc (Debian 11.3.0-1) 11.3.0, GNU ld (GNU Binutils for Debian) 2.38) #1 SMP PREEMPT Thu May 19 16:26:43 UTC 2022
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.17.0-rc7 root=UUID=bf24432d-9da3-4cc2-a83b-4c7b0a8472f3 ro console=tty0 console=tty1 console=ttyS0,38400n8 elevator=noop scsi_mod.use_blk_mq=Y net.ifnames=0 biosdevname=0 crashkernel=384M-:128M
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
... etc

Then we see bootup is completing:

[    4.969633] random: fast init done
Begin: Loading essential drivers ... done.
Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... [    5.242857] raid6: avx512x4 gen() 16798 MB/s
[    5.310868] raid6: avx512x2 gen() 17096 MB/s
[    5.378871] raid6: avx512x1 gen() 17076 MB/s
[    5.446848] raid6: avx2x4   gen() 18396 MB/s
[    5.514844] raid6: avx2x2   gen() 16922 MB/s
[    5.582858] raid6: avx2x1   gen() 13455 MB/s
[    5.583848] raid6: using algorithm avx2x4 gen() 18396 MB/s
[    5.650841] raid6: .... xor() 7335 MB/s, rmw enabled
[    5.651736] raid6: using avx512x2 recovery algorithm
[    5.653261] xor: automatically using best checksumming function   avx
[    5.707415] Btrfs loaded, crc32c=crc32c-intel, zoned=yes, fsverity=no
Scanning for Btrfs filesystems
[    5.727981] BTRFS: device label data devid 1 transid 23555 /dev/nvme0n1 scanned by btrfs (221)
done.
Begin: Will now check root file system ... fsck from util-linux 2.38
[/sbin/fsck.ext4 (1) -- /dev/vda1] fsck.ext4 -a -C0 /dev/vda1
/dev/vda1: clean, 48678/1310720 files, 2038899/5242624 blocks
done.
[    5.835747] EXT4-fs (vda1): mounted filesystem with ordered data mode. Quota mode: none.
done.
Begin: Running /scripts/local-bottom ... GROWROOT: NOCHANGE: partition 1 could only be grown by -33 [fudge=2048]
done.
Begin: Running /scripts/init-bottom ... done.
[    5.947812] Not activating Mandatory Access Control as /sbin/tomoyo-init does not exist.
[    5.985388] systemd[1]: Inserted module 'autofs4'
[    6.001400] systemd[1]: systemd 250.4-1 running in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS -OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY -P11KIT -QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK -XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
[    6.008955] systemd[1]: Detected virtualization kvm.
[    6.010253] systemd[1]: Detected architecture x86-64.


Welcome to Debian GNU/Linux bookworm/sid!


I see some lvmconfig complaint though:

[    6.013547] systemd[1]: Hostname set to <rebootlimit>.
[    6.047030] random: lvmconfig: uninitialized urandom read (4 bytes read)
[    6.143811] systemd[1]: Queued start job for default target Graphical Interface.
[    6.147891] systemd[1]: Created slice Slice /system/getty.

Also odd:

[    6.164355] systemd[1]: Arbitrary Executable File Formats File System Automount Point was skipped because of a failed condition check (ConditionPathExists=/proc/sys/fs/binfmt_misc).
[    6.167824] systemd[1]: Reached target Local Encrypted Volumes.

There was an issue with the binfmt sysctl move out of kernel/sysct.c but that was fixed via commit b42bc9a3c511 ("Fix regression due to "fs: move binfmt_misc sysctl to its own file") merged on v5.17-rc4:

mcgrof@sumo ~/linux (git::master)$ git describe --contains b42bc9a3c5115c3102a4923776bbeed3b191f2db
v5.17-rc4~37

There was one further fix for this I had but it was for a missing kmemleak detection thing, through commit a3580ac9b7a394 ("fs/file_table: fix adding missing kmemleak_not_leak()") so not really something which I suspect would create this boot stall as it is just telling kmemleak to ignore this as a  suspect as we don't free stuff later.

Then later another lvconfig complaint:

[    6.256025] random: lvm: uninitialized urandom read (4 bytes read)

Hrm, and then:

[    6.264704] systemd[1]: File System Check on Root Device was skipped because of a failed condition check (ConditionPathExists=!/run/initramfs/fsck-root).

I also see, worth noting:

[  OK  ] Started Journal Service.
         Mounting Kernel Configuration File System...
         Starting Apply Kernel Variables...
[    6.356431] EXT4-fs (vda1): re-mounted. Quota mode: none.
[  OK  ] Mounted Kernel Configuration File System.
[  OK  ] Finished Remount Root and Kernel File Systems.
[  OK  ] Finished Apply Kernel Variables.
         Starting Flush Journal to Persistent Storage...
         Starting Load/Save Random Seed[    6.382722] systemd-journald[302]: Received client request to flush runtime journal.
...

Finally the bootup completes:

Debian GNU/Linux bookworm/sid rebootlimit ttyS0

And then down we go again...

We end up with only this on the kernel log:

[  OK  ] Removed slice Slice /system/modprobe.
[  OK  ] Stopped target Graphical Interface.
[  OK  ] Stopped target Multi-User System.
[  OK  ] Stopped target Login Prompts.
[  OK  ] Stopped target RPC Port Mapper.
[  OK  ] Stopped target Timer Units.
[  OK  ] Stopped Daily apt upgrade and clean activities.
[  OK  ] Stopped Daily apt download activities.
[  OK  ] Stopped Daily dpkg database backup timer.
[  OK  ] Stopped Periodic ext4 Onli…ata Check for All Filesystems.
[  OK  ] Stopped Discard unused blocks once a week.
[  OK  ] Stopped Daily rotation of log files.
[  OK  ] Stopped Daily exim4-base housekeeping.
[  OK  ] Stopped Daily man-db regeneration.
[  OK  ] Stopped Daily Cleanup of Temporary Directories.
[  OK  ] Closed LVM2 poll daemon socket.
         Stopping Availability of block devices...
         Stopping Regular background program processing daemon...
         Stopping topping Serial Getty on ttyS0...
         Stopping Simple Network Ma…ent Protocol (SNMP) Daemon....
         Stopping OpenBSD Secure Shell server...
         Stopping Load/Save Random Seed...
         Stopping Unattended Upgrades Shutdown...
[  OK  ] Stopped System Logging Service.
[  OK  ] Stopped Unattended Upgrades Shutdown.
[  OK  ] Stopped Regular background program processing daemon.
[  OK  ] Stopped Getty on tty1.
[  OK  ] Stopped Serial Getty on ttyS0.
[   16.148956] __common_interrupt: 5.33 No irq handler for vector
[   16.150960] __common_interrupt: 5.33 No irq handler for vector
[   16.151880] __common_interrupt: 5.33 No irq handler for vector
[   16.152731] __common_interrupt: 5.33 No irq handler for vector
[   16.153616] __common_interrupt: 5.33 No irq handler for vector
[   16.154465] __common_interrupt: 5.33 No irq handler for vector
[   16.155317] __common_interrupt: 5.33 No irq handler for vector
[   16.156145] __common_interrupt: 5.33 No irq handler for vector
[   16.157020] __common_interrupt: 5.33 No irq handler for vector
[   16.157913] __common_interrupt: 5.33 No irq handler for vector

And the reboot stalls there.

[0] https://bugzilla.kernel.org/show_bug.cgi?id=87531
[1] https://bugzilla.kernel.org/show_bug.cgi?id=201617
[2] https://bugzilla.kernel.org/show_bug.cgi?id=208881
[3] https://lore.kernel.org/all/c270b5af-1720-7dc6-cdd5-e9d8611a1704@linuxfoundation.org/

Note You need to log in before you can comment on or make changes to this bug.