Created attachment 289551 [details] kernel .config (kernel 4.15-rc8+, Shuttle XPC FS51, Pentium 4 HT) Kernels >4.14.x show weird sorts of problems (swap partition fails to start, USB errors, WLAN errors) when Hyperthreading is enabled. Most of the time the machine won't finish booting and crashes with stacktraces like this: [...] ? usb_probe_interface+0x192/0x192 [usbcore] usb_runtime_suspend +0x21/0x52 [usbcore] __rpm_callback+0xb4/0x141 ? usb_probe_interface+0x192/0x192 [usbcore] ? usb_probe_interface+0x192/0x192 [usbcore] rpm_callbackk+0x46/0x63 ? usb_probe_interface+0x192/0x192 [usbcore] rpm_suspend+0x25f/0x39b ? _raw_spin_lock_irqsave+0x2f/0x38 __pm_runtime_suspend+0x4b/0x5f usb_runtime_idle+0x1e/0x26 [usbcore] __rpm_callback+0xb4/0x141 ? usb_runtime_resume+0x12/0x12 [usbcore] rpm_idle+0x12e/0x187 pm_runtime_work+0x43/0x77 process_one_work+0x215/0x36c worker_thread+0x164/0x1fb kthread+0xce/0xd0 ? rescuer_thread+0x1e5/0x1e5 ? kthread_destroy_worker+0x4e/0x4e ret_from_fork+0x19/0x24 Kernel Offset: 0x1e000000 from 0xc1000000 (reloca ?? 000000-0xf7b7dfff) Rebooting in 120 seconds.. With Hyperthreading disabled in BIOS, the very same kernel just works fine. I was curious and did a git-bisect which lead to the following offending commit: # git bisect bad | tee -a ~/bisect02.log a0c9259dc4e1923a98356967ce8b732da1979df8 is the first bad commit commit a0c9259dc4e1923a98356967ce8b732da1979df8 Author: Thomas Gleixner <tglx@linutronix.de> Date: Wed Jan 17 16:01:47 2018 +0100 irq/matrix: Spread interrupts on allocation Keith reported an issue with vector space exhaustion on a server machine which is caused by the i40e driver allocating 168 MSI interrupts when the driver is initialized, even when most of these interrupts are not used at all. The x86 vector allocation code tries to avoid the immediate allocation with the reservation mode, but the card uses MSI and does not support MSI entry masking, which prevents reservation mode and requires immediate vector allocation. The matrix allocator is a bit naive and prefers the first CPU in the cpumask which describes the possible target CPUs for an allocation. That results in allocating all 168 vectors on CPU0 which later causes vector space exhaustion when the NVMe driver tries to allocate managed interrupts on each CPU for the per CPU queues. Avoid this by finding the CPU which has the lowest vector allocation count to spread out the non managed interrupt accross the possible target CPUs. Fixes: 2f75d9e1c905 ("genirq: Implement bitmap matrix allocator") Reported-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Keith Busch <keith.busch@intel.com> Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801171557330.1777@nanos kernel/irq/matrix.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) Some data about the machine: # inxi -b System: Kernel: 4.14.178-gentoo-Pentium4 i686 bits: 32 Console: tty 0 Distro: Gentoo Base System release 2.6 Machine: Type: Desktop Mobo: Shuttle model: FS51 serial: N/A BIOS: Phoenix v: 6.00 PG date: 12/02/2003 CPU: Single Core: Intel Pentium 4 type: MT speed: 3063 MHz Graphics: Device-1: AMD RV350 [Radeon 9550/9600/X1050 Series] driver: N/A Display: server: X.org 1.20.7 driver: ati,radeon unloaded: fbdev,modesetting tty: 104x53 Network: Device-1: Ralink RT2500 Wireless 802.11bg driver: rt2500pci Device-2: Realtek RTL-8100/8101L/8139 PCI Fast Ethernet Adapter driver: 8139too Drives: Local Storage: total: 76.69 GiB used: 4.43 GiB (5.8%) Info: Processes: 79 Uptime: 3h 29m Memory: 1.97 GiB used: 329.1 MiB (16.3%) Init: systemd Shell: bash inxi: 3.0.38 # lscpu Architecture: i686 CPU op-mode(s): 32-bit Byte Order: Little Endian Address sizes: 36 bits physical, 32 bits virtual CPU(s): 2 On-line CPU(s) list: 0,1 Thread(s) per core: 2 Core(s) per socket: 1 Socket(s): 1 Vendor ID: GenuineIntel CPU family: 15 Model: 2 Model name: Intel(R) Pentium(R) 4 CPU 3.06GHz Stepping: 7 CPU MHz: 3062.686 BogoMIPS: 6127.59 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pebs bts cpuid cid xtpr
Created attachment 289553 [details] kernel dmesg (kernel 4.15-rc8+, Shuttle XPC FS51, Pentium 4 HT)
Created attachment 289555 [details] kernel dmesg (kernel 4.14.178, Shuttle XPC FS51, Pentium 4 HT)
Created attachment 289557 [details] bisect.log
Created attachment 295173 [details] dmesg (kernel 5.11-rc7, Shuttle XPC FS51, Pentium 4) Now with a proper stacktrace made with netconsole. The "inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage." which shows up at the beginning is bug #211575 and shows up seperate from HyperThreadine enabled.
Created attachment 295175 [details] kernel .config (kernel 5.11-rc7, Shuttle XPC FS51, Pentium 4)
On Tue, Feb 09 2021 at 22:53, bugzilla-daemon wrote: > dmesg (kernel 5.11-rc7, Shuttle XPC FS51, Pentium 4) > > Now with a proper stacktrace made with netconsole. > > The "inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage." which shows up at > the beginning is bug #211575 and shows up seperate from HyperThreadine > enabled. Yes, that's unrelated and an issue with the 8139 netconsole. That's an odd failure. It's stuck waiting for an interrupt to synchronize. Can you try to boot with: irqaffinity=0 on the kernel command line? If that makes it boot then you might try to move e.g. the sound interrupt to CPU1 via /proc/irq/$IRQ/smp_affinity[_list]. Thanks, tglx
(In reply to Thomas Gleixner from comment #6) > That's an odd failure. It's stuck waiting for an interrupt to > synchronize. Can you try to boot with: > > irqaffinity=0 Yes, that did the trick! The machine runs fine so far. > If that makes it boot then you might try to move e.g. the sound > interrupt to CPU1 via /proc/irq/$IRQ/smp_affinity[_list]. How to do that at boot time? Most of the time the machine won't finish booting, being able ro run user scripts afterwards.
On Thu, Feb 11 2021 at 15:43, bugzilla-daemon wrote: >> That's an odd failure. It's stuck waiting for an interrupt to >> synchronize. Can you try to boot with: >> >> irqaffinity=0 > Yes, that did the trick! The machine runs fine so far. > >> If that makes it boot then you might try to move e.g. the sound >> interrupt to CPU1 via /proc/irq/$IRQ/smp_affinity[_list]. > How to do that at boot time? Most of the time the machine won't finish > booting, > being able ro run user scripts afterwards. Not at boot time. With that command line option the machine seems to run. So you have a working system, right? If so , can you please provide the output of # cat /proc/interrupts Thanks, tglx
(In reply to Thomas Gleixner from comment #8) > If so , can you please provide the output of > > # cat /proc/interrupts > > Thanks, > > tglx # cat /proc/interrupts CPU0 CPU1 0: 40 0 IO-APIC 2-edge timer 8: 0 0 IO-APIC 8-edge rtc0 9: 0 0 IO-APIC 9-fasteoi acpi 14: 6672 0 IO-APIC 14-edge pata_sis 15: 0 0 IO-APIC 15-edge pata_sis 16: 67 0 IO-APIC 16-fasteoi radeon 17: 2775 0 IO-APIC 17-fasteoi rt2500pci 18: 1789 0 IO-APIC 18-fasteoi eth0, snd_intel8x0 19: 1 0 IO-APIC 19-fasteoi firewire_ohci 20: 0 0 IO-APIC 20-fasteoi ohci_hcd:usb2 21: 0 0 IO-APIC 21-fasteoi ohci_hcd:usb3 22: 0 0 IO-APIC 22-fasteoi ohci_hcd:usb4 23: 0 0 IO-APIC 23-fasteoi ehci_hcd:usb1 NMI: 3 3 Non-maskable interrupts LOC: 16611 15587 Local timer interrupts SPU: 0 0 Spurious interrupts PMI: 2 2 Performance monitoring interrupts IWI: 0 0 IRQ work interrupts RTR: 0 0 APIC ICR read retries RES: 5782 12575 Rescheduling interrupts CAL: 794 871 Function call interrupts TLB: 805 549 TLB shootdowns TRM: 0 0 Thermal event interrupts THR: 0 0 Threshold APIC interrupts MCE: 0 0 Machine check exceptions MCP: 1 1 Machine check polls ERR: 0 MIS: 0 PIN: 0 0 Posted-interrupt notification event NPI: 0 0 Nested posted-interrupt event PIW: 0 0 Posted-interrupt wakeup event
On Tue, Feb 23 2021 at 12:35, bugzilla-daemon wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=208097 > > --- Comment #9 from Erhard F. (erhard_f@mailbox.org) --- > (In reply to Thomas Gleixner from comment #8) >> If so , can you please provide the output of >> >> # cat /proc/interrupts >> >> Thanks, >> >> tglx > # cat /proc/interrupts > CPU0 CPU1 > 0: 40 0 IO-APIC 2-edge timer > 8: 0 0 IO-APIC 8-edge rtc0 > 9: 0 0 IO-APIC 9-fasteoi acpi > 14: 6672 0 IO-APIC 14-edge pata_sis > 15: 0 0 IO-APIC 15-edge pata_sis > 16: 67 0 IO-APIC 16-fasteoi radeon > 17: 2775 0 IO-APIC 17-fasteoi rt2500pci > 18: 1789 0 IO-APIC 18-fasteoi eth0, snd_intel8x0 > 19: 1 0 IO-APIC 19-fasteoi firewire_ohci > 20: 0 0 IO-APIC 20-fasteoi ohci_hcd:usb2 > 21: 0 0 IO-APIC 21-fasteoi ohci_hcd:usb3 > 22: 0 0 IO-APIC 22-fasteoi ohci_hcd:usb4 > 23: 0 0 IO-APIC 23-fasteoi ehci_hcd:usb1 Ok. So all device interrupts end up on CPU 0 which is what we told the kernel to do with that commandline option. What happens if you do the following: 1) Boot with irqaffinity=0 on the command line 2) run: echo 1 >/proc/irq/18/smp_affinity_list 3) Shut down the network interface and/or use sound on and off In theory this should lead to a similar situation as you saw without that command line option. Thanks, tglx
(In reply to Thomas Gleixner from comment #10) > What happens if you do the following: > > 1) Boot with irqaffinity=0 on the command line > > 2) run: echo 1 >/proc/irq/18/smp_affinity_list > > 3) Shut down the network interface and/or use sound on and off A few monents I run "echo 1 >/proc/irq/18/smp_affinity_list" I get this: ------------[ cut here ]------------ NETDEV WATCHDOG: eth0 (8139too): transmit queue 0 timed out WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:442 dev_watchdog+0x15f/0x1ad Modules linked in: auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc nfs_ssc rt2500pci eeprom_93cx6 rt2x00pci rt2x00mmio radeon rt2x00lib led_class mac80211 hwmon i2c_algo_bit drm_ttm_helper cfg80211 ttm drm_kms_helper cfbfillrect snd_intel8x0 syscopyarea snd_ac97_codec ohci_pci cfbimgblt ehci_pci ohci_hcd sysfillrect ehci_hcd ac97_bus sysimgblt fb_sys_fops snd_pcm cfbcopyarea usbcore evdev fb firewire_ohci firewire_core fan snd_timer thermal snd 8250 font sr_mod fbdev cdrom 8250_base serial_core rfkill soundcore button usb_common libarc4 crc_itu_t i2c_sis96x drm configfs fuse drm_panel_orientation_quirks backlight CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.11.1-gentoo-Pentium4 #2 Hardware name: /FS51, BIOS 6.00 PG 12/02/2003 EIP: dev_watchdog+0x15f/0x1ad Code: 3d 79 ad a9 c4 00 75 34 c6 05 79 ad a9 c4 01 8b 45 f0 e8 cd 32 fd ff 56 50 8d 83 fc fc ff ff 50 68 fc 2f 9a c4 e8 e5 b6 0b 00 <0f> 0b 83 c4 10 eb 0b 46 05 00 02 00 00 e9 02 ff ff ff 8b 83 18 fe EAX: 0000003b EBX: c108c304 ECX: f63c41fc EDX: fffff000 ESI: 00000000 EDI: ffff1d40 EBP: c1083f44 ESP: c1083f24 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282 CR0: 80050033 CR2: 0164e02c CR3: 05fd5000 CR4: 000006d0 Call Trace: <SOFTIRQ> ? pfifo_fast_change_tx_queue_len+0x193/0x193 call_timer_fn+0xfe/0x201 __run_timers+0x134/0x159 ? pfifo_fast_change_tx_queue_len+0x193/0x193 ? rt2500pci_rxdone_tasklet+0x4e/0x52 [rt2500pci] ? tasklet_action_common.constprop.0+0x67/0xa3 run_timer_softirq+0x14/0x27 __do_softirq+0x15f/0x307 ? __entry_text_end+0x5/0x5 call_on_stack+0x40/0x46 </SOFTIRQ> ? __irq_exit_rcu+0x4f/0x85 ? irq_exit_rcu+0x8/0x11 ? sysvec_apic_timer_interrupt+0x3e/0x4b ? handle_exception+0x10e/0x10e ? do_idle+0xb7/0x1c3 ? ldsem_down_write+0x1f/0x1f ? rebalance_domains+0x125/0x292 ? default_idle+0xa/0xc ? open_ctree+0xe03/0x134f ? sysvec_call_function_single+0x49/0x49 ? default_idle+0xa/0xc ? arch_cpu_idle+0xd/0xf ? default_idle_call+0x48/0x74 ? do_idle+0xb7/0x1c3 ? cpu_startup_entry+0x19/0x1b ? rest_init+0x11d/0x120 ? arch_call_rest_init+0x8/0xb ? start_kernel+0x40d/0x41b ? i386_start_kernel+0x43/0x45 ? startup_32_smp+0x164/0x168 irq event stamp: 14810 hardirqs last enabled at (14810): [<c4680e4a>] net_rx_action+0x75/0x250 hardirqs last disabled at (14809): [<c4680e27>] net_rx_action+0x52/0x250 softirqs last enabled at (14794): [<c477b29f>] __do_softirq+0x2d7/0x307 softirqs last disabled at (14807): [<c420fc81>] call_on_stack+0x40/0x46 ---[ end trace 0035e9e10037fdfe ]--- This is also the last netconsole message shown, the machine is not usable via ssh after that.
Only guessing around, but could bug #207353 be relevant too? It's about lockdep warnings regarding netdev on the same machine.
On Thu, Mar 04 2021 at 00:23, bugzilla-daemon wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=208097 > > --- Comment #12 from Erhard F. (erhard_f@mailbox.org) --- > Only guessing around, but could bug #207353 be relevant too? It's about > lockdep > warnings regarding netdev on the same machine. No, that's an independent issue which has nothing to do with those P4 oddities.