Bug 208097
Summary: | [Bisected][Regression] kernels greater than 4.14.x fail to boot with Hyperthreading enabled (Shuttle XPC SS51G w. Pentium 4 HT) | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Erhard F. (erhard_f) |
Component: | i386 | Assignee: | platform_i386 |
Status: | NEW --- | ||
Severity: | normal | CC: | tglx |
Priority: | P1 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 5.7.0 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
kernel .config (kernel 4.15-rc8+, Shuttle XPC FS51, Pentium 4 HT)
kernel dmesg (kernel 4.15-rc8+, Shuttle XPC FS51, Pentium 4 HT) kernel dmesg (kernel 4.14.178, Shuttle XPC FS51, Pentium 4 HT) bisect.log dmesg (kernel 5.11-rc7, Shuttle XPC FS51, Pentium 4) kernel .config (kernel 5.11-rc7, Shuttle XPC FS51, Pentium 4) |
Description
Erhard F.
2020-06-07 16:34:33 UTC
Created attachment 289553 [details]
kernel dmesg (kernel 4.15-rc8+, Shuttle XPC FS51, Pentium 4 HT)
Created attachment 289555 [details]
kernel dmesg (kernel 4.14.178, Shuttle XPC FS51, Pentium 4 HT)
Created attachment 289557 [details]
bisect.log
Created attachment 295173 [details] dmesg (kernel 5.11-rc7, Shuttle XPC FS51, Pentium 4) Now with a proper stacktrace made with netconsole. The "inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage." which shows up at the beginning is bug #211575 and shows up seperate from HyperThreadine enabled. Created attachment 295175 [details]
kernel .config (kernel 5.11-rc7, Shuttle XPC FS51, Pentium 4)
On Tue, Feb 09 2021 at 22:53, bugzilla-daemon wrote: > dmesg (kernel 5.11-rc7, Shuttle XPC FS51, Pentium 4) > > Now with a proper stacktrace made with netconsole. > > The "inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage." which shows up at > the beginning is bug #211575 and shows up seperate from HyperThreadine > enabled. Yes, that's unrelated and an issue with the 8139 netconsole. That's an odd failure. It's stuck waiting for an interrupt to synchronize. Can you try to boot with: irqaffinity=0 on the kernel command line? If that makes it boot then you might try to move e.g. the sound interrupt to CPU1 via /proc/irq/$IRQ/smp_affinity[_list]. Thanks, tglx (In reply to Thomas Gleixner from comment #6) > That's an odd failure. It's stuck waiting for an interrupt to > synchronize. Can you try to boot with: > > irqaffinity=0 Yes, that did the trick! The machine runs fine so far. > If that makes it boot then you might try to move e.g. the sound > interrupt to CPU1 via /proc/irq/$IRQ/smp_affinity[_list]. How to do that at boot time? Most of the time the machine won't finish booting, being able ro run user scripts afterwards. On Thu, Feb 11 2021 at 15:43, bugzilla-daemon wrote: >> That's an odd failure. It's stuck waiting for an interrupt to >> synchronize. Can you try to boot with: >> >> irqaffinity=0 > Yes, that did the trick! The machine runs fine so far. > >> If that makes it boot then you might try to move e.g. the sound >> interrupt to CPU1 via /proc/irq/$IRQ/smp_affinity[_list]. > How to do that at boot time? Most of the time the machine won't finish > booting, > being able ro run user scripts afterwards. Not at boot time. With that command line option the machine seems to run. So you have a working system, right? If so , can you please provide the output of # cat /proc/interrupts Thanks, tglx (In reply to Thomas Gleixner from comment #8) > If so , can you please provide the output of > > # cat /proc/interrupts > > Thanks, > > tglx # cat /proc/interrupts CPU0 CPU1 0: 40 0 IO-APIC 2-edge timer 8: 0 0 IO-APIC 8-edge rtc0 9: 0 0 IO-APIC 9-fasteoi acpi 14: 6672 0 IO-APIC 14-edge pata_sis 15: 0 0 IO-APIC 15-edge pata_sis 16: 67 0 IO-APIC 16-fasteoi radeon 17: 2775 0 IO-APIC 17-fasteoi rt2500pci 18: 1789 0 IO-APIC 18-fasteoi eth0, snd_intel8x0 19: 1 0 IO-APIC 19-fasteoi firewire_ohci 20: 0 0 IO-APIC 20-fasteoi ohci_hcd:usb2 21: 0 0 IO-APIC 21-fasteoi ohci_hcd:usb3 22: 0 0 IO-APIC 22-fasteoi ohci_hcd:usb4 23: 0 0 IO-APIC 23-fasteoi ehci_hcd:usb1 NMI: 3 3 Non-maskable interrupts LOC: 16611 15587 Local timer interrupts SPU: 0 0 Spurious interrupts PMI: 2 2 Performance monitoring interrupts IWI: 0 0 IRQ work interrupts RTR: 0 0 APIC ICR read retries RES: 5782 12575 Rescheduling interrupts CAL: 794 871 Function call interrupts TLB: 805 549 TLB shootdowns TRM: 0 0 Thermal event interrupts THR: 0 0 Threshold APIC interrupts MCE: 0 0 Machine check exceptions MCP: 1 1 Machine check polls ERR: 0 MIS: 0 PIN: 0 0 Posted-interrupt notification event NPI: 0 0 Nested posted-interrupt event PIW: 0 0 Posted-interrupt wakeup event On Tue, Feb 23 2021 at 12:35, bugzilla-daemon wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=208097 > > --- Comment #9 from Erhard F. (erhard_f@mailbox.org) --- > (In reply to Thomas Gleixner from comment #8) >> If so , can you please provide the output of >> >> # cat /proc/interrupts >> >> Thanks, >> >> tglx > # cat /proc/interrupts > CPU0 CPU1 > 0: 40 0 IO-APIC 2-edge timer > 8: 0 0 IO-APIC 8-edge rtc0 > 9: 0 0 IO-APIC 9-fasteoi acpi > 14: 6672 0 IO-APIC 14-edge pata_sis > 15: 0 0 IO-APIC 15-edge pata_sis > 16: 67 0 IO-APIC 16-fasteoi radeon > 17: 2775 0 IO-APIC 17-fasteoi rt2500pci > 18: 1789 0 IO-APIC 18-fasteoi eth0, snd_intel8x0 > 19: 1 0 IO-APIC 19-fasteoi firewire_ohci > 20: 0 0 IO-APIC 20-fasteoi ohci_hcd:usb2 > 21: 0 0 IO-APIC 21-fasteoi ohci_hcd:usb3 > 22: 0 0 IO-APIC 22-fasteoi ohci_hcd:usb4 > 23: 0 0 IO-APIC 23-fasteoi ehci_hcd:usb1 Ok. So all device interrupts end up on CPU 0 which is what we told the kernel to do with that commandline option. What happens if you do the following: 1) Boot with irqaffinity=0 on the command line 2) run: echo 1 >/proc/irq/18/smp_affinity_list 3) Shut down the network interface and/or use sound on and off In theory this should lead to a similar situation as you saw without that command line option. Thanks, tglx (In reply to Thomas Gleixner from comment #10) > What happens if you do the following: > > 1) Boot with irqaffinity=0 on the command line > > 2) run: echo 1 >/proc/irq/18/smp_affinity_list > > 3) Shut down the network interface and/or use sound on and off A few monents I run "echo 1 >/proc/irq/18/smp_affinity_list" I get this: ------------[ cut here ]------------ NETDEV WATCHDOG: eth0 (8139too): transmit queue 0 timed out WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:442 dev_watchdog+0x15f/0x1ad Modules linked in: auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc nfs_ssc rt2500pci eeprom_93cx6 rt2x00pci rt2x00mmio radeon rt2x00lib led_class mac80211 hwmon i2c_algo_bit drm_ttm_helper cfg80211 ttm drm_kms_helper cfbfillrect snd_intel8x0 syscopyarea snd_ac97_codec ohci_pci cfbimgblt ehci_pci ohci_hcd sysfillrect ehci_hcd ac97_bus sysimgblt fb_sys_fops snd_pcm cfbcopyarea usbcore evdev fb firewire_ohci firewire_core fan snd_timer thermal snd 8250 font sr_mod fbdev cdrom 8250_base serial_core rfkill soundcore button usb_common libarc4 crc_itu_t i2c_sis96x drm configfs fuse drm_panel_orientation_quirks backlight CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.11.1-gentoo-Pentium4 #2 Hardware name: /FS51, BIOS 6.00 PG 12/02/2003 EIP: dev_watchdog+0x15f/0x1ad Code: 3d 79 ad a9 c4 00 75 34 c6 05 79 ad a9 c4 01 8b 45 f0 e8 cd 32 fd ff 56 50 8d 83 fc fc ff ff 50 68 fc 2f 9a c4 e8 e5 b6 0b 00 <0f> 0b 83 c4 10 eb 0b 46 05 00 02 00 00 e9 02 ff ff ff 8b 83 18 fe EAX: 0000003b EBX: c108c304 ECX: f63c41fc EDX: fffff000 ESI: 00000000 EDI: ffff1d40 EBP: c1083f44 ESP: c1083f24 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210282 CR0: 80050033 CR2: 0164e02c CR3: 05fd5000 CR4: 000006d0 Call Trace: <SOFTIRQ> ? pfifo_fast_change_tx_queue_len+0x193/0x193 call_timer_fn+0xfe/0x201 __run_timers+0x134/0x159 ? pfifo_fast_change_tx_queue_len+0x193/0x193 ? rt2500pci_rxdone_tasklet+0x4e/0x52 [rt2500pci] ? tasklet_action_common.constprop.0+0x67/0xa3 run_timer_softirq+0x14/0x27 __do_softirq+0x15f/0x307 ? __entry_text_end+0x5/0x5 call_on_stack+0x40/0x46 </SOFTIRQ> ? __irq_exit_rcu+0x4f/0x85 ? irq_exit_rcu+0x8/0x11 ? sysvec_apic_timer_interrupt+0x3e/0x4b ? handle_exception+0x10e/0x10e ? do_idle+0xb7/0x1c3 ? ldsem_down_write+0x1f/0x1f ? rebalance_domains+0x125/0x292 ? default_idle+0xa/0xc ? open_ctree+0xe03/0x134f ? sysvec_call_function_single+0x49/0x49 ? default_idle+0xa/0xc ? arch_cpu_idle+0xd/0xf ? default_idle_call+0x48/0x74 ? do_idle+0xb7/0x1c3 ? cpu_startup_entry+0x19/0x1b ? rest_init+0x11d/0x120 ? arch_call_rest_init+0x8/0xb ? start_kernel+0x40d/0x41b ? i386_start_kernel+0x43/0x45 ? startup_32_smp+0x164/0x168 irq event stamp: 14810 hardirqs last enabled at (14810): [<c4680e4a>] net_rx_action+0x75/0x250 hardirqs last disabled at (14809): [<c4680e27>] net_rx_action+0x52/0x250 softirqs last enabled at (14794): [<c477b29f>] __do_softirq+0x2d7/0x307 softirqs last disabled at (14807): [<c420fc81>] call_on_stack+0x40/0x46 ---[ end trace 0035e9e10037fdfe ]--- This is also the last netconsole message shown, the machine is not usable via ssh after that. Only guessing around, but could bug #207353 be relevant too? It's about lockdep warnings regarding netdev on the same machine. On Thu, Mar 04 2021 at 00:23, bugzilla-daemon wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=208097 > > --- Comment #12 from Erhard F. (erhard_f@mailbox.org) --- > Only guessing around, but could bug #207353 be relevant too? It's about > lockdep > warnings regarding netdev on the same machine. No, that's an independent issue which has nothing to do with those P4 oddities. |