Hi all, We have a weird serious bug on systems that uses Haswell Architecture and "real" serial ports /dev/ttyS*. Hardware: some embedded device with "Intel(R) Celeron(R) 2980U @ 1.60GHz", we tried with microcode 0x23 and 0x24. Also on a HP Elite 840 G1" (i5-4300U). Both are members of the Haswell architecture. We can plug a different CPU module (different INTEL families) into the embedded device, namely "Intel(R) Atom(TM) CPU N455 @ 1.66GHz", which is ATOM Pineview architecture. With identical kernels used, we do not get the error. Kernel flavors: happens with distro kernels (Debian, Ubuntu, Fedora). Common factor seems that the kernels are >= 4.9.x. But also with upstream stable kernels, we used 4.13.x, 4.14.x, 4.18.x, even with 4.18.16. The embedded device also behaves strange (e.g. we had once MCEs with a 32b kernel, which went away when using a 64 kernel). We also sometimes get an error in AUFS with the same timestamp as the do_IRQ-message. We do not understand what AUFS has to do with hardware interrupts. However, we do not want to concentrate on this yet. We think that strange message from a mainland kernel itself is worthwhile to be tracked. If some interrupt gets hardwired, there is some chance that some memory gets corrupted. The MCE disappeared after switching from 32b kernel to 64b kernel. Also the HP Elite 840 G1 (using 64b kernel) does show (but 100x less) this problem (with do_IRQ: 0.38 message - different position of the serial port: ttyS4). Deterministic reproduction of "do_IRQ: 0.39 No irq handler for vector" message. We found a way always to reproduce it: the message triggers when we close the serial port. printk messages indicate that after the IER is cleared, and even after synchronize_irq() in serial8250_do_shutdown(), the error still happens. Sometimes even a "stty </dev/ttyS1" is enough to show such a behavior, because it already opens/closes the port. But it happens only sometimes. A deterministic way to reproduce the error is to use a tool called "stress-ng", using version with various stressors. Some newer versions (e.g. the one used from Debian, 0.07.16-1) just opens all the files in /dev, runs a fstat() on them, and closes them again. All this is executed in a loop and runs very fast. This has the side-effect that /dev/ttyS* are opened/closed very fast. And that shows the error message very frequently: [ 6.558244] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx [ 17.048154] fuse init (API version 7.27) [ 17.248215] do_IRQ: 0.39 No irq handler for vector [ 17.249622] do_IRQ: 0.39 No irq handler for vector [ 17.252415] do_IRQ: 0.39 No irq handler for vector [ 17.253698] do_IRQ: 0.39 No irq handler for vector [ 18.528774] do_IRQ: 0.39 No irq handler for vector [ 18.532305] do_IRQ: 0.39 No irq handler for vector [ 18.532540] do_IRQ: 0.39 No irq handler for vector [ 18.606916] do_IRQ: 0.39 No irq handler for vector [ 20.227241] random: crng init done Here, stress-ng was executed just for few seconds. Unfortunately, from time to time, the exact same setup makes the error scarce, e.g. it can happen that we don't see the error at least for 15 minutes. So when running this overnight we had between 1500 and 30000 of this messages in dmesg/journal. One thing noted is that "noapic=1" in the kernel command line makes the error go away (using classical PIC HW config instead IOAPIC). Also using the Atom CPU with the older architecture (Pineview) makes the error go away, but that one is now EOL. :-( Any advice on how to proceed further? Best Regards, Zoran (Holger Schurig is the main system architect)
Hello all, New info about this problem: kernel 4.9 doesn't show the issue. The same is true for later kernels up to 4.13. There was a compilation issue with 4.14 (which later was solved, something unrelated with tools/objcopy when compiling for a different architecture), and because of this glitch a git bisect was done between v4.13 and v4.15. This is the outcome: $ git bisect log # bad: [d8a5b80568a9cb66810e75b182018e9edb68e8ff] Linux 4.15 # good: [569dbb88e80deb68974ef6fdd6a13edb9d686261] Linux 4.13 git bisect start 'v4.15' 'v4.13' # good: [425a08c67317acee103b3ad58f57c762e8834faf] mlxsw: spectrum_router: Prepare for large adjacency groups git bisect good 425a08c67317acee103b3ad58f57c762e8834faf # bad: [e60e1ee60630cafef5e430c2ae364877e061d980] Merge tag 'drm-for-v4.15' of git://people.freedesktop.org/~airlied/linux git bisect bad e60e1ee60630cafef5e430c2ae364877e061d980 # bad: [4008e6a9bcee2f3b61bb11951de0fb0ed764cb91] Merge branch 'i2c/for-4.15' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/wsa/linux git bisect bad 4008e6a9bcee2f3b61bb11951de0fb0ed764cb91 # bad: [3c073991eb417b6f785ddc6afbbdc369eb84aa6a] Merge tag 'devprop-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm git bisect bad 3c073991eb417b6f785ddc6afbbdc369eb84aa6a # good: [2101dd64b304b034862f5ca40877c41b7ccb9c5e] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu git bisect good 2101dd64b304b034862f5ca40877c41b7ccb9c5e # good: [d6ec9d9a4def52a5094237564eaf6f6979fd7a27] Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect good d6ec9d9a4def52a5094237564eaf6f6979fd7a27 # good: [7d58e1c9059eefe0066c5acf2ffa582f6f0180e3] Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect good 7d58e1c9059eefe0066c5acf2ffa582f6f0180e3 # good: [990a848d537e4da966907c8ccec95bc568f2911c] Merge branches 'pm-devfreq' and 'pm-tools' git bisect good 990a848d537e4da966907c8ccec95bc568f2911c # bad: [25e960efc63852b84d1c3739aef586285b177395] PCI/MSI: Set MSI_FLAG_MUST_REACTIVATE in core code git bisect bad 25e960efc63852b84d1c3739aef586285b177395 # skip: [029c6e1c9df776fe1b2ba756a28fb65e9f9e9f69] x86/vector: Store the single CPU targets in apic data git bisect skip 029c6e1c9df776fe1b2ba756a28fb65e9f9e9f69 - no network - no USB keyboard - therefore "git bisect skip" # bad: [d6ffc6ac83b1f9f12652d89b9cb5bcbfbea7796c] x86/vector: Respect affinity mask in irq descriptor git bisect bad d6ffc6ac83b1f9f12652d89b9cb5bcbfbea7796c # good: [ef9e56d894eab99a33a06b96ba8057afa67d3702] x86/ioapic: Remove obsolete post hotplug update git bisect good ef9e56d894eab99a33a06b96ba8057afa67d3702 # good: [8d1e3dca7de6e8513872799a748a1d47d8dce60d] x86/vector: Add tracepoints for vector management git bisect good 8d1e3dca7de6e8513872799a748a1d47d8dce60d # good: [5ba204a1817ba95a7b24dbe8ef2c7ddd4cea886e] iommu/amd: Reevaluate vector configuration on activate() git bisect good 5ba204a1817ba95a7b24dbe8ef2c7ddd4cea886e # good: [4900be83602b6be07366d3e69f756c1959f4169a] x86/vector/msi: Switch to global reservation mode git bisect good 4900be83602b6be07366d3e69f756c1959f4169a # bad: [2cffad7bad83157f89332872015f4305d2ac09ac] x86/irq: Simplify hotplug vector accounting git bisect bad 2cffad7bad83157f89332872015f4305d2ac09ac # bad: [464d12309e1b5829597793db551ae8ecaecf4036] x86/vector: Switch IOAPIC to global reservation mode git bisect bad 464d12309e1b5829597793db551ae8ecaecf4036 # first bad commit: [464d12309e1b5829597793db551ae8ecaecf4036] x86/vector: Switch IOAPIC to global reservation mode Best Regards, Zoran (Holger Schurig is the main system architect)
Id addition to Mr. Holger discoveries (two topics described above), Bernd and me tried to test another theory, which I came upon Thursday, November 1st, 2018. That we need actually to play with i915 driver, and somehow to swap on fly i915 with frame buffer (in other words to disable GDM on fly), and see how it will influence this problem? On Friday, November 2nd, Bernd and me sat together for couple of hours with the embedded platform (silicon HSW Celeron 2980U), using stress-ng fstat() version 0.07.16 (one which opens and closes every file in /dev directory in the loop). The OS is Debian 4.18.10, and executing it in normal fashion, we very soon introduced the problem (per Mr. Holger descriptions). Intensive problem, and the first attachment (GDM_START-HTOP.jpg) emerged from it. You all can see that the both CPUs are hogged/swamped up to almost 100%. The number of DO_IRQ messages was high. Then, we issued a command: # systemctl stop getty@tty7, and then, GDM stopped working almost immediately. And the following happened (after some short transition time) with HTOP: GDM_STOP-HTOP.jpg . As you all see the load dramatically decreased, showing that the problem diminished almost by ~ 50x, but was still present!? I assume that i915 was swapped by frame buffer driver (please, correct me if I am wrong). For some (unknown???) reason, PCIe 00:02.0 VGA influences legacy serial ports (NOT PCIe ones) /dev/ttyS0 (IOAPIC int4) and /dev/ttyS1 (IOAPIC int3). And the GFX int is intA routed to int44 (picture PCIe-00-02-0.jpg). At the end, we issued another command: # systemctl start getty@tty7, which almost immediately brought GDM back, and after some short time brought system to the initial state: GDM_START-HTOP.jpg . It seems that some of these patches between 4.13 and 4.15 influenced and mixed delivery of IOAPIC GFX interrupts to LAPIC serial ports' interrupts??? I attached all three pictures for your observation. I'll try on Monday, November 5th, 2018. to test this problem using strace program (tough one, but necessary). New idea... Which appeared/popped up at the end of Friday. Thank you, Zoran
Created attachment 279311 [details] GDM_START-HTOP.jpg - high load while GDM active
Created attachment 279313 [details] GDM_STOP-HTOP.jpg - low load while GDM inactive
Created attachment 279315 [details] PCIe-00-02-0 - PCIe 00:02.0 using cli command lspci -vv
Hello, Today, I have tried to test more extensively this issue. To do this, I did the following: [A] Disabled /dev/ttyS0 (I/O 0x3F8, irq 4) using BIOS; [B] Decided NOT to use stress-ng, just to try to use serial /dev/ttyS1 commands; [C] Play with CLI commands, using simple bash online commands and scripts; [D] The same customer BIOS used (MCU 0x23). While using customer Linux dlvt83 4.18.10 (Thu October 18, 2018.) x86_64 image, the following is true: The issue is easy to reproduce using the following commands: # stty -F /dev/ttyS1 (one message 50/50 - roughly every second time) Changing baud rate on fly with # stty -F /dev/ttyS1 1200 to stty -F /dev/ttyS1 576000 does not change anything in timing and occurrence frequency; Executing online script # while true; do "ABC" > /dev/ttyS1; sleep 10; done& introduces in 80% cases IRQ message, only single one after 10 sec. (sleep time interval); The irq 3 is mapped to vector 39, BUT ONLY on one local LAPIC (CPU1 irq3 LAPIC 100% count in my case, CPU0 irq3 LAPIC is rock solid 0% count); While using Linux Debian 4.9.0/7 - amd64 Debian 4.9.110-1 (2018-07-05), the following is true: The issue is NOT reproducible at all (with above commands and scripts. also from previous testing with stress-ng); The irq 3 is visible on both local LAPICs (CPU0 irq3 LAPIC had 80% count of the interrupts in my case, CPU1 irq3 LAPIC around 20% count); _______ This prompted me to start thinking that some patch is changing delivery in IOAPIC mode, I/O APIC domain to LAPICs to be deterministic, so there is no race effect in between which core should take INT handling. Obvious change. So, either: [1] APIC MADT table is all wrong from BIOS (but Debian distro from early July 2018. works); [2] Some patch in kernel, patching IOAPIC mode (from late July 2018.) delivers this blow! Please, do read https://wiki.osdev.org/IOAPIC It seems that IOREDTBL [delivery mode] register was changed, from low priority delivery mode (001) to fixed priority delivery mode (000). Delivery Mode (8 - 10) How the interrupt will be sent to the CPU(s). It can be 000 (Fixed), 001 (Lowest Priority), 010 (SMI), 100 (NMI), 101 (INIT) and 111 (ExtINT). Most of the cases you want Fixed mode, or Lowest Priority if you don't want to suspend a high priority task on some important Processor/Core/Thread. All comments and suggestions are very welcome! Thank you, Zoran
New info emerged from the various platforms' testing, for last few days. Here are the findings: So far, we found this bug on the following platforms (Ubuntu, kernel used: 4.15.0-38-generic #41~16.04.1-Ubuntu SMP Wed Oct 10 UTC 2018 x86_64 GNU/Linux): [1] CPU IVB: i7-3520M, PCH Cougar Canyon Point (series C210 - INTEL series 7); [2] CPU HSW: 2980U, i5-4300U, PCH Lynx Point (series C220 - INTEL series 8); [3] CPU BDW: i3-5010U, PCH Lynx Point (series C220 - INTEL series 8); Foreign info (from Mellanox): CPU BDW-DE (CPUID 0x50654, my best guess), PCH Wildcat Point (series C230 - INTEL series 9) works without problems, according to Mellanox. Please, if you need more detailed info, I will be more than glad to provide to you info about cases 1/2/3. Thank you, Zoran
New entry, with much better explanation what is going on there. Please, do NOT close this entry! [BUG-Continued] "do_IRQ: 0.39 No irq handler for vector" from a 16550 port https://bugzilla.kernel.org/show_bug.cgi?id=201709 Thank you, Zoran
7.36 No irq handler for vector I have the same problem on a Lenovo ThinkServer TS140, Haswell E3-1245v3 4C8T, Intel C226. A test program runs test connections every minute between a Digi Portserver TS (dgrp), some ttyUSB (ftdi_sio) serial ports and ttyS0. Every minute the console and dmesg are spammed with this "irq handler" message. I had always considered the problem to be dgrp. When I saw this thread I shut down the part of the test that runs ttyS0 and the dmesg spam stopped. The test connections from Digi to ttyUSB continue to run and no dmesg spam. Does not happen in Arch 4.14. Does happen in Arch 4.19. I have it marked as starting in kernel 4.16. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1781016 https://www.spinics.net/lists/linux-serial/msg32407.html https://www.phoronix.com/forums/forum/linux-graphics-x-org-drivers/open-source-amd-linux/1057042-no-irq-handler-for-vector
Hello Chris, Please, read the full report on the following kernel bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201709 You can also read about this problem traversing the following message thread: https://marc.info/?t=154140376100010&r=1&w=2 Additional analyses performed (and patch making/introducing the problem identified): Used git log commands to compare differences between two taged kernels: 4.13 and 4.15. The problem/patch identified by Holger Schurig is the following: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=464d12309e1b5829597793db551ae8ecaecf4036 Zoran
tglx>So one hint is that it's only a few boards and not a really widespread problem. I already read all that. All I want is to provide a platform that is a lot more common.
We observe the same issue on Adlink COM Express Type 6 Express-HL motherboard (QM87 chipset) with Core i7-4700EQ and Express-SLE motherboards (CM236 chipset) with Xeon E3-1505L when we open and close or write to the serial port. The bisect show us that the issue was caused by a commit around d6ffc6ac83b1 "x86/vector: Respect affinity mask in irq descriptor" which seems to be in line with Zoran finding (the 464d12309e1b is 2nd commit before the 6ffc6ac83b1). It was difficult to pick the exact commit because it was taking several minutes to trigger "No irq handler for vector" issue.
Same here on a Linux 5.1.16-arch1 x86_64. Closing serial port triggers that dmesg line. Intel(R) Core(TM) i5-6500TE in a Neousys Technology Inc. / Nuvo-5000 / NVS-5000
> Same here on a Linux 5.1.16-arch1 x86_64. > Closing serial port triggers that dmesg line. It's not a real issue. It turned out to be a harmless warning at least for the serial case. We've finally got enough debug information to root cause and fix it. The changes are in Linus tree and will be merged back to stable sooner than later. Thanks, tglx