Bug 201617
Summary: | [BUG] "do_IRQ: 0.39 No irq handler for vector" from a 16550 port | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Zoran (zoran.stojsavljevic) |
Component: | x86-64 | Assignee: | platform_x86_64 (platform_x86_64) |
Status: | NEW --- | ||
Severity: | blocking | CC: | alexandre.jaud, jbuchert+kbugs, jt, linuxkernel.severach, lukas.redlinger, rostan, zoran.stojsavljevic |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | From 4.15 up to the latest (for now vanilla kernel 4.20-rc2) | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
GDM_START-HTOP.jpg - high load while GDM active
GDM_STOP-HTOP.jpg - low load while GDM inactive PCIe-00-02-0 - PCIe 00:02.0 using cli command lspci -vv |
Description
Zoran
2018-11-04 12:23:34 UTC
Hello all, New info about this problem: kernel 4.9 doesn't show the issue. The same is true for later kernels up to 4.13. There was a compilation issue with 4.14 (which later was solved, something unrelated with tools/objcopy when compiling for a different architecture), and because of this glitch a git bisect was done between v4.13 and v4.15. This is the outcome: $ git bisect log # bad: [d8a5b80568a9cb66810e75b182018e9edb68e8ff] Linux 4.15 # good: [569dbb88e80deb68974ef6fdd6a13edb9d686261] Linux 4.13 git bisect start 'v4.15' 'v4.13' # good: [425a08c67317acee103b3ad58f57c762e8834faf] mlxsw: spectrum_router: Prepare for large adjacency groups git bisect good 425a08c67317acee103b3ad58f57c762e8834faf # bad: [e60e1ee60630cafef5e430c2ae364877e061d980] Merge tag 'drm-for-v4.15' of git://people.freedesktop.org/~airlied/linux git bisect bad e60e1ee60630cafef5e430c2ae364877e061d980 # bad: [4008e6a9bcee2f3b61bb11951de0fb0ed764cb91] Merge branch 'i2c/for-4.15' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/wsa/linux git bisect bad 4008e6a9bcee2f3b61bb11951de0fb0ed764cb91 # bad: [3c073991eb417b6f785ddc6afbbdc369eb84aa6a] Merge tag 'devprop-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm git bisect bad 3c073991eb417b6f785ddc6afbbdc369eb84aa6a # good: [2101dd64b304b034862f5ca40877c41b7ccb9c5e] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu git bisect good 2101dd64b304b034862f5ca40877c41b7ccb9c5e # good: [d6ec9d9a4def52a5094237564eaf6f6979fd7a27] Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect good d6ec9d9a4def52a5094237564eaf6f6979fd7a27 # good: [7d58e1c9059eefe0066c5acf2ffa582f6f0180e3] Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect good 7d58e1c9059eefe0066c5acf2ffa582f6f0180e3 # good: [990a848d537e4da966907c8ccec95bc568f2911c] Merge branches 'pm-devfreq' and 'pm-tools' git bisect good 990a848d537e4da966907c8ccec95bc568f2911c # bad: [25e960efc63852b84d1c3739aef586285b177395] PCI/MSI: Set MSI_FLAG_MUST_REACTIVATE in core code git bisect bad 25e960efc63852b84d1c3739aef586285b177395 # skip: [029c6e1c9df776fe1b2ba756a28fb65e9f9e9f69] x86/vector: Store the single CPU targets in apic data git bisect skip 029c6e1c9df776fe1b2ba756a28fb65e9f9e9f69 - no network - no USB keyboard - therefore "git bisect skip" # bad: [d6ffc6ac83b1f9f12652d89b9cb5bcbfbea7796c] x86/vector: Respect affinity mask in irq descriptor git bisect bad d6ffc6ac83b1f9f12652d89b9cb5bcbfbea7796c # good: [ef9e56d894eab99a33a06b96ba8057afa67d3702] x86/ioapic: Remove obsolete post hotplug update git bisect good ef9e56d894eab99a33a06b96ba8057afa67d3702 # good: [8d1e3dca7de6e8513872799a748a1d47d8dce60d] x86/vector: Add tracepoints for vector management git bisect good 8d1e3dca7de6e8513872799a748a1d47d8dce60d # good: [5ba204a1817ba95a7b24dbe8ef2c7ddd4cea886e] iommu/amd: Reevaluate vector configuration on activate() git bisect good 5ba204a1817ba95a7b24dbe8ef2c7ddd4cea886e # good: [4900be83602b6be07366d3e69f756c1959f4169a] x86/vector/msi: Switch to global reservation mode git bisect good 4900be83602b6be07366d3e69f756c1959f4169a # bad: [2cffad7bad83157f89332872015f4305d2ac09ac] x86/irq: Simplify hotplug vector accounting git bisect bad 2cffad7bad83157f89332872015f4305d2ac09ac # bad: [464d12309e1b5829597793db551ae8ecaecf4036] x86/vector: Switch IOAPIC to global reservation mode git bisect bad 464d12309e1b5829597793db551ae8ecaecf4036 # first bad commit: [464d12309e1b5829597793db551ae8ecaecf4036] x86/vector: Switch IOAPIC to global reservation mode Best Regards, Zoran (Holger Schurig is the main system architect) Id addition to Mr. Holger discoveries (two topics described above), Bernd and me tried to test another theory, which I came upon Thursday, November 1st, 2018. That we need actually to play with i915 driver, and somehow to swap on fly i915 with frame buffer (in other words to disable GDM on fly), and see how it will influence this problem? On Friday, November 2nd, Bernd and me sat together for couple of hours with the embedded platform (silicon HSW Celeron 2980U), using stress-ng fstat() version 0.07.16 (one which opens and closes every file in /dev directory in the loop). The OS is Debian 4.18.10, and executing it in normal fashion, we very soon introduced the problem (per Mr. Holger descriptions). Intensive problem, and the first attachment (GDM_START-HTOP.jpg) emerged from it. You all can see that the both CPUs are hogged/swamped up to almost 100%. The number of DO_IRQ messages was high. Then, we issued a command: # systemctl stop getty@tty7, and then, GDM stopped working almost immediately. And the following happened (after some short transition time) with HTOP: GDM_STOP-HTOP.jpg . As you all see the load dramatically decreased, showing that the problem diminished almost by ~ 50x, but was still present!? I assume that i915 was swapped by frame buffer driver (please, correct me if I am wrong). For some (unknown???) reason, PCIe 00:02.0 VGA influences legacy serial ports (NOT PCIe ones) /dev/ttyS0 (IOAPIC int4) and /dev/ttyS1 (IOAPIC int3). And the GFX int is intA routed to int44 (picture PCIe-00-02-0.jpg). At the end, we issued another command: # systemctl start getty@tty7, which almost immediately brought GDM back, and after some short time brought system to the initial state: GDM_START-HTOP.jpg . It seems that some of these patches between 4.13 and 4.15 influenced and mixed delivery of IOAPIC GFX interrupts to LAPIC serial ports' interrupts??? I attached all three pictures for your observation. I'll try on Monday, November 5th, 2018. to test this problem using strace program (tough one, but necessary). New idea... Which appeared/popped up at the end of Friday. Thank you, Zoran Created attachment 279311 [details]
GDM_START-HTOP.jpg - high load while GDM active
Created attachment 279313 [details]
GDM_STOP-HTOP.jpg - low load while GDM inactive
Created attachment 279315 [details]
PCIe-00-02-0 - PCIe 00:02.0 using cli command lspci -vv
Hello, Today, I have tried to test more extensively this issue. To do this, I did the following: [A] Disabled /dev/ttyS0 (I/O 0x3F8, irq 4) using BIOS; [B] Decided NOT to use stress-ng, just to try to use serial /dev/ttyS1 commands; [C] Play with CLI commands, using simple bash online commands and scripts; [D] The same customer BIOS used (MCU 0x23). While using customer Linux dlvt83 4.18.10 (Thu October 18, 2018.) x86_64 image, the following is true: The issue is easy to reproduce using the following commands: # stty -F /dev/ttyS1 (one message 50/50 - roughly every second time) Changing baud rate on fly with # stty -F /dev/ttyS1 1200 to stty -F /dev/ttyS1 576000 does not change anything in timing and occurrence frequency; Executing online script # while true; do "ABC" > /dev/ttyS1; sleep 10; done& introduces in 80% cases IRQ message, only single one after 10 sec. (sleep time interval); The irq 3 is mapped to vector 39, BUT ONLY on one local LAPIC (CPU1 irq3 LAPIC 100% count in my case, CPU0 irq3 LAPIC is rock solid 0% count); While using Linux Debian 4.9.0/7 - amd64 Debian 4.9.110-1 (2018-07-05), the following is true: The issue is NOT reproducible at all (with above commands and scripts. also from previous testing with stress-ng); The irq 3 is visible on both local LAPICs (CPU0 irq3 LAPIC had 80% count of the interrupts in my case, CPU1 irq3 LAPIC around 20% count); _______ This prompted me to start thinking that some patch is changing delivery in IOAPIC mode, I/O APIC domain to LAPICs to be deterministic, so there is no race effect in between which core should take INT handling. Obvious change. So, either: [1] APIC MADT table is all wrong from BIOS (but Debian distro from early July 2018. works); [2] Some patch in kernel, patching IOAPIC mode (from late July 2018.) delivers this blow! Please, do read https://wiki.osdev.org/IOAPIC It seems that IOREDTBL [delivery mode] register was changed, from low priority delivery mode (001) to fixed priority delivery mode (000). Delivery Mode (8 - 10) How the interrupt will be sent to the CPU(s). It can be 000 (Fixed), 001 (Lowest Priority), 010 (SMI), 100 (NMI), 101 (INIT) and 111 (ExtINT). Most of the cases you want Fixed mode, or Lowest Priority if you don't want to suspend a high priority task on some important Processor/Core/Thread. All comments and suggestions are very welcome! Thank you, Zoran New info emerged from the various platforms' testing, for last few days. Here are the findings: So far, we found this bug on the following platforms (Ubuntu, kernel used: 4.15.0-38-generic #41~16.04.1-Ubuntu SMP Wed Oct 10 UTC 2018 x86_64 GNU/Linux): [1] CPU IVB: i7-3520M, PCH Cougar Canyon Point (series C210 - INTEL series 7); [2] CPU HSW: 2980U, i5-4300U, PCH Lynx Point (series C220 - INTEL series 8); [3] CPU BDW: i3-5010U, PCH Lynx Point (series C220 - INTEL series 8); Foreign info (from Mellanox): CPU BDW-DE (CPUID 0x50654, my best guess), PCH Wildcat Point (series C230 - INTEL series 9) works without problems, according to Mellanox. Please, if you need more detailed info, I will be more than glad to provide to you info about cases 1/2/3. Thank you, Zoran New entry, with much better explanation what is going on there. Please, do NOT close this entry! [BUG-Continued] "do_IRQ: 0.39 No irq handler for vector" from a 16550 port https://bugzilla.kernel.org/show_bug.cgi?id=201709 Thank you, Zoran 7.36 No irq handler for vector I have the same problem on a Lenovo ThinkServer TS140, Haswell E3-1245v3 4C8T, Intel C226. A test program runs test connections every minute between a Digi Portserver TS (dgrp), some ttyUSB (ftdi_sio) serial ports and ttyS0. Every minute the console and dmesg are spammed with this "irq handler" message. I had always considered the problem to be dgrp. When I saw this thread I shut down the part of the test that runs ttyS0 and the dmesg spam stopped. The test connections from Digi to ttyUSB continue to run and no dmesg spam. Does not happen in Arch 4.14. Does happen in Arch 4.19. I have it marked as starting in kernel 4.16. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1781016 https://www.spinics.net/lists/linux-serial/msg32407.html https://www.phoronix.com/forums/forum/linux-graphics-x-org-drivers/open-source-amd-linux/1057042-no-irq-handler-for-vector Hello Chris, Please, read the full report on the following kernel bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201709 You can also read about this problem traversing the following message thread: https://marc.info/?t=154140376100010&r=1&w=2 Additional analyses performed (and patch making/introducing the problem identified): Used git log commands to compare differences between two taged kernels: 4.13 and 4.15. The problem/patch identified by Holger Schurig is the following: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=464d12309e1b5829597793db551ae8ecaecf4036 Zoran tglx>So one hint is that it's only a few boards and not a really widespread problem. I already read all that. All I want is to provide a platform that is a lot more common. We observe the same issue on Adlink COM Express Type 6 Express-HL motherboard (QM87 chipset) with Core i7-4700EQ and Express-SLE motherboards (CM236 chipset) with Xeon E3-1505L when we open and close or write to the serial port. The bisect show us that the issue was caused by a commit around d6ffc6ac83b1 "x86/vector: Respect affinity mask in irq descriptor" which seems to be in line with Zoran finding (the 464d12309e1b is 2nd commit before the 6ffc6ac83b1). It was difficult to pick the exact commit because it was taking several minutes to trigger "No irq handler for vector" issue. Same here on a Linux 5.1.16-arch1 x86_64. Closing serial port triggers that dmesg line. Intel(R) Core(TM) i5-6500TE in a Neousys Technology Inc. / Nuvo-5000 / NVS-5000 > Same here on a Linux 5.1.16-arch1 x86_64.
> Closing serial port triggers that dmesg line.
It's not a real issue. It turned out to be a harmless warning at least for
the serial case.
We've finally got enough debug information to root cause and fix it. The
changes are in Linus tree and will be merged back to stable sooner than
later.
Thanks,
tglx
|