Hello again, This ticket is the continuation of ticket 201617: https://bugzilla.kernel.org/show_bug.cgi?id=201617 Please, do NOT merge these two tickets for now, since there is (described here), based upon some testing, another theory which recently emerged, and I would like to explore it much deeper, with your help. The tested platform features: I use two INTEL HSW (Core4) mobile skus with Lynx Point PCH: [1] Celeron 2980U, 2C, no HYP (2 HW threads) [2] i5-4300U, 2C, with HYP (4 HW threads) The tested target is external (baseboard) Winbond super I/O silicon with one or both serial ports targeted: /dev/ttyS0 legacy I/O 0x3f8 (hwirq4, baud rate 115200) /dev/ttyS1 legacy I/O 0x2f8 (hwirq3, baud rate 115200) I explored two modes of operation for both skus: [1] Kernel command line added command: none (as is) [2] Kernel command line added command: noapic The PM is enabled for the CPU, so all the C states are available for possible use, namely: C0 no sleep possible C1 yet allowed C1E yet allowed C3 yet allowed C6 yet allowed C7 yet allowed C8 yet allowed C9 yet allowed C10 yet allowed The test application used: The stress-ng 0.07.16 version used with the special quick C hacks/quirks introduced by me in the stress-fstat.c file (to make serial ports to fail as soon as possible). I also produced special version of stress-fstat.c for testing of only /dev/rtc0, but this driver has very different architecture than serial ones, so it does not fail, since it uses binary semaphore (only one HW thread at the time is allowed to access the driver, very different than serial, where multiple HW threads can access the same code - critical sections inside guarded by spinlocks). Very similar case comparing to only one HW thread used for the serial ports' testing. The test use cases performed: There are two IOAPICS in the Lynx Point PCH skus. For the first IOAPIC with kernel command line as is [case 1] it assumes that domain used for the first IOAPIC is IO-APIC domain. With the addendum noapic [case 2] first IOAPIC is bypassed, thus the mode is legacy XT-PIC mode. For the fact, there is for each HW thread enabled, one LAPIC associated. Here we identify four (4) use cases possible: [A] For i5-4300U both cores enabled and HYP enabled, there are four (4) LAPICs in the CPU/core package [B] For HYP threading disabled, there are only two LAPICs enabled, one per core [C] If only one core is enabled, with HYP enabled, there are two LAPICS per two threads on the enabled core [D] Only one core enabled, sans HYP (HYP disabled), thus only one LAPIC enabled in the system So, here we identify overall five (5) possible use cases upon running stress-ng tests, which produce the following results: [1.A] Serial interrupts are coming on all HW threads, bulks of the do_IRQ: messages [1.B] Serial interrupts are coming on both different cores' threads, bulks of the do_IRQ: messages [1.C] Serial interrupts are coming on single core's both threads, circa 100x less do_IRQ: messages (but they still appear) [1.D] Serial interrupts are coming on the single enabled HW thread, no do_IRQ: messages (although there are reports of one per several hours do_IRQ: message) [2.A-D] Since noapic command is inserted in the kernel command line, the IO-APIC domain is bypassed (IOAPIC not used in uncore PCH package), and the legacy XT-PIC domain is used (legacy INT lines used from PCH to CPU). The XT-PIC entries are rerouted from PCH to CPU to only one/designated LAPIC in the system. Although all four threads might be there, ONLY one thread exhibits/shows interrupts. No problems (no any do_IRQ: messages for days of testing) there! Additional analyses performed: Used git log commands to compare differences between two taged kernels: 4.13 and 4.15. The problem/patch identified by Holger Schurig is the following: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=464d12309e1b5829597793db551ae8ecaecf4036 I also started analysing all the patches in this time wise vicinity. In the kernel directory ...arch/x86/kernel/apic . The patches submitted on September 13th, 2017. There are lot of patches dealing with IOAPIC's internal timers, for the various purposes. Also significant reshufling of the structures and code were done, moving some code to the different functions and files. Theories, which came out of the above testing/analysis: My best guess is that there is some race condition in the first IOAPIC HW block. It is on HW level. The structures rearranged on September 13th, 2017. show that the code architecture is better structured, but this does not necessarily mean that the HW threads execute faster code, handling IOAPIC. Such of rearranging revealed/uncovered some HW conditions, not seen before (on IVB's Cougar Canyon Point and HSW's Lynx Point). Solution proposed: For now not yet completely known/proven. Investigation continues. _______ Please, if you need better explanation of the parts of this bugzilla, as well as testing code used, please, just ask, and I'll do the better explanation/magnification of the requested parts and post charts, testing code used and outputs from some CLI commands. Thank you, Zoran Stojsavljevic