Bug 197769
Summary: | kernel panic IO-APIC + timer (AMD CPU) from 4.13 onwards | ||
---|---|---|---|
Product: | Timers | Reporter: | p_c_chan |
Component: | Interval Timers | Assignee: | timers_interval-timers |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | CC: | ecm4, iissmart, matzes, perdigao1, rvelascog |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 4.13 onwards | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
This the good dmesg from 4.12.14
this is hwinfo, also from 4.12.14 Crash Screenshot 4.13.12 screenshot on 4.19.63 |
Description
p_c_chan
2017-11-03 23:03:44 UTC
Comment on attachment 260503 [details]
This the good dmesg from 4.12.14
In particular the dmesg from 4.12.14 shows
[ 0.015180] ..TIMER: vector=0x30 apic1=0 pin1=0 apic2=-1 pin2=-1
[ 0.016000] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
[ 0.016000] ...trying to set up timer (IRQ0) through the 8259A ...
[ 0.016000] ..... (found apic 0 pin 0) ...
[ 0.018000] ....... failed.
[ 0.018000] ...trying to set up timer as Virtual Wire IRQ...
[ 0.018000] ..... failed.
[ 0.018000] ...trying to set up timer as ExtINT IRQ...
[ 0.028870] ..... works.
For 4.13 or newer, trying to set up timer as ExtINT IRQ fails, giving us the kernel panic.
I am having the exact same problem as of 4.13.x. 4.12 and before worked fine. I am able to boot with noapic option, but would prefer not to have to do that. Also have only been running this way for a matter of hours, so not sure how stable it will be. From kernel panic: ...trying to set up timer as ExtINT IRQ... .....failed :(. Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a report. Then try booting with the 'noapic' option. CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.13.12-100.fc25.x86_64 #1 Hardware name: HP Pavilion 061 EW172AV-ABA a1530e/NAGAMI2, BIOS 3.11 09/19/2006 CPU Info: $ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 47 model name : AMD Athlon(tm) 64 Processor 3500+ stepping : 2 cpu MHz : 1000.000 cache size : 512 KB physical id : 0 siblings : 1 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good nopl cpuid pni lahf_lm 3dnowprefetch vmmcall bugs : fxsave_leak sysret_ss_attrs null_seg swapgs_fence bogomips : 2004.03 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp tm stc Good to know I am not the only one. My desktop is also an HP Pavillion. It is called a1677c. Created attachment 260707 [details]
this is hwinfo, also from 4.12.14
Hit panic same way using 4.14.0. The problem is still there since introduced in 4.13.0. Created attachment 260717 [details]
Crash Screenshot 4.13.12
Problem also exists in 4.15-rc3. Wonder when a fix for this would be available. I am having the same problem. In previous Kernel, the system will give a error message and continue, but now it throws a kernel panic message and hangs. My computer has a Asus M2N4-SLI motherboard. I have tried 'noapic' option and many variations using nolapic, acpi=off/[others], pci=off/biosirq and other similar options, and sometimes it boots but mouse and keyboard don't work. Reverting to a previous kernel boots ok. Problem still exists in 4.14.14. There are some apic related commits but do not fix this issue. I am still facing this issue as well.... unable to upgrade past version 4.12 on my system due to the bug. Four months later, still nobody looks at this issue. We just keep adding features, not even care if it works. :( 5 months after, status is still new!??? Ubuntu live DVDs are using the new kernels and I can't boot those any more. Status is still new!??? Seems FSC D2461 aka FSC Esprimo P5615 is also affected. (In reply to p_c_chan from comment #14) > Ubuntu live DVDs are using the new kernels and I can't boot those any more. > > Status is still new!??? Did you manage to identify/contact one of the developers of the code changes that you've isolated? No, I don't know about any developers that I can contact. It looked like nobody cares. The problem was reported last year and the status is still new. My desktop got a bad update last weekend, nvidia module wouldn't load. Then I did a bad move, tried recompiling the kernel. The problem tried out to be that kernel modules would be built with the wrong format. No modules would modprobe' after recompilation! I could not even log in with command line after the reboot. I was running 4.12.14. Hence it is not really the fault of this bug yet. Then when I decided to reinstall the desktop with ubuntu, this bug hit me hard. Ubuntu is using 4.13 even for the oldest available 2016 LTS! I burnt a pile of DVD's and none of them could get through this kernel panic. (Some of the new gcc's probably use pic or something in compiling kernel modules. I could not identify which one and it was getting late. Ended up I downgraded the desktop from unstable Debian to stable. Anyway luckily sshd, sftpd, ethernet and apt still worked.) I am surprised ubuntu did not catch this bug. LTS are supposed to be stable for everybody. Perhaps we have to wait for redhat to catch it. I guess redhat should be big enough to come up with a fix or ask for a fix. Maybe this bug is limited to some older systems with no longer common (at least by kernel developers and distributions makers) used hardware/chipset/bios. I think one (who is interested in solving the regression) has to do some more analysis and try to identify and contact the developer who made the offending changes. Unfortunately I'm short on time ;-) and at the moment content (to some extent) with one of the latest LTS-Kernels (4.9.110). Debian is still using 4.9. That works. Ubuntu is on 4.13 and above already. that fails to boot unless we modify the iso before burning it onto DVD. It was a single big change from upstream. Not sure how I can debug it, as it fails so early in boot time, to find out what and who did the damage. My desktop is a HP with a standalone nvidia graphic card, not so noname, still has enough horses for running linux for web and youtube. I actually downgraded it to stable debian, staying away from the leading edge after what happened last weekend. Anyway, updating is not always safe. I virtually bricked my old Samsung tab 10.1 last week trying to bring it to 7.1. It failed to load the 7.1 zip. I haven't found a way to put the old image or the official stock image back on from outside, in spite of CWM is still working. :( Seems this Debian Bug report is at least very similiar: Debian Bug report logs - #883294 linux-image-4.13.0-1-amd64: Kernel panic prevents boot: regression (apic) https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=883294 --- @ p_c_chan: Since you've already isolated the offending commit (did you use git bisect and have you checked it or is it only a suspicion?) there is some contact for this commit (and APIC): Thomas Gleixner <tglx@linutronix.de> (maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)) Looks similar.
"> > - Instead of not using IO-APIC completly, could you try to boot with
> > kernel parameter "no_timer_check" ?"
Did no_timer_check work?
I don't know about git bisect. I tried narrowing it down manually bit by bit, the diff is way too big.
(In reply to p_c_chan from comment #21) > Looks similar. > > "> > - Instead of not using IO-APIC completly, could you try to boot with > > > kernel parameter "no_timer_check" ?" > > Did no_timer_check work? No, does not work for me (4.15) - but perhaps due to other bugs who knows. > > I don't know about git bisect. I tried narrowing it down manually bit by > bit, the diff is way too big. So how did you find out that this commit is the one? Did you check out one kernel with the suspicious commit and one before that commit, compile them both and can proof that from one is working and the other not, that this is the offending commit? The very last 4.12.14 is ok. The very first 4.13-rc I found failed. I think I had tried rc1. (In reply to p_c_chan from comment #23) > The very last 4.12.14 is ok. The very first 4.13-rc I found failed. I think > I had tried rc1. For me (on FSC Esprimo P5615) The last working mainline kernel: commit 1b044f1cfc65a7d90b209dfabd57e16d98b58c5b Date: Mon Jul 3 16:14:51 2017 -0700 Merge branch 'timers-core-for-linus' The first one with the bug (the commit p_c_chan reported for this bug-report): commit 03ffbcdd7898c0b5299efeb9f18de927487ec1cf Date: Mon Jul 3 16:50:31 2017 -0700 Merge branch 'irq-core-for-linus' Interesting info. I am not very good spotting changes in git submissions. Maybe someone is. Looking through the latest kernel and searching for "MP-BIOS" I found it in the following file, which suggests it could be the culprit. I remember in the past that this is surely related to the apic, consistent with the filename. linux-4.17.5\arch\x86\kernel\apic\io_apic.c Line 2153: if (!no_pin1) apic_printk(APIC_QUIET, KERN_ERR "..MP-BIOS bug: " "8254 timer not connected to IO-APIC\n"); Just tried 4.17.6. Still fails, i.e. isn't fixed. (In reply to p_c_chan from comment #26) > Just tried 4.17.6. Still fails, i.e. isn't fixed. Same is true for FSC Esprimo P5615. I tried using the no_timer_check kernel option today on my system and it didn't work; it hangs early in the boot process with: [ 0.002000] Spectre V2 : Spectre v2 mitigation: Filling RSB on context switch So I had to go back to the noapic option. This is on Fedora 27 with 4.17.7. Failed in 4.20-rc3 too. Failed in 5.1-rc5 too. I'm guessing this only affects a small subset of machines, but is there any way to get this bug corrected?? Ah, it has been amost 2 years since I raised this issue in another report before raising this for AMD. The problem still exists in any kernel after 4.12. The report still has the new status. :( It looks like bugzilla is dead. I spent some time adding printk's to timer_irq_works() and kernel compiles last night. It showed that we really do not have any increase of jiffies from ExtINT in 4.19.63. In this new 4.19 it complaint something about missing vector in using ExtINT right before it fails. That's something new and may point us to the bug. I'll dig into the set up for using ExtINT. In comparison we receive expected (within +/- 2) jiffies from ExtINT in 4.9.186. Something is broken going across 4.13. Created attachment 284079 [details]
screenshot on 4.19.63
From 4.19.63. Added printk to show jiffies.
That error was comming from irq.c. Not sure why yet. Rework of vector management? Any chance of having someone looking into this before the problem turns 2 years old? Hello, I am trying to use an old box with new linux. My MB is Asus m2n4-sli with AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ GPU Asus Nvidia Gforce 7600 GT. Pretty old but this box was working very good with older distros. No I am trying to load a new linux via usb or cd and nothing worked. I've tried the newest stable Debian, Ubuntu, Lubuntu, Kali, OpenSuse, Fedora and Centos. All failed. Got black screen or at most error message: 8254 timer not connected to IO-APIC Now my already instaled Debian is working in this box, but I can't load now another distro. I guess this is a problem of old bios and hardware with new kernel. I've seen this problem in several linux sites and forums with no useful solutions. Do you have any idea about what could be done to solve this issue? Thanks a bunch Rod Unfortunately this bug just sits here forever. I have to downngrade the kernel to 4.9.x, the highest stream still receiving updates. On Mon, Sep 30 2019 at 22:06, bugzilla-daemon wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=197769 > > --- Comment #36 from p_c_chan@hotmail.com --- > Any chance of having someone looking into this before the problem turns 2 > years > old? Staring at it rihgt now... On Mon, Sep 21 2020 at 08:30, bugzilla-daemon wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=197769 > > --- Comment #39 from Thomas Gleixner (tglx@linutronix.de) --- > On Mon, Sep 30 2019 at 22:06, bugzilla-daemon wrote: >> https://bugzilla.kernel.org/show_bug.cgi?id=197769 >> >> --- Comment #36 from p_c_chan@hotmail.com --- >> Any chance of having someone looking into this before the problem turns 2 >> years >> old? > > Staring at it rihgt now... can anyone please test the patch below? Thanks, tglx --- --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -2243,6 +2243,7 @@ static inline void __init check_timer(vo legacy_pic->init(0); legacy_pic->make_irq(0); apic_write(APIC_LVT0, APIC_DM_EXTINT); + legacy_pic->unmask(0); unlock_ExtINT_logic(); Created attachment 292553 [details] attachment-22006-0.html Ok, I will check. El lun., 21 sept. 2020 7:04, <bugzilla-daemon@bugzilla.kernel.org> escribió: > https://bugzilla.kernel.org/show_bug.cgi?id=197769 > > --- Comment #40 from Thomas Gleixner (tglx@linutronix.de) --- > On Mon, Sep 21 2020 at 08:30, bugzilla-daemon wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=197769 > > > > --- Comment #39 from Thomas Gleixner (tglx@linutronix.de) --- > > On Mon, Sep 30 2019 at 22:06, bugzilla-daemon wrote: > >> https://bugzilla.kernel.org/show_bug.cgi?id=197769 > >> > >> --- Comment #36 from p_c_chan@hotmail.com --- > >> Any chance of having someone looking into this before the problem turns > 2 > >> years > >> old? > > > > Staring at it rihgt now... > > can anyone please test the patch below? > > Thanks, > > tglx > --- > --- a/arch/x86/kernel/apic/io_apic.c > +++ b/arch/x86/kernel/apic/io_apic.c > @@ -2243,6 +2243,7 @@ static inline void __init check_timer(vo > legacy_pic->init(0); > legacy_pic->make_irq(0); > apic_write(APIC_LVT0, APIC_DM_EXTINT); > + legacy_pic->unmask(0); > > unlock_ExtINT_logic(); > > -- > You are receiving this mail because: > You are on the CC list for the bug. (In reply to Thomas Gleixner from comment #40) > On Mon, Sep 21 2020 at 08:30, bugzilla-daemon wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=197769 > > > > --- Comment #39 from Thomas Gleixner (tglx@linutronix.de) --- > > On Mon, Sep 30 2019 at 22:06, bugzilla-daemon wrote: > >> https://bugzilla.kernel.org/show_bug.cgi?id=197769 > >> > >> --- Comment #36 from p_c_chan@hotmail.com --- > >> Any chance of having someone looking into this before the problem turns 2 > >> years > >> old? > > > > Staring at it rihgt now... > > can anyone please test the patch below? > > Thanks, > > tglx > --- > --- a/arch/x86/kernel/apic/io_apic.c > +++ b/arch/x86/kernel/apic/io_apic.c > @@ -2243,6 +2243,7 @@ static inline void __init check_timer(vo > legacy_pic->init(0); > legacy_pic->make_irq(0); > apic_write(APIC_LVT0, APIC_DM_EXTINT); > + legacy_pic->unmask(0); > > unlock_ExtINT_logic(); TEST: On FSC D2461 aka FSC Esprimo P5615 Testet with your patch (not yet testet w/o patch for this kernel version): [ 0.000000] Linux version 5.9.0-rc6-custom [ 0.000000] DMI: FUJITSU SIEMENS ESPRIMO P /D2461-A2, BIOS 6.00 R1.15.2461.A2 10/22/2007 [ 0.280852] APIC: Switch to symmetric I/O mode setup [ 0.281533] ..TIMER: vector=0x30 apic1=0 pin1=0 apic2=-1 pin2=-1 [ 0.334662] ..MP-BIOS bug: 8254 timer not connected to IO-APIC [ 0.334665] ...trying to set up timer (IRQ0) through the 8259A ... [ 0.334668] ..... (found apic 0 pin 0) ... [ 0.387795] ....... failed. [ 0.387797] ...trying to set up timer as Virtual Wire IRQ... [ 0.440896] ..... failed. [ 0.440897] ...trying to set up timer as ExtINT IRQ... [ 0.656870] ..... works. On Tue, Sep 22 2020 at 08:46, bugzilla-daemon wrote: > TEST: > > On FSC D2461 aka FSC Esprimo P5615 > Testet with your patch (not yet testet w/o patch for this kernel > version): It should be the same problem with an unmodified 5.9-rc kernel, but it would be nice if you could confirm. Thanks, tglx (In reply to Thomas Gleixner from comment #43) > On Tue, Sep 22 2020 at 08:46, bugzilla-daemon wrote: > > TEST: > > > > On FSC D2461 aka FSC Esprimo P5615 > > Testet with your patch (not yet testet w/o patch for this kernel > > version): > > It should be the same problem with an unmodified 5.9-rc kernel, but it > would be nice if you could confirm. > > Thanks, > > tglx Not shure if you get me right. So to make it more clear: [ 0.656870] ..... works. Test with patched 5.9-rc6 kernel was successfull for me (despite the "failed" messages - which were there before our problem occured - see "This the good dmesg from 4.12.14" in the post from 2017-11-03 23:08:05 UTC ). -> With patch: No kernel panic, no need for noapic boot parameter and system still running. Just had to confirm that the problem without the patch is still there. On Tue, Sep 22 2020 at 12:09, bugzilla-daemon wrote: > Not shure if you get me right. So to make it more clear: I did. > -> With patch: No kernel panic, no need for noapic boot parameter and system > still running. Just had to confirm that the problem without the patch is > still > there. That's what I was asking for: >> It should be the same problem with an unmodified 5.9-rc kernel, but it >> would be nice if you could confirm. unmodified == not patched >
> >> It should be the same problem with an unmodified 5.9-rc kernel, but it
> >> would be nice if you could confirm.
>
> unmodified == not patched
TEST:
On FSC D2461 aka FSC Esprimo P5615
Testet without your patch =
TEST of unmodified 5.9-rc kernel for confirmation:
Only boots with "noapic"
[ 0.000000] Linux version 5.9.0-rc6-unpatched ...
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.9.0-rc6-unpatched ro noapic ...
[ 0.000000] DMI: FUJITSU SIEMENS ESPRIMO P /D2461-A2, BIOS 6.00 R1.15.2461.A2 10/22/2007
[ 0.277636] APIC: Switch to symmetric I/O mode setup
[ 0.277639] Not enabling interrupt remapping due to skipped IO-APIC setup
Thanks for your patch, good job!
Patched 5.9.0-RC6, it did work finally. Thank you very much. My HP boots OK, but can't build nvidia-kernel-dkms. Dkma from Debian stable is likely too old for 5.9. Proabably would settle with 4.19 or 5.4 for long term. Good patched 4.19.146. It works. Good job. Please commit patch to longterm releases as well. Thanks. 4.19.149 showing up with fix. Works fine. Thank you. |