Bug 1607

Summary: oops on boot if ACPI - 2.4 SMP specific - P3/VIA/Award
Product: ACPI Reporter: Diego Calleja (diegocg)
Component: Config-OtherAssignee: Len Brown (lenb)
Status: CLOSED CODE_FIX    
Severity: normal CC: acpi-bugzilla
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.4.23-rc3 Subsystem:
Regression: --- Bisected commit-id:
Attachments: acpidmp output
dmi decode output
dmesg under 2.4 with acpi disabled (CONFIG_ACPI_BOOT enabled, though)
/proc/interrupt output under 2.4 with acpi disabled
dmesg output under 2.6 with acpi disabled
/proc/interrupts output under 2.6 with acpi disabled
dmesg output under 2.6 with acpi enabled
/proc/interrupts under 2.6 with acpi enabled
dmesg-2.6-acpi enabled pci=noacpi
/proc/interrupts 2.6-acpi enabled pci=noacpi
enable verbose ACPI debugging output
verbose dmesg output 1
verbose dmesg output 2
verbose dmesg output - NO hang
/proc/interrupts with the "verbose patch" - No hang
dmesg under 2.4.23 with acpi enabled and Shaohua's patch (http://bugme.osdl.org/attachment.cgi?id=1508)
/proc/interrupts in 2.4.23, acpi enabled and shaohua's patch (http://bugme.osdl.org/attachment.cgi?id=1508)
proposed patch to fix Invalid reference to IRQ 0
patch to enable verbose3 messages for thread tracing
dmesg in 2.4, acpi enabled, attachements 108 + 1570
/proc/interrupts, in 2.4 with acpi enabled, attachements 1508 + 1570
working 2.4.26-pre1 dmesg
/proc/interrupts fomr the 2.4.26-pre1 working kernel

Description Diego Calleja 2003-11-27 11:27:06 UTC
Distribution: Debian Sid
Hardware Environment: Dual P3, d6vaa elitegroup motherboard, 256 mb ram, ide discs

Problem Description:
I have this oops when I enable acpi in 2.4.23-rc3. It started happening at
least in 2.4.22. Sorry for not reporting it before, but usually I don't
use acpi (my bios seems to be screwed so usually I just turn it off) and
when it works alsa stops working (the sound card uses another interrupt
and esd doesn't seem to work very well with whatever changes there're), so
I sent this hoping it can solve problems to other people.

 
PCI: Using configuration type 1.
  tbxface-0117[03] acpi_load_tables	:ACPI tables sucesfully acquired
Parsing all control methods (lots of points)
Table[DSDT] (id F004) - 383 objects with 31 devices 83 methods 29 regions
ACPI Namespace sucesfully loaded at root c0369cfc
IOAPIC[0]: Set PCI routing entry (2-11 -> 0x81 -> IRQ11 Mode: 0 Active: 0)
evxfvent-0093 [04] acpi_enable: Transition to ACPI mode sucesfull
evgpeblk-0748 [06] ev_create_gpe_back: GPE-00 to 15 [_GPE] 2 regs at
0000000000004020 on int 11

ksymoops 2.4.9 on i686 2.6.0-test9-mm5.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.6.0-test9-mm5/ (default)
     -m /boot/System.map-2.4.23-rc3 (specified)

Error (regular_file): read_ksyms stat /proc/ksyms failed
No modules in ksyms, skipping objects
No ksyms, skipping lsmod
Unable to handle kernel NULL pointer dereference at virtual address 00000018
  00000018
*pde = 00000000
EIP:     0010:[<00000018>] Not Tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010246
eax: 00000000 ebx: c01053d0 ecx: 00000010 edx: dffe8000
esi: dffe8000 edi: dffe8000 ebp: ffffe000 esp: dffe9fb0
Stack: c0105492 00000002 00000000 00000000 00000000 c011bf45 00000c93 00000282
       ffffffff 34000000 00000001 00000000 00000001 c035c12a 00000246 00000029
       000000ff c011c1bf 00000000 00000400
Call trace: [<c0105492>] [<c011bf45>] [<c011c1bf>]
Code: Bad EIP value


>>EIP; 00000018 Before first symbol   <=====

>>ebx; c01053d0 <default_idle+0/50>

Trace; c0105492 <cpu_idle+52/70>
Trace; c011bf45 <call_console_drivers+65/120>
Trace; c011c1bf <printk+14f/180>

<0>Kernel Panic: Attempted to kill idle task!

1 error issued.  Results may not be reliable.
Comment 1 Diego Calleja 2003-11-27 11:28:28 UTC
Created attachment 1545 [details]
acpidmp output
Comment 2 Diego Calleja 2003-11-27 11:29:10 UTC
Created attachment 1546 [details]
dmi decode output
Comment 3 Diego Calleja 2003-11-27 11:30:12 UTC
Created attachment 1547 [details]
dmesg under 2.4 with acpi disabled (CONFIG_ACPI_BOOT enabled, though)
Comment 4 Diego Calleja 2003-11-27 11:30:57 UTC
Created attachment 1548 [details]
/proc/interrupt output under 2.4 with acpi disabled
Comment 5 Diego Calleja 2003-11-27 11:31:41 UTC
Created attachment 1549 [details]
dmesg output under 2.6 with acpi disabled
Comment 6 Diego Calleja 2003-11-27 11:32:17 UTC
Created attachment 1550 [details]
/proc/interrupts output under 2.6 with acpi disabled
Comment 7 Diego Calleja 2003-11-27 11:33:01 UTC
Created attachment 1551 [details]
dmesg output under 2.6 with acpi enabled
Comment 8 Diego Calleja 2003-11-27 11:33:38 UTC
Created attachment 1552 [details]
/proc/interrupts under 2.6 with acpi enabled
Comment 9 Diego Calleja 2003-11-27 11:36:02 UTC
I forgot to say that the bug still happens if I disable the acpi debug stuff.
Comment 10 Shaohua 2003-11-27 18:55:55 UTC
same problem as Bug 1164, please refer to it
Comment 11 Len Brown 2003-11-29 22:08:06 UTC
> 2.6.0 -- acpi enabled: 
> rscreate-0346 [10] rs_create_pci_routing_: (PRT[1C].source_index) Need Integer, found 
Reference 
> pci_irq-0217 [07] acpi_pci_irq_add_prt  : Error evaluating _PRT [AE_BAD_DATA] 
 
Yes, this BIOS has malformed _PRT entries: 
                Package (0x04) 
                { 
                    0x0007FFFF, 
                    0x00, 
                    0x00, 
                    \_SB.PCI0.LNKA 
                }, 
 
Do 2.4.23 and 2.6.0 boot and run correctly if ACPI is booted with pci=noacpi? 
Current thinking is that enabling that option automatically is the "fix" when we see a BIOS error 
like this. 
 
But that doesn't explain why 2.6 boots and 2.4 takes an oops. 
The oops in bug #1164 was causd by the DEBUG printk, and is different here. 
 
Can you try a clean 2.4 build? 
cp .config /tmp 
make distclean clean 
cp /tmp/.config . 
make oldconfig 
make depend 
make bzImage modules 
 
thanks, 
-Len 
 
Comment 12 Diego Calleja 2003-12-01 10:50:04 UTC
(Sorry for the delay, the monitor broke)

I built a fresh (distclean, etc) 2.4.23, CONFIG_ACPI_DEBUG=y, pci=noacpi. Same
hang ;(

Yes, 2.6 with acpi enabled and pci=noacpi boots and works - attaching dmesg and
/proc/interrupts. Hell, even esd works well with pci=noacpi :)

acpid, however, fails (acpid-1.0.2, debian sid)

estel:/home/diego/wget/pmtools-20010730# acpid 
acpid: can't open /proc/acpi/event: Device or resource busy

strace:
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x40161000
set_thread_area({entry_number:-1 -> 6, base_addr:0x401612a0, limit:1048575,
seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0,
useable:1}) = 0
munmap(0x40018000, 64916)               = 0
geteuid32()                             = 0
open("/proc/acpi/event", O_RDONLY)      = -1 EBUSY (Device or resource busy)
write(2, "acpid: can\'t open /proc/acpi/eve"..., 60acpid: can't open
/proc/acpi/event: Device or resource busy
) = 60
exit_group(1)                           = ?


Don't know if it's related, but it happens in 2.4 too.
Comment 13 Diego Calleja 2003-12-01 10:51:07 UTC
Created attachment 1579 [details]
dmesg-2.6-acpi enabled pci=noacpi
Comment 14 Diego Calleja 2003-12-01 10:52:04 UTC
Created attachment 1580 [details]
/proc/interrupts 2.6-acpi enabled pci=noacpi
Comment 15 Len Brown 2003-12-01 14:32:53 UTC
Let me know if I've got this right... 
The primary issue at hand is that 2.4.23 takes an oops at boot and hangs. 
This still occurs with a clean build, with !CONFIG_ACPI_DEBUG, and with pci=noacpi. 
 
It does not occur for 2.4.23 with acpi=off, or acpi=ht (just CONFIG_ACPI_BOOT) 
It does not occur on 2.6.0. 
 
Does the oops look the same with the latest 2.4.23 build? 
Can you capture the failing console with a serial line? 
 
One unusual thing in this system: 
ACPI: INT_SRC_OVR (bus[0] irq[0xb] global_irq[0xb] polarity[0x0] trigger[0x0]) 
IOAPIC[0]: Set PCI routing entry (2-11 -> 0x81 -> IRQ11 Mode: 0 Active: 0) 
 
I'd be surprised if ACPI events (such as power button) register interrupts 
when you do get the system up on 2.6 in ACPI mode. 
Please try the patch to bug #1351 and see if ACPI events start coming through. 
 
Comment 16 Len Brown 2003-12-01 14:36:05 UTC
Created attachment 1586 [details]
enable verbose ACPI debugging output

Please apply this patch to the failing 2.4.23 kernel,
built with CONFIG_ACPI_DEBUG, boot with acpi enabled
and report the last thing that comes out before the hang.

thanks,
-Len
Comment 17 Shaohua 2003-12-01 17:54:52 UTC
I think at least you should try if fixed DSDT can work in your box. (fix _PRT 
table).
Comment 18 Diego Calleja 2003-12-02 12:05:44 UTC
OK, I tested Len's patch
"Created an attachment (id=1586)
enable verbose ACPI debugging output"

Sorry, I don't have serial console ;(
I've caught two different messages (the first one isn't wrong, it seems the oops
got mixed with the messages).

Surprisingly, after booting several times with your patch (to check how
different were the messages), I rebooted once more, and then it _booted_, no
hang (I attach dmesg anc /proc/interrupts, of course the printk buffer was too
small to store the whole boot).
Comment 19 Diego Calleja 2003-12-02 12:07:32 UTC
Created attachment 1596 [details]
verbose dmesg output 1
Comment 20 Diego Calleja 2003-12-02 12:08:17 UTC
Created attachment 1597 [details]
verbose dmesg output 2
Comment 21 Diego Calleja 2003-12-02 12:10:50 UTC
Created attachment 1598 [details]
verbose dmesg output - NO hang
Comment 22 Diego Calleja 2003-12-02 12:13:33 UTC
Created attachment 1599 [details]
/proc/interrupts with the "verbose patch" - No hang
Comment 23 Len Brown 2003-12-02 19:24:13 UTC
Re: the 2.4 oops -- how did you capture this w/o a serial console? 
 
evgpeblk-0296 [08] ev_get_gpe_xrupt_block:---- Entry 
  utalloc-0409 [09] utcallocate          :---- Entry 00000040 
  utalloc-0432 [09] utcallocate          :---- Exit- dffea220 
  utalloc-0677 <1> unable to handle kernel NULL pointer dereference 
----Entry---Enfeatr20dffea220 
CPU:1 
EIP:0010[<00000018>] Not tainted 
EFLAGS:00010246 
        osl-0930 eax:00000000 ebx:c01053d0 ecx:00000010 edx:dffe4000 
        osl-0732 [09] os_acquire_lock   :esi:dffe4000 edi:dffe4000 ebp:ffffe000 esp:dffe5ff0 
[09] os_release_lock:   Process swapper (pid 0, stackpage=dffe5000)[09] os_release_lock    
:00000002 evgpeblk-0349 00000000 [08] os_acquire_lock        :00000282       osl-0742 
ffffffff----Exit- 
     34000000 00000001 ----Exit- 
 
drivers/acpi/events/evgpeblk.c is the same in 2.4 and 2.6 -- that would be too easy;-) 
 
nsaccess-0288 [07] ns_lookup             :---- Entry 
 nsaccess-0310 [07] ns_lookup             :Null scope prefix using root node (c035983c) 
 nsaccess-0375 [07] ns_lookup             :Path absolute from root[c035983c] 
 nsaccess-0486 [07] ns_lookup             :Simple pathname (1 segment, Flags=2) 
   nsdump-0081 [07] ns_print_name         :[_GPE] 
 
The 2nd one simply hanged here? 
 
So every time you repeat the boot it stops in a different place? 
Can you try booting 2.4.23 with "nosmp", and if it works try it a few times 
just to make sure? 
 
I think we could study many issues on this system... 
 
Applying the patch from bug #1351 to fix the ACPI interrupt is moot 
because linux chokes on the _PRT and ignores it -- effectively 
coming up in pci=noacpi mode.  If you're game to trying to get the 
system all the way up in ACPI mode, you'll need also Shaohua's patch 
from bug 1164 to hopefully have the kernel not discard the _PRT. 
 
thanks, 
-Len 
 
Comment 24 Diego Calleja 2003-12-03 10:33:13 UTC
Yes, the second one halted at that point. I can get different messages, too; but
most of them are like the first one (a couple of lines more/less), sometimes
without printk, sometimes with it. Sometimes they're different to #1 and #2.

Yes, nosmp works! 4/4 times I tested it just booted.

I'll try nosmp + Shaoua's patch now. 
Comment 25 Diego Calleja 2003-12-03 11:22:23 UTC
This patch: http://bugme.osdl.org/attachment.cgi?id=1508&action=view + nosmp
boots and works with acpi (dmesg and interrupts attached) but still I can't cat
/proc/acpi/event. Should I try
http://bugme.osdl.org/attachment.cgi?id=1570&action=view too?
Comment 26 Diego Calleja 2003-12-03 11:25:58 UTC
Created attachment 1610 [details]
dmesg under 2.4.23 with acpi enabled and Shaohua's patch (http://bugme.osdl.org/attachment.cgi?id=1508)
Comment 27 Diego Calleja 2003-12-03 11:27:11 UTC
Created attachment 1611 [details]
/proc/interrupts in 2.4.23, acpi enabled and shaohua's patch (http://bugme.osdl.org/attachment.cgi?id=1508)
Comment 28 Shaohua 2003-12-03 16:47:06 UTC
Created attachment 1614 [details]
proposed patch to fix  Invalid reference to IRQ 0

can you try the patch, it should fix the below error:
IOAPIC[0]: Invalid reference to IRQ 0/n<7>00:00:07[A] -> 2-10 -> IRQ 0
Comment 29 Len Brown 2003-12-03 21:50:53 UTC
Lets keep this bug focused on the primary issue -- the oops with the 2.4 ACPI+SMP kernel. 
 
Turns out that "nosmp" also disables IOAPIC initialization -- something that ACPI didn't expect. 
(and yet another bug;-) 
Don't bother with that "IRQ 0" patch, it will be insufficient to properly overcome the lack 
of IOAPIC initialization. 
 
I think it would be a worthwhile exercise to file an additional bug to see if we 
can get the uni-processor kernel running on this box in ACPI+APIC mode. 
To do so you'll want to build with  
 
CONFIG_X86_UP_APIC=y 
CONFIG_X86_UP_IOAPIC=y 
CONFIG_SMP=n 
CONFIG_ACPI=y 
You'll need (at least;-) the bug 1164 patch for the garbled _PRT entry 
and you'll need the bug 1351 patch for the ACPI interrupt in particular. 
 
That said, now to noodle on why 2 processors crash here when 2.6 works... 
 
Comment 30 Len Brown 2003-12-03 23:05:59 UTC
Created attachment 1615 [details]
patch to enable verbose3 messages for thread tracing

Please remove the previous "verbose ACPI debugging output" patch
and apply this one.  Boot it a few times with the SMP kernel and
capture the dmesg -- it may point out a multi-threaded issue.

thanks,
-Len
Comment 31 Diego Calleja 2003-12-10 15:02:53 UTC
Ok
For the comment #29:
I applied http://bugzilla.kernel.org/attachment.cgi?id=1570 and
http://bugzilla.kernel.org/attachment.cgi?id=1508 (i think they're those). It
boots, but still I can't cat /proc/acpi/event. dmesg and /proc/interrupts
attached, as usual
Comment 32 Diego Calleja 2003-12-10 15:04:14 UTC
Created attachment 1656 [details]
dmesg in 2.4, acpi enabled, attachements 108 + 1570
Comment 33 Diego Calleja 2003-12-10 15:05:45 UTC
Created attachment 1657 [details]
/proc/interrupts, in 2.4 with acpi enabled, attachements 1508 + 1570

(previous send needs s/108/1508/
Comment 34 Diego Calleja 2003-12-10 15:25:55 UTC
Now for 30: I applied http://bugzilla.kernel.org/attachment.cgi?id=1615 (without
any other patches, compiled from 0 as usual). I booted 3 times. In the third
attempt it booted (I'll attach the dmesg and /proc/interrupts if needed, but it
shouldn't be different than other already posted)

I've took a photo of the log. I can understand it pretty; I hope you can see it
well too. If not, I'll write down the whole (I don't mind doing it if needed).
I've taken 3 photos of each oops anyway.


pictures are here (i didn't want to put so much rubbish here)
1 oops:
http://www.terra.es/personal/diegocg/oops1-1.jpg
http://www.terra.es/personal/diegocg/oops1-2.jpg
http://www.terra.es/personal/diegocg/oops1-3.jpg

2 oops:
http://www.terra.es/personal/diegocg/oops2-1.jpg
http://www.terra.es/personal/diegocg/oops2-2.jpg
http://www.terra.es/personal/diegocg/oops2-3.jpg
Comment 35 Diego Calleja 2004-03-04 13:17:03 UTC
Well, I just tried 2.4.26-pre1 and it just works beautifully! Whatever you've
done, THANKS, I owe you a beer :)

(I know it has been a long time since my last report, but 2.6 works so well... )

I attach /proc/interrupts and dmesg. Sound still doesn't works properly (both
with alsa and OSS drivers) but I assume that's a different problem.

(just for curiosity, stracing esd i get:
select(6, NULL, [5], NULL, {0, 100})    = 1 (out [5], left {0, 10000})
write(5, "\0\0\0\0\0\0\304b\0\0Oj\0\0\250\17\0\0\214\246\0\0\16\220"..., 4096) =
4096
select(6, NULL, [5], NULL, {0, 100})    = 1 (out [5], left {0, 10000})
write(5, "\0\0)M\0\0002\337\0\0\207\217\0\0\277\247\0\0{\21\0\0\22"..., 4096) = 4096
select(6, NULL, [5], NULL, {0, 100})    = 0 (Timeout)
select(6, NULL, [5], NULL, {0, 100})    = 0 (Timeout)
select(6, NULL, [5], NULL, {0, 100})    = 0 (Timeout)
select(6, NULL, [5], NULL, {0, 100})    = 0 (Timeout)
select(6, NULL, [5], NULL, {0, 100})    = 0 (Timeout)
select(6, NULL, [5], NULL, {0, 100})    = 0 (Timeout)
(period)




Comment 36 Diego Calleja 2004-03-04 13:18:32 UTC
Created attachment 2283 [details]
working 2.4.26-pre1 dmesg
Comment 37 Diego Calleja 2004-03-04 13:19:10 UTC
Created attachment 2284 [details]
/proc/interrupts fomr the 2.4.26-pre1 working kernel
Comment 38 Diego Calleja 2004-03-09 02:36:24 UTC
I guess I'll close this one....