Latest working kernel version: v2.6.20 Earliest failing kernel version: v.2.6.22 Distribution: kernel.org, Ubuntu Hardware Environment: Dell PowerEdge 6300 with PERC 2 RAID (Adaptec) controller Software Environment: kernel Problem Description: Linux fails to boot because aacraid fails and no file system available. Steps to reproduce: Boot server with kernel later than v2.6.20 Dell PERC 2 RAID controller, latest firmware (2.8.0 build 6099) with 6 disks - 5x RAID-5, 1x spare. Logs being captured using a serial console connection. A *good* start with v2.6.20 reports: [ 6.681614] Adaptec aacraid driver (1.1-5[2423]-mh3) [ 6.686794] ACPI: PCI Interrupt 0000:03:03.0[A] -> GSI 18 (level, low) -> IRQ 17 [ 6.695162] FDC 0 is a National Semiconductor PC87306 [ 6.724207] AAC0: kernel 2.8-0[6089] [ 6.727976] AAC0: monitor 2.8-0[6089] [ 6.731702] AAC0: bios 2.8-0[6089] [ 6.735174] AAC0: serial 8a0376 [ 6.738794] scsi0 : percraid [ 6.742287] ACPI: PCI Interrupt 0000:02:04.0[A] -> <3>hub 1-0:1.0: over-current change on port 1 [ 6.742810] scsi 0:0:0:0: Direct-Access DELL Array1 V1.0 PQ: 0 ANSI: 2 [ 6.751893] scsi 0:0:1:0: Direct-Access DELL Archive V1.0 PQ: 0 ANSI: 2 A *bad* start with v2.6.22+ reports: [ 152.474463] BUG: soft lockup - CPU#0 stuck for 61s! [modprobe:2096] [ 152.474463] [ 152.474463] Pid: 2096, comm: modprobe Not tainted (2.6.25-rc8-custom #1) [ 152.474463] EIP: 0060:[<c0209db0>] EFLAGS: 00000293 CPU: 0 [ 152.474463] EIP is at native_read_tsc+0x0/0x10 [ 152.474463] EAX: 00000474 EBX: b8fd8e27 ECX: 02a52000 EDX: 0000004a [ 152.474463] ESI: 00000aac EDI: 0142f9cb EBP: f54dda84 ESP: f7c5dd1c [ 152.474463] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 [ 152.474463] CR0: 8005003b CR2: 080f91cf CR3: 37a60000 CR4: 000006d0 [ 152.474463] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 152.474463] DR6: ffff0ff0 DR7: 00000400 [ 152.474463] [<c0305067>] ? delay_tsc+0x17/0x20 [ 152.474463] [<c0305016>] ? __delay+0x6/0x10 [ 152.474463] [<f8a5aa40>] ? aac_fib_send+0x220/0x2d0 [aacraid] [ 152.474463] [<f8a569c4>] ? aac_get_adapter_info+0x74/0x680 [aacraid] [ 152.474463] [<c021937b>] ? __resched_task+0x5b/0x70 [ 152.474463] [<c021ccda>] ? try_to_wake_up+0x6a/0x100 [ 152.474463] [<f8a5d55a>] ? aac_probe_one+0x23a/0x4a4 [aacraid] [ 152.474463] [<f8a5af50>] ? aac_command_thread+0x0/0x6d0 [aacraid] [ 152.474463] [<c0310146>] ? pci_device_probe+0x56/0x80 [ 152.474463] [<c0367948>] ? driver_probe_device+0x88/0x170 [ 152.474463] [<c0367b9e>] ? __driver_attach+0x9e/0xa0 [ 152.474463] [<c0366cea>] ? bus_for_each_dev+0x3a/0x60 [ 152.474463] [<c03100f0>] ? pci_device_probe+0x0/0x80 [ 152.474463] [<c03677c6>] ? driver_attach+0x16/0x20 [ 152.474463] [<c0367b00>] ? __driver_attach+0x0/0xa0 [ 152.474463] [<c0367674>] ? bus_add_driver+0x1a4/0x210 [ 152.474463] [<c0310090>] ? pci_device_remove+0x0/0x40 [ 152.474463] [<c03100f0>] ? pci_device_probe+0x0/0x80 [ 152.474463] [<c0367d3b>] ? driver_register+0x3b/0xf0 [ 152.474463] [<c040b744>] ? _spin_unlock_irqrestore+0x4/0x10 [ 152.474463] [<c031034d>] ? __pci_register_driver+0x3d/0x80 [ 152.474463] [<f890a033>] ? aac_init+0x33/0x74 [aacraid] [ 152.474463] [<c024696e>] ? sys_init_module+0x13e/0x1c40 [ 152.474463] [<c040d37f>] ? do_page_fault+0x13f/0x670 [ 152.474463] [<c02294ec>] ? irq_exit+0x3c/0x70 [ 152.474463] [<c0204d76>] ? syscall_call+0x7/0xb [ 152.474463] ======================= v2.6.20 runs stable. v2.6.22+ all fail in the same way. There are also "nobody cared" IRQ faults: [ 17.155571] irq 10: nobody cared (try booting with the "irqpoll" option) [ 17.155571] Pid: 0, comm: swapper Not tainted 2.6.25-rc8-custom #1 [ 17.155571] [<c025ad74>] __report_bad_irq+0x24/0x80 [ 17.155571] [<c0219e27>] __update_rq_clock+0x27/0x180 [ 17.155571] [<c025b040>] note_interrupt+0x270/0x2b0 [ 17.155571] [<c023c8c1>] getnstimeofday+0x31/0xc0 [ 17.155571] [<c025a2a5>] handle_IRQ_event+0x25/0x50 [ 17.155571] [<c025b9dd>] handle_fasteoi_irq+0xad/0xe0 [ 17.155571] [<c02071dd>] do_IRQ+0x3d/0x80 [ 17.155571] [<c020571f>] common_interrupt+0x23/0x28 [ 17.155571] [<c02300d8>] sys_rt_sigsuspend+0xc8/0xd0 [ 17.155571] [<c02039c2>] default_idle+0x52/0x80 [ 17.155571] [<c0203970>] default_idle+0x0/0x80 [ 17.155571] [<c020380d>] cpu_idle+0x5d/0xe0 [ 17.155571] ======================= [ 17.155571] handlers: [ 17.155571] [<f88cc180>] (ahc_linux_isr+0x0/0x250 [aic7xxx]) [ 17.155571] Disabling IRQ #10 I'm not sure if these lead to the aacraid failure or the two are unrelated. In a *bad* boot log I see these but I'm not sure if they are related to the error reports later: [ 0.910906] ACPI: PCI Root Bridge [PX0B] (0000:02) [ 0.912085] ACPI: Bus 0000:02 not present in PCI namespace [ 0.917111] ACPI: PCI Root Bridge [PX1A] (0000:03) [ 0.920085] ACPI: Bus 0000:03 not present in PCI namespace I'm trying to determine if those Bus 0000:02/03 references are the same as the lspci device addresses 02:* and 03:* (see later) because if they are it would show these two reports might be the root cause of the entire problem. System configuration: The PERC/2 controller is: 03:03.0 RAID bus controller [0104]: Digital Equipment Corporation DECchip 21554 [1011:0046] (rev 01) $ uname -a Linux PowerEdge6300 2.6.20-15-generic #2 SMP Sun Apr 15 07:36:31 UTC 2007 i686 GNU/Linux $ modinfo aacraid filename: /lib/modules/2.6.20-15-generic/kernel/drivers/scsi/aacraid/aacraid.ko version: 1.1-5[2423]-mh3 license: GPL description: Dell PERC2, 2/Si, 3/Si, 3/Di, Adaptec Advanced Raid Products, HP NetRAID-4M, IBM ServeRAID & ICP SCSI driver author: Red Hat Inc and Adaptec srcversion: 9F4AEF75C12F7128F830FA2 depends: scsi_mod vermagic: 2.6.20-15-generic SMP mod_unload 586 $ lspci -nnn 00:02.0 ISA bridge [0601]: Intel Corporation 82371AB/EB/MB PIIX4 ISA [8086:7110] (rev 02) 00:02.1 IDE interface [0101]: Intel Corporation 82371AB/EB/MB PIIX4 IDE [8086:7111] (rev 01) 00:02.2 USB Controller [0c03]: Intel Corporation 82371AB/EB/MB PIIX4 USB [8086:7112] (rev 01) 00:02.3 Bridge [0680]: Intel Corporation 82371AB/EB/MB PIIX4 ACPI [8086:7113] (rev 02) 00:04.0 VGA compatible controller [0300]: ATI Technologies Inc 3D Rage Pro [1002:4749] (rev 5c) 00:08.0 SCSI storage controller [0100]: Adaptec AHA-2940U2/U2W [9005:0010] 00:0a.0 PCI bridge [0604]: Intel Corporation 21154 PCI-to-PCI Bridge [8086:b154] 00:10.0 Host bridge [0600]: Intel Corporation 450NX - 82451NX Memory & I/O Controller [8086:84ca] (rev 03) 00:12.0 Host bridge [0600]: Intel Corporation 450NX - 82454NX/84460GX PCI Expander Bridge [8086:84cb] (rev 04) 00:13.0 Host bridge [0600]: Intel Corporation 450NX - 82454NX/84460GX PCI Expander Bridge [8086:84cb] (rev 04) 00:14.0 Host bridge [0600]: Intel Corporation 450NX - 82454NX/84460GX PCI Expander Bridge [8086:84cb] (rev 04) 01:04.0 Ethernet controller [0200]: Intel Corporation 82557/8/9 [Ethernet Pro 100] [8086:1229] (rev 0d) 01:05.0 Ethernet controller [0200]: Intel Corporation 82557/8/9 [Ethernet Pro 100] [8086:1229] (rev 0d) 02:04.0 SCSI storage controller [0100]: Adaptec AHA-2940U2/U2W / 7890/7891 [9005:001f] 02:06.0 SCSI storage controller [0100]: Adaptec AHA-2940U2/U2W / 7890/7891 [9005:001f] 02:08.0 SCSI storage controller [0100]: Adaptec AIC-7860 [9004:6078] (rev 03) 03:03.0 RAID bus controller [0104]: Digital Equipment Corporation DECchip 21554 [1011:0046] (rev 01) $ lsmod | grep aac aacraid 59652 2 scsi_mod 142348 8 st,sr_mod,sg,sd_mod,aacraid,aic7xxx,scsi_transport_spi,libata $ grep -i aac /var/log/kern.log Apr 3 18:07:41 PowerEdge6300 kernel: [ 6.394845] Adaptec aacraid driver (1.1-5[2423]-mh3) Apr 3 18:07:41 PowerEdge6300 kernel: [ 51.623757] AAC0: kernel 2.8-0[6089] Apr 3 18:07:41 PowerEdge6300 kernel: [ 51.623770] AAC0: monitor 2.8-0[6089] Apr 3 18:07:41 PowerEdge6300 kernel: [ 51.623779] AAC0: bios 2.8-0[6089] Apr 3 18:07:41 PowerEdge6300 kernel: [ 51.623787] AAC0: serial 8a0376 $ egrep -i 'scsi3|3:0:' /var/log/kern.log Apr 3 18:07:41 PowerEdge6300 kernel: [ 51.624202] scsi3 : percraid Apr 3 18:07:41 PowerEdge6300 kernel: [ 51.624823] scsi 3:0:0:0: Direct-Access DELL Array1 V1.0 PQ: 0 ANSI: 2 Apr 3 18:07:41 PowerEdge6300 kernel: [ 51.625185] scsi 3:0:1:0: Direct-Access DELL Archive V1.0 PQ: 0 ANSI: 2 Apr 3 18:07:41 PowerEdge6300 kernel: [ 66.973120] sd 3:0:0:0: Attached scsi removable disk sda Apr 3 18:07:41 PowerEdge6300 kernel: [ 66.974231] sd 3:0:1:0: Attached scsi removable disk sdb Apr 3 18:07:41 PowerEdge6300 kernel: [ 66.997669] sd 3:0:0:0: Attached scsi generic sg1 type 0 Apr 3 18:07:41 PowerEdge6300 kernel: [ 66.998217] sd 3:0:1:0: Attached scsi generic sg2 type 0 Apr 3 18:07:41 PowerEdge6300 kernel: [ 67.016451] sr0: scsi3-mmc drive: 17x/40x cd/rw xa/form2 cdda tray $ git-rev-list --pretty=oneline --reverse v2.6.20..v2.6.22 -- drivers/scsi/aacraid shows 32 commits between good and bad versions that affect aacraid. I've begun a bisect/test cycle but it will require 15 tests and the build time is very long. If the issue is outside aacraid then it'd take weeks to follow the bisect/test cycle for all commits between v2.6.20 and v2.6.22. If the issue is ACPI related $ git-rev-list --pretty=oneline --reverse v2.6.20..v2.6.22 -- drivers/acpi/pci_root.c shows 7 commits and $ git-rev-list --pretty=oneline --reverse v2.6.20..v2.6.22 -- drivers/acpi shows 277 commits. Related is bug #9133. I've tried all the suggestions in that with no difference in the observed problem. I've tried boot options noapic noacpi irqpoll and the various aacraid.* and scsi_mod.scan=sync. Related Ubuntu report is bug #149071 which might have a different cause although I began reporting there as it seemed remarkably close. I may open another Ubuntu bug report to run mirror this one as the cause seems different. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/149071
Created attachment 15617 [details] v2.6.20 successful boot log
Created attachment 15618 [details] v.2.6.25-rc8 failed boot log
I found a mailing list report of a similar problem where ACPI was the cause. "kernel-2.6.23.1-30.fc8 problem with symbios scsi controllers" http://www.redhat.com/archives/rhl-devel-list/2007-October/msg02035.html That caused me try the boot option "pci=noacpi" and the kernel started. There are numerous other issues as a result so it doesn't solve the regression since v2.6.20 but it does narrow down the hunt for the culprit. In part the v2.6.25-rc8 boot log reports: [ 16.207231] Adaptec aacraid driver 1.1-5[2455]-ms [ 47.074484] AAC0: kernel 2.8-1[6099] [ 47.074498] AAC0: monitor 2.8-1[6099] [ 47.078494] AAC0: bios 2.8-1[6099] [ 47.082496] AAC0: serial 8A0376 [ 47.086509] scsi2 : percraid [ 47.091184] scsi 2:0:0:0: Direct-Access DELL Array1 V1.0 PQ: 0 ANSI: 2 [ 46.643256] e100: eth0: e100_probe: addr 0xfe5ff000, irq 11, MAC addr 00:02:b3:bb:d7:8c [ 47.095591] scsi 2:0:1:0: Direct-Access DELL Archive V1.0 PQ: 0 ANSI: 2 [ 46.670604] e100: eth1: e100_probe: addr 0xfe5fe000, irq 14, MAC addr 00:02:b3:bb:d7:8d [ 47.113160] scsi 2:1:0:0: Direct-Access IBM DMVS09M 0220 PQ: 0 ANSI: 3 [ 47.117449] scsi 2:1:1:0: Direct-Access IBM DMVS09M 0220 PQ: 0 ANSI: 3 [ 47.122638] scsi 2:1:2:0: Direct-Access IBM DMVS09M 0220 PQ: 0 ANSI: 3 [ 47.127812] scsi 2:1:3:0: Direct-Access IBM DMVS09M 0220 PQ: 0 ANSI: 3 [ 47.133005] scsi 2:1:4:0: Direct-Access IBM DMVS09M 0220 PQ: 0 ANSI: 3 [ 47.145724] scsi 2:1:5:0: Direct-Access HITACHI DK32DJ-18MC D4D4 PQ: 0 ANSI: 3 [ 46.749850] ata_piix 0000:00:10.0: no available native port [ 47.181601] scsi 2:1:6:0: Processor DELL 1x6 U2W SCSI BP 5.39 PQ: 0 ANSI: 2 [ 46.693844] PCI: Enabling device 0000:00:02.2 (0000 -> 0001) [ 46.693882] PCI: No IRQ known for interrupt pin D of device 0000:00:02.2. Probably buggy MP table.
Created attachment 15619 [details] v.2.6.25-rc8 pci=noacpi (successful) boot log
I'm attempting to re-assign this bug to ACPI > Config-Interrupts
v2.6.25-rc8 with "pci=noacpi" Output of Perl script dump_pirq $ wget http://linux.dell.com/files/tools/dump_pirq $ chmod 755 dump_pirq $ sudo ./dump_pirq Interrupt routing table found at address 0xfc7a0: Version 1.0, size 0x00d0 Interrupt router is device 00:02.0 PCI exclusive interrupt mask: 0x0000 [] Compatible router: vendor 0x8086 device 0x122e Device 00:02.0 (slot 0): INTA: link 0x64, irq mask 0x4000 [14] INTB: link 0x65, irq mask 0x8000 [15] INTD: link 0x63, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] Device 02:04.0 (slot 0): INTA: link 0x65, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] Device 02:06.0 (slot 0): INTA: link 0x66, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] Device 02:08.0 (slot 0): INTA: link 0x64, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] Device 00:06.0 (slot 1): INTA: link 0x69, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] INTB: link 0x64, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] INTC: link 0x67, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] INTD: link 0x68, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] Device 00:08.0 (slot 2): INTA: link 0x68, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] INTB: link 0x69, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] INTC: link 0x64, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] INTD: link 0x67, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] Device 00:0a.0 (slot 3): INTA: link 0x67, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] INTB: link 0x68, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] INTC: link 0x69, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] INTD: link 0x64, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] Device 03:01.0 (slot 4): INTA: link 0x63, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] INTB: link 0x60, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] INTC: link 0x61, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] INTD: link 0x62, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] Device 03:03.0 (slot 5): INTA: link 0x62, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] INTB: link 0x63, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] INTC: link 0x60, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] INTD: link 0x61, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] Device 03:05.0 (slot 6): INTA: link 0x61, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] INTB: link 0x62, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] INTC: link 0x63, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] INTD: link 0x60, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] Device 03:07.0 (slot 7): INTA: link 0x60, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] INTB: link 0x61, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] INTC: link 0x62, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] INTD: link 0x63, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14] Interrupt router at 00:02.0: Intel 82371AB PIIX4/PIIX4E PCI-to-ISA bridge PIRQ1 (link 0x60): unrouted PIRQ2 (link 0x61): unrouted PIRQ3 (link 0x62): irq 5 PIRQ4 (link 0x63): unrouted Serial IRQ: [disabled] [quiet] [frame=21] [pulse=4] Level mask: 0x4e20 [5,9,10,11,14] $ cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 178 0 0 0 IO-APIC-edge timer 1: 33 0 2 1 IO-APIC-edge i8042 2: 0 0 0 0 XT-PIC-XT cascade 4: 235 0 1 0 IO-APIC-edge serial 6: 2 1 0 0 IO-APIC-edge floppy 7: 0 0 0 0 IO-APIC-edge parport0 8: 3 0 0 0 IO-APIC-edge rtc 9: 0 0 0 0 IO-APIC-edge acpi 11: 1938 1 0 0 IO-APIC-fasteoi eth0 12: 104 0 0 1 IO-APIC-edge i8042 14: 105 0 1 0 IO-APIC-fasteoi ata_piix, aic7xxx 15: 0 0 0 0 IO-APIC-edge ata_piix 18: 2390 1 0 0 IO-APIC-fasteoi aacraid 20: 228 1 0 0 IO-APIC-fasteoi aic7xxx 21: 14 1 0 0 IO-APIC-fasteoi aic7xxx 22: 14 0 1 0 IO-APIC-fasteoi aic7xxx NMI: 0 0 0 0 Non-maskable interrupts LOC: 19217 8829 11373 6098 Local timer interrupts RES: 3614 4379 3396 2418 Rescheduling interrupts CAL: 88 203 209 220 function call interrupts TLB: 1076 1121 1100 599 TLB shootdowns TRM: 0 0 0 0 Thermal event interrupts SPU: 0 0 0 0 Spurious interrupts ERR: 0 MIS: 0
Created attachment 15620 [details] PowerEdge 6300 acpidump -b (binary)
Created attachment 15621 [details] PowerEdge 6300 lspci -vxxx
Created attachment 15622 [details] PowerEdge 6300 pirq_mps Generated using pirq_mps from bug #8714 and downloaded from http://bugzilla.kernel.org/attachment.cgi?id=14134
Please re-attach the acpidump without any parameters when using acpidump. :)
Created attachment 15653 [details] PowerEdge 6300 acpidump (hex) Since the issue is a mismatch between the busses ACPI knows about, and those that PCI does, I'm currently building a debug kernel with printk()s of calls/returns for the ACPI/PCI bus discovery path: drivers/acpi/pci_root.c::acpi_pci_root_add() arch/x86/pci/acpi.c::pci_acpi_scan_root() drivers/pci/probe.c::pci_scan_bus_parented() drivers/pci/probe.c::pci_create_bus() drivers/pci/search.c::pci_find_bus() drivers/pci/search.c::pci_do_find_bus()
Please do not get distracted by this comment, the driver has a timeout of 3 minutes on the first interrupt driven command. This timeout is supposed to print a message: printk(KERN_ERR "aacraid: aac_fib_send: first asynchronous command timed out.\n" "Usually a result of a PCI interrupt routing problem;\n" "update mother board BIOS or consider utilizing one of\n" "the SAFE mode kernel options (acpi, apic etc)\n"); The assumption was that an udelay used in this loop would not result in a CPU stuck condition. Can one please advise me on the recommended changes in this loop in .../linux/drivers/scsi/aacraid/commsup.c so that we can report this condition accurately without harm? 3 minutes was considered a possible start-up time for the Adapter with a fully populated set of targets, so even without the ACPI problem reported here, we still have to ride through this worst-case Firmware Startup. Sincerely -- Mark Salyzyn
Created attachment 15677 [details] v.2.6.25-rc8 acpi-pci-bus debug failed boot log This log contains the printk() function entry/exit messages mentioned in comment #11. There are a *lot* of messages since the kernel command line was: "debug acpi.debug_level=0xff acpi.debug_layer=0x1f" Of the three root buses, the first one is always detected correctly. In this log that begins at 19.363070. The second root bus handling begins at 36.936161 and third root bus at 37.023840. Both the latter fail with a similar pattern: [ 36.910354] bus: 'acpi': driver_probe_device: matched device PNP0A03:01 with driver pci_root [ 36.914329] bus: 'acpi': really_probe: probing driver pci_root with device PNP0A03:01 [ 36.918350] nsutils-0869 [03] ns_get_node : _SEG, AE_NOT_FOUND [ 36.928169] Execute Method: [\_SB_.PX0B._BBN] (Node f5412d20) [ 36.936161] ACPI: PCI Root Bridge [PX0B] (0000:02) [ 36.938332] PCI: arch/x86/pci/acpi.c::pci_acpi_scan_root(device, 0, 2) [ 36.942336] PCI: drivers/pci/probe.c::pci_scan_bus_parented() [ 36.946329] PCI: drivers/pci/probe.c::pci_create_bus() [ 36.950331] PCI: drivers/pci/search.c::pci_find_bus(0, 2) [ 36.954331] PCI: drivers/pci/search.c::pci_do_find_bus(bus, 2) [ 36.958330] PCI: drivers/pci/search.c::pci_do_find_bus(bus, 2) [ 36.962330] PCI: drivers/pci/search.c::pci_do_find_bus()=NULL [ 36.966330] PCI: drivers/pci/search.c::pci_do_find_bus()=NULL [ 36.970332] PCI: drivers/pci/search.c::pci_do_find_bus(bus, 2) [ 36.974332] PCI: drivers/pci/search.c::pci_do_find_bus()=F5462C00 [ 36.978332] PCI: drivers/pci/search.c::pci_find_bus()=F5462C00 [ 36.982332] PCI: drivers/pci/probe.c::pci_create_bus()=NULL [ 36.986331] PCI: pci_scan_bus_parented()=NULL [ 36.990332] PCI: arch/x86/pci/acpi.c::pci_acpi_scan_root()=NULL [ 36.994333] ACPI: Bus 0000:02 not present in PCI namespace Unfortunately I neglected to report each of the exit paths from drivers/pci/probe.c::pci_create_bus() So I'm adding those, rebuilding, and will report back when I have a boot log with those included. Without that though, what I can't help wondering about is the mismatch between the ACPI (DSDT) allocated bus numbers (0, 2, 3) and the PNP0A03 numbers (0,1,2) - I'm hoping those are simply indexes into a PNP0A03 table with one for each discovered bus. E.g. [ 36.914329] bus: 'acpi': really_probe: probing driver pci_root with device PNP0A03:01 [ 36.936161] ACPI: PCI Root Bridge [PX0B] (0000:02) Of the four possible exit paths from drivers/pci/probe.c::pci_create_bus() the first, "err_out", would result in a kernel message "PCI: Bus %04x:%02x already known\n" and that doesn't show up and so the exit path must be one of the remaining three. The 2nd is a device_register() error (dev_reg_err) The 3rd is a device_register() error (class_dev_reg_err) The 4th is a device_create_file() error (dev_create_file_err)
Created attachment 15685 [details] v2.6.25-rc8 acpi-pci-bus debug "PINPOINTED" failed boot log Well it had to prove me wrong. The cause of the issue is "Bus already known", despite what I speculated in comment #13 (unlucky for some!). For some reason the call to pr_debug() in drivers/pci/probe.c::pci_create_bus() isn't resulting in a kernel message so it misled my reasoning as to the error. if (pci_find_bus(pci_domain_nr(b), bus)) { /* If we already got to this bus through a different bridge, ignore it */ pr_debug("PCI: Bus %04x:%02x already known\n", pci_domain_nr(b), bus); printk(KERN_INFO "PCI: drivers/pci/probe.c::pci_create_bus() Bus already known\n"); goto err_out; } [ 37.222374] bus: 'acpi': driver_probe_device: matched device PNP0A03:01 with driver pci_root [ 37.226348] bus: 'acpi': really_probe: probing driver pci_root with device PNP0A03:01 [ 37.230370] nsutils-0869 [03] ns_get_node : _SEG, AE_NOT_FOUND [ 37.238371] Execute Method: [\_SB_.PX0B._BBN] (Node f5412d20) [ 37.247233] ACPI: PCI Root Bridge [PX0B] (0000:02) [ 37.250352] PCI: arch/x86/pci/acpi.c::pci_acpi_scan_root(device, 0, 2) [ 37.254361] PCI: drivers/pci/probe.c::pci_scan_bus_parented() [ 37.258349] PCI: drivers/pci/probe.c::pci_create_bus() [ 37.262351] PCI: drivers/pci/search.c::pci_find_bus(0, 2) [ 37.266350] PCI: drivers/pci/search.c::pci_find_next_bus(0) [ 37.270351] PCI: drivers/pci/search.c::pci_find_next_bus(F5462800) [ 37.274351] PCI: drivers/pci/search.c::pci_do_find_bus(bus, 2) [ 37.278351] PCI: drivers/pci/search.c::pci_do_find_bus(bus, 2) [ 37.282350] PCI: drivers/pci/search.c::pci_do_find_bus()=NULL [ 37.286350] PCI: drivers/pci/search.c::pci_do_find_bus()=NULL [ 37.290351] PCI: drivers/pci/search.c::pci_find_next_bus(F5462800) [ 37.294352] PCI: drivers/pci/search.c::pci_find_next_bus(F5462C00) [ 37.298352] PCI: drivers/pci/search.c::pci_do_find_bus(bus, 2) [ 37.302352] PCI: drivers/pci/search.c::pci_do_find_bus()=F5462C00 [ 37.306352] PCI: drivers/pci/search.c::pci_find_bus()=F5462C00 [ 37.310352] PCI: drivers/pci/probe.c::pci_create_bus() Bus already known [ 37.314353] PCI: drivers/pci/probe.c::pci_create_bus()=NULL [ 37.318352] PCI: pci_scan_bus_parented()=NULL [ 37.322354] PCI: arch/x86/pci/acpi.c::pci_acpi_scan_root()=NULL [ 37.326355] ACPI: Bus 0000:02 not present in PCI namespace Looking back through the log for references to the buses I found: [ 4.382109] device: 'PNP0A03:00': device_add [ 4.384305] PM: Adding info for acpi:PNP0A03:00 [ 4.388291] bus: 'acpi': add device PNP0A03:00 [ 11.666574] device: 'PNP0A03:01': device_add [ 11.668762] PM: Adding info for acpi:PNP0A03:01 [ 11.672747] bus: 'acpi': add device PNP0A03:01 [ 11.824765] device: 'PNP0A03:02': device_add [ 11.828773] PM: Adding info for acpi:PNP0A03:02 [ 11.832757] bus: 'acpi': add device PNP0A03:02 [ 19.401234] device: 'pci0000:00': device_add [ 19.405256] PM: Adding info for No Bus:pci0000:00 [ 19.409237] device: '0000:00': device_add [ 19.413257] PM: Adding info for No Bus:0000:00 [ 19.425226] PCI: drivers/pci/probe.c::pci_create_bus()=F5462800 [ 19.481239] device: 'pci0000:02': device_add [ 19.485255] PM: Adding info for No Bus:pci0000:02 [ 19.489241] device: '0000:02': device_add [ 19.493261] PM: Adding info for No Bus:0000:02 [ 19.497312] PCI: drivers/pci/probe.c::pci_create_bus()=F5462C00 [ 19.593247] device: 'pci0000:03': device_add [ 19.597262] PM: Adding info for No Bus:pci0000:03 [ 19.601248] device: '0000:03': device_add [ 19.605267] PM: Adding info for No Bus:0000:03 [ 19.609310] PCI: drivers/pci/probe.c::pci_create_bus()=F550E000 The ACPI PNP0A03 devices are: PX0A: _UID 1, _BBN 0, _ADR 0x00120000, pci_bus=0xF5462800, device 0000:0 PX0B: _UID 2, _BBN 2?, _ADR_0x00130000, pci_bus=0xF5462C00, device 0000:2 PX1A: _UID 3, _BBN 3?, _ADR 0x00140000, pci_bus=0xF550E000, device 0000:3 The ? above indicates _BBN is calculated in a method and the values I show here are the ones reported by ACPI as bus number. So buses 0:2 and 0:3 are created when the first root bridge (PX0A) is scanned. That causes the later scan/create in ACPI/PCI to fail with the "Bus already known" reason. [ 37.306352] PCI: drivers/pci/search.c::pci_find_bus()=F5462C00 [ 37.310352] PCI: drivers/pci/probe.c::pci_create_bus() Bus already known [ 37.430360] PCI: drivers/pci/search.c::pci_find_bus()=F550E000 [ 37.434360] PCI: drivers/pci/probe.c::pci_create_bus() Bus already known Here's the call logic when 0000:0 has been created and is being scanned itself: [ 19.369259] ACPI: PCI Root Bridge [PX0A] (0000:00) [ 19.373231] PCI: arch/x86/pci/acpi.c::pci_acpi_scan_root(device, 0, 0) [ 19.377237] PCI: drivers/pci/probe.c::pci_scan_bus_parented() [ 19.381229] PCI: drivers/pci/probe.c::pci_create_bus() [ 19.385232] PCI: drivers/pci/search.c::pci_find_bus(0, 0) [ 19.389230] PCI: drivers/pci/search.c::pci_find_next_bus(0) [ 19.393230] PCI: drivers/pci/search.c::pci_find_next_bus(0) [ 19.397230] PCI: drivers/pci/search.c::pci_find_bus()=NULL [ 19.401234] device: 'pci0000:00': device_add [ 19.405256] PM: Adding info for No Bus:pci0000:00 [ 19.409237] device: '0000:00': device_add [ 19.413257] PM: Adding info for No Bus:0000:00 [ 19.425226] PCI: drivers/pci/probe.c::pci_create_bus()=F5462800 [ 19.425495] * Found PM-Timer Bug on the chipset. Due to workarounds for a bug, [ 19.425503] * this clock source is slow. Consider trying other clock sources [ 19.429300] pci 0000:00:02.3: quirk: region 0800-083f claimed by PIIX4 ACPI [ 19.433238] pci 0000:00:02.3: quirk: region 0850-085f claimed by PIIX4 SMB [ 19.437545] pci 0000:00:10.0: Searching for i450NX host bridges [ 19.441239] PCI: drivers/pci/probe.c::pci_scan_bus_parented() [ 19.445233] PCI: drivers/pci/probe.c::pci_create_bus() [ 19.449235] PCI: drivers/pci/search.c::pci_find_bus(0, 2) [ 19.453233] PCI: drivers/pci/search.c::pci_find_next_bus(0) [ 19.457234] PCI: drivers/pci/search.c::pci_find_next_bus(F5462800) [ 19.461235] PCI: drivers/pci/search.c::pci_do_find_bus(bus, 2) [ 19.465234] PCI: drivers/pci/search.c::pci_do_find_bus()=NULL [ 19.469235] PCI: drivers/pci/search.c::pci_find_next_bus(F5462800) [ 19.473235] PCI: drivers/pci/search.c::pci_find_next_bus(0) [ 19.477234] PCI: drivers/pci/search.c::pci_find_bus()=NULL [ 19.481239] device: 'pci0000:02': device_add [ 19.485255] PM: Adding info for No Bus:pci0000:02 [ 19.489241] device: '0000:02': device_add [ 19.493261] PM: Adding info for No Bus:0000:02 [ 19.497312] PCI: drivers/pci/probe.c::pci_create_bus()=F5462C00 So it appears the issue is caused by an interaction with the i450NX fix-up code in: arch/x86/pci/fixup.c::pci_fixup_i450nx() I've been looking at the git commit logs for this (formerly known as arch/i386/pci/fixup.c) and the ACPI/PCI source files but so far haven't noticed anything that wouls explain this - certainly not between v2.6.20 and v2.6.22. I think this is the point where the experts need to get involved.
I found this reference to the issue in AKM's 2.6.0 mm tree and the linux-scsi mailing list archive: "I can tell you what's going on here. This is a 450NX based motherboard. The 450NX chipset from Intel was the first chipset to have peer PCI busses. For backwards compatibility, some machine makers hacked their PCI BIOS to have a fake bridge device on PCI bus 0 that points to the same bus number as the peer bus. This way if the OS didn't know about the peer bus registers it would still find the devices by scanning behind the bridge. In this case we are scanning behind this fake bridge and then also scanning based upon the peer bus registers in the chipset, and as a result we are finding the device twice. In order to fix this problem you need to change the peer bus quirk code for the 450NX chipset to scan the list of bus 0 devices looking for a bridge that has the same config as the peer bus registers and if so delete the bridge from the list. That will avoid double scanning and will avoid having the PCI code try and configure sub busses via a fake bridge when it should do all configurations via the 450NX peer bus registers. -- Doug Ledford <dledford@redhat.com>" http://marc.info/?l=linux-scsi&m=106839680416899&w=2
I have created a patch of arch/x86/pci/fixup.c::pci_fixup_i450nx() that does a dmi_check_system() against a struct dmi_system_id with a DMI_MATCH pair. If the ident matches it *doesn't* do the i450NX secondary bus scan. I'll report back once the revised kernel has been built and tested.
Created attachment 15691 [details] v2.6.25-rc8 acpi-pci-bus debug "FIXED" *successful* boot log My patch appears to have fixed the issue, although I'd like some feedback on the patch itself and the boot log in case it could cause knock-on effects due to the way the buses are detected and scanned now.
Created attachment 15692 [details] Proposed patch to i450NX fix-up Please check this and consider alongside the kernel boot log attached to the previous comment to ensure no unexpected side effects.
Created attachment 15701 [details] v2.6.25-rc8 acpi-pci-bus *no-debug* "FIXED" successful boot log Another kernel log, this time without kernel DEBUG messages (still contains my trace-path printk() reports)
Created attachment 15702 [details] Proposed patch to i450NX fix-up Corrected the number of arguments used by DBG()
Re-assigning to TJ as this is not an ACPI bug.
Created attachment 15712 [details] Revised less-wordy patch as submitted to linux-pci mailing list
Created attachment 15730 [details] Much better ACPI patch from Matthew Wilcox This was provided on the linux-pci mailing list. I have tested it and it works and is much better than my DMI matching solution.
Zhao Yakui reported on linux-acpi that another bug #10124 and associated patch should fix this. I grabbed the patch mentioned: http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.25-rc8/2.6.25-rc8-mm2/broken-out/acpi-unneccessary-to-scan-the-pci-bus-already-scanned.patch and confirmed it does solve this issue too. Marking this bug as a duplicate. *** This bug has been marked as a duplicate of bug 10124 ***