Bug 10396 - BUG: soft lockup - CPU#0 stuck for 61s! [modprobe:2096]
Summary: BUG: soft lockup - CPU#0 stuck for 61s! [modprobe:2096]
Status: RESOLVED DUPLICATE of bug 10124
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 high
Assignee: TJ
URL:
Keywords:
Depends on:
Blocks: 56331
  Show dependency tree
 
Reported: 2008-04-05 08:53 UTC by TJ
Modified: 2013-04-09 06:23 UTC (History)
2 users (show)

See Also:
Kernel Version: v2.6.25-rc8
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
v2.6.20 successful boot log (19.87 KB, text/x-log)
2008-04-05 08:54 UTC, TJ
Details
v.2.6.25-rc8 failed boot log (53.46 KB, text/x-log)
2008-04-05 08:55 UTC, TJ
Details
v.2.6.25-rc8 pci=noacpi (successful) boot log (21.18 KB, text/x-log)
2008-04-05 09:44 UTC, TJ
Details
PowerEdge 6300 acpidump -b (binary) (13.47 KB, application/x-extension-bin)
2008-04-05 10:12 UTC, TJ
Details
PowerEdge 6300 lspci -vxxx (8.10 KB, text/plain)
2008-04-05 10:18 UTC, TJ
Details
PowerEdge 6300 pirq_mps (4.27 KB, text/x-log)
2008-04-05 10:20 UTC, TJ
Details
PowerEdge 6300 acpidump (hex) (62.50 KB, application/octet-stream)
2008-04-07 05:26 UTC, TJ
Details
v.2.6.25-rc8 acpi-pci-bus debug failed boot log (580.28 KB, application/octet-stream)
2008-04-08 13:58 UTC, TJ
Details
v2.6.25-rc8 acpi-pci-bus debug "PINPOINTED" failed boot log (593.00 KB, application/octet-stream)
2008-04-08 23:14 UTC, TJ
Details
v2.6.25-rc8 acpi-pci-bus debug "FIXED" *successful* boot log (631.48 KB, application/octet-stream)
2008-04-09 03:45 UTC, TJ
Details
Proposed patch to i450NX fix-up (1.66 KB, patch)
2008-04-09 04:01 UTC, TJ
Details | Diff
v2.6.25-rc8 acpi-pci-bus *no-debug* "FIXED" successful boot log (44.21 KB, application/octet-stream)
2008-04-09 06:49 UTC, TJ
Details
Proposed patch to i450NX fix-up (1.68 KB, patch)
2008-04-09 07:08 UTC, TJ
Details | Diff
Revised less-wordy patch as submitted to linux-pci mailing list (1.36 KB, patch)
2008-04-10 03:16 UTC, TJ
Details | Diff
Much better ACPI patch from Matthew Wilcox (1.18 KB, patch)
2008-04-10 16:06 UTC, TJ
Details | Diff

Description TJ 2008-04-05 08:53:09 UTC
Latest working kernel version: v2.6.20
Earliest failing kernel version: v.2.6.22
Distribution: kernel.org, Ubuntu
Hardware Environment: Dell PowerEdge 6300 with PERC 2 RAID (Adaptec) controller
Software Environment: kernel
Problem Description: Linux fails to boot because aacraid fails and no file system available.

Steps to reproduce: Boot server with kernel later than v2.6.20

Dell PERC 2 RAID controller, latest firmware (2.8.0 build 6099) with 6 disks - 5x RAID-5, 1x spare.

Logs being captured using a serial console connection.

A *good* start with v2.6.20 reports:

[    6.681614] Adaptec aacraid driver (1.1-5[2423]-mh3)

[    6.686794] ACPI: PCI Interrupt 0000:03:03.0[A] -> GSI 18 (level, low) -> IRQ 17

[    6.695162] FDC 0 is a National Semiconductor PC87306

[    6.724207] AAC0: kernel 2.8-0[6089] 

[    6.727976] AAC0: monitor 2.8-0[6089]

[    6.731702] AAC0: bios 2.8-0[6089]

[    6.735174] AAC0: serial 8a0376

[    6.738794] scsi0 : percraid

[    6.742287] ACPI: PCI Interrupt 0000:02:04.0[A] -> <3>hub 1-0:1.0: over-current change on port 1

[    6.742810] scsi 0:0:0:0: Direct-Access     DELL     Array1           V1.0 PQ: 0 ANSI: 2

[    6.751893] scsi 0:0:1:0: Direct-Access     DELL     Archive          V1.0 PQ: 0 ANSI: 2

A *bad* start with v2.6.22+ reports:

[  152.474463] BUG: soft lockup - CPU#0 stuck for 61s! [modprobe:2096]

[  152.474463] 

[  152.474463] Pid: 2096, comm: modprobe Not tainted (2.6.25-rc8-custom #1)

[  152.474463] EIP: 0060:[<c0209db0>] EFLAGS: 00000293 CPU: 0

[  152.474463] EIP is at native_read_tsc+0x0/0x10

[  152.474463] EAX: 00000474 EBX: b8fd8e27 ECX: 02a52000 EDX: 0000004a

[  152.474463] ESI: 00000aac EDI: 0142f9cb EBP: f54dda84 ESP: f7c5dd1c

[  152.474463]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068

[  152.474463] CR0: 8005003b CR2: 080f91cf CR3: 37a60000 CR4: 000006d0

[  152.474463] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000

[  152.474463] DR6: ffff0ff0 DR7: 00000400

[  152.474463]  [<c0305067>] ? delay_tsc+0x17/0x20

[  152.474463]  [<c0305016>] ? __delay+0x6/0x10

[  152.474463]  [<f8a5aa40>] ? aac_fib_send+0x220/0x2d0 [aacraid]

[  152.474463]  [<f8a569c4>] ? aac_get_adapter_info+0x74/0x680 [aacraid]

[  152.474463]  [<c021937b>] ? __resched_task+0x5b/0x70

[  152.474463]  [<c021ccda>] ? try_to_wake_up+0x6a/0x100

[  152.474463]  [<f8a5d55a>] ? aac_probe_one+0x23a/0x4a4 [aacraid]

[  152.474463]  [<f8a5af50>] ? aac_command_thread+0x0/0x6d0 [aacraid]

[  152.474463]  [<c0310146>] ? pci_device_probe+0x56/0x80

[  152.474463]  [<c0367948>] ? driver_probe_device+0x88/0x170

[  152.474463]  [<c0367b9e>] ? __driver_attach+0x9e/0xa0

[  152.474463]  [<c0366cea>] ? bus_for_each_dev+0x3a/0x60

[  152.474463]  [<c03100f0>] ? pci_device_probe+0x0/0x80

[  152.474463]  [<c03677c6>] ? driver_attach+0x16/0x20

[  152.474463]  [<c0367b00>] ? __driver_attach+0x0/0xa0

[  152.474463]  [<c0367674>] ? bus_add_driver+0x1a4/0x210

[  152.474463]  [<c0310090>] ? pci_device_remove+0x0/0x40

[  152.474463]  [<c03100f0>] ? pci_device_probe+0x0/0x80

[  152.474463]  [<c0367d3b>] ? driver_register+0x3b/0xf0

[  152.474463]  [<c040b744>] ? _spin_unlock_irqrestore+0x4/0x10

[  152.474463]  [<c031034d>] ? __pci_register_driver+0x3d/0x80

[  152.474463]  [<f890a033>] ? aac_init+0x33/0x74 [aacraid]

[  152.474463]  [<c024696e>] ? sys_init_module+0x13e/0x1c40

[  152.474463]  [<c040d37f>] ? do_page_fault+0x13f/0x670

[  152.474463]  [<c02294ec>] ? irq_exit+0x3c/0x70

[  152.474463]  [<c0204d76>] ? syscall_call+0x7/0xb

[  152.474463]  =======================

v2.6.20 runs stable. v2.6.22+ all fail in the same way. There are also "nobody cared" IRQ faults:

[   17.155571] irq 10: nobody cared (try booting with the "irqpoll" option)

[   17.155571] Pid: 0, comm: swapper Not tainted 2.6.25-rc8-custom #1

[   17.155571]  [<c025ad74>] __report_bad_irq+0x24/0x80

[   17.155571]  [<c0219e27>] __update_rq_clock+0x27/0x180

[   17.155571]  [<c025b040>] note_interrupt+0x270/0x2b0

[   17.155571]  [<c023c8c1>] getnstimeofday+0x31/0xc0

[   17.155571]  [<c025a2a5>] handle_IRQ_event+0x25/0x50

[   17.155571]  [<c025b9dd>] handle_fasteoi_irq+0xad/0xe0

[   17.155571]  [<c02071dd>] do_IRQ+0x3d/0x80

[   17.155571]  [<c020571f>] common_interrupt+0x23/0x28

[   17.155571]  [<c02300d8>] sys_rt_sigsuspend+0xc8/0xd0

[   17.155571]  [<c02039c2>] default_idle+0x52/0x80

[   17.155571]  [<c0203970>] default_idle+0x0/0x80

[   17.155571]  [<c020380d>] cpu_idle+0x5d/0xe0

[   17.155571]  =======================

[   17.155571] handlers:

[   17.155571] [<f88cc180>] (ahc_linux_isr+0x0/0x250 [aic7xxx])

[   17.155571] Disabling IRQ #10

I'm not sure if these lead to the aacraid failure or the two are unrelated.

In a *bad* boot log I see these but I'm not sure if they are related to the error reports later:

[    0.910906] ACPI: PCI Root Bridge [PX0B] (0000:02)

[    0.912085] ACPI: Bus 0000:02 not present in PCI namespace

[    0.917111] ACPI: PCI Root Bridge [PX1A] (0000:03)

[    0.920085] ACPI: Bus 0000:03 not present in PCI namespace


I'm trying to determine if those Bus 0000:02/03 references are the same as the lspci device addresses 02:* and 03:* (see later) because if they are it would show these two reports might be the root cause of the entire problem.

System configuration:

The PERC/2 controller is:
03:03.0 RAID bus controller [0104]: Digital Equipment Corporation DECchip 21554 [1011:0046] (rev 01)

$ uname -a
Linux PowerEdge6300 2.6.20-15-generic #2 SMP Sun Apr 15 07:36:31 UTC 2007 i686 GNU/Linux

$ modinfo aacraid
filename: /lib/modules/2.6.20-15-generic/kernel/drivers/scsi/aacraid/aacraid.ko
version: 1.1-5[2423]-mh3
license: GPL
description: Dell PERC2, 2/Si, 3/Si, 3/Di, Adaptec Advanced Raid Products, HP NetRAID-4M, IBM ServeRAID & ICP SCSI driver
author: Red Hat Inc and Adaptec
srcversion: 9F4AEF75C12F7128F830FA2
depends: scsi_mod
vermagic: 2.6.20-15-generic SMP mod_unload 586

$ lspci -nnn
00:02.0 ISA bridge [0601]: Intel Corporation 82371AB/EB/MB PIIX4 ISA [8086:7110] (rev 02)
00:02.1 IDE interface [0101]: Intel Corporation 82371AB/EB/MB PIIX4 IDE [8086:7111] (rev 01)
00:02.2 USB Controller [0c03]: Intel Corporation 82371AB/EB/MB PIIX4 USB [8086:7112] (rev 01)
00:02.3 Bridge [0680]: Intel Corporation 82371AB/EB/MB PIIX4 ACPI [8086:7113] (rev 02)
00:04.0 VGA compatible controller [0300]: ATI Technologies Inc 3D Rage Pro [1002:4749] (rev 5c)
00:08.0 SCSI storage controller [0100]: Adaptec AHA-2940U2/U2W [9005:0010]
00:0a.0 PCI bridge [0604]: Intel Corporation 21154 PCI-to-PCI Bridge [8086:b154]
00:10.0 Host bridge [0600]: Intel Corporation 450NX - 82451NX Memory & I/O Controller [8086:84ca] (rev 03)
00:12.0 Host bridge [0600]: Intel Corporation 450NX - 82454NX/84460GX PCI Expander Bridge [8086:84cb] (rev 04)
00:13.0 Host bridge [0600]: Intel Corporation 450NX - 82454NX/84460GX PCI Expander Bridge [8086:84cb] (rev 04)
00:14.0 Host bridge [0600]: Intel Corporation 450NX - 82454NX/84460GX PCI Expander Bridge [8086:84cb] (rev 04)
01:04.0 Ethernet controller [0200]: Intel Corporation 82557/8/9 [Ethernet Pro 100] [8086:1229] (rev 0d)
01:05.0 Ethernet controller [0200]: Intel Corporation 82557/8/9 [Ethernet Pro 100] [8086:1229] (rev 0d)
02:04.0 SCSI storage controller [0100]: Adaptec AHA-2940U2/U2W / 7890/7891 [9005:001f]
02:06.0 SCSI storage controller [0100]: Adaptec AHA-2940U2/U2W / 7890/7891 [9005:001f]
02:08.0 SCSI storage controller [0100]: Adaptec AIC-7860 [9004:6078] (rev 03)
03:03.0 RAID bus controller [0104]: Digital Equipment Corporation DECchip 21554 [1011:0046] (rev 01)

$ lsmod | grep aac
aacraid 59652 2
scsi_mod 142348 8 st,sr_mod,sg,sd_mod,aacraid,aic7xxx,scsi_transport_spi,libata

$ grep -i aac /var/log/kern.log
Apr 3 18:07:41 PowerEdge6300 kernel: [ 6.394845] Adaptec aacraid driver (1.1-5[2423]-mh3)
Apr 3 18:07:41 PowerEdge6300 kernel: [ 51.623757] AAC0: kernel 2.8-0[6089]
Apr 3 18:07:41 PowerEdge6300 kernel: [ 51.623770] AAC0: monitor 2.8-0[6089]
Apr 3 18:07:41 PowerEdge6300 kernel: [ 51.623779] AAC0: bios 2.8-0[6089]
Apr 3 18:07:41 PowerEdge6300 kernel: [ 51.623787] AAC0: serial 8a0376

$ egrep -i 'scsi3|3:0:' /var/log/kern.log
Apr 3 18:07:41 PowerEdge6300 kernel: [ 51.624202] scsi3 : percraid
Apr 3 18:07:41 PowerEdge6300 kernel: [ 51.624823] scsi 3:0:0:0: Direct-Access DELL Array1 V1.0 PQ: 0 ANSI: 2
Apr 3 18:07:41 PowerEdge6300 kernel: [ 51.625185] scsi 3:0:1:0: Direct-Access DELL Archive V1.0 PQ: 0 ANSI: 2
Apr 3 18:07:41 PowerEdge6300 kernel: [ 66.973120] sd 3:0:0:0: Attached scsi removable disk sda
Apr 3 18:07:41 PowerEdge6300 kernel: [ 66.974231] sd 3:0:1:0: Attached scsi removable disk sdb
Apr 3 18:07:41 PowerEdge6300 kernel: [ 66.997669] sd 3:0:0:0: Attached scsi generic sg1 type 0
Apr 3 18:07:41 PowerEdge6300 kernel: [ 66.998217] sd 3:0:1:0: Attached scsi generic sg2 type 0
Apr 3 18:07:41 PowerEdge6300 kernel: [ 67.016451] sr0: scsi3-mmc drive: 17x/40x cd/rw xa/form2 cdda tray


$ git-rev-list --pretty=oneline --reverse v2.6.20..v2.6.22 -- drivers/scsi/aacraid

shows 32 commits between good and bad versions that affect aacraid. 

I've begun a bisect/test cycle but it will require 15 tests and the build time is very long. If the issue is outside aacraid then it'd take weeks to follow the bisect/test cycle for all commits between v2.6.20 and v2.6.22.

If the issue is ACPI related

$ git-rev-list --pretty=oneline --reverse v2.6.20..v2.6.22 -- drivers/acpi/pci_root.c

shows 7 commits and

$ git-rev-list --pretty=oneline --reverse v2.6.20..v2.6.22 -- drivers/acpi

shows 277 commits.

Related is bug #9133. I've tried all the suggestions in that with no difference in the observed problem. I've tried boot options noapic noacpi irqpoll and the various aacraid.* and scsi_mod.scan=sync.

Related Ubuntu report is bug #149071 which might have a different cause although I began reporting there as it seemed remarkably close. I may open another Ubuntu bug report to run mirror this one as the cause seems different.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/149071
Comment 1 TJ 2008-04-05 08:54:23 UTC
Created attachment 15617 [details]
v2.6.20 successful boot log
Comment 2 TJ 2008-04-05 08:55:24 UTC
Created attachment 15618 [details]
v.2.6.25-rc8 failed boot log
Comment 3 TJ 2008-04-05 09:43:07 UTC
I found a mailing list report of a similar problem where ACPI was the cause.

"kernel-2.6.23.1-30.fc8 problem with symbios scsi controllers"

http://www.redhat.com/archives/rhl-devel-list/2007-October/msg02035.html

That caused me try the boot option "pci=noacpi" and the kernel started. There are numerous other issues as a result so it doesn't solve the regression since v2.6.20 but it does narrow down the hunt for the culprit.

In part the v2.6.25-rc8 boot log reports:

[   16.207231] Adaptec aacraid driver 1.1-5[2455]-ms



[   47.074484] AAC0: kernel 2.8-1[6099]
[   47.074498] AAC0: monitor 2.8-1[6099]
[   47.078494] AAC0: bios 2.8-1[6099]
[   47.082496] AAC0: serial 8A0376
[   47.086509] scsi2 : percraid
[   47.091184] scsi 2:0:0:0: Direct-Access     DELL     Array1           V1.0 PQ: 0 ANSI: 2
[   46.643256] e100: eth0: e100_probe: addr 0xfe5ff000, irq 11, MAC addr 00:02:b3:bb:d7:8c
[   47.095591] scsi 2:0:1:0: Direct-Access     DELL     Archive          V1.0 PQ: 0 ANSI: 2
[   46.670604] e100: eth1: e100_probe: addr 0xfe5fe000, irq 14, MAC addr 00:02:b3:bb:d7:8d
[   47.113160] scsi 2:1:0:0: Direct-Access     IBM      DMVS09M          0220 PQ: 0 ANSI: 3
[   47.117449] scsi 2:1:1:0: Direct-Access     IBM      DMVS09M          0220 PQ: 0 ANSI: 3
[   47.122638] scsi 2:1:2:0: Direct-Access     IBM      DMVS09M          0220 PQ: 0 ANSI: 3
[   47.127812] scsi 2:1:3:0: Direct-Access     IBM      DMVS09M          0220 PQ: 0 ANSI: 3
[   47.133005] scsi 2:1:4:0: Direct-Access     IBM      DMVS09M          0220 PQ: 0 ANSI: 3
[   47.145724] scsi 2:1:5:0: Direct-Access     HITACHI  DK32DJ-18MC      D4D4 PQ: 0 ANSI: 3
[   46.749850] ata_piix 0000:00:10.0: no available native port
[   47.181601] scsi 2:1:6:0: Processor         DELL     1x6 U2W SCSI BP  5.39 PQ: 0 ANSI: 2
[   46.693844] PCI: Enabling device 0000:00:02.2 (0000 -> 0001)
[   46.693882] PCI: No IRQ known for interrupt pin D of device 0000:00:02.2. Probably buggy MP table.
Comment 4 TJ 2008-04-05 09:44:00 UTC
Created attachment 15619 [details]
v.2.6.25-rc8 pci=noacpi (successful) boot log
Comment 5 TJ 2008-04-05 09:49:21 UTC
I'm attempting to re-assign this bug to ACPI > Config-Interrupts
Comment 6 TJ 2008-04-05 10:10:01 UTC
v2.6.25-rc8 with "pci=noacpi"

Output of Perl script dump_pirq

$ wget http://linux.dell.com/files/tools/dump_pirq
$ chmod 755 dump_pirq
$ sudo ./dump_pirq

Interrupt routing table found at address 0xfc7a0:
  Version 1.0, size 0x00d0
  Interrupt router is device 00:02.0
  PCI exclusive interrupt mask: 0x0000 []
  Compatible router: vendor 0x8086 device 0x122e

Device 00:02.0 (slot 0): 
  INTA: link 0x64, irq mask 0x4000 [14]
  INTB: link 0x65, irq mask 0x8000 [15]
  INTD: link 0x63, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]

Device 02:04.0 (slot 0): 
  INTA: link 0x65, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]

Device 02:06.0 (slot 0): 
  INTA: link 0x66, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]

Device 02:08.0 (slot 0): 
  INTA: link 0x64, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]

Device 00:06.0 (slot 1): 
  INTA: link 0x69, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]
  INTB: link 0x64, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]
  INTC: link 0x67, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]
  INTD: link 0x68, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]

Device 00:08.0 (slot 2): 
  INTA: link 0x68, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]
  INTB: link 0x69, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]
  INTC: link 0x64, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]
  INTD: link 0x67, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]

Device 00:0a.0 (slot 3): 
  INTA: link 0x67, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]
  INTB: link 0x68, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]
  INTC: link 0x69, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]
  INTD: link 0x64, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]

Device 03:01.0 (slot 4): 
  INTA: link 0x63, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]
  INTB: link 0x60, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]
  INTC: link 0x61, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]
  INTD: link 0x62, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]

Device 03:03.0 (slot 5): 
  INTA: link 0x62, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]
  INTB: link 0x63, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]
  INTC: link 0x60, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]
  INTD: link 0x61, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]

Device 03:05.0 (slot 6): 
  INTA: link 0x61, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]
  INTB: link 0x62, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]
  INTC: link 0x63, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]
  INTD: link 0x60, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]

Device 03:07.0 (slot 7): 
  INTA: link 0x60, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]
  INTB: link 0x61, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]
  INTC: link 0x62, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]
  INTD: link 0x63, irq mask 0x5cf8 [3,4,5,6,7,10,11,12,14]

Interrupt router at 00:02.0: Intel 82371AB PIIX4/PIIX4E PCI-to-ISA bridge
  PIRQ1 (link 0x60): unrouted
  PIRQ2 (link 0x61): unrouted
  PIRQ3 (link 0x62): irq 5
  PIRQ4 (link 0x63): unrouted
  Serial IRQ: [disabled] [quiet] [frame=21] [pulse=4]
  Level mask: 0x4e20 [5,9,10,11,14]

$ cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       
  0:        178          0          0          0   IO-APIC-edge      timer
  1:         33          0          2          1   IO-APIC-edge      i8042
  2:          0          0          0          0    XT-PIC-XT        cascade
  4:        235          0          1          0   IO-APIC-edge      serial
  6:          2          1          0          0   IO-APIC-edge      floppy
  7:          0          0          0          0   IO-APIC-edge      parport0
  8:          3          0          0          0   IO-APIC-edge      rtc
  9:          0          0          0          0   IO-APIC-edge      acpi
 11:       1938          1          0          0   IO-APIC-fasteoi   eth0
 12:        104          0          0          1   IO-APIC-edge      i8042
 14:        105          0          1          0   IO-APIC-fasteoi   ata_piix, aic7xxx
 15:          0          0          0          0   IO-APIC-edge      ata_piix
 18:       2390          1          0          0   IO-APIC-fasteoi   aacraid
 20:        228          1          0          0   IO-APIC-fasteoi   aic7xxx
 21:         14          1          0          0   IO-APIC-fasteoi   aic7xxx
 22:         14          0          1          0   IO-APIC-fasteoi   aic7xxx
NMI:          0          0          0          0   Non-maskable interrupts
LOC:      19217       8829      11373       6098   Local timer interrupts
RES:       3614       4379       3396       2418   Rescheduling interrupts
CAL:         88        203        209        220   function call interrupts
TLB:       1076       1121       1100        599   TLB shootdowns
TRM:          0          0          0          0   Thermal event interrupts
SPU:          0          0          0          0   Spurious interrupts
ERR:          0
MIS:          0
Comment 7 TJ 2008-04-05 10:12:47 UTC
Created attachment 15620 [details]
PowerEdge 6300 acpidump -b (binary)
Comment 8 TJ 2008-04-05 10:18:51 UTC
Created attachment 15621 [details]
PowerEdge 6300 lspci -vxxx
Comment 9 TJ 2008-04-05 10:20:35 UTC
Created attachment 15622 [details]
PowerEdge 6300 pirq_mps

Generated using pirq_mps from bug #8714 and downloaded from

http://bugzilla.kernel.org/attachment.cgi?id=14134
Comment 10 Zhang Rui 2008-04-06 19:41:35 UTC
Please re-attach the acpidump without any parameters when using acpidump. :)
Comment 11 TJ 2008-04-07 05:26:28 UTC
Created attachment 15653 [details]
PowerEdge 6300 acpidump (hex)

Since the issue is a mismatch between the busses ACPI knows about, and those that PCI does, I'm currently building a debug kernel with printk()s of calls/returns for the ACPI/PCI bus discovery path:

drivers/acpi/pci_root.c::acpi_pci_root_add()
 arch/x86/pci/acpi.c::pci_acpi_scan_root()
  drivers/pci/probe.c::pci_scan_bus_parented()
   drivers/pci/probe.c::pci_create_bus()
    drivers/pci/search.c::pci_find_bus()
     drivers/pci/search.c::pci_do_find_bus()
Comment 12 Mark Salyzyn 2008-04-07 05:47:54 UTC
Please do not get distracted by this comment, the driver has a timeout of 3 minutes on the first interrupt driven command. This timeout is supposed to print a message:

        printk(KERN_ERR "aacraid: aac_fib_send: first asynchronous command timed out.\n"
          "Usually a result of a PCI interrupt routing problem;\n"
          "update mother board BIOS or consider utilizing one of\n"
          "the SAFE mode kernel options (acpi, apic etc)\n");

The assumption was that an udelay used in this loop would not result in a CPU stuck condition. Can one please advise me on the recommended changes in this loop in .../linux/drivers/scsi/aacraid/commsup.c so that we can report this condition accurately without harm? 3 minutes was considered a possible start-up time for the Adapter with a fully populated set of targets, so even without the ACPI problem reported here, we still have to ride through this worst-case Firmware Startup.

Sincerely -- Mark Salyzyn
Comment 13 TJ 2008-04-08 13:58:22 UTC
Created attachment 15677 [details]
v.2.6.25-rc8 acpi-pci-bus debug failed boot log

This log contains the printk() function entry/exit messages mentioned in comment #11.

There are a *lot* of messages since the kernel command line was: "debug acpi.debug_level=0xff acpi.debug_layer=0x1f"

Of the three root buses, the first one is always detected correctly. In this log that begins at 19.363070.

The second root bus handling begins at 36.936161 and third root bus at 37.023840.

Both the latter fail with a similar pattern:

[   36.910354] bus: 'acpi': driver_probe_device: matched device PNP0A03:01 with driver pci_root
[   36.914329] bus: 'acpi': really_probe: probing driver pci_root with device PNP0A03:01
[   36.918350]  nsutils-0869 [03] ns_get_node           : _SEG, AE_NOT_FOUND
[   36.928169] Execute Method: [\_SB_.PX0B._BBN] (Node f5412d20)
[   36.936161] ACPI: PCI Root Bridge [PX0B] (0000:02)
[   36.938332] PCI: arch/x86/pci/acpi.c::pci_acpi_scan_root(device, 0, 2)
[   36.942336] PCI: drivers/pci/probe.c::pci_scan_bus_parented()
[   36.946329] PCI: drivers/pci/probe.c::pci_create_bus()
[   36.950331] PCI: drivers/pci/search.c::pci_find_bus(0, 2)
[   36.954331] PCI: drivers/pci/search.c::pci_do_find_bus(bus, 2)
[   36.958330] PCI: drivers/pci/search.c::pci_do_find_bus(bus, 2)
[   36.962330] PCI: drivers/pci/search.c::pci_do_find_bus()=NULL
[   36.966330] PCI: drivers/pci/search.c::pci_do_find_bus()=NULL
[   36.970332] PCI: drivers/pci/search.c::pci_do_find_bus(bus, 2)
[   36.974332] PCI: drivers/pci/search.c::pci_do_find_bus()=F5462C00
[   36.978332] PCI: drivers/pci/search.c::pci_find_bus()=F5462C00
[   36.982332] PCI: drivers/pci/probe.c::pci_create_bus()=NULL
[   36.986331] PCI: pci_scan_bus_parented()=NULL
[   36.990332] PCI: arch/x86/pci/acpi.c::pci_acpi_scan_root()=NULL
[   36.994333] ACPI: Bus 0000:02 not present in PCI namespace

Unfortunately I neglected to report each of the exit paths from

drivers/pci/probe.c::pci_create_bus()

So I'm adding those, rebuilding, and will report back when I have a boot log with those included.

Without that though, what I can't help wondering about is the mismatch between the ACPI (DSDT) allocated bus numbers (0, 2, 3) and the PNP0A03 numbers (0,1,2) - I'm hoping those are simply indexes into a PNP0A03 table with one for each discovered bus. E.g.

[   36.914329] bus: 'acpi': really_probe: probing driver pci_root with device PNP0A03:01
[   36.936161] ACPI: PCI Root Bridge [PX0B] (0000:02)

Of the four possible exit paths from

drivers/pci/probe.c::pci_create_bus()

the first, "err_out", would result in a kernel message "PCI: Bus %04x:%02x already known\n" and that doesn't show up and so the exit path must be one of the remaining three.

The 2nd is a device_register() error (dev_reg_err)
The 3rd is a device_register() error (class_dev_reg_err)
The 4th is a device_create_file() error (dev_create_file_err)
Comment 14 TJ 2008-04-08 23:14:28 UTC
Created attachment 15685 [details]
v2.6.25-rc8 acpi-pci-bus debug "PINPOINTED" failed boot log

Well it had to prove me wrong. The cause of the issue is "Bus already known", despite what I speculated in comment #13 (unlucky for some!).

For some reason the call to pr_debug() in drivers/pci/probe.c::pci_create_bus() isn't resulting in a kernel message so it misled my reasoning as to the error.

if (pci_find_bus(pci_domain_nr(b), bus)) {
 /* If we already got to this bus through a different bridge, ignore it */
 pr_debug("PCI: Bus %04x:%02x already known\n", pci_domain_nr(b), bus);
 printk(KERN_INFO "PCI: drivers/pci/probe.c::pci_create_bus() Bus already known\n");
 goto err_out;
}

[   37.222374] bus: 'acpi': driver_probe_device: matched device PNP0A03:01 with driver pci_root
[   37.226348] bus: 'acpi': really_probe: probing driver pci_root with device PNP0A03:01
[   37.230370]  nsutils-0869 [03] ns_get_node           : _SEG, AE_NOT_FOUND
[   37.238371] Execute Method: [\_SB_.PX0B._BBN] (Node f5412d20)
[   37.247233] ACPI: PCI Root Bridge [PX0B] (0000:02)
[   37.250352] PCI: arch/x86/pci/acpi.c::pci_acpi_scan_root(device, 0, 2)
[   37.254361] PCI: drivers/pci/probe.c::pci_scan_bus_parented()
[   37.258349] PCI: drivers/pci/probe.c::pci_create_bus()
[   37.262351] PCI: drivers/pci/search.c::pci_find_bus(0, 2)
[   37.266350] PCI: drivers/pci/search.c::pci_find_next_bus(0)
[   37.270351] PCI: drivers/pci/search.c::pci_find_next_bus(F5462800)
[   37.274351] PCI: drivers/pci/search.c::pci_do_find_bus(bus, 2)
[   37.278351] PCI: drivers/pci/search.c::pci_do_find_bus(bus, 2)
[   37.282350] PCI: drivers/pci/search.c::pci_do_find_bus()=NULL
[   37.286350] PCI: drivers/pci/search.c::pci_do_find_bus()=NULL
[   37.290351] PCI: drivers/pci/search.c::pci_find_next_bus(F5462800)
[   37.294352] PCI: drivers/pci/search.c::pci_find_next_bus(F5462C00)
[   37.298352] PCI: drivers/pci/search.c::pci_do_find_bus(bus, 2)
[   37.302352] PCI: drivers/pci/search.c::pci_do_find_bus()=F5462C00
[   37.306352] PCI: drivers/pci/search.c::pci_find_bus()=F5462C00
[   37.310352] PCI: drivers/pci/probe.c::pci_create_bus() Bus already known
[   37.314353] PCI: drivers/pci/probe.c::pci_create_bus()=NULL
[   37.318352] PCI: pci_scan_bus_parented()=NULL
[   37.322354] PCI: arch/x86/pci/acpi.c::pci_acpi_scan_root()=NULL
[   37.326355] ACPI: Bus 0000:02 not present in PCI namespace

Looking back through the log for references to the buses I found:

[    4.382109] device: 'PNP0A03:00': device_add
[    4.384305] PM: Adding info for acpi:PNP0A03:00
[    4.388291] bus: 'acpi': add device PNP0A03:00

[   11.666574] device: 'PNP0A03:01': device_add
[   11.668762] PM: Adding info for acpi:PNP0A03:01
[   11.672747] bus: 'acpi': add device PNP0A03:01

[   11.824765] device: 'PNP0A03:02': device_add
[   11.828773] PM: Adding info for acpi:PNP0A03:02
[   11.832757] bus: 'acpi': add device PNP0A03:02

[   19.401234] device: 'pci0000:00': device_add
[   19.405256] PM: Adding info for No Bus:pci0000:00
[   19.409237] device: '0000:00': device_add
[   19.413257] PM: Adding info for No Bus:0000:00
[   19.425226] PCI: drivers/pci/probe.c::pci_create_bus()=F5462800

[   19.481239] device: 'pci0000:02': device_add
[   19.485255] PM: Adding info for No Bus:pci0000:02
[   19.489241] device: '0000:02': device_add
[   19.493261] PM: Adding info for No Bus:0000:02
[   19.497312] PCI: drivers/pci/probe.c::pci_create_bus()=F5462C00

[   19.593247] device: 'pci0000:03': device_add
[   19.597262] PM: Adding info for No Bus:pci0000:03
[   19.601248] device: '0000:03': device_add
[   19.605267] PM: Adding info for No Bus:0000:03
[   19.609310] PCI: drivers/pci/probe.c::pci_create_bus()=F550E000

The ACPI PNP0A03 devices are:

PX0A: _UID 1, _BBN 0,  _ADR 0x00120000, pci_bus=0xF5462800, device 0000:0
PX0B: _UID 2, _BBN 2?, _ADR_0x00130000, pci_bus=0xF5462C00, device 0000:2
PX1A: _UID 3, _BBN 3?, _ADR 0x00140000, pci_bus=0xF550E000, device 0000:3

The ? above indicates _BBN is calculated in a method and the values I show here are the ones reported by ACPI as bus number.

So buses 0:2 and 0:3 are created when the first root bridge (PX0A) is scanned. That causes the later scan/create in ACPI/PCI to fail with the "Bus already known" reason.

[   37.306352] PCI: drivers/pci/search.c::pci_find_bus()=F5462C00
[   37.310352] PCI: drivers/pci/probe.c::pci_create_bus() Bus already known

[   37.430360] PCI: drivers/pci/search.c::pci_find_bus()=F550E000
[   37.434360] PCI: drivers/pci/probe.c::pci_create_bus() Bus already known

Here's the call logic when 0000:0 has been created and is being scanned itself:

[   19.369259] ACPI: PCI Root Bridge [PX0A] (0000:00)
[   19.373231] PCI: arch/x86/pci/acpi.c::pci_acpi_scan_root(device, 0, 0)
[   19.377237] PCI: drivers/pci/probe.c::pci_scan_bus_parented()
[   19.381229] PCI: drivers/pci/probe.c::pci_create_bus()
[   19.385232] PCI: drivers/pci/search.c::pci_find_bus(0, 0)
[   19.389230] PCI: drivers/pci/search.c::pci_find_next_bus(0)
[   19.393230] PCI: drivers/pci/search.c::pci_find_next_bus(0)
[   19.397230] PCI: drivers/pci/search.c::pci_find_bus()=NULL
[   19.401234] device: 'pci0000:00': device_add
[   19.405256] PM: Adding info for No Bus:pci0000:00
[   19.409237] device: '0000:00': device_add
[   19.413257] PM: Adding info for No Bus:0000:00
[   19.425226] PCI: drivers/pci/probe.c::pci_create_bus()=F5462800
[   19.425495] * Found PM-Timer Bug on the chipset. Due to workarounds for a bug,
[   19.425503] * this clock source is slow. Consider trying other clock sources
[   19.429300] pci 0000:00:02.3: quirk: region 0800-083f claimed by PIIX4 ACPI
[   19.433238] pci 0000:00:02.3: quirk: region 0850-085f claimed by PIIX4 SMB

[   19.437545] pci 0000:00:10.0: Searching for i450NX host bridges

[   19.441239] PCI: drivers/pci/probe.c::pci_scan_bus_parented()
[   19.445233] PCI: drivers/pci/probe.c::pci_create_bus()
[   19.449235] PCI: drivers/pci/search.c::pci_find_bus(0, 2)
[   19.453233] PCI: drivers/pci/search.c::pci_find_next_bus(0)
[   19.457234] PCI: drivers/pci/search.c::pci_find_next_bus(F5462800)
[   19.461235] PCI: drivers/pci/search.c::pci_do_find_bus(bus, 2)
[   19.465234] PCI: drivers/pci/search.c::pci_do_find_bus()=NULL
[   19.469235] PCI: drivers/pci/search.c::pci_find_next_bus(F5462800)
[   19.473235] PCI: drivers/pci/search.c::pci_find_next_bus(0)
[   19.477234] PCI: drivers/pci/search.c::pci_find_bus()=NULL
[   19.481239] device: 'pci0000:02': device_add
[   19.485255] PM: Adding info for No Bus:pci0000:02
[   19.489241] device: '0000:02': device_add
[   19.493261] PM: Adding info for No Bus:0000:02
[   19.497312] PCI: drivers/pci/probe.c::pci_create_bus()=F5462C00

So it appears the issue is caused by an interaction with the i450NX fix-up code in:

arch/x86/pci/fixup.c::pci_fixup_i450nx()

I've been looking at the git commit logs for this (formerly known as arch/i386/pci/fixup.c) and the ACPI/PCI source files but so far haven't noticed anything that wouls explain this - certainly not between v2.6.20 and v2.6.22.

I think this is the point where the experts need to get involved.
Comment 15 TJ 2008-04-08 23:31:48 UTC
I found this reference to the issue in AKM's 2.6.0 mm tree and the linux-scsi mailing list archive:

"I can tell you what's going on here.  This is a 450NX based
motherboard.  The 450NX chipset from Intel was the first chipset to have
peer PCI busses.  For backwards compatibility, some machine makers
hacked their PCI BIOS to have a fake bridge device on PCI bus 0 that
points to the same bus number as the peer bus.  This way if the OS
didn't know about the peer bus registers it would still find the devices
by scanning behind the bridge.  In this case we are scanning behind this
fake bridge and then also scanning based upon the peer bus registers in
the chipset, and as a result we are finding the device twice.  In order
to fix this problem you need to change the peer bus quirk code for the
450NX chipset to scan the list of bus 0 devices looking for a bridge
that has the same config as the peer bus registers and if so delete the
bridge from the list.  That will avoid double scanning and will avoid
having the PCI code try and configure sub busses via a fake bridge when
it should do all configurations via the 450NX peer bus registers.

-- 
  Doug Ledford <dledford@redhat.com>"


http://marc.info/?l=linux-scsi&m=106839680416899&w=2
Comment 16 TJ 2008-04-09 00:30:51 UTC
I have created a patch of arch/x86/pci/fixup.c::pci_fixup_i450nx() that does a dmi_check_system() against a struct dmi_system_id with a DMI_MATCH pair. If the ident matches it *doesn't* do the i450NX secondary bus scan.

I'll report back once the revised kernel has been built and tested.
Comment 17 TJ 2008-04-09 03:45:00 UTC
Created attachment 15691 [details]
v2.6.25-rc8 acpi-pci-bus debug "FIXED" *successful* boot log

My patch appears to have fixed the issue, although I'd like some feedback on the patch itself and the boot log in case it could cause knock-on effects due to the way the buses are detected and scanned now.
Comment 18 TJ 2008-04-09 04:01:12 UTC
Created attachment 15692 [details]
Proposed patch to i450NX fix-up

Please check this and consider alongside the kernel boot log attached to the previous comment to ensure no unexpected side effects.
Comment 19 TJ 2008-04-09 06:49:11 UTC
Created attachment 15701 [details]
v2.6.25-rc8 acpi-pci-bus *no-debug* "FIXED" successful boot log

Another kernel log, this time without kernel DEBUG messages (still contains my trace-path printk() reports)
Comment 20 TJ 2008-04-09 07:08:40 UTC
Created attachment 15702 [details]
Proposed patch to i450NX fix-up

Corrected the number of arguments used by DBG()
Comment 21 Zhang Rui 2008-04-09 18:25:24 UTC
Re-assigning to TJ as this is not an ACPI bug.
Comment 22 TJ 2008-04-10 03:16:00 UTC
Created attachment 15712 [details]
Revised less-wordy patch as submitted to linux-pci mailing list
Comment 23 TJ 2008-04-10 16:06:06 UTC
Created attachment 15730 [details]
Much better ACPI patch from Matthew Wilcox

This was provided on the linux-pci mailing list. I have tested it and it works and is much better than my DMI matching solution.
Comment 24 TJ 2008-04-11 08:08:52 UTC
Zhao Yakui reported on linux-acpi that another bug #10124 and associated patch should fix this. I grabbed the patch mentioned:

http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.25-rc8/2.6.25-rc8-mm2/broken-out/acpi-unneccessary-to-scan-the-pci-bus-already-scanned.patch

and confirmed it does solve this issue too.

Marking this bug as a duplicate.

*** This bug has been marked as a duplicate of bug 10124 ***

Note You need to log in before you can comment on or make changes to this bug.