Bug 11045

Summary: Bug in MPT Fusion 2.6.26-rc7 unbootable
Product: SCSI Drivers Reporter: Kurk (kurk)
Component: OtherAssignee: scsi_drivers-other
Status: CLOSED CODE_FIX    
Severity: normal CC: bunk
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.26-rc7 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 10492    
Attachments: Boot messages with quiet disabled, serial dump

Description Kurk 2008-07-06 11:22:06 UTC
Latest working kernel version: 2.6.25
Earliest failing kernel version: 2.6.26-rc7
Distribution: Debian (but vanilla kernel)
Hardware Environment: IBM xSeries 335
Software Environment: error and hangup at boot
Problem Description: MPT Fusion error, unbootable, see below
Steps to reproduce: see below
----
Detailed description:

Hi all,
I'm no kernel expert, I hope I made no mistakes in this report. It seems to me that a bug was added to the MPT Fusion driver in 2.6.26 (rc7).

I compiled 2.6.26-rc7 on a machine with controller LSI53C1080 and it cannot boot. Doing the same with 2.6.25, basically the same config file, boots without problems.

I tried to forward-port the Fusion driver from 2.6.25 to 2.6.26-rc7 by simply copying over the directory drivers/message/fusion/ from 2.6.25 to 2.6.26-rc7 but unfortunately this doesn't compile, so I am stuck not being able to use 2.6.26 on this machine (actually I have not tried versions of 2.6.26 earlier than rc7... I don't have much time now).

I connected a serial cable in order to obtain the boot error message. I obtained two of those on different boots. I will paste these at the end of this post.


This is the verbose lspci of the controller (obtained with 2.6.25):
----------------------------------------
01:01.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07)
        Subsystem: IBM Unknown device 026d
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 72 (4250ns min, 4500ns max), Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 22
        Region 0: I/O ports at 2300 [size=256]
        Region 1: Memory at fbff0000 (64-bit, non-prefetchable) [size=64K]
        Region 3: Memory at fbfe0000 (64-bit, non-prefetchable) [size=64K]
        Capabilities: [50] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable-
                Address: 0000000000000000  Data: 0000
        Capabilities: [68] PCI-X non-bridge device
                Command: DPERE- ERO- RBC=512 OST=1
                Status: Dev=01:01.0 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=8 DMCRS=16 RSCEM- 266MHz- 533MHz-
        Kernel driver in use: mptspi
        Kernel modules: mptspi
----------------------------------------


This is an excerpt of the dmesg on 2.6.25 where the controller WORKS:
--------------------------------------------------------------------
Fusion MPT base driver 3.04.06
Copyright (c) 1999-2007 LSI Corporation
Fusion MPT SPI Host driver 3.04.06
...
mptbase: ioc0: Initiating bringup
...
ioc0: LSI53C1030 B2: Capabilities={Initiator}
Probing IDE interface ide1...
hdc: LG CD-ROM CRN-8245B, ATAPI CD/DVD-ROM drive
scsi0 : ioc0: LSI53C1030 B2, FwRev=01000e00h, Ports=1, MaxQ=222, IRQ=22
...
scsi0 : ioc0: LSI53C1030 B2, FwRev=01000e00h, Ports=1, MaxQ=222, IRQ=22
hdc: host max PIO4 wanted PIO255(auto-tune) selected PIO4
hdc: UDMA/33 mode selected
ide1 at 0x170-0x177,0x376 on irq 15
tg3.c:v3.90 (April 12, 2008)
ACPI: PCI Interrupt 0000:02:01.0[A] -> GSI 24 (level, low) -> IRQ 24
scsi 0:0:0:0: Direct-Access     IBM-ESXS DTN018C1UCDY10F  S23J PQ: 0 ANSI: 3
 target0:0:0: Beginning Domain Validation
 target0:0:0: Ending Domain Validation
 target0:0:0: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 127)
scsi 0:0:1:0: Direct-Access     IBM-ESXS DTN018C1UCDY10F  S23J PQ: 0 ANSI: 3
 target0:0:1: Beginning Domain Validation
...
ACPI: PCI Interrupt 0000:02:02.0[A] -> GSI 25 (level, low) -> IRQ 25
 target0:0:1: Ending Domain Validation
 target0:0:1: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 127)
...
hdc: ATAPI 24X CD-ROM drive, 128kB Cache
Uniform CD-ROM driver Revision: 3.20
scsi 0:0:8:0: Processor         IBM      25P3495a S320  1 1    PQ: 0 ANSI: 2
 target0:0:8: Beginning Domain Validation
 target0:0:8: Ending Domain Validation
 target0:0:8: asynchronous
Driver 'sd' needs updating - please use bus_type methods
sd 0:0:0:0: [sda] 35548320 512-byte hardware sectors (18201 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: cb 00 00 08
sd 0:0:0:0: Attached scsi generic sg0 type 0
scsi 0:0:1:0: Attached scsi generic sg1 type 0
scsi 0:0:8:0: Attached scsi generic sg2 type 3
--------------------------------------------------------------------


It is an x86 32bit PC compile. This is the excerpt of the .config file grepping for FUSION
------------------------------------
CONFIG_FUSION=y
CONFIG_FUSION_SPI=m
CONFIG_FUSION_FC=m
CONFIG_FUSION_SAS=m
CONFIG_FUSION_MAX_SGE=40
CONFIG_FUSION_CTL=m
CONFIG_FUSION_LAN=m
# CONFIG_FUSION_LOGGING is not set
------------------------------------



This is the boot error message obtained with serial cable. I left it running for 8 minutes for this. It loops so the message never ends.
--------------------------------------------------------------------

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

BUG: unable to handle kernel NULL pointer dereference at 0000034c

IP: [<f885cc5e>] :mptspi:mptspi_dv_renegotiate_work+0xa/0x9f

Oops: 0000 [#1] SMP

Modules linked in: ide_pci_generic(+) floppy mptspi(+) mptscsih ohci_hcd tg3 mptbase scsi_transport_spi usbcore serverworks ide_core ata_generic libata scsi_mod dock thermal processor fan thermal_sys



Pid: 9, comm: events/0 Not tainted (2.6.26-rc7 #1)

EIP: 0060:[<f885cc5e>] EFLAGS: 00010282 CPU: 0

EIP is at mptspi_dv_renegotiate_work+0xa/0x9f [mptspi]

EAX: f7a447c0 EBX: f7429900 ECX: f7a447c4 EDX: c1908988

ESI: f7a447c0 EDI: 0000034c EBP: f7429904 ESP: f7477f80

 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068

Process events/0 (pid: 9, ti=f7476000 task=f744d770 task.ti=f7476000)

Stack: f744d8e0 c190b260 00000000 c1908984 f7429900 f7a447c0 f885cc54 f7429904

       c012f253 f7429900 c012f934 f742990c 00000000 c012f9e8 00000000 f744d770

       c0131bdc f7477fc4 f7477fc4 f7429900 c012f934 00000000 c0131b1b c0131ae3

Call Trace:

 [<f885cc54>] mptspi_dv_renegotiate_work+0x0/0x9f [mptspi]

 [<c012f253>] run_workqueue+0x75/0xf6

 [<c012f934>] worker_thread+0x0/0xbf

 [<c012f9e8>] worker_thread+0xb4/0xbf

 [<c0131bdc>] autoremove_wake_function+0x0/0x2b

 [<c012f934>] worker_thread+0x0/0xbf

 [<c0131b1b>] kthread+0x38/0x5d

 [<c0131ae3>] kthread+0x0/0x5d

 [<c0104573>] kernel_thread_helper+0x7/0x10

 =======================

Code: 70 e8 9e f8 ff ff 8b 47 70 e8 44 b7 fe ff 8b 47 70 5a 5b 5e 5f 5d e9 89 f8 ff ff 58 5b 5e 5f 5d c3 55 57 56 53 83 ec 10 8b 78 10 <8b> 2f e8 c7 98 90 c7 66 83 bf 96 02 00 00 00 8b 85 3c 01 00 00

EIP: [<f885cc5e>] mptspi_dv_renegotiate_work+0xa/0x9f [mptspi] SS:ESP 0068:f7477f80

---[ end trace e311270f757682e4 ]---

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

mptbase: ioc0: Recovered from IOC FAULT

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

mptbase: ioc0: Recovered from IOC FAULT

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

mptbase: ioc0: Recovered from IOC FAULT

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

mptbase: ioc0: Recovered from IOC FAULT

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

mptbase: ioc0: Recovered from IOC FAULT

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

mptbase: ioc0: Recovered from IOC FAULT

scsi0 : ioc0: LSI53C1030 B2, FwRev=01000e00h, Ports=1, MaxQ=222, IRQ=223

 target0:0:0: mptspi: ioc0: dma_alloc_coherent for parameters failed

hdc: ATAPI 24X CD-ROM drive, 128kB Cache

Uniform CD-ROM driver Revision: 3.20

mptscsih: ioc0: attempting task abort! (sc=f7862e80)

scsi 0:0:0:0: CDB: Inquiry: 12 00 00 00 24 00

mptscsih: ioc0: WARNING - TM Handler for type=1: IOC Not operational (0x40008112)!

mptscsih: ioc0: WARNING -  Issuing HardReset!!

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

md: raid1 personality registered for level 1

device-mapper: ioctl: 4.13.0-ioctl (2007-10-18) initialised: dm-devel@redhat.com

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

scsi 0:0:0:0: mptscsih: ioc0: completing cmds: fw_channel 0, fw_id 0, sc=f7862e80, mf = f7a62da0, idx=f

mptbase: ioc0: Recovered from IOC FAULT

mptscsih: ioc0: task abort: FAILED (sc=f7862e80)

mptscsih: ioc0: attempting target reset! (sc=f7862e80)

scsi 0:0:0:0: CDB: Inquiry: 12 00 00 00 24 00

 target0:0:0: mptspi: ioc0: dma_alloc_coherent for parameters failed

 target0:0:0: FAST-5 WIDE SCSI 2.4 MB/s ST RTI WRFLOW PCOMP (844 ns, offset 68)

 target0:0:0: mptspi: ioc0: dma_alloc_coherent for parameters failed

mptscsih: ioc0: Issue of TaskMgmt failed!

mptscsih: ioc0: target reset: FAILED (sc=f7862e80)

mptscsih: ioc0: attempting bus reset! (sc=f7862e80)

scsi 0:0:0:0: CDB: Inquiry: 12 00 00 00 24 00

mptscsih: ioc0: WARNING - TM Handler for type=4: IOC Not operational (0x40008112)!

mptscsih: ioc0: WARNING -  Issuing HardReset!!

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

mptbase: ioc0: Recovered from IOC FAULT

mptscsih: ioc0: bus reset: FAILED (sc=f7862e80)

mptscsih: ioc0: attempting host reset! (sc=f7862e80)

mptbase: ioc0: Initiating recovery

 target0:0:0: mptspi: ioc0: dma_alloc_coherent for parameters failed

 target0:0:0: FAST-5 WIDE SCSI 2.4 MB/s ST RTI WRFLOW PCOMP (844 ns, offset 68)

 target0:0:0: mptspi: ioc0: dma_alloc_coherent for parameters failed

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

mptscsih: ioc0: host reset: SUCCESS (sc=f7862e80)

scsi 0:0:0:0: Device offlined - not ready after error recovery

 target0:0:1: mptspi: ioc0: dma_alloc_coherent for parameters failed

 target0:0:1: mptspi: ioc0: dma_alloc_coherent for parameters failed

 target0:0:1: FAST-5 WIDE SCSI 2.4 MB/s ST RTI WRFLOW PCOMP (844 ns, offset 68)

 target0:0:1: mptspi: ioc0: dma_alloc_coherent for parameters failed

mptscsih: ioc0: attempting task abort! (sc=f7862e80)

scsi 0:0:1:0: CDB: Inquiry: 12 00 00 00 24 00

mptscsih: ioc0: WARNING - TM Handler for type=1: IOC Not operational (0x40008112)!

mptscsih: ioc0: WARNING -  Issuing HardReset!!

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

scsi 0:0:1:0: mptscsih: ioc0: completing cmds: fw_channel 0, fw_id 1, sc=f7862e80, mf = f7a62f80, idx=14

mptbase: ioc0: Recovered from IOC FAULT

mptscsih: ioc0: task abort: FAILED (sc=f7862e80)

mptscsih: ioc0: attempting target reset! (sc=f7862e80)

scsi 0:0:1:0: CDB: Inquiry: 12 00 00 00 24 00

 target0:0:1: mptspi: ioc0: dma_alloc_coherent for parameters failed

 target0:0:1: FAST-5 WIDE SCSI 2.4 MB/s ST RTI WRFLOW PCOMP (844 ns, offset 68)

 target0:0:1: mptspi: ioc0: dma_alloc_coherent for parameters failed

mptscsih: ioc0: Issue of TaskMgmt failed!

mptscsih: ioc0: target reset: FAILED (sc=f7862e80)

mptscsih: ioc0: attempting bus reset! (sc=f7862e80)

scsi 0:0:1:0: CDB: Inquiry: 12 00 00 00 24 00

mptscsih: ioc0: WARNING - TM Handler for type=4: IOC Not operational (0x40008112)!

mptscsih: ioc0: WARNING -  Issuing HardReset!!

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

mptbase: ioc0: Recovered from IOC FAULT

mptscsih: ioc0: bus reset: FAILED (sc=f7862e80)

mptscsih: ioc0: attempting host reset! (sc=f7862e80)

mptbase: ioc0: Initiating recovery

 target0:0:1: mptspi: ioc0: dma_alloc_coherent for parameters failed

 target0:0:1: FAST-5 WIDE SCSI 2.4 MB/s ST RTI WRFLOW PCOMP (844 ns, offset 68)

 target0:0:1: mptspi: ioc0: dma_alloc_coherent for parameters failed

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

mptscsih: ioc0: host reset: SUCCESS (sc=f7862e80)

scsi 0:0:1:0: Device offlined - not ready after error recovery

 target0:0:2: mptspi: ioc0: dma_alloc_coherent for parameters failed

 target0:0:2: mptspi: ioc0: dma_alloc_coherent for parameters failed

 target0:0:2: FAST-5 WIDE SCSI 2.4 MB/s ST RTI WRFLOW PCOMP (844 ns, offset 68)

 target0:0:2: mptspi: ioc0: dma_alloc_coherent for parameters failed

mptscsih: ioc0: attempting task abort! (sc=f7862e80)

scsi 0:0:2:0: CDB: Inquiry: 12 00 00 00 24 00

mptscsih: ioc0: WARNING - TM Handler for type=1: IOC Not operational (0x40008112)!

mptscsih: ioc0: WARNING -  Issuing HardReset!!

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

scsi 0:0:2:0: mptscsih: ioc0: completing cmds: fw_channel 0, fw_id 2, sc=f7862e80, mf = f7a63160, idx=19

mptbase: ioc0: Recovered from IOC FAULT

mptscsih: ioc0: task abort: FAILED (sc=f7862e80)

mptscsih: ioc0: attempting target reset! (sc=f7862e80)

scsi 0:0:2:0: CDB: Inquiry: 12 00 00 00 24 00

 target0:0:2: mptspi: ioc0: dma_alloc_coherent for parameters failed

 target0:0:2: FAST-5 WIDE SCSI 2.4 MB/s ST RTI WRFLOW PCOMP (844 ns, offset 68)

 target0:0:2: mptspi: ioc0: dma_alloc_coherent for parameters failed

mptscsih: ioc0: Issue of TaskMgmt failed!

mptscsih: ioc0: target reset: FAILED (sc=f7862e80)

mptscsih: ioc0: attempting bus reset! (sc=f7862e80)

scsi 0:0:2:0: CDB: Inquiry: 12 00 00 00 24 00

mptscsih: ioc0: WARNING - TM Handler for type=4: IOC Not operational (0x40008112)!

mptscsih: ioc0: WARNING -  Issuing HardReset!!

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

mptbase: ioc0: Recovered from IOC FAULT

mptscsih: ioc0: bus reset: FAILED (sc=f7862e80)

mptscsih: ioc0: attempting host reset! (sc=f7862e80)

mptbase: ioc0: Initiating recovery

 target0:0:2: mptspi: ioc0: dma_alloc_coherent for parameters failed

 target0:0:2: FAST-5 WIDE SCSI 2.4 MB/s ST RTI WRFLOW PCOMP (844 ns, offset 68)

 target0:0:2: mptspi: ioc0: dma_alloc_coherent for parameters failed

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

mptscsih: ioc0: host reset: SUCCESS (sc=f7862e80)

scsi 0:0:2:0: Device offlined - not ready after error recovery

 target0:0:3: mptspi: ioc0: dma_alloc_coherent for parameters failed

 target0:0:3: mptspi: ioc0: dma_alloc_coherent for parameters failed

 target0:0:3: FAST-5 WIDE SCSI 2.4 MB/s ST RTI WRFLOW PCOMP (844 ns, offset 68)

 target0:0:3: mptspi: ioc0: dma_alloc_coherent for parameters failed

mptscsih: ioc0: attempting task abort! (sc=f7862e80)

scsi 0:0:3:0: CDB: Inquiry: 12 00 00 00 24 00

mptscsih: ioc0: WARNING - TM Handler for type=1: IOC Not operational (0x40008112)!

mptscsih: ioc0: WARNING -  Issuing HardReset!!

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -










This one I obtained it in 5 minutes in another boot with a very slightly differing .config file (not changing in the MPT Fusion options). You can see that the stack trace is slightly different.
--------------------------------------------------------------------

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

ACPI: Resource is not an IRQ entry

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

BUG: unable to handle kernel NULL pointer dereference at 0000034c

IP: [<f8856c5e>] :mptspi:mptspi_dv_renegotiate_work+0xa/0x9f

Oops: 0000 [#1] SMP

Modules linked in: ide_pci_generic(+) floppy mptspi(+) mptscsih mptbase scsi_transport_spi ohci_hcd usbcore tg3 serverworks ide_core ata_generic libata scsi_mod dock thermal processor fan



Pid: 9, comm: events/0 Not tainted (2.6.26-rc7 #3)

EIP: 0060:[<f8856c5e>] EFLAGS: 00010282 CPU: 0

EIP is at mptspi_dv_renegotiate_work+0xa/0x9f [mptspi]

EAX: f783f480 EBX: f7429900 ECX: f783f484 EDX: c1908548

ESI: f783f480 EDI: 0000034c EBP: f7429904 ESP: f7477f80

 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068

Process events/0 (pid: 9, ti=f7476000 task=f744d770 task.ti=f7476000)

Stack: 00000000 c012cea5 f7429900 c1908544 f7429900 f783f480 f8856c54 f7429904

       c012c82a f7429900 c012cee7 f742990c 00000000 c012cf9b 00000000 f744d770

       c012f190 f7477fc4 f7477fc4 f7429900 c012cee7 00000000 c012f0cf c012f097

Call Trace:

 [<c012cea5>] queue_delayed_work_on+0x9a/0xa6

 [<f8856c54>] mptspi_dv_renegotiate_work+0x0/0x9f [mptspi]

 [<c012c82a>] run_workqueue+0x6c/0xe4

 [<c012cee7>] worker_thread+0x0/0xbf

 [<c012cf9b>] worker_thread+0xb4/0xbf

 [<c012f190>] autoremove_wake_function+0x0/0x2b

 [<c012cee7>] worker_thread+0x0/0xbf

 [<c012f0cf>] kthread+0x38/0x5d

 [<c012f097>] kthread+0x0/0x5d

 [<c01043c3>] kernel_thread_helper+0x7/0x10

 =======================

Code: 70 e8 9e f8 ff ff 8b 47 70 e8 44 37 ff ff 8b 47 70 5a 5b 5e 5f 5d e9 89 f8 ff ff 58 5b 5e 5f 5d c3 55 57 56 53 83 ec 10 8b 78 10 <8b> 2f e8 ff cc 90 c7 66 83 bf 96 02 00 00 00 8b 85 3c 01 00 00

EIP: [<f8856c5e>] mptspi_dv_renegotiate_work+0xa/0x9f [mptspi] SS:ESP 0068:f7477f80

---[ end trace e7ec2a28a4a72094 ]---

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

mptbase: ioc0: Recovered from IOC FAULT

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

mptbase: ioc0: Recovered from IOC FAULT

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

mptbase: ioc0: Recovered from IOC FAULT

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

mptbase: ioc0: Recovered from IOC FAULT

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

mptbase: ioc0: Recovered from IOC FAULT

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

mptbase: ioc0: Recovered from IOC FAULT

scsi0 : ioc0: LSI53C1030 B2, FwRev=01000e00h, Ports=1, MaxQ=222, IRQ=223

 target0:0:0: mptspi: ioc0: dma_alloc_coherent for parameters failed

hdc: ATAPI 24X CD-ROM drive, 128kB Cache

Uniform CD-ROM driver Revision: 3.20

mptscsih: ioc0: attempting task abort! (sc=f7858e80)

scsi 0:0:0:0: CDB: Inquiry: 12 00 00 00 24 00

mptscsih: ioc0: WARNING - TM Handler for type=1: IOC Not operational (0x40008112)!

mptscsih: ioc0: WARNING -  Issuing HardReset!!

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

md: raid1 personality registered for level 1

device-mapper: ioctl: 4.13.0-ioctl (2007-10-18) initialised: dm-devel@redhat.com

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

scsi 0:0:0:0: mptscsih: ioc0: completing cmds: fw_channel 0, fw_id 0, sc=f7858e80, mf = f79e2da0, idx=f

mptbase: ioc0: Recovered from IOC FAULT

mptscsih: ioc0: task abort: FAILED (sc=f7858e80)

mptscsih: ioc0: attempting target reset! (sc=f7858e80)

scsi 0:0:0:0: CDB: Inquiry: 12 00 00 00 24 00

mptscsih: ioc0: Issue of TaskMgmt failed!

mptscsih: ioc0: target reset: FAILED (sc=f7858e80)

mptscsih: ioc0: attempting bus reset! (sc=f7858e80)

scsi 0:0:0:0: CDB: Inquiry: 12 00 00 00 24 00

mptscsih: ioc0: WARNING - TM Handler for type=4: IOC Not operational (0x40008112)!

mptscsih: ioc0: WARNING -  Issuing HardReset!!

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

mptbase: ioc0: Recovered from IOC FAULT

mptscsih: ioc0: bus reset: FAILED (sc=f7858e80)

mptscsih: ioc0: attempting host reset! (sc=f7858e80)

mptbase: ioc0: Initiating recovery

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

mptscsih: ioc0: host reset: SUCCESS (sc=f7858e80)

scsi 0:0:0:0: Device offlined - not ready after error recovery

 target0:0:1: mptspi: ioc0: dma_alloc_coherent for parameters failed

mptscsih: ioc0: attempting task abort! (sc=f7858080)

scsi 0:0:1:0: CDB: Inquiry: 12 00 00 00 24 00

mptscsih: ioc0: WARNING - TM Handler for type=1: IOC Not operational (0x40008112)!

mptscsih: ioc0: WARNING -  Issuing HardReset!!

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

scsi 0:0:1:0: mptscsih: ioc0: completing cmds: fw_channel 0, fw_id 1, sc=f7858080, mf = f79e2f80, idx=14

mptbase: ioc0: Recovered from IOC FAULT

mptscsih: ioc0: task abort: FAILED (sc=f7858080)

mptscsih: ioc0: attempting target reset! (sc=f7858080)

scsi 0:0:1:0: CDB: Inquiry: 12 00 00 00 24 00

 target0:0:1: mptspi: ioc0: dma_alloc_coherent for parameters failed

 target0:0:1: asynchronous

 target0:0:1: mptspi: ioc0: dma_alloc_coherent for parameters failed

mptscsih: ioc0: Issue of TaskMgmt failed!

mptscsih: ioc0: target reset: FAILED (sc=f7858080)

mptscsih: ioc0: attempting bus reset! (sc=f7858080)

scsi 0:0:1:0: CDB: Inquiry: 12 00 00 00 24 00

mptscsih: ioc0: WARNING - TM Handler for type=4: IOC Not operational (0x40008112)!

mptscsih: ioc0: WARNING -  Issuing HardReset!!

mptbase: ioc0: Initiating recovery

mptbase: ioc0: WARNING - IOC is in FAULT state!!!

mptbase: ioc0: WARNING -            FAULT code = 8112h

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

mptbase: ioc0: Recovered from IOC FAULT

mptscsih: ioc0: bus reset: FAILED (sc=f7858080)

mptscsih: ioc0: attempting host reset! (sc=f7858080)

mptbase: ioc0: Initiating recovery

 target0:0:1: mptspi: ioc0: dma_alloc_coherent for par
Comment 1 Anonymous Emailer 2008-07-06 12:34:53 UTC
Reply-To: akpm@linux-foundation.org


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Sun,  6 Jul 2008 11:22:08 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=11045
> 
>            Summary: Bug in MPT Fusion 2.6.26-rc7 unbootable
>            Product: Drivers
>            Version: 2.5
>      KernelVersion: 2.6.26-rc7
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>         AssignedTo: drivers_other@kernel-bugs.osdl.org
>         ReportedBy: kurk@shiftmail.org
> 
> 
> Latest working kernel version: 2.6.25
> Earliest failing kernel version: 2.6.26-rc7
> Distribution: Debian (but vanilla kernel)
> Hardware Environment: IBM xSeries 335
> Software Environment: error and hangup at boot
> Problem Description: MPT Fusion error, unbootable, see below
> Steps to reproduce: see below

We have two bugs here.  One in mpt-fusion and what I suspect is a
post-2.6.25 regression in ACPI.


> Detailed description:
> 
> Hi all,
> I'm no kernel expert, I hope I made no mistakes in this report. It seems to
> me
> that a bug was added to the MPT Fusion driver in 2.6.26 (rc7).
> 
> I compiled 2.6.26-rc7 on a machine with controller LSI53C1080 and it cannot
> boot. Doing the same with 2.6.25, basically the same config file, boots
> without
> problems.
> 
> I tried to forward-port the Fusion driver from 2.6.25 to 2.6.26-rc7 by simply
> copying over the directory drivers/message/fusion/ from 2.6.25 to 2.6.26-rc7
> but unfortunately this doesn't compile, so I am stuck not being able to use
> 2.6.26 on this machine (actually I have not tried versions of 2.6.26 earlier
> than rc7... I don't have much time now).
> 
> I connected a serial cable in order to obtain the boot error message. I
> obtained two of those on different boots. I will paste these at the end of
> this
> post.
> 
> 
> This is the verbose lspci of the controller (obtained with 2.6.25):
> ----------------------------------------
> 01:01.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
> Fusion-MPT Dual Ultra320 SCSI (rev 07)
>         Subsystem: IBM Unknown device 026d
>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr+
> Stepping- SERR+ FastB2B- DisINTx-
>         Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 72 (4250ns min, 4500ns max), Cache Line Size: 32 bytes
>         Interrupt: pin A routed to IRQ 22
>         Region 0: I/O ports at 2300 [size=256]
>         Region 1: Memory at fbff0000 (64-bit, non-prefetchable) [size=64K]
>         Region 3: Memory at fbfe0000 (64-bit, non-prefetchable) [size=64K]
>         Capabilities: [50] Power Management version 2
>                 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
> PME(D0-,D1-,D2-,D3hot-,D3cold-)
>                 Status: D0 PME-Enable- DSel=0 DScale=0 PME-
>         Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+
>         Queue=0/0
> Enable-
>                 Address: 0000000000000000  Data: 0000
>         Capabilities: [68] PCI-X non-bridge device
>                 Command: DPERE- ERO- RBC=512 OST=1
>                 Status: Dev=01:01.0 64bit+ 133MHz+ SCD- USC- DC=simple
> DMMRBC=2048 DMOST=8 DMCRS=16 RSCEM- 266MHz- 533MHz-
>         Kernel driver in use: mptspi
>         Kernel modules: mptspi
> ----------------------------------------
> 
> 
> This is an excerpt of the dmesg on 2.6.25 where the controller WORKS:
> --------------------------------------------------------------------
> Fusion MPT base driver 3.04.06
> Copyright (c) 1999-2007 LSI Corporation
> Fusion MPT SPI Host driver 3.04.06
> ...
> mptbase: ioc0: Initiating bringup
> ...
> ioc0: LSI53C1030 B2: Capabilities={Initiator}
> Probing IDE interface ide1...
> hdc: LG CD-ROM CRN-8245B, ATAPI CD/DVD-ROM drive
> scsi0 : ioc0: LSI53C1030 B2, FwRev=01000e00h, Ports=1, MaxQ=222, IRQ=22
> ...
> scsi0 : ioc0: LSI53C1030 B2, FwRev=01000e00h, Ports=1, MaxQ=222, IRQ=22
> hdc: host max PIO4 wanted PIO255(auto-tune) selected PIO4
> hdc: UDMA/33 mode selected
> ide1 at 0x170-0x177,0x376 on irq 15
> tg3.c:v3.90 (April 12, 2008)
> ACPI: PCI Interrupt 0000:02:01.0[A] -> GSI 24 (level, low) -> IRQ 24
> scsi 0:0:0:0: Direct-Access     IBM-ESXS DTN018C1UCDY10F  S23J PQ: 0 ANSI: 3
>  target0:0:0: Beginning Domain Validation
>  target0:0:0: Ending Domain Validation
>  target0:0:0: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 127)
> scsi 0:0:1:0: Direct-Access     IBM-ESXS DTN018C1UCDY10F  S23J PQ: 0 ANSI: 3
>  target0:0:1: Beginning Domain Validation
> ...
> ACPI: PCI Interrupt 0000:02:02.0[A] -> GSI 25 (level, low) -> IRQ 25
>  target0:0:1: Ending Domain Validation
>  target0:0:1: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 127)
> ...
> hdc: ATAPI 24X CD-ROM drive, 128kB Cache
> Uniform CD-ROM driver Revision: 3.20
> scsi 0:0:8:0: Processor         IBM      25P3495a S320  1 1    PQ: 0 ANSI: 2
>  target0:0:8: Beginning Domain Validation
>  target0:0:8: Ending Domain Validation
>  target0:0:8: asynchronous
> Driver 'sd' needs updating - please use bus_type methods
> sd 0:0:0:0: [sda] 35548320 512-byte hardware sectors (18201 MB)
> sd 0:0:0:0: [sda] Write Protect is off
> sd 0:0:0:0: [sda] Mode Sense: cb 00 00 08
> sd 0:0:0:0: Attached scsi generic sg0 type 0
> scsi 0:0:1:0: Attached scsi generic sg1 type 0
> scsi 0:0:8:0: Attached scsi generic sg2 type 3
> --------------------------------------------------------------------
> 
> 
> It is an x86 32bit PC compile. This is the excerpt of the .config file
> grepping
> for FUSION
> ------------------------------------
> CONFIG_FUSION=y
> CONFIG_FUSION_SPI=m
> CONFIG_FUSION_FC=m
> CONFIG_FUSION_SAS=m
> CONFIG_FUSION_MAX_SGE=40
> CONFIG_FUSION_CTL=m
> CONFIG_FUSION_LAN=m
> # CONFIG_FUSION_LOGGING is not set
> ------------------------------------
> 
> 
> 
> This is the boot error message obtained with serial cable. I left it running
> for 8 minutes for this. It loops so the message never ends.
> --------------------------------------------------------------------
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry
> 
> ACPI: Resource is not an IRQ entry

The acpi problem.

> mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000034c
> 
> IP: [<f885cc5e>] :mptspi:mptspi_dv_renegotiate_work+0xa/0x9f
> 
> Oops: 0000 [#1] SMP
> 
> Modules linked in: ide_pci_generic(+) floppy mptspi(+) mptscsih ohci_hcd tg3
> mptbase scsi_transport_spi usbcore serverworks ide_core ata_generic libata
> scsi_mod dock thermal processor fan thermal_sys
> 
> 
> 
> Pid: 9, comm: events/0 Not tainted (2.6.26-rc7 #1)
> 
> EIP: 0060:[<f885cc5e>] EFLAGS: 00010282 CPU: 0
> 
> EIP is at mptspi_dv_renegotiate_work+0xa/0x9f [mptspi]
> 
> EAX: f7a447c0 EBX: f7429900 ECX: f7a447c4 EDX: c1908988
> 
> ESI: f7a447c0 EDI: 0000034c EBP: f7429904 ESP: f7477f80
> 
>  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> 
> Process events/0 (pid: 9, ti=f7476000 task=f744d770 task.ti=f7476000)
> 
> Stack: f744d8e0 c190b260 00000000 c1908984 f7429900 f7a447c0 f885cc54
> f7429904
> 
>        c012f253 f7429900 c012f934 f742990c 00000000 c012f9e8 00000000
>        f744d770
> 
>        c0131bdc f7477fc4 f7477fc4 f7429900 c012f934 00000000 c0131b1b
>        c0131ae3
> 
> Call Trace:
> 
>  [<f885cc54>] mptspi_dv_renegotiate_work+0x0/0x9f [mptspi]
> 
>  [<c012f253>] run_workqueue+0x75/0xf6
> 
>  [<c012f934>] worker_thread+0x0/0xbf
> 
>  [<c012f9e8>] worker_thread+0xb4/0xbf
> 
>  [<c0131bdc>] autoremove_wake_function+0x0/0x2b
> 
>  [<c012f934>] worker_thread+0x0/0xbf
> 
>  [<c0131b1b>] kthread+0x38/0x5d
> 
>  [<c0131ae3>] kthread+0x0/0x5d
> 
>  [<c0104573>] kernel_thread_helper+0x7/0x10
> 
>  =======================
> 
> Code: 70 e8 9e f8 ff ff 8b 47 70 e8 44 b7 fe ff 8b 47 70 5a 5b 5e 5f 5d e9 89
> f8 ff ff 58 5b 5e 5f 5d c3 55 57 56 53 83 ec 10 8b 78 10 <8b> 2f e8 c7 98 90
> c7
> 66 83 bf 96 02 00 00 00 8b 85 3c 01 00 00
> 
> EIP: [<f885cc5e>] mptspi_dv_renegotiate_work+0xa/0x9f [mptspi] SS:ESP
> 0068:f7477f80
> 
> ---[ end trace e311270f757682e4 ]---

mpt-fusion shouldn't oops, no matter what acpi did to it.
Comment 2 Kurk 2008-07-06 13:28:48 UTC
The ACPI thing, if it is a bug (I didn't realize that) I think was introduced much earlier than 2.6.26. I think the same ACPI error strings are visible in 2.6.24 and probably even much earlier kernels. However they don't prevent booting. Thank you.
Comment 3 Adrian Bunk 2008-07-06 13:50:58 UTC
Thanks for this information.

Please open a separate bug for the ACPI problem.
Comment 4 Kurk 2008-07-07 02:36:45 UTC
OK I have opened bug 11049 for the ACPI thing.
Thank you.
Comment 5 Anonymous Emailer 2008-07-08 02:03:34 UTC
Reply-To: akpm@linux-foundation.org


You removed everyone from cc.  Please don't do that - there's not much
point in asking me to do things - this bug is reported by
kurk@shiftmail.org.

I don't know what "we do not assist with compiling drivers" can possibly
mean.  Eric, can you please help here?


On Mon, 7 Jul 2008 07:28:00 -0600 "Support, Software" <support@lsi.com> wrote:

>  Unfortunately,  we do not assist with compiling drivers.
> 
> I would recommend updating the firmware and BIOS on the controllers you are
> using, so that the compiled driver could communicate with the controller
> better.
> 
> In order to point you to the correct package for the controller that is not
> taking the compiled driver, I will need for you to send me all of the numbers
> off of the front and back of the controller.
> 
> -----Original Message-----
> From: Andrew Morton [mailto:akpm@linux-foundation.org]
> Sent: Sunday, July 06, 2008 3:34 PM
> To: linux-scsi@vger.kernel.org; linux-acpi@vger.kernel.org
> Cc: bugme-daemon@bugzilla.kernel.org; Moore, Eric; Support, Software
> Subject: Re: [Bugme-new] [Bug 11045] New: Bug in MPT Fusion 2.6.26-rc7
> unbootable
> 
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Sun,  6 Jul 2008 11:22:08 -0700 (PDT) bugme-daemon@bugzilla.kernel.org
> wrote:
> 
> > http://bugzilla.kernel.org/show_bug.cgi?id=11045
> >
> >            Summary: Bug in MPT Fusion 2.6.26-rc7 unbootable
> >            Product: Drivers
> >            Version: 2.5
> >      KernelVersion: 2.6.26-rc7
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Other
> >         AssignedTo: drivers_other@kernel-bugs.osdl.org
> >         ReportedBy: kurk@shiftmail.org
> >
> >
> > Latest working kernel version: 2.6.25
> > Earliest failing kernel version: 2.6.26-rc7
> > Distribution: Debian (but vanilla kernel) Hardware Environment: IBM
> > xSeries 335 Software Environment: error and hangup at boot Problem
> > Description: MPT Fusion error, unbootable, see below Steps to
> > reproduce: see below
> 
> We have two bugs here.  One in mpt-fusion and what I suspect is a
> post-2.6.25 regression in ACPI.
> 
> 
> > Detailed description:
> >
> > Hi all,
> > I'm no kernel expert, I hope I made no mistakes in this report. It
> > seems to me that a bug was added to the MPT Fusion driver in 2.6.26 (rc7).
> >
> > I compiled 2.6.26-rc7 on a machine with controller LSI53C1080 and it
> > cannot boot. Doing the same with 2.6.25, basically the same config
> > file, boots without problems.
> >
> > I tried to forward-port the Fusion driver from 2.6.25 to 2.6.26-rc7 by
> > simply copying over the directory drivers/message/fusion/ from 2.6.25
> > to 2.6.26-rc7 but unfortunately this doesn't compile, so I am stuck
> > not being able to use
> > 2.6.26 on this machine (actually I have not tried versions of 2.6.26
> > earlier than rc7... I don't have much time now).
> >
> > I connected a serial cable in order to obtain the boot error message.
> > I obtained two of those on different boots. I will paste these at the
> > end of this post.
> >
> >
> > This is the verbose lspci of the controller (obtained with 2.6.25):
> > ----------------------------------------
> > 01:01.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030
> > PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07)
> >         Subsystem: IBM Unknown device 026d
> >         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
> > ParErr+
> > Stepping- SERR+ FastB2B- DisINTx-
> >         Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium
> > >TAbort-
> > <TAbort- <MAbort- >SERR- <PERR- INTx-
> >         Latency: 72 (4250ns min, 4500ns max), Cache Line Size: 32 bytes
> >         Interrupt: pin A routed to IRQ 22
> >         Region 0: I/O ports at 2300 [size=256]
> >         Region 1: Memory at fbff0000 (64-bit, non-prefetchable) [size=64K]
> >         Region 3: Memory at fbfe0000 (64-bit, non-prefetchable) [size=64K]
> >         Capabilities: [50] Power Management version 2
> >                 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
> > PME(D0-,D1-,D2-,D3hot-,D3cold-)
> >                 Status: D0 PME-Enable- DSel=0 DScale=0 PME-
> >         Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+
> > Queue=0/0
> > Enable-
> >                 Address: 0000000000000000  Data: 0000
> >         Capabilities: [68] PCI-X non-bridge device
> >                 Command: DPERE- ERO- RBC=512 OST=1
> >                 Status: Dev=01:01.0 64bit+ 133MHz+ SCD- USC- DC=simple
> > DMMRBC=2048 DMOST=8 DMCRS=16 RSCEM- 266MHz- 533MHz-
> >         Kernel driver in use: mptspi
> >         Kernel modules: mptspi
> > ----------------------------------------
> >
> >
> > This is an excerpt of the dmesg on 2.6.25 where the controller WORKS:
> > --------------------------------------------------------------------
> > Fusion MPT base driver 3.04.06
> > Copyright (c) 1999-2007 LSI Corporation Fusion MPT SPI Host driver
> > 3.04.06 ...
> > mptbase: ioc0: Initiating bringup
> > ...
> > ioc0: LSI53C1030 B2: Capabilities={Initiator} Probing IDE interface
> > ide1...
> > hdc: LG CD-ROM CRN-8245B, ATAPI CD/DVD-ROM drive scsi0 : ioc0:
> > LSI53C1030 B2, FwRev=01000e00h, Ports=1, MaxQ=222, IRQ=22 ...
> > scsi0 : ioc0: LSI53C1030 B2, FwRev=01000e00h, Ports=1, MaxQ=222,
> > IRQ=22
> > hdc: host max PIO4 wanted PIO255(auto-tune) selected PIO4
> > hdc: UDMA/33 mode selected
> > ide1 at 0x170-0x177,0x376 on irq 15
> > tg3.c:v3.90 (April 12, 2008)
> > ACPI: PCI Interrupt 0000:02:01.0[A] -> GSI 24 (level, low) -> IRQ 24
> > scsi 0:0:0:0: Direct-Access     IBM-ESXS DTN018C1UCDY10F  S23J PQ: 0 ANSI:
> 3
> >  target0:0:0: Beginning Domain Validation
> >  target0:0:0: Ending Domain Validation
> >  target0:0:0: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 127)
> > scsi 0:0:1:0: Direct-Access     IBM-ESXS DTN018C1UCDY10F  S23J PQ: 0 ANSI:
> 3
> >  target0:0:1: Beginning Domain Validation ...
> > ACPI: PCI Interrupt 0000:02:02.0[A] -> GSI 25 (level, low) -> IRQ 25
> >  target0:0:1: Ending Domain Validation
> >  target0:0:1: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 127)
> > ...
> > hdc: ATAPI 24X CD-ROM drive, 128kB Cache Uniform CD-ROM driver
> > Revision: 3.20
> > scsi 0:0:8:0: Processor         IBM      25P3495a S320  1 1    PQ: 0 ANSI:
> 2
> >  target0:0:8: Beginning Domain Validation
> >  target0:0:8: Ending Domain Validation
> >  target0:0:8: asynchronous
> > Driver 'sd' needs updating - please use bus_type methods sd 0:0:0:0:
> > [sda] 35548320 512-byte hardware sectors (18201 MB) sd 0:0:0:0: [sda]
> > Write Protect is off sd 0:0:0:0: [sda] Mode Sense: cb 00 00 08 sd
> > 0:0:0:0: Attached scsi generic sg0 type 0 scsi 0:0:1:0: Attached scsi
> > generic sg1 type 0 scsi 0:0:8:0: Attached scsi generic sg2 type 3
> > --------------------------------------------------------------------
> >
> >
> > It is an x86 32bit PC compile. This is the excerpt of the .config file
> > grepping for FUSION
> > ------------------------------------
> > CONFIG_FUSION=y
> > CONFIG_FUSION_SPI=m
> > CONFIG_FUSION_FC=m
> > CONFIG_FUSION_SAS=m
> > CONFIG_FUSION_MAX_SGE=40
> > CONFIG_FUSION_CTL=m
> > CONFIG_FUSION_LAN=m
> > # CONFIG_FUSION_LOGGING is not set
> > ------------------------------------
> >
> >
> >
> > This is the boot error message obtained with serial cable. I left it
> > running for 8 minutes for this. It loops so the message never ends.
> > --------------------------------------------------------------------
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> >
> > ACPI: Resource is not an IRQ entry
> 
> The acpi problem.
> 
> > mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999),
> IntStatus=80000009!
> >
> > BUG: unable to handle kernel NULL pointer dereference at 0000034c
> >
> > IP: [<f885cc5e>] :mptspi:mptspi_dv_renegotiate_work+0xa/0x9f
> >
> > Oops: 0000 [#1] SMP
> >
> > Modules linked in: ide_pci_generic(+) floppy mptspi(+) mptscsih
> > ohci_hcd tg3 mptbase scsi_transport_spi usbcore serverworks ide_core
> > ata_generic libata scsi_mod dock thermal processor fan thermal_sys
> >
> >
> >
> > Pid: 9, comm: events/0 Not tainted (2.6.26-rc7 #1)
> >
> > EIP: 0060:[<f885cc5e>] EFLAGS: 00010282 CPU: 0
> >
> > EIP is at mptspi_dv_renegotiate_work+0xa/0x9f [mptspi]
> >
> > EAX: f7a447c0 EBX: f7429900 ECX: f7a447c4 EDX: c1908988
> >
> > ESI: f7a447c0 EDI: 0000034c EBP: f7429904 ESP: f7477f80
> >
> >  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> >
> > Process events/0 (pid: 9, ti=f7476000 task=f744d770 task.ti=f7476000)
> >
> > Stack: f744d8e0 c190b260 00000000 c1908984 f7429900 f7a447c0 f885cc54
> > f7429904
> >
> >        c012f253 f7429900 c012f934 f742990c 00000000 c012f9e8 00000000
> > f744d770
> >
> >        c0131bdc f7477fc4 f7477fc4 f7429900 c012f934 00000000 c0131b1b
> > c0131ae3
> >
> > Call Trace:
> >
> >  [<f885cc54>] mptspi_dv_renegotiate_work+0x0/0x9f [mptspi]
> >
> >  [<c012f253>] run_workqueue+0x75/0xf6
> >
> >  [<c012f934>] worker_thread+0x0/0xbf
> >
> >  [<c012f9e8>] worker_thread+0xb4/0xbf
> >
> >  [<c0131bdc>] autoremove_wake_function+0x0/0x2b
> >
> >  [<c012f934>] worker_thread+0x0/0xbf
> >
> >  [<c0131b1b>] kthread+0x38/0x5d
> >
> >  [<c0131ae3>] kthread+0x0/0x5d
> >
> >  [<c0104573>] kernel_thread_helper+0x7/0x10
> >
> >  =======================
> >
> > Code: 70 e8 9e f8 ff ff 8b 47 70 e8 44 b7 fe ff 8b 47 70 5a 5b 5e 5f
> > 5d e9 89
> > f8 ff ff 58 5b 5e 5f 5d c3 55 57 56 53 83 ec 10 8b 78 10 <8b> 2f e8 c7
> > 98 90 c7
> > 66 83 bf 96 02 00 00 00 8b 85 3c 01 00 00
> >
> > EIP: [<f885cc5e>] mptspi_dv_renegotiate_work+0xa/0x9f [mptspi] SS:ESP
> > 0068:f7477f80
> >
> > ---[ end trace e311270f757682e4 ]---
> 
> mpt-fusion shouldn't oops, no matter what acpi did to it.
> 
Comment 6 Anonymous Emailer 2008-07-08 07:09:23 UTC
Reply-To: James.Bottomley@HansenPartnership.com

On Tue, 2008-07-08 at 01:57 -0700, Andrew Morton wrote:
> You removed everyone from cc.  Please don't do that - there's not much
> point in asking me to do things - this bug is reported by
> kurk@shiftmail.org.
> 
> I don't know what "we do not assist with compiling drivers" can possibly
> mean.  Eric, can you please help here?

heh, well, cc'ing a support line on a technical bug report isn't
necessarily conducive to producing useful results ... what we're
discussing is probably already at level 4 or 5 (the real engineering
problems).  Support calls go in at levels 1-3 (as in consult manual and
spit out canned response before triaging for escalation).

That said, this line:

mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!

is absolutely characteristic of a lost interrupt.

With the current LSI driver, we have two possible causes for this.  One
is the usual ACPI screw up that we never seem to be able to fix.  The
other is that the driver recently enabled MSI (commit
23a274c8a5adafc74a66f16988776fc7dd6f6e51 in v2.6.26-rc1).  For the
former, just follow the usual ACPI screw up recipe.  For the latter, you
should see this message in the boot up:

mptbase: ioc0: PCI-MSI enabled

MSI can be turned off again by using the module parameter
mpt_msi_enable=0.

Unfortunately, the true fix is to find out if the motherboard really has
a global MSI problem (and I know MSI works with the LSI because I have a
1030 in an ia64 system here working just fine) and add it to the PCI
quirks file as unable to use MSI.

James
Comment 7 Bjorn Helgaas 2008-07-08 09:52:31 UTC
On Tuesday 08 July 2008 08:08:46 am James Bottomley wrote:
> That said, this line:
> 
> mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009!
> 
> is absolutely characteristic of a lost interrupt.
> 
> With the current LSI driver, we have two possible causes for this.  One
> is the usual ACPI screw up that we never seem to be able to fix.

Which ACPI screw up is that?  And what's the usual recipe?

I know about the ancient "pci=routeirq" recipe, but as far as I know,
there are no current problems that require that.

> The 
> other is that the driver recently enabled MSI (commit
> 23a274c8a5adafc74a66f16988776fc7dd6f6e51 in v2.6.26-rc1).  For the
> former, just follow the usual ACPI screw up recipe.  For the latter, you
> should see this message in the boot up:
> 
> mptbase: ioc0: PCI-MSI enabled
> 
> MSI can be turned off again by using the module parameter
> mpt_msi_enable=0.
> 
> Unfortunately, the true fix is to find out if the motherboard really has
> a global MSI problem (and I know MSI works with the LSI because I have a
> 1030 in an ia64 system here working just fine) and add it to the PCI
> quirks file as unable to use MSI.
> 
> James
Comment 8 Anonymous Emailer 2008-07-08 10:24:03 UTC
Reply-To: James.Bottomley@HansenPartnership.com

On Tue, 2008-07-08 at 10:51 -0600, Bjorn Helgaas wrote:
> On Tuesday 08 July 2008 08:08:46 am James Bottomley wrote:
> > That said, this line:
> > 
> > mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999),
> IntStatus=80000009!
> > 
> > is absolutely characteristic of a lost interrupt.
> > 
> > With the current LSI driver, we have two possible causes for this.  One
> > is the usual ACPI screw up that we never seem to be able to fix.
> 
> Which ACPI screw up is that?  And what's the usual recipe?

The usual screw up where subtle ACPI breakage from release to release
causes some IRQs to get misrouted.

Usually you start with noacpi and cycle through the pci routing options

> I know about the ancient "pci=routeirq" recipe, but as far as I know,
> there are no current problems that require that.

If you actually read this bug report, you'll see there was a message

ACPI: Resource is not an IRQ entry

Just before the fusion IRQ failed to get delivered, so I think it's a
good indicator that there *are* ACPI problems ...

James
Comment 9 Bjorn Helgaas 2008-07-08 13:57:51 UTC
On Tuesday 08 July 2008 11:23:33 am James Bottomley wrote:
> On Tue, 2008-07-08 at 10:51 -0600, Bjorn Helgaas wrote:
> > Which ACPI screw up is that?  And what's the usual recipe?
> 
> The usual screw up where subtle ACPI breakage from release to release
> causes some IRQs to get misrouted.
> 
> Usually you start with noacpi and cycle through the pci routing options

Don't worry, I wasn't trying to talk you out of an ACPI bug report;
I just wanted to get enough specifics so I could see whether it was
something I could fix.

> If you actually read this bug report, you'll see there was a message
> 
> ACPI: Resource is not an IRQ entry
> 
> Just before the fusion IRQ failed to get delivered, so I think it's a
> good indicator that there *are* ACPI problems ...

These messages also happen with 2.6.25, where the MPT Fusion driver
worked, so Kurk opened a separate bugzilla,
  http://bugzilla.kernel.org/show_bug.cgi?id=11049
for them.

Yakui Zhao thinks the messages are harmless because they're
related to interrupt link devices that we don't use in IOAPIC mode,
and given that the driver works in 2.6.25, that seems plausible
to me.

Regardless, the messages are alarming and annoying.  I'd like
to understand them better, but I'll pursue that in the 11049
bugzilla.

Bjorn
Comment 10 Anonymous Emailer 2008-07-08 14:53:11 UTC
Reply-To: akpm@linux-foundation.org

On Tue, 8 Jul 2008 14:56:53 -0600 Bjorn Helgaas <bjorn.helgaas@hp.com> wrote:

> On Tuesday 08 July 2008 11:23:33 am James Bottomley wrote:
> > On Tue, 2008-07-08 at 10:51 -0600, Bjorn Helgaas wrote:
> > > Which ACPI screw up is that?  And what's the usual recipe?
> > 
> > The usual screw up where subtle ACPI breakage from release to release
> > causes some IRQs to get misrouted.
> > 
> > Usually you start with noacpi and cycle through the pci routing options
> 
> Don't worry, I wasn't trying to talk you out of an ACPI bug report;
> I just wanted to get enough specifics so I could see whether it was
> something I could fix.
> 
> > If you actually read this bug report, you'll see there was a message
> > 
> > ACPI: Resource is not an IRQ entry
> > 
> > Just before the fusion IRQ failed to get delivered, so I think it's a
> > good indicator that there *are* ACPI problems ...
> 
> These messages also happen with 2.6.25, where the MPT Fusion driver
> worked, so Kurk opened a separate bugzilla,
>   http://bugzilla.kernel.org/show_bug.cgi?id=11049
> for them.
> 
> Yakui Zhao thinks the messages are harmless because they're
> related to interrupt link devices that we don't use in IOAPIC mode,
> and given that the driver works in 2.6.25, that seems plausible
> to me.
> 
> Regardless, the messages are alarming and annoying.  I'd like
> to understand them better, but I'll pursue that in the 11049
> bugzilla.
> 

Let us not forget the other part of this report:

BUG: unable to handle kernel NULL pointer dereference at 0000034c
IP: [<f885cc5e>] :mptspi:mptspi_dv_renegotiate_work+0xa/0x9f
Oops: 0000 [#1] SMP
Comment 11 Anonymous Emailer 2008-07-08 14:57:48 UTC
Reply-To: James.Bottomley@HansenPartnership.com

On Tue, 2008-07-08 at 14:47 -0700, Andrew Morton wrote:
> On Tue, 8 Jul 2008 14:56:53 -0600 Bjorn Helgaas <bjorn.helgaas@hp.com> wrote:
> 
> > On Tuesday 08 July 2008 11:23:33 am James Bottomley wrote:
> > > On Tue, 2008-07-08 at 10:51 -0600, Bjorn Helgaas wrote:
> > > > Which ACPI screw up is that?  And what's the usual recipe?
> > > 
> > > The usual screw up where subtle ACPI breakage from release to release
> > > causes some IRQs to get misrouted.
> > > 
> > > Usually you start with noacpi and cycle through the pci routing options
> > 
> > Don't worry, I wasn't trying to talk you out of an ACPI bug report;
> > I just wanted to get enough specifics so I could see whether it was
> > something I could fix.
> > 
> > > If you actually read this bug report, you'll see there was a message
> > > 
> > > ACPI: Resource is not an IRQ entry
> > > 
> > > Just before the fusion IRQ failed to get delivered, so I think it's a
> > > good indicator that there *are* ACPI problems ...
> > 
> > These messages also happen with 2.6.25, where the MPT Fusion driver
> > worked, so Kurk opened a separate bugzilla,
> >   http://bugzilla.kernel.org/show_bug.cgi?id=11049
> > for them.
> > 
> > Yakui Zhao thinks the messages are harmless because they're
> > related to interrupt link devices that we don't use in IOAPIC mode,
> > and given that the driver works in 2.6.25, that seems plausible
> > to me.
> > 
> > Regardless, the messages are alarming and annoying.  I'd like
> > to understand them better, but I'll pursue that in the 11049
> > bugzilla.
> > 
> 
> Let us not forget the other part of this report:
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000034c
> IP: [<f885cc5e>] :mptspi:mptspi_dv_renegotiate_work+0xa/0x9f
> Oops: 0000 [#1] SMP

That's fixed in the scsi-rc-fixes tree ... but it's a symptom, not a
cause.  If essential storage is on this adapter, the system will still
be unbootable.

James
Comment 12 Anonymous Emailer 2008-07-09 01:19:28 UTC
Reply-To: sathya.prakash@lsi.com

This may  be a problem due to enabling MSI for SPI controllers. I have posted another message in the list providing the correction patch which is already in scsi-misc tree. 
If the problem is gone with changing the module parameter mpt_msi_enable=0 or by applying the patch http://marc.info/?l=linux-scsi&m=121131228827682&w=4 then it might be due to MSI enabling.


On Tue, Jul 08, 2008 at 05:57:35PM -0400, James Bottomley wrote:
> On Tue, 2008-07-08 at 14:47 -0700, Andrew Morton wrote:
> > On Tue, 8 Jul 2008 14:56:53 -0600 Bjorn Helgaas <bjorn.helgaas@hp.com>
> wrote:
> >
> > > On Tuesday 08 July 2008 11:23:33 am James Bottomley wrote:
> > > > On Tue, 2008-07-08 at 10:51 -0600, Bjorn Helgaas wrote:
> > > > > Which ACPI screw up is that?  And what's the usual recipe?
> > > >
> > > > The usual screw up where subtle ACPI breakage from release to release
> > > > causes some IRQs to get misrouted.
> > > >
> > > > Usually you start with noacpi and cycle through the pci routing options
> > >
> > > Don't worry, I wasn't trying to talk you out of an ACPI bug report;
> > > I just wanted to get enough specifics so I could see whether it was
> > > something I could fix.
> > >
> > > > If you actually read this bug report, you'll see there was a message
> > > >
> > > > ACPI: Resource is not an IRQ entry
> > > >
> > > > Just before the fusion IRQ failed to get delivered, so I think it's a
> > > > good indicator that there *are* ACPI problems ...
> > >
> > > These messages also happen with 2.6.25, where the MPT Fusion driver
> > > worked, so Kurk opened a separate bugzilla,
> > >   http://bugzilla.kernel.org/show_bug.cgi?id=11049
> > > for them.
> > >
> > > Yakui Zhao thinks the messages are harmless because they're
> > > related to interrupt link devices that we don't use in IOAPIC mode,
> > > and given that the driver works in 2.6.25, that seems plausible
> > > to me.
> > >
> > > Regardless, the messages are alarming and annoying.  I'd like
> > > to understand them better, but I'll pursue that in the 11049
> > > bugzilla.
> > >
> >
> > Let us not forget the other part of this report:
> >
> > BUG: unable to handle kernel NULL pointer dereference at 0000034c
> > IP: [<f885cc5e>] :mptspi:mptspi_dv_renegotiate_work+0xa/0x9f
> > Oops: 0000 [#1] SMP
> 
> That's fixed in the scsi-rc-fixes tree ... but it's a symptom, not a
> cause.  If essential storage is on this adapter, the system will still
> be unbootable.
> 
> James
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Comment 13 Kurk 2008-07-09 01:40:01 UTC
>Hi,
>Can you please attach the complete log that you collected with a
>serial console to this bug report:
>  http://bugzilla.kernel.org/show_bug.cgi?id=11045
>
>The excerpts are missing important information about how the MPT
>interrupt is routed when the driver claims the device.  It's
>easier if we can just look at the complete log; sometimes it
>answers questions we didn't think of the first time around.
>
>Thanks,
>  Bjorn

Hi Bjorn,
what I pasted in, it is absolutely the complete log obtained with the serial cable. I did not modify it in any way or cut away parts. As I wrote, I obtained it twice, once stopping the boot after 8 minutes, the other stopping after 5 (and stack traces are slightly different in the two boots as you can see). The error messages loop as you can see and never end so it is necessary to stop that after a while. The other attachments like dmesg (2.6.25), lspci (2.6.25), .config yes, I grepped or cut them, but not the serial dump.

(Is it ok if I reply from the web interface or should I do via email? Sorry it's the first time I use bugzilla.kernel.org)
Comment 14 Kurk 2008-07-10 06:57:50 UTC
Created attachment 16785 [details]
Boot messages with quiet disabled, serial dump
Comment 15 Kurk 2008-07-10 07:25:16 UTC
Bjorn,
you were right, the serial dump was not complete because there was the 
"quiet" option specified as kernel parameter.
I just uploaded the full (non-quiet) serial dump as attachment on the 
bugzilla web interface.
Thank you
Comment 16 Kurk 2008-07-10 07:25:22 UTC
Hi all,
Good news! James and Sathya were correct, the bug is related to MSI: 
specifying mpt_msi_enable=0 as option for the mptbase module solves the 
problem and the system can boot as usual.

Having said this, do you still want me to try a patch, or perform some 
additional test?

Just out of curiosity: do you intend to eventually modify the kernel so 
to "support" and work around buggy hardware like the one we have (IBM 
xSeries 335), so that Linux can work out of the box even on this hardware?

Thank everybody for your help
Comment 17 Kurk 2008-07-10 09:02:36 UTC
Prakash, Sathya wrote:
> This may  be a problem due to enabling MSI for SPI controllers. I have posted
> another message in the list providing the correction patch which is already
> in scsi-misc tree. 
> If the problem is gone with changing the module parameter mpt_msi_enable=0 or
> by applying the patch http://marc.info/?l=linux-scsi&m=121131228827682&w=4
> then it might be due to MSI enabling.
>   
Another good news: I confirm that yes, the problem is also fixed by the 
patch linked above by Sathya, and in that case it is not needed to 
specify option mpt_msi_enable=0 for mptbase. Any one of the two (patch 
or option) is enough to fix the problem.
It would be nice to see this patch in the final release of the 2.6.26 kernel
Thank you
Comment 18 Anonymous Emailer 2008-07-10 16:50:40 UTC
Reply-To: akpm@linux-foundation.org

On Thu, 10 Jul 2008 16:52:01 +0200 kurk <kurk@shiftmail.org> wrote:

> Prakash, Sathya wrote:
> > This may  be a problem due to enabling MSI for SPI controllers. I have
> posted another message in the list providing the correction patch which is
> already in scsi-misc tree. 
> > If the problem is gone with changing the module parameter mpt_msi_enable=0
> or by applying the patch http://marc.info/?l=linux-scsi&m=121131228827682&w=4
> then it might be due to MSI enabling.
> >   
> Another good news: I confirm that yes, the problem is also fixed by the 
> patch linked above by Sathya, and in that case it is not needed to 
> specify option mpt_msi_enable=0 for mptbase. Any one of the two (patch 
> or option) is enough to fix the problem.
> It would be nice to see this patch in the final release of the 2.6.26 kernel
> Thank you

James, shouldn't we put that into 2.6.26?

That whole patch series looks pretty desirable actually..
Comment 19 Anonymous Emailer 2008-07-10 17:42:48 UTC
Reply-To: James.Bottomley@HansenPartnership.com

On Thu, 2008-07-10 at 16:44 -0700, Andrew Morton wrote:
> On Thu, 10 Jul 2008 16:52:01 +0200 kurk <kurk@shiftmail.org> wrote:
> 
> > Prakash, Sathya wrote:
> > > This may  be a problem due to enabling MSI for SPI controllers. I have
> posted another message in the list providing the correction patch which is
> already in scsi-misc tree. 
> > > If the problem is gone with changing the module parameter
> mpt_msi_enable=0 or by applying the patch
> http://marc.info/?l=linux-scsi&m=121131228827682&w=4 then it might be due to
> MSI enabling.
> > >   
> > Another good news: I confirm that yes, the problem is also fixed by the 
> > patch linked above by Sathya, and in that case it is not needed to 
> > specify option mpt_msi_enable=0 for mptbase. Any one of the two (patch 
> > or option) is enough to fix the problem.
> > It would be nice to see this patch in the final release of the 2.6.26
> kernel
> > Thank you
> 
> James, shouldn't we put that into 2.6.26?

I'm still not sure ... if it's a fault on the board with MSI, then yes,
we need it in ... although the form would then be wrong because we
probably should be identifying the faulty parts and blacklisting them.

If it's actually a fault on the motherboard with MSI, then no, this
isn't the patch series that should be in we need the motherboard strings
to blacklist it.

Unfortunately, I can't seem to get an answer out of LSI on this
question, It looks like the commit will cherry pick easily enough ...
although now I look at it the parameter's description is wrong.

> That whole patch series looks pretty desirable actually..

Well, it was billed as a driver update ... and it has a lot more than
just trivial changes, so on an eve of release quality issue, I'd tend to
say that wouldn't be a good idea.

James
Comment 20 Anonymous Emailer 2008-07-10 21:35:09 UTC
Reply-To: Sathya.Prakash@lsi.com

I did a recheck on this, except FC 919X and 929X boards, everything else should work fine with MSI. Hence the SPI boards (1030) should work with MSI and the problem might be with the motherboard.
But we would like to keep the MSI disabled for SPI controllers since we have not tested internally with MSI and FC enabled by default for them in our recent drivers.
So I would like to request to pull in the patch to disable MSI for SPI & FC.
-Thanks
Sathya

-----Original Message-----
From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com]
Sent: Friday, July 11, 2008 6:12 AM
To: Andrew Morton
Cc: kurk; Prakash, Sathya; Bjorn Helgaas; Support, Software; linux-scsi@vger.kernel.org; linux-acpi@vger.kernel.org; bugme-daemon@bugzilla.kernel.org; Moore, Eric
Subject: Re: [Bugme-new] [Bug 11045] New: Bug in MPT Fusion 2.6.26-rc7 unbootable

On Thu, 2008-07-10 at 16:44 -0700, Andrew Morton wrote:
> On Thu, 10 Jul 2008 16:52:01 +0200 kurk <kurk@shiftmail.org> wrote:
>
> > Prakash, Sathya wrote:
> > > This may  be a problem due to enabling MSI for SPI controllers. I have
> posted another message in the list providing the correction patch which is
> already in scsi-misc tree.
> > > If the problem is gone with changing the module parameter
> mpt_msi_enable=0 or by applying the patch
> http://marc.info/?l=linux-scsi&m=121131228827682&w=4 then it might be due to
> MSI enabling.
> > >
> > Another good news: I confirm that yes, the problem is also fixed by
> > the patch linked above by Sathya, and in that case it is not needed
> > to specify option mpt_msi_enable=0 for mptbase. Any one of the two
> > (patch or option) is enough to fix the problem.
> > It would be nice to see this patch in the final release of the
> > 2.6.26 kernel Thank you
>
> James, shouldn't we put that into 2.6.26?

I'm still not sure ... if it's a fault on the board with MSI, then yes, we need it in ... although the form would then be wrong because we probably should be identifying the faulty parts and blacklisting them.

If it's actually a fault on the motherboard with MSI, then no, this isn't the patch series that should be in we need the motherboard strings to blacklist it.

Unfortunately, I can't seem to get an answer out of LSI on this question, It looks like the commit will cherry pick easily enough ...
although now I look at it the parameter's description is wrong.

> That whole patch series looks pretty desirable actually..

Well, it was billed as a driver update ... and it has a lot more than just trivial changes, so on an eve of release quality issue, I'd tend to say that wouldn't be a good idea.

James
Comment 21 Anonymous Emailer 2008-07-11 07:06:04 UTC
Reply-To: James.Bottomley@HansenPartnership.com

On Fri, 2008-07-11 at 12:33 +0800, Prakash, Sathya wrote:
> I did a recheck on this, except FC 919X and 929X boards, everything
> else should work fine with MSI. Hence the SPI boards (1030) should
> work with MSI and the problem might be with the motherboard.

Right ... that's why I was asking ... my 1030 works fine with MSI.  If
there's a fault with the FC boards, then certainly they should have MSI
disabled.

The motherboard was my suspicion ... especially as older ones have SPI
and newer ones have SAS (and the older ones are most likely to have MSI
faults).  However, I think you can see from our point of view that if
the problem is the motherboard, disabling MSI in the fusion is the wrong
way to fix it.  If we do it this way, we'll promptly get another slew of
nasty bug reports for the next driver that enables MSI and doesn't work
on this platform

> But we would like to keep the MSI disabled for SPI controllers since
> we have not tested internally with MSI and FC enabled by default for
> them in our recent drivers.
> So I would like to request to pull in the patch to disable MSI for SPI & FC.

Yes, we'll do that ... I'll also see if the PCI maintainer can determine
the information needed to blacklist the motherboards so that we don't
get this all over again with them and a different driver.

James