Latest working kernel version: 2.6.25 Earliest failing kernel version: 2.6.26-rc7 Distribution: Debian (but vanilla kernel) Hardware Environment: IBM xSeries 335 Software Environment: error and hangup at boot Problem Description: MPT Fusion error, unbootable, see below Steps to reproduce: see below ---- Detailed description: Hi all, I'm no kernel expert, I hope I made no mistakes in this report. It seems to me that a bug was added to the MPT Fusion driver in 2.6.26 (rc7). I compiled 2.6.26-rc7 on a machine with controller LSI53C1080 and it cannot boot. Doing the same with 2.6.25, basically the same config file, boots without problems. I tried to forward-port the Fusion driver from 2.6.25 to 2.6.26-rc7 by simply copying over the directory drivers/message/fusion/ from 2.6.25 to 2.6.26-rc7 but unfortunately this doesn't compile, so I am stuck not being able to use 2.6.26 on this machine (actually I have not tried versions of 2.6.26 earlier than rc7... I don't have much time now). I connected a serial cable in order to obtain the boot error message. I obtained two of those on different boots. I will paste these at the end of this post. This is the verbose lspci of the controller (obtained with 2.6.25): ---------------------------------------- 01:01.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07) Subsystem: IBM Unknown device 026d Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 72 (4250ns min, 4500ns max), Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 22 Region 0: I/O ports at 2300 [size=256] Region 1: Memory at fbff0000 (64-bit, non-prefetchable) [size=64K] Region 3: Memory at fbfe0000 (64-bit, non-prefetchable) [size=64K] Capabilities: [50] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable- Address: 0000000000000000 Data: 0000 Capabilities: [68] PCI-X non-bridge device Command: DPERE- ERO- RBC=512 OST=1 Status: Dev=01:01.0 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=8 DMCRS=16 RSCEM- 266MHz- 533MHz- Kernel driver in use: mptspi Kernel modules: mptspi ---------------------------------------- This is an excerpt of the dmesg on 2.6.25 where the controller WORKS: -------------------------------------------------------------------- Fusion MPT base driver 3.04.06 Copyright (c) 1999-2007 LSI Corporation Fusion MPT SPI Host driver 3.04.06 ... mptbase: ioc0: Initiating bringup ... ioc0: LSI53C1030 B2: Capabilities={Initiator} Probing IDE interface ide1... hdc: LG CD-ROM CRN-8245B, ATAPI CD/DVD-ROM drive scsi0 : ioc0: LSI53C1030 B2, FwRev=01000e00h, Ports=1, MaxQ=222, IRQ=22 ... scsi0 : ioc0: LSI53C1030 B2, FwRev=01000e00h, Ports=1, MaxQ=222, IRQ=22 hdc: host max PIO4 wanted PIO255(auto-tune) selected PIO4 hdc: UDMA/33 mode selected ide1 at 0x170-0x177,0x376 on irq 15 tg3.c:v3.90 (April 12, 2008) ACPI: PCI Interrupt 0000:02:01.0[A] -> GSI 24 (level, low) -> IRQ 24 scsi 0:0:0:0: Direct-Access IBM-ESXS DTN018C1UCDY10F S23J PQ: 0 ANSI: 3 target0:0:0: Beginning Domain Validation target0:0:0: Ending Domain Validation target0:0:0: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 127) scsi 0:0:1:0: Direct-Access IBM-ESXS DTN018C1UCDY10F S23J PQ: 0 ANSI: 3 target0:0:1: Beginning Domain Validation ... ACPI: PCI Interrupt 0000:02:02.0[A] -> GSI 25 (level, low) -> IRQ 25 target0:0:1: Ending Domain Validation target0:0:1: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 127) ... hdc: ATAPI 24X CD-ROM drive, 128kB Cache Uniform CD-ROM driver Revision: 3.20 scsi 0:0:8:0: Processor IBM 25P3495a S320 1 1 PQ: 0 ANSI: 2 target0:0:8: Beginning Domain Validation target0:0:8: Ending Domain Validation target0:0:8: asynchronous Driver 'sd' needs updating - please use bus_type methods sd 0:0:0:0: [sda] 35548320 512-byte hardware sectors (18201 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: cb 00 00 08 sd 0:0:0:0: Attached scsi generic sg0 type 0 scsi 0:0:1:0: Attached scsi generic sg1 type 0 scsi 0:0:8:0: Attached scsi generic sg2 type 3 -------------------------------------------------------------------- It is an x86 32bit PC compile. This is the excerpt of the .config file grepping for FUSION ------------------------------------ CONFIG_FUSION=y CONFIG_FUSION_SPI=m CONFIG_FUSION_FC=m CONFIG_FUSION_SAS=m CONFIG_FUSION_MAX_SGE=40 CONFIG_FUSION_CTL=m CONFIG_FUSION_LAN=m # CONFIG_FUSION_LOGGING is not set ------------------------------------ This is the boot error message obtained with serial cable. I left it running for 8 minutes for this. It loops so the message never ends. -------------------------------------------------------------------- ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! BUG: unable to handle kernel NULL pointer dereference at 0000034c IP: [<f885cc5e>] :mptspi:mptspi_dv_renegotiate_work+0xa/0x9f Oops: 0000 [#1] SMP Modules linked in: ide_pci_generic(+) floppy mptspi(+) mptscsih ohci_hcd tg3 mptbase scsi_transport_spi usbcore serverworks ide_core ata_generic libata scsi_mod dock thermal processor fan thermal_sys Pid: 9, comm: events/0 Not tainted (2.6.26-rc7 #1) EIP: 0060:[<f885cc5e>] EFLAGS: 00010282 CPU: 0 EIP is at mptspi_dv_renegotiate_work+0xa/0x9f [mptspi] EAX: f7a447c0 EBX: f7429900 ECX: f7a447c4 EDX: c1908988 ESI: f7a447c0 EDI: 0000034c EBP: f7429904 ESP: f7477f80 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Process events/0 (pid: 9, ti=f7476000 task=f744d770 task.ti=f7476000) Stack: f744d8e0 c190b260 00000000 c1908984 f7429900 f7a447c0 f885cc54 f7429904 c012f253 f7429900 c012f934 f742990c 00000000 c012f9e8 00000000 f744d770 c0131bdc f7477fc4 f7477fc4 f7429900 c012f934 00000000 c0131b1b c0131ae3 Call Trace: [<f885cc54>] mptspi_dv_renegotiate_work+0x0/0x9f [mptspi] [<c012f253>] run_workqueue+0x75/0xf6 [<c012f934>] worker_thread+0x0/0xbf [<c012f9e8>] worker_thread+0xb4/0xbf [<c0131bdc>] autoremove_wake_function+0x0/0x2b [<c012f934>] worker_thread+0x0/0xbf [<c0131b1b>] kthread+0x38/0x5d [<c0131ae3>] kthread+0x0/0x5d [<c0104573>] kernel_thread_helper+0x7/0x10 ======================= Code: 70 e8 9e f8 ff ff 8b 47 70 e8 44 b7 fe ff 8b 47 70 5a 5b 5e 5f 5d e9 89 f8 ff ff 58 5b 5e 5f 5d c3 55 57 56 53 83 ec 10 8b 78 10 <8b> 2f e8 c7 98 90 c7 66 83 bf 96 02 00 00 00 8b 85 3c 01 00 00 EIP: [<f885cc5e>] mptspi_dv_renegotiate_work+0xa/0x9f [mptspi] SS:ESP 0068:f7477f80 ---[ end trace e311270f757682e4 ]--- mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! mptbase: ioc0: Recovered from IOC FAULT mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! mptbase: ioc0: Recovered from IOC FAULT mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! mptbase: ioc0: Recovered from IOC FAULT mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! mptbase: ioc0: Recovered from IOC FAULT mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! mptbase: ioc0: Recovered from IOC FAULT mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! mptbase: ioc0: Recovered from IOC FAULT scsi0 : ioc0: LSI53C1030 B2, FwRev=01000e00h, Ports=1, MaxQ=222, IRQ=223 target0:0:0: mptspi: ioc0: dma_alloc_coherent for parameters failed hdc: ATAPI 24X CD-ROM drive, 128kB Cache Uniform CD-ROM driver Revision: 3.20 mptscsih: ioc0: attempting task abort! (sc=f7862e80) scsi 0:0:0:0: CDB: Inquiry: 12 00 00 00 24 00 mptscsih: ioc0: WARNING - TM Handler for type=1: IOC Not operational (0x40008112)! mptscsih: ioc0: WARNING - Issuing HardReset!! mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h md: raid1 personality registered for level 1 device-mapper: ioctl: 4.13.0-ioctl (2007-10-18) initialised: dm-devel@redhat.com mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! scsi 0:0:0:0: mptscsih: ioc0: completing cmds: fw_channel 0, fw_id 0, sc=f7862e80, mf = f7a62da0, idx=f mptbase: ioc0: Recovered from IOC FAULT mptscsih: ioc0: task abort: FAILED (sc=f7862e80) mptscsih: ioc0: attempting target reset! (sc=f7862e80) scsi 0:0:0:0: CDB: Inquiry: 12 00 00 00 24 00 target0:0:0: mptspi: ioc0: dma_alloc_coherent for parameters failed target0:0:0: FAST-5 WIDE SCSI 2.4 MB/s ST RTI WRFLOW PCOMP (844 ns, offset 68) target0:0:0: mptspi: ioc0: dma_alloc_coherent for parameters failed mptscsih: ioc0: Issue of TaskMgmt failed! mptscsih: ioc0: target reset: FAILED (sc=f7862e80) mptscsih: ioc0: attempting bus reset! (sc=f7862e80) scsi 0:0:0:0: CDB: Inquiry: 12 00 00 00 24 00 mptscsih: ioc0: WARNING - TM Handler for type=4: IOC Not operational (0x40008112)! mptscsih: ioc0: WARNING - Issuing HardReset!! mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! mptbase: ioc0: Recovered from IOC FAULT mptscsih: ioc0: bus reset: FAILED (sc=f7862e80) mptscsih: ioc0: attempting host reset! (sc=f7862e80) mptbase: ioc0: Initiating recovery target0:0:0: mptspi: ioc0: dma_alloc_coherent for parameters failed target0:0:0: FAST-5 WIDE SCSI 2.4 MB/s ST RTI WRFLOW PCOMP (844 ns, offset 68) target0:0:0: mptspi: ioc0: dma_alloc_coherent for parameters failed mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! mptscsih: ioc0: host reset: SUCCESS (sc=f7862e80) scsi 0:0:0:0: Device offlined - not ready after error recovery target0:0:1: mptspi: ioc0: dma_alloc_coherent for parameters failed target0:0:1: mptspi: ioc0: dma_alloc_coherent for parameters failed target0:0:1: FAST-5 WIDE SCSI 2.4 MB/s ST RTI WRFLOW PCOMP (844 ns, offset 68) target0:0:1: mptspi: ioc0: dma_alloc_coherent for parameters failed mptscsih: ioc0: attempting task abort! (sc=f7862e80) scsi 0:0:1:0: CDB: Inquiry: 12 00 00 00 24 00 mptscsih: ioc0: WARNING - TM Handler for type=1: IOC Not operational (0x40008112)! mptscsih: ioc0: WARNING - Issuing HardReset!! mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! scsi 0:0:1:0: mptscsih: ioc0: completing cmds: fw_channel 0, fw_id 1, sc=f7862e80, mf = f7a62f80, idx=14 mptbase: ioc0: Recovered from IOC FAULT mptscsih: ioc0: task abort: FAILED (sc=f7862e80) mptscsih: ioc0: attempting target reset! (sc=f7862e80) scsi 0:0:1:0: CDB: Inquiry: 12 00 00 00 24 00 target0:0:1: mptspi: ioc0: dma_alloc_coherent for parameters failed target0:0:1: FAST-5 WIDE SCSI 2.4 MB/s ST RTI WRFLOW PCOMP (844 ns, offset 68) target0:0:1: mptspi: ioc0: dma_alloc_coherent for parameters failed mptscsih: ioc0: Issue of TaskMgmt failed! mptscsih: ioc0: target reset: FAILED (sc=f7862e80) mptscsih: ioc0: attempting bus reset! (sc=f7862e80) scsi 0:0:1:0: CDB: Inquiry: 12 00 00 00 24 00 mptscsih: ioc0: WARNING - TM Handler for type=4: IOC Not operational (0x40008112)! mptscsih: ioc0: WARNING - Issuing HardReset!! mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! mptbase: ioc0: Recovered from IOC FAULT mptscsih: ioc0: bus reset: FAILED (sc=f7862e80) mptscsih: ioc0: attempting host reset! (sc=f7862e80) mptbase: ioc0: Initiating recovery target0:0:1: mptspi: ioc0: dma_alloc_coherent for parameters failed target0:0:1: FAST-5 WIDE SCSI 2.4 MB/s ST RTI WRFLOW PCOMP (844 ns, offset 68) target0:0:1: mptspi: ioc0: dma_alloc_coherent for parameters failed mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! mptscsih: ioc0: host reset: SUCCESS (sc=f7862e80) scsi 0:0:1:0: Device offlined - not ready after error recovery target0:0:2: mptspi: ioc0: dma_alloc_coherent for parameters failed target0:0:2: mptspi: ioc0: dma_alloc_coherent for parameters failed target0:0:2: FAST-5 WIDE SCSI 2.4 MB/s ST RTI WRFLOW PCOMP (844 ns, offset 68) target0:0:2: mptspi: ioc0: dma_alloc_coherent for parameters failed mptscsih: ioc0: attempting task abort! (sc=f7862e80) scsi 0:0:2:0: CDB: Inquiry: 12 00 00 00 24 00 mptscsih: ioc0: WARNING - TM Handler for type=1: IOC Not operational (0x40008112)! mptscsih: ioc0: WARNING - Issuing HardReset!! mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! scsi 0:0:2:0: mptscsih: ioc0: completing cmds: fw_channel 0, fw_id 2, sc=f7862e80, mf = f7a63160, idx=19 mptbase: ioc0: Recovered from IOC FAULT mptscsih: ioc0: task abort: FAILED (sc=f7862e80) mptscsih: ioc0: attempting target reset! (sc=f7862e80) scsi 0:0:2:0: CDB: Inquiry: 12 00 00 00 24 00 target0:0:2: mptspi: ioc0: dma_alloc_coherent for parameters failed target0:0:2: FAST-5 WIDE SCSI 2.4 MB/s ST RTI WRFLOW PCOMP (844 ns, offset 68) target0:0:2: mptspi: ioc0: dma_alloc_coherent for parameters failed mptscsih: ioc0: Issue of TaskMgmt failed! mptscsih: ioc0: target reset: FAILED (sc=f7862e80) mptscsih: ioc0: attempting bus reset! (sc=f7862e80) scsi 0:0:2:0: CDB: Inquiry: 12 00 00 00 24 00 mptscsih: ioc0: WARNING - TM Handler for type=4: IOC Not operational (0x40008112)! mptscsih: ioc0: WARNING - Issuing HardReset!! mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! mptbase: ioc0: Recovered from IOC FAULT mptscsih: ioc0: bus reset: FAILED (sc=f7862e80) mptscsih: ioc0: attempting host reset! (sc=f7862e80) mptbase: ioc0: Initiating recovery target0:0:2: mptspi: ioc0: dma_alloc_coherent for parameters failed target0:0:2: FAST-5 WIDE SCSI 2.4 MB/s ST RTI WRFLOW PCOMP (844 ns, offset 68) target0:0:2: mptspi: ioc0: dma_alloc_coherent for parameters failed mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! mptscsih: ioc0: host reset: SUCCESS (sc=f7862e80) scsi 0:0:2:0: Device offlined - not ready after error recovery target0:0:3: mptspi: ioc0: dma_alloc_coherent for parameters failed target0:0:3: mptspi: ioc0: dma_alloc_coherent for parameters failed target0:0:3: FAST-5 WIDE SCSI 2.4 MB/s ST RTI WRFLOW PCOMP (844 ns, offset 68) target0:0:3: mptspi: ioc0: dma_alloc_coherent for parameters failed mptscsih: ioc0: attempting task abort! (sc=f7862e80) scsi 0:0:3:0: CDB: Inquiry: 12 00 00 00 24 00 mptscsih: ioc0: WARNING - TM Handler for type=1: IOC Not operational (0x40008112)! mptscsih: ioc0: WARNING - Issuing HardReset!! mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - This one I obtained it in 5 minutes in another boot with a very slightly differing .config file (not changing in the MPT Fusion options). You can see that the stack trace is slightly different. -------------------------------------------------------------------- ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry ACPI: Resource is not an IRQ entry mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! BUG: unable to handle kernel NULL pointer dereference at 0000034c IP: [<f8856c5e>] :mptspi:mptspi_dv_renegotiate_work+0xa/0x9f Oops: 0000 [#1] SMP Modules linked in: ide_pci_generic(+) floppy mptspi(+) mptscsih mptbase scsi_transport_spi ohci_hcd usbcore tg3 serverworks ide_core ata_generic libata scsi_mod dock thermal processor fan Pid: 9, comm: events/0 Not tainted (2.6.26-rc7 #3) EIP: 0060:[<f8856c5e>] EFLAGS: 00010282 CPU: 0 EIP is at mptspi_dv_renegotiate_work+0xa/0x9f [mptspi] EAX: f783f480 EBX: f7429900 ECX: f783f484 EDX: c1908548 ESI: f783f480 EDI: 0000034c EBP: f7429904 ESP: f7477f80 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Process events/0 (pid: 9, ti=f7476000 task=f744d770 task.ti=f7476000) Stack: 00000000 c012cea5 f7429900 c1908544 f7429900 f783f480 f8856c54 f7429904 c012c82a f7429900 c012cee7 f742990c 00000000 c012cf9b 00000000 f744d770 c012f190 f7477fc4 f7477fc4 f7429900 c012cee7 00000000 c012f0cf c012f097 Call Trace: [<c012cea5>] queue_delayed_work_on+0x9a/0xa6 [<f8856c54>] mptspi_dv_renegotiate_work+0x0/0x9f [mptspi] [<c012c82a>] run_workqueue+0x6c/0xe4 [<c012cee7>] worker_thread+0x0/0xbf [<c012cf9b>] worker_thread+0xb4/0xbf [<c012f190>] autoremove_wake_function+0x0/0x2b [<c012cee7>] worker_thread+0x0/0xbf [<c012f0cf>] kthread+0x38/0x5d [<c012f097>] kthread+0x0/0x5d [<c01043c3>] kernel_thread_helper+0x7/0x10 ======================= Code: 70 e8 9e f8 ff ff 8b 47 70 e8 44 37 ff ff 8b 47 70 5a 5b 5e 5f 5d e9 89 f8 ff ff 58 5b 5e 5f 5d c3 55 57 56 53 83 ec 10 8b 78 10 <8b> 2f e8 ff cc 90 c7 66 83 bf 96 02 00 00 00 8b 85 3c 01 00 00 EIP: [<f8856c5e>] mptspi_dv_renegotiate_work+0xa/0x9f [mptspi] SS:ESP 0068:f7477f80 ---[ end trace e7ec2a28a4a72094 ]--- mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! mptbase: ioc0: Recovered from IOC FAULT mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! mptbase: ioc0: Recovered from IOC FAULT mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! mptbase: ioc0: Recovered from IOC FAULT mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! mptbase: ioc0: Recovered from IOC FAULT mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! mptbase: ioc0: Recovered from IOC FAULT mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! mptbase: ioc0: Recovered from IOC FAULT scsi0 : ioc0: LSI53C1030 B2, FwRev=01000e00h, Ports=1, MaxQ=222, IRQ=223 target0:0:0: mptspi: ioc0: dma_alloc_coherent for parameters failed hdc: ATAPI 24X CD-ROM drive, 128kB Cache Uniform CD-ROM driver Revision: 3.20 mptscsih: ioc0: attempting task abort! (sc=f7858e80) scsi 0:0:0:0: CDB: Inquiry: 12 00 00 00 24 00 mptscsih: ioc0: WARNING - TM Handler for type=1: IOC Not operational (0x40008112)! mptscsih: ioc0: WARNING - Issuing HardReset!! mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h md: raid1 personality registered for level 1 device-mapper: ioctl: 4.13.0-ioctl (2007-10-18) initialised: dm-devel@redhat.com mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! scsi 0:0:0:0: mptscsih: ioc0: completing cmds: fw_channel 0, fw_id 0, sc=f7858e80, mf = f79e2da0, idx=f mptbase: ioc0: Recovered from IOC FAULT mptscsih: ioc0: task abort: FAILED (sc=f7858e80) mptscsih: ioc0: attempting target reset! (sc=f7858e80) scsi 0:0:0:0: CDB: Inquiry: 12 00 00 00 24 00 mptscsih: ioc0: Issue of TaskMgmt failed! mptscsih: ioc0: target reset: FAILED (sc=f7858e80) mptscsih: ioc0: attempting bus reset! (sc=f7858e80) scsi 0:0:0:0: CDB: Inquiry: 12 00 00 00 24 00 mptscsih: ioc0: WARNING - TM Handler for type=4: IOC Not operational (0x40008112)! mptscsih: ioc0: WARNING - Issuing HardReset!! mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! mptbase: ioc0: Recovered from IOC FAULT mptscsih: ioc0: bus reset: FAILED (sc=f7858e80) mptscsih: ioc0: attempting host reset! (sc=f7858e80) mptbase: ioc0: Initiating recovery mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! mptscsih: ioc0: host reset: SUCCESS (sc=f7858e80) scsi 0:0:0:0: Device offlined - not ready after error recovery target0:0:1: mptspi: ioc0: dma_alloc_coherent for parameters failed mptscsih: ioc0: attempting task abort! (sc=f7858080) scsi 0:0:1:0: CDB: Inquiry: 12 00 00 00 24 00 mptscsih: ioc0: WARNING - TM Handler for type=1: IOC Not operational (0x40008112)! mptscsih: ioc0: WARNING - Issuing HardReset!! mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! scsi 0:0:1:0: mptscsih: ioc0: completing cmds: fw_channel 0, fw_id 1, sc=f7858080, mf = f79e2f80, idx=14 mptbase: ioc0: Recovered from IOC FAULT mptscsih: ioc0: task abort: FAILED (sc=f7858080) mptscsih: ioc0: attempting target reset! (sc=f7858080) scsi 0:0:1:0: CDB: Inquiry: 12 00 00 00 24 00 target0:0:1: mptspi: ioc0: dma_alloc_coherent for parameters failed target0:0:1: asynchronous target0:0:1: mptspi: ioc0: dma_alloc_coherent for parameters failed mptscsih: ioc0: Issue of TaskMgmt failed! mptscsih: ioc0: target reset: FAILED (sc=f7858080) mptscsih: ioc0: attempting bus reset! (sc=f7858080) scsi 0:0:1:0: CDB: Inquiry: 12 00 00 00 24 00 mptscsih: ioc0: WARNING - TM Handler for type=4: IOC Not operational (0x40008112)! mptscsih: ioc0: WARNING - Issuing HardReset!! mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 8112h mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! mptbase: ioc0: Recovered from IOC FAULT mptscsih: ioc0: bus reset: FAILED (sc=f7858080) mptscsih: ioc0: attempting host reset! (sc=f7858080) mptbase: ioc0: Initiating recovery target0:0:1: mptspi: ioc0: dma_alloc_coherent for par
Reply-To: akpm@linux-foundation.org (switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Sun, 6 Jul 2008 11:22:08 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=11045 > > Summary: Bug in MPT Fusion 2.6.26-rc7 unbootable > Product: Drivers > Version: 2.5 > KernelVersion: 2.6.26-rc7 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Other > AssignedTo: drivers_other@kernel-bugs.osdl.org > ReportedBy: kurk@shiftmail.org > > > Latest working kernel version: 2.6.25 > Earliest failing kernel version: 2.6.26-rc7 > Distribution: Debian (but vanilla kernel) > Hardware Environment: IBM xSeries 335 > Software Environment: error and hangup at boot > Problem Description: MPT Fusion error, unbootable, see below > Steps to reproduce: see below We have two bugs here. One in mpt-fusion and what I suspect is a post-2.6.25 regression in ACPI. > Detailed description: > > Hi all, > I'm no kernel expert, I hope I made no mistakes in this report. It seems to > me > that a bug was added to the MPT Fusion driver in 2.6.26 (rc7). > > I compiled 2.6.26-rc7 on a machine with controller LSI53C1080 and it cannot > boot. Doing the same with 2.6.25, basically the same config file, boots > without > problems. > > I tried to forward-port the Fusion driver from 2.6.25 to 2.6.26-rc7 by simply > copying over the directory drivers/message/fusion/ from 2.6.25 to 2.6.26-rc7 > but unfortunately this doesn't compile, so I am stuck not being able to use > 2.6.26 on this machine (actually I have not tried versions of 2.6.26 earlier > than rc7... I don't have much time now). > > I connected a serial cable in order to obtain the boot error message. I > obtained two of those on different boots. I will paste these at the end of > this > post. > > > This is the verbose lspci of the controller (obtained with 2.6.25): > ---------------------------------------- > 01:01.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X > Fusion-MPT Dual Ultra320 SCSI (rev 07) > Subsystem: IBM Unknown device 026d > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr+ > Stepping- SERR+ FastB2B- DisINTx- > Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- > <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 72 (4250ns min, 4500ns max), Cache Line Size: 32 bytes > Interrupt: pin A routed to IRQ 22 > Region 0: I/O ports at 2300 [size=256] > Region 1: Memory at fbff0000 (64-bit, non-prefetchable) [size=64K] > Region 3: Memory at fbfe0000 (64-bit, non-prefetchable) [size=64K] > Capabilities: [50] Power Management version 2 > Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA > PME(D0-,D1-,D2-,D3hot-,D3cold-) > Status: D0 PME-Enable- DSel=0 DScale=0 PME- > Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ > Queue=0/0 > Enable- > Address: 0000000000000000 Data: 0000 > Capabilities: [68] PCI-X non-bridge device > Command: DPERE- ERO- RBC=512 OST=1 > Status: Dev=01:01.0 64bit+ 133MHz+ SCD- USC- DC=simple > DMMRBC=2048 DMOST=8 DMCRS=16 RSCEM- 266MHz- 533MHz- > Kernel driver in use: mptspi > Kernel modules: mptspi > ---------------------------------------- > > > This is an excerpt of the dmesg on 2.6.25 where the controller WORKS: > -------------------------------------------------------------------- > Fusion MPT base driver 3.04.06 > Copyright (c) 1999-2007 LSI Corporation > Fusion MPT SPI Host driver 3.04.06 > ... > mptbase: ioc0: Initiating bringup > ... > ioc0: LSI53C1030 B2: Capabilities={Initiator} > Probing IDE interface ide1... > hdc: LG CD-ROM CRN-8245B, ATAPI CD/DVD-ROM drive > scsi0 : ioc0: LSI53C1030 B2, FwRev=01000e00h, Ports=1, MaxQ=222, IRQ=22 > ... > scsi0 : ioc0: LSI53C1030 B2, FwRev=01000e00h, Ports=1, MaxQ=222, IRQ=22 > hdc: host max PIO4 wanted PIO255(auto-tune) selected PIO4 > hdc: UDMA/33 mode selected > ide1 at 0x170-0x177,0x376 on irq 15 > tg3.c:v3.90 (April 12, 2008) > ACPI: PCI Interrupt 0000:02:01.0[A] -> GSI 24 (level, low) -> IRQ 24 > scsi 0:0:0:0: Direct-Access IBM-ESXS DTN018C1UCDY10F S23J PQ: 0 ANSI: 3 > target0:0:0: Beginning Domain Validation > target0:0:0: Ending Domain Validation > target0:0:0: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 127) > scsi 0:0:1:0: Direct-Access IBM-ESXS DTN018C1UCDY10F S23J PQ: 0 ANSI: 3 > target0:0:1: Beginning Domain Validation > ... > ACPI: PCI Interrupt 0000:02:02.0[A] -> GSI 25 (level, low) -> IRQ 25 > target0:0:1: Ending Domain Validation > target0:0:1: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 127) > ... > hdc: ATAPI 24X CD-ROM drive, 128kB Cache > Uniform CD-ROM driver Revision: 3.20 > scsi 0:0:8:0: Processor IBM 25P3495a S320 1 1 PQ: 0 ANSI: 2 > target0:0:8: Beginning Domain Validation > target0:0:8: Ending Domain Validation > target0:0:8: asynchronous > Driver 'sd' needs updating - please use bus_type methods > sd 0:0:0:0: [sda] 35548320 512-byte hardware sectors (18201 MB) > sd 0:0:0:0: [sda] Write Protect is off > sd 0:0:0:0: [sda] Mode Sense: cb 00 00 08 > sd 0:0:0:0: Attached scsi generic sg0 type 0 > scsi 0:0:1:0: Attached scsi generic sg1 type 0 > scsi 0:0:8:0: Attached scsi generic sg2 type 3 > -------------------------------------------------------------------- > > > It is an x86 32bit PC compile. This is the excerpt of the .config file > grepping > for FUSION > ------------------------------------ > CONFIG_FUSION=y > CONFIG_FUSION_SPI=m > CONFIG_FUSION_FC=m > CONFIG_FUSION_SAS=m > CONFIG_FUSION_MAX_SGE=40 > CONFIG_FUSION_CTL=m > CONFIG_FUSION_LAN=m > # CONFIG_FUSION_LOGGING is not set > ------------------------------------ > > > > This is the boot error message obtained with serial cable. I left it running > for 8 minutes for this. It loops so the message never ends. > -------------------------------------------------------------------- > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry > > ACPI: Resource is not an IRQ entry The acpi problem. > mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! > > BUG: unable to handle kernel NULL pointer dereference at 0000034c > > IP: [<f885cc5e>] :mptspi:mptspi_dv_renegotiate_work+0xa/0x9f > > Oops: 0000 [#1] SMP > > Modules linked in: ide_pci_generic(+) floppy mptspi(+) mptscsih ohci_hcd tg3 > mptbase scsi_transport_spi usbcore serverworks ide_core ata_generic libata > scsi_mod dock thermal processor fan thermal_sys > > > > Pid: 9, comm: events/0 Not tainted (2.6.26-rc7 #1) > > EIP: 0060:[<f885cc5e>] EFLAGS: 00010282 CPU: 0 > > EIP is at mptspi_dv_renegotiate_work+0xa/0x9f [mptspi] > > EAX: f7a447c0 EBX: f7429900 ECX: f7a447c4 EDX: c1908988 > > ESI: f7a447c0 EDI: 0000034c EBP: f7429904 ESP: f7477f80 > > DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 > > Process events/0 (pid: 9, ti=f7476000 task=f744d770 task.ti=f7476000) > > Stack: f744d8e0 c190b260 00000000 c1908984 f7429900 f7a447c0 f885cc54 > f7429904 > > c012f253 f7429900 c012f934 f742990c 00000000 c012f9e8 00000000 > f744d770 > > c0131bdc f7477fc4 f7477fc4 f7429900 c012f934 00000000 c0131b1b > c0131ae3 > > Call Trace: > > [<f885cc54>] mptspi_dv_renegotiate_work+0x0/0x9f [mptspi] > > [<c012f253>] run_workqueue+0x75/0xf6 > > [<c012f934>] worker_thread+0x0/0xbf > > [<c012f9e8>] worker_thread+0xb4/0xbf > > [<c0131bdc>] autoremove_wake_function+0x0/0x2b > > [<c012f934>] worker_thread+0x0/0xbf > > [<c0131b1b>] kthread+0x38/0x5d > > [<c0131ae3>] kthread+0x0/0x5d > > [<c0104573>] kernel_thread_helper+0x7/0x10 > > ======================= > > Code: 70 e8 9e f8 ff ff 8b 47 70 e8 44 b7 fe ff 8b 47 70 5a 5b 5e 5f 5d e9 89 > f8 ff ff 58 5b 5e 5f 5d c3 55 57 56 53 83 ec 10 8b 78 10 <8b> 2f e8 c7 98 90 > c7 > 66 83 bf 96 02 00 00 00 8b 85 3c 01 00 00 > > EIP: [<f885cc5e>] mptspi_dv_renegotiate_work+0xa/0x9f [mptspi] SS:ESP > 0068:f7477f80 > > ---[ end trace e311270f757682e4 ]--- mpt-fusion shouldn't oops, no matter what acpi did to it.
The ACPI thing, if it is a bug (I didn't realize that) I think was introduced much earlier than 2.6.26. I think the same ACPI error strings are visible in 2.6.24 and probably even much earlier kernels. However they don't prevent booting. Thank you.
Thanks for this information. Please open a separate bug for the ACPI problem.
OK I have opened bug 11049 for the ACPI thing. Thank you.
Reply-To: akpm@linux-foundation.org You removed everyone from cc. Please don't do that - there's not much point in asking me to do things - this bug is reported by kurk@shiftmail.org. I don't know what "we do not assist with compiling drivers" can possibly mean. Eric, can you please help here? On Mon, 7 Jul 2008 07:28:00 -0600 "Support, Software" <support@lsi.com> wrote: > Unfortunately, we do not assist with compiling drivers. > > I would recommend updating the firmware and BIOS on the controllers you are > using, so that the compiled driver could communicate with the controller > better. > > In order to point you to the correct package for the controller that is not > taking the compiled driver, I will need for you to send me all of the numbers > off of the front and back of the controller. > > -----Original Message----- > From: Andrew Morton [mailto:akpm@linux-foundation.org] > Sent: Sunday, July 06, 2008 3:34 PM > To: linux-scsi@vger.kernel.org; linux-acpi@vger.kernel.org > Cc: bugme-daemon@bugzilla.kernel.org; Moore, Eric; Support, Software > Subject: Re: [Bugme-new] [Bug 11045] New: Bug in MPT Fusion 2.6.26-rc7 > unbootable > > > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). > > On Sun, 6 Jul 2008 11:22:08 -0700 (PDT) bugme-daemon@bugzilla.kernel.org > wrote: > > > http://bugzilla.kernel.org/show_bug.cgi?id=11045 > > > > Summary: Bug in MPT Fusion 2.6.26-rc7 unbootable > > Product: Drivers > > Version: 2.5 > > KernelVersion: 2.6.26-rc7 > > Platform: All > > OS/Version: Linux > > Tree: Mainline > > Status: NEW > > Severity: normal > > Priority: P1 > > Component: Other > > AssignedTo: drivers_other@kernel-bugs.osdl.org > > ReportedBy: kurk@shiftmail.org > > > > > > Latest working kernel version: 2.6.25 > > Earliest failing kernel version: 2.6.26-rc7 > > Distribution: Debian (but vanilla kernel) Hardware Environment: IBM > > xSeries 335 Software Environment: error and hangup at boot Problem > > Description: MPT Fusion error, unbootable, see below Steps to > > reproduce: see below > > We have two bugs here. One in mpt-fusion and what I suspect is a > post-2.6.25 regression in ACPI. > > > > Detailed description: > > > > Hi all, > > I'm no kernel expert, I hope I made no mistakes in this report. It > > seems to me that a bug was added to the MPT Fusion driver in 2.6.26 (rc7). > > > > I compiled 2.6.26-rc7 on a machine with controller LSI53C1080 and it > > cannot boot. Doing the same with 2.6.25, basically the same config > > file, boots without problems. > > > > I tried to forward-port the Fusion driver from 2.6.25 to 2.6.26-rc7 by > > simply copying over the directory drivers/message/fusion/ from 2.6.25 > > to 2.6.26-rc7 but unfortunately this doesn't compile, so I am stuck > > not being able to use > > 2.6.26 on this machine (actually I have not tried versions of 2.6.26 > > earlier than rc7... I don't have much time now). > > > > I connected a serial cable in order to obtain the boot error message. > > I obtained two of those on different boots. I will paste these at the > > end of this post. > > > > > > This is the verbose lspci of the controller (obtained with 2.6.25): > > ---------------------------------------- > > 01:01.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 > > PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07) > > Subsystem: IBM Unknown device 026d > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- > > ParErr+ > > Stepping- SERR+ FastB2B- DisINTx- > > Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium > > >TAbort- > > <TAbort- <MAbort- >SERR- <PERR- INTx- > > Latency: 72 (4250ns min, 4500ns max), Cache Line Size: 32 bytes > > Interrupt: pin A routed to IRQ 22 > > Region 0: I/O ports at 2300 [size=256] > > Region 1: Memory at fbff0000 (64-bit, non-prefetchable) [size=64K] > > Region 3: Memory at fbfe0000 (64-bit, non-prefetchable) [size=64K] > > Capabilities: [50] Power Management version 2 > > Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA > > PME(D0-,D1-,D2-,D3hot-,D3cold-) > > Status: D0 PME-Enable- DSel=0 DScale=0 PME- > > Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ > > Queue=0/0 > > Enable- > > Address: 0000000000000000 Data: 0000 > > Capabilities: [68] PCI-X non-bridge device > > Command: DPERE- ERO- RBC=512 OST=1 > > Status: Dev=01:01.0 64bit+ 133MHz+ SCD- USC- DC=simple > > DMMRBC=2048 DMOST=8 DMCRS=16 RSCEM- 266MHz- 533MHz- > > Kernel driver in use: mptspi > > Kernel modules: mptspi > > ---------------------------------------- > > > > > > This is an excerpt of the dmesg on 2.6.25 where the controller WORKS: > > -------------------------------------------------------------------- > > Fusion MPT base driver 3.04.06 > > Copyright (c) 1999-2007 LSI Corporation Fusion MPT SPI Host driver > > 3.04.06 ... > > mptbase: ioc0: Initiating bringup > > ... > > ioc0: LSI53C1030 B2: Capabilities={Initiator} Probing IDE interface > > ide1... > > hdc: LG CD-ROM CRN-8245B, ATAPI CD/DVD-ROM drive scsi0 : ioc0: > > LSI53C1030 B2, FwRev=01000e00h, Ports=1, MaxQ=222, IRQ=22 ... > > scsi0 : ioc0: LSI53C1030 B2, FwRev=01000e00h, Ports=1, MaxQ=222, > > IRQ=22 > > hdc: host max PIO4 wanted PIO255(auto-tune) selected PIO4 > > hdc: UDMA/33 mode selected > > ide1 at 0x170-0x177,0x376 on irq 15 > > tg3.c:v3.90 (April 12, 2008) > > ACPI: PCI Interrupt 0000:02:01.0[A] -> GSI 24 (level, low) -> IRQ 24 > > scsi 0:0:0:0: Direct-Access IBM-ESXS DTN018C1UCDY10F S23J PQ: 0 ANSI: > 3 > > target0:0:0: Beginning Domain Validation > > target0:0:0: Ending Domain Validation > > target0:0:0: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 127) > > scsi 0:0:1:0: Direct-Access IBM-ESXS DTN018C1UCDY10F S23J PQ: 0 ANSI: > 3 > > target0:0:1: Beginning Domain Validation ... > > ACPI: PCI Interrupt 0000:02:02.0[A] -> GSI 25 (level, low) -> IRQ 25 > > target0:0:1: Ending Domain Validation > > target0:0:1: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 127) > > ... > > hdc: ATAPI 24X CD-ROM drive, 128kB Cache Uniform CD-ROM driver > > Revision: 3.20 > > scsi 0:0:8:0: Processor IBM 25P3495a S320 1 1 PQ: 0 ANSI: > 2 > > target0:0:8: Beginning Domain Validation > > target0:0:8: Ending Domain Validation > > target0:0:8: asynchronous > > Driver 'sd' needs updating - please use bus_type methods sd 0:0:0:0: > > [sda] 35548320 512-byte hardware sectors (18201 MB) sd 0:0:0:0: [sda] > > Write Protect is off sd 0:0:0:0: [sda] Mode Sense: cb 00 00 08 sd > > 0:0:0:0: Attached scsi generic sg0 type 0 scsi 0:0:1:0: Attached scsi > > generic sg1 type 0 scsi 0:0:8:0: Attached scsi generic sg2 type 3 > > -------------------------------------------------------------------- > > > > > > It is an x86 32bit PC compile. This is the excerpt of the .config file > > grepping for FUSION > > ------------------------------------ > > CONFIG_FUSION=y > > CONFIG_FUSION_SPI=m > > CONFIG_FUSION_FC=m > > CONFIG_FUSION_SAS=m > > CONFIG_FUSION_MAX_SGE=40 > > CONFIG_FUSION_CTL=m > > CONFIG_FUSION_LAN=m > > # CONFIG_FUSION_LOGGING is not set > > ------------------------------------ > > > > > > > > This is the boot error message obtained with serial cable. I left it > > running for 8 minutes for this. It loops so the message never ends. > > -------------------------------------------------------------------- > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > > > ACPI: Resource is not an IRQ entry > > The acpi problem. > > > mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), > IntStatus=80000009! > > > > BUG: unable to handle kernel NULL pointer dereference at 0000034c > > > > IP: [<f885cc5e>] :mptspi:mptspi_dv_renegotiate_work+0xa/0x9f > > > > Oops: 0000 [#1] SMP > > > > Modules linked in: ide_pci_generic(+) floppy mptspi(+) mptscsih > > ohci_hcd tg3 mptbase scsi_transport_spi usbcore serverworks ide_core > > ata_generic libata scsi_mod dock thermal processor fan thermal_sys > > > > > > > > Pid: 9, comm: events/0 Not tainted (2.6.26-rc7 #1) > > > > EIP: 0060:[<f885cc5e>] EFLAGS: 00010282 CPU: 0 > > > > EIP is at mptspi_dv_renegotiate_work+0xa/0x9f [mptspi] > > > > EAX: f7a447c0 EBX: f7429900 ECX: f7a447c4 EDX: c1908988 > > > > ESI: f7a447c0 EDI: 0000034c EBP: f7429904 ESP: f7477f80 > > > > DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 > > > > Process events/0 (pid: 9, ti=f7476000 task=f744d770 task.ti=f7476000) > > > > Stack: f744d8e0 c190b260 00000000 c1908984 f7429900 f7a447c0 f885cc54 > > f7429904 > > > > c012f253 f7429900 c012f934 f742990c 00000000 c012f9e8 00000000 > > f744d770 > > > > c0131bdc f7477fc4 f7477fc4 f7429900 c012f934 00000000 c0131b1b > > c0131ae3 > > > > Call Trace: > > > > [<f885cc54>] mptspi_dv_renegotiate_work+0x0/0x9f [mptspi] > > > > [<c012f253>] run_workqueue+0x75/0xf6 > > > > [<c012f934>] worker_thread+0x0/0xbf > > > > [<c012f9e8>] worker_thread+0xb4/0xbf > > > > [<c0131bdc>] autoremove_wake_function+0x0/0x2b > > > > [<c012f934>] worker_thread+0x0/0xbf > > > > [<c0131b1b>] kthread+0x38/0x5d > > > > [<c0131ae3>] kthread+0x0/0x5d > > > > [<c0104573>] kernel_thread_helper+0x7/0x10 > > > > ======================= > > > > Code: 70 e8 9e f8 ff ff 8b 47 70 e8 44 b7 fe ff 8b 47 70 5a 5b 5e 5f > > 5d e9 89 > > f8 ff ff 58 5b 5e 5f 5d c3 55 57 56 53 83 ec 10 8b 78 10 <8b> 2f e8 c7 > > 98 90 c7 > > 66 83 bf 96 02 00 00 00 8b 85 3c 01 00 00 > > > > EIP: [<f885cc5e>] mptspi_dv_renegotiate_work+0xa/0x9f [mptspi] SS:ESP > > 0068:f7477f80 > > > > ---[ end trace e311270f757682e4 ]--- > > mpt-fusion shouldn't oops, no matter what acpi did to it. >
Reply-To: James.Bottomley@HansenPartnership.com On Tue, 2008-07-08 at 01:57 -0700, Andrew Morton wrote: > You removed everyone from cc. Please don't do that - there's not much > point in asking me to do things - this bug is reported by > kurk@shiftmail.org. > > I don't know what "we do not assist with compiling drivers" can possibly > mean. Eric, can you please help here? heh, well, cc'ing a support line on a technical bug report isn't necessarily conducive to producing useful results ... what we're discussing is probably already at level 4 or 5 (the real engineering problems). Support calls go in at levels 1-3 (as in consult manual and spit out canned response before triaging for escalation). That said, this line: mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! is absolutely characteristic of a lost interrupt. With the current LSI driver, we have two possible causes for this. One is the usual ACPI screw up that we never seem to be able to fix. The other is that the driver recently enabled MSI (commit 23a274c8a5adafc74a66f16988776fc7dd6f6e51 in v2.6.26-rc1). For the former, just follow the usual ACPI screw up recipe. For the latter, you should see this message in the boot up: mptbase: ioc0: PCI-MSI enabled MSI can be turned off again by using the module parameter mpt_msi_enable=0. Unfortunately, the true fix is to find out if the motherboard really has a global MSI problem (and I know MSI works with the LSI because I have a 1030 in an ia64 system here working just fine) and add it to the PCI quirks file as unable to use MSI. James
On Tuesday 08 July 2008 08:08:46 am James Bottomley wrote: > That said, this line: > > mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000009! > > is absolutely characteristic of a lost interrupt. > > With the current LSI driver, we have two possible causes for this. One > is the usual ACPI screw up that we never seem to be able to fix. Which ACPI screw up is that? And what's the usual recipe? I know about the ancient "pci=routeirq" recipe, but as far as I know, there are no current problems that require that. > The > other is that the driver recently enabled MSI (commit > 23a274c8a5adafc74a66f16988776fc7dd6f6e51 in v2.6.26-rc1). For the > former, just follow the usual ACPI screw up recipe. For the latter, you > should see this message in the boot up: > > mptbase: ioc0: PCI-MSI enabled > > MSI can be turned off again by using the module parameter > mpt_msi_enable=0. > > Unfortunately, the true fix is to find out if the motherboard really has > a global MSI problem (and I know MSI works with the LSI because I have a > 1030 in an ia64 system here working just fine) and add it to the PCI > quirks file as unable to use MSI. > > James
Reply-To: James.Bottomley@HansenPartnership.com On Tue, 2008-07-08 at 10:51 -0600, Bjorn Helgaas wrote: > On Tuesday 08 July 2008 08:08:46 am James Bottomley wrote: > > That said, this line: > > > > mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), > IntStatus=80000009! > > > > is absolutely characteristic of a lost interrupt. > > > > With the current LSI driver, we have two possible causes for this. One > > is the usual ACPI screw up that we never seem to be able to fix. > > Which ACPI screw up is that? And what's the usual recipe? The usual screw up where subtle ACPI breakage from release to release causes some IRQs to get misrouted. Usually you start with noacpi and cycle through the pci routing options > I know about the ancient "pci=routeirq" recipe, but as far as I know, > there are no current problems that require that. If you actually read this bug report, you'll see there was a message ACPI: Resource is not an IRQ entry Just before the fusion IRQ failed to get delivered, so I think it's a good indicator that there *are* ACPI problems ... James
On Tuesday 08 July 2008 11:23:33 am James Bottomley wrote: > On Tue, 2008-07-08 at 10:51 -0600, Bjorn Helgaas wrote: > > Which ACPI screw up is that? And what's the usual recipe? > > The usual screw up where subtle ACPI breakage from release to release > causes some IRQs to get misrouted. > > Usually you start with noacpi and cycle through the pci routing options Don't worry, I wasn't trying to talk you out of an ACPI bug report; I just wanted to get enough specifics so I could see whether it was something I could fix. > If you actually read this bug report, you'll see there was a message > > ACPI: Resource is not an IRQ entry > > Just before the fusion IRQ failed to get delivered, so I think it's a > good indicator that there *are* ACPI problems ... These messages also happen with 2.6.25, where the MPT Fusion driver worked, so Kurk opened a separate bugzilla, http://bugzilla.kernel.org/show_bug.cgi?id=11049 for them. Yakui Zhao thinks the messages are harmless because they're related to interrupt link devices that we don't use in IOAPIC mode, and given that the driver works in 2.6.25, that seems plausible to me. Regardless, the messages are alarming and annoying. I'd like to understand them better, but I'll pursue that in the 11049 bugzilla. Bjorn
Reply-To: akpm@linux-foundation.org On Tue, 8 Jul 2008 14:56:53 -0600 Bjorn Helgaas <bjorn.helgaas@hp.com> wrote: > On Tuesday 08 July 2008 11:23:33 am James Bottomley wrote: > > On Tue, 2008-07-08 at 10:51 -0600, Bjorn Helgaas wrote: > > > Which ACPI screw up is that? And what's the usual recipe? > > > > The usual screw up where subtle ACPI breakage from release to release > > causes some IRQs to get misrouted. > > > > Usually you start with noacpi and cycle through the pci routing options > > Don't worry, I wasn't trying to talk you out of an ACPI bug report; > I just wanted to get enough specifics so I could see whether it was > something I could fix. > > > If you actually read this bug report, you'll see there was a message > > > > ACPI: Resource is not an IRQ entry > > > > Just before the fusion IRQ failed to get delivered, so I think it's a > > good indicator that there *are* ACPI problems ... > > These messages also happen with 2.6.25, where the MPT Fusion driver > worked, so Kurk opened a separate bugzilla, > http://bugzilla.kernel.org/show_bug.cgi?id=11049 > for them. > > Yakui Zhao thinks the messages are harmless because they're > related to interrupt link devices that we don't use in IOAPIC mode, > and given that the driver works in 2.6.25, that seems plausible > to me. > > Regardless, the messages are alarming and annoying. I'd like > to understand them better, but I'll pursue that in the 11049 > bugzilla. > Let us not forget the other part of this report: BUG: unable to handle kernel NULL pointer dereference at 0000034c IP: [<f885cc5e>] :mptspi:mptspi_dv_renegotiate_work+0xa/0x9f Oops: 0000 [#1] SMP
Reply-To: James.Bottomley@HansenPartnership.com On Tue, 2008-07-08 at 14:47 -0700, Andrew Morton wrote: > On Tue, 8 Jul 2008 14:56:53 -0600 Bjorn Helgaas <bjorn.helgaas@hp.com> wrote: > > > On Tuesday 08 July 2008 11:23:33 am James Bottomley wrote: > > > On Tue, 2008-07-08 at 10:51 -0600, Bjorn Helgaas wrote: > > > > Which ACPI screw up is that? And what's the usual recipe? > > > > > > The usual screw up where subtle ACPI breakage from release to release > > > causes some IRQs to get misrouted. > > > > > > Usually you start with noacpi and cycle through the pci routing options > > > > Don't worry, I wasn't trying to talk you out of an ACPI bug report; > > I just wanted to get enough specifics so I could see whether it was > > something I could fix. > > > > > If you actually read this bug report, you'll see there was a message > > > > > > ACPI: Resource is not an IRQ entry > > > > > > Just before the fusion IRQ failed to get delivered, so I think it's a > > > good indicator that there *are* ACPI problems ... > > > > These messages also happen with 2.6.25, where the MPT Fusion driver > > worked, so Kurk opened a separate bugzilla, > > http://bugzilla.kernel.org/show_bug.cgi?id=11049 > > for them. > > > > Yakui Zhao thinks the messages are harmless because they're > > related to interrupt link devices that we don't use in IOAPIC mode, > > and given that the driver works in 2.6.25, that seems plausible > > to me. > > > > Regardless, the messages are alarming and annoying. I'd like > > to understand them better, but I'll pursue that in the 11049 > > bugzilla. > > > > Let us not forget the other part of this report: > > BUG: unable to handle kernel NULL pointer dereference at 0000034c > IP: [<f885cc5e>] :mptspi:mptspi_dv_renegotiate_work+0xa/0x9f > Oops: 0000 [#1] SMP That's fixed in the scsi-rc-fixes tree ... but it's a symptom, not a cause. If essential storage is on this adapter, the system will still be unbootable. James
Reply-To: sathya.prakash@lsi.com This may be a problem due to enabling MSI for SPI controllers. I have posted another message in the list providing the correction patch which is already in scsi-misc tree. If the problem is gone with changing the module parameter mpt_msi_enable=0 or by applying the patch http://marc.info/?l=linux-scsi&m=121131228827682&w=4 then it might be due to MSI enabling. On Tue, Jul 08, 2008 at 05:57:35PM -0400, James Bottomley wrote: > On Tue, 2008-07-08 at 14:47 -0700, Andrew Morton wrote: > > On Tue, 8 Jul 2008 14:56:53 -0600 Bjorn Helgaas <bjorn.helgaas@hp.com> > wrote: > > > > > On Tuesday 08 July 2008 11:23:33 am James Bottomley wrote: > > > > On Tue, 2008-07-08 at 10:51 -0600, Bjorn Helgaas wrote: > > > > > Which ACPI screw up is that? And what's the usual recipe? > > > > > > > > The usual screw up where subtle ACPI breakage from release to release > > > > causes some IRQs to get misrouted. > > > > > > > > Usually you start with noacpi and cycle through the pci routing options > > > > > > Don't worry, I wasn't trying to talk you out of an ACPI bug report; > > > I just wanted to get enough specifics so I could see whether it was > > > something I could fix. > > > > > > > If you actually read this bug report, you'll see there was a message > > > > > > > > ACPI: Resource is not an IRQ entry > > > > > > > > Just before the fusion IRQ failed to get delivered, so I think it's a > > > > good indicator that there *are* ACPI problems ... > > > > > > These messages also happen with 2.6.25, where the MPT Fusion driver > > > worked, so Kurk opened a separate bugzilla, > > > http://bugzilla.kernel.org/show_bug.cgi?id=11049 > > > for them. > > > > > > Yakui Zhao thinks the messages are harmless because they're > > > related to interrupt link devices that we don't use in IOAPIC mode, > > > and given that the driver works in 2.6.25, that seems plausible > > > to me. > > > > > > Regardless, the messages are alarming and annoying. I'd like > > > to understand them better, but I'll pursue that in the 11049 > > > bugzilla. > > > > > > > Let us not forget the other part of this report: > > > > BUG: unable to handle kernel NULL pointer dereference at 0000034c > > IP: [<f885cc5e>] :mptspi:mptspi_dv_renegotiate_work+0xa/0x9f > > Oops: 0000 [#1] SMP > > That's fixed in the scsi-rc-fixes tree ... but it's a symptom, not a > cause. If essential storage is on this adapter, the system will still > be unbootable. > > James > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
>Hi, >Can you please attach the complete log that you collected with a >serial console to this bug report: > http://bugzilla.kernel.org/show_bug.cgi?id=11045 > >The excerpts are missing important information about how the MPT >interrupt is routed when the driver claims the device. It's >easier if we can just look at the complete log; sometimes it >answers questions we didn't think of the first time around. > >Thanks, > Bjorn Hi Bjorn, what I pasted in, it is absolutely the complete log obtained with the serial cable. I did not modify it in any way or cut away parts. As I wrote, I obtained it twice, once stopping the boot after 8 minutes, the other stopping after 5 (and stack traces are slightly different in the two boots as you can see). The error messages loop as you can see and never end so it is necessary to stop that after a while. The other attachments like dmesg (2.6.25), lspci (2.6.25), .config yes, I grepped or cut them, but not the serial dump. (Is it ok if I reply from the web interface or should I do via email? Sorry it's the first time I use bugzilla.kernel.org)
Created attachment 16785 [details] Boot messages with quiet disabled, serial dump
Bjorn, you were right, the serial dump was not complete because there was the "quiet" option specified as kernel parameter. I just uploaded the full (non-quiet) serial dump as attachment on the bugzilla web interface. Thank you
Hi all, Good news! James and Sathya were correct, the bug is related to MSI: specifying mpt_msi_enable=0 as option for the mptbase module solves the problem and the system can boot as usual. Having said this, do you still want me to try a patch, or perform some additional test? Just out of curiosity: do you intend to eventually modify the kernel so to "support" and work around buggy hardware like the one we have (IBM xSeries 335), so that Linux can work out of the box even on this hardware? Thank everybody for your help
Prakash, Sathya wrote: > This may be a problem due to enabling MSI for SPI controllers. I have posted > another message in the list providing the correction patch which is already > in scsi-misc tree. > If the problem is gone with changing the module parameter mpt_msi_enable=0 or > by applying the patch http://marc.info/?l=linux-scsi&m=121131228827682&w=4 > then it might be due to MSI enabling. > Another good news: I confirm that yes, the problem is also fixed by the patch linked above by Sathya, and in that case it is not needed to specify option mpt_msi_enable=0 for mptbase. Any one of the two (patch or option) is enough to fix the problem. It would be nice to see this patch in the final release of the 2.6.26 kernel Thank you
Reply-To: akpm@linux-foundation.org On Thu, 10 Jul 2008 16:52:01 +0200 kurk <kurk@shiftmail.org> wrote: > Prakash, Sathya wrote: > > This may be a problem due to enabling MSI for SPI controllers. I have > posted another message in the list providing the correction patch which is > already in scsi-misc tree. > > If the problem is gone with changing the module parameter mpt_msi_enable=0 > or by applying the patch http://marc.info/?l=linux-scsi&m=121131228827682&w=4 > then it might be due to MSI enabling. > > > Another good news: I confirm that yes, the problem is also fixed by the > patch linked above by Sathya, and in that case it is not needed to > specify option mpt_msi_enable=0 for mptbase. Any one of the two (patch > or option) is enough to fix the problem. > It would be nice to see this patch in the final release of the 2.6.26 kernel > Thank you James, shouldn't we put that into 2.6.26? That whole patch series looks pretty desirable actually..
Reply-To: James.Bottomley@HansenPartnership.com On Thu, 2008-07-10 at 16:44 -0700, Andrew Morton wrote: > On Thu, 10 Jul 2008 16:52:01 +0200 kurk <kurk@shiftmail.org> wrote: > > > Prakash, Sathya wrote: > > > This may be a problem due to enabling MSI for SPI controllers. I have > posted another message in the list providing the correction patch which is > already in scsi-misc tree. > > > If the problem is gone with changing the module parameter > mpt_msi_enable=0 or by applying the patch > http://marc.info/?l=linux-scsi&m=121131228827682&w=4 then it might be due to > MSI enabling. > > > > > Another good news: I confirm that yes, the problem is also fixed by the > > patch linked above by Sathya, and in that case it is not needed to > > specify option mpt_msi_enable=0 for mptbase. Any one of the two (patch > > or option) is enough to fix the problem. > > It would be nice to see this patch in the final release of the 2.6.26 > kernel > > Thank you > > James, shouldn't we put that into 2.6.26? I'm still not sure ... if it's a fault on the board with MSI, then yes, we need it in ... although the form would then be wrong because we probably should be identifying the faulty parts and blacklisting them. If it's actually a fault on the motherboard with MSI, then no, this isn't the patch series that should be in we need the motherboard strings to blacklist it. Unfortunately, I can't seem to get an answer out of LSI on this question, It looks like the commit will cherry pick easily enough ... although now I look at it the parameter's description is wrong. > That whole patch series looks pretty desirable actually.. Well, it was billed as a driver update ... and it has a lot more than just trivial changes, so on an eve of release quality issue, I'd tend to say that wouldn't be a good idea. James
Reply-To: Sathya.Prakash@lsi.com I did a recheck on this, except FC 919X and 929X boards, everything else should work fine with MSI. Hence the SPI boards (1030) should work with MSI and the problem might be with the motherboard. But we would like to keep the MSI disabled for SPI controllers since we have not tested internally with MSI and FC enabled by default for them in our recent drivers. So I would like to request to pull in the patch to disable MSI for SPI & FC. -Thanks Sathya -----Original Message----- From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com] Sent: Friday, July 11, 2008 6:12 AM To: Andrew Morton Cc: kurk; Prakash, Sathya; Bjorn Helgaas; Support, Software; linux-scsi@vger.kernel.org; linux-acpi@vger.kernel.org; bugme-daemon@bugzilla.kernel.org; Moore, Eric Subject: Re: [Bugme-new] [Bug 11045] New: Bug in MPT Fusion 2.6.26-rc7 unbootable On Thu, 2008-07-10 at 16:44 -0700, Andrew Morton wrote: > On Thu, 10 Jul 2008 16:52:01 +0200 kurk <kurk@shiftmail.org> wrote: > > > Prakash, Sathya wrote: > > > This may be a problem due to enabling MSI for SPI controllers. I have > posted another message in the list providing the correction patch which is > already in scsi-misc tree. > > > If the problem is gone with changing the module parameter > mpt_msi_enable=0 or by applying the patch > http://marc.info/?l=linux-scsi&m=121131228827682&w=4 then it might be due to > MSI enabling. > > > > > Another good news: I confirm that yes, the problem is also fixed by > > the patch linked above by Sathya, and in that case it is not needed > > to specify option mpt_msi_enable=0 for mptbase. Any one of the two > > (patch or option) is enough to fix the problem. > > It would be nice to see this patch in the final release of the > > 2.6.26 kernel Thank you > > James, shouldn't we put that into 2.6.26? I'm still not sure ... if it's a fault on the board with MSI, then yes, we need it in ... although the form would then be wrong because we probably should be identifying the faulty parts and blacklisting them. If it's actually a fault on the motherboard with MSI, then no, this isn't the patch series that should be in we need the motherboard strings to blacklist it. Unfortunately, I can't seem to get an answer out of LSI on this question, It looks like the commit will cherry pick easily enough ... although now I look at it the parameter's description is wrong. > That whole patch series looks pretty desirable actually.. Well, it was billed as a driver update ... and it has a lot more than just trivial changes, so on an eve of release quality issue, I'd tend to say that wouldn't be a good idea. James
Reply-To: James.Bottomley@HansenPartnership.com On Fri, 2008-07-11 at 12:33 +0800, Prakash, Sathya wrote: > I did a recheck on this, except FC 919X and 929X boards, everything > else should work fine with MSI. Hence the SPI boards (1030) should > work with MSI and the problem might be with the motherboard. Right ... that's why I was asking ... my 1030 works fine with MSI. If there's a fault with the FC boards, then certainly they should have MSI disabled. The motherboard was my suspicion ... especially as older ones have SPI and newer ones have SAS (and the older ones are most likely to have MSI faults). However, I think you can see from our point of view that if the problem is the motherboard, disabling MSI in the fusion is the wrong way to fix it. If we do it this way, we'll promptly get another slew of nasty bug reports for the next driver that enables MSI and doesn't work on this platform > But we would like to keep the MSI disabled for SPI controllers since > we have not tested internally with MSI and FC enabled by default for > them in our recent drivers. > So I would like to request to pull in the patch to disable MSI for SPI & FC. Yes, we'll do that ... I'll also see if the PCI maintainer can determine the information needed to blacklist the motherboards so that we don't get this all over again with them and a different driver. James
Fixed by: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=27898988174bb211fd962ea73b9c6dc09f888705