Most recent kernel where this bug did not occur: 2.6.20.4 (see description) Distribution: PLD Linux Distribution 2.0 (Ac) Hardware Environment: Dell PowerEdge 1650 with Dell PowerEdge Expandable RAID Controller 3/Di (rev 01), 2 disks, mirror, PCI ID 1028:000a Problem Description: System does not boot at all using aacraid SCSI driver with Dell PERC 3/Di controller. Machine is currently running kernel 2.6.20.4 and it works perfectly. Few days ago I was trying upgrade to 2.6.22.9 and system failed to boot. Driver correctly detects controller and then following messages are displayed on console: scsi 0:0:0:0 Direct-Access DELL mirror V1.0 PQ: 0 ANSI: 2 AAC:AAC received an unrecognized command [601] aacraid: Host adapter abort request (0,1,0,0) aacraid: Host adapter abort request (0,1,0,0) aacraid: Host adapter reset request. SCSI hang ? scsi 0:1:0:0 scsi: Device offlined - not ready after error recovery ... and so on for every scsi id and scsi host (0:1:1:0, 0:1:2:0, ... , 0:2:15:0). Then system fails with kernel panic: unable to mount root filesystem which is obvious. I wasn't trying to build 2.6.20.x kernel series newer than .4. I've tried 2.6.21.4 and 2.6.22.5 and they failed same way 2.6.22.9 did. Steps to reproduce: try to boot with root fs on machine with Dell PERC 3/Di.
Try loading the driver with aacraid.dacmode=0, aacraid.expose_physicals=0 or aacraid.nondasd=0 (any of these should turn off calls to ScsiPortCommand64). ScsiPortCommand64 or command 601 is the 64 bit scsi command issued to the physical targets. Looks like the PERC3/Di is indicating support for VM_CtHost[Read|Write]64 and does not have support for ScsiPortCommand64. These are supposed to be tied together. If you report all of these flags individually resolves the problem, then I will have to add a QUIRK to the driver.
I've done some more testing and it seems that controller is working after going through all those "aacraid: Host adapter abort request (0,1,0,0)" messages. Root FS mounting problem was caused by initrd image and is not related to module problem. After fixing my initrd I've following situation: 1. When aacraid module is loaded without options I'm getting abort requests for every scsi host (2) and every scsi id (15 for each host). Boot is successfull but it takes over 20 minutes before scsi is initialized. 2. When loaded with dacmode=0 or expose_physicals=0 everything is working perfectly. 3. When loaded with nondasd=0 warnings are still there but only for scsi host 2 (and all 15 ids) so controller initializes in about 10 minutes. Problem isn't fatal but waiting 10-20 minutes for scsi initialization isn't nice. Maybe option expose_physicals=0 or dacmode=0 should be forced for this PCI ID?
Created attachment 13101 [details] PERC has no 64 bit scsi passthrough function. Please make sure you have the latest Firmware in your Adapter. I'd hate to make any changes just because you are using obsolete Firmware. dacmode=0 is a performance hit for runtime I/O on systems that have 64 bit addressible memory. Thus it is required that this be at the user's option to balance functionality and performance. expose_physicals=0 is only one aspect of this problem, it does not solve the fundamental issue that can be triggered by other scenarios. The fundamental problem is that the (specific Firmware Version?) PERC3/Di (and no doubt all it's 31 bit quirk brethern) has no scsi passthrough access function for 64 bit kernels. This access is used for storage management, nondasd devices and dasd physical exports (this last issue is the only one visible to you). I will need to add a check that disables the access under this scenario with a loss of the associated functionality :-(. Please try the enclosed patch to see if it mitigates the problem. This is not considered a problem if the system has only 32 bit address memory available and letting the driver automatically select dacmode=0 with no loss of performance or funtionality, but the code for detecting that was rejected by the community as impossible to be reliable in all processor architectures (works fine for x86) and is only available only in the Adaptec supplied driver. I may also suggest you use the Adaptec Supplied driver to solve your problem.
FYI, I had a similar problem with a client's server but with a 32 bit kernel. The generic Debian kernel 2.6.18-5-686 works fine, for the custom-built 2.6.23.8 (32 bit also) the customer reports an error message identical or similar to: "device offlined not ready after error recovery". Unfortunately I have no access to the full boot output nor a remote console and a very limited communication options with the customer, so my ability to provide more details on what exactly is printed during the boot process is very limited. Adding all three of aacraid.dacmode=0 aacraid.nondasd=0 aacraid.expose_physicals=0 makes it boot. I will attempt to limit this to the smallest subset of options later but can't perform experiments at this time. The controller as shown in lspci is: 04:08.0 PCI bridge: Intel Corporation 80303 I/O Processor PCI-to-PCI Bridge (rev 01) and in lshw: *-storage description: RAID bus controller product: PowerEdge Expandable RAID Controller 3/Di vendor: Dell physical id: 8.1 bus info: pci@04:08.1 version: 01 width: 32 bits clock: 66MHz capabilities: storage bus_master cap_list configuration: driver=aacraid latency=32 resources: iomemory:f0000000-f7ffffff irq:19
Obviously, that should read: 04:08.1 RAID bus controller: Dell PowerEdge Expandable RAID Controller 3/Di (rev 01) for the lspci output in the above post. Pasted the wrong line, need coffee. Apologies.
This may be firmware-related on the Dell PowerEdge 2650. I encountered this issue with the latest 2.6.18 kernel RPM in CentOS 5.1 (CentOS 5.1, kernel version 2.6.18-53.1.13.el5xen, kernel-xen-2.6.18-53.1.13.el5.rpm) on a PowerEdge 2650. I upgraded my PowerEdge 2650 BIOS to the latest version (A21, available on dell.com as file PE2650_BIOS_LX_A21.BIN) and the AAC RAID firmware to the latest version (version 2.8.1.7692, available on dell.com as file RAID_FRMW_LX_R168380.BIN). This appears to have completely resolved the issue. So, from my standpoint, this doesn't warrant a kernel patch, but rather a firmware upgrade. Hope this helps somebody else out there encountering this same issue.
If specific firmware versions (e.g. less than X) for a given card are known to be problematic, the driver init routine should detect these and warn loudly, if not fail to load all together. We did this with the megaraid_legacy driver a long time ago to avoid firmware-induced data corruption.
I also encountered this issue (with kernel 2.6.24, previous kernel was 2.6.15 but I suppose the problem was triggered by upgrading memory from 4GB to 6GB) on PowerEdge 2650 (with Dell PowerEdge Expandable RAID Controller 3/Di). I can confirm that firmware/BIOS update (from version 2.8.0.6092 to 2.8.1.7692) solved it.
(I am copying this from https://bugzilla.redhat.com/show_bug.cgi?id=457552) Thie discussion here resulted in the following patch: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=94cf6ba11b Unless overridden by module dac_mode=0, aac_scsi_32_64 is used for aac_adapter_scsi() on 64bit-DMA-capable systems for all contollers with AAC_QUIRK_SCSI_32 and AAC_OPT_SGMAP_HOST64 set. aac_scsi_32_64() always returns FAILED on 64bit DMA-capable systems if the adapter has the AAC_OPT_SGMAP_HOST64 flag set , and if the physical memory is >4GB. Thus, on every system with 64bit DMA and >4GB memory, aac_adapter_scsi() will always fail (!) for each controller with AC_QUIRK_SCSI_32 and AAC_OPT_SGMAP_HOST64. In practice, that means to me that AC_QUIRK_SCSI_32 implies that you can't use >4GB memory (that agrees with the findings in comment #17). Perhaps the Perc 3/Di has this limitation, but the 2120S and 2200S certainly don't. In the above-mentioned Red Hat bugzilla, the patch was modified by removing 2120S and 2200S from the list of controllers with AC_QUIRK_SCSI_32. Most probably, that isn't even sufficient. I would suggest to remove AC_QUIRK_SCSI_32 for all controllers except the Perc 3/Di. Mark, how did you generate the list of controllers that you assigned AC_QUIRK_SCSI_32 to? Why did you include 2120S and 2200s?