Created attachment 227821 [details] dmesg (before igfx_off) (I've already posted this on vfio-users and linux-scsi mailing list, so I'm just copying decription with some minor changes) Hi, I have some issues with Adaptec 3805 and found that it could be related to iommu as described here [1]. I've tried various (iommu=pt etc.) but nothing seems to change the behaviour of the controller. Also tried the patch [2] from Alex Williamson on top of 4.6.5. Controller seems to work fine (random r/w) for some hours under Windows 7 (that's on different machine, AMD Phenom) or (on the same machine - that is: Intel i5-2400 with Intel DQ67SW motherboard) with Clonezilla. I did some further digging and added intel_iommu=igfx_off (found at some thread with some other hardware mentioning Sandy Bridge) and it seems to work - just transfered 300 gigabytes^W^Wover 1 TB of data to a testing array. On the other hand it still fills logs with those over and over every 10 seconds: AAC: Host adapter dead -1 I've also tried various Live Linux 64-bit distros: 1/2. Arch Bang (4.6.4-1-ARCH) and Ubuntu 16.04.1 (4.4.0-31-generic #50-Ubuntu) - prints errors every 10 seconds (afair didnt test if it works as I expected it doesnt) 3. Ubuntu 14.04 (4.4.0-15-generic #31-Ubuntu) - doesnt print errors, works --below info from before intel_iommu=igfx_off-- I could read the array (dd if=controller of=/dev/null), but not write. With dd write, trying to make fs or partition or whatever it ended with messages similar to this: DMAR: DRHD: handling fault status reg 3 DMAR: DMAR:[DMA Write] Request device [03:01.0] fault addr ffbb5000 DMAR:[fault reason 02] Present bit in context entry is clear grub: kernel /boot/vmlinuz root=*cut* enable_mtrr_cleanup intel_iommu=on rw vfio-pci.ids=1002:9460,1002:aa30,8086:1c26 vfio_iommu_type1.allow_unsafe_interrupts=1 (vfio-pci.ids are GPU, GPU audio and USB for passthru to Windows VM) dmesg attached ~ # lspci -nnvs 03:0e.0 03:0e.0 RAID bus controller [0104]: Adaptec AAC-RAID [9005:0285] Subsystem: Adaptec 3805 [9005:02bc] Flags: bus master, stepping, 66MHz, medium devsel, latency 32, IRQ 18 Memory at fb800000 (64-bit, non-prefetchable) [size=2M] Expansion ROM at fba00000 [disabled] [size=256K] Capabilities: [c0] Power Management version 2 Capabilities: [d0] MSI: Enable- Count=1/2 Maskable- 64bit+ Capabilities: [e0] PCI-X non-bridge device Kernel driver in use: aacraid [1] https://www.redhat.com/archives/vfio-users/2016-July/msg00046.html [2] https://www.redhat.com/archives/vfio-users/2016-July/msg00063.html
(I'm a kernel bugzilla newbie and as newbie I'm not sure how much attention gets kernel bugzilla and do this gets along with mailing lists and the other way around. So I'm posting here my answers to David Carroll questions [1] (snipped just a bit)) # -- > Hi Piotr, > > You had indicated that a kernel using Alex Williamson's patch allowed > you to use the system correctly. Is that true? Hi David, I had to review my findings as I was a bit lost after trying so many different settings and Live distros. With intel_iommu=igfx_off on vanilla 4.7.0 kernel: - it works (Gentoo Linux) with Alex patch - it doesnt work (Gentoo Linux) without Alex patch > You also indicated that Ubuntu 14.04 worked, while 16.04 did not. Is > that true? Ubuntu 14.04 (and speaking here about Ubuntu I mean running amd64 image from usb stick - not sure, but I think it is 14.04.4 as it was latest available week+ ago) it doesnt print "AAC: Host adapter dead -1" and it works. Ubuntu 16.04 (as above, it is 16.04.1) every 10 seconds prints about "dead adapter", but works. When I've added intel_iommu=on on boot it doesnt work (similar DMAR errors as posted in dmesg without Alex patch). (both Ubuntus are amd64) > Looking at the aacraid driver shipped with the Ubuntu flavors seems to > be the same version of the driver. Do either of those kernel's have > Alex's patch applied for the 3805? I just used Live images so, sadly, I just dont know. > At this point, assuming the above statements are true, I would believe > that you would get the best results when using a kernel with Alex's > patches applied. Yeah, but I would love to get rid of those messages printed every ~10 seconds or at least know why they're there if adapter seems to work fine? If Ubuntu 14.04 and 16.04 share the same driver why one of them prints those messages and the other doesnt? # -- [1] https://www.mail-archive.com/vfio-users@redhat.com/msg01747.html
Created attachment 228381 [details] dmesg vanilla 4.7
Created attachment 228391 [details] dmesg with mentioned Alex patch
.. just created an account and wanted to post a confirmation of this bug, on a chassis with this controller, the "AAC: Host Adapter Dead -1" did not appear at all on an Ubuntu 14.04.1 installation (OS installed, not Live USB), but suddenly showed up after a fresh installation of 16.04 (amd64) on the same chassis and persists just like the OP says, every 10s. Otherwise the controller works normally with a RAID5 4-disk array with no noticeable performance degradation as tested with hdparm -T. Copying here the relevant lspci -vv stub in case it helps: ---(cut here)--- 02:0e.0 RAID bus controller: Adaptec AAC-RAID Subsystem: Adaptec 3805 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping+ SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 32 (250ns min, 250ns max), Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 18 Region 0: Memory at e8000000 (64-bit, non-prefetchable) [size=2M] Capabilities: [c0] Power Management version 2 Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [d0] MSI: Enable- Count=1/2 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [e0] PCI-X non-bridge device Command: DPERE- ERO- RBC=512 OST=4 Status: Dev=02:0e.0 64bit+ 133MHz+ SCD- USC- DC=bridge DMMRBC=1024 DMOST=4 DMCRS=16 RSCEM- 266MHz- 533MHz- Kernel driver in use: aacraid Kernel modules: aacraid
Had the same error after upgrading from 3.14.X kernel to 4.4.15 [ 59.713349] AAC: Host adapter dead -1 [ 69.713352] AAC: Host adapter dead -1 [ 79.713347] AAC: Host adapter dead -1 [ 89.713344] AAC: Host adapter dead -1 [ 99.713344] AAC: Host adapter dead -1 [ 109.713340] AAC: Host adapter dead -1 [ 119.713341] AAC: Host adapter dead -1 [ 129.713338] AAC: Host adapter dead -1 [ 139.713344] AAC: Host adapter dead -1 [ 149.713336] AAC: Host adapter dead -1 [ 159.713336] AAC: Host adapter dead -1 [ 169.713333] AAC: Host adapter dead -1 [ 179.713332] AAC: Host adapter dead -1 [ 189.713329] AAC: Host adapter dead -1 [ 199.713331] AAC: Host adapter dead -1 [ 209.713336] AAC: Host adapter dead -1 [ 219.713331] AAC: Host adapter dead -1 [ 229.713329] AAC: Host adapter dead -1 [ 239.713332] AAC: Host adapter dead -1 Arrays were accessible. In my case it was adaptec 3405 raid card. No longer have this system - replaced raid card with adaptec 8405 and with 8405 the problem doesn't happen.
I'm having this same 10 second issue with a 3805 on Ubuntu 16.04 LTS with kernel 4.4.0-34-generic Some dmesg samplings: $ dmesg |tail --lines=50 |grep AAC [46793.664043] AAC: Host adapter dead -1 [46803.664049] AAC: Host adapter dead -1 [46813.664046] AAC: Host adapter dead -1 [46823.664048] AAC: Host adapter dead -1 [46833.664048] AAC: Host adapter dead -1 [46843.664053] AAC: Host adapter dead -1 [46853.664045] AAC: Host adapter dead -1 [46863.664051] AAC: Host adapter dead -1 [46873.664043] AAC: Host adapter dead -1 [46883.664052] AAC: Host adapter dead -1 [46893.664048] AAC: Host adapter dead -1 [46903.664051] AAC: Host adapter dead -1 [46913.664049] AAC: Host adapter dead -1 [46923.664046] AAC: Host adapter dead -1 [46933.664050] AAC: Host adapter dead -1 [46943.664059] AAC: Host adapter dead -1 [46953.664035] AAC: Host adapter dead -1 [46963.664047] AAC: Host adapter dead -1 [46973.664047] AAC: Host adapter dead -1 [46983.664048] AAC: Host adapter dead -1 [46993.664067] AAC: Host adapter dead -1 [47003.664044] AAC: Host adapter dead -1 [47013.664050] AAC: Host adapter dead -1 [47023.664049] AAC: Host adapter dead -1 [47033.664059] AAC: Host adapter dead -1 [47043.664062] AAC: Host adapter dead -1 [47053.664044] AAC: Host adapter dead -1 [47063.664059] AAC: Host adapter dead -1 [47073.664054] AAC: Host adapter dead -1 [47083.664062] AAC: Host adapter dead -1 [47093.664094] AAC: Host adapter dead -1 [47103.664040] AAC: Host adapter dead -1 [47113.664047] AAC: Host adapter dead -1 [47123.664046] AAC: Host adapter dead -1 [47133.664048] AAC: Host adapter dead -1 [47143.664050] AAC: Host adapter dead -1 [47153.664041] AAC: Host adapter dead -1 [47163.664039] AAC: Host adapter dead -1 [47173.664059] AAC: Host adapter dead -1 [47183.664046] AAC: Host adapter dead -1 [47193.664047] AAC: Host adapter dead -1 [47203.664070] AAC: Host adapter dead -1 [47213.664054] AAC: Host adapter dead -1 [47223.664049] AAC: Host adapter dead -1 [47233.664048] AAC: Host adapter dead -1 [47243.664062] AAC: Host adapter dead -1
I have the exact same issue on my server running with an Adaptec 3405. The root system is mounted on that controller on an SSD harddisk. Until I upgraded from 14.04 TLS to 16.04 TLS it was working fine. After I upgraded to an new kernel first (from 4.2.0-41 to 4.2.0-42) the issue with the 'dead adapter' started. To test if a full system upgrade to 16.04 TLS would help I upgrade the system to that release. But on that version the problem is still there and the system is not usable. Now I'm running on 16.04 TLS but with the older kernel (4.2.0-41) and thats working just fine. But I want to use the newest kernel because thats how it should be :-) Here is some controller info: === root@server:~# lspci -nnvs 03:0e.0 03:0e.0 RAID bus controller [0104]: Adaptec AAC-RAID [9005:0285] Subsystem: Adaptec 3405 [9005:02bb] Flags: bus master, stepping, 66MHz, medium devsel, latency 64, IRQ 18 Memory at fda00000 (64-bit, non-prefetchable) [size=2M] Expansion ROM at fdcc0000 [disabled] [size=256K] Capabilities: [c0] Power Management version 2 Capabilities: [d0] MSI: Enable- Count=1/2 Maskable- 64bit+ Capabilities: [e0] PCI-X non-bridge device Kernel driver in use: aacraid Kernel modules: aacraid ===
The 3405 in my lab computer exhibits the same symptoms. Ubuntu 16.04 LTS with kernel 4.4.0-34-generic, also. Both Adaptec 3805 and 3405 models
Created attachment 229741 [details] dmesg vanilla 4.3.6 (with Alex patch) I tried previous vanilla kernels 4.x.latest-released and, as 4.6.x prints messages over and over I skipped 4.5 series and tried: - 4.4.19 - prints dead messages - 4.3.6 - works without messages (with Alex patch [1], intel_iommu=igfx_off) Not sure if this helps, but I hope it will. Any suggestions how to proceed from here? [1] https://www.redhat.com/archives/vfio-users/2016-July/msg00063.html
Created attachment 229751 [details] config-4.3.6
Created attachment 229761 [details] config-4.7.0
The same problem for me. Ubuntu LTS 16.04. # uname -a Linux trinity 4.4.0-36-generic #55-Ubuntu SMP Thu Aug 11 18:01:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux # arcconf getconfig 1 Controllers found: 1 ---------------------------------------------------------------------- Controller information ---------------------------------------------------------------------- Controller Status : Optimal Channel description : SAS/SATA Controller Model : Adaptec 3805 Controller Serial Number : 8C211053127 <...> -------------------------------------------------------- Controller Version Information -------------------------------------------------------- BIOS : 5.2-0 (17342) Firmware : 5.2-0 (17342) Driver : 1.2-1 (41010) Boot Flash : 5.2-0 (17342) $ w 14:24:22 up 20:41 <...> $ journalctl -b | grep 'AAC: Host adapter dead -1' | wc -l 7284
I have the exact same problem on Proxmox which is using Debian Jesse Linux Kernel 4.2.4, using Adaptec Raid 31605, it repeats every 10 seconds, "AAC: Host adapter dead -1" but everything is working...... Is there any way to get it to stop repeating these messages, because it may max out the messages log eventually. Does anyone know what is being done to patch this?
you can find it in: drivers/scsi/aacraid/commsup.c in my 4.7.3 it is in Line 1700
(In reply to HE - IT Services from comment #13) > I have the exact same problem on Proxmox which is using Debian Jesse Linux > Kernel 4.2.4, using Adaptec Raid 31605, it repeats every 10 seconds, "AAC: > Host adapter dead -1" but everything is working...... > > Is there any way to get it to stop repeating these messages, because it may > max out the messages log eventually. Does anyone know what is being done to > patch this? I found the Kernel is 4.4.13-2-pve, which would be 4.4.13.
(In reply to Matthias from comment #14) > you can find it in: > drivers/scsi/aacraid/commsup.c > in my 4.7.3 it is in Line 1700 Thanks Matthias.
Created attachment 240661 [details] aac.patch (In reply to Matthias from comment #14) > you can find it in: > drivers/scsi/aacraid/commsup.c > in my 4.7.3 it is in Line 1700 Works as a workaround, but this is not a solution. This is a patch that I use to use my controller. Alex patch + "STFU for logs".
The problem was introduced with commit below. Reverting this commit from kernel 4.9.3 makes the problem go away. https://lkml.org/lkml/2017/1/15/47 commit 78cbccd3bd683c295a44af8050797dc4a41376ff Author: Raghava Aditya Renukunta <RaghavaAditya.Renukunta@microsemi.com> Date: Mon Apr 25 23:32:37 2016 -0700 aacraid: Fix for KDUMP driver hang When KDUMP is triggered the driver first talks to the firmware in INTX mode, but the adapter firmware is still in MSIX mode. Therefore the first driver command hangs since the driver is waiting for an INTX response and firmware gives a MSIX response. If when the OS is installed on a RAID drive created by the adapter KDUMP will hang since the driver does not receive a response in sync mode. Fixed by: Change the firmware to INTX mode if it is in MSIX mode before sending the first sync command. Cc: stable@vger.kernel.org Signed-off-by: Raghava Aditya Renukunta <RaghavaAditya.Renukunta@microsemi.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Fix: http://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/patch/?id=8af8e1c22f9994bb1849c01d66c24fe23f9bc9a0
As of June 21, 2017, I am still seeing this in Debian Stretch running kernel 4.10.0-rc6. I have an IBM ServeRaid 8s, which is a rebranded Adaptec 4805. I see the message repeat about every second, so it really fills up syslog quickly.
I stopped seeing this message after updating to debian stretch with the 4.9 kernel and my 3805 adaptec card. I think this should be closed.
with my Adaptec 3405 and Kernel of Ubuntu 16.04.3 (Kernel 4.4.0-104-generic) and also with self-compiled 4.13.0, the message does not appear anymore.