Bug 151661 - Adaptec 3405 3805 prints "AAC: Host adapter dead -1" every 10 seconds but works fine anyway
Summary: Adaptec 3405 3805 prints "AAC: Host adapter dead -1" every 10 seconds but wor...
Status: NEW
Alias: None
Product: SCSI Drivers
Classification: Unclassified
Component: AACRAID (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: scsi_drivers-aacraid
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-08-06 21:47 UTC by Piotr Szymaniak
Modified: 2018-01-08 17:44 UTC (History)
9 users (show)

See Also:
Kernel Version: 4.6.5, 4.7.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg (before igfx_off) (26.65 KB, application/octet-stream)
2016-08-06 21:47 UTC, Piotr Szymaniak
Details
dmesg vanilla 4.7 (27.12 KB, text/plain)
2016-08-11 15:54 UTC, Piotr Szymaniak
Details
dmesg with mentioned Alex patch (26.64 KB, text/plain)
2016-08-11 15:55 UTC, Piotr Szymaniak
Details
dmesg vanilla 4.3.6 (with Alex patch) (26.44 KB, text/plain)
2016-08-22 21:52 UTC, Piotr Szymaniak
Details
config-4.3.6 (92.64 KB, text/plain)
2016-08-22 21:54 UTC, Piotr Szymaniak
Details
config-4.7.0 (96.11 KB, text/plain)
2016-08-22 21:54 UTC, Piotr Szymaniak
Details
aac.patch (950 bytes, patch)
2016-10-03 20:45 UTC, Piotr Szymaniak
Details | Diff

Description Piotr Szymaniak 2016-08-06 21:47:02 UTC
Created attachment 227821 [details]
dmesg (before igfx_off)

(I've already posted this on vfio-users and linux-scsi mailing list, so I'm just copying decription with some minor changes)

Hi,

I have some issues with Adaptec 3805 and found that it could be related to iommu as described here [1]. I've tried various (iommu=pt etc.) but nothing seems to change the behaviour of the controller. Also tried the patch [2] from Alex Williamson on top of 4.6.5. Controller seems to work fine (random r/w) for some hours under Windows 7 (that's on different machine, AMD Phenom) or (on the same machine - that is: Intel i5-2400 with Intel DQ67SW motherboard) with Clonezilla.

I did some further digging and added intel_iommu=igfx_off (found at some thread with some other hardware mentioning Sandy Bridge) and it seems to work - just transfered 300 gigabytes^W^Wover 1 TB of data to a testing array. On the other hand it still fills logs with those over and over every 10 seconds:
AAC: Host adapter dead -1

I've also tried various Live Linux 64-bit distros:
1/2. Arch Bang (4.6.4-1-ARCH) and Ubuntu 16.04.1 (4.4.0-31-generic
#50-Ubuntu) - prints errors every 10 seconds (afair didnt test if it
works as I expected it doesnt)
3. Ubuntu 14.04 (4.4.0-15-generic #31-Ubuntu) - doesnt print errors,
works


--below info from before intel_iommu=igfx_off--
I could read the array (dd if=controller of=/dev/null), but not write. With dd write, trying to make fs or partition or whatever it ended with messages similar to this:
DMAR: DRHD: handling fault status reg 3
DMAR: DMAR:[DMA Write] Request device [03:01.0] fault addr ffbb5000
DMAR:[fault reason 02] Present bit in context entry is clear

grub:
kernel /boot/vmlinuz root=*cut* enable_mtrr_cleanup intel_iommu=on rw
vfio-pci.ids=1002:9460,1002:aa30,8086:1c26 vfio_iommu_type1.allow_unsafe_interrupts=1

(vfio-pci.ids are GPU, GPU audio and USB for passthru to Windows VM)

dmesg attached

~ # lspci -nnvs 03:0e.0
03:0e.0 RAID bus controller [0104]: Adaptec AAC-RAID [9005:0285]
        Subsystem: Adaptec 3805 [9005:02bc]
        Flags: bus master, stepping, 66MHz, medium devsel, latency 32, IRQ 18
        Memory at fb800000 (64-bit, non-prefetchable) [size=2M]
        Expansion ROM at fba00000 [disabled] [size=256K]
        Capabilities: [c0] Power Management version 2
        Capabilities: [d0] MSI: Enable- Count=1/2 Maskable- 64bit+
        Capabilities: [e0] PCI-X non-bridge device
        Kernel driver in use: aacraid

[1] https://www.redhat.com/archives/vfio-users/2016-July/msg00046.html
[2] https://www.redhat.com/archives/vfio-users/2016-July/msg00063.html
Comment 1 Piotr Szymaniak 2016-08-11 15:53:26 UTC
(I'm a kernel bugzilla newbie and as newbie I'm not sure how much attention gets kernel bugzilla and do this gets along with mailing lists and the other way around. So I'm posting here my answers to David Carroll questions [1] (snipped just a bit))

# -- 
> Hi Piotr,
>
> You had indicated that a kernel using Alex Williamson's patch allowed
> you to use the system correctly. Is that true?

Hi David,

I had to review my findings as I was a bit lost after trying so many
different settings and Live distros. With intel_iommu=igfx_off on
vanilla 4.7.0 kernel:
- it works (Gentoo Linux) with Alex patch
- it doesnt work (Gentoo Linux) without Alex patch


> You also indicated that Ubuntu 14.04 worked, while 16.04 did not. Is
> that true?

Ubuntu 14.04 (and speaking here about Ubuntu I mean running amd64 image
from usb stick - not sure, but I think it is 14.04.4 as it was latest
available week+ ago) it doesnt print "AAC: Host adapter dead -1" and it
works.

Ubuntu 16.04 (as above, it is 16.04.1) every 10 seconds prints about
"dead adapter", but works. When I've added intel_iommu=on on boot it
doesnt work (similar DMAR errors as posted in dmesg without Alex patch).

(both Ubuntus are amd64)


> Looking at the aacraid driver shipped with the Ubuntu flavors seems to
> be the same version of the driver. Do either of those kernel's have
> Alex's patch applied for the 3805?

I just used Live images so, sadly, I just dont know.


> At this point, assuming the above statements are true, I would believe
> that you would get the best results when using a kernel with Alex's
> patches applied.

Yeah, but I would love to get rid of those messages printed every ~10
seconds or at least know why they're there if adapter seems to work
fine? If Ubuntu 14.04 and 16.04 share the same driver why one of them
prints those messages and the other doesnt?
# -- 

[1] https://www.mail-archive.com/vfio-users@redhat.com/msg01747.html
Comment 2 Piotr Szymaniak 2016-08-11 15:54:39 UTC
Created attachment 228381 [details]
dmesg vanilla 4.7
Comment 3 Piotr Szymaniak 2016-08-11 15:55:14 UTC
Created attachment 228391 [details]
dmesg with mentioned Alex patch
Comment 4 Chris "SKip" Mac-Stoker 2016-08-12 05:42:22 UTC
.. just created an account and wanted to post a confirmation of this bug, on a chassis with this controller, the "AAC: Host Adapter Dead -1" did not appear at all on an Ubuntu 14.04.1 installation (OS installed, not Live USB), but suddenly showed up after a fresh installation of 16.04 (amd64) on the same chassis and persists just like the OP says, every 10s. Otherwise the controller works normally with a RAID5 4-disk array with no noticeable performance degradation as tested with hdparm -T.

Copying here the relevant lspci -vv stub in case it helps:

---(cut here)---

02:0e.0 RAID bus controller: Adaptec AAC-RAID
        Subsystem: Adaptec 3805
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping+ SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 32 (250ns min, 250ns max), Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 18
        Region 0: Memory at e8000000 (64-bit, non-prefetchable) [size=2M]
        Capabilities: [c0] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [d0] MSI: Enable- Count=1/2 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [e0] PCI-X non-bridge device
                Command: DPERE- ERO- RBC=512 OST=4
                Status: Dev=02:0e.0 64bit+ 133MHz+ SCD- USC- DC=bridge DMMRBC=1024 DMOST=4 DMCRS=16 RSCEM- 266MHz- 533MHz-
        Kernel driver in use: aacraid
        Kernel modules: aacraid
Comment 5 Arkadiusz Miskiewicz 2016-08-12 06:17:51 UTC
Had the same error after upgrading from 3.14.X kernel to 4.4.15

[   59.713349] AAC: Host adapter dead -1
[   69.713352] AAC: Host adapter dead -1
[   79.713347] AAC: Host adapter dead -1
[   89.713344] AAC: Host adapter dead -1
[   99.713344] AAC: Host adapter dead -1
[  109.713340] AAC: Host adapter dead -1
[  119.713341] AAC: Host adapter dead -1
[  129.713338] AAC: Host adapter dead -1
[  139.713344] AAC: Host adapter dead -1
[  149.713336] AAC: Host adapter dead -1
[  159.713336] AAC: Host adapter dead -1
[  169.713333] AAC: Host adapter dead -1
[  179.713332] AAC: Host adapter dead -1
[  189.713329] AAC: Host adapter dead -1
[  199.713331] AAC: Host adapter dead -1
[  209.713336] AAC: Host adapter dead -1
[  219.713331] AAC: Host adapter dead -1
[  229.713329] AAC: Host adapter dead -1
[  239.713332] AAC: Host adapter dead -1

Arrays were accessible. In my case it was adaptec 3405 raid card. No longer have this system - replaced raid card with adaptec 8405 and with 8405 the problem doesn't happen.
Comment 6 David Gravereaux 2016-08-14 22:39:32 UTC
I'm having this same 10 second issue with a 3805 on Ubuntu 16.04 LTS with kernel 4.4.0-34-generic

Some dmesg samplings:
$ dmesg |tail --lines=50 |grep AAC
[46793.664043] AAC: Host adapter dead -1
[46803.664049] AAC: Host adapter dead -1
[46813.664046] AAC: Host adapter dead -1
[46823.664048] AAC: Host adapter dead -1
[46833.664048] AAC: Host adapter dead -1
[46843.664053] AAC: Host adapter dead -1
[46853.664045] AAC: Host adapter dead -1
[46863.664051] AAC: Host adapter dead -1
[46873.664043] AAC: Host adapter dead -1
[46883.664052] AAC: Host adapter dead -1
[46893.664048] AAC: Host adapter dead -1
[46903.664051] AAC: Host adapter dead -1
[46913.664049] AAC: Host adapter dead -1
[46923.664046] AAC: Host adapter dead -1
[46933.664050] AAC: Host adapter dead -1
[46943.664059] AAC: Host adapter dead -1
[46953.664035] AAC: Host adapter dead -1
[46963.664047] AAC: Host adapter dead -1
[46973.664047] AAC: Host adapter dead -1
[46983.664048] AAC: Host adapter dead -1
[46993.664067] AAC: Host adapter dead -1
[47003.664044] AAC: Host adapter dead -1
[47013.664050] AAC: Host adapter dead -1
[47023.664049] AAC: Host adapter dead -1
[47033.664059] AAC: Host adapter dead -1
[47043.664062] AAC: Host adapter dead -1
[47053.664044] AAC: Host adapter dead -1
[47063.664059] AAC: Host adapter dead -1
[47073.664054] AAC: Host adapter dead -1
[47083.664062] AAC: Host adapter dead -1
[47093.664094] AAC: Host adapter dead -1
[47103.664040] AAC: Host adapter dead -1
[47113.664047] AAC: Host adapter dead -1
[47123.664046] AAC: Host adapter dead -1
[47133.664048] AAC: Host adapter dead -1
[47143.664050] AAC: Host adapter dead -1
[47153.664041] AAC: Host adapter dead -1
[47163.664039] AAC: Host adapter dead -1
[47173.664059] AAC: Host adapter dead -1
[47183.664046] AAC: Host adapter dead -1
[47193.664047] AAC: Host adapter dead -1
[47203.664070] AAC: Host adapter dead -1
[47213.664054] AAC: Host adapter dead -1
[47223.664049] AAC: Host adapter dead -1
[47233.664048] AAC: Host adapter dead -1
[47243.664062] AAC: Host adapter dead -1
Comment 7 Harold Snel 2016-08-15 07:17:01 UTC
I have the exact same issue on my server running with an Adaptec 3405. The root system is mounted on that controller on an SSD harddisk. Until I upgraded from 14.04 TLS to 16.04 TLS it was working fine.

After I upgraded to an new kernel first (from 4.2.0-41 to 4.2.0-42) the issue with the 'dead adapter' started. To test if a full system upgrade to 16.04 TLS would help I upgrade the system to that release. But on that version the problem is still there and the system is not usable.

Now I'm running on 16.04 TLS but with the older kernel (4.2.0-41) and thats working just fine. But I want to use the newest kernel because thats how it should be :-)

Here is some controller info:
===
root@server:~# lspci -nnvs 03:0e.0
03:0e.0 RAID bus controller [0104]: Adaptec AAC-RAID [9005:0285]
         Subsystem: Adaptec 3405 [9005:02bb]
         Flags: bus master, stepping, 66MHz, medium devsel, latency 64, 
IRQ 18
         Memory at fda00000 (64-bit, non-prefetchable) [size=2M]
         Expansion ROM at fdcc0000 [disabled] [size=256K]
         Capabilities: [c0] Power Management version 2
         Capabilities: [d0] MSI: Enable- Count=1/2 Maskable- 64bit+
         Capabilities: [e0] PCI-X non-bridge device
         Kernel driver in use: aacraid
         Kernel modules: aacraid
===
Comment 8 David Gravereaux 2016-08-17 18:43:41 UTC
The 3405 in my lab computer exhibits the same symptoms. Ubuntu 16.04 LTS with kernel 4.4.0-34-generic, also.

Both Adaptec 3805 and 3405 models
Comment 9 Piotr Szymaniak 2016-08-22 21:52:53 UTC
Created attachment 229741 [details]
dmesg vanilla 4.3.6 (with Alex patch)

I tried previous vanilla kernels 4.x.latest-released and, as 4.6.x prints messages over and over I skipped 4.5 series and tried:
- 4.4.19 - prints dead messages
- 4.3.6 - works without messages (with Alex patch [1], intel_iommu=igfx_off)

Not sure if this helps, but I hope it will. Any suggestions how to proceed from here?

[1] https://www.redhat.com/archives/vfio-users/2016-July/msg00063.html
Comment 10 Piotr Szymaniak 2016-08-22 21:54:16 UTC
Created attachment 229751 [details]
config-4.3.6
Comment 11 Piotr Szymaniak 2016-08-22 21:54:36 UTC
Created attachment 229761 [details]
config-4.7.0
Comment 12 a 2016-09-04 11:25:02 UTC
The same problem for me.

Ubuntu LTS 16.04.

# uname -a
Linux trinity 4.4.0-36-generic #55-Ubuntu SMP Thu Aug 11 18:01:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

# arcconf getconfig 1
Controllers found: 1
----------------------------------------------------------------------
Controller information
----------------------------------------------------------------------
   Controller Status                        : Optimal
   Channel description                      : SAS/SATA
   Controller Model                         : Adaptec 3805
   Controller Serial Number                 : 8C211053127
<...>
   --------------------------------------------------------
   Controller Version Information
   --------------------------------------------------------
   BIOS                                     : 5.2-0 (17342)
   Firmware                                 : 5.2-0 (17342)
   Driver                                   : 1.2-1 (41010)
   Boot Flash                               : 5.2-0 (17342)

$ w
 14:24:22 up 20:41 <...>

$ journalctl -b | grep 'AAC: Host adapter dead -1' | wc -l
7284
Comment 13 HE - IT Services 2016-09-14 10:09:55 UTC
I have the exact same problem on Proxmox which is using Debian Jesse Linux Kernel 4.2.4, using Adaptec Raid 31605, it repeats every 10 seconds, "AAC: Host adapter dead -1" but everything is working......

Is there any way to get it to stop repeating these messages, because it may max out the messages log eventually.  Does anyone know what is being done to patch this?
Comment 14 Matthias 2016-09-14 10:18:52 UTC
you can find it in:
drivers/scsi/aacraid/commsup.c
in my 4.7.3 it is in Line 1700
Comment 15 HE - IT Services 2016-09-14 10:20:35 UTC
(In reply to HE - IT Services from comment #13)
> I have the exact same problem on Proxmox which is using Debian Jesse Linux
> Kernel 4.2.4, using Adaptec Raid 31605, it repeats every 10 seconds, "AAC:
> Host adapter dead -1" but everything is working......
> 
> Is there any way to get it to stop repeating these messages, because it may
> max out the messages log eventually.  Does anyone know what is being done to
> patch this?

I found the Kernel is 4.4.13-2-pve, which would be 4.4.13.
Comment 16 HE - IT Services 2016-09-14 10:22:09 UTC
(In reply to Matthias from comment #14)
> you can find it in:
> drivers/scsi/aacraid/commsup.c
> in my 4.7.3 it is in Line 1700

Thanks Matthias.
Comment 17 Piotr Szymaniak 2016-10-03 20:45:52 UTC
Created attachment 240661 [details]
aac.patch

(In reply to Matthias from comment #14)
> you can find it in:
> drivers/scsi/aacraid/commsup.c
> in my 4.7.3 it is in Line 1700

Works as a workaround, but this is not a solution.

This is a patch that I use to use my controller. Alex patch + "STFU for logs".
Comment 18 Arkadiusz Miskiewicz 2017-01-15 11:10:11 UTC
The problem was introduced with commit below. Reverting this commit from kernel 4.9.3 makes the problem go away.

https://lkml.org/lkml/2017/1/15/47

commit 78cbccd3bd683c295a44af8050797dc4a41376ff
Author: Raghava Aditya Renukunta <RaghavaAditya.Renukunta@microsemi.com>
Date:   Mon Apr 25 23:32:37 2016 -0700

    aacraid: Fix for KDUMP driver hang

    When KDUMP is triggered the driver first talks to the firmware in INTX
    mode, but the adapter firmware is still in MSIX mode. Therefore the first
    driver command hangs since the driver is waiting for an INTX response and
    firmware gives a MSIX response. If when the OS is installed on a RAID
    drive created by the adapter KDUMP will hang since the driver does not
    receive a response in sync mode.

    Fixed by: Change the firmware to INTX mode if it is in MSIX mode before
    sending the first sync command.

    Cc: stable@vger.kernel.org
    Signed-off-by: Raghava Aditya Renukunta <RaghavaAditya.Renukunta@microsemi.com>
    Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Comment 20 Stuart Naifeh 2017-06-21 22:43:54 UTC
As of June 21, 2017, I am still seeing this in Debian Stretch running kernel 4.10.0-rc6.  I have an IBM ServeRaid 8s, which is a rebranded Adaptec 4805.  I see the message repeat about every second, so it really fills up syslog quickly.
Comment 21 Rodrigo Aguilera 2018-01-06 16:02:42 UTC
I stopped seeing this message after updating to debian stretch with the 4.9 kernel and my 3805 adaptec card.

I think this should be closed.
Comment 22 Matthias 2018-01-08 17:44:21 UTC
with my Adaptec 3405 and Kernel of Ubuntu 16.04.3 (Kernel 4.4.0-104-generic) and also with self-compiled 4.13.0, the message does not appear anymore.

Note You need to log in before you can comment on or make changes to this bug.