Bug 215788 - arcmsr driver on kernel 5.16 and up fails to initialize ARC-1280ML RAID controller
Summary: arcmsr driver on kernel 5.16 and up fails to initialize ARC-1280ML RAID contr...
Status: NEW
Alias: None
Product: SCSI Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: scsi_drivers-other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-04-01 10:50 UTC by Jernej Simončič
Modified: 2023-09-22 17:28 UTC (History)
7 users (show)

See Also:
Kernel Version: 5.16, 5.17.1
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Bootup screenshot showing problem (46.45 KB, image/png)
2022-04-01 10:50 UTC, Jernej Simončič
Details

Description Jernej Simončič 2022-04-01 10:50:53 UTC
Created attachment 300675 [details]
Bootup screenshot showing problem

I have an Areca ARC-1280ML RAID controller in my home server, and it appears that something changed in kernel 5.16 causing the driver to hang with:

arcmsr0: abort device command of scsi id = 0 lun = 1
arcmsr0: abort device command of scsi id = 0 lun = 0
arcmsr0: abort device command of scsi id = 0 lun = 3
arcmsr: executing bus reset eh.....num_resets = 0, num_aborts = 3
arcmsr0: wait 'abort all outstanding command' timeout
arcmsr0: executing hw bus reset .....
arcmsr0: wait 'start adapter background rebuild' timeout
arcmsr: scsi bus reset eh returns with success
arcmsr0: abort device command of scsi id = 0 lun = 3
arcmsr: executing bus reset eh.....num_resets = 1, num_aborts = 4
arcmsr0: wait 'abort all outstanding command' timeout
arcmsr0: executing hw bus reset .....
arcmsr0: wait 'start adapter background rebuild' timeout
arcmsr: scsi bus reset eh returns with success

(this then repeats until system panics because it can't mount root)

When this happens, the card also stops responding on out-of-band network. With kernel 5.15 there are no problems.

I normally run bcachefs kernels, but I also tested with regular 5.17.1 kernel, where the same problem happens.
Comment 1 Jernej Simončič 2022-04-02 07:56:35 UTC
I just tested on another computer with ARC-1212 controller, same problem there.
Comment 2 Bart Van Assche 2022-04-05 05:08:24 UTC
Would it be possible to bisect this issue?
Comment 3 Jernej Simončič 2022-04-05 19:44:28 UTC
Done, it points at this commit:

e815d36548f01797ce381be8f0b74f4ba9befd15 is the first bad commit
commit e815d36548f01797ce381be8f0b74f4ba9befd15
Author: Damien Le Moal <damien.lemoal@wdc.com>
Date:   Wed Oct 27 11:22:20 2021 +0900

    scsi: sd: add concurrent positioning ranges support

    Add the sd_read_cpr() function to the sd scsi disk driver to discover
    if a device has multiple concurrent positioning ranges (i.e. multiple
    actuators on an HDD). The existence of VPD page B9h indicates if a
    device has multiple concurrent positioning ranges. The page content
    describes each range supported by the device.

    sd_read_cpr() is called from sd_revalidate_disk() and uses the block
    layer functions disk_alloc_independent_access_ranges() and
    disk_set_independent_access_ranges() to represent the set of actuators
    of the device as independent access ranges.

    The format of the Concurrent Positioning Ranges VPD page B9h is defined
    in section 6.6.6 of SBC-5.

    Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
    Reviewed-by: Keith Busch <kbusch@kernel.org>
    Link: https://lore.kernel.org/r/20211027022223.183838-3-damien.lemoal@wdc.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

 drivers/scsi/sd.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/scsi/sd.h |  1 +
 2 files changed, 82 insertions(+)
Comment 4 Bart Van Assche 2022-04-05 19:51:36 UTC
Damien, can you take a look?
Comment 5 Damien Le Moal 2022-04-05 23:29:45 UTC
Can you try this patch:

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index a390679cf458..cecba3fcbc61 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3216,6 +3216,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
                        sd_read_block_limits(sdkp);
                        sd_read_block_characteristics(sdkp);
                        sd_zbc_read_zones(sdkp, buffer);
+                       sd_read_cpr(sdkp);
                }
 
                sd_print_capacity(sdkp, old_capacity);
@@ -3225,7 +3226,6 @@ static int sd_revalidate_disk(struct gendisk *disk)
                sd_read_app_tag_own(sdkp, buffer);
                sd_read_write_same(sdkp, buffer);
                sd_read_security(sdkp, buffer);
-               sd_read_cpr(sdkp);
        }
 
        /*
Comment 6 Jernej Simončič 2022-04-06 06:59:40 UTC
Doesn't help unfortunately.
Comment 7 Damien Le Moal 2022-04-06 08:26:33 UTC
On Wed, 2022-04-06 at 06:59 +0000, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=215788
> 
> --- Comment #6 from Jernej Simončič (jernej-bugzilla.kernel@ena.si) ---
> Doesn't help unfortunately.

Hmm... And everything is OK if you comment out that function call ?

Can you post the output of these commands for the RAID disk ?

sg_inq /dev/sdX
sg_inq --vpd /dev/sdX
sg_vpd -E /dev/sdX
sg_logs -l -l /dev/sdX

And then last:

sg_vpd --force --page=0xb9 /dev/sdX

This last command could be the one crashing the HBA/drive so beware.
Your drive clearly should not be supporting vpd page 0xb9, so
sd_read_cpr() should be a nop, doing nothing. It does not seem to be
the case. Maybe this adapter uses page 0xb9 as a vendor specific one,
causing problems. The above commands will allow checking that.
Comment 8 Jernej Simončič 2022-04-06 16:35:04 UTC
Here are results for ARC-1212:

[root@sysrescue ~]# uname -a
Linux sysrescue 5.15.22-1-lts #1 SMP Tue, 08 Feb 2022 19:00:40 +0000 x86_64 GNU/Linux
[root@sysrescue ~]# sg_inq /dev/sdb
standard INQUIRY:
  PQual=0  PDT=0  RMB=0  LU_CONG=0  hot_pluggable=0  version=0x05  [SPC-3]
  [AERC=0]  [TrmTsk=0]  NormACA=0  HiSUP=0  Resp_data_format=2
  SCCS=0  ACC=0  TPGS=0  3PC=0  Protect=0  [BQue=0]
  EncServ=0  MultiP=0  [MChngr=0]  [ACKREQQ=0]  Addr16=1
  [RelAdr=0]  WBus16=1  Sync=0  [Linked=0]  [TranDis=0]  CmdQue=1
  [SPI: Clocking=0x3  QAS=0  IUS=0]
    length=96 (0x60)   Peripheral device type: disk
 Vendor identification: Areca
 Product identification: Storage
 Product revision level: R001
 Unit serial number: 42bc2c2180321188
[root@sysrescue ~]# sg_inq --vpd /dev/sdb
VPD INQUIRY, page code=0x00:
   Supported VPD pages:
     0x0        Supported VPD pages
     0x80       Unit serial number
     0x83       Device identification
     0xc7
[root@sysrescue ~]# sg_vpd -E /dev/sdb
Unit serial number VPD page:
  Unit serial number: 42bc2c2180321188

Device Identification VPD page:
  Addressed logical unit:
    designator type: EUI-64 based,  code set: Binary
      0x001b4d2008231188

Vendor VPD page=0xc0  failed to fetchVendor VPD page=0xc1  failed to fetchVendor VPD page=0xc2  failed to fetchVendor VPD page=0xc3  failed to fetchVendor VPD page=0xc4  failed to fetchVendor VPD page=0xc5  failed to fetch 00     00 c7 00 3c 00 00 00 00  00 00 00 00 00 00 00 00    ...<............
 10     00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00    ................
 20     00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00    ................
 30     00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00    ................

Vendor VPD page=0xc8  failed to fetchVendor VPD page=0xc9  failed to fetchVendor VPD page=0xca  failed to fetchVendor VPD page=0xd0  failed to fetchVendor VPD page=0xd1  failed to fetchVendor VPD page=0xd2  failed to fetch[root@sysrescue ~]# sg_logs -l -l /dev/sdb
    Areca     Storage           R001
log_sense: field in cdb illegal
sg_logs failed: Illegal request
[root@sysrescue ~]# sg_vpd --force --page=0xb9 /dev/sdb
VPD page=0xb7
fetching VPD page failed: Illegal request
sg_vpd failed: Illegal request


And for ARC-1280ML:

deepthought ~ # sg_inq /dev/sda
standard INQUIRY:
  PQual=0  PDT=0  RMB=0  LU_CONG=0  hot_pluggable=0  version=0x05  [SPC-3]
  [AERC=0]  [TrmTsk=0]  NormACA=0  HiSUP=0  Resp_data_format=2
  SCCS=0  ACC=0  TPGS=0  3PC=0  Protect=0  [BQue=0]
  EncServ=0  MultiP=0  [MChngr=0]  [ACKREQQ=0]  Addr16=1
  [RelAdr=0]  WBus16=1  Sync=0  [Linked=0]  [TranDis=0]  CmdQue=1
  [SPI: Clocking=0x3  QAS=0  IUS=0]
    length=96 (0x60)   Peripheral device type: disk
 Vendor identification: Areca
 Product identification: System
 Product revision level: R001
 Unit serial number: 0000003927378925
deepthought ~ # sg_inq /dev/sdX
sg_inq: error opening file: /dev/sdX: No such file or directory
sg_inq failed: No such file or directory
deepthought ~ # sg_inq --vpd /dev/sdX
sg_inq: error opening file: /dev/sdX: No such file or directory
sg_inq failed: No such file or directory
deepthought ~ # sg_vpd -E /dev/sdX
sg_vpd failed: No such file or directory
deepthought ~ # sg_logs -l -l /dev/sdX^C
deepthought ~ # sg_inq /dev/sda
standard INQUIRY:
  PQual=0  PDT=0  RMB=0  LU_CONG=0  hot_pluggable=0  version=0x05  [SPC-3]
  [AERC=0]  [TrmTsk=0]  NormACA=0  HiSUP=0  Resp_data_format=2
  SCCS=0  ACC=0  TPGS=0  3PC=0  Protect=0  [BQue=0]
  EncServ=0  MultiP=0  [MChngr=0]  [ACKREQQ=0]  Addr16=1
  [RelAdr=0]  WBus16=1  Sync=0  [Linked=0]  [TranDis=0]  CmdQue=1
  [SPI: Clocking=0x3  QAS=0  IUS=0]
    length=96 (0x60)   Peripheral device type: disk
 Vendor identification: Areca
 Product identification: System
 Product revision level: R001
 Unit serial number: 0000003927378925
deepthought ~ # sg_inq --vpd /dev/sda
VPD INQUIRY, page code=0x00:
   Supported VPD pages:
     0x0        Supported VPD pages
     0x80       Unit serial number
     0x83       Device identification
     0xc7
deepthought ~ # sg_vpd -E /dev/sda
@Unit serial number VPD page:
  Unit serial number: 0000003927378925

@Device Identification VPD page:
  Addressed logical unit:
    designator type: EUI-64 based,  code set: Binary
      0x001b4d2305766800

Vendor VPD page=0xc0  failed to fetchVendor VPD page=0xc1  failed to fetchVendor VPD page=0xc2  failed to fetchVendor VPD page=0xc3  failed to fetchVendor VPD page=0xc4  failed to fetchVendor VPD page=0xc5  failed to fetch 00     00 c7 00 3c 00 00 00 00  00 00 00 00 00 00 00 00    ...<............
 10     00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00    ................
 20     00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00    ................
 30     00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00    ................

Vendor VPD page=0xc8  failed to fetchVendor VPD page=0xc9  failed to fetchVendor VPD page=0xca  failed to fetchVendor VPD page=0xd0  failed to fetchVendor VPD page=0xd1  failed to fetchVendor VPD page=0xd2  failed to fetchdeepthought ~ # sg_logs -l -l /dev/sda
    Areca     System            R001
log_sense: field in cdb illegal
sg_logs failed: Illegal request
deepthought ~ # sg_vpd --force --page=0xb9 /dev/sda
VPD page=0xb7
fetching VPD page failed: Illegal request
sg_vpd failed: Illegal request


Neither controller crashed on the last command.
Comment 9 Jernej Simončič 2022-04-06 16:43:14 UTC
Oh, and I forgot to mention, yes, no problems if I comment out that call (did that yesterday already with 5.17.1 kernel release as a test).
Comment 10 Damien Le Moal 2022-04-06 22:49:44 UTC
Arg. I am baffled... No clue what is happening here. Since accessing the 0xb9 vpd page simply fails, sd_read_cpr() should be a nop and do nothing. I have no idea why it creates a problem. Let me go back to the code and check again.

It does seem to be that the adapter is crashing though, so it may not like the command sequence on initialization with that vpd page 0xb9 in the middle. Have you tried to check if there is a FW update for that HBA that may solve the issue ?
Comment 11 Damien Le Moal 2022-04-07 04:01:57 UTC
Martin Petersen (SCSI maintainer) suggested that you try out this code:

https://git.kernel.org/mkp/h/5.18/discovery

to see if it makes any difference. Can you give this a try please ?
Comment 12 Jernej Simončič 2022-04-07 07:10:44 UTC
Re firmware: both controllers are EOL by manufacturer and are running the last released firmware (1.49 for ARC-1280ML, 1.51 for ARC-1212).

5.18/discovery seems to work fine (I can see the volume and partitions on it).
Comment 13 Damien Le Moal 2022-04-07 07:18:20 UTC
That is great news ! So now we need to figure out which change in there avoids the problem (for backporting to stable). We will sort this out.
Comment 14 damien.lemoal 2022-04-07 07:20:50 UTC
+Martin

On 4/7/22 16:10, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=215788
> 
> --- Comment #12 from Jernej Simončič (jernej-bugzilla.kernel@ena.si) ---
> Re firmware: both controllers are EOL by manufacturer and are running the
> last
> released firmware (1.49 for ARC-1280ML, 1.51 for ARC-1212).
> 
> 5.18/discovery seems to work fine (I can see the volume and partitions on
> it).

Martin,

Your series is the solution :)
I have not looked at it yet. I wonder what change you have that solves the
issue ? We should have that subset backported to stable if possible.
Comment 15 Chris Rodrigues 2022-04-21 06:52:53 UTC
I am having similar issues with my ARC-1231 card in OpenSuSE with kernel 5.17.  The card works fine in Windows but when I try to boot linux I get "ata8: softreset failed (device not ready)" along with errors from the arcmsr driver.


> christophocles@localhost:~> lsb-release -a
LSB Version:    n/a
Distributor ID: openSUSE
Description:    openSUSE Tumbleweed
Release:        20220419
Codename:       n/a
christophocles@localhost:~> uname -r
5.17.3-1-default
christophocles@localhost:~> dmesg | grep ata8
[    0.683123] ata8: SATA max UDMA/133 abar m8192@0xde400000 port 0xde400180 irq 16
[    6.998524] ata8: link is slow to respond, please be patient (ready=0)
[   10.730379] ata8: softreset failed (device not ready)
[   17.042164] ata8: link is slow to respond, please be patient (ready=0)
[   20.770497] ata8: softreset failed (device not ready)
[   27.081803] ata8: link is slow to respond, please be patient (ready=0)
[   55.840740] ata8: softreset failed (device not ready)
[   55.851912] ata8: limiting SATA link speed to 1.5 Gbps
[   60.916825] ata8: softreset failed (device not ready)
[   60.928005] ata8: reset failed, giving up
christophocles@localhost:~> dmesg | grep arcmsr
               arcmsr version v1.50.00.05-20210429
[   63.719008] arcmsr 0000:02:00.0: msi enabled
[   95.326504] arcmsr9: abort device command of scsi id = 6 lun = 7
[   98.990514] arcmsr: executing bus reset eh.....num_resets = 0, num_aborts = 1 
[  139.002496] arcmsr9: wait 'abort all outstanding command' timeout
[  139.017296] arcmsr9: executing hw bus reset .....
[  192.194497] arcmsr9: wait 'start adapter background                          rebuild' timeout 
[  192.208071] arcmsr: scsi bus reset eh returns with success
[  212.562498] arcmsr9: abort device command of scsi id = 6 lun = 7
[  216.210497] arcmsr: executing bus reset eh.....num_resets = 1, num_aborts = 2 
[  256.222496] arcmsr9: wait 'abort all outstanding command' timeout
[  256.235610] arcmsr9: executing hw bus reset .....
[  309.422496] arcmsr9: wait 'start adapter background                          rebuild' timeout 
[  309.435455] arcmsr: scsi bus reset eh returns with success
Comment 16 Damien Le Moal 2022-04-25 02:28:05 UTC
(In reply to Chris Rodrigues from comment #15)
> I am having similar issues with my ARC-1231 card in OpenSuSE with kernel
> 5.17.  The card works fine in Windows but when I try to boot linux I get
> "ata8: softreset failed (device not ready)" along with errors from the
> arcmsr driver.
> 
> 
> > christophocles@localhost:~> lsb-release -a
> LSB Version:    n/a
> Distributor ID: openSUSE
> Description:    openSUSE Tumbleweed
> Release:        20220419
> Codename:       n/a
> christophocles@localhost:~> uname -r
> 5.17.3-1-default
> christophocles@localhost:~> dmesg | grep ata8
> [    0.683123] ata8: SATA max UDMA/133 abar m8192@0xde400000 port 0xde400180
> irq 16
> [    6.998524] ata8: link is slow to respond, please be patient (ready=0)
> [   10.730379] ata8: softreset failed (device not ready)
> [   17.042164] ata8: link is slow to respond, please be patient (ready=0)
> [   20.770497] ata8: softreset failed (device not ready)
> [   27.081803] ata8: link is slow to respond, please be patient (ready=0)
> [   55.840740] ata8: softreset failed (device not ready)
> [   55.851912] ata8: limiting SATA link speed to 1.5 Gbps
> [   60.916825] ata8: softreset failed (device not ready)
> [   60.928005] ata8: reset failed, giving up
> christophocles@localhost:~> dmesg | grep arcmsr
>                arcmsr version v1.50.00.05-20210429
> [   63.719008] arcmsr 0000:02:00.0: msi enabled
> [   95.326504] arcmsr9: abort device command of scsi id = 6 lun = 7
> [   98.990514] arcmsr: executing bus reset eh.....num_resets = 0, num_aborts
> = 1 
> [  139.002496] arcmsr9: wait 'abort all outstanding command' timeout
> [  139.017296] arcmsr9: executing hw bus reset .....
> [  192.194497] arcmsr9: wait 'start adapter background                      
> rebuild' timeout 
> [  192.208071] arcmsr: scsi bus reset eh returns with success
> [  212.562498] arcmsr9: abort device command of scsi id = 6 lun = 7
> [  216.210497] arcmsr: executing bus reset eh.....num_resets = 1, num_aborts
> = 2 
> [  256.222496] arcmsr9: wait 'abort all outstanding command' timeout
> [  256.235610] arcmsr9: executing hw bus reset .....
> [  309.422496] arcmsr9: wait 'start adapter background                      
> rebuild' timeout 
> [  309.435455] arcmsr: scsi bus reset eh returns with success

If you can and know how to build a kernel, can you please try this code:

https://git.kernel.org/mkp/h/5.18/discovery

It seems to solve the issue.
Comment 17 Chris Rodrigues 2022-04-25 14:17:43 UTC
Damien, thanks but I already decided Tumbleweed wasn't for me and I installed the stable version Leap 15.3 with the 5.3 kernel and all is well.  So I won't be able to test the 5.18 patch.

Also, I will mention that the "device not ready" errors were completely unrelated to arcmsr.  They're still happening and cause slower boot time, but that's a separate issue with my system I still haven't figured out.
Comment 18 Ken Link 2023-09-21 21:13:21 UTC
I am also experiencing this, on an Areca 1882ix-24 with firmware V1.56 2021-01-12 and arcmsr driver version v1.50.0X.14-20230614 in Ubuntu 22.04 on kernel 6.2.0. I take it the 5.18/discovery branch hasn't been merged anywhere yet? What more testing needs to be done? How can I help?
Comment 19 Juhani Heinonen 2023-09-22 08:32:55 UTC
I just upgraded my kernel from 5.15 to 6.1.53 and this "executing hw bus reset" popped up after a few years of smooth sailing. I have Areva 1880ix-24 with firmware V1.56 2019-07-30 and arcmsr driver version v1.50.00.05-20210429. I am just reporting that mysteriously this error appeared after quite a while.
Comment 20 Jernej Simončič 2023-09-22 13:49:51 UTC
FWIW, I've been running bcachefs kernels, and I haven't experienced any further problems with arcmsr on neither 6.1.0, nor 6.5.0.
Comment 21 Damien Le Moal 2023-09-22 17:28:08 UTC
(In reply to Ken Link from comment #18)
> I am also experiencing this, on an Areca 1882ix-24 with firmware V1.56
> 2021-01-12 and arcmsr driver version v1.50.0X.14-20230614 in Ubuntu 22.04 on
> kernel 6.2.0. I take it the 5.18/discovery branch hasn't been merged
> anywhere yet? What more testing needs to be done? How can I help?

I do not recall if the discovery branch was applied. Would need to check again. Given that 6.2 kernel is not LTS and I am not sure what kind of patching Ubuntu does, could you try with the latest stable (6.5.4) or latest mainline (6.6-rc2) ?

Martin,

Did you apply that 5.18/discovery branch mentioned above ? It does not look like t is applied.

Note You need to log in before you can comment on or make changes to this bug.