Created attachment 300675 [details] Bootup screenshot showing problem I have an Areca ARC-1280ML RAID controller in my home server, and it appears that something changed in kernel 5.16 causing the driver to hang with: arcmsr0: abort device command of scsi id = 0 lun = 1 arcmsr0: abort device command of scsi id = 0 lun = 0 arcmsr0: abort device command of scsi id = 0 lun = 3 arcmsr: executing bus reset eh.....num_resets = 0, num_aborts = 3 arcmsr0: wait 'abort all outstanding command' timeout arcmsr0: executing hw bus reset ..... arcmsr0: wait 'start adapter background rebuild' timeout arcmsr: scsi bus reset eh returns with success arcmsr0: abort device command of scsi id = 0 lun = 3 arcmsr: executing bus reset eh.....num_resets = 1, num_aborts = 4 arcmsr0: wait 'abort all outstanding command' timeout arcmsr0: executing hw bus reset ..... arcmsr0: wait 'start adapter background rebuild' timeout arcmsr: scsi bus reset eh returns with success (this then repeats until system panics because it can't mount root) When this happens, the card also stops responding on out-of-band network. With kernel 5.15 there are no problems. I normally run bcachefs kernels, but I also tested with regular 5.17.1 kernel, where the same problem happens.
I just tested on another computer with ARC-1212 controller, same problem there.
Would it be possible to bisect this issue?
Done, it points at this commit: e815d36548f01797ce381be8f0b74f4ba9befd15 is the first bad commit commit e815d36548f01797ce381be8f0b74f4ba9befd15 Author: Damien Le Moal <damien.lemoal@wdc.com> Date: Wed Oct 27 11:22:20 2021 +0900 scsi: sd: add concurrent positioning ranges support Add the sd_read_cpr() function to the sd scsi disk driver to discover if a device has multiple concurrent positioning ranges (i.e. multiple actuators on an HDD). The existence of VPD page B9h indicates if a device has multiple concurrent positioning ranges. The page content describes each range supported by the device. sd_read_cpr() is called from sd_revalidate_disk() and uses the block layer functions disk_alloc_independent_access_ranges() and disk_set_independent_access_ranges() to represent the set of actuators of the device as independent access ranges. The format of the Concurrent Positioning Ranges VPD page B9h is defined in section 6.6.6 of SBC-5. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20211027022223.183838-3-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk> drivers/scsi/sd.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ drivers/scsi/sd.h | 1 + 2 files changed, 82 insertions(+)
Damien, can you take a look?
Can you try this patch: diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index a390679cf458..cecba3fcbc61 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -3216,6 +3216,7 @@ static int sd_revalidate_disk(struct gendisk *disk) sd_read_block_limits(sdkp); sd_read_block_characteristics(sdkp); sd_zbc_read_zones(sdkp, buffer); + sd_read_cpr(sdkp); } sd_print_capacity(sdkp, old_capacity); @@ -3225,7 +3226,6 @@ static int sd_revalidate_disk(struct gendisk *disk) sd_read_app_tag_own(sdkp, buffer); sd_read_write_same(sdkp, buffer); sd_read_security(sdkp, buffer); - sd_read_cpr(sdkp); } /*
Doesn't help unfortunately.
On Wed, 2022-04-06 at 06:59 +0000, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=215788 > > --- Comment #6 from Jernej Simončič (jernej-bugzilla.kernel@ena.si) --- > Doesn't help unfortunately. Hmm... And everything is OK if you comment out that function call ? Can you post the output of these commands for the RAID disk ? sg_inq /dev/sdX sg_inq --vpd /dev/sdX sg_vpd -E /dev/sdX sg_logs -l -l /dev/sdX And then last: sg_vpd --force --page=0xb9 /dev/sdX This last command could be the one crashing the HBA/drive so beware. Your drive clearly should not be supporting vpd page 0xb9, so sd_read_cpr() should be a nop, doing nothing. It does not seem to be the case. Maybe this adapter uses page 0xb9 as a vendor specific one, causing problems. The above commands will allow checking that.
Here are results for ARC-1212: [root@sysrescue ~]# uname -a Linux sysrescue 5.15.22-1-lts #1 SMP Tue, 08 Feb 2022 19:00:40 +0000 x86_64 GNU/Linux [root@sysrescue ~]# sg_inq /dev/sdb standard INQUIRY: PQual=0 PDT=0 RMB=0 LU_CONG=0 hot_pluggable=0 version=0x05 [SPC-3] [AERC=0] [TrmTsk=0] NormACA=0 HiSUP=0 Resp_data_format=2 SCCS=0 ACC=0 TPGS=0 3PC=0 Protect=0 [BQue=0] EncServ=0 MultiP=0 [MChngr=0] [ACKREQQ=0] Addr16=1 [RelAdr=0] WBus16=1 Sync=0 [Linked=0] [TranDis=0] CmdQue=1 [SPI: Clocking=0x3 QAS=0 IUS=0] length=96 (0x60) Peripheral device type: disk Vendor identification: Areca Product identification: Storage Product revision level: R001 Unit serial number: 42bc2c2180321188 [root@sysrescue ~]# sg_inq --vpd /dev/sdb VPD INQUIRY, page code=0x00: Supported VPD pages: 0x0 Supported VPD pages 0x80 Unit serial number 0x83 Device identification 0xc7 [root@sysrescue ~]# sg_vpd -E /dev/sdb Unit serial number VPD page: Unit serial number: 42bc2c2180321188 Device Identification VPD page: Addressed logical unit: designator type: EUI-64 based, code set: Binary 0x001b4d2008231188 Vendor VPD page=0xc0 failed to fetchVendor VPD page=0xc1 failed to fetchVendor VPD page=0xc2 failed to fetchVendor VPD page=0xc3 failed to fetchVendor VPD page=0xc4 failed to fetchVendor VPD page=0xc5 failed to fetch 00 00 c7 00 3c 00 00 00 00 00 00 00 00 00 00 00 00 ...<............ 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Vendor VPD page=0xc8 failed to fetchVendor VPD page=0xc9 failed to fetchVendor VPD page=0xca failed to fetchVendor VPD page=0xd0 failed to fetchVendor VPD page=0xd1 failed to fetchVendor VPD page=0xd2 failed to fetch[root@sysrescue ~]# sg_logs -l -l /dev/sdb Areca Storage R001 log_sense: field in cdb illegal sg_logs failed: Illegal request [root@sysrescue ~]# sg_vpd --force --page=0xb9 /dev/sdb VPD page=0xb7 fetching VPD page failed: Illegal request sg_vpd failed: Illegal request And for ARC-1280ML: deepthought ~ # sg_inq /dev/sda standard INQUIRY: PQual=0 PDT=0 RMB=0 LU_CONG=0 hot_pluggable=0 version=0x05 [SPC-3] [AERC=0] [TrmTsk=0] NormACA=0 HiSUP=0 Resp_data_format=2 SCCS=0 ACC=0 TPGS=0 3PC=0 Protect=0 [BQue=0] EncServ=0 MultiP=0 [MChngr=0] [ACKREQQ=0] Addr16=1 [RelAdr=0] WBus16=1 Sync=0 [Linked=0] [TranDis=0] CmdQue=1 [SPI: Clocking=0x3 QAS=0 IUS=0] length=96 (0x60) Peripheral device type: disk Vendor identification: Areca Product identification: System Product revision level: R001 Unit serial number: 0000003927378925 deepthought ~ # sg_inq /dev/sdX sg_inq: error opening file: /dev/sdX: No such file or directory sg_inq failed: No such file or directory deepthought ~ # sg_inq --vpd /dev/sdX sg_inq: error opening file: /dev/sdX: No such file or directory sg_inq failed: No such file or directory deepthought ~ # sg_vpd -E /dev/sdX sg_vpd failed: No such file or directory deepthought ~ # sg_logs -l -l /dev/sdX^C deepthought ~ # sg_inq /dev/sda standard INQUIRY: PQual=0 PDT=0 RMB=0 LU_CONG=0 hot_pluggable=0 version=0x05 [SPC-3] [AERC=0] [TrmTsk=0] NormACA=0 HiSUP=0 Resp_data_format=2 SCCS=0 ACC=0 TPGS=0 3PC=0 Protect=0 [BQue=0] EncServ=0 MultiP=0 [MChngr=0] [ACKREQQ=0] Addr16=1 [RelAdr=0] WBus16=1 Sync=0 [Linked=0] [TranDis=0] CmdQue=1 [SPI: Clocking=0x3 QAS=0 IUS=0] length=96 (0x60) Peripheral device type: disk Vendor identification: Areca Product identification: System Product revision level: R001 Unit serial number: 0000003927378925 deepthought ~ # sg_inq --vpd /dev/sda VPD INQUIRY, page code=0x00: Supported VPD pages: 0x0 Supported VPD pages 0x80 Unit serial number 0x83 Device identification 0xc7 deepthought ~ # sg_vpd -E /dev/sda @Unit serial number VPD page: Unit serial number: 0000003927378925 @Device Identification VPD page: Addressed logical unit: designator type: EUI-64 based, code set: Binary 0x001b4d2305766800 Vendor VPD page=0xc0 failed to fetchVendor VPD page=0xc1 failed to fetchVendor VPD page=0xc2 failed to fetchVendor VPD page=0xc3 failed to fetchVendor VPD page=0xc4 failed to fetchVendor VPD page=0xc5 failed to fetch 00 00 c7 00 3c 00 00 00 00 00 00 00 00 00 00 00 00 ...<............ 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Vendor VPD page=0xc8 failed to fetchVendor VPD page=0xc9 failed to fetchVendor VPD page=0xca failed to fetchVendor VPD page=0xd0 failed to fetchVendor VPD page=0xd1 failed to fetchVendor VPD page=0xd2 failed to fetchdeepthought ~ # sg_logs -l -l /dev/sda Areca System R001 log_sense: field in cdb illegal sg_logs failed: Illegal request deepthought ~ # sg_vpd --force --page=0xb9 /dev/sda VPD page=0xb7 fetching VPD page failed: Illegal request sg_vpd failed: Illegal request Neither controller crashed on the last command.
Oh, and I forgot to mention, yes, no problems if I comment out that call (did that yesterday already with 5.17.1 kernel release as a test).
Arg. I am baffled... No clue what is happening here. Since accessing the 0xb9 vpd page simply fails, sd_read_cpr() should be a nop and do nothing. I have no idea why it creates a problem. Let me go back to the code and check again. It does seem to be that the adapter is crashing though, so it may not like the command sequence on initialization with that vpd page 0xb9 in the middle. Have you tried to check if there is a FW update for that HBA that may solve the issue ?
Martin Petersen (SCSI maintainer) suggested that you try out this code: https://git.kernel.org/mkp/h/5.18/discovery to see if it makes any difference. Can you give this a try please ?
Re firmware: both controllers are EOL by manufacturer and are running the last released firmware (1.49 for ARC-1280ML, 1.51 for ARC-1212). 5.18/discovery seems to work fine (I can see the volume and partitions on it).
That is great news ! So now we need to figure out which change in there avoids the problem (for backporting to stable). We will sort this out.
+Martin On 4/7/22 16:10, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=215788 > > --- Comment #12 from Jernej Simončič (jernej-bugzilla.kernel@ena.si) --- > Re firmware: both controllers are EOL by manufacturer and are running the > last > released firmware (1.49 for ARC-1280ML, 1.51 for ARC-1212). > > 5.18/discovery seems to work fine (I can see the volume and partitions on > it). Martin, Your series is the solution :) I have not looked at it yet. I wonder what change you have that solves the issue ? We should have that subset backported to stable if possible.
I am having similar issues with my ARC-1231 card in OpenSuSE with kernel 5.17. The card works fine in Windows but when I try to boot linux I get "ata8: softreset failed (device not ready)" along with errors from the arcmsr driver. > christophocles@localhost:~> lsb-release -a LSB Version: n/a Distributor ID: openSUSE Description: openSUSE Tumbleweed Release: 20220419 Codename: n/a christophocles@localhost:~> uname -r 5.17.3-1-default christophocles@localhost:~> dmesg | grep ata8 [ 0.683123] ata8: SATA max UDMA/133 abar m8192@0xde400000 port 0xde400180 irq 16 [ 6.998524] ata8: link is slow to respond, please be patient (ready=0) [ 10.730379] ata8: softreset failed (device not ready) [ 17.042164] ata8: link is slow to respond, please be patient (ready=0) [ 20.770497] ata8: softreset failed (device not ready) [ 27.081803] ata8: link is slow to respond, please be patient (ready=0) [ 55.840740] ata8: softreset failed (device not ready) [ 55.851912] ata8: limiting SATA link speed to 1.5 Gbps [ 60.916825] ata8: softreset failed (device not ready) [ 60.928005] ata8: reset failed, giving up christophocles@localhost:~> dmesg | grep arcmsr arcmsr version v1.50.00.05-20210429 [ 63.719008] arcmsr 0000:02:00.0: msi enabled [ 95.326504] arcmsr9: abort device command of scsi id = 6 lun = 7 [ 98.990514] arcmsr: executing bus reset eh.....num_resets = 0, num_aborts = 1 [ 139.002496] arcmsr9: wait 'abort all outstanding command' timeout [ 139.017296] arcmsr9: executing hw bus reset ..... [ 192.194497] arcmsr9: wait 'start adapter background rebuild' timeout [ 192.208071] arcmsr: scsi bus reset eh returns with success [ 212.562498] arcmsr9: abort device command of scsi id = 6 lun = 7 [ 216.210497] arcmsr: executing bus reset eh.....num_resets = 1, num_aborts = 2 [ 256.222496] arcmsr9: wait 'abort all outstanding command' timeout [ 256.235610] arcmsr9: executing hw bus reset ..... [ 309.422496] arcmsr9: wait 'start adapter background rebuild' timeout [ 309.435455] arcmsr: scsi bus reset eh returns with success
(In reply to Chris Rodrigues from comment #15) > I am having similar issues with my ARC-1231 card in OpenSuSE with kernel > 5.17. The card works fine in Windows but when I try to boot linux I get > "ata8: softreset failed (device not ready)" along with errors from the > arcmsr driver. > > > > christophocles@localhost:~> lsb-release -a > LSB Version: n/a > Distributor ID: openSUSE > Description: openSUSE Tumbleweed > Release: 20220419 > Codename: n/a > christophocles@localhost:~> uname -r > 5.17.3-1-default > christophocles@localhost:~> dmesg | grep ata8 > [ 0.683123] ata8: SATA max UDMA/133 abar m8192@0xde400000 port 0xde400180 > irq 16 > [ 6.998524] ata8: link is slow to respond, please be patient (ready=0) > [ 10.730379] ata8: softreset failed (device not ready) > [ 17.042164] ata8: link is slow to respond, please be patient (ready=0) > [ 20.770497] ata8: softreset failed (device not ready) > [ 27.081803] ata8: link is slow to respond, please be patient (ready=0) > [ 55.840740] ata8: softreset failed (device not ready) > [ 55.851912] ata8: limiting SATA link speed to 1.5 Gbps > [ 60.916825] ata8: softreset failed (device not ready) > [ 60.928005] ata8: reset failed, giving up > christophocles@localhost:~> dmesg | grep arcmsr > arcmsr version v1.50.00.05-20210429 > [ 63.719008] arcmsr 0000:02:00.0: msi enabled > [ 95.326504] arcmsr9: abort device command of scsi id = 6 lun = 7 > [ 98.990514] arcmsr: executing bus reset eh.....num_resets = 0, num_aborts > = 1 > [ 139.002496] arcmsr9: wait 'abort all outstanding command' timeout > [ 139.017296] arcmsr9: executing hw bus reset ..... > [ 192.194497] arcmsr9: wait 'start adapter background > rebuild' timeout > [ 192.208071] arcmsr: scsi bus reset eh returns with success > [ 212.562498] arcmsr9: abort device command of scsi id = 6 lun = 7 > [ 216.210497] arcmsr: executing bus reset eh.....num_resets = 1, num_aborts > = 2 > [ 256.222496] arcmsr9: wait 'abort all outstanding command' timeout > [ 256.235610] arcmsr9: executing hw bus reset ..... > [ 309.422496] arcmsr9: wait 'start adapter background > rebuild' timeout > [ 309.435455] arcmsr: scsi bus reset eh returns with success If you can and know how to build a kernel, can you please try this code: https://git.kernel.org/mkp/h/5.18/discovery It seems to solve the issue.
Damien, thanks but I already decided Tumbleweed wasn't for me and I installed the stable version Leap 15.3 with the 5.3 kernel and all is well. So I won't be able to test the 5.18 patch. Also, I will mention that the "device not ready" errors were completely unrelated to arcmsr. They're still happening and cause slower boot time, but that's a separate issue with my system I still haven't figured out.
I am also experiencing this, on an Areca 1882ix-24 with firmware V1.56 2021-01-12 and arcmsr driver version v1.50.0X.14-20230614 in Ubuntu 22.04 on kernel 6.2.0. I take it the 5.18/discovery branch hasn't been merged anywhere yet? What more testing needs to be done? How can I help?
I just upgraded my kernel from 5.15 to 6.1.53 and this "executing hw bus reset" popped up after a few years of smooth sailing. I have Areva 1880ix-24 with firmware V1.56 2019-07-30 and arcmsr driver version v1.50.00.05-20210429. I am just reporting that mysteriously this error appeared after quite a while.
FWIW, I've been running bcachefs kernels, and I haven't experienced any further problems with arcmsr on neither 6.1.0, nor 6.5.0.
(In reply to Ken Link from comment #18) > I am also experiencing this, on an Areca 1882ix-24 with firmware V1.56 > 2021-01-12 and arcmsr driver version v1.50.0X.14-20230614 in Ubuntu 22.04 on > kernel 6.2.0. I take it the 5.18/discovery branch hasn't been merged > anywhere yet? What more testing needs to be done? How can I help? I do not recall if the discovery branch was applied. Would need to check again. Given that 6.2 kernel is not LTS and I am not sure what kind of patching Ubuntu does, could you try with the latest stable (6.5.4) or latest mainline (6.6-rc2) ? Martin, Did you apply that 5.18/discovery branch mentioned above ? It does not look like t is applied.