Created attachment 278901 [details] kernel logs for last few days PM8003 PMC-Sierra Rev3 card with rev5 firmware connected to NetApp DS4246 Shelf's The system was running fine with ubuntu 18.04 with 4.15 kernelthen whent to put another shelf in circulation and everything broke either 4.18.11-041811-generic or something else is causing the issue, replaced controller card with backup and also tried backup shelf with the same issue disks all passed badblock check after 160 hr scan time then when trying to format the SEAGATE ST33000650NS SM drives they kept dropping out or Pc completely freezing up aio@aio:~$ sudo mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 -v -L SRD0NA1B0 -m 1 /dev/sdac1 mke2fs 1.44.1 (24-Mar-2018) fs_types for mke2fs.conf resolution: 'ext4' Filesystem label=SRD0NA1B0 OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 183148544 inodes, 732566016 blocks 7325660 blocks (1.00%) reserved for the super user First data block=0 Maximum filesystem blocks=2881486848 22357 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Filesystem UUID: 1f85920d-cd5a-4120-bb84-e290cb8c8808 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848, 512000000, 550731776, 644972544 Allocating group tables: done Writing inode tables: done Creating journal (262144 blocks): done Writing superblocks and filesystem accounting information: Warning, had trouble writing out superblocks. aio@aio:~$ sudo mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 -v -L SRD0NA1B0 -m 1 /dev/sdac1 [sudo] password for aio: mke2fs 1.44.1 (24-Mar-2018) fs_types for mke2fs.conf resolution: 'ext4' Filesystem label=SRD0NA1B0 OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 183148544 inodes, 732566016 blocks 7325660 blocks (1.00%) reserved for the super user First data block=0 Maximum filesystem blocks=2881486848 22357 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Filesystem UUID: 6b4fa378-38aa-4c02-ad46-4e2876ce1cb1 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848, 512000000, 550731776, 644972544 Allocating group tables: done Writing inode tables: done Creating journal (262144 blocks): done Writing superblocks and filesystem accounting information: done
pm80xx mpi_ssp_completion 1874:sas IO status 0x24 [ 708.424035] pm80xx mpi_ssp_completion 1883:SAS Address of IO Failure Drive:500605ba00b9cca2 [ 708.424241] sd 1:0:8:0: Power-on or device reset occurred [ 708.712116] pm80xx 0000:04:00.0: dev 500605ba00b9cca2 sent sense data, but stat(28) is not CHECK CONDITION [ 709.072865] sas: Enter sas_scsi_recover_host busy: 1 failed: 1 [ 709.073843] sd 1:0:8:0: Power-on or device reset occurred [ 709.074231] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1 [ 709.082924] pm80xx 0000:04:00.0: dev 500605ba00b9cca2 sent sense data, but stat(28) is not CHECK CONDITION [ 709.083028] pm80xx 0000:04:00.0: dev 500605ba00b9cca2 sent sense data, but stat(28) is not CHECK CONDITION [ 709.083030] pm80xx 0000:04:00.0: dev 500605ba00b9cca2 sent sense data, but stat(28) is not CHECK CONDITION [ 709.083032] pm80xx 0000:04:00.0: dev 500605ba00b9cca2 sent sense data, but stat(28) is not CHECK CONDITION [ 709.083148] pm80xx 0000:04:00.0: dev 500605ba00b9cca2 sent sense data, but stat(28) is not CHECK CONDITION [ 709.083151] pm80xx 0000:04:00.0: dev 500605ba00b9cca2 sent sense data, but stat(28) is not CHECK CONDITION [ 709.083259] pm80xx 0000:04:00.0: dev 500605ba00b9cca2 sent sense data, but stat(28) is not CHECK CONDITION [ 709.473399] sas: Enter sas_scsi_recover_host busy: 7 failed: 7 [ 709.478894] sd 1:0:8:0: Power-on or device reset occurred [ 709.479203] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 7 tries: 1 [ 709.482057] pm80xx 0000:04:00.0: dev 500605ba00b9cca2 sent sense data, but stat(28) is not CHECK CONDITION
it might be also related to the SEAGATE ST33000650NS as my other drives are HITACHI HUA723030ALA64SA and I used a spare and it formatted with fewer kernel errors, but still did have some of the same issues displaying tried 4 spare PM8003 controllers rev3 and rev5 cards and kernels 4.17 - 4.18 different Shelves and IOM and cables only other thing might be the PCI-e slot or bios setting but old backup OS of Ubuntu 14.04.5 LTS works unsure how to add another attachment but kernel log of other trials here https://drive.google.com/file/d/1evzTbENvVRfgU_p5I125ADkbbC11dx77/view?usp=sharing
it might actually be part of this bug https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=740701
however it might also have something to do with serial number being read wrong? the drive is an SEAGATE ST33000650NS SM (NA01) Serial # 500605ba00b9be18 but the logs show [11125.292669] pm80xx 0000:08:00.0: dev 500605ba00b9be19 sent sense data, but stat(28) is not CHECK CONDITION [11125.645491] sas: Enter sas_scsi_recover_host busy: 54 failed: 54
I have the same problem, using Archlinux and kernel 5.0.2. Currently testing with 4.14. For me, one (sometimes two) drives randomly drop every few hours to days.
I will try to reproduce the issue on my setup to debug it further.
I am using a LSI / Avago 9302-16e 12Gb/s PCIe 3.0 SAS HBA Raid Controller 03-25688-00 with out any issues if you want anything tested I can put the PM8003 PMC-Sierra back in