Bug 217920

Summary: Marvell RAID Controller issues since 6.5.x
Product: Linux Reporter: Timo Gurr (timo.gurr)
Component: KernelAssignee: Virtual assignee for kernel bugs (linux-kernel)
Status: NEW ---    
Severity: normal CC: bagasdotme, michael.melchert
Priority: P3    
Hardware: AMD   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: kernelconfig

Description Timo Gurr 2023-09-17 17:22:22 UTC
Created attachment 305122 [details]
kernelconfig

Hardware is a HPE ProLiant Microserver Gen10 X3216 with

# lspci | grep SATA
00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 49)
01:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9230 PCIe 2.0 x2 4-port SATA 6 Gb/s RAID Controller (rev 11)

# dmesg | grep ATA
[    0.015106] NODE_DATA(0) allocated [mem 0x1feffc000-0x1feffffff]
[    0.569868] ahci 0000:00:11.0: AHCI 0001.0300 32 slots 1 ports 6 Gbps 0x1 impl SATA mode
[    0.570560] ata1: SATA max UDMA/133 abar m1024@0xfeb69000 port 0xfeb69100 irq 19
[    0.581964] ahci 0000:01:00.0: AHCI 0001.0200 32 slots 8 ports 6 Gbps 0xff impl SATA mode
[    0.586488] ata2: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40100 irq 28
[    0.586554] ata3: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40180 irq 28
[    0.586617] ata4: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40200 irq 28
[    0.586681] ata5: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40280 irq 28
[    0.586742] ata6: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40300 irq 28
[    0.586804] ata7: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40380 irq 28
[    0.586866] ata8: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40400 irq 28
[    0.586927] ata9: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40480 irq 28
[    0.882680] ata1: SATA link down (SStatus 0 SControl 300)
[    0.896665] ata8: SATA link down (SStatus 0 SControl 310)
[    0.896979] ata7: SATA link down (SStatus 0 SControl 310)
[    0.897660] ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[    0.897986] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    0.899615] ata6: SATA link down (SStatus 0 SControl 310)
[    1.052964] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    1.312890] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    1.477997] ata9.00: ATAPI: MARVELL VIRTUAL, 1.09, max UDMA/66
[    1.478613] ata3.00: ATA-10: WDC WD40EFZX-68AWUN0, 81.00B81, max UDMA/133
[    1.478720] ata4.00: ATA-10: WDC WD40EFZX-68AWUN0, 81.00A81, max UDMA/133
[    1.478912] ata2.00: ATA-9: Samsung SSD 840 EVO 120GB, EXT0DB6Q, max UDMA/133
[    1.482260] scsi 1:0:0:0: Direct-Access     ATA      Samsung SSD 840  DB6Q PQ: 0 ANSI: 5
[    1.483793] scsi 2:0:0:0: Direct-Access     ATA      WDC WD40EFZX-68A 0B81 PQ: 0 ANSI: 5
[    1.485746] scsi 3:0:0:0: Direct-Access     ATA      WDC WD40EFZX-68A 0A81 PQ: 0 ANSI: 5
[    1.520882] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    1.521779] ata5.00: ATA-9: WDC WD30EFRX-68EUZN0, 82.00A82, max UDMA/133
[    1.523463] scsi 4:0:0:0: Direct-Access     ATA      WDC WD30EFRX-68E 0A82 PQ: 0 ANSI: 

I don't use the RAID features but make use of software RAID instead, on the first port I have a SSD with the operating system and the three others have HDDs plugged in.

These days I noticed extensive load and when looking at dmesg I could see the following lines getting repeated constantly.

[396495.764520] ata9.00: configured for UDMA/66
[396496.092239] ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[396496.092584] ata9.00: configured for UDMA/66
[396496.420123] ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[396496.420464] ata9.00: configured for UDMA/66
[396496.748016] ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[396496.748320] ata9.00: configured for UDMA/66
[396497.076285] ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[396497.076609] ata9.00: configured for UDMA/66

First I thought it'a disk issue as I already had some of them dying and replaced, however after leaving only the SSD connected I still recieved the same dmesg spam immediatelly during boot. So my guess was that the SSD is faulty then, so I replaced my long running

[    1.036030] ata2.00: ATA-9: SanDisk SDSSDP064G, 2.0.0, max UDMA/133

with with an older spare one I had lying around (using Clonezilla to clone the drive)

[    1.478912] ata2.00: ATA-9: Samsung SSD 840 EVO 120GB, EXT0DB6Q, max UDMA/133

and still hit the same problem with that one. After thinking about what I changed lately besides distribution package updates it came to my mind that I upgraded from kernel 6.4.x to 6.5.x lately (kernels and their upgrades are manual on my distribution so no package was used). I used an arch linux iso to boot my system which also used a previous kernel and worked fine, compiled a 6.4.x kernel again on the system, specifically the latest 6.4.16 one. Rebootet and everything is up and running fine again so after half a day I'm pretty sure none of my hardware is faulty and it's indeed a kernel issue/regression.

I hope I chose the correct component as I wasn't sure if it should be either SCSI or IO/Storage instead. Please let me know if you need further details. I can't guarantee to be able to do any actual testing like bisecting as I use the system in production.
Comment 1 Bagas Sanjaya 2023-09-18 00:00:26 UTC
(In reply to Timo Gurr from comment #0)
> Created attachment 305122 [details]
> kernelconfig
> 
> Hardware is a HPE ProLiant Microserver Gen10 X3216 with
> 
> # lspci | grep SATA
> 00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA
> Controller [AHCI mode] (rev 49)
> 01:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9230 PCIe 2.0 x2
> 4-port SATA 6 Gb/s RAID Controller (rev 11)
> 
> # dmesg | grep ATA
> [    0.015106] NODE_DATA(0) allocated [mem 0x1feffc000-0x1feffffff]
> [    0.569868] ahci 0000:00:11.0: AHCI 0001.0300 32 slots 1 ports 6 Gbps 0x1
> impl SATA mode
> [    0.570560] ata1: SATA max UDMA/133 abar m1024@0xfeb69000 port 0xfeb69100
> irq 19
> [    0.581964] ahci 0000:01:00.0: AHCI 0001.0200 32 slots 8 ports 6 Gbps
> 0xff impl SATA mode
> [    0.586488] ata2: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40100
> irq 28
> [    0.586554] ata3: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40180
> irq 28
> [    0.586617] ata4: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40200
> irq 28
> [    0.586681] ata5: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40280
> irq 28
> [    0.586742] ata6: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40300
> irq 28
> [    0.586804] ata7: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40380
> irq 28
> [    0.586866] ata8: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40400
> irq 28
> [    0.586927] ata9: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40480
> irq 28
> [    0.882680] ata1: SATA link down (SStatus 0 SControl 300)
> [    0.896665] ata8: SATA link down (SStatus 0 SControl 310)
> [    0.896979] ata7: SATA link down (SStatus 0 SControl 310)
> [    0.897660] ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> [    0.897986] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [    0.899615] ata6: SATA link down (SStatus 0 SControl 310)
> [    1.052964] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [    1.312890] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [    1.477997] ata9.00: ATAPI: MARVELL VIRTUAL, 1.09, max UDMA/66
> [    1.478613] ata3.00: ATA-10: WDC WD40EFZX-68AWUN0, 81.00B81, max UDMA/133
> [    1.478720] ata4.00: ATA-10: WDC WD40EFZX-68AWUN0, 81.00A81, max UDMA/133
> [    1.478912] ata2.00: ATA-9: Samsung SSD 840 EVO 120GB, EXT0DB6Q, max
> UDMA/133
> [    1.482260] scsi 1:0:0:0: Direct-Access     ATA      Samsung SSD 840 
> DB6Q PQ: 0 ANSI: 5
> [    1.483793] scsi 2:0:0:0: Direct-Access     ATA      WDC WD40EFZX-68A
> 0B81 PQ: 0 ANSI: 5
> [    1.485746] scsi 3:0:0:0: Direct-Access     ATA      WDC WD40EFZX-68A
> 0A81 PQ: 0 ANSI: 5
> [    1.520882] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [    1.521779] ata5.00: ATA-9: WDC WD30EFRX-68EUZN0, 82.00A82, max UDMA/133
> [    1.523463] scsi 4:0:0:0: Direct-Access     ATA      WDC WD30EFRX-68E
> 0A82 PQ: 0 ANSI: 
> 
> I don't use the RAID features but make use of software RAID instead, on the
> first port I have a SSD with the operating system and the three others have
> HDDs plugged in.
> 
> These days I noticed extensive load and when looking at dmesg I could see
> the following lines getting repeated constantly.
> 
> [396495.764520] ata9.00: configured for UDMA/66
> [396496.092239] ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> [396496.092584] ata9.00: configured for UDMA/66
> [396496.420123] ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> [396496.420464] ata9.00: configured for UDMA/66
> [396496.748016] ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> [396496.748320] ata9.00: configured for UDMA/66
> [396497.076285] ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> [396497.076609] ata9.00: configured for UDMA/66
> 
> First I thought it'a disk issue as I already had some of them dying and
> replaced, however after leaving only the SSD connected I still recieved the
> same dmesg spam immediatelly during boot. So my guess was that the SSD is
> faulty then, so I replaced my long running
> 
> [    1.036030] ata2.00: ATA-9: SanDisk SDSSDP064G, 2.0.0, max UDMA/133
> 
> with with an older spare one I had lying around (using Clonezilla to clone
> the drive)
> 
> [    1.478912] ata2.00: ATA-9: Samsung SSD 840 EVO 120GB, EXT0DB6Q, max
> UDMA/133
> 
> and still hit the same problem with that one. After thinking about what I
> changed lately besides distribution package updates it came to my mind that
> I upgraded from kernel 6.4.x to 6.5.x lately (kernels and their upgrades are
> manual on my distribution so no package was used). I used an arch linux iso
> to boot my system which also used a previous kernel and worked fine,
> compiled a 6.4.x kernel again on the system, specifically the latest 6.4.16
> one. Rebootet and everything is up and running fine again so after half a
> day I'm pretty sure none of my hardware is faulty and it's indeed a kernel
> issue/regression.
> 
> I hope I chose the correct component as I wasn't sure if it should be either
> SCSI or IO/Storage instead. Please let me know if you need further details.
> I can't guarantee to be able to do any actual testing like bisecting as I
> use the system in production.

Then please do bisection on your testing systems, replicating your production
setup. If you need to learn how to bisect kernel, see
Documentation/admin-guide/bug-bisect.rst.
Comment 2 Bagas Sanjaya 2023-09-18 07:52:54 UTC
Can you also test proposed fix at [1]?

[1]: https://lore.kernel.org/linux-scsi/20230915022034.678121-1-dlemoal@kernel.org/
Comment 3 mmelchert 2023-09-18 15:51:39 UTC
(In reply to Bagas Sanjaya from comment #2)
> Can you also test proposed fix at [1]?
> 
> [1]:
> https://lore.kernel.org/linux-scsi/20230915022034.678121-1-dlemoal@kernel.
> org/

Also had problems booting kernel 6.5. Only thing I could pinpoint
was that something went pearshaped during scsi (or ata) device scanning.
Applied the proposed patches against kernel 6.5.3 and everything works
nicely!
Thanks for posting.
Comment 4 Timo Gurr 2023-11-06 17:05:12 UTC
(In reply to Bagas Sanjaya from comment #2)
> Can you also test proposed fix at [1]?
> 
> [1]:
> https://lore.kernel.org/linux-scsi/20230915022034.678121-1-dlemoal@kernel.org/

I've been running kernel 6.6.0 which appears to carry the patch(es) you've mentioned on the system in question for the past few days now and didn't experience any issue so far, load looks normal and no dmesg entries appearing either. Sorry it took me while to get back to this, upgrade my machine and report back here.