Created attachment 289227 [details] last dmesg captured When performing a arcconf operation (assign hot-spare) on a adaptec 72405 SAS controller, the program crashes with the error "segmentation fault", but apparently, the driver is not too happy with it either: it becomes unresponsive, and makes it impossible to access scsi devices on the SAS controller. Additional tricks to perform a PCI level reset ultimately lead to a kernel panic: linuxserver# echo 1 > /sys/bus/pci/devices/0000\:04\:00.0/reset (wait a minute) linuxserver# echo 1 > /sys/bus/pci/rescan (wait a minute) linuxserver# umount /data/* (where all SAS devices are mounted) (hangs indefinitely) linuxserver# echo auto > /sys/bus/pci/devices/0000\:04\:00.0/power/control linuxserver# echo "0000:04:00.0" > /sys/bus/pci/drivers/aacraid/unbind --PANIC-- I haven't been able to C/P the panic output yet, working on a kexec kernel or crash dump. The root directory is NOT one of the SAS devices, it is on a generic SATA controller
UPDATE: the host does not panic, but the whole IO system does not work any longer: - network IO fails - logon fails (hangs indefinitely) - dmesg fails (hangs indefinitely) - keyboard still works I'd say a general IO error occurs (but why is there still USB keyboard input?), making the system unresponsive. Next time, I'll see whether I can still try a cat /dev/kmsg, but any use of kexec is off the table, I guess
Created attachment 289229 [details] kernel .config file
Is this perhaps a recently introduced bug? If so, would it be possible to bisect this? See also https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html.
Good idea ... however, currently, I cleared + rebuilt the storage array, and everything is working again. Any idea what this segfault means so I can reproduce the state (host adapter reset) and cause the same error condition?
Is it possible to reproduce the kernel warning by running sg_reset -h /dev/sd... where /dev/sd... is a SCSI device controlled by an aacraid adapter? sg_reset is available in the sg3_utils package.
Sorry, I tried that, but IOP reset succeeded. I even tried it while the array was doing an expansion operation, but no luck. It came back with no issues
I figured out it was due to an insufficient +5V line which made devices function the wrong way,I added some extra +5V juice and it worked without any problem. Neverthless, is it an option to "isolate" the storage driver somewhat so the other PCI devices are kept up-and-running? There are still some points of investigation: -If it's PCI related, why does the dedicated VGA + onboard USB still work? -If it's storage subsystem related, why does network IO fail? -If it's driver related, why is AHCI going down as well? I guess this is not supposed to happen, so I'll see whether I can make it crash again, and eventually try to reset the whole PCI bus (and see whether that would help)
to update a bit: I had the problem reoccured this morning. I can't access the PC right now, but I tried the remote syslog, and it displays something like: ... I know, the aacraid adapter panics, but why does it not reset the adapter and moves on? Why does the telnet daemon segfault in libc?
Created attachment 290249 [details] attachment of remote syslog
I just verified: the device was mostly dead:no F12 to enter kernel log, no num lock answer by kb led, but still replying to ping... I currently locked the screen on tty12, so next time I *should* be able to see something
the issue seems to be related to: > [59502.794967] Call Trace: > [59502.794967] _raw_spin_lock_irqsave+0x20/0x30 > [59502.794968] __scsi_iterate_devices+0x22/0x80 > [59502.794968] scsi_eh_ready_devs+0x129/0x7c0 > [59502.794968] ? __pm_runtime_resume+0x54/0x70 > [59502.794968] scsi_error_handler+0x394/0x3a0 > [59502.794969] kthread+0xf3/0x130 > [59502.794969] ? scsi_eh_get_sense+0x120/0x120 > [59502.794969] ? kthread_park+0x80/0x80 > [59502.794970] ret_from_fork+0x1f/0x30 As far as I see, this stack blocks the entire scsi subsystem. I do not see why: the scsi_error_handler runs in a separate kthread, so it *should* not block the IO subsystem ... but it definitely does: all storage devices on all SAS/SATA controllers (even USB) become inaccessible. I managed to get a dmesg out of it, but "echo 1 > /sys/class/pci_bus/0000\:04/device/reset" never completed. this command was issued over a running SSH session. A new session could not be established any longer. But it proves the PCI subsystem is partially intact. is it possible the raw_spin_lock_irqsave hurts when the adapter is not ready yet? and as such locks a device but never completes?
SAS adapter malfunction was due to a bad power supply - this was not a linux issue