Bug 207855 - arcconf host reset causes kernel panic -> driver crash?
Summary: arcconf host reset causes kernel panic -> driver crash?
Status: RESOLVED INVALID
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: SCSI (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: linux-scsi@vger.kernel.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-05-22 08:49 UTC by Janpieter Sollie
Modified: 2020-11-07 17:11 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.6.13 - 5.7.8
Subsystem:
Regression: No
Bisected commit-id:


Attachments
last dmesg captured (27.64 KB, text/plain)
2020-05-22 08:49 UTC, Janpieter Sollie
Details
kernel .config file (24.94 KB, application/gzip)
2020-05-22 09:40 UTC, Janpieter Sollie
Details
attachment of remote syslog (2.91 KB, text/plain)
2020-07-13 11:47 UTC, Janpieter Sollie
Details

Description Janpieter Sollie 2020-05-22 08:49:36 UTC
Created attachment 289227 [details]
last dmesg captured

When performing a arcconf operation (assign hot-spare) on a adaptec 72405 SAS controller, the program crashes with the error "segmentation fault", but apparently, the driver is not too happy with it either: it becomes unresponsive, and makes it impossible to access scsi devices on the SAS controller.
Additional tricks to perform a PCI level reset ultimately lead to a kernel panic:
linuxserver# echo 1 > /sys/bus/pci/devices/0000\:04\:00.0/reset
(wait a minute)
linuxserver# echo 1 > /sys/bus/pci/rescan
(wait a minute)
linuxserver# umount /data/* (where all SAS devices are mounted)
(hangs indefinitely) 
linuxserver# echo auto > /sys/bus/pci/devices/0000\:04\:00.0/power/control
linuxserver# echo "0000:04:00.0" > /sys/bus/pci/drivers/aacraid/unbind
--PANIC--

I haven't been able to C/P the panic output yet, working on a kexec kernel or crash dump.
The root directory is NOT one of the SAS devices, it is on a generic SATA controller
Comment 1 Janpieter Sollie 2020-05-22 09:19:56 UTC
UPDATE: the host does not panic, but the whole IO system does not work any longer:
- network IO fails
- logon fails (hangs indefinitely)
- dmesg fails (hangs indefinitely)
- keyboard still works
I'd say a general IO error occurs (but why is there still USB keyboard input?), making the system unresponsive.  Next time, I'll see whether I can still try a cat /dev/kmsg, but any use of kexec is off the table, I guess
Comment 2 Janpieter Sollie 2020-05-22 09:40:09 UTC
Created attachment 289229 [details]
kernel .config file
Comment 3 Bart Van Assche 2020-05-23 16:29:09 UTC
Is this perhaps a recently introduced bug? If so, would it be possible to bisect this? See also https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html.
Comment 4 Janpieter Sollie 2020-05-23 17:53:47 UTC
Good idea ... however, currently, I cleared + rebuilt the storage array, and everything is working again.  Any idea what this segfault means so I can reproduce the state (host adapter reset) and cause the same error condition?
Comment 5 Bart Van Assche 2020-05-23 18:38:45 UTC
Is it possible to reproduce the kernel warning by running sg_reset -h /dev/sd... where /dev/sd... is a SCSI device controlled by an aacraid adapter? sg_reset is available in the sg3_utils package.
Comment 6 Janpieter Sollie 2020-05-23 19:09:32 UTC
Sorry, I tried that, but IOP reset succeeded. I even tried it while the array was doing an expansion operation, but no luck. It came back with no issues
Comment 7 Janpieter Sollie 2020-06-06 09:01:52 UTC
I figured out it was due to an insufficient +5V line which made devices function the wrong way,I added some extra +5V juice and it worked without any problem.
Neverthless, is it an option to "isolate" the storage driver somewhat so the other PCI devices are kept up-and-running?
There are still some points of investigation:
-If it's PCI related, why does the dedicated VGA + onboard USB still work?
-If it's storage subsystem related, why does network IO fail?
-If it's driver related, why is AHCI going down as well?
I guess this is not supposed to happen, so I'll see whether I can make it crash again, and eventually try to reset the whole PCI bus (and see whether that would help)
Comment 8 Janpieter Sollie 2020-07-13 11:46:11 UTC
to update a bit:
I had the problem reoccured this morning.  I can't access the PC right now, but I tried the remote syslog, and it displays something like:
... I know, the aacraid adapter panics, but why does it not reset the adapter and moves on? Why does the telnet daemon segfault in libc?
Comment 9 Janpieter Sollie 2020-07-13 11:47:00 UTC
Created attachment 290249 [details]
attachment of remote syslog
Comment 10 Janpieter Sollie 2020-07-13 19:10:53 UTC
I just verified: the device was mostly dead:no F12 to enter kernel log, no num lock answer by kb led, but still replying to ping...
I currently locked the screen on tty12, so next time I *should* be able to see something
Comment 11 Janpieter Sollie 2020-07-15 08:45:00 UTC
the issue seems to be related to:
  
> [59502.794967] Call Trace:
> [59502.794967]  _raw_spin_lock_irqsave+0x20/0x30
> [59502.794968]  __scsi_iterate_devices+0x22/0x80
> [59502.794968]  scsi_eh_ready_devs+0x129/0x7c0
> [59502.794968]  ? __pm_runtime_resume+0x54/0x70
> [59502.794968]  scsi_error_handler+0x394/0x3a0
> [59502.794969]  kthread+0xf3/0x130
> [59502.794969]  ? scsi_eh_get_sense+0x120/0x120
> [59502.794969]  ? kthread_park+0x80/0x80
> [59502.794970]  ret_from_fork+0x1f/0x30
    
As far as I see, this stack blocks the entire scsi subsystem.
I do not see why: the scsi_error_handler runs in a separate kthread, so it *should* not block the IO subsystem ... but it definitely does: all storage devices on all SAS/SATA controllers (even USB) become inaccessible.  I managed to get a dmesg out of it, but "echo 1 > /sys/class/pci_bus/0000\:04/device/reset"
never completed.  this command was issued over a running SSH session.  A new session could not be established any longer.  But it proves the PCI subsystem is partially intact.
  
is it possible the raw_spin_lock_irqsave hurts when the adapter is not ready yet? and as such locks a device but never completes?
Comment 12 Janpieter Sollie 2020-11-07 17:11:06 UTC
SAS adapter malfunction was due to a bad power supply - this was not a linux issue

Note You need to log in before you can comment on or make changes to this bug.