Bug 216059 - Scsi host number of Adaptec RAID controller changes upon a PCIe hotplug and re-insert
Summary: Scsi host number of Adaptec RAID controller changes upon a PCIe hotplug and r...
Status: RESOLVED INVALID
Alias: None
Product: SCSI Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: scsi_drivers-other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-06-02 06:53 UTC by Sagar
Modified: 2022-06-02 17:09 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.18.11
Subsystem:
Regression: No
Bisected commit-id:


Attachments
The attachments contain the log files which capture before and after cases for a hotplug host number change (33.69 KB, application/x-zip-compressed)
2022-06-02 06:53 UTC, Sagar
Details

Description Sagar 2022-06-02 06:53:26 UTC
Created attachment 301088 [details]
The attachments contain the log files which capture before and after cases for a hotplug host number change

Summary:
This issue talks of the smartpqi driver for Adaptec controller, PCIe hotplug and the corresponding SCSI host number 


The Linux message log shows the host number (e.g. [14:2:0:0] storage - /dev/sg27) unexpectedly changing when PCIe hot remove is rapidly followed by PCIe hot add. The problem appears when the two PCIe events occur in quick succession (i.e. less than 2 minutes). Because of the timing factor, the issue can appear to be intermittent. The problem has been root caused as a kernel issue.

 

Investigation:
Kernel (4.18.11-hotplug-patch) debug prints were added in  the “scsi_add_host( )” and “scsi_remove_host ( )” routines. Per the debug prints in the log, the scsi host number is released after the PCIe hot add event, which forces the kernel use a different host number.

(debug prints)
Line 48: [ 1811.461055] smartpqi 0000:b3:00.0: Debuggg . . . pqi_unregister_scsi function, before scsi_remove_host, shost->host_num=14   //smartpqi requests host num 14 to be removed
Line 83: [ 2012.125750]  (null): Debuggg . . shost->host_no before dev_set_name = host15
Line 84: [ 2012.126709] smartpqi 0000:b3:00.0: Debuggg . . . before scsi_add_host, shost->host_num=15 //upon hot add, kernel allocates host number 15, it should be 14
Line 132: [ 2014.181784] scsi host14: Debuggg . . in scsi_host_dev_release function shost_host_no to be removed = 14 //kernel finally frees host number 14, but it’s too late

 

Conclusion:
The kernel is not releasing the host number immediately when the smartpqi driver calls the scsi_remove_host() routine. If the PCIe cable is added back within 2 minutes, the kernel can unexpectedly return a different host number. This can lead to applications accessing the wrong device.
This is a Linux kernel issue and we will be raising a bugzilla on the linux kernel.

 

Questions:
Will this be a problem for Amazon? (Wouldn’t they take several minutes to do this, they have to be very careful when hot plugging?)
Do we need to consider other customers that might use PCIe hot plug in the future?
The problem is observed in kernel V4.18.11, but would V5.04/V5.10 make a difference (should we test it ourselves)?

 

Consequence:
Application accesses wrong device. Rebooting system may still result in wrong host number.
Comment 1 Sagar 2022-06-02 16:58:41 UTC
(In reply to Sagar from comment #0)
> Created attachment 301088 [details]
> The attachments contain the log files which capture before and after cases
> for a hotplug host number change
> 
> Summary:
> This issue talks of the smartpqi driver for Adaptec controller, PCIe hotplug
> and the corresponding SCSI host number 
> 
> 
> The Linux message log shows the host number (e.g. [14:2:0:0] storage -
> /dev/sg27) unexpectedly changing when PCIe hot remove is rapidly followed by
> PCIe hot add. The problem appears when the two PCIe events occur in quick
> succession (i.e. less than 2 minutes). Because of the timing factor, the
> issue can appear to be intermittent. The problem has been root caused as a
> kernel issue.
> 
>  
> 
> Investigation:
> Kernel (4.18.11-hotplug-patch) debug prints were added in  the
> “scsi_add_host( )” and “scsi_remove_host ( )” routines. Per the debug prints
> in the log, the scsi host number is released after the PCIe hot add event,
> which forces the kernel use a different host number.
> 
> (debug prints)
> Line 48: [ 1811.461055] smartpqi 0000:b3:00.0: Debuggg . . .
> pqi_unregister_scsi function, before scsi_remove_host, shost->host_num=14  
> //smartpqi requests host num 14 to be removed
> Line 83: [ 2012.125750]  (null): Debuggg . . shost->host_no before
> dev_set_name = host15
> Line 84: [ 2012.126709] smartpqi 0000:b3:00.0: Debuggg . . . before
> scsi_add_host, shost->host_num=15 //upon hot add, kernel allocates host
> number 15, it should be 14
> Line 132: [ 2014.181784] scsi host14: Debuggg . . in scsi_host_dev_release
> function shost_host_no to be removed = 14 //kernel finally frees host number
> 14, but it’s too late
> 
>  
> 
> Conclusion:
> The kernel is not releasing the host number immediately when the smartpqi
> driver calls the scsi_remove_host() routine. If the PCIe cable is added back
> within 2 minutes, the kernel can unexpectedly return a different host
> number. This can lead to applications accessing the wrong device.
> This is a Linux kernel issue and we will be raising a bugzilla on the linux
> kernel.
> 
>  
> 
> Consequence:
> Application accesses wrong device. Rebooting system may still result in
> wrong host number.
Comment 2 Sagar 2022-06-02 17:09:13 UTC
This issue can be ignored. I have filed another BZ instead. . .

Note You need to log in before you can comment on or make changes to this bug.