Bug 92351

Summary: mpt2sas fails to load for a LSISAS2008 card
Product: Drivers Reporter: pjay
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: RESOLVED WILL_NOT_FIX    
Severity: normal CC: alan, bjorn
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.4.1 Tree: Mainline
Regression: Yes
Attachments: dmesg for 3.4.0 3.4.1 and 3.4.1 with pci=realloc=off
dmesg with 3.19 and LSI FW 18
lsipcivv
lspci from machine with LSI SAS2008 card running P17 FW
test patch #1

Description pjay 2015-01-30 16:16:54 UTC
Created attachment 165301 [details]
dmesg for 3.4.0 3.4.1 and 3.4.1 with pci=realloc=off

Up through 3.4.0 mpt2sas successfully loads. Starting in 3.4.1, the card is found but mpt3sas fails during the load; it will load if pci=realloc=off is in the kernel command line. The same problem exists in 3.19 rc6
Comment 1 pjay 2015-02-19 23:34:24 UTC
Created attachment 167611 [details]
dmesg with 3.19 and LSI FW 18
Comment 2 pjay 2015-02-27 00:35:08 UTC
Created attachment 168351 [details]
lsipcivv
Comment 3 Bjorn Helgaas 2015-03-24 02:46:03 UTC
Hi Paul, can you please run "sudo lspci -vvv" and attach the output?  The output attached to comment #2 was not collected as root, so it doesn't contain the SR-IOV information I was looking for.

I think your 3.4.0 and 3.4.1 kernels are from here:

  http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.4-precise/linux-image-3.4.0-030400-generic_3.4.0-030400.201205210521_amd64.deb
  http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.4.1-quantal/linux-image-3.4.1-030401-generic_3.4.1-030401.201206041411_amd64.deb

Inside those packages are /boot/config-3.4.0-030400-generic and /boot/config-3.4.1-030401-generic.  The 3.4.0 config does not have CONFIG_PCI_REALLOC_ENABLE_AUTO set, while the 3.4.1 config has CONFIG_PCI_REALLOC_ENABLE_AUTO=y.
Comment 4 pjay 2015-03-25 18:36:34 UTC
Created attachment 172371 [details]
lspci from machine with LSI SAS2008 card running P17 FW
Comment 5 Bjorn Helgaas 2016-10-27 22:46:54 UTC
See also downstream bug report https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1363313 and email thread at http://lkml.kernel.org/r/54C81B4E.7060900@nwtrail.com
Comment 6 Bjorn Helgaas 2016-10-27 23:15:56 UTC
The LSI SAS2008 has an SR-IOV capability.  The BIOS on your platform assigns space for the regular BARs, but not for the SR-IOV BARs:

  pci 0000:01:00.0: reg 10: [io  0xce00-0xceff]
  pci 0000:01:00.0: reg 14: [mem 0xfbdfc000-0xfbdfffff 64bit]   # BAR 1
  pci 0000:01:00.0: reg 1c: [mem 0xfbd80000-0xfbdbffff 64bit]   # BAR 3
  pci 0000:01:00.0: reg 30: [mem 0x00000000-0x0007ffff pref]
  pci 0000:01:00.0: reg 174: [mem 0x00000000-0x00003fff 64bit]
  pci 0000:01:00.0: reg 17c: [mem 0x00000000-0x0003ffff 64bit]

Starting with your 3.4.1 kernel, Linux tries to assign space for the SR-IOV BARs, and in the process, it moves the regular BARs as well:

  pci 0000:01:00.0: BAR 1: assigned [mem 0xeff40000-0xeff43fff 64bit]
  pci 0000:01:00.0: BAR 3: assigned [mem 0xefb00000-0xefb3ffff 64bit]

Yinghai's theory is that the LSI device has firmware that caches the BAR values, and it doesn't notice when Linux changes the BARs.  That seems like a plausible theory to me.  We tried to test that theory with [1], a patch intended to reset the LSI device after changing the BARs.  That failed, but I don't think we really know why.

I'd like to try that again, like this:

  # echo 0000:01:00.0 > /sys/bus/pci/drivers/mpt2sas/unbind
  # echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset
  # echo 0000:01:00.0 > /sys/bus/pci/drivers/mpt2sas/bind

This would be on a kernel that exhibits the problem.  If this doesn't help, it might also be interesting to try it on a kernel that *doesn't* exhibit the problem -- the driver should release the device, then claim it again after the reset without incident.

It looks like the mpt2sas driver has been replaced by mpt3sas, so if you test on a newer kernel than 3.4.x, you might have to replace "mpt2sas" above by "mpt3sas".

[1] http://lkml.kernel.org/r/CAE9FiQUxD8g4KG9CYe1znb5Lw5skkq=9QSOgkmGUDbhsaDq2Qw@mail.gmail.com
Comment 7 Bjorn Helgaas 2016-10-27 23:21:15 UTC
Created attachment 242991 [details]
test patch #1

This is the patch Yinghai proposed [1] (I edited it slightly and rebased them to v4.9-rc1).  This is basically what you (Paul) tested at [2], so I expect this should probably work.  If it does, I'd like to get this merged and resolve this.

[1] http://lkml.kernel.org/r/CAE9FiQW=uDAdtwZt7j4uWgsCYoH3Oti76dqDCht4_neeWc3mVQ@mail.gmail.com
[2] http://lkml.kernel.org/r/54D7AE67.1090007@nwtrail.com
Comment 8 pjay 2016-10-31 02:53:31 UTC
I have a machine here still running this LSI card. It is on Ubuntu 
kernel 3.13.0-100. There is no pci-realloc=off entry in the boot. LSI 
firmware P19. It works correctly. I've archived my emails from this, but 
I remember the problem was on a single earlier firmware version, just 
not which one.

On 10/27/2016 04:21 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=92351
>
> --- Comment #7 from Bjorn Helgaas <bhelgaas@google.com> ---
> Created attachment 242991 [details]
>    --> https://bugzilla.kernel.org/attachment.cgi?id=242991&action=edit
> test patch #1
>
> This is the patch Yinghai proposed [1] (I edited it slightly and rebased them
> to v4.9-rc1).  This is basically what you (Paul) tested at [2], so I expect
> this should probably work.  If it does, I'd like to get this merged and
> resolve
> this.
>
> [1]
>
> http://lkml.kernel.org/r/CAE9FiQW=uDAdtwZt7j4uWgsCYoH3Oti76dqDCht4_neeWc3mVQ@mail.gmail.com
> [2] http://lkml.kernel.org/r/54D7AE67.1090007@nwtrail.com
>
Comment 9 Bjorn Helgaas 2016-12-28 21:43:58 UTC
If I understand correctly, the problem occurs with LSI FW 18, but not with P19.

The workaround proposed in comment #7 would be required for FW 18.  It hasn't been tested, and it's a little messy, so I'm not going to apply it untested.  If upgrading to LSI FW P19 is sufficient, maybe we don't need the workaround anyway.

Please reopen if you disagree.