Bug 110681 - System hangs while trying to access vpd data on LSI Logic / Symbios Logic MegaRAID SAS 2108 controllers
Summary: System hangs while trying to access vpd data on LSI Logic / Symbios Logic Meg...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-01-11 19:09 UTC by Babu Moger
Modified: 2016-05-15 23:59 UTC (History)
4 users (show)

See Also:
Kernel Version: 4.4
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg file taken from crash (42.22 KB, text/plain)
2016-01-11 19:32 UTC, Babu Moger
Details
RFC patch disable vpd access to few buggy devices (3.63 KB, application/octet-stream)
2016-01-11 21:22 UTC, Babu Moger
Details

Description Babu Moger 2016-01-11 19:09:08 UTC
We have observed system panics with lspci -vvv command when there is this following card is on the system.

LSI Logic / Symbios Logic MegaRAID SAS 2108.

We can also recreate the panic by accessing vpd data. Like below.

cat /sys/bus/pci/drivers/megaraid_sas/0000:10:00.0/vpd


The panic stack looks like below.

CapabiliNON-RESUMABLE ERROR: Reporting on cpu 48 
NON-RESUMABLE ERROR: TPC [0x0000000000837624] <ehci_irq+0x24/0x4e0> 
NON-RESUMABLE ERROR: RAW 
[0030100000000001:00001291bc82a5e0:0000000202000080:ffffffffffffffff 
NON-RESUMABLE ERROR:       
0000000800300000:0000000100000000:0000000000000000:0000000000000000] 
NON-RESUMABLE ERROR: handle [0x0030100000000001] stick [0x00001291bc82a5e0] 
NON-RESUMABLE ERROR: type [precise nonresumable] 
NON-RESUMABLE ERROR: attrs [0x02000080] < ASI sp-faulted priv > 
NON-RESUMABLE ERROR: raddr [0xffffffffffffffff] 
NON-RESUMABLE ERROR: insn effective address [0x0000084001704024] 
NON-RESUMABLE ERROR: size [0x8] 
NON-RESUMABLE ERROR: asi [0x00] 
CPU: 48 PID: 0 Comm: swapper/48 Not tainted 4.1.9-16.el6uek.sparc64 #1 
task: fff8001faa7703a0 ti: fff8001faa778000 task.ti: fff8001faa778000 
TSTATE: 0000000080e01605 TPC: 0000000000837624 TNPC: 0000000000837628 Y: 
00000000    Not tainted 
TPC: <ehci_irq+0x24/0x4e0> 
g0: 00000000004208b0 g1: 0000000000000000 g2: 0000000000000000 g3: 
0000000000001104 
g4: fff8001faa7703a0 g5: fff8001ffeb56000 g6: fff8001faa778000 g7: 
fff8001fa8fcec00 
o0: 000000000000000e o1: 0000000000000000 o2: 0000000000000030 o3: 
0000000000000001 
o4: fff8000064b18ae0 o5: 0000000000000012 sp: fff8001fffd032a1 ret_pc: 
0000000000837610 
RPC: <ehci_irq+0x10/0x4e0> 
l0: fff8001fffa01260 l1: fff8001fa2f97240 l2: 0000084001704024 l3: 
fff8001faa77bc7c 
l4: 0000000000c5c000 l5: 0000000000000000 l6: fff8001fa2f9732c l7: 
0000000000000008 
i0: fff8001fa2f97000 i1: fff8001faa778008 i2: 0000000000004000 i3: 
0000000000000001 
i4: 0000000000000000 i5: 000000000000000e i6: fff8001fffd033d1 i7: 
000000000081496c 
I7: <usb_hcd_irq+0x2c/0x60> 
Call Trace: 
 [000000000081496c] usb_hcd_irq+0x2c/0x60 
 [00000000004b8790] handle_irq_event_percpu+0x70/0x220 
 [00000000004b8974] handle_irq_event+0x34/0x60 
 [00000000004bbf38] handle_fasteoi_irq+0x98/0x160 
 [00000000004b8684] generic_handle_irq+0x24/0x40 
 [00000000009b3910] handler_irq+0xb0/0x100 
 [00000000004208b4] tl0_irq5+0x14/0x20 
 [000000000042d51c] arch_cpu_idle+0x7c/0xa0 
 [00000000004ac740] cpuidle_idle_call+0x40/0xc0 
 [00000000004ac938] cpu_idle_loop+0x178/0x260 
 [00000000004aca34] cpu_startup_entry+0x14/0x40 
 [0000000000440900] smp_callin+0x100/0x140 
 [0000000000d6efc4] after_lock_tlb+0x1a8/0x1bc 
 [0000000000000000]           (null) 
Kernel panic - not syncing: Non-resumable error. 
CPU: 48 PID: 0 Comm: swapper/48 Not tainted 4.1.9-16.el6uek.sparc64 #1 
Call Trace: 
 [00000000009ad6d4] panic+0xb4/0x248 
 [0000000000429638] sun4v_nonresum_error+0xb8/0xe0 
 [000000000040742c] sun4v_nonres_mondo+0xc8/0xd8 
 [0000000000837624] ehci_irq+0x24/0x4e0 
 [000000000081496c] usb_hcd_irq+0x2c/0x60 
 [00000000004b8790] handle_irq_event_percpu+0x70/0x220 
 [00000000004b8974] handle_irq_event+0x34/0x60 
 [00000000004bbf38] handle_fasteoi_irq+0x98/0x160 
 [00000000004b8684] generic_handle_irq+0x24/0x40 
 [00000000009b3910] handler_irq+0xb0/0x100 
 [00000000004208b4] tl0_irq5+0x14/0x20 
 [000000000042d51c] arch_cpu_idle+0x7c/0xa0 
 [00000000004ac740] cpuidle_idle_call+0x40/0xc0 
 [00000000004ac938] cpu_idle_loop+0x178/0x260 
 [00000000004aca34] cpu_startup_entry+0x14/0x40 
 [0000000000440900] smp_callin+0x100/0x140 


Looking at the crash dump I noticed the following stack. (Note that it did not panic at this location though)

 PID: 5274   TASK: ffff800fe1198680  CPU: 0   COMMAND: "cat"
#0 [ffff800fe25f6f81] switch_to_pc at 8d725c
#1 [ffff800fe25f70e1] pci_user_read_config_word at 6c4698
#2 [ffff800fe25f71a1] pci_vpd_pci22_wait at 6c4710
#3 [ffff800fe25f7261] pci_vpd_pci22_read at 6c4994
#4 [ffff800fe25f7321] pci_read_vpd at 6c3e90
#5 [ffff800fe25f73d1] read_vpd_attr at 6ccc78
#6 [ffff800fe25f7481] read at 5be478
#7 [ffff800fe25f7531] vfs_read at 54fdb0
#8 [ffff800fe25f75e1] sys_read at 54ff10
#9 [ffff800fe25f76a1] linux_sparc_syscall at 4060f4
TSTATE=0x8082000223 TT=0x16d TPC=0xfffffc0100295e28 TNPC=0xfffffc0100295e2c
 r0=0x0000000000000000  r1=0x0000000000000003  r2=0x000000000020aec0
 r3=0x000000000020aec4  r4=0x000000000b000000  r5=0x00000000033fffff
 r6=0x0000000000000001  r7=0xfffffc01000006f0 r24=0x0000000000000003
r25=0x000000000020e000 r26=0x0000000000008000 r27=0x0000000000000000
r28=0x0000000000000000 r29=0x0000000000000000 r30=0x000007feffb468d1
Comment 1 Bjorn Helgaas 2016-01-11 19:19:40 UTC
Please attach a complete dmesg log, including the panic.

Can you try accessing VPD on this device in a Windows system?  If Windows crashes too, it's a pretty good indication that the LSI device is defective and maybe we should just completely blacklist it.  If Windows shows some VPD data, that's a clue that we should be able to do at least that much.
Comment 2 Babu Moger 2016-01-11 19:32:23 UTC
Created attachment 199361 [details]
dmesg file taken from crash
Comment 3 Babu Moger 2016-01-11 19:35:49 UTC
(In reply to Bjorn Helgaas from comment #1)
> Please attach a complete dmesg log, including the panic.
 
 Done.
> 
> Can you try accessing VPD on this device in a Windows system?  If Windows
> crashes too, it's a pretty good indication that the LSI device is defective

 Sorry. We don't test Windows machine.

> and maybe we should just completely blacklist it.  If Windows shows some VPD
> data, that's a clue that we should be able to do at least that much.

Looks like multiple vendors have this problem. Blacklisting may not be a good option.
Comment 4 Bjorn Helgaas 2016-01-11 19:54:25 UTC
Thanks for the log.  Do you test this card with any OS other than Linux?  Solaris?

By "blacklist", I meant a patch like what you proposed (http://lkml.kernel.org/r/1449174319-52798-1-git-send-email-babu.moger@oracle.com) that disables or limits VPD on that particular device.

Blacklisting is a last resort because it's hard to populate the list in the first place (somebody has to trip over the broken device, report it, and we have to figure out what's wrong and how to handle it), and it's hard to maintain over time.
Comment 5 Babu Moger 2016-01-11 21:20:44 UTC
(In reply to Bjorn Helgaas from comment #4)
> Thanks for the log.  Do you test this card with any OS other than Linux? 
> Solaris?
  No.
> 
> By "blacklist", I meant a patch like what you proposed
 
  Ok. Got it.

> (http://lkml.kernel.org/r/1449174319-52798-1-git-send-email-babu.
> moger@oracle.com) that disables or limits VPD on that particular device.
> 
> Blacklisting is a last resort because it's hard to populate the list in the
> first place (somebody has to trip over the broken device, report it, and we
> have to figure out what's wrong and how to handle it), and it's hard to
> maintain over time.

  I have posted an RFC patch. Please take a look. I will attach here as well.
Comment 6 Babu Moger 2016-01-11 21:22:55 UTC
Created attachment 199371 [details]
RFC  patch disable vpd access to few buggy devices
Comment 7 Martin Mansfield 2016-05-15 16:13:11 UTC
I have encountered this problem with the 3.10 on Centos 7 with a LSI 9240-4i where "lspci -vv" would hang the PC if the megasas driver failed to init correctly but if it did then "lspci -vv" would read a single zero byte of VPD and output an error. The link to the Centos 7 bug report is

https://bugs.centos.org/view.php?id=10818

This patch to blacklist the device may be the best way to go but there is another patch which has been recently committed which limits the size of VPD read. Has anyone tested that patch with a megasas card without the blacklisting patch to be certain that the card really is broken in that area ?

If there is valid VPD in there then the next question is, how should access to VPD in a device be prevented if the driver failing to init causes VPD access to hang the PC ?
Comment 8 Babu Moger 2016-05-15 23:59:52 UTC
(In reply to Martin Mansfield from comment #7)
> I have encountered this problem with the 3.10 on Centos 7 with a LSI 9240-4i
> where "lspci -vv" would hang the PC if the megasas driver failed to init
> correctly but if it did then "lspci -vv" would read a single zero byte of
> VPD and output an error. The link to the Centos 7 bug report is
> 
> https://bugs.centos.org/view.php?id=10818
> 
> This patch to blacklist the device may be the best way to go but there is
> another patch which has been recently committed which limits the size of VPD

Yes. I have tested these patches. It reads the vpd data and tries to figure out the actual length.

> read. Has anyone tested that patch with a megasas card without the
> blacklisting patch to be certain that the card really is broken in that area
> ?

What i have seen is, this device causes system to hang as soon we attempt to read the vpd. Only way is to blacklist the vpd access. I did not see any other alternative.

> 
> If there is valid VPD in there then the next question is, how should access
> to VPD in a device be prevented if the driver failing to init causes VPD
> access to hang the PC ?

There is no other way as far as I can tell.

Note You need to log in before you can comment on or make changes to this bug.