We have observed system panics with lspci -vvv command when there is this following card is on the system. LSI Logic / Symbios Logic MegaRAID SAS 2108. We can also recreate the panic by accessing vpd data. Like below. cat /sys/bus/pci/drivers/megaraid_sas/0000:10:00.0/vpd The panic stack looks like below. CapabiliNON-RESUMABLE ERROR: Reporting on cpu 48 NON-RESUMABLE ERROR: TPC [0x0000000000837624] <ehci_irq+0x24/0x4e0> NON-RESUMABLE ERROR: RAW [0030100000000001:00001291bc82a5e0:0000000202000080:ffffffffffffffff NON-RESUMABLE ERROR: 0000000800300000:0000000100000000:0000000000000000:0000000000000000] NON-RESUMABLE ERROR: handle [0x0030100000000001] stick [0x00001291bc82a5e0] NON-RESUMABLE ERROR: type [precise nonresumable] NON-RESUMABLE ERROR: attrs [0x02000080] < ASI sp-faulted priv > NON-RESUMABLE ERROR: raddr [0xffffffffffffffff] NON-RESUMABLE ERROR: insn effective address [0x0000084001704024] NON-RESUMABLE ERROR: size [0x8] NON-RESUMABLE ERROR: asi [0x00] CPU: 48 PID: 0 Comm: swapper/48 Not tainted 4.1.9-16.el6uek.sparc64 #1 task: fff8001faa7703a0 ti: fff8001faa778000 task.ti: fff8001faa778000 TSTATE: 0000000080e01605 TPC: 0000000000837624 TNPC: 0000000000837628 Y: 00000000 Not tainted TPC: <ehci_irq+0x24/0x4e0> g0: 00000000004208b0 g1: 0000000000000000 g2: 0000000000000000 g3: 0000000000001104 g4: fff8001faa7703a0 g5: fff8001ffeb56000 g6: fff8001faa778000 g7: fff8001fa8fcec00 o0: 000000000000000e o1: 0000000000000000 o2: 0000000000000030 o3: 0000000000000001 o4: fff8000064b18ae0 o5: 0000000000000012 sp: fff8001fffd032a1 ret_pc: 0000000000837610 RPC: <ehci_irq+0x10/0x4e0> l0: fff8001fffa01260 l1: fff8001fa2f97240 l2: 0000084001704024 l3: fff8001faa77bc7c l4: 0000000000c5c000 l5: 0000000000000000 l6: fff8001fa2f9732c l7: 0000000000000008 i0: fff8001fa2f97000 i1: fff8001faa778008 i2: 0000000000004000 i3: 0000000000000001 i4: 0000000000000000 i5: 000000000000000e i6: fff8001fffd033d1 i7: 000000000081496c I7: <usb_hcd_irq+0x2c/0x60> Call Trace: [000000000081496c] usb_hcd_irq+0x2c/0x60 [00000000004b8790] handle_irq_event_percpu+0x70/0x220 [00000000004b8974] handle_irq_event+0x34/0x60 [00000000004bbf38] handle_fasteoi_irq+0x98/0x160 [00000000004b8684] generic_handle_irq+0x24/0x40 [00000000009b3910] handler_irq+0xb0/0x100 [00000000004208b4] tl0_irq5+0x14/0x20 [000000000042d51c] arch_cpu_idle+0x7c/0xa0 [00000000004ac740] cpuidle_idle_call+0x40/0xc0 [00000000004ac938] cpu_idle_loop+0x178/0x260 [00000000004aca34] cpu_startup_entry+0x14/0x40 [0000000000440900] smp_callin+0x100/0x140 [0000000000d6efc4] after_lock_tlb+0x1a8/0x1bc [0000000000000000] (null) Kernel panic - not syncing: Non-resumable error. CPU: 48 PID: 0 Comm: swapper/48 Not tainted 4.1.9-16.el6uek.sparc64 #1 Call Trace: [00000000009ad6d4] panic+0xb4/0x248 [0000000000429638] sun4v_nonresum_error+0xb8/0xe0 [000000000040742c] sun4v_nonres_mondo+0xc8/0xd8 [0000000000837624] ehci_irq+0x24/0x4e0 [000000000081496c] usb_hcd_irq+0x2c/0x60 [00000000004b8790] handle_irq_event_percpu+0x70/0x220 [00000000004b8974] handle_irq_event+0x34/0x60 [00000000004bbf38] handle_fasteoi_irq+0x98/0x160 [00000000004b8684] generic_handle_irq+0x24/0x40 [00000000009b3910] handler_irq+0xb0/0x100 [00000000004208b4] tl0_irq5+0x14/0x20 [000000000042d51c] arch_cpu_idle+0x7c/0xa0 [00000000004ac740] cpuidle_idle_call+0x40/0xc0 [00000000004ac938] cpu_idle_loop+0x178/0x260 [00000000004aca34] cpu_startup_entry+0x14/0x40 [0000000000440900] smp_callin+0x100/0x140 Looking at the crash dump I noticed the following stack. (Note that it did not panic at this location though) PID: 5274 TASK: ffff800fe1198680 CPU: 0 COMMAND: "cat" #0 [ffff800fe25f6f81] switch_to_pc at 8d725c #1 [ffff800fe25f70e1] pci_user_read_config_word at 6c4698 #2 [ffff800fe25f71a1] pci_vpd_pci22_wait at 6c4710 #3 [ffff800fe25f7261] pci_vpd_pci22_read at 6c4994 #4 [ffff800fe25f7321] pci_read_vpd at 6c3e90 #5 [ffff800fe25f73d1] read_vpd_attr at 6ccc78 #6 [ffff800fe25f7481] read at 5be478 #7 [ffff800fe25f7531] vfs_read at 54fdb0 #8 [ffff800fe25f75e1] sys_read at 54ff10 #9 [ffff800fe25f76a1] linux_sparc_syscall at 4060f4 TSTATE=0x8082000223 TT=0x16d TPC=0xfffffc0100295e28 TNPC=0xfffffc0100295e2c r0=0x0000000000000000 r1=0x0000000000000003 r2=0x000000000020aec0 r3=0x000000000020aec4 r4=0x000000000b000000 r5=0x00000000033fffff r6=0x0000000000000001 r7=0xfffffc01000006f0 r24=0x0000000000000003 r25=0x000000000020e000 r26=0x0000000000008000 r27=0x0000000000000000 r28=0x0000000000000000 r29=0x0000000000000000 r30=0x000007feffb468d1
Please attach a complete dmesg log, including the panic. Can you try accessing VPD on this device in a Windows system? If Windows crashes too, it's a pretty good indication that the LSI device is defective and maybe we should just completely blacklist it. If Windows shows some VPD data, that's a clue that we should be able to do at least that much.
Created attachment 199361 [details] dmesg file taken from crash
(In reply to Bjorn Helgaas from comment #1) > Please attach a complete dmesg log, including the panic. Done. > > Can you try accessing VPD on this device in a Windows system? If Windows > crashes too, it's a pretty good indication that the LSI device is defective Sorry. We don't test Windows machine. > and maybe we should just completely blacklist it. If Windows shows some VPD > data, that's a clue that we should be able to do at least that much. Looks like multiple vendors have this problem. Blacklisting may not be a good option.
Thanks for the log. Do you test this card with any OS other than Linux? Solaris? By "blacklist", I meant a patch like what you proposed (http://lkml.kernel.org/r/1449174319-52798-1-git-send-email-babu.moger@oracle.com) that disables or limits VPD on that particular device. Blacklisting is a last resort because it's hard to populate the list in the first place (somebody has to trip over the broken device, report it, and we have to figure out what's wrong and how to handle it), and it's hard to maintain over time.
(In reply to Bjorn Helgaas from comment #4) > Thanks for the log. Do you test this card with any OS other than Linux? > Solaris? No. > > By "blacklist", I meant a patch like what you proposed Ok. Got it. > (http://lkml.kernel.org/r/1449174319-52798-1-git-send-email-babu. > moger@oracle.com) that disables or limits VPD on that particular device. > > Blacklisting is a last resort because it's hard to populate the list in the > first place (somebody has to trip over the broken device, report it, and we > have to figure out what's wrong and how to handle it), and it's hard to > maintain over time. I have posted an RFC patch. Please take a look. I will attach here as well.
Created attachment 199371 [details] RFC patch disable vpd access to few buggy devices
I have encountered this problem with the 3.10 on Centos 7 with a LSI 9240-4i where "lspci -vv" would hang the PC if the megasas driver failed to init correctly but if it did then "lspci -vv" would read a single zero byte of VPD and output an error. The link to the Centos 7 bug report is https://bugs.centos.org/view.php?id=10818 This patch to blacklist the device may be the best way to go but there is another patch which has been recently committed which limits the size of VPD read. Has anyone tested that patch with a megasas card without the blacklisting patch to be certain that the card really is broken in that area ? If there is valid VPD in there then the next question is, how should access to VPD in a device be prevented if the driver failing to init causes VPD access to hang the PC ?
(In reply to Martin Mansfield from comment #7) > I have encountered this problem with the 3.10 on Centos 7 with a LSI 9240-4i > where "lspci -vv" would hang the PC if the megasas driver failed to init > correctly but if it did then "lspci -vv" would read a single zero byte of > VPD and output an error. The link to the Centos 7 bug report is > > https://bugs.centos.org/view.php?id=10818 > > This patch to blacklist the device may be the best way to go but there is > another patch which has been recently committed which limits the size of VPD Yes. I have tested these patches. It reads the vpd data and tries to figure out the actual length. > read. Has anyone tested that patch with a megasas card without the > blacklisting patch to be certain that the card really is broken in that area > ? What i have seen is, this device causes system to hang as soon we attempt to read the vpd. Only way is to blacklist the vpd access. I did not see any other alternative. > > If there is valid VPD in there then the next question is, how should access > to VPD in a device be prevented if the driver failing to init causes VPD > access to hang the PC ? There is no other way as far as I can tell.