Bug 199779
Summary: | [Hardware Error] kernel panic upon reboot on HP DL360 Gen9 | ||
---|---|---|---|
Product: | Drivers | Reporter: | Ryan Finnie (ryan) |
Component: | PCI | Assignee: | drivers_pci (drivers_pci) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | bjorn, okaya |
Priority: | P1 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 4.15-rc1 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
lspci -vv
4.17.0-rc5-next-20180517 dmesg pcie_pme_remove removed, crash lspci -t debug_patch.patch debug patch output shutdown log |
Description
Ryan Finnie
2018-05-21 07:03:07 UTC
Can you test this patch? https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/pci/hotplug?id=d22b362184553899f7d6b6760899a77d3b2d7c1b There is a known Intel errata that we missed. can you also share your dmesg? Created attachment 276103 [details]
4.17.0-rc5-next-20180517 dmesg
Thanks, but same problem with that patch against 4.15. Even tried next-20180517 to be sure, no luck. dmesg against next-20180517 has been attached. Cool, I had my suspicions. That's why, I asked for dmesg. Your system doesn't seem to have hotplug driver loaded. The bugfix above is valid only if you have hotplug driver enabled. Something else must be happening. it looks like PME is the only PCIe port service driver loaded. Can you empty out this line to see if it makes any difference? Then, we can start going deeper based on your test result. https://elixir.bootlin.com/linux/latest/ident/pcie_pme_remove Created attachment 276111 [details]
pcie_pme_remove removed, crash
- .remove = pcie_pme_remove, With that removed, the crash becomes: [ 115.008578] kernel BUG at drivers/pci/msi.c:352! [ 115.069730] invalid opcode: 0000 [#1] SMP PTI [ 115.127399] CPU: 15 PID: 1 Comm: systemd-shutdow Not tainted 4.17.0-rc5-next-20180517-custom #1 [ 115.242735] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 01/22/2018 [ 115.351050] RIP: 0010:free_msi_irqs+0x17b/0x1b0 [ 115.410250] Code: 84 e1 fe ff ff 45 31 f6 eb 11 41 83 c6 01 44 39 73 14 0f 86 ce fe ff ff 8b 7b 10 44 01 f7 e8 7c f4 bb ff 48 83 78 70 00 74 e0 <0f> 0b 49 8d b5 a0 00 00 00 e8 b7 a0 bc ff e9 cf fe ff ff 48 8b 78 [...] Full output attached. Oops. can you comment out this line only? https://elixir.bootlin.com/linux/latest/source/drivers/pci/pcie/pme.c#L431 We have to call free_irq(). I went too aggressive at the problem. Commented out "pcie_pme_suspend(srv);", back to original Hardware Error crash. Weird. I'll come up with a debug patch. Can you collect some more data as to what other systems see this issue in the meantime? Since you are the first one to report the problem, there must be something unique about your setup. Also, please attach sudo lspci -t output too. Sure. I'm seeing this on a set of 4 DL360 Gen9s, I believe they were all purchased at the same time around 2016. I'll look around for further machines I can test on, looking for: 1) DL360 Gen9s but not from the same batch as these 2) Previous gens (not sure we have any older ones) 3) DL380 Gen9 Attaching lspci -t. Created attachment 276113 [details]
lspci -t
Created attachment 276115 [details]
debug_patch.patch
I was able to test on another DL360 Gen9 received about a year after the ones I discovered on, same problem. And a DL380 Gen9 with similar specs, also crashes. I was able to test on a DL380 Gen10, which did *not* crash. In summary: Bad: DL360 Gen9 - BIOS P89 v2.56 (01/22/2018) - P440ar V6.30 (originals) Bad: DL360 Gen9 - BIOS P89 v2.52 (10/25/2017) - P440ar V6.06 (newer) Bad: DL380 Gen9 - BIOS P89 v2.52 (10/25/2017) - P440ar V6.06 (newer) Good: DL380 Gen10 - U30 v1.32 (02/01/2018) - P408i-a 1.04-0 (even newer) Attached is the output from your debug patch on the original test system. Created attachment 276121 [details]
debug patch output
Many thanks, let's try these tests. Debug prints are not giving me any clues. The error seems to be asynchronous to the code execution. We'll have to find out by trial and error which one is confusing the HW. My bet is on the first one followed by the third. 1. comment out this line only. https://elixir.bootlin.com/linux/v4.17-rc6/source/drivers/pci/pcie/portdrv_core.c#L412 2. Comment out this line only. https://elixir.bootlin.com/linux/v4.17-rc6/source/drivers/pci/pcie/portdrv_pci.c#L148 3. Comment out the if block only. https://elixir.bootlin.com/linux/v4.17-rc6/source/drivers/pci/pcie/portdrv_pci.c#L142 Progress! #1 reboots correctly. A) I had reverted out the debug print patch, want me to add it back? Does it give you any extra insight? B) Should I move on to #2 and #3? No, this is enough. We now understand that disabling the bus master bit in the command-control register of the root port is causing a crash on your system. I suspect that the firmware is talking to the PCIe bus in parallel and by disabling the bus master bit, we are breaking the FW. Can you also attach the messages you are seeing during shutdown/reboot? The driver clean up order could be important too. Created attachment 276123 [details]
shutdown log
[`dmesg -n debug` added, otherwise normal systemd-obfuscated user messages]
This particular test machine is a MAAS server, 4 interfaces, 2 bonds, 2 bridges. It normally runs a KVM instance directly, but I don't have it set up to autoboot to save time while testing.
Functionally, the other machines tested don't have a common operational trait: OpenStack "smoosh" (nova-compute + n-c-c + neutron + swift + ceph etc in LXDs), a straight Apache archive server, a standby firewall. Actually, they all appear to be at least partially utilizing 10gige interfaces (hopefully that's not a consideration since I'm not sure if I can pull a straight gigabit machine out of active use to test on short notice).
can you apply debug_patch.patch + 1. comment out this line only. https://elixir.bootlin.com/linux/v4.17-rc6/source/drivers/pci/pcie/portdrv_core.c#L412 and collect shutdown log one more time. I see quite a bit of driver shutdown activity from your network adapters. I want to see them in reference to the port service driver shutdown to see which one is happening first and last. I am not yet convinced that it is necessary for pcie_port_device_remove() to call pci_disable_device() on PCIe Root Ports and Switch Ports during a reboot. A similar question came during discussion of pciehp timeouts during shutdown [1]. Eric Biederman had a good response [2] that I haven't had time to assimilate yet. [1] https://lkml.kernel.org/r/8770820b-85a0-172b-7230-3a44524e6c9f@molgen.mpg.de [2] https://lkml.kernel.org/r/87tvrrypkv.fsf@xmission.com I think the motivation is for rogue transactions from the devices not to hit the system memory while a new kernel is booted via kexec. It is not an issue when IOMMU is not present since the second kernel that is booting doesn't share the same address space. However; when IOMMU is present, an adapter can corrupt the newly booting kernel. So, you ideally want to have bus master bit cleared for a clean boot. What is interesting is that kexec is already doing this job in pci_device_shutdown(). This extra clear is unnecessary. I'll post a patch to remove it. change merged to the 4.18 kernel: https://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git/commit/?id=0d98ba8d70b0070ac117452ea0b663e26bbf46bf This issue can be closed. Ack, thank you for all your help. |