Bug 73141
Summary: | Can't assign bridge window after resume | ||
---|---|---|---|
Product: | Drivers | Reporter: | Bjorn Helgaas (bjorn) |
Component: | PCI | Assignee: | drivers_pci (drivers_pci) |
Status: | NEW --- | ||
Severity: | normal | CC: | parag.lkml, qq |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
URL: | http://lkml.kernel.org/r/CAOULuOY5eryuH1KgaqiuGXyYiQd2h=TZoPe2M2+juts9XuzW6Q@mail.gmail.com | ||
Kernel Version: | v3.14-rc8 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg (daily Ubuntu build of v3.14-rc)
lspci -vvxxxx acipdump with VTd enabled ACPI namespace disassembly bus check events and backtrace where warning occurs |
Description
Bjorn Helgaas
2014-03-29 16:08:11 UTC
Created attachment 130971 [details]
dmesg (daily Ubuntu build of v3.14-rc)
Created attachment 130981 [details]
lspci -vvxxxx
Created attachment 131011 [details]
acipdump with VTd enabled
My mistake on the VTd making a difference part - VTd does not make a difference. So only one acpidump is attached. Also this message doesn't appear on 3.11 Ubuntu LTS kernel. So this is possibly a regression (unless Ubuntu LTS kernel has some PCI related patches that mainline 3.11 doesn't). Created attachment 131731 [details] ACPI namespace disassembly Disassembly of the acpidump from comment #3. Created attachment 132381 [details]
bus check events and backtrace where warning occurs
Parag collected this backtrace, which also notes the Bus Check notifications received.
The "Bus check in hotplug_event()" message is printed before we call acpiphp_check_bridge(), so it looks like this is related to RP04.
Not sure how or if this is related but recent kernels trigger the below WARN_ON on fresh boot which seems to be related to pcie bridge. > if (pci_pcie_type(pdev) != PCI_EXP_TYPE_PCI_BRIDGE) { // this one triggers [ 0.475907] WARNING: CPU: 2 PID: 1 at drivers/pci/search.c:46 pci_find_upstream_pcie_bridge+0xb6/0xd0() [ 0.475908] Modules linked in: [ 0.475910] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 3.15.0-rc4+ #12 [ 0.475911] Hardware name: Hewlett-Packard HP Z230 Tower Workstation/1905, BIOS L51 v01.18 01/23/2014 [ 0.475912] 0000000000000009 ffff880445989d28 ffffffff816804a8 0000000000000000 [ 0.475914] ffff880445989d60 ffffffff810482cd ffff88044519a000 ffff8804451a0000 [ 0.475915] ffff8804451a0098 0000000000000000 0000000000000000 ffff880445989d70 [ 0.475916] Call Trace: [ 0.475921] [<ffffffff816804a8>] dump_stack+0x45/0x56 [ 0.475924] [<ffffffff810482cd>] warn_slowpath_common+0x7d/0xa0 [ 0.475926] [<ffffffff810483aa>] warn_slowpath_null+0x1a/0x20 [ 0.475927] [<ffffffff8135bd16>] pci_find_upstream_pcie_bridge+0xb6/0xd0 [ 0.475931] [<ffffffff8157128e>] intel_iommu_add_device+0x3e/0x220 [ 0.475933] [<ffffffff81566f00>] ? bus_set_iommu+0x50/0x50 [ 0.475934] [<ffffffff81566f2a>] add_iommu_group+0x2a/0x50 [ 0.475937] [<ffffffff81422c43>] bus_for_each_dev+0x63/0xa0 [ 0.475939] [<ffffffff81566ef8>] bus_set_iommu+0x48/0x50 [ 0.475941] [<ffffffff81d691e3>] intel_iommu_init+0x547/0x590 [ 0.475944] [<ffffffff81d223ee>] ? memblock_find_dma_reserve+0x120/0x120 [ 0.475946] [<ffffffff81d22400>] pci_iommu_init+0x12/0x3c [ 0.475949] [<ffffffff81000332>] do_one_initcall+0xf2/0x1a0 [ 0.475952] [<ffffffff81067d45>] ? parse_args+0x215/0x3e0 [ 0.475954] [<ffffffff81d1c029>] kernel_init_freeable+0x16c/0x1f1 [ 0.475955] [<ffffffff81d1b83d>] ? do_early_param+0x88/0x88 [ 0.475957] [<ffffffff81672470>] ? rest_init+0x80/0x80 [ 0.475958] [<ffffffff8167247e>] kernel_init+0xe/0xf0 [ 0.475960] [<ffffffff8168ffac>] ret_from_fork+0x7c/0xb0 [ 0.475962] [<ffffffff81672470>] ? rest_init+0x80/0x80 [ 0.475965] ---[ end trace 86136cb63c22d87f ]--- I think this WARN_ON is a different problem. Alex Williamson <alex.williamson@redhat.com> is working on a series of patches that should fix it. Here's a pointer (I think he might have updated it a little bit since then): http://lkml.kernel.org/r/20140501160128.17512.23609.stgit@bling.home I'm experiencing random oops-es and userspace crashes after a suspend-resume cycle. I'm commenting on this bug because in dmesg I'm seeing the same BAR-related error messages after each suspend-resume cycle:
[ 36.324051] PM: Finishing wakeup.
[ 36.324104] r8169 0000:02:00.0: no hotplug settings from platform
[ 36.324213] pci_bus 0000:04: Allocating resources
[ 36.324339] pci 0000:03:00.0: bridge window [io 0x1000-0x0fff] to [bus 04] add_size 1000
[ 36.324343] pci 0000:03:00.0: bridge window [mem 0x00100000-0x000fffff 64bit pref] to [bus 04] add_size 200000
[ 36.324344] pci 0000:03:00.0: bridge window [mem 0x00100000-0x000fffff] to [bus 04] add_size 200000
[ 36.324347] pci 0000:03:00.0: res[14]=[mem 0x00100000-0x000fffff] get_res_add_size add_size 200000
[ 36.324349] pci 0000:03:00.0: res[15]=[mem 0x00100000-0x000fffff 64bit pref] get_res_add_size add_size 200000
[ 36.324350] pci 0000:03:00.0: res[13]=[io 0x1000-0x0fff] get_res_add_size add_size 1000
[ 36.324353] pci 0000:03:00.0: BAR 14: can't assign mem (size 0x200000)
[ 36.324825] pci 0000:03:00.0: BAR 15: can't assign mem pref (size 0x200000)
[ 36.324829] pci 0000:03:00.0: BAR 13: can't assign io (size 0x1000)
[ 36.324832] pci 0000:03:00.0: BAR 14: can't assign mem (size 0x200000)
[ 36.324833] pci 0000:03:00.0: BAR 15: can't assign mem pref (size 0x200000)
[ 36.324835] pci 0000:03:00.0: BAR 13: can't assign io (size 0x1000)
I'm seeing the same symptoms on two machines (with basically the same hardware). An example oops is as follows:
BUG: Bad page state in process khugepaged pfn:33e5f
page:ffffea0000cf97c0 count:67108864 mapcount:0 mapping: (null) index:0x0
page flags: 0x1ffff0000000000()
Modules linked in: uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core snd_usb_audio videodev snd_usbmidi_lib usbmon joydev hid_microsoft pl2303 usbserial snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep hid_generic snd_pcm snd_page_alloc x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_seq_midi snd_seq_midi_event kvm bnep crct10dif_pclmul crc32_pclmul rfcomm ghash_clmulni_intel snd_rawmidi usbhid aesni_intel usb_storage hid bluetooth aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd snd_seq serio_raw snd_seq_device lpc_ich snd_timer i915 parport_pc snd ppdev mac_hid video drm_kms_helper mei_me mei lp parport shpchp soundcore drm i2c_algo_bit binfmt_misc nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache nls_iso8859_1 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 psmouse ahci libahci r8169 raid0 mii multipath linear
CPU: 0 PID: 51 Comm: khugepaged Not tainted 3.13.0-61-generic #100-Ubuntu
Hardware name: Gigabyte Technology Co., Ltd. H81M-HD3/H81M-HD3, BIOS FA 12/01/2014
0000000000000fee ffff880405135b68 ffffffff81723700 ffffea0000cf97c0
ffff880405135b80 ffffffff8171df66 0000000000000002 ffff880405135c60
ffffffff81158cfa ffff88041f5f7e08 0000000200000000 0000000100000000
Call Trace:
[<ffffffff81723700>] dump_stack+0x45/0x56
[<ffffffff8171df66>] bad_page.part.61+0xcf/0xe8
[<ffffffff81158cfa>] get_page_from_freelist+0x85a/0x930
[<ffffffff81158f54>] __alloc_pages_nodemask+0x184/0xb80
[<ffffffff8115d010>] ? release_pages+0x80/0x210
[<ffffffff811ab1cf>] khugepaged_scan_mm_slot+0x3bf/0xc90
[<ffffffff811abcef>] khugepaged+0x24f/0x460
[<ffffffff810ab390>] ? prepare_to_wait_event+0x100/0x100
[<ffffffff811abaa0>] ? khugepaged_scan_mm_slot+0xc90/0xc90
[<ffffffff8108b7d2>] kthread+0xd2/0xf0
[<ffffffff8108b700>] ? kthread_create_on_node+0x1c0/0x1c0
[<ffffffff81734228>] ret_from_fork+0x58/0x90
[<ffffffff8108b700>] ? kthread_create_on_node+0x1c0/0x1c0
In one of the machines, everything was working fine, untilthe motherboard was replaced - with the same model, but newer bios:
old: Gigabyte Technology Co., Ltd. H81M-HD3/H81M-HD3, BIOS F3 01/20/2014 (works ok)
new: Gigabyte Technology Co., Ltd. H81M-HD3/H81M-HD3, BIOS FA 12/01/2014 (shows the above errors, and crashes randomly after suspend-resume)
I can see in the logs for the older motherboard that the BAR's were assigned correctly:
Aug 10 19:52:02 andrzej kernel: [32251.585921] pci 0000:03:00.0: res[14]=[mem 0x00100000-0x000fffff] get_res_add_size add_size 200000
Aug 10 19:52:02 andrzej kernel: [32251.585922] pci 0000:03:00.0: res[15]=[mem 0x00100000-0x000fffff 64bit pref] get_res_add_size add_size 200000
Aug 10 19:52:02 andrzej kernel: [32251.585923] pci 0000:03:00.0: res[13]=[io 0x1000-0x0fff] get_res_add_size add_size 1000
Aug 10 19:52:02 andrzej kernel: [32251.585926] pci 0000:03:00.0: BAR 14: assigned [mem 0xdfa00000-0xdfbfffff]
Aug 10 19:52:02 andrzej kernel: [32251.585928] pci 0000:03:00.0: BAR 15: assigned [mem 0xdfc00000-0xdfdfffff 64bit pref]
Aug 10 19:52:02 andrzej kernel: [32251.585931] pci 0000:03:00.0: BAR 13: assigned [io 0x2000-0x2fff]
One more random note - since I don't really understand all this PCI and BAR stuff, I made some semi-random tests.
1. Calling "echo 1 > /sys/bus/pci/rescan" immediately after booting makes some dmesg noise:
[ 23.759389] i915 0000:00:02.0: BAR 6: [??? 0x00000000 flags 0x2] has bogus alignment
[ 23.759394] pci 0000:03:00.0: PCI bridge to [bus 04]
2. Calling the same command after a suspend-resume cycle, results in the following:
[ 43.307132] pci 0000:03:00.0: bridge window [io 0x1000-0x0fff] to [bus 04] add_size 1000
[ 43.307135] pci 0000:03:00.0: bridge window [mem 0x00100000-0x000fffff 64bit pref] to [bus 04] add_size 200000
[ 43.307137] pci 0000:03:00.0: bridge window [mem 0x00100000-0x000fffff] to [bus 04] add_size 200000
[ 43.307142] pci 0000:03:00.0: res[13]=[io 0x1000-0x0fff] get_res_add_size add_size 1000
[ 43.307144] pcieport 0000:00:1c.3: bridge window [io 0x1000-0x0fff] to [bus 03-04] add_size 1000
[ 43.307145] pci 0000:03:00.0: res[15]=[mem 0x00100000-0x000fffff 64bit pref] get_res_add_size add_size 200000
[ 43.307146] pcieport 0000:00:1c.3: bridge window [mem 0x00100000-0x000fffff 64bit pref] to [bus 03-04] add_size 200000
[ 43.307147] pci 0000:03:00.0: res[14]=[mem 0x00100000-0x000fffff] get_res_add_size add_size 200000
[ 43.307148] pcieport 0000:00:1c.3: bridge window [mem 0x00100000-0x000fffff] to [bus 03-04] add_size 200000
[ 43.307155] i915 0000:00:02.0: BAR 6: [??? 0x00000000 flags 0x2] has bogus alignment
[ 43.307157] pcieport 0000:00:1c.3: res[14]=[mem 0x00100000-0x000fffff] get_res_add_size add_size 200000
[ 43.307158] pcieport 0000:00:1c.3: res[15]=[mem 0x00100000-0x000fffff 64bit pref] get_res_add_size add_size 200000
[ 43.307159] pcieport 0000:00:1c.3: res[13]=[io 0x1000-0x0fff] get_res_add_size add_size 1000
[ 43.307162] pcieport 0000:00:1c.3: BAR 14: assigned [mem 0xdfa00000-0xdfbfffff]
[ 43.307164] pcieport 0000:00:1c.3: BAR 15: assigned [mem 0xdfc00000-0xdfdfffff 64bit pref]
[ 43.307167] pcieport 0000:00:1c.3: BAR 13: assigned [io 0x2000-0x2fff]
[ 43.307169] pci 0000:03:00.0: res[14]=[mem 0x00100000-0x000fffff] get_res_add_size add_size 200000
[ 43.307170] pci 0000:03:00.0: res[15]=[mem 0x00100000-0x000fffff 64bit pref] get_res_add_size add_size 200000
[ 43.307171] pci 0000:03:00.0: res[13]=[io 0x1000-0x0fff] get_res_add_size add_size 1000
>>> [ 43.307173] pci 0000:03:00.0: BAR 14: assigned [mem
>>> 0xdfa00000-0xdfbfffff]
>>> [ 43.307174] pci 0000:03:00.0: BAR 15: assigned [mem
>>> 0xdfc00000-0xdfdfffff 64bit pref]
>>> [ 43.307175] pci 0000:03:00.0: BAR 13: assigned [io 0x2000-0x2fff]
[ 43.307176] pci 0000:03:00.0: PCI bridge to [bus 04]
[ 43.307179] pci 0000:03:00.0: bridge window [io 0x2000-0x2fff]
[ 43.307186] pci 0000:03:00.0: bridge window [mem 0xdfa00000-0xdfbfffff]
[ 43.307191] pci 0000:03:00.0: bridge window [mem 0xdfc00000-0xdfdfffff 64bit pref]
3. Subsequent suspend-resume cycles work fine without mentioning BAR 13-15 errors. I don't have any conclusive results regarding system stability.
|