Bug 59501
Description
Alexander E. Patrakov
2013-06-10 06:30:37 UTC
Oops - please ignore the "only the CD-ROM connected to the Marvell pata controller actually worked" phrase. The laptop only listed the PCI devices in lspci. Created attachment 104111 [details]
Dmesg from 3.10-rc4
Created attachment 104131 [details]
Dmesg from 3.9.4
Created attachment 104141 [details]
Config from 3.10-rc4
Attaching just in case if there is an obvious-for-you error
Thanks for bisecting and providing details information. The root cause should be that the relative initialization order has been changed unnoticed. I have sent you a draft patch to confirm the root cause. Thanks again, Alexander! Created attachment 104311 [details]
Dmesg after the proposed patch
I have applied the proposed patch on top of 3.10-rc4. However, it does not solve the whole docking issue.
First docking now works. First undocking works, too. However, the second docking attempt results in not rescanning the bus, not reacting to the "undock" button, and the subsequent manual rescanning attempt resulted in a BUG (see the attachment).
Created attachment 104321 [details]
Dmesg after the proposed patch
Reattaching dmesg from the same kernel, but with acpiphp.debug=1
Unrelated note: the fix also broke pata_marvell again. Before the patch, after manual rescan, it found the CD-ROM, but after the patch (and in 3.9, too) it just says:
[ 56.793335] pata_marvell 0000:1a:00.0: no available native port
[ 56.793463] pata_acpi 0000:1a:00.0: no available native port
I suspect a timing issue here and will track it in a separate bug.
Created attachment 104331 [details]
Dmesg illustrating incomplete undocking
Testcase: boot docked, undock, observe incomplete undocking (not all PCI devices disapperared), rescan bus manually, get a BUG.
Created attachment 104341 [details]
The patch we are talking about
As I am going to cross-reference this patch from another bug, I have to attach it.
Created attachment 104401 [details]
patch to fix device destroying order issue
Hi Alexander, Seems there are multiple issues behind this bug. I have identified another issue. Could you please help to test the attached patch and send out new log messages for further investigation? Regards! Gerry Created attachment 104411 [details]
Dmesg after two patches applied
Unfortunately, second the patch leads to non-rescanning of the PCI bus, and to some lockdep spew :(
There's another resource allocation related issue with this bug. Bridge 0000:08:00.0 only has 1M pref MMIO, [ 56.752280] pci 0000:08:00.0: PCI bridge to [bus 0a-1d] [ 56.752294] pci 0000:08:00.0: bridge window [io 0x3000-0x5fff] [ 56.752300] pci 0000:08:00.0: bridge window [mem 0xb0000000-0xc03fffff] [ 56.752310] pci 0000:08:00.0: bridge window [mem 0xd4400000-0xd44fffff 64bit pref] But devices under this bridge requires 640M pref MMIO, so fails to allocate so much MMIO resources. [ 56.779697] pci 0000:08:00.0: BAR 13: can't assign io (size 0xb000) [ 56.779694] pci 0000:08:00.0: BAR 14: assigned [mem 0xb0000000-0xc06fffff] [ 56.779692] pci 0000:08:00.0: BAR 15: can't assign mem pref (size 0x28000000) It has the same issue for IO port resource allocation too. And this bug is a consequence of resource allocation failure. [ 56.787966] ------------[ cut here ]------------ [ 56.787971] WARNING: at drivers/pci/pci.c:130 pci_ioremap_bar+0x69/0x70() [ 56.787972] Modules linked in: pata_marvell hid_elecom hidp fuse zram(C) rfcomm bnep rtsx_pci_sdmmc iTCO_wdt mmc_core iTCO_vendor_support rtsx_pci_ms memstick coretemp kvm_intel kvm joydev pcspkr uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core qcserial usb_wwan arc4 videodev usbserial media btusb snd_hda_codec_hdmi bluetooth snd_hda_codec_realtek iwldvm mac80211 i915 snd_hda_intel sony_laptop intel_agp iwlwifi snd_hda_codec intel_gtt i2c_algo_bit drm_kms_helper snd_hwdep cfg80211 r8169 snd_pcm drm snd_page_alloc lpc_ich mii snd_timer rfkill agpgart rtsx_pci snd mfd_core i2c_i801 sha256_ssse3 sha256_generic dm_crypt raid0 md_mod crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd xhci_hcd ehci_pci ehci_hcd dm_mirror dm_region_hash [ 56.788017] dm_log dm_mod [last unloaded: microcode] [ 56.788022] CPU: 0 PID: 4 Comm: kworker/0:0 Tainted: G C 3.10.0-rc4 #2 [ 56.788023] Hardware name: Sony Corporation VPCZ23A4R/VAIO, BIOS R1013H5 05/21/2012 [ 56.788026] Workqueue: kacpi_hotplug _handle_hotplug_event_func [ 56.788028] ffffffff81a2a0fc ffff8802540d3858 ffffffff8165aa68 ffff8802540d3898 [ 56.788031] ffffffff8103c8cb ffff8802540d3878 ffff88025378d800 ffff8802268a6000 [ 56.788034] 0000000000000000 ffff88025378b000 0000000000000001 ffff8802540d38a8 [ 56.788037] Call Trace: [ 56.788042] [<ffffffff8165aa68>] dump_stack+0x19/0x1b [ 56.788046] [<ffffffff8103c8cb>] warn_slowpath_common+0x6b/0xa0 [ 56.788050] [<ffffffff8103c915>] warn_slowpath_null+0x15/0x20 [ 56.788053] [<ffffffff813831e9>] pci_ioremap_bar+0x69/0x70 [ 56.788060] [<ffffffffa0416bd6>] azx_first_init+0x56/0x600 [snd_hda_intel] [ 56.788064] [<ffffffff813868ef>] ? pci_get_domain_bus_and_slot+0x2f/0x70 It would be better to open another bug to solve this resource allocation issue once we have solve other bugs in dock driver. What's your thoughts? That resource allocation bug already exists in this bugzilla, it's bug 56531. Feel free to add comments to it. Jiang, Tejun Heo said in bug 59581 (that also applies to my laptop) that it is also PCI-related in its hotplugging part, could you please have a look at his comment? The ACPI BIOS is not in the optimal form, it provides 6 _EJD instances for devices under the same PCI hierarchy. I think just one _EJD for the top most P2P bridge should work, that does simplify the ACPI dock driver and acpiphp driver. But we have no chance to change the firmware, right? As I don't work for Sony, I can't change the firmware. But an ACPI table override is acceptable for me as a temporary solution. If this laptop needs a quirk - so be it. A patch to test is available at: http://patchwork.kernel.org/patch/2757731/ along with (on top of) http://patchwork.kernel.org/patch/2723671/ http://patchwork.kernel.org/patch/2723721/ Can you please try it? I have already tried it, see https://lkml.org/lkml/2013/6/21/13 Oops. here is what happened: In your mail, you advised me to apply your patch on top of "[1/4] and [4/4]", which I did, and the result is https://lkml.org/lkml/2013/6/21/13 In comment #18, you advised me to apply your patch on top of [1/4] and [2/4], which I will test soon. Sorry for the confusion. (In reply to comment #16) > The ACPI BIOS is not in the optimal form, it provides 6 _EJD instances for > devices under the same PCI hierarchy. I think just one _EJD for the top most > P2P bridge should work, that does simplify the ACPI dock driver and acpiphp > driver. But we have no chance to change the firmware, right? Well, I'm not sure if the ACPI tables are the problem here. It looks like find_dock_devices() should return AE_CTRL_DEPTH when it finds an object with _EJD pointing to the docking station being added to avoid adding the children of that object to the list too. Anyway, placing _EJD like this is not against the ACPI spec as far as I can say, so it's not a convincing argument for a vendor to update the BIOS. (In reply to comment #20) > Oops. here is what happened: > > In your mail, you advised me to apply your patch on top of "[1/4] and [4/4]", > which I did, and the result is https://lkml.org/lkml/2013/6/21/13 > > In comment #18, you advised me to apply your patch on top of [1/4] and [2/4], > which I will test soon. That was my mistake. I meant [1/4] and [4/4], sorry about that. :-( (In reply to comment #20) > Oops. here is what happened: > > In your mail, you advised me to apply your patch on top of "[1/4] and [4/4]", > which I did, and the result is https://lkml.org/lkml/2013/6/21/13 OK I'll reply to this message on the list, but let's use this bug entry for further debugging. Created attachment 105671 [details] ACPI / dock / PCI: Synchronous handling of dock events for PCI devices Well, I'd promised a different patch, but I changed my mind in the meantime. This one should be more-or-less functionally equivalent to the combination of Gerry's patches at http://patchwork.kernel.org/patch/2723721/ and http://patchwork.kernel.org/patch/2723751/ , if I haven't made any obvious mistakes. So, please apply on top of http://patchwork.kernel.org/patch/2723671/ http://patchwork.kernel.org/patch/2723791/ There are a few differences that shouldn't affect the system behavior, so please try this one and let me know if it works for you. Created attachment 105681 [details]
ACPI / dock / PCI: Synchronous handling of dock events for PCI devices
The previous version was somewhat racy, sorry about that.
Please try this one instead.
The patch results in a complete lockup in the following situation: boot undocked, dock, undock. I will capture the dmesg over netconsole later today and attach here. I'm on vocation from 06/22-06/25, probably no internet access so expect no response from me, sorry for the inconvenience. (In reply to comment #26) > The patch results in a complete lockup in the following situation: boot > undocked, dock, undock. I will capture the dmesg over netconsole later today > and attach here. Thanks! What happens if you boot docked and then undock? Does this work? If so, does a subsequent re-dock work? Created attachment 105721 [details]
ACPI / dock / PCI: Synchronous handling of dock events for PCI devices
Well, I did an obvious mistake, sorry.
With the previous version applied register_hotplug_dock_device() always returns an error code, but it must return 0 if successful.
I don't think that that's responsible for your lockup, however ... Created attachment 105731 [details]
Initially-docked dmesg
With linux-3.10-rc5 and the three patches you recommended, and with radeon blacklisted, here is the dmesg in the initially-docked case. The only problem is that the blue LED on the dock cable does not turn off until I kill all processes that use the radeon audio device in the dock station.
Will reboot now and test the initially-undocked case.
Created attachment 105741 [details]
Initially-undocked dmesg
In the initially-undocked case, with the new version of your patch and with radeon blacklisted, I could not reproduce the lockup. So, if everyone agrees with the three patches, let's consider the bug fixed.
Great! I'll make a series of the three patches and send it to the list, then. Can you please test one more thing for me in the meantime? Sure, please tell me what I need to test. Also feel free to add my Jabber account (same as email) to your contact list. Oops, given that Google has a strange position on XMPP federation, here is my other XMPP account: aep1980@talkr.im Created attachment 105751 [details]
ACPI / dock: Different list walk ordering in dock and undock cases
Thanks! I can use g+ too I suppose. :-)
Please apply this patch on top of the 3 patches you've just tested and check if it makes any difference on your system, for better or for worse.
It actually does three different things and should be split into three separate patches, but I think all of these changes make sense and I'd like to make them in 3.11 if they don't break things for you.
Created attachment 105761 [details]
Dmesg from initially-docked case
In the initially-docked case, the extra patch doesn't seem to break anything
Created attachment 105771 [details]
Dmesg from initially-undocked case
No problems in the initially-undocked case, either. A suspend-resume pair in the dmesg is a workaround for an unrelated i965 bug.
Thanks a lot for testing! Fixed by: 21a3101 ACPI / dock / PCI: Synchronous handling of dock events for PCI devices d66ecb7 PCI / ACPI: Use boot-time resource allocation rules during hotplug 94add0f ACPI / dock: Initialize ACPI dock subsystem upfront |