Bug 59501

Summary: Sony VAIO VPCZ23A4R: PCI bus is not rescanned on docking/undocking (worked in 3.9)
Product: ACPI Reporter: Alexander E. Patrakov (patrakov)
Component: Config-HotplugAssignee: Rafael J. Wysocki (rjw)
Status: CLOSED CODE_FIX    
Severity: normal CC: aaron.lu, bjorn, liuj97, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.10-rc4 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Dmesg from 3.10-rc4
Dmesg from 3.9.4
Config from 3.10-rc4
Dmesg after the proposed patch
Dmesg after the proposed patch
Dmesg illustrating incomplete undocking
The patch we are talking about
patch to fix device destroying order issue
Dmesg after two patches applied
ACPI / dock / PCI: Synchronous handling of dock events for PCI devices
ACPI / dock / PCI: Synchronous handling of dock events for PCI devices
ACPI / dock / PCI: Synchronous handling of dock events for PCI devices
Initially-docked dmesg
Initially-undocked dmesg
ACPI / dock: Different list walk ordering in dock and undock cases
Dmesg from initially-docked case
Dmesg from initially-undocked case

Description Alexander E. Patrakov 2013-06-10 06:30:37 UTC
The following commit (identified by git bisect) introduced a regression on my Sony VAIO VPCZ23A4R laptop:

commit 3b63aaa70e1ccc4b66d60acc78da09700706a703
Author: Jiang Liu <liuj97@gmail.com>
Date:   Fri Apr 12 05:44:26 2013 +0000

    PCI: acpiphp: Do not use ACPI PCI subdriver mechanism


The regression manifests itself as follows: boot the laptop without the dock station, then connect it, and the laptop does not see the PCI devices in the dock station. Before that commit, it did, but only the CD-ROM connected to the Marvell pata controller actually worked (see bug 56531).

A simple echo 1 > /sys/bus/pci/rescan is sufficient for the laptop to see the devices on the bus. Interestingly, after this commit with manual PCI bus rescan the CD-ROM started working (before that pata_marvell would only complain about "no native port available").

I will attach dmesgs both from the working and non-working kernels. Acpidump output is available from bug 55611.
Comment 1 Alexander E. Patrakov 2013-06-10 06:32:35 UTC
Oops - please ignore the "only the CD-ROM connected to the Marvell pata controller actually worked" phrase. The laptop only listed the PCI devices in lspci.
Comment 2 Alexander E. Patrakov 2013-06-10 06:37:55 UTC
Created attachment 104111 [details]
Dmesg from 3.10-rc4
Comment 3 Alexander E. Patrakov 2013-06-10 06:41:49 UTC
Created attachment 104131 [details]
Dmesg from 3.9.4
Comment 4 Alexander E. Patrakov 2013-06-10 06:42:51 UTC
Created attachment 104141 [details]
Config from 3.10-rc4

Attaching just in case if there is an obvious-for-you error
Comment 5 Jiang Liu 2013-06-10 17:50:19 UTC
Thanks for bisecting and providing details information. The root cause should be that the relative initialization order has been changed unnoticed. I have sent you a draft patch to confirm the root cause. Thanks again, Alexander!
Comment 6 Alexander E. Patrakov 2013-06-11 03:59:52 UTC
Created attachment 104311 [details]
Dmesg after the proposed patch

I have applied the proposed patch on top of 3.10-rc4. However, it does not solve the whole docking issue.

First docking now works. First undocking works, too. However, the second docking attempt results in not rescanning the bus, not reacting to the "undock" button, and the subsequent manual rescanning attempt resulted in a BUG (see the attachment).
Comment 7 Alexander E. Patrakov 2013-06-11 04:09:44 UTC
Created attachment 104321 [details]
Dmesg after the proposed patch

Reattaching dmesg from the same kernel, but with acpiphp.debug=1

Unrelated note: the fix also broke pata_marvell again. Before the patch, after manual rescan, it found the CD-ROM, but after the patch (and in 3.9, too) it just says:

[   56.793335] pata_marvell 0000:1a:00.0: no available native port
[   56.793463] pata_acpi 0000:1a:00.0: no available native port

I suspect a timing issue here and will track it in a separate bug.
Comment 8 Alexander E. Patrakov 2013-06-11 04:23:43 UTC
Created attachment 104331 [details]
Dmesg illustrating incomplete undocking

Testcase: boot docked, undock, observe incomplete undocking (not all PCI devices disapperared), rescan bus manually, get a BUG.
Comment 9 Alexander E. Patrakov 2013-06-11 04:37:03 UTC
Created attachment 104341 [details]
The patch we are talking about

As I am going to cross-reference this patch from another bug, I have to attach it.
Comment 10 Jiang Liu 2013-06-11 11:49:39 UTC
Created attachment 104401 [details]
patch to fix device destroying order issue
Comment 11 Jiang Liu 2013-06-11 11:51:24 UTC
Hi Alexander,
    Seems there are multiple issues behind this bug. I have identified another issue. Could you please help to test the attached patch and send out new log messages for further investigation?
Regards!
Gerry
Comment 12 Alexander E. Patrakov 2013-06-11 12:08:34 UTC
Created attachment 104411 [details]
Dmesg after two patches applied

Unfortunately, second the patch leads to non-rescanning of the PCI bus, and to some lockdep spew :(
Comment 13 Jiang Liu 2013-06-11 12:31:43 UTC
There's another resource allocation related issue with this bug.
Bridge 0000:08:00.0 only has 1M pref MMIO,
[   56.752280] pci 0000:08:00.0: PCI bridge to [bus 0a-1d]
[   56.752294] pci 0000:08:00.0:   bridge window [io  0x3000-0x5fff]
[   56.752300] pci 0000:08:00.0:   bridge window [mem 0xb0000000-0xc03fffff]
[   56.752310] pci 0000:08:00.0:   bridge window [mem 0xd4400000-0xd44fffff 64bit pref]

But devices under this bridge requires 640M pref MMIO, so fails to allocate so much MMIO resources.
[   56.779697] pci 0000:08:00.0: BAR 13: can't assign io (size 0xb000)
[   56.779694] pci 0000:08:00.0: BAR 14: assigned [mem 0xb0000000-0xc06fffff]
[   56.779692] pci 0000:08:00.0: BAR 15: can't assign mem pref (size 0x28000000)

It has the same issue for IO port resource allocation too.

And this bug is a consequence of resource allocation failure.
[   56.787966] ------------[ cut here ]------------
[   56.787971] WARNING: at drivers/pci/pci.c:130 pci_ioremap_bar+0x69/0x70()
[   56.787972] Modules linked in: pata_marvell hid_elecom hidp fuse zram(C) rfcomm bnep rtsx_pci_sdmmc iTCO_wdt mmc_core iTCO_vendor_support rtsx_pci_ms memstick coretemp kvm_intel kvm joydev pcspkr uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core qcserial usb_wwan arc4 videodev usbserial media btusb snd_hda_codec_hdmi bluetooth snd_hda_codec_realtek iwldvm mac80211 i915 snd_hda_intel sony_laptop intel_agp iwlwifi snd_hda_codec intel_gtt i2c_algo_bit drm_kms_helper snd_hwdep cfg80211 r8169 snd_pcm drm snd_page_alloc lpc_ich mii snd_timer rfkill agpgart rtsx_pci snd mfd_core i2c_i801 sha256_ssse3 sha256_generic dm_crypt raid0 md_mod crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd xhci_hcd ehci_pci ehci_hcd dm_mirror dm_region_hash
[   56.788017]  dm_log dm_mod [last unloaded: microcode]
[   56.788022] CPU: 0 PID: 4 Comm: kworker/0:0 Tainted: G         C   3.10.0-rc4 #2
[   56.788023] Hardware name: Sony Corporation VPCZ23A4R/VAIO, BIOS R1013H5 05/21/2012
[   56.788026] Workqueue: kacpi_hotplug _handle_hotplug_event_func
[   56.788028]  ffffffff81a2a0fc ffff8802540d3858 ffffffff8165aa68 ffff8802540d3898
[   56.788031]  ffffffff8103c8cb ffff8802540d3878 ffff88025378d800 ffff8802268a6000
[   56.788034]  0000000000000000 ffff88025378b000 0000000000000001 ffff8802540d38a8
[   56.788037] Call Trace:
[   56.788042]  [<ffffffff8165aa68>] dump_stack+0x19/0x1b
[   56.788046]  [<ffffffff8103c8cb>] warn_slowpath_common+0x6b/0xa0
[   56.788050]  [<ffffffff8103c915>] warn_slowpath_null+0x15/0x20
[   56.788053]  [<ffffffff813831e9>] pci_ioremap_bar+0x69/0x70
[   56.788060]  [<ffffffffa0416bd6>] azx_first_init+0x56/0x600 [snd_hda_intel]
[   56.788064]  [<ffffffff813868ef>] ? pci_get_domain_bus_and_slot+0x2f/0x70


It would be better to open another bug to solve this resource allocation issue once we have solve other bugs in dock driver.

What's your thoughts?
Comment 14 Alexander E. Patrakov 2013-06-11 12:44:52 UTC
That resource allocation bug already exists in this bugzilla, it's bug 56531. Feel free to add comments to it.
Comment 15 Alexander E. Patrakov 2013-06-11 13:48:17 UTC
Jiang,

Tejun Heo said in bug 59581 (that also applies to my laptop) that it is also PCI-related in its hotplugging part, could you please have a look at his comment?
Comment 16 Jiang Liu 2013-06-11 16:08:30 UTC
The ACPI BIOS is not in the optimal form, it provides 6 _EJD instances for devices under the same PCI hierarchy. I think just one _EJD for the top most P2P bridge should work, that does simplify the ACPI dock driver and acpiphp driver. But we have no chance to change the firmware, right?
Comment 17 Alexander E. Patrakov 2013-06-11 16:36:12 UTC
As I don't work for Sony, I can't change the firmware. But an ACPI table override is acceptable for me as a temporary solution.

If this laptop needs a quirk - so be it.
Comment 18 Rafael J. Wysocki 2013-06-21 11:45:03 UTC
A patch to test is available at:

http://patchwork.kernel.org/patch/2757731/

along with (on top of)

http://patchwork.kernel.org/patch/2723671/
http://patchwork.kernel.org/patch/2723721/

Can you please try it?
Comment 19 Alexander E. Patrakov 2013-06-21 11:59:52 UTC
I have already tried it, see https://lkml.org/lkml/2013/6/21/13
Comment 20 Alexander E. Patrakov 2013-06-21 12:02:35 UTC
Oops. here is what happened:

In your mail, you advised me to apply your patch on top of "[1/4] and [4/4]", which I did, and the result is https://lkml.org/lkml/2013/6/21/13

In comment #18, you advised me to apply your patch on top of [1/4] and [2/4], which I will test soon.

Sorry for the confusion.
Comment 21 Rafael J. Wysocki 2013-06-21 12:03:22 UTC
(In reply to comment #16)
> The ACPI BIOS is not in the optimal form, it provides 6 _EJD instances for
> devices under the same PCI hierarchy. I think just one _EJD for the top most
> P2P bridge should work, that does simplify the ACPI dock driver and acpiphp
> driver. But we have no chance to change the firmware, right?

Well, I'm not sure if the ACPI tables are the problem here.

It looks like find_dock_devices() should return AE_CTRL_DEPTH when it finds an object with _EJD pointing to the docking station being added to avoid adding the children of that object to the list too.

Anyway, placing _EJD like this is not against the ACPI spec as far as I can say, so it's not a convincing argument for a vendor to update the BIOS.
Comment 22 Rafael J. Wysocki 2013-06-21 12:04:03 UTC
(In reply to comment #20)
> Oops. here is what happened:
> 
> In your mail, you advised me to apply your patch on top of "[1/4] and [4/4]",
> which I did, and the result is https://lkml.org/lkml/2013/6/21/13
> 
> In comment #18, you advised me to apply your patch on top of [1/4] and [2/4],
> which I will test soon.

That was my mistake.  I meant [1/4] and [4/4], sorry about that. :-(
Comment 23 Rafael J. Wysocki 2013-06-21 12:16:19 UTC
(In reply to comment #20)
> Oops. here is what happened:
> 
> In your mail, you advised me to apply your patch on top of "[1/4] and [4/4]",
> which I did, and the result is https://lkml.org/lkml/2013/6/21/13

OK

I'll reply to this message on the list, but let's use this bug entry for further debugging.
Comment 24 Rafael J. Wysocki 2013-06-21 22:09:21 UTC
Created attachment 105671 [details]
ACPI / dock / PCI: Synchronous handling of dock events for PCI devices

Well, I'd promised a different patch, but I changed my mind in the meantime.

This one should be more-or-less functionally equivalent to the combination of Gerry's patches at http://patchwork.kernel.org/patch/2723721/ and http://patchwork.kernel.org/patch/2723751/ , if I haven't made any obvious mistakes.

So, please apply on top of

http://patchwork.kernel.org/patch/2723671/
http://patchwork.kernel.org/patch/2723791/

There are a few differences that shouldn't affect the system behavior, so please try this one and let me know if it works for you.
Comment 25 Rafael J. Wysocki 2013-06-21 22:22:43 UTC
Created attachment 105681 [details]
ACPI / dock / PCI: Synchronous handling of dock events for PCI devices

The previous version was somewhat racy, sorry about that.

Please try this one instead.
Comment 26 Alexander E. Patrakov 2013-06-22 03:34:28 UTC
The patch results in a complete lockup in the following situation: boot undocked, dock, undock. I will capture the dmesg over netconsole later today and attach here.
Comment 27 Aaron Lu 2013-06-22 03:35:17 UTC
I'm on vocation from 06/22-06/25, probably no internet access so expect no response from me, sorry for the inconvenience.
Comment 28 Rafael J. Wysocki 2013-06-22 09:50:59 UTC
(In reply to comment #26)
> The patch results in a complete lockup in the following situation: boot
> undocked, dock, undock. I will capture the dmesg over netconsole later today
> and attach here.

Thanks!

What happens if you boot docked and then undock?  Does this work?  If so, does a subsequent re-dock work?
Comment 29 Rafael J. Wysocki 2013-06-22 10:23:44 UTC
Created attachment 105721 [details]
ACPI / dock / PCI: Synchronous handling of dock events for PCI devices

Well, I did an obvious mistake, sorry.

With the previous version applied register_hotplug_dock_device() always returns an error code, but it must return 0 if successful.
Comment 30 Rafael J. Wysocki 2013-06-22 11:06:57 UTC
I don't think that that's responsible for your lockup, however ...
Comment 31 Alexander E. Patrakov 2013-06-22 12:04:54 UTC
Created attachment 105731 [details]
Initially-docked dmesg

With linux-3.10-rc5 and the three patches you recommended, and with radeon blacklisted, here is the dmesg in the initially-docked case. The only problem is that the blue LED on the dock cable does not turn off until I kill all processes that use the radeon audio device in the dock station.

Will reboot now and test the initially-undocked case.
Comment 32 Alexander E. Patrakov 2013-06-22 12:16:24 UTC
Created attachment 105741 [details]
Initially-undocked dmesg

In the initially-undocked case, with the new version of your patch and with radeon blacklisted, I could not reproduce the lockup. So, if everyone agrees with the three patches, let's consider the bug fixed.
Comment 33 Rafael J. Wysocki 2013-06-22 12:58:36 UTC
Great!

I'll make a series of the three patches and send it to the list, then.

Can you please test one more thing for me in the meantime?
Comment 34 Alexander E. Patrakov 2013-06-22 13:04:54 UTC
Sure, please tell me what I need to test. Also feel free to add my Jabber account (same as email) to your contact list.
Comment 35 Alexander E. Patrakov 2013-06-22 13:07:45 UTC
Oops, given that Google has a strange position on XMPP federation, here is my other XMPP account: aep1980@talkr.im
Comment 36 Rafael J. Wysocki 2013-06-22 13:28:16 UTC
Created attachment 105751 [details]
ACPI / dock: Different list walk ordering in dock and undock cases

Thanks!  I can use g+ too I suppose. :-)

Please apply this patch on top of the 3 patches you've just tested and check if it makes any difference on your system, for better or for worse.

It actually does three different things and should be split into three separate patches, but I think all of these changes make sense and I'd like to make them in 3.11 if they don't break things for you.
Comment 37 Alexander E. Patrakov 2013-06-22 13:56:37 UTC
Created attachment 105761 [details]
Dmesg from initially-docked case

In the initially-docked case, the extra patch doesn't seem to break anything
Comment 38 Alexander E. Patrakov 2013-06-22 14:04:22 UTC
Created attachment 105771 [details]
Dmesg from initially-undocked case

No problems in the initially-undocked case, either. A suspend-resume pair in the dmesg is a workaround for an unrelated i965 bug.
Comment 39 Rafael J. Wysocki 2013-06-22 14:13:22 UTC
Thanks a lot for testing!
Comment 40 Rafael J. Wysocki 2013-06-27 20:21:51 UTC
Fixed by:

21a3101 ACPI / dock / PCI: Synchronous handling of dock events for PCI devices
d66ecb7 PCI / ACPI: Use boot-time resource allocation rules during hotplug
94add0f ACPI / dock: Initialize ACPI dock subsystem upfront