Bug 9961 - Oops with docking and undocking, device after docking not connected
Summary: Oops with docking and undocking, device after docking not connected
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Gary Hade
URL:
Keywords:
: 10047 (view as bug list)
Depends on:
Blocks:
 
Reported: 2008-02-13 13:55 UTC by Pavel Kysilka
Modified: 2008-03-28 16:07 UTC (History)
6 users (show)

See Also:
Kernel Version: 2.6.24, 2.6.25-rc1-00075-g10270d4
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg - oops docking (27.57 KB, application/octet-stream)
2008-02-13 13:57 UTC, Pavel Kysilka
Details
dmesg - oops - undocking (30.24 KB, application/octet-stream)
2008-02-13 13:58 UTC, Pavel Kysilka
Details
Warnings on docking (11.80 KB, text/plain)
2008-02-20 13:28 UTC, Paul Martin
Details
possible fix (917 bytes, patch)
2008-03-20 13:22 UTC, Gary Hade
Details | Diff

Description Pavel Kysilka 2008-02-13 13:55:09 UTC
Latest working kernel version: ~ 2.6.23-rc1 
Earliest failing kernel version: ~ 2.6.24-20937-ga4ffc0a
Distribution: debian testing
Hardware Environment: IBM ThinkPad A21m + ThinkPad Dock II
Software Environment:
Problem Description: After docking and undocking I get oops in the attachment.
After docking devices on dock not displayed in lspci tree.
Steps to reproduce: dock or undock laptop


dmesg - kernel log messages

acpiphp_glue: bus exists... trim
acpiphp_glue: acpi_bus_trim return 0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.DOCK._PRT]
BUG: unable to handle kernel NULL pointer dereference at 00000000
IP: [<c0233eb4>] pdev_sort_resources+0x94/0x140
*pde = 00000000 
Oops: 0000 [#1] 
Modules linked in: thermal fan processor pcmcia firmware_class irtty_sir sir_dev nsc_ircc irda crc_ccitt pcspkr uhci_hcd snd_cs46xx snd_seq_midi snd_rawmidi snd_ac97_codec ac97_bus e100 yenta_socket rsrc_nonstatic pcmcia_core button ac battery acpiphp dock pci_hotplug thinkpad_acpi hwmon rtc

Pid: 61, comm: kacpi_notify Not tainted (2.6.25-rc1-00075-g10270d4 #137)
EIP: 0060:[<c0233eb4>] EFLAGS: 00010246 CPU: 0
EIP is at pdev_sort_resources+0x94/0x140
EAX: 00000000 EBX: c79238c4 ECX: 00000000 EDX: 00000fff
ESI: 00000000 EDI: 00000007 EBP: c78f5e80 ESP: c78f5e20
 DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
Process kacpi_notify (pid: 61, ti=c78f4000 task=c78e9aa0 task.ti=c78f4000)
Stack: c78119e0 00000001 c0246025 01000000 000003e5 c03f9bec c045cd08 00000000 
       c79238e0 c78f5e80 c7923800 c7923800 c78e0814 c78e0800 c78f5e80 c03d235d 
       c0258efb c78cd000 00000000 c78f5e98 c026284d 00000008 c7831734 c7923808 
Call Trace:
 [<c0246025>] acpi_os_signal_semaphore+0x42/0x54
 [<c03d235d>] pci_bus_assign_resources+0x3d/0x3c0
 [<c0258efb>] acpi_get_parent+0x63/0x6c
 [<c026284d>] acpi_bus_scan+0x6c/0x17b
 [<c8876ce3>] acpiphp_enable_slot+0x203/0x490 [acpiphp]
 [<c88770f4>] handle_hotplug_event_func+0x94/0x1a0 [acpiphp]
 [<c0256ead>] acpi_evaluate_object+0x225/0x230
 [<c03d8c0c>] __mutex_lock_slowpath+0xcc/0x180
 [<c8877060>] handle_hotplug_event_func+0x0/0x1a0 [acpiphp]
 [<c883621f>] hotplug_dock_devices+0x35/0xe1 [dock]
 [<c0246076>] acpi_os_execute_deferred+0x0/0x25
 [<c0246076>] acpi_os_execute_deferred+0x0/0x25
 [<c8836531>] dock_notify+0x72/0xb8 [dock]
 [<c024c509>] acpi_ev_notify_dispatch+0x49/0x52
 [<c0246093>] acpi_os_execute_deferred+0x1d/0x25
 [<c012694b>] run_workqueue+0x8b/0x110
 [<c01299a0>] autoremove_wake_function+0x0/0x40
 [<c01270ba>] worker_thread+0x7a/0xd0
 [<c01299a0>] autoremove_wake_function+0x0/0x40
 [<c0127040>] worker_thread+0x0/0xd0
 [<c0129622>] kthread+0x42/0x70
 [<c01295e0>] kthread+0x0/0x70
 [<c010428b>] kernel_thread_helper+0x7/0x1c
 =======================
Code: 00 00 8b 6c 24 24 40 83 ff 06 0f 4e c8 89 6c 24 1c eb 14 8d 74 26 00 8b 42 04 8b 2a 40 29 e8 39 c8 72 35 89 74 24 1c 8b 44 24 1c <8b> 30 31 c0 85 f6 74 ec 8b 56 04 8b 46 08 89 d5 05 74 01 00 00 
EIP: [<c0233eb4>] pdev_sort_resources+0x94/0x140 SS:ESP 0068:c78f5e20
---[ end trace 14b1defe2e8ac1cc ]---
usb 1-1: new full speed USB device using uhci_hcd and address 3
usb 1-1: new full speed USB device using uhci_hcd and address 4
Comment 1 Pavel Kysilka 2008-02-13 13:57:00 UTC
Created attachment 14799 [details]
dmesg - oops docking

boot laptop battery powered. dock.
Comment 2 Pavel Kysilka 2008-02-13 13:58:02 UTC
Created attachment 14800 [details]
dmesg - oops - undocking

boot laptop AC powered and docked. push docking button on dock and undock.
Comment 3 ykzhao 2008-02-19 16:54:26 UTC
Will you please attach the output of acpidump?
Thanks.
Comment 4 ykzhao 2008-02-19 16:56:43 UTC
*** Bug 10047 has been marked as a duplicate of this bug. ***
Comment 5 Paul Martin 2008-02-20 10:50:06 UTC
By using git-bisect, I've found the culprit. Reverting it cures the oops.

commit 8fa5913d54f3b1e09948e6a0db34da887e05ff1f
Author: Gary Hade <garyhade@us.ibm.com>
Date:   Wed Oct 3 15:55:51 2007 -0700

    PCI: remove transparent bridge sizing
    
    Remove transparent bridge sizing.
    
    Due to code in pci_read_bridge_bases() [drivers/pci/probe.c] the child
    bus of a transparent bridge already has access to the parent bus
    resources so transparent bridge sizing appears unnecessary.  The bridge
    sizing includes alignment and granularity adjustments that can cause
    significantly more memory to be reserved from the parant bus than
    required by devices on the child bus and allotted by _CRS.
    
    Signed-off-by: Gary Hade <gary.hade@us.ibm.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 5e5191e..401e03c 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -472,7 +472,12 @@ void pci_bus_size_bridges(struct pci_bus *bus)
                break;
 
        case PCI_CLASS_BRIDGE_PCI:
+               /* don't size subtractive decoding (transparent)
+                * PCI-to-PCI bridges */
+               if (bus->self->transparent)
+                       break;
                pci_bridge_check_ranges(bus);
+               /* fall through */
        default:
                pbus_size_io(bus);
                /* If the bridge supports prefetchable range, size it
Comment 6 Paul Martin 2008-02-20 13:28:45 UTC
Created attachment 14917 [details]
Warnings on docking

The above works for 2.6.24, but 2.6.25-rc2 (git head) gives many warnings about double registering objects. So, it looks like something needs to be prepared, but not registered, whereas before it was being registered without being prepared, hence the null pointer dereference.
Comment 7 Gary Hade 2008-02-20 17:52:26 UTC
Interesting.  Do you think that my transparent bridge sizing
removal change may have exposed a problem that is absent in
2.6.24 and later?
Comment 8 Zdenek Kabelac 2008-02-21 04:58:21 UTC
Maybe something like:

if (!bus || !bus->self || bus->self->transparent)
  break;

or

if (bus && bus->self && bus->self->transparent)
  break;

would at least help a little bit with the NULL oops ?
Thought there will be a bigger problem I guess...
Comment 9 Paul Martin 2008-02-21 05:14:18 UTC
It's not there that it crashes. Because the bits below don't get executed, it crashes elsewhere due to something not being initialised.

Also, if you boot up with the dock connected, DMA (and possibly IRQs) from the PCI card in the dock doesn't seem to work properly. This is a regression from 2.6.23. I'll try to work out roughly where things start to go wrong, but it'll take a while.
Comment 10 Pavel Kysilka 2008-02-21 07:11:22 UTC
   Hi,

there is acpi table of my laptop.

http://acpi.sourceforge.net/dsdt/view.php?id=689
Comment 11 Paul Martin 2008-02-22 03:31:40 UTC
My laptop's acpidump (T30) is already attached to bug #10047.
Comment 12 Gary Hade 2008-02-25 17:25:38 UTC
Comment #6 seems to indicate that my transparent bridge sizing patch
is no longer the prime suspect but if I can reproduce the problem 
with my T41p I will try to figure out the Oops disappears when 
my change is removed.  Unfortunately, I am having trouble locating
a Dock II.
Comment 13 Gary Hade 2008-03-05 11:05:23 UTC
My attempts to find a Dock II have been unsuccessful so far but we are
still searching...
Comment 14 Kristen 2008-03-13 10:39:35 UTC
Gary has found a dock, and has reproduced the problem.  He's looking into it now.
Comment 15 Gary Hade 2008-03-13 12:53:40 UTC
Yes, it was a bit of a challange but someone finally located a
Dock II for me to use.  I received it yesterday.

After reproducing the problem on my T41p with 2.6.25-rc2, 2.6.24,
and 2.6.24.3 I tried 2.6.24-rc1 where the above mentioned "remove
transparent bridge sizing" change had entered mainline.  The problem
reproduced again with 2.6.24-rc1 but disappeared after reverting only
the "remove transparent bridge sizing" change.

I may have misunderstood Paul's Comment #6.  When he said "The
above works for 2.6.24" I thought he meant that 2.6.24 worked 
without reverting the "remove transparent bridge sizing" change.
He must have meant that 2.6.24 worked after reverting the change.
Paul, is this what you intended or is there a chance that
2.6.24 is behaving differently on your T30 than it is on my T41p?

So, it appears to me that the problem was introduced or exposed
by my change. <sigh>  I am working to find a solution ASAP.
Comment 16 Paul Martin 2008-03-13 14:29:09 UTC
2.6.23 works. 2.6.24 works after reverting the change. Isn't English such an ambiguous language? <grin>

2.6.25-rc3 and later do not see the dock at all as a dock. You can use it if you boot up whilst the laptop is docked, but then you cannot undock other than by rebooting. Also, in 2.6.25-rc2 and later some PCI bus interrupts and possibly DMA are not being correctly handled, resulting in the V4L2 stack consuming almost all the CPU.
Comment 17 Gary Hade 2008-03-13 15:45:01 UTC
(In reply to comment #16)
> 2.6.23 works. 2.6.24 works after reverting the change. Isn't English such an
> ambiguous language? <grin>

Especially when idiots like me are trying to understand it. :)
Thanks, I think we're on same page now.
Comment 18 Paul Martin 2008-03-18 14:35:14 UTC
Vanilla 2.6.25-rc6, with PCI debug switched on:

ACPI: \_SB_.PCI0.PCI1.DOCK - docking
acpiphp_glue: handle_hotplug_event_func: Bus check notify on \_SB_.PCI0.PCI1.DOCK
PCI: Found 0000:02:03.0 [104c/ac22] 000604 01
PCI: Scanning behind PCI bridge 0000:02:03.0, config 000000, pass 0
PCI: Scanning behind PCI bridge 0000:02:03.0, config 000000, pass 1
PCI: Scanning bus 0000:08
PCI: Found 0000:08:01.0 [1095/0648] 000101 00
PCI: Found 0000:08:02.0 [104c/ac51] 000607 02
PCI: Found 0000:08:02.1 [104c/ac51] 000607 02
PCI: Fixups for bus 0000:08
PCI: Transparent bridge - 0000:02:03.0
PCI: Scanning behind PCI bridge 0000:08:02.0, config 000000, pass 0
PCI: Scanning behind PCI bridge 0000:08:02.1, config 000000, pass 0
PCI: Scanning behind PCI bridge 0000:08:02.0, config 000000, pass 1
PCI: Bus #09 (-#0c) is partially hidden behind transparent bridge #02 (-#08)
PCI: Scanning behind PCI bridge 0000:08:02.1, config 000000, pass 1
PCI: Bus #0d (-#10) is partially hidden behind transparent bridge #02 (-#08)
PCI: Bus scan for 0000:08 returning with max=10
PCI: Bus #08 (-#10) is partially hidden behind transparent bridge #02 (-#08)
acpiphp_glue: bus exists... trim
acpiphp_glue: acpi_bus_trim return 0
ath0: no IPv6 routers present
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCI1.DOCK._PRT]
BUG: unable to handle kernel NULL pointer dereference at 00000000
IP: [<c01dd7fe>] pdev_sort_resources+0x8a/0x116
*pde = 00000000 
Oops: 0000 [#1] PREEMPT SMP 
Modules linked in: tun aes_i586 aes_generic arc4 ecb ath5k mac80211 cfg80211 radeon drm rfcomm l2cap bluetooth ppdev lp ipv6 microcode ext3 jbd mbcache fuse dm_crypt crypto_blkcipher dm_mod cpufreq_stats speedstep_ich speedstep_lib thinkpad_acpi nvram acpiphp bay dock pcmcia firmware_class joydev battery yenta_socket rsrc_nonstatic pcmcia_core snd_intel8x0 ac snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss irtty_sir sir_dev snd_pcm snd_timer nsc_ircc button psmouse irda i2c_i801 snd intel_agp agpgart crc_ccitt serio_raw evdev parport_pc parport shpchp pci_hotplug iTCO_wdt soundcore snd_page_alloc rtc pcspkr xfs floppy e100 mii sg uhci_hcd usbcore sr_mod cdrom sd_mod thermal processor fan ata_piix libata scsi_mod radeonfb fb_ddc i2c_algo_bit i2c_core

Pid: 39, comm: kacpi_notify Not tainted (2.6.25-rc6 #14)
EIP: 0060:[<c01dd7fe>] EFLAGS: 00010246 CPU: 0
EIP is at pdev_sort_resources+0x8a/0x116
EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000fff
ESI: f529bcc4 EDI: f74d1e70 EBP: f74d1e58 ESP: f74d1e38
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process kacpi_notify (pid: 39, ti=f74d0000 task=f74a6120 task.ti=f74d0000)
Stack: 00000000 f529bcf0 f74d1e70 f529bc00 00000007 f529bc00 f74a3414 f74a3400 
       f74d1e8c c02a9d9b f529bc08 f7422f90 f7422978 dd3bb400 00000000 f7d28488 
       f74a3400 f74d1e8c f7d28480 f7d4a348 f74a3400 f74d1ec4 f9b161ed f74a340c 
Call Trace:
 [<c02a9d9b>] ? pci_bus_assign_resources+0x59/0x341
 [<f9b161ed>] ? enable_device+0x20d/0x2d7 [acpiphp]
 [<f9b15707>] ? acpiphp_enable_slot+0x9e/0xe7 [acpiphp]
 [<f9b1585d>] ? handle_hotplug_event_func+0x63/0x101 [acpiphp]
 [<f9b15167>] ? post_dock_fixups+0x6c/0x79 [acpiphp]
 [<f9b157fa>] ? handle_hotplug_event_func+0x0/0x101 [acpiphp]
 [<f9b10168>] ? hotplug_dock_devices+0x39/0xe1 [dock]
 [<f9b10441>] ? dock_notify+0x75/0xc0 [dock]
 [<c01f8529>] ? acpi_ev_notify_dispatch+0x4f/0x5a
 [<c01f3400>] ? acpi_os_execute_deferred+0x20/0x2c
 [<c012eedd>] ? run_workqueue+0x78/0xfb
 [<c01f33e0>] ? acpi_os_execute_deferred+0x0/0x2c
 [<c012f739>] ? worker_thread+0xb6/0xc2
 [<c0131ac1>] ? autoremove_wake_function+0x0/0x30
 [<c012f683>] ? worker_thread+0x0/0xc2
 [<c01319ee>] ? kthread+0x3b/0x61
 [<c01319b3>] ? kthread+0x0/0x61
 [<c0105683>] ? kernel_thread_helper+0x7/0x10
 =======================
Code: 50 52 51 ff 75 f0 68 1e c5 34 c0 e8 e8 4c f4 ff 83 c4 1c e9 87 00 00 00 8b 7d e8 8d 42 01 83 7d f0 06 89 7d e0 0f 4e c8 8b 45 e0 <8b> 18 31 c0 85 db 74 29 8b 53 04 8b 43 08 89 d7 05 8c 01 00 00 
EIP: [<c01dd7fe>] pdev_sort_resources+0x8a/0x116 SS:ESP 0068:f74d1e38
---[ end trace b8b5e3b5c2b6062d ]---
Comment 19 Gary Hade 2008-03-19 14:05:02 UTC
I am still struggling with this.  I do not fully understand root cause
yet but I'm betting that the hot-add of a "transparent" p2p bridge
(on Dock II) immediately below yet another "transparent" p2p bridge
(on ThinkPad) which creates a pretty complex cobweb of resource references has something to do with it.

    bus 00 --------------
                 |
              00:1e.0 - ThinkPad resident Intel 82801 Mobile PCI Bridge
                 |        "transparent" via pci_fixup_transparent_bridge()
                 |
    bus 02 --------------
                 |
              02:03.0 - Dock II resident Texas Instruments PCI2032 PCI
                 |      Docking Bridge
                 |         "transparent" due to programming interface == 0x01
                 |
    bus 09 --------------

I have determined why we are not seeing the Oops during boot when
the Dock II already attached to the laptop.  It is due to
pcibios_allocate_bus_resources() only being visited during boot.
The boot-time visit to
                  ...
   if (!r->start || !pr ||
       request_resource(pr, r) < 0) {
           printk(KERN_ERR "PCI: Cannot allocate "
                       "resource region %d "
                       "of bridge %s\n",
                       idx, pci_name(dev));
           /*
            * Something is wrong with the region.
            * Invalidate the resource to prevent
            * child resource allocations in this
            * range.
            */
            r->flags = 0;
   }
             ...
in pcibios_allocate_bus_resources() clears the Dock II resource flags
for regions 7, 8, and 9 as evidenced by the following messages which
appear both with and without transparent bridge sizing removed.
  PCI: Cannot allocate resource region 7 of bridge 0000:02:03.0
  PCI: Cannot allocate resource region 8 of bridge 0000:02:03.0
  PCI: Cannot allocate resource region 9 of bridge 0000:02:03.0

With the Dock II bridge resource flags for these regions cleared
the check:
    if (!(r->flags) || r->parent)
        continue;
in pdev_sort_resources() prevents execution from reaching
the list->next in the later:
   struct resource_list *ln = list->next;
where the NULL pointer dereference occurs in the hot-add case
after one trip through the body of the enclosing for loop.  In
the hot-add case, pcibios_allocate_bus_resources() is not visited
so the Dock II bridge resource flags are not cleared prior to
the call to pcibios_allocate_bus_resources().

If I am unable to come up with a solution for this that retains
the transparent bridge non-sizing in the next couple of days it
is likely that I will provide a patch that simply restores the
transparent bridge sizing.  I believe the resource shortage on
some of our systems that motiviated the transparent bridge
non-sizing change no longer shows up when space is not allocated
by default for expansion ROMs.  A later change that removes
default allocation for expansion ROMs is already in mainline.
Comment 20 Gary Hade 2008-03-20 13:22:04 UTC
Created attachment 15357 [details]
possible fix

Pavel/Paul,
Please give this a try and let me know what you think.
Comment 21 Gary Hade 2008-03-20 13:37:57 UTC
Note that the proposed fix has the side effect of eliminating the
possibly confusing "PCI: Cannot allocate resource region ..." messages
that show up during boot when the ThinkPad is jointed to the Dock II.
These messages were seen both with and without the transparent bridge
sizing removal change.
Comment 22 Gary Hade 2008-03-20 13:41:00 UTC
(In reply to comment #21)
> Note that the proposed fix has the side effect of eliminating the
> possibly confusing "PCI: Cannot allocate resource region ..." messages
> that show up during boot when the ThinkPad is jointed to the Dock II.
                                                ^^^^^^^
                                                joined
Comment 23 Len Brown 2008-03-25 20:44:14 UTC
Gary,
Please e-mail the patch in comment #20 to greg-kh
if you have not already, as i'm not sure he scans bugzilla regularly.
On the assumption that your patch is in the right spot,
I'm moving this sighting to the drivers/pci category
while marking it RESOLVED to indicate a patch is available to test.
Comment 24 Gary Hade 2008-03-26 09:36:32 UTC
(In reply to comment #23)
> Gary,
> Please e-mail the patch in comment #20 to greg-kh
> if you have not already, as i'm not sure he scans bugzilla regularly.
> On the assumption that your patch is in the right spot,

Len, I have not sent the patch to Greg yet but had planned to
do so today.  However, I just noticed a discussion concerning
another more serious problem which is being blamed on my transparent
bridge sizing removal change:
  http://lkml.org/lkml/2008/3/26/94

> I'm moving this sighting to the drivers/pci category

Correct.

> while marking it RESOLVED to indicate a patch is available to test.

Also correct but the fix is now the patch that reverts the 
transparent bridge sizing removal change:
  http://lkml.org/lkml/2008/3/26/176

Note You need to log in before you can comment on or make changes to this bug.