Bug 42122

Summary: deadlock when using ipw2200
Product: Drivers Reporter: Witold Baryluk (witold.baryluk+kernel)
Component: network-wirelessAssignee: Stanislaw Gruszka (stf_xl)
Status: CLOSED CODE_FIX    
Severity: blocking CC: florian, linville, maciej.rutecki, rjw, stf_xl
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.1.0-rc4+ Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 40982, 7216    

Description Witold Baryluk 2011-08-31 15:41:15 UTC
After lots of hard work, i managed to intercept and write down kernel log messages.


Somewhere after enabling my encrypted LVM, mounting root, something in initrd is loading ipw2200 early in boot process. (ipw200 looks to be initializing in parallel to the radeon drm - however disableing modeseting doesn't change anything).

It looks like this

libipw: 802.11 data/management/control stack, git-xxx
libipw: Copyright (C) 2004-2005 Intel Corporation <jketreno@linux.intel.com>
ipw2200: Intel(R) PRO/Wireless 2200/2915 Network Driver, xyxyx
ipw2200: Copyright(c) 2003-2006 Intel Corporation
ipw2200 0000:0b:02.0: PCI INT A -> GSI 21 (level, low) -> IRQ 21
ipw2200: Detected Intel PRO/Wireless 2915ABG Network Connection
drm: something with radeon/drm/kms probably, irrelevant
==================================================
INFO: possible recursive locking detected
==================================================


modprobe/1095 is trying lock:
  (rtnl_mutex){+.+.+.}, at: [<c146ba7f>] rtnl_lock+0xf/0x20

but task is already holding lock:
  (rtnl_mutex){+.+.+.}, at: [<c146ba7f>] rtnl_lock+0xf/0x20

other info that might help us debug this:

 Possible unsafe locking scenario:

  CPU 0
  -----
  lock(rtnl_mutex);
  lock(rtnl_mutex);

  *** DEADLOCK ***

May be due to missing lock nesting notation

4 locks held by modprobe/1095:

#0: (&__lockdep_no_validate__){......}, at: [<c12fdfb8>] __driver_attach+0x48/0xa0
#1: (&__lockdep_no_validate__){......}, at: [<c12fdfc4>] __driver_attach+0x54/0xa0
#2: (rtnl_mutex){+.+.+.} at [<c146ba7f>] rtnl_lock+0xf/0x20
#2: (&priv->mutex){+.+.+.} at [<f8cc7bd1>] ipw_net_init+0x41/0x2f0

Call trace

  ? printk+0x18
  __lock_acquire+0x126c
  ? _raw_spin_unlock_irqrestore+0x31
  ? trace_hardirq_on_caller+0xec
  lock_acquire+0x78
  ? rtnl_lock+0xf
  mutex_lock_nested+0x44
  ? rtnl_lock+0xf
  ? mutex_unlocked+0x8
  ? rfkill_register+0x1ef
  rtnl_lock+0xf
  wiphy_register+0x469
  ? __kmalloc+0x115
  ? ipw_net_init+0xc0
  ipw_net_init+0x1b9
  register_netdevice+0xdd
  register_netdev+0x1b
  ipw_pci_probe+0xb4c
  pci_device_probe+0x8d
  ? sysfs_create_link+0x12
  driver_probe_device+0x66
  __driver_attach+0x91
  ? driver_probe_device+0x180
  bus_for_each_dev+0x41
  ? pci_device_probe+0x120
  driver_attach+0x1c
  ? driver_probe_device+0x180
  bus_add_driver+0x1a7
  ? pci_device_probe+0x180
  driver_register+0x65
  ? __raw_spin_lock_init+0x31
  __pci_register_driver+0x52
  ipw_init+0x31
  do_one_initcall+0x2f
  ? 0xf8cdefff
  sys_init_module+0xb06



After this, system have some problems with multiple subsystems, including udev timeouts, read-only file systems, etc, etc.


I belive that commit ecb443355 introduced this regression.

commit ecb4433550f0620f3d1471ae7099037ede30a91e
Author: Stanislaw Gruszka <sgruszka@redhat.com>
Date:   Fri Aug 12 14:00:59 2011 +0200

    mac80211: fix suspend/resume races with unregister hw


Thanks.
Comment 1 Witold Baryluk 2011-08-31 15:48:55 UTC
In fact I first performed git bisect to find offending commit first. Call trace proves that indeed it is a problem. there is rtnl_mutex deadlock, and ecb443355 introduced usage of rtnl_lock() in wiphy_register function, without proper nesting notation.

Writing down kernel log was hard, because boot_delay is not active in this stage of booting - it only affects up to the loading of initrd and/or mounting root fs AFAIK. Also just after call trace was displayed radeon changed mode settings, resolution and cleared screen, including Shift-PgDown/Up kernel log buffer). And becasue system have problems with file system, syslog isn't started yet, and boot is never ending, there is no way to dump kernel log.


Thanks.
Comment 2 Florian Mickler 2011-09-05 21:16:43 UTC
in drivers/net/wireless/ipw2x00/ipw2200.c ipw_net_init calls wiphy_register which will try to acquire the rtnl lock. 

Sadly though, ipw_net_init is appointed the .ndo_init function in the net_device_ops structure of the ipw2200 driver and that function is called from 
 register_netdevice in net/core/dev.c which has a big fat ASSERT_RTNL() above that call. 

ASSERT_RTNL() is a macro which dump_stack's if the rtnl_lock is not hold. Under the assumption that Stanislaw's change to mac80211 was tested and agreed upon it seems that ipw2200 is doing something weird here and shouldnt call wiphy_register in its ndo_init callback. 


Adding
Comment 3 Florian Mickler 2011-09-05 21:18:32 UTC
(s/Adding/Flo/)
Comment 4 John W. Linville 2012-01-13 19:04:32 UTC
Does this problem persist with kernel 3.2 or later?
Comment 5 Stanislaw Gruszka 2012-01-15 09:45:09 UTC
Yes it is, we fixed that:

commit 7cabafcea793c003503a118da58da358b0692930
Author: Stanislaw Gruszka <sgruszka@redhat.com>
Date:   Wed Sep 14 16:47:50 2011 +0200

    ipw2x00: fix rtnl mutex deadlock