Most recent kernel where this bug did not occur: didn't find any (tried 2.6.16.32, 2.6.18.1 and Debian kernel 2.6.17-2) Distribution: Debian etch userspace Hardware Environment: $ lspci 00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 03) 00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 03) 00:02.0 CardBus bridge: Texas Instruments PCI1450 (rev 03) 00:02.1 CardBus bridge: Texas Instruments PCI1450 (rev 03) 00:03.0 Ethernet controller: Intel Corporation 82557/8/9 [Ethernet Pro 100] (rev 09) 00:03.1 Serial controller: Xircom Mini-PCI V.90 56k Modem 00:05.0 Multimedia audio controller: Cirrus Logic CS 4614/22/24 [CrystalClear SoundFusion Audio Accelerator] (rev 01) 00:07.0 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 02) 00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01) 00:07.2 USB Controller: Intel Corporation 82371AB/EB/MB PIIX4 USB (rev 01) 00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03) 01:00.0 VGA compatible controller: S3 Inc. 86C270-294 Savage/IX-MV (rev 11) $ lsusb Bus 001 Device 002: ID 0a12:0001 Cambridge Silicon Radio, Ltd Bluetooth Dongle (HCI mode) Bus 001 Device 001: ID 0000:0000 Software Environment: Debian etch userspace Problem Description: Kernel crash like this Nov 8 14:06:00 silu kernel: CPU: 0 Nov 8 14:06:00 silu kernel: EIP: 0060:[phys_startup_32+1700618595/-1073741824] Not tainte d VLI Nov 8 14:06:00 silu kernel: EFLAGS: 00010012 (2.6.18.2 #4) Nov 8 14:06:00 silu kernel: EIP is at 0x656d6163 Nov 8 14:06:00 silu kernel: eax: cdc35f28 ebx: cdc35f28 ecx: 00000000 edx: 00000003 Nov 8 14:06:00 silu kernel: esi: 6c2e6172 edi: 00000001 ebp: c3625c54 esp: c3625c34 Nov 8 14:06:00 silu kernel: ds: 007b es: 007b ss: 0068 Nov 8 14:06:00 silu kernel: Process pppd (pid: 3343, ti=c3624000 task=cdc2e570 task.ti=c3624000 ) Nov 8 14:06:00 silu kernel: Stack: c0116219 00000000 00000003 c77ee938 6b206567 c77ee938 000000 00 00000001 Nov 8 14:06:00 silu kernel: c3625c78 c011666a 00000000 00000000 00000003 00000282 c77ee8 00 00000000 Nov 8 14:06:00 silu kernel: c3625cb8 00000000 c01cb14f 00000000 c1250000 cf44ce00 c3625c b8 00000001 Nov 8 14:06:00 silu kernel: Call Trace: Nov 8 14:06:00 silu kernel: [__wake_up_common+47/83] __wake_up_common+0x2f/0x53 Nov 8 14:06:00 silu kernel: [__wake_up+42/61] __wake_up+0x2a/0x3d Nov 8 14:06:00 silu kernel: [release_dev+569/1518] release_dev+0x239/0x5ee Nov 8 14:06:00 silu kernel: [free_pages_bulk+27/491] free_pages_bulk+0x1b/0x1eb Nov 8 14:06:00 silu kernel: [_spin_lock_bh+8/24] _spin_lock_bh+0x8/0x18 Nov 8 14:06:00 silu kernel: [lock_sock+137/145] lock_sock+0x89/0x91 Nov 8 14:06:00 silu kernel: [lock_sock+137/145] lock_sock+0x89/0x91 Nov 8 14:06:00 silu kernel: [pg0+276501738/1070425088] inet6_destroy_sock+0x22/0x3a [ipv6] Nov 8 14:06:00 silu kernel: [tty_release+15/24] tty_release+0xf/0x18 Nov 8 14:06:00 silu kernel: [__fput+57/216] __fput+0x39/0xd8 Nov 8 14:06:00 silu kernel: [filp_close+78/84] filp_close+0x4e/0x54 Nov 8 14:06:00 silu kernel: [put_files_struct+101/167] put_files_struct+0x65/0xa7 Nov 8 14:06:00 silu kernel: [do_exit+458/1805] do_exit+0x1ca/0x70d Nov 8 14:06:00 silu kernel: [die+482/650] die+0x1e2/0x28a Nov 8 14:06:00 silu kernel: [die+613/650] die+0x265/0x28a Nov 8 14:06:00 silu kernel: [do_page_fault+948/1153] do_page_fault+0x3b4/0x481 Nov 8 14:06:00 silu kernel: [do_page_fault+0/1153] do_page_fault+0x0/0x481 Nov 8 14:06:00 silu kernel: [error_code+57/64] error_code+0x39/0x40 Nov 8 14:06:00 silu kernel: [pg0+276348877/1070425088] rfcomm_tty_chars_in_buffer+0x7/0x2f [rf comm] Nov 8 14:06:00 silu kernel: [tty_wait_until_sent+116/185] tty_wait_until_sent+0x74/0xb9 Nov 8 14:06:00 silu kernel: [default_wake_function+0/12] default_wake_function+0x0/0xc Nov 8 14:06:00 silu kernel: [n_tty_ioctl+0/1037] n_tty_ioctl+0x0/0x40d Nov 8 14:06:00 silu kernel: [set_termios+190/745] set_termios+0xbe/0x2e9 Nov 8 14:06:00 silu kernel: [do_path_lookup+522/549] do_path_lookup+0x20a/0x225 Nov 8 14:06:00 silu kernel: [path_lookup+15/17] path_lookup+0xf/0x11 Nov 8 14:06:00 silu kernel: [n_tty_ioctl+0/1037] n_tty_ioctl+0x0/0x40d Nov 8 14:06:00 silu kernel: [tty_ioctl+2893/2988] tty_ioctl+0xb4d/0xbac Nov 8 14:06:00 silu kernel: [unix_dgram_connect+106/287] unix_dgram_connect+0x6a/0x11f Nov 8 14:06:00 silu kernel: [do_wp_page+299/840] do_wp_page+0x12b/0x348 Nov 8 14:06:00 silu kernel: [__handle_mm_fault+1752/1856] __handle_mm_fault+0x6d8/0x740 Nov 8 14:06:00 silu kernel: [do_ioctl+71/93] do_ioctl+0x47/0x5d Nov 8 14:06:00 silu kernel: [vfs_ioctl+586/604] vfs_ioctl+0x24a/0x25c Nov 8 14:06:00 silu kernel: [do_page_fault+394/1153] do_page_fault+0x18a/0x481 Nov 8 14:06:00 silu kernel: [sys_ioctl+44/67] sys_ioctl+0x2c/0x43 Nov 8 14:06:00 silu kernel: [syscall_call+7/11] syscall_call+0x7/0xb Nov 8 14:06:00 silu kernel: Code: Bad EIP value. Nov 8 14:06:00 silu kernel: EIP: [phys_startup_32+1700618595/-1073741824] 0x656d6163 SS:ESP 006 8:c3625c34 Nov 8 14:06:00 silu kernel: <1>Fixing recursive fault but reboot is needed! Nov 8 14:07:54 silu kernel: SysRq : Emergency Sync Nov 8 14:07:54 silu kernel: Emergency Sync complete Steps to reproduce: Boot to run level 2, login as root on virtual terminal, run screen, open GPRS ppp connection with 'pon gprs-bt', open another shell to ping some host, open a third shell to do 'telinit 1'.
Created attachment 9483 [details] kern.log
Congratulations - you got the messiest-looking backtrace I've seen in a year. It _looks_ like it oopsed in the bluetooth->tty management code. But it could be the socket code too. Sigh. I recall we did fix one or two things in there. If you could, it would be good if you could test 2.6.19-rc5 or -rc6 and see if enabling CONFIG_UNWIND_INFO=y CONFIG_STACK_UNWIND=y gives us a more comprehensible backtrace. Thanks.
Created attachment 9492 [details] Oops screen shot with 2.6.19rc5 How on earth did I manage to get this crash to syslog? I must have played with magic sysreq and been very, very lucky...
Created attachment 10192 [details] sane crash log with 2.6.20-rc4 I tried 2.6.8 from Debian sarge and it crashes too. Then tried 2.6.20rc4 and it crashes. Then played kdb and I get to log the first null pointer, yeah! As I don't how to trace it back to C source and my asm sucks I just read lots of code and play with kprint. Making the dmesg buffer large and removing this tty thing helps a lot: diff -ru linux-2.6.20-rc4/drivers/char/tty_io.c linux-2.6.20-rc4-kdb/drivers/char/tty_io.c --- linux-2.6.20-rc4/drivers/char/tty_io.c 2007-01-07 07:45:51.000000000 +0200 +++ linux-2.6.20-rc4-kdb/drivers/char/tty_io.c 2007-01-25 21:23:05.000000000 +0200 @@ -2313,8 +2313,8 @@ if (!do_sleep) break; - printk(KERN_WARNING "release_dev: %s: read/write wait queue " - "active!\n", tty_name(tty, buf)); + /*printk(KERN_WARNING "release_dev: %s: read/write wait queue " + "active!\n", tty_name(tty, buf));*/ mutex_unlock(&tty_mutex); schedule(); } Still being clueless I stumble to http://git2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=b6e557fbf1dbba8cfa667a25503e5dbd0e9330b7 and now there's a dumb cure: --- linux-2.6.20-rc4/net/bluetooth/rfcomm/tty.c 2007-01-07 07:45:51.000000000 +0200 +++ linux-2.6.20-rc4-kdb/net/bluetooth/rfcomm/tty.c 2007-01-25 22:26:44.000000000 +0200 @@ -659,13 +659,23 @@ static int rfcomm_tty_write(struct tty_struct *tty, const unsigned char *buf, int count) { + BT_DBG("tty %p count %d", tty, count); + + if (!tty) return 0; + struct rfcomm_dev *dev = (struct rfcomm_dev *) tty->driver_data; + + BT_DBG("dev %p", dev); + if (!dev) return 0; + struct rfcomm_dlc *dlc = dev->dlc; + + BT_DBG("dlc %p", dlc); + if (!dlc) return 0; + struct sk_buff *skb; int err = 0, sent = 0, size; - BT_DBG("tty %p count %d", tty, count); - To my non-kernel hacker eyes this is a bit dumb, since I'd hope tty, dlc and others would be protected by some lock. I hope someone knows better. -Mikko
Marcel, have you seen this?
Created attachment 10193 [details] rfcomm log Debug prints from rfcomm with the previous patch. I had rfcomm2 open with ppp/ip, l2ping'ed my phone and did 'telinit 1'. Shows (struct rfcomm_dev *) tty->driver_data as 0 in the end. -Mikko
Can you please make sure that 2.6.20-rc5 or later still produces these crashes for you. The rfcomm_tty_chars_in_buffer() routine should return 0 in cases where the the DLC has been detached from the TTY. So there should be no call to rfcomm_tty_write() with an unattached DLC.
Created attachment 10196 [details] crash log with 2.6.20-rc5 This I got with almost plain 2.6.20-rc5. Only these changes included: diff -ru linux-2.6.20-rc5/net/bluetooth/rfcomm/Makefile linux-2.6.20-rc5-mcf/net /bluetooth/rfcomm/Makefile --- linux-2.6.20-rc5/net/bluetooth/rfcomm/Makefile 2007-01-12 20:54:26.0000 00000 +0200 +++ linux-2.6.20-rc5-mcf/net/bluetooth/rfcomm/Makefile 2007-01-26 12:10:27.0000 00000 +0200 @@ -2,6 +2,8 @@ # Makefile for the Linux Bluetooth RFCOMM layer. # +CFLAGS += -DCONFIG_BT_RFCOMM_DEBUG + obj-$(CONFIG_BT_RFCOMM) += rfcomm.o rfcomm-y := core.o sock.o diff -ru linux-2.6.20-rc5/drivers/char/tty_io.c linux-2.6.20-rc5-mcf/drivers/cha r/tty_io.c --- linux-2.6.20-rc5/drivers/char/tty_io.c 2007-01-12 20:54:26.000000000 +0 200 +++ linux-2.6.20-rc5-mcf/drivers/char/tty_io.c 2007-01-26 12:33:39.000000000 +0 200 @@ -2313,8 +2313,8 @@ if (!do_sleep) break; - printk(KERN_WARNING "release_dev: %s: read/write wait queue " - "active!\n", tty_name(tty, buf)); + /*printk(KERN_WARNING "release_dev: %s: read/write wait queue " + "active!\n", tty_name(tty, buf));*/ mutex_unlock(&tty_mutex); schedule(); }
Like before, with 2.6.20-rc5 this patch helps: diff -ru ../linux-2.6.20-rc5/net/bluetooth/rfcomm/tty.c net/bluetooth/rfcomm/tty .c --- ../linux-2.6.20-rc5/net/bluetooth/rfcomm/tty.c 2007-01-12 20:54:26.0000 00000 +0200 +++ net/bluetooth/rfcomm/tty.c 2007-01-26 20:48:00.000000000 +0200 @@ -660,6 +660,9 @@ static int rfcomm_tty_write(struct tty_struct *tty, const unsigned char *buf, i nt count) { struct rfcomm_dev *dev = (struct rfcomm_dev *) tty->driver_data; + BT_DBG("dev %p", dev); + if (!dev) + return 0; struct rfcomm_dlc *dlc = dev->dlc; struct sk_buff *skb; int err = 0, sent = 0, size; And logs show up like this: [ 3224.384000] rfcomm_recv_mcc: c6d7bd60 type 0x38 cr 0 [ 3224.384000] rfcomm_recv_msc: dlci 2 cr 0 v24 0x8d [ 3224.384000] rfcomm_process_dlcs: session c6d7bd60 state 1 [ 3224.384000] rfcomm_process_tx: dlc c839ef38 state 8 cfc 40 rx_credits 23 tx_c redits 7 [ 3224.388000] rfcomm_tty_ioctl: tty caf4c7f8 cmd 0x5401 [ 3224.388000] rfcomm_tty_ioctl: TCGETS is not supported [ 3224.388000] rfcomm_tty_open: tty caf4c7f8 id 2 [ 3224.392000] rfcomm_l2data_ready: cc5a5bf8 bytes 4 [ 3224.392000] rfcomm_process_rx: session c6d7bd60 state 1 qlen 1 [ 3224.392000] rfcomm_recv_ua: session c6d7bd60 state 1 dlci 2 [ 3224.392000] __rfcomm_dlc_close: dlc c839ef38 state 9 dlci 2 err 0 session c6d 7bd60 [ 3224.392000] rfcomm_dlc_clear_timer: dlc c839ef38 state 9 [ 3224.392000] rfcomm_dlc_unlink: dlc c839ef38 refcnt 1 session c6d7bd60 [ 3224.392000] rfcomm_dlc_free: c839ef38 [ 3224.392000] rfcomm_process_dlcs: session c6d7bd60 state 1 [ 3224.392000] rfcomm_session_del: session c6d7bd60 state 1 [ 3224.392000] rfcomm_send_disc: c6d7bd60 dlci 0 [ 3224.392000] rfcomm_send_frame: session c6d7bd60 len 4 [ 3224.404000] rfcomm_tty_ioctl: tty caf4c7f8 cmd 0x5401 [ 3224.404000] rfcomm_tty_ioctl: TCGETS is not supported [ 3224.404000] rfcomm_tty_ioctl: tty caf4c7f8 cmd 0x5401 [ 3224.404000] rfcomm_tty_ioctl: TCGETS is not supported [ 3224.404000] rfcomm_tty_ioctl: tty caf4c7f8 cmd 0x5402 [ 3224.404000] rfcomm_tty_ioctl: TCSETS is not supported [ 3224.404000] rfcomm_tty_set_termios: tty caf4c7f8 termios c8597e00 [ 3224.404000] rfcomm_tty_ioctl: tty caf4c7f8 cmd 0x5401 [ 3224.404000] rfcomm_tty_ioctl: TCGETS is not supported [ 3224.404000] rfcomm_tty_ioctl: tty caf4c7f8 cmd 0x5409 [ 3224.420000] rfcomm_tty_write: dev 00000000 -------------------------------------^ [ 3224.420000] rfcomm_l2state_change: cc5a5bf8 state 9 [ 3226.440000] bus bluetooth: remove device acl00605762B2A5 [ 3226.440000] PM: Removing info for bluetooth:acl00605762B2A5
I guess best we can do right now is ask Miko to test with the latest kernel (2.6.22-rc5 as of now) and confirm if the problem still there. Thanks.
Created attachment 11833 [details] crash screen shot
The screenshot above is from 2.6.22-rc5. Perhaps someone could try to explain/decipher the locking logic of net/bluetooth/rfcomm/tty.c? At least I'm not getting it...
Created attachment 11835 [details] Keep rfcomm_dev in list until it is freed This patch fixes another kernel crash. It looks like it could also fix your case. Please apply it to 2.6.22-rc5 and try it.
Created attachment 11855 [details] crash with 2.6.22-rc5 + "Keep rfcomm_dev in list until it is freed" patch With the 'Keep rfcomm_dev in list until it is freed' patch the oops is a bit harder to recreate: after 'pon gprs', l2ping on another terminal and 'telinit 1' on another, I get tons of these: Jun 23 00:12:22 localhost kernel: release_dev: rfcomm2: read/write wait queue active! Jun 23 00:12:22 localhost last message repeated 260 times I got bored so I entered sysreq+k which resulted in a: Jun 23 00:12:22 localhost kernel: BUG: unable to handle kernel NULL pointer dere ference at virtual address 00000000 Jun 23 00:12:22 localhost kernel: printing eip: Jun 23 00:12:22 localhost kernel: 00000000 Jun 23 00:12:22 localhost kernel: *pde = 00000000 Jun 23 00:12:22 localhost kernel: Oops: 0000 [#2] Jun 23 00:12:22 localhost kernel: Modules linked in: ppp_async crc_ccitt ppp_gen eric slhc hci_usb binfmt_misc rfcomm l2cap bluetooth ipv6 button ac battery nls_ iso8859_1 nls_cp437 vfat fat fuse tsdev pcmcia firmware_class floppy rtc parport _pc parport psmouse i2c_piix4 i2c_core serio_raw intel_agp shpchp pci_hotplug ag pgart yenta_socket rsrc_nonstatic pcmcia_core evdev ipt_REJECT xt_tcpudp iptable _filter ip_tables x_tables ext3 jbd mbcache dm_mirror dm_snapshot dm_mod ide_gen eric ide_cd cdrom ide_disk uhci_hcd usbcore piix generic ide_core e100 mii therm al processor fan Jun 23 00:12:22 localhost kernel: CPU: 0 Jun 23 00:12:22 localhost kernel: EIP: 0060:[<00000000>] Not tainted VLI Jun 23 00:12:22 localhost kernel: EFLAGS: 00010006 (2.6.22-rc5 #9) Jun 23 00:12:22 localhost kernel: EIP is at _stext+0x3feff000/0x14 Jun 23 00:12:22 localhost kernel: eax: c9daff2c ebx: c9daff2c ecx: 00000000 edx: 00000003 Jun 23 00:12:22 localhost kernel: esi: 00000000 edi: 00000001 ebp: c6c7deb4 esp: c6c7de94 Jun 23 00:12:22 localhost kernel: ds: 007b es: 007b fs: 0000 gs: 0033 ss: 0068 Jun 23 00:12:22 localhost kernel: Process pppd (pid: 3722, ti=c6c7c000 task=c5c2 6070 task.ti=c6c7c000) Jun 23 00:12:22 localhost kernel: Stack: c01111fc 00000000 00000003 c6d0f520 000 00000 00000000 00000292 c6c7df1c Jun 23 00:12:22 localhost kernel: c6c7decc c0112008 00000000 00000000 c6d 0f400 00000001 00000000 c01ef270 Jun 23 00:12:22 localhost kernel: 00000000 c6c7df14 00000001 c0140768 642 f0001 c6c7df14 ca445518 c6fdc120 Jun 23 00:12:22 localhost kernel: Call Trace: Jun 23 00:12:22 localhost kernel: [__wake_up_common+49/79] __wake_up_common+0x3 1/0x4f Jun 23 00:12:22 localhost kernel: [__wake_up+30/38] __wake_up+0x1e/0x26 Jun 23 00:12:22 localhost kernel: [release_dev+626/1410] release_dev+0x272/0x58 2 Jun 23 00:12:22 localhost kernel: [do_wp_page+88/877] do_wp_page+0x58/0x36d Jun 23 00:12:22 localhost kernel: [__handle_mm_fault+1590/1713] __handle_mm_fau lt+0x636/0x6b1 Jun 23 00:12:22 localhost kernel: [sk_free+176/197] sk_free+0xb0/0xc5 Jun 23 00:12:22 localhost kernel: [tty_ioctl+0/2695] tty_ioctl+0x0/0xa87 Jun 23 00:12:22 localhost kernel: [tty_release+7/10] tty_release+0x7/0xa Jun 23 00:12:22 localhost kernel: [__fput+57/222] __fput+0x39/0xde Jun 23 00:12:22 localhost kernel: [filp_close+81/88] filp_close+0x51/0x58 Jun 23 00:12:22 localhost kernel: [sys_close+84/131] sys_close+0x54/0x83 Jun 23 00:12:22 localhost kernel: [syscall_call+7/11] syscall_call+0x7/0xb Jun 23 00:12:22 localhost kernel: [xfrm_parse_spi+58/210] xfrm_parse_spi+0x3a/0 xd2 Jun 23 00:12:22 localhost kernel: ======================= Jun 23 00:12:22 localhost kernel: Code: Bad EIP value. Jun 23 00:12:22 localhost kernel: EIP: [<00000000>] _stext+0x3feff000/0x14 SS:ES P 0068:c6c7de94 I'm thinking this is a race between tty and ioctl code. I whish I could understand which lock protects rfcomm_dev list and each rfcomm_dev.
Created attachment 11909 [details] Hangup tty's before releasing rfcomm_dev Read a kernel book, studied to code, negledged the family for a long time... And here comes a patch, which I think is a proper fix for this issue. The core problem is that rfcomm socket layer ioctl can release rfcomm_dev struct while rfcomm tty layer is still actively using it. Calling tty_vhangup is needed for a synchronous hangup before rfcomm_dev is freed.
Patch applied: http://git.kernel.org/?p=linux/kernel/git/holtmann/bluetooth-2.6.git;a=commitdiff;h=84950cf0ba02fd6a5defe2511bc41f9aa2237632