Bug 12802 - General protection fault on rmmod cx8800
Summary: General protection fault on rmmod cx8800
Status: RESOLVED CODE_FIX
Alias: None
Product: v4l-dvb
Classification: Unclassified
Component: cx88 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Jean Delvare
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-03-02 04:38 UTC by Jean Delvare
Modified: 2009-07-02 12:22 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.28.5
Subsystem:
Regression: No
Bisected commit-id:


Attachments
fix (413 bytes, patch)
2009-03-03 12:47 UTC, Marcin Slusarz
Details | Diff
Tested fix (2.22 KB, patch)
2009-03-03 13:24 UTC, Jean Delvare
Details | Diff

Description Jean Delvare 2009-03-02 04:38:50 UTC
Latest working kernel version: I don't rmmod cx8800 often enough to tell
Earliest failing kernel version: 2.6.28.5
Distribution: openSUSE 11.1
Hardware Environment: Intel x86-64
Software Environment:
Problem Description:

I have hit the following general protection fault when removing module cx8800:

Feb 15 18:30:09 hyperion kernel: cx88/2: unregistering cx8802 driver, type: dvb access: shared
Feb 15 18:30:09 hyperion kernel: cx88[0]/2: subsystem: 107d:665f, board: WinFast DTV1000-T [card=35]
Feb 15 18:30:09 hyperion kernel: cx88-mpeg driver manager 0000:02:04.2: PCI INT A disabled
Feb 15 18:30:09 hyperion kernel: cx8800 0000:02:04.0: PCI INT A disabled
Feb 15 18:30:09 hyperion kernel: general protection fault: 0000 [#1]
Feb 15 18:30:09 hyperion kernel: last sysfs file: /sys/devices/pnp0/00:08/i2c-adapter/i2c-3/3-004c/temp2_crit_alarm
Feb 15 18:30:09 hyperion kernel: CPU 0
Feb 15 18:30:09 hyperion kernel: Modules linked in: lm90 w83627ehf
  hwmon_vid i2c_parport isofs ip6t_LOG xt_tcpudp xt_pkttype ipt_LOG
  xt_limit binfmt_misc snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device
  nfs lockd sunrpc radeon drm nf_conntrack_ipv6 ip6t_REJECT xt_NOTRACK
  ipt_REJECT xt_state iptable_raw iptable_filter ip6table_mangle
  nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4
  ip_tables ip6table_filter ip6_tables x_tables ipv6 nls_iso8859_1
  nls_cp437 vfat fuse loop dm_mod dvb_pll cx22702 cx88_vp3054_i2c zr36060
  cx88xx snd_intel8x0 ir_common snd_ac97_codec saa7110 snd_pcsp tveeprom
  ac97_bus zr36067 videobuf_dvb i2c_algo_bit compat_ioctl32 dvb_core
  snd_pcm snd_timer v4l2_common videocodec parport_pc parport btcx_risc
  iTCO_wdt snd videobuf_dma_sg 8139too soundcore videodev videobuf_core
  v4l1_compat mii i2c_i801 snd_page_alloc sr_mod cdrom intel_agp button
  sg sd_mod ehci_hcd uhci_hcd usbcore edd ext3 mbcache jbd fan
  ide_pci_generic piix ide_core ata_generic ata_piix libata thermal
  processor thermal_sys hwmon [last unloaded: cx8800]
Feb 15 18:30:09 hyperion kernel: Pid: 5, comm: events/0 Not tainted 2.6.28.5 #5
Feb 15 18:30:09 hyperion kernel: RIP: 0010:[<ffffffffa0256c18>]  [<ffffffffa0256c18>] cx88_ir_work+0x32/0x236 [cx88xx]
Feb 15 18:30:09 hyperion kernel: RSP: 0000:ffff88003f86be20  EFLAGS: 00010202
Feb 15 18:30:09 hyperion kernel: RAX: 0000000000350010 RBX: ffff88003e1c0ec8 RCX: 0000000000000001
Feb 15 18:30:09 hyperion kernel: RDX: ffffffff80666ee0 RSI: ffffffff809fc750 RDI: ffff88003e1c0ec0
Feb 15 18:30:09 hyperion kernel: RBP: ffff88003f86be50 R08: 0000000000000001 R09: ffffffbff9b00198
Feb 15 18:30:09 hyperion kernel: R10: 0000000000000080 R11: 0000000000000001 R12: ffff88003e1c0ec0
Feb 15 18:30:09 hyperion kernel: R13: ffff88003e1c0c00 R14: 2f4065766f6d6572 R15: ffff88003f8201f0
Feb 15 18:30:09 hyperion kernel: FS:  0000000000000000(0000) GS:ffffffff805f3020(0000) knlGS:0000000000000000
Feb 15 18:30:09 hyperion kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Feb 15 18:30:09 hyperion kernel: CR2: 00000000004042e0 CR3: 000000003ee63000 CR4: 00000000000006e0
Feb 15 18:30:09 hyperion kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 15 18:30:09 hyperion kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Feb 15 18:30:10 hyperion kernel: Process events/0 (pid: 5, threadinfo ffff88003f86a000, task ffff88003f845140)
Feb 15 18:30:10 hyperion kernel: Stack:
Feb 15 18:30:10 hyperion kernel:  ffff88003f86be50 ffff88003e1c0ec8 ffff88003e1c0ec0 ffff88003f8201c0
Feb 15 18:30:10 hyperion kernel:  ffffffffa0256be6 ffff88003f8201f0 ffff88003f86bec0 ffffffff802426b0
Feb 15 18:30:10 hyperion kernel:  ffffffff8024268a ffffffff8042a910 ffffffffa0264444 0000000000000000
Feb 15 18:30:10 hyperion kernel: Call Trace:
Feb 15 18:30:10 hyperion kernel:  [<ffffffffa0256be6>] ? cx88_ir_work+0x0/0x236 [cx88xx]
Feb 15 18:30:10 hyperion kernel:  [<ffffffff802426b0>] run_workqueue+0xee/0x21f
Feb 15 18:30:10 hyperion kernel:  [<ffffffff8024268a>] ? run_workqueue+0xc8/0x21f
Feb 15 18:30:10 hyperion kernel:  [<ffffffff8042a910>] ? thread_return+0x30/0x1cd
Feb 15 18:30:10 hyperion kernel:  [<ffffffff80242c7e>] worker_thread+0xa6/0x111
Feb 15 18:30:10 hyperion kernel:  [<ffffffff802465b1>] ? autoremove_wake_function+0x0/0x3b
Feb 15 18:30:10 hyperion kernel:  [<ffffffff80242bd8>] ? worker_thread+0x0/0x111
Feb 15 18:30:10 hyperion kernel:  [<ffffffff80246306>] kthread+0x49/0x7b
Feb 15 18:30:10 hyperion kernel:  [<ffffffff8020c019>] child_rip+0xa/0x11
Feb 15 18:30:10 hyperion kernel:  [<ffffffff8020bb34>] ? restore_args+0x0/0x30
Feb 15 18:30:10 hyperion kernel:  [<ffffffff802462bd>] ? kthread+0x0/0x7b
Feb 15 18:30:10 hyperion kernel:  [<ffffffff8020c00f>] ? child_rip+0x0/0x11
Feb 15 18:30:10 hyperion kernel: Code: 56 41 55 41 54 53 48 83 ec 08 49 89 fc 4c 8d af 40 fd ff ff 4c 8b b7 40 fd ff ff 41 8b 85 48 03 00 00 c1 e8 02 89 c0 48 c1 e0 02 <49> 03 46 40 8b 18 41 8b 86 10 0a 00 00 83 f8 23 0f 84 fc 00 00
Feb 15 18:30:10 hyperion kernel: RIP  [<ffffffffa0256c18>] cx88_ir_work+0x32/0x236 [cx88xx]
Feb 15 18:30:10 hyperion kernel:  RSP <ffff88003f86be20>
Feb 15 18:30:10 hyperion kernel: ---[ end trace 45f7ffdbf475eafd ]---
Feb 15 18:30:10 hyperion kernel: pci 0000:02:01.0: PCI INT A disabled

This was with kernel 2.6.28.5 on x86-64. The keyboard started acting weirdly, first time it didn't respond to any key, second time it was like the return key was stuck. Both time I was able to reboot the machine with some insistence (SysRq or remote logging, killing X and reboot.)

It doesn't happen on each rmmod cx8800, but as I managed to reproduce it once, I guess I should be able to reproduce it again if needed. Please let me know if you need more information or want me to test something.

Steps to reproduce:
Run "rmmod cx8800 && modprobe cx8800" in a loop.
Comment 1 Marcin Slusarz 2009-03-03 12:47:14 UTC
Created attachment 20416 [details]
fix
Comment 2 Marcin Slusarz 2009-03-03 12:48:03 UTC
does attached patch fix this bug?
Comment 3 Jean Delvare 2009-03-03 13:23:32 UTC
In fact I already have a tested fix for this bug, I will attach it. I can test your fix as well, it is simpler, but I suspect it is still racy: ir->work could be run between del_timer_sync(&ir->timer) and cancel_work_sync(&ir->work), rearming the timer, couldn't it?
Comment 4 Jean Delvare 2009-03-03 13:24:13 UTC
Created attachment 20417 [details]
Tested fix
Comment 5 Marcin Slusarz 2009-03-03 13:51:58 UTC
if I understand comment of cancel_work_sync, my fix should be ok too

/**
 * cancel_work_sync - block until a work_struct's callback has terminated
 * @work: the work which is to be flushed
 *
 * Returns true if @work was pending.
 *
 * cancel_work_sync() will cancel the work if it is queued. If the work's
 * callback appears to be running, cancel_work_sync() will block until it
 * has completed.
 *
 * It is possible to use this function if the work re-queues itself. It can
 * cancel the work even if it migrates to another workqueue, however in that
 * case it only guarantees that work->func() has completed on the last queued
 * workqueue.
 *
 * cancel_work_sync(&delayed_work->work) should be used only if ->timer is not
 * pending, otherwise it goes into a busy-wait loop until the timer expires.
 *
 * The caller must ensure that workqueue_struct on which this work was last
 * queued can't be destroyed before this function returns.
 */
Comment 6 Marcin Slusarz 2009-03-03 13:55:28 UTC
note that cancel_delayed_work_sync has to deal with the same problem... and it does:

/**
 * cancel_delayed_work_sync - reliably kill off a delayed work.
 * @dwork: the delayed work struct
 *
 * Returns true if @dwork was pending.
 *
 * It is possible to use this function if @dwork rearms itself via queue_work()
 * or queue_delayed_work(). See also the comment for cancel_work_sync().
 */
Comment 7 Jean Delvare 2009-03-04 09:58:26 UTC
I tested your patch and it worked fine for me. However I am running a single-CPU system, so I can't guarantee that it would work as well on an SMP system. Just because I did not hit the race doesn't mean it's not there... The reasons why I am skeptical are:

1* While cancel_work_sync() is guaranteed to succeed on a work which re-queues itself, this isn't exactly what the original code (nor yours) does. Instead, the work arms a timer, and the timer in turn re-arms the work. I suspect the guarantee doesn't apply to this case.

2* Looking at __cancel_work_timer(), I see that it loops over trying to delete the timer and grab the pending work. This suggests that the first try could fail, while your own code does not include any such loop.

So my implementation looks safer. Given that it has the added benefit of making the code more compact and simple, I think we should go with it.
Comment 8 Marcin Slusarz 2009-03-06 04:48:59 UTC
I agree.

Shouldn't schedule_delayed_work (in cx88_ir_start) be called with second parameter set to 0? It's not a bug, but change in behavior.
Comment 9 Jean Delvare 2009-03-06 04:58:38 UTC
Good point. I don't think it makes any difference in practice, but 0 is equivalent to the original code. I'll do that.
Comment 10 Marcin Slusarz 2009-03-06 05:07:37 UTC
hmmm, lots of drivers use 

del_timer_sync() && flush_scheduled_work();
or
del_timer_sync() && cancel_work_sync();

lots of them are probably broken...
Comment 11 Jean Delvare 2009-03-06 05:16:00 UTC
I already have patches for cx88, ir-kbd-i2c, em28xx and saa6588. I hope to have them merged in the v4l-dvb tree soon, and from there pushed to Linus. If there are other drivers out there doing the same then they indeed probably need a similar fix.
Comment 12 Jean Delvare 2009-07-02 12:22:42 UTC
Patch went upstream in kernel 2.6.30:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=569b7ec73abf576f9a9e4070d213aadf2cce73cb

Also in stable kernels 2.6.29.2 and 2.6.27.25.

Note You need to log in before you can comment on or make changes to this bug.