Bug 15160
Summary: | futex: unable to handle kernel NULL pointer dereference | ||
---|---|---|---|
Product: | Other | Reporter: | Jerome Marchand (jmarchan) |
Component: | Other | Assignee: | other_other |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | alan, dvhart, greg |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | No | Bisected commit-id: | |
Attachments: |
Reproducer
Fixed testcase and some trace_marker code |
I believe we may have two issues here. One is a legitimate WARN_ON() being printed by the kernel, the other is to identify the cause of the OOPS. The test case's for(;;) loop creates threads in both the parent and the child of the fork: for(;;) { mutex_init (mtx); for (i=0; i<PROCESSES; i++) { pid = fork(); if(!pid) break; pidtab[i] = pid; } for (i = 0 ; i < 5 ; i++) thread_create (tst1_func, addr); if (pid == 0) { sleep(10); } else { usleep(1); for (i=0; i<PROCESSES; i++) kill (pidtab[i], SIGINT); } pthread_mutex_destroy(mtx); But it only kills the threads (via their process) created in the child. This leaves threads that continue running across the for(;;) loop boundaries and more importantly, across the pthread_mutex_destroy() and mutex_init() calls. It is possible for one of these threads to acquire to the lock, then for the mutex to be destroyed and initialized (which sets the futex value to 0) at which point the next thread (in the new set of threads) to try to acquire the lock will do so in userspace via a cmpxchg. At this point the futex value disagrees with the pi_state->owner->pid and the kernel WARN_ON() mentioned above will continue to trigger for the duration of the test. This is evidenced by the mismatch happening between a lower value PID in pi_state->owner->pid and a higher value PID in the futex value. The second issue of course is the actual OOPS. I thought this might be related to the following patch which isn't in -tip yet: Subject: Re: [PATCH 2.6.33-rc5] futex_lock_pi() key refcnt fix Unfortunately, I've still been able to hit an OOPS with that applied. I'll continue to investigate. Can someone tell me if the thread create/destruction mismatch was intentional? From the if (!pid) break; line, I'm guessing it was unintentional. I'm going to try without threads spanning the mutex destroy/init cycle and see if that kills teh warnings and if I can still hit the OOPS. With the test case "fixed", I can hit a panic much sooner and without any warnings. BUG: unable to handle kernel NULL pointer dereference at 0000000000000620 IP: [<ffffffff813ee3ef>] _atomic_spin_lock_irq+0x35/0x5f PGD 403082067 PUD 403089067 PMD 0 Oops: 0002 [#1] PREEMPT SMP last sysfs file: /sys/devices/pci0000:00/0000:00:1c.0/0000:02:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/vendor CPU 4 Modules linked in: ipv6 autofs4 i2c_dev i2c_core hidp rfcomm l2cap bluetooth rfkill sunrpc dm_mirror dm_region_hash dm_log dm_multipath scsi_dh dm_mod video output sbs sbshc battery ac parport_pc lp parport sg bnx2 bnx2x crc32c libcrc32c button i5k_amb hwmon pata_acpi serio_raw i5000_edac ata_generic edac_core shpchp iTCO_wdt iTCO_vendor_support pcspkr ata_piix libata mptsas mptscsih scsi_transport_sas mptbase sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd Pid: 4341, comm: tst Not tainted 2.6.31.12-rt20 #1 IBM eServer BladeCenter HS21 -[7995AC1]- RIP: 0010:[<ffffffff813ee3ef>] [<ffffffff813ee3ef>] _atomic_spin_lock_irq+0x35/0x5f RSP: 0018:ffff8804034a7c88 EFLAGS: 00010046 RAX: 0000000000000100 RBX: 0000000000000620 RCX: 0000000079bc8c05 RDX: 0000000000000000 RSI: 00000000800010f5 RDI: 0000000000000001 RBP: ffff8804034a7ca8 R08: 0000000079bc8c05 R09: 00000000800010f5 R10: 0000000079bc8c05 R11: 0000000010001000 R12: ffff88040394a580 R13: ffff88040309bb88 R14: 00000000800010f5 R15: ffff88042bd2df50 FS: 00000000421d2940(0063) GS:ffff8800280f4000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000620 CR3: 0000000403083000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process tst (pid: 4341, threadinfo ffff8804034a6000, task ffff8804039ee600) Stack: ffff88042bd2df40 ffff88040394a580 0000000079bc8c05 ffff88042bd2df40 <0> ffff8804034a7ee8 ffffffff8108fc7b ffffffff813ee3ea 0000000079bc8c05 <0> 000010f500000000 0000000010000000 0000000100000ac0 0000000000000000 Call Trace: [<ffffffff8108fc7b>] do_futex+0x6ac/0xbfc [<ffffffff813ee3ea>] ? _atomic_spin_lock_irq+0x30/0x5f [<ffffffff810cbff4>] ? rb_reserve_next_event+0x26b/0x323 [<ffffffff810cd35d>] ? ring_buffer_lock_reserve+0x141/0x1d4 [<ffffffff810cae7f>] ? rb_commit+0x2c/0x47 [<ffffffff810d0c10>] ? trace_buffer_lock_reserve+0x52/0x7c [<ffffffff810cd425>] ? ring_buffer_unlock_commit+0x35/0xb2 [<ffffffff810ca8b1>] ? ring_buffer_event_data+0x1c/0x32 [<ffffffff810d1230>] ? trace_function+0xa0/0xc1 [<ffffffff8100c01b>] ? system_call_fastpath+0x16/0x1b [<ffffffff810901dc>] ? sys_futex+0x11/0x157 [<ffffffff810902f0>] sys_futex+0x125/0x157 [<ffffffff8100c01b>] system_call_fastpath+0x16/0x1b Code: 1f 44 00 00 48 89 fb 65 48 8b 04 25 28 00 00 00 48 89 45 f0 31 c0 fa e8 5b 9f ce ff bf 01 00 00 00 e8 c8 2d 00 00 b8 00 01 00 00 <f0> 66 0f c1 03 38 e0 74 06 f3 90 8a 03 eb f6 48 8b 45 f0 65 48 RIP [<ffffffff813ee3ef>] _atomic_spin_lock_irq+0x35/0x5f RSP <ffff8804034a7c88> CR2: 0000000000000620 Created attachment 24868 [details]
Fixed testcase and some trace_marker code
The old test case created threads in both the parent and the child of the fork, but only killed the ones created in the child. This left threads running across the for(;;) loop iterations, which also spans the mutex_destroy/init() calls. This caused corruption of the futex value, which lead to the warnings. This test case can still crash the kernel, much faster than before, and it does so without the warnings.
With the testcase from comment #3 on tip/master + the futex_lock_pi ref counting patch, I see an Oops almost immediately. BUG: unable to handle kernel NULL pointer dereference at 00000000000005c0 IP: [<ffffffff81324277>] _raw_spin_lock_irq+0x26/0x3a PGD 40e65c067 PUD 40e65d067 PMD 0 Oops: 0002 [#1] PREEMPT SMP last sysfs file: /sys/class/net/eth3/address CPU 5 Pid: 4042, comm: tst Not tainted 2.6.33-rc6-tip+ #1 Server Blade/IBM eServer BladeCenter HS21 -[7995AC1]- RIP: 0010:[<ffffffff81324277>] [<ffffffff81324277>] _raw_spin_lock_irq+0x26/0x3a RSP: 0018:ffff88040e401c98 EFLAGS: 00010002 RAX: 0000000000000100 RBX: 00000000000005c0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000080000fca RDI: 0000000000000001 RBP: ffff88040e401ca8 R08: 0000000000000000 R09: 00000000e2b6670c R10: 0000000000000007 R11: 0000000010001000 R12: ffff88040e629b98 R13: ffff88040e460440 R14: ffff880429f996d0 R15: 0000000080000fca FS: 0000000041001940(0063) GS:ffff880028340000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000000005c0 CR3: 000000040e65b000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process tst (pid: 4042, threadinfo ffff88040e400000, task ffff88040d79c080) Stack: ffff88040e460440 ffff880429f996c0 ffff88040e401ee8 ffffffff81067199 <0> ffff88040e5a8000 ffff8800283d3380 ffff88042cbbc0b8 00000fca00000000 <0> 0000000010000000 0000000100000046 ffff88002834ef20 ffff88042e4744c0 Call Trace: [<ffffffff81067199>] do_futex+0x6cd/0xbf6 [<ffffffff810b7a9a>] ? get_pageblock_flags_group+0x4/0x99 [<ffffffff810b951e>] ? free_hot_cold_page+0x1cc/0x1f9 [<ffffffff810b7a9a>] ? get_pageblock_flags_group+0x4/0x99 [<ffffffff810b951e>] ? free_hot_cold_page+0x1cc/0x1f9 [<ffffffff810b7a9a>] ? get_pageblock_flags_group+0x4/0x99 [<ffffffff8109ac2a>] ? trace_buffer_lock_reserve+0x22/0x56 [<ffffffff810e4d90>] ? kmem_cache_free+0x11/0x1db [<ffffffff8109ac2a>] ? trace_buffer_lock_reserve+0x22/0x56 [<ffffffff810676d3>] ? sys_futex+0x11/0x134 [<ffffffff81002b9b>] ? system_call_fastpath+0x16/0x1b [<ffffffff810676d3>] ? sys_futex+0x11/0x134 [<ffffffff81002b9b>] ? system_call_fastpath+0x16/0x1b [<ffffffff810676d3>] ? sys_futex+0x11/0x134 [<ffffffff810677d8>] sys_futex+0x116/0x134 [<ffffffff81002b9b>] system_call_fastpath+0x16/0x1b Code: f6 5f 5b c9 c3 55 48 89 e5 53 48 83 ec 08 e8 31 e6 cd ff 48 89 fb fa e8 7a ce d7 ff bf 01 00 00 00 e8 10 2e 00 00 b8 00 01 00 00 <f0> 66 0f c1 03 38 e0 74 06 f3 90 8a 03 eb f6 41 58 5b c9 c3 55 RIP [<ffffffff81324277>] _raw_spin_lock_irq+0x26/0x3a RSP <ffff88040e401c98> CR2: 00000000000005c0 Patches committed to tip: Commit-ID: 51246bfd189064079c54421507236fd2723b18f3 Gitweb: http://git.kernel.org/tip/51246bfd189064079c54421507236fd2723b18f3 Author: Thomas Gleixner <tglx@linutronix.de> AuthorDate: Tue, 2 Feb 2010 11:40:27 +0100 Committer: Thomas Gleixner <tglx@linutronix.de> CommitDate: Wed, 3 Feb 2010 15:13:22 +0100 futex: Handle user space corruption gracefully If the owner of a PI futex dies we fix up the pi_state and set pi_state->owner to NULL. When a malicious or just sloppy programmed user space application sets the futex value to 0 e.g. by calling pthread_mutex_init(), then the futex can be acquired again. A new waiter manages to enqueue itself on the pi_state w/o damage, but on unlock the kernel dereferences pi_state->owner and oopses. Prevent this by checking pi_state->owner in the unlock path. If pi_state->owner is not current we know that user space manipulated the futex value. Ignore the mess and return -EINVAL. This catches the above case and also the case where a task hijacks the futex by setting the tid value and then tries to unlock it. Reported-by: Jermome Marchand <jmarchan@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org> Commit-ID: 59647b6ac3050dd964bc556fe6ef22f4db5b935c Gitweb: http://git.kernel.org/tip/59647b6ac3050dd964bc556fe6ef22f4db5b935c Author: Thomas Gleixner <tglx@linutronix.de> AuthorDate: Wed, 3 Feb 2010 09:33:05 +0100 Committer: Thomas Gleixner <tglx@linutronix.de> CommitDate: Wed, 3 Feb 2010 15:13:22 +0100 futex: Handle futex value corruption gracefully The WARN_ON in lookup_pi_state which complains about a mismatch between pi_state->owner->pid and the pid which we retrieved from the user space futex is completely bogus. The code just emits the warning and then continues despite the fact that it detected an inconsistent state of the futex. A conveniant way for user space to spam the syslog. Replace the WARN_ON by a consistency check. If the values do not match return -EINVAL and let user space deal with the mess it created. This also fixes the missing task_pid_vnr() when we compare the pi_state->owner pid with the futex value. Reported-by: Jermome Marchand <jmarchan@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org> |
Created attachment 24756 [details] Reproducer An application using pthread mutexes with PTHREAD_PRIO_INHERIT protocol (this seems important as I can't reproduce the bug with an other protocol) makes the kernel crash. Here is the oops: BUG: unable to handle kernel NULL pointer dereference at 00000468 IP: [<c04568bf>] __lock_acquire+0x85/0xb62 *pdpt = 0000000035f67001 *pde = 0000000000000000 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/0000:08:00.0/irq Modules linked in: sunrpc iptable_filter ip_tables ip6table_filter ip6_tables x_tables ipv6 p4_clockmod dm_multipath scsi_dh joydev usb_storage dcdbas serio_raw pcspkr ehci_hcd uhci_hcd i5000_edac edac_core rtc_cmos rtc_core rtc_lib ext4 jbd2 mptsas mptscsih mptbase scsi_transport_sas [last unloaded: microcode] Pid: 19454, comm: tst Tainted: G W 2.6.33-rc5 #102 0DT097/PowerEdge 1950 EIP: 0060:[<c04568bf>] EFLAGS: 00010046 CPU: 0 EIP is at __lock_acquire+0x85/0xb62 EAX: 00000046 EBX: f67a44c0 ECX: 00000464 EDX: 00000000 ESI: 00000000 EDI: 00000000 EBP: f5e7fea8 ESP: f5e7fe44 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process tst (pid: 19454, ti=f5e7f000 task=f67a44c0 task.ti=f5e7f000) Stack: f5e7fe84 c044c8eb 0000000d 00000000 c0d076b0 00000002 00000028 00000464 <0> 00000000 00003900 00000001 00000002 10001000 00000092 f5e7fe84 c0454e5a <0> f5e7fe9c c044c920 000001a5 3156330a f67a49ec c0962534 f67a44c0 00000000 Call Trace: [<c044c8eb>] ? sched_clock_cpu+0x125/0x12d [<c0454e5a>] ? trace_hardirqs_off+0xb/0xd [<c044c920>] ? cpu_clock+0x2d/0x4e [<c0457425>] ? lock_acquire+0x89/0xa2 [<c045b97c>] ? do_futex+0x506/0x67e [<c073980c>] ? _raw_spin_lock_irq+0x24/0x54 [<c045b97c>] ? do_futex+0x506/0x67e [<c045b97c>] ? do_futex+0x506/0x67e [<c0454e99>] ? lock_release_holdtime+0x2b/0xd0 [<c073c4d2>] ? do_page_fault+0x274/0x2a2 [<c045bbd5>] ? sys_futex+0xe1/0xf4 [<c040283b>] ? sysenter_exit+0xf/0x16 [<c040280c>] ? sysenter_do_call+0x12/0x32 Code: ba b6 0a 00 00 0f 85 9d 0a 00 00 e9 49 03 00 00 83 fa 07 76 0f e8 52 2e 11 00 68 e1 25 8a c0 e9 55 0a 00 00 85 d2 75 0a 8b 4d b8 <8b> 41 04 85 c0 75 12 8b 45 b8 31 c9 e8 b2 e9 ff ff 85 c0 0f 84 EIP: [<c04568bf>] __lock_acquire+0x85/0xb62 SS:ESP 0068:f5e7fe44 CR2: 0000000000000468 The crash was preceded by several occurences of the following warnings: WARNING: at kernel/rtmutex-debug.c:192 debug_rt_mutex_unlock+0x6a/0x6c() Hardware name: PowerEdge 1950 Modules linked in: sunrpc iptable_filter ip_tables ip6table_filter ip6_tables x_tables ipv6 p4_clockmod dm_multipath scsi_dh joydev usb_storage dcdbas serio_raw pcspkr ehci_hcd uhci_hcd i5000_edac edac_core rtc_cmos rtc_core rtc_lib ext4 jbd2 mptsas mptscsih mptbase scsi_transport_sas [last unloaded: microcode] Pid: 9253, comm: tst Tainted: G W 2.6.33-rc5 #102 Call Trace: [<c043222f>] warn_slowpath_common+0x65/0x7c [<c045c812>] ? debug_rt_mutex_unlock+0x6a/0x6c [<c0432253>] warn_slowpath_null+0xd/0x10 [<c045c812>] debug_rt_mutex_unlock+0x6a/0x6c [<c073928b>] rt_mutex_slowunlock+0x1e/0x153 [<c07393c8>] rt_mutex_unlock+0x8/0xa [<c045b9fc>] do_futex+0x586/0x67e [<c0454e99>] ? lock_release_holdtime+0x2b/0xd0 [<c073c4d2>] ? do_page_fault+0x274/0x2a2 [<c045bbd5>] sys_futex+0xe1/0xf4 [<c040283b>] ? sysenter_exit+0xf/0x16 [<c040280c>] sysenter_do_call+0x12/0x32 WARNING: at kernel/futex.c:534 lookup_pi_state+0xa3/0x214() Hardware name: PowerEdge 1950 Modules linked in: sunrpc iptable_filter ip_tables ip6table_filter ip6_tables x_tables ipv6 p4_clockmod dm_multipath scsi_dh joydev usb_storage dcdbas serio_raw pcspkr ehci_hcd uhci_hcd i5000_edac edac_core rtc_cmos rtc_core rtc_lib ext4 jbd2 mptsas mptscsih mptbase scsi_transport_sas [last unloaded: microcode] Pid: 9002, comm: tst Tainted: G W 2.6.33-rc5 #102 Call Trace: [<c043222f>] warn_slowpath_common+0x65/0x7c [<c045a535>] ? lookup_pi_state+0xa3/0x214 [<c0432253>] warn_slowpath_null+0xd/0x10 [<c045a535>] lookup_pi_state+0xa3/0x214 [<c0445e0f>] ? __task_pid_nr_ns+0x65/0x6c [<c045a767>] futex_lock_pi_atomic+0xc1/0x109 [<c045ae13>] futex_lock_pi+0xce/0x224 [<c044c6bf>] ? sched_clock_local+0x17/0x11e [<c044c8eb>] ? sched_clock_cpu+0x125/0x12d [<c045ba82>] do_futex+0x60c/0x67e [<c0454e5a>] ? trace_hardirqs_off+0xb/0xd [<c044c920>] ? cpu_clock+0x2d/0x4e [<c0454e6b>] ? lockstat_clock+0xf/0x12 [<c0454e99>] ? lock_release_holdtime+0x2b/0xd0 [<c0739ddc>] ? _raw_spin_unlock_irq+0x22/0x26 [<c04561a3>] ? trace_hardirqs_on_caller+0x100/0x121 [<c045bbd5>] sys_futex+0xe1/0xf4 [<c040283b>] ? sysenter_exit+0xf/0x16 [<c047047b>] ? audit_syscall_entry+0x113/0x135 [<c040280c>] sysenter_do_call+0x12/0x32 The issue is easily reproducible with the attached reproducer.