Bug 15160

Summary: futex: unable to handle kernel NULL pointer dereference
Product: Other Reporter: Jerome Marchand (jmarchan)
Component: OtherAssignee: other_other
Status: RESOLVED CODE_FIX    
Severity: normal CC: alan, dvhart, greg
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: Reproducer
Fixed testcase and some trace_marker code

Description Jerome Marchand 2010-01-28 16:49:02 UTC
Created attachment 24756 [details]
Reproducer

An application using pthread mutexes with PTHREAD_PRIO_INHERIT protocol (this seems important as I can't reproduce the bug with an other protocol) makes the kernel crash.
Here is the oops:

BUG: unable to handle kernel NULL pointer dereference at 00000468
IP: [<c04568bf>] __lock_acquire+0x85/0xb62
*pdpt = 0000000035f67001 *pde = 0000000000000000 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/0000:08:00.0/irq
Modules linked in: sunrpc iptable_filter ip_tables ip6table_filter ip6_tables x_tables ipv6 p4_clockmod dm_multipath scsi_dh joydev usb_storage dcdbas serio_raw pcspkr ehci_hcd uhci_hcd i5000_edac edac_core rtc_cmos rtc_core rtc_lib ext4 jbd2 mptsas mptscsih mptbase scsi_transport_sas [last unloaded: microcode]

Pid: 19454, comm: tst Tainted: G        W  2.6.33-rc5 #102 0DT097/PowerEdge 1950
EIP: 0060:[<c04568bf>] EFLAGS: 00010046 CPU: 0
EIP is at __lock_acquire+0x85/0xb62
EAX: 00000046 EBX: f67a44c0 ECX: 00000464 EDX: 00000000
ESI: 00000000 EDI: 00000000 EBP: f5e7fea8 ESP: f5e7fe44
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process tst (pid: 19454, ti=f5e7f000 task=f67a44c0 task.ti=f5e7f000)
Stack:
 f5e7fe84 c044c8eb 0000000d 00000000 c0d076b0 00000002 00000028 00000464
<0> 00000000 00003900 00000001 00000002 10001000 00000092 f5e7fe84 c0454e5a
<0> f5e7fe9c c044c920 000001a5 3156330a f67a49ec c0962534 f67a44c0 00000000
Call Trace:
 [<c044c8eb>] ? sched_clock_cpu+0x125/0x12d
 [<c0454e5a>] ? trace_hardirqs_off+0xb/0xd
 [<c044c920>] ? cpu_clock+0x2d/0x4e
 [<c0457425>] ? lock_acquire+0x89/0xa2
 [<c045b97c>] ? do_futex+0x506/0x67e
 [<c073980c>] ? _raw_spin_lock_irq+0x24/0x54
 [<c045b97c>] ? do_futex+0x506/0x67e
 [<c045b97c>] ? do_futex+0x506/0x67e
 [<c0454e99>] ? lock_release_holdtime+0x2b/0xd0
 [<c073c4d2>] ? do_page_fault+0x274/0x2a2
 [<c045bbd5>] ? sys_futex+0xe1/0xf4
 [<c040283b>] ? sysenter_exit+0xf/0x16
 [<c040280c>] ? sysenter_do_call+0x12/0x32
Code: ba b6 0a 00 00 0f 85 9d 0a 00 00 e9 49 03 00 00 83 fa 07 76 0f e8 52 2e 11 00 68 e1 25 8a c0 e9 55 0a 00 00 85 d2 75 0a 8b 4d b8 <8b> 41 04 85 c0 75 12 8b 45 b8 31 c9 e8 b2 e9 ff ff 85 c0 0f 84 
EIP: [<c04568bf>] __lock_acquire+0x85/0xb62 SS:ESP 0068:f5e7fe44
CR2: 0000000000000468


The crash was preceded by several occurences of the following warnings:

WARNING: at kernel/rtmutex-debug.c:192 debug_rt_mutex_unlock+0x6a/0x6c()
Hardware name: PowerEdge 1950
Modules linked in: sunrpc iptable_filter ip_tables ip6table_filter ip6_tables x_tables ipv6 p4_clockmod dm_multipath scsi_dh joydev usb_storage dcdbas serio_raw pcspkr ehci_hcd uhci_hcd i5000_edac edac_core rtc_cmos rtc_core rtc_lib ext4 jbd2 mptsas mptscsih mptbase scsi_transport_sas [last unloaded: microcode]
Pid: 9253, comm: tst Tainted: G        W  2.6.33-rc5 #102
Call Trace:
 [<c043222f>] warn_slowpath_common+0x65/0x7c
 [<c045c812>] ? debug_rt_mutex_unlock+0x6a/0x6c
 [<c0432253>] warn_slowpath_null+0xd/0x10
 [<c045c812>] debug_rt_mutex_unlock+0x6a/0x6c
 [<c073928b>] rt_mutex_slowunlock+0x1e/0x153
 [<c07393c8>] rt_mutex_unlock+0x8/0xa
 [<c045b9fc>] do_futex+0x586/0x67e
 [<c0454e99>] ? lock_release_holdtime+0x2b/0xd0
 [<c073c4d2>] ? do_page_fault+0x274/0x2a2
 [<c045bbd5>] sys_futex+0xe1/0xf4
 [<c040283b>] ? sysenter_exit+0xf/0x16
 [<c040280c>] sysenter_do_call+0x12/0x32


WARNING: at kernel/futex.c:534 lookup_pi_state+0xa3/0x214()
Hardware name: PowerEdge 1950
Modules linked in: sunrpc iptable_filter ip_tables ip6table_filter ip6_tables x_tables ipv6 p4_clockmod dm_multipath scsi_dh joydev usb_storage dcdbas serio_raw pcspkr ehci_hcd uhci_hcd i5000_edac edac_core rtc_cmos rtc_core rtc_lib ext4 jbd2 mptsas mptscsih mptbase scsi_transport_sas [last unloaded: microcode]
Pid: 9002, comm: tst Tainted: G        W  2.6.33-rc5 #102
Call Trace:
 [<c043222f>] warn_slowpath_common+0x65/0x7c
 [<c045a535>] ? lookup_pi_state+0xa3/0x214
 [<c0432253>] warn_slowpath_null+0xd/0x10
 [<c045a535>] lookup_pi_state+0xa3/0x214
 [<c0445e0f>] ? __task_pid_nr_ns+0x65/0x6c
 [<c045a767>] futex_lock_pi_atomic+0xc1/0x109
 [<c045ae13>] futex_lock_pi+0xce/0x224
 [<c044c6bf>] ? sched_clock_local+0x17/0x11e
 [<c044c8eb>] ? sched_clock_cpu+0x125/0x12d
 [<c045ba82>] do_futex+0x60c/0x67e
 [<c0454e5a>] ? trace_hardirqs_off+0xb/0xd
 [<c044c920>] ? cpu_clock+0x2d/0x4e
 [<c0454e6b>] ? lockstat_clock+0xf/0x12
 [<c0454e99>] ? lock_release_holdtime+0x2b/0xd0
 [<c0739ddc>] ? _raw_spin_unlock_irq+0x22/0x26
 [<c04561a3>] ? trace_hardirqs_on_caller+0x100/0x121
 [<c045bbd5>] sys_futex+0xe1/0xf4
 [<c040283b>] ? sysenter_exit+0xf/0x16
 [<c047047b>] ? audit_syscall_entry+0x113/0x135
 [<c040280c>] sysenter_do_call+0x12/0x32


The issue is easily reproducible with the attached reproducer.
Comment 1 Darren Hart 2010-02-02 01:05:48 UTC
I believe we may have two issues here. One is a legitimate WARN_ON() being printed by the kernel, the other is to identify the cause of the OOPS. The test case's for(;;) loop creates threads in both the parent and the child of the fork:

	for(;;) {
		mutex_init (mtx);

		for (i=0; i<PROCESSES; i++) {
			pid = fork();
			if(!pid) break;
			pidtab[i] = pid;
		}
		for (i = 0 ; i < 5 ; i++)
			thread_create (tst1_func, addr);

		if (pid == 0) {
			sleep(10);
		} else {
			usleep(1);
			for (i=0; i<PROCESSES; i++)
				kill (pidtab[i], SIGINT);
		}

		pthread_mutex_destroy(mtx);

But it only kills the threads (via their process) created in the child. This leaves threads that continue running across the for(;;) loop boundaries and more importantly, across the pthread_mutex_destroy() and mutex_init() calls. It is possible for one of these threads to acquire to the lock, then for the mutex to be destroyed and initialized (which sets the futex value to 0) at which point the next thread (in the new set of threads) to try to acquire the lock will do so in userspace via a cmpxchg. At this point the futex value disagrees with the pi_state->owner->pid and the kernel WARN_ON() mentioned above will continue to trigger for the duration of the test. This is evidenced by the mismatch happening between a lower value PID in pi_state->owner->pid and a higher value PID in the futex value.

The second issue of course is the actual OOPS. I thought this might be related to the following patch which isn't in -tip yet:

        Subject:  Re: [PATCH 2.6.33-rc5] futex_lock_pi() key refcnt fix

Unfortunately, I've still been able to hit an OOPS with that applied. I'll continue to investigate.

Can someone tell me if the thread create/destruction mismatch was intentional? From the if (!pid) break; line, I'm guessing it was unintentional. I'm going to try without threads spanning the mutex destroy/init cycle and see if that kills teh warnings and if I can still hit the OOPS.
Comment 2 Darren Hart 2010-02-02 01:11:51 UTC
With the test case "fixed", I can hit a panic much sooner and without any warnings.


BUG: unable to handle kernel NULL pointer dereference at 0000000000000620
IP: [<ffffffff813ee3ef>] _atomic_spin_lock_irq+0x35/0x5f
PGD 403082067 PUD 403089067 PMD 0 
Oops: 0002 [#1] PREEMPT SMP 
last sysfs file: /sys/devices/pci0000:00/0000:00:1c.0/0000:02:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/vendor
CPU 4 
Modules linked in: ipv6 autofs4 i2c_dev i2c_core hidp rfcomm l2cap bluetooth rfkill sunrpc dm_mirror dm_region_hash dm_log dm_multipath scsi_dh dm_mod video output sbs sbshc battery ac parport_pc lp parport sg bnx2 bnx2x crc32c libcrc32c button i5k_amb hwmon pata_acpi serio_raw i5000_edac ata_generic edac_core shpchp iTCO_wdt iTCO_vendor_support pcspkr ata_piix libata mptsas mptscsih scsi_transport_sas mptbase sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd
Pid: 4341, comm: tst Not tainted 2.6.31.12-rt20 #1 IBM eServer BladeCenter HS21 -[7995AC1]-
RIP: 0010:[<ffffffff813ee3ef>]  [<ffffffff813ee3ef>] _atomic_spin_lock_irq+0x35/0x5f
RSP: 0018:ffff8804034a7c88  EFLAGS: 00010046
RAX: 0000000000000100 RBX: 0000000000000620 RCX: 0000000079bc8c05
RDX: 0000000000000000 RSI: 00000000800010f5 RDI: 0000000000000001
RBP: ffff8804034a7ca8 R08: 0000000079bc8c05 R09: 00000000800010f5
R10: 0000000079bc8c05 R11: 0000000010001000 R12: ffff88040394a580
R13: ffff88040309bb88 R14: 00000000800010f5 R15: ffff88042bd2df50
FS:  00000000421d2940(0063) GS:ffff8800280f4000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000620 CR3: 0000000403083000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process tst (pid: 4341, threadinfo ffff8804034a6000, task ffff8804039ee600)
Stack:
 ffff88042bd2df40 ffff88040394a580 0000000079bc8c05 ffff88042bd2df40
<0> ffff8804034a7ee8 ffffffff8108fc7b ffffffff813ee3ea 0000000079bc8c05
<0> 000010f500000000 0000000010000000 0000000100000ac0 0000000000000000
Call Trace:
 [<ffffffff8108fc7b>] do_futex+0x6ac/0xbfc
 [<ffffffff813ee3ea>] ? _atomic_spin_lock_irq+0x30/0x5f
 [<ffffffff810cbff4>] ? rb_reserve_next_event+0x26b/0x323
 [<ffffffff810cd35d>] ? ring_buffer_lock_reserve+0x141/0x1d4
 [<ffffffff810cae7f>] ? rb_commit+0x2c/0x47
 [<ffffffff810d0c10>] ? trace_buffer_lock_reserve+0x52/0x7c
 [<ffffffff810cd425>] ? ring_buffer_unlock_commit+0x35/0xb2
 [<ffffffff810ca8b1>] ? ring_buffer_event_data+0x1c/0x32
 [<ffffffff810d1230>] ? trace_function+0xa0/0xc1
 [<ffffffff8100c01b>] ? system_call_fastpath+0x16/0x1b
 [<ffffffff810901dc>] ? sys_futex+0x11/0x157
 [<ffffffff810902f0>] sys_futex+0x125/0x157
 [<ffffffff8100c01b>] system_call_fastpath+0x16/0x1b
Code: 1f 44 00 00 48 89 fb 65 48 8b 04 25 28 00 00 00 48 89 45 f0 31 c0 fa e8 5b 9f ce ff bf 01 00 00 00 e8 c8 2d 00 00 b8 00 01 00 00 <f0> 66 0f c1 03 38 e0 74 06 f3 90 8a 03 eb f6 48 8b 45 f0 65 48 
RIP  [<ffffffff813ee3ef>] _atomic_spin_lock_irq+0x35/0x5f
 RSP <ffff8804034a7c88>
CR2: 0000000000000620
Comment 3 Darren Hart 2010-02-02 08:16:46 UTC
Created attachment 24868 [details]
Fixed testcase and some trace_marker code

The old test case created threads in both the parent and the child of the fork, but only killed the ones created in the child. This left threads running across the for(;;) loop iterations, which also spans the mutex_destroy/init() calls. This caused corruption of the futex value, which lead to the warnings. This test case can still crash the kernel, much faster than before, and it does so without the warnings.
Comment 4 Darren Hart 2010-02-02 08:22:54 UTC
With the testcase from comment #3 on tip/master + the futex_lock_pi ref counting patch, I see an Oops almost immediately.

BUG: unable to handle kernel NULL pointer dereference at 00000000000005c0
IP: [<ffffffff81324277>] _raw_spin_lock_irq+0x26/0x3a
PGD 40e65c067 PUD 40e65d067 PMD 0 
Oops: 0002 [#1] PREEMPT SMP 
last sysfs file: /sys/class/net/eth3/address
CPU 5 
Pid: 4042, comm: tst Not tainted 2.6.33-rc6-tip+ #1 Server Blade/IBM eServer BladeCenter HS21 -[7995AC1]-
RIP: 0010:[<ffffffff81324277>]  [<ffffffff81324277>] _raw_spin_lock_irq+0x26/0x3a
RSP: 0018:ffff88040e401c98  EFLAGS: 00010002
RAX: 0000000000000100 RBX: 00000000000005c0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000080000fca RDI: 0000000000000001
RBP: ffff88040e401ca8 R08: 0000000000000000 R09: 00000000e2b6670c
R10: 0000000000000007 R11: 0000000010001000 R12: ffff88040e629b98
R13: ffff88040e460440 R14: ffff880429f996d0 R15: 0000000080000fca
FS:  0000000041001940(0063) GS:ffff880028340000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000005c0 CR3: 000000040e65b000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process tst (pid: 4042, threadinfo ffff88040e400000, task ffff88040d79c080)
Stack:
 ffff88040e460440 ffff880429f996c0 ffff88040e401ee8 ffffffff81067199
<0> ffff88040e5a8000 ffff8800283d3380 ffff88042cbbc0b8 00000fca00000000
<0> 0000000010000000 0000000100000046 ffff88002834ef20 ffff88042e4744c0
Call Trace:
 [<ffffffff81067199>] do_futex+0x6cd/0xbf6
 [<ffffffff810b7a9a>] ? get_pageblock_flags_group+0x4/0x99
 [<ffffffff810b951e>] ? free_hot_cold_page+0x1cc/0x1f9
 [<ffffffff810b7a9a>] ? get_pageblock_flags_group+0x4/0x99
 [<ffffffff810b951e>] ? free_hot_cold_page+0x1cc/0x1f9
 [<ffffffff810b7a9a>] ? get_pageblock_flags_group+0x4/0x99
 [<ffffffff8109ac2a>] ? trace_buffer_lock_reserve+0x22/0x56
 [<ffffffff810e4d90>] ? kmem_cache_free+0x11/0x1db
 [<ffffffff8109ac2a>] ? trace_buffer_lock_reserve+0x22/0x56
 [<ffffffff810676d3>] ? sys_futex+0x11/0x134
 [<ffffffff81002b9b>] ? system_call_fastpath+0x16/0x1b
 [<ffffffff810676d3>] ? sys_futex+0x11/0x134
 [<ffffffff81002b9b>] ? system_call_fastpath+0x16/0x1b
 [<ffffffff810676d3>] ? sys_futex+0x11/0x134
 [<ffffffff810677d8>] sys_futex+0x116/0x134
 [<ffffffff81002b9b>] system_call_fastpath+0x16/0x1b
Code: f6 5f 5b c9 c3 55 48 89 e5 53 48 83 ec 08 e8 31 e6 cd ff 48 89 fb fa e8 7a ce d7 ff bf 01 00 00 00 e8 10 2e 00 00 b8 00 01 00 00 <f0> 66 0f c1 03 38 e0 74 06 f3 90 8a 03 eb f6 41 58 5b c9 c3 55 
RIP  [<ffffffff81324277>] _raw_spin_lock_irq+0x26/0x3a
 RSP <ffff88040e401c98>
CR2: 00000000000005c0
Comment 5 Darren Hart 2010-02-03 17:10:07 UTC
Patches committed to tip:

Commit-ID:  51246bfd189064079c54421507236fd2723b18f3
Gitweb:     http://git.kernel.org/tip/51246bfd189064079c54421507236fd2723b18f3
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Tue, 2 Feb 2010 11:40:27 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 3 Feb 2010 15:13:22 +0100

futex: Handle user space corruption gracefully

If the owner of a PI futex dies we fix up the pi_state and set
pi_state->owner to NULL. When a malicious or just sloppy programmed
user space application sets the futex value to 0 e.g. by calling
pthread_mutex_init(), then the futex can be acquired again. A new
waiter manages to enqueue itself on the pi_state w/o damage, but on
unlock the kernel dereferences pi_state->owner and oopses.

Prevent this by checking pi_state->owner in the unlock path. If
pi_state->owner is not current we know that user space manipulated the
futex value. Ignore the mess and return -EINVAL.

This catches the above case and also the case where a task hijacks the
futex by setting the tid value and then tries to unlock it.

Reported-by: Jermome Marchand <jmarchan@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>





Commit-ID:  59647b6ac3050dd964bc556fe6ef22f4db5b935c
Gitweb:     http://git.kernel.org/tip/59647b6ac3050dd964bc556fe6ef22f4db5b935c
Author:     Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Wed, 3 Feb 2010 09:33:05 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 3 Feb 2010 15:13:22 +0100

futex: Handle futex value corruption gracefully

The WARN_ON in lookup_pi_state which complains about a mismatch
between pi_state->owner->pid and the pid which we retrieved from the
user space futex is completely bogus.

The code just emits the warning and then continues despite the fact
that it detected an inconsistent state of the futex. A conveniant way
for user space to spam the syslog.

Replace the WARN_ON by a consistency check. If the values do not match
return -EINVAL and let user space deal with the mess it created.

This also fixes the missing task_pid_vnr() when we compare the
pi_state->owner pid with the futex value.

Reported-by: Jermome Marchand <jmarchan@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>