Bug 42882

Summary: fsnotify_mark kthread oops in 3.2.9, list_add corruption warnings follow
Product: Other Reporter: Valentin Avram (valentin.avram)
Component: Loadable Security Modules (LSM)Assignee: Other/LSM (other_lsm)
Status: NEW ---    
Severity: normal CC: alan, kernel, mpagano
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.2.9 Subsystem:
Regression: No Bisected commit-id:
Attachments: The 3.2.9 kernel config used

Description Valentin Avram 2012-03-07 08:31:25 UTC
Created attachment 72549 [details]
The 3.2.9 kernel config used

Hello.

Old story background:
While trying to workaround a buggy old version of auditd, we used a script that used to restart the auditd daemon once a day. After a while (about a month), the kernel on that machine would just randomly freeze/panic on an auditd restart.

Fast forward to present:
Based on the story above, we test every new kernel we might use in production against a low-powered-auditd-restart-stress-test, that is with the server doing nothing, we just start and stop auditd with 5 seconds delay between operations.

Running kernel 3.2.9 (latest on kernel.org) with the configuration i will attach and using auditd-2.1.3 (one version behind the latest 2.2.0) for the "start-and-stop-stress-test", after about 5 minutes the kernel gives an oops on kernel thread fsnotify_mark. The EIP is either at
__list_del_entry+0x20/0xe0
or at
fsnotify_mark_destroy+0x87/0x130

After this oops the kernel thread [fsnotify_mark] (visible with ps) crashes.

Also, after this oops, a kernel compiled with DEBUG_INFO and DEBUG_LIST keeps logging warnings of "list_add corruption".

Kernels tested and affected are:
- 2.6.37-gentoo-r4
- 3.0.6-gentoo (extended test: after about an hour kernel reports CPU core stalls)
- 3.1.10-gentoo-r1
- 3.2.1-gentoo-r2
- 3.2.9

The OOPSes in 3.2.9:
1. in fsnotify_mark_destroy (auditd-loop started at 08:00:17)
2012-03-02T08:04:16.218595+02:00 quick158 kernel: BUG: unable to handle kernel NULL pointer dereference at 00000003
2012-03-02T08:04:16.218614+02:00 quick158 kernel: IP: [<c10f6a97>] fsnotify_mark_destroy+0x87/0x130
2012-03-02T08:04:16.218618+02:00 quick158 kernel: *pdpt = 0000000033fb5001 *pde = 0000000000000000 
2012-03-02T08:04:16.218620+02:00 quick158 kernel: Oops: 0002 [#1] SMP 
2012-03-02T08:04:16.218621+02:00 quick158 kernel: 
2012-03-02T08:04:16.218624+02:00 quick158 kernel: Pid: 641, comm: fsnotify_mark Not tainted 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge 2950/0CX396
2012-03-02T08:04:16.218626+02:00 quick158 kernel: EIP: 0060:[<c10f6a97>] EFLAGS: 00010212 CPU: 5
2012-03-02T08:04:16.218628+02:00 quick158 kernel: EIP is at fsnotify_mark_destroy+0x87/0x130
2012-03-02T08:04:16.218630+02:00 quick158 kernel: EAX: f4d3a688 EBX: f47ddfa8 ECX: ffffffff EDX: f4d3a6c4
2012-03-02T08:04:16.218646+02:00 quick158 kernel: ESI: f47a6800 EDI: ffffffc3 EBP: f47ddfa8 ESP: f47ddf90
2012-03-02T08:04:16.218648+02:00 quick158 kernel: DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
2012-03-02T08:04:16.218651+02:00 quick158 kernel: Process fsnotify_mark (pid: 641, ti=f47dc000 task=f47a6800 task.ti=f47dc000)
2012-03-02T08:04:16.218652+02:00 quick158 kernel: Stack:
2012-03-02T08:04:16.218655+02:00 quick158 kernel: f47a6800 00000000 f47a6800 c10502c0 f47ddfa0 f47ddfa0 f4d3a6c4 f4e0c244
2012-03-02T08:04:16.218657+02:00 quick158 kernel: f4c47f68 00000000 c10f6a10 00000000 c104f9c4 00000000 00000000 00000000
2012-03-02T08:04:16.218660+02:00 quick158 kernel: 00000000 f47ddfd4 f47ddfd4 00000000 c104f950 f4c47f68 c15914f6 00000000
2012-03-02T08:04:16.218683+02:00 quick158 kernel: Call Trace:
2012-03-02T08:04:16.218687+02:00 quick158 kernel: [<c10502c0>] ? abort_exclusive_wait+0x90/0x90
2012-03-02T08:04:16.218689+02:00 quick158 kernel: [<c10f6a10>] ? fsnotify_put_mark+0x20/0x20
2012-03-02T08:04:16.218690+02:00 quick158 kernel: [<c104f9c4>] ? kthread+0x74/0x80
2012-03-02T08:04:16.218692+02:00 quick158 kernel: [<c104f950>] ? kthread_flush_work_fn+0x10/0x10
2012-03-02T08:04:16.218694+02:00 quick158 kernel: [<c15914f6>] ? kernel_thread_helper+0x6/0xd
2012-03-02T08:04:16.218698+02:00 quick158 kernel: Code: 30 18 8c c1 e8 5b e7 f5 ff 8b 54 24 18 8d 42 c4 39 da 8b 48 3c 8d 79 c4 75 0e eb 2d 90 8d b4 26 00 00 00 00 89 f8 89 ef 8b 68 40 <89> 69 04 89 4d 00 89 50 3c 89 50 40 e8 48 ff ff ff
 8b 4f 3c 8d 
2012-03-02T08:04:16.218703+02:00 quick158 kernel: EIP: [<c10f6a97>] fsnotify_mark_destroy+0x87/0x130 SS:ESP 0068:f47ddf90
2012-03-02T08:04:16.218704+02:00 quick158 kernel: CR2: 0000000000000003
2012-03-02T08:04:16.218706+02:00 quick158 kernel: ---[ end trace a37de83f69eaa3ad ]---

2. in __list_del_entry (auditd-loop started at 08:49:43)
2012-03-02T08:53:27.187303+02:00 quick158 kernel: [  301.240011] BUG: unable to handle kernel NULL pointer dereference at   (null)
2012-03-02T08:53:27.187318+02:00 quick158 kernel: [  301.240305] IP: [<c1238dd0>] __list_del_entry+0x20/0xe0
2012-03-02T08:53:27.187322+02:00 quick158 kernel: [  301.240481] *pdpt = 0000000000000000 *pde = f000ddc8f000ddc8 
2012-03-02T08:53:27.187324+02:00 quick158 kernel: [  301.240698] Oops: 0000 [#1] SMP 
2012-03-02T08:53:27.187325+02:00 quick158 kernel: [  301.240910] 
2012-03-02T08:53:27.187328+02:00 quick158 kernel: [  301.241030] Pid: 642, comm: fsnotify_mark Not tainted 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge 2950/0CX396
2012-03-02T08:53:27.187330+02:00 quick158 kernel: [  301.241370] EIP: 0060:[<c1238dd0>] EFLAGS: 00010287 CPU: 6
2012-03-02T08:53:27.187332+02:00 quick158 kernel: [  301.241498] EIP is at __list_del_entry+0x20/0xe0
2012-03-02T08:53:27.187344+02:00 quick158 kernel: [  301.241623] EAX: f4fae544 EBX: f47cffa4 ECX: ffffffff EDX: 00000000
2012-03-02T08:53:27.187346+02:00 quick158 kernel: [  301.241751] ESI: f4fae544 EDI: f4fae508 EBP: f47cff7c ESP: f47cff64
2012-03-02T08:53:27.187349+02:00 quick158 kernel: [  301.241879]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
2012-03-02T08:53:27.187351+02:00 quick158 kernel: [  301.242005] Process fsnotify_mark (pid: 642, ti=f47ce000 task=f4f47c00 task.ti=f47ce000)
2012-03-02T08:53:27.187353+02:00 quick158 kernel: [  301.242207] Stack:
2012-03-02T08:53:27.187356+02:00 quick158 kernel: [  301.242327]  c10813c0 f47cffa4 f4f47c00 f4e70888 f47cff7c f47cffa4 f47cffb8 c10f6976
2012-03-02T08:53:27.187358+02:00 quick158 kernel: [  301.242882]  ffffffc3 f4f47c00 f4f47c00 00000000 f4f47c00 c10530c0 f47cff9c f47cff9c
2012-03-02T08:53:27.187361+02:00 quick158 kernel: [  301.243438]  f4fae544 f4fae544 f4c47f58 00000000 c10f68f0 f47cffe4 c1052834 00000000
2012-03-02T08:53:27.187362+02:00 quick158 kernel: [  301.243995] Call Trace:
2012-03-02T08:53:27.187365+02:00 quick158 kernel: [  301.244119]  [<c10813c0>] ? rcu_check_callbacks+0x110/0x110
2012-03-02T08:53:27.187367+02:00 quick158 kernel: [  301.244248]  [<c10f6976>] fsnotify_mark_destroy+0x86/0x120
2012-03-02T08:53:27.187369+02:00 quick158 kernel: [  301.244377]  [<c10530c0>] ? abort_exclusive_wait+0x80/0x80
2012-03-02T08:53:27.187372+02:00 quick158 kernel: [  301.244504]  [<c10f68f0>] ? fsnotify_put_mark+0x30/0x30
2012-03-02T08:53:27.187374+02:00 quick158 kernel: [  301.244631]  [<c1052834>] kthread+0x74/0x80
2012-03-02T08:53:27.187376+02:00 quick158 kernel: [  301.244756]  [<c10527c0>] ? kthread_flush_work_fn+0x10/0x10
2012-03-02T08:53:27.187379+02:00 quick158 kernel: [  301.244885]  [<c1582ab6>] kernel_thread_helper+0x6/0xd
2012-03-02T08:53:27.187383+02:00 quick158 kernel: [  301.245011] Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90 55 89 e5 53 83 ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00 02 20 00 0f 84 8e 00 00 00 <8b> 1a 39 d8 75 62 8b 59 04 39 d8 75 35 89 51 04 89 0a 83 c4 14 
2012-03-02T08:53:27.187386+02:00 quick158 kernel: [  301.248195] EIP: [<c1238dd0>] __list_del_entry+0x20/0xe0 SS:ESP 0068:f47cff64
2012-03-02T08:53:27.187388+02:00 quick158 kernel: [  301.248414] CR2: 0000000000000000
2012-03-02T08:53:27.187390+02:00 quick158 kernel: [  301.248538] ---[ end trace 15082dbfb353f84c ]---

After analyzing the warnings of list_add corruption, out of 805 occurences, there are at least 26 different types of traces+pointer values (null/not null). All of them occur either in:
- kernel thread of audit_prune_tree
kernel/audit_tree.c:    kthread_run(prune_tree_thread, NULL, "audit_prune_tree");
- auditctl (of auditd)

Kernel and auditd were compiled with gcc 4.5.3 and gcc flags -O2 -march=core2 -pipe -msse4.1 -fomit-frame-pointer.

quick158 ~ # gcc -v
Using built-in specs.
COLLECT_GCC=/usr/i686-pc-linux-gnu/gcc-bin/4.5.3/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/i686-pc-linux-gnu/4.5.3/lto-wrapper
Target: i686-pc-linux-gnu
Configured with: /var/tmp/portage/sys-devel/gcc-4.5.3-r1/work/gcc-4.5.3/configure --prefix=/usr --bindir=/usr/i686-pc-linux-gnu/gcc-bin/4.5.3 --includedir=/usr/lib/gcc/i686-pc-linux-gnu/4.5.3/include --datadir=/usr/share/gcc-data/i686-pc-linux-gnu/4.5.3 --mandir=/usr/share/gcc-data/i686-pc-linux-gnu/4.5.3/man --infodir=/usr/share/gcc-data/i686-pc-linux-gnu/4.5.3/info --with-gxx-include-dir=/usr/lib/gcc/i686-pc-linux-gnu/4.5.3/include/g++-v4 --host=i686-pc-linux-gnu --build=i686-pc-linux-gnu --disable-altivec --disable-fixed-point --without-ppl --without-cloog --disable-lto --enable-nls --without-included-gettext --with-system-zlib --disable-werror --enable-secureplt --disable-multilib --enable-libmudflap --disable-libssp --enable-libgomp --with-python-dir=/share/gcc-data/i686-pc-linux-gnu/4.5.3/python --enable-checking=release --disable-libgcj --with-arch=i686 --enable-languages=c,c++,fortran --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-targets=all --with-bugurl=http://bugs.gentoo.org/ --with-pkgversion='Gentoo 4.5.3-r1 p1.0, pie-0.4.5'
Thread model: posix
gcc version 4.5.3 (Gentoo 4.5.3-r1 p1.0, pie-0.4.5)

Auditd is compiled plain, without any ldap, prelude or python support.

More data (traces, configs, oopses, screenshots) can be found at:
https://bugs.gentoo.org/show_bug.cgi?id=389405
https://www.redhat.com/archives/linux-audit/2012-January/msg00057.html
https://www.redhat.com/archives/linux-audit/2012-February/msg00007.html
https://www.redhat.com/archives/linux-audit/2012-March/msg00004.html

Thank you for your time.
Comment 1 Valentin Avram 2012-03-22 07:34:53 UTC
Reported to Gentoo:
https://bugs.gentoo.org/show_bug.cgi?id=389405

Reported to LKML:
https://lkml.org/lkml/2012/3/13/200

Reported to audit/redhat:
https://www.redhat.com/archives/linux-audit/2012-March/msg00004.html

I really hope somebody will at least notice and ack this problem.