Created attachment 72549 [details] The 3.2.9 kernel config used Hello. Old story background: While trying to workaround a buggy old version of auditd, we used a script that used to restart the auditd daemon once a day. After a while (about a month), the kernel on that machine would just randomly freeze/panic on an auditd restart. Fast forward to present: Based on the story above, we test every new kernel we might use in production against a low-powered-auditd-restart-stress-test, that is with the server doing nothing, we just start and stop auditd with 5 seconds delay between operations. Running kernel 3.2.9 (latest on kernel.org) with the configuration i will attach and using auditd-2.1.3 (one version behind the latest 2.2.0) for the "start-and-stop-stress-test", after about 5 minutes the kernel gives an oops on kernel thread fsnotify_mark. The EIP is either at __list_del_entry+0x20/0xe0 or at fsnotify_mark_destroy+0x87/0x130 After this oops the kernel thread [fsnotify_mark] (visible with ps) crashes. Also, after this oops, a kernel compiled with DEBUG_INFO and DEBUG_LIST keeps logging warnings of "list_add corruption". Kernels tested and affected are: - 2.6.37-gentoo-r4 - 3.0.6-gentoo (extended test: after about an hour kernel reports CPU core stalls) - 3.1.10-gentoo-r1 - 3.2.1-gentoo-r2 - 3.2.9 The OOPSes in 3.2.9: 1. in fsnotify_mark_destroy (auditd-loop started at 08:00:17) 2012-03-02T08:04:16.218595+02:00 quick158 kernel: BUG: unable to handle kernel NULL pointer dereference at 00000003 2012-03-02T08:04:16.218614+02:00 quick158 kernel: IP: [<c10f6a97>] fsnotify_mark_destroy+0x87/0x130 2012-03-02T08:04:16.218618+02:00 quick158 kernel: *pdpt = 0000000033fb5001 *pde = 0000000000000000 2012-03-02T08:04:16.218620+02:00 quick158 kernel: Oops: 0002 [#1] SMP 2012-03-02T08:04:16.218621+02:00 quick158 kernel: 2012-03-02T08:04:16.218624+02:00 quick158 kernel: Pid: 641, comm: fsnotify_mark Not tainted 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge 2950/0CX396 2012-03-02T08:04:16.218626+02:00 quick158 kernel: EIP: 0060:[<c10f6a97>] EFLAGS: 00010212 CPU: 5 2012-03-02T08:04:16.218628+02:00 quick158 kernel: EIP is at fsnotify_mark_destroy+0x87/0x130 2012-03-02T08:04:16.218630+02:00 quick158 kernel: EAX: f4d3a688 EBX: f47ddfa8 ECX: ffffffff EDX: f4d3a6c4 2012-03-02T08:04:16.218646+02:00 quick158 kernel: ESI: f47a6800 EDI: ffffffc3 EBP: f47ddfa8 ESP: f47ddf90 2012-03-02T08:04:16.218648+02:00 quick158 kernel: DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 2012-03-02T08:04:16.218651+02:00 quick158 kernel: Process fsnotify_mark (pid: 641, ti=f47dc000 task=f47a6800 task.ti=f47dc000) 2012-03-02T08:04:16.218652+02:00 quick158 kernel: Stack: 2012-03-02T08:04:16.218655+02:00 quick158 kernel: f47a6800 00000000 f47a6800 c10502c0 f47ddfa0 f47ddfa0 f4d3a6c4 f4e0c244 2012-03-02T08:04:16.218657+02:00 quick158 kernel: f4c47f68 00000000 c10f6a10 00000000 c104f9c4 00000000 00000000 00000000 2012-03-02T08:04:16.218660+02:00 quick158 kernel: 00000000 f47ddfd4 f47ddfd4 00000000 c104f950 f4c47f68 c15914f6 00000000 2012-03-02T08:04:16.218683+02:00 quick158 kernel: Call Trace: 2012-03-02T08:04:16.218687+02:00 quick158 kernel: [<c10502c0>] ? abort_exclusive_wait+0x90/0x90 2012-03-02T08:04:16.218689+02:00 quick158 kernel: [<c10f6a10>] ? fsnotify_put_mark+0x20/0x20 2012-03-02T08:04:16.218690+02:00 quick158 kernel: [<c104f9c4>] ? kthread+0x74/0x80 2012-03-02T08:04:16.218692+02:00 quick158 kernel: [<c104f950>] ? kthread_flush_work_fn+0x10/0x10 2012-03-02T08:04:16.218694+02:00 quick158 kernel: [<c15914f6>] ? kernel_thread_helper+0x6/0xd 2012-03-02T08:04:16.218698+02:00 quick158 kernel: Code: 30 18 8c c1 e8 5b e7 f5 ff 8b 54 24 18 8d 42 c4 39 da 8b 48 3c 8d 79 c4 75 0e eb 2d 90 8d b4 26 00 00 00 00 89 f8 89 ef 8b 68 40 <89> 69 04 89 4d 00 89 50 3c 89 50 40 e8 48 ff ff ff 8b 4f 3c 8d 2012-03-02T08:04:16.218703+02:00 quick158 kernel: EIP: [<c10f6a97>] fsnotify_mark_destroy+0x87/0x130 SS:ESP 0068:f47ddf90 2012-03-02T08:04:16.218704+02:00 quick158 kernel: CR2: 0000000000000003 2012-03-02T08:04:16.218706+02:00 quick158 kernel: ---[ end trace a37de83f69eaa3ad ]--- 2. in __list_del_entry (auditd-loop started at 08:49:43) 2012-03-02T08:53:27.187303+02:00 quick158 kernel: [ 301.240011] BUG: unable to handle kernel NULL pointer dereference at (null) 2012-03-02T08:53:27.187318+02:00 quick158 kernel: [ 301.240305] IP: [<c1238dd0>] __list_del_entry+0x20/0xe0 2012-03-02T08:53:27.187322+02:00 quick158 kernel: [ 301.240481] *pdpt = 0000000000000000 *pde = f000ddc8f000ddc8 2012-03-02T08:53:27.187324+02:00 quick158 kernel: [ 301.240698] Oops: 0000 [#1] SMP 2012-03-02T08:53:27.187325+02:00 quick158 kernel: [ 301.240910] 2012-03-02T08:53:27.187328+02:00 quick158 kernel: [ 301.241030] Pid: 642, comm: fsnotify_mark Not tainted 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge 2950/0CX396 2012-03-02T08:53:27.187330+02:00 quick158 kernel: [ 301.241370] EIP: 0060:[<c1238dd0>] EFLAGS: 00010287 CPU: 6 2012-03-02T08:53:27.187332+02:00 quick158 kernel: [ 301.241498] EIP is at __list_del_entry+0x20/0xe0 2012-03-02T08:53:27.187344+02:00 quick158 kernel: [ 301.241623] EAX: f4fae544 EBX: f47cffa4 ECX: ffffffff EDX: 00000000 2012-03-02T08:53:27.187346+02:00 quick158 kernel: [ 301.241751] ESI: f4fae544 EDI: f4fae508 EBP: f47cff7c ESP: f47cff64 2012-03-02T08:53:27.187349+02:00 quick158 kernel: [ 301.241879] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 2012-03-02T08:53:27.187351+02:00 quick158 kernel: [ 301.242005] Process fsnotify_mark (pid: 642, ti=f47ce000 task=f4f47c00 task.ti=f47ce000) 2012-03-02T08:53:27.187353+02:00 quick158 kernel: [ 301.242207] Stack: 2012-03-02T08:53:27.187356+02:00 quick158 kernel: [ 301.242327] c10813c0 f47cffa4 f4f47c00 f4e70888 f47cff7c f47cffa4 f47cffb8 c10f6976 2012-03-02T08:53:27.187358+02:00 quick158 kernel: [ 301.242882] ffffffc3 f4f47c00 f4f47c00 00000000 f4f47c00 c10530c0 f47cff9c f47cff9c 2012-03-02T08:53:27.187361+02:00 quick158 kernel: [ 301.243438] f4fae544 f4fae544 f4c47f58 00000000 c10f68f0 f47cffe4 c1052834 00000000 2012-03-02T08:53:27.187362+02:00 quick158 kernel: [ 301.243995] Call Trace: 2012-03-02T08:53:27.187365+02:00 quick158 kernel: [ 301.244119] [<c10813c0>] ? rcu_check_callbacks+0x110/0x110 2012-03-02T08:53:27.187367+02:00 quick158 kernel: [ 301.244248] [<c10f6976>] fsnotify_mark_destroy+0x86/0x120 2012-03-02T08:53:27.187369+02:00 quick158 kernel: [ 301.244377] [<c10530c0>] ? abort_exclusive_wait+0x80/0x80 2012-03-02T08:53:27.187372+02:00 quick158 kernel: [ 301.244504] [<c10f68f0>] ? fsnotify_put_mark+0x30/0x30 2012-03-02T08:53:27.187374+02:00 quick158 kernel: [ 301.244631] [<c1052834>] kthread+0x74/0x80 2012-03-02T08:53:27.187376+02:00 quick158 kernel: [ 301.244756] [<c10527c0>] ? kthread_flush_work_fn+0x10/0x10 2012-03-02T08:53:27.187379+02:00 quick158 kernel: [ 301.244885] [<c1582ab6>] kernel_thread_helper+0x6/0xd 2012-03-02T08:53:27.187383+02:00 quick158 kernel: [ 301.245011] Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90 55 89 e5 53 83 ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00 02 20 00 0f 84 8e 00 00 00 <8b> 1a 39 d8 75 62 8b 59 04 39 d8 75 35 89 51 04 89 0a 83 c4 14 2012-03-02T08:53:27.187386+02:00 quick158 kernel: [ 301.248195] EIP: [<c1238dd0>] __list_del_entry+0x20/0xe0 SS:ESP 0068:f47cff64 2012-03-02T08:53:27.187388+02:00 quick158 kernel: [ 301.248414] CR2: 0000000000000000 2012-03-02T08:53:27.187390+02:00 quick158 kernel: [ 301.248538] ---[ end trace 15082dbfb353f84c ]--- After analyzing the warnings of list_add corruption, out of 805 occurences, there are at least 26 different types of traces+pointer values (null/not null). All of them occur either in: - kernel thread of audit_prune_tree kernel/audit_tree.c: kthread_run(prune_tree_thread, NULL, "audit_prune_tree"); - auditctl (of auditd) Kernel and auditd were compiled with gcc 4.5.3 and gcc flags -O2 -march=core2 -pipe -msse4.1 -fomit-frame-pointer. quick158 ~ # gcc -v Using built-in specs. COLLECT_GCC=/usr/i686-pc-linux-gnu/gcc-bin/4.5.3/gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/i686-pc-linux-gnu/4.5.3/lto-wrapper Target: i686-pc-linux-gnu Configured with: /var/tmp/portage/sys-devel/gcc-4.5.3-r1/work/gcc-4.5.3/configure --prefix=/usr --bindir=/usr/i686-pc-linux-gnu/gcc-bin/4.5.3 --includedir=/usr/lib/gcc/i686-pc-linux-gnu/4.5.3/include --datadir=/usr/share/gcc-data/i686-pc-linux-gnu/4.5.3 --mandir=/usr/share/gcc-data/i686-pc-linux-gnu/4.5.3/man --infodir=/usr/share/gcc-data/i686-pc-linux-gnu/4.5.3/info --with-gxx-include-dir=/usr/lib/gcc/i686-pc-linux-gnu/4.5.3/include/g++-v4 --host=i686-pc-linux-gnu --build=i686-pc-linux-gnu --disable-altivec --disable-fixed-point --without-ppl --without-cloog --disable-lto --enable-nls --without-included-gettext --with-system-zlib --disable-werror --enable-secureplt --disable-multilib --enable-libmudflap --disable-libssp --enable-libgomp --with-python-dir=/share/gcc-data/i686-pc-linux-gnu/4.5.3/python --enable-checking=release --disable-libgcj --with-arch=i686 --enable-languages=c,c++,fortran --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-targets=all --with-bugurl=http://bugs.gentoo.org/ --with-pkgversion='Gentoo 4.5.3-r1 p1.0, pie-0.4.5' Thread model: posix gcc version 4.5.3 (Gentoo 4.5.3-r1 p1.0, pie-0.4.5) Auditd is compiled plain, without any ldap, prelude or python support. More data (traces, configs, oopses, screenshots) can be found at: https://bugs.gentoo.org/show_bug.cgi?id=389405 https://www.redhat.com/archives/linux-audit/2012-January/msg00057.html https://www.redhat.com/archives/linux-audit/2012-February/msg00007.html https://www.redhat.com/archives/linux-audit/2012-March/msg00004.html Thank you for your time.
Reported to Gentoo: https://bugs.gentoo.org/show_bug.cgi?id=389405 Reported to LKML: https://lkml.org/lkml/2012/3/13/200 Reported to audit/redhat: https://www.redhat.com/archives/linux-audit/2012-March/msg00004.html I really hope somebody will at least notice and ack this problem.