Running a machine with 8 dualcore Opterons 875 (16cores total), which makes it far more sensitive to locking-, timing- and racing bugs than the average linux- machine. Running a multi-threaded app (16 threads) that reads small portions of data from a large file (stored on remote server, accessed over NFS-udp), and write back to another file - same NFS-share. System crashes really hard (beyond softwatchdog- and oops-recovery) anywhere between 1 and 15 minutes. On 2.6.15.x, problem appeared first, was fixed by kernelpatches to file.c and pagelist.c. Went to 2.6.17.6 for the much improved multi-dualcore support, same problem appeared - unfortunately the original patch is integrated already, so it must be something else this time. Kernel does not OOPS, but it locks up - on all CPUs, according to the logs. Kernel BUG at mm/truncate.c:76 invalid opcode: 0000 [1] SMP CPU 14 Modules linked in: nfs netconsole sch_sfq cls_u32 sch_tbf sch_prio iptable_filter ip_tables x_tables nfsd exportfs lockd 8250 seri al_core ipv6 parport_pc lp parport autofs4 sunrpc w83627hf_wdt binfmt_misc xfs dm_mod video button battery ac ohci1394 ieee1394 oh ci_hcd ehci_hcd i2c_nforce2 i2c_core tg3 floppy ide_cd cdrom Pid: 8086, comm: dipfilter.x Not tainted 2.6.17.6 #1 RIP: 0010:[<ffffffff8025eb96>] <ffffffff8025eb96>{invalidate_complete_page+86} RSP: 0018:ffff810393c09ca8 EFLAGS: 00010002 RAX: 0000000000000825 RBX: ffff8105ff8d86f0 RCX: ffff8103f271bb08 RDX: 0000000000000000 RSI: ffff810393c09c48 RDI: ffff8103f271bd88 RBP: ffff8103f271bd70 R08: 0000000000000001 R09: 000000000000002c R10: 000000000000002c R11: ffff8105f8a26240 R12: ffff810393c09e08 R13: 0000000000000000 R14: ffff8103f271bd70 R15: 00000000000d1a2c FS: 00002accace13dc0(0000) GS:ffff810e001bd9c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00002b050fbff000 CR3: 0000000ff7c21000 CR4: 00000000000006e0 Process dipfilter.x (pid: 8086, threadinfo ffff810393c08000, task ffff8103f8b70280) Stack: 00000000000d1a2b ffff8105ff8d86f0 0000000000000000 ffffffff8025f0c0 0000000000000000 00000000f271bc28 0000000000000000 ffffffffffffffff 000000000000000e 0000000000000000 Call Trace: <ffffffff8025f0c0>{invalidate_inode_pages2_range+320} <ffffffff882042d9>{:nfs:nfs_revalidate_mapping+105} <ffffffff88202e29>{:nfs:nfs_file_write+169} <ffffffff8027d710> {do_sync_write+208} <ffffffff802424e0>{autoremove_wake_function+0} <ffffffff80227a00> {default_wake_function+0} <ffffffff8027d80f>{vfs_write+191} <ffffffff8027d9a3>{sys_write+83} <ffffffff80209c06>{system_call+126} Code: 0f 0b 68 dc f2 46 80 c2 4c 00 48 89 df e8 18 74 ff ff f0 81 RIP <ffffffff8025eb96>{invalidate_complete_page+86} RSP <ffff810393c09ca8> NMI Watchdog detected LOCKUP on CPU 10 CPU 10 Modules linked in: nfs netconsole sch_sfq cls_u32 sch_tbf sch_prio iptable_filter ip_tables x_tables nfsd exportfs lockd 8250 seri al_core ipv6 parport_pc lp parport autofs4 sunrpc w83627hf_wdt binfmt_misc xfs dm_mod video button battery ac ohci1394 ieee1394 oh ci_hcd ehci_hcd i2c_nforce2 i2c_core tg3 floppy ide_cd cdrom Pid: 8106, comm: dipfilter.x Not tainted 2.6.17.6 #1 RIP: 0010:[<ffffffff80309579>] <ffffffff80309579>{__read_lock_failed+5} RSP: 0018:ffff8103f8647c08 EFLAGS: 00000097 RAX: ffff8103f271bd88 RBX: 00000000000d2431 RCX: ffff8103f8647d58 RDX: 0000000000000000 RSI: 00000000000d2431 RDI: ffff8103f271bd88 RBP: ffff8103f271bd70 R08: ffff8103f8646000 R09: 00000000ffffffff R10: 00000000d2864068 R11: 0000000000000001 R12: ffff8103f271bd70 R13: 0000000000001000 R14: 0000000000001000 R15: ffff8103f8e83be8 FS: 00002b15cc9e2dc0(0000) GS:ffff810a0016ad40(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fff0b64dbd8 CR3: 0000000393c21000 CR4: 00000000000006e0 Process dipfilter.x (pid: 8106, threadinfo ffff8103f8646000, task ffff8103f8645820) Stack: ffffffff8044802a ffffffff802569b5 ffff8103f271bd70 ffff810bff67f678 00000000000d2431 ffffffff80256f7d ffffffff804479ec 0000000000000000 0000000000000000 00000000d2438000 Call Trace: <ffffffff8044802a>{.text.lock.spinlock+83} <ffffffff802569b5>{find_get_page+21} <ffffffff80256f7d> {do_generic_mapping_read+397} <ffffffff804479ec>{__up_wakeup+53} <ffffffff80257360>{file_read_actor+0} <ffffffff802591e9>{__generic_file_aio_read+425} <ffffffff802593f4> {generic_file_aio_read+52} <ffffffff88202aba>{:nfs:nfs_file_read+170} <ffffffff8027d490> {do_sync_read+208} <ffffffff802424e0>{autoremove_wake_function+0} <ffffffff80227a00> {default_wake_function+0} <ffffffff8027d58c>{vfs_read+188} <ffffffff8027d913>{sys_read+83} <ffffffff80209c06>{system_call+126} Code: 83 38 01 78 f9 f0 ff 08 0f 88 ed ff ff ff c3 90 90 90 90 90 console shuts up ... NMI Watchdog detected LOCKUP on CPU 0 <and so on for all CPUs>
*** This bug has been marked as a duplicate of 6854 ***