Latest working kernel version: N/A Earliest failing kernel version: 2.6.27 Distribution: Fedora 10 x86 Hardware Environment: Intel 82801G chipset motherboard, occurs independent of NIC chipset, all gigabit NICs of different kinds. Software Environment: Fedora 10, tried Fedora 11 kernel after bug initially occurred on 2.6.27 stable kernel from Fedora. Problem Description: Under severe, severe load, NFS will sometimes fall down the stairs, and when trying to find out why a thread was in state D, we find this lovely message in dmesg: kernel BUG at fs/nfs/write.c:252! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/devices/virtual/block/dm-0/range CPU 1 Modules linked in: fuse cpufreq_stats autofs4 coretemp hwmon nfs lockd nfs_acl auth_rpcgss sunrpc ipv6 cpufreq_ondemand acpi_cpufreq freq_table ext4 jbd2 crc16 kvm_intel kvm snd_hda_code c_idt snd_hda_intel snd_hda_codec snd_hwdep snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq ppdev snd_seq_device snd_pcm_oss cpia_usb snd_mixer_oss cpia nouveau videodev pcspkr snd_ pcm drm firewire_ohci i2c_i801 v4l1_compat parport_pc firewire_core v4l2_compat_ioctl32 i82975x_edac usb_storage iTCO_wdt e1000e edac_core crc_itu_t iTCO_vendor_support parport snd_timer i2c_core snd soundcore snd_page_alloc raid1 Pid: 313, comm: pdflush Tainted: G W 2.6.29-0.258.rc8.git2.fc11.x86_64 #1 RIP: 0010:[<ffffffffa03696ca>] [<ffffffffa03696ca>] nfs_do_writepage+0xf5/0x18d [nfs] RSP: 0018:ffff88012d4f9aa0 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffffe20002f117a0 RCX: 0000000000000015 RDX: 000000001eef1eee RSI: 0000000000000001 RDI: ffff8800276b1860 RBP: ffff88012d4f9ad0 R08: 0000000000000002 R09: ffff8800276b1860 R10: ffff88012d4f9a90 R11: 0000000000000246 R12: ffff880101015b40 R13: ffff8800276b1860 R14: ffff88012d4f9c50 R15: ffff8800276b19e0 FS: 0000000000000000(0000) GS:ffff88012fab2af0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00000000f7d50b7c CR3: 0000000126df9000 CR4: 00000000000026e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process pdflush (pid: 313, threadinfo ffff88012d4f8000, task ffff88012e5ba350) Stack: ffff88012e5ba350 ffffffff810b27ea ffffe20002f117a0 ffff88012d4f9e10 ffff88012d517828 0000000000000001 ffff88012d4f9b00 ffffffffa0369b61 ffff88012d4f9b00 0000000000000286 ffffe20002f117a0 ffffe20002f117a0 Call Trace: [<ffffffff810b27ea>] ? clear_page_dirty_for_io+0xbc/0xda [<ffffffffa0369b61>] nfs_writepages_callback+0x16/0x2b [nfs] [<ffffffff810b2c0c>] write_cache_pages+0x252/0x3d2 [<ffffffffa0369b4b>] ? nfs_writepages_callback+0x0/0x2b [nfs] [<ffffffffa030558c>] ? rpc_execute+0x2d/0x31 [sunrpc] [<ffffffffa02ff62a>] ? rpc_run_task+0x5b/0x63 [sunrpc] [<ffffffffa036a7b6>] ? nfs_commit_list+0x1e4/0x254 [nfs] [<ffffffffa0369b24>] nfs_writepages+0x80/0xa7 [nfs] [<ffffffffa0369d3c>] ? nfs_flush_one+0x0/0xf4 [nfs] [<ffffffff810b2de5>] do_writepages+0x30/0x3f [<ffffffff810fd9ce>] __writeback_single_inode+0x18c/0x2d8 [<ffffffff810fdefb>] generic_sync_sb_inodes+0x210/0x362 [<ffffffff810fe245>] ? writeback_inodes+0x91/0x100 [<ffffffff810fe25b>] writeback_inodes+0xa7/0x100 [<ffffffff810b3493>] background_writeout+0x98/0xd1 [<ffffffff810b3c8e>] pdflush+0x158/0x20d [<ffffffff810b33fb>] ? background_writeout+0x0/0xd1 [<ffffffff810b3b36>] ? pdflush+0x0/0x20d [<ffffffff810609ad>] kthread+0x4d/0x78 [<ffffffff810126aa>] child_rip+0xa/0x20 [<ffffffff8100f88b>] ? __switch_to+0x190/0x398 [<ffffffff81011fbe>] ? restore_args+0x0/0x30 [<ffffffff81060960>] ? kthread+0x0/0x78 [<ffffffff810126a0>] ? child_rip+0x0/0x20 Code: 89 e7 e8 37 cf ff ff 4c 89 e7 89 45 d8 e8 a8 cf ff ff 8b 45 d8 85 c0 74 ab eb 7f 41 f6 44 24 40 02 74 0c 4c 89 ef e8 2c d1 02 e1 <0f> 0b eb fe 48 89 df e8 9b 8e d4 e0 85 c0 75 47 48 8b 43 18 48 RIP [<ffffffffa03696ca>] nfs_do_writepage+0xf5/0x18d [nfs] RSP <ffff88012d4f9aa0> As you may imagine, the source of the write() call never gets a return, and stays around in state D until the machine is rebooted. Steps to reproduce: 1) Put NFS under heavy, heavy load. Think gigabit point-to-point between a 10-disk RAID-6 server and this machine, torrenting the entire Debian Lenny CD/DVD set (which is actually how I produced it just now.) 2) This happens at some point.
Tried Nick Piggin's patch to fix this race condition (the "try 3" variant - I'd link the thread, but -fsdevel appears to be temporarily down). It took a lot longer, but it triggered.
I'm going to try his "incremental patch", though I don't think it'll fix this (based on my understanding of its intent). Is there any reasonable way I can instrument this and help debug it?
This bug should now be fixed in mainline by the two commits b827e496c893de0c0f142abfaeb8730a2fd6b37f: mm: close page_mkwrite races and 7fdf523067666b0eaff330f362401ee50ce187c4: NFS: Close page_mkwrite() races Cheers Trond
Created attachment 21235 [details] mm: close page_mkwrite races Change page_mkwrite to allow implementations to return with the page locked, and also change it's callers (in page fault paths) to hold the lock until the page is marked dirty. This allows the filesystem to have full control of page dirtying events coming from the VM. Rather than simply hold the page locked over the page_mkwrite call, we call page_mkwrite with the page unlocked and allow callers to return with it locked, so filesystems can avoid LOR conditions with page lock. The problem with the current scheme is this: a filesystem that wants to associate some metadata with a page as long as the page is dirty, will perform this manipulation in its ->page_mkwrite. It currently then must return with the page unlocked and may not hold any other locks (according to existing page_mkwrite convention). In this window, the VM could write out the page, clearing page-dirty. The filesystem has no good way to detect that a dirty pte is about to be attached, so it will happily write out the page, at which point, the filesystem may manipulate the metadata to reflect that the page is no longer dirty. It is not always possible to perform the required metadata manipulation in ->set_page_dirty, because that function cannot block or fail. The filesystem may need to allocate some data structure, for example. And the VM cannot mark the pte dirty before page_mkwrite, because page_mkwrite is allowed to fail, so we must not allow any window where the page could be written to if page_mkwrite does fail. This solution of holding the page locked over the 3 critical operations (page_mkwrite, setting the pte dirty, and finally setting the page dirty) closes out races nicely, preventing page cleaning for writeout being initiated in that window. This provides the filesystem with a strong synchronisation against the VM here.
Created attachment 21236 [details] NFS: Close page_mkwrite() races Follow up to Nick Piggin's patches to ensure that nfs_vm_page_mkwrite returns with the page lock held, and sets the VM_FAULT_LOCKED flag.
Marking this bug as CLOSED. Please reopen if the problem reoccurs.