Bug 12913

Summary: BUG on fs/nfs/write.c:252 !
Product: File System Reporter: Rich Ercolani (rercola)
Component: NFSAssignee: Trond Myklebust (trondmy)
Status: CLOSED CODE_FIX    
Severity: high    
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.29-0.rc8.git2.fc11 Subsystem:
Regression: No Bisected commit-id:
Attachments: mm: close page_mkwrite races
NFS: Close page_mkwrite() races

Description Rich Ercolani 2009-03-21 13:44:04 UTC
Latest working kernel version: N/A
Earliest failing kernel version: 2.6.27
Distribution: Fedora 10 x86
Hardware Environment: Intel 82801G chipset motherboard, occurs independent of NIC chipset, all gigabit NICs of different kinds.
Software Environment: Fedora 10, tried Fedora 11 kernel after bug initially occurred on 2.6.27 stable kernel from Fedora.
Problem Description:
Under severe, severe load, NFS will sometimes fall down the stairs, and when trying to find out why a thread was in state D, we find this lovely message in dmesg:
kernel BUG at fs/nfs/write.c:252!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/devices/virtual/block/dm-0/range
CPU 1 
Modules linked in: fuse cpufreq_stats autofs4 coretemp hwmon nfs lockd nfs_acl auth_rpcgss sunrpc ipv6 cpufreq_ondemand acpi_cpufreq freq_table ext4 jbd2 crc16 kvm_intel kvm snd_hda_code
c_idt snd_hda_intel snd_hda_codec snd_hwdep snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq ppdev snd_seq_device snd_pcm_oss cpia_usb snd_mixer_oss cpia nouveau videodev pcspkr snd_
pcm drm firewire_ohci i2c_i801 v4l1_compat parport_pc firewire_core v4l2_compat_ioctl32 i82975x_edac usb_storage iTCO_wdt e1000e edac_core crc_itu_t iTCO_vendor_support parport snd_timer
 i2c_core snd soundcore snd_page_alloc raid1
Pid: 313, comm: pdflush Tainted: G        W  2.6.29-0.258.rc8.git2.fc11.x86_64 #1         
RIP: 0010:[<ffffffffa03696ca>]  [<ffffffffa03696ca>] nfs_do_writepage+0xf5/0x18d [nfs]
RSP: 0018:ffff88012d4f9aa0  EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffffe20002f117a0 RCX: 0000000000000015
RDX: 000000001eef1eee RSI: 0000000000000001 RDI: ffff8800276b1860
RBP: ffff88012d4f9ad0 R08: 0000000000000002 R09: ffff8800276b1860
R10: ffff88012d4f9a90 R11: 0000000000000246 R12: ffff880101015b40
R13: ffff8800276b1860 R14: ffff88012d4f9c50 R15: ffff8800276b19e0
FS:  0000000000000000(0000) GS:ffff88012fab2af0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000f7d50b7c CR3: 0000000126df9000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process pdflush (pid: 313, threadinfo ffff88012d4f8000, task ffff88012e5ba350)
Stack:
 ffff88012e5ba350 ffffffff810b27ea ffffe20002f117a0 ffff88012d4f9e10
 ffff88012d517828 0000000000000001 ffff88012d4f9b00 ffffffffa0369b61
 ffff88012d4f9b00 0000000000000286 ffffe20002f117a0 ffffe20002f117a0
Call Trace:
 [<ffffffff810b27ea>] ? clear_page_dirty_for_io+0xbc/0xda
 [<ffffffffa0369b61>] nfs_writepages_callback+0x16/0x2b [nfs]
 [<ffffffff810b2c0c>] write_cache_pages+0x252/0x3d2
 [<ffffffffa0369b4b>] ? nfs_writepages_callback+0x0/0x2b [nfs]
 [<ffffffffa030558c>] ? rpc_execute+0x2d/0x31 [sunrpc]
 [<ffffffffa02ff62a>] ? rpc_run_task+0x5b/0x63 [sunrpc]
 [<ffffffffa036a7b6>] ? nfs_commit_list+0x1e4/0x254 [nfs]
 [<ffffffffa0369b24>] nfs_writepages+0x80/0xa7 [nfs]
 [<ffffffffa0369d3c>] ? nfs_flush_one+0x0/0xf4 [nfs]
 [<ffffffff810b2de5>] do_writepages+0x30/0x3f
 [<ffffffff810fd9ce>] __writeback_single_inode+0x18c/0x2d8
 [<ffffffff810fdefb>] generic_sync_sb_inodes+0x210/0x362
 [<ffffffff810fe245>] ? writeback_inodes+0x91/0x100
 [<ffffffff810fe25b>] writeback_inodes+0xa7/0x100
 [<ffffffff810b3493>] background_writeout+0x98/0xd1
 [<ffffffff810b3c8e>] pdflush+0x158/0x20d
 [<ffffffff810b33fb>] ? background_writeout+0x0/0xd1
 [<ffffffff810b3b36>] ? pdflush+0x0/0x20d
 [<ffffffff810609ad>] kthread+0x4d/0x78
 [<ffffffff810126aa>] child_rip+0xa/0x20
 [<ffffffff8100f88b>] ? __switch_to+0x190/0x398
 [<ffffffff81011fbe>] ? restore_args+0x0/0x30
 [<ffffffff81060960>] ? kthread+0x0/0x78
 [<ffffffff810126a0>] ? child_rip+0x0/0x20
Code: 89 e7 e8 37 cf ff ff 4c 89 e7 89 45 d8 e8 a8 cf ff ff 8b 45 d8 85 c0 74 ab eb 7f 41 f6 44 24 40 02 74 0c 4c 89 ef e8 2c d1 02 e1 <0f> 0b eb fe 48 89 df e8 9b 8e d4 e0 85 c0 75 47 48 8b 43 18 48 
RIP  [<ffffffffa03696ca>] nfs_do_writepage+0xf5/0x18d [nfs]
 RSP <ffff88012d4f9aa0>

As you may imagine, the source of the write() call never gets a return, and stays around in state D until the machine is rebooted.

Steps to reproduce:
1) Put NFS under heavy, heavy load. Think gigabit point-to-point between a 10-disk RAID-6 server and this machine, torrenting  the entire Debian Lenny CD/DVD set (which is actually how I produced it just now.)
2) This happens at some point.
Comment 1 Rich Ercolani 2009-04-19 05:06:41 UTC
Tried Nick Piggin's patch to fix this race condition (the "try 3" variant - I'd link the thread, but -fsdevel appears to be temporarily down).

It took a lot longer, but it triggered.
Comment 2 Rich Ercolani 2009-04-20 04:57:24 UTC
I'm going to try his "incremental patch", though I don't think it'll fix this (based on my understanding of its intent).

Is there any reasonable way I can instrument this and help debug it?
Comment 3 Trond Myklebust 2009-05-05 18:46:54 UTC
This bug should now be fixed in mainline by the two commits

   b827e496c893de0c0f142abfaeb8730a2fd6b37f: mm: close page_mkwrite races
and
   7fdf523067666b0eaff330f362401ee50ce187c4: NFS: Close page_mkwrite() races

Cheers
  Trond
Comment 4 Trond Myklebust 2009-05-05 18:49:33 UTC
Created attachment 21235 [details]
mm: close page_mkwrite races

Change page_mkwrite to allow implementations to return with the page
locked, and also change it's callers (in page fault paths) to hold the
lock until the page is marked dirty.  This allows the filesystem to have
full control of page dirtying events coming from the VM.

Rather than simply hold the page locked over the page_mkwrite call, we
call page_mkwrite with the page unlocked and allow callers to return with
it locked, so filesystems can avoid LOR conditions with page lock.

The problem with the current scheme is this: a filesystem that wants to
associate some metadata with a page as long as the page is dirty, will
perform this manipulation in its ->page_mkwrite.  It currently then must
return with the page unlocked and may not hold any other locks (according
to existing page_mkwrite convention).

In this window, the VM could write out the page, clearing page-dirty.  The
filesystem has no good way to detect that a dirty pte is about to be
attached, so it will happily write out the page, at which point, the
filesystem may manipulate the metadata to reflect that the page is no
longer dirty.

It is not always possible to perform the required metadata manipulation in
->set_page_dirty, because that function cannot block or fail.  The
filesystem may need to allocate some data structure, for example.

And the VM cannot mark the pte dirty before page_mkwrite, because
page_mkwrite is allowed to fail, so we must not allow any window where the
page could be written to if page_mkwrite does fail.

This solution of holding the page locked over the 3 critical operations
(page_mkwrite, setting the pte dirty, and finally setting the page dirty)
closes out races nicely, preventing page cleaning for writeout being
initiated in that window.  This provides the filesystem with a strong
synchronisation against the VM here.
Comment 5 Trond Myklebust 2009-05-05 18:50:53 UTC
Created attachment 21236 [details]
NFS: Close page_mkwrite() races

Follow up to Nick Piggin's patches to ensure that nfs_vm_page_mkwrite
returns with the page lock held, and sets the VM_FAULT_LOCKED flag.
Comment 6 Trond Myklebust 2010-02-03 21:04:19 UTC
Marking this bug as CLOSED. Please reopen if the problem reoccurs.