Bug 60788

Summary: xfstest generic/068 failure - kernel BUG at fs/buffer.c:3005!
Product: File System Reporter: Eric Whitney (enwlinux)
Component: ext4Assignee: fs_ext4 (fs_ext4)
Status: RESOLVED DUPLICATE    
Severity: normal CC: drJeckyll
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 3.11-rc6 Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel trace with timestamps and system name trimmed

Description Eric Whitney 2013-08-24 19:02:59 UTC
Created attachment 107298 [details]
kernel trace with timestamps and system name trimmed

Xfstest generic/068 fails roughly 5% of the time when run on a 3.11-rc6 x86-64 KVM guest against a test filesystem mounted with the data=journal option.  The failure is actually a test hang rather than an error exit.  When the hang occurs, a kernel failure in reported in the log: "kernel BUG at fs/buffer.c:3005!" (full trace attached).  Recovering from the bug requires rebooting and restarting the guest.  It also occurs on a Pandaboard ES running 3.11-rc6 at about the same frequency of occurrence.

Some evidence suggests that running additional tests prior to generic/068 increases the occurrence rate.  Running generic/062 prior to 068 increased the failure rate to 15% in set of 20 tests on the same KVM guest, for example.  Also, my sense is that 068 fails more frequently than 15% of the time when running the entire xfstests-bld data_journal test series from the top in normal regression, but I've not yet tried to quantify this.

The same bug has been seen on both x86_64 and ARM in previous releases and release candidates at least as far back as 3.8.  Jan Kara may have made the most recent comments related to this and another similar bug: http://www.spinics.net/lists/linux-ext4/msg36858.html (also, see bug id 60786).


Test system hardware configuration:
KVM x86_64 virtual machine (2 cores, 1 GB memory) on physical four core x86_64
Virtual system and test disk raw images located on two physical 147 GB SAS disks
Test file systems located in three 5.66 GB virtual disks


Test system software configuration:

Mainline kernel 3.11-rc6 on Ubuntu 12.10 x86-64 base
e2fsprogs, master branch 93061ea0f4d9f94579d09480f17dd47169c82081
xfsprogs, master branch 1a78b0840afcfbe01c12e6af04659388d62efeea
xfstests, master branch 6fe0ad6fafd76ef3d5b01639f22b747f4a317b63
xfstests-bld, master branch 723c23e1b2756b911dbb50bbd08c0afa5b836e3b
Comment 1 Ognian Tenchev 2014-03-23 11:39:47 UTC
uname -a
Linux storage 3.13.6-gentoo-dr #3 SMP Thu Mar 13 19:19:40 EET 2014 x86_64 AMD Phenom(tm) II X6 1090T Processor AuthenticAMD GNU/Linux

[Wed Mar 19 01:50:22 2014] ------------[ cut here ]------------
[Wed Mar 19 01:50:22 2014] kernel BUG at fs/buffer.c:3350!
[Wed Mar 19 01:50:22 2014] invalid opcode: 0000 [#1] SMP 
[Wed Mar 19 01:50:22 2014] Modules linked in: nfsd auth_rpcgss oid_registry exportfs nfs_acl lockd sunrpc fuse bridge stp llc iptable_filter xt_mark iptable_mangle ipt_MASQUERADE xt_nat xt_tcpudp ipv6 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables snd_pcm_oss snd_mixer_oss saa7134_alsa tda827x tda8290 tuner saa7134 tveeprom videobuf_dma_sg videobuf_core v4l2_common videodev igb ptp pps_core i2c_algo_bit virtio_balloon virtio_pci virtio_ring vhost_net vhost virtio thermal fan cpufreq_userspace cpufreq_powersave cpufreq_ondemand cpufreq_conservative cpufreq_stats it87 hwmon_vid hwmon tun usbhid kvm_amd kvm sr_mod cdrom snd_hda_intel snd_ctxfi snd_hda_codec snd_pcm e100 microcode snd_page_alloc snd_timer ohci_pci ehci_pci ohci_hcd ehci_hcd acpi_cpufreq snd usbcore r8169
[Wed Mar 19 01:50:22 2014]  soundcore processor mii usb_common
[Wed Mar 19 01:50:22 2014] CPU: 0 PID: 443 Comm: kswapd0 Not tainted 3.13.6-gentoo-dr #3
[Wed Mar 19 01:50:22 2014] Hardware name: Gigabyte Technology Co., Ltd. GA-880GA-UD3H/GA-880GA-UD3H, BIOS F7 11/24/2010
[Wed Mar 19 01:50:22 2014] task: ffff88031194b700 ti: ffff8803117e0000 task.ti: ffff8803117e0000
[Wed Mar 19 01:50:22 2014] RIP: 0010:[<ffffffff8110e3dc>]  [<ffffffff8110e3dc>] free_buffer_head+0xd/0x29
[Wed Mar 19 01:50:22 2014] RSP: 0018:ffff8803117e1aa8  EFLAGS: 00010287
[Wed Mar 19 01:50:22 2014] RAX: ffff880168843800 RBX: ffff8801688437b8 RCX: 0000000000000000
[Wed Mar 19 01:50:22 2014] RDX: 0000000000000000 RSI: ffff8801688437b8 RDI: ffff8801688437b8
[Wed Mar 19 01:50:22 2014] RBP: ffff8803132788b8 R08: 0000000000000030 R09: 000000020bb750c0
[Wed Mar 19 01:50:22 2014] R10: ffffea000bb750c0 R11: ffff88031fffb078 R12: 0000000000000001
[Wed Mar 19 01:50:22 2014] R13: ffffea000b1f1c60 R14: 0000000000000001 R15: 0000000000000001
[Wed Mar 19 01:50:22 2014] FS:  00007f0a03c61700(0000) GS:ffff88031fc00000(0000) knlGS:0000000000000000
[Wed Mar 19 01:50:22 2014] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[Wed Mar 19 01:50:22 2014] CR2: 0000000000442d40 CR3: 000000031130a000 CR4: 00000000000007b0
[Wed Mar 19 01:50:22 2014] Stack:
[Wed Mar 19 01:50:22 2014]  ffffffff8110e695 ffff8801688437b8 ffffea000b1f1c40 ffff8803132788b8
[Wed Mar 19 01:50:22 2014]  ffff8803117e1dd0 ffffffff810b8204 ffff8803117e1c50 000000000000000e
[Wed Mar 19 01:50:22 2014]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[Wed Mar 19 01:50:22 2014] Call Trace:
[Wed Mar 19 01:50:22 2014]  [<ffffffff8110e695>] ? try_to_free_buffers+0x7e/0x92
[Wed Mar 19 01:50:22 2014]  [<ffffffff810b8204>] ? shrink_page_list+0x646/0x879
[Wed Mar 19 01:50:22 2014]  [<ffffffff810b8985>] ? shrink_inactive_list+0x24d/0x3e1
[Wed Mar 19 01:50:22 2014]  [<ffffffff810b9115>] ? shrink_lruvec+0x34a/0x4b2
[Wed Mar 19 01:50:22 2014]  [<ffffffff810eb8c1>] ? super_cache_count+0xa2/0xae
[Wed Mar 19 01:50:22 2014]  [<ffffffff810b764d>] ? shrink_slab+0x2d2/0x2e6
[Wed Mar 19 01:50:22 2014]  [<ffffffff810b764d>] ? shrink_slab+0x2d2/0x2e6
[Wed Mar 19 01:50:22 2014]  [<ffffffff810b9b70>] ? balance_pgdat+0x28e/0x3cd
[Wed Mar 19 01:50:22 2014]  [<ffffffff810b9f41>] ? kswapd+0x292/0x316
[Wed Mar 19 01:50:22 2014]  [<ffffffff810598bf>] ? __wake_up_sync+0x7/0x7
[Wed Mar 19 01:50:22 2014]  [<ffffffff810b9caf>] ? balance_pgdat+0x3cd/0x3cd
[Wed Mar 19 01:50:22 2014]  [<ffffffff81046d51>] ? kthread+0xca/0xd2
[Wed Mar 19 01:50:22 2014]  [<ffffffff81046c87>] ? kthread_create_on_node+0x160/0x160
[Wed Mar 19 01:50:22 2014]  [<ffffffff8138e17c>] ? ret_from_fork+0x7c/0xb0
[Wed Mar 19 01:50:22 2014]  [<ffffffff81046c87>] ? kthread_create_on_node+0x160/0x160
[Wed Mar 19 01:50:22 2014] Code: 31 ff 48 89 da 48 89 ee 48 89 04 24 e8 ec c4 f9 ff 85 c0 7e 02 0f 0b 48 83 c4 18 5b 5d c3 48 8d 47 48 48 39 47 48 48 89 fe 74 02 <0f> 0b 48 8b 3d 3b 50 57 00 e8 e3 50 fd ff 65 ff 0c 25 50 ed 00 
[Wed Mar 19 01:50:22 2014] RIP  [<ffffffff8110e3dc>] free_buffer_head+0xd/0x29
[Wed Mar 19 01:50:22 2014]  RSP <ffff8803117e1aa8>
[Wed Mar 19 01:50:22 2014] ---[ end trace d6d840c2b7dcee1a ]---

Occurred when making backup with rsync
Comment 2 Eric Whitney 2015-07-22 18:37:26 UTC
This bug is a duplicate of #60786.

A patch included in 4.2-rc1 has resolved this problem: "ext4: fix race between truncate and __ext4_journalled_writepage()" (bdf96838aea6)

The method used to force the bug to occur in the past was to make repeated runs of generic/068 on a test filesystem mounted with the data=journal option.  The test system would typically hang with a few tens of runs at the most.

1000 test runs completed normally both on x86_64 and 32 bit ARM running 4.2-rc1 kernels.  Regular regression runs containing generic/068 have also completed normally using 4.2-rc2 and -rc3 kernels on the same architectures.

*** This bug has been marked as a duplicate of bug 60786 ***