Bug 86051 - xfsdump gets stuck with 3.17
Summary: xfsdump gets stuck with 3.17
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: XFS (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: XFS Guru
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-10-11 14:02 UTC by Stefanos Harhalakis
Modified: 2014-11-16 14:30 UTC (History)
1 user (show)

See Also:
Kernel Version: 3.17.0
Tree: Mainline
Regression: No


Attachments

Description Stefanos Harhalakis 2014-10-11 14:02:37 UTC
Hi,

I recently built 3.17 and started getting xfsdump stuck. Once xfsdump is stuck it's unkillable, even with -9. The backtraces that are listed bellow are caused by this. I tested this twice (test, got stuck, reboot, xfs_repair (no errors), test, got stuck). I left the xfsdump running for ~24 hours but nothing happened. I'm including two backtraces, the first and the last, but there were more in between. After that nothing was printed, even though xfsdump was still stuck.

First backtrace:
Oct 11 03:53:31 hell kernel: INFO: task xfsdump:3269 blocked for more than 120 seconds.
Oct 11 03:53:31 hell kernel:      Not tainted 3.17.0-v2-v #34
Oct 11 03:53:31 hell kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 11 03:53:31 hell kernel: xfsdump         D 0000000000000001     0  3269   3252 0x00000080
Oct 11 03:53:31 hell kernel: ffff8802aa23f9a0 0000000000000002 000000000000a000 ffff8802accce180
Oct 11 03:53:31 hell kernel: ffff8802aa23ffd8 ffff880408e0c920 ffff8802accce180 ffff8802aa23f8e8
Oct 11 03:53:31 hell kernel: ffffffff8113e1b7 0000000001b56000 ffff8802aa23f978 ffff8802aa23f960
Oct 11 03:53:31 hell kernel: Call Trace:
Oct 11 03:53:31 hell kernel: [<ffffffff8113e1b7>] ? lru_cache_add_active_or_unevictable+0x27/0x90
Oct 11 03:53:31 hell kernel: [<ffffffffa033f7b1>] ? xfs_iext_bno_to_ext+0xa1/0x1b0 [xfs]
Oct 11 03:53:31 hell kernel: [<ffffffffa0324b88>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
Oct 11 03:53:31 hell kernel: [<ffffffffa031a4e8>] ? xfs_bmap_search_multi_extents+0xa8/0x130 [xfs]
Oct 11 03:53:31 hell kernel: [<ffffffff814be799>] schedule+0x29/0x70
Oct 11 03:53:31 hell kernel: [<ffffffff814c13b9>] schedule_timeout+0x179/0x200
Oct 11 03:53:31 hell kernel: [<ffffffff81137135>] ? get_page_from_freelist+0x3c5/0x6c0
Oct 11 03:53:31 hell kernel: [<ffffffff814c0544>] __down+0x64/0xa0
Oct 11 03:53:31 hell kernel: [<ffffffffa034d4db>] ? _xfs_buf_find+0x14b/0x2a0 [xfs]
Oct 11 03:53:31 hell kernel: [<ffffffff8108d674>] down+0x44/0x50
Oct 11 03:53:31 hell kernel: [<ffffffffa034d2fc>] xfs_buf_lock+0x3c/0xd0 [xfs]
Oct 11 03:53:31 hell kernel: [<ffffffffa034d4db>] _xfs_buf_find+0x14b/0x2a0 [xfs]
Oct 11 03:53:31 hell kernel: [<ffffffffa034d75a>] xfs_buf_get_map+0x2a/0x190 [xfs]
Oct 11 03:53:31 hell kernel: [<ffffffffa034e42c>] xfs_buf_read_map+0x2c/0x110 [xfs]
Oct 11 03:53:31 hell kernel: [<ffffffffa0379669>] xfs_trans_read_buf_map+0x1b9/0x460 [xfs]
Oct 11 03:53:31 hell kernel: [<ffffffffa033d3dd>] xfs_read_agi+0x8d/0xe0 [xfs]
Oct 11 03:53:31 hell kernel: [<ffffffffa033d464>] xfs_ialloc_read_agi+0x34/0xd0 [xfs]
Oct 11 03:53:31 hell kernel: [<ffffffffa036189b>] xfs_bulkstat+0x16b/0x4d0 [xfs]
Oct 11 03:53:31 hell kernel: [<ffffffffa0361590>] ? xfs_bulkstat_one_int+0x2e0/0x2e0 [xfs]
Oct 11 03:53:31 hell kernel: [<ffffffff811a3946>] ? dput+0x26/0x1b0
Oct 11 03:53:31 hell kernel: [<ffffffffa0357071>] xfs_ioc_bulkstat+0xd1/0x1a0 [xfs]
Oct 11 03:53:31 hell kernel: [<ffffffffa035967e>] xfs_file_ioctl+0x81e/0xb20 [xfs]
Oct 11 03:53:31 hell kernel: [<ffffffff810f443c>] ? acct_account_cputime+0x1c/0x20
Oct 11 03:53:31 hell kernel: [<ffffffff81079f1b>] ? account_system_time+0x8b/0x190
Oct 11 03:53:31 hell kernel: [<ffffffff812a8838>] ? lockref_put_or_lock+0x48/0x80
Oct 11 03:53:31 hell kernel: [<ffffffff8119f8b8>] do_vfs_ioctl+0x2c8/0x490
Oct 11 03:53:31 hell kernel: [<ffffffff8107a390>] ? vtime_account_user+0x40/0x60
Oct 11 03:53:31 hell kernel: [<ffffffff810e0c3c>] ? __audit_syscall_entry+0x9c/0xf0
Oct 11 03:53:31 hell kernel: [<ffffffff8119fb01>] SyS_ioctl+0x81/0xa0
Oct 11 03:53:31 hell kernel: [<ffffffff814c2ad3>] tracesys+0xe1/0xe6

Last backtrace:
Oct 11 04:11:31 hell kernel: INFO: task xfsdump:3269 blocked for more than 120 seconds.
Oct 11 04:11:31 hell kernel:      Not tainted 3.17.0-v2-v #34
Oct 11 04:11:31 hell kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 11 04:11:31 hell kernel: xfsdump         D 0000000000000001     0  3269   3252 0x00000080
Oct 11 04:11:31 hell kernel: ffff8802aa23f9a0 0000000000000002 000000000000a000 ffff8802accce180
Oct 11 04:11:31 hell kernel: ffff8802aa23ffd8 ffff880408e0c920 ffff8802accce180 ffff8802aa23f8e8
Oct 11 04:11:31 hell kernel: ffffffff8113e1b7 0000000001b56000 ffff8802aa23f978 ffff8802aa23f960
Oct 11 04:11:31 hell kernel: Call Trace:
Oct 11 04:11:31 hell kernel: [<ffffffff8113e1b7>] ? lru_cache_add_active_or_unevictable+0x27/0x90
Oct 11 04:11:31 hell kernel: [<ffffffffa033f7b1>] ? xfs_iext_bno_to_ext+0xa1/0x1b0 [xfs]
Oct 11 04:11:31 hell kernel: [<ffffffffa0324b88>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
Oct 11 04:11:31 hell kernel: [<ffffffffa031a4e8>] ? xfs_bmap_search_multi_extents+0xa8/0x130 [xfs]
Oct 11 04:11:31 hell kernel: [<ffffffff814be799>] schedule+0x29/0x70
Oct 11 04:11:31 hell kernel: [<ffffffff814c13b9>] schedule_timeout+0x179/0x200
Oct 11 04:11:31 hell kernel: [<ffffffff81137135>] ? get_page_from_freelist+0x3c5/0x6c0
Oct 11 04:11:31 hell kernel: [<ffffffff814c0544>] __down+0x64/0xa0
Oct 11 04:11:31 hell kernel: [<ffffffffa034d4db>] ? _xfs_buf_find+0x14b/0x2a0 [xfs]
Oct 11 04:11:31 hell kernel: [<ffffffff8108d674>] down+0x44/0x50
Oct 11 04:11:31 hell kernel: [<ffffffffa034d2fc>] xfs_buf_lock+0x3c/0xd0 [xfs]
Oct 11 04:11:31 hell kernel: [<ffffffffa034d4db>] _xfs_buf_find+0x14b/0x2a0 [xfs]
Oct 11 04:11:31 hell kernel: [<ffffffffa034d75a>] xfs_buf_get_map+0x2a/0x190 [xfs]
Oct 11 04:11:31 hell kernel: [<ffffffffa034e42c>] xfs_buf_read_map+0x2c/0x110 [xfs]
Oct 11 04:11:31 hell kernel: [<ffffffffa0379669>] xfs_trans_read_buf_map+0x1b9/0x460 [xfs]
Oct 11 04:11:31 hell kernel: [<ffffffffa033d3dd>] xfs_read_agi+0x8d/0xe0 [xfs]
Oct 11 04:11:31 hell kernel: [<ffffffffa033d464>] xfs_ialloc_read_agi+0x34/0xd0 [xfs]
Oct 11 04:11:31 hell kernel: [<ffffffffa036189b>] xfs_bulkstat+0x16b/0x4d0 [xfs]
Oct 11 04:11:31 hell kernel: [<ffffffffa0361590>] ? xfs_bulkstat_one_int+0x2e0/0x2e0 [xfs]
Oct 11 04:11:31 hell kernel: [<ffffffff811a3946>] ? dput+0x26/0x1b0
Oct 11 04:11:31 hell kernel: [<ffffffffa0357071>] xfs_ioc_bulkstat+0xd1/0x1a0 [xfs]
Oct 11 04:11:31 hell kernel: [<ffffffffa035967e>] xfs_file_ioctl+0x81e/0xb20 [xfs]
Oct 11 04:11:31 hell kernel: [<ffffffff810f443c>] ? acct_account_cputime+0x1c/0x20
Oct 11 04:11:31 hell kernel: [<ffffffff81079f1b>] ? account_system_time+0x8b/0x190
Oct 11 04:11:31 hell kernel: [<ffffffff812a8838>] ? lockref_put_or_lock+0x48/0x80
Oct 11 04:11:31 hell kernel: [<ffffffff8119f8b8>] do_vfs_ioctl+0x2c8/0x490
Oct 11 04:11:31 hell kernel: [<ffffffff8107a390>] ? vtime_account_user+0x40/0x60
Oct 11 04:11:31 hell kernel: [<ffffffff810e0c3c>] ? __audit_syscall_entry+0x9c/0xf0
Oct 11 04:11:31 hell kernel: [<ffffffff8119fb01>] SyS_ioctl+0x81/0xa0


My details:
* kernel 3.7.0 built by me
* xfs_repair version 3.2.1
* 1 cpu, 4 cores
* I don't have meminfo from when the problem happens
* Relevant /proc/mounts line: /dev/mapper/tera1-home /home xfs rw,noatime,attr2,inode64,noquota 0 0
* The relevant part of the layout is as follows (take a deep breath):
  * Two physical rotational disks
  * An SSD disk
  * md116 made of two partitions from these disks
  * md118 made of two partitions from these disks
  * A partition on the SSD disk that's used for bcache caching
  * bcache1 comprised by md116 + ssd_part
  * bcache3 comprised by md118 + ssd_part
  * LVM PV on bcache1
  * LVM PV on bcache3
  * LVM VG with bcache1 and bcache3
  * LVM LV on that VG
  * XFS partition on that LV
* Write cache is enabled

Thanks,
Stefanos
Comment 1 Eric Sandeen 2014-10-12 16:24:32 UTC
c7cb51dc xfs: fix error handling at xfs_inumbers

caused a regression involving incomplete dumps.  Not sure what the bug is, yet.  I haven't looked at the hang problem, but I think Dave also spotted an error handling flaw.
Comment 2 Eric Sandeen 2014-10-21 13:57:44 UTC
http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commitdiff;h=a8b1ee8bafc765ebf029d03c5479a69aebff9693

fixed the incomplete dump problem.

Dave's patch

[PATCH] xfs: bulkstat doesn't release AGI buffer on error

on the list fixes the hang.
Comment 3 Stefanos Harhalakis 2014-11-16 14:30:05 UTC
Hi again,

I just tried 3.17.3 and the exact same xfsdump attempt that was failing managed to complete without problems. As such I'm marking this as RESOLVED.

Thanks!

Note You need to log in before you can comment on or make changes to this bug.