Hi, I recently built 3.17 and started getting xfsdump stuck. Once xfsdump is stuck it's unkillable, even with -9. The backtraces that are listed bellow are caused by this. I tested this twice (test, got stuck, reboot, xfs_repair (no errors), test, got stuck). I left the xfsdump running for ~24 hours but nothing happened. I'm including two backtraces, the first and the last, but there were more in between. After that nothing was printed, even though xfsdump was still stuck. First backtrace: Oct 11 03:53:31 hell kernel: INFO: task xfsdump:3269 blocked for more than 120 seconds. Oct 11 03:53:31 hell kernel: Not tainted 3.17.0-v2-v #34 Oct 11 03:53:31 hell kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 11 03:53:31 hell kernel: xfsdump D 0000000000000001 0 3269 3252 0x00000080 Oct 11 03:53:31 hell kernel: ffff8802aa23f9a0 0000000000000002 000000000000a000 ffff8802accce180 Oct 11 03:53:31 hell kernel: ffff8802aa23ffd8 ffff880408e0c920 ffff8802accce180 ffff8802aa23f8e8 Oct 11 03:53:31 hell kernel: ffffffff8113e1b7 0000000001b56000 ffff8802aa23f978 ffff8802aa23f960 Oct 11 03:53:31 hell kernel: Call Trace: Oct 11 03:53:31 hell kernel: [<ffffffff8113e1b7>] ? lru_cache_add_active_or_unevictable+0x27/0x90 Oct 11 03:53:31 hell kernel: [<ffffffffa033f7b1>] ? xfs_iext_bno_to_ext+0xa1/0x1b0 [xfs] Oct 11 03:53:31 hell kernel: [<ffffffffa0324b88>] ? xfs_bmbt_get_all+0x18/0x20 [xfs] Oct 11 03:53:31 hell kernel: [<ffffffffa031a4e8>] ? xfs_bmap_search_multi_extents+0xa8/0x130 [xfs] Oct 11 03:53:31 hell kernel: [<ffffffff814be799>] schedule+0x29/0x70 Oct 11 03:53:31 hell kernel: [<ffffffff814c13b9>] schedule_timeout+0x179/0x200 Oct 11 03:53:31 hell kernel: [<ffffffff81137135>] ? get_page_from_freelist+0x3c5/0x6c0 Oct 11 03:53:31 hell kernel: [<ffffffff814c0544>] __down+0x64/0xa0 Oct 11 03:53:31 hell kernel: [<ffffffffa034d4db>] ? _xfs_buf_find+0x14b/0x2a0 [xfs] Oct 11 03:53:31 hell kernel: [<ffffffff8108d674>] down+0x44/0x50 Oct 11 03:53:31 hell kernel: [<ffffffffa034d2fc>] xfs_buf_lock+0x3c/0xd0 [xfs] Oct 11 03:53:31 hell kernel: [<ffffffffa034d4db>] _xfs_buf_find+0x14b/0x2a0 [xfs] Oct 11 03:53:31 hell kernel: [<ffffffffa034d75a>] xfs_buf_get_map+0x2a/0x190 [xfs] Oct 11 03:53:31 hell kernel: [<ffffffffa034e42c>] xfs_buf_read_map+0x2c/0x110 [xfs] Oct 11 03:53:31 hell kernel: [<ffffffffa0379669>] xfs_trans_read_buf_map+0x1b9/0x460 [xfs] Oct 11 03:53:31 hell kernel: [<ffffffffa033d3dd>] xfs_read_agi+0x8d/0xe0 [xfs] Oct 11 03:53:31 hell kernel: [<ffffffffa033d464>] xfs_ialloc_read_agi+0x34/0xd0 [xfs] Oct 11 03:53:31 hell kernel: [<ffffffffa036189b>] xfs_bulkstat+0x16b/0x4d0 [xfs] Oct 11 03:53:31 hell kernel: [<ffffffffa0361590>] ? xfs_bulkstat_one_int+0x2e0/0x2e0 [xfs] Oct 11 03:53:31 hell kernel: [<ffffffff811a3946>] ? dput+0x26/0x1b0 Oct 11 03:53:31 hell kernel: [<ffffffffa0357071>] xfs_ioc_bulkstat+0xd1/0x1a0 [xfs] Oct 11 03:53:31 hell kernel: [<ffffffffa035967e>] xfs_file_ioctl+0x81e/0xb20 [xfs] Oct 11 03:53:31 hell kernel: [<ffffffff810f443c>] ? acct_account_cputime+0x1c/0x20 Oct 11 03:53:31 hell kernel: [<ffffffff81079f1b>] ? account_system_time+0x8b/0x190 Oct 11 03:53:31 hell kernel: [<ffffffff812a8838>] ? lockref_put_or_lock+0x48/0x80 Oct 11 03:53:31 hell kernel: [<ffffffff8119f8b8>] do_vfs_ioctl+0x2c8/0x490 Oct 11 03:53:31 hell kernel: [<ffffffff8107a390>] ? vtime_account_user+0x40/0x60 Oct 11 03:53:31 hell kernel: [<ffffffff810e0c3c>] ? __audit_syscall_entry+0x9c/0xf0 Oct 11 03:53:31 hell kernel: [<ffffffff8119fb01>] SyS_ioctl+0x81/0xa0 Oct 11 03:53:31 hell kernel: [<ffffffff814c2ad3>] tracesys+0xe1/0xe6 Last backtrace: Oct 11 04:11:31 hell kernel: INFO: task xfsdump:3269 blocked for more than 120 seconds. Oct 11 04:11:31 hell kernel: Not tainted 3.17.0-v2-v #34 Oct 11 04:11:31 hell kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 11 04:11:31 hell kernel: xfsdump D 0000000000000001 0 3269 3252 0x00000080 Oct 11 04:11:31 hell kernel: ffff8802aa23f9a0 0000000000000002 000000000000a000 ffff8802accce180 Oct 11 04:11:31 hell kernel: ffff8802aa23ffd8 ffff880408e0c920 ffff8802accce180 ffff8802aa23f8e8 Oct 11 04:11:31 hell kernel: ffffffff8113e1b7 0000000001b56000 ffff8802aa23f978 ffff8802aa23f960 Oct 11 04:11:31 hell kernel: Call Trace: Oct 11 04:11:31 hell kernel: [<ffffffff8113e1b7>] ? lru_cache_add_active_or_unevictable+0x27/0x90 Oct 11 04:11:31 hell kernel: [<ffffffffa033f7b1>] ? xfs_iext_bno_to_ext+0xa1/0x1b0 [xfs] Oct 11 04:11:31 hell kernel: [<ffffffffa0324b88>] ? xfs_bmbt_get_all+0x18/0x20 [xfs] Oct 11 04:11:31 hell kernel: [<ffffffffa031a4e8>] ? xfs_bmap_search_multi_extents+0xa8/0x130 [xfs] Oct 11 04:11:31 hell kernel: [<ffffffff814be799>] schedule+0x29/0x70 Oct 11 04:11:31 hell kernel: [<ffffffff814c13b9>] schedule_timeout+0x179/0x200 Oct 11 04:11:31 hell kernel: [<ffffffff81137135>] ? get_page_from_freelist+0x3c5/0x6c0 Oct 11 04:11:31 hell kernel: [<ffffffff814c0544>] __down+0x64/0xa0 Oct 11 04:11:31 hell kernel: [<ffffffffa034d4db>] ? _xfs_buf_find+0x14b/0x2a0 [xfs] Oct 11 04:11:31 hell kernel: [<ffffffff8108d674>] down+0x44/0x50 Oct 11 04:11:31 hell kernel: [<ffffffffa034d2fc>] xfs_buf_lock+0x3c/0xd0 [xfs] Oct 11 04:11:31 hell kernel: [<ffffffffa034d4db>] _xfs_buf_find+0x14b/0x2a0 [xfs] Oct 11 04:11:31 hell kernel: [<ffffffffa034d75a>] xfs_buf_get_map+0x2a/0x190 [xfs] Oct 11 04:11:31 hell kernel: [<ffffffffa034e42c>] xfs_buf_read_map+0x2c/0x110 [xfs] Oct 11 04:11:31 hell kernel: [<ffffffffa0379669>] xfs_trans_read_buf_map+0x1b9/0x460 [xfs] Oct 11 04:11:31 hell kernel: [<ffffffffa033d3dd>] xfs_read_agi+0x8d/0xe0 [xfs] Oct 11 04:11:31 hell kernel: [<ffffffffa033d464>] xfs_ialloc_read_agi+0x34/0xd0 [xfs] Oct 11 04:11:31 hell kernel: [<ffffffffa036189b>] xfs_bulkstat+0x16b/0x4d0 [xfs] Oct 11 04:11:31 hell kernel: [<ffffffffa0361590>] ? xfs_bulkstat_one_int+0x2e0/0x2e0 [xfs] Oct 11 04:11:31 hell kernel: [<ffffffff811a3946>] ? dput+0x26/0x1b0 Oct 11 04:11:31 hell kernel: [<ffffffffa0357071>] xfs_ioc_bulkstat+0xd1/0x1a0 [xfs] Oct 11 04:11:31 hell kernel: [<ffffffffa035967e>] xfs_file_ioctl+0x81e/0xb20 [xfs] Oct 11 04:11:31 hell kernel: [<ffffffff810f443c>] ? acct_account_cputime+0x1c/0x20 Oct 11 04:11:31 hell kernel: [<ffffffff81079f1b>] ? account_system_time+0x8b/0x190 Oct 11 04:11:31 hell kernel: [<ffffffff812a8838>] ? lockref_put_or_lock+0x48/0x80 Oct 11 04:11:31 hell kernel: [<ffffffff8119f8b8>] do_vfs_ioctl+0x2c8/0x490 Oct 11 04:11:31 hell kernel: [<ffffffff8107a390>] ? vtime_account_user+0x40/0x60 Oct 11 04:11:31 hell kernel: [<ffffffff810e0c3c>] ? __audit_syscall_entry+0x9c/0xf0 Oct 11 04:11:31 hell kernel: [<ffffffff8119fb01>] SyS_ioctl+0x81/0xa0 My details: * kernel 3.7.0 built by me * xfs_repair version 3.2.1 * 1 cpu, 4 cores * I don't have meminfo from when the problem happens * Relevant /proc/mounts line: /dev/mapper/tera1-home /home xfs rw,noatime,attr2,inode64,noquota 0 0 * The relevant part of the layout is as follows (take a deep breath): * Two physical rotational disks * An SSD disk * md116 made of two partitions from these disks * md118 made of two partitions from these disks * A partition on the SSD disk that's used for bcache caching * bcache1 comprised by md116 + ssd_part * bcache3 comprised by md118 + ssd_part * LVM PV on bcache1 * LVM PV on bcache3 * LVM VG with bcache1 and bcache3 * LVM LV on that VG * XFS partition on that LV * Write cache is enabled Thanks, Stefanos
c7cb51dc xfs: fix error handling at xfs_inumbers caused a regression involving incomplete dumps. Not sure what the bug is, yet. I haven't looked at the hang problem, but I think Dave also spotted an error handling flaw.
http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commitdiff;h=a8b1ee8bafc765ebf029d03c5479a69aebff9693 fixed the incomplete dump problem. Dave's patch [PATCH] xfs: bulkstat doesn't release AGI buffer on error on the list fixes the hang.
Hi again, I just tried 3.17.3 and the exact same xfsdump attempt that was failing managed to complete without problems. As such I'm marking this as RESOLVED. Thanks!