Bug 190751

Summary: reference leak in percpu_ref_tryget and percpu_ref_tryget_live
Product: IO/Storage Reporter: Douglas Miller (dougmill)
Component: Block LayerAssignee: Jens Axboe (axboe)
Status: RESOLVED CODE_FIX    
Severity: high CC: gpiccoli, mniyer
Priority: P1    
Hardware: PPC-64   
OS: Linux   
Kernel Version: 4.9 Subsystem:
Regression: No Bisected commit-id:

Description Douglas Miller 2016-12-20 13:54:31 UTC
This was originally observed on Ubuntu 14.04.1 and then also seen on RHEL 7.3 as well as the upstream kernel. All running ppc64le. Uncertain whether this is limited to PPC64 or not.

In this particular case, unpartitioned NVMe drives were being tested, resulting in some I/O to the partition sectors which triggered systemd-udevd to re-read the partition table (under heavy I/O load), which involves the call to blk_mq_freeze_queue_wait which hangs. Eventually, the rest of the system hangs.

While the systemd-udevd scenario can be easily worked around by adding partition tables to NVMe drives, the concern is that other uses for blk_mq_freeze_queue_wait will also hang.

Preliminary investigation suggested a problem (leak) with reference counts on q_usage_cnt.

Here is the stack trace for systemd-udevd:

[ 8758.785424] INFO: task systemd-udevd:1773 blocked for more than 120 seconds.
[ 8758.785720] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 8758.785838] systemd-udevd   D 00003fff87e6f7a0     0  1773      1 0x00040000
[ 8758.786077] Call Trace:
[ 8758.786137] [c000003f3f447790] [c000000000002794] hardware_interrupt_common+0x114/0x180 (unreliab
le)
[ 8758.786253] --- Exception: c000000000b4d880 at __switch_to+0x200/0x350
[ 8758.786253]     LR =           (null)
[ 8758.786358] [c000003f3f447960] [c0000000000158d0] __switch_to+0x200/0x350 (unreliable)
[ 8758.786500] [c000003f3f4479c0] [c0000000007adaf4] __schedule+0x414/0x9e0
[ 8758.786618] [c000003f3f447a90] [c0000000003b4c94] blk_mq_freeze_queue_wait+0x64/0xd0
[ 8758.786791] [c000003f3f447af0] [d000000033d81964] nvme_revalidate_disk+0xd4/0x3a0 [nvme]
[ 8758.786911] [c000003f3f447b90] [c0000000003c22d8] rescan_partitions+0x98/0x390
[ 8758.787018] [c000003f3f447c60] [c0000000003bb6ec] __blkdev_reread_part+0x9c/0xd0
[ 8758.787109] [c000003f3f447c90] [c0000000003bb758] blkdev_reread_part+0x38/0x70
[ 8758.787197] [c000003f3f447cc0] [c0000000003bc274] blkdev_ioctl+0x3b4/0xb80
[ 8758.787279] [c000003f3f447d20] [c0000000002cbc90] block_ioctl+0x70/0x90
[ 8758.787444] [c000003f3f447d40] [c000000000296af8] do_vfs_ioctl+0x458/0x740
[ 8758.787529] [c000003f3f447dd0] [c000000000296ea4] SyS_ioctl+0xc4/0xe0
[ 8758.787628] [c000003f3f447e30] [c00000000000a17c] system_call+0x38/0xb4

Unclear if the bug is in block MQ, or nvme, or elsewhere.
Comment 1 Douglas Miller 2016-12-20 14:06:20 UTC
The above stack trace is from RHEL 7.3.
Comment 2 Douglas Miller 2017-01-26 14:05:28 UTC
This can also be reproduce by simply reading a (multi-queue) block device while off/onlining CPUs. For example, the following script reproduces it on a PPC64 system with an NVMe drive:

#!/bin/bash

if [[ ${#} -eq 0 ]]; then
        ${0} breaker &
        while true; do
                dd if=/dev/nvme0n1 bs=1024k of=/dev/null
        done
elif [[ ${1} == "breaker" ]]; then
        while true; do
                ppc64_cpu --smt=off
                sleep 5
                ppc64_cpu --smt=on
                sleep 5
        done
fi

The problem usually reproduces in a few minutes, although sometimes it has taken a few hours.

I suspect the problem is related to the changes in commit 71f79fb3179e69b0c1448a2101a866d871c66e7f. If I revert blk_mq_timeout_work() to use blk_queue_enter(q, true) the hang is no longer seen, however the original deadlock that Gabriel fixed will return, so we cannot simply revert the patch.
Comment 3 Douglas Miller 2017-01-26 14:45:27 UTC
I am seeing this hang on 4.10.0-rc5.
Comment 4 Douglas Miller 2017-01-26 22:34:57 UTC
I might see the bug here. percpu_ref_tryget() calls atomic_long_inc_not_zero() in the case we're seeing, which would change the count from 8000000000000000 to 8000000000000001 and return 8000000000000000. But, percpu_ref_tryget() captures the return value as an int, not long, and returns it as a true/false indicating whether or not the increment took place. But, because it stripped off the high bits, it returns "0" indicating the increment did not take place when in fact the increment *did* take place because the true counter value was 8000000000000000 and not 00000000. This looks like a bug in percpu_ref_tryget*() routines.
Comment 5 Douglas Miller 2017-01-27 21:01:52 UTC
Patch submitted https://lkml.org/lkml/2017/1/27/810
Comment 7 Douglas Miller 2017-02-01 13:19:04 UTC
Commit is now in kernel/git/torvalds/linux.git, closing.