Bug 212305
Summary: | blktests block/009 fails on linux-next next-20210304 failure rate 1/669 | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Luis Chamberlain (mcgrof) |
Component: | Block Layer | Assignee: | Jens Axboe (axboe) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | wagi |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 5.12.0-rc1-next-20210304 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Luis Chamberlain
2021-03-16 17:01:40 UTC
(In reply to Luis Chamberlain from comment #0) > But since we just want to busy loop test block/009: > > ssh blktests-baseline Sorry this should be: ssh kdevops-blktests-block Hit a second failure at failure rate 1/614, that is, a failure after 614 tests. So failure rate average now 1/641 Luis, can you just post this on linux-block? Nobody really looks here, it seems... I'd also make sure that error still persists in a current tree first, instead of one that's 2 weeks old. Hit a third failure now at 1/61, failure rate average bumps to: 1/448 (In reply to Jens Axboe from comment #3) > Luis, can you just post this on linux-block? Nobody really looks here, it > seems... I'd also make sure that error still persists in a current tree > first, instead of one that's 2 weeks old. I was going to, but figured I'd first jot down as much as I could over the issue. Will do so now. (In reply to Luis Chamberlain from comment #0) > While using kdevops [0] to test block layer with blktests to create a > baseline I found that the test block/009 can fail on linux-next tag > next-20210304 after running the test 669 times in a loop. > > To reproduce you can get blktest set up with kdevops: > > make menuconfig # configure and enable blktests, only enable to test block > and select to test linux-next > make bringup # bring up kvm guests to test > make blktests # build and install blktests I forgot to provide instructions on how to get to linux-next on the guest once configured: make linux Confirmed on linux-next as of today so tag next-20210316 as well at failure rate 1/1008: root@kdevops-blktests-block /usr/local/blktests # uname -r 5.12.0-rc3-next-20210316 block/009 (check page-cache coherency after BLKDISCARD) [failed] runtime 0.960s ... 1.011s --- tests/block/009.out 2021-02-22 16:58:00.235209631 +0000 +++ /usr/local/blktests/results/nodev/block/009.out.bad 2021-03-16 18:38:31.792430532 +0000 @@ -1,6 +1,10 @@ Running block/009 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * +003e000 aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa +* +0040000 0000 0000 0000 0000 0000 0000 0000 0000 +* ... (Run 'diff -u tests/block/009.out /usr/local/blktests/results/nodev/block/009.out.bad' to see the entire diff) FAIL at 1008 root@kdevops-blktests-block /usr/local/blktests # cat /usr/local/blktests/results/nodev/block/009.out.bad Running block/009 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 003e000 aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa * 0040000 0000 0000 0000 0000 0000 0000 0000 0000 * 2000000 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 2000000 Test complete Once again hit a failure on linux-next 20210316 with failure rate of 1/862. 3rd failure on linux-next 20210316 at failure rate 1/255 Also fails even if CONFIG_SCSI_SCAN_ASYNC is disabled root@kdevops-blktests-block /usr/local/blktests # cat /sys/module/scsi_mod/parameters/scan sync Failure rate 1/1185 In case this is a scsi_debug issue, I tried the following: diff --git a/common/scsi_debug b/common/scsi_debug index b48cdc9..ecdbcc6 100644 --- a/common/scsi_debug +++ b/common/scsi_debug @@ -8,13 +8,42 @@ _have_scsi_debug() { _have_modules scsi_debug } +# As per korg#212337 [0] we must do more work in userspace to settle +# scsi_debug devices a bit more carefully. + +# [0] https://bugzilla.kernel.org/show_bug.cgi?id=212337 +_settle_scsi_debug_device() { + SCSI_DEBUG_MAX_WAIT=10 + SCSI_DEBUG_COUNT_WAIT_LOOP=0 + while true ; do + if [[ -b $1 ]]; then + SCSI_DEBUG_LSOF_COUNT=$(lsof $1 | wc -l) + if [[ $SCSI_DEBUG_LSOF_COUNT -ne 0 ]]; then + sleep 1; + else + break + fi + else + # Let device come up + sleep 1 + + let SCSI_DEBUG_COUNT_WAIT_LOOP=$SCSI_DEBUG_COUNT_WAIT_LOOP+1 + if [[ $SCSI_DEBUG_COUNT_WAIT_LOOP -ge $SCSI_DEBUG_MAX_WAIT ]]; then + break + fi + fi + done +} + _init_scsi_debug() { if ! modprobe -r scsi_debug || ! modprobe scsi_debug "$@"; then return 1 fi - udevadm settle + # Allow dependencies to load + sleep 1 + local host_sysfs host target_sysfs target SCSI_DEBUG_HOSTS=() SCSI_DEBUG_TARGETS=() @@ -43,6 +72,10 @@ _init_scsi_debug() { return 1 fi + for i in $SCSI_DEBUG_DEVICES ; do + _settle_scsi_debug_device /dev/$i + done + return 0 } But I still managed to reproduce a failure, with a failure rate of 1/1705 on linux-next tag next-20210316. Given the predicament with scsi_debug leaves initialization in a questionable state still, I decided to try a bigger sledge hammer to try do disprove if this issue can be blamed on scsi_debug. The above patch for common/scsi_debug on blktests was modified where it says "Allow for dependencies to load" where it sleeps for just 1 second, I set a sleep for 5 seconds. I kept the other stuff, the new settle_scsi_debug_device()... This together should be *sufficient* IMHO to really try to settle the simple created scsi_debug device created by block/009. And I still was able to reproduce a failure, this time at a failure rate of 1/5276: block/009 (check page-cache coherency after BLKDISCARD) [failed] runtime 6.060s ... 7.471s --- tests/block/009.out 2021-02-22 16:58:00.235209631 +0000 +++ /usr/local/blktests/results/nodev/block/009.out.bad 2021-03-19 04:56:01.123720333 +0000 @@ -1,6 +1,10 @@ Running block/009 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * +002f000 aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa +* +0040000 0000 0000 0000 0000 0000 0000 0000 0000 +* ... (Run 'diff -u tests/block/009.out /usr/local/blktests/results/nodev/block/009.out.bad' to see the entire diff) root@kdevops-blktests-block /usr/local/blktests # cat /usr/local/blktests/results/nodev/block/009.out.bad Running block/009 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 002f000 aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa * 0040000 0000 0000 0000 0000 0000 0000 0000 0000 * 2000000 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 2000000 Test complete Well this issue is no longer present upstream on linux-next. Some old kernel are affected. Also, to make things more confusing vanilla v5.12 also does not have the issue present. At this point I've ran the test in a loop on linux-next next-20211115 for over 181,000 times, with and without my newest patient modprobe patch for blktests [0]. The same applies to vanilla v5.12, I tested that 130,000 times. I even tried reverting these two patches and vanilla 5.12 still worked without issue: block: return -EBUSY when there are open partitions in blkdev_reread_part block: reopen the device in blkdev_reread_part Distributions will have to do some homework to fix this on their end for kernels in between :) [0] https://lkml.kernel.org/r/20211116172926.587062-1-mcgrof@kernel.org |