Bug 214015

Summary: module refcnt being 0 after testing fstests generic/108 prevents module removal
Product: Other Reporter: Luis Chamberlain (mcgrof)
Component: ModulesAssignee: Luis Chamberlain (mcgrof)
Status: ASSIGNED ---    
Severity: normal    
Priority: P5    
Hardware: All   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: v1 mod-refcnt-race.sh
busy-open-block-device-sleep.c

Description Luis Chamberlain 2021-08-10 02:31:10 UTC
Created attachment 298247 [details]
v1 mod-refcnt-race.sh

While testing with fstests generic/108 and trying to address many low hanging fruit module removal races with scsi_debug through korg#212337 [0] I noticed what could be perceived as a generic kernel issue: userspace seeing a refcnt of 0 and still not being able to remove the module.

Since this issue is really difficult to reproduce I ended up rewriting a test case for this to demonstrate the issue using any filesystem -- the script picks one at random: xfs, ext4 or btrfs. The attached script mod-refcnt-race.sh can be used to reproduce the issue easily now.

I'll follow up soon on this bug entry with my findings and resolutions.

Three use cases:

1) The refcnt not being 0 when you actually expect it to, thus calling for the need for a patient module removal option.

mod-refcnt-race.sh --modprobe-verbose --use-patience --max-loops 1


2) refcnt being 0 as observed in userspace and a subsequent request to remove the module fails, after using pvremove

mod-refcnt-race.sh --modprobe-verbose --use-patience

A failure rate of about 1/16 test can be observed.

3) same race as 2) but without creating  a filesystem or mounting it

mod-refcnt-race.sh --modprobe-verbose --use-patience --skip-mount

[0] https://bugzilla.kernel.org/show_bug.cgi?id=212337
[1] https://bugzilla.kernel.org/show_bug.cgi?id=212337
Comment 1 Luis Chamberlain 2021-08-10 02:32:36 UTC
I have patches for kmod which addresses these issues. I'll post patches soon, but using this to track the issue and the script which helps to reproduce the issues.
Comment 2 Luis Chamberlain 2021-08-10 04:38:07 UTC
Created attachment 298249 [details]
busy-open-block-device-sleep.c

The attached can be used with the script to induce the race where the race for the refcnt is force and then a few module removal attempts should fail after the refcnt is 0.
Comment 3 Luis Chamberlain 2021-08-10 05:18:00 UTC
v2 Patches posted on:

https://lkml.kernel.org/r/20210810051602.3067384-1-mcgrof@kernel.org