Bug 65281 - "sysfs group not found for kobject" when removing SCSI device
Summary: "sysfs group not found for kobject" when removing SCSI device
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: File System
Classification: Unclassified
Component: SysFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Greg Kroah-Hartman
URL: http://lkml.kernel.org/r/1384866598-1...
Keywords:
Depends on:
Blocks:
 
Reported: 2013-11-20 22:42 UTC by Bjorn Helgaas
Modified: 2014-03-12 11:26 UTC (History)
2 users (show)

See Also:
Kernel Version: 3.12
Tree: Mainline
Regression: No


Attachments
qemu setup (3.23 KB, text/plain)
2013-11-20 23:08 UTC, Bjorn Helgaas
Details
dmesg from qemu (96.16 KB, text/plain)
2013-11-20 23:11 UTC, Bjorn Helgaas
Details
Debug dmesg output from Acer Aspire S5 with a Thunderbolt unplug/replug cycle (210.36 KB, text/plain)
2013-11-23 20:46 UTC, Rafael J. Wysocki
Details
PCI: Move device_del() from pci_stop_dev() to pci_destroy_dev() (727 bytes, patch)
2013-11-23 22:29 UTC, Rafael J. Wysocki
Details | Diff

Description Bjorn Helgaas 2013-11-20 22:42:17 UTC
Mika Westerberg <mika.westerberg@linux.intel.com> reported the following issue (and included a patch to fix it; see the URL).  I'm just opening this bugzilla as a place for my notes about the details.


Commit bcdde7e221a8 (sysfs: make __sysfs_remove_dir() recursive) changed
the behavior so that directory removals will be done recursively. This
means that the sysfs group might already be removed if its parent directory
has been removed.

The current code outputs warnings similar to following log snippet when it
detects that there is no group for the given kobject:

 WARNING: CPU: 0 PID: 4 at fs/sysfs/group.c:214 sysfs_remove_group+0xc6/0xd0()
 sysfs group ffffffff81c6f1e0 not found for kobject 'host7'
 Modules linked in:
 CPU: 0 PID: 4 Comm: kworker/0:0 Not tainted 3.12.0+ #13
 Hardware name:                  /D33217CK, BIOS GKPPT10H.86A.0042.2013.0422.1439 04/22/2013
 Workqueue: kacpi_hotplug acpi_hotplug_work_fn
  0000000000000009 ffff8801002459b0 ffffffff817daab1 ffff8801002459f8
  ffff8801002459e8 ffffffff810436b8 0000000000000000 ffffffff81c6f1e0
  ffff88006d440358 ffff88006d440188 ffff88006e8b4c28 ffff880100245a48
 Call Trace:
  [<ffffffff817daab1>] dump_stack+0x45/0x56
  [<ffffffff810436b8>] warn_slowpath_common+0x78/0xa0
  [<ffffffff81043727>] warn_slowpath_fmt+0x47/0x50
  [<ffffffff811ad319>] ? sysfs_get_dirent_ns+0x49/0x70
  [<ffffffff811ae526>] sysfs_remove_group+0xc6/0xd0
  [<ffffffff81432f7e>] dpm_sysfs_remove+0x3e/0x50
  [<ffffffff8142a0d0>] device_del+0x40/0x1b0
  [<ffffffff8142a24d>] device_unregister+0xd/0x20
  [<ffffffff8144131a>] scsi_remove_host+0xba/0x110
  [<ffffffff8145f526>] ata_host_detach+0xc6/0x100
  [<ffffffff8145f578>] ata_pci_remove_one+0x18/0x20
  [<ffffffff812e8f48>] pci_device_remove+0x28/0x60
  [<ffffffff8142d854>] __device_release_driver+0x64/0xd0
  [<ffffffff8142d8de>] device_release_driver+0x1e/0x30
  [<ffffffff8142d257>] bus_remove_device+0xf7/0x140
  [<ffffffff8142a1b1>] device_del+0x121/0x1b0
  [<ffffffff812e43d4>] pci_stop_bus_device+0x94/0xa0
  [<ffffffff812e437b>] pci_stop_bus_device+0x3b/0xa0
  [<ffffffff812e437b>] pci_stop_bus_device+0x3b/0xa0
  [<ffffffff812e44dd>] pci_stop_and_remove_bus_device+0xd/0x20
  [<ffffffff812fc743>] trim_stale_devices+0x73/0xe0
  [<ffffffff812fc78b>] trim_stale_devices+0xbb/0xe0
  [<ffffffff812fc78b>] trim_stale_devices+0xbb/0xe0
  [<ffffffff812fcb6e>] acpiphp_check_bridge+0x7e/0xd0
  [<ffffffff812fd90d>] hotplug_event+0xcd/0x160
  [<ffffffff812fd9c5>] hotplug_event_work+0x25/0x60
  [<ffffffff81316749>] acpi_hotplug_work_fn+0x17/0x22
  [<ffffffff8105cf3a>] process_one_work+0x17a/0x430
  [<ffffffff8105db29>] worker_thread+0x119/0x390
  [<ffffffff8105da10>] ? manage_workers.isra.25+0x2a0/0x2a0
  [<ffffffff81063a5d>] kthread+0xcd/0xf0
  [<ffffffff81063990>] ? kthread_create_on_node+0x180/0x180
  [<ffffffff817eb33c>] ret_from_fork+0x7c/0xb0
  [<ffffffff81063990>] ? kthread_create_on_node+0x180/0x180

On this particular machine I see ~16 of these message during Thunderbolt
hot-unplug.
Comment 1 Bjorn Helgaas 2013-11-20 23:08:02 UTC
Created attachment 115341 [details]
qemu setup

I reproduced this problem on qemu using the attached setup by removing the AHCI controller with

  echo 1 > /sys/bus/pci/devices/0000:00:1f.2/remove
Comment 2 Bjorn Helgaas 2013-11-20 23:11:12 UTC
Created attachment 115351 [details]
dmesg from qemu

Attaching the dmesg log (including the "sysfs group not found" warnings) from qemu.  The callgraph where the warnings come from is below.

scsi_remove_host(shost)
  scsi_forget_host(shost)
    __scsi_remove_device(sdev)
      bsg_unregister_queue(sdev->request_queue)
        device_unregister(bcd->class_dev)
          device_del
            dpm_sysfs_remove
              sysfs_remove_group
                "sysfs group ffffffff81e70720 ('power') not found for kobject '0:0:0:0'"
      device_unregister(&sdev->sdev_dev)
        device_del
          dpm_sysfs_remove
            sysfs_remove_group
              "sysfs group ffffffff81e70720 ('power') not found for kobject '0:0:0:0'"
          class_intf->remove_dev        # .remove_dev = sg_remove
            sg_remove
              device_destroy(sg_sysfs_class, MKDEV(SCSI_GENERIC_MAJOR, sdp->index))
                device_unregister
                  device_del
                    dpm_sysfs_remove
                      sysfs_remove_group
                        "sysfs group ffffffff81e70720 ('power') not found for kobject 'sg0'"
      device_del(&sdev->sdev_gendev)
        dpm_sysfs_remove
          sysfs_remove_group
            "sysfs group ffffffff81e70720 ('power') not found for kobject '0:0:0:0'"
        bus_remove_device
          device_release_driver
            __device_release_driver
              sd_remove
                device_del(&sdkp->dev)
                  dpm_sysfs_remove
                    sysfs_remove_group
                      "sysfs group ffffffff81e70720 ('power') not found for kobject '0:0:0:0'"
                del_gendisk(&sdkp->disk)
                  delete_partition
                    device_del
                      dpm_sysfs_remove
                        sysfs_remove_group
                          "sysfs group ffffffff81e70720 ('power') not found for kobject 'sda5'"
                      device_remove_attrs
                        device_remove_groups
                          sysfs_remove_groups
                            "sysfs group ffffffff81e3ba60 ('trace') not found for kobject 'sda5'"
                  blk_unregister_queue(disk)
                    blk_trace_remove_sysfs
                      sysfs_remove_group
                        "sysfs group ffffffff81e3ba60 ('trace') not found for kobject 'sda'"
                  device_del(disk_to_dev(disk))
                    dpm_sysfs_remove
                      sysfs_remove_group
                        "sysfs group ffffffff81e70720 ('power') not found for kobject 'sda'"
  device_unregister(&shost->shost_dev)
    device_del
      dpm_sysfs_remove
        sysfs_remove_group
          "sysfs group ffffffff81e70720 ('power') not found for kobject 'host0'"
  device_del(&shost->shost_gendev)
    dpm_sysfs_remove
      sysfs_remove_group
        "sysfs group ffffffff81e70720 ('power') not found for kobject 'host0'"
Comment 3 Rafael J. Wysocki 2013-11-22 22:57:09 UTC
I wonder what's the output if you printk() pos->s_name above the sysfs_remove_one(&acxt, pos) in __sysfs_remove_dir()?
Comment 4 Rafael J. Wysocki 2013-11-23 02:59:17 UTC
So what happens here is we get device_del() for 0000:00:1f.2 first and that removes all stuff up to when it calls bus_remove_device() which stops the driver and triggers device_del() for ata_device on dev1.0 and then for dev1.0 itself.

Next, it does device_del() for ata_link on link1 (which is under 0000:00:1f.2/ata1/) and for link1 itself.

Then, it does device_del() for ata_port on ata1 and for ata1 itself (that is, 0000:00:1f.2/ata1/).

That descends into the host0 subdirectory and removes it recursively before trying to unregister host0.  That removes host0/target0:0:0/0:0:0:0/power among other things.

So when it finally goes to delete bsg on 0:0:0:0, it finds that there's no "power" group below the bsg's child also called 0:0:0:0 (whoever designed that subsystem had a sick sense of humor) - because that group has been removed already.

So in this particular case there is an ordering problem, because 0000:00:1f.2/ata1/host0/ should have been deleted before 0000:00:1f.2/ata1/.

I'm not sure about the Thunderbolt case, though, will look into it tomorrow.
Comment 5 Rafael J. Wysocki 2013-11-23 03:01:57 UTC
In any case I don't see a clean way to fix the above and the Mika's patch seems to be the simplest viable workaround.
Comment 6 Rafael J. Wysocki 2013-11-23 20:46:56 UTC
Created attachment 115751 [details]
Debug dmesg output from Acer Aspire S5 with a Thunderbolt unplug/replug cycle

I've reproduced the issue on an Acer Aspire S5 w/ Thunderbolt and attached is a dmesg output containing a Thunderbolt unplug/replug cycle.

The kernel 3.13-rc1 with a few additional patches on top including this one:

Index: linux-pm/drivers/base/core.c
===================================================================
--- linux-pm.orig/drivers/base/core.c
+++ linux-pm/drivers/base/core.c
@@ -1188,6 +1188,8 @@ void device_del(struct device *dev)
 	struct device *parent = dev->parent;
 	struct class_interface *class_intf;
 
+	dev_err(dev, "%s\n", __func__);
+
 	/* Notify clients of device removal.  This call must come
 	 * before dpm_sysfs_remove().
 	 */
Index: linux-pm/fs/sysfs/dir.c
===================================================================
--- linux-pm.orig/fs/sysfs/dir.c
+++ linux-pm/fs/sysfs/dir.c
@@ -875,8 +875,10 @@ static void __sysfs_remove(struct sysfs_
 	do {
 		pos = next;
 		next = sysfs_next_descendant_post(pos, sd);
-		if (pos)
+		if (pos) {
 			sysfs_remove_one(acxt, pos);
+			pr_err("%s: %s\n", __func__, pos->s_name);
+		}
 	} while (next);
 }
Comment 7 Rafael J. Wysocki 2013-11-23 21:39:09 UTC
To my eyes for PCI the problem is that pci_stop_dev() does a device_del() which removes the device's sysfs directories recursively.  That includes the "power" group of the bus device which is then removed by pci_remove_bus().
Comment 8 Rafael J. Wysocki 2013-11-23 22:29:54 UTC
Created attachment 115761 [details]
PCI: Move device_del() from pci_stop_dev() to pci_destroy_dev()

So this patch fixes the issue for me without the Mika's patch.
Comment 9 Rafael J. Wysocki 2013-11-24 00:38:32 UTC
However, I don't have SATA devices down my Thunderbolt link, so the patch from comment #8 is not sufficient to fix the trace from #description.
Comment 10 Rafael J. Wysocki 2013-11-25 12:09:08 UTC
The following patches are sufficient to make all of the warnings go away without the Mika's patch:

https://patchwork.kernel.org/patch/3226081/
https://patchwork.kernel.org/patch/3229651/

(the first one is analogous to the one in comment #8).
Comment 11 Vlad 2014-03-12 06:27:44 UTC
(In reply to Rafael J. Wysocki from comment #10)
> The following patches are sufficient to make all of the warnings go away
> without the Mika's patch:
> 
> https://patchwork.kernel.org/patch/3226081/
> https://patchwork.kernel.org/patch/3229651/
> 
> (the first one is analogous to the one in comment #8).

I still can reproduce this bug with vanilla 3.13.6 kernel:
------------[ cut here ]------------
WARNING: CPU: 2 PID: 5345 at fs/sysfs/group.c:214 device_del+0x3b/0x1b0()
sysfs group ffffffff81a62480 not found for kobject 'target7:0:0'
Modules linked in: kvm_intel kvm wmi
CPU: 2 PID: 5345 Comm: eject Not tainted 3.13.6 #1
Hardware name: ASUS All Series/H87M-PRO, BIOS 0502 04/08/2013
 0000000000000009 ffffffff817b8404 ffff880336b6dd58 ffffffff81062f7d
 ffff8802fc6c5c00 ffff880336b6dda8 ffff8800b8044900 ffff8802e758b188
 ffff8800b8044800 ffffffff81062fe7 ffffffff819538e0 0000000000000028
Call Trace:
 [<ffffffff817b8404>] ? dump_stack+0x49/0x6a
 [<ffffffff81062f7d>] ? warn_slowpath_common+0x6d/0x90
 [<ffffffff81062fe7>] ? warn_slowpath_fmt+0x47/0x50
 [<ffffffff815008bb>] ? device_del+0x3b/0x1b0
 [<ffffffff8151c8ad>] ? scsi_target_reap_usercontext+0x1d/0x30
 [<ffffffff81077d57>] ? execute_in_process_context+0x57/0x60
 [<ffffffff8151f86c>] ? scsi_device_dev_release_usercontext+0x16c/0x1b0
 [<ffffffff81077d57>] ? execute_in_process_context+0x57/0x60
 [<ffffffff81500098>] ? device_release+0x28/0x90
 [<ffffffff813c49b3>] ? kobject_cleanup+0x33/0x70
 [<ffffffff81522af6>] ? scsi_disk_put+0x26/0x40
 [<ffffffff8115b24d>] ? __blkdev_put+0x14d/0x190
 [<ffffffff8115bc5c>] ? blkdev_close+0x1c/0x20
 [<ffffffff8112a600>] ? __fput+0xb0/0x1f0
 [<ffffffff8107bbdf>] ? task_work_run+0x8f/0xd0
 [<ffffffff81002901>] ? do_notify_resume+0x61/0x90
 [<ffffffff8107bac5>] ? task_work_add+0x45/0x60
 [<ffffffff817bf6ea>] ? int_signal+0x12/0x17
---[ end trace e0d8e994af6f4ede ]---
Comment 12 Greg Kroah-Hartman 2014-03-12 06:41:52 UTC
On Wed, Mar 12, 2014 at 06:27:44AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:

That makes sense as the patch is not yet merged into the tree :(
Comment 13 Vlad 2014-03-12 11:26:05 UTC
(In reply to Greg Kroah-Hartman from comment #12)
> On Wed, Mar 12, 2014 at 06:27:44AM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> That makes sense as the patch is not yet merged into the tree :(

If you're talking only about these two:
https://patchwork.kernel.org/patch/3226081/
https://patchwork.kernel.org/patch/3229651/
they both have been merged. (I've checked sources myself).

Note You need to log in before you can comment on or make changes to this bug.