Bug 65281

Summary: "sysfs group not found for kobject" when removing SCSI device
Product: File System Reporter: Bjorn Helgaas (bjorn)
Component: SysFSAssignee: Greg Kroah-Hartman (greg)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: jernej.jakob, rjw, vovan
Priority: P1    
Hardware: All   
OS: Linux   
URL: http://lkml.kernel.org/r/1384866598-19716-1-git-send-email-mika.westerberg@linux.intel.com
Kernel Version: 3.12 Subsystem:
Regression: No Bisected commit-id:
Attachments: qemu setup
dmesg from qemu
Debug dmesg output from Acer Aspire S5 with a Thunderbolt unplug/replug cycle
PCI: Move device_del() from pci_stop_dev() to pci_destroy_dev()

Description Bjorn Helgaas 2013-11-20 22:42:17 UTC
Mika Westerberg <mika.westerberg@linux.intel.com> reported the following issue (and included a patch to fix it; see the URL).  I'm just opening this bugzilla as a place for my notes about the details.


Commit bcdde7e221a8 (sysfs: make __sysfs_remove_dir() recursive) changed
the behavior so that directory removals will be done recursively. This
means that the sysfs group might already be removed if its parent directory
has been removed.

The current code outputs warnings similar to following log snippet when it
detects that there is no group for the given kobject:

 WARNING: CPU: 0 PID: 4 at fs/sysfs/group.c:214 sysfs_remove_group+0xc6/0xd0()
 sysfs group ffffffff81c6f1e0 not found for kobject 'host7'
 Modules linked in:
 CPU: 0 PID: 4 Comm: kworker/0:0 Not tainted 3.12.0+ #13
 Hardware name:                  /D33217CK, BIOS GKPPT10H.86A.0042.2013.0422.1439 04/22/2013
 Workqueue: kacpi_hotplug acpi_hotplug_work_fn
  0000000000000009 ffff8801002459b0 ffffffff817daab1 ffff8801002459f8
  ffff8801002459e8 ffffffff810436b8 0000000000000000 ffffffff81c6f1e0
  ffff88006d440358 ffff88006d440188 ffff88006e8b4c28 ffff880100245a48
 Call Trace:
  [<ffffffff817daab1>] dump_stack+0x45/0x56
  [<ffffffff810436b8>] warn_slowpath_common+0x78/0xa0
  [<ffffffff81043727>] warn_slowpath_fmt+0x47/0x50
  [<ffffffff811ad319>] ? sysfs_get_dirent_ns+0x49/0x70
  [<ffffffff811ae526>] sysfs_remove_group+0xc6/0xd0
  [<ffffffff81432f7e>] dpm_sysfs_remove+0x3e/0x50
  [<ffffffff8142a0d0>] device_del+0x40/0x1b0
  [<ffffffff8142a24d>] device_unregister+0xd/0x20
  [<ffffffff8144131a>] scsi_remove_host+0xba/0x110
  [<ffffffff8145f526>] ata_host_detach+0xc6/0x100
  [<ffffffff8145f578>] ata_pci_remove_one+0x18/0x20
  [<ffffffff812e8f48>] pci_device_remove+0x28/0x60
  [<ffffffff8142d854>] __device_release_driver+0x64/0xd0
  [<ffffffff8142d8de>] device_release_driver+0x1e/0x30
  [<ffffffff8142d257>] bus_remove_device+0xf7/0x140
  [<ffffffff8142a1b1>] device_del+0x121/0x1b0
  [<ffffffff812e43d4>] pci_stop_bus_device+0x94/0xa0
  [<ffffffff812e437b>] pci_stop_bus_device+0x3b/0xa0
  [<ffffffff812e437b>] pci_stop_bus_device+0x3b/0xa0
  [<ffffffff812e44dd>] pci_stop_and_remove_bus_device+0xd/0x20
  [<ffffffff812fc743>] trim_stale_devices+0x73/0xe0
  [<ffffffff812fc78b>] trim_stale_devices+0xbb/0xe0
  [<ffffffff812fc78b>] trim_stale_devices+0xbb/0xe0
  [<ffffffff812fcb6e>] acpiphp_check_bridge+0x7e/0xd0
  [<ffffffff812fd90d>] hotplug_event+0xcd/0x160
  [<ffffffff812fd9c5>] hotplug_event_work+0x25/0x60
  [<ffffffff81316749>] acpi_hotplug_work_fn+0x17/0x22
  [<ffffffff8105cf3a>] process_one_work+0x17a/0x430
  [<ffffffff8105db29>] worker_thread+0x119/0x390
  [<ffffffff8105da10>] ? manage_workers.isra.25+0x2a0/0x2a0
  [<ffffffff81063a5d>] kthread+0xcd/0xf0
  [<ffffffff81063990>] ? kthread_create_on_node+0x180/0x180
  [<ffffffff817eb33c>] ret_from_fork+0x7c/0xb0
  [<ffffffff81063990>] ? kthread_create_on_node+0x180/0x180

On this particular machine I see ~16 of these message during Thunderbolt
hot-unplug.
Comment 1 Bjorn Helgaas 2013-11-20 23:08:02 UTC
Created attachment 115341 [details]
qemu setup

I reproduced this problem on qemu using the attached setup by removing the AHCI controller with

  echo 1 > /sys/bus/pci/devices/0000:00:1f.2/remove
Comment 2 Bjorn Helgaas 2013-11-20 23:11:12 UTC
Created attachment 115351 [details]
dmesg from qemu

Attaching the dmesg log (including the "sysfs group not found" warnings) from qemu.  The callgraph where the warnings come from is below.

scsi_remove_host(shost)
  scsi_forget_host(shost)
    __scsi_remove_device(sdev)
      bsg_unregister_queue(sdev->request_queue)
        device_unregister(bcd->class_dev)
          device_del
            dpm_sysfs_remove
              sysfs_remove_group
                "sysfs group ffffffff81e70720 ('power') not found for kobject '0:0:0:0'"
      device_unregister(&sdev->sdev_dev)
        device_del
          dpm_sysfs_remove
            sysfs_remove_group
              "sysfs group ffffffff81e70720 ('power') not found for kobject '0:0:0:0'"
          class_intf->remove_dev        # .remove_dev = sg_remove
            sg_remove
              device_destroy(sg_sysfs_class, MKDEV(SCSI_GENERIC_MAJOR, sdp->index))
                device_unregister
                  device_del
                    dpm_sysfs_remove
                      sysfs_remove_group
                        "sysfs group ffffffff81e70720 ('power') not found for kobject 'sg0'"
      device_del(&sdev->sdev_gendev)
        dpm_sysfs_remove
          sysfs_remove_group
            "sysfs group ffffffff81e70720 ('power') not found for kobject '0:0:0:0'"
        bus_remove_device
          device_release_driver
            __device_release_driver
              sd_remove
                device_del(&sdkp->dev)
                  dpm_sysfs_remove
                    sysfs_remove_group
                      "sysfs group ffffffff81e70720 ('power') not found for kobject '0:0:0:0'"
                del_gendisk(&sdkp->disk)
                  delete_partition
                    device_del
                      dpm_sysfs_remove
                        sysfs_remove_group
                          "sysfs group ffffffff81e70720 ('power') not found for kobject 'sda5'"
                      device_remove_attrs
                        device_remove_groups
                          sysfs_remove_groups
                            "sysfs group ffffffff81e3ba60 ('trace') not found for kobject 'sda5'"
                  blk_unregister_queue(disk)
                    blk_trace_remove_sysfs
                      sysfs_remove_group
                        "sysfs group ffffffff81e3ba60 ('trace') not found for kobject 'sda'"
                  device_del(disk_to_dev(disk))
                    dpm_sysfs_remove
                      sysfs_remove_group
                        "sysfs group ffffffff81e70720 ('power') not found for kobject 'sda'"
  device_unregister(&shost->shost_dev)
    device_del
      dpm_sysfs_remove
        sysfs_remove_group
          "sysfs group ffffffff81e70720 ('power') not found for kobject 'host0'"
  device_del(&shost->shost_gendev)
    dpm_sysfs_remove
      sysfs_remove_group
        "sysfs group ffffffff81e70720 ('power') not found for kobject 'host0'"
Comment 3 Rafael J. Wysocki 2013-11-22 22:57:09 UTC
I wonder what's the output if you printk() pos->s_name above the sysfs_remove_one(&acxt, pos) in __sysfs_remove_dir()?
Comment 4 Rafael J. Wysocki 2013-11-23 02:59:17 UTC
So what happens here is we get device_del() for 0000:00:1f.2 first and that removes all stuff up to when it calls bus_remove_device() which stops the driver and triggers device_del() for ata_device on dev1.0 and then for dev1.0 itself.

Next, it does device_del() for ata_link on link1 (which is under 0000:00:1f.2/ata1/) and for link1 itself.

Then, it does device_del() for ata_port on ata1 and for ata1 itself (that is, 0000:00:1f.2/ata1/).

That descends into the host0 subdirectory and removes it recursively before trying to unregister host0.  That removes host0/target0:0:0/0:0:0:0/power among other things.

So when it finally goes to delete bsg on 0:0:0:0, it finds that there's no "power" group below the bsg's child also called 0:0:0:0 (whoever designed that subsystem had a sick sense of humor) - because that group has been removed already.

So in this particular case there is an ordering problem, because 0000:00:1f.2/ata1/host0/ should have been deleted before 0000:00:1f.2/ata1/.

I'm not sure about the Thunderbolt case, though, will look into it tomorrow.
Comment 5 Rafael J. Wysocki 2013-11-23 03:01:57 UTC
In any case I don't see a clean way to fix the above and the Mika's patch seems to be the simplest viable workaround.
Comment 6 Rafael J. Wysocki 2013-11-23 20:46:56 UTC
Created attachment 115751 [details]
Debug dmesg output from Acer Aspire S5 with a Thunderbolt unplug/replug cycle

I've reproduced the issue on an Acer Aspire S5 w/ Thunderbolt and attached is a dmesg output containing a Thunderbolt unplug/replug cycle.

The kernel 3.13-rc1 with a few additional patches on top including this one:

Index: linux-pm/drivers/base/core.c
===================================================================
--- linux-pm.orig/drivers/base/core.c
+++ linux-pm/drivers/base/core.c
@@ -1188,6 +1188,8 @@ void device_del(struct device *dev)
 	struct device *parent = dev->parent;
 	struct class_interface *class_intf;
 
+	dev_err(dev, "%s\n", __func__);
+
 	/* Notify clients of device removal.  This call must come
 	 * before dpm_sysfs_remove().
 	 */
Index: linux-pm/fs/sysfs/dir.c
===================================================================
--- linux-pm.orig/fs/sysfs/dir.c
+++ linux-pm/fs/sysfs/dir.c
@@ -875,8 +875,10 @@ static void __sysfs_remove(struct sysfs_
 	do {
 		pos = next;
 		next = sysfs_next_descendant_post(pos, sd);
-		if (pos)
+		if (pos) {
 			sysfs_remove_one(acxt, pos);
+			pr_err("%s: %s\n", __func__, pos->s_name);
+		}
 	} while (next);
 }
Comment 7 Rafael J. Wysocki 2013-11-23 21:39:09 UTC
To my eyes for PCI the problem is that pci_stop_dev() does a device_del() which removes the device's sysfs directories recursively.  That includes the "power" group of the bus device which is then removed by pci_remove_bus().
Comment 8 Rafael J. Wysocki 2013-11-23 22:29:54 UTC
Created attachment 115761 [details]
PCI: Move device_del() from pci_stop_dev() to pci_destroy_dev()

So this patch fixes the issue for me without the Mika's patch.
Comment 9 Rafael J. Wysocki 2013-11-24 00:38:32 UTC
However, I don't have SATA devices down my Thunderbolt link, so the patch from comment #8 is not sufficient to fix the trace from #description.
Comment 10 Rafael J. Wysocki 2013-11-25 12:09:08 UTC
The following patches are sufficient to make all of the warnings go away without the Mika's patch:

https://patchwork.kernel.org/patch/3226081/
https://patchwork.kernel.org/patch/3229651/

(the first one is analogous to the one in comment #8).
Comment 11 Vlad 2014-03-12 06:27:44 UTC
(In reply to Rafael J. Wysocki from comment #10)
> The following patches are sufficient to make all of the warnings go away
> without the Mika's patch:
> 
> https://patchwork.kernel.org/patch/3226081/
> https://patchwork.kernel.org/patch/3229651/
> 
> (the first one is analogous to the one in comment #8).

I still can reproduce this bug with vanilla 3.13.6 kernel:
------------[ cut here ]------------
WARNING: CPU: 2 PID: 5345 at fs/sysfs/group.c:214 device_del+0x3b/0x1b0()
sysfs group ffffffff81a62480 not found for kobject 'target7:0:0'
Modules linked in: kvm_intel kvm wmi
CPU: 2 PID: 5345 Comm: eject Not tainted 3.13.6 #1
Hardware name: ASUS All Series/H87M-PRO, BIOS 0502 04/08/2013
 0000000000000009 ffffffff817b8404 ffff880336b6dd58 ffffffff81062f7d
 ffff8802fc6c5c00 ffff880336b6dda8 ffff8800b8044900 ffff8802e758b188
 ffff8800b8044800 ffffffff81062fe7 ffffffff819538e0 0000000000000028
Call Trace:
 [<ffffffff817b8404>] ? dump_stack+0x49/0x6a
 [<ffffffff81062f7d>] ? warn_slowpath_common+0x6d/0x90
 [<ffffffff81062fe7>] ? warn_slowpath_fmt+0x47/0x50
 [<ffffffff815008bb>] ? device_del+0x3b/0x1b0
 [<ffffffff8151c8ad>] ? scsi_target_reap_usercontext+0x1d/0x30
 [<ffffffff81077d57>] ? execute_in_process_context+0x57/0x60
 [<ffffffff8151f86c>] ? scsi_device_dev_release_usercontext+0x16c/0x1b0
 [<ffffffff81077d57>] ? execute_in_process_context+0x57/0x60
 [<ffffffff81500098>] ? device_release+0x28/0x90
 [<ffffffff813c49b3>] ? kobject_cleanup+0x33/0x70
 [<ffffffff81522af6>] ? scsi_disk_put+0x26/0x40
 [<ffffffff8115b24d>] ? __blkdev_put+0x14d/0x190
 [<ffffffff8115bc5c>] ? blkdev_close+0x1c/0x20
 [<ffffffff8112a600>] ? __fput+0xb0/0x1f0
 [<ffffffff8107bbdf>] ? task_work_run+0x8f/0xd0
 [<ffffffff81002901>] ? do_notify_resume+0x61/0x90
 [<ffffffff8107bac5>] ? task_work_add+0x45/0x60
 [<ffffffff817bf6ea>] ? int_signal+0x12/0x17
---[ end trace e0d8e994af6f4ede ]---
Comment 12 Greg Kroah-Hartman 2014-03-12 06:41:52 UTC
On Wed, Mar 12, 2014 at 06:27:44AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:

That makes sense as the patch is not yet merged into the tree :(
Comment 13 Vlad 2014-03-12 11:26:05 UTC
(In reply to Greg Kroah-Hartman from comment #12)
> On Wed, Mar 12, 2014 at 06:27:44AM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> That makes sense as the patch is not yet merged into the tree :(

If you're talking only about these two:
https://patchwork.kernel.org/patch/3226081/
https://patchwork.kernel.org/patch/3229651/
they both have been merged. (I've checked sources myself).
Comment 14 Jernej Jakob 2021-01-01 21:21:47 UTC
I think I ran into this on 5.4.72:

Jan  1 22:02:10 gentoo kernel: mptbase: ioc0: LogInfo(0x30030501): Originator={IOP}, Code={Invalid Page}, SubCode(0x0501) cb_idx mptbase_reply
Jan  1 22:02:10 gentoo kernel: ------------[ cut here ]------------
Jan  1 22:02:10 gentoo kernel: sysfs group 'power' not found for kobject 'target1:0:2'
Jan  1 22:02:10 gentoo kernel: WARNING: CPU: 1 PID: 27540 at fs/sysfs/group.c:278 sysfs_remove_group+0x74/0x80
Jan  1 22:02:10 gentoo kernel: Modules linked in: mptsas mptscsih mptbase scsi_transport_sas fuse xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_tables bpfilter br_netfilter bridge overlay raid1 md_mod tun tcp_diag udp_diag inet_diag ipv6 crc_ccitt cfg80211 rfkill 8021q garp mrp stp llc radeon snd_hda_codec_idt snd_hda_codec_hdmi snd_hda_codec_generic ledtrig_audio gpio_ich iTCO_wdt snd_hda_intel iTCO_vendor_support kvm_intel i2c_algo_bit snd_intel_nhlt ttm ppdev snd_hda_codec kvm drm_kms_helper firewire_ohci snd_hda_core lpc_ich snd_hwdep pcspkr serio_raw i2c_i801 firewire_core irqbypass mfd_core crc_itu_t ftdi_sio snd_pcm drm
ata_generic joydev backlight snd_timer coretemp usbserial pata_acpi parport_pc i2c_core snd parport mei_me
soundcore mac_hid mei acpi_cpufreq ext4 mbcache jbd2 serpent_sse2_x86_64 serpent_generic dm_crypt dm_mod sd_mod xhci_pci xhci_hcd ahci libahci e1000e sata_sil24
Jan  1 22:02:10 gentoo kernel:  pata_marvell uhci_hcd
Jan  1 22:02:10 gentoo kernel: CPU: 1 PID: 27540 Comm: kworker/1:0 Not tainted 5.4.72-gentoo-x86_64 #1
Jan  1 22:02:10 gentoo kernel: Hardware name:  /DP965LT, BIOS MQ96510J.86A.1761.2009.0326.0001 03/26/2009
Jan  1 22:02:10 gentoo kernel: Workqueue: mpt/0 mptsas_firmware_event_work [mptsas]
Jan  1 22:02:10 gentoo kernel: RIP: 0010:sysfs_remove_group+0x74/0x80
Jan  1 22:02:10 gentoo kernel: Code: ff 5b 48 89 ef 5d 41 5c e9 79 be ff ff 48 89 ef e8 01 ba ff ff eb cc 49 8b 14 24 48 8b 33 48 c7 c7 f0 2d 0f b1 e8 5a 38 d2 ff <0f> 0b 5b 5d 41 5c c3 0f 1f 44 00 00 66 66 66 66 90 48 85 f6 74 31
Jan  1 22:02:10 gentoo kernel: RSP: 0018:ffffab98c4133bd8 EFLAGS: 00010286
Jan  1 22:02:10 gentoo kernel: RAX: 0000000000000000 RBX: ffffffffb0edafa0 RCX: 0000000000001e8a
Jan  1 22:02:10 gentoo kernel: RDX: 0000000000000001 RSI: 0000000000000082 RDI: 0000000000000247
Jan  1 22:02:10 gentoo kernel: RBP: 0000000000000000 R08: 0000000000001e8a R09: 0000000000000001
Jan  1 22:02:10 gentoo kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff93331e859c28
Jan  1 22:02:10 gentoo kernel: R13: ffff933421471010 R14: ffff933421470ff0 R15: ffff933421471000
Jan  1 22:02:10 gentoo kernel: FS:  0000000000000000(0000) GS:ffff933423a80000(0000) knlGS:0000000000000000
Jan  1 22:02:10 gentoo kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan  1 22:02:10 gentoo kernel: CR2: 00007fe8019a4008 CR3: 000000021d77a000 CR4: 00000000000006e0
Jan  1 22:02:10 gentoo kernel: Call Trace:
Jan  1 22:02:10 gentoo kernel:  device_del+0x81/0x370
Jan  1 22:02:10 gentoo kernel:  ? attribute_container_device_trigger+0xb4/0xf0
Jan  1 22:02:10 gentoo kernel:  scsi_target_reap+0x56/0x70
Jan  1 22:02:10 gentoo kernel:  scsi_remove_target+0x1a0/0x1c0
Jan  1 22:02:10 gentoo kernel:  sas_rphy_remove+0x52/0x60 [scsi_transport_sas]
Jan  1 22:02:10 gentoo kernel:  sas_port_delete+0x26/0x150 [scsi_transport_sas]
Jan  1 22:02:10 gentoo kernel:  mptsas_del_end_device.cold+0x139/0x1b4 [mptsas]
Jan  1 22:02:10 gentoo kernel:  mptsas_hotplug_work+0x467/0x750 [mptsas]