Bug 217914
Summary: | scsi_eh_1 process high cpu after upgrading to 6.5 | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Christian Zangl (laktak) |
Component: | SCSI | Assignee: | linux-scsi (linux-scsi) |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | CC: | kernel, loberman, pedretti.fabio |
Priority: | P3 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | Yes | Bisected commit-id: | 624885209f31eb9985bf51abe204ecbffe2fdeea |
Description
Christian Zangl
2023-09-15 19:33:01 UTC
On 9/15/23 12:33, bugzilla-daemon@kernel.org wrote: > The users loqs and leonshaw helped to narrow it down to this commit: > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=624885209f31eb9985bf51abe204ecbffe2fdeea Damien, can you please take a look? Thanks, Bart. On Fri, 2023-09-15 at 13:42 -0700, Bart Van Assche wrote:
> On 9/15/23 12:33, bugzilla-daemon@kernel.org wrote:
> > The users loqs and leonshaw helped to narrow it down to this
> > commit:
> >
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=624885209f31eb9985bf51abe204ecbffe2fdeea
>
> Damien, can you please take a look?
>
> Thanks,
>
> Bart.
>
I had a quick look at this and its not making sense. The only calls I
see are in the scan when the devices is added and in the re-scan.
It should not be consuming the scsi_eh thread unless some type of udev
events keeps happening.
Would be good to get some
cat /proc/<PID>/stack of the scsi_eh threads if they are constantly
consuming CPUY
I will try reproduce and try figure out what is going on here.
Thanks
Laurence
Not reproducible generically for me [root@penguin8 ~]# uname -a Linux penguin8 6.5.0+ #2 SMP PREEMPT_DYNAMIC [root@penguin8 ~]# lsscsi [0:0:0:0] disk ATA Samsung SSD 850 3B6Q /dev/sdb [1:0:0:0] disk ATA Samsung SSD 850 3B6Q /dev/sda USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1649 0.0 0.0 0 0 ? S 16:58 0:00 [scsi_eh_0] root 1651 0.0 0.0 0 0 ? S 16:58 0:00 [scsi_eh_1] root 1653 0.0 0.0 0 0 ? S 16:58 0:00 [scsi_eh_2] root 1655 0.0 0.0 0 0 ? S 16:58 0:00 [scsi_eh_3] root 1668 0.0 0.0 0 0 ? S 16:58 0:00 [scsi_eh_4] root 1670 0.0 0.0 0 0 ? S 16:58 0:00 [scsi_eh_5] root 1672 0.0 0.0 0 0 ? S 16:58 0:00 [scsi_eh_6] root 1674 0.0 0.0 0 0 ? S 16:58 0:00 [scsi_eh_7] root 1866 0.0 0.0 0 0 ? S 16:58 0:00 [scsi_eh_8] root 1887 0.0 0.0 0 0 ? S 16:58 0:00 [scsi_eh_9] root 1649 0.0 0.0 0 0 ? S 16:58 0:00 [scsi_eh_0] root 1651 0.0 0.0 0 0 ? S 16:58 0:00 [scsi_eh_1] root 1653 0.0 0.0 0 0 ? S 16:58 0:00 [scsi_eh_2] root 1655 0.0 0.0 0 0 ? S 16:58 0:00 [scsi_eh_3] root 1668 0.0 0.0 0 0 ? S 16:58 0:00 [scsi_eh_4] root 1670 0.0 0.0 0 0 ? S 16:58 0:00 [scsi_eh_5] root 1672 0.0 0.0 0 0 ? S 16:58 0:00 [scsi_eh_6] root 1674 0.0 0.0 0 0 ? S 16:58 0:00 [scsi_eh_7] root 1866 0.0 0.0 0 0 ? S 16:58 0:00 [scsi_eh_8] root 1887 0.0 0.0 0 0 ? S 16:58 0:00 [scsi_eh_9] I Have no CDROm so I think its the virtual cdrom. In VMware the CDROM will continuously get probed and log errors due to no media and every time that happens it will call the cdl stuff. I will bring up a Virtual guest now, will take time as I will have to build upstream kernels. On Fri, Sep 15, 2023 at 01:42:18PM -0700, Bart Van Assche wrote: > On 9/15/23 12:33, bugzilla-daemon@kernel.org wrote: > > The users loqs and leonshaw helped to narrow it down to this commit: > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=624885209f31eb9985bf51abe204ecbffe2fdeea > > Damien, can you please take a look? > Hello Bart, It seems like: https://lore.kernel.org/linux-scsi/20230915022034.678121-1-dlemoal@kernel.org/ Solves the problem. From a quick look at the logs with extra log leves enabled: https://pastebin.com/f2LQ8kQD it appears that the MAINTENANCE_IN / MI_REPORT_SUPPORTED_OPERATION_CODES command with a non-zero service action issued by scsi_cdl_check() fails, and will be added to SCSI EH over and over. Kind regards, Niklas On 9/16/23 07:01, Niklas Cassel wrote: > On Fri, Sep 15, 2023 at 01:42:18PM -0700, Bart Van Assche wrote: >> On 9/15/23 12:33, bugzilla-daemon@kernel.org wrote: >>> The users loqs and leonshaw helped to narrow it down to this commit: >>> >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=624885209f31eb9985bf51abe204ecbffe2fdeea >> >> Damien, can you please take a look? >> > > Hello Bart, > > It seems like: > > https://lore.kernel.org/linux-scsi/20230915022034.678121-1-dlemoal@kernel.org/ > > Solves the problem. > > From a quick look at the logs with extra log leves enabled: > https://pastebin.com/f2LQ8kQD > it appears that the MAINTENANCE_IN / MI_REPORT_SUPPORTED_OPERATION_CODES > command with a non-zero service action issued by scsi_cdl_check() fails, > and will be added to SCSI EH over and over. The failure is due to the drive not liking this command. My patch avoids sending that command, thus solves the issue with drives that choke on it. However, the constant retry sound to me like a different bug... We should not retry that command at all I think. Or maybe limit it to 3 retries. > > > Kind regards, > Niklas On 9/16/23 07:01, Niklas Cassel wrote: > On Fri, Sep 15, 2023 at 01:42:18PM -0700, Bart Van Assche wrote: >> On 9/15/23 12:33, bugzilla-daemon@kernel.org wrote: >>> The users loqs and leonshaw helped to narrow it down to this commit: >>> >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=624885209f31eb9985bf51abe204ecbffe2fdeea >> >> Damien, can you please take a look? >> > > Hello Bart, > > It seems like: > > https://lore.kernel.org/linux-scsi/20230915022034.678121-1-dlemoal@kernel.org/ > > Solves the problem. > > From a quick look at the logs with extra log leves enabled: > https://pastebin.com/f2LQ8kQD > it appears that the MAINTENANCE_IN / MI_REPORT_SUPPORTED_OPERATION_CODES > command with a non-zero service action issued by scsi_cdl_check() fails, > and will be added to SCSI EH over and over. Looks like the vmware emulated scsi cdrom (sr) does not like this command... While SPC would allow cdroms to support CDL, I do not think we will ever see that. So we could restrict CDL probe to block devices only. That still does not explain why the constant retry. The MAINTENANCE_IN / MI_REPORT_SUPPORTED_OPERATION_CODES failing is expected in most cases so it should silently move on with cdl probe returning false. My patch is still needed as some drives seem to hang on that command. > > > Kind regards, > Niklas Noticed the same here when upgrading from 6.1.0-13-amd64 to 6.5.0-0.deb12.1-amd64 (both Debian kernels) earlier this month: ============================================================================ $ sar -f /var/log/sysstat/sa20231108 CPU %user %nice %system %iowait %steal %idle [...] 18:30:03 all 1.03 0.00 0.45 0.04 0.00 98.47 18:40:01 all 1.07 0.00 0.52 0.05 0.00 98.36 18:50:01 all 1.07 0.00 0.53 0.04 0.00 98.37 19:00:01 all 1.35 0.00 0.69 0.08 0.00 97.88 19:10:04 all 1.09 0.00 0.52 0.07 0.00 98.31 19:20:02 all 1.14 0.00 0.51 0.05 0.00 98.30 19:30:04 all 1.62 0.00 0.65 0.08 0.00 97.65 Average: all 1.06 0.00 0.50 0.06 0.00 98.38 19:32:27 LINUX RESTART (2 CPU) 19:40:03 CPU %user %nice %system %iowait %steal %idle 19:50:00 all 2.27 0.00 3.23 57.40 0.00 37.11 20:00:02 all 1.29 0.00 2.70 59.27 0.00 36.75 20:10:03 all 1.48 0.00 2.93 58.38 0.00 37.21 20:20:03 all 1.40 0.00 2.94 58.93 0.00 36.73 20:30:02 all 1.39 0.00 2.87 59.99 0.00 35.74 20:40:03 all 1.48 0.00 3.44 59.83 0.00 35.26 20:50:00 all 1.29 0.00 2.88 60.84 0.00 34.98 21:00:03 all 1.31 0.00 2.63 59.81 0.00 36.25 21:10:03 all 1.33 0.00 2.72 59.85 0.00 36.09 21:20:01 all 1.31 0.00 2.82 59.28 0.00 36.59 21:30:01 all 1.39 0.00 2.92 60.51 0.00 35.18 21:40:01 all 1.34 0.00 3.04 60.04 0.00 35.57 21:50:03 all 1.29 0.00 2.51 59.79 0.00 36.41 22:00:03 all 1.36 0.00 3.23 59.81 0.00 35.59 22:10:03 all 1.37 0.00 2.56 59.13 0.00 36.93 22:20:03 all 1.36 0.00 2.88 58.46 0.00 37.29 22:30:03 all 1.31 0.00 2.65 59.07 0.00 36.97 22:40:00 all 1.32 0.00 2.72 59.61 0.00 36.35 22:50:01 all 1.32 0.00 2.72 59.35 0.00 36.61 23:00:03 all 1.29 0.00 2.68 59.30 0.00 36.72 23:10:03 all 1.35 0.00 2.62 60.11 0.00 35.91 23:20:02 all 1.29 0.00 2.91 59.55 0.00 36.25 23:30:03 all 1.32 0.00 2.72 58.37 0.00 37.59 23:40:01 all 1.34 0.00 2.97 57.74 0.00 37.95 23:50:00 all 1.33 0.00 2.54 59.90 0.00 36.24 Average: all 1.38 0.00 2.83 59.37 0.00 36.41 $ last -n 3 reboot reboot system boot 6.5.0-0.deb12.1- Wed Nov 8 19:32 still running reboot system boot 6.1.0-13-amd64 Mon Oct 9 23:14 - 19:32 (29+21:17) reboot system boot 6.1.0-13-amd64 Mon Oct 9 22:39 - 23:14 (00:35) ============================================================================ And top only shows a single [scsi_eh_2] thread using ~50% of CPU time, sometimes there's an events thread too. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 336 root 20 0 0.0m 0.0m 0.0m D 50.0 0.0 53,37 [scsi_eh_2] 3794126 root 20 0 0.0m 0.0m 0.0m I 25.7 0.0 0:06.27 [kworker/0:0-events] This is a Debian/amd64 VM running on a VMware ESX host. There's a virtual CDROM drive, but nothing is attached here and I'm not using it, all: ============================================================================ $ lsblk -d --scsi NAME HCTL TYPE VENDOR MODEL REV SERIAL TRAN sda 0:0:0:0 disk VMware Virtual disk 2.0 sr0 2:0:0:0 rom NECVMWar VMware Virtual IDE CDROM Drive 1.00 10000000000000000001 ata $ dmesg -t | grep -i scsi Block layer SCSI generic (bsg) driver version 0.4 loaded (major 247) SCSI subsystem initialized VMware PVSCSI driver - version 1.0.7.0-k vmw_pvscsi: using 64bit dma vmw_pvscsi: max_id: 65 vmw_pvscsi: setting ring_pages to 32 vmw_pvscsi: enabling reqCallThreshold vmw_pvscsi: driver-based request coalescing enabled vmw_pvscsi: using MSI-X scsi host0: VMware PVSCSI storage adapter rev 2, req/cmp/msg rings: 32/32/1 pages, cmd_per_lun=254 vmw_pvscsi 0000:03:00.0: VMware PVSCSI rev 2 host #0 scsi 0:0:0:0: Direct-Access VMware Virtual disk 2.0 PQ: 0 ANSI: 6 sd 0:0:0:0: [sda] Attached SCSI disk sd 0:0:0:0: Attached scsi generic sg0 type 0 scsi host1: ata_piix scsi host2: ata_piix scsi 2:0:0:0: CD-ROM NECVMWar VMware IDE CDR10 1.00 PQ: 0 ANSI: 5 scsi 2:0:0:0: Attached scsi generic sg1 type 5 sr 2:0:0:0: [sr0] scsi3-mmc drive: 1x/1x writer dvd-ram cd/rw xa/form2 cdda tray sr 2:0:0:0: Attached scsi CD-ROM sr0 ============================================================================ Will that "scsi: Do no try to probe for CDL on old drives" patch land in mainline or is this still under discussion? On Mon, Nov 13, 2023 at 03:30:57AM +0000, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=217914 > > Christian Kujau (kernel@nerdbynature.de) changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |kernel@nerdbynature.de > > --- Comment #7 from Christian Kujau (kernel@nerdbynature.de) --- > Noticed the same here when upgrading from 6.1.0-13-amd64 to > 6.5.0-0.deb12.1-amd64 (both Debian kernels) earlier this month: > > ============================================================================ > $ sar -f /var/log/sysstat/sa20231108 > CPU %user %nice %system %iowait %steal > %idle > [...] > 18:30:03 all 1.03 0.00 0.45 0.04 0.00 > 98.47 > 18:40:01 all 1.07 0.00 0.52 0.05 0.00 > 98.36 > 18:50:01 all 1.07 0.00 0.53 0.04 0.00 > 98.37 > 19:00:01 all 1.35 0.00 0.69 0.08 0.00 > 97.88 > 19:10:04 all 1.09 0.00 0.52 0.07 0.00 > 98.31 > 19:20:02 all 1.14 0.00 0.51 0.05 0.00 > 98.30 > 19:30:04 all 1.62 0.00 0.65 0.08 0.00 > 97.65 > Average: all 1.06 0.00 0.50 0.06 0.00 > 98.38 > > 19:32:27 LINUX RESTART (2 CPU) > > 19:40:03 CPU %user %nice %system %iowait %steal > %idle > 19:50:00 all 2.27 0.00 3.23 57.40 0.00 > 37.11 > 20:00:02 all 1.29 0.00 2.70 59.27 0.00 > 36.75 > 20:10:03 all 1.48 0.00 2.93 58.38 0.00 > 37.21 > 20:20:03 all 1.40 0.00 2.94 58.93 0.00 > 36.73 > 20:30:02 all 1.39 0.00 2.87 59.99 0.00 > 35.74 > 20:40:03 all 1.48 0.00 3.44 59.83 0.00 > 35.26 > 20:50:00 all 1.29 0.00 2.88 60.84 0.00 > 34.98 > 21:00:03 all 1.31 0.00 2.63 59.81 0.00 > 36.25 > 21:10:03 all 1.33 0.00 2.72 59.85 0.00 > 36.09 > 21:20:01 all 1.31 0.00 2.82 59.28 0.00 > 36.59 > 21:30:01 all 1.39 0.00 2.92 60.51 0.00 > 35.18 > 21:40:01 all 1.34 0.00 3.04 60.04 0.00 > 35.57 > 21:50:03 all 1.29 0.00 2.51 59.79 0.00 > 36.41 > 22:00:03 all 1.36 0.00 3.23 59.81 0.00 > 35.59 > 22:10:03 all 1.37 0.00 2.56 59.13 0.00 > 36.93 > 22:20:03 all 1.36 0.00 2.88 58.46 0.00 > 37.29 > 22:30:03 all 1.31 0.00 2.65 59.07 0.00 > 36.97 > 22:40:00 all 1.32 0.00 2.72 59.61 0.00 > 36.35 > 22:50:01 all 1.32 0.00 2.72 59.35 0.00 > 36.61 > 23:00:03 all 1.29 0.00 2.68 59.30 0.00 > 36.72 > 23:10:03 all 1.35 0.00 2.62 60.11 0.00 > 35.91 > 23:20:02 all 1.29 0.00 2.91 59.55 0.00 > 36.25 > 23:30:03 all 1.32 0.00 2.72 58.37 0.00 > 37.59 > 23:40:01 all 1.34 0.00 2.97 57.74 0.00 > 37.95 > 23:50:00 all 1.33 0.00 2.54 59.90 0.00 > 36.24 > Average: all 1.38 0.00 2.83 59.37 0.00 > 36.41 > > $ last -n 3 reboot > reboot system boot 6.5.0-0.deb12.1- Wed Nov 8 19:32 still running > reboot system boot 6.1.0-13-amd64 Mon Oct 9 23:14 - 19:32 (29+21:17) > reboot system boot 6.1.0-13-amd64 Mon Oct 9 22:39 - 23:14 (00:35) > ============================================================================ > > And top only shows a single [scsi_eh_2] thread using ~50% of CPU time, > sometimes there's an events thread too. > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 336 root 20 0 0.0m 0.0m 0.0m D 50.0 0.0 53,37 > [scsi_eh_2] > 3794126 root 20 0 0.0m 0.0m 0.0m I 25.7 0.0 0:06.27 > [kworker/0:0-events] > > This is a Debian/amd64 VM running on a VMware ESX host. There's a virtual > CDROM > drive, but nothing is attached here and I'm not using it, all: > > > ============================================================================ > $ lsblk -d --scsi > NAME HCTL TYPE VENDOR MODEL REV SERIAL > TRAN > sda 0:0:0:0 disk VMware Virtual disk 2.0 > sr0 2:0:0:0 rom NECVMWar VMware Virtual IDE CDROM Drive 1.00 > 10000000000000000001 ata > > > $ dmesg -t | grep -i scsi > Block layer SCSI generic (bsg) driver version 0.4 loaded (major 247) > SCSI subsystem initialized > VMware PVSCSI driver - version 1.0.7.0-k > vmw_pvscsi: using 64bit dma > vmw_pvscsi: max_id: 65 > vmw_pvscsi: setting ring_pages to 32 > vmw_pvscsi: enabling reqCallThreshold > vmw_pvscsi: driver-based request coalescing enabled > vmw_pvscsi: using MSI-X > scsi host0: VMware PVSCSI storage adapter rev 2, req/cmp/msg rings: 32/32/1 > pages, cmd_per_lun=254 > vmw_pvscsi 0000:03:00.0: VMware PVSCSI rev 2 host #0 > scsi 0:0:0:0: Direct-Access VMware Virtual disk 2.0 PQ: 0 ANSI: 6 > sd 0:0:0:0: [sda] Attached SCSI disk > sd 0:0:0:0: Attached scsi generic sg0 type 0 > scsi host1: ata_piix > scsi host2: ata_piix > scsi 2:0:0:0: CD-ROM NECVMWar VMware IDE CDR10 1.00 PQ: 0 ANSI: 5 > scsi 2:0:0:0: Attached scsi generic sg1 type 5 > sr 2:0:0:0: [sr0] scsi3-mmc drive: 1x/1x writer dvd-ram cd/rw xa/form2 cdda > tray > sr 2:0:0:0: Attached scsi CD-ROM sr0 > ============================================================================ > > Will that "scsi: Do no try to probe for CDL on old drives" patch land in > mainline or is this still under discussion? It has been in mainline for a while: 2132df16f53b ("scsi: core: ata: Do no try to probe for CDL on old drives") $ git tag --contains 2132df16f53b v6.6 v6.6-rc4 v6.6-rc5 v6.6-rc6 v6.6-rc7 v6.6.1 v6.7-rc1 Kind regards, Niklas Oh, great! I was only searching for the subject line, not the actual changes. So, I just have to wait for Debian (Backports) to move to v6.6 then :-) Thanks! This Bugzilla entry can be closed then, I assume. "scsi: core: ata: Do no try to probe for CDL on old drives" -> it's also on 6.5.6. This should be closed. |