Bug 25392 - scsi_eh_11 CPU loop
Summary: scsi_eh_11 CPU loop
Status: CLOSED UNREPRODUCIBLE
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Jeff Garzik
URL:
Keywords:
: 25602 (view as bug list)
Depends on:
Blocks: 21782
  Show dependency tree
 
Reported: 2010-12-21 10:24 UTC by Martin Steigerwald
Modified: 2011-01-14 08:51 UTC (History)
4 users (show)

See Also:
Kernel Version: 2.6.37-rc6-tp42-00009-gb3444d1
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
script to poll for stacktraces (354 bytes, text/plain)
2011-01-13 21:25 UTC, Martin Steigerwald
Details

Description Martin Steigerwald 2010-12-21 10:24:06 UTC
Subject    : 2.6.37-rc5: scsi_eh_11 CPU loop   
Submitter  : "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Date       : 2010-08-18 6:18
Message-ID : 1282112318.21202.8.camel@ymzhang.sh.intel.com
References : http://marc.info/?t=129283967100004&r=1&w=2
References : http://lkml.org/lkml/2010/12/20/52

This entry is being used for tracking a regression from 2.6.36(?). Please don't
close it until the problem is fixed in the mainline.

top - 10:49:07 up 3 days, 14:24,  8 users,  load average: 2.31, 2.62, 2.28
Tasks: 198 total,   2 running, 194 sleeping,   0 stopped,   2 zombie
Cpu(s):  6.8%us, 93.2%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  
0.0%st
Mem:   2073660k total,  1941152k used,   132508k free,   153876k buffers
Swap:  4000180k total,   243452k used,  3756728k free,   676612k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                  \
  735 root      20   0     0    0    0 R 85.2  0.0 137:28.94 scsi_eh_11

I don't see anything in dmesg. Everything appears to work as normal, 
except for the slowness. Which got a bit better upon renicing scsi_eh_11 
to 19 (not knowing whether its really safe, but it works for now).

Interrupts appear to be within usual range as well:

martin@shambhala:~> vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----
cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id 
wa
 2  0 243052  88528 154520 677228   10   28   171   374  416  195 36 20 38  
6
 3  0 243052  88512 154520 677228    0    0     0     0 1471  992 45 55  0  
0
 2  0 243052  88324 154520 677228    0    0     0     0 1181  985 49 51  0  
0
 3  0 243052  88932 154520 677228    0    0     0     0 1339 1672 46 54  0  
0
 2  0 243052  88900 154520 677228    0    0     0    44 1222 1327 50 50  0  
0
 3  0 243052  88900 154520 677228    0    0     0     0 1216 1409 48 52  0  
0

Kernel in use:

martin@shambhala:~> cat /proc/version
Linux version 2.6.37-rc6-tp42-00009-gb3444d1 (martin@shambhala) (gcc 
version 4.4.5 (Debian 4.4.5-8) ) #6 PREEMPT Thu Dec 16 19:25:32 CET 2010

I never saw this with 2.6.36 or 2.7.37-rc3.

I am using a Delock PCMCIA eSATA adapter with Silicon Image controller 
from time to time. Yesterday evening for example. Maybe it has to do with 
removing it again. Well it got scsi11 for one of its port:

Dec 19 21:56:24 shambhala kernel: pcmcia_socket pcmcia_socket0: pccard: 
CardBus card inserted into slot 0
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: [1095:3512] type 0 
class 0x000180
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: reg 10: [io  
0x0000-0x0007]
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: reg 14: [io  
0x0000-0x0003]
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: reg 18: [io  
0x0000-0x0007]
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: reg 1c: [io  
0x0000-0x0003]
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: reg 20: [io  
0x0000-0x000f]
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: reg 24: [mem 
0x00000000-0x000001ff]
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: reg 30: [mem 
0x00000000-0x0007ffff pref]
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: supports D1 D2
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: BAR 6: assigned [mem 
0xeff80000-0xefffffff pref]
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: BAR 5: assigned [mem 
0xcffffe00-0xcfffffff]
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: BAR 5: set to [mem 
0xcffffe00-0xcfffffff] (PCI address [0xcffffe00-0xcfffffff])
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: BAR 4: assigned [io  
0x8cf0-0x8cff]
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: BAR 4: set to [io  
0x8cf0-0x8cff] (PCI address [0x8cf0-0x8cff])
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: BAR 0: assigned [io  
0x8ce8-0x8cef]
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: BAR 0: set to [io  
0x8ce8-0x8cef] (PCI address [0x8ce8-0x8cef])
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: BAR 2: assigned [io  
0x8ce0-0x8ce7]
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: BAR 2: set to [io  
0x8ce0-0x8ce7] (PCI address [0x8ce0-0x8ce7])
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: BAR 1: assigned [io  
0x8cdc-0x8cdf]
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: BAR 1: set to [io  
0x8cdc-0x8cdf] (PCI address [0x8cdc-0x8cdf])
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: BAR 3: assigned [io  
0x8cd8-0x8cdb]
Dec 19 21:56:24 shambhala kernel: pci 0000:03:00.0: BAR 3: set to [io  
0x8cd8-0x8cdb] (PCI address [0x8cd8-0x8cdb])
Dec 19 21:56:24 shambhala kernel: sata_sil 0000:03:00.0: enabling device 
(0000 -> 0003)
Dec 19 21:56:24 shambhala kernel: sata_sil 0000:03:00.0: PCI INT A -> 
Link[LNKA] -> GSI 11 (level, low) -> IRQ 11
Dec 19 21:56:24 shambhala kernel: sata_sil 0000:03:00.0: Applying R_ERR on 
DMA activate FIS errata fix
Dec 19 21:56:24 shambhala kernel: sata_sil 0000:03:00.0: setting latency 
timer to 64
Dec 19 21:56:24 shambhala kernel: scsi11 : sata_sil
Dec 19 21:56:24 shambhala kernel: scsi12 : sata_sil
Dec 19 21:56:24 shambhala kernel: ata9: SATA max UDMA/100 mmio 
m512@0xcffffe00 tf 0xcffffe80 irq 11
Dec 19 21:56:24 shambhala kernel: ata10: SATA max UDMA/100 mmio 
m512@0xcffffe00 tf 0xcffffec0 irq 11
Dec 19 21:56:24 shambhala kernel: ata9: SATA link down (SStatus 0 SControl 
310)
Dec 19 21:56:24 shambhala kernel: ata10: SATA link down (SStatus 0 
SControl 310)
Dec 19 21:56:27 shambhala kernel: ata9: hard resetting link
Dec 19 21:56:30 shambhala kernel: ata9: SATA link up 1.5 Gbps (SStatus 113 
SControl 310)
Dec 19 21:56:30 shambhala kernel: ata9.00: ATA-8: Hitachi HTS545050B9A300, 
PB4OC60G, max UDMA/133
Dec 19 21:56:30 shambhala kernel: ata9.00: 976773168 sectors, multi 0: 
LBA48 NCQ (depth 0/32)
Dec 19 21:56:30 shambhala kernel: ata9.00: configured for UDMA/100
Dec 19 21:56:30 shambhala kernel: ata9: EH complete
Dec 19 21:56:30 shambhala kernel: scsi 11:0:0:0: Direct-Access     ATA      
Hitachi HTS54505 PB4O PQ: 0 ANSI: 5
Dec 19 21:56:30 shambhala kernel: sd 11:0:0:0: Attached scsi generic sg2 
type 0
Dec 19 21:56:30 shambhala kernel: sd 11:0:0:0: [sdb] 976773168 512-byte 
logical blocks: (500 GB/465 GiB)
Dec 19 21:56:30 shambhala kernel: sd 11:0:0:0: [sdb] Write Protect is off
Dec 19 21:56:30 shambhala kernel: sd 11:0:0:0: [sdb] Mode Sense: 00 3a 00 
00
Dec 19 21:56:30 shambhala kernel: sd 11:0:0:0: [sdb] Write cache: enabled, 
read cache: enabled, doesn't support DPO
Dec 19 21:56:30 shambhala kernel: sdb: sdb1 sdb3
Dec 19 21:56:30 shambhala kernel: sd 11:0:0:0: [sdb] Attached SCSI disk

Removal appeared to be fine:

Dec 19 23:10:34 shambhala kernel: ata9.00: disabled
Dec 19 23:10:34 shambhala kernel: sd 11:0:0:0: [sdb] Synchronizing SCSI 
cache
Dec 19 23:10:34 shambhala kernel: sd 11:0:0:0: [sdb]  Result: 
hostbyte=0x04 driverbyte=0x00
Dec 19 23:10:34 shambhala kernel: sd 11:0:0:0: [sdb] Stopping disk
Dec 19 23:10:34 shambhala kernel: sd 11:0:0:0: [sdb] START_STOP FAILED
Dec 19 23:10:34 shambhala kernel: sd 11:0:0:0: [sdb]  Result: 
hostbyte=0x04 driverbyte=0x00
Dec 19 23:10:34 shambhala kernel: sata_sil 0000:03:00.0: PCI INT A 
disabled

Any hints on what to try to get more information? I keep the ThinkPad T42 
running for a while to wait for some hints.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
Comment 1 Martin Steigerwald 2010-12-21 10:25:47 UTC
Answer by Tejun:

Can you please apply the following patch, trigger the problem and
attach the kernel log?

diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index 5e59050..c748b5a 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -888,12 +888,19 @@ void ata_eh_fastdrain_timerfn(unsigned long arg)
  */
 static void ata_eh_set_pending(struct ata_port *ap, int fastdrain)
 {
+       static int xxx;
        int cnt;

        /* already scheduled? */
        if (ap->pflags & ATA_PFLAG_EH_PENDING)
                return;

+       if (!(ap->pflags & ATA_PFLAG_LOADING) && xxx < 16) {
+               ata_port_printk(ap, KERN_WARNING, "XXX ata_eh_set_pending()\n");
+               dump_stack();
+               xxx++;
+       }
+
        ap->pflags |= ATA_PFLAG_EH_PENDING;

        if (!fastdrain)
Comment 2 Martin Steigerwald 2010-12-21 10:31:46 UTC
I will do so with my next kernel compile - likely rc7 if it comes out before Christmas.

As I do not know what exactly triggers it, it can take a while till I experience it again. I hope to be able to trigger it by inserting and removing the PCMCIA eSATA adapter lots of times. As the bug happened, I forgot to vgchange -an the LVM on the external drive. That might be or might not be related to triggering the issue. I will try this as well.

Might make sense to switch this bugzilla report to mail. Or use the bugzilla. Or just use the bugzilla for tracking and continue using mail. I'll do whatever suits you best (but prefer mail).
Comment 3 Rafael J. Wysocki 2010-12-30 10:43:46 UTC
*** Bug 25602 has been marked as a duplicate of this bug. ***
Comment 4 Martin Steigerwald 2010-12-30 11:09:08 UTC
I reproduced this issue via:

- Insert eSATA disk
- vgscan
- vgchange -ay
- Remove eSATA disk
- vgscan

Without the vgscan after removing the disk it didn't show up.

But I do not get any hints in dmesg:

shambhala:~> dmesg | tail -52
pcmcia_socket pcmcia_socket0: pccard: CardBus card inserted into slot 0
pci 0000:03:00.0: [1095:3512] type 0 class 0x000180
pci 0000:03:00.0: reg 10: [io  0x0000-0x0007]
pci 0000:03:00.0: reg 14: [io  0x0000-0x0003]
pci 0000:03:00.0: reg 18: [io  0x0000-0x0007]
pci 0000:03:00.0: reg 1c: [io  0x0000-0x0003]
pci 0000:03:00.0: reg 20: [io  0x0000-0x000f]
pci 0000:03:00.0: reg 24: [mem 0x00000000-0x000001ff]
pci 0000:03:00.0: reg 30: [mem 0x00000000-0x0007ffff pref]
pci 0000:03:00.0: supports D1 D2
pci 0000:03:00.0: BAR 6: assigned [mem 0xe8000000-0xe807ffff pref]
pci 0000:03:00.0: BAR 5: assigned [mem 0xc4000000-0xc40001ff]
pci 0000:03:00.0: BAR 5: set to [mem 0xc4000000-0xc40001ff] (PCI address [0xc4000000-0xc40001ff])
pci 0000:03:00.0: BAR 4: assigned [io  0x4000-0x400f]
pci 0000:03:00.0: BAR 4: set to [io  0x4000-0x400f] (PCI address [0x4000-0x400f])
pci 0000:03:00.0: BAR 0: assigned [io  0x4010-0x4017]
pci 0000:03:00.0: BAR 0: set to [io  0x4010-0x4017] (PCI address [0x4010-0x4017])
pci 0000:03:00.0: BAR 2: assigned [io  0x4018-0x401f]
pci 0000:03:00.0: BAR 2: set to [io  0x4018-0x401f] (PCI address [0x4018-0x401f])
pci 0000:03:00.0: BAR 1: assigned [io  0x4020-0x4023]
pci 0000:03:00.0: BAR 1: set to [io  0x4020-0x4023] (PCI address [0x4020-0x4023])
pci 0000:03:00.0: BAR 3: assigned [io  0x4024-0x4027]
pci 0000:03:00.0: BAR 3: set to [io  0x4024-0x4027] (PCI address [0x4024-0x4027])
sata_sil 0000:03:00.0: enabling device (0000 -> 0003)
sata_sil 0000:03:00.0: PCI INT A -> Link[LNKA] -> GSI 11 (level, low) -> IRQ 11
sata_sil 0000:03:00.0: Applying R_ERR on DMA activate FIS errata fix
sata_sil 0000:03:00.0: setting latency timer to 64
scsi4 : sata_sil
scsi5 : sata_sil
ata5: SATA max UDMA/100 mmio m512@0xc4000000 tf 0xc4000080 irq 11
ata6: SATA max UDMA/100 mmio m512@0xc4000000 tf 0xc40000c0 irq 11
ata5: SATA link down (SStatus 0 SControl 310)
ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata6.00: ATA-8: Hitachi HTS545050B9A300, PB4OC60G, max UDMA/133
ata6.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 0/32)
ata6.00: configured for UDMA/100
scsi 5:0:0:0: Direct-Access     ATA      Hitachi HTS54505 PB4O PQ: 0 ANSI: 5
sd 5:0:0:0: Attached scsi generic sg2 type 0
sd 5:0:0:0: [sdc] 976773168 512-byte logical blocks: (500 GB/465 GiB)
sd 5:0:0:0: [sdc] Write Protect is off
sd 5:0:0:0: [sdc] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sdc: sdc1 sdc3
sd 5:0:0:0: [sdc] Attached SCSI disk
pcmcia_socket pcmcia_socket0: pccard: card ejected from slot 0
ata6.00: disabled
sd 5:0:0:0: [sdc] Synchronizing SCSI cache
sd 5:0:0:0: [sdc]  Result: hostbyte=0x04 driverbyte=0x00
sd 5:0:0:0: [sdc] Stopping disk
sd 5:0:0:0: [sdc] START_STOP FAILED
sd 5:0:0:0: [sdc]  Result: hostbyte=0x04 driverbyte=0x00
sata_sil 0000:03:00.0: PCI INT A disabled

Even though I have the debug patch applied:

diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index 5e59050..c748b5a 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -888,12 +888,19 @@ void ata_eh_fastdrain_timerfn(unsigned long arg)
  */
 static void ata_eh_set_pending(struct ata_port *ap, int fastdrain)
 {
+       static int xxx;
        int cnt;
 
        /* already scheduled? */
        if (ap->pflags & ATA_PFLAG_EH_PENDING)
                return;
 
+       if (!(ap->pflags & ATA_PFLAG_LOADING) && xxx < 16) {
+               ata_port_printk(ap, KERN_WARNING, "XXX ata_eh_set_pending()\n");
+               dump_stack();
+               xxx++;
+       }
+
        ap->pflags |= ATA_PFLAG_EH_PENDING;
 
        if (!fastdrain)

This is with:

martin@shambhala:~> cat /proc/version 
Linux version 2.6.37-rc7-tp42-ata-eh-dbg-dirty (martin@shambhala) (gcc version 4.4.5 (Debian 4.4.5-8) ) #1 PREEMPT Wed Dec 22 11:41:20 CET 2010

Thats just 2.6.37-rc7 + the debug patch.
Comment 5 Tejun Heo 2010-12-31 12:13:22 UTC
Hmmm... weird.  While the scsi_eh is looping, can you please do "cat /proc/PID_OF_THE_TASK/stack" several times and see where it's spending its cpu cycles?

Thanks.
Comment 6 Martin Steigerwald 2011-01-13 21:25:50 UTC
Created attachment 43452 [details]
script to poll for stacktraces

Script to poll for stacktraces.
Comment 7 Martin Steigerwald 2011-01-13 21:27:43 UTC
Was not able to reproduce this with 2.6.37-rc8 - probably issue has been fixed in the meanwhile. Thus closing as fixed for now. Thanks.

Anyway I posted a script to poll for stacktraces, should anyone should have this issue again.
Comment 8 Florian Mickler 2011-01-14 08:51:58 UTC
I changed the resolution to UNREPRODUCIBLE, which is (nowadays) the default resolution if no specific commit is known that fixes it.

Note You need to log in before you can comment on or make changes to this bug.