Bug 11832

Summary: "irq 18: nobody cared" on Toshiba Satellite A100
Product: ACPI Reporter: Rafael J. Wysocki (rjw)
Component: Config-InterruptsAssignee: Len Brown (lenb)
Status: CLOSED UNREPRODUCIBLE    
Severity: normal CC: acpi-bugzilla, m.v.b, philipp+kernel-bugs, sassmann, shaohua.li, venki, yakui.zhao
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.18, 2.6.24, 2.6.27 Subsystem:
Regression: No Bisected commit-id:
Attachments: spurious-counter.patch - should help to reproduce issue quicker
acpidump output
Output of "lspci -vv -xxx -nn"
Contents of /proc/interrupts on Debian's 2.6.24-based kernel
dmesg from Debian's 2.6.24 based kernel
The output of "git bisect log"
Output of "git bisect log"
dmesg output for 2.6.29.1 with 99 as the spurious interrupt threshold
Output of lspci -vxxx
Output of lsusb -v
The patch that makes the kernel print out the name of the handler which "worked" for the spurious interrupt.

Description Rafael J. Wysocki 2008-10-25 06:19:43 UTC
Subject    : Regression in 2.6.27: "irq 18: nobody cared" on Toshiba Satellite A100
Submitter  : "M. Vefa Bicakci" <bicave@superonline.com>
Date       : 2008-10-19 14:06
References : http://marc.info/?l=linux-kernel&m=122442552100406&w=4
Handled-By : Stefan Assmann <sassmann@suse.de>

This entry is being used for tracking a regression from 2.6.26.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 ykzhao 2008-10-26 22:07:22 UTC
Will you please use git-bisect to identify the commit by which the regression is caused by?
   Will you please attach the output of acpidump, lspci -vxxx, /proc/interrupts?
   thanks.
Comment 2 Stefan Assmann 2008-10-27 03:24:34 UTC
taken from initial discussion on lkml:

contents of /proc/interrupts:
--- 8< ---
           CPU0       CPU1       
  0:   45249492      60399   IO-APIC-edge      timer
  1:      25451          0   IO-APIC-edge      i8042
  8:          1          0   IO-APIC-edge      rtc0
  9:      36514          0   IO-APIC-fasteoi   acpi
 12:    1147983       2103   IO-APIC-edge      i8042
 14:     170245          0   IO-APIC-edge      ata_piix
 15:     558085        819   IO-APIC-edge      ata_piix
 16:        508          0   IO-APIC-fasteoi   uhci_hcd:usb5, i915@pci:0000:00:02.0
 17:       1353          0   IO-APIC-fasteoi   firewire_ohci
 18:     300158          1   IO-APIC-fasteoi   uhci_hcd:usb4, tifm_7xx1, yenta
 19:          0          0   IO-APIC-fasteoi   uhci_hcd:usb3
 20:      26606          2   IO-APIC-fasteoi   eth0
 22:    3206279          1   IO-APIC-fasteoi   HDA Intel
 23:          3          0   IO-APIC-fasteoi   uhci_hcd:usb1, ehci_hcd:usb2
220:    2105545          0   PCI-MSI-edge      iwl3945
NMI:          0          0   Non-maskable interrupts
LOC:    5971997   27874747   Local timer interrupts
RES:     938710    1791498   Rescheduling interrupts
CAL:     138135     180813   function call interrupts
TLB:      48455      64413   TLB shootdowns
TRM:          0          0   Thermal event interrupts
SPU:          0          0   Spurious interrupts
ERR:          0
MIS:          0
--- >8 ---
Comment 3 Stefan Assmann 2008-10-27 04:44:09 UTC
Created attachment 18471 [details]
spurious-counter.patch - should help to reproduce issue quicker

This patches should trigger spurious IRQs more often as it reduces the number of unhandled IRQs that trigger a nobody cared event from 99900 to 999.
Comment 4 Shaohua 2008-10-28 23:16:23 UTC
please provide required info as comment 1 suggested, so we can work on it
Comment 5 Rafael J. Wysocki 2008-11-02 13:14:53 UTC
References : http://marc.info/?l=linux-kernel&m=122565894720909&w=4
Comment 6 Rafael J. Wysocki 2008-11-02 13:16:14 UTC
Apparently, not a regression from 2.6.26.
Comment 7 Len Brown 2008-11-11 23:58:04 UTC
http://marc.info/?l=linux-kernel&m=122624419830761&w=4

> I have also tried to boot with "noapic" option using the kernels
> from Etch (Debian 4.0), unmodified: one based on 2.6.18 and one
> based on 2.6.24. Again, I got similar "nobody cared" messages.

clearing regression flag.
Comment 8 M. Vefa Bicakci 2008-11-12 15:07:17 UTC
Hello,

I am sorry; I didn't even know that this bug-report ticket existed until
I read Len Brown's e-mail which is dated today.

Currently, I am testing Debian's 2.6.24 based kernel *without* the "noapic"
option. The uptime is currently almost three days, and I haven't yet seen
the "nobody cared" message. Considering that I saw the "nobody cared" message
with 2.6.25.19 in 41 hours, it is probable that 2.6.24 doesn't have this
problem.

However, there is no way of knowing how much I should have to wait until
I can say that a certain version of the kernel does not have this problem.
This makes bisecting very hard and time-consuming. Any suggestions are
welcome.

I am going to attach the output of "acpidump" and "lspci -vv -xxx -nn"
in a moment.

Thanks to everyone!

Regards,

M. Vefa Bicakci
Comment 9 M. Vefa Bicakci 2008-11-12 15:08:31 UTC
Created attachment 18833 [details]
acpidump output
Comment 10 M. Vefa Bicakci 2008-11-12 15:09:22 UTC
Created attachment 18834 [details]
Output of "lspci -vv -xxx -nn"
Comment 11 M. Vefa Bicakci 2008-11-12 15:10:28 UTC
Created attachment 18835 [details]
Contents of /proc/interrupts on Debian's 2.6.24-based kernel
Comment 12 Len Brown 2008-11-12 22:08:06 UTC
please also attach the output from dmesg -s64000
Comment 13 M. Vefa Bicakci 2008-11-13 16:39:48 UTC
Created attachment 18856 [details]
dmesg from Debian's 2.6.24 based kernel

Please note two things:

1) This dmesg is from Debian's 2.6.24 based kernel for Etch, with a modified
version of the patch from Stefan Assmann. (I used "99" instead of "999" for
the spurious interrupt count.) Furthermore, the uptime has been about 96 hours
and I haven't been able to reproduce the "nobody cared" message with this
kernel.

2) I just noticed that there is a file named "/etc/modprobe.d/ieee1394" on
my system which blacklists the "ohci1394" module. This means that the firewire
drivers were not loaded in my tests with kernels whose versions were lower than
2.6.25.x, including Debian's 2.6.24 based kernel that I am testing right now.
Comment 14 Ph. Marek 2008-11-21 10:43:32 UTC
JFI: I think I got the same problem, with debian kernel 2.6.26-10:
  2.6.26-1-amd64 #1 SMP Sat Nov 8 18:25:23 UTC 2008 x86_64 GNU/Linux

It's again an USB interrupt, and I got it when I tried to get my Canon 350D (in "normal" mode, as opposed to "PTP") to run:
- Plugged the camera in, wasn't detected
- Turned camera off, and on again, again nothing on "lsusb"
- Set udev logging to "debug", camera off and on, and there's the report.

I looked in my logfiles (where the first entry is from May 6), and it's the first occurrence of "nobody cared"; as I mentioned, I used the camera in PTP previously. Maybe that makes a difference.

I hope that helps a bit; I can try to find some time for testing, if it helps.


Call Trace:
 <IRQ>  [<ffffffffa020c11a>] :snd_hda_intel:azx_interrupt+0x20/0xf9
 [<ffffffff8026c78f>] __report_bad_irq+0x30/0x72
 [<ffffffff8026c9ce>] note_interrupt+0x1fd/0x23b
 [<ffffffff8026d257>] handle_fasteoi_irq+0xa5/0xc8
 [<ffffffff8020f5e4>] do_IRQ+0x6d/0xd9
 [<ffffffff8020b0a3>] default_idle+0x0/0x49
 [<ffffffff8020c46d>] ret_from_intr+0x0/0x19
 <EOI>  [<ffffffff8021a817>] lapic_next_event+0x0/0x13
 [<ffffffff8021eb54>] native_safe_halt+0x2/0x3
 [<ffffffff8021eb54>] native_safe_halt+0x2/0x3
 [<ffffffff8020b0cd>] default_idle+0x2a/0x49
 [<ffffffff8020ac79>] cpu_idle+0x89/0xb3

handlers:
[<ffffffff8038eaf3>] (usb_hcd_irq+0x0/0x78)
[<ffffffffa020c0fa>] (azx_interrupt+0x0/0xf9 [snd_hda_intel])
Disabling IRQ #22
Comment 15 Ph. Marek 2008-11-21 10:47:57 UTC
Update: Now STR doesn't work anymore, and I get more reports.

PM: Some devices failed to suspend
Restarting tasks ... done.
PM: Syncing filesystems ... done.
Freezing user space processes ... (elapsed 0.04 seconds) done.
Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
Suspending console(s)
sd 3:0:0:0: [sdb] Synchronizing SCSI cache
sd 3:0:0:0: [sdb] Stopping disk
sd 0:0:0:0: [sda] Synchronizing SCSI cache
sd 0:0:0:0: [sda] Stopping disk
ACPI handle has no context!
serial 00:0a: disabled
parport_pc 00:06: disabled
ACPI handle has no context!
ACPI: PCI interrupt for device 0000:00:10.1 disabled
ACPI: PCI interrupt for device 0000:00:0f.0 disabled
ACPI: PCI interrupt for device 0000:00:0e.0 disabled
usb_hcd_pci_suspend(): ehci_pci_suspend+0x0/0x99 [ehci_hcd] returns -22
pci_device_suspend(): usb_hcd_pci_suspend+0x0/0x176 returns -22
suspend_device(): pci_device_suspend+0x0/0x57 returns -22
Could not suspend device 0000:00:0b.1: error -22
PM: Writing back config space on device 0000:00:0e.0 at offset f (was 1030100, writing 103010b)
...
irq 22: nobody cared (try booting with the "irqpoll" option)
Pid: 0, comm: swapper Tainted: P          2.6.26-1-amd64 #1

Call Trace:
 <IRQ>  [<ffffffffa020c11a>] :snd_hda_intel:azx_interrupt+0x20/0xf9
 [<ffffffff8026c78f>] __report_bad_irq+0x30/0x72
 [<ffffffff8026c9ce>] note_interrupt+0x1fd/0x23b
 [<ffffffff8026d257>] handle_fasteoi_irq+0xa5/0xc8
 [<ffffffff8020f5e4>] do_IRQ+0x6d/0xd9
 [<ffffffff8020b0a3>] default_idle+0x0/0x49
 [<ffffffff8020c46d>] ret_from_intr+0x0/0x19
 <EOI>  [<ffffffff8022f2e8>] finish_task_switch+0x2a/0xba
 [<ffffffff8022f2e8>] finish_task_switch+0x2a/0xba
 [<ffffffff8021eb54>] native_safe_halt+0x2/0x3
 [<ffffffff8021eb54>] native_safe_halt+0x2/0x3
 [<ffffffff8020b0cd>] default_idle+0x2a/0x49
 [<ffffffff8020ac79>] cpu_idle+0x89/0xb3

handlers:
[<ffffffff8038eaf3>] (usb_hcd_irq+0x0/0x78)
[<ffffffffa020c0fa>] (azx_interrupt+0x0/0xf9 [snd_hda_intel])
Disabling IRQ #22
ata2: SATA link down (SStatus 0 SControl 300)
...
hdc: UDMA/66 mode selected
PM: Some devices failed to suspend
Restarting tasks ... done.
hda_intel: azx_get_response timeout, switching to polling mode: last cmd=0x001f0500
Comment 16 Shaohua 2008-11-25 19:08:47 UTC
So do any alsa guys look at the issue?
Comment 17 ykzhao 2008-12-03 23:03:38 UTC
Hi, Pmarek
    From the log it seems that the issue on your box is different with that on Toshiba Satellite A100.
    Will you please open a new bug and attach the output of acpidump, lspci -vxxx, dmesg.
    Thanks.
Comment 18 ykzhao 2008-12-03 23:10:31 UTC
Will you please double check whether the following issue still exists on the kernel of 2.6.27.2?
    >irq 18: nobody cared (try booting with the "irqpoll" option)
    If exists, please add the boot option of "nousb" and see whether it still exists?
    Thanks.
    
    
    
Comment 19 M. Vefa Bicakci 2008-12-05 18:20:46 UTC
Dear ykzhao,

I have confirmed that this problem exists in 2.6.25.19, 2.6.26.7
and 2.6.27.{1,2,4}.

I tried to blacklist all USB related modules on 2.6.27.x and have
confirmed that this problem exists when USB modules are blacklisted too.

(I have already mentioned the above on LKML.)

I would like to clarify one thing. When I boot using "noapic",
the version of the kernel does not matter; I always get a "nobody
cared" message during boot.

However, when I boot with*out* "noapic", that is, normally, then
I only get "nobody cared" messages after the computer has been on
for a while - and only for certain kernel versions. Please read on.

=== Important update ===

Furthermore, I have found out that this problem was introduced by
a change between 2.6.24 and 2.6.25. I have found this out by using
2.6.24 for quite some time and noticing that I did not get any
"nobody cared" messages, even though I had reduced the check number
for the spurious IRQ counter to "99".

I have partially bisected the commits between 2.6.24 and 2.6.25 and
have made some progress. I have waited 48 to 70 hours using the kernel
built by each iteration of the bisection. If I got a "nobody cared"
message, I used "git bisect bad". Otherwise, "git bisect good".

Unfortunately, I don't have the time these days to continue bisecting.
(I am kind of busy.) I will have the time about one-two weeks later.

I am going to attach the current output of "git bisect log" in a moment.
Please note that the bisection isn't finished yet. As I said, I will be
able to continue bisecting in a few weeks.

Regards,

M. Vefa Bicakci
Comment 20 M. Vefa Bicakci 2008-12-05 18:22:19 UTC
Created attachment 19165 [details]
The output of "git bisect log"

Please note that the bisection is not complete yet.
Comment 21 M. Vefa Bicakci 2008-12-21 23:33:18 UTC
Hello,

Wondering which commit could be the cause of this problem, I ran
"git bisect visualize". After noticing that nothing really relevant
showed up in the list of commits, I realized that waiting for 48-60
hours might not be enough to detect whether a certain version of the
kernel has this problem or not.

Furthermore, I have tested 2c6b8c030cfca334c3d700ee504036c585c4c6a3
again, and I have found out that it is actually "bad", not "good".

This means that my bisection results are not meaningful.

I have restarted the bisection and this time I will wait longer
before deciding whether a given version of the kernel has the
problem or not.

I would really appreciate it if a kernel developer could help me
to debug this problem.

Regards,

M. Vefa Bicakci
Comment 22 Zhang Rui 2008-12-28 17:57:13 UTC
hi, vefa,

thanks for your effort, is that any test result?
Comment 23 M. Vefa Bicakci 2009-01-01 23:26:24 UTC
Dear Zhang Rui,

I have started to bisect again, but this time with a 3-day waiting
period before declaring a kernel "good". I am currently on the 4th
iteration. However, I am still not sure whether a 3-day waiting
period is long enough. (As you might remember from my previous posts,
a 2-day waiting period was not enough.)

I will attach the output of "git bisect log" in a moment.

I would really appreciate it if someone could ask me to test certain
git commits.

Regards,

M. Vefa Bicakci
Comment 24 M. Vefa Bicakci 2009-01-01 23:28:43 UTC
Created attachment 19601 [details]
Output of "git bisect log"

Here is my bisection progress.
Comment 25 M. Vefa Bicakci 2009-02-14 17:59:45 UTC
Hello,

While bisecting (between 2.6.24 and 2.6.25), I noticed that the commits
between the so called good and bad commits were very unrelated to the
problem I am having. Finally, I tried a vanilla 2.6.24 kernel. I am sad
to report that vanilla 2.6.24 has the same problem.

However, all is not lost. I have built Debian's 2.6.24 based kernel with
a patch which reduces the spurious IRQ counter limit to 10, and have run
it for 18 days. I am happy to report that during that time I haven't had
any "nobody cared" messages.

So I decided to bisect the differences between Debian's 2.6.24 and the
vanilla 2.6.24. I have created a git repository based on Debian's 2.6.24,
and then reverted Debian's changes to 2.6.24 patch by patch and ran
"git commit" between each step. This gave me a git repository which can
be used to bisect the differences. (Unfortunately, git cannot bisect
from future to past, which is the reason I have come up with this
complicated approach.)

Just to rule out the possibility that this might be a configuration related
issue, I am trying a vanilla 2.6.24 kernel with Debian's configuration.
If this test shows that the configuration does not trigger this bug, then
I will start the bisection procedure using the git repository which I have
documented above.

Regards,

M. Vefa Bicakci
Comment 26 ykzhao 2009-03-04 21:48:22 UTC
HI, Vefa
    How about the git-bisect result? 
    From the comment #25 it seems that there is no such issue on the 2.6.24 Debian's kernel. But the problem exists on the vanilla kernel(2.6.24). Right?
    If so, had better use the git-bisect to identify the bad commit.
    Thanks.
    
    
Comment 27 Zhang Rui 2009-03-23 02:12:29 UTC
ping Vefa.
Comment 28 M. Vefa Bicakci 2009-03-30 01:31:13 UTC
Hello,

I apologize for not responding earlier; I wanted to gather more data before
posting a message.

I took 2.6.24's source and reduced its spurious interrupt limit to 10
in "kernel/irq/spurious.c". I called this 2.6.24-suspect. Testing this kernel
was not helpful because after a certain amount of uptime and with a certain
pattern of network activity, iwl3945 started to print error messages which
are appended to this message. After a few "rmmod iwl3945" and "modprobe
iwl3945" cycles, I got crashes. This occurred at least three times.

In addition to this, I got the following "nobody cared" message one time
while testing 2.6.24-suspect:

=== 8< ===
irq 17: nobody cared (try booting with the "irqpoll" option)
Pid: 29141, comm: Xorg Not tainted 2.6.24-suspect #1
 [<c0157ee6>] __report_bad_irq+0x36/0x75
 [<c0158139>] note_interrupt+0x214/0x273
 [<c01577e9>] handle_IRQ_event+0x23/0x51
 [<c01586af>] handle_fasteoi_irq+0x85/0xa5
 [<c010672f>] do_IRQ+0x55/0x6e
 [<c0184a4f>] sys_ioctl+0x45/0x5e
 [<c010497f>] common_interrupt+0x23/0x28
 =======================
handlers:
[<f8882d7a>] (usb_hcd_irq+0x0/0x56 [usbcore])
[<f8ddccfd>] (i915_driver_irq_handler+0x0/0x18e [i915])
Disabling IRQ #17
=== >8 ===

However, this message occurred when I was trying CompizFusion - I was
switching it on and off. To be honest, I don't know if this "nobody cared"
message was printed because of the same problem as the one I had when
I opened this bug report, or because 10 is a really low limit.

The longest uptime I was able to achieve with 2.6.24-suspect before
I got crashes was 4 days. Because wireless is essential for me, I didn't continue my tests with 2.6.24-suspect. 

I then used Sidux's 2.6.28.7 based kernel, unmodified. After an uptime of
15 days without any problems, I decided to test vanilla 2.6.28.7 with a
low spurious interrupt limit. This time I used 99 as the spurious
interrupt limit. I called this kernel 2.6.28.7-spurious. I am still
running this kernel, and my current uptime is 15 days.

So, in other words, I haven't been able to reproduce this problem with
2.6.28.7. (Yet?) However, I am going to try 2.6.24.7 and 2.6.29 with 99
as the spurious interrupt limit, so please do not close this bug report
yet.

Any suggestions on which of these two versions I should test next,
how much time I should wait before deciding that I cannot reproduce
this problem, and what would be a good spurious interrupt limit (instead
of 10 or 99) ?

There is one more important thing which happened: Before testing
2.6.24-suspect, I upgraded my laptop's BIOS. I can downgrade the
BIOS if it becomes necessary to confirm the role of the BIOS version
in this problem.

Once again, sorry for my late response.

Thank you,

M. Vefa Bicakci

########################################################################
Here's the message iwl3945 prints:

=== 8< ===
Error sending REPLY_SCAN_ABORT_CMD: time out after 500ms.
Error sending REPLY_RXON: time out after 500ms.
=== >8 ===
Comment 29 ykzhao 2009-03-30 05:19:40 UTC
Please try the 2.6.29 kernel with the 99 as the spurious interrupt threshold and then see whether the problem can be reproduced. And IMO it is unnecessary to downgrade the BIOS. 
   If the problem can't be reproduced, IMO this bug can be closed. 
   Thanks.
Comment 30 M. Vefa Bicakci 2009-04-04 13:12:53 UTC
Hello,

Unfortunately, I have bad news. I took vanilla 2.6.29 and modified it so
that its spurious interrupt threshold is 99. After an uptime of 4 days
and 4.5 hours with this kernel, at 3:30am, when the computer was mostly
idle, I got the "nobody cared" message below:

=== 8< ===
irq 18: nobody cared (try booting with the "irqpoll" option)
Pid: 0, comm: swapper Not tainted 2.6.29-spurious #1
Call Trace:
 [<c0175494>] __report_bad_irq+0x24/0x90
 [<c017566d>] note_interrupt+0x16d/0x1b0
 [<c0174590>] handle_IRQ_event+0x30/0x60
 [<c0175d2b>] handle_fasteoi_irq+0xab/0xd0
 [<c0106273>] do_IRQ+0x43/0x90
 [<c0104427>] common_interrupt+0x27/0x2c
 [<f80770c8>] acpi_idle_enter_bm+0x2c8/0x351 [processor]
 [<c02e387f>] cpuidle_idle_call+0x6f/0xc0
 [<c0102ce6>] cpu_idle+0x66/0xa0
handlers:
[<f80ba970>] (usb_hcd_irq+0x0/0x80 [usbcore])
[<f81bd600>] (tifm_7xx1_isr+0x0/0x140 [tifm_7xx1])
[<f80eaad0>] (yenta_interrupt+0x0/0xf0 [yenta_socket])
[<f808ad60>] (sdhci_irq+0x0/0x560 [sdhci])
Disabling IRQ #18
=== >8 ===

My laptop has four USB ports, two on its right side and two on its back.
I have a USB keyboard attached to one of the ports on the back and a
USB mouse attached to one of the ports on the right side. This "nobody
cared" message only affected the rear ports. If I connect my USB keyboard
to a port on the right side, then it works properly. I don't know if this
problem is specific to the ports on the real panel or is because of my USB
keyboard.

If needed, I can reboot and have my USB keyboard (and mouse) attached to
one of the ports on the right side and see whether I get a "nobody cared"
message or not.

What I don't understand is how I managed to get with 2.6.28.7-spurious
a 15 day uptime without any "nobody cared" messages.

Should I go back to testing 2.6.24.7 or 2.6.28.7 ? I would appreciate
any suggestion to debug this problem.

I will not reboot until I get a reply from you, in case the current
state of the computer can provide some information.

Thank you,

M. Vefa Bicakci

---

For the record, the following is what my /proc/interrupts looks like:

=== 8< ===
           CPU0       CPU1
  0:  248683635       8493   IO-APIC-edge      timer
  1:      95240         46   IO-APIC-edge      i8042
  8:         15          0   IO-APIC-edge      rtc0
  9:     105898      25576   IO-APIC-fasteoi   acpi
 12:    1105620          0   IO-APIC-edge      i8042
 14:     809096          8   IO-APIC-edge      ata_piix
 15:    1900817         84   IO-APIC-edge      ata_piix
 16:        596         48   IO-APIC-fasteoi   uhci_hcd:usb5, i915@pci:0000:00:02.0
 17:       3501          0   IO-APIC-fasteoi   firewire_ohci
 18:     455207       8036   IO-APIC-fasteoi   uhci_hcd:usb4, tifm_7xx1, yenta, mmc0
 19:     441166      20272   IO-APIC-fasteoi   uhci_hcd:usb3
 20:      90489         12   IO-APIC-fasteoi   eth0
 22:    2727170          0   IO-APIC-fasteoi   HDA Intel
 23:         22          2   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb2
 27:    4655721        586   PCI-MSI-edge      iwl3945
NMI:          0          0   Non-maskable interrupts
LOC:   10409292   99475736   Local timer interrupts
RES:    2619999    4462391   Rescheduling interrupts
CAL:        186        592   Function call interrupts
TLB:      19465      26565   TLB shootdowns
TRM:          0          0   Thermal event interrupts
SPU:          0          0   Spurious interrupts
ERR:          0
MIS:          0
=== >8 ===
Comment 31 ykzhao 2009-04-08 08:00:25 UTC
Hi
    Sorry for the late response.
    Will you please attach the output of dmesg, lspci -vxxx, lsusb -l?
    
    From the test result it seems that the issue can be reproduced if the spurious interrupt threshold is changed to 99. Of course it is not easy to reproduce. >After 4 days.
    
     How about if we don't change the spurious interrupt threshold?
     Thanks.
Comment 32 M. Vefa Bicakci 2009-04-08 15:16:51 UTC
Hello,

I will attach the output of "dmesg", "lspci -vxxx" and "lsusb -v" in a moment.

I am using Sidux, and the lsusb I have on my system does not have a "-l"
option. So, instead I will attach the output of "lsusb -v".

My system is currently in the "nobody cared" state. Because of that, 
I have attached my USB keyboard to one of the ports on the right side.
Normally I attach it to one of the ports on the back side.

I have a custom patch applied; all it does is to print the name of
the ISR which "worked" for the spurious interrupt. I will attach this
patch in a moment too.

Finally, I would like to note that I used to get the "nobody cared"
messages even without changing the spurious interrupt threshold. This
was the original reason of my bug report. I started to decrease the
threshold on Stefan Assmann's advice so that the problem could be
reproduced quickly.

However, I will try your suggestion: I just compiled a vanilla 2.6.29.1
without modifying the threshold, and I will test this kernel with the
keyboard attached to one of the ports on the back of the computer.
Please note that it might take a long time to reproduce the problem.

Thank you for your help.

M. Vefa Bicakci
Comment 33 M. Vefa Bicakci 2009-04-08 15:19:18 UTC
Created attachment 20885 [details]
dmesg output for 2.6.29.1 with 99 as the spurious interrupt threshold
Comment 34 M. Vefa Bicakci 2009-04-08 15:21:37 UTC
Created attachment 20886 [details]
Output of lspci -vxxx
Comment 35 M. Vefa Bicakci 2009-04-08 15:22:35 UTC
Created attachment 20887 [details]
Output of lsusb -v
Comment 36 M. Vefa Bicakci 2009-04-08 15:24:01 UTC
Created attachment 20888 [details]
The patch that makes the kernel print out the name of the handler which "worked" for the spurious interrupt.
Comment 37 M. Vefa Bicakci 2009-04-08 15:25:37 UTC
The patch I attached a few seconds ago also changes the spurious
interrupt threshold to 10. However, the kernel I am currently using
has a threshold of 99, not 10.

As I said above, I will test 2.6.29.1 without the threshold modification.
Comment 38 M. Vefa Bicakci 2009-05-11 15:01:21 UTC
Hello,

Unfortunately, I have bad news. I had to restart my computer
a few times after an uptime of 6 or 7 days because of X crashes
which are unrelated to this bug.

After those reboots, my longest uptime was 20 days, which lasted
from April 16th to May 6th or 7th. (The reason of the reboot was
a hard lock-up caused by the SD card reader, and even Alt-SysRq-B
didn't work.)

During that 20 day-long uptime, no "nobody cared" messages were
emitted by the kernel. However, because the spurious interrupt
threshold is at its default value of 99900, I am guessing that
it might take even longer than 20 days to get a "nobody cared"
message.

The only thing I can think of doing is trying again with 2.6.29.3,
this time being even more careful so that a reboot does not become
necessary.

I would appreciate any advice.

Regards,

M. Vefa Bicakci
Comment 39 Shaohua 2009-07-01 03:17:22 UTC
does irqpoll boot option work here?
It disappears in latest kernel, maybe we should just close the track, so please double check with latest kernel.
Comment 40 ykzhao 2009-07-06 02:01:02 UTC
From the log in comment #38 it seems that this issue can't be reproduced on the upstream kernel. 

How about close this bug?
Comment 41 M. Vefa Bicakci 2009-08-04 16:35:37 UTC
Hello,

I am currently running 2.6.30.1 without any modifications, and the
uptime is about 30 days, during which I haven't seen a nobody cared
message.

I still don't know the cause of this problem, and because of this
I can't tell for sure whether the same problem will occur again.
Maybe we could close the bug report for now, and I can reopen it
if I ever get another nobody cared message. (Hopefully I won't!)

Regards,

M. Vefa Bicakci