Bug 11086

Summary: 2.6.26 (sata_nv = irq 21 problem after halting)
Product: IO/Storage Reporter: Dan (dan76)
Component: Serial ATAAssignee: Tejun Heo (htejun)
Status: REJECTED INSUFFICIENT_DATA    
Severity: normal CC: rjw, trenn
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.26 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 10492    
Attachments: nv-hardreset.patch

Description Dan 2008-07-14 12:31:53 UTC
Latest working kernel version: 2.6.25
Earliest failing kernel version: 2.6.26
Distribution: Linux from scratch
Hardware Environment: athlon64 x2 5600+, Asus M2N-E
Software Environment:
Problem Description:

	Something happened with ncq's sata_nv support. When I halt my
system (Athlon64, Asus M2N-E), I get the following (I had to copy by
hand, since the system is halted, so it's just an excerpt):

IRQ 21: nobody cared (try booting with the "irqpool" option)
Pid 0, comm: swapper not tainted 2.6.26 #2

Call trace:
........... nv_swncq_interrupt
........... __report_bad_irq
........... note_interrupt
........... handle_fasteoi_irq
........... do_irq
........... ret_from_intr

Handles:

[<f88888888803a5b90>] (nv_swncq_interrupt+0x0/0xd0)

Disabling IRQ #21

	This just happens after a halt.

	If I use irqpool, this error doesn't happen.

	Thanks.

Steps to reproduce:

Just shutdown the system (halt).
Comment 1 Anonymous Emailer 2008-07-14 12:58:24 UTC
Reply-To: akpm@linux-foundation.org


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Mon, 14 Jul 2008 12:31:54 -0700 (PDT)
bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=11086
> 
>            Summary: 2.6.26 (sata_nv = irq 21 problem after halting)
>            Product: IO/Storage
>            Version: 2.5
>      KernelVersion: 2.6.26
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Serial ATA
>         AssignedTo: jgarzik@pobox.com
>         ReportedBy: fragabr@gmail.com
> 
> 
> Latest working kernel version: 2.6.25
> Earliest failing kernel version: 2.6.26
> Distribution: Linux from scratch
> Hardware Environment: athlon64 x2 5600+, Asus M2N-E
> Software Environment:
> Problem Description:
> 
>         Something happened with ncq's sata_nv support. When I halt my
> system (Athlon64, Asus M2N-E), I get the following (I had to copy by
> hand, since the system is halted, so it's just an excerpt):
> 
> IRQ 21: nobody cared (try booting with the "irqpool" option)
> Pid 0, comm: swapper not tainted 2.6.26 #2
> 
> Call trace:
> ........... nv_swncq_interrupt
> ........... __report_bad_irq
> ........... note_interrupt
> ........... handle_fasteoi_irq
> ........... do_irq
> ........... ret_from_intr
> 
> Handles:
> 
> [<f88888888803a5b90>] (nv_swncq_interrupt+0x0/0xd0)
> 
> Disabling IRQ #21
> 
>         This just happens after a halt.
> 
>         If I use irqpool, this error doesn't happen.
> 
>         Thanks.
> 
> Steps to reproduce:
> 
> Just shutdown the system (halt).
> 

It's a 2.6.26 regression.  It happens at halt-time, so it won't be the
usual ACPI stuff.

I think this is the second report of ata going splat in this manner at
shutdown time.  Did we get the order of something wrong?
Comment 2 Tejun Heo 2008-07-18 06:17:02 UTC
Hmmm... libata doesn't do much on shutdown.  sd spins down disks and that's about it.  All the rest keeps running as usual until the power is cut, so there isn't much order which can go wrong.  That said, I think it's the third report somewhat related to libata and shutdown, so it smells fishy.

Daniel, does the machine power off after that?  Or does the problem prevent the machine from powering off?
Comment 3 Dan 2008-07-18 10:58:09 UTC
Hi Tejun, the problem prevents the machine from powering off (it stays on after the problem). If you need more information or tests, just ask. Thanks.
Comment 4 Dan 2008-10-01 20:35:58 UTC
Hi again! Now testing with 2.6.27-rc8 on the same machine, I get these lines (I can't see everything, since the I only have 60 console text lines):

08c: 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
090: 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
094: 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
098:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
09c:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0a0:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0a4:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0a8:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0ac:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0b0:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0b4:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0b8:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0bc:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0c0:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0c4:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0c8:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0cc:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0d0:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0d4:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0d8:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0dc:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0e0:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0e4:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0e8:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0ec:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0f0:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0f4:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0f8:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000
0fc:00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000 // 00000000 00000000 00000000

And the machine stays powered on, of course (otherwise I couldn't read this ;).
Comment 5 Tejun Heo 2008-10-04 17:52:52 UTC
Created attachment 18169 [details]
nv-hardreset.patch

Can you please test the attached patch on top of 2.6.27-rc8?  Thanks.
Comment 6 Dan 2008-10-04 22:06:44 UTC
Of course! I applied your patch, but the result was the same as before. Full of "00000000" on screen... I can't see if there's something before this (between System halted and the zeroes). Maybe if there is a way to prevent this zeroes from being displayed, it would help.

Thanks.
Comment 7 Tejun Heo 2008-10-04 22:15:44 UTC
Cc'ing Thomas and Rafael.  The machine doesn't shut down and dumps strange messages.  Any ideas?
Comment 8 Tejun Heo 2008-10-04 22:16:10 UTC
Does turning off swncq make any difference?
Comment 9 Dan 2008-10-05 00:15:31 UTC
Tejun, no. Passing "sata_nv.swncq=0" to kernel does not help.
Comment 10 Rafael J. Wysocki 2008-10-05 03:42:55 UTC
Please attach /proc/interrupts from the working system.
Comment 11 Dan 2008-10-05 08:46:58 UTC
Here it's the /proc/interrupts, but please notice that there're different messages between 2.6.26 and 2.6.27-rc8 (above). The swncq error message happened in 2.6.26, while the 2.6.27-rc8 gives lots of "00000000":

           CPU0       CPU1       
  0:         24          2   IO-APIC-edge      timer
  1:          0        115   IO-APIC-edge      i8042
  7:          1          0   IO-APIC-edge    
  9:          0          0   IO-APIC-fasteoi   acpi
 14:          1         51   IO-APIC-edge      pata_amd
 15:          0          0   IO-APIC-edge      pata_amd
 16:        112      16417   IO-APIC-fasteoi   nvidia
 18:          0          0   IO-APIC-fasteoi   cx88[0]
 20:         11       2131   IO-APIC-fasteoi   ohci_hcd:usb1
 21:          0          0   IO-APIC-fasteoi   sata_nv
 22:          0        730   IO-APIC-fasteoi   sata_nv, HDA Intel
 23:        215      45342   IO-APIC-fasteoi   sata_nv, ehci_hcd:usb2
318:        259      29194   PCI-MSI-edge      eth0
NMI:          0          0   Non-maskable interrupts
LOC:     118067     117814   Local timer interrupts
RES:      11173       4175   Rescheduling interrupts
CAL:       3111       2107   function call interrupts
TLB:        243        151   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
SPU:          0          0   Spurious interrupts
ERR:          1

If you need any test or more informatin, feel free to ask.
Comment 12 Dan 2008-10-27 13:38:28 UTC
I tested with 2.6.28-rc2 and the strange messages after halt are gone. But the machine isn't being shut down. I only see System halted and the machine keeps powered on... Is this a kernel issue or something related to BIOS? Thanks.
Comment 13 Dan 2008-10-28 08:46:50 UTC
I answered too early. Yesterday I got a debug message (trace), but as my console had only 25 lines, I couldn't see the top line of the kernel trace.
Comment 14 Dan 2008-10-28 22:54:44 UTC
Hi Tejun. Now, with 2.6.28-rc2 (I managed to switch to 60 lines of console and could get the trace) I get the following (I had to copy more or less by hand):

WARNING: at net/sched/sch_generic.c:226  dev_watchdog...
Call Trace:

IRQ .... warn_slowpath
         dequeue_task
         try_to_wake_up
         __next_cpu
         find_busiest_group
         getnstimeoftheday
         strlcpy
         dev_Watchdog
         cascade
         ...
         ...
         apic_timer_interrupt
         default_idle
         c1e_idle
         cpu_idle

***

Does it help in some way? If you need it complete, just ask. Thank you again!
Comment 15 Tejun Heo 2008-11-09 23:00:37 UTC
That's from network interface watchdog.  A packet was scheduled to go out but it couldn't go out.  I can't determine whether it's somehow related to the shutdown problem or not.  If turning of swncq doesn't make any change, it could be that the changes in libata isn't the cause of the problem.  I don't have any idea here.  The only sure thing would be trying bisecting and find out which commit exactly broke power off.

Thanks.
Comment 16 Alan 2009-03-23 09:52:03 UTC
No bisect done so closing
Comment 17 Dan 2009-03-23 10:54:47 UTC
I didn't do any bisect because it's too much work and since the problem is harmless, I'm too lazy to do any bisect. Since I want to change my computer, I'll avoid anything based on nvidia sata... AHCI is the way to go, much better.
Comment 18 Dan 2009-03-24 01:28:53 UTC
(In reply to comment #16)
> No bisect done so closing
> 

Well, nevermind, it seems to be fixed in 2.6.29. No more strange messages after halt.