Bug 8650 - disk errors due SATA timeout
Summary: disk errors due SATA timeout
Status: CLOSED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Tejun Heo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-06-19 04:08 UTC by Vasily Averin
Modified: 2008-09-24 06:51 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.22-rc4
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
dmesg output (23.31 KB, text/plain)
2007-06-19 04:10 UTC, Vasily Averin
Details
/proc/interrupts output (759 bytes, text/plain)
2007-06-19 04:11 UTC, Vasily Averin
Details
lspci -vvvxxx output (29.60 KB, text/plain)
2007-06-19 04:13 UTC, Vasily Averin
Details
dont-enable-pci-irq-for-legacy-host.patch (1.86 KB, patch)
2007-06-20 00:12 UTC, Tejun Heo
Details | Diff
2.6.22-rc4+tejun.patch: serial console logs (33.23 KB, text/plain)
2007-06-20 23:49 UTC, Vasily Averin
Details
/proc/interrupts output (772 bytes, text/plain)
2007-06-20 23:51 UTC, Vasily Averin
Details
dmesg (30.19 KB, text/plain)
2007-06-25 10:22 UTC, Miguel Marte
Details
interrupts (932 bytes, text/plain)
2007-06-25 10:25 UTC, Miguel Marte
Details
output of lspci (2.09 KB, text/plain)
2007-06-25 10:28 UTC, Miguel Marte
Details
2.6.22-rc7: serial console log with sysrq (905.69 KB, text/plain)
2007-07-07 00:26 UTC, Vasily Averin
Details
2.6.22-rc7: sata timeouts issue again (90.98 KB, text/plain)
2007-07-08 06:13 UTC, Vasily Averin
Details
Debugging patch for JBD lockup (584 bytes, patch)
2007-07-16 08:23 UTC, Jan Kara
Details | Diff
2.6.22: JBD lockup, serial console logs (490.52 KB, text/plain)
2007-07-17 05:43 UTC, Vasily Averin
Details
serial console log (2.6.23rc7 + thgl patch+ "acpi=noirq") (54.23 KB, text/plain)
2007-11-05 22:57 UTC, Vasily Averin
Details

Description Vasily Averin 2007-06-19 04:08:50 UTC
VIA VT6420 SATA RAID Controller on MSI motherboard, RHEL5 distribution, x86_64 kernel.

On booting hardware initialized properly and all works fine some time, but then it detects timeout on sata disc.

Originally was found on RHEL5 kernels, and it lead to device disconnect.

Reproduced on 2.6.22-rc4 kernel, in this case EH works well and disk become alive,
however due ext3 errors filesystem remounts to RO.
Comment 1 Vasily Averin 2007-06-19 04:10:26 UTC
Created attachment 11793 [details]
dmesg output
Comment 2 Vasily Averin 2007-06-19 04:11:41 UTC
Created attachment 11794 [details]
/proc/interrupts output
Comment 3 Vasily Averin 2007-06-19 04:13:05 UTC
Created attachment 11795 [details]
lspci -vvvxxx output
Comment 4 Andrew Morton 2007-06-19 07:39:27 UTC
Are you able to identify an earlier kernel version whcih did not have this problem?
Comment 5 Tejun Heo 2007-06-20 00:12:49 UTC
Created attachment 11808 [details]
dont-enable-pci-irq-for-legacy-host.patch

Can you please try the attached patch and report the result?
Comment 6 Vasily Averin 2007-06-20 00:26:00 UTC
(In reply to comment #4)
> Are you able to identify an earlier kernel version whcih did not have this
> problem?

I've reproduced this issue on RHEL5 and RHEL5 based kernels.
Jan Evert van Grootheest has reproted about very similar issue on 2.6.18-4-xen-amd64 kernel
http://www.mail-archive.com/linux-ide@vger.kernel.org/msg06011.html

However I can say nothing about older kernels.
Comment 7 Vasily Averin 2007-06-20 00:30:17 UTC
(In reply to comment #5)
> Created an attachment (id=11808) [details]
> dont-enable-pci-irq-for-legacy-host.patch
> 
> Can you please try the attached patch and report the result?

Thank you, Tejun

However I would note that I have no a simple reproducer for this issue and my experiments may take a lot of time. 
Comment 8 Tejun Heo 2007-06-20 00:35:55 UTC
Okay, setting status to NEEDINFO.  Please report back when you can confirm the result either way.  Thanks.
Comment 9 Vasily Averin 2007-06-20 01:12:49 UTC
(In reply to comment #0)
> Reproduced on 2.6.22-rc4 kernel, in this case EH works well and disk become
> alive

btw, I was wrong and disk access return IO errors :(
Comment 10 Tejun Heo 2007-06-20 01:16:00 UTC
Yeah, the disk was gone in the log.  EH gave up after several tries.
Comment 11 Vasily Averin 2007-06-20 23:49:23 UTC
Created attachment 11838 [details]
2.6.22-rc4+tejun.patch: serial console logs

Tejun,

I've reproduced this issue again on the patched kernel (i.e. 2.6.22-rc4 + your patch). Serial console logs are in attachments.

Yesterday I've started the following tests:
ltp
volano
jbb2000
unixbench
aim7
smtp

I would note that during aim7 test I've got several "hda: lost interrupt" messages but node still worked well.

Today I've restarted the tests again -- and it leads to SATA timeout.
Comment 12 Vasily Averin 2007-06-20 23:51:54 UTC
Created attachment 11839 [details]
/proc/interrupts output
Comment 13 Vasily Averin 2007-06-20 23:54:39 UTC
Sorry, cannot get lspci output -- this binary is not in disk cashe :(
Comment 14 Miguel Marte 2007-06-25 10:22:20 UTC
Created attachment 11868 [details]
dmesg

Same problem on the sata_via, it didn't happen until i added in a 500gig into my system which made me reorder most of my hds.  Originally i didn't have this problem, and i didn't move the one on ata4, which is the one with the problem.  I had 4 hds, 3 WD on the sata_promise, and 1 Seagate 320 gig 7200.10 on port1 on the sata_via.  I removed the one on the ide on the promise controller and moved 1 of the 250 WD which was on the promise controller to the sata_via on the 2nd port.  I added the 500G WD to the promise.  I never moved the Seagate which is the one with the problem.  I'm attaching the lspci and interrupts.
Comment 15 Miguel Marte 2007-06-25 10:25:54 UTC
Created attachment 11869 [details]
interrupts

Here are the interrupts.
Comment 16 Miguel Marte 2007-06-25 10:28:19 UTC
Created attachment 11870 [details]
output of lspci

here is the lspci.  If you want the boot log, just for it.
Comment 17 Tejun Heo 2007-07-02 00:03:53 UTC
Vasily, please try to use pata_via instead of via82cxxx.

Miguel, please post full dmesg including boot messages.  Also, please try pata_via with the posted patch applied on top of 2.6.22-rc7.  Also, you can use LABEL's to mount filesystems to avoid chaos when drives get moved around.  If you use LABEL aware initrd, you don't even have to change the root device (grub might have some problem tho).
Comment 18 Miguel Marte 2007-07-05 14:06:40 UTC
Good news, it seems that the problem is caused by interference and not the driver.   I first tried rc7 but that didn't solve anything, as well as the patch.  I thought that if this was a huge problem, more people would have noticed.  I removed all the cables and reorganized them because i noticed that the power lines were too close the the data ones.  After i did that, the problem hasn't appeared again.  I have been running 3 days straight with out any errors.  Vasily should recheck his cables or possibly replace it if its bad.
Comment 19 Tejun Heo 2007-07-05 18:19:39 UTC
Thanks.  Actually, it seems more than half of this kind of reports are from hardware issues.  Power related problems are the most common.  Cabling is pretty common too.  Also, some controllers and drives have problems talking to each other in 3Gbps.  If you have *any* electrical issues, the SATA link will be the first one to suffer because it runs really fast (which also implies narrower voltage swing) and pretty long.
Comment 20 Vasily Averin 2007-07-06 00:38:05 UTC
(In reply to comment #18)
> Vasily should recheck his cables or possibly replace it if its bad.

We have replaced cables already, actually it was our first action. Unfortunately it did not help us.
Comment 21 Miguel Marte 2007-07-06 08:18:46 UTC
I didn't replace the cable also, just moved them away from from the power cables which were crossing them.  Look for something that would interfere with the connection.  And as Tejun said, the sata links are sensitive to power fluctuations, so a bad psu can cause this too.
Comment 22 Vasily Averin 2007-07-06 23:47:51 UTC
(In reply to comment #21)
> I didn't replace the cable also, just moved them away from from the power
> cables which were crossing them.  Look for something that would interfere
> with
> the connection.  And as Tejun said, the sata links are sensitive to power
> fluctuations, so a bad psu can cause this too.

Ok, I'm understand you now, thank you for explanation.
I personally will try to do it on my testnode, but it is not an option in general case, especially for people that have server in datacenter. :(

Tejun, is it probably some way to limits speed of SATA link?
Comment 23 Vasily Averin 2007-07-07 00:24:10 UTC
I've switched to pata_via instead of via82cxxx:
Good news:
- SATA disconnets are go away, sata controller do not reports about any problems now.

Bad news:
- I see timeouts on DVD attaced to PATA controller (please see console logs in my next attachment)

- Node works well some time but then IO is stalled and node hangs

I've reproduced it:
- on 2.6.22-rc4 kernel + Tejun patch (changes: ide was disabled and pata_via enabled). I have serial console logs and can attach it if it's required
- on original 2.6.22-rc7 kernel
Comment 24 Vasily Averin 2007-07-07 00:26:27 UTC
Created attachment 11972 [details]
2.6.22-rc7: serial console log with sysrq
Comment 25 Vasily Averin 2007-07-07 07:35:16 UTC
As far as I understand node hangs is not related to libata, IMHO it is ext3/jbd related issue. I believe it makes sense to file new bug for this issue.
Comment 26 Vasily Averin 2007-07-08 06:13:54 UTC
Created attachment 11973 [details]
2.6.22-rc7: sata timeouts issue again

I've reproduced SATA timeouts issue again on plain 2.6.22-rc7 kernel (with pata_via).
I've created crashdump and hope it can help us.
Comment 27 Tejun Heo 2007-07-10 09:09:09 UTC
Thanks for testing.

1. In the lockup in comment 24, everything seems to be stuck in ext3.  It's probable that SCSI and libata are to blame.  I need some help to decode what's going on inside ext3/jbd tho.  cc'ing Jan Kara for help.

2. The PATA and SATA controllers share a PCI IRQ line.  The PATA controller also seem to be hardwired to 14 or 14/15 depending on controller mode.  The driver/ide drivers use IRQ auto-detection and detect 14 for the PATA part while libata honors pdev->irq and use 20.  The end result is the same tho.  One of the two hosts lose ability to assert IRQ and everything falls down.  It's definitely related to IRQ routing and is really peculiar.  Well, I wouldn't expect anything less from the vendor. :-) Does "acpi=noirq" make any difference?
Comment 28 Jan Kara 2007-07-13 12:11:09 UTC
It seems kjournald is stuck inside journal_commit_transaction() and everybody else waits for a transaction to commit or waits in journal_force_transaction() (which is suspitious). It could be that we missed some IO completion and we are infinitely waiting for it but it could also be something different... Questions to Vasily: Is the number of threads normal or you spawn new threads when current threads become unavailable? Also could you please run gdb on vmlinux of the kernel which produced messages in comment 24 and send me output of disass journal_commit_transaction (I need to find out which part of the function corresponds to the offset 0x181). Thanks.
Comment 29 Jan Kara 2007-07-13 12:33:00 UTC
Hmm, looking more into the traces I'd guess that commit is waiting for all updates of the running transaction to be finished. But there are 7 processes (2322, 2345, 2362, 2379, 2385, 2410, 2437) which are waiting in journal_stop (called from journal_force_commit). This is actually strange as it seems these processes are never woken up from schedule_timeout_uninterruptible(1) - they are in the same state in backtraces several minutes away...
Comment 30 Vasily Averin 2007-07-14 00:30:55 UTC
(In reply to comment #28)
> Questions
> to Vasily: Is the number of threads normal or you spawn new threads when
> current threads become unavailable? Also could you please run gdb on vmlinux
> of
> the kernel which produced messages in comment 24 and send me output of disass
> journal_commit_transaction (I need to find out which part of the function
> corresponds to the offset 0x181). Thanks.

Jan,
I'm not sure that I'm understand your first question correctly. I've just executed AIM7 test on the node and wait until it hangs. Then I've tried to execute something in old ssh session (without success) and then I've pressed SysRq keys.
journal_commit_transaction+0x181 points to next instruction after "call schedule"

0000000000002296 <journal_commit_transaction>:
    2296:       41 57                   push   %r15
    2298:       41 56                   push   %r14
    229a:       49 89 fe                mov    %rdi,%r14
    229d:       41 55                   push   %r13
...
    23ee:       f3 a5                   rep movsl %ds:(%rsi),%es:(%rdi)
    23f0:       48 89 de                mov    %rbx,%rsi
    23f3:       48 89 ef                mov    %rbp,%rdi
    23f6:       e8 00 00 00 00          callq  23fb <journal_commit_transaction+0x165>
                        23f7: R_X86_64_PC32     prepare_to_wait+0xfffffffffffffffc
    23fb:       41 83 7d 74 00          cmpl   $0x0,0x74(%r13)
    2400:       74 27                   je     2429 <journal_commit_transaction+0x193>
    2402:       41 c7 45 70 01 00 00    movl   $0x1,0x70(%r13)
    2409:       00
    240a:       41 c7 46 24 01 00 00    movl   $0x1,0x24(%r14)
    2411:       00
    2412:       e8 00 00 00 00          callq  2417 <journal_commit_transaction+0x181>
                        2413: R_X86_64_PC32     schedule+0xfffffffffffffffc
    2417:       49 8d 7e 24             lea    0x24(%r14),%rdi  <<<<<<<< HERE
    241b:       e8 00 00 00 00          callq  2420 <journal_commit_transaction+0x18a>
                        241c: R_X86_64_PC32     _spin_lock+0xfffffffffffffffc
    2420:       49 8d 7d 70             lea    0x70(%r13),%rdi
    2424:       e8 00 00 00 00          callq  2429 <journal_commit_transaction+0x193>
                        2425: R_X86_64_PC32     _spin_lock+0xfffffffffffffffc
    2429:       48 89 de                mov    %rbx,%rsi
    242c:       48 89 ef                mov    %rbp,%rdi
    242f:       e8 00 00 00 00          callq  2434 <journal_commit_transaction+0x19e>
                        2430: R_X86_64_PC32     finish_wait+0xfffffffffffffffc
    2434:       e9 5a ff ff ff          jmpq   2393 <journal_commit_transaction+0xfd>
    2439:       41 c7 45 70 01 00 00    movl   $0x1,0x70(%r13)

(In reply to comment #29)
> Hmm, looking more into the traces I'd guess that commit is waiting for all
> updates of the running transaction to be finished. But there are 7 processes
> (2322, 2345, 2362, 2379, 2385, 2410, 2437) which are waiting in journal_stop
> (called from journal_force_commit). This is actually strange as it seems
> these
> processes are never woken up from schedule_timeout_uninterruptible(1) - they
> are in the same state in backtraces several minutes away...

What do you think about process 2411? it have unusual calltrace:
ext3_file_write
 generic_file_buffered_write
  ext3_prepare_write (ret = -ENOSPACE, retry)
   ext3_should_retry_alloc
    journal_force_commit_nested
     log_wait_commit
       wait_event(journal->j_wait_done_commit,
                  !tid_gt(tid, journal->j_commit_sequence));

I would note that after node reboot I've found that root partition was full.
Comment 31 Jan Kara 2007-07-16 08:23:10 UTC
Thanks for that disassembly. It confirmes that we are waiting on t_updates (i.e. waiting until all processes writing to the current transaction finish, which never happens due to some processes waiting in schedule_timeout_uninterruptible(1)).

I've also noticed the process 2411 but that seems to be harmless. It just correctly waits for a transaction to be committed without any handle held. Those 7 processes above are the only ones which seem to hold the handle. Could you try reproducing the problem with the attached patch? That would confirm my suspicion that there's some strange bug in kernel timers...
Comment 32 Jan Kara 2007-07-16 08:23:47 UTC
Created attachment 12053 [details]
Debugging patch for JBD lockup
Comment 33 Vasily Averin 2007-07-16 21:55:26 UTC
BTW could somebody please comment endless resets/timeouts on PATA-attached DVDrom drive? IMHO it is abnormal.

Jul 16 17:45:13 ts28 
sr 2:0:0:0: ioctl_internal_command return code = 8000002
   : Sense Key : Aborted Command [current] [descriptor]
   : Add. Sense: No additional sense information
Jul 16 17:45:45 ts28 
ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata3.00: cmd a0/00:00:00:00:20/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
         res 40/00:03:00:00:20/00:00:00:00:00/a0 Emask 0x4 (timeout)
ata3: soft resetting port
ata3.00: configured for PIO0
ata3: EH complete
Jul 16 17:46:16 ts28 
ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata3.00: cmd a0/00:00:00:00:20/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
         res 40/00:03:00:00:20/00:00:00:00:00/a0 Emask 0x4 (timeout)
ata3: soft resetting port
ata3.00: configured for PIO0
Comment 34 Jan Kara 2007-07-16 21:56:55 UTC
This is an automated reply. I am away on vacation until August 13 and thus will not be able to reply to your email.
Comment 35 Vasily Averin 2007-07-17 05:35:36 UTC
I've reproduced ext3/jbd lockup again, now on 2.6.22 kernel.

As before, "multitask" processes are waiting in schedule_timeout_uninterruptible()

I've prepared crash-dump, and analyzed it by using crash utility. 
"ps -l" command (-l is used to display the tasks last_run or timestamp values) shows that all "multitask" processes has not been scheduled at least several hours, since last EH message:
Jul 12 09:00:16 ts28 Linux version 2.6.22 (vvs@vvs.work.ve) <<< node boot
Jul 17 09:01:03 ts28 ata3: EH complete <<< last EH message
Jul 17 12:49:38 ts28 SysRq : HELP : loglevel0-8 reBoot  <<< SysRq keypresses
Jul 17 13:28:01 ts28 SysRq : Trigger a crashdump  <<<< Crashdump creation


 crash> ps -l
[447400288986915]  PID: 1391   TASK: ffff810037db9100  CPU: 1   COMMAND: "klogd"
<<<< Crashdump timestump
[447400288237424]  PID: 0      TASK: ffffffff80543a80  CPU: 0   COMMAND: "swapper"
[447400288237424]  PID: 2680   TASK: ffff8100df7cf040  CPU: 0   COMMAND: "hald-addon-keyb"
[447400288232031]  PID: 10     TASK: ffff8100df7c3890  CPU: 1   COMMAND: "events/1"
[447400285705077]  PID: 0      TASK: ffff810037d76850  CPU: 1   COMMAND: "swapper"
[445298275932286]  PID: 1388   TASK: ffff81008a305040  CPU: 1   COMMAND: "syslogd"
<<<< SysRq keys timestamp
[445020120551131]  PID: 82     TASK: ffff8100df7c8040  CPU: 0   COMMAND: "kblockd/0"
[445020104277966]  PID: 2735   TASK: ffff8100df719080  CPU: 1   COMMAND: "mingetty"
<<<< last EH message timestamp
[431415855283835]  PID: 13426  TASK: ffff8100649970c0  CPU: 0   COMMAND: "multitask"
[431415855268119]  PID: 13449  TASK: ffff810014639790  CPU: 0   COMMAND: "multitask"
[431415855258044]  PID: 13466  TASK: ffff81007dfb4100  CPU: 1   COMMAND: "multitask"
[431415855250139]  PID: 13411  TASK: ffff810034397100  CPU: 0   COMMAND: "multitask"
[431415855232188]  PID: 13422  TASK: ffff810090fac040  CPU: 0   COMMAND: "multitask"
[431415855226922]  PID: 13459  TASK: ffff810014b7a750  CPU: 1   COMMAND: "multitask"
[431415855213345]  PID: 13434  TASK: ffff8100dce76040  CPU: 0   COMMAND: "multitask"
[431415855200411]  PID: 13444  TASK: ffff8100d004f140  CPU: 1   COMMAND: "multitask"
Comment 36 Vasily Averin 2007-07-17 05:43:41 UTC
Created attachment 12062 [details]
2.6.22: JBD lockup, serial console logs

I have crash-dump and is able to check any other ideas.
Comment 37 Tejun Heo 2007-07-17 20:08:34 UTC
(In reply to comment #33)
> BTW could somebody please comment endless resets/timeouts on PATA-attached
> DVDrom drive? IMHO it is abnormal.

Have you tried "acpi=noirq" yet?
Comment 38 Vasily Averin 2007-07-18 05:01:29 UTC
(In reply to comment #37)
> Have you tried "acpi=noirq" yet?

Yes, last week I've tested my kernels with "acpi=noirq" and SATA timeout issue is still not reproduced.
I've got only JBD lockups.
Comment 39 Vasily Averin 2007-07-27 01:22:26 UTC
(In reply to comment #29)
> Hmm, looking more into the traces I'd guess that commit is waiting for all
> updates of the running transaction to be finished. But there are 7 processes
> (2322, 2345, 2362, 2379, 2385, 2410, 2437) which are waiting in journal_stop
> (called from journal_force_commit). This is actually strange as it seems
> these
> processes are never woken up from schedule_timeout_uninterruptible(1) - they
> are in the same state in backtraces several minutes away...

It seems I've found the confirmation that it was timer-related issue:

Jul  7 09:51:28 ts28 SysRq : Show Pending Timers
Jul  7 09:51:28 ts28 now at 122082292361305 nsecs

Jul  7 09:54:49 ts28 SysRq : Show Pending Timers
Jul  7 09:54:49 ts28 now at 122081350371493 nsecs

Timer on the node is not works,
 moreover second timestamp is older than first.

BTW, I've reproduced this issue, acpi_pm clocksource was used.
Comment 40 Jan Kara 2007-07-27 01:23:48 UTC
This is an automated reply. I am away on vacation until August 13 and thus will not be able to reply to your email.
Comment 41 Jan Kara 2007-08-21 02:34:08 UTC
Vasily, do you still have the problem? If yes, I could find somebody more knowledgeable with timers to have a look...
Comment 42 Vasily Averin 2007-08-21 02:54:04 UTC
I've disabled irqbalance service on the node and after this change node works without any troubles.
I believe it is not fix but work around only.
Comment 43 Jan Kara 2007-08-28 04:02:40 UTC
Hmm, I've just seen a thread on LKML, speaking about problems with dying network card caused by some IRQ problems. That might be related. The thread is:
Re: 2.6.20->2.6.21 - networking dies after random time

Could you maybe try a patch from http://lkml.org/lkml/2007/8/6/11? Or it's probably going to be fixed in 2.6.23...
Comment 44 Jan Kara 2007-10-02 03:10:08 UTC
Any news here? How about latest 2.6.23-rc?
Comment 45 Vasily Averin 2007-10-02 03:33:12 UTC
I've reproduced issue on stock 2.6.23-rc7:
timer interrupts (i.e. irq0) were stalled, but all other interrupts worked well.

Then I've applied patch from http://lkml.org/lkml/2007/8/6/11 -- and it looks like it helps, now I cannot reproduce this issue for several days already.
Comment 46 Vasily Averin 2007-10-02 10:31:55 UTC
Patch is not helps, testnode hangs again. :(
Comment 47 Jan Kara 2007-10-04 02:50:36 UTC
Hmm, OK. I'll try to bring attention of someone who understands timers and hardware better :). Thomas, do you have an idea what can cause these hangs or how to debug them? It seems timers suddently stop working.
Comment 48 Vasily Averin 2007-11-05 22:54:44 UTC
According to Thomas Gleixner recommendation I've applied path that prevent irq balancing for irq0
http://lkml.org/lkml/2007/10/9/25

While testing I've got the following results:
1) irq0 issue has gone away, timer interrupt works well now
2) unfortunately SATA timeout issue has return back, I've reproduced it both without and with "acpi=noirq" option on 2.6.23-rc7 kernel
Comment 49 Vasily Averin 2007-11-05 22:57:46 UTC
Created attachment 13410 [details]
serial console log (2.6.23rc7 + thgl patch+ "acpi=noirq")
Comment 50 Tejun Heo 2007-11-12 05:10:11 UTC
Vasily, can you please try 2.6.24-rc2?
Comment 51 Vasily Averin 2007-11-21 22:52:26 UTC
2.6.24-rc2 kernel with both ide and sata enabled,
without "acpi=noirq",
and with irqbalance service enabled

After 10-days-long testing node still works without any troubles.
I do not understand how it is, but it looks like issue is resolved now.

One strange is that /proc/interrupts shows too few timer interrupts now
           CPU0       CPU1
  0:        288          1   IO-APIC-edge      timer
  1:         20          9   IO-APIC-edge      i8042
  3:          1          1   IO-APIC-edge
  4:        136        636   IO-APIC-edge
  6:          4          1   IO-APIC-edge      floppy
  7:          0          0   IO-APIC-edge      parport0
  8:          0          1   IO-APIC-edge      rtc
  9:          0          1   IO-APIC-fasteoi   acpi
 12:        126          9   IO-APIC-edge      i8042
 14:    6211791    1275547   IO-APIC-edge      ide0
 16:     384157  270595968   IO-APIC-fasteoi   sata_via
 17:          0          0   IO-APIC-fasteoi   uhci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, ehci_hcd:usb4
 18:    9663580          1   IO-APIC-fasteoi   eth0
NMI:          0          0   Non-maskable interrupts
LOC:  833318712  833181920   Local timer interrupts
RES:   47631156   35226096   Rescheduling interrupts
CAL:        391        799   function call interrupts
TLB:    1982793    1829733   TLB shootdowns
TRM:          0          0   Thermal event interrupts
SPU:          0          0   Spurious interrupts
ERR:          0
MIS:          0

Note You need to log in before you can comment on or make changes to this bug.