Bug 14817 - When is system under load, then freeze/HD fail
Summary: When is system under load, then freeze/HD fail
Alias: None
Product: Other
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 blocking
Assignee: other_other
Depends on:
Blocks: 14230
  Show dependency tree
Reported: 2009-12-15 11:12 UTC by David Heidelberg (okias)
Modified: 2010-10-07 19:46 UTC (History)
11 users (show)

See Also:
Kernel Version: 2.6.33-rc2
Tree: Mainline
Regression: Yes

dmesg (52.20 KB, text/plain)
2009-12-15 11:12 UTC, David Heidelberg (okias)
photo_of_screen.jpg (499.08 KB, image/jpeg)
2009-12-19 20:01 UTC, David Heidelberg (okias)
photo_of_screen2.jpg (678.10 KB, image/jpeg)
2009-12-20 10:39 UTC, David Heidelberg (okias)
picture_random_driver_damage.jpg (690.31 KB, image/jpeg)
2009-12-21 12:45 UTC, David Heidelberg (okias)
first_part (526.71 KB, image/jpeg)
2010-01-05 15:28 UTC, David Heidelberg (okias)
second_part.jpg (495.99 KB, image/jpeg)
2010-01-05 15:30 UTC, David Heidelberg (okias)

Description David Heidelberg (okias) 2009-12-15 11:12:32 UTC
Created attachment 24194 [details]

With 2.6.31 fine. Now when I use more Qt/KDE applications (Qt 4.6, KDE 4.3.80), I get circa 10 sekund (or more) slowdown - including cursor and if I continue in work, then system hangs/disconnect my hdd. Please look at dmesg, it's part with hpet, and after that WiFi stopped working (wifi worked ever well).

Comment 1 Tejun Heo 2009-12-17 00:08:51 UTC
ATA part looks very suspicious.  It's an optiarc drive attached to ahci.

 ata5: illegal qc_active transition (00000001->00000003)
 ata5.00: exception Emask 0x2 SAct 0x0 SErr 0x0 action 0x6 frozen
 sr 4:0:0:0: [sr0] CDB: cdb[0]=0x46: 46 01 00 00 00 00 00 01 64 00
 ata5.00: cmd a0/00:00:00:64:01/00:00:00:00:00/a0 tag 0 pio 16740 in
          res 50/00:03:00:08:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation)
 ata5.00: status: { DRDY }
 ata5: hard resetting link
 ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
 ata5.00: configured for UDMA/100
 ata5: EH complete

libata never issues more than one non-NCQ commands concurrently and the log accordingly shows only one in-flight command.  The above can only happen if somehow bit 1 is magically set in PORT_CMD_ISSUE register of the controller.  Can you please turn on CONFIG_PRINTK_TIME and attach the resulting log?  Also, is the drive accessible after the above failure?
Comment 2 David Heidelberg (okias) 2009-12-17 19:50:01 UTC
Now I hit this issue again, it looked like fs was read-only few seconds after problems (cause I tried write dmesg) and then completly disappeared (ls,dmesg,cd not worked). I again hit before that issue information it was disconnected from wifi).

It happend (again) after running konqueror, so it might be somehow connected to wifi (because once it does when I run Kopete).

But maybe it's because I noticed, my DVD-RW have some problem with burning CD and when I tried burn it, my whole lappy started vibrate and rotation of cd wasn't right. But this problem not appeared anytime when I tried burn. Maybe is DVD-RW magically damaged.

I'm afraid I not be able get more logs, because it's very random and it doesn't allow me save logs... I add time printk feature and hope I can write to paper data from log...
Comment 3 Tejun Heo 2009-12-18 07:30:08 UTC
If you have a second machine, the best way would be setting up netconsole or serial console.
Comment 4 David Heidelberg (okias) 2009-12-19 20:01:04 UTC
Created attachment 24232 [details]

I have only ability use ssh over WiFi (HTC Dream). I have only picture, of my screen with dmesg (not complete ;-( ). Next time I try get more info.
Comment 5 David Heidelberg (okias) 2009-12-19 20:43:36 UTC
Now I hit this comment - http://lkml.org/lkml/2009/12/19/82 and I think it's possible that I have same problem with 2.6.32 with ahci driver. Now testing with pci=nomsi, I hope it help.
Comment 6 David Heidelberg (okias) 2009-12-20 10:39:44 UTC
Created attachment 24236 [details]

I'm not sure, if it is same issue, but behavior is similiar:
I tried run 40 instances of konqueror and get this. Of course dmesg wasn't saved. It's configuration with pci=nomsi.
Comment 7 Tejun Heo 2009-12-21 04:40:15 UTC
Unfortunately, from the log, I can't tell what went wrong.  The ATA part which ended up on the last screen doesn't say anything about why and what went wrong.  Can you plug in a USB stick and save dmesg output there?  ie. keep "while true; do dmesg -c >> /mnt/USB_STICK/dmesg.out; sleep 1; done" running.  You log will most likely be on the usb stick after disk checks out.
Comment 8 David Heidelberg (okias) 2009-12-21 12:45:33 UTC
Created attachment 24244 [details]

So, I tried (even with sync option) and look like there is no connection between sata subsystem, because it telling something about problem with usb. Look like something inside kernel start damaging actually used structures etc. maybe WiFi driver, but he fail after that problem, so probably he have something damaged too.

pci=nomsi of cource doesn't affect this issue.
Comment 9 David Heidelberg (okias) 2009-12-21 12:47:52 UTC
And I forgot - this state I'm able avoid mostly with running lot of konquerors, but when compiling glibc/KDE 4.4 or anything else, it's stable. Maybe it's because konqueror run lot of proccess and kernel can't handle it (my guess)
Comment 10 Tejun Heo 2009-12-23 09:10:11 UTC
Without kernel log, it's really difficult to tell what's going on.  Any chance you can extract log off that machine (serial/netconsole, usb stick or separate disk, ssh session which pipes kernel messages off the machine, whatever...)?
Comment 11 David Heidelberg (okias) 2009-12-23 14:26:02 UTC
serial/netconsole impossible, usb stick failed (same way as my hdd was disconnected with lot of errors), separate disk end probably same way, ssh is probably correct solution - I try, when I get near to some pc)
Comment 12 David Heidelberg (okias) 2009-12-25 16:52:38 UTC
so, 2.6.33-rc2 solved my problem (I tried running of xx konquerors + kopete and it worked).

Problem probably remains on 2.6.32
Comment 13 Tejun Heo 2009-12-26 01:27:42 UTC
If 2.6.33-rc2 fixes your problem I don't think libata would have much to do with it.  AFAICT, there's no libata change which could cause that type of behavior difference.  It looks like something more fundamental is broken.
Comment 14 David Heidelberg (okias) 2009-12-27 20:32:06 UTC
So I'm afraid I hit this problem with 2.6.33-rc2 (exactly this problem) again :-( I try do screenshot next time more quickly.
Comment 15 Rafael J. Wysocki 2009-12-30 21:01:06 UTC
On Wednesday 30 December 2009, okias wrote:
> Yes, it still valid - even on 2.6.33-rc2
> 2009/12/29 Rafael J. Wysocki <rjw@sisk.pl>:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.31 and 2.6.32.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.31 and 2.6.32.  Please verify if it still should
> > be listed and let me know (either way).
> >
> >
> > Bug-Entry       : http://bugzilla.kernel.org/show_bug.cgi?id=14817
> > Subject         : When is system under load, then freeze/HD fail
> > Submitter       : okias <d.okias@gmail.com>
> > Date            : 2009-12-15 11:12 (15 days old)
Comment 16 Tejun Heo 2010-01-01 01:27:11 UTC
I don't think this problem is specific to or caused by ata.  Something at lower level seems broken.  The only possibly related change would be enabling of FPDMA AA but I don't think it's related.  Just in case, does kernel paramter "libata.force=noncq" make any difference?

I think it would be best to try bisection and find out which commit between 31 and 32 caused the problem but the problem can't be reproduce reliably, right?
Comment 17 David Heidelberg (okias) 2010-01-01 14:00:29 UTC
it's not related with ATA at all. Even if I did "while
true; do dmesg -c >> /mnt/USB_STICK/dmesg.out; sleep 1; done" it writed errors related with USB stack (and log wasn't writed on usb stick), so something much more lowlevel is broken. Another problem is, there is really hard reproduce it on 2.6.33-rc2 (I didn't hit that problem for few days).
Comment 18 David Heidelberg (okias) 2010-01-05 15:28:32 UTC
Created attachment 24446 [details]

Please look at it - 2.6.33-rc2-git
Comment 19 David Heidelberg (okias) 2010-01-05 15:30:31 UTC
Created attachment 24447 [details]
Comment 20 Tejun Heo 2010-01-11 02:20:25 UTC
cc'ing HPET people and Andrew.  The system is failing in weird ways after triggering a WARN in HPET path.  Any ideas?
Comment 21 Andrew Morton 2010-01-11 02:30:24 UTC
cc x86 people, johnstul.

That hpet warning is quite common and the frequency seems to have increased around 2.6.32.  I don't think I've seen an explanation of what would cause it to come out, nor why it might have become more common.  Having it first come out after 9000 seconds uptime seems a bit alarming.

I doubt if an HPET glitch would cause ata to fall over though.  Perhaps something more upstream has failed (interrupts?, bus config?) and that then caused both hpet and ata to fail.
Comment 22 Florian Mickler 2010-10-07 19:40:09 UTC
Can you still reproduce this with current mainline kernels?
Comment 23 David Heidelberg (okias) 2010-10-07 19:46:38 UTC
It seems to be ok.

Note You need to log in before you can comment on or make changes to this bug.