Bug 9115

Summary: (sata_via) system freeze in random time
Product: IO/Storage Reporter: Dave (davemilter)
Component: Serial ATAAssignee: Jeff Garzik (jgarzik)
Status: CLOSED OBSOLETE    
Severity: normal CC: alan, htejun, sys
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.22-gentoo-r5 Subsystem:
Regression: No Bisected commit-id:
Attachments: lspci -vv
dmesg output
full dmesg log

Description Dave 2007-10-03 01:30:29 UTC
Most recent kernel where this bug did not occur:
Distribution:
Hardware Environment:
Software Environment:
Problem Description:

Time to time I got this message in log, and after that system freezes,
I can not even reboot it, because of disk is not response:

ata1.00: qc timeout (... 0x27)
ata_hpa_resize: hpa_sector(0) is smaller than sectors (488397168) 
failed to set xfermode (err_mask=0x40)
limiting speed to UDMA/1333:PIO5 failed to reoder soft reseting port

after that again message about timeout

If I reboot all become normal for some time

Steps to reproduce:
Comment 1 Dave 2007-10-03 01:31:21 UTC
Created attachment 13027 [details]
lspci -vv
Comment 2 Dave 2007-10-03 05:00:13 UTC
Created attachment 13030 [details]
dmesg output

dmesg output
Comment 3 Tejun Heo 2007-10-03 18:30:26 UTC
1. Which kernel version are you using?
2. Please post full kernel boot log
3. Please post dmesg including the error message (if you can't save it due to IO error, taking a photo with digital camera is fine too)
Comment 4 Dave 2007-10-04 01:18:48 UTC
>1. Which kernel version are you using?

Look at "Version" attribute of bug.

>2. Please post full kernel boot log

Look at comment #2, it is dmesg output after system boot

>3. Please post dmesg including the error message (if you can't save it due to
>IO error, taking a photo with digital camera is fine too)

Untill not there is no photo, untill next time bug will be triggered.
Comment 5 Tejun Heo 2007-10-04 01:23:24 UTC
The kernel log is truncated.  Dunno whether gentoo does it but some distros save full kernel log under /var/log/boot.msg.  If the file doesn't exist, "dmesg -s 1000000" might help.  Also, please use vanilla kernel.
Comment 6 Jeff Garzik 2007-11-02 09:04:21 UTC
Any re-test results with vanilla kernel?
Comment 7 Dave 2007-11-13 05:05:41 UTC
I updated to 2.6.22-gentoo-r8, actually gentoo-sources is almost vanilla with backportes bugfixes, or you mean try 2.6.23?

The kernel still freezes with messages according to ata, but 
I can not get any new info, have no digital camera around,
and some time all freezes and I can not even switch to 12 console where syslog print messages. Magic keys also not help.
Comment 8 Tejun Heo 2007-11-13 17:55:13 UTC
Please post full kernel boot log.  That should be possible, right?
Comment 9 Dave 2007-11-14 22:39:48 UTC
Created attachment 13556 [details]
full dmesg log 

>Please post full kernel boot log.  That should be possible, right?

Actually I don't see what is missing in dmesg log that I posted,
here is result of "dmesg -s 1000000"
Comment 10 Dave 2007-12-03 08:57:25 UTC
Here is last log, freeze of machine happened today three times:

19:36:03 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
19:36:03 ata1: soft resetting port
19:36:33 ata1.00: qc timeout (cmd 0x27)
19:36:33 ata1.00: ata_hpa_resize 1: hpa sectors(0) is smaller than sectors (488397168)
19:36:33 ata1.00: failed to set xfermode (err_mask=0x40)
19:36:33 ata1.00: failed to recover some divces retrying 5 secs
--------------
... after several attempts to retring, messages the same, only time is changin
--------------
19:37:08 ata1.00: limiting speed to UDMA/133:PI03
19:37:43 ata1.00 disabled
19:37:44 ata1: EH complete
19:37:44 sd 0:0:0:0: [sda] Result: hostbyte=0x04 driverbyte=0x00
19:37:44 end_request: I/0 error; dev sda, sector 470024647
19:37:44 Buffer I/O error on device sda11, logical 11891468
Comment 11 Tejun Heo 2007-12-03 17:29:04 UTC
Dave, how often do those problems occur?  Is it the first time since your last report?
Comment 12 Dave 2007-12-03 22:42:59 UTC
>Dave, how often do those problems occur? 

Once time in two, three days.

>Is it the first time since your last
>report?

No, but often I can not even see any logs,
machine just hangs, and only see that monitor of activity show
high disk activity.
Comment 13 Tejun Heo 2007-12-05 22:31:06 UTC
I'm sorry but I don't really know what can be done here.  Transmission errors on SATA are much more frequent than PATA.  SATA is the first thing that will complain if you have unstable power or some sort of interference.  This is expected and libata is equipped well to deal with such errors.

Sadly, early VIA SATA controllers behave really bad when transmission error occurs on the wire.  As you've been seeing, it eats the machine alive (controller hangs while holding PCI bus) and even when that doesn't happen the port which suffered transmission error usually goes completely bonkers.  In both case, there isn't much operating system can do.

Probably the best way to deal with the problem is to find out why those transmission errors are occurring in the first place and fix it or buying a cheap add-on SATA controller which behaves when error occurs.
Comment 14 kiev 2008-05-25 16:08:09 UTC
for me she showed up one time in the floor of hour, however as a result of this
problem I lost a mysql database - mysql innodb not start - "Accertion error" -
did not help even "innodb_force_recovery = 4", backup was an a week remoteness
- the works of whole department lost data for a few days, the management simply
in shock - I going to discharge from job (((

this problem already whole year:
-----------
I'm stumped trying to track down the below intermittent problem.....
I've confirmed this problem on 2.6.19, 2.6.20 and 2.6.21.
-----------
http://lkml.org/lkml/2007/6/14/154
http://kerneltrap.org/mailarchive/linux-kernel/2007/6/14/103765
http://kerneltrap.org/node/16175

"System hang from time to time" http://bugzilla.kernel.org/show_bug.cgi?id=8300
"sata hotplug removal of drive freezes all 2.6.21 kernels"
http://bugzilla.kernel.org/show_bug.cgi?id=8421
"(sata_via) system freeze in random time"
http://bugzilla.kernel.org/show_bug.cgi?id=9115
"kernel freezes with on clockevent warning"
http://bugzilla.kernel.org/show_bug.cgi?id=9834
"[pata_ali] Unspecified hang on Acer laptop"
http://bugzilla.kernel.org/show_bug.cgi?id=9898
"System freezes after I/O on pata_jmicron device"
http://bugzilla.kernel.org/show_bug.cgi?id=10296

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/217920
https://bugs.launchpad.net/ubuntu/+bug/164183
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/229747
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/159521
https://bugs.launchpad.net/ubuntu/+bug/164183
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/187146
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/221437
https://bugs.launchpad.net/ubuntu/+bug/226600
Comment 15 Tejun Heo 2008-05-25 20:01:10 UTC
Kiev, I'm sorry about your incident but the problems you linked above are a lot of different ones including incomplete exception handling problem (2.6.21), IRQ misrouting, transmission error (ICRCs) probably due to cable problem and libata low level driver problems.

For your problem specifically, I can't really recommend early via SATA chips for any mission critical application.  The chip's behavior on error conditions just doesn't leave much room for recovery.

I don't have much experience w/ mysql but any transactional db should be able to recover from power cut or machine crash at any point in time.  If your data is important, you'll need to use at least above average hardware components w/ redundancy and softwares which can cope with problems.
Comment 16 kiev 2008-05-29 15:16:45 UTC
problem - this error arises up not only for me, she arises up for many many users

personally I for the exposure of this error did here that:
1 - hdd drive replaced on other
2 - replaced SATA connector on other
3 - replaced the version of the system with x86 on x86_64
4 - hdd drive connected through other channel of SATA of port
a problem proceeded!
5 - replaced the power module
6 - substituted the system board of Biostar by server Intel
a problem proceeded!!!
7 - tried the options of load of kernel in different combinations: noapic acpi=off / 
combined_mode=libata / irqpoll

and problem proceeded!!

and only after replacement of hdd drive SATA  on IDE a problem disappeared :(

to that mysql innodb owes was - certainly owe, but happened unlucky... she simply stupped at reading of the spoilt file and landing to core dump (Accertion failure) - simply a unlucky happened for me, did not help even innodb_force_recovery = 4 (((
Comment 17 Tejun Heo 2008-05-29 20:07:46 UTC
Yes, a lot of ATA related problems occur to many users and there are many different causes for those problems.  The only way we can proceed is by tracking down each problem case and fixing and working around it.

I just can't tell what's going on from what you reported.  If you want me or any other developer to dive into the problem, please post...

1. "lspci -nn" on the problematic machine.
2. Kernel boot log (/var/log/boot.msg on some distros)
3. The result of dmesg after the failure.  If you can't store dmesg result after problem has occurred because either the disk went down or the system froze, please set up a netconsole or serial console and capture the failing kernel log.  If net or serial console is not an option.  Digital cameras can be the last resort.
4. "hdparm -I" results of your harddrives.
Comment 18 kiev 2008-05-30 01:02:30 UTC
hank you, finally noticed on our problems

on both configurations (different hdd drive,motherboard, sata link, power supply) I had quite identical problems

on old motherboard parameters are such:

http://litopys.net/tmp/dmesg.txt
http://litopys.net/tmp/lshw.txt
http://litopys.net/tmp/lspci.txt
mistake such: http://litopys.net/tmp/syslog.txt

on new Intel motherboard parameters are such:

http://litopys.net/tmp/1/bootstrap.log
http://litopys.net/tmp/1/hdparm-intel.txt
http://litopys.net/tmp/1/lspci-intel.txt
http://litopys.net/tmp/1/lshw-intel.txt
http://litopys.net/tmp/1/dmesg.txt

mistake after copy 2Gb file - such: http://litopys.net/tmp/1/syslog-intel.txt

thanks You!
Comment 19 kiev 2008-05-30 01:03:32 UTC
Thank You
Comment 20 kiev 2008-05-30 01:12:16 UTC
is a misprint, error - "hank" -> "thank" excuse me 
unfortunately on bugzilla.kernel.org it is impossible to edit a report
Comment 21 Tejun Heo 2008-05-30 03:14:39 UTC
The controller is reporting host bus error indicating that it's having problem transferring data to or from the memory.  It seems like you have bad ram modules there.  Did you use the same memory sticks on the old and new machines, right?  It also explains the data corruption and mysql failure as bad ram means random data corruption.

Can you please run several batches of memtest86 and if possible try different memory sticks?
Comment 22 kiev 2008-05-30 05:36:55 UTC
yes, I used the same modules of memory, however much this bug fully disappeared when I replaced hard drive SATA on IDE - that with hard drive of IDE this bug was absent.

similarly server never hung up and showed no problems look like problems with memory - and server works round-the-clock, loading is very large and no problems in case with IDE hard drive

as soon as mount SATA hdd drive and try to copy on him - these problems appear
Comment 23 kiev 2008-06-03 16:27:06 UTC
in new kernel  2.6.24-18 - a problem continues to be present (((
Comment 24 Tejun Heo 2008-06-08 22:22:00 UTC
Eh... strange.  ICH7 is probably the most well tested controller and this is the first time I see host bus errors or data corruption.  I doubt upgrading kernel would resolve the problem.

Can you please try to rule out other possible causes by trying out different harddrive and memory sticks?  I agree that some of the symptoms point differently but I can't think of anything else at the moment.  :-(

Thanks.
Comment 25 kiev 2008-06-21 00:52:16 UTC
kernel 2.6.24-19-generic #1 SMP Wed Jun 18 14:15:37 UTC 2008 x86_64 GNU/Linux - a problem continues to be present (((
HEEELPPP!!!!!!
Comment 26 Tejun Heo 2008-06-21 01:24:28 UTC
Have you tried different hardware combination?
Comment 27 kiev 2008-06-21 01:45:07 UTC
> Have you tried different hardware combination?
yes - http://bugzilla.kernel.org/show_bug.cgi?id=9115#c18
Comment 28 Tejun Heo 2008-06-21 01:54:55 UTC
I meant the suggestions from #24.  IOW, have you tried different memory configurations?  Leaving out only single stick or sticks in different slots and see whether that makes any difference.
Comment 29 kiev 2008-06-21 02:23:33 UTC
Instead of memory I have simply replaced sata hard drive on ide hd and this bug is not shown some months
Comment 30 Tejun Heo 2008-06-21 02:50:10 UTC
Yes, I'm aware of that.  As I wrote before, I'm pretty much out of ideas so I was suggesting swapping memory sticks as the last resort.