Bug 9115
Summary: | (sata_via) system freeze in random time | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Dave (davemilter) |
Component: | Serial ATA | Assignee: | Jeff Garzik (jgarzik) |
Status: | CLOSED OBSOLETE | ||
Severity: | normal | CC: | alan, htejun, sys |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.22-gentoo-r5 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
lspci -vv
dmesg output full dmesg log |
Description
Dave
2007-10-03 01:30:29 UTC
Created attachment 13027 [details]
lspci -vv
Created attachment 13030 [details]
dmesg output
dmesg output
1. Which kernel version are you using? 2. Please post full kernel boot log 3. Please post dmesg including the error message (if you can't save it due to IO error, taking a photo with digital camera is fine too) >1. Which kernel version are you using? Look at "Version" attribute of bug. >2. Please post full kernel boot log Look at comment #2, it is dmesg output after system boot >3. Please post dmesg including the error message (if you can't save it due to >IO error, taking a photo with digital camera is fine too) Untill not there is no photo, untill next time bug will be triggered. The kernel log is truncated. Dunno whether gentoo does it but some distros save full kernel log under /var/log/boot.msg. If the file doesn't exist, "dmesg -s 1000000" might help. Also, please use vanilla kernel. Any re-test results with vanilla kernel? I updated to 2.6.22-gentoo-r8, actually gentoo-sources is almost vanilla with backportes bugfixes, or you mean try 2.6.23? The kernel still freezes with messages according to ata, but I can not get any new info, have no digital camera around, and some time all freezes and I can not even switch to 12 console where syslog print messages. Magic keys also not help. Please post full kernel boot log. That should be possible, right? Created attachment 13556 [details] full dmesg log >Please post full kernel boot log. That should be possible, right? Actually I don't see what is missing in dmesg log that I posted, here is result of "dmesg -s 1000000" Here is last log, freeze of machine happened today three times: 19:36:03 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 19:36:03 ata1: soft resetting port 19:36:33 ata1.00: qc timeout (cmd 0x27) 19:36:33 ata1.00: ata_hpa_resize 1: hpa sectors(0) is smaller than sectors (488397168) 19:36:33 ata1.00: failed to set xfermode (err_mask=0x40) 19:36:33 ata1.00: failed to recover some divces retrying 5 secs -------------- ... after several attempts to retring, messages the same, only time is changin -------------- 19:37:08 ata1.00: limiting speed to UDMA/133:PI03 19:37:43 ata1.00 disabled 19:37:44 ata1: EH complete 19:37:44 sd 0:0:0:0: [sda] Result: hostbyte=0x04 driverbyte=0x00 19:37:44 end_request: I/0 error; dev sda, sector 470024647 19:37:44 Buffer I/O error on device sda11, logical 11891468 Dave, how often do those problems occur? Is it the first time since your last report? >Dave, how often do those problems occur? Once time in two, three days. >Is it the first time since your last >report? No, but often I can not even see any logs, machine just hangs, and only see that monitor of activity show high disk activity. I'm sorry but I don't really know what can be done here. Transmission errors on SATA are much more frequent than PATA. SATA is the first thing that will complain if you have unstable power or some sort of interference. This is expected and libata is equipped well to deal with such errors. Sadly, early VIA SATA controllers behave really bad when transmission error occurs on the wire. As you've been seeing, it eats the machine alive (controller hangs while holding PCI bus) and even when that doesn't happen the port which suffered transmission error usually goes completely bonkers. In both case, there isn't much operating system can do. Probably the best way to deal with the problem is to find out why those transmission errors are occurring in the first place and fix it or buying a cheap add-on SATA controller which behaves when error occurs. for me she showed up one time in the floor of hour, however as a result of this problem I lost a mysql database - mysql innodb not start - "Accertion error" - did not help even "innodb_force_recovery = 4", backup was an a week remoteness - the works of whole department lost data for a few days, the management simply in shock - I going to discharge from job ((( this problem already whole year: ----------- I'm stumped trying to track down the below intermittent problem..... I've confirmed this problem on 2.6.19, 2.6.20 and 2.6.21. ----------- http://lkml.org/lkml/2007/6/14/154 http://kerneltrap.org/mailarchive/linux-kernel/2007/6/14/103765 http://kerneltrap.org/node/16175 "System hang from time to time" http://bugzilla.kernel.org/show_bug.cgi?id=8300 "sata hotplug removal of drive freezes all 2.6.21 kernels" http://bugzilla.kernel.org/show_bug.cgi?id=8421 "(sata_via) system freeze in random time" http://bugzilla.kernel.org/show_bug.cgi?id=9115 "kernel freezes with on clockevent warning" http://bugzilla.kernel.org/show_bug.cgi?id=9834 "[pata_ali] Unspecified hang on Acer laptop" http://bugzilla.kernel.org/show_bug.cgi?id=9898 "System freezes after I/O on pata_jmicron device" http://bugzilla.kernel.org/show_bug.cgi?id=10296 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/217920 https://bugs.launchpad.net/ubuntu/+bug/164183 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/229747 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/159521 https://bugs.launchpad.net/ubuntu/+bug/164183 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/187146 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/221437 https://bugs.launchpad.net/ubuntu/+bug/226600 Kiev, I'm sorry about your incident but the problems you linked above are a lot of different ones including incomplete exception handling problem (2.6.21), IRQ misrouting, transmission error (ICRCs) probably due to cable problem and libata low level driver problems. For your problem specifically, I can't really recommend early via SATA chips for any mission critical application. The chip's behavior on error conditions just doesn't leave much room for recovery. I don't have much experience w/ mysql but any transactional db should be able to recover from power cut or machine crash at any point in time. If your data is important, you'll need to use at least above average hardware components w/ redundancy and softwares which can cope with problems. problem - this error arises up not only for me, she arises up for many many users personally I for the exposure of this error did here that: 1 - hdd drive replaced on other 2 - replaced SATA connector on other 3 - replaced the version of the system with x86 on x86_64 4 - hdd drive connected through other channel of SATA of port a problem proceeded! 5 - replaced the power module 6 - substituted the system board of Biostar by server Intel a problem proceeded!!! 7 - tried the options of load of kernel in different combinations: noapic acpi=off / combined_mode=libata / irqpoll and problem proceeded!! and only after replacement of hdd drive SATA on IDE a problem disappeared :( to that mysql innodb owes was - certainly owe, but happened unlucky... she simply stupped at reading of the spoilt file and landing to core dump (Accertion failure) - simply a unlucky happened for me, did not help even innodb_force_recovery = 4 ((( Yes, a lot of ATA related problems occur to many users and there are many different causes for those problems. The only way we can proceed is by tracking down each problem case and fixing and working around it. I just can't tell what's going on from what you reported. If you want me or any other developer to dive into the problem, please post... 1. "lspci -nn" on the problematic machine. 2. Kernel boot log (/var/log/boot.msg on some distros) 3. The result of dmesg after the failure. If you can't store dmesg result after problem has occurred because either the disk went down or the system froze, please set up a netconsole or serial console and capture the failing kernel log. If net or serial console is not an option. Digital cameras can be the last resort. 4. "hdparm -I" results of your harddrives. hank you, finally noticed on our problems on both configurations (different hdd drive,motherboard, sata link, power supply) I had quite identical problems on old motherboard parameters are such: http://litopys.net/tmp/dmesg.txt http://litopys.net/tmp/lshw.txt http://litopys.net/tmp/lspci.txt mistake such: http://litopys.net/tmp/syslog.txt on new Intel motherboard parameters are such: http://litopys.net/tmp/1/bootstrap.log http://litopys.net/tmp/1/hdparm-intel.txt http://litopys.net/tmp/1/lspci-intel.txt http://litopys.net/tmp/1/lshw-intel.txt http://litopys.net/tmp/1/dmesg.txt mistake after copy 2Gb file - such: http://litopys.net/tmp/1/syslog-intel.txt thanks You! Thank You is a misprint, error - "hank" -> "thank" excuse me unfortunately on bugzilla.kernel.org it is impossible to edit a report The controller is reporting host bus error indicating that it's having problem transferring data to or from the memory. It seems like you have bad ram modules there. Did you use the same memory sticks on the old and new machines, right? It also explains the data corruption and mysql failure as bad ram means random data corruption. Can you please run several batches of memtest86 and if possible try different memory sticks? yes, I used the same modules of memory, however much this bug fully disappeared when I replaced hard drive SATA on IDE - that with hard drive of IDE this bug was absent. similarly server never hung up and showed no problems look like problems with memory - and server works round-the-clock, loading is very large and no problems in case with IDE hard drive as soon as mount SATA hdd drive and try to copy on him - these problems appear in new kernel 2.6.24-18 - a problem continues to be present ((( Eh... strange. ICH7 is probably the most well tested controller and this is the first time I see host bus errors or data corruption. I doubt upgrading kernel would resolve the problem. Can you please try to rule out other possible causes by trying out different harddrive and memory sticks? I agree that some of the symptoms point differently but I can't think of anything else at the moment. :-( Thanks. kernel 2.6.24-19-generic #1 SMP Wed Jun 18 14:15:37 UTC 2008 x86_64 GNU/Linux - a problem continues to be present ((( HEEELPPP!!!!!! Have you tried different hardware combination? > Have you tried different hardware combination? yes - http://bugzilla.kernel.org/show_bug.cgi?id=9115#c18 I meant the suggestions from #24. IOW, have you tried different memory configurations? Leaving out only single stick or sticks in different slots and see whether that makes any difference. Instead of memory I have simply replaced sata hard drive on ide hd and this bug is not shown some months Yes, I'm aware of that. As I wrote before, I'm pretty much out of ideas so I was suggesting swapping memory sticks as the last resort. |