Bug 34302
Summary: | Errors from kjournald | ||
---|---|---|---|
Product: | File System | Reporter: | Tom Moore (tmoore) |
Component: | ext3 | Assignee: | fs_ext3 (fs_ext3) |
Status: | RESOLVED OBSOLETE | ||
Severity: | normal | CC: | alan |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | linux-2.6-2.6.32/debian | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Tom Moore
2011-05-03 13:30:44 UTC
You've given us a bunch of stack dumps, but not the explanation of why kernel dumped them. There should have been a few more informative messages before the stack dumps in your syslog. I'm going to guess it was a soft lockup warning message, perhaps? - Ted Yes, it appears that there is additional in the syslog that is not in the messages file (sorry about that). May 3 01:02:25 fawkes kernel: [307200.700498] INFO: task kjournald:6416 blocked for more than 120 seconds. May 3 01:02:25 fawkes kernel: [307200.700506] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. The machine was doing a full backup of about 1Tb of data at the time using backuppc, so that might be the reason that the journal could not keep up. It is possible that something else in the crontab also kicked off at around that time. I am a bit confused by the messages however. The backup source is 2 SATA 3 (6Gbs) disks in a Raid-1 config; the backup target is 1 SATA2 (3gbs) disk. The final message of concern was May 3 01:40:09 fawkes kernel: [309464.781078] ata8.00: exception Emask 0x0 SAct 0x2 SErr 0x0 action 0x6 frozen May 3 01:40:09 fawkes kernel: [309464.781091] ata8.00: failed command: READ FPDMA QUEUED May 3 01:40:09 fawkes kernel: [309464.781107] ata8.00: cmd 60/08:08:47:0a:ab/00:00:45:00:00/40 tag 1 ncq 4096 in May 3 01:40:09 fawkes kernel: [309464.781111] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) May 3 01:40:09 fawkes kernel: [309464.781118] ata8.00: status: { DRDY } May 3 01:40:09 fawkes kernel: [309464.781131] ata8: hard resetting link May 3 01:40:10 fawkes kernel: [309465.112103] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300) May 3 01:40:10 fawkes kernel: [309465.119531] ata8.00: configured for UDMA/133 May 3 01:40:10 fawkes kernel: [309465.119543] ata8.00: device reported invalid CHS sector 0 May 3 01:40:10 fawkes kernel: [309465.119559] ata8: EH complete I am not 100% sure which disk ata8 corresponds to. I tried looking through the boot log to figure out the device assignments, but it was a bit confusing, It appears that ata8.00 is one of the source raid disks. If so, I wonder why it got reset? On 2011-05-03, at 10:27 AM, bugzilla-daemon@bugzilla.kernel.org wrote: > The final message of concern was > > May 3 01:40:09 fawkes kernel: [309464.781078] ata8.00: exception Emask 0x0 > SAct 0x2 SErr 0x0 action 0x6 frozen > May 3 01:40:09 fawkes kernel: [309464.781091] ata8.00: failed command: READ > FPDMA QUEUED > May 3 01:40:09 fawkes kernel: [309464.781107] ata8.00: cmd > 60/08:08:47:0a:ab/00:00:45:00:00/40 tag 1 ncq 4096 in > May 3 01:40:09 fawkes kernel: [309464.781111] res > 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > May 3 01:40:09 fawkes kernel: [309464.781118] ata8.00: status: { DRDY } > May 3 01:40:09 fawkes kernel: [309464.781131] ata8: hard resetting link > May 3 01:40:10 fawkes kernel: [309465.112103] ata8: SATA link up 6.0 Gbps > (SStatus 133 SControl 300) > May 3 01:40:10 fawkes kernel: [309465.119531] ata8.00: configured for > UDMA/133 > May 3 01:40:10 fawkes kernel: [309465.119543] ata8.00: device reported > invalid > CHS sector 0 > May 3 01:40:10 fawkes kernel: [309465.119559] ata8: EH complete This looks like a problem with your disk. > I am not 100% sure which disk ata8 corresponds to. I tried looking through > the > boot log to figure out the device assignments, but it was a bit confusing, It > appears that ata8.00 is one of the source raid disks. If so, I wonder why it > got reset? One possibility is if the source disks share the SATA controller with the target they may both be blocked at the same time. Cheers, Andreas > This looks like a problem with your disk. So this is a disk hardware problem... > One possibility is if the source disks share the SATA controller with the > target they may both be blocked at the same time. ... or is it a driver problem? I am not sure what you mean by both being blocked at the same time, and I don't know what I should do about it. I have two sata controllers, and I put one of the source raid disks on each controller. I thought that this would improve reliability and perhaps performance. This means that the backup target disk shares a controller with one of the source disks. It appears that the drive that reset is on the second controller with the backup target disk (how do I confirm what contoller the devices ata8.00 and ata9.00 belong to?). I would appreciate any suggestions on how to triage this problem as being either 1) hardware that should be replaced (easy) and close this bug report; or 2) driver related that needs to be watched for in case it happens again. TIA Tom |