Bug 10296
Summary: | (test patch)pata_jmicron: DRDY data drain needed: System freezes after I/O on pata_jmicron device | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | jniklast |
Component: | Serial ATA | Assignee: | Alan (alan) |
Status: | CLOSED OBSOLETE | ||
Severity: | normal | CC: | bunk, chris, lkmlist |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.24.3 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
dmesg of the (bugree) 2.6.23 kernel
dmesg of the 2.6.24.3 kernel lspci -vvxxx output errorlog of crashing pata_jmicron module 2.6.27-gentoo-r4 x86_64 dmesg Messages log, showing the problem Drain patch for test messages with the patch; problem still occuring |
Description
jniklast
2008-03-21 02:58:56 UTC
Please provide a dmesg of both kernels and an lspci -vvxxx. PATA so taking ownership Dunno whether I should just paste them here because they obviously are quite long, so I used nopaste: 2.6.23 (working) kernel dmesg: http://www.nopaste.org/p/aWoFAFsSt 2.6.24.3 (not working) kernel dmesg: http://www.nopaste.org/p/aUeuofcbN lspci -vvxxx: http://www.nopaste.org/p/aqf99DPeG dmesg and /var/log/messages show nothing when it freezes btw Also the bug this far only occured on a fat32 partition, I tried to reproduce it with an ext3 filesystem on the same harddisk but instead of freezing the whole computer, there were lots of read errors and the pata_module simply crashed and mounted all filesystems read-only. This is what /var/log/messages had about that incident: http://www.nopaste.org/p/aYhzHRsDB I hope that information helps, I don't know how to get debug output from the pata_jmicron module as modinfo shows no parms. So if there is a way how I could provide more useful data please tell me. And thanks in advance! (In reply to comment #3) > Dunno whether I should just paste them here because they obviously are quite > long, so I used nopaste: >... "Create a New Attachment" is the correct way to attach bigger files to a bug. Created attachment 15370 [details]
dmesg of the (bugree) 2.6.23 kernel
Created attachment 15371 [details]
dmesg of the 2.6.24.3 kernel
Created attachment 15372 [details]
lspci -vvxxx output
Created attachment 15373 [details]
errorlog of crashing pata_jmicron module
[ 88.540267] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 169.12 Thu Feb 14 17:51:09 PST 2008 Sorry - this binary module means we can't do the debugging as only Nvidia have all the source code. From a scan of the bug report you drive goes away and stops responding. There are no obvious changes between 2.6.23 and 2.6.24 to cause that to have changed so I am curious - Does the bug occur without the Nvidia binary modules being loaded on that boot - If you go back to 2.6.23 does the problem eventually happen on that ? The bug definitely doesn't occur in 2.6.23. And it also happens without the nvidia binary module loaded. But it seems without X it doesn't always occur, at least not as fast as with X. But after a bit I/O from the disk in question it did happen without the nvidia module. This time though it seemed that only one of my two cores was infected: the I/O waiting load war only at 50% most time and only occassionally rising to 100%. The effects of this were the same though, no disk I/O was possible anymore and any process that tried it was completely stalled. But processes that didn't need any disk I/O ran well (though you couldn't create another shell) like top that continuously updated itself. If you run X though after a few seconds it stalls itself, probably due to it doing some disk I/O. Thank you for looking at this bug, although I couldn't find anyone with the same problem yet. Does irqpoll help? No I use it already. I remember that it didn't work without at all a few kernel versions ago but I didn't try it without since though. I have the same hardware, but cant reproduce this bug. Using 2.6.24.2. Thought I have th driver compiled into the kernel. Justed moved 1,5GB from an to the drive attached to the first port of the jmicron controller. kernel (hd0,0)/vmlinuz root=/dev/sda3 video=nvidiafb:1280x1024-32@85,mtrr,ywrap *-storage description: SATA controller product: JMicron 20360/20363 AHCI Controller vendor: JMicron Technologies, Inc. physical id: 0 bus info: pci@0000:02:00.0 version: 02 width: 32 bits clock: 33MHz capabilities: storage pm pciexpress ahci_1.0 bus_master cap_list configuration: driver=ahci latency=0 *-ide description: IDE interface product: JMicron 20360/20363 AHCI Controller vendor: JMicron Technologies, Inc. physical id: 0.1 bus info: pci@0000:02:00.1 version: 02 width: 32 bits clock: 33MHz capabilities: ide pm bus_master cap_list configuration: driver=pata_jmicron latency=0 [ 6011.757221] scsi 1:0:0:0: Direct-Access ATA External Disk 0 RGL1 PQ: 0 ANSI: 5 [ 6011.757322] sd 1:0:0:0: [sdb] 312581808 512-byte hardware sectors (160042 MB) [ 6011.757337] sd 1:0:0:0: [sdb] Write Protect is off [ 6011.757345] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00 [ 6011.757371] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 6011.757434] sd 1:0:0:0: [sdb] 312581808 512-byte hardware sectors (160042 MB) [ 6011.757445] sd 1:0:0:0: [sdb] Write Protect is off [ 6011.757447] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00 [ 6011.757486] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 6011.757489] sdb: sdb1 [ 6011.757720] sd 1:0:0:0: [sdb] Attached SCSI disk [ 6011.757760] sd 1:0:0:0: Attached scsi generic sg1 type 0 [ 6016.929322] kjournald starting. Commit interval 5 seconds [ 6016.929333] EXT3-fs warning: maximal mount count reached, running e2fsck is recommended [ 6016.948524] EXT3 FS on sdb1, internal journal [ 6016.948535] EXT3-fs: mounted filesystem with ordered data mode. Despite there is something wrong with the jmicron: http://bugzilla.kernel.org/show_bug.cgi?id=9010 regards Bjoern for me she showed up one time in the floor of hour, however as a result of this problem I lost a mysql database - mysql innodb not start - "Accertion error" - did not help even "innodb_force_recovery = 4", backup was an a week remoteness - the works of whole department lost data for a few days, the management simply in shock - I going to discharge from job ((( this problem already whole year: ----------- I'm stumped trying to track down the below intermittent problem..... I've confirmed this problem on 2.6.19, 2.6.20 and 2.6.21. ----------- http://lkml.org/lkml/2007/6/14/154 http://kerneltrap.org/mailarchive/linux-kernel/2007/6/14/103765 http://kerneltrap.org/node/16175 "System hang from time to time" http://bugzilla.kernel.org/show_bug.cgi?id=8300 "sata hotplug removal of drive freezes all 2.6.21 kernels" http://bugzilla.kernel.org/show_bug.cgi?id=8421 "(sata_via) system freeze in random time" http://bugzilla.kernel.org/show_bug.cgi?id=9115 "kernel freezes with on clockevent warning" http://bugzilla.kernel.org/show_bug.cgi?id=9834 "[pata_ali] Unspecified hang on Acer laptop" http://bugzilla.kernel.org/show_bug.cgi?id=9898 "System freezes after I/O on pata_jmicron device" http://bugzilla.kernel.org/show_bug.cgi?id=10296 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/217920 https://bugs.launchpad.net/ubuntu/+bug/164183 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/229747 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/159521 https://bugs.launchpad.net/ubuntu/+bug/164183 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/187146 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/221437 https://bugs.launchpad.net/ubuntu/+bug/226600 I have observed the same bug.... Here is the message I see when I have the controller enabled: Jul 2 12:30:21 [kernel] ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Jul 2 12:30:21 [kernel] ata7.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Jul 2 12:30:21 [kernel] cdb 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Jul 2 12:30:21 [kernel] res 40/00:02:00:08:00/00:00:00:00:00/b0 Emask 0x4 (timeout) Jul 2 12:30:21 [kernel] ata7.00: status: { DRDY } Jul 2 12:30:21 [kernel] ata7: soft resetting link Jul 2 12:30:21 [kernel] ata7.00: configured for PIO0 Jul 2 12:30:21 [kernel] ata7.01: configured for UDMA/25 Jul 2 12:30:21 [kernel] ata7: EH complete Jul 2 12:30:51 [kernel] ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Jul 2 12:30:51 [kernel] ata7.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Jul 2 12:30:51 [kernel] cdb 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Jul 2 12:30:51 [kernel] res 40/00:02:00:08:00/00:00:00:00:00/b0 Emask 0x4 (timeout) Jul 2 12:30:51 [kernel] ata7.00: status: { DRDY } Jul 2 12:30:51 [kernel] ata7: soft resetting link Jul 2 12:30:52 [kernel] ata7.00: configured for PIO0 Jul 2 12:30:52 [kernel] ata7.01: configured for UDMA/25 Jul 2 12:30:52 [kernel] ata7: EH complete Drive are two DVD-RW drives. They work fine in Windows. They work fine on a board without this controller. My system becomes INCREDIBLY slow and unresponsive and I need to SSH into it to reboot it remotely due to the lack of response on the GUI, and I cannot kill X. I have noticed that the JMicron controller and my NVidia video card share an IRQ. With the controller disabled, the computer is rock-stable. #14 is a random lit of unrelated reports nothing to do with the one report there - plus it works on other boxes #15 Christopher; that one is a nice clear dump - would be useful to get the top part of the log when it happens and also a dmesg so I can see what drives etc are present. Looks like the drives need the DRDY drain patches that Mark Lord was working on. Alan: I only included an excerpt, because of the size of the log. I can definitely include a dmesg if it would help...... Created attachment 18974 [details]
2.6.27-gentoo-r4 x86_64 dmesg
This is my dmesg.
Created attachment 18999 [details]
Messages log, showing the problem
This is the COMPLETE messages log, that shows the transgression of the IDE negotiated speed from ATA66 down to PIO0.
It does look like the drain patch will fix that one Created attachment 19000 [details]
Drain patch for test
Created attachment 19017 [details]
messages with the patch; problem still occuring
With the patch, the problem still seems to be occurring. I have attached the newer copy of my messages log.
Not much else can be done with this bug - closing as obsolete |