Bug 10296

Summary: (test patch)pata_jmicron: DRDY data drain needed: System freezes after I/O on pata_jmicron device
Product: IO/Storage Reporter: jniklast
Component: Serial ATAAssignee: Alan (alan)
Status: CLOSED OBSOLETE    
Severity: normal CC: bunk, chris, lkmlist
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.24.3 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg of the (bugree) 2.6.23 kernel
dmesg of the 2.6.24.3 kernel
lspci -vvxxx output
errorlog of crashing pata_jmicron module
2.6.27-gentoo-r4 x86_64 dmesg
Messages log, showing the problem
Drain patch for test
messages with the patch; problem still occuring

Description jniklast 2008-03-21 02:58:56 UTC
Latest working kernel version: 2.6.23.17
Earliest failing kernel version: 2.6.24.0
Distribution: Gentoo
Hardware Environment: JMicron 20360/20363 AHCI Controller (rev 02) on Asus P5W DH Deluxe Mainboard (Intel 975X + ICH7R chipset)
Software Environment: gcc 4.1.2, binutils 2.18, glibc 2.6.1-r0
Problem Description:
I have pata_jmicron as a module and use a IDE-Harddisk on the JMicron controller. Whenever I do some I/O on that harddisk the I/O waiting load goes up within a few seconds until it is at or close to 100% which renders the system unusable. Magic SysRq makes a graceful reboot possible sometimes.
Steps to reproduce:
1. Do some I/O on a IDE-Harddisk which uses pata_jmicron
Comment 1 Alan 2008-03-21 04:23:04 UTC
Please provide a dmesg of both kernels and an lspci -vvxxx.
Comment 2 Alan 2008-03-21 04:44:07 UTC
PATA so taking ownership
Comment 3 jniklast 2008-03-21 06:15:25 UTC
Dunno whether I should just paste them here because they obviously are quite long, so I used nopaste:
2.6.23 (working) kernel dmesg: http://www.nopaste.org/p/aWoFAFsSt
2.6.24.3 (not working) kernel dmesg: http://www.nopaste.org/p/aUeuofcbN
lspci -vvxxx: http://www.nopaste.org/p/aqf99DPeG

dmesg and /var/log/messages show nothing when it freezes btw
Also the bug this far only occured on a fat32 partition, I tried to reproduce it with an ext3 filesystem on the same harddisk but instead of freezing the whole computer, there were lots of read errors and the pata_module simply crashed and mounted all filesystems read-only. This is what /var/log/messages had about that incident: http://www.nopaste.org/p/aYhzHRsDB

I hope that information helps, I don't know how to get debug output from the pata_jmicron module as modinfo shows no parms. So if there is a way how I could provide more useful data please tell me.

And thanks in advance!
Comment 4 Adrian Bunk 2008-03-21 06:18:57 UTC
(In reply to comment #3)
> Dunno whether I should just paste them here because they obviously are quite
> long, so I used nopaste:
>...

"Create a New Attachment" is the correct way to attach bigger files to a bug.
Comment 5 jniklast 2008-03-21 07:06:30 UTC
Created attachment 15370 [details]
dmesg of the (bugree) 2.6.23 kernel
Comment 6 jniklast 2008-03-21 07:07:14 UTC
Created attachment 15371 [details]
dmesg of the 2.6.24.3 kernel
Comment 7 jniklast 2008-03-21 07:07:44 UTC
Created attachment 15372 [details]
lspci -vvxxx output
Comment 8 jniklast 2008-03-21 07:09:29 UTC
Created attachment 15373 [details]
errorlog of crashing pata_jmicron module
Comment 9 Alan 2008-03-21 07:51:31 UTC
[   88.540267] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  169.12  Thu Feb 14 17:51:09 PST 2008

Sorry - this binary module means we can't do the debugging as only Nvidia have all the source code.

From a scan of the bug report you drive goes away and stops responding. There are no obvious changes between 2.6.23 and 2.6.24 to cause that to have changed so I am curious 

- Does the bug occur without the Nvidia binary modules being loaded on that boot
- If you go back to 2.6.23 does the problem eventually happen on that ?
Comment 10 jniklast 2008-03-22 01:56:56 UTC
The bug definitely doesn't occur in 2.6.23.
And it also happens without the nvidia binary module loaded. But it seems without X it doesn't always occur, at least not as fast as with X. But after a bit I/O from the disk in question it did happen without the nvidia module. This time though it seemed that only one of my two cores was infected: the I/O waiting load war only at 50% most time and only occassionally rising to 100%. The effects of this were the same though, no disk I/O was possible anymore and any process that tried it was completely stalled. But processes that didn't need any disk I/O ran well (though you couldn't create another shell) like top that continuously updated itself. If you run X though after a few seconds it stalls itself, probably due to it doing some disk I/O.
Thank you for looking at this bug, although I couldn't find anyone with the same problem yet.
Comment 11 Tejun Heo 2008-03-23 03:07:26 UTC
Does irqpoll help?
Comment 12 jniklast 2008-03-23 03:37:54 UTC
No I use it already. I remember that it didn't work without at all a few kernel versions ago but I didn't try it without since though.
Comment 13 Bjoern Olausson 2008-03-24 02:02:32 UTC
I have the same hardware, but cant reproduce this bug.
Using 2.6.24.2. Thought I have th driver compiled into the kernel.

Justed moved 1,5GB from an to the drive attached to the first port of the jmicron controller.

kernel (hd0,0)/vmlinuz root=/dev/sda3 video=nvidiafb:1280x1024-32@85,mtrr,ywrap


           *-storage
                description: SATA controller
                product: JMicron 20360/20363 AHCI Controller
                vendor: JMicron Technologies, Inc.
                physical id: 0
                bus info: pci@0000:02:00.0
                version: 02
                width: 32 bits
                clock: 33MHz
                capabilities: storage pm pciexpress ahci_1.0 bus_master cap_list
                configuration: driver=ahci latency=0
           *-ide
                description: IDE interface
                product: JMicron 20360/20363 AHCI Controller
                vendor: JMicron Technologies, Inc.
                physical id: 0.1
                bus info: pci@0000:02:00.1
                version: 02
                width: 32 bits
                clock: 33MHz
                capabilities: ide pm bus_master cap_list
                configuration: driver=pata_jmicron latency=0


[ 6011.757221] scsi 1:0:0:0: Direct-Access     ATA      External Disk 0  RGL1 PQ: 0 ANSI: 5
[ 6011.757322] sd 1:0:0:0: [sdb] 312581808 512-byte hardware sectors (160042 MB)
[ 6011.757337] sd 1:0:0:0: [sdb] Write Protect is off
[ 6011.757345] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[ 6011.757371] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 6011.757434] sd 1:0:0:0: [sdb] 312581808 512-byte hardware sectors (160042 MB)
[ 6011.757445] sd 1:0:0:0: [sdb] Write Protect is off
[ 6011.757447] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[ 6011.757486] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 6011.757489]  sdb: sdb1
[ 6011.757720] sd 1:0:0:0: [sdb] Attached SCSI disk
[ 6011.757760] sd 1:0:0:0: Attached scsi generic sg1 type 0
[ 6016.929322] kjournald starting.  Commit interval 5 seconds
[ 6016.929333] EXT3-fs warning: maximal mount count reached, running e2fsck is recommended
[ 6016.948524] EXT3 FS on sdb1, internal journal
[ 6016.948535] EXT3-fs: mounted filesystem with ordered data mode.


Despite there is something wrong with the jmicron:
http://bugzilla.kernel.org/show_bug.cgi?id=9010

regards
Bjoern
Comment 14 kiev 2008-05-25 16:10:13 UTC
for me she showed up one time in the floor of hour, however as a result of this
problem I lost a mysql database - mysql innodb not start - "Accertion error" -
did not help even "innodb_force_recovery = 4", backup was an a week remoteness
- the works of whole department lost data for a few days, the management simply
in shock - I going to discharge from job (((

this problem already whole year:
-----------
I'm stumped trying to track down the below intermittent problem.....
I've confirmed this problem on 2.6.19, 2.6.20 and 2.6.21.
-----------
http://lkml.org/lkml/2007/6/14/154
http://kerneltrap.org/mailarchive/linux-kernel/2007/6/14/103765
http://kerneltrap.org/node/16175

"System hang from time to time" http://bugzilla.kernel.org/show_bug.cgi?id=8300
"sata hotplug removal of drive freezes all 2.6.21 kernels"
http://bugzilla.kernel.org/show_bug.cgi?id=8421
"(sata_via) system freeze in random time"
http://bugzilla.kernel.org/show_bug.cgi?id=9115
"kernel freezes with on clockevent warning"
http://bugzilla.kernel.org/show_bug.cgi?id=9834
"[pata_ali] Unspecified hang on Acer laptop"
http://bugzilla.kernel.org/show_bug.cgi?id=9898
"System freezes after I/O on pata_jmicron device"
http://bugzilla.kernel.org/show_bug.cgi?id=10296

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/217920
https://bugs.launchpad.net/ubuntu/+bug/164183
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/229747
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/159521
https://bugs.launchpad.net/ubuntu/+bug/164183
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/187146
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/221437
https://bugs.launchpad.net/ubuntu/+bug/226600
Comment 15 Christopher Adlam 2008-09-25 18:24:11 UTC
I have observed the same bug....

Here is the message I see when I have the controller enabled:

Jul  2 12:30:21 [kernel] ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Jul  2 12:30:21 [kernel] ata7.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Jul  2 12:30:21 [kernel]          cdb 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
Jul  2 12:30:21 [kernel]          res 40/00:02:00:08:00/00:00:00:00:00/b0 Emask 0x4 (timeout)
Jul  2 12:30:21 [kernel] ata7.00: status: { DRDY }
Jul  2 12:30:21 [kernel] ata7: soft resetting link
Jul  2 12:30:21 [kernel] ata7.00: configured for PIO0
Jul  2 12:30:21 [kernel] ata7.01: configured for UDMA/25
Jul  2 12:30:21 [kernel] ata7: EH complete
Jul  2 12:30:51 [kernel] ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Jul  2 12:30:51 [kernel] ata7.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Jul  2 12:30:51 [kernel]          cdb 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
Jul  2 12:30:51 [kernel]          res 40/00:02:00:08:00/00:00:00:00:00/b0 Emask 0x4 (timeout)
Jul  2 12:30:51 [kernel] ata7.00: status: { DRDY }
Jul  2 12:30:51 [kernel] ata7: soft resetting link
Jul  2 12:30:52 [kernel] ata7.00: configured for PIO0
Jul  2 12:30:52 [kernel] ata7.01: configured for UDMA/25
Jul  2 12:30:52 [kernel] ata7: EH complete


Drive are two DVD-RW drives. They work fine in Windows. They work fine on a board without this controller. My system becomes INCREDIBLY slow and unresponsive and I need to SSH into it to reboot it remotely due to the lack of response on the GUI, and I cannot kill X. 

I have noticed that the JMicron controller and my NVidia video card share an IRQ. 

With the controller disabled, the computer is rock-stable. 
Comment 16 Alan 2008-09-26 01:43:30 UTC
#14 is a random lit of unrelated reports nothing to do with the one report there - plus it works on other boxes

#15 Christopher; that one is a nice clear dump - would be useful to get the top part of the log when it happens and also a dmesg so I can see what drives etc are present. Looks like the drives need the DRDY drain patches that Mark Lord was working on.
Comment 17 Christopher Adlam 2008-11-22 14:19:14 UTC
Alan:

I only included an excerpt, because of the size of the log.

I can definitely include a dmesg if it would help......
Comment 18 Christopher Adlam 2008-11-22 14:58:10 UTC
Created attachment 18974 [details]
2.6.27-gentoo-r4 x86_64 dmesg

This is my dmesg.
Comment 19 Christopher Adlam 2008-11-24 02:06:27 UTC
Created attachment 18999 [details]
Messages log, showing the problem

This is the COMPLETE messages log, that shows the transgression of the IDE negotiated speed from ATA66 down to PIO0.
Comment 20 Alan 2008-11-24 02:29:44 UTC
It does look like the drain patch will fix that one
Comment 21 Alan 2008-11-24 02:32:05 UTC
Created attachment 19000 [details]
Drain patch for test
Comment 22 Christopher Adlam 2008-11-25 14:01:12 UTC
Created attachment 19017 [details]
messages with the patch; problem still occuring

With the patch, the problem still seems to be occurring. I have attached the newer copy of my messages log.
Comment 23 Alan 2012-05-21 14:56:16 UTC
Not much else can be done with this bug - closing as obsolete