Bug 8627
Summary: | randomly a stuttering in the music (caused by HSM violation?) | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Bjoern Olausson (lkmlist) |
Component: | Serial ATA | Assignee: | Tejun Heo (htejun) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | albertcc, htejun, zackki13597 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.21.5 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
Patch for not stopping DMA if the device is busy
Patch for not stopping DMA if the device is busy NCQ-blacklist.patch Patch to limit ATAPI DMA to R/W only for Plextor PX-130A" libata-core.c_bug_8627.diff proper libata-core.c_bug_8627.diff my libata.h |
Description
Bjoern Olausson
2007-06-14 02:50:31 UTC
This is a minor problem found when working on bug 8259. More description of the problem: The problem is related to two Plextor drives connected to the PATA port of JMicron controller. The port is driven by pata_jmicron. Sometimes the drive has HSM violation and caused noticeable jitter during music play. Bjoern has collected the detailed libata trace. It seems the HSM violation only happens on the slave drive (Px-130a) when doing the 0x4a (GET_EVENT_STATUS_NOTIFICATION) command. The HSM violation seems to be caused by the drive interrupts when the ATAPI DMA is still on going. Currently Bjoern is helping to narrow down the problem by removing the medium from the master drive (px-708a) and place it to the slave drive (px-130a). Hopefully we can know if this is the problem of the JMircon controller or the Plextor px-130a drive... ====================== (transaction with HSM violation) Jun 8 10:53:09 freax ata_scsi_dump_cdb: CDB (9:0,1,0) 4a 01 00 00 10 00 00 00 08 Jun 8 10:53:09 freax ata_scsi_translate: ENTER Jun 8 10:53:09 freax ata_sg_setup: ENTER, ata9 Jun 8 10:53:09 freax ata_sg_setup: 1 sg elements mapped Jun 8 10:53:09 freax ata_fill_sg: PRD[0] = (0x54B8A000, 0x8) Jun 8 10:53:09 freax ata9: ata_dev_select: ENTER, device 1, wait 1 Jun 8 10:53:09 freax ata_tf_load: feat 0x1 nsect 0x0 lba 0x0 0x0 0x0 Jun 8 10:53:09 freax ata_tf_load: device 0xB0 Jun 8 10:53:09 freax ata_exec_command: ata9: cmd 0xA0 Jun 8 10:53:09 freax ata_scsi_translate: EXIT Jun 8 10:53:09 freax ata_host_intr: ata9: protocol 7 task_state 1 Jun 8 10:53:09 freax ahci_interrupt: ENTER Jun 8 10:53:09 freax ata_hsm_move: ata9: protocol 7 task_state 1 (dev_stat 0x58) Jun 8 10:53:09 freax atapi_send_cdb: send cdb Jun 8 10:53:09 freax ahci_interrupt: ENTER Jun 8 10:53:09 freax ata_host_intr: ata9: protocol 7 task_state 3 Jun 8 10:53:09 freax ata_host_intr: ata9: host_stat 0x5 Jun 8 10:53:09 freax ahci_interrupt: ENTER Jun 8 10:53:09 freax ahci_interrupt: ENTER Jun 8 10:53:09 freax ata_host_intr: ata9: protocol 7 task_state 3 Jun 8 10:53:09 freax ata_host_intr: ata9: host_stat 0x4 Jun 8 10:53:09 freax ata_hsm_move: ata9: protocol 7 task_state 3 (dev_stat 0x0) Jun 8 10:53:09 freax ata_hsm_move: ata9: protocol 7 task_state 4 (dev_stat 0x0) Jun 8 10:53:09 freax ata_scsi_timed_out: ENTER Jun 8 10:53:09 freax ata_scsi_timed_out: EXIT, ret=0 Jun 8 10:53:09 freax ata_scsi_error: ENTER Jun 8 10:53:09 freax ata_port_flush_task: ENTER Jun 8 10:53:09 freax ahci_interrupt: ENTER Jun 8 10:53:09 freax ata_port_flush_task: flush #1 Jun 8 10:53:09 freax ata9: ata_port_flush_task: flush #2 Jun 8 10:53:09 freax ata9: ata_port_flush_task: EXIT Jun 8 10:53:09 freax ata_eh_autopsy: ENTER Jun 8 10:53:09 freax ata_eh_autopsy: EXIT Jun 8 10:53:09 freax ata9.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 Jun 8 10:53:09 freax ata9.01: (BMDMA stat 0x4) Jun 8 10:53:09 freax ata9.01: cmd a0/01:00:00:00:00/00:00:00:00:00/b0 tag 0 cdb 0x4a data 8 in Jun 8 10:53:09 freax res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation) Jun 8 10:53:09 freax ata_eh_recover: ENTER Jun 8 10:53:09 freax ata_eh_prep_resume: ENTER Jun 8 10:53:09 freax ata_eh_prep_resume: EXIT Jun 8 10:53:09 freax __ata_port_freeze: ata9 port frozen ==================== (normal transaction) Jun 8 10:53:12 freax ata_scsi_dump_cdb: CDB (9:0,1,0) 4a 01 00 00 10 00 00 00 08 Jun 8 10:53:12 freax ata_scsi_translate: ENTER Jun 8 10:53:12 freax ata_sg_setup: ENTER, ata9 Jun 8 10:53:12 freax ata_sg_setup: 1 sg elements mapped Jun 8 10:53:12 freax ata_fill_sg: PRD[0] = (0x5430C000, 0x8) Jun 8 10:53:12 freax ata9: ata_dev_select: ENTER, device 1, wait 1 Jun 8 10:53:12 freax ata_tf_load: feat 0x1 nsect 0x0 lba 0x0 0x0 0x0 Jun 8 10:53:12 freax ata_tf_load: device 0xB0 Jun 8 10:53:12 freax ata_exec_command: ata9: cmd 0xA0 Jun 8 10:53:12 freax ata_scsi_translate: EXIT Jun 8 10:53:12 freax ata_hsm_move: ata9: protocol 7 task_state 1 (dev_stat 0x58) Jun 8 10:53:12 freax atapi_send_cdb: send cdb Jun 8 10:53:12 freax ata_host_intr: ata9: protocol 7 task_state 3 Jun 8 10:53:12 freax ata_host_intr: ata9: host_stat 0x1 Jun 8 10:53:12 freax ahci_interrupt: ENTER Jun 8 10:53:12 freax ahci_interrupt: ENTER Jun 8 10:53:12 freax ata_host_intr: ata9: protocol 7 task_state 3 Jun 8 10:53:12 freax ata_host_intr: ata9: host_stat 0x1 Jun 8 10:53:12 freax ata_host_intr: ata9: protocol 7 task_state 3 Jun 8 10:53:12 freax ata_host_intr: ata9: host_stat 0x4 Jun 8 10:53:12 freax ata_hsm_move: ata9: protocol 7 task_state 3 (dev_stat 0x50) Jun 8 10:53:12 freax ata_hsm_move: ata9: dev 1 command complete, drv_stat 0x50 Jun 8 10:53:12 freax ata_sg_clean: unmapping 1 sg elements Jun 8 10:53:12 freax atapi_qc_complete: ENTER, err_mask 0x0 This is a libata issue. Hi Bjoern, After removing the medium from px-708a and placing medium into px-130a, does the shuttering still happens? I have a logfile with debug enabled. While removing and adding media to the devices. I echoed comments to the logfile. use grep to find them: grep -n "<----" messages_-_2007-06-14_12.29.59_-_with_debug.log http://olausson.de/temp/messages_-_2007-06-14_12.29.59_-_with_debug.log.bz2 size packed: 1872533 bytes size unpacked: 437679017 bytes md5sum packed: 0473581df658100a7a479619af206fd1 md5sum unpacked: d8f701c6a94cbed6b39a8fbf52451306 Ther was stuttering but with debugging enabled it was hard to find. I tried to echo a "<---- Stuttering occured above ---->" asa. I noticed a stuttering. So now I'll try without debug.... for me the output is more easy to grep and nail it to the stuttering. Right now I am running without any of these devices attached. and compiling a kernel without debugging. No stuttering so far. But maybe you as pro will find some interesting stuff in that log. >After removing the medium from px-708a and placing medium into px-130a, does >the shuttering still happens? I'll answer this when I booted the kernel without debuging Thanks for your help Bjoern Now here's the log without the two drives. grep -n "<----" messages_-_2007-06-14_13.07.23_-_with_debub_without_drives.log to see comments The X-760A I not attached to the jmicron controller. IMHO it is attached to the Intel ICH7 pata port. http://olausson.de/temp/messages_-_2007-06-14_13.07.23_-_with_debub_without_drives.log.bz2 size packed: 922667 bytes size unpacked: 244104001 bytes md5sum packed: 1e867109840acdb914b1fff091197e66 md5sum unpacked: 8cf8a2672556cc7e45e4c8cc4fb2b0e6 Running now the kernel without debugging and wayting for someting to happen. (first without any media in drives, than I'll insert media into 130A and see what happens. After this I'll instert media into 708A) regards Bjoern Thing are getting more and more wired. Every time I compile the kernel with debuging and recompile it without debuging a noticable events of the bug vanishes. Only thing I noticed while burning a DVD ISO image <---- Starting to burn an other DVD iso with PX760A ----> Jun 14 14:27:36 freax cdrom: This disc doesn't have any tracks I recognize! Jun 14 14:30:01 freax cron[19764]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons ) Jun 14 14:35:54 freax wpa_cli: interface ath0 DISCONNECTED Jun 14 14:35:54 freax wpa_cli: interface ath0 CONNECTED Jun 14 14:35:59 freax ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Jun 14 14:35:59 freax ata7.00: cmd a0/01:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0xad data 4 in Jun 14 14:35:59 freax res 40/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout) Jun 14 14:36:06 freax ata7: port is slow to respond, please be patient (Status 0xd8) Jun 14 14:36:11 freax ata7: soft resetting port Jun 14 14:36:12 freax ata7.00: configured for UDMA/66 Jun 14 14:36:12 freax ata7: EH complete <---- DVD burning is finished ----> Thats it. No more stuttering..... Bu I'll continue to listen music and watch dmsg for hsm violation.... I'ts crazy regards Bjoern I shuld mention that currently ther is no differnece having a media loaded or not, drive independend. Okay, when talking about the beast... it occures: ata9.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 ata9.01: (BMDMA stat 0x4) ata9.01: cmd a0/01:00:00:00:00/00:00:00:00:00/b0 tag 0 cdb 0x4a data 8 in res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation) ata9: soft resetting port ata9.00: configured for UDMA/33 ata9.01: configured for UDMA/100 ata9: EH complete while having a media in PX130A /dev/sr2 on /media/VirtualBox-WinXP type udf (ro,noexec,nosuid,nodev,uid=1001,gid=100,umask=000,iocharset=utf8) but I didn't notice a stuttering... maybe I was to focused on learning Enzymekinetics... regards Bjoern I removed px-708A from the bus rebooted and we'll se what happens. One thing I noticed in dmesg on line 343: ATA: abnormal status 0x7F on port 0x0000000000010177 Anything to be concerned about? full dmesg output here: http://olausson.de/temp/dmsg-without_px708A waiting for HSM violoation now ;-) Thanks Bjoern Now I was waiting for a long time for a HSM violation. None occured. So I decidet to insert a DVD in the px-130A. After some time a HSM-Violation accured. But it does not longer target the DVD... When I am not wrong it targets my second Harddrive (ata1.00). But at the end of the error message it tells me something about sda... confusing. root@freax $ cat /sys/bus/scsi/devices/1\:0\:0\:0/model External Disk 0 This disc is a WDC WD740ADFD-00 connected on a EZ-Raid chip (not in Raid mode) which is IMHO bridged to one port of the intel ICH7. I guess you'd beter hav a look at the Asus P5W-DH Delux specs... another one is connected directly to the Intel controller (my root boot and swap is on this one) The other disc is send to sleep wiht "hdparm -s 1 -S 120" on boot root@freax $ cat /sys/bus/scsi/devices/0\:0\:0\:0/model WDC WD740ADFD-00 Jun 14 16:36:59 freax UDF-fs: Partition marked readonly; forcing readonly mount Jun 14 16:36:59 freax UDF-fs INFO UDF 0.9.8.1 (2004/29/09) Mounting volume 'Road_to_Guantanamo.TNPG', timestamp 2007/05/06 1 Jun 14 16:40:01 freax cron[6802]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons ) Jun 14 16:50:01 freax cron[10944]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons ) Jun 14 16:50:21 freax ata1.00: exception Emask 0x2 SAct 0xf800009 SErr 0x0 action 0x2 frozen Jun 14 16:50:21 freax ata1.00: (spurious completions during NCQ issue=0x0 SAct=0xf800009 FIS=004040a1:00400000) Jun 14 16:50:21 freax ata1.00: cmd 61/08:00:80:20:57/00:00:08:00:00/40 tag 0 cdb 0x0 data 4096 out Jun 14 16:50:21 freax res 40/00:00:80:20:57/00:00:08:00:00/40 Emask 0x2 (HSM violation) Jun 14 16:50:21 freax ata1.00: cmd 61/30:18:80:60:1f/00:00:00:00:00/40 tag 3 cdb 0x0 data 24576 out Jun 14 16:50:21 freax res 40/00:00:80:20:57/00:00:08:00:00/40 Emask 0x2 (HSM violation) Jun 14 16:50:21 freax ata1.00: cmd 61/08:b8:10:65:53/00:00:08:00:00/40 tag 23 cdb 0x0 data 4096 out Jun 14 16:50:21 freax res 40/00:00:80:20:57/00:00:08:00:00/40 Emask 0x2 (HSM violation) Jun 14 16:50:21 freax ata1.00: cmd 61/10:c0:38:65:53/00:00:08:00:00/40 tag 24 cdb 0x0 data 8192 out Jun 14 16:50:21 freax res 40/00:00:80:20:57/00:00:08:00:00/40 Emask 0x2 (HSM violation) Jun 14 16:50:21 freax ata1.00: cmd 61/18:c8:60:e3:53/00:00:08:00:00/40 tag 25 cdb 0x0 data 12288 out Jun 14 16:50:21 freax res 40/00:00:80:20:57/00:00:08:00:00/40 Emask 0x2 (HSM violation) Jun 14 16:50:21 freax ata1.00: cmd 61/08:d0:68:23:56/00:00:08:00:00/40 tag 26 cdb 0x0 data 4096 out Jun 14 16:50:21 freax res 40/00:00:80:20:57/00:00:08:00:00/40 Emask 0x2 (HSM violation) Jun 14 16:50:21 freax ata1.00: cmd 61/08:d8:40:12:57/00:00:08:00:00/40 tag 27 cdb 0x0 data 4096 out Jun 14 16:50:21 freax res 40/00:00:80:20:57/00:00:08:00:00/40 Emask 0x2 (HSM violation) Jun 14 16:50:22 freax ata1: soft resetting port Jun 14 16:50:22 freax ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Jun 14 16:50:22 freax ata1.00: configured for UDMA/133 Jun 14 16:50:22 freax ata1: EH complete Jun 14 16:50:22 freax SCSI device sda: 145226112 512-byte hdwr sectors (74356 MB) Jun 14 16:50:22 freax sda: Write Protect is off Jun 14 16:50:22 freax sda: Mode Sense: 00 3a 00 00 Jun 14 16:50:22 freax SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA regards blubbi here anoter HSM violation (stil with media in px-130A ata1.00: exception Emask 0x2 SAct 0x6 SErr 0x0 action 0x2 frozen ata1.00: (spurious completions during NCQ issue=0x0 SAct=0x6 FIS=004040a1:00000001) ata1.00: cmd 61/08:08:d0:60:e3/00:00:07:00:00/40 tag 1 cdb 0x0 data 4096 out res 40/00:10:e8:64:ff/00:00:07:00:00/40 Emask 0x2 (HSM violation) ata1.00: cmd 61/08:10:e8:64:ff/00:00:07:00:00/40 tag 2 cdb 0x0 data 4096 out res 40/00:10:e8:64:ff/00:00:07:00:00/40 Emask 0x2 (HSM violation) ata1: soft resetting port ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1.00: configured for UDMA/133 ata1: EH complete SCSI device sda: 145226112 512-byte hdwr sectors (74356 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA sory, the last violation above was with a media in the PX-760A (not attached to jmicron) now I'll remove the medium and wait for another violation. A summary of the IDE/SATA ports on the machine: - ata1 to ata4: ICH7R in ACHI mode (1f.2, irq 1275) WD740A drive is connected to ata1 and ata2 is bridged to EZ-Raid chip. - ata5 and ata6: JMicron JMB363? (irq 17) No drives connected - ata7 and ata8: Intel ICH7 in legacy port address (irq 14/15) Plextor PX-760a is connected to ata7 - ata9: JMicron IDE (irq 16) Plextor px-708a as master and px-130a as slave. For the following HSM violation: ata1.00: cmd 61/08:10:e8:64:ff/00:00:07:00:00/40 tag 2 cdb 0x0 data 4096 out res 40/00:10:e8:64:ff/00:00:07:00:00/40 Emask 0x2 (HSM violation) It is related to the WD drive. cmd61 is FPDMA_WRITE. Maybe Tejun knows better about it... Yeah, that's because faulty NCQ implementation in the WD740ADFD. Please post the result of 'hdparm -I /dev/sda'. I'll add it to blacklist. This is a separate problem from the ATAPI HSM violation tho. <---- Starting to burn an other DVD iso with PX760A ----> Jun 14 14:27:36 freax cdrom: This disc doesn't have any tracks I recognize! Jun 14 14:30:01 freax cron[19764]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons ) Jun 14 14:35:54 freax wpa_cli: interface ath0 DISCONNECTED Jun 14 14:35:54 freax wpa_cli: interface ath0 CONNECTED Jun 14 14:35:59 freax ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Jun 14 14:35:59 freax ata7.00: cmd a0/01:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0xad data 4 in Jun 14 14:35:59 freax res 40/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout) Jun 14 14:36:06 freax ata7: port is slow to respond, please be patient (Status 0xd8) Jun 14 14:36:11 freax ata7: soft resetting port Jun 14 14:36:12 freax ata7.00: configured for UDMA/66 Jun 14 14:36:12 freax ata7: EH complete <---- DVD burning is finished ----> ========================== cdb 0xad is READ_DVD_STRUCTURE. I guess we can ignore this timeout during DVD burning at this moment. Okay, when talking about the beast... it occures: ata9.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 ata9.01: (BMDMA stat 0x4) ata9.01: cmd a0/01:00:00:00:00/00:00:00:00:00/b0 tag 0 cdb 0x4a data 8 in res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation) ata9: soft resetting port ata9.00: configured for UDMA/33 ata9.01: configured for UDMA/100 ata9: EH complete while having a media in PX130A /dev/sr2 on /media/VirtualBox-WinXP type udf (ro,noexec,nosuid,nodev,uid=1001,gid=100,umask=000,iocharset=utf8) but I didn't notice a stuttering... maybe I was to focused on learning Enzymekinetics... ================================= Again this HSM violation only occurs with the px-130a drive and cdb 4a (GET_EVENT_STATUS_NOTIFICATION). I guess this is the problem of px-130a when doing ATAPI DMA for the specific command. Maybe we can limit this drive to ATAPI_DMA_RW_ONLY... But GET_EVENT_STATUS_NOTIFICATION mostly works ok. The HSM violation is rare and EH recoverrd it nicely. So, I am wondering if limiting to ATAPI_DMA_RW_ONLY is necesary... For the NCQ HSM violation, maybe Tejun has better idea. >Yeah, that's because faulty NCQ implementation in the WD740ADFD. Please post
>the result of 'hdparm -I /dev/sda'. I'll add it to blacklist. This is a
>separate problem from the ATAPI HSM violation tho.
hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: WDC WD740ADFD-00NLR1
Serial Number: WD-WMANS1333464
Firmware Revision: 20.07P20
Standards:
Used: ATA/ATAPI-7 published, ANSI INCITS 397-2005
Supported: 7 6 5 4
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 145226112
LBA48 user addressable sectors: 145226112
device size with M = 1024*1024: 70911 MBytes
device size with M = 1000*1000: 74355 MBytes (74 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, with device specific minimum
R/W multiple sector transfer: Max = 16 Current = 16
Recommended acoustic management value: 128, current value: 254
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* NOP cmd
* DOWNLOAD_MICROCODE
Power-Up In Standby feature set
* SET_FEATURES required to spinup after power up
SET_MAX security extension
* Automatic Acoustic Management feature set
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* SATA-I signaling speed (1.5Gb/s)
* Native Command Queueing (NCQ)
* Host-initiated interface power management
* Phy event counters
DMA Setup Auto-Activate optimization
* Software settings preservation
* SMART Command Transport (SCT) feature set
* SCT Long Sector Access (AC1)
* SCT LBA Segment Access (AC2)
* SCT Error Recovery Control (AC3)
* SCT Features Control (AC4)
* SCT Data Tables (AC5)
unknown 206[12]
Security:
Master password revision code = 65534
supported
not enabled
not locked
frozen
not expired: security count
not supported: enhanced erase
Checksum: correct
Is this a big drawback in the drives speed?
Souldn't WD be notified about the bad implementaion so they can release an nes BIOS for the drive (if possible)
regards
Bjoern
Anything more I can do to help you? Thanks a lot Bjoern Created attachment 11760 [details]
Patch for not stopping DMA if the device is busy
After checking the trace, maybe we should not stop DMA if the device is still busy.
Hi Bjoern,
Could you please try the attached patch and see if the "ata9.01 HSM violation" still occurs, thanks.
Created attachment 11761 [details]
Patch for not stopping DMA if the device is busy
Hi Bjoern,
Sorry, please ignore my previous patch and use this instead.
Could you please try the attached revised patch and see if the "ata9.01 HSM violation" still occurs, thanks.
just booted the revised patch... I'll redo the first patch and try the new one. Thanks Bjoern So far no "ata9.01 HSM violation" has occured. Seems as if your 02_jmicron_irq.diff patch did it. So all these errors are results from bugy software impelementation in the drives (WD and Plextor)? Any you guys now have to work around thmem, am I right? Or did I get something wrong? Thanks for the help Bjoern By the way, is there a way to upgrade the firmware of the harddrives? Thanks Bjoern Stil no "ata9.01 HSM violation" but ata1.00: exception Emask 0x2 SAct 0x1fe00 SErr 0x0 action 0x2 frozen ata1.00: (spurious completions during NCQ issue=0x0 SAct=0x1fe00 FIS=004040a1:00040000) ata1.00: cmd 61/18:48:d0:4e:6d/00:00:05:00:00/40 tag 9 cdb 0x0 data 12288 out res 40/00:90:28:a6:6c/00:00:05:00:00/40 Emask 0x2 (HSM violation) ata1.00: cmd 61/10:50:f0:4e:6d/00:00:05:00:00/40 tag 10 cdb 0x0 data 8192 out res 40/00:90:28:a6:6c/00:00:05:00:00/40 Emask 0x2 (HSM violation) ata1.00: cmd 61/08:58:48:9c:6d/00:00:05:00:00/40 tag 11 cdb 0x0 data 4096 out res 40/00:90:28:a6:6c/00:00:05:00:00/40 Emask 0x2 (HSM violation) ata1.00: cmd 61/08:60:b0:9c:6d/00:00:05:00:00/40 tag 12 cdb 0x0 data 4096 out res 40/00:90:28:a6:6c/00:00:05:00:00/40 Emask 0x2 (HSM violation) ata1.00: cmd 61/28:68:90:9d:6d/00:00:05:00:00/40 tag 13 cdb 0x0 data 20480 out res 40/00:90:28:a6:6c/00:00:05:00:00/40 Emask 0x2 (HSM violation) ata1.00: cmd 61/08:70:50:a1:6d/00:00:05:00:00/40 tag 14 cdb 0x0 data 4096 out res 40/00:90:28:a6:6c/00:00:05:00:00/40 Emask 0x2 (HSM violation) ata1.00: cmd 61/08:78:a8:a1:6d/00:00:05:00:00/40 tag 15 cdb 0x0 data 4096 out res 40/00:90:28:a6:6c/00:00:05:00:00/40 Emask 0x2 (HSM violation) ata1.00: cmd 61/08:80:b0:a1:6d/00:00:05:00:00/40 tag 16 cdb 0x0 data 4096 out res 40/00:90:28:a6:6c/00:00:05:00:00/40 Emask 0x2 (HSM violation) ata1: soft resetting port ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1.00: configured for UDMA/133 ata1: EH complete SCSI device sda: 145226112 512-byte hdwr sectors (74356 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA is there a way to turn of NCQ? regards blubbi Created attachment 11774 [details]
NCQ-blacklist.patch
NCQ will be turned off automatically after a few such incidents and here's a patch to blacklist NCQ for the drive. I'll submit the patch upstream soon. Thanks.
That did the trick. no more "spurious completions during NCQ" issues in dmesg. Thanks a lot. Do you know if the reason will be fixed? Cause disabeing is just a workaround IMHO. regards blubbi > So far no "ata9.01 HSM violation" has occured.
Hi Bjoern,
Before submitting the workaround patch for the JMicron, I would like to make sure whether the problem is specific to JMicron or might affect other controllers. Could you please disconnect the px-130a from the JMicron IDE port and reconnect it to the Intel IDE port (that is, connect the px-130a to the same port as PX-760a).
Please see if the px-130a causes any "HSM violation" with the Intel port, both with/without medium in the px-130a drive. Thanks.
I'll test it. and post my results. Should I test it with the patched kernel or unpatched? > Should I test it with the patched kernel or unpatched?
Both are ok, but unpatched kernel preferred.
Regarding spurious completion: The firmware is faulty and violates the NCQ protocol so there's nothing much more to do from the driver side than not using it. You can scream at the vendor for firmware upgrade tho. :-) Tejun Heo: If you tell me what evidence I should throw at WD, I'll do it. Do you thing this thread is evidence enough? regards blubbi Albert Lee: Here is the requested information. I disconnected the PX-760a and connected the px-130a to its port. So now here is the HSM-Violation again. This was done with the patched Kernel. ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata7.00: (BMDMA stat 0x24) ata7.00: cmd a0/01:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x4a data 8 in res 7f/7f:7f:7f:7f:7f/00:00:00:00:00/7f Emask 0x2 (HSM violation) ata7: soft resetting port ATA: abnormal status 0x7F on port 0x00000000000101f7 ata7.00: configured for UDMA/33 ata7: EH complete ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata7.00: (BMDMA stat 0x24) ata7.00: cmd a0/01:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x4a data 8 in res 7f/7f:7f:7f:7f:7f/00:00:00:00:00/7f Emask 0x2 (HSM violation) ata7: soft resetting port ATA: abnormal status 0x7F on port 0x00000000000101f7 ata7.00: configured for UDMA/33 ata7: EH complete (In reply to comment #32) > If you tell me what evidence I should throw at WD, I'll do it. > Do you thing this thread is evidence enough? I think quoting the error message and telling them that their drives are blacklisted for NCQ should do the trick. Created attachment 11810 [details]
Patch to limit ATAPI DMA to R/W only for Plextor PX-130A"
Hi Bjoern,
Hmm, the problem also happens with the Intel controller. Maybe we should blacklist the Plextor drive instead of workaround from the controller side.
Could you please keep the px-130a drive connected to the Intel ICH7 port and try the attached "Limit ATAPI DMA to R/W only for Plextor PX-130A" patch, better with ATA_DEBUG/ATA_DEBUG turned off.
Please check if the px-130a ever causes any "HSM violation" with the new patch. Thanks.
Albert Lee: Okay, I'll check it. Tejun Heo: Request to WD is out. Let's see what happens. Thanks Bjoern Created attachment 11816 [details]
libata-core.c_bug_8627.diff
mmh, I trie to combine the two patches "Limit ATAPI DMA to R/W only for Plextor PX-130A" and "NCQ-blacklist.patch"
but I get a error during compilation:
CC drivers/ata/libata-core.o
drivers/ata/libata-core.c: In Funktion »ata_dev_configure«:
drivers/ata/libata-core.c:1788: Warnung: in Vergleich verschiedener Zeigertypen fehlt Typkonvertierung
drivers/ata/libata-core.c: Auf höchster Ebene:
drivers/ata/libata-core.c:3390: Fehler: expected »}« before »ata_device_blacklist«
make[2]: *** [drivers/ata/libata-core.o] Fehler 1
make[1]: *** [drivers/ata] Fehler 2
make: *** [drivers] Fehler 2
So here's the diff I have created to patch the lates 2.6.21.5 kernel.
Where's the problem? Where's my mistake?
--- libata-core.c_org 2007-06-20 12:56:45.000000000 +0200
+++ libata-core.c_patched 2007-06-20 13:37:31.000000000 +0200
@@ -3362,6 +3362,7 @@ static const struct ata_blacklist_entry
/* Weird ATAPI devices */
{ "TORiSAN DVD-ROM DRD-N216", NULL, ATA_HORKAGE_MAX_SEC_128 |
ATA_HORKAGE_DMA_RW_ONLY },
+ { "PLEXTOR DVD-ROM PX-130A", NULL, ATA_HORKAGE_DMA_RW_ONLY },
/* Devices we expect to fail diagnostics */
@@ -3379,11 +3380,14 @@ static const struct ata_blacklist_entry
{ "HTS541060G9SA00", "MB3OC60D", ATA_HORKAGE_NONCQ, },
{ "HTS541080G9SA00", "MB4OC60D", ATA_HORKAGE_NONCQ, },
{ "HTS541010G9SA00", "MBZOC60D", ATA_HORKAGE_NONCQ, },
-
+ /* Drives which do spurious command completion */
+ { "HTS541612J9SA00", "SBDIC7JP", ATA_HORKAGE_NONCQ, },
+ { "WDC WD740ADFD-00NLR1", NULL, ATA_HORKAGE_NONCQ, },
+
/* Devices with NCQ limits */
/* End Marker */
- { }
+ { }ata_device_blacklist
};
unsigned long ata_device_blacklisted(const struct ata_device *dev)
Created attachment 11817 [details]
proper libata-core.c_bug_8627.diff
Sry found the mistake:
+ { }ata_device_blacklist
attached the corrected version
Regards
blubbi
but sill I get: drivers/ata/libata-core.c: In Funktion »ata_dev_configure«: drivers/ata/libata-core.c:1788: Warnung: in Vergleich verschiedener Zeigertypen fehlt Typkonvertierung Something to worry about? Albert Lee: Okay, no more "HSM violation" with your latest patch. neither on the Intel nor on the Jmicron controller. Thanks Bjoern Hi Bjoern, Tejun has a new patch for ATAPI DMA. Could you please drop my previous patches and try if the px-130a ever timeout after Tejun's new patch applied? (Please test with px-130a connected to ich7 and jmicron.) The new patch to test: https://bugzilla.novell.com/attachment.cgi?id=147389 (The original bug: https://bugzilla.novell.com/show_bug.cgi?id=229260) Thanks for your help/patience. No problem. But I guess I'll still have to apply the NCQ blacklist patch. I have to thank you for your help. Best regards Bjoern I can't apply the last part of the patch: diff --git a/include/linux/libata.h b/include/linux/libata.h index 745c4f9..e9659ff 100644 --- a/include/linux/libata.h +++ libata.h @@ -298,7 +298,6 @@ enum { ATA_HORKAGE_NODMA = (1 << 1), /* DMA problems */ ATA_HORKAGE_NONCQ = (1 << 2), /* Don't use NCQ */ ATA_HORKAGE_MAX_SEC_128 = (1 << 3), /* Limit max sects to 128 */ - ATA_HORKAGE_DMA_RW_ONLY = (1 << 4), /* ATAPI DMA for RW only */ }; enum hsm_task_states { cause these lines do not appear in the vanilla sources. 2.6.21.5 regards blubbi Tought the line ATA_HORKAGE_DMA_RW_ONLY = (1 << 4), /* ATAPI DMA for RW only */ is just removed, I am wondering if I need to apply the patch. But do I need the other three lines above? thanks Bjoern Yes, Tejun's patch is against 2.6.22-rc5. If convinient, please test it with 2.6.22-rc5; otherwise maybe apply the following part of the patch manually. /** * ata_check_atapi_dma - Check whether ATAPI DMA can be supported * @qc: Metadata associated with taskfile to check @@ -4124,33 +4120,19 @@ static void ata_fill_sg(struct ata_queued_cmd *qc) int ata_check_atapi_dma(struct ata_queued_cmd *qc) { struct ata_port *ap = qc->ap; - int rc = 0; /* Assume ATAPI DMA is OK by default */ + + /* Don't allow DMA if it isn't multiple of 16 bytes. Quite a + * few ATAPI devices choke on such DMA requests. + */ + if (unlikely(qc->nbytes & 15)) + return 1; if (ap->ops->check_atapi_dma) - rc = ap->ops->check_atapi_dma(qc); + return ap->ops->check_atapi_dma(qc); - return rc; + return 0; } + /** * ata_qc_prep - Prepare taskfile for submission * @qc: Metadata associated with taskfile to be prepared diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index c228df2..4ddf00c 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -2384,11 +2384,6 @@ static unsigned int atapi_xlat(struct ata_queued_cmd *qc) int using_pio = (dev->flags & ATA_DFLAG_PIO); int nodata = (scmd->sc_data_direction == DMA_NONE); - if (!using_pio) - /* Check whether ATAPI DMA is safe */ - if (ata_check_atapi_dma(qc)) - using_pio = 1; - memset(qc->cdb, 0, dev->cdb_len); memcpy(qc->cdb, scmd->cmnd, scmd->cmd_len); @@ -2401,19 +2396,22 @@ static unsigned int atapi_xlat(struct ata_queued_cmd *qc) } qc->tf.command = ATA_CMD_PACKET; + qc->nbytes = scmd->request_bufflen; + + /* check whether ATAPI DMA is safe */ + if (!using_pio && ata_check_atapi_dma(qc)) + using_pio = 1; - /* no data, or PIO data xfer */ if (using_pio || nodata) { + /* no data, or PIO data xfer */ if (nodata) qc->tf.protocol = ATA_PROT_ATAPI_NODATA; else qc->tf.protocol = ATA_PROT_ATAPI; qc->tf.lbam = (8 * 1024) & 0xff; qc->tf.lbah = (8 * 1024) >> 8; - } - - /* DMA data xfer */ - else { + } else { + /* DMA data xfer */ qc->tf.protocol = ATA_PROT_ATAPI_DMA; qc->tf.feature |= ATAPI_PKT_DMA; @@ -2422,8 +2420,6 @@ static unsigned int atapi_xlat(struct ata_queued_cmd *qc) qc->tf.feature |= ATAPI_DMADIR; } - qc->nbytes = scmd->request_bufflen; - return 0; } Created attachment 11845 [details]
my libata.h
Still no go using 2.6.22-rc5
the libata.h file seems to differ completely from the one Tejun is using:
libata.h.rej
***************
*** 298,304 ****
ATA_HORKAGE_NODMA = (1 << 1), /* DMA problems */
ATA_HORKAGE_NONCQ = (1 << 2), /* Don't use NCQ */
ATA_HORKAGE_MAX_SEC_128 = (1 << 3), /* Limit max sects to 128 */
- ATA_HORKAGE_DMA_RW_ONLY = (1 << 4), /* ATAPI DMA for RW only */
};
enum hsm_task_states {
--- 298,303 ----
ATA_HORKAGE_NODMA = (1 << 1), /* DMA problems */
ATA_HORKAGE_NONCQ = (1 << 2), /* Don't use NCQ */
ATA_HORKAGE_MAX_SEC_128 = (1 << 3), /* Limit max sects to 128 */
};
enum hsm_task_states {
STOP... my fault.... tried to patch the wrong libata.h in /driver/ata/ SRY Bjoern okay, applied against 2.6.21.5 too. root@freax $ patch -p0 < bug-229260_update-ata_check_atapi_dma.patch patching file ../linux-2.6.21.5/drivers/ata/libata-core.c Hunk #1 succeeded at 1787 with fuzz 2 (offset -259 lines). Hunk #2 succeeded at 3356 with fuzz 1 (offset -420 lines). Hunk #3 succeeded at 3670 (offset -432 lines). Hunk #4 succeeded at 3688 (offset -432 lines). patching file ../linux-2.6.21.5/drivers/ata/libata-scsi.c Hunk #1 succeeded at 2445 (offset 61 lines). Hunk #2 succeeded at 2457 (offset 61 lines). Hunk #3 succeeded at 2481 (offset 61 lines). patching file ../linux-2.6.21.5/include/linux/libata.h Hunk #1 succeeded at 312 (offset 14 lines). in 2.6.22 madwifi-ng is not working for me. So I'd rather test with 2.6.21 until the wlan drivers are included in the kernel and I don't have to hasel around with madwifi-ng. regards Bjoern So far no errors on the jmicron controller. Now I'll test the intel controllern. regards Bjoern Hi Bjoern, There is one way to upgrade the firmware of harddrives, provided that its manufacturer will release a newer firmware ONLY if the original firmware is f**ked up! The ATA spec itself describes a firmware update mechanism with the DOWNLOAD MICROCODE command.Microcode is not in the serial flash ROM.There is just a small boot code.The firmware that affects you is actually on the magnetic media of the disk at an inaccessible by end users area (part of the maintanance tracks?).Updating it is like overwriting a file in your disk.What is worse, ATA reports that the firmware can be up to 33,553,920 bytes which is 31,99MB.This is a huge space to fix any error if the controller logic is not completely broken and if the controller processor has the processing power to meet a slightly increased overhead.But if the disk becomes faster,more responsive then you will not buy their latest and greatest product.Well they do not want that... Note here that the ATA spec states that if a device receives a firmware modification, all error log data shall be discarded and the device error count for the life of the device shall be reset to zero.So a copy of those should be taken in an update if an update is provided as errors sometimes occur at specific LBA and it for your interest to have this LBA in mind. In the past IBM,SEAGATE and WD have given firmware for some drives.The most famous were the IBM deathstar firmware updates.So far WD does not supply such a firmware for your drive...Their current firmware updates are hidden in their site and given through their knowledge base articles, where the firmware and its proprietary programmer are in the form of DOS executables...Firmware updates have been given for some WD drives such as the first WD scorpios which had problems with power management,for some 2500KS SATA disks with wrong buffer size (8MB instead of 16MB!!) and of course for the YS series regarding RAID issues (WD has given a firmware for RAID problems in the past again as the huge recovery time made many PATA drives fall out of RAID arrays).As you can understand they give updates only if there is a major issue that will force WD to replace its drives.WD has gone over the last couple of years from bad to worse as they have started making frequent firmware errors! In my turn I want to express my sympathy for your crappy hardware!I had both a bad plextor DVD and a bad raptor...Both were returned back though.The plextor replacement DVD did the same problems with the initial after a while and it went to the garbage.I also had a WD360GD-00FNA0 and I still have a WD360GD-00FLCO(bought a couple of years after the first).After buying them, I saw that the first one was limited to UDMA 5 transfers in Linux as I got "applying bridge limits" because of the Marvell 88i8030-TBC SATA bridge chip.Also it could not accept SATA latch cables as the SATA data connector does not have the required rails. So SATA latch cables cannot fit to the connector and only normal cables can be installed...In contrast, the second one (FLC0) has UDMA6 with the updated Marvell 88i8030-TBC1 bridge and has the rails for latch cables. Both raptors were bought for SATA I 150MB/s support but the first only supported the legacy PATA UDMA 5 at 100MB/s.Raptors were supposed to kill SCSI but with their UDMA 5 for the first 360GD and no NCQ for the 740ADFD the only thing they killed were our pockets. Luckily this time in contrast with the DVD I was able to do the last act of dignity remaining, I killed the first raptor on warranty and got a SEAGATE replacement!.To be honest I got two SATAII 160GB WDs but one of them worked in UDMA 5!Yes a 2007 native(no bridge) SATAII drive (WD1600JS-60NCB1) only gave UDMA5.So it was replaced again with a SEAGATE... Despite the replacement I got so angry because WD fooled me(UDMA 6 $$$ price for UDMA 5 product) for some time and I have named the WD360GD-00FNA0 and WD740ADFD-00NLR1, Traptor because they were build to trap computer users. I was thinking seriously to buy a 740ADFD disk in the past,but for some reason I did not.So no WD again for me!I guess that is what you should do if they do not give a firmware update that solves this problem.WD should have already released a firmware for such an expensive drive with four features(three now!), 10000RPM & 16MB buffer & NCQ & RAFF.If they give a fix (I would not sleep on that side though) then remember to blacklist only WDC WD740ADFD-00NLR1 with firmware 20.07P20 . Finally your hard disk serial number is useless to anyone(noone needs the full S/N as a proof that you actually own the drive,the firmware number is already enough).Next time do not sumbit personal data such as full S/N or full MAC addresses on bug reports, instead do it like this: WD-WMANS1******. Thanks for this reply. I appreciate your tipps and your explanations. So far WD did not answere to my request... but I'll continue to nag them. Thanks a lot. best regards Bjoern > Now I'll test the intel controllern.
Hi Bjoern,
Any news with the intel controller?
Sory, I forgot to post the results: none Everything works great... intel and jmicron. No more HSM violations in dmesg Just the one after burning a DVD. Comment #16 ( http://bugzilla.kernel.org/show_bug.cgi?id=8627#c16 ) Sory for the late answer. Regards & thanks for your help! Bjoern Hi Tejun, Since both of your NCQ-blacklist and limit ATAPI to multiple of 16-bytes patches are accepted, maybe we can close this bug... I would agree ;-) Thanks for the help guys! regards Bjoern Olausson Thanks a lot for driving this, Albert. Closing. |