Bug 8244

Summary: AOpen cdrom 56X/AKH qc timeout under libata
Product: IO/Storage Reporter: Yarema (yupadmin)
Component: Serial ATAAssignee: Albert Lee (albertcc)
Status: CLOSED CODE_FIX    
Severity: normal CC: alan, htejun, list, patmans, rohandhruva
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.20.3 Subsystem:
Regression: --- Bisected commit-id:
Attachments: dmesg from 20.03.2007
New dmesg output with the same 2.6.20.3 kernel with applied Albert's patches
otput from "cdrecord dev=/dev/hdh -inq -vvVV"
output from "cdrecord dev=/dev/hda -inq -vvVV"
dmesg 21.03.2007 18.38 (before experiment)
messages 21.03.2007 18.38 (before experiment)
messages 21.03.2007 18.53 (after experiment)
dmesg 21.03.2007 18.55 (after experiment)
dmesg scsi_id removed
Debug patch to filter out INQUIRY with EVPD = 1
dmesg from kernel with scsi patch
Debug patch that filters both 0x12 and 0x46
dmesg after boot second scsi patch
Patch to do REQUEST SENSE by polling PIO
dmesg from the proposed Albert's patch
Debug patch to show actual CDB
Debug patch to show CDB parameters
Debug patch to show CDB parameters
dmesg from filtering patch
one more dmesg with the latest patch
Proposed patch to simulate REQUEST SENSE
dmesg from proposed simulation REQUEST SENSE
dmesg from proposed simulation REQUEST SENSE
Patch to remove tf_read() from atapi_eh_request_sense().
dmesg
dmesg, cdrom attached to ide port
Proposed patch to clear task file before REQUEST SENSE
dmesg: cdrom and dvd are on on cable attached to onboard ide port
dmasg - cdrom attached to Promise port, DVD - to onboard ide pord
Complete dmesg output
lspci -vvv output
cdrecord dev=/dev/sr0 -inq -vvVV output

Description Yarema 2007-03-21 01:50:25 UTC
Most recent kernel where this bug did *NOT* occur:
Distribution:
Hardware Environment:
Software Environment:
Problem Description:

Steps to reproduce:
Comment 1 Yarema 2007-03-21 01:52:43 UTC
Created attachment 10881 [details]
dmesg from 20.03.2007
Comment 2 Yarema 2007-03-21 01:55:38 UTC
Created attachment 10882 [details]
New dmesg output with the same 2.6.20.3 kernel with applied Albert's patches
Comment 3 Albert Lee 2007-03-21 04:19:04 UTC
Hi Yarema,

Thanks a lot. But the part before
  [   39.188609] ata4: protocol 5 task_state 1
  [   39.188614] ata4: host_stat 0x0
is truncated...

Could you please
1. reboot the system into the kernel with the IDE driver 
  (the kernel that has all drives working)
2. Run "cdrecord -scanbus", check which one is the AOpen 56X/AKH drive.
3. Run something like "cdrecord dev=3,0,0 -inq -vvVV" 
   (Please change dev=3,0,0 to the real value returned by "cdrecord -scanbus" 
for the AOpen drive.)

4. Please attach the output of "cdrecord dev=?,0,0 -inq -vvVV" to the bugzilla.

Thanks.
Comment 4 Yarema 2007-03-21 04:56:49 UTC
cdrecord -scanbus give me no output, but here is the output from "cdrecord
--devices" :

Beginning native device scan. This may take a while if devices are busy...
Error trying to open /dev/hde exclusively (Device or resource busy)... retrying
in 1 second.
Error trying to open /dev/hde exclusively (Device or resource busy)... retrying
in 1 second.
Error trying to open /dev/hde exclusively (Device or resource busy)... retrying
in 1 second.
Error trying to open /dev/hde exclusively (Device or resource busy)... retrying
in 1 second.
Error trying to open /dev/hde exclusively (Device or resource busy)... giving up.
Error trying to open /dev/hdf exclusively (Device or resource busy)... retrying
in 1 second.
Error trying to open /dev/hdf exclusively (Device or resource busy)... retrying
in 1 second.
Error trying to open /dev/hdf exclusively (Device or resource busy)... retrying
in 1 second.
Error trying to open /dev/hdf exclusively (Device or resource busy)... retrying
in 1 second.
Error trying to open /dev/hdf exclusively (Device or resource busy)... giving up.
wodim: Warning: controller returns wrong size for CD capabilities page.
wodim: Overview of accessible drives (2 found) :
----------------------------------------------------------------------
0    dev='/dev/hda'   rwrw-- :  'HP'  'DVD Writer 200j'
1    dev='/dev/hdh'   rwrw-- :  'E-IDE'  'CD-ROM 56X/AKH'
----------------------------------------------------------------------


Output from "cdrecord dev=/dev/hdh -inq -vvVV" is in attachment.

Comment 5 Yarema 2007-03-21 05:00:05 UTC
Created attachment 10887 [details]
otput from "cdrecord dev=/dev/hdh -inq -vvVV"
Comment 6 Albert Lee 2007-03-21 05:12:59 UTC
The INQUIRY looks good. The drive said it's 36 bytes. I don't know why the 
SCSI subsystem ever sent INQUIRY of 254 bytes to the AOpen drive...
=========
Executing 'inquiry' command on Bus 3 Target 1, Lun 0 timeout 40s
CDB:  12 00 00 00 24 00
cmd finished after 0.000s timeout 40s
Got 36 (0x24), expecting 36 (0x24) bytes of data.
Received Data:  05 80 00 01 1F 00 00 00 45 2D 49 44 45 20 20 20 43 44 2D 52 4F 
4D 20 35 36 58 2F 41 4B 48 20 20 41 38 30 20
Inquiry Data   : 05 80 00 01 1F 00 00 00 45 2D 49 44 45 20 20 20 43 44 2D 52 
4F 4D 20 35 36 58 2F 41 4B 48 20 20 41 38 30 20
Inquiry Data   : ........E-IDE   CD-ROM 56X/AKH  A80


Comment 7 Yarema 2007-03-21 05:15:09 UTC
Created attachment 10888 [details]
output from "cdrecord dev=/dev/hda -inq -vvVV"
Comment 8 Albert Lee 2007-03-21 05:21:15 UTC
For the HP drive, the INQUIRY looks good, too.
Notice that the HP driver has longer INQUIRY (74 bytes) than the AOpen drive 
(32 bytes)
===========
Executing 'inquiry' command on Bus 0 Target 0, Lun 0 timeout 40s
CDB:  12 00 00 00 24 00
cmd finished after 0.001s timeout 40s
Got 36 (0x24), expecting 36 (0x24) bytes of data.
Received Data:  05 80 00 32 45 00 00 00 48 50 20 20 20 20 20 20 44 56 44 20 57 
72 69 74 65 72 20 32 30 30 6A 20 31 2E 33 36
Inquiry Data   : 05 80 00 32 45 00 00 00 48 50 20 20 20 20 20 20 44 56 44 20 
57 72 69 74 65 72 20 32 30 30 6A 20 31 2E 33 36

Executing 'inquiry' command on Bus 0 Target 0, Lun 0 timeout 40s
CDB:  12 00 00 00 4A 00
cmd finished after 0.001s timeout 40s
Got 74 (0x4A), expecting 74 (0x4A) bytes of data.
Received Data:  05 80 00 32 45 00 00 00 48 50 20 20 20 20 20 20 44 56 44 20 57 
72 69 74 65 72 20 32 30 30 6A 20 31 2E 33 36 30 32 30 37 32 34 00 00 00 00 43 
4E 32 42 36 33 32 33 37 4A FF FF 52 49 43 44 41 33 33 31 31 30 34 38 34 20 00 
00
Inquiry Data   : 05 80 00 32 45 00 00 00 48 50 20 20 20 20 20 20 44 56 44 20 
57 72 69 74 65 72 20 32 30 30 6A 20 31 2E 33 36 30 32 30 37 32 34 00 00 00 00 
43 4E 32 42 36 33 32 33 37 4A FF FF 52 49 43 44 41 33 33 31 31 30 34 38 34 20 
00 00
Inquiry Data   : ...2E...HP      DVD Writer 200j 
1.36020724....CN2B63237J..RICDA33110484 ..
Comment 9 Albert Lee 2007-03-21 05:30:40 UTC
So, the AOpen 56X/AKH looks fine. The question now is, why the SCSI midlayer 
send INQUIRY of "0xfe" length to the AOpen drive, like below?
=========
[   99.289088] CDB (4:0,1,0) 12 01 00 00 fe 00 00 00 00
[   99.289144] ata4: protocol 5 task_state 4 (dev_stat 0x58)
[   99.289592] ata4: protocol 5 task_state 1
[   99.289596] ata4: host_stat 0x4
[   99.289601] ata4: protocol 5 task_state 1 (dev_stat 0x51)
[   99.289604] ata4: protocol 5 task_state 2 (dev_stat 0x51)
[   99.289607] ata4: protocol 5 task_state 3 (dev_stat 0x51)
[   99.289684] ata4: protocol 5 task_state 4 (dev_stat 0x58)
[  102.023103] ata4: protocol 5 task_state 1
[  102.023114] ata4: host_stat 0x0
[  104.367468] ata4: protocol 5 task_state 1
[  104.367479] ata4: host_stat 0x0
[  104.368073] ata4: protocol 5 task_state 1
[  104.368077] ata4: host_stat 0x0
[  104.368681] ata4: protocol 5 task_state 1
[  104.368685] ata4: host_stat 0x0
Comment 10 Yarema 2007-03-21 10:48:34 UTC
Albert, here are my output of dmesg and messages before and after experiment
with "rmmod ata_piix" and "modprobe ata_piix" (see below) 
Comment 11 Yarema 2007-03-21 10:51:08 UTC
Created attachment 10900 [details]
dmesg 21.03.2007 18.38 (before experiment)
Comment 12 Yarema 2007-03-21 10:52:27 UTC
Created attachment 10901 [details]
messages 21.03.2007 18.38 (before experiment)
Comment 13 Yarema 2007-03-21 10:54:10 UTC
Created attachment 10902 [details]
messages 21.03.2007 18.53 (after experiment)
Comment 14 Yarema 2007-03-21 10:56:23 UTC
Created attachment 10903 [details]
dmesg 21.03.2007 18.55 (after experiment)
Comment 15 Albert Lee 2007-03-21 22:34:37 UTC
From the trace, SCSI mid-layer only issues one good INQUIRY at the beginning. 
I've also checked the scsi_scan.c source code, it also looks good.

So, it looks like the bad INQUIRY command
  "CDB (4:0,1,0) 12 01 00 00 fe 00 00 00 00"
is coming from the user space, not the SCSI mid-layer.

Two problems together caused this bug:
1. Ubuntu Linux issues an incorrect INQUIRY command to the drive.
   (Other distros seem to have the INQUIRY correct.)
2. The incorrect INQUIRY happens to cause the AOpen drive frozen.
   (The HP drive is immune from the incorrect INQUIRY command.)


Comment 16 Albert Lee 2007-03-21 22:44:05 UTC
We have two possible solutions here:
1. Fix Ubuntu, such that the incorrect INQUIRY is fixed.
2. Patch kernel, such that the AOpen drives are blacklisted.
   Each INQUIRY is inspected for the blacklisted drives.
   If the INQUIRY looks wrong, the INQUIRY is rejected.

I guess 1. is the preferred solution.
Will ask Jeff and others advice.
Comment 17 Yarema 2007-03-22 05:21:36 UTC
Thank you very much!!! I posted this bug to kubuntu forum and sent it also to
ubuntu-developers list, we'll see what developers say about it. 
Comment 18 Albert Lee 2007-03-22 19:24:25 UTC
(Adding Alan to cc list.)

Hi Yarema,

It seems the INQUIRY comes from the "scsi_id" utility.
Could you please attach "/etc/scsi_id.config" to bugzilla, thanks.
Comment 19 Yarema 2007-03-23 01:49:21 UTC
Hm, but I don't have it, this file I have on my ubuntu box in
/usr/share/doc/udev/examples/scsi_id.config, so my box doesn't use it. 
Comment 20 Albert Lee 2007-03-23 01:58:40 UTC
Hmm, Could you please try
1. mv /sbin/scsi_id /sbin/scsi_id.ori
2. rmmod ata_piix; modprobe ata_piix

and see if the "qc time out" problem persists?
Comment 21 Yarema 2007-03-23 02:13:40 UTC
Well, in ubuntu it is /lib/udev/scsi_id, I changed it to /lib/udev/scsi_id.orig,
and after "rmmod ata_piix; modprobe ata_piix" I've got this:

[  501.777352] ata1.00: disabled
[  501.840307] ata_piix 0000:00:1f.1: version 2.10ac1
[  501.840772] PCI: Setting latency timer of device 0000:00:1f.1 to 64
[  501.841124] ata5: PATA max UDMA/100 cmd 0x000101f0 ctl 0x000103f6 bmdma
0x0001f000 irq 14
[  501.860461] ata6: PATA max UDMA/100 cmd 0x00010170 ctl 0x00010376 bmdma
0x0001f008 irq 15
[  501.860926] scsi4 : ata_piix
[  502.177256] ata5.00: ATAPI, max UDMA/33
[  502.341037] ata5.00: configured for UDMA/33
[  502.341165] scsi5 : ata_piix
[  502.498977] scsi 4:0:0:0: CD-ROM            HP       DVD Writer 200j  1.36
PQ: 0 ANSI: 5
[  502.503937] sr0: scsi3-mmc drive: 32x/32x writer cd/rw xa/form2 cdda tray
[  502.504499] sr 4:0:0:0: Attached scsi CD-ROM sr0
[  502.505038] sr 4:0:0:0: Attached scsi generic sg0 type 5


Note, there nothing about Aopen at all! Aopen cdrom attached to Promise adapter
as it was before in the previous tests.
Comment 22 Albert Lee 2007-03-23 02:24:40 UTC
> Note, there nothing about Aopen at all! Aopen cdrom attached to
> Promise adapter as it was before in the previous tests.

Ah, sorry I forgot the AOpen is still connected to Promise.
But we cannot rmmod pata_pdc2027x since it's where the root hard drive 
connected to.

So, may need a reboot when convient.
Comment 23 Yarema 2007-03-23 02:45:20 UTC
Created attachment 10921 [details]
dmesg scsi_id removed

I rebooted comp with scsi_id removed. It looks like nothing changed. I attached
dmesg with output. Kernel is without your patches. BTW, standard scsi_id.config
in ubuntu has only this:

#
# scsi_id configuration
#
# lower or upper case has no affect on the left side. Quotes (") are
# required for spaces in values. Model is the same as the SCSI
# INQUIRY product identification field. Per the SCSI INQUIRY, the vendor
# is limited to 8 bytes, model to 16 bytes.
#
# The first matching line found is used. Short matches match longer ones,
# if you do not want such a match space fill the extra bytes. If no model
# is specified, only the vendor string need match.
#
# options=<any scsi_id command line options>
# vendor=string[,model=string],options=<per-device scsi_id options>

# some libata drives require vpd page 0x80
vendor="ATA",options=-p 0x80


Yarema
Comment 24 Albert Lee 2007-03-23 03:12:39 UTC
Ok, maybe the device id probe is done by hald on your system, not scsi_id?
(So, we can mv scsi_id.ori back to scsi_id now.)

I've no idea about how to prevent hald from sending INQUIRY...
I will try to make a debug patch to filter out the INQUIRY with EVPD first for 
your test.
Comment 25 Yarema 2007-03-23 03:20:46 UTC
Hi Albert,

I think you are right. Every time I do modprobe ata_piix I see (ps xuaw) that
hald polling /dev/scd1 (cdrom drive): 

107       4868  0.0  0.1   2096   900 ?        S    10:37   0:00
hald-addon-storage: polling /dev/scd1

After that I can use /dev/scd1.

Yarema
Comment 26 Albert Lee 2007-03-23 03:27:08 UTC
Created attachment 10922 [details]
Debug patch to filter out INQUIRY with EVPD = 1

Hi Yarema,
Please try this patch and check if it makes the "qc timeout" go away.
Comment 27 Yarema 2007-03-23 07:15:07 UTC
Created attachment 10923 [details]
dmesg from kernel with scsi patch

It is hard to believe, but it works! What did that patch? I attached dmesg. One
thing I noticed that start k3b takes long time because of hald polling. 

ps xuaw | grep hald

107	  4330	0.6  1.6  10412  8712 ?        Ss   15:04   0:03 /usr/sbin/hald

root	  4331	0.0  0.1   2876  1016 ?        S    15:04   0:00 hald-runner
107	  4342	0.0  0.1   2100   892 ?        S    15:04   0:00
hald-addon-keyboard: listening on /dev/input/event1
107	  4370	0.0  0.1   2104   900 ?        S    15:04   0:00
hald-addon-storage: polling /dev/scd0
107	  4372	0.0  0.1   2104   900 ?        D    15:04   0:00
hald-addon-storage: polling /dev/scd1

Thanks,

Yarema
Comment 28 Yarema 2007-03-23 07:38:12 UTC
Life is not so easy. This is happened when I tried to mount cdrom:

[  361.792873] ata4.01: qc timeout (cmd 0xa0)
[  361.792890] ata4.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
[  361.792902] ata4.01: cmd a0/00:00:00:00:20/00:00:00:00:00/b0 tag 0 cdb 0x46
data 32 in
[  361.792904]          res 51/51:03:00:0c:00/00:00:00:00:00/b0 Emask 0x5 (timeout)
[  368.782440] ata4: port is slow to respond, please be patient (Status 0xd8)
[  391.764019] ata4: port failed to respond (30 secs, Status 0xd8)
[  391.764033] ata4: soft resetting port
[  391.931178] ATA: abnormal status 0x7F on port 0xE08815DF
[  391.942565] ATA: abnormal status 0x7F on port 0xE08815DF
[  392.103558] ata4.01: revalidation failed (errno=-2)
[  392.103571] ata4: failed to recover some devices, retrying in 5 secs
[  397.100065] ata4: soft resetting port
[  397.267219] ATA: abnormal status 0x7F on port 0xE08815DF
[  397.278623] ATA: abnormal status 0x7F on port 0xE08815DF
[  397.603616] ata4.01: configured for MWDMA2
[  397.603654] ata4: EH complete
[  427.558506] ata4.01: qc timeout (cmd 0xa0)
[  427.558522] ata4.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
[  427.558534] ata4.01: cmd a0/00:00:00:00:20/00:00:00:00:00/b0 tag 0 cdb 0x46
data 32 in
[  427.558536]          res 51/51:03:00:00:02/00:00:00:00:00/b0 Emask 0x5 (timeout)
[  434.548060] ata4: port is slow to respond, please be patient (Status 0xd8)
[  457.529657] ata4: port failed to respond (30 secs, Status 0xd8)
[  457.529670] ata4: soft resetting port
[  457.696812] ATA: abnormal status 0x7F on port 0xE08815DF
[  457.708194] ATA: abnormal status 0x7F on port 0xE08815DF
[  457.869183] ata4.01: revalidation failed (errno=-2)
[  457.869195] ata4: failed to recover some devices, retrying in 5 secs
[  462.865694] ata4: soft resetting port
[  463.032849] ATA: abnormal status 0x7F on port 0xE08815DF
[  463.044229] ATA: abnormal status 0x7F on port 0xE08815DF
[  463.369250] ata4.01: configured for MWDMA2
[  463.369288] ata4: EH complete

Cdrom cannot mounted.

Yarema
Comment 29 Albert Lee 2007-03-23 09:41:08 UTC
Created attachment 10924 [details]
Debug patch that filters both 0x12 and 0x46

So, 0x46 also times out? Too bad.

Please check if this new debug patch helps. Thanks.
Comment 30 Yarema 2007-03-23 14:59:00 UTC
Created attachment 10925 [details]
dmesg after boot second scsi patch

Boot was OK, dmesg is attached. I put cd into cdrom drive, dmesg otput is here:


[  294.820052] ata4.01: qc timeout (cmd 0xa0)
[  294.820070] ata4.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen

[  294.820082] ata4.01: cmd a0/00:00:00:00:20/00:00:00:00:00/b0 tag 0 cdb 0x51
data 32 in
[  294.820084]		res 51/51:03:00:00:08/00:00:00:00:00/b0 Emask 0x5
(timeout)
[  301.809610] ata4: port is slow to respond, please be patient (Status 0xd8)
[  324.791245] ata4: port failed to respond (30 secs, Status 0xd8)
[  324.791260] ata4: soft resetting port
[  324.958391] ATA: abnormal status 0x7F on port 0xE08815DF
[  324.969852] ATA: abnormal status 0x7F on port 0xE08815DF
[  325.130746] ata4.01: revalidation failed (errno=-2)
[  325.130757] ata4: failed to recover some devices, retrying in 5 secs
[  330.127244] ata4: soft resetting port
[  330.294428] ATA: abnormal status 0x7F on port 0xE08815DF
[  330.305872] ATA: abnormal status 0x7F on port 0xE08815DF
[  330.630839] ata4.01: configured for MWDMA2
[  330.630894] ata4: EH completeBTW, what does that mean?

Immediately after that I tried to mount cdrom:

[  528.047522] ISO 9660 Extensions: Microsoft Joliet Level 3
[  528.048443] ISOFS: changing to secondary root

Cdrom is mounted! And I even can read data on it. But hald still polling:
hald-addon-storage: polling /dev/scd1. I tried also k3b, k3b tried to get info
about cd drives by means of hald:

[  681.633498] ata4.01: qc timeout (cmd 0xa0)
[  681.633516] ata4.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen

[  681.633528] ata4.01: cmd a0/00:00:00:00:20/00:00:00:00:00/b0 tag 0 cdb 0x5a
data 8 in
[  681.633530]		res 51/51:03:00:24:00/00:00:00:00:00/b0 Emask 0x5
(timeout)
[  688.623052] ata4: port is slow to respond, please be patient (Status 0xd8)
[  711.604630] ata4: port failed to respond (30 secs, Status 0xd8)
[  711.604643] ata4: soft resetting port
[  711.771791] ATA: abnormal status 0x7F on port 0xE08815DF
[  711.783200] ATA: abnormal status 0x7F on port 0xE08815DF
[  711.944183] ata4.01: revalidation failed (errno=-2)
[  711.944196] ata4: failed to recover some devices, retrying in 5 secs
[  716.940703] ata4: soft resetting port
[  717.107826] ATA: abnormal status 0x7F on port 0xE08815DF
[  717.119220] ATA: abnormal status 0x7F on port 0xE08815DF
[  717.444226] ata4.01: configured for MWDMA2
[  717.444275] ata4: EH complete
[  747.427071] ata4.01: qc timeout (cmd 0xa0)
[  747.427089] ata4.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen

[  747.427100] ata4.01: cmd a0/00:00:00:00:20/00:00:00:00:00/b0 tag 0 cdb 0x5a
data 8 in
[  747.427102]		res 51/51:03:00:1c:00/00:00:00:00:00/b0 Emask 0x5
(timeout)
[  754.416641] ata4: port is slow to respond, please be patient (Status 0xd8)
[  777.398223] ata4: port failed to respond (30 secs, Status 0xd8)
[  777.398237] ata4: soft resetting port
[  777.565383] ATA: abnormal status 0x7F on port 0xE08815DF
[  777.576766] ATA: abnormal status 0x7F on port 0xE08815DF
[  777.737749] ata4.01: revalidation failed (errno=-2)
[  777.737761] ata4: failed to recover some devices, retrying in 5 secs
[  782.734265] ata4: soft resetting port
[  782.901419] ATA: abnormal status 0x7F on port 0xE08815DF
[  782.912808] ATA: abnormal status 0x7F on port 0xE08815DF
[  783.229866] ata4.01: configured for MWDMA2
[  783.229917] ata4: EH complete
[  834.237242] ata4.01: qc timeout (cmd 0xa0)
[  834.237261] ata4.01: limiting speed to MWDMA1
[  834.237266] ata4.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen

[  834.237277] ata4.01: cmd a0/00:00:00:00:20/00:00:00:00:00/b0 tag 0 cdb 0x51
data 2 in
[  834.237279]		res 51/51:03:00:08:00/00:00:00:00:00/b0 Emask 0x5
(timeout)
[  841.226788] ata4: port is slow to respond, please be patient (Status 0xd8)
[  864.208379] ata4: port failed to respond (30 secs, Status 0xd8)
[  864.208392] ata4: soft resetting port
[  864.375538] ATA: abnormal status 0x7F on port 0xE08815DF
[  864.386916] ATA: abnormal status 0x7F on port 0xE08815DF
[  864.547909] ata4.01: failed to IDENTIFY (I/O error, err_mask=0x1)
[  864.547918] ata4.01: revalidation failed (errno=-5)
[  864.547928] ata4: failed to recover some devices, retrying in 5 secs
[  869.544433] ata4: soft resetting port
[  869.711572] ATA: abnormal status 0x7F on port 0xE08815DF
[  869.722951] ATA: abnormal status 0x7F on port 0xE08815DF
[  870.047997] ata4.01: configured for MWDMA1
[  870.048043] ata4: EH complete

Albert, I think you are very close to solution.
I'm ready to try another patches.

Regards,
Yarema
Comment 31 Albert Lee 2007-03-23 18:10:12 UTC
Sigh, this drive is really broken. Updating the broken firmware maybe the best 
thing to do. I will ask AOpen if any firmware update available.

> cmd a0/00:00:00:00:20/00:00:00:00:00/b0 tag 0 cdb 0x5a 
...
> cmd a0/00:00:00:00:20/00:00:00:00:00/b0 tag 0 cdb 0x51
> res 51/51:03:00:08:00/00:00:00:00:00/b0 Emask 0x5(timeout)

So, it is not 0x12, 0x46, 0x5a and 0x51 that causes trouble.
The dev_stat is 0x51; it seems the device returns "check condition"
and timeout upon "Request Sense"...

I will try to do a debug patch to blacklist 0x12, 0x46, 0x5a and 0x51 in the 
atapi request sense code path.
Comment 32 Albert Lee 2007-03-26 00:20:57 UTC
Created attachment 10946 [details]
Patch to do REQUEST SENSE by polling PIO

From previous trace, it seems the "qc timeout" happens during REQUEST SENSE.
Maybe the drive just lost irq? 

Hi Yarema,
Could you please try this patch alone (without previous INQUIRY/EVPD filter
patches).

Thanks.
Comment 33 Yarema 2007-03-26 05:29:52 UTC
Created attachment 10947 [details]
dmesg from the proposed Albert's patch

Hard to say... Same shit.
Comment 34 Albert Lee 2007-03-26 19:59:56 UTC
> Hard to say... Same shit.

So polling REQUEST SENSE doesn't help here. I was hoping this simple fix can 
save us from command filtering in the hot path... too bad.

Ok, let's go back to command filtering...



Comment 35 Albert Lee 2007-03-26 20:34:39 UTC
Created attachment 10957 [details]
Debug patch to show actual CDB

Hi Yarema,

Could you please apply this debug patch and collect the "CDB" output.
(This can help us to filter the command more precisely if we know what
parameter inside the CDB caused the problem.)
Comment 36 Tejun Heo 2007-03-26 20:44:06 UTC
Albert, where are you planning to put the filtering?  In SCSI or libata?  Do you
think we can fix this by modifying userspace?  Kernel itself doesn't have
problem detecting and using the device and I'm not sure whether kernel should be
filtering direct CDBs from userland.

Thanks.
Comment 37 Albert Lee 2007-03-26 21:41:18 UTC
Hi Tejun,

Currently just trying to narrow down the problematic command/parameter pairs.
No plan for adding the filtering to the kernel.

BTW, I've contacted AOpen, the support engineer (DarrenHsieh@aopen.com.tw) said 
this CD-956E is no longer in production and low chance AOpen would put in R&D 
resource into it. :(
Comment 38 Tejun Heo 2007-03-26 22:05:57 UTC
I see.  Thanks a lot for all the effort.  Yeah, I don't expect AOpen (or any
other vendor for that matter) to fix firmware of discontinued products.
Comment 39 Tejun Heo 2007-03-26 22:06:54 UTC
CC'd scsi_id author Patrick Mansfield.  Hello.
Comment 40 Albert Lee 2007-03-26 22:57:00 UTC
Summary of the current problem/status:

1. The hald/scsi_id issues a valid INQUIRY with EVPD = 1, page 0x00
2. The AOpen 56X/AKH drive returns "check condition" (status 0x51)
3. Libata EH issues "REQUEST SENSE" to the AOpen drive, but the AOpen drive 
times out.
4. Besides INQUIRY, the AOpen drive also has problem on
  - GET_CONFIGURATION (0x46) => Maybe from kernel space? not from hald/scsi_id?
  - READ_DISC_INFO (0x51)
  - MODE_SENSE_10 (0x5a) and
  - INQUIRY with EVPD=1
  The time out seems also occurs on REQUEST SENSE, not on the commands itself.
  We're waiting for Yarema's CDB debug data for the actual parameters that 
causes the timeout.
5. Polling "REQUEST SENSE" doesn't help. => the timeout is not caused by lost 
irq.
Comment 41 Yarema 2007-03-27 02:09:33 UTC
To which kernel version I have to apply this patch? Mine is 2.6.20.3 downloaded
from kernel.org. Patch is succeeded, but:
Hunk #1 succeeded at 2384 (offset -44 lines).
Hunk #2 succeeded at 2416 (offset -44 lines).
Hunk #3 succeeded at 2477 (offset -44 lines).

Next, I can't compile modules:

  CC [M]  drivers/ata/libata-core.o
  CC [M]  drivers/ata/libata-scsi.o
drivers/ata/libata-scsi.c: In function 
Comment 42 Albert Lee 2007-03-27 02:20:06 UTC
Created attachment 10958 [details]
Debug patch to show CDB parameters

Hi Yurema,

Sorry, I mixed-up the 2.6.21-rc one.
Please use this one for 2.6.20.3 instead, thanks.
Comment 43 Albert Lee 2007-03-27 02:29:47 UTC
Created attachment 10959 [details]
Debug patch to show CDB parameters

Please try this one. Sorry I messed it up.
Comment 44 Yarema 2007-03-27 04:57:48 UTC
Created attachment 10960 [details]
dmesg from filtering patch

Here is truncated (don't know why) dmesg, but with the useful info I believe.

Yarema
Comment 45 Yarema 2007-03-27 08:30:03 UTC
Created attachment 10966 [details]
one more dmesg with the latest patch
Comment 46 Patrick Mansfield 2007-03-27 10:44:36 UTC
I have not been following scsi_id changes for a while, it is part of udev, post
to linux-hotplug-devel@lists.sourceforge.net if you have any patches, changes or
questions.

There are so many dmesg attached, I don't know which one to look at.

I'm also not up to date on scsi/ata work. Is this a SATA bridge to a cdrom?
Those are problematic, we probably don't want to use scsi_id at all for IDE
CDROM drives attached that way - the output is likely meaningless when it does work.

I would try scsi_id from the command line, and see if it works or not and then
update scsi_id.config or udev rules. It might be not be possible to properly
black list just your device and not affect others, but that does not matter much
since no one cares about id's from CDROM.

Looks like you have vendor E-IDE product CD-ROM 56X/AKH.

To debug easier, run scsi_id from the command line, and maybe try hdparm -w
/dev/sr1 to reset it on hang. I don't have any systems SCSI/SATA CDROM, I'm not
sure of the sysfs path for sr (argument to the '-s').

Like:

scsi_id -g -s /block/sr1

And try this:

scsi_id -g -p 0x80 -s /block/sr1

And then black list out completely (spaces and names might be wrong ...) in
scsi_id.config:

vendor="E-IDE",options="-b"

Use page 0x80 in scsi_id.config:

vendor="E-IDE",options="-p 0x80"

A udev rule to ignore cdrom on SATA/ATA bridge or such might be possible.

Hope that helps ...
Comment 47 Yarema 2007-03-27 14:02:31 UTC
To Patrick Mansfield:

Well, everything is OK, but how can I do that, if I have sr1 in
/dev/.udev/failed/block@sr1 only?
Anyway thanks for help.

Yarema
Comment 48 Patrick Mansfield 2007-03-27 14:17:55 UTC
> Well, everything is OK, but how can I do that, if I have sr1 in
> /dev/.udev/failed/block@sr1 only?

What do you mean by "everything is OK"?

Did you run scsi_id per the other comment?

I don't know why udev puts stuff in the .failed directory, you probably need to
look at (or post) the log messages not just dmesg. I see lots of dmesg output,
but no log files.

Last I knew, even if there is no /dev scsi_id should still work OK (when run as
root), now it normally uses udev tmp dev $tempnod
Comment 49 Yarema 2007-03-27 15:16:38 UTC
Listen, I need argument as /dev/sr1, but I don't have it at all!
Device is recognized, but udev or something else put it into the
/dev/.udev/failed/block@sr1 and that's it. Consequently, how can I run commands
mentioned by you if I don't have this device? And for sure I looked up into logs
to find where my devices are. I can post here my logs if you want. Just tell my
what you need.

Regards,

Yarema
Comment 50 Patrick Mansfield 2007-03-27 15:27:19 UTC
> Listen, I need argument as /dev/sr1, but I don't have it at all!

You did not answer the question: Did you run scsi_id per the other comment?

Do you have /sys/block entries, like /sys/block/sr1?

For example, on my laptop with a USB storage device attached:

[root ~]# ls -ld /sys/block/sda/
drwxr-xr-x 6 root root 0 Mar 27 15:25 /sys/block/sda/

[root ~]# scsi_id -g -p 0x80 -s /block/sda
SGeneric STORAGE DEVICE
[root ~]# ls -l /dev/sda
brw-r----- 1 root disk 8, 0 Mar 27 15:19 /dev/sda
[root ~]# scsi_id -g -p 0x80 -s /block/sda
SGeneric STORAGE DEVICE
[root ~]# rm /dev/sda
[root ~]# scsi_id -g -p 0x80 -s /block/sda
SGeneric STORAGE DEVICE

> Device is recognized, but udev or something else put it into the
> /dev/.udev/failed/block@sr1 and that's it. Consequently, how can I run commands
> mentioned by you if I don't have this device? 

AFAIUI, that means udev did not create a /dev entry, not that the kernel
driver(s) failed to detect the device.

> And for sure I looked up into logs
> to find where my devices are. I can post here my logs if you want. Just tell my
> what you need.
Comment 51 Yarema 2007-03-28 01:19:11 UTC
See:

ls -l /sys/block/sr1/
total 0
-r--r--r-- 1 root root 4096 2007-03-28 12:03 dev
lrwxrwxrwx 1 root root    0 2007-03-28 12:03 device ->
../../devices/pci0000:00/0000:00:1e.0/0000:02:01.0/host3/target3:0:1/3:0:1:0
drwxr-xr-x 2 root root    0 2007-03-28 12:03 holders
drwxr-xr-x 3 root root    0 2007-03-28 12:03 queue
-r--r--r-- 1 root root 4096 2007-03-28 12:03 range
-r--r--r-- 1 root root 4096 2007-03-28 12:03 removable
-r--r--r-- 1 root root 4096 2007-03-28 12:03 size
drwxr-xr-x 2 root root    0 2007-03-28 12:03 slaves
-r--r--r-- 1 root root 4096 2007-03-28 12:03 stat
lrwxrwxrwx 1 root root    0 2007-03-28 12:03 subsystem -> ../../block
--w------- 1 root root 4096 2007-03-28 12:03 uevent

And I don't have sr1 in /dev/ directory. Do you think that it is udev problem?

BTW, what linux do you have?

Yarema
Comment 52 Albert Lee 2007-03-28 02:02:24 UTC
Dear all,

AOpen kindly send me a 952E/AKH for test today. And I can reproduce Yarema's 
problem locally on my box:

ata4.00: qc timeout (cmd 0xa0)
ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata4.00: cmd a0/00:00:00:00:20/00:00:00:00:00/a0 tag 0 cdb 0x46 data 32 in
         res 51/51:03:00:0c:00/00:00:00:00:00/a0 Emask 0x5 (timeout)
ata4: port is slow to respond, please be patient (Status 0xd8)
ata4: port failed to respond (30 secs, Status 0xd8)
ata4: soft resetting port

I will test more with this drive to collect more info about the failing SCSI 
commands, etc.
Comment 53 Yarema 2007-03-28 02:10:41 UTC
Good news!!! Now I calmed down - my cdrom is not broken. Thank you very much,
Albert!

Yarema 
Comment 54 Albert Lee 2007-03-28 05:52:25 UTC
Created attachment 10972 [details]
Proposed patch to simulate REQUEST SENSE 

This patch simulate REQUEST SENSE for the AOpen drive, command 0x12, 0x46, 0x51
and 0x5a.

Hi Yarema,

Could you please try this patch (against clean 2.6.20.3).
Please try k3b, etc. and see if any "qc timeout" in dmesg.
We need to check whether any other commands (besides 0x12, 0x46, 0x51 and 0x5a)
require such simulated request sense. Thanks.
Comment 55 Yarema 2007-03-28 14:21:03 UTC
Created attachment 10976 [details]
dmesg from proposed simulation REQUEST SENSE

It looks like everything is OK... I tried k3d and monitored dmesg and syslogs,
and I haven't seen any "qc timeout". I attached here just in case my dmesg for
more precisely review by specialists. If everything is OK can we expect that it
would be implemented in the next kernel tree?

Alan, thanks. 

Yarema
Comment 56 Yarema 2007-03-28 14:21:29 UTC
Created attachment 10977 [details]
dmesg from proposed simulation REQUEST SENSE

It looks like everything is OK... I tried k3d and monitored dmesg and syslogs,
and I haven't seen any "qc timeout". I attached here just in case my dmesg for
more precisely review by specialists. If everything is OK can we expect that it
would be implemented in the next kernel tree?

Albert, thanks. 

Yarema
Comment 57 Tejun Heo 2007-03-28 19:46:40 UTC
Oh, man.  That's just a weird patch.  Well, I guess we'll need it to get the
AOpen device to work on linux.  Do you happen to know how the ide layer fares
with it?

Thanks.  :-)
Comment 58 Albert Lee 2007-03-28 20:23:44 UTC
> Do you happen to know how the ide layer fares with it?
No idea. Maybe just lucky or something?
Now that I have the drive at hand, I will reboot to IDE and check it.
Comment 59 Albert Lee 2007-03-29 03:48:04 UTC
> I will reboot to IDE and check it.

With the AOpen drive connected to the Promise adapter, IDE got "irq 10: nobody 
cared...".

However, with the AOpen drive connected to the onboard ICH4, IDE works 
flawlessly with the AOpen drive. Even REQUEST SENSE works well on IDE:
   CDB 46 00 00 00 00 00 00 00 20
   CDB 03 00 00 00 12 00 00 00 00

It turns out in libata, the "ap->ops->tf_read(ap, &tf)" in 
atapi_eh_request_sense() leaves some garbage in the taskfile and the AOpen 
drive doesn't like it. Patch to follow.
Comment 60 Albert Lee 2007-03-29 03:53:49 UTC
Created attachment 10989 [details]
Patch to remove tf_read() from atapi_eh_request_sense().

Hi Yarema,

Could you please also try this patch (against clean 2.6.20.3) on both Promise
and onboard ICHx.
Please check if anything like "irq xx: nobody cared" with this patch:
Thanks.

======================
(e.g.)
irq 10: nobody cared (try booting with the "irqpoll" option)
 [<c013d714>] __report_bad_irq+0x24/0x7f
 [<c013d990>] note_interrupt+0x221/0x24c
 [<c027a649>] usb_hcd_irq+0x26/0x4f
 [<c013cb7c>] handle_IRQ_event+0x25/0x4a
 [<c013e2f5>] handle_level_irq+0xe7/0xea
 [<c013e20e>] handle_level_irq+0x0/0xea
 [<c0104675>] do_IRQ+0x87/0xf2
 [<c0103007>] common_interrupt+0x23/0x28
 [<c011d20c>] __do_softirq+0x37/0x96
 [<c010459b>] do_softirq+0x60/0xb3
 [<c013e20e>] handle_level_irq+0x0/0xea
 [<c011d0ee>] irq_exit+0x38/0x3a
 [<c010467c>] do_IRQ+0x8e/0xf2
 [<c031c153>] schedule+0x2dc/0x62e
 [<c0103007>] common_interrupt+0x23/0x28
 [<c01019ef>] default_idle+0x0/0x39
 [<c0101a19>] default_idle+0x2a/0x39
 [<c0101a6c>] cpu_idle+0x44/0x70
 [<c03e76d8>] start_kernel+0x28d/0x315
 [<c03e71e8>] unknown_bootoption+0x0/0x263
 =======================
handlers:
[<e0846b9d>] (ata_interrupt+0x0/0x253 [libata])
[<c027a623>] (usb_hcd_irq+0x0/0x4f)
[<c027a623>] (usb_hcd_irq+0x0/0x4f)
Disabling IRQ #10
Comment 61 Tejun Heo 2007-03-29 03:54:18 UTC
Yeah, reading TF before doing REQUEST_SENSE and writing it back have always been
a bit suspicious.  I never understood why it was done that way but kept the code
there anyway.  So, removing ->tf_read() call fixes the problem?
Comment 62 Albert Lee 2007-03-29 03:57:01 UTC
> removing ->tf_read() call fixes the problem?

Yes. No more qc timeout on REQUEST_SENSE. But we still have the "irq 10: nobody 
cared ..." problem. :(  I guess that is another problem. At least the qc time 
is gone.
Comment 63 Yarema 2007-03-29 14:05:43 UTC
Created attachment 10992 [details]
dmesg 

I attached here my dmesg for analyzing. I didn't find there "nobody cared" for
irq, maybe my drive is specific:-)

Yarema
Comment 64 Yarema 2007-03-29 14:36:33 UTC
Created attachment 10993 [details]
dmesg, cdrom attached to ide port

Here cdrom attached to onboard ide port. Still no "irq nobody cared".
Comment 65 Albert Lee 2007-03-30 10:19:20 UTC
> I didn't find there "nobody cared" for irq, maybe my drive is specific:-)

dmesg looks good. The "irq nobody cared" maybe due to faulty device, etc. on my 
box.
Comment 66 Albert Lee 2007-03-30 10:29:45 UTC
Created attachment 10999 [details]
Proposed patch to clear task file before REQUEST SENSE

It seems the purpose of ->tf_read() is reading the error register to initialize
sense_buf. So, ->tf_read() is actually needed.

Attached the revised patch.

Hi Yarema,
Could you please try this patch (against clean 2.6.20.3) on both Promise
and onboard ICHx. Hopefully this is the last one. :)
Comment 67 Yarema 2007-03-31 00:35:11 UTC
Created attachment 11005 [details]
dmesg: cdrom and dvd are on on cable attached to onboard ide port
Comment 68 Yarema 2007-03-31 00:43:31 UTC
Created attachment 11006 [details]
dmasg - cdrom attached to Promise port, DVD - to onboard ide pord

Looks like everything is all right.


Yarema
Comment 69 Soner Tari 2007-03-31 03:55:06 UTC
Hi, I've already submitted a bug report to Ubuntu, but I'm wondering if the
problem I'm experiencing is related with this thread. Could somebody check the
following link and comment please? Thanks.

https://launchpad.net/ubuntu/+bug/99018
Comment 70 Albert Lee 2007-04-01 20:48:48 UTC
Hi Yarema,

Thanks. The dmesgs look good. I'm closing this bug now.

(BTW, it seems your AOpen drive is configured to MWDMA2:
[   31.867388] ata2.01: ATAPI, max MWDMA2
[   32.031146] ata2.01: configured for MWDMA2

Maybe you could try turn on UDMA/33 for the AOpen drive by
placing the jumper to the left most position which is labeled "Turbo".)
Comment 71 Albert Lee 2007-04-01 22:42:18 UTC
> Hi, I've already submitted a bug report to Ubuntu, but I'm wondering
>  if the problem I'm experiencing is related with this thread.
> Could somebody check the following link and comment please? Thanks.
>
> https://launchpad.net/ubuntu/+bug/99018

Hi Sonar,

Your problem looks different. You have some trailing data with your drive:
[ 4295.256000] ata2.00: 12 bytes trailing data
But maybe it is worth a try.

Could you please try the following patch agaist 2.6.21-rcX
(http://marc.info/?l=linux-ide&m=117548454130046&w=2)
and see if it fixes your problem. If not, please open a new bug and attach the 
full dmesg to that bug, thanks.
Comment 72 Yarema 2007-04-02 10:03:30 UTC
Albert and all,

Thank you very much!

Yarema
Comment 73 Soner Tari 2007-04-02 17:28:27 UTC
Hi Albert, thank you very much, that patch fixed my problem too! I'm not sure if
this fixes the problem completely, but so far it seems perfect.

To be exact, I installed the source package for kernel 2.6.20.3-ubuntu1, and
manually applied the patch.

Hopefully, this patch will be in the new kernels from now on, so I won't have to
build a custom kernel again myself.

Thanks again,
Comment 74 Rohan B. Dhruva 2007-04-08 03:35:45 UTC
Created attachment 11098 [details]
Complete dmesg output

This is my complete dmesg output. The part on the very bottom, after the
"gspca" message shows the dmesg during which the hang occurs.
Comment 75 Rohan B. Dhruva 2007-04-08 03:39:44 UTC
Created attachment 11101 [details]
lspci -vvv output
Comment 76 Rohan B. Dhruva 2007-04-08 03:45:00 UTC
Created attachment 11103 [details]
cdrecord dev=/dev/sr0 -inq -vvVV output
Comment 77 Rohan B. Dhruva 2007-04-08 03:49:59 UTC
I am using Arch Linux, with kernel 2.6.20.6 with the patch - 
http://marc.info/?l=linux-ide&m=117548454130046&w=2 - but it does not solve 
the problem. I have an Acer TravelMate 3260 laptop. 

Symptoms are - periodically the PC hangs for about 30 seconds, and then all is 
back to normal. The complete dmesg output is attached, and the dmesg output 
during the hang are all the messages after the last "gspca" message. I am also 
attaching the lspci -vvv output. The CDROM is a TSSTCorp CD/DVDW TS-L632D. 
Also attached is the "cdrecord dev=/dev/sr0 -inq -vvVV" output.

dmesg output - http://bugzilla.kernel.org/attachment.cgi?id=11098&action=view
lspci output - http://bugzilla.kernel.org/attachment.cgi?id=11101&action=view
cdrecord output - 
http://bugzilla.kernel.org/attachment.cgi?id=11103&action=view

I would have opened a new bug, but the dmesg output is EXACTLY the same as 
output attached in the bug. Please tell me if any more info is needed, or if I 
should open a new bug. 
Comment 78 Albert Lee 2007-04-08 20:49:13 UTC
> I am using Arch Linux, with kernel 2.6.20.6 with the patch - 
> http://marc.info/?l=linux-ide&m=117548454130046&w=2 - but it does not solve 
> the problem.

Hmm, looks like another problem since the patch doesn't fix it.

> I would have opened a new bug, but the dmesg output is EXACTLY the same as 
> output attached in the bug. Please tell me if any more info is needed, or
> if I should open a new bug.

The timeout could be caused by something like device not responding or lost 
irq, etc. Please open a new bug and attach the full dmesg, etc. to that bug, 
thanks.
Comment 79 Rohan B. Dhruva 2007-04-09 12:23:10 UTC
Albert, here is the new bug that I have filed -
http://bugzilla.kernel.org/show_bug.cgi?id=8316

Thanks ! :)