Bug 201151 - illegal qc_active transition prevents optical drive detection
Summary: illegal qc_active transition prevents optical drive detection
Status: RESOLVED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: Tejun Heo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-09-16 14:08 UTC by tones111
Modified: 2019-08-25 20:15 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.18.*
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Ignore ATA_TAG_INTERNAL and 0 both being set (565 bytes, patch)
2018-09-18 16:16 UTC, Jens Axboe
Details | Diff
Use hw tag for ap->qc_active mask (419 bytes, patch)
2018-09-18 16:35 UTC, Jens Axboe
Details | Diff
Ignore 0/32 both being set (565 bytes, patch)
2018-09-18 16:52 UTC, Jens Axboe
Details | Diff
Mask swap (828 bytes, patch)
2018-09-18 18:55 UTC, Jens Axboe
Details | Diff
dmesg output after applying patch "Ignore 0/32 both being set" (67.70 KB, text/plain)
2018-09-19 02:16 UTC, tones111
Details
lspci output (27.46 KB, text/plain)
2018-09-19 02:17 UTC, tones111
Details
Mask swap (899 bytes, patch)
2018-09-19 14:47 UTC, Jens Axboe
Details | Diff
Mask swap v2 (916 bytes, patch)
2018-09-19 17:40 UTC, Jens Axboe
Details | Diff

Description tones111 2018-09-16 14:08:49 UTC
Running any 4.18.* kernel I see the following messages early in the boot sequence and my system is unable to detect (no /dev/sr0 created) my optical drive.

journalctl -b excerpt:
Sep 15 19:57:40 mobile kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Sep 15 19:57:40 mobile kernel: ata2: illegal qc_active transition (100000000->00000001)
Sep 15 19:57:40 mobile kernel: ata2.00: ATAPI: MATSHITADVD-RAM UJ8C2 S, 1.00, max UDMA/133
Sep 15 19:57:40 mobile kernel: ata2.00: failed to set xfermode (err_mask=0x40)
Sep 15 19:57:40 mobile kernel: usb 4-2: New USB device found, idVendor=04f2, idProduct=b40a, bcdDevice=69.64
Sep 15 19:57:40 mobile kernel: usb 4-2: New USB device strings: Mfr=3, Product=1, SerialNumber=2
Sep 15 19:57:40 mobile kernel: usb 4-2: Product: USB2.0 HD UVC WebCam
Sep 15 19:57:40 mobile kernel: usb 4-2: Manufacturer: Chicony Electronics Co.,Ltd.
Sep 15 19:57:40 mobile kernel: usb 4-2: SerialNumber: 0x0001
Sep 15 19:57:40 mobile kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Sep 15 19:57:40 mobile kernel: ata2: illegal qc_active transition (100000000->00000001)
Sep 15 19:57:40 mobile kernel: ata2.00: failed to set xfermode (err_mask=0x40)
Sep 15 19:57:40 mobile kernel: ata2.00: limiting speed to UDMA/133:PIO3
Sep 15 19:57:40 mobile kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Sep 15 19:57:40 mobile kernel: ata2: illegal qc_active transition (100000000->00000001)
Sep 15 19:57:40 mobile kernel: ata2.00: configured for UDMA/133
Sep 15 19:57:40 mobile kernel: ata2.00: TEST_UNIT_READY failed (err_mask=0x40)
Sep 15 19:57:40 mobile kernel: ata2.00: disabled
Sep 15 19:57:40 mobile kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)


Booting a 4.17.* kernel does not exhibit this behavior and my optical drive is detected and works fine.

journalctl -b excerpt from a working 4.17.13-arch1-1-ARCH kernel:
Sep 16 08:40:28 mobile kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Sep 16 08:40:28 mobile kernel: ata1.00: ATA-8: HGST HTS545050A7E680, GG2OAF10, max UDMA/133
Sep 16 08:40:28 mobile kernel: ata1.00: 976773168 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
Sep 16 08:40:28 mobile kernel: ata1.00: configured for UDMA/133
Sep 16 08:40:28 mobile kernel: scsi 0:0:0:0: Direct-Access     ATA      HGST HTS545050A7 AF10 PQ: 0 ANSI: 5
Sep 16 08:40:28 mobile kernel: ata2.00: ATAPI: MATSHITADVD-RAM UJ8C2 S, 1.00, max UDMA/133
Sep 16 08:40:28 mobile kernel: ata2.00: configured for UDMA/133
Sep 16 08:40:28 mobile kernel: scsi 1:0:0:0: CD-ROM            MATSHITA DVD-RAM UJ8C2 S  1.00 PQ: 0 ANSI: 5
Sep 16 08:40:28 mobile kernel: sd 0:0:0:0: [sda] 976773168 512-byte logical blocks: (500 GB/466 GiB)
Sep 16 08:40:28 mobile kernel: sd 0:0:0:0: [sda] 4096-byte physical blocks
Sep 16 08:40:28 mobile kernel: sd 0:0:0:0: [sda] Write Protect is off
Sep 16 08:40:28 mobile kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Sep 16 08:40:28 mobile kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Sep 16 08:40:28 mobile kernel: tsc: Refined TSC clocksource calibration: 1497.192 MHz
Sep 16 08:40:28 mobile kernel: clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1594c60bad9, max_idle_ns: 440795256279 ns
Sep 16 08:40:28 mobile kernel: clocksource: Switched to clocksource tsc
Sep 16 08:40:28 mobile kernel: sr 1:0:0:0: [sr0] scsi3-mmc drive: 24x/24x writer dvd-ram cd/rw xa/form2 cdda tray
Sep 16 08:40:28 mobile kernel: cdrom: Uniform CD-ROM driver Revision: 3.20
Sep 16 08:40:28 mobile kernel: sr 1:0:0:0: Attached scsi CD-ROM sr0


A possibly interesting wrinkle is that I can sometimes boot a 4.18 kernel and early-exit the bootloader countdown, see fewer illegal transition messages, and have a /dev/sr0 once booted.  However, after the system wakes from sleep the optical drive is no longer functional.
Comment 1 tones111 2018-09-18 01:26:45 UTC
The regression exists in v4.19-rc3.

I've managed to bisect the problem to...

commit 28361c403683c2b00d4f5e76045f3ccd299bf99d
Author: Jens Axboe <axboe@kernel.dk>
Date:   Fri May 11 12:51:09 2018 -0600

    libata: add extra internal command
Comment 2 Jens Axboe 2018-09-18 16:16:11 UTC
Hmm, that's a little odd. The port active mask shows tag 32 active, which is hw tag 0. For some reason, tag 0 is set on the ahci side. Can you try the patch I'm about to attach?
Comment 3 Jens Axboe 2018-09-18 16:16:41 UTC
Created attachment 278633 [details]
Ignore ATA_TAG_INTERNAL and 0 both being set
Comment 4 Jens Axboe 2018-09-18 16:29:49 UTC
Can you also provide a full dmesg?
Comment 5 Jens Axboe 2018-09-18 16:35:04 UTC
Actually, I think we should just mark the active mask with the real hw tag instead. This should work for both the normal case, as well as yours where we end up seeing the internally mapped tag show up in the done mask.
Comment 6 Jens Axboe 2018-09-18 16:35:30 UTC
Created attachment 278635 [details]
Use hw tag for ap->qc_active mask
Comment 7 Jens Axboe 2018-09-18 16:52:03 UTC
Back to square one, please just try the original patch, patch #2 can't possibly work... I'll add it again, looks like there's no way to mark it not-obsolete.
Comment 8 Jens Axboe 2018-09-18 16:52:53 UTC
Created attachment 278637 [details]
Ignore 0/32 both being set
Comment 9 Jens Axboe 2018-09-18 18:55:11 UTC
Looking at the code, from the various callers of ata_qc_complete_multiple(), only sil24 can actually trigger this. Since you're going to be testing various patches anyway, I'm going to toss another one into the mix for you to test...
Comment 10 Jens Axboe 2018-09-18 18:55:37 UTC
Created attachment 278641 [details]
Mask swap
Comment 11 tones111 2018-09-19 02:16:14 UTC
Created attachment 278649 [details]
dmesg output after applying patch "Ignore 0/32 both being set"
Comment 12 tones111 2018-09-19 02:17:59 UTC
Created attachment 278651 [details]
lspci output
Comment 13 tones111 2018-09-19 02:22:23 UTC
I tried both patches, in isolation, against 4.19-rc3.

The "Ignore 0/32 both being set" patch resolves the issue for me.

The "Mask swap" patch did not help.

I've attached the dmesg output from the boot with the good patch as well as some lspci output in case that helps.

thanks for the fast response!
Comment 14 tones111 2018-09-19 03:42:31 UTC
After adding some debug printouts, it appears tag 0 is getting set and passed via qc_active during ata_cq_complete_multiple in libahci.c

static void ahci_handle_port_interrupt(...)
{
...
	if (pp->fbs_enabled) {
		if (ap->qc_active) {
			qc_active = readl(port_mmio + PORT_SCR_ACT);
			qc_active |= readl(port_mmio + PORT_CMD_ISSUE);
		}
	} else {
		/* pp->active_link is valid iff any command is in flight */
		if (ap->qc_active && pp->active_link->sactive)
			qc_active = readl(port_mmio + PORT_SCR_ACT);
		else
			qc_active = readl(port_mmio + PORT_CMD_ISSUE); // <<<---- HERE
	}


	rc = ata_qc_complete_multiple(ap, qc_active);
...

I'm not familiar with this code, so I'm not sure if that would be expected or not.
Comment 15 Jens Axboe 2018-09-19 14:47:20 UTC
I think the best/safest is just to mask the hw invalid tag32 and set tag0, if tag32 is set. Can you confirm this one works too?
Comment 16 Jens Axboe 2018-09-19 14:47:43 UTC
Created attachment 278671 [details]
Mask swap
Comment 17 Jens Axboe 2018-09-19 17:40:45 UTC
Created attachment 278673 [details]
Mask swap v2
Comment 18 Jens Axboe 2018-09-19 17:42:04 UTC
Added a newer version, sorry for all the various versions... But I believe this one should work, as it just mirrors the right tag between the internal and hardware mask. I'm curious if it works for your case though, as we don't expect bit0 to be set in the hardware mask. It should be cleared, which means that command is done. It'll work for you if it is indeed just a spurious interrupt. If it doesn't work, then we might need to add some more debugging to figure out wtf is going wrong here.
Comment 19 tones111 2018-09-20 01:59:08 UTC
The first "mask swap" patch breaks communication with my hard drive, so it drops out to a prompt.

The second patch ("mask swap v2") is working for me.

Thanks again
Comment 20 Jens Axboe 2018-09-20 02:41:34 UTC
Ok great, that makes me feel better. Can I add your tested-by to the patch?


> On Sep 19, 2018, at 7:59 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=201151
> 
> --- Comment #19 from tones111@hotmail.com ---
> The first "mask swap" patch breaks communication with my hard drive, so it
> drops out to a prompt.
> 
> The second patch ("mask swap v2") is working for me.
> 
> Thanks again
> 
> -- 
> You are receiving this mail because:
> You are on the CC list for the bug.

Note You need to log in before you can comment on or make changes to this bug.