Bug 59581

Summary: Sony VAIO VPCZ23A4R: BluRay writer on pata_marvell works unreliably
Product: IO/Storage Reporter: Alexander E. Patrakov (patrakov)
Component: Serial ATAAssignee: Jeff Garzik (jgarzik)
Status: RESOLVED INSUFFICIENT_DATA    
Severity: normal CC: mcgrof, szg00000, tj
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.10-rc4 Subsystem:
Regression: No Bisected commit-id:
Attachments: assume-atapi.patch
Dmesg with your patch
assume-atapi-2.patch
Dmesg after the second patch
Dump of resource5 in the hotplug case
Dump of resource5 in the coldplug case
init / kthread: add module_long_probe_init() and module_long_probe_exit()
pata_marvell: use module_long_probe_init()

Description Alexander E. Patrakov 2013-06-11 04:54:19 UTC
I have a Sony VAIO VPCZ23A4R laptop. It has a dock station that, among other PCIe devices, contains a Marvell 88SE6121 SATA II Controller [11ab:6121] with a BluRay writer attached.

Note that the laptop is also affected by bug 59501 and bug 56531, they may interfere with some of the testcases.

If I boot the laptop with the dock station connected, it takes a long time because of qc timeout, link being slow to respond, and SRST errors. See messages about ata8.00 in https://bugzilla.kernel.org/attachment.cgi?id=104331 . After waiting through these errors, the BluRay drive works. I want it to work without having to wait through the errors and without the slow boot.

If I connect the dock station to the laptop with kernel 3.10-rc4 patched with https://bugzilla.kernel.org/attachment.cgi?id=104341 , or with unpatched kernel 3.9, the kernel does not find the drive, and instead says:

[   38.065673] pata_marvell 0000:1a:00.0: no available native port
[   38.065769] pata_acpi 0000:1a:00.0: no available native port

See https://bugzilla.kernel.org/attachment.cgi?id=104131 or https://bugzilla.kernel.org/attachment.cgi?id=104321 for the examples. I want it to find the drive.

An interesting point: if I connect the dock station with the unpatched kernel 3.10-rc4, that doesn't rescan the PCI bus automatically on docking due to bug 59501, and then rescan the PCI bus manually (i.e. introduce some manual delay between the appearance of the device and the kernel noticing it, or maybe skip some black magic done by the firmware), the drive is detected without any delays and just works. See https://bugzilla.kernel.org/attachment.cgi?id=104321 for an example dmesg.

So I guess that the above means that the kernel must insert some delay or somehow undo what the firmware does with the controller in order for the BluRay drive to work as it should on my laptop. Could you please help me debug this further?
Comment 1 Tejun Heo 2013-06-11 08:16:17 UTC
Created attachment 104351 [details]
assume-atapi.patch

Hmmm.... as for the hotplug problem, it looks like the controller is being handed for probing before its PCI resources are properly set up. pata_marvell tries to probe it but its PCI regions are not mapped (yet, obviously) so it can't do anything about it and gives up. I don't know why that is but it probably should be fixed from PCI side. pata_marvell is just a consumer on that front.

As for SRST failure during boot, can you please try the attached patch?

Thanks.
Comment 2 Alexander E. Patrakov 2013-06-11 08:32:54 UTC
Created attachment 104361 [details]
Dmesg with your patch

The patch does not help.

I assume this is because it says XXX about ata7, while the problematic drive is on ata8.
Comment 3 Alexander E. Patrakov 2013-06-11 08:36:07 UTC
BTW, do you have any explanation why there are no SRST errors in the hotplug-then-rescan case?
Comment 4 Tejun Heo 2013-06-12 06:42:57 UTC
Created attachment 104481 [details]
assume-atapi-2.patch

Oops, can you please try this one instead? As for why SRST behaves differently, I have no idea. The controller sure is in a different state. The BIOS hasn't had the chance to init / probe the device when hotplugged. As for why that leads to the behavior difference, I don't know.
Comment 5 Alexander E. Patrakov 2013-06-12 09:59:09 UTC
Created attachment 104501 [details]
Dmesg after the second patch

Still no luck. Note that I have applied not only your patch, but also all patches from this thread: https://lkml.org/lkml/2013/6/11/231 , sorry if they interfere.
Comment 6 Tejun Heo 2013-06-12 16:10:39 UTC
I don't have any idea left. pata_marvell has always been somewhat odd and I personally don't have a lot of experience with it. I'll try to ping someone who has more experience with it.

Thanks.
Comment 7 Tejun Heo 2013-06-12 16:13:03 UTC
Alan was the only one familiar with the controller and he's gone now. :(
Comment 8 Alexander E. Patrakov 2013-06-12 17:48:53 UTC
The hotplug case got fixed by the patch from bug 56531 comment 7.
Comment 9 Alexander E. Patrakov 2013-06-13 08:28:13 UTC
Tejun, I also forgot to say that not only your second patch does not help in the non-hotplug case, it actually makes the kernel never detect the BluRay drive when booting with the dock station. Originally, it was detected with a large timeout.

I think that we are lucky that we can look at the controller both in the "correct" and in the "wedged" states. Could you please make a debugging patch that at least dumps all marvell chip registers, so that we can compare them?
Comment 10 Alexander E. Patrakov 2013-06-14 04:10:34 UTC
I can confirm that the hotplug part of this bug was a duplicate of bug 56531. So, now we only have to deal with the SRST issue in the non-hotplug case.
Comment 11 Alexander E. Patrakov 2013-06-16 15:33:59 UTC
I have some new information about this bug.

If I boot with modprobe.blacklist=pata_marvell ahci.marvell_enable=1, then the controller does not work even when hotplugged.

[   33.170677] ata7.00: qc timeout (cmd 0xa1)
[   33.170712] ata7.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[   33.475948] ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[   33.475972] ata7.00: link online but device misclassified
[   33.475975] ata7: link online but 1 devices misclassified, retrying
[   33.475979] ata7: reset failed (errno=-11), retrying in 10 secs
[   43.488473] ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[   43.488493] ata7.00: link online but device misclassified
[   43.488497] ata7: link online but 1 devices misclassified, retrying
[   43.488501] ata7: reset failed (errno=-11), retrying in 10 secs
[   53.500823] ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[   53.500841] ata7.00: link online but device misclassified
[   53.500844] ata7: link online but 1 devices misclassified, retrying
[   53.500848] ata7: reset failed (errno=-11), retrying in 35 secs
[   88.238841] ata7: limiting SATA link speed to 1.5 Gbps
[   88.544303] ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[   88.544322] ata7.00: link online but device misclassified
[   88.544326] ata7: link online but 1 devices misclassified, device detection might fail
[   88.849607] ata8: SATA link down (SStatus 0 SControl 300)

Note that both in dmesg with pata_marvell in non-hotplug case and with ahci in both cases, the first failure is "qc timeout" / "failed to IDENTIFY".

So, my completely uneducated guess is:

1) There is some b0rked AHCI mode which the card may be put into either by the kernel or by the firmware

2) The firmware puts the controller AHCI mode on boot, but not on hotplug, and a reset is needed to get it out of that mode, so that it starts accepting legacy commands as sent by pata_marvell.

Does this make any sense? Is there any way to verify this guess?


Unrelated note: ahci says that the link is ata7, while pata_marvell says ata8.
Comment 12 Alexander E. Patrakov 2013-06-16 16:14:54 UTC
Well, that thought appears to be false. I have used pcimem from git://github.com/billfarrow/pcimem.git to read the word at offset 4 in resource 5 of the card. It is equal to 0, even before loading pata_marvell, so this is not an AHCI-related thing.

Still, I would like to see a debugging patch or a list of safe-to-access registers that would allow me to find out the differences in the card state in the hot-plugged and cold-plugged cases.
Comment 13 Tejun Heo 2013-06-17 18:52:52 UTC
I don't have any more intimate knowledge of the controller or access to the document. It's likely that the controllers has some global reset control somewhere and performing global reset on probing will get things working. Hmmm...
Comment 14 Alexander E. Patrakov 2013-06-18 16:29:44 UTC
Created attachment 105241 [details]
Dump of resource5 in the hotplug case
Comment 15 Alexander E. Patrakov 2013-06-18 16:32:47 UTC
Created attachment 105251 [details]
Dump of resource5 in the coldplug case

I have attached two dumps of resource5, one in the coldplug case and one in the hotplug case. Before I start poking different values into the registers, I want you to look at the difference. Reason: at least some of the registers are the standard ones. Unlike me, you understand generic IDE/SATA controllers, and thus can recognize the fact that some standard, understandable-to-you register differs if it is the case.
Comment 16 Alexander E. Patrakov 2013-06-18 16:35:18 UTC
Note: both dumps are with pata_marvell blacklisted and thus with no driver attached to the controller.
Comment 17 Alexander E. Patrakov 2014-08-10 06:18:58 UTC
As of now, booting with the docking station connected still results in a long wait (which is what this bug is about) after which the BluRay drive works. Hot-plugging the docking station results in a functional BluRay drive immediately.
Comment 18 Luis Chamberlain 2014-08-27 22:15:58 UTC
Created attachment 148541 [details]
init / kthread: add module_long_probe_init() and module_long_probe_exit()

Based on discussion we will split init / probe and allow probe to be run asynchronously. Those changes will happen eventually but in the meantime you can use the two patches I will supply. Attached is the first one which is generic.
Comment 19 Luis Chamberlain 2014-08-27 22:16:55 UTC
Created attachment 148551 [details]
pata_marvell: use module_long_probe_init()

This is the second driver specific patch. Once Greg adds asynch probe support we'll just go with that solution instead upstream.
Comment 20 Tejun Heo 2014-08-28 12:27:10 UTC
Luis, please don't apply the patch. This isn't anything specific to pata_marvell. Error conditions during probing can stretch out the time taken over 30s. It's a rare but still expected behavior. Adding this patch to pata_marvell is unlikely to have any meaningful gain. Let's please fix it from driver core.

Thanks.
Comment 21 Luis Chamberlain 2014-08-28 17:35:32 UTC
Tejun agreed -- its what I noted on comment #19 -- the right solution is for us to get async probe on the driver core and let drivers enable that. I supplied the patches though to enable users to use the driver at least for now, its not my goal to upstream these.
Comment 22 Alexander E. Patrakov 2018-03-25 13:44:35 UTC
The laptop is dead, closing any bugs reproducible only on it.