Bug 15771

Summary: Marvell 6145 on Jetway Daughterboard fails to detect any disks
Product: IO/Storage Reporter: Dan Alderman (dan)
Component: Serial ATAAssignee: Jeff Garzik (jgarzik)
Status: RESOLVED OBSOLETE    
Severity: normal CC: alan, ben, cebbert, daniel.hornung, jrickman, mlord, tj
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32.10-90.fc12.x86_64 Subsystem:
Regression: No Bisected commit-id:

Description Dan Alderman 2010-04-12 18:07:51 UTC
Hi.

I started investigating this problem with help from Ben Hutchings, in his capacity as a member of the Debian project.  I added information to the Debian bug tracker, bug ID 509923 (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=509923).

I have tried Fedora 12 with the same results so I think it's a Mainline issue, but I have filed this report against the kernel version for the errors I have to hand.

The Atom board I'm using is this (Atom 330):

http://www.jetwaycomputer.com/NC92.html

and the Marvell Daughter Board is the ADPE4S on this page (scroll down a little for the PCIe 4*SATA board).

http://www.jetwaycomputer.com/Daughter_Board.html

The board uses the Marvell 88SE6145 chipset with only the SATA ports connected.

The issue is (as far as I can tell), pata_marvell fails to detect the disks attached to the SATA ports and forcing ahci with marvel_enable=1 also seems to fail.

pata_marvell gives me lots of:

pata_marvell 0000:03:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
pata_marvell 0000:03:00.0: setting latency timer to 64
scsi2 : pata_marvell
scsi3 : pata_marvell
ata3: PATA max UDMA/100 cmd 0xef00 ctl 0xee00 bmdma 0xeb00 irq 16
ata4: PATA max UDMA/133 cmd 0xed00 ctl 0xec00 bmdma 0xeb08 irq 16
BAR5:00:04 01:7F 02:22 03:C8 04:02 05:00 06:00 07:80 08:00 09:00 0A:00
0B:00 0C:1F 0D:00 0E:00 0F:00
ata4.01: qc timeout (cmd 0xec)
ata4.01: failed to IDENTIFY (I/O error, err_mask=0x4)
ata4: link is slow to respond, please be patient (ready=0)
ata4: device not ready (errno=-16), forcing hardreset

With the ahci driver I get lots of errors like this:

ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata4.00: failed command: IDENTIFY DEVICE
ata3.00: failed command: IDENTIFY DEVICE
ata3.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
         res 40/00:00:1f:00:00/00:00:00:00:00/e0 Emask 0x4 (timeout)
ata3.00: status: { DRDY }
ata4.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata3: hard resetting link

All the disks are detected fine from within the mini raid BIOS on the daughterboard and the motherboard bios sees the disks, showing them as SCSI.  The disk set I have already contains an mdraid5 with all my data, so I don't want to overwrite it, I just want to see each disk individually.

I'm very happy to provide any more information and do more testing to try and get this fixes.

Thank you.

D.
Comment 1 Ben Hutchings 2010-04-12 18:41:07 UTC
I've split the Debian bug report into:

http://bugs.debian.org/509923 - pata_marvell fails to handle some 88SE6145 boards
http://bugs.debian.org/577563 - ahci fails to handle some 88SE6145 boards

since ahci does seem to work for most people.
Comment 2 Tejun Heo 2010-04-13 21:48:53 UTC
Does irqpoll help?  Mark, do you have experience with these controllers?
Comment 3 Mark Lord 2010-04-13 22:14:49 UTC
These are totally different controllers from the sata_mv ones.

But.. I do have a box here that has a 6145 chip in it.

So if we could find out what the end-user's drive setup is, then perhaps I could reproduce it here and investigate some.  So, I need to know:  what type (SATA/PATA) drives are connected to which ports of the controller, and which kernel module is being used to manage them (ahci, or pata_marvell?).

Thx.
Comment 4 Dan Alderman 2010-04-13 22:37:18 UTC
Hi,

To the JCN92 motherboard I have a seagate 250GB 2.5" 5400rpm drive attached to SATA1 and a SATA DVD Drive attached to SATA2.

On the Marvell Daughter Board (ADPE4S http://www.jetwaycomputer.com/Daughter_Board.html) I have 4 WD Green 1TB drives attached.

I think I initially had some issues with the SATA cables in the Chenbro ES34069 chassis I am using as I was getting detection failures in the Marvell bios.  Having replaced them they now detect reliably.

I installed CentOS 5.4 x64 and compiled up the vendor driver from here:

http://jetwaycomputer.com/download/Drivers/ADPE4S/Marvell_M88SE6145_Linux.zip

and I was able to detect and see the existing mdraid5 set on my WD disks, mounted and used without issue.

Originally I had only tried using modern 64bit distros (Fedora 12, Debian 5.04) and had no luck.  I have just tried Fedora 12 i386 cd installer and seemed to get better results, but drive 4 on the controller timed out.

I am now trying the latest kernel (2.6.32.11-99.fc12.i686.PAE) but appending ahci.enable_marvell=1 seems to result in no detection of the disks attached to the controller at all and lots of the same errors like I was seeing before.
Comment 5 Dan Alderman 2010-04-13 23:43:17 UTC
OK, I had made a mistake and allowed anaconda to use ata_generic so ahci wasn't being used.  Remaking the initrd via the rescue cd having forced marvell_enable=1 worked.

So, now I believe I am in the position that the 32 bit kernel works for this board, after some user intervention, but the 64 bit one doesn't appear to work at all.  At least, I was unsuccessful with the distributions I tried.
Comment 6 Dan Alderman 2010-04-14 13:25:47 UTC
After a little more investigation I have some more info.

It would appear that the detect errors/timeouts are being caused by the PAE enabled kernel in Fedora and all the 64bit kernels I have tried (>2.6.26).

The board I have is Atom 330 based with 2GB RAM.
Comment 7 Dan Alderman 2010-04-14 14:37:44 UTC
I have just tried using the PAE kernel in F12 (2.6.32.11-99.fc12.i686.PAE) with ahci.marvell_enable=1 and mem=nopentium and the controller detects and works and the system boots fine.

grub entry:

title Fedora (2.6.32.11-99.fc12.i686.PAE)
	root (hd0,0)
	kernel /vmlinuz-2.6.32.11-99.fc12.i686.PAE ro root=/dev/mapper/vg_chenbro-lv_root  LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=uk ahci.marvell_enable=1 mem=nopentium
	initrd /initramfs-2.6.32.11-99.fc12.i686.PAE.img
Comment 8 Chuck Ebbert 2010-04-14 22:25:53 UTC
This is just a wild guess, but can you set up your initrd so it updates the CPU microcode before loading the ahci driver?
Comment 9 Dan Alderman 2010-04-15 00:18:41 UTC
I currently have no idea how to do that.  Is there a guide somewhere that can tell me how?
Comment 10 Dan Alderman 2010-04-16 00:18:26 UTC
I've been playing around a little more with the setup.  I'm not sure if this information helps with this particular bug, but it may prove useful to someone, perhaps, maybe...

I tried smartctl on one of the WD Green drives attached to the 6145 controller.  When the drive is not in an active mdraid array and smartctl is left to probe for the drives it causes a reset of the 6145 controller and the re-detect fails in the same way the pata_marvell does if the ahci driver isn't forced to take over at boot time.

After a long timeout, 2 of the 4 drives showed as detected in the kernel log. I soft rebooted but quickly discovered that I had to completely remove the power from the system for at least 10 seconds before the kernel would re-detect the controller correctly at boot time.  I am currently running 2.6.32.11-99.fc12.i686.PAE with ahci.marvell_enable=1 mem=nopentium on the boot command line.

If smartctl is run with -i -d sat to force SATA mode on a drive which is not part of an active array, it is detected and smartctl produces a report which looks similar to what one would expect.  I don't have it to hand and I want to let my array finish rebuilding before I try anything else of this nature, but it looked similar to the report produced by the 2.5" SATA drive attached to the ICH7 on the same motherboard, which was definitely correct.

=== START OF INFORMATION SECTION ===
Device Model:     ST9250315AS
Serial Number:    6VC2HFVF
Firmware Version: 0001SDM1
User Capacity:    250,059,350,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Apr 16 01:08:47 2010 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

If I run smarctl -d sat on the same drive when it's part of an active mdraid5 array (which was syncing the parity disk at the time) the controller it works fine once, but on the second run the controller freezes and causes a re detect.

Apr 15 12:58:07 chenbro kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Apr 15 12:58:07 chenbro kernel: ata1.00: failed command: IDENTIFY DEVICE
Apr 15 12:58:07 chenbro kernel: ata1.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
Apr 15 12:58:07 chenbro kernel:         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Apr 15 12:58:07 chenbro kernel: ata1.00: status: { DRDY }
Apr 15 12:58:07 chenbro kernel: ata1: hard resetting link
Apr 15 12:58:07 chenbro kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 15 12:58:07 chenbro kernel: ata1.00: configured for UDMA/133
Apr 15 12:58:07 chenbro kernel: ata1: EH complete

Smartctl then reports back that there is SMART information available, but also shows errors.  The machine then had to be powered off for 10s and then rebooted to get access back to the 2 missing drives.

The controller appeared to recover OK so I re-assembled the md0 and let it start the resync again.  731 minutes to go...
Comment 11 jrickman 2011-03-12 11:53:39 UTC
I have this same motherboard and daughterboard. I have tested this problem with Fedora Core 13 (all released i386 PAE kernel versions) and Fedora Core 14 (all released i386 PAE kernel versions). I am using an IDE drive, attached to the motherboard port, for the "system".

I have added the "ahci.marvell_enable=1" and "mem=nopentium" values to my "grub.conf" file, but it did not help.

I cannot get this controller to recognize all 4 attached drives via Fedora Core Linux. Sometimes 1 attached drive is seen in "blkid". Sometimes 2 attached drives are seen in "blkid". I can't ever remember seeing 3 or 4 of the attached drives.

I can see all drives in the 88SE6145 BIOS during system boot. I can create a RAID array using that BIOS. Once the system is booted, the above issues occur: not all attached drives are seen. That makes building a RAID array useless for testing so I just try for a "bunch of disks".

For the record, I removed the Linux IDE hard disk and installed a different IDE drive so I could test this system under Windows XP with the vendor drivers. There were no issues at all. All ports on the 88SE6145 controller saw attached hard disks under Windows.

Over in the FreeBSD forums there has been discussion of this controller. Have a look at this thread: http://forums.freebsd.org/showthread.php?t=20412
Comment 12 Mark Lord 2011-03-12 16:22:32 UTC
From that FreeBSD thread, here is what they did to "fix" it:

> Disable NCQ for multiport Marvell 88SX61XX SATA controllers. Simultaneous
> active I/O to several disks (copying large file on ZFS) causes timeout after
> just a few seconds of run. Single port 88SX6111 seems like not affected.
>
> Skip reading transferred bytes count for these controllers. It works for
> 88SX6111, but 88SX6145 always returns zero there. Haven't tested others,
> but better to be safe.
Comment 13 Ben Hutchings 2011-03-12 20:16:52 UTC
(In reply to comment #12)
> From that FreeBSD thread, here is what they did to "fix" it:
> 
> > Disable NCQ for multiport Marvell 88SX61XX SATA controllers. Simultaneous
> > active I/O to several disks (copying large file on ZFS) causes timeout
> after
> > just a few seconds of run. Single port 88SX6111 seems like not affected.
> >
> > Skip reading transferred bytes count for these controllers. It works for
> > 88SX6111, but 88SX6145 always returns zero there. Haven't tested others,
> > but better to be safe.

I think that's fixing a different problem. We already disable NCQ for these controllers, and we never read the byte count from the command structure (ahci_cmd_list::bytecount on FreeBSD, ahci_cmd_hdr::status on Linux).
Comment 14 jrickman 2011-03-13 06:01:35 UTC
My point in comparing Fedora Core 14 to FreeBSD is/was this: I get the impression from the FreeBSD forums that developers for FreeBSD may have this controller/chip family working and there could be something useful in their work.

For the record, I have tested Fedora Core 12, 13, and 14 extensively on this system; Fedora Core is my "primary OS of choice" at home for servers while I use RHEL at work.

I have no background with FreeBSD and Debian, but I tried FreeBSD about 6 months ago just to see what would happen. I got farther with it than Fedora. In FreeBSD 8.1 "public release" (no patches) I could see all attached drives on the 88SE6145 controller in "cfdisk" most of the time, but I had problems with read/write of files to the disks; my Fedora Core experience with this controller cannot even match that.

This is a "home" system and it is recorded in "smolts":
http://www.smolts.org/client/show/pub_2a4d8e91-4787-4ae5-8d9f-ea04b2146633

According to Comment #4 in this bug, this controller works under CentOS 5.4 with the vendor drivers. Based on "G*" searches for this issue, some change(s) happened in the kernel for the ATA drivers (sometime after 2.6.18...I think...would have to check). There were several interesting archived email exchanges on the matter.

It sounds like CentOS 5.x (RHEL5...2.6.18) did not adopt those ATA changes, which is why this controller works in that OS version with 3rd party drivers. Distributions like Fedora Core did adopt those ATA changes and have issues with this controller. I wonder if CentOS 6.x (RHEL6...2.6.32) will adopt those ATA changes? [nudge -> RH]

QED...or am I misinterpreting/misunderstanding something here?
Comment 15 jrickman 2011-03-16 03:38:57 UTC
Further test results...

Compiled Marvell's Linux drivers for the 88SE6145 controller on CentOS 5.5 latest published kernel based on 2.6.18. No compile issues. Blacklisted "pata_marvell" driver. Recompiled "initrd" to ensure "mv61xx" driver was included. Reboot. All drives attached to Marvell controller are visible. Build a RAID-5 array on 4 drives on this controller. No issues. Read and write files to new RAID-5 array. No issues at all.

Swapped in a new blank IDE drive. Loaded RHEL6 (Scientific Linux 6.0). Updated system to latest published RPMs. Kernel is based on 2.6.32. Loaded RPMs needed to compile kernel modules. Attempted to compile Marvell's Linux driver. Compile fails. Attempted to used published workarounds to "activate" AHCI portion of "pata_marvell" code ("ahci.marvell_enable=1"). Did not work. Got all sorts of timeout issues when 88SE6145 controller attempts to talk to drives attached to it. When I run "blkid", errors scroll across Console at very fast rate.

Swapped in an IDE drive previously used in this machine. It has Fedora Core 14 with all current updates. Loaded RPMs needed to compile kernel modules. Attempted to compile Marvell's Linux driver. Compile fails.

[XYZ@fatman source]# make
make ARCH=i386  CC=cc LD=ld CROSS_COMPILE= V= -C /lib/modules/2.6.35.11-83.fc14.i686.PAE/build M=`pwd` modules
make[1]: Entering directory `/usr/src/kernels/2.6.35.11-83.fc14.i686.PAE'
  CC [M]  /XYZ/Marvell/1.0.0.9/source/linux/linux_sense.o
/XYZ/Marvell/1.0.0.9/source/linux/linux_sense.c: In function ‘HBA_Translate_Req_Status_To_OS_Status’:
/XYZ/Marvell/1.0.0.9/source/linux/linux_sense.c:23:10: error: ‘struct scsi_cmnd’ has no member named ‘use_sg’
/XYZ/Marvell/1.0.0.9/source/linux/linux_sense.c:34:40: error: ‘struct scatterlist’ has no member named ‘page’
/XYZ/Marvell/1.0.0.9/source/linux/linux_sense.c:50:11: error: ‘struct scsi_cmnd’ has no member named ‘use_sg’
/XYZ/Marvell/1.0.0.9/source/linux/linux_sense.c:53:36: error: ‘struct scsi_cmnd’ has no member named ‘request_buffer’
/XYZ/Marvell/1.0.0.9/source/linux/linux_sense.c:54:14: error: ‘struct scsi_cmnd’ has no member named ‘use_sg’
/XYZ/Marvell/1.0.0.9/source/linux/linux_sense.c:60:11: error: ‘struct scsi_cmnd’ has no member named ‘request_bufflen’
/XYZ/Marvell/1.0.0.9/source/linux/linux_sense.c:96:34: error: ‘SUGGEST_ABORT’ undeclared (first use in this function)
/XYZ/Marvell/1.0.0.9/source/linux/linux_sense.c:96:34: note: each undeclared identifier is reported only once for each function it appears in
make[2]: *** [/XYZ/Marvell/1.0.0.9/source/linux/linux_sense.o] Error 1
make[1]: *** [_module_/XYZ/Marvell/1.0.0.9/source] Error 2
make[1]: Leaving directory `/usr/src/kernels/2.6.35.11-83.fc14.i686.PAE'
make: *** [all] Error 2

"mv61xx" kernel module compile on RHEL6 exhibits the same error message.

Research shows the last precompiled module for Fedora and this controller, as provided by ASUS, is FC10.

So, what is "broken" here?

The controller? No. It is proven here to work under XP and RHEL5.x

The vendor? Possible. They have not updated their driver code to compile on kernel versions more recent than 2.6.18.

The appropriate Linux kernel modules? Yes. Some change(s) after 2.6.18 introduced issues that break this driver and also appear to break support for this controller. The efficacy of the provided workarounds seems "hit and miss". This appears to be documented in archived email threads and forums across multiple distributions dating back to 2008 or even late 2007. 

"G*" searches show this "broken pata_marvell" kernel module impacts post-2.6.18 releases of Ubuntu, Debian, Mandriva, and Fedora. I suspect Suse variants may also impacted. I have proven RHEL6 is also be impacted.

It would be nice if the party, or parties, responsible for this mess would step forward and fix it. I've chased this problem for almost 2 years now via various bug reporting mechanisms and gotten no relief whatsoever.
Comment 16 Ben Hutchings 2011-03-16 04:12:06 UTC
(In reply to comment #15)
> The vendor? Possible. They have not updated their driver code to compile on
> kernel versions more recent than 2.6.18.

Yes, that's totally pathetic.

> It would be nice if the party, or parties, responsible for this mess would
> step
> forward and fix it. I've chased this problem for almost 2 years now via
> various
> bug reporting mechanisms and gotten no relief whatsoever.

Who's responsible? Seems like Marvell is most responsible - for shipping non-standard hardware without working drivers or documentation. You have no right to expect anyone else to work on drivers for you.

However, if there is a version where the in-kernel driver worked, you can use
'git bisect' to find out where it broke. If you can point that out, then one of the ATA driver maintainers may be able to come up with a possible fix for you.