Bug 15771
Summary: | Marvell 6145 on Jetway Daughterboard fails to detect any disks | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Dan Alderman (dan) |
Component: | Serial ATA | Assignee: | Jeff Garzik (jgarzik) |
Status: | RESOLVED OBSOLETE | ||
Severity: | normal | CC: | alan, ben, cebbert, daniel.hornung, jrickman, mlord, tj |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.32.10-90.fc12.x86_64 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Dan Alderman
2010-04-12 18:07:51 UTC
I've split the Debian bug report into: http://bugs.debian.org/509923 - pata_marvell fails to handle some 88SE6145 boards http://bugs.debian.org/577563 - ahci fails to handle some 88SE6145 boards since ahci does seem to work for most people. Does irqpoll help? Mark, do you have experience with these controllers? These are totally different controllers from the sata_mv ones. But.. I do have a box here that has a 6145 chip in it. So if we could find out what the end-user's drive setup is, then perhaps I could reproduce it here and investigate some. So, I need to know: what type (SATA/PATA) drives are connected to which ports of the controller, and which kernel module is being used to manage them (ahci, or pata_marvell?). Thx. Hi, To the JCN92 motherboard I have a seagate 250GB 2.5" 5400rpm drive attached to SATA1 and a SATA DVD Drive attached to SATA2. On the Marvell Daughter Board (ADPE4S http://www.jetwaycomputer.com/Daughter_Board.html) I have 4 WD Green 1TB drives attached. I think I initially had some issues with the SATA cables in the Chenbro ES34069 chassis I am using as I was getting detection failures in the Marvell bios. Having replaced them they now detect reliably. I installed CentOS 5.4 x64 and compiled up the vendor driver from here: http://jetwaycomputer.com/download/Drivers/ADPE4S/Marvell_M88SE6145_Linux.zip and I was able to detect and see the existing mdraid5 set on my WD disks, mounted and used without issue. Originally I had only tried using modern 64bit distros (Fedora 12, Debian 5.04) and had no luck. I have just tried Fedora 12 i386 cd installer and seemed to get better results, but drive 4 on the controller timed out. I am now trying the latest kernel (2.6.32.11-99.fc12.i686.PAE) but appending ahci.enable_marvell=1 seems to result in no detection of the disks attached to the controller at all and lots of the same errors like I was seeing before. OK, I had made a mistake and allowed anaconda to use ata_generic so ahci wasn't being used. Remaking the initrd via the rescue cd having forced marvell_enable=1 worked. So, now I believe I am in the position that the 32 bit kernel works for this board, after some user intervention, but the 64 bit one doesn't appear to work at all. At least, I was unsuccessful with the distributions I tried. After a little more investigation I have some more info. It would appear that the detect errors/timeouts are being caused by the PAE enabled kernel in Fedora and all the 64bit kernels I have tried (>2.6.26). The board I have is Atom 330 based with 2GB RAM. I have just tried using the PAE kernel in F12 (2.6.32.11-99.fc12.i686.PAE) with ahci.marvell_enable=1 and mem=nopentium and the controller detects and works and the system boots fine. grub entry: title Fedora (2.6.32.11-99.fc12.i686.PAE) root (hd0,0) kernel /vmlinuz-2.6.32.11-99.fc12.i686.PAE ro root=/dev/mapper/vg_chenbro-lv_root LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=uk ahci.marvell_enable=1 mem=nopentium initrd /initramfs-2.6.32.11-99.fc12.i686.PAE.img This is just a wild guess, but can you set up your initrd so it updates the CPU microcode before loading the ahci driver? I currently have no idea how to do that. Is there a guide somewhere that can tell me how? I've been playing around a little more with the setup. I'm not sure if this information helps with this particular bug, but it may prove useful to someone, perhaps, maybe... I tried smartctl on one of the WD Green drives attached to the 6145 controller. When the drive is not in an active mdraid array and smartctl is left to probe for the drives it causes a reset of the 6145 controller and the re-detect fails in the same way the pata_marvell does if the ahci driver isn't forced to take over at boot time. After a long timeout, 2 of the 4 drives showed as detected in the kernel log. I soft rebooted but quickly discovered that I had to completely remove the power from the system for at least 10 seconds before the kernel would re-detect the controller correctly at boot time. I am currently running 2.6.32.11-99.fc12.i686.PAE with ahci.marvell_enable=1 mem=nopentium on the boot command line. If smartctl is run with -i -d sat to force SATA mode on a drive which is not part of an active array, it is detected and smartctl produces a report which looks similar to what one would expect. I don't have it to hand and I want to let my array finish rebuilding before I try anything else of this nature, but it looked similar to the report produced by the 2.5" SATA drive attached to the ICH7 on the same motherboard, which was definitely correct. === START OF INFORMATION SECTION === Device Model: ST9250315AS Serial Number: 6VC2HFVF Firmware Version: 0001SDM1 User Capacity: 250,059,350,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Fri Apr 16 01:08:47 2010 BST SMART support is: Available - device has SMART capability. SMART support is: Enabled If I run smarctl -d sat on the same drive when it's part of an active mdraid5 array (which was syncing the parity disk at the time) the controller it works fine once, but on the second run the controller freezes and causes a re detect. Apr 15 12:58:07 chenbro kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Apr 15 12:58:07 chenbro kernel: ata1.00: failed command: IDENTIFY DEVICE Apr 15 12:58:07 chenbro kernel: ata1.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in Apr 15 12:58:07 chenbro kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Apr 15 12:58:07 chenbro kernel: ata1.00: status: { DRDY } Apr 15 12:58:07 chenbro kernel: ata1: hard resetting link Apr 15 12:58:07 chenbro kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Apr 15 12:58:07 chenbro kernel: ata1.00: configured for UDMA/133 Apr 15 12:58:07 chenbro kernel: ata1: EH complete Smartctl then reports back that there is SMART information available, but also shows errors. The machine then had to be powered off for 10s and then rebooted to get access back to the 2 missing drives. The controller appeared to recover OK so I re-assembled the md0 and let it start the resync again. 731 minutes to go... I have this same motherboard and daughterboard. I have tested this problem with Fedora Core 13 (all released i386 PAE kernel versions) and Fedora Core 14 (all released i386 PAE kernel versions). I am using an IDE drive, attached to the motherboard port, for the "system". I have added the "ahci.marvell_enable=1" and "mem=nopentium" values to my "grub.conf" file, but it did not help. I cannot get this controller to recognize all 4 attached drives via Fedora Core Linux. Sometimes 1 attached drive is seen in "blkid". Sometimes 2 attached drives are seen in "blkid". I can't ever remember seeing 3 or 4 of the attached drives. I can see all drives in the 88SE6145 BIOS during system boot. I can create a RAID array using that BIOS. Once the system is booted, the above issues occur: not all attached drives are seen. That makes building a RAID array useless for testing so I just try for a "bunch of disks". For the record, I removed the Linux IDE hard disk and installed a different IDE drive so I could test this system under Windows XP with the vendor drivers. There were no issues at all. All ports on the 88SE6145 controller saw attached hard disks under Windows. Over in the FreeBSD forums there has been discussion of this controller. Have a look at this thread: http://forums.freebsd.org/showthread.php?t=20412 From that FreeBSD thread, here is what they did to "fix" it: > Disable NCQ for multiport Marvell 88SX61XX SATA controllers. Simultaneous > active I/O to several disks (copying large file on ZFS) causes timeout after > just a few seconds of run. Single port 88SX6111 seems like not affected. > > Skip reading transferred bytes count for these controllers. It works for > 88SX6111, but 88SX6145 always returns zero there. Haven't tested others, > but better to be safe. (In reply to comment #12) > From that FreeBSD thread, here is what they did to "fix" it: > > > Disable NCQ for multiport Marvell 88SX61XX SATA controllers. Simultaneous > > active I/O to several disks (copying large file on ZFS) causes timeout > after > > just a few seconds of run. Single port 88SX6111 seems like not affected. > > > > Skip reading transferred bytes count for these controllers. It works for > > 88SX6111, but 88SX6145 always returns zero there. Haven't tested others, > > but better to be safe. I think that's fixing a different problem. We already disable NCQ for these controllers, and we never read the byte count from the command structure (ahci_cmd_list::bytecount on FreeBSD, ahci_cmd_hdr::status on Linux). My point in comparing Fedora Core 14 to FreeBSD is/was this: I get the impression from the FreeBSD forums that developers for FreeBSD may have this controller/chip family working and there could be something useful in their work. For the record, I have tested Fedora Core 12, 13, and 14 extensively on this system; Fedora Core is my "primary OS of choice" at home for servers while I use RHEL at work. I have no background with FreeBSD and Debian, but I tried FreeBSD about 6 months ago just to see what would happen. I got farther with it than Fedora. In FreeBSD 8.1 "public release" (no patches) I could see all attached drives on the 88SE6145 controller in "cfdisk" most of the time, but I had problems with read/write of files to the disks; my Fedora Core experience with this controller cannot even match that. This is a "home" system and it is recorded in "smolts": http://www.smolts.org/client/show/pub_2a4d8e91-4787-4ae5-8d9f-ea04b2146633 According to Comment #4 in this bug, this controller works under CentOS 5.4 with the vendor drivers. Based on "G*" searches for this issue, some change(s) happened in the kernel for the ATA drivers (sometime after 2.6.18...I think...would have to check). There were several interesting archived email exchanges on the matter. It sounds like CentOS 5.x (RHEL5...2.6.18) did not adopt those ATA changes, which is why this controller works in that OS version with 3rd party drivers. Distributions like Fedora Core did adopt those ATA changes and have issues with this controller. I wonder if CentOS 6.x (RHEL6...2.6.32) will adopt those ATA changes? [nudge -> RH] QED...or am I misinterpreting/misunderstanding something here? Further test results... Compiled Marvell's Linux drivers for the 88SE6145 controller on CentOS 5.5 latest published kernel based on 2.6.18. No compile issues. Blacklisted "pata_marvell" driver. Recompiled "initrd" to ensure "mv61xx" driver was included. Reboot. All drives attached to Marvell controller are visible. Build a RAID-5 array on 4 drives on this controller. No issues. Read and write files to new RAID-5 array. No issues at all. Swapped in a new blank IDE drive. Loaded RHEL6 (Scientific Linux 6.0). Updated system to latest published RPMs. Kernel is based on 2.6.32. Loaded RPMs needed to compile kernel modules. Attempted to compile Marvell's Linux driver. Compile fails. Attempted to used published workarounds to "activate" AHCI portion of "pata_marvell" code ("ahci.marvell_enable=1"). Did not work. Got all sorts of timeout issues when 88SE6145 controller attempts to talk to drives attached to it. When I run "blkid", errors scroll across Console at very fast rate. Swapped in an IDE drive previously used in this machine. It has Fedora Core 14 with all current updates. Loaded RPMs needed to compile kernel modules. Attempted to compile Marvell's Linux driver. Compile fails. [XYZ@fatman source]# make make ARCH=i386 CC=cc LD=ld CROSS_COMPILE= V= -C /lib/modules/2.6.35.11-83.fc14.i686.PAE/build M=`pwd` modules make[1]: Entering directory `/usr/src/kernels/2.6.35.11-83.fc14.i686.PAE' CC [M] /XYZ/Marvell/1.0.0.9/source/linux/linux_sense.o /XYZ/Marvell/1.0.0.9/source/linux/linux_sense.c: In function ‘HBA_Translate_Req_Status_To_OS_Status’: /XYZ/Marvell/1.0.0.9/source/linux/linux_sense.c:23:10: error: ‘struct scsi_cmnd’ has no member named ‘use_sg’ /XYZ/Marvell/1.0.0.9/source/linux/linux_sense.c:34:40: error: ‘struct scatterlist’ has no member named ‘page’ /XYZ/Marvell/1.0.0.9/source/linux/linux_sense.c:50:11: error: ‘struct scsi_cmnd’ has no member named ‘use_sg’ /XYZ/Marvell/1.0.0.9/source/linux/linux_sense.c:53:36: error: ‘struct scsi_cmnd’ has no member named ‘request_buffer’ /XYZ/Marvell/1.0.0.9/source/linux/linux_sense.c:54:14: error: ‘struct scsi_cmnd’ has no member named ‘use_sg’ /XYZ/Marvell/1.0.0.9/source/linux/linux_sense.c:60:11: error: ‘struct scsi_cmnd’ has no member named ‘request_bufflen’ /XYZ/Marvell/1.0.0.9/source/linux/linux_sense.c:96:34: error: ‘SUGGEST_ABORT’ undeclared (first use in this function) /XYZ/Marvell/1.0.0.9/source/linux/linux_sense.c:96:34: note: each undeclared identifier is reported only once for each function it appears in make[2]: *** [/XYZ/Marvell/1.0.0.9/source/linux/linux_sense.o] Error 1 make[1]: *** [_module_/XYZ/Marvell/1.0.0.9/source] Error 2 make[1]: Leaving directory `/usr/src/kernels/2.6.35.11-83.fc14.i686.PAE' make: *** [all] Error 2 "mv61xx" kernel module compile on RHEL6 exhibits the same error message. Research shows the last precompiled module for Fedora and this controller, as provided by ASUS, is FC10. So, what is "broken" here? The controller? No. It is proven here to work under XP and RHEL5.x The vendor? Possible. They have not updated their driver code to compile on kernel versions more recent than 2.6.18. The appropriate Linux kernel modules? Yes. Some change(s) after 2.6.18 introduced issues that break this driver and also appear to break support for this controller. The efficacy of the provided workarounds seems "hit and miss". This appears to be documented in archived email threads and forums across multiple distributions dating back to 2008 or even late 2007. "G*" searches show this "broken pata_marvell" kernel module impacts post-2.6.18 releases of Ubuntu, Debian, Mandriva, and Fedora. I suspect Suse variants may also impacted. I have proven RHEL6 is also be impacted. It would be nice if the party, or parties, responsible for this mess would step forward and fix it. I've chased this problem for almost 2 years now via various bug reporting mechanisms and gotten no relief whatsoever. (In reply to comment #15) > The vendor? Possible. They have not updated their driver code to compile on > kernel versions more recent than 2.6.18. Yes, that's totally pathetic. > It would be nice if the party, or parties, responsible for this mess would > step > forward and fix it. I've chased this problem for almost 2 years now via > various > bug reporting mechanisms and gotten no relief whatsoever. Who's responsible? Seems like Marvell is most responsible - for shipping non-standard hardware without working drivers or documentation. You have no right to expect anyone else to work on drivers for you. However, if there is a version where the in-kernel driver worked, you can use 'git bisect' to find out where it broke. If you can point that out, then one of the ATA driver maintainers may be able to come up with a possible fix for you. |