Most recent kernel where this bug did not occur: unknown but also happens with 2.6.15 Distribution: gentoo Hardware Environment: SMP x86_64 Dual Core AMD Opteron Processor 265 4Gb RAM per CPU HBA card: LSI20320RB (SN:P055445105) ioc0: 53C1030: Capabilities={Initiator,Target} scsi2 : ioc0: LSI53C1030, FwRev=01032700h, Ports=1, MaxQ=255, IRQ=18 ./cfg1030 GETCONFIG 1 Read configuration has been initiated for controller 1 ------------------------------------------------------------------------ Controller information ------------------------------------------------------------------------ Controller type : LSI53C1020/1030 BIOS version : 5.07.03.00 Firmware version : 1.03.39.00 SCSI channel description : 1 parallel SCSI wide Initiator IDs (Channel/SCSI ID) : 1/8 Maximum physical devices : 15 Concurrent commands supported : 255 ------------------------------------------------------------------------ Logical drive information ------------------------------------------------------------------------ ------------------------------------------------------------------------ Physical device information ------------------------------------------------------------------------ Channel #1 Initiator at SCSI ID 8 Target on SCSI ID 0 Device is a Hard disk SCSI ID : 0 State : Ready (RDY) Size (in MB)/(in sectors) : 1430037/-1366251520 Device ID : IFT A16U-G2421 347D Attached Storage: Vendor: IFT Model: A16U-G2421 Rev: 347D Jun 22 08:13:13 db2 kernel: Type: Direct-Access ANSI SCSI revision: 03 lspci => 00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07) 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05) 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03) 00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02) 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05) 00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12) 00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01) 00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12) 00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01) 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 01:01.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 08) 01:05.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10) 01:05.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10) 02:03.0 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10) 02:03.1 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10) 03:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b) 03:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b) 03:04.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) Software Environment: Portage 2.1 (default-linux/amd64/2005.1, gcc-3.4.5, glibc-2.3.6-r3, 2.6.17.1-1 x86_64) ================================================================= System uname: 2.6.17.1-1 x86_64 Dual Core AMD Opteron(tm) Processor 265 Gentoo Base System version 1.6.14 distcc 2.18.3 x86_64-pc-linux-gnu (protocols 1 and 2) (default port 3632) [disabled] dev-lang/python: 2.3.5-r2, 2.4.2 dev-python/pycrypto: 2.0.1-r5 dev-util/ccache: [Not Present] dev-util/confcache: [Not Present] sys-apps/sandbox: 1.2.17 sys-devel/autoconf: 2.13, 2.59-r7 sys-devel/automake: 1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r1 sys-devel/binutils: 2.16.1-r2 sys-devel/gcc-config: 1.3.13-r2 sys-devel/libtool: 1.5.22 virtual/os-headers: 2.6.11-r2 ACCEPT_KEYWORDS="amd64" AUTOCLEAN="yes" CBUILD="x86_64-pc-linux-gnu" CFLAGS="-g -pipe -O2 -march=k8 -msse -msse2 -mmmx -m3dnow" CHOST="x86_64-pc-linux-gnu" CONFIG_PROTECT="/etc" CONFIG_PROTECT_MASK="/etc/env.d /etc/gconf /etc/revdep-rebuild /etc/terminfo" CXXFLAGS="-g -pipe -O2 -march=k8 -msse -msse2 -mmmx -m3dnow" DISTDIR="/usr/portage/distfiles" FEATURES="autoconfig distlocks metadata-transfer nostrip sandbox sfperms strict userpriv usersandbox" GENTOO_MIRRORS="http://distfiles.gentoo.org http://distro.ibiblio.org/pub/linux/distributions/gentoo" MAKEOPTS="-j4" PKGDIR="/usr/portage/packages" PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --delete-after --stats --timeout=180 --exclude='/distfiles' --exclude='/local' --exclude='/packages'" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" PORTDIR_OVERLAY="/usr/local/portage" SYNC="rsync://gentoo-rsync.d.local/gentoo-portage" USE="amd64 avi berkdb bitmap-fonts cli crypt cups dba dri eds emboss encode foomaticdb fortran gif gstreamer gtk2 imlib ipv6 isdnlog jpeg lzw lzw-tiff mp3 mpeg ncurses nls nptl nptlonly opengl pam pcre pdflib perl png pppd python quicktime readline reflection sdl session spell spl ssl syslog tcpd tiff truetype-fonts type1-fonts usb xml xml2 xorg xv zlib elibc_glibc kernel_linux userland_GNU" Unset: CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, LDFLAGS, LINGUAS, PORTAGE_RSYNC_EXTRA_OPTS Problem Description: When copying files from /tmp to an external storage device, CPU load increases, but disk IO stops after a while. Any attempt to look at the external storage i.e. ls, df, cd etc. results in a lockup of the terminal in use. kill -9 of the locked process has no effect. Reboot fails, requiring a power off and on. This has occurred using xfs, jfx, ext2, ext3 and reiserfs. Interestingly, using cfg1030 (90p4932) to look at the HBA configuration still works. While looking on the mailing lists, I noticed a similar problem with cp from NFS (this is not a copy from or to NFS...) http://www.ussg.iu.edu/hypermail/linux/kernel/0404.3/1026.html Where Andrew Morton requested some debugging. I have followed the steps on the above page in the hop it helps. I will attach it once submitted. Steps to reproduce: cp /tmp/my.MYD /mnt/data/
Created attachment 8392 [details] debug output debug output as mentioned in original bug report
It sounds like the block layer (more likely the driver) has lost an IO request. When it happens please do: echo p > /proc/sysrq-trigger dmesg -s 1000000 > foo and attach `foo' to this report (including the kernel bootup messages) Thanks.
There have been a bunch of problems reported with the current mptscsi driver and external storage. nStors don't work with the new driver at all. On the redhat bugzilla they say that vmware external storage doesn't work either. https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=188487 https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=190760 https://bugzilla.novell.com/show_bug.cgi?id=173330 It's buggy in sles9sp3 as well but it works in sles9sp2. I've been trying to debug this problem as well http://marc.theaimsgroup.com/?l=linux-scsi&m=114868539100023&w=2 What does your /proc/scsi/scsi look like? Someone said that with the nStor if you don't export the wahoo controller as a LUN then it works OK and gets great performance but someone else said it stopped printing error messages but the performance sucked. I haven't had time to test that myself yet.
Created attachment 8401 [details] sysrq-trigger output Andrew, as requested
Dan.... db2 ~ # cat /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: FUJITSU Model: MAT3073NC Rev: 0108 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi0 Channel: 00 Id: 01 Lun: 00 Vendor: FUJITSU Model: MAT3073NC Rev: 0108 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi0 Channel: 00 Id: 06 Lun: 00 Vendor: SUPER Model: GEM318 Rev: 0 Type: Processor ANSI SCSI revision: 02 Host: scsi2 Channel: 00 Id: 00 Lun: 00 Vendor: IFT Model: A16U-G2421 Rev: 347D Type: Direct-Access ANSI SCSI revision: 03
Created attachment 8402 [details] startup dmesg Andrew, startup dmesg attached
On Fri, 23 Jun 2006 07:43:24 -0700 bugme-daemon@bugzilla.kernel.org wrote: > sysrq-trigger output Well yes, but there's no useful info there. Looks like the log buffer overflowed. You _should_ have a bunch of process stack backtraces. Can we prevent all that mpt driver gunk from coming out so it doesn't fill the log buffer? You might need to do `dmesg -n 8' to get the sysrq-trigger output to generate the needed info. You can run echo p > /proc/sysrq-trigger any time. I suggest you get that working right first, before starting testing. Thanks.
I will recompile the mpt driver without the DEBUG in it. Give me 30 minutes and I should have an output for you.
I would be interesting test if you could try the 2.6.17 kernel. That way could rule out any domain validation issues, as the newer driver is running with the SPI transport layer, using generic dv, thanks to James Bottomley.
Created attachment 8404 [details] dmesg This dmesg does not have all the mpt debug in it
Eric, I will give 2.6.17 a go when I have the output from 2.6.17.1.
Dan's issue is a multi-lun issue. I doubt your experiencing the same issue. The MPT_DEBUG and MPT_DEBUG_MSG_FRAME are too verbose. Can you enable the following in the Makefile: CFLAGS_mptscsih.o += -DMPT_DEBUG_REPLY EXTRA_CFLAGS += -DMPT_DEBUG_FAIL CFLAGS_mptbase.o += -DMPT_DEBUG_RESET CFLAGS_mptscsih.o += -DMPT_DEBUG_TM
Also pls enable displaying the sense data # sysctl -w dev.scsi.logging_level = 0x1000
Created attachment 8405 [details] sysrq-trigger output hopefully this sysrq-trigger is better. It was created using the following echo p > /proc/sysrq-trigger dmesg -s 1000000 > foo
I have recompiled the kernel with CFLAGS_mptscsih.o += -DMPT_DEBUG_REPLY EXTRA_CFLAGS += -DMPT_DEBUG_FAIL CFLAGS_mptbase.o += -DMPT_DEBUG_RESET CFLAGS_mptscsih.o += -DMPT_DEBUG_TM just remote powercycling at the moment. I will then sysctl -w dev.scsi.logging_level = 0x1000 Eric, do you want another echo p > /proc/sysrq-trigger dmesg -s 1000000 > foo or did you want something else?
Created attachment 8406 [details] sysrq-trigger output with mptdebug This sysrq-trigger was generated with the debug flags as requested.
Created attachment 8407 [details] sysrq-trigger + mptdebug +sysctl This is a echo p > /proc/sysrq-trigger dmesg -s 1000000 > /tmp/foo with the following set sysctl -w dev.scsi.logging_level=0x1000 and in the fusion Makefile EXTRA_CFLAGS += -DMPT_DEBUG_FAIL CFLAGS_mptbase.o += -DMPT_DEBUG_RESET CFLAGS_mptscsih.o += -DMPT_DEBUG_TM CFLAGS_mptscsih.o += -DMPT_DEBUG_REPLY
Created attachment 8408 [details] IOC pre_reset routed to SCSI host driver! I saw this go passed in a "tail -f /var/log/messages" and thought it might be useful. It is late here so I will continue this in the morning, if anyone wants any other debugging, please let me know.
On Fri, 23 Jun 2006 15:24:55 -0700 bugme-daemon@bugzilla.kernel.org wrote: > This dmesg does not have all the mpt debug in it Yes, but it doesn't have what we want in it either. oops, my fault. We should be using `t', not `p'. Sit down at a Linux box and do dmesg -n 8 echo t > /proc/sysrq-trigger dmesg -s 1000000 and you'll get lots of stuff like Call Trace: <ffffffff8040982e>{schedule_timeout+30} <ffffffff80335cba>{tty_poll+95} <ffffffff8028eb63>{do_select+1027} <ffffffff8028ef9a>{__pollwait+0} <ffffffff802299f9>{default_wake_function+0} <ffffffff802299f9>{default_wake_function+0} <ffffffff802299f9>{default_wake_function+0} <ffffffff8022ab07>{__wake_up+56} <ffffffff803a6b73>{sock_def_readable+63} <ffffffff804044c9>{unix_stream_sendmsg+589} <ffffffff803a2719>{do_sock_write+196} <ffffffff80229609>{activate_task+75} <ffffffff803a2d7f>{sock_aio_write+79} <ffffffff8028ee33>{sys_select+621} <ffffffff802439de>{autoremove_wake_function+0} <ffffffff80402618>{unix_ioctl+208} <ffffffff803a302a>{sock_ioctl+466} <ffffffff8028db4d>{do_ioctl+33} <ffffffff802092b6>{system_call+126} zsh S ffff810101f37f18 0 10108 10107 10123 (NOTLB) ffff810101f37f18 ffff810016fff8c8 ffffffff8040c2c0 0000000000000008 ffff81009fc762a8 ffff81009fc760c0 ffffffff8047ddc0 000116ffc49f8821 0000000000007a42 ffff810000000000 Call Trace: <ffffffff8040c2c0>{do_page_fault+1173} <ffffffff8029228d>{dput+61} <ffffffff802092b6>{system_call+126} <ffffffff80209185>{sys_rt_sigsuspend+199} <ffffffff8023ceaa>{sys_rt_sigprocmask+191} <ffffffff802095c3>{ptregscall_common+103} zsh R running task 0 10123 10108 (NOTLB) and that's what we want to see.
Created attachment 8409 [details] t sysrq-trigger I used the following to create this attachment. dmesg -n 8 echo t > /proc/sysrq-trigger dmesg -s 1000000 > /tmp/foo
looking through the output, "mk2" is a bash script I wrote to do the dmesg -n 8 echo t > /proc/sysrq-trigger dmesg -s 1000000 > /tmp/foo "cpme" is a bash script to cp a series of files from tmp to varying directories on the external storage. The current cmd it is running is cp /tmp/docset.MYD /mnt/sde1/mysql-4.0.24_me/var/me/docset.MYD and has been doing it for at least the last eight hours. ls -l /tmp/docset.MYD -rw-r--r-- 1 root root 13808376 Apr 27 01:43 /tmp/docset.MYD At present I can not do an "ls" on the external drive as it will lock up, and require a power reset. If any more debuging is needed before the power reset let me know. Otherwise I will gladly reset the power. Unless someone has a way of "unblocking" everything. I am willing to give anything a try. From experience, a shutdown or reboot will just get blocked.
I know everyone is busy, but has anyone had chance to look at the output I posted? What looks to be the cause of the problem? Many thanks
Yes, it looks like everything is stuck waiting for I/O completion. Probably because some request went to the driver and it got lost, or the completion interrupt was mishandled, etc. So yeah - your sysrq trace confirms that it's a driver issue. Our hopes rest with Eric ;)
Sorry, I've been busy this week trying to get sas wide port support out. According to both the sysrq trace, and the previous dmesg, what has happened is to target = 0, lun = 1 (1) Firmware returns DID_NO_CONNECT, meaning the device has been removed (2) Firmare returns SAM_STAT_BUSY, meaning the device is busy (3) Firmare returns SAM_STAT_CHECK_CONDITION, - I don't see the sense data, did you do " sysctl -w dev.scsi.logging_level = 0x1000" (4) Task Aborts sent from above, Firmware says it succeeds (5) Firmare returns SAM_STAT_BUSY, meaning the device is busy (6) Task Aborts sent from above, Firmware says it succeeds (7) Device continues returning BUSY, and upper layers sends task aborts (8) Upper layers give up, then issue Host Reset sent from above (9) Domain Validation request from spi transport (10) Transport sends inquiry to driver as part of Domain Validation, however the command doesn't complete, and we are sitting on a mutex. THis copied from trace below: events/3 D ffff81010813fa88 0 17 1 18 16 (L-TLB) ffff81010813fa88 ffff810103fe0848 0000000000000096 0000000000000096 00000000000000c9 0000000000000092 0000000000000292 ffff810104456750 0000000000000234 ffff81020459d770 Call Trace: <ffffffff8057ce9e>{wait_for_completion+158} <ffffffff80228170>{default_wake_function+0} <ffffffff80228170> {default_wake_function+0} <ffffffff803d08c7>{blk_execute_rq_nowait+151} <ffffffff803d09c0> {blk_execute_rq+208} <ffffffff803cfe16>{__freed_request+54} <ffffffff8049a19f> {scsi_execute+239} <ffffffff804a0dd4>{spi_execute+132} <ffffffff80254b70> {mempool_free_slab+0} <ffffffff804a2348>{spi_dv_device_compare_inquiry+120} <ffffffff804a2719>{spi_dv_device+265} <ffffffff803dd692>{kobject_get+18} <ffffffff8045e197>{get_device+23} <ffffffff804dd5f0> {mptspi_dv_renegotiate_work+0} <ffffffff804dc978>{mptspi_dv_device+184} <ffffffff804dd628> {mptspi_dv_renegotiate_work+56} <ffffffff8023f000>{run_workqueue+176} <ffffffff8023f19a> {worker_thread+330} <ffffffff80228170>{default_wake_function+0} <ffffffff80228170> {default_wake_function+0} <ffffffff8023f050>{worker_thread+0} <ffffffff80242769>{kthread+217} <ffffffff8020ac96>{child_rip+8} <ffffffff80219280>{flat_send_IPI_mask+0} <ffffffff80242690>{kthread+0} <ffffffff8020ac8e>{child_rip+0} (12) I can't tell if scsi_execute() in mid layer is properly handling a device returning SAM_STAT_BUSY or SAM_STAT_CHECK_CONDITION, or its possible the reply never came from firmware Questions - (1) Are you sure you have proper cabling and termination? (2) Can you try negiotating at a slower speed. This can be doing by going into bios configuration utiltiy, or doing this from /sys/class/spi_transport, and going into the proper target., and modifying the period. (3) What was the negiotion, It would be in dmesg, or /var/log/boot.msg. (4) Can you disbale the kbobj debug messages? Its too verbose. Eric Moore
I have extracted from dmesg the inititation of the device, Eric, looks like 320MB/s. I have removed kobj debug, and will submit a new dmesg. Fusion MPT base driver 3.03.09 Copyright (c) 1999-2005 LSI Logic Corporation Fusion MPT SPI Host driver 3.03.09 GSI 18 sharing vector 0xB9 and IRQ 18 ACPI: PCI Interrupt 0000:01:01.0[A] -> GSI 28 (level, low) -> IRQ 18 mptbase: Initiating ioc0 bringup ioc0: 53C1030: Capabilities={Initiator,Target} scsi2 : ioc0: LSI53C1030, FwRev=01032700h, Ports=1, MaxQ=255, IRQ=18 Vendor: IFT Model: A16U-G2421 Rev: 347D Type: Direct-Access ANSI SCSI revision: 03 target2:0:0: Beginning Domain Validation target2:0:0: Ending Domain Validation target2:0:0: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS PCOMP (6.25 ns, offset 127) SCSI device sdc: 2928715776 512-byte hdwr sectors (1499502 MB)
Created attachment 8468 [details] dmesg new dmesg with kobj switched off.
Thankyou. (1) Your device appears to be at LUN=0, however in the log the faulty device as LUN=1, do you know why? (2) Do you have BIOS, so you can go into the ^C utililty and try changing your device to a slower speed? Or try going into /sys/class/spi_transport, and and changing the period to the minimal period, and disabling qas? Eric
Eric, Hope these answers help. (1) Are you sure you have proper cabling and termination? I have used three different cables, and by that, I also mean makes. I get the same with all of them. The A16U-G2421 has a "gui" in which the termination type can be changed. It has always been enabled. (2) Can you try negiotating at a slower speed. This can be doing by going into bios configuration utiltiy, or doing this from /sys/class/spi_transport, and going into the proper target., and modifying the period. I have modified the Eonstor to 80Mhz, 160MB/s (3) What was the negiotion, It would be in dmesg, or /var/log/boot.msg. From the /var/log/dmesg it looks like 160Mhz, 320MB/s. (4) Can you disbale the kbobj debug messages? Its too verbose. I have done this and rebooted with the new kernel (a -2 instead of -1) You mention else where in your post, I don't see the sense data, did you do " sysctl -w dev.scsi.logging_level = 0x1000" I set this in sysctl.conf when you originally asked. I have just checked db2 ~ # sysctl dev.scsi.logging_level dev.scsi.logging_level = 4096 db2 ~ # sysctl -w dev.scsi.logging_level=0x1000 dev.scsi.logging_level = 0x1000 db2 ~ # sysctl dev.scsi.logging_level dev.scsi.logging_level = 4096 db2 ~ # I presume this means it was originally set? I am not overly confident about modifying /sys/class/spi_transport/target2:0:0/period so I have changed the settings on the Eonstor and rebooted, now it is set to target2:0:0: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 127) I will give this a try and report back.
Thankyou. (1) Your device appears to be at LUN=0, however in the log the faulty device as LUN=1, do you know why? Yes, I tried LUN=0 and LUN=1 both have the same effect. It is now back at LUN=1 (2) Do you have BIOS, so you can go into the ^C utililty and try changing your device to a slower speed? Or try going into /sys/class/spi_transport, and and changing the period to the minimal period, and disabling qas? I would need to be in the office to do that, which unfortunately means Monday. It is now early Saturday morning in the UK. I am running the copy at 80Mhz, and will come back to you with the results.
the sysctl did work because I have just rebooted and seen this in the kern.log when I mounted the external partition. Jun 30 23:59:26 db2 kernel: Reply ha=0 id=0 lun=1: Jun 30 23:59:26 db2 kernel: IOCStatus=0045h SCSIState=00h SCSIStatus=00h Jun 30 23:59:26 db2 kernel: resid=40 bufflen=64 xfer_cnt=24 Jun 30 23:59:26 db2 kernel: sc->underflow={report ERR if < 00h bytes xfer'd} Jun 30 23:59:26 db2 kernel: ActBytesXferd=18h Jun 30 23:59:26 db2 kernel: sc->result is 00000000h Jun 30 23:59:26 db2 kernel: Reply ha=0 id=0 lun=1: Jun 30 23:59:26 db2 kernel: IOCStatus=0045h SCSIState=00h SCSIStatus=00h Jun 30 23:59:26 db2 kernel: resid=40 bufflen=64 xfer_cnt=24 Jun 30 23:59:26 db2 kernel: sc->underflow={report ERR if < 00h bytes xfer'd} Jun 30 23:59:26 db2 kernel: ActBytesXferd=18h Jun 30 23:59:26 db2 kernel: sc->result is 00000000h Jun 30 23:59:26 db2 kernel: Reply ha=0 id=0 lun=1: Jun 30 23:59:26 db2 kernel: IOCStatus=0000h SCSIState=01h SCSIStatus=02h Jun 30 23:59:26 db2 kernel: resid=4 bufflen=4 xfer_cnt=0 Jun 30 23:59:26 db2 kernel: sc->result is 00000002h Jun 30 23:59:26 db2 kernel: sd 2:0:0:1: done SUCCESS 2 sd 2:0:0:1: Jun 30 23:59:26 db2 kernel: command: Log Sense: 4d 00 40 00 00 00 00 00 04 00 Jun 30 23:59:26 db2 kernel: sdc: Current: sense key: Illegal Request Jun 30 23:59:26 db2 kernel: Additional sense: Invalid command operation code Jun 30 23:59:26 db2 kernel: Reply ha=0 id=0 lun=1: Jun 30 23:59:26 db2 kernel: IOCStatus=0000h SCSIState=01h SCSIStatus=02h Jun 30 23:59:26 db2 kernel: resid=4 bufflen=4 xfer_cnt=0 Jun 30 23:59:26 db2 kernel: sc->result is 00000002h Jun 30 23:59:26 db2 kernel: sd 2:0:0:1: done SUCCESS 2 sd 2:0:0:1: Jun 30 23:59:26 db2 kernel: command: Log Sense: 4d 00 50 00 00 00 00 00 04 00 Jun 30 23:59:26 db2 kernel: sdc: Current: sense key: Illegal Request Jun 30 23:59:26 db2 kernel: Additional sense: Invalid command operation code Jun 30 23:59:59 db2 kernel: kjournald starting. Commit interval 5 seconds Jun 30 23:59:59 db2 kernel: EXT3 FS on sdc2, internal journal Jun 30 23:59:59 db2 kernel: EXT3-fs: mounted filesystem with ordered data mode. Still testing the copy.....
The only other reference to sence was in /var/log/debug Jun 30 23:59:18 db2 kernel: sdc: Mode Sense: 9b 00 00 08 Jun 30 23:59:18 db2 kernel: sdc: Mode Sense: 9b 00 00 08
Created attachment 8470 [details] kernel .config I have carried out the test cp over 200 times, probably closer to 300 when I finished faffing around modifying the copy shell script, and so far no errors have appeared in any of the log files, no debug messages in the log file either, which worries me. Have I done something wrong? or is it fixed? Total data copied in the excess of 1152G of data. Attached is my kernel .config Eric, based on my .config, would you expect anything in the log files if all was working well?
Marshall, in the 2.6.17 kernel, support for the spi transport layer as added to the mptspi driver. What this amounted to was domain validation was moved out of the driver, and into the transport layer. This handles setting up proper negotiation automatically, by running a series of test to insure the devices are runnign at the proper speed, to match the proper cabling and termination to your devices. The traces you sent before indicated that your device was not there, and busy, which could account for inproper speed. I'm interested to know if you can change the speeds to your device, go back to failing condiditons, and lower speed till you're not receiving errors?
Eric, Do you mean change the speed on the external device, or on the server? On the external device (an Eonstor A16U-G2421) I have a dropdown of 160MHz,80MHz,40,33.... currently, I have 80 selected. I presume to fine tune it as you requested, I would need to do it on the server. Unfortunately, I do not know how to do that, however, if you tell me, I will give it a try. Thank you, Spencer
I'm talking about your end device that is attached to the 53c1030 scsi controller, not the server speed. Pls note that you can change the speed by going into the bios configuration utility for the controller. This is done during system boot up. At the point when you see the 1030 controller detected by bios, you select ^C, then enter the utility. In the utility, select devices. You should see all the devices, and the corresponding speed. FAST160 is Ultra 320, and FAST80 is Ultra 160. I suggest you go back to Ultra 320 speeds to verify that you can replicate the issue, then go back to Ultra 160 or slower to see when it goes away.
Are you people interested in continuing work on this bug? I have a system that is affected by this. I also have an open ticket with the vendor, but to me it doesn't really look like a hw issue.
This still occurs in 2.6.21
After all it may at least partly be a hardware issue. I sent the controllers from my 2 IFT DAS boxes back for RMA. They resoldered the terminators. I only got one controller back so far and it seems to work just fine. Funny thing is they could only reproduce the error with jfs :) Bottom line: I really don't know if all this applies to this bug but it may be a clue for someone who knows more than I do.
Spencer, Any update on this? Is the problem still present in current 2.6.23+?
I will check 2.6.23 when I am in the office on Tuesday and get back to you.
Any update on this problem, were you able to test with recent kernel?