Bug 6739 - High CPU load, no disk io when cp, reboot fails, requires poweroff to recover
Summary: High CPU load, no disk io when cp, reboot fails, requires poweroff to recover
Status: REJECTED INSUFFICIENT_DATA
Alias: None
Product: File System
Classification: Unclassified
Component: Other (show other bugs)
Hardware: i386 Linux
: P2 blocking
Assignee: fs_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-06-23 01:38 UTC by GDS.Marshall
Modified: 2008-09-22 16:42 UTC (History)
4 users (show)

See Also:
Kernel Version: 2.6.17.1
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
debug output (244.99 KB, text/plain)
2006-06-23 01:40 UTC, GDS.Marshall
Details
sysrq-trigger output (244.92 KB, text/plain)
2006-06-23 07:42 UTC, GDS.Marshall
Details
startup dmesg (240.46 KB, text/plain)
2006-06-23 07:56 UTC, GDS.Marshall
Details
dmesg (211.97 KB, text/plain)
2006-06-23 15:23 UTC, GDS.Marshall
Details
sysrq-trigger output (237.21 KB, text/plain)
2006-06-23 16:09 UTC, GDS.Marshall
Details
sysrq-trigger output with mptdebug (236.87 KB, text/plain)
2006-06-23 16:24 UTC, GDS.Marshall
Details
sysrq-trigger + mptdebug +sysctl (236.87 KB, text/plain)
2006-06-23 16:33 UTC, GDS.Marshall
Details
IOC pre_reset routed to SCSI host driver! (239.58 KB, text/plain)
2006-06-23 16:44 UTC, GDS.Marshall
Details
t sysrq-trigger (245.86 KB, text/plain)
2006-06-24 01:06 UTC, GDS.Marshall
Details
dmesg (24.75 KB, text/plain)
2006-06-30 15:39 UTC, GDS.Marshall
Details
kernel .config (24.04 KB, text/plain)
2006-07-01 07:35 UTC, GDS.Marshall
Details

Description GDS.Marshall 2006-06-23 01:38:40 UTC
Most recent kernel where this bug did not occur: unknown but also happens with
2.6.15
Distribution: gentoo
Hardware Environment:
SMP
x86_64 Dual Core AMD Opteron Processor 265
4Gb RAM per CPU

HBA card: LSI20320RB (SN:P055445105)
ioc0: 53C1030: Capabilities={Initiator,Target}
scsi2 : ioc0: LSI53C1030, FwRev=01032700h, Ports=1, MaxQ=255, IRQ=18

./cfg1030 GETCONFIG 1
Read configuration has been initiated for controller 1
------------------------------------------------------------------------
Controller information
------------------------------------------------------------------------
  Controller type                         : LSI53C1020/1030
  BIOS version                            : 5.07.03.00
  Firmware version                        : 1.03.39.00
  SCSI channel description                : 1 parallel SCSI wide
  Initiator IDs (Channel/SCSI ID)         : 1/8
  Maximum physical devices                : 15
  Concurrent commands supported           : 255
------------------------------------------------------------------------
Logical drive information
------------------------------------------------------------------------
------------------------------------------------------------------------
Physical device information
------------------------------------------------------------------------
Channel #1
  Initiator at SCSI ID 8
  Target on SCSI ID 0
    Device is a Hard disk
    SCSI ID                               : 0
    State                                 : Ready (RDY)
    Size (in MB)/(in sectors)             : 1430037/-1366251520
    Device ID                             : IFT     A16U-G2421      347D


Attached Storage:
Vendor: IFT       Model: A16U-G2421        Rev: 347D
Jun 22 08:13:13 db2 kernel:   Type:   Direct-Access                      ANSI
SCSI revision: 03

lspci =>
00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07)
00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05)
00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03)
00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02)
00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05)
00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM
Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
01:01.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
Fusion-MPT Dual Ultra320 SCSI (rev 08)
01:05.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit
Ethernet (rev 10)
01:05.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit
Ethernet (rev 10)
02:03.0 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10)
02:03.1 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10)
03:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
03:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
03:04.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)


Software Environment:
Portage 2.1 (default-linux/amd64/2005.1, gcc-3.4.5, glibc-2.3.6-r3, 2.6.17.1-1
x86_64)
=================================================================
System uname: 2.6.17.1-1 x86_64 Dual Core AMD Opteron(tm) Processor 265
Gentoo Base System version 1.6.14
distcc 2.18.3 x86_64-pc-linux-gnu (protocols 1 and 2) (default port 3632) [disabled]
dev-lang/python:     2.3.5-r2, 2.4.2
dev-python/pycrypto: 2.0.1-r5
dev-util/ccache:     [Not Present]
dev-util/confcache:  [Not Present]
sys-apps/sandbox:    1.2.17
sys-devel/autoconf:  2.13, 2.59-r7
sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r1
sys-devel/binutils:  2.16.1-r2
sys-devel/gcc-config: 1.3.13-r2
sys-devel/libtool:   1.5.22
virtual/os-headers:  2.6.11-r2
ACCEPT_KEYWORDS="amd64"
AUTOCLEAN="yes"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-g -pipe -O2 -march=k8 -msse -msse2 -mmmx -m3dnow"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc"
CONFIG_PROTECT_MASK="/etc/env.d /etc/gconf /etc/revdep-rebuild /etc/terminfo"
CXXFLAGS="-g -pipe -O2 -march=k8 -msse -msse2 -mmmx -m3dnow"
DISTDIR="/usr/portage/distfiles"
FEATURES="autoconfig distlocks metadata-transfer nostrip sandbox sfperms strict
userpriv usersandbox"
GENTOO_MIRRORS="http://distfiles.gentoo.org
http://distro.ibiblio.org/pub/linux/distributions/gentoo"
MAKEOPTS="-j4"
PKGDIR="/usr/portage/packages"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress
--force --whole-file --delete --delete-after --stats --timeout=180
--exclude='/distfiles' --exclude='/local' --exclude='/packages'"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/local/portage"
SYNC="rsync://gentoo-rsync.d.local/gentoo-portage"
USE="amd64 avi berkdb bitmap-fonts cli crypt cups dba dri eds emboss encode
foomaticdb fortran gif gstreamer gtk2 imlib ipv6 isdnlog jpeg lzw lzw-tiff mp3
mpeg ncurses nls nptl nptlonly opengl pam pcre pdflib perl png pppd python
quicktime readline reflection sdl session spell spl ssl syslog tcpd tiff
truetype-fonts type1-fonts usb xml xml2 xorg xv zlib elibc_glibc kernel_linux
userland_GNU"
Unset:  CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, LDFLAGS,
LINGUAS, PORTAGE_RSYNC_EXTRA_OPTS

Problem Description:
When copying files from /tmp to an external storage device, CPU load increases,
but disk IO stops after a while.  Any attempt to look at the external storage
i.e. ls, df, cd etc. results in a lockup of the terminal in use. kill -9 of the
locked process has no effect.  Reboot fails, requiring a power off and on.  This
has occurred using xfs, jfx, ext2, ext3 and reiserfs.  Interestingly, using
cfg1030 (90p4932) to look at the HBA configuration still works.

While looking on the mailing lists, I noticed a similar problem with cp from NFS
(this is not a copy from or to NFS...)
http://www.ussg.iu.edu/hypermail/linux/kernel/0404.3/1026.html
Where Andrew Morton requested some debugging.  I have followed the steps on the
above page in the hop it helps.  I will attach it once submitted.


Steps to reproduce:
cp /tmp/my.MYD /mnt/data/
Comment 1 GDS.Marshall 2006-06-23 01:40:08 UTC
Created attachment 8392 [details]
debug output

debug output as mentioned in original bug report
Comment 2 Andrew Morton 2006-06-23 01:49:18 UTC
It sounds like the block layer (more likely the driver) has lost
an IO request.

When it happens please do:

echo p > /proc/sysrq-trigger
dmesg -s 1000000 > foo

and attach `foo' to this report (including the kernel
bootup messages)

Thanks.
Comment 3 Dan Carpenter 2006-06-23 03:13:44 UTC
There have been a bunch of problems reported with the current mptscsi driver and
external storage.  nStors don't work with the new driver at all.  On the redhat
bugzilla they say that vmware external storage doesn't work either.

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=188487
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=190760
https://bugzilla.novell.com/show_bug.cgi?id=173330
It's buggy in sles9sp3 as well but it works in sles9sp2.

I've been trying to debug this problem as well
http://marc.theaimsgroup.com/?l=linux-scsi&m=114868539100023&w=2

What does your /proc/scsi/scsi look like?  Someone said that with the nStor if
you don't export the wahoo controller as a LUN then it works OK and gets great
performance but someone else said it stopped printing error messages but the
performance sucked.  I haven't had time to test that myself yet.
Comment 4 GDS.Marshall 2006-06-23 07:42:21 UTC
Created attachment 8401 [details]
sysrq-trigger output

Andrew, as requested
Comment 5 GDS.Marshall 2006-06-23 07:43:28 UTC
Dan....

db2 ~ # cat /proc/scsi/scsi 
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: FUJITSU  Model: MAT3073NC        Rev: 0108
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 01 Lun: 00
  Vendor: FUJITSU  Model: MAT3073NC        Rev: 0108
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 06 Lun: 00
  Vendor: SUPER    Model: GEM318           Rev: 0   
  Type:   Processor                        ANSI SCSI revision: 02
Host: scsi2 Channel: 00 Id: 00 Lun: 00
  Vendor: IFT      Model: A16U-G2421       Rev: 347D
  Type:   Direct-Access                    ANSI SCSI revision: 03
Comment 6 GDS.Marshall 2006-06-23 07:56:19 UTC
Created attachment 8402 [details]
startup dmesg

Andrew, startup dmesg attached
Comment 7 Andrew Morton 2006-06-23 13:32:41 UTC
On Fri, 23 Jun 2006 07:43:24 -0700
bugme-daemon@bugzilla.kernel.org wrote:

> sysrq-trigger output

Well yes, but there's no useful info there.  Looks like the log buffer
overflowed.   You _should_ have a bunch of process stack backtraces.

Can we prevent all that mpt driver gunk from coming out so it doesn't
fill the log buffer?

You might need to do `dmesg -n 8' to get the sysrq-trigger output
to generate the needed info.  You can run

	echo p > /proc/sysrq-trigger

any time.  I suggest you get that working right first, before starting
testing.

Thanks.

Comment 8 GDS.Marshall 2006-06-23 14:41:06 UTC
I will recompile the mpt driver without the DEBUG in it.  Give me 30 minutes and
I should have an output for you.
Comment 9 Eric Moore 2006-06-23 15:09:17 UTC
I would be interesting test if you could try the 2.6.17 kernel.  That way 
could rule out any domain validation issues, as the newer driver is running 
with the SPI transport layer, using generic dv, thanks to James Bottomley.
Comment 10 GDS.Marshall 2006-06-23 15:23:50 UTC
Created attachment 8404 [details]
dmesg

This dmesg does not have all the mpt debug in it
Comment 11 GDS.Marshall 2006-06-23 15:32:40 UTC
Eric, I will give 2.6.17 a go when I have the output from 2.6.17.1.
Comment 12 Eric Moore 2006-06-23 15:38:49 UTC
Dan's issue is a multi-lun issue.  I doubt your experiencing the same issue.

The MPT_DEBUG and MPT_DEBUG_MSG_FRAME are too verbose.

Can you enable the following in the Makefile:

CFLAGS_mptscsih.o += -DMPT_DEBUG_REPLY
EXTRA_CFLAGS += -DMPT_DEBUG_FAIL
CFLAGS_mptbase.o += -DMPT_DEBUG_RESET
CFLAGS_mptscsih.o += -DMPT_DEBUG_TM


Comment 13 Eric Moore 2006-06-23 15:40:46 UTC
Also pls enable displaying the sense data

# sysctl -w dev.scsi.logging_level = 0x1000
Comment 14 GDS.Marshall 2006-06-23 16:09:01 UTC
Created attachment 8405 [details]
sysrq-trigger output

hopefully this sysrq-trigger is better.  It was created using the following
echo p > /proc/sysrq-trigger
dmesg -s 1000000 > foo
Comment 15 GDS.Marshall 2006-06-23 16:17:22 UTC
I have recompiled the kernel with

CFLAGS_mptscsih.o += -DMPT_DEBUG_REPLY
EXTRA_CFLAGS += -DMPT_DEBUG_FAIL
CFLAGS_mptbase.o += -DMPT_DEBUG_RESET
CFLAGS_mptscsih.o += -DMPT_DEBUG_TM

just remote powercycling at the moment.

I will then
sysctl -w dev.scsi.logging_level = 0x1000

Eric, do you want another 
echo p > /proc/sysrq-trigger
dmesg -s 1000000 > foo

or did you want something else?
Comment 16 GDS.Marshall 2006-06-23 16:24:46 UTC
Created attachment 8406 [details]
sysrq-trigger output with mptdebug

This sysrq-trigger was generated with the debug flags as requested.
Comment 17 GDS.Marshall 2006-06-23 16:33:25 UTC
Created attachment 8407 [details]
sysrq-trigger + mptdebug +sysctl

This is a 
echo p > /proc/sysrq-trigger
dmesg -s 1000000 > /tmp/foo

with the following set
sysctl -w dev.scsi.logging_level=0x1000

and in the fusion Makefile
EXTRA_CFLAGS += -DMPT_DEBUG_FAIL
CFLAGS_mptbase.o += -DMPT_DEBUG_RESET
CFLAGS_mptscsih.o += -DMPT_DEBUG_TM
CFLAGS_mptscsih.o += -DMPT_DEBUG_REPLY
Comment 18 GDS.Marshall 2006-06-23 16:44:59 UTC
Created attachment 8408 [details]
IOC pre_reset routed to SCSI host driver!

I saw this go passed in a "tail -f /var/log/messages" and thought it might be
useful.

It is late here so I will continue this in the morning, if anyone wants any
other debugging, please let me know.
Comment 19 Andrew Morton 2006-06-23 16:54:59 UTC
On Fri, 23 Jun 2006 15:24:55 -0700
bugme-daemon@bugzilla.kernel.org wrote:

> This dmesg does not have all the mpt debug in it

Yes, but it doesn't have what we want in it either.

oops, my fault.  We should be using `t', not `p'.

Sit down at a Linux box and do

	dmesg -n 8
	echo t > /proc/sysrq-trigger
	dmesg -s 1000000

and you'll get lots of stuff like



Call Trace: <ffffffff8040982e>{schedule_timeout+30}
       <ffffffff80335cba>{tty_poll+95} <ffffffff8028eb63>{do_select+1027}
       <ffffffff8028ef9a>{__pollwait+0} <ffffffff802299f9>{default_wake_function+0}
       <ffffffff802299f9>{default_wake_function+0} <ffffffff802299f9>{default_wake_function+0}
       <ffffffff8022ab07>{__wake_up+56} <ffffffff803a6b73>{sock_def_readable+63}
       <ffffffff804044c9>{unix_stream_sendmsg+589} <ffffffff803a2719>{do_sock_write+196}
       <ffffffff80229609>{activate_task+75} <ffffffff803a2d7f>{sock_aio_write+79}
       <ffffffff8028ee33>{sys_select+621} <ffffffff802439de>{autoremove_wake_function+0}
       <ffffffff80402618>{unix_ioctl+208} <ffffffff803a302a>{sock_ioctl+466}
       <ffffffff8028db4d>{do_ioctl+33} <ffffffff802092b6>{system_call+126}
zsh           S ffff810101f37f18     0 10108  10107 10123               (NOTLB)
ffff810101f37f18 ffff810016fff8c8 ffffffff8040c2c0 0000000000000008 
       ffff81009fc762a8 ffff81009fc760c0 ffffffff8047ddc0 000116ffc49f8821 
       0000000000007a42 ffff810000000000 
Call Trace: <ffffffff8040c2c0>{do_page_fault+1173} <ffffffff8029228d>{dput+61}
       <ffffffff802092b6>{system_call+126} <ffffffff80209185>{sys_rt_sigsuspend+199}
       <ffffffff8023ceaa>{sys_rt_sigprocmask+191} <ffffffff802095c3>{ptregscall_common+103}
zsh           R  running task       0 10123  10108                     (NOTLB)

and that's what we want to see.

Comment 20 GDS.Marshall 2006-06-24 01:06:27 UTC
Created attachment 8409 [details]
t sysrq-trigger 

I used the following to create this attachment.

dmesg -n 8
echo t > /proc/sysrq-trigger
dmesg -s 1000000 > /tmp/foo
Comment 21 GDS.Marshall 2006-06-24 01:18:13 UTC
looking through the output, "mk2" is a bash script I wrote to do the
dmesg -n 8
echo t > /proc/sysrq-trigger
dmesg -s 1000000 > /tmp/foo

"cpme" is a bash script to cp a series of files from tmp to varying directories
on the external storage.  The current cmd it is running is
cp /tmp/docset.MYD /mnt/sde1/mysql-4.0.24_me/var/me/docset.MYD
and has been doing it for at least the last eight hours.
ls -l /tmp/docset.MYD 
-rw-r--r-- 1 root root 13808376 Apr 27 01:43 /tmp/docset.MYD
At present I can not do an "ls" on the external drive as it will lock up, and
require a power reset.

If any more debuging is needed before the power reset let me know.  Otherwise I
will gladly reset the power.  Unless someone has a way of "unblocking"
everything.  I am willing to give anything a try.  From experience, a shutdown
or reboot will just get blocked.
Comment 22 GDS.Marshall 2006-06-27 17:15:47 UTC
I know everyone is busy, but has anyone had chance to look at the output I
posted?  What looks to be the cause of the problem?

Many thanks
Comment 23 Andrew Morton 2006-06-27 17:29:06 UTC
Yes, it looks like everything is stuck waiting for I/O completion.  Probably
because some request went to the driver and it got lost, or the completion
interrupt was mishandled, etc.

So yeah - your sysrq trace confirms that it's a driver issue.  Our
hopes rest with Eric ;)
Comment 24 Eric Moore 2006-06-30 10:20:49 UTC
Sorry, I've been busy this week trying to get sas wide port support out.

According to both the sysrq trace, and the previous dmesg, what has happened is
to target = 0, lun = 1

(1) Firmware returns DID_NO_CONNECT, meaning the device has been removed
(2) Firmare returns SAM_STAT_BUSY, meaning the device is busy
(3) Firmare returns SAM_STAT_CHECK_CONDITION, - I don't see the sense data, 
did you do " sysctl -w dev.scsi.logging_level = 0x1000"
(4) Task Aborts sent from above, Firmware says it succeeds
(5) Firmare returns SAM_STAT_BUSY, meaning the device is busy
(6) Task Aborts sent from above, Firmware says it succeeds
(7) Device continues returning BUSY, and upper layers sends task aborts
(8) Upper layers give up, then issue Host Reset sent from above
(9) Domain Validation request from spi transport
(10) Transport sends inquiry to driver as part of Domain Validation, however 
the command doesn't complete, and we are sitting on a mutex.  THis copied from 
trace below:

events/3      D ffff81010813fa88     0    17      1            18    16 (L-TLB)
ffff81010813fa88 ffff810103fe0848 0000000000000096 0000000000000096 
       00000000000000c9 0000000000000092 0000000000000292 ffff810104456750 
       0000000000000234 ffff81020459d770 
Call Trace: <ffffffff8057ce9e>{wait_for_completion+158}
       <ffffffff80228170>{default_wake_function+0} <ffffffff80228170>
{default_wake_function+0}
       <ffffffff803d08c7>{blk_execute_rq_nowait+151} <ffffffff803d09c0>
{blk_execute_rq+208}
       <ffffffff803cfe16>{__freed_request+54} <ffffffff8049a19f>
{scsi_execute+239}
       <ffffffff804a0dd4>{spi_execute+132} <ffffffff80254b70>
{mempool_free_slab+0}
       <ffffffff804a2348>{spi_dv_device_compare_inquiry+120}
       <ffffffff804a2719>{spi_dv_device+265} <ffffffff803dd692>{kobject_get+18}
       <ffffffff8045e197>{get_device+23} <ffffffff804dd5f0>
{mptspi_dv_renegotiate_work+0}
       <ffffffff804dc978>{mptspi_dv_device+184} <ffffffff804dd628>
{mptspi_dv_renegotiate_work+56}
       <ffffffff8023f000>{run_workqueue+176} <ffffffff8023f19a>
{worker_thread+330}
       <ffffffff80228170>{default_wake_function+0} <ffffffff80228170>
{default_wake_function+0}
       <ffffffff8023f050>{worker_thread+0} <ffffffff80242769>{kthread+217}
       <ffffffff8020ac96>{child_rip+8} <ffffffff80219280>{flat_send_IPI_mask+0}
       <ffffffff80242690>{kthread+0} <ffffffff8020ac8e>{child_rip+0}
(12) I can't tell if scsi_execute() in mid layer is properly handling a device 
returning SAM_STAT_BUSY or SAM_STAT_CHECK_CONDITION, or its possible the reply 
never came from firmware

Questions -

(1) Are you sure you have proper cabling and termination?
(2) Can you try negiotating at a slower speed.  This can be doing by going into
bios configuration utiltiy, or doing this from /sys/class/spi_transport, and
going into the proper target., and modifying the period.
(3) What was the negiotion, It would be in dmesg, or /var/log/boot.msg. 
(4) Can you disbale the kbobj debug messages?  Its too verbose.

Eric Moore

Comment 25 GDS.Marshall 2006-06-30 15:37:38 UTC
I have extracted from dmesg the inititation of the device, Eric, looks like
320MB/s.  I have removed kobj debug, and will submit a new dmesg.

Fusion MPT base driver 3.03.09
Copyright (c) 1999-2005 LSI Logic Corporation
Fusion MPT SPI Host driver 3.03.09
GSI 18 sharing vector 0xB9 and IRQ 18
ACPI: PCI Interrupt 0000:01:01.0[A] -> GSI 28 (level, low) -> IRQ 18
mptbase: Initiating ioc0 bringup
ioc0: 53C1030: Capabilities={Initiator,Target}
scsi2 : ioc0: LSI53C1030, FwRev=01032700h, Ports=1, MaxQ=255, IRQ=18
  Vendor: IFT       Model: A16U-G2421        Rev: 347D
  Type:   Direct-Access                      ANSI SCSI revision: 03
 target2:0:0: Beginning Domain Validation
 target2:0:0: Ending Domain Validation
 target2:0:0: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS PCOMP (6.25 ns, offset 127)
SCSI device sdc: 2928715776 512-byte hdwr sectors (1499502 MB)
Comment 26 GDS.Marshall 2006-06-30 15:39:26 UTC
Created attachment 8468 [details]
dmesg

new dmesg with kobj switched off.
Comment 27 Eric Moore 2006-06-30 15:50:16 UTC
Thankyou.

(1) Your device appears to be at LUN=0, however in the log the faulty device 
as LUN=1, do you know why?

(2) Do you have BIOS, so you can go into the ^C utililty and try changing your 
device to a slower speed?  Or try going into /sys/class/spi_transport, and and 
changing the period to the minimal period, and disabling qas?

Eric 
Comment 28 GDS.Marshall 2006-06-30 16:46:42 UTC
Eric,  Hope these answers help.

(1) Are you sure you have proper cabling and termination?
I have used three different cables, and by that, I also mean makes.  I get the
same with all of them.

The A16U-G2421 has a "gui" in which the termination type can be changed.  It has
always been enabled.

(2) Can you try negiotating at a slower speed.  This can be doing by going into
bios configuration utiltiy, or doing this from /sys/class/spi_transport, and
going into the proper target., and modifying the period.
I have modified the Eonstor to 80Mhz, 160MB/s

(3) What was the negiotion, It would be in dmesg, or /var/log/boot.msg. 
From the /var/log/dmesg it looks like 160Mhz, 320MB/s.

(4) Can you disbale the kbobj debug messages?  Its too verbose.
I have done this and rebooted with the new kernel (a -2 instead of -1)

You mention else where in your post,
I don't see the sense data, did you do " sysctl -w dev.scsi.logging_level = 0x1000"

I set this in sysctl.conf when you originally asked.  I have just checked
db2 ~ # sysctl dev.scsi.logging_level
dev.scsi.logging_level = 4096
db2 ~ # sysctl -w dev.scsi.logging_level=0x1000
dev.scsi.logging_level = 0x1000
db2 ~ # sysctl dev.scsi.logging_level
dev.scsi.logging_level = 4096
db2 ~ # 
I presume this means it was originally set?

I am not overly confident about modifying
/sys/class/spi_transport/target2:0:0/period
so I have changed the settings on the Eonstor and rebooted, now it is set to
target2:0:0: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 127)

I will give this a try and report back.
Comment 29 GDS.Marshall 2006-06-30 16:52:59 UTC
Thankyou.

(1) Your device appears to be at LUN=0, however in the log the faulty device 
as LUN=1, do you know why?
Yes, I tried LUN=0 and LUN=1 both have the same effect.  It is now back at LUN=1

(2) Do you have BIOS, so you can go into the ^C utililty and try changing your 
device to a slower speed?  Or try going into /sys/class/spi_transport, and and 
changing the period to the minimal period, and disabling qas?
I would need to be in the office to do that, which unfortunately means Monday. 
It is now early Saturday morning in the UK.

I am running the copy at 80Mhz, and will come back to you with the results.
Comment 30 GDS.Marshall 2006-06-30 17:03:40 UTC
the sysctl did work because I have just rebooted and seen this in the kern.log
when I mounted the external partition.
Jun 30 23:59:26 db2 kernel: Reply ha=0 id=0 lun=1:
Jun 30 23:59:26 db2 kernel: IOCStatus=0045h SCSIState=00h SCSIStatus=00h
Jun 30 23:59:26 db2 kernel: resid=40 bufflen=64 xfer_cnt=24
Jun 30 23:59:26 db2 kernel:   sc->underflow={report ERR if < 00h bytes xfer'd}
Jun 30 23:59:26 db2 kernel:   ActBytesXferd=18h
Jun 30 23:59:26 db2 kernel:   sc->result is 00000000h
Jun 30 23:59:26 db2 kernel: Reply ha=0 id=0 lun=1:
Jun 30 23:59:26 db2 kernel: IOCStatus=0045h SCSIState=00h SCSIStatus=00h
Jun 30 23:59:26 db2 kernel: resid=40 bufflen=64 xfer_cnt=24
Jun 30 23:59:26 db2 kernel:   sc->underflow={report ERR if < 00h bytes xfer'd}
Jun 30 23:59:26 db2 kernel:   ActBytesXferd=18h
Jun 30 23:59:26 db2 kernel:   sc->result is 00000000h
Jun 30 23:59:26 db2 kernel: Reply ha=0 id=0 lun=1:
Jun 30 23:59:26 db2 kernel: IOCStatus=0000h SCSIState=01h SCSIStatus=02h
Jun 30 23:59:26 db2 kernel: resid=4 bufflen=4 xfer_cnt=0
Jun 30 23:59:26 db2 kernel:   sc->result is 00000002h
Jun 30 23:59:26 db2 kernel: sd 2:0:0:1: done SUCCESS        2 sd 2:0:0:1: 
Jun 30 23:59:26 db2 kernel:         command: Log Sense: 4d 00 40 00 00 00 00 00 
04 00
Jun 30 23:59:26 db2 kernel: sdc: Current: sense key: Illegal Request
Jun 30 23:59:26 db2 kernel:     Additional sense: Invalid command operation code
Jun 30 23:59:26 db2 kernel: Reply ha=0 id=0 lun=1:
Jun 30 23:59:26 db2 kernel: IOCStatus=0000h SCSIState=01h SCSIStatus=02h
Jun 30 23:59:26 db2 kernel: resid=4 bufflen=4 xfer_cnt=0
Jun 30 23:59:26 db2 kernel:   sc->result is 00000002h
Jun 30 23:59:26 db2 kernel: sd 2:0:0:1: done SUCCESS        2 sd 2:0:0:1: 
Jun 30 23:59:26 db2 kernel:         command: Log Sense: 4d 00 50 00 00 00 00 00 
04 00
Jun 30 23:59:26 db2 kernel: sdc: Current: sense key: Illegal Request
Jun 30 23:59:26 db2 kernel:     Additional sense: Invalid command operation code
Jun 30 23:59:59 db2 kernel: kjournald starting.  Commit interval 5 seconds
Jun 30 23:59:59 db2 kernel: EXT3 FS on sdc2, internal journal
Jun 30 23:59:59 db2 kernel: EXT3-fs: mounted filesystem with ordered data mode.


Still testing the copy.....
Comment 31 GDS.Marshall 2006-07-01 00:55:00 UTC
The only other reference to sence was in /var/log/debug
Jun 30 23:59:18 db2 kernel: sdc: Mode Sense: 9b 00 00 08
Jun 30 23:59:18 db2 kernel: sdc: Mode Sense: 9b 00 00 08
Comment 32 GDS.Marshall 2006-07-01 07:35:28 UTC
Created attachment 8470 [details]
kernel .config

I have carried out the test cp over 200 times, probably closer to 300 when I
finished faffing around modifying the copy shell script, and so far no errors
have appeared in any of the log files, no debug messages in the log file
either, which worries me.  Have I done something wrong? or is it fixed?  Total
data copied in the excess of 1152G of data.

Attached is my kernel .config

Eric, based on my .config, would you expect anything in the log files if all
was working well?
Comment 33 Eric Moore 2006-07-05 09:40:14 UTC
Marshall, in the 2.6.17 kernel, support for the spi transport layer as added 
to the mptspi driver.  What this amounted to was domain validation was moved 
out of the driver, and into the transport layer.  This handles setting up 
proper negotiation automatically, by running a series of test to insure the 
devices are runnign at the proper speed, to match the proper cabling and 
termination to your devices.  The traces you sent before indicated that your 
device was not there, and busy, which could account for inproper speed.  I'm 
interested to know if you can change the speeds to your device, go back to 
failing condiditons, and lower speed till you're not receiving errors? 
Comment 34 GDS.Marshall 2006-07-06 05:25:36 UTC
Eric,

Do you mean change the speed on the external device, or on the server?

On the external device (an Eonstor A16U-G2421) I have a dropdown of
160MHz,80MHz,40,33.... currently, I have 80 selected.

I presume to fine tune it as you requested, I would need to do it on the server.
 Unfortunately, I do not know how to do that, however, if you tell me, I will
give it a try.

Thank you,

Spencer


Comment 35 Eric Moore 2006-07-06 08:18:00 UTC
I'm talking about your end device that is attached to the 53c1030 scsi 
controller, not the server speed.

Pls note that you can change the speed by going into the bios configuration 
utility for the controller.  This is done during system boot up. At the point 
when you see the 1030 controller detected by bios, you select ^C, then enter 
the utility.  In the utility, select devices. You should see all the devices, 
and the corresponding speed.

FAST160 is Ultra 320, and FAST80 is Ultra 160.  I suggest you go back to Ultra 
320 speeds to verify that you can replicate the issue, then go back to Ultra 
160 or slower to see when it goes away.
Comment 36 Gabor K. Horvath 2007-07-02 01:15:11 UTC
Are you people interested in continuing work on this bug?
I have a system that is affected by this. I also have an open ticket with the vendor, but to me it doesn't really look like a hw issue.
Comment 37 GDS.Marshall 2007-07-06 15:42:40 UTC
This still occurs in 2.6.21
Comment 38 Gabor K. Horvath 2007-10-04 00:51:14 UTC
After all it may at least partly be a hardware issue.
I sent the controllers from my 2 IFT DAS boxes back for RMA.
They resoldered the terminators. I only got one controller back so far and it seems to work just fine.

Funny thing is they could only reproduce the error with jfs :)

Bottom line: I really don't know if all this applies to this bug but it may be a clue for someone who knows more than I do.
Comment 39 Natalie Protasevich 2007-10-22 23:09:00 UTC
Spencer,
Any update on this? Is the problem still present in current 2.6.23+?
Comment 40 GDS.Marshall 2007-10-26 14:37:07 UTC
I will check 2.6.23 when I am in the office on Tuesday and get back to you.
Comment 41 Natalie Protasevich 2008-03-25 20:20:10 UTC
Any update on this problem, were you able to test with recent kernel?

Note You need to log in before you can comment on or make changes to this bug.