Bug 2493 - hard freeze on heavy load with external firewire hard drive
Summary: hard freeze on heavy load with external firewire hard drive
Status: REJECTED INSUFFICIENT_DATA
Alias: None
Product: Drivers
Classification: Unclassified
Component: IEEE1394 (show other bugs)
Hardware: i386 Linux
: P2 high
Assignee: Stefan Richter
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-04-11 04:58 UTC by thomas
Modified: 2006-01-08 04:12 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.4 and 2.6.5
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description thomas 2004-04-11 04:58:17 UTC
Distribution:
Gentoo

Hardware Environment:
Toshiba Satellite 3000-512 laptop
Pentium III 1Ghz
512Mb ram
Maxtor 5000DV 160Gb external firewire drive
Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller

Software Environment:
Vanilla linux kernels 2.6.4 and 2.6.5
All drives formatted with reiserfs

Problem Description:
The system freezes totally (unresponsive to mouse, keyboard and network events)
when putting heavy load on an external firewire drive (reiserfs)

Steps to reproduce:
Impose a heavy load on the drive (emerging huge softwares on gentoo, editing
large multimedia files, etc...)
Comment 1 thomas 2004-04-11 05:42:53 UTC
It's not the first time this happens also : the drive's node changes while it's
mounted - maybe the two problems are related. I hope the following dmesg output
helps :

ieee1394: Error parsing configrom for node 0-00:1023
ieee1394: Node changed: 0-00:1023 -> 0-01:1023
ieee1394: Node added: ID:BUS[0-00:1023]  GUID[0010b92000c66716]
scsi1 : SCSI emulation for IEEE-1394 SBP-2 Devices
ieee1394: sbp2: Logged into SBP-2 device
ieee1394: sbp2: Node 0-00:1023: Max speed [S400] - Max payload [2048]
  Vendor: Maxtor    Model: 5000DV            Rev: 0100
  Type:   Direct-Access                      ANSI SCSI revision: 06
SCSI device sda: 320171008 512-byte hdwr sectors (163928 MB)
sda: asking for cache data failed
sda: assuming drive cache: write through
 /dev/scsi/host1/bus0/target1/lun0: p1 p2 p3
Attached scsi disk sda at scsi1, channel 0, id 1, lun 0
Attached scsi generic sg1 at scsi1, channel 0, id 1, lun 0,  type 0
found reiserfs format "3.6" with standard journal
Reiserfs journal params: device sda1, size 8192, journal first block 18, max
trans len 1024, max batch 900, max commit age 30, max trans age 30
reiserfs: checking transaction log (sda1) for (sda1)
reiserfs: replayed 22 transactions in 1 seconds
Using r5 hash to sort names
ieee1394: Node changed: 0-01:1023 -> 0-00:1023
ieee1394: Node suspended: ID:BUS[0-00:1023]  GUID[0010b92000c66716]
ieee1394: sbp2: Logged out of SBP-2 device
scsi1 (1:0): rejecting I/O to dead device
Buffer I/O error on device sda1, logical block 6514
lost page write due to I/O error on sda1
Buffer I/O error on device sda1, logical block 6515
lost page write due to I/O error on sda1
Buffer I/O error on device sda1, logical block 6516
lost page write due to I/O error on sda1
Buffer I/O error on device sda1, logical block 6517
lost page write due to I/O error on sda1
Buffer I/O error on device sda1, logical block 6518
lost page write due to I/O error on sda1
Buffer I/O error on device sda1, logical block 6519
lost page write due to I/O error on sda1
Buffer I/O error on device sda1, logical block 6520
lost page write due to I/O error on sda1
Buffer I/O error on device sda1, logical block 6521
lost page write due to I/O error on sda1
Buffer I/O error on device sda1, logical block 6522
lost page write due to I/O error on sda1
Buffer I/O error on device sda1, logical block 6523
lost page write due to I/O error on sda1
journal-601, buffer write failed
------------[ cut here ]------------
kernel BUG at fs/reiserfs/prints.c:339!
invalid operand: 0000 [#1]
PREEMPT
CPU:    0
EIP:    0060:[<c01a7f28>]    Tainted: PF
EFLAGS: 00010282   (2.6.5)
EIP is at reiserfs_panic+0x38/0x70
eax: 00000024   ebx: d679a000   ecx: 00000001   edx: c04a10f8
esi: e2e3e5e4   edi: 00000000   ebp: c17f1cd4   esp: c17f1cc4
ds: 007b   es: 007b   ss: 0068
Process knodemgrd_0 (pid: 12, threadinfo=c17f0000 task=dfc0b160)
Stack: c043ca6f c057c6a0 e2e3e5e4 d679a000 c17f1d18 c01b34de d679a000 c0448b60
       00001000 c17f1d18 d58ab000 dec15ce0 0000197c 0000000e 0000000c 00000000
       0000000b caa5e8b0 d679a000 00000001 00000004 c17f1d80 c01b7abc d679a000
Call Trace:
 [<c01b34de>] flush_commit_list+0x2ee/0x470
 [<c01b7abc>] do_journal_end+0x60c/0xc50
 [<c01b6d9d>] flush_old_commits+0x13d/0x1a0
 [<c01a4bb1>] reiserfs_write_super+0x81/0x90
 [<c015a3c7>] fsync_super+0xb7/0xc0
 [<c015a3f5>] fsync_bdev+0x25/0x60
 [<c0172948>] __invalidate_device+0x68/0x70
 [<c0160a70>] bdev_set+0x0/0x20
 [<c02bf43f>] invalidate_partition+0x3f/0x55
 [<c018dc2c>] del_gendisk+0x2c/0xf0
 [<c02f3fe0>] sd_remove+0x20/0x40
 [<c02b66f6>] device_release_driver+0x66/0x70
 [<c02b6870>] bus_remove_device+0x70/0xc0
 [<c02b566f>] device_del+0x6f/0xb0
 [<c02f13fe>] scsi_remove_device+0x5e/0xb0
 [<c02f0834>] scsi_forget_host+0x44/0x90
 [<c02ea56a>] scsi_remove_host+0x2a/0x60
 [<c030d7e9>] sbp2_remove_device+0x1b9/0x1e0
 [<c030cf85>] sbp2_remove+0x25/0x30
 [<c02b66f6>] device_release_driver+0x66/0x70
 [<c030343f>] nodemgr_suspend_ne+0x10f/0x140
 [<c0303710>] nodemgr_probe_ne+0x60/0x90
 [<c03037a6>] nodemgr_node_probe+0x66/0xb0
 [<c0303b61>] nodemgr_host_thread+0x171/0x1a0
 [<c03039f0>] nodemgr_host_thread+0x0/0x1a0
 [<c01072c5>] kernel_thread_helper+0x5/0x10

Code: 0f 0b 53 01 58 0e 44 c0 c7 04 24 e0 75 44 c0 b8 71 0a 44 c0
 initializing firmware upload
Comment 2 Roger 2004-04-11 21:06:14 UTC
I'm seeing this problem currently here too with 2.6.4 & 2.6.5 using an external
ieee1394 hard drive via a "FireWire (IEEE 1394): Texas Instruments PCI4451
IEEE-1394 Controller".

I'm trying to use rsync to backup my o/s but have been using nice -n 19 to no avail.

Working on my laptop remotely and cannot connect to an external box for logging
the dump to syslog.  Can just say that I'm not seeing any "buffer i/o errors" --
just mearily the PREEMPT & nothing with rieserfs in it either. (I am using a
rieserfs on the external hdd).
Comment 3 Stefan Neupert 2004-06-12 07:41:34 UTC
And here too with

Kernel (from kernel.org) 2.6.4 and gcc version 3.3.3 (Debian 20040401)
WDIGTL   Model: WDCFireWireHD-Ox Rev: 2.70 (120 GB ext. HDD)
Debian 'unstable'

when using rsync or unison-gtk the box freezes up completely after a view minutes.


Comment 4 F 2004-11-29 04:57:21 UTC
The problem persists with 2.6.8 and 2.6.9 kernels from debian. I have tried with
and old 2.6.3 kernel and it works without any problem.
Comment 5 Adrian Bunk 2005-07-05 10:59:58 UTC
Is this problem still present in a vanilla 2.6.12.2 ftp.kernel.org kernel
without any xternal modules loaded since the last boot?
Comment 6 Stefan Richter 2005-07-24 04:24:21 UTC
To everyone who commented here: Is this problem still present with 2.6.12.x?
Does one of the hints given at http://www.linux1394.org/faq.php#sbp2abort help?

Although the problems reported by the commenters seem to be the same, they may
actually have different causes. In case of Thomas' report, the first sign of
trouble is:
> ieee1394: Node changed: 0-01:1023 -> 0-00:1023
> ieee1394: Node suspended: ID:BUS[0-00:1023]  GUID[0010b92000c66716]
> ieee1394: sbp2: Logged out of SBP-2 device
This means that the FireWire controller told the drivers that the disk was
disconnected. I assume you did not trip over the cable, so there may be a
hardware problem like an overheated physical interface chip in the disk
enclosure or in the PC. If so, then we can of course not fix it in software. But
we have to fix the occurence of DoS situations at least.

Do the other commenters see the "node changed"/"suspended"/"logged out" messages
too?
Comment 7 Stefan Richter 2005-10-01 01:36:42 UTC
2.6.14-rc3 contains two sbp2 fixes that may be related to the reports here.
Please test.
Comment 8 Stefan Richter 2005-10-29 04:58:27 UTC
Please test Linux 2.6.14.

Note You need to log in before you can comment on or make changes to this bug.