1827 – USB storage-related system hang on AMD64

Bug 1827 - USB storage-related system hang on AMD64

Summary: USB storage-related system hang on AMD64

Status:	CLOSED PATCH_ALREADY_AVAILABLE

Alias:	None

Product:	SCSI Drivers
Classification:	Unclassified
Component:	Other (show other bugs)
Hardware:	i386 Linux

Importance:	P2 high
Assignee:	Mike Anderson

URL:
Keywords:

Depends on:
Blocks:

Reported:	2004-01-10 04:17 UTC by Rafael J. Wysocki
Modified:	2005-07-05 11:57 UTC (History)
CC List:	3 users (show)

See Also:
Kernel Version:	2.6.1
Subsystem:
Regression:	---
Bisected commit-id:

Attachments
A snippet from the kernel log with some oops (1.71 KB, text/plain) 2004-01-11 09:47 UTC, Rafael J. Wysocki	Details
What happened after rmmod sg (19.10 KB, text/plain) 2004-01-12 14:29 UTC, Rafael J. Wysocki	Details
Kernel log from 2.6.1-mm4 (from a serial console) (3.83 KB, text/plain) 2004-01-17 15:07 UTC, Rafael J. Wysocki	Details
Output from strace hwscan (12.40 KB, text/plain) 2004-01-21 14:46 UTC, Rafael J. Wysocki	Details
Kernel log from 2.6.2-rc2 (2.78 KB, text/plain) 2004-01-26 13:35 UTC, Rafael J. Wysocki	Details
Kernel log from 2.6.2-rc2-mm1 (3.29 KB, text/plain) 2004-01-29 12:55 UTC, Rafael J. Wysocki	Details
Add an attachment (proposed patch, testcase, etc.)

Description Rafael J. Wysocki 2004-01-10 04:17:31 UTC

Distribution: SuSE 9.0 64-bit (x86_64) 
Hardware Environment: 2 x AMD Opteron 240, Tyan Thunder K8W, 1 GB RAM (4 x 256, two 
nodes, dual-channel), Adaptec AHA19160 + SCSI CD-RW (Toshiba), LSI 53C1010 64-bit SCSI 
+ 2 x HDD (IBM), NEC-based USB 2.0 adapter (Manhattan), GeForce FX5200 (LeadTek), 
SATA HDD (Seagate), IDE DVD (Liteon) 
Software Environment: Out-of-the box SuSE 9.0 (downloadable version) + Linux 2.6.1 
Problem Description: The system hangs solid after a USB storage device has been 
disconnected 
 
Hint1: This occurs on 2.6.1-rc2-bk1 and 2.6.1-rc3 as well.  I'm almost sure it occurs on 
2.6.1-rc1, but I had no time to confirm it 
 
Hint2: The NEC-based USB 2.0 is involved, which is on a PCI connected to the bus via the 
AMD-8111 chip, so it may be a PCI bug 
 
Hint3: Here's a snippet from the kernel log: 
 
Jan  9 23:52:50 chimera kernel: hub 4-0:1.0: new USB device on port 3, assigned address 2 
Jan  9 23:52:56 chimera kernel: Initializing USB Mass Storage driver... 
Jan  9 23:52:56 chimera kernel: scsi2 : SCSI emulation for USB Mass Storage devices 
Jan  9 23:52:56 chimera kernel:   Vendor:           Model: TS256MJFLASHA     Rev: 1.00 
Jan  9 23:52:56 chimera kernel:   Type:   Direct-Access                      ANSI SCSI revision: 02 
Jan  9 23:52:56 chimera kernel: SCSI device sdc: 506400 512-byte hdwr sectors (259 MB) 
Jan  9 23:52:56 chimera kernel: sdc: assuming Write Enabled 
Jan  9 23:52:56 chimera kernel: sdc: assuming drive cache: write through 
Jan  9 23:52:57 chimera kernel:  sdc: sdc1 
Jan  9 23:52:57 chimera kernel: Attached scsi removable disk sdc at scsi2, channel 0, id 0, 
lun 0 
Jan  9 23:52:57 chimera kernel: Attached scsi generic sg3 at scsi2, channel 0, id 0, lun 0,  
type 0 
Jan  9 23:52:57 chimera kernel: WARNING: USB Mass Storage data integrity not assured 
Jan  9 23:52:57 chimera kernel: USB Mass Storage device found at 2 
Jan  9 23:52:57 chimera kernel: drivers/usb/core/usb.c: registered new driver usb-storage 
Jan  9 23:52:57 chimera kernel: USB Mass Storage support registered. 
Jan  9 23:52:57 chimera kernel: Device not ready.  Make sure there is a disc in the drive. 
Jan  9 23:53:55 chimera kernel: usb 4-3: USB disconnect, address 2 
Jan  9 23:54:01 chimera kernel: Unable to handle kernel NULL pointer dereference at 
0000000000000008 RIP:  
Jan  9 23:54:01 chimera kernel: <ffffffff801848dd>{chrdev_open+285}PML4 39400067 PGD 
392bc067 PMD 0 
 
The hang occurs now. 
 
Note: It was confirmed only with the help of a USB pendrive.  I've not tried any other USB 
devices yet. 
 
Steps to reproduce: Put a USB pendrive into a USB port, wait 10 seconds and pull it out.

Comment 1 Andi Kleen 2004-01-10 23:47:34 UTC

I would contact the USB people. It's probably a bug in their code.

Connecting a serial console and getting a more complete oops log may also help.

Comment 2 Rafael J. Wysocki 2004-01-11 09:47:02 UTC

Created attachment 1832 [details]
A snippet from the kernel log with some oops

More verbose USB debug info included.  It starts at the time I'have pulled out
the device from the USB port.  Magic SysRq did not work after the oops.

Comment 3 Rafael J. Wysocki 2004-01-11 09:51:56 UTC

I have some more observations: 
 
(1) USB printer does not trigger the error condition 
(2) An unsupported USB storage device (ie such that does not get a SCSI device 
attached to it) does not cause the hang 
(3) The problem does not seem to occur on i386

Comment 4 James Bottomley 2004-01-11 11:43:40 UTC

It seems very strange that a scsi block device would trigger a problem
in character device open, unless its somehow related to the opening of 
the character sg node by some of the hotplug scripts

Could you try to reproduce this without sg (i.e. probably compile sg out of the
kernel if it's compiled in, or rename the module if it's modular)

Comment 5 Matthew Dharm 2004-01-11 12:14:17 UTC

It's difficult to see how usb-storage is the problem here.  The logs indicate
that the usb-storage disconnect sequence completed without errors, and the OOPS
is from and entirely different piece of code.

Considering that SCSI's /dev/sg* nodes are character devices, and just about the
only character devices that could be involved, I agree with James' suggestion to
try this without sg.  If that's the problem, then this should be punted over to
the SCSI folks.

Comment 6 Rafael J. Wysocki 2004-01-12 14:29:38 UTC

Created attachment 1846 [details]
What happened after rmmod sg

I did what you suggested: I compiled sg out of the kernel and it doesn't hang
now, so definitely sg is to blame.

But first I'd unloaded sg (rmmod sg) and renamed sg to something else.	Then I
plugged the USB thing in and the kernel hung immediately.  The story's in the
attachement (it started after I'd plugged the pendrive)

Comment 7 Matthew Dharm 2004-01-13 10:41:33 UTC

This is pretty clearly a SCSI problem with sg.  Re-assign to owner of SCSI.

Comment 8 Rafael J. Wysocki 2004-01-17 15:07:49 UTC

Created attachment 1882 [details]
Kernel log from 2.6.1-mm4 (from a serial console)

I have some more details.  The log is from 2.6.1-mm4 (everything is included),
and I am able to get something like this from 2.6.1-bk4:

Loglevel set to 9
usb 4-1: USB disconnect, address 2
Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP:
<ffffffff801848cd>{chrdev_open+285}PML4 3d802067 PGD 3eb5c067 PMD 0
Oops: 0002 [1]
CPU 1
Pid: 1977, comm: hwscan Not tainted
RIP: 0010:[<ffffffff801848cd>] <ffffffff801848cd>{chrdev_open+285}
RSP: 0018:000001003eba3ea8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000001001b162ca0
RDX: 000001003efcd3f8 RSI: 000001003eba3eb4 RDI: 0000000001500003
RBP: 000001001b162c40 R08: 000001003eba3dd8 R09: 0000000000000000
R10: 000001003d022bf0 R11: 0000000000000048 R12: 000001003efcd200
R13: 000001001b162c40 R14: 000001003d022b80 R15: 0000000000000000
FS:  00000000005044a0(0000) GS:ffffffff8049ca40(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 000000003ff9b000 CR4: 00000000000006a0
Process hwscan (pid: 1977, stackpage=1003fe974c0)
Stack: 0000000000000212 00000000801ee2fa 000001003d022b80 000001003efcd200
       000001001ffcedc0 000001001e6f0280 0000000000000001 ffffffff8017955b
       0000000000008800 000001001e260000
Call Trace:<ffffffff8017955b>{dentry_open+315} <ffffffff8017940e>{filp_open+62}
       <ffffffff801799fc>{sys_open+76} <ffffffff8010efa0>{system_call+124}


Code: 48 89 50 08 49 89 55 60 45 31 ed 48 89 4a 08 eb 29 66 90 48
RIP <ffffffff801848cd>{chrdev_open+285} RSP <000001003eba3ea8>
CR2: 0000000000000008
 <6>note: hwscan[1977] exited with preempt_count 1
bad: scheduling while atomic!

Call Trace:<ffffffff80132d9e>{schedule+94} <ffffffff80167672>{unmap_vmas+498}
       <ffffffff8016ce21>{exit_mmap+321} <ffffffff80136413>{mmput+131}
       <ffffffff8013bbd4>{do_exit+580} <ffffffff8011ee6f>{do_page_fault+1087}
       <ffffffff8010f95d>{error_exit+0} <ffffffff801848cd>{chrdev_open+285}
       <ffffffff80184856>{chrdev_open+166} <ffffffff8017955b>{dentry_open+315}
       <ffffffff8017940e>{filp_open+62} <ffffffff801799fc>{sys_open+76}
       <ffffffff8010efa0>{system_call+124}

Well, I think there's more to it than just a SCSI bug ...

Comment 9 Rafael J. Wysocki 2004-01-20 13:43:46 UTC

I investigated this a bit and that's what I found. 
 
The error condition is triggered by something called hwscan (it comes from the 
package hwinfo-7.30-45 and is a SuSE stuff).  It is called as /usr/sbin/hwscan 
--usb and/or /usr/sbin/hwscan --usb --fast from /etc/hotplug/usb.agent.  
Apparently, the crash does not occur any more after I've removed the x bit 
from the access mask of /usr/sbin/hwscan (sigh). 
 
I don't know what exactly it does, because SuSE guys did not bother to provide 
any documentation for it, but I think that it tampers with the kernel on a low 
level and can make it crash if this is done in a wrong way.  So, it may be a 
good idea to ask SuSE about it. 
 
FYI, even if hwscan is executable, the crash does not occur if sg is not 
loaded.  If hwscan is enabled and sg is loaded, the -bk5 and -mm5 crash too.

Comment 10 Andi Kleen 2004-01-20 14:54:01 UTC

You're just hiding the bug then. 

I would suggest to strace hwscan and find out which file access causes the
crash.

Comment 11 Rafael J. Wysocki 2004-01-21 14:46:40 UTC

Created attachment 1925 [details]
Output from strace hwscan

Well, I know I'm hiding the bug by disabling the hwscan.  It was only a hint.
:-)

Attached is an output from strace /usr/sbin/hwscan --usb.  Actually there are
two of them: the second one is mixed with the kernel debug output dumped to the
same console (in the order of appearance).  To me it looks like the first
open("/dev/sg0", ...) triggers the crash, but you are more experienced.

Comment 12 Rafael J. Wysocki 2004-01-21 14:50:34 UTC

Hint: sg0 is attached to a CD recorder, sg1, sg2 - SCSI HDDs, and sg3 - a USB 
pendrive

Comment 13 Rafael J. Wysocki 2004-01-22 13:47:03 UTC

I forgot to write that the strace /usr/sbin/hwscan was done _after_ I had 
removed the pendrive from the USB port (yes, I know it's quite important), and 
something tells me that sg3 shouldn't have been there at that time. 
 
So, I think that the bottom line is this: hwscan tries to open /dev/sg3 which 
is attached to a device that is no longer present in the system and this 
causes the crash (apparently, I was wrong in the previous post). 
 
But, IMHO, this means that the bug is not on the SCSI part, because the SCSI 
layer has no idea that the device was disconnected until the USB layer makes 
it detach the sg*, and the USB layer does no such thing.  Of course, I can be 
wrong as before, but please comment.

Comment 14 Rafael J. Wysocki 2004-01-26 13:35:39 UTC

Created attachment 1954 [details]
Kernel log from 2.6.2-rc2

I wrote the following piece of code:

#include <fcntl.h>

int main( int argc, char **argv ) {

  int  fd;

  if( argc < 2 ) {
    printf( "Usage: %s device\n", argv[0] );
    return  0;
  }
  fd =	open( argv[1], O_RDONLY|O_NONBLOCK );
  close( fd );

  return  0;

}

as a proof of concept.	I tested the kernel with the help of it.

First, I plugged the USB pendrive into a port.	Then, I pulled it out.
Next, I ran the above program as &quot;hwtest /dev/sg3&quot; and the kernel
crashed.

Attached is a log from a serial console (out of 2.6.2-rc2).

Note: all kernels above and including 2.6.2-rc1 are affected.

Comment 15 Rafael J. Wysocki 2004-01-29 12:55:21 UTC

Created attachment 1963 [details]
Kernel log from 2.6.2-rc2-mm1

I tested this on the 2.6.2-rc2-mm1 and the behavior was slightly different than
before, so I decided to post the log (hwtest is the program listed in the
previous post run by root as "hwtest /dev/sg3").

Comment 16 Rafael J. Wysocki 2004-02-06 05:17:17 UTC

It seems that the problem has gone in the 2.6.2-bk1 and 2.6.2-mm1, although it is present in 
the 2.6.2 (I haven't tested the 2.6.2-bk2 yet).

Comment 17 Rafael J. Wysocki 2005-01-15 06:01:02 UTC

The issue does not seem to be present in recent kernels.  I'm marking it as 
closed.

Note You need to log in before you can comment on or make changes to this bug.