Bug 1827
Summary: | USB storage-related system hang on AMD64 | ||
---|---|---|---|
Product: | SCSI Drivers | Reporter: | Rafael J. Wysocki (rjwysocki) |
Component: | Other | Assignee: | Mike Anderson (andmike) |
Status: | CLOSED PATCH_ALREADY_AVAILABLE | ||
Severity: | high | CC: | andi-bz, greg, jejb |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.1 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
A snippet from the kernel log with some oops
What happened after rmmod sg Kernel log from 2.6.1-mm4 (from a serial console) Output from strace hwscan Kernel log from 2.6.2-rc2 Kernel log from 2.6.2-rc2-mm1 |
Description
Rafael J. Wysocki
2004-01-10 04:17:31 UTC
I would contact the USB people. It's probably a bug in their code. Connecting a serial console and getting a more complete oops log may also help. Created attachment 1832 [details]
A snippet from the kernel log with some oops
More verbose USB debug info included. It starts at the time I'have pulled out
the device from the USB port. Magic SysRq did not work after the oops.
I have some more observations: (1) USB printer does not trigger the error condition (2) An unsupported USB storage device (ie such that does not get a SCSI device attached to it) does not cause the hang (3) The problem does not seem to occur on i386 It seems very strange that a scsi block device would trigger a problem in character device open, unless its somehow related to the opening of the character sg node by some of the hotplug scripts Could you try to reproduce this without sg (i.e. probably compile sg out of the kernel if it's compiled in, or rename the module if it's modular) It's difficult to see how usb-storage is the problem here. The logs indicate that the usb-storage disconnect sequence completed without errors, and the OOPS is from and entirely different piece of code. Considering that SCSI's /dev/sg* nodes are character devices, and just about the only character devices that could be involved, I agree with James' suggestion to try this without sg. If that's the problem, then this should be punted over to the SCSI folks. Created attachment 1846 [details]
What happened after rmmod sg
I did what you suggested: I compiled sg out of the kernel and it doesn't hang
now, so definitely sg is to blame.
But first I'd unloaded sg (rmmod sg) and renamed sg to something else. Then I
plugged the USB thing in and the kernel hung immediately. The story's in the
attachement (it started after I'd plugged the pendrive)
This is pretty clearly a SCSI problem with sg. Re-assign to owner of SCSI. Created attachment 1882 [details]
Kernel log from 2.6.1-mm4 (from a serial console)
I have some more details. The log is from 2.6.1-mm4 (everything is included),
and I am able to get something like this from 2.6.1-bk4:
Loglevel set to 9
usb 4-1: USB disconnect, address 2
Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP:
<ffffffff801848cd>{chrdev_open+285}PML4 3d802067 PGD 3eb5c067 PMD 0
Oops: 0002 [1]
CPU 1
Pid: 1977, comm: hwscan Not tainted
RIP: 0010:[<ffffffff801848cd>] <ffffffff801848cd>{chrdev_open+285}
RSP: 0018:000001003eba3ea8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000001001b162ca0
RDX: 000001003efcd3f8 RSI: 000001003eba3eb4 RDI: 0000000001500003
RBP: 000001001b162c40 R08: 000001003eba3dd8 R09: 0000000000000000
R10: 000001003d022bf0 R11: 0000000000000048 R12: 000001003efcd200
R13: 000001001b162c40 R14: 000001003d022b80 R15: 0000000000000000
FS: 00000000005044a0(0000) GS:ffffffff8049ca40(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 000000003ff9b000 CR4: 00000000000006a0
Process hwscan (pid: 1977, stackpage=1003fe974c0)
Stack: 0000000000000212 00000000801ee2fa 000001003d022b80 000001003efcd200
000001001ffcedc0 000001001e6f0280 0000000000000001 ffffffff8017955b
0000000000008800 000001001e260000
Call Trace:<ffffffff8017955b>{dentry_open+315} <ffffffff8017940e>{filp_open+62}
<ffffffff801799fc>{sys_open+76} <ffffffff8010efa0>{system_call+124}
Code: 48 89 50 08 49 89 55 60 45 31 ed 48 89 4a 08 eb 29 66 90 48
RIP <ffffffff801848cd>{chrdev_open+285} RSP <000001003eba3ea8>
CR2: 0000000000000008
<6>note: hwscan[1977] exited with preempt_count 1
bad: scheduling while atomic!
Call Trace:<ffffffff80132d9e>{schedule+94} <ffffffff80167672>{unmap_vmas+498}
<ffffffff8016ce21>{exit_mmap+321} <ffffffff80136413>{mmput+131}
<ffffffff8013bbd4>{do_exit+580} <ffffffff8011ee6f>{do_page_fault+1087}
<ffffffff8010f95d>{error_exit+0} <ffffffff801848cd>{chrdev_open+285}
<ffffffff80184856>{chrdev_open+166} <ffffffff8017955b>{dentry_open+315}
<ffffffff8017940e>{filp_open+62} <ffffffff801799fc>{sys_open+76}
<ffffffff8010efa0>{system_call+124}
Well, I think there's more to it than just a SCSI bug ...
I investigated this a bit and that's what I found. The error condition is triggered by something called hwscan (it comes from the package hwinfo-7.30-45 and is a SuSE stuff). It is called as /usr/sbin/hwscan --usb and/or /usr/sbin/hwscan --usb --fast from /etc/hotplug/usb.agent. Apparently, the crash does not occur any more after I've removed the x bit from the access mask of /usr/sbin/hwscan (sigh). I don't know what exactly it does, because SuSE guys did not bother to provide any documentation for it, but I think that it tampers with the kernel on a low level and can make it crash if this is done in a wrong way. So, it may be a good idea to ask SuSE about it. FYI, even if hwscan is executable, the crash does not occur if sg is not loaded. If hwscan is enabled and sg is loaded, the -bk5 and -mm5 crash too. You're just hiding the bug then. I would suggest to strace hwscan and find out which file access causes the crash. Created attachment 1925 [details]
Output from strace hwscan
Well, I know I'm hiding the bug by disabling the hwscan. It was only a hint.
:-)
Attached is an output from strace /usr/sbin/hwscan --usb. Actually there are
two of them: the second one is mixed with the kernel debug output dumped to the
same console (in the order of appearance). To me it looks like the first
open("/dev/sg0", ...) triggers the crash, but you are more experienced.
Hint: sg0 is attached to a CD recorder, sg1, sg2 - SCSI HDDs, and sg3 - a USB pendrive I forgot to write that the strace /usr/sbin/hwscan was done _after_ I had removed the pendrive from the USB port (yes, I know it's quite important), and something tells me that sg3 shouldn't have been there at that time. So, I think that the bottom line is this: hwscan tries to open /dev/sg3 which is attached to a device that is no longer present in the system and this causes the crash (apparently, I was wrong in the previous post). But, IMHO, this means that the bug is not on the SCSI part, because the SCSI layer has no idea that the device was disconnected until the USB layer makes it detach the sg*, and the USB layer does no such thing. Of course, I can be wrong as before, but please comment. Created attachment 1954 [details]
Kernel log from 2.6.2-rc2
I wrote the following piece of code:
#include <fcntl.h>
int main( int argc, char **argv ) {
int fd;
if( argc < 2 ) {
printf( "Usage: %s device\n", argv[0] );
return 0;
}
fd = open( argv[1], O_RDONLY|O_NONBLOCK );
close( fd );
return 0;
}
as a proof of concept. I tested the kernel with the help of it.
First, I plugged the USB pendrive into a port. Then, I pulled it out.
Next, I ran the above program as "hwtest /dev/sg3" and the kernel
crashed.
Attached is a log from a serial console (out of 2.6.2-rc2).
Note: all kernels above and including 2.6.2-rc1 are affected.
Created attachment 1963 [details]
Kernel log from 2.6.2-rc2-mm1
I tested this on the 2.6.2-rc2-mm1 and the behavior was slightly different than
before, so I decided to post the log (hwtest is the program listed in the
previous post run by root as "hwtest /dev/sg3").
It seems that the problem has gone in the 2.6.2-bk1 and 2.6.2-mm1, although it is present in the 2.6.2 (I haven't tested the 2.6.2-bk2 yet). The issue does not seem to be present in recent kernels. I'm marking it as closed. |