Distribution: SuSE 9.0 64-bit (x86_64) Hardware Environment: 2 x AMD Opteron 240, Tyan Thunder K8W, 1 GB RAM (4 x 256, two nodes, dual-channel), Adaptec AHA19160 + SCSI CD-RW (Toshiba), LSI 53C1010 64-bit SCSI + 2 x HDD (IBM), NEC-based USB 2.0 adapter (Manhattan), GeForce FX5200 (LeadTek), SATA HDD (Seagate), IDE DVD (Liteon) Software Environment: Out-of-the box SuSE 9.0 (downloadable version) + Linux 2.6.1 Problem Description: The system hangs solid after a USB storage device has been disconnected Hint1: This occurs on 2.6.1-rc2-bk1 and 2.6.1-rc3 as well. I'm almost sure it occurs on 2.6.1-rc1, but I had no time to confirm it Hint2: The NEC-based USB 2.0 is involved, which is on a PCI connected to the bus via the AMD-8111 chip, so it may be a PCI bug Hint3: Here's a snippet from the kernel log: Jan 9 23:52:50 chimera kernel: hub 4-0:1.0: new USB device on port 3, assigned address 2 Jan 9 23:52:56 chimera kernel: Initializing USB Mass Storage driver... Jan 9 23:52:56 chimera kernel: scsi2 : SCSI emulation for USB Mass Storage devices Jan 9 23:52:56 chimera kernel: Vendor: Model: TS256MJFLASHA Rev: 1.00 Jan 9 23:52:56 chimera kernel: Type: Direct-Access ANSI SCSI revision: 02 Jan 9 23:52:56 chimera kernel: SCSI device sdc: 506400 512-byte hdwr sectors (259 MB) Jan 9 23:52:56 chimera kernel: sdc: assuming Write Enabled Jan 9 23:52:56 chimera kernel: sdc: assuming drive cache: write through Jan 9 23:52:57 chimera kernel: sdc: sdc1 Jan 9 23:52:57 chimera kernel: Attached scsi removable disk sdc at scsi2, channel 0, id 0, lun 0 Jan 9 23:52:57 chimera kernel: Attached scsi generic sg3 at scsi2, channel 0, id 0, lun 0, type 0 Jan 9 23:52:57 chimera kernel: WARNING: USB Mass Storage data integrity not assured Jan 9 23:52:57 chimera kernel: USB Mass Storage device found at 2 Jan 9 23:52:57 chimera kernel: drivers/usb/core/usb.c: registered new driver usb-storage Jan 9 23:52:57 chimera kernel: USB Mass Storage support registered. Jan 9 23:52:57 chimera kernel: Device not ready. Make sure there is a disc in the drive. Jan 9 23:53:55 chimera kernel: usb 4-3: USB disconnect, address 2 Jan 9 23:54:01 chimera kernel: Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP: Jan 9 23:54:01 chimera kernel: <ffffffff801848dd>{chrdev_open+285}PML4 39400067 PGD 392bc067 PMD 0 The hang occurs now. Note: It was confirmed only with the help of a USB pendrive. I've not tried any other USB devices yet. Steps to reproduce: Put a USB pendrive into a USB port, wait 10 seconds and pull it out.
I would contact the USB people. It's probably a bug in their code. Connecting a serial console and getting a more complete oops log may also help.
Created attachment 1832 [details] A snippet from the kernel log with some oops More verbose USB debug info included. It starts at the time I'have pulled out the device from the USB port. Magic SysRq did not work after the oops.
I have some more observations: (1) USB printer does not trigger the error condition (2) An unsupported USB storage device (ie such that does not get a SCSI device attached to it) does not cause the hang (3) The problem does not seem to occur on i386
It seems very strange that a scsi block device would trigger a problem in character device open, unless its somehow related to the opening of the character sg node by some of the hotplug scripts Could you try to reproduce this without sg (i.e. probably compile sg out of the kernel if it's compiled in, or rename the module if it's modular)
It's difficult to see how usb-storage is the problem here. The logs indicate that the usb-storage disconnect sequence completed without errors, and the OOPS is from and entirely different piece of code. Considering that SCSI's /dev/sg* nodes are character devices, and just about the only character devices that could be involved, I agree with James' suggestion to try this without sg. If that's the problem, then this should be punted over to the SCSI folks.
Created attachment 1846 [details] What happened after rmmod sg I did what you suggested: I compiled sg out of the kernel and it doesn't hang now, so definitely sg is to blame. But first I'd unloaded sg (rmmod sg) and renamed sg to something else. Then I plugged the USB thing in and the kernel hung immediately. The story's in the attachement (it started after I'd plugged the pendrive)
This is pretty clearly a SCSI problem with sg. Re-assign to owner of SCSI.
Created attachment 1882 [details] Kernel log from 2.6.1-mm4 (from a serial console) I have some more details. The log is from 2.6.1-mm4 (everything is included), and I am able to get something like this from 2.6.1-bk4: Loglevel set to 9 usb 4-1: USB disconnect, address 2 Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP: <ffffffff801848cd>{chrdev_open+285}PML4 3d802067 PGD 3eb5c067 PMD 0 Oops: 0002 [1] CPU 1 Pid: 1977, comm: hwscan Not tainted RIP: 0010:[<ffffffff801848cd>] <ffffffff801848cd>{chrdev_open+285} RSP: 0018:000001003eba3ea8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000001001b162ca0 RDX: 000001003efcd3f8 RSI: 000001003eba3eb4 RDI: 0000000001500003 RBP: 000001001b162c40 R08: 000001003eba3dd8 R09: 0000000000000000 R10: 000001003d022bf0 R11: 0000000000000048 R12: 000001003efcd200 R13: 000001001b162c40 R14: 000001003d022b80 R15: 0000000000000000 FS: 00000000005044a0(0000) GS:ffffffff8049ca40(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000008 CR3: 000000003ff9b000 CR4: 00000000000006a0 Process hwscan (pid: 1977, stackpage=1003fe974c0) Stack: 0000000000000212 00000000801ee2fa 000001003d022b80 000001003efcd200 000001001ffcedc0 000001001e6f0280 0000000000000001 ffffffff8017955b 0000000000008800 000001001e260000 Call Trace:<ffffffff8017955b>{dentry_open+315} <ffffffff8017940e>{filp_open+62} <ffffffff801799fc>{sys_open+76} <ffffffff8010efa0>{system_call+124} Code: 48 89 50 08 49 89 55 60 45 31 ed 48 89 4a 08 eb 29 66 90 48 RIP <ffffffff801848cd>{chrdev_open+285} RSP <000001003eba3ea8> CR2: 0000000000000008 <6>note: hwscan[1977] exited with preempt_count 1 bad: scheduling while atomic! Call Trace:<ffffffff80132d9e>{schedule+94} <ffffffff80167672>{unmap_vmas+498} <ffffffff8016ce21>{exit_mmap+321} <ffffffff80136413>{mmput+131} <ffffffff8013bbd4>{do_exit+580} <ffffffff8011ee6f>{do_page_fault+1087} <ffffffff8010f95d>{error_exit+0} <ffffffff801848cd>{chrdev_open+285} <ffffffff80184856>{chrdev_open+166} <ffffffff8017955b>{dentry_open+315} <ffffffff8017940e>{filp_open+62} <ffffffff801799fc>{sys_open+76} <ffffffff8010efa0>{system_call+124} Well, I think there's more to it than just a SCSI bug ...
I investigated this a bit and that's what I found. The error condition is triggered by something called hwscan (it comes from the package hwinfo-7.30-45 and is a SuSE stuff). It is called as /usr/sbin/hwscan --usb and/or /usr/sbin/hwscan --usb --fast from /etc/hotplug/usb.agent. Apparently, the crash does not occur any more after I've removed the x bit from the access mask of /usr/sbin/hwscan (sigh). I don't know what exactly it does, because SuSE guys did not bother to provide any documentation for it, but I think that it tampers with the kernel on a low level and can make it crash if this is done in a wrong way. So, it may be a good idea to ask SuSE about it. FYI, even if hwscan is executable, the crash does not occur if sg is not loaded. If hwscan is enabled and sg is loaded, the -bk5 and -mm5 crash too.
You're just hiding the bug then. I would suggest to strace hwscan and find out which file access causes the crash.
Created attachment 1925 [details] Output from strace hwscan Well, I know I'm hiding the bug by disabling the hwscan. It was only a hint. :-) Attached is an output from strace /usr/sbin/hwscan --usb. Actually there are two of them: the second one is mixed with the kernel debug output dumped to the same console (in the order of appearance). To me it looks like the first open("/dev/sg0", ...) triggers the crash, but you are more experienced.
Hint: sg0 is attached to a CD recorder, sg1, sg2 - SCSI HDDs, and sg3 - a USB pendrive
I forgot to write that the strace /usr/sbin/hwscan was done _after_ I had removed the pendrive from the USB port (yes, I know it's quite important), and something tells me that sg3 shouldn't have been there at that time. So, I think that the bottom line is this: hwscan tries to open /dev/sg3 which is attached to a device that is no longer present in the system and this causes the crash (apparently, I was wrong in the previous post). But, IMHO, this means that the bug is not on the SCSI part, because the SCSI layer has no idea that the device was disconnected until the USB layer makes it detach the sg*, and the USB layer does no such thing. Of course, I can be wrong as before, but please comment.
Created attachment 1954 [details] Kernel log from 2.6.2-rc2 I wrote the following piece of code: #include <fcntl.h> int main( int argc, char **argv ) { int fd; if( argc < 2 ) { printf( "Usage: %s device\n", argv[0] ); return 0; } fd = open( argv[1], O_RDONLY|O_NONBLOCK ); close( fd ); return 0; } as a proof of concept. I tested the kernel with the help of it. First, I plugged the USB pendrive into a port. Then, I pulled it out. Next, I ran the above program as "hwtest /dev/sg3" and the kernel crashed. Attached is a log from a serial console (out of 2.6.2-rc2). Note: all kernels above and including 2.6.2-rc1 are affected.
Created attachment 1963 [details] Kernel log from 2.6.2-rc2-mm1 I tested this on the 2.6.2-rc2-mm1 and the behavior was slightly different than before, so I decided to post the log (hwtest is the program listed in the previous post run by root as "hwtest /dev/sg3").
It seems that the problem has gone in the 2.6.2-bk1 and 2.6.2-mm1, although it is present in the 2.6.2 (I haven't tested the 2.6.2-bk2 yet).
The issue does not seem to be present in recent kernels. I'm marking it as closed.