Bug 25832 - kernel crashes when a mounted ext3/4 file system is physically removed
Summary: kernel crashes when a mounted ext3/4 file system is physically removed
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-12-29 11:47 UTC by rocko
Modified: 2012-07-02 13:24 UTC (History)
7 users (show)

See Also:
Kernel Version: 2.6.37-rc8
Tree: Mainline
Regression: Yes


Attachments
kernel log during crash (130.20 KB, text/plain)
2010-12-29 11:47 UTC, rocko
Details
inode kernel bug dump (5.17 KB, text/plain)
2011-02-05 08:31 UTC, rocko
Details
kernel paging request bug dump (2.10 KB, text/plain)
2011-02-05 20:53 UTC, rocko
Details
log of latest kernel panic (1.28 KB, text/plain)
2011-02-06 03:44 UTC, rocko
Details
diff between 2.6.37.1 and 2.6.38-rc5 configs (76.12 KB, text/plain)
2011-02-21 04:01 UTC, rocko
Details
syslog from VM which is hanging for 10-15 seconds at a time (17.29 KB, text/plain)
2011-02-21 04:08 UTC, rocko
Details
crash log from 2.6.36-rc1 (838.69 KB, image/jpeg)
2011-03-13 02:26 UTC, rocko
Details
oops on USB unbind - unable to handle kernel paging request (4.76 KB, text/plain)
2011-04-24 01:35 UTC, rocko
Details
oops log for null pointer dereference sync + unbind (4.12 KB, text/plain)
2011-04-25 00:36 UTC, rocko
Details
another null pointer dereference (5.03 KB, text/plain)
2011-04-25 00:37 UTC, rocko
Details
and another null pointer dereference (5.14 KB, text/plain)
2011-04-25 00:39 UTC, rocko
Details
null pointer deference hit while running without a desktop (8.28 KB, text/plain)
2011-04-26 01:22 UTC, rocko
Details
oops log for null pointer dereference 2.6.39-rc5 (5.29 KB, text/plain)
2011-05-03 02:19 UTC, rocko
Details
usbonoff-with-mount.sh - script to reproduce the crash (857 bytes, text/plain)
2011-09-01 01:30 UTC, rocko
Details

Description rocko 2010-12-29 11:47:58 UTC
Created attachment 41852 [details]
kernel log during crash

With the following procedure on Ubuntu 10.10 with 2.6.37-rc8:

1) While Gnome is running, attach USB devices (I typically have 4 or 5 drives connected via a 7-port USB hub).

2) Suspend to RAM.

3) Remove the USB hub.

4) Resume.

The kernel (usually but not always) crashes upon resume. I get the classic 'kernel oops' flashing wireless LAN/bluetooth lights and everything is dead - I can't ssh into the machine from outside and I can't get a tty terminal with CTRL-ALT-Fx.

The crash is apparently random in timing. Sometimes it crashes before Gnome's 'screen locked - enter password' dialog appears; sometimes it crashes while the password is ready for entry (today it crashed while I was typing the password and I also had the speaker blipping away as well as the flashing lights because the Ubuntu attention 'drum-beat' was playing when it crashed). The latest that a crash has occurred is after I had a working Gnome desktop back, perhaps a few seconds after resuming started.

On one occasion the kernel locked up when I removed the USB hub cable while Gnome was running (before suspend), but I have not been able to reproduce this.

The problem in debugging this is that I get absolutely nothing in the logs regarding the crash - no kernel OOPS message and nothing that looks out of the ordinary. I've also tried sending log output using netconsole to another PC, but it loses the network link upon suspend.

I've also tried catching the device crash information in the RTC but it just told me that the device responsible was the PC itself.

I didn't get this crash with 2.6.36. For a while I thought the problem was the autogroup v4 patch I applied to 2.6.36-rc7, as running with a generic kernel seemed less likely to cause the crash, but when I tested 2.6.36-rc8-generic just now, it crashed.

I've attached a sample log of when 2.6.37-rc7 with the autogroup patch crashed last week (it definitely still occurs in 2.6.37-rc8). In the log the crash occurred around 13:19:58 and reboot at 13:21:33.

All the external drives are ext4 if that makes a difference. I've tried reproducing it with (a) a single ntfs drive or (b) a single ext4 USB key attached on suspend but without success.

What other information might be useful?
Comment 1 rocko 2011-01-04 01:03:00 UTC
fwiw, if I manually unmount the drives before suspending, then suspend and remove the cable, and then resume, it doesn't crash.
Comment 2 rocko 2011-01-06 02:07:30 UTC
I have done five successful suspend/remove cable/resume cycles now with 2.6.37-generic. Was there a commit after rc8 that might have fixed this?
Comment 3 rocko 2011-01-06 19:06:23 UTC
I managed to reproduce the kernel freeze in 2.6.37 by just removing the USB cable with the drives still mounted and attached. The only information that shows in the log is:

Jan  7 02:59:05 pegasus-maverick kernel: [53521.290147] usb 5-2: USB disconnect, address 2
Jan  7 02:59:08 pegasus-maverick kernel: [53524.165484] usb 2-1: USB disconnect, address 19
Jan  7 02:59:08 pegasus-maverick kernel: [53524.165494] usb 2-1.1: USB disconnect, address 20
Jan  7 02:59:08 pegasus-maverick kernel: [53524.180483] JBD2: I/O error detected when updating journal superblock for sdb1-8.
Jan  7 02:59:08 pegasus-maverick kernel: [53524.184337] usb 2-1.2: USB disconnect, address 22
Jan  7 02:59:08 pegasus-maverick kernel: [53524.262399] usb 2-1.5: USB disconnect, address 24
Jan  7 02:59:08 pegasus-maverick kernel: [53524.291001] usb 2-1.6: USB disconnect, address 23
Jan  7 02:59:08 pegasus-maverick kernel: [53524.340092] JBD2: I/O error detected when updating journal superblock for sdd1-8.
Jan  7 02:59:08 pegasus-maverick kernel: [53524.360120] JBD2: I/O error detected when updating journal superblock for sdf1-8.
Jan  7 03:00:45 pegasus-maverick kernel: imklog 4.2.0, log source = /proc/kmsg started.
Comment 4 rocko 2011-02-03 02:34:19 UTC
This is still present in 2.6.37, but I haven't seen it yet in 2.6.38.
Comment 5 Alan Stern 2011-02-03 15:05:39 UTC
This looks like a filesystem or block-layer problem, not a USB problem.  The bug report should be reassigned.
Comment 6 Theodore Tso 2011-02-03 18:22:24 UTC
I've looked at the kernel log, and the kernel messages displayed, and unfortunately there's no stack trace that would help narrow down what might be going on.

Did you have CONFIG_DETECT_SOFTLOCKUP enabled?   If not, I would suggest enabling it, and see if you can get a stack trace.   Another useful thing to do is to try getting some information via sysrq-l and sysrq-d.  Using sysrq-p multiple times to see where the PC is on the various machines can be useful to detect livelocks.  Sysrq-t can also be useful, but it's a huge amount of information.

In general, for most of the sysrq dumps, you'll really going to need to use a serial console attached to a second computer to record all of this output.   So if someone would like to help reproduce this problem.  Even if the problem is gone in 2.6.38, it would be useful to see what is going the problem in 2.6.37 so that perhaps the patches could be backported into the stable release if the changes aren't too invasive.  So if someone can help us gather this information, it would be really helpful.

-- Ted
Comment 7 rocko 2011-02-04 06:27:51 UTC
Thanks for the feedback. I didn't have CONFIG_DETECT_SOFTLOCKUP enabled, so I recompiled the kernel with it on. I managed to reproduce the problem but unfortunately still didn't get anything useful in the logs. The syslog shows immediately after the wlan reconnected:

Feb  4 14:13:02 pegasus-maverick anacron[6272]: Anacron 2.3 started on 2011-02-04
Feb  4 14:13:02 pegasus-maverick anacron[6272]: Normal exit (0 jobs run)
<hard reset was required here>
Feb  4 14:14:30 pegasus-maverick kernel: imklog 4.2.0, log source = /proc/kmsg started.


Might the number of USB devices be relevant? I used to have external 4 drives connected and now only have 3: I managed three suspend/resume cycles without getting the freeze, but after I attached a fourth drive (NTFS btw, whereas the others are ext4), the next suspend/resume cycle froze the machine.


Am I right in thinking sysrq requires me to connect over the network? The problem with that is that I can only reproduce this reliably after a suspend/resume cycle, and of course the network connection is lost during suspend and the machine typically locks up so quickly that I don't have time to establish a new connection.
Comment 8 rocko 2011-02-04 06:30:52 UTC
And of course after I read your comment again I saw the last paragraph about using a serial console. Unfortunately, I don't have a serial port on this PC.
Comment 9 Alan Stern 2011-02-04 15:31:49 UTC
In the absence of anything better you can boot with the "no_console_suspend" option, do

   echo 8 >/proc/sys/kernel/printk

and then initiate the suspend from a VT console.  You may be able to glean some useful information from what shows up on the screen.
Comment 10 rocko 2011-02-05 08:31:16 UTC
Created attachment 46432 [details]
inode kernel bug dump

I tried suspending/resuming from a VT console in 2.6.38-rc3 and discovered a kernel bug - I've attached it in case it's related to this. But the system didn't freeze up in 2.6.38.
Comment 11 rocko 2011-02-05 08:52:58 UTC
I've reproduced the original problem in 2.6.37 suspending/resuming from the console. The log is quite short and looks cut off:

BUG: unable to handle kernel paging request at ffffffff8291cb60
IP: [<ffffffff8105165d>] task_rq_lock+0x4d/0xa0
PGD 1a05067 PUD 1a09063 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/virtual/hwmon/hwmon0/temp1_input
CPU 1
Modules linked in: nls_utf8 udf ses enclosure usb_storage hidp binfmt_misc frcomm sco bnep 12cap vboxnetadp vboxnetflt vboxdrv parport_pc ppdev_dm_crypt btrfs zlib_deflate crc32c libcrc32c snd_hda_codec_idt snd_hda_intel nvidia(P) snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy arc4 snd_seq_oss snd_seq_midi snd_rawmidi iwlagn iwlcore snd_seq_midi_event_

Hopefully I've typed it correctly.
Comment 12 Theodore Tso 2011-02-05 19:12:45 UTC
Rocko,

can you reproduce the problem reliably at this point?  And if so, can you give us more details about how you reproduced, and how someone else might be able to replicate the problem reliably?

Is FUSE always involved, as it was apparently in comment #10?   Can you reproduce the problem without FUSE?

You're right, the oops message does look cut off.

-- Ted
Comment 13 rocko 2011-02-05 19:56:52 UTC
I can reproduce the freezing problem fairly reliably - it doesn't always occur, but seems likely to occur over 3-5 suspend/resume cycles. 

I also have seen it occur on another PC (an older Dell 32 bit laptop, fwiw) using the same external drive setup, ie with four USB drives attached via a 7-in-1 D-Link USB hub. Two are 1TB ext4 drives, one is a 500GB ext4 drive, and the other drive was either a 1TB ext4 drive or a 500GB ntfs drive. I did try reproducing the freeze with just a single drive attached but wasn't successful, so perhaps the number of drives is relevant.

Both PCs that I have seen the freeze on were running Ubuntu 10.10. 

To see the (truncated) debug log I suspended from a console using "/etc/init.d/sleep.sh force".


The fact that FUSE was involved in comment #10 puzzles me, given that that bug dump starts off saying it's an EXT4-fs error on sdb1 - isn't ext4 purely a kernel driver? I assume FUSE would have been loaded because of the ntfs-3g driver loaded for sdf1? It's quite possible that comment #10 was a different bug entirely, because the system didn't freeze. If it is related, it might be relevant that the freezing problem also occurs with only external ext4 drives attached. FUSE might still have been loaded though at that time the PC used to have an internal ntfs partition. Does the reference to gnome-panel reading directory lblock 0 suggest gnome-panel was trying to read a configuration file? Those would all be on my ext4 home partition, which is on the internal drive.
Comment 14 rocko 2011-02-05 19:58:34 UTC
And of course I meant to say that I suspended from the console with "/etc/acpi/sleep.sh force". It's quite early over here.
Comment 15 rocko 2011-02-05 20:53:52 UTC
Created attachment 46472 [details]
kernel paging request bug dump

I reproduced the freezing bug (ie kernel paging request bug) in 2.6.38-rc3 this time. The bug dump is longer but still looks truncated during the call trace. 

FUSE was not loaded this time (I only had ext3/4 drives).
Comment 16 Theodore Tso 2011-02-05 23:10:20 UTC
Hmm, can you tell me how your hard drives are arranged?  Based on the comment in #15, the bug was while the kernel was trying to wake up firefox.  There's nothing in the (truncated) strace trace that looks like a file system would be involved, but if there was file system involvement, presumably it would be related to firefox writing to its dot files in your home directory.  Is your home directory located on a USB drive?

Here's another thought.  What happens if you try a series of suspend/resumes where you pull all of your hard drives, but you don't pull the USB hub.  Does that change the pattern of the crashes/freezes?

Also, is the system largely quiet before the suspend --- were there any processes downloading torrents, doing compiles, or otherwise using a large amount of CPU or disk bandwidth before you suspended your system?

Hmm.... this looks like a real puzzler.
Comment 17 rocko 2011-02-06 02:53:57 UTC
The home directory is on an internal partition and thus never removed. The system was pretty quiet before the suspend - all I had done after rebooting was run Firefox to look at this bug report, and I didn't see any substantial disk activity from any automated tasks like updatedb.

I tried a few cycles removing only the drives and not the hub but haven't re-encountered the problem in 2.6.38-rc3 yet. This did reveal that I actually have five external drives connected, not four. (There's an extra 1TB.) I'll try later with 2.6.37 since it does seem to be easier to reproduce there.

On a few occasions the PC has frozen when I simply removed the USB cable during normal operation (not during a resume). This happened once on the other 32 bit PC as well. It is unfortunately much harder to reproduce. Could this indicate perhaps that the paging request bug is a symptom of an earlier memory corruption caused when the file system attempts to access a mounted but removed partition before realising it has gone?
Comment 18 rocko 2011-02-06 03:44:22 UTC
Created attachment 46552 [details]
log of latest kernel panic

The attached shows some more information from a 2.6.37 kernel panic during resume. The panic followed what I assume was the paging request error (the full log scrolled off the top of the screen, but it looks similar to the last one). I didn't copy down all the hex data but I might be able to figure them out from some (slightly blurry) photos of the screen if they are important.

Note: this time I had removed the drives from the hub individually rather than removing the hub. So it didn't make any difference doing it that way around.
Comment 19 Theodore Tso 2011-02-06 04:26:25 UTC
Hmm.... what if you only remove one drive while suspended?   Can you try removing each of the drives as separate experiments, and see if you can narrow it down to a single drive that seems to cause the problem when it is removed?

And then can you see how that drive is being used (if it is being used at all)?

If the system is mostly quiet, then presumably it would die immediately on resume, but when the system actually tried to access the file system.  So if we can narrow it down to a single file system, and then figure out which processes had files opened on that file system, maybe that would give us a clue.

You said, "paging request error" --- were you running programs off of any of the external drives?   What programs if any were accessing the drive?

I'll note that I've tried some simple experiments with removing a quiescent USB drive from my system while it was suspended, and I haven't been able to reproduce the problem.   I have often forgotten to unmount a hot-swapping AHCI attached hot-swappable SATA drive in my Ultrabay slot which is idle while suspending, and it's never caused a hang on resume.  It gets a new sdN drive letter on resume, so all of the file system mounts are invalidated, but I've never gotten a crash when that happens --- and I've been using 2.6.37 on my laptop almost ever since it has been released.
Comment 20 rocko 2011-02-06 06:42:59 UTC
It's a *lot* harder to repeat it with just one drive attached. Over 20 suspend/resume cycles I only managed to repeat the freeze once with a single drive attached (with the 500GB drive attached in this case).

I doubt very much that a particular program is accessing the external partitions. In my testing I just allow Gnome to mount the partitions and once it has settled I run the sleep command. A "lsof|grep media" after the partitions are mounted doesn't show anything accessing the external file systems.

Is it relevant that when I start the suspend cycle but before suspend occurs there are some kernel messages saying that the external partitions are being remounted? I would have thought that they would be unmounted without a remount before suspend.
Comment 21 rocko 2011-02-06 09:00:29 UTC
Could this possibly be a caching/stale pointer problem? 

In comment #10, FUSE was somehow involved in an ext4-fs operation after the fuse/ntfs partition was unmounted (and presumably its kernel FS memory freed). Wouldn't a fuse/ext4 combination only happen if the kernel incorrectly tried to access the freed ntfs/fuse memory data when subsequently trying to read/write the ext4 file system (or am I incorrect when I assume that FUSE would never be involved in ext4-fs operations?).

Two of the other crashes happened when firefox and gnome-panel tried to access the home partition and failed - but the home partition _was_ present and therefore should have always been accessible. Could the kernel have got confused and been accessing access freed memory used for an absent external partition instead of the memory being used to cache data for the home partition?

This might also explain why the chances of the kernel panic happening are much higher if there are multiple removed drives involved, ie because there is more chance of freed memory incorrectly being accessed.
Comment 22 Theodore Tso 2011-02-07 01:20:01 UTC
Memory doesn't get freed just because a device disappears.   The file system is still shown as mounted after the system resumes.  Attempts to access the mounted file system will result in errors, but the data structures don't get magically freed until you explicit umount the failed file system.

It's more likely that the kernel is stuck in some loop trying to access the failed file system, and looping, but in that case, it would be caused by a specific process trying to access the file system after the system resumed.  Say, if you were executing a program that was located on the now-failed file system, or if a file from the now-failed file system was mmap'ed into memory, and for some reason the kernel was looping forever instead of returning an error to the program and/or killing the program.

This is why I asked you if you could use the various sysrq commands to try to figure out what the kernel was doing after it locked up.  In answer to your previous message, no, sysrq doesn't require access over the network.  It requires access to the console.  If you have a VT console, sysrq-p can be triggered by holding down the alt, sysrq and p keys; sysrq-l can be triggered by alt-sysrq-l, etc.   If you have a serial console, you can send a break followed by an l to trigger a sysrq-l, and so on.
Comment 23 rocko 2011-02-07 03:11:05 UTC
Thanks for the info on how to use the sysrq functions. Those keys work nicely normally. But unfortunately they don't work once it freezes with this bug - the freeze is total, which I guess also explains the truncated stack traces.

I wasn't suggesting that memory was freed just because the device disappeared - I was wondering if it is possible that the kernel could umount several missing devices and free the memory pertaining to them (all good so far), but get pointers to the freed memory mixed up, possibly through caching of the freed memory pointer.
Comment 24 Theodore Tso 2011-02-07 03:32:44 UTC
If the freeze is total, such that sysrq isn't functioning, then that implies that we're stuck in some kind of interrupt service routine lockup, which might imply that it's not an ext4 specific bug.

Hmm.... can you try replicating with some other file system, like vfat or ext3?
Comment 25 Theodore Tso 2011-02-07 03:33:09 UTC
(i.e., but rather something which is device driver specific)
Comment 26 rocko 2011-02-07 04:24:49 UTC
The kernel panic messages (when kmsg gets that far) do say that a fatal exception has happened in an interrupt. It seems odd to me though that kmsg_dump locks up a seemingly random time later while reporting the panic.

I tried again with two vfat USB drives, one ext3 external drive, and an ntfs external drive attached via the hub. I reproduced the freeze, this time caused by a null pointer reference error rather than a unhandled page fault error. Again it was in task_rq_lock. The stack trace stopped very abruptly this time. 

I'm not positive that this lets ext4 completely off the hook because of the crash that was caused when firefox tried to access the internal ext4 home partition. Presumably I'd have to try and reproduce the problem with all the internal partitions reformatted to ext3 to rule out ext4 completely.

But perhaps ext4's memory was corrupted by the umount operations? Is there common umount kernel code shared by all file systems?
Comment 27 Theodore Tso 2011-02-07 15:35:59 UTC
There is shared umount code paths that are used by all file systems in fs/super.c, in the VFS layer, of course.

But again, unless you or some userspace program has explicitly requested a umount, a umount does not happen just because a disk drive has disappeared.  So if you are locking up right after a resume, there wouldn't have been any _time_ for you or some userspace program to request an umount.   And certainly the standard desktop userspace programs (e.g., the GNOME desktop) do not automagically look for kernel I/O error messages caused by a disk disappearing, and then umount the file system automagically.
Comment 28 rocko 2011-02-07 23:49:14 UTC
Actually, the lockup doesn't take place _immediately_ after resume - there is a random interval of up to several seconds. A couple of times I even managed to type my password in for the screensaver and get back to the desktop before the freeze. So there is certainly time for a userspace process to try and access a missing drive. 

And there is a process that umount drives when it detects they are missing (I think it is udev?). They disappear from my Gnome desktop when I pull out the drive or when I resume after removing the drives.

These (sample) messages after resume show it happening:

Feb  7 10:04:04 pegasus-maverick ntfs-3g[2745]: Unmounting /dev/sdf1 (My Passport)
Feb  7 10:04:04 pegasus-maverick kernel: [  128.908709] JBD: I/O error detected when updating journal superblock for sde1.
Feb  7 10:04:04 pegasus-maverick kernel: [  128.934868] scsi 9:0:0:1: rejecting I/O to dead device
Feb  7 10:04:04 pegasus-maverick kernel: [  128.949322] EXT3-fs (sde1): I/O error while writing superblock


The first message implies that a umount happens for at least missing ntfs drives. The bug in comment #10 happened during a sys_umount call, which was triggered by Gnome automatically in response to the missing drive.


The most confusing thing is how the sytem might have crashed when firefox was trying to access a valid mounted ext4 partition. That's why I'm thinking it's caused by a memory corruption or invalid pointer.
Comment 29 rocko 2011-02-19 12:36:35 UTC
2.6.38-rc5 seems particularly prone to this bug - I've crashed it three times today without even suspending. The first time was when I accidentally power off an attached USB drive; the second was when I removed an ext4 USB key; and the third time was when I removed the USB key again while trying to get a log output (it took about 10 goes doing this to get the crash). In all cases the stored syslog only showed the USB disconnect, but in the last case I got a partial bug dump on the tty console:

BUG: scheduling while atomic: kworker
RIP: 0010 [...] in tick_nohz_restart_sched_tick+0x55/0x180
...
Call trace:
cpu_idle+0xd7/0xf0
start_secondary+0x1bc/0x1c3

So might this be a scheduling problem rather than a file system problem?

In this last case I must have removed then reattached the USB key in the same instance, because its light kept flashing ie indicating that something was trying to read or write to it.

I notice that when a device is removed the kernel immediately re-mounts it in read-only mode before presumably dismounting it again. So if you are trying to copy to a device that is removed the first error that Gnome might report is that the device is read-only, and when you press 'skip', the second error is that the device has disappeared. Is this normal? It seems odd.

Does anyone have any more suggestions on how to debug this? It is really annoying when the simple act of removing a USB file system completely crashes the kernel (and trashes all your unsaved data).
Comment 30 Alan Stern 2011-02-19 15:55:19 UTC
I don't know about tracking this down, but for testing purposes you can reproduce these conditions without actually unplugging and replugging the device.  All you have to do is go into the /sys/bus/usb/drivers/usb_storage directory.  In there, you'll see a file representing each usb-storage device in your system, with names like "2-1.2:1.0".

To imitate an unplug, write the corresponding file name to the "unbind" file.  To imitate a replug, write the file name to the "bind" file.  For example:

   echo -n 2-1.2:1.0 >unbind
   echo -n 2-1.2:1.0 >bind

This can easily be automated in a shell-script loop, which should speed up your testing.
Comment 31 rocko 2011-02-20 00:16:33 UTC
Thanks, that helped a lot! Incidentally, if I don't sleep between unbind and bind operations in the script, the kernel becomes unresponsive (ie to everything including ssh sessions, ctrl-c/d/z and trying to switch to another tty) and just keeps scrolling the bind/unbind messages up the screen, even long after I remove the USB device, so presumably bind/unbind would have to fail at this point.

Anyway, I managed to reproduce the freeze quite quickly with just a single USB key. The stack trace is slightly longer this time:

<this scrolled off the top of the screen>
wake_up_process
wakeup_timer_fn
run_timer_softirq
? wakeup_timer_fn
__do_softirq
call_softirq
irq_exit
do_IRQ
ret_from_intr
? arch_local_irq_enable
? sched_clock_idle_wakeup_event
acpi_idle_enter_bm
cpuidle_idle_call
cpu_idle
rest_init
start_kernel
x86_64_start_reservvations
x86_64_start_kernel
...
RIP task_rq_lock
--- end trace ---
Kernel panic - not syncing: fatal exception in interrupt
Pid: 0, comm: swapper
Call trace:
IRQ ? panic
? kmsg_dump
? oops_end
? no_context
? __bad_area_nosemaphore
? bad_area_nosemaphore
? enqueue_task
? resched_task
? try_to_wake_up
? native_sched_clock
? native_sched_clock [yes, twice]
? page_fault
? task_rq_lock
? try_to_wake_up
? wake_up_process
? wake_up_process [twice]
? wakeup_timer
? run_timer_softirq
? wakeup_timer_fn
? __do_softirq
? call_softirq
? irq_exit
? do_IRQ
? ret_from_intr
? arch_local_irq_enable
? sched_clock_idle_wakeup_event
? acpi_idle_enter_bm
? cpuidle_idle_call
? cpu_idle
? rest_init
? x86_64_start_reservvations
? x86_64_start_kernel

But does that help any?
Comment 32 rocko 2011-02-21 04:01:54 UTC
Created attachment 48502 [details]
diff between 2.6.37.1 and 2.6.38-rc5 configs

I think this might have had something to do with the config I was building the kernel with.

I tried the Ubuntu weekly build of 2.6.38-rc5 (which appears to have been compiled with gcc-4.2) and couldn't repeat the freeze. I then recompiled 2.6.38-rc5 using gcc-4.5.1 but with the Ubuntu config and haven't been able to reproduce the freeze yet with this combination, so it seems likely a config issue instead of a compiler issue.

The attached diff should show the 'good' config for 2.6.38-rc5 (ie copied from the Ubuntu weekly build) against the config for the build I did against 2.6.38-rc4, which is very similar to the config I was using with 2.6.38-rc5 that kept crashing.
Comment 33 rocko 2011-02-21 04:08:10 UTC
Created attachment 48512 [details]
syslog from VM which is hanging for 10-15 seconds at a time

In case it is useful... I also tried reproducing the freeze running Ubuntu 11.04 (running 2.6.38-rc4) in VirtualBox and repeatedly assigning/unassigning the test USB key to it (which used to have a good chance of freezing the host, but not now). I didn't get a total freeze, but I did start getting freezes for maybe 10-15 seconds. The USB access light started flashing permanently so I seem to have recreated the situation where some process has crashed trying to access the drive, or perhaps just crashed the drive itself, but the kernel doesn't completely crash, just completely pause temporarily. The key still shows up in an ls as attached to 1-1:1.0, but I can't bind/unbind it anymore - it gives write errors (no such device) and "device offlined" errors.
Comment 34 rocko 2011-02-21 08:59:14 UTC
OK, spoke too soon - my host with 2.6.38-rc5 compiled against the new config just froze up completely after removing the external USB drives and then resuming. Are there any known issues building the kernel with gcc 4.5.1?
Comment 35 Alan Stern 2011-02-21 16:48:34 UTC
If you generate plug and unplug events too rapidly, you will overrun the USB hub driver.  If you run the test while using a VT console instead of X11, and if you enable CONFIG_USB_DEBUG and do "echo 8 >/proc/sys/kernel/printk" before starting, you'll be able to see what's going on.
Comment 36 rocko 2011-02-22 07:03:33 UTC
Thanks for the info - I assume that's to see what has is happening if I can make the USB light flash permanently? Or might it help in figuring out what causes the full kernel crash?

Btw, this kernel freeze bug is still present in 2.6.38-rc6 - I have reproduced the  unable to handle paging request crash shortly after resuming.
Comment 37 rocko 2011-03-03 15:23:14 UTC
The bug is still present in 2.6.38-rc7 (it happened the first time I resumed after removing the external drives).
Comment 38 rocko 2011-03-06 10:25:46 UTC
The syslog shows that, for whatever reason, sometimes processes do try and access files on the drives after they are removed. Messages to this effect typically appear if the kernel _doesn't_ crash.

Is it possible that there might be a sync/locking problem here that leads to the crash? eg could the crash occur if a process tries to access the removed drive while the kernel is in the process of unmounting it? Perhaps this would explain why the crash is more likely to happen on resume (ie when many processes are waking up and presumably more likely to try and access files) and when multiple drives are removed (ie when the kernel is likely to be more busy trying to unmount the missing drives).
Comment 39 Alan Stern 2011-03-06 15:59:05 UTC
It is expected that processes will try to access files on a drive after it has been removed.  After all, they have no way of knowing that the drive is gone.

I'm not convinced that the timing with respect to your resume is relevant.  After all, you mentioned that the crashes are subject to random delays, sometimes happening several seconds after the system resumes.

Also, the kernel does not automatically unmount missing drives.  Unmounting is a separate operation from removal.  Unmounting takes place only when the user (or a program) does the equivalent of running the umount command.

You did say that this problem doesn't arise in 2.6.36.  Your best chance of tracking it down might be to try doing a git bisection between 2.6.36 and 2.6.37.  However, for this to work, you have to be able to detect reliably whether or not a particular kernel has the bug.  The test you used in comment #31 might do the job.  If you can narrow this down to an individual kernel change, that would be a big help.
Comment 40 rocko 2011-03-12 06:03:21 UTC
I tried kernel 2.6.36, and it also crashed the first time. I must just not have noticed it before (or maybe the compiler *is* important). The log was along the lines of:

Pid: 0, comm: swapper Not tainted 2.6.36-git
...
Call Trace:
? sched_clock_local
tick_nohz_stop_sched_tick
cpu_idle
rest_init
start_kernel
x86_64_start_reservations
x86_64_start_kernel

RIP get_next_timer_interrupt

Kernel panic: not syncing: Attempted to kill the idle task!
Pid: 0, comm: swapper Tainted: G   D  2.6.36-git

Call trace:
panic
do_exit
? kmsg_dump
oops_end
no_context
__bad_area_nosemaphore
bad_area_nosemaphore
do_page_fault
page_fault
? get_next_timer_interrupt
? get_next_timer_interrupt
? sched_clock_local
tick_nohz_stop_sched_tick
cpu_idle
rest_init
...
Comment 41 Rafael J. Wysocki 2011-03-12 12:12:31 UTC
Dropping from the list of post-2.6.36 regressions.
Comment 42 Alan Stern 2011-03-12 15:51:28 UTC
Can you install an older compiler and build a kernel with it?
Comment 43 rocko 2011-03-13 00:41:00 UTC
I tried both 2.6.38-rc8 built with gcc 4.4.5 and also the Ubuntu weekly build of 2.6.38-rc8, which says it is built with gcc 4.2.3, and both crashed.

I didn't manage to crash the Ubuntu 2.6.35-28 kernel, so it might be a regression between 2.6.35 and 36.
Comment 44 rocko 2011-03-13 02:26:21 UTC
Created attachment 50702 [details]
crash log from 2.6.36-rc1

I got a magnificent crash log from 2.6.36-rc1 on a crash during resume - unfortunately about five pages worth of log scrolled off the console and I only got the last one (see attached jpg). There was nothing in the syslog saved to disk.

I'm not sure exactly what caused the crash, but there are some file system and page fault calls in the stack trace (as well as some wireless calls), so perhaps it is related to this bug. This kernel was quite flakey, though - it crashed on two other occasions when I hadn't even removed any USB drives.
Comment 45 Alan Stern 2011-03-13 02:45:04 UTC
That "invalid_op" entry in the stack trace strongly suggests that memory is getting corrupted, and I don't mean as a result of a software bug.  Have you run a memory hardware test recently?
Comment 46 rocko 2011-03-13 03:01:46 UTC
It's always a possibility, but I don't think this is a RAM problem. I have run memory tests fairly recently to try and rule it out. One of the BIOS RAM tests is very extensive: I didn't run it to completion because it would have taken many hours to complete, but did let it go for 20 minutes or so. It tested quite a few patterns and didn't find any failures. I also occasionally make it run the 'quick' memory check upon boot (about 5 minutes worth) and it never finds problems.

There are other indicators that this probably isn't a hardware problem:

* I have seen this bug occur on another PC that was running 2.6.37 at the time - it crashed when I unplugged the external USB hub without unmounting the drives first.

* I don't get crashes during normal operation - only after removing USB drives. And only if they were mounted prior to removing them.

* Just now I have been testing Ubuntu's 2.6.35-28-generic kernel more extensively and haven't been able to crash it (around 20 suspend/resume cycles without incident).

* I did manage to crash a VM running Ubuntu 11.04 with kernel 2.6.38-rc8 by using VirtualBox's USB menu in the status bar to repeatedly attach and remove an external USB drive. It took about ten goes before it hung. In the past, the host has usually crashed during this test, but this time I was running 2.6.35 on the host and only the VM crashed.
Comment 47 Alan Stern 2011-03-13 21:17:00 UTC
Okay, if you believe this, and if you can reliably detect the bug (which is often a difficult thing to be sure of), then you can try bisecting the kernel changes between 2.6.35 and 2.6.36-rc1.

The thing is, the test really does have to be reliable.  If you make a mistake (tell git that a kernel doesn't have the bug when it actually does) then you're very unlikely to track down the true cause.  Still, it's worth a try -- especially since nothing else seems to help!

The test you used in comment #31 would be a good approach.
Comment 48 rocko 2011-03-14 04:10:06 UTC
Unfortunately there are a couple of problems with bisecting this, quite apart from the time it takes to build and test each kernel...

The first is that it is unsurprisingly easy to crash an rc1 kernel, which makes it even harder to reliably detect the bug. For instance, I managed to crash the first bisect of 35-36.rc1 with the suspend/resume test eventually (it seems much more robust than 2.6.38 with respect to this bug) but it isn't clear that the problem occurred as a result of the USB removal - the stack trace looks like it includes memory allocations and socket receives. Why, the first time I booted the 36rc1 kernel, it hung completely at the login screen with no human input whatsoever, clearly as a result of a different bug. I'm not even positive that the log I posted from the 36rc1 kernel crash above is related to this bug, as it looks different from the 2.6.37/38 logs.

The second problem is that I can't necessarily compile all the bisects! For instance commit f6cec0ae58c17522a7bc4e2f39dae19f199ab534 (the second bisect I tried) fails with this error:

drivers/staging/comedi/drivers/das08_cs.c: In function ‘das08_pcmcia_config_loop’:
drivers/staging/comedi/drivers/das08_cs.c:225:8: error: ‘struct pcmcia_device’ has no member named ‘io

Is there a way to make the kernel handle a bug like this more gracefully? It seems that there are many great debug tools for extracting information about process states, but they are all useless here because the crash is so catastrophic.
Comment 49 Alan Stern 2011-03-14 14:18:48 UTC
It is possible to tell git that you can't test a certain kernel; it will then choose a different bisection point for testing.

There is no way to make the kernel handle a catastrophic bug gracefully.  There are ways to capture a complete kernel log when a bug does occur, by using a serial console or a network console.  There's also the kernel's internal debugger (kgdb).  Besides, information about process states probably won't help.
Comment 50 rocko 2011-03-24 23:06:05 UTC
For a time I thought I had tracked the problem down to a single 500 GB ext4 hard drive, because it became possible to able to reliably and quickly reproduce the crash in my VM by simply simulating USB plug/unplug - it was crashing on the third or fourth simulated unplug. Then after I ran fsck on the drive it was much harder to reproduce the crash (I didn't manage to reproduce it in ten separate attempts, each with hundreds of plugs/unplugs). 

I removed the drive completely from the system and had no crashes for over a week, until this morning when 2.6.38.1 crashed upon resume (of course after I had removed the drives during suspend).

Is it possible something in the ext4 drivers would try to access a removed drive that is has some sort of misconfiguration (ie one that is fixed by an fsck) and cause the crash? ie Seeing as how an fsck made the problem apparently 'go away' for a while? It would explain why more drives increase the chances of a crash.
Comment 51 rocko 2011-03-31 00:44:24 UTC
I've had a report from another user who is seeing the same problem with 2.6.36-generic, which means it isn't confined to my three PCs.

Perhaps this means we'll be seeing more people experiencing it once Ubuntu 11.04 comes out using kernel 2.6.38...
Comment 52 rocko 2011-03-31 00:49:40 UTC
Some quite relevant information is that the other user has reported the external drive he unplugs during suspend mode is an ext3 drive, not an ext4 drive.
Comment 53 Theodore Tso 2011-04-01 15:09:03 UTC
Folks who are experiencing this problem might want to try 2.6.38.2.   There was a fix that was committed to mainline and backported to the stable kernels that may fix this problem:

commit 95f28604a65b1c40b6c6cd95e58439cd7ded3add
Author: Jens Axboe <jaxboe@fusionio.com>
Date:   Thu Mar 17 11:13:12 2011 +0100

    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    
    We don't have proper reference counting for this yet, so we run into
    cases where the device is pulled and we OOPS on flushing the fs data.
    This happens even though the dirty inodes have already been
    migrated to the default_backing_dev_info.
    
    Reported-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
    Tested-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

Sorry for not responding to this bug sooner, but I've been crazy busy in the last couple of weeks; troubleshooting and discussion was taking place on LKML, and I was pretty sure this wasn't an ext4 specific issue.
Comment 54 rocko 2011-04-01 23:57:59 UTC
That patch certainly sounds like it's on the right track for fixing this bug, but, alas, I just reproduced the oops in 2.6.38.2 on my fifth attempt. Perhaps there are still other issues along this line?
Comment 55 rocko 2011-04-04 02:32:00 UTC
The bug is still present in 2.6.39-rc1.

Is it possible that the memory corruption is caused by changes in the locking code? Wasn't 2.6.36 where the BKL was removed from the usb subsystem?
Comment 56 Alan Stern 2011-04-04 14:11:52 UTC
The USB stack never really used the BKL for very much, and the mass-storage pathways didn't use it at all.  Besides, the problem you're seeing doesn't involve USB directly.  I'm pretty confident that it would occur with any hot-unpluggable drive.
Comment 57 rocko 2011-04-05 15:17:02 UTC
Ok. I only mentioned it because by chance I saw https://lkml.org/lkml/2011/4/3/70, which mentioned a patch to usbhid/hiddev.c at http://git.kernel.org/?p=linux/kernel/git/jikos/hid.git;a=commitdiff;h=9c9e54a8df0be48aa359744f412377cc55c3b7d2. The patch says "This obviously lead to memory corruptions at device disconnect time", which of course looks a lot like the issue I'm getting, so I thought perhaps there might have been other problems introduced when the BKL was removed.
Comment 58 rocko 2011-04-06 23:00:52 UTC
I reproduced the problem removing/inserting an SD card in the SD card reader (SD Host controller: Ricoh Co Ltd R5C822 SD/SDIO/MMC/MS/MSPro Host Adapter (rev 22), but only if it has an ext4 file system on it. I couldn't make it happen at all with a vfat file system on it - I tried 30 times or so with the vfat card, but it happened on just the third removal of the ext4 file system.

So it does seem to be an ext3/4 issue rather than a USB-related issue.
Comment 59 rocko 2011-04-22 06:00:52 UTC
This is still an issue in both 2.6.38.3 and 2.6.39-rc4. Fairly frequently, complete and catastrophic failure of the kernel occurs when I remove an ext3 or ext4 USB drive.
Comment 60 Theodore Tso 2011-04-22 10:13:05 UTC
One of the problems which has made this hard to debug is (a) the kernel oops/panic messages are inconsistent (they're not all the same, and some of them don't have anything file system related at all), and (b) mant of them are incomplete.

Note that we need more than just the stack traces, too.  We also need the beginning of the oops message, complete with the IP/RIP information. 

At least one of these systems looks like it's using some kind of network console?   Can you set up a simple serial console, and then do some experiments where you yank out the USB drive, and tell us whether it happens reliably 100% of the time?   50% of the time?   10% of the time?    If you run sync first and the system is idle, does that make it more or less likely to happen?   If you then start writing to the now-disappeared file system with a single command, does it crash right away?  Does it crash 30 seconds later?

I've done a simple experiment where I've mounted a USB stick, written to it, typed sync, but then without typing umount, yanked the stick out, and then tried writing to the file system.  It didn't crash for me.   So some explicit instructions of what you can do that causes reliable crashes would be very useful, and then if you can set up a serial console so we can get complete and reliably crash logs, that would also be very useful.
Comment 61 rocko 2011-04-22 10:37:37 UTC
I can reliably reproduce it with the following, running in Ubuntu (11.04 at the moment, but 10.10 works as well):

1. I insert a USB key formatted with ext4.

2. I run this simple script, passing the id that shows up in /sys/bus/usb/drivers/usb-storage for the USB key as the argument, eg 2-2:1.0:

#!/bin/bash
SLEEP=3
function list {
  if [ -e "$1" ]; then
    echo $1
  fi
}

cd /sys/bus/usb/drivers/usb-storage
if [ "$1" == "" ] || [ "$2" != "" ]; then
  echo "Usage: $0 usb device"
  echo "With one of these: "
  for file in 1* ; do list $file; done
  for file in 2* ; do list $file; done
  exit 1
fi

echo 8 > /proc/sys/kernel/printk
count=1
while true; do
  echo -n $1 >unbind
  echo $count
  sleep $SLEEP
  echo -n $1 >bind
  sleep $SLEEP
  ((count=count+1))
done



I'm making it sleep 3 seconds between unbind/bind events because that gives Ubuntu plenty of time to mount the drive after the bind.

When it oopses, it typically oopses within a few seconds of the unbind event (I'd say in less than 10 seconds and often immediately). For me it normally will oops inside 10 bind/unbinds, but 2.6.39-rc4 took more like 20 to 30 goes when I tried it this morning).

I'll try again sending the output to another PC, but I'm pretty sure last time I tried that the ssh session crashed before it could dump the stack trace, just as the tty console usually crashes before it finishes dumping.

When I get time I'll try some variations as well like making it sync and sleep before the unbind.
Comment 62 rocko 2011-04-22 11:58:14 UTC
So I added in a "sync && sleep 3" before the unbind statement and it crashed on the very first unbind - before it even got to echo $count. And rebooting took five minutes while all the inbuilt drives fsck'd themselves.
Comment 63 rocko 2011-04-22 13:41:45 UTC
2.6.39-rc4 is either _much_ harder to crash, or my script isn't as reliable at crashing the kernel as I thought (until now I've mostly used the suspend/resume method with multiple drives attached). I've now done over 200 bind/unbind cycles of this external ext4 USB key without a crash. But I certainly did crash it once earlier today.

An observation from earlier that might be relevant here: a couple of weeks ago one of my drives got itself into a state that made it crash the kernel almost every time I unplugged it, but after I did an fsck on it it became significantly less likely to cause the crash. And after my last reboot there was a lot of fsck'ing going on, probably including the external drive.
Comment 64 Theodore Tso 2011-04-22 15:00:15 UTC
I'm going to guess that your script depends on your desktop trying to access (and possibly write to) your USB stick during the time that you are running causing the unbind to happen?

A more useful test case would be one that works even if no desktop is running (i.e., you're logged in via SSH, or the VT console, or the serial console), and the script contains all of the commands which are accessing the USB storage device.  Otherwise, it might be dependent on what desktop you are running, and someone who is using fvwm or KDE (for example) if you happened to create the test case while using the GNOME desktop (and then you have to answer the question of which version of the GNOME desktop, and what desktop pacakges you might have installed, etc.)

I very much doubt it has to do with when the file system was fsck'ed.  The real question is what specific I/O pattern happened to be going on at the time when the USB stick was yanked out.  And that might explain why I don't see it, because normally I'm not crazy enough to yank out a device while it's actively been accessed.  (And I don't like desktops that initiate a lot of I/O behind my back.... since that generally means it's doing this at times when I might not like it, such as when I'm running on battery and am trying to conserve power....)
Comment 65 rocko 2011-04-23 00:32:32 UTC
Yes, the crash only happens when I'm running a desktop (gdm in this case), partly because this is what handles the auto-mounting of the USB drives. I suppose we *could* tell people to just not use a desktop at all :)

I don't think users deliberately yank out mounted USB drives. I think the most likely real-world scenarios that trigger this crash are (1) suspend, remove drive, resume [ie how I first noticed this], (2) remove the wrong drive by accident, (3) a power failure makes the drive suddenly go offline [I've seen that, too].

Anyway, my test crash case using the script above that was working so reliably for this ext4 USB key is no longer crashing the kernel in either 2.6.38.3 or 2.6.39-rc4 (I've done over a thousand bind/unbind cycles for each now). My guess is that the suspend/resume test results in a higher likelihood of I/O (especially if multiple drives are involved) and therefore triggering the bug.

I'm also still curious whether it's possible for the ext3/4 drive to somehow get its format into a state that causes this to happen, given that yesterday it crashed so reliably but not today. If so, couldn't it be possible that such a state could be corrected by a fsck and therefore reduce the chances of this mystery I/O pattern happening?
Comment 66 rocko 2011-04-23 04:12:30 UTC
FYI, I reproduced it on the second attempt at suspend/resume with 2.6.38.4. I was slightly hopeful it might be fixed as some of the patches seemed to be addressing this kind of situation.

I had a thought about the consistency of the logs: I think the ones related to this bug might be consistent, it's just that I've posted logs that might be due to different bugs entirely. The ntfs/fuse one turned out to be another bug, for instance. It's possible the logs from 2.6.36-rc1 were caused by a number of of other bugs, rc1 being in all likelihood the least robust of any kernel release.
Comment 67 Theodore Tso 2011-04-23 19:31:45 UTC
Just because *I* don't like desktops that initiate I/O at random times when I don't request doesn't mean that other users shouldn't use it.  It's just that if we're talking about making a reliable test case, it's much better if it doesn't depend on random I/O initiated by a desktop.  The test case should do whatever I/O is needed, so that it is complete reproducible, even by people who don't necessarily use the same desktop as you.

Thinking about this some more, *very* recently (as in the most recent merge window) there have been some hanges recently to avoid deadlock in ext3/4 on when freezing and unfreezing file systems for snapshots, and that code path is also used on suspend/resume.  Those changes came in way after 2.6.36-rc1, so yes, if they are also causing some issues with 2.6.39-rc2+ systems, it's very likely that there are different bugs involved.   Which is why I insist on getting full and accurate OOPS logs, so we can see if they are different crashes that happen to have apparently the same symptoms caused by the same event (i.e. USB keys getting rudely yanked out of the system).

By the way, for years and years and years USB disks just didn't work at all across suspend/reusmes.  Which is why I have scripts into my suspend/resume framework to automatically unmount removeable disks at suspend-time....
Comment 68 rocko 2011-04-24 01:35:15 UTC
Created attachment 55282 [details]
oops on USB unbind - unable to handle kernel paging request

> Just because *I* don't like desktops that initiate I/O at random times
> when I don't request doesn't mean that other users shouldn't use it.

Honest, I wasn't having a go, I meant it tongue-in-cheek!

I have managed to reproduce the crash on a VM and log the output via netconsole. An important thing to note is that it made no difference when my script was set to call sync just prior to the unbind. (In fact, it crashed on the very first unbind when I did this.)

The VM was doing very little: I booted into the desktop, ran gnome-terminal, ran the modprobe command to load netconsole, and then ran the unbind/rebind script. The first crash happened on the fourth unbind.

I've attached the resulting log, which is for an 'unable to handle kernel paging request' - hopefully it's sufficiently complete, but it doesn't look much longer than some of the ones I captured manually earlier. Note that I wasn't doing suspend/resume, just running the unbind/rebind script, and this one is without sync being called.
Comment 69 rocko 2011-04-25 00:36:39 UTC
Created attachment 55312 [details]
oops log for null pointer dereference sync + unbind
Comment 70 rocko 2011-04-25 00:37:51 UTC
Created attachment 55322 [details]
another null pointer dereference
Comment 71 rocko 2011-04-25 00:39:14 UTC
Created attachment 55332 [details]
and another null pointer dereference

Do these logs have enough information to continue?
Comment 72 Theodore Tso 2011-04-25 20:28:11 UTC
The thing is, the given the stack trace and the fact that it's caused by a null pointer in next_interrupt, makes me highly dubious this has anything to do with ext4.   You could put in a printk in fs/ext4/super.c:ext4_put_super() and make sure it's not trigger, since that's the only place where we mess with timers at all --- but del_timer() properly disable interrupts before mucking with the pointer, so I'm not convinced it was caused by ext4.  (Also, the ext4 timer is only present if the file system has reported any errors, and it only fires once every 24 hours, so it's highly unlikely it would be on a timer bucket that would the next timer interrupt would trip against right afterwards.  So I very much doubt it's caused by the ext4 error reporting timer.)

And if you're seeing this on ext3, which doesn't use a timer at all, then it's definitely not the fault of the file system layer, but probably something in the usb block device driver...

In all of the stack traces, there's an scsi disk attach going on right before the crash:

[ 1255.355192] scsi4 : usb-storage 1-1:1.0
[ 1256.387085] scsi 4:0:0:0: Direct-Access     JetFlash TS2GJF110        0.00 PQ: 0 ANSI: 2
[ 1256.409758] sd 4:0:0:0: Attached scsi generic sg2 type 0
[ 1256.425575] sd 4:0:0:0: [sdb] 4063232 512-byte logical blocks: (2.08 GB/1.93 GiB)
[ 1256.434955] sd 4:0:0:0: [sdb] Write Protect is off
[ 1256.435172] sd 4:0:0:0: [sdb] Mode Sense: 00 00 00 00
[ 1256.447520] sd 4:0:0:0: [sdb] Asking for cache data failed
[ 1256.448031] sd 4:0:0:0: [sdb] Assuming drive cache: write through
[ 1256.484174] sd 4:0:0:0: [sdb] Asking for cache data failed
[ 1256.484174] sd 4:0:0:0: [sdb] Assuming drive cache: write through
[ 1256.493409]  sdb: sdb1
[ 1256.540082] sd 4:0:0:0: [sdb] Asking for cache data failed
[ 1256.540083] sd 4:0:0:0: [sdb] Assuming drive cache: write through
[ 1256.540083] sd 4:0:0:0: [sdb] Attached SCSI removable disk
[ 1257.566980] usb-storage 1-1:1.0: Quirks match for vid 0457 pid 0150: 80
[ 1257.566983] scsi5 : usb-storage 1-1:1.0
[ 1258.400641] BUG: unable to handle kernel NULL pointer dereference at   (null)
[ 1258.400818] IP: [<c105ccf8>] __next_timer_interrupt+0xa8/0x160

Was sdb the device that had been just yanked out?  Or was this some other SCSI device?
Comment 73 rocko 2011-04-26 00:27:38 UTC
Yes, sdb is in this case the external USB key that has just been inserted (or bound by the script).
Comment 74 Theodore Tso 2011-04-26 00:44:42 UTC
So have you confirmed that if you're not running a desktop, the system doesn't crash?

Part of the problem is I have absolutely *no* idea what the desktop was doing at the time of the crash.   The stack trace is completely useless.....
Comment 75 rocko 2011-04-26 01:22:52 UTC
Created attachment 55522 [details]
null pointer deference hit while running without a desktop

I just now managed to crash the kernel *without* a desktop running. Running the VM in recovery mode (ie console with no desktop), I tried two ways:

a) I modified the script to mount the device after binding it. With no other modifications, the kernel did not crash with around 50 bind/mount/unbind attempts (which is not conclusive but seems a reasonable number of tests to try). Note that with this setup, the device kept getting a new drive letter on each bind, ie /dev/sdb, /dev/sdc, /dev/sdd, etc, whereas with a desktop running it is assigned each time to /dev/sdb.

b) I modified the script to umount the device immediately after the subsequent unbind, ie the process is bind, mount on /tmp/usb, unbind, umount /tmp/usb. It crashed with the null pointer dereference first time (log attached just in case).

So the umount might be the key to the issue. I should think desktop's auto-mounting code would also be trying to umount devices once it realises they're no longer there.
Comment 76 Theodore Tso 2011-04-26 03:28:39 UTC
So for this latest result, which file systems have you been using?  ext3? ext4? ntfs? vfat?
Comment 77 rocko 2011-04-26 04:02:41 UTC
This is with ext4 on both the root file system and the external USB drive.

AFAIK, I have only ever reproduced the crash on ext4, but I haven't done much experimentation with ext3 as I don't use it. Another user told me he had the same issue - oops on resume after removing drive - and he was using an ext3 drive. I did try reproducing the crash with vfat and ntfs but couldn't make the kernel crash.
Comment 78 Alan Stern 2011-04-26 18:15:00 UTC
I can report triggering a similar problem -- once.  I attached a USB drive with an ext4 filesystem, mounted it, read a file from it, unbound usb-storage, and then unmounted it.  No desktop was running at the time.  About a second later I got a nasty crash -- an unending stream of log messages scrolling up the console, with no way to stop it other than powering off the machine.

So then I rebuilt the test kernel to add netconsole support, and I have not been able to trigger the problem since.

It wouldn't be at all suprising to find out that this bug lies not in the filesystem code but somewhere lower, such as the block layer or even the SCSI core.  It seems to have a large random component as well as a delayed impact.  Rocko is able to trigger it a lot more reproducibly.
Comment 79 rocko 2011-05-03 02:19:42 UTC
Created attachment 56272 [details]
oops log for null pointer dereference 2.6.39-rc5

Still present in 2.6.39-rc5 (log attached).

Yes, I can readily reproduce this bug in a VM with my current ext4 USB test key - the attached log is from a fresh install of Ubuntu 11.04 amd64 in VirtualBox 4.0.6, with the kernel upgraded to 2.6.39-rc5.

Fwiw, 2.6.39-rc5 behaved a bit differently from 2.6.38: the system reported issues with eg /dev/sdb1 and I started seeing multiple mountpoints in the desktop that I couldn't unmount (/media/disk, /media/disk_, /media/disk__).
Comment 80 rocko 2011-05-04 07:36:48 UTC
The null pointer dereference still happens in 2.6.39-rc6.
Comment 81 rocko 2011-05-10 23:27:56 UTC
Still present in 2.6.39-rc7.
Comment 82 rocko 2011-05-26 06:44:12 UTC
I just hit this again in 2.6.39. I was trying to delete a partition from an external drive, something went wrong and for some reason the kernel dumped all the other attached USB drives, whereupon the entire machine crashed with an "unable to handle paging" then "scheduling while atomic" oops. It's a really annoying bug.


This is what was recorded in the system log (it actually got written this time):

May 26 14:28:02 hercules kernel: [128626.171297] xhci_hcd 0000:04:00.0: WARN: Stalled endpoint
May 26 14:29:36 hercules kernel: [128720.267229] xhci_hcd 0000:04:00.0: WARN: Stalled endpoint
May 26 14:29:36 hercules kernel: [128720.353640] xhci_hcd 0000:04:00.0: WARN: Stalled endpoint
May 26 14:29:36 hercules kernel: [128720.380617] xhci_hcd 0000:04:00.0: WARN: Stalled endpoint
May 26 14:29:41 hercules kernel: [128725.194501] xhci_hcd 0000:04:00.0: WARN: Stalled endpoint
May 26 14:29:41 hercules kernel: [128725.222284] xhci_hcd 0000:04:00.0: WARN: Stalled endpoint
May 26 14:29:46 hercules kernel: [128729.766858] xhci_hcd 0000:04:00.0: WARN: Stalled endpoint
May 26 14:29:46 hercules kernel: [128729.801933] xhci_hcd 0000:04:00.0: WARN: Stalled endpoint
May 26 14:29:53 hercules kernel: [128736.817629] xhci_hcd 0000:04:00.0: WARN: Stalled endpoint
May 26 14:29:53 hercules kernel: [128736.817863] xhci_hcd 0000:04:00.0: ERROR no room on ep ring
May 26 14:29:53 hercules kernel: [128736.817873] xhci_hcd 0000:04:00.0: ERR: No room for command on command ring
May 26 14:29:53 hercules kernel: [128736.817882] xhci_hcd 0000:04:00.0: FIXME allocate a new ring segment
May 26 14:30:00 hercules kernel: [128743.863985] xhci_hcd 0000:04:00.0: ERROR no room on ep ring
May 26 14:30:00 hercules kernel: [128743.863998] xhci_hcd 0000:04:00.0: ERR: No room for command on command ring
May 26 14:30:05 hercules kernel: [128748.872230] xhci_hcd 0000:04:00.0: xHCI host not responding to stop endpoint command.
May 26 14:30:05 hercules kernel: [128748.872240] xhci_hcd 0000:04:00.0: Assuming host is dying, halting host.
May 26 14:30:05 hercules kernel: [128748.878855] xhci_hcd 0000:04:00.0: HC died; cleaning up
May 26 14:30:05 hercules kernel: [128748.878960] usb 3-1: USB disconnect, device number 12
May 26 14:30:05 hercules kernel: [128748.878971] usb 3-1.1: USB disconnect, device number 14
May 26 14:30:05 hercules kernel: [128748.879684] sd 18:0:0:0: Device offlined - not ready after error recovery
May 26 14:30:05 hercules kernel: [128748.879717] sd 18:0:0:0: rejecting I/O to offline device
May 26 14:30:05 hercules kernel: [128748.879786] sd 18:0:0:0: [sdf] Unhandled error code
May 26 14:30:05 hercules kernel: [128748.879793] sd 18:0:0:0: [sdf]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 26 14:30:05 hercules kernel: [128748.879804] sd 18:0:0:0: [sdf] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00
May 26 14:30:05 hercules kernel: [128748.879830] end_request: I/O error, dev sdf, sector 0
May 26 14:30:05 hercules kernel: [128748.879840] Buffer I/O error on device sdf, logical block 0
May 26 14:30:05 hercules kernel: [128748.879915] sd 18:0:0:0: rejecting I/O to offline device
May 26 14:30:05 hercules kernel: [128748.879978] sd 18:0:0:0: rejecting I/O to offline device
May 26 14:30:05 hercules kernel: [128748.963133] usb 3-1.5: USB disconnect, device number 15
May 26 14:30:05 hercules kernel: [128749.116280] JBD2: I/O error detected when updating journal superblock for sdd1-8.
May 26 14:30:05 hercules kernel: [128749.222460] usb 3-1.6: USB disconnect, device number 13
May 26 14:30:05 hercules kernel: [128749.240211] JBD2: I/O error detected when updating journal superblock for sdb1-8.
May 26 14:30:05 hercules kernel: [128749.262545] usb 3-1.7: USB disconnect, device number 16
May 26 14:30:05 hercules kernel: [128749.269556] JBD2: I/O error detected when updating journal superblock for sde1-8.
May 26 14:30:05 hercules kernel: [128749.363281] usb 3-2: USB disconnect, device number 17
May 26 14:30:10 hercules kernel: [128754.270529] BUG: unable to handle kernel paging request at 00000001638ca0f8


The "scheduling while atomic" message was shown on the screen, with a short stack trace ("cpuidle_idle_call / cpu_idle / start_secondary").
Comment 83 Alan Stern 2011-05-26 14:27:43 UTC
This is a completely different issue from the main problem in this bug report.  You should report it separately -- perhaps by posting it on the linux-usb mailing list.
Comment 84 Bryce Nesbitt 2011-07-13 07:52:10 UTC
The bug bit me today, after I pulled out a mounted ext3 usb drive.  A total Kernel hang.  Video still displayed.  No ping.  No mouse.  No keyboard.  Unfortunately I forgot about Magic SysRq (http://en.wikipedia.org/wiki/Magic_SysRq_key ) and did not probe.  Here's all I got:

# vi /var/log/syslog
Jul 13 00:31:43 ubuntu kernel: [181916.094245] JBD2: I/O error detected when updating journal superblock for sde1-8.
Jul 13 00:40:49 ubuntu kernel: imklog 4.6.4, log source = /proc/kmsg started.

# uname -a
Linux ubuntu 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
Comment 85 rocko 2011-08-31 05:00:32 UTC
This bug is still an issue in 3.0.4 and 3.1-rc4. Sadly, this is making linux look rather unreliable for real-world use. :(
Comment 86 rocko 2011-08-31 05:07:27 UTC
@Alan: re your comment #78, how did you compile in netconsole support? I have the following set in .config for my builds, which is the default in Ubuntu's kernel:

CONFIG_NETCONSOLE=m
CONFIG_NETCONSOLE_DYNAMIC=y
Comment 87 Alan Stern 2011-08-31 14:35:39 UTC
I think all I had was  CONFIG_NETCONSOLE=y.

As for fixing the problem...  To be honest, you shouldn't expect it to get fixed until somebody can identify what's causing it.  Since you seem to be one of the very few people experiencing it regularly, the situation doesn't look good until you can provide more information.

The best course of action is to narrow down the variables as much as possible.  That means not running any extraneous programs (i.e., don't run a graphical desktop, and indeed, don't run X at all).  It also means coming up with a very repeatable scenario to trigger the problem.  Something like what you described in comment #75 would be good.

Speaking of which, you mentioned in that comment that on each loop through the test, the driver letter would go up by one.  That should not have happened!  It's another indication of something strange.  Can you verify -- does this still happen with 3.1-rc4?
Comment 88 rocko 2011-08-31 23:43:20 UTC
Running my desktop-less usb-on-off-with-mount.sh test script in 3.1-rc4, the drive stayed at /dev/sdb1 each time. So that seems to be fixed.

The kernel eventually crashed on a umount (not straightaway like last time). load_balance appears in the limited stack trace that I can see in the VM.
Comment 89 rocko 2011-09-01 01:30:13 UTC
Created attachment 71072 [details]
usbonoff-with-mount.sh - script to reproduce the crash

For the record, it is very easy to reproduce the bug with the attached script, which repeatedly mounts, forces an eject, then umounts the USB device until the kernel eventually crashes. The procedure is:

1. Create a VirtualBox VM running Ubuntu 11.04 64-bit with the kernel being tested and boot the VM in recovery mode, ie so no desktop is running.

2. Use the VirtualBox USB facility (right click on the USB icon in the status bar) to attach an ext3/4 drive to the VM.

3. Run the attached script, specifying the target USB device as the argument (the script lists possible devices if you don't supply an argument). There's a commented-out option to redirect output to output to an external machine via netconsole as well, ie to capture the crash log, which you can't fully see in the VM window.
Comment 90 rocko 2011-09-04 03:53:16 UTC
BTW, CONFIG_NETCONSOLE=y doesn't make any difference for me (it still crashes the kernel), although I didn't really expect it to since I had it compiled in as a module and was specifically loading it prior to running the test that crashes the kernel.

So I have a reproducible test case (which, even better, works in VirtualBox VM), but the error information generated in the crash log isn't sufficient for tracking down this problem. Where do I go from here?
Comment 91 Alan Stern 2011-09-04 13:55:27 UTC
Good news: I have been able to reproduce the same sort of crash regularly, using a variant of your script.  That will make debugging a lot easier, even though it will still be difficult.

Incidentally, is CONFIG_EXT4_USE_FOR_EXT23 set in your test kernel configuration?
Comment 92 rocko 2011-09-04 14:00:29 UTC
Hey, that is good news!

I can't find that setting at all in my config. I only have these for CONFIG_EXT4_:

CONFIG_EXT4_FS=y
CONFIG_EXT4_FS_XATTR=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
# CONFIG_EXT4_DEBUG is not set

What is it supposed to do?
Comment 93 Alan Stern 2011-09-05 17:44:30 UTC
It causes the ext4 driver to used for ext2 and ext3 filesystems, instead of using the ext2 and ext3 drivers.

Note You need to log in before you can comment on or make changes to this bug.