Bug 12885

Summary: kernel BUG at fs/jbd/transaction.c:1376!
Product: File System Reporter: Andreas Hauser (andy-kernelbugzilla)
Component: ext3Assignee: fs_ext3 (fs_ext3)
Status: REJECTED INVALID    
Severity: normal CC: jens
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.28 Subsystem:
Regression: --- Bisected commit-id:
Attachments: Archlinux kernel package diff

Description Andreas Hauser 2009-03-16 12:42:48 UTC
Latest working kernel version: 2.6.28
Earliest failing kernel version: 2.6.28
Distribution: Archlinux
Hardware Environment:x86_64
Software Environment:
Problem Description:
------------[ cut here ]------------
kernel BUG at fs/jbd/transaction.c:1376!
invalid opcode: 0000 [#1] PREEMPT SMP
last sysfs file: /sys/devices/platform/coretemp.1/temp1_input
CPU 1
Modules linked in: xt_tcpudp iptable_mangle xt_MARK ip_tables x_tables cpufreq_ondemand ext3 jbd ext2 lrw joydev usbhid hid uhci_hcd snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss ohci1394 i2c_i801 i2c_core ieee1394 ehci_hcd skge snd_hda_intel snd_pcm snd_timer snd_page_alloc snd_hwdep snd soundcore pcspkr usbcore intel_agp iTCO_wdt iTCO_vendor_support sg battery ac tun fuse sky2 evdev acpi_cpufreq freq_table coretemp button fan thermal processor rtc_cmos rtc_core rtc_lib ext4 mbcache jbd2 crc16 aes_x86_64 aes_generic xts gf128mul dm_crypt dm_mod sd_mod sr_mod cdrom pata_acpi ata_generic ahci ata_piix pata_marvell libata scsi_mod
Pid: 11065, comm: scp Not tainted 2.6.28-ARCH #1
RIP: 0010:[<ffffffffa039119e>]  [<ffffffffa039119e>] journal_stop+0x1ee/0x200 [jbd]
RSP: 0018:ffff88009121bd48  EFLAGS: 00010282
RAX: ffff8800cf9c2520 RBX: 0000000000000000 RCX: 0000000000000034
RDX: ffff88020d93eb88 RSI: ffff88022f7fb5d0 RDI: ffff88022f7fb5d0
RBP: ffff880216f6e3c0 R08: 0400000000000000 R09: 0000000000000000
R10: ffff88022e848000 R11: ffffffff80321f00 R12: ffff88022f7fb5d0
R13: ffff88022ea8a000 R14: 00000000000081f8 R15: ffff88009121bdbc
FS:  00002abe4691f810(0000) GS:ffff88022f802900(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000192c428 CR3: 00000000cfb9f000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process scp (pid: 11065, threadinfo ffff88009121a000, task ffff8800cf9c2520)
Stack:
 ffffffff80488675 0000000000000000 ffff88022ea89800 ffffffffa03b77f6
 ffff88020d93eb88 00000000000081f8 ffff88009121bdbc ffffffffa03b12dd
 ffff88021a7c97d8 ffff88022f7fb5d0 0000000000000000 ffffffffa03abaef
Call Trace:
 [<ffffffff80488675>] ? _spin_unlock+0x5/0x30
 [<ffffffffa03b12dd>] ? __ext3_journal_stop+0x2d/0x60 [ext3]
 [<ffffffffa03abaef>] ? ext3_create+0x3f/0x130 [ext3]
 [<ffffffff802bc1a7>] ? vfs_create+0xf7/0x140
 [<ffffffff802bf17c>] ? do_filp_open+0x85c/0x980
 [<ffffffff802572a0>] ? autoremove_wake_function+0x0/0x30
 [<ffffffff802b5b73>] ? vfs_stat_fd+0x23/0x60
 [<ffffffff80488675>] ? _spin_unlock+0x5/0x30
 [<ffffffff802c8b42>] ? alloc_fd+0x122/0x150
 [<ffffffff802af640>] ? do_sys_open+0x80/0x110
 [<ffffffff8020c50a>] ? system_call_fastpath+0x16/0x1b
Code: e8 58 3b ea df 41 8b 75 28 85 f6 0f 84 37 ff ff ff 49 8d 7d 78 31 c9 ba 01 00 00 00 be 03 00 00 00 e8 37 3b ea df e9 1d ff ff ff <0f> 0b eb fe 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 41 57 41
RIP  [<ffffffffa039119e>] journal_stop+0x1ee/0x200 [jbd]
 RSP <ffff88009121bd48>


Steps to reproduce:

Don't know.
Comment 1 Andreas Hauser 2009-03-17 02:30:39 UTC
Another one:
------------[ cut here ]------------
kernel BUG at fs/jbd/transaction.c:1376!
invalid opcode: 0000 [#1] PREEMPT SMP 
last sysfs file: /sys/devices/platform/coretemp.1/temp1_input
CPU 0 
Modules linked in: xt_tcpudp iptable_mangle xt_MARK ip_tables x_tables cpufreq_ondemand ext3 jbd ext2 lrw joydev snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device usbhid hid snd_pcm_oss snd_mixer_oss ohci1394 uhci_hcd snd_hda_intel snd_pcm snd_timer snd_page_alloc snd_hwdep ieee1394 skge snd pcspkr ehci_hcd soundcore i2c_i801 i2c_core usbcore iTCO_wdt intel_agp iTCO_vendor_support sg battery ac tun fuse sky2 evdev acpi_cpufreq freq_table coretemp button fan thermal processor rtc_cmos rtc_core rtc_lib ext4 mbcache jbd2 crc16 aes_x86_64 aes_generic xts gf128mul dm_crypt dm_mod sr_mod cdrom sd_mod pata_acpi ata_generic ahci ata_piix pata_marvell libata scsi_mod
Pid: 6706, comm: unrar Not tainted 2.6.28-ARCH #1
RIP: 0010:[<ffffffffa039119e>]  [<ffffffffa039119e>] journal_stop+0x1ee/0x200 [jbd]
RSP: 0000:ffff8800cfac5de8  EFLAGS: 00010286
RAX: ffff8801b33a2b50 RBX: 0000000000000000 RCX: 0000000000000034
RDX: ffff880160894770 RSI: ffff880229b61390 RDI: ffff880229b61390
RBP: ffff8801a01e0180 R08: 0400000000000000 R09: 0000000000000000
R10: ffff88022e758000 R11: ffffffff80321f00 R12: ffff880229b61390
R13: ffff88022554ec00 R14: 00000000000041fd R15: ffff880141756ea0
FS:  00002af651d5bb20(0000) GS:ffffffff8065e000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f6c80ef9000 CR3: 00000000cfb7c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process unrar (pid: 6706, threadinfo ffff8800cfac4000, task ffff8801b33a2b50)
Stack:
 ffff8800432cd310 0000000000000000 ffff88022554c800 ffffffffa03b77c8
 ffff8800432cd310 00000000000041fd ffff880141756ea0 ffffffffa03b12dd
 ffff8801608bab88 ffff880229b61390 ffff880160894770 ffffffffa03aba2e
Call Trace:
 [<ffffffffa03b12dd>] ? __ext3_journal_stop+0x2d/0x60 [ext3]
 [<ffffffffa03aba2e>] ? ext3_mkdir+0x24e/0x2d0 [ext3]
 [<ffffffff802bbc67>] ? vfs_mkdir+0xe7/0x130
 [<ffffffff80488675>] ? _spin_unlock+0x5/0x30
 [<ffffffff802be27c>] ? sys_mkdirat+0x11c/0x130
 [<ffffffff80488675>] ? _spin_unlock+0x5/0x30
 [<ffffffff802c8b42>] ? alloc_fd+0x122/0x150
 [<ffffffff8020c50a>] ? system_call_fastpath+0x16/0x1b
Code: e8 58 3b ea df 41 8b 75 28 85 f6 0f 84 37 ff ff ff 49 8d 7d 78 31 c9 ba 01 00 00 00 be 03 00 00 00 e8 37 3b ea df e9 1d ff ff ff <0f> 0b eb fe 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 
RIP  [<ffffffffa039119e>] journal_stop+0x1ee/0x200 [jbd]
 RSP <ffff8800cfac5de8>
---[ end trace 0b2b3734ee2df4f1 ]---
Comment 2 Theodore Tso 2009-03-17 14:49:32 UTC
The BUG is caused by the following in journal_stop() in transaction.c:

	J_ASSERT(journal_current_handle() == handle);

Somehow the journal handle got corrupted, or we screwed up in the refcount of open handles and allowed a journal transaction to commit even though ext3_mkdir() and ext3_creat() was in the middle of doing something with a handle.  This is one of these should-never-happen situations.

What was the system doing at the time of the crash?  And we have to wonder why you are seeing it but apparently not others.  Is this a completely unmodified, stock 2.6.28 kernel?   Have there been any patches applied?
Comment 3 Andreas Hauser 2009-03-18 07:46:40 UTC
Stock Archlinux kernel, which uses this patch set afaik:
ftp://ftp.archlinux.org/other/kernel26/patch-2.6.28.7-2-ARCH.bz2
Or the version before. The ext3 is on a luks encrypted hdd.

Afair I tried to scp to the fs and the scp just hang even after the disk spun up.
That was when I noticed.

Is there a way to recover?

Is there a way to differentiate between hardware problem and software bug?
Comment 4 Theodore Tso 2009-03-18 07:57:41 UTC
Oh, this looks pretty likely to be a software bug.  It just seems to be one of these "should never happen; why isn't anyone else seeing it?" things.  Maybe it has something to do with the luks encrpytion layer?  I don't know.

If you could do some experiments to see how easily you can reproduce it, and maybe some hints about how we could reproduce it on one of our machines, that would be really appreciated.   If you're willing to work with us on some debugging patches, that would be really great as well.

Once you can reproduce it, then we can try to twiddle various variables, such as whether it goes away if you remove the luks layer.  We can also try introducing some patches that might log some in-flight information, etc.  So trying to see if we can get an easy reproduction case is the first thing that we really need to do.   If you have the time to help us out, it will really help us track it down.  (If not, we'll understand, of course.)

Thanks in advance...
Comment 5 Theodore Tso 2009-03-18 07:58:38 UTC
Oh, once it happens, probably the only way to recover is to reboot.   
Comment 6 Andreas Hauser 2009-03-18 09:34:49 UTC
I'm quite willing to nail this one down.

Do you mean reproduction on another ext3 or on the same?
On the same it wouldn't be a problem, it e.g. happens with a cp -rp when trying to back that data up.
And while most fs on my system are ext4 now, i've used ext3 over luks for quite
some time now, without any problems.
Comment 7 Theodore Tso 2009-03-18 10:41:42 UTC
It's good to know that you can reproduce it easily on that particular ext3 file system.   One interesting question is whether you can reproduce it on some other ext3 file system (since then we might be able to reproduce it on our end); although I haven't played with LUKS at all, and I don't know how much effort it would take for me to set it up on my system.

How are you backing up the data?   tar?  Amanda?   Some other backup system?
Comment 8 Theodore Tso 2009-03-18 10:45:07 UTC
One interesting thing you could try, if you can go to 2.6.29-rc8, is you could try mounting your ext3 filesystem using ext4, and see if the problem goes away or not.   If the problem sticks around, then that's actually very interesting, since so much has changed between ext3 and ext4.  If it goes away, that could be a solution for you, but it also could be that it's just harder to reproduce on ext4.

This smells like a bug in jbd layer, or perhaps some assumption which is getting violated by LUKS; since if it was as simple as just doing a backup of a tree while another process was copying data into it, ext3 is such a commonly used file system that I would have thought such a problem would have been discovered by now.  So if we have something like this which can be easily reproduced, I really want to try to get it chased down.

Thanks,
Comment 9 Theodore Tso 2009-03-18 10:51:13 UTC
Trying to mount ext4 over LUKS on 2.6.28 should work as well.  (Sorry, the reason why I said going to 2.6.29-rc was because I confused myself for a moment and thought this was an ext2 problem, and 2.6.29-rc1 and later will allow you to mount an ext2 filesystem using the ext4 filesystem code.)   But simply a quick try to see if it will reproduce with using ext4 to mount your ext3 filesystem over LUKS would also be worth a very quick test.   Again, if the problem still sticks around, that's the much more interesting data point; if you can't reproduce it, that's good to know, but it doesn't help us chase down the ext3/jbd bug....
Comment 10 Andreas Hauser 2009-03-18 11:10:49 UTC
(In reply to comment #7)
> It's good to know that you can reproduce it easily on that particular ext3
> file
> system.   One interesting question is whether you can reproduce it on some
> other ext3 file system (since then we might be able to reproduce it on our
> end); although I haven't played with LUKS at all, and I don't know how much
> effort it would take for me to set it up on my system.

That should be rather easy, follow:
http://wiki.archlinux.org/index.php/LUKS

Basically:

$ cryptsetup -c aes-xts-plain -y -s 512 luksFormat /dev/sda3
$ cryptsetup luksOpen /dev/sda3 test
$ mkfs.ext3 /dev/mapper/test

> How are you backing up the data?   tar?  Amanda?   Some other backup system?

I just tried to cp -rp the data away. There was no write to the fs at the time I think. 
Comment 11 Andreas Hauser 2009-03-18 11:14:13 UTC
(In reply to comment #8)
> One interesting thing you could try, if you can go to 2.6.29-rc8, is you
> could
> try mounting your ext3 filesystem using ext4, and see if the problem goes
> away
> or not.   If the problem sticks around, then that's actually very
> interesting,
> since so much has changed between ext3 and ext4.  If it goes away, that could
> be a solution for you, but it also could be that it's just harder to
> reproduce
> on ext4.

Might we not loose the bug that way, as in it'd be gone for ever?
Can I do that read-only?

> This smells like a bug in jbd layer, or perhaps some assumption which is
> getting violated by LUKS; since if it was as simple as just doing a backup of
> a
> tree while another process was copying data into it, ext3 is such a commonly
> used file system that I would have thought such a problem would have been
> discovered by now.  So if we have something like this which can be easily
> reproduced, I really want to try to get it chased down.

It's as easy as *reading* the afaics. I'll test this some more.
Comment 12 Andreas Hauser 2009-03-18 11:50:01 UTC
OK. The BUG above says there was an unrar that caused it. So that probably  happened before my cp -rp. Most of the files are readable after the bug happens. But accessing the problematic directory or file, processes hang. Is this to be expected after the bug happens but before a reboot?
Comment 13 Theodore Tso 2009-03-18 12:11:00 UTC
Oh, sorry, I misunderstood what you were saying.  All you need to do is "cp -rp" the directory and you get the system hang?   Uh, that's very interesting.   How big is the file system in question?   I thought you were saying you were doing a backup while some other process was doing a cp -rp *into* the directory.

I agree, if that's all it takes, then it must be highly sensitive to the filesystem state, and we want to be careful to preserve it.   In fact, before you do anything else, you might want to save the filesystem image using e2image:

        e2image -r /dev/sda1 - | bzip2 > sda1.e2i.bz2

Given that this is an encrypted filesystem, I can imagine that you probably won't be willing to send this to me, even though this omits all of the data blocks, and only keeps the metadata blocks (although this does include the directory blocks and hence the file names).

However, you can take this raw image file, and dump it on a raw disk, and see if you can replicate the problem on a disk partition.    Something else you could do is to send me a "scrambled" e2image:

        e2image -rs /dev/sda1 - | bzip2 > sda1.e2i.bz2

This randomizes the directory names, although it means that I would have to turn off the dir_index flags before I could try using it.  Still, it might be enough to replicate the problem on my end.

Final question --- have you tried running e2fsck -n on the filesystem; do you know if the filesystem has been reported as self-consistent by e2fsck?

Thanks,
Comment 14 Theodore Tso 2009-03-18 12:13:19 UTC
In answer to your question in comment #12, yes, it's expected that processes accessing the problematic directory might hang after a reboot, since when the process died after the BUG, it probably left some locks locked, and so processes would end up waiting forever for the locks to get unlocked (which they won't since the process that held them died after the OOPS message).

But if this was caused by the unrar, then this might be harder to replicate....

In any case, if you haven't rebooted yet, I would try rebooting, and then running e2fsck on the filesystem to make sure it is consistent.  Then the next trick is to see what is needed to replicate the BUG/oops message.
Comment 15 Andreas Hauser 2009-03-18 14:12:35 UTC
A dd of the hdd is about 370GB. Using this dd image and loop mounting it
I can reproduce a hang (no fsck), but this time nothing in dmesg.

I'm probably able to share it, but only to a limited group. It's got nothing really confidential on it. I just encrypt all my disks so I can throw them away without a second thought, especially if they break (or send them for repair).
Comment 16 Andreas Hauser 2009-03-18 14:50:53 UTC
So after rebooting and manual fsck, a simple touch is triggering it,
or something similar:

$ fsck /dev/mapper/media
fsck 1.41.4 (27-Jan-2009)
e2fsck 1.41.4 (27-Jan-2009)
/dev/mapper/media: recovering journal
/dev/mapper/media has been mounted 157 times without being checked, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/mapper/media: 21438/24420352 files (30.1% non-contiguous), 89299732/97677469 blocks


$ touch x
[1]    5794 segmentation fault  touch x

dmesg:
EXT3 FS on dm-1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
------------[ cut here ]------------
kernel BUG at fs/jbd/transaction.c:1376!
invalid opcode: 0000 [#1] PREEMPT SMP
last sysfs file: /sys/devices/platform/coretemp.1/temp1_input
CPU 1
Modules linked in: ext3 jbd xt_tcpudp iptable_mangle xt_MARK ip_tables x_tables cpufreq_ondemand ext2 lrw joydev usbhid hid uhci_hcd snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss ohci1394 ieee1394 pcspkr skge ehci_hcd i2c_i801 i2c_core snd_hda_intel snd_pcm snd_timer snd_page_alloc snd_hwdep snd soundcore usbcore intel_agp iTCO_wdt iTCO_vendor_support sg battery ac tun fuse sky2 evdev acpi_cpufreq freq_table coretemp button fan thermal processor rtc_cmos rtc_core rtc_lib ext4 mbcache jbd2 crc16 aes_x86_64 aes_generic xts gf128mul dm_crypt dm_mod sr_mod cdrom sd_mod pata_acpi ata_generic ahci ata_piix pata_marvell libata scsi_mod
Pid: 5794, comm: touch Not tainted 2.6.28-ARCH #1
RIP: 0010:[<ffffffffa053019e>]  [<ffffffffa053019e>] journal_stop+0x1ee/0x200 [jbd]
RSP: 0018:ffff88021acfbd48  EFLAGS: 00010282
RAX: ffff8802190e18c0 RBX: 0000000000000000 RCX: 0000000000000034
RDX: ffff8801b799fbf0 RSI: ffff88022f516168 RDI: ffff88022f516168
RBP: ffff8801ac5f6780 R08: 0400000000000000 R09: 0000000000000000
R10: ffff88020b544000 R11: ffffffff80321f00 R12: ffff88022f516168
R13: ffff88020b401800 R14: 00000000000081b4 R15: ffff88021acfbdbc
FS:  00002ab790209000(0000) GS:ffff88022f802900(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000406da0 CR3: 0000000228bf2000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process touch (pid: 5794, threadinfo ffff88021acfa000, task ffff8802190e18c0)
Stack:
 ffffffff80488675 0000000000000000 ffff88020b401000 ffffffffa05567f6
 ffff8801b799fbf0 00000000000081b4 ffff88021acfbdbc ffffffffa05502dd
 ffff8801ce594e00 ffff88022f516168 0000000000000000 ffffffffa054aaef
Call Trace:
 [<ffffffff80488675>] ? _spin_unlock+0x5/0x30
 [<ffffffffa05502dd>] ? __ext3_journal_stop+0x2d/0x60 [ext3]
 [<ffffffffa054aaef>] ? ext3_create+0x3f/0x130 [ext3]
 [<ffffffffa054ae7b>] ? ext3_lookup+0xcb/0x100 [ext3]
 [<ffffffff802bc1a7>] ? vfs_create+0xf7/0x140
 [<ffffffff802bf17c>] ? do_filp_open+0x85c/0x980
 [<ffffffff80488675>] ? _spin_unlock+0x5/0x30
 [<ffffffff802c8b42>] ? alloc_fd+0x122/0x150
 [<ffffffff802af640>] ? do_sys_open+0x80/0x110
 [<ffffffff8020c50a>] ? system_call_fastpath+0x16/0x1b
Code: e8 58 4b d0 df 41 8b 75 28 85 f6 0f 84 37 ff ff ff 49 8d 7d 78 31 c9 ba 01 00 00 00 be 03 00 00 00 e8 37 4b d0 df e9 1d ff ff ff <0f> 0b eb fe 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 41 57 41
RIP  [<ffffffffa053019e>] journal_stop+0x1ee/0x200 [jbd]
 RSP <ffff88021acfbd48>
---[ end trace 778583b30cdd0759 ]---
Comment 17 Andreas Hauser 2009-03-18 15:04:57 UTC
Mounting as ext4, touch works, mounting again as ext3 hits the bug again.
Comment 18 Andreas Hauser 2009-03-19 02:37:35 UTC
Booting a new Archlinux kernel 2.6.28.8, I got this on booting:

Adding 1999992k swap on /a.swap.  Priority:-1 extents:462 across:8115000k
JBD: barrier-based sync failed on dm-0:8 - disabling barriers
sky2 eth0: enabling interface
sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control both
general protection fault: 0000 [#1] PREEMPT SMP
last sysfs file: /sys/block/dm-2/size
CPU 0
Modules linked in: ext2 lrw joydev usbhid hid uhci_hcd snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss ohci1394 ieee1394 ehci_hcd snd_hda_intel snd_pcm snd_timer snd_page_alloc snd_hwdep snd soundcore skge i2c_i801 i2c_core usbcore intel_agp pcspkr iTCO_wdt iTCO_vendor_support sg battery ac tun fuse sky2 evdev acpi_cpufreq freq_table coretemp button fan thermal processor rtc_cmos rtc_core rtc_lib ext4 mbcache jbd2 crc16 aes_x86_64 aes_generic xts gf128mul dm_crypt dm_mod sd_mod sr_mod cdrom pata_acpi ata_generic ahci ata_piix pata_marvell libata scsi_mod
Pid: 4955, comm: hald Not tainted 2.6.28-ARCH #1
RIP: 0010:[<ffffffffa01620ab>]  [<ffffffffa01620ab>] acpi_processor_info_seq_show+0x10/0x69 [processor]
RSP: 0018:ffff88022e571e48  EFLAGS: 00010206
RAX: ffff88022ee1f760 RBX: ffff88022e513380 RCX: 2222222222222222
RDX: 0038004000000000 RSI: 0000000000000001 RDI: ffff88022e513380
RBP: 0000000000000001 R08: 00000000ffffffff R09: 0000000000000000
R10: 0000000000000022 R11: ffff88022e513380 R12: 0000000000000001
R13: ffff88022e571e98 R14: 0000000000000000 R15: 0000000000000400
FS:  00007fadef4276f0(0000) GS:ffffffff8065e000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000435660 CR3: 000000022cbf1000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process hald (pid: 4955, threadinfo ffff88022e570000, task ffff88022e61dcd0)
Stack:
 ffff88022e8fa900 ffff88022e571f50 0000000000000400 ffffffff802cd9c6
 00000007fadef2c2 ffff88022e571f50 0000000000655860 ffff88022e8fa900
 ffff88022e5133b0 0000000000000000 0000000000000000 ffff88022eb28480
Call Trace:
 [<ffffffff802cd9c6>] ? seq_read+0xd6/0x360
 [<ffffffff802cd8f0>] ? seq_read+0x0/0x360
 [<ffffffff802fec51>] ? proc_reg_read+0x81/0xd0
 [<ffffffff802b23f8>] ? vfs_read+0xc8/0x180
 [<ffffffff802b25b3>] ? sys_read+0x53/0xa0
 [<ffffffff8020c50a>] ? system_call_fastpath+0x16/0x1b
Code: c3 48 8b 47 e8 48 89 f1 48 c7 c6 9b 20 16 a0 48 89 cf 48 8b 50 60 e9 e5 bc 16 e0 48 83 ec 18 48 8b 57 70 49 89 fb 48 85 d2 74 52 <8a> 42 1c 49 c7 c0 ce 6a 16 a0 48 c7 c6 d1 6a 16 a0 4d 89 c2 4c
RIP  [<ffffffffa01620ab>] acpi_processor_info_seq_show+0x10/0x69 [processor]
 RSP <ffff88022e571e48>
---[ end trace 7a143c8d515d5620 ]---
Comment 19 Andreas Hauser 2009-03-19 14:41:23 UTC
The problem was introduced between Archlinux kernel version 2.6.28.7-1 and 2.6.28.7-2.
I'll upload a diff of the package. The patch used by Archlinux has not changed between those revisions.
The only thing I could see that might be related to the bug i'm seeing is a change in CONFIG_NLS_UTF8 which is now built as module instead of into the kernel.
Comment 20 Andreas Hauser 2009-03-19 14:43:06 UTC
Created attachment 20601 [details]
Archlinux kernel package diff

A diff between the kernel source packages that introduced the bug and the one before.
Comment 21 Eric Sandeen 2009-03-19 14:51:22 UTC
Can you replicate this on stock upstream 2.6.28.7, with the archlinux configs?

-Eric
Comment 22 Andreas Hauser 2009-03-19 14:54:01 UTC
Maybe a related bug:
http://bugs.archlinux.org/task/13762?project=1
Comment 23 Andreas Hauser 2009-03-20 00:23:57 UTC
Just commenting out the Archlinux patch, results in ext3 not being able to load anymore:
ext3: Unknown symbol __grab_cache_page
Comment 24 Theodore Tso 2009-03-20 06:26:48 UTC
Andreas, if you can send me the compressed e2image file (it should compress well, since all of the data blocks have been taken out) via some kind of private download URL, I'd really appreciate it.   At this point that's probably going to be the fastest way to track down what is happening.

The possibly related bug you pointed at in Comment #22 is a reiserfs bug, which uses totally unrelated journalling machinery to ext4, so I very much doubt it is related.
Comment 25 Andreas Hauser 2009-03-21 00:12:08 UTC
Turns out the grub root was wrong and so an old kernel got loaded,
leading to a mismatch between kernel and modules.

If you still consider this a bug, i'll submit the e2image.