Bug 13793 - 2.6.31 regression - boot crash in rcu_process_callbacks - DELL XPS M1330
Summary: 2.6.31 regression - boot crash in rcu_process_callbacks - DELL XPS M1330
Status: CLOSED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: VFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Len Brown
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-07-17 14:49 UTC by Christophe Dumez
Modified: 2012-06-13 14:10 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.31-rc3+ (2009-07-17)
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Photo of the screen after the crash (477.77 KB, image/jpeg)
2009-07-17 14:49 UTC, Christophe Dumez
Details
lspci -nnvv output (15.08 KB, text/plain)
2009-07-17 14:50 UTC, Christophe Dumez
Details
acpidump (112.60 KB, text/plain)
2009-07-20 01:37 UTC, Christophe Dumez
Details
lspci -vxxx (13.39 KB, text/plain)
2009-07-20 01:38 UTC, Christophe Dumez
Details
dmesg output after I/O errors (49.27 KB, text/plain)
2009-07-21 04:53 UTC, Christophe Dumez
Details
kern.log after I/O errors (681.09 KB, text/plain)
2009-07-21 04:54 UTC, Christophe Dumez
Details
Input / Output Errors (Photo) (435.11 KB, image/jpeg)
2009-07-21 04:56 UTC, Christophe Dumez
Details
Photo of the crash with pre-rc1 (9937ac0cc087b03d6d73f46a5d6b38c43626e60e) (539.53 KB, image/jpeg)
2009-07-21 08:50 UTC, Christophe Dumez
Details
mount output (1.00 KB, text/plain)
2009-07-22 16:04 UTC, Christophe Dumez
Details
Do not release acl when returning (1.40 KB, patch)
2009-07-23 14:46 UTC, Stefan Bader
Details | Diff

Description Christophe Dumez 2009-07-17 14:49:46 UTC
Created attachment 22393 [details]
Photo of the screen after the crash

Kernel 2.6.31 does not boot on my laptop (a DELL XPS M1330). I have tried rc1, rc2, rc3 and today's daily build but none of those will boot on my laptop.

I get the following call trace with today's daily build:
rcu_do_batch+0x27/0x90
__rcu_process_callbacks+0xc8/0x100
tick_handle_oneshot_broadcast+0xdd/0x100
rcu_process_callbacks+0x20/0x40
timer_interrupt+0x21/0x70
handle_IRQ_event+0x56/0x120
do_softirq+0x3c/0x40
irq_exit+0x5c/0x70
do_IRQ+0x4f/0xc0
common_interrupt+0x29/0x30
sys_getresuid+0x3b/0x70
acpi_idle_enter_bm+0x19a/0x1c9
cpuidle_idle_call+0x6f/0xc0
cpu_idle+0x42/0x80
start_secondary+0xae/0cd0
Comment 1 Christophe Dumez 2009-07-17 14:50:39 UTC
Created attachment 22394 [details]
lspci -nnvv output
Comment 2 ykzhao 2009-07-20 00:36:45 UTC
Hi, Christophe
    Will you please confirm whether it can be booted normally when using the old version kernel? For example: 2.6.29, 2.6.30.
    If so, will you please use the git-bisect to identify the bad commit which causes the regression?
    Will you please attach the output of acpidump, lspci -vxxx?
    Thanks.
Comment 3 Christophe Dumez 2009-07-20 01:32:23 UTC
Last kernel I booted was 2.6.28 and it was fine. I will get my hands on a 2.6.30 and test.
Comment 4 Christophe Dumez 2009-07-20 01:37:44 UTC
Created attachment 22403 [details]
acpidump
Comment 5 Christophe Dumez 2009-07-20 01:38:50 UTC
Created attachment 22404 [details]
lspci -vxxx
Comment 6 Christophe Dumez 2009-07-20 01:49:11 UTC
I have just booted on a 2.6.30 kernel and it works fine. Problems were introduced in v2.6.31.

I'm not using git but premade packages from http://kernel.ubuntu.com/~kernel-ppa/mainline

It is difficult to see when this problem started because I was affected by this bug since I switched to rc1:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/392709
Comment 7 Christophe Dumez 2009-07-20 13:52:04 UTC
I pulled kernel tree up until commit 4075ea8c54a7506844a69f674990241e7766357b from git. This commit occurred after rc1 and supposedly fixes booting on XPS M1330.

Well, It does not boot on my XPS M1330.
Comment 8 Len Brown 2009-07-20 14:15:24 UTC
The only recent commit I see that mentions this box is
412af97838828bc6d035a1902c8974f944663da6
"ACPI: video: prevent NULL deref in acpi_get_pci_dev()"
but that went upstream immediately after 2.6.31-rc1,
so you already have it.

unclear if this regression is related to ACPI.
Does 2.6.31-rc boot before the ACPI changes went in at:
0c26d7cc31cd81a82be3b9d7687217d49fe9c47e
(note that it will fail after that, due to the
 NULL inj acpi_get_pci_dev() mentioned above --
 so you can apply that patch manually to bisect forward...)

any difference with "maxcpus=1"?
any difference with "idle=poll"?
how about with "acpi=off"?
Comment 9 Christophe Dumez 2009-07-20 15:23:53 UTC
Neither of these three booting parameters changed anything. So I guess this means ACPI is not the origin of the problem?
Comment 10 Christophe Dumez 2009-07-21 01:58:07 UTC
I compiled kernel just before commit 0c26d7cc31cd81a82be3b9d7687217d49fe9c47e (ACPI changes), and I applied NULL check patch in acpi_get_pci_dev() manually.

Sadly, it does not boot but it crashes earlier in the boot process and the call trace seems different. This second problem was probably fixed later (because I did not experience it with later kernels). Could someone tell me which patch I should apply for this?

Call trace:
shmem_acl_init
shmem_mknod
vfs_mknod
sys_mknodat
handle_mm_fault
do_page_fault
sys_mknod
syscall_call
Comment 11 Christophe Dumez 2009-07-21 02:15:19 UTC
The second problem I'm experiencing is this one it seems:
http://lkml.indiana.edu/hypermail/linux/kernel/0906.3/00506.html

Apparently, it was fixed on June 24th (just before rc1) by commit c6223048259006759237d826219f0fa4f312fb47. I will apply this patch too and retest.
Comment 12 Christophe Dumez 2009-07-21 04:52:39 UTC
I compiled kernel tree up to 0c26d7cc31cd81a82be3b9d7687217d49fe9c47e commit (excluded) with the following cherry picked patches:
412af97838828bc6d035a1902c8974f944663da6 : ACPI NULL reference check
c6223048259006759237d826219f0fa4f312fb47 : JFS ACL race condition fix
d5bb68adda7cc179e8efadeaa3a283cb470f13a6 : Another JFS ACL race condition fix

The result is :
- I don't get a crash but I get a lot of Input/Output errors and X is not launched. Boot process is just stuck at some point.

iirc, I got the same result with rc3 by provided the parameter "idle=poll" (as advised in a previous post).

I don't know if it means that the crash did not occur or simply if I cannot see it. In any case, it does not boot.

Since I managed to get a prompt this time (despite the I/O Errors), I will post kern.log and dmesg.
Comment 13 Christophe Dumez 2009-07-21 04:53:18 UTC
Created attachment 22421 [details]
dmesg output after I/O errors
Comment 14 Christophe Dumez 2009-07-21 04:54:02 UTC
Created attachment 22422 [details]
kern.log after I/O errors
Comment 15 Christophe Dumez 2009-07-21 04:56:00 UTC
Created attachment 22423 [details]
Input / Output Errors (Photo)
Comment 16 Christophe Dumez 2009-07-21 08:48:32 UTC
Ok, still using kernel tree up to 0c26d7cc31cd81a82be3b9d7687217d49fe9c47e commit
(excluded) with the following cherry picked patches:
412af97838828bc6d035a1902c8974f944663da6 : ACPI NULL reference check
c6223048259006759237d826219f0fa4f312fb47 : JFS ACL race condition fix
d5bb68adda7cc179e8efadeaa3a283cb470f13a6 : Another JFS ACL race condition fix

but with the same config file as I used for rc3, I can see the call trace this time. Therefore, the crash was not caused by ACPI changes. Although the call trace is not exactly the same, it looks similar: I will post it anyway.

The problem was introduced after v2.6.30 release and 
0c26d7cc31cd81a82be3b9d7687217d49fe9c47e commit (ACPI changes).
Comment 17 Christophe Dumez 2009-07-21 08:50:14 UTC
Created attachment 22426 [details]
Photo of the crash with pre-rc1 (9937ac0cc087b03d6d73f46a5d6b38c43626e60e)
Comment 18 Christophe Dumez 2009-07-21 08:53:41 UTC
When I said I used kernel tree up to 0c26d7cc31cd81a82be3b9d7687217d49fe9c47e
commit (excluded): I actually used tree up to 9937ac0cc087b03d6d73f46a5d6b38c43626e60e (included). I hope this is OK, I'm really not used to git and I did not know exactly how to do this (thus I chose a commit which happened slightly earlier in time, according to rc1 changelog).
Comment 19 Christophe Dumez 2009-07-21 15:06:33 UTC
I'm using git-bisect but it takes a lot of time. I'm providing my current results, hoping it will help:

$ git bisect log
git bisect start
# good: [07a2039b8eb0af4ff464efd3dfd95de5c02648c6] Linux 2.6.30
git bisect good 07a2039b8eb0af4ff464efd3dfd95de5c02648c6
# bad: [9937ac0cc087b03d6d73f46a5d6b38c43626e60e] MAINTAINERS: Change mailing list info for CRIS
git bisect bad 9937ac0cc087b03d6d73f46a5d6b38c43626e60e
# good: [e7c5a4f292e0d1f4ba9a3a94b2c8e8b71e35b25a] powerpc/5121: make clock debug output more readable
git bisect good e7c5a4f292e0d1f4ba9a3a94b2c8e8b71e35b25a
# good: [0dd5198672dd2bbeb933862e1fc82162e0b636be] Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6
git bisect good 0dd5198672dd2bbeb933862e1fc82162e0b636be
Comment 20 Christophe Dumez 2009-07-21 15:09:28 UTC
I'm currently using git-bisect to pinpoint the bad commit but it takes a lot of time...

I'm providing my current results, hoping it will help:
$ git bisect log
git bisect start
# good: [07a2039b8eb0af4ff464efd3dfd95de5c02648c6] Linux 2.6.30
git bisect good 07a2039b8eb0af4ff464efd3dfd95de5c02648c6
# bad: [9937ac0cc087b03d6d73f46a5d6b38c43626e60e] MAINTAINERS: Change mailing list info for CRIS
git bisect bad 9937ac0cc087b03d6d73f46a5d6b38c43626e60e
# good: [e7c5a4f292e0d1f4ba9a3a94b2c8e8b71e35b25a] powerpc/5121: make clock debug output more readable
git bisect good e7c5a4f292e0d1f4ba9a3a94b2c8e8b71e35b25a
# good: [0dd5198672dd2bbeb933862e1fc82162e0b636be] Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6
git bisect good 0dd5198672dd2bbeb933862e1fc82162e0b636be
Comment 21 Christophe Dumez 2009-07-22 03:04:03 UTC
I'm making some progress:

$ git bisect log
git bisect start
# good: [07a2039b8eb0af4ff464efd3dfd95de5c02648c6] Linux 2.6.30
git bisect good 07a2039b8eb0af4ff464efd3dfd95de5c02648c6
# bad: [9937ac0cc087b03d6d73f46a5d6b38c43626e60e] MAINTAINERS: Change mailing
list info for CRIS
git bisect bad 9937ac0cc087b03d6d73f46a5d6b38c43626e60e
# good: [e7c5a4f292e0d1f4ba9a3a94b2c8e8b71e35b25a] powerpc/5121: make clock
debug output more readable
git bisect good e7c5a4f292e0d1f4ba9a3a94b2c8e8b71e35b25a
# good: [0dd5198672dd2bbeb933862e1fc82162e0b636be] Merge branch 'for_linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6
git bisect good 0dd5198672dd2bbeb933862e1fc82162e0b636be
# good: [9b901ee0cb007eb4e2ee056e5b1c5c2837d53bdb] [WATCHDOG] wdt_pci.c: remove #ifdef CONFIG_WDT_501_PCI
git bisect good 9b901ee0cb007eb4e2ee056e5b1c5c2837d53bdb
# good: [7e0338c0de18c50f09aea1fbef45110cf7d64a3c] Merge branch 'for-2.6.31' of git://fieldses.org/git/linux-nfsd
git bisect good 7e0338c0de18c50f09aea1fbef45110cf7d64a3c
Comment 22 Christophe Dumez 2009-07-22 07:40:28 UTC
Progressing:

$ git bisect log
git bisect start
# good: [07a2039b8eb0af4ff464efd3dfd95de5c02648c6] Linux 2.6.30
git bisect good 07a2039b8eb0af4ff464efd3dfd95de5c02648c6
# bad: [9937ac0cc087b03d6d73f46a5d6b38c43626e60e] MAINTAINERS: Change mailing
list info for CRIS
git bisect bad 9937ac0cc087b03d6d73f46a5d6b38c43626e60e
# good: [e7c5a4f292e0d1f4ba9a3a94b2c8e8b71e35b25a] powerpc/5121: make clock
debug output more readable
git bisect good e7c5a4f292e0d1f4ba9a3a94b2c8e8b71e35b25a
# good: [0dd5198672dd2bbeb933862e1fc82162e0b636be] Merge branch 'for_linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6
git bisect good 0dd5198672dd2bbeb933862e1fc82162e0b636be
# good: [9b901ee0cb007eb4e2ee056e5b1c5c2837d53bdb] [WATCHDOG] wdt_pci.c: remove
#ifdef CONFIG_WDT_501_PCI
git bisect good 9b901ee0cb007eb4e2ee056e5b1c5c2837d53bdb
# good: [7e0338c0de18c50f09aea1fbef45110cf7d64a3c] Merge branch 'for-2.6.31' of
git://fieldses.org/git/linux-nfsd
git bisect good 7e0338c0de18c50f09aea1fbef45110cf7d64a3c
# good: [eebf8d86acf0db974dfaad8e8285f4e12ca488e2] V4L/DVB (12131): BUGFIX: An incorrect Carrier Recovery Loop optimization table was being
git bisect good eebf8d86acf0db974dfaad8e8285f4e12ca488e2
# good: [a10b32db34898d0db58a58ef76a70c374931bbff] kgdb: kgdboc console poll hooks for serial_txx9 uart
git bisect good a10b32db34898d0db58a58ef76a70c374931bbff
Comment 23 Christophe Dumez 2009-07-22 12:26:54 UTC
When I said I used kernel tree up to 0c26d7cc31cd81a82be3b9d7687217d49fe9c47e
commit (excluded): I actually used tree up to
9937ac0cc087b03d6d73f46a5d6b38c43626e60e (included).

Apparently, this was a mistake: 9937ac0cc087b03d6d73f46a5d6b38c43626e60e occurred *after* 9937ac0cc087b03d6d73f46a5d6b38c43626e60e according to git log.

I will test commit 936940a9c7e3d99b25859bf1ff140d8c2480183a (Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6) right now. This one occurred just before ACPI changes.

I have a feeling the ACPI commit (0c26d7cc31cd81a82be3b9d7687217d49fe9c47e) is the faulty one after all. I will confirm this in a few hours.
Comment 24 Christophe Dumez 2009-07-22 13:56:32 UTC
Apparently, 

[936940a9c7e3d99b25859bf1ff140d8c2480183a] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6

is crashing already. So the ACPI commit does not seem to be the problem.

I'll try to continue and bisect but bug seems to be between:

[6122af3743a48dddae19810626dd7c9c8e6c1df8] asus_acpi: Deprecate in favor of asus-laptop

and

[936940a9c7e3d99b25859bf1ff140d8c2480183a] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
Comment 25 Christophe Dumez 2009-07-22 15:46:05 UTC
Thanks to git-bisect, I identified the following commit as the problem:
commit 936940a9c7e3d99b25859bf1ff140d8c2480183a
Merge: 09ce42d 1cbd20d
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Wed Jun 24 10:03:12 2009 -0700

    Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
    
    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (23 commits)
      switch xfs to generic acl caching helpers
      helpers for acl caching + switch to those
      switch shmem to inode->i_acl
      switch reiserfs to inode->i_acl
      switch reiserfs to usual conventions for caching ACLs
      reiserfs: minimal fix for ACL caching
      switch nilfs2 to inode->i_acl
      switch btrfs to inode->i_acl
      switch jffs2 to inode->i_acl
      switch jfs to inode->i_acl
      switch ext4 to inode->i_acl
      switch ext3 to inode->i_acl
      switch ext2 to inode->i_acl
      add caching of ACLs in struct inode
      fs: Add new pre-allocation ioctls to vfs for compatibility with legacy xfs ioctls
      cleanup __writeback_single_inode
      ... and the same for vfsmount id/mount group id
      Make allocation of anon devices cheaper
      update Documentation/filesystems/Locking
      devpts: remove module-related code
      ...

Note that I'm using jfs so it could be related.
Comment 26 Christophe Dumez 2009-07-22 16:00:56 UTC
Since ACPI does not seem to be the problem, changing component to FileSystem.
Comment 27 Christophe Dumez 2009-07-22 16:04:56 UTC
Created attachment 22444 [details]
mount output
Comment 28 Christophe Dumez 2009-07-23 01:59:24 UTC
I got a bit more precise now, still using git-bisect.

The problem occurred after :
[6582a0e6f6bc7bf64817b9e1a424782855292ab0] switch ext3 to inode->i_acl

and of course before:
[936940a9c7e3d99b25859bf1ff140d8c2480183a] Merge branch 'for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
Comment 29 Christophe Dumez 2009-07-23 05:35:46 UTC
still using git-bisect.

The problem occurred after :
[290c263bf83cd78e53b1aa3b42165f588163f2be] switch jffs2 to inode->i_acl

and of course before:
[936940a9c7e3d99b25859bf1ff140d8c2480183a] Merge branch 'for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
Comment 30 Christophe Dumez 2009-07-23 07:15:53 UTC
still using git-bisect.

The problem occurred after :
[281eede0328c84a8f20e0e85b807d5b51c3de4f2] switch reiserfs to inode->i_acl

and of course before:
[936940a9c7e3d99b25859bf1ff140d8c2480183a] Merge branch 'for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
Comment 31 Christophe Dumez 2009-07-23 13:52:58 UTC
I have just confirmed that the patch proposed by Stefan Bader on this bug report works:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/396780/comments/33
Comment 32 Stefan Bader 2009-07-23 14:46:53 UTC
Created attachment 22475 [details]
Do not release acl when returning
Comment 33 Stefan Bader 2009-07-23 14:47:50 UTC
Christophe tried the patch above and it solved the crashes he was experiencing.

Note You need to log in before you can comment on or make changes to this bug.