Bug 32982

Summary: Kernel locks up a few minutes after boot
Product: IO/Storage Reporter: Bart Van Assche (bvanassche)
Component: Block LayerAssignee: drivers_video-dri
Status: CLOSED CODE_FIX    
Severity: normal CC: akpm, alexdeucher, florian, maciej.rutecki, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.29-rc2+ Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 32012    

Description Bart Van Assche 2011-04-10 19:55:13 UTC
A few seconds to a few minutes after the X server has started, the system locks up. Sometimes only one or two processes lock up before the system locks up entirely, sometimes the entire system locks up all at once - pressing caps lock does no longer trigger the keyboard caps lock LED. I haven't seen this behavior ever before, not even with any rc kernel.

The same kernel with the same config runs fine in a virtual machine. Could this be the radeon driver that is causing trouble ?

$ lspci|grep VGA
02:00.0 VGA compatible controller: ATI Technologies Inc RV610 [Radeon HD 2400 XT]

kernel version:
$ git show HEAD|head -n 4
commit 94c8a984ae2adbd9a9626fb42e0f2faf3e36e86f
Merge: f9fa0bc 37adb89
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Fri Apr 8 11:47:35 2011 -0700

GRUB entry:
title openSUSE 11.4 - 2.6.39-rc2+-debug
    root (hd0,0)
    kernel /boot/vmlinuz-2.6.39-rc2-debug+ root=/dev/md0 resume=/dev/md1 splash=silent showopts vga=0x31a nomodesetting
    initrd /boot/initrd-2.6.39-rc2-debug+
Comment 1 Alex Deucher 2011-04-11 23:14:27 UTC
Is this a regression?  Did previous kernels work ok?  If so, can you bisect?
Comment 2 Alex Deucher 2011-04-11 23:15:20 UTC
Does it work ok if you remove:
vga=0x31a nomodesetting
from your grub entry?
Comment 3 Bart Van Assche 2011-04-12 18:34:48 UTC
(In reply to comment #1)
> Is this a regression?  Did previous kernels work ok?  If so, can you bisect?

Yes, it's a regression - 2.6.38 and 2.6.38.2 run perfectly on the same system.

I haven't had much luck with the bisect though - halfway bisecting I encountered a commit that made my system unbootable because reassembling the RAID1 /boot partition failed. The log I have so far is:

$ git bisect start 'drivers/video'
# good: [521cb40b0c44418a4fd36dc633f575813d59a43d] Linux 2.6.38
git bisect good 521cb40b0c44418a4fd36dc633f575813d59a43d
# bad: [94c8a984ae2adbd9a9626fb42e0f2faf3e36e86f] Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
git bisect bad 94c8a984ae2adbd9a9626fb42e0f2faf3e36e86f
# good: [da49252fb0392d8196833ef3da92e48fb371f8d7] Merge branch 'for-paul' of git://gitorious.org/linux-omap-dss2/linux
git bisect good da49252fb0392d8196833ef3da92e48fb371f8d7
# good: [bf5f0019046d596d613caf74722ba4994e153899] video, sm501: add I/O functions for use on powerpc
git bisect good bf5f0019046d596d613caf74722ba4994e153899
# good: [6b794743b2c5e21825d35b5d5dd57d6fcc388198] unicore32 framebuffer fix: get videomemory by __get_free_pages() and make it floatable
git bisect good 6b794743b2c5e21825d35b5d5dd57d6fcc388198

unknown because of boot failure: 21cd72e7cb424f1686855602ec0fdc6e5830f249
Comment 4 Alex Deucher 2011-04-12 18:40:28 UTC
'git bisect skip' to skip problematic commits.
Comment 5 Alex Deucher 2011-04-12 18:42:33 UTC
Also, the drm is not in drivers/video, it's in drivers/gpu
Comment 6 Bart Van Assche 2011-04-13 18:49:13 UTC
Although I'm still busy bisecting, I'd like to report that I got the following hung task report with head b73a21fc66fee35b41db755abebfacba48b2fc76 (had already seen something similar before with 2.6.39-rc2):

INFO: task kjournald:918 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald       D ffff880131b9ddb8     0   918      2 0x00000000
 ffff880131b9dd20 0000000000000046 ffff880131b9dca0 ffffffff8108cd6d
 0000000000000282 ffff880131b9dfd8 ffff880137729f40 ffff880131b9dfd8
 ffff880131b9c000 ffff880131b9c000 ffff880131b9c000 ffff880131b9dfd8
Call Trace:
 [<ffffffff8108cd6d>] ? trace_hardirqs_on_caller+0x14d/0x190
 [<ffffffff81048419>] ? sub_preempt_count+0xa9/0xe0
 [<ffffffffa00c615e>] journal_commit_transaction+0x13e/0x1590 [jbd]
 [<ffffffff813e2535>] ? _raw_spin_unlock_irqrestore+0x65/0x80
 [<ffffffff81048419>] ? sub_preempt_count+0xa9/0xe0
 [<ffffffff81075280>] ? wake_up_bit+0x40/0x40
 [<ffffffff8106258a>] ? del_timer_sync+0x8a/0xc0
 [<ffffffff81062500>] ? try_to_del_timer_sync+0x110/0x110
 [<ffffffffa00ca1a1>] kjournald+0xf1/0x250 [jbd]
 [<ffffffff81075280>] ? wake_up_bit+0x40/0x40
 [<ffffffffa00ca0b0>] ? commit_timeout+0x10/0x10 [jbd]
 [<ffffffff81074c66>] kthread+0x96/0xa0
 [<ffffffff813e4294>] kernel_thread_helper+0x4/0x10
 [<ffffffff8103cebb>] ? finish_task_switch+0x7b/0xe0
 [<ffffffff813e24ab>] ? _raw_spin_unlock_irq+0x3b/0x60
 [<ffffffff813e2944>] ? retint_restore_args+0xe/0xe
 [<ffffffff81074bd0>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff813e4290>] ? gs_change+0xb/0xb
no locks held by kjournald/918.
INFO: task klauncher:5744 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
klauncher       D 00000001000297b4     0  5744   5743 0x00000000
 ffff88011dd73938 0000000000000046 ffff880100000000 ffffffff8108cbef
 ffffffff813e2535 ffff88011dd73fd8 ffff8801382e1f40 ffff88011dd73fd8
 ffff88011dd72000 ffff88011dd72000 ffff88011dd72000 ffff88011dd73fd8
Call Trace:
 [<ffffffff8108cbef>] ? mark_held_locks+0x6f/0xa0
 [<ffffffff813e2535>] ? _raw_spin_unlock_irqrestore+0x65/0x80
 [<ffffffff8115b830>] ? __wait_on_buffer+0x30/0x30
 [<ffffffff813df0e9>] io_schedule+0x59/0x80
 [<ffffffff8115b83e>] sleep_on_buffer+0xe/0x20
 [<ffffffff813df89a>] __wait_on_bit_lock+0x5a/0xc0
 [<ffffffff8115b830>] ? __wait_on_buffer+0x30/0x30
 [<ffffffff813df978>] out_of_line_wait_on_bit_lock+0x78/0x90
 [<ffffffff810752d0>] ? autoremove_wake_function+0x50/0x50
 [<ffffffff8115b886>] __lock_buffer+0x36/0x40
 [<ffffffffa00c56ad>] do_get_write_access+0x64d/0x660 [jbd]
 [<ffffffff81048419>] ? sub_preempt_count+0xa9/0xe0
 [<ffffffffa00c33e0>] ? start_this_handle+0x370/0x470 [jbd]
 [<ffffffffa00cb594>] ? journal_add_journal_head+0xf4/0x220 [jbd]
 [<ffffffffa00c58f1>] journal_get_write_access+0x31/0x50 [jbd]
 [<ffffffffa00efe6d>] __ext3_journal_get_write_access+0x2d/0x60 [ext3]
 [<ffffffffa00e2423>] ext3_reserve_inode_write+0x83/0xb0 [ext3]
 [<ffffffffa00e2494>] ext3_mark_inode_dirty+0x44/0x70 [ext3]
 [<ffffffffa00e4ffe>] ext3_dirty_inode+0x5e/0xa0 [ext3]
 [<ffffffff8115412f>] __mark_inode_dirty+0x3f/0x250
 [<ffffffff8114627c>] file_update_time+0xec/0x170
 [<ffffffff813dfecd>] ? mutex_lock_nested+0x27d/0x3a0
 [<ffffffff810e4138>] __generic_file_aio_write+0x1f8/0x440
 [<ffffffff810e43f5>] generic_file_aio_write+0x75/0xf0
 [<ffffffff8112ca6a>] do_sync_write+0xda/0x120
 [<ffffffff8110a2d7>] ? remove_vma+0x77/0x90
 [<ffffffff8108cdbd>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff8110a2d7>] ? remove_vma+0x77/0x90
 [<ffffffff8112d186>] vfs_write+0xc6/0x170
 [<ffffffff8112d481>] sys_write+0x51/0x90
 [<ffffffff813e31eb>] system_call_fastpath+0x16/0x1b
2 locks held by klauncher/5744:
 #0:  (&sb->s_type->i_mutex_key#11){+.+.+.}, at: [<ffffffff810e43d9>] generic_file_aio_write+0x59/0xf0
 #1:  (jbd_handle){+.+...}, at: [<ffffffffa00c33e0>] start_this_handle+0x370/0x470 [jbd]
INFO: task okular:4180 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
okular          D 000000010002a251     0  4180   5743 0x00000000
 ffff880041d13aa8 0000000000000046 ffff880000000000 ffffffff8108cd6d
 0000000000000282 ffff880041d13fd8 ffff880037a59f40 ffff880041d13fd8
 ffff880041d12000 ffff880041d12000 ffff880041d12000 ffff880041d13fd8
Call Trace:
 [<ffffffff8108cd6d>] ? trace_hardirqs_on_caller+0x14d/0x190
 [<ffffffffa00c32b4>] start_this_handle+0x244/0x470 [jbd]
 [<ffffffff8109a633>] ? is_module_address+0x33/0x60
 [<ffffffff81075280>] ? wake_up_bit+0x40/0x40
 [<ffffffffa00c370b>] journal_start+0xdb/0x120 [jbd]
 [<ffffffffa00eb436>] ext3_journal_start_sb+0x36/0x70 [ext3]
 [<ffffffffa00e2663>] ext3_setattr+0x1a3/0x210 [ext3]
 [<ffffffff81148556>] notify_change+0x116/0x360
 [<ffffffff8112b803>] do_truncate+0x63/0x90
 [<ffffffff81048419>] ? sub_preempt_count+0xa9/0xe0
 [<ffffffff8113abac>] do_last+0x42c/0x820
 [<ffffffff8113c3b0>] path_openat+0xd0/0x410
 [<ffffffff81102ad3>] ? might_fault+0x53/0xb0
 [<ffffffff8113c76f>] do_filp_open+0x7f/0xa0
 [<ffffffff81048419>] ? sub_preempt_count+0xa9/0xe0
 [<ffffffff813e2585>] ? _raw_spin_unlock+0x35/0x60
 [<ffffffff81149834>] ? alloc_fd+0xf4/0x150
 [<ffffffff8112c651>] do_sys_open+0x101/0x1e0
 [<ffffffff8112c750>] sys_open+0x20/0x30
 [<ffffffff813e31eb>] system_call_fastpath+0x16/0x1b
2 locks held by okular/4180:
 #0:  (&sb->s_type->i_mutex_key#11){+.+.+.}, at: [<ffffffff8112b7f7>] do_truncate+0x57/0x90
 #1:  (&sb->s_type->i_alloc_sem_key#4){+.+...}, at: [<ffffffff811486e0>] notify_change+0x2a0/0x360
Comment 7 Bart Van Assche 2011-04-14 15:57:40 UTC
Still occurs with 85f2e689a5c8fb6ed8fdbee00109e7f6e5fefcb6 (2.6.29-rc3+). Note: I'm still trying to find the offending commit via bisecting.
Comment 8 Bart Van Assche 2011-04-14 19:24:34 UTC
The result of the bisect process - not sure it's useful:

# git bisect skip
There are only 'skip'ped commits left to test.
The first bad commit could be any of:
62e0ff1ef2d8ea0814487f73a7de431396a1e914
1fcf0069f4715f6f811466db68a547a348b4d5a9
94e948e6e43cd34e0e2ca496d5e90e4ff0d884f9
53f358a81e10e798f44af896ffacaedd7ac0269b
e9c5db0b8dce1bcdc99ad26e718230810d6b5cff
b73a21fc66fee35b41db755abebfacba48b2fc76
We cannot bisect more!

git bisect start
# skip: [1fcf0069f4715f6f811466db68a547a348b4d5a9] Merge branch 'common/fbdev' of master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6
git bisect skip 1fcf0069f4715f6f811466db68a547a348b4d5a9
# skip: [e9c5db0b8dce1bcdc99ad26e718230810d6b5cff] efifb: support AMD Radeon HD 6490
git bisect skip e9c5db0b8dce1bcdc99ad26e718230810d6b5cff
# skip: [21cd72e7cb424f1686855602ec0fdc6e5830f249] savagefb: Set up I2C based on chip family instead of card id
git bisect skip 21cd72e7cb424f1686855602ec0fdc6e5830f249
# skip: [47c87d930f3db4fc3a30505075e07f5597e2e953] fb: Reduce priority of resource conflict message
git bisect skip 47c87d930f3db4fc3a30505075e07f5597e2e953
# good: [899631c7916b231ba6509c90dbc33221e9194029] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
git bisect good 899631c7916b231ba6509c90dbc33221e9194029
# bad: [b73a21fc66fee35b41db755abebfacba48b2fc76] video: s3c-fb: fix checkpatch errors and warning
git bisect bad b73a21fc66fee35b41db755abebfacba48b2fc76
# skip: [6c5103890057b1bb781b26b7aae38d33e4c517d8] Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block
git bisect skip 6c5103890057b1bb781b26b7aae38d33e4c517d8
# good: [247f99386100d1d1c369ba98120d2edebf5426fc] fbdev: sh_mobile_lcdcfb: fix module lock acquisition
git bisect good 247f99386100d1d1c369ba98120d2edebf5426fc
# skip: [53f358a81e10e798f44af896ffacaedd7ac0269b] Merge branch 'viafb-next' of git://github.com/schandinat/linux-2.6
git bisect skip 53f358a81e10e798f44af896ffacaedd7ac0269b
# good: [3f086fe93f734ba76f2e130777687f81e0cbb318] viafb: initialize margins correct
git bisect good 3f086fe93f734ba76f2e130777687f81e0cbb318
# skip: [62e0ff1ef2d8ea0814487f73a7de431396a1e914] fbcon: Remove unused 'display *p' variable from fb_flashcursor()
git bisect skip 62e0ff1ef2d8ea0814487f73a7de431396a1e914
# skip: [94e948e6e43cd34e0e2ca496d5e90e4ff0d884f9] s3fb: fix Virge/GX2
git bisect skip 94e948e6e43cd34e0e2ca496d5e90e4ff0d884f9
Comment 9 Rafael J. Wysocki 2011-04-17 18:20:39 UTC
On Sunday, April 17, 2011, Linus Torvalds wrote:
> Is this machine running a RAID5 setup or something like that?
> 
> There is a known interaction with the new block layer plugging code
> and MD. The "hung task" report in that bugzilla looks very much like
> that issue. And you do have "root=/dev/md0", so clearly there's some
> md thing going on.
> 
> And bisecting might not work all that well for it, because I suspect
> it ends up being very much a matter of IO patterns how it triggers.
> 
> Neil supposedly has a patch for it, but I haven't seen it yet. Neil, Jens?
> 
>                                    Linus
Comment 10 Bart Van Assche 2011-05-01 09:35:17 UTC
Was fixed in 2.6.39-rc4.