Created attachment 28361 [details] full dmesg output Kernel configuration: 2.6.36-rc3 Athlon 64 X2 3GHz in 64bits mode Slackware64 13.1 maybe an ext4 problem At boot, with a kernel compiled with support for locks, kmemcheck and so, I got this from time to time (amoung several other problems reported from kmemcheck and sometimes a total freeze at boot): kjournald starting. Commit interval 5 seconds EXT3-fs (sda1): using internal journal EXT3-fs (sda1): mounted filesystem with ordered data mode ADDRCONF(NETDEV_UP): eth0: link is not ready uli526x: eth0 NIC Link is Up 100 Mbps Full duplex ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready ------------[ cut here ]------------ WARNING: at kernel/watchdog.c:233 watchdog_overflow_callback+0xe3/0x110() Hardware name: K8 Combo-Z Watchdog detected hard LOCKUP on cpu 0 Modules linked in: snd_intel8x0 snd_cmipci snd_opl3_lib snd_hwdep snd_mpu401_uart snd_ac97_codec snd_ra wmidi ac97_bus Pid: 1923, comm: update-mime-dat Not tainted 2.6.36-rc3 #6 Call Trace: <NMI> [<ffffffff81042cda>] warn_slowpath_common+0x7a/0xb0 [<ffffffff81042db1>] warn_slowpath_fmt+0x41/0x50 [<ffffffff81086d37>] ? watchdog_overflow_callback+0x7/0x110 [<ffffffff81086e13>] watchdog_overflow_callback+0xe3/0x110 [<ffffffff81096281>] ? __perf_event_overflow+0x141/0x210 [<ffffffff8109133a>] ? perf_event_update_userpage+0x10a/0x1b0 [<ffffffff81091230>] ? perf_event_update_userpage+0x0/0x1b0 [<ffffffff8101247d>] ? x86_perf_event_set_period+0xdd/0x160 [<ffffffff810964f4>] ? perf_event_overflow+0x14/0x20 [<ffffffff8101414f>] ? x86_pmu_handle_irq+0x10f/0x140 [<ffffffff81010d8b>] ? perf_event_nmi_handler+0x4b/0x60 [<ffffffff810660d0>] ? notifier_call_chain+0x50/0xa0 [<ffffffff810666a5>] ? __atomic_notifier_call_chain+0x95/0xd0 [<ffffffff81066610>] ? __atomic_notifier_call_chain+0x0/0xd0 [<ffffffff8108b255>] ? rcu_nmi_exit+0x15/0x60 [<ffffffff8100424b>] ? do_nmi+0x9b/0x2b0 [<ffffffff815981ca>] ? nmi+0x1a/0x2c [<ffffffff8118fc46>] ? jbd2_journal_get_write_access+0x36/0x50 [<ffffffff81597de0>] ? debug+0x0/0x40 <<EOE>> <#DB> [<ffffffff81086d42>] ? watchdog_overflow_callback+0x12/0x110 <<EOE>> ---[ end trace 1408e87e8e109015 ]--- BUG: sleeping function called from invalid context at fs/ext4/inode.c:5833 in_atomic(): 1, irqs_disabled(): 0, pid: 1923, name: update-mime-dat 2 locks held by update-mime-dat/1923: #0: (&sb->s_type->i_mutex_key#3){+.+.+.}, at: [<ffffffff810db511>] do_last+0x111/0x690 #1: (jbd2_handle){+.+...}, at: [<ffffffff811904f0>] start_this_handle+0x470/0x510 Pid: 1923, comm: update-mime-dat Tainted: G W 2.6.36-rc3 #6 Call Trace: [<ffffffff810721f3>] ? __debug_show_held_locks+0x13/0x30 [<ffffffff81036b85>] __might_sleep+0x105/0x130 [<ffffffff8115c137>] ext4_mark_inode_dirty+0x47/0x1d0 [<ffffffff8116643b>] ext4_add_nondir+0x6b/0x80 [<ffffffff81166bf6>] ext4_create+0xd6/0x130 [<ffffffff810da459>] vfs_create+0x89/0xc0 [<ffffffff810db9fa>] ? do_last+0x5fa/0x690 [<ffffffff810db935>] do_last+0x535/0x690 [<ffffffff810dd97d>] do_filp_open+0x21d/0x660 [<ffffffff81596c98>] ? trace_hardirqs_off_thunk+0x3a/0x3c [<ffffffff810393c5>] ? sub_preempt_count+0x95/0xd0 [<ffffffff815977d0>] ? _raw_spin_unlock+0x30/0x60 [<ffffffff810e9132>] ? alloc_fd+0x122/0x1d0 [<ffffffff810cd7e0>] do_sys_open+0x60/0x120 [<ffffffff81596c59>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff810cd8cb>] sys_open+0x1b/0x20 [<ffffffff810023ab>] system_call_fastpath+0x16/0x1b note: update-mime-dat[1923] exited with preempt_count 1 BUG: scheduling while atomic: update-mime-dat/1923/0x10000002 no locks held by update-mime-dat/1923. Modules linked in: snd_intel8x0 snd_cmipci snd_opl3_lib snd_hwdep snd_mpu401_uart snd_ac97_codec snd_rawmidi ac97_bus Pid: 1923, comm: update-mime-dat Tainted: G W 2.6.36-rc3 #6 Call Trace: [<ffffffff810721f3>] ? __debug_show_held_locks+0x13/0x30 [<ffffffff81039257>] __schedule_bug+0x77/0x80 [<ffffffff815943f9>] schedule+0x6e9/0xa30 [<ffffffff8103d153>] __cond_resched+0x13/0x30 [<ffffffff8159482b>] _cond_resched+0x2b/0x40 [<ffffffff810b1a59>] unmap_vmas+0x7c9/0x940 [<ffffffff810b6c2d>] exit_mmap+0xdd/0x1a0 [<ffffffff81040677>] mmput+0x57/0xe0 [<ffffffff81045250>] exit_mm+0x100/0x130 [<ffffffff810477ad>] do_exit+0x64d/0x720 [<ffffffff810478cf>] do_group_exit+0x4f/0xc0 [<ffffffff8105448a>] get_signal_to_deliver+0x31a/0x510 [<ffffffff81040090>] ? kick_process+0x50/0xa0 [<ffffffff8159778d>] ? _raw_spin_unlock_irqrestore+0x5d/0x70 [<ffffffff810017c0>] do_signal+0x70/0x790 [<ffffffff8159776d>] ? _raw_spin_unlock_irqrestore+0x3d/0x70 [<ffffffff8100dd71>] ? syscall_trace_leave+0xd1/0xe0 [<ffffffff81001f2f>] do_notify_resume+0x4f/0x60 [<ffffffff81596c59>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff8100264b>] int_signal+0x12/0x17 BUG: scheduling while atomic: update-mime-dat/1923/0x10000002 no locks held by update-mime-dat/1923. Modules linked in: snd_intel8x0 snd_cmipci snd_opl3_lib snd_hwdep snd_mpu401_uart snd_ac97_codec snd_rawmidi ac97_bus Pid: 1923, comm: update-mime-dat Tainted: G W 2.6.36-rc3 #6 Call Trace: ... etc full dmesg appended as attachement.
Update : still present in 2.6.36-rc4. However, the call stack is a little different now. As the computer crashes very soon, I've taken a photo, see attachment.
Created attachment 29872 [details] console output image screenshot of warning (can't get it as the computer crashes after that).
On Tuesday, September 14, 2010, Christian Casteyde wrote: > Yes, still present, as well as my other bug that says that the system freezes > short after the boot. > > Le dimanche 12 septembre 2010 20:14:29, vous avez écrit : > > This message has been generated automatically as a part of a summary report > > of recent regressions. > > > > The following bug entry is on the current list of known regressions > > from 2.6.35. Please verify if it still should be listed and let the > > tracking team know (either way). > > > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=17361 > > Subject : Watchdog detected hard LOCKUP in > jbd2_journal_get_write_access > > Submitter : Christian Casteyde <casteyde.christian@free.fr> > > Date : 2010-08-29 19:59 (15 days old)
Update : Still present in 2.6.36-rc5 With the same log as shown in comment #2 Some more info are given after that ("recursive fault fixed, but rebbot is necessary"). If I wait a little more, I get another pile of messages. More precisely, the problem occurs indeed after dhcp gets an IP address, and before (or when) sshd is started. Between the two, I guess iptables rules are defined in my network scripts.
Well, I managed to boot in single mode, and tried to execute commands one by one to see what triggers the problems. Indeed, it's really random, sometime it's while ldconfig is called by the init scripts, sometime it's when starting udev... The callstacks are quite different also, sometime it's kmemcheck, sometime preempt count, sometime NULL dereference... So I'm posting the dmesg output I saved each time I got an error. Now at least they are complete, so the root cause may be easier to find.
Created attachment 30982 [details] warning got while booting kmemcheck fatal error while in sys_readlink
Created attachment 30992 [details] warning got while starting udev now using smp_processor_id() in preemptible while in watchdog_overflow_callback
Created attachment 31002 [details] quite the same, plus another warning after
Created attachment 31012 [details] finally, the NULL dereference I'm not sure the second and next warnings are not due to the first one, and a bug in kmemcheck itself. I'm totally unable to tell what cause it, and I gave up bisecting because it crashes here and there, and I cannot say what crash is due to which bug, if ever there is only one bug.
Created attachment 31032 [details] .config file I append the .config file, just to tell what are the kernel debug options I use. I'm running in 64bit and build from vanilla source, with gcc 4.4.4. I don't think this is hardware related, since I get the same on my laptop. More some racy problems in udev/ldconfig/fc-cache I'm still thinking it's a vfs problem or kmemcheck itself bug.
*** Bug 17371 has been marked as a duplicate of this bug. ***
*** Bug 16616 has been marked as a duplicate of this bug. ***
On Monday, September 27, 2010, Christian Casteyde wrote: > Still present in rc5-git7. > > > Le dimanche 26 septembre 2010 22:04:13, vous avez écrit : > > This message has been generated automatically as a part of a summary report > > of recent regressions. > > > > The following bug entry is on the current list of known regressions > > from 2.6.35. Please verify if it still should be listed and let the > > tracking team know (either way). > > > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=17361 > > Subject : Watchdog detected hard LOCKUP in > jbd2_journal_get_write_access > > Submitter : Christian Casteyde <casteyde.christian@free.fr> > > Date : 2010-08-29 19:59 (29 days old)
People may not be paying attention to this due to the subject line. Except for the initial bug report, none of the other stack traces have anything to do with ext4/jbd2. And in the initial ext4 trace, we see the complaint that we're calling might_sleep() in ext4_mark_inode_dirty(), in a code path where we are manifestly not taking any spinlocks. And in fact we don't see any spinlocks being taken at the point where the complaint is mode in ext4_mark_inode_dirty(). Yet preempt_count > 1. It looks to me like some unrelated piece of code is bumping preempt_count, and not decrementing it. Maybe in some code which is called from an interrupt handler, in some device driver? That might explain why you're getting failures all over the kernel. It may be worth closing this report, and opening several new ones, one for each failure, and make it clear this is not an ext4-related problem, since the subject line and component assigned for this bug is highly misleading. Including your kernel config would also be useful when you do that.
I've explored some of the ideas you mentionned. First, I disabled kmemcheck, and indeed I've got no warning at all. So this is really a kmemcheck + something else (buggy driver or perf counters as mentionned by Vegard in his previous mail). I could check the reverse: kmemcheck but without perf counters, however I cannot unselect the perf option on my config (it is forced by something else I've not found). kmemcheck used to work (I don't know if perf counters were available in previous kernels however). Second, I have most of my drivers built in, however I use sound cards as modules in order to control the load order. I've booted after erasing the /lib/module driver, but with kmemcheck enabled, and it was far better. I got only one kmemecheck error: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready ERROR: kmemcheck: Fatal error Pid: 1914, comm: gtk-update-icon Not tainted 2.6.36-rc6 #13 K8 Combo-Z/K8 Combo-Z RIP: 0010:[<ffffffff81010b90>] [<ffffffff81010b90>] x86_perf_event_update+0x0/0x80 RSP: 0018:ffff880002607d50 EFLAGS: 00010086 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000ffffffff RDX: ffff88000260b5c0 RSI: 0000000000000400 RDI: ffff88003f808800 RBP: ffff880002607df8 R08: 0000000000000000 R09: 0000000000000004 R10: 0000000000000001 R11: 0000000000000038 R12: ffff88000260b7c8 R13: 0000000000000000 R14: ffff88000260b7c0 R15: ffff88003f808800 FS: 00007fdb70d5e700(0000) GS:ffff880002600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88003f836420 CR3: 000000003da52000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 [<ffffffff8102bc34>] kmemcheck_error_save_bug+0xb4/0xe0 [<ffffffff8102c658>] kmemcheck_access+0xe8/0x4a0 [<ffffffff8102ca82>] kmemcheck_fault+0x72/0x80 [<ffffffff81026f89>] do_page_fault+0x359/0x440 [<ffffffff8159711f>] page_fault+0x1f/0x30 [<ffffffff81011528>] perf_event_nmi_handler+0x78/0x140 [<ffffffff810665d0>] notifier_call_chain+0x50/0xa0 [<ffffffff81066b81>] __atomic_notifier_call_chain+0x71/0xd0 [<ffffffff81066bf1>] atomic_notifier_call_chain+0x11/0x20 [<ffffffff81066c2e>] notify_die+0x2e/0x30 [<ffffffff81004343>] do_nmi+0x193/0x2b0 [<ffffffff8159740a>] nmi+0x1a/0x2c [<ffffffff8102c5c7>] kmemcheck_access+0x57/0x4a0 [<ffffffff8102ca82>] kmemcheck_fault+0x72/0x80 [<ffffffff81026f89>] do_page_fault+0x359/0x440 [<ffffffff8159711f>] page_fault+0x1f/0x30 [<ffffffff810dab55>] do_lookup+0x55/0x170 [<ffffffff810dcb10>] link_path_walk+0x130/0xbd0 [<ffffffff810dd720>] path_walk+0x60/0xe0 [<ffffffff810dd7f3>] do_path_lookup+0x53/0x60 [<ffffffff810de433>] user_path_at+0x53/0xa0 [<ffffffff810d4c17>] vfs_fstatat+0x37/0x70 [<ffffffff810d4c86>] vfs_stat+0x16/0x20 [<ffffffff810d4e5f>] sys_newstat+0x1f/0x50 [<ffffffff810023ab>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff The good news is all other errors are gone. Here is the lsmod output without kmemcheck, just to tell the modules that could cause more grief: Module Size Used by snd_intel8x0 29098 1 snd_cmipci 31770 4 snd_opl3_lib 9332 1 snd_cmipci snd_hwdep 6314 1 snd_opl3_lib snd_mpu401_uart 6360 1 snd_cmipci snd_rawmidi 20342 1 snd_mpu401_uart snd_ac97_codec 117688 1 snd_intel8x0 ac97_bus 1290 1 snd_ac97_codec The .config file is already attached to this bug report (please note I've also some other unused drivers configured, for hardware I do not use for now).
I also append .config file for my laptop, since it also has the problem as my main computer. I've tried to git bisect the problem on this laptop: 0f477dd0851bdcee82923da66a7fc4a44cb1bc3d is the first bad commit However, I don't know if this is the same bug as described here, as I said I cannot be sure.
Created attachment 33052 [details] git bisect log result
Created attachment 33062 [details] .config file for the laptop Common hardware I'm aware is: Athlon 64 in 64bits, soundcard intel8x0
0f477dd0851bdcee82923da66a7fc4a44cb1bc3d is a merge commit, so it surely didn't introduce the problem by itself, although it's quite probable that one of the commits in this merge did that. Adding Peter to the CC list.
Hello My bisection was obviously false, sorry. It took me some time, but I've finally understood why my bisection failed, and what was happening. The problem is that with kmemcheck, the kernel boots so slowly (except if there is a warning at boot in swapper, as it occured in several revision pre rc1), that the hard and soft lockup detection mechanism was triggering (often in udev or hal or ldconfig or other slow commands). Moreover, it triggered sometimes at a place the kernel cannot cope with it, and crashes. Before bisecting, I took a fresh 2.6.35 .config file and reactivated the debugging options, but I fogot the lockup detector. So I couldn't see the errors anymore. Nevertheless, I checked other warnings that used to occur before -rc1, and that were fixed later. As I was looking for a range between 2.6.35 and rc1, I got false results on already fixed bugs. Finally, I do not have any problem anymore with 2.6.36-rc8. Except that the lockup timeout is either too low for my computers, or crashes the kernel sometimes whent it triggers. But that's not critical. I'm closing the bug.