Bug 17361

Summary: Random kmemcheck errors and kernel freeze on 2.6.36-rc*
Product: File System Reporter: Christian Casteyde (casteyde.christian)
Component: ext4Assignee: fs_ext4 (fs_ext4)
Status: CLOSED INVALID    
Severity: normal CC: akpm, hpa, maciej.rutecki, rjw, tytso
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.36-rc3 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 16444    
Attachments: full dmesg output
console output image
warning got while booting
warning got while starting udev
quite the same, plus another warning after
finally, the NULL dereference
.config file
git bisect log result
.config file for the laptop

Description Christian Casteyde 2010-08-29 19:59:12 UTC
Created attachment 28361 [details]
full dmesg output

Kernel configuration: 2.6.36-rc3
Athlon 64 X2 3GHz in 64bits mode
Slackware64 13.1

maybe an ext4 problem

At boot, with a kernel compiled with support for locks, kmemcheck and so, I got this from time to time (amoung several other problems reported from kmemcheck and sometimes a total freeze at boot):


kjournald starting.  Commit interval 5 seconds
EXT3-fs (sda1): using internal journal
EXT3-fs (sda1): mounted filesystem with ordered data mode
ADDRCONF(NETDEV_UP): eth0: link is not ready
uli526x: eth0 NIC Link is Up 100 Mbps Full duplex
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
------------[ cut here ]------------
WARNING: at kernel/watchdog.c:233 watchdog_overflow_callback+0xe3/0x110()
Hardware name: K8 Combo-Z
Watchdog detected hard LOCKUP on cpu 0
Modules linked in: snd_intel8x0 snd_cmipci snd_opl3_lib snd_hwdep snd_mpu401_uart snd_ac97_codec snd_ra
wmidi ac97_bus
Pid: 1923, comm: update-mime-dat Not tainted 2.6.36-rc3 #6
Call Trace:
 <NMI>  [<ffffffff81042cda>] warn_slowpath_common+0x7a/0xb0
 [<ffffffff81042db1>] warn_slowpath_fmt+0x41/0x50
 [<ffffffff81086d37>] ? watchdog_overflow_callback+0x7/0x110
 [<ffffffff81086e13>] watchdog_overflow_callback+0xe3/0x110
 [<ffffffff81096281>] ? __perf_event_overflow+0x141/0x210
 [<ffffffff8109133a>] ? perf_event_update_userpage+0x10a/0x1b0
 [<ffffffff81091230>] ? perf_event_update_userpage+0x0/0x1b0
 [<ffffffff8101247d>] ? x86_perf_event_set_period+0xdd/0x160
 [<ffffffff810964f4>] ? perf_event_overflow+0x14/0x20
 [<ffffffff8101414f>] ? x86_pmu_handle_irq+0x10f/0x140
 [<ffffffff81010d8b>] ? perf_event_nmi_handler+0x4b/0x60
 [<ffffffff810660d0>] ? notifier_call_chain+0x50/0xa0
 [<ffffffff810666a5>] ? __atomic_notifier_call_chain+0x95/0xd0
 [<ffffffff81066610>] ? __atomic_notifier_call_chain+0x0/0xd0
 [<ffffffff8108b255>] ? rcu_nmi_exit+0x15/0x60
 [<ffffffff8100424b>] ? do_nmi+0x9b/0x2b0
 [<ffffffff815981ca>] ? nmi+0x1a/0x2c
 [<ffffffff8118fc46>] ? jbd2_journal_get_write_access+0x36/0x50
 [<ffffffff81597de0>] ? debug+0x0/0x40
 <<EOE>>  <#DB>  [<ffffffff81086d42>] ? watchdog_overflow_callback+0x12/0x110
 <<EOE>> 
---[ end trace 1408e87e8e109015 ]---
BUG: sleeping function called from invalid context at fs/ext4/inode.c:5833
in_atomic(): 1, irqs_disabled(): 0, pid: 1923, name: update-mime-dat
2 locks held by update-mime-dat/1923:
 #0:  (&sb->s_type->i_mutex_key#3){+.+.+.}, at: [<ffffffff810db511>] do_last+0x111/0x690
 #1:  (jbd2_handle){+.+...}, at: [<ffffffff811904f0>] start_this_handle+0x470/0x510
Pid: 1923, comm: update-mime-dat Tainted: G        W   2.6.36-rc3 #6
Call Trace:
 [<ffffffff810721f3>] ? __debug_show_held_locks+0x13/0x30
 [<ffffffff81036b85>] __might_sleep+0x105/0x130
 [<ffffffff8115c137>] ext4_mark_inode_dirty+0x47/0x1d0
 [<ffffffff8116643b>] ext4_add_nondir+0x6b/0x80
 [<ffffffff81166bf6>] ext4_create+0xd6/0x130
 [<ffffffff810da459>] vfs_create+0x89/0xc0
 [<ffffffff810db9fa>] ? do_last+0x5fa/0x690
 [<ffffffff810db935>] do_last+0x535/0x690
 [<ffffffff810dd97d>] do_filp_open+0x21d/0x660
 [<ffffffff81596c98>] ? trace_hardirqs_off_thunk+0x3a/0x3c
 [<ffffffff810393c5>] ? sub_preempt_count+0x95/0xd0
 [<ffffffff815977d0>] ? _raw_spin_unlock+0x30/0x60
 [<ffffffff810e9132>] ? alloc_fd+0x122/0x1d0
 [<ffffffff810cd7e0>] do_sys_open+0x60/0x120
 [<ffffffff81596c59>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff810cd8cb>] sys_open+0x1b/0x20
 [<ffffffff810023ab>] system_call_fastpath+0x16/0x1b
note: update-mime-dat[1923] exited with preempt_count 1
BUG: scheduling while atomic: update-mime-dat/1923/0x10000002
no locks held by update-mime-dat/1923.
Modules linked in: snd_intel8x0 snd_cmipci snd_opl3_lib snd_hwdep snd_mpu401_uart snd_ac97_codec snd_rawmidi ac97_bus
Pid: 1923, comm: update-mime-dat Tainted: G        W   2.6.36-rc3 #6
Call Trace:
 [<ffffffff810721f3>] ? __debug_show_held_locks+0x13/0x30
 [<ffffffff81039257>] __schedule_bug+0x77/0x80
 [<ffffffff815943f9>] schedule+0x6e9/0xa30
 [<ffffffff8103d153>] __cond_resched+0x13/0x30
 [<ffffffff8159482b>] _cond_resched+0x2b/0x40
 [<ffffffff810b1a59>] unmap_vmas+0x7c9/0x940
 [<ffffffff810b6c2d>] exit_mmap+0xdd/0x1a0
 [<ffffffff81040677>] mmput+0x57/0xe0
 [<ffffffff81045250>] exit_mm+0x100/0x130
 [<ffffffff810477ad>] do_exit+0x64d/0x720
 [<ffffffff810478cf>] do_group_exit+0x4f/0xc0
 [<ffffffff8105448a>] get_signal_to_deliver+0x31a/0x510
 [<ffffffff81040090>] ? kick_process+0x50/0xa0
 [<ffffffff8159778d>] ? _raw_spin_unlock_irqrestore+0x5d/0x70
 [<ffffffff810017c0>] do_signal+0x70/0x790
 [<ffffffff8159776d>] ? _raw_spin_unlock_irqrestore+0x3d/0x70
 [<ffffffff8100dd71>] ? syscall_trace_leave+0xd1/0xe0
 [<ffffffff81001f2f>] do_notify_resume+0x4f/0x60
 [<ffffffff81596c59>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff8100264b>] int_signal+0x12/0x17
BUG: scheduling while atomic: update-mime-dat/1923/0x10000002
no locks held by update-mime-dat/1923.
Modules linked in: snd_intel8x0 snd_cmipci snd_opl3_lib snd_hwdep snd_mpu401_uart snd_ac97_codec snd_rawmidi ac97_bus
Pid: 1923, comm: update-mime-dat Tainted: G        W   2.6.36-rc3 #6
Call Trace:
...

etc
full dmesg appended as attachement.
Comment 1 Christian Casteyde 2010-09-14 05:48:04 UTC
Update :
still present in 2.6.36-rc4.
However, the call stack is a little different now. As the computer crashes very soon, I've taken a photo, see attachment.
Comment 2 Christian Casteyde 2010-09-14 05:50:55 UTC
Created attachment 29872 [details]
console output image

screenshot of warning (can't get it as the computer crashes after that).
Comment 3 Rafael J. Wysocki 2010-09-14 18:08:31 UTC
On Tuesday, September 14, 2010, Christian Casteyde wrote:
> Yes, still present, as well as my other bug that says that the system freezes 
> short after the boot.
> 
> Le dimanche 12 septembre 2010 20:14:29, vous avez écrit :
> > This message has been generated automatically as a part of a summary report
> > of recent regressions.
> > 
> > The following bug entry is on the current list of known regressions
> > from 2.6.35.  Please verify if it still should be listed and let the
> > tracking team know (either way).
> > 
> > 
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=17361
> > Subject             : Watchdog detected hard LOCKUP in 
> jbd2_journal_get_write_access
> > Submitter   : Christian Casteyde <casteyde.christian@free.fr>
> > Date                : 2010-08-29 19:59 (15 days old)
Comment 4 Christian Casteyde 2010-09-21 17:19:32 UTC
Update : Still present in 2.6.36-rc5
With the same log as shown in comment #2
Some more info are given after that ("recursive fault fixed, but rebbot is necessary").
If I wait a little more, I get another pile of messages.

More precisely, the problem occurs indeed after dhcp gets an IP address, and before (or when) sshd is started. Between the two, I guess iptables rules are defined in my network scripts.
Comment 5 Christian Casteyde 2010-09-22 17:47:58 UTC
Well, I managed to boot in single mode, and tried to execute commands one by one to see what triggers the problems.
Indeed, it's really random, sometime it's while ldconfig is called by the init scripts, sometime it's when starting udev...
The callstacks are quite different also, sometime it's kmemcheck, sometime preempt count, sometime NULL dereference...

So I'm posting the dmesg output I saved each time I got an error. Now at least they are complete, so the root cause may be easier to find.
Comment 6 Christian Casteyde 2010-09-22 17:51:41 UTC
Created attachment 30982 [details]
warning got while booting

kmemcheck fatal error while in sys_readlink
Comment 7 Christian Casteyde 2010-09-22 17:53:31 UTC
Created attachment 30992 [details]
warning got while starting udev

now using smp_processor_id() in preemptible while in watchdog_overflow_callback
Comment 8 Christian Casteyde 2010-09-22 17:55:59 UTC
Created attachment 31002 [details]
quite the same, plus another warning after
Comment 9 Christian Casteyde 2010-09-22 17:58:50 UTC
Created attachment 31012 [details]
finally, the NULL dereference

I'm not sure the second and next warnings are not due to the first one, and a bug in kmemcheck itself.
I'm totally unable to tell what cause it, and I gave up bisecting because it crashes here and there, and I cannot say what crash is due to which bug, if ever there is only one bug.
Comment 10 Christian Casteyde 2010-09-22 18:01:49 UTC
Created attachment 31032 [details]
.config file

I append the .config file, just to tell what are the kernel debug options I use.
I'm running in 64bit and build from vanilla source, with gcc 4.4.4.

I don't think this is hardware related, since I get the same on my laptop. More some racy problems in udev/ldconfig/fc-cache

I'm still thinking it's a vfs problem or kmemcheck itself bug.
Comment 11 Christian Casteyde 2010-09-22 18:11:39 UTC
*** Bug 17371 has been marked as a duplicate of this bug. ***
Comment 12 Christian Casteyde 2010-09-22 18:13:08 UTC
*** Bug 16616 has been marked as a duplicate of this bug. ***
Comment 13 Rafael J. Wysocki 2010-09-27 19:26:48 UTC
On Monday, September 27, 2010, Christian Casteyde wrote:
> Still present in rc5-git7.
> 
> 
> Le dimanche 26 septembre 2010 22:04:13, vous avez écrit :
> > This message has been generated automatically as a part of a summary report
> > of recent regressions.
> > 
> > The following bug entry is on the current list of known regressions
> > from 2.6.35.  Please verify if it still should be listed and let the
> > tracking team know (either way).
> > 
> > 
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=17361
> > Subject             : Watchdog detected hard LOCKUP in 
> jbd2_journal_get_write_access
> > Submitter   : Christian Casteyde <casteyde.christian@free.fr>
> > Date                : 2010-08-29 19:59 (29 days old)
Comment 14 Theodore Tso 2010-10-02 16:42:45 UTC
People may not be paying attention to this due to the subject line.

Except for the initial bug report, none of the other stack traces have anything to do with ext4/jbd2.    And in the initial ext4 trace, we see the complaint that we're calling might_sleep() in ext4_mark_inode_dirty(), in a code path where we are manifestly not taking any spinlocks.    And in fact we don't see any spinlocks being taken at the point where the complaint is mode in ext4_mark_inode_dirty().   Yet preempt_count > 1.

It looks to me like some unrelated piece of code is bumping preempt_count, and not decrementing it.  Maybe in some code which is called from an interrupt handler, in some device driver?   That might explain why you're getting failures all over the kernel.

It may be worth closing this report, and opening several new ones, one for each failure, and make it clear this is not an ext4-related problem, since the subject line and component assigned for this bug is highly misleading.

Including your kernel config would also be useful when you do that.
Comment 15 Christian Casteyde 2010-10-04 18:48:29 UTC
I've explored some of the ideas you mentionned.

First, I disabled kmemcheck, and indeed I've got no warning at all. So this is really a kmemcheck + something else (buggy driver or perf counters as mentionned by Vegard in his previous mail).

I could check the reverse: kmemcheck but without perf counters, however I cannot unselect the perf option on my config (it is forced by something else I've not found). kmemcheck used to work (I don't know if perf counters were available in previous kernels however).

Second, I have most of my drivers built in, however I use sound cards as modules in order to control the load order.

I've booted after erasing the /lib/module driver, but with kmemcheck enabled, and it was far better. I got only one kmemecheck error:

ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
ERROR: kmemcheck: Fatal error

Pid: 1914, comm: gtk-update-icon Not tainted 2.6.36-rc6 #13 K8 Combo-Z/K8 Combo-Z
RIP: 0010:[<ffffffff81010b90>]  [<ffffffff81010b90>] x86_perf_event_update+0x0/0x80
RSP: 0018:ffff880002607d50  EFLAGS: 00010086
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000ffffffff
RDX: ffff88000260b5c0 RSI: 0000000000000400 RDI: ffff88003f808800
RBP: ffff880002607df8 R08: 0000000000000000 R09: 0000000000000004
R10: 0000000000000001 R11: 0000000000000038 R12: ffff88000260b7c8
R13: 0000000000000000 R14: ffff88000260b7c0 R15: ffff88003f808800
FS:  00007fdb70d5e700(0000) GS:ffff880002600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88003f836420 CR3: 000000003da52000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
 [<ffffffff8102bc34>] kmemcheck_error_save_bug+0xb4/0xe0
 [<ffffffff8102c658>] kmemcheck_access+0xe8/0x4a0
 [<ffffffff8102ca82>] kmemcheck_fault+0x72/0x80
 [<ffffffff81026f89>] do_page_fault+0x359/0x440
 [<ffffffff8159711f>] page_fault+0x1f/0x30
 [<ffffffff81011528>] perf_event_nmi_handler+0x78/0x140
 [<ffffffff810665d0>] notifier_call_chain+0x50/0xa0
 [<ffffffff81066b81>] __atomic_notifier_call_chain+0x71/0xd0
 [<ffffffff81066bf1>] atomic_notifier_call_chain+0x11/0x20
 [<ffffffff81066c2e>] notify_die+0x2e/0x30
 [<ffffffff81004343>] do_nmi+0x193/0x2b0
 [<ffffffff8159740a>] nmi+0x1a/0x2c
 [<ffffffff8102c5c7>] kmemcheck_access+0x57/0x4a0
 [<ffffffff8102ca82>] kmemcheck_fault+0x72/0x80
 [<ffffffff81026f89>] do_page_fault+0x359/0x440
 [<ffffffff8159711f>] page_fault+0x1f/0x30
 [<ffffffff810dab55>] do_lookup+0x55/0x170
 [<ffffffff810dcb10>] link_path_walk+0x130/0xbd0
 [<ffffffff810dd720>] path_walk+0x60/0xe0
 [<ffffffff810dd7f3>] do_path_lookup+0x53/0x60
 [<ffffffff810de433>] user_path_at+0x53/0xa0
 [<ffffffff810d4c17>] vfs_fstatat+0x37/0x70
 [<ffffffff810d4c86>] vfs_stat+0x16/0x20
 [<ffffffff810d4e5f>] sys_newstat+0x1f/0x50
 [<ffffffff810023ab>] system_call_fastpath+0x16/0x1b
 [<ffffffffffffffff>] 0xffffffffffffffff

The good news is all other errors are gone. Here is the lsmod output without kmemcheck, just to tell the modules that could cause more grief:
Module                  Size  Used by
snd_intel8x0           29098  1 
snd_cmipci             31770  4 
snd_opl3_lib            9332  1 snd_cmipci
snd_hwdep               6314  1 snd_opl3_lib
snd_mpu401_uart         6360  1 snd_cmipci
snd_rawmidi            20342  1 snd_mpu401_uart
snd_ac97_codec        117688  1 snd_intel8x0
ac97_bus                1290  1 snd_ac97_codec

The .config file is already attached to this bug report (please note I've also some other unused drivers configured, for hardware I do not use for now).
Comment 16 Christian Casteyde 2010-10-10 19:19:29 UTC
I also append .config file for my laptop, since it also has the problem as my main computer.
I've tried to git bisect the problem on this laptop:

0f477dd0851bdcee82923da66a7fc4a44cb1bc3d is the first bad commit

However, I don't know if this is the same bug as described here, as I said I cannot be sure.
Comment 17 Christian Casteyde 2010-10-10 19:20:45 UTC
Created attachment 33052 [details]
git bisect log result
Comment 18 Christian Casteyde 2010-10-10 19:22:28 UTC
Created attachment 33062 [details]
.config file for the laptop

Common hardware I'm aware is:
Athlon 64 in 64bits,
soundcard intel8x0
Comment 19 Rafael J. Wysocki 2010-10-10 19:38:36 UTC
0f477dd0851bdcee82923da66a7fc4a44cb1bc3d is a merge commit, so it surely didn't
introduce the problem by itself, although it's quite probable that one of the
commits in this merge did that.

Adding Peter to the CC list.
Comment 20 Christian Casteyde 2010-10-20 20:10:37 UTC
Hello
My bisection was obviously false, sorry.

It took me some time, but I've finally understood why my bisection failed, and what was happening.

The problem is that with kmemcheck, the kernel boots so slowly (except if there is a warning at boot in swapper, as it occured in several revision pre rc1), that the hard and soft lockup detection mechanism was triggering (often in udev or hal or ldconfig or other slow commands). Moreover, it triggered sometimes at a place the kernel cannot cope with it, and crashes.

Before bisecting, I took a fresh 2.6.35 .config file and reactivated the debugging options, but I fogot the lockup detector. So I couldn't see the errors anymore. Nevertheless, I checked other warnings that used to occur before -rc1, and that were fixed later. As I was looking for a range between 2.6.35 and rc1, I got false results on already fixed bugs.

Finally, I do not have any problem anymore with 2.6.36-rc8. Except that the lockup timeout is either too low for my computers, or crashes the kernel sometimes whent it triggers. But that's not critical.

I'm closing the bug.