Bug 9180 (kblockd)

Summary: Null pointer dereference in workqueue processing in kblockd process on switching I/O scheduler
Product: Process Management Reporter: Nick (gentuu)
Component: OtherAssignee: Jens Axboe (axboe)
Status: CLOSED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: akpm, oleg, randy.dunlap, wbrana
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: vanilla 2.6.23.x up to 2.6.24-rc1-git13 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: debug patch to detect corrupted work_structs
perhaps a fix for the first trace
(may be) fix IOPRIO_CLASS_IDLE accounting

Description Nick 2007-10-18 03:58:20 UTC
Most recent kernel where this bug did not occur: 2.6.22.x
Distribution: Gentoo
Hardware Environment: i386 and x86_64
Software Environment: GNU
Problem Description:

An oops occurs on switching I/O scheduler during active fork()ing
and disk usage.

Here is x86_64 ARCH dmesg part:

[  620.257633] ------------[ cut here ]------------
[  620.257641] kernel BUG at kernel/workqueue.c:258!
[  620.257643] invalid opcode: 0000 [1] SMP
[  620.257645] CPU 0
[  620.257647] Modules linked in: tcp_westwood ipt_REJECT xt_state iptable_filter ipt_REDIRECT ipt_owner xt_tcpudp xt_multiport iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack ip_tables x_tables sdhci mmc_core sr_mod cdrom
[  620.257662] Pid: 77, comm: kblockd/0 Not tainted 2.6.23 #19
[  620.257664] RIP: 0010:[<ffffffff8024a4f6>]  [<ffffffff8024a4f6>] run_workqueue+0x116/0x170
[  620.257672] RSP: 0018:ffff81003f4a3ea0  EFLAGS: 00010282
[  620.257673] RAX: ffff81002fe67880 RBX: 0000000000000000 RCX: ffff81002fe67880
[  620.257676] RDX: ffff81002fe67880 RSI: ffff81003f4a3ed0 RDI: ffff81002fe67878
[  620.257677] RBP: ffffffff8031bd10 R08: ffff81003f4a2000 R09: ffff81003e31cdc0
[  620.257679] R10: 0000000000000000 R11: 0000000000000000 R12: ffff81003f675b40
[  620.257681] R13: ffff81003f675b48 R14: 0000000000000000 R15: 0000000000000000
[  620.257684] FS:  0000000000000000(0000) GS:ffffffff806a4000(0000) knlGS:0000000000000000
[  620.257686] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[  620.257688] CR2: 00002b68aabb41de CR3: 000000002eb0a000 CR4: 00000000000006e0
[  620.257690] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  620.257692] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  620.257694] Process kblockd/0 (pid: 77, threadinfo ffff81003f4a2000, task ffff81003f6b8000)
[  620.257696] Stack:  ffffffff8024b040 ffff81003f675b58 ffff81003f675b40 ffffffff8024b040
[  620.257700]  ffff81003f675b48 ffffffff8024b0e3 0000000000000000 ffff81003f6b8000
[  620.257703]  ffffffff8024e900 ffff81003f4a3ee8 ffff81003f4a3ee8 00000000fffffffc
[  620.257706] Call Trace:
[  620.257709]  [<ffffffff8024b040>] worker_thread+0x0/0x110
[  620.257712]  [<ffffffff8024b040>] worker_thread+0x0/0x110
[  620.257714]  [<ffffffff8024b0e3>] worker_thread+0xa3/0x110
[  620.257718]  [<ffffffff8024e900>] autoremove_wake_function+0x0/0x30
[  620.257721]  [<ffffffff8024b040>] worker_thread+0x0/0x110
[  620.257723]  [<ffffffff8024b040>] worker_thread+0x0/0x110
[  620.257726]  [<ffffffff8024e53b>] kthread+0x4b/0x80
[  620.257729]  [<ffffffff8020cab8>] child_rip+0xa/0x12
[  620.257732]  [<ffffffff8024e4f0>] kthread+0x0/0x80
[  620.257734]  [<ffffffff8020caae>] child_rip+0x0/0x12
[  620.257736]
[  620.257737]
[  620.257737] Code: 0f 0b eb fe 66 0f 1f 44 00 00 65 48 8b 34 25 00 00 00 00 8b
[  620.257746] RIP  [<ffffffff8024a4f6>] run_workqueue+0x116/0x170
[  620.257749]  RSP <ffff81003f4a3ea0>

the i386 oops looks like this one (can provide later).

Please note the problem is always occurs in "kblockd/X" process (X = 0 ~~ SMP CPUs)


Steps to reproduce:
Need 2.6.23.1 with CFQ and deadline I/O schedulers loaded.

and these 3 simultaneous tasks trigger the problem:

1. something like this:
# while :; do
>        echo deadline > /sys/block/sda/queue/scheduler
>        echo cfq > /sys/block/sda/queue/scheduler
>done

2. a fork()/clone() torture (possible, you can apply any other way - should not matter).
I'm running a simple perl script:
------------CUT----------
#!/usr/bin/perl

$SIG{CHLD}=IGNORE;
while (1) {
        $r=fork;
        exit if (defined($r) && $r == 0);
}
------------/CUT----------
it's not a fork bomb. It's forking in 1 thread. Children are just exiting.

3. some high disk usage. Like this:
# while :; do tar xjf ~/linux-2.6.23.1.tar.bz2; rm -rf linux-2.6.23.1; done


It could take some time to occur, like 10-30 minutes.

Almost always system is fully hanging when this oops occurs.
Comment 1 Nick 2007-10-18 16:44:08 UTC
Looks like we have an additional failure place on this bug
at block/cfq-iosched.c:1147
It's on i386 arch.

Unfortunately, I can't recreate the problem at this place of code.
But pretty sure that it's the same problem.

Here is dmesg:

Oct 18 10:14:59 beta [107202.940998] ------------[ cut here ]------------
Oct 18 10:14:59 beta [107202.941193] kernel BUG at /usr/src/linux-2.6.23/block/cfq-iosched.c:1147!
Oct 18 10:14:59 beta [107202.941487] invalid opcode: 0000 [#1] SMP
Oct 18 10:14:59 beta [107202.941926] Modules linked in: ohci_hcd usbcore floppy sr_mod cdrom
Oct 18 10:14:59 beta [107202.942480] CPU:    0
Oct 18 10:14:59 beta [107202.942481] EIP:    0060:[<c02d7dfb>]    Not tainted VLI
Oct 18 10:14:59 beta [107202.942482] EFLAGS: 00010086   (2.6.23 #1)
Oct 18 10:14:59 beta [107202.943015] EIP is at cfq_put_queue+0x4b/0x80
Oct 18 10:14:59 beta [107202.943195] eax: eb10c210   ebx: eb10c210   ecx: eb10c1b8   edx: eb10c1b8
Oct 18 10:14:59 beta [107202.943406] esi: c85a5300   edi: c22b9000   ebp: eb057f18   esp: eb057ee0
Oct 18 10:14:59 beta [107202.943615] ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Oct 18 10:14:59 beta [107202.943813] Process bash (pid: 19625, ti=eb056000 task=d4c80bc0 task.ti=eb056000)
Oct 18 10:14:59 beta [107202.944030] Stack: 00000001 c85a5300 c02d7fb9 dcff3cc0 dcff3d0c ca939360 c02cbbf3 c22b9000
Oct 18 10:14:59 beta [107202.944850]        dcff3cc0 c02cbfda 00000044 00000001 00000009 ca939900 64616564 656e696c
Oct 18 10:14:59 beta [107202.945668]        00000000 00000000 c22b90d0 c22b93a8 c22b9000 c0871b3c c02cf64c eafa9000
Oct 18 10:14:59 beta [107202.946487] Call Trace:
Oct 18 10:14:59 beta [107202.946774]  [<c02d7fb9>] cfq_exit_queue+0x89/0xb0
Oct 18 10:14:59 beta [107202.947018]  [<c02cbbf3>] elevator_exit+0x23/0x50
Oct 18 10:14:59 beta [107202.947263]  [<c02cbfda>] elv_iosched_store+0x16a/0x1a0
Oct 18 10:14:59 beta [107202.947513]  [<c02cf64c>] queue_attr_store+0x4c/0x80
Oct 18 10:14:59 beta [107202.947759]  [<c01a8bda>] sysfs_write_file+0xaa/0x120
Oct 18 10:14:59 beta [107202.948008]  [<c0168586>] vfs_write+0xa6/0x160
Oct 18 10:14:59 beta [107202.948250]  [<c01a8b30>] sysfs_write_file+0x0/0x120
Oct 18 10:14:59 beta [107202.948496]  [<c0168cc1>] sys_write+0x41/0x70
Oct 18 10:14:59 beta [107202.948736]  [<c0102b1e>] sysenter_past_esp+0x5f/0x85
Oct 18 10:14:59 beta [107202.948985]  =======================
Oct 18 10:14:59 beta [107202.949155] Code: 01 00 85 c0 75 29 8b 43 2c 8b 53 28 01 d0 85 c0 75 21 f6 43 50 01 75 1f 3b 5e 44 74 1e 89 da a1 10 f2 cf c0 5b 5e e9 d5 c6 e8 ff f> 0b eb fe 0f 0b eb fe 0f 0b eb fe 0f 0b eb fe 31 c9 89 da 89
Oct 18 10:14:59 beta [107202.953398] EIP: [<c02d7dfb>] cfq_put_queue+0x4b/0x80 SS:ESP 0068:eb057ee0
Comment 2 Nick 2007-10-18 16:47:48 UTC
Changed severity to high as the second bug occured without any unusual actions. Just a running system.
Comment 3 Oleg Nesterov 2007-10-19 05:33:00 UTC
Created attachment 13212 [details]
debug patch to detect corrupted work_structs

> [  620.257641] kernel BUG at kernel/workqueue.c:258!
> ...
> Steps to reproduce:

Could you try to reproduce with this patch? Hopefully it can detect
the bad work. Not sure this is really useful, but still.
Comment 4 Nick 2007-10-19 05:56:01 UTC
> Could you try to reproduce with this patch?
Unfortunately, no.
The second oops occured on a system without any abnormal activities.

It was running fine almost a week before the crash and is running fine now.
several other systems on the same vmlinuz binary are running ok.
So, I'm afraid it's just a _very_ unlikely failure to be reproduced.
Comment 5 Nick 2007-10-19 06:51:38 UTC
> Could you try to reproduce with this patch?

Sorry, just misread your post...

The patch is for the first recreatable problem - so, will try it and post here
my results.
Comment 6 Oleg Nesterov 2007-10-20 05:13:26 UTC
After a brief look, I suspect we have 2 bugs in block/ related to workqueus.

1. cfq_exit_queue() forgets to cancel cfq_data->unplug_work, I think this
  _might_ explain the first trace.

2. blk_sync_queue() doesn't cancel request_queue->unplug_work, this is not
   related.
Comment 7 Oleg Nesterov 2007-10-20 05:20:06 UTC
Created attachment 13215 [details]
perhaps a fix for the first trace

Nick, any chance you can check if this patch makes any difference?
Would be nice to know Jens's opinion, of course :)
Comment 8 Nick 2007-10-21 09:24:48 UTC
Good news!

Thanks to Oleg for the patch (id=13215). It really does fix the problem :)

My test system was running 2 hours with the torture tasks without any problems.


And I have 2 additional notices about the initial problem:

- It's SMP-only problem. I wasn't able to reproduce it on an UP i386 machine.

- The problem doesn't appear (or possible appears much less likely) once RAM is
big enough to store in cache whole FS activity without forced cache submitting.
I wasn't able to reproduce the problem on a 4Gb RAM system unpacking/removing the 2.6.23 package (is it too small for now?? ;)))
2Gb RAM system effectively shows the problem though.


So, I'd like to see the patch submitted in .23 branch :)
Comment 9 Nick 2007-11-04 01:00:05 UTC
Just found now a way to recreate the second dump problem.
Pretty easy.

There are several notices on this:
- First notice: the problem appears only if a disk-active task is running in IDLE I/O class.
- Second notice: it is _LIKELY_ the problem came with this patch:
"cfq: async queue allocation per priority"
its URL:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=c2dea2d1fdbce86942dba0a968c523d8b7858bb5
- Third notice: the problem is gone (just can't recreate it in the described way) once the patch above ^^^ is rolled back


So, to recreate you are running 2 tasks:
1. something like this (the sleep()s _ARE_ important!):
  # while :; do
             echo cfq > /sys/block/sda/queue/scheduler;
             sleep 1;
             echo deadline > /sys/block/sda/queue/scheduler;
             sleep 1;
     done


2. something IDLE-I/O-niced (!!important) disk-active
(the system cache should be small enough to force disk flushing)
   # ionice -c3 -p $$
   # while :; do cp -a linux-2.6.23 linux-2.6.23_1; rm -rf linux-2.6.23_1; done

and here, we need some magick ;)
Just start to interrupt the task (with ctrl+c) and run it again in 2-3 seconds.
The second posted above BUG appears after just several runnings/stops....



Trying to recreate the problem on the latest 2.6.24-rc1-git13. Will update later.
Comment 10 Nick 2007-11-04 01:26:18 UTC
Just easily recreated the problem in the latest 2.6.24-rc1-git13

[  364.789660] ------------[ cut here ]------------
[  364.789677] kernel BUG at block/cfq-iosched.c:1147!
[  364.789683] invalid opcode: 0000 [#1] PREEMPT
[  364.789693] Modules linked in: nfs iptable_raw ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack iptable_mangle xt_TCPMSS xt_tcpudp iptable_filter ip_tables x_tables
[  364.789756]
[  364.789762] Pid: 5900, comm: bash Not tainted (2.6.24-rc1-git13 #1)
[  364.789768] EIP: 0060:[<c021fd4f>] EFLAGS: 00010046 CPU: 0
[  364.789779] EIP is at cfq_put_queue+0x48/0x74
[  364.789784] EAX: 00000000 EBX: c5d73128 ECX: c18d93c0 EDX: c10bae60
[  364.789790] ESI: c2ceb380 EDI: c2706200 EBP: c5d21f1c ESP: c5d21ee8
[  364.789796]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
[  364.789802] Process bash (pid: 5900, ti=c5d20000 task=c2c9f350 task.ti=c5d20000)
[  364.789810] Stack: c2ceb380 00000001 c02208b1 c28bd600 c28bd62c c0216dbf c28bd600 c2766ad0
[  364.789854]        c021714c 00000044 b7f1c000 00000009 c28bd9c0 64616564 656e696c 00000000
[  364.789893]        00000000 c2766bb0 c2766e50 c2766ad0 c042bcf0 c0219eb3 c5d31000 c042bc68
[  364.789926] Call Trace:
[  364.789937]  [<c02208b1>] cfq_exit_queue+0x79/0xab
[  364.789947]  [<c0216dbf>] elevator_exit+0x1b/0x32
[  364.789957]  [<c021714c>] elv_iosched_store+0x177/0x1c6
[  364.789968]  [<c0219eb3>] queue_attr_store+0x40/0x61
[  364.789978]  [<c019645d>] sysfs_write_file+0xa8/0xe9
[  364.789990]  [<c01963b5>] sysfs_write_file+0x0/0xe9
[  364.789999]  [<c015adc2>] vfs_write+0x87/0x140
[  364.790011]  [<c015b33d>] sys_write+0x41/0x6a
[  364.790022]  [<c01027ee>] sysenter_past_esp+0x5f/0x85
[  364.790033]  =======================
[  364.790037] Code: 06 5e 00 00 85 c0 75 27 8b 43 28 03 43 2c 85 c0 75 21 f6 43 50 01 75 1f 39 5e 44 74 1e 89 da a1 a8 91 48 c0 5b 5e e9 73 75 f3 ff <0f> 0b eb fe 0f 0b eb fe 0f 0b eb fe 0f 0b eb fe 31 c9 89 da 89
[  364.790357] EIP: [<c021fd4f>] cfq_put_queue+0x48/0x74 SS:ESP 0068:c5d21ee8
[  364.790377] note: bash[5900] exited with preempt_count 1
Comment 11 Oleg Nesterov 2007-11-04 10:34:02 UTC
Created attachment 13393 [details]
(may be) fix IOPRIO_CLASS_IDLE accounting

On 11/04, bugme-daemon@bugzilla.kernel.org wrote:
>
> ------- Comment #9 from gentuu@gmail.com  2007-11-04 01:00 -------
>
> - First notice: the problem appears only if a disk-active task is running in
> IDLE I/O class.

Great work, Nickolay.

Looks like cfq_put_async_queues() is completely wrong wrt IOPRIO_CLASS_IDLE. Could you try the patch?
Comment 12 Nick 2007-11-05 04:43:39 UTC
Yes, the patch does fix the bug!
Thank you Oleg.

I saw that Jens already accepted the patch.
So, closing this bug now :)

Congrats!
Comment 13 Adrian Bunk 2007-12-27 13:30:13 UTC
*** Bug 9633 has been marked as a duplicate of this bug. ***