Bug 33932 - Oops in cfq_unlink_blkio_group on cgroup rmdir
Summary: Oops in cfq_unlink_blkio_group on cgroup rmdir
Status: RESOLVED OBSOLETE
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Block Layer (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Jens Axboe
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-04-25 10:58 UTC by Cyril B.
Modified: 2012-08-20 15:55 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.38.2
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Kernel log of the crash (9.21 KB, text/plain)
2011-07-25 14:35 UTC, Israel G. Lugo
Details
/proc/lockdep (97.26 KB, text/plain)
2011-07-25 14:45 UTC, Israel G. Lugo
Details
GDB session inside the bug (27.00 KB, text/plain)
2011-07-27 13:04 UTC, Israel G. Lugo
Details
Contents of *cfqd and *(cfqd->queue) (18.23 KB, text/plain)
2011-08-03 00:58 UTC, Israel G. Lugo
Details
debug cfq patch v1 (2.37 KB, patch)
2011-08-03 20:15 UTC, Vivek Goyal
Details | Diff
Kernel messages, from boot to the crash (9.11 KB, text/plain)
2011-08-06 00:43 UTC, Israel G. Lugo
Details
GDB inside the BUG (21.93 KB, text/plain)
2011-08-06 00:45 UTC, Israel G. Lugo
Details
kernel log of the quick crash (4.35 KB, text/plain)
2011-08-06 01:13 UTC, Israel G. Lugo
Details
kmemleak after the quick crash (4.58 KB, text/plain)
2011-08-06 01:13 UTC, Israel G. Lugo
Details
unlink crash debug patch v2 (3.32 KB, patch)
2011-08-08 20:46 UTC, Vivek Goyal
Details | Diff

Description Cyril B. 2011-04-25 10:58:40 UTC
I have a Python script that removes all empty cgroups every day, one at a time. It's run fine for 3 weeks, but yesterday I got this oops:

[1921483.650066] BUG: unable to handle kernel NULL pointer dereference at           (null)
[1921483.650215] IP: [<ffffffff8149a5a6>] _raw_spin_lock_irqsave+0x16/0x30
[1921483.650270] PGD 1c8b7067 PUD 659a9067 PMD 0 
[1921483.650321] Oops: 0002 [#1] SMP 
[1921483.650367] last sysfs file: /sys/module/drbd/parameters/cn_idx
[1921483.650417] CPU 5 
[1921483.650422] Modules linked in: nf_conntrack_ftp xt_owner xt_NFLOG nfnetlink_log nfnetlink nf_conntrack_ipv6 nf_defrag_ipv6 xt_recent nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_multiport drbd lru_cache ehci_hcd usbcore netconsole e1000e [last unloaded: nf_conntrack_ftp]
[1921483.650693] 
[1921483.650728] Pid: 1945882, comm: python Not tainted 2.6.38.2 #1 Supermicro H8SCM/H8SCM
[1921483.650827] RIP: 0010:[<ffffffff8149a5a6>]  [<ffffffff8149a5a6>] _raw_spin_lock_irqsave+0x16/0x30
[1921483.650915] RSP: 0018:ffff8800a38bde10  EFLAGS: 00010046
[1921483.650962] RAX: 0000000000000246 RBX: ffff88010b420030 RCX: 0000000000000246
[1921483.651041] RDX: 0000000000000100 RSI: ffff88010b420030 RDI: 0000000000000000
[1921483.651120] RBP: ffff88010b420030 R08: 0000000000000009 R09: ffff880020ca4558
[1921483.651199] R10: ffff880020ca4438 R11: ffff880020ca44f8 R12: ffff88040de34c00
[1921483.651278] R13: ffff88040de34c00 R14: ffff8803beec0d80 R15: 0000000000d18530
[1921483.651357] FS:  00007f7fda5ff6e0(0000) GS:ffff8800dfd40000(0000) knlGS:00000000f731f6d0
[1921483.651439] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[1921483.651489] CR2: 0000000000000000 CR3: 000000007e32e000 CR4: 00000000000006e0
[1921483.651568] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1921483.651647] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[1921483.651726] Process python (pid: 1945882, threadinfo ffff8800a38bc000, task ffff88001abe3340)
[1921483.651809] Stack:
[1921483.651845]  ffffffff81249a69 ffffffff81864ba0 ffff88010b420030 ffff8803beec0da4
[1921483.651934]  ffffffff81246aae ffffffff81863a00 ffff88022637e400 ffff8800020b6d00
[1921483.652022]  ffff88003b1c6b9c ffff88040fb31d1c ffffffff81088679 ffff88003b1c6b40
[1921483.652111] Call Trace:
[1921483.652152]  [<ffffffff81249a69>] ? cfq_unlink_blkio_group+0x19/0x90
[1921483.652205]  [<ffffffff81246aae>] ? blkiocg_destroy+0x9e/0x140
[1921483.652256]  [<ffffffff81088679>] ? cgroup_diput+0x69/0x100
[1921483.652306]  [<ffffffff810fb9c9>] ? dput+0x1d9/0x320
[1921483.652354]  [<ffffffff810f3f90>] ? do_rmdir+0x100/0x130
[1921483.652406]  [<ffffffff810c7d2a>] ? sys_munmap+0x5a/0x80
[1921483.652455]  [<ffffffff8100ae92>] ? system_call_fastpath+0x16/0x1b
[1921483.652505] Code: e0 74 06 f3 90 8a 07 eb f6 c3 66 66 2e 0f 1f 84 00 00 00 00 00 9c 58 66 66 90 66 90 48 89 c1 fa 66 66 90 66 66 90 ba 00 01 00 00 <f0> 66 0f c1 17 38 f2 74 06 f3 90 8a 17 eb f6 48 89 c8 c3 0f 1f 
[1921483.652813] RIP  [<ffffffff8149a5a6>] _raw_spin_lock_irqsave+0x16/0x30
[1921483.652866]  RSP <ffff8800a38bde10>
[1921483.652907] CR2: 0000000000000000
[1921483.653321] ---[ end trace d9216500967642d6 ]---
Comment 1 Israel G. Lugo 2011-07-11 14:19:48 UTC
Same here. I have a PAM script that creates a cgroup for each UID upon login, and then I set up a simple shell script as a release agent to do the rmdir when each cgroup becomes empty.

This was working fine for several months with 2.6.36, but is breaking every 3 or 4 days since I upgraded to 2.6.38. It seems to be some kind of regression.

Jul  9 18:17:53 hostname kernel: BUG: unable to handle kernel NULL pointer dereference at            (nil)
Jul  9 18:17:53 hostname kernel: IP: [<ffffffff813ca8a4>] _raw_spin_lock_irqsave+0xc/0x1d
Jul  9 18:17:53 hostname kernel: PGD f7107066 
Jul  9 18:17:53 hostname kernel: Oops: 0002 [#1] SMP 
Jul  9 18:17:53 hostname kernel: last sysfs file: /sys/devices/virtual/vc/vcsa12/uevent
Jul  9 18:17:53 hostname kernel: CPU 1 
Jul  9 18:17:53 hostname kernel: Pid: 32005, comm: rmdir Not tainted 2.6.38-hardened-r6 #2 HP ProLiant DL145 G1/PCBA Serenade-3 V0100219-300
Jul  9 18:17:53 hostname kernel: RIP: 0010:[<ffffffff813ca8a4>]  [<ffffffff813ca8a4>] _raw_spin_lock_irqsave+0xc/0x1d
Jul  9 18:17:53 hostname kernel: RSP: 0018:ffff8800c1e8dd78  EFLAGS: 00010096
Jul  9 18:17:53 hostname kernel: RAX: 0000000000000296 RBX: ffff88007ca0ec00 RCX: 0000000000000002
Jul  9 18:17:53 hostname kernel: RDX: 0000000000000100 RSI: ffff88007c477030 RDI: 0000000000000000
Jul  9 18:17:53 hostname kernel: RBP: ffff8800c1e8dd78 R08: 0000000000000006 R09: 0000000000000000
Jul  9 18:17:53 hostname kernel: R10: 0000000000000000 R11: fffffffffffffffe R12: ffff88007c477030
Jul  9 18:17:53 hostname kernel: R13: ffffffff8162f350 R14: ffff88007c477030 R15: ffff88007ca0ec00
Jul  9 18:17:53 hostname kernel: FS:  0000032b0f344700(0000) GS:ffff8800fbc00000(0000) knlGS:0000000000000000
Jul  9 18:17:53 hostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jul  9 18:17:53 hostname kernel: CR2: 0000000000000000 CR3: 00000000013dc000 CR4: 00000000000006f0
Jul  9 18:17:53 hostname kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul  9 18:17:53 hostname kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jul  9 18:17:53 hostname kernel: Process rmdir (pid: 32005, threadinfo ffff8800c1cbbc80, task ffff8800c1cbb8d0)
Jul  9 18:17:53 hostname kernel: Stack:
Jul  9 18:17:53 hostname kernel: ffff8800c1e8dda8 ffffffff811e8481 ffff8800c1e8dd98 ffff88007b206540
Jul  9 18:17:53 hostname kernel: ffff88007b206564 ffffffff8162f350 ffff8800c1e8dde8 ffffffff811e638e
Jul  9 18:17:53 hostname kernel: ffff8800c1e8dde8 ffff88007b0f4800 ffffffff8162e3a0 ffff880066c284c0
Jul  9 18:17:53 hostname kernel: Call Trace:
Jul  9 18:17:53 hostname kernel: [<ffffffff811e8481>] cfq_unlink_blkio_group+0x22/0x8f
Jul  9 18:17:53 hostname kernel: [<ffffffff811e638e>] blkiocg_destroy+0xa9/0x127
Jul  9 18:17:53 hostname kernel: [<ffffffff81068412>] cgroup_diput+0x5b/0xe5
Jul  9 18:17:53 hostname kernel: [<ffffffff810d23ec>] d_kill+0xdb/0xfd
Jul  9 18:17:53 hostname kernel: [<ffffffff810d3d49>] dput+0x13d/0x14d
Jul  9 18:17:53 hostname kernel: [<ffffffff810cc0ab>] do_rmdir+0x141/0x17d
Jul  9 18:17:53 hostname kernel: [<ffffffff810cc11e>] sys_rmdir+0x11/0x13
Jul  9 18:17:53 hostname kernel: [<ffffffff81002619>] system_call_fastpath+0x16/0x1b
Jul  9 18:17:53 hostname kernel: [<ffffffff810025b7>] ? system_call_after_swapgs+0x17/0x63
Jul  9 18:17:53 hostname kernel: Code: c0 0f b6 c0 c3 55 b8 00 01 00 00 48 89 e5 f0 66 0f c1 07 38 e0 74 06 f3 90 8a 07 eb f6 c9 c3 55 48 89 e5 9c 58 fa ba 00 01 00 00 <f0> 66 0f c1 17 38 f2 74 06 f3 90 8a 17 eb f6 c9 c3 55 48 89 e5 
Jul  9 18:17:53 hostname kernel: RIP  [<ffffffff813ca8a4>] _raw_spin_lock_irqsave+0xc/0x1d
Jul  9 18:17:53 hostname kernel: RSP <ffff8800c1e8dd78>
Jul  9 18:17:53 hostname kernel: CR2: 0000000000000000


# uname -a
Linux nexus 2.6.38-hardened-r6 #2 SMP Fri Jun 24 18:00:21 WEST 2011 x86_64 AMD Opteron(tm) Processor 248 AuthenticAMD GNU/Linux

This is a Gentoo system, running a kernel with Grsec and PAX patches.
Comment 2 Cyril B. 2011-07-11 14:25:10 UTC
Israel: the following patch may resolve it:

http://groups.google.com/group/linux.kernel/tree/browse_frm/thread/ea60bf48c75a3423/321a7a800cd15a3f?rnum=1&_done=%2Fgroup%2Flinux.kernel%2Fbrowse_frm%2Fthread%2Fea60bf48c75a3423%2Fcf65ecc931364ebf%3Fpli%3D1%26#doc_f45f2ef773229dce

I haven't had the change to test it myself. I think it's included in Linux 3.0.
Comment 3 Israel G. Lugo 2011-07-12 11:14:59 UTC
Hi Cyril, thank you for the link and the quick response.

I tried to follow a few of the mentioned patches and am wondering if perhaps you meant to link the patch right after the one you did? http://groups.google.com/group/linux.kernel/tree/browse_frm/thread/ea60bf48c75a3423/321a7a800cd15a3f?rnum=1&_done=%2Fgroup%2Flinux.kernel%2Fbrowse_frm%2Fthread%2Fea60bf48c75a3423%2Fcf65ecc931364ebf%3Fpli%3D1%26#doc_cf65ecc931364ebf (message #9 in the thread, "cfq-iosched: Fix a possible race with cfq cgroup removal code" by Vivek Goyal)

From what I understood, patch #8 apparently makes structural changes to the way the cgroup stats are accessed, making it lockless, using an rcu. Patch #9 seems to address a race condition in the removal of a cgroup; if I understood correctly, a race where the cgroup is in the process of being freed while other drivers try to access it (because it's still in the cgroup list).

I am not entirely sure whether to apply patch #8, #9 or both. I'm also not sure whether they can/should be applied by themselves, or with the other changes as well (I did not follow the entire changelist). Being as though this is a production server and downtime is frowned upon, I would like to try and understand things a bit better first.

It would be very nice if a kernel dev could shed some light on the issue?
Comment 4 Israel G. Lugo 2011-07-12 11:19:26 UTC
Additionally, I would suggest tagging this bug as a regression. I did not encounter this problem with 2.6.36 for about 6 months of heavy use (automatically creating/removing cgroups per UID with each login/logout, many hundreds of times per day), but have been encountering it every 2 or 3 days after the upgrade to 2.6.38.
Comment 5 Cyril B. 2011-07-12 11:23:56 UTC
Indeed, my link was wrong, the patch I was actually referencing is #9. I haven't bothered to try to apply them on a 2.6.39, I'll wait for 3.0 and hope it's fixed.
Comment 6 Vivek Goyal 2011-07-12 13:21:24 UTC
Can you guys please apply patch mentioned in above links and see if it fixes the issue. 

Following is upstream commit.

commit 56edf7d75db5b14d628b46623c414ffbeed68d7f
Author: Vivek Goyal <vgoyal@redhat.com>
Date:   Thu May 19 15:38:22 2011 -0400

    cfq-iosched: Fix a possible race with cfq cgroup removal code


This takes care of a possible race with group removal while request queue is exiting. So for you to hit this race, there must have been some kind of device removal, or request queue destruction activity must be going on.

If this patch does not fix the issue, we shall have to dive deeper.
Comment 7 Israel G. Lugo 2011-07-12 15:00:08 UTC
(In reply to comment #6)
> Can you guys please apply patch mentioned in above links and see if it fixes
> the issue. 

Thank you for your quick reply, Vivek. I will apply the patch tomorrow and leave it running to see if the problem goes away.

> 
> This takes care of a possible race with group removal while request queue is
> exiting. So for you to hit this race, there must have been some kind of
> device
> removal, or request queue destruction activity must be going on.
> 

Interesting. In my case at least, there had been no physical changes made to the devices. In particular, no disks have been hotplugged or hotremoved, and no filesystems were being unmounted at the time of failure. The blade was just sitting in its cabinet doing its job as a general multi-user shell server.

I expect that there should only be normal I/O at the time, e.g. cron and syslog, regular users logging in and out to do development work and some (quite frequent) Nagios probes executing over SSH (causing further logins and logouts).

The cgroup rmdir is executed by a script which is registered as a release_agent. Could this trigger the race somehow? Could something be in the process of cleaning up the request queue because the cgroup just became empty, and at the same time my script gets called to do the rmdir?
Comment 8 Israel G. Lugo 2011-07-12 15:08:29 UTC
(In reply to comment #7)
> 
> I expect that there should only be normal I/O at the time, e.g. cron and
> syslog, regular users logging in and out to do development work and some
> (quite
> frequent) Nagios probes executing over SSH (causing further logins and
> logouts).
> 

The logouts are relevant because that's basically when my script gets called to remove the cgroup. I've got a PAM hook creating /dev/cgroup/$UID when someone logs in and placing the session leader in that cgroup; then when the cgroup becomes empty (i.e. the user logs out) my script gets called to rmdir the respective cgroup.
Comment 9 Vivek Goyal 2011-07-12 15:14:53 UTC
Automatic cgroup removal with the help of a script should be fine. This also triggers removal of blkio_group objects which are hanging on request queue. So one needs to cleanup blkio_group object when either request queue (device) is going away or when cgroup is going away.

If device is not going away then request queue and associated queue lock should be there and we should not be facing null pointer dereference. There was only one condition of race which above patch should fix.

So if above patch does not fix it, then we are probably facing some other issue which will need more investigation.So do give above patch a try.
Comment 10 Israel G. Lugo 2011-07-13 12:50:30 UTC
I will apply the patch today and let it run to see if things break again.

I will also try to reproduce this problem on a test virtual machine, using a similar kernel config and automated SSH logins, so that we can debug more easily if necessary.

The box currently experiencing the problem is a production server, and when it breaks no one can login via SSH (since any process attempting to access the cgroup filesystem will hang in an uninterruptible wait). I can't even reboot it without physically doing a SysRq+E on the console (sigterm all processes). Naturally, I would like to minimize the amount of breakage.

I will send you my kernel config off-list, for the eventuality that I'm unable to reproduce the problem in a VM, that you might suggest any useful debug options for me to activate, or other steps I can take to gather information. If I have a debug kernel already prepared, I can boot it up next time the machine breaks (assuming the patch doesn't fix things).
Comment 11 Israel G. Lugo 2011-07-15 14:08:39 UTC
I managed to reproduce the problem on the test VM. Unfortunately work has kept me from being able to try the cfq-iosched patch to see if it solves the problem, but at least now that I can reproduce the problem on a non-production box I will have more freedom to troubleshoot this.

I will be away during the weekend but will report on the patch next week. Thank you for your help.
Comment 12 Israel G. Lugo 2011-07-25 14:35:41 UTC
Created attachment 66542 [details]
Kernel log of the crash

I apologize for the delay in my response.

The cfq-iosched patch did not solve the problem; it took about 10 hours to trigger (longer than without the patch), but it still happened.

I tried to reproduce the problem on a simpler VM with a simpler test case, but haven't been able to do so. For now, I am using the real-world PAM scripts and a couple of automated SSH loops.

I am attaching the output of the kernel log containing the BUG and other warnings. This kernel had been patched with the cfq-iosched patch and was built with CGROUP_DEBUG=y, DEBUG_BLK_CGROUP=y, DEBUG_INFO=y, LOCKDEP=y, PROVE_LOCKING=y, and most other lock-related debugging options activated as well.

I will try further to reproduce the problem in simpler, more controlled conditions. Any debugging suggestions would be greatly appreciated.
Comment 13 Israel G. Lugo 2011-07-25 14:45:27 UTC
Created attachment 66552 [details]
/proc/lockdep

Attaching the contents of /proc/lockdep, in case it helps.

In the previously attached kern.log, note that the BUG manifests itself in a different way with the debug options enabled: the error message shown is "BUG: sleeping function called from invalid context at kernel/rwsem.c:21"

I'm seeing an "invalid opcode" exception, perhaps the result of following an invalid reference to a lock that had already been freed somewhere?
Comment 14 Israel G. Lugo 2011-07-27 13:04:25 UTC
Created attachment 66872 [details]
GDB session inside the bug

I've caught the bug with gdb. Am in the process of analyzing it now, but a little feedback from someone more familiarized with the CFQ implementation would help.

Highlights from the GDB session so far:
* cfq_unlink_blkio_group() is calling spin_lock_irqsave with a NULL pointer (cfqd->queue->queue_lock == NULL)

* cfqd->queue itself seems to be garbage, since e.g. cfqd->queue->make_request_fn in fact has SPINLOCK_MAGIC (0xdead4ead). For reference:
(gdb) print *(raw_spinlock_t *)&((struct cfq_data *)key)->queue->make_request_fn
$84 = {
  raw_lock = {
    slock = 60652
  }, 
  magic = 3735899821,   /* SPINLOCK_MAGIC */
  owner_cpu = 4294967295, 
  owner = 0xffffffffffffffff, 
  dep_map = {
    key = 0xffffffff8214b4b1, 
    class_cache = {0x0, 0x0}, 
    name = 0xffffffff815114b5 "&(&blkg->stats_lock)->rlock"
  }
}

* cfqd itself seems to be garbage as well, since e.g. cfq->active_queue again has SPINLOCK_MAGIC:

(gdb) print *(raw_spinlock_t *)&((struct cfq_data *)key)->active_queue
$87 = {
  raw_lock = {
    slock = 0
  }, 
  magic = 3735899821, 
  owner_cpu = 4294967295, 
  owner = 0xffffffffffffffff, 
  dep_map = {
    key = 0xffffffff818e51cc, 
    class_cache = {0x0, 0x0}, 
    name = 0xffffffff814f6b3d "&(&lock->wait_lock)->rlock"
  }
}

Either that or the stack frame has been corrupted by lower frames, but at least the observed value of cfqd->queue->queue_lock == 0 is consistent.

This suggests a race on "key", perhaps an access not protected by RCU lock somewhere?
Comment 15 Vivek Goyal 2011-08-01 18:31:13 UTC
(In reply to comment #14)
> Created an attachment (id=66872) [details]
> GDB session inside the bug
> 
> I've caught the bug with gdb. Am in the process of analyzing it now, but a
> little feedback from someone more familiarized with the CFQ implementation
> would help.
> 
> Highlights from the GDB session so far:
> * cfq_unlink_blkio_group() is calling spin_lock_irqsave with a NULL pointer
> (cfqd->queue->queue_lock == NULL)
> 
> * cfqd->queue itself seems to be garbage, since e.g.
> cfqd->queue->make_request_fn in fact has SPINLOCK_MAGIC (0xdead4ead). For
> reference:
> (gdb) print *(raw_spinlock_t *)&((struct cfq_data
> *)key)->queue->make_request_fn

I don't understand this. Why are you printing the contents of make_request_function? I thought you wanted the contents of cfqd->queue.
so you should do something like "print *(cfqd->queue)"

> $84 = {
>   raw_lock = {
>     slock = 60652
>   }, 
>   magic = 3735899821,   /* SPINLOCK_MAGIC */
>   owner_cpu = 4294967295, 
>   owner = 0xffffffffffffffff, 
>   dep_map = {
>     key = 0xffffffff8214b4b1, 
>     class_cache = {0x0, 0x0}, 
>     name = 0xffffffff815114b5 "&(&blkg->stats_lock)->rlock"
>   }
> }
> 
> * cfqd itself seems to be garbage as well, since e.g. cfq->active_queue again
> has SPINLOCK_MAGIC:

cfqd->active_queue is a pointer to the cfq_queue which could possibly be active. Why are you typecasting active_queue to spinlock and printing it?

> 
> (gdb) print *(raw_spinlock_t *)&((struct cfq_data *)key)->active_queue
> $87 = {
>   raw_lock = {
>     slock = 0
>   }, 
>   magic = 3735899821, 
>   owner_cpu = 4294967295, 
>   owner = 0xffffffffffffffff, 
>   dep_map = {
>     key = 0xffffffff818e51cc, 
>     class_cache = {0x0, 0x0}, 
>     name = 0xffffffff814f6b3d "&(&lock->wait_lock)->rlock"
>   }
> }
> 
> Either that or the stack frame has been corrupted by lower frames, but at
> least
> the observed value of cfqd->queue->queue_lock == 0 is consistent.
> 
> This suggests a race on "key", perhaps an access not protected by RCU lock
> somewhere?

It definitely sounds like a race. Looking at the code I have found one but. But that bug should not impact this issue. That bug will just enforce additional rcu states when it shoudn't have and I am surprised nobody has complained about slow boot time yet.

- So you are also mounting blkio controller somewhere using your pam script?
- What's the kernel version you are using. Can you use the latest upstream kernel. there were few rcu fixes that went into 3.0
- What's the device driver in your virtual machine. Looks like it could be a driver issue also which cleaned up its device and lock before calling blk_cleanup_queue().
Comment 16 Israel G. Lugo 2011-08-01 22:37:42 UTC
(In reply to comment #15)
> 
> I don't understand this. Why are you printing the contents of
> make_request_function? I thought you wanted the contents of cfqd->queue.
> so you should do something like "print *(cfqd->queue)"

If you look at the gdb output which I've attached, it should be clearer. I am printing the contents of make_request_fn as an example, to show that it points to something which is very likely *not* a function. That is, cfqd itself seems to be corrupt. A function should not be starting with 0xdead4ead -- that doesn't even make sense, in x86 that would be a fisubr instruction (floating-point subtract) with a really dumb address.


> 
> cfqd->active_queue is a pointer to the cfq_queue which could possibly be
> active. Why are you typecasting active_queue to spinlock and printing it?
> 

To show that it too has SPINLOCK_MAGIC. I should have been clearer before, allow me to explain.

In the gdb session file attached above, I did a print of cfqd, and noticed what seemed like a lot of junk values. For example, cfqd->active_queue == 0xdead4ead00000000. This obviously seems like a very strange address. It also happens to be the value of SPINLOCK_MAGIC, defined in include/linux/spinlock_types.h to be inside raw_spinlock_t when CONFIG_DEBUG_SPINLOCK is enabled (which it is in my case).

Just like make_request_fn above, active_queue seems to be the start of a spinlock (it doesn't point to a spinlock, it *is* a spinlock). The value of cfq_unlink_blkio_group's "key" seems to be wrong, since the contents of cfqd don't look anything like a valid struct cfq_data.

I typecasted &cfqd->active_queue to a pointer-to-spinlock to take a look at the actual spinlock which seems to be there. Looking at its .magic and .name members, it does seem to be an actual spinlock (SPINLOCK_MAGIC == 0xdead4ead == 3735899821 and name == "&(&lock->wait_lock)->rlock").
) 


> 
> - So you are also mounting blkio controller somewhere using your pam script?

I am mounting the blkio controller only once at boot, using the following inside an init script:

mount -t cgroup -o cpu,blkio cgroup /dev/cgroup/cpuio

I never unmount it or remount.


> - What's the kernel version you are using. Can you use the latest upstream
> kernel. there were few rcu fixes that went into 3.0

I am using a 2.6.38 kernel, which my other work servers use. I haven't been able to reproduce the problem reliably at home using a simplified VM: that's the only reason why I haven't tried a more recent kernel yet. Things have been somewhat hectic at work but I will try again as soon as I have the time and let you know.


> - What's the device driver in your virtual machine. Looks like it could be a
> driver issue also which cleaned up its device and lock before calling
> blk_cleanup_queue().

This test virtual machine is using the SATA AHCI driver. However, the problem which I originally reported was happening on a physical server, using the AMD_IDE driver (legacy ATA support). So at least those two drivers cause the issue.

Again, I will try to reproduce this in a simpler manner. Right now I am using a test VM at work with a bunch of userland software, and I have to do like 3,000 automated SSHs to trigger the bug through the PAM script. At home I haven't been able to reproduce this yet; it might be because my home PC is too fast so the VM's CPU never reach 100% and the race is harder to trigger. I'll try running the VM on a slower host.

The virtual machine on which I ran the GDB session above is still frozen inside the GDB session (inside the bug). I can print more information or set breakpoints etc, on request.
Comment 17 Israel G. Lugo 2011-08-01 22:48:51 UTC
Just to give some info on the PAM script that creates the cgroup on login. What it does is parse some configuration files, log some stuff to syslog, then the equivalent of:

mkdir /dev/cgroup/cpuio/$UID
echo $users_cpu_share > /dev/cgroup/cpuio/$UID/cpu.shares
echo $pid_of_session_leader > /dev/cgroup/cpuio/$UID/tasks

The automated cleanup (release agent) again parses a configuration file, logs some stuff to syslog, then does an rmdir of the cgroup.

There are no mounts or umounts on any of the scripts.
Comment 18 Vivek Goyal 2011-08-02 14:22:31 UTC
(In reply to comment #16)
> The virtual machine on which I ran the GDB session above is still frozen
> inside
> the GDB session (inside the bug). I can print more information or set
> breakpoints etc, on request.


Can you please print the values of *cfqd and *(cfqd->queue). You seem to be suggesting that cfqd itself has been freed and we are accessing a freed object. Looking at the code, it beats me how that can happen.

Do you have some kind of slab debug enabled?
Comment 19 Vivek Goyal 2011-08-02 17:05:11 UTC
Looking at latest CFQ code. cfqd is freed only on cfq_exit_queue(). And before freeing cfqd we wait for an rcu period if there are some groups which have not been cleaned up yet. 

So the only way to invalidate cfqd object is to call cfq_exit_queue() and that will happen either when device is going away or IO scheduler is being changed. And that would cleanup any blkio_group objects. If there are any objects left as it raced with cgroup removal path, then cfq_exit_queue() waits for one rcu period. So key(cfqd) should be valid.

Similarly if cfqd has not been freed then cfqd->queue object should also be valid. Only question remains about cfqd->queue->queue_lock as that can be provided and freed by driver earlier.

So I think first it is important to establish what is corrupted for sure. As per your investigation it looks like cfqd is corrupted but looking at code, that is not making any sense.
Comment 20 Israel G. Lugo 2011-08-03 00:58:50 UTC
Created attachment 67412 [details]
Contents of *cfqd and *(cfqd->queue)
Comment 21 Israel G. Lugo 2011-08-03 01:07:27 UTC
(In reply to comment #18)
> 
> Can you please print the values of *cfqd and *(cfqd->queue). You seem to be
> suggesting that cfqd itself has been freed and we are accessing a freed
> object.
> Looking at the code, it beats me how that can happen.

Sure thing. I have created an attachment containing the values of *cfqd and *(cfqd->queue), in GDB's pretty-print format. Please let me know of anything else that might be useful.


> 
> Do you have some kind of slab debug enabled?

Indeed I do. I have CONFIG_SLUB_DEBUG enabled, DEBUG_OBJECTS, DEBUG_OBJECTS_FREE, and several other memory-related debugs as well. I had some output on /proc/kmemleak for example, but right now the kernel is stopped inside gdb so I can't access the file.

I can let the kernel resume normal operation and copy any useful debug files for you, if you'd like. It might take a day to re-trigger the bug if we want to trap it inside GDB again, though.

I'm not very familiar with kernel debugging, and I have never done any slub debugging for example. But I am a C programmer and well accustomed to GDB. I'll be glad to help in any way I can.
Comment 22 Vivek Goyal 2011-08-03 18:19:37 UTC
(In reply to comment #21)
> > Can you please print the values of *cfqd and *(cfqd->queue). You seem to be
> > suggesting that cfqd itself has been freed and we are accessing a freed
> object.
> > Looking at the code, it beats me how that can happen.
> 
> Sure thing. I have created an attachment containing the values of *cfqd and
> *(cfqd->queue), in GDB's pretty-print format. Please let me know of anything
> else that might be useful.

Thanks. Looks like cfqd itself got freed. 

  grp_service_tree = {
    rb = {
      rb_node = 0x0
    },
    left = 0x0,
    count = 0,
    total_weight = 0,
    min_vdisktime = 0
  },
  root_group = {
    rb_node = {
      rb_parent_color = 0,
      rb_right = 0xffff88000d12dc00,
      rb_left = 0xffff88000e5ad838
    },

Here grp_service_tree is empty. That means there are no groups on it. But root group has got rb_right and rb_left pointer as non-null which shouldn't have been the case.

Also following are zero and not the default value.

  cfq_slice_idle = 0,
  cfq_group_idle = 0,

I am assuming that you have not changed these values.

So if cfqd is free and key pointer is no more valid, then only way to explain it is that somehow we did not call synchronoze_rcu() in cfq_exit_queue(). I will have another close look at the code and see if I spot somehting. Otherwise we can put some printk in exit queue path and see what's happening.

Are you changing any IO schedulers or removing some block devies etc?
Comment 23 Vivek Goyal 2011-08-03 18:44:10 UTC
(In reply to comment #21)
> > 
> > Do you have some kind of slab debug enabled?
> 
> Indeed I do. I have CONFIG_SLUB_DEBUG enabled, DEBUG_OBJECTS,
> DEBUG_OBJECTS_FREE, and several other memory-related debugs as well. I had
> some
> output on /proc/kmemleak for example, but right now the kernel is stopped
> inside gdb so I can't access the file.
> 
> I can let the kernel resume normal operation and copy any useful debug files
> for you, if you'd like. It might take a day to re-trigger the bug if we want
> to
> trap it inside GDB again, though.
> 

Is it possible to capture the dump and open using "crash". Might be useful to look up same data structure later.
Comment 24 Vivek Goyal 2011-08-03 20:15:11 UTC
Created attachment 67442 [details]
debug cfq patch v1

Debug CFQ patch for the unlink oops
Comment 25 Vivek Goyal 2011-08-03 20:16:49 UTC
Israel, can you please apply attached debug patch and run your tests again and try to reproduce the situation again. Make sure printk() are appearing on your serial console and once the crash happens, do send me last 100-200 lines of debug output.

In fact even before crash, I would be interested in seeing the 100 lines of debug output just to see what's happening in your workload.
Comment 26 Israel G. Lugo 2011-08-06 00:37:25 UTC
I applied the patch and managed to successfully reproduce the problem. Took around 24 hours to trigger.

Unfortunately, since I had an attached GDB, the kernel halted at the BUG() call and I could not capture the final debug printk. I did not have the kernel console redirected to a serial port... I have the syslog output from /proc/kmsg, containing the previous messages leading up to the crash.

I will post the kernel messages as attachments, plus a gdb print of *cfqd and *(cfqd->queue), as well as the *blkiop which was being processed inside blkiocg_destroy at the time.
Comment 27 Israel G. Lugo 2011-08-06 00:43:12 UTC
Created attachment 67692 [details]
Kernel messages, from boot to the crash

This doesn't include the final printk before the BUG. However, I got the same information from GDB. The crash happened inside this debug check, inside cfq_unlink_blkio_group:

    if (!cfqd->nr_blkcg_linked_grps) {
            printk("Either key=%p is bogus or key already got freed\n",
                    key);
            BUG();
    }

The value of "key" was 0xffff88000e5b1800, as can be seen in the attached file.
Comment 28 Israel G. Lugo 2011-08-06 00:45:25 UTC
Created attachment 67702 [details]
GDB inside the BUG

This contains:
- value of cfq_unlink_blkio_group's "key"
- contents of *cfqd
- contents of *(cfqd->queue)
- contents of blkiocg_destroy's *blkiop
Comment 29 Israel G. Lugo 2011-08-06 00:58:32 UTC
I was running two parallel loops on a client doing SSH login/logout concurrently. I did verify that the cgroup was being removed, with an "ls" of the root cgroup.

Please note that although only forty-two "Will delete group /test" messages appear in the kernel logs, many many rmdirs (around 6000) were made. The rmdirs were made by a release agent script, which gets called when the cgroup becomes empty.
Comment 30 Israel G. Lugo 2011-08-06 01:09:48 UTC
I just got lucky! I rebooted the VM, did an SSH (or two) as root, and got a crash. GDB wasn't running, so I got the printk's. I'll attach the output here.

Incidentally, I have saved both the first GDB session from last week and the one mentioned above, as virtual machine snapshots. I can go back to them and print further data structures whenever required.
Comment 31 Israel G. Lugo 2011-08-06 01:13:07 UTC
Created attachment 67712 [details]
kernel log of the quick crash

Note that the freed "key" value is, again, key=ffff88000e5b1800. Curious.

I believe the most relevant lines are these:
Aug  6 02:26:47 kerneltest kernel: debug cfq: Added group /root dev=8:0 plid=1
Aug  6 02:26:47 kerneltest kernel: debug cfq: Added group /root dev=8:0 plid=0
Aug  6 02:28:56 kerneltest kernel: debug cfq: Will delete group /root. blkg_dev=8:0, key=ffff88000e59b000, plid=0
Aug  6 02:28:56 kerneltest kernel: debug cfq: Will delete group /root. blkg_dev=8:0, key=ffff88000e5b1800, plid=1
Aug  6 02:28:56 kerneltest kernel: Either key=ffff88000e5b1800 is bogus or key already got freed

I will post the contents of /sys/kernel/debug/kmemleak in case that helps.
Comment 32 Israel G. Lugo 2011-08-06 01:13:46 UTC
Created attachment 67722 [details]
kmemleak after the quick crash
Comment 33 Vivek Goyal 2011-08-08 19:31:27 UTC
(In reply to comment #31)
> Created an attachment (id=67712) [details]
> kernel log of the quick crash
> 
> Note that the freed "key" value is, again, key=ffff88000e5b1800. Curious.
> 
> I believe the most relevant lines are these:
> Aug  6 02:26:47 kerneltest kernel: debug cfq: Added group /root dev=8:0
> plid=1
> Aug  6 02:26:47 kerneltest kernel: debug cfq: Added group /root dev=8:0
> plid=0
> Aug  6 02:28:56 kerneltest kernel: debug cfq: Will delete group /root.
> blkg_dev=8:0, key=ffff88000e59b000, plid=0
> Aug  6 02:28:56 kerneltest kernel: debug cfq: Will delete group /root.
> blkg_dev=8:0, key=ffff88000e5b1800, plid=1
> Aug  6 02:28:56 kerneltest kernel: Either key=ffff88000e5b1800 is bogus or
> key
> already got freed
> 
> I will post the contents of /sys/kernel/debug/kmemleak in case that helps.

There is something strange. blkio throttle policy registers struct throtl_data pointer as "key" and that is policy id 1. So key for that should be ffff88000e5b1800. 

But CFQ is getting callback with above pointer which is supposed to be pointing to throtl_data and not cfq_data. That sounds just plain wrong.

I already have the code to avoid this situation.

                list_for_each_entry(blkiop, &blkio_list, list) {
                        if (blkiop->plid != blkg->plid)
                                continue;
                        blkiop->ops.blkio_unlink_group_fn(key, blkg);
                }

Here we check if blkg and blkiop policy id don't match, don't go ahead with the unlink call back. That means a blkg belonging to throttling policy will not
invoke a unlink callback in CFQ.

I think I will enhance my debug patch to get more output and ask you to re-run to get more information out.
Comment 34 Vivek Goyal 2011-08-08 20:46:06 UTC
Created attachment 68142 [details]
unlink crash debug patch v2

A patch to debug cgroup removal crash
Comment 35 Vivek Goyal 2011-08-08 20:49:27 UTC
israel, can you please test this V2 of the debug patch. It is no different from first patch except that it is more verbose than V1.

So it does look like that we are not looking at device removal race but something else.

Also of interest is that try disabling throttling functionality on your system and see if you can still reproduce it.

CONFIG_BLK_DEV_THROTTLING=n
Comment 36 Israel G. Lugo 2011-09-01 11:50:36 UTC
Just wanted to let you know that I haven't forgotten about this. I was away on vacations for a few weeks and have to tend to several things at work now. I will pick up on this soon, though.

Thank you again for all your help.

Note You need to log in before you can comment on or make changes to this bug.