Most recent kernel where this bug did not occur: not tested Distribution: GNU Hardware Environment: i386 Software Environment: GNU Problem Description: "quotaon" process hangs for several minutes on a loaded system to complete. Sometimes it takes 10 minutes... The process hangs in near-to-deadlock function fs/dquot.c:add_dquot_ref() for so long. The problem is in the function's weak algorithm. Steps to reproduce: Quota modification with edquota can bring quotaon process in the first top position for 2-3 seconds even on a low loaded system. Not fully sure how to reproduce problem to hang quotaon for several minutes (just big disk load?) The problem algorithm (at fs/dquot.c:add_dquot_ref()) theoretically may loop infinitely. It is a deadlock possible!! It goes through super block's inodes list and inits some inodes. But after the init call - it starts the loop over!! Is that necessary? In the worst _static_ (none else updates the FS) case (when all the inodes require init) the loop can be run (N^2)/2 times !! But just anything updates every time an inode in the start of the list - the loop will always start over and over... It looks like the logic is broken... It's much better to create a loop against running whole list and continue until a loop updates 0 inodes. In this case the worst static situation takes just 2 loops: first one updates all the inodes, and the next one - just see that it's nothing to do. Obviously, any FS updates during such updated logic can't delay function finish for so long as it's now. The attached patch implements the new algorithm.
Created attachment 13905 [details] the described patch greatly decreases the problem
I'm afraid the patch is not right. Suppose that inode is list_del_init()'ed when we drop inode_lock. In that case list_for_each_entry() goes to the endless loop when we take the lock again. Also, I am not sure this could be qualified as BUG. Perhaps it is better to discuss this problem on lkml?
Jan, you have a livelock...
Oleg is right. My patch has a race. But in the same time we can't just lock whole inner loop as sb->dq_op->initialize(inode, type) may block. So, yes, current implementation is correct, but it hsa the initially described problem. And an addon to the problem description. The problem appears not on editquota. It uses quotactl(Q_SETQUOTA...) which is ok. The problem occurs on enabling quotas. quotactl(Q_QUOTAON...). Yes, for most ppl quota is enabled once at the system start and the problem doesn't touch them. But I have to disable/enable the quotas many time during runtime - so, the problem stay but I'm changing its severity... It's not so important in fact.
Also providing the problem stack showed when lock debugging is on. BUG: soft lockup - CPU#2 stuck for 11s! [quotaon:14005] Pid: 14005, comm: quotaon EIP: 0060:[<c01a5f74>] CPU: 2 EIP is at vfs_quota_on_inode+0x294/0x420 EFLAGS: 00000246 Not tainted (2.6.23.9 #1) EAX: 00000000 EBX: cb5e80fc ECX: 00000000 EDX: eb5a5bcc ESI: f6f274d4 EDI: f6f98e5c EBP: f6f27400 DS: 007b ES: 007b FS: 00d8 CR0: 8005003b CR2: b7e88000 CR3: 10e5c000 CR4: 000006d0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 [<c01a7320>] vfs_quota_on+0x70/0x80 [<c0183972>] dput+0xa2/0x130 [<c01eed5e>] ext3_quota_on+0x8e/0xc0 [<c02f0e9e>] strncpy_from_user+0x3e/0x70 [<c017a683>] getname+0xb3/0xe0 [<c01a989a>] sys_quotactl+0x61a/0x6e0 [<c06e5f34>] _spin_unlock+0x14/0x20 [<c0172d0e>] __fput+0x12e/0x190 [<c0187d13>] mntput_no_expire+0x13/0x70 [<c016fe29>] filp_close+0x49/0x80 [<c0102c1e>] syscall_call+0x7/0xb it's just the same on .24 series.
Just finished rewriting the patch over. The main idea of it - is using a separate list to keep inodes to initialize. So, here is the scenario: - we are locking the super block's inodes list - looping on the list, iget()ing inodes for initialization - and adding them to a separate list - unlocking the super block's inodes list then - running through the new list, initing inodes from it and put()ing them This makes execution time depending only on inodes count, not any FS activity. The patch is attached.
Created attachment 13919 [details] Fix for the livelock the problem fixing patch which is using a separate list to store inodes to init. It prevents heavy dependence on the FS activity and makes quota enabling time depending just on the inodes number on a super block.
Comment on attachment 13905 [details] the described patch It has a race
On 12/09, bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=9520 > > > ------- Comment #7 from gentuu@gmail.com 2007-12-09 01:50 ------- > Created an attachment (id=13919) > --> (http://bugzilla.kernel.org/attachment.cgi?id=13919&action=view) > Fix for the livelock > > the problem fixing patch which is using a separate list to store inodes to > init. > It prevents heavy dependence on the FS activity and makes quota enabling time > depending just on the inodes number on a super block. I can't really comment the filesystem code, just a couple general nits. - I don't really think you need kmem_cache_create+kmem_cache_alloc. Perhaps kmalloc() it is better. - You can't do __GFP_WAIT allocation under spin_lock(), you need GFP_ATOMIC. - Ugh. this "endless" loop if allocation fails... - When the first inode is found, the code sets l_head = inode, then we are doing list_for_each_entry(..., l_head, ...). This is not exactly right, note that list_for_each(head) does not count the first list_head == head. IOW, list_for_each(pos, head) does nothing if there is no other list_head's on list. - restart_initing: is not needed, we have list_for_each_entry_safe(), it allows to do list_del(&l_entry->i_list) inside the loop. - I don't understand why do we neeed "if (!l_head) printk(KERN_ERR ...)". - another printk(KERN_ERR) inside the second loop looks bogus. If you want to add the debugging checks, I'd suggest to use WARN_ON() instead. Oleg.
> > > - I don't really think you need kmem_cache_create+kmem_cache_alloc. > Perhaps > kmalloc() it is better. Holy true. I'll consider replacing caches. - You can't do __GFP_WAIT allocation under spin_lock(), you need GFP_ATOMIC. OMG... My fault. Will fix. - Ugh. this "endless" loop if allocation fails... I'll appreciate _any_ suggestions... should we try several times and fail on permanent allocation failure then? - When the first inode is found, the code sets l_head = inode, then we are > doing list_for_each_entry(..., l_head, ...). This is not exactly right, > note that list_for_each(head) does not count the first list_head == > head. Yep, I know. But this doesn't really matter in the situation :) IOW, list_for_each(pos, head) does nothing if there is no other > list_head's > on list. That's why I'm explicitly freeing the last cache object. - restart_initing: is not needed, we have list_for_each_entry_safe(), it > allows > to do list_del(&l_entry->i_list) inside the loop. Heh.... I should be looking better next time... > - I don't understand why do we neeed "if (!l_head) printk(KERN_ERR ...)". Right... we should allow no inodes initializing to enable quotas - another printk(KERN_ERR) inside the second loop looks bogus. If you want > to add the debugging checks, I'd suggest to use WARN_ON() instead. Ok, considering this too Oleg, these are just perfect comments! :) Great thanks! Will update the patch soon. <div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br></blockquote><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> - I don't really think you need kmem_cache_create+kmem_cache_alloc. Perhaps<br> kmalloc() it is better.</blockquote><div><br>Holy true. I'll consider replacing caches.<br> </div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> - You can't do __GFP_WAIT allocation under spin_lock(), you need GFP_ATOMIC.</blockquote><div><br>OMG... My fault. Will fix.<br></div><br><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> - Ugh. this "endless" loop if allocation fails...</blockquote><div><br>I'll appreciate _any_ suggestions... should we try several times and fail on permanent allocation failure then?<br> </div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> - When the first inode is found, the code sets l_head = inode, then we are<br> doing list_for_each_entry(..., l_head, ...). This is not exactly right,<br> note that list_for_each(head) does not count the first list_head == head. </blockquote><div><br>Yep, I know. But this doesn't really matter in the situation :) <br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> IOW, list_for_each(pos, head) does nothing if there is no other list_head's<br> on list.</blockquote><div><br>That's why I'm explicitly freeing the last cache object.<br> <br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> - restart_initing: is not needed, we have list_for_each_entry_safe(), it allows<br> to do list_del(&l_entry->i_list) inside the loop.</blockquote><div><br>Heh.... I should be looking better next time...<br><br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">- I don't understand why do we neeed "if (!l_head) printk(KERN_ERR ...)". </blockquote><div><br>Right... we should allow no inodes initializing to enable quotas<br><br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> - another printk(KERN_ERR) inside the second loop looks bogus. If you want<br> to add the debugging checks, I'd suggest to use WARN_ON() instead.</blockquote><div><br>Ok, considering this too<br><br><br>Oleg, these are just perfect comments! :) <br>Great thanks!<br><br>Will update the patch soon.<br></div></div>
On 12/09, bugme-daemon@bugzilla.kernel.org wrote: > > ------- Comment #10 from gentuu@gmail.com 2007-12-09 08:25 ------- > > - Ugh. this "endless" loop if allocation fails... > > I'll appreciate _any_ suggestions... should we try several times and fail > on permanent allocation failure then? Heh. You need a more knowledgeable person in that area :) We have them on CC. My opinion, we should break the loop, initialize() inodes we already found, and repeat again. Or return an error. Offtopic to this patch, but we don't check the return value of ->initialize(), can't it fail? > Yep, I know. But this doesn't really matter in the situation :) > > IOW, list_for_each(pos, head) does nothing if there is no other > > list_head's > > on list. > > > That's why I'm explicitly freeing the last cache object. Yep. But we don't do ->initialize() + dput() for that inode. Oleg.
> Heh. You need a more knowledgeable person in that area :) We have them on CC. You are right. Let's wait for their touch > Yep. But we don't do ->initialize() + dput() for that inode. Ok... marking the second patch as broken...
Comment on attachment 13919 [details] Fix for the livelock Just broken
A few comments: ->initialize() can fail but something is really wrong if it does (filesystem corruption, IO error, etc) and we cannot do much about it... I read in Comment #1 that this livelock happens after you edited quota with edquota(8) - that is quite strange because edquota(8) never calls quotaon (at least in the tree I maintain). What version of quota tools do you have? Not that we shouldn't do something about the livelock but running quotaon on a living filesystem isn't a good idea for other reasons - it means quota was off before and therefore numbers of used blocks and inodes is probably going to be wrong...
Jan, about edquota in the first comment. I already commented later that not edqouta causes the problem, but quotaon. Sure normal quota using should avoid turning it off/on more than once during uptime. But I have to do so and the livelock comes up. What do you think about the second patch idea? How about using a separate list to store the inodes for initializing?
Created attachment 13961 [details] Fixing patch with Oleg's notes implemented Reworked the patch according to Oleg's advisory. The function now returns errors in failure case. And the caller has to handle them properly. This is still in TODO. Jan, what would you suggest to do in calling vfs_quota_on_inode() function to do to handle the fatal errors (NOMEM/INVAL) got from our add_dquot_ref()? Thanks
Jan, Oleg, any comments about the updated patch ?
I don't like the list building you do much. And I think we can avoid it - while we hold a reference to the inode we are sure it cannot be removed from s_inodes list. We also don't have to care about newly added inodes as they will have quotas already initialized properly. So the only thing we have to solve is how to safely continue the search. I'm working on writing and testing the patch now...
Great Jan! Waiting for your patch
Created attachment 13972 [details] Patch implementing faster inode scan Attached patch should make inode scan in add_dquot_ref() linear time. The patch seems to work for me.
Nick, can you try the patch? Thanks.
Jan, perfect solution! :)) Trying now!
Jan, it works perfectly :) Great thanks! PS Just don't forget to remove this string from the file top ;) "add_dquot_ref() restarts after blocking"
Comment on attachment 13961 [details] Fixing patch with Oleg's notes implemented Jan's patch does the same without additional list ;)
To comment #23: That line is just in a changelog so it makes no sence to change it... Eventually we could rip the whole changelog and rely just on git logs but that's a different matter.
Ok, I see Jan Will we see your fix in 2.6.24 branch? :)
There could be some subtle race I've missed so I prefer the patch goes to 2.6.25-rc1 so that we have more time to see a bug report if there's some problem... So for 2.6.24 you'll have to use a standalone patch, I hope it's not a big problem. Thanks for testing.
Ok, I see It's not a problem at all to use the patch as standalone. Thanks again Jan!