Bug 9520 (quotactl_hangs)

Summary: Quotas modification on a loaded system takes up to several minutes to complete
Product: File System Reporter: Nick (gentuu)
Component: OtherAssignee: Jan Kara (jack)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: akpm, jack, oleg
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.23.x and current up to 2.6.24-rc4-git5 Subsystem:
Regression: --- Bisected commit-id:
Attachments: the described patch
Fix for the livelock
Fixing patch with Oleg's notes implemented
Patch implementing faster inode scan

Description Nick 2007-12-07 10:26:18 UTC
Most recent kernel where this bug did not occur: not tested
Distribution: GNU
Hardware Environment: i386
Software Environment: GNU
Problem Description:
"quotaon" process hangs for several minutes on a loaded system to complete.
Sometimes it takes 10 minutes...
The process hangs in near-to-deadlock function fs/dquot.c:add_dquot_ref()
for so long.
The problem is in the function's weak algorithm.

Steps to reproduce:
Quota modification with edquota can bring quotaon process in the first top position for 2-3 seconds even on a low loaded system.

Not fully sure how to reproduce  problem to hang quotaon for several minutes
(just big disk load?)

The problem algorithm (at fs/dquot.c:add_dquot_ref()) theoretically may
loop infinitely. It is a deadlock possible!!

It goes through super block's inodes list and inits some inodes. But after the init call - it starts the loop over!! Is that necessary?
In the worst _static_ (none else updates the FS) case (when all the inodes require init) the loop can be run (N^2)/2 times !!
But just anything updates every time an inode in the start of the list - the loop will always start over and over...
It looks like the logic is broken...

It's much better to create a loop against running whole list
and continue until a loop updates 0 inodes.
In this case the worst static situation takes just 2 loops:
first one updates all the inodes, and the next one - just see that it's nothing to do.
Obviously, any FS updates during such updated logic can't delay function finish
for so long as it's now.

The attached patch implements the new algorithm.
Comment 1 Nick 2007-12-07 10:28:58 UTC
Created attachment 13905 [details]
the described patch

greatly decreases the problem
Comment 2 Oleg Nesterov 2007-12-07 12:26:18 UTC
I'm afraid the patch is not right.

Suppose that inode is list_del_init()'ed when we drop inode_lock.
In that case list_for_each_entry() goes to the endless loop when
we take the lock again.

Also, I am not sure this could be qualified as BUG. Perhaps it is
better to discuss this problem on lkml?
Comment 3 Andrew Morton 2007-12-07 14:36:40 UTC
Jan, you have a livelock...
Comment 4 Nick 2007-12-08 01:57:49 UTC
Oleg is right.

My patch has a race. But in the same time we can't just lock whole inner loop
as sb->dq_op->initialize(inode, type) may block.
So, yes, current implementation is correct, but it hsa the initially described problem.

And an addon to the problem description.
The problem appears not on editquota. It uses quotactl(Q_SETQUOTA...) which is ok.
The problem occurs on enabling quotas. quotactl(Q_QUOTAON...).

Yes, for most ppl quota is enabled once at the system start and the problem doesn't touch them.
But I have to disable/enable the quotas many time during runtime - so, the problem stay but I'm changing its severity... It's not so important in fact.
Comment 5 Nick 2007-12-08 02:07:57 UTC
Also providing the problem stack showed when lock debugging is on.


BUG: soft lockup - CPU#2 stuck for 11s! [quotaon:14005]
Pid: 14005, comm:              quotaon
EIP: 0060:[<c01a5f74>] CPU: 2
EIP is at vfs_quota_on_inode+0x294/0x420
EFLAGS: 00000246    Not tainted  (2.6.23.9 #1)
EAX: 00000000 EBX: cb5e80fc ECX: 00000000 EDX: eb5a5bcc
ESI: f6f274d4 EDI: f6f98e5c EBP: f6f27400 DS: 007b ES: 007b FS: 00d8
CR0: 8005003b CR2: b7e88000 CR3: 10e5c000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
[<c01a7320>] vfs_quota_on+0x70/0x80
[<c0183972>] dput+0xa2/0x130
[<c01eed5e>] ext3_quota_on+0x8e/0xc0
[<c02f0e9e>] strncpy_from_user+0x3e/0x70
[<c017a683>] getname+0xb3/0xe0
[<c01a989a>] sys_quotactl+0x61a/0x6e0
[<c06e5f34>] _spin_unlock+0x14/0x20
[<c0172d0e>] __fput+0x12e/0x190
[<c0187d13>] mntput_no_expire+0x13/0x70
[<c016fe29>] filp_close+0x49/0x80
[<c0102c1e>] syscall_call+0x7/0xb

it's just the same on .24 series.
Comment 6 Nick 2007-12-09 00:23:13 UTC
Just finished rewriting the patch over.
The main idea of it - is using a separate list to keep
inodes to initialize.

So, here is the scenario:
- we are locking the super block's inodes list
- looping on the list, iget()ing inodes for initialization
- and adding them to a separate list
- unlocking the super block's inodes list
then
- running through the new list, initing inodes from it and put()ing them

This makes execution time depending only on inodes count, not any FS activity.

The patch is attached.
Comment 7 Nick 2007-12-09 01:50:31 UTC
Created attachment 13919 [details]
Fix for the livelock

the problem fixing patch which is using a separate list to store inodes to init.
It prevents heavy dependence on the FS activity and makes quota enabling time depending just on the inodes number on a super block.
Comment 8 Nick 2007-12-09 01:51:37 UTC
Comment on attachment 13905 [details]
the described patch

It has a race
Comment 9 Oleg Nesterov 2007-12-09 07:40:30 UTC
On 12/09, bugme-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=9520
> 
> 
> ------- Comment #7 from gentuu@gmail.com  2007-12-09 01:50 -------
> Created an attachment (id=13919)
>  --> (http://bugzilla.kernel.org/attachment.cgi?id=13919&action=view)
> Fix for the livelock
> 
> the problem fixing patch which is using a separate list to store inodes to
> init.
> It prevents heavy dependence on the FS activity and makes quota enabling time
> depending just on the inodes number on a super block.

I can't really comment the filesystem code, just a couple general nits.

- I don't really think you need kmem_cache_create+kmem_cache_alloc. Perhaps
  kmalloc() it is better.

- You can't do __GFP_WAIT allocation under spin_lock(), you need GFP_ATOMIC.

- Ugh. this "endless" loop if allocation fails...

- When the first inode is found, the code sets l_head = inode, then we are
  doing list_for_each_entry(..., l_head, ...). This is not exactly right,
  note that list_for_each(head) does not count the first list_head == head.
  IOW, list_for_each(pos, head) does nothing if there is no other list_head's
  on list.

- restart_initing: is not needed, we have list_for_each_entry_safe(), it allows
  to do list_del(&l_entry->i_list) inside the loop.

- I don't understand why do we neeed "if (!l_head) printk(KERN_ERR ...)".

- another printk(KERN_ERR) inside the second loop looks bogus. If you want
  to add the debugging checks, I'd suggest to use WARN_ON() instead.

Oleg.
Comment 10 Nick 2007-12-09 08:25:22 UTC
>
>
> - I don't really think you need kmem_cache_create+kmem_cache_alloc.
> Perhaps
>   kmalloc() it is better.


Holy true. I'll consider replacing caches.


- You can't do __GFP_WAIT allocation under spin_lock(), you need GFP_ATOMIC.


OMG...  My fault. Will fix.


- Ugh. this "endless" loop if allocation fails...


I'll appreciate _any_ suggestions...  should we try several times and fail
on permanent allocation failure then?


- When the first inode is found, the code sets l_head = inode, then we are
>   doing list_for_each_entry(..., l_head, ...). This is not exactly right,
>   note that list_for_each(head) does not count the first list_head ==
> head.


Yep, I know. But this doesn't really matter in the situation :)

  IOW, list_for_each(pos, head) does nothing if there is no other
> list_head's
>   on list.


That's why I'm explicitly freeing the last cache object.


- restart_initing: is not needed, we have list_for_each_entry_safe(), it
> allows
>   to do list_del(&l_entry->i_list) inside the loop.


Heh....   I should be looking better next time...



> - I don't understand why do we neeed "if (!l_head) printk(KERN_ERR ...)".


Right...  we should allow no inodes initializing to enable quotas


- another printk(KERN_ERR) inside the second loop looks bogus. If you want
>   to add the debugging checks, I'd suggest to use WARN_ON() instead.


Ok, considering this too


Oleg, these are just perfect comments! :)
Great thanks!

Will update the patch soon.
<div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br></blockquote><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
- I don&#39;t really think you need kmem_cache_create+kmem_cache_alloc. Perhaps<br>&nbsp;&nbsp;kmalloc() it is better.</blockquote><div><br>Holy true. I&#39;ll consider replacing caches.<br>&nbsp;</div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
- You can&#39;t do __GFP_WAIT allocation under spin_lock(), you need GFP_ATOMIC.</blockquote><div><br>OMG...&nbsp; My fault. Will fix.<br></div><br><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
- Ugh. this &quot;endless&quot; loop if allocation fails...</blockquote><div><br>I&#39;ll appreciate _any_ suggestions...&nbsp; should we try several times and fail on permanent allocation failure then?<br>&nbsp;</div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
- When the first inode is found, the code sets l_head = inode, then we are<br>&nbsp;&nbsp;doing list_for_each_entry(..., l_head, ...). This is not exactly right,<br>&nbsp;&nbsp;note that list_for_each(head) does not count the first list_head == head.
</blockquote><div><br>Yep, I know. But this doesn&#39;t really matter in the situation :) <br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
&nbsp;&nbsp;IOW, list_for_each(pos, head) does nothing if there is no other list_head&#39;s<br>&nbsp;&nbsp;on list.</blockquote><div><br>That&#39;s why I&#39;m explicitly freeing the last cache object.<br>&nbsp;<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
- restart_initing: is not needed, we have list_for_each_entry_safe(), it allows<br>&nbsp;&nbsp;to do list_del(&amp;l_entry-&gt;i_list) inside the loop.</blockquote><div><br>Heh....&nbsp;&nbsp; I should be looking better next time...<br><br>&nbsp;
</div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">- I don&#39;t understand why do we neeed &quot;if (!l_head) printk(KERN_ERR ...)&quot;.
</blockquote><div><br>Right...&nbsp; we should allow no inodes initializing to enable quotas<br><br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
- another printk(KERN_ERR) inside the second loop looks bogus. If you want<br>&nbsp;&nbsp;to add the debugging checks, I&#39;d suggest to use WARN_ON() instead.</blockquote><div><br>Ok, considering this too<br><br><br>Oleg, these are just perfect comments! :) 
<br>Great thanks!<br><br>Will update the patch soon.<br></div></div>
Comment 11 Oleg Nesterov 2007-12-09 08:58:50 UTC
On 12/09, bugme-daemon@bugzilla.kernel.org wrote:
>
> ------- Comment #10 from gentuu@gmail.com  2007-12-09 08:25 -------
>
> - Ugh. this "endless" loop if allocation fails...
>
> I'll appreciate _any_ suggestions...  should we try several times and fail
> on permanent allocation failure then?

Heh. You need a more knowledgeable person in that area :) We have them on CC.
My opinion, we should break the loop, initialize() inodes we already found,
and repeat again.

Or return an error. Offtopic to this patch, but we don't check the return
value of ->initialize(), can't it fail?

> Yep, I know. But this doesn't really matter in the situation :)
>
>   IOW, list_for_each(pos, head) does nothing if there is no other
> > list_head's
> >   on list.
>
>
> That's why I'm explicitly freeing the last cache object.

Yep. But we don't do ->initialize() + dput() for that inode.

Oleg.
Comment 12 Nick 2007-12-09 09:10:00 UTC
> Heh. You need a more knowledgeable person in that area :) We have them on CC.
You are right. Let's wait for their touch

> Yep. But we don't do ->initialize() + dput() for that inode.
Ok...  marking the second patch as broken...
Comment 13 Nick 2007-12-09 09:11:38 UTC
Comment on attachment 13919 [details]
Fix for the livelock

Just broken
Comment 14 Jan Kara 2007-12-10 05:16:35 UTC
A few comments: ->initialize() can fail but something is really wrong if it does (filesystem corruption, IO error, etc) and we cannot do much about it...
I read in Comment #1 that this livelock happens after you edited quota with edquota(8) - that is quite strange because edquota(8) never calls quotaon (at least in the tree I maintain). What version of quota tools do you have? Not that we shouldn't do something about the livelock but running quotaon on a living filesystem isn't a good idea for other reasons - it means quota was off before and therefore numbers of used blocks and inodes is probably going to be wrong...
Comment 15 Nick 2007-12-10 05:40:11 UTC
Jan,

about edquota in the first comment. I already commented later that not edqouta
causes the problem, but quotaon.

Sure normal quota using should avoid turning it off/on more than once during uptime. But I have to do so and the livelock comes up.

What do you think about the second patch idea? How about using a separate list to store the inodes for initializing?
Comment 16 Nick 2007-12-10 18:27:29 UTC
Created attachment 13961 [details]
Fixing patch with Oleg's notes implemented

Reworked the patch according to Oleg's advisory.

The function now returns errors in failure case.
And the caller has to handle them properly.
This is still in TODO.

Jan, what would you suggest to do in calling vfs_quota_on_inode() function
to do to handle the fatal errors (NOMEM/INVAL) got from our add_dquot_ref()?

Thanks
Comment 17 Nick 2007-12-11 02:36:33 UTC
Jan, Oleg,

any comments about the updated patch ?
Comment 18 Jan Kara 2007-12-11 02:43:50 UTC
I don't like the list building you do much. And I think we can avoid it - while we hold a reference to the inode we are sure it cannot be removed from s_inodes list. We also don't have to care about newly added inodes as they will have quotas already initialized properly. So the only thing we have to solve is how to safely continue the search. I'm working on writing and testing the patch now...
Comment 19 Nick 2007-12-11 02:47:47 UTC
Great Jan!

Waiting for your patch
Comment 20 Jan Kara 2007-12-11 04:08:28 UTC
Created attachment 13972 [details]
Patch implementing faster inode scan

Attached patch should make inode scan in add_dquot_ref() linear time. The patch seems to work for me.
Comment 21 Jan Kara 2007-12-11 04:09:08 UTC
Nick, can you try the patch? Thanks.
Comment 22 Nick 2007-12-11 04:33:15 UTC
Jan,

perfect solution! :))

Trying now!
Comment 23 Nick 2007-12-11 05:03:26 UTC
Jan, it works perfectly :)

Great thanks!

PS Just don't forget to remove this string from the file top ;)
"add_dquot_ref() restarts after blocking"
Comment 24 Nick 2007-12-11 05:04:26 UTC
Comment on attachment 13961 [details]
Fixing patch with Oleg's notes implemented

Jan's patch does the same without additional list ;)
Comment 25 Jan Kara 2007-12-11 05:34:40 UTC
To comment #23: That line is just in a changelog so it makes no sence to change it... Eventually we could rip the whole changelog and rely just on git logs but that's a different matter.
Comment 26 Nick 2007-12-11 05:37:29 UTC
Ok, I see Jan

Will we see your fix in 2.6.24 branch? :)
Comment 27 Jan Kara 2007-12-11 05:47:11 UTC
There could be some subtle race I've missed so I prefer the patch goes to 2.6.25-rc1 so that we have more time to see a bug report if there's some problem... So for 2.6.24 you'll have to use a standalone patch, I hope it's not a big problem. Thanks for testing.
Comment 28 Nick 2007-12-11 05:50:41 UTC
Ok, I see

It's not a problem at all to use the patch as standalone.

Thanks again Jan!