Bug 180841

Summary: Process accounting sometimes do not append records for terminated processes
Product: Process Management Reporter: Dmitry Romanov (rdimanos)
Component: OtherAssignee: process_other
Status: NEW ---    
Severity: normal CC: martin.steigerwald, rdimanos
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.3-rc1 and later Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Test case for illustration problem with process accounting
image50bd4d.PNG

Description Dmitry Romanov 2016-10-24 19:23:21 UTC
Created attachment 242561 [details]
Test case for illustration problem with process accounting

It seems I found situation when process accounting do not append records for 
terminated processes.


How reproduce (kernel versions 3.17-rc1 - 4.9-rc1):

Create empty file for accounting, call system call acct() with this
file, sleep for not less than one jiffy, create new process and exit this 
process. Now records for terminated processes does not append to accounting 
file. And this state keep until process accounting restarted.
Note, system call acct() in this procedure return successfully, with zero.
It is important for reproduce that after process accounting on with 
acct() no exit of some process during one tick happen (current jiffies 
must increment before some process exit). On my system this happen very rare, 
and problem reproduce almost always.

I wrote program test.c which implement described above steps. Then, program test
size of accounting file. If size remain zero, then it seems problem.


How it was found (and possible cause):

I investigated bug in program atop 1.26-2 (Monitor for system resources 
and process activity):
https://bugs.launchpad.net/ubuntu/+source/atop/+bug/1022865
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=778598

This bug sometimes reproduce on my system. Atop sometimes crash with SIGFPE 
on system start or at midnight (when atop restart with cron). 
After some debugging I found that this atop crash happen whenever described 
problem with process accounting happen.

After a study of the kernel/acct.c it seems that I found source of problem. 
I think the problem is in function check_free_space(), that is used to check the
free space on filesystem with accounting file. 
Lines (kernel versions 3.17-rc1 - 4.9-rc1)

if (time_is_before_jiffies(acct->needcheck))
		goto out;

are used for testing, whether it is time to check free space. 
Variable acct->needcheck is used for keeping next time to check free space.
If condition is true, then it is not time to check free space, it branch to end 
of function. If condition is false, then it check free space, state of process 
accounting (acct->active) is changed accordingly, next time to check free space 
write to acct->needcheck. But it is necessary to use function 
"time_is_before_jiffies" then!?

Whenever process accounting is switched on with acct(), variable 
acct->needcheck is set to current jiffies, acct->active is set to zero 
(disabled). If between process accounting switched on and first exit of process 
jiffies did not increment, then branch with "goto" will not happen, free space 
will check, and if free space is present, accounting will activate 
(acct->active = 1).  If between process accounting switched on and first exit 
of process passed more than one jiffy, then jiffies will be greater than 
acct->needcheck, branch will happen with "goto", acct->active remains zero. 
From this moment, current jiffies will be greater than acct->needcheck always, 
and always acct->active equal 0, and records for terminated processes does not 
append to accounting file.
Such behaviour observe in kernel versions 3.17-rc1 - 4.9-rc1.

So I suppose that this problem may be solve in versions 3.17-rc1 - 4.9-rc1 by 
following patch:

----------
diff --git a/kernel/acct.c b/kernel/acct.c
index 74963d1..37f1dc6 100644
--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -99,7 +99,7 @@ static int check_free_space(struct bsd_acct_struct *acct)
 {
        struct kstatfs sbuf;
 
-       if (time_is_before_jiffies(acct->needcheck))
+       if (time_is_after_jiffies(acct->needcheck))
                goto out;
 
        /* May block */
----------


In kernel versions 3.3-rc1 - 3.16:

In kernel versions 3.3-rc1 - 3.16 activation of process accounting implemented 
differently, so delay between call acct(filename) and process termination do not
produce problem, and program test.c do not detect problem. But, it seems, using
function time_is_before_jiffies is not right similarly.
Another problem arise, if during work of process accounting happen that current 
jiffies is greater than acct->needcheck (for example, if between two consecutive 
process terminations happen interval greater than ACCT_TIMEOUT seconds). 
Then in lines:

	if (!file || time_is_before_jiffies(acct->needcheck))
		goto out;

always will branch with "goto" and acct->needcheck will not change. So free 
space will not check more, until accounting restart. It is not good.
Note, that in version 3.17-rc1 - 4.9-rc1 this problem is also present.

Therefore I suppose that this problem for kernel version 3.3-rc1 - 3.16 may be 
solve by following patch:

----------
diff --git a/kernel/acct.c b/kernel/acct.c
index 808a86f..591bdcd 100644
--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -107,7 +107,7 @@ static int check_free_space(struct bsd_acct_struct *acct, struct file *file)
 
        spin_lock(&acct_lock);
        res = acct->active;
-       if (!file || time_is_before_jiffies(acct->needcheck))
+       if (!file || time_is_after_jiffies(acct->needcheck))
                goto out;
        spin_unlock(&acct_lock);
----------


In kernel 3.3 another method use for define time to check 
free space (by timer). So I not found in these versions such problem.


Sorry if it is all my mistake.
And sorry for my bad English.

Dmitry
Comment 1 Martin Steigerwald 2016-12-19 10:49:44 UTC
According to the following comment in the Debian bug report this issue may be solved meanwhile:

atop sometimes fails with a floating point exception or a trap exception
Re: Bug#778598: atop: SIGFPE
https://bugs.debian.org/778598#49

Dmitry, can you confirm this?
Comment 2 Dmitry Romanov 2016-12-24 19:43:12 UTC
Hello Martin

I tested atop version 2.2.3-1~exp1, mentioned in comment 
https://bugs.debian.org/778598#49
(I build this version from source in my Ubuntu and launch directly as 
the superuser, did not install in system).

I launched atop many times and look for presence and size of accounting 
file (/tmp/atop.d/atop.acct). 
I did not experience atop crash with SIGFPE now. But sometimes after atop
launching accounting file was absent.

When problem with process accounting in kernel happened (like described in my 
bug report above), it seems (as seen in source), that this version atop switch 
off process accounting and remove accounting file without any message to user.
Comment 3 Martin Steigerwald 2016-12-24 19:49:06 UTC
Created attachment 248481 [details]
image50bd4d.PNG

Sehr geehrte Absenderin, sehr geehrter Absender,

ich bin bis einschließlich 6.1.2017 im Urlaub. Sollten Sie ein dringendes Problem haben und über einen Servicevertrag verfügen, erreichen Sie uns rund um die Uhr unter der 0700er-Service-Nummer, die Sie von uns erhalten haben.

Für alle anderen Fragen, die nicht warten können, kontaktieren Sie unser Support-Team unter support@teamix.de oder rufen Sie die 0911 / 30 999 - 0 an. Bitte setzen Sie mich ebenfalls auf CC. Mails an mich bearbeite ich, wenn ich zurück bin. Ich lasse sie nicht automatisch weiterleiten.

Frohe Weihnachten und ein gutes neues Jahr,
--
<http://www.teamix.de>[teamix]<http://www.teamix.de>    Martin Steigerwald
Trainer
teamix GmbH
Südwestpark 43
90449 Nürnberg  Tel.: +49 911 30999 55
Fax: +49 911 30999 99   mail: martin.steigerwald@teamix.de
web: http://www.teamix.de
blog: http://blog.teamix.de
Amtsgericht Nürnberg, HRB 18320 | Geschäftsführer: Oliver Kügow, Richard Müller
teamix Support Hotline: +49 911 30999-112

*** Bitte liken Sie uns auf Facebook: facebook.com/teamix ***
Comment 4 Dmitry Romanov 2018-01-13 18:32:58 UTC
Fixed by the following commit in stable:

commit ae04ca35247af576999da5ef726d1a03fc65de09
Author: Oleg Nesterov <oleg@redhat.com>
Date:   Thu Jan 4 16:17:49 2018 -0800

    kernel/acct.c: fix the acct->needcheck check in check_free_space()