Created attachment 242561 [details] Test case for illustration problem with process accounting It seems I found situation when process accounting do not append records for terminated processes. How reproduce (kernel versions 3.17-rc1 - 4.9-rc1): Create empty file for accounting, call system call acct() with this file, sleep for not less than one jiffy, create new process and exit this process. Now records for terminated processes does not append to accounting file. And this state keep until process accounting restarted. Note, system call acct() in this procedure return successfully, with zero. It is important for reproduce that after process accounting on with acct() no exit of some process during one tick happen (current jiffies must increment before some process exit). On my system this happen very rare, and problem reproduce almost always. I wrote program test.c which implement described above steps. Then, program test size of accounting file. If size remain zero, then it seems problem. How it was found (and possible cause): I investigated bug in program atop 1.26-2 (Monitor for system resources and process activity): https://bugs.launchpad.net/ubuntu/+source/atop/+bug/1022865 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=778598 This bug sometimes reproduce on my system. Atop sometimes crash with SIGFPE on system start or at midnight (when atop restart with cron). After some debugging I found that this atop crash happen whenever described problem with process accounting happen. After a study of the kernel/acct.c it seems that I found source of problem. I think the problem is in function check_free_space(), that is used to check the free space on filesystem with accounting file. Lines (kernel versions 3.17-rc1 - 4.9-rc1) if (time_is_before_jiffies(acct->needcheck)) goto out; are used for testing, whether it is time to check free space. Variable acct->needcheck is used for keeping next time to check free space. If condition is true, then it is not time to check free space, it branch to end of function. If condition is false, then it check free space, state of process accounting (acct->active) is changed accordingly, next time to check free space write to acct->needcheck. But it is necessary to use function "time_is_before_jiffies" then!? Whenever process accounting is switched on with acct(), variable acct->needcheck is set to current jiffies, acct->active is set to zero (disabled). If between process accounting switched on and first exit of process jiffies did not increment, then branch with "goto" will not happen, free space will check, and if free space is present, accounting will activate (acct->active = 1). If between process accounting switched on and first exit of process passed more than one jiffy, then jiffies will be greater than acct->needcheck, branch will happen with "goto", acct->active remains zero. From this moment, current jiffies will be greater than acct->needcheck always, and always acct->active equal 0, and records for terminated processes does not append to accounting file. Such behaviour observe in kernel versions 3.17-rc1 - 4.9-rc1. So I suppose that this problem may be solve in versions 3.17-rc1 - 4.9-rc1 by following patch: ---------- diff --git a/kernel/acct.c b/kernel/acct.c index 74963d1..37f1dc6 100644 --- a/kernel/acct.c +++ b/kernel/acct.c @@ -99,7 +99,7 @@ static int check_free_space(struct bsd_acct_struct *acct) { struct kstatfs sbuf; - if (time_is_before_jiffies(acct->needcheck)) + if (time_is_after_jiffies(acct->needcheck)) goto out; /* May block */ ---------- In kernel versions 3.3-rc1 - 3.16: In kernel versions 3.3-rc1 - 3.16 activation of process accounting implemented differently, so delay between call acct(filename) and process termination do not produce problem, and program test.c do not detect problem. But, it seems, using function time_is_before_jiffies is not right similarly. Another problem arise, if during work of process accounting happen that current jiffies is greater than acct->needcheck (for example, if between two consecutive process terminations happen interval greater than ACCT_TIMEOUT seconds). Then in lines: if (!file || time_is_before_jiffies(acct->needcheck)) goto out; always will branch with "goto" and acct->needcheck will not change. So free space will not check more, until accounting restart. It is not good. Note, that in version 3.17-rc1 - 4.9-rc1 this problem is also present. Therefore I suppose that this problem for kernel version 3.3-rc1 - 3.16 may be solve by following patch: ---------- diff --git a/kernel/acct.c b/kernel/acct.c index 808a86f..591bdcd 100644 --- a/kernel/acct.c +++ b/kernel/acct.c @@ -107,7 +107,7 @@ static int check_free_space(struct bsd_acct_struct *acct, struct file *file) spin_lock(&acct_lock); res = acct->active; - if (!file || time_is_before_jiffies(acct->needcheck)) + if (!file || time_is_after_jiffies(acct->needcheck)) goto out; spin_unlock(&acct_lock); ---------- In kernel 3.3 another method use for define time to check free space (by timer). So I not found in these versions such problem. Sorry if it is all my mistake. And sorry for my bad English. Dmitry
According to the following comment in the Debian bug report this issue may be solved meanwhile: atop sometimes fails with a floating point exception or a trap exception Re: Bug#778598: atop: SIGFPE https://bugs.debian.org/778598#49 Dmitry, can you confirm this?
Hello Martin I tested atop version 2.2.3-1~exp1, mentioned in comment https://bugs.debian.org/778598#49 (I build this version from source in my Ubuntu and launch directly as the superuser, did not install in system). I launched atop many times and look for presence and size of accounting file (/tmp/atop.d/atop.acct). I did not experience atop crash with SIGFPE now. But sometimes after atop launching accounting file was absent. When problem with process accounting in kernel happened (like described in my bug report above), it seems (as seen in source), that this version atop switch off process accounting and remove accounting file without any message to user.
Created attachment 248481 [details] image50bd4d.PNG Sehr geehrte Absenderin, sehr geehrter Absender, ich bin bis einschließlich 6.1.2017 im Urlaub. Sollten Sie ein dringendes Problem haben und über einen Servicevertrag verfügen, erreichen Sie uns rund um die Uhr unter der 0700er-Service-Nummer, die Sie von uns erhalten haben. Für alle anderen Fragen, die nicht warten können, kontaktieren Sie unser Support-Team unter support@teamix.de oder rufen Sie die 0911 / 30 999 - 0 an. Bitte setzen Sie mich ebenfalls auf CC. Mails an mich bearbeite ich, wenn ich zurück bin. Ich lasse sie nicht automatisch weiterleiten. Frohe Weihnachten und ein gutes neues Jahr, -- <http://www.teamix.de>[teamix]<http://www.teamix.de> Martin Steigerwald Trainer teamix GmbH Südwestpark 43 90449 Nürnberg Tel.: +49 911 30999 55 Fax: +49 911 30999 99 mail: martin.steigerwald@teamix.de web: http://www.teamix.de blog: http://blog.teamix.de Amtsgericht Nürnberg, HRB 18320 | Geschäftsführer: Oliver Kügow, Richard Müller teamix Support Hotline: +49 911 30999-112 *** Bitte liken Sie uns auf Facebook: facebook.com/teamix ***
Fixed by the following commit in stable: commit ae04ca35247af576999da5ef726d1a03fc65de09 Author: Oleg Nesterov <oleg@redhat.com> Date: Thu Jan 4 16:17:49 2018 -0800 kernel/acct.c: fix the acct->needcheck check in check_free_space()