Bug 219194

Summary: PSI full-stall duration reports nonsensically large value when reading pressure file too frequently
Product: Other Reporter: Brandon Duffany (brandon)
Component: OtherAssignee: other_other
Status: NEW ---    
Severity: normal CC: brandon, hannes
Priority: P3    
Hardware: AMD   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: Minimal reproducible example

Description Brandon Duffany 2024-08-26 00:49:34 UTC
Created attachment 306775 [details]
Minimal reproducible example

The cumulative full-stall duration ("total" field) reported in /sys/fs/cgroup/.../cpu.pressure is occasionally nonsensically large - a value much longer than the total execution duration of the only process that was run in the cgroup.

Interestingly, the nonsensical value that it reports is often exactly 4.294967 seconds. This value looks suspiciously like UINT32_MAX (4294967295) which made me think that this is a bug with PSI code rather than something I was doing wrong.

I have seen other values, too, but this is the one it most commonly reports. At one point, I was frequently observing 8.58... which looks suspiciously like (UINT32_MAX * 2).

Another interesting thing about this bug is that I have only been able to reproduce it when frequently reading from the cpu.pressure file. It reproduces reliably when reading the cpu.pressure file every microsecond. If I instead just let the processes in the cgroup run to completion, *then* read the cpu.pressure file just once at the end (after the cgroup no longer has any processes), the reported total stall duration seems fine.

(In production, we poll this file every 50ms, so we are seeing this issue a bit more rarely, but often enough that it's problematic for us.)

I've attached a minimal reproducible example. Compile it with a C compiler, then run it as root, so that it has the necessary permissions to create cgroups under /sys/fs/cgroup. It's also available in the gist here: https://gist.github.com/bduffany/8e3489e2d11f0b80681b493f55355824

After some number of iterations (sometimes takes only a few, sometimes takes thousands), it prints something like this:

... [lots of output omitted] ...
cpu full-stall 0.000000 / exec 0.044252 (0.00%)
cpu full-stall 0.000000 / exec 0.041969 (0.00%)
cpu full-stall 0.000000 / exec 0.040588 (0.00%)
cpu full-stall 0.000000 / exec 0.038366 (0.00%)
cpu full-stall 0.000000 / exec 0.046104 (0.00%)
cpu full-stall 0.000000 / exec 0.049142 (0.00%)
cpu full-stall 4.294967 / exec 0.043183 (9945.97%)
nonsensical stall duration: full-stall duration is 4.294967 but exec duration is only 0.043183



I'm currently running Linux 6.8.0 - the output of `uname -a` shows:

Linux d 6.8.0-40-generic #40~22.04.3-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 30 17:30:19 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Comment 1 Johannes Weiner 2024-08-26 18:19:12 UTC
Hi Brandon,

On Mon, Aug 26, 2024 at 12:49:34AM +0000, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=219194
> 
>             Bug ID: 219194
>            Summary: PSI full-stall duration reports nonsensically large
>                     value when reading pressure file too frequently
>            Product: Other
>            Version: 2.5
>           Hardware: AMD
>                 OS: Linux
>             Status: NEW
>           Severity: normal
>           Priority: P3
>          Component: Other
>           Assignee: other_other@kernel-bugs.osdl.org
>           Reporter: brandon@buildbuddy.io
>                 CC: hannes@cmpxchg.org
>         Regression: No
> 
> Created attachment 306775 [details]
>   --> https://bugzilla.kernel.org/attachment.cgi?id=306775&action=edit
> Minimal reproducible example
> 
> The cumulative full-stall duration ("total" field) reported in
> /sys/fs/cgroup/.../cpu.pressure is occasionally nonsensically large - a value
> much longer than the total execution duration of the only process that was
> run
> in the cgroup.
> 
> Interestingly, the nonsensical value that it reports is often exactly
> 4.294967
> seconds. This value looks suspiciously like UINT32_MAX (4294967295) which
> made
> me think that this is a bug with PSI code rather than something I was doing
> wrong.

Thanks for reporting this. Your reproducer was immensely helpful in
tracking this down, much appreciated. With the below patch, I can no
longer trigger it. If at all possible, could you try it on your end?

It's a race condition between aggregator (reader) and a task changing
state. Specifically, it's an issue with the live state snooping: the
aggregator will add cpu_clock() - state_start to the sample of any
currently active states that have not been recorded yet. This makes
the aggregator get ahead of recorded task state, but we assume this is
okay because, as the state concludes, that delta we snooped ahead will
be incorporated back into the recorded times.

But there is a bug. When the state change concludes it reads
cpu_clock() before acquiring the seq write lock. This means an
aggregator can race and snoop ahead to a later clock value than the
one at which the state will conclude. The next aggregation will see a
negative delta and record a UINT_MAX - <race time> sample:

psi_group_change()               get_recent_times()
  now = cpu_clock() (X)

  <a few cycles pass>

                                 read_seqcount_begin()
                                 now = cpu_clock() (X+42)
                                 memcpy(times, groupc->times)
                                 !read_seqcount_retry()

                                 /* Snoops live state up to X+42 */
                                 if (state_mask & (1 << s))
                                   times[s] += now - state_start

				 delta = times[s] - groupc->times_prev[s]
                                 groupc->times_prev[s] = times[s]

                                 /* times_prev up to X+42 */

  write_seqcount_begin()
  /* concludes state ending at X */
  groupc->times[s] += now - state_start
  groupc->state_mask &= 1 << s
  write_seqcount_end()

                                 read_seqcount_begin()
                                 memcpy(times, groupc->times)
                                 !read_seqcount_retry()

                                 if (state_mask & (1 << s))
                                   ; /* nope, state concluded */

				 delta = times[s] - groupc->times_prev[s]
                                 /* times_prev ahead, sample is UINT_MAX-42 */

The solution is to move the clock read of the state change into the
locked section. This way, live snooping cannot get ahead of the time
at which the state will conclude:

---

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 020d58967d4e..84dad1511d1e 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -769,12 +769,13 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
 }
 
 static void psi_group_change(struct psi_group *group, int cpu,
-			     unsigned int clear, unsigned int set, u64 now,
+			     unsigned int clear, unsigned int set,
 			     bool wake_clock)
 {
 	struct psi_group_cpu *groupc;
 	unsigned int t, m;
 	u32 state_mask;
+	u64 now;
 
 	lockdep_assert_rq_held(cpu_rq(cpu));
 	groupc = per_cpu_ptr(group->pcpu, cpu);
@@ -789,6 +790,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
 	 * SOME and FULL time these may have resulted in.
 	 */
 	write_seqcount_begin(&groupc->seq);
+	now = cpu_clock(cpu);
 
 	/*
 	 * Start with TSK_ONCPU, which doesn't have a corresponding
@@ -899,18 +901,15 @@ void psi_task_change(struct task_struct *task, int clear, int set)
 {
 	int cpu = task_cpu(task);
 	struct psi_group *group;
-	u64 now;
 
 	if (!task->pid)
 		return;
 
 	psi_flags_change(task, clear, set);
 
-	now = cpu_clock(cpu);
-
 	group = task_psi_group(task);
 	do {
-		psi_group_change(group, cpu, clear, set, now, true);
+		psi_group_change(group, cpu, clear, set, true);
 	} while ((group = group->parent));
 }
 
@@ -919,7 +918,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 {
 	struct psi_group *group, *common = NULL;
 	int cpu = task_cpu(prev);
-	u64 now = cpu_clock(cpu);
 
 	if (next->pid) {
 		psi_flags_change(next, 0, TSK_ONCPU);
@@ -936,7 +934,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 				break;
 			}
 
-			psi_group_change(group, cpu, 0, TSK_ONCPU, now, true);
+			psi_group_change(group, cpu, 0, TSK_ONCPU, true);
 		} while ((group = group->parent));
 	}
 
@@ -974,7 +972,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		do {
 			if (group == common)
 				break;
-			psi_group_change(group, cpu, clear, set, now, wake_clock);
+			psi_group_change(group, cpu, clear, set, wake_clock);
 		} while ((group = group->parent));
 
 		/*
@@ -986,7 +984,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
 			clear &= ~TSK_ONCPU;
 			for (; group; group = group->parent)
-				psi_group_change(group, cpu, clear, set, now, wake_clock);
+				psi_group_change(group, cpu, clear, set, wake_clock);
 		}
 	}
 }
@@ -997,8 +995,8 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
 	int cpu = task_cpu(curr);
 	struct psi_group *group;
 	struct psi_group_cpu *groupc;
-	u64 now, irq;
 	s64 delta;
+	u64 irq;
 
 	if (static_branch_likely(&psi_disabled))
 		return;
@@ -1011,7 +1009,6 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
 	if (prev && task_psi_group(prev) == group)
 		return;
 
-	now = cpu_clock(cpu);
 	irq = irq_time_read(cpu);
 	delta = (s64)(irq - rq->psi_irq_time);
 	if (delta < 0)
@@ -1019,12 +1016,15 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
 	rq->psi_irq_time = irq;
 
 	do {
+		u64 now;
+
 		if (!group->enabled)
 			continue;
 
 		groupc = per_cpu_ptr(group->pcpu, cpu);
 
 		write_seqcount_begin(&groupc->seq);
+		now = cpu_clock(cpu);
 
 		record_times(groupc, now);
 		groupc->times[PSI_IRQ_FULL] += delta;
@@ -1223,11 +1223,9 @@ void psi_cgroup_restart(struct psi_group *group)
 	for_each_possible_cpu(cpu) {
 		struct rq *rq = cpu_rq(cpu);
 		struct rq_flags rf;
-		u64 now;
 
 		rq_lock_irq(rq, &rf);
-		now = cpu_clock(cpu);
-		psi_group_change(group, cpu, 0, 0, now, true);
+		psi_group_change(group, cpu, 0, 0, true);
 		rq_unlock_irq(rq, &rf);
 	}
 }
Comment 2 Brandon Duffany 2024-08-27 18:52:04 UTC
Hi Johannes - thank you for looking into this so quickly!

For testing kernel changes it's easiest for me to use kernel v5.15 and run it under Firecracker MicroVM, so I had to adjust the patch slightly to remove the irq PSI changes.

I checked that the issue still reproduces on v5.15 before applying the patch, and that after applying the patch, the issue no longer reproduces.

Thank you!