Bug 205427
Summary: | Qdisc running seqcount may cause performance issues | ||
---|---|---|---|
Product: | Networking | Reporter: | Wen Yang (wenyang) |
Component: | IPV4 | Assignee: | Stephen Hemminger (stephen) |
Status: | NEW --- | ||
Severity: | normal | CC: | code, davem, eric.dumazet, wenyang, xiyou.wangcong |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.9 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Wen Yang
2019-11-05 05:27:57 UTC
edb09eb17ed8: "net: sched: do not acquire qdisc spinlock in qdisc/class stats dump" may require further optimization. void -__gnet_stats_copy_basic(struct gnet_stats_basic_packed *bstats, +__gnet_stats_copy_basic(const seqcount_t *running, + struct gnet_stats_basic_packed *bstats, struct gnet_stats_basic_cpu __percpu *cpu, struct gnet_stats_basic_packed *b) { + unsigned int seq; + if (cpu) { __gnet_stats_copy_basic_cpu(bstats, cpu); - } else { + return; + } + do { + if (running) + seq = read_seqcount_begin(running); bstats->bytes = b->bytes; bstats->packets = b->packets; - } + } while (running && read_seqcount_retry(running, seq)); } (In reply to Wen Yang from comment #1) > edb09eb17ed8: "net: sched: do not acquire qdisc spinlock in qdisc/class > stats dump" may require further optimization. > > void > -__gnet_stats_copy_basic(struct gnet_stats_basic_packed *bstats, > +__gnet_stats_copy_basic(const seqcount_t *running, > + struct gnet_stats_basic_packed *bstats, > struct gnet_stats_basic_cpu __percpu *cpu, > struct gnet_stats_basic_packed *b) > { > + unsigned int seq; > + > if (cpu) { > __gnet_stats_copy_basic_cpu(bstats, cpu); > - } else { > + return; > + } > + do { > + if (running) > + seq = read_seqcount_begin(running); > bstats->bytes = b->bytes; > bstats->packets = b->packets; > - } > + } while (running && read_seqcount_retry(running, seq)); > } Just to get the statistics of bstats, it is repeated here. It may not be necessary. What kind of cpus is used on this platform ? The 'running' sequence should eventually be 'released' by the netperf. It looks like the CPU 80 never sees low order bit of sequence being 0. Your kernel warning is Tainted, was there anything happened before this soft lockup? And, 4.9 is too old for upstream, please try to reproduce this on a more recent kernel release, ideally the latest. (In reply to Eric Dumazet from comment #3) > What kind of cpus is used on this platform ? > > The 'running' sequence should eventually be 'released' by the netperf. > > It looks like the CPU 80 never sees low order bit of sequence being 0. Thank you very much for your comments. By analyzing several vmcores, we found that the lowest bit of Qdisc->running is indeed 1. crash> Qdisc ffff887da5990c00 struct Qdisc { enqueue = 0xffffffff81651890 <pfifo_fast_enqueue>, dequeue = 0xffffffff81651770 <pfifo_fast_dequeue>, flags = 372, limit = 0, ops = 0xffffffff81d86040 <pfifo_fast_ops>, stab = 0x0, hash = { next = 0x0, pprev = 0xffff887da5991028 }, handle = 0, parent = 1, u32_node = 0x0, dev_queue = 0xffff887e24580000, rate_est = { bps = 0, pps = 0 }, cpu_bstats = 0x60818043e8d0, cpu_qstats = 0x60818043e930, gso_skb = { next = 0xffff887da5990c80, prev = 0xffff887da5990c80, qlen = 0, lock = { { rlock = { raw_lock = { { val = { counter = 0 }, { locked = 0 '\000', pending = 0 '\000' }, { locked_pending = 0, tail = 0 } } } } } } }, q = { head = 0x0, tail = 0x0, qlen = 0, lock = { { rlock = { raw_lock = { { val = { counter = 0 }, { locked = 0 '\000', pending = 0 '\000' }, { locked_pending = 0, tail = 0 } } } } } } }, bstats = { bytes = 0, packets = 0 }, running = { sequence = 4624263 }, qstats = { qlen = 0, backlog = 0, drops = 0, requeues = 0, overlimits = 0 }, state = 0, next_sched = 0xffff887cba53e000, skb_bad_txq = { next = 0xffff887da5990ce8, prev = 0xffff887da5990ce8, qlen = 0, lock = { { rlock = { raw_lock = { { val = { counter = 0 }, { locked = 0 '\000', pending = 0 '\000' }, { locked_pending = 0, tail = 0 } } } } } } }, padded = 0, refcnt = { counter = 1 }, busylock = { { rlock = { raw_lock = { { val = { counter = 0 }, { locked = 0 '\000', pending = 0 '\000' }, { locked_pending = 0, tail = 0 } } } } } } } crash> It may be that the packet traffic at that moment is too large, or it may be a bug in a third-party kernel module, causing this soft lockup. We may avoid this problem by replacing read_seqcount_begin() with raw_read_seqcount_latch() here. diff --git a/net/core/gen_stats.c b/net/core/gen_stats.c index 1d653fbfcf52..8dad81e4be26 100644 --- a/net/core/gen_stats.c +++ b/net/core/gen_stats.c @@ -150,7 +150,7 @@ __gnet_stats_copy_basic(const seqcount_t *running, } do { if (running) - seq = read_seqcount_begin(running); + seq = raw_read_seqcount_latch(running); bstats->bytes = b->bytes; bstats->packets = b->packets; } while (running && read_seqcount_retry(running, seq)); I would greatly appreciate it if you kindly give me some suggestions。 I wonder if there is a bug then... about one Qdisc->running change being lost and low order bit always at the wrong value. raw_read_seqcount_latch() can not be used in the Qdisc context. raw_read_seqcount_latch() is only used in some classes having dual set of data structures like in kernel/time/timekeeping.c Qdisc->running seems to be stuck on your host, right ? What qdisc are you using ? If this is pfifo_fast, could you try another one like fq ? (In reply to Eric Dumazet from comment #6) > I wonder if there is a bug then... about one Qdisc->running change being lost > and low order bit always at the wrong value. > > raw_read_seqcount_latch() can not be used in the Qdisc context. > > raw_read_seqcount_latch() is only used in some classes having dual set > of data structures like in kernel/time/timekeeping.c > > Qdisc->running seems to be stuck on your host, right ? > > What qdisc are you using ? > > If this is pfifo_fast, could you try another one like fq ? Yes, disc->running is stuck. I read the code over and over and found that there might be a problem in this place: 3159 static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, 3160 struct net_device *dev, 3161 struct netdev_queue *txq) 3162 { ... ... 3198 } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) && 3199 qdisc_run_begin(q)) { *** The statements between these && conditions are not necessarily executed sequentially. If the qdisc_run_begin(q) statement is executed first and the other conditions are not satisfied, the lowest bit of q->running will always be 1. *** ... 3206 qdisc_bstats_update(q, skb); ... 3215 3216 qdisc_run_end(q); 3217 rc = NET_XMIT_SUCCESS; 3218 } This patch may fix it, please help me to check it. Thanks. diff --git a/net/core/dev.c b/net/core/dev.c index 20c7a67..0ee8161 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3602,27 +3602,27 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, spinlock_t *root_lock = qdisc_lock(q); struct sk_buff *to_free = NULL; bool contended; - int rc; + int rc = NET_XMIT_SUCCESS; qdisc_calculate_pkt_len(skb, q); if (q->flags & TCQ_F_NOLOCK) { - if ((q->flags & TCQ_F_CAN_BYPASS) && q->empty && - qdisc_run_begin(q)) { - if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, - &q->state))) { - __qdisc_drop(skb, &to_free); - rc = NET_XMIT_DROP; - goto end_run; - } - qdisc_bstats_cpu_update(q, skb); + if ((q->flags & TCQ_F_CAN_BYPASS) && q->empty) { + if (qdisc_run_begin(q)) { + if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, + &q->state))) { + __qdisc_drop(skb, &to_free); + rc = NET_XMIT_DROP; + goto end_run; + } + qdisc_bstats_cpu_update(q, skb); - rc = NET_XMIT_SUCCESS; - if (sch_direct_xmit(skb, q, dev, txq, NULL, true)) - __qdisc_run(q); + if (sch_direct_xmit(skb, q, dev, txq, NULL, true)) + __qdisc_run(q); end_run: - qdisc_run_end(q); + qdisc_run_end(q); + } } else { rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; qdisc_run(q); @@ -3647,26 +3647,27 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { __qdisc_drop(skb, &to_free); rc = NET_XMIT_DROP; - } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) && - qdisc_run_begin(q)) { - /* - * This is a work-conserving queue; there are no old skbs - * waiting to be sent out; and the qdisc is not running - - * xmit the skb directly. - */ + } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q)) { + if (qdisc_run_begin(q)) { + /* + * This is a work-conserving queue; there are no old skbs + * waiting to be sent out; and the qdisc is not running - + * xmit the skb directly. + */ - qdisc_bstats_update(q, skb); + qdisc_bstats_update(q, skb); - if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) { - if (unlikely(contended)) { - spin_unlock(&q->busylock); - contended = false; + if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) { + if (unlikely(contended)) { + spin_unlock(&q->busylock); + contended = false; + } + __qdisc_run(q); } - __qdisc_run(q); - } - qdisc_run_end(q); - rc = NET_XMIT_SUCCESS; + qdisc_run_end(q); + rc = NET_XMIT_SUCCESS; + } } else { rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; if (qdisc_run_begin(q)) { First of all, you have not tried a recent kernel. Hint : The problem might have been solved already. (In reply to Eric Dumazet from comment #9) > First of all, you have not tried a recent kernel. > > Hint : The problem might have been solved already. Thanks for your comments, we have analyzed the qdisc code and the vmcore again. 1, The flags of the stuck Qdisc: crash> Qdisc.flags -x ffff887da5990c00 flags = 0x174 Corresponding code: #define TCQ_F_CAN_BYPASS 4 … #define TCQ_F_NOLOCK 0x100 /* qdisc does not require locking */ This is a NOLOCK qdisc. 2, The stats of the stuck Qdisc: q = { head = 0x0, tail = 0x0, qlen = 0x0, … bstats = { bytes = 0x0, packets = 0x0 }, … qstats = { qlen = 0x0, backlog = 0x0, drops = 0x0, requeues = 0x0, overlimits = 0x0 }, bstats , qstats and q.qlen are all 0, so we may speculate that qdisc_reset() has been executed. 3, This patch is very likely to fix this problem, we will try to test it. d518d2ed8640 net/sched: fix race between deactivation and dequeue for NOLOCK qdisc. -- Regards, I asked you if you tried another qdisc than pfifo_fast. You are looking at stats of the qdisc, but pfifo_fast has per-cpu stats which are not seen in the 'struct Qdisc' storage. |