Bug 6986 - e1000 doesn't update trafic counters frequently enough
Summary: e1000 doesn't update trafic counters frequently enough
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: i386 Linux
: P2 low
Assignee: Auke Kok
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-08-11 02:27 UTC by Olivier Cortes
Modified: 2007-12-04 09:42 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.15
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
patch updated for 2.6.18-rc4 (428 bytes, patch)
2006-08-11 03:00 UTC, Tomasz Torcz
Details | Diff
Proposed patch for 2.6.21-rc5 (949 bytes, patch)
2007-04-11 08:47 UTC, Olaf Kirch
Details | Diff

Description Olivier Cortes 2006-08-11 02:27:00 UTC
Most recent kernel where this bug did not occur: none (this is still occuring in
2.6.18-rc4)
Distribution: Ubuntu
Hardware Environment: IBM T40p
Software Environment: Ubuntu Dapper up-to-date, 686 kernel
Problem Description: see http://lkml.org/lkml/2003/12/20/30

Steps to reproduce: just use any graphing applet and set the update interval
smaller than 2 seconds (netspeed has 1 second per default, I use it at 500ms)

see Ubuntu bug :
https://launchpad.net/distros/ubuntu/+source/linux-source-2.6.15/+bug/55989
Comment 1 Tomasz Torcz 2006-08-11 03:00:45 UTC
Created attachment 8757 [details]
patch updated for 2.6.18-rc4
Comment 2 Auke Kok 2006-08-11 14:22:00 UTC
This isn't a bug, that timer interval was chosen for a good reason too.

If people are inclined to choose a different timer interval, perhaps one of the
following should be more attractive:

1) all drivers use a standard timer interval for this
2) the timer interval should be changeable by the user (using a module parameter
for instance, or ethtool).

Since the interval has been 2 seconds for a very long time, I would rather leave
that as the default. It is more than enough for most purposes, and running it
more often should not generally be needed at all (it just doesn't make much
sense for normal people - they do not watch their network traffic every second).

Cheers,

Auke
Comment 3 Tomasz Torcz 2006-08-12 08:11:38 UTC
If you look at Ubuntu bug you will see that people indeed watch their traffic
every second.
Comment 4 Auke Kok 2006-09-01 17:00:43 UTC
promiscuous mode provides more than sufficient abilities to monitor network at
<2s intervals. Normal mode really does not need this interval to be changed. 

Jeff, please mark as "WILL_NOT_FIX" or "INVALID"
Comment 5 Jean Delvare 2006-10-10 05:07:49 UTC
Auke, I have to object.

First of all, the current code doesn't only break network monitoring intervals <
2 seconds. It basically breaks any interval which isn't exactly a multiple of 2
seconds. Just try to set the interval to 3 or 5 seconds and you'll see a
saw-tooth diagram at constant transfer speed. Of course, the longer the
interval, the less visible the problem will be, but it's still there unless the
interval is an even number of seconds. 5 seconds is a perfectly reasonable
interval for a regular user, so it shouldn't break.

Secondly, there are other drivers which update most of their statistics every 2
seconds, and I agree there's nothing wrong with that. For example the e100
driver does exactly that. The difference is that in the e100 driver, the most
important values (rx/tx_bytes and rx/tx_packets) are _not_ updated that way.
Instead they are computed by the driver in real time, just as the 3c59x and
8139too drivers do, to name only the ones I used on a regular basis.

So this is a real bug, and it needs to be fixed. It is reported here, it was
reported on LKML twice, it was reported on Ubuntu and Suse. Users appear to
care. I agree that we do not want to change the timer from 2 * HZ to HZ. Instead
we want to handle the main statistics in real time.

Jeff, please do not close this bug. Instead, feel free to hand it over to me,
and I'll see what I can do. My new laptop (Thinkpad T60p) is affected, so I want
this bug to be fixed.
Comment 6 Auke Kok 2006-10-10 08:52:15 UTC
-- "My new laptop (Thinkpad T60p) is affected, so I want this bug to be fixed."

That is completely irrelevant. Please don't take this personal, I don't.

-- "So this is a real bug, and it needs to be fixed."

I disagree that this is a bug: It is a feature request. I don't see how it is
broken, so lets put the right label on this. See comment #2.

network statistics are always transient numbers where your information lags
behind what is actually happening. I fail to see how reducing the counter read
interval to 1 second will solve your "problem" at all. Next week you'll try
something like gkrellm which updates 20 times per second and we're back to
square one.
Comment 7 Jean Delvare 2006-10-11 04:01:35 UTC
Auke, the fact that I am affected _is_ relevant in an open source development
model. Bugs which affect developers directly are more likely do be investigated
and fixed quickly, because the same person has the will and the skills to do it.
This is all I meant here: I am affected by the problem, and I am willing to put
time and energy to get it solved. I'm sorry if you understood it differently.

As for the bug vs. feature request labelling, granted, I didn't find any formal
specification stating how frequently a Linux ethernet driver must update its
stat counters, so it's not a bug per se. But the behavior of the e1000 driver
differs from the behavior of many popular ethernet drivers (e100, 8139too,
3c59x...) in that respect, so the least I can say is that it is misleading for
the user. As such, it looks to me like something we want to change.

I have a patch ready, which I will submit to the netdev mailing list for review
and comments later today.
Comment 8 Jesse Brandeburg 2006-12-18 15:24:10 UTC
Hi Jean, did you ever submit the patch?  In order for us to ACK this patch
however I would like to get some warm fuzzy feelings by running some performance
benchmarks to make sure the CPU utilization doesn't increase due to these "real
time" changes.

As it stands now, the 7.3.20 (and 7.3.15) drivers are counting bytes rx and tx
per interrupt anyway, so storing those in a hot cache line shouldn't be too big
a deal.  (this byte counting is used for dynamic interrupt moderation through
the ITR register)

Comment 9 Jean Delvare 2006-12-20 09:08:22 UTC
Jesse, I did submit the patch to netdev two months ago:
http://marc.theaimsgroup.com/?t=116056673400001&r=1&w=2
But there were some minor differences between the hardware and software
counters, which I wanted to investigate. There was also an objection about TSO
packets possibly being not handled properly. Unfortunately I didn't have the
time to work on this patch since then.

Given the recent changes that were done to the e1000 driver, my patch probably
needs some more work. Feel tree to integrate it into the new driver if you can,
as I probably won't have the time to work on it myself before another month or so.
Comment 10 Jean Delvare 2007-04-10 00:54:16 UTC
Sorry but it seems I just don't have the time to work on this :(
Comment 11 Olaf Kirch 2007-04-11 08:45:25 UTC
What's wrong with having e1000_get_stats call e1000_update_stats? That way
you always get current stats, but without incurring any overhead for people
who don't use this. Other drivers seem to do the same.

Comment 12 Olaf Kirch 2007-04-11 08:47:18 UTC
Created attachment 11125 [details]
Proposed patch for 2.6.21-rc5

This patch makes e1000_get_stats call e1000_update_stats
Comment 13 Jesse Brandeburg 2007-04-11 08:59:46 UTC
we actually had this suggested patch in e1000 before, but we took it out because
there were (probably X) applications under some distros that were calling our
get_stats every 100ms or so, causing a huge amount of CPU utilization due to all
the register reads.  This also hurt performance due to all the PIO.  I don't
think it is the right way to fix this issue.
Comment 14 Olaf Kirch 2007-04-11 23:40:36 UTC
Well, I would call this an application bug. Apps shouldn't poll that often
if it hurts. And there are a number of NICs that do retrieve the most current
counters in get_stats, rather than updating them asynchronously.

The effects synch updates could be mitigated in two ways

 -	in get_stats, call e1000_update_stats if the most recent
	update was more than 0.5 seconds ago

 -	in get_stats, update the byte/packet counters only, but let
	the other - less critical - counters lag.

Of course, there's the alternative solution to have a callback that updates
the NIC's statistics, which gets called periodically, and convert all NICs
to that method. If you think this is worth the effort, I can certainly put
together a patch proposal and submit that to netdev.
Comment 15 Natalie Protasevich 2007-07-19 20:03:00 UTC
Was there a resolution with this issue, any update?
Thanks.
Comment 16 Auke Kok 2007-07-20 09:17:58 UTC
This is on the TODO list to improve. We will add partial real-time counters for packets/bytes as requested.
Comment 17 Auke Kok 2007-12-04 09:42:20 UTC
Fix merged in jgarzik/netdev-2.6 #upstream. building on the ITR code we can reuse accounting and allow real-time stats for packets/bytes to be updated. ethtool stats still show the (slow updated) hardware packet/byte counters.

http://git.kernel.org/?p=linux/kernel/git/jgarzik/netdev-2.6.git;a=commitdiff;h=af1710a493c5cb668c38d4008340443039809790;hp=94e3b7161fdbbc79221fda7a8bbff822fec77245

Note You need to log in before you can comment on or make changes to this bug.