Most recent kernel where this bug did not occur: none (this is still occuring in 2.6.18-rc4) Distribution: Ubuntu Hardware Environment: IBM T40p Software Environment: Ubuntu Dapper up-to-date, 686 kernel Problem Description: see http://lkml.org/lkml/2003/12/20/30 Steps to reproduce: just use any graphing applet and set the update interval smaller than 2 seconds (netspeed has 1 second per default, I use it at 500ms) see Ubuntu bug : https://launchpad.net/distros/ubuntu/+source/linux-source-2.6.15/+bug/55989
Created attachment 8757 [details] patch updated for 2.6.18-rc4
This isn't a bug, that timer interval was chosen for a good reason too. If people are inclined to choose a different timer interval, perhaps one of the following should be more attractive: 1) all drivers use a standard timer interval for this 2) the timer interval should be changeable by the user (using a module parameter for instance, or ethtool). Since the interval has been 2 seconds for a very long time, I would rather leave that as the default. It is more than enough for most purposes, and running it more often should not generally be needed at all (it just doesn't make much sense for normal people - they do not watch their network traffic every second). Cheers, Auke
If you look at Ubuntu bug you will see that people indeed watch their traffic every second.
promiscuous mode provides more than sufficient abilities to monitor network at <2s intervals. Normal mode really does not need this interval to be changed. Jeff, please mark as "WILL_NOT_FIX" or "INVALID"
Auke, I have to object. First of all, the current code doesn't only break network monitoring intervals < 2 seconds. It basically breaks any interval which isn't exactly a multiple of 2 seconds. Just try to set the interval to 3 or 5 seconds and you'll see a saw-tooth diagram at constant transfer speed. Of course, the longer the interval, the less visible the problem will be, but it's still there unless the interval is an even number of seconds. 5 seconds is a perfectly reasonable interval for a regular user, so it shouldn't break. Secondly, there are other drivers which update most of their statistics every 2 seconds, and I agree there's nothing wrong with that. For example the e100 driver does exactly that. The difference is that in the e100 driver, the most important values (rx/tx_bytes and rx/tx_packets) are _not_ updated that way. Instead they are computed by the driver in real time, just as the 3c59x and 8139too drivers do, to name only the ones I used on a regular basis. So this is a real bug, and it needs to be fixed. It is reported here, it was reported on LKML twice, it was reported on Ubuntu and Suse. Users appear to care. I agree that we do not want to change the timer from 2 * HZ to HZ. Instead we want to handle the main statistics in real time. Jeff, please do not close this bug. Instead, feel free to hand it over to me, and I'll see what I can do. My new laptop (Thinkpad T60p) is affected, so I want this bug to be fixed.
-- "My new laptop (Thinkpad T60p) is affected, so I want this bug to be fixed." That is completely irrelevant. Please don't take this personal, I don't. -- "So this is a real bug, and it needs to be fixed." I disagree that this is a bug: It is a feature request. I don't see how it is broken, so lets put the right label on this. See comment #2. network statistics are always transient numbers where your information lags behind what is actually happening. I fail to see how reducing the counter read interval to 1 second will solve your "problem" at all. Next week you'll try something like gkrellm which updates 20 times per second and we're back to square one.
Auke, the fact that I am affected _is_ relevant in an open source development model. Bugs which affect developers directly are more likely do be investigated and fixed quickly, because the same person has the will and the skills to do it. This is all I meant here: I am affected by the problem, and I am willing to put time and energy to get it solved. I'm sorry if you understood it differently. As for the bug vs. feature request labelling, granted, I didn't find any formal specification stating how frequently a Linux ethernet driver must update its stat counters, so it's not a bug per se. But the behavior of the e1000 driver differs from the behavior of many popular ethernet drivers (e100, 8139too, 3c59x...) in that respect, so the least I can say is that it is misleading for the user. As such, it looks to me like something we want to change. I have a patch ready, which I will submit to the netdev mailing list for review and comments later today.
Hi Jean, did you ever submit the patch? In order for us to ACK this patch however I would like to get some warm fuzzy feelings by running some performance benchmarks to make sure the CPU utilization doesn't increase due to these "real time" changes. As it stands now, the 7.3.20 (and 7.3.15) drivers are counting bytes rx and tx per interrupt anyway, so storing those in a hot cache line shouldn't be too big a deal. (this byte counting is used for dynamic interrupt moderation through the ITR register)
Jesse, I did submit the patch to netdev two months ago: http://marc.theaimsgroup.com/?t=116056673400001&r=1&w=2 But there were some minor differences between the hardware and software counters, which I wanted to investigate. There was also an objection about TSO packets possibly being not handled properly. Unfortunately I didn't have the time to work on this patch since then. Given the recent changes that were done to the e1000 driver, my patch probably needs some more work. Feel tree to integrate it into the new driver if you can, as I probably won't have the time to work on it myself before another month or so.
Sorry but it seems I just don't have the time to work on this :(
What's wrong with having e1000_get_stats call e1000_update_stats? That way you always get current stats, but without incurring any overhead for people who don't use this. Other drivers seem to do the same.
Created attachment 11125 [details] Proposed patch for 2.6.21-rc5 This patch makes e1000_get_stats call e1000_update_stats
we actually had this suggested patch in e1000 before, but we took it out because there were (probably X) applications under some distros that were calling our get_stats every 100ms or so, causing a huge amount of CPU utilization due to all the register reads. This also hurt performance due to all the PIO. I don't think it is the right way to fix this issue.
Well, I would call this an application bug. Apps shouldn't poll that often if it hurts. And there are a number of NICs that do retrieve the most current counters in get_stats, rather than updating them asynchronously. The effects synch updates could be mitigated in two ways - in get_stats, call e1000_update_stats if the most recent update was more than 0.5 seconds ago - in get_stats, update the byte/packet counters only, but let the other - less critical - counters lag. Of course, there's the alternative solution to have a callback that updates the NIC's statistics, which gets called periodically, and convert all NICs to that method. If you think this is worth the effort, I can certainly put together a patch proposal and submit that to netdev.
Was there a resolution with this issue, any update? Thanks.
This is on the TODO list to improve. We will add partial real-time counters for packets/bytes as requested.
Fix merged in jgarzik/netdev-2.6 #upstream. building on the ITR code we can reuse accounting and allow real-time stats for packets/bytes to be updated. ethtool stats still show the (slow updated) hardware packet/byte counters. http://git.kernel.org/?p=linux/kernel/git/jgarzik/netdev-2.6.git;a=commitdiff;h=af1710a493c5cb668c38d4008340443039809790;hp=94e3b7161fdbbc79221fda7a8bbff822fec77245