Bug 11112
Summary: | On turning on UIE interrupt on rtc, next read/select returns immediately and not on update boundary | ||
---|---|---|---|
Product: | Timers | Reporter: | W Unruh (unruh) |
Component: | Realtime Clock | Assignee: | David Brownell (dbrownell) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | ||
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.24 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
updated test program
rtc-at91rm9200 patch rtc-cmos patch |
Description
W Unruh
2008-07-17 15:25:15 UTC
Created attachment 16871 [details]
updated test program
Curious. I cleaned the code up a bit (including Lindent) and fixed the bug in usec calculation (negative!) ... observed the same thing, using rtc-cmos (but haven't yet tried other RTCs).
I noticed that the HPET glue code has even worse trouble. Sometimes the first *two* returns seem bogus. Once I saw the first return be too fast, but the others showed right when expected. And then there are test runs like this one:
Read using select(2) on /dev/rtc:
1 0.000671 0290
2 0.915634 0190
3 1.890956 0190
4 3.369143 0190
5 3.910837 0190
I know the HPET glue is flakey (doesn't synch to hardware at all), but that's not even vaguely good.
Yes, that usec code was certainly wrong. Sorry. It just happened to work on one of my machines each time I used it. On machines with just rtc (ie older machines without hpet) using Mandriva 7.0 (kernel 2.6.17), the returns seemed to be fine. On the more recent machines which have an hpet, I get that weird behaviour of the first one being an immediate return. I did not ever see your case where one of the later returns (4 in your case on the hpet machines) was way off. On Thu, 17 Jul 2008, bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=11112 > > > > > > ------- Comment #1 from dbrownell@users.sourceforge.net 2008-07-17 17:13 > ------- > Created an attachment (id=16871) > --> (http://bugzilla.kernel.org/attachment.cgi?id=16871&action=view) > updated test program > > Curious. I cleaned the code up a bit (including Lindent) and fixed the bug > in > usec calculation (negative!) ... observed the same thing, using rtc-cmos (but > haven't yet tried other RTCs). > > I noticed that the HPET glue code has even worse trouble. Sometimes the > first > *two* returns seem bogus. Once I saw the first return be too fast, but the > others showed right when expected. And then there are test runs like this > one: > > Read using select(2) on /dev/rtc: > 1 0.000671 0290 > 2 0.915634 0190 > 3 1.890956 0190 > 4 3.369143 0190 > 5 3.910837 0190 > > I know the HPET glue is flakey (doesn't synch to hardware at all), but that's > not even vaguely good. > > > Reply-To: david-b@pacbell.net On Thursday 17 July 2008, bugme-daemon@bugzilla.kernel.org wrote: > On machines with just rtc (ie older machines without hpet) using Mandriva > 7.0 (kernel 2.6.17), the returns seemed to be fine. That's also stuff not using the new RTC framework. > On the more recent machines which have an hpet, I get that weird > behaviour of the first one being an immediate return. There are multiple variables here. The HPET goofage is IMO a very separate bug from the "immediate return" thing, which seems to come from some issue the new framework code. I am not sure. The immediate return was already there under Mandriva 2007 (kernel 2.6.17) on machines using hpet. I submitted a bug fix to chrony to get around it when I found it about 1.5 yr ago-- should have submitted it to the kernel as well I guess, but I had no test case then and did not think about it. Ie, the immediate return was already there on 2.6.17 if the machine had an hpet emulation for rtc. On Thu, 17 Jul 2008, bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=11112 > > > > > > ------- Comment #3 from anonymous@kernel-bugs.osdl.org 2008-07-17 23:25 > ------- > Reply-To: david-b@pacbell.net > > On Thursday 17 July 2008, bugme-daemon@bugzilla.kernel.org wrote: >> On machines with just rtc (ie older machines without hpet) using Mandriva >> 7.0 (kernel 2.6.17), the returns seemed to be fine. > > That's also stuff not using the new RTC framework. > > >> On the more recent machines which have an hpet, I get that weird >> behaviour of the first one being an immediate return. > > There are multiple variables here. The HPET goofage is IMO a > very separate bug from the "immediate return" thing, which > seems to come from some issue the new framework code. > > > OK, I am now very confused. I am not sure if this is the place, but looking at Mandriva 2008.1, the original did NOT have the the HPET_EMULATE_RTC confugured at all, but it did load the rtc-class module. HOwever, that led to no end of problems on many machines. On one of mine, all reads of rtc after the first after switching on UIE would block. On other peoples, the dates returned by the rtc were wildly out, and would change without warning. Ie, the rtc_class thing simply did not work for many people (It seemed to depend on the exact hardware). Mandriva have now released a kernel with HPET_EMULATE_RTC switched on again, and this seems to solved the problems on those hardwares( except the "first read or select on UIE returns immediately and not on the update" problem). Reading the config options on RTC_DRV_CMOS it seems to say that you need a separate device as well as rtc-cmos, or is this driver sufficient to get at the various classic devices. Ie, what does "Your system will need to define the platform device used by this driver" mean? On Thu, 17 Jul 2008, bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=11112 > > > > > > ------- Comment #3 from anonymous@kernel-bugs.osdl.org 2008-07-17 23:25 > ------- > Reply-To: david-b@pacbell.net > > On Thursday 17 July 2008, bugme-daemon@bugzilla.kernel.org wrote: >> On machines with just rtc (ie older machines without hpet) using Mandriva >> 7.0 (kernel 2.6.17), the returns seemed to be fine. > > That's also stuff not using the new RTC framework. > > >> On the more recent machines which have an hpet, I get that weird >> behaviour of the first one being an immediate return. > > There are multiple variables here. The HPET goofage is IMO a > very separate bug from the "immediate return" thing, which > seems to come from some issue the new framework code. > > > Reply-To: david-b@pacbell.net Using rtc-at91rm9200: Read using select(2) on /dev/rtc: 1 0.000506 0190 2 0.216502 0190 3 1.216509 0190 4 2.216460 0190 5 3.216454 0190 Makes me wonder if maybe the drivers should be scrubbing some old IRQ status ... the framework code didn't look to have any obvious bugs, but bugs like that could lurk for a long time. Just looking at the code in rtd-dev.c. It looks like the rtc emulate uie is always used if the RTC_INTF_DEV_UIE_EMUL is set even if the rtc supports uie interrupts just fine. Or am I misreading the code? On Fri, 18 Jul 2008, bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=11112 > > > > > > ------- Comment #6 from anonymous@kernel-bugs.osdl.org 2008-07-18 13:15 > ------- > Reply-To: david-b@pacbell.net > > Using rtc-at91rm9200: > > Read using select(2) on /dev/rtc: > 1 0.000506 0190 > 2 0.216502 0190 > 3 1.216509 0190 > 4 2.216460 0190 > 5 3.216454 0190 > > Makes me wonder if maybe the drivers should be scrubbing some > old IRQ status ... the framework code didn't look to have any > obvious bugs, but bugs like that could lurk for a long time. > > > Yep, the problem goes away if old IRQ status gets scrubbed. Patches ready for rm9200, x86/no-hpet, x86-hpet. Created attachment 16882 [details]
rtc-at91rm9200 patch
This shows the canonical fix for drivers with this bug: scrub old IRQ status before enabling new event IRQs.
Created attachment 16883 [details] rtc-cmos patch This resolves the problems for me: - extra IRQ without HPET - extra IRQ with HPET - flakey irq reports with HPET I think the last of those came from the patch which needs to be applied first: http://marc.info/?l=linux-kernel&m=121631884829116&w=2 Reply-To: david-b@pacbell.net On Friday 18 July 2008, bugme-daemon@bugzilla.kernel.org wrote: > Reading the config options on RTC_DRV_CMOS it seems to say that you need a > separate device as well as rtc-cmos, or is this driver sufficient to get at > the various classic devices. Ie, what does "Your system will need to define > the platform device used by this driver" mean? It means that it follows the driver model ... if there's a PNP device defined, or a platform device defined, this driver binds to that device. But unlike the legacy driver, it will NOT go and poke at the hardware in the hope that it's guessing right. That's probably the essence of what makes the legacy driver a legacy driver: it completely disregards the driver model. Here is another example. Read using select(2) on /dev/rtc: 1 1.181705 190 2 2.201611 190 3 3.671093 190 4 4.181635 190 5 5.671149 190 This is as bad as David's example. Note that this does include the special hpet patch as well. I have been running a whole series of the tests in a row, and get a few similar to the above one. Note also that the first interrupt occured 1.2 sec after interrupts were switched on (It should have been less than 1 sec.) Ie, at least the hpet rtc-cmos still has trouble. On Fri, 18 Jul 2008, bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=11112 > > > > > > ------- Comment #11 from anonymous@kernel-bugs.osdl.org 2008-07-18 21:19 > ------- > Reply-To: david-b@pacbell.net > > On Friday 18 July 2008, bugme-daemon@bugzilla.kernel.org wrote: >> Reading the config options on RTC_DRV_CMOS it seems to say that you need a >> separate device as well as rtc-cmos, or is this driver sufficient to get at >> the various classic devices. Ie, what does "Your system will need to define >> the platform device used by this driver" mean? > > It means that it follows the driver model ... if there's a PNP device > defined, or a platform device defined, this driver binds to that device. > > But unlike the legacy driver, it will NOT go and poke at the hardware > in the hope that it's guessing right. That's probably the essence of > what makes the legacy driver a legacy driver: it completely disregards > the driver model. > > > There are still problems. Although there is no longer spurious first interrupt, the problem David found in his comment 1 is still there occasionally. Here are an couple of examples: 1 1.181705 190 2 2.201611 190 3 3.671093 190 4 4.181635 190 5 5.671149 190 Note that the first interrupt occurs 1.2 sec after switchon (it should be less than 1) and the interrupt times are way off. And another 1 0.854310 0190 2 2.374260 0190 3 2.894221 0190 4 3.964241 0190 5 5.284241 0190 These were generated by running rtc-uie in a loop with variable sleep times between the calls to rtc-uie. Mandriva cooker kernel 2.6.26 with all of David's patches installed. The patches do resolve the spurious first interrupt problem that started all this. (Note I do not know if my emails with some of this will go through, but if they do, ignore them in favour of this comment posted from the web) I observed about 10% of a mid-sized sample (40 runs) as having these bizarre symptoms. That's only with HPET ... most runs, and all without HPET, give nice regular one second intervals. There are a number of other issues with the HPET interposed IRQs ... like not actually being in phase with the hardware update IRQs, and requiring way too many IRQs (64 Hz!). One more bug seems unsurprising, though I'd like to know just what's causing this ...
Yes, that is also what I found, except on my machine it was (usually) a
lot less than 10%. Not sure what triggers it. I tried a run of 50 while the
machine was compiling the kernel, and that did not seem to make a
difference, so it does not seem to be interrupt collisions ( and many
milliseconds on a Core2 duo 2.4GHz machine sounds way way too long to be
interrupt collisions.)
Do I open another bug report on this?
> There are a number of other issues with the HPET interposed IRQs ... like not
> actually being in phase with the hardware update IRQs, and requiring way too
> many IRQs (64 Hz!). One more bug seems unsurprising, though I'd like to know
> just what's causing this ...
What do you mean not being in phase with the hardware update irqs? I
thought it was that the HPET routine simply captured the rtc generated
irq, and that you had to go through the hpet.c interrupt handler to get the
rtc IRQ. But is the hardware actually behaving differently when the HPET is
activated? Is there not some way of simply writing hpet.c so that hpet does
not intercept the rtc irq, or does it happen in hardware?
>
>
>
Reply-To: david-b@pacbell.net > Do I open another bug report on this? Different bug == different bugreport. :) > > There are a number of other issues with the HPET interposed IRQs ... like > not > > actually being in phase with the hardware update IRQs, and requiring way > too > > many IRQs (64 Hz!). One more bug seems unsurprising, though I'd like to > know > > just what's causing this ... > > What do you mean not being in phase with the hardware update irqs? I > thought it was that the HPET routine simply captured the rtc generated > irq, and that you had to go through the hpet.c interrupt handler to get the > rtc IRQ. But is the hardware actually behaving differently when the HPET is > activated? Is there not some way of simply writing hpet.c so that hpet does > not intercept the rtc irq, or does it happen in hardware? See the HPET spec. Basically when the HPET is put into "legacy replacement" mode -- what I call "broken" mode -- the RTC and the i8254 timer interrupts are completely disconnected, with the first two HPET comparator IRQs spliced in where they would normally appear. On Wed, 23 Jul 2008, bugme-daemon@bugzilla.kernel.org wrote: > Reply-To: david-b@pacbell.net > >> Do I open another bug report on this? > > Different bug == different bugreport. :) Will do. It is even worse than either of us expected. Here is a run where I reset the counter by -5 every time an "error (difference is bigger than 1ms froma second. The time of the interrupt changes discontinuously to a new long term run The * indicate "errors". Note that at the 18 sec mark the interrupt time has suddenly shifted by over 9ms from what would appear to be the previous "correct" interrupt time. Ie, before that the update seemed to occur at .652 sec and afterwards at .562 sec. Ie, the phase offset between the HPET and the RTC seems to have shifted by 9ms, which is completely nuts. 1 0.782692 0190 -3 * 1.652687 0190 -2 2.652684 0190 -1 3.652682 0190 0 4.652685 0190 1 5.652683 0190 2 6.652683 0190 -2 * 8.152767 0190 -1 8.652708 0190 0 9.652697 0190 1 10.652695 0190 2 11.652697 0190 3 12.652641 0190 -1 * 14.152733 0190 0 14.652642 0190 1 15.652642 0190 2 16.652792 0190 3 17.652640 0190 -1 * 18.562086 0190 0 19.562064 0190 1 20.562039 0190 2 21.562015 0190 > > >>> There are a number of other issues with the HPET interposed IRQs ... like >>> not >>> actually being in phase with the hardware update IRQs, and requiring way >>> too >>> many IRQs (64 Hz!). One more bug seems unsurprising, though I'd like to >>> know >>> just what's causing this ... >> >> What do you mean not being in phase with the hardware update irqs? I >> thought it was that the HPET routine simply captured the rtc generated >> irq, and that you had to go through the hpet.c interrupt handler to get the >> rtc IRQ. But is the hardware actually behaving differently when the HPET is >> activated? Is there not some way of simply writing hpet.c so that hpet does >> not intercept the rtc irq, or does it happen in hardware? > > See the HPET spec. Basically when the HPET is put into "legacy replacement" > mode -- what I call "broken" mode -- the RTC and the i8254 timer interrupts > are completely disconnected, with the first two HPET comparator IRQs spliced > in where they would normally appear. > > > |