f



Annoying sync loss every 20 minutes on Poweredge 2600, RH7.3

Now this is really annoying me now. I simply cannot find the reason, why this is
happening.

Is there any hope to turn this off?

Regards,

Edin

-- 
Sicherheit gibt es nicht umsonst - zu keiner Zeit und nirgendwo!
(Das BSI IT-Sicherheitshandbuch)

0
Edin
7/3/2003 10:10:01 AM
comp.protocols.time.ntp 4895 articles. 2 followers. Post Follow

12 Replies
522 Views

Similar Articles

[PageSpeed] 30

>Now this is really annoying me now. I simply cannot find the reason, why this is
>happening.
>
>Is there any hope to turn this off?

This should be in the FAQ.  Have you looked there?

One common cause is a software bug in the kernel timekeeping area.
It adds too much/little to the time on each interrupt.

Or your crystal could be way off.

Or you could be missing interrupts.  Does it happen when the system
is idle, or only when it's busy?

Laptops may have troubles with power-saving modes.  I'm not familiar
with this area.


What is your "drift"?  It's usually in /etc/ntp.drift, or
something like that.  (Look in your config file.)  Or try the
ntptime utility where it's called "frequency".

If the drift is off by more than 500 ppm, ntpd has a sanity
check/limit and won't correct enough to keep the clock accurate.
So it drifts, and after a while it drifts far enough that ntpd
takes a big step, prints that message, and starts over.

-- 
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's.  I hate spam.

0
hmurray
7/3/2003 6:26:52 PM
Hi Hal and thanks for the reply,


Hal Murray wrote:
>>Now this is really annoying me now. I simply cannot find the reason, why this is
>>happening.
>>
>>Is there any hope to turn this off?
> 
> 
> This should be in the FAQ.  Have you looked there?

Yes I have, but found nothing about it. I'll look again.

> 
> One common cause is a software bug in the kernel timekeeping area.
> It adds too much/little to the time on each interrupt.

I use kernel 2.4.20. I have 8 other systems with the same configuration
working perfectly fine.

> 
> Or your crystal could be way off.
:(

> 
> Or you could be missing interrupts.  Does it happen when the system
> is idle, or only when it's busy?

Every ~22 minutes:

Jul  3 19:58:02 rufus ntpd[1285]: synchronisation lost
Jul  3 20:21:41 rufus ntpd[1285]: time reset 0.395221 s
Jul  3 20:21:41 rufus ntpd[1285]: synchronisation lost
Jul  3 20:43:10 rufus ntpd[1285]: time reset 1.205013 s
Jul  3 20:43:10 rufus ntpd[1285]: synchronisation lost
Jul  3 21:04:46 rufus ntpd[1285]: time reset 0.467360 s
Jul  3 21:04:46 rufus ntpd[1285]: synchronisation lost
Jul  3 21:26:26 rufus ntpd[1285]: time reset 1.306115 s
Jul  3 21:26:26 rufus ntpd[1285]: synchronisation lost
Jul  3 21:48:08 rufus ntpd[1285]: time reset 0.556156 s
Jul  3 21:48:08 rufus ntpd[1285]: synchronisation lost
Jul  3 22:09:50 rufus ntpd[1285]: time reset 1.317970 s
Jul  3 22:09:50 rufus ntpd[1285]: synchronisation lost
Jul  3 22:32:29 rufus ntpd[1285]: time reset 0.568680 s
Jul  3 22:32:29 rufus ntpd[1285]: synchronisation lost
Jul  3 22:55:20 rufus ntpd[1285]: time reset 1.325117 s
Jul  3 22:55:20 rufus ntpd[1285]: synchronisation lost

> 
> Laptops may have troubles with power-saving modes.  I'm not familiar
> with this area.

It's not a laptop but a big server.
> 
> 
> What is your "drift"?  It's usually in /etc/ntp.drift, or
> something like that.  (Look in your config file.)  Or try the
> ntptime utility where it's called "frequency".

Jul  3 18:27:07 rufus ntpd[1285]: ntpd 4.1.1b@1.829 Fri Jun 13 16:00:49 
CEST 2003 (1)
Jul  3 18:27:07 rufus ntpd[1285]: precision = 17 usec
Jul  3 18:27:07 rufus ntpd[1285]: kernel time discipline status 0040
Jul  3 18:27:07 rufus ntpd[1285]: frequency initialized 88.736 from 
/etc/ntpd.drift
Jul  3 18:30:19 rufus ntpd[1285]: time reset -0.175969 s
Jul  3 18:30:19 rufus ntpd[1285]: kernel time discipline status change 41
Jul  3 18:30:19 rufus ntpd[1285]: synchronisation lost

> 
> If the drift is off by more than 500 ppm, ntpd has a sanity
> check/limit and won't correct enough to keep the clock accurate.
> So it drifts, and after a while it drifts far enough that ntpd
> takes a big step, prints that message, and starts over.
> 

Regards,

Edin

0
Edin
7/3/2003 9:49:36 PM
In article <be28f0$e5mc$1@ID-139806.news.dfncis.de>,
Edin Dizdarevic  <edin.dizdarevic@interActive-Systems.de> wrote:
>It's not a laptop but a big server.

SMP?  I've seen the existence of two cycle clocks make NTP unhappy, but
nothing as bad as what you are seeing.

Have you considered other things like heavy network load, lost interrupts, ...
Do the other machines have identical hardware, or just something close?
0
mcintosh
7/4/2003 12:02:06 AM
Allen McIntosh wrote:
> In article <be28f0$e5mc$1@ID-139806.news.dfncis.de>,
> Edin Dizdarevic  <edin.dizdarevic@interActive-Systems.de> wrote:
> 
>>It's not a laptop but a big server.
> 
> 
> SMP?  I've seen the existence of two cycle clocks make NTP unhappy, but
> nothing as bad as what you are seeing.

Dual board but single Xeon 1.8G CPU

> 
> Have you considered other things like heavy network load, lost interrupts, ...
> Do the other machines have identical hardware, or just something close?

This machine has nothing to do. Lost interrupts... could be. It's an
SCSI sysstem with only one IDE CD-ROM drive. I disabled APIC, disabled
DMA for CD-ROM, updated SCSI drivers, kernel 2.4.20...nothing helped.

Other machines are "simple" Pentiums 4...

Regards,

Edin

0
Edin
7/4/2003 7:04:05 AM
-----BEGIN PGP SIGNED MESSAGE-----

In article <be28f0$e5mc$1@ID-139806.news.dfncis.de>,
edin.dizdarevic@interActive-Systems.de says...
>

[snip]

>> What is your "drift"?  It's usually in /etc/ntp.drift, or
>> something like that.  (Look in your config file.)  Or try the
>> ntptime utility where it's called "frequency".
>
>Jul  3 18:27:07 rufus ntpd[1285]: ntpd 4.1.1b@1.829 Fri Jun 13 16:00:49 
>CEST 2003 (1)

I have a client with a similar poweredge system running rh7.2, but we run
ntpq 4.1.71-a Thu Jun 13 11:32:54 PDT 2002 (1) built from source. The last
sync lost message was about a month ago.



-----BEGIN PGP SIGNATURE-----
Version: 4.5

iQCVAgUBPwhw19ZjPoeWO7BhAQGfewP/eACoEYjIPpw/3m2mZ2fgm9dE8A3v8TBZ
sUubhOXlMQhM0oB8g91YGht+aZ5XrJOHeFCKjSNkYrfnt3WkLmb9O+hxvMbBAXjP
FLmy2wuKuL5QQcT/A7lr+rKvwYn9oiYya8aoo+ChQiGDLJAATyTki0tkydpgunPg
N5W26bLlTJw=
=WCPk
-----END PGP SIGNATURE-----

0
carl
7/6/2003 6:56:25 PM
Carl Byington wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> 
> In article <be28f0$e5mc$1@ID-139806.news.dfncis.de>,
> edin.dizdarevic@interActive-Systems.de says...
> 
> 
> [snip]
> 
> 
>>>What is your "drift"?  It's usually in /etc/ntp.drift, or
>>>something like that.  (Look in your config file.)  Or try the
>>>ntptime utility where it's called "frequency".
>>
>>Jul  3 18:27:07 rufus ntpd[1285]: ntpd 4.1.1b@1.829 Fri Jun 13 16:00:49 
>>CEST 2003 (1)
> 
> 
> I have a client with a similar poweredge system running rh7.2, but we run
> ntpq 4.1.71-a Thu Jun 13 11:32:54 PDT 2002 (1) built from source. The last
> sync lost message was about a month ago.

How similar and which kernel version do you use?

Greetings,

Edin


> 
> 
> 
> -----BEGIN PGP SIGNATURE-----
> Version: 4.5
> 
> iQCVAgUBPwhw19ZjPoeWO7BhAQGfewP/eACoEYjIPpw/3m2mZ2fgm9dE8A3v8TBZ
> sUubhOXlMQhM0oB8g91YGht+aZ5XrJOHeFCKjSNkYrfnt3WkLmb9O+hxvMbBAXjP
> FLmy2wuKuL5QQcT/A7lr+rKvwYn9oiYya8aoo+ChQiGDLJAATyTki0tkydpgunPg
> N5W26bLlTJw=
> =WCPk
> -----END PGP SIGNATURE-----
> 

0
Edin
7/6/2003 10:03:41 PM
On Mon, 07 Jul 2003 16:27:15 +0000, Hal Murray Combobulated:

>>How would one go about diagnosing lost interrupts?
> 
> Write a loop that:
>    reads the cycle counter and the system time waits/sleeps a while
>    reads them again
>    does the arithmetic to compare the differences.
> 
> You have to calibrate the speed of your CPU.
> 
> If you want something more than a hack, you need to do something like
> read the cycle counter both before and after getting the system time,
> and try again if the difference is large, say because the scheduler gave
> the CPU to another task or you got a page fault or ...
> 
> You can obviously use the reads from the end of one loop as the readings
> for the beginning of the next loop.

Uhhhg........ You're way over my head here, Hal! I'm not a programmer but
I can follow code somewhat. You wouldn't have a link to some documentation
or examples of what you just explained would you?

-- 
Ciao,
CRH 8^)>

0
CRH
7/8/2003 3:17:56 AM
-----BEGIN PGP SIGNED MESSAGE-----

In article <bea6d1$2t592$1@ID-139806.news.dfncis.de>,
edin.dizdarevic@interActive-Systems.de says...
>


>> I have a client with a similar poweredge system running rh7.2, but we
run
>> ntpq 4.1.71-a Thu Jun 13 11:32:54 PDT 2002 (1) built from source. The
last
>> sync lost message was about a month ago.
>
>How similar and which kernel version do you use?


One system is stock RH7.2
  2.4.7-10smp #1 SMP Thu Sep 6 17:09:31 EDT 2001 i686 unknown
on a dual processor poweredge of some flavor. That system is about 18
months old at this point, and I don't have the exact model number here.

That one is running just fine with ntp 4.1.71-a.



Another system is stock RH8.0
  2.4.18-14 #1 Wed Sep 4 13:35:50 EDT 2002 i686 i686 i386 GNU/Linux
on a single processor poweredge of some flavor. That system is about 6
months old, and I don't have the exact model number here either.

This was running ntp 4.1.1.a from the stock RH8.0 distribution, but I just
noticed that it had your same problem with excessive lost sync. So I
compiled ntp from source on that machine, and it is now running 4.1.71-a.
But it still has the same problem.

Jul  7 18:12:45 mail3 ntpd[12685]: time reset 1.470749 s
Jul  7 18:12:45 mail3 ntpd[12685]: synchronisation lost
Jul  7 18:32:05 mail3 ntpd[12685]: time reset 1.031199 s
Jul  7 18:32:05 mail3 ntpd[12685]: synchronisation lost
Jul  7 18:55:51 mail3 ntpd[12685]: time reset 1.140142 s
Jul  7 18:55:51 mail3 ntpd[12685]: synchronisation lost
Jul  7 19:28:15 mail3 ntpd[12685]: time reset 0.529740 s
Jul  7 19:28:15 mail3 ntpd[12685]: synchronisation lost
Jul  7 20:15:51 mail3 ntpd[12685]: time reset 0.561480 s
Jul  7 20:15:51 mail3 ntpd[12685]: synchronisation lost


-----BEGIN PGP SIGNATURE-----
Version: 4.5

iQCVAgUBPwpBttZjPoeWO7BhAQG15wP9FBpEHmF/fFVKfy4Qo6dZvtBXaLshT5Qn
wTp+5Q8GKUUAgyqZx/ADnjf6+Kzc6aIv9cr+MxwUx8E8UAv58nBmq4ZYkGQxj8M5
ehkDKW+wwHbzLXnSyW+r9V8j60AfJDlKLHBbKT/TYoPIb8qI4sIMBbpdwMnhJ69l
oitWHOvIWHY=
=xFVM
-----END PGP SIGNATURE-----

0
carl
7/8/2003 3:59:55 AM
>Uhhhg........ You're way over my head here, Hal! I'm not a programmer but
>I can follow code somewhat. You wouldn't have a link to some documentation
>or examples of what you just explained would you?

I don't have any clean code.  It's the sort of thing that
gets hacked up to chase some particular problem.

You can try the program at:
   http://www.megapathdsl.net/~hmurray/temp/cpu
(Compiled on RH 7.2)

On my system, it prints this:

      Cycles     MicroSec       MHz    uSec to gettimeofday
                                        first    second

  2821021234      4994934   564.776     2.243     0.930
  2824212558      5000583   564.777     3.070     0.930
  2824204896      5000568   564.777     2.943     0.930
  2824212945      5000581   564.777     3.205     0.930

It's computing the CPU frequency from the cycle counter.

This is a 566 MHz machine, but the spread-spectrum hacks
on the CPU clock reduce that a bit.

It may not work right on an SMP system.

-- 
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's.  I hate spam.

0
hmurray
7/8/2003 8:25:54 AM

Hal Murray wrote:
>>Uhhhg........ You're way over my head here, Hal! I'm not a programmer but
>>I can follow code somewhat. You wouldn't have a link to some documentation
>>or examples of what you just explained would you?
> 
> 
> I don't have any clean code.  It's the sort of thing that
> gets hacked up to chase some particular problem.
> 
> You can try the program at:
>    http://www.megapathdsl.net/~hmurray/temp/cpu
> (Compiled on RH 7.2)
> 
> On my system, it prints this:
> 
>       Cycles     MicroSec       MHz    uSec to gettimeofday
>                                         first    second
> 
>   2821021234      4994934   564.776     2.243     0.930
>   2824212558      5000583   564.777     3.070     0.930
>   2824204896      5000568   564.777     2.943     0.930
>   2824212945      5000581   564.777     3.205     0.930
> 
> It's computing the CPU frequency from the cycle counter.
> 
> This is a 566 MHz machine, but the spread-spectrum hacks
> on the CPU clock reduce that a bit.
> 
> It may not work right on an SMP system.
> 


But what does it all mean?


      Cycles     MicroSec       MHz    uSec to gettimeofday
                                        first    second

  8981007560      5005743  1794.141     1.157     0.798
  8988970204      5010181  1794.141     1.113     0.798
  8988971524      5010182  1794.141     1.454     0.798
  8988968944      5010180  1794.141     1.228     0.798
  8988971748      5010182  1794.141     1.503     0.798
  8988970560      5010179  1794.142     1.271     0.798
  8988973648      5010183  1794.141     1.135     0.798
  8988965460      5010178  1794.141     1.026     0.798
  8988969384      5010180  1794.141     0.898     0.805
  8988971760      5010181  1794.141     0.903     0.798
  8988968960      5010179  1794.141     0.916     0.798
  8988970564      5010180  1794.141     0.910     0.798
  8988970432      5010180  1794.141     1.026     0.798
  8988970604      5010181  1794.141     1.032     0.798
  8988969732      5010178  1794.142     0.903     0.798
  8988970172      5010180  1794.141     0.910     0.798
  8988976384      5010182  1794.142     2.127     0.798
  8988967324      5010179  1794.141     1.300     0.798
  8988968224      5010177  1794.142     0.916     1.699
  8988972180      5010180  1794.142     1.066     0.798
  8988969140      5010179  1794.141     0.901     0.805
  8988970104      5010179  1794.141     1.101     0.798
  8988970304      5010178  1794.142     0.914     0.798
  8988970104      5010179  1794.141     1.014     0.798
  8988969612      5010178  1794.142     0.916     0.798
  8988972016      5010180  1794.141     1.026     0.798
  8988973296      6210114  1447.473     1.467     0.989 <<<<<<< What's this?
  8988982348      5010185  1794.142     1.028     0.798
  8988957572      5010172  1794.141     1.063     0.798
  8988968676      5010177  1794.142     1.039     0.798
  8988969552      5010178  1794.142     1.037     0.798
  8988970416      5010179  1794.142     0.903     0.798
  8988972852      5010179  1794.142     1.034     0.798
  8988967596      5010176  1794.142     0.907     0.798
  8988973560      5010180  1794.142     1.126     0.805
  8988968020      5010177  1794.142     0.898     0.805
  8988970004      5010991  1793.851     0.981     0.805
  8988970260      5011419  1793.698     1.033     0.798
  8988970132      5011286  1793.745     1.061     0.798
  8988970488      5009279  1794.464     0.905     0.805
  8988971156      5009393  1794.423     1.126     0.798
  8988971480      5009499  1794.385     1.070     0.798
  8988969384      5009574  1794.358     1.030     0.805
  8988968584      5009637  1794.335     0.905     0.805
  8988971276      5009725  1794.304     1.030     0.805
  8988971068      5009626  1794.340     1.034     0.798
  8988970832      5009363  1794.434     0.912     0.798
  8988970420      5009396  1794.422     1.005     0.798
  8988970004      5009705  1794.311     1.043     0.805
  8988969440      5010015  1794.200     1.164     0.798
  8988968804      5010071  1794.180     0.916     0.798
  8988971656      5009922  1794.234     1.030     0.805
  8988971620      5009552  1794.366     1.023     0.798
  8988969760      5011050  1793.830     1.015     0.798
  8988970204      5011026  1793.838     1.342     0.798


Regards,

Edin

-- 
Sicherheit gibt es nicht umsonst - zu keiner Zeit und nirgendwo!
(Das BSI IT-Sicherheitshandbuch)

0
Edin
7/8/2003 9:21:56 AM
>But what does it all mean?

It goes into a loop, sleeping for 5 seconds...

    First column is number of cycles (during that sleep)
    Second column is microseconds.
    Third column is computed cycle counter frequency.


>      Cycles     MicroSec       MHz    uSec to gettimeofday
>                                        first    second
>
>  8981007560      5005743  1794.141     1.157     0.798
>  8988970204      5010181  1794.141     1.113     0.798
>  8988971524      5010182  1794.141     1.454     0.798
>  8988968944      5010180  1794.141     1.228     0.798

....

Looks normal.  You have a 1.8 GHz system.


>  8988969612      5010178  1794.142     0.916     0.798
>  8988972016      5010180  1794.141     1.026     0.798
>  8988973296      6210114  1447.473     1.467     0.989 <<<<<<< What's this?
>  8988982348      5010185  1794.142     1.028     0.798

That's why this program is a hack.  You have to poke around
and/or hack the code to print more info when strange things
happen.

Did your time get reset by 1.2 seconds at that point?

-- 
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's.  I hate spam.

0
hmurray
7/8/2003 4:39:31 PM

Hal Murray wrote:
>>But what does it all mean?
> 
> 
> It goes into a loop, sleeping for 5 seconds...
> 
>     First column is number of cycles (during that sleep)
>     Second column is microseconds.
>     Third column is computed cycle counter frequency.
> 
> 
> 
>>     Cycles     MicroSec       MHz    uSec to gettimeofday
>>                                       first    second
>>
>> 8981007560      5005743  1794.141     1.157     0.798
>> 8988970204      5010181  1794.141     1.113     0.798
>> 8988971524      5010182  1794.141     1.454     0.798
>> 8988968944      5010180  1794.141     1.228     0.798
> 
> 
> ...
> 
> Looks normal.  You have a 1.8 GHz system.

;)

> 
> 
> 
>> 8988969612      5010178  1794.142     0.916     0.798
>> 8988972016      5010180  1794.141     1.026     0.798
>> 8988973296      6210114  1447.473     1.467     0.989 <<<<<<< What's this?
>> 8988982348      5010185  1794.142     1.028     0.798
> 
> 
> That's why this program is a hack.  You have to poke around
> and/or hack the code to print more info when strange things
> happen.
> 
> Did your time get reset by 1.2 seconds at that point?
> 

How do you get that value?


-- 
Sicherheit gibt es nicht umsonst - zu keiner Zeit und nirgendwo!
(Das BSI IT-Sicherheitshandbuch)

0
Edin
7/8/2003 4:48:30 PM
Reply: