Clock has stopped (time/date looping over 5 seconds), things are broken - what to check to debug? - Kernel

This is a discussion on Clock has stopped (time/date looping over 5 seconds), things are broken - what to check to debug? - Kernel ; So far what I have is that the clock is moving between 10:01:03 to 10:01:07 (when it gets to 07 it goes back to 03), doing rdate -s results in things changing: 16:12:38 to 16:12:43 (resets back to :38). Doing ...

+ Reply to Thread
Results 1 to 3 of 3

Thread: Clock has stopped (time/date looping over 5 seconds), things are broken - what to check to debug?

  1. Clock has stopped (time/date looping over 5 seconds), things are broken - what to check to debug?

    So far what I have is that the clock is moving between
    10:01:03 to 10:01:07 (when it gets to 07 it goes back to 03), doing rdate -s
    results in things changing:

    16:12:38 to 16:12:43 (resets back to :38).

    Doing this:
    while true ; do date; usleep 1000000; done
    Fri Apr 4 16:12:39 CDT 2008
    Fri Apr 4 16:12:40 CDT 2008
    Fri Apr 4 16:12:41 CDT 2008
    Fri Apr 4 16:12:42 CDT 2008
    Fri Apr 4 16:12:43 CDT 2008

    It stops at :43, ^C is required, and you can then restart it with repeatable
    results.

    This F7 - 2.6.23.15-80.fc7

    dmesg/messages contain nothing abnormal.

    This machine has done it several times, a freqency of maybe 1x per every couple
    of weeks or so. I believe it had also done this with: 2.6.22.9-91.fc7 so it
    has been doing this for a while. It used to work with some older kernel (I
    don't know which).

    Given what the clock is doing, things that sleep at the wrong time hang forever,
    and a number of other things fail to work.

    vmstat 1 results in a single line being printed out, and then a floating point
    exception.

    "shutdown -r now" fails to complete, power cycle is required to get the machine
    back up.

    I don't believe any hardware failure that I can think of would cause the clock
    to do what mine is doing.

    Ideas?

    Roger

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: Clock has stopped (time/date looping over 5 seconds), things are broken - what to check to debug?

    Hi Roger,

    Does this sound familiar:

    http://lkml.org/lkml/2008/3/14/178


    We've been chasing this for quite a while. Our PIC gets in a bad state
    where it thinks the CPU is in the ISR, and so won't give another int. We
    haven't much of an idea of how we get in that state other than that
    HZ=1000 makes it happen faster and HZ=100 causes it less often.

    I think that if you look at jiffies you will see it is not incrementing.
    The 4 second loop seems to be in the conversion from jiffies to wall
    time.


    It _appears_ that there is a race in the kernel that can be triggered by
    any number of hardware issues. There's another thread by Gregory Stark
    with the same symptoms - he thinks his was fixed by replacing a bad
    DIMM.

    Note that we first saw this on 2.6.16, and Gregory found it on 2.6.5.
    We've seen systems run for a couple of months before seeing this, so
    it's a bear to debug.

    How often is this happening for you? How repeatable?

    What hardware are you running on?


    Joel.





    On Fri, 2008-04-04 at 16:27 -0500, Roger Heflin wrote:
    > So far what I have is that the clock is moving between
    > 10:01:03 to 10:01:07 (when it gets to 07 it goes back to 03), doing rdate -s
    > results in things changing:
    >
    > 16:12:38 to 16:12:43 (resets back to :38).
    >
    > Doing this:
    > while true ; do date; usleep 1000000; done
    > Fri Apr 4 16:12:39 CDT 2008
    > Fri Apr 4 16:12:40 CDT 2008
    > Fri Apr 4 16:12:41 CDT 2008
    > Fri Apr 4 16:12:42 CDT 2008
    > Fri Apr 4 16:12:43 CDT 2008
    >
    > It stops at :43, ^C is required, and you can then restart it with repeatable
    > results.
    >
    > This F7 - 2.6.23.15-80.fc7
    >
    > dmesg/messages contain nothing abnormal.
    >
    > This machine has done it several times, a freqency of maybe 1x per every couple
    > of weeks or so. I believe it had also done this with: 2.6.22.9-91.fc7 so it
    > has been doing this for a while. It used to work with some older kernel (I
    > don't know which).
    >
    > Given what the clock is doing, things that sleep at the wrong time hang forever,
    > and a number of other things fail to work.
    >
    > vmstat 1 results in a single line being printed out, and then a floating point
    > exception.
    >
    > "shutdown -r now" fails to complete, power cycle is required to get the machine
    > back up.
    >
    > I don't believe any hardware failure that I can think of would cause the clock
    > to do what mine is doing.
    >
    > Ideas?
    >
    > Roger
    >
    > --
    > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    > the body of a message to majordomo@vger.kernel.org
    > More majordomo info at http://vger.kernel.org/majordomo-info.html
    > Please read the FAQ at http://www.tux.org/lkml/
    >
    >


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: Clock has stopped (time/date looping over 5 seconds), things are broken - what to check to debug?

    Joel K. Greene wrote:
    > Hi Roger,
    >
    > Does this sound familiar:
    >
    > http://lkml.org/lkml/2008/3/14/178


    That sounds like it matches what I have.

    >
    >
    > We've been chasing this for quite a while. Our PIC gets in a bad state
    > where it thinks the CPU is in the ISR, and so won't give another int. We
    > haven't much of an idea of how we get in that state other than that
    > HZ=1000 makes it happen faster and HZ=100 causes it less often.


    I do have HZ=1000 set, Pavel mentions setting it the =4000 to make it happen
    faster, I will try that, I am rebuilding 2.4.24.4 with =4000 in the .config
    file, and will verify after it is up that 4000 is running on it.

    My machine does have a fair amount of cpu usage (transcoding video), and has a
    fair amount of interrupt handling (5 disks, and 3 TV recording cards).

    >
    > I think that if you look at jiffies you will see it is not incrementing.
    > The 4 second loop seems to be in the conversion from jiffies to wall
    > time.


    I did check the counter in /proc/timer_list under (now at) and it was looping too.

    >
    >
    > It _appears_ that there is a race in the kernel that can be triggered by
    > any number of hardware issues. There's another thread by Gregory Stark
    > with the same symptoms - he thinks his was fixed by replacing a bad
    > DIMM.


    I don't think I have bad HW, I will run a test job for a few hours that checks
    its results and make sure that the proper answers are coming back, and it is not
    crashing.

    I do have a couple of disks (on a SIL controller) that every so often appear to
    give funny errors, but recover and continue on.

    >
    > Note that we first saw this on 2.6.16, and Gregory found it on 2.6.5.
    > We've seen systems run for a couple of months before seeing this, so
    > it's a bear to debug.
    >
    > How often is this happening for you? How repeatable?


    14-30 days, I don't know if it always happens or not, I don't have exact enough
    data, but I don't think the machine has made it past 30 days in the last 6
    months, if I go back far enough though, I believe it was stable, before I added
    a couple of TV recording cards (PVR150, HD5500), and a disk controller (SIL) to
    it.

    >
    > What hardware are you running on?


    AMD-754 Sempron64 processor.

    ASUS K8V-SE Deluxe MB (VT8385/VT8387 Chipset), so very different HW that the
    Serverworks-P3 that you have.







    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread