Solaris 5.10 cpu hike --- _flockget --- possible deadlock ? - Unix

This is a discussion on Solaris 5.10 cpu hike --- _flockget --- possible deadlock ? - Unix ; Hi, [sorry for repeating message again] before few times on solaris 10, one process started cosuming cpu towards 100%. I have taken pstack and gcore. And by viewing stacks of threads, its suprised to me ! 128 threads are active ...

+ Reply to Thread
Results 1 to 10 of 10

Thread: Solaris 5.10 cpu hike --- _flockget --- possible deadlock ?

  1. Solaris 5.10 cpu hike --- _flockget --- possible deadlock ?

    Hi,

    [sorry for repeating message again]

    before few times on solaris 10, one process started cosuming cpu
    towards 100%.
    I have taken pstack and gcore. And by viewing stacks of threads, its
    suprised to me !

    128 threads are active

    ------
    ff140090 lwp_park (0, 0, 0)
    ff12409c _flockget (ff16c3b8, c9610, 44274, ff117594, ff16c308,
    c9610) + 94
    ff118110 fprintf (c9640, a7ae8, f7afa558, 501b0, ff118460, ff168284)
    + 50
    00059644 logstr

    many different threads (approx 100) executing/stcuk in above.

    And below two threads, one in fopen and one in write, its get locked.

    ----
    ff140090 lwp_park (0, 0, 0)
    ff12409c _flockget (ff16c3b8, c9610, 44274, ff12bb8c, ff16c308,
    c9610) + 94
    ff121478 _endopen (c98b0, a7b00, c9640, 0, 109, 0) + 12c
    ff124fac fopen (c98b0, a7b00, a7800, 3c788, 0, c9640) + 1c
    000598c8 open_logfile (c98b0, c9800, c9c00, ff052400, ff16cbc0, 1) +
    1e4
    0003da30 sig_thr (0, feefc000, 0, 0, 3d98c, 0) + a4
    ff13fff0 _lwp_start (0, 0, 0, 0, 0, 0)
    -------
    ff1218dc _flsbuf (32, c9640, ee, ff16c27c, fffffffb, ffffffff) + dc
    ff1263d4 _fwrite_unlocked (4, 1, 23, c9640, 23, febfa55d) + 414
    ff114478 _dowrite (febfa558, 23, c9640, febfa458, fe, 4) + 54
    ff116f0c _ndoprnt (a7aea, febfa520, febfa57b, febfa558, febfa458,
    ff150181) + 28bc
    ff1181a0 fprintf (c9640, a7ae8, febfa558, 501b0, ff118460, ff168284)
    + e0
    00059644 logstr (a3668, febfa8d0, 877b, dec88, ff322f58, d) + 1c4
    ....
    .... some function calls here.
    ff13fff0 _lwp_start (0, 0, 0, 0, 0, 0)
    --------

    I have tried much to debug, but no clue. [Only one doubt is i am
    using alarm() to get one thread periodically do some stuff, and in
    that above code is executed. Am i doing something stupid ? or any
    hidden Os/library bug is kidding me ?

    Tnx in advance,
    -Raxit

  2. Re: Solaris 5.10 cpu hike --- _flockget --- possible deadlock?

    Raxit@mykavita.com wrote:
    > [...]
    > ff140090 lwp_park (0, 0, 0)
    > ff12409c _flockget (ff16c3b8, c9610, 44274, ff117594, ff16c308,
    > c9610) + 94
    > ff118110 fprintf (c9640, a7ae8, f7afa558, 501b0, ff118460, ff168284)
    > + 50
    > 00059644 logstr
    >
    > many different threads (approx 100) executing/stcuk in above.


    It looks like the hundred threads are all trying to write
    a log message at the same time, and are patiently waiting for
    their turn to use fprintf on the same FILE* stream. (Verify:
    Check whether the first fprintf argument is the same in all
    the stacks.) If that's the case, it may be that the CPU spike
    is associated with whatever produces all that log activity.
    Have you looked at the log messages?

    --
    Eric Sosman
    esosman@ieee-dot-org.invalid

  3. Re: Solaris 5.10 cpu hike --- _flockget --- possible deadlock ?

    On Jul 4, 11:45 am, "Ra...@mykavita.com"
    wrote:
    > Hi,
    >
    > [sorry for repeating message again]
    >
    > before few times on solaris 10, one process started cosuming cpu
    > towards 100%.
    > I have taken pstack and gcore. And by viewing stacks of threads, its
    > suprised to me !
    >
    > 128 threads are active

    [ snip ]
    > I have tried much to debug, but no clue. [Only one doubt is i am
    > using alarm() to get one thread periodically do some stuff, and in
    > that above code is executed. Am i doing something stupid ? or any
    > hidden Os/library bug is kidding me ?


    You are calling fprintf() in a signal handler in a multithreaded
    process.
    That is the problem. Standard i/o is not async-signal safe.
    Don't do that.

    Roger Faulkner
    Sun Microsystems


  4. Re: Solaris 5.10 cpu hike --- _flockget --- possible deadlock ?

    On Jul 5, 5:47*am, roger.faulk...@sun.com wrote:
    > On Jul 4, 11:45 am, "Ra...@mykavita.com"
    > wrote:
    >
    > > Hi,

    >
    > > [sorry for repeating message again]

    >
    > > before few times on solaris 10, one process started cosuming cpu
    > > towards 100%.
    > > I have taken pstack and gcore. And by viewing stacks of threads, its
    > > suprised to me !

    >
    > > 128 threads are active

    > [ snip ]
    > > I have tried much to debug, but no clue. *[Only one doubt is i am
    > > using alarm() to get one thread periodically do some stuff, and in
    > > that above code is executed. Am i doing something stupid ? or any
    > > hidden Os/library bug is kidding me ?


    >
    > You are calling fprintf() in a signal handler in a multithreaded
    > process.
    > That is the problem. *Standard i/o is not async-signal safe.
    > Don't do that.
    >
    > Roger Faulkner
    > Sun Microsystems



    >If you use a signal ALARM handler and have him doing anything with
    >stdio, you may have broken stdio (you can't use most stdio functions
    >from signal handlers)


    >Casper


    Thanks Eric, Roger and Casper,

    Definitely we do have broken code !!! [was ignoring this imp point]

    what we are doing is when our server is starting, we are allocating
    one separate thread to handle signal (including SIGALARM)

    this thread is periodically getting called (because of alarm)... and
    try to populate some stuff from some file into memory.
    And Additionally using logfunctions (to print debug message). This log
    functions are using Global fd (so that all thread can write the log to
    same logfile) (and we do have log-rotation, i.e. after some time/
    filesize log will be rotated -- (fopen) in stack) [All this are using
    fopen,fprintf etc functions]

    So instead of using Alarm, we need to use something else ? to
    peridically do the stuff. ?

    may be
    1. instead of alarm+other signal, use separate thread to do periodic
    task and Signal.
    2. Instead of Alarm, use sleep/delay etc function and here i can use
    All fopen,fprintf etc stdio
    3. in Signal handler avoid using stdio ! [Now SIGALARM will not come
    here. as i am fully trying to avoid SIGALARM]

    Any other thoughts !?

    -Raxit









  5. Re: Solaris 5.10 cpu hike --- _flockget --- possible deadlock ?

    "Raxit@mykavita.com" writes:

    [...]

    > what we are doing is when our server is starting, we are allocating
    > one separate thread to handle signal (including SIGALARM)
    >
    > this thread is periodically getting called (because of alarm)... and
    > try to populate some stuff from some file into memory.
    > And Additionally using logfunctions (to print debug message). This log
    > functions are using Global fd (so that all thread can write the log to
    > same logfile) (and we do have log-rotation, i.e. after some time/
    > filesize log will be rotated -- (fopen) in stack) [All this are using
    > fopen,fprintf etc functions]
    >
    > So instead of using Alarm, we need to use something else ? to
    > peridically do the stuff. ?


    The easy way to get this sorted out would be to block all signals in
    all threads and use sigwait in the signal handling thread to dequeue
    pending signals synchronously.

  6. Re: Solaris 5.10 cpu hike --- _flockget --- possible deadlock ?

    On Jul 7, 1:59*pm, Rainer Weikusat wrote:
    > "Ra...@mykavita.com" writes:
    >
    > [...]
    >
    > > what we are doing is when our server is starting, we are allocating
    > > one separate thread to handle signal (including SIGALARM)

    >
    > > this thread is periodically getting called (because of alarm)... and
    > > try to populate some stuff from some file into memory.
    > > And Additionally using logfunctions (to print debug message). This log
    > > functions are using Global fd (so that all thread can write the log to
    > > same logfile) (and we do have log-rotation, i.e. after some time/
    > > filesize log will be rotated -- (fopen) in stack) [All this are using
    > > fopen,fprintf etc functions]

    >
    > > So instead of using Alarm, we need to use something else ? to
    > > peridically do the stuff. ?

    >
    > The easy way to get this sorted out would be to block all signals in
    > all threads and use sigwait in the signal handling thread to dequeue
    > pending signals synchronously.


    The code is like below we use sigwait (which dumps the pstack of
    original post!)

    void * Thread_which_Handle_Signals ()
    while (1)
    {
    alarm (time); <---Say for Ex. 60 second

    /* Wait for the signal */
    sigwait (&glb_sigact, &signal_alr);

    if (DEBUG)
    Print using Stdio/fprintf

    if (SIGALRM == signal_alr)
    {
    Open Some file, load some value to memory. <---this
    uses stdio, Is this ok ?
    Write some log <----this uses stdio, Is this ok ?
    }
    if (SIGTERM == signal_alr)
    {
    Do some closing activity !
    Write some log. etc using stdio. <---Is this ok ?
    }

    }


    -Raxit

  7. Re: Solaris 5.10 cpu hike --- _flockget --- possible deadlock ?

    "Raxit@mykavita.com" writes:
    > On Jul 7, 1:59*pm, Rainer Weikusat wrote:
    >> "Ra...@mykavita.com" writes:
    >>
    >> [...]
    >>
    >> > what we are doing is when our server is starting, we are allocating
    >> > one separate thread to handle signal (including SIGALARM)

    >>
    >> > this thread is periodically getting called (because of alarm)... and
    >> > try to populate some stuff from some file into memory.
    >> > And Additionally using logfunctions (to print debug message). This log
    >> > functions are using Global fd (so that all thread can write the log to
    >> > same logfile) (and we do have log-rotation, i.e. after some time/
    >> > filesize log will be rotated -- (fopen) in stack) [All this are using
    >> > fopen,fprintf etc functions]

    >>
    >> > So instead of using Alarm, we need to use something else ? to
    >> > peridically do the stuff. ?

    >>
    >> The easy way to get this sorted out would be to block all signals in
    >> all threads and use sigwait in the signal handling thread to dequeue
    >> pending signals synchronously.

    >
    > The code is like below we use sigwait (which dumps the pstack of
    > original post!)


    There shouldn't be any problems caused by 'signal handler misuse',
    then.

  8. Re: Solaris 5.10 cpu hike --- _flockget --- possible deadlock ?

    On Jul 7, 4:04*pm, Rainer Weikusat wrote:
    > "Ra...@mykavita.com" writes:
    > > On Jul 7, 1:59*pm, Rainer Weikusat wrote:
    > >> "Ra...@mykavita.com" writes:

    >
    > >> [...]

    >
    > >> > what we are doing is when our server is starting, we are allocating
    > >> > one separate thread to handle signal (including SIGALARM)

    >
    > >> > this thread is periodically getting called (because of alarm)... and
    > >> > try to populate some stuff from some file into memory.
    > >> > And Additionally using logfunctions (to print debug message). This log
    > >> > functions are using Global fd (so that all thread can write the log to
    > >> > same logfile) (and we do have log-rotation, i.e. after some time/
    > >> > filesize log will be rotated -- (fopen) in stack) [All this are using
    > >> > fopen,fprintf etc functions]

    >
    > >> > So instead of using Alarm, we need to use something else ? to
    > >> > peridically do the stuff. ?

    >
    > >> The easy way to get this sorted out would be to block all signals in
    > >> all threads and use sigwait in the signal handling thread to dequeue
    > >> pending signals synchronously.

    >
    > > The code is like below we use sigwait (which dumps the pstack of
    > > original post!)

    >
    > There shouldn't be any problems caused by 'signal handler misuse',
    > then.

    Ummmm...


    fprintf (c9640 <---- 126 threads trying to write to same fd !!!
    (from pstack)
    Total 128 thread: 1 main Thread + 1 signal handle thread + 126
    above !!!

    trying to get log messages but some problem with gdb/dbx

    bash$gdb
    warning: .dynamic section for "/lib/libnsl.so.1" is not at the
    expected address (wrong library or version mismatch?)
    warning: .dynamic section for "/lib/libsocket.so.1" is not at the
    expected address (wrong library or version mismatch?)
    warning: .dynamic section for "/lib/libelf.so.1" is not at the
    expected address (wrong library or version mismatch?)
    warning: .dynamic section for "/lib/libdl.so.1" is not at the expected
    address (wrong library or version mismatch?)
    warning: .dynamic section for "/lib/libc.so.1" is not at the expected
    address (wrong library or version mismatch?)
    warning: .dynamic section for "/lib/libmd5.so.1" is not at the
    expected address
    warning: difference appears to be caused by prelink, adjusting
    expectations
    warning: .dynamic section for "/lib/nss_files.so.1" is not at the
    expected address (wrong library or version mismatch?)
    Error while mapping shared library sections:
    /usr1/sms/obj/libauth4C.so: No such file or directory.
    Error while mapping shared library sections:
    /usr1/sms/obj/libgenpasswd.so: No such file or directory.

    (dbx) threads
    dbx: MT support is disabled
    (dbx) (dbx) lwps
    l@1 running in dead_and_buried()
    l@2 running in tdb_sync_obj_deregister()
    l@3 running in tdb_sync_obj_deregister()
    l@56931 running in tdb_sync_obj_deregister()
    l@56933 running in tdb_sync_obj_deregister()
    l@56932 running in tdb_sync_obj_deregister()
    l@56934 running in tdb_sync_obj_deregister()
    l@56930 running in globit()
    l@56935 running in tdb_sync_obj_deregister()
    ....
    ...


    Cheers,
    -Raxit

  9. Re: Solaris 5.10 cpu hike --- _flockget --- possible deadlock ?


    On Jul 5, 12:17*am, Eric Sosman wrote:
    > Ra...@mykavita.com wrote:
    > > [...]
    > > *ff140090 lwp_park (0, 0, 0)
    > > *ff12409c _flockget (ff16c3b8, c9610, 44274, ff117594, ff16c308,
    > > c9610) + 94
    > > *ff118110 fprintf *(c9640, a7ae8, f7afa558, 501b0, ff118460, ff168284)
    > > + 50
    > > *00059644 logstr

    >
    > > many different threads (approx 100) executing/stcuk in above.

    >
    > * * *It looks like the hundred threads are all trying to write
    > a log message at the same time, and are patiently waiting for
    > their turn to use fprintf on the same FILE* stream. *(Verify:
    > Check whether the first fprintf argument is the same in all


    Yes. Its same for 120 threads,

    > the stacks.) *If that's the case, it may be that the CPU spike
    > is associated with whatever produces all that log activity.
    > Have you looked at the log messages?

    Its normal log messages that we have like [pid:Threadid]
    [Date:Time]Logmessage (total approx 50char)

    >
    > --
    > Eric Sosman
    > esos...@ieee-dot-org.invalid



  10. Re: Solaris 5.10 cpu hike --- _flockget --- possible deadlock ?

    Raxit@mykavita.com wrote:
    > On Jul 5, 12:17 am, Eric Sosman wrote:
    >> Ra...@mykavita.com wrote:
    >>> [...]
    >>> ff140090 lwp_park (0, 0, 0)
    >>> ff12409c _flockget (ff16c3b8, c9610, 44274, ff117594, ff16c308,
    >>> c9610) + 94
    >>> ff118110 fprintf (c9640, a7ae8, f7afa558, 501b0, ff118460, ff168284)
    >>> + 50
    >>> 00059644 logstr
    >>> many different threads (approx 100) executing/stcuk in above.

    >> It looks like the hundred threads are all trying to write
    >> a log message at the same time, and are patiently waiting for
    >> their turn to use fprintf on the same FILE* stream. (Verify:
    >> Check whether the first fprintf argument is the same in all

    >
    > Yes. Its same for 120 threads,
    >
    >> the stacks.) If that's the case, it may be that the CPU spike
    >> is associated with whatever produces all that log activity.
    >> Have you looked at the log messages?

    > Its normal log messages that we have like [pid:Threadid]
    > [Date:Time]Logmessage (total approx 50char)


    Other responders have already pointed out something I'd
    overlooked: You're calling fprintf() from an asynchronous
    signal handler. That's a programming error, because fprintf()
    is not async-signal-safe. The cure is "Don't Do That."

    --
    Eric Sosman
    esosman@ieee-dot-org.invalid

+ Reply to Thread