Re: threads do not get cpa - SGI

This is a discussion on Re: threads do not get cpa - SGI ; Joe, thanks for the suggestion! I've added the following lines to my code: if ((r=pthread_setconcurrency(nthreads))) fprintf(stderr,"Could not set concurrency: %d!\n",r); The results are a little bit erratic, sometimes I get the expected behavior, i.e. activity_in_top = nthreads*100, and sometimes I ...

+ Reply to Thread
Results 1 to 11 of 11

Thread: Re: threads do not get cpa

  1. Re: threads do not get cpa

    Joe,
    thanks for the suggestion!
    I've added the following lines to my code:

    if ((r=pthread_setconcurrency(nthreads)))
    fprintf(stderr,"Could not set concurrency: %d!\n",r);

    The results are a little bit erratic, sometimes I get the expected
    behavior, i.e. activity_in_top = nthreads*100, and sometimes I get less
    than the expected activity. All this is on a SGI Octane with IRIX
    6.5.24m (16cpus), no other applications running. I am little bit
    puzzled now, could that be a strange effect of the IRIX scheduler?

    I also noticed another oddity: When I run my program with more threads
    it's slower than with just a single one, e.g.:

    nthreads cpu-time wall-clock time
    1 30 30
    2 140 70
    3 452 154

    Do you have further suggestions?

    Markus


  2. Re: threads do not get cpa

    azb123@planet.nl wrote:
    > Joe,
    > thanks for the suggestion!
    > I've added the following lines to my code:
    >
    > if ((r=pthread_setconcurrency(nthreads)))
    > fprintf(stderr,"Could not set concurrency: %d!\n",r);
    >
    > The results are a little bit erratic, sometimes I get the expected
    > behavior, i.e. activity_in_top = nthreads*100, and sometimes I get less
    > than the expected activity. All this is on a SGI Octane with IRIX
    > 6.5.24m (16cpus), no other applications running. I am little bit
    > puzzled now, could that be a strange effect of the IRIX scheduler?
    >
    > I also noticed another oddity: When I run my program with more threads
    > it's slower than with just a single one, e.g.:
    >
    > nthreads cpu-time wall-clock time
    > 1 30 30
    > 2 140 70
    > 3 452 154
    >
    > Do you have further suggestions?
    >

    Could be lock contention from either the locks in your program or ones
    in a routine you are calling. Also if you were running Linux I'd
    suspect Linux's brain damaged signaling but you're using IRIX which
    I don't know if it has the same problem.


    --
    Joe Seigh

    When you get lemons, you make lemonade.
    When you get hardware, you make software.

  3. Re: threads do not get cpa

    In article <1123428099.078544.129690@g49g2000cwa.googlegroups. com>,
    wrote:

    % I also noticed another oddity: When I run my program with more threads
    % it's slower than with just a single one, e.g.:

    Perhaps you spend more time contending for stderr and CurrentQuery than
    you spend performing calculations. Try saving up the output in a
    (pre-allocated) buffer and spitting it out in one go at thread exit
    time, and pre-allocating the queries (i.e., have each thread step through
    the list in steps of nthreads, rather than contending for a mutex after
    each query.
    --

    Patrick TJ McPhee
    North York Canada
    ptjm@interlog.com

  4. Re: threads do not get cpa

    In article <3lmrhmF133a4fU2@uni-berlin.de>,
    Patrick TJ McPhee wrote:
    >In article <1123428099.078544.129690@g49g2000cwa.googlegroups. com>,
    > wrote:
    >
    >% I also noticed another oddity: When I run my program with more threads
    >% it's slower than with just a single one, e.g.:
    >
    >Perhaps you spend more time contending for stderr and CurrentQuery than
    >you spend performing calculations. Try saving up the output in a
    >(pre-allocated) buffer and spitting it out in one go at thread exit
    >time, and pre-allocating the queries (i.e., have each thread step through
    >the list in steps of nthreads, rather than contending for a mutex after
    >each query.


    Also note that some implementations of drand48() require a mutex to
    protect the internal state of the random number generator. Others
    have the generator per-thread, but there's not any easy way to
    tell the difference.

    You might want to do your own implementation of drand48() that
    takes a context pointer, and have separate contexts for each
    thread. Or just come up with some other kind of work to do that
    is less likely to have an internal mutex.
    --
    Steve Watt KD6GGD PP-ASEL-IA ICBM: 121W 56' 57.8" / 37N 20' 14.9"
    Internet: steve @ Watt.COM Whois: SW32
    Free time? There's no such thing. It just comes in varying prices...

  5. Re: threads do not get cpa

    > Also note that some implementations of drand48() require a mutex to
    > protect the internal state of the random number generator. Others
    > have the generator per-thread, but there's not any easy way to
    > tell the difference.


    Thanks, this was exactly the problem! I replaced drand48() by erand48()
    which has a pointer to a buffer for the internal state as a command
    line argument, and now I get the desired speed-up. Seems that the
    internal state was mutex protected and that resulted in a contention
    problem.

    However, the original question still remains: the number of processors
    the program runs on seems unpredictable (at least to me), even after I
    have added the pthread_setconcurrency() call. When I create 3 threads,
    the program sometimes gets 3 processors, and sometimes just 1 (the SGI
    is still mostly empty).

    Is there anything known about the way the number of processors to be
    used is determined under IRIX 6.5.24m?

    Markus


  6. Re: threads do not get cpa

    > Could be lock contention from either the locks in your program or ones
    > in a routine you are calling. Also if you were running Linux I'd
    > suspect Linux's brain damaged signaling but you're using IRIX which
    > I don't know if it has the same problem.


    Why are they brain damaged? Where can I read about it? Just asking.

    JD

  7. Re: threads do not get cpa

    Salut Markus,

    > However, the original question still remains: the number of processors
    > the program runs on seems unpredictable (at least to me), even after I
    > have added the pthread_setconcurrency() call. When I create 3 threads,
    > the program sometimes gets 3 processors, and sometimes just 1 (the SGI
    > is still mostly empty).
    >
    > Is there anything known about the way the number of processors to be
    > used is determined under IRIX 6.5.24m?


    I don't have access to a SGI machine, so I can't verify in the
    practice... However, accordingly to the a manual about IRIX
    programming, I read:


    You can specify an initial thread scheduling scope by calling
    pthread_attr_setscope() and passing one of the scope constants
    (PTHREAD_SCOPE_SYSTEM or PTHREAD_SCOPE_PROCESS) in the pthread_attr_t
    object. By default, process scope is selected and scheduling is
    performed by the thread runtime, but thread scheduling by the kernel is
    provided with the system scope attribute. System scope threads run at
    real-time policy and priority and may be created only by privileged
    users.


    You may need to select system contention scope (PTHREAD_SCOPE_SYSTEM)
    to obtain the wanted bahavior.


    With regards,
    Loic.


  8. Re: threads do not get cpa

    Jedrzej Dudkiewicz wrote:
    >>Could be lock contention from either the locks in your program or ones
    >>in a routine you are calling. Also if you were running Linux I'd
    >>suspect Linux's brain damaged signaling but you're using IRIX which
    >>I don't know if it has the same problem.

    >
    >
    > Why are they brain damaged? Where can I read about it? Just asking.
    >

    pthread_cond_signal|broadcast preempts the calling thread. That can
    slow things down considerably. For example, a simple producer/consumer
    file copy gets slowed down by 2x or 3x or so. There was a kernel patch
    floating around to "fix" this. Or you could use fastcv from
    http://sourceforge.net/projects/atomic-ptr-plus/
    which is what I used. Or you could use a sem_trywait/sem_wait and
    do a sched_yield if you had to wait using sem_wait.

    In practice this means that doing the pthread_cond_signal after releasing
    the mutex, instead of while holding the mutex, has some performance
    benefits on Linux since the signaled thread will attempt to get the mutex
    before the signaler has a chance to release it if it is holding the lock while
    signaling. Both forms of signaling are correct but the former seems to
    upset some Posix purists for some reason.

    --
    Joe Seigh

    When you get lemons, you make lemonade.
    When you get hardware, you make software.

  9. Re: threads do not get cpa

    loic-dev@gmx.net wrote:
    > Salut Markus,
    >
    > > However, the original question still remains: the number of processors
    > > the program runs on seems unpredictable (at least to me), even after I
    > > have added the pthread_setconcurrency() call. When I create 3 threads,
    > > the program sometimes gets 3 processors, and sometimes just 1 (the SGI
    > > is still mostly empty).
    > >
    > > Is there anything known about the way the number of processors to be
    > > used is determined under IRIX 6.5.24m?

    >
    > I don't have access to a SGI machine, so I can't verify in the
    > practice... However, accordingly to the a manual about IRIX
    > programming, I read:
    >
    >
    > You can specify an initial thread scheduling scope by calling
    > pthread_attr_setscope() and passing one of the scope constants
    > (PTHREAD_SCOPE_SYSTEM or PTHREAD_SCOPE_PROCESS) in the pthread_attr_t
    > object. By default, process scope is selected and scheduling is
    > performed by the thread runtime, but thread scheduling by the kernel is
    > provided with the system scope attribute. System scope threads run at
    > real-time policy and priority and may be created only by privileged
    > users.
    >

    >
    > You may need to select system contention scope (PTHREAD_SCOPE_SYSTEM)
    > to obtain the wanted bahavior.


    Hi Loic,

    yes, PTHREAD_SCOPE_SYSTEM sounds like it would be the solution.
    However, threads with scheduling scope PTHREAD_SCOPE_SYSTEM require
    under IRIX to be run as root or with scheduling management
    capabilities, since with this scope you can have (under IRIX) threads
    with a priority higher than system daemons. Therefore, this scheduling
    scope is not available to me.

    There is another, non-portable, scheduling scope available under IRIX
    (PTHREAD_SCOPE_BOUND_NP), but that shows the same behavior, i.e. the
    cpu-usage seems to be erratic and unpredictable (at least for me).

    Markus


  10. Re: threads do not get cpa

    > >>in a routine you are calling. Also if you were running Linux I'd
    > >>suspect Linux's brain damaged signaling

    > >
    > > Why are they brain damaged? Where can I read about it? Just asking.
    > >

    > pthread_cond_signal|broadcast preempts the calling thread. That can
    > slow things down considerably. For example, a simple producer/consumer
    > file copy gets slowed down by 2x or 3x or so.


    Are there many more things like this in Linux scheduler?

    I've seen this in one of your posts:

    On a single processor Linux system, it hard to get as dramatic numbers due
    to the amount of scheduler artifacts that exist when trying this kind of
    stuff
    on a single processor Linux system. [...] and the
    mutex and rwlock versions hang with writer starvation.

    Is this related? 99% that yes, but only 99%

    > There was a kernel patch
    > floating around to "fix" this. Or you could use fastcv from
    > http://sourceforge.net/projects/atomic-ptr-plus/
    > which is what I used. Or you could use a sem_trywait/sem_wait and
    > do a sched_yield if you had to wait using sem_wait.


    I've already seen it. Now I spend my evenings with gcc manual in one hand
    (in one window), Intel documentation in second one and try to understand
    every piece of it.

    > In practice this means that doing the pthread_cond_signal after releasing
    > the mutex, instead of while holding the mutex, has some performance
    > benefits on Linux since the signaled thread will attempt to get the mutex
    > before the signaler has a chance to release it if it is holding the lock

    while
    > signaling. Both forms of signaling are correct but the former seems to
    > upset some Posix purists for some reason.


    Stored in long time memory.

    Thanks for an answer.

    JD


  11. Re: threads do not get cpa

    Jedrzej Dudkiewicz wrote:
    >>>>in a routine you are calling. Also if you were running Linux I'd
    >>>>suspect Linux's brain damaged signaling
    >>>
    >>>Why are they brain damaged? Where can I read about it? Just asking.
    >>>

    >>
    >>pthread_cond_signal|broadcast preempts the calling thread. That can
    >>slow things down considerably. For example, a simple producer/consumer
    >>file copy gets slowed down by 2x or 3x or so.

    >
    >
    > Are there many more things like this in Linux scheduler?
    >
    > I've seen this in one of your posts:
    >
    > On a single processor Linux system, it hard to get as dramatic numbers due
    > to the amount of scheduler artifacts that exist when trying this kind of
    > stuff
    > on a single processor Linux system. [...] and the
    > mutex and rwlock versions hang with writer starvation.
    >
    > Is this related? 99% that yes, but only 99%
    >
    >

    No. Usually when measuring scalability, it's nice to have a 4 way, 8 way,
    16 way system to test on. On a signle processor, scheduler artifacts tend
    to get in the way and obscure your measurements. The starvation problem
    is due to SCHED_OTHER where the locks and wait sets don't have FIFO
    service order or some other mechanism which will guarantee forward progress.
    It's a scalability problem but its effect isn't so much in overall performance
    as it may affect just some of the threads. If the affected threads are all
    of the writer threads you will probably notice it. If the affected threads
    are just some of the reader threads you may not notice it right away depending
    on what the reader threads are doing, e.g. if forward progress can be made
    as long as some of the reader threads run.

    --
    Joe Seigh

    When you get lemons, you make lemonade.
    When you get hardware, you make software.

+ Reply to Thread