Re: threads do not get cpa - SGI
This is a discussion on Re: threads do not get cpa - SGI ; Joe,
thanks for the suggestion!
I've added the following lines to my code:
if ((r=pthread_setconcurrency(nthreads)))
fprintf(stderr,"Could not set concurrency: %d!\n",r);
The results are a little bit erratic, sometimes I get the expected
behavior, i.e. activity_in_top = nthreads*100, and sometimes I ...
-
Re: threads do not get cpa
Joe,
thanks for the suggestion!
I've added the following lines to my code:
if ((r=pthread_setconcurrency(nthreads)))
fprintf(stderr,"Could not set concurrency: %d!\n",r);
The results are a little bit erratic, sometimes I get the expected
behavior, i.e. activity_in_top = nthreads*100, and sometimes I get less
than the expected activity. All this is on a SGI Octane with IRIX
6.5.24m (16cpus), no other applications running. I am little bit
puzzled now, could that be a strange effect of the IRIX scheduler?
I also noticed another oddity: When I run my program with more threads
it's slower than with just a single one, e.g.:
nthreads cpu-time wall-clock time
1 30 30
2 140 70
3 452 154
Do you have further suggestions?
Markus
-
Re: threads do not get cpa
azb123@planet.nl wrote:
> Joe,
> thanks for the suggestion!
> I've added the following lines to my code:
>
> if ((r=pthread_setconcurrency(nthreads)))
> fprintf(stderr,"Could not set concurrency: %d!\n",r);
>
> The results are a little bit erratic, sometimes I get the expected
> behavior, i.e. activity_in_top = nthreads*100, and sometimes I get less
> than the expected activity. All this is on a SGI Octane with IRIX
> 6.5.24m (16cpus), no other applications running. I am little bit
> puzzled now, could that be a strange effect of the IRIX scheduler?
>
> I also noticed another oddity: When I run my program with more threads
> it's slower than with just a single one, e.g.:
>
> nthreads cpu-time wall-clock time
> 1 30 30
> 2 140 70
> 3 452 154
>
> Do you have further suggestions?
>
Could be lock contention from either the locks in your program or ones
in a routine you are calling. Also if you were running Linux I'd
suspect Linux's brain damaged signaling but you're using IRIX which
I don't know if it has the same problem.
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software.
-
Re: threads do not get cpa
In article <1123428099.078544.129690@g49g2000cwa.googlegroups. com>,
wrote:
% I also noticed another oddity: When I run my program with more threads
% it's slower than with just a single one, e.g.:
Perhaps you spend more time contending for stderr and CurrentQuery than
you spend performing calculations. Try saving up the output in a
(pre-allocated) buffer and spitting it out in one go at thread exit
time, and pre-allocating the queries (i.e., have each thread step through
the list in steps of nthreads, rather than contending for a mutex after
each query.
--
Patrick TJ McPhee
North York Canada
ptjm@interlog.com
-
Re: threads do not get cpa
In article <3lmrhmF133a4fU2@uni-berlin.de>,
Patrick TJ McPhee wrote:
>In article <1123428099.078544.129690@g49g2000cwa.googlegroups. com>,
> wrote:
>
>% I also noticed another oddity: When I run my program with more threads
>% it's slower than with just a single one, e.g.:
>
>Perhaps you spend more time contending for stderr and CurrentQuery than
>you spend performing calculations. Try saving up the output in a
>(pre-allocated) buffer and spitting it out in one go at thread exit
>time, and pre-allocating the queries (i.e., have each thread step through
>the list in steps of nthreads, rather than contending for a mutex after
>each query.
Also note that some implementations of drand48() require a mutex to
protect the internal state of the random number generator. Others
have the generator per-thread, but there's not any easy way to
tell the difference.
You might want to do your own implementation of drand48() that
takes a context pointer, and have separate contexts for each
thread. Or just come up with some other kind of work to do that
is less likely to have an internal mutex.
--
Steve Watt KD6GGD PP-ASEL-IA ICBM: 121W 56' 57.8" / 37N 20' 14.9"
Internet: steve @ Watt.COM Whois: SW32
Free time? There's no such thing. It just comes in varying prices...
-
Re: threads do not get cpa
> Also note that some implementations of drand48() require a mutex to
> protect the internal state of the random number generator. Others
> have the generator per-thread, but there's not any easy way to
> tell the difference.
Thanks, this was exactly the problem! I replaced drand48() by erand48()
which has a pointer to a buffer for the internal state as a command
line argument, and now I get the desired speed-up. Seems that the
internal state was mutex protected and that resulted in a contention
problem.
However, the original question still remains: the number of processors
the program runs on seems unpredictable (at least to me), even after I
have added the pthread_setconcurrency() call. When I create 3 threads,
the program sometimes gets 3 processors, and sometimes just 1 (the SGI
is still mostly empty).
Is there anything known about the way the number of processors to be
used is determined under IRIX 6.5.24m?
Markus
-
Re: threads do not get cpa
> Could be lock contention from either the locks in your program or ones
> in a routine you are calling. Also if you were running Linux I'd
> suspect Linux's brain damaged signaling but you're using IRIX which
> I don't know if it has the same problem.
Why are they brain damaged? Where can I read about it? Just asking.
JD
-
Re: threads do not get cpa
Salut Markus,
> However, the original question still remains: the number of processors
> the program runs on seems unpredictable (at least to me), even after I
> have added the pthread_setconcurrency() call. When I create 3 threads,
> the program sometimes gets 3 processors, and sometimes just 1 (the SGI
> is still mostly empty).
>
> Is there anything known about the way the number of processors to be
> used is determined under IRIX 6.5.24m?
I don't have access to a SGI machine, so I can't verify in the
practice... However, accordingly to the a manual about IRIX
programming, I read:
You can specify an initial thread scheduling scope by calling
pthread_attr_setscope() and passing one of the scope constants
(PTHREAD_SCOPE_SYSTEM or PTHREAD_SCOPE_PROCESS) in the pthread_attr_t
object. By default, process scope is selected and scheduling is
performed by the thread runtime, but thread scheduling by the kernel is
provided with the system scope attribute. System scope threads run at
real-time policy and priority and may be created only by privileged
users.
You may need to select system contention scope (PTHREAD_SCOPE_SYSTEM)
to obtain the wanted bahavior.
With regards,
Loic.
-
Re: threads do not get cpa
Jedrzej Dudkiewicz wrote:
>>Could be lock contention from either the locks in your program or ones
>>in a routine you are calling. Also if you were running Linux I'd
>>suspect Linux's brain damaged signaling but you're using IRIX which
>>I don't know if it has the same problem.
>
>
> Why are they brain damaged? Where can I read about it? Just asking.
>
pthread_cond_signal|broadcast preempts the calling thread. That can
slow things down considerably. For example, a simple producer/consumer
file copy gets slowed down by 2x or 3x or so. There was a kernel patch
floating around to "fix" this. Or you could use fastcv from
http://sourceforge.net/projects/atomic-ptr-plus/
which is what I used. Or you could use a sem_trywait/sem_wait and
do a sched_yield if you had to wait using sem_wait.
In practice this means that doing the pthread_cond_signal after releasing
the mutex, instead of while holding the mutex, has some performance
benefits on Linux since the signaled thread will attempt to get the mutex
before the signaler has a chance to release it if it is holding the lock while
signaling. Both forms of signaling are correct but the former seems to
upset some Posix purists for some reason.
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software.
-
Re: threads do not get cpa
loic-dev@gmx.net wrote:
> Salut Markus,
>
> > However, the original question still remains: the number of processors
> > the program runs on seems unpredictable (at least to me), even after I
> > have added the pthread_setconcurrency() call. When I create 3 threads,
> > the program sometimes gets 3 processors, and sometimes just 1 (the SGI
> > is still mostly empty).
> >
> > Is there anything known about the way the number of processors to be
> > used is determined under IRIX 6.5.24m?
>
> I don't have access to a SGI machine, so I can't verify in the
> practice... However, accordingly to the a manual about IRIX
> programming, I read:
>
>
> You can specify an initial thread scheduling scope by calling
> pthread_attr_setscope() and passing one of the scope constants
> (PTHREAD_SCOPE_SYSTEM or PTHREAD_SCOPE_PROCESS) in the pthread_attr_t
> object. By default, process scope is selected and scheduling is
> performed by the thread runtime, but thread scheduling by the kernel is
> provided with the system scope attribute. System scope threads run at
> real-time policy and priority and may be created only by privileged
> users.
>
>
> You may need to select system contention scope (PTHREAD_SCOPE_SYSTEM)
> to obtain the wanted bahavior.
Hi Loic,
yes, PTHREAD_SCOPE_SYSTEM sounds like it would be the solution.
However, threads with scheduling scope PTHREAD_SCOPE_SYSTEM require
under IRIX to be run as root or with scheduling management
capabilities, since with this scope you can have (under IRIX) threads
with a priority higher than system daemons. Therefore, this scheduling
scope is not available to me.
There is another, non-portable, scheduling scope available under IRIX
(PTHREAD_SCOPE_BOUND_NP), but that shows the same behavior, i.e. the
cpu-usage seems to be erratic and unpredictable (at least for me).
Markus
-
Re: threads do not get cpa
> >>in a routine you are calling. Also if you were running Linux I'd
> >>suspect Linux's brain damaged signaling
> >
> > Why are they brain damaged? Where can I read about it? Just asking.
> >
> pthread_cond_signal|broadcast preempts the calling thread. That can
> slow things down considerably. For example, a simple producer/consumer
> file copy gets slowed down by 2x or 3x or so.
Are there many more things like this in Linux scheduler?
I've seen this in one of your posts:
On a single processor Linux system, it hard to get as dramatic numbers due
to the amount of scheduler artifacts that exist when trying this kind of
stuff
on a single processor Linux system. [...] and the
mutex and rwlock versions hang with writer starvation.
Is this related? 99% that yes, but only 99% 
> There was a kernel patch
> floating around to "fix" this. Or you could use fastcv from
> http://sourceforge.net/projects/atomic-ptr-plus/
> which is what I used. Or you could use a sem_trywait/sem_wait and
> do a sched_yield if you had to wait using sem_wait.
I've already seen it. Now I spend my evenings with gcc manual in one hand
(in one window), Intel documentation in second one and try to understand
every piece of it.
> In practice this means that doing the pthread_cond_signal after releasing
> the mutex, instead of while holding the mutex, has some performance
> benefits on Linux since the signaled thread will attempt to get the mutex
> before the signaler has a chance to release it if it is holding the lock
while
> signaling. Both forms of signaling are correct but the former seems to
> upset some Posix purists for some reason.
Stored in long time memory.
Thanks for an answer.
JD
-
Re: threads do not get cpa
Jedrzej Dudkiewicz wrote:
>>>>in a routine you are calling. Also if you were running Linux I'd
>>>>suspect Linux's brain damaged signaling
>>>
>>>Why are they brain damaged? Where can I read about it? Just asking.
>>>
>>
>>pthread_cond_signal|broadcast preempts the calling thread. That can
>>slow things down considerably. For example, a simple producer/consumer
>>file copy gets slowed down by 2x or 3x or so.
>
>
> Are there many more things like this in Linux scheduler?
>
> I've seen this in one of your posts:
>
> On a single processor Linux system, it hard to get as dramatic numbers due
> to the amount of scheduler artifacts that exist when trying this kind of
> stuff
> on a single processor Linux system. [...] and the
> mutex and rwlock versions hang with writer starvation.
>
> Is this related? 99% that yes, but only 99% 
>
>
No. Usually when measuring scalability, it's nice to have a 4 way, 8 way,
16 way system to test on. On a signle processor, scheduler artifacts tend
to get in the way and obscure your measurements. The starvation problem
is due to SCHED_OTHER where the locks and wait sets don't have FIFO
service order or some other mechanism which will guarantee forward progress.
It's a scalability problem but its effect isn't so much in overall performance
as it may affect just some of the threads. If the affected threads are all
of the writer threads you will probably notice it. If the affected threads
are just some of the reader threads you may not notice it right away depending
on what the reader threads are doing, e.g. if forward progress can be made
as long as some of the reader threads run.
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software.