Hello

I've found a strange behaviour I can't understand: Under certain
circumstances a multithreaded process can get unresponsive and
eat all the CPU, bypassing even limits sets via "ulimit".

Normally, the process should take a few seconds, I don't know
when/how it reachs this strange state. (I haven't been able to
reproduce it, but I can see lots of similar user processes on
this system).


* system:

shiva:~/ uname -srR
IRIX64 6.5 6.5.22m
shiva:~/ hinv
6 200 MHZ IP19 Processors
CPU: MIPS R4400 Processor Chip Revision: 6.0
FPU: MIPS R4000 Floating Point Coprocessor Revision: 0.0
Main memory size: 512 Mbytes, 2-way interleaved


* top

PID PGRP USERNAME PRI SIZE RES STATE TIME WCPU% CPU% COMMAND
506433 506433 XXXXXXXX 20 3040K 1472K sleep 785:41 87.2 87.09 matrice
508368 508368 XXXXXXXX 20 3056K 1472K sleep 785:58 83.6 84.94 matrice
508383 508383 XXXXXXXX 20 3056K 1472K sleep 647:25 69.5 70.05 matrice
508125 508125 XXXXXXXX 20 3056K 1472K sleep 488:54 52.3 52.00 matrice
508223 508223 XXXXXXXX 20 2416K 1344K sleep 321:58 34.9 35.00 matrice
508333 508333 XXXXXXXX 20 2416K 1344K sleep 322:24 34.7 34.91 matrice
507816 507816 XXXXXXXX 20 2416K 1344K sleep 323:37 34.6 34.58 matrice

* ps


XXXXXXXX 506433 1 0 18:16:34 ? 786:22 ./matrices
XXXXXXXX 507816 1 0 18:03:21 ? 323:54 ./matrices
XXXXXXXX 508125 1 0 17:56:37 ? 489:20 ./matrices
XXXXXXXX 508223 1 0 18:09:10 ? 322:15 ./matrices
XXXXXXXX 508333 1 0 18:08:23 ? 322:41 ./matrices
XXXXXXXX 508368 1 0 18:11:41 ? 786:38 ./matrices
XXXXXXXX 508383 1 0 17:58:15 ? 647:59 ./matrices

* par -sSSi -l -p 507816

continous output:

4562mS[ 0] matrices(507816): nanosleep({sec=0, nsec=1000}, 0)
4562mS[ 4] matrices(507816): nanosleep({sec=0, nsec=1000}, 0)
4565mS[ 0] matrices(507816): END-nanosleep() OK


* source code "matrices.c"


uses pthread_* to create sysconf(_SC_NPROC_ONLN) threads,
using pthread_mutex_* and the typical stuff.

(I could post full source code - 190 lines - if necessary)

* kill

it doesn't use any signal() call to catch signals.

process doesn't get killed by "kill -TERM".

kill -HUP does nothing

I have to use kill -9


* ulimit

the strangest thing about this, is the proccess is executed under
these limits:

time(seconds) 3600
file(blocks) unlimited
data(kbytes) 100000
stack(kbytes) 65536
memory(kbytes) 505808
coredump(blocks) unlimited
nofiles(descriptors) 200
vmemory(kbytes) 170000
concurrency(threads) 12


so, I don't understand how it can reach these cpu times:


506433 506433 XXXXXXXX 20 3040K 1472K sleep 785:41 87.2 87.09 matrice

(I have tested it, and the time limit is working:

./foo
Exceeded CPU time limit(coredump)
)




--
PGP and other useless info at \
http://webdiis.unizar.es/~spd/ \
finger://daphne.cps.unizar.es/spd \ Timeo Danaos et dona ferentes
ftp://ivo.cps.unizar.es/pub/ \ (Virgilio)