-
Pressure on malloc
I've got a problem on AIX 5.3 using C++ RunTime 9.0.0.1, which is
returning ENOMEM
with a 2GB heap. The core file is 300MB, so there is no way we are
even close to
2GB. I tried to write a small test case to expose this failure,
and was surprised
because it crashed every AIX malloc algorithm I know of.
1. Is there something wrong with my testcase?
The whole point is to put all the wood behind one arrow and aim it
at malloc/free.
2. Does this testcase fail in other environments?
"Not so much" in AIX 5.1 and 5.2, is my experience.
Perhaps the "lock free" malloc algorithms are responsible?
The big idea behind the testcase is to run N threads very hard,
and cache a total of nMem previous allocations, and then start
freeing them in LIFO fashion, and allocate a new (random sized)
block in its place.
The code runs in 32-bit or 64-bit mode; I tend to push 64-bit.
Run for 30 seconds with 256 threads:
new64 30 256
Try other algorithms:
MALLOCMULTIHEAP=considersize,heaps:4 new64 30 256
MALLOCMULTIHEAP=considersize,heaps:16 new64 30 256
MALLOCTYPE=buckets new64 30 256
MALLOCTYPE=watson new64 30 256
---- Makefile ---
CCaix = xlC_r
default: clean new64
new: new.cpp
$(CCaix) -qlanglvl=newexcp -qkeepparm -bmaxdata:0x80000000 -O -
o new new.cpp
./new 4 4
new64: new.cpp
$(CCaix) -qlanglvl=newexcp -q64 -qkeepparm -bmaxdata:
0x80000000 -O -o new64 new.c
pp
./new64 4 4
--- new.cpp ---
// xlC_r -q64 -O -o new new.cpp
// [url]http://publib.boulder.ibm.com/infocenter/pseries/v6r1/index.jsp?topic=/com.ibm.aix.basetechref/doc/basetrf1/fetch_and_add.htm[/url]
#define nMEM 16384
// #define nMEM 4096
#include <new>
#include <iostream>
#include <cstdlib>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <sys/time.h>
#include <sys/atomic_op.h>
int bDie=FALSE, bStart=FALSE, gLoops=0;
volatile int nRun=0;
atomic_l nNew=0;
char* gNew[nMEM];
using namespace std;
void no_storage() {
std::cerr << "operator new failed: errno=" << errno << " " <<
sys_errlist[errno] << endl;
throw std::bad_alloc(); // abort(); //std::exit(1);
}
void * ThreadNew(int* arg) {
long n; size_t rSize; unsigned int Seed = 7654321;
fetch_and_add((atomic_p)&nRun, 1); while (!bStart) { usleep(0); }
try {
while (!bDie) {
(*arg)++;
n = (fetch_and_addlp((atomic_l)&nNew, 1)) % nMEM;
delete gNew[n];
rSize = 0x0fffffff & rand_r(&Seed); // 0:268,435,455
gNew[n] = new char[rSize];
memset(gNew[n], 1, rSize);
}
fetch_and_add((atomic_p)&nRun, -1);
}
catch (const std::bad_alloc&) {
std::cout << "caught bad_alloc with size " << rSize << " bytes
#Allocations=" << nNew << std::endl;
fetch_and_add((atomic_p)&nRun, -1);
}
return 0;
}
extern "C" int main (int argc, char** argv) {
int i, status, perThread[4096]; pthread_t thread[1024];
pthread_attr_t attr;
if (argc != 3) { printf("usage: %s seconds threads\n", argv[0]);
return 0; }
int nSeconds = atol(argv[1]);
int nThreads = atol(argv[2]);
void *eBrk, *sBrk = sbrk(0); long tBrk;
if (nThreads < 1 || nSeconds < 1) { printf("invalid parameters\n");
return 0; }
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr,PTHREAD_CREATE_DETACHED);
std::set_new_handler(&no_storage);
memset(&gNew, 0, sizeof(gNew));
//--------------------------------------------------------------------------------
for (i = 0; i < nThreads; i++) {
perThread[i] = 0;
status = pthread_create(&thread[i], &attr, (void*(*)(void*))
ThreadNew, (void*)&perThread[i]);
if (status != 0) { printf("can't start thread\n"); return 0; }
}
while (nRun < nThreads) sched_yield(); // Wait 'til
they start
bStart = TRUE; usleep(nSeconds*1000000);
bDie = TRUE; while (nRun != 0) sched_yield(); // Wait 'til
they end
for (i=0; i<nThreads; i++) gLoops+=perThread[i];
eBrk = sbrk(0); tBrk = ( (long)eBrk) - ( (long)sBrk);
printf("%llX End\n%llX Start\n %8llX Total -- %d Allocations\n",
eBrk, sBrk, tBrk, nNew);
printf(" %10d perThread Total %10d Allocations \n", gLoops, nNew);
printf(" %10d perSecond\n", gLoops / (nThreads * nSeconds));
for (i=0; i<nThreads; i++) { printf(" %10d\n", perThread[i]); }
printf("\n");
//--------------------------------------------------------------------------------
return 0;
}
--- The algorithm ---
1. atomicly increment an index pointer into the "gNew[]" array.
n = (fetch_and_addlp((atomic_l)&nNew, 1)) % nMEM;
The number wraps around nMEM.
2. Free the memory address at gNew[n] and allocate a new block.
delete gNew[n];
rSize = 0x0fffffff & rand_r(&Seed); // 0:268,435,455
gNew[n] = new char[rSize];
memset(gNew[n], 1, rSize);
Note: delete NULL is ok.
3. When the number of allocations (nNew) exceeds nMEM,
then start freeing previous allocations before allocating new
ones.
4. memset() is just there to ensure the heap is filled with something.
The size of the core file is a hint about how much memory was
allocated.
--- Notes ---
The application I'm testing is C++, which has lots of new/delete code.
This program puts pressure on new/delete to perform according to
specifications.
The idea is to hang onto memory that is allocated (for awhile).
If you new/delete memory, you won't use much of the heap,
nor exercise the malloc subsystem very much.
The test hangs onto nMem allocations, then deletes them in LIFO
fashion.
The output from the program (that doesn't crash) is:
$ new64 30 256
12FEAC560 End Heap address at end of pgm
110031A40 Start Heap address at start of pgm
1FE7AB20 Total Size of the Heap (in hex)
5538791 perThread Total 5538791 Allocations Number of
allocations
721 perSecond A measure of how efficient the
algorithm is for a CPU
"Code Worriers"
"memset" in the test is unnecessary.
If you think it might contribute to heap corruption, then comment it
out.
(malloc/free will still crash)
The return value from "fetch_and_addLP()" is the original value.
This is the index into gNew[] array. The testcase starts with "0',
and the first value returned is also "0", and continues until we wrap
at nMem.
I am concerned mostly about AIX 5.3 on very fast processors
(Power5, Power5++, Power6). But other machines also fail in
malloc if they are given enough "pressure".
The program prints out the number of new/delete combinations for
each thread. It is an indication of "schedule fairness".
If the numbers are evenly distributed, it means threads are not
locking up because of mutex operations within malloc.
-
Re: Pressure on malloc
One possibility:
In ThreadNew(), the value of n (at any point in time) is not guaranteed
to be unique. There exists a possibility for 2 (or more) threads to
have the same value of n at some point time (because of the modulo
operator). This introduces the potential for stale data (in gNew) or
stepping over updates from other threads.
I was able to make the testcase fail on my modest Power3 box with
new64 30 56
Then I increased nNew to 65536 and the run was successful. With
new64 30 256
I needed to increase nNew to 131072 for a successful run.
Larger values of nNew (and hence more space in gNew) alleviate the
potential conflict until more pressure occurs. This implies that (as
improbable as it may seem) n is managing to hit the same value in
2 threads.
-
Re: Pressure on malloc
The calculated value of "n" at any point in time is not random, it is
an atomic
incremented value that points to a slot in retained memory.
Note: the program is not designed to be "safe" in any respect to a
valid application.
It's only purpose is to put pressure on malloc/free.
The storage of every new allocation will be stored into gNew[n].
If two threads were to somehow get the same slot, it's a giant "so
what".
- Memory storage is atomic on PowerPC.
- If one memory address was "lost" because two threads used the same
"n",
it only means that memory would be leaked. Nothing else.
- There is no way that this program should ever corrupt malloc.
The locking is performed entirely within the AIX algorithms ...
nothing a
user application can do to corrupt that, other than overwriting
storage
areas and destroying adjacent malloc control blocks.
That is not happening in this test case.
I also increase nNew values to large numbers, without any noticeable
effect
on the stability of the test case. It doesn't always fail ...
especially on slower Power3
boxes. Throw the execution into a shell do while loop, and it
eventually will crash.
It has crashed every machine/AIX version I have access to ... AIX 5.1
thru AIX 5.3.
It has crashed on every C++ Runtime version, v6 thru v9
The increased nNew just means we build a larger inventory of new
requests that
we preserve. The pressure continues to be placed on malloc.
I have observed that it does always take some free activity to trigger
the crash.
If you are varying the number of retained nNew requests, you might
also
vary the amount of time the case is allowed to run.
At 131072, run:
new64 120 256
new64 240 256
Remove the memset from the test. I have never experienced two threads
simultaneously
getting the same "n" slot, but I've discussed this with AIX Tech
Support:
Thread 1 gets 100 bytes; stores then clears that memory
Thread 2 gets 200 bytes; stores then clears that memory
In the unfortunate situation where Thread 2 stores, then Thread 1 ...
it would be theoretically possible for Thread 2 to attempt to
initialize
200 bytes into a block that is only 100 bytes, and thus over-write
malloc controls.
"Blue Moon" plays in my head when I try to contemplate this, but I
just remove
memset to make the possibility disappear from discussion.
On Sep 25, 12:18 pm, "Gary R. Hook" <gh...@no.spammers.net> wrote:[color=blue]
> One possibility:
>
> In ThreadNew(), the value of n (at any point in time) is not guaranteed
> to be unique. There exists a possibility for 2 (or more) threads to
> have the same value of n at some point time (because of the modulo
> operator). This introduces the potential for stale data (in gNew) or
> stepping over updates from other threads.
>
> I was able to make the testcase fail on my modest Power3 box with
> new64 30 56
> Then I increased nNew to 65536 and the run was successful. With
> new64 30 256
> I needed to increase nNew to 131072 for a successful run.
>
> Larger values of nNew (and hence more space in gNew) alleviate the
> potential conflict until more pressure occurs. This implies that (as
> improbable as it may seem) n is managing to hit the same value in
> 2 threads.[/color]
-
Re: Pressure on malloc
One other thing to keep in mind ... this failure is likely the cause
of corruption opportunities
presented by malloc interlocking schemes.
When you run 256 threads, make sure that the number of allocations
recorded
in each thread are approximately equal. AIX support ran the test on a
Power5 machine,
1.6 MHz speed ... but they ran in a virtual LPAR, which allocated
small percentages
of real cpu usage to the application.
As a result, for 256 threads, 115 of them had ZERO allocations.
They never got dispatched, and for an application like ours that
requires it to
be responsive to user requests in a thread, this would be a disaster.
The testcase waits until it has all the threads running, then starts
them
simultaneously. One can use the number of malloc requests per second
as a measure of that algorithm's effectiveness in delivering mostly
application
cpu cycles ... and not interlocking on internal mechanisms the
application
writer has little control over.
This test was written to discover the best algorithms to advise our
customers
to use on a particular AIX configuration.
If you aren't getting even distributions of new requests, it lowers
the probability
you will ever expose a race condition.
On Sep 25, 12:18 pm, "Gary R. Hook" <gh...@no.spammers.net> wrote:
[color=blue]
> In ThreadNew(), the value of n (at any point in time) is not guaranteed to be unique.[/color]
-
Re: Pressure on malloc
I think the objection Gary raised about "n" not being unique, can
probably
occur more often than I expected. So I've modified the program to
add this protection guard against two threads operating on the same
slot.
char * pMem = gNew[n];
if (pMem == -1) continue;
if (!compare_and_swaplp(&gNew[n], pMem, -1)) continue;
delete pMem;
The programs sets -1 into the slot when it is working,
which gets replaced when a new allocation is obtained.
The problem is that malloc can block on internal mutex while other
threads are making progress ... especially with MULTIHEAP algorithms.
The code block in the thread now looks like this ... untested, because
I don't have access to an AIX machine just now.
while (!bDie) {
(*arg)++;
n = (fetch_and_addlp((atomic_l)&nNew, 1)) % nMEM;
char * pMem = gNew[n];
if (pMem == -1) continue;
if (!compare_and_swaplp(&gNew[n], pMem, -1)) continue;
delete pMem;
rSize = 0x0fffffff & rand_r(&Seed); // 0:268,435,455
gNew[n] = new char[rSize];
memset(gNew[n], 1, rSize);
}