Pressure on malloc - Aix

This is a discussion on Pressure on malloc - Aix ; I've got a problem on AIX 5.3 using C++ RunTime 9.0.0.1, which is returning ENOMEM with a 2GB heap. The core file is 300MB, so there is no way we are even close to 2GB. I tried to write a ...

+ Reply to Thread
Results 1 to 5 of 5

Thread: Pressure on malloc

  1. Pressure on malloc

    I've got a problem on AIX 5.3 using C++ RunTime 9.0.0.1, which is
    returning ENOMEM
    with a 2GB heap. The core file is 300MB, so there is no way we are
    even close to
    2GB. I tried to write a small test case to expose this failure,
    and was surprised
    because it crashed every AIX malloc algorithm I know of.

    1. Is there something wrong with my testcase?
    The whole point is to put all the wood behind one arrow and aim it
    at malloc/free.
    2. Does this testcase fail in other environments?
    "Not so much" in AIX 5.1 and 5.2, is my experience.
    Perhaps the "lock free" malloc algorithms are responsible?

    The big idea behind the testcase is to run N threads very hard,
    and cache a total of nMem previous allocations, and then start
    freeing them in LIFO fashion, and allocate a new (random sized)
    block in its place.

    The code runs in 32-bit or 64-bit mode; I tend to push 64-bit.
    Run for 30 seconds with 256 threads:
    new64 30 256
    Try other algorithms:
    MALLOCMULTIHEAP=considersize,heaps:4 new64 30 256
    MALLOCMULTIHEAP=considersize,heaps:16 new64 30 256
    MALLOCTYPE=buckets new64 30 256
    MALLOCTYPE=watson new64 30 256


    ---- Makefile ---
    CCaix = xlC_r

    default: clean new64

    new: new.cpp
    $(CCaix) -qlanglvl=newexcp -qkeepparm -bmaxdata:0x80000000 -O -
    o new new.cpp
    ./new 4 4
    new64: new.cpp
    $(CCaix) -qlanglvl=newexcp -q64 -qkeepparm -bmaxdata:
    0x80000000 -O -o new64 new.c
    pp
    ./new64 4 4

    --- new.cpp ---
    // xlC_r -q64 -O -o new new.cpp
    // http://publib.boulder.ibm.com/infoce...ch_and_add.htm
    #define nMEM 16384
    // #define nMEM 4096
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int bDie=FALSE, bStart=FALSE, gLoops=0;
    volatile int nRun=0;
    atomic_l nNew=0;
    char* gNew[nMEM];
    using namespace std;
    void no_storage() {
    std::cerr << "operator new failed: errno=" << errno << " " <<
    sys_errlist[errno] << endl;
    throw std::bad_alloc(); // abort(); //std::exit(1);
    }

    void * ThreadNew(int* arg) {
    long n; size_t rSize; unsigned int Seed = 7654321;
    fetch_and_add((atomic_p)&nRun, 1); while (!bStart) { usleep(0); }
    try {
    while (!bDie) {
    (*arg)++;
    n = (fetch_and_addlp((atomic_l)&nNew, 1)) % nMEM;
    delete gNew[n];
    rSize = 0x0fffffff & rand_r(&Seed); // 0:268,435,455
    gNew[n] = new char[rSize];
    memset(gNew[n], 1, rSize);
    }
    fetch_and_add((atomic_p)&nRun, -1);
    }
    catch (const std::bad_alloc&) {
    std::cout << "caught bad_alloc with size " << rSize << " bytes
    #Allocations=" << nNew << std::endl;
    fetch_and_add((atomic_p)&nRun, -1);
    }
    return 0;
    }

    extern "C" int main (int argc, char** argv) {
    int i, status, perThread[4096]; pthread_t thread[1024];
    pthread_attr_t attr;
    if (argc != 3) { printf("usage: %s seconds threads\n", argv[0]);
    return 0; }
    int nSeconds = atol(argv[1]);
    int nThreads = atol(argv[2]);
    void *eBrk, *sBrk = sbrk(0); long tBrk;
    if (nThreads < 1 || nSeconds < 1) { printf("invalid parameters\n");
    return 0; }
    pthread_attr_init(&attr);
    pthread_attr_setdetachstate(&attr,PTHREAD_CREATE_DETACHED);
    std::set_new_handler(&no_storage);
    memset(&gNew, 0, sizeof(gNew));
    //--------------------------------------------------------------------------------
    for (i = 0; i < nThreads; i++) {
    perThread[i] = 0;
    status = pthread_create(&thread[i], &attr, (void*(*)(void*))
    ThreadNew, (void*)&perThread[i]);
    if (status != 0) { printf("can't start thread\n"); return 0; }
    }
    while (nRun < nThreads) sched_yield(); // Wait 'til
    they start
    bStart = TRUE; usleep(nSeconds*1000000);
    bDie = TRUE; while (nRun != 0) sched_yield(); // Wait 'til
    they end

    for (i=0; i eBrk = sbrk(0); tBrk = ( (long)eBrk) - ( (long)sBrk);
    printf("%llX End\n%llX Start\n %8llX Total -- %d Allocations\n",
    eBrk, sBrk, tBrk, nNew);
    printf(" %10d perThread Total %10d Allocations \n", gLoops, nNew);
    printf(" %10d perSecond\n", gLoops / (nThreads * nSeconds));
    for (i=0; i printf("\n");
    //--------------------------------------------------------------------------------
    return 0;
    }

    --- The algorithm ---
    1. atomicly increment an index pointer into the "gNew[]" array.
    n = (fetch_and_addlp((atomic_l)&nNew, 1)) % nMEM;
    The number wraps around nMEM.

    2. Free the memory address at gNew[n] and allocate a new block.
    delete gNew[n];
    rSize = 0x0fffffff & rand_r(&Seed); // 0:268,435,455
    gNew[n] = new char[rSize];
    memset(gNew[n], 1, rSize);
    Note: delete NULL is ok.

    3. When the number of allocations (nNew) exceeds nMEM,
    then start freeing previous allocations before allocating new
    ones.

    4. memset() is just there to ensure the heap is filled with something.
    The size of the core file is a hint about how much memory was
    allocated.

    --- Notes ---

    The application I'm testing is C++, which has lots of new/delete code.
    This program puts pressure on new/delete to perform according to
    specifications.

    The idea is to hang onto memory that is allocated (for awhile).
    If you new/delete memory, you won't use much of the heap,
    nor exercise the malloc subsystem very much.

    The test hangs onto nMem allocations, then deletes them in LIFO
    fashion.

    The output from the program (that doesn't crash) is:
    $ new64 30 256
    12FEAC560 End Heap address at end of pgm
    110031A40 Start Heap address at start of pgm
    1FE7AB20 Total Size of the Heap (in hex)
    5538791 perThread Total 5538791 Allocations Number of
    allocations
    721 perSecond A measure of how efficient the
    algorithm is for a CPU

    "Code Worriers"
    "memset" in the test is unnecessary.
    If you think it might contribute to heap corruption, then comment it
    out.
    (malloc/free will still crash)

    The return value from "fetch_and_addLP()" is the original value.
    This is the index into gNew[] array. The testcase starts with "0',
    and the first value returned is also "0", and continues until we wrap
    at nMem.

    I am concerned mostly about AIX 5.3 on very fast processors
    (Power5, Power5++, Power6). But other machines also fail in
    malloc if they are given enough "pressure".

    The program prints out the number of new/delete combinations for
    each thread. It is an indication of "schedule fairness".
    If the numbers are evenly distributed, it means threads are not
    locking up because of mutex operations within malloc.


  2. Re: Pressure on malloc

    One possibility:

    In ThreadNew(), the value of n (at any point in time) is not guaranteed
    to be unique. There exists a possibility for 2 (or more) threads to
    have the same value of n at some point time (because of the modulo
    operator). This introduces the potential for stale data (in gNew) or
    stepping over updates from other threads.

    I was able to make the testcase fail on my modest Power3 box with
    new64 30 56
    Then I increased nNew to 65536 and the run was successful. With
    new64 30 256
    I needed to increase nNew to 131072 for a successful run.

    Larger values of nNew (and hence more space in gNew) alleviate the
    potential conflict until more pressure occurs. This implies that (as
    improbable as it may seem) n is managing to hit the same value in
    2 threads.

  3. Re: Pressure on malloc

    The calculated value of "n" at any point in time is not random, it is
    an atomic
    incremented value that points to a slot in retained memory.
    Note: the program is not designed to be "safe" in any respect to a
    valid application.
    It's only purpose is to put pressure on malloc/free.

    The storage of every new allocation will be stored into gNew[n].
    If two threads were to somehow get the same slot, it's a giant "so
    what".
    - Memory storage is atomic on PowerPC.
    - If one memory address was "lost" because two threads used the same
    "n",
    it only means that memory would be leaked. Nothing else.
    - There is no way that this program should ever corrupt malloc.
    The locking is performed entirely within the AIX algorithms ...
    nothing a
    user application can do to corrupt that, other than overwriting
    storage
    areas and destroying adjacent malloc control blocks.
    That is not happening in this test case.

    I also increase nNew values to large numbers, without any noticeable
    effect
    on the stability of the test case. It doesn't always fail ...
    especially on slower Power3
    boxes. Throw the execution into a shell do while loop, and it
    eventually will crash.

    It has crashed every machine/AIX version I have access to ... AIX 5.1
    thru AIX 5.3.
    It has crashed on every C++ Runtime version, v6 thru v9

    The increased nNew just means we build a larger inventory of new
    requests that
    we preserve. The pressure continues to be placed on malloc.
    I have observed that it does always take some free activity to trigger
    the crash.

    If you are varying the number of retained nNew requests, you might
    also
    vary the amount of time the case is allowed to run.

    At 131072, run:
    new64 120 256
    new64 240 256

    Remove the memset from the test. I have never experienced two threads
    simultaneously
    getting the same "n" slot, but I've discussed this with AIX Tech
    Support:

    Thread 1 gets 100 bytes; stores then clears that memory
    Thread 2 gets 200 bytes; stores then clears that memory

    In the unfortunate situation where Thread 2 stores, then Thread 1 ...
    it would be theoretically possible for Thread 2 to attempt to
    initialize
    200 bytes into a block that is only 100 bytes, and thus over-write
    malloc controls.

    "Blue Moon" plays in my head when I try to contemplate this, but I
    just remove
    memset to make the possibility disappear from discussion.



    On Sep 25, 12:18 pm, "Gary R. Hook" wrote:
    > One possibility:
    >
    > In ThreadNew(), the value of n (at any point in time) is not guaranteed
    > to be unique. There exists a possibility for 2 (or more) threads to
    > have the same value of n at some point time (because of the modulo
    > operator). This introduces the potential for stale data (in gNew) or
    > stepping over updates from other threads.
    >
    > I was able to make the testcase fail on my modest Power3 box with
    > new64 30 56
    > Then I increased nNew to 65536 and the run was successful. With
    > new64 30 256
    > I needed to increase nNew to 131072 for a successful run.
    >
    > Larger values of nNew (and hence more space in gNew) alleviate the
    > potential conflict until more pressure occurs. This implies that (as
    > improbable as it may seem) n is managing to hit the same value in
    > 2 threads.




  4. Re: Pressure on malloc

    One other thing to keep in mind ... this failure is likely the cause
    of corruption opportunities
    presented by malloc interlocking schemes.

    When you run 256 threads, make sure that the number of allocations
    recorded
    in each thread are approximately equal. AIX support ran the test on a
    Power5 machine,
    1.6 MHz speed ... but they ran in a virtual LPAR, which allocated
    small percentages
    of real cpu usage to the application.

    As a result, for 256 threads, 115 of them had ZERO allocations.
    They never got dispatched, and for an application like ours that
    requires it to
    be responsive to user requests in a thread, this would be a disaster.

    The testcase waits until it has all the threads running, then starts
    them
    simultaneously. One can use the number of malloc requests per second
    as a measure of that algorithm's effectiveness in delivering mostly
    application
    cpu cycles ... and not interlocking on internal mechanisms the
    application
    writer has little control over.

    This test was written to discover the best algorithms to advise our
    customers
    to use on a particular AIX configuration.

    If you aren't getting even distributions of new requests, it lowers
    the probability
    you will ever expose a race condition.

    On Sep 25, 12:18 pm, "Gary R. Hook" wrote:

    > In ThreadNew(), the value of n (at any point in time) is not guaranteed to be unique.



  5. Re: Pressure on malloc

    I think the objection Gary raised about "n" not being unique, can
    probably
    occur more often than I expected. So I've modified the program to
    add this protection guard against two threads operating on the same
    slot.


    char * pMem = gNew[n];
    if (pMem == -1) continue;
    if (!compare_and_swaplp(&gNew[n], pMem, -1)) continue;
    delete pMem;

    The programs sets -1 into the slot when it is working,
    which gets replaced when a new allocation is obtained.

    The problem is that malloc can block on internal mutex while other
    threads are making progress ... especially with MULTIHEAP algorithms.

    The code block in the thread now looks like this ... untested, because
    I don't have access to an AIX machine just now.

    while (!bDie) {
    (*arg)++;
    n = (fetch_and_addlp((atomic_l)&nNew, 1)) % nMEM;
    char * pMem = gNew[n];
    if (pMem == -1) continue;
    if (!compare_and_swaplp(&gNew[n], pMem, -1)) continue;
    delete pMem;
    rSize = 0x0fffffff & rand_r(&Seed); // 0:268,435,455
    gNew[n] = new char[rSize];
    memset(gNew[n], 1, rSize);
    }


+ Reply to Thread