Find source of massive syscalls/context switches - Aix

This is a discussion on Find source of massive syscalls/context switches - Aix ; Hello, System: AIX 5.3, DB2 v. 8 (latest patchlevel), MQ 5.3, Java 1.4. The system runs a somewhat large, custom made java application which performs way below expectations and requirements. The system is generally idle: Lots of idle CPU, no ...

+ Reply to Thread
Results 1 to 11 of 11

Thread: Find source of massive syscalls/context switches

  1. Find source of massive syscalls/context switches

    Hello,

    System: AIX 5.3, DB2 v. 8 (latest patchlevel), MQ 5.3, Java 1.4.

    The system runs a somewhat large, custom made java application which
    performs way below expectations and requirements.

    The system is generally idle: Lots of idle CPU, no Pgspin/Pgspout
    activity, no massive I/O. However, the system has a very high syscall and
    context switch level when inspected with topas.

    Now, I'm very interested in identifying where all the syscalls and
    context switches are coming from. Is there a way to track the source of
    syscalls and/or I/Os?

    --
    Regards,
    Troels Arvin
    http://troels.arvin.dk/

  2. Re: Find source of massive syscalls/context switches

    On Dec 13, 6:16 pm, Troels Arvin wrote:
    > Hello,
    >
    > System: AIX 5.3, DB2 v. 8 (latest patchlevel), MQ 5.3, Java 1.4.
    >
    > The system runs a somewhat large, custom made java application which
    > performs way below expectations and requirements.
    >
    > The system is generally idle: Lots of idle CPU, no Pgspin/Pgspout
    > activity, no massive I/O. However, the system has a very high syscall and
    > context switch level when inspected with topas.
    >
    > Now, I'm very interested in identifying where all the syscalls and
    > context switches are coming from. Is there a way to track the source of
    > syscalls ...


    $ trace -a; sleep 15; trcstop
    $ curt -i /var/adm/ras/trcfile

    > ... and/or I/Os?

    $ filemon -v -o fmon.out -O all ; sleep 15 ; trcstop

    Just simple examples. See man pages for details. Also take a look at
    the network for an unusual amount of small packages.

    hth
    Hajo

  3. Re: Find source of massive syscalls/context switches

    On Dec 13, 12:16 pm, Troels Arvin wrote:
    > Hello,
    >
    > System: AIX 5.3, DB2 v. 8 (latest patchlevel), MQ 5.3, Java 1.4.
    >
    > The system runs a somewhat large, custom made java application which
    > performs way below expectations and requirements.
    >
    > The system is generally idle: Lots of idle CPU, no Pgspin/Pgspout
    > activity, no massive I/O. However, the system has a very high syscall and
    > context switch level when inspected with topas.
    >
    > Now, I'm very interested in identifying where all the syscalls and
    > context switches are coming from. Is there a way to track the source of
    > syscalls and/or I/Os?
    >
    > --
    > Regards,
    > Troels Arvin http://troels.arvin.dk/


    I realize it's not an answer to your question but I can't restrain
    myself. What were the expectations of this application's performance
    based on? Maybe you just have a great big, ugly application that's
    running no faster than it could or should do.

    FWIW

    Jim Lane

  4. Re: Find source of massive syscalls/context switches

    On Thu, 13 Dec 2007 10:32:53 -0800, Jim.Lane wrote:
    > I realize it's not an answer to your question but I can't restrain
    > myself. What were the expectations of this application's performance
    > based on?


    The output of the program is a number of reports, measured in "reports/
    minute". There is noone who is able to state exactly how many reports/
    minute the system should really be able to produce, because it would be
    very hard to calculate. However, it strikes everyone involved that the
    system is mostly idle (CPU-wise), has no paging-problems, and isn't even
    close to being I/O bound (juding from iostat/topas). And still, the
    number of generated reports per minute is depressingly slow. I'm not
    involved in the coding of the system, but I'm very much interested in
    finding bottlenecks.

    Have a look at this:
    http://troels.arvin.dk/aix/lots_of_c...s-20071213.txt

    Note the Cswitch and Syscall numbers. These exceed the numbers on our
    most busy SAS server running at full speed.

    --
    Regards,
    Troels Arvin
    http://troels.arvin.dk/

  5. Re: Find source of massive syscalls/context switches

    On Thu, 13 Dec 2007 09:52:03 -0800, Hajo Ehlers wrote:
    >> ... and/or I/Os?

    > $ filemon -v -o fmon.out -O all ; sleep 15 ; trcstop


    Thanks - I'll remember this for some other time. I'm sorry: I should have
    written "context switches" - How can sources of context switches be
    identified? - Via "trace", also?

    --
    Regards,
    Troels Arvin
    http://troels.arvin.dk/

  6. Re: Find source of massive syscalls/context switches

    On 13 Dez., 20:11, Troels Arvin wrote:
    > On Thu, 13 Dec 2007 09:52:03 -0800, Hajo Ehlers wrote:
    > >> ... and/or I/Os?

    > > $ filemon -v -o fmon.out -O all ; sleep 15 ; trcstop

    >
    > Thanks - I'll remember this for some other time. I'm sorry: I should have
    > written "context switches" - How can sources of context switches be
    > identified? - Via "trace", also?

    Currently i am at home but please consult the man page for curt.
    Also vmstat gives some clue what kind of content switches you have -
    maybe even the man page for topas gives a clue.

    The interessting thing about your topas out is the network part
    regarding lo0
    > lo0 4708.7 24.1K 24.1K 2354.3 2354.3


    As you can see you have an average of about 100 byte / package and the
    amount of package in and out are the same. Also the total amount of
    network packages on lo0 of 48k is close to the context switch value
    from topas. So in a worst case senario your application is starting
    for each package/communication process a thread.
    So i would check the communication process within your program and/or
    use iptrace to check how the current communication process is
    working.Maybe something is going totaly wrong there.

    hth
    Hajo

  7. Re: Find source of massive syscalls/context switches

    On Dec 14, 6:16 am, Troels Arvin wrote:
    > Hello,
    >
    > System: AIX 5.3, DB2 v. 8 (latest patchlevel), MQ 5.3, Java 1.4.
    >
    > The system runs a somewhat large, custom made java application which
    > performs way below expectations and requirements.
    >
    > The system is generally idle: Lots of idle CPU, no Pgspin/Pgspout
    > activity, no massive I/O. However, the system has a very high syscall and
    > context switch level when inspected with topas.
    >
    > Now, I'm very interested in identifying where all the syscalls and
    > context switches are coming from. Is there a way to track the source of
    > syscalls and/or I/Os?
    >
    > --
    > Regards,
    > Troels Arvin http://troels.arvin.dk/


    maybe you can show me what this looks like on your system ?

    DB/2 seems to create a kazillion File Descriptors ...?
    21 LPARwthpidevapp:./ # lsof | wc
    1100149 10090271 152207090


    Database Connection Information

    Database server = DB2/6000 8.1.2


    DB/2 level
    DB21085I Instance "dids01in" uses "32" bits and DB2 code release
    "SQL08012"
    with level identifier "02030106".
    Informational tokens are "DB2 v8.1.1.16", "s030508", "U486566", and
    FixPak "2".
    Product is installed at "/usr/opt/db2_08_01".

  8. Re: Find source of massive syscalls/context switches

    Troels Arvin schrieb:

    >
    > Have a look at this:
    > http://troels.arvin.dk/aix/lots_of_c...s-20071213.txt
    >
    > Note the Cswitch and Syscall numbers. These exceed the numbers on our
    > most busy SAS server running at full speed.
    >

    You may get ftp://ftp.software.ibm.com/aix/tools/perftools/perfpmr
    I would first look at the tprof output (KERNEL section) to see which
    syscalls are taking up most of the kernel time.
    Did you note that
    Name PID CPU% PgSp Owner
    java 112328 13.6 309.0 someuser
    db2jd 119262 7.7 12.1 db2inst1
    java 111642 1.0 46.1 someuser
    db2jd 107652 0.1 3.6 db2inst1
    topas 94952 0.1 2.6 foo
    amqzlaa0 35382 0.0 5.1 someuser
    and kernel/user are 28.1/23.4. I see just 22.5% CPU on the processes.



  9. Re: Find source of massive syscalls/context switches

    Troels Arvin writes:

    > On Thu, 13 Dec 2007 10:32:53 -0800, Jim.Lane wrote:
    >> I realize it's not an answer to your question but I can't restrain
    >> myself. What were the expectations of this application's performance
    >> based on?

    >
    > The output of the program is a number of reports, measured in "reports/
    > minute". There is noone who is able to state exactly how many reports/
    > minute the system should really be able to produce, because it would be
    > very hard to calculate. However, it strikes everyone involved that the
    > system is mostly idle (CPU-wise), has no paging-problems, and isn't even
    > close to being I/O bound (juding from iostat/topas). And still, the
    > number of generated reports per minute is depressingly slow. I'm not
    > involved in the coding of the system, but I'm very much interested in
    > finding bottlenecks.
    >
    > Have a look at this:
    > http://troels.arvin.dk/aix/lots_of_c...s-20071213.txt
    >
    > Note the Cswitch and Syscall numbers. These exceed the numbers on our
    > most busy SAS server running at full speed.
    >


    Well, your application is pretty obviously making a "metric-buttload"
    of system calls :-)

    My first question would be "is there any 'custom-written' networking
    code in this app ?".

    I've seen behavior that looks a lot like this in apps that are written
    with wrappers for TCP/IP functions that end up (once you drill down
    thru 4 layers of object oriented obfuscation) forcing the transmission
    of an IP packet for each character in the data stream.

    Your loopback interface is shipping a non-trivial amount of data, and
    you're _averaging_ less than 100 bytes per packet. That sounds
    suspicious. ( as in, someone thought disabling nagling on a socket was
    a bright idea ).

    You're calling a (set of) fairly expensive system calls ... you're
    spending more time in kernel space than you are in user land.

    Don't know whether you can run your app under strace, which is easier
    for the non-performance professional to interpret, generally.

    Having said that, the output of trace is guaranteed to be "the truth,
    the whole truth, and nothing but the truth".

    But it takes a while to crawl thru the output. :-)

    You don't want/need (a.k.a. can't stand) 15 seconds worth. 1 second's
    worth of this behavior is probably going to make the culprit obvious.

    You know that, if you have the source to the app, you can instrument
    the app, and have the app hooks interveaved in the trace output with
    the system hooks.

    Looks closer at topas screen ... userid 'foo'

    In group 'bar', I presume ?

    --
    #include /* I don't speak for IBM ... */
    /* Heck, I don't even speak for myself */
    /* Don't believe me ? Ask my wife :-) */
    Richard D. Latham lathamr@us.ibm.com

  10. Re: Find source of massive syscalls/context switches

    On Thu, 13 Dec 2007 17:16:32 +0000 (UTC), Troels Arvin
    wrote:

    >Hello,
    >
    >System: AIX 5.3, DB2 v. 8 (latest patchlevel), MQ 5.3, Java 1.4.
    >


    Some other posts noted high loopback traffic with small packets in
    addition to the high syscall rate that started us down this rat hole.

    Since I just spent far too much time debugging a similar problem
    (albeit with Oracle as the DB and DataStage as the application), take
    a look at whether the Java application is making "single row requests"
    to DB2 rather than fetching an array of results per request to
    process?

    The would account for the observations (high syscalls, small lo0
    packets and overall sluggish performance with not-to-high CPU
    activity.

    -- David

  11. Re: Find source of massive syscalls/context switches

    Hello,

    On Thu, 13 Dec 2007 17:16:32 +0000, I wrote:
    [...]
    > Now, I'm very interested in identifying where all the syscalls and
    > context switches are coming from.

    [...]

    Thanks for the many good answers in this thread -- they have been most
    useful. I sniffed on the loopback network trafic, and there was indeed a
    massive amount of very small messages; it seems like the messages are
    primarily (almost exclusively) between java and db2jd.

    It's still not entirely clear how to fix this; and it's still not
    entirely clear that this is the only cause of the absurd amount of
    syscalls/context switches. - But it has given us some good starting
    points.

    --
    Regards,
    Troels Arvin
    http://troels.arvin.dk/

+ Reply to Thread