help analyzing low system(with sar/vmstat/u386mon/sarcheck data) - SCO

This is a discussion on help analyzing low system(with sar/vmstat/u386mon/sarcheck data) - SCO ; system configuration: sco 5.0.6, with about 170 ttys loggedin by telnet, two 2G cpu, 4G memory (1) output of sar -A: SCO_SV zjyw-38 3.2v5.0.6 i80386 03/31/2008 09:01:04 %usr %sys %wio %idle (-u) bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/ ...

+ Reply to Thread
Results 1 to 10 of 10

Thread: help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

  1. help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

    system configuration: sco 5.0.6, with about 170 ttys loggedin by
    telnet, two 2G cpu, 4G memory

    (1) output of sar -A:
    SCO_SV zjyw-38 3.2v5.0.6 i80386 03/31/2008

    09:01:04 %usr %sys %wio %idle (-u)
    bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/
    s (-b)
    device %busy avque r+w/s blks/s avwait
    avserv (-d)
    c_hits cmisses (hit %) (-n)
    rawch/s canch/s outch/s rcvin/s xmtin/s mdmin/s (-y)
    scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s (-
    c)
    swpin/s bswin/s swpot/s bswot/s pswch/s (-w)
    iget/s namei/s dirbk/s (-a)
    runq-sz %runocc swpq-sz %swpocc (-q)
    proc-sz ov inod-sz ov file-sz ov lock-sz (-v)
    msg/s sema/s (-m)
    vflt/s pflt/s pgfil/s rclm/s (-p)
    freemem freeswp availrmem availsmem (-r)
    cpybuf/s slpcpybuf/s (-B)
    dptch/s idler/s swidle/s (-R)
    ovsiohw/s ovsiodma/s ovclist/s (-g)
    mpbuf/s ompb/s mphbuf/s omphbuf/s pbuf/s spbuf/s dmabuf/s
    sdmabuf/s (-h)

    Average 6 20 2 72 (-u)
    Average 6 193153 100 68 1071 94 0
    0 (-b)
    Average Sdsk-0 100.00 1.00 22.86 148.17 0.00
    57.19 (-d)
    Average 453343 8647 (98%) (-n)
    Average 25 1 5553 0 0 0 (-y)
    Average 241413 175778 7502 3.37 3.48 1171261 72597 (-
    c)
    Average 0.00 0.0 0.00 0.0 951 (-w)
    Average 7614 990 1768 (-a)
    Average 2.1 100 (-q)
    Average 0.00 0.00 (-m)
    Average 76.83 158.98 0.05 0.00 (-p)
    Average 611232 1048576 799826 513961 (-r)
    Average 0.00 0.00 (-B)
    Average 2707.10 376.22 45.88 (-R)
    Average 0.00 0.00 0.00 (-g)
    Average 0.04 0.00 16.64 0.00 0.00 0.00
    0.00 0.00 (-h)

    (2) output of sar:
    # sar -r 1 10

    SCO_SV zjyw-38 3.2v5.0.6 i80386 03/31/2008

    08:46:41 freemem freeswp availrmem availsmem (-r)
    08:46:42 680115 1048576 802064 648186
    08:46:43 680048 1048576 802062 648143
    08:46:44 679997 1048576 802062 648143

    # sar -w 1 10

    SCO_SV zjyw-38 3.2v5.0.6 i80386 04/02/2008

    11:48:43 swpin/s bswin/s swpot/s bswot/s pswch/s (-w)
    11:48:44 0.00 0.0 0.00 0.0 1383
    11:48:45 0.00 0.0 0.00 0.0 1321
    11:48:46 0.00 0.0 0.00 0.0 1417
    11:48:47 0.00 0.0 0.00 0.0 1247

    sar -p:
    08:29:21 vflt/s pflt/s pgfil/s rclm/s (-p)
    08:29:22 541.18 1567.65 0.00 0.00
    08:29:23 197.06 109.80 0.00 0.00
    08:29:24 44.55 41.58 0.00 0.00
    08:29:25 50.98 134.31 0.00 0.00
    08:29:26 85.15 388.12 0.00 0.00
    08:29:27 111.76 358.82 0.00 0.00
    08:29:28 534.31 726.47 0.00 0.00
    08:29:29 216.67 131.37 0.00 0.00
    08:29:30 290.10 550.50 0.00 0.00
    08:29:31 244.12 138.24 0.00 0.00
    08:29:32 33.98 113.59 0.00 0.00
    08:29:33 103.96 279.21 0.00 0.00

    (3) output of vmstat:
    PROCS PAGING SYSTEM CPU
    r b w frs dmd sw cch fil pft frp pos pif pis rso rsi sy cs us
    su id
    1 743 0 1048576 382 0 1222 0 783 0 0 0 0 0 0 170730
    547 11 14 75
    1 743 0 1048576 0 0 0 0 65 0 0 0 0 0 0 123108
    583 2 11 87
    4 737 0 1048576 13 0 0 0 132 0 0 0 0 0 0 275090
    695 8 29 63
    3 738 0 1048576 491 0 1080 0 439 0 0 0 0 0 0 358404
    800 5 35 60
    3 738 0 1048576 13 0 0 0 41 0 0 0 0 0 0 512184
    741 12 37 51
    3 739 0 1048576 76 0 571 0 269 0 0 0 0 0 0 208337
    755 7 25 68
    3 740 0 1048576 9 0 198 0 117 0 0 0 0 0 0 283185
    662 10 18 72
    4 739 0 1048576 10 0 2 0 49 0 0 0 0 0 0 248484
    684 10 15 75
    2 737 0 1048576 203 0 3 0 125 0 0 0 0 0 0 277137
    615 8 27 65
    2 737 0 1048576 28 0 2 0 2 0 0 0 0 0 0 378153
    616 8 32 60
    2 739 0 1048576 644 0 4015 0 1149 0 0 0 0 0 0 92687
    882 6 17 77
    4 739 0 1048576 244 0 1222 0 672 0 0 0 0 0 0 152569
    814 10 13 77
    1 742 0 1048576 465 0 3632 0 1191 0 0 0 0 0 0 242385
    902 11 28 61
    2 743 0 1048576 407 0 1572 0 957 0 0 0 0 0 0 157772
    531 9 15 76

    (4)u386mon's output:


    u386mon 2.74/SCO 3.2 - zjyw-38 15:18:49
    wht@n4hgf
    ---- CPU --- tot usr ker brk
    ---------------------------------------------------
    2 Sec Avg % 30 8 22 0
    uuuukkkkkkkkkkk
    10 Sec Avg % 32 6 26 0
    uuukkkkkkkkkkkkk
    20 Sec Avg % 30 5 25 0
    uukkkkkkkkkkkk
    ---- Wait -- tot io pio swp -- (% of real time)
    -------------------------------
    2 Sec Avg % 11 11 0 0
    iiiii
    10 Sec Avg % 9 9 0 0
    iiii
    20 Sec Avg % 6 6 0 0
    iii
    ---- Sysinfo/Minfo --- (last 2031 msec activity)
    ------------------------------
    bread 2 readch 51780167 pswitch 1555 vfault 381
    unmodsw 0
    bwrite 54 writch 171667 syscall 190057 demand 381
    unmodfl 0
    lread 388468 rawch 94 sysread 175629 pfault 289
    psoutok 0
    lwrite 6884 canch 6 syswrit 3620 cw 189
    psinfai 0
    phread 0 outch 7420 sysfork 7 steal 100
    psinok 0
    phwrite 0 msg 0 sysexec 7 frdpgs 0
    rsout 0
    swapin 0 sema 0 vfpg 0
    rsin 0
    swapout 0 maxmem -1080464krunque 0 sfpg 0
    bswapin 0 frmem -1688772krunocc 0 vspg 0
    pages on
    bswapout 0 mem used 20% swpque 0 sspg 0
    swap 0
    iget 14417 nswap 524288k swpocc 0 pnpfault 0
    cache 992
    namei 1795 frswp 524288k wrtfault 0
    file 0
    dirblk 3423 swp used 0%



    ---- Sysinfo/Minfo --- (last 2041 msec activity)
    ------------------------------
    bread 0 readch 89470697 pswitch 2339 vfault 208
    unmodsw 0
    bwrite 0 writch 30924 syscall 293135 demand 206
    unmodfl 0
    lread 531488 rawch 84 sysread 223028 pfault 238
    psoutok 0
    lwrite 338 canch 1 syswrit 708 cw 84
    psinfai 0
    phread 0 outch 10410 sysfork 3 steal 154
    psinok 0
    phwrite 0 msg 0 sysexec 4 frdpgs 0
    rsout 0
    swapin 0 sema 0 vfpg 0
    rsin 0
    swapout 0 maxmem -1080464krunque 1 sfpg 0
    bswapin 0 frmem -1680708krunocc 1 vspg 0
    pages on
    bswapout 0 mem used 20% swpque 0 sspg 0
    swap 0
    iget 2685 nswap 524288k swpocc 0 pnpfault 0
    cache 455
    namei 775 frswp 524288k wrtfault 0
    file 0

    (5) part of output of sarcheck:
    The following indication(s) of a memory shortage were seen: The
    reclaim
    rate was at least one quarter of the page fault rate in only 0.0
    percent
    of the samples. This statistic can be used to confirm the
    presence of
    an occasional memory-poor condition.

    The average swap out transfer request rate was 1768.3 per second,
    which
    is an indication of a memory-poor condition.

    The amount of freeswp did not change during the monitoring
    period,
    indicating that the system has plenty of memory installed.

    The average number of free pages usually did not stray far above
    the
    value of GPGSHI. This indicates that vhand, the page stealing
    daemon,
    was usually active and the memory poor condition seen on this
    system has
    resulted in increased CPU overhead as well as additional disk
    activity.

    Both GPGSHI and GPGSLO were set to high values, relative to the
    amount
    of memory present. Since paging was seen and these parameters are
    set
    in a way that increases the activity of the page stealing vhand
    daemon,
    consider lowering the values of GPGSHI and GPGSLO. The
    difference
    between GPGSLO and GPGSHI is large. This may create a CPU
    bottleneck
    while a large amount of dirty pages are being written to disk.

    ***********
    My questions are:
    (1)sarcheck's output: "The following indication(s) of a memory
    shortage were seen: The reclaim
    rate was at least one quarter of the page fault rate in only 0.0
    percent
    of the samples. This statistic can be used to confirm the
    presence of
    an occasional memory-poor condition."
    --> What does this statement mean?
    (2)sarcheck's output: "The average swap out transfer request rate was
    1768.3 per second, which
    is an indication of a memory-poor condition."
    -->How is the number 1768.3 calculated out? According to the sar
    and vmstat's output, there seems to be no swap, why does sarcheck say
    "The average swap out transfer request rate was 1768.3 per second" and
    "there is memory-poor condition"?
    (3)sarcheck's output: "The average number of free pages usually did
    not stray far above the
    value of GPGSHI."
    -->GPGSHI's value is 6000, and according the output of sar-
    r:freemem 680115 is significantly
    higher than the value of GPGSHI. Why sarcheck's conclusion is
    opposite?
    (4)Is the output of sar-p normal? Is vflt or pflt too large?
    (5)Is the output of vmstat normal? Is sy or cs too large?
    (6)In the u386mon's output,steal is not zero,Why? System's freemem
    never fall below GPGSLO.

    Sorry for so many questions, and appreciate for anyone's advice and
    help
    best regards for all

  2. Re: help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

    yannanqi@126.com wrote:

    > system configuration: sco 5.0.6, with about 170 ttys loggedin by
    > telnet, two 2G cpu, 4G memory


    A buncha other stuff, in ugly format, not worth trying to edit for
    quoting.

    Your sar output looks reasonable for a system as described. So do the
    other utils (modulo a few display bugs in u386mon). The system isn't
    swapping at all and has loads more memory than it needs. sarcheck looks
    like it isn't prepared to deal with some details of the sar outputs --
    is it the latest sarcheck for OSR506?

    The big thing missing in all that output is your description of what's
    wrong. It looks like a system that has a lot of work to do and is doing
    it without complaint. Plus some spurious nonsense from sarcheck. If
    the whole problem is the advice from sarcheck, ignore it (ask them for
    advice, though...)

    The one possibly questionable stat is that the disk is 100% busy. But
    you posted a snapshot, we can't tell if that was a momentary burst or
    continuous. If it's continuous, the system might benefit from a faster
    disk subsystem (faster drive, faster HBA, maybe an external RAID of the
    sort that's intended to speed things up rather than or in addition to
    giving redundancy -- RAID 0 or RAID 10). Although it's 100% busy, the
    delay stats didn't look bad, so I'm not sure if this relates to your
    issue.

    If there's an actual performance problem, why don't you describe it
    instead of posting a morass of details that don't seem to show much
    wrong?

    In your other message about NBUF:

    > On OSR506 platform with 4G memory, the mtune shows:NBUF
    > 0 24 450000,that means the maximum value of NBUF is
    > 450000,but if I give 1000000 to NBUF,when system starts,it give the
    > following message:
    >
    > kernel: Hz = 100, i/o bufs = 467116k (high bufs = 466092k)CONFIG:
    > Buffer allocation was reduced (NBUF reduced to 467116)
    >
    > (1)That means NBUF gets a value of 467116, where does this number come
    > from?


    I would guess that 450000 was someone's back-of-napkin calculation
    of the most buffers that could guaranteed to be accomodated within
    the constraints of other kernel structures. When you demand 1000000
    buffers, you cause the kernel to do a live calculation of the same
    constraints, only now it has more specific information about certain
    structures whose sizes are system-specific. Some of the constraints on
    your system aren't quite the theoretical limits, so it can squeeze in a
    few more buffers.

    You should expect that by demanding the absolute maximum buffers, you
    may be invisibly squeezing down the size of other kernel structures.
    This could potentially hurt performance or stability. (I'm not saying
    that it _does_ hurt, I don't really know.) You can also reasonably
    expect that SCO _tested_ with 450000 buffers but not with 467116. I
    I doubt the 3.8% increase in buffers is making so much difference in
    performance that it's worth running in an untested configuration.

    > ps:
    > (2) If NBUF has a value other than zero, Is it ok to let NHBUF=0? Can
    > NHBUF self-tune according to NBUF when NBUF is not set to zero?


    It should auto-tune. You can observe runtime values of these by doing:

    # crash
    > v | grep buf

    v_buf: 450000
    v_hbuf: 524288

    If you boot with different forced NBUF (v_buf) values, you should see
    v_hbuf (NHBUF) float to different values. It's always a power of 2 so
    you'll have to make sharp changes to NBUF to see NHBUF change.

    > (3)When does MAXBUF have effect, when NBUF is zero or NBUF is not zero?


    MAXBUF is an obsolete parameter, no longer edited by configure(ADM), no
    longer meaningful to the kernel.

    >Bela<


  3. Re: help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

    Bela,I can't express my heart by words.Only one word:you're
    great,great thanks! You freeed me through clear explanation.

    But the sco system really encounters performance problem: the telnet
    users' working interface is very slow,the items of the dropdown list
    fields slowly appears one by one.

    (1)sar -b: The %rcache and %wcache seem to be normal.
    SCO_SV zjyw-38 3.2v5.0.6 i80386 04/03/2008

    14:25:10 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/
    s (-b)
    14:25:11 18 36849 100 388 5282 93 0
    0
    14:25:12 3 158264 100 33 316 89 0
    0
    14:25:13 0 59686 100 0 226 100 0
    0
    14:25:14 0 79142 100 0 159 100 0
    0
    14:25:15 0 164502 100 0 50 100 0
    0
    14:25:16 0 169043 100 0 181 100 0
    0
    14:25:17 0 61087 100 0 7 100 0
    0
    14:25:18 0 15037 100 0 16 100 0
    0
    14:25:19 0 230439 100 0 102 100 0
    0
    14:25:20 0 55642 100 0 61 100 0
    0
    14:25:21 0 37027 100 0 12 100 0
    0
    14:25:22 1 127536 100 0 43 100 0
    0
    14:25:23 0 133101 100 18 122 85 0
    0
    14:25:24 0 17444 100 1 2 60 0
    0
    14:25:25 0 0 0 0 0 100 0
    0
    14:25:26 0 7142 100 1 12 92 0
    0
    14:25:27 0 3 100 4 4 11 0
    0
    14:25:28 0 146721 100 0 79 100 0
    0
    14:25:29 0 8179 100 0 37 100 0
    0
    14:25:30 0 175348 100 0 37 100 0
    0
    14:25:31 0 98968 100 0 81 100 0
    0
    14:25:32 0 67449 100 0 26 100 0
    0
    14:25:33 0 66537 100 0 27 100 0
    0
    14:25:34 0 19567 100 0 8 100 0
    0
    14:25:35 0 99711 100 0 31 100 0
    0
    14:25:36 0 45507 100 0 86 100 0
    0
    14:25:37 0 98409 100 0 34 100 0
    0
    14:25:39 0 85748 100 10 80 88 0
    0
    14:25:40 130 156812 100 6 5129 100 0
    0
    14:25:41 0 14653 100 421 143 0 0
    0
    14:25:42 0 431218 100 0 284 100 0
    0
    14:25:43 0 26278 100 0 81 100 0
    0
    14:25:44 0 77340 100 0 116 100 0
    0
    14:25:45 0 18695 100 0 18 100 0
    0
    14:25:46 0 21389 100 0 20 100 0
    0
    14:25:47 0 149728 100 11 68 84 0
    0
    14:25:48 0 1027 100 0 56 100 0
    0

    (2) sar -d:
    14:24:48 Sdsk-0 4.95 1.00 7.92 15.84 0.00
    6.25

    14:24:49
    14:24:50
    14:24:51
    14:24:52
    14:24:53
    14:24:54 Sdsk-0 1.98 1.00 4.95 9.90 0.00
    4.00

    14:24:55
    14:24:56
    14:24:57 Sdsk-0 10.00 1.00 13.00 26.00 0.00
    7.69

    14:24:58 Sdsk-0 100.00 1.00 53.92 178.43 0.00
    21.82

    14:24:59 Sdsk-0 100.00 1.00 240.59 1976.24 0.00
    44.07

    14:25:00 Sdsk-0 0.99 1.00 0.99 1.98 0.00
    10.00

    14:25:01
    14:25:02
    14:25:03
    14:25:04
    14:25:05
    14:25:06
    14:25:07
    14:25:08
    14:25:09 Sdsk-0 0.99 1.00 0.99 9.90 0.00
    10.00

    14:25:10
    14:25:11 Sdsk-0 100.00 1.00 134.65 857.43 0.00
    76.25

    14:25:12 Sdsk-0 21.78 1.00 6.93 27.72 0.00
    31.43

    14:25:13
    14:25:14
    14:25:15
    14:25:16
    14:25:17
    14:25:18
    14:25:19
    14:25:20
    14:25:21 Sdsk-0 1.00 1.00 1.00 2.00 0.00
    10.00

    14:25:22
    14:25:23 Sdsk-0 17.65 1.00 18.63 37.25 0.00
    9.47

    (3)vmstat:
    Thu Mar 27 16:23:31 CST 2008
    # vmstat 1 100

    PROCS PAGING SYSTEM CPU
    r b w frs dmd sw cch fil pft frp pos pif pis rso rsi sy cs us
    su id

    2 921 0 1048576 8 0 0 0 2 0 0 0 0 0 0 1128585
    54570 28 27 45
    2 921 0 1048576 62 0 0 0 0 0 0 0 0 0 0 1174108
    57445 20 32 48
    2 918 0 1048576 5 0 0 0 14 0 0 0 0 0 0 1149028
    52759 35 36 29
    2 920 0 1048576 26 0 425 0 104 0 0 0 0 0 0 1180505
    55345 24 33 43
    2 920 0 1048576 0 0 0 0 0 0 0 0 0 0 0 1204870
    57131 29 26 45
    5 917 0 1048576 28 0 0 0 1 0 0 0 0 0 0 1154900
    53069 30 30 40
    2 922 0 1048576 31 0 425 0 104 0 0 0 0 0 0 1232803
    55362 27 35 38
    3 921 0 1048576 30 0 0 0 0 0 0 0 0 0 0 1153011
    47926 27 37 36
    3 925 0 1048576 111 0 822 0 165 0 0 0 0 0 0 1165042
    55808 34 39 27
    2 926 0 1048576 24 0 85 0 18 0 0 0 0 0 0 1203487
    46580 32 43 25
    4 925 0 1048576 6 0 85 0 15 0 0 0 0 0 0 1215516
    55999 23 35 42

    *************************
    My thoughts:
    (1)By vmstat's output,the "sy:system calls" and "cs:context switch"
    are very high. Are these value normal? (The sco system is only an
    endpoint-telnet-server, and the telnet users don't have much business,
    only querying & charging--the database server is on third part)
    (2)By vmstat's output,does the "cch" effect the system's performance?

  4. Re: help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

    On Wed, Apr 02, 2008, yannanqi@126.com wrote:
    >Bela,I can't express my heart by words.Only one word:you're
    >great,great thanks! You freeed me through clear explanation.
    >
    >But the sco system really encounters performance problem: the telnet
    >users' working interface is very slow,the items of the dropdown list
    >fields slowly appears one by one.


    Does the system exhibit this type of performance on the console?
    If it doesn't, the problem is most likely network related.

    If it is a network problem, it could be a bad NIC, network
    switch, hub, or even another machine on the LAN with the same IP
    address as the server. DNS problems usually show up with long
    initial connection times as the system attempts to resolve the
    host name of the connecting IP.

    I have seen major problems with NICs which show high numbers of
    errors on incoming and outgoing packets. On Linux systems the
    ifconfig command shows the error history, but SCO's doesn't, at
    least not on the OSR 5.0.6a systems we have here.

    Bill
    --
    INTERNET: bill@celestial.com Bill Campbell; Celestial Software LLC
    URL: http://www.celestial.com/ PO Box 820; 6641 E. Mercer Way
    FAX: (206) 232-9186 Mercer Island, WA 98040-0820; (206) 236-1676

    That rifle on the wall of the labourer's cottage or working class flat is
    the symbol of democracy. It is our job to see that it stays there.
    --GEORGE ORWELL

  5. Re: help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

    yannanqi@126.com wrote:

    > But the sco system really encounters performance problem: the telnet
    > users' working interface is very slow,the items of the dropdown list
    > fields slowly appears one by one.
    >
    > (1)sar -b: The %rcache and %wcache seem to be normal.


    They're actually exceptionally high (since you have so much buffer
    cache). Shouldn't be causing a performance problem.

    > (2) sar -d:
    > 14:24:48 Sdsk-0 4.95 1.00 7.92 15.84 0.00 6.25
    > 14:24:49
    > 14:24:50
    > 14:24:51
    > 14:24:52
    > 14:24:53
    > 14:24:54 Sdsk-0 1.98 1.00 4.95 9.90 0.00 4.00
    > 14:24:55
    > 14:24:56
    > 14:24:57 Sdsk-0 10.00 1.00 13.00 26.00 0.00 7.69
    > 14:24:58 Sdsk-0 100.00 1.00 53.92 178.43 0.00 21.82
    > 14:24:59 Sdsk-0 100.00 1.00 240.59 1976.24 0.00 44.07
    > 14:25:00 Sdsk-0 0.99 1.00 0.99 1.98 0.00 10.00
    > 14:25:01
    > 14:25:02


    Alternating 100% busy and idle, hmmm.

    How full are the filesystems? HTFS on OSR506 (and OSR507 without at
    least MP3 or so) was extremely inefficient at allocating space on
    nearly-full filesystems. On large filesystems (100GiB would be large
    enough), this inefficiency was costly in both CPU and disk I/O terms.

    But your buffer cache stats suggest this is not the problem.

    Much more likely: you've got dirty buffer cache storms. You've given
    the system 450MB of buffer cache. A process that was writing very
    quickly to an already allocated file could dirty tens of megabytes in a
    few seconds. Those blocks would stay in cache until bdflush was run,
    then they would all try to write to disk at the same time, busying out
    the disk for a long time.

    To mitigate this, change BDFLUSHR to 1 (run bdflush as often as
    possible, once a second) and NAUTOUP to 2 (flush buffers that are no
    more than 2 seconds old). This costs a bit of extra CPU, but your
    system has plenty to spare.

    I seem to remember that one of the OSR507 patches also improved some
    buffer cache handling. With your 506, the system might actually run
    _faster_ with a much smaller buffer cache. You should test it with a
    sharp reduction, e.g. NBUF=50000; revert back to 450000 if it doesn't
    help.

    Because of the buffer cache & filesystem space allocation improvements,
    this system would probably be a lot happier under OSR507 + MP5. (Or it
    might make no difference... can't really tell without trying.)

    > (3)vmstat:
    > Thu Mar 27 16:23:31 CST 2008
    > # vmstat 1 100
    >
    > PROCS PAGING SYSTEM CPU
    > r b w frs dmd sw cch fil pft frp pos pif pis rso rsi sy cs us su id
    >
    > 2 921 0 1048576 8 0 0 0 2 0 0 0 0 0 0 1128585 54570 28 27 45
    > 2 921 0 1048576 62 0 0 0 0 0 0 0 0 0 0 1174108 57445 20 32 48
    > 2 918 0 1048576 5 0 0 0 14 0 0 0 0 0 0 1149028 52759 35 36 29
    > 2 920 0 1048576 26 0 425 0 104 0 0 0 0 0 0 1180505 55345 24 33 43
    > 2 920 0 1048576 0 0 0 0 0 0 0 0 0 0 0 1204870 57131 29 26 45
    > 5 917 0 1048576 28 0 0 0 1 0 0 0 0 0 0 1154900 53069 30 30 40
    > 2 922 0 1048576 31 0 425 0 104 0 0 0 0 0 0 1232803 55362 27 35 38
    > 3 921 0 1048576 30 0 0 0 0 0 0 0 0 0 0 1153011 47926 27 37 36
    > 3 925 0 1048576 111 0 822 0 165 0 0 0 0 0 0 1165042 55808 34 39 27
    > 2 926 0 1048576 24 0 85 0 18 0 0 0 0 0 0 1203487 46580 32 43 25
    > 4 925 0 1048576 6 0 85 0 15 0 0 0 0 0 0 1215516 55999 23 35 42
    >
    > *************************
    > My thoughts:
    > (1)By vmstat's output,the "sy:system calls" and "cs:context switch"
    > are very high. Are these value normal? (The sco system is only an
    > endpoint-telnet-server, and the telnet users don't have much business,
    > only querying & charging--the database server is on third part)
    > (2)By vmstat's output,does the "cch" effect the system's performance?


    Syscalls/sec does seem high, but since the CPU is still 40% idle, that's
    not the problem. Context switches/sec is in line with syscalls.

    I think cch "pages in cache" refers to pages that have been marked for
    possible purging by the virtual memory sweeper, then were demonstrated
    (by a page fault) to still be in use. This is part of the normal
    functioning of the virtual memory system and the page rate looks
    reasonable, maybe even a bit low (not a worry).

    >Bela<


  6. Re: help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

    Bill Campbell wrote:

    > I have seen major problems with NICs which show high numbers of
    > errors on incoming and outgoing packets. On Linux systems the
    > ifconfig command shows the error history, but SCO's doesn't, at
    > least not on the OSR 5.0.6a systems we have here.


    netstat -i; ndstat -l

    >Bela<


  7. Re: help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

    Sorry for my absence of these days,I'm on a national holiday and
    your rich knowledge and patience defeat me.... Thanks again.

    (1)Bela Lubkin wrote:
    > Bill Campbell wrote:
    >
    >> Does the system exhibit this type of performance on the console?
    >> If it doesn't, the problem is most likely network related.


    > > I have seen major problems with NICs which show high numbers of
    > > errors on incoming and outgoing packets. On Linux systems the
    > > ifconfig command shows the error history, but SCO's doesn't, at
    > > least not on the OSR 5.0.6a systems we have here.

    >
    > netstat -i; ndstat -l
    >
    > >Bela<


    Because the application needs operator ID and password to login, I
    can't test it on the console. But I'll try to do it later and feedback
    the result to you. The NIC should be ok, because the old application
    works fine. The following is the result of "netstat & ndstat":

    # netstat -i
    Name Mtu Network Address Ipkts Ierrs Opkts Oerrs
    Coll
    net1 1500 142.70 zjyw-38 8460437 0 7113058 0
    469448
    lo0 8232 loopback localhost 2467255 0 2467255 0
    0
    atl0* 8232 none none No Statistics Available

    # ndstat
    Device MAC address in use Factory MAC Address
    ------ ------------------ -------------------
    /dev/net1 00:15:60:a5:ac:80 00:15:60:a5:ac:80

    Multicast address table
    -----------------------
    01:00:5e:00:00:01

    FRAMES
    Unicast Multicast Broadcast Error Octets Queue Length
    ---------- --------- --------- ------ ----------- ------------
    In: 7943453 0 517623 0 613619190 0
    Out: 7113607 0 1 0 500415540 0

    # ndstat -l
    Device MAC address in use Factory MAC Address
    ------ ------------------ -------------------
    /dev/net1 00:15:60:a5:ac:80 00:15:60:a5:ac:80

    Multicast address table
    -----------------------
    01:00:5e:00:00:01

    FRAMES
    Unicast Multicast Broadcast Error Octets Queue Length
    ---------- --------- --------- ------ ----------- ------------
    In: 7943990 0 517715 0 613689653 0
    Out: 7114102 0 1 0 500466929 0

    DLPI Module Info: 2 SAPs open, 18 SAPs maximum
    5281 frames received destined for an unbound SAP

    MAC Driver Info: Media_type: Ethernet
    Min_SDU: 14, Max_SDU: 1514, Address length: 6
    Interface speed: 10 Mbits/sec

    DLPI Restarts Info: Last queue size: 0
    Last send time: 6080505
    Restart in progress: 0
    Number of restarts: 0

    Interface Version: MDI 100

    ETHERNET SPECIFIC STATISTICS

    Collision Table - The number of frames successfully transmitted,
    but involved in at least one collision:

    Frames Frames
    ------- -------
    1 collision 229269 9 collisions 125
    2 collisions 55519 10 collisions 19
    3 collisions 15181 11 collisions 2
    4 collisions 9432 12 collisions 0
    5 collisions 6231 13 collisions 0
    6 collisions 1748 14 collisions 0
    7 collisions 248 15 collisions 0
    8 collisions 151 16 collisions 0


    Bad Alignment 0 Number of frames received that
    were
    not an integral number of octets

    FCS Errors 0 Number of frames received that
    did
    not pass the Frame Check Sequence

    SQE Test Errors 0 Number of Signal Quality Error
    Test
    signals that were detected by the
    adapter

    Deferred Transmissions 118929 Number of frames delayed on the
    first transmission attempt
    because
    the media was busy

    Late Collisions 0 Number of times a collision was
    detected later than 512 bits into
    the transmitted frame

    Excessive Collisions 0 Number of frames dropped on
    transmission
    because of excessive collisions

    Internal MAC Transmit 0 Number of frames dropped on
    transmission
    Errors because of errors not covered
    above

    Carrier Sense Errors 0 Number of times that the carrier
    sense
    condition was lost when
    attempting to
    send a frame that was deferred
    for an
    excessive amount of time

    Frame Too Long 0 Number of frames dropped on
    reception
    because they were larger than the
    maximum Ethernet frame size

    Internal MAC Receive 0 Number of frames dropped on
    reception
    Errors because of errors not covered
    above

    Spurious Interrupts 0 Number of times the adapter
    interrupted
    the system for an unknown reason

    No STREAMS Buffers 0 Number of frames dropped on
    reception
    because no STREAMS buffers were
    available

    Underruns/Overruns 0 Number of times the transfer of
    data to or from the frame buffer
    did not complete successfully

    Device Timeouts 0 Number of times the adapter
    failed to
    respond to a request from the
    driver
    #

    (2)The filesystems are mostly free,so this shouldn't be the problem:
    # dfspace
    / : Disk space: 7434.21 MB of 8927.00 MB available
    (83.28%).
    /stand : Disk space: 2.41 MB of 14.99 MB available (16.12%).
    /serv : Disk space: 27516.99 MB of 29998.61 MB available
    (91.73%).
    /servbak : Disk space: 28590.46 MB of 29999.01 MB available
    (95.30%).

    (3)
    > Much more likely: you've got dirty buffer cache storms. You've given
    > the system 450MB of buffer cache. A process that was writing very
    > quickly to an already allocated file could dirty tens of megabytes in a
    > few seconds. Those blocks would stay in cache until bdflush was run,
    > then they would all try to write to disk at the same time, busying out
    > the disk for a long time.
    >
    > To mitigate this, change BDFLUSHR to 1 (run bdflush as often as
    > possible, once a second) and NAUTOUP to 2 (flush buffers that are no
    > more than 2 seconds old). This costs a bit of extra CPU, but your
    > system has plenty to spare.


    This shouldn't be the point. Because the application generates little
    writes, totally about 20M one day.

    (4)
    > I seem to remember that one of the OSR507 patches also improved some
    > buffer cache handling. With your 506, the system might actually run
    > _faster_ with a much smaller buffer cache. You should test it with a
    > sharp reduction, e.g. NBUF=50000; revert back to 450000 if it doesn't
    > help.


    I'll test it and feedback the result as soon as possible.

    (5)
    > Because of the buffer cache & filesystem space allocation improvements,
    > this system would probably be a lot happier under OSR507 + MP5. (Or it
    > might make no difference... can't really tell without trying.)


    I have no method... Because the software supplier says their
    application only support OSR506 and OSR505.

  8. Re: help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

    yannanqi@126.com wrote:

    > The NIC should be ok, because the old application
    > works fine. The following is the result of "netstat & ndstat":
    >
    > # netstat -i
    > Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
    > net1 1500 142.70 zjyw-38 8460437 0 7113058 0 469448
    > lo0 8232 loopback localhost 2467255 0 2467255 0 0
    > atl0* 8232 none none No Statistics Available


    > # ndstat
    > MAC Driver Info: Media_type: Ethernet
    > Min_SDU: 14, Max_SDU: 1514, Address length: 6
    > Interface speed: 10 Mbits/sec


    > Frames Frames
    > ------- -------
    > 1 collision 229269 9 collisions 125
    > 2 collisions 55519 10 collisions 19
    > 3 collisions 15181 11 collisions 2
    > 4 collisions 9432 12 collisions 0
    > 5 collisions 6231 13 collisions 0
    > 6 collisions 1748 14 collisions 0
    > 7 collisions 248 15 collisions 0
    > 8 collisions 151 16 collisions 0


    > Deferred Transmissions 118929 Number of frames delayed on the
    > first transmission attempt because
    > the media was busy


    6.5% collisions on output seems pretty high.

    For comparison, this system I'm looking at has sent 33 million packets,
    experiencing 0 collisions and 133 deferred transmissions. Of course
    it's the big fish on a pretty quiet LAN, and it's probably on a switch.

    High collisions can be a sign of: very busy network; bad cables;
    incorrect autodetection of duplex. You should put this system on a
    100Mbps or 1Gbps network, preferably on a switch, and make sure it is
    set for or autodetecting the right duplex setting.

    You said the problem was telnet users having slow response. Interactive
    use exchanges one or more packets for every character typed by the user.
    With 6.5% collisions, every sentence they type is going to experience
    several collisions and the resulting back-off algorithm. I can imagine
    this causing the entire problem.

    > (4)
    > > I seem to remember that one of the OSR507 patches also improved some
    > > buffer cache handling. With your 506, the system might actually run
    > > _faster_ with a much smaller buffer cache. You should test it with a
    > > sharp reduction, e.g. NBUF=50000; revert back to 450000 if it doesn't
    > > help.

    >
    > I'll test it and feedback the result as soon as possible.


    Ok. Even with the net issues, I'm still suspicious about the 100% busy
    disk readings. Your buffer cache ratios are very high, disk shouldn't
    need to be busy. Is it a very old & slow disk? Swap in a fast disk.

    > (5)
    > > Because of the buffer cache & filesystem space allocation improvements,
    > > this system would probably be a lot happier under OSR507 + MP5. (Or it
    > > might make no difference... can't really tell without trying.)

    >
    > I have no method... Because the software supplier says their
    > application only support OSR506 and OSR505.


    What they mean is they can't be bothered to test with anything newer.
    It would probably be fine, backwards compatibility was/is SCO's core
    competency...

    >Bela<


  9. Re: help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

    It could be network or disk or both! Having Bela in here is truly
    awesome, I am not in Bela's league of understanding the inner workings
    of SCO but I have some practical generic advice for you to consider.

    The buffer cache flush daemon "bdflush" will be regularly flushing,
    when it does it is writing your (huge) buffer cache to disk. This
    could be responsible for the surges in disk i/o that you see.

    making your buffer cache smaller or more frequent flushing or a
    combination of both could help to smooth out the big data write
    tsunamis into smaller waves but looking at the underlying disk and/or
    RAID architectecture is also important. If your system is
    experiencing a situation where the activity that is generating i/o is
    very bursty and infrequent then a bigger cache could help deal with a
    slow disk but if the action is frequent or continuous then you really
    need a faster disk. The frequency of i/o bursts and the timing of the
    bdflush is also important but faster disks always help.

    You have a lot of collisions on your network - you need to deal with
    that too. That could be a range of issues but work through a process
    of elimination looking at things like the following........

    a. Switch configuration (assuming you have managed switches with
    some layer3 capabilities) - I have found IGMP snooping turned ON will
    assist in managing broadcast traffic
    b. check the event logs on the switch looking at error counts by
    port and follow the trail to track down the source of the noise where
    the counts are highest.
    c. beef up the server to switch connection - make sure the data
    pipes are fat where data converges! Upgrade the server NIC to gigabit
    and push it into a gigabit port on your switch make sure the backbone
    of your network linking your LAN segments together has fat pipes
    too....
    d. look at implementing some QoS for your telnet traffic if all of
    the above are fine



  10. Re: help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

    Thanks for Bela and James's warm-hearted and constructional advice.
    I'll lookup them up one by one.
    About the disk's performance,ohh... The server is a HP DL380 G4, Raid
    1, BDFLUSHR=30 and NAUTOUP=10, according to my inspection, the 100%
    busy doesn't seem to be caused by the bdflush.. But I'm not sure. May
    be really a hardware bottleneck ...

    No matter what, I'll take your advice to heart and try to do
    something, then I'll feedback the results to you, but this may take
    some days because of the production environment.

    Thanks to Bela,James and Bill again. Best regards for you.

+ Reply to Thread