Diminishing bandwidth performance with multiple quad core X5355s - Hardware

This is a discussion on Diminishing bandwidth performance with multiple quad core X5355s - Hardware ; Hello, I'm considering getting a second X5355 in my new machine. I've heard bandwidth gets bad above 4 cores with the Woodcrests. We do large calculations on multi-gig datasets held in 16 gigs of RAM, so our performance is bandwidth ...

+ Reply to Thread
Results 1 to 7 of 7

Thread: Diminishing bandwidth performance with multiple quad core X5355s

  1. Diminishing bandwidth performance with multiple quad core X5355s

    Hello, I'm considering getting a second X5355 in my new machine. I've
    heard bandwidth gets bad above 4 cores with the Woodcrests.

    We do large calculations on multi-gig datasets held in 16 gigs of RAM,
    so our performance is bandwidth limited.

    How much faster would two quad core X5355 chips be compared to one?

    Thanks.


  2. Re: Diminishing bandwidth performance with multiple quad core X5355s

    On 4 May 2007 12:27:36 -0700, CharlesBlackstone
    wrote:

    >Hello, I'm considering getting a second X5355 in my new machine. I've
    >heard bandwidth gets bad above 4 cores with the Woodcrests.
    >
    >We do large calculations on multi-gig datasets held in 16 gigs of RAM,
    >so our performance is bandwidth limited.
    >
    >How much faster would two quad core X5355 chips be compared to one?
    >
    >Thanks.



    There are too many variables involved to answer your
    question with any reasonable degree of accuracy. The
    obvious answer is that if you can use a 2nd system, you have
    that bandwidth increase.

    If you can find someone doing same calcs, same app (?) on
    same/similar config who has done this upgrade, only then
    would you approach some data that might be extrapolated to
    your situation. Maybe more info about these jobs would
    help. Maybe not.

  3. Re: Diminishing bandwidth performance with multiple quad core X5355s

    On May 4, 1:59 pm, kony wrote:
    > On 4 May 2007 12:27:36 -0700, CharlesBlackstone
    >
    > wrote:
    > >Hello, I'm considering getting a second X5355 in my new machine. I've
    > >heard bandwidth gets bad above 4 cores with the Woodcrests.

    >
    > >We do large calculations on multi-gig datasets held in 16 gigs of RAM,
    > >so our performance is bandwidth limited.

    >
    > >How much faster would two quad core X5355 chips be compared to one?

    >
    > >Thanks.

    >
    > There are too many variables involved to answer your
    > question with any reasonable degree of accuracy. The
    > obvious answer is that if you can use a 2nd system, you have
    > that bandwidth increase.
    >
    > If you can find someone doing same calcs, same app (?) on
    > same/similar config who has done this upgrade, only then
    > would you approach some data that might be extrapolated to
    > your situation. Maybe more info about these jobs would
    > help. Maybe not.






    I think a lot of people are aware that an Opteron system has less
    bandwidth restrictions with a lot of processors, but that woodcrests
    don't have as good a memory controller and fall behind opterons after
    4 cores or so. I'm asking how severe this is. Heavy number cruncing of
    huge data sets in RAM is a bandwidth intensive operation. So, I'm
    asking how badly woodcrests are impacted above 4 cores, for example, 8
    cores vs 4 cores, on bandwidth performance. I didn't think this was
    that vague, is there anything else I can tell you that will make the
    question less difficult to answer?

    I need to crunch a chunk of data 8 gigs large. It cant' be split into
    two chunks and the crunching is not a parallel operation, so having
    two computers is not helpful.



  4. Re: Diminishing bandwidth performance with multiple quad core X5355s

    CharlesBlackstone wrote:

    > Hello, I'm considering getting a second X5355 in my new machine. I've
    > heard bandwidth gets bad above 4 cores with the Woodcrests.
    >
    > We do large calculations on multi-gig datasets held in 16 gigs of RAM,
    > so our performance is bandwidth limited.
    >
    > How much faster would two quad core X5355 chips be compared to one?


    You mean you are populating only a single slot on a dual slot board? You
    should use both slots to get the most bandwidth out of the system.

    --
    Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

  5. Re: Diminishing bandwidth performance with multiple quad core X5355s

    CharlesBlackstone wrote:
    > On May 4, 1:59 pm, kony wrote:
    >> On 4 May 2007 12:27:36 -0700, CharlesBlackstone
    >>
    >> wrote:
    >>> Hello, I'm considering getting a second X5355 in my new machine. I've
    >>> heard bandwidth gets bad above 4 cores with the Woodcrests.
    >>> We do large calculations on multi-gig datasets held in 16 gigs of RAM,
    >>> so our performance is bandwidth limited.
    >>> How much faster would two quad core X5355 chips be compared to one?
    >>> Thanks.

    >> There are too many variables involved to answer your
    >> question with any reasonable degree of accuracy. The
    >> obvious answer is that if you can use a 2nd system, you have
    >> that bandwidth increase.
    >>
    >> If you can find someone doing same calcs, same app (?) on
    >> same/similar config who has done this upgrade, only then
    >> would you approach some data that might be extrapolated to
    >> your situation. Maybe more info about these jobs would
    >> help. Maybe not.

    >
    > I think a lot of people are aware that an Opteron system has less
    > bandwidth restrictions with a lot of processors, but that woodcrests
    > don't have as good a memory controller and fall behind opterons after
    > 4 cores or so. I'm asking how severe this is. Heavy number cruncing of
    > huge data sets in RAM is a bandwidth intensive operation. So, I'm
    > asking how badly woodcrests are impacted above 4 cores, for example, 8
    > cores vs 4 cores, on bandwidth performance. I didn't think this was
    > that vague, is there anything else I can tell you that will make the
    > question less difficult to answer?
    >
    > I need to crunch a chunk of data 8 gigs large. It cant' be split into
    > two chunks and the crunching is not a parallel operation, so having
    > two computers is not helpful.
    >


    There was a time, when looking at spec.org , the answer to these questions
    was easy to see. But the last time I looked here, I wasn't sure I was
    even looking at the results right. Perhaps you can find evidence of your
    choking hypothesis here.

    http://www.spec.org/cpu2006/results/

    There are sites that have run consumer oriented benchmarks on such
    platforms, but there the danger is that the application is not
    using the available resource properly or well. They don't do a good
    job here, of listing the hardware particulars. (I believe it is
    possible the computer in question here uses two 3GHz X5365's, which are
    not listed on processorfinder.intel.com.) This benchmark is not
    directly applicable, because they are comparing dual dual-cores to
    dual quad-cores, whereas you want to compare one quad-core to two
    quad-cores. Still, I think you can see that the speedup here is not
    linear, for the quality of applications and testing techniques
    they are using.

    http://www.barefeats.com/octopro1.html

    For the $1200 or so you are going to spend finding out, I think
    you'll get some benefit. But it won't be a linear speedup. And
    if you test your existing platform and setup, with 1, 2, or 4
    cores enabled, I think you may already be able to show what
    kind of impact your particular memory access pattern is
    having on the platform. If you are seeing pretty close to
    linear speedup right now, then chances are you'll get some
    benefit from an extra 4 cores. (Enough to justify the $1200.)
    If, on the other hand, the box is already collapsing under the
    access pattern (say purely random access, some kind of cache
    busting pattern), you may already have evidence that the extra
    4 cores would be wasted.

    (Followup arbitrarily set, because my news server won't let me
    post without it.)

    Paul

  6. Re: Diminishing bandwidth performance with multiple quad core X5355s

    On May 4, 12:27 pm, CharlesBlackstone
    wrote:
    > Hello, I'm considering getting a second X5355 in my new machine. I've
    > heard bandwidth gets bad above 4 cores with the Woodcrests.
    >
    > We do large calculations on multi-gig datasets held in 16 gigs of RAM,
    > so our performance is bandwidth limited.
    >
    > How much faster would two quad core X5355 chips be compared to one?
    >

    see if running
    ../sys_basher -mbandwidth

    on an unloaded and loaded system sheds any light onto your question.
    if not vary the memory sizes used by sys_basher


  7. Re: Diminishing bandwidth performance with multiple quad core X5355s

    On May 5, 9:44 am, CharlesBlackstone
    wrote:

    > I think a lot of people are aware that an Opteron system has less
    > bandwidth restrictions with a lot of processors, but that woodcrests
    > don't have as good a memory controller and fall behind opterons after
    > 4 cores or so. I'm asking how severe this is. Heavy number cruncing of
    > huge data sets in RAM is a bandwidth intensive operation. So, I'm
    > asking how badly woodcrests are impacted above 4 cores, for example, 8
    > cores vs 4 cores, on bandwidth performance. I didn't think this was
    > that vague, is there anything else I can tell you that will make the
    > question less difficult to answer?



    Your question is difficult to answer because you'd first need to know
    (at least approximately) what's the ratio of
    FLOPS vs memory accesses, and the pattern of those accesses. It all
    boils down to that. If your program
    can keep the CPU busy during "long" stretches of time without needing
    to access the memory bus, then your
    program will definitely benefit from more cpus/cores. If, on the
    other hand, your program needs to request
    (i.e. load/store) to main RAM (i.e. cache misses) very frequently,
    then you will have contention on the memory
    bus and your performance per cpu will degrade.

    You ask "how badly" will your app degrade; well, the actual way to
    model and predict that would be using the hardware performance
    counters (OProfile under Linux, cputrack on Solaris, etc), and then
    you'd get an idea about the rate of instructions vs anything else
    (load/stores
    to ram, retired FLOPS, cache misses, TLB misses, etc). But of
    course the best way is to measure your program on the real thing.

    I wanted to post this even if it's a bit late on the thread because
    right now I have exactly this kind of problem.
    We're trying to figure out if a dual-Quadcore (Xeon) will be better
    (cost/benefit wise) than a 4-way Opteron dualcore, for *our* program.

    Spec CPU 2006 can give you some pretty good insights on this: go to
    the advanced query option, and list all available results,
    but filter by "number of total cores" equal to 8. Go straight to the
    int_rate and fp_rate figures, and you'll be able to compare how
    4-way dual Opterons compare to (Xeon) dual-Quadcores. At least, on
    the Spec-2006 suite, whose programs have working set sizes quite
    big, although they may not be as RAM-bottlenecked as your particular
    program.

    As you say, Opterons do definitely have a much better memory system.
    But then a 4-way mobo is WAY more expensive that a dual-socket one...

    And btw, if you want to benchmark just memory bandwidth/latency
    performance, STREAM (http://www.cs.virginia.edu/stream/)
    is the way to go.

    Cheers,

    JL


+ Reply to Thread