Background

The recently announced IBM POWER7 systems occupy the top of some SPEC benchmark lists. With the CPU chip at the center of the announcement, it is perhaps of interest to examine the POWER7 results with the SPECfp_rate2006 benchmarks, which exercise the processor, compiler, and memory hierarchy.

Scaling POWER7 from 2 to 4 chips is not impressive

As of 23-Feb-2010, IBM's best published 2-chip result and best 4-chip result for SPECfp_rate2006 are, respectively, 586 and 851. The scaling from 2 chips to 4 chips is less than 1.5x (851/586=1.452). SPEC also includes a less aggressively tuned metric, "SPECfp_rate_base2006", and for base the scaling is similar (776/531=1.461).

As can be seen in Table 1, although the 4-chip system has twice as many cores, its MHz is slightly lower. In all other dimensions listed in Table 1, it would seem to provide twice the capability of the 2-chip system:

Table 1: Best Published 2-chip vs. Best Published 4-chip POWER7 systems

2-chip sys 4-chip sys Cores 16 32 Chips 2 4 Threads 64 128 Copies run (in base) 64 128 MHz 3860 3550 L1 D cache, per core 32 KB 32 KB L1 I cache, per core 32 KB 32 KB L2 cache, per core 256 KB 256 KB L3 cache, per core 4 MB 4 MB Memory (GB) 128 256 SPECfp_rate_base2006 531 776 SPECfp_rate2006 586 851 SPECfp is a trademark of the Standard Performance Evaluation Corporation, SPEC. For more information on SPEC, see www.spec.org. Performance results above are from IBM's website as of 23-Feb-2010. Additional detail in the table above is from full disclosures provided to SPEC per rule 4.6. The 2-chip system is the "IBM Power 780 (3.86 GHz, 16 core)"; the 4-chip system is the "IBM Power 750 Express (3.55 GHz, 32 core)".


Scaling is not uniform

The CPU2006 floating point suite contains 17 individual benchmarks. Do they show uniform scaling?

In fact, scaling is not uniform; when twice as many copies are run on twice as many chips, some of the programs scale well, while others stall out.

If you click the icon on the right for the detailed graph, you will see that seven of the tested programs (shown in red) scale relatively poorly. The poorly-scaling tests are, according to SPEC, drawn from fluid dynamics, speech recognition, physics, linear programming, and electromagnetics applications.

Explanation: memory system

The computations performed in these benchmarks exercise more than just the the chip and its caches. They are memory intensive, and will not scale well unless memory bandwidth is also scaled.

Sidebar: Additional techical detail is available: See Figure 5 of an article that shows system activity via performance counters. In the graph above, seven benchmarks are shown in red. Six of those seven benchmarks appear in Figure 5 of the article with more than 10 L2 cache misses per 1000 instructions, as can be seen in the 5th column from the right of Figure 5. Ten misses per 1000 instructions is a lot of misses, because memory systems are typically located at least one hundred cycles distant from CPU chips. For example, IBM says that the 4.7 GHz POWER6 is 450 cycles distant from memory, as was explored in a previous blog post.

For the one remaining benchmark, 482.sphinx3, a reasonable (but untested) hypothesis would be that it fit reasonably well in the 8 MB L2 of the system tested in the performance counter article, but does not fit well in the 4 MB per core L3 on the POWER7.


Differences between the IBM 750 vs. IBM 780

If the memory system is so important to the benchmarks that scale poorly, would that match with what we know of the 2-chip vs. 4-chip system? It would seem so. As noted in Table 1, the current best-published 2-chip result is on the Power 780, whereas the best 4-chip result is on the Power 750. The 750 uses one 4-RU box to hold up to 4 chips, whereas the 780 places only 2 chips into each 4-RU enclosure. Presumably, the extra space in the 780 is used to take better advantage of the POWER7 memory system, perhaps by using more of its memory controllers / channels.

About Pricing

As of 23-Feb-2010, the "IBM Web price" at IBM's Power 750 Express server browse and buy page for a 4-chip, 3.3 GHz 750 with 128 GB was $174,192.

The 4-chip system in Table 1 used 256 GB, which would add $34,080 to the cost, for a total of $208,272.

Calculated using the difference between 16x 8 GB IBM DIMMs vs. 16x 16 GB DIMMs with the pricing that is linked from IBM's Power 750 accessories and upgrades page, specifically:
($6390/(2x16GB) * 256 GB) - ($2130/(2x8GB) * 128GB) = $34080
But wait, now how much would you pay?

The $208,272 price would still be for a 3.3 GHz system; but the tested system in Table 1 was 3.55 GHz, so would presumably cost noticeably more. As of 23-Feb-2010, the above referenced IBM web pages simply say "Call for price" on the 3.55 GHz model.

Finally, if you wanted a 4-chip system that scaled well for all of the SPECfp_rate2006 benchmarks when compared to the 2-chip 780, you should presumably build a 4-chip 780 rather than a 4-chip 750 - and, one presumes, the 780 will again cost noticeably more than the 750.

Disclaimer: this blogger is going solely by pricing found on the IBM web site as of 23-Feb-2010. I do not claim to be an expert in IBM pricing. The presumptions of the previous paragraphs are, IMHO, reasonable; but are unproven.

Bottom line: caveat emptor

IBM has some strong results, but if you want scaling, you have to pay attention to whether your application is hungry for memory bandwidth; and, if so, you need to pay careful attention to which model you are looking at. Try not to be confused by the different benchmarks that exercise different capabilities of the different configurations.



More...