Significance of Results

Results are presented for the Weather Research and Forecasting (WRF) code running on twelve Sun Blade X6275 server modules, housed in the Sun Blade 6048 chassis, using the 2.5 km CONUS benchmark dataset.

  • The Sun Blade X6275 cluster was able to achieve 373 GFLOP/s on the CONUS 2.5-KM Dataset.

  • The results demonstrate an 91% speedup efficiency, or 11x speedup, from 1 to 12 blades.

  • The current results results were run with turbo on.
Performance Landscape

Performance is expressed in terms "simulation speedup" which is the ratio of the simulated time step per iteration to the average wall clock time required to compute it. A larger number implies better performance. The current results were run with turbo mode on.

WRF Weather Research and Forecasting CONUS 2.5-KM Dataset #
Blade #
Node #
Proc #
Core Performance
(Simulation Speedup) Computation Rate
GFLOP/sec Speedup/Efficiency
(vs. 1 blade) Turbo On
Relative Perf Turbo On Turbo Off Turbo On Turbo Off Turbo On Turbo Off 12 24 48 192 13.58 12.93 373.0 355.1 11.0 / 91% 10.4 / 87% +6% 8 16 32 128 9.27
7.5 / 93%

6 12 24 96 7.03 6.60 193.1 181.3 5.7 / 94% 5.3 / 89% +7% 4 8 16 64 4.74
3.8 / 96%

2 4 8 32 2.44
2.0 / 98%

1 2 4 16 1.24 1.24 34.1 34.1 1.0 / 100% 1.0 / 100% +0% Results and Configuration Summary

Hardware Configuration:

  • Sun Blade 6048 Modular System
    • 12 x Sun Blade X6275 Server Modules, each with
      • 4 x 2.93 GHz Intel QC X5570 processors
        24 GB (6 x 4GB)
        QDR InfiniBand
        HT disabled in BIOS
        Turbo mode enabled in BIOS
Software Configuration:

  • OS: SUSE Linux Enterprise Server 10 SP 2
    Compiler: PGI 7.2-5
    MPI Library: Scali MPI v5.6.4
    Benchmark: WRF
    Support Library: netCDF 3.6.3
Benchmark Description

The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. WRF is designed to be a flexible, state-of-the-art atmospheric simulation system that is portable and efficient on available parallel computing platforms. It features multiple dynamical cores, a 3-dimensional variational (3DVAR) data assimilation system, and a software architecture allowing for computational parallelism and system extensibility.

Dataset used:

  • Single domain, large size 2.5KM Continental US (CONUS-2.5K)

    • 1501x1201x35 cell volume
    • 6hr, 2.5km resolution dataset from June 4, 2005
    • Benchmark is the final 3hr simulation for hrs 3-6 starting from a provided restart file; the benchmark may also be performed (but seldom reported) for the full 6hrs starting from a cold start
    • Iterations output at every 15 sec of simulation time, with the computation cost of each time step ~412 GFLOP
Key Points and Best Practices

  • Processes were bound to processors in round-robin fashion.
  • Model simulation time is 15 seconds per iteration as defined in input job deck. An achieved speedup of 2.67X means that each model iteration of 15s of simulation time was executed in 5.6s of real wallclock time (on average).
  • Computational requirements are 412 GFLOP per simulation time step as (measured empirically and) documented on the UCAR web site for this data model.
  • Model was run as single MPI job.
  • Benchmark was built and run as a pure-MPI variant. With larger process counts building and running WRF as a hybrid OpenMP/MPI variant may be more efficient.
  • Input and output (netCDF format) datasets can be very large for some WRF data models and run times will generally benefit by using a scalable filesystem. Performance with very large datasets (>5GB) can benefit by enabling WRF quilting of I/O across designated processors/servers. The master thread (or rank-0) performs most of the I/O (unless quilting specifies otherwise), with all processes potentially generating some I/O.
See Also

Disclosure Statement

WRF, CONUS-2.5K, see, results as of 9/21/2009.