http://www.beyond3d.com/content/articles/77

NVIDIA Tesla: GPU computing gets its own brand
written by Tim for Professional Workstation

A Brief History of CUDA

When NVIDIA's G80 launched in November 2006, there was a brief mention
of a new toolkit that would greatly simplify GPU computing
development. Called CUDA (for Complete Unified Device Architecture),
we knew at the time that it was a C derivative that would run on the
GPU without using any 3D API as an intermediary. We also knew that the
lead architect for CUDA was Ian Buck, a student of the legendary Pat
Hanrahan at Stanford and one of the authors of the original BrookGPU
paper. Considering its pedigree, we were very excited to see what he
could do with G80. While we waited to get our hands on CUDA, we saw
three major advantages to NVIDIA's approach. First, by bypassing 3D
APIs, there's no concern about future drivers breaking an application
as has plagued them in the past; consider Folding@Home's initial
release on R580 and the continued absence of G80 support as an
example. Second, it makes GPU computing more accessible by allowing
developers to write their applications in a potentially more familiar
manner, as opposed to shoehorning their application to fit within a 3D
API's paradigm. Finally, it allows developers to access portions of
the chip that they wouldn't be able to use directly in a 3D API.

In February, NVIDIA released a beta of CUDA to the public. Our ideas
about the advantages of the CUDA approach were confirmed, especially
with the exposure of the parallel data cache to reduce DRAM accesses
and accelerate algorithms that would have previously been limited by
memory bandwidth or latency. Of course, it wasn't perfect; there was
still only single precision (32-bit), which is not suitable for many
applications, among other limitations. Still, the beta showed enormous
promise, and it was obvious that GPU computing would rapidly become a
major part of NVIDIA's business.

Today, NVIDIA launches its third brand of GPU products, Tesla, for GPU
computing.



The Tesla Lineup

At the moment, NVIDIA Tesla is primarily focused on the highest of the
high-end, namely the oil, gas, and computational finance industries.
That's important to keep in mind because the Tesla introduction has
answered another question we had when we first looked at CUDA: would
it be limited to professional cards, even though the consumer GeForce
cards would be capable of using CUDA? The answer is a resounding no.
CUDA will be available across all product lines, although eventually
there will be some features specific to GPU computing that are only
available through the Tesla brand. Instead, Tesla, like Quadro, will
be focused as a total solution. Workstations and software will be
qualified to work with Tesla, with the same types of support as
Quadro.

The basic unit of the current Tesla line, the Tesla C870, should be
very familiar to anyone who's seen the GeForce 8800. It's essentially
an 8800 GTX--a 575MHz core clock and 128 SPs at 1.35GHz--with 1.5GiB
of GDDR3 RAM. Of course, it's not quite an 8800 GTX--there are no
display outputs at all on the card, even though it has a new version
of the NVIO chip.




NVIDIA states that the C870 has a peak performance of 518 GFlops.
Careful readers might already realize the implications of this number:
the missing MUL is apparently now available in CUDA, increasing
theoretical peak performance by 50% over what was previously stated
for the 8800 GTX in CUDA. However, the conditions necessary for the
MUL in the SFU to be used are unknown, and we don't know whether or
not the SFU MUL will ever be available in 3D applications. Still, the
difference between a 50x and an 100x speedup is a lot less important
than the difference between 1x and 50x, so we aren't too concerned
about the missing MUL. The C870 has an MSRP of $1299 and should be
available in August.

The second product in the current Tesla line is the NVIDIA D870,
called the "Deskside Supercomputer." It's very similar to the Quadro
Plex; it's two Tesla C870 cards in an external unit. Like the Quadro
Plex, the D870 will connect to a host computer via an external PCIe 8x
or 16x connection.




Because it's simply two C870 cards in a more convenient form factor,
the peak theoretical performance of the D870 is 1.036 TFlops. Keep in
mind that CUDA doesn't use any multi-chip interface like SLI. Instead,
one thread on the CPU controls one CUDA device. So, in the case of the
D870, there are two CUDA devices, and two CPU threads will be used to
control them. As a result, if the data set can be spread across the
two devices, there's a linear increase in speed. There's not any
overhead from SLI or anything other than PCIe bandwidth, so the D870
really will be about twice as fast as the C870. The D870 has an MSRP
of $7500 and, like the C870, is scheduled for availability in August.

Finally, there's the utter beast of the family: the Tesla GPU Server.
With availability targeted for Q4, it costs $12,000, and it's twice as
fast as the D870 with four G80 cards. The really terrifying thing
here, however, is the form-factor. It fits four G80s into a 1U rack.




Again, the peak performance is double that of the D870: 2.072 TFlops.
The GPU Server will consume about 550W of power on average (with a
peak of 800W) and, like the D870, will be connected to a host machine
via an external PCI Express 8x or 16x interface. A host machine will
be able to control a single GPU Server. Still, for markets that can
leverage this kind of computing power, the GPU Server has to be
unbelievably attractive, as it would fit into the existing server
ecosystem without any special considerations.

But, just what are those markets, and why are they able to use GPU
computing to such effect? NVIDIA has been working closely with some
developers since CUDA first became available, and we've seen some of
the fruits of their labor.

(page 2)

Tesla isn't going to cause a major paradigm shift in every market
overnight. It will, however, cause those shifts in individual
industries as software is writen to take advantage of CUDA. We had the
chance to speak to a number of groups who have been working with CUDA
over the past year to see what they've been able to do with it, and
the results are impressive.

Headwave

One of the major markets that NVIDIA is targeting with Tesla is the
oil and gas industry, and it's easy to see why. Most oil fields that
have not yet been discovered are offshore, and as a result, the need
for data processing capabilities has grown at an astronomical rate.
Seismic data is being acquired at much higher resolutions due to the
expense of attempting to tap an undersea oil field--about $150
million--along with new types of data, such as electromagnetics, that
weren't even considered a decade ago.


These processed data sets are often in the range of 50 gigabytes in
size, down from between half a terabyte and 2.5 terabytes before
processing. Headwave is entering the seismic data processing market
with a product that uses CUDA to great effect. Traditional algorithms
run on a workstation can process a data set at 10-30 MB/s. For a
terabyte data set, that's over three weeks of processing time at
least. With massive data sets and an inherently parallelizable
algorithm, Headwave is able to achieve speeds of 1-2 GB/s. That same
terabyte data set can now be processed in less than a day on a single
G80.

Evolved Machines


Most people have heard of neural networks in artificial intelligence,
but many people don't realize that artificial neural networks as
described in the literature have absolutely nothing to do with
biological neurons. Artificial neural networks essentially abstract
away all of the physical aspects of the neuron in exchange for a
greatly simplified model that seems to have reached its limits in
terms of usefulness. Obviously, those physical aspects of the neuron
are essential for more general purpose computation.

Dr. Paul Rhodes and his team at Evolved Machines are creating neural
arrays, which are complete simulations of neural circuits. Because
they take into account all of the physical characters of the neuron,
they are infinitely more complicated than the neural networks of old.
A single neuron contains 2000 differential equations, each of which is
updated 100,000 times per second. Each update takes 20 Flops. So, for
a single neuron, we're already at 4 GFlops. Dr. Rhodes estimates that
a neural circuit capable of performing sensory perception would be
composed of 1000 to 2500 neurons, which takes us to between 4 and 10
TFlops. Until very recently, this was the type of application that
would take a Top 500 supercomputing cluster, but it's suddenly
possible with a rack or two.

Sensory perception is exactly what Evolved Machines is trying to
achieve. They're building neural circuits that are capable of visual
as well as olfactory recognition--yes, that's right, computers that
smell. Already, they've found that their application is about 65 times
faster on a single G80 than it is on a current x86 chip.

Theoretical and Computational Biophysics Group at UIUC


Evolved Machines weren't the only group demonstrating simulations of
molecular biology. John Stone, a developer with the Theoretical and
Computational Biophysics Group at the University of Illinois at Urbana-
Champaign, was on hand to share his experiences with CUDA development.
Stone has added CUDA support to the Nanoscale Molecular Dynamics
package (NAMD). As with Headwave and Evolved Machine, parts of NAMD
were sped up by orders of magnitude thanks to CUDA.




Stone has documented much of his experiences with CUDA in a lecture
given to the ECE 498AL class at UIUC (more on that class later), and
it's definitely worth a read to anyone considering using CUDA. Stone
focuses on the algorithm that places ions in a simulated virus, and he
points out a number of potential bottlenecks that are applicable to
any GPU computing project. The number of arithmetic operations must be
high enough to effectively hide memory latency, data structures must
be modified to prevent branching, different memory regions must be
used effectively, register usage has to be kept as low as possible...
basically, it's all the rules that we've come to expect when writing
3D code on a GPU.

Of course, this isn't completely straightforward. Stone provides four
implementations of the Coloumbic potential kernel. The naive version
that doesn't make any attempt to hide memory latency achieves 90
GFlops. The final version, which follows all the guidelines above and
maximizes use of the parallel data cache, reaches 235 GFlops. Of
course, the resulting code is significantly more daunting than naive
version, although once you've got a decent handle on CUDA, the code
itself is really not that bad. However, it's very clear that the
algorithm was carefully designed, implemented, profiled, and
reimplemented several times, which is where the difficulty with CUDA
arises.

(page 3)

The Need for Education

What people are going to discover, though, is that CUDA is hard.
Writing the code isn't hard--CUDA really is just C with a few added
keywords--but designing algorithms to really utilize the architecture
can be fantastically difficult. One concern that NVIDIA has is that
students in computer science won't get enough training with parallel
algorithms and massively parallel architectures to be able to make the
best use of CUDA. This certainly isn't unjustified. A year ago, if we
were to mention a massively parallel architecture, we'd be talking
about a supercomputing cluster. Now, some of the same difficulties in
designing software for a cluster apply to every G8x chip.

To try to improve the situation, David Kirk, chief scientist at
NVIDIA, taught a class at UIUC on data-parallel programming using
CUDA, the previously mentioned ECE 498AL. All of the materials for the
class, including lecture slides, audio recordings of all lectures, and
assignments, are freely available. We've gone through all of the
materials to get a better understanding of CUDA ourselves, and we
highly recommend it to anyone interested in data-parallel programming
or CUDA. NVIDIA, and Kirk in particular, are hoping that ECE 498AL can
be used as a template for classes in data-parallel programming at
other universities.

Of course, many will claim that NVIDIA is pushing classes in data-
parallel programming as a way to push CUDA, and there's certainly some
element of truth to that. The problem with that view, though, is the
lack of other widely available massively parallel machines. As Kirk
told us, it's not possible to be entirely platform agnostic when
teaching any low-level language like CUDA. Teaching C usually involves
some explanation of what's happening on an x86, and teaching data-
parallel programming using CUDA isn't too different. Most importantly,
though, students need to be exposed to these types of architectures as
early as possible. Massively parallel architectures won't be some fad
that goes away in five years, and any exposure, even if it's centered
around CUDA, is better than leaving students totally clueless once
other vendors introduce similar products.

The Future

First, it's important to keep in mind Tesla's position in the
marketplace. CUDA on consumer products isn't going away. Developing
CUDA is perfectly possible on GeForce cards, and we expect that the
importance to CUDA in mainstream applications as well as gaming will
quickly grow with the number of G8x chips. Tesla is instead focused at
the users of high-end GPU computing products that will be certified
for use with specific Tesla products.


However, that does not mean that there won't be Tesla-specific
features. One of the problems preventing the adoption of GPU computing
in some markets is the need for double precision. As we look to G92,
which has been stated to be close to 1 TFlop for single precision
processing as well as being capable of double precision processing, we
can say that double precision on G92 will be limited to the Tesla
line. While this will surely disappoint some, the need for double
precision processing on a consumer GPU is questionable at best, and
NVIDIA sees this as an excellent way to differentiate the two product
lines. We expect that there will be other Tesla-specific features,
such as ECC RAM, that simply don't make sense in the consumer market.

As far as CUDA goes, 1.0 should launch next week, and we'll have in-
depth coverage of just what that means for developers. Among the new
features are asynchronous kernel calls (freeing up the CPU while the
GPU runs), improved FFT and BLAS libraries, and a Matlab library to
offload whatever processing it can to the GPU. Most importantly,
though, 64-bit versions of CUDA will be available, correcting a major
complaint with the earlier betas. In addition, the separation between
CUDA drivers and normal device drivers will soon end, making CUDA
realistically available to end users. Finally, the specification for
PTX, the intermediate assembly language used by CUDA, will be opened,
allowing other languages to gain the same access to the chip as CUDA
has. However, it also means that backends for different architectures
could be developed, potentially changing the GPU computing game
considerably. Keep an eye out for that.

All in all, 2007 looks to be the year when GPU computing starts to
make inroads into the market. With Tesla, NVIDIA is making a serious
bet on the future of the company (and certainly looking at waging war
with the CPU guys). Considering the disruptive effects that GPU
computing could have on some markets, it's a very exciting time for
the industry, and with GPU computing capabilities stealthily
introduced to more and more computers, we're definitely looking
forward to seeing just what happens.

Also, be sure to check out our interviews with David Kirk, NVIDIA
Chief Scientist, and Andy Keane, General Manager of the GPU Computing
group regarding Tesla, the future of GPU computing, and better
incorporating parallel programming into education.



Interviews:

Dave Kirk
http://www.beyond3d.com/content/interviews/40/

Andy Keane
http://www.beyond3d.com/content/interviews/41/