REAL GP-GPU PROFILING

A TIGER IN A CAGE? General Purpose computations on Graphical Processor Units (GP-GPUs) posses an extreme computational power—at least potentially. But often a GP-GPU-programmer will end up in an endless, unproductive optimization phase.

Now, for the first time, you will have the opportunity to do a real profiling of GP-GPU code. Not through special hardware registers or other non-intuitive mechanisms. But in a novel non-intrusive way, meaning that you do not have to pre- or re-program you code, but merely run it through 'lab4241's GP-GPU performance analyze pipeline.

Currently 'lab4241' presents beta-stage GP-GPU profiling, meaning that only a limited set of the nVidia GP-GPU software stack is available— see the requirements section for details.

GPU PROFILING AT A GLANCE: a demo profile dump from the matrix multiplication example.

Here is a screenshot of how the GPU profiler is able to analyze the memory usage, in terms of number of reads and writes to all memory types (global, shared, etc.) for the specific test case of shared-memory optimized matrix multiplication, C=A*B, A and B being square matrices of a modest size of N=32:

Memory read/write usage for matrix multiplication, C=A*B, N=32, shared optimization.
[click to enlarge]

The profiler can also plot the temporal access pattern to all GPU memory types, here again plotted for the shared-memory matrix multiplication demo case:

Temporal memory usage for matrix multiplication.
[click to enlarge]

A trace of calls to the CUDA kernel driver for the case would look like:

CUDA driver calls.
[click to enlarge]

showing, as expected, that a 32*32 matrix multiplication takes up very little GPU kernel time, a little more than 20 µsec, compared to other functions. A full demonstration for the matrix multiplication can be found in the demo section.

Goto the background section for a general introduction to GPU-profiling, or goto the download section for download the beta-stage software.