CDA 6938 Multi-core/Many-core Architecture and Programming Homework

ST: CDA 6938 Multi-core/many-core architectures and programming

Assignments

Homework #0 (No need to turn in)

Write a multithreaded program using the CUDA programming model and the emulator to generate multiple “Hello world!”.

Homework #1 Nvidia CUDA

Send your source code along with a brief explanation and the performance results in a text file to zhou@eecs.ucf.edu

a. Write a multithreaded program using the CUDA programming model to compute matrix multiplication. One thread computes one element of the product matrix. (Due date: 2/1/08)

b. Write a multithreaded program using the CUDA programming model to compute matrix multiplication using the tiling algorithm. One thread computes one element of the product matrix (Due date: 2/1/08)

c. Write a multithreaded program using the CUDA programming model to compute matrix multiplication using the unroll and jam algorithm. One thread computes several elements of the product matrix (Due date: 2/8/08)

d. (bonus) Write a multithreaded program using the CUDA programming model to further improve the performance of matrix multiplication. Example ideas include combining tiling and loop & jam, using texture memory/cache, etc. In your report, include your achieved throughput (in the calculation, you do not need to include the latency for data communication between device and host). (Due date: 2/8/08)

Homework #2 AMD/ATI Brook+

Send your source code along with a brief explanation and the performance results in a text file to zhou@eecs.ucf.edu

Write a 2D convolution function using the streaming computing model and Brook+. Debug your program using the CPU-based emulator. (due date: 2/15/08)

Brief explanation on 2D convolution (or image convolution): assume a matrix a[M, N] and a matrix h[J,K], the convolution is defined as follows:

For simplicity, you may assume out-of-bound elements of a (e.g., a[-1,-1]) are zero.

Estimate the theoretical performance limit that you may achieve based on your kernel function for part (a) and analyze the performance bottleneck of your kernel function. Test your program on the ATI HD 3870 graphics processors to measure the actual performance. (Due date: 2/22/08)

Improve the performance by computing multiple elements in c in one kernel function. Repeat your performance analysis and report how much speedup (compared to computing one element in the kernel function) you get from the ATI HD 3870 graphics processors. (Due date: 2/29/08)

Homework #3 Cross-platform Performance Comparison (Due date: 03/14/08)

Send your source code along with a brief explanation and the performance results in a text file to zhou@eecs.ucf.edu

a. Write an optimized version of single-precision floating-point matrix multiplication for both Nvidia GTX8800 GPU (using CUDA) and ATI HD 3870 GPU (using Brook+). Use random numbers to initialize both matrices. Report the following results: (1) The number of lines of code in the kernel function(s). (2) The execution time (including data transmission time from CPU/GPU to GPU/CPU) for matrix size 256 x 256, 512 x 512, 1024 x 1024, 2048 x 2048, 4096 x 4096 on both GTX8800 and HD 3870. (3) The execution time of double-precision floating-point matrix multiplication of the same sizes on HD3870. [Note: in order to get consistent results, please use the experimental setup provided in the lab instead of using your own graphics cards. To reduce the run-time noise on execution time, you need to measure at least 10 times for each experiment and then report the average execution time].

b. Write an optimized version of single-precision floating-point 2D convolution for both Nvidia GTX8800 GPU (using CUDA) and ATI HD3870 GPU (using Brook+). Use random numbers to initialize both matrices. Report the following results: (1) The number of lines of code in the kernel function(s). (2) The execution time (including data transmission time from CPU/GPU to GPU/CPU) for matrix size 256 x 256, 512 x 512, 1024x 1024, 2048x2048, 4096x4096, the convolution kernel size is 5x5. (3) The execution time of double-precision floating-point 2D convolution of the same sizes on HD3870. [Note: in order to get consistent results, please use the experimental setup provided in the lab instead of using your own graphics cards. To reduce the run-time noise on execution time, you need to measure at least 10 times for each experiment and then report the average execution time].

Homework #4 Cell Programming (due date: 4/20/08)

Send your source code along with a brief explanation and the performance results in a text file to zhou@eecs.ucf.edu

(a) Write a single-precision floating-point matrix multiplication program for Cell processors. Use random numbers to initialize both matrices. Report the following results: (1) The number of lines of code in the SPE program(s). (2) The execution time for matrix size 256 x 256, 512 x 512, 1024 x 1024, 2048 x 2048, 4096 x 4096.

(b) Write a single-precision floating-point 2D convolution program for Cell processors. Use random numbers to initialize both matrices. Report the following results: (1) The number of lines of code in the SPE program(s). (2) The execution time for matrix size 256 x 256, 512 x 512, 1024 x 1024, 2048 x 2048, 4096 x 4096; the convolution kernel size is 5x5.