ST: CDA 6938
Multi-core/many-core architectures and programming
Assignments
Homework #0 (No need to turn in)
Write a multithreaded program using the CUDA programming model and the emulator to generate multiple “Hello world!”.
Homework #1 Nvidia CUDA
Send your source code along with a brief explanation and the performance results in a text file to zhou@eecs.ucf.edu
a. Write a multithreaded program using the CUDA programming model to compute matrix multiplication. One thread computes one element of the product matrix. (Due date: 2/1/08)
b. Write a multithreaded program using the CUDA programming model to compute matrix multiplication using the tiling algorithm. One thread computes one element of the product matrix (Due date: 2/1/08)
c. Write a multithreaded program using the CUDA programming model to compute matrix multiplication using the unroll and jam algorithm. One thread computes several elements of the product matrix (Due date: 2/8/08)
d. (bonus) Write a multithreaded program using the CUDA programming model to further improve the performance of matrix multiplication. Example ideas include combining tiling and loop & jam, using texture memory/cache, etc. In your report, include your achieved throughput (in the calculation, you do not need to include the latency for data communication between device and host). (Due date: 2/8/08)
Homework #2 AMD/ATI Brook+
Send your source code along with a brief explanation and the performance results in a text file to zhou@eecs.ucf.edu
Brief explanation on 2D convolution (or image convolution): assume a matrix a[M, N] and a matrix h[J,K], the convolution is defined as follows:
For simplicity, you may assume out-of-bound elements of a (e.g., a[-1,-1]) are zero.
Homework #3 Cross-platform Performance Comparison (Due date: 03/14/08)
Send your source code along with a brief explanation and the performance results in a text file to zhou@eecs.ucf.edu
a. Write an optimized version of single-precision floating-point matrix multiplication for both Nvidia GTX8800 GPU (using CUDA) and ATI HD 3870 GPU (using Brook+). Use random numbers to initialize both matrices. Report the following results: (1) The number of lines of code in the kernel function(s). (2) The execution time (including data transmission time from CPU/GPU to GPU/CPU) for matrix size 256 x 256, 512 x 512, 1024 x 1024, 2048 x 2048, 4096 x 4096 on both GTX8800 and HD 3870. (3) The execution time of double-precision floating-point matrix multiplication of the same sizes on HD3870. [Note: in order to get consistent results, please use the experimental setup provided in the lab instead of using your own graphics cards. To reduce the run-time noise on execution time, you need to measure at least 10 times for each experiment and then report the average execution time].
b. Write an optimized version of single-precision floating-point 2D convolution for both Nvidia GTX8800 GPU (using CUDA) and ATI HD3870 GPU (using Brook+). Use random numbers to initialize both matrices. Report the following results: (1) The number of lines of code in the kernel function(s). (2) The execution time (including data transmission time from CPU/GPU to GPU/CPU) for matrix size 256 x 256, 512 x 512, 1024x 1024, 2048x2048, 4096x4096, the convolution kernel size is 5x5. (3) The execution time of double-precision floating-point 2D convolution of the same sizes on HD3870. [Note: in order to get consistent results, please use the experimental setup provided in the lab instead of using your own graphics cards. To reduce the run-time noise on execution time, you need to measure at least 10 times for each experiment and then report the average execution time].
Homework #4 Cell Programming (due date: 4/20/08)
Send your source code along with a brief explanation and the performance results in a text file to zhou@eecs.ucf.edu
(a) Write a single-precision floating-point matrix multiplication program for Cell processors. Use random numbers to initialize both matrices. Report the following results: (1) The number of lines of code in the SPE program(s). (2) The execution time for matrix size 256 x 256, 512 x 512, 1024 x 1024, 2048 x 2048, 4096 x 4096.
(b) Write a single-precision floating-point 2D convolution program for Cell processors. Use random numbers to initialize both matrices. Report the following results: (1) The number of lines of code in the SPE program(s). (2) The execution time for matrix size 256 x 256, 512 x 512, 1024 x 1024, 2048 x 2048, 4096 x 4096; the convolution kernel size is 5x5.