CUDA Kernel Optimization Project
The CUDA Kernel Optimization Project implements custom CUDA kernels for ReLU activation functions, demonstrating advanced GPU programming and performance optimization techniques. Developed on NVIDIA Tesla T4 GPU using PyTorch 2.8.0 and CUDA 12.5, this project achieves 75% memory bandwidth efficiency while matching the performance of PyTorch's highly optimized native implementation. The project bridges high-level deep learning frameworks with low-level GPU programming, providing insights into parallel computing, memory hierarchy optimization, and kernel profiling. Through systematic implementation of both naive and vectorized kernels, it showcases how modern ML frameworks achieve their performance characteristics and where custom optimization can provide value.
Overview
CUDA Kernel Optimization explores the design, implementation, and tuning of ReLU activation as custom CUDA kernels:
- Develop naive and vectorized CUDA kernels for the ReLU function
- Integrate kernels into PyTorch via C++ extension and load_inline
- Profile kernel and compare speed/memory metrics to PyTorch native
- Apply float4 vectorization for improved memory bandwidth
- Achieve competitive performance with comprehensive correctness validation
The project reveals how low-level memory and thread optimizations impact practical deep learning workloads.
Objectives & Scope
| Primary Objective | Description |
|---|
| Custom CUDA Kernel Development | Implement ReLU activation as native CUDA kernel with manual thread and memory management |
| PyTorch Integration | Seamlessly integrate CUDA code using C++ extensions and load_inline compilation |
| Performance Profiling | Utilize PyTorch Profiler to measure kernel execution time and GPU utilization |
| Optimization Implementation | Apply vectorization using float4 for improved memory bandwidth |
| Comparative Analysis | Benchmark against PyTorch native operations |
| Scope | Status |
|---|
| Naive and optimized CUDA kernel implementations | In-Scope |
| Memory bandwidth analysis and optimization | In-Scope |
| PyTorch C++/CUDA extension compilation | In-Scope |
| Performance profiling with PyTorch Profiler | In-Scope |
| Correctness validation across multiple tensor sizes | In-Scope |
| Multi-GPU implementations | Out-of-Scope |
| Backward pass with gradient computation | Out-of-Scope |
| Production deployment considerations | Out-of-Scope |
| Advanced optimization (Triton, TensorRT) | Out-of-Scope |
Implementation Examples & Visualizations
![CUDA Thread Organization]()
![Memory Access Patterns]()
- Thread Organization: Grid of 39,063 blocks × 256 threads for 10M elements
- Memory Pattern: Naive sequential vs. vectorized float4 access
- Performance: Optimized kernel achieves 120.8 GB/s (1.02x speedup)
- Validation: Perfect correctness match with PyTorch native operations
Performance Comparison (10M elements, Tesla T4):
Naive CUDA: 0.3365ms (119.1 GB/s) | Optimized CUDA: 0.3311ms (120.8 GB/s) ⭐ | PyTorch Native: 0.3385ms (118.3 GB/s)
System Architecture / Design
| Layer | Components | Role |
|---|
| Framework | PyTorch 2.8.0 | High-level tensor ops, kernel integration, profiling |
| Device Runtime | NVIDIA Tesla T4 (CUDA 12.5) | GPU execution, memory management, profiler data collection |
| Kernel Implementation | Custom CUDA (naive/vectorized) | Elementwise ReLU with/without float4 optimization |
| Integration Bridge | C++/CUDA Extension (PYBIND11) | load_inline compilation, tensor marshalling |
| Hardware | 2560 CUDA cores, 16GB GDDR6 | Parallel execution, 320 GB/s memory bandwidth |
Python Layer → C++ Extension Interface → CUDA Kernel Layer → GPU Hardware (Tesla T4)
Technology Stack & Justification
| Technology | Version | Purpose | Justification |
|---|
| PyTorch | 2.8.0 | ML Framework | Industry-standard with excellent CUDA integration |
| CUDA | 12.5 | GPU Programming | Native NVIDIA platform for maximum control |
| Python | 3.12 | Scripting | Rapid prototyping and PyTorch compatibility |
| Google Colab | - | Environment | Free Tesla T4 GPU access |
| NVCC | 12.5.82 | Compiler | Official NVIDIA compiler with optimizations |
Methodology / Implementation Details
| Implementation | Key Features | Optimization Strategy | Performance Impact |
|---|
| Naive CUDA Kernel | 256 threads/block, sequential access | Coalesced memory, minimal branching | 119.1 GB/s, matches PyTorch baseline |
| Vectorized Kernel | float4 loads, 4x reduction in transactions | Memory alignment, cache efficiency | 120.8 GB/s, 1.02x speedup |
| PyTorch Integration | C++ extension, load_inline JIT | Zero-copy tensor marshalling | Seamless performance profiling |
| Benchmarking | Warmup runs, synchronization | Cold-start elimination, accurate timing | Reliable performance measurements |
- Thread configuration optimized for Tesla T4 architecture (256 threads/block)
- Memory access patterns designed for coalescing and cache efficiency
- Comprehensive correctness validation across multiple tensor sizes
- Profiling reveals memory-bound characteristics (0.125 FLOPs/byte)
Benchmark Results & Analysis
| Size | Implementation | Time (ms) | Throughput (GB/s) | Speedup |
|---|
| 100K | Naive CUDA | 0.0111 | 36.0 | 1.01x |
| 100K | PyTorch | 0.0112 | 35.7 | 1.00x |
| 1M | Naive CUDA | 0.0380 | 105.3 | 0.96x |
| 1M | PyTorch | 0.0366 | 109.3 | 1.00x |
| 10M | Naive CUDA | 0.3365 | 119.1 | 1.01x |
| 10M | Optimized CUDA | 0.3311 | 120.8 | 1.02x ⭐ |
| 10M | PyTorch | 0.3385 | 118.3 | 1.00x |
| Key Findings | Result | Impact |
|---|
| Performance Parity | ✅ Match PyTorch within 1-2% | Custom kernels competitive |
| Optimization Gain | ✅ 1.02x speedup with vectorization | float4 provides measurable improvement |
| Memory Bandwidth | ✅ 120.8 GB/s (75.5% of peak) | Excellent bandwidth utilization |
| Scaling Behavior | ✅ Performance improves with size | Amortizes launch overhead |
Performance benchmarks reveal vectorized kernel achieves competitive performance with PyTorch's native implementation. Profiling confirms efficient GPU utilization and validates memory-bound workload characteristics.
Conclusion
The CUDA Kernel Optimization Project successfully demonstrates advanced GPU programming through custom kernel implementation achieving competitive performance with PyTorch's native operations. The optimized vectorized kernel achieves 1.02x speedup while maintaining 75.5% memory bandwidth efficiency on Tesla T4. Key achievements include custom CUDA kernels matching PyTorch native performance, vectorization optimization providing measurable improvement, comprehensive profiling revealing memory-bound characteristics, and production-ready code with thorough correctness validation. This project bridges high-level ML frameworks with low-level GPU programming, providing essential skills for performance-critical AI system development and demonstrating expertise in GPU architecture, memory optimization, and profiling—valuable for ML infrastructure engineering and AI compiler development.