Memory

Cache

fast memory close to, or integrated with, the processor
transparent to the programmer
When the code accesses a variable the cache controller checks whether the contents of the address is in cache memory
- If yes -> cache hit
- If no -> cache miss -> processor accesses the address in memory and reads a line/block
  - whilst being read by processor, the line read is also put into the cache memory

-> Good for speeding up transfers into the core

An address space is divided into pages (typically 4k, now -> 2M)
A physical memory is divided into frames of the same size
When code is required or a memory access happens the entire page is fetched into memory from the hard-disk
Due to relative addressing, the page can be placed in any frame in memory
Thus only the pages the code is actually using need to be in main memory
Efficient use of main memory
- Support for multiprocessing

Introduction to GPU Memory Architecture

GPU memory architecture is designed to maximize parallel processing efficiency. It includes multiple memory types with varying latencies and bandwidths to optimize performance for different tasks.

Key Memory Types in GPUs:

Global Memory:
- Global memory is the largest memory space on the GPU, accessible by all threads but with higher latency compared to other memory types.
- It is used for storing large datasets and is essential for applications requiring significant data storage.
Shared Memory:
- Shared memory is a smaller, faster memory space accessible by all threads within a block.
- It allows for efficient data sharing and communication between threads, significantly reducing the need for accessing slower global memory.
- Shared memory is particularly useful for implementing data parallel algorithms, where multiple threads need to access the same data.
Registers:
- Registers are the fastest memory on the GPU, used to store variables for individual threads.
- They offer low-latency access and are crucial for the execution of compute-intensive tasks.
- However, the number of registers per thread is limited, and excessive use can lead to register spilling into slower memory.
Constant Memory:
- Constant memory is read-only memory that is cached and optimized for reading by multiple threads.
- It is ideal for storing constant values that do not change during kernel execution, such as coefficients in mathematical computations.
Texture Memory:
- Texture memory is a specialized memory used for texture mapping in graphics rendering.
- It is optimized for spatial locality and offers efficient data retrieval for 2D and 3D textures.

Memory Access Patterns:

Efficient memory access patterns are critical for achieving high performance on GPUs. Poor access patterns can lead to memory bottlenecks, reducing the overall throughput.
Coalesced Memory Access:
- Threads in a warp (a group of 32 threads) should access consecutive memory addresses to achieve coalesced memory access.
- Coalesced access minimizes the number of memory transactions, reducing latency and increasing bandwidth utilization.

Memory Hierarchies and Data Locality:

GPUs utilize a memory hierarchy to manage data locality and access speeds:
- Registers > Shared Memory > Constant/Texture Memory > Global Memory.
Proper use of memory hierarchies can significantly enhance performance by ensuring that frequently accessed data resides in faster memory.

Memory Management Techniques:

Memory Allocation:
- Memory allocation on the GPU is managed through APIs such as CUDA and OpenCL.
- Developers allocate memory using functions like cudaMalloc for CUDA or clCreateBuffer for OpenCL.
- It is crucial to free allocated memory after use to prevent memory leaks using cudaFree or clReleaseMemObject.
Data Transfer:
- Data transfer between the host (CPU) and the device (GPU) is a critical aspect of GPU programming.
- Transfers are managed using functions like cudaMemcpy or clEnqueueWriteBuffer.
- Minimizing data transfer and maximizing data reuse on the GPU can reduce transfer overhead and improve performance.
Shared Memory Utilization:
- Shared memory can be explicitly managed within kernels to store intermediate results or frequently accessed data.
- Synchronization primitives like __syncthreads() ensure that all threads in a block have a consistent view of shared memory.
Memory Alignment:
- Proper memory alignment is essential to avoid inefficient memory access patterns.
- Aligning data structures to match memory boundaries can lead to more efficient memory transactions.

Optimization Strategies:

Avoiding Bank Conflicts:
- Shared memory is divided into banks, and accessing multiple addresses within the same bank can cause bank conflicts, leading to serialization.
- To avoid bank conflicts, ensure that memory accesses are distributed evenly across different banks.
Using Constant Memory:
- Utilize constant memory for read-only data that is accessed frequently by multiple threads.
- This reduces global memory access and leverages the caching mechanism of constant memory.
Prefetching Data:
- Prefetching involves loading data into shared memory before it is needed by threads, reducing the wait time for memory access.
- This technique can be implemented using double-buffering strategies, where one buffer is being processed while the other is being loaded with new data.
Leveraging Texture Memory:
- For applications involving spatial data, such as image processing, use texture memory to take advantage of its spatial locality optimizations.

Conclusion:

Understanding and effectively utilizing the various types of GPU memory is crucial for maximizing the performance of parallel applications.
Proper memory management, efficient access patterns, and optimization strategies can significantly reduce memory bottlenecks and enhance computational throughput.
By leveraging the full potential of the GPU's memory hierarchy, developers can achieve substantial performance gains in their applications.