thread model
Estimated time to read: 2 minutes
Logical
thread
- Smallest unit of instruction execution.
- Executes kernel code once.
Warp
- Threads doesn't exists on it's own, it exists in a group of 32 threads called warp.
- All of the threads inside a warp shares same instruction.
- Warp divergence is an issue that can cause delays in execution in some threads while other are running causing loss in time and efficiency.
Block
- A group of 3D, 2D or 1D threads.
- Threads in a block shares shared memory with each other.
- We can use threadId(x,y,z) to access thread inside a block and we can use blockDim(x,y,z) to find block dimension.
Grid
- It is a group of 3D, 2D or 1D blocks that executes together.
- Blocks cannot interact with each other and they can be scheduled on any availble SM.
- We can use blockId(x,y,z) to access block inside a grid and we can use gridDim(x,y,z) to find grid dimension.

Physical
Cuda cores
- Individual processing unit.
- Each core have a compute unit and a register.
Streaming multiprocessor
- Each SM have a group of CUDA cores, Unified data cache(shared memory + L1 cache) and register file.
- We have multiple SM per Graphics processing clusters(GPC).
- SM schedules warps onto CUDA cores
GPU
- It has a group of GPCs.
- It has a memory controller and GPU DRAM.
Memory
Registers
- Each thread has it's own private registers that we can control by code.
- Each SM has a limited register file so some spills to local memory in global memory.
Local memory
- Memory private to thread but stored in global memory.
- Slow compared to register or shared memory.
L1 cache
- Automatically cache global memory reads and writes.
- It is per SM and between shared memory/register and global memory.
- You can't direcly control it without changing compiler or caching flags.
- It is used to speed up repeated access.
Shared memory
- Memory shared by threads in a block.
- Can be declared using __shared__. It can be only accessed by threads in the same block.
- Very fast(Like L1 cache).
- We can control: size, usage pattern, synchronization
Global memory
- DRAM visible to all threads in all block and can be accessed using cudaMalloc and kernel pointers.
- Slow. Coalse to help in speed.
Constant memory
- Can be declared with __constant__.
- Read only memory on GPU cached per SM.
- Very fast if all threads in a warp read the same value.

Operations
Functions
device
- Runs on GPU. Called from GPU, return type any.
host
- Runs on CPU. Called from CPU, return type any.
global
- Runs on GPU. Called from CPU, return type void.