Senior Software Engineer (6+ YOE) | Edge AI, Computer Vision, CUDA & Distributed Systems | Ex-Tech Lead 2026-01-04

thread model

Estimated time to read: 2 minutes

Logical

thread

- Smallest unit of instruction execution.
- Executes kernel code once.

Warp

- Threads doesn't exists on it's own, it exists in a group of 32 threads called warp.
- All of the threads inside a warp shares same instruction.
- Warp divergence is an issue that can cause delays in execution in some threads while other are running causing loss in time and efficiency.

Block

- A group of 3D, 2D or 1D threads.
- Threads in a block shares shared memory with each other.
- We can use threadId(x,y,z) to access thread inside a block and we can use blockDim(x,y,z) to find block dimension.

Grid

- It is a group of 3D, 2D or 1D blocks that executes together.
- Blocks cannot interact with each other and they can be scheduled on any availble SM.
- We can use blockId(x,y,z) to access block inside a grid and we can use gridDim(x,y,z) to find grid dimension.

Grid of thread blocks

Physical

Cuda cores

- Individual processing unit.
- Each core have a compute unit and a register.

Streaming multiprocessor

- Each SM have a group of CUDA cores, Unified data cache(shared memory + L1 cache) and register file.
- We have multiple SM per Graphics processing clusters(GPC).
- SM schedules warps onto CUDA cores

GPU

- It has a group of GPCs.
- It has a memory controller and GPU DRAM.

Memory

Registers

- Each thread has it's own private registers that we can control by code.
- Each SM has a limited register file so some spills to local memory in global memory.

Local memory

- Memory private to thread but stored in global memory.
- Slow compared to register or shared memory.

L1 cache

- Automatically cache global memory reads and writes.
- It is per SM and between shared memory/register and global memory.
- You can't direcly control it without changing compiler or caching flags.
- It is used to speed up repeated access.

Shared memory

- Memory shared by threads in a block.
- Can be declared using __shared__. It can be only accessed by threads in the same block.
- Very fast(Like L1 cache). 
- We can control: size, usage pattern, synchronization

Global memory

- DRAM visible to all threads in all block and can be accessed using cudaMalloc and kernel pointers.
- Slow. Coalse to help in speed.

Constant memory

- Can be declared with __constant__.
- Read only memory on GPU cached per SM.
- Very fast if all threads in a warp read the same value.

GPU CPU system

Operations

Functions

device

- Runs on GPU. Called from  GPU, return type any.

host

- Runs on CPU. Called from  CPU, return type any.

global

- Runs on GPU. Called from  CPU, return type void.