🔍 The grouped GEMM interface allows you to execute a list of independent matrix multiplications in a single kernel launch , drastically reducing launch latency and improving GPU utilization.
Note: The exact API entry point can vary by CUDA version. In recent versions (CUDA 11+), specific grouped APIs are exposed to handle the array of descriptors efficiently. cublaslt grouped gemm documentation
Unlike standard batched GEMMs, each operation in a group can have unique dimensions. 🔍 The grouped GEMM interface allows you to
💡 Use cublasLtMatmulPreference to set workspace and then cublasLtMatmulAlgoGetHeuristic – the grouped version reuses plans across problems for maximum speed. cublaslt grouped gemm documentation