101-Nvdia-Smi¶

101 - Cuda - Nvidia-smi¶

In our daily work, we use GPU alot rather than CPU in the heavy computations, so How to monitor the usage?

In this doc, we will try to talk about main points and the support points for these main points, so please refer to the main source of knowledge “The Official DOCUMENTATION” for nvidia-smi

Table of Content¶

101 - Cuda - Nvidia-smi
Table of Content
What is NVSMI?
Where to get?
How it looks?
Output Desicription
- GPU’s State Table
- Processes Table
Driver models
Conrtol Your GPUs
Note
References

What is NVSMI?¶

Also known as Nvidia-smi, which is a command line utility tool by Nvidia.

Where to get?¶

It could be installed along with CUDA toolkit.

How it looks?¶

Let’s run it in our terminal, and the output should be like this: Nvidia-smi command run Running Nvidia-smi in my local machine

From the output we can first get the following GPU ATTRIBUTES info:

The Nvidia driver version info.
The CUDA version available that my GPU supports
The Nvidia GPU version.
The Current processes running that uses the GPU [if there is any process running]
The Memory usage.

Output Desicription¶

The output generated is two tables;

The state of GPU.
The Processes table.

GPU’s State Table¶

Please refer to the docs, to get all the info, options, and capabilities of the that simple command “nvidia-smi”.

Processes Table¶

They are the Processes using GPUs, this table lists all the processes having Compute or Graphics context on the devie.

Compute processes are reported on all the fully supported products.
Graphical processes is limited to the supported products starting with Kepler architecture.

There are some values for each entity we should be aware of:

Parameter	Description	Value
	Represents NVML Index of the device.
	Represent PRocess ID corresponding to the active [compute or graphics] context.
	Process Type which could be either Compute, Graphics, or both	- ‘C’: Compute - ‘G’: Graphics - ‘C+G’: Compute and Graphics
	Represents process name for the Compute or Grahpics process.
	Amount of Memory used on the device by the context.

Driver models¶

Microsoft and NVIDIA offer two driver modes for Windows;

The Windows Display Driver Model (WDDM): On workstations and laptops, this is usually the default mode. This driver mode allows shared usage of the NVIDIA GPU for display output and GPGPU computing.
Tesla Compute Cluster (TCC): This driver mode uses the NVIDIA GPU for GPGPU computing exclusively.

Feature	WDDM (Windows Display Driver Model)	TCC (Tesla Compute Cluster)
Purpose	Graphics rendering and display management on Windows.	High-performance computing (HPC) and GPU compute tasks without display functions.
Operating Mode	Primarily for graphics and display.	Dedicated compute mode without display support.
GPU Compatibility	Consumer-grade NVIDIA GPUs (Geforce Series, etc.)	Typically available on certain NVIDIA Tesla and Quadro GPUs.
Display Output	Supports display functions.	Does not support display output in TCC mode.
Use Cases	General-purpose graphics rendering, gaming, multimedia, etc.	High-performance computing, scientific simulations, data processing, parallel computing tasks.
Driver Type	WDDM drivers are used.	TCC drivers are often used in TCC mode.
System Integration	Integrated with the Windows operating system.	Typically used in server environments for compute-intensive tasks.
GPU Management	Provides a balance between graphics and compute tasks.	Optimized for compute performance with minimal emphasis on graphics.

Conrtol Your GPUs¶

A great start of using the nvidia-smi command line, in through the tutorial illustrated by Eliot Eshelman in Microway-Nvidia-smi-Control_Your_GPUs.

Let’s write down the most might-look important ones for our reference;

Objective	Details	Command
Query GPU devices	To get all available Nvidia devices	nvidia-smi -L
Query GPU devies details	To get details about GPU	nvidia-smi –query-gpu=index,name,uuid,serial –format=csv
Monitor Overall GPU usage with 1-second update intervals		nvidia-smi dmon
Monitor per-process GPU usage with 1-second update intervals		nvidia-smi pmon
Query GPU performance	To review the current state of each GPU and any reasons for clock slowdowns, use the PERFORMANCE flag	nvidia-smi -q -d PERFORMANCE
Query GPU clock	To review the current GPU clock speed, default clock speed, and maximum possible clock speed, run	nvidia-smi -q -d CLOCK
Query GPU supported clocks		nvidia-smi -q -d SUPPORTED_CLOCKS
Prinitng all GPU details	To list all available data on a particular GPU, specify the ID of the card with -i.	nvidia-smi -i 0 -q
Priniting specific GPU details	Excerpt for a card running GPU	nvidia-smi -i 0 -q -d MEMORY,UTILIZATION,POWER,CLOCK,COMPUTE
Query Nvidia Link “NVlink” status	To get info about the nvlink status	nvidia-smi nvlink –status
Query Nvidia Link “NVlink” capabilities	To get info about the nvlink capabilities	nvidia-smi nvlink –capabilities
Inquery about the system/GPU topology		nvidia-smi topo –matrix

Note¶

A bandwith note from Making machine learning models faster by horace-brr_intro, found here

This cost of moving stuff to and from our compute units is what's called the "memory bandwidth" cost. As an aside, your GPU's DRAM is what shows up in nvidia-smi, and is the primary quantity responsible for your lovely "CUDA Out of Memory' errors.

References¶

Nvidia docs for Nvidia-smi could be found here.
A fast quick guide from medium: Explained Output of Nvidia-smi Utility
Steps to help control you GPU monitoring from microway, nvidia-smi:control your GPUs, could be found here.
NVIDAI Management Libarary - NVML docs could be found here
Python binding to the Nvidia Management Libaray could be found here