101-Nvdia-Smi

101 - Cuda - Nvidia-smi

In our daily work, we use GPU alot rather than CPU in the heavy computations, so How to monitor the usage?

In this doc, we will try to talk about main points and the support points for these main points, so please refer to the main source of knowledge “The Official DOCUMENTATION” for nvidia-smi

Table of Content

What is NVSMI?

Also known as Nvidia-smi, which is a command line utility tool by Nvidia.

Where to get?

It could be installed along with CUDA toolkit.

How it looks?

Let’s run it in our terminal, and the output should be like this: Nvidia-smi command run Running Nvidia-smi in my local machine

From the output we can first get the following GPU ATTRIBUTES info:

  • The Nvidia driver version info.

  • The CUDA version available that my GPU supports

  • The Nvidia GPU version.

  • The Current processes running that uses the GPU [if there is any process running]

  • The Memory usage.

Output Desicription

The output generated is two tables;

  1. The state of GPU.

  2. The Processes table.

GPU’s State Table

Please refer to the docs, to get all the info, options, and capabilities of the that simple command “nvidia-smi”.

Processes Table

They are the Processes using GPUs, this table lists all the processes having Compute or Graphics context on the devie.

  • Compute processes are reported on all the fully supported products.

  • Graphical processes is limited to the supported products starting with Kepler architecture.

There are some values for each entity we should be aware of:

Parameter

Description

Value

Represents NVML Index of the device.

Represent PRocess ID corresponding to the active [compute or graphics] context.

Process Type which could be either Compute, Graphics, or both

- ‘C’: Compute
- ‘G’: Graphics
- ‘C+G’: Compute and Graphics

Represents process name for the Compute or Grahpics process.

Amount of Memory used on the device by the context.

Driver models

Microsoft and NVIDIA offer two driver modes for Windows;

  1. The Windows Display Driver Model (WDDM): On workstations and laptops, this is usually the default mode. This driver mode allows shared usage of the NVIDIA GPU for display output and GPGPU computing.

  2. Tesla Compute Cluster (TCC): This driver mode uses the NVIDIA GPU for GPGPU computing exclusively.

Feature

WDDM (Windows Display Driver Model)

TCC (Tesla Compute Cluster)

Purpose

Graphics rendering and display management on Windows.

High-performance computing (HPC) and GPU compute tasks without display functions.

Operating Mode

Primarily for graphics and display.

Dedicated compute mode without display support.

GPU Compatibility

Consumer-grade NVIDIA GPUs (Geforce Series, etc.)

Typically available on certain NVIDIA Tesla and Quadro GPUs.

Display Output

Supports display functions.

Does not support display output in TCC mode.

Use Cases

General-purpose graphics rendering, gaming, multimedia, etc.

High-performance computing, scientific simulations, data processing, parallel computing tasks.

Driver Type

WDDM drivers are used.

TCC drivers are often used in TCC mode.

System Integration

Integrated with the Windows operating system.

Typically used in server environments for compute-intensive tasks.

GPU Management

Provides a balance between graphics and compute tasks.

Optimized for compute performance with minimal emphasis on graphics.

Conrtol Your GPUs

A great start of using the nvidia-smi command line, in through the tutorial illustrated by Eliot Eshelman in Microway-Nvidia-smi-Control_Your_GPUs.

Let’s write down the most might-look important ones for our reference;

Objective

Details

Command

Query GPU devices

To get all available Nvidia devices

nvidia-smi -L

Query GPU devies details

To get details about GPU

nvidia-smi –query-gpu=index,name,uuid,serial –format=csv

Monitor Overall GPU usage with 1-second update intervals

nvidia-smi dmon

Monitor per-process GPU usage with 1-second update intervals

nvidia-smi pmon

Query GPU performance

To review the current state of each GPU and any reasons for clock slowdowns, use the PERFORMANCE flag

nvidia-smi -q -d PERFORMANCE

Query GPU clock

To review the current GPU clock speed, default clock speed, and maximum possible clock speed, run

nvidia-smi -q -d CLOCK

Query GPU supported clocks

nvidia-smi -q -d SUPPORTED_CLOCKS

Prinitng all GPU details

To list all available data on a particular GPU, specify the ID of the card with -i.

nvidia-smi -i 0 -q

Priniting specific GPU details

Excerpt for a card running GPU

nvidia-smi -i 0 -q -d MEMORY,UTILIZATION,POWER,CLOCK,COMPUTE

Query Nvidia Link “NVlink” status

To get info about the nvlink status

nvidia-smi nvlink –status

Query Nvidia Link “NVlink” capabilities

To get info about the nvlink capabilities

nvidia-smi nvlink –capabilities

Inquery about the system/GPU topology

nvidia-smi topo –matrix

Note

  • A bandwith note from Making machine learning models faster by horace-brr_intro, found here

    This cost of moving stuff to and from our compute units is what's called the "memory bandwidth" cost. As an aside, your GPU's DRAM is what shows up in nvidia-smi, and is the primary quantity responsible for your lovely "CUDA Out of Memory' errors.
    

References

  • Nvidia docs for Nvidia-smi could be found here.

  • A fast quick guide from medium: Explained Output of Nvidia-smi Utility

  • Steps to help control you GPU monitoring from microway, nvidia-smi:control your GPUs, could be found here.

  • NVIDAI Management Libarary - NVML docs could be found here

  • Python binding to the Nvidia Management Libaray could be found here