~blog

Retrieving the temperature of an NVIDIA GPU from a GTX 1080 on Linux, without nouveau or nvidia_drm

Posted at — Mar 3, 2023

I have several NVIDIA GPUs (~25,000) in passthrough on VMs, but I need to monitor their temperature from the hypervisor (qemu). There’s no nvidia_smi available, no nouveau or nvidia_drm driver loaded, only vfio-pci and qemu here.
 
So, I dug into it to retrieve their temperatures from the hypervisor!
 
All tests will be conducted on a GTX 1080 (GP104). This technique works on the Pascal and Ampere families. The same logic could possibly be applied to other families; however, I’m unable to test it.

How do we start ?

The envytools website provides a starting point for research and offers some leads, including the ranges of MMIO (memory mapped i/o) registers as well as the description of PCI BARs (base register addresses). Particularly, this includes the PTHERM register.
The other lead is the kernel itself, hoping that the nouveau driver can retrieve the GPU temperature (spoiler: yes). It’s just a matter of digging into the code to better understand what might be happening. In any case, a more specific idea is emerging: reading from the GPU memory, somewhere.

Let’s get some reading

Snippet of code to retrieve the GPU temperature:
source: linux/drivers/gpu/drm/nouveau/nvkm/subdev/therm/gp100.c

static int
gp100_temp_get(struct nvkm_therm *therm)
{
        struct nvkm_device *device = therm->subdev.device;
        struct nvkm_subdev *subdev = &therm->subdev;
        u32 tsensor = nvkm_rd32(device, 0x020460);
        u32 inttemp = (tsensor & 0x0001fff8);

        /* device SHADOWed */
        if (tsensor & 0x40000000)
                nvkm_trace(subdev, "reading temperature from SHADOWed sensor\n");

        /* device valid */
        if (tsensor & 0x20000000)
                return (inttemp >> 8);
        else
                return -ENODEV;
} 

source: linux/drivers/gpu/drm/nouveau/include/nvkm/core/device.h

// Structure reduced for readability
struct nvkm_device {
    [...]
    struct device *dev;
    const char *name;
    [...]
    void __iomem *pri;
    [...]
    struct list_head subdev;
    [...]
};
[...]

#define nvkm_rd32(d,a) ioread32_native((d)->pri + (a))

The sensor value is retrieved using the nvkm_rd32 macro, which reads a 32-bit register at the offset device->pri + 0x020460.
Masks are then applied to the sensor value to obtain only the temperature. All of this needs to be kept in mind.
 
source: linux/drivers/gpu/drm/nouveau/nkvm/engine/device/base.c

    [...]
    mmio_base = device->func->resource_addr(device, 0);
    mmio_size = device->func->resource_size(device, 0);
    [...]
    device->pri = ioremap(mmio_base, mmio_size);    

ioremap will map the MMIO to a virtual memory space and return a pointer to the beginning of this space corresponding to the start of the GPU’s physical memory at the requested address. Hence the use of ioread32 since it is not recommended to directly dereference these addresses.
 
What do mmio_base and mmio_size correspond to? According to the envy documentation, the memory space of the MMIO registers is located on BAR 0 and is 16M in size. It is a 32-bit memory addressing area that is non-prefetchable.
 
So, mmio_base corresponds to the starting address of BAR 0, and mmio_size is 16M. This can also be verified using lspci.
 

# lspci -d 10de: -s .0 -vv
02:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1) (prog-if 00 [VGA controller])
Subsystem: ZOTAC International (MCO) Ltd. GP104 [GeForce GTX 1080]
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 26
NUMA node: 0
IOMMU group: 45
//   BAR 0         mmio_base                                mmio_size
//     v               v                                       v
Region 0: Memory at c6000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at 27fe0000000 (64-bit, prefetchable) [size=256M]
Region 3: Memory at 27ff0000000 (64-bit, prefetchable) [size=32M]
Region 5: I/O ports at 6000 [size=128]
Expansion ROM at c7000000 [disabled] [size=512K]
[...]
Kernel driver in use: vfio-pci

In practice

Now we need to write a small piece of code that allows accessing my GPU via vfio, and read a value from the sensor at the previously retrieved offset, then apply the masks to extract the GPU temperature.
 
The code is quite basic and comes from the kernel documentation about vfio. In this example, I want to access the device 02:00.0 which is located in the IOMMU group 45 (seen in the above lspci output).

#include <linux/vfio.h>
#include <sys/ioctl.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <stdint.h>
#include <stdio.h>
#include <errno.h>

int main() {
  int container, group, device, i;
 struct vfio_group_status group_status =
   { .argsz = sizeof(group_status) };
 struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) };
 struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) };
 struct vfio_device_info device_info = { .argsz = sizeof(device_info) };

 /* Create a new container */
 container = open("/dev/vfio/vfio", O_RDWR);

 if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION)
   return -1;
 /* Unknown API version */

 if (!ioctl(container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU))
   return -1;
 /* Doesn't support the IOMMU driver we want. */

 /* Open the group */
 group = open("/dev/vfio/45", O_RDONLY);
 if (group < 0)
   return -1;

 /* Test the group is viable and available */
 ioctl(group, VFIO_GROUP_GET_STATUS, &group_status);

 if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE))
   /* Group is not viable (ie, not all devices bound for vfio) */
   return -1;

 /* Add the group to the container */
 ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);

 /* Enable the IOMMU model we want */
 ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);

 /* Get addition IOMMU info */
 ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info);

 /* Allocate some space and setup a DMA mapping */
 dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
		      MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
 dma_map.size = 1024 * 1024;
 dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
 dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;

 ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);

 /* Get a file descriptor for the device */
 device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:02:00.0");

 /* Test and setup the device */
 ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);

 /* Working only on BAR 0 */
 struct vfio_region_info regs = {
   .argsz = sizeof(struct vfio_region_info),
   .index  = 0
 };
 
 ioctl(device, VFIO_DEVICE_GET_REGION_INFO, ®s);
 
 uint8_t* ptr = mmap(0, regs.size, PROT_READ, MAP_SHARED, device, 0);

 /* Stolen from you know where ;) */
 uint32_t tsensor = *(uint32_t*)(ptr + 0x020460);
 uint32_t inttemp = (tsensor & 0x0001fff8);
 
 if (tsensor & 0x40000000)
   printf("shadowed sensor\n");

 if (tsensor & 0x20000000)
   printf("temp %d\n", inttemp >> 8);

 /* Gratuitous device reset and go... */
 ioctl(device, VFIO_DEVICE_RESET);
 munmap(ptr, regs.size);

 return 0;
}
# gcc vfio.c && ./a.out
temp 28

Here we can see that the GPU is at 28°C (they are water-cooled).
Unfortunately, the last problem is that if the GPU is already in use, for example via qemu, the open() call on the IOMMU group will return that it is already in use.
This time there is another solution, which is to directly access memory with qemu monitor, using the xp command.
 
I start a VM using this GPU, and using the base address retrieved from BAR0 (with pci info), we will access it by adding the offset of the register we previously retreived containing the temperature:

(qemu) info pci
[...]
  Bus  2, device   0, function 0:
    VGA controller: PCI device 10de:1b80
      PCI subsystem 19da:1425
      IRQ 0, pin A
      BAR0: 32 bit memory at 0xc1000000 [0xc1ffffff].
      BAR1: 64 bit prefetchable memory at 0x1000000000 [0x100fffffff].
      BAR3: 64 bit prefetchable memory at 0x1010000000 [0x1011ffffff].
      BAR5: I/O at 0xb000 [0xb07f].
      BAR6: 32 bit memory at 0xffffffffffffffff [0x0007fffe].
[..]
# We read a 32-bit register at position 0xc1020460
(qemu) xp /1w (0xc1000000 + 0x020460)
00000000c1020460: 0x20001ca0
# We apply the shifts retrieved from the code in nouveau
In [1]: (0x20001ca0 & 0x0001fff8) >> 8
Out[1]: 28 <-- la temperature

Conclusion

With all this, it is possible to monitor the temperature of these GPUs very simply from the host, using these two techniques, provided that they do not interfere with the VM startup.
 
However, it’s not over yet; on some models, there are still other sensors to retrieve…
It was a pretty fun little project; maybe I’ll try it on other families of NVIDIA GPUs if I can get my hands on them, or even maybe some AMD GPUs.