You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Nvidia GPUs are critical for organizations running training and inference. As organizations adopt large language models internally they have a choice of using third-party products like OpenAI or hosting their own models and running inference on their own GPUs.
An integration via Elastic Agent would allow large K8s deployments (as well as linux and windows farms) monitor GPU utilization, power consumption, as well as report on GPU throttling and errors.
We should make this available via an integration. Likely via the Prometheus metrics endpoint.
The text was updated successfully, but these errors were encountered:
Elastic has previously published blog posts for monitoring Nvidia GPU utilization, temperature, power consumption, etc via Nvidia's dcgm exporter (which presents a prometheus metrics endpoint) and there was a previous ebay project called NvidiaGPUBeat that monitored nvidia GPUs via the nvidia-smi command line.
Nvidia GPUs are critical for organizations running training and inference. As organizations adopt large language models internally they have a choice of using third-party products like OpenAI or hosting their own models and running inference on their own GPUs.
An integration via Elastic Agent would allow large K8s deployments (as well as linux and windows farms) monitor GPU utilization, power consumption, as well as report on GPU throttling and errors.
We should make this available via an integration. Likely via the Prometheus metrics endpoint.
The text was updated successfully, but these errors were encountered: