Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Nvidia GPU] New Integration for Nvidia GPU Monitoring #11930

Open
strawgate opened this issue Nov 30, 2024 · 0 comments · May be fixed by #12581
Open

[Nvidia GPU] New Integration for Nvidia GPU Monitoring #11930

strawgate opened this issue Nov 30, 2024 · 0 comments · May be fixed by #12581
Labels
New Integration Issue or pull request for creating a new integration package.

Comments

@strawgate
Copy link
Contributor

Elastic has previously published blog posts for monitoring Nvidia GPU utilization, temperature, power consumption, etc via Nvidia's dcgm exporter (which presents a prometheus metrics endpoint) and there was a previous ebay project called NvidiaGPUBeat that monitored nvidia GPUs via the nvidia-smi command line.

Nvidia GPUs are critical for organizations running training and inference. As organizations adopt large language models internally they have a choice of using third-party products like OpenAI or hosting their own models and running inference on their own GPUs.

An integration via Elastic Agent would allow large K8s deployments (as well as linux and windows farms) monitor GPU utilization, power consumption, as well as report on GPU throttling and errors.

We should make this available via an integration. Likely via the Prometheus metrics endpoint.

@strawgate strawgate added the New Integration Issue or pull request for creating a new integration package. label Nov 30, 2024
@strawgate strawgate linked a pull request Feb 4, 2025 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
New Integration Issue or pull request for creating a new integration package.
Projects
None yet
1 participant