Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NVIDIA GPU] Introduce Monitoring Integration #12581

Open
wants to merge 4,947 commits into
base: main
Choose a base branch
from

Conversation

strawgate
Copy link
Contributor

@strawgate strawgate commented Feb 4, 2025

Proposed commit message

Introduce NVIDIA GPU Monitoring Integration

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.
  • I have verified that any added dashboard complies with Kibana's Dashboard good practices

Author's Checklist

How to test this PR locally

Deploy NVIDIA DGCM on a device with an NVIDIA GPU to get a prometheus metrics endpoint that you can provide to the integration.

If you have docker this just requires:

docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.9-3.6.1-ubuntu22.04
curl localhost:9400/metrics

Configure the integration to point at the host running the container and GPU http://nvidiahost:9400/metrics

Some metrics are not enabled by default with the container, enabling all metrics requires some extra steps.

Related issues

Fixes #11930

Screenshots

WIP:

bill-easton-test kb us-central1 gcp cloud es io_9243_app_dashboards (1)
bill-easton-test kb us-central1 gcp cloud es io_9243_app_dashboards

mrodm and others added 30 commits December 12, 2024 11:54
…tic#12074)

Enable the creation of issues for flaky tests in the daily builds triggered
using 9.0.0 as stack release.
…astic#12072)

* add SQS calls and S3 permissions in docs

* bump package version

* fix pr id

* add SQS GetQueueAttributes
sort permissions
Credential construction by the v3.21 alpine results in system test failures
with the error:

private key should be a PEM or plain PKCS1 or PKCS8; parse error: asn1:
structure error: tags don't match (16 vs {class:0 tag:13 length:45
isCompound:true}) {optional:false explicit:false application:false
private:false defaultValue:<nil> tag:<nil> stringType:0 timeType:0 set:false
omitEmpty:false} pkcs1PrivateKey @2

Pin alpine to v3.20 until the root of the issue is identified and fixed.
…n. (elastic#12092)

Qualys can send empty XML response body with 200 success status.
Handle this case as valid.
…2071)

* Fix broken links

* Update packages/google_workspace/_dev/build/docs/README.md

Co-authored-by: Krishna Chaitanya Reddy Burri <[email protected]>

* Fix tychon link

* Fix Lumos link

* Fix wiz link

* Remove link to vulnerability data stream

* Update wiz changelog and manifest

* Update bbot changelog and manifest

* Update cisco_duo changelog and manifest

* Update ti_cybersixgill changelog and manifest

* Update google_workspace changelog and manifest

* Update lumos changelog and manifest

* Update tychon changelog and manifest

* Update thycotic_ss changelog and manifest

* Update authentik changelog and manifest

* update google workspace readme

---------

Co-authored-by: Krishna Chaitanya Reddy Burri <[email protected]>
The source.ip field is never set, so this is redundant.
* Fix broken links

* Remove the link from the Application insights integration

* Update nats link as per shmsr suggestion

* Add link on Jolokia parameters

* Update citrix references for adc and waf

* Add more specific links for adc and waf
…elastic#12103)

*Added support for configurable retry options which was introduced in 8.16
… pipeline (elastic#12028)

* fix optional chaining in the replica_status data stream pipeline
…ic#12107)

Made with ❤️️ by updatecli

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…tic#12033)

Include a new dynamic field for user_agent.version in pipeline tests
in order to accept versions values with a trailing dot.
* Initial draft of the o365_metrics package with the `outlook_activity` data stream.
Add docs about retrieving ISAC feeds for Custom Threat Intelligence integration
…ic#12082)

No dynamic mapping was being generated for
tines.audit_log.inputs.inputs.options.*, and this package uses the
tines.audit_log.inputs.inputs.options field directly, without having any
mapping for it or its sub-properties.

The workaround ensures that there is a mapping for
tines.audit_log.inputs.inputs.* that serves for
tines.audit_log.inputs.inputs.options as well as for its subobjects.

The configured dynamic mapping was not being generated due to some issue in
Fleet that we are investigating.

We detected this issue while refactoring field mappings tests in
elastic-package, more about this in elastic/elastic-package#2214[1].

[1]elastic/elastic-package#2214 (comment)

Co-authored-by: Dan Kortschak <[email protected]>
…ase (elastic#12079)

* bump CSPM templates URLs to use v8.17.0

* bump Asset Inv. templates URLs to use v8.17.0

* update versions (remove previews)

* fix YAML
Change property connection_string to be a secret like in the other integrations.
* Fix broken links

* Update changelog and manifest
…ic#12128)

Made with ❤️️ by updatecli

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…0.108.0 (elastic#12131)

Bumps [github.com/elastic/elastic-package](https://github.com/elastic/elastic-package) from 0.107.2 to 0.108.0.
- [Release notes](https://github.com/elastic/elastic-package/releases)
- [Changelog](https://github.com/elastic/elastic-package/blob/main/.goreleaser.yml)
- [Commits](elastic/elastic-package@v0.107.2...v0.108.0)

---
updated-dependencies:
- dependency-name: github.com/elastic/elastic-package
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Mario Rodriguez Molins <[email protected]>
Changes added:
- Add a limit parameter, that can be used to control the size of responses from TAXII servers (see https://docs.oasis-open.org/cti/taxii/v2.1/os/taxii-v2.1-os.html#_Toc31107517)
- To avoid fetching duplicate indicators every interval, now the response header X-Taxii-Date-Added-Last is stored in the cursor and used to populate the added_after parameter every iteration (see https://docs.oasis-open.org/cti/taxii/v2.1/os/taxii-v2.1-os.html#_Toc31107519)
* Update link

* Update changelog and manifest
elastic#11920)

This is enabled per data stream to allow tuning of behaviour.
…nt" tag to documents with event.kind set to "pipeline_error" (elastic#12108)

This manually replays the changes in elastic#12046.
taylor-swanson and others added 15 commits January 31, 2025 15:14
…ONN_TIMEDOUT (elastic#12556)

- Handle additional parsing cases for SSLVPN HTTPREQUEST and TCPCONN_TIMEDOUT events
Bump golang.org/x/net from 0.23.0 to 0.33.0 for mock service in
/packages/websocket/_dev/deploy/docker/websocket-mock-service.


Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Add handling of Check Point firewall session logs in accordance with the ECS
structure.

Session logs aggregate multiple connection logs from the same network
activity into a single event. The aggregation creates the following fields:

- creation_time: UNIX timestamp of the first connection in the session.
- last_hit_time: UNIX timestamp of the last recorded connection in the session.
- duration: Duration (in seconds) of the session.
- aggregated_log_count: Number of connection logs aggregated into the session.
- connection_count: Number of connections recorded in the session.
- update_count: Number of times the session was updated.

This commit will:

1. Interpret creation_time and last_hit_time as dates, storing them in the ECS
fields event.start and event.end, respectively.
2. Convert duration to nanoseconds, as per the ECS event.duration specification,
and store it in the event.duration field.
3. Ensure checkpoint.aggregated_log_count, checkpoint.connection_count, and
checkpoint.update_count are mapped to numeric types.

Note that `checkpoint.aggregated_log_count`, `checkpoint.connection_count`, and
`checkpoint.update_count` which were previously mapped dynamically as keyword
data types are now statically mapped as integer data types.

Closes elastic#11894
The current data flow for the fields changed here is NetworkMessageId[1] →
m365_defender.event.network.message_id → email.message_id and
InternetMessageId[1] → m365_defender.event.internet_message_id → email.local_id,
but the definition of email.message_id is that it represents the RFC5322
Message-ID[2], corresponding to the Defender InternetMessageId value, and
email.local_id[3] is the non-persistent identifier, reasonably corresponding to
the Defender NetworkMessageId value.

Also add m365_defender.event.internet_message_id to final remove processor.

[1]https://learn.microsoft.com/en-us/defender-xdr/advanced-hunting-emailevents-table#:~:text=NetworkMessageId,sending%20email%20system
[2]https://www.elastic.co/guide/en/ecs/current/ecs-email.html#field-email-message-id
[3]https://www.elastic.co/guide/en/ecs/current/ecs-email.html#field-email-local-id
Fix system tests for Custom TI and Tychon for 9.0
* update drives data stream

* update managed_volumes data stream

* update monitoring_jobs data stream

* update mssql_databases data stream

* update physical_hosts data stream

* update virtualmachines data stream

* update docs

* remove httpjson from manifest

* add changelog entry

* format

* update docs

* improve docs

* rename first to pageSize

* improve resource_timeout description

* remove count metric from managed_volumes

* make cluster and sla domain base fields

* improve pageSize description

* improve changelog

* change virtual machines data stream name

* update sample events and pipelines

* build docs

* run format

* fix virtual machines tag

* fix virtual machines sample event

* build docs
…c#12543)

sampling.tail.storage_limit is 0 by default in 9.0. See elastic/apm-server#15467 .
As UI validation requires unit (e.g. GB), set apm integration default storage limit to 0GB which carries the same meaning.
…cs mappings (elastic#12568)

[elastic_agent] Add missing apm-server tail sampling monitoring metrics mappings

Tail-based sampling monitoring metrics were missed in the bugfix in elastic#10414
This commit updates the Kubernetes Container Logs documentation to
better explain that an input is always generated for every container.

It also fixes a broken link.
@strawgate strawgate requested a review from a team as a code owner February 4, 2025 04:13
@elasticmachine
Copy link

elasticmachine commented Feb 4, 2025

💔 Build Failed

Failed CI Steps

History

Copy link

Quality Gate passed Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Nvidia GPU] New Integration for Nvidia GPU Monitoring