Skip to content

Ai/add mlx nic temp#786

Closed
D-G-Dimitrov wants to merge 3 commits intorbonghi:masterfrom
D-G-Dimitrov:ai/add-mlx-nic-temp
Closed

Ai/add mlx nic temp#786
D-G-Dimitrov wants to merge 3 commits intorbonghi:masterfrom
D-G-Dimitrov:ai/add-mlx-nic-temp

Conversation

@D-G-Dimitrov
Copy link

@D-G-Dimitrov D-G-Dimitrov commented Jan 6, 2026

Summary by Sourcery

Add Mellanox NIC temperature monitoring support using MLNX_OFED utilities and integrate these readings into the existing temperature service.

New Features:

  • Detect Mellanox NICs via lspci and read their temperatures using the MLNX_OFED mget_temp utility.
  • Expose Mellanox NIC temperatures through the TemperatureService alongside existing board and hwmon sensors.

Enhancements:

  • Extend temperature reading logic to handle both file-based sensors and precomputed numeric millidegree values in a unified way.
  • Normalize Mellanox temperature values to Celsius in the status output and apply default max/crit thresholds for consistent alerting and color coding.

Documentation:

  • Add dedicated documentation describing Mellanox NIC temperature support, configuration, troubleshooting steps, and a high-level summary of related changes.

Tests:

  • Introduce standalone scripts to test Mellanox temperature detection and verify correct conversion, thresholds, and integration with the temperature service.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jan 6, 2026

Reviewer's Guide

Adds Mellanox NIC temperature support to jtop by introducing MLNX_OFED-based temperature discovery and reading, extending the temperature core to handle numeric millidegree values alongside sysfs paths, wiring Mellanox sensors into TemperatureService, and providing documentation and verification/test scripts for the new behavior.

Class diagram for updated temperature core with Mellanox support

classDiagram
    class TemperatureService {
        - dict _temperature
        + TemperatureService()
        + dict get_status()
    }

    class TemperatureUtils {
        + dict read_temperature(dict data)
        + dict get_hwmon_thermal_system(str root_dir)
        + dict get_mellanox_temperature()
    }

    class ExternalTools {
        <<utility>>
        + mget_temp
        + lspci
    }

    TemperatureService --> TemperatureUtils : uses
    TemperatureUtils --> ExternalTools : calls

    %% Details of behaviors
    class read_temperature_details {
        + handles_path_values
        + handles_numeric_millidegree_values
        + returns_celsius_values
    }

    class get_mellanox_temperature_details {
        + detect_mlnx_ofed_via_mget_temp
        + discover_devices_with_lspci
        + read_temp_with_mget_temp
        + sudo_fallback_on_failure
        + store_temp_in_millidegrees
        + build_mlx_sensor_keys
    }

    TemperatureUtils .. read_temperature_details
    TemperatureUtils .. get_mellanox_temperature_details

    class get_status_details {
        + handle_path_based_sensors
        + handle_numeric_mlx_sensors
        + convert_mlx_millidegrees_to_celsius
        + set_default_max_crit_for_mlx
        + mark_sensor_online_offline
    }

    TemperatureService .. get_status_details
Loading

Flow diagram for Mellanox NIC temperature detection and reading

flowchart TD
    A_start["Start get_mellanox_temperature"] --> B_check_mget["Run which mget_temp"]
    B_check_mget -->|mget_temp not found or error| Z_end_empty["Return empty temperature dict"]
    B_check_mget -->|mget_temp found| C_lspci["Run lspci -d 15b3: -D to list Mellanox devices"]

    C_lspci -->|no devices or error| Z_end_empty
    C_lspci -->|devices found| D_for_each["For each lspci device line"]

    D_for_each --> E_parse_line["Parse bus_addr and device_name"]
    E_parse_line --> F_filter["Is ConnectX or MT device?"]
    F_filter -->|no| D_for_each
    F_filter -->|yes| G_try_mget["Run mget_temp -d bus_addr (no sudo)"]

    G_try_mget -->|timeout or exception| D_for_each
    G_try_mget -->|returncode != 0| H_try_sudo["Run sudo mget_temp -d bus_addr"]
    G_try_mget -->|returncode == 0 and stdout| I_parse_temp["Parse temperature from stdout"]

    H_try_sudo -->|timeout or exception| D_for_each
    H_try_sudo -->|returncode != 0| D_for_each
    H_try_sudo -->|returncode == 0 and stdout| I_parse_temp

    I_parse_temp -->|parse error| D_for_each
    I_parse_temp -->|success| J_store["Store temperature[mlx_bus_addr] = { temp: celsius * 1000 }"]
    J_store --> D_for_each

    D_for_each -->|no more lines| K_return["Return Mellanox temperature dict"]

    K_return --> Z_end["End get_mellanox_temperature"]
Loading

Flow diagram for TemperatureService initialization and status with Mellanox sensors

flowchart TD
    A_init["TemperatureService.__init__"] --> B_init_dict["Initialize _temperature as empty dict"]
    B_init_dict --> C_virtual["Load virtual thermal zones"]
    C_virtual --> D_hwmon["Load hwmon thermal sensors via get_hwmon_thermal_system"]
    D_hwmon --> E_update_hwmon["Update _temperature with hwmon sensors"]
    E_update_hwmon --> F_mlx["Call get_mellanox_temperature"]
    F_mlx --> G_update_mlx["Update _temperature with Mellanox sensors"]
    G_update_mlx --> H_check_empty["If _temperature is empty, log warning"]
    H_check_empty --> I_sort["Sort sensors"]
    I_sort --> J_init_done["Initialization complete"]

    J_init_done --> K_get_status["TemperatureService.get_status"]
    K_get_status --> L_iter["For each sensor in _temperature"]

    L_iter --> M_is_mlx["sensor.get(temp) is numeric?"]
    M_is_mlx -->|yes Mellanox| N_convert_mlx["Convert millidegrees to Celsius"]
    N_convert_mlx --> O_defaults_mlx["Set max=84, crit=100"]
    O_defaults_mlx --> P_online_mlx["Set online = temp != TEMPERATURE_OFFLINE"]
    P_online_mlx --> Q_add_status_mlx["Add Mellanox sensor to status dict"]

    M_is_mlx -->|no path-based| R_read_path["Call read_temperature for sensor"]
    R_read_path --> S_online_path["Set online = temp != TEMPERATURE_OFFLINE"]
    S_online_path --> T_add_status_path["Add path-based sensor to status dict"]

    Q_add_status_mlx --> U_next["Next sensor"]
    T_add_status_path --> U_next
    U_next -->|more sensors| L_iter
    U_next -->|no more sensors| V_return_status["Return full status dict"]
Loading

File-Level Changes

Change Details Files
Extend temperature reading pipeline to support numeric millidegree values (e.g., Mellanox) in addition to sysfs file paths and to normalize all values to Celsius for consumers.
  • Update read_temperature() to accept mappings whose values may be either numeric millidegree readings or sysfs file paths, converting both to Celsius floats.
  • Adjust TemperatureService.get_status() to detect sensors whose temp entry is already a numeric millidegree value, convert to Celsius, and inject default max/crit thresholds, while keeping the existing path-based flow unchanged.
  • Ensure TEMPERATURE_OFFLINE handling and online flag computation remain consistent when using the new numeric sensor branch.
jtop/core/temperature.py
Introduce Mellanox NIC temperature discovery and collection using MLNX_OFED’s mget_temp and integrate discovered sensors into the existing temperature service.
  • Add get_mellanox_temperature() which checks for mget_temp, enumerates Mellanox PCI devices via lspci, runs mget_temp (preferring non-sudo, falling back to sudo) per device with timeouts and logging, and returns a mapping of synthetic mlx_* sensor IDs to millidegree temp values.
  • Call get_mellanox_temperature() from TemperatureService.init() and merge the resulting Mellanox sensors into the main temperature map alongside hwmon/thermal sensors.
  • Define a naming convention for Mellanox sensors based on PCI bus address (mlx) to keep keys stable and debuggable.
jtop/core/temperature.py
Document the Mellanox temperature issue, its fix, and the resulting behavior, and provide small helper scripts for manual verification and ad-hoc testing.
  • Add MELLANOX_FIX_SUMMARY.md describing the original regression, root cause, fix details (conversion to Celsius, default thresholds, sudo handling), and manual verification steps.
  • Add MELLANOX_TEMP_README.md documenting overall Mellanox NIC temperature support, usage patterns with and without MLNX_OFED, troubleshooting tips, and permission recommendations.
  • Add CHANGES_SUMMARY.md summarizing the implementation, data flow, sensor naming, error handling, compatibility, and testing strategy for Mellanox support.
  • Introduce verify_mellanox_fix.py to instantiate TemperatureService, inject a synthetic Mellanox-like millidegree reading, and assert conversion and thresholds via printed diagnostics.
  • Introduce test_mellanox_temp.py to call get_mellanox_temperature() directly and print detected Mellanox temperatures or explain expected reasons for absence.
MELLANOX_FIX_SUMMARY.md
MELLANOX_TEMP_README.md
CHANGES_SUMMARY.md
verify_mellanox_fix.py
test_mellanox_temp.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • The new Mellanox integration relies on shelling out to which and to sudo mget_temp from within library code; consider using shutil.which for detection and avoiding sudo inside the service (e.g., document required permissions or group membership instead), so the temperature service remains usable in non-interactive and constrained environments.
  • Several new top-level helper files (verify_mellanox_fix.py, test_mellanox_temp.py, MELLANOX_FIX_SUMMARY.md, MELLANOX_TEMP_README.md, CHANGES_SUMMARY.md) are added; it may be cleaner to either move these under a dedicated tools/ or docs/ directory or keep only the minimal runtime-necessary pieces in the repo root to reduce clutter.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new Mellanox integration relies on shelling out to `which` and to `sudo mget_temp` from within library code; consider using `shutil.which` for detection and avoiding `sudo` inside the service (e.g., document required permissions or group membership instead), so the temperature service remains usable in non-interactive and constrained environments.
- Several new top-level helper files (`verify_mellanox_fix.py`, `test_mellanox_temp.py`, `MELLANOX_FIX_SUMMARY.md`, `MELLANOX_TEMP_README.md`, `CHANGES_SUMMARY.md`) are added; it may be cleaner to either move these under a dedicated `tools/` or `docs/` directory or keep only the minimal runtime-necessary pieces in the repo root to reduce clutter.

## Individual Comments

### Comment 1
<location> `jtop/core/temperature.py:151-160` </location>
<code_context>
+                                    temp_result = None
+                                    try:
+                                        # Try without sudo first
+                                        temp_result = subprocess.run(
+                                            ['mget_temp', '-d', bus_addr],
+                                            capture_output=True,
+                                            text=True,
+                                            timeout=2
+                                        )
+                                        if temp_result.returncode != 0:
+                                            # If failed without sudo, try with sudo
+                                            temp_result = subprocess.run(
+                                                ['sudo', 'mget_temp', '-d', bus_addr],
+                                                capture_output=True,
+                                                text=True,
</code_context>

<issue_to_address>
**🚨 issue (security):** Invoking `sudo mget_temp` from a library context is risky and may hang on password prompts.

This can hang in non-interactive contexts if passwordless sudo isn’t configured, and it silently escalates privileges from a read-only-looking API. Please either remove the sudo fallback and document that `mget_temp` must be directly runnable, or make sudo use explicitly opt-in via configuration and disabled by default.
</issue_to_address>

### Comment 2
<location> `jtop/core/temperature.py:134` </location>
<code_context>
+            # Find all Mellanox devices
+            try:
+                # Get list of Mellanox devices
+                devices_result = subprocess.run(['lspci', '-d', '15b3:', '-D'], capture_output=True, text=True)
+                if devices_result.returncode == 0 and devices_result.stdout.strip():
+                    device_lines = devices_result.stdout.strip().split('\n')
</code_context>

<issue_to_address>
**suggestion (performance):** Running `lspci` on every service initialization may be expensive and lacks a timeout.

On some systems this subprocess can be relatively heavy, and without a timeout a hung `lspci` would block service startup. Consider adding a timeout to the call and/or caching the discovery so it only runs periodically instead of on every `TemperatureService` construction.

Suggested implementation:

```python
                # Get list of Mellanox devices
                try:
                    devices_result = subprocess.run(
                        ['lspci', '-d', '15b3:', '-D'],
                        capture_output=True,
                        text=True,
                        timeout=5,
                    )
                except subprocess.TimeoutExpired:
                    logger.warning("Timeout while running lspci to detect Mellanox devices")
                    devices_result = None

                if devices_result and devices_result.returncode == 0 and devices_result.stdout.strip():

```

To fully address the suggestion about avoiding running lspci on every service initialization, consider:
1. Introducing a module-level cache (e.g., _MELLANOX_DEVICES_CACHE and _MELLANOX_DEVICES_CACHE_TS) and a TTL.
2. Wrapping the lspci execution in a helper function (e.g., _get_mellanox_devices()) that returns cached results if they are still fresh, and only re-runs lspci when the cache has expired.
</issue_to_address>

### Comment 3
<location> `jtop/core/temperature.py:173-182` </location>
<code_context>
+                                        temp_value = temp_result.stdout.strip()
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Parsing `mget_temp` output as a bare float may be brittle if the CLI adds labels or units.

This assumes `mget_temp` always returns a plain numeric string. If the CLI ever adds labels, units, or extra lines, `float(temp_result.stdout.strip())` will fail. Parsing just the first numeric token (e.g., via `split()` or a small regex) or restricting to the first line would make this more resilient to format changes.

Suggested implementation:

```python
                                    if temp_result.returncode == 0 and temp_result.stdout.strip():
                                        raw_output = temp_result.stdout.strip()
                                        # Use only the first line and extract the first numeric token to be resilient to format changes
                                        first_line = raw_output.splitlines()[0]
                                        match = re.search(r'([-+]?(?:\d+(?:\.\d*)?|\.\d+))', first_line)
                                        if not match:
                                            logger.warning(f"Could not find numeric temperature in mget_temp output for {bus_addr}: {first_line!r}")
                                            continue
                                        temp_value_str = match.group(1)
                                        try:
                                            temp_celsius = float(temp_value_str)
                                            # Create a virtual temperature file path for compatibility
                                            sensor_key = f"mlx_{bus_addr.replace(':', '_').replace('.', '_')}"
                                            temperature[sensor_key] = {
                                                'temp': temp_celsius * 1000.0  # Store in millidegrees for consistency
                                            }
                                            logger.info(f"Found Mellanox NIC temperature: {device_name} = {temp_celsius}°C")
                                        except ValueError:
                                            logger.warning(f"Could not parse temperature from mget_temp for {bus_addr}: {temp_value_str!r}")

```

1. Add `import re` near the top of `jtop/core/temperature.py` alongside the other imports, for example:

<<<<<<< SEARCH
import logging
=======
import logging
import re
>>>>>>> REPLACE

Adjust the SEARCH line (`import logging`) to match an existing import in your file if it differs.
</issue_to_address>

### Comment 4
<location> `CHANGES_SUMMARY.md:17` </location>
<code_context>
+**Key Features:**
+- Detects MLNX_OFED installation by checking for `mget_temp`
+- Uses `lspci -d 15b3:` to find Mellanox devices
+- Calls `sudo mget_temp -d <device>` for each NIC
+- Stores temperatures in millidegrees for consistency
+- Comprehensive error handling and logging
</code_context>

<issue_to_address>
**issue:** Align this description of `mget_temp` invocation with the documented sudo fallback behavior.

This line still describes always calling `sudo mget_temp -d <device>`, but the implementation now runs `mget_temp` first and only falls back to sudo on failure. Please update this bullet to reflect the new fallback behavior so it matches the actual code and the rest of the documentation.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +151 to +160
temp_result = subprocess.run(
['mget_temp', '-d', bus_addr],
capture_output=True,
text=True,
timeout=2
)
if temp_result.returncode != 0:
# If failed without sudo, try with sudo
temp_result = subprocess.run(
['sudo', 'mget_temp', '-d', bus_addr],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 issue (security): Invoking sudo mget_temp from a library context is risky and may hang on password prompts.

This can hang in non-interactive contexts if passwordless sudo isn’t configured, and it silently escalates privileges from a read-only-looking API. Please either remove the sudo fallback and document that mget_temp must be directly runnable, or make sudo use explicitly opt-in via configuration and disabled by default.

# Find all Mellanox devices
try:
# Get list of Mellanox devices
devices_result = subprocess.run(['lspci', '-d', '15b3:', '-D'], capture_output=True, text=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): Running lspci on every service initialization may be expensive and lacks a timeout.

On some systems this subprocess can be relatively heavy, and without a timeout a hung lspci would block service startup. Consider adding a timeout to the call and/or caching the discovery so it only runs periodically instead of on every TemperatureService construction.

Suggested implementation:

                # Get list of Mellanox devices
                try:
                    devices_result = subprocess.run(
                        ['lspci', '-d', '15b3:', '-D'],
                        capture_output=True,
                        text=True,
                        timeout=5,
                    )
                except subprocess.TimeoutExpired:
                    logger.warning("Timeout while running lspci to detect Mellanox devices")
                    devices_result = None

                if devices_result and devices_result.returncode == 0 and devices_result.stdout.strip():

To fully address the suggestion about avoiding running lspci on every service initialization, consider:

  1. Introducing a module-level cache (e.g., _MELLANOX_DEVICES_CACHE and _MELLANOX_DEVICES_CACHE_TS) and a TTL.
  2. Wrapping the lspci execution in a helper function (e.g., _get_mellanox_devices()) that returns cached results if they are still fresh, and only re-runs lspci when the cache has expired.

Comment on lines +173 to +182
temp_value = temp_result.stdout.strip()
try:
temp_celsius = float(temp_value)
# Create a virtual temperature file path for compatibility
sensor_key = f"mlx_{bus_addr.replace(':', '_').replace('.', '_')}"
temperature[sensor_key] = {
'temp': temp_celsius * 1000.0 # Store in millidegrees for consistency
}
logger.info(f"Found Mellanox NIC temperature: {device_name} = {temp_celsius}°C")
except ValueError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Parsing mget_temp output as a bare float may be brittle if the CLI adds labels or units.

This assumes mget_temp always returns a plain numeric string. If the CLI ever adds labels, units, or extra lines, float(temp_result.stdout.strip()) will fail. Parsing just the first numeric token (e.g., via split() or a small regex) or restricting to the first line would make this more resilient to format changes.

Suggested implementation:

                                    if temp_result.returncode == 0 and temp_result.stdout.strip():
                                        raw_output = temp_result.stdout.strip()
                                        # Use only the first line and extract the first numeric token to be resilient to format changes
                                        first_line = raw_output.splitlines()[0]
                                        match = re.search(r'([-+]?(?:\d+(?:\.\d*)?|\.\d+))', first_line)
                                        if not match:
                                            logger.warning(f"Could not find numeric temperature in mget_temp output for {bus_addr}: {first_line!r}")
                                            continue
                                        temp_value_str = match.group(1)
                                        try:
                                            temp_celsius = float(temp_value_str)
                                            # Create a virtual temperature file path for compatibility
                                            sensor_key = f"mlx_{bus_addr.replace(':', '_').replace('.', '_')}"
                                            temperature[sensor_key] = {
                                                'temp': temp_celsius * 1000.0  # Store in millidegrees for consistency
                                            }
                                            logger.info(f"Found Mellanox NIC temperature: {device_name} = {temp_celsius}°C")
                                        except ValueError:
                                            logger.warning(f"Could not parse temperature from mget_temp for {bus_addr}: {temp_value_str!r}")
  1. Add import re near the top of jtop/core/temperature.py alongside the other imports, for example:

<<<<<<< SEARCH
import logging

import logging
import re

REPLACE

Adjust the SEARCH line (import logging) to match an existing import in your file if it differs.

**Key Features:**
- Detects MLNX_OFED installation by checking for `mget_temp`
- Uses `lspci -d 15b3:` to find Mellanox devices
- Calls `sudo mget_temp -d <device>` for each NIC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Align this description of mget_temp invocation with the documented sudo fallback behavior.

This line still describes always calling sudo mget_temp -d <device>, but the implementation now runs mget_temp first and only falls back to sudo on failure. Please update this bullet to reflect the new fallback behavior so it matches the actual code and the rest of the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant